Skip to content

Table of Contents

cs.CV [Back]

[1] Graph-based Online Monitoring of Train Driver States via Facial and Skeletal Features

Olivia Nocentini,Marta Lagomarsino,Gokhan Solak,Younggeol Cho,Qiyi Tong,Marta Lorenzini,Arash Ajoudani

Main category: cs.CV

TL;DR: 提出了一种基于行为监测的在线系统,使用定制DGNN分类驾驶员状态,结合面部和骨骼特征实现最高准确率。

Details Motivation: 传统系统如死机开关功能有限,需更先进技术提升铁路安全。 Method: 采用定制DGNN,通过消融研究比较面部、骨骼及组合特征。 Result: 组合特征在三分类中准确率80.88%,二分类超99%,并引入新数据集。 Conclusion: 该研究通过视觉技术推进了铁路安全监测。 Abstract: Driver fatigue poses a significant challenge to railway safety, with traditional systems like the dead-man switch offering limited and basic alertness checks. This study presents an online behavior-based monitoring system utilizing a customised Directed-Graph Neural Network (DGNN) to classify train driver's states into three categories: alert, not alert, and pathological. To optimize input representations for the model, an ablation study was performed, comparing three feature configurations: skeletal-only, facial-only, and a combination of both. Experimental results show that combining facial and skeletal features yields the highest accuracy (80.88%) in the three-class model, outperforming models using only facial or skeletal features. Furthermore, this combination achieves over 99% accuracy in the binary alertness classification. Additionally, we introduced a novel dataset that, for the first time, incorporates simulated pathological conditions into train driver monitoring, broadening the scope for assessing risks related to fatigue and health. This work represents a step forward in enhancing railway safety through advanced online monitoring using vision-based technologies.

[2] OptiGait-LGBM: An Efficient Approach of Gait-based Person Re-identification in Non-Overlapping Regions

Md. Sakib Hassan Chowdhury,Md. Hafiz Ahamed,Bishowjit Paul,Sarafat Hussain Abhi,Abu Bakar Siddique,Md. Robius Sany

Main category: cs.CV

TL;DR: 提出了一种基于骨骼模型的OptiGait-LGBM模型,用于解决无约束户外环境中的步态识别问题,并在性能和效率上优于现有方法。

Details Motivation: 现有步态识别系统在无约束真实场景中性能下降,缺乏同时解决户外环境、非重叠视角、光照变化和计算效率的数据集和方法。 Method: 采用骨骼模型提取关节标志点,构建非序列数据集,开发OptiGait-LGBM分类模型,并引入RUET-GAIT基准数据集。 Result: OptiGait-LGBM在准确性、内存使用和训练时间上优于Random Forest和CatBoost等集成方法。 Conclusion: 该方法为真实场景提供了一种低成本、高效且内存优化的步态识别解决方案。 Abstract: Gait recognition, known for its ability to identify individuals from a distance, has gained significant attention in recent times due to its non-intrusive verification. While video-based gait identification systems perform well on large public datasets, their performance drops when applied to real-world, unconstrained gait data due to various factors. Among these, uncontrolled outdoor environments, non-overlapping camera views, varying illumination, and computational efficiency are core challenges in gait-based authentication. Currently, no dataset addresses all these challenges simultaneously. In this paper, we propose an OptiGait-LGBM model capable of recognizing person re-identification under these constraints using a skeletal model approach, which helps mitigate inconsistencies in a person's appearance. The model constructs a dataset from landmark positions, minimizing memory usage by using non-sequential data. A benchmark dataset, RUET-GAIT, is introduced to represent uncontrolled gait sequences in complex outdoor environments. The process involves extracting skeletal joint landmarks, generating numerical datasets, and developing an OptiGait-LGBM gait classification model. Our aim is to address the aforementioned challenges with minimal computational cost compared to existing methods. A comparative analysis with ensemble techniques such as Random Forest and CatBoost demonstrates that the proposed approach outperforms them in terms of accuracy, memory usage, and training time. This method provides a novel, low-cost, and memory-efficient video-based gait recognition solution for real-world scenarios.

[3] SparseMeXT Unlocking the Potential of Sparse Representations for HD Map Construction

Anqing Jiang,Jinhao Chai,Yu Gao,Yiru Wang,Yuwen Heng,Zhigang Sun,Hao Sun,Zezhong Zhao,Li Sun,Jian Zhou,Lijuan Zhu,Shugong Xu,Hao Zhao

Main category: cs.CV

TL;DR: 论文提出了一种优化的稀疏表示方法,通过专用网络架构、稀疏-密集分割辅助任务和去噪模块,显著提升了高精地图构建的性能,超越了传统密集方法。

Details Motivation: 稀疏表示在高精地图构建中效率更高,但现有方法因缺乏针对性设计而表现不佳,本文旨在通过改进稀疏技术填补这一差距。 Method: 设计了专用稀疏特征提取网络架构,引入稀疏-密集分割辅助任务以利用几何和语义线索,并采用基于物理先验的去噪模块优化预测。 Result: 在nuScenes数据集上达到SOTA性能,SparseMeXt-Tiny的mAP为55.5%(32fps),SparseMeXt-Base为65.2%,SparseMeXt-Large为68.9%(20fps)。 Conclusion: 稀疏方法潜力巨大,挑战了传统密集表示的依赖,重新定义了效率与性能的权衡。 Abstract: Recent advancements in high-definition \emph{HD} map construction have demonstrated the effectiveness of dense representations, which heavily rely on computationally intensive bird's-eye view \emph{BEV} features. While sparse representations offer a more efficient alternative by avoiding dense BEV processing, existing methods often lag behind due to the lack of tailored designs. These limitations have hindered the competitiveness of sparse representations in online HD map construction. In this work, we systematically revisit and enhance sparse representation techniques, identifying key architectural and algorithmic improvements that bridge the gap with--and ultimately surpass--dense approaches. We introduce a dedicated network architecture optimized for sparse map feature extraction, a sparse-dense segmentation auxiliary task to better leverage geometric and semantic cues, and a denoising module guided by physical priors to refine predictions. Through these enhancements, our method achieves state-of-the-art performance on the nuScenes dataset, significantly advancing HD map construction and centerline detection. Specifically, SparseMeXt-Tiny reaches a mean average precision \emph{mAP} of 55.5% at 32 frames per second \emph{fps}, while SparseMeXt-Base attains 65.2% mAP. Scaling the backbone and decoder further, SparseMeXt-Large achieves an mAP of 68.9% at over 20 fps, establishing a new benchmark for sparse representations in HD map construction. These results underscore the untapped potential of sparse methods, challenging the conventional reliance on dense representations and redefining efficiency-performance trade-offs in the field.

[4] TUGS: Physics-based Compact Representation of Underwater Scenes by Tensorized Gaussian

Shijie Lian,Ziyi Zhang,Laurence Tianruo Yang and,Mengyu Ren,Debin Liu,Hua Li

Main category: cs.CV

TL;DR: TUGS是一种高效的水下3D场景重建方法,通过轻量级张量高斯和物理模块解决水下复杂交互问题,显著减少参数和计算成本。

Details Motivation: 水下3D重建因光传播、水体介质和物体表面的复杂交互而困难,现有方法无法准确建模且计算成本高。 Method: TUGS采用轻量级张量高阶高斯和基于物理的自适应介质估计模块,模拟水下光衰减和背散射效应。 Result: TUGS在真实水下数据集上表现优异,渲染速度快、内存占用低,重建质量高。 Conclusion: TUGS适用于内存受限的水下无人机应用,解决了现有方法的建模和计算瓶颈。 Abstract: Underwater 3D scene reconstruction is crucial for undewater robotic perception and navigation. However, the task is significantly challenged by the complex interplay between light propagation, water medium, and object surfaces, with existing methods unable to model their interactions accurately. Additionally, expensive training and rendering costs limit their practical application in underwater robotic systems. Therefore, we propose Tensorized Underwater Gaussian Splatting (TUGS), which can effectively solve the modeling challenges of the complex interactions between object geometries and water media while achieving significant parameter reduction. TUGS employs lightweight tensorized higher-order Gaussians with a physics-based underwater Adaptive Medium Estimation (AME) module, enabling accurate simulation of both light attenuation and backscatter effects in underwater environments. Compared to other NeRF-based and GS-based methods designed for underwater, TUGS is able to render high-quality underwater images with faster rendering speeds and less memory usage. Extensive experiments on real-world underwater datasets have demonstrated that TUGS can efficiently achieve superior reconstruction quality using a limited number of parameters, making it particularly suitable for memory-constrained underwater UAV applications

[5] Towards Understanding Deep Learning Model in Image Recognition via Coverage Test

Wenkai Li,Xiaoqi Li,Yingjie Mao,Yishun Wang

Main category: cs.CV

TL;DR: 本文研究了深度神经网络(DNN)中四种覆盖度指标的关系与模式,通过实验分析了模型深度、配置信息与覆盖度的关联,并提出了未来研究方向。

Details Motivation: 随着DNN的广泛应用,其安全测试成为研究重点,但目前缺乏对不同覆盖度指标的实证研究,尤其是模型深度与覆盖度之间的关系。 Method: 选取LeNet、VGG和ResNet等不同架构的DNN,以及5至54层的10个模型,通过实验比较不同深度、配置信息与覆盖度指标的关系。 Result: 实验揭示了四种覆盖度指标(主要功能、边界、层次和结构覆盖)与模型深度及配置信息的关系,并探讨了修改决策/条件覆盖与数据集大小的关联。 Conclusion: 研究填补了DNN覆盖度指标实证研究的空白,并提出了三个未来方向,以推动DNN安全测试的发展。 Abstract: Deep neural networks (DNNs) play a crucial role in the field of artificial intelligence, and their security-related testing has been a prominent research focus. By inputting test cases, the behavior of models is examined for anomalies, and coverage metrics are utilized to determine the extent of neurons covered by these test cases. With the widespread application and advancement of DNNs, different types of neural behaviors have garnered attention, leading to the emergence of various coverage metrics for neural networks. However, there is currently a lack of empirical research on these coverage metrics, specifically in analyzing the relationships and patterns between model depth, configuration information, and neural network coverage. This paper aims to investigate the relationships and patterns of four coverage metrics: primary functionality, boundary, hierarchy, and structural coverage. A series of empirical experiments were conducted, selecting LeNet, VGG, and ResNet as different DNN architectures, along with 10 models of varying depths ranging from 5 to 54 layers, to compare and study the relationships between different depths, configuration information, and various neural network coverage metrics. Additionally, an investigation was carried out on the relationships between modified decision/condition coverage and dataset size. Finally, three potential future directions are proposed to further contribute to the security testing of DNN Models.

[6] Towards SFW sampling for diffusion models via external conditioning

Camilo Carvajal Reyes,Joaquín Fontbona,Felipe Tobar

Main category: cs.CV

TL;DR: 本文提出了一种基于外部多模态模型的SFW采样器,用于引导基于分数的生成模型(SBM)生成安全内容,避免不适宜内容。

Details Motivation: 尽管SBM在图像合成中表现优异,但容易被诱导生成不适宜内容(NSFW)。现有方法依赖模型自身知识且需微调,本文探索利用外部条件源确保输出安全。 Method: 采用Conditional Trajectory Correction步骤,利用多模态模型(如CLIP)引导样本远离不适宜区域,并支持用户自定义NSFW类别。 Result: 在Stable Diffusion上的实验表明,SFW采样器有效减少不适宜内容生成,且对图像质量影响微小。 Conclusion: SFW采样器适用于对齐的SBM模型,展示了模型无关条件在防止不适宜内容生成中的潜力。 Abstract: Score-based generative models (SBM), also known as diffusion models, are the de facto state of the art for image synthesis. Despite their unparalleled performance, SBMs have recently been in the spotlight for being tricked into creating not-safe-for-work (NSFW) content, such as violent images and non-consensual nudity. Current approaches that prevent unsafe generation are based on the models' own knowledge, and the majority of them require fine-tuning. This article explores the use of external sources for ensuring safe outputs in SBMs. Our safe-for-work (SFW) sampler implements a Conditional Trajectory Correction step that guides the samples away from undesired regions in the ambient space using multimodal models as the source of conditioning. Furthermore, using Contrastive Language Image Pre-training (CLIP), our method admits user-defined NSFW classes, which can vary in different settings. Our experiments on the text-to-image SBM Stable Diffusion validate that the proposed SFW sampler effectively reduces the generation of explicit content while being competitive with other fine-tuning-based approaches, as assessed via independent NSFW detectors. Moreover, we evaluate the impact of the SFW sampler on image quality and show that the proposed correction scheme comes at a minor cost with negligible effect on samples not needing correction. Our study confirms the suitability of the SFW sampler towards aligned SBM models and the potential of using model-agnostic conditioning for the prevention of unwanted images.

[7] Generative AI for Urban Planning: Synthesizing Satellite Imagery via Diffusion Models

Qingyi Wang,Yuebing Liang,Yunhan Zheng,Kaiyuan Xu,Jinhua Zhao,Shenhao Wang

Main category: cs.CV

TL;DR: 该论文提出了一种基于Stable Diffusion和ControlNet的生成式AI方法,用于生成高保真卫星图像,并结合土地利用描述和基础设施信息,解决了现有方法在规模化生成实用设计时的不足。

Details Motivation: 生成式AI为城市规划提供了自动化布局和灵活设计探索的新机会,但现有方法难以生成大规模、现实且实用的设计。 Method: 通过扩展Stable Diffusion模型并利用ControlNet,结合OpenStreetMap的土地利用和约束信息,生成条件化的卫星图像。 Result: 模型在三个美国城市的数据上表现出色,生成多样且现实的城市场景,并通过定量和定性评估验证了其高质量和鲁棒性。 Conclusion: 该研究为可控城市场景生成设定了基准,展示了生成式AI在提升规划工作流程和公众参与方面的潜力。 Abstract: Generative AI offers new opportunities for automating urban planning by creating site-specific urban layouts and enabling flexible design exploration. However, existing approaches often struggle to produce realistic and practical designs at scale. Therefore, we adapt a state-of-the-art Stable Diffusion model, extended with ControlNet, to generate high-fidelity satellite imagery conditioned on land use descriptions, infrastructure, and natural environments. To overcome data availability limitations, we spatially link satellite imagery with structured land use and constraint information from OpenStreetMap. Using data from three major U.S. cities, we demonstrate that the proposed diffusion model generates realistic and diverse urban landscapes by varying land-use configurations, road networks, and water bodies, facilitating cross-city learning and design diversity. We also systematically evaluate the impacts of varying language prompts and control imagery on the quality of satellite imagery generation. Our model achieves high FID and KID scores and demonstrates robustness across diverse urban contexts. Qualitative assessments from urban planners and the general public show that generated images align closely with design descriptions and constraints, and are often preferred over real images. This work establishes a benchmark for controlled urban imagery generation and highlights the potential of generative AI as a tool for enhancing planning workflows and public engagement.

[8] Crowd Scene Analysis using Deep Learning Techniques

Muhammad Junaid Asif

Main category: cs.CV

TL;DR: 论文提出了一种结合自监督训练和多列CNN的模型,用于解决人群计数中的标注数据需求高和场景复杂性挑战,并基于VGG19提出时空模型用于异常检测,在多个数据集上表现优异。

Details Motivation: 解决人群计数中标注数据需求高和场景复杂性(如遮挡、非均匀密度)的挑战,以及异常检测中光照、环境条件等问题。 Method: 1. 人群计数:结合自监督训练和多列CNN(MCNN),学习多尺度特征;2. 异常检测:基于VGG19的时空模型,结合CNN提取空间特征,LSTM提取时序特征,并使用密集残差块优化。 Result: 在ShanghaiTech、UCFQNRF等数据集上,人群计数模型表现优异(MAE和MSE指标);异常检测模型在Hockey Fight和SCVD数据集上优于其他先进方法。 Conclusion: 提出的方法有效解决了人群计数和异常检测中的关键挑战,并在多个公开数据集上验证了其优越性。 Abstract: Our research is focused on two main applications of crowd scene analysis crowd counting and anomaly detection In recent years a large number of researches have been presented in the domain of crowd counting We addressed two main challenges in this domain 1 Deep learning models are datahungry paradigms and always need a large amount of annotated data for the training of algorithm It is timeconsuming and costly task to annotate such large amount of data Selfsupervised training is proposed to deal with this challenge 2 MCNN consists of multicolumns of CNN with different sizes of filters by presenting a novel approach based on a combination of selfsupervised training and MultiColumn CNN This enables the model to learn features at different levels and makes it effective in dealing with challenges of occluded scenes nonuniform density complex backgrounds and scale invariation The proposed model was evaluated on publicly available data sets such as ShanghaiTech and UCFQNRF by means of MAE and MSE A spatiotemporal model based on VGG19 is proposed for crowd anomaly detection addressing challenges like lighting environmental conditions unexpected objects and scalability The model extracts spatial and temporal features allowing it to be generalized to realworld scenes Spatial features are learned using CNN while temporal features are learned using LSTM blocks The model works on binary classification and can detect normal or abnormal behavior The models performance is improved by replacing fully connected layers with dense residual blocks Experiments on the Hockey Fight dataset and SCVD dataset show our models outperform other stateoftheart approaches

[9] Generative AI for Autonomous Driving: Frontiers and Opportunities

Yuping Wang,Shuo Xing,Cui Can,Renjie Li,Hongyuan Hua,Kexin Tian,Zhaobin Mo,Xiangbo Gao,Keshu Wu,Sulong Zhou,Hengxu You,Juntong Peng,Junge Zhang,Zehao Wang,Rui Song,Mingxuan Yan,Walter Zimmer,Xingcheng Zhou,Peiran Li,Zhaohan Lu,Chia-Ju Chen,Yue Huang,Ryan A. Rossi,Lichao Sun,Hongkai Yu,Zhiwen Fan,Frank Hao Yang,Yuhao Kang,Ross Greer,Chenxi Liu,Eun Hak Lee,Xuan Di,Xinyue Ye,Liu Ren,Alois Knoll,Xiaopeng Li,Shuiwang Ji,Masayoshi Tomizuka,Marco Pavone,Tianbao Yang,Jing Du,Ming-Hsuan Yang,Hua Wei,Ziran Wang,Yang Zhou,Jiachen Li,Zhengzhong Tu

Main category: cs.CV

TL;DR: 本文综述了生成式人工智能(GenAI)在自动驾驶领域的应用,涵盖生成模型原理、前沿应用及挑战,为研究者和政策制定者提供前瞻性参考。

Details Motivation: 探讨GenAI如何解决自动驾驶中的重大挑战,尤其是实现完全自主驾驶(Level 5)。 Method: 综述现代生成模型(如VAEs、GANs、Diffusion Models和LLMs)的原理及其在自动驾驶中的应用。 Result: 总结了GenAI在图像、LiDAR、轨迹生成等领域的应用,并提出了未来研究方向。 Conclusion: GenAI为自动驾驶提供了革命性解决方案,但仍需解决泛化性、安全性、伦理等问题。 Abstract: Generative Artificial Intelligence (GenAI) constitutes a transformative technological wave that reconfigures industries through its unparalleled capabilities for content creation, reasoning, planning, and multimodal understanding. This revolutionary force offers the most promising path yet toward solving one of engineering's grandest challenges: achieving reliable, fully autonomous driving, particularly the pursuit of Level 5 autonomy. This survey delivers a comprehensive and critical synthesis of the emerging role of GenAI across the autonomous driving stack. We begin by distilling the principles and trade-offs of modern generative modeling, encompassing VAEs, GANs, Diffusion Models, and Large Language Models (LLMs). We then map their frontier applications in image, LiDAR, trajectory, occupancy, video generation as well as LLM-guided reasoning and decision making. We categorize practical applications, such as synthetic data workflows, end-to-end driving strategies, high-fidelity digital twin systems, smart transportation networks, and cross-domain transfer to embodied AI. We identify key obstacles and possibilities such as comprehensive generalization across rare cases, evaluation and safety checks, budget-limited implementation, regulatory compliance, ethical concerns, and environmental effects, while proposing research plans across theoretical assurances, trust metrics, transport integration, and socio-technical influence. By unifying these threads, the survey provides a forward-looking reference for researchers, engineers, and policymakers navigating the convergence of generative AI and advanced autonomous mobility. An actively maintained repository of cited works is available at https://github.com/taco-group/GenAI4AD.

[10] Intelligent Road Anomaly Detection with Real-time Notification System for Enhanced Road Safety

Ali Almakhluk,Uthman Baroudi,Yasser El-Alfy

Main category: cs.CV

TL;DR: 开发了一个基于树莓派和深度学习的系统,用于实时检测和分类道路损坏(如坑洞和裂缝),并通过云服务向相关部门和附近车辆发送警告,以提高交通安全。

Details Motivation: 道路损坏(如坑洞和裂缝)是交通事故的常见原因,亟需一种主动检测和预警系统来减少事故。 Method: 系统结合树莓派、摄像头模块、深度学习模型和云服务,实时检测、分类道路损坏并传输数据。 Result: 系统能够实时检测和分类道路损坏,并向相关部门和附近车辆发送警告信号。 Conclusion: 该解决方案通过主动预警和通知,有望显著提升道路安全,减少因道路损坏引发的事故。 Abstract: This study aims to improve transportation safety, especially traffic safety. Road damage anomalies such as potholes and cracks have emerged as a significant and recurring cause for accidents. To tackle this problem and improve road safety, a comprehensive system has been developed to detect potholes, cracks (e.g. alligator, transverse, longitudinal), classify their sizes, and transmit this data to the cloud for appropriate action by authorities. The system also broadcasts warning signals to nearby vehicles warning them if a severe anomaly is detected on the road. Moreover, the system can count road anomalies in real-time. It is emulated through the utilization of Raspberry Pi, a camera module, deep learning model, laptop, and cloud service. Deploying this innovative solution aims to proactively enhance road safety by notifying relevant authorities and drivers about the presence of potholes and cracks to take actions, thereby mitigating potential accidents arising from this prevalent road hazard leading to safer road conditions for the whole community.

[11] LightLab: Controlling Light Sources in Images with Diffusion Models

Nadav Magar,Amir Hertz,Eric Tabellion,Yael Pritch,Alex Rav-Acha,Ariel Shamir,Yedid Hoshen

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型的简单有效方法,用于对图像中的光源进行细粒度参数化控制。

Details Motivation: 现有重光照方法要么依赖多视图输入进行逆渲染,要么无法提供对光照变化的显式控制。 Method: 通过在小规模真实原始照片对和合成渲染图像上微调扩散模型,利用光的线性特性合成图像对,训练模型实现精确光照变化控制。 Result: 模型能够实现对光照强度和颜色的显式控制,并在用户偏好测试中优于现有方法。 Conclusion: 该方法在光照编辑方面表现出色,提供了更灵活的控制能力。 Abstract: We present a simple, yet effective diffusion-based method for fine-grained, parametric control over light sources in an image. Existing relighting methods either rely on multiple input views to perform inverse rendering at inference time, or fail to provide explicit control over light changes. Our method fine-tunes a diffusion model on a small set of real raw photograph pairs, supplemented by synthetically rendered images at scale, to elicit its photorealistic prior for relighting. We leverage the linearity of light to synthesize image pairs depicting controlled light changes of either a target light source or ambient illumination. Using this data and an appropriate fine-tuning scheme, we train a model for precise illumination changes with explicit control over light intensity and color. Lastly, we show how our method can achieve compelling light editing results, and outperforms existing methods based on user preference.

[12] Optimizing Neuro-Fuzzy and Colonial Competition Algorithms for Skin Cancer Diagnosis in Dermatoscopic Images

Hamideh Khaleghpour,Brett McKinney

Main category: cs.CV

TL;DR: 论文提出了一种结合图像处理和机器学习算法(神经模糊和殖民竞争方法)的AI诊断工具,用于区分皮肤病变的良恶性,准确率达94%。

Details Motivation: 皮肤癌发病率上升,公众意识不足且临床专业知识短缺,亟需先进的诊断辅助工具。 Method: 采用神经模糊和殖民竞争方法,结合图像处理技术,分析ISIC数据库中的皮肤镜图像。 Result: 在560张图像的数据集上达到94%的准确率。 Conclusion: 该方法在早期黑色素瘤检测中具有潜力,可为皮肤癌诊断提供重要支持。 Abstract: The rising incidence of skin cancer, coupled with limited public awareness and a shortfall in clinical expertise, underscores an urgent need for advanced diagnostic aids. Artificial Intelligence (AI) has emerged as a promising tool in this domain, particularly for distinguishing malignant from benign skin lesions. Leveraging publicly available datasets of skin lesions, researchers have been developing AI-based diagnostic solutions. However, the integration of such computer systems in clinical settings is still nascent. This study aims to bridge this gap by employing a fusion of image processing techniques and machine learning algorithms, specifically neuro-fuzzy and colonial competition approaches. Applied to dermoscopic images from the ISIC database, our method achieved a notable accuracy of 94% on a dataset of 560 images. These results underscore the potential of our approach in aiding clinicians in the early detection of melanoma, thereby contributing significantly to skin cancer diagnostics.

[13] Learning Cocoercive Conservative Denoisers via Helmholtz Decomposition for Poisson Inverse Problems

Deliang Wei,Peng Chen,Haobo Xu,Jiale Yao,Fang Li,Tieyong Zeng

Main category: cs.CV

TL;DR: 提出了一种新的CoCo去噪器,解决了PnP方法在泊松逆问题中的局限性,通过结合哈密顿正则化和谱正则化,提升了去噪性能,并证明了全局收敛性。

Details Motivation: 解决PnP方法在泊松逆问题中因强凸性或平滑性假设及非扩张性去噪器限制而导致的性能问题。 Method: 提出CoCo去噪器,结合哈密顿正则化和谱正则化的训练策略,确保其保守性和共轭性,并证明其为弱凸函数的近端算子。 Result: 实验表明,该方法在视觉质量和定量指标上优于相关方法。 Conclusion: CoCo去噪器为PnP方法提供了更灵活且高性能的解决方案,适用于泊松逆问题。 Abstract: Plug-and-play (PnP) methods with deep denoisers have shown impressive results in imaging problems. They typically require strong convexity or smoothness of the fidelity term and a (residual) non-expansive denoiser for convergence. These assumptions, however, are violated in Poisson inverse problems, and non-expansiveness can hinder denoising performance. To address these challenges, we propose a cocoercive conservative (CoCo) denoiser, which may be (residual) expansive, leading to improved denoising. By leveraging the generalized Helmholtz decomposition, we introduce a novel training strategy that combines Hamiltonian regularization to promote conservativeness and spectral regularization to ensure cocoerciveness. We prove that CoCo denoiser is a proximal operator of a weakly convex function, enabling a restoration model with an implicit weakly convex prior. The global convergence of PnP methods to a stationary point of this restoration model is established. Extensive experimental results demonstrate that our approach outperforms closely related methods in both visual quality and quantitative metrics.

[14] Behind Maya: Building a Multilingual Vision Language Model

Nahid Alam,Karthik Reddy Kanjula,Surya Guthikonda,Timothy Chung,Bala Krishna S Vegesna,Abhipsha Das,Anthony Susevski,Ryan Sze-Yin Chan,S M Iftekhar Uddin,Shayekh Bin Islam,Roshan Santhosh,Snegha A,Drishti Sharma,Chen Liu,Isha Chaturvedi,Genta Indra Winata,Ashvanth. S,Snehanshu Mukherjee,Alham Fikri Aji

Main category: cs.CV

TL;DR: Maya是一个开源的多语言视觉语言模型,旨在解决现有模型在低资源语言和文化多样性上的不足。

Details Motivation: 现有视觉语言模型在主流语言上表现优异,但在低资源语言和文化多样性上表现不足。 Method: 基于LLaVA预训练数据集,构建了一个包含八种语言的多语言图像-文本数据集,并开发了支持这些语言的多语言图像-文本模型。 Result: Maya提升了视觉语言任务中的文化和语言理解能力。 Conclusion: Maya为多语言和文化多样性任务提供了有效的解决方案。 Abstract: In recent times, we have seen a rapid development of large Vision-Language Models (VLMs). They have shown impressive results on academic benchmarks, primarily in widely spoken languages but lack performance on low-resource languages and varied cultural contexts. To address these limitations, we introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; and 2) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at https://github.com/nahidalam/maya.

[15] Differentiable Channel Selection in Self-Attention For Person Re-Identification

Yancheng Wang,Nebojsa Jojic,Yingzhen Yang

Main category: cs.CV

TL;DR: 提出了一种名为DCS-Attention的新型注意力模块,通过可微分通道选择提升注意力权重计算,显著提高了行人重识别任务的性能。

Details Motivation: 基于信息瓶颈原则,旨在选择最具信息量的通道以提升特征提取能力。 Method: 设计了DCS-Attention模块,支持固定或可学习主干网络(DCS-FB和DCS-DNAS),并推导了可优化的变分上界损失。 Result: 在多个行人重识别基准测试中,DCS-Attention显著提升了DNN的预测准确率。 Conclusion: DCS-Attention通过学习关键判别特征,在行人重识别任务中表现出色。 Abstract: In this paper, we propose a novel attention module termed the Differentiable Channel Selection Attention module, or the DCS-Attention module. In contrast with conventional self-attention, the DCS-Attention module features selection of informative channels in the computation of the attention weights. The selection of the feature channels is performed in a differentiable manner, enabling seamless integration with DNN training. Our DCS-Attention is compatible with either fixed neural network backbones or learnable backbones with Differentiable Neural Architecture Search (DNAS), leading to DCS with Fixed Backbone (DCS-FB) and DCS-DNAS, respectively. Importantly, our DCS-Attention is motivated by the principle of Information Bottleneck (IB), and a novel variational upper bound for the IB loss, which can be optimized by SGD, is derived and incorporated into the training loss of the networks with the DCS-Attention modules. In this manner, a neural network with DCS-Attention modules is capable of selecting the most informative channels for feature extraction so that it enjoys state-of-the-art performance for the Re-ID task. Extensive experiments on multiple person Re-ID benchmarks using both DCS-FB and DCS-DNAS show that DCS-Attention significantly enhances the prediction accuracy of DNNs for person Re-ID, which demonstrates the effectiveness of DCS-Attention in learning discriminative features critical to identifying person identities. The code of our work is available at https://github.com/Statistical-Deep-Learning/DCS-Attention.

Yangyi Chen,Hao Peng,Tong Zhang,Heng Ji

Main category: cs.CV

TL;DR: PRIOR是一种视觉语言预训练方法,通过差异加权优先处理与图像相关的标记,减少噪声拟合和幻觉风险。

Details Motivation: 标准视觉语言模型(LVLM)预训练中,简单的下一标记预测(NTP)会拟合噪声并增加幻觉风险,因为只有少量标记与视觉内容直接相关。 Method: PRIOR利用参考模型(纯文本LLM)为每个标记分配权重,基于其重要性调整损失函数,优先处理视觉相关标记。 Result: 在两种LVLM设置中,PRIOR分别实现了19%和8%的平均相对改进,并展示了更好的扩展性能。 Conclusion: PRIOR通过差异加权显著提升了视觉语言模型的性能和扩展潜力。 Abstract: In standard large vision-language models (LVLMs) pre-training, the model typically maximizes the joint probability of the caption conditioned on the image via next-token prediction (NTP); however, since only a small subset of caption tokens directly relates to the visual content, this naive NTP unintentionally fits the model to noise and increases the risk of hallucination. We present PRIOR, a simple vision-language pre-training approach that addresses this issue by prioritizing image-related tokens through differential weighting in the NTP loss, drawing from the importance sampling framework. PRIOR introduces a reference model-a text-only large language model (LLM) trained on the captions without image inputs, to weight each token based on its probability for LVLMs training. Intuitively, tokens that are directly related to the visual inputs are harder to predict without the image and thus receive lower probabilities from the text-only reference LLM. During training, we implement a token-specific re-weighting term based on the importance scores to adjust each token's loss. We implement PRIOR in two distinct settings: LVLMs with visual encoders and LVLMs without visual encoders. We observe 19% and 8% average relative improvement, respectively, on several vision-language benchmarks compared to NTP. In addition, PRIOR exhibits superior scaling properties, as demonstrated by significantly higher scaling coefficients, indicating greater potential for performance gains compared to NTP given increasing compute and data.

[17] Towards Adaptive Meta-Gradient Adversarial Examples for Visual Tracking

Wei-Long Tian,Peng Gao,Xiao Liu,Long Xu,Hamido Fujita,Hanan Aljuai,Mao-Li Wang

Main category: cs.CV

TL;DR: 提出了一种自适应元梯度对抗攻击方法(AMGA),用于揭示视觉跟踪器的安全漏洞,显著提升对抗样本的迁移性和攻击效果。

Details Motivation: 深度学习模型的安全问题影响了视觉跟踪方法在实际场景中的可靠应用,需通过有效对抗攻击揭示其漏洞。 Method: 结合多模型集成和元学习策略,利用动量机制和高斯平滑,优化梯度方向,缩小白盒与黑盒攻击的差距。 Result: 在OTB2015、LaSOT和GOT-10k等数据集上,AMGA显著提升了攻击性能、迁移性和欺骗性。 Conclusion: AMGA为视觉跟踪器的安全研究提供了有效工具,代码和数据已开源。 Abstract: In recent years, visual tracking methods based on convolutional neural networks and Transformers have achieved remarkable performance and have been successfully applied in fields such as autonomous driving. However, the numerous security issues exposed by deep learning models have gradually affected the reliable application of visual tracking methods in real-world scenarios. Therefore, how to reveal the security vulnerabilities of existing visual trackers through effective adversarial attacks has become a critical problem that needs to be addressed. To this end, we propose an adaptive meta-gradient adversarial attack (AMGA) method for visual tracking. This method integrates multi-model ensembles and meta-learning strategies, combining momentum mechanisms and Gaussian smoothing, which can significantly enhance the transferability and attack effectiveness of adversarial examples. AMGA randomly selects models from a large model repository, constructs diverse tracking scenarios, and iteratively performs both white- and black-box adversarial attacks in each scenario, optimizing the gradient directions of each model. This paradigm minimizes the gap between white- and black-box adversarial attacks, thus achieving excellent attack performance in black-box scenarios. Extensive experimental results on large-scale datasets such as OTB2015, LaSOT, and GOT-10k demonstrate that AMGA significantly improves the attack performance, transferability, and deception of adversarial examples. Codes and data are available at https://github.com/pgao-lab/AMGA.

[18] Multimodal Fusion of Glucose Monitoring and Food Imagery for Caloric Content Prediction

Adarsh Kumar

Main category: cs.CV

TL;DR: 提出了一种多模态深度学习框架,结合CGM数据、人口统计/微生物组数据和餐前食物图像,显著提升了热量估算的准确性。

Details Motivation: 精确估算热量摄入对2型糖尿病管理至关重要,但现有方法(如CGM)因个体和餐食差异难以全面捕捉营养信息。 Method: 采用注意力编码和卷积特征提取处理食物图像,多层感知器处理CGM和微生物组数据,并通过后期融合策略进行联合推理。 Result: 在40多名参与者的数据集上,模型RMSRE为0.2544,比基线模型性能提升50%以上。 Conclusion: 多模态传感技术有望提升慢性病管理中的自动化饮食评估工具。 Abstract: Effective dietary monitoring is critical for managing Type 2 diabetes, yet accurately estimating caloric intake remains a major challenge. While continuous glucose monitors (CGMs) offer valuable physiological data, they often fall short in capturing the full nutritional profile of meals due to inter-individual and meal-specific variability. In this work, we introduce a multimodal deep learning framework that jointly leverages CGM time-series data, Demographic/Microbiome, and pre-meal food images to enhance caloric estimation. Our model utilizes attention based encoding and a convolutional feature extraction for meal imagery, multi-layer perceptrons for CGM and Microbiome data followed by a late fusion strategy for joint reasoning. We evaluate our approach on a curated dataset of over 40 participants, incorporating synchronized CGM, Demographic and Microbiome data and meal photographs with standardized caloric labels. Our model achieves a Root Mean Squared Relative Error (RMSRE) of 0.2544, outperforming the baselines models by over 50%. These findings demonstrate the potential of multimodal sensing to improve automated dietary assessment tools for chronic disease management.

[19] 2D-3D Attention and Entropy for Pose Robust 2D Facial Recognition

J. Brennan Peace,Shuowen Hu,Benjamin S. Riggan

Main category: cs.CV

TL;DR: 提出了一种新颖的域自适应框架,通过2D图像推断3D点云特性,解决面部识别中姿态差异导致的性能下降问题。

Details Motivation: 面部识别在姿态差异较大时性能下降,需提升姿态不变性。 Method: 使用共享注意力映射和联合熵正则化损失,增强2D与3D表示的关联性。 Result: 在FaceScape和ARL-VTF数据集上表现优于其他方法,姿态不变性显著提升。 Conclusion: 该框架有效解决了姿态差异问题,提升了面部识别的鲁棒性。 Abstract: Despite recent advances in facial recognition, there remains a fundamental issue concerning degradations in performance due to substantial perspective (pose) differences between enrollment and query (probe) imagery. Therefore, we propose a novel domain adaptive framework to facilitate improved performances across large discrepancies in pose by enabling image-based (2D) representations to infer properties of inherently pose invariant point cloud (3D) representations. Specifically, our proposed framework achieves better pose invariance by using (1) a shared (joint) attention mapping to emphasize common patterns that are most correlated between 2D facial images and 3D facial data and (2) a joint entropy regularizing loss to promote better consistency$\unicode{x2014}$enhancing correlations among the intersecting 2D and 3D representations$\unicode{x2014}$by leveraging both attention maps. This framework is evaluated on FaceScape and ARL-VTF datasets, where it outperforms competitive methods by achieving profile (90$\unicode{x00b0}$$\unicode{x002b}$) TAR @ 1$\unicode{x0025}$ FAR improvements of at least 7.1$\unicode{x0025}$ and 1.57$\unicode{x0025}$, respectively.

[20] OpenLKA: An Open Dataset of Lane Keeping Assist from Recent Car Models under Real-world Driving Conditions

Yuhang Wang,Abdulaziz Alhuraish,Shengming Yuan,Hao Zhou

Main category: cs.CV

TL;DR: OpenLKA是首个开放的大规模数据集,用于评估和改进车道保持辅助系统(LKA),包含400小时驾驶数据,覆盖多种复杂场景。

Details Motivation: 现有LKA系统的真实性能因专有系统和数据访问受限而未被充分研究,OpenLKA旨在填补这一空白。 Method: 通过广泛的道路测试和社区贡献,收集多模态数据,包括CAN总线流、高清视频、Openpilot输出及场景注释。 Result: 数据集提供了全面的LKA性能评估平台,支持识别安全关键场景和评估道路基础设施。 Conclusion: OpenLKA为LKA系统的真实性能研究和自动驾驶基础设施评估提供了宝贵资源。 Abstract: Lane Keeping Assist (LKA) is widely adopted in modern vehicles, yet its real-world performance remains underexplored due to proprietary systems and limited data access. This paper presents OpenLKA, the first open, large-scale dataset for LKA evaluation and improvement. It includes 400 hours of driving data from 50+ production vehicle models, collected through extensive road testing in Tampa, Florida and global contributions from the Comma.ai driving community. The dataset spans a wide range of challenging scenarios, including complex road geometries, degraded lane markings, adverse weather, lighting conditions and surrounding traffic. The dataset is multimodal, comprising: i) full CAN bus streams, decoded using custom reverse-engineered DBC files to extract key LKA events (e.g., system disengagements, lane detection failures); ii) synchronized high-resolution dash-cam video; iii) real-time outputs from Openpilot, providing accurate estimates of road curvature and lane positioning; iv) enhanced scene annotations generated by Vision Language Models, describing lane visibility, pavement quality, weather, lighting, and traffic conditions. By integrating vehicle-internal signals with high-fidelity perception and rich semantic context, OpenLKA provides a comprehensive platform for benchmarking the real-world performance of production LKA systems, identifying safety-critical operational scenarios, and assessing the readiness of current road infrastructure for autonomous driving. The dataset is publicly available at: https://github.com/OpenLKA/OpenLKA.

[21] Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning

Dayong Liang,Changmeng Zheng,Zhiyuan Wen,Yi Cai,Xiao-Yong Wei,Qing Li

Main category: cs.CV

TL;DR: 论文提出了一种增强视觉语言模型(VLMs)交互推理能力的框架ISGR,通过双流图构造器、交互查询和长期记忆强化学习策略,显著提升了复杂场景理解任务的表现。

Details Motivation: 传统场景图主要关注空间关系,限制了VLMs在复杂视觉场景中的交互推理能力。现有方法存在关系集不相关和缺乏持久记忆的问题。 Method: ISGR框架包含双流图构造器(结合空间关系提取和交互感知标注)、交互查询激活VLMs功能知识、长期记忆强化学习策略。 Result: 在交互密集的推理基准测试中显著优于基线方法,尤其在复杂场景理解任务上表现突出。 Conclusion: ISGR通过增强交互推理能力,为复杂场景理解提供了有效解决方案。 Abstract: Traditional scene graphs primarily focus on spatial relationships, limiting vision-language models' (VLMs) ability to reason about complex interactions in visual scenes. This paper addresses two key challenges: (1) conventional detection-to-construction methods produce unfocused, contextually irrelevant relationship sets, and (2) existing approaches fail to form persistent memories for generalizing interaction reasoning to new scenes. We propose Interaction-augmented Scene Graph Reasoning (ISGR), a framework that enhances VLMs' interactional reasoning through three complementary components. First, our dual-stream graph constructor combines SAM-powered spatial relation extraction with interaction-aware captioning to generate functionally salient scene graphs with spatial grounding. Second, we employ targeted interaction queries to activate VLMs' latent knowledge of object functionalities, converting passive recognition into active reasoning about how objects work together. Finally, we introduce a lone-term memory reinforcement learning strategy with a specialized interaction-focused reward function that transforms transient patterns into long-term reasoning heuristics. Extensive experiments demonstrate that our approach significantly outperforms baseline methods on interaction-heavy reasoning benchmarks, with particularly strong improvements on complex scene understanding tasks. The source code can be accessed at https://github.com/open_upon_acceptance.

[22] Promoting SAM for Camouflaged Object Detection via Selective Key Point-based Guidance

Guoying Liang,Su Yang

Main category: cs.CV

TL;DR: 该研究利用Segment Anything Model (SAM)进行伪装目标检测(COD),提出了一种新框架,通过点提示提升SAM的性能,首次将大模型应用于COD任务。

Details Motivation: 探索大模型在COD任务中的潜力,揭示SAM在适当提示下可用于COD,避免从头设计专业模型。 Method: 开发了Promotion Point Targeting Network (PPT-net)和关键点选择算法(KPS),通过多尺度特征预测目标存在概率,并对比性地引导SAM分割。 Result: 在3个数据集和6个指标上优于现有方法,验证了SAM在COD任务中的有效性。 Conclusion: 该研究为COD提供了一种现成的方法,展示了利用大模型的优势,性能优于传统方法且任务难度降低。 Abstract: Big model has emerged as a new research paradigm that can be applied to various down-stream tasks with only minor effort for domain adaption. Correspondingly, this study tackles Camouflaged Object Detection (COD) leveraging the Segment Anything Model (SAM). The previous studies declared that SAM is not workable for COD but this study reveals that SAM works if promoted properly, for which we devise a new framework to render point promotions: First, we develop the Promotion Point Targeting Network (PPT-net) to leverage multi-scale features in predicting the probabilities of camouflaged objects' presences at given candidate points over the image. Then, we develop a key point selection (KPS) algorithm to deploy both positive and negative point promotions contrastively to SAM to guide the segmentation. It is the first work to facilitate big model for COD and achieves plausible results experimentally over the existing methods on 3 data sets under 6 metrics. This study demonstrates an off-the-shelf methodology for COD by leveraging SAM, which gains advantage over designing professional models from scratch, not only in performance, but also in turning the problem to a less challenging task, that is, seeking informative but not exactly precise promotions.

[23] WSCIF: A Weakly-Supervised Color Intelligence Framework for Tactical Anomaly Detection in Surveillance Keyframes

Wei Meng

Main category: cs.CV

TL;DR: 提出了一种基于颜色特征的轻量级异常检测框架,用于资源受限和数据敏感环境下的监控视频,快速识别潜在威胁事件。

Details Motivation: 传统深度学习模型在高风险安全任务中面临无标签、数据不可利用的挑战,需轻量级解决方案。 Method: 融合无监督KMeans聚类与RGB通道直方图建模,检测关键帧中的结构异常和颜色突变信号。 Result: 成功识别高能光源、目标存在和反射干扰等异常帧,适用于战术警告、可疑对象筛查和环境变化监测。 Conclusion: 颜色特征作为低语义战场信号载体具有重要价值,未来将结合图神经网络和时间建模扩展能力。 Abstract: The deployment of traditional deep learning models in high-risk security tasks in an unlabeled, data-non-exploitable video intelligence environment faces significant challenges. In this paper, we propose a lightweight anomaly detection framework based on color features for surveillance video clips in a high sensitivity tactical mission, aiming to quickly identify and interpret potential threat events under resource-constrained and data-sensitive conditions. The method fuses unsupervised KMeans clustering with RGB channel histogram modeling to achieve composite detection of structural anomalies and color mutation signals in key frames. The experiment takes an operation surveillance video occurring in an African country as a research sample, and successfully identifies multiple highly anomalous frames related to high-energy light sources, target presence, and reflective interference under the condition of no access to the original data. The results show that this method can be effectively used for tactical assassination warning, suspicious object screening and environmental drastic change monitoring with strong deployability and tactical interpretation value. The study emphasizes the importance of color features as low semantic battlefield signal carriers, and its battlefield intelligent perception capability will be further extended by combining graph neural networks and temporal modeling in the future.

[24] Beyond General Prompts: Automated Prompt Refinement using Contrastive Class Alignment Scores for Disambiguating Objects in Vision-Language Models

Lucas Choi,Ross Greer

Main category: cs.CV

TL;DR: 提出了一种基于对比类对齐分数(CCAS)的自动提示优化方法,用于提升视觉语言模型(VLM)在目标检测中的性能。

Details Motivation: 视觉语言模型(VLM)的性能受提示措辞影响较大,需要一种自动优化提示的方法。 Method: 通过大型语言模型生成多样提示候选,并使用CCAS(基于句子嵌入的语义对齐评分)筛选最佳提示。 Result: 在挑战性目标类别上验证了该方法能提高检测精度,无需额外训练或标注数据。 Conclusion: 该方法为VLM检测系统提供了一种可扩展且模型无关的提示优化方案。 Abstract: Vision-language models (VLMs) offer flexible object detection through natural language prompts but suffer from performance variability depending on prompt phrasing. In this paper, we introduce a method for automated prompt refinement using a novel metric called the Contrastive Class Alignment Score (CCAS), which ranks prompts based on their semantic alignment with a target object class while penalizing similarity to confounding classes. Our method generates diverse prompt candidates via a large language model and filters them through CCAS, computed using prompt embeddings from a sentence transformer. We evaluate our approach on challenging object categories, demonstrating that our automatic selection of high-precision prompts improves object detection accuracy without the need for additional model training or labeled data. This scalable and model-agnostic pipeline offers a principled alternative to manual prompt engineering for VLM-based detection systems.

[25] TopoDiT-3D: Topology-Aware Diffusion Transformer with Bottleneck Structure for 3D Point Cloud Generation

Zechao Guan,Feng Yan,Shuai Du,Lin Ma,Qingshan Liu

Main category: cs.CV

TL;DR: TopoDiT-3D是一种基于拓扑感知的扩散Transformer模型,通过瓶颈结构和持久同调提取全局拓扑信息,显著提升了3D点云生成的质量和效率。

Details Motivation: 现有方法主要关注局部特征提取,忽略了全局拓扑信息(如空洞),而这对形状一致性和复杂几何捕获至关重要。 Method: 设计了基于Perceiver Resampler的瓶颈结构,结合持久同调提取的拓扑信息,并自适应过滤冗余局部特征以提高训练效率。 Result: TopoDiT-3D在视觉质量、多样性和训练效率上优于现有方法,验证了拓扑信息与局部特征学习的协同作用。 Conclusion: TopoDiT-3D展示了全局拓扑信息对3D点云生成的重要性,为未来研究提供了新方向。 Abstract: Recent advancements in Diffusion Transformer (DiT) models have significantly improved 3D point cloud generation. However, existing methods primarily focus on local feature extraction while overlooking global topological information, such as voids, which are crucial for maintaining shape consistency and capturing complex geometries. To address this limitation, we propose TopoDiT-3D, a Topology-Aware Diffusion Transformer with a bottleneck structure for 3D point cloud generation. Specifically, we design the bottleneck structure utilizing Perceiver Resampler, which not only offers a mode to integrate topological information extracted through persistent homology into feature learning, but also adaptively filters out redundant local features to improve training efficiency. Experimental results demonstrate that TopoDiT-3D outperforms state-of-the-art models in visual quality, diversity, and training efficiency. Furthermore, TopoDiT-3D demonstrates the importance of rich topological information for 3D point cloud generation and its synergy with conventional local feature learning. Videos and code are available at https://github.com/Zechao-Guan/TopoDiT-3D.

[26] AMSnet 2.0: A Large AMS Database with AI Segmentation for Net Detection

Yichen Shi,Zhuofu Tao,Yuhao Gao,Li Huang,Hongyang Wang,Zhiping Yu,Ting-Jung Lin,Lei He

Main category: cs.CV

TL;DR: 本文提出了一种基于分割的新型网络检测机制,用于解决多模态大语言模型(MLLMs)在理解电路原理图时的局限性,并扩展了AMSnet数据集。

Details Motivation: 现有方法如AMSnet依赖硬编码启发式方法,难以处理复杂或噪声原理图,且缺乏高质量的训练数据。 Method: 提出了一种基于分割的网络检测机制,具有高鲁棒性,并能恢复位置信息以实现原理图的数字重建。同时扩展了AMSnet数据集,创建了AMSnet 2.0。 Result: AMSnet 2.0包含2,686个电路,提供原理图图像、Spectre格式网表、OpenAccess数字原理图及组件和网络的位置信息,显著优于原始AMSnet的792个电路。 Conclusion: 新方法提高了对复杂原理图的理解能力,扩展的数据集为未来研究提供了更丰富的资源。 Abstract: Current multimodal large language models (MLLMs) struggle to understand circuit schematics due to their limited recognition capabilities. This could be attributed to the lack of high-quality schematic-netlist training data. Existing work such as AMSnet applies schematic parsing to generate netlists. However, these methods rely on hard-coded heuristics and are difficult to apply to complex or noisy schematics in this paper. We therefore propose a novel net detection mechanism based on segmentation with high robustness. The proposed method also recovers positional information, allowing digital reconstruction of schematics. We then expand AMSnet dataset with schematic images from various sources and create AMSnet 2.0. AMSnet 2.0 contains 2,686 circuits with schematic images, Spectre-formatted netlists, OpenAccess digital schematics, and positional information for circuit components and nets, whereas AMSnet only includes 792 circuits with SPICE netlists but no digital schematics.

[27] DRRNet: Macro-Micro Feature Fusion and Dual Reverse Refinement for Camouflaged Object Detection

Jianlin Sun,Xiaolin Fang,Juwei Guan,Dongdong Gui,Teqi Wang,Tongxin Zhu

Main category: cs.CV

TL;DR: DRRNet提出了一种四阶段架构,通过全局与局部特征融合及逆向细化,显著提升了伪装目标检测的性能。

Details Motivation: 伪装目标检测中目标与背景在颜色、纹理和形状上高度相似,现有方法难以兼顾全局语义和局部细节。 Method: 采用四阶段流程,包括全局上下文特征提取、局部细节补充、双表征融合及逆向细化模块。 Result: 在基准数据集上显著优于现有方法。 Conclusion: DRRNet通过多尺度特征融合和逆向细化,有效解决了伪装目标检测的挑战。 Abstract: The core challenge in Camouflage Object Detection (COD) lies in the indistinguishable similarity between targets and backgrounds in terms of color, texture, and shape. This causes existing methods to either lose edge details (such as hair-like fine structures) due to over-reliance on global semantic information or be disturbed by similar backgrounds (such as vegetation patterns) when relying solely on local features. We propose DRRNet, a four-stage architecture characterized by a "context-detail-fusion-refinement" pipeline to address these issues. Specifically, we introduce an Omni-Context Feature Extraction Module to capture global camouflage patterns and a Local Detail Extraction Module to supplement microstructural information for the full-scene context module. We then design a module for forming dual representations of scene understanding and structural awareness, which fuses panoramic features and local features across various scales. In the decoder, we also introduce a reverse refinement module that leverages spatial edge priors and frequency-domain noise suppression to perform a two-stage inverse refinement of the output. By applying two successive rounds of inverse refinement, the model effectively suppresses background interference and enhances the continuity of object boundaries. Experimental results demonstrate that DRRNet significantly outperforms state-of-the-art methods on benchmark datasets. Our code is available at https://github.com/jerrySunning/DRRNet.

[28] UniCAD: Efficient and Extendable Architecture for Multi-Task Computer-Aided Diagnosis System

Yitao Zhu,Yuan Yin,Zhenrong Shen,Zihao Zhao,Haiyu Song,Sheng Wang,Dinggang Shen,Qian Wang

Main category: cs.CV

TL;DR: UniCAD是一个基于预训练视觉基础模型的多任务计算机辅助诊断(CAD)统一架构,通过低秩适应策略和模块化设计,显著减少了任务特定参数需求,同时支持2D和3D医学图像处理。

Details Motivation: 解决医学影像领域缺乏开源CAD平台以及预训练视觉模型复杂性和资源需求高的问题。 Method: 采用低秩适应策略和模块化架构,结合冻结的基础模型与可插拔专家模块。 Result: 在12个医学数据集上表现优于现有方法,仅需0.17%可训练参数。 Conclusion: UniCAD提供了一个高效、可扩展的开源平台,促进了医学影像研究的公平性和效率。 Abstract: The growing complexity and scale of visual model pre-training have made developing and deploying multi-task computer-aided diagnosis (CAD) systems increasingly challenging and resource-intensive. Furthermore, the medical imaging community lacks an open-source CAD platform to enable the rapid creation of efficient and extendable diagnostic models. To address these issues, we propose UniCAD, a unified architecture that leverages the robust capabilities of pre-trained vision foundation models to seamlessly handle both 2D and 3D medical images while requiring only minimal task-specific parameters. UniCAD introduces two key innovations: (1) Efficiency: A low-rank adaptation strategy is employed to adapt a pre-trained visual model to the medical image domain, achieving performance on par with fully fine-tuned counterparts while introducing only 0.17% trainable parameters. (2) Plug-and-Play: A modular architecture that combines a frozen foundation model with multiple plug-and-play experts, enabling diverse tasks and seamless functionality expansion. Building on this unified CAD architecture, we establish an open-source platform where researchers can share and access lightweight CAD experts, fostering a more equitable and efficient research ecosystem. Comprehensive experiments across 12 diverse medical datasets demonstrate that UniCAD consistently outperforms existing methods in both accuracy and deployment efficiency. The source code and project page are available at https://mii-laboratory.github.io/UniCAD/.

[29] Zero-shot Quantization: A Comprehensive Survey

Minjun Kim,Jaehyeon Choi,Jongkeun Lee,Wonjin Cho,U Kang

Main category: cs.CV

TL;DR: 本文综述了零样本量化(ZSQ)方法,解决了传统量化方法依赖训练数据的限制,并分类分析了现有ZSQ方法,提出了未来研究方向。

Details Motivation: 传统量化方法需要访问训练数据,但在隐私、安全或法规限制下不现实。ZSQ通过无需真实数据实现量化,成为有前景的解决方案。 Method: 首先定义ZSQ问题及关键挑战,然后基于数据生成策略分类现有ZSQ方法,分析其动机、核心思想和要点。 Result: 提供了对ZSQ方法的全面概述,并首次深入调查了该领域。 Conclusion: ZSQ是解决数据依赖问题的有效方法,未来研究可进一步优化其局限性并推动领域发展。 Abstract: Network quantization has proven to be a powerful approach to reduce the memory and computational demands of deep learning models for deployment on resource-constrained devices. However, traditional quantization methods often rely on access to training data, which is impractical in many real-world scenarios due to privacy, security, or regulatory constraints. Zero-shot Quantization (ZSQ) emerges as a promising solution, achieving quantization without requiring any real data. In this paper, we provide a comprehensive overview of ZSQ methods and their recent advancements. First, we provide a formal definition of the ZSQ problem and highlight the key challenges. Then, we categorize the existing ZSQ methods into classes based on data generation strategies, and analyze their motivations, core ideas, and key takeaways. Lastly, we suggest future research directions to address the remaining limitations and advance the field of ZSQ. To the best of our knowledge, this paper is the first in-depth survey on ZSQ.

[30] PDE: Gene Effect Inspired Parameter Dynamic Evolution for Low-light Image Enhancement

Tong Li,Lizhi Wang,Hansen Feng,Lin Zhu,Hua Huang

Main category: cs.CV

TL;DR: 论文提出了一种名为参数动态演化(PDE)的方法,通过模拟基因突变和重组来缓解低光图像增强中的基因效应,从而提升模型性能。

Details Motivation: 研究发现,某些参数重置为随机值时反而能提升图像增强效果,这种现象被称为基因效应。它限制了模型性能,因为随机参数有时优于学习参数。 Method: 提出参数动态演化(PDE),利用参数正交生成技术模拟基因重组和突变,以适应不同图像并缓解基因效应。 Result: 实验验证了PDE的有效性,能够显著提升低光图像增强的性能。 Conclusion: PDE通过动态调整参数,成功解决了基因效应问题,为低光图像增强提供了新思路。 Abstract: Low-light image enhancement (LLIE) is a fundamental task in computational photography, aiming to improve illumination, reduce noise, and enhance image quality. While recent advancements focus on designing increasingly complex neural network models, we observe a peculiar phenomenon: resetting certain parameters to random values unexpectedly improves enhancement performance for some images. Drawing inspiration from biological genes, we term this phenomenon the gene effect. The gene effect limits enhancement performance, as even random parameters can sometimes outperform learned ones, preventing models from fully utilizing their capacity. In this paper, we investigate the reason and propose a solution. Based on our observations, we attribute the gene effect to static parameters, analogous to how fixed genetic configurations become maladaptive when environments change. Inspired by biological evolution, where adaptation to new environments relies on gene mutation and recombination, we propose parameter dynamic evolution (PDE) to adapt to different images and mitigate the gene effect. PDE employs a parameter orthogonal generation technique and the corresponding generated parameters to simulate gene recombination and gene mutation, separately. Experiments validate the effectiveness of our techniques. The code will be released to the public.

[31] A Surrogate Model for the Forward Design of Multi-layered Metasurface-based Radar Absorbing Structures

Vineetha Joy,Aditya Anand,Nidhi,Anshuman Kumar,Amit Sethi,Hema Singh

Main category: cs.CV

TL;DR: 提出了一种基于卷积神经网络(CNN)的代理模型,用于快速预测多层超表面雷达吸收结构的电磁响应,显著减少了计算时间和设计空间探索。

Details Motivation: 传统方法依赖全波仿真工具设计超表面雷达吸收结构,计算量大且耗时,需探索大设计空间。 Method: 采用基于Huber损失函数的CNN架构预测反射特性。 Result: 模型在1000次训练周期内达到99.9%的余弦相似度和0.001的均方误差,计算时间显著减少且预测精度高。 Conclusion: 该代理模型高效且准确,适用于超表面雷达吸收结构的快速设计和优化。 Abstract: Metasurface-based radar absorbing structures (RAS) are highly preferred for applications like stealth technology, electromagnetic (EM) shielding, etc. due to their capability to achieve frequency selective absorption characteristics with minimal thickness and reduced weight penalty. However, the conventional approach for the EM design and optimization of these structures relies on forward simulations, using full wave simulation tools, to predict the electromagnetic (EM) response of candidate meta atoms. This process is computationally intensive, extremely time consuming and requires exploration of large design spaces. To overcome this challenge, we propose a surrogate model that significantly accelerates the prediction of EM responses of multi-layered metasurface-based RAS. A convolutional neural network (CNN) based architecture with Huber loss function has been employed to estimate the reflection characteristics of the RAS model. The proposed model achieved a cosine similarity of 99.9% and a mean square error of 0.001 within 1000 epochs of training. The efficiency of the model has been established via full wave simulations as well as experiment where it demonstrated significant reduction in computational time while maintaining high predictive accuracy.

[32] Zero-Shot Multi-modal Large Language Model v.s. Supervised Deep Learning: A Comparative Study on CT-Based Intracranial Hemorrhage Subtyping

Yinuo Wang,Yue Zeng,Kai Chen,Cai Meng,Chao Pan,Zhouping Tang

Main category: cs.CV

TL;DR: 研究比较了多模态大语言模型(MLLMs)与传统深度学习模型在颅内出血(ICH)分类和亚型识别中的表现,发现传统模型在准确性上优于MLLMs,但MLLMs在交互性和可解释性上有优势。

Details Motivation: 颅内出血的及时识别对预后和治疗决策至关重要,但传统方法因低对比度和模糊边界而面临挑战。研究旨在评估MLLMs在此任务中的潜力。 Method: 使用RSNA提供的192个NCCT数据集,比较了GPT-4o、Gemini 2.0 Flash、Claude 3.5 Sonnet V2等MLLMs与ResNet50、Vision Transformer等传统模型在ICH分类和亚型识别中的表现。 Result: 传统模型在ICH二分类和亚型识别中全面优于MLLMs,例如Gemini 2.0 Flash的宏平均精度为0.41,F1分数为0.31。 Conclusion: MLLMs在交互性上表现优异,但准确性不及传统模型。未来需优化模型以提高其在三维医学图像处理中的性能。 Abstract: Introduction: Timely identification of intracranial hemorrhage (ICH) subtypes on non-contrast computed tomography is critical for prognosis prediction and therapeutic decision-making, yet remains challenging due to low contrast and blurring boundaries. This study evaluates the performance of zero-shot multi-modal large language models (MLLMs) compared to traditional deep learning methods in ICH binary classification and subtyping. Methods: We utilized a dataset provided by RSNA, comprising 192 NCCT volumes. The study compares various MLLMs, including GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet V2, with conventional deep learning models, including ResNet50 and Vision Transformer. Carefully crafted prompts were used to guide MLLMs in tasks such as ICH presence, subtype classification, localization, and volume estimation. Results: The results indicate that in the ICH binary classification task, traditional deep learning models outperform MLLMs comprehensively. For subtype classification, MLLMs also exhibit inferior performance compared to traditional deep learning models, with Gemini 2.0 Flash achieving an macro-averaged precision of 0.41 and a macro-averaged F1 score of 0.31. Conclusion: While MLLMs excel in interactive capabilities, their overall accuracy in ICH subtyping is inferior to deep networks. However, MLLMs enhance interpretability through language interactions, indicating potential in medical imaging analysis. Future efforts will focus on model refinement and developing more precise MLLMs to improve performance in three-dimensional medical image processing.

[33] Test-Time Augmentation for Pose-invariant Face Recognition

Jaemin Jung,Youngjoon Jang,Joon Son Chung

Main category: cs.CV

TL;DR: 提出Pose-TTA方法,通过测试阶段增强头部姿态提升人脸识别性能,无需额外训练。

Details Motivation: 现有方法需针对不同数据集重新训练,耗费大量资源。Pose-TTA旨在避免这一问题,直接在推理阶段对齐姿态。 Method: 使用肖像动画器将源图像身份转移到驱动图像姿态,生成匹配的侧脸图像,并采用加权特征聚合策略减少合成数据偏差。 Result: 在多种数据集和预训练模型上验证,Pose-TTA显著提升推理性能。 Conclusion: Pose-TTA无需重新训练,易于集成到现有流程中,有效提升人脸识别效果。 Abstract: The goal of this paper is to enhance face recognition performance by augmenting head poses during the testing phase. Existing methods often rely on training on frontalised images or learning pose-invariant representations, yet both approaches typically require re-training and testing for each dataset, involving a substantial amount of effort. In contrast, this study proposes Pose-TTA, a novel approach that aligns faces at inference time without additional training. To achieve this, we employ a portrait animator that transfers the source image identity into the pose of a driving image. Instead of frontalising a side-profile face -- which can introduce distortion -- Pose-TTA generates matching side-profile images for comparison, thereby reducing identity information loss. Furthermore, we propose a weighted feature aggregation strategy to address any distortions or biases arising from the synthetic data, thus enhancing the reliability of the augmented images. Extensive experiments on diverse datasets and with various pre-trained face recognition models demonstrate that Pose-TTA consistently improves inference performance. Moreover, our method is straightforward to integrate into existing face recognition pipelines, as it requires no retraining or fine-tuning of the underlying recognition models.

[34] Few-Shot Anomaly-Driven Generation for Anomaly Classification and Segmentation

Guan Gui,Bin-Bin Gao,Jun Liu,Chengjie Wang,Yunsheng Wu

Main category: cs.CV

TL;DR: 提出了一种基于少量真实异常样本的AnoGen方法,通过扩散模型生成多样且真实的异常数据,提升异常检测模型的性能。

Details Motivation: 工业检测中异常样本稀缺,现有方法生成的异常与真实异常语义差距大,导致检测性能不佳。 Method: 分三阶段:学习异常分布并嵌入知识;用嵌入和边界框指导扩散模型生成异常;提出弱监督方法训练模型。 Result: 在MVTec数据集上,DRAEM和DesTSeg的分割任务AU-PR指标分别提升5.8%和1.5%。 Conclusion: AnoGen方法能有效生成真实多样的异常数据,显著提升异常检测模型的性能。 Abstract: Anomaly detection is a practical and challenging task due to the scarcity of anomaly samples in industrial inspection. Some existing anomaly detection methods address this issue by synthesizing anomalies with noise or external data. However, there is always a large semantic gap between synthetic and real-world anomalies, resulting in weak performance in anomaly detection. To solve the problem, we propose a few-shot Anomaly-driven Generation (AnoGen) method, which guides the diffusion model to generate realistic and diverse anomalies with only a few real anomalies, thereby benefiting training anomaly detection models. Specifically, our work is divided into three stages. In the first stage, we learn the anomaly distribution based on a few given real anomalies and inject the learned knowledge into an embedding. In the second stage, we use the embedding and given bounding boxes to guide the diffusion model to generate realistic and diverse anomalies on specific objects (or textures). In the final stage, we propose a weakly-supervised anomaly detection method to train a more powerful model with generated anomalies. Our method builds upon DRAEM and DesTSeg as the foundation model and conducts experiments on the commonly used industrial anomaly detection dataset, MVTec. The experiments demonstrate that our generated anomalies effectively improve the model performance of both anomaly classification and segmentation tasks simultaneously, \eg, DRAEM and DseTSeg achieved a 5.8\% and 1.5\% improvement in AU-PR metric on segmentation task, respectively. The code and generated anomalous data are available at https://github.com/gaobb/AnoGen.

[35] Learning to Detect Multi-class Anomalies with Just One Normal Image Prompt

Bin-Bin Gao

Main category: cs.CV

TL;DR: 论文提出了一种名为OneNIP的方法,通过仅使用一张正常图像提示,实现了对异常特征的高效重建与恢复,显著提升了统一异常检测的性能。

Details Motivation: 现有的自注意力重建模型在统一异常检测中存在对异常特征重建不准确的问题,且由于在低分辨率潜在空间操作,导致异常分割不精确。 Method: 提出OneNIP方法,利用一张正常图像提示重建正常特征并恢复异常特征;同时引入监督细化器,通过真实正常和合成异常图像回归重建误差,优化像素级异常分割。 Result: OneNIP在MVTec、BTAD和VisA三个工业异常检测基准测试中表现优于先前方法。 Conclusion: OneNIP通过简单有效的方法提升了重建模型的效率和泛化能力,为统一异常检测提供了新思路。 Abstract: Unsupervised reconstruction networks using self-attention transformers have achieved state-of-the-art performance for multi-class (unified) anomaly detection with a single model. However, these self-attention reconstruction models primarily operate on target features, which may result in perfect reconstruction for both normal and anomaly features due to high consistency with context, leading to failure in detecting anomalies. Additionally, these models often produce inaccurate anomaly segmentation due to performing reconstruction in a low spatial resolution latent space. To enable reconstruction models enjoying high efficiency while enhancing their generalization for unified anomaly detection, we propose a simple yet effective method that reconstructs normal features and restores anomaly features with just One Normal Image Prompt (OneNIP). In contrast to previous work, OneNIP allows for the first time to reconstruct or restore anomalies with just one normal image prompt, effectively boosting unified anomaly detection performance. Furthermore, we propose a supervised refiner that regresses reconstruction errors by using both real normal and synthesized anomalous images, which significantly improves pixel-level anomaly segmentation. OneNIP outperforms previous methods on three industry anomaly detection benchmarks: MVTec, BTAD, and VisA. The code and pre-trained models are available at https://github.com/gaobb/OneNIP.

[36] MetaUAS: Universal Anomaly Segmentation with One-Prompt Meta-Learning

Bin-Bin Gao

Main category: cs.CV

TL;DR: 论文提出了一种纯视觉基础模型MetaUAS,用于通用视觉异常分割,通过统一异常分割与变化分割,利用合成数据集训练,无需依赖语言模型或特殊异常检测数据集。

Details Motivation: 探索纯视觉基础模型替代视觉-语言模型在通用视觉异常分割中的潜力,解决视觉表示与语言无关的问题。 Method: 提出MetaUAS框架,通过合成图像对训练,结合软特征对齐模块处理几何变化,实现单图像提示的异常分割。 Result: MetaUAS在零样本、少样本和全样本异常分割任务中显著优于现有方法。 Conclusion: MetaUAS首次实现无需语言指导的通用异常分割,高效且训练自由。 Abstract: Zero- and few-shot visual anomaly segmentation relies on powerful vision-language models that detect unseen anomalies using manually designed textual prompts. However, visual representations are inherently independent of language. In this paper, we explore the potential of a pure visual foundation model as an alternative to widely used vision-language models for universal visual anomaly segmentation. We present a novel paradigm that unifies anomaly segmentation into change segmentation. This paradigm enables us to leverage large-scale synthetic image pairs, featuring object-level and local region changes, derived from existing image datasets, which are independent of target anomaly datasets. We propose a one-prompt Meta-learning framework for Universal Anomaly Segmentation (MetaUAS) that is trained on this synthetic dataset and then generalizes well to segment any novel or unseen visual anomalies in the real world. To handle geometrical variations between prompt and query images, we propose a soft feature alignment module that bridges paired-image change perception and single-image semantic segmentation. This is the first work to achieve universal anomaly segmentation using a pure vision model without relying on special anomaly detection datasets and pre-trained visual-language models. Our method effectively and efficiently segments any anomalies with only one normal image prompt and enjoys training-free without guidance from language. Our MetaUAS significantly outperforms previous zero-shot, few-shot, and even full-shot anomaly segmentation methods. The code and pre-trained models are available at https://github.com/gaobb/MetaUAS.

[37] Recent Advances in Medical Imaging Segmentation: A Survey

Fares Bougourzi,Abdenour Hadid

Main category: cs.CV

TL;DR: 该论文综述了医学图像分割领域的最新进展,重点探讨了生成式AI、少样本学习、基础模型和通用模型等方法,并总结了理论、技术和应用,同时指出了未来研究方向。

Details Motivation: 医学图像分割面临数据可访问性、注释复杂性、结构变异性等多重挑战,现有模型在泛化和领域适应方面仍有不足,需要更高效的解决方案。 Method: 通过分析生成式AI、少样本学习、基础模型和通用模型等方法,结合理论和技术综述,探讨其在医学图像分割中的应用。 Result: 论文总结了这些方法的优势和局限性,并提出了未来研究方向,以提升分割模型的实用性和可访问性。 Conclusion: 医学图像分割领域仍需解决泛化和领域适应问题,未来研究应关注模型效率和实际应用。 Abstract: Medical imaging is a cornerstone of modern healthcare, driving advancements in diagnosis, treatment planning, and patient care. Among its various tasks, segmentation remains one of the most challenging problem due to factors such as data accessibility, annotation complexity, structural variability, variation in medical imaging modalities, and privacy constraints. Despite recent progress, achieving robust generalization and domain adaptation remains a significant hurdle, particularly given the resource-intensive nature of some proposed models and their reliance on domain expertise. This survey explores cutting-edge advancements in medical image segmentation, focusing on methodologies such as Generative AI, Few-Shot Learning, Foundation Models, and Universal Models. These approaches offer promising solutions to longstanding challenges. We provide a comprehensive overview of the theoretical foundations, state-of-the-art techniques, and recent applications of these methods. Finally, we discuss inherent limitations, unresolved issues, and future research directions aimed at enhancing the practicality and accessibility of segmentation models in medical imaging. We are maintaining a \href{https://github.com/faresbougourzi/Awesome-DL-for-Medical-Imaging-Segmentation}{GitHub Repository} to continue tracking and updating innovations in this field.

[38] Predicting butterfly species presence from satellite imagery using soft contrastive regularisation

Thijs L van der Plas,Stephen Law,Michael JO Pocock

Main category: cs.CV

TL;DR: 本文提出了一种新数据集和方法,利用卫星图像预测英国蝴蝶物种分布,并通过对比正则化损失提升预测精度。

Details Motivation: 随着对可扩展生物多样性监测方法的需求增长,遥感数据的广泛应用和覆盖范围使其成为研究热点。传统方法侧重于栖息地监测,而新方法尝试直接从卫星图像预测多物种分布。 Method: 实验优化了基于Resnet的模型,利用4波段卫星图像预测多物种分布,并开发了一种针对概率标签的软监督对比正则化损失。 Result: 模型在高生物多样性区域表现优于基线,对比正则化损失进一步提高了预测准确性。 Conclusion: 新数据集和对比正则化方法为遥感数据预测物种生物多样性提供了有效工具,有助于高效生物多样性监测。 Abstract: The growing demand for scalable biodiversity monitoring methods has fuelled interest in remote sensing data, due to its widespread availability and extensive coverage. Traditionally, the application of remote sensing to biodiversity research has focused on mapping and monitoring habitats, but with increasing availability of large-scale citizen-science wildlife observation data, recent methods have started to explore predicting multi-species presence directly from satellite images. This paper presents a new data set for predicting butterfly species presence from satellite data in the United Kingdom. We experimentally optimise a Resnet-based model to predict multi-species presence from 4-band satellite images, and find that this model especially outperforms the mean rate baseline for locations with high species biodiversity. To improve performance, we develop a soft, supervised contrastive regularisation loss that is tailored to probabilistic labels (such as species-presence data), and demonstrate that this improves prediction accuracy. In summary, our new data set and contrastive regularisation method contribute to the open challenge of accurately predicting species biodiversity from remote sensing data, which is key for efficient biodiversity monitoring.

[39] Neural Video Compression using 2D Gaussian Splatting

Lakshya Gupta,Imran N. Junejo

Main category: cs.CV

TL;DR: 论文提出了一种基于区域兴趣(ROI)的神经视频压缩模型,利用2D高斯泼溅技术,显著提升了编码速度,适用于实时视频应用。

Details Motivation: 传统视频编解码标准(如AVC、HEVC)依赖手工特征,而神经视频编解码器(NVC)通过深度学习提供更高压缩效率,但计算需求限制了实时应用。本文旨在解决这一问题。 Method: 采用2D高斯泼溅技术,结合内容感知初始化策略和帧间冗余减少机制,设计了一种新型视频编解码方案。 Result: 编码速度提升了88%,实现了首个基于高斯泼溅的神经视频编解码器。 Conclusion: 该方法为实时视频应用(如视频会议)提供了高效解决方案,展示了神经视频编解码器的潜力。 Abstract: The computer vision and image processing research community has been involved in standardizing video data communications for the past many decades, leading to standards such as AVC, HEVC, VVC, AV1, AV2, etc. However, recent groundbreaking works have focused on employing deep learning-based techniques to replace the traditional video codec pipeline to a greater affect. Neural video codecs (NVC) create an end-to-end ML-based solution that does not rely on any handcrafted features (motion or edge-based) and have the ability to learn content-aware compression strategies, offering better adaptability and higher compression efficiency than traditional methods. This holds a great potential not only for hardware design, but also for various video streaming platforms and applications, especially video conferencing applications such as MS-Teams or Zoom that have found extensive usage in classrooms and workplaces. However, their high computational demands currently limit their use in real-time applications like video conferencing. To address this, we propose a region-of-interest (ROI) based neural video compression model that leverages 2D Gaussian Splatting. Unlike traditional codecs, 2D Gaussian Splatting is capable of real-time decoding and can be optimized using fewer data points, requiring only thousands of Gaussians for decent quality outputs as opposed to millions in 3D scenes. In this work, we designed a video pipeline that speeds up the encoding time of the previous Gaussian splatting-based image codec by 88% by using a content-aware initialization strategy paired with a novel Gaussian inter-frame redundancy-reduction mechanism, enabling Gaussian splatting to be used for a video-codec solution, the first of its kind solution in this neural video codec space.

[40] BioVFM-21M: Benchmarking and Scaling Self-Supervised Vision Foundation Models for Biomedical Image Analysis

Jiarun Liu,Hong-Yu Zhou,Weijian Huang,Hao Yang,Dongning Song,Tao Tan,Yong Liang,Shanshan Wang

Main category: cs.CV

TL;DR: 研究了医学视觉基础模型的规模化行为,发现规模化对性能提升有益但效果因任务而异,并提出了BioVFM-21M数据集和BioVFM模型。

Details Motivation: 医学图像与自然数据差异显著,规模化行为在医学领域尚未明确,需探索关键因素以开发规模化医学视觉基础模型。 Method: 通过自监督学习,研究了模型大小、训练算法、数据规模和成像模态的规模化行为,并构建了BioVFM-21M数据集。 Result: BioVFM模型在12个医学基准测试中优于现有最佳基础模型,规模化效果因任务而异。 Conclusion: 规模化虽有益,但任务特性、数据多样性、预训练方法和计算效率仍需重点考虑。 Abstract: Scaling up model and data size have demonstrated impressive performance improvement over a wide range of tasks. Despite extensive studies on scaling behaviors for general-purpose tasks, medical images exhibit substantial differences from natural data. It remains unclear the key factors in developing medical vision foundation models at scale due to the absence of an extensive understanding of scaling behavior in the medical domain. In this paper, we explored the scaling behavior across model sizes, training algorithms, data sizes, and imaging modalities in developing scalable medical vision foundation models by self-supervised learning. To support scalable pretraining, we introduce BioVFM-21M, a large-scale biomedical image dataset encompassing a wide range of biomedical image modalities and anatomies. We observed that scaling up does provide benefits but varies across tasks. Additional analysis reveals several factors correlated with scaling benefits. Finally, we propose BioVFM, a large-scale medical vision foundation model pretrained on 21 million biomedical images, which outperforms the previous state-of-the-art foundation models across 12 medical benchmarks. Our results highlight that while scaling up is beneficial for pursuing better performance, task characteristics, data diversity, pretraining methods, and computational efficiency remain critical considerations for developing scalable medical foundation models.

[41] Unsupervised Multiview Contrastive Language-Image Joint Learning with Pseudo-Labeled Prompts Via Vision-Language Model for 3D/4D Facial Expression Recognition

Muzammil Behzad

Main category: cs.CV

TL;DR: MultiviewVLM是一种用于无监督对比多视角表示学习的视觉语言模型,专注于从3D/4D数据中学习面部情绪。

Details Motivation: 现有方法在多视角情绪表示学习中缺乏无监督对齐和对比学习策略,MultiviewVLM旨在填补这一空白。 Method: 模型结合伪标签和文本提示进行隐式对齐,提出联合嵌入空间和多视角对比学习策略,并引入梯度友好损失函数。 Result: 实验表明MultiviewVLM优于现有方法,且易于适应实际应用。 Conclusion: MultiviewVLM为多视角情绪表示学习提供了高效且可扩展的解决方案。 Abstract: In this paper, we introduce MultiviewVLM, a vision-language model designed for unsupervised contrastive multiview representation learning of facial emotions from 3D/4D data. Our architecture integrates pseudo-labels derived from generated textual prompts to guide implicit alignment of emotional semantics. To capture shared information across multi-views, we propose a joint embedding space that aligns multiview representations without requiring explicit supervision. We further enhance the discriminability of our model through a novel multiview contrastive learning strategy that leverages stable positive-negative pair sampling. A gradient-friendly loss function is introduced to promote smoother and more stable convergence, and the model is optimized for distributed training to ensure scalability. Extensive experiments demonstrate that MultiviewVLM outperforms existing state-of-the-art methods and can be easily adapted to various real-world applications with minimal modifications.

[42] Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis

Bingxin Ke,Kevin Qu,Tianfu Wang,Nando Metzger,Shengyu Huang,Bo Li,Anton Obukhov,Konrad Schindler

Main category: cs.CV

TL;DR: Marigold是一种基于预训练潜在扩散模型(如Stable Diffusion)的条件生成模型,通过微调协议将其知识迁移到密集图像分析任务中,如深度估计和表面法线预测,实现了零样本泛化的最先进性能。

Details Motivation: 在数据稀缺的情况下,预训练模型的质量对迁移学习至关重要。传统方法依赖图像分类和自监督学习,而新兴的文本到图像生成模型(如潜在扩散模型)展示了强大的视觉理解能力,为密集图像分析任务提供了新的可能性。 Method: Marigold通过微调预训练的潜在扩散模型(如Stable Diffusion),仅需少量合成数据和单GPU训练几天,即可适应密集图像分析任务,且对模型架构改动极小。 Result: Marigold在单目深度估计、表面法线预测和本征分解等任务中表现出色,实现了零样本泛化的最先进性能。 Conclusion: Marigold展示了从生成模型中提取知识并应用于密集图像分析任务的潜力,为数据稀缺场景下的迁移学习提供了高效解决方案。 Abstract: The success of deep learning in computer vision over the past decade has hinged on large labeled datasets and strong pretrained models. In data-scarce settings, the quality of these pretrained models becomes crucial for effective transfer learning. Image classification and self-supervised learning have traditionally been the primary methods for pretraining CNNs and transformer-based architectures. Recently, the rise of text-to-image generative models, particularly those using denoising diffusion in a latent space, has introduced a new class of foundational models trained on massive, captioned image datasets. These models' ability to generate realistic images of unseen content suggests they possess a deep understanding of the visual world. In this work, we present Marigold, a family of conditional generative models and a fine-tuning protocol that extracts the knowledge from pretrained latent diffusion models like Stable Diffusion and adapts them for dense image analysis tasks, including monocular depth estimation, surface normals prediction, and intrinsic decomposition. Marigold requires minimal modification of the pre-trained latent diffusion model's architecture, trains with small synthetic datasets on a single GPU over a few days, and demonstrates state-of-the-art zero-shot generalization. Project page: https://marigoldcomputervision.github.io

[43] RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo

Jenny Schmalfuss,Victor Oei,Lukas Mehl,Madlen Bartsch,Shashank Agnihotri,Margret Keuper,Andrés Bruhn

Main category: cs.CV

TL;DR: RobustSpring是一个新的数据集和基准测试,用于评估光流、场景流和立体视觉模型对图像损坏的鲁棒性,填补了现有基准测试仅关注模型准确性的不足。

Details Motivation: 现有基准测试主要关注模型准确性,而忽略了模型对噪声或天气等现实扰动的鲁棒性。 Method: 通过在高分辨率Spring数据集上应用20种不同的图像损坏(如噪声、模糊、颜色变化等),生成了20,000张损坏图像,并提出了新的鲁棒性评估指标。 Result: 实验表明,准确性高的模型不一定鲁棒,且鲁棒性因损坏类型而异。 Conclusion: RobustSpring将鲁棒性作为首要目标,旨在推动兼具准确性和鲁棒性的模型发展。 Abstract: Standard benchmarks for optical flow, scene flow, and stereo vision algorithms generally focus on model accuracy rather than robustness to image corruptions like noise or rain. Hence, the resilience of models to such real-world perturbations is largely unquantified. To address this, we present RobustSpring, a comprehensive dataset and benchmark for evaluating robustness to image corruptions for optical flow, scene flow, and stereo models. RobustSpring applies 20 different image corruptions, including noise, blur, color changes, quality degradations, and weather distortions, in a time-, stereo-, and depth-consistent manner to the high-resolution Spring dataset, creating a suite of 20,000 corrupted images that reflect challenging conditions. RobustSpring enables comparisons of model robustness via a new corruption robustness metric. Integration with the Spring benchmark enables public two-axis evaluations of both accuracy and robustness. We benchmark a curated selection of initial models, observing that accurate models are not necessarily robust and that robustness varies widely by corruption type. RobustSpring is a new computer vision benchmark that treats robustness as a first-class citizen to foster models that combine accuracy with resilience. It will be available at https://spring-benchmark.org.

[44] MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment

Siyuan Yan,Xieji Li,Ming Hu,Yiwen Jiang,Zhen Yu,Zongyuan Ge

Main category: cs.CV

TL;DR: MAKE是一种多知识增强的视觉语言预训练框架,用于零样本皮肤病任务,通过分解临床叙述、细粒度对齐和诊断引导加权,显著优于现有VLP模型。

Details Motivation: 皮肤病诊断需要结合视觉特征和临床知识,但现有VLP方法受限于文本长度和缺乏结构化文本。 Method: 提出多知识增强对比学习、细粒度对齐机制和诊断引导加权方案。 Result: 在403,563个图像-文本对上预训练后,MAKE在8个数据集上显著优于现有VLP模型。 Conclusion: MAKE通过多知识增强和细粒度对齐,提升了皮肤病诊断的零样本性能。 Abstract: Dermatological diagnosis represents a complex multimodal challenge that requires integrating visual features with specialized clinical knowledge. While vision-language pretraining (VLP) has advanced medical AI, its effectiveness in dermatology is limited by text length constraints and the lack of structured texts. In this paper, we introduce MAKE, a Multi-Aspect Knowledge-Enhanced vision-language pretraining framework for zero-shot dermatological tasks. Recognizing that comprehensive dermatological descriptions require multiple knowledge aspects that exceed standard text constraints, our framework introduces: (1) a multi-aspect contrastive learning strategy that decomposes clinical narratives into knowledge-enhanced sub-texts through large language models, (2) a fine-grained alignment mechanism that connects subcaptions with diagnostically relevant image features, and (3) a diagnosis-guided weighting scheme that adaptively prioritizes different sub-captions based on clinical significance prior. Through pretraining on 403,563 dermatological image-text pairs collected from education resources, MAKE significantly outperforms state-of-the-art VLP models on eight datasets across zero-shot skin disease classification, concept annotation, and cross-modal retrieval tasks. Our code will be made publicly available at https: //github.com/SiyuanYan1/MAKE.

[45] Text-driven Motion Generation: Overview, Challenges and Directions

Ali Rida Sahili,Najett Neji,Hedi Tabia

Main category: cs.CV

TL;DR: 本文综述了文本驱动运动生成的研究现状,包括传统方法、现代分类(基于VAE、扩散模型和混合模型)、数据集和评估方法,并指出了未来研究方向。

Details Motivation: 文本驱动运动生成提供了一种直观灵活的方式控制动画角色,适用于虚拟现实、游戏等领域,但需要系统梳理现有方法和挑战。 Method: 从架构(VAE、扩散模型、混合模型)和运动表示(离散与连续)两个角度分类现代方法,并分析数据集和评估方法。 Result: 总结了当前研究进展,指出了关键挑战和局限性。 Conclusion: 本文为语言驱动运动合成的研究者提供了有价值的参考,并展望了未来方向。 Abstract: Text-driven motion generation offers a powerful and intuitive way to create human movements directly from natural language. By removing the need for predefined motion inputs, it provides a flexible and accessible approach to controlling animated characters. This makes it especially useful in areas like virtual reality, gaming, human-computer interaction, and robotics. In this review, we first revisit the traditional perspective on motion synthesis, where models focused on predicting future poses from observed initial sequences, often conditioned on action labels. We then provide a comprehensive and structured survey of modern text-to-motion generation approaches, categorizing them from two complementary perspectives: (i) architectural, dividing methods into VAE-based, diffusion-based, and hybrid models; and (ii) motion representation, distinguishing between discrete and continuous motion generation strategies. In addition, we explore the most widely used datasets, evaluation methods, and recent benchmarks that have shaped progress in this area. With this survey, we aim to capture where the field currently stands, bring attention to its key challenges and limitations, and highlight promising directions for future exploration. We hope this work offers a valuable starting point for researchers and practitioners working to push the boundaries of language-driven human motion synthesis.

[46] Examining Deployment and Refinement of the VIOLA-AI Intracranial Hemorrhage Model Using an Interactive NeoMedSys Platform

Qinghui Liu,Jon Nesvold,Hanna Raaum,Elakkyen Murugesu,Martin Røvang,Bradley J Maclntosh,Atle Bjørnerud,Karoline Skogen

Main category: cs.CV

TL;DR: 研究开发了一个名为NeoMedSys的放射学软件平台,用于高效部署和优化AI模型,并在真实临床环境中评估其可行性。通过迭代改进,AI模型VIOLA-AI在颅内出血检测中的性能显著提升。

Details Motivation: 临床部署AI工具在放射学中面临挑战和机遇,NeoMedSys旨在解决这些问题,提升AI模型的部署效率和性能。 Method: NeoMedSys整合了AI模型部署、测试和优化工具,结合医疗图像查看器和注释系统。研究在挪威最大急诊科进行,评估VIOLA-AI模型在颅内出血检测中的性能。 Result: 迭代改进显著提升了VIOLA-AI的灵敏度(90.3%)、特异性(89.3%)和AUC(0.949)。实时放射科医生反馈对模型优化至关重要。 Conclusion: NeoMedSys平台有效支持AI模型的临床部署和优化,显著提升了诊断性能,展示了实时反馈和迭代改进的价值。 Abstract: Background: There are many challenges and opportunities in the clinical deployment of AI tools in radiology. The current study describes a radiology software platform called NeoMedSys that can enable efficient deployment and refinements of AI models. We evaluated the feasibility and effectiveness of running NeoMedSys for three months in real-world clinical settings and focused on improvement performance of an in-house developed AI model (VIOLA-AI) designed for intracranial hemorrhage (ICH) detection. Methods: NeoMedSys integrates tools for deploying, testing, and optimizing AI models with a web-based medical image viewer, annotation system, and hospital-wide radiology information systems. A pragmatic investigation was deployed using clinical cases of patients presenting to the largest Emergency Department in Norway (site-1) with suspected traumatic brain injury (TBI) or patients with suspected stroke (site-2). We assessed ICH classification performance as VIOLA-AI encountered new data and underwent pre-planned model retraining. Performance metrics included sensitivity, specificity, accuracy, and the area under the receiver operating characteristic curve (AUC). Results: NeoMedSys facilitated iterative improvements in the AI model, significantly enhancing its diagnostic accuracy. Automated bleed detection and segmentation were reviewed in near real-time to facilitate re-training VIOLA-AI. The iterative refinement process yielded a marked improvement in classification sensitivity, rising to 90.3% (from 79.2%), and specificity that reached 89.3% (from 80.7%). The bleed detection ROC analysis for the entire sample demonstrated a high area-under-the-curve (AUC) of 0.949 (from 0.873). Model refinement stages were associated with notable gains, highlighting the value of real-time radiologist feedback.

[47] FedSaaS: Class-Consistency Federated Semantic Segmentation via Global Prototype Supervision and Local Adversarial Harmonization

Xiaoyang Yu,Xiaoming Wu,Xin Wang,Dongrun Li,Ming Yang,Peng Cheng

Main category: cs.CV

TL;DR: 论文提出了一种名为FedSaaS的新型联邦语义分割框架,通过类样本和对比损失解决类一致性问题,显著提升了分割精度。

Details Motivation: 现有研究在处理异构问题时忽略了语义空间中的细粒度类关系,导致类表示模糊。 Method: 引入类样本作为本地和全局类表示的标准,服务器端建模类原型监督客户端全局分支,客户端通过对抗机制协调全局和本地分支,并使用多级对比损失。 Result: 在多个驾驶场景分割数据集上,FedSaaS优于现有方法,显著提高了平均分割精度。 Conclusion: FedSaaS有效解决了类一致性问题,为联邦语义分割提供了新思路。 Abstract: Federated semantic segmentation enables pixel-level classification in images through collaborative learning while maintaining data privacy. However, existing research commonly overlooks the fine-grained class relationships within the semantic space when addressing heterogeneous problems, particularly domain shift. This oversight results in ambiguities between class representation. To overcome this challenge, we propose a novel federated segmentation framework that strikes class consistency, termed FedSaaS. Specifically, we introduce class exemplars as a criterion for both local- and global-level class representations. On the server side, the uploaded class exemplars are leveraged to model class prototypes, which supervise global branch of clients, ensuring alignment with global-level representation. On the client side, we incorporate an adversarial mechanism to harmonize contributions of global and local branches, leading to consistent output. Moreover, multilevel contrastive losses are employed on both sides to enforce consistency between two-level representations in the same semantic space. Extensive experiments on several driving scene segmentation datasets demonstrate that our framework outperforms state-of-the-art methods, significantly improving average segmentation accuracy and effectively addressing the class-consistency representation problem.

[48] FreeDriveRF: Monocular RGB Dynamic NeRF without Poses for Autonomous Driving via Point-Level Dynamic-Static Decoupling

Yue Wen,Liang Song,Yijia Liu,Siting Zhu,Yanzi Miao,Lijun Han,Hesheng Wang

Main category: cs.CV

TL;DR: FreeDriveRF利用仅RGB图像重建动态驾驶场景,无需姿态输入,通过语义监督解耦动态与静态部分,并引入光流约束动态建模。

Details Motivation: 现有方法依赖精确姿态输入和多传感器数据,增加了系统复杂性。 Method: 通过语义监督解耦动态与静态部分,利用光流约束动态建模,并引入动态流优化姿态。 Result: 在KITTI和Waymo数据集上表现优异。 Conclusion: FreeDriveRF在动态场景建模中具有高效性和准确性。 Abstract: Dynamic scene reconstruction for autonomous driving enables vehicles to perceive and interpret complex scene changes more precisely. Dynamic Neural Radiance Fields (NeRFs) have recently shown promising capability in scene modeling. However, many existing methods rely heavily on accurate poses inputs and multi-sensor data, leading to increased system complexity. To address this, we propose FreeDriveRF, which reconstructs dynamic driving scenes using only sequential RGB images without requiring poses inputs. We innovatively decouple dynamic and static parts at the early sampling level using semantic supervision, mitigating image blurring and artifacts. To overcome the challenges posed by object motion and occlusion in monocular camera, we introduce a warped ray-guided dynamic object rendering consistency loss, utilizing optical flow to better constrain the dynamic modeling process. Additionally, we incorporate estimated dynamic flow to constrain the pose optimization process, improving the stability and accuracy of unbounded scene reconstruction. Extensive experiments conducted on the KITTI and Waymo datasets demonstrate the superior performance of our method in dynamic scene modeling for autonomous driving.

[49] Sparse Point Cloud Patches Rendering via Splitting 2D Gaussians

Ma Changfeng,Bi Ran,Guo Jie,Wang Chongjun,Guo Yanwen

Main category: cs.CV

TL;DR: 提出了一种新的点云渲染方法,通过从点云预测2D高斯分布,无需依赖类别先验或密集点云,且无需额外细化。

Details Motivation: 现有基于学习的方法依赖类别先验、密集点云或额外细化,限制了其泛化能力和效率。 Method: 采用两个相同模块的全块架构,通过点云信息(法线、颜色、距离)归一化并初始化高斯分布,再使用分裂解码器细化。 Result: 在多个数据集上验证了方法的优越性和泛化能力,达到SOTA性能。 Conclusion: 该方法可直接泛化到不同类别的点云,且渲染时无需额外细化,保留了2D高斯的优势。 Abstract: Current learning-based methods predict NeRF or 3D Gaussians from point clouds to achieve photo-realistic rendering but still depend on categorical priors, dense point clouds, or additional refinements. Hence, we introduce a novel point cloud rendering method by predicting 2D Gaussians from point clouds. Our method incorporates two identical modules with an entire-patch architecture enabling the network to be generalized to multiple datasets. The module normalizes and initializes the Gaussians utilizing the point cloud information including normals, colors and distances. Then, splitting decoders are employed to refine the initial Gaussians by duplicating them and predicting more accurate results, making our methodology effectively accommodate sparse point clouds as well. Once trained, our approach exhibits direct generalization to point clouds across different categories. The predicted Gaussians are employed directly for rendering without additional refinement on the rendered images, retaining the benefits of 2D Gaussians. We conduct extensive experiments on various datasets, and the results demonstrate the superiority and generalization of our method, which achieves SOTA performance. The code is available at https://github.com/murcherful/GauPCRender}{https://github.com/murcherful/GauPCRender.

[50] FaceShield: Explainable Face Anti-Spoofing with Multimodal Large Language Models

Hongyang Wang,Yichen Shi,Zhuofu Tao,Yuhao Gao,Liepiao Zhang,Xun Lin,Jun Feng,Xiaochen Yuan,Zitong Yu,Xiaochun Cao

Main category: cs.CV

TL;DR: FaceShield是一种多模态大语言模型(MLLM),专为面部反欺骗(FAS)设计,具备判断面部真实性、识别欺骗类型、提供推理和定位攻击区域的能力。

Details Motivation: 现有FAS方法缺乏可解释性和推理能力,且缺乏专门为FAS设计的MLLM和数据集。 Method: 结合原始图像和先验知识的辅助信息,采用欺骗感知视觉感知(SAVP)和提示引导的视觉标记掩码(PVTM)策略。 Result: 在三个基准数据集上,FaceShield在粗粒度分类、细粒度分类、推理和攻击定位任务上显著优于现有方法。 Conclusion: FaceShield为FAS任务提供了通用且全面的解决方案,其数据集和代码将公开发布。 Abstract: Face anti-spoofing (FAS) is crucial for protecting facial recognition systems from presentation attacks. Previous methods approached this task as a classification problem, lacking interpretability and reasoning behind the predicted results. Recently, multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and decision-making in visual tasks. However, there is currently no universal and comprehensive MLLM and dataset specifically designed for FAS task. To address this gap, we propose FaceShield, a MLLM for FAS, along with the corresponding pre-training and supervised fine-tuning (SFT) datasets, FaceShield-pre10K and FaceShield-sft45K. FaceShield is capable of determining the authenticity of faces, identifying types of spoofing attacks, providing reasoning for its judgments, and detecting attack areas. Specifically, we employ spoof-aware vision perception (SAVP) that incorporates both the original image and auxiliary information based on prior knowledge. We then use an prompt-guided vision token masking (PVTM) strategy to random mask vision tokens, thereby improving the model's generalization ability. We conducted extensive experiments on three benchmark datasets, demonstrating that FaceShield significantly outperforms previous deep learning models and general MLLMs on four FAS tasks, i.e., coarse-grained classification, fine-grained classification, reasoning, and attack localization. Our instruction datasets, protocols, and codes will be released soon.

[51] MoRAL: Motion-aware Multi-Frame 4D Radar and LiDAR Fusion for Robust 3D Object Detection

Xiangyuan Peng,Yu Wang,Miao Tang,Bierzynski Kay,Lorenzo Servadei,Robert Wille

Main category: cs.CV

TL;DR: MoRAL是一个运动感知的多帧4D雷达与LiDAR融合框架,用于鲁棒的3D目标检测,通过补偿雷达点云的帧间错位并利用动态信息,显著提升了检测性能。

Details Motivation: 现有方法忽视了雷达点云因物体运动导致的帧间错位,且未充分利用4D雷达的动态信息,影响了检测的准确性。 Method: 提出MoRAL框架,包括运动感知雷达编码器(MRE)补偿错位,以及运动注意力门控融合(MAGF)模块整合雷达动态信息指导LiDAR特征。 Result: 在VoD数据集上,MoRAL在整体区域和驾驶走廊中分别达到73.30%和88.68%的mAP,行人和骑行者的检测性能也显著提升。 Conclusion: MoRAL通过运动感知和动态信息融合,显著提升了3D目标检测的鲁棒性和准确性。 Abstract: Reliable autonomous driving systems require accurate detection of traffic participants. To this end, multi-modal fusion has emerged as an effective strategy. In particular, 4D radar and LiDAR fusion methods based on multi-frame radar point clouds have demonstrated the effectiveness in bridging the point density gap. However, they often neglect radar point clouds' inter-frame misalignment caused by object movement during accumulation and do not fully exploit the object dynamic information from 4D radar. In this paper, we propose MoRAL, a motion-aware multi-frame 4D radar and LiDAR fusion framework for robust 3D object detection. First, a Motion-aware Radar Encoder (MRE) is designed to compensate for inter-frame radar misalignment from moving objects. Later, a Motion Attention Gated Fusion (MAGF) module integrate radar motion features to guide LiDAR features to focus on dynamic foreground objects. Extensive evaluations on the View-of-Delft (VoD) dataset demonstrate that MoRAL outperforms existing methods, achieving the highest mAP of 73.30% in the entire area and 88.68% in the driving corridor. Notably, our method also achieves the best AP of 69.67% for pedestrians in the entire area and 96.25% for cyclists in the driving corridor.

[52] Efficient LiDAR Reflectance Compression via Scanning Serialization

Jiahao Zhu,Kang You,Dandan Ding,Zhan Ma

Main category: cs.CV

TL;DR: SerLiC是一种基于序列化的神经压缩框架,用于高效压缩LiDAR点云中的反射率数据,通过扫描顺序序列化和Mamba模型实现高压缩率和快速处理。

Details Motivation: LiDAR点云中的反射率属性在下游任务中至关重要,但在神经压缩方法中仍未充分探索。 Method: 将3D LiDAR点云通过扫描顺序序列化为1D序列,结合传感器扫描索引、径向距离和先验反射率进行上下文表示,并使用Mamba模型进行高效序列建模。 Result: SerLiC实现了超过2倍的体积压缩,比现有方法压缩比特减少22%,且仅使用2%的参数。轻量版SerLiC达到>10 fps,仅需111K参数。 Conclusion: SerLiC在压缩效率和实时性方面表现出色,适用于实际应用。 Abstract: Reflectance attributes in LiDAR point clouds provide essential information for downstream tasks but remain underexplored in neural compression methods. To address this, we introduce SerLiC, a serialization-based neural compression framework to fully exploit the intrinsic characteristics of LiDAR reflectance. SerLiC first transforms 3D LiDAR point clouds into 1D sequences via scan-order serialization, offering a device-centric perspective for reflectance analysis. Each point is then tokenized into a contextual representation comprising its sensor scanning index, radial distance, and prior reflectance, for effective dependencies exploration. For efficient sequential modeling, Mamba is incorporated with a dual parallelization scheme, enabling simultaneous autoregressive dependency capture and fast processing. Extensive experiments demonstrate that SerLiC attains over 2x volume reduction against the original reflectance data, outperforming the state-of-the-art method by up to 22% reduction of compressed bits while using only 2% of its parameters. Moreover, a lightweight version of SerLiC achieves > 10 fps (frames per second) with just 111K parameters, which is attractive for real-world applications.

[53] Endo-CLIP: Progressive Self-Supervised Pre-training on Raw Colonoscopy Records

Yili He,Yan Zhu,Peiyao Fu,Ruijie Yang,Tianyi Chen,Zhihua Wang,Quanlin Li,Pinghong Zhou,Xian Yang,Shuo Wang

Main category: cs.CV

TL;DR: Endo-CLIP是一种自监督框架,通过三阶段处理(清洗、调整、统一)优化内窥镜图像分析,显著提升息肉检测和分类性能。

Details Motivation: 内窥镜图像分析面临非信息性背景、复杂医学术语和多病变描述模糊等挑战,需要改进预训练方法。 Method: Endo-CLIP采用三阶段框架:1)去除背景帧;2)利用大语言模型提取临床属性进行细粒度对比学习;3)使用患者级交叉注意力解决多息肉模糊问题。 Result: 实验表明,Endo-CLIP在零样本和少样本息肉检测与分类中显著优于现有预训练方法。 Conclusion: Endo-CLIP为更准确和临床相关的内窥镜分析提供了新途径。 Abstract: Pre-training on image-text colonoscopy records offers substantial potential for improving endoscopic image analysis, but faces challenges including non-informative background images, complex medical terminology, and ambiguous multi-lesion descriptions. We introduce Endo-CLIP, a novel self-supervised framework that enhances Contrastive Language-Image Pre-training (CLIP) for this domain. Endo-CLIP's three-stage framework--cleansing, attunement, and unification--addresses these challenges by (1) removing background frames, (2) leveraging large language models to extract clinical attributes for fine-grained contrastive learning, and (3) employing patient-level cross-attention to resolve multi-polyp ambiguities. Extensive experiments demonstrate that Endo-CLIP significantly outperforms state-of-the-art pre-training methods in zero-shot and few-shot polyp detection and classification, paving the way for more accurate and clinically relevant endoscopic analysis.

[54] MrTrack: Register Mamba for Needle Tracking with Rapid Reciprocating Motion during Ultrasound-Guided Aspiration Biopsy

Yuelin Zhang,Qingpeng Ding,Long Lei,Yongxuan Feng,Raymond Shing-Yan Tang,Shing Shin Cheng

Main category: cs.CV

TL;DR: MrTrack是一种基于Mamba的超声引导细针穿刺活检针追踪器,通过时序上下文提取和检索机制解决快速往复运动问题,显著提升准确性和效率。

Details Motivation: 超声引导细针穿刺活检中,快速往复运动导致现有追踪器性能不足,需一种新方法解决这一问题。 Method: 提出MrTrack,利用Mamba-based register机制提取和检索时序上下文,结合自监督损失提升特征多样性。 Result: 在自动和手动穿刺数据集中,MrTrack在准确性、鲁棒性和推理效率上均优于现有方法。 Conclusion: MrTrack为超声引导活检提供了一种高效、准确的针追踪解决方案。 Abstract: Ultrasound-guided fine needle aspiration (FNA) biopsy is a common minimally invasive diagnostic procedure. However, an aspiration needle tracker addressing rapid reciprocating motion is still missing. MrTrack, an aspiration needle tracker with a mamba-based register mechanism, is proposed. MrTrack leverages a Mamba-based register extractor to sequentially distill global context from each historical search map, storing these temporal cues in a register bank. The Mamba-based register retriever then retrieves temporal prompts from the register bank to provide external cues when current vision features are temporarily unusable due to rapid reciprocating motion and imaging degradation. A self-supervised register diversify loss is proposed to encourage feature diversity and dimension independence within the learned register, mitigating feature collapse. Comprehensive experiments conducted on both motorized and manual aspiration datasets demonstrate that MrTrack not only outperforms state-of-the-art trackers in accuracy and robustness but also achieves superior inference efficiency.

[55] Beyond Pixels: Leveraging the Language of Soccer to Improve Spatio-Temporal Action Detection in Broadcast Videos

Jeremie Ochin,Raphael Chekroun,Bogdan Stanciulescu,Sotiris Manitsaris

Main category: cs.CV

TL;DR: 本文提出了一种基于Transformer的编码器-解码器模型,通过结合游戏状态信息和上下文理解,改进足球视频中的时空动作检测(STAD),减少误报并提高精度和召回率。

Details Motivation: 现有STAD方法在高召回率、低精度要求下缺乏上下文理解,导致大量误报。本文旨在通过游戏级推理和序列去噪任务解决这一问题。 Method: 使用Transformer编码器-解码器模型处理噪声预测序列和游戏状态信息,结合团队动态和战术规律生成去噪动作序列。 Result: 该方法在低置信度下提高了精度和召回率,实现了更可靠的事件提取。 Conclusion: 通过上下文建模和团队级推理,本文方法显著提升了STAD性能,补充了现有像素级方法的不足。 Abstract: State-of-the-art spatio-temporal action detection (STAD) methods show promising results for extracting soccer events from broadcast videos. However, when operated in the high-recall, low-precision regime required for exhaustive event coverage in soccer analytics, their lack of contextual understanding becomes apparent: many false positives could be resolved by considering a broader sequence of actions and game-state information. In this work, we address this limitation by reasoning at the game level and improving STAD through the addition of a denoising sequence transduction task. Sequences of noisy, context-free player-centric predictions are processed alongside clean game state information using a Transformer-based encoder-decoder model. By modeling extended temporal context and reasoning jointly over team-level dynamics, our method leverages the "language of soccer" - its tactical regularities and inter-player dependencies - to generate "denoised" sequences of actions. This approach improves both precision and recall in low-confidence regimes, enabling more reliable event extraction from broadcast video and complementing existing pixel-based methods.

[56] A 2D Semantic-Aware Position Encoding for Vision Transformers

Xi Chen,Shiyang Zhou,Muqi Huang,Jiaxu Feng,Yun Xiong,Kun Zhou,Biao Yang,Yuhui Zhang,Huishuai Bao,Sijia Peng,Chuan Li,Feng Shi

Main category: cs.CV

TL;DR: 提出了一种新的2D语义感知位置编码方法(SaPE²),解决了传统位置编码在视觉任务中无法有效捕捉语义关系的问题。

Details Motivation: 现有位置编码方法主要借鉴自然语言处理,无法有效捕捉图像块间的语义关系,限制了模型的泛化能力和翻译等变性。 Method: 提出SaPE²,通过动态调整位置表示,利用局部内容而非固定线性关系或空间坐标,增强语义感知。 Result: SaPE²提升了模型在不同分辨率和尺度下的泛化能力,改善了翻译等变性,并更好地聚合了视觉相似但空间遥远的图像块特征。 Conclusion: SaPE²在视觉Transformer中有效结合了位置编码和感知相似性,提升了计算机视觉任务的性能。 Abstract: Vision transformers have demonstrated significant advantages in computer vision tasks due to their ability to capture long-range dependencies and contextual relationships through self-attention. However, existing position encoding techniques, which are largely borrowed from natural language processing, fail to effectively capture semantic-aware positional relationships between image patches. Traditional approaches like absolute position encoding and relative position encoding primarily focus on 1D linear position relationship, often neglecting the semantic similarity between distant yet contextually related patches. These limitations hinder model generalization, translation equivariance, and the ability to effectively handle repetitive or structured patterns in images. In this paper, we propose 2-Dimensional Semantic-Aware Position Encoding ($\text{SaPE}^2$), a novel position encoding method with semantic awareness that dynamically adapts position representations by leveraging local content instead of fixed linear position relationship or spatial coordinates. Our method enhances the model's ability to generalize across varying image resolutions and scales, improves translation equivariance, and better aggregates features for visually similar but spatially distant patches. By integrating $\text{SaPE}^2$ into vision transformers, we bridge the gap between position encoding and perceptual similarity, thereby improving performance on computer vision tasks.

[57] Denoising and Alignment: Rethinking Domain Generalization for Multimodal Face Anti-Spoofing

Yingjie Ma,Xun Lin,Zitong Yu,Xin Liu,Xiaochen Yuan,Weicheng Xie,Linlin Shen

Main category: cs.CV

TL;DR: 论文提出了一种名为MMDA的多模态去噪与对齐框架,通过结合CLIP的零样本泛化能力,显著提升了跨模态对齐的泛化性能,并在多个基准数据集上表现优于现有方法。

Details Motivation: 当前多模态人脸防伪方法在泛化能力上存在不足,主要由于模态特定偏差和域偏移问题。 Method: MMDA框架通过MD2A模块减少域和模态噪声影响,RS2策略对齐多域多模态数据,U-DSA模块增强表示适应性。 Result: 在四个基准数据集上,MMDA框架在跨域泛化和多模态检测精度上优于现有方法。 Conclusion: MMDA框架通过去噪和对齐机制有效提升了多模态人脸防伪的泛化能力和检测精度。 Abstract: Face Anti-Spoofing (FAS) is essential for the security of facial recognition systems in diverse scenarios such as payment processing and surveillance. Current multimodal FAS methods often struggle with effective generalization, mainly due to modality-specific biases and domain shifts. To address these challenges, we introduce the \textbf{M}ulti\textbf{m}odal \textbf{D}enoising and \textbf{A}lignment (\textbf{MMDA}) framework. By leveraging the zero-shot generalization capability of CLIP, the MMDA framework effectively suppresses noise in multimodal data through denoising and alignment mechanisms, thereby significantly enhancing the generalization performance of cross-modal alignment. The \textbf{M}odality-\textbf{D}omain Joint \textbf{D}ifferential \textbf{A}ttention (\textbf{MD2A}) module in MMDA concurrently mitigates the impacts of domain and modality noise by refining the attention mechanism based on extracted common noise features. Furthermore, the \textbf{R}epresentation \textbf{S}pace \textbf{S}oft (\textbf{RS2}) Alignment strategy utilizes the pre-trained CLIP model to align multi-domain multimodal data into a generalized representation space in a flexible manner, preserving intricate representations and enhancing the model's adaptability to various unseen conditions. We also design a \textbf{U}-shaped \textbf{D}ual \textbf{S}pace \textbf{A}daptation (\textbf{U-DSA}) module to enhance the adaptability of representations while maintaining generalization performance. These improvements not only enhance the framework's generalization capabilities but also boost its ability to represent complex representations. Our experimental results on four benchmark datasets under different evaluation protocols demonstrate that the MMDA framework outperforms existing state-of-the-art methods in terms of cross-domain generalization and multimodal detection accuracy. The code will be released soon.

[58] Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput

Bo Zhang,Shuo Li,Runhe Tian,Yang Yang,Jixin Tang,Jinhao Zhou,Lin Ma

Main category: cs.CV

TL;DR: Flash-VL 2B是一种优化视觉语言模型(VLM)的新方法,旨在实现超低延迟和高吞吐量,同时保持准确性。

Details Motivation: 针对实时应用需求,优化VLM以在资源受限环境中实现高效性能。 Method: 采用架构增强、计算策略优化、令牌压缩、数据筛选、训练方案及新颖的图像处理技术(隐式语义拼接)。 Result: 在11个标准VLM基准测试中,Flash-VL 2B在速度和准确性上均达到最先进水平。 Conclusion: Flash-VL 2B是资源受限和大规模实时应用中的理想解决方案。 Abstract: In this paper, we introduce Flash-VL 2B, a novel approach to optimizing Vision-Language Models (VLMs) for real-time applications, targeting ultra-low latency and high throughput without sacrificing accuracy. Leveraging advanced architectural enhancements and efficient computational strategies, Flash-VL 2B is designed to maximize throughput by reducing processing time while maintaining competitive performance across multiple vision-language benchmarks. Our approach includes tailored architectural choices, token compression mechanisms, data curation, training schemes, and a novel image processing technique called implicit semantic stitching that effectively balances computational load and model performance. Through extensive evaluations on 11 standard VLM benchmarks, we demonstrate that Flash-VL 2B achieves state-of-the-art results in both speed and accuracy, making it a promising solution for deployment in resource-constrained environments and large-scale real-time applications.

[59] Conformal Bounds on Full-Reference Image Quality for Imaging Inverse Problems

Jeffrey Wen,Rizwan Ahmad,Philip Schniter

Main category: cs.CV

TL;DR: 该论文提出了一种结合共形预测和近似后验采样的方法,用于在不知道真实图像的情况下,构建全参考图像质量(FRIQ)指标的置信边界,确保其以用户指定的误差概率成立。

Details Motivation: 在成像逆问题中,评估恢复图像与真实图像的接近程度至关重要,尤其是在医学成像等安全关键应用中。但由于真实图像未知,直接计算FRIQ指标具有挑战性。 Method: 结合共形预测和近似后验采样,构建FRIQ指标的置信边界。 Result: 在图像去噪和加速磁共振成像(MRI)问题上验证了方法的有效性。 Conclusion: 该方法为FRIQ指标提供了可靠的置信边界,适用于安全关键应用。 Abstract: In imaging inverse problems, we would like to know how close the recovered image is to the true image in terms of full-reference image quality (FRIQ) metrics like PSNR, SSIM, LPIPS, etc. This is especially important in safety-critical applications like medical imaging, where knowing that, say, the SSIM was poor could potentially avoid a costly misdiagnosis. But since we don't know the true image, computing FRIQ is non-trivial. In this work, we combine conformal prediction with approximate posterior sampling to construct bounds on FRIQ that are guaranteed to hold up to a user-specified error probability. We demonstrate our approach on image denoising and accelerated magnetic resonance imaging (MRI) problems. Code is available at https://github.com/jwen307/quality_uq.

[60] Contactless Cardiac Pulse Monitoring Using Event Cameras

Mohamed Moustafa,Joseph Lemley,Peter Corcoran

Main category: cs.CV

TL;DR: 研究利用事件相机和CNN模型从面部事件流中无接触重建心率信号,结果显示事件相机在高帧率下表现优于传统相机。

Details Motivation: 探索事件相机在无接触心率监测中的潜力,利用其高动态范围和时间分辨率。 Method: 使用监督CNN模型从事件流中提取心率信号,评估心率计算的准确性。 Result: 事件相机在60和120 FPS下表现优于传统相机(RMSE分别为2.54和2.13 bpm)。 Conclusion: 事件相机在远程心率监测中具有潜力,高帧率下性能更优。 Abstract: Time event cameras are a novel technology for recording scene information at extremely low latency and with low power consumption. Event cameras output a stream of events that encapsulate pixel-level light intensity changes within the scene, capturing information with a higher dynamic range and temporal resolution than traditional cameras. This study investigates the contact-free reconstruction of an individual's cardiac pulse signal from time event recording of their face using a supervised convolutional neural network (CNN) model. An end-to-end model is trained to extract the cardiac signal from a two-dimensional representation of the event stream, with model performance evaluated based on the accuracy of the calculated heart rate. The experimental results confirm that physiological cardiac information in the facial region is effectively preserved within the event stream, showcasing the potential of this novel sensor for remote heart rate monitoring. The model trained on event frames achieves a root mean square error (RMSE) of 3.32 beats per minute (bpm) compared to the RMSE of 2.92 bpm achieved by the baseline model trained on standard camera frames. Furthermore, models trained on event frames generated at 60 and 120 FPS outperformed the 30 FPS standard camera results, achieving an RMSE of 2.54 and 2.13 bpm, respectively.

[61] Camera-Only 3D Panoptic Scene Completion for Autonomous Driving through Differentiable Object Shapes

Nicola Marinello,Simen Cassiman,Jonas Heylen,Marc Proesmans,Luc Van Gool

Main category: cs.CV

TL;DR: 本文提出了一种新的3D全景场景补全框架,扩展了现有的3D语义场景补全模型,通过对象模块和全景模块提升自动驾驶车辆的路径规划和决策能力。

Details Motivation: 自动驾驶车辆需要完整的周围环境地图以规划和行动,但目前3D全景场景补全研究不足,亟需新方法填补这一空白。 Method: 提出了一种包含对象模块和全景模块的框架,可轻松与现有3D占用和场景补全方法结合,利用占用基准中的注释学习对象形状。 Result: 该方法能够区分同一类别中的对象实例,并预测被遮挡区域,为路径规划和决策提供更全面的环境信息。 Conclusion: 本文提出的框架为3D全景场景补全提供了新的解决方案,代码已开源,便于进一步研究和应用。 Abstract: Autonomous vehicles need a complete map of their surroundings to plan and act. This has sparked research into the tasks of 3D occupancy prediction, 3D scene completion, and 3D panoptic scene completion, which predict a dense map of the ego vehicle's surroundings as a voxel grid. Scene completion extends occupancy prediction by predicting occluded regions of the voxel grid, and panoptic scene completion further extends this task by also distinguishing object instances within the same class; both aspects are crucial for path planning and decision-making. However, 3D panoptic scene completion is currently underexplored. This work introduces a novel framework for 3D panoptic scene completion that extends existing 3D semantic scene completion models. We propose an Object Module and Panoptic Module that can easily be integrated with 3D occupancy and scene completion methods presented in the literature. Our approach leverages the available annotations in occupancy benchmarks, allowing individual object shapes to be learned as a differentiable problem. The code is available at https://github.com/nicolamarinello/OffsetOcc .

[62] Using Foundation Models as Pseudo-Label Generators for Pre-Clinical 4D Cardiac CT Segmentation

Anne-Marie Rickmann,Stephanie L. Thorn,Shawn S. Ahn,Supum Lee,Selen Uman,Taras Lysyy,Rachel Burns,Nicole Guerrera,Francis G. Spinale,Jason A. Burdick,Albert J. Sinusas,James S. Duncan

Main category: cs.CV

TL;DR: 该论文研究了如何利用基础模型为猪心脏CT生成伪标签,并通过自训练方法迭代优化这些标签,无需手动标注数据。

Details Motivation: 由于猪与人类在解剖和生理上的相似性,猪模型在临床前研究中常用,但物种差异导致模型直接迁移困难。基础模型在人类数据上表现良好,但其在猪数据上的适用性尚未充分探索。 Method: 提出一种简单的自训练方法,利用基础模型生成伪标签,并通过迭代更新优化分割质量,无需手动标注猪数据。 Result: 自训练过程不仅提高了分割准确性,还平滑了连续帧之间的时间不一致性。 Conclusion: 虽然结果令人鼓舞,但仍有改进空间,例如采用更复杂的自训练策略或探索其他基础模型和心脏成像技术。 Abstract: Cardiac image segmentation is an important step in many cardiac image analysis and modeling tasks such as motion tracking or simulations of cardiac mechanics. While deep learning has greatly advanced segmentation in clinical settings, there is limited work on pre-clinical imaging, notably in porcine models, which are often used due to their anatomical and physiological similarity to humans. However, differences between species create a domain shift that complicates direct model transfer from human to pig data. Recently, foundation models trained on large human datasets have shown promise for robust medical image segmentation; yet their applicability to porcine data remains largely unexplored. In this work, we investigate whether foundation models can generate sufficiently accurate pseudo-labels for pig cardiac CT and propose a simple self-training approach to iteratively refine these labels. Our method requires no manually annotated pig data, relying instead on iterative updates to improve segmentation quality. We demonstrate that this self-training process not only enhances segmentation accuracy but also smooths out temporal inconsistencies across consecutive frames. Although our results are encouraging, there remains room for improvement, for example by incorporating more sophisticated self-training strategies and by exploring additional foundation models and other cardiac imaging technologies.

[63] BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen,Zhiyang Xu,Xichen Pan,Yushi Hu,Can Qin,Tom Goldstein,Lifu Huang,Tianyi Zhou,Saining Xie,Silvio Savarese,Le Xue,Caiming Xiong,Ran Xu

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散变换器的统一多模态模型BLIP3-o,结合图像理解与生成,通过创新的训练策略和高质量数据集,在多个基准测试中表现优异。

Details Motivation: 统一图像理解与生成的多模态模型架构和训练方法尚未充分探索,本文旨在填补这一空白。 Method: 采用扩散变换器生成CLIP图像特征,提出分阶段预训练策略(先理解后生成),并构建高质量指令调优数据集BLIP3o-60k。 Result: BLIP3-o在图像理解和生成任务中表现优异,训练效率高且生成质量好。 Conclusion: BLIP3-o为统一多模态模型提供了创新设计和训练方法,开源资源将推动未来研究。 Abstract: Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models. BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks. To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.

[64] Don't Forget your Inverse DDIM for Image Editing

Guillermo Gomez-Trenado,Pablo Mesejo,Oscar Cordón,Stéphane Lathuilière

Main category: cs.CV

TL;DR: SAGE是一种基于预训练扩散模型的新图像编辑技术,通过自注意力机制优化编辑效率,显著优于现有方法。

Details Motivation: 解决现有文本到图像生成方法在编辑真实图像时计算量大或重建效果差的问题。 Method: 基于DDIM算法,利用扩散U-Net的自注意力层设计新引导机制,优化未编辑区域的重建。 Result: 在定量和定性评估中表现优异,用户研究中47名用户全部偏好SAGE,且在10项定量分析中7项排名第一。 Conclusion: SAGE通过自注意力引导机制高效解决了图像编辑的核心挑战,成为当前最优方法。 Abstract: The field of text-to-image generation has undergone significant advancements with the introduction of diffusion models. Nevertheless, the challenge of editing real images persists, as most methods are either computationally intensive or produce poor reconstructions. This paper introduces SAGE (Self-Attention Guidance for image Editing) - a novel technique leveraging pre-trained diffusion models for image editing. SAGE builds upon the DDIM algorithm and incorporates a novel guidance mechanism utilizing the self-attention layers of the diffusion U-Net. This mechanism computes a reconstruction objective based on attention maps generated during the inverse DDIM process, enabling efficient reconstruction of unedited regions without the need to precisely reconstruct the entire input image. Thus, SAGE directly addresses the key challenges in image editing. The superiority of SAGE over other methods is demonstrated through quantitative and qualitative evaluations and confirmed by a statistically validated comprehensive user study, in which all 47 surveyed users preferred SAGE over competing methods. Additionally, SAGE ranks as the top-performing method in seven out of 10 quantitative analyses and secures second and third places in the remaining three.

[65] Variational Visual Question Answering

Tobias Jan Wieczorek,Nathalie Daun,Mohammad Emtiyaz Khan,Marcus Rohrbach

Main category: cs.CV

TL;DR: 提出了一种基于变分学习的VQA方法(IVON),显著提升了多模态模型的校准性和可靠性,尤其在分布偏移情况下表现更优。

Details Motivation: 尽管多模态模型在VQA任务中取得进展,但其在分布外(OOD)场景下的校准性和可靠性问题仍未解决。 Method: 采用变分算法IVON替代传统的AdamW优化,生成模型参数的后验分布。 Result: 实验显示,该方法将预期校准误差降低50%以上,覆盖率提升4%(固定风险1%),在OOD情况下覆盖率提升8%。 Conclusion: 变分学习是提升多模态模型可靠性的有效方法。 Abstract: Despite remarkable progress in multimodal models for Visual Question Answering (VQA), there remain major reliability concerns because the models can often be overconfident and miscalibrated, especially in out-of-distribution (OOD) settings. Plenty has been done to address such issues for unimodal models, but little work exists for multimodal cases. Here, we address unreliability in multimodal models by proposing a Variational VQA approach. Specifically, instead of fine-tuning vision-language models by using AdamW, we employ a recently proposed variational algorithm called IVON, which yields a posterior distribution over model parameters. Through extensive experiments, we show that our approach improves calibration and abstentions without sacrificing the accuracy of AdamW. For instance, compared to AdamW fine-tuning, we reduce Expected Calibration Error by more than 50% compared to the AdamW baseline and raise Coverage by 4% vs. SOTA (for a fixed risk of 1%). In the presence of distribution shifts, the performance gain is even higher, achieving 8% Coverage (@ 1% risk) improvement vs. SOTA when 50% of test cases are OOD. Overall, we present variational learning as a viable option to enhance the reliability of multimodal models.

[66] UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing

Yung-Hsuan Lai,Janek Ebbers,Yu-Chiang Frank Wang,François Germain,Michael Jeffrey Jones,Moitreya Chatterjee

Main category: cs.CV

TL;DR: UWAV提出了一种新的弱监督音频-视觉视频解析方法,通过考虑伪标签的不确定性和特征混合正则化,显著提升了性能。

Details Motivation: 现有方法在生成伪标签时忽略段间依赖性和标签预测偏差,限制了性能。UWAV旨在解决这些问题。 Method: UWAV引入不确定性加权的伪标签生成和特征混合正则化,优化模型训练。 Result: UWAV在多个指标和数据集上优于现有方法,证明了其有效性和泛化能力。 Conclusion: UWAV通过改进伪标签生成和训练策略,显著提升了弱监督AVVP任务的性能。 Abstract: Audio-Visual Video Parsing (AVVP) entails the challenging task of localizing both uni-modal events (i.e., those occurring exclusively in either the visual or acoustic modality of a video) and multi-modal events (i.e., those occurring in both modalities concurrently). Moreover, the prohibitive cost of annotating training data with the class labels of all these events, along with their start and end times, imposes constraints on the scalability of AVVP techniques unless they can be trained in a weakly-supervised setting, where only modality-agnostic, video-level labels are available in the training data. To this end, recently proposed approaches seek to generate segment-level pseudo-labels to better guide model training. However, the absence of inter-segment dependencies when generating these pseudo-labels and the general bias towards predicting labels that are absent in a segment limit their performance. This work proposes a novel approach towards overcoming these weaknesses called Uncertainty-weighted Weakly-supervised Audio-visual Video Parsing (UWAV). Additionally, our innovative approach factors in the uncertainty associated with these estimated pseudo-labels and incorporates a feature mixup based training regularization for improved training. Empirical results show that UWAV outperforms state-of-the-art methods for the AVVP task on multiple metrics, across two different datasets, attesting to its effectiveness and generalizability.

cs.GR [Back]

[67] IntrinsicEdit: Precise generative image manipulation in intrinsic space

Linjie Lyu,Valentin Deschaintre,Yannick Hold-Geoffroy,Miloš Hašan,Jae Shin Yoon,Thomas Leimkühler,Christian Theobalt,Iliyan Georgiev

Main category: cs.GR

TL;DR: 提出了一种基于RGB-X扩散框架的生成工作流,通过精确扩散反演和解耦通道操作,实现像素级精确编辑,无需额外数据或微调。

Details Motivation: 现有生成扩散模型的编辑接口缺乏精确控制,且通常仅适用于单一任务。 Method: 在RGB-X扩散框架基础上,引入内在图像潜在空间,结合精确扩散反演和解耦通道操作。 Result: 在复杂图像上实现了颜色、纹理调整、对象插入移除、全局光照调整等任务的最先进性能。 Conclusion: 该方法为多任务图像编辑提供了高效、精确的解决方案。 Abstract: Generative diffusion models have advanced image editing with high-quality results and intuitive interfaces such as prompts and semantic drawing. However, these interfaces lack precise control, and the associated methods typically specialize on a single editing task. We introduce a versatile, generative workflow that operates in an intrinsic-image latent space, enabling semantic, local manipulation with pixel precision for a range of editing operations. Building atop the RGB-X diffusion framework, we address key challenges of identity preservation and intrinsic-channel entanglement. By incorporating exact diffusion inversion and disentangled channel manipulation, we enable precise, efficient editing with automatic resolution of global illumination effects -- all without additional data collection or model fine-tuning. We demonstrate state-of-the-art performance across a variety of tasks on complex images, including color and texture adjustments, object insertion and removal, global relighting, and their combinations.

[68] Template-Guided Reconstruction of Pulmonary Segments with Neural Implicit Functions

Kangxian Xie,Yufei Zhu,Kaiming Kuang,Li Zhang,Hongwei Bran Li,Mingchen Gao,Jiancheng Yang

Main category: cs.GR

TL;DR: 提出了一种基于神经隐式函数的方法,用于高质量3D肺段重建,解决了传统深度学习方法在计算资源和分辨率上的限制。

Details Motivation: 高质量的3D肺段重建对肺段切除术和肺癌手术规划至关重要,但现有方法在分辨率和计算资源上存在不足。 Method: 使用神经隐式函数学习3D表面,通过变形可学习模板实现解剖感知的精确重建,并引入两个临床相关评估指标。 Result: 提出的方法优于现有方法,并开发了Lung3D数据集用于算法基准测试。 Conclusion: 该方法为肺段重建提供了新视角,代码和数据将公开。 Abstract: High-quality 3D reconstruction of pulmonary segments plays a crucial role in segmentectomy and surgical treatment planning for lung cancer. Due to the resolution requirement of the target reconstruction, conventional deep learning-based methods often suffer from computational resource constraints or limited granularity. Conversely, implicit modeling is favored due to its computational efficiency and continuous representation at any resolution. We propose a neural implicit function-based method to learn a 3D surface to achieve anatomy-aware, precise pulmonary segment reconstruction, represented as a shape by deforming a learnable template. Additionally, we introduce two clinically relevant evaluation metrics to assess the reconstruction comprehensively. Further, due to the absence of publicly available shape datasets to benchmark reconstruction algorithms, we developed a shape dataset named Lung3D, including the 3D models of 800 labeled pulmonary segments and the corresponding airways, arteries, veins, and intersegmental veins. We demonstrate that the proposed approach outperforms existing methods, providing a new perspective for pulmonary segment reconstruction. Code and data will be available at https://github.com/M3DV/ImPulSe.

[69] Position-Normal Manifold for Efficient Glint Rendering on High-Resolution Normal Maps

Liwen Wu,Fujun Luan,Miloš Hašan,Ravi Ramamoorthi

Main category: cs.GR

TL;DR: 提出了一种基于流形的闪光法线分布函数(NDF)公式,精确捕捉表面法线分布,比现有方法更快且更简单。

Details Motivation: 高光物体在复杂光照下产生的闪光效果难以通过传统法线贴图BRDF渲染,需要更精确且高效的方法。 Method: 采用流形基础公式,将NDF构造转换为网格相交问题,避免了复杂的数值近似,并利用网格聚类加速计算。 Result: 新方法在保持相似闪光效果的同时,计算速度比基线快一个数量级。 Conclusion: 该方法不仅高效精确,还扩展了法线贴图漫反射表面的阴影掩蔽分析。 Abstract: Detailed microstructures on specular objects often exhibit intriguing glinty patterns under high-frequency lighting, which is challenging to render using a conventional normal-mapped BRDF. In this paper, we present a manifold-based formulation of the glint normal distribution functions (NDF) that precisely captures the surface normal distributions over queried footprints. The manifold-based formulation transfers the integration for the glint NDF construction to a problem of mesh intersections. Compared to previous works that rely on complex numerical approximations, our integral solution is exact and much simpler to compute, which also allows an easy adaptation of a mesh clustering hierarchy to accelerate the NDF evaluation of large footprints. Our performance and quality analysis shows that our NDF formulation achieves similar glinty appearance compared to the baselines but is an order of magnitude faster. Within this framework, we further present a novel derivation of analytical shadow-masking for normal-mapped diffuse surfaces -- a component that is often ignored in previous works.

[70] Neural BRDF Importance Sampling by Reparameterization

Liwen Wu,Sai Bi,Zexiang Xu,Hao Tan,Kai Zhang,Fujun Luan,Haolin Lu,Ravi Ramamoorthi

Main category: cs.GR

TL;DR: 提出了一种基于重参数化的神经BRDF重要性采样方法,提高了渲染效率和灵活性。

Details Motivation: 神经BRDF的重要性采样存在挑战,现有方法依赖可逆网络和多步推理,限制了效率和灵活性。 Method: 通过重参数化将分布学习任务转化为BRDF积分替换问题,避免了可逆网络和多步推理的限制。 Result: 在神经BRDF渲染中实现了最佳的方差减少,同时保持了高推理速度。 Conclusion: 该方法在提升渲染质量的同时,显著提高了效率和灵活性。 Abstract: Neural bidirectional reflectance distribution functions (BRDFs) have emerged as popular material representations for enhancing realism in physically-based rendering. Yet their importance sampling remains a significant challenge. In this paper, we introduce a reparameterization-based formulation of neural BRDF importance sampling that seamlessly integrates into the standard rendering pipeline with precise generation of BRDF samples. The reparameterization-based formulation transfers the distribution learning task to a problem of identifying BRDF integral substitutions. In contrast to previous methods that rely on invertible networks and multi-step inference to reconstruct BRDF distributions, our model removes these constraints, which offers greater flexibility and efficiency. Our variance and performance analysis demonstrates that our reparameterization method achieves the best variance reduction in neural BRDF renderings while maintaining high inference speeds compared to existing baselines.

[71] Procedural Low-Poly Terrain Generation with Terracing for Computer Games

Richard Tivolt

Main category: cs.GR

TL;DR: 提出一种生成随机低多边形阶梯地形的方法,解决传统网格顶点法无法模拟低多边形风格的问题。

Details Motivation: 传统网格顶点法生成的地形平滑但缺乏低多边形风格的混沌感,不适用于特定需求。 Method: 通过随机生成阶梯状低多边形地形,并添加不同生物群落和植被。 Result: 实现了具有低多边形风格的地形,并丰富了环境细节。 Conclusion: 该方法成功生成了符合需求的低多边形阶梯地形,适用于特定场景。 Abstract: In computer games, traditional procedural terrain generation relies on a grid of vertices, with each point representing terrain elevation. For each square in the grid, two triangles are created by connecting fixed vertex indices, resulting in a continuous 3D surface. While this method is efficient for modelling smooth terrain, the grid-like structure lacks the distinct, chaotic appearance of low-poly objects and is not suitable to be used for our purposes. The technique presented in this paper aims to solve the following problem: Generate random, low-poly looking terraced terrain with different biomes and add vegetation to create an interesting environment.

[72] UMotion: Uncertainty-driven Human Motion Estimation from Inertial and Ultra-wideband Units

Huakun Liu,Hiroki Ota,Xin Wei,Yutaro Hirao,Monica Perusquia-Hernandez,Hideaki Uchiyama,Kiyoshi Kiyokawa

Main category: cs.GR

TL;DR: UMotion是一个基于不确定性驱动的框架,结合IMU和UWB传感器,通过UKF算法实时优化3D人体姿态和形状估计,解决了姿态模糊和数据漂移问题。

Details Motivation: 稀疏穿戴式IMU在3D人体运动估计中存在姿态模糊、数据漂移和适应性不足的问题,需要更鲁棒的解决方案。 Method: 提出UMotion框架,结合IMU和UWB传感器,利用UKF算法实时融合传感器数据与人体运动约束,优化估计。 Result: 实验表明UMotion能稳定传感器数据,并在姿态精度上优于现有技术。 Conclusion: UMotion通过多传感器融合和不确定性驱动的方法,显著提升了3D人体姿态估计的准确性和鲁棒性。 Abstract: Sparse wearable inertial measurement units (IMUs) have gained popularity for estimating 3D human motion. However, challenges such as pose ambiguity, data drift, and limited adaptability to diverse bodies persist. To address these issues, we propose UMotion, an uncertainty-driven, online fusing-all state estimation framework for 3D human shape and pose estimation, supported by six integrated, body-worn ultra-wideband (UWB) distance sensors with IMUs. UWB sensors measure inter-node distances to infer spatial relationships, aiding in resolving pose ambiguities and body shape variations when combined with anthropometric data. Unfortunately, IMUs are prone to drift, and UWB sensors are affected by body occlusions. Consequently, we develop a tightly coupled Unscented Kalman Filter (UKF) framework that fuses uncertainties from sensor data and estimated human motion based on individual body shape. The UKF iteratively refines IMU and UWB measurements by aligning them with uncertain human motion constraints in real-time, producing optimal estimates for each. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of UMotion in stabilizing sensor data and the improvement over state of the art in pose accuracy.

cs.CL [Back]

[73] Human-AI Collaboration or Academic Misconduct? Measuring AI Use in Student Writing Through Stylometric Evidence

Eduardo Araujo Oliveira,Madhavi Mohoni,Sonsoles López-Pernas,Mohammed Saqr

Main category: cs.CL

TL;DR: 研究利用作者验证技术量化AI在学术写作中的辅助作用,旨在提升透明度和学生发展。

Details Motivation: 随着人机协作在教育中普及,如何量化AI辅助的挑战日益突出,研究旨在解决这一问题。 Method: 研究分为三阶段:数据集选择与扩展、作者验证方法开发及系统评估,采用改进的特征向量差异方法。 Result: 改进的分类器能有效区分学生写作与AI生成文本,并在多场景中表现稳健。 Conclusion: 研究为AI时代的学术写作提供了透明工具,支持学术诚信调查。 Abstract: As human-AI collaboration becomes increasingly prevalent in educational contexts, understanding and measuring the extent and nature of such interactions pose significant challenges. This research investigates the use of authorship verification (AV) techniques not as a punitive measure, but as a means to quantify AI assistance in academic writing, with a focus on promoting transparency, interpretability, and student development. Building on prior work, we structured our investigation into three stages: dataset selection and expansion, AV method development, and systematic evaluation. Using three datasets - including a public dataset (PAN-14) and two from University of Melbourne students from various courses - we expanded the data to include LLM-generated texts, totalling 1,889 documents and 540 authorship problems from 506 students. We developed an adapted Feature Vector Difference AV methodology to construct robust academic writing profiles for students, designed to capture meaningful, individual characteristics of their writing. The method's effectiveness was evaluated across multiple scenarios, including distinguishing between student-authored and LLM-generated texts and testing resilience against LLMs' attempts to mimic student writing styles. Results demonstrate the enhanced AV classifier's ability to identify stylometric discrepancies and measure human-AI collaboration at word and sentence levels while providing educators with a transparent tool to support academic integrity investigations. This work advances AV technology, offering actionable insights into the dynamics of academic writing in an AI-driven era.

[74] Clicking some of the silly options: Exploring Player Motivation in Static and Dynamic Educational Interactive Narratives

Daeun Hwang,Samuel Shields,Alex Calderwood,Shi Johnson-Bey,Michael Mateas,Noah Wardrip-Fruin,Edward F. Melcer

Main category: cs.CL

TL;DR: 比较静态和动态叙事教育游戏对学习动机的影响,发现动态叙事能提升参与度,但需平衡教学目标和叙事动态性。

Details Motivation: 探讨动态叙事对学习动机的影响,填补相关研究空白。 Method: 开发两种版本的互动叙事游戏(静态和动态叙事),比较其对玩家动机的影响。 Result: 动态叙事能提升玩家参与度,但需平衡教学目标和叙事动态性。 Conclusion: 动态叙事在教育游戏中具有潜力,但需进一步优化设计。 Abstract: Motivation is an important factor underlying successful learning. Previous research has demonstrated the positive effects that static interactive narrative games can have on motivation. Concurrently, advances in AI have made dynamic and adaptive approaches to interactive narrative increasingly accessible. However, limited work has explored the impact that dynamic narratives can have on learner motivation. In this paper, we compare two versions of Academical, a choice-based educational interactive narrative game about research ethics. One version employs a traditional hand-authored branching plot (i.e., static narrative) while the other dynamically sequences plots during play (i.e., dynamic narrative). Results highlight the importance of responsive content and a variety of choices for player engagement, while also illustrating the challenge of balancing pedagogical goals with the dynamic aspects of narrative. We also discuss design implications that arise from these findings. Ultimately, this work provides initial steps to illuminate the emerging potential of AI-driven dynamic narrative in educational games.

[75] A suite of LMs comprehend puzzle statements as well as humans

Adele E Goldberg,Supantho Rakshit,Jennifer Hu,Kyle Mahowald

Main category: cs.CL

TL;DR: 研究发现,人类在限制重读条件下的语言理解准确率(73%)低于Falcon-180B-Chat(76%)和GPT-4(81%),而GPT-o1表现完美。模型与人类在涉及互惠行为的查询中表现相似,表明模型能力被低估。

Details Motivation: 重新评估人类与大型语言模型在语言理解任务中的表现,挑战此前认为模型表现不如人类的假设。 Method: 使用相同刺激材料,进行预注册研究,比较人类在允许重读和限制重读条件下的表现,并与多个语言模型(Falcon-180B-Chat、GPT-4、GPT-o1等)对比。 Result: 人类在限制重读条件下准确率显著下降(73%),低于多个模型表现。模型与人类在互惠行为查询中表现相似,且模型能力被系统性低估。 Conclusion: 研究强调实验设计和编码实践的重要性,并挑战当前模型在语言理解上弱于人类的假设。 Abstract: Recent claims suggest that large language models (LMs) underperform humans in comprehending minimally complex English statements (Dentella et al., 2024). Here, we revisit those findings and argue that human performance was overestimated, while LLM abilities were underestimated. Using the same stimuli, we report a preregistered study comparing human responses in two conditions: one allowed rereading (replicating the original study), and one that restricted rereading (a more naturalistic comprehension test). Human accuracy dropped significantly when rereading was restricted (73%), falling below that of Falcon-180B-Chat (76%) and GPT-4 (81%). The newer GPT-o1 model achieves perfect accuracy. Results further show that both humans and models are disproportionately challenged by queries involving potentially reciprocal actions (e.g., kissing), suggesting shared pragmatic sensitivities rather than model-specific deficits. Additional analyses using Llama-2-70B log probabilities, a recoding of open-ended model responses, and grammaticality ratings of other sentences reveal systematic underestimation of model performance. We find that GPT-4o can align with either naive or expert grammaticality judgments, depending on prompt framing. These findings underscore the need for more careful experimental design and coding practices in LLM evaluation, and they challenge the assumption that current models are inherently weaker than humans at language comprehension.

[76] For GPT-4 as with Humans: Information Structure Predicts Acceptability of Long-Distance Dependencies

Nicole Cuneo,Eleanor Graves,Supantho Rakshit,Adele E. Goldberg

Main category: cs.CL

TL;DR: 论文探讨了语言模型(LM)是否能理解和生成可靠的元语言判断,并验证了GPT-4在信息结构和句法关系上的表现。

Details Motivation: 研究语言模型是否能够理解和生成自然语言中的元语言判断,并验证其是否能捕捉形式与功能之间的微妙关系。 Method: 通过实验(Study 1a, 1b)验证GPT-4在信息结构和可接受性任务中的表现,并在Study 2中通过操纵信息结构验证因果关系。 Result: GPT-4在零样本任务中表现出可靠的元语言能力,并复制了信息结构与可接受性之间的显著交互作用。 Conclusion: 研究揭示了GPT-4生成的英语与自然语言之间存在紧密关系,信息结构与句法之间的关联值得进一步探索。 Abstract: It remains debated how well any LM understands natural language or generates reliable metalinguistic judgments. Moreover, relatively little work has demonstrated that LMs can represent and respect subtle relationships between form and function proposed by linguists. We here focus on a particular such relationship established in recent work: English speakers' judgments about the information structure of canonical sentences predicts independently collected acceptability ratings on corresponding 'long distance dependency' [LDD] constructions, across a wide array of base constructions and multiple types of LDDs. To determine whether any LM captures this relationship, we probe GPT-4 on the same tasks used with humans and new extensions.Results reveal reliable metalinguistic skill on the information structure and acceptability tasks, replicating a striking interaction between the two, despite the zero-shot, explicit nature of the tasks, and little to no chance of contamination [Studies 1a, 1b]. Study 2 manipulates the information structure of base sentences and confirms a causal relationship: increasing the prominence of a constituent in a context sentence increases the subsequent acceptability ratings on an LDD construction. The findings suggest a tight relationship between natural and GPT-4 generated English, and between information structure and syntax, which begs for further exploration.

[77] Atomic Consistency Preference Optimization for Long-Form Question Answering

Jingfeng Chen,Raghuveer Thirukovalluru,Junlin Wang,Kaiwei Luo,Bhuwan Dhingra

Main category: cs.CL

TL;DR: ACPO是一种自监督的偏好优化方法,通过利用原子一致性信号提高LLMs的事实准确性,无需外部监督。

Details Motivation: 解决LLMs生成事实性错误的问题,避免依赖外部模型或知识库。 Method: 提出ACPO方法,利用多响应中的原子一致性信号识别高质量数据对进行模型对齐。 Result: ACPO在LongFact和BioGen数据集上优于监督基线FactAlign 1.95分。 Conclusion: ACPO提供了一种高效、可扩展的方法,显著提升LLMs的事实可靠性。 Abstract: Large Language Models (LLMs) frequently produce factoid hallucinations - plausible yet incorrect answers. A common mitigation strategy is model alignment, which improves factual accuracy by training on curated factual and non-factual pairs. However, this approach often relies on a stronger model (e.g., GPT-4) or an external knowledge base to assess factual correctness, which may not always be accessible. To address this, we propose Atomic Consistency Preference Optimization (ACPO), a self-supervised preference-tuning method that enhances factual accuracy without external supervision. ACPO leverages atomic consistency signals, i.e., the agreement of individual facts across multiple stochastic responses, to identify high- and low-quality data pairs for model alignment. By eliminating the need for costly GPT calls, ACPO provides a scalable and efficient approach to improving factoid question-answering. Despite being self-supervised, empirical results demonstrate that ACPO outperforms FactAlign, a strong supervised alignment baseline, by 1.95 points on the LongFact and BioGen datasets, highlighting its effectiveness in enhancing factual reliability without relying on external models or knowledge bases.

[78] A Comprehensive Analysis of Large Language Model Outputs: Similarity, Diversity, and Bias

Brandon Smith,Mohamed Reda Bouadjenek,Tahsin Alamgir Kheya,Phillip Dawson,Sunil Aryal

Main category: cs.CL

TL;DR: 研究比较了12种大型语言模型(LLM)的输出相似性、多样性和伦理表现,发现同一模型的输出相似度高,不同模型间差异显著,GPT-4表现独特且多样。

Details Motivation: 探讨LLM在输出相似性、多样性和伦理标准方面的表现,以指导未来开发和伦理评估。 Method: 使用5,000个多样化提示,生成约300万文本,比较12种LLM(包括专有和开源模型)的输出。 Result: 同一LLM输出相似度高;模型间差异显著(如WizardLM相似度高,GPT-4多样性强);词汇和风格差异明显;部分模型表现更公平。 Conclusion: 研究揭示了LLM输出的行为和多样性,为未来开发和伦理评估提供了新见解。 Abstract: Large Language Models (LLMs) represent a major step toward artificial general intelligence, significantly advancing our ability to interact with technology. While LLMs perform well on Natural Language Processing tasks -- such as translation, generation, code writing, and summarization -- questions remain about their output similarity, variability, and ethical implications. For instance, how similar are texts generated by the same model? How does this compare across different models? And which models best uphold ethical standards? To investigate, we used 5{,}000 prompts spanning diverse tasks like generation, explanation, and rewriting. This resulted in approximately 3 million texts from 12 LLMs, including proprietary and open-source systems from OpenAI, Google, Microsoft, Meta, and Mistral. Key findings include: (1) outputs from the same LLM are more similar to each other than to human-written texts; (2) models like WizardLM-2-8x22b generate highly similar outputs, while GPT-4 produces more varied responses; (3) LLM writing styles differ significantly, with Llama 3 and Mistral showing higher similarity, and GPT-4 standing out for distinctiveness; (4) differences in vocabulary and tone underscore the linguistic uniqueness of LLM-generated content; (5) some LLMs demonstrate greater gender balance and reduced bias. These results offer new insights into the behavior and diversity of LLM outputs, helping guide future development and ethical evaluation.

[79] S-DAT: A Multilingual, GenAI-Driven Framework for Automated Divergent Thinking Assessment

Jennifer Haase,Paul H. P. Hanel,Sebastian Pokutta

Main category: cs.CL

TL;DR: S-DAT是一种多语言框架,利用大语言模型和语义距离评估发散思维,解决了传统创造力评估的局限。

Details Motivation: 传统创造力评估方法耗时、语言依赖性强且主观,限制了其扩展性和跨文化适用性。 Method: S-DAT结合大语言模型和多语言嵌入技术,通过计算语义距离评估发散思维。 Result: 在11种语言中验证了S-DAT的稳健性和一致性,并显示出与其他发散思维测量的收敛效度。 Conclusion: S-DAT为全球范围的创造力研究提供了更公平、全面的评估工具。 Abstract: This paper introduces S-DAT (Synthetic-Divergent Association Task), a scalable, multilingual framework for automated assessment of divergent thinking (DT) -a core component of human creativity. Traditional creativity assessments are often labor-intensive, language-specific, and reliant on subjective human ratings, limiting their scalability and cross-cultural applicability. In contrast, S-DAT leverages large language models and advanced multilingual embeddings to compute semantic distance -- a language-agnostic proxy for DT. We evaluate S-DAT across eleven diverse languages, including English, Spanish, German, Russian, Hindi, and Japanese (Kanji, Hiragana, Katakana), demonstrating robust and consistent scoring across linguistic contexts. Unlike prior DAT approaches, the S-DAT shows convergent validity with other DT measures and correct discriminant validity with convergent thinking. This cross-linguistic flexibility allows for more inclusive, global-scale creativity research, addressing key limitations of earlier approaches. S-DAT provides a powerful tool for fairer, more comprehensive evaluation of cognitive flexibility in diverse populations and can be freely assessed online: https://sdat.iol.zib.de/.

[80] CEC-Zero: Chinese Error Correction Solution Based on LLM

Sophie Zhang,Zhiming Lin

Main category: cs.CL

TL;DR: 本文提出CEC-Zero,一种基于强化学习的框架,使大型语言模型(LLMs)能够自主学习和纠正中文拼写错误,无需外部监督。该方法显著提升了模型的可靠性和泛化能力。

Details Motivation: 尽管LLMs在中文文本处理(如拼写纠正)中表现出色,但其可靠性和泛化能力仍有不足。本文旨在通过强化学习框架解决这些问题。 Method: 提出CEC-Zero框架,结合强化学习与LLMs的生成能力,使模型能够自主学习和纠正错误,无需依赖标注数据或辅助模型。 Result: 实验表明,RL增强的LLMs在准确性和跨领域泛化能力上表现优异,适用于实际中文NLP应用。 Conclusion: CEC-Zero为中文文本纠错提供了可扩展的解决方案,并为自改进语言模型设立了新范式。 Abstract: Recent advancements in large language models (LLMs) demonstrate exceptional Chinese text processing capabilities, particularly in Chinese Spelling Correction (CSC). While LLMs outperform traditional BERT-based models in accuracy and robustness, challenges persist in reliability and generalization. This paper proposes CEC-Zero, a novel reinforcement learning (RL) framework enabling LLMs to self-correct through autonomous error strategy learning without external supervision. By integrating RL with LLMs' generative power, the method eliminates dependency on annotated data or auxiliary models. Experiments reveal RL-enhanced LLMs achieve industry-viable accuracy and superior cross-domain generalization, offering a scalable solution for reliability optimization in Chinese NLP applications. This breakthrough facilitates LLM deployment in practical Chinese text correction scenarios while establishing a new paradigm for self-improving language models.

[81] How an unintended Side Effect of a Research Project led to Boosting the Power of UML

Ulrich Frank,Pierre Maier

Main category: cs.CL

TL;DR: 介绍了一种新型UML建模工具,支持类图与对象图的集成及对象执行,适用于教学与研究。

Details Motivation: 改进传统UML工具,提供更灵活的软件架构设计及教学支持。 Method: 设计并实现了一种集成类图与对象图、支持对象执行的新型UML建模工具。 Result: 工具成功应用于教学与研究,展示了多级架构的潜力。 Conclusion: 该工具是研究副产品的成功案例,为软件建模与教学提供了新思路。 Abstract: This paper describes the design, implementation and use of a new UML modeling tool that represents a significant advance over conventional tools. Among other things, it allows the integration of class diagrams and object diagrams as well as the execution of objects. This not only enables new software architectures characterized by the integration of software with corresponding object models, but is also ideal for use in teaching, as it provides students with a particularly stimulating learning experience. A special feature of the project is that it has emerged from a long-standing international research project, which is aimed at a comprehensive multi-level architecture. The project is therefore an example of how research can lead to valuable results that arise as a side effect of other work.

[82] A Scalable Unsupervised Framework for multi-aspect labeling of Multilingual and Multi-Domain Review Data

Jiin Park,Misuk Kim

Main category: cs.CL

TL;DR: 提出了一种多语言、可扩展且无监督的跨领域方面检测框架,用于多领域多语言评论数据的多方面标注,实验证明其标签质量高且适用于训练。

Details Motivation: 现有研究多局限于特定领域和语言,或依赖有监督学习需要大量标注数据,限制了分析的广泛性和效率。 Method: 通过聚类提取方面类别候选,使用负采样生成方面感知嵌入向量,并利用预训练语言模型评估自动生成标签的有效性。 Result: 实验表明自动生成的标签质量高,模型性能优异,且框架在处理大规模数据时具有一致性和可扩展性。 Conclusion: 该框架克服了有监督方法的限制,适用于多语言多领域环境,未来将探索自动评论摘要和AI代理集成以提升分析效率。 Abstract: Effectively analyzing online review data is essential across industries. However, many existing studies are limited to specific domains and languages or depend on supervised learning approaches that require large-scale labeled datasets. To address these limitations, we propose a multilingual, scalable, and unsupervised framework for cross-domain aspect detection. This framework is designed for multi-aspect labeling of multilingual and multi-domain review data. In this study, we apply automatic labeling to Korean and English review datasets spanning various domains and assess the quality of the generated labels through extensive experiments. Aspect category candidates are first extracted through clustering, and each review is then represented as an aspect-aware embedding vector using negative sampling. To evaluate the framework, we conduct multi-aspect labeling and fine-tune several pretrained language models to measure the effectiveness of the automatically generated labels. Results show that these models achieve high performance, demonstrating that the labels are suitable for training. Furthermore, comparisons with publicly available large language models highlight the framework's superior consistency and scalability when processing large-scale data. A human evaluation also confirms that the quality of the automatic labels is comparable to those created manually. This study demonstrates the potential of a robust multi-aspect labeling approach that overcomes limitations of supervised methods and is adaptable to multilingual, multi-domain environments. Future research will explore automatic review summarization and the integration of artificial intelligence agents to further improve the efficiency and depth of review analysis.

[83] Scent of Knowledge: Optimizing Search-Enhanced Reasoning with Information Foraging

Hongjin Qian,Zheng Liu

Main category: cs.CL

TL;DR: InForage是一个基于强化学习的框架,通过动态信息检索增强LLMs,解决复杂任务中的模糊性和多步需求。

Details Motivation: 传统静态检索方法无法满足复杂任务的需求,需要动态、自适应的检索策略。 Method: 提出InForage框架,利用强化学习优化检索过程,奖励中间检索质量。 Result: 在多项任务中表现优于基线方法,展示了自适应检索的有效性。 Conclusion: InForage为构建高效、自适应的推理代理提供了可行方案。 Abstract: Augmenting large language models (LLMs) with external retrieval has become a standard method to address their inherent knowledge cutoff limitations. However, traditional retrieval-augmented generation methods employ static, pre-inference retrieval strategies, making them inadequate for complex tasks involving ambiguous, multi-step, or evolving information needs. Recent advances in test-time scaling techniques have demonstrated significant potential in enabling LLMs to dynamically interact with external tools, motivating the shift toward adaptive inference-time retrieval. Inspired by Information Foraging Theory (IFT), we propose InForage, a reinforcement learning framework that formalizes retrieval-augmented reasoning as a dynamic information-seeking process. Unlike existing approaches, InForage explicitly rewards intermediate retrieval quality, encouraging LLMs to iteratively gather and integrate information through adaptive search behaviors. To facilitate training, we construct a human-guided dataset capturing iterative search and reasoning trajectories for complex, real-world web tasks. Extensive evaluations across general question answering, multi-hop reasoning tasks, and a newly developed real-time web QA dataset demonstrate InForage's superior performance over baseline methods. These results highlight InForage's effectiveness in building robust, adaptive, and efficient reasoning agents.

[84] Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs

Jingcheng Niu,Xingdi Yuan,Tong Wang,Hamidreza Saghir,Amir H. Abdi

Main category: cs.CL

TL;DR: 论文发现了一种名为“上下文牵引”的现象,语言模型(LMs)会显著提高输入提示中已出现过的token的logits或概率,即使这些token是随机的。研究发现这一现象受语义因素影响,并通过可微分掩码方法识别了相关注意力头(“牵引头”),关闭这些头可显著减弱牵引效应。

Details Motivation: 研究旨在揭示语言模型在处理输入提示时如何被“无关”上下文信息分散注意力,从而提供一种新的机制视角。 Method: 通过统计分析和可微分掩码方法识别“牵引头”,并验证关闭这些头对减弱牵引效应的效果。 Result: 发现上下文牵引是一种机制现象,受语义因素调节;关闭牵引头可显著减弱牵引效应。 Conclusion: 上下文牵引的发现为语言模型注意力分散问题的机制分析和缓解提供了关键步骤。 Abstract: We observe a novel phenomenon, contextual entrainment, across a wide range of language models (LMs) and prompt settings, providing a new mechanistic perspective on how LMs become distracted by ``irrelevant'' contextual information in the input prompt. Specifically, LMs assign significantly higher logits (or probabilities) to any tokens that have previously appeared in the context prompt, even for random tokens. This suggests that contextual entrainment is a mechanistic phenomenon, occurring independently of the relevance or semantic relation of the tokens to the question or the rest of the sentence. We find statistically significant evidence that the magnitude of contextual entrainment is influenced by semantic factors. Counterfactual prompts have a greater effect compared to factual ones, suggesting that while contextual entrainment is a mechanistic phenomenon, it is modulated by semantic factors. We hypothesise that there is a circuit of attention heads -- the entrainment heads -- that corresponds to the contextual entrainment phenomenon. Using a novel entrainment head discovery method based on differentiable masking, we identify these heads across various settings. When we ``turn off'' these heads, i.e., set their outputs to zero, the effect of contextual entrainment is significantly attenuated, causing the model to generate output that capitulates to what it would produce if no distracting context were provided. Our discovery of contextual entrainment, along with our investigation into LM distraction via the entrainment heads, marks a key step towards the mechanistic analysis and mitigation of the distraction problem.

[85] Qwen3 Technical Report

An Yang,Anfeng Li,Baosong Yang,Beichen Zhang,Binyuan Hui,Bo Zheng,Bowen Yu,Chang Gao,Chengen Huang,Chenxu Lv,Chujie Zheng,Dayiheng Liu,Fan Zhou,Fei Huang,Feng Hu,Hao Ge,Haoran Wei,Huan Lin,Jialong Tang,Jian Yang,Jianhong Tu,Jianwei Zhang,Jianxin Yang,Jiaxi Yang,Jing Zhou,Jingren Zhou,Junyang Lin,Kai Dang,Keqin Bao,Kexin Yang,Le Yu,Lianghao Deng,Mei Li,Mingfeng Xue,Mingze Li,Pei Zhang,Peng Wang,Qin Zhu,Rui Men,Ruize Gao,Shixuan Liu,Shuang Luo,Tianhao Li,Tianyi Tang,Wenbiao Yin,Xingzhang Ren,Xinyu Wang,Xinyu Zhang,Xuancheng Ren,Yang Fan,Yang Su,Yichang Zhang,Yinger Zhang,Yu Wan,Yuqiong Liu,Zekun Wang,Zeyu Cui,Zhenru Zhang,Zhipeng Zhou,Zihan Qiu

Main category: cs.CL

TL;DR: Qwen3是Qwen模型家族的最新版本,集成了密集和MoE架构,支持多语言,并创新性地结合了思考模式和非思考模式,动态切换以优化性能。

Details Motivation: 提升语言模型的性能、效率和多语言能力,同时通过统一框架简化模型切换。 Method: 采用密集和MoE架构,引入思考模式和非思考模式动态切换机制,以及思考预算机制优化资源分配。 Result: 在代码生成、数学推理等任务中达到SOTA,支持119种语言,性能优于前代和同类模型。 Conclusion: Qwen3通过创新设计和高效资源利用,显著提升了多语言和推理能力,开源促进社区发展。 Abstract: In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models--such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)--and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.

[86] Multilingual Machine Translation with Quantum Encoder Decoder Attention-based Convolutional Variational Circuits

Subrit Dikshit,Ritu Tiwari,Priyank Jain

Main category: cs.CL

TL;DR: QEDACVC是一种基于量子计算的编码器-解码器架构,用于多语言机器翻译,相比传统计算领域的模型(如GRU、LSTM、BERT等),它在OPUS数据集上实现了82%的准确率。

Details Motivation: 探索量子计算在多语言机器翻译中的应用,以替代传统的基于经典计算的模型。 Method: 提出量子编码器-解码器架构,结合量子卷积、量子池化、量子变分电路和量子注意力机制,在量子计算硬件上模拟和运行。 Result: 在英语、法语、德语和印地语的多语言翻译任务中,QEDACVC在OPUS数据集上达到了82%的准确率。 Conclusion: QEDACVC展示了量子计算在多语言机器翻译中的潜力,为未来研究提供了新的方向。 Abstract: Cloud-based multilingual translation services like Google Translate and Microsoft Translator achieve state-of-the-art translation capabilities. These services inherently use large multilingual language models such as GRU, LSTM, BERT, GPT, T5, or similar encoder-decoder architectures with attention mechanisms as the backbone. Also, new age natural language systems, for instance ChatGPT and DeepSeek, have established huge potential in multiple tasks in natural language processing. At the same time, they also possess outstanding multilingual translation capabilities. However, these models use the classical computing realm as a backend. QEDACVC (Quantum Encoder Decoder Attention-based Convolutional Variational Circuits) is an alternate solution that explores the quantum computing realm instead of the classical computing realm to study and demonstrate multilingual machine translation. QEDACVC introduces the quantum encoder-decoder architecture that simulates and runs on quantum computing hardware via quantum convolution, quantum pooling, quantum variational circuit, and quantum attention as software alterations. QEDACVC achieves an Accuracy of 82% when trained on the OPUS dataset for English, French, German, and Hindi corpora for multilingual translations.

[87] PT-MoE: An Efficient Finetuning Framework for Integrating Mixture-of-Experts into Prompt Tuning

Zongqian Li,Yixuan Su,Nigel Collier

Main category: cs.CL

TL;DR: PT-MoE框架结合矩阵分解和MoE路由,在参数高效微调中实现跨任务一致性和泛化能力,性能优于PT和LoRA。

Details Motivation: 现有PEFT方法存在效率与性能不匹配的问题,PT-MoE旨在通过矩阵分解和MoE路由解决这一问题。 Method: 提出PT-MoE框架,整合矩阵分解和MoE路由,用于高效的提示调优。 Result: 在17个数据集上,PT-MoE在QA和数学任务中表现最优,参数减少25%。 Conclusion: PT-MoE展示了跨任务一致性和泛化能力,为未来PEFT方法提供了新思路。 Abstract: Parameter-efficient fine-tuning (PEFT) methods have shown promise in adapting large language models, yet existing approaches exhibit counter-intuitive phenomena: integrating router into prompt tuning (PT) increases training efficiency yet does not improve performance universally; parameter reduction through matrix decomposition can improve performance in specific domains. Motivated by these observations and the modular nature of PT, we propose PT-MoE, a novel framework that integrates matrix decomposition with mixture-of-experts (MoE) routing for efficient PT. Results across 17 datasets demonstrate that PT-MoE achieves state-of-the-art performance in both question answering (QA) and mathematical problem solving tasks, improving F1 score by 1.49 points over PT and 2.13 points over LoRA in QA tasks, while enhancing mathematical accuracy by 10.75 points over PT and 0.44 points over LoRA, all while using 25% fewer parameters than LoRA. Our analysis reveals that while PT methods generally excel in QA tasks and LoRA-based methods in math datasets, the integration of matrix decomposition and MoE in PT-MoE yields complementary benefits: decomposition enables efficient parameter sharing across experts while MoE provides dynamic adaptation, collectively enabling PT-MoE to demonstrate cross-task consistency and generalization abilities. These findings, along with ablation studies on routing mechanisms and architectural components, provide insights for future PEFT methods.

[88] WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models

Abdullah Mushtaq,Imran Taj,Rafay Naeem,Ibrahim Ghaznavi,Junaid Qadir

Main category: cs.CL

TL;DR: 论文提出WorldView-Bench基准,通过自由生成评估LLMs的全球文化包容性(GCI),发现多代理系统(MAS)显著提升文化多样性。

Details Motivation: 现有LLMs训练和评估框架偏向西方中心主义,导致文化同质化,无法反映全球文明多样性。 Method: 基于Multiplex Worldview理论,提出两种干预策略:上下文实现和多代理系统实现的多重LLMs。 Result: MAS-Implemented Multiplex LLMs将Perspectives Distribution Score熵从13%提升至94%,并改善情感和文化平衡。 Conclusion: Multiplex-aware AI评估可减少LLMs文化偏见,推动更具包容性和伦理对齐的AI系统。 Abstract: Large Language Models (LLMs) are predominantly trained and aligned in ways that reinforce Western-centric epistemologies and socio-cultural norms, leading to cultural homogenization and limiting their ability to reflect global civilizational plurality. Existing benchmarking frameworks fail to adequately capture this bias, as they rely on rigid, closed-form assessments that overlook the complexity of cultural inclusivity. To address this, we introduce WorldView-Bench, a benchmark designed to evaluate Global Cultural Inclusivity (GCI) in LLMs by analyzing their ability to accommodate diverse worldviews. Our approach is grounded in the Multiplex Worldview proposed by Senturk et al., which distinguishes between Uniplex models, reinforcing cultural homogenization, and Multiplex models, which integrate diverse perspectives. WorldView-Bench measures Cultural Polarization, the exclusion of alternative perspectives, through free-form generative evaluation rather than conventional categorical benchmarks. We implement applied multiplexity through two intervention strategies: (1) Contextually-Implemented Multiplex LLMs, where system prompts embed multiplexity principles, and (2) Multi-Agent System (MAS)-Implemented Multiplex LLMs, where multiple LLM agents representing distinct cultural perspectives collaboratively generate responses. Our results demonstrate a significant increase in Perspectives Distribution Score (PDS) entropy from 13% at baseline to 94% with MAS-Implemented Multiplex LLMs, alongside a shift toward positive sentiment (67.7%) and enhanced cultural balance. These findings highlight the potential of multiplex-aware AI evaluation in mitigating cultural bias in LLMs, paving the way for more inclusive and ethically aligned AI systems.

eess.IV [Back]

[89] In-Context Learning for Label-Efficient Cancer Image Classification in Oncology

Mobina Shrestha,Bishwas Mandal,Vishal Mandal,Asis Shrestha

Main category: eess.IV

TL;DR: 论文研究了上下文学习(ICL)在肿瘤学中的应用,通过少量标注样本使模型适应新任务,无需重新训练,评估了四种视觉语言模型在三个肿瘤数据集上的表现。

Details Motivation: AI在肿瘤学中的应用受限于对大量标注数据和模型重新训练的依赖,研究探索了ICL作为替代方案的潜力。 Method: 使用四种视觉语言模型(Paligemma、CLIP、ALIGN、GPT-4o)在三个肿瘤数据集(MHIST、PatchCamelyon、HAM10000)上评估ICL性能。 Result: 所有模型在少样本提示下表现显著提升,GPT-4o在二分类和多分类任务中分别达到F1分数0.81和0.60,开源模型也表现竞争力。 Conclusion: ICL展示了在肿瘤学中的实用潜力,尤其适用于罕见癌症和资源有限的环境,为无法进行微调的场景提供了可行解决方案。 Abstract: The application of AI in oncology has been limited by its reliance on large, annotated datasets and the need for retraining models for domain-specific diagnostic tasks. Taking heed of these limitations, we investigated in-context learning as a pragmatic alternative to model retraining by allowing models to adapt to new diagnostic tasks using only a few labeled examples at inference, without the need for retraining. Using four vision-language models (VLMs)-Paligemma, CLIP, ALIGN and GPT-4o, we evaluated the performance across three oncology datasets: MHIST, PatchCamelyon and HAM10000. To the best of our knowledge, this is the first study to compare the performance of multiple VLMs on different oncology classification tasks. Without any parameter updates, all models showed significant gains with few-shot prompting, with GPT-4o reaching an F1 score of 0.81 in binary classification and 0.60 in multi-class classification settings. While these results remain below the ceiling of fully fine-tuned systems, they highlight the potential of ICL to approximate task-specific behavior using only a handful of examples, reflecting how clinicians often reason from prior cases. Notably, open-source models like Paligemma and CLIP demonstrated competitive gains despite their smaller size, suggesting feasibility for deployment in computing constrained clinical environments. Overall, these findings highlight the potential of ICL as a practical solution in oncology, particularly for rare cancers and resource-limited contexts where fine-tuning is infeasible and annotated data is difficult to obtain.

[90] Thoughts on Objectives of Sparse and Hierarchical Masked Image Model

Asahi Miyazaki,Tsuyoshi Okita

Main category: eess.IV

TL;DR: 本文提出了一种新的掩码模式Mesh Mask-ed SparK,用于改进SparK模型的性能,并研究了掩码模式对预训练效果的影响。

Details Motivation: 掩码图像建模是当前流行的训练目标,而SparK模型在自监督学习中表现优异。本文旨在通过改进掩码模式进一步提升其性能。 Method: 提出Mesh Mask-ed SparK模型,采用新的掩码模式,并在预训练中评估其效果。 Result: 研究发现掩码模式对预训练性能有显著影响。 Conclusion: Mesh Mask-ed SparK模型通过优化掩码模式,提升了SparK模型的性能。 Abstract: Masked image modeling is one of the most poplular objectives of training. Recently, the SparK model has been proposed with superior performance among self-supervised learning models. This paper proposes a new mask pattern for this SparK model, proposing it as the Mesh Mask-ed SparK model. We report the effect of the mask pattern used for image masking in pre-training on performance.

[91] Ultrasound Report Generation with Multimodal Large Language Models for Standardized Texts

Peixuan Ge,Tongkun Su,Faqin Lv,Baoliang Zhao,Peng Zhang,Chi Hong Wong,Liang Yao,Yu Sun,Zenan Wang,Pak Kin Wong,Ying Hu

Main category: eess.IV

TL;DR: 提出了一种统一的多器官和多语言超声报告生成框架,通过片段化多语言训练和标准化报告结构,显著提升了生成报告的准确性和一致性。

Details Motivation: 超声报告生成因图像变异性、操作依赖性和缺乏标准化数据集而具有挑战性,亟需一种自动化解决方案。 Method: 整合片段化多语言训练,对齐模块化文本片段与多样化影像数据,并利用双语数据集(英语-中文),通过选择性解冻视觉变换器(ViT)优化文本-图像对齐。 Result: 相比之前的KMVE方法,BLEU分数提升约2%,ROUGE-L提升约3%,CIDEr提升约15%,显著减少了内容缺失或错误。 Conclusion: 该框架展示了在真实临床工作流程中的潜力,为多器官和多语言超声报告生成提供了可扩展的解决方案。 Abstract: Ultrasound (US) report generation is a challenging task due to the variability of US images, operator dependence, and the need for standardized text. Unlike X-ray and CT, US imaging lacks consistent datasets, making automation difficult. In this study, we propose a unified framework for multi-organ and multilingual US report generation, integrating fragment-based multilingual training and leveraging the standardized nature of US reports. By aligning modular text fragments with diverse imaging data and curating a bilingual English-Chinese dataset, the method achieves consistent and clinically accurate text generation across organ sites and languages. Fine-tuning with selective unfreezing of the vision transformer (ViT) further improves text-image alignment. Compared to the previous state-of-the-art KMVE method, our approach achieves relative gains of about 2\% in BLEU scores, approximately 3\% in ROUGE-L, and about 15\% in CIDEr, while significantly reducing errors such as missing or incorrect content. By unifying multi-organ and multi-language report generation into a single, scalable framework, this work demonstrates strong potential for real-world clinical workflows.

[92] Total Variation-Based Image Decomposition and Denoising for Microscopy Images

Marco Corrias,Giada Franceschi,Michele Riva,Alberto Tampieri,Karin Föttinger,Ulrike Diebold,Thomas Pock,Cesare Franchini

Main category: eess.IV

TL;DR: 该论文提出了一种基于总变分(TV)的显微镜图像分解与去噪方法,评估了TV-L1、Huber-ROF和TGV-L1在不同案例中的表现,并公开了Python代码AiSurf。

Details Motivation: 显微镜图像常受噪声和干扰信号影响,现代去噪和恢复方法需求增加。 Method: 通过提取并去除图像中的干扰信号或直接去噪,评估了TV-L1、Huber-ROF和TGV-L1的性能。 Result: Huber-ROF最灵活,TGV-L1最适合去噪,方法适用于多种显微镜技术。 Conclusion: 该方法在显微镜图像处理中具有广泛适用性,代码已公开以便集成到实验流程中。 Abstract: Experimentally acquired microscopy images are unavoidably affected by the presence of noise and other unwanted signals, which degrade their quality and might hide relevant features. With the recent increase in image acquisition rate, modern denoising and restoration solutions become necessary. This study focuses on image decomposition and denoising of microscopy images through a workflow based on total variation (TV), addressing images obtained from various microscopy techniques, including atomic force microscopy (AFM), scanning tunneling microscopy (STM), and scanning electron microscopy (SEM). Our approach consists in restoring an image by extracting its unwanted signal components and subtracting them from the raw one, or by denoising it. We evaluate the performance of TV-$L^1$, Huber-ROF, and TGV-$L^1$ in achieving this goal in distinct study cases. Huber-ROF proved to be the most flexible one, while TGV-$L^1$ is the most suitable for denoising. Our results suggest a wider applicability of this method in microscopy, restricted not only to STM, AFM, and SEM images. The Python code used for this study is publicly available as part of AiSurf. It is designed to be integrated into experimental workflows for image acquisition or can be used to denoise previously acquired images.

[93] Validation of Conformal Prediction in Cervical Atypia Classification

Misgina Tsighe Hagos,Antti Suutala,Dmitrii Bychkov,Hakan Kücükel,Joar von Bahr,Milda Poceviciute,Johan Lundin,Nina Linder,Claes Lundström

Main category: eess.IV

TL;DR: 论文探讨了深度学习在宫颈癌分类中的应用,指出其过度自信和不确定性表达不足的问题,并提出使用保形预测框架生成更可靠的预测集。通过专家标注验证,发现现有方法在真实性和实用性上存在不足。

Details Motivation: 解决深度学习模型在宫颈癌分类中过度自信和不确定性表达不足的问题,提升预测结果的可靠性和临床实用性。 Method: 采用保形预测框架生成预测集,结合专家标注对三种保形预测方法和三种深度学习模型进行全面验证。 Result: 传统基于覆盖率的评估高估了性能,现有保形预测方法生成的预测集与人类标注一致性较差。 Conclusion: 保形预测方法在表达不确定性和处理模糊数据方面有潜力,但需进一步优化以提高与人类标注的一致性。 Abstract: Deep learning based cervical cancer classification can potentially increase access to screening in low-resource regions. However, deep learning models are often overconfident and do not reliably reflect diagnostic uncertainty. Moreover, they are typically optimized to generate maximum-likelihood predictions, which fail to convey uncertainty or ambiguity in their results. Such challenges can be addressed using conformal prediction, a model-agnostic framework for generating prediction sets that contain likely classes for trained deep-learning models. The size of these prediction sets indicates model uncertainty, contracting as model confidence increases. However, existing conformal prediction evaluation primarily focuses on whether the prediction set includes or covers the true class, often overlooking the presence of extraneous classes. We argue that prediction sets should be truthful and valuable to end users, ensuring that the listed likely classes align with human expectations rather than being overly relaxed and including false positives or unlikely classes. In this study, we comprehensively validate conformal prediction sets using expert annotation sets collected from multiple annotators. We evaluate three conformal prediction approaches applied to three deep-learning models trained for cervical atypia classification. Our expert annotation-based analysis reveals that conventional coverage-based evaluations overestimate performance and that current conformal prediction methods often produce prediction sets that are not well aligned with human labels. Additionally, we explore the capabilities of the conformal prediction methods in identifying ambiguous and out-of-distribution data.

[94] BiECVC: Gated Diversification of Bidirectional Contexts for Learned Video Compression

Wei Jiang,Junru Li,Kai Zhang,Li Zhang

Main category: eess.IV

TL;DR: BiECVC是一种双向学习视频压缩框架,通过多样化上下文建模和自适应门控机制,显著提升了压缩性能,首次在RA配置下超越VTM 13.2。

Details Motivation: 现有双向学习视频压缩(BVC)方法在性能上落后于单向方法,主要因上下文提取能力有限且缺乏动态适应性。 Method: BiECVC结合局部与非局部上下文建模,利用线性注意力机制和双向上下文门控动态过滤信息。 Result: 实验显示,BiECVC在RA配置下比特率分别降低13.4%和15.7%,首次全面超越VTM 13.2。 Conclusion: BiECVC通过创新上下文建模和门控机制,实现了双向视频压缩的突破性进展。 Abstract: Recent forward prediction-based learned video compression (LVC) methods have achieved impressive results, even surpassing VVC reference software VTM under the Low Delay B (LDB) configuration. In contrast, learned bidirectional video compression (BVC) remains underexplored and still lags behind its forward-only counterparts. This performance gap is mainly due to the limited ability to extract diverse and accurate contexts: most existing BVCs primarily exploit temporal motion while neglecting non-local correlations across frames. Moreover, they lack the adaptability to dynamically suppress harmful contexts arising from fast motion or occlusion. To tackle these challenges, we propose BiECVC, a BVC framework that incorporates diversified local and non-local context modeling along with adaptive context gating. For local context enhancement, BiECVC reuses high-quality features from lower layers and aligns them using decoded motion vectors without introducing extra motion overhead.To model non-local dependencies efficiently, we adopt a linear attention mechanism that balances performance and complexity. To further mitigate the impact of inaccurate context prediction, we introduce Bidirectional Context Gating, inspired by data-dependent decay in recent autoregressive language models, to dynamically filter contextual information based on conditional coding results. Extensive experiments demonstrate that BiECVC achieves state-of-the-art performance, reducing the bit-rate by 13.4% and 15.7% compared to VTM 13.2 under the Random Access (RA) configuration with intra periods of 32 and 64, respectively. To our knowledge, BiECVC is the first learned video codec to surpass VTM 13.2 RA across all standard test datasets. Code will be available at https://github.com/JiangWeibeta/ECVC.

[95] Q-space Guided Collaborative Attention Translation Network for Flexible Diffusion-Weighted Images Synthesis

Pengli Zhu,Yingji Fu,Nanguang Chen,Anqi Qiu

Main category: eess.IV

TL;DR: 提出了一种名为Q-CATN的新方法,用于从灵活的q空间采样中合成多壳高角度分辨率DWI数据,结合结构MRI数据,通过协作注意力机制动态调整内部表示。

Details Motivation: 解决现有方法在灵活q空间采样下DWI合成中的局限性,同时保持解剖学保真度。 Method: 采用协作注意力机制提取多模态互补信息,并引入任务特定约束以学习DWI信号分布与q空间的内在关系。 Result: 在HCP数据集上,Q-CATN在参数图和纤维束估计上优于现有方法,同时保留细节。 Conclusion: Q-CATN是一种适用于临床和研究的灵活工具,其代码已开源。 Abstract: This study, we propose a novel Q-space Guided Collaborative Attention Translation Networks (Q-CATN) for multi-shell, high-angular resolution DWI (MS-HARDI) synthesis from flexible q-space sampling, leveraging the commonly acquired structural MRI data. Q-CATN employs a collaborative attention mechanism to effectively extract complementary information from multiple modalities and dynamically adjust its internal representations based on flexible q-space information, eliminating the need for fixed sampling schemes. Additionally, we introduce a range of task-specific constraints to preserve anatomical fidelity in DWI, enabling Q-CATN to accurately learn the intrinsic relationships between directional DWI signal distributions and q-space. Extensive experiments on the Human Connectome Project (HCP) dataset demonstrate that Q-CATN outperforms existing methods, including 1D-qDL, 2D-qDL, MESC-SD, and QGAN, in estimating parameter maps and fiber tracts both quantitatively and qualitatively, while preserving fine-grained details. Notably, its ability to accommodate flexible q-space sampling highlights its potential as a promising toolkit for clinical and research applications. Our code is available at https://github.com/Idea89560041/Q-CATN.

[96] DCSNet: A Lightweight Knowledge Distillation-Based Model with Explainable AI for Lung Cancer Diagnosis from Histopathological Images

Sadman Sakib Alif,Nasim Anzum Promise,Fiaz Al Abid,Aniqua Nusrat Zereen

Main category: eess.IV

TL;DR: 论文提出了一种基于知识蒸馏和可解释AI的轻量级模型(DCSNet),用于肺癌检测,旨在解决计算资源需求和模型透明度问题。

Details Motivation: 肺癌是全球癌症相关死亡的主要原因,早期检测和准确诊断对提高生存率至关重要。深度学习模型(如CNN)虽有效,但计算成本高且缺乏透明度,限制了其在资源受限环境中的应用。 Method: 采用知识蒸馏方法,将复杂教师模型(如ResNet50)的知识转移到轻量级学生模型(DCSNet),并结合可解释AI技术提升透明度。评估了八种CNN作为教师模型。 Result: 提出的DCSNet在资源受限环境下实现了高诊断性能,并通过可解释AI技术增强了模型透明度。 Conclusion: 该方法不仅优化了计算资源使用,还提高了模型的可信度,有助于AI驱动诊断工具在医疗领域的推广。 Abstract: Lung cancer is a leading cause of cancer-related deaths globally, where early detection and accurate diagnosis are critical for improving survival rates. While deep learning, particularly convolutional neural networks (CNNs), has revolutionized medical image analysis by detecting subtle patterns indicative of early-stage lung cancer, its adoption faces challenges. These models are often computationally expensive and require significant resources, making them unsuitable for resource constrained environments. Additionally, their lack of transparency hinders trust and broader adoption in sensitive fields like healthcare. Knowledge distillation addresses these challenges by transferring knowledge from large, complex models (teachers) to smaller, lightweight models (students). We propose a knowledge distillation-based approach for lung cancer detection, incorporating explainable AI (XAI) techniques to enhance model transparency. Eight CNNs, including ResNet50, EfficientNetB0, EfficientNetB3, and VGG16, are evaluated as teacher models. We developed and trained a lightweight student model, Distilled Custom Student Network (DCSNet) using ResNet50 as the teacher. This approach not only ensures high diagnostic performance in resource-constrained settings but also addresses transparency concerns, facilitating the adoption of AI-driven diagnostic tools in healthcare.

[97] Spec2VolCAMU-Net: A Spectrogram-to-Volume Model for EEG-to-fMRI Reconstruction based on Multi-directional Time-Frequency Convolutional Attention Encoder and Vision-Mamba U-Net

Dongyi He,Shiyang Li,Bin Jiang,He Yan

Main category: eess.IV

TL;DR: Spec2VolCAMU-Net是一种轻量级的频谱到体积生成器,通过多方向时间-频率卷积注意力编码器和Vision-Mamba U-Net解码器,显著提升了EEG到fMRI的生成质量。

Details Motivation: 高分辨率fMRI成本高且操作复杂,而EEG更易获取。现有EEG-to-fMRI生成器存在性能或效率问题,需改进。 Method: 提出Spec2VolCAMU-Net,结合多方向时间-频率卷积注意力编码器和Vision-Mamba U-Net解码器,使用SSI-MSE损失端到端训练。 Result: 在三个公共基准测试中取得最佳SSIM和PSNR分数,分别提升14.5%、14.9%、16.9%和4.6%。 Conclusion: 该模型轻量高效,适合临床和研究中的实时应用。 Abstract: High-resolution functional magnetic resonance imaging (fMRI) is essential for mapping human brain activity; however, it remains costly and logistically challenging. If comparable volumes could be generated directly from widely available scalp electroencephalography (EEG), advanced neuroimaging would become significantly more accessible. Existing EEG-to-fMRI generators rely on plain CNNs that fail to capture cross-channel time-frequency cues or on heavy transformer/GAN decoders that strain memory and stability. We propose Spec2VolCAMU-Net, a lightweight spectrogram-to-volume generator that confronts these issues via a Multi-directional Time-Frequency Convolutional Attention Encoder, stacking temporal, spectral and joint convolutions with self-attention, and a Vision-Mamba U-Net decoder whose linear-time state-space blocks enable efficient long-range spatial modelling. Trained end-to-end with a hybrid SSI-MSE loss, Spec2VolCAMU-Net achieves state-of-the-art fidelity on three public benchmarks, recording SSIMs of 0.693 on NODDI, 0.725 on Oddball and 0.788 on CN-EPFL, representing improvements of 14.5%, 14.9%, and 16.9% respectively over previous best SSIM scores. Furthermore, it achieves competitive PSNR scores, particularly excelling on the CN-EPFL dataset with a 4.6% improvement over the previous best PSNR, thus striking a better balance in reconstruction quality. The proposed model is lightweight and efficient, making it suitable for real-time applications in clinical and research settings. The code is available at https://github.com/hdy6438/Spec2VolCAMU-Net.

[98] Meta-learning Slice-to-Volume Reconstruction in Fetal Brain MRI using Implicit Neural Representations

Maik Dannecker,Thomas Sanchez,Meritxell Bach Cuadra,Özgün Turgut,Anthony N. Price,Lucilio Cordero-Grande,Vanessa Kyriakopoulou,Joseph V. Hajnal,Daniel Rueckert

Main category: eess.IV

TL;DR: 提出了一种基于隐式神经表示的新型SVR方法,用于快速准确的MRI重建,尤其在严重运动和图像损坏情况下表现优异。

Details Motivation: 现有方法在严重运动或图像损坏时表现不佳,或需要切片预对齐,限制了高分辨率重建的效果。 Method: 采用隐式神经表示进行运动校正、异常值处理和高分辨率重建,并通过自监督元学习初始化任务特定先验。 Result: 在模拟和临床MRI脑数据实验中,重建质量显著提升,尤其在严重运动情况下,重建时间减少50%。 Conclusion: 该方法在严重运动和图像损坏情况下优于现有技术,具有更高的重建质量和效率。 Abstract: High-resolution slice-to-volume reconstruction (SVR) from multiple motion-corrupted low-resolution 2D slices constitutes a critical step in image-based diagnostics of moving subjects, such as fetal brain Magnetic Resonance Imaging (MRI). Existing solutions struggle with image artifacts and severe subject motion or require slice pre-alignment to achieve satisfying reconstruction performance. We propose a novel SVR method to enable fast and accurate MRI reconstruction even in cases of severe image and motion corruption. Our approach performs motion correction, outlier handling, and super-resolution reconstruction with all operations being entirely based on implicit neural representations. The model can be initialized with task-specific priors through fully self-supervised meta-learning on either simulated or real-world data. In extensive experiments including over 480 reconstructions of simulated and clinical MRI brain data from different centers, we prove the utility of our method in cases of severe subject motion and image artifacts. Our results demonstrate improvements in reconstruction quality, especially in the presence of severe motion, compared to state-of-the-art methods, and up to 50% reduction in reconstruction time.

cs.AI [Back]

[99] Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document Corpora

Michael Majurski,Cynthia Matuszek

Main category: cs.AI

TL;DR: 提出了一种自动化构建基于事实的合成数据模型评估方法,利用语言模型自动评估领域知识,减少人工标注需求。

Details Motivation: 人工构建评估基准耗时且难以覆盖所有领域,需要自动化解决方案。 Method: 利用语言模型从基础文档(如教科书)自动生成评估问题,支持多项选择和开放式问题。 Result: 合成数据评估与人工标注问题相关性高(Spearman 0.96,Pearson 0.79),Gemma3模型表现优异。 Conclusion: 该方法高效且可靠,为语言模型评估提供了新工具。 Abstract: Language Models (LMs) continue to advance, improving response quality and coherence. Given Internet-scale training datasets, LMs have likely encountered much of what users might ask them to generate in some form during their training. A plethora of evaluation benchmarks have been constructed to assess model quality, response appropriateness, and reasoning capabilities. However, the human effort required for benchmark construction is limited and being rapidly outpaced by the size and scope of the models under evaluation. Additionally, having humans build a benchmark for every possible domain of interest is impractical. Therefore, we propose a methodology for automating the construction of fact-based synthetic data model evaluations grounded in document populations. This work leverages those very same LMs to evaluate domain-specific knowledge automatically, using only grounding documents (e.g., a textbook) as input. This synthetic data benchmarking approach corresponds well with human curated questions with a Spearman ranking correlation of 0.96 and a benchmark evaluation Pearson accuracy correlation of 0.79. This novel tool supports generating both multiple choice and open-ended synthetic data questions to gain diagnostic insight of LM capability. We apply this methodology to evaluate model performance on a recent relevant arXiv preprint, discovering a surprisingly strong performance from Gemma3 models.

[100] Automated Meta Prompt Engineering for Alignment with the Theory of Mind

Aaron Baughman,Rahul Agarwal,Eduardo Morales,Gozde Akay

Main category: cs.AI

TL;DR: 提出了一种元提示方法,通过联合优化人类心理预期与大型语言模型(LLM)神经处理的相似性,生成复杂任务的流畅文本。采用代理强化学习技术,通过上下文学习让LLM作为法官(LLMaaJ)指导另一个LLM生成内容。实验显示,人类内容审阅者的预期与AI生成内容在53.8%的情况下完全对齐。

Details Motivation: 解决理论与心智(ToM)对齐问题,通过预测和包含人类编辑来优化LLM生成文本的质量。 Method: 使用代理强化学习技术,LLMaaJ通过上下文学习指导另一个LLM生成内容,并测量人类心理预期。 Result: 人类审阅者的预期与AI生成内容在53.8%的情况下完全对齐,平均迭代次数为4.38。内容质量提升,覆盖了更多网球比赛细节。 Conclusion: 该方法成功应用于2024年美国网球公开赛,并扩展到其他体育和娱乐现场活动。 Abstract: We introduce a method of meta-prompting that jointly produces fluent text for complex tasks while optimizing the similarity of neural states between a human's mental expectation and a Large Language Model's (LLM) neural processing. A technique of agentic reinforcement learning is applied, in which an LLM as a Judge (LLMaaJ) teaches another LLM, through in-context learning, how to produce content by interpreting the intended and unintended generated text traits. To measure human mental beliefs around content production, users modify long form AI-generated text articles before publication at the US Open 2024 tennis Grand Slam. Now, an LLMaaJ can solve the Theory of Mind (ToM) alignment problem by anticipating and including human edits within the creation of text from an LLM. Throughout experimentation and by interpreting the results of a live production system, the expectations of human content reviewers had 100% of alignment with AI 53.8% of the time with an average iteration count of 4.38. The geometric interpretation of content traits such as factualness, novelty, repetitiveness, and relevancy over a Hilbert vector space combines spatial volume (all trait importance) with vertices alignment (individual trait relevance) enabled the LLMaaJ to optimize on Human ToM. This resulted in an increase in content quality by extending the coverage of tennis action. Our work that was deployed at the US Open 2024 has been used across other live events within sports and entertainment.

[101] Improving the Reliability of LLMs: Combining CoT, RAG, Self-Consistency, and Self-Verification

Adarsh Kumar,Hwiyoon Kim,Jawahar Sai Nathani,Neil Roy

Main category: cs.AI

TL;DR: 论文探讨了如何结合链式思维(CoT)与检索增强生成(RAG)以及自一致性和自验证策略,以减少大语言模型(LLMs)的幻觉问题并提高事实准确性。

Details Motivation: LLMs在生成复杂开放任务内容时容易产生自信但错误或不相关的信息(幻觉),这是其应用的主要限制。CoT虽能改善多步推理,但单独使用无法完全解决幻觉问题。 Method: 结合CoT与RAG,并应用自一致性和自验证策略,通过引入外部知识源和模型自我验证来优化生成内容。 Result: 比较了基线LLMs与CoT、CoT+RAG、自一致性和自验证技术的效果,发现这些方法能显著减少幻觉并保持流畅性和推理深度。 Conclusion: 综合使用CoT、RAG、自一致性和自验证是最有效的方法,能在减少幻觉的同时保持生成内容的质量。 Abstract: Hallucination, where large language models (LLMs) generate confident but incorrect or irrelevant information, remains a key limitation in their application to complex, open-ended tasks. Chain-of-thought (CoT) prompting has emerged as a promising method for improving multistep reasoning by guiding models through intermediate steps. However, CoT alone does not fully address the hallucination problem. In this work, we investigate how combining CoT with retrieval-augmented generation (RAG), as well as applying self-consistency and self-verification strategies, can reduce hallucinations and improve factual accuracy. By incorporating external knowledge sources during reasoning and enabling models to verify or revise their own outputs, we aim to generate more accurate and coherent responses. We present a comparative evaluation of baseline LLMs against CoT, CoT+RAG, self-consistency, and self-verification techniques. Our results highlight the effectiveness of each method and identify the most robust approach for minimizing hallucinations while preserving fluency and reasoning depth.

[102] Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?

Anthony GX-Chen,Dongyan Lin,Mandana Samiei,Doina Precup,Blake A. Richards,Rob Fergus,Kenneth Marino

Main category: cs.AI

TL;DR: 语言模型(LM)代理在探索和推断因果关系时表现出对常见析取关系的偏好,而对合取关系存在系统性困难,这种偏见于模型和人类成人中均存在。

Details Motivation: 研究LM是否具备探索和理解世界因果结构的能力,以及是否存在系统性偏见导致错误结论。 Method: 使用发展心理学中的'Blicket Test'范式,评估LM对不同因果关系的推断能力,并分析模型家族、规模和提示策略的影响。 Result: LM对析取关系推断可靠,但对合取关系存在'析取偏见',且任务复杂度增加时表现下降。LM与人类成人的推理模式相似。 Conclusion: 提出一种测试时采样方法,显著减少析取偏见,推动LM实现更科学的因果推理。 Abstract: Language model (LM) agents are increasingly used as autonomous decision-makers who need to actively gather information to guide their decisions. A crucial cognitive skill for such agents is the efficient exploration and understanding of the causal structure of the world -- key to robust, scientifically grounded reasoning. Yet, it remains unclear whether LMs possess this capability or exhibit systematic biases leading to erroneous conclusions. In this work, we examine LMs' ability to explore and infer causal relationships, using the well-established "Blicket Test" paradigm from developmental psychology. We find that LMs reliably infer the common, intuitive disjunctive causal relationships but systematically struggle with the unusual, yet equally (or sometimes even more) evidenced conjunctive ones. This "disjunctive bias" persists across model families, sizes, and prompting strategies, and performance further declines as task complexity increases. Interestingly, an analogous bias appears in human adults, suggesting that LMs may have inherited deep-seated reasoning heuristics from their training data. To this end, we quantify similarities between LMs and humans, finding that LMs exhibit adult-like inference profiles (but not children-like). Finally, we propose a test-time sampling method which explicitly samples and eliminates hypotheses about causal relationships from the LM. This scalable approach significantly reduces the disjunctive bias and moves LMs closer to the goal of scientific, causally rigorous reasoning.

physics.chem-ph [Back]

[103] EDBench: Large-Scale Electron Density Data for Molecular Modeling

Hongxin Xiang,Ke Li,Mingquan Liu,Zhixiang Cheng,Bin Yao,Wenjie Du,Jun Xia,Li Zeng,Xin Jin,Xiangxiang Zeng

Main category: physics.chem-ph

TL;DR: 论文介绍了EDBench,一个大规模、高质量的电子密度数据集,旨在推动电子尺度的学习研究,并展示了基于学习的方法在计算电子密度上的高效性和准确性。

Details Motivation: 现有分子机器学习力场(MLFFs)通常忽略电子密度(ED)的重要性,而ED是理解分子力场的关键。由于ED计算依赖耗时的第一性原理密度泛函理论(DFT),缺乏大规模ED数据限制了其在MLFFs中的应用。 Method: 论文基于PCQM4Mv2构建了EDBench数据集,包含330万分子的精确ED数据,并设计了一套以ED为中心的基准任务(预测、检索和生成)来评估模型能力。 Result: 实验表明,基于EDBench的学习方法不仅可行,还能高精度计算ED,显著降低计算成本。 Conclusion: EDBench为电子密度驱动的药物发现和材料科学提供了坚实基础,其数据和基准将免费开放。 Abstract: Existing molecular machine learning force fields (MLFFs) generally focus on the learning of atoms, molecules, and simple quantum chemical properties (such as energy and force), but ignore the importance of electron density (ED) $\rho(r)$ in accurately understanding molecular force fields (MFFs). ED describes the probability of finding electrons at specific locations around atoms or molecules, which uniquely determines all ground state properties (such as energy, molecular structure, etc.) of interactive multi-particle systems according to the Hohenberg-Kohn theorem. However, the calculation of ED relies on the time-consuming first-principles density functional theory (DFT) which leads to the lack of large-scale ED data and limits its application in MLFFs. In this paper, we introduce EDBench, a large-scale, high-quality dataset of ED designed to advance learning-based research at the electronic scale. Built upon the PCQM4Mv2, EDBench provides accurate ED data, covering 3.3 million molecules. To comprehensively evaluate the ability of models to understand and utilize electronic information, we design a suite of ED-centric benchmark tasks spanning prediction, retrieval, and generation. Our evaluation on several state-of-the-art methods demonstrates that learning from EDBench is not only feasible but also achieves high accuracy. Moreover, we show that learning-based method can efficiently calculate ED with comparable precision while significantly reducing the computational cost relative to traditional DFT calculations. All data and benchmarks from EDBench will be freely available, laying a robust foundation for ED-driven drug discovery and materials science.

cs.HC [Back]

[104] Performance Gains of LLMs With Humans in a World of LLMs Versus Humans

Lucas McCullum,Pelagie Ami Agassi,Leo Anthony Celi,Daniel K. Ebner,Chrystinne Oliveira Fernandes,Rachel S. Hicklen,Mkliwa Koumbia,Lisa Soleymani Lehmann,David Restrepo

Main category: cs.HC

TL;DR: 论文主张不应简单比较LLMs与人类专家,而应研究如何安全地将LLMs整合到临床环境中,促进人机协作。

Details Motivation: 当前研究倾向于将LLMs与人类专家对比,但缺乏对LLMs在临床环境中安全使用的规范,可能威胁患者安全。 Method: 提出需要开发策略,使LLMs与人类能够高效协作,而非对立比较。 Result: 强调人机协作的重要性,而非竞争关系。 Conclusion: 未来研究应聚焦于LLMs在临床环境中的安全使用策略,实现人机共生。 Abstract: Currently, a considerable research effort is devoted to comparing LLMs to a group of human experts, where the term "expert" is often ill-defined or variable, at best, in a state of constantly updating LLM releases. Without proper safeguards in place, LLMs will threaten to cause harm to the established structure of safe delivery of patient care which has been carefully developed throughout history to keep the safety of the patient at the forefront. A key driver of LLM innovation is founded on community research efforts which, if continuing to operate under "humans versus LLMs" principles, will expedite this trend. Therefore, research efforts moving forward must focus on effectively characterizing the safe use of LLMs in clinical settings that persist across the rapid development of novel LLM models. In this communication, we demonstrate that rather than comparing LLMs to humans, there is a need to develop strategies enabling efficient work of humans with LLMs in an almost symbiotic manner.

cs.SE [Back]

[105] Customizing a Large Language Model for VHDL Design of High-Performance Microprocessors

Nicolas Dupuis,Ravi Nair,Shyam Ramji,Sean McClintock,Nishant Chauhan,Priyanka Nagpal,Bart Blaner,Ken Valk,Leon Stok,Ruchir Puri

Main category: cs.SE

TL;DR: 本文探讨了在硬件设计中使用LLM的潜力,特别是针对VHDL代码解释的开发,展示了通过扩展预训练和专家评估提升模型性能的方法。

Details Motivation: 尽管LLM在Verilog设计中受到关注,但VHDL及其在高性能处理器设计中的独特需求被忽视。本文旨在填补这一空白。 Method: 开发了针对VHDL代码解释的LLM,通过扩展预训练(EPT)和专家评估优化模型性能,并引入LLM-as-a-judge方法。 Result: EPT模型的专家评估评分从43%提升至69%,指令调优版本预期可达71%,未来可能提升至85%以上。 Conclusion: 通过新方法和技术,硬件设计LLM的质量有望进一步提升,尤其是在生成式AI领域的新发展支持下。 Abstract: The use of Large Language Models (LLMs) in hardware design has taken off in recent years, principally through its incorporation in tools that increase chip designer productivity. There has been considerable discussion about the use of LLMs in RTL specifications of chip designs, for which the two most popular languages are Verilog and VHDL. LLMs and their use in Verilog design has received significant attention due to the higher popularity of the language, but little attention so far has been given to VHDL despite its continued popularity in the industry. There has also been little discussion about the unique needs of organizations that engage in high-performance processor design, and techniques to deploy AI solutions in these settings. In this paper, we describe our journey in developing a Large Language Model (LLM) specifically for the purpose of explaining VHDL code, a task that has particular importance in an organization with decades of experience and assets in high-performance processor design. We show how we developed test sets specific to our needs and used them for evaluating models as we performed extended pretraining (EPT) of a base LLM. Expert evaluation of the code explanations produced by the EPT model increased to 69% compared to a base model rating of 43%. We further show how we developed an LLM-as-a-judge to gauge models similar to expert evaluators. This led us to deriving and evaluating a host of new models, including an instruction-tuned version of the EPT model with an expected expert evaluator rating of 71%. Our experiments also indicate that with the potential use of newer base models, this rating can be pushed to 85% and beyond. We conclude with a discussion on further improving the quality of hardware design LLMs using exciting new developments in the Generative AI world.

cs.MM [Back]

[106] Toward Accessible and Safe Live Streaming Using Distributed Content Filtering with MoQ

Andrew C. Freeman

Main category: cs.MM

TL;DR: 本文提出了一种基于Media Over QUIC Transport协议的实时内容审核方法,用于直播视频流,仅删除违规内容片段,同时支持透明分发分析任务。

Details Motivation: 随着直播视频的普及,对实时内容审核的需求增加,但直播的延迟限制使得传统离线分析方法不适用。 Method: 扩展了Media Over QUIC Transport协议,支持实时内容审核,仅删除违规视频片段,并允许分析任务透明分发到客户端设备。 Result: 系统在光敏观众的光闪烁移除场景中测试,客户端延迟仅增加一个GOP(图像组)时长。 Conclusion: 该方法有效实现了直播视频的实时内容审核,同时最小化了延迟影响。 Abstract: Live video streaming is increasingly popular on social media platforms. With the growth of live streaming comes an increased need for robust content moderation to remove dangerous, illegal, or otherwise objectionable content. Whereas video on demand distribution enables offline content analysis, live streaming imposes restrictions on latency for both analysis and distribution. In this paper, we present extensions to the in-progress Media Over QUIC Transport protocol that enable real-time content moderation in one-to-many video live streams. Importantly, our solution removes only the video segments that contain objectionable content, allowing playback resumption as soon as the stream conforms to content policies again. Content analysis tasks may be transparently distributed to arbitrary client devices. We implement and evaluate our system in the context of light strobe removal for photosensitive viewers, finding that streaming clients experience an increased latency of only one group-of-pictures duration.

cs.RO [Back]

[107] Parameter-Efficient Fine-Tuning of Vision Foundation Model for Forest Floor Segmentation from UAV Imagery

Mohammad Wasil,Ahmad Drak,Brennan Penfold,Ludovico Scarton,Maximilian Johenneken,Alexander Asteroth,Sebastian Houben

Main category: cs.RO

TL;DR: 论文提出了一种基于Segment Anything Model(SAM)的方法,通过参数高效微调(PEFT)来分割森林地面物体,如树桩、植被和木质残骸,以适应无人机在森林监测中的需求。

Details Motivation: 森林地面物体的详细理解因高自然变异性、快速变化的环境参数和模糊的注释定义而具有挑战性。 Method: 采用参数高效微调(PEFT)方法,调整SAM的掩码解码器以自动分割森林地面物体,无需手动提示。 Result: 基于适配器的PEFT方法实现了最高的平均交并比(mIoU),而低秩适应(LoRA)则为资源受限的无人机平台提供了轻量级替代方案。 Conclusion: 该方法有效解决了森林地面物体分割的挑战,为无人机在森林监测中的应用提供了高效解决方案。 Abstract: Unmanned Aerial Vehicles (UAVs) are increasingly used for reforestation and forest monitoring, including seed dispersal in hard-to-reach terrains. However, a detailed understanding of the forest floor remains a challenge due to high natural variability, quickly changing environmental parameters, and ambiguous annotations due to unclear definitions. To address this issue, we adapt the Segment Anything Model (SAM), a vision foundation model with strong generalization capabilities, to segment forest floor objects such as tree stumps, vegetation, and woody debris. To this end, we employ parameter-efficient fine-tuning (PEFT) to fine-tune a small subset of additional model parameters while keeping the original weights fixed. We adjust SAM's mask decoder to generate masks corresponding to our dataset categories, allowing for automatic segmentation without manual prompting. Our results show that the adapter-based PEFT method achieves the highest mean intersection over union (mIoU), while Low-rank Adaptation (LoRA), with fewer parameters, offers a lightweight alternative for resource-constrained UAV platforms.

[108] Multi-step manipulation task and motion planning guided by video demonstration

Kateryna Zorina,David Kovar,Mederic Fourmy,Florent Lamiraux,Nicolas Mansard,Justin Carpentier,Josef Sivic,Vladimir Petrik

Main category: cs.RO

TL;DR: 利用教学视频指导机器人完成复杂多步骤任务与运动规划,扩展RRT规划器并结合视频提取的接触状态和3D物体位姿,解决顺序依赖任务。

Details Motivation: 通过教学视频指导机器人完成复杂任务,解决传统规划算法难以处理的顺序依赖问题。 Method: 扩展RRT规划器,结合视频提取的接触状态和3D物体位姿,设计新基准任务验证方法。 Result: 在Franka Emika Panda和KUKA KMR iiwa等机器人上验证了方法的有效性,并开发了轨迹优化方法。 Conclusion: 视频引导的规划方法能有效解决复杂任务,并具有泛化能力。 Abstract: This work aims to leverage instructional video to solve complex multi-step task-and-motion planning tasks in robotics. Towards this goal, we propose an extension of the well-established Rapidly-Exploring Random Tree (RRT) planner, which simultaneously grows multiple trees around grasp and release states extracted from the guiding video. Our key novelty lies in combining contact states and 3D object poses extracted from the guiding video with a traditional planning algorithm that allows us to solve tasks with sequential dependencies, for example, if an object needs to be placed at a specific location to be grasped later. We also investigate the generalization capabilities of our approach to go beyond the scene depicted in the instructional video. To demonstrate the benefits of the proposed video-guided planning approach, we design a new benchmark with three challenging tasks: (I) 3D re-arrangement of multiple objects between a table and a shelf, (ii) multi-step transfer of an object through a tunnel, and (iii) transferring objects using a tray similar to a waiter transfers dishes. We demonstrate the effectiveness of our planning algorithm on several robots, including the Franka Emika Panda and the KUKA KMR iiwa. For a seamless transfer of the obtained plans to the real robot, we develop a trajectory refinement approach formulated as an optimal control problem (OCP).

[109] RT-cache: Efficient Robot Trajectory Retrieval System

Owen Kwon,Abraham George,Alison Bartsch,Amir Barati Farimani

Main category: cs.RO

TL;DR: RT-cache通过存储和检索成功轨迹片段,显著降低机器人推理延迟,提升任务完成速度和成功率。

Details Motivation: 现代视觉-语言-动作模型在多样化任务中表现优异,但推理成本高、延迟大,RT-cache旨在解决这一问题。 Method: 结合Memory Builder和Trajectory Retrieval,构建大规模轨迹记忆库,实时检索匹配的多步运动片段。 Result: 在Open-X Embodiment数据集和真实数据中,RT-cache比无检索基线更快、更成功地完成任务。 Conclusion: RT-cache为实时机器人操作提供了一种高效、数据驱动的解决方案。 Abstract: This paper introduces RT-cache, a novel trajectorymemory pipeline that accelerates real-world robot inference by leveraging big-data retrieval and learning from experience. While modern Vision-Language-Action (VLA) models can handle diverse robotic tasks, they often incur high per-step inference costs, resulting in significant latency, sometimes minutes per task. In contrast, RT-cache stores a large-scale Memory of previously successful robot trajectories and retrieves relevant multistep motion snippets, drastically reducing inference overhead. By integrating a Memory Builder with a Trajectory Retrieval, we develop an efficient retrieval process that remains tractable even for extremely large datasets. RT-cache flexibly accumulates real-world experiences and replays them whenever the current scene matches past states, adapting quickly to new or unseen environments with only a few additional samples. Experiments on the Open-X Embodiment Dataset and other real-world data demonstrate that RT-cache completes tasks both faster and more successfully than a baseline lacking retrieval, suggesting a practical, data-driven solution for real-time manipulation.

[110] FoldNet: Learning Generalizable Closed-Loop Policy for Garment Folding via Keypoint-Driven Asset and Demonstration Synthesis

Yuxing Chen,Bowen Xiao,He Wang

Main category: cs.RO

TL;DR: 论文提出了一种合成服装数据集,用于机器人服装折叠任务,并通过关键点标注和生成模型生成高质量数据。提出的KG-DAgger方法显著提升了模型性能,最终在真实世界中实现了75%的成功率。

Details Motivation: 由于服装的可变形性,为机器人服装操作任务生成大量高质量数据极具挑战性。 Method: 构建基于关键点的几何服装模板,应用生成模型生成真实纹理,并通过闭环模仿学习训练折叠策略。提出KG-DAgger方法,基于关键点生成恢复失败的演示数据。 Result: KG-DAgger显著提升性能,真实世界成功率提高25%,最终达到75%的成功率。 Conclusion: 实验验证了框架的有效性,为机器人服装折叠任务提供了高质量数据和高效方法。 Abstract: Due to the deformability of garments, generating a large amount of high-quality data for robotic garment manipulation tasks is highly challenging. In this paper, we present a synthetic garment dataset that can be used for robotic garment folding. We begin by constructing geometric garment templates based on keypoints and applying generative models to generate realistic texture patterns. Leveraging these keypoint annotations, we generate folding demonstrations in simulation and train folding policies via closed-loop imitation learning. To improve robustness, we propose KG-DAgger, which uses a keypoint-based strategy to generate demonstration data for recovering from failures. KG-DAgger significantly improves the model performance, boosting the real-world success rate by 25\%. After training with 15K trajectories (about 2M image-action pairs), the model achieves a 75\% success rate in the real world. Experiments in both simulation and real-world settings validate the effectiveness of our proposed framework.

[111] TransDiffuser: End-to-end Trajectory Generation with Decorrelated Multi-modal Representation for Autonomous Driving

Xuefeng Jiang,Yuan Ma,Pengxiang Li,Leimeng Xu,Xin Wen,Kun Zhan,Zhongpu Xia,Peng Jia,XianPeng Lang,Sheng Sun

Main category: cs.RO

TL;DR: TransDiffuser是一种基于扩散模型的轨迹规划方法,用于端到端自动驾驶,通过多模态表示解相关优化机制解决模式崩溃问题,性能优于现有方法。

Details Motivation: 扩散模型在多个领域表现出潜力,但如何将其应用于自动驾驶轨迹规划仍是一个挑战。本文旨在探索扩散模型在自动驾驶中的潜力。 Method: 提出TransDiffuser,一种基于编码器-解码器的生成模型,利用多模态条件输入和表示解相关优化机制生成高质量多样化的轨迹。 Result: 在NAVSIM基准测试中,TransDiffuser的PDMS达到94.85,优于现有方法。 Conclusion: TransDiffuser展示了扩散模型在自动驾驶轨迹规划中的有效性,无需基于锚点的先验轨迹即可实现高性能。 Abstract: In recent years, diffusion model has shown its potential across diverse domains from vision generation to language modeling. Transferring its capabilities to modern autonomous driving systems has also emerged as a promising direction.In this work, we propose TransDiffuser, an encoder-decoder based generative trajectory planning model for end-to-end autonomous driving. The encoded scene information serves as the multi-modal conditional input of the denoising decoder. To tackle the mode collapse dilemma in generating high-quality diverse trajectories, we introduce a simple yet effective multi-modal representation decorrelation optimization mechanism during the training process.TransDiffuser achieves PDMS of 94.85 on the NAVSIM benchmark, surpassing previous state-of-the-art methods without any anchor-based prior trajectories.

[112] APR-Transformer: Initial Pose Estimation for Localization in Complex Environments through Absolute Pose Regression

Srinivas Ravuri,Yuan Xu,Martin Ludwig Zehetner,Ketan Motlag,Sahin Albayrak

Main category: cs.RO

TL;DR: APR-Transformer是一种基于深度神经网络的模型,用于预测绝对位姿(3D位置和方向),在GNSS信号缺失的环境中表现优异。

Details Motivation: 精确的初始化对定位算法至关重要,尤其在GPS信号缺失的环境中,传统方法表现不佳。 Method: 提出APR-Transformer模型,利用图像或LiDAR数据预测绝对位姿。 Result: 在多个基准数据集上达到最优性能,并在实际自动驾驶车辆中验证了可靠性。 Conclusion: APR-Transformer在复杂环境中具有实用性和高效性,代码已开源。 Abstract: Precise initialization plays a critical role in the performance of localization algorithms, especially in the context of robotics, autonomous driving, and computer vision. Poor localization accuracy is often a consequence of inaccurate initial poses, particularly noticeable in GNSS-denied environments where GPS signals are primarily relied upon for initialization. Recent advances in leveraging deep neural networks for pose regression have led to significant improvements in both accuracy and robustness, especially in estimating complex spatial relationships and orientations. In this paper, we introduce APR-Transformer, a model architecture inspired by state-of-the-art methods, which predicts absolute pose (3D position and 3D orientation) using either image or LiDAR data. We demonstrate that our proposed method achieves state-of-the-art performance on established benchmark datasets such as the Radar Oxford Robot-Car and DeepLoc datasets. Furthermore, we extend our experiments to include our custom complex APR-BeIntelli dataset. Additionally, we validate the reliability of our approach in GNSS-denied environments by deploying the model in real-time on an autonomous test vehicle. This showcases the practical feasibility and effectiveness of our approach. The source code is available at:https://github.com/GT-ARC/APR-Transformer.

econ.GN [Back]

[113] Ornithologist: Towards Trustworthy "Reasoning" about Central Bank Communications

Dominic Zaun Eu Jones

Main category: econ.GN

TL;DR: Ornithologist是一个弱监督文本分类系统,用于衡量央行文本的鹰派和鸽派倾向,通过人类编写的决策树指导大语言模型,提高透明度和可解释性。

Details Motivation: 开发一个透明、可解释且易于非专家使用的文本分类系统,减少幻觉风险,并适用于其他文本分类问题。 Method: 使用“分类学引导推理”,结合人类编写的决策树和大语言模型,进行弱监督分类。 Result: Ornithologist对RBA通信的鹰派和鸽派倾向测量能够预测未来现金利率路径和市场预期。 Conclusion: Ornithologist是一种高效、透明的文本分类工具,适用于央行文本分析,并具有广泛的应用潜力。 Abstract: I develop Ornithologist, a weakly-supervised textual classification system and measure the hawkishness and dovishness of central bank text. Ornithologist uses ``taxonomy-guided reasoning'', guiding a large language model with human-authored decision trees. This increases the transparency and explainability of the system and makes it accessible to non-experts. It also reduces hallucination risk. Since it requires less supervision than traditional classification systems, it can more easily be applied to other problems or sources of text (e.g. news) without much modification. Ornithologist measurements of hawkishness and dovishness of RBA communication carry information about the future of the cash rate path and of market expectations.

cs.LG [Back]

[114] The Geometry of Meaning: Perfect Spacetime Representations of Hierarchical Structures

Andres Anabalon,Hugo Garces,Julio Oliva,Jose Cifuentes

Main category: cs.LG

TL;DR: 论文提出了一种快速算法,将层次结构嵌入三维闵可夫斯基时空,数据相关性完全由因果结构编码。该方法仅依赖局部层次信号,无需全局符号结构,并在WordNet语料库中验证了其有效性。

Details Motivation: 探索离散数据(如层次结构)在几何空间中的完美表示,揭示概念和类别之间的层次关系本质上是几何的。 Method: 使用面向令牌对的局部层次信号,通过因果结构而非距离来编码层次关系,并在WordNet中验证。 Result: 成功嵌入哺乳动物子树和WordNet的最大无歧义子集(82,115个名词),层次结构完全由几何编码。 Conclusion: 离散数据存在完美的三维几何表示,且与广义相对论和场论有深层联系,表明层次意义本质上是几何的。 Abstract: We show that there is a fast algorithm that embeds hierarchical structures in three-dimensional Minkowski spacetime. The correlation of data ends up purely encoded in the causal structure. Our model relies solely on oriented token pairs -- local hierarchical signals -- with no access to global symbolic structure. We apply our method to the corpus of \textit{WordNet}. We provide a perfect embedding of the mammal sub-tree including ambiguities (more than one hierarchy per node) in such a way that the hierarchical structures get completely codified in the geometry and exactly reproduce the ground-truth. We extend this to a perfect embedding of the maximal unambiguous subset of the \textit{WordNet} with 82{,}115 noun tokens and a single hierarchy per token. We introduce a novel retrieval mechanism in which causality, not distance, governs hierarchical access. Our results seem to indicate that all discrete data has a perfect geometrical representation that is three-dimensional. The resulting embeddings are nearly conformally invariant, indicating deep connections with general relativity and field theory. These results suggest that concepts, categories, and their interrelations, namely hierarchical meaning itself, is geometric.

[115] An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits

Cody Steinmetz,Gavin Childress,Aaron Herbst,Gavin Jones,Jasdeep Singh,Eli Vang,Keagan Weinstock

Main category: cs.LG

TL;DR: 通过RMS归一化和渐进量化策略,将全精度LLM稳定微调为三元量化模型,性能接近全精度模型。

Details Motivation: 解决三元量化LLM的不稳定性问题,同时降低部署成本。 Method: 在每次线性投影前插入RMS归一化,并采用渐进层量化策略。 Result: 在标准语言建模任务中性能接近或超过全精度模型,无需额外复杂度。 Conclusion: 归一化可显著缩小三元与全精度LLM的精度差距,实现实用化超低比特推理。 Abstract: Large language models (LLMs) have transformed natural-language processing, yet their scale makes real-world deployment costly. Post-training quantization reduces memory and computation but often degrades accuracy, while quantization-aware training can recover performance at the cost of extra training. Pushing quantization to the ternary (2-bit) regime yields even larger savings but is notoriously unstable. Building on recent work showing that a bias-free, RMS-normalized Transformer with straight-through estimation can reach 1.58-bit precision, we demonstrate that simply inserting RMS normalization before every linear projection and applying a gradual, layer-wise quantization schedule stably fine-tunes full-precision checkpoints into ternary LLMs. Our approach matches or surpasses more elaborate knowledge-distillation pipelines on standard language-modeling benchmarks without adding model complexity. These results indicate that careful normalization alone can close much of the accuracy gap between ternary and full-precision LLMs, making ultra-low-bit inference practical.

[116] ForeCite: Adapting Pre-Trained Language Models to Predict Future Citation Rates of Academic Papers

Gavin Hull,Alex Bihlo

Main category: cs.LG

TL;DR: ForeCite是一个基于预训练因果语言模型的框架,用于预测学术论文的未来引用率,性能优于现有方法。

Details Motivation: 自动化研究评估和加速科学进步需要预测论文的引用率。 Method: 在预训练因果语言模型上添加线性头,用于回归任务预测平均月引用率。 Result: 在90万+生物医学论文数据集上,测试相关性达到ρ=0.826,比之前最优方法提升27点。 Conclusion: ForeCite为学术研究长期影响力预测设立了新标准,并为自动化科学贡献评估奠定了基础。 Abstract: Predicting the future citation rates of academic papers is an important step toward the automation of research evaluation and the acceleration of scientific progress. We present $\textbf{ForeCite}$, a simple but powerful framework to append pre-trained causal language models with a linear head for average monthly citation rate prediction. Adapting transformers for regression tasks, ForeCite achieves a test correlation of $\rho = 0.826$ on a curated dataset of 900K+ biomedical papers published between 2000 and 2024, a 27-point improvement over the previous state-of-the-art. Comprehensive scaling-law analysis reveals consistent gains across model sizes and data volumes, while temporal holdout experiments confirm practical robustness. Gradient-based saliency heatmaps suggest a potentially undue reliance on titles and abstract texts. These results establish a new state-of-the-art in forecasting the long-term influence of academic research and lay the groundwork for the automated, high-fidelity evaluation of scientific contributions.

[117] CXMArena: Unified Dataset to benchmark performance in realistic CXM Scenarios

Raghav Garg,Kapil Sharma,Karan Gupta

Main category: cs.LG

TL;DR: 论文介绍了CXMArena,一个用于评估AI在客户体验管理(CXM)中实用性的合成基准数据集,解决了现有基准缺乏真实性和数据稀缺的问题。

Details Motivation: 现有基准在评估大型语言模型(LLM)在CXM中的实用性时存在不足,缺乏真实性和深度知识库集成。 Method: 开发了基于LLM的管道生成合成数据集CXMArena,模拟品牌CXM实体,并通过噪声注入和验证确保真实性。 Result: 基线实验显示当前模型在CXMArena任务中表现不佳,如文章搜索准确率仅68%,知识库优化F1分数仅0.3。 Conclusion: CXMArena为CXM中的AI评估提供了更真实的基准,揭示了当前模型的局限性,需复杂解决方案。 Abstract: Large Language Models (LLMs) hold immense potential for revolutionizing Customer Experience Management (CXM), particularly in contact center operations. However, evaluating their practical utility in complex operational environments is hindered by data scarcity (due to privacy concerns) and the limitations of current benchmarks. Existing benchmarks often lack realism, failing to incorporate deep knowledge base (KB) integration, real-world noise, or critical operational tasks beyond conversational fluency. To bridge this gap, we introduce CXMArena, a novel, large-scale synthetic benchmark dataset specifically designed for evaluating AI in operational CXM contexts. Given the diversity in possible contact center features, we have developed a scalable LLM-powered pipeline that simulates the brand's CXM entities that form the foundation of our datasets-such as knowledge articles including product specifications, issue taxonomies, and contact center conversations. The entities closely represent real-world distribution because of controlled noise injection (informed by domain experts) and rigorous automated validation. Building on this, we release CXMArena, which provides dedicated benchmarks targeting five important operational tasks: Knowledge Base Refinement, Intent Prediction, Agent Quality Adherence, Article Search, and Multi-turn RAG with Integrated Tools. Our baseline experiments underscore the benchmark's difficulty: even state of the art embedding and generation models achieve only 68% accuracy on article search, while standard embedding methods yield a low F1 score of 0.3 for knowledge base refinement, highlighting significant challenges for current models necessitating complex pipelines and solutions over conventional techniques.

[118] Optimizing Urban Critical Green Space Development Using Machine Learning

Mohammad Ganjirad,Mahmoud Reza Delavar,Hossein Bagheri,Mohammad Mehdi Azizi

Main category: cs.LG

TL;DR: 本文提出了一种基于多指标(社会经济、环境和敏感性)的德黑兰城市绿地开发优先级框架,结合机器学习模型和微气候模拟验证其有效性。

Details Motivation: 解决德黑兰城市绿地开发优先级问题,结合多源数据提升决策的科学性。 Method: 利用Google Earth Engine、WRF模型等数据源,通过机器学习模型(如随机森林)分类植被覆盖,并结合多指标生成优先级地图。 Result: 随机森林模型表现最佳(准确率>94%),夜间地表温度和敏感人群为关键指标;微气候模拟显示绿色屋顶技术可降温0.67°C。 Conclusion: 该框架为城市规划者提供了科学工具,优化绿地开发决策。 Abstract: This paper presents a novel framework for prioritizing urban green space development in Tehran using diverse socio-economic, environmental, and sensitivity indices. The indices were derived from various sources including Google Earth Engine, air pollution measurements, municipal reports and the Weather Research & Forecasting (WRF) model. The WRF model was used to estimate the air temperature at a 1 km resolution due to insufficient meteorological stations, yielding RMSE and MAE values of 0.96{\deg}C and 0.92{\deg}C, respectively. After data preparation, several machine learning models were used for binary vegetation cover classification including XGBoost, LightGBM, Random Forest (RF) and Extra Trees. RF achieved the highest performance, exceeding 94% in Overall Accuracy, Recall, and F1-score. Then, the probability of areas lacking vegetation cover was assessed using socio-economic, environmental and sensitivity indices. This resulted in the RF generating an urban green space development prioritization map. Feature Importance Analysis revealed that the most significant indices were nightly land surface temperature (LST) and sensitive population. Finally, the framework performance was validated through microclimate simulation to assess the critical areas after and before the green space development by green roofs. The simulation demonstrated reducing air temperature by up to 0.67{\deg}C after utilizing the green roof technology in critical areas. As a result, this framework provides a valuable tool for urban planners to develop green spaces.

[119] GreenFactory: Ensembling Zero-Cost Proxies to Estimate Performance of Neural Networks

Gabriel Cortês,Nuno Lourenço,Paolo Romano,Penousal Machado

Main category: cs.LG

TL;DR: GreenFactory是一种集成零成本代理的方法,通过随机森林回归器直接预测模型测试精度,解决了传统代理方法的泛化性和预测精度问题。

Details Motivation: 传统神经架构搜索中评估网络性能需要训练和评估,耗时且资源密集;现有零成本代理方法泛化性差且仅提供相对排名。 Method: 提出GreenFactory,集成多个零成本代理,利用随机森林回归器结合其优势,直接预测模型测试精度。 Result: 在NATS-Bench上验证,GreenFactory在多个数据集上表现稳健,Kendall相关系数高(如CIFAR-10达0.907)。 Conclusion: GreenFactory在神经架构搜索中提供了一种高效、可靠的性能预测方法,适用于多样化场景。 Abstract: Determining the performance of a Deep Neural Network during Neural Architecture Search processes is essential for identifying optimal architectures and hyperparameters. Traditionally, this process requires training and evaluation of each network, which is time-consuming and resource-intensive. Zero-cost proxies estimate performance without training, serving as an alternative to traditional training. However, recent proxies often lack generalization across diverse scenarios and provide only relative rankings rather than predicted accuracies. To address these limitations, we propose GreenFactory, an ensemble of zero-cost proxies that leverages a random forest regressor to combine multiple predictors' strengths and directly predict model test accuracy. We evaluate GreenFactory on NATS-Bench, achieving robust results across multiple datasets. Specifically, GreenFactory achieves high Kendall correlations on NATS-Bench-SSS, indicating substantial agreement between its predicted scores and actual performance: 0.907 for CIFAR-10, 0.945 for CIFAR-100, and 0.920 for ImageNet-16-120. Similarly, on NATS-Bench-TSS, we achieve correlations of 0.921 for CIFAR-10, 0.929 for CIFAR-100, and 0.908 for ImageNet-16-120, showcasing its reliability in both search spaces.

cs.SD [Back]

[120] DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis

Zeeshan Ahmad,Shudi Bao,Meng Chen

Main category: cs.SD

TL;DR: 提出了一种基于可变形周期性网络的GAN(DPN-GAN),通过引入周期性激活函数和多分辨率生成模块,显著提升了音频生成的质量和鲁棒性。

Details Motivation: 现有GAN模型依赖带宽受限的梅尔频谱,导致生成音频分辨率受限和模式崩溃问题。 Method: 设计了DPN-GAN,包含周期性ReLU激活函数和可变形卷积操作的多分辨率生成模块,并改进了判别器网络。 Result: 实验表明,DPN-GAN在多种数据集上优于现有GAN模型,生成音频质量更高且更具鲁棒性。 Conclusion: DPN-GAN通过创新架构解决了音频生成中的分辨率限制和模式崩溃问题,表现出优越性能。 Abstract: In recent years, generative adversarial networks (GANs) have made significant progress in generating audio sequences. However, these models typically rely on bandwidth-limited mel-spectrograms, which constrain the resolution of generated audio sequences, and lead to mode collapse during conditional generation. To address this issue, we propose Deformable Periodic Network based GAN (DPN-GAN), a novel GAN architecture that incorporates a kernel-based periodic ReLU activation function to induce periodic bias in audio generation. This innovative approach enhances the model's ability to capture and reproduce intricate audio patterns. In particular, our proposed model features a DPN module for multi-resolution generation utilizing deformable convolution operations, allowing for adaptive receptive fields that improve the quality and fidelity of the synthetic audio. Additionally, we enhance the discriminator network using deformable convolution to better distinguish between real and generated samples, further refining the audio quality. We trained two versions of the model: DPN-GAN small (38.67M parameters) and DPN-GAN large (124M parameters). For evaluation, we use five different datasets, covering both speech synthesis and music generation tasks, to demonstrate the efficiency of the DPN-GAN. The experimental results demonstrate that DPN-GAN delivers superior performance on both out-of-distribution and noisy data, showcasing its robustness and adaptability. Trained across various datasets, DPN-GAN outperforms state-of-the-art GAN architectures on standard evaluation metrics, and exhibits increased robustness in synthesized audio.

cs.CR [Back]

[121] LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries

Zekun Wu,Seonglae Cho,Umar Mohammed,Cristian Munoz,Kleyton Costa,Xin Guan,Theo King,Ze Wang,Emre Kazim,Adriano Koshiyama

Main category: cs.CR

TL;DR: LibVulnWatch是一个基于图的代理评估框架,用于评估开源AI库的风险,涵盖安全、许可、维护等方面,并提供可复现的评分。

Details Motivation: 开源AI库在现代AI系统中至关重要,但其在安全、许可、维护等方面存在未被充分研究的风险,需要一种系统化的评估方法。 Method: 基于LangGraph构建的图框架,协调多个专用代理从可信源提取、验证和量化风险,生成五个关键领域的评分。 Result: 应用于20个常用库,覆盖88%的OpenSSF Scorecard检查,发现每个库最多19个额外风险,包括RCE漏洞、SBOM缺失等。 Conclusion: LibVulnWatch通过将治理原则转化为可验证指标,为AI供应链风险评估和库选择提供了可扩展、透明的机制。 Abstract: Open-source AI libraries are foundational to modern AI systems but pose significant, underexamined risks across security, licensing, maintenance, supply chain integrity, and regulatory compliance. We present LibVulnWatch, a graph-based agentic assessment framework that performs deep, source-grounded evaluations of these libraries. Built on LangGraph, the system coordinates a directed acyclic graph of specialized agents to extract, verify, and quantify risk using evidence from trusted sources such as repositories, documentation, and vulnerability databases. LibVulnWatch generates reproducible, governance-aligned scores across five critical domains, publishing them to a public leaderboard for longitudinal ecosystem monitoring. Applied to 20 widely used libraries, including ML frameworks, LLM inference engines, and agent orchestration tools, our system covers up to 88% of OpenSSF Scorecard checks while uncovering up to 19 additional risks per library. These include critical Remote Code Execution (RCE) vulnerabilities, absent Software Bills of Materials (SBOMs), licensing constraints, undocumented telemetry, and widespread gaps in regulatory documentation and auditability. By translating high-level governance principles into practical, verifiable metrics, LibVulnWatch advances technical AI governance with a scalable, transparent mechanism for continuous supply chain risk assessment and informed library selection.

[122] Robustness Analysis against Adversarial Patch Attacks in Fully Unmanned Stores

Hyunsik Na,Wonho Lee,Seungdeok Roh,Sohee Park,Daeseon Choi

Main category: cs.CR

TL;DR: 论文研究了无人商店中基于AI的自动结账系统的安全漏洞,特别是对抗性补丁攻击对物体检测模型的破坏性影响,并提出了新的评估指标和防御策略。

Details Motivation: 随着无人商店的普及,其AI系统的安全漏洞(如对抗性补丁攻击)可能导致盗窃和库存问题,亟需研究其影响和防御措施。 Method: 研究了三种对抗性补丁攻击(隐藏、创建和修改攻击),提出新的颜色直方图相似性损失函数和边界框评估指标,并在数字和物理环境中验证攻击效果。 Result: 对抗性补丁攻击在无人商店环境中显著破坏物体检测模型,且阴影攻击在无模型参数访问时仍能提高攻击成功率。 Conclusion: 当前防御机制在实时检测中存在不足,需改进物体检测模型的鲁棒性并采取主动防御措施以应对对抗性威胁。 Abstract: The advent of convenient and efficient fully unmanned stores equipped with artificial intelligence-based automated checkout systems marks a new era in retail. However, these systems have inherent artificial intelligence security vulnerabilities, which are exploited via adversarial patch attacks, particularly in physical environments. This study demonstrated that adversarial patches can severely disrupt object detection models used in unmanned stores, leading to issues such as theft, inventory discrepancies, and interference. We investigated three types of adversarial patch attacks -- Hiding, Creating, and Altering attacks -- and highlighted their effectiveness. We also introduce the novel color histogram similarity loss function by leveraging attacker knowledge of the color information of a target class object. Besides the traditional confusion-matrix-based attack success rate, we introduce a new bounding-boxes-based metric to analyze the practical impact of these attacks. Starting with attacks on object detection models trained on snack and fruit datasets in a digital environment, we evaluated the effectiveness of adversarial patches in a physical testbed that mimicked a real unmanned store with RGB cameras and realistic conditions. Furthermore, we assessed the robustness of these attacks in black-box scenarios, demonstrating that shadow attacks can enhance success rates of attacks even without direct access to model parameters. Our study underscores the necessity for robust defense strategies to protect unmanned stores from adversarial threats. Highlighting the limitations of the current defense mechanisms in real-time detection systems and discussing various proactive measures, we provide insights into improving the robustness of object detection models and fortifying unmanned retail environments against these attacks.

[123] Adaptive Security Policy Management in Cloud Environments Using Reinforcement Learning

Muhammad Saqib,Dipkumar Mehta,Fnu Yashu,Shubham Malhotra

Main category: cs.CR

TL;DR: 论文提出了一种基于强化学习的动态安全策略管理框架,用于解决云环境中静态策略的不足,显著提升了入侵检测率和响应速度。

Details Motivation: 云环境(如AWS)的安全需求复杂且动态变化,静态安全策略已无法应对威胁的演化和资源的弹性变化。 Method: 采用深度强化学习算法(如深度Q网络和近端策略优化),利用云遥测数据动态调整防火墙规则和IAM策略。 Result: 实验结果显示,该框架的入侵检测率(92%)显著高于静态策略(82%),且响应时间减少了58%。 Conclusion: 验证了自适应强化学习方法在提升云安全策略管理中的有效性。 Abstract: The security of cloud environments, such as Amazon Web Services (AWS), is complex and dynamic. Static security policies have become inadequate as threats evolve and cloud resources exhibit elasticity [1]. This paper addresses the limitations of static policies by proposing a security policy management framework that uses reinforcement learning (RL) to adapt dynamically. Specifically, we employ deep reinforcement learning algorithms, including deep Q Networks and proximal policy optimization, enabling the learning and continuous adjustment of controls such as firewall rules and Identity and Access Management (IAM) policies. The proposed RL based solution leverages cloud telemetry data (AWS Cloud Trail logs, network traffic data, threat intelligence feeds) to continuously refine security policies, maximizing threat mitigation, and compliance while minimizing resource impact. Experimental results demonstrate that our adaptive RL based framework significantly outperforms static policies, achieving higher intrusion detection rates (92% compared to 82% for static policies) and substantially reducing incident detection and response times by 58%. In addition, it maintains high conformity with security requirements and efficient resource usage. These findings validate the effectiveness of adaptive reinforcement learning approaches in improving cloud security policy management.

cs.IR [Back]

[124] Focus, Merge, Rank: Improved Question Answering Based on Semi-structured Knowledge Bases

Derian Boer,Stephen Roth,Stefan Kramer

Main category: cs.IR

TL;DR: FocusedRetriever是一个基于半结构化知识库(SKB)的多跳问答框架,结合了结构化与非结构化数据,通过模块化设计在多个领域和性能指标上优于现有方法。

Details Motivation: 现实场景中,机器学习模型和交互系统通常只能利用结构化或非结构化数据之一,而SKB通过链接两者提供了新的知识访问策略。 Method: 框架整合了基于向量相似性搜索的实体检索、基于LLM的Cypher查询生成和成对重排序,利用LLM提取关系事实和实体属性,并通过节点集连接和向量相似性搜索优化候选答案。 Result: 在STaRK基准测试中,FocusedRetriever的平均首次命中率比次优方法高25.7%。 Conclusion: FocusedRetriever展示了SKB在多跳问答中的潜力,并为进一步优化(如微调)提供了机会。 Abstract: In many real-world settings, machine learning models and interactive systems have access to both structured knowledge, e.g., knowledge graphs or tables, and unstructured content, e.g., natural language documents. However, most rely on either. Semi-Structured Knowledge Bases (SKBs) bridge this gap by linking unstructured content to nodes within structured data, thereby enabling new strategies for knowledge access and use. In this work, we present FocusedRetriever, a modular SKB-based framework for multi-hop question answering. It integrates components (VSS-based entity search, LLM-based generation of Cypher queries and pairwise re-ranking) in a way that enables it to outperform state-of-the-art methods across all three STaRK benchmark test sets, covering diverse domains and multiple performance metrics. The average first-hit rate exceeds that of the second-best method by 25.7%. FocusedRetriever leverages (1) the capacity of Large Language Models (LLMs) to extract relational facts and entity attributes from unstructured text, (2) node set joins to filter answer candidates based on these extracted triplets and constraints, (3) vector similarity search to retrieve and rank relevant unstructured content, and (4) the contextual capabilities of LLMs to finally rank the top-k answers. For generality, we only incorporate base LLMs in FocusedRetriever in our evaluation. However, our analysis of intermediate results highlights several opportunities for further upgrades including finetuning. The source code is publicly available at https://github.com/kramerlab/FocusedRetriever .