cs.CV [Back]

[1] A Computational Pipeline for Advanced Analysis of 4D Flow MRI in the Left Atrium

Xabier Morales,Ayah Elsayed,Debbie Zhao,Filip Loncaric,Ainhoa Aguado,Mireia Masias,Gina Quill,Marc Ramos,Ada Doltra,Ana Garcia,Marta Sitges,David Marlevi,Alistair Young,Martyn Nash,Bart Bijnens,Oscar Camara

Main category: cs.CV

TL;DR: 本文介绍了一种开源计算框架，用于分析左心房（LA）的4D Flow MRI数据，解决了传统超声分析的局限性，并支持高质量自动分割和血流动力学参数分析。

Details

Motivation: 传统超声分析对左心房血流动力学的理解有限，而4D Flow MRI的低速度和空间分辨率限制了其应用。缺乏专用计算框架和多样化的采集协议进一步增加了研究难度。 Method: 开发了一种开源计算框架，支持对4D Flow MRI数据的定性和定量分析，包括自动分割（Dice > 0.9，Hausdorff 95 < 3 mm）和血流动力学参数（能量、涡度、压力）的评估。 Result: 框架对不同中心的数据具有鲁棒性，即使训练数据有限也能实现高精度分割。首次全面评估了LA中的能量、涡度和压力参数，探索其作为预后生物标志物的潜力。 Conclusion: 该框架为左心房血流动力学研究提供了高效工具，支持多中心数据分析和预后生物标志物的探索。 Abstract: The left atrium (LA) plays a pivotal role in modulating left ventricular filling, but our comprehension of its hemodynamics is significantly limited by the constraints of conventional ultrasound analysis. 4D flow magnetic resonance imaging (4D Flow MRI) holds promise for enhancing our understanding of atrial hemodynamics. However, the low velocities within the LA and the limited spatial resolution of 4D Flow MRI make analyzing this chamber challenging. Furthermore, the absence of dedicated computational frameworks, combined with diverse acquisition protocols and vendors, complicates gathering large cohorts for studying the prognostic value of hemodynamic parameters provided by 4D Flow MRI. In this study, we introduce the first open-source computational framework tailored for the analysis of 4D Flow MRI in the LA, enabling comprehensive qualitative and quantitative analysis of advanced hemodynamic parameters. Our framework proves robust to data from different centers of varying quality, producing high-accuracy automated segmentations (Dice $>$ 0.9 and Hausdorff 95 $<$ 3 mm), even with limited training data. Additionally, we conducted the first comprehensive assessment of energy, vorticity, and pressure parameters in the LA across a spectrum of disorders to investigate their potential as prognostic biomarkers.

[2] Dyadic Mamba: Long-term Dyadic Human Motion Synthesis

Julian Tanke,Takashi Shibuya,Kengo Uchida,Koichi Saito,Yuki Mitsufuji

Main category: cs.CV

TL;DR: Dyadic Mamba利用状态空间模型（SSM）生成任意长度的高质量双人运动，解决了传统Transformer方法在长序列生成中的局限性。

Details

Motivation: 现有基于Transformer的方法在生成长序列双人运动时表现不佳，主要受限于位置编码方案。 Method: 提出Dyadic Mamba，通过简单的架构设计（如序列拼接）实现信息流动，避免复杂的跨注意力机制。 Result: 在短序列基准上表现优异，长序列生成显著优于Transformer方法，并提出了新的长序列评估基准。 Conclusion: SSM架构为长序列双人运动生成提供了有前景的解决方案。 Abstract: Generating realistic dyadic human motion from text descriptions presents significant challenges, particularly for extended interactions that exceed typical training sequence lengths. While recent transformer-based approaches have shown promising results for short-term dyadic motion synthesis, they struggle with longer sequences due to inherent limitations in positional encoding schemes. In this paper, we introduce Dyadic Mamba, a novel approach that leverages State-Space Models (SSMs) to generate high-quality dyadic human motion of arbitrary length. Our method employs a simple yet effective architecture that facilitates information flow between individual motion sequences through concatenation, eliminating the need for complex cross-attention mechanisms. We demonstrate that Dyadic Mamba achieves competitive performance on standard short-term benchmarks while significantly outperforming transformer-based approaches on longer sequences. Additionally, we propose a new benchmark for evaluating long-term motion synthesis quality, providing a standardized framework for future research. Our results demonstrate that SSM-based architectures offer a promising direction for addressing the challenging task of long-term dyadic human motion synthesis from text descriptions.

[3] BoundarySeg:An Embarrassingly Simple Method To Boost Medical Image Segmentation Performance for Low Data Regimes

Tushar Kataria,Shireen Y. Elhabian

Main category: cs.CV

TL;DR: 提出了一种名为BoundarySeg的多任务框架，通过结合器官边界预测作为辅助任务，提升医学图像分割的准确性，无需依赖未标注数据。

Details

Motivation: 医学数据获取和标注困难，半监督方法依赖未标注数据且效果受限。 Method: BoundarySeg框架将器官边界预测作为辅助任务，利用任务间一致性提供额外监督。 Result: 在低数据情况下表现优异，性能媲美或超越现有半监督方法。 Conclusion: BoundarySeg提供了一种高效且不依赖未标注数据的医学图像分割解决方案。 Abstract: Obtaining large-scale medical data, annotated or unannotated, is challenging due to stringent privacy regulations and data protection policies. In addition, annotating medical images requires that domain experts manually delineate anatomical structures, making the process both time-consuming and costly. As a result, semi-supervised methods have gained popularity for reducing annotation costs. However, the performance of semi-supervised methods is heavily dependent on the availability of unannotated data, and their effectiveness declines when such data are scarce or absent. To overcome this limitation, we propose a simple, yet effective and computationally efficient approach for medical image segmentation that leverages only existing annotations. We propose BoundarySeg , a multi-task framework that incorporates organ boundary prediction as an auxiliary task to full organ segmentation, leveraging consistency between the two task predictions to provide additional supervision. This strategy improves segmentation accuracy, especially in low data regimes, allowing our method to achieve performance comparable to or exceeding state-of-the-art semi supervised approaches all without relying on unannotated data or increasing computational demands. Code will be released upon acceptance.

[4] Mission Balance: Generating Under-represented Class Samples using Video Diffusion Models

Danush Kumar Venkatesh,Isabel Funke,Micha Pfeiffer,Fiona Kolbinger,Hanna Maria Schmeiser,Juergen Weitz,Marius Distler,Stefanie Speidel

Main category: cs.CV

TL;DR: 提出了一种基于文本条件的扩散方法，通过合成手术视频解决数据集不平衡问题，显著提升模型性能。

Details

Motivation: 手术视频数据集中严重的数据不平衡阻碍了高性能模型的开发，需要一种方法来生成高质量合成视频以解决这一问题。 Method: 采用两阶段、基于文本条件的扩散方法，分离空间和时间建模，并结合拒绝采样策略选择最佳合成样本。 Result: 在手术动作识别和术中事件预测任务中，合成视频的引入显著提升了模型性能。 Conclusion: 该方法有效解决了数据不平衡问题，为计算机辅助手术提供了更好的数据支持。 Abstract: Computer-assisted interventions can improve intra-operative guidance, particularly through deep learning methods that harness the spatiotemporal information in surgical videos. However, the severe data imbalance often found in surgical video datasets hinders the development of high-performing models. In this work, we aim to overcome the data imbalance by synthesizing surgical videos. We propose a unique two-stage, text-conditioned diffusion-based method to generate high-fidelity surgical videos for under-represented classes. Our approach conditions the generation process on text prompts and decouples spatial and temporal modeling by utilizing a 2D latent diffusion model to capture spatial content and then integrating temporal attention layers to ensure temporal consistency. Furthermore, we introduce a rejection sampling strategy to select the most suitable synthetic samples, effectively augmenting existing datasets to address class imbalance. We evaluate our method on two downstream tasks-surgical action recognition and intra-operative event prediction-demonstrating that incorporating synthetic videos from our approach substantially enhances model performance. We open-source our implementation at https://gitlab.com/nct_tso_public/surgvgen.

[5] Few-Shot Learning of Visual Compositional Concepts through Probabilistic Schema Induction

Andrew Jun Lee,Taylor Webb,Trevor Bihl,Keith Holyoak,Hongjing Lu

Main category: cs.CV

TL;DR: 论文提出了一种名为PSI的原型模型，通过深度学习对少量结构化示例进行类比映射，形成组合概念（模式），其性能优于传统模型。

Details

Motivation: 研究人类如何从有限示例中学习新视觉概念，尤其是组合概念学习依赖于结构化表示和类比映射。 Method: 引入Probabilistic Schema Induction (PSI)模型，结合对象级和关系相似性，并通过选择性注意力机制增强相关关系。 Result: PSI表现出类似人类的学习性能，优于使用非结构化特征向量的原型模型和弱结构化表示的变体。 Conclusion: 结构化表示和类比映射对快速学习组合视觉概念至关重要，深度学习可用于构建心理模型。 Abstract: The ability to learn new visual concepts from limited examples is a hallmark of human cognition. While traditional category learning models represent each example as an unstructured feature vector, compositional concept learning is thought to depend on (1) structured representations of examples (e.g., directed graphs consisting of objects and their relations) and (2) the identification of shared relational structure across examples through analogical mapping. Here, we introduce Probabilistic Schema Induction (PSI), a prototype model that employs deep learning to perform analogical mapping over structured representations of only a handful of examples, forming a compositional concept called a schema. In doing so, PSI relies on a novel conception of similarity that weighs object-level similarity and relational similarity, as well as a mechanism for amplifying relations relevant to classification, analogous to selective attention parameters in traditional models. We show that PSI produces human-like learning performance and outperforms two controls: a prototype model that uses unstructured feature vectors extracted from a deep learning model, and a variant of PSI with weaker structured representations. Notably, we find that PSI's human-like performance is driven by an adaptive strategy that increases relational similarity over object-level similarity and upweights the contribution of relations that distinguish classes. These findings suggest that structured representations and analogical mapping are critical to modeling rapid human-like learning of compositional visual concepts, and demonstrate how deep learning can be leveraged to create psychological models.

[6] Large-Scale Gaussian Splatting SLAM

Zhe Xin,Chenyang Wu,Penghui Huang,Yanyong Zhang,Yinian Mao,Guoquan Huang

Main category: cs.CV

TL;DR: LSG-SLAM是一种基于3D高斯泼溅（3DGS）的大规模视觉SLAM系统，使用立体相机，通过多模态策略和特征对齐约束提升鲁棒性，并在大规模场景中实现高效重建。

Details

Motivation: 现有NeRF和3DGS方法多依赖RGBD传感器且仅适用于室内环境，大规模户外场景的鲁棒性重建尚未充分探索。 Method: 采用多模态策略估计初始位姿，引入特征对齐约束减少渲染损失中的外观相似性影响，使用连续高斯泼溅子图处理无界场景，并通过位姿优化和结构细化提升重建质量。 Result: 在EuRoc和KITTI数据集上，LSG-SLAM表现优于现有神经、3DGS及传统方法。 Conclusion: LSG-SLAM为大规模户外场景的视觉SLAM提供了高效且鲁棒的解决方案。 Abstract: The recently developed Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have shown encouraging and impressive results for visual SLAM. However, most representative methods require RGBD sensors and are only available for indoor environments. The robustness of reconstruction in large-scale outdoor scenarios remains unexplored. This paper introduces a large-scale 3DGS-based visual SLAM with stereo cameras, termed LSG-SLAM. The proposed LSG-SLAM employs a multi-modality strategy to estimate prior poses under large view changes. In tracking, we introduce feature-alignment warping constraints to alleviate the adverse effects of appearance similarity in rendering losses. For the scalability of large-scale scenarios, we introduce continuous Gaussian Splatting submaps to tackle unbounded scenes with limited memory. Loops are detected between GS submaps by place recognition and the relative pose between looped keyframes is optimized utilizing rendering and feature warping losses. After the global optimization of camera poses and Gaussian points, a structure refinement module enhances the reconstruction quality. With extensive evaluations on the EuRoc and KITTI datasets, LSG-SLAM achieves superior performance over existing Neural, 3DGS-based, and even traditional approaches. Project page: https://lsg-slam.github.io.

[7] AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection

Bin-Bin Gao,Yue Zhu,Jiangtao Yan,Yuezhi Cai,Weixi Zhang,Meng Wang,Jun Liu,Yong Liu,Lei Wang,Chengjie Wang

Main category: cs.CV

TL;DR: AdaptCLIP是一种基于CLIP模型的视觉异常检测方法，通过交替学习视觉和文本表示，并结合上下文和对齐残差特征，实现了无需目标域微调的跨域泛化。

Details

Motivation: 解决现有方法在提示模板设计、复杂令牌交互或额外微调方面的局限性，提升视觉异常检测的灵活性和泛化能力。 Method: 引入三个简单适配器（视觉、文本和提示-查询适配器），交替学习自适应表示，并结合上下文与残差特征进行对比学习。 Result: 在12个工业和医学领域的异常检测基准测试中取得最优性能。 Conclusion: AdaptCLIP通过简单适配器和对比学习策略，显著提升了跨域视觉异常检测的效果，且无需目标域微调。 Abstract: Universal visual anomaly detection aims to identify anomalies from novel or unseen vision domains without additional fine-tuning, which is critical in open scenarios. Recent studies have demonstrated that pre-trained vision-language models like CLIP exhibit strong generalization with just zero or a few normal images. However, existing methods struggle with designing prompt templates, complex token interactions, or requiring additional fine-tuning, resulting in limited flexibility. In this work, we present a simple yet effective method called AdaptCLIP based on two key insights. First, adaptive visual and textual representations should be learned alternately rather than jointly. Second, comparative learning between query and normal image prompt should incorporate both contextual and aligned residual features, rather than relying solely on residual features. AdaptCLIP treats CLIP models as a foundational service, adding only three simple adapters, visual adapter, textual adapter, and prompt-query adapter, at its input or output ends. AdaptCLIP supports zero-/few-shot generalization across domains and possesses a training-free manner on target domains once trained on a base dataset. AdaptCLIP achieves state-of-the-art performance on 12 anomaly detection benchmarks from industrial and medical domains, significantly outperforming existing competitive methods. We will make the code and model of AdaptCLIP available at https://github.com/gaobb/AdaptCLIP.

[8] DDFP: Data-dependent Frequency Prompt for Source Free Domain Adaptation of Medical Image Segmentation

Siqi Yin,Shaolei Liu,Manning Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的无源域自适应（SFDA）框架，通过预适应生成高质量伪标签和数据依赖的频率提示，结合风格相关层微调策略，显著提升了跨模态医学图像分割的性能。

Details

Motivation: 由于隐私政策限制，获取带标签的源域数据（尤其是医学数据）变得困难，现有SFDA方法在图像风格转换和伪标签生成方面仍有改进空间。 Method: 提出预适应生成预适应模型作为目标模型初始化，引入数据依赖频率提示改进图像风格转换，并采用风格相关层微调策略。 Result: 在跨模态腹部和心脏SFDA分割任务中，该方法优于现有最先进方法。 Conclusion: 所提框架有效解决了SFDA中的域差距问题，提升了模型性能，尤其在医学图像分割领域具有显著优势。 Abstract: Domain adaptation addresses the challenge of model performance degradation caused by domain gaps. In the typical setup for unsupervised domain adaptation, labeled data from a source domain and unlabeled data from a target domain are used to train a target model. However, access to labeled source domain data, particularly in medical datasets, can be restricted due to privacy policies. As a result, research has increasingly shifted to source-free domain adaptation (SFDA), which requires only a pretrained model from the source domain and unlabeled data from the target domain data for adaptation. Existing SFDA methods often rely on domain-specific image style translation and self-supervision techniques to bridge the domain gap and train the target domain model. However, the quality of domain-specific style-translated images and pseudo-labels produced by these methods still leaves room for improvement. Moreover, training the entire model during adaptation can be inefficient under limited supervision. In this paper, we propose a novel SFDA framework to address these challenges. Specifically, to effectively mitigate the impact of domain gap in the initial training phase, we introduce preadaptation to generate a preadapted model, which serves as an initialization of target model and allows for the generation of high-quality enhanced pseudo-labels without introducing extra parameters. Additionally, we propose a data-dependent frequency prompt to more effectively translate target domain images into a source-like style. To further enhance adaptation, we employ a style-related layer fine-tuning strategy, specifically designed for SFDA, to train the target model using the prompted target domain images and pseudo-labels. Extensive experiments on cross-modality abdominal and cardiac SFDA segmentation tasks demonstrate that our proposed method outperforms existing state-of-the-art methods.

[9] VRU-CIPI: Crossing Intention Prediction at Intersections for Improving Vulnerable Road Users Safety

Ahmed S. Abdelrahman,Mohamed Abdel-Aty,Quoc Dai Tran

Main category: cs.CV

TL;DR: VRU-CIPI框架通过GRU和Transformer自注意力机制预测VRU的过街意图，准确率达96.45%，并实现实时推理速度33FPS，结合I2V通信提升路口安全性。

Details

Motivation: 理解并预测VRU在路口的过街意图对提升道路交互安全至关重要，误判可能导致危险冲突。 Method: 提出VRU-CIPI框架，结合GRU捕捉时序动态和Transformer自注意力机制编码上下文与空间依赖。 Result: 在UCF-VRU数据集上达到96.45%的准确率和33FPS的实时推理速度。 Conclusion: VRU-CIPI结合I2V通信可提前激活过街信号并预警车辆，提升路口安全性。 Abstract: Understanding and predicting human behavior in-thewild, particularly at urban intersections, remains crucial for enhancing interaction safety between road users. Among the most critical behaviors are crossing intentions of Vulnerable Road Users (VRUs), where misinterpretation may result in dangerous conflicts with oncoming vehicles. In this work, we propose the VRU-CIPI framework with a sequential attention-based model designed to predict VRU crossing intentions at intersections. VRU-CIPI employs Gated Recurrent Unit (GRU) to capture temporal dynamics in VRU movements, combined with a multi-head Transformer self-attention mechanism to encode contextual and spatial dependencies critical for predicting crossing direction. Evaluated on UCF-VRU dataset, our proposed achieves state-of-the-art performance with an accuracy of 96.45% and achieving real-time inference speed reaching 33 frames per second. Furthermore, by integrating with Infrastructure-to-Vehicles (I2V) communication, our approach can proactively enhance intersection safety through timely activation of crossing signals and providing early warnings to connected vehicles, ensuring smoother and safer interactions for all road users.

[10] Non-Registration Change Detection: A Novel Change Detection Task and Benchmark Dataset

Zhe Shan,Lei Zhou,Liu Mao,Shaofan Chen,Chuanqiu Ren,Xia Xie

Main category: cs.CV

TL;DR: 本文提出了一种新的遥感变化检测任务——非配准变化检测，以应对自然灾害、人为事故和军事打击等紧急情况。通过系统分析八种现实场景，开发了针对不同场景的图像转换方案，并验证了非配准变化检测对现有方法的严重影响。

Details

Motivation: 解决自然灾害、人为事故和军事打击等紧急情况下遥感图像的非配准变化检测问题，填补了该领域的研究空白。 Method: 系统提出八种现实场景，开发针对不同场景的图像转换方案，将现有配准变化检测数据集转换为非配准版本。 Result: 非配准变化检测对现有最先进方法造成严重影响。 Conclusion: 非配准变化检测是一个重要且具有挑战性的任务，现有方法需进一步改进以适应实际需求。 Abstract: In this study, we propose a novel remote sensing change detection task, non-registration change detection, to address the increasing number of emergencies such as natural disasters, anthropogenic accidents, and military strikes. First, in light of the limited discourse on the issue of non-registration change detection, we systematically propose eight scenarios that could arise in the real world and potentially contribute to the occurrence of non-registration problems. Second, we develop distinct image transformation schemes tailored to various scenarios to convert the available registration change detection dataset into a non-registration version. Finally, we demonstrate that non-registration change detection can cause catastrophic damage to the state-of-the-art methods. Our code and dataset are available at https://github.com/ShanZard/NRCD.

[11] CSPENet: Contour-Aware and Saliency Priors Embedding Network for Infrared Small Target Detection

Jiakun Deng,Kexuan Li,Xingye Cui,Jiaxuan Li,Chang Long,Tian Pu,Zhenming Peng

Main category: cs.CV

TL;DR: 提出了一种基于轮廓感知和显著性先验嵌入网络（CSPENet）的红外小目标检测方法，解决了现有方法在密集杂波环境下目标定位和轮廓信息感知的不足。

Details

Motivation: 现有方法在红外小目标检测中难以准确定位暗淡目标和感知轮廓信息，限制了检测性能。 Method: 设计了SCPEM模块捕获目标轮廓像素梯度特性，提取显著性先验和多尺度结构先验；提出DBPEA架构嵌入先验；开发AGFEM模块优化特征表示。 Result: 在多个公开数据集上，CSPENet优于其他先进方法。 Conclusion: CSPENet通过结合轮廓感知和显著性先验，显著提升了红外小目标检测性能。 Abstract: Infrared small target detection (ISTD) plays a critical role in a wide range of civilian and military applications. Existing methods suffer from deficiencies in the localization of dim targets and the perception of contour information under dense clutter environments, severely limiting their detection performance. To tackle these issues, we propose a contour-aware and saliency priors embedding network (CSPENet) for ISTD. We first design a surround-convergent prior extraction module (SCPEM) that effectively captures the intrinsic characteristic of target contour pixel gradients converging toward their center. This module concurrently extracts two collaborative priors: a boosted saliency prior for accurate target localization and multi-scale structural priors for comprehensively enriching contour detail representation. Building upon this, we propose a dual-branch priors embedding architecture (DBPEA) that establishes differentiated feature fusion pathways, embedding these two priors at optimal network positions to achieve performance enhancement. Finally, we develop an attention-guided feature enhancement module (AGFEM) to refine feature representations and improve saliency estimation accuracy. Experimental results on public datasets NUDT-SIRST, IRSTD-1k, and NUAA-SIRST demonstrate that our CSPENet outperforms other state-of-the-art methods in detection performance. The code is available at https://github.com/IDIP2025/CSPENet.

Hao Yang,Tao Tan,Shuai Tan,Weiqin Yang,Kunyan Cai,Calvin Chen,Yue Sun

Main category: cs.CV

TL;DR: MambaControl是一个新框架，结合选择性状态空间建模和扩散过程，用于高保真预测医学图像轨迹，提升阿尔茨海默病预测性能。

Details

Motivation: 现有方法在捕捉纵向依赖性和结构一致性方面存在不足，需要一种能同时处理复杂时空动态和保持解剖完整性的方法。 Method: MambaControl整合了Mamba长程建模和图引导解剖控制，并引入傅里叶增强谱图表示来捕捉空间一致性和多尺度细节。 Result: 定量和区域评估表明，MambaControl在疾病进展预测和解剖保真度方面表现优异。 Conclusion: MambaControl在个性化预后和临床决策支持中具有潜力。 Abstract: Modelling disease progression in precision medicine requires capturing complex spatio-temporal dynamics while preserving anatomical integrity. Existing methods often struggle with longitudinal dependencies and structural consistency in progressive disorders. To address these limitations, we introduce MambaControl, a novel framework that integrates selective state-space modelling with diffusion processes for high-fidelity prediction of medical image trajectories. To better capture subtle structural changes over time while maintaining anatomical consistency, MambaControl combines Mamba-based long-range modelling with graph-guided anatomical control to more effectively represent anatomical correlations. Furthermore, we introduce Fourier-enhanced spectral graph representations to capture spatial coherence and multiscale detail, enabling MambaControl to achieve state-of-the-art performance in Alzheimer's disease prediction. Quantitative and regional evaluations demonstrate improved progression prediction quality and anatomical fidelity, highlighting its potential for personalised prognosis and clinical decision support.

[13] TKFNet: Learning Texture Key Factor Driven Feature for Facial Expression Recognition

Liqian Deng

Main category: cs.CV

TL;DR: 本文提出了一种基于纹理关键驱动因素（TKDF）的新框架，通过纹理感知特征提取器（TAFE）和双重上下文信息过滤（DCIF）来提升野外面部表情识别（FER）的性能。

Details

Motivation: 野外面部表情识别因表情特征的微妙性和局部性以及面部外观的复杂变化而具有挑战性。 Method: 提出TKDF框架，结合TAFE（基于ResNet的多分支注意力特征提取器）和DCIF（自适应池化和注意力机制过滤上下文）。 Result: 在RAF-DB和KDEF数据集上实现了最先进的性能。 Conclusion: TKDF的引入显著提升了FER的准确性和鲁棒性。 Abstract: Facial expression recognition (FER) in the wild remains a challenging task due to the subtle and localized nature of expression-related features, as well as the complex variations in facial appearance. In this paper, we introduce a novel framework that explicitly focuses on Texture Key Driver Factors (TKDF), localized texture regions that exhibit strong discriminative power across emotional categories. By carefully observing facial image patterns, we identify that certain texture cues, such as micro-changes in skin around the brows, eyes, and mouth, serve as primary indicators of emotional dynamics. To effectively capture and leverage these cues, we propose a FER architecture comprising a Texture-Aware Feature Extractor (TAFE) and Dual Contextual Information Filtering (DCIF). TAFE employs a ResNet-based backbone enhanced with multi-branch attention to extract fine-grained texture representations, while DCIF refines these features by filtering context through adaptive pooling and attention mechanisms. Experimental results on RAF-DB and KDEF datasets demonstrate that our method achieves state-of-the-art performance, verifying the effectiveness and robustness of incorporating TKDFs into FER pipelines.

[14] APCoTTA: Continual Test-Time Adaptation for Semantic Segmentation of Airborne LiDAR Point Clouds

Yuan Gao,Shaobo Xia,Sheng Nie,Cheng Wang,Xiaohuan Xi,Bisheng Yang

Main category: cs.CV

TL;DR: APCoTTA是一种针对ALS点云语义分割的连续测试时间适应方法，通过动态选择可训练层、熵一致性损失和随机参数插值机制，解决了领域偏移和灾难性遗忘问题，并在新构建的基准测试中表现优异。

Details

Motivation: ALS点云分割在现实应用中常因环境、传感器变化导致模型性能下降，而现有CTTA方法在点云领域研究有限，缺乏标准化数据集且面临灾难性遗忘和错误累积的挑战。 Method: 提出动态可训练层选择模块、熵一致性损失和随机参数插值机制，以平衡目标适应和源知识保留。 Result: 在ISPRSC和H3DC两个新基准测试中，APCoTTA的mIoU分别提升了约9%和14%。 Conclusion: APCoTTA有效解决了ALS点云分割中的CTTA问题，新基准和代码已开源。 Abstract: Airborne laser scanning (ALS) point cloud segmentation is a fundamental task for large-scale 3D scene understanding. In real-world applications, models are typically fixed after training. However, domain shifts caused by changes in the environment, sensor types, or sensor degradation often lead to a decline in model performance. Continuous Test-Time Adaptation (CTTA) offers a solution by adapting a source-pretrained model to evolving, unlabeled target domains. Despite its potential, research on ALS point clouds remains limited, facing challenges such as the absence of standardized datasets and the risk of catastrophic forgetting and error accumulation during prolonged adaptation. To tackle these challenges, we propose APCoTTA, the first CTTA method tailored for ALS point cloud semantic segmentation. We propose a dynamic trainable layer selection module. This module utilizes gradient information to select low-confidence layers for training, and the remaining layers are kept frozen, mitigating catastrophic forgetting. To further reduce error accumulation, we propose an entropy-based consistency loss. By losing such samples based on entropy, we apply consistency loss only to the reliable samples, enhancing model stability. In addition, we propose a random parameter interpolation mechanism, which randomly blends parameters from the selected trainable layers with those of the source model. This approach helps balance target adaptation and source knowledge retention, further alleviating forgetting. Finally, we construct two benchmarks, ISPRSC and H3DC, to address the lack of CTTA benchmarks for ALS point cloud segmentation. Experimental results demonstrate that APCoTTA achieves the best performance on two benchmarks, with mIoU improvements of approximately 9% and 14% over direct inference. The new benchmarks and code are available at https://github.com/Gaoyuan2/APCoTTA.

[15] High Quality Underwater Image Compression with Adaptive Correction and Codebook-based Augmentation

Yimin Zhou,Yichong Xia,Sicheng Pan,Bin Chen,Baoyi An,Haoqian Wang,Zhi Wang,Yaowei Wang,Zikun Zhou

Main category: cs.CV

TL;DR: HQUIC是一种针对水下图像压缩的算法，通过自适应预测衰减系数和全局光信息，结合多尺度频率分量动态加权，显著提升了压缩效率。

Details

Motivation: 现有水下图像压缩算法未能充分利用水下场景的独特性，导致性能不佳。 Method: HQUIC采用ALTC模块预测衰减系数和全局光信息，并利用辅助分支提取常见对象，动态加权多尺度频率分量。 Result: 在多种水下数据集上的评估显示，HQUIC优于现有压缩方法。 Conclusion: HQUIC通过针对性设计显著提升了水下图像压缩的性能。 Abstract: With the increasing exploration and exploitation of the underwater world, underwater images have become a critical medium for human interaction with marine environments, driving extensive research into their efficient transmission and storage. However, contemporary underwater image compression algorithms fail to fully leverage the unique characteristics distinguishing underwater scenes from terrestrial images, resulting in suboptimal performance. To address this limitation, we introduce HQUIC, designed to exploit underwater-image-specific features for enhanced compression efficiency. HQUIC employs an ALTC module to adaptively predict the attenuation coefficients and global light information of the images, which effectively mitigates the issues caused by the differences in lighting and tone existing in underwater images. Subsequently, HQUIC employs a codebook as an auxiliary branch to extract the common objects within underwater images and enhances the performance of the main branch. Furthermore, HQUIC dynamically weights multi-scale frequency components, prioritizing information critical for distortion quality while discarding redundant details. Extensive evaluations on diverse underwater datasets demonstrate that HQUIC outperforms state-of-the-art compression methods.

[16] PointArena: Probing Multimodal Grounding Through Language-Guided Pointing

Long Cheng,Jiafei Duan,Yi Ru Wang,Haoquan Fang,Boyang Li,Yushan Huang,Elvis Wang,Ainaz Eftekhar,Jason Lee,Wentao Yuan,Rose Hendrix,Noah A. Smith,Fei Xia,Dieter Fox,Ranjay Krishna

Main category: cs.CV

TL;DR: PointArena是一个评估多模态指向能力的平台，包含数据集、交互式竞技场和机器人系统，测试显示Molmo-72B表现最佳。

Details

Motivation: 现有基准仅关注对象定位任务，缺乏对多样化指向推理场景的评估。 Method: PointArena包括Point-Bench数据集、Point-Battle交互竞技场和Point-Act机器人系统，用于多阶段评估。 Result: Molmo-72B表现最优，专有模型性能接近，针对指向任务的监督训练显著提升性能。 Conclusion: 精确的指向能力对多模态模型连接抽象推理与实际行动至关重要。 Abstract: Pointing serves as a fundamental and intuitive mechanism for grounding language within visual contexts, with applications spanning robotics, assistive technologies, and interactive AI systems. While recent multimodal models have started to support pointing capabilities, existing benchmarks typically focus only on referential object localization tasks. We introduce PointArena, a comprehensive platform for evaluating multimodal pointing across diverse reasoning scenarios. PointArena comprises three components: (1) Point-Bench, a curated dataset containing approximately 1,000 pointing tasks across five reasoning categories; (2) Point-Battle, an interactive, web-based arena facilitating blind, pairwise model comparisons, which has already gathered over 4,500 anonymized votes; and (3) Point-Act, a real-world robotic manipulation system allowing users to directly evaluate multimodal model pointing capabilities in practical settings. We conducted extensive evaluations of both state-of-the-art open-source and proprietary multimodal models. Results indicate that Molmo-72B consistently outperforms other models, though proprietary models increasingly demonstrate comparable performance. Additionally, we find that supervised training specifically targeting pointing tasks significantly enhances model performance. Across our multi-stage evaluation pipeline, we also observe strong correlations, underscoring the critical role of precise pointing capabilities in enabling multimodal models to effectively bridge abstract reasoning with concrete, real-world actions. Project page: https://pointarena.github.io/

[17] Descriptive Image-Text Matching with Graded Contextual Similarity

Jinhyun Jang,Jiyeong Lee,Kwanghoon Sohn

Main category: cs.CV

TL;DR: 论文提出了一种描述性图像-文本匹配方法（DITM），通过探索语言的描述灵活性学习图像与文本的分级上下文相似性，解决了现有方法中稀疏二元监督的局限性。

Details

Motivation: 现有方法采用稀疏二元监督，忽略了图像与文本之间多对多的对应关系以及从一般到具体描述的隐含连接。 Method: DITM利用句子描述性评分（基于TF-IDF）动态调整正负样本连接性，并按通用到具体顺序对齐相关句子。 Result: 在MS-COCO、Flickr30K和CxC数据集上验证了DITM的有效性，并提升了模型的层次推理能力。 Conclusion: DITM通过超越刚性二元监督，能够更有效地发现最优匹配和潜在正样本对。 Abstract: Image-text matching aims to build correspondences between visual and textual data by learning their pairwise similarities. Most existing approaches have adopted sparse binary supervision, indicating whether a pair of images and sentences matches or not. However, such sparse supervision covers a limited subset of image-text relationships, neglecting their inherent many-to-many correspondences; an image can be described in numerous texts at different descriptive levels. Moreover, existing approaches overlook the implicit connections from general to specific descriptions, which form the underlying rationale for the many-to-many relationships between vision and language. In this work, we propose descriptive image-text matching, called DITM, to learn the graded contextual similarity between image and text by exploring the descriptive flexibility of language. We formulate the descriptiveness score of each sentence with cumulative term frequency-inverse document frequency (TF-IDF) to balance the pairwise similarity according to the keywords in the sentence. Our method leverages sentence descriptiveness to learn robust image-text matching in two key ways: (1) to refine the false negative labeling, dynamically relaxing the connectivity between positive and negative pairs, and (2) to build more precise matching, aligning a set of relevant sentences in a generic-to-specific order. By moving beyond rigid binary supervision, DITM enhances the discovery of both optimal matches and potential positive pairs. Extensive experiments on MS-COCO, Flickr30K, and CxC datasets demonstrate the effectiveness of our method in representing complex image-text relationships compared to state-of-the-art approaches. In addition, DITM enhances the hierarchical reasoning ability of the model, supported by the extensive analysis on HierarCaps benchmark.

[18] From Air to Wear: Personalized 3D Digital Fashion with AR/VR Immersive 3D Sketching

Ying Zang,Yuanqi Hu,Xinyu Chen,Yuxia Xu,Suhui Wang,Chunan Yu,Lanyun Zhu,Deyi Ji,Xin Xu,Tianrun Chen

Main category: cs.CV

TL;DR: 提出了一种基于3D草图的3D服装生成框架，通过结合条件扩散模型和自适应课程学习，使普通用户也能在AR/VR环境中设计高质量数字服装。

Details

Motivation: 现有3D服装设计工具技术门槛高且数据有限，难以满足普通用户的需求。 Method: 采用条件扩散模型、共享潜在空间的草图编码器和自适应课程学习策略，处理自由手绘输入并生成个性化服装。 Result: 实验和用户研究表明，该方法在逼真度和可用性上显著优于现有基线。 Conclusion: 该框架有望推动下一代消费平台上的大众化时尚设计。 Abstract: In the era of immersive consumer electronics, such as AR/VR headsets and smart devices, people increasingly seek ways to express their identity through virtual fashion. However, existing 3D garment design tools remain inaccessible to everyday users due to steep technical barriers and limited data. In this work, we introduce a 3D sketch-driven 3D garment generation framework that empowers ordinary users - even those without design experience - to create high-quality digital clothing through simple 3D sketches in AR/VR environments. By combining a conditional diffusion model, a sketch encoder trained in a shared latent space, and an adaptive curriculum learning strategy, our system interprets imprecise, free-hand input and produces realistic, personalized garments. To address the scarcity of training data, we also introduce KO3DClothes, a new dataset of paired 3D garments and user-created sketches. Extensive experiments and user studies confirm that our method significantly outperforms existing baselines in both fidelity and usability, demonstrating its promise for democratized fashion design on next-generation consumer platforms.

[19] Application of YOLOv8 in monocular downward multiple Car Target detection

Shijie Lyu

Main category: cs.CV

TL;DR: 本文提出了一种基于YOLOv8的改进自主目标检测网络，通过结构重参数化技术和双向金字塔结构网络模型，显著提升了多尺度、小目标和远距离目标的检测效率与精度。

Details

Motivation: 当前自动驾驶技术中的环境感知方法（如雷达、摄像头）存在高成本、易受天气和光照影响等问题，亟需改进。 Method: 在YOLOv8框架中集成了结构重参数化技术、双向金字塔结构网络模型和新型检测流程。 Result: 实验显示改进模型的检测精度达65%，在多尺度和小目标检测上表现优异。 Conclusion: 该模型在自动驾驶竞赛（如FSAC）中具有实际应用潜力，尤其擅长单目标和小目标检测场景。 Abstract: Autonomous driving technology is progressively transforming traditional car driving methods, marking a significant milestone in modern transportation. Object detection serves as a cornerstone of autonomous systems, playing a vital role in enhancing driving safety, enabling autonomous functionality, improving traffic efficiency, and facilitating effective emergency responses. However, current technologies such as radar for environmental perception, cameras for road perception, and vehicle sensor networks face notable challenges, including high costs, vulnerability to weather and lighting conditions, and limited resolution.To address these limitations, this paper presents an improved autonomous target detection network based on YOLOv8. By integrating structural reparameterization technology, a bidirectional pyramid structure network model, and a novel detection pipeline into the YOLOv8 framework, the proposed approach achieves highly efficient and precise detection of multi-scale, small, and remote objects. Experimental results demonstrate that the enhanced model can effectively detect both large and small objects with a detection accuracy of 65%, showcasing significant advancements over traditional methods.This improved model holds substantial potential for real-world applications and is well-suited for autonomous driving competitions, such as the Formula Student Autonomous China (FSAC), particularly excelling in scenarios involving single-target and small-object detection.

[20] ORL-LDM: Offline Reinforcement Learning Guided Latent Diffusion Model Super-Resolution Reconstruction

Shijie Lyu

Main category: cs.CV

TL;DR: 本文提出了一种基于强化学习的潜在扩散模型（LDM）微调方法，用于遥感图像超分辨率重建，显著提升了图像质量。

Details

Motivation: 现有深度学习方法在处理复杂场景和保留图像细节方面存在局限性，需要更有效的解决方案。 Method: 通过构建强化学习环境（状态、动作、奖励），在LDM的反向去噪过程中使用近端策略优化（PPO）优化决策目标。 Result: 在RESISC45数据集上，PSNR提升3-4dB，SSIM提高0.08-0.11，LPIPS降低0.06-0.10，尤其在结构化复杂场景中表现突出。 Conclusion: 该方法有效提升了超分辨率的质量和场景适应性。 Abstract: With the rapid advancement of remote sensing technology, super-resolution image reconstruction is of great research and practical significance. Existing deep learning methods have made progress but still face limitations in handling complex scenes and preserving image details. This paper proposes a reinforcement learning-based latent diffusion model (LDM) fine-tuning method for remote sensing image super-resolution. The method constructs a reinforcement learning environment with states, actions, and rewards, optimizing decision objectives through proximal policy optimization (PPO) during the reverse denoising process of the LDM model. Experiments on the RESISC45 dataset show significant improvements over the baseline model in PSNR, SSIM, and LPIPS, with PSNR increasing by 3-4dB, SSIM improving by 0.08-0.11, and LPIPS reducing by 0.06-0.10, particularly in structured and complex natural scenes. The results demonstrate the method's effectiveness in enhancing super-resolution quality and adaptability across scenes.

[21] DeepSeqCoco: A Robust Mobile Friendly Deep Learning Model for Detection of Diseases in Cocos nucifera

Miit Daga,Dhriti Parikh,Swarna Priya Ramu

Main category: cs.CV

TL;DR: DeepSeqCoco是一种基于深度学习的模型，用于从椰树图像中自动准确识别疾病，其准确率高达99.5%，优于现有模型，并显著减少训练和预测时间。

Details

Motivation: 椰树疾病对农业产量构成严重威胁，尤其是在发展中国家，传统方法难以实现早期诊断和干预。 Method: 采用深度学习模型DeepSeqCoco，测试了多种优化器设置（如SGD、Adam及混合配置），以平衡准确性、损失最小化和计算成本。 Result: 模型准确率达99.5%，混合SGD-Adam配置验证损失最低为2.81%，训练时间减少18%，预测时间减少85%。 Conclusion: DeepSeqCoco展示了通过AI实现高效、可扩展的疾病监测系统的潜力，有助于精准农业的发展。 Abstract: Coconut tree diseases are a serious risk to agricultural yield, particularly in developing countries where conventional farming practices restrict early diagnosis and intervention. Current disease identification methods are manual, labor-intensive, and non-scalable. In response to these limitations, we come up with DeepSeqCoco, a deep learning based model for accurate and automatic disease identification from coconut tree images. The model was tested under various optimizer settings, such as SGD, Adam, and hybrid configurations, to identify the optimal balance between accuracy, minimization of loss, and computational cost. Results from experiments indicate that DeepSeqCoco can achieve as much as 99.5% accuracy (achieving up to 5% higher accuracy than existing models) with the hybrid SGD-Adam showing the lowest validation loss of 2.81%. It also shows a drop of up to 18% in training time and up to 85% in prediction time for input images. The results point out the promise of the model to improve precision agriculture through an AI-based, scalable, and efficient disease monitoring system.

[22] Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

Bingda Tang,Boyang Zheng,Xichen Pan,Sayak Paul,Saining Xie

Main category: cs.CV

TL;DR: 本文深入探讨了文本到图像合成中LLMs与DiTs深度融合的设计空间，填补了现有研究在详细比较和训练方法公开性上的空白。

Details

Motivation: 现有研究多关注整体系统性能，而忽略了与替代方法的详细比较及关键设计细节的公开，导致对该方法潜力的不确定性。 Method: 通过实证研究，进行与基准方法的对照比较，分析关键设计选择，并提供可复现的大规模训练方案。 Result: 提供了有意义的数据点和实用指南，为多模态生成的未来研究奠定基础。 Conclusion: 本研究填补了研究空白，为多模态生成领域的进一步探索提供了清晰的方向和可操作的指导。 Abstract: This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis -- specifically, the deep fusion of large language models (LLMs) and diffusion transformers (DiTs) for multi-modal generation. Previous studies mainly focused on overall system performance rather than detailed comparisons with alternative methods, and key design details and training recipes were often left undisclosed. These gaps create uncertainty about the real potential of this approach. To fill these gaps, we conduct an empirical study on text-to-image generation, performing controlled comparisons with established baselines, analyzing important design choices, and providing a clear, reproducible recipe for training at scale. We hope this work offers meaningful data points and practical guidelines for future research in multi-modal generation.

[23] Advances in Radiance Field for Dynamic Scene: From Neural Field to Gaussian Field

Jinlong Fan,Xuepu Zeng,Jing Zhang,Mingming Gong,Yuxiang Yang,Dacheng Tao

Main category: cs.CV

TL;DR: 该论文综述了动态场景表示与重建领域的最新进展，重点分析了200多篇相关论文，涵盖了从隐式神经表示到显式高斯基元的技术，并提出了分类和评估框架。

Details

Motivation: 动态场景表示与重建在神经辐射场和3D高斯泼溅技术的推动下取得了显著进展，但仍需系统梳理和评估现有方法，以指导未来研究。 Method: 通过分析200多篇论文，从运动表示范式、重建技术、辅助信息整合和正则化方法等角度进行分类和评估。 Result: 提出了统一的表示框架，总结了动态场景重建的关键技术和挑战。 Conclusion: 该综述为研究人员提供了全面的参考，并指出了未来研究方向，推动了动态场景重建领域的发展。 Abstract: Dynamic scene representation and reconstruction have undergone transformative advances in recent years, catalyzed by breakthroughs in neural radiance fields and 3D Gaussian splatting techniques. While initially developed for static environments, these methodologies have rapidly evolved to address the complexities inherent in 4D dynamic scenes through an expansive body of research. Coupled with innovations in differentiable volumetric rendering, these approaches have significantly enhanced the quality of motion representation and dynamic scene reconstruction, thereby garnering substantial attention from the computer vision and graphics communities. This survey presents a systematic analysis of over 200 papers focused on dynamic scene representation using radiance field, spanning the spectrum from implicit neural representations to explicit Gaussian primitives. We categorize and evaluate these works through multiple critical lenses: motion representation paradigms, reconstruction techniques for varied scene dynamics, auxiliary information integration strategies, and regularization approaches that ensure temporal consistency and physical plausibility. We organize diverse methodological approaches under a unified representational framework, concluding with a critical examination of persistent challenges and promising research directions. By providing this comprehensive overview, we aim to establish a definitive reference for researchers entering this rapidly evolving field while offering experienced practitioners a systematic understanding of both conceptual principles and practical frontiers in dynamic scene reconstruction.

[24] PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language

Ijazul Haq,Yingjie Zhang,Irfan Ali Khan

Main category: cs.CV

TL;DR: 论文评估了大型多模态模型（LMMs）在低资源普什图语OCR任务中的表现，开发了合成数据集PsOCR，并比较了多个开源和闭源模型的性能，发现Gemini表现最佳。

Details

Motivation: 普什图语的NLP面临挑战，如草书字体和数据集稀缺，因此需要解决这些问题并评估LMMs在OCR任务中的能力。 Method: 开发了包含100万张图像的合成数据集PsOCR，覆盖多种字体和布局，并测试了多个LMMs模型。 Result: Gemini在所有模型中表现最佳，开源模型中Qwen-7B表现突出。 Conclusion: 研究为普什图语OCR提供了基础，并适用于类似脚本如阿拉伯语和波斯语。 Abstract: This paper evaluates the performance of Large Multimodal Models (LMMs) on Optical Character Recognition (OCR) in the low-resource Pashto language. Natural Language Processing (NLP) in Pashto faces several challenges due to the cursive nature of its script and a scarcity of structured datasets. To address this, we developed a synthetic Pashto OCR dataset, PsOCR, consisting of one million images annotated with bounding boxes at word, line, and document levels, suitable for training and evaluating models based on different architectures, including Convolutional Neural Networks (CNNs) and Transformers. PsOCR covers variations across 1,000 unique font families, colors, image sizes, and layouts. A benchmark subset of 10K images was selected to evaluate the performance of several LMMs, including seven open-source models: DeepSeek's Janus, InternVL, MiniCPM, Florence, and Qwen (3B and 7B), and four closed-source models: GPT-4o, Gemini, Claude, and Grok. Experimental results demonstrate that Gemini achieves the best performance among all models, whereas among open-source models, Qwen-7B stands out. This work provides an insightful assessment of the capabilities and limitations of current LMMs for OCR tasks in Pashto and establishes a foundation for further research not only in Pashto OCR but also for other similar scripts such as Arabic, Persian, and Urdu. PsOCR is available at https://github.com/zirak-ai/PashtoOCR.

[25] ToonifyGB: StyleGAN-based Gaussian Blendshapes for 3D Stylized Head Avatars

Rui-Yang Ju,Sheng-Yen Huang,Yi-Ping Hung

Main category: cs.CV

TL;DR: ToonifyGB是一个两阶段框架，用于从单目视频生成多样化的风格化3D头部头像。第一阶段通过改进的StyleGAN生成风格化视频，第二阶段利用高斯混合形状合成风格化头像。

Details

Motivation: 扩展Toonify框架以支持风格化3D头部头像的合成，同时解决传统StyleGAN在固定分辨率下裁剪对齐面部的限制。 Method: 两阶段方法：1) 使用改进的StyleGAN生成风格化视频；2) 从视频中学习高斯混合形状以合成风格化头像。 Result: 在Arcane和Pixar两种风格上验证了ToonifyGB的高效性，能够生成高质量的风格化动画。 Conclusion: ToonifyGB成功实现了风格化3D头部头像的实时重建，并展示了其在多样化风格中的应用潜力。 Abstract: The introduction of 3D Gaussian blendshapes has enabled the real-time reconstruction of animatable head avatars from monocular video. Toonify, a StyleGAN-based framework, has become widely used for facial image stylization. To extend Toonify for synthesizing diverse stylized 3D head avatars using Gaussian blendshapes, we propose an efficient two-stage framework, ToonifyGB. In Stage 1 (stylized video generation), we employ an improved StyleGAN to generate the stylized video from the input video frames, which addresses the limitation of cropping aligned faces at a fixed resolution as preprocessing for normal StyleGAN. This process provides a more stable video, which enables Gaussian blendshapes to better capture the high-frequency details of the video frames, and efficiently generate high-quality animation in the next stage. In Stage 2 (Gaussian blendshapes synthesis), we learn a stylized neutral head model and a set of expression blendshapes from the generated video. By combining the neutral head model with expression blendshapes, ToonifyGB can efficiently render stylized avatars with arbitrary expressions. We validate the effectiveness of ToonifyGB on the benchmark dataset using two styles: Arcane and Pixar.

[26] MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models

Yuncheng Guo,Xiaodong Gu

Main category: cs.CV

TL;DR: MMRL和MMRL++通过共享模态无关表示空间和优化表示令牌，解决了少样本学习中的过拟合问题，提升了跨模态交互和泛化能力。

Details

Motivation: 大规模预训练视觉语言模型在少样本学习中容易过拟合，泛化能力不足。 Method: 提出MMRL，引入共享模态无关表示空间，优化表示令牌和类令牌，并加入正则化项；进一步扩展为MMRL++，减少参数并增强模态内交互。 Result: 在15个数据集上优于现有方法，平衡了任务适应和泛化。 Conclusion: MMRL和MMRL++有效解决了少样本学习的过拟合问题，提升了性能。 Abstract: Large-scale pre-trained Vision-Language Models (VLMs) have significantly advanced transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, undermining their ability to generalize to new tasks. To address this, we propose Multi-Modal Representation Learning (MMRL), which introduces a shared, learnable, modality-agnostic representation space. MMRL generates space tokens projected into both text and image encoders as representation tokens, enabling more effective cross-modal interactions. Unlike prior methods that mainly optimize class token features, MMRL inserts representation tokens into higher encoder layers--where task-specific features are more prominent--while preserving general knowledge in the lower layers. During training, both class and representation features are jointly optimized: a trainable projection layer is applied to representation tokens for task adaptation, while the projection layer for class token remains frozen to retain pre-trained knowledge. To further promote generalization, we introduce a regularization term aligning class and text features with the frozen VLM's zero-shot features. At inference, a decoupling strategy uses both class and representation features for base tasks, but only class features for novel tasks due to their stronger generalization. Building upon this, we propose MMRL++, a parameter-efficient and interaction-aware extension that significantly reduces trainable parameters and enhances intra-modal interactions--particularly across the layers of representation tokens--allowing gradient sharing and instance-specific information to propagate more effectively through the network. Extensive experiments on 15 datasets demonstrate that MMRL and MMRL++ consistently outperform state-of-the-art methods, achieving a strong balance between task-specific adaptation and generalization.

[27] Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering

Yangfu Li,Hongjian Zhan,Tianyi Chen,Qi Liu,Yue Lu

Main category: cs.CV

TL;DR: 提出了一种多目标平衡覆盖方法（MoB），通过动态权衡视觉标记修剪中的目标，显著提升性能和效率。

Details

Motivation: 现有视觉标记修剪方法采用静态策略，忽略了任务间目标重要性的变化，导致性能不一致。 Method: 基于Hausdorff距离推导误差界，利用ε-覆盖理论揭示目标间的内在权衡，提出MoB框架，将修剪问题转化为双目标覆盖问题。 Result: MoB在LLaVA-1.5-7B上仅使用11.1%的视觉标记保留96.4%性能，加速LLaVA-Next-7B 1.3-1.5倍，且适用于多种任务和模型。 Conclusion: MoB通过动态权衡目标，实现了高效且一致的视觉标记修剪，适用于复杂场景和先进模型。 Abstract: Existing visual token pruning methods target prompt alignment and visual preservation with static strategies, overlooking the varying relative importance of these objectives across tasks, which leads to inconsistent performance. To address this, we derive the first closed-form error bound for visual token pruning based on the Hausdorff distance, uniformly characterizing the contributions of both objectives. Moreover, leveraging $\epsilon$-covering theory, we reveal an intrinsic trade-off between these objectives and quantify their optimal attainment levels under a fixed budget. To practically handle this trade-off, we propose Multi-Objective Balanced Covering (MoB), which reformulates visual token pruning as a bi-objective covering problem. In this framework, the attainment trade-off reduces to budget allocation via greedy radius trading. MoB offers a provable performance bound and linear scalability with respect to the number of input visual tokens, enabling adaptation to challenging pruning scenarios. Extensive experiments show that MoB preserves 96.4% of performance for LLaVA-1.5-7B using only 11.1% of the original visual tokens and accelerates LLaVA-Next-7B by 1.3-1.5$\times$ with negligible performance loss. Additionally, evaluations on Qwen2-VL and Video-LLaVA confirm that MoB integrates seamlessly into advanced MLLMs and diverse vision-language tasks.

[28] IMITATE: Image Registration with Context for unknown time frame recovery

Ziad Kheil,Lucas Robinet,Laurent Risser,Soleakhena Ken

Main category: cs.CV

TL;DR: 本文提出了一种新的图像配准方法，通过条件U-Net架构估计未知条件下的图像，解决了放疗中4D-CT扫描的肿瘤运动插值问题。

Details

Motivation: 解决放疗中因不规则呼吸、滞后效应和呼吸信号与内部运动相关性差导致的4D-CT重建伪影问题。 Method: 使用条件U-Net架构，无需固定图像，直接基于已知图像和条件信息建模。 Result: 在临床4D-CT数据上实现了无伪影的实时重建。 Conclusion: 该方法有效解决了复杂运动条件下的图像配准问题，代码已开源。 Abstract: In this paper, we formulate a novel image registration formalism dedicated to the estimation of unknown condition-related images, based on two or more known images and their associated conditions. We show how to practically model this formalism by using a new conditional U-Net architecture, which fully takes into account the conditional information and does not need any fixed image. Our formalism is then applied to image moving tumors for radiotherapy treatment at different breathing amplitude using 4D-CT (3D+t) scans in thoracoabdominal regions. This driving application is particularly complex as it requires to stitch a collection of sequential 2D slices into several 3D volumes at different organ positions. Movement interpolation with standard methods then generates well known reconstruction artefacts in the assembled volumes due to irregular patient breathing, hysteresis and poor correlation of breathing signal to internal motion. Results obtained on 4D-CT clinical data showcase artefact-free volumes achieved through real-time latencies. The code is publicly available at https://github.com/Kheil-Z/IMITATE .

[29] Multi-Source Collaborative Style Augmentation and Domain-Invariant Learning for Federated Domain Generalization

Yikang Wei

Main category: cs.CV

TL;DR: 提出了一种多源协作风格增强和领域不变学习方法（MCSAD），用于联邦领域泛化，通过扩展风格空间和跨领域特征对齐提升模型泛化能力。

Details

Motivation: 现有风格增强方法在数据分散场景下风格空间有限，无法充分利用多源域信息，需改进。 Method: 提出多源协作风格增强模块生成更广风格数据，并通过跨领域特征对齐和类关系集成蒸馏学习领域不变模型。 Result: 在多个领域泛化数据集上显著优于现有联邦领域泛化方法。 Conclusion: MCSAD通过协作风格增强和领域不变学习有效提升了模型在未见目标域上的泛化性能。 Abstract: Federated domain generalization aims to learn a generalizable model from multiple decentralized source domains for deploying on the unseen target domain. The style augmentation methods have achieved great progress on domain generalization. However, the existing style augmentation methods either explore the data styles within isolated source domain or interpolate the style information across existing source domains under the data decentralization scenario, which leads to limited style space. To address this issue, we propose a Multi-source Collaborative Style Augmentation and Domain-invariant learning method (MCSAD) for federated domain generalization. Specifically, we propose a multi-source collaborative style augmentation module to generate data in the broader style space. Furthermore, we conduct domain-invariant learning between the original data and augmented data by cross-domain feature alignment within the same class and classes relation ensemble distillation between different classes to learn a domain-invariant model. By alternatively conducting collaborative style augmentation and domain-invariant learning, the model can generalize well on unseen target domain. Extensive experiments on multiple domain generalization datasets indicate that our method significantly outperforms the state-of-the-art federated domain generalization methods.

[30] Modeling Saliency Dataset Bias

Matthias Kümmerer,Harneet Khanuja,Matthias Bethge

Main category: cs.CV

TL;DR: 论文提出了一种新架构，通过少量数据集特定参数解决跨数据集显著性预测的泛化问题，显著提升了性能。

Details

Motivation: 现有显著性预测模型在跨数据集时性能显著下降（约40%），主要由于数据集偏差，增加多样性未能解决此问题。 Method: 提出了一种扩展的编码器-解码器结构，仅使用少于20个数据集特定参数控制多尺度结构、中心偏差等机制。 Result: 模型在MIT/Tuebingen显著性基准测试中达到新SOTA，泛化性能提升75%以上，仅需50个样本即可显著改进。 Conclusion: 新模型不仅提升了跨数据集泛化能力，还揭示了空间显著性的复杂多尺度效应。 Abstract: Recent advances in image-based saliency prediction are approaching gold standard performance levels on existing benchmarks. Despite this success, we show that predicting fixations across multiple saliency datasets remains challenging due to dataset bias. We find a significant performance drop (around 40%) when models trained on one dataset are applied to another. Surprisingly, increasing dataset diversity does not resolve this inter-dataset gap, with close to 60% attributed to dataset-specific biases. To address this remaining generalization gap, we propose a novel architecture extending a mostly dataset-agnostic encoder-decoder structure with fewer than 20 dataset-specific parameters that govern interpretable mechanisms such as multi-scale structure, center bias, and fixation spread. Adapting only these parameters to new data accounts for more than 75% of the generalization gap, with a large fraction of the improvement achieved with as few as 50 samples. Our model sets a new state-of-the-art on all three datasets of the MIT/Tuebingen Saliency Benchmark (MIT300, CAT2000, and COCO-Freeview), even when purely generalizing from unrelated datasets, but with a substantial boost when adapting to the respective training datasets. The model also provides valuable insights into spatial saliency properties, revealing complex multi-scale effects that combine both absolute and relative sizes.

[31] VolE: A Point-cloud Framework for Food 3D Reconstruction and Volume Estimation

Umair Haroon,Ahmad AlMughrabi,Thanasis Zoumpekas,Ricardo Marques,Petia Radeva

Main category: cs.CV

TL;DR: VolE是一种基于移动设备驱动的3D重建框架，用于食物体积估计，无需参考物体或深度信息，性能优于现有方法。

Details

Motivation: 当前食物体积估计方法受限于单目数据、专用硬件或依赖参考物体，无法满足医疗营养管理和健康监测的需求。 Method: VolE利用AR移动设备捕捉图像和相机位置，通过食物视频分割生成食物掩模，实现无参考和无深度的3D重建。 Result: 实验表明，VolE在多个数据集上表现优异，平均绝对百分比误差（MAPE）为2.22%。 Conclusion: VolE提供了一种高效、精确的食物体积估计方法，适用于实际应用场景。 Abstract: Accurate food volume estimation is crucial for medical nutrition management and health monitoring applications, but current food volume estimation methods are often limited by mononuclear data, leveraging single-purpose hardware such as 3D scanners, gathering sensor-oriented information such as depth information, or relying on camera calibration using a reference object. In this paper, we present VolE, a novel framework that leverages mobile device-driven 3D reconstruction to estimate food volume. VolE captures images and camera locations in free motion to generate precise 3D models, thanks to AR-capable mobile devices. To achieve real-world measurement, VolE is a reference- and depth-free framework that leverages food video segmentation for food mask generation. We also introduce a new food dataset encompassing the challenging scenarios absent in the previous benchmarks. Our experiments demonstrate that VolE outperforms the existing volume estimation techniques across multiple datasets by achieving 2.22 % MAPE, highlighting its superior performance in food volume estimation.

[32] Data-Agnostic Augmentations for Unknown Variations: Out-of-Distribution Generalisation in MRI Segmentation

Puru Vaish,Felix Meister,Tobias Heimann,Christoph Brune,Jelmer M. Wolterink

Main category: cs.CV

TL;DR: 论文探讨了医学图像分割模型在真实临床场景中的性能下降问题，提出了MixUp和Auxiliary Fourier Augmentation两种数据增强方法，显著提升了模型的泛化能力和鲁棒性。

Details

Motivation: 医学图像分割模型在真实临床环境中因训练与测试数据分布不匹配而性能下降，传统数据增强方法对此效果有限。 Method: 系统评估了MixUp和Auxiliary Fourier Augmentation两种数据增强策略，并集成到nnU-Net训练流程中。 Result: 实验表明，这些方法显著提升了模型对分布偏移的鲁棒性，并在心脏和前列腺MRI分割任务中验证了其有效性。 Conclusion: MixUp和Auxiliary Fourier Augmentation是简单易行且有效的解决方案，可提升医学分割模型在真实场景中的可靠性。 Abstract: Medical image segmentation models are often trained on curated datasets, leading to performance degradation when deployed in real-world clinical settings due to mismatches between training and test distributions. While data augmentation techniques are widely used to address these challenges, traditional visually consistent augmentation strategies lack the robustness needed for diverse real-world scenarios. In this work, we systematically evaluate alternative augmentation strategies, focusing on MixUp and Auxiliary Fourier Augmentation. These methods mitigate the effects of multiple variations without explicitly targeting specific sources of distribution shifts. We demonstrate how these techniques significantly improve out-of-distribution generalization and robustness to imaging variations across a wide range of transformations in cardiac cine MRI and prostate MRI segmentation. We quantitatively find that these augmentation methods enhance learned feature representations by promoting separability and compactness. Additionally, we highlight how their integration into nnU-Net training pipelines provides an easy-to-implement, effective solution for enhancing the reliability of medical segmentation models in real-world applications.

[33] On the Interplay of Human-AI Alignment,Fairness, and Performance Trade-offs in Medical Imaging

Haozhe Luo,Ziyu Zhou,Zixin Shu,Aurélie Pahud de Mortanges,Robert Berke,Mauricio Reyes

Main category: cs.CV

TL;DR: 论文探讨了在医学影像中结合人类洞察力以减少AI偏见和提升公平性的方法，发现适度的人类-AI对齐能显著改善公平性和泛化能力。

Details

Motivation: 解决医学影像AI中的偏见问题，探索人类-AI对齐对公平性和泛化能力的影响。 Method: 系统性地研究人类-AI对齐，结合人类洞察力调整模型。 Result: 人类-AI对齐能减少公平性差距并增强泛化能力，但过度对齐可能导致性能权衡。 Conclusion: 适度的人类-AI对齐是开发公平、鲁棒且泛化能力强的医学AI系统的有效策略。 Abstract: Deep neural networks excel in medical imaging but remain prone to biases, leading to fairness gaps across demographic groups. We provide the first systematic exploration of Human-AI alignment and fairness in this domain. Our results show that incorporating human insights consistently reduces fairness gaps and enhances out-of-domain generalization, though excessive alignment can introduce performance trade-offs, emphasizing the need for calibrated strategies. These findings highlight Human-AI alignment as a promising approach for developing fair, robust, and generalizable medical AI systems, striking a balance between expert guidance and automated efficiency. Our code is available at https://github.com/Roypic/Aligner.

[34] MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation

Yanbo Ding

Main category: cs.CV

TL;DR: MTVCrafter提出了一种基于4D运动序列的人类图像动画框架，通过4D运动标记和运动感知视频DiT，显著提升了动画质量和泛化能力。

Details

Motivation: 现有方法依赖2D姿态图像，限制了泛化能力并丢失了3D信息，MTVCrafter旨在直接建模3D运动序列以解决这一问题。 Method: 引入4DMoT将3D运动序列量化为4D运动标记，并设计MV-DiT利用这些标记进行动画生成。 Result: 实验显示MTVCrafter的FID-VID为6.98，优于第二名65%，并能泛化到多样化的开放世界角色。 Conclusion: MTVCrafter为姿态引导的人类视频生成开辟了新方向，显著推动了该领域的发展。 Abstract: Human image animation has gained increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely largely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information for open-world animation. To tackle this problem, we propose MTVCrafter (Motion Tokenization Video Crafter), the first framework that directly models raw 3D motion sequences (i.e., 4D motion) for human image animation. Specifically, we introduce 4DMoT (4D motion tokenizer) to quantize 3D motion sequences into 4D motion tokens. Compared to 2D-rendered pose images, 4D motion tokens offer more robust spatio-temporal cues and avoid strict pixel-level alignment between pose image and character, enabling more flexible and disentangled control. Then, we introduce MV-DiT (Motion-aware Video DiT). By designing unique motion attention with 4D positional encodings, MV-DiT can effectively leverage motion tokens as 4D compact yet expressive context for human image animation in the complex 3D world. Hence, it marks a significant step forward in this field and opens a new direction for pose-guided human video generation. Experiments show that our MTVCrafter achieves state-of-the-art results with an FID-VID of 6.98, surpassing the second-best by 65%. Powered by robust motion tokens, MTVCrafter also generalizes well to diverse open-world characters (single/multiple, full/half-body) across various styles and scenarios. Our video demos and code are provided in the supplementary material and at this anonymous GitHub link: https://anonymous.4open.science/r/MTVCrafter-1B13.

[35] ADHMR: Aligning Diffusion-based Human Mesh Recovery via Direct Preference Optimization

Wenhao Shen,Wanqi Yin,Xiaofeng Yang,Cheng Chen,Chaoyue Song,Zhongang Cai,Lei Yang,Hao Wang,Guosheng Lin

Main category: cs.CV

TL;DR: ADHMR提出了一种基于扩散模型和偏好优化的单图像人体网格恢复方法，通过HMR-Scorer评估预测质量并优化模型，显著提升了性能。

Details

Motivation: 解决现有概率方法在单图像人体网格恢复中与2D观测不对齐及对野外图像鲁棒性差的问题。 Method: 训练HMR-Scorer评估预测质量，构建偏好数据集，通过直接偏好优化微调基础模型。 Result: ADHMR在实验中优于现有方法，且HMR-Scorer能提升其他模型的性能。 Conclusion: ADHMR通过偏好优化显著提升了人体网格恢复的准确性和鲁棒性。 Abstract: Human mesh recovery (HMR) from a single image is inherently ill-posed due to depth ambiguity and occlusions. Probabilistic methods have tried to solve this by generating numerous plausible 3D human mesh predictions, but they often exhibit misalignment with 2D image observations and weak robustness to in-the-wild images. To address these issues, we propose ADHMR, a framework that Aligns a Diffusion-based HMR model in a preference optimization manner. First, we train a human mesh prediction assessment model, HMR-Scorer, capable of evaluating predictions even for in-the-wild images without 3D annotations. We then use HMR-Scorer to create a preference dataset, where each input image has a pair of winner and loser mesh predictions. This dataset is used to finetune the base model using direct preference optimization. Moreover, HMR-Scorer also helps improve existing HMR models by data cleaning, even with fewer training samples. Extensive experiments show that ADHMR outperforms current state-of-the-art methods. Code is available at: https://github.com/shenwenhao01/ADHMR.

[36] Sage Deer: A Super-Aligned Driving Generalist Is Your Copilot

Hao Lu,Jiaqi Tang,Jiyao Wang,Yunfan LU,Xu Cao,Qingyong Hu,Yin Wang,Yuting Zhang,Tianxin Xie,Yunpeng Zhang,Yong Chen,Jiayu. Gao,Bin Huang,Dengbo He,Shuiguang Deng,Hao Chen,Ying-Cong Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为SAGE DeeR的智能驾驶座舱代理，具备超级对齐、通用性和自我启发能力，并通过大规模基准测试验证其性能。

Details

Motivation: 智能驾驶座舱需满足用户的舒适性、交互性和安全性需求，因此需要一种能够适应不同用户偏好和场景的通用代理。 Method: SAGE DeeR通过多视角多模态输入理解用户生理指标、情绪和行为，并结合语言空间的隐式思维链提升能力。 Result: SAGE DeeR实现了超级对齐（个性化反应）、通用性（多模态理解）和自我启发（隐式思维链），并通过基准测试验证了其感知决策能力和对齐准确性。 Conclusion: SAGE DeeR为智能驾驶座舱提供了一种高效、个性化和通用的解决方案，具有实际应用潜力。 Abstract: The intelligent driving cockpit, an important part of intelligent driving, needs to match different users' comfort, interaction, and safety needs. This paper aims to build a Super-Aligned and GEneralist DRiving agent, SAGE DeeR. Sage Deer achieves three highlights: (1) Super alignment: It achieves different reactions according to different people's preferences and biases. (2) Generalist: It can understand the multi-view and multi-mode inputs to reason the user's physiological indicators, facial emotions, hand movements, body movements, driving scenarios, and behavioral decisions. (3) Self-Eliciting: It can elicit implicit thought chains in the language space to further increase generalist and super-aligned abilities. Besides, we collected multiple data sets and built a large-scale benchmark. This benchmark measures the deer's perceptual decision-making ability and the super alignment's accuracy.

[37] Inferring Driving Maps by Deep Learning-based Trail Map Extraction

Michael Hubbertz,Pascal Colling,Qi Han,Tobias Meisen

Main category: cs.CV

TL;DR: 提出了一种新颖的离线地图构建方法，通过整合非正式路线（trails）并使用基于Transformer的深度学习模型，实现了高效且通用的地图更新。

Details

Motivation: 高精地图对自动驾驶规划至关重要，但传统在线地图构建面临时间一致性、传感器遮挡等问题。为提升性能并解决这些问题，提出了离线地图构建方法。 Method: 整合车辆和其他交通参与者的非正式路线数据，利用Transformer模型构建全局地图，支持持续更新且不依赖特定传感器。 Result: 在基准数据集上验证，优于现有在线地图构建方法，提升了对未知环境和传感器配置的泛化能力。 Conclusion: 该方法在自动驾驶系统中表现出鲁棒性和适用性，为地图构建提供了高效且通用的解决方案。 Abstract: High-definition (HD) maps offer extensive and accurate environmental information about the driving scene, making them a crucial and essential element for planning within autonomous driving systems. To avoid extensive efforts from manual labeling, methods for automating the map creation have emerged. Recent trends have moved from offline mapping to online mapping, ensuring availability and actuality of the utilized maps. While the performance has increased in recent years, online mapping still faces challenges regarding temporal consistency, sensor occlusion, runtime, and generalization. We propose a novel offline mapping approach that integrates trails - informal routes used by drivers - into the map creation process. Our method aggregates trail data from the ego vehicle and other traffic participants to construct a comprehensive global map using transformer-based deep learning models. Unlike traditional offline mapping, our approach enables continuous updates while remaining sensor-agnostic, facilitating efficient data transfer. Our method demonstrates superior performance compared to state-of-the-art online mapping approaches, achieving improved generalization to previously unseen environments and sensor configurations. We validate our approach on two benchmark datasets, highlighting its robustness and applicability in autonomous driving systems.

[38] HandReader: Advanced Techniques for Efficient Fingerspelling Recognition

Pavel Korotaev,Petr Surovtsev,Alexander Kapitanov,Karina Kvanchiani,Aleksandr Nagaev

Main category: cs.CV

TL;DR: 论文提出HandReader，包含三种架构（RGB、KP、RGB+KP），用于手语拼写识别，分别在RGB和关键点模态上优化，并在多个数据集上取得先进结果。

Details

Motivation: 手语拼写识别中快速手部动作的时序处理仍有改进空间，需结合RGB和关键点信息提升准确性。 Method: HandReader包含三种架构：HandReader_RGB（使用TSAM模块处理RGB时序）、HandReader_KP（基于TPE编码器处理关键点）、HandReader_RGB+KP（联合编码器结合两种模态）。 Result: 在ChicagoFSWild、ChicagoFSWild+和俄罗斯Znaki数据集上取得先进性能，模型和数据集已公开。 Conclusion: HandReader通过多模态优化显著提升手语拼写识别性能，为相关研究提供新工具和数据支持。 Abstract: Fingerspelling is a significant component of Sign Language (SL), allowing the interpretation of proper names, characterized by fast hand movements during signing. Although previous works on fingerspelling recognition have focused on processing the temporal dimension of videos, there remains room for improving the accuracy of these approaches. This paper introduces HandReader, a group of three architectures designed to address the fingerspelling recognition task. HandReader$_{RGB}$ employs the novel Temporal Shift-Adaptive Module (TSAM) to process RGB features from videos of varying lengths while preserving important sequential information. HandReader$_{KP}$ is built on the proposed Temporal Pose Encoder (TPE) operated on keypoints as tensors. Such keypoints composition in a batch allows the encoder to pass them through 2D and 3D convolution layers, utilizing temporal and spatial information and accumulating keypoints coordinates. We also introduce HandReader_RGB+KP - architecture with a joint encoder to benefit from RGB and keypoint modalities. Each HandReader model possesses distinct advantages and achieves state-of-the-art results on the ChicagoFSWild and ChicagoFSWild+ datasets. Moreover, the models demonstrate high performance on the first open dataset for Russian fingerspelling, Znaki, presented in this paper. The Znaki dataset and HandReader pre-trained models are publicly available.

[39] MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting

Mengqiu Xu,Kaixin Chen,Heng Guo,Yixiang Huang,Ming Wu,Zhenwei Shi,Chuang Zhang,Jun Guo

Main category: cs.CV

TL;DR: 论文提出了MFogHub，首个多区域、多卫星的海洋雾数据集，解决了现有数据集单一性和局限性问题，支持多样化的海洋雾检测与预测研究。

Details

Motivation: 现有海洋雾数据集多为单一区域或卫星数据，限制了模型的泛化能力和对海洋雾特性的深入探索。 Method: 通过整合15个沿海雾频发区域和6颗地球静止卫星的标注数据，构建了包含68,000多个高分辨率样本的MFogHub数据集。 Result: 实验表明，MFogHub能揭示区域和卫星差异导致的泛化波动，并为开发针对性雾预测技术提供支持。 Conclusion: MFogHub推动了全球海洋雾动态的监测和科学理解，数据集和代码已开源。 Abstract: Deep learning approaches for marine fog detection and forecasting have outperformed traditional methods, demonstrating significant scientific and practical importance. However, the limited availability of open-source datasets remains a major challenge. Existing datasets, often focused on a single region or satellite, restrict the ability to evaluate model performance across diverse conditions and hinder the exploration of intrinsic marine fog characteristics. To address these limitations, we introduce \textbf{MFogHub}, the first multi-regional and multi-satellite dataset to integrate annotated marine fog observations from 15 coastal fog-prone regions and six geostationary satellites, comprising over 68,000 high-resolution samples. By encompassing diverse regions and satellite perspectives, MFogHub facilitates rigorous evaluation of both detection and forecasting methods under varying conditions. Extensive experiments with 16 baseline models demonstrate that MFogHub can reveal generalization fluctuations due to regional and satellite discrepancy, while also serving as a valuable resource for the development of targeted and scalable fog prediction techniques. Through MFogHub, we aim to advance both the practical monitoring and scientific understanding of marine fog dynamics on a global scale. The dataset and code are at \href{https://github.com/kaka0910/MFogHub}{https://github.com/kaka0910/MFogHub}.

[40] MSCI: Addressing CLIP's Inherent Limitations for Compositional Zero-Shot Learning

Yue Wang,Shuai Xu,Xuelin Zhu,Yicong Li

Main category: cs.CV

TL;DR: 提出了一种多阶段跨模态交互模型（MSCI），通过利用CLIP视觉编码器的中间层信息，增强对细粒度局部特征的捕捉能力。

Details

Motivation: 现有研究依赖CLIP的跨模态对齐能力，但忽视了其在细粒度局部特征捕捉上的局限性。 Method: 设计两个自适应聚合器，分别从低层和高层视觉特征中提取局部和全局信息，并通过分阶段交互机制整合到文本表示中。 Result: 在三个数据集上的实验验证了模型的有效性和优越性。 Conclusion: MSCI模型显著提升了细粒度局部视觉信息的感知能力，并能灵活适应不同场景。 Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize unseen state-object combinations by leveraging known combinations. Existing studies basically rely on the cross-modal alignment capabilities of CLIP but tend to overlook its limitations in capturing fine-grained local features, which arise from its architectural and training paradigm. To address this issue, we propose a Multi-Stage Cross-modal Interaction (MSCI) model that effectively explores and utilizes intermediate-layer information from CLIP's visual encoder. Specifically, we design two self-adaptive aggregators to extract local information from low-level visual features and integrate global information from high-level visual features, respectively. These key information are progressively incorporated into textual representations through a stage-by-stage interaction mechanism, significantly enhancing the model's perception capability for fine-grained local visual information. Additionally, MSCI dynamically adjusts the attention weights between global and local visual information based on different combinations, as well as different elements within the same combination, allowing it to flexibly adapt to diverse scenarios. Experiments on three widely used datasets fully validate the effectiveness and superiority of the proposed model. Data and code are available at https://github.com/ltpwy/MSCI.

[41] StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation

Daniel A. P. Oliveira,David Martins de Matos

Main category: cs.CV

TL;DR: 论文提出StoryReasoning数据集和Qwen Storyteller模型，通过视觉相似性和面部识别解决视觉故事中角色和对象一致性问题，减少幻觉现象。

Details

Motivation: 视觉故事系统在跨帧保持角色一致性和正确关联动作与主体方面存在困难，导致引用幻觉。 Method: 使用结构化场景分析和基于视觉相似性与面部识别的跨帧对象重识别，结合链式思维推理和视觉实体链接。 Result: 微调Qwen2.5-VL 7B模型后，幻觉现象平均减少12.3%（从4.06降至3.56）。 Conclusion: StoryReasoning数据集和Qwen Storyteller模型有效提升了视觉故事中角色和对象的一致性，减少了幻觉问题。 Abstract: Visual storytelling systems struggle to maintain character identity across frames and link actions to appropriate subjects, frequently leading to referential hallucinations. These issues can be addressed through grounding of characters, objects, and other entities on the visual elements. We propose StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images, with both structured scene analyses and grounded stories. Each story maintains character and object consistency across frames while explicitly modeling multi-frame relationships through structured tabular representations. Our approach features cross-frame object re-identification using visual similarity and face recognition, chain-of-thought reasoning for explicit narrative modeling, and a grounding scheme that links textual elements to visual entities across multiple frames. We establish baseline performance by fine-tuning Qwen2.5-VL 7B, creating Qwen Storyteller, which performs end-to-end object detection, re-identification, and landmark detection while maintaining consistent object references throughout the story. Evaluation demonstrates a reduction from 4.06 to 3.56 (-12.3%) hallucinations on average per story when compared to a non-fine-tuned model.

[42] MIPHEI-ViT: Multiplex Immunofluorescence Prediction from H&E Images using ViT Foundation Models

Guillaume Balezo,Roger Trullo,Albert Pla Planas,Etienne Decenciere,Thomas Walter

Main category: cs.CV

TL;DR: MIPHEI是一种基于U-Net和ViT的模型，能够从H&E染色图像预测mIF信号，实现细胞类型分类，性能优于现有基线。

Details

Motivation: 解决mIF技术因成本和复杂性难以临床普及的问题，利用H&E图像预测mIF信号。 Method: 采用U-Net架构结合ViT编码器，训练于ORION数据集，验证于两个独立数据集。 Result: 在多个标记上表现优异，如Pan-CK（F1=0.88）、CD3e（F1=0.57），显著优于基线模型。 Conclusion: MIPHEI为大规模H&E数据集的细胞类型分析提供了新途径，有助于研究空间细胞组织与患者预后的关系。 Abstract: Histopathological analysis is a cornerstone of cancer diagnosis, with Hematoxylin and Eosin (H&E) staining routinely acquired for every patient to visualize cell morphology and tissue architecture. On the other hand, multiplex immunofluorescence (mIF) enables more precise cell type identification via proteomic markers, but has yet to achieve widespread clinical adoption due to cost and logistical constraints. To bridge this gap, we introduce MIPHEI (Multiplex Immunofluorescence Prediction from H&E), a U-Net-inspired architecture that integrates state-of-the-art ViT foundation models as encoders to predict mIF signals from H&E images. MIPHEI targets a comprehensive panel of markers spanning nuclear content, immune lineages (T cells, B cells, myeloid), epithelium, stroma, vasculature, and proliferation. We train our model using the publicly available ORION dataset of restained H&E and mIF images from colorectal cancer tissue, and validate it on two independent datasets. MIPHEI achieves accurate cell-type classification from H&E alone, with F1 scores of 0.88 for Pan-CK, 0.57 for CD3e, 0.56 for SMA, 0.36 for CD68, and 0.30 for CD20, substantially outperforming both a state-of-the-art baseline and a random classifier for most markers. Our results indicate that our model effectively captures the complex relationships between nuclear morphologies in their tissue context, as visible in H&E images and molecular markers defining specific cell types. MIPHEI offers a promising step toward enabling cell-type-aware analysis of large-scale H&E datasets, in view of uncovering relationships between spatial cellular organization and patient outcomes.

[43] A Unified and Scalable Membership Inference Method for Visual Self-supervised Encoder via Part-aware Capability

Jie Zhu,Jirong Zha,Ding Li,Leye Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为PartCrop的统一成员推理方法，用于攻击视觉自监督模型，并在不同训练协议和结构上验证了其有效性。同时，提出了防御方法，并进一步扩展了PartCrop的规模。

Details

Motivation: 自监督学习在利用无标签数据方面具有潜力，但也面临隐私问题。攻击者在面对黑盒系统时，缺乏训练方法和细节信息，因此需要一种通用的攻击方法。 Method: 提出PartCrop方法，通过裁剪图像中的部分对象并查询其在表示空间中的响应，利用模型对训练数据的部分感知能力进行攻击。 Result: 实验验证了PartCrop在不同自监督模型和数据集上的有效性和泛化能力。防御实验表明，早期停止、差分隐私和裁剪尺度范围缩小等方法均有效。 Conclusion: PartCrop是一种有效的成员推理攻击方法，其扩展版本PartCrop-v2进一步提升了可扩展性。防御措施也为隐私保护提供了解决方案。 Abstract: Self-supervised learning shows promise in harnessing extensive unlabeled data, but it also confronts significant privacy concerns, especially in vision. In this paper, we perform membership inference on visual self-supervised models in a more realistic setting: self-supervised training method and details are unknown for an adversary when attacking as he usually faces a black-box system in practice. In this setting, considering that self-supervised model could be trained by completely different self-supervised paradigms, e.g., masked image modeling and contrastive learning, with complex training details, we propose a unified membership inference method called PartCrop. It is motivated by the shared part-aware capability among models and stronger part response on the training data. Specifically, PartCrop crops parts of objects in an image to query responses within the image in representation space. We conduct extensive attacks on self-supervised models with different training protocols and structures using three widely used image datasets. The results verify the effectiveness and generalization of PartCrop. Moreover, to defend against PartCrop, we evaluate two common approaches, i.e., early stop and differential privacy, and propose a tailored method called shrinking crop scale range. The defense experiments indicate that all of them are effective. Finally, besides prototype testing on toy visual encoders and small-scale image datasets, we quantitatively study the impacts of scaling from both data and model aspects in a realistic scenario and propose a scalable PartCrop-v2 by introducing two structural improvements to PartCrop. Our code is at https://github.com/JiePKU/PartCrop.

[44] SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ Complexity

Shihao Zou,Qingfeng Li,Wei Ji,Jingjing Li,Yongkui Yang,Guoqi Li,Chao Dong

Main category: cs.CV

TL;DR: SpikeVideoFormer是一种高效的脉冲驱动视频Transformer，通过设计线性时间复杂度的SDHA注意力机制，显著提升了视频任务的性能与效率。

Details

Motivation: 现有SNN-based Transformers主要关注单图像任务，未能充分利用SNN在视频任务中的高效性。 Method: 设计了脉冲驱动的Hamming注意力（SDHA），并分析了多种脉冲驱动的时空注意力方案，选择最优方案。 Result: 在视频分类、姿态跟踪和语义分割任务中，性能优于现有SNN方法，且效率显著高于ANN方法。 Conclusion: SpikeVideoFormer在视频任务中实现了高效且高性能的解决方案，为SNN在视频领域的应用提供了新思路。 Abstract: Spiking Neural Networks (SNNs) have shown competitive performance to Artificial Neural Networks (ANNs) in various vision tasks, while offering superior energy efficiency. However, existing SNN-based Transformers primarily focus on single-image tasks, emphasizing spatial features while not effectively leveraging SNNs' efficiency in video-based vision tasks. In this paper, we introduce SpikeVideoFormer, an efficient spike-driven video Transformer, featuring linear temporal complexity $\mathcal{O}(T)$. Specifically, we design a spike-driven Hamming attention (SDHA) which provides a theoretically guided adaptation from traditional real-valued attention to spike-driven attention. Building on SDHA, we further analyze various spike-driven space-time attention designs and identify an optimal scheme that delivers appealing performance for video tasks, while maintaining only linear temporal complexity. The generalization ability and efficiency of our model are demonstrated across diverse downstream video tasks, including classification, human pose tracking, and semantic segmentation. Empirical results show our method achieves state-of-the-art (SOTA) performance compared to existing SNN approaches, with over 15\% improvement on the latter two tasks. Additionally, it matches the performance of recent ANN-based methods while offering significant efficiency gains, achieving $\times 16$, $\times 10$ and $\times 5$ improvements on the three tasks. https://github.com/JimmyZou/SpikeVideoFormer

[45] Learned Lightweight Smartphone ISP with Unpaired Data

Andrei Arhire,Radu Timofte

Main category: cs.CV

TL;DR: 提出一种无需对齐数据的训练方法，用于学习型ISP，通过对抗训练和多判别器实现高质量图像转换。

Details

Motivation: 传统学习型ISP需要像素对齐的配对数据，获取成本高且困难。 Method: 采用无配对训练方法，结合多损失函数和对抗训练，利用预训练网络特征图指导学习。 Result: 在多个数据集上表现优异，接近配对训练方法的性能，适合移动设备。 Conclusion: 无配对学习方法在ISP任务中具有潜力，可减少数据获取成本，同时保持高质量输出。 Abstract: The Image Signal Processor (ISP) is a fundamental component in modern smartphone cameras responsible for conversion of RAW sensor image data to RGB images with a strong focus on perceptual quality. Recent work highlights the potential of deep learning approaches and their ability to capture details with a quality increasingly close to that of professional cameras. A difficult and costly step when developing a learned ISP is the acquisition of pixel-wise aligned paired data that maps the raw captured by a smartphone camera sensor to high-quality reference images. In this work, we address this challenge by proposing a novel training method for a learnable ISP that eliminates the need for direct correspondences between raw images and ground-truth data with matching content. Our unpaired approach employs a multi-term loss function guided by adversarial training with multiple discriminators processing feature maps from pre-trained networks to maintain content structure while learning color and texture characteristics from the target RGB dataset. Using lightweight neural network architectures suitable for mobile devices as backbones, we evaluated our method on the Zurich RAW to RGB and Fujifilm UltraISP datasets. Compared to paired training methods, our unpaired learning strategy shows strong potential and achieves high fidelity across multiple evaluation metrics. The code and pre-trained models are available at https://github.com/AndreiiArhire/Learned-Lightweight-Smartphone-ISP-with-Unpaired-Data .

[46] Vision language models have difficulty recognizing virtual objects

Tyler Tran,Sangeet Khemlani,J. G. Trafton

Main category: cs.CV

TL;DR: 论文探讨了视觉语言模型（VLMs）对图像中虚拟对象的理解能力，发现其表现不足。

Details

Motivation: 研究VLMs是否能理解图像中未直接呈现的虚拟对象及其空间关系，以评估其场景理解能力。 Method: 通过设计包含虚拟对象的提示（如“想象树上有风筝”）来测试VLMs，并系统评估其表现。 Result: 实验表明，当前先进的VLMs在处理虚拟对象时表现不佳。 Conclusion: VLMs在理解虚拟对象及其空间关系方面仍需改进。 Abstract: Vision language models (VLMs) are AI systems paired with both language and vision encoders to process multimodal input. They are capable of performing complex semantic tasks such as automatic captioning, but it remains an open question about how well they comprehend the visuospatial properties of scenes depicted in the images they process. We argue that descriptions of virtual objects -- objects that are not visually represented in an image -- can help test scene comprehension in these AI systems. For example, an image that depicts a person standing under a tree can be paired with the following prompt: imagine that a kite is stuck in the tree. VLMs that comprehend the scene should update their representations and reason sensibly about the spatial relations between all three objects. We describe systematic evaluations of state-of-the-art VLMs and show that their ability to process virtual objects is inadequate.

[47] Consistent Quantity-Quality Control across Scenes for Deployment-Aware Gaussian Splatting

Fengdi Zhang,Hongkun Cao,Ruqi Huang

Main category: cs.CV

TL;DR: ControlGS是一种3D高斯溅射（3DGS）优化方法，通过单一训练和用户指定参数，实现跨场景一致的渲染质量与高斯数量控制。

Details

Motivation: 现有方法在3DGS中难以直观调整高斯数量与渲染质量的权衡，无法适应多样化的硬件和通信限制需求。 Method: ControlGS通过固定设置和用户指定参数，自动找到不同场景下的最优权衡点，支持无级调整。 Result: ControlGS在减少高斯数量的同时保持高质量渲染，优于基线方法，并支持广泛调整范围。 Conclusion: ControlGS提供了一种灵活且高效的方法，适用于从紧凑对象到大型户外场景的多样化需求。 Abstract: To reduce storage and computational costs, 3D Gaussian splatting (3DGS) seeks to minimize the number of Gaussians used while preserving high rendering quality, introducing an inherent trade-off between Gaussian quantity and rendering quality. Existing methods strive for better quantity-quality performance, but lack the ability for users to intuitively adjust this trade-off to suit practical needs such as model deployment under diverse hardware and communication constraints. Here, we present ControlGS, a 3DGS optimization method that achieves semantically meaningful and cross-scene consistent quantity-quality control while maintaining strong quantity-quality performance. Through a single training run using a fixed setup and a user-specified hyperparameter reflecting quantity-quality preference, ControlGS can automatically find desirable quantity-quality trade-off points across diverse scenes, from compact objects to large outdoor scenes. It also outperforms baselines by achieving higher rendering quality with fewer Gaussians, and supports a broad adjustment range with stepless control over the trade-off.

[48] Logos as a Well-Tempered Pre-train for Sign Language Recognition

Ilya Ovodov,Petr Surovtsev,Karina Kvanchiani,Alexander Kapitanov,Alexander Nagaev

Main category: cs.CV

TL;DR: 本文研究了孤立手语识别（ISLR）任务的两个问题：跨语言数据不足和相似手语的语义歧义，并提出Logos数据集及预训练模型，显著提升了模型性能。

Details

Motivation: 解决孤立手语识别中数据不足和相似手语标注歧义的问题。 Method: 提出Logos数据集，探索跨语言迁移学习方法，并采用多分类头联合训练。 Result: 预训练模型在WLASL数据集上超越现有最优结果，在AUTSL数据集上表现竞争性。 Conclusion: Logos数据集和迁移学习方法有效提升了ISLR任务性能，特别是对低资源语言。 Abstract: This paper examines two aspects of the isolated sign language recognition (ISLR) task. First, despite the availability of a number of datasets, the amount of data for most individual sign languages is limited. It poses the challenge of cross-language ISLR model training, including transfer learning. Second, similar signs can have different semantic meanings. It leads to ambiguity in dataset labeling and raises the question of the best policy for annotating such signs. To address these issues, this study presents Logos, a novel Russian Sign Language (RSL) dataset, the most extensive ISLR dataset by the number of signers and one of the largest available datasets while also the largest RSL dataset in size and vocabulary. It is shown that a model, pre-trained on the Logos dataset can be used as a universal encoder for other language SLR tasks, including few-shot learning. We explore cross-language transfer learning approaches and find that joint training using multiple classification heads benefits accuracy for the target lowresource datasets the most. The key feature of the Logos dataset is explicitly annotated visually similar sign groups. We show that explicitly labeling visually similar signs improves trained model quality as a visual encoder for downstream tasks. Based on the proposed contributions, we outperform current state-of-the-art results for the WLASL dataset and get competitive results for the AUTSL dataset, with a single stream model processing solely RGB video. The source code, dataset, and pre-trained models are publicly available.

[49] UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation

Yi Li,Haonan Wang,Qixiang Zhang,Boyu Xiao,Chenchang Hu,Hualiang Wang,Xiaomeng Li

Main category: cs.CV

TL;DR: 论文提出了UniEval框架，用于统一评估多模态模型，解决了现有评估方法的局限性，并通过实验验证其有效性。

Details

Motivation: 当前多模态模型的评估缺乏统一框架，存在冗余、依赖额外资源等问题，亟需简化且全面的评估方法。 Method: 提出UniEval框架，包含UniBench基准和UniScore指标，支持无额外模型或标注的统一评估。 Result: UniBench更具挑战性，UniScore与人工评估高度一致，优于现有指标，并揭示了新见解。 Conclusion: UniEval为多模态模型提供了高效、统一的评估方案，具有实际应用价值。 Abstract: The emergence of unified multimodal understanding and generation models is rapidly attracting attention because of their ability to enhance instruction-following capabilities while minimizing model redundancy. However, there is a lack of a unified evaluation framework for these models, which would enable an elegant, simplified, and overall evaluation. Current models conduct evaluations on multiple task-specific benchmarks, but there are significant limitations, such as the lack of overall results, errors from extra evaluation models, reliance on extensive labeled images, benchmarks that lack diversity, and metrics with limited capacity for instruction-following evaluation. To tackle these challenges, we introduce UniEval, the first evaluation framework designed for unified multimodal models without extra models, images, or annotations. This facilitates a simplified and unified evaluation process. The UniEval framework contains a holistic benchmark, UniBench (supports both unified and visual generation models), along with the corresponding UniScore metric. UniBench includes 81 fine-grained tags contributing to high diversity. Experimental results indicate that UniBench is more challenging than existing benchmarks, and UniScore aligns closely with human evaluations, surpassing current metrics. Moreover, we extensively evaluated SoTA unified and visual generation models, uncovering new insights into Univeral's unique values.

[50] CheXGenBench: A Unified Benchmark For Fidelity, Privacy and Utility of Synthetic Chest Radiographs

Raman Dutt,Pedro Sanchez,Yongchen Yao,Steven McDonagh,Sotirios A. Tsaftaris,Timothy Hospedales

Main category: cs.CV

TL;DR: CheXGenBench是一个用于评估合成胸部X光片生成的多维度框架，涵盖生成质量、隐私风险和临床实用性，并提供了标准化评估协议和数据集。

Details

Motivation: 当前医学领域的生成AI评估存在方法不一致、架构比较过时和评估标准脱节的问题，缺乏对合成样本临床价值的关注。 Method: 通过标准化数据分区和统一评估协议（包含20多个定量指标），对11种领先的文本到图像生成架构进行系统分析。 Result: 发现现有评估协议在生成保真度方面效率低下，导致不一致和无意义的比较。同时发布了高质量合成数据集SynthCheX-75K。 Conclusion: CheXGenBench为医学AI社区提供了标准化基准，支持客观和可重复的比较，并推动了未来生成模型的集成。 Abstract: We introduce CheXGenBench, a rigorous and multifaceted evaluation framework for synthetic chest radiograph generation that simultaneously assesses fidelity, privacy risks, and clinical utility across state-of-the-art text-to-image generative models. Despite rapid advancements in generative AI for real-world imagery, medical domain evaluations have been hindered by methodological inconsistencies, outdated architectural comparisons, and disconnected assessment criteria that rarely address the practical clinical value of synthetic samples. CheXGenBench overcomes these limitations through standardised data partitioning and a unified evaluation protocol comprising over 20 quantitative metrics that systematically analyse generation quality, potential privacy vulnerabilities, and downstream clinical applicability across 11 leading text-to-image architectures. Our results reveal critical inefficiencies in the existing evaluation protocols, particularly in assessing generative fidelity, leading to inconsistent and uninformative comparisons. Our framework establishes a standardised benchmark for the medical AI community, enabling objective and reproducible comparisons while facilitating seamless integration of both existing and future generative models. Additionally, we release a high-quality, synthetic dataset, SynthCheX-75K, comprising 75K radiographs generated by the top-performing model (Sana 0.6B) in our benchmark to support further research in this critical domain. Through CheXGenBench, we establish a new state-of-the-art and release our framework, models, and SynthCheX-75K dataset at https://raman1121.github.io/CheXGenBench/

[51] MorphGuard: Morph Specific Margin Loss for Enhancing Robustness to Face Morphing Attacks

Iurii Medvedev,Nuno Goncalves

Main category: cs.CV

TL;DR: 提出一种双分支分类策略，增强深度学习人脸识别系统对脸部变形攻击的鲁棒性。

Details

Motivation: 人脸识别技术广泛应用，但面临脸部变形攻击的安全威胁，需提升系统鲁棒性。 Method: 通过双分支分类策略处理脸部变形标签的模糊性，将变形图像纳入训练过程。 Result: 在公开基准测试中验证了方法的有效性，提升了对抗脸部变形攻击的能力。 Conclusion: 该方法通用性强，可集成到现有训练流程中，提升分类识别的鲁棒性。 Abstract: Face recognition has evolved significantly with the advancement of deep learning techniques, enabling its widespread adoption in various applications requiring secure authentication. However, this progress has also increased its exposure to presentation attacks, including face morphing, which poses a serious security threat by allowing one identity to impersonate another. Therefore, modern face recognition systems must be robust against such attacks. In this work, we propose a novel approach for training deep networks for face recognition with enhanced robustness to face morphing attacks. Our method modifies the classification task by introducing a dual-branch classification strategy that effectively handles the ambiguity in the labeling of face morphs. This adaptation allows the model to incorporate morph images into the training process, improving its ability to distinguish them from bona fide samples. Our strategy has been validated on public benchmarks, demonstrating its effectiveness in enhancing robustness against face morphing attacks. Furthermore, our approach is universally applicable and can be integrated into existing face recognition training pipelines to improve classification-based recognition methods.

[52] Enhancing Multi-Image Question Answering via Submodular Subset Selection

Aaryan Sharma,Shivansh Gupta,Samar Agarwal,Vishak Prasad C.,Ganesh Ramakrishnan

Main category: cs.CV

TL;DR: 本文提出了一种基于子模子集选择技术的检索框架增强方法，用于提升多图像问答任务中的检索性能。

Details

Motivation: 大型多模态模型在单图像任务中表现优异，但在多图像场景（如多图像问答）中面临可扩展性和检索性能问题。 Method: 采用基于查询的子模函数（如GraphCut）预选语义相关图像子集，并结合锚点查询和数据增强优化检索流程。 Result: 该方法在大量图像场景中显著提升了检索效果，尤其是在大图像库中表现突出。 Conclusion: 子模子集选择技术能有效增强多图像问答任务的检索性能，为未来研究提供了新方向。 Abstract: Large multimodal models (LMMs) have achieved high performance in vision-language tasks involving single image but they struggle when presented with a collection of multiple images (Multiple Image Question Answering scenario). These tasks, which involve reasoning over large number of images, present issues in scalability (with increasing number of images) and retrieval performance. In this work, we propose an enhancement for retriever framework introduced in MIRAGE model using submodular subset selection techniques. Our method leverages query-aware submodular functions, such as GraphCut, to pre-select a subset of semantically relevant images before main retrieval component. We demonstrate that using anchor-based queries and augmenting the data improves submodular-retriever pipeline effectiveness, particularly in large haystack sizes.

[53] Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis

Pengfei Wang,Guohai Xu,Weinong Wang,Junjie Yang,Jie Lou,Yunhua Xue

Main category: cs.CV

TL;DR: 论文提出了一个衡量多模态大语言模型（MLLMs）是否真正理解视觉输入的新方法，通过定义“隐性视觉误解”（IVM）并引入“注意力准确度”指标和基准。

Details

Motivation: 现有基准主要评估答案正确性，忽略了模型是否真正理解视觉输入，因此需要一种更可靠的评估方法。 Method: 通过解耦视觉和文本模态的因果注意力模块，分析注意力分布，并提出“注意力准确度”指标和新的基准。 Result: 研究发现注意力分布随网络层加深逐渐集中在正确答案相关的图像上，新指标能可靠评估视觉理解。 Conclusion: 提出的方法不仅适用于多模态场景，还能扩展到单模态，具有广泛适用性和通用性。 Abstract: Recent advancements have enhanced the capability of Multimodal Large Language Models (MLLMs) to comprehend multi-image information. However, existing benchmarks primarily evaluate answer correctness, overlooking whether models genuinely comprehend the visual input. To address this, we define implicit visual misunderstanding (IVM), where MLLMs provide correct answers without fully comprehending the visual input. Through our analysis, we decouple the visual and textual modalities within the causal attention module, revealing that attention distribution increasingly converges on the image associated with the correct answer as the network layers deepen. This insight leads to the introduction of a scale-agnostic metric, \textit{attention accuracy}, and a novel benchmark for quantifying IVMs. Attention accuracy directly evaluates the model's visual understanding via internal mechanisms, remaining robust to positional biases for more reliable assessments. Furthermore, we extend our approach to finer granularities and demonstrate its effectiveness in unimodal scenarios, underscoring its versatility and generalizability.

[54] Does Feasibility Matter? Understanding the Impact of Feasibility on Synthetic Training Data

Yiwen Liu,Jessica Bader,Jae Myung Kim

Main category: cs.CV

TL;DR: 研究探讨了合成图像中可行性（feasibility）对CLIP分类器性能的影响，发现可行性对性能影响极小，且混合可行与不可行图像训练无显著差异。

Details

Motivation: 随着扩散模型生成逼真图像的能力提升，合成数据训练模型效果逐渐改善，但生成的图像可能存在不现实的特征（如狗浮在空中）。研究旨在验证合成图像中可行性是否影响CLIP分类器的性能。 Method: 提出VariReal流程，通过最小化编辑源图像以包含可行或不可行属性，并基于文本提示生成图像。实验使用LoRA微调的CLIP模型，评估三种细粒度数据集上的性能。 Result: 可行性对CLIP性能影响极小（差异小于0.3%），且某些属性会影响分类性能。混合可行与不可行图像训练对性能无显著影响。 Conclusion: 合成图像的可行性对CLIP分类器性能影响有限，无需强制要求可行性，混合训练数据集是可行的。 Abstract: With the development of photorealistic diffusion models, models trained in part or fully on synthetic data achieve progressively better results. However, diffusion models still routinely generate images that would not exist in reality, such as a dog floating above the ground or with unrealistic texture artifacts. We define the concept of feasibility as whether attributes in a synthetic image could realistically exist in the real-world domain; synthetic images containing attributes that violate this criterion are considered infeasible. Intuitively, infeasible images are typically considered out-of-distribution; thus, training on such images is expected to hinder a model's ability to generalize to real-world data, and they should therefore be excluded from the training set whenever possible. However, does feasibility really matter? In this paper, we investigate whether enforcing feasibility is necessary when generating synthetic training data for CLIP-based classifiers, focusing on three target attributes: background, color, and texture. We introduce VariReal, a pipeline that minimally edits a given source image to include feasible or infeasible attributes given by the textual prompt generated by a large language model. Our experiments show that feasibility minimally affects LoRA-fine-tuned CLIP performance, with mostly less than 0.3% difference in top-1 accuracy across three fine-grained datasets. Also, the attribute matters on whether the feasible/infeasible images adversarially influence the classification performance. Finally, mixing feasible and infeasible images in training datasets does not significantly impact performance compared to using purely feasible or infeasible datasets.

[55] MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

Ke Wang,Junting Pan,Linda Wei,Aojun Zhou,Weikang Shi,Zimu Lu,Han Xiao,Yunqiao Yang,Houxing Ren,Mingjie Zhan,Hongsheng Li

Main category: cs.CV

TL;DR: 论文提出利用代码作为监督信号，解决多模态数学推理中图像与文本对齐的问题，开发了FigCodifier模型和ImgCode-8.6M数据集，并构建了MM-MathInstruct-3M微调数据集，最终训练的MathCoder-VL模型在多项指标上达到开源SOTA。

Details

Motivation: 当前多模态模型在数学推理中因缺乏对数学图形细节的关注而受限，需要更精确的跨模态对齐方法。 Method: 通过代码作为监督信号，开发图像到代码的模型FigCodifier和数据集ImgCode-8.6M，并构建微调数据集MM-MathInstruct-3M。 Result: MathCoder-VL模型在MathVista几何问题子集上超越GPT-4o和Claude 3.5 Sonnet，提升8.9%和9.2%。 Conclusion: 代码监督和多模态数据集显著提升了数学推理能力，模型和数据集将开源。 Abstract: Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at https://github.com/mathllm/MathCoder.

[56] End-to-End Vision Tokenizer Tuning

Wenxuan Wang,Fan Zhang,Yufeng Cui,Haiwen Diao,Zhuoyan Luo,Huchuan Lu,Jing Liu,Xinlong Wang

Main category: cs.CV

TL;DR: ETT提出了一种端到端的视觉标记器调优方法，通过联合优化视觉标记化和目标自回归任务，解决了现有视觉标记器与下游任务不匹配的问题，显著提升了多模态理解和视觉生成任务的性能。

Details

Motivation: 现有视觉标记器的优化与下游任务训练分离，导致视觉标记器无法适应不同任务的需求，成为表示瓶颈。例如，图像中文本标记的错误会影响识别或生成结果。 Method: ETT利用视觉标记器代码本的嵌入表示，通过联合优化重构和标题目标，实现视觉标记器的端到端调优。该方法无需调整原始代码本或大型语言模型架构。 Result: 实验表明，ETT在多模态理解和视觉生成任务中比冻结标记器基线提升了2-6%的性能，同时保持了原始重构能力。 Conclusion: ETT是一种简单而强大的方法，有望在多模态基础模型中推广，不仅限于图像生成和理解任务。 Abstract: Existing vision tokenization isolates the optimization of vision tokenizers from downstream training, implicitly assuming the visual tokens can generalize well across various tasks, e.g., image generation and visual question answering. The vision tokenizer optimized for low-level reconstruction is agnostic to downstream tasks requiring varied representations and semantics. This decoupled paradigm introduces a critical misalignment: The loss of the vision tokenization can be the representation bottleneck for target tasks. For example, errors in tokenizing text in a given image lead to poor results when recognizing or generating them. To address this, we propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks. Unlike prior autoregressive models that use only discrete indices from a frozen vision tokenizer, ETT leverages the visual embeddings of the tokenizer codebook, and optimizes the vision tokenizers end-to-end with both reconstruction and caption objectives. ETT can be seamlessly integrated into existing training pipelines with minimal architecture modifications. Our ETT is simple to implement and integrate, without the need to adjust the original codebooks or architectures of the employed large language models. Extensive experiments demonstrate that our proposed end-to-end vision tokenizer tuning unlocks significant performance gains, i.e., 2-6% for multimodal understanding and visual generation tasks compared to frozen tokenizer baselines, while preserving the original reconstruction capability. We hope this very simple and strong method can empower multimodal foundation models besides image generation and understanding.

[57] Depth Anything with Any Prior

Zehan Wang,Siyu Chen,Lihe Yang,Jialei Wang,Ziang Zhang,Hengshuang Zhao,Zhou Zhao

Main category: cs.CV

TL;DR: Prior Depth Anything框架结合不完整但精确的深度测量信息与相对但完整的几何结构预测，生成准确、密集且详细的深度图。通过粗到细的流程整合两种互补深度源，并展示出色的零样本泛化能力。

Details

Motivation: 解决深度测量中信息不完整与预测中几何结构不精确的问题，生成更高质量的深度图。 Method: 设计粗到细流程，包括像素级度量对齐和距离感知加权预填充，以及基于归一化预填充先验和预测的条件单目深度估计模型。 Result: 在7个真实数据集上，零样本泛化能力优于或匹配任务特定方法，支持测试时改进和灵活性。 Conclusion: 该框架在深度完成、超分辨率和修复任务中表现优异，并能随MDE模型进步而演进。 Abstract: This work presents Prior Depth Anything, a framework that combines incomplete but precise metric information in depth measurement with relative but complete geometric structures in depth prediction, generating accurate, dense, and detailed metric depth maps for any scene. To this end, we design a coarse-to-fine pipeline to progressively integrate the two complementary depth sources. First, we introduce pixel-level metric alignment and distance-aware weighting to pre-fill diverse metric priors by explicitly using depth prediction. It effectively narrows the domain gap between prior patterns, enhancing generalization across varying scenarios. Second, we develop a conditioned monocular depth estimation (MDE) model to refine the inherent noise of depth priors. By conditioning on the normalized pre-filled prior and prediction, the model further implicitly merges the two complementary depth sources. Our model showcases impressive zero-shot generalization across depth completion, super-resolution, and inpainting over 7 real-world datasets, matching or even surpassing previous task-specific methods. More importantly, it performs well on challenging, unseen mixed priors and enables test-time improvements by switching prediction models, providing a flexible accuracy-efficiency trade-off while evolving with advancements in MDE models.

[58] 3D-Fixup: Advancing Photo Editing with 3D Priors

Yen-Chi Cheng,Krishna Kumar Singh,Jae Shin Yoon,Alex Schwing,Liangyan Gui,Matheus Gadelha,Paul Guerrero,Nanxuan Zhao

Main category: cs.CV

TL;DR: 3D-Fixup是一个基于扩散模型和3D先验的框架，用于支持复杂的3D感知图像编辑，如平移和旋转。

Details

Motivation: 尽管扩散模型在图像先验建模方面取得了进展，但基于单张图像的3D感知编辑仍具挑战性。 Method: 利用视频数据生成训练对，结合Image-to-3D模型的3D指导，设计高质量数据生成流程。 Result: 3D-Fixup实现了高质量的3D感知编辑，保持了身份一致性。 Conclusion: 该框架推动了扩散模型在真实图像处理中的应用，代码已开源。 Abstract: Despite significant advances in modeling image priors via diffusion models, 3D-aware image editing remains challenging, in part because the object is only specified via a single image. To tackle this challenge, we propose 3D-Fixup, a new framework for editing 2D images guided by learned 3D priors. The framework supports difficult editing situations such as object translation and 3D rotation. To achieve this, we leverage a training-based approach that harnesses the generative power of diffusion models. As video data naturally encodes real-world physical dynamics, we turn to video data for generating training data pairs, i.e., a source and a target frame. Rather than relying solely on a single trained model to infer transformations between source and target frames, we incorporate 3D guidance from an Image-to-3D model, which bridges this challenging task by explicitly projecting 2D information into 3D space. We design a data generation pipeline to ensure high-quality 3D guidance throughout training. Results show that by integrating these 3D priors, 3D-Fixup effectively supports complex, identity coherent 3D-aware edits, achieving high-quality results and advancing the application of diffusion models in realistic image manipulation. The code is provided at https://3dfixup.github.io/

cs.GR [Back]

[59] VRSplat: Fast and Robust Gaussian Splatting for Virtual Reality

Xuechang Tu,Lukas Radl,Michael Steiner,Markus Steinberger,Bernhard Kerbl,Fernando de la Torre

Main category: cs.GR

TL;DR: VRSplat结合并扩展了3D高斯泼溅（3DGS）技术，解决了VR中的关键问题，如时间伪影和投影失真，并通过用户研究验证了其优越性。

Details

Motivation: 3DGS在VR中面临时间伪影、投影失真和帧率下降等问题，这些问题在VR环境中尤为突出，亟需解决。 Method: 结合Mini-Splatting、StopThePop和Optimal Projection技术，改进3DGS核心光栅化器，并提出高效的中心凹光栅化器。 Result: VRSplat在用户研究中表现优异，支持72+ FPS，消除了伪影和浮动物体。 Conclusion: VRSplat是首个系统评估的3DGS方法，适用于现代VR应用，解决了关键挑战。 Abstract: 3D Gaussian Splatting (3DGS) has rapidly become a leading technique for novel-view synthesis, providing exceptional performance through efficient software-based GPU rasterization. Its versatility enables real-time applications, including on mobile and lower-powered devices. However, 3DGS faces key challenges in virtual reality (VR): (1) temporal artifacts, such as popping during head movements, (2) projection-based distortions that result in disturbing and view-inconsistent floaters, and (3) reduced framerates when rendering large numbers of Gaussians, falling below the critical threshold for VR. Compared to desktop environments, these issues are drastically amplified by large field-of-view, constant head movements, and high resolution of head-mounted displays (HMDs). In this work, we introduce VRSplat: we combine and extend several recent advancements in 3DGS to address challenges of VR holistically. We show how the ideas of Mini-Splatting, StopThePop, and Optimal Projection can complement each other, by modifying the individual techniques and core 3DGS rasterizer. Additionally, we propose an efficient foveated rasterizer that handles focus and peripheral areas in a single GPU launch, avoiding redundant computations and improving GPU utilization. Our method also incorporates a fine-tuning step that optimizes Gaussian parameters based on StopThePop depth evaluations and Optimal Projection. We validate our method through a controlled user study with 25 participants, showing a strong preference for VRSplat over other configurations of Mini-Splatting. VRSplat is the first, systematically evaluated 3DGS approach capable of supporting modern VR applications, achieving 72+ FPS while eliminating popping and stereo-disrupting floaters.

[60] Style Customization of Text-to-Vector Generation with Image Diffusion Priors

Peiying Zhang,Nanxuan Zhao,Jing Liao

Main category: cs.GR

TL;DR: 提出了一种两阶段风格定制SVG生成方法，结合前馈T2V模型和T2I先验，解决了现有方法在风格定制和结构一致性上的不足。

Details

Motivation: 现有T2V生成方法缺乏风格定制能力，难以生成视觉一致且美观的SVG集合。 Method: 两阶段流程：1) 训练路径级表示的T2V扩散模型确保结构一致性；2) 通过蒸馏定制T2I模型实现风格定制。 Result: 实验验证了该方法能高效生成高质量、多样化的定制风格SVG。 Conclusion: 提出的方法有效解决了风格定制与结构一致性的挑战，为SVG生成提供了实用解决方案。 Abstract: Scalable Vector Graphics (SVGs) are highly favored by designers due to their resolution independence and well-organized layer structure. Although existing text-to-vector (T2V) generation methods can create SVGs from text prompts, they often overlook an important need in practical applications: style customization, which is vital for producing a collection of vector graphics with consistent visual appearance and coherent aesthetics. Extending existing T2V methods for style customization poses certain challenges. Optimization-based T2V models can utilize the priors of text-to-image (T2I) models for customization, but struggle with maintaining structural regularity. On the other hand, feed-forward T2V models can ensure structural regularity, yet they encounter difficulties in disentangling content and style due to limited SVG training data. To address these challenges, we propose a novel two-stage style customization pipeline for SVG generation, making use of the advantages of both feed-forward T2V models and T2I image priors. In the first stage, we train a T2V diffusion model with a path-level representation to ensure the structural regularity of SVGs while preserving diverse expressive capabilities. In the second stage, we customize the T2V diffusion model to different styles by distilling customized T2I models. By integrating these techniques, our pipeline can generate high-quality and diverse SVGs in custom styles based on text prompts in an efficient feed-forward manner. The effectiveness of our method has been validated through extensive experiments. The project page is https://customsvg.github.io.

cs.CL [Back]

[61] Next Word Suggestion using Graph Neural Network

Abisha Thapa Magar,Anup Shakya

Main category: cs.CL

TL;DR: 论文提出了一种结合图卷积网络（GNN）和LSTM的上下文嵌入方法，用于语言建模中的下一个词预测任务，并在资源有限的情况下验证了其有效性。

Details

Motivation: 当前主流语言模型需要大量参数和计算资源，成本高昂。本文旨在解决语言建模中的上下文嵌入子任务，探索一种更高效的解决方案。 Method: 利用GNN中的图卷积操作编码上下文信息，并与LSTM结合预测下一个词。实验在自定义Wikipedia语料库上进行，资源消耗较低。 Result: 该方法在资源有限的情况下表现良好，能够有效预测下一个词。 Conclusion: 提出的方法为语言建模提供了一种资源高效的替代方案，尤其在上下文嵌入任务中表现出潜力。 Abstract: Language Modeling is a prevalent task in Natural Language Processing. The currently existing most recent and most successful language models often tend to build a massive model with billions of parameters, feed in a tremendous amount of text data, and train with enormous computation resources which require millions of dollars. In this project, we aim to address an important sub-task in language modeling, i.e., context embedding. We propose an approach to exploit the Graph Convolution operation in GNNs to encode the context and use it in coalition with LSTMs to predict the next word given a local context of preceding words. We test this on the custom Wikipedia text corpus using a very limited amount of resources and show that this approach works fairly well to predict the next word.

[62] DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models

Xiwen Chen,Wenhui Zhu,Peijie Qiu,Xuanzhao Dong,Hao Wang,Haiyu Wu,Huayu Li,Aristeidis Sotiras,Yalin Wang,Abolfazl Razi

Main category: cs.CL

TL;DR: 论文提出了一种名为Diversity-aware Reward Adjustment (DRA)的方法，通过引入语义多样性改进奖励计算，解决了GRPO在低资源设置中多样性-质量不一致的问题。

Details

Motivation: GRPO在低资源设置中表现良好，但其基于标量奖励信号的方法无法捕捉语义多样性，导致多样性-质量不一致。 Method: DRA利用Submodular Mutual Information (SMI)调整奖励，降低冗余补全的权重，增强多样性补全的奖励。 Result: 在五个数学推理基准测试中，DRA-GRPO和DGA-DR.GRPO表现优于现有基线，平均准确率达58.2%，仅需7,000个微调样本和约55美元的训练成本。 Conclusion: DRA方法有效解决了多样性-质量不一致问题，提升了低资源设置下的性能，且代码已开源。 Abstract: Recent advances in reinforcement learning for language model post-training, such as Group Relative Policy Optimization (GRPO), have shown promise in low-resource settings. However, GRPO typically relies on solution-level and scalar reward signals that fail to capture the semantic diversity among sampled completions. This leads to what we identify as a diversity-quality inconsistency, where distinct reasoning paths may receive indistinguishable rewards. To address this limitation, we propose $\textit{Diversity-aware Reward Adjustment}$ (DRA), a method that explicitly incorporates semantic diversity into the reward computation. DRA uses Submodular Mutual Information (SMI) to downweight redundant completions and amplify rewards for diverse ones. This encourages better exploration during learning, while maintaining stable exploitation of high-quality samples. Our method integrates seamlessly with both GRPO and its variant DR.~GRPO, resulting in $\textit{DRA-GRPO}$ and $\textit{DGA-DR.~GRPO}$. We evaluate our method on five mathematical reasoning benchmarks and find that it outperforms recent strong baselines. It achieves state-of-the-art performance with an average accuracy of 58.2%, using only 7,000 fine-tuning samples and a total training cost of approximately $55. The code is available at https://github.com/xiwenc1/DRA-GRPO.

[63] Large Language Models Are More Persuasive Than Incentivized Human Persuaders

Philipp Schoenegger,Francesco Salvi,Jiacheng Liu,Xiaoli Nan,Ramit Debnath,Barbara Fasolo,Evelina Leivada,Gabriel Recchia,Fritz Günther,Ali Zarifhonarvar,Joe Kwon,Zahoor Ul Islam,Marco Dehnert,Daryl Y. H. Lee,Madeline G. Reinecke,David G. Kamper,Mert Kobaş,Adam Sandford,Jonas Kgomo,Luke Hewitt,Shreya Kapoor,Kerem Oktar,Eyup Engin Kucuk,Bo Feng,Cameron R. Jones,Izzy Gainsburg,Sebastian Olschewski,Nora Heinzelmann,Francisco Cruz,Ben M. Tappin,Tao Ma,Peter S. Park,Rayan Onyonka,Arthur Hjorth,Peter Slattery,Qingcheng Zeng,Lennart Finke,Igor Grossmann,Alessandro Salatiello,Ezra Karger

Main category: cs.CL

TL;DR: 前沿大型语言模型（Claude Sonnet 3.5）在实时对话测试中表现出比激励人类更强的说服能力，无论是引导正确还是错误答案。

Details

Motivation: 比较AI与人类在说服能力上的差异，尤其是在激励条件下。 Method: 通过预注册的大规模激励实验，让参与者完成在线测试，AI或人类说服者尝试引导其答案。 Result: AI说服者在引导正确和错误答案时均显著优于人类，且影响参与者的准确性和收益。 Conclusion: AI的说服能力已超越激励人类，凸显了对齐和治理框架的紧迫性。 Abstract: We directly compare the persuasion capabilities of a frontier large language model (LLM; Claude Sonnet 3.5) against incentivized human persuaders in an interactive, real-time conversational quiz setting. In this preregistered, large-scale incentivized experiment, participants (quiz takers) completed an online quiz where persuaders (either humans or LLMs) attempted to persuade quiz takers toward correct or incorrect answers. We find that LLM persuaders achieved significantly higher compliance with their directional persuasion attempts than incentivized human persuaders, demonstrating superior persuasive capabilities in both truthful (toward correct answers) and deceptive (toward incorrect answers) contexts. We also find that LLM persuaders significantly increased quiz takers' accuracy, leading to higher earnings, when steering quiz takers toward correct answers, and significantly decreased their accuracy, leading to lower earnings, when steering them toward incorrect answers. Overall, our findings suggest that AI's persuasion capabilities already exceed those of humans that have real-money bonuses tied to performance. Our findings of increasingly capable AI persuaders thus underscore the urgency of emerging alignment and governance frameworks.

[64] System Prompt Optimization with Meta-Learning

Yumin Choi,Jinheon Baek,Sung Ju Hwang

Main category: cs.CL

TL;DR: 论文提出了一种双层系统提示优化方法，通过元学习框架优化系统提示，使其能适应多样化的用户提示并迁移到未见任务。

Details

Motivation: 现有研究主要关注任务特定的用户提示优化，而忽略了通用的系统提示优化。系统提示一旦优化，可跨任务和领域使用。 Method: 提出元学习框架，通过迭代优化系统提示和用户提示，确保二者协同。实验覆盖14个未见数据集和5个领域。 Result: 优化的系统提示能有效适应多样化用户提示，并在未见任务上实现快速适应和性能提升。 Conclusion: 双层系统提示优化方法显著提升了LLM的通用性和适应性。 Abstract: Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the system prompt that is, once optimized, applicable across different tasks and domains. Motivated by this, we introduce the novel problem of bilevel system prompt optimization, whose objective is to design system prompts that are robust to diverse user prompts and transferable to unseen tasks. To tackle this problem, we then propose a meta-learning framework, which meta-learns the system prompt by optimizing it over various user prompts across multiple datasets, while simultaneously updating the user prompts in an iterative manner to ensure synergy between them. We conduct experiments on 14 unseen datasets spanning 5 different domains, on which we show that our approach produces system prompts that generalize effectively to diverse user prompts. Also, our findings reveal that the optimized system prompt enables rapid adaptation even to unseen tasks, requiring fewer optimization steps for test-time user prompts while achieving improved performance.

[65] VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts

Xin Liu,Lechen Zhang,Sheza Munir,Yiyang Gu,Lu Wang

Main category: cs.CL

TL;DR: VeriFact是一个新的事实性评估框架，旨在通过识别和补全缺失或不完整的事实来提高长文本生成的事实性评估准确性。同时，FactRBench基准测试首次同时评估精确率和召回率。

Details

Motivation: 现有方法在评估长文本生成的事实性时，常因忽略上下文和关键关系事实而效果不佳。 Method: 提出VeriFact框架，改进事实提取和验证流程；引入FactRBench基准，评估精确率和召回率。 Result: VeriFact显著提高了事实完整性和关系信息的保留；FactRBench显示大模型在精确率和召回率上表现更好，但两者不一定正相关。 Conclusion: VeriFact和FactRBench为长文本生成的事实性评估提供了更全面的工具，强调了综合评估的重要性。 Abstract: Large language models (LLMs) excel at generating long-form responses, but evaluating their factuality remains challenging due to complex inter-sentence dependencies within the generated facts. Prior solutions predominantly follow a decompose-decontextualize-verify pipeline but often fail to capture essential context and miss key relational facts. In this paper, we introduce VeriFact, a factuality evaluation framework designed to enhance fact extraction by identifying and resolving incomplete and missing facts to support more accurate verification results. Moreover, we introduce FactRBench , a benchmark that evaluates both precision and recall in long-form model responses, whereas prior work primarily focuses on precision. FactRBench provides reference fact sets from advanced LLMs and human-written answers, enabling recall assessment. Empirical evaluations show that VeriFact significantly enhances fact completeness and preserves complex facts with critical relational information, resulting in more accurate factuality evaluation. Benchmarking various open- and close-weight LLMs on FactRBench indicate that larger models within same model family improve precision and recall, but high precision does not always correlate with high recall, underscoring the importance of comprehensive factuality assessment.

[66] An AI-Powered Research Assistant in the Lab: A Practical Guide for Text Analysis Through Iterative Collaboration with LLMs

Gino Carmona-Díaz,William Jiménez-Leal,María Alejandra Grisales,Chandra Sripada,Santiago Amaya,Michael Inzlicht,Juan Pablo Bermúdez

Main category: cs.CL

TL;DR: 论文提供了一个逐步教程，利用LLMs高效开发、测试和应用分类法分析非结构化数据，展示了高编码者间可靠性。

Details

Motivation: 分析开放式文本耗时且易受偏见影响，LLMs提供了高效且质量不降的解决方案。 Method: 通过迭代协作过程，结合预定义和数据驱动分类法，使用提示词生成、评估和优化分类法。 Result: 成功生成并应用了生活领域分类法，实现了高编码者间可靠性。 Conclusion: LLMs在文本分析中具有潜力，但也存在局限性。 Abstract: Analyzing texts such as open-ended responses, headlines, or social media posts is a time- and labor-intensive process highly susceptible to bias. LLMs are promising tools for text analysis, using either a predefined (top-down) or a data-driven (bottom-up) taxonomy, without sacrificing quality. Here we present a step-by-step tutorial to efficiently develop, test, and apply taxonomies for analyzing unstructured data through an iterative and collaborative process between researchers and LLMs. Using personal goals provided by participants as an example, we demonstrate how to write prompts to review datasets and generate a taxonomy of life domains, evaluate and refine the taxonomy through prompt and direct modifications, test the taxonomy and assess intercoder agreements, and apply the taxonomy to categorize an entire dataset with high intercoder reliability. We discuss the possibilities and limitations of using LLMs for text analysis.

[67] Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning

Shaurya Sharthak,Vinayak Pahalwan,Adithya Kamath,Adarsh Shirawalmath

Main category: cs.CL

TL;DR: 论文提出TokenAdapt框架，通过模型无关的tokenizer移植方法和多词Supertokens预分词学习，显著提升效率和语义保留。

Details

Motivation: 固定tokenization方案限制了预训练语言模型的效率和多语言/专业应用性能，现有方法计算成本高且效果有限。 Method: 引入TokenAdapt（混合启发式初始化新token嵌入）和Supertokens预分词学习，结合局部子词分解和全局语义相似性估计。 Result: TokenAdapt在零样本困惑度上显著优于基线方法（如ReTok和TransTokenizer），困惑度比降低至少2倍。 Conclusion: TokenAdapt框架高效解决了tokenizer移植问题，同时提升了压缩效率和语义保留能力。 Abstract: Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes, leading to inefficiencies and performance limitations, particularly for multilingual or specialized applications. This tokenizer lock-in presents significant challenges. standard methods to overcome this often require prohibitive computational resources. Although tokenizer replacement with heuristic initialization aims to reduce this burden, existing methods often require exhaustive residual fine-tuning and still may not fully preserve semantic nuances or adequately address the underlying compression inefficiencies. Our framework introduces two innovations: first, Tokenadapt, a model-agnostic tokenizer transplantation method, and second, novel pre-tokenization learning for multi-word Supertokens to enhance compression and reduce fragmentation. Tokenadapt initializes new unique token embeddings via a hybrid heuristic that combines two methods: a local estimate based on subword decomposition using the old tokenizer, and a global estimate utilizing the top-k semantically similar tokens from the original vocabulary. This methodology aims to preserve semantics while significantly minimizing retraining requirements. Empirical investigations validate both contributions: the transplantation heuristic successfully initializes unique tokens, markedly outperforming conventional baselines and sophisticated methods including Transtokenizer and ReTok, while our Supertokens achieve notable compression gains. Our zero-shot perplexity results demonstrate that the TokenAdapt hybrid initialization consistently yields lower perplexity ratios compared to both ReTok and TransTokenizer baselines across different base models and newly trained target tokenizers. TokenAdapt typically reduced the overall perplexity ratio significantly compared to ReTok, yielding at least a 2-fold improvement in these aggregate scores.

[68] Automated Detection of Clinical Entities in Lung and Breast Cancer Reports Using NLP Techniques

J. Moreno-Casanova,J. M. Auñón,A. Mártinez-Pérez,M. E. Pérez-Martínez,M. E. Gas-López

Main category: cs.CL

TL;DR: 该研究利用NLP技术（尤其是NER）自动从电子健康记录中提取肺癌和乳腺癌的关键临床信息，使用uQuery工具和预训练的RoBERTa模型，取得了较高的准确性，但仍面临低频实体识别的挑战。

Details

Motivation: 手动提取临床报告信息耗时且易出错，限制了医疗数据驱动方法的效率，NLP技术可自动化这一过程，提高数据提取的准确性和效率。 Method: 使用GMV的NLP工具uQuery和预训练的RoBERTa模型（bsc-bio-ehr-en3），通过NER技术从200份乳腺癌和400份肺癌报告中提取临床实体。 Result: 模型在识别常见实体（如MET和PAT）上表现优异，但对低频实体（如EVOL）的识别仍有挑战。 Conclusion: NLP技术（尤其是NER）在自动化临床数据提取方面具有潜力，但需进一步优化以提升低频实体的识别能力。 Abstract: Research projects, including those focused on cancer, rely on the manual extraction of information from clinical reports. This process is time-consuming and prone to errors, limiting the efficiency of data-driven approaches in healthcare. To address these challenges, Natural Language Processing (NLP) offers an alternative for automating the extraction of relevant data from electronic health records (EHRs). In this study, we focus on lung and breast cancer due to their high incidence and the significant impact they have on public health. Early detection and effective data management in both types of cancer are crucial for improving patient outcomes. To enhance the accuracy and efficiency of data extraction, we utilized GMV's NLP tool uQuery, which excels at identifying relevant entities in clinical texts and converting them into standardized formats such as SNOMED and OMOP. uQuery not only detects and classifies entities but also associates them with contextual information, including negated entities, temporal aspects, and patient-related details. In this work, we explore the use of NLP techniques, specifically Named Entity Recognition (NER), to automatically identify and extract key clinical information from EHRs related to these two cancers. A dataset from Health Research Institute Hospital La Fe (IIS La Fe), comprising 200 annotated breast cancer and 400 lung cancer reports, was used, with eight clinical entities manually labeled using the Doccano platform. To perform NER, we fine-tuned the bsc-bio-ehr-en3 model, a RoBERTa-based biomedical linguistic model pre-trained in Spanish. Fine-tuning was performed using the Transformers architecture, enabling accurate recognition of clinical entities in these cancer types. Our results demonstrate strong overall performance, particularly in identifying entities like MET and PAT, although challenges remain with less frequent entities like EVOL.

[69] Exploring the generalization of LLM truth directions on conversational formats

Timour Ichmoukhamedov,David Martens

Main category: cs.CL

TL;DR: 研究发现LLM的“真理方向”在短对话中表现良好，但在长对话中泛化能力较差，提出通过添加固定关键词改善泛化能力。

Details

Motivation: 探索LLM的真理方向在不同对话格式中的泛化能力，以提升其作为谎言检测工具的可靠性。 Method: 通过实验比较短对话和长对话中真理方向的泛化表现，并提出在对话末尾添加固定关键词的解决方案。 Result: 短对话中真理方向泛化良好，长对话中表现较差；添加固定关键词显著改善了长对话的泛化能力。 Conclusion: LLM作为谎言检测工具在新场景中的泛化能力仍需改进，固定关键词是一种有效的解决方案。 Abstract: Several recent works argue that LLMs have a universal truth direction where true and false statements are linearly separable in the activation space of the model. It has been demonstrated that linear probes trained on a single hidden state of the model already generalize across a range of topics and might even be used for lie detection in LLM conversations. In this work we explore how this truth direction generalizes between various conversational formats. We find good generalization between short conversations that end on a lie, but poor generalization to longer formats where the lie appears earlier in the input prompt. We propose a solution that significantly improves this type of generalization by adding a fixed key phrase at the end of each conversation. Our results highlight the challenges towards reliable LLM lie detectors that generalize to new settings.

[70] KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning

Peiqi Sui,Juan Diego Rodriguez,Philippe Laban,Dean Murphy,Joseph P. Dexter,Richard Jean So,Samuel Baker,Pramit Chaudhuri

Main category: cs.CL

TL;DR: KRISTEVA是首个用于评估解释性推理的细读基准，包含1331道选择题，测试大语言模型（LLMs）在文学细读任务中的表现。

Details

Motivation: 填补LLMs在文学细读任务评估上的空白，因为现有多学科基准（如MMLU）未涵盖文学领域。 Method: 设计三个渐进难度任务集：提取文体特征、检索上下文信息、多跳推理，测试LLMs的表现。 Result: 当前最先进的LLMs具备一定大学水平的细读能力（准确率49.7%-69.7%），但在11项任务中有10项表现不及人类评估者。 Conclusion: LLMs在文学细读任务中表现有限，仍需进一步改进。 Abstract: Each year, tens of millions of essays are written and graded in college-level English courses. Students are asked to analyze literary and cultural texts through a process known as close reading, in which they gather textual details to formulate evidence-based arguments. Despite being viewed as a basis for critical thinking and widely adopted as a required element of university coursework, close reading has never been evaluated on large language models (LLMs), and multi-discipline benchmarks like MMLU do not include literature as a subject. To fill this gap, we present KRISTEVA, the first close reading benchmark for evaluating interpretive reasoning, consisting of 1331 multiple-choice questions adapted from classroom data. With KRISTEVA, we propose three progressively more difficult sets of tasks to approximate different elements of the close reading process, which we use to test how well LLMs may seem to understand and reason about literary works: 1) extracting stylistic features, 2) retrieving relevant contextual information from parametric knowledge, and 3) multi-hop reasoning between style and external contexts. Our baseline results find that, while state-of-the-art LLMs possess some college-level close reading competency (accuracy 49.7% - 69.7%), their performances still trail those of experienced human evaluators on 10 out of our 11 tasks.

[71] Do Large Language Models Know Conflict? Investigating Parametric vs. Non-Parametric Knowledge of LLMs for Conflict Forecasting

Apollinaire Poli Nemkova,Sarath Chandra Lingareddy,Sagnik Ray Choudhury,Mark V. Albert

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLMs）在预测暴力冲突方面的能力，比较了其参数化知识与非参数化方法（结合外部数据）的表现。

Details

Motivation: LLMs在自然语言任务中表现优异，但其在冲突预测中的应用尚未充分探索，这对早期预警和人道主义规划至关重要。 Method: 通过参数化（仅依赖预训练权重）和非参数化（结合外部冲突数据集和新闻）两种设置，评估LLMs在预测冲突趋势和伤亡方面的能力。 Result: 研究发现LLMs在冲突预测中具有一定潜力，但结合外部知识能显著提升性能。 Conclusion: LLMs可用于冲突预测，但需结合外部数据以弥补预训练知识的不足。 Abstract: Large Language Models (LLMs) have shown impressive performance across natural language tasks, but their ability to forecast violent conflict remains underexplored. We investigate whether LLMs possess meaningful parametric knowledge-encoded in their pretrained weights-to predict conflict escalation and fatalities without external data. This is critical for early warning systems, humanitarian planning, and policy-making. We compare this parametric knowledge with non-parametric capabilities, where LLMs access structured and unstructured context from conflict datasets (e.g., ACLED, GDELT) and recent news reports via Retrieval-Augmented Generation (RAG). Incorporating external information could enhance model performance by providing up-to-date context otherwise missing from pretrained weights. Our two-part evaluation framework spans 2020-2024 across conflict-prone regions in the Horn of Africa and the Middle East. In the parametric setting, LLMs predict conflict trends and fatalities relying only on pretrained knowledge. In the non-parametric setting, models receive summaries of recent conflict events, indicators, and geopolitical developments. We compare predicted conflict trend labels (e.g., Escalate, Stable Conflict, De-escalate, Peace) and fatalities against historical data. Our findings highlight the strengths and limitations of LLMs for conflict forecasting and the benefits of augmenting them with structured external knowledge.

[72] Crossing Borders Without Crossing Boundaries: How Sociolinguistic Awareness Can Optimize User Engagement with Localized Spanish AI Models Across Hispanophone Countries

Martin Capdevila,Esteban Villa Turek,Ellen Karina Chumbe Fernandez,Luis Felipe Polo Galvez,Luis Cadavid,Andrea Marroquin,Rebeca Vargas Quesada,Johanna Crew,Nicole Vallejo Galarraga,Christopher Rodriguez,Diego Gutierrez,Radhi Datla

Main category: cs.CL

TL;DR: 本文探讨了西班牙语在拉丁美洲和西班牙的变体差异，强调区域化语言模型的重要性，以弥合社会语言差异并提升用户信任。

Details

Motivation: 研究西班牙语变体的差异，以证明区域化AI模型在促进包容性和用户增长中的关键作用。 Method: 通过社会文化和语言背景分析，比较拉丁美洲和西班牙的西班牙语变体。 Result: 提出至少五种西班牙语子变体，以提升AI模型的本地化效果和用户信任。 Conclusion: 区域化语言模型能有效弥合语言差异，支持国际化战略并提升用户参与度。 Abstract: Large language models are, by definition, based on language. In an effort to underscore the critical need for regional localized models, this paper examines primary differences between variants of written Spanish across Latin America and Spain, with an in-depth sociocultural and linguistic contextualization therein. We argue that these differences effectively constitute significant gaps in the quotidian use of Spanish among dialectal groups by creating sociolinguistic dissonances, to the extent that locale-sensitive AI models would play a pivotal role in bridging these divides. In doing so, this approach informs better and more efficient localization strategies that also serve to more adequately meet inclusivity goals, while securing sustainable active daily user growth in a major low-risk investment geographic area. Therefore, implementing at least the proposed five sub variants of Spanish addresses two lines of action: to foment user trust and reliance on AI language models while also demonstrating a level of cultural, historical, and sociolinguistic awareness that reflects positively on any internationalization strategy.

[73] From Trade-off to Synergy: A Versatile Symbiotic Watermarking Framework for Large Language Models

Yidan Wang,Yubing Ren,Yanan Cao,Binxing Fang

Main category: cs.CL

TL;DR: 提出了一种结合logits-based和sampling-based水印方案的混合框架，通过三种策略（串行、并行、混合）优化水印嵌入，平衡检测性、鲁棒性、文本质量和安全性。

Details

Motivation: 随着大语言模型（LLMs）的兴起，AI生成文本的滥用问题日益严重，水印技术成为潜在解决方案。现有水印方案在鲁棒性、文本质量和安全性之间存在权衡。 Method: 提出了一种混合水印框架，结合logits-based和sampling-based方案，通过串行、并行和混合策略嵌入水印，并利用token熵和语义熵优化嵌入过程。 Result: 实验表明，该方法在多种数据集和模型上优于现有基线，达到SOTA性能。 Conclusion: 该框架为多样化水印范式提供了新思路，代码已开源。 Abstract: The rise of Large Language Models (LLMs) has heightened concerns about the misuse of AI-generated text, making watermarking a promising solution. Mainstream watermarking schemes for LLMs fall into two categories: logits-based and sampling-based. However, current schemes entail trade-offs among robustness, text quality, and security. To mitigate this, we integrate logits-based and sampling-based schemes, harnessing their respective strengths to achieve synergy. In this paper, we propose a versatile symbiotic watermarking framework with three strategies: serial, parallel, and hybrid. The hybrid framework adaptively embeds watermarks using token entropy and semantic entropy, optimizing the balance between detectability, robustness, text quality, and security. Furthermore, we validate our approach through comprehensive experiments on various datasets and models. Experimental results indicate that our method outperforms existing baselines and achieves state-of-the-art (SOTA) performance. We believe this framework provides novel insights into diverse watermarking paradigms. Our code is available at \href{https://github.com/redwyd/SymMark}{https://github.com/redwyd/SymMark}.

[74] Rethinking Prompt Optimizers: From Prompt Merits to Optimization

Zixiao Zhu,Hanzhang Zhou,Zijian Feng,Tianjiao Li,Chua Jia Jim Deryl,Mak Lee Onn,Gee Wah Ng,Kezhi Mao

Main category: cs.CL

TL;DR: MePO是一种轻量级、可本地部署的提示优化器，通过模型无关的提示质量指标提升性能，适用于不同规模的模型。

Details

Motivation: 现有提示优化方法依赖大型LLM生成复杂提示，可能导致轻量级模型性能下降，因此需要一种更通用、高效的优化方法。 Method: 提出模型无关的提示质量指标，并基于这些指标训练轻量级提示优化器MePO，使用轻量级LLM生成的数据集。 Result: MePO在多样任务和模型类型中表现优异，降低了成本和隐私风险。 Conclusion: MePO为提示优化提供了可扩展且鲁棒的解决方案，适用于实际部署。 Abstract: Prompt optimization (PO) offers a practical alternative to fine-tuning large language models (LLMs), enabling performance improvements without altering model weights. Existing methods typically rely on advanced, large-scale LLMs like GPT-4 to generate optimized prompts. However, due to limited downward compatibility, verbose, instruction-heavy prompts from advanced LLMs can overwhelm lightweight inference models and degrade response quality. In this work, we rethink prompt optimization through the lens of interpretable design. We first identify a set of model-agnostic prompt quality merits and empirically validate their effectiveness in enhancing prompt and response quality. We then introduce MePO, a merit-guided, lightweight, and locally deployable prompt optimizer trained on our preference dataset built from merit-aligned prompts generated by a lightweight LLM. Unlike prior work, MePO avoids online optimization reliance, reduces cost and privacy concerns, and, by learning clear, interpretable merits, generalizes effectively to both large-scale and lightweight inference models. Experiments demonstrate that MePO achieves better results across diverse tasks and model types, offering a scalable and robust solution for real-world deployment. Our model and dataset are available at: https://github.com/MidiyaZhu/MePO

[75] Personalizing Large Language Models using Retrieval Augmented Generation and Knowledge Graph

Deeksha Prahlad,Chanhee Lee,Dongha Kim,Hokeun Kim

Main category: cs.CL

TL;DR: 论文提出了一种基于知识图谱（KGs）的检索增强生成（RAG）方法，用于解决大型语言模型（LLMs）在生成个性化响应时的幻觉问题。

Details

Motivation: LLMs在生成响应时容易因过拟合而产生错误信息（幻觉），主要原因是缺乏及时、真实和个性化的数据输入。 Method: 通过结合知识图谱（KGs）的检索增强生成（RAG）方法，为LLMs提供结构化且持续更新的个性化数据（如日历数据）。 Result: 实验表明，该方法在理解个性化信息和生成准确响应方面显著优于基线LLMs，且响应时间略有减少。 Conclusion: 结合KGs的RAG方法能有效减少LLMs的幻觉问题，提升个性化响应的准确性。 Abstract: The advent of large language models (LLMs) has allowed numerous applications, including the generation of queried responses, to be leveraged in chatbots and other conversational assistants. Being trained on a plethora of data, LLMs often undergo high levels of over-fitting, resulting in the generation of extra and incorrect data, thus causing hallucinations in output generation. One of the root causes of such problems is the lack of timely, factual, and personalized information fed to the LLM. In this paper, we propose an approach to address these problems by introducing retrieval augmented generation (RAG) using knowledge graphs (KGs) to assist the LLM in personalized response generation tailored to the users. KGs have the advantage of storing continuously updated factual information in a structured way. While our KGs can be used for a variety of frequently updated personal data, such as calendar, contact, and location data, we focus on calendar data in this paper. Our experimental results show that our approach works significantly better in understanding personal information and generating accurate responses compared to the baseline LLMs using personal data as text inputs, with a moderate reduction in response time.

[76] DIF: A Framework for Benchmarking and Verifying Implicit Bias in LLMs

Lake Yin,Fan Huang

Main category: cs.CL

TL;DR: 论文提出了一种衡量LLM隐含偏见的方法DIF，并验证了其有效性。

Details

Motivation: 研究LLM隐含偏见的伦理和技术问题，缺乏标准化的评估方法。 Method: 通过评估现有LLM逻辑和数学问题数据集，结合社会人口角色，开发了DIF指标。 Result: DIF统计验证了LLM隐含偏见的存在，并发现答题准确性与偏见呈负相关。 Conclusion: DIF为LLM隐含偏见提供了可解释的基准，支持偏见与模型性能的关联。 Abstract: As Large Language Models (LLMs) have risen in prominence over the past few years, there has been concern over the potential biases in LLMs inherited from the training data. Previous studies have examined how LLMs exhibit implicit bias, such as when response generation changes when different social contexts are introduced. We argue that this implicit bias is not only an ethical, but also a technical issue, as it reveals an inability of LLMs to accommodate extraneous information. However, unlike other measures of LLM intelligence, there are no standard methods to benchmark this specific subset of LLM bias. To bridge this gap, we developed a method for calculating an easily interpretable benchmark, DIF (Demographic Implicit Fairness), by evaluating preexisting LLM logic and math problem datasets with sociodemographic personas. We demonstrate that this method can statistically validate the presence of implicit bias in LLM behavior and find an inverse trend between question answering accuracy and implicit bias, supporting our argument.

[77] CAFE: Retrieval Head-based Coarse-to-Fine Information Seeking to Enhance Multi-Document QA Capability

Han Peng,Jinhao Jiang,Zican Dong,Wayne Xin Zhao,Lei Fang

Main category: cs.CL

TL;DR: 论文提出了一种名为CAFE的两阶段方法，通过粗到细的策略提升多文档问答能力，显著优于现有基线方法。

Details

Motivation: 尽管大语言模型的输入上下文长度有所提升，但在长上下文输入中的检索和推理能力仍不足，现有方法在平衡检索精度和召回率方面存在挑战。 Method: CAFE采用两阶段方法：1）粗粒度过滤，利用检索头识别和排序相关文档；2）细粒度引导，将注意力集中在最相关内容上。 Result: 实验表明，CAFE在Mistral模型上比SFT和RAG方法分别提升了22.1%和13.7%的SubEM分数。 Conclusion: CAFE通过逐步消除背景和干扰文档的负面影响，显著提升了多文档问答的性能。 Abstract: Advancements in Large Language Models (LLMs) have extended their input context length, yet they still struggle with retrieval and reasoning in long-context inputs. Existing methods propose to utilize the prompt strategy and retrieval head to alleviate this limitation. However, they still face challenges in balancing retrieval precision and recall, impacting their efficacy in answering questions. To address this, we introduce $\textbf{CAFE}$, a two-stage coarse-to-fine method to enhance multi-document question-answering capacities. By gradually eliminating the negative impacts of background and distracting documents, CAFE makes the responses more reliant on the evidence documents. Initially, a coarse-grained filtering method leverages retrieval heads to identify and rank relevant documents. Then, a fine-grained steering method guides attention to the most relevant content. Experiments across benchmarks show CAFE outperforms baselines, achieving up to 22.1% and 13.7% SubEM improvement over SFT and RAG methods on the Mistral model, respectively.

[78] Dark LLMs: The Growing Threat of Unaligned AI Models

Michael Fire,Yitzhak Elbazis,Adi Wasenstein,Lior Rokach

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型（LLMs）的越狱漏洞问题，指出其训练数据中的未过滤内容导致模型易受攻击，并提出一种通用越狱攻击方法，揭示了行业在AI安全实践上的不足。

Details

Motivation: LLMs的广泛应用伴随着越狱攻击的威胁，其训练数据中的问题内容使模型易受攻击，可能导致有害输出。研究旨在揭示这一漏洞及其潜在风险。 Method: 研究提出了一种通用越狱攻击方法，测试了多个先进LLMs的漏洞，并公开披露了攻击方法以评估行业响应。 Result: 研究发现多数测试模型仍易受攻击，行业对漏洞的响应不足，突显了AI安全实践的缺陷。 Conclusion: 随着LLMs的普及和开源化，其滥用风险加剧，需采取果断措施防止危险知识的传播。 Abstract: Large Language Models (LLMs) rapidly reshape modern life, advancing fields from healthcare to education and beyond. However, alongside their remarkable capabilities lies a significant threat: the susceptibility of these models to jailbreaking. The fundamental vulnerability of LLMs to jailbreak attacks stems from the very data they learn from. As long as this training data includes unfiltered, problematic, or 'dark' content, the models can inherently learn undesirable patterns or weaknesses that allow users to circumvent their intended safety controls. Our research identifies the growing threat posed by dark LLMs models deliberately designed without ethical guardrails or modified through jailbreak techniques. In our research, we uncovered a universal jailbreak attack that effectively compromises multiple state-of-the-art models, enabling them to answer almost any question and produce harmful outputs upon request. The main idea of our attack was published online over seven months ago. However, many of the tested LLMs were still vulnerable to this attack. Despite our responsible disclosure efforts, responses from major LLM providers were often inadequate, highlighting a concerning gap in industry practices regarding AI safety. As model training becomes more accessible and cheaper, and as open-source LLMs proliferate, the risk of widespread misuse escalates. Without decisive intervention, LLMs may continue democratizing access to dangerous knowledge, posing greater risks than anticipated.

[79] Designing and Contextualising Probes for African Languages

Wisdom Aduah,Francois Meyer

Main category: cs.CL

TL;DR: 本文系统研究了预训练语言模型（PLMs）对非洲语言的编码能力，发现适应非洲语言的PLMs比多语言PLMs更能捕捉目标语言特征。

Details

Motivation: 探究PLMs在非洲语言中的语言学知识编码机制，填补相关研究空白。 Method: 训练分层探针分析六种非洲语言的语言特征分布，并设计控制任务验证探针性能。 Result: 适应非洲语言的PLMs编码更多目标语言信息；句法信息集中在中后层，语义信息分布更广。 Conclusion: 研究证实了PLMs的内部知识机制，为非洲语言PLMs的优化提供了理论支持。 Abstract: Pretrained language models (PLMs) for African languages are continually improving, but the reasons behind these advances remain unclear. This paper presents the first systematic investigation into probing PLMs for linguistic knowledge about African languages. We train layer-wise probes for six typologically diverse African languages to analyse how linguistic features are distributed. We also design control tasks, a way to interpret probe performance, for the MasakhaPOS dataset. We find PLMs adapted for African languages to encode more linguistic information about target languages than massively multilingual PLMs. Our results reaffirm previous findings that token-level syntactic information concentrates in middle-to-last layers, while sentence-level semantic information is distributed across all layers. Through control tasks and probing baselines, we confirm that performance reflects the internal knowledge of PLMs rather than probe memorisation. Our study applies established interpretability techniques to African-language PLMs. In doing so, we highlight the internal mechanisms underlying the success of strategies like active learning and multilingual adaptation.

[80] XRAG: Cross-lingual Retrieval-Augmented Generation

Wei Liu,Sony Trenous,Leonardo F. R. Ribeiro,Bill Byrne,Felix Hieber

Main category: cs.CL

TL;DR: XRAG是一个新的基准，用于评估LLM在跨语言检索增强生成（RAG）中的表现，特别是在用户语言与检索结果不匹配的情况下。

Details

Motivation: 研究LLM在跨语言RAG中的生成能力，尤其是语言不匹配时的表现。 Method: 构建XRAG基准，基于新闻文章生成需要外部知识的问题，涵盖单语和多语检索场景，并提供相关性标注。 Result: 实验发现LLM在单语检索中语言正确性不足，在多语检索中跨语言推理是主要挑战。 Conclusion: XRAG是研究LLM推理能力的有效工具，揭示了跨语言RAG中的新挑战。 Abstract: We propose XRAG, a novel benchmark designed to evaluate the generation abilities of LLMs in cross-lingual Retrieval-Augmented Generation (RAG) settings where the user language does not match the retrieval results. XRAG is constructed from recent news articles to ensure that its questions require external knowledge to be answered. It covers the real-world scenarios of monolingual and multilingual retrieval, and provides relevancy annotations for each retrieved document. Our novel dataset construction pipeline results in questions that require complex reasoning, as evidenced by the significant gap between human and LLM performance. Consequently, XRAG serves as a valuable benchmark for studying LLM reasoning abilities, even before considering the additional cross-lingual complexity. Experimental results on five LLMs uncover two previously unreported challenges in cross-lingual RAG: 1) in the monolingual retrieval setting, all evaluated models struggle with response language correctness; 2) in the multilingual retrieval setting, the main challenge lies in reasoning over retrieved information across languages rather than generation of non-English text.

[81] What Does Neuro Mean to Cardio? Investigating the Role of Clinical Specialty Data in Medical LLMs

Xinlan Yan,Di Wu,Yibin Lei,Christof Monz,Iacer Calixto

Main category: cs.CL

TL;DR: S-MedQA是一个用于评估大语言模型在细粒度医学专业领域表现的英文医学问答数据集，研究发现专业领域微调并不总能带来最佳性能，改进更多源于领域转换而非知识注入。

Details

Motivation: 验证知识注入假设在医学问答中的适用性，并探讨微调数据的作用。 Method: 使用S-MedQA数据集，分析不同专业领域微调对模型性能的影响，并观察临床相关术语的概率变化。 Result: 专业领域微调不一定带来最佳性能，临床相关术语概率普遍提升，改进主要来自领域转换。 Conclusion: 建议重新思考医学领域中微调数据的作用，并公开数据集和代码以促进研究。 Abstract: In this paper, we introduce S-MedQA, an English medical question-answering (QA) dataset for benchmarking large language models in fine-grained clinical specialties. We use S-MedQA to check the applicability of a popular hypothesis related to knowledge injection in the knowledge-intense scenario of medical QA, and show that: 1) training on data from a speciality does not necessarily lead to best performance on that specialty and 2) regardless of the specialty fine-tuned on, token probabilities of clinically relevant terms for all specialties increase consistently. Thus, we believe improvement gains come mostly from domain shifting (e.g., general to medical) rather than knowledge injection and suggest rethinking the role of fine-tuning data in the medical domain. We release S-MedQA and all code needed to reproduce all our experiments to the research community.

[82] GE-Chat: A Graph Enhanced RAG Framework for Evidential Response Generation of LLMs

Longchao Da,Parth Mitesh Shah,Kuan-Ru Liou,Jiaxing Zhang,Hua Wei

Main category: cs.CL

TL;DR: 论文提出GE-Chat框架，通过知识图谱增强检索生成，提升LLM生成证据的可靠性。

Details

Motivation: LLM输出不可靠且存在幻觉问题，用户需手动验证，导致信任问题。 Method: 结合知识图谱、检索增强生成、CoT逻辑生成、n跳子图搜索和蕴含式句子生成，实现精确证据检索。 Result: 方法提升了现有模型在自由上下文中识别准确证据的能力，增强了LLM结论的可信度判断。 Conclusion: GE-Chat框架为LLM生成内容提供了可靠的证据支持，有助于提升用户信任。 Abstract: Large Language Models are now key assistants in human decision-making processes. However, a common note always seems to follow: "LLMs can make mistakes. Be careful with important info." This points to the reality that not all outputs from LLMs are dependable, and users must evaluate them manually. The challenge deepens as hallucinated responses, often presented with seemingly plausible explanations, create complications and raise trust issues among users. To tackle such issue, this paper proposes GE-Chat, a knowledge Graph enhanced retrieval-augmented generation framework to provide Evidence-based response generation. Specifically, when the user uploads a material document, a knowledge graph will be created, which helps construct a retrieval-augmented agent, enhancing the agent's responses with additional knowledge beyond its training corpus. Then we leverage Chain-of-Thought (CoT) logic generation, n-hop sub-graph searching, and entailment-based sentence generation to realize accurate evidence retrieval. We demonstrate that our method improves the existing models' performance in terms of identifying the exact evidence in a free-form context, providing a reliable way to examine the resources of LLM's conclusion and help with the judgment of the trustworthiness.

[83] Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning

Yoichi Ishibashi,Taro Yano,Masafumi Oyamada

Main category: cs.CL

TL;DR: 论文提出了一种名为Reasoning CPT的持续预训练方法，通过合成数据重建文本背后的隐藏思维过程，显著提升了模型在多个领域的推理能力。

Details

Motivation: 现有监督微调和强化学习方法在训练推理模型时局限于特定领域，数据广度和可扩展性受限，而持续预训练（CPT）无需任务特定信号，但如何有效合成推理数据及其跨领域影响尚不明确。 Method: 采用Reasoning CPT方法，利用STEM和法律语料库的合成数据（包含隐藏思维）对Gemma2-9B模型进行训练，并与标准CPT在MMLU基准上对比。 Result: Reasoning CPT在所有评估领域均表现更优，推理能力可跨领域迁移，且在问题难度增加时优势更明显（最高提升8分）。模型还能根据问题难度调整推理深度。 Conclusion: Reasoning CPT通过合成隐藏思维数据显著提升模型推理能力，且具有跨领域泛化性和适应性。 Abstract: Large Language Models (LLMs) have demonstrated significant improvements in reasoning capabilities through supervised fine-tuning and reinforcement learning. However, when training reasoning models, these approaches are primarily applicable to specific domains such as mathematics and programming, which imposes fundamental constraints on the breadth and scalability of training data. In contrast, continual pretraining (CPT) offers the advantage of not requiring task-specific signals. Nevertheless, how to effectively synthesize training data for reasoning and how such data affect a wide range of domains remain largely unexplored. This study provides a detailed evaluation of Reasoning CPT, a form of CPT that uses synthetic data to reconstruct the hidden thought processes underlying texts, based on the premise that texts are the result of the author's thinking process. Specifically, we apply Reasoning CPT to Gemma2-9B using synthetic data with hidden thoughts derived from STEM and Law corpora, and compare it to standard CPT on the MMLU benchmark. Our analysis reveals that Reasoning CPT consistently improves performance across all evaluated domains. Notably, reasoning skills acquired in one domain transfer effectively to others; the performance gap with conventional methods widens as problem difficulty increases, with gains of up to 8 points on the most challenging problems. Furthermore, models trained with hidden thoughts learn to adjust the depth of their reasoning according to problem difficulty.

[84] The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think

Seongyun Lee,Seungone Kim,Minju Seo,Yongrae Jo,Dongyoung Go,Hyeonbin Hwang,Jinho Park,Xiang Yue,Sean Welleck,Graham Neubig,Moontae Lee,Minjoon Seo

Main category: cs.CL

TL;DR: 论文提出了一种名为CoT Encyclopedia的框架，用于分析和引导模型推理，通过自动提取、聚类和解释推理策略，提升了分析的全面性和可解释性，并展示了性能提升。

Details

Motivation: 理解现代大语言模型中的长链思维（CoT）推理策略的多样性及其影响，现有方法受限于人类直觉，无法全面捕捉模型行为。 Method: 提出CoT Encyclopedia框架，自动从模型生成的CoT中提取推理标准，嵌入语义空间并聚类，生成对比性解释。 Result: 人类评估表明该框架比现有方法更具可解释性和全面性，并能预测和优化模型推理策略，提升性能。 Conclusion: 训练数据格式对推理行为的影响大于数据领域，强调了格式感知模型设计的重要性。 Abstract: Long chain-of-thought (CoT) is an essential ingredient in effective usage of modern large language models, but our understanding of the reasoning strategies underlying these capabilities remains limited. While some prior works have attempted to categorize CoTs using predefined strategy types, such approaches are constrained by human intuition and fail to capture the full diversity of model behaviors. In this work, we introduce the CoT Encyclopedia, a bottom-up framework for analyzing and steering model reasoning. Our method automatically extracts diverse reasoning criteria from model-generated CoTs, embeds them into a semantic space, clusters them into representative categories, and derives contrastive rubrics to interpret reasoning behavior. Human evaluations show that this framework produces more interpretable and comprehensive analyses than existing methods. Moreover, we demonstrate that this understanding enables performance gains: we can predict which strategy a model is likely to use and guide it toward more effective alternatives. Finally, we provide practical insights, such as that training data format (e.g., free-form vs. multiple-choice) has a far greater impact on reasoning behavior than data domain, underscoring the importance of format-aware model design.

[85] VQ-Logits: Compressing the Output Bottleneck of Large Language Models via Vector Quantized Logits

Jintian Shao,Hongyi Huang,Jiayi Wu,YiMing Cheng,ZhiYu Wu,You Shan,MingKai Zheng

Main category: cs.CL

TL;DR: VQ-Logits利用向量量化（VQ）减少LLM输出层的参数和计算成本，参数减少99%，计算速度提升6倍，仅增加4%的困惑度。

Details

Motivation: 解决LLM因大词汇量导致的输出层参数和计算成本高的问题。 Method: 用小型共享码本替换大型输出嵌入矩阵，通过向量量化映射词汇到码本向量。 Result: 在标准基准测试中，参数减少99%，计算速度提升6倍，困惑度仅增加4%。 Conclusion: VQ-Logits是一种高效且稳健的方法，显著降低了LLM输出层的资源需求。 Abstract: Large Language Models (LLMs) have achieved remarkable success but face significant computational and memory challenges, particularly due to their extensive output vocabularies. The final linear projection layer, mapping hidden states to vocabulary-sized logits, often constitutes a substantial portion of the model's parameters and computational cost during inference. Existing methods like adaptive softmax or hierarchical softmax introduce structural complexities. In this paper, we propose VQ-Logits, a novel approach that leverages Vector Quantization (VQ) to drastically reduce the parameter count and computational load of the LLM output layer. VQ-Logits replaces the large V * dmodel output embedding matrix with a small, shared codebook of K embedding vectors (K << V ). Each token in the vocabulary is mapped to one of these K codebook vectors. The LLM predicts logits over this compact codebook, which are then efficiently "scattered" to the full vocabulary space using the learned or preassigned mapping. We demonstrate through extensive experiments on standard language modeling benchmarks (e.g., WikiText-103, C4) that VQ-Logits can achieve up to 99% parameter reduction in the output layer and 6x speedup in logit computation, with only a marginal 4% increase in perplexity compared to full softmax baselines. We further provide detailed ablation studies on codebook size, initialization, and learning strategies, showcasing the robustness and effectiveness of our approach.

[86] RAIDEN-R1: Improving Role-awareness of LLMs via GRPO with Verifiable Reward

Zongsheng Wang,Kaili Sun,Bowen Wu,Qun Yu,Ying Li,Baoxun Wang

Main category: cs.CL

TL;DR: RAIDEN-R1提出了一种新的强化学习框架，通过可验证的角色感知奖励（VRAR）提升角色扮演对话代理（RPCA）的角色一致性。实验显示其14B-GRPO模型在基准测试中表现优异。

Details

Motivation: 解决角色扮演对话代理（RPCA）在角色一致性上的挑战。 Method: 提出RAIDEN-R1框架，结合VRAR奖励机制，采用单术语和多术语挖掘策略，并构建高质量的角色感知思维链数据集。 Result: 14B-GRPO模型在Script-Based Knowledge和Conversation Memory指标上分别达到88.04%和88.65%的准确率，优于基线模型。 Conclusion: RAIDEN-R1填补了RPCA训练中的量化空白，推动了角色感知推理模式的发展。 Abstract: Role-playing conversational agents (RPCAs) face persistent challenges in maintaining role consistency. To address this, we propose RAIDEN-R1, a novel reinforcement learning framework that integrates Verifiable Role-Awareness Reward (VRAR). The method introduces both singular and multi-term mining strategies to generate quantifiable rewards by assessing role-specific keys. Additionally, we construct a high-quality, role-aware Chain-of-Thought dataset through multi-LLM collaboration, and implement experiments to enhance reasoning coherence. Experiments on the RAIDEN benchmark demonstrate RAIDEN-R1's superiority: our 14B-GRPO model achieves 88.04% and 88.65% accuracy on Script-Based Knowledge and Conversation Memory metrics, respectively, outperforming baseline models while maintaining robustness. Case analyses further reveal the model's enhanced ability to resolve conflicting contextual cues and sustain first-person narrative consistency. This work bridges the non-quantifiability gap in RPCA training and provides insights into role-aware reasoning patterns, advancing the development of RPCAs.

Poli Apollinaire Nemkova,Solomon Ubani,Mark V. Albert

Main category: cs.CL

TL;DR: 研究评估了多种先进大语言模型（如GPT-3.5、GPT-4等）在零样本和少样本条件下对俄语和乌克兰语社交媒体帖子的二元分类任务（涉及人权侵犯）的表现，并与人工标注结果对比。

Details

Motivation: 探索大语言模型在多语言环境下处理敏感、领域特定任务的可靠性和适用性，尤其是对主观和上下文依赖的判断能力。 Method: 使用多种大语言模型进行零样本和少样本标注，对比人工标注结果，分析不同提示条件下的表现及错误模式。 Result: 研究揭示了各模型在跨语言适应性和标注任务中的优势与局限。 Conclusion: 大语言模型在敏感任务中具有一定潜力，但其表现受语言和上下文影响，需进一步优化以适应实际应用。 Abstract: In the era of increasingly sophisticated natural language processing (NLP) systems, large language models (LLMs) have demonstrated remarkable potential for diverse applications, including tasks requiring nuanced textual understanding and contextual reasoning. This study investigates the capabilities of multiple state-of-the-art LLMs - GPT-3.5, GPT-4, LLAMA3, Mistral 7B, and Claude-2 - for zero-shot and few-shot annotation of a complex textual dataset comprising social media posts in Russian and Ukrainian. Specifically, the focus is on the binary classification task of identifying references to human rights violations within the dataset. To evaluate the effectiveness of these models, their annotations are compared against a gold standard set of human double-annotated labels across 1000 samples. The analysis includes assessing annotation performance under different prompting conditions, with prompts provided in both English and Russian. Additionally, the study explores the unique patterns of errors and disagreements exhibited by each model, offering insights into their strengths, limitations, and cross-linguistic adaptability. By juxtaposing LLM outputs with human annotations, this research contributes to understanding the reliability and applicability of LLMs for sensitive, domain-specific tasks in multilingual contexts. It also sheds light on how language models handle inherently subjective and context-dependent judgments, a critical consideration for their deployment in real-world scenarios.

[88] The Evolving Landscape of Generative Large Language Models and Traditional Natural Language Processing in Medicine

Rui Yang,Huitao Li,Matthew Yu Heng Wong,Yuhe Ke,Xin Li,Kunyu Yu,Jingchi Liao,Jonathan Chong Kai Liew,Sabarinath Vinod Nair,Jasmine Chiat Ling Ong,Irene Li,Douglas Teodoro,Chuan Hong,Daniel Shu Wei Ting,Nan Liu

Main category: cs.CL

TL;DR: 生成式大语言模型（LLMs）在开放任务中表现更优，而传统NLP在信息提取和分析任务中占优。

Details

Motivation: 探索生成式LLMs与传统NLP在不同医疗任务中的差异。 Method: 分析了19,123项研究。 Result: 生成式LLMs在开放任务中表现更好，传统NLP在信息提取和分析任务中更有效。 Conclusion: 随着技术发展，需确保其在医疗应用中的伦理使用。 Abstract: Natural language processing (NLP) has been traditionally applied to medicine, and generative large language models (LLMs) have become prominent recently. However, the differences between them across different medical tasks remain underexplored. We analyzed 19,123 studies, finding that generative LLMs demonstrate advantages in open-ended tasks, while traditional NLP dominates in information extraction and analysis tasks. As these technologies advance, ethical use of them is essential to ensure their potential in medical applications.

[89] From Questions to Clinical Recommendations: Large Language Models Driving Evidence-Based Clinical Decision Making

Dubai Li,Nan Jiang,Kangping Huang,Ruiqi Tu,Shuyu Ouyang,Huayu Yu,Lin Qiao,Chen Yu,Tianshu Zhou,Danyang Tong,Qian Wang,Mengtao Li,Xiaofeng Zeng,Yu Tian,Xinping Tian,Jingsong Li

Main category: cs.CL

TL;DR: Quicker是一个基于大语言模型的临床决策支持系统，旨在自动化证据合成并生成临床建议，显著提升决策效率和准确性。

Details

Motivation: 临床证据整合到实时实践中存在挑战，如工作量大、流程复杂和时间限制，因此需要自动化工具支持高效决策。 Method: Quicker通过自动化链条覆盖从问题到临床建议的所有阶段，并利用交互式用户界面实现定制化决策。 Result: 实验表明Quicker在问题分解、文献筛选和推荐生成方面表现优异，显著缩短了推荐开发时间。 Conclusion: Quicker能帮助医生更快、更可靠地做出基于证据的临床决策。 Abstract: Clinical evidence, derived from rigorous research and data analysis, provides healthcare professionals with reliable scientific foundations for informed decision-making. Integrating clinical evidence into real-time practice is challenging due to the enormous workload, complex professional processes, and time constraints. This highlights the need for tools that automate evidence synthesis to support more efficient and accurate decision making in clinical settings. This study introduces Quicker, an evidence-based clinical decision support system powered by large language models (LLMs), designed to automate evidence synthesis and generate clinical recommendations modeled after standard clinical guideline development processes. Quicker implements a fully automated chain that covers all phases, from questions to clinical recommendations, and further enables customized decision-making through integrated tools and interactive user interfaces. To evaluate Quicker's capabilities, we developed the Q2CRBench-3 benchmark dataset, based on clinical guideline development records for three different diseases. Experimental results highlighted Quicker's strong performance, with fine-grained question decomposition tailored to user preferences, retrieval sensitivities comparable to human experts, and literature screening performance approaching comprehensive inclusion of relevant studies. In addition, Quicker-assisted evidence assessment effectively supported human reviewers, while Quicker's recommendations were more comprehensive and logically coherent than those of clinicians. In system-level testing, collaboration between a single reviewer and Quicker reduced the time required for recommendation development to 20-40 minutes. In general, our findings affirm the potential of Quicker to help physicians make quicker and more reliable evidence-based clinical decisions.

[90] J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

Chenxi Whitehouse,Tianlu Wang,Ping Yu,Xian Li,Jason Weston,Ilia Kulikov,Swarnadeep Saha

Main category: cs.CL

TL;DR: 论文提出了一种名为J1的强化学习方法，用于训练LLM-as-a-Judge模型，通过可验证的奖励机制提升判断能力，并在多个基准测试中优于现有模型。

Details

Motivation: AI的发展受限于评估质量，LLM-as-a-Judge模型成为核心解决方案，但需要更好的训练方法来提升其推理能力。 Method: 采用强化学习训练J1模型，将可验证和不可验证的提示转化为可验证奖励任务，激励模型思考并减少判断偏差。 Result: J1在8B和70B规模下均优于现有模型，包括从DeepSeek-R1蒸馏的模型，并在某些基准测试中超越R1。 Conclusion: J1通过强化学习显著提升了LLM-as-a-Judge模型的判断能力，特别是在生成评估标准、自我参考答案和重新评估响应正确性方面表现优异。 Abstract: The progress of AI is bottlenecked by the quality of evaluation, and powerful LLM-as-a-Judge models have proved to be a core solution. Improved judgment ability is enabled by stronger chain-of-thought reasoning, motivating the need to find the best recipes for training such models to think. In this work we introduce J1, a reinforcement learning approach to training such models. Our method converts both verifiable and non-verifiable prompts to judgment tasks with verifiable rewards that incentivize thinking and mitigate judgment bias. In particular, our approach outperforms all other existing 8B or 70B models when trained at those sizes, including models distilled from DeepSeek-R1. J1 also outperforms o1-mini, and even R1 on some benchmarks, despite training a smaller model. We provide analysis and ablations comparing Pairwise-J1 vs Pointwise-J1 models, offline vs online training recipes, reward strategies, seed prompts, and variations in thought length and content. We find that our models make better judgments by learning to outline evaluation criteria, comparing against self-generated reference answers, and re-evaluating the correctness of model responses.

[91] LDIR: Low-Dimensional Dense and Interpretable Text Embeddings with Relative Representations

Yile Wang,Zhanyu Shen,Hui Huang

Main category: cs.CL

TL;DR: 提出了一种低维、密集且可解释的文本嵌入方法LDIR，通过最远点采样生成锚文本相关语义表示，性能接近黑盒模型且优于其他可解释嵌入基线。

Details

Motivation: 现有文本嵌入（如SimCSE和LLM2Vec）性能优秀但难以解释，而经典稀疏嵌入（如词袋模型）性能较差。需要一种低维、密集且可解释的文本嵌入方法。 Method: 提出LDIR方法，通过最远点采样生成低维（小于500维）密集嵌入，数值表示与不同锚文本的语义相关性。 Result: 在多个语义文本相似性、检索和聚类任务中，LDIR性能接近黑盒基线模型，且优于其他可解释嵌入基线。 Conclusion: LDIR是一种低维、密集且可解释的文本嵌入方法，兼具性能和可解释性。 Abstract: Semantic text representation is a fundamental task in the field of natural language processing. Existing text embedding (e.g., SimCSE and LLM2Vec) have demonstrated excellent performance, but the values of each dimension are difficult to trace and interpret. Bag-of-words, as classic sparse interpretable embeddings, suffers from poor performance. Recently, Benara et al. (2024) propose interpretable text embeddings using large language models, which forms "0/1" embeddings based on responses to a series of questions. These interpretable text embeddings are typically high-dimensional (larger than 10,000). In this work, we propose Low-dimensional (lower than 500) Dense and Interpretable text embeddings with Relative representations (LDIR). The numerical values of its dimensions indicate semantic relatedness to different anchor texts through farthest point sampling, offering both semantic representation as well as a certain level of traceability and interpretability. We validate LDIR on multiple semantic textual similarity, retrieval, and clustering tasks. Extensive experimental results show that LDIR performs close to the black-box baseline models and outperforms the interpretable embeddings baselines with much fewer dimensions. Code is available at https://github.com/szu-tera/LDIR.

Chunyu Ye,Shaonan Wang

Main category: cs.CL

TL;DR: 提出了一种多模态框架，通过视觉语言模型解码大脑活动中的语言信息，支持视觉、听觉和文本输入。

Details

Motivation: 人类思维本质上是多模态的，而现有研究多局限于单模态输入，因此需要一种更灵活的方法来解码大脑活动中的语言信息。 Method: 利用视觉语言模型（VLMs），通过模态特定的专家联合解释多模态信息。 Result: 实验表明，该方法性能与现有最佳系统相当，同时更具适应性和扩展性。 Conclusion: 该研究推动了更具生态效度和普适性的思维解码技术的发展。 Abstract: Decoding thoughts from brain activity offers valuable insights into human cognition and enables promising applications in brain-computer interaction. While prior studies have explored language reconstruction from fMRI data, they are typically limited to single-modality inputs such as images or audio. In contrast, human thought is inherently multimodal. To bridge this gap, we propose a unified and flexible framework for reconstructing coherent language from brain recordings elicited by diverse input modalities-visual, auditory, and textual. Our approach leverages visual-language models (VLMs), using modality-specific experts to jointly interpret information across modalities. Experiments demonstrate that our method achieves performance comparable to state-of-the-art systems while remaining adaptable and extensible. This work advances toward more ecologically valid and generalizable mind decoding.

[93] Multi-domain Multilingual Sentiment Analysis in Industry: Predicting Aspect-based Opinion Quadruples

Benjamin White,Anastasia Shimorina

Main category: cs.CL

TL;DR: 本文探讨了基于大型语言模型（LLMs）的方面情感分析系统设计，重点研究了四元组意见提取，并在多领域和多语言数据上验证了单一模型的通用性。

Details

Motivation: 研究旨在验证单一微调模型是否能同时有效处理多领域特定的分类体系，以减少操作复杂性。 Method: 使用内部数据集，设计了一个结合多领域的模型，并与专用单领域模型进行性能对比。 Result: 结果表明，多领域模型的性能与专用单领域模型相当，同时降低了操作复杂性。 Conclusion: 研究总结了处理非提取性预测和评估LLM系统失败模式的经验，为结构化预测任务提供了实用指导。 Abstract: This paper explores the design of an aspect-based sentiment analysis system using large language models (LLMs) for real-world use. We focus on quadruple opinion extraction -- identifying aspect categories, sentiment polarity, targets, and opinion expressions from text data across different domains and languages. Using internal datasets, we investigate whether a single fine-tuned model can effectively handle multiple domain-specific taxonomies simultaneously. We demonstrate that a combined multi-domain model achieves performance comparable to specialized single-domain models while reducing operational complexity. We also share lessons learned for handling non-extractive predictions and evaluating various failure modes when developing LLM-based systems for structured prediction tasks.

[94] Rethinking Repetition Problems of LLMs in Code Generation

Yihong Dong,Yuchen Liu,Xue Jiang,Zhi Jin,Ge Li

Main category: cs.CL

TL;DR: 论文提出了一种基于语法的解码方法RPG，用于解决代码生成中的结构重复问题，显著提升了生成代码的质量。

Details

Motivation: 神经语言模型虽提升了代码生成性能，但重复问题（尤其是结构重复）仍然存在，现有研究多关注内容重复，忽略了更普遍的结构重复。 Method: RPG利用语法规则识别重复问题，并通过衰减关键令牌的似然来减少重复。 Result: RPG在CodeRepetEval、HumanEval和MBPP基准测试中显著优于基线方法，有效减少重复并提升代码质量。 Conclusion: RPG为解决代码生成中的结构重复问题提供了高效方法，实验验证了其优越性。 Abstract: With the advent of neural language models, the performance of code generation has been significantly boosted. However, the problem of repetitions during the generation process continues to linger. Previous work has primarily focused on content repetition, which is merely a fraction of the broader repetition problem in code generation. A more prevalent and challenging problem is structural repetition. In structural repetition, the repeated code appears in various patterns but possesses a fixed structure, which can be inherently reflected in grammar. In this paper, we formally define structural repetition and propose an efficient decoding approach called RPG, which stands for Repetition Penalization based on Grammar, to alleviate the repetition problems in code generation for LLMs. Specifically, RPG first leverages grammar rules to identify repetition problems during code generation, and then strategically decays the likelihood of critical tokens that contribute to repetitions, thereby mitigating them in code generation. To facilitate this study, we construct a new dataset CodeRepetEval to comprehensively evaluate approaches for mitigating the repetition problems in code generation. Extensive experimental results demonstrate that RPG substantially outperforms the best-performing baselines on CodeRepetEval dataset as well as HumanEval and MBPP benchmarks, effectively reducing repetitions and enhancing the quality of generated code.

[95] Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation

Yue Guo,Jae Ho Sohn,Gondy Leroy,Trevor Cohen

Main category: cs.CL

TL;DR: 研究发现，尽管大型语言模型（LLMs）生成的简明语言摘要（PLS）在主观评价中与人工撰写的难以区分，但人工撰写的PLS显著提升了读者理解。自动评估指标与人类判断不一致，需开发更注重理解的评估框架。

Details

Motivation: 解决LLMs生成简明语言摘要（PLS）在医疗信息传播中的有效性及现有评估方法的局限性。 Method: 通过大规模众包实验（150名参与者），结合主观评分（简洁性、信息性、连贯性、忠实性）和客观理解测试（多选题），并比较10种自动评估指标与人类判断的一致性。 Result: LLMs生成的PLS在主观评价中表现接近人工，但人工撰写的PLS显著提升理解；自动评估指标无法反映人类判断。 Conclusion: 需开发更注重读者理解的评估框架和生成方法，自动评估指标不适用于PLS评价。 Abstract: Plain language summaries (PLSs) are essential for facilitating effective communication between clinicians and patients by making complex medical information easier for laypeople to understand and act upon. Large language models (LLMs) have recently shown promise in automating PLS generation, but their effectiveness in supporting health information comprehension remains unclear. Prior evaluations have generally relied on automated scores that do not measure understandability directly, or subjective Likert-scale ratings from convenience samples with limited generalizability. To address these gaps, we conducted a large-scale crowdsourced evaluation of LLM-generated PLSs using Amazon Mechanical Turk with 150 participants. We assessed PLS quality through subjective Likert-scale ratings focusing on simplicity, informativeness, coherence, and faithfulness; and objective multiple-choice comprehension and recall measures of reader understanding. Additionally, we examined the alignment between 10 automated evaluation metrics and human judgments. Our findings indicate that while LLMs can generate PLSs that appear indistinguishable from human-written ones in subjective evaluations, human-written PLSs lead to significantly better comprehension. Furthermore, automated evaluation metrics fail to reflect human judgment, calling into question their suitability for evaluating PLSs. This is the first study to systematically evaluate LLM-generated PLSs based on both reader preferences and comprehension outcomes. Our findings highlight the need for evaluation frameworks that move beyond surface-level quality and for generation methods that explicitly optimize for layperson comprehension.

[96] Hierarchical Document Refinement for Long-context Retrieval-augmented Generation

Jiajie Jin,Xiaoxi Li,Guanting Dong,Yuyao Zhang,Yutao Zhu,Yongkang Wu,Zhonghua Li,Qi Ye,Zhicheng Dou

Main category: cs.CL

TL;DR: LongRefiner是一种高效的插件式优化器，用于处理长文本RAG应用中的冗余和噪声问题，显著降低计算成本和延迟。

Details

Motivation: 现实中的RAG应用常面临长上下文输入问题，冗余信息和噪声导致高推理成本和性能下降。 Method: LongRefiner采用双级查询分析、分层文档结构和基于多任务学习的自适应优化，利用单一基础模型。 Result: 在七个QA数据集上的实验表明，LongRefiner性能优异，计算成本和延迟仅为最佳基线的1/10。 Conclusion: LongRefiner具有可扩展性、高效性和实用性，为长文本RAG应用提供了实用解决方案。 Abstract: Real-world RAG applications often encounter long-context input scenarios, where redundant information and noise results in higher inference costs and reduced performance. To address these challenges, we propose LongRefiner, an efficient plug-and-play refiner that leverages the inherent structural characteristics of long documents. LongRefiner employs dual-level query analysis, hierarchical document structuring, and adaptive refinement through multi-task learning on a single foundation model. Experiments on seven QA datasets demonstrate that LongRefiner achieves competitive performance in various scenarios while using 10x fewer computational costs and latency compared to the best baseline. Further analysis validates that LongRefiner is scalable, efficient, and effective, providing practical insights for real-world long-text RAG applications. Our code is available at https://github.com/ignorejjj/LongRefiner.

[97] Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models

Zemin Huang,Zhiyang Chen,Zijun Wang,Tiancheng Li,Guo-Jun Qi

Main category: cs.CL

TL;DR: DCoLT是一种用于扩散语言模型的推理框架，通过逆向扩散过程中的中间步骤作为潜在“思考”动作，并基于结果的强化学习优化整个推理轨迹。

Details

Motivation: 传统链式思维（CoT）方法具有因果线性思维限制，DCoLT旨在实现双向非线性推理，无需中间步骤的语法正确性。 Method: 在两种扩散语言模型（SEDD和LLaDA）上实现DCoLT，利用强化学习优化推理轨迹。SEDD通过概率策略最大化奖励，LLaDA通过基于排名的掩码策略模块优化。 Result: 在数学和代码生成任务中，DCoLT强化的模型表现优于其他方法，LLaDA在多个任务中推理准确率显著提升。 Conclusion: DCoLT通过非线性推理和强化学习显著提升了扩散语言模型的推理能力。 Abstract: We introduce the \emph{Diffusion Chain of Lateral Thought (DCoLT)}, a reasoning framework for diffusion language models. DCoLT treats each intermediate step in the reverse diffusion process as a latent "thinking" action and optimizes the entire reasoning trajectory to maximize the reward on the correctness of the final answer with outcome-based Reinforcement Learning (RL). Unlike traditional Chain-of-Thought (CoT) methods that follow a causal, linear thinking process, DCoLT allows bidirectional, non-linear reasoning with no strict rule on grammatical correctness amid its intermediate steps of thought. We implement DCoLT on two representative Diffusion Language Models (DLMs). First, we choose SEDD as a representative continuous-time discrete diffusion model, where its concrete score derives a probabilistic policy to maximize the RL reward over the entire sequence of intermediate diffusion steps. We further consider the discrete-time masked diffusion language model -- LLaDA, and find that the order to predict and unmask tokens plays an essential role to optimize its RL action resulting from the ranking-based Unmasking Policy Module (UPM) defined by the Plackett-Luce model. Experiments on both math and code generation tasks show that using only public data and 16 H800 GPUs, DCoLT-reinforced DLMs outperform other DLMs trained by SFT or RL or even both. Notably, DCoLT-reinforced LLaDA boosts its reasoning accuracy by +9.8%, +5.7%, +11.4%, +19.5% on GSM8K, MATH, MBPP, and HumanEval.

[98] CL-RAG: Bridging the Gap in Retrieval-Augmented Generation with Curriculum Learning

Shaohan Wang,Licheng Zhang,Zheren Fu,Zhendong Mao

Main category: cs.CL

TL;DR: CL-RAG是一个基于课程学习的RAG系统训练框架，通过分阶段训练提升性能，在四个开放域QA数据集上表现优于现有方法。

Details

Motivation: 现有RAG方法直接使用检索到的文档，但文档有效性差异大，影响模型训练效果。受人类认知学习启发，采用课程学习优化训练。 Method: 构建多难度级别的训练数据，分阶段训练检索器和生成器，采用课程学习提升模型泛化能力。 Result: 在四个开放域QA数据集上，CL-RAG性能提升2%至4%，优于现有方法。 Conclusion: CL-RAG通过课程学习有效优化RAG系统，提升性能和泛化能力。 Abstract: Retrieval-Augmented Generation (RAG) is an effective method to enhance the capabilities of large language models (LLMs). Existing methods focus on optimizing the retriever or generator in the RAG system by directly utilizing the top-k retrieved documents. However, the documents effectiveness are various significantly across user queries, i.e. some documents provide valuable knowledge while others totally lack critical information. It hinders the retriever and generator's adaptation during training. Inspired by human cognitive learning, curriculum learning trains models using samples progressing from easy to difficult, thus enhancing their generalization ability, and we integrate this effective paradigm to the training of the RAG system. In this paper, we propose a multi-stage Curriculum Learning based RAG system training framework, named CL-RAG. We first construct training data with multiple difficulty levels for the retriever and generator separately through sample evolution. Then, we train the model in stages based on the curriculum learning approach, thereby optimizing the overall performance and generalization of the RAG system more effectively. Our CL-RAG framework demonstrates consistent effectiveness across four open-domain QA datasets, achieving performance gains of 2% to 4% over multiple advanced methods.

[99] Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective

Yutao Mou,Xiao Deng,Yuxiao Luo,Shikun Zhang,Wei Ye

Main category: cs.CL

TL;DR: 提出了CoV-Eval多任务基准和VC-Judge判断模型，全面评估LLM代码安全，发现LLM在生成安全代码和修复漏洞方面仍有不足。

Details

Motivation: 现有代码安全基准仅关注单一任务，缺乏多维度评估，需全面衡量LLM在代码安全方面的能力。 Method: 提出CoV-Eval多任务基准和VC-Judge模型，对20个LLM进行全面评估。 Result: LLM能较好识别漏洞，但在生成安全代码和修复漏洞方面表现不佳。 Conclusion: 揭示了LLM代码安全的关键挑战，为未来研究提供优化方向。 Abstract: Code security and usability are both essential for various coding assistant applications driven by large language models (LLMs). Current code security benchmarks focus solely on single evaluation task and paradigm, such as code completion and generation, lacking comprehensive assessment across dimensions like secure code generation, vulnerability repair and discrimination. In this paper, we first propose CoV-Eval, a multi-task benchmark covering various tasks such as code completion, vulnerability repair, vulnerability detection and classification, for comprehensive evaluation of LLM code security. Besides, we developed VC-Judge, an improved judgment model that aligns closely with human experts and can review LLM-generated programs for vulnerabilities in a more efficient and reliable way. We conduct a comprehensive evaluation of 20 proprietary and open-source LLMs. Overall, while most LLMs identify vulnerable codes well, they still tend to generate insecure codes and struggle with recognizing specific vulnerability types and performing repairs. Extensive experiments and qualitative analyses reveal key challenges and optimization directions, offering insights for future research in LLM code security.

[100] The Devil Is in the Word Alignment Details: On Translation-Based Cross-Lingual Transfer for Token Classification Tasks

Benedikt Ebing,Goran Glavaš

Main category: cs.CL

TL;DR: 本文研究了基于翻译的跨语言迁移（XLT）中标签投影的低级设计决策，发现优化后的词对齐器（WA）性能与标记方法相当，并提出一种新的集成策略，显著优于标记方法。

Details

Motivation: 在跨语言迁移的标记分类任务中，标签投影是一个关键但未系统研究的步骤，尤其是词对齐器的低级设计决策及其与标记方法的比较。 Method: 系统研究了词对齐器在标签投影中的设计决策（如多标记跨度算法、过滤策略和预标记化），并提出了一种集成翻译-训练和翻译-测试预测的新策略。 Result: 优化后的词对齐器性能与标记方法相当，而新提出的集成策略显著优于标记方法，并降低了对低级设计决策的敏感性。 Conclusion: 通过优化设计决策和集成策略，词对齐器在跨语言迁移中的标签投影表现更优且更稳健。 Abstract: Translation-based strategies for cross-lingual transfer XLT such as translate-train -- training on noisy target language data translated from the source language -- and translate-test -- evaluating on noisy source language data translated from the target language -- are competitive XLT baselines. In XLT for token classification tasks, however, these strategies include label projection, the challenging step of mapping the labels from each token in the original sentence to its counterpart(s) in the translation. Although word aligners (WAs) are commonly used for label projection, the low-level design decisions for applying them to translation-based XLT have not been systematically investigated. Moreover, recent marker-based methods, which project labeled spans by inserting tags around them before (or after) translation, claim to outperform WAs in label projection for XLT. In this work, we revisit WAs for label projection, systematically investigating the effects of low-level design decisions on token-level XLT: (i) the algorithm for projecting labels between (multi-)token spans, (ii) filtering strategies to reduce the number of noisily mapped labels, and (iii) the pre-tokenization of the translated sentences. We find that all of these substantially impact translation-based XLT performance and show that, with optimized choices, XLT with WA offers performance at least comparable to that of marker-based methods. We then introduce a new projection strategy that ensembles translate-train and translate-test predictions and demonstrate that it substantially outperforms the marker-based projection. Crucially, we show that our proposed ensembling also reduces sensitivity to low-level WA design choices, resulting in more robust XLT for token classification tasks.

[101] Multi-Token Prediction Needs Registers

Anastasios Gerontopoulos,Spyros Gidaris,Nikos Komodakis

Main category: cs.CL

TL;DR: MuToR是一种多令牌预测方法，通过插入可学习的寄存器令牌来预测未来目标，具有参数少、兼容性强和可扩展性高的优点。

Details

Motivation: 多令牌预测在语言模型预训练中表现良好，但在微调等其他场景中效果不一致，因此需要一种更通用的方法。 Method: MuToR在输入序列中插入可学习的寄存器令牌，每个令牌负责预测未来目标，无需修改架构且参数极少。 Result: MuToR在语言和视觉领域的生成任务中表现出色，适用于监督微调、参数高效微调和预训练。 Conclusion: MuToR是一种简单有效的多令牌预测方法，具有广泛适用性和可扩展性。 Abstract: Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: https://github.com/nasosger/MuToR.

[102] WorldPM: Scaling Human Preference Modeling

Binghai Wang,Runji Lin,Keming Lu,Le Yu,Zhenru Zhang,Fei Huang,Chujie Zheng,Kai Dang,Yang Fan,Xingzhang Ren,An Yang,Binyuan Hui,Dayiheng Liu,Tao Gui,Qi Zhang,Xuanjing Huang,Yu-Gang Jiang,Bowen Yu,Jingren Zhou,Junyang Lin

Main category: cs.CL

TL;DR: 论文发现偏好建模中存在与语言模型类似的缩放规律，提出WorldPM方法，通过大规模数据训练验证其扩展潜力，并在多个基准测试中显著提升性能。

Details

Motivation: 受语言模型中测试损失随模型和数据规模呈幂律关系的启发，探索偏好建模中是否存在类似规律，并提出WorldPM以统一表示人类偏好。 Method: 从公共论坛收集多样化偏好数据，使用1.5B至72B参数的模型进行大规模训练（15M数据），并通过对抗性、客观性和主观性指标评估。 Result: 发现对抗性和客观性指标随规模提升而改善，主观性指标无显著变化；WorldPM在7个基准测试中普遍提升性能，集成到RLHF流程后显著改进评估结果。 Conclusion: WorldPM展示了偏好建模的扩展潜力，为偏好微调提供了有效基础，并在实际应用中表现出显著性能提升。 Abstract: Motivated by scaling laws in language modeling that demonstrate how test loss scales as a power law with model and dataset sizes, we find that similar laws exist in preference modeling. We propose World Preference Modeling$ (WorldPM) to emphasize this scaling potential, where World Preference embodies a unified representation of human preferences. In this paper, we collect preference data from public forums covering diverse user communities, and conduct extensive training using 15M-scale data across models ranging from 1.5B to 72B parameters. We observe distinct patterns across different evaluation metrics: (1) Adversarial metrics (ability to identify deceptive features) consistently scale up with increased training data and base model size; (2) Objective metrics (objective knowledge with well-defined answers) show emergent behavior in larger language models, highlighting WorldPM's scalability potential; (3) Subjective metrics (subjective preferences from a limited number of humans or AI) do not demonstrate scaling trends. Further experiments validate the effectiveness of WorldPM as a foundation for preference fine-tuning. Through evaluations on 7 benchmarks with 20 subtasks, we find that WorldPM broadly improves the generalization performance across human preference datasets of varying sizes (7K, 100K and 800K samples), with performance gains exceeding 5% on many key subtasks. Integrating WorldPM into our internal RLHF pipeline, we observe significant improvements on both in-house and public evaluation sets, with notable gains of 4% to 8% in our in-house evaluations.

[103] Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models

Zhiyuan Hu,Yibo Wang,Hanze Dong,Yuhui Xu,Amrita Saha,Caiming Xiong,Bryan Hooi,Junnan Li

Main category: cs.CL

TL;DR: 论文提出了一种方法，通过明确对齐模型的三种元能力（演绎、归纳和溯因），提升大型推理模型的可扩展性和可靠性。

Details

Motivation: 现有大型推理模型的推理行为（如自我修正和回溯）具有不可预测性，限制了其可靠性和可扩展性。 Method: 采用三阶段流程：个体对齐、参数空间合并和领域特定强化学习，使用自动生成的自验证任务。 Result: 性能比基线提升10%，领域特定强化学习进一步带来2%的平均增益。 Conclusion: 明确的元能力对齐为推理提供了可扩展且可靠的基础。 Abstract: Large reasoning models (LRMs) already possess a latent capacity for long chain-of-thought reasoning. Prior work has shown that outcome-based reinforcement learning (RL) can incidentally elicit advanced reasoning behaviors such as self-correction, backtracking, and verification phenomena often referred to as the model's "aha moment". However, the timing and consistency of these emergent behaviors remain unpredictable and uncontrollable, limiting the scalability and reliability of LRMs' reasoning capabilities. To address these limitations, we move beyond reliance on prompts and coincidental "aha moments". Instead, we explicitly align models with three meta-abilities: deduction, induction, and abduction, using automatically generated, self-verifiable tasks. Our three stage-pipeline individual alignment, parameter-space merging, and domain-specific reinforcement learning, boosting performance by over 10\% relative to instruction-tuned baselines. Furthermore, domain-specific RL from the aligned checkpoint yields an additional 2\% average gain in the performance ceiling across math, coding, and science benchmarks, demonstrating that explicit meta-ability alignment offers a scalable and dependable foundation for reasoning. Code is available at: https://github.com/zhiyuanhubj/Meta-Ability-Alignment

cs.SI [Back]

[104] Tales of the 2025 Los Angeles Fire: Hotwash for Public Health Concerns in Reddit via LLM-Enhanced Topic Modeling

Sulong Zhou,Qunying Huang,Shaoheng Zhou,Yun Hang,Xinyue Ye,Aodong Mei,Kathryn Phung,Yuning Ye,Uma Govindswamy,Zehan Li

Main category: cs.SI

TL;DR: 研究通过分析2025年洛杉矶野火期间的Reddit讨论，开发了一个分层框架，结合主题建模和人类参与，识别了公众对灾难的情境感知和危机叙事，揭示了公众健康和安全问题。

Details

Motivation: 近年来野火频发且严重，了解公众在危机中的感知和反应对灾难响应至关重要。社交媒体提供了捕捉公众情绪和信息的渠道。 Method: 收集385篇帖子和114,879条评论，采用主题建模方法，结合大语言模型和人类参与，开发分层框架分类潜在主题。 Result: 情境感知类别的讨论与火灾进展紧密相关，公众健康和安全的主题广泛；危机叙事中，心理健康和悲伤信号占主导。 Conclusion: 研究提供了首个标注的社交媒体数据集和可扩展的分析框架，为灾难响应和公共卫生沟通提供了依据。 Abstract: Wildfires have become increasingly frequent, irregular, and severe in recent years. Understanding how affected populations perceive and respond during wildfire crises is critical for timely and empathetic disaster response. Social media platforms offer a crowd-sourced channel to capture evolving public discourse, providing hyperlocal information and insight into public sentiment. This study analyzes Reddit discourse during the 2025 Los Angeles wildfires, spanning from the onset of the disaster to full containment. We collect 385 posts and 114,879 comments related to the Palisades and Eaton fires. We adopt topic modeling methods to identify the latent topics, enhanced by large language models (LLMs) and human-in-the-loop (HITL) refinement. Furthermore, we develop a hierarchical framework to categorize latent topics, consisting of two main categories, Situational Awareness (SA) and Crisis Narratives (CN). The volume of SA category closely aligns with real-world fire progressions, peaking within the first 2-5 days as the fires reach the maximum extent. The most frequent co-occurring category set of public health and safety, loss and damage, and emergency resources expands on a wide range of health-related latent topics, including environmental health, occupational health, and one health. Grief signals and mental health risks consistently accounted for 60 percentage and 40 percentage of CN instances, respectively, with the highest total volume occurring at night. This study contributes the first annotated social media dataset on the 2025 LA fires, and introduces a scalable multi-layer framework that leverages topic modeling for crisis discourse analysis. By identifying persistent public health concerns, our results can inform more empathetic and adaptive strategies for disaster response, public health communication, and future research in comparable climate-related disaster events.

cs.LG [Back]

[105] Predictability Shapes Adaptation: An Evolutionary Perspective on Modes of Learning in Transformers

Alexander Y. Ku,Thomas L. Griffiths,Stephanie C. Y. Chan

Main category: cs.LG

TL;DR: 论文探讨了Transformer模型中权重学习（IWL）和上下文学习（ICL）的交互，借鉴进化生物学中的遗传编码和表型可塑性，发现环境稳定性和线索可靠性影响学习模式的选择。

Details

Motivation: 理解Transformer中IWL和ICL的交互机制，借鉴进化生物学的策略，探究环境因素对学习模式的影响。 Method: 通过回归和分类任务实验，操作环境稳定性和线索可靠性，分析其对IWL和ICL平衡的影响。 Result: 高环境稳定性偏好IWL，高线索可靠性增强ICL；学习动态显示任务依赖的时序变化。 Conclusion: 环境可预测性是Transformer学习模式选择的关键因素，为理解ICL和优化训练方法提供了新视角。 Abstract: Transformer models learn in two distinct modes: in-weights learning (IWL), encoding knowledge into model weights, and in-context learning (ICL), adapting flexibly to context without weight modification. To better understand the interplay between these learning modes, we draw inspiration from evolutionary biology's analogous adaptive strategies: genetic encoding (akin to IWL, adapting over generations and fixed within an individual's lifetime) and phenotypic plasticity (akin to ICL, enabling flexible behavioral responses to environmental cues). In evolutionary biology, environmental predictability dictates the balance between these strategies: stability favors genetic encoding, while reliable predictive cues promote phenotypic plasticity. We experimentally operationalize these dimensions of predictability and systematically investigate their influence on the ICL/IWL balance in Transformers. Using regression and classification tasks, we show that high environmental stability decisively favors IWL, as predicted, with a sharp transition at maximal stability. Conversely, high cue reliability enhances ICL efficacy, particularly when stability is low. Furthermore, learning dynamics reveal task-contingent temporal evolution: while a canonical ICL-to-IWL shift occurs in some settings (e.g., classification with many classes), we demonstrate that scenarios with easier IWL (e.g., fewer classes) or slower ICL acquisition (e.g., regression) can exhibit an initial IWL phase later yielding to ICL dominance. These findings support a relative-cost hypothesis for explaining these learning mode transitions, establishing predictability as a critical factor governing adaptive strategies in Transformers, and offering novel insights for understanding ICL and guiding training methodologies.

[106] Comparing Exploration-Exploitation Strategies of LLMs and Humans: Insights from Standard Multi-armed Bandit Tasks

Ziyuan Zhang,Darcy Wang,Ningyuan Chen,Rodrigo Mansur,Vahid Sarhangian

Main category: cs.LG

TL;DR: LLMs用于模拟人类决策行为，研究其在探索-利用权衡中的表现。通过多臂老虎机任务比较LLMs、人类和算法，发现推理能力使LLMs更接近人类行为，但在复杂环境中适应性不足。

Details

Motivation: 探究LLMs在动态决策任务中是否表现出类似人类的行为，并评估其性能。 Method: 使用多臂老虎机任务，结合可解释的选择模型，分析LLMs、人类和算法的探索-利用策略。 Result: 推理能力使LLMs在简单任务中接近人类行为，但在复杂环境中适应性较差。 Conclusion: LLMs在模拟人类行为和自动化决策方面有潜力，但在复杂环境中仍需改进。 Abstract: Large language models (LLMs) are increasingly used to simulate or automate human behavior in complex sequential decision-making tasks. A natural question is then whether LLMs exhibit similar decision-making behavior to humans, and can achieve comparable (or superior) performance. In this work, we focus on the exploration-exploitation (E&E) tradeoff, a fundamental aspect of dynamic decision-making under uncertainty. We employ canonical multi-armed bandit (MAB) tasks introduced in the cognitive science and psychiatry literature to conduct a comparative study of the E&E strategies of LLMs, humans, and MAB algorithms. We use interpretable choice models to capture the E&E strategies of the agents and investigate how explicit reasoning, through both prompting strategies and reasoning-enhanced models, shapes LLM decision-making. We find that reasoning shifts LLMs toward more human-like behavior, characterized by a mix of random and directed exploration. In simple stationary tasks, reasoning-enabled LLMs exhibit similar levels of random and directed exploration compared to humans. However, in more complex, non-stationary environments, LLMs struggle to match human adaptability, particularly in effective directed exploration, despite achieving similar regret in certain scenarios. Our findings highlight both the promise and limits of LLMs as simulators of human behavior and tools for automated decision-making and point to potential areas of improvements.

[107] Advanced Crash Causation Analysis for Freeway Safety: A Large Language Model Approach to Identifying Key Contributing Factors

Ahmed S. Abdelrahman,Mohamed Abdel-Aty,Samgyu Yang,Abdulrahman Faden

Main category: cs.LG

TL;DR: 研究利用LLM（Llama3 8B）分析高速公路事故数据，通过零样本分类识别事故原因，结果表明LLM能有效识别主要事故原因，并具有实际应用潜力。

Details

Motivation: 传统统计方法和机器学习模型难以捕捉事故中复杂因素间的交互作用，因此研究探索LLM在事故分析中的应用。 Method: 通过QLoRA微调Llama3 8B模型，利用226项高速公路事故研究数据训练模型，进行零样本分类分析事故原因。 Result: LLM能有效识别如酒驾、超速等主要事故原因，并结合事件数据提供更深入见解，问卷结果显示88.89%的研究者认可其有效性。 Conclusion: LLM为交通事故分析提供了全面工具，有助于制定更有效的安全措施，对规划者和政策制定者具有重要参考价值。 Abstract: Understanding the factors contributing to traffic crashes and developing strategies to mitigate their severity is essential. Traditional statistical methods and machine learning models often struggle to capture the complex interactions between various factors and the unique characteristics of each crash. This research leverages large language model (LLM) to analyze freeway crash data and provide crash causation analysis accordingly. By compiling 226 traffic safety studies related to freeway crashes, a training dataset encompassing environmental, driver, traffic, and geometric design factors was created. The Llama3 8B model was fine-tuned using QLoRA to enhance its understanding of freeway crashes and their contributing factors, as covered in these studies. The fine-tuned Llama3 8B model was then used to identify crash causation without pre-labeled data through zero-shot classification, providing comprehensive explanations to ensure that the identified causes were reasonable and aligned with existing research. Results demonstrate that LLMs effectively identify primary crash causes such as alcohol-impaired driving, speeding, aggressive driving, and driver inattention. Incorporating event data, such as road maintenance, offers more profound insights. The model's practical applicability and potential to improve traffic safety measures were validated by a high level of agreement among researchers in the field of traffic safety, as reflected in questionnaire results with 88.89%. This research highlights the complex nature of traffic crashes and how LLMs can be used for comprehensive analysis of crash causation and other contributing factors. Moreover, it provides valuable insights and potential countermeasures to aid planners and policymakers in developing more effective and efficient traffic safety practices.

[108] Learning Virtual Machine Scheduling in Cloud Computing through Language Agents

JieHao Wu,Ziwei Wang,Junjie Sheng,Wenhao Li,Xiangfei Wang,Jun Luo

Main category: cs.LG

TL;DR: 本文提出了一种名为MiCo的分层语言代理框架，利用大语言模型（LLM）设计启发式方法，解决了云服务中虚拟机器（VM）调度的动态多维装箱问题（ODMBP）。

Details

Motivation: 传统优化方法难以适应实时变化，启发式方法策略僵化，现有学习方法缺乏通用性和可解释性。 Method: 将ODMBP建模为半马尔可夫决策过程（SMDP-Option），采用两阶段架构（Option Miner和Option Composer），利用LLM生成策略。 Result: 在超过10,000台虚拟机器的大规模场景中，MiCo实现了96.9%的竞争比，并在非稳态请求流和多样化配置下保持高性能。 Conclusion: MiCo在复杂和大规模云环境中表现出色，验证了其有效性。 Abstract: In cloud services, virtual machine (VM) scheduling is a typical Online Dynamic Multidimensional Bin Packing (ODMBP) problem, characterized by large-scale complexity and fluctuating demands. Traditional optimization methods struggle to adapt to real-time changes, domain-expert-designed heuristic approaches suffer from rigid strategies, and existing learning-based methods often lack generalizability and interpretability. To address these limitations, this paper proposes a hierarchical language agent framework named MiCo, which provides a large language model (LLM)-driven heuristic design paradigm for solving ODMBP. Specifically, ODMBP is formulated as a Semi-Markov Decision Process with Options (SMDP-Option), enabling dynamic scheduling through a two-stage architecture, i.e., Option Miner and Option Composer. Option Miner utilizes LLMs to discover diverse and useful non-context-aware strategies by interacting with constructed environments. Option Composer employs LLMs to discover a composing strategy that integrates the non-context-aware strategies with the contextual ones. Extensive experiments on real-world enterprise datasets demonstrate that MiCo achieves a 96.9\% competitive ratio in large-scale scenarios involving more than 10,000 virtual machines. It maintains high performance even under nonstationary request flows and diverse configurations, thus validating its effectiveness in complex and large-scale cloud environments.

[109] ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector Attention

Jintian Shao,Hongyi Huang,Jiayi Wu,Beiwen Zhang,ZhiYu Wu,You Shan,MingKai Zheng

Main category: cs.LG

TL;DR: ComplexFormer引入了一种新的注意力机制CMHA，通过复数平面统一建模语义和位置差异，显著提升了模型性能。

Details

Motivation: 传统Transformer在整合位置信息和多头注意力灵活性方面存在局限，ComplexFormer旨在解决这一问题。 Method: 采用CMHA机制，包括每头的欧拉变换和自适应差分旋转机制，统一建模语义和位置差异。 Result: 在语言建模、文本生成等任务中表现优异，生成困惑度更低，长上下文一致性更好。 Conclusion: ComplexFormer提供了一种更高效、灵活的注意力机制，显著提升了模型性能。 Abstract: Transformer models rely on self-attention to capture token dependencies but face challenges in effectively integrating positional information while allowing multi-head attention (MHA) flexibility. Prior methods often model semantic and positional differences disparately or apply uniform positional adjustments across heads, potentially limiting representational capacity. This paper introduces ComplexFormer, featuring Complex Multi-Head Attention-CMHA. CMHA empowers each head to independently model semantic and positional differences unified within the complex plane, representing interactions as rotations and scaling. ComplexFormer incorporates two key improvements: (1) a per-head Euler transformation, converting real-valued query/key projections into polar-form complex vectors for head-specific complex subspace operation; and (2) a per-head adaptive differential rotation mechanism, exp[i(Adapt(ASmn,i) + Delta(Pmn),i)], allowing each head to learn distinct strategies for integrating semantic angle differences (ASmn,i) with relative positional encodings (Delta(Pmn),i). Extensive experiments on language modeling, text generation, code generation, and mathematical reasoning show ComplexFormer achieves superior performance, significantly lower generation perplexity , and improved long-context coherence compared to strong baselines like RoPE-Transformers. ComplexFormer demonstrates strong parameter efficiency, offering a more expressive, adaptable attention mechanism.

[110] Superposition Yields Robust Neural Scaling

Yizhou liu,Ziming Liu,Jeff Gore

Main category: cs.LG

TL;DR: 论文探讨了大型语言模型（LLM）性能随模型规模增长的神经缩放定律的起源，提出表示叠加是这一现象的关键机制。

Details

Motivation: 研究大型语言模型性能随规模增长的神经缩放定律的起源，以理解其背后的机制。 Method: 通过构建玩具模型，分析弱叠加和强叠加下损失与模型规模的关系，并结合几何解释和实际LLM数据分析。 Result: 发现强叠加下损失与模型维度成反比，且实际LLM数据与玩具模型预测一致。Chinchilla缩放定律也与结果吻合。 Conclusion: 表示叠加是神经缩放定律的重要机制，未来可能通过优化训练策略和模型架构提升性能。 Abstract: The success of today's large language models (LLMs) depends on the observation that larger models perform better. However, the origin of this neural scaling law -- the finding that loss decreases as a power law with model size -- remains unclear. Starting from two empirical principles -- that LLMs represent more things than the model dimensions (widths) they have (i.e., representations are superposed), and that words or concepts in language occur with varying frequencies -- we constructed a toy model to study the loss scaling with model size. We found that when superposition is weak, meaning only the most frequent features are represented without interference, the scaling of loss with model size depends on the underlying feature frequency; if feature frequencies follow a power law, so does the loss. In contrast, under strong superposition, where all features are represented but overlap with each other, the loss becomes inversely proportional to the model dimension across a wide range of feature frequency distributions. This robust scaling behavior is explained geometrically: when many more vectors are packed into a lower dimensional space, the interference (squared overlaps) between vectors scales inversely with that dimension. We then analyzed four families of open-sourced LLMs and found that they exhibit strong superposition and quantitatively match the predictions of our toy model. The Chinchilla scaling law turned out to also agree with our results. We conclude that representation superposition is an important mechanism underlying the observed neural scaling laws. We anticipate that these insights will inspire new training strategies and model architectures to achieve better performance with less computation and fewer parameters.

[111] Parallel Scaling Law for Language Models

Mouxiang Chen,Binyuan Hui,Zeyu Cui,Jiaxi Yang,Dayiheng Liu,Jianling Sun,Junyang Lin,Zhongxin Liu

Main category: cs.LG

TL;DR: 论文提出了一种新的并行扩展（ParScale）方法，通过在训练和推理时增加并行计算，显著提升了推理效率，同时减少了内存和延迟开销。

Details

Motivation: 传统的语言模型扩展方法（参数扩展或推理时扩展）通常需要较高的空间或时间成本，因此需要一种更高效的扩展范式。 Method: 通过应用P种多样且可学习的输入变换，并行执行模型前向传播，并动态聚合输出，实现并行扩展。该方法适用于任何模型结构、优化过程、数据或任务。 Result: 实验验证表明，并行扩展模型在性能提升的同时，内存和延迟开销显著低于参数扩展方法（如内存减少22倍，延迟减少6倍）。 Conclusion: ParScale为低资源场景下部署更强大模型提供了可能，并为机器学习中计算的作用提供了新视角。 Abstract: It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce the third and more inference-efficient scaling paradigm: increasing the model's parallel computation during both training and inference time. We apply $P$ diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the $P$ outputs. This method, namely parallel scaling (ParScale), scales parallel computation by reusing existing parameters and can be applied to any model structure, optimization procedure, data, or task. We theoretically propose a new scaling law and validate it through large-scale pre-training, which shows that a model with $P$ parallel streams is similar to scaling the parameters by $O(\log P)$ while showing superior inference efficiency. For example, ParScale can use up to 22$\times$ less memory increase and 6$\times$ less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget. The new scaling law we discovered potentially facilitates the deployment of more powerful models in low-resource scenarios, and provides an alternative perspective for the role of computation in machine learning.

Vibha Belavadi,Tushar Vatsa,Dewang Sultania,Suhas Suresha,Ishita Verma,Cheng Chen,Tracy Holloway King,Michael Friedrich

Main category: cs.LG

TL;DR: 本文提出了一种基于路由器的架构，用于生成高质量的合成数据以微调大型语言模型（LLM），解决了传统方法在多样性和复杂性上的不足，显著提升了函数分类和API参数选择的准确性。

Details

Motivation: 在数字内容创作工具中，用户通过自然语言查询表达需求，但缺乏真实用户交互数据和隐私限制导致无法直接训练模型。传统合成数据生成方法无法模拟真实数据分布，导致微调后性能不佳。 Method: 采用基于路由器的架构，结合领域资源（如内容元数据和知识图谱）以及文本到文本和视觉到文本的语言模型，生成高质量的合成训练数据。 Result: 在真实用户查询上的评估显示，该方法在函数分类和API参数选择上表现显著优于传统方法，建立了新的基准。 Conclusion: 提出的路由器架构有效解决了合成数据生成的核心问题，为LLM在函数调用任务中的微调提供了更优解决方案。 Abstract: This paper addresses fine-tuning Large Language Models (LLMs) for function calling tasks when real user interaction data is unavailable. In digital content creation tools, where users express their needs through natural language queries that must be mapped to API calls, the lack of real-world task-specific data and privacy constraints for training on it necessitate synthetic data generation. Existing approaches to synthetic data generation fall short in diversity and complexity, failing to replicate real-world data distributions and leading to suboptimal performance after LLM fine-tuning. We present a novel router-based architecture that leverages domain resources like content metadata and structured knowledge graphs, along with text-to-text and vision-to-text language models to generate high-quality synthetic training data. Our architecture's flexible routing mechanism enables synthetic data generation that matches observed real-world distributions, addressing a fundamental limitation of traditional approaches. Evaluation on a comprehensive set of real user queries demonstrates significant improvements in both function classification accuracy and API parameter selection. Models fine-tuned with our synthetic data consistently outperform traditional approaches, establishing new benchmarks for function calling tasks.

[113] MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models

Mugilan Ganesan,Shane Segal,Ankur Aggarwal,Nish Sinnadurai,Sean Lie,Vithursan Thangarasa

Main category: cs.LG

TL;DR: MASSV通过两阶段方法将小型语言模型转化为高效的多模态草稿模型，显著加速视觉语言模型的推理速度。

Details

Motivation: 现有小型语言模型无法处理视觉输入且预测与视觉上下文不匹配，限制了推测解码在视觉语言模型中的应用。 Method: MASSV通过轻量级可训练投影器连接目标模型的视觉编码器，并利用目标模型生成的自蒸馏视觉指令调整对齐预测。 Result: 在Qwen2.5-VL和Gemma3模型上，MASSV将接受长度提升30%，推理速度提高1.46倍。 Conclusion: MASSV为加速当前及未来视觉语言模型提供了一种可扩展且兼容架构的方法。 Abstract: Speculative decoding significantly accelerates language model inference by enabling a lightweight draft model to propose multiple tokens that a larger target model verifies simultaneously. However, applying this technique to vision-language models (VLMs) presents two fundamental challenges: small language models that could serve as efficient drafters lack the architectural components to process visual inputs, and their token predictions fail to match those of VLM target models that consider visual context. We introduce Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models (MASSV), which transforms existing small language models into effective multimodal drafters through a two-phase approach. MASSV first connects the target VLM's vision encoder to the draft model via a lightweight trainable projector, then applies self-distilled visual instruction tuning using responses generated by the target VLM to align token predictions. Comprehensive experiments across the Qwen2.5-VL and Gemma3 model families demonstrate that MASSV increases accepted length by up to 30% and delivers end-to-end inference speedups of up to 1.46x on visually-grounded tasks. MASSV provides a scalable, architecture-compatible method for accelerating both current and future VLMs.

[114] RainPro-8: An Efficient Deep Learning Model to Estimate Rainfall Probabilities Over 8 Hours

Rafael Pablos Sarabia,Joachim Nyborg,Morten Birk,Jeppe Liborius Sjørup,Anders Lillevang Vesterholt,Ira Assent

Main category: cs.LG

TL;DR: 提出了一种深度学习模型，用于欧洲8小时高分辨率降水概率预报，整合多源数据并实现高效训练和快速推理。

Details

Motivation: 克服雷达深度学习模型在短预报时间上的局限性，提升降水预报的准确性和不确定性量化。 Method: 整合雷达、卫星和数值天气预报数据，设计紧凑架构以捕获长程交互。 Result: 模型在实验中表现优于现有数值天气预报系统和深度学习临近预报模型。 Conclusion: 该模型为欧洲高分辨率降水预报设定了新标准，平衡了准确性、可解释性和计算效率。 Abstract: We present a deep learning model for high-resolution probabilistic precipitation forecasting over an 8-hour horizon in Europe, overcoming the limitations of radar-only deep learning models with short forecast lead times. Our model efficiently integrates multiple data sources - including radar, satellite, and physics-based numerical weather prediction (NWP) - while capturing long-range interactions, resulting in accurate forecasts with robust uncertainty quantification through consistent probabilistic maps. Featuring a compact architecture, it enables more efficient training and faster inference than existing models. Extensive experiments demonstrate that our model surpasses current operational NWP systems, extrapolation-based methods, and deep-learning nowcasting models, setting a new standard for high-resolution precipitation forecasting in Europe, ensuring a balance between accuracy, interpretability, and computational efficiency.

[115] PIF: Anomaly detection via preference embedding

Filippo Leveni,Luca Magri,Giacomo Boracchi,Cesare Alippi

Main category: cs.LG

TL;DR: 提出了一种名为PIF的新型异常检测方法，结合自适应隔离方法和偏好嵌入的优势，通过高维空间嵌入和PI-Forest树方法计算异常分数。实验表明PIF优于现有技术。

Details

Motivation: 解决基于结构化模式的异常检测问题，结合自适应隔离和偏好嵌入的优势。 Method: 提出PIF方法，通过高维空间嵌入数据，并使用PI-Forest树方法计算异常分数。 Result: 在合成和真实数据集上的实验显示PIF优于现有异常检测技术，PI-Forest在测量任意距离和隔离偏好空间点方面表现更好。 Conclusion: PIF是一种有效的异常检测方法，结合了自适应隔离和偏好嵌入的优势，实验验证了其优越性。 Abstract: We address the problem of detecting anomalies with respect to structured patterns. To this end, we conceive a novel anomaly detection method called PIF, that combines the advantages of adaptive isolation methods with the flexibility of preference embedding. Specifically, we propose to embed the data in a high dimensional space where an efficient tree-based method, PI-Forest, is employed to compute an anomaly score. Experiments on synthetic and real datasets demonstrate that PIF favorably compares with state-of-the-art anomaly detection techniques, and confirm that PI-Forest is better at measuring arbitrary distances and isolate points in the preference space.

[116] SEAL: Searching Expandable Architectures for Incremental Learning

Matteo Gambella,Vicente Javier Castro Solar,Manuel Roveri

Main category: cs.LG

TL;DR: SEAL是一个基于NAS的框架，用于数据增量学习，动态调整模型结构以减少遗忘并提高准确性。

Details

Motivation: 解决增量学习中平衡新任务学习和旧知识保留的挑战，避免现有方法因模型扩展导致的资源浪费。 Method: SEAL通过容量估计指标动态扩展模型结构，结合交叉蒸馏训练保持稳定性，并联合搜索架构和扩展策略。 Result: 实验表明，SEAL在减少遗忘和提高准确性的同时，保持了较小的模型规模。 Conclusion: SEAL展示了结合NAS和选择性扩展在增量学习中的高效适应性潜力。 Abstract: Incremental learning is a machine learning paradigm where a model learns from a sequential stream of tasks. This setting poses a key challenge: balancing plasticity (learning new tasks) and stability (preserving past knowledge). Neural Architecture Search (NAS), a branch of AutoML, automates the design of the architecture of Deep Neural Networks and has shown success in static settings. However, existing NAS-based approaches to incremental learning often rely on expanding the model at every task, making them impractical in resource-constrained environments. In this work, we introduce SEAL, a NAS-based framework tailored for data-incremental learning, a scenario where disjoint data samples arrive sequentially and are not stored for future access. SEAL adapts the model structure dynamically by expanding it only when necessary, based on a capacity estimation metric. Stability is preserved through cross-distillation training after each expansion step. The NAS component jointly searches for both the architecture and the optimal expansion policy. Experiments across multiple benchmarks demonstrate that SEAL effectively reduces forgetting and enhances accuracy while maintaining a lower model size compared to prior methods. These results highlight the promise of combining NAS and selective expansion for efficient, adaptive learning in incremental scenarios.

q-bio.QM [Back]

[117] Generative diffusion model surrogates for mechanistic agent-based biological models

Tien Comlekoglu,J. Quetzalcóatl Toledo-Marín,Douglas W. DeSimone,Shayn M. Peirce,Geoffrey Fox,James A. Glazier

Main category: q-bio.QM

TL;DR: 利用去噪扩散概率模型（DDPM）训练生成式AI替代模型，加速细胞-波特模型（CPM）的计算，实现22倍速度提升。

Details

Motivation: CPM在大型空间和时间尺度上计算成本高，限制了其应用。通过生成式AI替代模型可以加速计算。 Method: 使用DDPM训练生成式AI替代模型，结合图像分类器辅助选择和验证替代模型。 Result: 替代模型能生成比参考配置提前20,000时间步的模型配置，计算时间减少约22倍。 Conclusion: DDPM可用于开发随机生物系统的数字孪生，为未来研究提供了方向。 Abstract: Mechanistic, multicellular, agent-based models are commonly used to investigate tissue, organ, and organism-scale biology at single-cell resolution. The Cellular-Potts Model (CPM) is a powerful and popular framework for developing and interrogating these models. CPMs become computationally expensive at large space- and time- scales making application and investigation of developed models difficult. Surrogate models may allow for the accelerated evaluation of CPMs of complex biological systems. However, the stochastic nature of these models means each set of parameters may give rise to different model configurations, complicating surrogate model development. In this work, we leverage denoising diffusion probabilistic models to train a generative AI surrogate of a CPM used to investigate \textit{in vitro} vasculogenesis. We describe the use of an image classifier to learn the characteristics that define unique areas of a 2-dimensional parameter space. We then apply this classifier to aid in surrogate model selection and verification. Our CPM model surrogate generates model configurations 20,000 timesteps ahead of a reference configuration and demonstrates approximately a 22x reduction in computational time as compared to native code execution. Our work represents a step towards the implementation of DDPMs to develop digital twins of stochastic biological systems.

cs.IR [Back]

[118] A Survey on Large Language Models in Multimodal Recommender Systems

Alejo Lopez-Avila,Jinhua Du

Main category: cs.IR

TL;DR: 本文综述了大型语言模型（LLMs）在多模态推荐系统（MRS）中的应用，探讨了其优势与挑战，并提出了新的分类法和未来研究方向。

Details

Motivation: 研究LLMs如何通过语义推理和动态输入处理提升MRS性能，同时解决其可扩展性和模型可访问性问题。 Method: 通过综述近期研究，提出分类法，总结提示策略、微调方法和数据适应技术。 Result: 明确了LLMs在MRS中的整合模式，并提供了评估指标、数据集和未来方向的概述。 Conclusion: LLMs为MRS带来新机遇，但需进一步研究以解决挑战，推动领域发展。 Abstract: Multimodal recommender systems (MRS) integrate heterogeneous user and item data, such as text, images, and structured information, to enhance recommendation performance. The emergence of large language models (LLMs) introduces new opportunities for MRS by enabling semantic reasoning, in-context learning, and dynamic input handling. Compared to earlier pre-trained language models (PLMs), LLMs offer greater flexibility and generalisation capabilities but also introduce challenges related to scalability and model accessibility. This survey presents a comprehensive review of recent work at the intersection of LLMs and MRS, focusing on prompting strategies, fine-tuning methods, and data adaptation techniques. We propose a novel taxonomy to characterise integration patterns, identify transferable techniques from related recommendation domains, provide an overview of evaluation metrics and datasets, and point to possible future directions. We aim to clarify the emerging role of LLMs in multimodal recommendation and support future research in this rapidly evolving field.

cs.RO [Back]

[119] FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation

Jun Guo,Xiaojian Ma,Yikai Wang,Min Yang,Huaping Liu,Qing Li

Main category: cs.RO

TL;DR: FlowDreamer提出了一种基于3D场景流的视觉世界模型，用于机器人操作任务，通过显式运动表示提升未来帧预测性能。

Details

Motivation: 研究旨在改进机器人操作中的视觉世界模型，通过显式处理动态预测（3D场景流）来提升未来视觉观测的准确性。 Method: FlowDreamer采用U-Net预测3D场景流，并结合扩散模型生成未来帧，模块化设计但端到端训练。 Result: 在4个基准测试中，FlowDreamer在语义相似性、像素质量和任务成功率上分别提升7%、11%和6%。 Conclusion: FlowDreamer通过显式运动表示显著提升了RGB-D世界模型的性能，适用于机器人操作任务。 Abstract: This paper investigates training better visual world models for robot manipulation, i.e., models that can predict future visual observations by conditioning on past frames and robot actions. Specifically, we consider world models that operate on RGB-D frames (RGB-D world models). As opposed to canonical approaches that handle dynamics prediction mostly implicitly and reconcile it with visual rendering in a single model, we introduce FlowDreamer, which adopts 3D scene flow as explicit motion representations. FlowDreamer first predicts 3D scene flow from past frame and action conditions with a U-Net, and then a diffusion model will predict the future frame utilizing the scene flow. FlowDreamer is trained end-to-end despite its modularized nature. We conduct experiments on 4 different benchmarks, covering both video prediction and visual planning tasks. The results demonstrate that FlowDreamer achieves better performance compared to other baseline RGB-D world models by 7% on semantic similarity, 11% on pixel quality, and 6% on success rate in various robot manipulation domains.

cs.HC [Back]

[120] CartoAgent: a multimodal large language model-powered multi-agent cartographic framework for map style transfer and evaluation

Chenglong Wang,Yuhao Kang,Zhaoya Gong,Pengjun Zhao,Yu Feng,Wenjia Zhang,Ge Li

Main category: cs.HC

TL;DR: CartoAgent是一个基于多模态大语言模型的多智能体制图框架，通过模拟制图实践中的三个阶段（准备、地图设计和评估），生成既美观又信息丰富的地图。

Details

Motivation: 生成式人工智能（GenAI）的快速发展为制图过程提供了新机遇，但以往研究要么忽视了地图的艺术性，要么难以同时保证地图的准确性和信息性。 Method: CartoAgent利用多模态大语言模型（MLLMs）作为智能体，分阶段协作完成制图任务，并通过分离样式与地理数据确保准确性。 Result: 实验和人工评估验证了CartoAgent在地图样式迁移和评估任务中的有效性。 Conclusion: CartoAgent可扩展支持多种制图设计决策，并为未来GenAI在制图中的集成提供参考。 Abstract: The rapid development of generative artificial intelligence (GenAI) presents new opportunities to advance the cartographic process. Previous studies have either overlooked the artistic aspects of maps or faced challenges in creating both accurate and informative maps. In this study, we propose CartoAgent, a novel multi-agent cartographic framework powered by multimodal large language models (MLLMs). This framework simulates three key stages in cartographic practice: preparation, map design, and evaluation. At each stage, different MLLMs act as agents with distinct roles to collaborate, discuss, and utilize tools for specific purposes. In particular, CartoAgent leverages MLLMs' visual aesthetic capability and world knowledge to generate maps that are both visually appealing and informative. By separating style from geographic data, it can focus on designing stylesheets without modifying the vector-based data, thereby ensuring geographic accuracy. We applied CartoAgent to a specific task centered on map restyling-namely, map style transfer and evaluation. The effectiveness of this framework was validated through extensive experiments and a human evaluation study. CartoAgent can be extended to support a variety of cartographic design decisions and inform future integrations of GenAI in cartography.

[121] Visual Feedback of Pattern Separability Improves Myoelectric Decoding Performance of Upper Limb Prostheses

Ruichen Yang,György M. Lévay,Christopher L. Hunt,Dániel Czeiner,Megan C. Hodgson,Damini Agarwal,Rahul R. Kaliki,Nitish V. Thakor

Main category: cs.HC

TL;DR: 论文介绍了一种名为Reviewer的3D视觉界面，通过实时投影EMG信号到解码器的分类空间，优化了肌电假肢的模式识别控制性能。

Details

Motivation: 随着假肢运动复杂度的增加，用户难以生成足够独特的EMG模式以实现可靠分类，现有训练方法依赖试错调整，效果有限。 Method: 研究通过10次实验，比较了使用Reviewer与传统虚拟手臂可视化训练的效果，评估了Fitts定律任务的性能。 Result: 使用Reviewer的组在完成率、路径效率和吞吐量等方面表现更优。 Conclusion: 3D视觉反馈显著提升了新手操作者的模式识别控制性能，减少了试错调整的依赖。 Abstract: State-of-the-art upper limb myoelectric prostheses often use pattern recognition (PR) control systems that translate electromyography (EMG) signals into desired movements. As prosthesis movement complexity increases, users often struggle to produce sufficiently distinct EMG patterns for reliable classification. Existing training typically involves heuristic, trial-and-error user adjustments to static decoder boundaries. Goal: We introduce the Reviewer, a 3D visual interface projecting EMG signals directly into the decoder's classification space, providing intuitive, real-time insight into PR algorithm behavior. This structured feedback reduces cognitive load and fosters mutual, data-driven adaptation between user-generated EMG patterns and decoder boundaries. Methods: A 10-session study with 12 able-bodied participants compared PR performance after motor-based training and updating using the Reviewer versus conventional virtual arm visualization. Performance was assessed using a Fitts law task that involved the aperture of the cursor and the control of orientation. Results: Participants trained with the Reviewer achieved higher completion rates, reduced overshoot, and improved path efficiency and throughput compared to the standard visualization group. Significance: The Reviewer introduces decoder-informed motor training, facilitating immediate and consistent PR-based myoelectric control improvements. By iteratively refining control through real-time feedback, this approach reduces reliance on trial-and-error recalibration, enabling a more adaptive, self-correcting training framework. Conclusion: The 3D visual feedback significantly improves PR control in novice operators through structured training, enabling feedback-driven adaptation and reducing reliance on extensive heuristic adjustments.

[122] SOS: A Shuffle Order Strategy for Data Augmentation in Industrial Human Activity Recognition

Anh Tuan Ha,Hoang Khang Phan,Thai Minh Tien Ngo,Anh Phan Truong,Nhat Tan Le

Main category: cs.HC

TL;DR: 本文提出了一种通过深度学习方法（注意力自编码器和条件生成对抗网络）生成高质量HAR数据集的方法，并通过随机序列策略解决数据异质性问题，显著提升了分类性能。

Details

Motivation: 在HAR领域，获取高质量且多样化的数据成本高且困难，同时数据异质性也是关键挑战。 Method: 使用注意力自编码器和条件生成对抗网络生成数据集，并通过随机序列策略打乱数据以均匀化分布。 Result: 实验结果显示，随机序列策略显著提升了分类性能，准确率达到0.70±0.03，宏F1分数为0.64±0.01。 Conclusion: 该方法不仅扩展了有效训练数据集，还为复杂现实场景中的HAR系统提供了增强途径。 Abstract: In the realm of Human Activity Recognition (HAR), obtaining high quality and variance data is still a persistent challenge due to high costs and the inherent variability of real-world activities. This study introduces a generation dataset by deep learning approaches (Attention Autoencoder and conditional Generative Adversarial Networks). Another problem that data heterogeneity is a critical challenge, one of the solutions is to shuffle the data to homogenize the distribution. Experimental results demonstrate that the random sequence strategy significantly improves classification performance, achieving an accuracy of up to 0.70 $\pm$ 0.03 and a macro F1 score of 0.64 $\pm$ 0.01. For that, disrupting temporal dependencies through random sequence reordering compels the model to focus on instantaneous recognition, thereby improving robustness against activity transitions. This approach not only broadens the effective training dataset but also offers promising avenues for enhancing HAR systems in complex, real-world scenarios.

cs.AI [Back]

[123] From Text to Network: Constructing a Knowledge Graph of Taiwan-Based China Studies Using Generative AI

Hsuan-Lei Shao

Main category: cs.AI

TL;DR: 该研究利用生成式AI和大型语言模型，将台湾中国研究领域的学术文本转化为结构化知识图谱，提供了一种新的知识导航方式。

Details

Motivation: 台湾中国研究领域积累了丰富的学术成果，但缺乏系统化的整理和分析工具，研究旨在填补这一空白。 Method: 应用生成式AI技术从1367篇论文中提取实体关系三元组，并通过D3.js可视化构建知识图谱和向量数据库。 Result: 系统揭示了新的学术轨迹、主题集群和研究空白，支持从线性文本转向网络化知识导航。 Conclusion: 研究展示了生成式AI在区域知识系统中的应用潜力，为数字人文和区域研究提供了新工具。 Abstract: Taiwanese China Studies (CS) has developed into a rich, interdisciplinary research field shaped by the unique geopolitical position and long standing academic engagement with Mainland China. This study responds to the growing need to systematically revisit and reorganize decades of Taiwan based CS scholarship by proposing an AI assisted approach that transforms unstructured academic texts into structured, interactive knowledge representations. We apply generative AI (GAI) techniques and large language models (LLMs) to extract and standardize entity relation triples from 1,367 peer reviewed CS articles published between 1996 and 2019. These triples are then visualized through a lightweight D3.js based system, forming the foundation of a domain specific knowledge graph and vector database for the field. This infrastructure allows users to explore conceptual nodes and semantic relationships across the corpus, revealing previously uncharted intellectual trajectories, thematic clusters, and research gaps. By decomposing textual content into graph structured knowledge units, our system enables a paradigm shift from linear text consumption to network based knowledge navigation. In doing so, it enhances scholarly access to CS literature while offering a scalable, data driven alternative to traditional ontology construction. This work not only demonstrates how generative AI can augment area studies and digital humanities but also highlights its potential to support a reimagined scholarly infrastructure for regional knowledge systems.

[124] Towards a Deeper Understanding of Reasoning Capabilities in Large Language Models

Annie Wong,Thomas Bäck,Aske Plaat,Niki van Stein,Anna V. Kononova

Main category: cs.AI

TL;DR: 研究评估了大型语言模型在动态环境中的自适应能力，发现战略提示可以缩小模型间的性能差距，但高级提示技术对小型模型更有效，且推理方法存在不稳定性。

Details

Motivation: 探索大型语言模型在动态环境中作为自学习和推理代理的真正潜力。 Method: 通过自反思、启发式突变和规划等提示技术，在动态环境中测试不同开源语言模型的适应能力。 Result: 大型模型通常表现更好，但战略提示可缩小差距；高级提示对小型模型更有效；推理方法表现不稳定。 Conclusion: 当前大型语言模型在规划、推理和空间协调等方面仍存在根本性不足，需超越静态基准以捕捉推理的复杂性。 Abstract: While large language models demonstrate impressive performance on static benchmarks, the true potential of large language models as self-learning and reasoning agents in dynamic environments remains unclear. This study systematically evaluates the efficacy of self-reflection, heuristic mutation, and planning as prompting techniques to test the adaptive capabilities of agents. We conduct experiments with various open-source language models in dynamic environments and find that larger models generally outperform smaller ones, but that strategic prompting can close this performance gap. Second, a too-long prompt can negatively impact smaller models on basic reactive tasks, while larger models show more robust behaviour. Third, advanced prompting techniques primarily benefit smaller models on complex games, but offer less improvement for already high-performing large language models. Yet, we find that advanced reasoning methods yield highly variable outcomes: while capable of significantly improving performance when reasoning and decision-making align, they also introduce instability and can lead to big performance drops. Compared to human performance, our findings reveal little evidence of true emergent reasoning. Instead, large language model performance exhibits persistent limitations in crucial areas such as planning, reasoning, and spatial coordination, suggesting that current-generation large language models still suffer fundamental shortcomings that may not be fully overcome through self-reflective prompting alone. Reasoning is a multi-faceted task, and while reasoning methods like Chain of thought improves multi-step reasoning on math word problems, our findings using dynamic benchmarks highlight important shortcomings in general reasoning capabilities, indicating a need to move beyond static benchmarks to capture the complexity of reasoning.

cs.CR [Back]

[125] PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization

Yidan Wang,Yanan Cao,Yubing Ren,Fang Fang,Zheng Lin,Binxing Fang

Main category: cs.CR

TL;DR: 论文探讨了大型语言模型（LLMs）的隐私风险，提出了一种名为PIG的新框架，用于提取敏感信息，并通过实验验证其优于现有方法。

Details

Motivation: 现有方法在评估LLMs隐私泄漏时存在局限性，且越狱攻击在隐私场景中的作用尚未充分研究。 Method: 提出PIG框架，通过识别PII实体、构建隐私上下文，并采用梯度策略迭代更新以提取目标PII。 Result: PIG在四种白盒和两种黑盒LLMs上表现优于基线方法，达到SoTA效果。 Conclusion: LLMs存在显著隐私风险，需加强防护措施。 Abstract: Large Language Models (LLMs) excel in various domains but pose inherent privacy risks. Existing methods to evaluate privacy leakage in LLMs often use memorized prefixes or simple instructions to extract data, both of which well-alignment models can easily block. Meanwhile, Jailbreak attacks bypass LLM safety mechanisms to generate harmful content, but their role in privacy scenarios remains underexplored. In this paper, we examine the effectiveness of jailbreak attacks in extracting sensitive information, bridging privacy leakage and jailbreak attacks in LLMs. Moreover, we propose PIG, a novel framework targeting Personally Identifiable Information (PII) and addressing the limitations of current jailbreak methods. Specifically, PIG identifies PII entities and their types in privacy queries, uses in-context learning to build a privacy context, and iteratively updates it with three gradient-based strategies to elicit target PII. We evaluate PIG and existing jailbreak methods using two privacy-related datasets. Experiments on four white-box and two black-box LLMs show that PIG outperforms baseline methods and achieves state-of-the-art (SoTA) results. The results underscore significant privacy risks in LLMs, emphasizing the need for stronger safeguards. Our code is availble at \href{https://github.com/redwyd/PrivacyJailbreak}{https://github.com/redwyd/PrivacyJailbreak}.

eess.IV [Back]

[126] ImplicitStainer: Data-Efficient Medical Image Translation for Virtual Antibody-based Tissue Staining Using Local Implicit Functions

Tushar Kataria,Beatrice Knudsen,Shireen Y. Elhabian

Main category: eess.IV

TL;DR: ImplicitStainer利用局部隐函数改进虚拟染色技术，通过像素级预测提升性能，减少数据需求，并在有限数据下仍能生成高质量结果。

Details

Motivation: H&E染色虽为病理学金标准，但缺乏分子信息，而IHC染色耗时且不易获取。虚拟染色技术通过深度学习生成IHC图像，但现有方法数据需求高。 Method: 提出ImplicitStainer，基于局部隐函数优化图像翻译，专注于像素级预测，提升虚拟染色性能。 Result: 在两种数据集上验证，性能优于15种先进GAN和扩散模型，且对数据量变化更鲁棒。 Conclusion: ImplicitStainer为虚拟染色提供了高效解决方案，尤其在数据有限时表现优异。 Abstract: Hematoxylin and eosin (H&E) staining is a gold standard for microscopic diagnosis in pathology. However, H&E staining does not capture all the diagnostic information that may be needed. To obtain additional molecular information, immunohistochemical (IHC) stains highlight proteins that mark specific cell types, such as CD3 for T-cells or CK8/18 for epithelial cells. While IHC stains are vital for prognosis and treatment guidance, they are typically only available at specialized centers and time consuming to acquire, leading to treatment delays for patients. Virtual staining, enabled by deep learning-based image translation models, provides a promising alternative by computationally generating IHC stains from H&E stained images. Although many GAN and diffusion based image to image (I2I) translation methods have been used for virtual staining, these models treat image patches as independent data points, which results in increased and more diverse data requirements for effective generation. We present ImplicitStainer, a novel approach that leverages local implicit functions to improve image translation, specifically virtual staining performance, by focusing on pixel-level predictions. This method enhances robustness to variations in dataset sizes, delivering high-quality results even with limited data. We validate our approach on two datasets using a comprehensive set of metrics and benchmark it against over fifteen state-of-the-art GAN- and diffusion based models. Full Code and models trained will be released publicly via Github upon acceptance.

[127] Ordered-subsets Multi-diffusion Model for Sparse-view CT Reconstruction

Pengfei Yu,Bin Huang,Minghui Zhang,Weiwen Wu,Shaoyu Wang,Qiegen Liu

Main category: eess.IV

TL;DR: 提出了一种名为OSMM的有序子集多扩散模型，用于稀疏视图CT重建，通过分块学习和全局约束提升细节重建效果。

Details

Motivation: 传统扩散模型在稀疏视图CT重建中因数据冗余导致学习效果差，重建图像细节不足。 Method: 将CT投影数据分为等量子集，采用多子集扩散模型（MSDM）独立学习，并结合完整数据的单扩散模型（OWDM）作为全局约束。 Result: OSMM在图像质量和噪声鲁棒性上优于传统扩散模型，适应性强。 Conclusion: OSMM为稀疏视图CT提供了一种高效、鲁棒的解决方案。 Abstract: Score-based diffusion models have shown significant promise in the field of sparse-view CT reconstruction. However, the projection dataset is large and riddled with redundancy. Consequently, applying the diffusion model to unprocessed data results in lower learning effectiveness and higher learning difficulty, frequently leading to reconstructed images that lack fine details. To address these issues, we propose the ordered-subsets multi-diffusion model (OSMM) for sparse-view CT reconstruction. The OSMM innovatively divides the CT projection data into equal subsets and employs multi-subsets diffusion model (MSDM) to learn from each subset independently. This targeted learning approach reduces complexity and enhances the reconstruction of fine details. Furthermore, the integration of one-whole diffusion model (OWDM) with complete sinogram data acts as a global information constraint, which can reduce the possibility of generating erroneous or inconsistent sinogram information. Moreover, the OSMM's unsupervised learning framework provides strong robustness and generalizability, adapting seamlessly to varying sparsity levels of CT sinograms. This ensures consistent and reliable performance across different clinical scenarios. Experimental results demonstrate that OSMM outperforms traditional diffusion models in terms of image quality and noise resilience, offering a powerful and versatile solution for advanced CT imaging in sparse-view scenarios.

[128] Visual Fidelity Index for Generative Semantic Communications with Critical Information Embedding

Jianhao Huang,Qunsong Zeng,Kaibin Huang

Main category: eess.IV

TL;DR: 本文提出了一种混合生成语义通信系统，结合文本提示和关键特征传输，以解决纯提示驱动生成丢失细节的问题，并设计了GVIF指标评估系统性能。

Details

Motivation: 纯提示驱动的生成语义通信会丢失细粒度视觉细节，且缺乏系统性评估指标。 Method: 提出语义过滤方法选择关键特征，结合文本提示和扩散生成模型重建图像；设计GVIF指标量化视觉质量。 Result: GVIF指标对视觉保真度敏感，优化系统在PSNR和FID得分上优于基准方案。 Conclusion: 混合Gen-SemCom系统和GVIF指标有效提升了通信效率和生成图像质量。 Abstract: Generative semantic communication (Gen-SemCom) with large artificial intelligence (AI) model promises a transformative paradigm for 6G networks, which reduces communication costs by transmitting low-dimensional prompts rather than raw data. However, purely prompt-driven generation loses fine-grained visual details. Additionally, there is a lack of systematic metrics to evaluate the performance of Gen-SemCom systems. To address these issues, we develop a hybrid Gen-SemCom system with a critical information embedding (CIE) framework, where both text prompts and semantically critical features are extracted for transmissions. First, a novel approach of semantic filtering is proposed to select and transmit the semantically critical features of images relevant to semantic label. By integrating the text prompt and critical features, the receiver reconstructs high-fidelity images using a diffusion-based generative model. Next, we propose the generative visual information fidelity (GVIF) metric to evaluate the visual quality of the generated image. By characterizing the statistical models of image features, the GVIF metric quantifies the mutual information between the distorted features and their original counterparts. By maximizing the GVIF metric, we design a channel-adaptive Gen-SemCom system that adaptively control the volume of features and compression rate according to the channel state. Experimental results validate the GVIF metric's sensitivity to visual fidelity, correlating with both the PSNR and critical information volume. In addition, the optimized system achieves superior performance over benchmarking schemes in terms of higher PSNR and lower FID scores.

[129] HWA-UNETR: Hierarchical Window Aggregate UNETR for 3D Multimodal Gastric Lesion Segmentation

Jiaming Liang,Lihuan Dai,Xiaoqi Sheng,Xiangguang Chen,Chun Yao,Guihua Tao,Qibin Leng,Honming Cai,Xi Zhong

Main category: eess.IV

TL;DR: 论文提出了一种新的3D分割框架HWA-UNETR和一个公开的多模态胃癌MRI数据集GCM 2025，解决了多模态医学图像分割中的对齐和数据集稀缺问题。

Details

Motivation: 胃癌病变分析中多模态医学图像分割面临数据集稀缺和模态对齐的挑战，导致算法训练受限和分析准确性下降。 Method: 提出了HWA-UNETR框架，采用可学习的窗口聚合层（HWA块）和三元融合机制，动态对齐多模态特征并建模长距离空间依赖。 Result: 在GCM 2025和BraTS 2021数据集上的实验表明，新方法在Dice分数上比现有方法提升1.68%，且具有强鲁棒性。 Conclusion: HWA-UNETR和GCM 2025数据集为多模态医学图像分割提供了有效解决方案，推动了胃癌病变分析的研究。 Abstract: Multimodal medical image segmentation faces significant challenges in the context of gastric cancer lesion analysis. This clinical context is defined by the scarcity of independent multimodal datasets and the imperative to amalgamate inherently misaligned modalities. As a result, algorithms are constrained to train on approximate data and depend on application migration, leading to substantial resource expenditure and a potential decline in analysis accuracy. To address those challenges, we have made two major contributions: First, we publicly disseminate the GCM 2025 dataset, which serves as the first large-scale, open-source collection of gastric cancer multimodal MRI scans, featuring professionally annotated FS-T2W, CE-T1W, and ADC images from 500 patients. Second, we introduce HWA-UNETR, a novel 3D segmentation framework that employs an original HWA block with learnable window aggregation layers to establish dynamic feature correspondences between different modalities' anatomical structures, and leverages the innovative tri-orientated fusion mamba mechanism for context modeling and capturing long-range spatial dependencies. Extensive experiments on our GCM 2025 dataset and the publicly BraTS 2021 dataset validate the performance of our framework, demonstrating that the new approach surpasses existing methods by up to 1.68\% in the Dice score while maintaining solid robustness. The dataset and code are public via https://github.com/JeMing-creater/HWA-UNETR.

[130] Multi-contrast laser endoscopy for in vivo gastrointestinal imaging

Taylor L. Bobrow,Mayank Golhar,Suchapa Arayakarnkul,Anthony A. Song,Saowanee Ngamruengphong,Nicholas J. Durr

Main category: eess.IV

TL;DR: 多对比激光内窥镜（MLE）通过多光谱、相干和方向性照明增强胃肠道病变检测，对比度和色差显著优于白光和窄带成像。

Details

Motivation: 白光内窥镜在检测胃肠道疾病时对比度不足，导致许多病例漏诊。 Method: MLE结合多光谱漫反射、激光散斑对比成像和光度立体技术，增强组织对比。 Result: MLE在31个息肉成像中，对比度和色差分别提高约3倍和5倍。 Conclusion: MLE作为一种多功能工具，有望改善胃肠道成像。 Abstract: White light endoscopy is the clinical gold standard for detecting diseases in the gastrointestinal tract. Most applications involve identifying visual abnormalities in tissue color, texture, and shape. Unfortunately, the contrast of these features is often subtle, causing many clinically relevant cases to go undetected. To overcome this challenge, we introduce Multi-contrast Laser Endoscopy (MLE): a platform for widefield clinical imaging with rapidly tunable spectral, coherent, and directional illumination. We demonstrate three capabilities of MLE: enhancing tissue chromophore contrast with multispectral diffuse reflectance, quantifying blood flow using laser speckle contrast imaging, and characterizing mucosal topography using photometric stereo. We validate MLE with benchtop models, then demonstrate MLE in vivo during clinical colonoscopies. MLE images from 31 polyps demonstrate an approximate three-fold improvement in contrast and a five-fold improvement in color difference compared to white light and narrow band imaging. With the ability to reveal multiple complementary types of tissue contrast while seamlessly integrating into the clinical environment, MLE shows promise as an investigative tool to improve gastrointestinal imaging.

cs.SD [Back]

[131] LAV: Audio-Driven Dynamic Visual Generation with Neural Compression and StyleGAN2

Jongmin Jung,Dasaem Jeong

Main category: cs.SD

TL;DR: LAV系统结合EnCodec音频压缩和StyleGAN2生成能力，通过预录音频驱动动态视觉输出。

Details

Motivation: 探索利用预训练音频压缩模型实现语义丰富的音频-视觉转换，用于艺术和计算应用。 Method: 使用EnCodec嵌入作为潜在表示，通过随机初始化的线性映射直接转换为StyleGAN2的潜在空间。 Result: 保留了语义丰富性，实现了细腻且语义一致的音频-视觉转换。 Conclusion: LAV展示了预训练音频压缩模型在艺术和计算应用中的潜力。 Abstract: This paper introduces LAV (Latent Audio-Visual), a system that integrates EnCodec's neural audio compression with StyleGAN2's generative capabilities to produce visually dynamic outputs driven by pre-recorded audio. Unlike previous works that rely on explicit feature mappings, LAV uses EnCodec embeddings as latent representations, directly transformed into StyleGAN2's style latent space via randomly initialized linear mapping. This approach preserves semantic richness in the transformation, enabling nuanced and semantically coherent audio-visual translations. The framework demonstrates the potential of using pretrained audio compression models for artistic and computational applications.

Table of Contents

cs.CV [Back]

[1] A Computational Pipeline for Advanced Analysis of 4D Flow MRI in the Left Atrium

[2] Dyadic Mamba: Long-term Dyadic Human Motion Synthesis

[3] BoundarySeg:An Embarrassingly Simple Method To Boost Medical Image Segmentation Performance for Low Data Regimes

[4] Mission Balance: Generating Under-represented Class Samples using Video Diffusion Models

[5] Few-Shot Learning of Visual Compositional Concepts through Probabilistic Schema Induction

[6] Large-Scale Gaussian Splatting SLAM

[7] AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection

[8] DDFP: Data-dependent Frequency Prompt for Source Free Domain Adaptation of Medical Image Segmentation

[9] VRU-CIPI: Crossing Intention Prediction at Intersections for Improving Vulnerable Road Users Safety

[10] Non-Registration Change Detection: A Novel Change Detection Task and Benchmark Dataset

[11] CSPENet: Contour-Aware and Saliency Priors Embedding Network for Infrared Small Target Detection

[12] MambaControl: Anatomy Graph-Enhanced Mamba ControlNet with Fourier Refinement for Diffusion-Based Disease Trajectory Prediction

[13] TKFNet: Learning Texture Key Factor Driven Feature for Facial Expression Recognition

[14] APCoTTA: Continual Test-Time Adaptation for Semantic Segmentation of Airborne LiDAR Point Clouds

[15] High Quality Underwater Image Compression with Adaptive Correction and Codebook-based Augmentation

[16] PointArena: Probing Multimodal Grounding Through Language-Guided Pointing

[17] Descriptive Image-Text Matching with Graded Contextual Similarity

[18] From Air to Wear: Personalized 3D Digital Fashion with AR/VR Immersive 3D Sketching

[19] Application of YOLOv8 in monocular downward multiple Car Target detection

[20] ORL-LDM: Offline Reinforcement Learning Guided Latent Diffusion Model Super-Resolution Reconstruction

[21] DeepSeqCoco: A Robust Mobile Friendly Deep Learning Model for Detection of Diseases in Cocos nucifera

[22] Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

[23] Advances in Radiance Field for Dynamic Scene: From Neural Field to Gaussian Field

[24] PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language

[25] ToonifyGB: StyleGAN-based Gaussian Blendshapes for 3D Stylized Head Avatars

[26] MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models

[27] Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering

[28] IMITATE: Image Registration with Context for unknown time frame recovery

[29] Multi-Source Collaborative Style Augmentation and Domain-Invariant Learning for Federated Domain Generalization

[30] Modeling Saliency Dataset Bias

[31] VolE: A Point-cloud Framework for Food 3D Reconstruction and Volume Estimation

[32] Data-Agnostic Augmentations for Unknown Variations: Out-of-Distribution Generalisation in MRI Segmentation

[33] On the Interplay of Human-AI Alignment,Fairness, and Performance Trade-offs in Medical Imaging

[34] MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation

[35] ADHMR: Aligning Diffusion-based Human Mesh Recovery via Direct Preference Optimization

[36] Sage Deer: A Super-Aligned Driving Generalist Is Your Copilot

[37] Inferring Driving Maps by Deep Learning-based Trail Map Extraction

[38] HandReader: Advanced Techniques for Efficient Fingerspelling Recognition

[39] MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting

[40] MSCI: Addressing CLIP's Inherent Limitations for Compositional Zero-Shot Learning

[41] StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation

[42] MIPHEI-ViT: Multiplex Immunofluorescence Prediction from H&E Images using ViT Foundation Models

[43] A Unified and Scalable Membership Inference Method for Visual Self-supervised Encoder via Part-aware Capability

[44] SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ Complexity

[45] Learned Lightweight Smartphone ISP with Unpaired Data

[46] Vision language models have difficulty recognizing virtual objects

[47] Consistent Quantity-Quality Control across Scenes for Deployment-Aware Gaussian Splatting

[48] Logos as a Well-Tempered Pre-train for Sign Language Recognition

[49] UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation

[50] CheXGenBench: A Unified Benchmark For Fidelity, Privacy and Utility of Synthetic Chest Radiographs

[51] MorphGuard: Morph Specific Margin Loss for Enhancing Robustness to Face Morphing Attacks

[52] Enhancing Multi-Image Question Answering via Submodular Subset Selection

[53] Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis

[54] Does Feasibility Matter? Understanding the Impact of Feasibility on Synthetic Training Data

[55] MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

[56] End-to-End Vision Tokenizer Tuning

[57] Depth Anything with Any Prior

[58] 3D-Fixup: Advancing Photo Editing with 3D Priors

cs.GR [Back]

[59] VRSplat: Fast and Robust Gaussian Splatting for Virtual Reality

[60] Style Customization of Text-to-Vector Generation with Image Diffusion Priors

cs.CL [Back]

[61] Next Word Suggestion using Graph Neural Network

[62] DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models

[63] Large Language Models Are More Persuasive Than Incentivized Human Persuaders

[64] System Prompt Optimization with Meta-Learning

[65] VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts

[66] An AI-Powered Research Assistant in the Lab: A Practical Guide for Text Analysis Through Iterative Collaboration with LLMs

[67] Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning

[68] Automated Detection of Clinical Entities in Lung and Breast Cancer Reports Using NLP Techniques

[69] Exploring the generalization of LLM truth directions on conversational formats

[70] KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning

[71] Do Large Language Models Know Conflict? Investigating Parametric vs. Non-Parametric Knowledge of LLMs for Conflict Forecasting

[72] Crossing Borders Without Crossing Boundaries: How Sociolinguistic Awareness Can Optimize User Engagement with Localized Spanish AI Models Across Hispanophone Countries

[73] From Trade-off to Synergy: A Versatile Symbiotic Watermarking Framework for Large Language Models

[74] Rethinking Prompt Optimizers: From Prompt Merits to Optimization

[75] Personalizing Large Language Models using Retrieval Augmented Generation and Knowledge Graph

[76] DIF: A Framework for Benchmarking and Verifying Implicit Bias in LLMs