cs.CV [Back]

[1] A Computational Pipeline for Advanced Analysis of 4D Flow MRI in the Left Atrium

Xabier Morales,Ayah Elsayed,Debbie Zhao,Filip Loncaric,Ainhoa Aguado,Mireia Masias,Gina Quill,Marc Ramos,Ada Doltra,Ana Garcia,Marta Sitges,David Marlevi,Alistair Young,Martyn Nash,Bart Bijnens,Oscar Camara

Main category: cs.CV

TL;DR: 本文介绍了一种针对左心房（LA）4D Flow MRI分析的开源计算框架，解决了传统超声分析的局限性，并首次全面评估了能量、涡度和压力参数作为预后生物标志物的潜力。

Details

Motivation: 传统超声分析对左心房血流动力学的理解有限，而4D Flow MRI虽具潜力，但受限于低流速、低分辨率及缺乏专用计算框架。 Method: 开发了一个开源计算框架，支持不同中心数据的定性定量分析，实现了高精度自动分割（Dice > 0.9，Hausdorff 95 < 3 mm）。 Result: 框架对不同质量数据表现稳健，并首次全面评估了能量、涡度和压力参数在多种疾病中的预后价值。 Conclusion: 该框架为左心房血流动力学研究提供了可靠工具，并揭示了新型生物标志物的潜力。 Abstract: The left atrium (LA) plays a pivotal role in modulating left ventricular filling, but our comprehension of its hemodynamics is significantly limited by the constraints of conventional ultrasound analysis. 4D flow magnetic resonance imaging (4D Flow MRI) holds promise for enhancing our understanding of atrial hemodynamics. However, the low velocities within the LA and the limited spatial resolution of 4D Flow MRI make analyzing this chamber challenging. Furthermore, the absence of dedicated computational frameworks, combined with diverse acquisition protocols and vendors, complicates gathering large cohorts for studying the prognostic value of hemodynamic parameters provided by 4D Flow MRI. In this study, we introduce the first open-source computational framework tailored for the analysis of 4D Flow MRI in the LA, enabling comprehensive qualitative and quantitative analysis of advanced hemodynamic parameters. Our framework proves robust to data from different centers of varying quality, producing high-accuracy automated segmentations (Dice $>$ 0.9 and Hausdorff 95 $<$ 3 mm), even with limited training data. Additionally, we conducted the first comprehensive assessment of energy, vorticity, and pressure parameters in the LA across a spectrum of disorders to investigate their potential as prognostic biomarkers.

[2] Dyadic Mamba: Long-term Dyadic Human Motion Synthesis

Julian Tanke,Takashi Shibuya,Kengo Uchida,Koichi Saito,Yuki Mitsufuji

Main category: cs.CV

TL;DR: Dyadic Mamba利用状态空间模型（SSMs）生成长时间高质量的双人运动，解决了传统Transformer方法在长序列生成中的局限性。

Details

Motivation: 现有基于Transformer的方法在生成长时间双人运动时表现不佳，主要受限于位置编码方案。 Method: 提出Dyadic Mamba，通过简单有效的架构（基于SSMs）实现信息流动，无需复杂交叉注意力机制。 Result: 在短序列基准上表现优异，长序列生成显著优于Transformer方法，并提出了新的长序列评估基准。 Conclusion: SSM架构为长时双人运动生成提供了有前景的解决方案。 Abstract: Generating realistic dyadic human motion from text descriptions presents significant challenges, particularly for extended interactions that exceed typical training sequence lengths. While recent transformer-based approaches have shown promising results for short-term dyadic motion synthesis, they struggle with longer sequences due to inherent limitations in positional encoding schemes. In this paper, we introduce Dyadic Mamba, a novel approach that leverages State-Space Models (SSMs) to generate high-quality dyadic human motion of arbitrary length. Our method employs a simple yet effective architecture that facilitates information flow between individual motion sequences through concatenation, eliminating the need for complex cross-attention mechanisms. We demonstrate that Dyadic Mamba achieves competitive performance on standard short-term benchmarks while significantly outperforming transformer-based approaches on longer sequences. Additionally, we propose a new benchmark for evaluating long-term motion synthesis quality, providing a standardized framework for future research. Our results demonstrate that SSM-based architectures offer a promising direction for addressing the challenging task of long-term dyadic human motion synthesis from text descriptions.

[3] BoundarySeg:An Embarrassingly Simple Method To Boost Medical Image Segmentation Performance for Low Data Regimes

Tushar Kataria,Shireen Y. Elhabian

Main category: cs.CV

TL;DR: 提出了一种名为BoundarySeg的多任务框架，通过结合器官边界预测作为辅助任务，提升医学图像分割的准确性，无需依赖未标注数据。

Details

Motivation: 医学数据获取和标注困难，半监督方法依赖未标注数据且效果有限。 Method: BoundarySeg框架将器官边界预测作为辅助任务，利用任务间一致性提供额外监督。 Result: 在低数据情况下表现优异，性能媲美或超越现有半监督方法。 Conclusion: BoundarySeg提供了一种高效且不依赖未标注数据的医学图像分割解决方案。 Abstract: Obtaining large-scale medical data, annotated or unannotated, is challenging due to stringent privacy regulations and data protection policies. In addition, annotating medical images requires that domain experts manually delineate anatomical structures, making the process both time-consuming and costly. As a result, semi-supervised methods have gained popularity for reducing annotation costs. However, the performance of semi-supervised methods is heavily dependent on the availability of unannotated data, and their effectiveness declines when such data are scarce or absent. To overcome this limitation, we propose a simple, yet effective and computationally efficient approach for medical image segmentation that leverages only existing annotations. We propose BoundarySeg , a multi-task framework that incorporates organ boundary prediction as an auxiliary task to full organ segmentation, leveraging consistency between the two task predictions to provide additional supervision. This strategy improves segmentation accuracy, especially in low data regimes, allowing our method to achieve performance comparable to or exceeding state-of-the-art semi supervised approaches all without relying on unannotated data or increasing computational demands. Code will be released upon acceptance.

[4] Mission Balance: Generating Under-represented Class Samples using Video Diffusion Models

Danush Kumar Venkatesh,Isabel Funke,Micha Pfeiffer,Fiona Kolbinger,Hanna Maria Schmeiser,Juergen Weitz,Marius Distler,Stefanie Speidel

Main category: cs.CV

TL;DR: 提出了一种基于文本条件扩散的两阶段方法，通过合成手术视频解决数据不平衡问题，显著提升下游任务性能。

Details

Motivation: 手术视频数据集中严重的数据不平衡阻碍了高性能模型的开发，需要一种方法来生成高质量且类别平衡的合成数据。 Method: 采用两阶段、基于文本条件的扩散方法，分离空间和时间建模，并结合拒绝采样策略选择最佳合成样本。 Result: 在手术动作识别和术中事件预测任务中，合成视频显著提升了模型性能。 Conclusion: 该方法有效解决了数据不平衡问题，并通过开源实现促进了进一步研究。 Abstract: Computer-assisted interventions can improve intra-operative guidance, particularly through deep learning methods that harness the spatiotemporal information in surgical videos. However, the severe data imbalance often found in surgical video datasets hinders the development of high-performing models. In this work, we aim to overcome the data imbalance by synthesizing surgical videos. We propose a unique two-stage, text-conditioned diffusion-based method to generate high-fidelity surgical videos for under-represented classes. Our approach conditions the generation process on text prompts and decouples spatial and temporal modeling by utilizing a 2D latent diffusion model to capture spatial content and then integrating temporal attention layers to ensure temporal consistency. Furthermore, we introduce a rejection sampling strategy to select the most suitable synthetic samples, effectively augmenting existing datasets to address class imbalance. We evaluate our method on two downstream tasks-surgical action recognition and intra-operative event prediction-demonstrating that incorporating synthetic videos from our approach substantially enhances model performance. We open-source our implementation at https://gitlab.com/nct_tso_public/surgvgen.

[5] Few-Shot Learning of Visual Compositional Concepts through Probabilistic Schema Induction

Andrew Jun Lee,Taylor Webb,Trevor Bihl,Keith Holyoak,Hongjing Lu

Main category: cs.CV

TL;DR: PSI模型通过结构化表示和类比映射，实现了人类视觉概念的快速学习，性能优于传统模型。

Details

Motivation: 研究人类如何从有限示例中学习视觉概念，强调结构化表示和类比映射的重要性。 Method: 提出Probabilistic Schema Induction (PSI)模型，结合深度学习和类比映射，权衡对象和关系相似性。 Result: PSI表现接近人类学习能力，优于传统非结构化模型和弱结构化变体。 Conclusion: 结构化表示和类比映射是快速学习视觉概念的关键，深度学习可助力心理学模型开发。 Abstract: The ability to learn new visual concepts from limited examples is a hallmark of human cognition. While traditional category learning models represent each example as an unstructured feature vector, compositional concept learning is thought to depend on (1) structured representations of examples (e.g., directed graphs consisting of objects and their relations) and (2) the identification of shared relational structure across examples through analogical mapping. Here, we introduce Probabilistic Schema Induction (PSI), a prototype model that employs deep learning to perform analogical mapping over structured representations of only a handful of examples, forming a compositional concept called a schema. In doing so, PSI relies on a novel conception of similarity that weighs object-level similarity and relational similarity, as well as a mechanism for amplifying relations relevant to classification, analogous to selective attention parameters in traditional models. We show that PSI produces human-like learning performance and outperforms two controls: a prototype model that uses unstructured feature vectors extracted from a deep learning model, and a variant of PSI with weaker structured representations. Notably, we find that PSI's human-like performance is driven by an adaptive strategy that increases relational similarity over object-level similarity and upweights the contribution of relations that distinguish classes. These findings suggest that structured representations and analogical mapping are critical to modeling rapid human-like learning of compositional visual concepts, and demonstrate how deep learning can be leveraged to create psychological models.

[6] Large-Scale Gaussian Splatting SLAM

Zhe Xin,Chenyang Wu,Penghui Huang,Yanyong Zhang,Yinian Mao,Guoquan Huang

Main category: cs.CV

TL;DR: LSG-SLAM是一种基于3D高斯泼溅的大规模视觉SLAM方法，使用立体相机，通过多模态策略和特征对齐约束提升鲁棒性，并在大规模场景中表现优异。

Details

Motivation: 现有NeRF和3DGS方法多依赖RGBD传感器且仅适用于室内环境，大规模户外场景的鲁棒性重建尚未充分探索。 Method: 采用多模态策略估计初始位姿，引入特征对齐约束优化渲染损失，使用连续高斯泼溅子图处理无界场景，并通过位姿优化和结构细化模块提升重建质量。 Result: 在EuRoc和KITTI数据集上，LSG-SLAM性能优于现有神经、3DGS及传统方法。 Conclusion: LSG-SLAM为大规模户外场景的视觉SLAM提供了高效且鲁棒的解决方案。 Abstract: The recently developed Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have shown encouraging and impressive results for visual SLAM. However, most representative methods require RGBD sensors and are only available for indoor environments. The robustness of reconstruction in large-scale outdoor scenarios remains unexplored. This paper introduces a large-scale 3DGS-based visual SLAM with stereo cameras, termed LSG-SLAM. The proposed LSG-SLAM employs a multi-modality strategy to estimate prior poses under large view changes. In tracking, we introduce feature-alignment warping constraints to alleviate the adverse effects of appearance similarity in rendering losses. For the scalability of large-scale scenarios, we introduce continuous Gaussian Splatting submaps to tackle unbounded scenes with limited memory. Loops are detected between GS submaps by place recognition and the relative pose between looped keyframes is optimized utilizing rendering and feature warping losses. After the global optimization of camera poses and Gaussian points, a structure refinement module enhances the reconstruction quality. With extensive evaluations on the EuRoc and KITTI datasets, LSG-SLAM achieves superior performance over existing Neural, 3DGS-based, and even traditional approaches. Project page: https://lsg-slam.github.io.

[7] AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection

Bin-Bin Gao,Yue Zhu,Jiangtao Yan,Yuezhi Cai,Weixi Zhang,Meng Wang,Jun Liu,Yong Liu,Lei Wang,Chengjie Wang

Main category: cs.CV

TL;DR: AdaptCLIP是一种简单有效的方法，通过交替学习视觉和文本表示，并结合上下文和对齐残差特征，显著提升了通用视觉异常检测的性能。

Details

Motivation: 现有方法在提示模板设计、复杂令牌交互或额外微调方面存在局限性，AdaptCLIP旨在解决这些问题，提供更灵活的解决方案。 Method: AdaptCLIP基于两个关键洞察：交替学习视觉和文本表示，以及结合上下文和对齐残差特征的对比学习。它仅添加三个简单适配器，无需目标域微调。 Result: AdaptCLIP在12个工业和医学领域的异常检测基准测试中达到最先进性能，显著优于现有方法。 Conclusion: AdaptCLIP通过简单适配器和训练自由的方式，实现了跨领域的零/少样本泛化，为视觉异常检测提供了高效解决方案。 Abstract: Universal visual anomaly detection aims to identify anomalies from novel or unseen vision domains without additional fine-tuning, which is critical in open scenarios. Recent studies have demonstrated that pre-trained vision-language models like CLIP exhibit strong generalization with just zero or a few normal images. However, existing methods struggle with designing prompt templates, complex token interactions, or requiring additional fine-tuning, resulting in limited flexibility. In this work, we present a simple yet effective method called AdaptCLIP based on two key insights. First, adaptive visual and textual representations should be learned alternately rather than jointly. Second, comparative learning between query and normal image prompt should incorporate both contextual and aligned residual features, rather than relying solely on residual features. AdaptCLIP treats CLIP models as a foundational service, adding only three simple adapters, visual adapter, textual adapter, and prompt-query adapter, at its input or output ends. AdaptCLIP supports zero-/few-shot generalization across domains and possesses a training-free manner on target domains once trained on a base dataset. AdaptCLIP achieves state-of-the-art performance on 12 anomaly detection benchmarks from industrial and medical domains, significantly outperforming existing competitive methods. We will make the code and model of AdaptCLIP available at https://github.com/gaobb/AdaptCLIP.

[8] DDFP: Data-dependent Frequency Prompt for Source Free Domain Adaptation of Medical Image Segmentation

Siqi Yin,Shaolei Liu,Manning Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的源自由域适应（SFDA）框架，通过预适应生成高质量伪标签和数据依赖的频率提示，结合风格相关层微调策略，显著提升了跨模态医学图像分割的性能。

Details

Motivation: 由于隐私政策限制，医疗数据中标记源域数据的获取受限，现有SFDA方法在图像风格转换和伪标签质量上仍有改进空间。 Method: 引入预适应生成预适应模型，提出数据依赖频率提示用于风格转换，并采用风格相关层微调策略。 Result: 在跨模态腹部和心脏SFDA分割任务中，该方法优于现有最先进方法。 Conclusion: 所提框架有效解决了SFDA中的域差距问题，提升了模型性能。 Abstract: Domain adaptation addresses the challenge of model performance degradation caused by domain gaps. In the typical setup for unsupervised domain adaptation, labeled data from a source domain and unlabeled data from a target domain are used to train a target model. However, access to labeled source domain data, particularly in medical datasets, can be restricted due to privacy policies. As a result, research has increasingly shifted to source-free domain adaptation (SFDA), which requires only a pretrained model from the source domain and unlabeled data from the target domain data for adaptation. Existing SFDA methods often rely on domain-specific image style translation and self-supervision techniques to bridge the domain gap and train the target domain model. However, the quality of domain-specific style-translated images and pseudo-labels produced by these methods still leaves room for improvement. Moreover, training the entire model during adaptation can be inefficient under limited supervision. In this paper, we propose a novel SFDA framework to address these challenges. Specifically, to effectively mitigate the impact of domain gap in the initial training phase, we introduce preadaptation to generate a preadapted model, which serves as an initialization of target model and allows for the generation of high-quality enhanced pseudo-labels without introducing extra parameters. Additionally, we propose a data-dependent frequency prompt to more effectively translate target domain images into a source-like style. To further enhance adaptation, we employ a style-related layer fine-tuning strategy, specifically designed for SFDA, to train the target model using the prompted target domain images and pseudo-labels. Extensive experiments on cross-modality abdominal and cardiac SFDA segmentation tasks demonstrate that our proposed method outperforms existing state-of-the-art methods.

[9] VRU-CIPI: Crossing Intention Prediction at Intersections for Improving Vulnerable Road Users Safety

Ahmed S. Abdelrahman,Mohamed Abdel-Aty,Quoc Dai Tran

Main category: cs.CV

TL;DR: 论文提出了一种基于注意力机制的VRU-CIPI框架，用于预测行人在路口的过街意图，结合GRU和Transformer技术，在UCF-VRU数据集上达到96.45%的准确率和实时推理速度。

Details

Motivation: 理解并预测行人在路口的过街意图对提升道路交互安全至关重要，尤其是避免与车辆的潜在危险冲突。 Method: 采用GRU捕捉行人运动的时序动态，结合多头Transformer自注意力机制编码上下文和空间依赖关系。 Result: 在UCF-VRU数据集上实现96.45%的准确率和33帧/秒的实时推理速度。 Conclusion: VRU-CIPI框架通过I2V通信技术提升路口安全性，为所有道路用户提供更顺畅和安全的交互。 Abstract: Understanding and predicting human behavior in-thewild, particularly at urban intersections, remains crucial for enhancing interaction safety between road users. Among the most critical behaviors are crossing intentions of Vulnerable Road Users (VRUs), where misinterpretation may result in dangerous conflicts with oncoming vehicles. In this work, we propose the VRU-CIPI framework with a sequential attention-based model designed to predict VRU crossing intentions at intersections. VRU-CIPI employs Gated Recurrent Unit (GRU) to capture temporal dynamics in VRU movements, combined with a multi-head Transformer self-attention mechanism to encode contextual and spatial dependencies critical for predicting crossing direction. Evaluated on UCF-VRU dataset, our proposed achieves state-of-the-art performance with an accuracy of 96.45% and achieving real-time inference speed reaching 33 frames per second. Furthermore, by integrating with Infrastructure-to-Vehicles (I2V) communication, our approach can proactively enhance intersection safety through timely activation of crossing signals and providing early warnings to connected vehicles, ensuring smoother and safer interactions for all road users.

[10] Non-Registration Change Detection: A Novel Change Detection Task and Benchmark Dataset

Zhe Shan,Lei Zhou,Liu Mao,Shaofan Chen,Chuanqiu Ren,Xia Xie

Main category: cs.CV

TL;DR: 论文提出了一种新的遥感变化检测任务——非配准变化检测，以应对自然灾害等紧急情况。作者系统提出了八种可能导致非配准问题的场景，并开发了针对不同场景的图像转换方案。实验表明，非配准变化检测会对现有先进方法造成严重影响。

Details

Motivation: 针对自然灾害、人为事故和军事打击等紧急情况日益增多的问题，提出非配准变化检测任务，填补现有研究的空白。 Method: 系统提出八种非配准问题的真实场景，并开发相应的图像转换方案，将现有配准变化检测数据集转换为非配准版本。 Result: 实验证明，非配准变化检测会对现有最先进方法造成灾难性影响。 Conclusion: 非配准变化检测是一个重要但尚未充分研究的问题，论文提出的方法和数据集为未来研究提供了基础。 Abstract: In this study, we propose a novel remote sensing change detection task, non-registration change detection, to address the increasing number of emergencies such as natural disasters, anthropogenic accidents, and military strikes. First, in light of the limited discourse on the issue of non-registration change detection, we systematically propose eight scenarios that could arise in the real world and potentially contribute to the occurrence of non-registration problems. Second, we develop distinct image transformation schemes tailored to various scenarios to convert the available registration change detection dataset into a non-registration version. Finally, we demonstrate that non-registration change detection can cause catastrophic damage to the state-of-the-art methods. Our code and dataset are available at https://github.com/ShanZard/NRCD.

[11] CSPENet: Contour-Aware and Saliency Priors Embedding Network for Infrared Small Target Detection

Jiakun Deng,Kexuan Li,Xingye Cui,Jiaxuan Li,Chang Long,Tian Pu,Zhenming Peng

Main category: cs.CV

TL;DR: 提出了一种基于轮廓感知和显著性先验嵌入的网络（CSPENet），用于红外小目标检测，解决了现有方法在定位模糊目标和密集杂波环境下轮廓信息感知不足的问题。

Details

Motivation: 现有方法在红外小目标检测中存在定位模糊目标和轮廓信息感知不足的缺陷，限制了检测性能。 Method: 设计了环绕收敛先验提取模块（SCPEM）捕获目标轮廓特征，提出双分支先验嵌入架构（DBPEA）融合特征，并开发了注意力引导特征增强模块（AGFEM）优化特征表示。 Result: 在NUDT-SIRST、IRSTD-1k和NUAA-SIRST数据集上，CSPENet优于其他先进方法。 Conclusion: CSPENet通过结合轮廓感知和显著性先验，显著提升了红外小目标检测性能。 Abstract: Infrared small target detection (ISTD) plays a critical role in a wide range of civilian and military applications. Existing methods suffer from deficiencies in the localization of dim targets and the perception of contour information under dense clutter environments, severely limiting their detection performance. To tackle these issues, we propose a contour-aware and saliency priors embedding network (CSPENet) for ISTD. We first design a surround-convergent prior extraction module (SCPEM) that effectively captures the intrinsic characteristic of target contour pixel gradients converging toward their center. This module concurrently extracts two collaborative priors: a boosted saliency prior for accurate target localization and multi-scale structural priors for comprehensively enriching contour detail representation. Building upon this, we propose a dual-branch priors embedding architecture (DBPEA) that establishes differentiated feature fusion pathways, embedding these two priors at optimal network positions to achieve performance enhancement. Finally, we develop an attention-guided feature enhancement module (AGFEM) to refine feature representations and improve saliency estimation accuracy. Experimental results on public datasets NUDT-SIRST, IRSTD-1k, and NUAA-SIRST demonstrate that our CSPENet outperforms other state-of-the-art methods in detection performance. The code is available at https://github.com/IDIP2025/CSPENet.

Hao Yang,Tao Tan,Shuai Tan,Weiqin Yang,Kunyan Cai,Calvin Chen,Yue Sun

Main category: cs.CV

TL;DR: MambaControl是一个结合选择性状态空间建模和扩散过程的新框架，用于高保真预测医学图像轨迹，特别针对阿尔茨海默病的预测。

Details

Motivation: 现有方法在捕捉纵向依赖性和结构一致性方面存在不足，特别是在进行性疾病中。 Method: MambaControl结合了基于Mamba的长程建模和图引导的解剖控制，并引入傅里叶增强的谱图表示以捕捉空间一致性和多尺度细节。 Result: 定量和区域评估显示，MambaControl在预测质量和解剖保真度方面表现优异。 Conclusion: MambaControl在个性化预后和临床决策支持方面具有潜力。 Abstract: Modelling disease progression in precision medicine requires capturing complex spatio-temporal dynamics while preserving anatomical integrity. Existing methods often struggle with longitudinal dependencies and structural consistency in progressive disorders. To address these limitations, we introduce MambaControl, a novel framework that integrates selective state-space modelling with diffusion processes for high-fidelity prediction of medical image trajectories. To better capture subtle structural changes over time while maintaining anatomical consistency, MambaControl combines Mamba-based long-range modelling with graph-guided anatomical control to more effectively represent anatomical correlations. Furthermore, we introduce Fourier-enhanced spectral graph representations to capture spatial coherence and multiscale detail, enabling MambaControl to achieve state-of-the-art performance in Alzheimer's disease prediction. Quantitative and regional evaluations demonstrate improved progression prediction quality and anatomical fidelity, highlighting its potential for personalised prognosis and clinical decision support.

[13] TKFNet: Learning Texture Key Factor Driven Feature for Facial Expression Recognition

Liqian Deng

Main category: cs.CV

TL;DR: 论文提出了一种基于纹理关键驱动因素（TKDF）的新框架，通过纹理感知特征提取器和双上下文信息过滤，显著提升了野外面部表情识别的性能。

Details

Motivation: 野外面部表情识别（FER）因表情特征的微妙性和局部性以及面部外观的复杂变化而具有挑战性。论文旨在通过关注纹理关键驱动因素（TKDF）来解决这一问题。 Method: 提出了一种包含纹理感知特征提取器（TAFE）和双上下文信息过滤（DCIF）的架构。TAFE基于ResNet增强多分支注意力提取纹理特征，DCIF通过自适应池化和注意力机制优化特征。 Result: 在RAF-DB和KDEF数据集上的实验表明，该方法达到了最先进的性能，验证了TKDF在FER中的有效性和鲁棒性。 Conclusion: 通过关注局部纹理特征并结合上下文过滤，该方法显著提升了FER的性能，为未来研究提供了新方向。 Abstract: Facial expression recognition (FER) in the wild remains a challenging task due to the subtle and localized nature of expression-related features, as well as the complex variations in facial appearance. In this paper, we introduce a novel framework that explicitly focuses on Texture Key Driver Factors (TKDF), localized texture regions that exhibit strong discriminative power across emotional categories. By carefully observing facial image patterns, we identify that certain texture cues, such as micro-changes in skin around the brows, eyes, and mouth, serve as primary indicators of emotional dynamics. To effectively capture and leverage these cues, we propose a FER architecture comprising a Texture-Aware Feature Extractor (TAFE) and Dual Contextual Information Filtering (DCIF). TAFE employs a ResNet-based backbone enhanced with multi-branch attention to extract fine-grained texture representations, while DCIF refines these features by filtering context through adaptive pooling and attention mechanisms. Experimental results on RAF-DB and KDEF datasets demonstrate that our method achieves state-of-the-art performance, verifying the effectiveness and robustness of incorporating TKDFs into FER pipelines.

[14] APCoTTA: Continual Test-Time Adaptation for Semantic Segmentation of Airborne LiDAR Point Clouds

Yuan Gao,Shaobo Xia,Sheng Nie,Cheng Wang,Xiaohuan Xi,Bisheng Yang

Main category: cs.CV

TL;DR: APCoTTA是一种针对ALS点云语义分割的连续测试时间适应方法，通过动态选择可训练层、熵一致性损失和随机参数插值机制，解决了领域偏移问题，并在两个新基准上表现优异。

Details

Motivation: 解决ALS点云分割中因环境变化、传感器类型或退化导致的模型性能下降问题，填补CTTA在该领域的研究空白。 Method: 提出动态可训练层选择模块、熵一致性损失和随机参数插值机制，以平衡目标适应和源知识保留。 Result: 在两个新基准ISPRSC和H3DC上，mIoU分别提升约9%和14%。 Conclusion: APCoTTA有效解决了ALS点云分割中的领域适应问题，并提供了新的基准和代码资源。 Abstract: Airborne laser scanning (ALS) point cloud segmentation is a fundamental task for large-scale 3D scene understanding. In real-world applications, models are typically fixed after training. However, domain shifts caused by changes in the environment, sensor types, or sensor degradation often lead to a decline in model performance. Continuous Test-Time Adaptation (CTTA) offers a solution by adapting a source-pretrained model to evolving, unlabeled target domains. Despite its potential, research on ALS point clouds remains limited, facing challenges such as the absence of standardized datasets and the risk of catastrophic forgetting and error accumulation during prolonged adaptation. To tackle these challenges, we propose APCoTTA, the first CTTA method tailored for ALS point cloud semantic segmentation. We propose a dynamic trainable layer selection module. This module utilizes gradient information to select low-confidence layers for training, and the remaining layers are kept frozen, mitigating catastrophic forgetting. To further reduce error accumulation, we propose an entropy-based consistency loss. By losing such samples based on entropy, we apply consistency loss only to the reliable samples, enhancing model stability. In addition, we propose a random parameter interpolation mechanism, which randomly blends parameters from the selected trainable layers with those of the source model. This approach helps balance target adaptation and source knowledge retention, further alleviating forgetting. Finally, we construct two benchmarks, ISPRSC and H3DC, to address the lack of CTTA benchmarks for ALS point cloud segmentation. Experimental results demonstrate that APCoTTA achieves the best performance on two benchmarks, with mIoU improvements of approximately 9% and 14% over direct inference. The new benchmarks and code are available at https://github.com/Gaoyuan2/APCoTTA.

[15] High Quality Underwater Image Compression with Adaptive Correction and Codebook-based Augmentation

Yimin Zhou,Yichong Xia,Sicheng Pan,Bin Chen,Baoyi An,Haoqian Wang,Zhi Wang,Yaowei Wang,Zikun Zhou

Main category: cs.CV

TL;DR: HQUIC是一种新型水下图像压缩算法，通过利用水下图像特有特征（如光线衰减和全局光信息）以及动态加权多尺度频率组件，显著提升了压缩效率。

Details

Motivation: 现有水下图像压缩算法未能充分利用水下场景的独特性，导致性能不佳。HQUIC旨在通过针对性设计解决这一问题。 Method: HQUIC采用ALTC模块预测光线衰减系数和全局光信息，并利用代码本提取水下图像中的常见物体，同时动态加权多尺度频率组件以优化压缩效果。 Result: 在多个水下数据集上的评估表明，HQUIC优于现有最先进的压缩方法。 Conclusion: HQUIC通过针对性优化水下图像特征，显著提升了压缩性能，为水下图像的高效传输和存储提供了新解决方案。 Abstract: With the increasing exploration and exploitation of the underwater world, underwater images have become a critical medium for human interaction with marine environments, driving extensive research into their efficient transmission and storage. However, contemporary underwater image compression algorithms fail to fully leverage the unique characteristics distinguishing underwater scenes from terrestrial images, resulting in suboptimal performance. To address this limitation, we introduce HQUIC, designed to exploit underwater-image-specific features for enhanced compression efficiency. HQUIC employs an ALTC module to adaptively predict the attenuation coefficients and global light information of the images, which effectively mitigates the issues caused by the differences in lighting and tone existing in underwater images. Subsequently, HQUIC employs a codebook as an auxiliary branch to extract the common objects within underwater images and enhances the performance of the main branch. Furthermore, HQUIC dynamically weights multi-scale frequency components, prioritizing information critical for distortion quality while discarding redundant details. Extensive evaluations on diverse underwater datasets demonstrate that HQUIC outperforms state-of-the-art compression methods.

[16] PointArena: Probing Multimodal Grounding Through Language-Guided Pointing

Long Cheng,Jiafei Duan,Yi Ru Wang,Haoquan Fang,Boyang Li,Yushan Huang,Elvis Wang,Ainaz Eftekhar,Jason Lee,Wentao Yuan,Rose Hendrix,Noah A. Smith,Fei Xia,Dieter Fox,Ranjay Krishna

Main category: cs.CV

TL;DR: PointArena是一个评估多模态指向能力的综合平台，包含数据集、交互式竞技场和机器人系统，测试显示Molmo-72B表现最佳。

Details

Motivation: 现有的基准测试仅关注对象定位任务，缺乏对多模态指向能力的全面评估。 Method: PointArena由三个组件组成：Point-Bench数据集、Point-Battle交互竞技场和Point-Act机器人系统。 Result: Molmo-72B表现最优，专有模型表现接近，针对指向任务的监督训练显著提升性能。 Conclusion: 精确的指向能力对多模态模型连接抽象推理与具体行动至关重要。 Abstract: Pointing serves as a fundamental and intuitive mechanism for grounding language within visual contexts, with applications spanning robotics, assistive technologies, and interactive AI systems. While recent multimodal models have started to support pointing capabilities, existing benchmarks typically focus only on referential object localization tasks. We introduce PointArena, a comprehensive platform for evaluating multimodal pointing across diverse reasoning scenarios. PointArena comprises three components: (1) Point-Bench, a curated dataset containing approximately 1,000 pointing tasks across five reasoning categories; (2) Point-Battle, an interactive, web-based arena facilitating blind, pairwise model comparisons, which has already gathered over 4,500 anonymized votes; and (3) Point-Act, a real-world robotic manipulation system allowing users to directly evaluate multimodal model pointing capabilities in practical settings. We conducted extensive evaluations of both state-of-the-art open-source and proprietary multimodal models. Results indicate that Molmo-72B consistently outperforms other models, though proprietary models increasingly demonstrate comparable performance. Additionally, we find that supervised training specifically targeting pointing tasks significantly enhances model performance. Across our multi-stage evaluation pipeline, we also observe strong correlations, underscoring the critical role of precise pointing capabilities in enabling multimodal models to effectively bridge abstract reasoning with concrete, real-world actions. Project page: https://pointarena.github.io/

[17] Descriptive Image-Text Matching with Graded Contextual Similarity

Jinhyun Jang,Jiyeong Lee,Kwanghoon Sohn

Main category: cs.CV

TL;DR: 论文提出了一种名为DITM的方法，通过探索语言的描述灵活性，学习图像与文本之间的分级上下文相似性，超越了传统的稀疏二元监督。

Details

Motivation: 现有方法采用稀疏二元监督，忽略了图像与文本之间固有的多对多关系，且未考虑从一般到具体描述的隐式连接。 Method: DITM利用句子描述性评分（基于TF-IDF）动态调整正负样本对的连接性，并构建通用到具体的句子对齐。 Result: 在MS-COCO、Flickr30K和CxC数据集上的实验表明，DITM在表示复杂图像-文本关系方面优于现有方法，并提升了模型的层次推理能力。 Conclusion: DITM通过灵活的监督机制和分级对齐策略，显著提升了图像-文本匹配的性能和表达能力。 Abstract: Image-text matching aims to build correspondences between visual and textual data by learning their pairwise similarities. Most existing approaches have adopted sparse binary supervision, indicating whether a pair of images and sentences matches or not. However, such sparse supervision covers a limited subset of image-text relationships, neglecting their inherent many-to-many correspondences; an image can be described in numerous texts at different descriptive levels. Moreover, existing approaches overlook the implicit connections from general to specific descriptions, which form the underlying rationale for the many-to-many relationships between vision and language. In this work, we propose descriptive image-text matching, called DITM, to learn the graded contextual similarity between image and text by exploring the descriptive flexibility of language. We formulate the descriptiveness score of each sentence with cumulative term frequency-inverse document frequency (TF-IDF) to balance the pairwise similarity according to the keywords in the sentence. Our method leverages sentence descriptiveness to learn robust image-text matching in two key ways: (1) to refine the false negative labeling, dynamically relaxing the connectivity between positive and negative pairs, and (2) to build more precise matching, aligning a set of relevant sentences in a generic-to-specific order. By moving beyond rigid binary supervision, DITM enhances the discovery of both optimal matches and potential positive pairs. Extensive experiments on MS-COCO, Flickr30K, and CxC datasets demonstrate the effectiveness of our method in representing complex image-text relationships compared to state-of-the-art approaches. In addition, DITM enhances the hierarchical reasoning ability of the model, supported by the extensive analysis on HierarCaps benchmark.

[18] From Air to Wear: Personalized 3D Digital Fashion with AR/VR Immersive 3D Sketching

Ying Zang,Yuanqi Hu,Xinyu Chen,Yuxia Xu,Suhui Wang,Chunan Yu,Lanyun Zhu,Deyi Ji,Xin Xu,Tianrun Chen

Main category: cs.CV

TL;DR: 提出了一种基于3D草图的3D服装生成框架，通过简单的AR/VR环境中的草图输入，使普通用户也能设计高质量数字服装。

Details

Motivation: 现有3D服装设计工具技术门槛高且数据有限，限制了普通用户的使用。 Method: 结合条件扩散模型、共享潜在空间训练的草图编码器和自适应课程学习策略，处理不精确的手绘输入并生成逼真服装。 Result: 通过实验和用户研究验证，该方法在逼真度和可用性上显著优于现有基线。 Conclusion: 该方法有望推动下一代消费平台上的大众化时尚设计。 Abstract: In the era of immersive consumer electronics, such as AR/VR headsets and smart devices, people increasingly seek ways to express their identity through virtual fashion. However, existing 3D garment design tools remain inaccessible to everyday users due to steep technical barriers and limited data. In this work, we introduce a 3D sketch-driven 3D garment generation framework that empowers ordinary users - even those without design experience - to create high-quality digital clothing through simple 3D sketches in AR/VR environments. By combining a conditional diffusion model, a sketch encoder trained in a shared latent space, and an adaptive curriculum learning strategy, our system interprets imprecise, free-hand input and produces realistic, personalized garments. To address the scarcity of training data, we also introduce KO3DClothes, a new dataset of paired 3D garments and user-created sketches. Extensive experiments and user studies confirm that our method significantly outperforms existing baselines in both fidelity and usability, demonstrating its promise for democratized fashion design on next-generation consumer platforms.

[19] Application of YOLOv8 in monocular downward multiple Car Target detection

Shijie Lyu

Main category: cs.CV

TL;DR: 本文提出了一种基于YOLOv8的改进自主目标检测网络，通过结构重参数化技术和双向金字塔结构网络模型，显著提升了多尺度、小目标和远距离物体的检测精度。

Details

Motivation: 当前自动驾驶技术中的环境感知方法（如雷达、摄像头）存在高成本、易受天气和光照影响等问题，亟需改进。 Method: 在YOLOv8框架中整合结构重参数化技术、双向金字塔结构网络模型和新型检测流程。 Result: 改进模型在实验中实现了65%的检测精度，显著优于传统方法，特别适用于单目标和小物体检测场景。 Conclusion: 该模型在自动驾驶竞赛（如FSAC）中具有实际应用潜力，尤其在复杂环境下表现优异。 Abstract: Autonomous driving technology is progressively transforming traditional car driving methods, marking a significant milestone in modern transportation. Object detection serves as a cornerstone of autonomous systems, playing a vital role in enhancing driving safety, enabling autonomous functionality, improving traffic efficiency, and facilitating effective emergency responses. However, current technologies such as radar for environmental perception, cameras for road perception, and vehicle sensor networks face notable challenges, including high costs, vulnerability to weather and lighting conditions, and limited resolution.To address these limitations, this paper presents an improved autonomous target detection network based on YOLOv8. By integrating structural reparameterization technology, a bidirectional pyramid structure network model, and a novel detection pipeline into the YOLOv8 framework, the proposed approach achieves highly efficient and precise detection of multi-scale, small, and remote objects. Experimental results demonstrate that the enhanced model can effectively detect both large and small objects with a detection accuracy of 65%, showcasing significant advancements over traditional methods.This improved model holds substantial potential for real-world applications and is well-suited for autonomous driving competitions, such as the Formula Student Autonomous China (FSAC), particularly excelling in scenarios involving single-target and small-object detection.

[20] ORL-LDM: Offline Reinforcement Learning Guided Latent Diffusion Model Super-Resolution Reconstruction

Shijie Lyu

Main category: cs.CV

TL;DR: 本文提出了一种基于强化学习的潜在扩散模型（LDM）微调方法，用于遥感图像超分辨率，显著提升了图像质量。

Details

Motivation: 现有深度学习方法在处理复杂场景和保留图像细节方面存在局限性，需要更有效的解决方案。 Method: 构建强化学习环境（状态、动作、奖励），在LDM模型的反向去噪过程中使用近端策略优化（PPO）优化决策目标。 Result: 在RESISC45数据集上，PSNR提升3-4dB，SSIM提高0.08-0.11，LPIPS降低0.06-0.10，尤其在结构化复杂场景中表现突出。 Conclusion: 该方法有效提升了超分辨率质量和场景适应性。 Abstract: With the rapid advancement of remote sensing technology, super-resolution image reconstruction is of great research and practical significance. Existing deep learning methods have made progress but still face limitations in handling complex scenes and preserving image details. This paper proposes a reinforcement learning-based latent diffusion model (LDM) fine-tuning method for remote sensing image super-resolution. The method constructs a reinforcement learning environment with states, actions, and rewards, optimizing decision objectives through proximal policy optimization (PPO) during the reverse denoising process of the LDM model. Experiments on the RESISC45 dataset show significant improvements over the baseline model in PSNR, SSIM, and LPIPS, with PSNR increasing by 3-4dB, SSIM improving by 0.08-0.11, and LPIPS reducing by 0.06-0.10, particularly in structured and complex natural scenes. The results demonstrate the method's effectiveness in enhancing super-resolution quality and adaptability across scenes.

[21] DeepSeqCoco: A Robust Mobile Friendly Deep Learning Model for Detection of Diseases in Cocos nucifera

Miit Daga,Dhriti Parikh,Swarna Priya Ramu

Main category: cs.CV

TL;DR: DeepSeqCoco是一种基于深度学习的模型，用于自动识别椰子树疾病，准确率高达99.5%，比现有模型高5%，且训练和预测时间显著减少。

Details

Motivation: 椰子树疾病对农业产量构成严重威胁，尤其是在发展中国家，传统诊断方法效率低下且难以扩展。 Method: 采用深度学习模型DeepSeqCoco，测试了多种优化器设置（如SGD、Adam及混合配置），以平衡准确性、损失最小化和计算成本。 Result: 模型在混合SGD-Adam配置下达到99.5%的准确率，验证损失最低为2.81%，训练和预测时间分别减少18%和85%。 Conclusion: DeepSeqCoco展示了通过AI技术实现高效、可扩展的疾病监测系统的潜力，有助于精准农业的发展。 Abstract: Coconut tree diseases are a serious risk to agricultural yield, particularly in developing countries where conventional farming practices restrict early diagnosis and intervention. Current disease identification methods are manual, labor-intensive, and non-scalable. In response to these limitations, we come up with DeepSeqCoco, a deep learning based model for accurate and automatic disease identification from coconut tree images. The model was tested under various optimizer settings, such as SGD, Adam, and hybrid configurations, to identify the optimal balance between accuracy, minimization of loss, and computational cost. Results from experiments indicate that DeepSeqCoco can achieve as much as 99.5% accuracy (achieving up to 5% higher accuracy than existing models) with the hybrid SGD-Adam showing the lowest validation loss of 2.81%. It also shows a drop of up to 18% in training time and up to 85% in prediction time for input images. The results point out the promise of the model to improve precision agriculture through an AI-based, scalable, and efficient disease monitoring system.

[22] Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

Bingda Tang,Boyang Zheng,Xichen Pan,Sayak Paul,Saining Xie

Main category: cs.CV

TL;DR: 本文对文本到图像合成中LLMs与DiTs的深度融合进行了详细探索，填补了现有研究的空白。

Details

Motivation: 现有研究多关注整体系统性能，缺乏详细设计比较和公开的训练方法，导致对该方法潜力的不确定性。 Method: 通过实证研究，进行控制性比较，分析关键设计选择，并提供可复现的大规模训练方案。 Result: 提供了有意义的数据点和实用指南。 Conclusion: 本文为多模态生成的未来研究提供了重要参考。 Abstract: This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis -- specifically, the deep fusion of large language models (LLMs) and diffusion transformers (DiTs) for multi-modal generation. Previous studies mainly focused on overall system performance rather than detailed comparisons with alternative methods, and key design details and training recipes were often left undisclosed. These gaps create uncertainty about the real potential of this approach. To fill these gaps, we conduct an empirical study on text-to-image generation, performing controlled comparisons with established baselines, analyzing important design choices, and providing a clear, reproducible recipe for training at scale. We hope this work offers meaningful data points and practical guidelines for future research in multi-modal generation.

[23] Advances in Radiance Field for Dynamic Scene: From Neural Field to Gaussian Field

Jinlong Fan,Xuepu Zeng,Jing Zhang,Mingming Gong,Yuxiang Yang,Dacheng Tao

Main category: cs.CV

TL;DR: 本文综述了动态场景表示与重建领域的最新进展，重点分析了基于神经辐射场和3D高斯泼溅技术的200多篇论文，并提出了统一的框架。

Details

Motivation: 动态场景重建在计算机视觉和图形学中具有重要意义，但现有方法多针对静态场景，亟需系统梳理动态场景的技术进展。 Method: 通过分析200多篇论文，从运动表示范式、重建技术、辅助信息整合和正则化方法等角度进行分类和评估。 Result: 提出了一个统一的动态场景表示框架，总结了现有方法的优缺点，并指出了技术挑战。 Conclusion: 本文为研究者提供了动态场景重建的全面参考，并指出了未来研究方向。 Abstract: Dynamic scene representation and reconstruction have undergone transformative advances in recent years, catalyzed by breakthroughs in neural radiance fields and 3D Gaussian splatting techniques. While initially developed for static environments, these methodologies have rapidly evolved to address the complexities inherent in 4D dynamic scenes through an expansive body of research. Coupled with innovations in differentiable volumetric rendering, these approaches have significantly enhanced the quality of motion representation and dynamic scene reconstruction, thereby garnering substantial attention from the computer vision and graphics communities. This survey presents a systematic analysis of over 200 papers focused on dynamic scene representation using radiance field, spanning the spectrum from implicit neural representations to explicit Gaussian primitives. We categorize and evaluate these works through multiple critical lenses: motion representation paradigms, reconstruction techniques for varied scene dynamics, auxiliary information integration strategies, and regularization approaches that ensure temporal consistency and physical plausibility. We organize diverse methodological approaches under a unified representational framework, concluding with a critical examination of persistent challenges and promising research directions. By providing this comprehensive overview, we aim to establish a definitive reference for researchers entering this rapidly evolving field while offering experienced practitioners a systematic understanding of both conceptual principles and practical frontiers in dynamic scene reconstruction.

[24] PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language

Ijazul Haq,Yingjie Zhang,Irfan Ali Khan

Main category: cs.CV

TL;DR: 本文评估了大型多模态模型（LMMs）在低资源普什图语OCR任务中的表现，并发布了合成数据集PsOCR。实验显示Gemini表现最佳，开源模型中Qwen-7B领先。

Details

Motivation: 普什图语的NLP面临挑战，如草书体脚本和数据集稀缺。为此，开发了PsOCR数据集以支持模型训练和评估。 Method: 创建了包含100万张图像的PsOCR数据集，涵盖多种字体、颜色和布局。评估了7个开源和4个闭源LMMs在10K基准子集上的表现。 Result: Gemini在所有模型中表现最佳，开源模型中Qwen-7B表现突出。 Conclusion: 本研究为普什图语OCR任务提供了LMMs的能力和局限性的评估，并为类似脚本的研究奠定了基础。 Abstract: This paper evaluates the performance of Large Multimodal Models (LMMs) on Optical Character Recognition (OCR) in the low-resource Pashto language. Natural Language Processing (NLP) in Pashto faces several challenges due to the cursive nature of its script and a scarcity of structured datasets. To address this, we developed a synthetic Pashto OCR dataset, PsOCR, consisting of one million images annotated with bounding boxes at word, line, and document levels, suitable for training and evaluating models based on different architectures, including Convolutional Neural Networks (CNNs) and Transformers. PsOCR covers variations across 1,000 unique font families, colors, image sizes, and layouts. A benchmark subset of 10K images was selected to evaluate the performance of several LMMs, including seven open-source models: DeepSeek's Janus, InternVL, MiniCPM, Florence, and Qwen (3B and 7B), and four closed-source models: GPT-4o, Gemini, Claude, and Grok. Experimental results demonstrate that Gemini achieves the best performance among all models, whereas among open-source models, Qwen-7B stands out. This work provides an insightful assessment of the capabilities and limitations of current LMMs for OCR tasks in Pashto and establishes a foundation for further research not only in Pashto OCR but also for other similar scripts such as Arabic, Persian, and Urdu. PsOCR is available at https://github.com/zirak-ai/PashtoOCR.

[25] ToonifyGB: StyleGAN-based Gaussian Blendshapes for 3D Stylized Head Avatars

Rui-Yang Ju,Sheng-Yen Huang,Yi-Ping Hung

Main category: cs.CV

TL;DR: ToonifyGB是一个两阶段框架，用于从单目视频生成多样化的风格化3D头部虚拟形象，结合了改进的StyleGAN和高斯混合形状技术。

Details

Motivation: 为了扩展Toonify框架以生成风格化的3D头部虚拟形象，并解决传统StyleGAN在固定分辨率下裁剪对齐面部的限制。 Method: 第一阶段使用改进的StyleGAN生成风格化视频，第二阶段学习风格化的中性头部模型和表情混合形状。 Result: 在Arcane和Pixar两种风格上验证了ToonifyGB的有效性，能够高效渲染具有任意表情的风格化虚拟形象。 Conclusion: ToonifyGB能够稳定生成高质量的风格化3D头部动画，并捕捉视频帧的高频细节。 Abstract: The introduction of 3D Gaussian blendshapes has enabled the real-time reconstruction of animatable head avatars from monocular video. Toonify, a StyleGAN-based framework, has become widely used for facial image stylization. To extend Toonify for synthesizing diverse stylized 3D head avatars using Gaussian blendshapes, we propose an efficient two-stage framework, ToonifyGB. In Stage 1 (stylized video generation), we employ an improved StyleGAN to generate the stylized video from the input video frames, which addresses the limitation of cropping aligned faces at a fixed resolution as preprocessing for normal StyleGAN. This process provides a more stable video, which enables Gaussian blendshapes to better capture the high-frequency details of the video frames, and efficiently generate high-quality animation in the next stage. In Stage 2 (Gaussian blendshapes synthesis), we learn a stylized neutral head model and a set of expression blendshapes from the generated video. By combining the neutral head model with expression blendshapes, ToonifyGB can efficiently render stylized avatars with arbitrary expressions. We validate the effectiveness of ToonifyGB on the benchmark dataset using two styles: Arcane and Pixar.

[26] MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models

Yuncheng Guo,Xiaodong Gu

Main category: cs.CV

TL;DR: MMRL和MMRL++通过共享模态无关表示空间和优化表示令牌，有效解决了少样本数据下预训练视觉语言模型的过拟合问题，提升了跨模态交互和泛化能力。

Details

Motivation: 大规模预训练的视觉语言模型在少样本数据下容易过拟合，泛化能力受限。 Method: 提出MMRL，引入共享模态无关表示空间，优化表示令牌；MMRL++进一步减少参数并增强模态内交互。 Result: 在15个数据集上表现优于现有方法，平衡了任务适应和泛化。 Conclusion: MMRL和MMRL++在少样本学习中显著提升了模型的适应性和泛化能力。 Abstract: Large-scale pre-trained Vision-Language Models (VLMs) have significantly advanced transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, undermining their ability to generalize to new tasks. To address this, we propose Multi-Modal Representation Learning (MMRL), which introduces a shared, learnable, modality-agnostic representation space. MMRL generates space tokens projected into both text and image encoders as representation tokens, enabling more effective cross-modal interactions. Unlike prior methods that mainly optimize class token features, MMRL inserts representation tokens into higher encoder layers--where task-specific features are more prominent--while preserving general knowledge in the lower layers. During training, both class and representation features are jointly optimized: a trainable projection layer is applied to representation tokens for task adaptation, while the projection layer for class token remains frozen to retain pre-trained knowledge. To further promote generalization, we introduce a regularization term aligning class and text features with the frozen VLM's zero-shot features. At inference, a decoupling strategy uses both class and representation features for base tasks, but only class features for novel tasks due to their stronger generalization. Building upon this, we propose MMRL++, a parameter-efficient and interaction-aware extension that significantly reduces trainable parameters and enhances intra-modal interactions--particularly across the layers of representation tokens--allowing gradient sharing and instance-specific information to propagate more effectively through the network. Extensive experiments on 15 datasets demonstrate that MMRL and MMRL++ consistently outperform state-of-the-art methods, achieving a strong balance between task-specific adaptation and generalization.

[27] Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering

Yangfu Li,Hongjian Zhan,Tianyi Chen,Qi Liu,Yue Lu

Main category: cs.CV

TL;DR: MoB提出了一种多目标平衡覆盖方法，通过动态权衡视觉令牌修剪中的目标，显著提升了性能与效率。

Details

Motivation: 现有视觉令牌修剪方法采用静态策略，忽略了任务间目标重要性的差异，导致性能不稳定。 Method: 基于Hausdorff距离推导误差界，利用ε-覆盖理论揭示目标间内在权衡，提出MoB框架，将修剪问题转化为双目标覆盖问题。 Result: MoB在LLaVA-1.5-7B上仅用11.1%的令牌保留96.4%性能，加速LLaVA-Next-7B 1.3-1.5倍，且适用于多种任务和模型。 Conclusion: MoB通过动态权衡目标，为视觉令牌修剪提供了高效且可扩展的解决方案。 Abstract: Existing visual token pruning methods target prompt alignment and visual preservation with static strategies, overlooking the varying relative importance of these objectives across tasks, which leads to inconsistent performance. To address this, we derive the first closed-form error bound for visual token pruning based on the Hausdorff distance, uniformly characterizing the contributions of both objectives. Moreover, leveraging $\epsilon$-covering theory, we reveal an intrinsic trade-off between these objectives and quantify their optimal attainment levels under a fixed budget. To practically handle this trade-off, we propose Multi-Objective Balanced Covering (MoB), which reformulates visual token pruning as a bi-objective covering problem. In this framework, the attainment trade-off reduces to budget allocation via greedy radius trading. MoB offers a provable performance bound and linear scalability with respect to the number of input visual tokens, enabling adaptation to challenging pruning scenarios. Extensive experiments show that MoB preserves 96.4% of performance for LLaVA-1.5-7B using only 11.1% of the original visual tokens and accelerates LLaVA-Next-7B by 1.3-1.5$\times$ with negligible performance loss. Additionally, evaluations on Qwen2-VL and Video-LLaVA confirm that MoB integrates seamlessly into advanced MLLMs and diverse vision-language tasks.

[28] IMITATE: Image Registration with Context for unknown time frame recovery

Ziad Kheil,Lucas Robinet,Laurent Risser,Soleakhena Ken

Main category: cs.CV

TL;DR: 提出了一种基于条件U-Net架构的新型图像配准方法，用于估计未知条件下的图像，并在放疗中应用于4D-CT扫描的肿瘤运动估计。

Details

Motivation: 解决传统方法在4D-CT扫描中因不规则呼吸、滞后效应和呼吸信号与内部运动相关性差导致的重建伪影问题。 Method: 使用条件U-Net架构，无需固定图像，完全利用条件信息进行图像配准。 Result: 在临床4D-CT数据上实现了无伪影的实时重建。 Conclusion: 该方法在复杂条件下（如不规则呼吸）表现出色，代码已开源。 Abstract: In this paper, we formulate a novel image registration formalism dedicated to the estimation of unknown condition-related images, based on two or more known images and their associated conditions. We show how to practically model this formalism by using a new conditional U-Net architecture, which fully takes into account the conditional information and does not need any fixed image. Our formalism is then applied to image moving tumors for radiotherapy treatment at different breathing amplitude using 4D-CT (3D+t) scans in thoracoabdominal regions. This driving application is particularly complex as it requires to stitch a collection of sequential 2D slices into several 3D volumes at different organ positions. Movement interpolation with standard methods then generates well known reconstruction artefacts in the assembled volumes due to irregular patient breathing, hysteresis and poor correlation of breathing signal to internal motion. Results obtained on 4D-CT clinical data showcase artefact-free volumes achieved through real-time latencies. The code is publicly available at https://github.com/Kheil-Z/IMITATE .

[29] Multi-Source Collaborative Style Augmentation and Domain-Invariant Learning for Federated Domain Generalization

Yikang Wei

Main category: cs.CV

TL;DR: 提出了一种多源协作风格增强和域不变学习方法（MCSAD），用于联邦域泛化，通过扩展风格空间和域不变学习提升模型在未见目标域上的泛化能力。

Details

Motivation: 现有风格增强方法在数据分散场景下风格空间有限，无法充分利用多源域信息。 Method: 提出多源协作风格增强模块生成更广风格空间数据，并通过跨域特征对齐和类关系集成蒸馏进行域不变学习。 Result: 在多个域泛化数据集上显著优于现有联邦域泛化方法。 Conclusion: MCSAD通过协作风格增强和域不变学习，有效提升了模型在未见目标域上的泛化性能。 Abstract: Federated domain generalization aims to learn a generalizable model from multiple decentralized source domains for deploying on the unseen target domain. The style augmentation methods have achieved great progress on domain generalization. However, the existing style augmentation methods either explore the data styles within isolated source domain or interpolate the style information across existing source domains under the data decentralization scenario, which leads to limited style space. To address this issue, we propose a Multi-source Collaborative Style Augmentation and Domain-invariant learning method (MCSAD) for federated domain generalization. Specifically, we propose a multi-source collaborative style augmentation module to generate data in the broader style space. Furthermore, we conduct domain-invariant learning between the original data and augmented data by cross-domain feature alignment within the same class and classes relation ensemble distillation between different classes to learn a domain-invariant model. By alternatively conducting collaborative style augmentation and domain-invariant learning, the model can generalize well on unseen target domain. Extensive experiments on multiple domain generalization datasets indicate that our method significantly outperforms the state-of-the-art federated domain generalization methods.

[30] Modeling Saliency Dataset Bias

Matthias Kümmerer,Harneet Khanuja,Matthias Bethge

Main category: cs.CV

TL;DR: 论文提出了一种新型架构，通过少量数据集特定参数解决跨数据集显著性预测的性能下降问题，显著提升了泛化能力。

Details

Motivation: 现有显著性预测模型在不同数据集间泛化能力差，性能下降显著，需解决数据集偏差问题。 Method: 扩展了数据集无关的编码器-解码器结构，引入少于20个数据集特定参数，控制多尺度结构、中心偏差和注视扩散等机制。 Result: 模型在MIT/Tuebingen显著性基准测试中达到新最优性能，泛化能力提升75%以上，仅需50个样本即可显著改进。 Conclusion: 该模型不仅提升了跨数据集性能，还揭示了空间显著性的复杂多尺度效应。 Abstract: Recent advances in image-based saliency prediction are approaching gold standard performance levels on existing benchmarks. Despite this success, we show that predicting fixations across multiple saliency datasets remains challenging due to dataset bias. We find a significant performance drop (around 40%) when models trained on one dataset are applied to another. Surprisingly, increasing dataset diversity does not resolve this inter-dataset gap, with close to 60% attributed to dataset-specific biases. To address this remaining generalization gap, we propose a novel architecture extending a mostly dataset-agnostic encoder-decoder structure with fewer than 20 dataset-specific parameters that govern interpretable mechanisms such as multi-scale structure, center bias, and fixation spread. Adapting only these parameters to new data accounts for more than 75% of the generalization gap, with a large fraction of the improvement achieved with as few as 50 samples. Our model sets a new state-of-the-art on all three datasets of the MIT/Tuebingen Saliency Benchmark (MIT300, CAT2000, and COCO-Freeview), even when purely generalizing from unrelated datasets, but with a substantial boost when adapting to the respective training datasets. The model also provides valuable insights into spatial saliency properties, revealing complex multi-scale effects that combine both absolute and relative sizes.

[31] VolE: A Point-cloud Framework for Food 3D Reconstruction and Volume Estimation

Umair Haroon,Ahmad AlMughrabi,Thanasis Zoumpekas,Ricardo Marques,Petia Radeva

Main category: cs.CV

TL;DR: VolE是一种基于移动设备驱动的3D重建框架，用于精确估计食物体积，无需参考对象或深度信息，性能优于现有方法。

Details

Motivation: 当前食物体积估计方法受限于单核数据、专用硬件或依赖参考对象，VolE旨在解决这些限制。 Method: 利用AR移动设备捕获图像和相机位置，通过食物视频分割生成食物掩模，实现无参考和深度的3D重建。 Result: 实验显示VolE在多个数据集上表现优异，平均绝对百分比误差为2.22%。 Conclusion: VolE在食物体积估计中展现出卓越性能，为医疗营养管理和健康监测提供了高效解决方案。 Abstract: Accurate food volume estimation is crucial for medical nutrition management and health monitoring applications, but current food volume estimation methods are often limited by mononuclear data, leveraging single-purpose hardware such as 3D scanners, gathering sensor-oriented information such as depth information, or relying on camera calibration using a reference object. In this paper, we present VolE, a novel framework that leverages mobile device-driven 3D reconstruction to estimate food volume. VolE captures images and camera locations in free motion to generate precise 3D models, thanks to AR-capable mobile devices. To achieve real-world measurement, VolE is a reference- and depth-free framework that leverages food video segmentation for food mask generation. We also introduce a new food dataset encompassing the challenging scenarios absent in the previous benchmarks. Our experiments demonstrate that VolE outperforms the existing volume estimation techniques across multiple datasets by achieving 2.22 % MAPE, highlighting its superior performance in food volume estimation.

[32] Data-Agnostic Augmentations for Unknown Variations: Out-of-Distribution Generalisation in MRI Segmentation

Puru Vaish,Felix Meister,Tobias Heimann,Christoph Brune,Jelmer M. Wolterink

Main category: cs.CV

TL;DR: 论文探讨了医学图像分割模型在真实临床场景中性能下降的问题，提出通过MixUp和辅助傅里叶增强方法提升模型的泛化能力和鲁棒性。

Details

Motivation: 医学图像分割模型在真实临床环境中性能下降，传统数据增强方法无法应对多样化的分布偏移。 Method: 系统评估MixUp和辅助傅里叶增强方法，分析其对分布偏移的缓解效果。 Result: 这些方法显著提升了模型在心脏和前列腺MRI分割中的泛化能力和鲁棒性，并改善了特征表示。 Conclusion: 将MixUp和辅助傅里叶增强集成到nnU-Net训练流程中，为医学分割模型提供了一种简单有效的解决方案。 Abstract: Medical image segmentation models are often trained on curated datasets, leading to performance degradation when deployed in real-world clinical settings due to mismatches between training and test distributions. While data augmentation techniques are widely used to address these challenges, traditional visually consistent augmentation strategies lack the robustness needed for diverse real-world scenarios. In this work, we systematically evaluate alternative augmentation strategies, focusing on MixUp and Auxiliary Fourier Augmentation. These methods mitigate the effects of multiple variations without explicitly targeting specific sources of distribution shifts. We demonstrate how these techniques significantly improve out-of-distribution generalization and robustness to imaging variations across a wide range of transformations in cardiac cine MRI and prostate MRI segmentation. We quantitatively find that these augmentation methods enhance learned feature representations by promoting separability and compactness. Additionally, we highlight how their integration into nnU-Net training pipelines provides an easy-to-implement, effective solution for enhancing the reliability of medical segmentation models in real-world applications.

[33] On the Interplay of Human-AI Alignment,Fairness, and Performance Trade-offs in Medical Imaging

Haozhe Luo,Ziyu Zhou,Zixin Shu,Aurélie Pahud de Mortanges,Robert Berke,Mauricio Reyes

Main category: cs.CV

TL;DR: 本文探讨了在医学影像中结合人类见解与AI对齐以减少公平性差距，并提升泛化能力，但需注意过度对齐可能带来的性能权衡。

Details

Motivation: 解决医学影像AI中的公平性差距问题，探索人类与AI对齐的潜力。 Method: 系统性地研究人类-AI对齐及其对公平性和泛化能力的影响，提出校准策略。 Result: 结合人类见解能显著减少公平性差距并提升泛化能力，但过度对齐可能导致性能下降。 Conclusion: 人类-AI对齐是开发公平、稳健且泛化能力强的医学AI系统的有效方法，需平衡专家指导与自动化效率。 Abstract: Deep neural networks excel in medical imaging but remain prone to biases, leading to fairness gaps across demographic groups. We provide the first systematic exploration of Human-AI alignment and fairness in this domain. Our results show that incorporating human insights consistently reduces fairness gaps and enhances out-of-domain generalization, though excessive alignment can introduce performance trade-offs, emphasizing the need for calibrated strategies. These findings highlight Human-AI alignment as a promising approach for developing fair, robust, and generalizable medical AI systems, striking a balance between expert guidance and automated efficiency. Our code is available at https://github.com/Roypic/Aligner.

[34] MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation

Yanbo Ding

Main category: cs.CV

TL;DR: MTVCrafter提出了一种基于4D运动序列的人类图像动画框架，通过4DMoT和MV-DiT技术，显著提升了动画的灵活性和表现力。

Details

Motivation: 现有方法依赖2D姿态图像，限制了泛化能力并丢失了3D信息，因此需要直接建模3D运动序列以提升动画效果。 Method: 引入4DMoT将3D运动序列量化为4D运动令牌，并设计MV-DiT利用这些令牌进行动画生成。 Result: MTVCrafter在FID-VID指标上达到6.98，优于第二名65%，并展示了在多样化场景中的泛化能力。 Conclusion: MTVCrafter为人类视频生成开辟了新方向，显著提升了动画质量和灵活性。 Abstract: Human image animation has gained increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely largely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information for open-world animation. To tackle this problem, we propose MTVCrafter (Motion Tokenization Video Crafter), the first framework that directly models raw 3D motion sequences (i.e., 4D motion) for human image animation. Specifically, we introduce 4DMoT (4D motion tokenizer) to quantize 3D motion sequences into 4D motion tokens. Compared to 2D-rendered pose images, 4D motion tokens offer more robust spatio-temporal cues and avoid strict pixel-level alignment between pose image and character, enabling more flexible and disentangled control. Then, we introduce MV-DiT (Motion-aware Video DiT). By designing unique motion attention with 4D positional encodings, MV-DiT can effectively leverage motion tokens as 4D compact yet expressive context for human image animation in the complex 3D world. Hence, it marks a significant step forward in this field and opens a new direction for pose-guided human video generation. Experiments show that our MTVCrafter achieves state-of-the-art results with an FID-VID of 6.98, surpassing the second-best by 65%. Powered by robust motion tokens, MTVCrafter also generalizes well to diverse open-world characters (single/multiple, full/half-body) across various styles and scenarios. Our video demos and code are provided in the supplementary material and at this anonymous GitHub link: https://anonymous.4open.science/r/MTVCrafter-1B13.

[35] ADHMR: Aligning Diffusion-based Human Mesh Recovery via Direct Preference Optimization

Wenhao Shen,Wanqi Yin,Xiaofeng Yang,Cheng Chen,Chaoyue Song,Zhongang Cai,Lei Yang,Hao Wang,Guosheng Lin

Main category: cs.CV

TL;DR: ADHMR提出了一种基于扩散模型和偏好优化的单图像人体网格恢复方法，通过HMR-Scorer评估预测质量并优化模型，显著提升了性能。

Details

Motivation: 单图像人体网格恢复存在深度模糊和遮挡问题，现有概率方法在2D对齐和野外图像鲁棒性上表现不佳。 Method: 训练HMR-Scorer评估预测质量，构建偏好数据集，通过直接偏好优化微调基础模型。 Result: ADHMR在实验中优于现有方法，且HMR-Scorer能提升其他模型的性能。 Conclusion: ADHMR通过偏好优化解决了现有方法的不足，为人体网格恢复提供了新思路。 Abstract: Human mesh recovery (HMR) from a single image is inherently ill-posed due to depth ambiguity and occlusions. Probabilistic methods have tried to solve this by generating numerous plausible 3D human mesh predictions, but they often exhibit misalignment with 2D image observations and weak robustness to in-the-wild images. To address these issues, we propose ADHMR, a framework that Aligns a Diffusion-based HMR model in a preference optimization manner. First, we train a human mesh prediction assessment model, HMR-Scorer, capable of evaluating predictions even for in-the-wild images without 3D annotations. We then use HMR-Scorer to create a preference dataset, where each input image has a pair of winner and loser mesh predictions. This dataset is used to finetune the base model using direct preference optimization. Moreover, HMR-Scorer also helps improve existing HMR models by data cleaning, even with fewer training samples. Extensive experiments show that ADHMR outperforms current state-of-the-art methods. Code is available at: https://github.com/shenwenhao01/ADHMR.

[36] Sage Deer: A Super-Aligned Driving Generalist Is Your Copilot

Hao Lu,Jiaqi Tang,Jiyao Wang,Yunfan LU,Xu Cao,Qingyong Hu,Yin Wang,Yuting Zhang,Tianxin Xie,Yunpeng Zhang,Yong Chen,Jiayu. Gao,Bin Huang,Dengbo He,Shuiguang Deng,Hao Chen,Ying-Cong Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为SAGE DeeR的智能驾驶座舱代理，具备超对齐、通用性和自我激发能力，并通过大规模基准测试验证其性能。

Details

Motivation: 智能驾驶座舱需满足用户舒适性、交互性和安全性需求，因此需要一种能够适应不同用户偏好和场景的通用代理。 Method: 构建SAGE DeeR代理，实现超对齐（个性化反应）、通用性（多模态输入理解）和自我激发（语言空间隐式思维链提取），并建立大规模基准测试。 Result: SAGE DeeR能够根据用户偏好和场景进行个性化反应，理解多模态输入，并通过自我激发提升能力。基准测试验证了其感知决策能力和超对齐准确性。 Conclusion: SAGE DeeR为智能驾驶座舱提供了一种高效、个性化的解决方案，具备广泛的应用潜力。 Abstract: The intelligent driving cockpit, an important part of intelligent driving, needs to match different users' comfort, interaction, and safety needs. This paper aims to build a Super-Aligned and GEneralist DRiving agent, SAGE DeeR. Sage Deer achieves three highlights: (1) Super alignment: It achieves different reactions according to different people's preferences and biases. (2) Generalist: It can understand the multi-view and multi-mode inputs to reason the user's physiological indicators, facial emotions, hand movements, body movements, driving scenarios, and behavioral decisions. (3) Self-Eliciting: It can elicit implicit thought chains in the language space to further increase generalist and super-aligned abilities. Besides, we collected multiple data sets and built a large-scale benchmark. This benchmark measures the deer's perceptual decision-making ability and the super alignment's accuracy.

[37] Inferring Driving Maps by Deep Learning-based Trail Map Extraction

Michael Hubbertz,Pascal Colling,Qi Han,Tobias Meisen

Main category: cs.CV

TL;DR: 论文提出了一种新颖的离线地图构建方法，通过整合驾驶者使用的非正式路线（trails），利用基于Transformer的深度学习模型构建全局地图，优于现有在线地图方法。

Details

Motivation: 高精度地图（HD maps）是自动驾驶系统的关键组成部分，但传统方法依赖人工标注且在线地图面临时间一致性、传感器遮挡等问题。 Method: 整合来自自车和其他交通参与者的trail数据，使用Transformer模型构建全局地图，支持持续更新且传感器无关。 Result: 在两种基准数据集上验证，方法在泛化性和性能上优于现有在线地图方法。 Conclusion: 该方法为自动驾驶系统提供了一种高效、鲁棒的地图构建解决方案。 Abstract: High-definition (HD) maps offer extensive and accurate environmental information about the driving scene, making them a crucial and essential element for planning within autonomous driving systems. To avoid extensive efforts from manual labeling, methods for automating the map creation have emerged. Recent trends have moved from offline mapping to online mapping, ensuring availability and actuality of the utilized maps. While the performance has increased in recent years, online mapping still faces challenges regarding temporal consistency, sensor occlusion, runtime, and generalization. We propose a novel offline mapping approach that integrates trails - informal routes used by drivers - into the map creation process. Our method aggregates trail data from the ego vehicle and other traffic participants to construct a comprehensive global map using transformer-based deep learning models. Unlike traditional offline mapping, our approach enables continuous updates while remaining sensor-agnostic, facilitating efficient data transfer. Our method demonstrates superior performance compared to state-of-the-art online mapping approaches, achieving improved generalization to previously unseen environments and sensor configurations. We validate our approach on two benchmark datasets, highlighting its robustness and applicability in autonomous driving systems.

[38] HandReader: Advanced Techniques for Efficient Fingerspelling Recognition

Pavel Korotaev,Petr Surovtsev,Alexander Kapitanov,Karina Kvanchiani,Aleksandr Nagaev

Main category: cs.CV

TL;DR: 本文提出了HandReader，包含三种架构（RGB、KP、RGB+KP），用于手语拼写识别，并在多个数据集上取得了最先进的结果。

Details

Motivation: 手语拼写（fingerspelling）是手语的重要组成部分，但现有方法在视频时序处理上仍有改进空间。 Method: HandReader_RGB使用TSAM模块处理RGB特征；HandReader_KP基于TPE编码器处理关键点；HandReader_RGB+KP结合了两种模态。 Result: 在ChicagoFSWild、ChicagoFSWild+和Znaki数据集上均表现优异。 Conclusion: HandReader模型高效且准确，同时公开了Znaki数据集和预训练模型。 Abstract: Fingerspelling is a significant component of Sign Language (SL), allowing the interpretation of proper names, characterized by fast hand movements during signing. Although previous works on fingerspelling recognition have focused on processing the temporal dimension of videos, there remains room for improving the accuracy of these approaches. This paper introduces HandReader, a group of three architectures designed to address the fingerspelling recognition task. HandReader$_{RGB}$ employs the novel Temporal Shift-Adaptive Module (TSAM) to process RGB features from videos of varying lengths while preserving important sequential information. HandReader$_{KP}$ is built on the proposed Temporal Pose Encoder (TPE) operated on keypoints as tensors. Such keypoints composition in a batch allows the encoder to pass them through 2D and 3D convolution layers, utilizing temporal and spatial information and accumulating keypoints coordinates. We also introduce HandReader_RGB+KP - architecture with a joint encoder to benefit from RGB and keypoint modalities. Each HandReader model possesses distinct advantages and achieves state-of-the-art results on the ChicagoFSWild and ChicagoFSWild+ datasets. Moreover, the models demonstrate high performance on the first open dataset for Russian fingerspelling, Znaki, presented in this paper. The Znaki dataset and HandReader pre-trained models are publicly available.

[39] MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting

Mengqiu Xu,Kaixin Chen,Heng Guo,Yixiang Huang,Ming Wu,Zhenwei Shi,Chuang Zhang,Jun Guo

Main category: cs.CV

TL;DR: 论文介绍了MFogHub，首个多区域、多卫星的海洋雾数据集，用于提升检测和预测模型的泛化能力。

Details

Motivation: 解决现有海洋雾数据集单一区域或卫星的限制，促进全球海洋雾动态的科学研究和实际监测。 Method: 整合15个沿海雾多发区域和6颗地球静止卫星的标注数据，构建包含68,000多个高分辨率样本的MFogHub数据集。 Result: 实验表明，MFogHub能揭示区域和卫星差异导致的泛化波动，并为针对性雾预测技术开发提供资源。 Conclusion: MFogHub有助于推动全球海洋雾动态的科学理解和实际应用，数据集和代码已开源。 Abstract: Deep learning approaches for marine fog detection and forecasting have outperformed traditional methods, demonstrating significant scientific and practical importance. However, the limited availability of open-source datasets remains a major challenge. Existing datasets, often focused on a single region or satellite, restrict the ability to evaluate model performance across diverse conditions and hinder the exploration of intrinsic marine fog characteristics. To address these limitations, we introduce \textbf{MFogHub}, the first multi-regional and multi-satellite dataset to integrate annotated marine fog observations from 15 coastal fog-prone regions and six geostationary satellites, comprising over 68,000 high-resolution samples. By encompassing diverse regions and satellite perspectives, MFogHub facilitates rigorous evaluation of both detection and forecasting methods under varying conditions. Extensive experiments with 16 baseline models demonstrate that MFogHub can reveal generalization fluctuations due to regional and satellite discrepancy, while also serving as a valuable resource for the development of targeted and scalable fog prediction techniques. Through MFogHub, we aim to advance both the practical monitoring and scientific understanding of marine fog dynamics on a global scale. The dataset and code are at \href{https://github.com/kaka0910/MFogHub}{https://github.com/kaka0910/MFogHub}.

[40] MSCI: Addressing CLIP's Inherent Limitations for Compositional Zero-Shot Learning

Yue Wang,Shuai Xu,Xuelin Zhu,Yicong Li

Main category: cs.CV

TL;DR: 提出了一种多阶段跨模态交互（MSCI）模型，通过利用CLIP视觉编码器的中间层信息，增强对细粒度局部特征的捕捉能力。

Details

Motivation: 现有方法主要依赖CLIP的跨模态对齐能力，但忽视了其在细粒度局部特征捕捉上的局限性。 Method: 设计了两个自适应聚合器，分别从低层视觉特征提取局部信息和高层视觉特征整合全局信息，并通过分阶段交互机制逐步融入文本表示。 Result: 在三个广泛使用的数据集上验证了模型的有效性和优越性。 Conclusion: MSCI通过动态调整全局与局部视觉信息的注意力权重，显著提升了模型对细粒度局部视觉信息的感知能力。 Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize unseen state-object combinations by leveraging known combinations. Existing studies basically rely on the cross-modal alignment capabilities of CLIP but tend to overlook its limitations in capturing fine-grained local features, which arise from its architectural and training paradigm. To address this issue, we propose a Multi-Stage Cross-modal Interaction (MSCI) model that effectively explores and utilizes intermediate-layer information from CLIP's visual encoder. Specifically, we design two self-adaptive aggregators to extract local information from low-level visual features and integrate global information from high-level visual features, respectively. These key information are progressively incorporated into textual representations through a stage-by-stage interaction mechanism, significantly enhancing the model's perception capability for fine-grained local visual information. Additionally, MSCI dynamically adjusts the attention weights between global and local visual information based on different combinations, as well as different elements within the same combination, allowing it to flexibly adapt to diverse scenarios. Experiments on three widely used datasets fully validate the effectiveness and superiority of the proposed model. Data and code are available at https://github.com/ltpwy/MSCI.

[41] StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation

Daniel A. P. Oliveira,David Martins de Matos

Main category: cs.CV

TL;DR: 论文提出StoryReasoning数据集和Qwen Storyteller模型，通过跨帧实体重识别和链式推理减少视觉叙事中的指代幻觉。

Details

Motivation: 视觉叙事系统常因跨帧角色身份不一致和动作与主体关联错误导致指代幻觉，需通过视觉元素实体接地解决。 Method: 构建包含4,178个故事的StoryReasoning数据集，使用视觉相似性和人脸识别进行跨帧实体重识别，结合链式推理和接地方案。 Result: 微调Qwen2.5-VL 7B得到的Qwen Storyteller模型将指代幻觉从4.06降至3.56（-12.3%）。 Conclusion: 通过结构化表示和跨帧实体接地，StoryReasoning和Qwen Storyteller有效减少视觉叙事中的指代幻觉。 Abstract: Visual storytelling systems struggle to maintain character identity across frames and link actions to appropriate subjects, frequently leading to referential hallucinations. These issues can be addressed through grounding of characters, objects, and other entities on the visual elements. We propose StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images, with both structured scene analyses and grounded stories. Each story maintains character and object consistency across frames while explicitly modeling multi-frame relationships through structured tabular representations. Our approach features cross-frame object re-identification using visual similarity and face recognition, chain-of-thought reasoning for explicit narrative modeling, and a grounding scheme that links textual elements to visual entities across multiple frames. We establish baseline performance by fine-tuning Qwen2.5-VL 7B, creating Qwen Storyteller, which performs end-to-end object detection, re-identification, and landmark detection while maintaining consistent object references throughout the story. Evaluation demonstrates a reduction from 4.06 to 3.56 (-12.3%) hallucinations on average per story when compared to a non-fine-tuned model.

[42] MIPHEI-ViT: Multiplex Immunofluorescence Prediction from H&E Images using ViT Foundation Models

Guillaume Balezo,Roger Trullo,Albert Pla Planas,Etienne Decenciere,Thomas Walter

Main category: cs.CV

TL;DR: MIPHEI是一种基于U-Net和ViT的模型，用于从H&E染色图像预测mIF信号，以低成本实现细胞类型分类。

Details

Motivation: 解决mIF技术因成本和物流限制未广泛临床应用的问题，利用H&E图像预测mIF信号。 Method: 采用U-Net架构结合ViT编码器，训练于ORION数据集，验证于两个独立数据集。 Result: 在多个标记物上表现优异，如Pan-CK（F1=0.88），显著优于基线模型。 Conclusion: MIPHEI为大规模H&E数据集的细胞类型分析提供了可行方案，有助于研究细胞空间组织与患者预后的关系。 Abstract: Histopathological analysis is a cornerstone of cancer diagnosis, with Hematoxylin and Eosin (H&E) staining routinely acquired for every patient to visualize cell morphology and tissue architecture. On the other hand, multiplex immunofluorescence (mIF) enables more precise cell type identification via proteomic markers, but has yet to achieve widespread clinical adoption due to cost and logistical constraints. To bridge this gap, we introduce MIPHEI (Multiplex Immunofluorescence Prediction from H&E), a U-Net-inspired architecture that integrates state-of-the-art ViT foundation models as encoders to predict mIF signals from H&E images. MIPHEI targets a comprehensive panel of markers spanning nuclear content, immune lineages (T cells, B cells, myeloid), epithelium, stroma, vasculature, and proliferation. We train our model using the publicly available ORION dataset of restained H&E and mIF images from colorectal cancer tissue, and validate it on two independent datasets. MIPHEI achieves accurate cell-type classification from H&E alone, with F1 scores of 0.88 for Pan-CK, 0.57 for CD3e, 0.56 for SMA, 0.36 for CD68, and 0.30 for CD20, substantially outperforming both a state-of-the-art baseline and a random classifier for most markers. Our results indicate that our model effectively captures the complex relationships between nuclear morphologies in their tissue context, as visible in H&E images and molecular markers defining specific cell types. MIPHEI offers a promising step toward enabling cell-type-aware analysis of large-scale H&E datasets, in view of uncovering relationships between spatial cellular organization and patient outcomes.

[43] A Unified and Scalable Membership Inference Method for Visual Self-supervised Encoder via Part-aware Capability

Jie Zhu,Jirong Zha,Ding Li,Leye Wang

Main category: cs.CV

TL;DR: 论文提出了一种统一的成员推理方法PartCrop，用于攻击视觉自监督模型，并在不同训练协议和结构下验证了其有效性。同时，评估了防御方法并提出了改进版本PartCrop-v2。

Details

Motivation: 自监督学习在利用无标签数据方面具有潜力，但也面临隐私问题。论文针对自监督模型在未知训练方法和细节的黑盒场景下进行成员推理研究。 Method: 提出PartCrop方法，通过裁剪图像中的部分对象并在表示空间中查询响应，利用模型对训练数据的部分感知能力进行推理。 Result: 实验验证了PartCrop的有效性和泛化能力，并评估了防御方法的有效性。改进版本PartCrop-v2通过结构优化提升了可扩展性。 Conclusion: PartCrop是一种有效的成员推理方法，适用于不同自监督模型，同时防御方法也能有效应对攻击。PartCrop-v2进一步提升了方法的实用性。 Abstract: Self-supervised learning shows promise in harnessing extensive unlabeled data, but it also confronts significant privacy concerns, especially in vision. In this paper, we perform membership inference on visual self-supervised models in a more realistic setting: self-supervised training method and details are unknown for an adversary when attacking as he usually faces a black-box system in practice. In this setting, considering that self-supervised model could be trained by completely different self-supervised paradigms, e.g., masked image modeling and contrastive learning, with complex training details, we propose a unified membership inference method called PartCrop. It is motivated by the shared part-aware capability among models and stronger part response on the training data. Specifically, PartCrop crops parts of objects in an image to query responses within the image in representation space. We conduct extensive attacks on self-supervised models with different training protocols and structures using three widely used image datasets. The results verify the effectiveness and generalization of PartCrop. Moreover, to defend against PartCrop, we evaluate two common approaches, i.e., early stop and differential privacy, and propose a tailored method called shrinking crop scale range. The defense experiments indicate that all of them are effective. Finally, besides prototype testing on toy visual encoders and small-scale image datasets, we quantitatively study the impacts of scaling from both data and model aspects in a realistic scenario and propose a scalable PartCrop-v2 by introducing two structural improvements to PartCrop. Our code is at https://github.com/JiePKU/PartCrop.

[44] SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ Complexity

Shihao Zou,Qingfeng Li,Wei Ji,Jingjing Li,Yongkui Yang,Guoqi Li,Chao Dong

Main category: cs.CV

TL;DR: SpikeVideoFormer是一种高效的脉冲驱动视频Transformer，通过线性时间复杂度和优化的空间-时间注意力设计，在视频任务中表现出色，同时显著提升能效。

Details

Motivation: 现有SNN-based Transformers主要关注单图像任务，未能充分利用SNN在视频任务中的高效性。 Method: 设计了脉冲驱动的Hamming注意力（SDHA），并优化了空间-时间注意力设计，保持线性时间复杂度。 Result: 在视频分类、姿态跟踪和语义分割任务中达到SOTA性能，能效显著优于ANN方法。 Conclusion: SpikeVideoFormer展示了SNN在视频任务中的潜力，兼具高性能和高能效。 Abstract: Spiking Neural Networks (SNNs) have shown competitive performance to Artificial Neural Networks (ANNs) in various vision tasks, while offering superior energy efficiency. However, existing SNN-based Transformers primarily focus on single-image tasks, emphasizing spatial features while not effectively leveraging SNNs' efficiency in video-based vision tasks. In this paper, we introduce SpikeVideoFormer, an efficient spike-driven video Transformer, featuring linear temporal complexity $\mathcal{O}(T)$. Specifically, we design a spike-driven Hamming attention (SDHA) which provides a theoretically guided adaptation from traditional real-valued attention to spike-driven attention. Building on SDHA, we further analyze various spike-driven space-time attention designs and identify an optimal scheme that delivers appealing performance for video tasks, while maintaining only linear temporal complexity. The generalization ability and efficiency of our model are demonstrated across diverse downstream video tasks, including classification, human pose tracking, and semantic segmentation. Empirical results show our method achieves state-of-the-art (SOTA) performance compared to existing SNN approaches, with over 15\% improvement on the latter two tasks. Additionally, it matches the performance of recent ANN-based methods while offering significant efficiency gains, achieving $\times 16$, $\times 10$ and $\times 5$ improvements on the three tasks. https://github.com/JimmyZou/SpikeVideoFormer

[45] Learned Lightweight Smartphone ISP with Unpaired Data

Andrei Arhire,Radu Timofte

Main category: cs.CV

TL;DR: 提出了一种无需对齐数据的训练方法，用于学习智能手机ISP，通过对抗训练和多判别器实现高质量RGB转换。

Details

Motivation: 开发学习型ISP时，获取对齐的RAW-RGB数据成本高且困难，因此需要一种无需配对数据的方法。 Method: 采用无配对训练策略，结合多损失函数和对抗训练，利用预训练网络的特征图保持内容结构。 Result: 在Zurich RAW to RGB和Fujifilm UltraISP数据集上表现优异，评估指标显示高保真度。 Conclusion: 无配对学习方法在智能手机ISP中具有潜力，适用于移动设备，且效果接近配对方法。 Abstract: The Image Signal Processor (ISP) is a fundamental component in modern smartphone cameras responsible for conversion of RAW sensor image data to RGB images with a strong focus on perceptual quality. Recent work highlights the potential of deep learning approaches and their ability to capture details with a quality increasingly close to that of professional cameras. A difficult and costly step when developing a learned ISP is the acquisition of pixel-wise aligned paired data that maps the raw captured by a smartphone camera sensor to high-quality reference images. In this work, we address this challenge by proposing a novel training method for a learnable ISP that eliminates the need for direct correspondences between raw images and ground-truth data with matching content. Our unpaired approach employs a multi-term loss function guided by adversarial training with multiple discriminators processing feature maps from pre-trained networks to maintain content structure while learning color and texture characteristics from the target RGB dataset. Using lightweight neural network architectures suitable for mobile devices as backbones, we evaluated our method on the Zurich RAW to RGB and Fujifilm UltraISP datasets. Compared to paired training methods, our unpaired learning strategy shows strong potential and achieves high fidelity across multiple evaluation metrics. The code and pre-trained models are available at https://github.com/AndreiiArhire/Learned-Lightweight-Smartphone-ISP-with-Unpaired-Data .

[46] Vision language models have difficulty recognizing virtual objects

Tyler Tran,Sangeet Khemlani,J. G. Trafton

Main category: cs.CV

TL;DR: 论文探讨了视觉语言模型（VLMs）对图像中虚拟对象的理解能力，发现现有模型在此方面的表现不足。

Details

Motivation: 研究VLMs是否真正理解图像中的空间关系，特别是对虚拟对象的处理能力。 Method: 通过设计包含虚拟对象的提示（如“想象树上卡了一只风筝”）来测试VLMs的场景理解能力。 Result: 实验表明，当前先进的VLMs在处理虚拟对象时表现不佳。 Conclusion: VLMs在理解图像中的虚拟对象和空间关系方面仍需改进。 Abstract: Vision language models (VLMs) are AI systems paired with both language and vision encoders to process multimodal input. They are capable of performing complex semantic tasks such as automatic captioning, but it remains an open question about how well they comprehend the visuospatial properties of scenes depicted in the images they process. We argue that descriptions of virtual objects -- objects that are not visually represented in an image -- can help test scene comprehension in these AI systems. For example, an image that depicts a person standing under a tree can be paired with the following prompt: imagine that a kite is stuck in the tree. VLMs that comprehend the scene should update their representations and reason sensibly about the spatial relations between all three objects. We describe systematic evaluations of state-of-the-art VLMs and show that their ability to process virtual objects is inadequate.

[47] Consistent Quantity-Quality Control across Scenes for Deployment-Aware Gaussian Splatting

Fengdi Zhang,Hongkun Cao,Ruqi Huang

Main category: cs.CV

TL;DR: ControlGS是一种3D高斯泼溅（3DGS）优化方法，通过用户指定的超参数实现语义一致的数量-质量权衡，支持广泛调整范围，并在单次训练中自动适应不同场景。

Details

Motivation: 现有方法在3DGS中难以直观调整数量与渲染质量之间的权衡，无法满足实际需求（如不同硬件和通信限制下的模型部署）。 Method: 通过固定设置和用户指定的超参数，ControlGS在单次训练中自动寻找最优的数量-质量权衡点，支持跨场景一致的控制。 Result: ControlGS在减少高斯数量的同时保持高渲染质量，优于基线方法，并支持无级调整。 Conclusion: ControlGS提供了一种灵活且高效的方法，适用于从紧凑对象到大型户外场景的多样化需求。 Abstract: To reduce storage and computational costs, 3D Gaussian splatting (3DGS) seeks to minimize the number of Gaussians used while preserving high rendering quality, introducing an inherent trade-off between Gaussian quantity and rendering quality. Existing methods strive for better quantity-quality performance, but lack the ability for users to intuitively adjust this trade-off to suit practical needs such as model deployment under diverse hardware and communication constraints. Here, we present ControlGS, a 3DGS optimization method that achieves semantically meaningful and cross-scene consistent quantity-quality control while maintaining strong quantity-quality performance. Through a single training run using a fixed setup and a user-specified hyperparameter reflecting quantity-quality preference, ControlGS can automatically find desirable quantity-quality trade-off points across diverse scenes, from compact objects to large outdoor scenes. It also outperforms baselines by achieving higher rendering quality with fewer Gaussians, and supports a broad adjustment range with stepless control over the trade-off.

[48] Logos as a Well-Tempered Pre-train for Sign Language Recognition

Ilya Ovodov,Petr Surovtsev,Karina Kvanchiani,Alexander Kapitanov,Alexander Nagaev

Main category: cs.CV

TL;DR: 论文研究了孤立手语识别（ISLR）中的跨语言训练和相似手语标注问题，提出了新的俄罗斯手语数据集Logos，并展示了其在跨语言任务中的有效性。

Details

Motivation: 解决ISLR任务中数据量有限和相似手语标注模糊的问题。 Method: 提出Logos数据集，探索跨语言迁移学习方法，并采用多分类头联合训练。 Result: Logos数据集作为预训练模型在跨语言任务中表现优异，显著提升了低资源数据集的准确性。 Conclusion: Logos数据集和提出的方法在ISLR任务中优于现有技术，为手语识别提供了新思路。 Abstract: This paper examines two aspects of the isolated sign language recognition (ISLR) task. First, despite the availability of a number of datasets, the amount of data for most individual sign languages is limited. It poses the challenge of cross-language ISLR model training, including transfer learning. Second, similar signs can have different semantic meanings. It leads to ambiguity in dataset labeling and raises the question of the best policy for annotating such signs. To address these issues, this study presents Logos, a novel Russian Sign Language (RSL) dataset, the most extensive ISLR dataset by the number of signers and one of the largest available datasets while also the largest RSL dataset in size and vocabulary. It is shown that a model, pre-trained on the Logos dataset can be used as a universal encoder for other language SLR tasks, including few-shot learning. We explore cross-language transfer learning approaches and find that joint training using multiple classification heads benefits accuracy for the target lowresource datasets the most. The key feature of the Logos dataset is explicitly annotated visually similar sign groups. We show that explicitly labeling visually similar signs improves trained model quality as a visual encoder for downstream tasks. Based on the proposed contributions, we outperform current state-of-the-art results for the WLASL dataset and get competitive results for the AUTSL dataset, with a single stream model processing solely RGB video. The source code, dataset, and pre-trained models are publicly available.

[49] UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation

Yi Li,Haonan Wang,Qixiang Zhang,Boyu Xiao,Chenchang Hu,Hualiang Wang,Xiaomeng Li

Main category: cs.CV

TL;DR: UniEval是一个针对统一多模态模型的无额外模型、图像或标注的评估框架，包含UniBench基准和UniScore指标，解决了现有评估方法的局限性。

Details

Motivation: 当前统一多模态模型缺乏统一的评估框架，现有方法存在局限性，如缺乏整体结果、依赖额外模型和标注等。 Method: 提出UniEval框架，包含UniBench基准（支持统一和视觉生成模型）和UniScore指标，UniBench具有81个细粒度标签以提高多样性。 Result: 实验表明UniBench更具挑战性，UniScore与人类评估高度一致，优于现有指标。 Conclusion: UniEval为统一多模态模型提供了高效评估工具，揭示了其独特价值。 Abstract: The emergence of unified multimodal understanding and generation models is rapidly attracting attention because of their ability to enhance instruction-following capabilities while minimizing model redundancy. However, there is a lack of a unified evaluation framework for these models, which would enable an elegant, simplified, and overall evaluation. Current models conduct evaluations on multiple task-specific benchmarks, but there are significant limitations, such as the lack of overall results, errors from extra evaluation models, reliance on extensive labeled images, benchmarks that lack diversity, and metrics with limited capacity for instruction-following evaluation. To tackle these challenges, we introduce UniEval, the first evaluation framework designed for unified multimodal models without extra models, images, or annotations. This facilitates a simplified and unified evaluation process. The UniEval framework contains a holistic benchmark, UniBench (supports both unified and visual generation models), along with the corresponding UniScore metric. UniBench includes 81 fine-grained tags contributing to high diversity. Experimental results indicate that UniBench is more challenging than existing benchmarks, and UniScore aligns closely with human evaluations, surpassing current metrics. Moreover, we extensively evaluated SoTA unified and visual generation models, uncovering new insights into Univeral's unique values.

[50] CheXGenBench: A Unified Benchmark For Fidelity, Privacy and Utility of Synthetic Chest Radiographs

Raman Dutt,Pedro Sanchez,Yongchen Yao,Steven McDonagh,Sotirios A. Tsaftaris,Timothy Hospedales

Main category: cs.CV

TL;DR: CheXGenBench是一个用于评估合成胸部X光片生成的多方面框架，涵盖生成质量、隐私风险和临床实用性，并提供了标准化评估协议和数据集SynthCheX-75K。

Details

Motivation: 当前医学领域生成AI评估存在方法不一致、架构比较过时和评估标准不统一的问题，缺乏对临床实用性的关注。 Method: 通过标准化数据分区和包含20多项定量指标的评估协议，对11种领先的文本到图像生成架构进行系统分析。 Result: 揭示了现有评估协议在生成保真度评估中的不足，提供了标准化基准和高质量数据集SynthCheX-75K。 Conclusion: CheXGenBench为医学AI社区提供了客观、可复现的评估标准，并推动了生成模型在临床领域的应用。 Abstract: We introduce CheXGenBench, a rigorous and multifaceted evaluation framework for synthetic chest radiograph generation that simultaneously assesses fidelity, privacy risks, and clinical utility across state-of-the-art text-to-image generative models. Despite rapid advancements in generative AI for real-world imagery, medical domain evaluations have been hindered by methodological inconsistencies, outdated architectural comparisons, and disconnected assessment criteria that rarely address the practical clinical value of synthetic samples. CheXGenBench overcomes these limitations through standardised data partitioning and a unified evaluation protocol comprising over 20 quantitative metrics that systematically analyse generation quality, potential privacy vulnerabilities, and downstream clinical applicability across 11 leading text-to-image architectures. Our results reveal critical inefficiencies in the existing evaluation protocols, particularly in assessing generative fidelity, leading to inconsistent and uninformative comparisons. Our framework establishes a standardised benchmark for the medical AI community, enabling objective and reproducible comparisons while facilitating seamless integration of both existing and future generative models. Additionally, we release a high-quality, synthetic dataset, SynthCheX-75K, comprising 75K radiographs generated by the top-performing model (Sana 0.6B) in our benchmark to support further research in this critical domain. Through CheXGenBench, we establish a new state-of-the-art and release our framework, models, and SynthCheX-75K dataset at https://raman1121.github.io/CheXGenBench/

[51] MorphGuard: Morph Specific Margin Loss for Enhancing Robustness to Face Morphing Attacks

Iurii Medvedev,Nuno Goncalves

Main category: cs.CV

TL;DR: 提出了一种双分支分类策略的深度网络训练方法，增强人脸识别系统对脸部变形攻击的鲁棒性。

Details

Motivation: 随着深度学习技术的发展，人脸识别系统面临脸部变形攻击等安全威胁，需要增强其鲁棒性。 Method: 通过引入双分支分类策略，处理脸部变形图像的标签模糊问题，并将其纳入训练过程。 Result: 在公开基准测试中验证了方法的有效性，提升了系统对脸部变形攻击的防御能力。 Conclusion: 该方法具有普适性，可集成到现有的人脸识别训练流程中，提升分类性能。 Abstract: Face recognition has evolved significantly with the advancement of deep learning techniques, enabling its widespread adoption in various applications requiring secure authentication. However, this progress has also increased its exposure to presentation attacks, including face morphing, which poses a serious security threat by allowing one identity to impersonate another. Therefore, modern face recognition systems must be robust against such attacks. In this work, we propose a novel approach for training deep networks for face recognition with enhanced robustness to face morphing attacks. Our method modifies the classification task by introducing a dual-branch classification strategy that effectively handles the ambiguity in the labeling of face morphs. This adaptation allows the model to incorporate morph images into the training process, improving its ability to distinguish them from bona fide samples. Our strategy has been validated on public benchmarks, demonstrating its effectiveness in enhancing robustness against face morphing attacks. Furthermore, our approach is universally applicable and can be integrated into existing face recognition training pipelines to improve classification-based recognition methods.

[52] Enhancing Multi-Image Question Answering via Submodular Subset Selection

Aaryan Sharma,Shivansh Gupta,Samar Agarwal,Vishak Prasad C.,Ganesh Ramakrishnan

Main category: cs.CV

TL;DR: 论文提出了一种改进的检索框架，利用子模子集选择技术提升多图像问答任务中的检索性能。

Details

Motivation: 现有的多模态大模型在单图像任务中表现优异，但在多图像场景（如多图像问答）中面临可扩展性和检索性能问题。 Method: 采用基于查询的子模函数（如GraphCut）预选语义相关图像子集，并结合锚点查询和数据增强优化检索流程。 Result: 在大量图像中，该方法显著提升了检索管道的有效性。 Conclusion: 子模子集选择技术能有效解决多图像任务中的检索挑战，尤其在数据量大时表现突出。 Abstract: Large multimodal models (LMMs) have achieved high performance in vision-language tasks involving single image but they struggle when presented with a collection of multiple images (Multiple Image Question Answering scenario). These tasks, which involve reasoning over large number of images, present issues in scalability (with increasing number of images) and retrieval performance. In this work, we propose an enhancement for retriever framework introduced in MIRAGE model using submodular subset selection techniques. Our method leverages query-aware submodular functions, such as GraphCut, to pre-select a subset of semantically relevant images before main retrieval component. We demonstrate that using anchor-based queries and augmenting the data improves submodular-retriever pipeline effectiveness, particularly in large haystack sizes.

[53] Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis

Pengfei Wang,Guohai Xu,Weinong Wang,Junjie Yang,Jie Lou,Yunhua Xue

Main category: cs.CV

TL;DR: 论文提出了一个衡量多模态大语言模型（MLLMs）视觉理解的新方法，通过解耦视觉和文本模态，引入注意力准确率指标和新基准，以量化隐含视觉误解（IVM）。

Details

Motivation: 现有基准主要评估答案正确性，忽略了模型是否真正理解视觉输入，因此需要一种更可靠的方法来评估视觉理解能力。 Method: 通过解耦视觉和文本模态，分析注意力分布，提出注意力准确率指标和新基准，量化IVM。 Result: 研究发现注意力分布随网络层加深逐渐集中在正确答案相关的图像上，注意力准确率能可靠评估视觉理解。 Conclusion: 提出的方法不仅适用于多模态场景，还能扩展到单模态，具有广泛适用性和通用性。 Abstract: Recent advancements have enhanced the capability of Multimodal Large Language Models (MLLMs) to comprehend multi-image information. However, existing benchmarks primarily evaluate answer correctness, overlooking whether models genuinely comprehend the visual input. To address this, we define implicit visual misunderstanding (IVM), where MLLMs provide correct answers without fully comprehending the visual input. Through our analysis, we decouple the visual and textual modalities within the causal attention module, revealing that attention distribution increasingly converges on the image associated with the correct answer as the network layers deepen. This insight leads to the introduction of a scale-agnostic metric, \textit{attention accuracy}, and a novel benchmark for quantifying IVMs. Attention accuracy directly evaluates the model's visual understanding via internal mechanisms, remaining robust to positional biases for more reliable assessments. Furthermore, we extend our approach to finer granularities and demonstrate its effectiveness in unimodal scenarios, underscoring its versatility and generalizability.

[54] Does Feasibility Matter? Understanding the Impact of Feasibility on Synthetic Training Data

Yiwen Liu,Jessica Bader,Jae Myung Kim

Main category: cs.CV

TL;DR: 研究探讨了在合成训练数据中强制要求可行性（feasibility）是否必要，发现其对CLIP分类器性能影响极小。

Details

Motivation: 随着扩散模型的发展，合成数据训练的图像质量提升，但仍存在不现实的图像。研究旨在验证可行性是否影响模型泛化能力。 Method: 提出VariReal管道，通过最小化编辑源图像生成可行或不可行属性，并测试其对CLIP分类器性能的影响。 Result: 可行性对CLIP性能影响极小（差异<0.3%），且混合可行与不可行数据对性能无显著影响。 Conclusion: 在合成数据训练中，强制要求可行性并非必要，混合数据不影响模型表现。 Abstract: With the development of photorealistic diffusion models, models trained in part or fully on synthetic data achieve progressively better results. However, diffusion models still routinely generate images that would not exist in reality, such as a dog floating above the ground or with unrealistic texture artifacts. We define the concept of feasibility as whether attributes in a synthetic image could realistically exist in the real-world domain; synthetic images containing attributes that violate this criterion are considered infeasible. Intuitively, infeasible images are typically considered out-of-distribution; thus, training on such images is expected to hinder a model's ability to generalize to real-world data, and they should therefore be excluded from the training set whenever possible. However, does feasibility really matter? In this paper, we investigate whether enforcing feasibility is necessary when generating synthetic training data for CLIP-based classifiers, focusing on three target attributes: background, color, and texture. We introduce VariReal, a pipeline that minimally edits a given source image to include feasible or infeasible attributes given by the textual prompt generated by a large language model. Our experiments show that feasibility minimally affects LoRA-fine-tuned CLIP performance, with mostly less than 0.3% difference in top-1 accuracy across three fine-grained datasets. Also, the attribute matters on whether the feasible/infeasible images adversarially influence the classification performance. Finally, mixing feasible and infeasible images in training datasets does not significantly impact performance compared to using purely feasible or infeasible datasets.

[55] MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

Ke Wang,Junting Pan,Linda Wei,Aojun Zhou,Weikang Shi,Zimu Lu,Han Xiao,Yunqiao Yang,Houxing Ren,Mingjie Zhan,Hongsheng Li

Main category: cs.CV

TL;DR: 论文提出利用代码作为跨模态对齐的监督信号，解决了数学图像细节缺失问题，构建了最大图像-代码数据集和高质量数学指令数据集，并训练出性能优越的多模态数学问题解决模型。

Details

Motivation: 现有自然语言图像描述数据集忽略数学图像的细节，阻碍了多模态数学推理的发展，因此需要一种新方法实现跨模态对齐。 Method: 利用代码作为监督信号，开发图像到代码模型FigCodifier和数据集ImgCode-8.6M，合成数学图像并构建MM-MathInstruct-3M数据集，训练MathCoder-VL模型。 Result: MathCoder-VL在六项指标上达到开源SOTA，几何问题解决能力超越GPT-4o和Claude 3.5 Sonnet，分别提升8.9%和9.2%。 Conclusion: 通过代码监督和高质量数据集，显著提升了多模态数学推理能力，模型和数据集将开源。 Abstract: Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at https://github.com/mathllm/MathCoder.

[56] End-to-End Vision Tokenizer Tuning

Wenxuan Wang,Fan Zhang,Yufeng Cui,Haiwen Diao,Zhuoyan Luo,Huchuan Lu,Jing Liu,Xinlong Wang

Main category: cs.CV

TL;DR: ETT提出了一种端到端的视觉标记器调优方法，通过联合优化视觉标记化和目标任务，解决了现有视觉标记器与下游任务不匹配的问题。

Details

Motivation: 现有视觉标记器与下游任务分离优化，导致视觉标记器无法适应不同任务的需求，成为性能瓶颈。 Method: ETT利用视觉嵌入和标记器代码本，联合优化重建和标题生成目标，实现端到端调优。 Result: 实验表明，ETT在多模态理解和视觉生成任务中性能提升2-6%，同时保持原有重建能力。 Conclusion: ETT是一种简单有效的方法，有望在多模态基础模型中广泛应用。 Abstract: Existing vision tokenization isolates the optimization of vision tokenizers from downstream training, implicitly assuming the visual tokens can generalize well across various tasks, e.g., image generation and visual question answering. The vision tokenizer optimized for low-level reconstruction is agnostic to downstream tasks requiring varied representations and semantics. This decoupled paradigm introduces a critical misalignment: The loss of the vision tokenization can be the representation bottleneck for target tasks. For example, errors in tokenizing text in a given image lead to poor results when recognizing or generating them. To address this, we propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks. Unlike prior autoregressive models that use only discrete indices from a frozen vision tokenizer, ETT leverages the visual embeddings of the tokenizer codebook, and optimizes the vision tokenizers end-to-end with both reconstruction and caption objectives. ETT can be seamlessly integrated into existing training pipelines with minimal architecture modifications. Our ETT is simple to implement and integrate, without the need to adjust the original codebooks or architectures of the employed large language models. Extensive experiments demonstrate that our proposed end-to-end vision tokenizer tuning unlocks significant performance gains, i.e., 2-6% for multimodal understanding and visual generation tasks compared to frozen tokenizer baselines, while preserving the original reconstruction capability. We hope this very simple and strong method can empower multimodal foundation models besides image generation and understanding.

[57] Depth Anything with Any Prior

Zehan Wang,Siyu Chen,Lihe Yang,Jialei Wang,Ziang Zhang,Hengshuang Zhao,Zhou Zhao

Main category: cs.CV

TL;DR: Prior Depth Anything框架结合不完整但精确的深度测量信息与相对但完整的几何结构，生成准确、密集且详细的深度图。

Details

Motivation: 解决现有深度预测方法在精度和完整性上的不足，通过结合两种互补的深度来源提升性能。 Method: 采用粗到细的流程：1) 像素级度量对齐和距离感知加权预填充先验；2) 基于归一化先验和预测的条件单目深度估计模型细化噪声。 Result: 在7个真实数据集上展示了零样本泛化能力，性能优于或匹配任务特定方法，并能处理未见过的混合先验。 Conclusion: 该框架灵活高效，能随着单目深度估计模型的进步而改进，提供了精度与效率的平衡。 Abstract: This work presents Prior Depth Anything, a framework that combines incomplete but precise metric information in depth measurement with relative but complete geometric structures in depth prediction, generating accurate, dense, and detailed metric depth maps for any scene. To this end, we design a coarse-to-fine pipeline to progressively integrate the two complementary depth sources. First, we introduce pixel-level metric alignment and distance-aware weighting to pre-fill diverse metric priors by explicitly using depth prediction. It effectively narrows the domain gap between prior patterns, enhancing generalization across varying scenarios. Second, we develop a conditioned monocular depth estimation (MDE) model to refine the inherent noise of depth priors. By conditioning on the normalized pre-filled prior and prediction, the model further implicitly merges the two complementary depth sources. Our model showcases impressive zero-shot generalization across depth completion, super-resolution, and inpainting over 7 real-world datasets, matching or even surpassing previous task-specific methods. More importantly, it performs well on challenging, unseen mixed priors and enables test-time improvements by switching prediction models, providing a flexible accuracy-efficiency trade-off while evolving with advancements in MDE models.

[58] 3D-Fixup: Advancing Photo Editing with 3D Priors

Yen-Chi Cheng,Krishna Kumar Singh,Jae Shin Yoon,Alex Schwing,Liangyan Gui,Matheus Gadelha,Paul Guerrero,Nanxuan Zhao

Main category: cs.CV

TL;DR: 3D-Fixup是一个基于扩散模型和3D先验的框架，用于支持复杂的3D感知图像编辑。

Details

Motivation: 尽管扩散模型在图像先验建模方面取得了进展，但基于单张图像的3D感知编辑仍具挑战性。 Method: 利用视频数据生成训练对，结合Image-to-3D模型的3D指导，设计数据生成管道以优化训练。 Result: 3D-Fixup实现了高质量的3D感知编辑，支持复杂操作如对象平移和3D旋转。 Conclusion: 通过整合3D先验，3D-Fixup在图像编辑中取得了显著进展，代码已开源。 Abstract: Despite significant advances in modeling image priors via diffusion models, 3D-aware image editing remains challenging, in part because the object is only specified via a single image. To tackle this challenge, we propose 3D-Fixup, a new framework for editing 2D images guided by learned 3D priors. The framework supports difficult editing situations such as object translation and 3D rotation. To achieve this, we leverage a training-based approach that harnesses the generative power of diffusion models. As video data naturally encodes real-world physical dynamics, we turn to video data for generating training data pairs, i.e., a source and a target frame. Rather than relying solely on a single trained model to infer transformations between source and target frames, we incorporate 3D guidance from an Image-to-3D model, which bridges this challenging task by explicitly projecting 2D information into 3D space. We design a data generation pipeline to ensure high-quality 3D guidance throughout training. Results show that by integrating these 3D priors, 3D-Fixup effectively supports complex, identity coherent 3D-aware edits, achieving high-quality results and advancing the application of diffusion models in realistic image manipulation. The code is provided at https://3dfixup.github.io/

cs.GR [Back]

[59] VRSplat: Fast and Robust Gaussian Splatting for Virtual Reality

Xuechang Tu,Lukas Radl,Michael Steiner,Markus Steinberger,Bernhard Kerbl,Fernando de la Torre

Main category: cs.GR

TL;DR: VRSplat结合并扩展了3DGS技术，解决了VR中的时间伪影、投影失真和帧率问题，通过改进的栅格化和优化步骤，实现了72+ FPS且无伪影的VR体验。

Details

Motivation: 3DGS在VR中面临时间伪影、投影失真和帧率不足等问题，这些问题在头戴显示器（HMDs）中尤为突出。 Method: 结合Mini-Splatting、StopThePop和Optimal Projection技术，改进3DGS栅格化器，提出高效的中心凹栅格化器，并通过优化高斯参数。 Result: 用户研究显示VRSplat优于其他配置，实现了72+ FPS且消除了伪影和失真。 Conclusion: VRSplat是首个系统评估的3DGS方法，支持现代VR应用，解决了关键挑战。 Abstract: 3D Gaussian Splatting (3DGS) has rapidly become a leading technique for novel-view synthesis, providing exceptional performance through efficient software-based GPU rasterization. Its versatility enables real-time applications, including on mobile and lower-powered devices. However, 3DGS faces key challenges in virtual reality (VR): (1) temporal artifacts, such as popping during head movements, (2) projection-based distortions that result in disturbing and view-inconsistent floaters, and (3) reduced framerates when rendering large numbers of Gaussians, falling below the critical threshold for VR. Compared to desktop environments, these issues are drastically amplified by large field-of-view, constant head movements, and high resolution of head-mounted displays (HMDs). In this work, we introduce VRSplat: we combine and extend several recent advancements in 3DGS to address challenges of VR holistically. We show how the ideas of Mini-Splatting, StopThePop, and Optimal Projection can complement each other, by modifying the individual techniques and core 3DGS rasterizer. Additionally, we propose an efficient foveated rasterizer that handles focus and peripheral areas in a single GPU launch, avoiding redundant computations and improving GPU utilization. Our method also incorporates a fine-tuning step that optimizes Gaussian parameters based on StopThePop depth evaluations and Optimal Projection. We validate our method through a controlled user study with 25 participants, showing a strong preference for VRSplat over other configurations of Mini-Splatting. VRSplat is the first, systematically evaluated 3DGS approach capable of supporting modern VR applications, achieving 72+ FPS while eliminating popping and stereo-disrupting floaters.

[60] Style Customization of Text-to-Vector Generation with Image Diffusion Priors

Peiying Zhang,Nanxuan Zhao,Jing Liao

Main category: cs.GR

TL;DR: 论文提出了一种两阶段的SVG生成方法，结合了前馈T2V模型和T2I先验，解决了现有方法在风格定制和结构一致性上的不足。

Details

Motivation: 现有文本到矢量（T2V）生成方法缺乏风格定制能力，无法满足实际应用中一致视觉风格的需求。 Method: 提出两阶段流程：1）训练路径级表示的T2V扩散模型确保结构一致性；2）通过蒸馏定制T2I模型实现风格定制。 Result: 实验验证了该方法能高效生成高质量、多样化的定制风格SVG。 Conclusion: 该方法有效结合了前馈T2V和T2I先验，解决了风格定制和结构一致性的问题。 Abstract: Scalable Vector Graphics (SVGs) are highly favored by designers due to their resolution independence and well-organized layer structure. Although existing text-to-vector (T2V) generation methods can create SVGs from text prompts, they often overlook an important need in practical applications: style customization, which is vital for producing a collection of vector graphics with consistent visual appearance and coherent aesthetics. Extending existing T2V methods for style customization poses certain challenges. Optimization-based T2V models can utilize the priors of text-to-image (T2I) models for customization, but struggle with maintaining structural regularity. On the other hand, feed-forward T2V models can ensure structural regularity, yet they encounter difficulties in disentangling content and style due to limited SVG training data. To address these challenges, we propose a novel two-stage style customization pipeline for SVG generation, making use of the advantages of both feed-forward T2V models and T2I image priors. In the first stage, we train a T2V diffusion model with a path-level representation to ensure the structural regularity of SVGs while preserving diverse expressive capabilities. In the second stage, we customize the T2V diffusion model to different styles by distilling customized T2I models. By integrating these techniques, our pipeline can generate high-quality and diverse SVGs in custom styles based on text prompts in an efficient feed-forward manner. The effectiveness of our method has been validated through extensive experiments. The project page is https://customsvg.github.io.

cs.CL [Back]

[61] Next Word Suggestion using Graph Neural Network

Abisha Thapa Magar,Anup Shakya

Main category: cs.CL

TL;DR: 论文提出了一种结合图卷积网络（GNN）和LSTM的方法，用于语言建模中的上下文嵌入任务，并在资源有限的情况下验证了其有效性。

Details

Motivation: 当前大型语言模型需要巨额计算资源和数据，本研究旨在探索一种资源高效的方法来解决上下文嵌入问题。 Method: 利用GNN的图卷积操作编码上下文，并与LSTM结合预测下一个词。实验基于自定义维基百科语料库，资源消耗较低。 Result: 该方法在有限资源下表现良好，能够有效预测下一个词。 Conclusion: 研究表明，结合GNN和LSTM的方法在资源受限的语言建模任务中具有潜力。 Abstract: Language Modeling is a prevalent task in Natural Language Processing. The currently existing most recent and most successful language models often tend to build a massive model with billions of parameters, feed in a tremendous amount of text data, and train with enormous computation resources which require millions of dollars. In this project, we aim to address an important sub-task in language modeling, i.e., context embedding. We propose an approach to exploit the Graph Convolution operation in GNNs to encode the context and use it in coalition with LSTMs to predict the next word given a local context of preceding words. We test this on the custom Wikipedia text corpus using a very limited amount of resources and show that this approach works fairly well to predict the next word.

[62] DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models

Xiwen Chen,Wenhui Zhu,Peijie Qiu,Xuanzhao Dong,Hao Wang,Haiyu Wu,Huayu Li,Aristeidis Sotiras,Yalin Wang,Abolfazl Razi

Main category: cs.CL

TL;DR: 论文提出了一种名为Diversity-aware Reward Adjustment (DRA)的方法，通过引入语义多样性改进GRPO在低资源语言模型后训练中的表现。

Details

Motivation: GRPO在低资源设置中表现良好，但其基于标量奖励信号的方法无法捕捉语义多样性，导致多样性-质量不一致问题。 Method: DRA利用Submodular Mutual Information (SMI)调整奖励，降低冗余补全的权重，增强多样性补全的奖励。 Result: 在五个数学推理基准测试中，DRA-GRPO和DGA-DR.GRPO优于基线，平均准确率达58.2%，仅需7,000个微调样本和约55美元的训练成本。 Conclusion: DRA通过显式引入语义多样性，有效解决了GRPO的局限性，并在低资源设置中实现了最优性能。 Abstract: Recent advances in reinforcement learning for language model post-training, such as Group Relative Policy Optimization (GRPO), have shown promise in low-resource settings. However, GRPO typically relies on solution-level and scalar reward signals that fail to capture the semantic diversity among sampled completions. This leads to what we identify as a diversity-quality inconsistency, where distinct reasoning paths may receive indistinguishable rewards. To address this limitation, we propose $\textit{Diversity-aware Reward Adjustment}$ (DRA), a method that explicitly incorporates semantic diversity into the reward computation. DRA uses Submodular Mutual Information (SMI) to downweight redundant completions and amplify rewards for diverse ones. This encourages better exploration during learning, while maintaining stable exploitation of high-quality samples. Our method integrates seamlessly with both GRPO and its variant DR.~GRPO, resulting in $\textit{DRA-GRPO}$ and $\textit{DGA-DR.~GRPO}$. We evaluate our method on five mathematical reasoning benchmarks and find that it outperforms recent strong baselines. It achieves state-of-the-art performance with an average accuracy of 58.2%, using only 7,000 fine-tuning samples and a total training cost of approximately $55. The code is available at https://github.com/xiwenc1/DRA-GRPO.

[63] Large Language Models Are More Persuasive Than Incentivized Human Persuaders

Philipp Schoenegger,Francesco Salvi,Jiacheng Liu,Xiaoli Nan,Ramit Debnath,Barbara Fasolo,Evelina Leivada,Gabriel Recchia,Fritz Günther,Ali Zarifhonarvar,Joe Kwon,Zahoor Ul Islam,Marco Dehnert,Daryl Y. H. Lee,Madeline G. Reinecke,David G. Kamper,Mert Kobaş,Adam Sandford,Jonas Kgomo,Luke Hewitt,Shreya Kapoor,Kerem Oktar,Eyup Engin Kucuk,Bo Feng,Cameron R. Jones,Izzy Gainsburg,Sebastian Olschewski,Nora Heinzelmann,Francisco Cruz,Ben M. Tappin,Tao Ma,Peter S. Park,Rayan Onyonka,Arthur Hjorth,Peter Slattery,Qingcheng Zeng,Lennart Finke,Igor Grossmann,Alessandro Salatiello,Ezra Karger

Main category: cs.CL

TL;DR: 研究发现，前沿大语言模型（Claude Sonnet 3.5）在说服能力上显著优于激励人类说服者，无论是引导正确还是错误答案。

Details

Motivation: 比较AI与人类在实时对话中的说服能力，评估AI在真实激励场景下的表现。 Method: 通过预注册的大规模激励实验，让参与者完成在线测试，AI或人类说服者尝试引导其选择正确或错误答案。 Result: AI说服者在引导正确和错误答案时均表现更优，显著影响参与者的准确性和收益。 Conclusion: AI的说服能力已超越激励人类，突显了对齐与治理框架的紧迫性。 Abstract: We directly compare the persuasion capabilities of a frontier large language model (LLM; Claude Sonnet 3.5) against incentivized human persuaders in an interactive, real-time conversational quiz setting. In this preregistered, large-scale incentivized experiment, participants (quiz takers) completed an online quiz where persuaders (either humans or LLMs) attempted to persuade quiz takers toward correct or incorrect answers. We find that LLM persuaders achieved significantly higher compliance with their directional persuasion attempts than incentivized human persuaders, demonstrating superior persuasive capabilities in both truthful (toward correct answers) and deceptive (toward incorrect answers) contexts. We also find that LLM persuaders significantly increased quiz takers' accuracy, leading to higher earnings, when steering quiz takers toward correct answers, and significantly decreased their accuracy, leading to lower earnings, when steering them toward incorrect answers. Overall, our findings suggest that AI's persuasion capabilities already exceed those of humans that have real-money bonuses tied to performance. Our findings of increasingly capable AI persuaders thus underscore the urgency of emerging alignment and governance frameworks.

[64] System Prompt Optimization with Meta-Learning

Yumin Choi,Jinheon Baek,Sung Ju Hwang

Main category: cs.CL

TL;DR: 论文提出了一种双层系统提示优化方法，通过元学习框架优化系统提示，使其能适应多样化的用户提示并迁移到未见任务。

Details

Motivation: 现有研究主要关注任务特定的用户提示优化，而忽略了通用的系统提示优化。本文旨在设计能适应多样化用户提示并可迁移到未见任务的系统提示。 Method: 采用元学习框架，通过在多数据集上迭代优化系统提示和用户提示，确保二者协同。 Result: 在14个未见数据集上验证了方法的有效性，优化的系统提示能快速适应新任务且性能提升。 Conclusion: 双层系统提示优化方法显著提升了LLM的泛化能力和适应性。 Abstract: Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the system prompt that is, once optimized, applicable across different tasks and domains. Motivated by this, we introduce the novel problem of bilevel system prompt optimization, whose objective is to design system prompts that are robust to diverse user prompts and transferable to unseen tasks. To tackle this problem, we then propose a meta-learning framework, which meta-learns the system prompt by optimizing it over various user prompts across multiple datasets, while simultaneously updating the user prompts in an iterative manner to ensure synergy between them. We conduct experiments on 14 unseen datasets spanning 5 different domains, on which we show that our approach produces system prompts that generalize effectively to diverse user prompts. Also, our findings reveal that the optimized system prompt enables rapid adaptation even to unseen tasks, requiring fewer optimization steps for test-time user prompts while achieving improved performance.

[65] VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts

Xin Liu,Lechen Zhang,Sheza Munir,Yiyang Gu,Lu Wang

Main category: cs.CL

TL;DR: VeriFact是一个新的事实性评估框架，旨在通过识别和解决不完整或缺失的事实来提高长文本生成的事实性评估准确性。同时，FactRBench基准测试首次同时评估精确率和召回率。

Details

Motivation: 现有方法在评估长文本生成的事实性时，常因忽略上下文和关键关系事实而效果不佳。 Method: 提出VeriFact框架，优化事实提取和验证流程；引入FactRBench基准测试，评估精确率和召回率。 Result: VeriFact显著提高了事实完整性和复杂关系的保留，FactRBench显示大模型在精确率和召回率上表现更好，但两者不一定相关。 Conclusion: VeriFact和FactRBench为长文本生成的事实性评估提供了更全面的工具，强调了综合评估的重要性。 Abstract: Large language models (LLMs) excel at generating long-form responses, but evaluating their factuality remains challenging due to complex inter-sentence dependencies within the generated facts. Prior solutions predominantly follow a decompose-decontextualize-verify pipeline but often fail to capture essential context and miss key relational facts. In this paper, we introduce VeriFact, a factuality evaluation framework designed to enhance fact extraction by identifying and resolving incomplete and missing facts to support more accurate verification results. Moreover, we introduce FactRBench , a benchmark that evaluates both precision and recall in long-form model responses, whereas prior work primarily focuses on precision. FactRBench provides reference fact sets from advanced LLMs and human-written answers, enabling recall assessment. Empirical evaluations show that VeriFact significantly enhances fact completeness and preserves complex facts with critical relational information, resulting in more accurate factuality evaluation. Benchmarking various open- and close-weight LLMs on FactRBench indicate that larger models within same model family improve precision and recall, but high precision does not always correlate with high recall, underscoring the importance of comprehensive factuality assessment.

[66] An AI-Powered Research Assistant in the Lab: A Practical Guide for Text Analysis Through Iterative Collaboration with LLMs

Gino Carmona-Díaz,William Jiménez-Leal,María Alejandra Grisales,Chandra Sripada,Santiago Amaya,Michael Inzlicht,Juan Pablo Bermúdez

Main category: cs.CL

TL;DR: 该论文提供了一个逐步教程，利用LLMs高效开发、测试和应用分类法，用于分析非结构化数据，并通过迭代和协作过程实现高质量文本分析。

Details

Motivation: 传统文本分析（如开放式回答、社交媒体帖子）耗时且易受偏见影响，LLMs为解决这一问题提供了潜力。 Method: 通过迭代和协作过程，结合预定义和数据驱动的分类法，利用LLMs生成、评估和应用分类法。 Result: 展示了如何通过提示生成生活领域分类法，并通过测试实现高编码一致性。 Conclusion: LLMs在文本分析中具有潜力，但也存在局限性，需进一步探索。 Abstract: Analyzing texts such as open-ended responses, headlines, or social media posts is a time- and labor-intensive process highly susceptible to bias. LLMs are promising tools for text analysis, using either a predefined (top-down) or a data-driven (bottom-up) taxonomy, without sacrificing quality. Here we present a step-by-step tutorial to efficiently develop, test, and apply taxonomies for analyzing unstructured data through an iterative and collaborative process between researchers and LLMs. Using personal goals provided by participants as an example, we demonstrate how to write prompts to review datasets and generate a taxonomy of life domains, evaluate and refine the taxonomy through prompt and direct modifications, test the taxonomy and assess intercoder agreements, and apply the taxonomy to categorize an entire dataset with high intercoder reliability. We discuss the possibilities and limitations of using LLMs for text analysis.

[67] Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning

Shaurya Sharthak,Vinayak Pahalwan,Adithya Kamath,Adarsh Shirawalmath

Main category: cs.CL

TL;DR: 论文提出TokenAdapt框架，通过模型无关的tokenizer移植方法和多词Supertokens预分词学习，解决固定tokenization方案的低效问题，显著减少重新训练需求并提升性能。

Details

Motivation: 固定tokenization方案在多语言或专业应用中效率低下且性能受限，现有方法计算资源消耗大且难以保留语义细节。 Method: 引入TokenAdapt（混合启发式初始化新token嵌入）和Supertokens预分词学习，结合局部子词分解和全局语义相似性估计。 Result: TokenAdapt在零样本困惑度测试中显著优于基线方法（如ReTok和TransTokenizer），困惑度比降低至少2倍。 Conclusion: TokenAdapt框架有效解决了tokenizer锁定的问题，同时减少了重新训练需求并提升了性能。 Abstract: Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes, leading to inefficiencies and performance limitations, particularly for multilingual or specialized applications. This tokenizer lock-in presents significant challenges. standard methods to overcome this often require prohibitive computational resources. Although tokenizer replacement with heuristic initialization aims to reduce this burden, existing methods often require exhaustive residual fine-tuning and still may not fully preserve semantic nuances or adequately address the underlying compression inefficiencies. Our framework introduces two innovations: first, Tokenadapt, a model-agnostic tokenizer transplantation method, and second, novel pre-tokenization learning for multi-word Supertokens to enhance compression and reduce fragmentation. Tokenadapt initializes new unique token embeddings via a hybrid heuristic that combines two methods: a local estimate based on subword decomposition using the old tokenizer, and a global estimate utilizing the top-k semantically similar tokens from the original vocabulary. This methodology aims to preserve semantics while significantly minimizing retraining requirements. Empirical investigations validate both contributions: the transplantation heuristic successfully initializes unique tokens, markedly outperforming conventional baselines and sophisticated methods including Transtokenizer and ReTok, while our Supertokens achieve notable compression gains. Our zero-shot perplexity results demonstrate that the TokenAdapt hybrid initialization consistently yields lower perplexity ratios compared to both ReTok and TransTokenizer baselines across different base models and newly trained target tokenizers. TokenAdapt typically reduced the overall perplexity ratio significantly compared to ReTok, yielding at least a 2-fold improvement in these aggregate scores.

[68] Automated Detection of Clinical Entities in Lung and Breast Cancer Reports Using NLP Techniques

J. Moreno-Casanova,J. M. Auñón,A. Mártinez-Pérez,M. E. Pérez-Martínez,M. E. Gas-López

Main category: cs.CL

TL;DR: 该研究利用NLP技术（特别是NER）自动从电子健康记录中提取肺癌和乳腺癌的关键临床信息，使用uQuery工具和预训练的RoBERTa模型，取得了较高的准确性，但仍面临低频实体识别的挑战。

Details

Motivation: 手动提取临床报告信息耗时且易错，限制了医疗数据驱动方法的效率。NLP技术可自动化这一过程，提高数据提取的准确性和效率。 Method: 使用GMV的NLP工具uQuery和预训练的RoBERTa模型（bsc-bio-ehr-en3），通过NER技术从200份乳腺癌和400份肺癌报告中提取8种临床实体。 Result: 模型整体表现良好，尤其在识别MET和PAT等实体时效果显著，但对低频实体（如EVOL）的识别仍有挑战。 Conclusion: NLP技术（如NER）在自动化临床数据提取中具有潜力，可显著提高效率，但需进一步优化低频实体的识别能力。 Abstract: Research projects, including those focused on cancer, rely on the manual extraction of information from clinical reports. This process is time-consuming and prone to errors, limiting the efficiency of data-driven approaches in healthcare. To address these challenges, Natural Language Processing (NLP) offers an alternative for automating the extraction of relevant data from electronic health records (EHRs). In this study, we focus on lung and breast cancer due to their high incidence and the significant impact they have on public health. Early detection and effective data management in both types of cancer are crucial for improving patient outcomes. To enhance the accuracy and efficiency of data extraction, we utilized GMV's NLP tool uQuery, which excels at identifying relevant entities in clinical texts and converting them into standardized formats such as SNOMED and OMOP. uQuery not only detects and classifies entities but also associates them with contextual information, including negated entities, temporal aspects, and patient-related details. In this work, we explore the use of NLP techniques, specifically Named Entity Recognition (NER), to automatically identify and extract key clinical information from EHRs related to these two cancers. A dataset from Health Research Institute Hospital La Fe (IIS La Fe), comprising 200 annotated breast cancer and 400 lung cancer reports, was used, with eight clinical entities manually labeled using the Doccano platform. To perform NER, we fine-tuned the bsc-bio-ehr-en3 model, a RoBERTa-based biomedical linguistic model pre-trained in Spanish. Fine-tuning was performed using the Transformers architecture, enabling accurate recognition of clinical entities in these cancer types. Our results demonstrate strong overall performance, particularly in identifying entities like MET and PAT, although challenges remain with less frequent entities like EVOL.

[69] Exploring the generalization of LLM truth directions on conversational formats

Timour Ichmoukhamedov,David Martens

Main category: cs.CL

TL;DR: 研究发现LLMs的激活空间中存在一个通用真理方向，但该方向在不同对话格式中泛化能力有限，尤其是在长对话中。通过添加固定关键词短语，可以显著改善泛化能力。

Details

Motivation: 探索LLMs中真理方向的泛化能力，尤其是在不同对话格式中的表现，以提升LLM谎言检测的可靠性。 Method: 通过线性探针分析LLMs的激活空间，测试真理方向在短对话和长对话中的泛化能力，并提出添加固定关键词短语的解决方案。 Result: 真理方向在短对话中泛化良好，但在长对话中表现不佳；添加固定关键词短语显著改善了长对话中的泛化能力。 Conclusion: 研究揭示了LLM谎言检测在新场景中泛化的挑战，并提出了一种有效的改进方法。 Abstract: Several recent works argue that LLMs have a universal truth direction where true and false statements are linearly separable in the activation space of the model. It has been demonstrated that linear probes trained on a single hidden state of the model already generalize across a range of topics and might even be used for lie detection in LLM conversations. In this work we explore how this truth direction generalizes between various conversational formats. We find good generalization between short conversations that end on a lie, but poor generalization to longer formats where the lie appears earlier in the input prompt. We propose a solution that significantly improves this type of generalization by adding a fixed key phrase at the end of each conversation. Our results highlight the challenges towards reliable LLM lie detectors that generalize to new settings.

[70] KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning

Peiqi Sui,Juan Diego Rodriguez,Philippe Laban,Dean Murphy,Joseph P. Dexter,Richard Jean So,Samuel Baker,Pramit Chaudhuri

Main category: cs.CL

TL;DR: KRISTEVA是首个用于评估解释性推理的细读基准，包含1331个选择题，测试大语言模型在文学细读中的表现。

Details

Motivation: 填补大语言模型在文学细读评估上的空白，验证其是否具备大学水平的细读能力。 Method: 设计了三个逐步困难的任务集：提取文体特征、检索上下文信息、多跳推理，测试LLMs的表现。 Result: 当前最先进的LLMs在细读任务中表现尚可（准确率49.7%-69.7%），但仍落后于人类评估者。 Conclusion: LLMs在文学细读中展现出一定能力，但与人类仍有差距，需进一步改进。 Abstract: Each year, tens of millions of essays are written and graded in college-level English courses. Students are asked to analyze literary and cultural texts through a process known as close reading, in which they gather textual details to formulate evidence-based arguments. Despite being viewed as a basis for critical thinking and widely adopted as a required element of university coursework, close reading has never been evaluated on large language models (LLMs), and multi-discipline benchmarks like MMLU do not include literature as a subject. To fill this gap, we present KRISTEVA, the first close reading benchmark for evaluating interpretive reasoning, consisting of 1331 multiple-choice questions adapted from classroom data. With KRISTEVA, we propose three progressively more difficult sets of tasks to approximate different elements of the close reading process, which we use to test how well LLMs may seem to understand and reason about literary works: 1) extracting stylistic features, 2) retrieving relevant contextual information from parametric knowledge, and 3) multi-hop reasoning between style and external contexts. Our baseline results find that, while state-of-the-art LLMs possess some college-level close reading competency (accuracy 49.7% - 69.7%), their performances still trail those of experienced human evaluators on 10 out of our 11 tasks.

[71] Do Large Language Models Know Conflict? Investigating Parametric vs. Non-Parametric Knowledge of LLMs for Conflict Forecasting

Apollinaire Poli Nemkova,Sarath Chandra Lingareddy,Sagnik Ray Choudhury,Mark V. Albert

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLMs）是否能够仅凭预训练权重预测暴力冲突的升级和死亡人数，并与结合外部数据的非参数方法进行比较。

Details

Motivation: LLMs在自然语言任务中表现优异，但其在冲突预测中的应用尚未充分探索，这对早期预警系统和政策制定至关重要。 Method: 通过参数化（仅预训练权重）和非参数化（结合外部冲突数据集和新闻）两种方法，评估LLMs在2020-2024年冲突频发地区的预测能力。 Result: 研究发现LLMs在冲突预测中具有一定能力，但结合外部知识能显著提升性能。 Conclusion: LLMs在冲突预测中具有潜力，但需结合外部知识以弥补预训练权重的局限性。 Abstract: Large Language Models (LLMs) have shown impressive performance across natural language tasks, but their ability to forecast violent conflict remains underexplored. We investigate whether LLMs possess meaningful parametric knowledge-encoded in their pretrained weights-to predict conflict escalation and fatalities without external data. This is critical for early warning systems, humanitarian planning, and policy-making. We compare this parametric knowledge with non-parametric capabilities, where LLMs access structured and unstructured context from conflict datasets (e.g., ACLED, GDELT) and recent news reports via Retrieval-Augmented Generation (RAG). Incorporating external information could enhance model performance by providing up-to-date context otherwise missing from pretrained weights. Our two-part evaluation framework spans 2020-2024 across conflict-prone regions in the Horn of Africa and the Middle East. In the parametric setting, LLMs predict conflict trends and fatalities relying only on pretrained knowledge. In the non-parametric setting, models receive summaries of recent conflict events, indicators, and geopolitical developments. We compare predicted conflict trend labels (e.g., Escalate, Stable Conflict, De-escalate, Peace) and fatalities against historical data. Our findings highlight the strengths and limitations of LLMs for conflict forecasting and the benefits of augmenting them with structured external knowledge.

[72] Crossing Borders Without Crossing Boundaries: How Sociolinguistic Awareness Can Optimize User Engagement with Localized Spanish AI Models Across Hispanophone Countries

Martin Capdevila,Esteban Villa Turek,Ellen Karina Chumbe Fernandez,Luis Felipe Polo Galvez,Luis Cadavid,Andrea Marroquin,Rebeca Vargas Quesada,Johanna Crew,Nicole Vallejo Galarraga,Christopher Rodriguez,Diego Gutierrez,Radhi Datla

Main category: cs.CL

TL;DR: 论文探讨了西班牙语在不同地区的变体差异，强调本地化语言模型的重要性，以弥合社会语言差异并提升用户信任。

Details

Motivation: 研究西班牙语在拉丁美洲和西班牙的变体差异，以证明本地化AI模型对提升用户信任和满足包容性目标的关键作用。 Method: 通过社会文化和语言背景的深入分析，比较不同地区的西班牙语变体，并提出五种子变体的实现方案。 Result: 研究表明，本地化语言模型能有效减少社会语言差异，提升用户信任，并支持国际化战略。 Conclusion: 实现西班牙语的本地化变体有助于提升AI模型的用户信任和文化意识，同时促进可持续的用户增长。 Abstract: Large language models are, by definition, based on language. In an effort to underscore the critical need for regional localized models, this paper examines primary differences between variants of written Spanish across Latin America and Spain, with an in-depth sociocultural and linguistic contextualization therein. We argue that these differences effectively constitute significant gaps in the quotidian use of Spanish among dialectal groups by creating sociolinguistic dissonances, to the extent that locale-sensitive AI models would play a pivotal role in bridging these divides. In doing so, this approach informs better and more efficient localization strategies that also serve to more adequately meet inclusivity goals, while securing sustainable active daily user growth in a major low-risk investment geographic area. Therefore, implementing at least the proposed five sub variants of Spanish addresses two lines of action: to foment user trust and reliance on AI language models while also demonstrating a level of cultural, historical, and sociolinguistic awareness that reflects positively on any internationalization strategy.

[73] From Trade-off to Synergy: A Versatile Symbiotic Watermarking Framework for Large Language Models

Yidan Wang,Yubing Ren,Yanan Cao,Binxing Fang

Main category: cs.CL

TL;DR: 提出了一种结合logits-based和sampling-based水印方法的混合框架，通过三种策略（串行、并行、混合）优化水印的检测性、鲁棒性、文本质量和安全性。

Details

Motivation: 随着大语言模型（LLMs）的兴起，AI生成文本的滥用问题日益严重，水印技术成为潜在解决方案。现有水印方法在鲁棒性、文本质量和安全性之间存在权衡。 Method: 提出了一种混合水印框架，结合logits-based和sampling-based方法，利用三种策略（串行、并行、混合）和基于token熵与语义熵的自适应嵌入方式。 Result: 实验表明，该方法在多个数据集和模型上优于现有基线，达到SOTA性能。 Conclusion: 该框架为多样化水印范式提供了新思路，代码已开源。 Abstract: The rise of Large Language Models (LLMs) has heightened concerns about the misuse of AI-generated text, making watermarking a promising solution. Mainstream watermarking schemes for LLMs fall into two categories: logits-based and sampling-based. However, current schemes entail trade-offs among robustness, text quality, and security. To mitigate this, we integrate logits-based and sampling-based schemes, harnessing their respective strengths to achieve synergy. In this paper, we propose a versatile symbiotic watermarking framework with three strategies: serial, parallel, and hybrid. The hybrid framework adaptively embeds watermarks using token entropy and semantic entropy, optimizing the balance between detectability, robustness, text quality, and security. Furthermore, we validate our approach through comprehensive experiments on various datasets and models. Experimental results indicate that our method outperforms existing baselines and achieves state-of-the-art (SOTA) performance. We believe this framework provides novel insights into diverse watermarking paradigms. Our code is available at \href{https://github.com/redwyd/SymMark}{https://github.com/redwyd/SymMark}.

[74] Rethinking Prompt Optimizers: From Prompt Merits to Optimization

Zixiao Zhu,Hanzhang Zhou,Zijian Feng,Tianjiao Li,Chua Jia Jim Deryl,Mak Lee Onn,Gee Wah Ng,Kezhi Mao

Main category: cs.CL

TL;DR: MePO是一种基于可解释设计的轻量级提示优化器，通过模型无关的提示质量指标改进提示和响应质量，适用于各种模型。

Details

Motivation: 现有提示优化方法依赖大型LLM生成提示，但可能导致轻量级模型性能下降，因此需要一种更通用且高效的优化方法。 Method: 提出MePO，基于轻量级LLM生成的偏好数据集，学习可解释的提示质量指标，实现本地部署和离线优化。 Result: 实验表明MePO在多种任务和模型上表现优异，具有可扩展性和鲁棒性。 Conclusion: MePO提供了一种高效、通用的提示优化解决方案，适用于实际部署。 Abstract: Prompt optimization (PO) offers a practical alternative to fine-tuning large language models (LLMs), enabling performance improvements without altering model weights. Existing methods typically rely on advanced, large-scale LLMs like GPT-4 to generate optimized prompts. However, due to limited downward compatibility, verbose, instruction-heavy prompts from advanced LLMs can overwhelm lightweight inference models and degrade response quality. In this work, we rethink prompt optimization through the lens of interpretable design. We first identify a set of model-agnostic prompt quality merits and empirically validate their effectiveness in enhancing prompt and response quality. We then introduce MePO, a merit-guided, lightweight, and locally deployable prompt optimizer trained on our preference dataset built from merit-aligned prompts generated by a lightweight LLM. Unlike prior work, MePO avoids online optimization reliance, reduces cost and privacy concerns, and, by learning clear, interpretable merits, generalizes effectively to both large-scale and lightweight inference models. Experiments demonstrate that MePO achieves better results across diverse tasks and model types, offering a scalable and robust solution for real-world deployment. Our model and dataset are available at: https://github.com/MidiyaZhu/MePO

[75] Personalizing Large Language Models using Retrieval Augmented Generation and Knowledge Graph

Deeksha Prahlad,Chanhee Lee,Dongha Kim,Hokeun Kim

Main category: cs.CL

TL;DR: 论文提出了一种基于知识图谱（KG）的检索增强生成（RAG）方法，以解决大型语言模型（LLM）在生成个性化响应时因缺乏及时、真实和个性化信息而导致的幻觉问题。实验表明，该方法在理解个人信息和生成准确响应方面优于基线LLM。

Details

Motivation: LLM在生成响应时容易因过拟合而产生错误或多余信息（幻觉），主要原因是缺乏及时、真实和个性化的输入数据。 Method: 采用检索增强生成（RAG）结合知识图谱（KG），利用KG结构化存储和更新个性化数据（如日历信息），辅助LLM生成更准确的响应。 Result: 实验结果显示，该方法在理解个人信息和生成准确响应方面显著优于基线LLM，且响应时间略有减少。 Conclusion: 结合KG的RAG方法能有效减少LLM的幻觉问题，提升个性化响应的准确性。 Abstract: The advent of large language models (LLMs) has allowed numerous applications, including the generation of queried responses, to be leveraged in chatbots and other conversational assistants. Being trained on a plethora of data, LLMs often undergo high levels of over-fitting, resulting in the generation of extra and incorrect data, thus causing hallucinations in output generation. One of the root causes of such problems is the lack of timely, factual, and personalized information fed to the LLM. In this paper, we propose an approach to address these problems by introducing retrieval augmented generation (RAG) using knowledge graphs (KGs) to assist the LLM in personalized response generation tailored to the users. KGs have the advantage of storing continuously updated factual information in a structured way. While our KGs can be used for a variety of frequently updated personal data, such as calendar, contact, and location data, we focus on calendar data in this paper. Our experimental results show that our approach works significantly better in understanding personal information and generating accurate responses compared to the baseline LLMs using personal data as text inputs, with a moderate reduction in response time.

[76] DIF: A Framework for Benchmarking and Verifying Implicit Bias in LLMs

Lake Yin,Fan Huang

Main category: cs.CL

TL;DR: 论文提出了一种衡量大型语言模型（LLM）隐含偏见的新方法DIF，并通过实验验证了其有效性。

Details

Motivation: LLMs的隐含偏见不仅是伦理问题，也是技术问题，但目前缺乏标准化的衡量方法。 Method: 开发了DIF（Demographic Implicit Fairness）方法，通过评估LLM在逻辑和数学问题数据集上的表现，结合社会人口统计角色进行分析。 Result: 实验统计验证了LLM行为中存在隐含偏见，并发现问题回答准确性与隐含偏见呈负相关。 Conclusion: DIF方法为LLM隐含偏见的标准化评估提供了可行方案，并揭示了其技术局限性。 Abstract: As Large Language Models (LLMs) have risen in prominence over the past few years, there has been concern over the potential biases in LLMs inherited from the training data. Previous studies have examined how LLMs exhibit implicit bias, such as when response generation changes when different social contexts are introduced. We argue that this implicit bias is not only an ethical, but also a technical issue, as it reveals an inability of LLMs to accommodate extraneous information. However, unlike other measures of LLM intelligence, there are no standard methods to benchmark this specific subset of LLM bias. To bridge this gap, we developed a method for calculating an easily interpretable benchmark, DIF (Demographic Implicit Fairness), by evaluating preexisting LLM logic and math problem datasets with sociodemographic personas. We demonstrate that this method can statistically validate the presence of implicit bias in LLM behavior and find an inverse trend between question answering accuracy and implicit bias, supporting our argument.

[77] CAFE: Retrieval Head-based Coarse-to-Fine Information Seeking to Enhance Multi-Document QA Capability

Han Peng,Jinhao Jiang,Zican Dong,Wayne Xin Zhao,Lei Fang

Main category: cs.CL

TL;DR: 论文提出了一种名为CAFE的两阶段方法，通过粗到细的过滤和引导机制，提升大语言模型在多文档问答中的表现。

Details

Motivation: 尽管大语言模型的输入上下文长度有所提升，但在长上下文输入中的检索和推理能力仍有不足，现有方法在平衡检索精度和召回率方面存在挑战。 Method: CAFE采用两阶段方法：粗粒度过滤识别相关文档并排序，细粒度引导将注意力集中在最相关内容上。 Result: 实验表明，CAFE在多个基准测试中优于基线方法，在Mistral模型上分别比SFT和RAG方法提升了22.1%和13.7%的SubEM分数。 Conclusion: CAFE通过逐步消除背景和干扰文档的负面影响，显著提升了多文档问答的准确性和可靠性。 Abstract: Advancements in Large Language Models (LLMs) have extended their input context length, yet they still struggle with retrieval and reasoning in long-context inputs. Existing methods propose to utilize the prompt strategy and retrieval head to alleviate this limitation. However, they still face challenges in balancing retrieval precision and recall, impacting their efficacy in answering questions. To address this, we introduce $\textbf{CAFE}$, a two-stage coarse-to-fine method to enhance multi-document question-answering capacities. By gradually eliminating the negative impacts of background and distracting documents, CAFE makes the responses more reliant on the evidence documents. Initially, a coarse-grained filtering method leverages retrieval heads to identify and rank relevant documents. Then, a fine-grained steering method guides attention to the most relevant content. Experiments across benchmarks show CAFE outperforms baselines, achieving up to 22.1% and 13.7% SubEM improvement over SFT and RAG methods on the Mistral model, respectively.

[78] Dark LLMs: The Growing Threat of Unaligned AI Models

Michael Fire,Yitzhak Elbazis,Adi Wasenstein,Lior Rokach

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型（LLMs）因训练数据中的问题内容而容易受到越狱攻击的威胁，并提出了一种通用越狱攻击方法，揭示了行业在AI安全方面的不足。

Details

Motivation: 随着LLMs的广泛应用，其潜在的越狱攻击风险日益凸显，尤其是那些未经伦理约束或通过越狱技术修改的模型，可能被滥用生成有害内容。 Method: 研究团队发现了一种通用越狱攻击方法，能够绕过多个先进LLMs的安全控制，使其生成有害输出。 Result: 测试显示，许多LLMs在攻击发布七个月后仍存在漏洞，且主要LLM提供商的应对措施不足。 Conclusion: 随着LLMs的普及和开源化，其滥用风险加剧，亟需行业采取更有效的安全措施。 Abstract: Large Language Models (LLMs) rapidly reshape modern life, advancing fields from healthcare to education and beyond. However, alongside their remarkable capabilities lies a significant threat: the susceptibility of these models to jailbreaking. The fundamental vulnerability of LLMs to jailbreak attacks stems from the very data they learn from. As long as this training data includes unfiltered, problematic, or 'dark' content, the models can inherently learn undesirable patterns or weaknesses that allow users to circumvent their intended safety controls. Our research identifies the growing threat posed by dark LLMs models deliberately designed without ethical guardrails or modified through jailbreak techniques. In our research, we uncovered a universal jailbreak attack that effectively compromises multiple state-of-the-art models, enabling them to answer almost any question and produce harmful outputs upon request. The main idea of our attack was published online over seven months ago. However, many of the tested LLMs were still vulnerable to this attack. Despite our responsible disclosure efforts, responses from major LLM providers were often inadequate, highlighting a concerning gap in industry practices regarding AI safety. As model training becomes more accessible and cheaper, and as open-source LLMs proliferate, the risk of widespread misuse escalates. Without decisive intervention, LLMs may continue democratizing access to dangerous knowledge, posing greater risks than anticipated.

[79] Designing and Contextualising Probes for African Languages

Wisdom Aduah,Francois Meyer

Main category: cs.CL

TL;DR: 本文系统研究了预训练语言模型（PLMs）对非洲语言的编码能力，发现适应非洲语言的PLMs比多语言PLMs更能捕捉目标语言特征。

Details

Motivation: 探究预训练语言模型在非洲语言中的语言学知识编码机制，以理解其性能提升的原因。 Method: 通过分层探测器和控制任务分析六种非洲语言的PLMs，使用MasakhaPOS数据集评估性能。 Result: 适应非洲语言的PLMs比多语言PLMs编码更多语言学信息，句法信息集中在中后层，语义信息分布在各层。 Conclusion: 研究证实了PLMs内部知识的有效性，并揭示了适应策略（如主动学习和多语言适应）的成功机制。 Abstract: Pretrained language models (PLMs) for African languages are continually improving, but the reasons behind these advances remain unclear. This paper presents the first systematic investigation into probing PLMs for linguistic knowledge about African languages. We train layer-wise probes for six typologically diverse African languages to analyse how linguistic features are distributed. We also design control tasks, a way to interpret probe performance, for the MasakhaPOS dataset. We find PLMs adapted for African languages to encode more linguistic information about target languages than massively multilingual PLMs. Our results reaffirm previous findings that token-level syntactic information concentrates in middle-to-last layers, while sentence-level semantic information is distributed across all layers. Through control tasks and probing baselines, we confirm that performance reflects the internal knowledge of PLMs rather than probe memorisation. Our study applies established interpretability techniques to African-language PLMs. In doing so, we highlight the internal mechanisms underlying the success of strategies like active learning and multilingual adaptation.

[80] XRAG: Cross-lingual Retrieval-Augmented Generation

Wei Liu,Sony Trenous,Leonardo F. R. Ribeiro,Bill Byrne,Felix Hieber

Main category: cs.CL

TL;DR: XRAG是一个新颖的基准测试，用于评估LLM在跨语言检索增强生成（RAG）中的生成能力，特别是在用户语言与检索结果不匹配的情况下。

Details

Motivation: 现实场景中，用户语言与检索结果不匹配的情况常见，但现有基准测试未能充分覆盖这一复杂性。 Method: XRAG基于近期新闻文章构建，确保问题需要外部知识回答，并提供相关文档的标注。 Result: 实验发现，在单语言检索中，模型难以保证回答语言正确；在多语言检索中，主要挑战在于跨语言信息的推理。 Conclusion: XRAG为研究LLM推理能力提供了有价值的基准，揭示了跨语言RAG中的新挑战。 Abstract: We propose XRAG, a novel benchmark designed to evaluate the generation abilities of LLMs in cross-lingual Retrieval-Augmented Generation (RAG) settings where the user language does not match the retrieval results. XRAG is constructed from recent news articles to ensure that its questions require external knowledge to be answered. It covers the real-world scenarios of monolingual and multilingual retrieval, and provides relevancy annotations for each retrieved document. Our novel dataset construction pipeline results in questions that require complex reasoning, as evidenced by the significant gap between human and LLM performance. Consequently, XRAG serves as a valuable benchmark for studying LLM reasoning abilities, even before considering the additional cross-lingual complexity. Experimental results on five LLMs uncover two previously unreported challenges in cross-lingual RAG: 1) in the monolingual retrieval setting, all evaluated models struggle with response language correctness; 2) in the multilingual retrieval setting, the main challenge lies in reasoning over retrieved information across languages rather than generation of non-English text.

[81] What Does Neuro Mean to Cardio? Investigating the Role of Clinical Specialty Data in Medical LLMs

Xinlan Yan,Di Wu,Yibin Lei,Christof Monz,Iacer Calixto

Main category: cs.CL

TL;DR: S-MedQA是一个用于评估大语言模型在细粒度临床专业中表现的医学问答数据集，研究发现专业领域微调的效果并非最佳，改进更多来自领域转换而非知识注入。

Details

Motivation: 研究大语言模型在医学问答中的表现，验证知识注入假设在医学领域的适用性。 Method: 使用S-MedQA数据集，分析不同专业领域微调对模型性能的影响，并观察临床相关术语的概率变化。 Result: 1) 专业领域微调不一定带来最佳性能；2) 所有专业的临床相关术语概率均增加。 Conclusion: 改进主要来自领域转换而非知识注入，建议重新思考医学领域微调数据的作用。 Abstract: In this paper, we introduce S-MedQA, an English medical question-answering (QA) dataset for benchmarking large language models in fine-grained clinical specialties. We use S-MedQA to check the applicability of a popular hypothesis related to knowledge injection in the knowledge-intense scenario of medical QA, and show that: 1) training on data from a speciality does not necessarily lead to best performance on that specialty and 2) regardless of the specialty fine-tuned on, token probabilities of clinically relevant terms for all specialties increase consistently. Thus, we believe improvement gains come mostly from domain shifting (e.g., general to medical) rather than knowledge injection and suggest rethinking the role of fine-tuning data in the medical domain. We release S-MedQA and all code needed to reproduce all our experiments to the research community.

[82] GE-Chat: A Graph Enhanced RAG Framework for Evidential Response Generation of LLMs

Longchao Da,Parth Mitesh Shah,Kuan-Ru Liou,Jiaxing Zhang,Hua Wei

Main category: cs.CL

TL;DR: GE-Chat是一个基于知识图谱的检索增强生成框架，旨在解决大型语言模型（LLMs）输出不可靠的问题，通过证据生成提升回答的可信度。

Details

Motivation: LLMs在决策辅助中常因幻觉回答和不可靠输出引发信任问题，需手动验证。 Method: 结合知识图谱构建检索增强代理，利用CoT逻辑生成、n跳子图搜索和蕴含式句子生成实现精准证据检索。 Result: 方法在自由上下文中准确识别证据，提升模型性能，帮助判断LLM结论的可信度。 Conclusion: GE-Chat通过证据增强生成，为LLM输出的可靠性提供了有效解决方案。 Abstract: Large Language Models are now key assistants in human decision-making processes. However, a common note always seems to follow: "LLMs can make mistakes. Be careful with important info." This points to the reality that not all outputs from LLMs are dependable, and users must evaluate them manually. The challenge deepens as hallucinated responses, often presented with seemingly plausible explanations, create complications and raise trust issues among users. To tackle such issue, this paper proposes GE-Chat, a knowledge Graph enhanced retrieval-augmented generation framework to provide Evidence-based response generation. Specifically, when the user uploads a material document, a knowledge graph will be created, which helps construct a retrieval-augmented agent, enhancing the agent's responses with additional knowledge beyond its training corpus. Then we leverage Chain-of-Thought (CoT) logic generation, n-hop sub-graph searching, and entailment-based sentence generation to realize accurate evidence retrieval. We demonstrate that our method improves the existing models' performance in terms of identifying the exact evidence in a free-form context, providing a reliable way to examine the resources of LLM's conclusion and help with the judgment of the trustworthiness.

[83] Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning

Yoichi Ishibashi,Taro Yano,Masafumi Oyamada

Main category: cs.CL

TL;DR: 论文探讨了通过合成数据改进大型语言模型（LLM）推理能力的持续预训练方法（Reasoning CPT），并在多个领域验证了其有效性。

Details

Motivation: 当前监督微调和强化学习方法在推理任务中受限于特定领域，而持续预训练（CPT）无需任务特定信号，但其在推理任务中的应用和数据合成效果尚未充分研究。 Method: 提出Reasoning CPT方法，利用合成数据重建文本背后的隐藏思维过程，应用于Gemma2-9B模型，并与标准CPT在MMLU基准上对比。 Result: Reasoning CPT在所有评估领域均提升性能，尤其在难题上表现更优（提升达8分），且推理能力可跨领域迁移。 Conclusion: Reasoning CPT通过调整推理深度适应问题难度，显著提升模型性能，展示了跨领域推理的潜力。 Abstract: Large Language Models (LLMs) have demonstrated significant improvements in reasoning capabilities through supervised fine-tuning and reinforcement learning. However, when training reasoning models, these approaches are primarily applicable to specific domains such as mathematics and programming, which imposes fundamental constraints on the breadth and scalability of training data. In contrast, continual pretraining (CPT) offers the advantage of not requiring task-specific signals. Nevertheless, how to effectively synthesize training data for reasoning and how such data affect a wide range of domains remain largely unexplored. This study provides a detailed evaluation of Reasoning CPT, a form of CPT that uses synthetic data to reconstruct the hidden thought processes underlying texts, based on the premise that texts are the result of the author's thinking process. Specifically, we apply Reasoning CPT to Gemma2-9B using synthetic data with hidden thoughts derived from STEM and Law corpora, and compare it to standard CPT on the MMLU benchmark. Our analysis reveals that Reasoning CPT consistently improves performance across all evaluated domains. Notably, reasoning skills acquired in one domain transfer effectively to others; the performance gap with conventional methods widens as problem difficulty increases, with gains of up to 8 points on the most challenging problems. Furthermore, models trained with hidden thoughts learn to adjust the depth of their reasoning according to problem difficulty.

[84] The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think

Seongyun Lee,Seungone Kim,Minju Seo,Yongrae Jo,Dongyoung Go,Hyeonbin Hwang,Jinho Park,Xiang Yue,Sean Welleck,Graham Neubig,Moontae Lee,Minjoon Seo

Main category: cs.CL

TL;DR: 该论文提出了CoT百科全书框架，用于自动分析和引导模型推理行为，比现有方法更全面且可解释，并能提升模型性能。

Details

Motivation: 理解长链思维（CoT）的推理策略是使用大型语言模型的关键，但现有方法受限于人类直觉，无法全面捕捉模型行为的多样性。 Method: 通过自动提取模型生成的CoT中的推理标准，嵌入语义空间并聚类，生成对比性评分标准以解释推理行为。 Result: 人类评估显示该方法比现有方法更可解释和全面，且能预测模型策略并引导其使用更有效的方法。 Conclusion: 训练数据格式对推理行为的影响大于数据领域，强调了格式感知模型设计的重要性。 Abstract: Long chain-of-thought (CoT) is an essential ingredient in effective usage of modern large language models, but our understanding of the reasoning strategies underlying these capabilities remains limited. While some prior works have attempted to categorize CoTs using predefined strategy types, such approaches are constrained by human intuition and fail to capture the full diversity of model behaviors. In this work, we introduce the CoT Encyclopedia, a bottom-up framework for analyzing and steering model reasoning. Our method automatically extracts diverse reasoning criteria from model-generated CoTs, embeds them into a semantic space, clusters them into representative categories, and derives contrastive rubrics to interpret reasoning behavior. Human evaluations show that this framework produces more interpretable and comprehensive analyses than existing methods. Moreover, we demonstrate that this understanding enables performance gains: we can predict which strategy a model is likely to use and guide it toward more effective alternatives. Finally, we provide practical insights, such as that training data format (e.g., free-form vs. multiple-choice) has a far greater impact on reasoning behavior than data domain, underscoring the importance of format-aware model design.

[85] VQ-Logits: Compressing the Output Bottleneck of Large Language Models via Vector Quantized Logits

Jintian Shao,Hongyi Huang,Jiayi Wu,YiMing Cheng,ZhiYu Wu,You Shan,MingKai Zheng

Main category: cs.CL

TL;DR: VQ-Logits利用向量量化技术大幅减少LLM输出层的参数和计算成本，仅轻微增加困惑度。

Details

Motivation: 解决LLM因大词汇量导致的输出层参数和计算成本高的问题。 Method: 用小型共享码本替代大型输出嵌入矩阵，预测码本上的logits并映射到完整词汇空间。 Result: 在标准基准测试中，参数减少99%，计算速度提升6倍，困惑度仅增加4%。 Conclusion: VQ-Logits是一种高效且稳健的方法，显著优化了LLM输出层的性能。 Abstract: Large Language Models (LLMs) have achieved remarkable success but face significant computational and memory challenges, particularly due to their extensive output vocabularies. The final linear projection layer, mapping hidden states to vocabulary-sized logits, often constitutes a substantial portion of the model's parameters and computational cost during inference. Existing methods like adaptive softmax or hierarchical softmax introduce structural complexities. In this paper, we propose VQ-Logits, a novel approach that leverages Vector Quantization (VQ) to drastically reduce the parameter count and computational load of the LLM output layer. VQ-Logits replaces the large V * dmodel output embedding matrix with a small, shared codebook of K embedding vectors (K << V ). Each token in the vocabulary is mapped to one of these K codebook vectors. The LLM predicts logits over this compact codebook, which are then efficiently "scattered" to the full vocabulary space using the learned or preassigned mapping. We demonstrate through extensive experiments on standard language modeling benchmarks (e.g., WikiText-103, C4) that VQ-Logits can achieve up to 99% parameter reduction in the output layer and 6x speedup in logit computation, with only a marginal 4% increase in perplexity compared to full softmax baselines. We further provide detailed ablation studies on codebook size, initialization, and learning strategies, showcasing the robustness and effectiveness of our approach.

[86] RAIDEN-R1: Improving Role-awareness of LLMs via GRPO with Verifiable Reward

Zongsheng Wang,Kaili Sun,Bowen Wu,Qun Yu,Ying Li,Baoxun Wang

Main category: cs.CL

TL;DR: RAIDEN-R1是一个新的强化学习框架，通过VRAR奖励机制提升角色扮演对话代理的角色一致性，实验显示其在多个指标上优于基线模型。

Details

Motivation: 解决角色扮演对话代理（RPCAs）在角色一致性上的挑战。 Method: 提出RAIDEN-R1框架，整合VRAR奖励机制，采用单术语和多术语挖掘策略，并构建高质量的角色感知数据集。 Result: 14B-GRPO模型在Script-Based Knowledge和Conversation Memory指标上分别达到88.04%和88.65%的准确率，优于基线模型。 Conclusion: RAIDEN-R1填补了RPCA训练中的量化空白，为角色感知推理模式提供了新见解。 Abstract: Role-playing conversational agents (RPCAs) face persistent challenges in maintaining role consistency. To address this, we propose RAIDEN-R1, a novel reinforcement learning framework that integrates Verifiable Role-Awareness Reward (VRAR). The method introduces both singular and multi-term mining strategies to generate quantifiable rewards by assessing role-specific keys. Additionally, we construct a high-quality, role-aware Chain-of-Thought dataset through multi-LLM collaboration, and implement experiments to enhance reasoning coherence. Experiments on the RAIDEN benchmark demonstrate RAIDEN-R1's superiority: our 14B-GRPO model achieves 88.04% and 88.65% accuracy on Script-Based Knowledge and Conversation Memory metrics, respectively, outperforming baseline models while maintaining robustness. Case analyses further reveal the model's enhanced ability to resolve conflicting contextual cues and sustain first-person narrative consistency. This work bridges the non-quantifiability gap in RPCA training and provides insights into role-aware reasoning patterns, advancing the development of RPCAs.

Poli Apollinaire Nemkova,Solomon Ubani,Mark V. Albert

Main category: cs.CL

TL;DR: 研究评估了多种先进大语言模型（如GPT-3.5、GPT-4等）在俄语和乌克兰语社交媒体数据上零样本和小样本标注的表现，重点关注人权侵犯的二元分类任务。

Details

Motivation: 探索大语言模型在多语言复杂任务中的表现，特别是敏感领域（如人权）的标注可靠性。 Method: 比较多种LLM的标注结果与人工标注的黄金标准，分析不同提示条件下的表现及错误模式。 Result: 揭示了各模型在跨语言任务中的优势和局限性，以及提示语言对性能的影响。 Conclusion: 研究为大语言模型在多语言敏感任务中的应用提供了实用见解，强调了其可靠性和适应性仍需改进。 Abstract: In the era of increasingly sophisticated natural language processing (NLP) systems, large language models (LLMs) have demonstrated remarkable potential for diverse applications, including tasks requiring nuanced textual understanding and contextual reasoning. This study investigates the capabilities of multiple state-of-the-art LLMs - GPT-3.5, GPT-4, LLAMA3, Mistral 7B, and Claude-2 - for zero-shot and few-shot annotation of a complex textual dataset comprising social media posts in Russian and Ukrainian. Specifically, the focus is on the binary classification task of identifying references to human rights violations within the dataset. To evaluate the effectiveness of these models, their annotations are compared against a gold standard set of human double-annotated labels across 1000 samples. The analysis includes assessing annotation performance under different prompting conditions, with prompts provided in both English and Russian. Additionally, the study explores the unique patterns of errors and disagreements exhibited by each model, offering insights into their strengths, limitations, and cross-linguistic adaptability. By juxtaposing LLM outputs with human annotations, this research contributes to understanding the reliability and applicability of LLMs for sensitive, domain-specific tasks in multilingual contexts. It also sheds light on how language models handle inherently subjective and context-dependent judgments, a critical consideration for their deployment in real-world scenarios.

[88] The Evolving Landscape of Generative Large Language Models and Traditional Natural Language Processing in Medicine

Rui Yang,Huitao Li,Matthew Yu Heng Wong,Yuhe Ke,Xin Li,Kunyu Yu,Jingchi Liao,Jonathan Chong Kai Liew,Sabarinath Vinod Nair,Jasmine Chiat Ling Ong,Irene Li,Douglas Teodoro,Chuan Hong,Daniel Shu Wei Ting,Nan Liu

Main category: cs.CL

TL;DR: 生成式大语言模型（LLMs）在开放式任务中表现更优，而传统NLP在信息提取和分析任务中占优。

Details

Motivation: 探索生成式LLMs与传统NLP在不同医疗任务中的差异。 Method: 分析了19,123项研究，比较两种技术在医疗任务中的表现。 Result: 生成式LLMs在开放式任务中更具优势，传统NLP在信息提取和分析任务中表现更好。 Conclusion: 随着技术进步，需确保这些技术在医疗应用中的伦理使用。 Abstract: Natural language processing (NLP) has been traditionally applied to medicine, and generative large language models (LLMs) have become prominent recently. However, the differences between them across different medical tasks remain underexplored. We analyzed 19,123 studies, finding that generative LLMs demonstrate advantages in open-ended tasks, while traditional NLP dominates in information extraction and analysis tasks. As these technologies advance, ethical use of them is essential to ensure their potential in medical applications.

[89] From Questions to Clinical Recommendations: Large Language Models Driving Evidence-Based Clinical Decision Making

Dubai Li,Nan Jiang,Kangping Huang,Ruiqi Tu,Shuyu Ouyang,Huayu Yu,Lin Qiao,Chen Yu,Tianshu Zhou,Danyang Tong,Qian Wang,Mengtao Li,Xiaofeng Zeng,Yu Tian,Xinping Tian,Jingsong Li

Main category: cs.CL

TL;DR: Quicker是一种基于大型语言模型的临床决策支持系统，旨在自动化证据合成并生成临床建议，显著提高决策效率和准确性。

Details

Motivation: 临床证据整合到实时实践中存在挑战，如工作量大、流程复杂和时间限制，因此需要自动化工具支持高效准确的决策。 Method: Quicker采用全自动化流程，覆盖从问题到临床建议的所有阶段，并通过交互界面支持定制化决策。 Result: 实验显示Quicker在问题分解、文献筛选和推荐生成方面表现优异，协作模式下将推荐时间缩短至20-40分钟。 Conclusion: Quicker能帮助医生更快、更可靠地做出基于证据的临床决策。 Abstract: Clinical evidence, derived from rigorous research and data analysis, provides healthcare professionals with reliable scientific foundations for informed decision-making. Integrating clinical evidence into real-time practice is challenging due to the enormous workload, complex professional processes, and time constraints. This highlights the need for tools that automate evidence synthesis to support more efficient and accurate decision making in clinical settings. This study introduces Quicker, an evidence-based clinical decision support system powered by large language models (LLMs), designed to automate evidence synthesis and generate clinical recommendations modeled after standard clinical guideline development processes. Quicker implements a fully automated chain that covers all phases, from questions to clinical recommendations, and further enables customized decision-making through integrated tools and interactive user interfaces. To evaluate Quicker's capabilities, we developed the Q2CRBench-3 benchmark dataset, based on clinical guideline development records for three different diseases. Experimental results highlighted Quicker's strong performance, with fine-grained question decomposition tailored to user preferences, retrieval sensitivities comparable to human experts, and literature screening performance approaching comprehensive inclusion of relevant studies. In addition, Quicker-assisted evidence assessment effectively supported human reviewers, while Quicker's recommendations were more comprehensive and logically coherent than those of clinicians. In system-level testing, collaboration between a single reviewer and Quicker reduced the time required for recommendation development to 20-40 minutes. In general, our findings affirm the potential of Quicker to help physicians make quicker and more reliable evidence-based clinical decisions.

[90] J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

Chenxi Whitehouse,Tianlu Wang,Ping Yu,Xian Li,Jason Weston,Ilia Kulikov,Swarnadeep Saha

Main category: cs.CL

TL;DR: J1是一种通过强化学习训练LLM-as-a-Judge模型的方法，优于现有8B和70B模型，并在某些基准测试中超越更大模型。

Details

Motivation: AI评估质量是瓶颈，LLM-as-a-Judge模型通过链式思维推理提升判断能力，需要找到最佳训练方法。 Method: 使用强化学习将可验证和不可验证提示转换为带奖励的判断任务，激励思维并减少偏见。 Result: J1在8B和70B规模上优于其他模型，甚至在某些基准测试中超越更大模型。 Conclusion: J1通过强化学习显著提升LLM-as-a-Judge模型的判断能力，优于现有方法。 Abstract: The progress of AI is bottlenecked by the quality of evaluation, and powerful LLM-as-a-Judge models have proved to be a core solution. Improved judgment ability is enabled by stronger chain-of-thought reasoning, motivating the need to find the best recipes for training such models to think. In this work we introduce J1, a reinforcement learning approach to training such models. Our method converts both verifiable and non-verifiable prompts to judgment tasks with verifiable rewards that incentivize thinking and mitigate judgment bias. In particular, our approach outperforms all other existing 8B or 70B models when trained at those sizes, including models distilled from DeepSeek-R1. J1 also outperforms o1-mini, and even R1 on some benchmarks, despite training a smaller model. We provide analysis and ablations comparing Pairwise-J1 vs Pointwise-J1 models, offline vs online training recipes, reward strategies, seed prompts, and variations in thought length and content. We find that our models make better judgments by learning to outline evaluation criteria, comparing against self-generated reference answers, and re-evaluating the correctness of model responses.

[91] LDIR: Low-Dimensional Dense and Interpretable Text Embeddings with Relative Representations

Yile Wang,Zhanyu Shen,Hui Huang

Main category: cs.CL

TL;DR: 本文提出了一种低维、密集且可解释的文本嵌入方法LDIR，通过最远点采样生成数值维度，表现接近黑盒基线模型，同时优于其他可解释嵌入方法。

Details

Motivation: 现有文本嵌入方法（如SimCSE和LLM2Vec）性能优秀但难以解释，而经典稀疏嵌入（如词袋模型）性能较差。需要一种既能保持高性能又具备可解释性的方法。 Method: 提出LDIR方法，通过最远点采样生成低维（小于500维）密集嵌入，数值维度表示与不同锚文本的语义相关性。 Result: 在多个语义文本相似性、检索和聚类任务中，LDIR表现接近黑盒基线模型，且优于其他可解释嵌入方法。 Conclusion: LDIR在保持高性能的同时提供了可解释性，为语义文本表示提供了一种新的解决方案。 Abstract: Semantic text representation is a fundamental task in the field of natural language processing. Existing text embedding (e.g., SimCSE and LLM2Vec) have demonstrated excellent performance, but the values of each dimension are difficult to trace and interpret. Bag-of-words, as classic sparse interpretable embeddings, suffers from poor performance. Recently, Benara et al. (2024) propose interpretable text embeddings using large language models, which forms "0/1" embeddings based on responses to a series of questions. These interpretable text embeddings are typically high-dimensional (larger than 10,000). In this work, we propose Low-dimensional (lower than 500) Dense and Interpretable text embeddings with Relative representations (LDIR). The numerical values of its dimensions indicate semantic relatedness to different anchor texts through farthest point sampling, offering both semantic representation as well as a certain level of traceability and interpretability. We validate LDIR on multiple semantic textual similarity, retrieval, and clustering tasks. Extensive experimental results show that LDIR performs close to the black-box baseline models and outperforms the interpretable embeddings baselines with much fewer dimensions. Code is available at https://github.com/szu-tera/LDIR.

Chunyu Ye,Shaonan Wang

Main category: cs.CL

TL;DR: 提出了一种多模态框架，利用视觉语言模型（VLMs）从大脑活动中重建语言，适用于视觉、听觉和文本输入。

Details

Motivation: 人类思维本质上是多模态的，而现有研究多局限于单模态输入，因此需要一种更灵活的方法。 Method: 采用视觉语言模型（VLMs），结合模态特定专家，共同解析多模态信息。 Result: 实验表明，该方法性能与最先进系统相当，同时更具适应性和扩展性。 Conclusion: 该研究推动了更具生态效度和普适性的思维解码技术。 Abstract: Decoding thoughts from brain activity offers valuable insights into human cognition and enables promising applications in brain-computer interaction. While prior studies have explored language reconstruction from fMRI data, they are typically limited to single-modality inputs such as images or audio. In contrast, human thought is inherently multimodal. To bridge this gap, we propose a unified and flexible framework for reconstructing coherent language from brain recordings elicited by diverse input modalities-visual, auditory, and textual. Our approach leverages visual-language models (VLMs), using modality-specific experts to jointly interpret information across modalities. Experiments demonstrate that our method achieves performance comparable to state-of-the-art systems while remaining adaptable and extensible. This work advances toward more ecologically valid and generalizable mind decoding.

[93] Multi-domain Multilingual Sentiment Analysis in Industry: Predicting Aspect-based Opinion Quadruples

Benjamin White,Anastasia Shimorina

Main category: cs.CL

TL;DR: 本文探讨了基于大型语言模型（LLM）的方面情感分析系统设计，重点研究四元组意见提取，并在多领域和多语言中验证了单一模型的通用性。

Details

Motivation: 研究目的是验证单一微调模型是否能同时有效处理多领域特定分类法，并降低操作复杂性。 Method: 利用内部数据集，设计了一个多领域联合模型，并与专用单领域模型进行性能对比。 Result: 结果表明，多领域联合模型的性能与专用单领域模型相当，同时减少了操作复杂性。 Conclusion: 研究总结了处理非提取预测和评估LLM系统失败模式的经验，为结构化预测任务提供了实用指导。 Abstract: This paper explores the design of an aspect-based sentiment analysis system using large language models (LLMs) for real-world use. We focus on quadruple opinion extraction -- identifying aspect categories, sentiment polarity, targets, and opinion expressions from text data across different domains and languages. Using internal datasets, we investigate whether a single fine-tuned model can effectively handle multiple domain-specific taxonomies simultaneously. We demonstrate that a combined multi-domain model achieves performance comparable to specialized single-domain models while reducing operational complexity. We also share lessons learned for handling non-extractive predictions and evaluating various failure modes when developing LLM-based systems for structured prediction tasks.

[94] Rethinking Repetition Problems of LLMs in Code Generation

Yihong Dong,Yuchen Liu,Xue Jiang,Zhi Jin,Ge Li

Main category: cs.CL

TL;DR: 论文提出了一种基于语法的解码方法RPG，用于解决代码生成中的结构重复问题，并通过实验验证其有效性。

Details

Motivation: 神经语言模型在代码生成中表现优异，但重复问题（尤其是结构重复）仍然存在，需要更高效的解决方案。 Method: RPG利用语法规则识别重复问题，并通过衰减关键标记的似然来减少重复。 Result: RPG在CodeRepetEval、HumanEval和MBPP基准测试中显著优于基线方法，有效减少重复并提升代码质量。 Conclusion: RPG是一种高效的解码方法，能够显著缓解代码生成中的结构重复问题。 Abstract: With the advent of neural language models, the performance of code generation has been significantly boosted. However, the problem of repetitions during the generation process continues to linger. Previous work has primarily focused on content repetition, which is merely a fraction of the broader repetition problem in code generation. A more prevalent and challenging problem is structural repetition. In structural repetition, the repeated code appears in various patterns but possesses a fixed structure, which can be inherently reflected in grammar. In this paper, we formally define structural repetition and propose an efficient decoding approach called RPG, which stands for Repetition Penalization based on Grammar, to alleviate the repetition problems in code generation for LLMs. Specifically, RPG first leverages grammar rules to identify repetition problems during code generation, and then strategically decays the likelihood of critical tokens that contribute to repetitions, thereby mitigating them in code generation. To facilitate this study, we construct a new dataset CodeRepetEval to comprehensively evaluate approaches for mitigating the repetition problems in code generation. Extensive experimental results demonstrate that RPG substantially outperforms the best-performing baselines on CodeRepetEval dataset as well as HumanEval and MBPP benchmarks, effectively reducing repetitions and enhancing the quality of generated code.

[95] Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation

Yue Guo,Jae Ho Sohn,Gondy Leroy,Trevor Cohen

Main category: cs.CL

TL;DR: LLM生成的简明语言摘要（PLS）在主观评价中与人工编写的PLS难以区分，但在理解效果上人工编写的PLS更优。自动评估指标与人类判断不一致。

Details

Motivation: 解决LLM生成PLS的有效性和评估方法的不足，尤其是直接衡量理解性的缺失。 Method: 通过大规模众包评估（150名参与者），结合主观评分（简洁性、信息性、连贯性、忠实性）和客观理解测试（多选题）。 Result: LLM生成的PLS在主观评分上与人工编写的PLS无显著差异，但在理解测试中人工编写的PLS表现更好。自动评估指标与人类判断不一致。 Conclusion: 需要开发超越表面质量的评估框架，并优化生成方法以提升普通人的理解效果。 Abstract: Plain language summaries (PLSs) are essential for facilitating effective communication between clinicians and patients by making complex medical information easier for laypeople to understand and act upon. Large language models (LLMs) have recently shown promise in automating PLS generation, but their effectiveness in supporting health information comprehension remains unclear. Prior evaluations have generally relied on automated scores that do not measure understandability directly, or subjective Likert-scale ratings from convenience samples with limited generalizability. To address these gaps, we conducted a large-scale crowdsourced evaluation of LLM-generated PLSs using Amazon Mechanical Turk with 150 participants. We assessed PLS quality through subjective Likert-scale ratings focusing on simplicity, informativeness, coherence, and faithfulness; and objective multiple-choice comprehension and recall measures of reader understanding. Additionally, we examined the alignment between 10 automated evaluation metrics and human judgments. Our findings indicate that while LLMs can generate PLSs that appear indistinguishable from human-written ones in subjective evaluations, human-written PLSs lead to significantly better comprehension. Furthermore, automated evaluation metrics fail to reflect human judgment, calling into question their suitability for evaluating PLSs. This is the first study to systematically evaluate LLM-generated PLSs based on both reader preferences and comprehension outcomes. Our findings highlight the need for evaluation frameworks that move beyond surface-level quality and for generation methods that explicitly optimize for layperson comprehension.

[96] Hierarchical Document Refinement for Long-context Retrieval-augmented Generation

Jiajie Jin,Xiaoxi Li,Guanting Dong,Yuyao Zhang,Yutao Zhu,Yongkang Wu,Zhonghua Li,Qi Ye,Zhicheng Dou

Main category: cs.CL

TL;DR: LongRefiner是一种高效的即插即用优化器，针对长文本RAG应用中的冗余和噪声问题，通过双级查询分析、分层文档结构和自适应优化，显著降低计算成本和延迟。

Details

Motivation: 解决长文本RAG应用中冗余信息和噪声导致的高推理成本和性能下降问题。 Method: 采用双级查询分析、分层文档结构和基于多任务学习的自适应优化，利用单一基础模型实现。 Result: 在七个QA数据集上表现优异，计算成本和延迟降低10倍。 Conclusion: LongRefiner具有可扩展性、高效性和实用性，为长文本RAG应用提供了实用解决方案。 Abstract: Real-world RAG applications often encounter long-context input scenarios, where redundant information and noise results in higher inference costs and reduced performance. To address these challenges, we propose LongRefiner, an efficient plug-and-play refiner that leverages the inherent structural characteristics of long documents. LongRefiner employs dual-level query analysis, hierarchical document structuring, and adaptive refinement through multi-task learning on a single foundation model. Experiments on seven QA datasets demonstrate that LongRefiner achieves competitive performance in various scenarios while using 10x fewer computational costs and latency compared to the best baseline. Further analysis validates that LongRefiner is scalable, efficient, and effective, providing practical insights for real-world long-text RAG applications. Our code is available at https://github.com/ignorejjj/LongRefiner.

[97] Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models

Zemin Huang,Zhiyang Chen,Zijun Wang,Tiancheng Li,Guo-Jun Qi

Main category: cs.CL

TL;DR: DCoLT是一种用于扩散语言模型的推理框架，通过逆向扩散过程的中间步骤作为潜在“思考”动作，并利用基于结果的强化学习优化整个推理轨迹。

Details

Motivation: 传统Chain-of-Thought方法具有因果线性思维限制，DCoLT旨在实现双向非线性推理，突破中间步骤的语法限制。 Method: 在两种扩散语言模型（SEDD和LLaDA）上实现DCoLT，分别通过概率策略和基于排名的解掩码策略优化强化学习奖励。 Result: 实验表明，DCoLT增强的模型在数学和代码生成任务中表现优于其他方法，LLaDA在多个任务中准确率显著提升。 Conclusion: DCoLT通过优化推理轨迹，显著提升了扩散语言模型的推理能力。 Abstract: We introduce the \emph{Diffusion Chain of Lateral Thought (DCoLT)}, a reasoning framework for diffusion language models. DCoLT treats each intermediate step in the reverse diffusion process as a latent "thinking" action and optimizes the entire reasoning trajectory to maximize the reward on the correctness of the final answer with outcome-based Reinforcement Learning (RL). Unlike traditional Chain-of-Thought (CoT) methods that follow a causal, linear thinking process, DCoLT allows bidirectional, non-linear reasoning with no strict rule on grammatical correctness amid its intermediate steps of thought. We implement DCoLT on two representative Diffusion Language Models (DLMs). First, we choose SEDD as a representative continuous-time discrete diffusion model, where its concrete score derives a probabilistic policy to maximize the RL reward over the entire sequence of intermediate diffusion steps. We further consider the discrete-time masked diffusion language model -- LLaDA, and find that the order to predict and unmask tokens plays an essential role to optimize its RL action resulting from the ranking-based Unmasking Policy Module (UPM) defined by the Plackett-Luce model. Experiments on both math and code generation tasks show that using only public data and 16 H800 GPUs, DCoLT-reinforced DLMs outperform other DLMs trained by SFT or RL or even both. Notably, DCoLT-reinforced LLaDA boosts its reasoning accuracy by +9.8%, +5.7%, +11.4%, +19.5% on GSM8K, MATH, MBPP, and HumanEval.

[98] CL-RAG: Bridging the Gap in Retrieval-Augmented Generation with Curriculum Learning

Shaohan Wang,Licheng Zhang,Zheren Fu,Zhendong Mao

Main category: cs.CL

TL;DR: CL-RAG框架通过多阶段课程学习优化RAG系统，显著提升性能。

Details

Motivation: 现有RAG方法直接使用检索文档，但文档质量参差不齐，影响模型训练效果。受人类认知学习启发，提出课程学习框架。 Method: 构建多难度训练数据，分阶段训练检索器和生成器。 Result: 在四个开放域QA数据集上性能提升2%至4%。 Conclusion: CL-RAG有效提升RAG系统性能和泛化能力。 Abstract: Retrieval-Augmented Generation (RAG) is an effective method to enhance the capabilities of large language models (LLMs). Existing methods focus on optimizing the retriever or generator in the RAG system by directly utilizing the top-k retrieved documents. However, the documents effectiveness are various significantly across user queries, i.e. some documents provide valuable knowledge while others totally lack critical information. It hinders the retriever and generator's adaptation during training. Inspired by human cognitive learning, curriculum learning trains models using samples progressing from easy to difficult, thus enhancing their generalization ability, and we integrate this effective paradigm to the training of the RAG system. In this paper, we propose a multi-stage Curriculum Learning based RAG system training framework, named CL-RAG. We first construct training data with multiple difficulty levels for the retriever and generator separately through sample evolution. Then, we train the model in stages based on the curriculum learning approach, thereby optimizing the overall performance and generalization of the RAG system more effectively. Our CL-RAG framework demonstrates consistent effectiveness across four open-domain QA datasets, achieving performance gains of 2% to 4% over multiple advanced methods.

[99] Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective

Yutao Mou,Xiao Deng,Yuxiao Luo,Shikun Zhang,Wei Ye

Main category: cs.CL

TL;DR: 提出了CoV-Eval多任务基准和VC-Judge评估模型，全面评估LLM代码安全性，发现LLM在生成安全代码和修复漏洞方面仍有不足。

Details

Motivation: 现有代码安全基准仅关注单一任务，缺乏多维度评估，需全面衡量LLM在代码安全方面的表现。 Method: 提出CoV-Eval多任务基准和VC-Judge评估模型，对20种LLM进行综合评估。 Result: LLM能较好识别漏洞代码，但在生成安全代码和修复漏洞方面表现不佳。 Conclusion: 研究揭示了LLM代码安全的关键挑战，为未来优化提供了方向。 Abstract: Code security and usability are both essential for various coding assistant applications driven by large language models (LLMs). Current code security benchmarks focus solely on single evaluation task and paradigm, such as code completion and generation, lacking comprehensive assessment across dimensions like secure code generation, vulnerability repair and discrimination. In this paper, we first propose CoV-Eval, a multi-task benchmark covering various tasks such as code completion, vulnerability repair, vulnerability detection and classification, for comprehensive evaluation of LLM code security. Besides, we developed VC-Judge, an improved judgment model that aligns closely with human experts and can review LLM-generated programs for vulnerabilities in a more efficient and reliable way. We conduct a comprehensive evaluation of 20 proprietary and open-source LLMs. Overall, while most LLMs identify vulnerable codes well, they still tend to generate insecure codes and struggle with recognizing specific vulnerability types and performing repairs. Extensive experiments and qualitative analyses reveal key challenges and optimization directions, offering insights for future research in LLM code security.

[100] The Devil Is in the Word Alignment Details: On Translation-Based Cross-Lingual Transfer for Token Classification Tasks

Benedikt Ebing,Goran Glavaš

Main category: cs.CL

TL;DR: 论文研究了基于翻译的跨语言迁移（XLT）策略中的标签投影问题，比较了词对齐器（WAs）和标记方法的效果，并提出了一种新的集成策略。

Details

Motivation: 在跨语言迁移的标记分类任务中，标签投影是关键步骤，但现有方法（如词对齐器和标记方法）的设计选择未得到系统研究。 Method: 系统研究了词对齐器在标签投影中的低层设计选择（如标签投影算法、过滤策略和预标记化），并提出了集成翻译-训练和翻译-测试预测的新策略。 Result: 优化后的词对齐器性能与标记方法相当，而新提出的集成策略显著优于标记方法，并降低了对设计选择的敏感性。 Conclusion: 通过优化设计选择和引入集成策略，基于翻译的跨语言迁移在标记分类任务中表现更优且更稳健。 Abstract: Translation-based strategies for cross-lingual transfer XLT such as translate-train -- training on noisy target language data translated from the source language -- and translate-test -- evaluating on noisy source language data translated from the target language -- are competitive XLT baselines. In XLT for token classification tasks, however, these strategies include label projection, the challenging step of mapping the labels from each token in the original sentence to its counterpart(s) in the translation. Although word aligners (WAs) are commonly used for label projection, the low-level design decisions for applying them to translation-based XLT have not been systematically investigated. Moreover, recent marker-based methods, which project labeled spans by inserting tags around them before (or after) translation, claim to outperform WAs in label projection for XLT. In this work, we revisit WAs for label projection, systematically investigating the effects of low-level design decisions on token-level XLT: (i) the algorithm for projecting labels between (multi-)token spans, (ii) filtering strategies to reduce the number of noisily mapped labels, and (iii) the pre-tokenization of the translated sentences. We find that all of these substantially impact translation-based XLT performance and show that, with optimized choices, XLT with WA offers performance at least comparable to that of marker-based methods. We then introduce a new projection strategy that ensembles translate-train and translate-test predictions and demonstrate that it substantially outperforms the marker-based projection. Crucially, we show that our proposed ensembling also reduces sensitivity to low-level WA design choices, resulting in more robust XLT for token classification tasks.

[101] Multi-Token Prediction Needs Registers

Anastasios Gerontopoulos,Spyros Gidaris,Nikos Komodakis

Main category: cs.CL

TL;DR: MuToR是一种简单有效的多令牌预测方法，通过插入可学习的寄存器令牌来预测未来目标，兼容现有预训练语言模型且参数增加极少。

Details

Motivation: 多令牌预测在语言模型预训练中表现优异，但在微调等场景中效果不一致，因此提出MuToR以解决这一问题。 Method: MuToR在输入序列中插入可学习的寄存器令牌，每个令牌负责预测未来目标，无需架构改动且参数极少。 Result: MuToR在语言和视觉领域的生成任务中表现出色，适用于监督微调、参数高效微调和预训练。 Conclusion: MuToR是一种高效、兼容性强的多令牌预测方法，适用于多种场景。 Abstract: Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: https://github.com/nasosger/MuToR.

[102] WorldPM: Scaling Human Preference Modeling

Binghai Wang,Runji Lin,Keming Lu,Le Yu,Zhenru Zhang,Fei Huang,Chujie Zheng,Kai Dang,Yang Fan,Xingzhang Ren,An Yang,Binyuan Hui,Dayiheng Liu,Tao Gui,Qi Zhang,Xuanjing Huang,Yu-Gang Jiang,Bowen Yu,Jingren Zhou,Junyang Lin

Main category: cs.CL

TL;DR: 论文发现偏好建模中存在与语言模型相似的扩展规律，提出WorldPM框架，验证其在多样任务中的泛化能力。

Details

Motivation: 受语言模型中测试损失与模型和数据规模呈幂律关系的启发，探索偏好建模中的类似规律。 Method: 收集公共论坛的偏好数据，使用1.5B至72B参数的模型进行大规模训练，分析不同评估指标的扩展行为。 Result: 发现对抗性和客观性指标随规模扩展，而主观性指标无此趋势；WorldPM在多个基准任务中提升泛化性能5%以上。 Conclusion: WorldPM作为偏好微调基础表现优异，显著提升RLHF管道的评估结果。 Abstract: Motivated by scaling laws in language modeling that demonstrate how test loss scales as a power law with model and dataset sizes, we find that similar laws exist in preference modeling. We propose World Preference Modeling$ (WorldPM) to emphasize this scaling potential, where World Preference embodies a unified representation of human preferences. In this paper, we collect preference data from public forums covering diverse user communities, and conduct extensive training using 15M-scale data across models ranging from 1.5B to 72B parameters. We observe distinct patterns across different evaluation metrics: (1) Adversarial metrics (ability to identify deceptive features) consistently scale up with increased training data and base model size; (2) Objective metrics (objective knowledge with well-defined answers) show emergent behavior in larger language models, highlighting WorldPM's scalability potential; (3) Subjective metrics (subjective preferences from a limited number of humans or AI) do not demonstrate scaling trends. Further experiments validate the effectiveness of WorldPM as a foundation for preference fine-tuning. Through evaluations on 7 benchmarks with 20 subtasks, we find that WorldPM broadly improves the generalization performance across human preference datasets of varying sizes (7K, 100K and 800K samples), with performance gains exceeding 5% on many key subtasks. Integrating WorldPM into our internal RLHF pipeline, we observe significant improvements on both in-house and public evaluation sets, with notable gains of 4% to 8% in our in-house evaluations.

[103] Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models

Zhiyuan Hu,Yibo Wang,Hanze Dong,Yuhui Xu,Amrita Saha,Caiming Xiong,Bryan Hooi,Junnan Li

Main category: cs.CL

TL;DR: 论文提出了一种新方法，通过明确对齐模型的三种元能力（演绎、归纳和溯因），提升大型推理模型的可扩展性和可靠性。

Details

Motivation: 现有大型推理模型的推理行为（如自我修正和回溯）虽然存在，但不可预测且不可控，限制了其可扩展性和可靠性。 Method: 采用三阶段流程：个体对齐、参数空间合并和领域特定强化学习，通过自动生成的自验证任务对齐元能力。 Result: 性能比基线提升超过10%，领域特定强化学习进一步带来2%的平均增益。 Conclusion: 明确的元能力对齐为推理提供了可扩展且可靠的基础。 Abstract: Large reasoning models (LRMs) already possess a latent capacity for long chain-of-thought reasoning. Prior work has shown that outcome-based reinforcement learning (RL) can incidentally elicit advanced reasoning behaviors such as self-correction, backtracking, and verification phenomena often referred to as the model's "aha moment". However, the timing and consistency of these emergent behaviors remain unpredictable and uncontrollable, limiting the scalability and reliability of LRMs' reasoning capabilities. To address these limitations, we move beyond reliance on prompts and coincidental "aha moments". Instead, we explicitly align models with three meta-abilities: deduction, induction, and abduction, using automatically generated, self-verifiable tasks. Our three stage-pipeline individual alignment, parameter-space merging, and domain-specific reinforcement learning, boosting performance by over 10\% relative to instruction-tuned baselines. Furthermore, domain-specific RL from the aligned checkpoint yields an additional 2\% average gain in the performance ceiling across math, coding, and science benchmarks, demonstrating that explicit meta-ability alignment offers a scalable and dependable foundation for reasoning. Code is available at: https://github.com/zhiyuanhubj/Meta-Ability-Alignment

cs.IR [Back]

[104] A Survey on Large Language Models in Multimodal Recommender Systems

Alejo Lopez-Avila,Jinhua Du

Main category: cs.IR

TL;DR: 本文综述了大型语言模型（LLMs）在多模态推荐系统（MRS）中的应用，探讨了其优势与挑战，并提出了新的分类法和未来研究方向。

Details

Motivation: LLMs为MRS提供了语义推理、上下文学习和动态输入处理等新能力，但其可扩展性和模型可访问性也带来挑战，需要系统研究。 Method: 通过综述近期研究，提出分类法，总结提示策略、微调方法和数据适应技术，并分析评估指标与数据集。 Result: LLMs在MRS中展现出灵活性和泛化能力，但也需解决可扩展性和可访问性问题。 Conclusion: LLMs在多模态推荐中具有潜力，未来研究应关注技术整合和实际应用。 Abstract: Multimodal recommender systems (MRS) integrate heterogeneous user and item data, such as text, images, and structured information, to enhance recommendation performance. The emergence of large language models (LLMs) introduces new opportunities for MRS by enabling semantic reasoning, in-context learning, and dynamic input handling. Compared to earlier pre-trained language models (PLMs), LLMs offer greater flexibility and generalisation capabilities but also introduce challenges related to scalability and model accessibility. This survey presents a comprehensive review of recent work at the intersection of LLMs and MRS, focusing on prompting strategies, fine-tuning methods, and data adaptation techniques. We propose a novel taxonomy to characterise integration patterns, identify transferable techniques from related recommendation domains, provide an overview of evaluation metrics and datasets, and point to possible future directions. We aim to clarify the emerging role of LLMs in multimodal recommendation and support future research in this rapidly evolving field.

cs.LG [Back]

[105] Predictability Shapes Adaptation: An Evolutionary Perspective on Modes of Learning in Transformers

Alexander Y. Ku,Thomas L. Griffiths,Stephanie C. Y. Chan

Main category: cs.LG

TL;DR: 论文研究了Transformer模型的两种学习模式（IWL和ICL），通过类比进化生物学中的遗传编码和表型可塑性，探讨了环境可预测性对这两种模式平衡的影响。实验表明，高稳定性偏好IWL，而高线索可靠性增强ICL。学习动态还揭示了任务依赖的时序演变。

Details

Motivation: 理解Transformer模型中IWL和ICL的交互作用，借鉴进化生物学的策略，探究环境可预测性如何影响这两种学习模式的平衡。 Method: 通过回归和分类任务，实验操作化环境可预测性的维度，系统研究其对IWL/ICL平衡的影响。 Result: 高环境稳定性显著偏好IWL，高线索可靠性增强ICL。学习动态显示任务依赖的时序演变，如ICL到IWL的转变或初始IWL后ICL主导。 Conclusion: 可预测性是决定Transformer中学习模式平衡的关键因素，支持相对成本假说，为理解ICL和优化训练方法提供了新视角。 Abstract: Transformer models learn in two distinct modes: in-weights learning (IWL), encoding knowledge into model weights, and in-context learning (ICL), adapting flexibly to context without weight modification. To better understand the interplay between these learning modes, we draw inspiration from evolutionary biology's analogous adaptive strategies: genetic encoding (akin to IWL, adapting over generations and fixed within an individual's lifetime) and phenotypic plasticity (akin to ICL, enabling flexible behavioral responses to environmental cues). In evolutionary biology, environmental predictability dictates the balance between these strategies: stability favors genetic encoding, while reliable predictive cues promote phenotypic plasticity. We experimentally operationalize these dimensions of predictability and systematically investigate their influence on the ICL/IWL balance in Transformers. Using regression and classification tasks, we show that high environmental stability decisively favors IWL, as predicted, with a sharp transition at maximal stability. Conversely, high cue reliability enhances ICL efficacy, particularly when stability is low. Furthermore, learning dynamics reveal task-contingent temporal evolution: while a canonical ICL-to-IWL shift occurs in some settings (e.g., classification with many classes), we demonstrate that scenarios with easier IWL (e.g., fewer classes) or slower ICL acquisition (e.g., regression) can exhibit an initial IWL phase later yielding to ICL dominance. These findings support a relative-cost hypothesis for explaining these learning mode transitions, establishing predictability as a critical factor governing adaptive strategies in Transformers, and offering novel insights for understanding ICL and guiding training methodologies.

[106] Comparing Exploration-Exploitation Strategies of LLMs and Humans: Insights from Standard Multi-armed Bandit Tasks

Ziyuan Zhang,Darcy Wang,Ningyuan Chen,Rodrigo Mansur,Vahid Sarhangian

Main category: cs.LG

TL;DR: 该研究比较了大型语言模型（LLMs）与人类在多臂老虎机任务中的探索-利用权衡策略，发现推理能力使LLMs更接近人类行为，但在复杂环境中适应性不足。

Details

Motivation: 探讨LLMs在动态决策任务中是否表现出与人类相似的探索-利用权衡行为，并评估其性能。 Method: 使用多臂老虎机任务和可解释的选择模型，比较LLMs、人类和算法的决策策略，分析推理能力对LLMs行为的影响。 Result: 推理使LLMs在简单任务中表现出类似人类的随机和定向探索，但在复杂环境中适应性较差。 Conclusion: LLMs在模拟人类行为和自动化决策方面有潜力，但在复杂环境中的适应性仍需改进。 Abstract: Large language models (LLMs) are increasingly used to simulate or automate human behavior in complex sequential decision-making tasks. A natural question is then whether LLMs exhibit similar decision-making behavior to humans, and can achieve comparable (or superior) performance. In this work, we focus on the exploration-exploitation (E&E) tradeoff, a fundamental aspect of dynamic decision-making under uncertainty. We employ canonical multi-armed bandit (MAB) tasks introduced in the cognitive science and psychiatry literature to conduct a comparative study of the E&E strategies of LLMs, humans, and MAB algorithms. We use interpretable choice models to capture the E&E strategies of the agents and investigate how explicit reasoning, through both prompting strategies and reasoning-enhanced models, shapes LLM decision-making. We find that reasoning shifts LLMs toward more human-like behavior, characterized by a mix of random and directed exploration. In simple stationary tasks, reasoning-enabled LLMs exhibit similar levels of random and directed exploration compared to humans. However, in more complex, non-stationary environments, LLMs struggle to match human adaptability, particularly in effective directed exploration, despite achieving similar regret in certain scenarios. Our findings highlight both the promise and limits of LLMs as simulators of human behavior and tools for automated decision-making and point to potential areas of improvements.

[107] Advanced Crash Causation Analysis for Freeway Safety: A Large Language Model Approach to Identifying Key Contributing Factors

Ahmed S. Abdelrahman,Mohamed Abdel-Aty,Samgyu Yang,Abdulrahman Faden

Main category: cs.LG

TL;DR: 该研究利用大型语言模型（LLM）分析高速公路事故数据，通过零样本分类识别事故原因，验证了LLM在交通安全性分析中的有效性。

Details

Motivation: 传统统计方法和机器学习模型难以捕捉事故中复杂因素的交互作用，因此研究探索LLM在事故原因分析中的应用。 Method: 使用QLoRA对Llama3 8B模型进行微调，基于226项高速公路事故研究构建训练数据集，通过零样本分类识别事故原因。 Result: LLM能有效识别酒驾、超速、攻击性驾驶和注意力分散等主要事故原因，并结合事件数据提供更深入分析，研究者认可度达88.89%。 Conclusion: LLM为交通事故原因分析提供了新工具，有助于制定更有效的交通安全措施。 Abstract: Understanding the factors contributing to traffic crashes and developing strategies to mitigate their severity is essential. Traditional statistical methods and machine learning models often struggle to capture the complex interactions between various factors and the unique characteristics of each crash. This research leverages large language model (LLM) to analyze freeway crash data and provide crash causation analysis accordingly. By compiling 226 traffic safety studies related to freeway crashes, a training dataset encompassing environmental, driver, traffic, and geometric design factors was created. The Llama3 8B model was fine-tuned using QLoRA to enhance its understanding of freeway crashes and their contributing factors, as covered in these studies. The fine-tuned Llama3 8B model was then used to identify crash causation without pre-labeled data through zero-shot classification, providing comprehensive explanations to ensure that the identified causes were reasonable and aligned with existing research. Results demonstrate that LLMs effectively identify primary crash causes such as alcohol-impaired driving, speeding, aggressive driving, and driver inattention. Incorporating event data, such as road maintenance, offers more profound insights. The model's practical applicability and potential to improve traffic safety measures were validated by a high level of agreement among researchers in the field of traffic safety, as reflected in questionnaire results with 88.89%. This research highlights the complex nature of traffic crashes and how LLMs can be used for comprehensive analysis of crash causation and other contributing factors. Moreover, it provides valuable insights and potential countermeasures to aid planners and policymakers in developing more effective and efficient traffic safety practices.

[108] Learning Virtual Machine Scheduling in Cloud Computing through Language Agents

JieHao Wu,Ziwei Wang,Junjie Sheng,Wenhao Li,Xiangfei Wang,Jun Luo

Main category: cs.LG

TL;DR: 论文提出了一种名为MiCo的分层语言代理框架，利用大语言模型（LLM）设计启发式方法，解决云服务中的动态多维装箱问题（ODMBP）。

Details

Motivation: 传统优化方法难以适应实时变化，启发式方法策略僵化，现有学习方法缺乏通用性和可解释性。 Method: 将ODMBP建模为半马尔可夫决策过程（SMDP-Option），采用两阶段架构（Option Miner和Option Composer），利用LLM生成策略。 Result: 在涉及10,000多个虚拟机的真实数据集上，MiCo实现了96.9%的竞争比，且在非平稳请求流和多样化配置下表现优异。 Conclusion: MiCo在复杂和大规模云环境中表现出高效性和适应性。 Abstract: In cloud services, virtual machine (VM) scheduling is a typical Online Dynamic Multidimensional Bin Packing (ODMBP) problem, characterized by large-scale complexity and fluctuating demands. Traditional optimization methods struggle to adapt to real-time changes, domain-expert-designed heuristic approaches suffer from rigid strategies, and existing learning-based methods often lack generalizability and interpretability. To address these limitations, this paper proposes a hierarchical language agent framework named MiCo, which provides a large language model (LLM)-driven heuristic design paradigm for solving ODMBP. Specifically, ODMBP is formulated as a Semi-Markov Decision Process with Options (SMDP-Option), enabling dynamic scheduling through a two-stage architecture, i.e., Option Miner and Option Composer. Option Miner utilizes LLMs to discover diverse and useful non-context-aware strategies by interacting with constructed environments. Option Composer employs LLMs to discover a composing strategy that integrates the non-context-aware strategies with the contextual ones. Extensive experiments on real-world enterprise datasets demonstrate that MiCo achieves a 96.9\% competitive ratio in large-scale scenarios involving more than 10,000 virtual machines. It maintains high performance even under nonstationary request flows and diverse configurations, thus validating its effectiveness in complex and large-scale cloud environments.

[109] ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector Attention

Jintian Shao,Hongyi Huang,Jiayi Wu,Beiwen Zhang,ZhiYu Wu,You Shan,MingKai Zheng

Main category: cs.LG

TL;DR: ComplexFormer提出了一种新的复杂多头注意力机制（CMHA），通过将语义和位置信息统一在复数平面中建模，显著提升了Transformer模型的性能。

Details

Motivation: 传统Transformer模型在整合位置信息和多头注意力灵活性方面存在挑战，限制了其表示能力。 Method: 引入CMHA机制，包括每头的欧拉变换和自适应差分旋转机制，统一建模语义和位置差异。 Result: 在语言建模、文本生成等任务中表现优异，生成困惑度显著降低，长上下文一致性更强。 Conclusion: ComplexFormer提供了一种更灵活、高效的注意力机制，具有更强的参数效率和表达能力。 Abstract: Transformer models rely on self-attention to capture token dependencies but face challenges in effectively integrating positional information while allowing multi-head attention (MHA) flexibility. Prior methods often model semantic and positional differences disparately or apply uniform positional adjustments across heads, potentially limiting representational capacity. This paper introduces ComplexFormer, featuring Complex Multi-Head Attention-CMHA. CMHA empowers each head to independently model semantic and positional differences unified within the complex plane, representing interactions as rotations and scaling. ComplexFormer incorporates two key improvements: (1) a per-head Euler transformation, converting real-valued query/key projections into polar-form complex vectors for head-specific complex subspace operation; and (2) a per-head adaptive differential rotation mechanism, exp[i(Adapt(ASmn,i) + Delta(Pmn),i)], allowing each head to learn distinct strategies for integrating semantic angle differences (ASmn,i) with relative positional encodings (Delta(Pmn),i). Extensive experiments on language modeling, text generation, code generation, and mathematical reasoning show ComplexFormer achieves superior performance, significantly lower generation perplexity , and improved long-context coherence compared to strong baselines like RoPE-Transformers. ComplexFormer demonstrates strong parameter efficiency, offering a more expressive, adaptable attention mechanism.

[110] Superposition Yields Robust Neural Scaling

Yizhou liu,Ziming Liu,Jeff Gore

Main category: cs.LG

TL;DR: 论文通过构建玩具模型研究大语言模型（LLM）的损失随模型尺寸变化的规律，发现表示叠加是神经缩放定律的重要机制。

Details

Motivation: 探究大语言模型性能随尺寸提升的神经缩放定律的起源。 Method: 基于两个经验原则构建玩具模型，分析损失与模型尺寸的关系，并在开源LLM家族中验证。 Result: 发现弱叠加下损失与特征频率相关，强叠加下损失与模型维度成反比，且开源LLM符合强叠加预测。 Conclusion: 表示叠加是神经缩放定律的关键机制，未来可优化训练策略和架构以提升性能。 Abstract: The success of today's large language models (LLMs) depends on the observation that larger models perform better. However, the origin of this neural scaling law -- the finding that loss decreases as a power law with model size -- remains unclear. Starting from two empirical principles -- that LLMs represent more things than the model dimensions (widths) they have (i.e., representations are superposed), and that words or concepts in language occur with varying frequencies -- we constructed a toy model to study the loss scaling with model size. We found that when superposition is weak, meaning only the most frequent features are represented without interference, the scaling of loss with model size depends on the underlying feature frequency; if feature frequencies follow a power law, so does the loss. In contrast, under strong superposition, where all features are represented but overlap with each other, the loss becomes inversely proportional to the model dimension across a wide range of feature frequency distributions. This robust scaling behavior is explained geometrically: when many more vectors are packed into a lower dimensional space, the interference (squared overlaps) between vectors scales inversely with that dimension. We then analyzed four families of open-sourced LLMs and found that they exhibit strong superposition and quantitatively match the predictions of our toy model. The Chinchilla scaling law turned out to also agree with our results. We conclude that representation superposition is an important mechanism underlying the observed neural scaling laws. We anticipate that these insights will inspire new training strategies and model architectures to achieve better performance with less computation and fewer parameters.

[111] Parallel Scaling Law for Language Models

Mouxiang Chen,Binyuan Hui,Zeyu Cui,Jiaxi Yang,Dayiheng Liu,Jianling Sun,Junyang Lin,Zhongxin Liu

Main category: cs.LG

TL;DR: 论文提出了一种新的并行扩展（ParScale）方法，通过增加模型的并行计算而非参数或输出令牌来提升效率，显著减少了内存和延迟开销。

Details

Motivation: 传统扩展语言模型的方法（参数扩展或推理时扩展）通常需要较高的空间或时间成本，因此需要一种更高效的扩展方式。 Method: 提出并行扩展（ParScale），通过对输入应用多样且可学习的变换，并行执行模型前向传播，并动态聚合输出。该方法适用于任何模型结构、优化过程或任务。 Result: 实验证明，ParScale在相同性能提升下，内存和延迟开销分别减少22倍和6倍，且可通过少量令牌的后训练将预训练模型转换为并行扩展模型。 Conclusion: ParScale为低资源场景下部署更强大模型提供了新思路，并重新定义了计算在机器学习中的作用。 Abstract: It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce the third and more inference-efficient scaling paradigm: increasing the model's parallel computation during both training and inference time. We apply $P$ diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the $P$ outputs. This method, namely parallel scaling (ParScale), scales parallel computation by reusing existing parameters and can be applied to any model structure, optimization procedure, data, or task. We theoretically propose a new scaling law and validate it through large-scale pre-training, which shows that a model with $P$ parallel streams is similar to scaling the parameters by $O(\log P)$ while showing superior inference efficiency. For example, ParScale can use up to 22$\times$ less memory increase and 6$\times$ less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget. The new scaling law we discovered potentially facilitates the deployment of more powerful models in low-resource scenarios, and provides an alternative perspective for the role of computation in machine learning.

Vibha Belavadi,Tushar Vatsa,Dewang Sultania,Suhas Suresha,Ishita Verma,Cheng Chen,Tracy Holloway King,Michael Friedrich

Main category: cs.LG

TL;DR: 提出了一种基于路由器的架构，用于生成高质量合成数据，以优化LLM在函数调用任务中的微调性能。

Details

Motivation: 解决因缺乏真实用户交互数据和隐私限制导致的合成数据生成不足问题。 Method: 利用领域资源（如内容元数据和知识图谱）及多模态语言模型，通过灵活的路由机制生成合成数据。 Result: 在真实用户查询测试中，函数分类准确率和API参数选择显著提升。 Conclusion: 基于合成数据微调的模型性能优于传统方法，为函数调用任务设定了新基准。 Abstract: This paper addresses fine-tuning Large Language Models (LLMs) for function calling tasks when real user interaction data is unavailable. In digital content creation tools, where users express their needs through natural language queries that must be mapped to API calls, the lack of real-world task-specific data and privacy constraints for training on it necessitate synthetic data generation. Existing approaches to synthetic data generation fall short in diversity and complexity, failing to replicate real-world data distributions and leading to suboptimal performance after LLM fine-tuning. We present a novel router-based architecture that leverages domain resources like content metadata and structured knowledge graphs, along with text-to-text and vision-to-text language models to generate high-quality synthetic training data. Our architecture's flexible routing mechanism enables synthetic data generation that matches observed real-world distributions, addressing a fundamental limitation of traditional approaches. Evaluation on a comprehensive set of real user queries demonstrates significant improvements in both function classification accuracy and API parameter selection. Models fine-tuned with our synthetic data consistently outperform traditional approaches, establishing new benchmarks for function calling tasks.

[113] MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models

Mugilan Ganesan,Shane Segal,Ankur Aggarwal,Nish Sinnadurai,Sean Lie,Vithursan Thangarasa

Main category: cs.LG

TL;DR: MASSV通过两阶段方法将小型语言模型转化为高效的多模态草稿模型，显著加速视觉语言模型的推理速度。

Details

Motivation: 现有小型语言模型无法处理视觉输入且预测不匹配视觉上下文，限制了推测解码在视觉语言模型中的应用。 Method: MASSV分两阶段：1）通过轻量级投影器连接目标模型的视觉编码器；2）利用目标模型生成的响应进行自蒸馏视觉指令调优。 Result: 实验显示MASSV在视觉任务中接受长度提升30%，推理速度提升1.46倍。 Conclusion: MASSV为加速当前及未来视觉语言模型提供了一种可扩展且兼容架构的方法。 Abstract: Speculative decoding significantly accelerates language model inference by enabling a lightweight draft model to propose multiple tokens that a larger target model verifies simultaneously. However, applying this technique to vision-language models (VLMs) presents two fundamental challenges: small language models that could serve as efficient drafters lack the architectural components to process visual inputs, and their token predictions fail to match those of VLM target models that consider visual context. We introduce Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models (MASSV), which transforms existing small language models into effective multimodal drafters through a two-phase approach. MASSV first connects the target VLM's vision encoder to the draft model via a lightweight trainable projector, then applies self-distilled visual instruction tuning using responses generated by the target VLM to align token predictions. Comprehensive experiments across the Qwen2.5-VL and Gemma3 model families demonstrate that MASSV increases accepted length by up to 30% and delivers end-to-end inference speedups of up to 1.46x on visually-grounded tasks. MASSV provides a scalable, architecture-compatible method for accelerating both current and future VLMs.

[114] RainPro-8: An Efficient Deep Learning Model to Estimate Rainfall Probabilities Over 8 Hours

Rafael Pablos Sarabia,Joachim Nyborg,Morten Birk,Jeppe Liborius Sjørup,Anders Lillevang Vesterholt,Ira Assent

Main category: cs.LG

TL;DR: 提出了一种深度学习模型，用于欧洲高分辨率降水概率预测，整合多源数据并实现高效训练和快速推理。

Details

Motivation: 克服雷达深度学习模型在短预测时间上的限制，提升降水预测的准确性和不确定性量化能力。 Method: 整合雷达、卫星和数值天气预报数据，设计紧凑架构以捕获长程交互，实现高效训练和推理。 Result: 模型在实验中超越现有数值天气预报系统和深度学习临近预报模型，成为欧洲高分辨率降水预测的新标准。 Conclusion: 该模型在准确性、可解释性和计算效率之间取得平衡，为降水预测提供了更优解决方案。 Abstract: We present a deep learning model for high-resolution probabilistic precipitation forecasting over an 8-hour horizon in Europe, overcoming the limitations of radar-only deep learning models with short forecast lead times. Our model efficiently integrates multiple data sources - including radar, satellite, and physics-based numerical weather prediction (NWP) - while capturing long-range interactions, resulting in accurate forecasts with robust uncertainty quantification through consistent probabilistic maps. Featuring a compact architecture, it enables more efficient training and faster inference than existing models. Extensive experiments demonstrate that our model surpasses current operational NWP systems, extrapolation-based methods, and deep-learning nowcasting models, setting a new standard for high-resolution precipitation forecasting in Europe, ensuring a balance between accuracy, interpretability, and computational efficiency.

[115] PIF: Anomaly detection via preference embedding

Filippo Leveni,Luca Magri,Giacomo Boracchi,Cesare Alippi

Main category: cs.LG

TL;DR: 提出了一种名为PIF的新型异常检测方法，结合了自适应隔离方法和偏好嵌入的优势，通过PI-Forest在高维空间中计算异常分数。实验表明PIF优于现有技术。

Details

Motivation: 解决基于结构化模式的异常检测问题，结合自适应隔离和偏好嵌入的优势。 Method: 提出PIF方法，利用PI-Forest在高维空间中嵌入数据并计算异常分数。 Result: 在合成和真实数据集上，PIF优于现有异常检测技术，PI-Forest在测量任意距离和隔离偏好空间中的点方面表现更好。 Conclusion: PIF是一种有效的异常检测方法，结合了自适应隔离和偏好嵌入的优势，实验验证了其优越性。 Abstract: We address the problem of detecting anomalies with respect to structured patterns. To this end, we conceive a novel anomaly detection method called PIF, that combines the advantages of adaptive isolation methods with the flexibility of preference embedding. Specifically, we propose to embed the data in a high dimensional space where an efficient tree-based method, PI-Forest, is employed to compute an anomaly score. Experiments on synthetic and real datasets demonstrate that PIF favorably compares with state-of-the-art anomaly detection techniques, and confirm that PI-Forest is better at measuring arbitrary distances and isolate points in the preference space.

[116] SEAL: Searching Expandable Architectures for Incremental Learning

Matteo Gambella,Vicente Javier Castro Solar,Manuel Roveri

Main category: cs.LG

TL;DR: SEAL是一个基于NAS的框架，用于数据增量学习，通过动态调整模型结构和选择性扩展来平衡学习新任务和保留旧知识，同时减少模型大小。

Details

Motivation: 解决增量学习中平衡新任务学习和旧知识保留的挑战，避免现有方法因频繁扩展模型而导致的资源浪费。 Method: SEAL通过动态调整模型结构和选择性扩展（基于容量估计指标），结合交叉蒸馏训练保持稳定性，NAS组件同时搜索架构和扩展策略。 Result: 实验表明，SEAL在多基准测试中有效减少遗忘、提高准确性，并保持较小的模型大小。 Conclusion: SEAL展示了结合NAS和选择性扩展在增量学习中的高效性和适应性。 Abstract: Incremental learning is a machine learning paradigm where a model learns from a sequential stream of tasks. This setting poses a key challenge: balancing plasticity (learning new tasks) and stability (preserving past knowledge). Neural Architecture Search (NAS), a branch of AutoML, automates the design of the architecture of Deep Neural Networks and has shown success in static settings. However, existing NAS-based approaches to incremental learning often rely on expanding the model at every task, making them impractical in resource-constrained environments. In this work, we introduce SEAL, a NAS-based framework tailored for data-incremental learning, a scenario where disjoint data samples arrive sequentially and are not stored for future access. SEAL adapts the model structure dynamically by expanding it only when necessary, based on a capacity estimation metric. Stability is preserved through cross-distillation training after each expansion step. The NAS component jointly searches for both the architecture and the optimal expansion policy. Experiments across multiple benchmarks demonstrate that SEAL effectively reduces forgetting and enhances accuracy while maintaining a lower model size compared to prior methods. These results highlight the promise of combining NAS and selective expansion for efficient, adaptive learning in incremental scenarios.

cs.RO [Back]

[117] FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation

Jun Guo,Xiaojian Ma,Yikai Wang,Min Yang,Huaping Liu,Qing Li

Main category: cs.RO

TL;DR: FlowDreamer是一种基于3D场景流的RGB-D世界模型，通过显式运动表示提升机器人操作的视觉预测能力，性能优于基线模型。

Details

Motivation: 研究如何通过显式运动表示（3D场景流）改进机器人操作的视觉世界模型，以更准确地预测未来视觉观察。 Method: FlowDreamer采用U-Net预测3D场景流，结合扩散模型生成未来帧，实现端到端训练。 Result: 在4个基准测试中，FlowDreamer在语义相似性、像素质量和成功率上分别提升7%、11%和6%。 Conclusion: FlowDreamer通过显式运动表示显著提升了RGB-D世界模型的性能，适用于机器人操作任务。 Abstract: This paper investigates training better visual world models for robot manipulation, i.e., models that can predict future visual observations by conditioning on past frames and robot actions. Specifically, we consider world models that operate on RGB-D frames (RGB-D world models). As opposed to canonical approaches that handle dynamics prediction mostly implicitly and reconcile it with visual rendering in a single model, we introduce FlowDreamer, which adopts 3D scene flow as explicit motion representations. FlowDreamer first predicts 3D scene flow from past frame and action conditions with a U-Net, and then a diffusion model will predict the future frame utilizing the scene flow. FlowDreamer is trained end-to-end despite its modularized nature. We conduct experiments on 4 different benchmarks, covering both video prediction and visual planning tasks. The results demonstrate that FlowDreamer achieves better performance compared to other baseline RGB-D world models by 7% on semantic similarity, 11% on pixel quality, and 6% on success rate in various robot manipulation domains.

eess.IV [Back]

[118] ImplicitStainer: Data-Efficient Medical Image Translation for Virtual Antibody-based Tissue Staining Using Local Implicit Functions

Tushar Kataria,Beatrice Knudsen,Shireen Y. Elhabian

Main category: eess.IV

TL;DR: ImplicitStainer是一种利用局部隐式函数改进虚拟染色（IHC染色）的新方法，通过像素级预测提升性能，减少数据需求。

Details

Motivation: H&E染色无法提供全部诊断信息，而IHC染色耗时且资源密集，虚拟染色成为替代方案。现有方法数据需求高，需改进。 Method: 提出ImplicitStainer，基于局部隐式函数进行图像翻译，专注于像素级预测，减少数据依赖。 Result: 在有限数据下表现优异，优于15种先进GAN和扩散模型。 Conclusion: ImplicitStainer为虚拟染色提供高效解决方案，代码将公开。 Abstract: Hematoxylin and eosin (H&E) staining is a gold standard for microscopic diagnosis in pathology. However, H&E staining does not capture all the diagnostic information that may be needed. To obtain additional molecular information, immunohistochemical (IHC) stains highlight proteins that mark specific cell types, such as CD3 for T-cells or CK8/18 for epithelial cells. While IHC stains are vital for prognosis and treatment guidance, they are typically only available at specialized centers and time consuming to acquire, leading to treatment delays for patients. Virtual staining, enabled by deep learning-based image translation models, provides a promising alternative by computationally generating IHC stains from H&E stained images. Although many GAN and diffusion based image to image (I2I) translation methods have been used for virtual staining, these models treat image patches as independent data points, which results in increased and more diverse data requirements for effective generation. We present ImplicitStainer, a novel approach that leverages local implicit functions to improve image translation, specifically virtual staining performance, by focusing on pixel-level predictions. This method enhances robustness to variations in dataset sizes, delivering high-quality results even with limited data. We validate our approach on two datasets using a comprehensive set of metrics and benchmark it against over fifteen state-of-the-art GAN- and diffusion based models. Full Code and models trained will be released publicly via Github upon acceptance.

[119] Ordered-subsets Multi-diffusion Model for Sparse-view CT Reconstruction

Pengfei Yu,Bin Huang,Minghui Zhang,Weiwen Wu,Shaoyu Wang,Qiegen Liu

Main category: eess.IV

TL;DR: 提出了一种名为OSMM的有序子集多扩散模型，用于稀疏视图CT重建，通过分块学习和全局约束提升细节重建效果。

Details

Motivation: 传统基于分数的扩散模型在处理稀疏视图CT重建时，由于数据量大且冗余，导致学习效果差、细节丢失。 Method: 将CT投影数据分为等量子集，利用多子集扩散模型（MSDM）独立学习每个子集，并结合完整数据的全局扩散模型（OWDM）作为约束。 Result: OSMM在图像质量和噪声鲁棒性上优于传统扩散模型，适应不同稀疏程度的CT数据。 Conclusion: OSMM为稀疏视图CT重建提供了高效、鲁棒的解决方案。 Abstract: Score-based diffusion models have shown significant promise in the field of sparse-view CT reconstruction. However, the projection dataset is large and riddled with redundancy. Consequently, applying the diffusion model to unprocessed data results in lower learning effectiveness and higher learning difficulty, frequently leading to reconstructed images that lack fine details. To address these issues, we propose the ordered-subsets multi-diffusion model (OSMM) for sparse-view CT reconstruction. The OSMM innovatively divides the CT projection data into equal subsets and employs multi-subsets diffusion model (MSDM) to learn from each subset independently. This targeted learning approach reduces complexity and enhances the reconstruction of fine details. Furthermore, the integration of one-whole diffusion model (OWDM) with complete sinogram data acts as a global information constraint, which can reduce the possibility of generating erroneous or inconsistent sinogram information. Moreover, the OSMM's unsupervised learning framework provides strong robustness and generalizability, adapting seamlessly to varying sparsity levels of CT sinograms. This ensures consistent and reliable performance across different clinical scenarios. Experimental results demonstrate that OSMM outperforms traditional diffusion models in terms of image quality and noise resilience, offering a powerful and versatile solution for advanced CT imaging in sparse-view scenarios.

[120] Visual Fidelity Index for Generative Semantic Communications with Critical Information Embedding

Jianhao Huang,Qunsong Zeng,Kaibin Huang

Main category: eess.IV

TL;DR: 本文提出了一种混合生成语义通信系统（Gen-SemCom），通过嵌入关键信息框架（CIE）解决纯提示驱动生成丢失细节的问题，并设计了GVIF指标评估生成图像质量。

Details

Motivation: 纯提示驱动的生成语义通信会丢失图像的细粒度细节，且缺乏系统性的评估指标。 Method: 开发了带有CIE框架的混合Gen-SemCom系统，结合文本提示和关键特征传输，提出语义过滤方法选择关键特征，并使用扩散生成模型重建图像。设计了GVIF指标量化图像质量。 Result: GVIF指标对视觉保真度敏感，与PSNR和关键信息量相关。优化系统在PSNR和FID分数上优于基准方案。 Conclusion: 混合Gen-SemCom系统和GVIF指标有效提升了生成图像的视觉质量和通信效率。 Abstract: Generative semantic communication (Gen-SemCom) with large artificial intelligence (AI) model promises a transformative paradigm for 6G networks, which reduces communication costs by transmitting low-dimensional prompts rather than raw data. However, purely prompt-driven generation loses fine-grained visual details. Additionally, there is a lack of systematic metrics to evaluate the performance of Gen-SemCom systems. To address these issues, we develop a hybrid Gen-SemCom system with a critical information embedding (CIE) framework, where both text prompts and semantically critical features are extracted for transmissions. First, a novel approach of semantic filtering is proposed to select and transmit the semantically critical features of images relevant to semantic label. By integrating the text prompt and critical features, the receiver reconstructs high-fidelity images using a diffusion-based generative model. Next, we propose the generative visual information fidelity (GVIF) metric to evaluate the visual quality of the generated image. By characterizing the statistical models of image features, the GVIF metric quantifies the mutual information between the distorted features and their original counterparts. By maximizing the GVIF metric, we design a channel-adaptive Gen-SemCom system that adaptively control the volume of features and compression rate according to the channel state. Experimental results validate the GVIF metric's sensitivity to visual fidelity, correlating with both the PSNR and critical information volume. In addition, the optimized system achieves superior performance over benchmarking schemes in terms of higher PSNR and lower FID scores.

[121] HWA-UNETR: Hierarchical Window Aggregate UNETR for 3D Multimodal Gastric Lesion Segmentation

Jiaming Liang,Lihuan Dai,Xiaoqi Sheng,Xiangguang Chen,Chun Yao,Guihua Tao,Qibin Leng,Honming Cai,Xi Zhong

Main category: eess.IV

TL;DR: 论文提出了一种新的3D分割框架HWA-UNETR，并公开了首个大规模胃癌多模态MRI数据集GCM 2025，解决了多模态医学图像分割中的挑战。

Details

Motivation: 胃癌病灶分析中多模态医学图像分割面临数据稀缺和模态对齐问题，导致算法训练受限和资源浪费。 Method: 提出HWA-UNETR框架，采用可学习的窗口聚合层（HWA块）和三元融合机制，实现多模态动态特征对齐和长程空间依赖建模。 Result: 在GCM 2025和BraTS 2021数据集上验证，Dice分数提升1.68%，且具有强鲁棒性。 Conclusion: HWA-UNETR和GCM 2025数据集为胃癌多模态分割提供了高效解决方案，性能优于现有方法。 Abstract: Multimodal medical image segmentation faces significant challenges in the context of gastric cancer lesion analysis. This clinical context is defined by the scarcity of independent multimodal datasets and the imperative to amalgamate inherently misaligned modalities. As a result, algorithms are constrained to train on approximate data and depend on application migration, leading to substantial resource expenditure and a potential decline in analysis accuracy. To address those challenges, we have made two major contributions: First, we publicly disseminate the GCM 2025 dataset, which serves as the first large-scale, open-source collection of gastric cancer multimodal MRI scans, featuring professionally annotated FS-T2W, CE-T1W, and ADC images from 500 patients. Second, we introduce HWA-UNETR, a novel 3D segmentation framework that employs an original HWA block with learnable window aggregation layers to establish dynamic feature correspondences between different modalities' anatomical structures, and leverages the innovative tri-orientated fusion mamba mechanism for context modeling and capturing long-range spatial dependencies. Extensive experiments on our GCM 2025 dataset and the publicly BraTS 2021 dataset validate the performance of our framework, demonstrating that the new approach surpasses existing methods by up to 1.68\% in the Dice score while maintaining solid robustness. The dataset and code are public via https://github.com/JeMing-creater/HWA-UNETR.

[122] Multi-contrast laser endoscopy for in vivo gastrointestinal imaging

Taylor L. Bobrow,Mayank Golhar,Suchapa Arayakarnkul,Anthony A. Song,Saowanee Ngamruengphong,Nicholas J. Durr

Main category: eess.IV

TL;DR: 多对比激光内窥镜（MLE）通过可调谐光谱、相干和定向照明，显著提升胃肠道疾病检测的对比度和色彩差异。

Details

Motivation: 白光内窥镜在检测胃肠道疾病时对比度不足，导致许多临床相关病例被漏诊。 Method: MLE平台结合多光谱漫反射、激光散斑对比成像和光度立体技术，增强组织对比度、量化血流并表征黏膜地形。 Result: MLE在31个息肉样本中显示出对比度提升约3倍，色彩差异提升5倍。 Conclusion: MLE作为一种互补性组织对比工具，有望改善胃肠道成像。 Abstract: White light endoscopy is the clinical gold standard for detecting diseases in the gastrointestinal tract. Most applications involve identifying visual abnormalities in tissue color, texture, and shape. Unfortunately, the contrast of these features is often subtle, causing many clinically relevant cases to go undetected. To overcome this challenge, we introduce Multi-contrast Laser Endoscopy (MLE): a platform for widefield clinical imaging with rapidly tunable spectral, coherent, and directional illumination. We demonstrate three capabilities of MLE: enhancing tissue chromophore contrast with multispectral diffuse reflectance, quantifying blood flow using laser speckle contrast imaging, and characterizing mucosal topography using photometric stereo. We validate MLE with benchtop models, then demonstrate MLE in vivo during clinical colonoscopies. MLE images from 31 polyps demonstrate an approximate three-fold improvement in contrast and a five-fold improvement in color difference compared to white light and narrow band imaging. With the ability to reveal multiple complementary types of tissue contrast while seamlessly integrating into the clinical environment, MLE shows promise as an investigative tool to improve gastrointestinal imaging.

q-bio.QM [Back]

[123] Generative diffusion model surrogates for mechanistic agent-based biological models

Tien Comlekoglu,J. Quetzalcóatl Toledo-Marín,Douglas W. DeSimone,Shayn M. Peirce,Geoffrey Fox,James A. Glazier

Main category: q-bio.QM

TL;DR: 利用去噪扩散概率模型（DDPM）训练生成式AI替代模型，加速细胞-波茨模型（CPM）的计算，实现22倍速度提升。

Details

Motivation: CPM在复杂生物系统模拟中计算成本高，需开发替代模型以加速评估。 Method: 使用DDPM训练生成式AI替代模型，结合图像分类器辅助选择和验证。 Result: 替代模型生成比参考配置提前20,000时间步的模型配置，计算时间减少约22倍。 Conclusion: DDPM为开发随机生物系统的数字孪生提供了可行路径。 Abstract: Mechanistic, multicellular, agent-based models are commonly used to investigate tissue, organ, and organism-scale biology at single-cell resolution. The Cellular-Potts Model (CPM) is a powerful and popular framework for developing and interrogating these models. CPMs become computationally expensive at large space- and time- scales making application and investigation of developed models difficult. Surrogate models may allow for the accelerated evaluation of CPMs of complex biological systems. However, the stochastic nature of these models means each set of parameters may give rise to different model configurations, complicating surrogate model development. In this work, we leverage denoising diffusion probabilistic models to train a generative AI surrogate of a CPM used to investigate \textit{in vitro} vasculogenesis. We describe the use of an image classifier to learn the characteristics that define unique areas of a 2-dimensional parameter space. We then apply this classifier to aid in surrogate model selection and verification. Our CPM model surrogate generates model configurations 20,000 timesteps ahead of a reference configuration and demonstrates approximately a 22x reduction in computational time as compared to native code execution. Our work represents a step towards the implementation of DDPMs to develop digital twins of stochastic biological systems.

cs.SD [Back]

[124] LAV: Audio-Driven Dynamic Visual Generation with Neural Compression and StyleGAN2

Jongmin Jung,Dasaem Jeong

Main category: cs.SD

TL;DR: LAV系统结合EnCodec的神经音频压缩和StyleGAN2的生成能力，通过预录音频驱动视觉动态输出。

Details

Motivation: 探索利用预训练音频压缩模型实现艺术和计算应用的潜力，避免依赖显式特征映射。 Method: 使用EnCodec嵌入作为潜在表示，通过随机初始化的线性映射直接转换为StyleGAN2的风格潜在空间。 Result: 保留了语义丰富性，实现了细腻且语义一致的音频-视觉转换。 Conclusion: LAV展示了预训练音频压缩模型在艺术和计算应用中的潜力。 Abstract: This paper introduces LAV (Latent Audio-Visual), a system that integrates EnCodec's neural audio compression with StyleGAN2's generative capabilities to produce visually dynamic outputs driven by pre-recorded audio. Unlike previous works that rely on explicit feature mappings, LAV uses EnCodec embeddings as latent representations, directly transformed into StyleGAN2's style latent space via randomly initialized linear mapping. This approach preserves semantic richness in the transformation, enabling nuanced and semantically coherent audio-visual translations. The framework demonstrates the potential of using pretrained audio compression models for artistic and computational applications.

cs.CR [Back]

[125] PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization

Yidan Wang,Yanan Cao,Yubing Ren,Fang Fang,Zheng Lin,Binxing Fang

Main category: cs.CR

TL;DR: 论文探讨了大型语言模型（LLMs）的隐私风险，提出了一种名为PIG的新框架，通过结合越狱攻击和隐私泄露问题，有效提取敏感信息，并在实验中优于现有方法。

Details

Motivation: 现有方法评估LLMs隐私泄露时依赖记忆前缀或简单指令，易被对齐模型阻止；同时，越狱攻击在隐私场景中的作用尚未充分研究。 Method: 提出PIG框架，通过识别隐私查询中的PII实体、上下文学习和梯度策略迭代更新，提取目标PII。 Result: 在四种白盒和两种黑盒LLMs上实验，PIG优于基线方法并达到SoTA效果。 Conclusion: 结果凸显LLMs的隐私风险，需更强防护措施。 Abstract: Large Language Models (LLMs) excel in various domains but pose inherent privacy risks. Existing methods to evaluate privacy leakage in LLMs often use memorized prefixes or simple instructions to extract data, both of which well-alignment models can easily block. Meanwhile, Jailbreak attacks bypass LLM safety mechanisms to generate harmful content, but their role in privacy scenarios remains underexplored. In this paper, we examine the effectiveness of jailbreak attacks in extracting sensitive information, bridging privacy leakage and jailbreak attacks in LLMs. Moreover, we propose PIG, a novel framework targeting Personally Identifiable Information (PII) and addressing the limitations of current jailbreak methods. Specifically, PIG identifies PII entities and their types in privacy queries, uses in-context learning to build a privacy context, and iteratively updates it with three gradient-based strategies to elicit target PII. We evaluate PIG and existing jailbreak methods using two privacy-related datasets. Experiments on four white-box and two black-box LLMs show that PIG outperforms baseline methods and achieves state-of-the-art (SoTA) results. The results underscore significant privacy risks in LLMs, emphasizing the need for stronger safeguards. Our code is availble at \href{https://github.com/redwyd/PrivacyJailbreak}{https://github.com/redwyd/PrivacyJailbreak}.

cs.SI [Back]

[126] Tales of the 2025 Los Angeles Fire: Hotwash for Public Health Concerns in Reddit via LLM-Enhanced Topic Modeling

Sulong Zhou,Qunying Huang,Shaoheng Zhou,Yun Hang,Xinyue Ye,Aodong Mei,Kathryn Phung,Yuning Ye,Uma Govindswamy,Zehan Li

Main category: cs.SI

TL;DR: 该研究通过分析2025年洛杉矶野火期间的Reddit讨论，利用主题建模和分层框架，揭示了公众对灾害的情境认知和危机叙事，为灾害响应和公共卫生策略提供了数据支持。

Details

Motivation: 近年来野火频发且严重，了解公众在灾害中的感知和反应对及时、共情的灾害响应至关重要。社交媒体提供了捕捉公众情绪和信息的渠道。 Method: 收集了385篇帖子和114,879条评论，采用主题建模方法（结合LLMs和HITL），并开发了分层框架（SA和CN）对主题进行分类。 Result: SA类别的讨论量与火灾实际进展一致，公众健康和心理健康是CN类别的主要话题。研究提供了首个标注的社交媒体数据集。 Conclusion: 研究结果为灾害响应、公共卫生沟通和未来类似灾害事件的研究提供了数据支持，并提出了可扩展的分析框架。 Abstract: Wildfires have become increasingly frequent, irregular, and severe in recent years. Understanding how affected populations perceive and respond during wildfire crises is critical for timely and empathetic disaster response. Social media platforms offer a crowd-sourced channel to capture evolving public discourse, providing hyperlocal information and insight into public sentiment. This study analyzes Reddit discourse during the 2025 Los Angeles wildfires, spanning from the onset of the disaster to full containment. We collect 385 posts and 114,879 comments related to the Palisades and Eaton fires. We adopt topic modeling methods to identify the latent topics, enhanced by large language models (LLMs) and human-in-the-loop (HITL) refinement. Furthermore, we develop a hierarchical framework to categorize latent topics, consisting of two main categories, Situational Awareness (SA) and Crisis Narratives (CN). The volume of SA category closely aligns with real-world fire progressions, peaking within the first 2-5 days as the fires reach the maximum extent. The most frequent co-occurring category set of public health and safety, loss and damage, and emergency resources expands on a wide range of health-related latent topics, including environmental health, occupational health, and one health. Grief signals and mental health risks consistently accounted for 60 percentage and 40 percentage of CN instances, respectively, with the highest total volume occurring at night. This study contributes the first annotated social media dataset on the 2025 LA fires, and introduces a scalable multi-layer framework that leverages topic modeling for crisis discourse analysis. By identifying persistent public health concerns, our results can inform more empathetic and adaptive strategies for disaster response, public health communication, and future research in comparable climate-related disaster events.

cs.HC [Back]

[127] CartoAgent: a multimodal large language model-powered multi-agent cartographic framework for map style transfer and evaluation

Chenglong Wang,Yuhao Kang,Zhaoya Gong,Pengjun Zhao,Yu Feng,Wenjia Zhang,Ge Li

Main category: cs.HC

TL;DR: CartoAgent是一个基于多模态大语言模型（MLLMs）的多智能体制图框架，通过模拟制图实践中的三个阶段（准备、设计和评估），生成既美观又信息丰富的地图。

Details

Motivation: 现有研究忽视了地图的艺术性，或难以同时保证地图的准确性和信息丰富性。 Method: CartoAgent利用MLLMs的视觉审美能力和世界知识，将风格与地理数据分离，专注于设计样式表而不修改矢量数据。 Result: 通过地图样式迁移和评估任务验证了框架的有效性。 Conclusion: CartoAgent可扩展支持多种制图设计决策，并为GenAI在制图中的未来集成提供参考。 Abstract: The rapid development of generative artificial intelligence (GenAI) presents new opportunities to advance the cartographic process. Previous studies have either overlooked the artistic aspects of maps or faced challenges in creating both accurate and informative maps. In this study, we propose CartoAgent, a novel multi-agent cartographic framework powered by multimodal large language models (MLLMs). This framework simulates three key stages in cartographic practice: preparation, map design, and evaluation. At each stage, different MLLMs act as agents with distinct roles to collaborate, discuss, and utilize tools for specific purposes. In particular, CartoAgent leverages MLLMs' visual aesthetic capability and world knowledge to generate maps that are both visually appealing and informative. By separating style from geographic data, it can focus on designing stylesheets without modifying the vector-based data, thereby ensuring geographic accuracy. We applied CartoAgent to a specific task centered on map restyling-namely, map style transfer and evaluation. The effectiveness of this framework was validated through extensive experiments and a human evaluation study. CartoAgent can be extended to support a variety of cartographic design decisions and inform future integrations of GenAI in cartography.

[128] Visual Feedback of Pattern Separability Improves Myoelectric Decoding Performance of Upper Limb Prostheses

Ruichen Yang,György M. Lévay,Christopher L. Hunt,Dániel Czeiner,Megan C. Hodgson,Damini Agarwal,Rahul R. Kaliki,Nitish V. Thakor

Main category: cs.HC

TL;DR: 论文提出了一种名为Reviewer的3D视觉界面，通过实时投影EMG信号到分类空间，帮助用户直观理解模式识别算法行为，从而提升肌电假肢的控制性能。

Details

Motivation: 现有肌电假肢的模式识别控制系统中，用户难以生成足够独特的EMG信号模式以实现可靠分类，且训练过程依赖试错调整。 Method: 通过10次实验，比较了使用Reviewer与传统虚拟手臂可视化训练的12名健康参与者在Fitts定律任务中的表现。 Result: 使用Reviewer的参与者完成率更高，路径效率提升，且减少了过冲现象。 Conclusion: 3D视觉反馈通过结构化训练显著改善了新手操作者的模式识别控制，减少了试错调整的依赖。 Abstract: State-of-the-art upper limb myoelectric prostheses often use pattern recognition (PR) control systems that translate electromyography (EMG) signals into desired movements. As prosthesis movement complexity increases, users often struggle to produce sufficiently distinct EMG patterns for reliable classification. Existing training typically involves heuristic, trial-and-error user adjustments to static decoder boundaries. Goal: We introduce the Reviewer, a 3D visual interface projecting EMG signals directly into the decoder's classification space, providing intuitive, real-time insight into PR algorithm behavior. This structured feedback reduces cognitive load and fosters mutual, data-driven adaptation between user-generated EMG patterns and decoder boundaries. Methods: A 10-session study with 12 able-bodied participants compared PR performance after motor-based training and updating using the Reviewer versus conventional virtual arm visualization. Performance was assessed using a Fitts law task that involved the aperture of the cursor and the control of orientation. Results: Participants trained with the Reviewer achieved higher completion rates, reduced overshoot, and improved path efficiency and throughput compared to the standard visualization group. Significance: The Reviewer introduces decoder-informed motor training, facilitating immediate and consistent PR-based myoelectric control improvements. By iteratively refining control through real-time feedback, this approach reduces reliance on trial-and-error recalibration, enabling a more adaptive, self-correcting training framework. Conclusion: The 3D visual feedback significantly improves PR control in novice operators through structured training, enabling feedback-driven adaptation and reducing reliance on extensive heuristic adjustments.

[129] SOS: A Shuffle Order Strategy for Data Augmentation in Industrial Human Activity Recognition

Anh Tuan Ha,Hoang Khang Phan,Thai Minh Tien Ngo,Anh Phan Truong,Nhat Tan Le

Main category: cs.HC

TL;DR: 论文提出了一种通过深度学习方法（注意力自编码器和条件生成对抗网络）生成高质量HAR数据集的方法，并通过随机序列策略显著提升了分类性能。

Details

Motivation: 解决HAR领域中高质量和多样性数据获取成本高的问题，以及数据异质性对模型性能的影响。 Method: 使用注意力自编码器和条件生成对抗网络生成数据集，并通过随机序列策略打乱数据以均匀分布。 Result: 随机序列策略显著提升了分类性能，准确率达到0.70 ± 0.03，宏F1分数为0.64 ± 0.01。 Conclusion: 该方法不仅扩大了有效训练数据集，还为复杂现实场景中的HAR系统提供了改进方向。 Abstract: In the realm of Human Activity Recognition (HAR), obtaining high quality and variance data is still a persistent challenge due to high costs and the inherent variability of real-world activities. This study introduces a generation dataset by deep learning approaches (Attention Autoencoder and conditional Generative Adversarial Networks). Another problem that data heterogeneity is a critical challenge, one of the solutions is to shuffle the data to homogenize the distribution. Experimental results demonstrate that the random sequence strategy significantly improves classification performance, achieving an accuracy of up to 0.70 $\pm$ 0.03 and a macro F1 score of 0.64 $\pm$ 0.01. For that, disrupting temporal dependencies through random sequence reordering compels the model to focus on instantaneous recognition, thereby improving robustness against activity transitions. This approach not only broadens the effective training dataset but also offers promising avenues for enhancing HAR systems in complex, real-world scenarios.

cs.AI [Back]

[130] From Text to Network: Constructing a Knowledge Graph of Taiwan-Based China Studies Using Generative AI

Hsuan-Lei Shao

Main category: cs.AI

TL;DR: 该研究提出了一种AI辅助方法，将台湾中国研究领域的非结构化学术文本转化为结构化的知识图谱，利用生成式AI和大语言模型提取实体关系三元组，并通过可视化系统揭示研究趋势和空白。

Details

Motivation: 回应台湾中国研究领域对系统化整理数十年学术成果的需求，探索生成式AI在区域研究和数字人文学科中的应用。 Method: 应用生成式AI和大语言模型从1,367篇同行评审文章中提取实体关系三元组，并通过D3.js可视化系统构建知识图谱和向量数据库。 Result: 系统成功揭示了研究领域的知识轨迹、主题集群和研究空白，支持从线性文本阅读转向网络化知识导航。 Conclusion: 该研究展示了生成式AI在区域知识系统重构中的潜力，为学术基础设施提供了数据驱动的替代方案。 Abstract: Taiwanese China Studies (CS) has developed into a rich, interdisciplinary research field shaped by the unique geopolitical position and long standing academic engagement with Mainland China. This study responds to the growing need to systematically revisit and reorganize decades of Taiwan based CS scholarship by proposing an AI assisted approach that transforms unstructured academic texts into structured, interactive knowledge representations. We apply generative AI (GAI) techniques and large language models (LLMs) to extract and standardize entity relation triples from 1,367 peer reviewed CS articles published between 1996 and 2019. These triples are then visualized through a lightweight D3.js based system, forming the foundation of a domain specific knowledge graph and vector database for the field. This infrastructure allows users to explore conceptual nodes and semantic relationships across the corpus, revealing previously uncharted intellectual trajectories, thematic clusters, and research gaps. By decomposing textual content into graph structured knowledge units, our system enables a paradigm shift from linear text consumption to network based knowledge navigation. In doing so, it enhances scholarly access to CS literature while offering a scalable, data driven alternative to traditional ontology construction. This work not only demonstrates how generative AI can augment area studies and digital humanities but also highlights its potential to support a reimagined scholarly infrastructure for regional knowledge systems.

[131] Towards a Deeper Understanding of Reasoning Capabilities in Large Language Models

Annie Wong,Thomas Bäck,Aske Plaat,Niki van Stein,Anna V. Kononova

Main category: cs.AI

TL;DR: 研究评估了大型语言模型在动态环境中的自适应能力，发现战略提示可以缩小模型间的性能差距，但高级推理方法效果不稳定，且模型在规划、推理和空间协调方面仍存在局限性。

Details

Motivation: 探讨大型语言模型作为自学习和推理智能体在动态环境中的潜力，评估其自适应能力。 Method: 通过自反思、启发式变异和规划等提示技术，在动态环境中测试开源语言模型的性能。 Result: 大模型表现优于小模型，但战略提示可缩小差距；高级提示技术对小模型更有效；推理方法效果不稳定，模型在关键领域仍有限制。 Conclusion: 当前大型语言模型在推理和规划方面存在根本性不足，需超越静态基准以全面评估推理能力。 Abstract: While large language models demonstrate impressive performance on static benchmarks, the true potential of large language models as self-learning and reasoning agents in dynamic environments remains unclear. This study systematically evaluates the efficacy of self-reflection, heuristic mutation, and planning as prompting techniques to test the adaptive capabilities of agents. We conduct experiments with various open-source language models in dynamic environments and find that larger models generally outperform smaller ones, but that strategic prompting can close this performance gap. Second, a too-long prompt can negatively impact smaller models on basic reactive tasks, while larger models show more robust behaviour. Third, advanced prompting techniques primarily benefit smaller models on complex games, but offer less improvement for already high-performing large language models. Yet, we find that advanced reasoning methods yield highly variable outcomes: while capable of significantly improving performance when reasoning and decision-making align, they also introduce instability and can lead to big performance drops. Compared to human performance, our findings reveal little evidence of true emergent reasoning. Instead, large language model performance exhibits persistent limitations in crucial areas such as planning, reasoning, and spatial coordination, suggesting that current-generation large language models still suffer fundamental shortcomings that may not be fully overcome through self-reflective prompting alone. Reasoning is a multi-faceted task, and while reasoning methods like Chain of thought improves multi-step reasoning on math word problems, our findings using dynamic benchmarks highlight important shortcomings in general reasoning capabilities, indicating a need to move beyond static benchmarks to capture the complexity of reasoning.

Table of Contents

cs.CV [Back]

[1] A Computational Pipeline for Advanced Analysis of 4D Flow MRI in the Left Atrium

[2] Dyadic Mamba: Long-term Dyadic Human Motion Synthesis

[3] BoundarySeg:An Embarrassingly Simple Method To Boost Medical Image Segmentation Performance for Low Data Regimes

[4] Mission Balance: Generating Under-represented Class Samples using Video Diffusion Models

[5] Few-Shot Learning of Visual Compositional Concepts through Probabilistic Schema Induction

[6] Large-Scale Gaussian Splatting SLAM

[7] AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection

[8] DDFP: Data-dependent Frequency Prompt for Source Free Domain Adaptation of Medical Image Segmentation

[9] VRU-CIPI: Crossing Intention Prediction at Intersections for Improving Vulnerable Road Users Safety

[10] Non-Registration Change Detection: A Novel Change Detection Task and Benchmark Dataset

[11] CSPENet: Contour-Aware and Saliency Priors Embedding Network for Infrared Small Target Detection

[12] MambaControl: Anatomy Graph-Enhanced Mamba ControlNet with Fourier Refinement for Diffusion-Based Disease Trajectory Prediction

[13] TKFNet: Learning Texture Key Factor Driven Feature for Facial Expression Recognition

[14] APCoTTA: Continual Test-Time Adaptation for Semantic Segmentation of Airborne LiDAR Point Clouds

[15] High Quality Underwater Image Compression with Adaptive Correction and Codebook-based Augmentation

[16] PointArena: Probing Multimodal Grounding Through Language-Guided Pointing

[17] Descriptive Image-Text Matching with Graded Contextual Similarity

[18] From Air to Wear: Personalized 3D Digital Fashion with AR/VR Immersive 3D Sketching

[19] Application of YOLOv8 in monocular downward multiple Car Target detection

[20] ORL-LDM: Offline Reinforcement Learning Guided Latent Diffusion Model Super-Resolution Reconstruction

[21] DeepSeqCoco: A Robust Mobile Friendly Deep Learning Model for Detection of Diseases in Cocos nucifera

[22] Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

[23] Advances in Radiance Field for Dynamic Scene: From Neural Field to Gaussian Field

[24] PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language

[25] ToonifyGB: StyleGAN-based Gaussian Blendshapes for 3D Stylized Head Avatars

[26] MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models

[27] Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering

[28] IMITATE: Image Registration with Context for unknown time frame recovery

[29] Multi-Source Collaborative Style Augmentation and Domain-Invariant Learning for Federated Domain Generalization

[30] Modeling Saliency Dataset Bias

[31] VolE: A Point-cloud Framework for Food 3D Reconstruction and Volume Estimation

[32] Data-Agnostic Augmentations for Unknown Variations: Out-of-Distribution Generalisation in MRI Segmentation

[33] On the Interplay of Human-AI Alignment,Fairness, and Performance Trade-offs in Medical Imaging

[34] MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation

[35] ADHMR: Aligning Diffusion-based Human Mesh Recovery via Direct Preference Optimization

[36] Sage Deer: A Super-Aligned Driving Generalist Is Your Copilot

[37] Inferring Driving Maps by Deep Learning-based Trail Map Extraction

[38] HandReader: Advanced Techniques for Efficient Fingerspelling Recognition

[39] MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting

[40] MSCI: Addressing CLIP's Inherent Limitations for Compositional Zero-Shot Learning

[41] StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation

[42] MIPHEI-ViT: Multiplex Immunofluorescence Prediction from H&E Images using ViT Foundation Models

[43] A Unified and Scalable Membership Inference Method for Visual Self-supervised Encoder via Part-aware Capability

[44] SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ Complexity

[45] Learned Lightweight Smartphone ISP with Unpaired Data

[46] Vision language models have difficulty recognizing virtual objects

[47] Consistent Quantity-Quality Control across Scenes for Deployment-Aware Gaussian Splatting

[48] Logos as a Well-Tempered Pre-train for Sign Language Recognition

[49] UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation

[50] CheXGenBench: A Unified Benchmark For Fidelity, Privacy and Utility of Synthetic Chest Radiographs

[51] MorphGuard: Morph Specific Margin Loss for Enhancing Robustness to Face Morphing Attacks

[52] Enhancing Multi-Image Question Answering via Submodular Subset Selection

[53] Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis

[54] Does Feasibility Matter? Understanding the Impact of Feasibility on Synthetic Training Data

[55] MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

[56] End-to-End Vision Tokenizer Tuning

[57] Depth Anything with Any Prior

[58] 3D-Fixup: Advancing Photo Editing with 3D Priors

cs.GR [Back]

[59] VRSplat: Fast and Robust Gaussian Splatting for Virtual Reality

[60] Style Customization of Text-to-Vector Generation with Image Diffusion Priors

cs.CL [Back]

[61] Next Word Suggestion using Graph Neural Network

[62] DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models

[63] Large Language Models Are More Persuasive Than Incentivized Human Persuaders

[64] System Prompt Optimization with Meta-Learning

[65] VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts

[66] An AI-Powered Research Assistant in the Lab: A Practical Guide for Text Analysis Through Iterative Collaboration with LLMs

[67] Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning

[68] Automated Detection of Clinical Entities in Lung and Breast Cancer Reports Using NLP Techniques

[69] Exploring the generalization of LLM truth directions on conversational formats

[70] KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning

[71] Do Large Language Models Know Conflict? Investigating Parametric vs. Non-Parametric Knowledge of LLMs for Conflict Forecasting

[72] Crossing Borders Without Crossing Boundaries: How Sociolinguistic Awareness Can Optimize User Engagement with Localized Spanish AI Models Across Hispanophone Countries

[73] From Trade-off to Synergy: A Versatile Symbiotic Watermarking Framework for Large Language Models

[74] Rethinking Prompt Optimizers: From Prompt Merits to Optimization

[75] Personalizing Large Language Models using Retrieval Augmented Generation and Knowledge Graph

[76] DIF: A Framework for Benchmarking and Verifying Implicit Bias in LLMs