cs.CV [Back]

[1] A Computational Pipeline for Advanced Analysis of 4D Flow MRI in the Left Atrium

Xabier Morales,Ayah Elsayed,Debbie Zhao,Filip Loncaric,Ainhoa Aguado,Mireia Masias,Gina Quill,Marc Ramos,Ada Doltra,Ana Garcia,Marta Sitges,David Marlevi,Alistair Young,Martyn Nash,Bart Bijnens,Oscar Camara

Main category: cs.CV

TL;DR: 研究提出首个开源计算框架，用于分析左心房的4D Flow MRI数据，克服了传统超声分析的局限性，并验证了血流动力学参数作为预后生物标志物的潜力。

Details

Motivation: 传统超声分析对左心房血流动力学的理解有限，4D Flow MRI虽具潜力，但受限于低流速和空间分辨率，且缺乏专用计算框架。 Method: 开发开源计算框架，支持多中心数据的高精度自动分割（Dice > 0.9，Hausdorff 95 < 3 mm），并首次全面评估能量、涡度和压力参数。 Result: 框架对不同质量数据表现稳健，验证了血流动力学参数作为预后生物标志物的潜力。 Conclusion: 开源框架为左心房血流动力学研究提供了可靠工具，并揭示了新生物标志物的临床价值。 Abstract: The left atrium (LA) plays a pivotal role in modulating left ventricular filling, but our comprehension of its hemodynamics is significantly limited by the constraints of conventional ultrasound analysis. 4D flow magnetic resonance imaging (4D Flow MRI) holds promise for enhancing our understanding of atrial hemodynamics. However, the low velocities within the LA and the limited spatial resolution of 4D Flow MRI make analyzing this chamber challenging. Furthermore, the absence of dedicated computational frameworks, combined with diverse acquisition protocols and vendors, complicates gathering large cohorts for studying the prognostic value of hemodynamic parameters provided by 4D Flow MRI. In this study, we introduce the first open-source computational framework tailored for the analysis of 4D Flow MRI in the LA, enabling comprehensive qualitative and quantitative analysis of advanced hemodynamic parameters. Our framework proves robust to data from different centers of varying quality, producing high-accuracy automated segmentations (Dice $>$ 0.9 and Hausdorff 95 $<$ 3 mm), even with limited training data. Additionally, we conducted the first comprehensive assessment of energy, vorticity, and pressure parameters in the LA across a spectrum of disorders to investigate their potential as prognostic biomarkers.

[2] Dyadic Mamba: Long-term Dyadic Human Motion Synthesis

Julian Tanke,Takashi Shibuya,Kengo Uchida,Koichi Saito,Yuki Mitsufuji

Main category: cs.CV

TL;DR: Dyadic Mamba利用状态空间模型（SSMs）生成高质量、任意长度的双人运动，解决了传统Transformer方法在长序列生成中的局限性。

Details

Motivation: 现有Transformer方法在生成长序列双人运动时表现不佳，主要受限于位置编码方案。 Method: 提出Dyadic Mamba，通过简单架构和序列拼接实现信息流动，避免复杂的跨注意力机制。 Result: 在短序列基准上表现优异，长序列生成显著优于Transformer方法，并提出了新的长序列评估基准。 Conclusion: SSM架构为长序列双人运动生成提供了有效解决方案，推动了未来研究。 Abstract: Generating realistic dyadic human motion from text descriptions presents significant challenges, particularly for extended interactions that exceed typical training sequence lengths. While recent transformer-based approaches have shown promising results for short-term dyadic motion synthesis, they struggle with longer sequences due to inherent limitations in positional encoding schemes. In this paper, we introduce Dyadic Mamba, a novel approach that leverages State-Space Models (SSMs) to generate high-quality dyadic human motion of arbitrary length. Our method employs a simple yet effective architecture that facilitates information flow between individual motion sequences through concatenation, eliminating the need for complex cross-attention mechanisms. We demonstrate that Dyadic Mamba achieves competitive performance on standard short-term benchmarks while significantly outperforming transformer-based approaches on longer sequences. Additionally, we propose a new benchmark for evaluating long-term motion synthesis quality, providing a standardized framework for future research. Our results demonstrate that SSM-based architectures offer a promising direction for addressing the challenging task of long-term dyadic human motion synthesis from text descriptions.

[3] BoundarySeg:An Embarrassingly Simple Method To Boost Medical Image Segmentation Performance for Low Data Regimes

Tushar Kataria,Shireen Y. Elhabian

Main category: cs.CV

TL;DR: 提出了一种仅利用现有标注的医学图像分割方法BoundarySeg，通过多任务框架结合器官边界预测提升分割精度，无需未标注数据。

Details

Motivation: 医学数据获取和标注困难，半监督方法依赖未标注数据且效果受限。 Method: BoundarySeg框架将器官边界预测作为辅助任务，利用任务间一致性提供额外监督。 Result: 在低数据情况下表现优异，性能媲美或超越现有半监督方法。 Conclusion: BoundarySeg提供了一种高效且不依赖未标注数据的医学图像分割解决方案。 Abstract: Obtaining large-scale medical data, annotated or unannotated, is challenging due to stringent privacy regulations and data protection policies. In addition, annotating medical images requires that domain experts manually delineate anatomical structures, making the process both time-consuming and costly. As a result, semi-supervised methods have gained popularity for reducing annotation costs. However, the performance of semi-supervised methods is heavily dependent on the availability of unannotated data, and their effectiveness declines when such data are scarce or absent. To overcome this limitation, we propose a simple, yet effective and computationally efficient approach for medical image segmentation that leverages only existing annotations. We propose BoundarySeg , a multi-task framework that incorporates organ boundary prediction as an auxiliary task to full organ segmentation, leveraging consistency between the two task predictions to provide additional supervision. This strategy improves segmentation accuracy, especially in low data regimes, allowing our method to achieve performance comparable to or exceeding state-of-the-art semi supervised approaches all without relying on unannotated data or increasing computational demands. Code will be released upon acceptance.

[4] Mission Balance: Generating Under-represented Class Samples using Video Diffusion Models

Danush Kumar Venkatesh,Isabel Funke,Micha Pfeiffer,Fiona Kolbinger,Hanna Maria Schmeiser,Juergen Weitz,Marius Distler,Stefanie Speidel

Main category: cs.CV

TL;DR: 提出了一种基于文本条件扩散的两阶段方法，用于生成高保真手术视频以解决数据不平衡问题，并通过拒绝采样策略增强数据集，显著提升下游任务性能。

Details

Motivation: 手术视频数据集中存在严重的数据不平衡问题，阻碍了高性能模型的开发，因此需要合成手术视频来解决这一问题。 Method: 采用两阶段、基于文本条件的扩散方法，结合2D潜在扩散模型和时序注意力层，生成高保真手术视频，并通过拒绝采样策略选择最佳合成样本。 Result: 在手术动作识别和术中事件预测两个下游任务中，合成视频的引入显著提升了模型性能。 Conclusion: 该方法有效解决了手术视频数据不平衡问题，并通过开源实现促进了进一步研究。 Abstract: Computer-assisted interventions can improve intra-operative guidance, particularly through deep learning methods that harness the spatiotemporal information in surgical videos. However, the severe data imbalance often found in surgical video datasets hinders the development of high-performing models. In this work, we aim to overcome the data imbalance by synthesizing surgical videos. We propose a unique two-stage, text-conditioned diffusion-based method to generate high-fidelity surgical videos for under-represented classes. Our approach conditions the generation process on text prompts and decouples spatial and temporal modeling by utilizing a 2D latent diffusion model to capture spatial content and then integrating temporal attention layers to ensure temporal consistency. Furthermore, we introduce a rejection sampling strategy to select the most suitable synthetic samples, effectively augmenting existing datasets to address class imbalance. We evaluate our method on two downstream tasks-surgical action recognition and intra-operative event prediction-demonstrating that incorporating synthetic videos from our approach substantially enhances model performance. We open-source our implementation at https://gitlab.com/nct_tso_public/surgvgen.

[5] Few-Shot Learning of Visual Compositional Concepts through Probabilistic Schema Induction

Andrew Jun Lee,Taylor Webb,Trevor Bihl,Keith Holyoak,Hongjing Lu

Main category: cs.CV

TL;DR: 论文提出了一种名为概率模式归纳（PSI）的原型模型，通过深度学习在少量结构化示例上进行类比映射，形成组合概念。PSI在人类类似的学习表现上优于传统模型。

Details

Motivation: 人类能够从有限示例中快速学习新视觉概念，而传统模型通常使用无结构特征向量，无法捕捉组合概念学习中的结构化表示和类比映射。 Method: PSI模型结合了深度学习和结构化表示，通过权衡对象级相似性和关系相似性，并放大与分类相关的关系，实现类比映射。 Result: PSI在人类类似的学习任务中表现优异，优于使用无结构特征向量的原型模型和结构化表示较弱的变体。 Conclusion: 结构化表示和类比映射对快速学习组合视觉概念至关重要，深度学习可用于构建心理学模型。 Abstract: The ability to learn new visual concepts from limited examples is a hallmark of human cognition. While traditional category learning models represent each example as an unstructured feature vector, compositional concept learning is thought to depend on (1) structured representations of examples (e.g., directed graphs consisting of objects and their relations) and (2) the identification of shared relational structure across examples through analogical mapping. Here, we introduce Probabilistic Schema Induction (PSI), a prototype model that employs deep learning to perform analogical mapping over structured representations of only a handful of examples, forming a compositional concept called a schema. In doing so, PSI relies on a novel conception of similarity that weighs object-level similarity and relational similarity, as well as a mechanism for amplifying relations relevant to classification, analogous to selective attention parameters in traditional models. We show that PSI produces human-like learning performance and outperforms two controls: a prototype model that uses unstructured feature vectors extracted from a deep learning model, and a variant of PSI with weaker structured representations. Notably, we find that PSI's human-like performance is driven by an adaptive strategy that increases relational similarity over object-level similarity and upweights the contribution of relations that distinguish classes. These findings suggest that structured representations and analogical mapping are critical to modeling rapid human-like learning of compositional visual concepts, and demonstrate how deep learning can be leveraged to create psychological models.

[6] Large-Scale Gaussian Splatting SLAM

Zhe Xin,Chenyang Wu,Penghui Huang,Yanyong Zhang,Yinian Mao,Guoquan Huang

Main category: cs.CV

TL;DR: LSG-SLAM是一种基于3D高斯泼溅（3DGS）的大规模视觉SLAM方法，使用立体相机，通过多模态策略和特征对齐约束提升鲁棒性，并在大规模场景中实现高效重建。

Details

Motivation: 现有基于NeRF和3DGS的视觉SLAM方法通常依赖RGBD传感器且仅适用于室内环境，而大规模户外场景的鲁棒性重建尚未充分探索。 Method: LSG-SLAM采用多模态策略估计初始位姿，引入特征对齐约束减少渲染损失中的外观相似性影响，并使用连续高斯泼溅子图处理无边界场景。通过位姿优化和结构细化模块提升重建质量。 Result: 在EuRoc和KITTI数据集上的评估表明，LSG-SLAM性能优于现有基于神经、3DGS及传统方法。 Conclusion: LSG-SLAM为大规模户外场景提供了一种高效且鲁棒的视觉SLAM解决方案。 Abstract: The recently developed Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have shown encouraging and impressive results for visual SLAM. However, most representative methods require RGBD sensors and are only available for indoor environments. The robustness of reconstruction in large-scale outdoor scenarios remains unexplored. This paper introduces a large-scale 3DGS-based visual SLAM with stereo cameras, termed LSG-SLAM. The proposed LSG-SLAM employs a multi-modality strategy to estimate prior poses under large view changes. In tracking, we introduce feature-alignment warping constraints to alleviate the adverse effects of appearance similarity in rendering losses. For the scalability of large-scale scenarios, we introduce continuous Gaussian Splatting submaps to tackle unbounded scenes with limited memory. Loops are detected between GS submaps by place recognition and the relative pose between looped keyframes is optimized utilizing rendering and feature warping losses. After the global optimization of camera poses and Gaussian points, a structure refinement module enhances the reconstruction quality. With extensive evaluations on the EuRoc and KITTI datasets, LSG-SLAM achieves superior performance over existing Neural, 3DGS-based, and even traditional approaches. Project page: https://lsg-slam.github.io.

[7] AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection

Bin-Bin Gao,Yue Zhu,Jiangtao Yan,Yuezhi Cai,Weixi Zhang,Meng Wang,Jun Liu,Yong Liu,Lei Wang,Chengjie Wang

Main category: cs.CV

TL;DR: AdaptCLIP是一种基于CLIP模型的视觉异常检测方法，通过交替学习视觉和文本表示，并结合上下文和对齐残差特征，实现跨领域的零/少样本泛化。

Details

Motivation: 解决现有方法在提示模板设计、复杂令牌交互或额外微调方面的局限性，提升视觉异常检测的灵活性和性能。 Method: 提出AdaptCLIP方法，通过三个简单适配器（视觉、文本和提示-查询适配器）交替学习视觉和文本表示，并结合上下文和对齐残差特征。 Result: 在12个工业和医学领域的异常检测基准测试中达到最先进性能，显著优于现有方法。 Conclusion: AdaptCLIP是一种简单有效的方法，支持跨领域的零/少样本泛化，无需目标领域的额外训练。 Abstract: Universal visual anomaly detection aims to identify anomalies from novel or unseen vision domains without additional fine-tuning, which is critical in open scenarios. Recent studies have demonstrated that pre-trained vision-language models like CLIP exhibit strong generalization with just zero or a few normal images. However, existing methods struggle with designing prompt templates, complex token interactions, or requiring additional fine-tuning, resulting in limited flexibility. In this work, we present a simple yet effective method called AdaptCLIP based on two key insights. First, adaptive visual and textual representations should be learned alternately rather than jointly. Second, comparative learning between query and normal image prompt should incorporate both contextual and aligned residual features, rather than relying solely on residual features. AdaptCLIP treats CLIP models as a foundational service, adding only three simple adapters, visual adapter, textual adapter, and prompt-query adapter, at its input or output ends. AdaptCLIP supports zero-/few-shot generalization across domains and possesses a training-free manner on target domains once trained on a base dataset. AdaptCLIP achieves state-of-the-art performance on 12 anomaly detection benchmarks from industrial and medical domains, significantly outperforming existing competitive methods. We will make the code and model of AdaptCLIP available at https://github.com/gaobb/AdaptCLIP.

[8] DDFP: Data-dependent Frequency Prompt for Source Free Domain Adaptation of Medical Image Segmentation

Siqi Yin,Shaolei Liu,Manning Wang

Main category: cs.CV

TL;DR: 提出了一种新的源自由域适应（SFDA）框架，通过预适应生成高质量伪标签和数据依赖的频率提示，结合风格相关层微调策略，显著提升了跨模态医学图像分割的性能。

Details

Motivation: 解决源自由域适应中伪标签质量不高和模型训练效率低的问题，特别是在医学数据隐私受限的场景下。 Method: 引入预适应生成预适应模型，使用数据依赖频率提示进行图像风格转换，并采用风格相关层微调策略优化目标模型。 Result: 在跨模态腹部和心脏分割任务中，性能优于现有最先进方法。 Conclusion: 提出的方法有效提升了源自由域适应的性能，尤其在医学图像分割领域具有应用潜力。 Abstract: Domain adaptation addresses the challenge of model performance degradation caused by domain gaps. In the typical setup for unsupervised domain adaptation, labeled data from a source domain and unlabeled data from a target domain are used to train a target model. However, access to labeled source domain data, particularly in medical datasets, can be restricted due to privacy policies. As a result, research has increasingly shifted to source-free domain adaptation (SFDA), which requires only a pretrained model from the source domain and unlabeled data from the target domain data for adaptation. Existing SFDA methods often rely on domain-specific image style translation and self-supervision techniques to bridge the domain gap and train the target domain model. However, the quality of domain-specific style-translated images and pseudo-labels produced by these methods still leaves room for improvement. Moreover, training the entire model during adaptation can be inefficient under limited supervision. In this paper, we propose a novel SFDA framework to address these challenges. Specifically, to effectively mitigate the impact of domain gap in the initial training phase, we introduce preadaptation to generate a preadapted model, which serves as an initialization of target model and allows for the generation of high-quality enhanced pseudo-labels without introducing extra parameters. Additionally, we propose a data-dependent frequency prompt to more effectively translate target domain images into a source-like style. To further enhance adaptation, we employ a style-related layer fine-tuning strategy, specifically designed for SFDA, to train the target model using the prompted target domain images and pseudo-labels. Extensive experiments on cross-modality abdominal and cardiac SFDA segmentation tasks demonstrate that our proposed method outperforms existing state-of-the-art methods.

[9] VRU-CIPI: Crossing Intention Prediction at Intersections for Improving Vulnerable Road Users Safety

Ahmed S. Abdelrahman,Mohamed Abdel-Aty,Quoc Dai Tran

Main category: cs.CV

TL;DR: VRU-CIPI框架通过GRU和Transformer自注意力机制预测VRU的过街意图，准确率达96.45%，并实现实时推理速度33FPS。结合I2V通信，提升交叉路口安全性。

Details

Motivation: 理解并预测VRU在交叉路口的过街意图，以减少与车辆的冲突，提升安全性。 Method: 采用GRU捕捉VRU运动的时间动态，结合Transformer自注意力机制编码上下文和空间依赖关系。 Result: 在UCF-VRU数据集上达到96.45%的准确率，实时推理速度33FPS。 Conclusion: VRU-CIPI框架能有效预测VRU意图，结合I2V通信可显著提升交叉路口安全性。 Abstract: Understanding and predicting human behavior in-thewild, particularly at urban intersections, remains crucial for enhancing interaction safety between road users. Among the most critical behaviors are crossing intentions of Vulnerable Road Users (VRUs), where misinterpretation may result in dangerous conflicts with oncoming vehicles. In this work, we propose the VRU-CIPI framework with a sequential attention-based model designed to predict VRU crossing intentions at intersections. VRU-CIPI employs Gated Recurrent Unit (GRU) to capture temporal dynamics in VRU movements, combined with a multi-head Transformer self-attention mechanism to encode contextual and spatial dependencies critical for predicting crossing direction. Evaluated on UCF-VRU dataset, our proposed achieves state-of-the-art performance with an accuracy of 96.45% and achieving real-time inference speed reaching 33 frames per second. Furthermore, by integrating with Infrastructure-to-Vehicles (I2V) communication, our approach can proactively enhance intersection safety through timely activation of crossing signals and providing early warnings to connected vehicles, ensuring smoother and safer interactions for all road users.

[10] Non-Registration Change Detection: A Novel Change Detection Task and Benchmark Dataset

Zhe Shan,Lei Zhou,Liu Mao,Shaofan Chen,Chuanqiu Ren,Xia Xie

Main category: cs.CV

TL;DR: 本文提出了一种新的遥感变化检测任务——非配准变化检测，以应对自然灾害、人为事故和军事打击等紧急情况。作者系统提出了八种现实场景可能导致非配准问题，并开发了针对不同场景的图像转换方案，将现有配准数据集转换为非配准版本。实验表明，非配准变化检测会对现有先进方法造成严重破坏。

Details

Motivation: 应对自然灾害、人为事故和军事打击等紧急情况，解决非配准变化检测问题。 Method: 系统提出八种现实场景，开发图像转换方案将配准数据集转换为非配准版本。 Result: 非配准变化检测对现有先进方法造成严重破坏。 Conclusion: 非配准变化检测是一个重要且具有挑战性的任务，需进一步研究。 Abstract: In this study, we propose a novel remote sensing change detection task, non-registration change detection, to address the increasing number of emergencies such as natural disasters, anthropogenic accidents, and military strikes. First, in light of the limited discourse on the issue of non-registration change detection, we systematically propose eight scenarios that could arise in the real world and potentially contribute to the occurrence of non-registration problems. Second, we develop distinct image transformation schemes tailored to various scenarios to convert the available registration change detection dataset into a non-registration version. Finally, we demonstrate that non-registration change detection can cause catastrophic damage to the state-of-the-art methods. Our code and dataset are available at https://github.com/ShanZard/NRCD.

[11] CSPENet: Contour-Aware and Saliency Priors Embedding Network for Infrared Small Target Detection

Jiakun Deng,Kexuan Li,Xingye Cui,Jiaxuan Li,Chang Long,Tian Pu,Zhenming Peng

Main category: cs.CV

TL;DR: 提出了一种基于轮廓感知和显著性先验嵌入网络（CSPENet）的红外小目标检测方法，解决了现有方法在密集杂波环境下目标定位和轮廓信息感知的不足。

Details

Motivation: 现有方法在密集杂波环境下对暗淡目标的定位和轮廓信息感知存在缺陷，限制了检测性能。 Method: 设计了SCPEM模块捕获目标轮廓梯度特征，提取显著性先验和多尺度结构先验；提出DBPEA架构嵌入先验；开发AGFEM模块优化特征表示。 Result: 在NUDT-SIRST、IRSTD-1k和NUAA-SIRST数据集上，CSPENet优于其他先进方法。 Conclusion: CSPENet通过轮廓感知和先验嵌入，显著提升了红外小目标检测性能。 Abstract: Infrared small target detection (ISTD) plays a critical role in a wide range of civilian and military applications. Existing methods suffer from deficiencies in the localization of dim targets and the perception of contour information under dense clutter environments, severely limiting their detection performance. To tackle these issues, we propose a contour-aware and saliency priors embedding network (CSPENet) for ISTD. We first design a surround-convergent prior extraction module (SCPEM) that effectively captures the intrinsic characteristic of target contour pixel gradients converging toward their center. This module concurrently extracts two collaborative priors: a boosted saliency prior for accurate target localization and multi-scale structural priors for comprehensively enriching contour detail representation. Building upon this, we propose a dual-branch priors embedding architecture (DBPEA) that establishes differentiated feature fusion pathways, embedding these two priors at optimal network positions to achieve performance enhancement. Finally, we develop an attention-guided feature enhancement module (AGFEM) to refine feature representations and improve saliency estimation accuracy. Experimental results on public datasets NUDT-SIRST, IRSTD-1k, and NUAA-SIRST demonstrate that our CSPENet outperforms other state-of-the-art methods in detection performance. The code is available at https://github.com/IDIP2025/CSPENet.

Hao Yang,Tao Tan,Shuai Tan,Weiqin Yang,Kunyan Cai,Calvin Chen,Yue Sun

Main category: cs.CV

TL;DR: MambaControl是一个结合选择性状态空间建模和扩散过程的新框架，用于高保真预测医学图像轨迹，在阿尔茨海默病预测中表现优异。

Details

Motivation: 现有方法在捕捉复杂时空动态和保持解剖结构一致性方面存在不足，尤其在渐进性疾病中。 Method: 结合Mamba长程建模和图引导解剖控制，引入傅里叶增强谱图表示以捕捉空间一致性和多尺度细节。 Result: 在阿尔茨海默病预测中达到最先进性能，定量和区域评估显示预测质量和解剖保真度提升。 Conclusion: MambaControl在个性化预后和临床决策支持方面具有潜力。 Abstract: Modelling disease progression in precision medicine requires capturing complex spatio-temporal dynamics while preserving anatomical integrity. Existing methods often struggle with longitudinal dependencies and structural consistency in progressive disorders. To address these limitations, we introduce MambaControl, a novel framework that integrates selective state-space modelling with diffusion processes for high-fidelity prediction of medical image trajectories. To better capture subtle structural changes over time while maintaining anatomical consistency, MambaControl combines Mamba-based long-range modelling with graph-guided anatomical control to more effectively represent anatomical correlations. Furthermore, we introduce Fourier-enhanced spectral graph representations to capture spatial coherence and multiscale detail, enabling MambaControl to achieve state-of-the-art performance in Alzheimer's disease prediction. Quantitative and regional evaluations demonstrate improved progression prediction quality and anatomical fidelity, highlighting its potential for personalised prognosis and clinical decision support.

[13] TKFNet: Learning Texture Key Factor Driven Feature for Facial Expression Recognition

Liqian Deng

Main category: cs.CV

TL;DR: 本文提出了一种基于纹理关键驱动因素（TKDF）的面部表情识别（FER）框架，通过局部纹理特征和双上下文信息过滤（DCIF）提升识别性能。

Details

Motivation: 野生环境下的面部表情识别因表情特征的微妙性和局部性以及面部外观的复杂变化而具有挑战性。 Method: 提出Texture-Aware Feature Extractor（TAFE）和Dual Contextual Information Filtering（DCIF）架构，TAFE基于ResNet增强多分支注意力提取纹理特征，DCIF通过自适应池化和注意力机制优化特征。 Result: 在RAF-DB和KDEF数据集上达到最先进性能，验证了TKDF的有效性和鲁棒性。 Conclusion: TKDF的引入显著提升了FER性能，为复杂环境下的表情识别提供了新思路。 Abstract: Facial expression recognition (FER) in the wild remains a challenging task due to the subtle and localized nature of expression-related features, as well as the complex variations in facial appearance. In this paper, we introduce a novel framework that explicitly focuses on Texture Key Driver Factors (TKDF), localized texture regions that exhibit strong discriminative power across emotional categories. By carefully observing facial image patterns, we identify that certain texture cues, such as micro-changes in skin around the brows, eyes, and mouth, serve as primary indicators of emotional dynamics. To effectively capture and leverage these cues, we propose a FER architecture comprising a Texture-Aware Feature Extractor (TAFE) and Dual Contextual Information Filtering (DCIF). TAFE employs a ResNet-based backbone enhanced with multi-branch attention to extract fine-grained texture representations, while DCIF refines these features by filtering context through adaptive pooling and attention mechanisms. Experimental results on RAF-DB and KDEF datasets demonstrate that our method achieves state-of-the-art performance, verifying the effectiveness and robustness of incorporating TKDFs into FER pipelines.

[14] APCoTTA: Continual Test-Time Adaptation for Semantic Segmentation of Airborne LiDAR Point Clouds

Yuan Gao,Shaobo Xia,Sheng Nie,Cheng Wang,Xiaohuan Xi,Bisheng Yang

Main category: cs.CV

TL;DR: APCoTTA是一种针对ALS点云语义分割的连续测试时间适应方法，通过动态选择可训练层、熵一致性损失和参数插值机制，解决了领域偏移和灾难性遗忘问题，并在两个新基准上取得了显著性能提升。

Details

Motivation: ALS点云分割在现实应用中常因环境、传感器变化导致模型性能下降，而现有CTTA方法在ALS领域研究有限且缺乏标准化数据集。 Method: 提出动态可训练层选择模块、熵一致性损失和随机参数插值机制，以平衡目标适应和源知识保留。 Result: 在两个新基准ISPRSC和H3DC上，APCoTTA的mIoU分别提升了约9%和14%。 Conclusion: APCoTTA有效解决了ALS点云分割中的领域适应问题，并提供了新的基准和开源代码。 Abstract: Airborne laser scanning (ALS) point cloud segmentation is a fundamental task for large-scale 3D scene understanding. In real-world applications, models are typically fixed after training. However, domain shifts caused by changes in the environment, sensor types, or sensor degradation often lead to a decline in model performance. Continuous Test-Time Adaptation (CTTA) offers a solution by adapting a source-pretrained model to evolving, unlabeled target domains. Despite its potential, research on ALS point clouds remains limited, facing challenges such as the absence of standardized datasets and the risk of catastrophic forgetting and error accumulation during prolonged adaptation. To tackle these challenges, we propose APCoTTA, the first CTTA method tailored for ALS point cloud semantic segmentation. We propose a dynamic trainable layer selection module. This module utilizes gradient information to select low-confidence layers for training, and the remaining layers are kept frozen, mitigating catastrophic forgetting. To further reduce error accumulation, we propose an entropy-based consistency loss. By losing such samples based on entropy, we apply consistency loss only to the reliable samples, enhancing model stability. In addition, we propose a random parameter interpolation mechanism, which randomly blends parameters from the selected trainable layers with those of the source model. This approach helps balance target adaptation and source knowledge retention, further alleviating forgetting. Finally, we construct two benchmarks, ISPRSC and H3DC, to address the lack of CTTA benchmarks for ALS point cloud segmentation. Experimental results demonstrate that APCoTTA achieves the best performance on two benchmarks, with mIoU improvements of approximately 9% and 14% over direct inference. The new benchmarks and code are available at https://github.com/Gaoyuan2/APCoTTA.

[15] High Quality Underwater Image Compression with Adaptive Correction and Codebook-based Augmentation

Yimin Zhou,Yichong Xia,Sicheng Pan,Bin Chen,Baoyi An,Haoqian Wang,Zhi Wang,Yaowei Wang,Zikun Zhou

Main category: cs.CV

TL;DR: HQUIC是一种针对水下图像压缩的新方法，通过自适应预测衰减系数和全局光信息，并利用多尺度频率组件动态加权，显著提升了压缩效率。

Details

Motivation: 现有水下图像压缩算法未能充分利用水下场景的独特特性，导致性能不佳。 Method: HQUIC采用ALTC模块自适应预测衰减系数和全局光信息，利用辅助分支提取常见对象，并动态加权多尺度频率组件。 Result: 在多种水下数据集上的评估表明，HQUIC优于现有压缩方法。 Conclusion: HQUIC通过针对性设计，显著提升了水下图像压缩的性能。 Abstract: With the increasing exploration and exploitation of the underwater world, underwater images have become a critical medium for human interaction with marine environments, driving extensive research into their efficient transmission and storage. However, contemporary underwater image compression algorithms fail to fully leverage the unique characteristics distinguishing underwater scenes from terrestrial images, resulting in suboptimal performance. To address this limitation, we introduce HQUIC, designed to exploit underwater-image-specific features for enhanced compression efficiency. HQUIC employs an ALTC module to adaptively predict the attenuation coefficients and global light information of the images, which effectively mitigates the issues caused by the differences in lighting and tone existing in underwater images. Subsequently, HQUIC employs a codebook as an auxiliary branch to extract the common objects within underwater images and enhances the performance of the main branch. Furthermore, HQUIC dynamically weights multi-scale frequency components, prioritizing information critical for distortion quality while discarding redundant details. Extensive evaluations on diverse underwater datasets demonstrate that HQUIC outperforms state-of-the-art compression methods.

[16] PointArena: Probing Multimodal Grounding Through Language-Guided Pointing

Long Cheng,Jiafei Duan,Yi Ru Wang,Haoquan Fang,Boyang Li,Yushan Huang,Elvis Wang,Ainaz Eftekhar,Jason Lee,Wentao Yuan,Rose Hendrix,Noah A. Smith,Fei Xia,Dieter Fox,Ranjay Krishna

Main category: cs.CV

TL;DR: 论文介绍了PointArena，一个评估多模态指向能力的平台，包含数据集、交互式比较和机器人系统。实验显示Molmo-72B表现最佳，且针对指向任务的训练显著提升性能。

Details

Motivation: 指向是语言与视觉场景结合的基础机制，现有基准仅关注对象定位任务，缺乏多样性。 Method: 提出PointArena平台，包含Point-Bench数据集、Point-Battle交互比较和Point-Act机器人系统。 Result: Molmo-72B表现最优，专有模型性能接近，针对指向任务的训练显著提升性能。 Conclusion: 精确指向能力对多模态模型连接抽象推理与实际行动至关重要，PointArena为评估提供了全面工具。 Abstract: Pointing serves as a fundamental and intuitive mechanism for grounding language within visual contexts, with applications spanning robotics, assistive technologies, and interactive AI systems. While recent multimodal models have started to support pointing capabilities, existing benchmarks typically focus only on referential object localization tasks. We introduce PointArena, a comprehensive platform for evaluating multimodal pointing across diverse reasoning scenarios. PointArena comprises three components: (1) Point-Bench, a curated dataset containing approximately 1,000 pointing tasks across five reasoning categories; (2) Point-Battle, an interactive, web-based arena facilitating blind, pairwise model comparisons, which has already gathered over 4,500 anonymized votes; and (3) Point-Act, a real-world robotic manipulation system allowing users to directly evaluate multimodal model pointing capabilities in practical settings. We conducted extensive evaluations of both state-of-the-art open-source and proprietary multimodal models. Results indicate that Molmo-72B consistently outperforms other models, though proprietary models increasingly demonstrate comparable performance. Additionally, we find that supervised training specifically targeting pointing tasks significantly enhances model performance. Across our multi-stage evaluation pipeline, we also observe strong correlations, underscoring the critical role of precise pointing capabilities in enabling multimodal models to effectively bridge abstract reasoning with concrete, real-world actions. Project page: https://pointarena.github.io/

[17] Descriptive Image-Text Matching with Graded Contextual Similarity

Jinhyun Jang,Jiyeong Lee,Kwanghoon Sohn

Main category: cs.CV

TL;DR: 论文提出了一种描述性图像-文本匹配方法（DITM），通过探索语言的描述灵活性学习图像与文本之间的分级上下文相似性，解决了现有稀疏二元监督方法的局限性。

Details

Motivation: 现有方法采用稀疏二元监督，仅覆盖有限的图像-文本关系，忽略了其固有的多对多对应关系，且未考虑从一般到具体描述的隐含连接。 Method: DITM通过计算句子的描述性分数（基于TF-IDF），动态调整正负样本对的连接性，并按从通用到具体的顺序对齐相关句子。 Result: 在MS-COCO、Flickr30K和CxC数据集上的实验表明，DITM能更有效地表示复杂的图像-文本关系，并在HierarCaps基准测试中提升了模型的层次推理能力。 Conclusion: DITM通过超越刚性二元监督，提升了最优匹配和潜在正样本对的发现能力，为图像-文本匹配提供了更灵活的解决方案。 Abstract: Image-text matching aims to build correspondences between visual and textual data by learning their pairwise similarities. Most existing approaches have adopted sparse binary supervision, indicating whether a pair of images and sentences matches or not. However, such sparse supervision covers a limited subset of image-text relationships, neglecting their inherent many-to-many correspondences; an image can be described in numerous texts at different descriptive levels. Moreover, existing approaches overlook the implicit connections from general to specific descriptions, which form the underlying rationale for the many-to-many relationships between vision and language. In this work, we propose descriptive image-text matching, called DITM, to learn the graded contextual similarity between image and text by exploring the descriptive flexibility of language. We formulate the descriptiveness score of each sentence with cumulative term frequency-inverse document frequency (TF-IDF) to balance the pairwise similarity according to the keywords in the sentence. Our method leverages sentence descriptiveness to learn robust image-text matching in two key ways: (1) to refine the false negative labeling, dynamically relaxing the connectivity between positive and negative pairs, and (2) to build more precise matching, aligning a set of relevant sentences in a generic-to-specific order. By moving beyond rigid binary supervision, DITM enhances the discovery of both optimal matches and potential positive pairs. Extensive experiments on MS-COCO, Flickr30K, and CxC datasets demonstrate the effectiveness of our method in representing complex image-text relationships compared to state-of-the-art approaches. In addition, DITM enhances the hierarchical reasoning ability of the model, supported by the extensive analysis on HierarCaps benchmark.

[18] From Air to Wear: Personalized 3D Digital Fashion with AR/VR Immersive 3D Sketching

Ying Zang,Yuanqi Hu,Xinyu Chen,Yuxia Xu,Suhui Wang,Chunan Yu,Lanyun Zhu,Deyi Ji,Xin Xu,Tianrun Chen

Main category: cs.CV

TL;DR: 本文提出了一种基于3D草图的3D服装生成框架，旨在降低虚拟服装设计的技术门槛，使普通用户也能轻松创建高质量数字服装。

Details

Motivation: 在AR/VR设备普及的背景下，用户希望通过虚拟时尚表达自我，但现有3D服装设计工具技术门槛高且数据有限。 Method: 结合条件扩散模型、共享潜在空间训练的草图编码器和自适应课程学习策略，系统能处理不精确的手绘输入并生成逼真的个性化服装。 Result: 通过实验和用户研究验证，该方法在逼真度和易用性上显著优于现有基线。 Conclusion: 该框架有望推动下一代消费平台上的大众化时尚设计。 Abstract: In the era of immersive consumer electronics, such as AR/VR headsets and smart devices, people increasingly seek ways to express their identity through virtual fashion. However, existing 3D garment design tools remain inaccessible to everyday users due to steep technical barriers and limited data. In this work, we introduce a 3D sketch-driven 3D garment generation framework that empowers ordinary users - even those without design experience - to create high-quality digital clothing through simple 3D sketches in AR/VR environments. By combining a conditional diffusion model, a sketch encoder trained in a shared latent space, and an adaptive curriculum learning strategy, our system interprets imprecise, free-hand input and produces realistic, personalized garments. To address the scarcity of training data, we also introduce KO3DClothes, a new dataset of paired 3D garments and user-created sketches. Extensive experiments and user studies confirm that our method significantly outperforms existing baselines in both fidelity and usability, demonstrating its promise for democratized fashion design on next-generation consumer platforms.

[19] Application of YOLOv8 in monocular downward multiple Car Target detection

Shijie Lyu

Main category: cs.CV

TL;DR: 本文提出了一种基于YOLOv8的改进目标检测网络，通过结合结构重参数化技术和双向金字塔结构网络模型，显著提升了多尺度、小目标和远距离目标的检测效率与精度。

Details

Motivation: 当前自动驾驶技术中的目标检测方法存在高成本、易受天气和光照影响以及分辨率有限等问题，亟需改进。 Method: 在YOLOv8框架中集成结构重参数化技术、双向金字塔结构网络模型及新型检测流程。 Result: 改进模型的检测精度达到65%，在多尺度和小目标检测中表现优异。 Conclusion: 该模型在自动驾驶竞赛等实际应用中具有显著潜力，尤其在单目标和小目标检测场景中表现突出。 Abstract: Autonomous driving technology is progressively transforming traditional car driving methods, marking a significant milestone in modern transportation. Object detection serves as a cornerstone of autonomous systems, playing a vital role in enhancing driving safety, enabling autonomous functionality, improving traffic efficiency, and facilitating effective emergency responses. However, current technologies such as radar for environmental perception, cameras for road perception, and vehicle sensor networks face notable challenges, including high costs, vulnerability to weather and lighting conditions, and limited resolution.To address these limitations, this paper presents an improved autonomous target detection network based on YOLOv8. By integrating structural reparameterization technology, a bidirectional pyramid structure network model, and a novel detection pipeline into the YOLOv8 framework, the proposed approach achieves highly efficient and precise detection of multi-scale, small, and remote objects. Experimental results demonstrate that the enhanced model can effectively detect both large and small objects with a detection accuracy of 65%, showcasing significant advancements over traditional methods.This improved model holds substantial potential for real-world applications and is well-suited for autonomous driving competitions, such as the Formula Student Autonomous China (FSAC), particularly excelling in scenarios involving single-target and small-object detection.

[20] ORL-LDM: Offline Reinforcement Learning Guided Latent Diffusion Model Super-Resolution Reconstruction

Shijie Lyu

Main category: cs.CV

TL;DR: 本文提出了一种基于强化学习的潜在扩散模型（LDM）微调方法，用于遥感图像超分辨率重建，显著提升了图像质量。

Details

Motivation: 现有深度学习方法在处理复杂场景和保留图像细节方面存在局限性，需要更有效的解决方案。 Method: 通过构建强化学习环境（状态、动作、奖励），在LDM的反向去噪过程中使用近端策略优化（PPO）优化决策目标。 Result: 在RESISC45数据集上，PSNR提升3-4dB，SSIM提高0.08-0.11，LPIPS降低0.06-0.10，尤其在结构化复杂场景中表现突出。 Conclusion: 该方法有效提升了超分辨率的质量和场景适应性。 Abstract: With the rapid advancement of remote sensing technology, super-resolution image reconstruction is of great research and practical significance. Existing deep learning methods have made progress but still face limitations in handling complex scenes and preserving image details. This paper proposes a reinforcement learning-based latent diffusion model (LDM) fine-tuning method for remote sensing image super-resolution. The method constructs a reinforcement learning environment with states, actions, and rewards, optimizing decision objectives through proximal policy optimization (PPO) during the reverse denoising process of the LDM model. Experiments on the RESISC45 dataset show significant improvements over the baseline model in PSNR, SSIM, and LPIPS, with PSNR increasing by 3-4dB, SSIM improving by 0.08-0.11, and LPIPS reducing by 0.06-0.10, particularly in structured and complex natural scenes. The results demonstrate the method's effectiveness in enhancing super-resolution quality and adaptability across scenes.

[21] DeepSeqCoco: A Robust Mobile Friendly Deep Learning Model for Detection of Diseases in Cocos nucifera

Miit Daga,Dhriti Parikh,Swarna Priya Ramu

Main category: cs.CV

TL;DR: DeepSeqCoco是一种基于深度学习的模型，用于从椰树图像中自动准确识别疾病，其准确率高达99.5%，优于现有模型，并显著降低训练和预测时间。

Details

Motivation: 椰树疾病对农业产量构成严重威胁，尤其在发展中国家，传统方法难以实现早期诊断和干预。 Method: 采用深度学习模型DeepSeqCoco，测试了多种优化器设置（如SGD、Adam及混合配置），以平衡准确性、损失最小化和计算成本。 Result: 模型准确率达99.5%，混合SGD-Adam配置验证损失最低（2.81%），训练和预测时间分别减少18%和85%。 Conclusion: DeepSeqCoco展示了AI在精准农业中的潜力，提供了一种可扩展且高效的疾病监测系统。 Abstract: Coconut tree diseases are a serious risk to agricultural yield, particularly in developing countries where conventional farming practices restrict early diagnosis and intervention. Current disease identification methods are manual, labor-intensive, and non-scalable. In response to these limitations, we come up with DeepSeqCoco, a deep learning based model for accurate and automatic disease identification from coconut tree images. The model was tested under various optimizer settings, such as SGD, Adam, and hybrid configurations, to identify the optimal balance between accuracy, minimization of loss, and computational cost. Results from experiments indicate that DeepSeqCoco can achieve as much as 99.5% accuracy (achieving up to 5% higher accuracy than existing models) with the hybrid SGD-Adam showing the lowest validation loss of 2.81%. It also shows a drop of up to 18% in training time and up to 85% in prediction time for input images. The results point out the promise of the model to improve precision agriculture through an AI-based, scalable, and efficient disease monitoring system.

[22] Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

Bingda Tang,Boyang Zheng,Xichen Pan,Sayak Paul,Saining Xie

Main category: cs.CV

TL;DR: 本文对文本到图像合成中LLM与DiT深度融合的设计空间进行了详细探索，填补了现有研究的空白。

Details

Motivation: 现有研究多关注整体系统性能，缺乏对设计细节和训练方法的公开，导致该方法潜力不明确。 Method: 通过实证研究，进行对照实验，分析关键设计选择，并提供可复现的大规模训练方案。 Result: 提供了有意义的数据点和实用指南，为多模态生成研究奠定基础。 Conclusion: 本文填补了研究空白，为未来多模态生成研究提供了清晰的参考。 Abstract: This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis -- specifically, the deep fusion of large language models (LLMs) and diffusion transformers (DiTs) for multi-modal generation. Previous studies mainly focused on overall system performance rather than detailed comparisons with alternative methods, and key design details and training recipes were often left undisclosed. These gaps create uncertainty about the real potential of this approach. To fill these gaps, we conduct an empirical study on text-to-image generation, performing controlled comparisons with established baselines, analyzing important design choices, and providing a clear, reproducible recipe for training at scale. We hope this work offers meaningful data points and practical guidelines for future research in multi-modal generation.

[23] Advances in Radiance Field for Dynamic Scene: From Neural Field to Gaussian Field

Jinlong Fan,Xuepu Zeng,Jing Zhang,Mingming Gong,Yuxiang Yang,Dacheng Tao

Main category: cs.CV

TL;DR: 本文综述了动态场景表示与重建领域的最新进展，重点分析了基于神经辐射场和3D高斯泼溅技术的200多篇论文，并提出了统一的分类框架。

Details

Motivation: 动态场景表示与重建技术近年来取得显著进展，但缺乏系统性综述。本文旨在填补这一空白，为研究者和实践者提供全面参考。 Method: 通过分析200多篇论文，从运动表示范式、重建技术、辅助信息整合和正则化方法等角度进行分类和评估。 Result: 提出了一个统一的动态场景表示框架，总结了现有方法的优缺点，并指出了未来研究方向。 Conclusion: 本文为动态场景重建领域提供了系统性综述，明确了当前挑战和未来潜力，为研究者提供了重要参考。 Abstract: Dynamic scene representation and reconstruction have undergone transformative advances in recent years, catalyzed by breakthroughs in neural radiance fields and 3D Gaussian splatting techniques. While initially developed for static environments, these methodologies have rapidly evolved to address the complexities inherent in 4D dynamic scenes through an expansive body of research. Coupled with innovations in differentiable volumetric rendering, these approaches have significantly enhanced the quality of motion representation and dynamic scene reconstruction, thereby garnering substantial attention from the computer vision and graphics communities. This survey presents a systematic analysis of over 200 papers focused on dynamic scene representation using radiance field, spanning the spectrum from implicit neural representations to explicit Gaussian primitives. We categorize and evaluate these works through multiple critical lenses: motion representation paradigms, reconstruction techniques for varied scene dynamics, auxiliary information integration strategies, and regularization approaches that ensure temporal consistency and physical plausibility. We organize diverse methodological approaches under a unified representational framework, concluding with a critical examination of persistent challenges and promising research directions. By providing this comprehensive overview, we aim to establish a definitive reference for researchers entering this rapidly evolving field while offering experienced practitioners a systematic understanding of both conceptual principles and practical frontiers in dynamic scene reconstruction.

[24] PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language

Ijazul Haq,Yingjie Zhang,Irfan Ali Khan

Main category: cs.CV

TL;DR: 本文评估了大型多模态模型（LMMs）在低资源普什图语OCR任务中的表现，开发了一个合成数据集PsOCR，并比较了开源和闭源模型的性能。

Details

Motivation: 普什图语的NLP面临挑战，如草书字体和数据集稀缺，因此需要评估和改进LMMs在OCR任务中的表现。 Method: 开发了包含100万张图像的合成数据集PsOCR，涵盖多种字体和布局，并测试了11种LMMs的性能。 Result: Gemini在闭源模型中表现最佳，而Qwen-7B在开源模型中表现突出。 Conclusion: 本研究为普什图语OCR提供了基准，并为类似脚本（如阿拉伯语、波斯语和乌尔都语）的研究奠定了基础。 Abstract: This paper evaluates the performance of Large Multimodal Models (LMMs) on Optical Character Recognition (OCR) in the low-resource Pashto language. Natural Language Processing (NLP) in Pashto faces several challenges due to the cursive nature of its script and a scarcity of structured datasets. To address this, we developed a synthetic Pashto OCR dataset, PsOCR, consisting of one million images annotated with bounding boxes at word, line, and document levels, suitable for training and evaluating models based on different architectures, including Convolutional Neural Networks (CNNs) and Transformers. PsOCR covers variations across 1,000 unique font families, colors, image sizes, and layouts. A benchmark subset of 10K images was selected to evaluate the performance of several LMMs, including seven open-source models: DeepSeek's Janus, InternVL, MiniCPM, Florence, and Qwen (3B and 7B), and four closed-source models: GPT-4o, Gemini, Claude, and Grok. Experimental results demonstrate that Gemini achieves the best performance among all models, whereas among open-source models, Qwen-7B stands out. This work provides an insightful assessment of the capabilities and limitations of current LMMs for OCR tasks in Pashto and establishes a foundation for further research not only in Pashto OCR but also for other similar scripts such as Arabic, Persian, and Urdu. PsOCR is available at https://github.com/zirak-ai/PashtoOCR.

[25] ToonifyGB: StyleGAN-based Gaussian Blendshapes for 3D Stylized Head Avatars

Rui-Yang Ju,Sheng-Yen Huang,Yi-Ping Hung

Main category: cs.CV

TL;DR: ToonifyGB是一个两阶段框架，用于从单目视频生成多样化的风格化3D头部虚拟形象，结合了改进的StyleGAN和3D高斯混合形状。

Details

Motivation: 扩展Toonify框架以支持风格化3D头部虚拟形象的合成，解决传统StyleGAN在固定分辨率裁剪对齐人脸时的局限性。 Method: 第一阶段使用改进的StyleGAN生成风格化视频；第二阶段从视频中学习风格化中性头部模型和表情混合形状。 Result: 在Arcane和Pixar风格上验证了ToonifyGB的高效性和高质量动画生成能力。 Conclusion: ToonifyGB能够高效渲染具有任意表情的风格化虚拟形象，适用于实时应用。 Abstract: The introduction of 3D Gaussian blendshapes has enabled the real-time reconstruction of animatable head avatars from monocular video. Toonify, a StyleGAN-based framework, has become widely used for facial image stylization. To extend Toonify for synthesizing diverse stylized 3D head avatars using Gaussian blendshapes, we propose an efficient two-stage framework, ToonifyGB. In Stage 1 (stylized video generation), we employ an improved StyleGAN to generate the stylized video from the input video frames, which addresses the limitation of cropping aligned faces at a fixed resolution as preprocessing for normal StyleGAN. This process provides a more stable video, which enables Gaussian blendshapes to better capture the high-frequency details of the video frames, and efficiently generate high-quality animation in the next stage. In Stage 2 (Gaussian blendshapes synthesis), we learn a stylized neutral head model and a set of expression blendshapes from the generated video. By combining the neutral head model with expression blendshapes, ToonifyGB can efficiently render stylized avatars with arbitrary expressions. We validate the effectiveness of ToonifyGB on the benchmark dataset using two styles: Arcane and Pixar.

[26] MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models

Yuncheng Guo,Xiaodong Gu

Main category: cs.CV

TL;DR: 论文提出MMRL和MMRL++方法，通过共享模态无关表示空间和优化表示与分类特征，解决小样本数据下视觉-语言模型的过拟合问题，提升泛化能力。

Details

Motivation: 大规模预训练视觉-语言模型在小样本数据下容易过拟合，泛化能力不足。 Method: MMRL引入模态无关表示空间，将空间令牌投影到文本和图像编码器中，联合优化分类和表示特征；MMRL++进一步减少参数并增强模态内交互。 Result: 在15个数据集上，MMRL和MMRL++均优于现有方法，平衡了任务适应与泛化。 Conclusion: MMRL和MMRL++通过优化表示空间和特征交互，显著提升了小样本学习中的泛化性能。 Abstract: Large-scale pre-trained Vision-Language Models (VLMs) have significantly advanced transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, undermining their ability to generalize to new tasks. To address this, we propose Multi-Modal Representation Learning (MMRL), which introduces a shared, learnable, modality-agnostic representation space. MMRL generates space tokens projected into both text and image encoders as representation tokens, enabling more effective cross-modal interactions. Unlike prior methods that mainly optimize class token features, MMRL inserts representation tokens into higher encoder layers--where task-specific features are more prominent--while preserving general knowledge in the lower layers. During training, both class and representation features are jointly optimized: a trainable projection layer is applied to representation tokens for task adaptation, while the projection layer for class token remains frozen to retain pre-trained knowledge. To further promote generalization, we introduce a regularization term aligning class and text features with the frozen VLM's zero-shot features. At inference, a decoupling strategy uses both class and representation features for base tasks, but only class features for novel tasks due to their stronger generalization. Building upon this, we propose MMRL++, a parameter-efficient and interaction-aware extension that significantly reduces trainable parameters and enhances intra-modal interactions--particularly across the layers of representation tokens--allowing gradient sharing and instance-specific information to propagate more effectively through the network. Extensive experiments on 15 datasets demonstrate that MMRL and MMRL++ consistently outperform state-of-the-art methods, achieving a strong balance between task-specific adaptation and generalization.

[27] Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering

Yangfu Li,Hongjian Zhan,Tianyi Chen,Qi Liu,Yue Lu

Main category: cs.CV

TL;DR: 论文提出了一种动态视觉令牌剪枝方法MoB，通过量化目标间的权衡和优化预算分配，显著提升了性能和效率。

Details

Motivation: 现有视觉令牌剪枝方法采用静态策略，忽视了任务间目标重要性的差异，导致性能不稳定。 Method: 基于Hausdorff距离推导误差界，利用ε-覆盖理论揭示目标间权衡，并提出MoB框架，将剪枝问题转化为双目标覆盖问题。 Result: MoB在LLaVA-1.5-7B上仅用11.1%的令牌保留了96.4%的性能，并在LLaVA-Next-7B上加速1.3-1.5倍。 Conclusion: MoB通过动态权衡目标，实现了高效且性能稳定的视觉令牌剪枝，适用于多种视觉语言任务。 Abstract: Existing visual token pruning methods target prompt alignment and visual preservation with static strategies, overlooking the varying relative importance of these objectives across tasks, which leads to inconsistent performance. To address this, we derive the first closed-form error bound for visual token pruning based on the Hausdorff distance, uniformly characterizing the contributions of both objectives. Moreover, leveraging $\epsilon$-covering theory, we reveal an intrinsic trade-off between these objectives and quantify their optimal attainment levels under a fixed budget. To practically handle this trade-off, we propose Multi-Objective Balanced Covering (MoB), which reformulates visual token pruning as a bi-objective covering problem. In this framework, the attainment trade-off reduces to budget allocation via greedy radius trading. MoB offers a provable performance bound and linear scalability with respect to the number of input visual tokens, enabling adaptation to challenging pruning scenarios. Extensive experiments show that MoB preserves 96.4% of performance for LLaVA-1.5-7B using only 11.1% of the original visual tokens and accelerates LLaVA-Next-7B by 1.3-1.5$\times$ with negligible performance loss. Additionally, evaluations on Qwen2-VL and Video-LLaVA confirm that MoB integrates seamlessly into advanced MLLMs and diverse vision-language tasks.

[28] IMITATE: Image Registration with Context for unknown time frame recovery

Ziad Kheil,Lucas Robinet,Laurent Risser,Soleakhena Ken

Main category: cs.CV

TL;DR: 提出了一种基于条件U-Net架构的图像配准方法，用于估计未知条件下的图像，并在4D-CT扫描中实现了无伪影的肿瘤运动重建。

Details

Motivation: 解决传统方法在4D-CT扫描中因不规则呼吸、滞后效应和呼吸信号与内部运动相关性差导致的伪影问题。 Method: 使用条件U-Net架构，无需固定图像，完全利用条件信息进行图像配准。 Result: 在临床4D-CT数据上实现了无伪影的实时重建。 Conclusion: 该方法在复杂条件下（如放疗中的肿瘤运动）表现出色，代码已开源。 Abstract: In this paper, we formulate a novel image registration formalism dedicated to the estimation of unknown condition-related images, based on two or more known images and their associated conditions. We show how to practically model this formalism by using a new conditional U-Net architecture, which fully takes into account the conditional information and does not need any fixed image. Our formalism is then applied to image moving tumors for radiotherapy treatment at different breathing amplitude using 4D-CT (3D+t) scans in thoracoabdominal regions. This driving application is particularly complex as it requires to stitch a collection of sequential 2D slices into several 3D volumes at different organ positions. Movement interpolation with standard methods then generates well known reconstruction artefacts in the assembled volumes due to irregular patient breathing, hysteresis and poor correlation of breathing signal to internal motion. Results obtained on 4D-CT clinical data showcase artefact-free volumes achieved through real-time latencies. The code is publicly available at https://github.com/Kheil-Z/IMITATE .

[29] Multi-Source Collaborative Style Augmentation and Domain-Invariant Learning for Federated Domain Generalization

Yikang Wei

Main category: cs.CV

TL;DR: 论文提出了一种多源协作风格增强和领域不变学习方法（MCSAD），用于联邦领域泛化，通过生成更广泛的风格空间数据和跨领域特征对齐，显著提升了模型在未见目标领域的泛化能力。

Details

Motivation: 现有风格增强方法在数据分散场景下风格空间有限，无法充分利用多源域信息。 Method: 提出多源协作风格增强模块生成更广泛风格数据，并通过跨领域特征对齐和类关系集成蒸馏学习领域不变模型。 Result: 在多个领域泛化数据集上显著优于现有联邦领域泛化方法。 Conclusion: MCSAD通过协作风格增强和领域不变学习，有效提升了模型在未见目标领域的泛化性能。 Abstract: Federated domain generalization aims to learn a generalizable model from multiple decentralized source domains for deploying on the unseen target domain. The style augmentation methods have achieved great progress on domain generalization. However, the existing style augmentation methods either explore the data styles within isolated source domain or interpolate the style information across existing source domains under the data decentralization scenario, which leads to limited style space. To address this issue, we propose a Multi-source Collaborative Style Augmentation and Domain-invariant learning method (MCSAD) for federated domain generalization. Specifically, we propose a multi-source collaborative style augmentation module to generate data in the broader style space. Furthermore, we conduct domain-invariant learning between the original data and augmented data by cross-domain feature alignment within the same class and classes relation ensemble distillation between different classes to learn a domain-invariant model. By alternatively conducting collaborative style augmentation and domain-invariant learning, the model can generalize well on unseen target domain. Extensive experiments on multiple domain generalization datasets indicate that our method significantly outperforms the state-of-the-art federated domain generalization methods.

[30] Modeling Saliency Dataset Bias

Matthias Kümmerer,Harneet Khanuja,Matthias Bethge

Main category: cs.CV

TL;DR: 论文提出一种新架构，通过少量数据集特定参数解决跨数据集显著性预测的泛化问题，显著提升了性能。

Details

Motivation: 现有显著性预测模型在不同数据集间泛化能力差，性能下降显著，需解决数据集偏差问题。 Method: 扩展编码器-解码器结构，引入少于20个数据集特定参数，控制多尺度结构、中心偏差和注视扩散等机制。 Result: 模型在MIT/Tuebingen显著性基准测试中达到新SOTA，泛化性能提升75%以上，仅需50样本即可显著改进。 Conclusion: 新架构有效解决了跨数据集泛化问题，同时揭示了空间显著性的复杂多尺度效应。 Abstract: Recent advances in image-based saliency prediction are approaching gold standard performance levels on existing benchmarks. Despite this success, we show that predicting fixations across multiple saliency datasets remains challenging due to dataset bias. We find a significant performance drop (around 40%) when models trained on one dataset are applied to another. Surprisingly, increasing dataset diversity does not resolve this inter-dataset gap, with close to 60% attributed to dataset-specific biases. To address this remaining generalization gap, we propose a novel architecture extending a mostly dataset-agnostic encoder-decoder structure with fewer than 20 dataset-specific parameters that govern interpretable mechanisms such as multi-scale structure, center bias, and fixation spread. Adapting only these parameters to new data accounts for more than 75% of the generalization gap, with a large fraction of the improvement achieved with as few as 50 samples. Our model sets a new state-of-the-art on all three datasets of the MIT/Tuebingen Saliency Benchmark (MIT300, CAT2000, and COCO-Freeview), even when purely generalizing from unrelated datasets, but with a substantial boost when adapting to the respective training datasets. The model also provides valuable insights into spatial saliency properties, revealing complex multi-scale effects that combine both absolute and relative sizes.

[31] VolE: A Point-cloud Framework for Food 3D Reconstruction and Volume Estimation

Umair Haroon,Ahmad AlMughrabi,Thanasis Zoumpekas,Ricardo Marques,Petia Radeva

Main category: cs.CV

TL;DR: VolE是一种基于移动设备驱动的3D重建框架，用于精确估计食物体积，无需参考物或深度信息，性能优于现有方法。

Details

Motivation: 当前食物体积估计方法受限于单核数据、专用硬件或依赖参考物，VolE旨在解决这些限制。 Method: 利用移动设备捕捉图像和相机位置，通过AR技术生成3D模型，并结合食物视频分割生成食物掩模。 Result: 在多个数据集上，VolE的MAPE为2.22%，表现优于现有技术。 Conclusion: VolE在食物体积估计中表现出色，为医疗营养管理和健康监测提供了高效解决方案。 Abstract: Accurate food volume estimation is crucial for medical nutrition management and health monitoring applications, but current food volume estimation methods are often limited by mononuclear data, leveraging single-purpose hardware such as 3D scanners, gathering sensor-oriented information such as depth information, or relying on camera calibration using a reference object. In this paper, we present VolE, a novel framework that leverages mobile device-driven 3D reconstruction to estimate food volume. VolE captures images and camera locations in free motion to generate precise 3D models, thanks to AR-capable mobile devices. To achieve real-world measurement, VolE is a reference- and depth-free framework that leverages food video segmentation for food mask generation. We also introduce a new food dataset encompassing the challenging scenarios absent in the previous benchmarks. Our experiments demonstrate that VolE outperforms the existing volume estimation techniques across multiple datasets by achieving 2.22 % MAPE, highlighting its superior performance in food volume estimation.

[32] Data-Agnostic Augmentations for Unknown Variations: Out-of-Distribution Generalisation in MRI Segmentation

Puru Vaish,Felix Meister,Tobias Heimann,Christoph Brune,Jelmer M. Wolterink

Main category: cs.CV

TL;DR: 论文探讨了医学图像分割模型在真实临床场景中的性能下降问题，提出MixUp和辅助傅里叶增强方法，显著提升了模型的泛化能力和鲁棒性。

Details

Motivation: 医学图像分割模型在真实临床环境中性能下降，传统数据增强方法不足以应对多样化的分布偏移。 Method: 系统评估MixUp和辅助傅里叶增强方法，分析其对分布偏移的缓解效果。 Result: 这些方法显著提升了模型在心脏MRI和前列腺MRI分割中的泛化能力和鲁棒性，并改善了特征表示。 Conclusion: MixUp和辅助傅里叶增强是简单有效的解决方案，可提升医学分割模型在真实场景中的可靠性。 Abstract: Medical image segmentation models are often trained on curated datasets, leading to performance degradation when deployed in real-world clinical settings due to mismatches between training and test distributions. While data augmentation techniques are widely used to address these challenges, traditional visually consistent augmentation strategies lack the robustness needed for diverse real-world scenarios. In this work, we systematically evaluate alternative augmentation strategies, focusing on MixUp and Auxiliary Fourier Augmentation. These methods mitigate the effects of multiple variations without explicitly targeting specific sources of distribution shifts. We demonstrate how these techniques significantly improve out-of-distribution generalization and robustness to imaging variations across a wide range of transformations in cardiac cine MRI and prostate MRI segmentation. We quantitatively find that these augmentation methods enhance learned feature representations by promoting separability and compactness. Additionally, we highlight how their integration into nnU-Net training pipelines provides an easy-to-implement, effective solution for enhancing the reliability of medical segmentation models in real-world applications.

[33] On the Interplay of Human-AI Alignment,Fairness, and Performance Trade-offs in Medical Imaging

Haozhe Luo,Ziyu Zhou,Zixin Shu,Aurélie Pahud de Mortanges,Robert Berke,Mauricio Reyes

Main category: cs.CV

TL;DR: 研究探讨了在医学影像中结合人类洞察力以减少AI偏见的方法，发现适度对齐可提升公平性和泛化能力，但需平衡性能。

Details

Motivation: 解决医学影像AI中的偏见问题，探索人类与AI对齐对公平性和泛化能力的影响。 Method: 系统研究人类与AI对齐在医学影像中的应用，分析其对公平性和泛化能力的作用。 Result: 适度的人类-AI对齐能减少公平性差距并提升泛化能力，但过度对齐可能导致性能下降。 Conclusion: 人类-AI对齐是开发公平、稳健且泛化能力强的医学AI系统的有效方法，需平衡专家指导与自动化效率。 Abstract: Deep neural networks excel in medical imaging but remain prone to biases, leading to fairness gaps across demographic groups. We provide the first systematic exploration of Human-AI alignment and fairness in this domain. Our results show that incorporating human insights consistently reduces fairness gaps and enhances out-of-domain generalization, though excessive alignment can introduce performance trade-offs, emphasizing the need for calibrated strategies. These findings highlight Human-AI alignment as a promising approach for developing fair, robust, and generalizable medical AI systems, striking a balance between expert guidance and automated efficiency. Our code is available at https://github.com/Roypic/Aligner.

[34] MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation

Yanbo Ding

Main category: cs.CV

TL;DR: MTVCrafter提出了一种基于4D运动序列的人类图像动画框架，通过4DMoT和MV-DiT技术，实现了更灵活和鲁棒的运动控制，显著提升了性能。

Details

Motivation: 现有方法依赖2D渲染姿势图像，限制了泛化能力并丢失了3D信息，MTVCrafter旨在解决这一问题。 Method: 引入4DMoT将3D运动序列量化为4D运动标记，并结合MV-DiT的运动注意力机制，利用4D位置编码进行动画生成。 Result: MTVCrafter在FID-VID上达到6.98，超越第二名65%，并在多样开放世界场景中表现优异。 Conclusion: MTVCrafter为人类视频生成开辟了新方向，展示了4D运动建模的潜力。 Abstract: Human image animation has gained increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely largely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information for open-world animation. To tackle this problem, we propose MTVCrafter (Motion Tokenization Video Crafter), the first framework that directly models raw 3D motion sequences (i.e., 4D motion) for human image animation. Specifically, we introduce 4DMoT (4D motion tokenizer) to quantize 3D motion sequences into 4D motion tokens. Compared to 2D-rendered pose images, 4D motion tokens offer more robust spatio-temporal cues and avoid strict pixel-level alignment between pose image and character, enabling more flexible and disentangled control. Then, we introduce MV-DiT (Motion-aware Video DiT). By designing unique motion attention with 4D positional encodings, MV-DiT can effectively leverage motion tokens as 4D compact yet expressive context for human image animation in the complex 3D world. Hence, it marks a significant step forward in this field and opens a new direction for pose-guided human video generation. Experiments show that our MTVCrafter achieves state-of-the-art results with an FID-VID of 6.98, surpassing the second-best by 65%. Powered by robust motion tokens, MTVCrafter also generalizes well to diverse open-world characters (single/multiple, full/half-body) across various styles and scenarios. Our video demos and code are provided in the supplementary material and at this anonymous GitHub link: https://anonymous.4open.science/r/MTVCrafter-1B13.

[35] ADHMR: Aligning Diffusion-based Human Mesh Recovery via Direct Preference Optimization

Wenhao Shen,Wanqi Yin,Xiaofeng Yang,Cheng Chen,Chaoyue Song,Zhongang Cai,Lei Yang,Hao Wang,Guosheng Lin

Main category: cs.CV

TL;DR: ADHMR提出了一种基于扩散模型和偏好优化的人体网格恢复方法，通过HMR-Scorer评估预测质量并优化模型，显著提升了性能。

Details

Motivation: 解决现有概率方法在单图像人体网格恢复中与2D观测不匹配及对野外图像鲁棒性差的问题。 Method: 训练HMR-Scorer评估预测质量，构建偏好数据集，通过直接偏好优化微调基础模型。 Result: ADHMR在实验中优于现有方法，并能通过数据清理提升其他HMR模型性能。 Conclusion: ADHMR通过偏好优化和数据清理，显著提升了人体网格恢复的准确性和鲁棒性。 Abstract: Human mesh recovery (HMR) from a single image is inherently ill-posed due to depth ambiguity and occlusions. Probabilistic methods have tried to solve this by generating numerous plausible 3D human mesh predictions, but they often exhibit misalignment with 2D image observations and weak robustness to in-the-wild images. To address these issues, we propose ADHMR, a framework that Aligns a Diffusion-based HMR model in a preference optimization manner. First, we train a human mesh prediction assessment model, HMR-Scorer, capable of evaluating predictions even for in-the-wild images without 3D annotations. We then use HMR-Scorer to create a preference dataset, where each input image has a pair of winner and loser mesh predictions. This dataset is used to finetune the base model using direct preference optimization. Moreover, HMR-Scorer also helps improve existing HMR models by data cleaning, even with fewer training samples. Extensive experiments show that ADHMR outperforms current state-of-the-art methods. Code is available at: https://github.com/shenwenhao01/ADHMR.

[36] Sage Deer: A Super-Aligned Driving Generalist Is Your Copilot

Hao Lu,Jiaqi Tang,Jiyao Wang,Yunfan LU,Xu Cao,Qingyong Hu,Yin Wang,Yuting Zhang,Tianxin Xie,Yunpeng Zhang,Yong Chen,Jiayu. Gao,Bin Huang,Dengbo He,Shuiguang Deng,Hao Chen,Ying-Cong Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为SAGE DeeR的智能驾驶座舱代理，具备超对齐、通用性和自我激发能力，并通过大规模基准测试验证其性能。

Details

Motivation: 智能驾驶座舱需满足不同用户的舒适性、交互性和安全性需求，因此需要一种能够适应多样化需求的智能代理。 Method: 构建SAGE DeeR代理，实现超对齐（个性化反应）、通用性（多模态输入理解）和自我激发（语言空间隐式思维链提取），并建立大规模基准测试。 Result: SAGE DeeR能够根据用户偏好和场景进行个性化反应，理解多模态输入，并通过自我激发提升性能。 Conclusion: SAGE DeeR在智能驾驶座舱中展现出强大的适应性和性能，为未来智能驾驶系统提供了新思路。 Abstract: The intelligent driving cockpit, an important part of intelligent driving, needs to match different users' comfort, interaction, and safety needs. This paper aims to build a Super-Aligned and GEneralist DRiving agent, SAGE DeeR. Sage Deer achieves three highlights: (1) Super alignment: It achieves different reactions according to different people's preferences and biases. (2) Generalist: It can understand the multi-view and multi-mode inputs to reason the user's physiological indicators, facial emotions, hand movements, body movements, driving scenarios, and behavioral decisions. (3) Self-Eliciting: It can elicit implicit thought chains in the language space to further increase generalist and super-aligned abilities. Besides, we collected multiple data sets and built a large-scale benchmark. This benchmark measures the deer's perceptual decision-making ability and the super alignment's accuracy.

[37] Inferring Driving Maps by Deep Learning-based Trail Map Extraction

Michael Hubbertz,Pascal Colling,Qi Han,Tobias Meisen

Main category: cs.CV

TL;DR: 论文提出了一种新颖的离线地图构建方法，通过整合非正式路线（trails）数据，利用基于Transformer的深度学习模型，实现了高效且传感器无关的地图更新。

Details

Motivation: 高精地图（HD maps）对自动驾驶系统至关重要，但传统在线地图构建方法在时间一致性、传感器遮挡等方面存在挑战。 Method: 整合自车和其他交通参与者的非正式路线数据，使用Transformer模型构建全局地图，支持持续更新且传感器无关。 Result: 在基准数据集上验证，方法优于现有在线地图构建方法，泛化能力更强。 Conclusion: 该方法为自动驾驶系统提供了一种高效、鲁棒的地图构建解决方案。 Abstract: High-definition (HD) maps offer extensive and accurate environmental information about the driving scene, making them a crucial and essential element for planning within autonomous driving systems. To avoid extensive efforts from manual labeling, methods for automating the map creation have emerged. Recent trends have moved from offline mapping to online mapping, ensuring availability and actuality of the utilized maps. While the performance has increased in recent years, online mapping still faces challenges regarding temporal consistency, sensor occlusion, runtime, and generalization. We propose a novel offline mapping approach that integrates trails - informal routes used by drivers - into the map creation process. Our method aggregates trail data from the ego vehicle and other traffic participants to construct a comprehensive global map using transformer-based deep learning models. Unlike traditional offline mapping, our approach enables continuous updates while remaining sensor-agnostic, facilitating efficient data transfer. Our method demonstrates superior performance compared to state-of-the-art online mapping approaches, achieving improved generalization to previously unseen environments and sensor configurations. We validate our approach on two benchmark datasets, highlighting its robustness and applicability in autonomous driving systems.

[38] HandReader: Advanced Techniques for Efficient Fingerspelling Recognition

Pavel Korotaev,Petr Surovtsev,Alexander Kapitanov,Karina Kvanchiani,Aleksandr Nagaev

Main category: cs.CV

TL;DR: 本文提出了HandReader，一种用于手语拼写识别的架构组，包含三种模型（RGB、KP、RGB+KP），在多个数据集上取得了最先进的结果，并发布了新的俄语手语拼写数据集Znaki。

Details

Motivation: 手语拼写（fingerspelling）是手语的重要组成部分，但现有方法在视频时间维度处理上仍有提升空间。 Method: HandReader包含三种架构：1）HandReader$_{RGB}$使用TSAM模块处理RGB特征；2）HandReader$_{KP}$基于TPE编码器处理关键点；3）HandReader$_{RGB+KP}$结合RGB和关键点模态。 Result: 在ChicagoFSWild、ChicagoFSWild+和Znaki数据集上取得了最先进的结果。 Conclusion: HandReader模型在手语拼写识别任务中表现出色，同时发布了新的数据集Znaki和预训练模型。 Abstract: Fingerspelling is a significant component of Sign Language (SL), allowing the interpretation of proper names, characterized by fast hand movements during signing. Although previous works on fingerspelling recognition have focused on processing the temporal dimension of videos, there remains room for improving the accuracy of these approaches. This paper introduces HandReader, a group of three architectures designed to address the fingerspelling recognition task. HandReader$_{RGB}$ employs the novel Temporal Shift-Adaptive Module (TSAM) to process RGB features from videos of varying lengths while preserving important sequential information. HandReader$_{KP}$ is built on the proposed Temporal Pose Encoder (TPE) operated on keypoints as tensors. Such keypoints composition in a batch allows the encoder to pass them through 2D and 3D convolution layers, utilizing temporal and spatial information and accumulating keypoints coordinates. We also introduce HandReader_RGB+KP - architecture with a joint encoder to benefit from RGB and keypoint modalities. Each HandReader model possesses distinct advantages and achieves state-of-the-art results on the ChicagoFSWild and ChicagoFSWild+ datasets. Moreover, the models demonstrate high performance on the first open dataset for Russian fingerspelling, Znaki, presented in this paper. The Znaki dataset and HandReader pre-trained models are publicly available.

[39] MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting

Mengqiu Xu,Kaixin Chen,Heng Guo,Yixiang Huang,Ming Wu,Zhenwei Shi,Chuang Zhang,Jun Guo

Main category: cs.CV

TL;DR: 论文介绍了MFogHub，首个多区域、多卫星的海洋雾数据集，用于改进海洋雾检测和预测的深度学习模型。

Details

Motivation: 现有数据集局限于单一区域或卫星，限制了模型在多样化条件下的评估和对海洋雾特性的深入探索。 Method: 整合了来自15个沿海雾区和6颗地球静止卫星的68,000多个高分辨率样本，构建了MFogHub数据集。 Result: 实验表明，MFogHub能揭示因区域和卫星差异导致的泛化波动，并支持针对性雾预测技术的开发。 Conclusion: MFogHub旨在推动全球海洋雾动态的监测和科学理解，数据集和代码已开源。 Abstract: Deep learning approaches for marine fog detection and forecasting have outperformed traditional methods, demonstrating significant scientific and practical importance. However, the limited availability of open-source datasets remains a major challenge. Existing datasets, often focused on a single region or satellite, restrict the ability to evaluate model performance across diverse conditions and hinder the exploration of intrinsic marine fog characteristics. To address these limitations, we introduce \textbf{MFogHub}, the first multi-regional and multi-satellite dataset to integrate annotated marine fog observations from 15 coastal fog-prone regions and six geostationary satellites, comprising over 68,000 high-resolution samples. By encompassing diverse regions and satellite perspectives, MFogHub facilitates rigorous evaluation of both detection and forecasting methods under varying conditions. Extensive experiments with 16 baseline models demonstrate that MFogHub can reveal generalization fluctuations due to regional and satellite discrepancy, while also serving as a valuable resource for the development of targeted and scalable fog prediction techniques. Through MFogHub, we aim to advance both the practical monitoring and scientific understanding of marine fog dynamics on a global scale. The dataset and code are at \href{https://github.com/kaka0910/MFogHub}{https://github.com/kaka0910/MFogHub}.

[40] MSCI: Addressing CLIP's Inherent Limitations for Compositional Zero-Shot Learning

Yue Wang,Shuai Xu,Xuelin Zhu,Yicong Li

Main category: cs.CV

TL;DR: 论文提出了一种多阶段跨模态交互（MSCI）模型，通过利用CLIP视觉编码器的中间层信息，增强对细粒度局部特征的捕捉能力，从而改进组合零样本学习（CZSL）任务。

Details

Motivation: 现有研究依赖CLIP的跨模态对齐能力，但忽视了其在捕捉细粒度局部特征方面的局限性。 Method: 设计了两个自适应聚合器，分别从低级和高级视觉特征中提取局部和全局信息，并通过分阶段交互机制将其融入文本表示。 Result: 在三个广泛使用的数据集上的实验验证了模型的有效性和优越性。 Conclusion: MSCI模型通过动态调整全局和局部视觉信息的注意力权重，显著提升了模型对细粒度局部特征的感知能力。 Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize unseen state-object combinations by leveraging known combinations. Existing studies basically rely on the cross-modal alignment capabilities of CLIP but tend to overlook its limitations in capturing fine-grained local features, which arise from its architectural and training paradigm. To address this issue, we propose a Multi-Stage Cross-modal Interaction (MSCI) model that effectively explores and utilizes intermediate-layer information from CLIP's visual encoder. Specifically, we design two self-adaptive aggregators to extract local information from low-level visual features and integrate global information from high-level visual features, respectively. These key information are progressively incorporated into textual representations through a stage-by-stage interaction mechanism, significantly enhancing the model's perception capability for fine-grained local visual information. Additionally, MSCI dynamically adjusts the attention weights between global and local visual information based on different combinations, as well as different elements within the same combination, allowing it to flexibly adapt to diverse scenarios. Experiments on three widely used datasets fully validate the effectiveness and superiority of the proposed model. Data and code are available at https://github.com/ltpwy/MSCI.

[41] StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation

Daniel A. P. Oliveira,David Martins de Matos

Main category: cs.CV

TL;DR: 论文提出StoryReasoning数据集和Qwen Storyteller模型，通过视觉相似性和面部识别解决视觉故事中角色和对象一致性问题，减少幻觉现象。

Details

Motivation: 视觉故事系统在跨帧保持角色一致性和正确关联动作与主体方面存在困难，导致引用幻觉。 Method: 使用StoryReasoning数据集（4,178个故事，52,016张电影图像），结合跨帧对象重识别、链式思维推理和视觉实体链接，微调Qwen2.5-VL 7B模型。 Result: 微调后的Qwen Storyteller模型将幻觉现象从4.06降至3.56（-12.3%）。 Conclusion: 通过视觉实体链接和多帧关系建模，显著提升了视觉故事的一致性和准确性。 Abstract: Visual storytelling systems struggle to maintain character identity across frames and link actions to appropriate subjects, frequently leading to referential hallucinations. These issues can be addressed through grounding of characters, objects, and other entities on the visual elements. We propose StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images, with both structured scene analyses and grounded stories. Each story maintains character and object consistency across frames while explicitly modeling multi-frame relationships through structured tabular representations. Our approach features cross-frame object re-identification using visual similarity and face recognition, chain-of-thought reasoning for explicit narrative modeling, and a grounding scheme that links textual elements to visual entities across multiple frames. We establish baseline performance by fine-tuning Qwen2.5-VL 7B, creating Qwen Storyteller, which performs end-to-end object detection, re-identification, and landmark detection while maintaining consistent object references throughout the story. Evaluation demonstrates a reduction from 4.06 to 3.56 (-12.3%) hallucinations on average per story when compared to a non-fine-tuned model.

[42] MIPHEI-ViT: Multiplex Immunofluorescence Prediction from H&E Images using ViT Foundation Models

Guillaume Balezo,Roger Trullo,Albert Pla Planas,Etienne Decenciere,Thomas Walter

Main category: cs.CV

TL;DR: MIPHEI是一种基于U-Net和ViT的模型，能够从H&E染色图像预测mIF信号，实现细胞类型分类，性能优于现有方法。

Details

Motivation: 解决mIF技术因成本和复杂性未广泛临床应用的问题，通过H&E图像预测mIF信号，实现低成本细胞类型分析。 Method: 采用U-Net架构结合ViT编码器，利用ORION数据集训练，并在两个独立数据集上验证。 Result: 模型在多个标记物上表现优异，如Pan-CK（F1=0.88），显著优于基线方法。 Conclusion: MIPHEI为大规模H&E数据集的细胞类型分析提供了新途径，有助于研究细胞空间组织与患者预后的关系。 Abstract: Histopathological analysis is a cornerstone of cancer diagnosis, with Hematoxylin and Eosin (H&E) staining routinely acquired for every patient to visualize cell morphology and tissue architecture. On the other hand, multiplex immunofluorescence (mIF) enables more precise cell type identification via proteomic markers, but has yet to achieve widespread clinical adoption due to cost and logistical constraints. To bridge this gap, we introduce MIPHEI (Multiplex Immunofluorescence Prediction from H&E), a U-Net-inspired architecture that integrates state-of-the-art ViT foundation models as encoders to predict mIF signals from H&E images. MIPHEI targets a comprehensive panel of markers spanning nuclear content, immune lineages (T cells, B cells, myeloid), epithelium, stroma, vasculature, and proliferation. We train our model using the publicly available ORION dataset of restained H&E and mIF images from colorectal cancer tissue, and validate it on two independent datasets. MIPHEI achieves accurate cell-type classification from H&E alone, with F1 scores of 0.88 for Pan-CK, 0.57 for CD3e, 0.56 for SMA, 0.36 for CD68, and 0.30 for CD20, substantially outperforming both a state-of-the-art baseline and a random classifier for most markers. Our results indicate that our model effectively captures the complex relationships between nuclear morphologies in their tissue context, as visible in H&E images and molecular markers defining specific cell types. MIPHEI offers a promising step toward enabling cell-type-aware analysis of large-scale H&E datasets, in view of uncovering relationships between spatial cellular organization and patient outcomes.

[43] A Unified and Scalable Membership Inference Method for Visual Self-supervised Encoder via Part-aware Capability

Jie Zhu,Jirong Zha,Ding Li,Leye Wang

Main category: cs.CV

TL;DR: 论文提出了一种统一的成员推理方法PartCrop，用于攻击视觉自监督模型，并在不同训练协议和结构上验证了其有效性。同时，提出了防御方法并评估了其效果。

Details

Motivation: 自监督学习在利用无标签数据方面具有潜力，但也面临隐私问题，尤其是在视觉领域。攻击者通常面对黑盒系统，缺乏训练方法和细节信息，因此需要一种通用的攻击方法。 Method: 提出PartCrop方法，通过裁剪图像中的部分对象并查询其在表示空间中的响应，利用模型对训练数据的部分感知能力进行攻击。 Result: 实验表明PartCrop在不同训练协议和结构的自监督模型上均有效，且提出了三种防御方法（提前停止、差分隐私和裁剪尺度范围缩小）均有效。 Conclusion: PartCrop是一种通用的成员推理攻击方法，适用于不同自监督模型，并提出了有效的防御策略。同时，通过改进提出了可扩展的PartCrop-v2。 Abstract: Self-supervised learning shows promise in harnessing extensive unlabeled data, but it also confronts significant privacy concerns, especially in vision. In this paper, we perform membership inference on visual self-supervised models in a more realistic setting: self-supervised training method and details are unknown for an adversary when attacking as he usually faces a black-box system in practice. In this setting, considering that self-supervised model could be trained by completely different self-supervised paradigms, e.g., masked image modeling and contrastive learning, with complex training details, we propose a unified membership inference method called PartCrop. It is motivated by the shared part-aware capability among models and stronger part response on the training data. Specifically, PartCrop crops parts of objects in an image to query responses within the image in representation space. We conduct extensive attacks on self-supervised models with different training protocols and structures using three widely used image datasets. The results verify the effectiveness and generalization of PartCrop. Moreover, to defend against PartCrop, we evaluate two common approaches, i.e., early stop and differential privacy, and propose a tailored method called shrinking crop scale range. The defense experiments indicate that all of them are effective. Finally, besides prototype testing on toy visual encoders and small-scale image datasets, we quantitatively study the impacts of scaling from both data and model aspects in a realistic scenario and propose a scalable PartCrop-v2 by introducing two structural improvements to PartCrop. Our code is at https://github.com/JiePKU/PartCrop.

[44] SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ Complexity

Shihao Zou,Qingfeng Li,Wei Ji,Jingjing Li,Yongkui Yang,Guoqi Li,Chao Dong

Main category: cs.CV

TL;DR: SpikeVideoFormer是一种高效的脉冲驱动视频Transformer，通过线性时间复杂度的设计，在视频任务中表现出色，同时显著提升能效。

Details

Motivation: 现有SNN-based Transformers主要关注单图像任务，未能充分利用SNN在视频任务中的高效性。 Method: 设计了脉冲驱动的Hamming注意力（SDHA），并优化了时空注意力方案，保持线性时间复杂度。 Result: 在视频分类、姿态跟踪和语义分割任务中达到SOTA性能，能效显著优于ANN方法。 Conclusion: SpikeVideoFormer展示了SNN在视频任务中的潜力，兼具高性能与高效能。 Abstract: Spiking Neural Networks (SNNs) have shown competitive performance to Artificial Neural Networks (ANNs) in various vision tasks, while offering superior energy efficiency. However, existing SNN-based Transformers primarily focus on single-image tasks, emphasizing spatial features while not effectively leveraging SNNs' efficiency in video-based vision tasks. In this paper, we introduce SpikeVideoFormer, an efficient spike-driven video Transformer, featuring linear temporal complexity $\mathcal{O}(T)$. Specifically, we design a spike-driven Hamming attention (SDHA) which provides a theoretically guided adaptation from traditional real-valued attention to spike-driven attention. Building on SDHA, we further analyze various spike-driven space-time attention designs and identify an optimal scheme that delivers appealing performance for video tasks, while maintaining only linear temporal complexity. The generalization ability and efficiency of our model are demonstrated across diverse downstream video tasks, including classification, human pose tracking, and semantic segmentation. Empirical results show our method achieves state-of-the-art (SOTA) performance compared to existing SNN approaches, with over 15\% improvement on the latter two tasks. Additionally, it matches the performance of recent ANN-based methods while offering significant efficiency gains, achieving $\times 16$, $\times 10$ and $\times 5$ improvements on the three tasks. https://github.com/JimmyZou/SpikeVideoFormer

[45] Learned Lightweight Smartphone ISP with Unpaired Data

Andrei Arhire,Radu Timofte

Main category: cs.CV

TL;DR: 论文提出了一种无需成对数据的训练方法，用于学习智能手机ISP，通过多损失函数和对抗训练实现高质量图像转换。

Details

Motivation: 开发学习型ISP时，获取像素对齐的成对数据成本高且困难，因此需要一种无需成对数据的方法。 Method: 采用无配对训练方法，结合多损失函数和对抗训练，利用预训练网络的特征图指导学习。 Result: 在Zurich RAW to RGB和Fujifilm UltraISP数据集上表现优异，评估指标显示高保真度。 Conclusion: 无配对学习方法在智能手机ISP中具有潜力，能够高效实现高质量图像转换。 Abstract: The Image Signal Processor (ISP) is a fundamental component in modern smartphone cameras responsible for conversion of RAW sensor image data to RGB images with a strong focus on perceptual quality. Recent work highlights the potential of deep learning approaches and their ability to capture details with a quality increasingly close to that of professional cameras. A difficult and costly step when developing a learned ISP is the acquisition of pixel-wise aligned paired data that maps the raw captured by a smartphone camera sensor to high-quality reference images. In this work, we address this challenge by proposing a novel training method for a learnable ISP that eliminates the need for direct correspondences between raw images and ground-truth data with matching content. Our unpaired approach employs a multi-term loss function guided by adversarial training with multiple discriminators processing feature maps from pre-trained networks to maintain content structure while learning color and texture characteristics from the target RGB dataset. Using lightweight neural network architectures suitable for mobile devices as backbones, we evaluated our method on the Zurich RAW to RGB and Fujifilm UltraISP datasets. Compared to paired training methods, our unpaired learning strategy shows strong potential and achieves high fidelity across multiple evaluation metrics. The code and pre-trained models are available at https://github.com/AndreiiArhire/Learned-Lightweight-Smartphone-ISP-with-Unpaired-Data .

[46] Vision language models have difficulty recognizing virtual objects

Tyler Tran,Sangeet Khemlani,J. G. Trafton

Main category: cs.CV

TL;DR: 论文探讨了视觉语言模型（VLMs）对图像中虚拟对象的理解能力，发现其表现不足。

Details

Motivation: 研究VLMs是否能够理解和推理图像中未直接呈现的虚拟对象的空间关系，以测试其场景理解能力。 Method: 通过设计包含虚拟对象的提示（如“想象树上有风筝”）来评估VLMs的表现，并进行系统测试。 Result: 当前先进的VLMs在处理虚拟对象时表现不佳，未能有效更新场景表示和推理空间关系。 Conclusion: VLMs在理解虚拟对象和复杂空间关系方面仍有不足，需进一步改进。 Abstract: Vision language models (VLMs) are AI systems paired with both language and vision encoders to process multimodal input. They are capable of performing complex semantic tasks such as automatic captioning, but it remains an open question about how well they comprehend the visuospatial properties of scenes depicted in the images they process. We argue that descriptions of virtual objects -- objects that are not visually represented in an image -- can help test scene comprehension in these AI systems. For example, an image that depicts a person standing under a tree can be paired with the following prompt: imagine that a kite is stuck in the tree. VLMs that comprehend the scene should update their representations and reason sensibly about the spatial relations between all three objects. We describe systematic evaluations of state-of-the-art VLMs and show that their ability to process virtual objects is inadequate.

[47] Consistent Quantity-Quality Control across Scenes for Deployment-Aware Gaussian Splatting

Fengdi Zhang,Hongkun Cao,Ruqi Huang

Main category: cs.CV

TL;DR: ControlGS是一种3D高斯溅射优化方法，支持用户直观调整高斯数量与渲染质量的权衡，适用于多样场景。

Details

Motivation: 现有方法缺乏用户直观调整高斯数量与渲染质量权衡的能力，无法适应不同硬件和通信限制的实际需求。 Method: 通过单次训练和用户指定的超参数，ControlGS自动找到不同场景中理想的高斯数量与渲染质量权衡点。 Result: ControlGS在减少高斯数量的同时保持高渲染质量，支持广泛的调整范围和无级控制。 Conclusion: ControlGS在多样场景中表现优异，优于基线方法，提供灵活的高斯数量与渲染质量权衡控制。 Abstract: To reduce storage and computational costs, 3D Gaussian splatting (3DGS) seeks to minimize the number of Gaussians used while preserving high rendering quality, introducing an inherent trade-off between Gaussian quantity and rendering quality. Existing methods strive for better quantity-quality performance, but lack the ability for users to intuitively adjust this trade-off to suit practical needs such as model deployment under diverse hardware and communication constraints. Here, we present ControlGS, a 3DGS optimization method that achieves semantically meaningful and cross-scene consistent quantity-quality control while maintaining strong quantity-quality performance. Through a single training run using a fixed setup and a user-specified hyperparameter reflecting quantity-quality preference, ControlGS can automatically find desirable quantity-quality trade-off points across diverse scenes, from compact objects to large outdoor scenes. It also outperforms baselines by achieving higher rendering quality with fewer Gaussians, and supports a broad adjustment range with stepless control over the trade-off.

[48] Logos as a Well-Tempered Pre-train for Sign Language Recognition

Ilya Ovodov,Petr Surovtsev,Karina Kvanchiani,Alexander Kapitanov,Alexander Nagaev

Main category: cs.CV

TL;DR: 论文研究了孤立手语识别（ISLR）中的两个问题：跨语言数据不足和相似手语的语义歧义，并提出了Logos数据集和训练方法。

Details

Motivation: 解决ISLR中跨语言数据不足和相似手语标注歧义的问题。 Method: 提出Logos数据集，探索跨语言迁移学习和多分类头联合训练方法，并标注视觉相似手语组。 Result: Logos数据集成为最大RSL数据集，预训练模型可用于其他语言任务，并在WLASL和AUTSL数据集上取得优异结果。 Conclusion: Logos数据集和标注方法显著提升模型性能，为跨语言ISLR任务提供了有效解决方案。 Abstract: This paper examines two aspects of the isolated sign language recognition (ISLR) task. First, despite the availability of a number of datasets, the amount of data for most individual sign languages is limited. It poses the challenge of cross-language ISLR model training, including transfer learning. Second, similar signs can have different semantic meanings. It leads to ambiguity in dataset labeling and raises the question of the best policy for annotating such signs. To address these issues, this study presents Logos, a novel Russian Sign Language (RSL) dataset, the most extensive ISLR dataset by the number of signers and one of the largest available datasets while also the largest RSL dataset in size and vocabulary. It is shown that a model, pre-trained on the Logos dataset can be used as a universal encoder for other language SLR tasks, including few-shot learning. We explore cross-language transfer learning approaches and find that joint training using multiple classification heads benefits accuracy for the target lowresource datasets the most. The key feature of the Logos dataset is explicitly annotated visually similar sign groups. We show that explicitly labeling visually similar signs improves trained model quality as a visual encoder for downstream tasks. Based on the proposed contributions, we outperform current state-of-the-art results for the WLASL dataset and get competitive results for the AUTSL dataset, with a single stream model processing solely RGB video. The source code, dataset, and pre-trained models are publicly available.

[49] UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation

Yi Li,Haonan Wang,Qixiang Zhang,Boyu Xiao,Chenchang Hu,Hualiang Wang,Xiaomeng Li

Main category: cs.CV

TL;DR: 论文提出UniEval框架，用于统一评估多模态模型，解决了现有评估方法的局限性，如缺乏整体结果、依赖额外模型等问题。

Details

Motivation: 现有统一多模态模型的评估方法存在局限性，如缺乏统一框架、依赖额外资源等，亟需一种简洁高效的评估方案。 Method: 提出UniEval框架，包含UniBench基准和UniScore指标，支持统一和视觉生成模型，无需额外模型或标注。 Result: UniBench比现有基准更具挑战性，UniScore与人工评估高度一致，优于现有指标。 Conclusion: UniEval为多模态模型提供了一种高效、统一的评估方法，揭示了其独特价值。 Abstract: The emergence of unified multimodal understanding and generation models is rapidly attracting attention because of their ability to enhance instruction-following capabilities while minimizing model redundancy. However, there is a lack of a unified evaluation framework for these models, which would enable an elegant, simplified, and overall evaluation. Current models conduct evaluations on multiple task-specific benchmarks, but there are significant limitations, such as the lack of overall results, errors from extra evaluation models, reliance on extensive labeled images, benchmarks that lack diversity, and metrics with limited capacity for instruction-following evaluation. To tackle these challenges, we introduce UniEval, the first evaluation framework designed for unified multimodal models without extra models, images, or annotations. This facilitates a simplified and unified evaluation process. The UniEval framework contains a holistic benchmark, UniBench (supports both unified and visual generation models), along with the corresponding UniScore metric. UniBench includes 81 fine-grained tags contributing to high diversity. Experimental results indicate that UniBench is more challenging than existing benchmarks, and UniScore aligns closely with human evaluations, surpassing current metrics. Moreover, we extensively evaluated SoTA unified and visual generation models, uncovering new insights into Univeral's unique values.

[50] CheXGenBench: A Unified Benchmark For Fidelity, Privacy and Utility of Synthetic Chest Radiographs

Raman Dutt,Pedro Sanchez,Yongchen Yao,Steven McDonagh,Sotirios A. Tsaftaris,Timothy Hospedales

Main category: cs.CV

TL;DR: CheXGenBench是一个用于评估合成胸部X光片生成的多方面框架，涵盖生成质量、隐私风险和临床实用性，解决了现有评估方法的不足。

Details

Motivation: 当前医学领域生成AI评估存在方法不一致、架构比较过时和评估标准脱节的问题，缺乏对临床实用性的关注。 Method: 通过标准化数据分区和统一评估协议，使用20多个定量指标分析11种领先文本到图像架构的生成质量、隐私漏洞和临床适用性。 Result: 揭示了现有评估协议在生成保真度评估中的低效性，提出了标准化基准并发布了高质量合成数据集SynthCheX-75K。 Conclusion: CheXGenBench为医学AI社区提供了标准化基准，支持客观比较和未来研究，并公开了框架、模型和数据集。 Abstract: We introduce CheXGenBench, a rigorous and multifaceted evaluation framework for synthetic chest radiograph generation that simultaneously assesses fidelity, privacy risks, and clinical utility across state-of-the-art text-to-image generative models. Despite rapid advancements in generative AI for real-world imagery, medical domain evaluations have been hindered by methodological inconsistencies, outdated architectural comparisons, and disconnected assessment criteria that rarely address the practical clinical value of synthetic samples. CheXGenBench overcomes these limitations through standardised data partitioning and a unified evaluation protocol comprising over 20 quantitative metrics that systematically analyse generation quality, potential privacy vulnerabilities, and downstream clinical applicability across 11 leading text-to-image architectures. Our results reveal critical inefficiencies in the existing evaluation protocols, particularly in assessing generative fidelity, leading to inconsistent and uninformative comparisons. Our framework establishes a standardised benchmark for the medical AI community, enabling objective and reproducible comparisons while facilitating seamless integration of both existing and future generative models. Additionally, we release a high-quality, synthetic dataset, SynthCheX-75K, comprising 75K radiographs generated by the top-performing model (Sana 0.6B) in our benchmark to support further research in this critical domain. Through CheXGenBench, we establish a new state-of-the-art and release our framework, models, and SynthCheX-75K dataset at https://raman1121.github.io/CheXGenBench/

[51] MorphGuard: Morph Specific Margin Loss for Enhancing Robustness to Face Morphing Attacks

Iurii Medvedev,Nuno Goncalves

Main category: cs.CV

TL;DR: 提出了一种新的双分支分类策略，用于增强人脸识别系统对抗面部变形攻击的鲁棒性。

Details

Motivation: 随着深度学习技术的发展，人脸识别系统面临面部变形攻击等安全威胁，需要提高其鲁棒性。 Method: 通过双分支分类策略处理面部变形图像的标签模糊性，并将其纳入训练过程。 Result: 在公开基准测试中验证了方法的有效性，能够显著提升对抗面部变形攻击的能力。 Conclusion: 该方法具有普适性，可集成到现有的人脸识别训练流程中，提升基于分类的识别方法。 Abstract: Face recognition has evolved significantly with the advancement of deep learning techniques, enabling its widespread adoption in various applications requiring secure authentication. However, this progress has also increased its exposure to presentation attacks, including face morphing, which poses a serious security threat by allowing one identity to impersonate another. Therefore, modern face recognition systems must be robust against such attacks. In this work, we propose a novel approach for training deep networks for face recognition with enhanced robustness to face morphing attacks. Our method modifies the classification task by introducing a dual-branch classification strategy that effectively handles the ambiguity in the labeling of face morphs. This adaptation allows the model to incorporate morph images into the training process, improving its ability to distinguish them from bona fide samples. Our strategy has been validated on public benchmarks, demonstrating its effectiveness in enhancing robustness against face morphing attacks. Furthermore, our approach is universally applicable and can be integrated into existing face recognition training pipelines to improve classification-based recognition methods.

[52] Enhancing Multi-Image Question Answering via Submodular Subset Selection

Aaryan Sharma,Shivansh Gupta,Samar Agarwal,Vishak Prasad C.,Ganesh Ramakrishnan

Main category: cs.CV

TL;DR: 提出了一种基于子模子集选择技术的检索框架增强方法，用于解决多图像问答任务中的可扩展性和检索性能问题。

Details

Motivation: 多模态模型在单图像任务中表现优异，但在多图像场景（如多图像问答）中面临可扩展性和检索性能的挑战。 Method: 采用查询感知的子模函数（如GraphCut）预选语义相关图像子集，并结合锚点查询和数据增强优化检索流程。 Result: 在大量图像场景中，子模检索器框架的有效性显著提升。 Conclusion: 子模子集选择技术能有效增强多图像问答任务的检索性能。 Abstract: Large multimodal models (LMMs) have achieved high performance in vision-language tasks involving single image but they struggle when presented with a collection of multiple images (Multiple Image Question Answering scenario). These tasks, which involve reasoning over large number of images, present issues in scalability (with increasing number of images) and retrieval performance. In this work, we propose an enhancement for retriever framework introduced in MIRAGE model using submodular subset selection techniques. Our method leverages query-aware submodular functions, such as GraphCut, to pre-select a subset of semantically relevant images before main retrieval component. We demonstrate that using anchor-based queries and augmenting the data improves submodular-retriever pipeline effectiveness, particularly in large haystack sizes.

[53] Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis

Pengfei Wang,Guohai Xu,Weinong Wang,Junjie Yang,Jie Lou,Yunhua Xue

Main category: cs.CV

TL;DR: 论文提出了一个衡量多模态大语言模型（MLLMs）视觉理解的新方法，通过定义‘隐性视觉误解’（IVM）并引入‘注意力准确度’指标，更可靠地评估模型是否真正理解视觉输入。

Details

Motivation: 现有基准主要评估答案正确性，忽略了模型是否真正理解视觉输入，因此需要一种新方法来量化隐性视觉误解。 Method: 通过解耦因果注意力模块中的视觉和文本模态，分析注意力分布，并引入‘注意力准确度’指标和新的基准。 Result: 研究发现注意力分布随着网络层加深逐渐集中在正确答案相关的图像上，新指标能可靠评估视觉理解。 Conclusion: 提出的方法不仅适用于多模态场景，还能扩展到单模态，具有广泛适用性和通用性。 Abstract: Recent advancements have enhanced the capability of Multimodal Large Language Models (MLLMs) to comprehend multi-image information. However, existing benchmarks primarily evaluate answer correctness, overlooking whether models genuinely comprehend the visual input. To address this, we define implicit visual misunderstanding (IVM), where MLLMs provide correct answers without fully comprehending the visual input. Through our analysis, we decouple the visual and textual modalities within the causal attention module, revealing that attention distribution increasingly converges on the image associated with the correct answer as the network layers deepen. This insight leads to the introduction of a scale-agnostic metric, \textit{attention accuracy}, and a novel benchmark for quantifying IVMs. Attention accuracy directly evaluates the model's visual understanding via internal mechanisms, remaining robust to positional biases for more reliable assessments. Furthermore, we extend our approach to finer granularities and demonstrate its effectiveness in unimodal scenarios, underscoring its versatility and generalizability.

[54] Does Feasibility Matter? Understanding the Impact of Feasibility on Synthetic Training Data

Yiwen Liu,Jessica Bader,Jae Myung Kim

Main category: cs.CV

TL;DR: 研究探讨了在生成合成训练数据时，图像可行性（feasibility）对CLIP分类器性能的影响，发现可行性对性能影响极小，且混合可行与不可行图像对性能无显著影响。

Details

Motivation: 随着扩散模型的发展，合成数据训练模型效果提升，但生成的图像可能存在不现实的特征（如狗浮在空中）。研究旨在验证图像可行性是否对CLIP分类器的训练至关重要。 Method: 提出VariReal流程，通过最小化编辑源图像以包含可行或不可行属性，并在三个细粒度数据集上测试LoRA微调CLIP的性能。 Result: 可行性对CLIP性能影响极小（准确率差异<0.3%），且混合可行与不可行图像对性能无显著影响。 Conclusion: 图像可行性对CLIP分类器性能影响有限，混合可行与不可行图像在训练中是可接受的。 Abstract: With the development of photorealistic diffusion models, models trained in part or fully on synthetic data achieve progressively better results. However, diffusion models still routinely generate images that would not exist in reality, such as a dog floating above the ground or with unrealistic texture artifacts. We define the concept of feasibility as whether attributes in a synthetic image could realistically exist in the real-world domain; synthetic images containing attributes that violate this criterion are considered infeasible. Intuitively, infeasible images are typically considered out-of-distribution; thus, training on such images is expected to hinder a model's ability to generalize to real-world data, and they should therefore be excluded from the training set whenever possible. However, does feasibility really matter? In this paper, we investigate whether enforcing feasibility is necessary when generating synthetic training data for CLIP-based classifiers, focusing on three target attributes: background, color, and texture. We introduce VariReal, a pipeline that minimally edits a given source image to include feasible or infeasible attributes given by the textual prompt generated by a large language model. Our experiments show that feasibility minimally affects LoRA-fine-tuned CLIP performance, with mostly less than 0.3% difference in top-1 accuracy across three fine-grained datasets. Also, the attribute matters on whether the feasible/infeasible images adversarially influence the classification performance. Finally, mixing feasible and infeasible images in training datasets does not significantly impact performance compared to using purely feasible or infeasible datasets.

[55] MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

Ke Wang,Junting Pan,Linda Wei,Aojun Zhou,Weikang Shi,Zimu Lu,Han Xiao,Yunqiao Yang,Houxing Ren,Mingjie Zhan,Hongsheng Li

Main category: cs.CV

TL;DR: 论文提出利用代码作为跨模态对齐的监督信号，开发了FigCodifier模型和ImgCode-8.6M数据集，并构建了MM-MathInstruct-3M数据集，最终训练出MathCoder-VL模型，在数学问题解决上表现优异。

Details

Motivation: 现有自然语言图像描述数据集忽略数学图形的细节，限制了大型多模态模型在数学推理中的表现。 Method: 利用代码作为监督信号，开发图像到代码的模型FigCodifier和数据集ImgCode-8.6M，并构建多模态数学指令数据集MM-MathInstruct-3M。 Result: MathCoder-VL模型在六个指标上达到开源SOTA，在MathVista的几何问题子集上超越GPT-4o和Claude 3.5 Sonnet。 Conclusion: 通过代码监督和多模态数据集的构建，显著提升了多模态数学推理能力。 Abstract: Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at https://github.com/mathllm/MathCoder.

[56] End-to-End Vision Tokenizer Tuning

Wenxuan Wang,Fan Zhang,Yufeng Cui,Haiwen Diao,Zhuoyan Luo,Huchuan Lu,Jing Liu,Xinlong Wang

Main category: cs.CV

TL;DR: ETT是一种端到端的视觉标记器调优方法，通过联合优化视觉标记化和目标自回归任务，显著提升多模态理解和视觉生成任务的性能。

Details

Motivation: 现有视觉标记化方法将标记器优化与下游任务训练分离，导致视觉标记无法适应不同任务的需求，成为性能瓶颈。 Method: ETT利用视觉嵌入和代码簿，联合优化视觉标记器的重建和标题生成目标，无需修改现有架构。 Result: 实验显示，ETT在多模态理解和视觉生成任务中性能提升2-6%，同时保持原有重建能力。 Conclusion: ETT是一种简单有效的方法，有望推动多模态基础模型的发展。 Abstract: Existing vision tokenization isolates the optimization of vision tokenizers from downstream training, implicitly assuming the visual tokens can generalize well across various tasks, e.g., image generation and visual question answering. The vision tokenizer optimized for low-level reconstruction is agnostic to downstream tasks requiring varied representations and semantics. This decoupled paradigm introduces a critical misalignment: The loss of the vision tokenization can be the representation bottleneck for target tasks. For example, errors in tokenizing text in a given image lead to poor results when recognizing or generating them. To address this, we propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks. Unlike prior autoregressive models that use only discrete indices from a frozen vision tokenizer, ETT leverages the visual embeddings of the tokenizer codebook, and optimizes the vision tokenizers end-to-end with both reconstruction and caption objectives. ETT can be seamlessly integrated into existing training pipelines with minimal architecture modifications. Our ETT is simple to implement and integrate, without the need to adjust the original codebooks or architectures of the employed large language models. Extensive experiments demonstrate that our proposed end-to-end vision tokenizer tuning unlocks significant performance gains, i.e., 2-6% for multimodal understanding and visual generation tasks compared to frozen tokenizer baselines, while preserving the original reconstruction capability. We hope this very simple and strong method can empower multimodal foundation models besides image generation and understanding.

[57] Depth Anything with Any Prior

Zehan Wang,Siyu Chen,Lihe Yang,Jialei Wang,Ziang Zhang,Hengshuang Zhao,Zhou Zhao

Main category: cs.CV

TL;DR: Prior Depth Anything框架结合不完整但精确的深度测量信息与相对但完整的几何结构预测，生成准确、密集且详细的深度图。

Details

Motivation: 解决现有深度预测方法在精度和完整性上的不足，通过结合两种互补的深度源提升泛化能力。 Method: 采用粗到细的流程：1）通过像素级度量对齐和距离感知加权预填充先验；2）开发条件化单目深度估计模型以优化噪声。 Result: 在7个真实数据集上展示了零样本泛化能力，性能优于任务特定方法，并能处理未见过的混合先验。 Conclusion: 该框架灵活高效，能随着MDE模型的进步而提升，为深度预测提供了新的解决方案。 Abstract: This work presents Prior Depth Anything, a framework that combines incomplete but precise metric information in depth measurement with relative but complete geometric structures in depth prediction, generating accurate, dense, and detailed metric depth maps for any scene. To this end, we design a coarse-to-fine pipeline to progressively integrate the two complementary depth sources. First, we introduce pixel-level metric alignment and distance-aware weighting to pre-fill diverse metric priors by explicitly using depth prediction. It effectively narrows the domain gap between prior patterns, enhancing generalization across varying scenarios. Second, we develop a conditioned monocular depth estimation (MDE) model to refine the inherent noise of depth priors. By conditioning on the normalized pre-filled prior and prediction, the model further implicitly merges the two complementary depth sources. Our model showcases impressive zero-shot generalization across depth completion, super-resolution, and inpainting over 7 real-world datasets, matching or even surpassing previous task-specific methods. More importantly, it performs well on challenging, unseen mixed priors and enables test-time improvements by switching prediction models, providing a flexible accuracy-efficiency trade-off while evolving with advancements in MDE models.

[58] 3D-Fixup: Advancing Photo Editing with 3D Priors

Yen-Chi Cheng,Krishna Kumar Singh,Jae Shin Yoon,Alex Schwing,Liangyan Gui,Matheus Gadelha,Paul Guerrero,Nanxuan Zhao

Main category: cs.CV

TL;DR: 3D-Fixup是一个基于扩散模型和3D先验的框架，用于支持复杂的3D感知图像编辑，如平移和旋转。

Details

Motivation: 尽管扩散模型在图像先验建模方面取得了进展，但基于单张图像的3D感知编辑仍具挑战性。 Method: 利用视频数据生成训练对，结合扩散模型和Image-to-3D模型的3D引导，设计数据生成流程以确保高质量的3D引导。 Result: 3D-Fixup支持复杂且身份一致的3D感知编辑，实现了高质量结果。 Conclusion: 通过整合3D先验，3D-Fixup推动了扩散模型在真实图像操作中的应用。 Abstract: Despite significant advances in modeling image priors via diffusion models, 3D-aware image editing remains challenging, in part because the object is only specified via a single image. To tackle this challenge, we propose 3D-Fixup, a new framework for editing 2D images guided by learned 3D priors. The framework supports difficult editing situations such as object translation and 3D rotation. To achieve this, we leverage a training-based approach that harnesses the generative power of diffusion models. As video data naturally encodes real-world physical dynamics, we turn to video data for generating training data pairs, i.e., a source and a target frame. Rather than relying solely on a single trained model to infer transformations between source and target frames, we incorporate 3D guidance from an Image-to-3D model, which bridges this challenging task by explicitly projecting 2D information into 3D space. We design a data generation pipeline to ensure high-quality 3D guidance throughout training. Results show that by integrating these 3D priors, 3D-Fixup effectively supports complex, identity coherent 3D-aware edits, achieving high-quality results and advancing the application of diffusion models in realistic image manipulation. The code is provided at https://3dfixup.github.io/

cs.GR [Back]

[59] VRSplat: Fast and Robust Gaussian Splatting for Virtual Reality

Xuechang Tu,Lukas Radl,Michael Steiner,Markus Steinberger,Bernhard Kerbl,Fernando de la Torre

Main category: cs.GR

TL;DR: VRSplat结合并扩展了3DGS技术，解决了VR中的时间伪影、投影失真和帧率问题，通过改进核心技术和引入高效的光栅化方法，实现了72+ FPS且消除了视觉伪影。

Details

Motivation: 3DGS在VR中面临时间伪影、投影失真和帧率不足的问题，这些问题在头戴设备中被放大，需要一种全面的解决方案。 Method: 结合Mini-Splatting、StopThePop和Optimal Projection技术，改进3DGS光栅化器，提出高效的中心光栅化方法，并通过微调优化高斯参数。 Result: 用户研究表明VRSplat优于其他配置，实现了72+ FPS并消除了视觉伪影。 Conclusion: VRSplat是首个系统评估的3DGS方法，支持现代VR应用，解决了关键挑战。 Abstract: 3D Gaussian Splatting (3DGS) has rapidly become a leading technique for novel-view synthesis, providing exceptional performance through efficient software-based GPU rasterization. Its versatility enables real-time applications, including on mobile and lower-powered devices. However, 3DGS faces key challenges in virtual reality (VR): (1) temporal artifacts, such as popping during head movements, (2) projection-based distortions that result in disturbing and view-inconsistent floaters, and (3) reduced framerates when rendering large numbers of Gaussians, falling below the critical threshold for VR. Compared to desktop environments, these issues are drastically amplified by large field-of-view, constant head movements, and high resolution of head-mounted displays (HMDs). In this work, we introduce VRSplat: we combine and extend several recent advancements in 3DGS to address challenges of VR holistically. We show how the ideas of Mini-Splatting, StopThePop, and Optimal Projection can complement each other, by modifying the individual techniques and core 3DGS rasterizer. Additionally, we propose an efficient foveated rasterizer that handles focus and peripheral areas in a single GPU launch, avoiding redundant computations and improving GPU utilization. Our method also incorporates a fine-tuning step that optimizes Gaussian parameters based on StopThePop depth evaluations and Optimal Projection. We validate our method through a controlled user study with 25 participants, showing a strong preference for VRSplat over other configurations of Mini-Splatting. VRSplat is the first, systematically evaluated 3DGS approach capable of supporting modern VR applications, achieving 72+ FPS while eliminating popping and stereo-disrupting floaters.

[60] Style Customization of Text-to-Vector Generation with Image Diffusion Priors

Peiying Zhang,Nanxuan Zhao,Jing Liao

Main category: cs.GR

TL;DR: 提出了一种两阶段风格定制SVG生成方法，结合前馈T2V模型和T2I先验，解决了现有方法在风格定制和结构一致性上的不足。

Details

Motivation: 现有文本到矢量（T2V）生成方法缺乏风格定制能力，无法满足实际应用中一致视觉风格的需求。 Method: 采用两阶段流程：1）训练路径级表示的T2V扩散模型确保结构一致性；2）通过蒸馏定制T2I模型实现风格定制。 Result: 实验验证了该方法能高效生成高质量、风格一致的SVG。 Conclusion: 提出的方法在风格定制和结构一致性上优于现有技术，适用于实际应用。 Abstract: Scalable Vector Graphics (SVGs) are highly favored by designers due to their resolution independence and well-organized layer structure. Although existing text-to-vector (T2V) generation methods can create SVGs from text prompts, they often overlook an important need in practical applications: style customization, which is vital for producing a collection of vector graphics with consistent visual appearance and coherent aesthetics. Extending existing T2V methods for style customization poses certain challenges. Optimization-based T2V models can utilize the priors of text-to-image (T2I) models for customization, but struggle with maintaining structural regularity. On the other hand, feed-forward T2V models can ensure structural regularity, yet they encounter difficulties in disentangling content and style due to limited SVG training data. To address these challenges, we propose a novel two-stage style customization pipeline for SVG generation, making use of the advantages of both feed-forward T2V models and T2I image priors. In the first stage, we train a T2V diffusion model with a path-level representation to ensure the structural regularity of SVGs while preserving diverse expressive capabilities. In the second stage, we customize the T2V diffusion model to different styles by distilling customized T2I models. By integrating these techniques, our pipeline can generate high-quality and diverse SVGs in custom styles based on text prompts in an efficient feed-forward manner. The effectiveness of our method has been validated through extensive experiments. The project page is https://customsvg.github.io.

cs.CL [Back]

[61] Next Word Suggestion using Graph Neural Network

Abisha Thapa Magar,Anup Shakya

Main category: cs.CL

TL;DR: 论文提出了一种结合图卷积网络（GNN）和LSTM的方法，用于语言建模中的上下文嵌入任务，并在资源有限的情况下验证了其有效性。

Details

Motivation: 当前主流语言模型需要大量参数和计算资源，本研究旨在解决语言建模中的上下文嵌入子任务，探索更高效的解决方案。 Method: 利用GNN中的图卷积操作编码上下文，并与LSTM结合预测下一个单词，实验基于自定义的Wikipedia文本语料库。 Result: 在资源有限的情况下，该方法能够较好地预测下一个单词。 Conclusion: 该方法为语言建模提供了一种资源高效的上下文嵌入解决方案。 Abstract: Language Modeling is a prevalent task in Natural Language Processing. The currently existing most recent and most successful language models often tend to build a massive model with billions of parameters, feed in a tremendous amount of text data, and train with enormous computation resources which require millions of dollars. In this project, we aim to address an important sub-task in language modeling, i.e., context embedding. We propose an approach to exploit the Graph Convolution operation in GNNs to encode the context and use it in coalition with LSTMs to predict the next word given a local context of preceding words. We test this on the custom Wikipedia text corpus using a very limited amount of resources and show that this approach works fairly well to predict the next word.

[62] DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models

Xiwen Chen,Wenhui Zhu,Peijie Qiu,Xuanzhao Dong,Hao Wang,Haiyu Wu,Huayu Li,Aristeidis Sotiras,Yalin Wang,Abolfazl Razi

Main category: cs.CL

TL;DR: 论文提出了一种名为Diversity-aware Reward Adjustment (DRA) 的方法，用于解决GRPO在语言模型后训练中因奖励信号单一导致的多样性-质量不一致问题。DRA通过引入语义多样性改进奖励计算，提升了模型性能。

Details

Motivation: GRPO在低资源设置中表现良好，但其依赖的标量奖励信号无法捕捉语义多样性，导致多样性-质量不一致问题。 Method: DRA利用Submodular Mutual Information (SMI) 调整奖励，降低冗余补全的权重，增强多样性补全的奖励，从而平衡探索与利用。 Result: 在五个数学推理基准测试中，DRA-GRPO和DGA-DR.GRPO表现优异，平均准确率达58.2%，仅需7,000个微调样本和约55美元的训练成本。 Conclusion: DRA通过显式引入语义多样性，有效解决了GRPO的局限性，并在低资源设置下实现了最佳性能。 Abstract: Recent advances in reinforcement learning for language model post-training, such as Group Relative Policy Optimization (GRPO), have shown promise in low-resource settings. However, GRPO typically relies on solution-level and scalar reward signals that fail to capture the semantic diversity among sampled completions. This leads to what we identify as a diversity-quality inconsistency, where distinct reasoning paths may receive indistinguishable rewards. To address this limitation, we propose $\textit{Diversity-aware Reward Adjustment}$ (DRA), a method that explicitly incorporates semantic diversity into the reward computation. DRA uses Submodular Mutual Information (SMI) to downweight redundant completions and amplify rewards for diverse ones. This encourages better exploration during learning, while maintaining stable exploitation of high-quality samples. Our method integrates seamlessly with both GRPO and its variant DR.~GRPO, resulting in $\textit{DRA-GRPO}$ and $\textit{DGA-DR.~GRPO}$. We evaluate our method on five mathematical reasoning benchmarks and find that it outperforms recent strong baselines. It achieves state-of-the-art performance with an average accuracy of 58.2%, using only 7,000 fine-tuning samples and a total training cost of approximately $55. The code is available at https://github.com/xiwenc1/DRA-GRPO.

[63] Large Language Models Are More Persuasive Than Incentivized Human Persuaders

Philipp Schoenegger,Francesco Salvi,Jiacheng Liu,Xiaoli Nan,Ramit Debnath,Barbara Fasolo,Evelina Leivada,Gabriel Recchia,Fritz Günther,Ali Zarifhonarvar,Joe Kwon,Zahoor Ul Islam,Marco Dehnert,Daryl Y. H. Lee,Madeline G. Reinecke,David G. Kamper,Mert Kobaş,Adam Sandford,Jonas Kgomo,Luke Hewitt,Shreya Kapoor,Kerem Oktar,Eyup Engin Kucuk,Bo Feng,Cameron R. Jones,Izzy Gainsburg,Sebastian Olschewski,Nora Heinzelmann,Francisco Cruz,Ben M. Tappin,Tao Ma,Peter S. Park,Rayan Onyonka,Arthur Hjorth,Peter Slattery,Qingcheng Zeng,Lennart Finke,Igor Grossmann,Alessandro Salatiello,Ezra Karger

Main category: cs.CL

TL;DR: 研究发现，前沿大语言模型（Claude Sonnet 3.5）在说服能力上显著优于激励人类，无论是引导正确还是错误答案。

Details

Motivation: 比较AI与人类在实时对话中的说服能力，探讨AI的潜在影响力。 Method: 通过预注册的大规模激励实验，让参与者完成在线测试，由AI或人类说服者引导其选择答案。 Result: AI说服者在引导正确和错误答案时均表现更优，显著影响参与者的准确性和收益。 Conclusion: AI的说服能力已超越激励人类，凸显了对齐和治理框架的紧迫性。 Abstract: We directly compare the persuasion capabilities of a frontier large language model (LLM; Claude Sonnet 3.5) against incentivized human persuaders in an interactive, real-time conversational quiz setting. In this preregistered, large-scale incentivized experiment, participants (quiz takers) completed an online quiz where persuaders (either humans or LLMs) attempted to persuade quiz takers toward correct or incorrect answers. We find that LLM persuaders achieved significantly higher compliance with their directional persuasion attempts than incentivized human persuaders, demonstrating superior persuasive capabilities in both truthful (toward correct answers) and deceptive (toward incorrect answers) contexts. We also find that LLM persuaders significantly increased quiz takers' accuracy, leading to higher earnings, when steering quiz takers toward correct answers, and significantly decreased their accuracy, leading to lower earnings, when steering them toward incorrect answers. Overall, our findings suggest that AI's persuasion capabilities already exceed those of humans that have real-money bonuses tied to performance. Our findings of increasingly capable AI persuaders thus underscore the urgency of emerging alignment and governance frameworks.

[64] System Prompt Optimization with Meta-Learning

Yumin Choi,Jinheon Baek,Sung Ju Hwang

Main category: cs.CL

TL;DR: 本文提出了一种双层系统提示优化方法，通过元学习框架优化系统提示，使其适应多样化的用户提示并泛化到未见任务。

Details

Motivation: 现有提示优化研究主要关注任务特定的用户提示，而忽略了可跨任务和领域应用的系统提示。 Method: 采用元学习框架，通过迭代优化系统提示和用户提示，确保二者协同。 Result: 在14个未见数据集上验证，优化的系统提示能有效泛化，并减少测试时用户提示的优化步骤。 Conclusion: 优化的系统提示具有跨任务和领域的泛化能力，提升性能的同时减少适应成本。 Abstract: Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the system prompt that is, once optimized, applicable across different tasks and domains. Motivated by this, we introduce the novel problem of bilevel system prompt optimization, whose objective is to design system prompts that are robust to diverse user prompts and transferable to unseen tasks. To tackle this problem, we then propose a meta-learning framework, which meta-learns the system prompt by optimizing it over various user prompts across multiple datasets, while simultaneously updating the user prompts in an iterative manner to ensure synergy between them. We conduct experiments on 14 unseen datasets spanning 5 different domains, on which we show that our approach produces system prompts that generalize effectively to diverse user prompts. Also, our findings reveal that the optimized system prompt enables rapid adaptation even to unseen tasks, requiring fewer optimization steps for test-time user prompts while achieving improved performance.

[65] VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts

Xin Liu,Lechen Zhang,Sheza Munir,Yiyang Gu,Lu Wang

Main category: cs.CL

TL;DR: VeriFact是一个新的事实性评估框架，旨在通过识别和解决不完整或缺失的事实来提高长文本生成的事实性评估准确性。同时，FactRBench基准测试评估了精确率和召回率，填补了现有工作主要关注精确率的不足。

Details

Motivation: 大型语言模型（LLMs）在生成长文本时表现优异，但评估其事实性仍然具有挑战性，尤其是复杂的句子间依赖关系。现有方法往往忽略关键上下文和关系事实。 Method: 提出VeriFact框架，通过提取和验证事实来提升评估准确性；同时引入FactRBench基准测试，评估精确率和召回率。 Result: VeriFact显著提高了事实完整性和保留了复杂关系信息，FactRBench显示更大模型在同一家族中表现更好，但高精确率不一定与高召回率相关。 Conclusion: VeriFact和FactRBench为长文本生成的事实性评估提供了更全面的方法，强调了精确率和召回率的重要性。 Abstract: Large language models (LLMs) excel at generating long-form responses, but evaluating their factuality remains challenging due to complex inter-sentence dependencies within the generated facts. Prior solutions predominantly follow a decompose-decontextualize-verify pipeline but often fail to capture essential context and miss key relational facts. In this paper, we introduce VeriFact, a factuality evaluation framework designed to enhance fact extraction by identifying and resolving incomplete and missing facts to support more accurate verification results. Moreover, we introduce FactRBench , a benchmark that evaluates both precision and recall in long-form model responses, whereas prior work primarily focuses on precision. FactRBench provides reference fact sets from advanced LLMs and human-written answers, enabling recall assessment. Empirical evaluations show that VeriFact significantly enhances fact completeness and preserves complex facts with critical relational information, resulting in more accurate factuality evaluation. Benchmarking various open- and close-weight LLMs on FactRBench indicate that larger models within same model family improve precision and recall, but high precision does not always correlate with high recall, underscoring the importance of comprehensive factuality assessment.

[66] An AI-Powered Research Assistant in the Lab: A Practical Guide for Text Analysis Through Iterative Collaboration with LLMs

Gino Carmona-Díaz,William Jiménez-Leal,María Alejandra Grisales,Chandra Sripada,Santiago Amaya,Michael Inzlicht,Juan Pablo Bermúdez

Main category: cs.CL

TL;DR: 本文提供了一个逐步教程，利用LLMs高效开发、测试和应用分类法，用于分析非结构化数据，并通过迭代和协作过程实现高质量结果。

Details

Motivation: 分析开放式文本（如回答、标题或社交媒体帖子）耗时且易受偏见影响，LLMs为高质量文本分析提供了新工具。 Method: 通过迭代和协作过程，结合预定义和数据驱动的分类法，使用LLMs开发、测试和应用分类法。 Result: 展示了如何生成、评估和应用分类法，实现了高编码者间可靠性。 Conclusion: LLMs在文本分析中具有潜力，但也存在局限性。 Abstract: Analyzing texts such as open-ended responses, headlines, or social media posts is a time- and labor-intensive process highly susceptible to bias. LLMs are promising tools for text analysis, using either a predefined (top-down) or a data-driven (bottom-up) taxonomy, without sacrificing quality. Here we present a step-by-step tutorial to efficiently develop, test, and apply taxonomies for analyzing unstructured data through an iterative and collaborative process between researchers and LLMs. Using personal goals provided by participants as an example, we demonstrate how to write prompts to review datasets and generate a taxonomy of life domains, evaluate and refine the taxonomy through prompt and direct modifications, test the taxonomy and assess intercoder agreements, and apply the taxonomy to categorize an entire dataset with high intercoder reliability. We discuss the possibilities and limitations of using LLMs for text analysis.

[67] Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning

Shaurya Sharthak,Vinayak Pahalwan,Adithya Kamath,Adarsh Shirawalmath

Main category: cs.CL

TL;DR: 论文提出TokenAdapt框架，通过模型无关的tokenizer移植方法和多词Supertokens预分词学习，解决固定分词方案带来的效率与性能限制。

Details

Motivation: 固定分词方案导致多语言或专业应用效率低下，现有方法计算资源消耗大且难以保留语义。 Method: 引入TokenAdapt（结合局部子词分解和全局语义相似性初始化新token）和Supertokens预分词学习。 Result: TokenAdapt显著优于基线方法，降低困惑度，Supertokens提升压缩效率。 Conclusion: TokenAdapt框架有效解决分词移植问题，性能优于现有方法。 Abstract: Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes, leading to inefficiencies and performance limitations, particularly for multilingual or specialized applications. This tokenizer lock-in presents significant challenges. standard methods to overcome this often require prohibitive computational resources. Although tokenizer replacement with heuristic initialization aims to reduce this burden, existing methods often require exhaustive residual fine-tuning and still may not fully preserve semantic nuances or adequately address the underlying compression inefficiencies. Our framework introduces two innovations: first, Tokenadapt, a model-agnostic tokenizer transplantation method, and second, novel pre-tokenization learning for multi-word Supertokens to enhance compression and reduce fragmentation. Tokenadapt initializes new unique token embeddings via a hybrid heuristic that combines two methods: a local estimate based on subword decomposition using the old tokenizer, and a global estimate utilizing the top-k semantically similar tokens from the original vocabulary. This methodology aims to preserve semantics while significantly minimizing retraining requirements. Empirical investigations validate both contributions: the transplantation heuristic successfully initializes unique tokens, markedly outperforming conventional baselines and sophisticated methods including Transtokenizer and ReTok, while our Supertokens achieve notable compression gains. Our zero-shot perplexity results demonstrate that the TokenAdapt hybrid initialization consistently yields lower perplexity ratios compared to both ReTok and TransTokenizer baselines across different base models and newly trained target tokenizers. TokenAdapt typically reduced the overall perplexity ratio significantly compared to ReTok, yielding at least a 2-fold improvement in these aggregate scores.

[68] Automated Detection of Clinical Entities in Lung and Breast Cancer Reports Using NLP Techniques

J. Moreno-Casanova,J. M. Auñón,A. Mártinez-Pérez,M. E. Pérez-Martínez,M. E. Gas-López

Main category: cs.CL

TL;DR: 该研究利用NLP技术（特别是NER）自动从电子健康记录中提取肺癌和乳腺癌的临床信息，以提高数据提取的效率和准确性。

Details

Motivation: 手动提取临床报告信息耗时且易出错，限制了医疗领域数据驱动方法的效率。NLP技术可以自动化这一过程，尤其是在高发病率的肺癌和乳腺癌中。 Method: 使用GMV的NLP工具uQuery和基于RoBERTa的bsc-bio-ehr-en3模型进行NER，对200份乳腺癌和400份肺癌报告进行实体识别和标准化。 Result: 整体性能良好，尤其在识别MET和PAT等实体上表现突出，但对低频实体如EVOL仍有挑战。 Conclusion: NLP技术能有效提升临床数据提取的效率和准确性，为肺癌和乳腺癌的早期检测和管理提供支持。 Abstract: Research projects, including those focused on cancer, rely on the manual extraction of information from clinical reports. This process is time-consuming and prone to errors, limiting the efficiency of data-driven approaches in healthcare. To address these challenges, Natural Language Processing (NLP) offers an alternative for automating the extraction of relevant data from electronic health records (EHRs). In this study, we focus on lung and breast cancer due to their high incidence and the significant impact they have on public health. Early detection and effective data management in both types of cancer are crucial for improving patient outcomes. To enhance the accuracy and efficiency of data extraction, we utilized GMV's NLP tool uQuery, which excels at identifying relevant entities in clinical texts and converting them into standardized formats such as SNOMED and OMOP. uQuery not only detects and classifies entities but also associates them with contextual information, including negated entities, temporal aspects, and patient-related details. In this work, we explore the use of NLP techniques, specifically Named Entity Recognition (NER), to automatically identify and extract key clinical information from EHRs related to these two cancers. A dataset from Health Research Institute Hospital La Fe (IIS La Fe), comprising 200 annotated breast cancer and 400 lung cancer reports, was used, with eight clinical entities manually labeled using the Doccano platform. To perform NER, we fine-tuned the bsc-bio-ehr-en3 model, a RoBERTa-based biomedical linguistic model pre-trained in Spanish. Fine-tuning was performed using the Transformers architecture, enabling accurate recognition of clinical entities in these cancer types. Our results demonstrate strong overall performance, particularly in identifying entities like MET and PAT, although challenges remain with less frequent entities like EVOL.

[69] Exploring the generalization of LLM truth directions on conversational formats

Timour Ichmoukhamedov,David Martens

Main category: cs.CL

TL;DR: 研究发现LLM中存在通用真理方向，但该方向在不同对话格式中泛化能力有限，尤其是长对话中谎言出现在开头时。通过添加固定关键词可显著改善泛化能力。

Details

Motivation: 探索LLM中真理方向在不同对话格式中的泛化能力，以提升LLM谎言检测的可靠性。 Method: 使用线性探针分析LLM隐藏状态中的真理方向，测试其在短对话和长对话中的泛化能力，并提出通过添加固定关键词改善泛化。 Result: 短对话中真理方向泛化良好，但长对话中表现不佳；添加固定关键词后泛化能力显著提升。 Conclusion: LLM谎言检测在新场景中的泛化仍具挑战性，但通过特定方法可部分解决。 Abstract: Several recent works argue that LLMs have a universal truth direction where true and false statements are linearly separable in the activation space of the model. It has been demonstrated that linear probes trained on a single hidden state of the model already generalize across a range of topics and might even be used for lie detection in LLM conversations. In this work we explore how this truth direction generalizes between various conversational formats. We find good generalization between short conversations that end on a lie, but poor generalization to longer formats where the lie appears earlier in the input prompt. We propose a solution that significantly improves this type of generalization by adding a fixed key phrase at the end of each conversation. Our results highlight the challenges towards reliable LLM lie detectors that generalize to new settings.

[70] KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning

Peiqi Sui,Juan Diego Rodriguez,Philippe Laban,Dean Murphy,Joseph P. Dexter,Richard Jean So,Samuel Baker,Pramit Chaudhuri

Main category: cs.CL

TL;DR: KRISTEVA是首个用于评估解释性推理的细读基准，包含1331道选择题，测试LLMs在文学细读中的表现。

Details

Motivation: 填补LLMs在文学细读评估上的空白，验证其是否具备大学水平的细读能力。 Method: 设计三个渐进任务：提取风格特征、检索上下文信息、多跳推理，测试LLMs表现。 Result: 当前LLMs具备一定细读能力（准确率49.7%-69.7%），但仍落后于人类评估者。 Conclusion: LLMs在文学细读中表现有限，需进一步改进以接近人类水平。 Abstract: Each year, tens of millions of essays are written and graded in college-level English courses. Students are asked to analyze literary and cultural texts through a process known as close reading, in which they gather textual details to formulate evidence-based arguments. Despite being viewed as a basis for critical thinking and widely adopted as a required element of university coursework, close reading has never been evaluated on large language models (LLMs), and multi-discipline benchmarks like MMLU do not include literature as a subject. To fill this gap, we present KRISTEVA, the first close reading benchmark for evaluating interpretive reasoning, consisting of 1331 multiple-choice questions adapted from classroom data. With KRISTEVA, we propose three progressively more difficult sets of tasks to approximate different elements of the close reading process, which we use to test how well LLMs may seem to understand and reason about literary works: 1) extracting stylistic features, 2) retrieving relevant contextual information from parametric knowledge, and 3) multi-hop reasoning between style and external contexts. Our baseline results find that, while state-of-the-art LLMs possess some college-level close reading competency (accuracy 49.7% - 69.7%), their performances still trail those of experienced human evaluators on 10 out of our 11 tasks.

[71] Do Large Language Models Know Conflict? Investigating Parametric vs. Non-Parametric Knowledge of LLMs for Conflict Forecasting

Apollinaire Poli Nemkova,Sarath Chandra Lingareddy,Sagnik Ray Choudhury,Mark V. Albert

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLMs）在预测暴力冲突方面的能力，比较了其参数化知识与非参数化能力，并评估了外部信息对模型性能的影响。

Details

Motivation: LLMs在自然语言任务中表现优异，但其在冲突预测方面的能力尚未充分探索。这对于早期预警系统和人道主义规划至关重要。 Method: 通过两部分评估框架（参数化和非参数化设置），比较LLMs在预测冲突趋势和死亡人数时的表现，并结合外部数据（如ACLED、GDELT）进行增强。 Result: 研究发现LLMs在冲突预测方面具有潜力，但结合外部知识能显著提升性能。 Conclusion: LLMs在冲突预测中展现出优势，但需结合外部信息以弥补预训练知识的不足。 Abstract: Large Language Models (LLMs) have shown impressive performance across natural language tasks, but their ability to forecast violent conflict remains underexplored. We investigate whether LLMs possess meaningful parametric knowledge-encoded in their pretrained weights-to predict conflict escalation and fatalities without external data. This is critical for early warning systems, humanitarian planning, and policy-making. We compare this parametric knowledge with non-parametric capabilities, where LLMs access structured and unstructured context from conflict datasets (e.g., ACLED, GDELT) and recent news reports via Retrieval-Augmented Generation (RAG). Incorporating external information could enhance model performance by providing up-to-date context otherwise missing from pretrained weights. Our two-part evaluation framework spans 2020-2024 across conflict-prone regions in the Horn of Africa and the Middle East. In the parametric setting, LLMs predict conflict trends and fatalities relying only on pretrained knowledge. In the non-parametric setting, models receive summaries of recent conflict events, indicators, and geopolitical developments. We compare predicted conflict trend labels (e.g., Escalate, Stable Conflict, De-escalate, Peace) and fatalities against historical data. Our findings highlight the strengths and limitations of LLMs for conflict forecasting and the benefits of augmenting them with structured external knowledge.

[72] Crossing Borders Without Crossing Boundaries: How Sociolinguistic Awareness Can Optimize User Engagement with Localized Spanish AI Models Across Hispanophone Countries

Martin Capdevila,Esteban Villa Turek,Ellen Karina Chumbe Fernandez,Luis Felipe Polo Galvez,Luis Cadavid,Andrea Marroquin,Rebeca Vargas Quesada,Johanna Crew,Nicole Vallejo Galarraga,Christopher Rodriguez,Diego Gutierrez,Radhi Datla

Main category: cs.CL

TL;DR: 论文探讨了西班牙语在拉丁美洲和西班牙的变体差异，强调区域化语言模型的重要性，以弥合社会语言差异并提升用户信任。

Details

Motivation: 研究旨在揭示西班牙语不同变体间的显著差异，说明区域化AI模型对提升包容性和用户增长的关键作用。 Method: 通过深入的社会文化和语言背景分析，比较拉丁美洲和西班牙的西班牙语变体。 Result: 研究表明，区域化语言模型能有效减少社会语言差异，提升用户信任和国际化策略的效果。 Conclusion: 提出至少五种西班牙语子变体的实现方案，以促进用户依赖和文化意识，支持国际化目标。 Abstract: Large language models are, by definition, based on language. In an effort to underscore the critical need for regional localized models, this paper examines primary differences between variants of written Spanish across Latin America and Spain, with an in-depth sociocultural and linguistic contextualization therein. We argue that these differences effectively constitute significant gaps in the quotidian use of Spanish among dialectal groups by creating sociolinguistic dissonances, to the extent that locale-sensitive AI models would play a pivotal role in bridging these divides. In doing so, this approach informs better and more efficient localization strategies that also serve to more adequately meet inclusivity goals, while securing sustainable active daily user growth in a major low-risk investment geographic area. Therefore, implementing at least the proposed five sub variants of Spanish addresses two lines of action: to foment user trust and reliance on AI language models while also demonstrating a level of cultural, historical, and sociolinguistic awareness that reflects positively on any internationalization strategy.

[73] From Trade-off to Synergy: A Versatile Symbiotic Watermarking Framework for Large Language Models

Yidan Wang,Yubing Ren,Yanan Cao,Binxing Fang

Main category: cs.CL

TL;DR: 本文提出了一种结合基于logits和基于采样的水印方法的混合框架，通过三种策略优化水印的检测性、鲁棒性、文本质量和安全性，实验证明其性能优于现有基线。

Details

Motivation: 随着大型语言模型（LLMs）的兴起，AI生成文本的滥用问题日益严重，水印技术成为潜在解决方案。然而，现有水印方法在鲁棒性、文本质量和安全性之间存在权衡。 Method: 提出了一种混合框架，结合基于logits和基于采样的水印方法，采用串行、并行和混合三种策略，并利用令牌熵和语义熵自适应嵌入水印。 Result: 实验结果表明，该方法在多种数据集和模型上表现优异，性能优于现有基线，达到SOTA水平。 Conclusion: 该框架为水印技术的多样化范式提供了新思路，代码已开源。 Abstract: The rise of Large Language Models (LLMs) has heightened concerns about the misuse of AI-generated text, making watermarking a promising solution. Mainstream watermarking schemes for LLMs fall into two categories: logits-based and sampling-based. However, current schemes entail trade-offs among robustness, text quality, and security. To mitigate this, we integrate logits-based and sampling-based schemes, harnessing their respective strengths to achieve synergy. In this paper, we propose a versatile symbiotic watermarking framework with three strategies: serial, parallel, and hybrid. The hybrid framework adaptively embeds watermarks using token entropy and semantic entropy, optimizing the balance between detectability, robustness, text quality, and security. Furthermore, we validate our approach through comprehensive experiments on various datasets and models. Experimental results indicate that our method outperforms existing baselines and achieves state-of-the-art (SOTA) performance. We believe this framework provides novel insights into diverse watermarking paradigms. Our code is available at \href{https://github.com/redwyd/SymMark}{https://github.com/redwyd/SymMark}.

[74] Rethinking Prompt Optimizers: From Prompt Merits to Optimization

Zixiao Zhu,Hanzhang Zhou,Zijian Feng,Tianjiao Li,Chua Jia Jim Deryl,Mak Lee Onn,Gee Wah Ng,Kezhi Mao

Main category: cs.CL

TL;DR: MePO是一种基于可解释设计的轻量级提示优化器，通过模型无关的提示质量指标提升性能，适用于不同规模的推理模型。

Details

Motivation: 现有提示优化方法依赖大型LLM生成复杂提示，但可能不兼容轻量级模型，导致性能下降。 Method: 提出模型无关的提示质量指标，并基于轻量级LLM生成的偏好数据集训练MePO。 Result: MePO在多样任务和模型类型中表现优异，降低了成本和隐私风险。 Conclusion: MePO为实际部署提供了可扩展且鲁棒的解决方案。 Abstract: Prompt optimization (PO) offers a practical alternative to fine-tuning large language models (LLMs), enabling performance improvements without altering model weights. Existing methods typically rely on advanced, large-scale LLMs like GPT-4 to generate optimized prompts. However, due to limited downward compatibility, verbose, instruction-heavy prompts from advanced LLMs can overwhelm lightweight inference models and degrade response quality. In this work, we rethink prompt optimization through the lens of interpretable design. We first identify a set of model-agnostic prompt quality merits and empirically validate their effectiveness in enhancing prompt and response quality. We then introduce MePO, a merit-guided, lightweight, and locally deployable prompt optimizer trained on our preference dataset built from merit-aligned prompts generated by a lightweight LLM. Unlike prior work, MePO avoids online optimization reliance, reduces cost and privacy concerns, and, by learning clear, interpretable merits, generalizes effectively to both large-scale and lightweight inference models. Experiments demonstrate that MePO achieves better results across diverse tasks and model types, offering a scalable and robust solution for real-world deployment. Our model and dataset are available at: https://github.com/MidiyaZhu/MePO

[75] Personalizing Large Language Models using Retrieval Augmented Generation and Knowledge Graph

Deeksha Prahlad,Chanhee Lee,Dongha Kim,Hokeun Kim

Main category: cs.CL

TL;DR: 论文提出了一种基于知识图谱的检索增强生成方法（RAG），用于解决大语言模型（LLM）在生成个性化响应时的幻觉问题，实验表明该方法在准确性和响应时间上优于基线模型。

Details

Motivation: 大语言模型（LLM）在生成响应时容易因过拟合而产生幻觉，缺乏及时、准确和个性化的信息输入是主要原因。 Method: 通过引入知识图谱（KG）的检索增强生成（RAG）方法，结合结构化且持续更新的个人数据（如日历数据），辅助LLM生成个性化响应。 Result: 实验结果显示，该方法在理解个人信息和生成准确响应方面显著优于基线LLM，同时响应时间适度减少。 Conclusion: 知识图谱的引入有效解决了LLM的幻觉问题，为个性化响应生成提供了更可靠的解决方案。 Abstract: The advent of large language models (LLMs) has allowed numerous applications, including the generation of queried responses, to be leveraged in chatbots and other conversational assistants. Being trained on a plethora of data, LLMs often undergo high levels of over-fitting, resulting in the generation of extra and incorrect data, thus causing hallucinations in output generation. One of the root causes of such problems is the lack of timely, factual, and personalized information fed to the LLM. In this paper, we propose an approach to address these problems by introducing retrieval augmented generation (RAG) using knowledge graphs (KGs) to assist the LLM in personalized response generation tailored to the users. KGs have the advantage of storing continuously updated factual information in a structured way. While our KGs can be used for a variety of frequently updated personal data, such as calendar, contact, and location data, we focus on calendar data in this paper. Our experimental results show that our approach works significantly better in understanding personal information and generating accurate responses compared to the baseline LLMs using personal data as text inputs, with a moderate reduction in response time.

[76] DIF: A Framework for Benchmarking and Verifying Implicit Bias in LLMs

Lake Yin,Fan Huang

Main category: cs.CL

TL;DR: 本文提出了一种衡量大语言模型（LLM）隐含偏见的方法DIF，并通过实验验证了其有效性，发现隐含偏见与回答准确性呈负相关。

Details

Motivation: 研究LLM隐含偏见的动机在于其既是伦理问题，也反映了模型技术上的不足，但目前缺乏标准化的评估方法。 Method: 开发了DIF（Demographic Implicit Fairness）方法，通过结合社会人口统计角色评估LLM的逻辑和数学问题数据集。 Result: 实验证明DIF能统计验证LLM的隐含偏见，并发现回答准确性与隐含偏见呈负相关。 Conclusion: DIF为LLM隐含偏见的评估提供了标准化方法，支持隐含偏见对模型性能的负面影响。 Abstract: As Large Language Models (LLMs) have risen in prominence over the past few years, there has been concern over the potential biases in LLMs inherited from the training data. Previous studies have examined how LLMs exhibit implicit bias, such as when response generation changes when different social contexts are introduced. We argue that this implicit bias is not only an ethical, but also a technical issue, as it reveals an inability of LLMs to accommodate extraneous information. However, unlike other measures of LLM intelligence, there are no standard methods to benchmark this specific subset of LLM bias. To bridge this gap, we developed a method for calculating an easily interpretable benchmark, DIF (Demographic Implicit Fairness), by evaluating preexisting LLM logic and math problem datasets with sociodemographic personas. We demonstrate that this method can statistically validate the presence of implicit bias in LLM behavior and find an inverse trend between question answering accuracy and implicit bias, supporting our argument.

[77] CAFE: Retrieval Head-based Coarse-to-Fine Information Seeking to Enhance Multi-Document QA Capability

Han Peng,Jinhao Jiang,Zican Dong,Wayne Xin Zhao,Lei Fang

Main category: cs.CL

TL;DR: 论文提出了一种名为CAFE的两阶段粗到细方法，用于提升大语言模型在多文档问答中的性能，通过逐步消除背景和干扰文档的影响，显著提高了回答的准确性。

Details

Motivation: 尽管大语言模型的输入上下文长度有所扩展，但在长上下文输入中的检索和推理能力仍然不足。现有方法在平衡检索精度和召回率方面存在挑战，影响了问答效果。 Method: CAFE采用两阶段方法：1）粗粒度过滤，利用检索头识别和排序相关文档；2）细粒度引导，将注意力集中在最相关内容上。 Result: 实验表明，CAFE在多个基准测试中优于基线方法，在Mistral模型上分别比SFT和RAG方法提升了22.1%和13.7%的SubEM分数。 Conclusion: CAFE通过逐步优化文档检索和注意力分配，显著提升了多文档问答的性能，为解决长上下文输入中的检索和推理问题提供了有效方案。 Abstract: Advancements in Large Language Models (LLMs) have extended their input context length, yet they still struggle with retrieval and reasoning in long-context inputs. Existing methods propose to utilize the prompt strategy and retrieval head to alleviate this limitation. However, they still face challenges in balancing retrieval precision and recall, impacting their efficacy in answering questions. To address this, we introduce $\textbf{CAFE}$, a two-stage coarse-to-fine method to enhance multi-document question-answering capacities. By gradually eliminating the negative impacts of background and distracting documents, CAFE makes the responses more reliant on the evidence documents. Initially, a coarse-grained filtering method leverages retrieval heads to identify and rank relevant documents. Then, a fine-grained steering method guides attention to the most relevant content. Experiments across benchmarks show CAFE outperforms baselines, achieving up to 22.1% and 13.7% SubEM improvement over SFT and RAG methods on the Mistral model, respectively.

[78] Dark LLMs: The Growing Threat of Unaligned AI Models

Michael Fire,Yitzhak Elbazis,Adi Wasenstein,Lior Rokach

Main category: cs.CL

TL;DR: 论文探讨了大语言模型（LLMs）面临的越狱攻击威胁，揭示了其训练数据中的漏洞，并提出了一种通用越狱攻击方法，测试发现多个先进模型仍易受攻击。

Details

Motivation: 随着LLMs的广泛应用，其安全性问题日益突出，尤其是越狱攻击可能导致模型输出有害内容，引发社会风险。 Method: 研究通过分析训练数据的漏洞，设计了一种通用越狱攻击方法，测试了多个先进模型的脆弱性。 Result: 研究发现多个LLMs仍易受攻击，且行业对AI安全的应对不足。 Conclusion: 若不采取果断措施，LLMs的滥用风险将加剧，呼吁加强AI安全实践。 Abstract: Large Language Models (LLMs) rapidly reshape modern life, advancing fields from healthcare to education and beyond. However, alongside their remarkable capabilities lies a significant threat: the susceptibility of these models to jailbreaking. The fundamental vulnerability of LLMs to jailbreak attacks stems from the very data they learn from. As long as this training data includes unfiltered, problematic, or 'dark' content, the models can inherently learn undesirable patterns or weaknesses that allow users to circumvent their intended safety controls. Our research identifies the growing threat posed by dark LLMs models deliberately designed without ethical guardrails or modified through jailbreak techniques. In our research, we uncovered a universal jailbreak attack that effectively compromises multiple state-of-the-art models, enabling them to answer almost any question and produce harmful outputs upon request. The main idea of our attack was published online over seven months ago. However, many of the tested LLMs were still vulnerable to this attack. Despite our responsible disclosure efforts, responses from major LLM providers were often inadequate, highlighting a concerning gap in industry practices regarding AI safety. As model training becomes more accessible and cheaper, and as open-source LLMs proliferate, the risk of widespread misuse escalates. Without decisive intervention, LLMs may continue democratizing access to dangerous knowledge, posing greater risks than anticipated.

[79] Designing and Contextualising Probes for African Languages

Wisdom Aduah,Francois Meyer

Main category: cs.CL

TL;DR: 本文系统研究了非洲语言预训练模型（PLMs）中的语言学知识分布，发现针对非洲语言优化的PLMs比多语言PLMs更能捕捉目标语言特征。

Details

Motivation: 探讨非洲语言PLMs性能提升的原因，填补对非洲语言PLMs内部机制的研究空白。 Method: 训练分层探针分析六种非洲语言的语法特征分布，设计控制任务验证探针性能。 Result: 优化的PLMs在目标语言中编码更多语言学信息；语法信息集中在中后层，语义信息分布更广。 Conclusion: 研究证实PLMs性能反映其内部知识，为非洲语言PLMs的优化策略提供了理论支持。 Abstract: Pretrained language models (PLMs) for African languages are continually improving, but the reasons behind these advances remain unclear. This paper presents the first systematic investigation into probing PLMs for linguistic knowledge about African languages. We train layer-wise probes for six typologically diverse African languages to analyse how linguistic features are distributed. We also design control tasks, a way to interpret probe performance, for the MasakhaPOS dataset. We find PLMs adapted for African languages to encode more linguistic information about target languages than massively multilingual PLMs. Our results reaffirm previous findings that token-level syntactic information concentrates in middle-to-last layers, while sentence-level semantic information is distributed across all layers. Through control tasks and probing baselines, we confirm that performance reflects the internal knowledge of PLMs rather than probe memorisation. Our study applies established interpretability techniques to African-language PLMs. In doing so, we highlight the internal mechanisms underlying the success of strategies like active learning and multilingual adaptation.

[80] XRAG: Cross-lingual Retrieval-Augmented Generation

Wei Liu,Sony Trenous,Leonardo F. R. Ribeiro,Bill Byrne,Felix Hieber

Main category: cs.CL

TL;DR: XRAG是一个新颖的基准测试，用于评估LLM在跨语言检索增强生成（RAG）中的生成能力，特别是在用户语言与检索结果不匹配的情况下。

Details

Motivation: 解决跨语言RAG场景中LLM生成能力的评估问题，尤其是在语言不匹配和复杂推理需求的情况下。 Method: 构建基于新闻文章的XRAG数据集，涵盖单语和多语检索场景，并提供相关性标注。通过实验评估五种LLM的表现。 Result: 发现两个未报告的挑战：单语检索中模型在回答语言正确性上表现不佳；多语检索中主要挑战是跨语言信息的推理。 Conclusion: XRAG是一个有价值的基准，可用于研究LLM的推理能力，尤其是在跨语言复杂性的背景下。 Abstract: We propose XRAG, a novel benchmark designed to evaluate the generation abilities of LLMs in cross-lingual Retrieval-Augmented Generation (RAG) settings where the user language does not match the retrieval results. XRAG is constructed from recent news articles to ensure that its questions require external knowledge to be answered. It covers the real-world scenarios of monolingual and multilingual retrieval, and provides relevancy annotations for each retrieved document. Our novel dataset construction pipeline results in questions that require complex reasoning, as evidenced by the significant gap between human and LLM performance. Consequently, XRAG serves as a valuable benchmark for studying LLM reasoning abilities, even before considering the additional cross-lingual complexity. Experimental results on five LLMs uncover two previously unreported challenges in cross-lingual RAG: 1) in the monolingual retrieval setting, all evaluated models struggle with response language correctness; 2) in the multilingual retrieval setting, the main challenge lies in reasoning over retrieved information across languages rather than generation of non-English text.

[81] What Does Neuro Mean to Cardio? Investigating the Role of Clinical Specialty Data in Medical LLMs

Xinlan Yan,Di Wu,Yibin Lei,Christof Monz,Iacer Calixto

Main category: cs.CL

TL;DR: S-MedQA是一个用于评估大语言模型在细粒度临床专科中表现的医学问答数据集，研究发现专业数据训练不一定带来最佳表现，且领域迁移比知识注入更重要。

Details

Motivation: 研究大语言模型在医学问答中的表现，验证知识注入假设在医学领域的适用性。 Method: 使用S-MedQA数据集，分析不同专科数据训练对模型性能的影响，并观察临床相关术语的概率变化。 Result: 专业数据训练不一定带来最佳表现，领域迁移对性能提升更关键。 Conclusion: 建议重新思考医学领域中微调数据的作用，强调领域迁移的重要性，并公开数据集和代码。 Abstract: In this paper, we introduce S-MedQA, an English medical question-answering (QA) dataset for benchmarking large language models in fine-grained clinical specialties. We use S-MedQA to check the applicability of a popular hypothesis related to knowledge injection in the knowledge-intense scenario of medical QA, and show that: 1) training on data from a speciality does not necessarily lead to best performance on that specialty and 2) regardless of the specialty fine-tuned on, token probabilities of clinically relevant terms for all specialties increase consistently. Thus, we believe improvement gains come mostly from domain shifting (e.g., general to medical) rather than knowledge injection and suggest rethinking the role of fine-tuning data in the medical domain. We release S-MedQA and all code needed to reproduce all our experiments to the research community.

[82] GE-Chat: A Graph Enhanced RAG Framework for Evidential Response Generation of LLMs

Longchao Da,Parth Mitesh Shah,Kuan-Ru Liou,Jiaxing Zhang,Hua Wei

Main category: cs.CL

TL;DR: GE-Chat框架通过知识图谱增强检索生成，提供基于证据的响应，提升LLM输出的可靠性。

Details

Motivation: LLM输出存在不可靠性和幻觉问题，用户需手动评估，亟需提高响应可信度。 Method: 结合知识图谱、检索增强生成、CoT逻辑生成、n跳子图搜索和蕴含式句子生成，实现精准证据检索。 Result: 方法在自由文本中准确识别证据，提升模型性能，帮助判断LLM结论的可信度。 Conclusion: GE-Chat为LLM输出提供了可靠的证据支持，增强了用户信任。 Abstract: Large Language Models are now key assistants in human decision-making processes. However, a common note always seems to follow: "LLMs can make mistakes. Be careful with important info." This points to the reality that not all outputs from LLMs are dependable, and users must evaluate them manually. The challenge deepens as hallucinated responses, often presented with seemingly plausible explanations, create complications and raise trust issues among users. To tackle such issue, this paper proposes GE-Chat, a knowledge Graph enhanced retrieval-augmented generation framework to provide Evidence-based response generation. Specifically, when the user uploads a material document, a knowledge graph will be created, which helps construct a retrieval-augmented agent, enhancing the agent's responses with additional knowledge beyond its training corpus. Then we leverage Chain-of-Thought (CoT) logic generation, n-hop sub-graph searching, and entailment-based sentence generation to realize accurate evidence retrieval. We demonstrate that our method improves the existing models' performance in terms of identifying the exact evidence in a free-form context, providing a reliable way to examine the resources of LLM's conclusion and help with the judgment of the trustworthiness.

[83] Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning

Yoichi Ishibashi,Taro Yano,Masafumi Oyamada

Main category: cs.CL

TL;DR: 论文探讨了通过合成数据训练的Reasoning CPT方法，相比传统方法在跨领域推理任务中表现更优，尤其在难题上提升显著。

Details

Motivation: 传统监督微调和强化学习方法在特定领域（如数学和编程）的推理能力提升有限，且数据扩展性不足。Reasoning CPT通过合成数据模拟作者的思考过程，试图解决这一问题。 Method: 提出Reasoning CPT方法，利用合成数据（基于STEM和法律语料库）重建文本背后的隐藏思考过程，并在Gemma2-9B模型上应用，与标准CPT在MMLU基准上对比。 Result: Reasoning CPT在所有评估领域均表现更优，推理能力可跨领域迁移，且在难题上提升达8分。模型还能根据问题难度调整推理深度。 Conclusion: Reasoning CPT通过合成数据有效提升模型的跨领域推理能力，尤其在复杂任务中表现突出，展示了其潜力和扩展性。 Abstract: Large Language Models (LLMs) have demonstrated significant improvements in reasoning capabilities through supervised fine-tuning and reinforcement learning. However, when training reasoning models, these approaches are primarily applicable to specific domains such as mathematics and programming, which imposes fundamental constraints on the breadth and scalability of training data. In contrast, continual pretraining (CPT) offers the advantage of not requiring task-specific signals. Nevertheless, how to effectively synthesize training data for reasoning and how such data affect a wide range of domains remain largely unexplored. This study provides a detailed evaluation of Reasoning CPT, a form of CPT that uses synthetic data to reconstruct the hidden thought processes underlying texts, based on the premise that texts are the result of the author's thinking process. Specifically, we apply Reasoning CPT to Gemma2-9B using synthetic data with hidden thoughts derived from STEM and Law corpora, and compare it to standard CPT on the MMLU benchmark. Our analysis reveals that Reasoning CPT consistently improves performance across all evaluated domains. Notably, reasoning skills acquired in one domain transfer effectively to others; the performance gap with conventional methods widens as problem difficulty increases, with gains of up to 8 points on the most challenging problems. Furthermore, models trained with hidden thoughts learn to adjust the depth of their reasoning according to problem difficulty.

[84] The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think

Seongyun Lee,Seungone Kim,Minju Seo,Yongrae Jo,Dongyoung Go,Hyeonbin Hwang,Jinho Park,Xiang Yue,Sean Welleck,Graham Neubig,Moontae Lee,Minjoon Seo

Main category: cs.CL

TL;DR: 该论文提出了CoT Encyclopedia框架，用于自动分析和引导语言模型的推理行为，相比现有方法更全面和可解释，并能提升模型性能。

Details

Motivation: 理解语言模型的长链推理策略仍有限，现有方法依赖人工直觉，无法捕捉模型行为的多样性。 Method: 提出CoT Encyclopedia框架，自动提取、嵌入、聚类推理标准，并生成对比性解释。 Result: 框架比现有方法更全面和可解释，并能预测和优化模型推理策略。 Conclusion: 数据格式对推理行为影响显著，强调了格式感知模型设计的重要性。 Abstract: Long chain-of-thought (CoT) is an essential ingredient in effective usage of modern large language models, but our understanding of the reasoning strategies underlying these capabilities remains limited. While some prior works have attempted to categorize CoTs using predefined strategy types, such approaches are constrained by human intuition and fail to capture the full diversity of model behaviors. In this work, we introduce the CoT Encyclopedia, a bottom-up framework for analyzing and steering model reasoning. Our method automatically extracts diverse reasoning criteria from model-generated CoTs, embeds them into a semantic space, clusters them into representative categories, and derives contrastive rubrics to interpret reasoning behavior. Human evaluations show that this framework produces more interpretable and comprehensive analyses than existing methods. Moreover, we demonstrate that this understanding enables performance gains: we can predict which strategy a model is likely to use and guide it toward more effective alternatives. Finally, we provide practical insights, such as that training data format (e.g., free-form vs. multiple-choice) has a far greater impact on reasoning behavior than data domain, underscoring the importance of format-aware model design.

[85] VQ-Logits: Compressing the Output Bottleneck of Large Language Models via Vector Quantized Logits

Jintian Shao,Hongyi Huang,Jiayi Wu,YiMing Cheng,ZhiYu Wu,You Shan,MingKai Zheng

Main category: cs.CL

TL;DR: VQ-Logits利用向量量化技术大幅减少LLM输出层的参数和计算成本，实验显示参数减少99%，计算速度提升6倍，仅增加4%的困惑度。

Details

Motivation: 大型语言模型输出层的参数和计算成本过高，现有方法结构复杂。 Method: 采用向量量化技术，用小型共享码本替代大型输出嵌入矩阵，预测紧凑码本的对数并通过映射扩展到完整词汇空间。 Result: 参数减少99%，计算速度提升6倍，困惑度仅增加4%。 Conclusion: VQ-Logits是一种高效且稳健的方法，显著降低了LLM输出层的资源需求。 Abstract: Large Language Models (LLMs) have achieved remarkable success but face significant computational and memory challenges, particularly due to their extensive output vocabularies. The final linear projection layer, mapping hidden states to vocabulary-sized logits, often constitutes a substantial portion of the model's parameters and computational cost during inference. Existing methods like adaptive softmax or hierarchical softmax introduce structural complexities. In this paper, we propose VQ-Logits, a novel approach that leverages Vector Quantization (VQ) to drastically reduce the parameter count and computational load of the LLM output layer. VQ-Logits replaces the large V * dmodel output embedding matrix with a small, shared codebook of K embedding vectors (K << V ). Each token in the vocabulary is mapped to one of these K codebook vectors. The LLM predicts logits over this compact codebook, which are then efficiently "scattered" to the full vocabulary space using the learned or preassigned mapping. We demonstrate through extensive experiments on standard language modeling benchmarks (e.g., WikiText-103, C4) that VQ-Logits can achieve up to 99% parameter reduction in the output layer and 6x speedup in logit computation, with only a marginal 4% increase in perplexity compared to full softmax baselines. We further provide detailed ablation studies on codebook size, initialization, and learning strategies, showcasing the robustness and effectiveness of our approach.

[86] RAIDEN-R1: Improving Role-awareness of LLMs via GRPO with Verifiable Reward

Zongsheng Wang,Kaili Sun,Bowen Wu,Qun Yu,Ying Li,Baoxun Wang

Main category: cs.CL

TL;DR: RAIDEN-R1提出了一种新的强化学习框架，通过可验证的角色感知奖励（VRAR）提升角色扮演对话代理（RPCA）的角色一致性。

Details

Motivation: 解决角色扮演对话代理在角色一致性方面的挑战。 Method: 结合VRAR奖励机制，采用单术语和多术语挖掘策略评估角色特定关键词，并构建高质量的角色感知Chain-of-Thought数据集。 Result: 14B-GRPO模型在Script-Based Knowledge和Conversation Memory指标上分别达到88.04%和88.65%的准确率，优于基线模型。 Conclusion: RAIDEN-R1填补了RPCA训练中的非量化空白，推动了角色感知推理模式的发展。 Abstract: Role-playing conversational agents (RPCAs) face persistent challenges in maintaining role consistency. To address this, we propose RAIDEN-R1, a novel reinforcement learning framework that integrates Verifiable Role-Awareness Reward (VRAR). The method introduces both singular and multi-term mining strategies to generate quantifiable rewards by assessing role-specific keys. Additionally, we construct a high-quality, role-aware Chain-of-Thought dataset through multi-LLM collaboration, and implement experiments to enhance reasoning coherence. Experiments on the RAIDEN benchmark demonstrate RAIDEN-R1's superiority: our 14B-GRPO model achieves 88.04% and 88.65% accuracy on Script-Based Knowledge and Conversation Memory metrics, respectively, outperforming baseline models while maintaining robustness. Case analyses further reveal the model's enhanced ability to resolve conflicting contextual cues and sustain first-person narrative consistency. This work bridges the non-quantifiability gap in RPCA training and provides insights into role-aware reasoning patterns, advancing the development of RPCAs.

Poli Apollinaire Nemkova,Solomon Ubani,Mark V. Albert

Main category: cs.CL

TL;DR: 研究探讨了多种先进大语言模型（如GPT-3.5、GPT-4等）在零样本和少样本标注俄乌社交媒体帖子中人权侵犯内容的能力，并与人工标注结果对比。

Details

Motivation: 评估大语言模型在多语言、敏感领域任务中的可靠性和适用性，尤其是在主观性强的情境下。 Method: 使用多种大语言模型进行零样本和少样本标注，对比人工标注结果，分析不同提示条件下的表现及错误模式。 Result: 研究揭示了各模型在跨语言适应性和标注任务中的优缺点。 Conclusion: 大语言模型在敏感领域任务中具有一定潜力，但需注意其主观性和上下文依赖性。 Abstract: In the era of increasingly sophisticated natural language processing (NLP) systems, large language models (LLMs) have demonstrated remarkable potential for diverse applications, including tasks requiring nuanced textual understanding and contextual reasoning. This study investigates the capabilities of multiple state-of-the-art LLMs - GPT-3.5, GPT-4, LLAMA3, Mistral 7B, and Claude-2 - for zero-shot and few-shot annotation of a complex textual dataset comprising social media posts in Russian and Ukrainian. Specifically, the focus is on the binary classification task of identifying references to human rights violations within the dataset. To evaluate the effectiveness of these models, their annotations are compared against a gold standard set of human double-annotated labels across 1000 samples. The analysis includes assessing annotation performance under different prompting conditions, with prompts provided in both English and Russian. Additionally, the study explores the unique patterns of errors and disagreements exhibited by each model, offering insights into their strengths, limitations, and cross-linguistic adaptability. By juxtaposing LLM outputs with human annotations, this research contributes to understanding the reliability and applicability of LLMs for sensitive, domain-specific tasks in multilingual contexts. It also sheds light on how language models handle inherently subjective and context-dependent judgments, a critical consideration for their deployment in real-world scenarios.

[88] The Evolving Landscape of Generative Large Language Models and Traditional Natural Language Processing in Medicine

Rui Yang,Huitao Li,Matthew Yu Heng Wong,Yuhe Ke,Xin Li,Kunyu Yu,Jingchi Liao,Jonathan Chong Kai Liew,Sabarinath Vinod Nair,Jasmine Chiat Ling Ong,Irene Li,Douglas Teodoro,Chuan Hong,Daniel Shu Wei Ting,Nan Liu

Main category: cs.CL

TL;DR: 生成式大语言模型（LLM）在开放式任务中表现优异，而传统NLP在信息提取和分析任务中占优。

Details

Motivation: 探讨生成式LLM与传统NLP在不同医疗任务中的差异。 Method: 分析了19,123项研究。 Result: 生成式LLM在开放式任务中表现更好，传统NLP在信息提取和分析任务中更优。 Conclusion: 随着技术进步，伦理使用这些技术对发挥其在医疗应用中的潜力至关重要。 Abstract: Natural language processing (NLP) has been traditionally applied to medicine, and generative large language models (LLMs) have become prominent recently. However, the differences between them across different medical tasks remain underexplored. We analyzed 19,123 studies, finding that generative LLMs demonstrate advantages in open-ended tasks, while traditional NLP dominates in information extraction and analysis tasks. As these technologies advance, ethical use of them is essential to ensure their potential in medical applications.

[89] From Questions to Clinical Recommendations: Large Language Models Driving Evidence-Based Clinical Decision Making

Dubai Li,Nan Jiang,Kangping Huang,Ruiqi Tu,Shuyu Ouyang,Huayu Yu,Lin Qiao,Chen Yu,Tianshu Zhou,Danyang Tong,Qian Wang,Mengtao Li,Xiaofeng Zeng,Yu Tian,Xinping Tian,Jingsong Li

Main category: cs.CL

TL;DR: Quicker是一个基于大型语言模型的临床决策支持系统，旨在自动化证据合成并生成临床建议，显著提升决策效率和准确性。

Details

Motivation: 临床实践中整合研究证据面临工作量大、流程复杂和时间限制等挑战，亟需自动化工具支持。 Method: Quicker通过全自动流程实现从问题到临床建议的转化，并支持定制化决策。评估使用Q2CRBench-3基准数据集。 Result: Quicker在问题分解、文献筛选和推荐逻辑上表现优异，与人类专家协作可将建议开发时间缩短至20-40分钟。 Conclusion: Quicker能帮助医生更快、更可靠地做出基于证据的临床决策。 Abstract: Clinical evidence, derived from rigorous research and data analysis, provides healthcare professionals with reliable scientific foundations for informed decision-making. Integrating clinical evidence into real-time practice is challenging due to the enormous workload, complex professional processes, and time constraints. This highlights the need for tools that automate evidence synthesis to support more efficient and accurate decision making in clinical settings. This study introduces Quicker, an evidence-based clinical decision support system powered by large language models (LLMs), designed to automate evidence synthesis and generate clinical recommendations modeled after standard clinical guideline development processes. Quicker implements a fully automated chain that covers all phases, from questions to clinical recommendations, and further enables customized decision-making through integrated tools and interactive user interfaces. To evaluate Quicker's capabilities, we developed the Q2CRBench-3 benchmark dataset, based on clinical guideline development records for three different diseases. Experimental results highlighted Quicker's strong performance, with fine-grained question decomposition tailored to user preferences, retrieval sensitivities comparable to human experts, and literature screening performance approaching comprehensive inclusion of relevant studies. In addition, Quicker-assisted evidence assessment effectively supported human reviewers, while Quicker's recommendations were more comprehensive and logically coherent than those of clinicians. In system-level testing, collaboration between a single reviewer and Quicker reduced the time required for recommendation development to 20-40 minutes. In general, our findings affirm the potential of Quicker to help physicians make quicker and more reliable evidence-based clinical decisions.

[90] J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

Chenxi Whitehouse,Tianlu Wang,Ping Yu,Xian Li,Jason Weston,Ilia Kulikov,Swarnadeep Saha

Main category: cs.CL

TL;DR: J1是一种通过强化学习训练LLM-as-a-Judge模型的方法，优于现有8B和70B模型，包括从DeepSeek-R1蒸馏的模型。

Details

Motivation: AI评估质量的瓶颈促使开发更强大的LLM-as-a-Judge模型，需要改进其思维训练方法。 Method: J1将可验证和不可验证的提示转化为带奖励的判断任务，激励思考并减少偏见。 Result: J1在8B和70B规模上优于其他模型，甚至在某些基准上超过R1。 Conclusion: J1通过优化训练策略和奖励机制，显著提升了模型的判断能力。 Abstract: The progress of AI is bottlenecked by the quality of evaluation, and powerful LLM-as-a-Judge models have proved to be a core solution. Improved judgment ability is enabled by stronger chain-of-thought reasoning, motivating the need to find the best recipes for training such models to think. In this work we introduce J1, a reinforcement learning approach to training such models. Our method converts both verifiable and non-verifiable prompts to judgment tasks with verifiable rewards that incentivize thinking and mitigate judgment bias. In particular, our approach outperforms all other existing 8B or 70B models when trained at those sizes, including models distilled from DeepSeek-R1. J1 also outperforms o1-mini, and even R1 on some benchmarks, despite training a smaller model. We provide analysis and ablations comparing Pairwise-J1 vs Pointwise-J1 models, offline vs online training recipes, reward strategies, seed prompts, and variations in thought length and content. We find that our models make better judgments by learning to outline evaluation criteria, comparing against self-generated reference answers, and re-evaluating the correctness of model responses.

[91] LDIR: Low-Dimensional Dense and Interpretable Text Embeddings with Relative Representations

Yile Wang,Zhanyu Shen,Hui Huang

Main category: cs.CL

TL;DR: 提出了一种低维、密集且可解释的文本嵌入方法LDIR，通过最远点采样生成锚文本，数值表示语义相关性，性能接近黑盒基线模型。

Details

Motivation: 现有文本嵌入（如SimCSE和LLM2Vec）性能优秀但难以解释，而经典稀疏嵌入（如词袋模型）性能较差。需要一种既能保持高性能又具备可解释性的方法。 Method: 提出LDIR方法，通过最远点采样生成锚文本，构建低维（少于500维）密集且可解释的文本嵌入，数值表示语义相关性。 Result: 在多个语义文本相似性、检索和聚类任务中，LDIR性能接近黑盒基线模型，且优于其他可解释嵌入基线。 Conclusion: LDIR在低维条件下实现了高性能和可解释性，为文本嵌入领域提供了新的解决方案。 Abstract: Semantic text representation is a fundamental task in the field of natural language processing. Existing text embedding (e.g., SimCSE and LLM2Vec) have demonstrated excellent performance, but the values of each dimension are difficult to trace and interpret. Bag-of-words, as classic sparse interpretable embeddings, suffers from poor performance. Recently, Benara et al. (2024) propose interpretable text embeddings using large language models, which forms "0/1" embeddings based on responses to a series of questions. These interpretable text embeddings are typically high-dimensional (larger than 10,000). In this work, we propose Low-dimensional (lower than 500) Dense and Interpretable text embeddings with Relative representations (LDIR). The numerical values of its dimensions indicate semantic relatedness to different anchor texts through farthest point sampling, offering both semantic representation as well as a certain level of traceability and interpretability. We validate LDIR on multiple semantic textual similarity, retrieval, and clustering tasks. Extensive experimental results show that LDIR performs close to the black-box baseline models and outperforms the interpretable embeddings baselines with much fewer dimensions. Code is available at https://github.com/szu-tera/LDIR.

Chunyu Ye,Shaonan Wang

Main category: cs.CL

TL;DR: 提出了一种多模态框架，利用视觉语言模型从大脑活动中重建语言，支持视觉、听觉和文本输入。

Details

Motivation: 人类思维是多模态的，而现有研究多限于单模态输入，因此需要一种更灵活的方法来解码大脑活动。 Method: 采用视觉语言模型（VLMs），结合模态特定的专家，共同解析多模态信息。 Result: 实验表明，该方法性能与现有最优系统相当，同时更具适应性和可扩展性。 Conclusion: 该研究推动了更生态有效和通用的思维解码技术。 Abstract: Decoding thoughts from brain activity offers valuable insights into human cognition and enables promising applications in brain-computer interaction. While prior studies have explored language reconstruction from fMRI data, they are typically limited to single-modality inputs such as images or audio. In contrast, human thought is inherently multimodal. To bridge this gap, we propose a unified and flexible framework for reconstructing coherent language from brain recordings elicited by diverse input modalities-visual, auditory, and textual. Our approach leverages visual-language models (VLMs), using modality-specific experts to jointly interpret information across modalities. Experiments demonstrate that our method achieves performance comparable to state-of-the-art systems while remaining adaptable and extensible. This work advances toward more ecologically valid and generalizable mind decoding.

[93] Multi-domain Multilingual Sentiment Analysis in Industry: Predicting Aspect-based Opinion Quadruples

Benjamin White,Anastasia Shimorina

Main category: cs.CL

TL;DR: 本文探讨了基于大型语言模型（LLMs）的方面情感分析系统设计，重点研究了四元组意见提取，并在多领域和多语言中验证了单一模型的通用性。

Details

Motivation: 研究动机在于探索单一模型是否能同时有效处理多领域的特定分类法，以减少操作复杂性。 Method: 方法包括使用内部数据集，训练一个多领域联合模型，并与专用单领域模型进行性能对比。 Result: 结果表明，多领域联合模型的性能与专用单领域模型相当，同时降低了操作复杂性。 Conclusion: 结论强调了在处理非提取性预测和评估失败模式时的经验教训，为基于LLM的结构化预测任务提供了实用指导。 Abstract: This paper explores the design of an aspect-based sentiment analysis system using large language models (LLMs) for real-world use. We focus on quadruple opinion extraction -- identifying aspect categories, sentiment polarity, targets, and opinion expressions from text data across different domains and languages. Using internal datasets, we investigate whether a single fine-tuned model can effectively handle multiple domain-specific taxonomies simultaneously. We demonstrate that a combined multi-domain model achieves performance comparable to specialized single-domain models while reducing operational complexity. We also share lessons learned for handling non-extractive predictions and evaluating various failure modes when developing LLM-based systems for structured prediction tasks.

[94] Rethinking Repetition Problems of LLMs in Code Generation

Yihong Dong,Yuchen Liu,Xue Jiang,Zhi Jin,Ge Li

Main category: cs.CL

TL;DR: 论文提出了一种基于语法的解码方法RPG，用于解决代码生成中的结构重复问题，显著提升了生成代码的质量。

Details

Motivation: 现有的神经语言模型在代码生成中存在重复问题，尤其是结构重复，而之前的研究主要关注内容重复。 Method: RPG利用语法规则识别重复问题，并通过衰减关键令牌的似然来减少重复。 Result: RPG在CodeRepetEval、HumanEval和MBPP基准测试中表现优于基线方法，有效减少了重复。 Conclusion: RPG是一种高效的解码方法，能够显著改善代码生成中的重复问题。 Abstract: With the advent of neural language models, the performance of code generation has been significantly boosted. However, the problem of repetitions during the generation process continues to linger. Previous work has primarily focused on content repetition, which is merely a fraction of the broader repetition problem in code generation. A more prevalent and challenging problem is structural repetition. In structural repetition, the repeated code appears in various patterns but possesses a fixed structure, which can be inherently reflected in grammar. In this paper, we formally define structural repetition and propose an efficient decoding approach called RPG, which stands for Repetition Penalization based on Grammar, to alleviate the repetition problems in code generation for LLMs. Specifically, RPG first leverages grammar rules to identify repetition problems during code generation, and then strategically decays the likelihood of critical tokens that contribute to repetitions, thereby mitigating them in code generation. To facilitate this study, we construct a new dataset CodeRepetEval to comprehensively evaluate approaches for mitigating the repetition problems in code generation. Extensive experimental results demonstrate that RPG substantially outperforms the best-performing baselines on CodeRepetEval dataset as well as HumanEval and MBPP benchmarks, effectively reducing repetitions and enhancing the quality of generated code.

[95] Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation

Yue Guo,Jae Ho Sohn,Gondy Leroy,Trevor Cohen

Main category: cs.CL

TL;DR: LLM生成的简明语言摘要（PLS）在主观评价中与人工编写的PLS难以区分，但人工编写的PLS在理解效果上显著更优。自动化评估指标与人类判断不一致。

Details

Motivation: 解决LLM生成PLS的有效性和自动化评估指标的局限性问题。 Method: 通过大规模众包评估（150名参与者），结合主观评分和客观理解测试，比较LLM与人工PLS。 Result: LLM生成的PLS在主观评分中表现良好，但人工PLS在理解效果上更优；自动化指标与人类判断不符。 Conclusion: 需要开发更注重理解效果的PLS评估框架和生成方法。 Abstract: Plain language summaries (PLSs) are essential for facilitating effective communication between clinicians and patients by making complex medical information easier for laypeople to understand and act upon. Large language models (LLMs) have recently shown promise in automating PLS generation, but their effectiveness in supporting health information comprehension remains unclear. Prior evaluations have generally relied on automated scores that do not measure understandability directly, or subjective Likert-scale ratings from convenience samples with limited generalizability. To address these gaps, we conducted a large-scale crowdsourced evaluation of LLM-generated PLSs using Amazon Mechanical Turk with 150 participants. We assessed PLS quality through subjective Likert-scale ratings focusing on simplicity, informativeness, coherence, and faithfulness; and objective multiple-choice comprehension and recall measures of reader understanding. Additionally, we examined the alignment between 10 automated evaluation metrics and human judgments. Our findings indicate that while LLMs can generate PLSs that appear indistinguishable from human-written ones in subjective evaluations, human-written PLSs lead to significantly better comprehension. Furthermore, automated evaluation metrics fail to reflect human judgment, calling into question their suitability for evaluating PLSs. This is the first study to systematically evaluate LLM-generated PLSs based on both reader preferences and comprehension outcomes. Our findings highlight the need for evaluation frameworks that move beyond surface-level quality and for generation methods that explicitly optimize for layperson comprehension.

[96] Hierarchical Document Refinement for Long-context Retrieval-augmented Generation

Jiajie Jin,Xiaoxi Li,Guanting Dong,Yuyao Zhang,Yutao Zhu,Yongkang Wu,Zhonghua Li,Qi Ye,Zhicheng Dou

Main category: cs.CL

TL;DR: LongRefiner是一个高效的长文本处理工具，通过双级查询分析、层次化文档结构和自适应优化，显著降低计算成本并提升性能。

Details

Motivation: 解决长文本RAG应用中冗余信息和噪声导致的高计算成本和性能下降问题。 Method: 采用双级查询分析、层次化文档结构和基于多任务学习的自适应优化方法。 Result: 在七个QA数据集上表现优异，计算成本和延迟降低10倍。 Conclusion: LongRefiner高效、可扩展且实用，适用于真实场景的长文本RAG应用。 Abstract: Real-world RAG applications often encounter long-context input scenarios, where redundant information and noise results in higher inference costs and reduced performance. To address these challenges, we propose LongRefiner, an efficient plug-and-play refiner that leverages the inherent structural characteristics of long documents. LongRefiner employs dual-level query analysis, hierarchical document structuring, and adaptive refinement through multi-task learning on a single foundation model. Experiments on seven QA datasets demonstrate that LongRefiner achieves competitive performance in various scenarios while using 10x fewer computational costs and latency compared to the best baseline. Further analysis validates that LongRefiner is scalable, efficient, and effective, providing practical insights for real-world long-text RAG applications. Our code is available at https://github.com/ignorejjj/LongRefiner.

[97] Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models

Zemin Huang,Zhiyang Chen,Zijun Wang,Tiancheng Li,Guo-Jun Qi

Main category: cs.CL

TL;DR: DCoLT是一种用于扩散语言模型的推理框架，通过逆向扩散过程中的中间步骤作为潜在“思考”动作，并基于强化学习优化整个推理轨迹，以提升最终答案的正确性。

Details

Motivation: 传统Chain-of-Thought（CoT）方法采用线性因果推理，而DCoLT允许双向、非线性推理，突破了中间步骤语法正确性的限制，旨在提升推理能力。 Method: DCoLT在两种扩散语言模型（SEDD和LLaDA）上实现，通过强化学习优化推理轨迹。SEDD利用概率策略最大化奖励，LLaDA通过基于排名的Unmasking Policy Module优化推理顺序。 Result: 实验表明，DCoLT增强的扩散语言模型在数学和代码生成任务中表现优于其他方法，LLaDA在多个任务中准确率显著提升（如GSM8K +9.8%）。 Conclusion: DCoLT通过非线性推理和强化学习优化，显著提升了扩散语言模型的推理能力，尤其在复杂任务中表现突出。 Abstract: We introduce the \emph{Diffusion Chain of Lateral Thought (DCoLT)}, a reasoning framework for diffusion language models. DCoLT treats each intermediate step in the reverse diffusion process as a latent "thinking" action and optimizes the entire reasoning trajectory to maximize the reward on the correctness of the final answer with outcome-based Reinforcement Learning (RL). Unlike traditional Chain-of-Thought (CoT) methods that follow a causal, linear thinking process, DCoLT allows bidirectional, non-linear reasoning with no strict rule on grammatical correctness amid its intermediate steps of thought. We implement DCoLT on two representative Diffusion Language Models (DLMs). First, we choose SEDD as a representative continuous-time discrete diffusion model, where its concrete score derives a probabilistic policy to maximize the RL reward over the entire sequence of intermediate diffusion steps. We further consider the discrete-time masked diffusion language model -- LLaDA, and find that the order to predict and unmask tokens plays an essential role to optimize its RL action resulting from the ranking-based Unmasking Policy Module (UPM) defined by the Plackett-Luce model. Experiments on both math and code generation tasks show that using only public data and 16 H800 GPUs, DCoLT-reinforced DLMs outperform other DLMs trained by SFT or RL or even both. Notably, DCoLT-reinforced LLaDA boosts its reasoning accuracy by +9.8%, +5.7%, +11.4%, +19.5% on GSM8K, MATH, MBPP, and HumanEval.

[98] CL-RAG: Bridging the Gap in Retrieval-Augmented Generation with Curriculum Learning

Shaohan Wang,Licheng Zhang,Zheren Fu,Zhendong Mao

Main category: cs.CL

TL;DR: CL-RAG框架通过多阶段课程学习优化RAG系统，显著提升性能。

Details

Motivation: 现有RAG方法直接使用检索文档，但文档质量参差不齐，影响模型训练效果。 Method: 构建多难度训练数据，分阶段训练检索器和生成器。 Result: 在四个开放域QA数据集上性能提升2%至4%。 Conclusion: CL-RAG框架有效提升RAG系统的性能和泛化能力。 Abstract: Retrieval-Augmented Generation (RAG) is an effective method to enhance the capabilities of large language models (LLMs). Existing methods focus on optimizing the retriever or generator in the RAG system by directly utilizing the top-k retrieved documents. However, the documents effectiveness are various significantly across user queries, i.e. some documents provide valuable knowledge while others totally lack critical information. It hinders the retriever and generator's adaptation during training. Inspired by human cognitive learning, curriculum learning trains models using samples progressing from easy to difficult, thus enhancing their generalization ability, and we integrate this effective paradigm to the training of the RAG system. In this paper, we propose a multi-stage Curriculum Learning based RAG system training framework, named CL-RAG. We first construct training data with multiple difficulty levels for the retriever and generator separately through sample evolution. Then, we train the model in stages based on the curriculum learning approach, thereby optimizing the overall performance and generalization of the RAG system more effectively. Our CL-RAG framework demonstrates consistent effectiveness across four open-domain QA datasets, achieving performance gains of 2% to 4% over multiple advanced methods.

[99] Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective

Yutao Mou,Xiao Deng,Yuxiao Luo,Shikun Zhang,Wei Ye

Main category: cs.CL

TL;DR: 提出CoV-Eval多任务基准和VC-Judge模型，全面评估LLM代码安全性，发现LLM在生成安全代码和修复漏洞方面仍有不足。

Details

Motivation: 现有代码安全基准仅关注单一任务，缺乏多维度评估，需更全面的安全性能测试工具。 Method: 提出CoV-Eval多任务基准（代码补全、漏洞修复等）和VC-Judge模型，评估20个LLM。 Result: 多数LLM能识别漏洞，但生成安全代码和修复能力较弱，特定漏洞类型识别困难。 Conclusion: 揭示了LLM代码安全的关键挑战，为未来研究提供优化方向。 Abstract: Code security and usability are both essential for various coding assistant applications driven by large language models (LLMs). Current code security benchmarks focus solely on single evaluation task and paradigm, such as code completion and generation, lacking comprehensive assessment across dimensions like secure code generation, vulnerability repair and discrimination. In this paper, we first propose CoV-Eval, a multi-task benchmark covering various tasks such as code completion, vulnerability repair, vulnerability detection and classification, for comprehensive evaluation of LLM code security. Besides, we developed VC-Judge, an improved judgment model that aligns closely with human experts and can review LLM-generated programs for vulnerabilities in a more efficient and reliable way. We conduct a comprehensive evaluation of 20 proprietary and open-source LLMs. Overall, while most LLMs identify vulnerable codes well, they still tend to generate insecure codes and struggle with recognizing specific vulnerability types and performing repairs. Extensive experiments and qualitative analyses reveal key challenges and optimization directions, offering insights for future research in LLM code security.

[100] The Devil Is in the Word Alignment Details: On Translation-Based Cross-Lingual Transfer for Token Classification Tasks

Benedikt Ebing,Goran Glavaš

Main category: cs.CL

TL;DR: 论文研究了基于翻译的跨语言迁移（XLT）策略中的标签投影问题，通过系统优化词对齐器（WA）的低层设计决策，提出了一种新的集成策略，显著提升了性能。

Details

Motivation: 在跨语言迁移的标记分类任务中，标签投影是关键挑战，但现有方法（如词对齐器和标记法）的低层设计决策未得到系统研究。 Method: 系统研究了词对齐器在标签投影中的低层设计决策（如标签投影算法、过滤策略和预标记化），并提出了集成翻译-训练和翻译-测试预测的新策略。 Result: 优化后的词对齐器性能与标记法相当，而新提出的集成策略显著优于标记法，并降低了对低层设计决策的敏感性。 Conclusion: 通过优化词对齐器的设计和引入集成策略，跨语言迁移的标记分类任务性能得到显著提升，且更具鲁棒性。 Abstract: Translation-based strategies for cross-lingual transfer XLT such as translate-train -- training on noisy target language data translated from the source language -- and translate-test -- evaluating on noisy source language data translated from the target language -- are competitive XLT baselines. In XLT for token classification tasks, however, these strategies include label projection, the challenging step of mapping the labels from each token in the original sentence to its counterpart(s) in the translation. Although word aligners (WAs) are commonly used for label projection, the low-level design decisions for applying them to translation-based XLT have not been systematically investigated. Moreover, recent marker-based methods, which project labeled spans by inserting tags around them before (or after) translation, claim to outperform WAs in label projection for XLT. In this work, we revisit WAs for label projection, systematically investigating the effects of low-level design decisions on token-level XLT: (i) the algorithm for projecting labels between (multi-)token spans, (ii) filtering strategies to reduce the number of noisily mapped labels, and (iii) the pre-tokenization of the translated sentences. We find that all of these substantially impact translation-based XLT performance and show that, with optimized choices, XLT with WA offers performance at least comparable to that of marker-based methods. We then introduce a new projection strategy that ensembles translate-train and translate-test predictions and demonstrate that it substantially outperforms the marker-based projection. Crucially, we show that our proposed ensembling also reduces sensitivity to low-level WA design choices, resulting in more robust XLT for token classification tasks.

[101] Multi-Token Prediction Needs Registers

Anastasios Gerontopoulos,Spyros Gidaris,Nikos Komodakis

Main category: cs.CL

TL;DR: MuToR是一种多令牌预测方法，通过插入可学习的寄存器令牌改进语言模型预训练，适用于监督微调和其他任务。

Details

Motivation: 多令牌预测在预训练中表现良好，但在微调等场景中效果不一致，因此需要一种更通用的方法。 Method: MuToR在输入序列中插入可学习的寄存器令牌，每个令牌负责预测未来目标，无需额外参数或架构改动。 Result: MuToR在语言和视觉领域的生成任务中表现出色，适用于监督微调、参数高效微调和预训练。 Conclusion: MuToR是一种简单有效的多令牌预测方法，兼容现有模型且支持可扩展的预测范围。 Abstract: Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: https://github.com/nasosger/MuToR.

[102] WorldPM: Scaling Human Preference Modeling

Binghai Wang,Runji Lin,Keming Lu,Le Yu,Zhenru Zhang,Fei Huang,Chujie Zheng,Kai Dang,Yang Fan,Xingzhang Ren,An Yang,Binyuan Hui,Dayiheng Liu,Tao Gui,Qi Zhang,Xuanjing Huang,Yu-Gang Jiang,Bowen Yu,Jingren Zhou,Junyang Lin

Main category: cs.CL

TL;DR: 论文发现偏好建模中存在类似语言模型的缩放规律，提出WorldPM方法，验证其在多种评估指标上的表现，并展示其在偏好微调中的有效性。

Details

Motivation: 受语言模型中测试损失随模型和数据规模呈幂律关系的启发，探索偏好建模中是否存在类似规律。 Method: 收集公共论坛的偏好数据，使用1.5B到72B参数的模型进行训练，分析不同评估指标的表现。 Result: 发现对抗性和客观性指标随规模提升，主观性指标无缩放趋势；WorldPM在多个基准测试中显著提升性能。 Conclusion: WorldPM作为偏好微调的基础，具有广泛的应用潜力，尤其在提升泛化性能方面表现突出。 Abstract: Motivated by scaling laws in language modeling that demonstrate how test loss scales as a power law with model and dataset sizes, we find that similar laws exist in preference modeling. We propose World Preference Modeling$ (WorldPM) to emphasize this scaling potential, where World Preference embodies a unified representation of human preferences. In this paper, we collect preference data from public forums covering diverse user communities, and conduct extensive training using 15M-scale data across models ranging from 1.5B to 72B parameters. We observe distinct patterns across different evaluation metrics: (1) Adversarial metrics (ability to identify deceptive features) consistently scale up with increased training data and base model size; (2) Objective metrics (objective knowledge with well-defined answers) show emergent behavior in larger language models, highlighting WorldPM's scalability potential; (3) Subjective metrics (subjective preferences from a limited number of humans or AI) do not demonstrate scaling trends. Further experiments validate the effectiveness of WorldPM as a foundation for preference fine-tuning. Through evaluations on 7 benchmarks with 20 subtasks, we find that WorldPM broadly improves the generalization performance across human preference datasets of varying sizes (7K, 100K and 800K samples), with performance gains exceeding 5% on many key subtasks. Integrating WorldPM into our internal RLHF pipeline, we observe significant improvements on both in-house and public evaluation sets, with notable gains of 4% to 8% in our in-house evaluations.

[103] Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models

Zhiyuan Hu,Yibo Wang,Hanze Dong,Yuhui Xu,Amrita Saha,Caiming Xiong,Bryan Hooi,Junnan Li

Main category: cs.CL

TL;DR: 论文提出了一种通过明确对齐模型的三种元能力（演绎、归纳和溯因）来提升大型推理模型（LRMs）的可扩展性和可靠性的方法，避免了依赖偶然的“顿悟时刻”。

Details

Motivation: 现有大型推理模型的推理行为（如自我修正、回溯和验证）虽然存在，但其时机和一致性不可预测，限制了模型的可靠性和扩展性。 Method: 采用三阶段流程：个体对齐、参数空间合并和领域特定强化学习，通过自动生成的自验证任务对齐元能力。 Result: 方法在指令调优基线基础上提升了10%的性能，领域特定强化学习进一步提高了2%的性能上限。 Conclusion: 明确的元能力对齐为推理提供了可扩展且可靠的基础。 Abstract: Large reasoning models (LRMs) already possess a latent capacity for long chain-of-thought reasoning. Prior work has shown that outcome-based reinforcement learning (RL) can incidentally elicit advanced reasoning behaviors such as self-correction, backtracking, and verification phenomena often referred to as the model's "aha moment". However, the timing and consistency of these emergent behaviors remain unpredictable and uncontrollable, limiting the scalability and reliability of LRMs' reasoning capabilities. To address these limitations, we move beyond reliance on prompts and coincidental "aha moments". Instead, we explicitly align models with three meta-abilities: deduction, induction, and abduction, using automatically generated, self-verifiable tasks. Our three stage-pipeline individual alignment, parameter-space merging, and domain-specific reinforcement learning, boosting performance by over 10\% relative to instruction-tuned baselines. Furthermore, domain-specific RL from the aligned checkpoint yields an additional 2\% average gain in the performance ceiling across math, coding, and science benchmarks, demonstrating that explicit meta-ability alignment offers a scalable and dependable foundation for reasoning. Code is available at: https://github.com/zhiyuanhubj/Meta-Ability-Alignment

cs.RO [Back]

[104] FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation

Jun Guo,Xiaojian Ma,Yikai Wang,Min Yang,Huaping Liu,Qing Li

Main category: cs.RO

TL;DR: FlowDreamer提出了一种基于3D场景流的RGB-D世界模型，通过显式运动表示提升机器人操作中的视觉预测性能。

Details

Motivation: 研究旨在改进机器人操作中的视觉世界模型，通过显式处理动态预测而非隐式建模，以提升预测准确性。 Method: FlowDreamer采用U-Net预测3D场景流，再通过扩散模型生成未来帧，实现端到端训练。 Result: 在4个基准测试中，FlowDreamer在语义相似性、像素质量和成功率上分别优于基线模型7%、11%和6%。 Conclusion: FlowDreamer通过显式运动表示显著提升了RGB-D世界模型的性能，适用于机器人操作任务。 Abstract: This paper investigates training better visual world models for robot manipulation, i.e., models that can predict future visual observations by conditioning on past frames and robot actions. Specifically, we consider world models that operate on RGB-D frames (RGB-D world models). As opposed to canonical approaches that handle dynamics prediction mostly implicitly and reconcile it with visual rendering in a single model, we introduce FlowDreamer, which adopts 3D scene flow as explicit motion representations. FlowDreamer first predicts 3D scene flow from past frame and action conditions with a U-Net, and then a diffusion model will predict the future frame utilizing the scene flow. FlowDreamer is trained end-to-end despite its modularized nature. We conduct experiments on 4 different benchmarks, covering both video prediction and visual planning tasks. The results demonstrate that FlowDreamer achieves better performance compared to other baseline RGB-D world models by 7% on semantic similarity, 11% on pixel quality, and 6% on success rate in various robot manipulation domains.

cs.SD [Back]

[105] LAV: Audio-Driven Dynamic Visual Generation with Neural Compression and StyleGAN2

Jongmin Jung,Dasaem Jeong

Main category: cs.SD

TL;DR: LAV系统结合EnCodec的音频压缩和StyleGAN2的生成能力，通过预录音频驱动动态视觉输出，利用隐式映射保留语义丰富性。

Details

Motivation: 探索预训练音频压缩模型在艺术和计算应用中的潜力，提供更细腻的音频-视觉转换。 Method: 使用EnCodec嵌入作为隐式表示，通过随机初始化的线性映射直接转换到StyleGAN2的风格隐空间。 Result: 实现了语义一致且细腻的音频-视觉转换，展示了隐式映射的优势。 Conclusion: LAV框架为音频驱动的视觉生成提供了新思路，展示了预训练模型的广泛应用前景。 Abstract: This paper introduces LAV (Latent Audio-Visual), a system that integrates EnCodec's neural audio compression with StyleGAN2's generative capabilities to produce visually dynamic outputs driven by pre-recorded audio. Unlike previous works that rely on explicit feature mappings, LAV uses EnCodec embeddings as latent representations, directly transformed into StyleGAN2's style latent space via randomly initialized linear mapping. This approach preserves semantic richness in the transformation, enabling nuanced and semantically coherent audio-visual translations. The framework demonstrates the potential of using pretrained audio compression models for artistic and computational applications.

cs.IR [Back]

[106] A Survey on Large Language Models in Multimodal Recommender Systems

Alejo Lopez-Avila,Jinhua Du

Main category: cs.IR

TL;DR: 该论文综述了大型语言模型（LLMs）在多模态推荐系统（MRS）中的应用，探讨了其优势与挑战，并提出了一种新的分类法来整合相关技术。

Details

Motivation: 研究LLMs如何通过语义推理、上下文学习和动态输入处理提升MRS的性能，同时解决其带来的可扩展性和模型访问性挑战。 Method: 通过综述近期研究，提出分类法，总结提示策略、微调方法和数据适应技术，并识别相关推荐领域的可迁移技术。 Result: 明确了LLMs在多模态推荐中的新兴角色，提供了评估指标和数据集的概述，并指出了未来研究方向。 Conclusion: 论文旨在支持这一快速发展领域的未来研究，为LLMs在MRS中的应用提供清晰框架。 Abstract: Multimodal recommender systems (MRS) integrate heterogeneous user and item data, such as text, images, and structured information, to enhance recommendation performance. The emergence of large language models (LLMs) introduces new opportunities for MRS by enabling semantic reasoning, in-context learning, and dynamic input handling. Compared to earlier pre-trained language models (PLMs), LLMs offer greater flexibility and generalisation capabilities but also introduce challenges related to scalability and model accessibility. This survey presents a comprehensive review of recent work at the intersection of LLMs and MRS, focusing on prompting strategies, fine-tuning methods, and data adaptation techniques. We propose a novel taxonomy to characterise integration patterns, identify transferable techniques from related recommendation domains, provide an overview of evaluation metrics and datasets, and point to possible future directions. We aim to clarify the emerging role of LLMs in multimodal recommendation and support future research in this rapidly evolving field.

cs.CR [Back]

[107] PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization

Yidan Wang,Yanan Cao,Yubing Ren,Fang Fang,Zheng Lin,Binxing Fang

Main category: cs.CR

TL;DR: 本文探讨了越狱攻击在提取LLM敏感信息中的有效性，并提出了一种新框架PIG，用于识别和提取个人可识别信息（PII），展示了其在隐私保护方面的优越性。

Details

Motivation: LLMs在隐私方面存在风险，现有评估方法容易被对齐模型阻挡，而越狱攻击在隐私场景中的作用尚未充分研究。 Method: 提出PIG框架，通过识别PII实体、构建隐私上下文，并采用三种梯度策略迭代更新以提取目标PII。 Result: 实验表明，PIG在四种白盒和两种黑盒LLMs上优于基线方法，达到SoTA效果。 Conclusion: LLMs存在显著隐私风险，需更强保护措施。 Abstract: Large Language Models (LLMs) excel in various domains but pose inherent privacy risks. Existing methods to evaluate privacy leakage in LLMs often use memorized prefixes or simple instructions to extract data, both of which well-alignment models can easily block. Meanwhile, Jailbreak attacks bypass LLM safety mechanisms to generate harmful content, but their role in privacy scenarios remains underexplored. In this paper, we examine the effectiveness of jailbreak attacks in extracting sensitive information, bridging privacy leakage and jailbreak attacks in LLMs. Moreover, we propose PIG, a novel framework targeting Personally Identifiable Information (PII) and addressing the limitations of current jailbreak methods. Specifically, PIG identifies PII entities and their types in privacy queries, uses in-context learning to build a privacy context, and iteratively updates it with three gradient-based strategies to elicit target PII. We evaluate PIG and existing jailbreak methods using two privacy-related datasets. Experiments on four white-box and two black-box LLMs show that PIG outperforms baseline methods and achieves state-of-the-art (SoTA) results. The results underscore significant privacy risks in LLMs, emphasizing the need for stronger safeguards. Our code is availble at \href{https://github.com/redwyd/PrivacyJailbreak}{https://github.com/redwyd/PrivacyJailbreak}.

eess.IV [Back]

[108] ImplicitStainer: Data-Efficient Medical Image Translation for Virtual Antibody-based Tissue Staining Using Local Implicit Functions

Tushar Kataria,Beatrice Knudsen,Shireen Y. Elhabian

Main category: eess.IV

TL;DR: 论文提出了一种名为ImplicitStainer的新方法，利用局部隐式函数改进图像翻译，特别是虚拟染色性能，通过像素级预测减少数据需求并提高生成质量。

Details

Motivation: H&E染色是病理学诊断的金标准，但缺乏分子信息。IHC染色虽能提供补充信息，但获取耗时且仅限专业中心。虚拟染色通过深度学习生成IHC图像，但现有方法数据需求高且效果有限。 Method: 提出ImplicitStainer方法，利用局部隐式函数进行像素级预测，减少数据需求并提高虚拟染色质量。 Result: 在两种数据集上验证，性能优于15种现有GAN和扩散模型，尤其在数据有限时表现优异。 Conclusion: ImplicitStainer为虚拟染色提供了一种高效且数据需求低的新方法，有望加速病理诊断。代码和模型将公开。 Abstract: Hematoxylin and eosin (H&E) staining is a gold standard for microscopic diagnosis in pathology. However, H&E staining does not capture all the diagnostic information that may be needed. To obtain additional molecular information, immunohistochemical (IHC) stains highlight proteins that mark specific cell types, such as CD3 for T-cells or CK8/18 for epithelial cells. While IHC stains are vital for prognosis and treatment guidance, they are typically only available at specialized centers and time consuming to acquire, leading to treatment delays for patients. Virtual staining, enabled by deep learning-based image translation models, provides a promising alternative by computationally generating IHC stains from H&E stained images. Although many GAN and diffusion based image to image (I2I) translation methods have been used for virtual staining, these models treat image patches as independent data points, which results in increased and more diverse data requirements for effective generation. We present ImplicitStainer, a novel approach that leverages local implicit functions to improve image translation, specifically virtual staining performance, by focusing on pixel-level predictions. This method enhances robustness to variations in dataset sizes, delivering high-quality results even with limited data. We validate our approach on two datasets using a comprehensive set of metrics and benchmark it against over fifteen state-of-the-art GAN- and diffusion based models. Full Code and models trained will be released publicly via Github upon acceptance.

[109] Ordered-subsets Multi-diffusion Model for Sparse-view CT Reconstruction

Pengfei Yu,Bin Huang,Minghui Zhang,Weiwen Wu,Shaoyu Wang,Qiegen Liu

Main category: eess.IV

TL;DR: 提出了一种名为OSMM的有序子集多扩散模型，用于稀疏视图CT重建，通过分块学习和全局约束提升细节重建效果和鲁棒性。

Details

Motivation: 传统扩散模型在稀疏视图CT重建中因数据冗余导致学习效果差，重建图像细节不足。 Method: 将CT投影数据分为等量子集，采用多子集扩散模型独立学习，并结合完整数据的全局扩散模型作为约束。 Result: OSMM在图像质量和噪声鲁棒性上优于传统扩散模型。 Conclusion: OSMM为稀疏视图CT提供了一种高效、鲁棒的解决方案。 Abstract: Score-based diffusion models have shown significant promise in the field of sparse-view CT reconstruction. However, the projection dataset is large and riddled with redundancy. Consequently, applying the diffusion model to unprocessed data results in lower learning effectiveness and higher learning difficulty, frequently leading to reconstructed images that lack fine details. To address these issues, we propose the ordered-subsets multi-diffusion model (OSMM) for sparse-view CT reconstruction. The OSMM innovatively divides the CT projection data into equal subsets and employs multi-subsets diffusion model (MSDM) to learn from each subset independently. This targeted learning approach reduces complexity and enhances the reconstruction of fine details. Furthermore, the integration of one-whole diffusion model (OWDM) with complete sinogram data acts as a global information constraint, which can reduce the possibility of generating erroneous or inconsistent sinogram information. Moreover, the OSMM's unsupervised learning framework provides strong robustness and generalizability, adapting seamlessly to varying sparsity levels of CT sinograms. This ensures consistent and reliable performance across different clinical scenarios. Experimental results demonstrate that OSMM outperforms traditional diffusion models in terms of image quality and noise resilience, offering a powerful and versatile solution for advanced CT imaging in sparse-view scenarios.

[110] Visual Fidelity Index for Generative Semantic Communications with Critical Information Embedding

Jianhao Huang,Qunsong Zeng,Kaibin Huang

Main category: eess.IV

TL;DR: 提出了一种混合生成语义通信系统，结合文本提示和关键特征传输，解决了纯提示驱动生成丢失细节的问题，并设计了GVIF指标评估视觉质量。

Details

Motivation: 纯提示驱动的生成语义通信会丢失细节，且缺乏系统性能评估指标。 Method: 开发了CIE框架，结合文本提示和关键特征传输，使用扩散模型重建图像，并提出GVIF指标量化视觉质量。 Result: GVIF指标对视觉保真度敏感，优化系统在PSNR和FID上优于基准方案。 Conclusion: 混合Gen-SemCom系统和GVIF指标有效提升了通信性能和视觉质量。 Abstract: Generative semantic communication (Gen-SemCom) with large artificial intelligence (AI) model promises a transformative paradigm for 6G networks, which reduces communication costs by transmitting low-dimensional prompts rather than raw data. However, purely prompt-driven generation loses fine-grained visual details. Additionally, there is a lack of systematic metrics to evaluate the performance of Gen-SemCom systems. To address these issues, we develop a hybrid Gen-SemCom system with a critical information embedding (CIE) framework, where both text prompts and semantically critical features are extracted for transmissions. First, a novel approach of semantic filtering is proposed to select and transmit the semantically critical features of images relevant to semantic label. By integrating the text prompt and critical features, the receiver reconstructs high-fidelity images using a diffusion-based generative model. Next, we propose the generative visual information fidelity (GVIF) metric to evaluate the visual quality of the generated image. By characterizing the statistical models of image features, the GVIF metric quantifies the mutual information between the distorted features and their original counterparts. By maximizing the GVIF metric, we design a channel-adaptive Gen-SemCom system that adaptively control the volume of features and compression rate according to the channel state. Experimental results validate the GVIF metric's sensitivity to visual fidelity, correlating with both the PSNR and critical information volume. In addition, the optimized system achieves superior performance over benchmarking schemes in terms of higher PSNR and lower FID scores.

[111] HWA-UNETR: Hierarchical Window Aggregate UNETR for 3D Multimodal Gastric Lesion Segmentation

Jiaming Liang,Lihuan Dai,Xiaoqi Sheng,Xiangguang Chen,Chun Yao,Guihua Tao,Qibin Leng,Honming Cai,Xi Zhong

Main category: eess.IV

TL;DR: 论文提出了一种新的多模态医学图像分割方法HWA-UNETR，并发布了首个大规模开源胃癌多模态MRI数据集GCM 2025，解决了数据稀缺和模态对齐问题。

Details

Motivation: 胃癌病变分析中多模态医学图像分割面临数据稀缺和模态对齐的挑战，导致算法训练受限和资源浪费。 Method: 提出HWA-UNETR框架，使用可学习的窗口聚合层（HWA块）动态对齐多模态特征，并采用三向融合机制建模长距离空间依赖。 Result: 在GCM 2025和BraTS 2021数据集上验证，Dice分数比现有方法提升1.68%，且鲁棒性强。 Conclusion: HWA-UNETR和GCM 2025数据集为胃癌多模态图像分割提供了有效解决方案，代码和数据集已开源。 Abstract: Multimodal medical image segmentation faces significant challenges in the context of gastric cancer lesion analysis. This clinical context is defined by the scarcity of independent multimodal datasets and the imperative to amalgamate inherently misaligned modalities. As a result, algorithms are constrained to train on approximate data and depend on application migration, leading to substantial resource expenditure and a potential decline in analysis accuracy. To address those challenges, we have made two major contributions: First, we publicly disseminate the GCM 2025 dataset, which serves as the first large-scale, open-source collection of gastric cancer multimodal MRI scans, featuring professionally annotated FS-T2W, CE-T1W, and ADC images from 500 patients. Second, we introduce HWA-UNETR, a novel 3D segmentation framework that employs an original HWA block with learnable window aggregation layers to establish dynamic feature correspondences between different modalities' anatomical structures, and leverages the innovative tri-orientated fusion mamba mechanism for context modeling and capturing long-range spatial dependencies. Extensive experiments on our GCM 2025 dataset and the publicly BraTS 2021 dataset validate the performance of our framework, demonstrating that the new approach surpasses existing methods by up to 1.68\% in the Dice score while maintaining solid robustness. The dataset and code are public via https://github.com/JeMing-creater/HWA-UNETR.

[112] Multi-contrast laser endoscopy for in vivo gastrointestinal imaging

Taylor L. Bobrow,Mayank Golhar,Suchapa Arayakarnkul,Anthony A. Song,Saowanee Ngamruengphong,Nicholas J. Durr

Main category: eess.IV

TL;DR: 多对比激光内窥镜（MLE）通过多光谱、相干和定向照明增强胃肠道疾病检测，对比度和色差显著优于白光和窄带成像。

Details

Motivation: 白光内窥镜在检测胃肠道疾病时对比度不足，导致许多病例漏诊。 Method: MLE平台结合多光谱漫反射、激光散斑对比成像和光度立体技术，增强组织对比度。 Result: MLE在31个息肉样本中显示出对比度提升约3倍，色差提升5倍。 Conclusion: MLE有望成为改善胃肠道成像的研究工具。 Abstract: White light endoscopy is the clinical gold standard for detecting diseases in the gastrointestinal tract. Most applications involve identifying visual abnormalities in tissue color, texture, and shape. Unfortunately, the contrast of these features is often subtle, causing many clinically relevant cases to go undetected. To overcome this challenge, we introduce Multi-contrast Laser Endoscopy (MLE): a platform for widefield clinical imaging with rapidly tunable spectral, coherent, and directional illumination. We demonstrate three capabilities of MLE: enhancing tissue chromophore contrast with multispectral diffuse reflectance, quantifying blood flow using laser speckle contrast imaging, and characterizing mucosal topography using photometric stereo. We validate MLE with benchtop models, then demonstrate MLE in vivo during clinical colonoscopies. MLE images from 31 polyps demonstrate an approximate three-fold improvement in contrast and a five-fold improvement in color difference compared to white light and narrow band imaging. With the ability to reveal multiple complementary types of tissue contrast while seamlessly integrating into the clinical environment, MLE shows promise as an investigative tool to improve gastrointestinal imaging.

cs.HC [Back]

[113] CartoAgent: a multimodal large language model-powered multi-agent cartographic framework for map style transfer and evaluation

Chenglong Wang,Yuhao Kang,Zhaoya Gong,Pengjun Zhao,Yu Feng,Wenjia Zhang,Ge Li

Main category: cs.HC

TL;DR: CartoAgent是一个基于多模态大语言模型（MLLMs）的多智能体制图框架，通过模拟制图实践的三个阶段（准备、设计和评估），生成既美观又信息丰富的地图。

Details

Motivation: 生成式人工智能（GenAI）的快速发展为制图过程提供了新机遇，但以往研究要么忽视地图的艺术性，要么难以同时保证地图的准确性和信息丰富性。 Method: CartoAgent利用MLLMs的视觉审美能力和世界知识，通过分离样式与地理数据，专注于设计样式表而不修改矢量数据，确保地理准确性。 Result: 通过地图样式迁移和评估任务验证了框架的有效性，实验和人工评估研究证明了其性能。 Conclusion: CartoAgent可扩展支持多种制图设计决策，并为未来GenAI在制图中的集成提供参考。 Abstract: The rapid development of generative artificial intelligence (GenAI) presents new opportunities to advance the cartographic process. Previous studies have either overlooked the artistic aspects of maps or faced challenges in creating both accurate and informative maps. In this study, we propose CartoAgent, a novel multi-agent cartographic framework powered by multimodal large language models (MLLMs). This framework simulates three key stages in cartographic practice: preparation, map design, and evaluation. At each stage, different MLLMs act as agents with distinct roles to collaborate, discuss, and utilize tools for specific purposes. In particular, CartoAgent leverages MLLMs' visual aesthetic capability and world knowledge to generate maps that are both visually appealing and informative. By separating style from geographic data, it can focus on designing stylesheets without modifying the vector-based data, thereby ensuring geographic accuracy. We applied CartoAgent to a specific task centered on map restyling-namely, map style transfer and evaluation. The effectiveness of this framework was validated through extensive experiments and a human evaluation study. CartoAgent can be extended to support a variety of cartographic design decisions and inform future integrations of GenAI in cartography.

[114] Visual Feedback of Pattern Separability Improves Myoelectric Decoding Performance of Upper Limb Prostheses

Ruichen Yang,György M. Lévay,Christopher L. Hunt,Dániel Czeiner,Megan C. Hodgson,Damini Agarwal,Rahul R. Kaliki,Nitish V. Thakor

Main category: cs.HC

TL;DR: 论文提出了一种名为Reviewer的3D视觉界面，用于改进肌电假肢的模式识别控制，通过实时反馈提升用户与解码器的交互效果。

Details

Motivation: 随着假肢运动复杂性的增加，用户难以生成足够独特的肌电信号模式，现有训练方法依赖试错调整，效率低下。 Method: 研究通过10次实验，比较了使用Reviewer与传统虚拟手臂视觉反馈的训练效果，评估指标包括任务完成率、路径效率等。 Result: 使用Reviewer的参与者表现更优，任务完成率更高，路径效率更好，且减少了过冲现象。 Conclusion: 3D视觉反馈显著改善了新手操作者的模式识别控制，减少了试错调整的依赖，实现了反馈驱动的适应性训练。 Abstract: State-of-the-art upper limb myoelectric prostheses often use pattern recognition (PR) control systems that translate electromyography (EMG) signals into desired movements. As prosthesis movement complexity increases, users often struggle to produce sufficiently distinct EMG patterns for reliable classification. Existing training typically involves heuristic, trial-and-error user adjustments to static decoder boundaries. Goal: We introduce the Reviewer, a 3D visual interface projecting EMG signals directly into the decoder's classification space, providing intuitive, real-time insight into PR algorithm behavior. This structured feedback reduces cognitive load and fosters mutual, data-driven adaptation between user-generated EMG patterns and decoder boundaries. Methods: A 10-session study with 12 able-bodied participants compared PR performance after motor-based training and updating using the Reviewer versus conventional virtual arm visualization. Performance was assessed using a Fitts law task that involved the aperture of the cursor and the control of orientation. Results: Participants trained with the Reviewer achieved higher completion rates, reduced overshoot, and improved path efficiency and throughput compared to the standard visualization group. Significance: The Reviewer introduces decoder-informed motor training, facilitating immediate and consistent PR-based myoelectric control improvements. By iteratively refining control through real-time feedback, this approach reduces reliance on trial-and-error recalibration, enabling a more adaptive, self-correcting training framework. Conclusion: The 3D visual feedback significantly improves PR control in novice operators through structured training, enabling feedback-driven adaptation and reducing reliance on extensive heuristic adjustments.

[115] SOS: A Shuffle Order Strategy for Data Augmentation in Industrial Human Activity Recognition

Anh Tuan Ha,Hoang Khang Phan,Thai Minh Tien Ngo,Anh Phan Truong,Nhat Tan Le

Main category: cs.HC

TL;DR: 论文提出了一种通过深度学习方法（注意力自编码器和条件生成对抗网络）生成高质量HAR数据集的方法，并通过随机序列策略提高分类性能。

Details

Motivation: 解决HAR领域高质量和多样化数据获取困难及数据异质性问题。 Method: 使用注意力自编码器和条件生成对抗网络生成数据集，并通过随机序列策略打乱数据分布。 Result: 随机序列策略显著提升分类性能，准确率达0.70±0.03，宏F1分数0.64±0.01。 Conclusion: 该方法不仅扩展了有效训练数据集，还为复杂现实场景中的HAR系统提供了改进方向。 Abstract: In the realm of Human Activity Recognition (HAR), obtaining high quality and variance data is still a persistent challenge due to high costs and the inherent variability of real-world activities. This study introduces a generation dataset by deep learning approaches (Attention Autoencoder and conditional Generative Adversarial Networks). Another problem that data heterogeneity is a critical challenge, one of the solutions is to shuffle the data to homogenize the distribution. Experimental results demonstrate that the random sequence strategy significantly improves classification performance, achieving an accuracy of up to 0.70 $\pm$ 0.03 and a macro F1 score of 0.64 $\pm$ 0.01. For that, disrupting temporal dependencies through random sequence reordering compels the model to focus on instantaneous recognition, thereby improving robustness against activity transitions. This approach not only broadens the effective training dataset but also offers promising avenues for enhancing HAR systems in complex, real-world scenarios.

cs.LG [Back]

[116] Predictability Shapes Adaptation: An Evolutionary Perspective on Modes of Learning in Transformers

Alexander Y. Ku,Thomas L. Griffiths,Stephanie C. Y. Chan

Main category: cs.LG

TL;DR: 论文探讨了Transformer模型的两种学习模式（IWL和ICL）及其与进化生物学的类比，研究了环境可预测性对学习模式选择的影响，并揭示了任务依赖的学习动态。

Details

Motivation: 理解Transformer模型中IWL和ICL的交互作用及其与进化生物学中遗传编码和表型可塑性的类比，以探索环境可预测性如何影响学习模式的选择。 Method: 通过回归和分类任务实验，操作化环境稳定性和线索可靠性，系统研究其对IWL和ICL平衡的影响。 Result: 高环境稳定性显著偏向IWL，而高线索可靠性增强ICL效果；学习动态显示任务依赖的时序演变，支持相对成本假说。 Conclusion: 可预测性是Transformer学习模式选择的关键因素，研究结果为理解ICL和优化训练方法提供了新视角。 Abstract: Transformer models learn in two distinct modes: in-weights learning (IWL), encoding knowledge into model weights, and in-context learning (ICL), adapting flexibly to context without weight modification. To better understand the interplay between these learning modes, we draw inspiration from evolutionary biology's analogous adaptive strategies: genetic encoding (akin to IWL, adapting over generations and fixed within an individual's lifetime) and phenotypic plasticity (akin to ICL, enabling flexible behavioral responses to environmental cues). In evolutionary biology, environmental predictability dictates the balance between these strategies: stability favors genetic encoding, while reliable predictive cues promote phenotypic plasticity. We experimentally operationalize these dimensions of predictability and systematically investigate their influence on the ICL/IWL balance in Transformers. Using regression and classification tasks, we show that high environmental stability decisively favors IWL, as predicted, with a sharp transition at maximal stability. Conversely, high cue reliability enhances ICL efficacy, particularly when stability is low. Furthermore, learning dynamics reveal task-contingent temporal evolution: while a canonical ICL-to-IWL shift occurs in some settings (e.g., classification with many classes), we demonstrate that scenarios with easier IWL (e.g., fewer classes) or slower ICL acquisition (e.g., regression) can exhibit an initial IWL phase later yielding to ICL dominance. These findings support a relative-cost hypothesis for explaining these learning mode transitions, establishing predictability as a critical factor governing adaptive strategies in Transformers, and offering novel insights for understanding ICL and guiding training methodologies.

[117] Comparing Exploration-Exploitation Strategies of LLMs and Humans: Insights from Standard Multi-armed Bandit Tasks

Ziyuan Zhang,Darcy Wang,Ningyuan Chen,Rodrigo Mansur,Vahid Sarhangian

Main category: cs.LG

TL;DR: 研究比较了大型语言模型（LLMs）与人类在多臂老虎机任务中的探索-利用策略，发现推理能力使LLMs更接近人类行为，但在复杂环境中适应性不足。

Details

Motivation: 探讨LLMs在动态决策任务中是否表现出与人类相似的行为，并评估其性能。 Method: 使用多臂老虎机任务和可解释的选择模型，比较LLMs、人类及算法的探索-利用策略，分析推理能力的影响。 Result: 推理使LLMs更接近人类行为，但在复杂环境中适应性较差，特别是在定向探索方面。 Conclusion: LLMs在模拟人类行为和自动化决策方面有潜力，但在复杂环境中的适应性仍需改进。 Abstract: Large language models (LLMs) are increasingly used to simulate or automate human behavior in complex sequential decision-making tasks. A natural question is then whether LLMs exhibit similar decision-making behavior to humans, and can achieve comparable (or superior) performance. In this work, we focus on the exploration-exploitation (E&E) tradeoff, a fundamental aspect of dynamic decision-making under uncertainty. We employ canonical multi-armed bandit (MAB) tasks introduced in the cognitive science and psychiatry literature to conduct a comparative study of the E&E strategies of LLMs, humans, and MAB algorithms. We use interpretable choice models to capture the E&E strategies of the agents and investigate how explicit reasoning, through both prompting strategies and reasoning-enhanced models, shapes LLM decision-making. We find that reasoning shifts LLMs toward more human-like behavior, characterized by a mix of random and directed exploration. In simple stationary tasks, reasoning-enabled LLMs exhibit similar levels of random and directed exploration compared to humans. However, in more complex, non-stationary environments, LLMs struggle to match human adaptability, particularly in effective directed exploration, despite achieving similar regret in certain scenarios. Our findings highlight both the promise and limits of LLMs as simulators of human behavior and tools for automated decision-making and point to potential areas of improvements.

[118] Advanced Crash Causation Analysis for Freeway Safety: A Large Language Model Approach to Identifying Key Contributing Factors

Ahmed S. Abdelrahman,Mohamed Abdel-Aty,Samgyu Yang,Abdulrahman Faden

Main category: cs.LG

TL;DR: 研究利用LLM（Llama3 8B）分析高速公路事故数据，通过零样本分类识别事故原因，验证了模型在识别酒驾、超速等主要因素上的有效性，并展示了其在交通安全中的实际应用潜力。

Details

Motivation: 传统统计方法和机器学习模型难以捕捉事故中复杂因素的交互作用，因此研究探索利用LLM进行更全面的分析。 Method: 通过QLoRA微调Llama3 8B模型，结合226项研究的数据集，进行零样本分类识别事故原因。 Result: 模型能有效识别酒驾、超速等主要事故原因，且与交通安全领域研究者的问卷结果一致性达88.89%。 Conclusion: LLM为交通事故的全面分析提供了新工具，并为政策制定者提供了有价值的见解和潜在对策。 Abstract: Understanding the factors contributing to traffic crashes and developing strategies to mitigate their severity is essential. Traditional statistical methods and machine learning models often struggle to capture the complex interactions between various factors and the unique characteristics of each crash. This research leverages large language model (LLM) to analyze freeway crash data and provide crash causation analysis accordingly. By compiling 226 traffic safety studies related to freeway crashes, a training dataset encompassing environmental, driver, traffic, and geometric design factors was created. The Llama3 8B model was fine-tuned using QLoRA to enhance its understanding of freeway crashes and their contributing factors, as covered in these studies. The fine-tuned Llama3 8B model was then used to identify crash causation without pre-labeled data through zero-shot classification, providing comprehensive explanations to ensure that the identified causes were reasonable and aligned with existing research. Results demonstrate that LLMs effectively identify primary crash causes such as alcohol-impaired driving, speeding, aggressive driving, and driver inattention. Incorporating event data, such as road maintenance, offers more profound insights. The model's practical applicability and potential to improve traffic safety measures were validated by a high level of agreement among researchers in the field of traffic safety, as reflected in questionnaire results with 88.89%. This research highlights the complex nature of traffic crashes and how LLMs can be used for comprehensive analysis of crash causation and other contributing factors. Moreover, it provides valuable insights and potential countermeasures to aid planners and policymakers in developing more effective and efficient traffic safety practices.

[119] Learning Virtual Machine Scheduling in Cloud Computing through Language Agents

JieHao Wu,Ziwei Wang,Junjie Sheng,Wenhao Li,Xiangfei Wang,Jun Luo

Main category: cs.LG

TL;DR: 提出了一种名为MiCo的分层语言代理框架，利用大语言模型（LLM）设计启发式方法，解决云服务中的动态多维装箱问题（ODMBP）。

Details

Motivation: 传统优化方法难以适应实时变化，启发式方法策略僵化，现有学习方法缺乏通用性和可解释性。 Method: 将ODMBP建模为半马尔可夫决策过程（SMDP-Option），采用两阶段架构（Option Miner和Option Composer），利用LLM生成和组合策略。 Result: 在超过10,000台虚拟机的场景中，MiCo实现了96.9%的竞争比，并在非稳态请求流和多样化配置下保持高性能。 Conclusion: MiCo在复杂和大规模云环境中表现出色，验证了其有效性。 Abstract: In cloud services, virtual machine (VM) scheduling is a typical Online Dynamic Multidimensional Bin Packing (ODMBP) problem, characterized by large-scale complexity and fluctuating demands. Traditional optimization methods struggle to adapt to real-time changes, domain-expert-designed heuristic approaches suffer from rigid strategies, and existing learning-based methods often lack generalizability and interpretability. To address these limitations, this paper proposes a hierarchical language agent framework named MiCo, which provides a large language model (LLM)-driven heuristic design paradigm for solving ODMBP. Specifically, ODMBP is formulated as a Semi-Markov Decision Process with Options (SMDP-Option), enabling dynamic scheduling through a two-stage architecture, i.e., Option Miner and Option Composer. Option Miner utilizes LLMs to discover diverse and useful non-context-aware strategies by interacting with constructed environments. Option Composer employs LLMs to discover a composing strategy that integrates the non-context-aware strategies with the contextual ones. Extensive experiments on real-world enterprise datasets demonstrate that MiCo achieves a 96.9\% competitive ratio in large-scale scenarios involving more than 10,000 virtual machines. It maintains high performance even under nonstationary request flows and diverse configurations, thus validating its effectiveness in complex and large-scale cloud environments.

[120] ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector Attention

Jintian Shao,Hongyi Huang,Jiayi Wu,Beiwen Zhang,ZhiYu Wu,You Shan,MingKai Zheng

Main category: cs.LG

TL;DR: ComplexFormer提出了一种新的多头注意力机制CMHA，通过复数平面统一建模语义和位置差异，显著提升了模型性能。

Details

Motivation: 传统Transformer在多头注意力中难以有效统一语义和位置信息，限制了表示能力。 Method: 引入CMHA，包括每头Euler变换和自适应差分旋转机制，独立建模语义和位置差异。 Result: 在语言建模、文本生成等任务中表现优异，困惑度更低，长上下文一致性更好。 Conclusion: ComplexFormer提供了一种更灵活、高效的注意力机制，性能优于现有方法。 Abstract: Transformer models rely on self-attention to capture token dependencies but face challenges in effectively integrating positional information while allowing multi-head attention (MHA) flexibility. Prior methods often model semantic and positional differences disparately or apply uniform positional adjustments across heads, potentially limiting representational capacity. This paper introduces ComplexFormer, featuring Complex Multi-Head Attention-CMHA. CMHA empowers each head to independently model semantic and positional differences unified within the complex plane, representing interactions as rotations and scaling. ComplexFormer incorporates two key improvements: (1) a per-head Euler transformation, converting real-valued query/key projections into polar-form complex vectors for head-specific complex subspace operation; and (2) a per-head adaptive differential rotation mechanism, exp[i(Adapt(ASmn,i) + Delta(Pmn),i)], allowing each head to learn distinct strategies for integrating semantic angle differences (ASmn,i) with relative positional encodings (Delta(Pmn),i). Extensive experiments on language modeling, text generation, code generation, and mathematical reasoning show ComplexFormer achieves superior performance, significantly lower generation perplexity , and improved long-context coherence compared to strong baselines like RoPE-Transformers. ComplexFormer demonstrates strong parameter efficiency, offering a more expressive, adaptable attention mechanism.

[121] Superposition Yields Robust Neural Scaling

Yizhou liu,Ziming Liu,Jeff Gore

Main category: cs.LG

TL;DR: 论文研究了大型语言模型（LLMs）性能随模型规模提升的神经缩放定律，发现表示叠加是这一现象的关键机制。

Details

Motivation: 探索神经缩放定律的起源，即为何模型规模越大性能越好。 Method: 基于两个经验原则构建玩具模型，研究损失随模型规模的变化，并分析开源LLMs。 Result: 发现弱叠加时损失与特征频率相关，强叠加时损失与模型维度成反比，且开源LLMs符合强叠加预测。 Conclusion: 表示叠加是神经缩放定律的重要机制，未来可优化训练策略和架构以减少计算和参数需求。 Abstract: The success of today's large language models (LLMs) depends on the observation that larger models perform better. However, the origin of this neural scaling law -- the finding that loss decreases as a power law with model size -- remains unclear. Starting from two empirical principles -- that LLMs represent more things than the model dimensions (widths) they have (i.e., representations are superposed), and that words or concepts in language occur with varying frequencies -- we constructed a toy model to study the loss scaling with model size. We found that when superposition is weak, meaning only the most frequent features are represented without interference, the scaling of loss with model size depends on the underlying feature frequency; if feature frequencies follow a power law, so does the loss. In contrast, under strong superposition, where all features are represented but overlap with each other, the loss becomes inversely proportional to the model dimension across a wide range of feature frequency distributions. This robust scaling behavior is explained geometrically: when many more vectors are packed into a lower dimensional space, the interference (squared overlaps) between vectors scales inversely with that dimension. We then analyzed four families of open-sourced LLMs and found that they exhibit strong superposition and quantitatively match the predictions of our toy model. The Chinchilla scaling law turned out to also agree with our results. We conclude that representation superposition is an important mechanism underlying the observed neural scaling laws. We anticipate that these insights will inspire new training strategies and model architectures to achieve better performance with less computation and fewer parameters.

[122] Parallel Scaling Law for Language Models

Mouxiang Chen,Binyuan Hui,Zeyu Cui,Jiaxi Yang,Dayiheng Liu,Jianling Sun,Junyang Lin,Zhongxin Liu

Main category: cs.LG

TL;DR: 论文提出了一种新的并行计算扩展方法（ParScale），通过并行处理输入和动态聚合输出，显著提高了推理效率，同时减少了内存和延迟开销。

Details

Motivation: 传统语言模型扩展通常通过增加参数或输出标记来实现，但会带来显著的空间或时间成本。本文旨在探索更高效的扩展范式。 Method: 引入并行扩展（ParScale），通过并行处理输入、动态聚合输出，并验证了新的扩展定律。 Result: ParScale在相同性能提升下，内存和延迟开销分别减少了22倍和6倍，且可通过少量标记的后训练实现模型扩展。 Conclusion: ParScale为低资源场景下部署更强大模型提供了新思路，并重新定义了计算在机器学习中的作用。 Abstract: It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce the third and more inference-efficient scaling paradigm: increasing the model's parallel computation during both training and inference time. We apply $P$ diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the $P$ outputs. This method, namely parallel scaling (ParScale), scales parallel computation by reusing existing parameters and can be applied to any model structure, optimization procedure, data, or task. We theoretically propose a new scaling law and validate it through large-scale pre-training, which shows that a model with $P$ parallel streams is similar to scaling the parameters by $O(\log P)$ while showing superior inference efficiency. For example, ParScale can use up to 22$\times$ less memory increase and 6$\times$ less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget. The new scaling law we discovered potentially facilitates the deployment of more powerful models in low-resource scenarios, and provides an alternative perspective for the role of computation in machine learning.

Vibha Belavadi,Tushar Vatsa,Dewang Sultania,Suhas Suresha,Ishita Verma,Cheng Chen,Tracy Holloway King,Michael Friedrich

Main category: cs.LG

TL;DR: 提出了一种基于路由器的架构，用于生成高质量合成数据以微调大语言模型（LLMs），解决真实用户数据不足的问题，显著提升了功能分类和API参数选择的准确性。

Details

Motivation: 在数字内容创作工具中，用户通过自然语言查询表达需求，但缺乏真实任务数据和隐私限制导致需要合成数据生成。现有方法在多样性和复杂性上不足，无法复现真实数据分布。 Method: 采用基于路由器的架构，结合领域资源（如内容元数据和知识图谱）以及文本到文本和视觉到文本的语言模型，生成高质量合成数据。 Result: 在真实用户查询上的评估显示，功能分类准确性和API参数选择显著提升，微调后的模型性能优于传统方法。 Conclusion: 提出的方法解决了传统合成数据生成的局限性，为功能调用任务设定了新基准。 Abstract: This paper addresses fine-tuning Large Language Models (LLMs) for function calling tasks when real user interaction data is unavailable. In digital content creation tools, where users express their needs through natural language queries that must be mapped to API calls, the lack of real-world task-specific data and privacy constraints for training on it necessitate synthetic data generation. Existing approaches to synthetic data generation fall short in diversity and complexity, failing to replicate real-world data distributions and leading to suboptimal performance after LLM fine-tuning. We present a novel router-based architecture that leverages domain resources like content metadata and structured knowledge graphs, along with text-to-text and vision-to-text language models to generate high-quality synthetic training data. Our architecture's flexible routing mechanism enables synthetic data generation that matches observed real-world distributions, addressing a fundamental limitation of traditional approaches. Evaluation on a comprehensive set of real user queries demonstrates significant improvements in both function classification accuracy and API parameter selection. Models fine-tuned with our synthetic data consistently outperform traditional approaches, establishing new benchmarks for function calling tasks.

[124] MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models

Mugilan Ganesan,Shane Segal,Ankur Aggarwal,Nish Sinnadurai,Sean Lie,Vithursan Thangarasa

Main category: cs.LG

TL;DR: MASSV通过两阶段方法将小型语言模型转化为高效的多模态草稿模型，显著加速视觉语言模型的推理速度。

Details

Motivation: 现有小型语言模型无法处理视觉输入，且其预测与视觉语言模型不匹配，需解决这些问题以提升推理效率。 Method: MASSV通过轻量级投影器连接目标模型的视觉编码器，并利用目标模型生成的自蒸馏视觉指令调整对齐预测。 Result: 实验显示MASSV将接受长度提升30%，推理速度提升1.46倍。 Conclusion: MASSV为加速当前及未来视觉语言模型提供了一种可扩展且兼容的方法。 Abstract: Speculative decoding significantly accelerates language model inference by enabling a lightweight draft model to propose multiple tokens that a larger target model verifies simultaneously. However, applying this technique to vision-language models (VLMs) presents two fundamental challenges: small language models that could serve as efficient drafters lack the architectural components to process visual inputs, and their token predictions fail to match those of VLM target models that consider visual context. We introduce Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models (MASSV), which transforms existing small language models into effective multimodal drafters through a two-phase approach. MASSV first connects the target VLM's vision encoder to the draft model via a lightweight trainable projector, then applies self-distilled visual instruction tuning using responses generated by the target VLM to align token predictions. Comprehensive experiments across the Qwen2.5-VL and Gemma3 model families demonstrate that MASSV increases accepted length by up to 30% and delivers end-to-end inference speedups of up to 1.46x on visually-grounded tasks. MASSV provides a scalable, architecture-compatible method for accelerating both current and future VLMs.

[125] RainPro-8: An Efficient Deep Learning Model to Estimate Rainfall Probabilities Over 8 Hours

Rafael Pablos Sarabia,Joachim Nyborg,Morten Birk,Jeppe Liborius Sjørup,Anders Lillevang Vesterholt,Ira Assent

Main category: cs.LG

TL;DR: 提出一种深度学习模型，用于欧洲8小时高分辨率降水概率预测，整合雷达、卫星和数值天气预报数据，优于现有方法。

Details

Motivation: 克服雷达深度学习模型预测时间短的局限性，整合多源数据提升预测精度和不确定性量化。 Method: 结合雷达、卫星和数值天气预报数据，设计紧凑架构以高效训练和快速推理。 Result: 模型在实验中超越现有数值天气预报系统和深度学习临近预报方法，成为欧洲高分辨率降水预测新标准。 Conclusion: 该模型在准确性、可解释性和计算效率之间取得平衡，为降水预测设定了新标杆。 Abstract: We present a deep learning model for high-resolution probabilistic precipitation forecasting over an 8-hour horizon in Europe, overcoming the limitations of radar-only deep learning models with short forecast lead times. Our model efficiently integrates multiple data sources - including radar, satellite, and physics-based numerical weather prediction (NWP) - while capturing long-range interactions, resulting in accurate forecasts with robust uncertainty quantification through consistent probabilistic maps. Featuring a compact architecture, it enables more efficient training and faster inference than existing models. Extensive experiments demonstrate that our model surpasses current operational NWP systems, extrapolation-based methods, and deep-learning nowcasting models, setting a new standard for high-resolution precipitation forecasting in Europe, ensuring a balance between accuracy, interpretability, and computational efficiency.

[126] PIF: Anomaly detection via preference embedding

Filippo Leveni,Luca Magri,Giacomo Boracchi,Cesare Alippi

Main category: cs.LG

TL;DR: 提出了一种名为PIF的新异常检测方法，结合自适应隔离方法和偏好嵌入的灵活性，通过高维空间嵌入和PI-Forest树方法计算异常分数。实验表明其优于现有技术。

Details

Motivation: 解决基于结构化模式的异常检测问题，结合两种方法的优势以提高效果。 Method: 提出PIF方法，利用高维空间嵌入和PI-Forest树计算异常分数。 Result: 在合成和真实数据集上表现优于现有技术，PI-Forest在测量任意距离和隔离点方面更优。 Conclusion: PIF方法有效，PI-Forest在偏好空间中表现优异。 Abstract: We address the problem of detecting anomalies with respect to structured patterns. To this end, we conceive a novel anomaly detection method called PIF, that combines the advantages of adaptive isolation methods with the flexibility of preference embedding. Specifically, we propose to embed the data in a high dimensional space where an efficient tree-based method, PI-Forest, is employed to compute an anomaly score. Experiments on synthetic and real datasets demonstrate that PIF favorably compares with state-of-the-art anomaly detection techniques, and confirm that PI-Forest is better at measuring arbitrary distances and isolate points in the preference space.

[127] SEAL: Searching Expandable Architectures for Incremental Learning

Matteo Gambella,Vicente Javier Castro Solar,Manuel Roveri

Main category: cs.LG

TL;DR: SEAL是一个基于NAS的框架，用于数据增量学习，通过动态调整模型结构并仅在必要时扩展，以减少遗忘并提高准确性。

Details

Motivation: 解决增量学习中平衡新任务学习与保留旧知识的挑战，避免现有NAS方法因频繁扩展模型而导致资源浪费的问题。 Method: SEAL通过容量估计指标动态调整模型结构，仅在必要时扩展，并通过交叉蒸馏训练保持稳定性，同时NAS组件搜索最优架构和扩展策略。 Result: 实验表明，SEAL在多基准测试中有效减少遗忘、提高准确性，并保持较小的模型规模。 Conclusion: 结合NAS和选择性扩展，SEAL为增量学习提供了一种高效、自适应的解决方案。 Abstract: Incremental learning is a machine learning paradigm where a model learns from a sequential stream of tasks. This setting poses a key challenge: balancing plasticity (learning new tasks) and stability (preserving past knowledge). Neural Architecture Search (NAS), a branch of AutoML, automates the design of the architecture of Deep Neural Networks and has shown success in static settings. However, existing NAS-based approaches to incremental learning often rely on expanding the model at every task, making them impractical in resource-constrained environments. In this work, we introduce SEAL, a NAS-based framework tailored for data-incremental learning, a scenario where disjoint data samples arrive sequentially and are not stored for future access. SEAL adapts the model structure dynamically by expanding it only when necessary, based on a capacity estimation metric. Stability is preserved through cross-distillation training after each expansion step. The NAS component jointly searches for both the architecture and the optimal expansion policy. Experiments across multiple benchmarks demonstrate that SEAL effectively reduces forgetting and enhances accuracy while maintaining a lower model size compared to prior methods. These results highlight the promise of combining NAS and selective expansion for efficient, adaptive learning in incremental scenarios.

q-bio.QM [Back]

[128] Generative diffusion model surrogates for mechanistic agent-based biological models

Tien Comlekoglu,J. Quetzalcóatl Toledo-Marín,Douglas W. DeSimone,Shayn M. Peirce,Geoffrey Fox,James A. Glazier

Main category: q-bio.QM

TL;DR: 利用去噪扩散概率模型（DDPM）训练生成式AI替代模型，加速细胞-波特模型（CPM）的计算，实现22倍速度提升。

Details

Motivation: CPM在复杂生物系统中计算成本高，且其随机性导致参数配置多样，替代模型开发困难。 Method: 使用图像分类器学习二维参数空间特征，结合DDPM训练生成式AI替代模型。 Result: 替代模型能提前生成20,000时间步的配置，计算时间减少约22倍。 Conclusion: DDPM为随机生物系统的数字孪生开发提供了可行路径。 Abstract: Mechanistic, multicellular, agent-based models are commonly used to investigate tissue, organ, and organism-scale biology at single-cell resolution. The Cellular-Potts Model (CPM) is a powerful and popular framework for developing and interrogating these models. CPMs become computationally expensive at large space- and time- scales making application and investigation of developed models difficult. Surrogate models may allow for the accelerated evaluation of CPMs of complex biological systems. However, the stochastic nature of these models means each set of parameters may give rise to different model configurations, complicating surrogate model development. In this work, we leverage denoising diffusion probabilistic models to train a generative AI surrogate of a CPM used to investigate \textit{in vitro} vasculogenesis. We describe the use of an image classifier to learn the characteristics that define unique areas of a 2-dimensional parameter space. We then apply this classifier to aid in surrogate model selection and verification. Our CPM model surrogate generates model configurations 20,000 timesteps ahead of a reference configuration and demonstrates approximately a 22x reduction in computational time as compared to native code execution. Our work represents a step towards the implementation of DDPMs to develop digital twins of stochastic biological systems.

cs.AI [Back]

[129] From Text to Network: Constructing a Knowledge Graph of Taiwan-Based China Studies Using Generative AI

Hsuan-Lei Shao

Main category: cs.AI

TL;DR: 该研究利用生成式AI和大型语言模型，将台湾中国研究的学术文本转化为结构化知识图谱，提供交互式知识探索。

Details

Motivation: 回应系统整理台湾中国研究领域多年学术成果的需求，通过AI技术提升知识获取效率。 Method: 应用生成式AI和大型语言模型，从1367篇同行评审文章中提取实体关系三元组，并通过D3.js可视化。 Result: 构建了领域知识图谱和向量数据库，揭示未探索的研究轨迹和主题集群。 Conclusion: 生成式AI可增强区域研究，为学术基础设施提供数据驱动的替代方案。 Abstract: Taiwanese China Studies (CS) has developed into a rich, interdisciplinary research field shaped by the unique geopolitical position and long standing academic engagement with Mainland China. This study responds to the growing need to systematically revisit and reorganize decades of Taiwan based CS scholarship by proposing an AI assisted approach that transforms unstructured academic texts into structured, interactive knowledge representations. We apply generative AI (GAI) techniques and large language models (LLMs) to extract and standardize entity relation triples from 1,367 peer reviewed CS articles published between 1996 and 2019. These triples are then visualized through a lightweight D3.js based system, forming the foundation of a domain specific knowledge graph and vector database for the field. This infrastructure allows users to explore conceptual nodes and semantic relationships across the corpus, revealing previously uncharted intellectual trajectories, thematic clusters, and research gaps. By decomposing textual content into graph structured knowledge units, our system enables a paradigm shift from linear text consumption to network based knowledge navigation. In doing so, it enhances scholarly access to CS literature while offering a scalable, data driven alternative to traditional ontology construction. This work not only demonstrates how generative AI can augment area studies and digital humanities but also highlights its potential to support a reimagined scholarly infrastructure for regional knowledge systems.

[130] Towards a Deeper Understanding of Reasoning Capabilities in Large Language Models

Annie Wong,Thomas Bäck,Aske Plaat,Niki van Stein,Anna V. Kononova

Main category: cs.AI

TL;DR: 研究表明，大型语言模型在动态环境中的自适应能力仍有局限，战略提示可缩小模型间性能差距，但高级推理方法效果不稳定。

Details

Motivation: 评估大型语言模型在动态环境中作为自学习和推理代理的真实潜力。 Method: 通过自反思、启发式突变和规划等提示技术，在动态环境中测试不同开源语言模型的适应性。 Result: 大模型表现优于小模型，但战略提示可缩小差距；高级提示技术对小模型更有效，但对大模型改进有限；推理方法效果不稳定。 Conclusion: 当前大型语言模型在规划、推理和空间协调等方面仍有根本性不足，需超越静态基准以全面评估推理能力。 Abstract: While large language models demonstrate impressive performance on static benchmarks, the true potential of large language models as self-learning and reasoning agents in dynamic environments remains unclear. This study systematically evaluates the efficacy of self-reflection, heuristic mutation, and planning as prompting techniques to test the adaptive capabilities of agents. We conduct experiments with various open-source language models in dynamic environments and find that larger models generally outperform smaller ones, but that strategic prompting can close this performance gap. Second, a too-long prompt can negatively impact smaller models on basic reactive tasks, while larger models show more robust behaviour. Third, advanced prompting techniques primarily benefit smaller models on complex games, but offer less improvement for already high-performing large language models. Yet, we find that advanced reasoning methods yield highly variable outcomes: while capable of significantly improving performance when reasoning and decision-making align, they also introduce instability and can lead to big performance drops. Compared to human performance, our findings reveal little evidence of true emergent reasoning. Instead, large language model performance exhibits persistent limitations in crucial areas such as planning, reasoning, and spatial coordination, suggesting that current-generation large language models still suffer fundamental shortcomings that may not be fully overcome through self-reflective prompting alone. Reasoning is a multi-faceted task, and while reasoning methods like Chain of thought improves multi-step reasoning on math word problems, our findings using dynamic benchmarks highlight important shortcomings in general reasoning capabilities, indicating a need to move beyond static benchmarks to capture the complexity of reasoning.

cs.SI [Back]

[131] Tales of the 2025 Los Angeles Fire: Hotwash for Public Health Concerns in Reddit via LLM-Enhanced Topic Modeling

Sulong Zhou,Qunying Huang,Shaoheng Zhou,Yun Hang,Xinyue Ye,Aodong Mei,Kathryn Phung,Yuning Ye,Uma Govindswamy,Zehan Li

Main category: cs.SI

TL;DR: 研究分析了2025年洛杉矶野火期间Reddit上的讨论，通过主题建模和分层框架识别公众对灾害的感知和反应，重点关注情境意识和危机叙事。

Details

Motivation: 了解受灾人群在野火危机中的感知和反应，以支持更及时和共情的灾害响应。 Method: 收集385篇帖子和114,879条评论，采用主题建模方法（结合LLMs和HITL）和分层框架（SA和CN）分析。 Result: SA类别与火灾进展紧密相关，CN类别中60%为悲伤信号，40%为心理健康风险。 Conclusion: 研究提供了首个标注的社交媒体数据集，并提出可扩展的多层框架，为灾害响应和公共卫生沟通提供依据。 Abstract: Wildfires have become increasingly frequent, irregular, and severe in recent years. Understanding how affected populations perceive and respond during wildfire crises is critical for timely and empathetic disaster response. Social media platforms offer a crowd-sourced channel to capture evolving public discourse, providing hyperlocal information and insight into public sentiment. This study analyzes Reddit discourse during the 2025 Los Angeles wildfires, spanning from the onset of the disaster to full containment. We collect 385 posts and 114,879 comments related to the Palisades and Eaton fires. We adopt topic modeling methods to identify the latent topics, enhanced by large language models (LLMs) and human-in-the-loop (HITL) refinement. Furthermore, we develop a hierarchical framework to categorize latent topics, consisting of two main categories, Situational Awareness (SA) and Crisis Narratives (CN). The volume of SA category closely aligns with real-world fire progressions, peaking within the first 2-5 days as the fires reach the maximum extent. The most frequent co-occurring category set of public health and safety, loss and damage, and emergency resources expands on a wide range of health-related latent topics, including environmental health, occupational health, and one health. Grief signals and mental health risks consistently accounted for 60 percentage and 40 percentage of CN instances, respectively, with the highest total volume occurring at night. This study contributes the first annotated social media dataset on the 2025 LA fires, and introduces a scalable multi-layer framework that leverages topic modeling for crisis discourse analysis. By identifying persistent public health concerns, our results can inform more empathetic and adaptive strategies for disaster response, public health communication, and future research in comparable climate-related disaster events.

Table of Contents

cs.CV [Back]

[1] A Computational Pipeline for Advanced Analysis of 4D Flow MRI in the Left Atrium

[2] Dyadic Mamba: Long-term Dyadic Human Motion Synthesis

[3] BoundarySeg:An Embarrassingly Simple Method To Boost Medical Image Segmentation Performance for Low Data Regimes

[4] Mission Balance: Generating Under-represented Class Samples using Video Diffusion Models

[5] Few-Shot Learning of Visual Compositional Concepts through Probabilistic Schema Induction

[6] Large-Scale Gaussian Splatting SLAM

[7] AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection

[8] DDFP: Data-dependent Frequency Prompt for Source Free Domain Adaptation of Medical Image Segmentation

[9] VRU-CIPI: Crossing Intention Prediction at Intersections for Improving Vulnerable Road Users Safety

[10] Non-Registration Change Detection: A Novel Change Detection Task and Benchmark Dataset

[11] CSPENet: Contour-Aware and Saliency Priors Embedding Network for Infrared Small Target Detection

[12] MambaControl: Anatomy Graph-Enhanced Mamba ControlNet with Fourier Refinement for Diffusion-Based Disease Trajectory Prediction

[13] TKFNet: Learning Texture Key Factor Driven Feature for Facial Expression Recognition

[14] APCoTTA: Continual Test-Time Adaptation for Semantic Segmentation of Airborne LiDAR Point Clouds

[15] High Quality Underwater Image Compression with Adaptive Correction and Codebook-based Augmentation

[16] PointArena: Probing Multimodal Grounding Through Language-Guided Pointing

[17] Descriptive Image-Text Matching with Graded Contextual Similarity

[18] From Air to Wear: Personalized 3D Digital Fashion with AR/VR Immersive 3D Sketching

[19] Application of YOLOv8 in monocular downward multiple Car Target detection

[20] ORL-LDM: Offline Reinforcement Learning Guided Latent Diffusion Model Super-Resolution Reconstruction

[21] DeepSeqCoco: A Robust Mobile Friendly Deep Learning Model for Detection of Diseases in Cocos nucifera

[22] Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

[23] Advances in Radiance Field for Dynamic Scene: From Neural Field to Gaussian Field

[24] PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language

[25] ToonifyGB: StyleGAN-based Gaussian Blendshapes for 3D Stylized Head Avatars

[26] MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models

[27] Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering

[28] IMITATE: Image Registration with Context for unknown time frame recovery

[29] Multi-Source Collaborative Style Augmentation and Domain-Invariant Learning for Federated Domain Generalization

[30] Modeling Saliency Dataset Bias

[31] VolE: A Point-cloud Framework for Food 3D Reconstruction and Volume Estimation

[32] Data-Agnostic Augmentations for Unknown Variations: Out-of-Distribution Generalisation in MRI Segmentation

[33] On the Interplay of Human-AI Alignment,Fairness, and Performance Trade-offs in Medical Imaging

[34] MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation

[35] ADHMR: Aligning Diffusion-based Human Mesh Recovery via Direct Preference Optimization

[36] Sage Deer: A Super-Aligned Driving Generalist Is Your Copilot

[37] Inferring Driving Maps by Deep Learning-based Trail Map Extraction

[38] HandReader: Advanced Techniques for Efficient Fingerspelling Recognition

[39] MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting

[40] MSCI: Addressing CLIP's Inherent Limitations for Compositional Zero-Shot Learning

[41] StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation

[42] MIPHEI-ViT: Multiplex Immunofluorescence Prediction from H&E Images using ViT Foundation Models

[43] A Unified and Scalable Membership Inference Method for Visual Self-supervised Encoder via Part-aware Capability

[44] SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ Complexity

[45] Learned Lightweight Smartphone ISP with Unpaired Data

[46] Vision language models have difficulty recognizing virtual objects

[47] Consistent Quantity-Quality Control across Scenes for Deployment-Aware Gaussian Splatting

[48] Logos as a Well-Tempered Pre-train for Sign Language Recognition

[49] UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation

[50] CheXGenBench: A Unified Benchmark For Fidelity, Privacy and Utility of Synthetic Chest Radiographs

[51] MorphGuard: Morph Specific Margin Loss for Enhancing Robustness to Face Morphing Attacks

[52] Enhancing Multi-Image Question Answering via Submodular Subset Selection

[53] Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis

[54] Does Feasibility Matter? Understanding the Impact of Feasibility on Synthetic Training Data

[55] MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

[56] End-to-End Vision Tokenizer Tuning

[57] Depth Anything with Any Prior

[58] 3D-Fixup: Advancing Photo Editing with 3D Priors

cs.GR [Back]

[59] VRSplat: Fast and Robust Gaussian Splatting for Virtual Reality

[60] Style Customization of Text-to-Vector Generation with Image Diffusion Priors

cs.CL [Back]

[61] Next Word Suggestion using Graph Neural Network

[62] DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models

[63] Large Language Models Are More Persuasive Than Incentivized Human Persuaders

[64] System Prompt Optimization with Meta-Learning

[65] VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts

[66] An AI-Powered Research Assistant in the Lab: A Practical Guide for Text Analysis Through Iterative Collaboration with LLMs

[67] Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning

[68] Automated Detection of Clinical Entities in Lung and Breast Cancer Reports Using NLP Techniques

[69] Exploring the generalization of LLM truth directions on conversational formats

[70] KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning

[71] Do Large Language Models Know Conflict? Investigating Parametric vs. Non-Parametric Knowledge of LLMs for Conflict Forecasting

[72] Crossing Borders Without Crossing Boundaries: How Sociolinguistic Awareness Can Optimize User Engagement with Localized Spanish AI Models Across Hispanophone Countries

[73] From Trade-off to Synergy: A Versatile Symbiotic Watermarking Framework for Large Language Models

[74] Rethinking Prompt Optimizers: From Prompt Merits to Optimization

[75] Personalizing Large Language Models using Retrieval Augmented Generation and Knowledge Graph

[76] DIF: A Framework for Benchmarking and Verifying Implicit Bias in LLMs