cs.CV [Total: 93]
cs.GR [Total: 3]
cs.CL [Total: 135]
cs.PF [Total: 1]
cs.CG [Total: 1]
q-fin.CP [Total: 1]
cs.AI [Total: 23]
cs.CY [Total: 1]
q-bio.GN [Total: 1]
stat.ML [Total: 1]
cs.SE [Total: 2]
eess.AS [Total: 5]
eess.IV [Total: 11]
cs.SD [Total: 6]
cs.LG [Total: 20]
cs.CR [Total: 3]
q-bio.QM [Total: 1]
cs.IR [Total: 6]
cs.MA [Total: 1]

cs.CV [Back]

[1] An Edge AI Solution for Space Object Detection

Wenxuan Zhang,Peng Hu

Main category: cs.CV

TL;DR: 论文提出了一种基于深度学习的边缘AI解决方案，用于空间物体检测（SOD），结合SE层、Vision Transformers和YOLOv9框架，实现了高精度和低延迟的检测。

Details

Motivation: 随着近地轨道空间资产的增加，实时碰撞评估和避障的需求促使开发高效的边缘AI解决方案。 Method: 采用基于深度学习的视觉感知技术，结合SE层、Vision Transformers和YOLOv9框架构建模型。 Result: 模型在多种实际SOD场景中表现出高精度和极低延迟的多卫星检测能力。 Conclusion: 提出的边缘AI解决方案在空间物体检测任务中具有高效性和实用性。 Abstract: Effective Edge AI for space object detection (SOD) tasks that can facilitate real-time collision assessment and avoidance is essential with the increasing space assets in near-Earth orbits. In SOD, low Earth orbit (LEO) satellites must detect other objects with high precision and minimal delay. We explore an Edge AI solution based on deep-learning-based vision sensing for SOD tasks and propose a deep learning model based on Squeeze-and-Excitation (SE) layers, Vision Transformers (ViT), and YOLOv9 framework. We evaluate the performance of these models across various realistic SOD scenarios, demonstrating their ability to detect multiple satellites with high accuracy and very low latency.

[2] Self-Supervised Learning for Image Segmentation: A Comprehensive Survey

Thangarajah Akilan,Nusrat Jahan,Wandong Zhang

Main category: cs.CV

TL;DR: 该论文综述了自监督学习在图像分割领域的应用，分析了150多篇相关文献，提供了任务分类、数据集和未来研究方向。

Details

Motivation: 监督学习需要大量精确标注数据，成本高且耗时。自监督学习通过利用无标签数据克服这一限制，成为解决计算机视觉问题的有力工具。 Method: 调查了150多篇关于自监督学习和图像分割的文献，分类了代理任务、下游任务和常用数据集。 Result: 总结了自监督学习在图像分割中的关键进展，并提供了实用的分类和数据集信息。 Conclusion: 自监督学习在图像分割领域具有巨大潜力，未来研究应关注提高方法的可访问性和可理解性。 Abstract: Supervised learning demands large amounts of precisely annotated data to achieve promising results. Such data curation is labor-intensive and imposes significant overhead regarding time and costs. Self-supervised learning (SSL) partially overcomes these limitations by exploiting vast amounts of unlabeled data and creating surrogate (pretext or proxy) tasks to learn useful representations without manual labeling. As a result, SSL has become a powerful machine learning (ML) paradigm for solving several practical downstream computer vision problems, such as classification, detection, and segmentation. Image segmentation is the cornerstone of many high-level visual perception applications, including medical imaging, intelligent transportation, agriculture, and surveillance. Although there is substantial research potential for developing advanced algorithms for SSL-based semantic segmentation, a comprehensive study of existing methodologies is essential to trace advances and guide emerging researchers. This survey thoroughly investigates over 150 recent image segmentation articles, particularly focusing on SSL. It provides a practical categorization of pretext tasks, downstream tasks, and commonly used benchmark datasets for image segmentation research. It concludes with key observations distilled from a large body of literature and offers future directions to make this research field more accessible and comprehensible for readers.

[3] IPENS:Interactive Unsupervised Framework for Rapid Plant Phenotyping Extraction via NeRF-SAM2 Fusion

Wentao Song,He Huang,Youqiang Sun,Fang Qu,Jiaqi Zhang,Longhui Fang,Yuwei Hao,Chenyang Peng

Main category: cs.CV

TL;DR: IPENS是一种交互式无监督多目标点云提取方法，利用辐射场信息将2D掩码提升为3D点云，解决了单交互多目标分割问题，显著提高了水稻和小麦的表型提取效率。

Details

Motivation: 现有方法依赖大规模高精度人工标注数据，对于自遮挡的谷物级目标，无监督方法效果不佳，因此需要一种无需标注数据的高效表型提取方法。 Method: IPENS利用SAM2分割的2D掩码，通过辐射场信息将其提升为3D点云，并设计多目标协同优化策略解决多目标分割问题。 Result: 在水稻数据集上，IPENS的mIoU为63.72%，表型预测表现优异；在小麦数据集上，mIoU提升至89.68%，表型预测精度更高。 Conclusion: IPENS为非侵入式高质量表型提取提供了解决方案，无需标注数据，显著加速智能育种效率。 Abstract: Advanced plant phenotyping technologies play a crucial role in targeted trait improvement and accelerating intelligent breeding. Due to the species diversity of plants, existing methods heavily rely on large-scale high-precision manually annotated data. For self-occluded objects at the grain level, unsupervised methods often prove ineffective. This study proposes IPENS, an interactive unsupervised multi-target point cloud extraction method. The method utilizes radiance field information to lift 2D masks, which are segmented by SAM2 (Segment Anything Model 2), into 3D space for target point cloud extraction. A multi-target collaborative optimization strategy is designed to effectively resolve the single-interaction multi-target segmentation challenge. Experimental validation demonstrates that IPENS achieves a grain-level segmentation accuracy (mIoU) of 63.72% on a rice dataset, with strong phenotypic estimation capabilities: grain volume prediction yields R2 = 0.7697 (RMSE = 0.0025), leaf surface area R2 = 0.84 (RMSE = 18.93), and leaf length and width predictions achieve R2 = 0.97 and 0.87 (RMSE = 1.49 and 0.21). On a wheat dataset,IPENS further improves segmentation accuracy to 89.68% (mIoU), with equally outstanding phenotypic estimation performance: spike volume prediction achieves R2 = 0.9956 (RMSE = 0.0055), leaf surface area R2 = 1.00 (RMSE = 0.67), and leaf length and width predictions reach R2 = 0.99 and 0.92 (RMSE = 0.23 and 0.15). This method provides a non-invasive, high-quality phenotyping extraction solution for rice and wheat. Without requiring annotated data, it rapidly extracts grain-level point clouds within 3 minutes through simple single-round interactions on images for multiple targets, demonstrating significant potential to accelerate intelligent breeding efficiency.

[4] GeoVLM: Improving Automated Vehicle Geolocalisation Using Vision-Language Matching

Barkin Dagda,Muhammad Awais,Saber Fallah

Main category: cs.CV

TL;DR: GeoVLM利用视觉语言模型的零样本能力，通过可解释的跨视图语言描述改进跨视图地理定位的匹配精度。

Details

Motivation: 现有跨视图地理定位方法在高召回率下仍难以将正确图像排名第一，相似场景导致匹配困难。 Method: 提出GeoVLM，一种可训练的重新排序方法，利用视觉语言模型生成跨视图语言描述。 Result: 在VIGOR、University-1652及新数据集Cross-View United Kingdom上验证，GeoVLM优于现有方法。 Conclusion: GeoVLM通过自然语言描述显著提升跨视图地理定位的检索性能。 Abstract: Cross-view geo-localisation identifies coarse geographical position of an automated vehicle by matching a ground-level image to a geo-tagged satellite image from a database. Despite the advancements in Cross-view geo-localisation, significant challenges still persist such as similar looking scenes which makes it challenging to find the correct match as the top match. Existing approaches reach high recall rates but they still fail to rank the correct image as the top match. To address this challenge, this paper proposes GeoVLM, a novel approach which uses the zero-shot capabilities of vision language models to enable cross-view geo-localisation using interpretable cross-view language descriptions. GeoVLM is a trainable reranking approach which improves the best match accuracy of cross-view geo-localisation. GeoVLM is evaluated on standard benchmark VIGOR and University-1652 and also through real-life driving environments using Cross-View United Kingdom, a new benchmark dataset introduced in this paper. The results of the paper show that GeoVLM improves retrieval performance of cross-view geo-localisation compared to the state-of-the-art methods with the help of explainable natural language descriptions. The code is available at https://github.com/CAV-Research-Lab/GeoVLM

[5] GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization

Pengyue Jia,Seongheon Park,Song Gao,Xiangyu Zhao,Yixuan Li

Main category: cs.CV

TL;DR: GeoRanker是一个基于距离感知的排名框架，利用视觉语言模型联合编码查询-候选交互并预测地理邻近性，显著提升了全球图像地理定位任务的性能。

Details

Motivation: 全球图像地理定位任务面临视觉内容多样性的挑战，现有方法依赖简单的相似性启发式和点监督，未能建模候选间的空间关系。 Method: 提出GeoRanker框架，结合多阶距离损失和视觉语言模型，联合编码查询-候选交互，并引入GeoRanking数据集支持地理排名任务。 Result: 在IM2GPS3K和YFCC4K基准测试中达到最先进水平，显著优于现有方法。 Conclusion: GeoRanker通过建模空间关系和引入多模态候选信息，有效提升了地理定位任务的性能。 Abstract: Worldwide image geolocalization-the task of predicting GPS coordinates from images taken anywhere on Earth-poses a fundamental challenge due to the vast diversity in visual content across regions. While recent approaches adopt a two-stage pipeline of retrieving candidates and selecting the best match, they typically rely on simplistic similarity heuristics and point-wise supervision, failing to model spatial relationships among candidates. In this paper, we propose GeoRanker, a distance-aware ranking framework that leverages large vision-language models to jointly encode query-candidate interactions and predict geographic proximity. In addition, we introduce a multi-order distance loss that ranks both absolute and relative distances, enabling the model to reason over structured spatial relationships. To support this, we curate GeoRanking, the first dataset explicitly designed for geographic ranking tasks with multimodal candidate information. GeoRanker achieves state-of-the-art results on two well-established benchmarks (IM2GPS3K and YFCC4K), significantly outperforming current best methods.

[6] Frozen Backpropagation: Relaxing Weight Symmetry in Temporally-Coded Deep Spiking Neural Networks

Gaspard Goupy,Pierre Tirilly,Ioan Marius Bilasco

Main category: cs.CV

TL;DR: 提出了一种名为Frozen Backpropagation (fBP)的训练算法，通过周期性冻结反馈权重减少权重传输，从而降低硬件开销和能耗。

Details

Motivation: 在神经形态硬件上直接训练SNNs能显著降低能耗，但反向传播中的权重对称性要求增加了硬件开销和能耗。 Method: fBP算法通过周期性冻结反馈权重更新前向权重，并提出了三种部分权重传输方案以减少传输成本。 Result: fBP在图像识别任务中表现优于现有方法，接近BP的准确率，同时通过部分权重传输将传输成本降低1000x至10000x。 Conclusion: fBP为神经形态硬件的设计提供了指导，支持基于反向传播的片上学习。 Abstract: Direct training of Spiking Neural Networks (SNNs) on neuromorphic hardware can greatly reduce energy costs compared to GPU-based training. However, implementing Backpropagation (BP) on such hardware is challenging because forward and backward passes are typically performed by separate networks with distinct weights. To compute correct gradients, forward and feedback weights must remain symmetric during training, necessitating weight transport between the two networks. This symmetry requirement imposes hardware overhead and increases energy costs. To address this issue, we introduce Frozen Backpropagation (fBP), a BP-based training algorithm relaxing weight symmetry in settings with separate networks. fBP updates forward weights by computing gradients with periodically frozen feedback weights, reducing weight transports during training and minimizing synchronization overhead. To further improve transport efficiency, we propose three partial weight transport schemes of varying computational complexity, where only a subset of weights is transported at a time. We evaluate our methods on image recognition tasks and compare them to existing approaches addressing the weight symmetry requirement. Our results show that fBP outperforms these methods and achieves accuracy comparable to BP. With partial weight transport, fBP can substantially lower transport costs by 1,000x with an accuracy drop of only 0.5pp on CIFAR-10 and 1.1pp on CIFAR-100, or by up to 10,000x at the expense of moderated accuracy loss. This work provides insights for guiding the design of neuromorphic hardware incorporating BP-based on-chip learning.

[7] ReSW-VL: Representation Learning for Surgical Workflow Analysis Using Vision-Language Model

Satoshi Kondo

Main category: cs.CV

TL;DR: 提出了一种基于视觉语言模型（ReSW-VL）的手术阶段识别方法，通过微调CLIP模型的图像编码器并结合提示学习，在三个数据集上验证了其优于传统方法的性能。

Details

Motivation: 手术阶段识别技术具有广泛的应用潜力，但现有研究在CNN特征提取或表示学习的训练方法上仍有不足。 Method: 使用视觉语言模型CLIP，通过提示学习微调其图像编码器，用于手术阶段识别。 Result: 在三个手术阶段识别数据集上，所提方法表现优于传统方法。 Conclusion: ReSW-VL方法通过结合视觉语言模型和提示学习，有效提升了手术阶段识别的性能。 Abstract: Surgical phase recognition from video is a technology that automatically classifies the progress of a surgical procedure and has a wide range of potential applications, including real-time surgical support, optimization of medical resources, training and skill assessment, and safety improvement. Recent advances in surgical phase recognition technology have focused primarily on Transform-based methods, although methods that extract spatial features from individual frames using a CNN and video features from the resulting time series of spatial features using time series modeling have shown high performance. However, there remains a paucity of research on training methods for CNNs employed for feature extraction or representation learning in surgical phase recognition. In this study, we propose a method for representation learning in surgical workflow analysis using a vision-language model (ReSW-VL). Our proposed method involves fine-tuning the image encoder of a CLIP (Convolutional Language Image Model) vision-language model using prompt learning for surgical phase recognition. The experimental results on three surgical phase recognition datasets demonstrate the effectiveness of the proposed method in comparison to conventional methods.

[8] Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping

Subash Khanal,Srikumar Sastry,Aayush Dhakal,Adeel Ahmad,Nathan Jacobs

Main category: cs.CV

TL;DR: Sat2Sound是一个多模态表示学习框架，用于预测地球上任何位置的声音分布，通过结合卫星图像和音频数据，并利用视觉语言模型生成丰富的声景描述。

Details

Motivation: 现有方法依赖卫星图像和地理标记音频样本，但无法充分捕捉声音多样性。Sat2Sound旨在通过多模态学习解决这一问题。 Method: 利用视觉语言模型生成声景描述，结合对比学习音频、音频描述、卫星图像和图像描述，学习共享的声景概念代码库。 Result: 在GeoSound和SoundingEarth数据集上实现了跨模态检索的最新性能，并支持基于位置的声景合成。 Conclusion: Sat2Sound通过多模态学习提升了声景映射的准确性，并展示了新的应用潜力。 Abstract: We present Sat2Sound, a multimodal representation learning framework for soundscape mapping, designed to predict the distribution of sounds at any location on Earth. Existing methods for this task rely on satellite image and paired geotagged audio samples, which often fail to capture the diversity of sound sources at a given location. To address this limitation, we enhance existing datasets by leveraging a Vision-Language Model (VLM) to generate semantically rich soundscape descriptions for locations depicted in satellite images. Our approach incorporates contrastive learning across audio, audio captions, satellite images, and satellite image captions. We hypothesize that there is a fixed set of soundscape concepts shared across modalities. To this end, we learn a shared codebook of soundscape concepts and represent each sample as a weighted average of these concepts. Sat2Sound achieves state-of-the-art performance in cross-modal retrieval between satellite image and audio on two datasets: GeoSound and SoundingEarth. Additionally, building on Sat2Sound's ability to retrieve detailed soundscape captions, we introduce a novel application: location-based soundscape synthesis, which enables immersive acoustic experiences. Our code and models will be publicly available.

[9] Transfer Learning from Visual Speech Recognition to Mouthing Recognition in German Sign Language

Dinh Nam Pham,Eleftherios Avramidis

Main category: cs.CV

TL;DR: 该论文研究了通过从视觉语音识别（VSR）迁移学习来改进德国手语中口型识别的效果，并探讨了多任务学习对提升模型性能的作用。

Details

Motivation: 手语识别（SLR）系统通常关注手势，但非手动特征（如口型）也包含重要语言信息。本研究旨在填补这一空白，探索如何利用VSR的知识提升口型识别。 Method: 利用三个VSR数据集（英语、德语无关词、德语目标词）进行迁移学习，并采用多任务学习策略。 Result: 多任务学习显著提升了口型识别和VSR的准确性，同时增强了模型鲁棒性。 Conclusion: 口型识别应视为与VSR相关但独立的任务，且VSR知识可有效迁移至SLR数据有限的口型标注任务。 Abstract: Sign Language Recognition (SLR) systems primarily focus on manual gestures, but non-manual features such as mouth movements, specifically mouthing, provide valuable linguistic information. This work directly classifies mouthing instances to their corresponding words in the spoken language while exploring the potential of transfer learning from Visual Speech Recognition (VSR) to mouthing recognition in German Sign Language. We leverage three VSR datasets: one in English, one in German with unrelated words and one in German containing the same target words as the mouthing dataset, to investigate the impact of task similarity in this setting. Our results demonstrate that multi-task learning improves both mouthing recognition and VSR accuracy as well as model robustness, suggesting that mouthing recognition should be treated as a distinct but related task to VSR. This research contributes to the field of SLR by proposing knowledge transfer from VSR to SLR datasets with limited mouthing annotations.

[10] Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

Yongshuo Zong,Qin Zhang,Dongsheng An,Zhihua Li,Xiang Xu,Linghan Xu,Zhuowen Tu,Yifan Xing,Onkar Dabeer

Main category: cs.CV

TL;DR: 提出了一种自动扩展指令跟随数据的工作流，用于提升视觉语言模型在复杂指令下的像素级定位能力，解决了文本指令定位中的五大挑战，并通过知识蒸馏生成高质量数据，显著提升了模型性能。

Details

Motivation: 解决文本指令定位中的五大挑战（幻觉引用、多对象场景、推理、多粒度和部分级引用），减少对昂贵人工标注的依赖。 Method: 利用预训练教师模型的知识蒸馏，生成与现有像素级标注链接的高质量指令-响应对。 Result: 生成的Ground-V数据集显著提升了模型性能，LISA和PSALM在gIoU指标上分别提升4.4%和7.9%，并在RefCOCO/+/g等基准上达到新SOTA。 Conclusion: Ground-V有效提升了视觉语言模型的像素级定位能力，为复杂指令下的定位任务提供了高质量数据支持。 Abstract: This work presents a simple yet effective workflow for automatically scaling instruction-following data to elicit pixel-level grounding capabilities of VLMs under complex instructions. In particular, we address five critical real-world challenges in text-instruction-based grounding: hallucinated references, multi-object scenarios, reasoning, multi-granularity, and part-level references. By leveraging knowledge distillation from a pre-trained teacher model, our approach generates high-quality instruction-response pairs linked to existing pixel-level annotations, minimizing the need for costly human annotation. The resulting dataset, Ground-V, captures rich object localization knowledge and nuanced pixel-level referring expressions. Experiment results show that models trained on Ground-V exhibit substantial improvements across diverse grounding tasks. Specifically, incorporating Ground-V during training directly achieves an average accuracy boost of 4.4% for LISA and a 7.9% for PSALM across six benchmarks on the gIoU metric. It also sets new state-of-the-art results on standard benchmarks such as RefCOCO/+/g. Notably, on gRefCOCO, we achieve an N-Acc of 83.3%, exceeding the previous state-of-the-art by more than 20%.

[11] Physics-Driven Local-Whole Elastic Deformation Modeling for Point Cloud Representation Learning

Zhongyu Chen,Rong Zhao,Xie Han,Xindong Guo,Song Wang,Zherui Qiao

Main category: cs.CV

TL;DR: 提出了一种基于物理驱动的自监督学习方法，通过局部-整体力传播机制学习点云表示，优于现有方法。

Details

Motivation: 现有方法侧重于几何分布和结构特征，忽略了局部信息与整体结构的关系，而真实世界中物体的弹性变形通过局部力传播影响整体形状。 Method: 采用双任务编码器-解码器框架，结合隐式场的几何建模能力和物理驱动的弹性变形，通过两个解码器分别学习整体形状和局部变形。 Result: 实验表明，该方法在物体分类、少样本学习和分割任务中优于现有方法。 Conclusion: 通过物理驱动的局部-整体关系建模，有效提升了点云表示学习的性能。 Abstract: Existing point cloud representation learning tend to learning the geometric distribution of objects through data-driven approaches, emphasizing structural features while overlooking the relationship between the local information and the whole structure. Local features reflect the fine-grained variations of an object, while the whole structure is determined by the interaction and combination of these local features, collectively defining the object's shape. In real-world, objects undergo elastic deformation under external forces, and this deformation gradually affects the whole structure through the propagation of forces from local regions, thereby altering the object's geometric properties. Inspired by this, we propose a physics-driven self-supervised learning method for point cloud representation, which captures the relationship between parts and the whole by constructing a local-whole force propagation mechanism. Specifically, we employ a dual-task encoder-decoder framework, integrating the geometric modeling capability of implicit fields with physics-driven elastic deformation. The encoder extracts features from the point cloud and its tetrahedral mesh representation, capturing both geometric and physical properties. These features are then fed into two decoders: one learns the whole geometric shape of the point cloud through an implicit field, while the other predicts local deformations using two specifically designed physics information loss functions, modeling the deformation relationship between local and whole shapes. Experimental results show that our method outperforms existing approaches in object classification, few-shot learning, and segmentation, demonstrating its effectiveness.

[12] InstanceBEV: Unifying Instance and BEV Representation for Global Modeling

Feng Li,Kun Xu,Zhaoyue Wang,Yunduan Cui,Mohammad Masum Billah,Jia Liu

Main category: cs.CV

TL;DR: 论文提出InstanceBEV方法，通过实例级降维解决BEV视角下全局建模的数据复杂度问题，无需稀疏化或加速操作，实现了高效且高性能的3D空间建模。

Details

Motivation: 现有基于BEV的方法在大规模全局建模时需复杂优化，而InstanceBEV旨在简化流程并提升性能。 Method: 采用实例级降维技术，直接利用transformer聚合全局特征，并将全局特征图采样至3D空间。 Result: 在OpenOcc-NuScenes数据集上达到最优性能，且框架简单高效。 Conclusion: InstanceBEV为BEV视角下的全局建模提供了更高效的解决方案，无需额外优化即可实现高性能。 Abstract: Occupancy Grid Maps are widely used in navigation for their ability to represent 3D space occupancy. However, existing methods that utilize multi-view cameras to construct Occupancy Networks for perception modeling suffer from cubic growth in data complexity. Adopting a Bird's-Eye View (BEV) perspective offers a more practical solution for autonomous driving, as it provides higher semantic density and mitigates complex object occlusions. Nonetheless, BEV-based approaches still require extensive engineering optimizations to enable efficient large-scale global modeling. To address this challenge, we propose InstanceBEV, the first method to introduce instance-level dimensionality reduction for BEV, enabling global modeling with transformers without relying on sparsification or acceleration operators. Different from other BEV methods, our approach directly employs transformers to aggregate global features. Compared to 3D object detection models, our method samples global feature maps into 3D space. Experiments on OpenOcc-NuScenes dataset show that InstanceBEV achieves state-of-the-art performance while maintaining a simple, efficient framework without requiring additional optimizations.

[13] MGStream: Motion-aware 3D Gaussian for Streamable Dynamic Scene Reconstruction

Zhenyu Bao,Qing Li,Guibiao Liao,Zhongyuan Zhao,Kanglin Liu

Main category: cs.CV

TL;DR: MGStream提出了一种基于运动相关3D高斯和静态3D高斯的动态场景重建方法，解决了3DGS在动态场景中的闪烁和存储效率问题。

Details

Motivation: 3DGS在动态新视角合成中表现优异，但仍存在闪烁、存储效率低和难以建模新物体的问题。 Method: 使用运动相关3D高斯重建动态部分，静态部分使用普通3D高斯；通过运动掩码和聚类凸包算法实现运动相关3D高斯，并应用刚性变形和注意力优化。 Result: 在N3DV和MeetRoom数据集上，MGStream在渲染质量、训练/存储效率和时间一致性上优于现有方法。 Conclusion: MGStream有效解决了动态场景中的闪烁和存储问题，同时提升了建模新物体的能力。 Abstract: 3D Gaussian Splatting (3DGS) has gained significant attention in streamable dynamic novel view synthesis (DNVS) for its photorealistic rendering capability and computational efficiency. Despite much progress in improving rendering quality and optimization strategies, 3DGS-based streamable dynamic scene reconstruction still suffers from flickering artifacts and storage inefficiency, and struggles to model the emerging objects. To tackle this, we introduce MGStream which employs the motion-related 3D Gaussians (3DGs) to reconstruct the dynamic and the vanilla 3DGs for the static. The motion-related 3DGs are implemented according to the motion mask and the clustering-based convex hull algorithm. The rigid deformation is applied to the motion-related 3DGs for modeling the dynamic, and the attention-based optimization on the motion-related 3DGs enables the reconstruction of the emerging objects. As the deformation and optimization are only conducted on the motion-related 3DGs, MGStream avoids flickering artifacts and improves the storage efficiency. Extensive experiments on real-world datasets N3DV and MeetRoom demonstrate that MGStream surpasses existing streaming 3DGS-based approaches in terms of rendering quality, training/storage efficiency and temporal consistency. Our code is available at: https://github.com/pcl3dv/MGStream.

[14] SuperMapNet for Long-Range and High-Accuracy Vectorized HD Map Construction

Ruqin Zhou,San Jiang,Wanshou Jiang,Yongsheng Zhang,Chenguang Dai

Main category: cs.CV

TL;DR: SuperMapNet提出了一种用于长距离高精度矢量高清地图构建的方法，通过多模态输入和交互模块解决现有方法的局限性。

Details

Motivation: 现有方法在BEV特征生成和地图元素分类定位中存在单模态限制和多模态协同不足的问题，导致特征空洞和低精度。 Method: 结合相机图像和LiDAR点云输入，通过跨注意力协同增强模块和流式差异对齐模块生成BEV特征，并通过三级交互（点-点、元素-元素、点-元素）实现高精度分类定位。 Result: 在nuScenes和Argoverse2数据集上表现优异，分别超过SOTA方法14.9/8.8 mAP和18.5/3.1 mAP。 Conclusion: SuperMapNet通过多模态协同和交互设计显著提升了矢量高清地图的构建精度和范围。 Abstract: Vectorized HD map is essential for autonomous driving. Remarkable work has been achieved in recent years, but there are still major issues: (1) in the generation of the BEV features, single modality-based methods are of limited perception capability, while direct concatenation-based multi-modal methods fail to capture synergies and disparities between different modalities, resulting in limited ranges with feature holes; (2) in the classification and localization of map elements, only point information is used without the consideration of element infor-mation and neglects the interaction between point information and element information, leading to erroneous shapes and element entanglement with low accuracy. To address above issues, we introduce SuperMapNet for long-range and high-accuracy vectorized HD map construction. It uses both camera images and LiDAR point clouds as input, and first tightly couple semantic information from camera images and geometric information from LiDAR point clouds by a cross-attention based synergy enhancement module and a flow-based disparity alignment module for long-range BEV feature generation. And then, local features from point queries and global features from element queries are tightly coupled by three-level interactions for high-accuracy classification and localization, where Point2Point interaction learns local geometric information between points of the same element and of each point, Element2Element interaction learns relation constraints between different elements and semantic information of each elements, and Point2Element interaction learns complement element information for its constituent points. Experiments on the nuScenes and Argoverse2 datasets demonstrate superior performances, surpassing SOTAs over 14.9/8.8 mAP and 18.5/3.1 mAP under hard/easy settings, respectively. The code is made publicly available1.

[15] Domain Adaptation of VLM for Soccer Video Understanding

Tiancheng Jiang,Henry Wang,Md Sirajus Salekin,Parmida Atighehchian,Shinan Zhang

Main category: cs.CV

TL;DR: 该论文研究了开源视觉语言模型（VLM）在特定领域（如足球）中的适应能力，通过课程学习方式微调模型，显著提升了足球相关任务的性能。

Details

Motivation: 现有视频理解VLM研究多为通用领域，缺乏对特定领域迁移学习能力的探索，本文以足球为例填补这一空白。 Method: 使用大规模足球数据集和LLM生成指令遵循数据，通过课程学习方式（先教授关键概念后问答任务）迭代微调通用VLM。 Result: 最终模型在足球视觉问答任务中相对提升37.5%，足球动作分类任务准确率从11.8%提升至63.5%。 Conclusion: 研究表明，通过领域适配和课程学习，通用VLM可显著提升特定领域任务性能。 Abstract: Vision Language Models (VLMs) have demonstrated strong performance in multi-modal tasks by effectively aligning visual and textual representations. However, most video understanding VLM research has been domain-agnostic, leaving the understanding of their transfer learning capability to specialized domains under-explored. In this work, we address this by exploring the adaptability of open-source VLMs to specific domains, and focusing on soccer as an initial case study. Our approach uses large-scale soccer datasets and LLM to create instruction-following data, and use them to iteratively fine-tune the general-domain VLM in a curriculum learning fashion (first teaching the model key soccer concepts to then question answering tasks). The final adapted model, trained using a curated dataset of 20k video clips, exhibits significant improvement in soccer-specific tasks compared to the base model, with a 37.5% relative improvement for the visual question-answering task and an accuracy improvement from 11.8% to 63.5% for the downstream soccer action classification task.

[16] 4D-ROLLS: 4D Radar Occupancy Learning via LiDAR Supervision

Ruihan Liu,Xiaoyi Wu,Xijun Chen,Liang Hu,Yunjiang Lou

Main category: cs.CV

TL;DR: 4D-ROLLS是一种弱监督的4D雷达占用估计方法，利用LiDAR点云作为监督信号，在恶劣环境中表现优异。

Details

Motivation: 现有占用估计方法在恶劣环境（如烟雾、雨雪）中性能不佳，需一种更鲁棒的解决方案。 Method: 通过生成伪LiDAR标签（占用查询和高度图）作为多阶段监督，训练4D雷达占用估计模型，并与LiDAR占用图对齐以提高精度。 Result: 实验验证了4D-ROLLS的卓越性能，包括恶劣环境下的鲁棒性和跨数据集训练的有效性。 Conclusion: 4D-ROLLS轻量高效（30Hz推理速度），可无缝迁移至下游任务，具有广泛应用潜力。 Abstract: A comprehensive understanding of 3D scenes is essential for autonomous vehicles (AVs), and among various perception tasks, occupancy estimation plays a central role by providing a general representation of drivable and occupied space. However, most existing occupancy estimation methods rely on LiDAR or cameras, which perform poorly in degraded environments such as smoke, rain, snow, and fog. In this paper, we propose 4D-ROLLS, the first weakly supervised occupancy estimation method for 4D radar using the LiDAR point cloud as the supervisory signal. Specifically, we introduce a method for generating pseudo-LiDAR labels, including occupancy queries and LiDAR height maps, as multi-stage supervision to train the 4D radar occupancy estimation model. Then the model is aligned with the occupancy map produced by LiDAR, fine-tuning its accuracy in occupancy estimation. Extensive comparative experiments validate the exceptional performance of 4D-ROLLS. Its robustness in degraded environments and effectiveness in cross-dataset training are qualitatively demonstrated. The model is also seamlessly transferred to downstream tasks BEV segmentation and point cloud occupancy prediction, highlighting its potential for broader applications. The lightweight network enables 4D-ROLLS model to achieve fast inference speeds at about 30 Hz on a 4060 GPU. The code of 4D-ROLLS will be made available at https://github.com/CLASS-Lab/4D-ROLLS.

Chu Chen,Kangning Cui,Pasquale Cascarano,Wei Tang,Elena Loli Piccolomini,Raymond H. Chan

Main category: cs.CV

TL;DR: 提出了一种自监督超声视频超分辨率算法DUP，无需配对训练数据即可提升分辨率并降噪。

Details

Motivation: 超声视频通常信噪比低、分辨率有限，且设备和采集设置的差异导致数据分布和噪声水平不一致，影响预训练模型的泛化能力。 Method: DUP通过视频自适应优化神经网络，提升超声视频分辨率并去除噪声。 Result: 定量和视觉评估显示DUP优于现有超分辨率算法，显著提升下游应用效果。 Conclusion: DUP是一种有效的自监督超声视频超分辨率方法，无需配对数据即可实现高质量增强。 Abstract: Ultrasound imaging is widely applied in clinical practice, yet ultrasound videos often suffer from low signal-to-noise ratios (SNR) and limited resolutions, posing challenges for diagnosis and analysis. Variations in equipment and acquisition settings can further exacerbate differences in data distribution and noise levels, reducing the generalizability of pre-trained models. This work presents a self-supervised ultrasound video super-resolution algorithm called Deep Ultrasound Prior (DUP). DUP employs a video-adaptive optimization process of a neural network that enhances the resolution of given ultrasound videos without requiring paired training data while simultaneously removing noise. Quantitative and visual evaluations demonstrate that DUP outperforms existing super-resolution algorithms, leading to substantial improvements for downstream applications.

[18] An Explorative Analysis of SVM Classifier and ResNet50 Architecture on African Food Classification

Chinedu Emmanuel Mbonu,Kenechukwu Anigbogu,Doris Asogwa,Tochukwu Belonwu

Main category: cs.CV

TL;DR: 该研究比较了深度学习和传统机器学习方法在非洲食物分类中的表现，发现两者各有优劣。

Details

Motivation: 尽管食物识别系统在西方菜肴中已有显著进展，但在非洲食物中的应用仍较少被探索。 Method: 使用微调的ResNet50模型和支持向量机（SVM）分类器，对包含1,658张非洲食物图像的数据集进行评估。 Result: 通过混淆矩阵、F1分数、准确率、召回率和精确度五项指标评估模型效果。 Conclusion: 研究结果为非洲食物识别的进一步发展提供了有价值的见解。 Abstract: Food recognition systems has advanced significantly for Western cuisines, yet its application to African foods remains underexplored. This study addresses this gap by evaluating both deep learning and traditional machine learning methods for African food classification. We compared the performance of a fine-tuned ResNet50 model with a Support Vector Machine (SVM) classifier. The dataset comprises 1,658 images across six selected food categories that are known in Africa. To assess model effectiveness, we utilize five key evaluation metrics: Confusion matrix, F1-score, accuracy, recall and precision. Our findings offer valuable insights into the strengths and limitations of both approaches, contributing to the advancement of food recognition for African cuisines.

[19] LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts

Qifeng Cai,Hao Liang,Hejun Dong,Meiyi Qiang,Ruichuan An,Zhaoyang Han,Zhengzhou Zhu,Bin Cui,Wentao Zhang

Main category: cs.CV

TL;DR: LoVR是一个针对长视频-文本检索的基准数据集，包含467个长视频和40,804个细粒度片段，提供高质量标注，并提出自动标注生成框架和语义融合方法。

Details

Motivation: 现有基准数据集在视频时长、标注质量和粒度上存在不足，限制了高级视频-文本检索方法的评估。 Method: 提出高效的标注生成框架（VLM自动生成、质量评分和动态优化）和语义融合方法，生成连贯的全视频标注。 Result: LoVR数据集为视频理解和检索带来新挑战，实验显示当前方法存在局限性。 Conclusion: LoVR是一个具有挑战性的基准数据集，为未来研究提供了宝贵见解。 Abstract: Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at https://github.com/TechNomad-ds/LoVR-benchmark

[20] Every Pixel Tells a Story: End-to-End Urdu Newspaper OCR

Samee Arif,Sualeha Farid

Main category: cs.CV

TL;DR: 本文提出了一种针对乌尔都语报纸的端到端OCR流程，解决了多栏布局、低分辨率扫描和多样字体等挑战，通过四个模块（文章分割、图像超分辨率、栏分割和文本识别）实现高效识别。

Details

Motivation: 乌尔都语报纸OCR面临多栏布局复杂、扫描质量低和字体多样等问题，亟需一种高效解决方案。 Method: 采用四模块流程：1) YOLOv11x进行文章分割；2) SwinIR模型提升图像分辨率；3) YOLOv11x分割栏；4) 测试多种LLM（如Gemini、GPT）进行文本识别。 Result: 文章分割精度0.963，超分辨率PSNR 32.71 dB，栏分割精度0.970，Gemini-2.5-Pro的WER最低为0.133。 Conclusion: 提出的端到端流程在乌尔都语报纸OCR中表现优异，各模块均达到高精度，Gemini-2.5-Pro在文本识别中表现最佳。 Abstract: This paper introduces a comprehensive end-to-end pipeline for Optical Character Recognition (OCR) on Urdu newspapers. In our approach, we address the unique challenges of complex multi-column layouts, low-resolution archival scans, and diverse font styles. Our process decomposes the OCR task into four key modules: (1) article segmentation, (2) image super-resolution, (3) column segmentation, and (4) text recognition. For article segmentation, we fine-tune and evaluate YOLOv11x to identify and separate individual articles from cluttered layouts. Our model achieves a precision of 0.963 and mAP@50 of 0.975. For super-resolution, we fine-tune and benchmark the SwinIR model (reaching 32.71 dB PSNR) to enhance the quality of degraded newspaper scans. To do our column segmentation, we use YOLOv11x to separate columns in text to further enhance performance - this model reaches a precision of 0.970 and mAP@50 of 0.975. In the text recognition stage, we benchmark a range of LLMs from different families, including Gemini, GPT, Llama, and Claude. The lowest WER of 0.133 is achieved by Gemini-2.5-Pro.

[21] StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning

Huaijie Wang,De Cheng,Guozhang Li,Zhipeng Xu,Lingfeng He,Jie Li,Nannan Wang,Xinbo Gao

Main category: cs.CV

TL;DR: StPR是一种无需示例的视频类增量学习框架，通过分离时空信息并动态路由任务专家，有效解决灾难性遗忘问题。

Details

Motivation: 视频类增量学习（VCIL）面临时空结构的复杂性，现有方法依赖示例或忽略时序建模，StPR旨在解决这些限制。 Method: 提出Frame-Shared Semantics Distillation（FSSD）和Temporal Decomposition-based Mixture-of-Experts（TD-MoE），分别处理空间语义和时序动态。 Result: 在UCF101、HMDB51和Kinetics400上表现优于基线，同时提升了解释性和效率。 Conclusion: StPR为VCIL提供了一种统一且无需示例的解决方案，有效平衡了知识保留和时序建模。 Abstract: Video Class-Incremental Learning (VCIL) seeks to develop models that continuously learn new action categories over time without forgetting previously acquired knowledge. Unlike traditional Class-Incremental Learning (CIL), VCIL introduces the added complexity of spatiotemporal structures, making it particularly challenging to mitigate catastrophic forgetting while effectively capturing both frame-shared semantics and temporal dynamics. Existing approaches either rely on exemplar rehearsal, raising concerns over memory and privacy, or adapt static image-based methods that neglect temporal modeling. To address these limitations, we propose Spatiotemporal Preservation and Routing (StPR), a unified and exemplar-free VCIL framework that explicitly disentangles and preserves spatiotemporal information. First, we introduce Frame-Shared Semantics Distillation (FSSD), which identifies semantically stable and meaningful channels by jointly considering semantic sensitivity and classification contribution. These important semantic channels are selectively regularized to maintain prior knowledge while allowing for adaptation. Second, we design a Temporal Decomposition-based Mixture-of-Experts (TD-MoE), which dynamically routes task-specific experts based on their temporal dynamics, enabling inference without task ID or stored exemplars. Together, StPR effectively leverages spatial semantics and temporal dynamics, achieving a unified, exemplar-free VCIL framework. Extensive experiments on UCF101, HMDB51, and Kinetics400 show that our method outperforms existing baselines while offering improved interpretability and efficiency in VCIL. Code is available in the supplementary materials.

[22] Multi-Label Stereo Matching for Transparent Scene Depth Estimation

Zhidan Liu,Chengtang Yao,Jiaxi Zeng,Yuwei Wu,Yunde Jia

Main category: cs.CV

TL;DR: 提出一种多标签立体匹配方法，用于同时估计透明场景中透明物体和被遮挡背景的深度。

Details

Motivation: 传统方法假设视差维度为单模态分布，将匹配问题视为单标签回归，无法处理透明场景中的多深度值问题。 Method: 采用多标签回归框架，引入像素级多元高斯表示，通过GRU迭代预测均值和协方差矩阵。 Result: 实验表明，该方法显著提升了透明表面的深度估计性能，同时保留了背景信息。 Conclusion: 多标签回归方法有效解决了透明场景中的多深度值估计问题，代码已开源。 Abstract: In this paper, we present a multi-label stereo matching method to simultaneously estimate the depth of the transparent objects and the occluded background in transparent scenes.Unlike previous methods that assume a unimodal distribution along the disparity dimension and formulate the matching as a single-label regression problem, we propose a multi-label regression formulation to estimate multiple depth values at the same pixel in transparent scenes. To resolve the multi-label regression problem, we introduce a pixel-wise multivariate Gaussian representation, where the mean vector encodes multiple depth values at the same pixel, and the covariance matrix determines whether a multi-label representation is necessary for a given pixel. The representation is iteratively predicted within a GRU framework. In each iteration, we first predict the update step for the mean parameters and then use both the update step and the updated mean parameters to estimate the covariance matrix. We also synthesize a dataset containing 10 scenes and 89 objects to validate the performance of transparent scene depth estimation. The experiments show that our method greatly improves the performance on transparent surfaces while preserving the background information for scene reconstruction. Code is available at https://github.com/BFZD233/TranScene.

[23] UHD Image Dehazing via anDehazeFormer with Atmospheric-aware KV Cache

Pu Wang,Pengwen Dai,Chen Wu,Yeying Jin,Dianjie Lu,Guijuan Zhang,Youshan Zhang,Zhuoran Zheng

Main category: cs.CV

TL;DR: 提出了一种高效的视觉Transformer框架，用于超高清图像去雾，解决了现有方法训练速度慢和内存消耗高的问题。

Details

Motivation: 现有去雾方法在超高清图像处理中面临训练速度慢和内存消耗高的挑战，亟需一种高效解决方案。 Method: 引入自适应归一化机制和大气散射感知的KV缓存机制，优化训练速度和内存使用。 Result: 训练速度提升5倍，内存开销降低，支持实时处理50张高分辨率图像/秒，同时保持去雾质量。 Conclusion: 该方法在4K/8K图像恢复任务中显著提升计算效率，并提供了一种新的可解释去雾方法。 Abstract: In this paper, we propose an efficient visual transformer framework for ultra-high-definition (UHD) image dehazing that addresses the key challenges of slow training speed and high memory consumption for existing methods. Our approach introduces two key innovations: 1) an \textbf{a}daptive \textbf{n}ormalization mechanism inspired by the nGPT architecture that enables ultra-fast and stable training with a network with a restricted range of parameter expressions; and 2) we devise an atmospheric scattering-aware KV caching mechanism that dynamically optimizes feature preservation based on the physical haze formation model. The proposed architecture improves the training convergence speed by \textbf{5 $\times$} while reducing memory overhead, enabling real-time processing of 50 high-resolution images per second on an RTX4090 GPU. Experimental results show that our approach maintains state-of-the-art dehazing quality while significantly improving computational efficiency for 4K/8K image restoration tasks. Furthermore, we provide a new dehazing image interpretable method with the help of an integrated gradient attribution map. Our code can be found here: https://anonymous.4open.science/r/anDehazeFormer-632E/README.md.

[24] EGFormer: Towards Efficient and Generalizable Multimodal Semantic Segmentation

Zelin Zhang,Tao Zhang,KediLI,Xu Zheng

Main category: cs.CV

TL;DR: EGFormer是一种高效的多模态语义分割框架，通过动态评分和模态丢弃模块减少参数和计算量，同时保持性能。

Details

Motivation: 现有方法多关注准确性而忽略计算效率，EGFormer旨在解决这一问题。 Method: 提出ASM动态评分模块和MDM模态丢弃模块，灵活整合多模态并减少冗余。 Result: 参数减少88%，计算量减少50%，在无监督域适应任务中表现最佳。 Conclusion: EGFormer在高效性和通用性上表现优异，为多模态语义分割提供了新思路。 Abstract: Recent efforts have explored multimodal semantic segmentation using various backbone architectures. However, while most methods aim to improve accuracy, their computational efficiency remains underexplored. To address this, we propose EGFormer, an efficient multimodal semantic segmentation framework that flexibly integrates an arbitrary number of modalities while significantly reducing model parameters and inference time without sacrificing performance. Our framework introduces two novel modules. First, the Any-modal Scoring Module (ASM) assigns importance scores to each modality independently, enabling dynamic ranking based on their feature maps. Second, the Modal Dropping Module (MDM) filters out less informative modalities at each stage, selectively preserving and aggregating only the most valuable features. This design allows the model to leverage useful information from all available modalities while discarding redundancy, thus ensuring high segmentation quality. In addition to efficiency, we evaluate EGFormer on a synthetic-to-real transfer task to demonstrate its generalizability. Extensive experiments show that EGFormer achieves competitive performance with up to 88 percent reduction in parameters and 50 percent fewer GFLOPs. Under unsupervised domain adaptation settings, it further achieves state-of-the-art transfer performance compared to existing methods.

[25] OmniStyle: Filtering High Quality Style Transfer Data at Scale

Ye Wang,Ruiqi Liu,Jiang Lin,Fei Liu,Zili Yi,Yilin Wang,Rui Ma

Main category: cs.CV

TL;DR: OmniStyle-1M是一个大规模的风格迁移数据集，包含100万对内容-风格-风格化图像三元组，支持高效训练和精确控制。OmniFilter确保数据质量，OmniStyle框架基于DiT架构，表现优于现有方法。

Details

Motivation: 解决风格迁移领域缺乏大规模、高质量数据集的问题，同时支持精确控制和高效训练。 Method: 提出OmniFilter评估框架筛选高质量数据，基于DiT架构设计OmniStyle框架，支持指令和图像引导的风格迁移。 Result: OmniStyle在质量和效率上优于现有方法，生成高分辨率、细节丰富的输出。 Conclusion: OmniStyle-1M和方法为高质量风格迁移研究提供了重要资源，推动了领域发展。 Abstract: In this paper, we introduce OmniStyle-1M, a large-scale paired style transfer dataset comprising over one million content-style-stylized image triplets across 1,000 diverse style categories, each enhanced with textual descriptions and instruction prompts. We show that OmniStyle-1M can not only enable efficient and scalable of style transfer models through supervised training but also facilitate precise control over target stylization. Especially, to ensure the quality of the dataset, we introduce OmniFilter, a comprehensive style transfer quality assessment framework, which filters high-quality triplets based on content preservation, style consistency, and aesthetic appeal. Building upon this foundation, we propose OmniStyle, a framework based on the Diffusion Transformer (DiT) architecture designed for high-quality and efficient style transfer. This framework supports both instruction-guided and image-guided style transfer, generating high resolution outputs with exceptional detail. Extensive qualitative and quantitative evaluations demonstrate OmniStyle's superior performance compared to existing approaches, highlighting its efficiency and versatility. OmniStyle-1M and its accompanying methodologies provide a significant contribution to advancing high-quality style transfer, offering a valuable resource for the research community.

[26] AppleGrowthVision: A large-scale stereo dataset for phenological analysis, fruit detection, and 3D reconstruction in apple orchards

Laura-Sophia von Hirschhausen,Jannes S. Magnusson,Mykyta Kovalenko,Fredrik Boye,Tanay Rawat,Peter Eisert,Anna Hilsmann,Sebastian Pretzsch,Sebastian Bosse

Main category: cs.CV

TL;DR: AppleGrowthVision是一个大规模数据集，填补了苹果园监测中数据多样性和立体图像的空白，提升了目标检测和生长阶段预测的性能。

Details

Motivation: 苹果园监测因数据集限制（如缺乏多样性和立体图像）而受限，影响了3D建模和任务（如水果定位和产量估计）。 Method: 提出AppleGrowthVision数据集，包含高分辨率立体图像和密集标注图像，覆盖多个生长阶段。 Result: 数据集显著提升了YOLOv8和Faster R-CNN的性能，生长阶段预测准确率超过95%。 Conclusion: AppleGrowthVision连接了农业科学与计算机视觉，为精准农业提供了强大工具。未来工作包括改进标注和3D重建。 Abstract: Deep learning has transformed computer vision for precision agriculture, yet apple orchard monitoring remains limited by dataset constraints. The lack of diverse, realistic datasets and the difficulty of annotating dense, heterogeneous scenes. Existing datasets overlook different growth stages and stereo imagery, both essential for realistic 3D modeling of orchards and tasks like fruit localization, yield estimation, and structural analysis. To address these gaps, we present AppleGrowthVision, a large-scale dataset comprising two subsets. The first includes 9,317 high resolution stereo images collected from a farm in Brandenburg (Germany), covering six agriculturally validated growth stages over a full growth cycle. The second subset consists of 1,125 densely annotated images from the same farm in Brandenburg and one in Pillnitz (Germany), containing a total of 31,084 apple labels. AppleGrowthVision provides stereo-image data with agriculturally validated growth stages, enabling precise phenological analysis and 3D reconstructions. Extending MinneApple with our data improves YOLOv8 performance by 7.69 % in terms of F1-score, while adding it to MinneApple and MAD boosts Faster R-CNN F1-score by 31.06 %. Additionally, six BBCH stages were predicted with over 95 % accuracy using VGG16, ResNet152, DenseNet201, and MobileNetv2. AppleGrowthVision bridges the gap between agricultural science and computer vision, by enabling the development of robust models for fruit detection, growth modeling, and 3D analysis in precision agriculture. Future work includes improving annotation, enhancing 3D reconstruction, and extending multimodal analysis across all growth stages.

[27] Selective Structured State Space for Multispectral-fused Small Target Detection

Qianqian Zhang,WeiJun Wang,Yunxing Liu,Li Zhou,Hao Zhao,Junshe An,Zihan Wang

Main category: cs.CV

TL;DR: 论文提出了一种基于Mamba架构的改进方法，通过ESTD和CARG模块增强小目标检测能力，并结合MEPF模块进行多光谱融合，以解决高分辨率遥感图像中小目标识别精度低和计算成本高的问题。

Details

Motivation: 高分辨率遥感图像中的小目标检测面临识别精度低和计算成本高的挑战，现有方法如Transformer和CNN在计算复杂度或性能上存在不足。 Method: 利用Mamba的线性复杂度提升效率，并通过ESTD模块增强局部注意力，CARG模块强化空间和通道信息，MEPF模块实现多光谱融合。 Result: 改进后的模型能够更准确地捕捉小目标的细节和语义信息，显著提升了检测性能。 Conclusion: 通过结合局部注意力、多模态融合和高效架构，该方法有效解决了小目标检测的难题。 Abstract: Target detection in high-resolution remote sensing imagery faces challenges due to the low recognition accuracy of small targets and high computational costs. The computational complexity of the Transformer architecture increases quadratically with image resolution, while Convolutional Neural Networks (CNN) architectures are forced to stack deeper convolutional layers to expand their receptive fields, leading to an explosive growth in computational demands. To address these computational constraints, we leverage Mamba's linear complexity for efficiency. However, Mamba's performance declines for small targets, primarily because small targets occupy a limited area in the image and have limited semantic information. Accurate identification of these small targets necessitates not only Mamba's global attention capabilities but also the precise capture of fine local details. To this end, we enhance Mamba by developing the Enhanced Small Target Detection (ESTD) module and the Convolutional Attention Residual Gate (CARG) module. The ESTD module bolsters local attention to capture fine-grained details, while the CARG module, built upon Mamba, emphasizes spatial and channel-wise information, collectively improving the model's ability to capture distinctive representations of small targets. Additionally, to highlight the semantic representation of small targets, we design a Mask Enhanced Pixel-level Fusion (MEPF) module for multispectral fusion, which enhances target features by effectively fusing visible and infrared multimodal information.

[28] Learning Concept-Driven Logical Rules for Interpretable and Generalizable Medical Image Classification

Yibo Gao,Hangqi Zhou,Zheyao Gao,Bomin Wang,Shangqi Gao,Sihan Wang,Xiahai Zhuang

Main category: cs.CV

TL;DR: 论文提出了一种名为CRL的新框架，通过二值化视觉概念学习布尔逻辑规则，解决了概念泄漏问题，并提供了局部和全局可解释性。

Details

Motivation: 临床应用中决策安全的需求凸显了基于概念的方法在医学影像中的潜力，但现有方法存在概念泄漏问题且仅关注局部解释。 Method: CRL框架利用逻辑层捕捉概念相关性并提取临床有意义的规则，实现局部和全局可解释性。 Result: 在两个医学图像分类任务中，CRL性能与现有方法相当，且显著提高了对分布外数据的泛化能力。 Conclusion: CRL通过逻辑规则学习解决了概念泄漏问题，同时提升了可解释性和泛化能力。 Abstract: The pursuit of decision safety in clinical applications highlights the potential of concept-based methods in medical imaging. While these models offer active interpretability, they often suffer from concept leakages, where unintended information within soft concept representations undermines both interpretability and generalizability. Moreover, most concept-based models focus solely on local explanations (instance-level), neglecting the global decision logic (dataset-level). To address these limitations, we propose Concept Rule Learner (CRL), a novel framework to learn Boolean logical rules from binarized visual concepts. CRL employs logical layers to capture concept correlations and extract clinically meaningful rules, thereby providing both local and global interpretability. Experiments on two medical image classification tasks show that CRL achieves competitive performance with existing methods while significantly improving generalizability to out-of-distribution data. The code of our work is available at https://github.com/obiyoag/crl.

[29] Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Hao Feng,Shu Wei,Xiang Fei,Wei Shi,Yingdong Han,Lei Liao,Jinghui Lu,Binghong Wu,Qi Liu,Chunhui Lin,Jingqun Tang,Hao Liu,Can Huang

Main category: cs.CV

TL;DR: Dolphin是一种新型多模态文档图像解析模型，采用“先分析后解析”范式，通过异构锚点提示实现高效并行解析，性能优越且轻量。

Details

Motivation: 解决现有方法在集成开销、效率瓶颈和布局结构退化方面的局限性。 Method: 分两阶段：首先生成布局元素序列作为锚点，再结合任务提示并行解析内容。 Result: 在多个基准测试中达到最优性能，同时保持高效。 Conclusion: Dolphin通过轻量架构和并行机制，显著提升了文档图像解析的效率和性能。 Abstract: Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/Dolphin

[30] Scaling Vision Mamba Across Resolutions via Fractal Traversal

Bo Li,Haoke Xiao,Lv Tang

Main category: cs.CV

TL;DR: FractalMamba++是一种基于分形序列化和状态路由的视觉骨干网络，解决了Vision Mamba在2D到1D序列化和分辨率适应性上的问题，并在高分辨率任务中表现优异。

Details

Motivation: Vision Mamba在视觉输入中面临2D到1D序列化的挑战和分辨率适应性不足的问题，现有方法破坏了空间连续性。 Method: 提出分形序列化（Hilbert曲线）保持空间局部性，引入Cross-State Routing（CSR）增强全局上下文传播，以及Positional-Relation Capture（PRC）模块恢复局部邻接关系。 Result: 在图像分类、语义分割、目标检测和变化检测任务中，FractalMamba++表现优于现有Mamba骨干网络，尤其在高分辨率下。 Conclusion: FractalMamba++通过分形序列化和状态路由机制，显著提升了Vision Mamba在视觉任务中的性能，特别是在高分辨率场景下。 Abstract: Vision Mamba has recently emerged as a promising alternative to Transformer-based architectures, offering linear complexity in sequence length while maintaining strong modeling capacity. However, its adaptation to visual inputs is hindered by challenges in 2D-to-1D patch serialization and weak scalability across input resolutions. Existing serialization strategies such as raster scanning disrupt local spatial continuity and limit the model's ability to generalize across scales. In this paper, we propose FractalMamba++, a robust vision backbone that leverages fractal-based patch serialization via Hilbert curves to preserve spatial locality and enable seamless resolution adaptability. To address long-range dependency fading in high-resolution inputs, we further introduce a Cross-State Routing (CSR) mechanism that enhances global context propagation through selective state reuse. Additionally, we propose a Positional-Relation Capture (PRC) module to recover local adjacency disrupted by curve inflection points. Extensive experiments on image classification, semantic segmentation, object detection, and change detection demonstrate that FractalMamba++ consistently outperforms previous Mamba-based backbones, particularly under high-resolution settings.

[31] Place Recognition: A Comprehensive Review, Current Challenges and Future Directions

Zhenyu Li,Tianyi Shang,Pengjie Xu,Zhaojun Deng

Main category: cs.CV

TL;DR: 本文综述了地点识别的最新进展，重点介绍了CNN、Transformer和跨模态方法，并总结了数据集、评估指标及未来研究方向。

Details

Motivation: 地点识别是车辆导航和地图构建的关键，尤其在SLAM和长期导航中至关重要。本文旨在全面回顾该领域的最新方法。 Method: 综述了CNN、Transformer和跨模态三种方法，分析了它们在视觉描述符学习、全局依赖捕获和多模态数据整合中的表现。 Result: 总结了标准数据集和评估指标，并提供了代码库和实验结果。 Conclusion: 指出了当前挑战（如领域适应、实时性能）和未来方向（如终身学习），为后续研究提供了参考。 Abstract: Place recognition is a cornerstone of vehicle navigation and mapping, which is pivotal in enabling systems to determine whether a location has been previously visited. This capability is critical for tasks such as loop closure in Simultaneous Localization and Mapping (SLAM) and long-term navigation under varying environmental conditions. In this survey, we comprehensively review recent advancements in place recognition, emphasizing three representative methodological paradigms: Convolutional Neural Network (CNN)-based approaches, Transformer-based frameworks, and cross-modal strategies. We begin by elucidating the significance of place recognition within the broader context of autonomous systems. Subsequently, we trace the evolution of CNN-based methods, highlighting their contributions to robust visual descriptor learning and scalability in large-scale environments. We then examine the emerging class of Transformer-based models, which leverage self-attention mechanisms to capture global dependencies and offer improved generalization across diverse scenes. Furthermore, we discuss cross-modal approaches that integrate heterogeneous data sources such as Lidar, vision, and text description, thereby enhancing resilience to viewpoint, illumination, and seasonal variations. We also summarize standard datasets and evaluation metrics widely adopted in the literature. Finally, we identify current research challenges and outline prospective directions, including domain adaptation, real-time performance, and lifelong learning, to inspire future advancements in this domain. The unified framework of leading-edge place recognition methods, i.e., code library, and the results of their experimental evaluations are available at https://github.com/CV4RA/SOTA-Place-Recognitioner.

[32] Generalizable Multispectral Land Cover Classification via Frequency-Aware Mixture of Low-Rank Token Experts

Xi Chen,Shen Yan,Juelin Zhu,Chen Chen,Yu Liu,Maojun Zhang

Main category: cs.CV

TL;DR: Land-MoE提出了一种新颖的多光谱土地覆盖分类方法，通过频率感知的低秩标记专家混合模块和频率域调制，显著提升了模型对光谱偏移的鲁棒性。

Details

Motivation: 多光谱土地覆盖分类中，传感器和地理条件差异导致的光谱偏移问题亟待解决。现有方法依赖小规模模型，性能有限。 Method: Land-MoE采用频率感知的低秩标记专家混合模块（MoLTE）和频率感知滤波器（FAF），高效微调视觉基础模型。 Result: 实验表明，Land-MoE在跨传感器和跨地理任务中大幅超越现有方法，并在RGB遥感图像的领域泛化语义分割任务中达到SOTA性能。 Conclusion: Land-MoE通过创新的模块设计，在多光谱土地覆盖分类和领域泛化任务中表现出色，为相关领域提供了高效解决方案。 Abstract: We introduce Land-MoE, a novel approach for multispectral land cover classification (MLCC). Spectral shift, which emerges from disparities in sensors and geospatial conditions, poses a significant challenge in this domain. Existing methods predominantly rely on domain adaptation and generalization strategies, often utilizing small-scale models that exhibit limited performance. In contrast, Land-MoE addresses these issues by hierarchically inserting a Frequency-aware Mixture of Low-rank Token Experts, to fine-tune Vision Foundation Models (VFMs) in a parameter-efficient manner. Specifically, Land-MoE comprises two key modules: the mixture of low-rank token experts (MoLTE) and frequency-aware filters (FAF). MoLTE leverages rank-differentiated tokens to generate diverse feature adjustments for individual instances within multispectral images. By dynamically combining learnable low-rank token experts of varying ranks, it enhances the robustness against spectral shifts. Meanwhile, FAF conducts frequency-domain modulation on the refined features. This process enables the model to effectively capture frequency band information that is strongly correlated with semantic essence, while simultaneously suppressing frequency noise irrelevant to the task. Comprehensive experiments on MLCC tasks involving cross-sensor and cross-geospatial setups demonstrate that Land-MoE outperforms existing methods by a large margin. Additionally, the proposed approach has also achieved state-of-the-art performance in domain generalization semantic segmentation tasks of RGB remote sensing images.

[33] Unlocking the Power of SAM 2 for Few-Shot Segmentation

Qianxiong Xu,Lanyun Zhu,Xuanyi Liu,Guosheng Lin,Cheng Long,Ziyue Li,Rui Zhao

Main category: cs.CV

TL;DR: 论文提出了一种改进少样本分割（FSS）的方法，通过设计伪提示生成器和迭代记忆精炼，解决了现有方法中匹配不兼容和分割错误的问题。

Details

Motivation: 少样本分割容易过拟合，现有方法利用基础模型（如SAM）的知识简化学习过程，但视频数据中的对象身份与FSS不同，导致匹配不兼容。 Method: 设计了伪提示生成器编码伪查询记忆，以兼容方式匹配查询特征；进一步提出迭代记忆精炼和支持校准记忆注意力，优化记忆内容。 Result: 在PASCAL-5$^i$和COCO-20$^i$上的实验表明，1-shot mIoU比最佳基线提高了4.2%。 Conclusion: 通过兼容性匹配和记忆优化，显著提升了少样本分割的性能。 Abstract: Few-Shot Segmentation (FSS) aims to learn class-agnostic segmentation on few classes to segment arbitrary classes, but at the risk of overfitting. To address this, some methods use the well-learned knowledge of foundation models (e.g., SAM) to simplify the learning process. Recently, SAM 2 has extended SAM by supporting video segmentation, whose class-agnostic matching ability is useful to FSS. A simple idea is to encode support foreground (FG) features as memory, with which query FG features are matched and fused. Unfortunately, the FG objects in different frames of SAM 2's video data are always the same identity, while those in FSS are different identities, i.e., the matching step is incompatible. Therefore, we design Pseudo Prompt Generator to encode pseudo query memory, matching with query features in a compatible way. However, the memories can never be as accurate as the real ones, i.e., they are likely to contain incomplete query FG, and some unexpected query background (BG) features, leading to wrong segmentation. Hence, we further design Iterative Memory Refinement to fuse more query FG features into the memory, and devise a Support-Calibrated Memory Attention to suppress the unexpected query BG features in memory. Extensive experiments have been conducted on PASCAL-5$^i$ and COCO-20$^i$ to validate the effectiveness of our design, e.g., the 1-shot mIoU can be 4.2\% better than the best baseline.

[34] Unintended Bias in 2D+ Image Segmentation and Its Effect on Attention Asymmetry

Zsófia Molnár,Gergely Szabó,András Horváth

Main category: cs.CV

TL;DR: 研究探讨了预训练模型在生物医学图像分割中的偏差问题，并提出消除偏差的策略。

Details

Motivation: 预训练模型在生物医学图像等专业数据集上可能引入偏差，导致特征利用不一致，影响模型性能和结果可靠性。 Method: 通过实验比较预训练和随机初始化模型的性能及显著性图分布，提出消除预训练颜色通道权重偏差的方法。 Result: 提出的方法在消除偏差的同时保持了预训练模型的优势，提升了模型可解释性。 Conclusion: 研究为解决预训练权重偏差提供了实用方法，适用于多种深度学习任务。 Abstract: Supervised pretrained models have become widely used in deep learning, especially for image segmentation tasks. However, when applied to specialized datasets such as biomedical imaging, pretrained weights often introduce unintended biases. These biases cause models to assign different levels of importance to different slices, leading to inconsistencies in feature utilization, which can be observed as asymmetries in saliency map distributions. This transfer of color distributions from natural images to non-natural datasets can compromise model performance and reduce the reliability of results. In this study, we investigate the effects of these biases and propose strategies to mitigate them. Through a series of experiments, we test both pretrained and randomly initialized models, comparing their performance and saliency map distributions. Our proposed methods, which aim to neutralize the bias introduced by pretrained color channel weights, demonstrate promising results, offering a practical approach to improving model explainability while maintaining the benefits of pretrained models. This publication presents our findings, providing insights into addressing pretrained weight biases across various deep learning tasks.

[35] CONSIGN: Conformal Segmentation Informed by Spatial Groupings via Decomposition

Bruno Viti,Elias Karabelas,Martin Holler

Main category: cs.CV

TL;DR: CONSIGN方法通过结合空间相关性改进图像分割中的不确定性量化，生成具有统计保证的预测集。

Details

Motivation: 传统置信度分数缺乏严格的统计有效性，且忽略空间相关性，导致不确定性估计保守且不直观。 Method: 提出CONSIGN方法，利用空间分组分解结合共形预测，改进不确定性量化。 Result: 在多个数据集和模型上验证，CONSIGN显著提升性能并改善不确定性估计质量。 Conclusion: CONSIGN通过考虑空间结构，为图像分割提供更可靠和解释性强的不确定性估计。 Abstract: Most machine learning-based image segmentation models produce pixel-wise confidence scores - typically derived from softmax outputs - that represent the model's predicted probability for each class label at every pixel. While this information can be particularly valuable in high-stakes domains such as medical imaging, these (uncalibrated) scores are heuristic in nature and do not constitute rigorous quantitative uncertainty estimates. Conformal prediction (CP) provides a principled framework for transforming heuristic confidence scores into statistically valid uncertainty estimates. However, applying CP directly to image segmentation ignores the spatial correlations between pixels, a fundamental characteristic of image data. This can result in overly conservative and less interpretable uncertainty estimates. To address this, we propose CONSIGN (Conformal Segmentation Informed by Spatial Groupings via Decomposition), a CP-based method that incorporates spatial correlations to improve uncertainty quantification in image segmentation. Our method generates meaningful prediction sets that come with user-specified, high-probability error guarantees. It is compatible with any pre-trained segmentation model capable of generating multiple sample outputs - such as those using dropout, Bayesian modeling, or ensembles. We evaluate CONSIGN against a standard pixel-wise CP approach across three medical imaging datasets and two COCO dataset subsets, using three different pre-trained segmentation models. Results demonstrate that accounting for spatial structure significantly improves performance across multiple metrics and enhances the quality of uncertainty estimates.

[36] Intra-class Patch Swap for Self-Distillation

Hongjun Choi,Eun Som Jeon,Ankita Shukla,Pavan Turaga

Main category: cs.CV

TL;DR: 提出了一种基于教师无关蒸馏的新框架，通过类内补丁交换增强实现单网络自蒸馏，无需额外组件或架构修改，性能优于现有方法。

Details

Motivation: 传统知识蒸馏依赖预训练教师网络，带来内存、训练成本和教师选择问题；现有自蒸馏方法常需架构修改或复杂流程，通用性和效率受限。 Method: 使用类内补丁交换增强，模拟师生动态，通过实例间蒸馏对齐预测分布，仅需单一增强函数。 Result: 在图像分类、语义分割和目标检测任务中，性能优于现有自蒸馏和传统教师蒸馏方法。 Conclusion: 自蒸馏的成功可能依赖于增强设计，该方法简单、通用且高效。 Abstract: Knowledge distillation (KD) is a valuable technique for compressing large deep learning models into smaller, edge-suitable networks. However, conventional KD frameworks rely on pre-trained high-capacity teacher networks, which introduce significant challenges such as increased memory/storage requirements, additional training costs, and ambiguity in selecting an appropriate teacher for a given student model. Although a teacher-free distillation (self-distillation) has emerged as a promising alternative, many existing approaches still rely on architectural modifications or complex training procedures, which limit their generality and efficiency. To address these limitations, we propose a novel framework based on teacher-free distillation that operates using a single student network without any auxiliary components, architectural modifications, or additional learnable parameters. Our approach is built on a simple yet highly effective augmentation, called intra-class patch swap augmentation. This augmentation simulates a teacher-student dynamic within a single model by generating pairs of intra-class samples with varying confidence levels, and then applying instance-to-instance distillation to align their predictive distributions. Our method is conceptually simple, model-agnostic, and easy to implement, requiring only a single augmentation function. Extensive experiments across image classification, semantic segmentation, and object detection show that our method consistently outperforms both existing self-distillation baselines and conventional teacher-based KD approaches. These results suggest that the success of self-distillation could hinge on the design of the augmentation itself. Our codes are available at https://github.com/hchoi71/Intra-class-Patch-Swap.

[37] Hunyuan-Game: Industrial-grade Intelligent Game Creation Model

Ruihuang Li,Caijin Zhou,Shoujian Zheng,Jianxiang Lu,Jiabin Huang,Comi Chen,Junshu Tang,Guangzheng Xu,Jiale Tao,Hongmei Wang,Donghao Li,Wenqing Yu,Senbo Wang,Zhimin Li,Yetshuan Shi,Haoyu Yang,Yukun Wang,Wenxun Dai,Jiaqi Li,Linqing Wang,Qixun Wang,Zhiyong Xu,Yingfang Zhang,Jiangfeng Xiong,Weijie Kong,Chao Zhang,Hongxin Zhang,Qiaoling Zheng,Weiting Guo,Xinchi Deng,Yixuan Li,Renjia Wei,Yulin Jian,Duojun Huang,Xuhua Ren,Sihuan Lin,Yifu Sun,Yuan Zhou,Joey Wang,Qin Lin,Jingmiao Yu,Jihong Zhang,Caesar Zhong,Di Wang,Yuhong Liu,Linus,Jie Jiang,Longhuang Wu,Shuai Shao,Qinglin Lu

Main category: cs.CV

TL;DR: Hunyuan-Game项目利用生成式AI技术，通过图像和视频生成模型，为游戏开发提供高质量内容生成解决方案。

Details

Motivation: 尽管生成模型取得进展，但高质量游戏内容（如图像和视频）的合成仍具挑战性。Hunyuan-Game旨在提升玩家体验并提高设计师效率。 Method: 项目分为图像生成和视频生成两部分。图像生成基于数十亿游戏图像数据集，开发了四种定制模型；视频生成基于数百万游戏和动漫视频数据集，开发了五种核心算法模型。 Result: 生成的图像和视频不仅具有高美学表现，还深度融合了游戏和动漫领域的专业知识。 Conclusion: Hunyuan-Game为智能游戏生产提供了系统性解决方案，推动了游戏内容生成的革新。 Abstract: Intelligent game creation represents a transformative advancement in game development, utilizing generative artificial intelligence to dynamically generate and enhance game content. Despite notable progress in generative models, the comprehensive synthesis of high-quality game assets, including both images and videos, remains a challenging frontier. To create high-fidelity game content that simultaneously aligns with player preferences and significantly boosts designer efficiency, we present Hunyuan-Game, an innovative project designed to revolutionize intelligent game production. Hunyuan-Game encompasses two primary branches: image generation and video generation. The image generation component is built upon a vast dataset comprising billions of game images, leading to the development of a group of customized image generation models tailored for game scenarios: (1) General Text-to-Image Generation. (2) Game Visual Effects Generation, involving text-to-effect and reference image-based game visual effect generation. (3) Transparent Image Generation for characters, scenes, and game visual effects. (4) Game Character Generation based on sketches, black-and-white images, and white models. The video generation component is built upon a comprehensive dataset of millions of game and anime videos, leading to the development of five core algorithmic models, each targeting critical pain points in game development and having robust adaptation to diverse game video scenarios: (1) Image-to-Video Generation. (2) 360 A/T Pose Avatar Video Synthesis. (3) Dynamic Illustration Generation. (4) Generative Video Super-Resolution. (5) Interactive Game Video Generation. These image and video generation models not only exhibit high-level aesthetic expression but also deeply integrate domain-specific knowledge, establishing a systematic understanding of diverse game and anime art styles.

[38] ReactDiff: Latent Diffusion for Facial Reaction Generation

Jiaming Li,Sheng Wang,Xin Wang,Yitao Zhu,Honglin Xiong,Zixu Zhuang,Qian Wang

Main category: cs.CV

TL;DR: ReactDiff框架通过多模态Transformer和潜在空间条件扩散，显著提升了面部反应生成的多样性和相关性。

Details

Motivation: 解决现有方法在捕捉视频与音频相关性及平衡反应多样性、真实性和适当性方面的不足。 Method: 结合多模态Transformer和潜在空间条件扩散，利用类内和类间注意力实现细粒度多模态交互。 Result: 实验显示ReactDiff在相关性（0.26）和多样性（0.094）上显著优于现有方法，同时保持真实性。 Conclusion: ReactDiff为面部反应生成提供了高效且多样化的解决方案，代码已开源。 Abstract: Given the audio-visual clip of the speaker, facial reaction generation aims to predict the listener's facial reactions. The challenge lies in capturing the relevance between video and audio while balancing appropriateness, realism, and diversity. While prior works have mostly focused on uni-modal inputs or simplified reaction mappings, recent approaches such as PerFRDiff have explored multi-modal inputs and the one-to-many nature of appropriate reaction mappings. In this work, we propose the Facial Reaction Diffusion (ReactDiff) framework that uniquely integrates a Multi-Modality Transformer with conditional diffusion in the latent space for enhanced reaction generation. Unlike existing methods, ReactDiff leverages intra- and inter-class attention for fine-grained multi-modal interaction, while the latent diffusion process between the encoder and decoder enables diverse yet contextually appropriate outputs. Experimental results demonstrate that ReactDiff significantly outperforms existing approaches, achieving a facial reaction correlation of 0.26 and diversity score of 0.094 while maintaining competitive realism. The code is open-sourced at \href{https://github.com/Hunan-Tiger/ReactDiff}{github}.

[39] Unify Graph Learning with Text: Unleashing LLM Potentials for Session Search

Songhao Wu,Quan Tu,Hong Liu,Jia Xu,Zhongyi Liu,Guannan Zhang,Ran Wang,Xiuying Chen,Rui Yan

Main category: cs.CV

TL;DR: 论文提出了一种名为Symbolic Graph Ranker (SGR)的方法，结合文本和图结构信息，利用大语言模型(LLMs)提升会话搜索性能。

Details

Motivation: 当前会话搜索方法多关注序列建模或图结构，但未充分结合两者的优势，且忽略了词级语义建模。 Method: SGR通过符号语法规则将会话图转为文本，并设计自监督任务（如链接预测、节点内容生成等）增强LLMs对图结构的理解。 Result: 在AOL和Tiangong-ST数据集上的实验证明了SGR的优越性。 Conclusion: SGR为传统搜索策略与现代LLMs之间架起了桥梁，提供了一种新颖有效的方法。 Abstract: Session search involves a series of interactive queries and actions to fulfill user's complex information need. Current strategies typically prioritize sequential modeling for deep semantic understanding, overlooking the graph structure in interactions. While some approaches focus on capturing structural information, they use a generalized representation for documents, neglecting the word-level semantic modeling. In this paper, we propose Symbolic Graph Ranker (SGR), which aims to take advantage of both text-based and graph-based approaches by leveraging the power of recent Large Language Models (LLMs). Concretely, we first introduce a set of symbolic grammar rules to convert session graph into text. This allows integrating session history, interaction process, and task instruction seamlessly as inputs for the LLM. Moreover, given the natural discrepancy between LLMs pre-trained on textual corpora, and the symbolic language we produce using our graph-to-text grammar, our objective is to enhance LLMs' ability to capture graph structures within a textual format. To achieve this, we introduce a set of self-supervised symbolic learning tasks including link prediction, node content generation, and generative contrastive learning, to enable LLMs to capture the topological information from coarse-grained to fine-grained. Experiment results and comprehensive analysis on two benchmark datasets, AOL and Tiangong-ST, confirm the superiority of our approach. Our paradigm also offers a novel and effective methodology that bridges the gap between traditional search strategies and modern LLMs.

Junjie Li,Jiawei Wang,Miyu Li,Yu Liu,Yumei Wang,Haitao Xu

Main category: cs.CV

TL;DR: M3Depth是一种专为火星探测器设计的深度估计模型，通过小波变换卷积核和一致性损失提升稀疏纹理环境下的深度估计精度。

Details

Motivation: 火星地形纹理稀疏且缺乏几何约束，传统学习方法性能下降，需针对性解决方案。 Method: 结合小波变换卷积核捕捉低频特征，引入一致性损失和像素级细化模块。 Result: 在合成火星数据集上深度估计精度提升16%，适用于实际火星场景。 Conclusion: M3Depth为未来火星探索任务提供了有效的深度估计解决方案。 Abstract: Depth estimation plays a great potential role in obstacle avoidance and navigation for further Mars exploration missions. Compared to traditional stereo matching, learning-based stereo depth estimation provides a data-driven approach to infer dense and precise depth maps from stereo image pairs. However, these methods always suffer performance degradation in environments with sparse textures and lacking geometric constraints, such as the unstructured terrain of Mars. To address these challenges, we propose M3Depth, a depth estimation model tailored for Mars rovers. Considering the sparse and smooth texture of Martian terrain, which is primarily composed of low-frequency features, our model incorporates a convolutional kernel based on wavelet transform that effectively captures low-frequency response and expands the receptive field. Additionally, we introduce a consistency loss that explicitly models the complementary relationship between depth map and surface normal map, utilizing the surface normal as a geometric constraint to enhance the accuracy of depth estimation. Besides, a pixel-wise refinement module with mutual boosting mechanism is designed to iteratively refine both depth and surface normal predictions. Experimental results on synthetic Mars datasets with depth annotations show that M3Depth achieves a significant 16% improvement in depth estimation accuracy compared to other state-of-the-art methods in depth estimation. Furthermore, the model demonstrates strong applicability in real-world Martian scenarios, offering a promising solution for future Mars exploration missions.

[41] LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer

Changgu Chen,Xiaoyan Yang,Junwei Shu,Changbo Wang,Yang Li

Main category: cs.CV

TL;DR: 论文提出LMP框架，通过解耦前景背景、加权运动转移和外观分离模块，实现零样本视频生成中对运动的精细控制。

Details

Motivation: 当前DiT模型在视频生成中缺乏对内容的精细控制，尤其是在复杂运动描述和图像到视频生成中的运动控制方面存在不足。 Method: 提出LMP框架，包括前景背景解耦模块、加权运动转移模块和外观分离模块，以参考用户提供的运动视频生成目标视频。 Result: 实验表明，LMP在生成质量、提示视频一致性和控制能力方面达到最先进水平。 Conclusion: LMP框架有效解决了视频生成中运动控制的挑战，为文本到视频和图像到视频生成提供了新方法。 Abstract: In recent years, large-scale pre-trained diffusion transformer models have made significant progress in video generation. While current DiT models can produce high-definition, high-frame-rate, and highly diverse videos, there is a lack of fine-grained control over the video content. Controlling the motion of subjects in videos using only prompts is challenging, especially when it comes to describing complex movements. Further, existing methods fail to control the motion in image-to-video generation, as the subject in the reference image often differs from the subject in the reference video in terms of initial position, size, and shape. To address this, we propose the Leveraging Motion Prior (LMP) framework for zero-shot video generation. Our framework harnesses the powerful generative capabilities of pre-trained diffusion transformers to enable motion in the generated videos to reference user-provided motion videos in both text-to-video and image-to-video generation. To this end, we first introduce a foreground-background disentangle module to distinguish between moving subjects and backgrounds in the reference video, preventing interference in the target video generation. A reweighted motion transfer module is designed to allow the target video to reference the motion from the reference video. To avoid interference from the subject in the reference video, we propose an appearance separation module to suppress the appearance of the reference subject in the target video. We annotate the DAVIS dataset with detailed prompts for our experiments and design evaluation metrics to validate the effectiveness of our method. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in generation quality, prompt-video consistency, and control capability. Our homepage is available at https://vpx-ecnu.github.io/LMP-Website/

[42] Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method

Xinshen Zhang,Zhen Ye,Xu Zheng

Main category: cs.CV

TL;DR: 论文提出了OmniVQA数据集和基准测试，评估了多模态大语言模型（MLLMs）在全景视觉问答中的表现，发现其存在显著局限性，并提出了一种基于强化学习的方法360-R1以改进性能。

Details

Motivation: 现有MLLMs在全景图像理解能力上存在不足，缺乏针对360度视觉问答的专用数据集和基准测试。 Method: 引入OmniVQA数据集和基准测试，并提出基于Qwen2.5-VL-Instruct的360-R1方法，通过三种新的奖励函数改进GRPO。 Result: 实验表明360-R1在全景空间中的性能提升了6%。 Conclusion: 当前MLLMs在全景视觉理解上存在局限性，需要针对360度图像的专用架构或训练创新。 Abstract: Omnidirectional images (ODIs), with their 360{\deg} field of view, provide unparalleled spatial awareness for immersive applications like augmented reality and embodied AI. However, the capability of existing multi-modal large language models (MLLMs) to comprehend and reason about such panoramic scenes remains underexplored. This paper addresses this gap by introducing OmniVQA, the first dataset and conducting the first benchmark for omnidirectional visual question answering. Our evaluation of state-of-the-art MLLMs reveals significant limitations in handling omnidirectional visual question answering, highlighting persistent challenges in object localization, feature extraction, and hallucination suppression within panoramic contexts. These results underscore the disconnect between current MLLM capabilities and the demands of omnidirectional visual understanding, which calls for dedicated architectural or training innovations tailored to 360{\deg} imagery. Building on the OmniVQA dataset and benchmark, we further introduce a rule-based reinforcement learning method, 360-R1, based on Qwen2.5-VL-Instruct. Concretely, we modify the group relative policy optimization (GRPO) by proposing three novel reward functions: (1) reasoning process similarity reward, (2) answer semantic accuracy reward, and (3) structured format compliance reward. Extensive experiments on our OmniVQA demonstrate the superiority of our proposed method in omnidirectional space (+6% improvement).

[43] Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment

Yang Hu,Runchen Wang,Stephen Chong Zhao,Xuhui Zhan,Do Hun Kim,Mark Wallace,David A. Tovar

Main category: cs.CV

TL;DR: Perceptual-Initialization (PI) 通过在初始化阶段融入人类感知结构，显著提升了零样本性能，无需任务特定微调。

Details

Motivation: 挑战传统方法，探索在早期表示学习中嵌入人类感知结构的效果，以提升视觉-语言对齐系统的泛化能力。 Method: 利用NIGHTS数据集的人类感知三元组嵌入初始化CLIP视觉编码器，随后在YFCC15M上进行自监督学习。 Result: 在29个零样本分类和2个检索基准测试中表现显著提升，ImageNet-1K上仅需15轮预训练即显现增益。 Conclusion: 早期融入人类感知结构为通用视觉-语言智能提供了更强基础，挑战了传统仅用于微调的做法。 Abstract: We introduce Perceptual-Initialization (PI), a paradigm shift in visual representation learning that incorporates human perceptual structure during the initialization phase rather than as a downstream fine-tuning step. By integrating human-derived triplet embeddings from the NIGHTS dataset to initialize a CLIP vision encoder, followed by self-supervised learning on YFCC15M, our approach demonstrates significant zero-shot performance improvements, without any task-specific fine-tuning, across 29 zero shot classification and 2 retrieval benchmarks. On ImageNet-1K, zero-shot gains emerge after approximately 15 epochs of pretraining. Benefits are observed across datasets of various scales, with improvements manifesting at different stages of the pretraining process depending on dataset characteristics. Our approach consistently enhances zero-shot top-1 accuracy, top-5 accuracy, and retrieval recall (e.g., R@1, R@5) across these diverse evaluation tasks, without requiring any adaptation to target domains. These findings challenge the conventional wisdom of using human-perceptual data primarily for fine-tuning and demonstrate that embedding human perceptual structure during early representation learning yields more capable and vision-language aligned systems that generalize immediately to unseen tasks. Our work shows that "beginning with you", starting with human perception, provides a stronger foundation for general-purpose vision-language intelligence.

[44] Flexible-weighted Chamfer Distance: Enhanced Objective Function for Point Cloud Completion

Jie Li,Shengwei Tian,Long Yu,Xin Ning

Main category: cs.CV

TL;DR: 论文提出了一种灵活加权的Chamfer距离（FCD），用于改进点云生成任务中全局分布与局部性能的平衡。

Details

Motivation: 直接使用固定权重的Chamfer距离（CD）作为目标函数可能导致全局分布不佳，尽管整体性能看似良好。 Method: 提出FCD，通过为CD的全局分布组件分配更高权重，并采用灵活加权策略调整组件间的平衡。 Result: 在两个先进网络上验证，FCD在CD、EMD、DCD、F-Score及人工评估中表现更优。 Conclusion: FCD能有效提升点云生成的全局分布质量，同时保持整体性能。 Abstract: Chamfer Distance (CD) comprises two components that can evaluate the global distribution and local performance of generated point clouds, making it widely utilized as a similarity measure between generated and target point clouds in point cloud completion tasks. Additionally, CD's computational efficiency has led to its frequent application as an objective function for guiding point cloud generation. However, using CD directly as an objective function with fixed equal weights for its two components can often result in seemingly high overall performance (i.e., low CD score), while failing to achieve a good global distribution. This is typically reflected in high Earth Mover's Distance (EMD) and Decomposed Chamfer Distance (DCD) scores, alongside poor human assessments. To address this issue, we propose a Flexible-Weighted Chamfer Distance (FCD) to guide point cloud generation. FCD assigns a higher weight to the global distribution component of CD and incorporates a flexible weighting strategy to adjust the balance between the two components, aiming to improve global distribution while maintaining robust overall performance. Experimental results on two state-of-the-art networks demonstrate that our method achieves superior results across multiple evaluation metrics, including CD, EMD, DCD, and F-Score, as well as in human evaluations.

[45] VoQA: Visual-only Question Answering

Luyang Jiang,Jianing An,Jie Luo,Wenjun Wu,Lei Huang

Main category: cs.CV

TL;DR: 提出VoQA任务，要求模型仅通过视觉输入回答嵌入图像中的问题，并引入GRT-SFT方法提升性能。

Details

Motivation: 现有大型视觉语言模型在纯视觉问题回答任务中表现不佳，需改进以增强视觉理解能力。 Method: 采用GRT-SFT策略，通过结构化微调引导模型逐步推理。 Result: GRT-SFT显著提升了模型在VoQA任务中的表现。 Conclusion: 该研究提升了模型在复杂多模态场景中的纯视觉理解能力。 Abstract: We propose Visual-only Question Answering (VoQA), a novel multimodal task in which questions are visually embedded within images, without any accompanying textual input. This requires models to locate, recognize, and reason over visually embedded textual questions, posing challenges for existing large vision-language models (LVLMs), which show notable performance drops even with carefully designed prompts. To bridge this gap, we introduce Guided Response Triggering Supervised Fine-tuning (GRT-SFT), a structured fine-tuning strategy that guides the model to perform step-by-step reasoning purely based on visual input, significantly improving model performance. Our work enhances models' capacity for human-like visual understanding in complex multimodal scenarios, where information, including language, is perceived visually.

[46] UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning

Sule Bai,Mingxing Li,Yong Liu,Jing Tang,Haoji Zhang,Lei Sun,Xiangxiang Chu,Yansong Tang

Main category: cs.CV

TL;DR: UniVG-R1是一个基于强化学习的多模态大语言模型，用于解决复杂指令和多图像场景下的视觉定位任务，通过推理链数据集和难度感知策略提升性能。

Details

Motivation: 传统视觉定位方法难以处理复杂指令和多图像场景，缺乏跨模态推理能力。 Method: 构建推理链数据集进行监督微调，结合基于规则的强化学习和难度感知权重调整策略。 Result: 在MIG-Bench上性能提升9.1%，零样本性能平均提升23.4%。 Conclusion: UniVG-R1在复杂视觉定位任务中表现出色，具有强泛化能力。 Abstract: Traditional visual grounding methods primarily focus on single-image scenarios with simple textual references. However, extending these methods to real-world scenarios that involve implicit and complex instructions, particularly in conjunction with multiple images, poses significant challenges, which is mainly due to the lack of advanced reasoning ability across diverse multi-modal contexts. In this work, we aim to address the more practical universal grounding task, and propose UniVG-R1, a reasoning guided multimodal large language model (MLLM) for universal visual grounding, which enhances reasoning capabilities through reinforcement learning (RL) combined with cold-start data. Specifically, we first construct a high-quality Chain-of-Thought (CoT) grounding dataset, annotated with detailed reasoning chains, to guide the model towards correct reasoning paths via supervised fine-tuning. Subsequently, we perform rule-based reinforcement learning to encourage the model to identify correct reasoning chains, thereby incentivizing its reasoning capabilities. In addition, we identify a difficulty bias arising from the prevalence of easy samples as RL training progresses, and we propose a difficulty-aware weight adjustment strategy to further strengthen the performance. Experimental results demonstrate the effectiveness of UniVG-R1, which achieves state-of-the-art performance on MIG-Bench with a 9.1% improvement over the previous method. Furthermore, our model exhibits strong generalizability, achieving an average improvement of 23.4% in zero-shot performance across four image and video reasoning grounding benchmarks. The project page can be accessed at https://amap-ml.github.io/UniVG-R1-page/.

[47] Decoupling Classifier for Boosting Few-shot Object Detection and Instance Segmentation

Bin-Bin Gao,Xiaochen Chen,Zhongyi Huang,Congchong Nie,Jun Liu,Jinxiang Lai,Guannan Jiang,Xi Wang,Chengjie Wang

Main category: cs.CV

TL;DR: 本文提出了一种简单但有效的方法，通过解耦分类器来缓解小样本目标检测和实例分割中的偏差分类问题。

Details

Motivation: 小样本目标检测和实例分割中，模型因缺失标签问题导致分类偏差，现有方法难以解决。 Method: 将标准分类器解耦为两个独立的头，分别处理清晰的正样本和由缺失标签引起的噪声负样本。 Result: 在PASCAL VOC和MS-COCO基准测试中，模型性能显著优于基线和现有最优方法，且无需额外计算成本。 Conclusion: 解耦分类器的方法有效解决了偏差分类问题，提升了小样本学习的性能。 Abstract: This paper focus on few-shot object detection~(FSOD) and instance segmentation~(FSIS), which requires a model to quickly adapt to novel classes with a few labeled instances. The existing methods severely suffer from bias classification because of the missing label issue which naturally exists in an instance-level few-shot scenario and is first formally proposed by us. Our analysis suggests that the standard classification head of most FSOD or FSIS models needs to be decoupled to mitigate the bias classification. Therefore, we propose an embarrassingly simple but effective method that decouples the standard classifier into two heads. Then, these two individual heads are capable of independently addressing clear positive samples and noisy negative samples which are caused by the missing label. In this way, the model can effectively learn novel classes while mitigating the effects of noisy negative samples. Without bells and whistles, our model without any additional computation cost and parameters consistently outperforms its baseline and state-of-the-art by a large margin on PASCAL VOC and MS-COCO benchmarks for FSOD and FSIS tasks. The Code is available at https://csgaobb.github.io/Projects/DCFS.

[48] Visual Agentic Reinforcement Fine-Tuning

Ziyu Liu,Yuhang Zang,Yushan Zou,Zijian Liang,Xiaoyi Dong,Yuhang Cao,Haodong Duan,Dahua Lin,Jiaqi Wang

Main category: cs.CV

TL;DR: Visual-ARFT方法通过强化微调提升大型视觉语言模型的多模态代理能力，在搜索和编码任务中显著优于基线模型和GPT-4o，并展现出强泛化能力。

Details

Motivation: 开源社区在多模态代理能力（尤其是图像思维）方面的研究较少，本文旨在填补这一空白。 Method: 采用Visual-ARFT（视觉代理强化微调）方法，使模型能够实时浏览网页、编写代码处理图像。 Result: 在MAT-Coding和MAT-Search任务中分别提升18.6% F1/13.0% EM和10.3% F1/8.7% EM，超越GPT-4o；在多跳QA任务中也有显著提升。 Conclusion: Visual-ARFT为构建鲁棒且泛化能力强的多模态代理提供了有效路径。 Abstract: A key trend in Large Reasoning Models (e.g., OpenAI's o3) is the native agentic ability to use external tools such as web browsers for searching and writing/executing code for image manipulation to think with images. In the open-source research community, while significant progress has been made in language-only agentic abilities such as function calling and tool integration, the development of multi-modal agentic capabilities that involve truly thinking with images, and their corresponding benchmarks, are still less explored. This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs). With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques. We also present a Multi-modal Agentic Tool Bench (MAT) with two settings (MAT-Search and MAT-Coding) designed to evaluate LVLMs' agentic search and coding abilities. Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search, ultimately surpassing GPT-4o. Visual-ARFT also achieves +29.3 F1% / +25.9% EM gains on existing multi-hop QA benchmarks such as 2Wiki and HotpotQA, demonstrating strong generalization capabilities. Our findings suggest that Visual-ARFT offers a promising path toward building robust and generalizable multimodal agents.

[49] Instructing Text-to-Image Diffusion Models via Classifier-Guided Semantic Optimization

Yuanyuan Chang,Yinghua Yao,Tao Qin,Mengmeng Wang,Ivor Tsang,Guang Dai

Main category: cs.CV

TL;DR: 提出一种通过优化语义嵌入和属性分类器引导文本到图像模型编辑的方法，避免手动编写文本提示，实现高效且准确的编辑。

Details

Motivation: 现有方法依赖手动编写文本提示，效率低且可能引入无关细节，限制了编辑性能。 Method: 利用属性分类器优化语义嵌入，无需文本提示或扩散模型训练，实现精确编辑。 Result: 实验表明，该方法能实现高度解耦和跨领域数据的强泛化能力。 Conclusion: 该方法为文本到图像模型编辑提供了一种高效且无需训练的新思路。 Abstract: Text-to-image diffusion models have emerged as powerful tools for high-quality image generation and editing. Many existing approaches rely on text prompts as editing guidance. However, these methods are constrained by the need for manual prompt crafting, which can be time-consuming, introduce irrelevant details, and significantly limit editing performance. In this work, we propose optimizing semantic embeddings guided by attribute classifiers to steer text-to-image models toward desired edits, without relying on text prompts or requiring any training or fine-tuning of the diffusion model. We utilize classifiers to learn precise semantic embeddings at the dataset level. The learned embeddings are theoretically justified as the optimal representation of attribute semantics, enabling disentangled and accurate edits. Experiments further demonstrate that our method achieves high levels of disentanglement and strong generalization across different domains of data.

[50] Aligning Attention Distribution to Information Flow for Hallucination Mitigation in Large Vision-Language Models

Jianfei Zhao,Feng Zhang,Xin Sun,Chong Feng

Main category: cs.CV

TL;DR: 论文提出了一种方法，通过优化注意力分布与信息流的对齐，减少大型视觉语言模型（LVLM）中的幻觉现象。

Details

Motivation: 研究发现LVLM的注意力分布与信息流不匹配，导致视觉理解能力下降和幻觉现象。 Method: 识别关注核心语义表示的注意力头，并通过两阶段优化范式调整注意力分布。 Result: 在五个LVLM上验证，显著减少幻觉，同时揭示了减少幻觉与丰富细节之间的权衡。 Conclusion: 方法有效提升视觉理解能力，并允许手动调整模型的保守性，适应多样化需求。 Abstract: Due to the unidirectional masking mechanism, Decoder-Only models propagate information from left to right. LVLMs (Large Vision-Language Models) follow the same architecture, with visual information gradually integrated into semantic representations during forward propagation. Through systematic analysis, we observe that over 80\% of the visual information is absorbed into the semantic representations. However, the model's attention still predominantly focuses on the visual representations. This misalignment between the attention distribution and the actual information flow undermines the model's visual understanding ability and contributes to hallucinations. To address this issue, we enhance the model's visual understanding by leveraging the core information embedded in semantic representations. Specifically, we identify attention heads that focus on core semantic representations based on their attention distributions. Then, through a two-stage optimization paradigm, we propagate the advantages of these attention heads across the entire model, aligning the attention distribution with the actual information flow. We evaluate our method on three image captioning benchmarks using five different LVLMs, demonstrating its effectiveness in significantly reducing hallucinations. Further experiments reveal a trade-off between reduced hallucinations and richer details. Notably, our method allows for manual adjustment of the model's conservativeness, enabling flexible control to meet diverse real-world requirements. Code will be released once accepted.

[51] Speculative Decoding Reimagined for Multimodal Large Language Models

Luxi Lin,Zhihang Lin,Zhanpeng Zeng,Rongrong Ji

Main category: cs.CV

TL;DR: Multimodal Speculative Decoding (MSD) 加速多模态大语言模型 (MLLMs) 推理，通过分离处理文本和视觉标记，并采用两阶段训练策略，显著提升推理速度。

Details

Motivation: 现有推测解码方法在多模态大语言模型 (MLLMs) 中无法达到与单模态大语言模型 (LLMs) 相同的加速效果，因此需要针对 MLLMs 重新设计推测解码方法。 Method: MSD 分离处理文本和视觉标记，并采用两阶段训练策略：第一阶段训练文本能力，第二阶段逐步引入多模态数据增强视觉能力。 Result: MSD 在 LLaVA-1.5-7B 和 LLaVA-1.5-13B 上分别实现最高 2.29 倍和 2.46 倍的推理加速。 Conclusion: MSD 是一种针对 MLLMs 的高效推测解码方法，显著提升推理速度，同时保持准确性。 Abstract: This paper introduces Multimodal Speculative Decoding (MSD) to accelerate Multimodal Large Language Models (MLLMs) inference. Speculative decoding has been shown to accelerate Large Language Models (LLMs) without sacrificing accuracy. However, current speculative decoding methods for MLLMs fail to achieve the same speedup as they do for LLMs. To address this, we reimagine speculative decoding specifically for MLLMs. Our analysis of MLLM characteristics reveals two key design principles for MSD: (1) Text and visual tokens have fundamentally different characteristics and need to be processed separately during drafting. (2) Both language modeling ability and visual perception capability are crucial for the draft model. For the first principle, MSD decouples text and visual tokens in the draft model, allowing each to be handled based on its own characteristics. For the second principle, MSD uses a two-stage training strategy: In stage one, the draft model is trained on text-only instruction-tuning datasets to improve its language modeling ability. In stage two, MSD gradually introduces multimodal data to enhance the visual perception capability of the draft model. Experiments show that MSD boosts inference speed by up to $2.29\times$ for LLaVA-1.5-7B and up to $2.46\times$ for LLaVA-1.5-13B on multimodal benchmarks, demonstrating its effectiveness. Our code is available at https://github.com/Lyn-Lucy/MSD.

[52] RA-Touch: Retrieval-Augmented Touch Understanding with Enriched Visual Data

Yoorhim Cho,Hongyeob Kim,Semin Kim,Youjia Zhang,Yunseok Choi,Sungeun Hong

Main category: cs.CV

TL;DR: RA-Touch是一个基于检索增强的框架，通过利用视觉数据和触觉语义提升视觉触觉感知能力。

Details

Motivation: 触觉数据的收集成本高且耗时，而视觉上不同的物体可能具有相似的触觉特性，因此可以通过视觉数据中的材质线索间接理解触觉特性。 Method: RA-Touch通过重新标注大规模视觉数据集，加入触觉语义描述，并利用检索增强的方法将视觉-文本表示与触觉输入对齐。 Result: 在TVL基准测试中，RA-Touch优于现有方法，展示了基于检索的视觉重用对触觉理解的潜力。 Conclusion: RA-Touch通过结合视觉和触觉语义，为触觉感知提供了一种高效且成本较低的解决方案。 Abstract: Visuo-tactile perception aims to understand an object's tactile properties, such as texture, softness, and rigidity. However, the field remains underexplored because collecting tactile data is costly and labor-intensive. We observe that visually distinct objects can exhibit similar surface textures or material properties. For example, a leather sofa and a leather jacket have different appearances but share similar tactile properties. This implies that tactile understanding can be guided by material cues in visual data, even without direct tactile supervision. In this paper, we introduce RA-Touch, a retrieval-augmented framework that improves visuo-tactile perception by leveraging visual data enriched with tactile semantics. We carefully recaption a large-scale visual dataset with tactile-focused descriptions, enabling the model to access tactile semantics typically absent from conventional visual datasets. A key challenge remains in effectively utilizing these tactile-aware external descriptions. RA-Touch addresses this by retrieving visual-textual representations aligned with tactile inputs and integrating them to focus on relevant textural and material properties. By outperforming prior methods on the TVL benchmark, our method demonstrates the potential of retrieval-based visual reuse for tactile understanding. Code is available at https://aim-skku.github.io/RA-Touch

[53] Towards Generating Realistic Underwater Images

Abdul-Kazeem Shamba

Main category: cs.CV

TL;DR: 论文研究了对比学习和生成对抗网络在从合成图像生成逼真水下图像中的应用，评估了不同模型在VAROS数据集上的表现。

Details

Motivation: 探索如何通过图像翻译模型将合成图像转换为逼真的水下图像，解决水下图像生成中的光照和结构问题。 Method: 使用pix2pix、CycleGAN、CUT等模型进行图像翻译，并通过FID和SSIM指标评估性能。 Result: pix2pix在FID上表现最佳，而自编码器在SSIM上最优；CycleGAN在非配对方法中FID表现好，CUT在SSIM上更优，加入深度信息后CUT的FID最低。 Conclusion: 深度信息能提升生成图像的逼真度，但可能牺牲部分结构保真度；不同模型在感知质量和结构保留上各有优劣。 Abstract: This paper explores the use of contrastive learning and generative adversarial networks for generating realistic underwater images from synthetic images with uniform lighting. We investigate the performance of image translation models for generating realistic underwater images using the VAROS dataset. Two key evaluation metrics, Fr\'echet Inception Distance (FID) and Structural Similarity Index Measure (SSIM), provide insights into the trade-offs between perceptual quality and structural preservation. For paired image translation, pix2pix achieves the best FID scores due to its paired supervision and PatchGAN discriminator, while the autoencoder model attains the highest SSIM, suggesting better structural fidelity despite producing blurrier outputs. Among unpaired methods, CycleGAN achieves a competitive FID score by leveraging cycle-consistency loss, whereas CUT, which replaces cycle-consistency with contrastive learning, attains higher SSIM, indicating improved spatial similarity retention. Notably, incorporating depth information into CUT results in the lowest overall FID score, demonstrating that depth cues enhance realism. However, the slight decrease in SSIM suggests that depth-aware learning may introduce structural variations.

[54] A Review of Vision-Based Assistive Systems for Visually Impaired People: Technologies, Applications, and Future Directions

Fulong Yao,Wenju Zhou,Huosheng Hu

Main category: cs.CV

TL;DR: 综述了近年来针对视障人士的辅助系统技术进展，重点关注障碍物检测、导航和用户交互的最新成果。

Details

Motivation: 视障人士需要准确及时的环境信息以实现独立生活，而视觉辅助技术能显著提升其移动能力和与外界互动。 Method: 通过文献综述，分析最新的视觉辅助系统技术，包括障碍物检测、导航和用户交互。 Result: 总结了当前技术的进展，并探讨了视觉引导系统的未来趋势。 Conclusion: 视觉辅助系统在帮助视障人士方面取得了显著进展，未来仍有进一步发展的潜力。 Abstract: Visually impaired individuals rely heavily on accurate and timely information about obstacles and their surrounding environments to achieve independent living. In recent years, significant progress has been made in the development of assistive technologies, particularly vision-based systems, that enhance mobility and facilitate interaction with the external world in both indoor and outdoor settings. This paper presents a comprehensive review of recent advances in assistive systems designed for the visually impaired, with a focus on state-of-the-art technologies in obstacle detection, navigation, and user interaction. In addition, emerging trends and future directions in visual guidance systems are discussed.

[55] RADAR: Enhancing Radiology Report Generation with Supplementary Knowledge Injection

Wenjun Hou,Yi Cheng,Kaishuai Xu,Heng Li,Yan Hu,Wenjie Li,Jiang Liu

Main category: cs.CV

TL;DR: RADAR框架通过结合LLM内部知识和外部检索信息，提升放射学报告生成的准确性和信息量。

Details

Motivation: 现有方法忽略了LLM内部已嵌入的知识，导致信息冗余和低效利用。 Method: RADAR首先提取LLM内部与专家分类输出一致的知识，再检索补充信息，最后整合生成报告。 Result: 在多个数据集上，RADAR在语言质量和临床准确性上均优于现有LLM。 Conclusion: RADAR通过系统整合内外知识，显著提升了放射学报告生成的效果。 Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in various domains, including radiology report generation. Previous approaches have attempted to utilize multimodal LLMs for this task, enhancing their performance through the integration of domain-specific knowledge retrieval. However, these approaches often overlook the knowledge already embedded within the LLMs, leading to redundant information integration and inefficient utilization of learned representations. To address this limitation, we propose RADAR, a framework for enhancing radiology report generation with supplementary knowledge injection. RADAR improves report generation by systematically leveraging both the internal knowledge of an LLM and externally retrieved information. Specifically, it first extracts the model's acquired knowledge that aligns with expert image-based classification outputs. It then retrieves relevant supplementary knowledge to further enrich this information. Finally, by aggregating both sources, RADAR generates more accurate and informative radiology reports. Extensive experiments on MIMIC-CXR, CheXpert-Plus, and IU X-ray demonstrate that our model outperforms state-of-the-art LLMs in both language quality and clinical accuracy

[56] RETRO: REthinking Tactile Representation Learning with Material PriOrs

Weihao Xia,Chenliang Zhou,Cengiz Oztireli

Main category: cs.CV

TL;DR: 该论文提出了一种结合材料感知先验的触觉表示学习方法，弥补了现有方法忽视材料特性的不足，提升了触觉反馈的准确性和丰富性。

Details

Motivation: 现有触觉表示学习方法主要关注触觉数据与视觉或文本信息的对齐，忽略了材料特性对触觉体验的关键影响。 Method: 通过引入材料感知先验（预学习的材料特性）到触觉表示学习框架中，以更好地捕捉和泛化表面纹理的细微差异。 Result: 该方法在多种材料和纹理上实现了更准确、上下文丰富的触觉反馈，提升了在机器人、触觉反馈系统和材料编辑等实际应用中的性能。 Conclusion: 结合材料感知先验的触觉表示学习方法显著提升了触觉模型的性能，为实际应用提供了更优的解决方案。 Abstract: Tactile perception is profoundly influenced by the surface properties of objects in contact. However, despite their crucial role in shaping tactile experiences, these material characteristics have been largely neglected in existing tactile representation learning methods. Most approaches primarily focus on aligning tactile data with visual or textual information, overlooking the richness of tactile feedback that comes from understanding the materials' inherent properties. In this work, we address this gap by revisiting the tactile representation learning framework and incorporating material-aware priors into the learning process. These priors, which represent pre-learned characteristics specific to different materials, allow tactile models to better capture and generalize the nuances of surface texture. Our method enables more accurate, contextually rich tactile feedback across diverse materials and textures, improving performance in real-world applications such as robotics, haptic feedback systems, and material editing.

[57] Accuracy and Fairness of Facial Recognition Technology in Low-Quality Police Images: An Experiment With Synthetic Faces

Maria Cuellar,Hon Kiu,To,Arush Mehrotra

Main category: cs.CV

TL;DR: 研究探讨了图像质量退化（如对比度、亮度、运动模糊等）对人脸识别技术（FRT）准确性和公平性的影响，发现错误率在女性及黑人群体中更高，但FRT仍优于传统法医方法。

Details

Motivation: 评估FRT在真实执法场景中的表现，尤其是图像质量不佳时的准确性和公平性。 Method: 使用StyleGAN3生成合成人脸，模拟退化图像，并通过Deepface和ArcFace损失进行1:n识别任务测试。 Result: 错误率在图像质量接近基线时最高，模糊和低分辨率影响显著；女性和黑人群体（尤其是黑人女性）错误率更高。 Conclusion: FRT在验证和监管下可作为有价值的调查工具，但需透明度和监督以确保公平性和法医有效性。 Abstract: Facial recognition technology (FRT) is increasingly used in criminal investigations, yet most evaluations of its accuracy rely on high-quality images, unlike those often encountered by law enforcement. This study examines how five common forms of image degradation--contrast, brightness, motion blur, pose shift, and resolution--affect FRT accuracy and fairness across demographic groups. Using synthetic faces generated by StyleGAN3 and labeled with FairFace, we simulate degraded images and evaluate performance using Deepface with ArcFace loss in 1:n identification tasks. We perform an experiment and find that false positive rates peak near baseline image quality, while false negatives increase as degradation intensifies--especially with blur and low resolution. Error rates are consistently higher for women and Black individuals, with Black females most affected. These disparities raise concerns about fairness and reliability when FRT is used in real-world investigative contexts. Nevertheless, even under the most challenging conditions and for the most affected subgroups, FRT accuracy remains substantially higher than that of many traditional forensic methods. This suggests that, if appropriately validated and regulated, FRT should be considered a valuable investigative tool. However, algorithmic accuracy alone is not sufficient: we must also evaluate how FRT is used in practice, including user-driven data manipulation. Such cases underscore the need for transparency and oversight in FRT deployment to ensure both fairness and forensic validity.

[58] Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?

Bo Feng,Zhengfeng Lai,Shiyu Li,Zizhen Wang,Simon Wang,Ping Huang,Meng Cao

Main category: cs.CV

TL;DR: VBenchComp提出了一种自动化分类方法，将视频理解问题分为LLM-Answerable、Semantic、Temporal和Others四类，以更精确评估视频LLM的能力。

Details

Motivation: 现有视频理解基准常混淆知识问题和纯图像问题，未能清晰区分模型的时序推理能力，导致评分无法真实反映对动态内容的理解。 Method: 提出VBenchComp，通过自动化流程将问题分类为LLM-Answerable、Semantic、Temporal和Others，以细粒度评估模型能力。 Result: 分析揭示了传统评分掩盖的模型弱点，并提供了设计更准确评估视频LLM的基准的建议。 Conclusion: VBenchComp为未来视频理解基准的设计提供了新思路，能更精准评估模型的时序推理能力。 Abstract: Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model's temporal reasoning ability, which is the key aspect that distinguishes video understanding from other modalities. We identify two major limitations that obscure whether higher scores truly indicate stronger understanding of the dynamic content in videos: (1) strong language priors, where models can answer questions without watching the video; and (2) shuffling invariance, where models maintain similar performance on certain questions even when video frames are temporally shuffled. To alleviate these issues, we propose VBenchComp, an automated pipeline that categorizes questions into different domains: LLM-Answerable, Semantic, and Temporal. Specifically, LLM-Answerable questions can be answered without viewing the video; Semantic questions remain answerable even when the video frames are shuffled; and Temporal questions require understanding the correct temporal order of frames. The rest of the questions are labeled as Others. This can enable fine-grained evaluation of different capabilities of a video LLM. Our analysis reveals nuanced model weaknesses that are hidden by traditional overall scores, and we offer insights and recommendations for designing future benchmarks that more accurately assess video LLMs.

[59] Handloom Design Generation Using Generative Networks

Rajat Kanti Bhattacharjee,Meghali Nandi,Amrit Jha,Gunajit Kalita,Ferdous Ahmed Barbhuiya

Main category: cs.CV

TL;DR: 论文提出了一种基于深度学习的服装设计生成方法，专注于手织面料，并探讨了相关挑战及应用。

Details

Motivation: 生成神经网络模型在理解艺术设计和合成方面的能力尚未得到充分探索。 Method: 结合了当前最先进的生成模型和风格迁移算法，研究其在设计生成任务中的表现。 Result: 通过用户评分评估结果，并提供了新的数据集NeuralLoom。 Conclusion: 该研究为手织面料设计生成提供了新的方法和数据集。 Abstract: This paper proposes deep learning techniques of generating designs for clothing, focused on handloom fabric and discusses the associated challenges along with its application. The capability of generative neural network models in understanding artistic designs and synthesizing those is not yet explored well. In this work, multiple methods are employed incorporating the current state of the art generative models and style transfer algorithms to study and observe their performance for the task. The results are then evaluated through user score. This work also provides a new dataset NeuralLoom for the task of the design generation.

[60] Domain Adaptation for Multi-label Image Classification: a Discriminator-free Approach

Inder Pal Singh,Enjie Ghorbel,Anis Kacem,Djamila Aouada

Main category: cs.CV

TL;DR: 论文提出了一种无判别器的对抗性方法DDA-MLIC，用于多标签图像分类的无监督域适应，通过高斯混合模型和深度神经网络优化，性能优于现有方法。

Details

Motivation: 现有对抗性方法在多标签图像分类中通常需要额外的判别器子网，这可能损害任务特异性判别能力，因此需要一种更高效的方法。 Method: 采用高斯混合模型（GMM）建模源和目标预测，通过深度神经网络估计GMM参数，并利用Fr'echet距离构建对抗损失。 Result: 在多种多标签图像数据集上，DDA-MLIC在精度上优于现有方法，且参数更少。 Conclusion: DDA-MLIC提供了一种高效且性能优越的无监督域适应方法，适用于多标签图像分类。 Abstract: This paper introduces a discriminator-free adversarial-based approach termed DDA-MLIC for Unsupervised Domain Adaptation (UDA) in the context of Multi-Label Image Classification (MLIC). While recent efforts have explored adversarial-based UDA methods for MLIC, they typically include an additional discriminator subnet. Nevertheless, decoupling the classification and the discrimination tasks may harm their task-specific discriminative power. Herein, we address this challenge by presenting a novel adversarial critic directly derived from the task-specific classifier. Specifically, we employ a two-component Gaussian Mixture Model (GMM) to model both source and target predictions, distinguishing between two distinct clusters. Instead of using the traditional Expectation Maximization (EM) algorithm, our approach utilizes a Deep Neural Network (DNN) to estimate the parameters of each GMM component. Subsequently, the source and target GMM parameters are leveraged to formulate an adversarial loss using the Fr\'echet distance. The proposed framework is therefore not only fully differentiable but is also cost-effective as it avoids the expensive iterative process usually induced by the standard EM method. The proposed method is evaluated on several multi-label image datasets covering three different types of domain shift. The obtained results demonstrate that DDA-MLIC outperforms existing state-of-the-art methods in terms of precision while requiring a lower number of parameters. The code is made publicly available at github.com/cvi2snt/DDA-MLIC.

Seunghyuk Cho,Zhenyue Qin,Yang Liu,Youngbin Choi,Seungbeom Lee,Dongwoo Kim

Main category: cs.CV

TL;DR: 本文综述了平面几何问题求解（PGPS）的研究现状，将其方法归类为编码器-解码器框架，分析了其架构设计，并指出了未来研究的挑战与方向。

Details

Motivation: 填补PGPS领域缺乏系统性综述的空白，为研究社区提供全面的研究现状分析。 Method: 将PGPS方法归类为编码器-解码器框架，总结其输入输出格式，并分析不同架构设计。 Result: 提出了PGPS研究的主要挑战，如编码阶段的幻觉问题和数据泄露问题。 Conclusion: 未来研究应关注解决幻觉问题和数据泄露问题，以推动PGPS领域的发展。 Abstract: Plane geometry problem solving (PGPS) has recently gained significant attention as a benchmark to assess the multi-modal reasoning capabilities of large vision-language models. Despite the growing interest in PGPS, the research community still lacks a comprehensive overview that systematically synthesizes recent work in PGPS. To fill this gap, we present a survey of existing PGPS studies. We first categorize PGPS methods into an encoder-decoder framework and summarize the corresponding output formats used by their encoders and decoders. Subsequently, we classify and analyze these encoders and decoders according to their architectural designs. Finally, we outline major challenges and promising directions for future research. In particular, we discuss the hallucination issues arising during the encoding phase within encoder-decoder architectures, as well as the problem of data leakage in current PGPS benchmarks.

[62] Replace in Translation: Boost Concept Alignment in Counterfactual Text-to-Image

Sifan Li,Ming Tao,Hao Zhao,Ling Shao,Hao Tang

Main category: cs.CV

TL;DR: 论文提出了一种通过显式逻辑叙事提示（ELNP）和潜在空间逐步替换的方法，提升反事实文本到图像（T2I）生成中的概念对齐能力。

Details

Motivation: 反事实T2I生成在真实感和概念对齐方面存在挑战，限制了AIGC的多样性。本文旨在解决这一问题。 Method: 利用可控T2I模型在潜在空间中逐步替换对象，结合DeepSeek生成的ELNP指导替换过程。 Result: 实验表明，该方法显著提升了反事实T2I生成的概念对齐能力。 Conclusion: 提出的策略有效解决了反事实T2I生成中的概念对齐问题，为AIGC提供了更丰富的可能性。 Abstract: Text-to-Image (T2I) has been prevalent in recent years, with most common condition tasks having been optimized nicely. Besides, counterfactual Text-to-Image is obstructing us from a more versatile AIGC experience. For those scenes that are impossible to happen in real world and anti-physics, we should spare no efforts in increasing the factual feel, which means synthesizing images that people think very likely to be happening, and concept alignment, which means all the required objects should be in the same frame. In this paper, we focus on concept alignment. As controllable T2I models have achieved satisfactory performance for real applications, we utilize this technology to replace the objects in a synthesized image in latent space step-by-step to change the image from a common scene to a counterfactual scene to meet the prompt. We propose a strategy to instruct this replacing process, which is called as Explicit Logical Narrative Prompt (ELNP), by using the newly SoTA language model DeepSeek to generate the instructions. Furthermore, to evaluate models' performance in counterfactual T2I, we design a metric to calculate how many required concepts in the prompt can be covered averagely in the synthesized images. The extensive experiments and qualitative comparisons demonstrate that our strategy can boost the concept alignment in counterfactual T2I.

[63] Egocentric Action-aware Inertial Localization in Point Clouds

Mingfang Zhang,Ryo Yonetani,Yifei Huang,Liangyang Ouyang,Ruicong Liu,Yoichi Sato

Main category: cs.CV

TL;DR: 提出了一种名为EAIL的新型惯性定位框架，利用头戴式IMU信号中的自我中心动作线索在3D点云中定位目标个体。

Details

Motivation: 解决IMU传感器噪声和人类动作多样性导致的定位漂移问题，利用动作与环境结构的关联性作为空间锚点。 Method: 通过分层多模态对齐学习IMU信号中的动作线索与点云中环境特征的关联，利用对比学习训练模态编码器。 Result: 实验表明EAIL在惯性定位和动作识别方面优于现有基线方法。 Conclusion: EAIL框架有效利用动作与环境的相关性，显著提升了惯性定位的精度，同时还能识别动作序列。 Abstract: This paper presents a novel inertial localization framework named Egocentric Action-aware Inertial Localization (EAIL), which leverages egocentric action cues from head-mounted IMU signals to localize the target individual within a 3D point cloud. Human inertial localization is challenging due to IMU sensor noise that causes trajectory drift over time. The diversity of human actions further complicates IMU signal processing by introducing various motion patterns. Nevertheless, we observe that some actions observed through the head-mounted IMU correlate with spatial environmental structures (e.g., bending down to look inside an oven, washing dishes next to a sink), thereby serving as spatial anchors to compensate for the localization drift. The proposed EAIL framework learns such correlations via hierarchical multi-modal alignment. By assuming that the 3D point cloud of the environment is available, it contrastively learns modality encoders that align short-term egocentric action cues in IMU signals with local environmental features in the point cloud. These encoders are then used in reasoning the IMU data and the point cloud over time and space to perform inertial localization. Interestingly, these encoders can further be utilized to recognize the corresponding sequence of actions as a by-product. Extensive experiments demonstrate the effectiveness of the proposed framework over state-of-the-art inertial localization and inertial action recognition baselines.

[64] Vid2World: Crafting Video Diffusion Models to Interactive World Models

Siqiao Huang,Jialong Wu,Qixing Zhou,Shangchen Miao,Mingsheng Long

Main category: cs.CV

TL;DR: Vid2World利用预训练视频扩散模型构建交互式世界模型，通过因果化和动作引导机制提升预测质量和可控性。

Details

Motivation: 现有世界模型需要大量领域特定训练且预测粗糙，而视频扩散模型能生成高质量视频，但缺乏交互性。 Method: Vid2World通过因果化预训练视频扩散模型架构和训练目标，并引入因果动作引导机制。 Result: 在机器人操作和游戏仿真领域实验中表现优异。 Conclusion: Vid2World为将视频扩散模型转化为交互式世界模型提供了可扩展且高效的方法。 Abstract: World models, which predict transitions based on history observation and action sequences, have shown great promise in improving data efficiency for sequential decision making. However, existing world models often require extensive domain-specific training and still produce low-fidelity, coarse predictions, limiting their applicability in complex environments. In contrast, video diffusion models trained on large, internet-scale datasets have demonstrated impressive capabilities in generating high-quality videos that capture diverse real-world dynamics. In this work, we present Vid2World, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models. To bridge the gap, Vid2World performs casualization of a pre-trained video diffusion model by crafting its architecture and training objective to enable autoregressive generation. Furthermore, it introduces a causal action guidance mechanism to enhance action controllability in the resulting interactive world model. Extensive experiments in robot manipulation and game simulation domains show that our method offers a scalable and effective approach for repurposing highly capable video diffusion models to interactive world models.

[65] Dual Data Alignment Makes AI-Generated Image Detector Easier Generalizable

Ruoxin Chen,Junwei Xi,Zhiyuan Yan,Ke-Yue Zhang,Shuang Wu,Jingyi Xie,Xu Chen,Lei Xu,Isabel Guan,Taiping Yao,Shouhong Ding

Main category: cs.CV

TL;DR: 论文提出了一种双数据对齐（DDA）方法，解决了现有检测器在像素和频率域上的对齐问题，并通过新测试集验证了其有效性。

Details

Motivation: 现有检测器在训练数据上容易过拟合虚假相关特征，导致在无偏数据集上性能下降。传统方法仅通过像素级对齐无法解决频率域上的偏差问题。 Method: 提出DDA方法，同时对齐像素和频率域；并引入DDA-COCO和EvalGEN两个新测试集。 Result: 实验表明，使用DDA对齐的MSCOCO训练的检测器在8个基准测试中性能显著提升，尤其在野外基准上提高了7.2%。 Conclusion: DDA方法有效提升了检测器的泛化能力，解决了频率域偏差问题。 Abstract: Existing detectors are often trained on biased datasets, leading to the possibility of overfitting on non-causal image attributes that are spuriously correlated with real/synthetic labels. While these biased features enhance performance on the training data, they result in substantial performance degradation when applied to unbiased datasets. One common solution is to perform dataset alignment through generative reconstruction, matching the semantic content between real and synthetic images. However, we revisit this approach and show that pixel-level alignment alone is insufficient. The reconstructed images still suffer from frequency-level misalignment, which can perpetuate spurious correlations. To illustrate, we observe that reconstruction models tend to restore the high-frequency details lost in real images (possibly due to JPEG compression), inadvertently creating a frequency-level misalignment, where synthetic images appear to have richer high-frequency content than real ones. This misalignment leads to models associating high-frequency features with synthetic labels, further reinforcing biased cues. To resolve this, we propose Dual Data Alignment (DDA), which aligns both the pixel and frequency domains. Moreover, we introduce two new test sets: DDA-COCO, containing DDA-aligned synthetic images for testing detector performance on the most aligned dataset, and EvalGEN, featuring the latest generative models for assessing detectors under new generative architectures such as visual auto-regressive generators. Finally, our extensive evaluations demonstrate that a detector trained exclusively on DDA-aligned MSCOCO could improve across 8 diverse benchmarks by a non-trivial margin, showing a +7.2% on in-the-wild benchmarks, highlighting the improved generalizability of unbiased detectors.

[66] Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives

Xingxing Weng,Chao Pang,Gui-Song Xia

Main category: cs.CV

TL;DR: 本文综述了遥感领域中视觉语言建模（VLM）的两阶段范式进展，包括分类、方法、数据集及未来研究方向。

Details

Motivation: 填补图像与自然语言之间的信息鸿沟，推动遥感领域VLM的发展。 Method: 采用两阶段范式（预训练+微调），分类为对比学习、视觉指令调优和文本条件图像生成，并详细分析架构与目标。 Result: VLM模型在遥感数据分析任务中表现优异，支持对话式交互，并总结了现有工作和数据集。 Conclusion: 未来研究方向包括跨模态对齐、模糊需求理解、模型可靠性、可扩展能力及多模态数据集。 Abstract: Vision-language modeling (VLM) aims to bridge the information gap between images and natural language. Under the new paradigm of first pre-training on massive image-text pairs and then fine-tuning on task-specific data, VLM in the remote sensing domain has made significant progress. The resulting models benefit from the absorption of extensive general knowledge and demonstrate strong performance across a variety of remote sensing data analysis tasks. Moreover, they are capable of interacting with users in a conversational manner. In this paper, we aim to provide the remote sensing community with a timely and comprehensive review of the developments in VLM using the two-stage paradigm. Specifically, we first cover a taxonomy of VLM in remote sensing: contrastive learning, visual instruction tuning, and text-conditioned image generation. For each category, we detail the commonly used network architecture and pre-training objectives. Second, we conduct a thorough review of existing works, examining foundation models and task-specific adaptation methods in contrastive-based VLM, architectural upgrades, training strategies and model capabilities in instruction-based VLM, as well as generative foundation models with their representative downstream applications. Third, we summarize datasets used for VLM pre-training, fine-tuning, and evaluation, with an analysis of their construction methodologies (including image sources and caption generation) and key properties, such as scale and task adaptability. Finally, we conclude this survey with insights and discussions on future research directions: cross-modal representation alignment, vague requirement comprehension, explanation-driven model reliability, continually scalable model capabilities, and large-scale datasets featuring richer modalities and greater challenges.

[67] DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng,Michael Yang,Jack Hong,Chenxiao Zhao,Guohai Xu,Le Yang,Chao Shen,Xing Yu

Main category: cs.CV

TL;DR: DeepEyes探索了视觉与文本交织的多模态推理范式，通过强化学习实现无需冷启动SFT的“图像思考”能力，显著提升了细粒度感知和推理任务表现。

Details

Motivation: 解决现有大视觉语言模型（VLMs）在视觉与文本推理无缝整合上的不足，模拟人类认知过程。 Method: 提出基于工具使用的数据选择机制和奖励策略，通过端到端强化学习训练DeepEyes模型。 Result: 在细粒度感知、推理、基础任务（如幻觉和数学推理）上表现显著提升，并观察到工具调用行为的进化。 Conclusion: DeepEyes展示了视觉与文本推理的自然整合能力，为多模态推理提供了新思路。 Abstract: Large Vision-Language Models (VLMs) have shown strong capabilities in multimodal understanding and reasoning, yet they are primarily constrained by text-based reasoning processes. However, achieving seamless integration of visual and textual reasoning which mirrors human cognitive processes remains a significant challenge. In particular, effectively incorporating advanced visual input processing into reasoning mechanisms is still an open question. Thus, in this paper, we explore the interleaved multimodal reasoning paradigm and introduce DeepEyes, a model with "thinking with images" capabilities incentivized through end-to-end reinforcement learning without the need for cold-start SFT. Notably, this ability emerges natively within the model itself, leveraging its inherent grounding ability as a tool instead of depending on separate specialized models. Specifically, we propose a tool-use-oriented data selection mechanism and a reward strategy to encourage successful tool-assisted reasoning trajectories. DeepEyes achieves significant performance gains on fine-grained perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of tool-calling behavior from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at https://github.com/Visual-Agent/DeepEyes.

[68] ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations

Xuecheng Wu,Jiaxing Liu,Danlei Huang,Xiaoyu Li,Yifan Wang,Chen Chen,Liya Ma,Xuezhi Cao,Junxiao Xue

Main category: cs.CV

TL;DR: VI-CoT通过逐步视觉状态更新提升多模态大语言模型（MLLMs）的推理能力，但现有基准测试固定视觉状态限制了评估。为此，提出ViC-Bench基准和IPII策略，系统性评估18种MLLMs的VI-CoT能力。

Details

Motivation: 现有基准测试固定视觉状态可能扭曲模型推理轨迹，且未系统探索视觉状态对推理能力的影响。 Method: 提出ViC-Bench基准，包含四项任务，支持自由视觉状态生成；采用三阶段评估策略和IPII提示策略。 Result: 评估了18种MLLMs，揭示了其VI-CoT能力的关键洞察。 Conclusion: ViC-Bench为系统性评估VI-CoT能力提供了新基准，并公开可用。 Abstract: Visual-Interleaved Chain-of-Thought (VI-CoT) enables MLLMs to continually update their understanding and decisions based on step-wise intermediate visual states (IVS), much like a human would, which demonstrates impressive success in various tasks, thereby leading to emerged advancements in related benchmarks. Despite promising progress, current benchmarks provide models with relatively fixed IVS, rather than free-style IVS, whch might forcibly distort the original thinking trajectories, failing to evaluate their intrinsic reasoning capabilities. More importantly, existing benchmarks neglect to systematically explore the impact factors that IVS would impart to untamed reasoning performance. To tackle above gaps, we introduce a specialized benchmark termed ViC-Bench, consisting of four representive tasks: maze navigation, jigsaw puzzle, embodied long-horizon planning, and complex counting, where each task has dedicated free-style IVS generation pipeline supporting function calls. To systematically examine VI-CoT capability, we propose a thorough evaluation suite incorporating a progressive three-stage strategy with targeted new metrics. Besides, we establish Incremental Prompting Information Injection (IPII) strategy to ablatively explore the prompting factors for VI-CoT. We extensively conduct evaluations for 18 advanced MLLMs, revealing key insights into their VI-CoT capability. Our proposed benchmark is publicly open at Huggingface.

[69] Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency

Jiafeng Liang,Shixin Jiang,Xuan Dong,Ning Wang,Zheng Chu,Hui Su,Jinlan Fu,Ming Liu,See-Kiong Ng,Bing Qin

Main category: cs.CV

TL;DR: 论文提出了一种新的时间鲁棒性基准（TemRobBench），评估大型多模态模型（LMMs）在时间分析中的鲁棒性，并设计了一种全景直接偏好优化（PanoDPO）方法以提升模型性能。

Details

Motivation: 现有LMMs在视频理解任务中表现优异，但其时间分析能力的鲁棒性尚未充分研究，尤其是在对抗环境中对先验知识和文本上下文的过度依赖问题。 Method: 提出TemRobBench基准，引入视觉和文本模态的时间不一致扰动；设计PanoDPO方法，鼓励模型同时结合视觉和语言特征偏好。 Result: 评估16种主流LMMs，发现其在对抗环境中表现不佳；PanoDPO能显著提升模型的时间分析鲁棒性和可靠性。 Conclusion: PanoDPO方法有效解决了LMMs在时间分析中的鲁棒性问题，为未来研究提供了新方向。 Abstract: Large Multimodal Models (LMMs) have recently demonstrated impressive performance on general video comprehension benchmarks. Nevertheless, for broader applications, the robustness of their temporal analysis capability needs to be thoroughly investigated yet predominantly ignored. Motivated by this, we propose a novel temporal robustness benchmark (TemRobBench), which introduces temporal inconsistency perturbations separately at the visual and textual modalities to assess the robustness of models. We evaluate 16 mainstream LMMs and find that they exhibit over-reliance on prior knowledge and textual context in adversarial environments, while ignoring the actual temporal dynamics in the video. To mitigate this issue, we design panoramic direct preference optimization (PanoDPO), which encourages LMMs to incorporate both visual and linguistic feature preferences simultaneously. Experimental results show that PanoDPO can effectively enhance the model's robustness and reliability in temporal analysis.

[70] Diving into the Fusion of Monocular Priors for Generalized Stereo Matching

Chengtang Yao,Lidong Yu,Zhidan Liu,Jiaxi Zeng,Yuwei Wu,Yunde Jia

Main category: cs.CV

TL;DR: 论文提出了一种利用视觉基础模型（VFM）的单目先验来改进立体匹配的方法，通过二进制局部排序图和像素级线性回归模块解决融合中的问题，显著提升了性能。

Details

Motivation: 立体匹配在遮挡和非朗伯表面等不适定区域表现不佳，而现有的单目先验因数据偏差限制了泛化能力。利用VFM的无偏单目先验可以改进这一问题。 Method: 提出二进制局部排序图统一相对和绝对深度表示，并通过像素级线性回归模块全局自适应对齐单目深度与视差。 Result: 在从SceneFlow到Middlebury和Booster数据集的泛化实验中，性能显著提升，且效率几乎不受影响。 Conclusion: 该方法充分利用单目先验，有效且高效地支持立体匹配结果。 Abstract: The matching formulation makes it naturally hard for the stereo matching to handle ill-posed regions like occlusions and non-Lambertian surfaces. Fusing monocular priors has been proven helpful for ill-posed matching, but the biased monocular prior learned from small stereo datasets constrains the generalization. Recently, stereo matching has progressed by leveraging the unbiased monocular prior from the vision foundation model (VFM) to improve the generalization in ill-posed regions. We dive into the fusion process and observe three main problems limiting the fusion of the VFM monocular prior. The first problem is the misalignment between affine-invariant relative monocular depth and absolute depth of disparity. Besides, when we use the monocular feature in an iterative update structure, the over-confidence in the disparity update leads to local optima results. A direct fusion of a monocular depth map could alleviate the local optima problem, but noisy disparity results computed at the first several iterations will misguide the fusion. In this paper, we propose a binary local ordering map to guide the fusion, which converts the depth map into a binary relative format, unifying the relative and absolute depth representation. The computed local ordering map is also used to re-weight the initial disparity update, resolving the local optima and noisy problem. In addition, we formulate the final direct fusion of monocular depth to the disparity as a registration problem, where a pixel-wise linear regression module can globally and adaptively align them. Our method fully exploits the monocular prior to support stereo matching results effectively and efficiently. We significantly improve the performance from the experiments when generalizing from SceneFlow to Middlebury and Booster datasets while barely reducing the efficiency.

[71] Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models

Xuyang Liu,Yiyu Wang,Junpeng Ma,Linfeng Zhang

Main category: cs.CV

TL;DR: VideoLLM面临视觉令牌效率问题，VidCom2框架通过自适应压缩解决信息丢失和兼容性问题，显著提升性能。

Details

Motivation: 视频大语言模型（VideoLLM）因视觉令牌的二次复杂度导致效率低下，现有压缩方法存在信息丢失和架构兼容性问题。 Method: 提出VidCom2框架，通过量化帧独特性自适应调整压缩强度，保留关键信息并减少冗余。 Result: 实验表明，VidCom2仅用25%令牌即可达到99.6%原始性能，并减少70.8%生成延迟。 Conclusion: VidCom2高效且兼容现有方法，为VideoLLM提供了一种实用的加速解决方案。 Abstract: Video large language models (VideoLLM) excel at video understanding, but face efficiency challenges due to the quadratic complexity of abundant visual tokens. Our systematic analysis of token compression methods for VideoLLMs reveals two critical issues: (i) overlooking distinctive visual signals across frames, leading to information loss; (ii) suffering from implementation constraints, causing incompatibility with modern architectures or efficient operators. To address these challenges, we distill three design principles for VideoLLM token compression and propose a plug-and-play inference acceleration framework "Video Compression Commander" (VidCom2). By quantifying each frame's uniqueness, VidCom2 adaptively adjusts compression intensity across frames, effectively preserving essential information while reducing redundancy in video sequences. Extensive experiments across various VideoLLMs and benchmarks demonstrate the superior performance and efficiency of our VidCom2. With only 25% visual tokens, VidCom2 achieves 99.6% of the original performance on LLaVA-OV while reducing 70.8% of the LLM generation latency. Notably, our Frame Compression Adjustment strategy is compatible with other token compression methods to further improve their performance. Our code is available at https://github.com/xuyang-liu16/VidCom2.

[72] VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank

Tianhe Wu,Jian Zou,Jie Liang,Lei Zhang,Kede Ma

Main category: cs.CV

TL;DR: VisualQuality-R1是一种基于推理的无参考图像质量评估模型，通过强化学习训练，优于现有方法，并能生成与人类对齐的质量描述。

Details

Motivation: 探索推理诱导的计算建模在图像质量评估（IQA）中的潜力，尤其是在依赖视觉推理的任务中。 Method: 使用强化学习训练模型，通过组相对策略优化生成多个质量分数，并基于Thurstone模型计算比较概率。 Result: VisualQuality-R1在实验中表现优于其他无参考IQA模型，并能生成丰富的质量描述。 Conclusion: VisualQuality-R1适用于广泛的图像处理任务，如超分辨率和图像生成，具有可靠性和多数据集训练能力。 Abstract: DeepSeek-R1 has demonstrated remarkable effectiveness in incentivizing reasoning and generalization capabilities of large language models (LLMs) through reinforcement learning. Nevertheless, the potential of reasoning-induced computational modeling has not been thoroughly explored in the context of image quality assessment (IQA), a task critically dependent on visual reasoning. In this paper, we introduce VisualQuality-R1, a reasoning-induced no-reference IQA (NR-IQA) model, and we train it with reinforcement learning to rank, a learning algorithm tailored to the intrinsically relative nature of visual quality. Specifically, for a pair of images, we employ group relative policy optimization to generate multiple quality scores for each image. These estimates are then used to compute comparative probabilities of one image having higher quality than the other under the Thurstone model. Rewards for each quality estimate are defined using continuous fidelity measures rather than discretized binary labels. Extensive experiments show that the proposed VisualQuality-R1 consistently outperforms discriminative deep learning-based NR-IQA models as well as a recent reasoning-induced quality regression method. Moreover, VisualQuality-R1 is capable of generating contextually rich, human-aligned quality descriptions, and supports multi-dataset training without requiring perceptual scale realignment. These features make VisualQuality-R1 especially well-suited for reliably measuring progress in a wide range of image processing tasks like super-resolution and image generation.

[73] RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

Jiaang Li,Yifei Yuan,Wenyan Li,Mohammad Aliannejadi,Daniel Hershcovich,Anders Søgaard,Ivan Vulić,Wenxuan Zhang,Paul Pu Liang,Yang Deng,Serge Belongie

Main category: cs.CV

TL;DR: RAVENEA是一个新的基准测试，通过检索增强方法提升视觉文化理解，在cVQA和cIC任务中表现优异。

Details

Motivation: 现有视觉语言模型在文化细微差别理解上表现不足，而检索增强生成在文本场景中有效，但在多模态场景中未充分探索。 Method: 引入RAVENEA基准，整合10,000+维基百科文档，训练并评估七种多模态检索器，测试其对14种先进VLMs的影响。 Result: 轻量级VLMs结合文化感知检索后，在cVQA和cIC任务中分别提升3.2%和6.2%。 Conclusion: 检索增强方法和文化包容性基准对多模态理解具有重要价值。 Abstract: As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 10,000 Wikipedia documents curated and ranked by human annotators. With RAVENEA, we train and evaluate seven multimodal retrievers for each image query, and measure the downstream impact of retrieval-augmented inputs across fourteen state-of-the-art VLMs. Our results show that lightweight VLMs, when augmented with culture-aware retrieval, outperform their non-augmented counterparts (by at least 3.2% absolute on cVQA and 6.2% absolute on cIC). This highlights the value of retrieval-augmented methods and culturally inclusive benchmarks for multimodal understanding.

[74] Enhancing Interpretability of Sparse Latent Representations with Class Information

Farshad Sangari Abiz,Reshad Hosseini,Babak N. Araabi

Main category: cs.CV

TL;DR: 本文提出了一种新方法，通过确保同一类别样本的潜在空间中的活跃维度一致，增强潜在空间的可解释性。

Details

Motivation: 标准VAE生成的潜在空间分散且无结构，限制了其可解释性。VSC虽然引入了稀疏潜在表示，但未能保证同一类别样本的活跃维度一致性。 Method: 引入新的损失函数，鼓励同一类别样本共享相似的活跃维度，从而创建更结构化的潜在空间。 Result: 该方法生成了一个更结构化和可解释的潜在空间，其中每个共享维度对应一个高级概念或“因子”。 Conclusion: 新方法不仅捕捉了全局因子，还捕获了类别特定因子，显著提升了潜在表示的实用性和可解释性。 Abstract: Variational Autoencoders (VAEs) are powerful generative models for learning latent representations. Standard VAEs generate dispersed and unstructured latent spaces by utilizing all dimensions, which limits their interpretability, especially in high-dimensional spaces. To address this challenge, Variational Sparse Coding (VSC) introduces a spike-and-slab prior distribution, resulting in sparse latent representations for each input. These sparse representations, characterized by a limited number of active dimensions, are inherently more interpretable. Despite this advantage, VSC falls short in providing structured interpretations across samples within the same class. Intuitively, samples from the same class are expected to share similar attributes while allowing for variations in those attributes. This expectation should manifest as consistent patterns of active dimensions in their latent representations, but VSC does not enforce such consistency. In this paper, we propose a novel approach to enhance the latent space interpretability by ensuring that the active dimensions in the latent space are consistent across samples within the same class. To achieve this, we introduce a new loss function that encourages samples from the same class to share similar active dimensions. This alignment creates a more structured and interpretable latent space, where each shared dimension corresponds to a high-level concept, or "factor." Unlike existing disentanglement-based methods that primarily focus on global factors shared across all classes, our method captures both global and class-specific factors, thereby enhancing the utility and interpretability of latent representations.

[75] ReservoirTTA: Prolonged Test-time Adaptation for Evolving and Recurring Domains

Guillaume Vray,Devavrat Tomar,Xufeng Gao,Jean-Philippe Thiran,Evan Shelhamer,Behzad Bozorgtabar

Main category: cs.CV

TL;DR: ReservoirTTA是一个新颖的插件框架，用于处理测试领域持续变化的长期测试时间适应（TTA），通过维护一个领域专用模型库来提升性能。

Details

Motivation: 解决测试领域持续变化时单模型适应中的灾难性遗忘、领域间干扰和错误累积等问题。 Method: 使用在线聚类检测新领域，并通过路由样本到专用模型实现领域特定适应。 Result: 在ImageNet-C、CIFAR-10/100-C和Cityscapes→ACDC等任务中表现优于现有方法。 Conclusion: ReservoirTTA在多领域持续变化场景中显著提升了适应准确性和稳定性。 Abstract: This paper introduces ReservoirTTA, a novel plug-in framework designed for prolonged test-time adaptation (TTA) in scenarios where the test domain continuously shifts over time, including cases where domains recur or evolve gradually. At its core, ReservoirTTA maintains a reservoir of domain-specialized models -- an adaptive test-time model ensemble -- that both detects new domains via online clustering over style features of incoming samples and routes each sample to the appropriate specialized model, and thereby enables domain-specific adaptation. This multi-model strategy overcomes key limitations of single model adaptation, such as catastrophic forgetting, inter-domain interference, and error accumulation, ensuring robust and stable performance on sustained non-stationary test distributions. Our theoretical analysis reveals key components that bound parameter variance and prevent model collapse, while our plug-in TTA module mitigates catastrophic forgetting of previously encountered domains. Extensive experiments on the classification corruption benchmarks, including ImageNet-C and CIFAR-10/100-C, as well as the Cityscapes$\rightarrow$ACDC semantic segmentation task, covering recurring and continuously evolving domain shifts, demonstrate that ReservoirTTA significantly improves adaptation accuracy and maintains stable performance across prolonged, recurring shifts, outperforming state-of-the-art methods.

[76] SparC: Sparse Representation and Construction for High-Resolution 3D Shapes Modeling

Zhihao Li,Yufei Wang,Heliang Zheng,Yihao Luo,Bihan Wen

Main category: cs.CV

TL;DR: SparC框架通过稀疏可变形立方体表示和新型编码器SparConv-VAE，解决了3D对象合成中的细节丢失问题，实现了高保真重建和高效生成。

Details

Motivation: 现有3D对象合成方法因网格数据非结构化和体积网格立方复杂度高，导致细节丢失严重，SparC旨在解决这一问题。 Method: 结合稀疏可变形立方体表示（SparseCubes）和稀疏卷积网络构建的SparConv-VAE，实现高效、近无损的3D重建。 Result: SparC在开放表面、不连续组件和复杂几何体上实现了最先进的重建保真度，并降低了训练和推理成本。 Conclusion: SparC为高分辨率3D生成提供了可扩展的解决方案，并与潜在扩散模型自然集成。 Abstract: High-fidelity 3D object synthesis remains significantly more challenging than 2D image generation due to the unstructured nature of mesh data and the cubic complexity of dense volumetric grids. Existing two-stage pipelines-compressing meshes with a VAE (using either 2D or 3D supervision), followed by latent diffusion sampling-often suffer from severe detail loss caused by inefficient representations and modality mismatches introduced in VAE. We introduce SparC, a unified framework that combines a sparse deformable marching cubes representation SparseCubes with a novel encoder SparConv-VAE. SparseCubes converts raw meshes into high-resolution ($1024^3$) surfaces with arbitrary topology by scattering signed distance and deformation fields onto a sparse cube, allowing differentiable optimization. SparConv-VAE is the first modality-consistent variational autoencoder built entirely upon sparse convolutional networks, enabling efficient and near-lossless 3D reconstruction suitable for high-resolution generative modeling through latent diffusion. SparC achieves state-of-the-art reconstruction fidelity on challenging inputs, including open surfaces, disconnected components, and intricate geometry. It preserves fine-grained shape details, reduces training and inference cost, and integrates naturally with latent diffusion models for scalable, high-resolution 3D generation.

[77] diffDemorph: Extending Reference-Free Demorphing to Unseen Faces

Nitish Shukla,Arun Ross

Main category: cs.CV

TL;DR: 提出了一种基于扩散的新方法，无需参考图像即可从合成人脸图像中分离出原始人脸，显著优于现有技术。

Details

Motivation: 现有参考无关的去合成方法受限于训练和测试数据的分布假设，如合成技术和人脸风格，限制了实用性。 Method: 采用扩散模型，从合成图像中分离出原始人脸，支持跨技术和风格的泛化。 Result: 在六个数据集和两个人脸匹配器上测试，性能提升≥59.46%，且能泛化到真实合成图像。 Conclusion: 该方法显著提升了去合成技术的实用性和泛化能力，适用于多种场景。 Abstract: A face morph is created by combining two (or more) face images corresponding to two (or more) identities to produce a composite that successfully matches the constituent identities. Reference-free (RF) demorphing reverses this process using only the morph image, without the need for additional reference images. Previous RF demorphing methods were overly constrained, as they rely on assumptions about the distributions of training and testing morphs such as the morphing technique used, face style, and images used to create the morph. In this paper, we introduce a novel diffusion-based approach that effectively disentangles component images from a composite morph image with high visual fidelity. Our method is the first to generalize across morph techniques and face styles, beating the current state of the art by $\geq 59.46\%$ under a common training protocol across all datasets tested. We train our method on morphs created using synthetically generated face images and test on real morphs, thereby enhancing the practicality of the technique. Experiments on six datasets and two face matchers establish the utility and efficacy of our method.

[78] Personalize Your Gaussian: Consistent 3D Scene Personalization from a Single Image

Yuxuan Wang,Xuanyu Yi,Qingshan Xu,Yuan Zhou,Long Chen,Hanwang Zhang

Main category: cs.CV

TL;DR: CP-GS是一个通过单张参考图像个性化3D场景的框架，通过渐进式传播参考外观和几何线索，解决了多视角一致性和参考一致性的挑战。

Details

Motivation: 现有方法因单视角限制导致视角偏差，难以实现多视角和参考一致性，CP-GS旨在解决这一问题。 Method: 结合预训练图像到3D生成和迭代LoRA微调，通过几何线索生成多视角指导图像和个性化3DGS输出。 Result: 实验表明CP-GS有效减少视角偏差，显著优于现有方法。 Conclusion: CP-GS通过几何引导的生成过程，实现了高质量的3D场景个性化。 Abstract: Personalizing 3D scenes from a single reference image enables intuitive user-guided editing, which requires achieving both multi-view consistency across perspectives and referential consistency with the input image. However, these goals are particularly challenging due to the viewpoint bias caused by the limited perspective provided in a single image. Lacking the mechanisms to effectively expand reference information beyond the original view, existing methods of image-conditioned 3DGS personalization often suffer from this viewpoint bias and struggle to produce consistent results. Therefore, in this paper, we present Consistent Personalization for 3D Gaussian Splatting (CP-GS), a framework that progressively propagates the single-view reference appearance to novel perspectives. In particular, CP-GS integrates pre-trained image-to-3D generation and iterative LoRA fine-tuning to extract and extend the reference appearance, and finally produces faithful multi-view guidance images and the personalized 3DGS outputs through a view-consistent generation process guided by geometric cues. Extensive experiments on real-world scenes show that our CP-GS effectively mitigates the viewpoint bias, achieving high-quality personalization that significantly outperforms existing methods. The code will be released at https://github.com/Yuxuan-W/CP-GS.

[79] Dynadiff: Single-stage Decoding of Images from Continuously Evolving fMRI

Marlène Careil,Yohann Benchetrit,Jean-Rémi King

Main category: cs.CV

TL;DR: Dynadiff是一种新的单阶段扩散模型，用于从动态fMRI记录中重建图像，简化了训练流程，并在时间分辨率和语义重建指标上优于现有方法。

Details

Motivation: 当前的多阶段预处理方法限制了时间分辨率，需要一种更高效的动态fMRI解码方法。 Method: 提出了Dynadiff，一种单阶段扩散模型，直接处理动态fMRI信号。 Result: 在时间分辨率和高层次语义重建指标上优于现有方法，同时在静态数据上保持竞争力。 Conclusion: 为时间分辨的脑到图像解码奠定了基础。 Abstract: Brain-to-image decoding has been recently propelled by the progress in generative AI models and the availability of large ultra-high field functional Magnetic Resonance Imaging (fMRI). However, current approaches depend on complicated multi-stage pipelines and preprocessing steps that typically collapse the temporal dimension of brain recordings, thereby limiting time-resolved brain decoders. Here, we introduce Dynadiff (Dynamic Neural Activity Diffusion for Image Reconstruction), a new single-stage diffusion model designed for reconstructing images from dynamically evolving fMRI recordings. Our approach offers three main contributions. First, Dynadiff simplifies training as compared to existing approaches. Second, our model outperforms state-of-the-art models on time-resolved fMRI signals, especially on high-level semantic image reconstruction metrics, while remaining competitive on preprocessed fMRI data that collapse time. Third, this approach allows a precise characterization of the evolution of image representations in brain activity. Overall, this work lays the foundation for time-resolved brain-to-image decoding.

[80] Instance Segmentation for Point Sets

Abhimanyu Talwar,Julien Laasri

Main category: cs.CV

TL;DR: 本文提出两种基于采样的方法，用于解决SGPN中内存密集型相似度矩阵的问题，通过子采样点集进行实例分割，并使用最近邻方法将标签扩展到完整点集。随机采样策略在速度和内存使用上表现最佳。

Details

Motivation: SGPN等网络在实例分割中使用内存密集型相似度矩阵，导致内存占用随点数平方增长，亟需改进。 Method: 采用两种采样方法：在子采样点集上计算实例分割，再通过最近邻方法将标签扩展到完整点集。 Result: 两种方法在大子采样集上表现相当，但随机采样策略在速度和内存使用上提升最显著。 Conclusion: 随机采样策略是解决内存问题的有效方法，尤其适用于大规模点集。 Abstract: Recently proposed neural network architectures like PointNet [QSMG16] and PointNet++ [QYSG17] have made it possible to apply Deep Learning to 3D point sets. The feature representations of shapes learned by these two networks enabled training classifiers for Semantic Segmentation, and more recently for Instance Segmentation via the Similarity Group Proposal Network (SGPN) [WYHN17]. One area of improvement which has been highlighted by SGPN's authors, pertains to use of memory intensive similarity matrices which occupy memory quadratic in the number of points. In this report, we attempt to tackle this issue through use of two sampling based methods, which compute Instance Segmentation on a sub-sampled Point Set, and then extrapolate labels to the complete set using the nearest neigbhour approach. While both approaches perform equally well on large sub-samples, the random-based strategy gives the most improvements in terms of speed and memory usage.

[81] 3D Reconstruction from Sketches

Abhimanyu Talwar,Julien Laasri

Main category: cs.CV

TL;DR: 提出了一种从多张草图重建3D场景的流程，包括草图拼接、CycleGAN转换和深度图估计。尽管拼接效果不佳，但单草图重建表现良好。

Details

Motivation: 解决从草图重建3D场景的问题，尤其是针对多张草图的拼接和单草图的深度估计。 Method: 1) 通过对应点拼接草图；2) 用CycleGAN将拼接图转为真实图像；3) 用MegaDepth估计深度图。 Result: 拼接效果不理想，但单草图重建在多种绘图上表现良好。 Conclusion: 流程在单草图重建上有效，但多草图拼接需进一步改进。 Abstract: We consider the problem of reconstructing a 3D scene from multiple sketches. We propose a pipeline which involves (1) stitching together multiple sketches through use of correspondence points, (2) converting the stitched sketch into a realistic image using a CycleGAN, and (3) estimating that image's depth-map using a pre-trained convolutional neural network based architecture called MegaDepth. Our contribution includes constructing a dataset of image-sketch pairs, the images for which are from the Zurich Building Database, and sketches have been generated by us. We use this dataset to train a CycleGAN for our pipeline's second step. We end up with a stitching process that does not generalize well to real drawings, but the rest of the pipeline that creates a 3D reconstruction from a single sketch performs quite well on a wide variety of drawings.

[82] A General Framework for Group Sparsity in Hyperspectral Unmixing Using Endmember Bundles

Gokul Bhusal,Yifei Lou,Cristina Garcia-Cardona,Ekaterina Merkurjev

Main category: cs.CV

TL;DR: 论文提出了一种基于群稀疏性的高光谱解混方法，通过使用端元束表示材料，并引入组内和组间稀疏性（SWAG）以及TL1惩罚，提高了模型的准确性。

Details

Motivation: 高光谱数据由于空间分辨率低，常包含多种材料的混合信号，传统线性混合模型无法准确表示材料的类内变异性，因此需要新的解混方法。 Method: 提出了一种基于端元束的框架，支持组间稀疏性或组内和组间稀疏性（SWAG），并引入了TL1惩罚作为新的正则化方法。 Result: 在合成和真实高光谱数据上的实验表明，所提方法具有显著的有效性和优越性。 Conclusion: 通过引入端元束和灵活的稀疏性惩罚，该方法在高光谱解混任务中表现出色，为解决材料类内变异性问题提供了有效方案。 Abstract: Due to low spatial resolution, hyperspectral data often consists of mixtures of contributions from multiple materials. This limitation motivates the task of hyperspectral unmixing (HU), a fundamental problem in hyperspectral imaging. HU aims to identify the spectral signatures (\textit{endmembers}) of the materials present in an observed scene, along with their relative proportions (\textit{fractional abundance}) in each pixel. A major challenge lies in the class variability in materials, which hinders accurate representation by a single spectral signature, as assumed in the conventional linear mixing model. Moreover, To address this issue, we propose using group sparsity after representing each material with a set of spectral signatures, known as endmember bundles, where each group corresponds to a specific material. In particular, we develop a bundle-based framework that can enforce either inter-group sparsity or sparsity within and across groups (SWAG) on the abundance coefficients. Furthermore, our framework offers the flexibility to incorporate a variety of sparsity-promoting penalties, among which the transformed $\ell_1$ (TL1) penalty is a novel regularization in the HU literature. Extensive experiments conducted on both synthetic and real hyperspectral data demonstrate the effectiveness and superiority of the proposed approaches.

[83] Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference

Tomer Gafni,Asaf Karnieli,Yair Hanani

Main category: cs.CV

TL;DR: 提出了一种硬件高效的量化推理方案W4A8，结合4位整数权重和8位浮点计算，显著提升速度和内存利用率，同时通过双精度量化算法（DPQ）减少精度损失。

Details

Motivation: 随着任务复杂度增加，模型规模增长导致延迟和内存效率问题，后训练量化成为解决方案。 Method: 采用W4A8方案（4位整数权重和8位浮点计算），并开发双精度量化算法（DPQ）以减少精度损失。 Result: 实验显示显著提升吞吐量，同时保持可接受的精度损失。 Conclusion: 该方案在多种现代加速器上有效平衡了性能和精度。 Abstract: Deep neural networks have achieved state-of-the-art results in a wide range of applications, from natural language processing and computer vision to speech recognition. However, as tasks become increasingly complex, model sizes continue to grow, posing challenges in latency and memory efficiency. To meet these constraints, post-training quantization has emerged as a promising solution. In this paper, we propose a novel hardware-efficient quantization and inference scheme that exploits hardware advantages with minimal accuracy degradation. Specifically, we introduce a W4A8 scheme, where weights are quantized and stored using 4-bit integer precision, and inference computations are performed using 8-bit floating-point arithmetic, demonstrating significant speedups and improved memory utilization compared to 16-bit operations, applicable on various modern accelerators. To mitigate accuracy loss, we develop a novel quantization algorithm, dubbed Dual Precision Quantization (DPQ), that leverages the unique structure of our scheme without introducing additional inference overhead. Experimental results demonstrate improved performance (i.e., increased throughput) while maintaining tolerable accuracy degradation relative to the full-precision model.

[84] VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation

Wentao Ma,Weiming Ren,Yiming Jia,Zhuofeng Li,Ping Nie,Ge Zhang,Wenhu Chen

Main category: cs.CV

TL;DR: 论文指出现有长视频理解（LVU）基准测试存在缺陷，如过度依赖选择题（MCQ）和问题先验性，导致评估结果失真。作者提出VideoEval-Pro，一个更真实的LVU基准测试，通过开放式问题评估模型能力。

Details

Motivation: 现有LVU基准测试的评估结果因选择题和问题先验性而失真，无法真实反映模型的长视频理解能力。 Method: 提出VideoEval-Pro基准测试，包含开放式短答题，评估片段级和全视频理解能力。 Result: 实验显示，视频LMM在开放式问题上表现显著下降（>25%），且MCQ高分不意味着开放式高分。VideoEval-Pro更能体现输入帧数增加的优势。 Conclusion: VideoEval-Pro提供了更真实可靠的LVU评估方法，有助于更清晰地衡量该领域的进展。 Abstract: Large multimodal models (LMMs) have recently emerged as a powerful tool for long video understanding (LVU), prompting the development of standardized LVU benchmarks to evaluate their performance. However, our investigation reveals a rather sober lesson for existing LVU benchmarks. First, most existing benchmarks rely heavily on multiple-choice questions (MCQs), whose evaluation results are inflated due to the possibility of guessing the correct answer; Second, a significant portion of questions in these benchmarks have strong priors to allow models to answer directly without even reading the input video. For example, Gemini-1.5-Pro can achieve over 50\% accuracy given a random frame from a long video on Video-MME. We also observe that increasing the number of frames does not necessarily lead to improvement on existing benchmarks, which is counterintuitive. As a result, the validity and robustness of current LVU benchmarks are undermined, impeding a faithful assessment of LMMs' long-video understanding capability. To tackle this problem, we propose VideoEval-Pro, a realistic LVU benchmark containing questions with open-ended short-answer, which truly require understanding the entire video. VideoEval-Pro assesses both segment-level and full-video understanding through perception and reasoning tasks. By evaluating 21 proprietary and open-source video LMMs, we conclude the following findings: (1) video LMMs show drastic performance ($>$25\%) drops on open-ended questions compared with MCQs; (2) surprisingly, higher MCQ scores do not lead to higher open-ended scores on VideoEval-Pro; (3) compared to other MCQ benchmarks, VideoEval-Pro benefits more from increasing the number of input frames. Our results show that VideoEval-Pro offers a more realistic and reliable measure of long video understanding, providing a clearer view of progress in this domain.

[85] CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation

Anna C. Doris,Md Ferdous Alam,Amin Heyrani Nobari,Faez Ahmed

Main category: cs.CV

TL;DR: CAD-Coder是一种基于视觉语言模型（VLM）的开源工具，能够直接从视觉输入生成可编辑的CAD代码（CadQuery Python），显著提升了CAD模型生成的效率和准确性。

Details

Motivation: 当前手动创建CAD模型的工作流程耗时且需要专业知识，而现有AI驱动的CAD生成模型存在局限性，如操作表示不完整、泛化能力差和输出精度低。 Method: 通过新构建的数据集GenCAD-Code（包含16.3万对CAD模型图像和代码），对VLM进行微调，生成可编辑的CAD代码。 Result: CAD-Coder在语法有效性（100%）和3D实体相似性方面优于现有VLM基线模型（如GPT-4.5和Qwen2.5-VL-72B），并能从真实图像生成CAD代码。 Conclusion: CAD-Coder展示了VLM在优化CAD工作流程中的潜力，为工程师和设计师提供了高效的工具。 Abstract: Efficient creation of accurate and editable 3D CAD models is critical in engineering design, significantly impacting cost and time-to-market in product innovation. Current manual workflows remain highly time-consuming and demand extensive user expertise. While recent developments in AI-driven CAD generation show promise, existing models are limited by incomplete representations of CAD operations, inability to generalize to real-world images, and low output accuracy. This paper introduces CAD-Coder, an open-source Vision-Language Model (VLM) explicitly fine-tuned to generate editable CAD code (CadQuery Python) directly from visual input. Leveraging a novel dataset that we created--GenCAD-Code, consisting of over 163k CAD-model image and code pairs--CAD-Coder outperforms state-of-the-art VLM baselines such as GPT-4.5 and Qwen2.5-VL-72B, achieving a 100% valid syntax rate and the highest accuracy in 3D solid similarity. Notably, our VLM demonstrates some signs of generalizability, successfully generating CAD code from real-world images and executing CAD operations unseen during fine-tuning. The performance and adaptability of CAD-Coder highlights the potential of VLMs fine-tuned on code to streamline CAD workflows for engineers and designers. CAD-Coder is publicly available at: https://github.com/anniedoris/CAD-Coder.

[86] Beyond Words: Multimodal LLM Knows When to Speak

Zikai Liao,Yi Ouyang,Yi-Lun Lee,Chen-Ping Yu,Yi-Hsuan Tsai,Zhaozheng Yin

Main category: cs.CV

TL;DR: 论文提出了一种多模态LLM模型MM-When2Speak，用于预测对话中何时及如何回应，显著优于单模态和现有LLM基线。

Details

Motivation: 解决LLM聊天机器人在实时对话中难以把握回应时机的问题，尤其是依赖文本输入而忽略多模态上下文信号。 Method: 构建了一个多模态数据集，结合视觉、听觉和文本信息，并提出了MM-When2Speak模型，整合多模态上下文预测回应时机和类型。 Result: 实验显示MM-When2Speak在回应时机准确性上比现有商业LLM提升4倍。 Conclusion: 多模态输入对实现自然、及时的对话AI至关重要。 Abstract: While large language model (LLM)-based chatbots have demonstrated strong capabilities in generating coherent and contextually relevant responses, they often struggle with understanding when to speak, particularly in delivering brief, timely reactions during ongoing conversations. This limitation arises largely from their reliance on text input, lacking the rich contextual cues in real-world human dialogue. In this work, we focus on real-time prediction of response types, with an emphasis on short, reactive utterances that depend on subtle, multimodal signals across vision, audio, and text. To support this, we introduce a new multimodal dataset constructed from real-world conversational videos, containing temporally aligned visual, auditory, and textual streams. This dataset enables fine-grained modeling of response timing in dyadic interactions. Building on this dataset, we propose MM-When2Speak, a multimodal LLM-based model that adaptively integrates visual, auditory, and textual context to predict when a response should occur, and what type of response is appropriate. Experiments show that MM-When2Speak significantly outperforms state-of-the-art unimodal and LLM-based baselines, achieving up to a 4x improvement in response timing accuracy over leading commercial LLMs. These results underscore the importance of multimodal inputs for producing timely, natural, and engaging conversational AI.

Yilin Ye,Junchao Huang,Xingchen Zeng,Jiazhi Xia,Wei Zeng

Main category: cs.CV

TL;DR: AKRMap是一种新的降维技术，用于可视化跨模态嵌入，通过学习投影空间中度量景观的核回归，提高准确性。

Details

Motivation: 传统降维方法（如PCA和t-SNE）主要关注单模态特征分布，未能有效整合跨模态度量（如CLIPScore）。 Method: AKRMap通过构建监督投影网络，利用后投影核回归损失和自适应广义核，联合优化投影过程。 Result: 实验表明，AKRMap在生成更准确和可信的可视化方面优于现有方法。 Conclusion: AKRMap能有效可视化跨模态嵌入，支持交互功能，适用于文本到图像模型的比较。 Abstract: Cross-modal embeddings form the foundation for multi-modal models. However, visualization methods for interpreting cross-modal embeddings have been primarily confined to traditional dimensionality reduction (DR) techniques like PCA and t-SNE. These DR methods primarily focus on feature distributions within a single modality, whilst failing to incorporate metrics (e.g., CLIPScore) across multiple modalities.This paper introduces AKRMap, a new DR technique designed to visualize cross-modal embeddings metric with enhanced accuracy by learning kernel regression of the metric landscape in the projection space. Specifically, AKRMap constructs a supervised projection network guided by a post-projection kernel regression loss, and employs adaptive generalized kernels that can be jointly optimized with the projection. This approach enables AKRMap to efficiently generate visualizations that capture complex metric distributions, while also supporting interactive features such as zoom and overlay for deeper exploration. Quantitative experiments demonstrate that AKRMap outperforms existing DR methods in generating more accurate and trustworthy visualizations. We further showcase the effectiveness of AKRMap in visualizing and comparing cross-modal embeddings for text-to-image models. Code and demo are available at https://github.com/yilinye/AKRMap.

[88] UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens

Ruichuan An,Sihan Yang,Renrui Zhang,Zijun Shen,Ming Lu,Gaole Dai,Hao Liang,Ziyu Guo,Shilin Yan,Yulin Luo,Bocheng Zou,Chaoqun Yang,Wentao Zhang

Main category: cs.CV

TL;DR: UniCTokens提出了一种统一的概念标记框架，将个性化信息整合到视觉语言模型中，以提升理解和生成任务的性能。

Details

Motivation: 现有方法将概念理解和生成任务分离，导致复杂提示生成受限。UniCTokens旨在通过统一标记和渐进训练策略解决这一问题。 Method: UniCTokens训练统一的概念标记，并采用三阶段渐进训练策略：理解预热、从理解引导生成、从生成深化理解。 Result: 实验表明，UniCTokens在概念理解、生成及知识驱动生成任务中表现优异，尤其在知识驱动生成上达到SOTA。 Conclusion: 研究表明，增强理解可提升生成质量，生成过程也能反哺理解。UniCTokens为个性化视觉语言任务提供了新思路。 Abstract: Personalized models have demonstrated remarkable success in understanding and generating concepts provided by users. However, existing methods use separate concept tokens for understanding and generation, treating these tasks in isolation. This may result in limitations for generating images with complex prompts. For example, given the concept $\langle bo\rangle$, generating "$\langle bo\rangle$ wearing its hat" without additional textual descriptions of its hat. We call this kind of generation personalized knowledge-driven generation. To address the limitation, we present UniCTokens, a novel framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation. UniCTokens trains a set of unified concept tokens to leverage complementary semantics, boosting two personalized tasks. Moreover, we propose a progressive training strategy with three stages: understanding warm-up, bootstrapping generation from understanding, and deepening understanding from generation to enhance mutual benefits between both tasks. To quantitatively evaluate the unified VLM personalization, we present UnifyBench, the first benchmark for assessing concept understanding, concept generation, and knowledge-driven generation. Experimental results on UnifyBench indicate that UniCTokens shows competitive performance compared to leading methods in concept understanding, concept generation, and achieving state-of-the-art results in personalized knowledge-driven generation. Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding. Our code and dataset will be released at: \href{https://github.com/arctanxarc/UniCTokens}{https://github.com/arctanxarc/UniCTokens}.

[89] Training-Free Watermarking for Autoregressive Image Generation

Yu Tong,Zihao Pan,Shuai Yang,Kaiyang Zhou

Main category: cs.CV

TL;DR: 提出IndexMark，一种无需训练的自动回归图像生成模型水印框架，通过代码本冗余特性嵌入水印，不影响图像质量，验证精度高且抗干扰能力强。

Details

Motivation: 现有生成水印方法主要针对扩散模型，自动回归图像生成模型的水印研究不足，需填补这一空白。 Method: 利用代码本冗余特性，设计match-then-replace方法选择水印标记并替换，通过Index Encoder提升验证精度，引入辅助验证方案增强抗裁剪攻击能力。 Result: 实验表明IndexMark在图像质量和验证精度上达到最优，且对裁剪、噪声等多种干扰具有鲁棒性。 Conclusion: IndexMark为自动回归图像生成模型提供了一种高效、鲁棒的水印解决方案。 Abstract: Invisible image watermarking can protect image ownership and prevent malicious misuse of visual generative models. However, existing generative watermarking methods are mainly designed for diffusion models while watermarking for autoregressive image generation models remains largely underexplored. We propose IndexMark, a training-free watermarking framework for autoregressive image generation models. IndexMark is inspired by the redundancy property of the codebook: replacing autoregressively generated indices with similar indices produces negligible visual differences. The core component in IndexMark is a simple yet effective match-then-replace method, which carefully selects watermark tokens from the codebook based on token similarity, and promotes the use of watermark tokens through token replacement, thereby embedding the watermark without affecting the image quality. Watermark verification is achieved by calculating the proportion of watermark tokens in generated images, with precision further improved by an Index Encoder. Furthermore, we introduce an auxiliary validation scheme to enhance robustness against cropping attacks. Experiments demonstrate that IndexMark achieves state-of-the-art performance in terms of image quality and verification accuracy, and exhibits robustness against various perturbations, including cropping, noises, Gaussian blur, random erasing, color jittering, and JPEG compression.

[90] Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

Jiaer Xia,Yuhang Zang,Peng Gao,Yixuan Li,Kaiyang Zhou

Main category: cs.CV

TL;DR: 论文探讨了如何通过强化学习训练视觉语言模型（VLM）进行图像推理，无需显式思维链监督。研究发现直接应用强化学习会导致模型走捷径，提出通过先解释图像再推理的方法（caption-reason-answer格式）提升泛化能力。实验表明模型Visionary-R1在多个视觉推理基准上优于主流多模态模型。

Details

Motivation: 解决AI中通用推理能力的挑战，探索如何通过强化学习训练视觉语言模型进行图像推理，避免显式思维链监督。 Method: 使用强化学习训练VLM，采用caption-reason-answer输出格式：首先生成详细图像描述，再构建推理链。实验基于273K无思维链的视觉问答对。 Result: 模型Visionary-R1在多个视觉推理基准上表现优于GPT-4o、Claude3.5-Sonnet和Gemini-1.5-Pro等强大多模态模型。 Conclusion: 通过先解释图像再推理的方法能有效避免模型走捷径，提升泛化能力，强化学习在无显式监督下仍可训练出高性能视觉推理模型。 Abstract: Learning general-purpose reasoning capabilities has long been a challenging problem in AI. Recent research in large language models (LLMs), such as DeepSeek-R1, has shown that reinforcement learning techniques like GRPO can enable pre-trained LLMs to develop reasoning capabilities using simple question-answer pairs. In this paper, we aim to train visual language models (VLMs) to perform reasoning on image data through reinforcement learning and visual question-answer pairs, without any explicit chain-of-thought (CoT) supervision. Our findings indicate that simply applying reinforcement learning to a VLM -- by prompting the model to produce a reasoning chain before providing an answer -- can lead the model to develop shortcuts from easy questions, thereby reducing its ability to generalize across unseen data distributions. We argue that the key to mitigating shortcut learning is to encourage the model to interpret images prior to reasoning. Therefore, we train the model to adhere to a caption-reason-answer output format: initially generating a detailed caption for an image, followed by constructing an extensive reasoning chain. When trained on 273K CoT-free visual question-answer pairs and using only reinforcement learning, our model, named Visionary-R1, outperforms strong multimodal models, such as GPT-4o, Claude3.5-Sonnet, and Gemini-1.5-Pro, on multiple visual reasoning benchmarks.

[91] UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

Rui Tian,Mingfei Gao,Mingze Xu,Jiaming Hu,Jiasen Lu,Zuxuan Wu,Yinfei Yang,Afshin Dehghan

Main category: cs.CV

TL;DR: UniGen是一种统一的多模态大语言模型，具备图像理解和生成能力。通过数据驱动的训练流程和创新的Chain-of-Thought Verification策略，UniGen在多项基准测试中达到领先水平。

Details

Motivation: 研究旨在构建一个统一的MLLM，解决图像理解与生成任务中的关键挑战，并为未来研究提供方向。 Method: 采用多阶段预训练、监督微调和直接偏好优化的训练流程，并提出CoT-V策略以提升生成质量。 Result: UniGen在GenEval和DPG-Bench上分别取得0.78和85.19的分数，表现优异。 Conclusion: UniGen通过全面的训练策略和创新方法，为统一MLLM的开发提供了实用见解和未来研究方向。 Abstract: We introduce UniGen, a unified multimodal large language model (MLLM) capable of image understanding and generation. We study the full training pipeline of UniGen from a data-centric perspective, including multi-stage pre-training, supervised fine-tuning, and direct preference optimization. More importantly, we propose a new Chain-of-Thought Verification (CoT-V) strategy for test-time scaling, which significantly boosts UniGen's image generation quality using a simple Best-of-N test-time strategy. Specifically, CoT-V enables UniGen to act as both image generator and verifier at test time, assessing the semantic alignment between a text prompt and its generated image in a step-by-step CoT manner. Trained entirely on open-source datasets across all stages, UniGen achieves state-of-the-art performance on a range of image understanding and generation benchmarks, with a final score of 0.78 on GenEval and 85.19 on DPG-Bench. Through extensive ablation studies, our work provides actionable insights and addresses key challenges in the full life cycle of building unified MLLMs, contributing meaningful directions to the future research.

[92] Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng,Deyao Zhu,Kunchang Li,Chenhui Gou,Feng Li,Zeyu Wang,Shu Zhong,Weihao Yu,Xiaonan Nie,Ziang Song,Guang Shi,Haoqi Fan

Main category: cs.CV

TL;DR: BAGEL是一个开源的基础模型，支持多模态理解和生成，通过大规模多模态数据预训练，在复杂推理任务中表现优异。

Details

Motivation: 统一多模态理解和生成的需求，推动开源社区在多模态领域的研究。 Method: 采用统一的解码器模型，基于大规模文本、图像、视频和网页数据进行预训练。 Result: 在标准基准测试中显著优于其他开源模型，展现出复杂推理能力。 Conclusion: BAGEL为多模态研究提供了新的机会，并公开了代码和模型检查点。 Abstract: Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/

[93] Grouping First, Attending Smartly: Training-Free Acceleration for Diffusion Transformers

Sucheng Ren,Qihang Yu,Ju He,Alan Yuille,Liang-Chieh Chen

Main category: cs.CV

TL;DR: GRAT是一种无需训练即可加速扩散Transformer注意力计算的方法，通过分组和结构化区域限制，显著提升图像和视频生成速度，同时保持输出质量。

Details

Motivation: 扩散Transformer的高计算成本限制了其实际部署，例如生成高分辨率图像耗时过长。 Method: GRAT通过分组连续令牌并限制注意力区域，利用GPU并行性和注意力图的稀疏性，减少计算开销。 Result: GRAT在生成8192×8192图像时实现了35.8倍的加速，且无需微调即可保持性能。 Conclusion: GRAT为加速扩散Transformer提供了有效解决方案，有望推动可扩展视觉生成的研究。 Abstract: Diffusion-based Transformers have demonstrated impressive generative capabilities, but their high computational costs hinder practical deployment, for example, generating an $8192\times 8192$ image can take over an hour on an A100 GPU. In this work, we propose GRAT (\textbf{GR}ouping first, \textbf{AT}tending smartly), a training-free attention acceleration strategy for fast image and video generation without compromising output quality. The key insight is to exploit the inherent sparsity in learned attention maps (which tend to be locally focused) in pretrained Diffusion Transformers and leverage better GPU parallelism. Specifically, GRAT first partitions contiguous tokens into non-overlapping groups, aligning both with GPU execution patterns and the local attention structures learned in pretrained generative Transformers. It then accelerates attention by having all query tokens within the same group share a common set of attendable key and value tokens. These key and value tokens are further restricted to structured regions, such as surrounding blocks or criss-cross regions, significantly reducing computational overhead (e.g., attaining a \textbf{35.8$\times$} speedup over full attention when generating $8192\times 8192$ images) while preserving essential attention patterns and long-range context. We validate GRAT on pretrained Flux and HunyuanVideo for image and video generation, respectively. In both cases, GRAT achieves substantially faster inference without any fine-tuning, while maintaining the performance of full attention. We hope GRAT will inspire future research on accelerating Diffusion Transformers for scalable visual generation.

cs.GR [Back]

[94] FreeMesh: Boosting Mesh Generation with Coordinates Merging

Jian Liu,Haohan Weng,Biwen Lei,Xianghui Yang,Zibo Zhao,Zhuo Chen,Song Guo,Tao Han,Chunchao Guo

Main category: cs.GR

TL;DR: 论文提出了一种新指标PTME和坐标合并技术，用于评估和改进网格标记化方法，无需训练即可提升压缩率。

Details

Motivation: 现有网格生成方法缺乏对标记化效率的评估标准，限制了网格标记化技术的优化和发展。 Method: 引入PTME指标评估标记化方法，并提出坐标合并技术优化压缩率。 Result: 实验验证了PTME和坐标合并技术在多种标记化方法中的有效性。 Conclusion: PTME和坐标合并技术可提升现有网格标记化方法，并指导未来网格生成技术的发展。 Abstract: The next-coordinate prediction paradigm has emerged as the de facto standard in current auto-regressive mesh generation methods. Despite their effectiveness, there is no efficient measurement for the various tokenizers that serialize meshes into sequences. In this paper, we introduce a new metric Per-Token-Mesh-Entropy (PTME) to evaluate the existing mesh tokenizers theoretically without any training. Building upon PTME, we propose a plug-and-play tokenization technique called coordinate merging. It further improves the compression ratios of existing tokenizers by rearranging and merging the most frequent patterns of coordinates. Through experiments on various tokenization methods like MeshXL, MeshAnything V2, and Edgerunner, we further validate the performance of our method. We hope that the proposed PTME and coordinate merging can enhance the existing mesh tokenizers and guide the further development of native mesh generation.

[95] Large-Scale Multi-Character Interaction Synthesis

Ziyi Chang,He Wang,George Alex Koulieris,Hubert P. H. Shum

Main category: cs.GR

TL;DR: 论文提出了一种生成大规模多角色交互动画的方法，解决了现有方法在数据不足和过渡规划上的挑战。

Details

Motivation: 多角色交互动画在角色动画中具有重要意义，但现有方法在交互合成和过渡规划上存在不足，尤其是缺乏数据和密集交互的过渡规划。 Method: 提出了一种条件生成管道，包括一个可协调的多角色交互空间用于交互合成，以及一个过渡规划网络用于协调。 Result: 实验证明了该方法在多角色交互合成中的有效性，展示了其可扩展性和可迁移性。 Conclusion: 该方法为多角色交互动画提供了一种有效的解决方案，解决了数据不足和过渡规划的挑战。 Abstract: Generating large-scale multi-character interactions is a challenging and important task in character animation. Multi-character interactions involve not only natural interactive motions but also characters coordinated with each other for transition. For example, a dance scenario involves characters dancing with partners and also characters coordinated to new partners based on spatial and temporal observations. We term such transitions as coordinated interactions and decompose them into interaction synthesis and transition planning. Previous methods of single-character animation do not consider interactions that are critical for multiple characters. Deep-learning-based interaction synthesis usually focuses on two characters and does not consider transition planning. Optimization-based interaction synthesis relies on manually designing objective functions that may not generalize well. While crowd simulation involves more characters, their interactions are sparse and passive. We identify two challenges to multi-character interaction synthesis, including the lack of data and the planning of transitions among close and dense interactions. Existing datasets either do not have multiple characters or do not have close and dense interactions. The planning of transitions for multi-character close and dense interactions needs both spatial and temporal considerations. We propose a conditional generative pipeline comprising a coordinatable multi-character interaction space for interaction synthesis and a transition planning network for coordinations. Our experiments demonstrate the effectiveness of our proposed pipeline for multicharacter interaction synthesis and the applications facilitated by our method show the scalability and transferability.

Yue Fei,Jingjing Liu,Yuyou Yao,Wenming Wu,Liping Zheng

Main category: cs.GR

TL;DR: 提出了一种基于CVT的表面重网格方法，通过多次裁剪3D Centroidal Voronoi单元并结合曲率自适应的原始表面面片，平衡了网格质量和计算效率。

Details

Motivation: 现有CVT方法在高质量和高效计算之间存在权衡，需要一种既能保证质量又能降低计算复杂度的解决方案。 Method: 通过曲率自适应调整裁剪次数，利用相邻面片法向量的角度关系表示局部曲率大小，优化顶点分布。 Result: 实验证明该方法在质量和效率之间取得了平衡。 Conclusion: 该方法通过曲率自适应策略有效解决了CVT重网格中的质量与效率问题。 Abstract: CVT (Centroidal Voronoi Tessellation)-based remeshing optimizes mesh quality by leveraging the Voronoi-Delaunay framework to optimize vertex distribution and produce uniformly distributed vertices with regular triangles. Current CVT-based approaches can be classified into two categories: (1) exact methods (e.g., Geodesic CVT, Restricted Voronoi Diagrams) that ensure high quality but require significant computation; and (2) approximate methods that try to reduce computational complexity yet result in fair quality. To address this trade-off, we propose a CVT-based surface remeshing approach that achieves balanced optimization between quality and efficiency through multiple clipping times of 3D Centroidal Voronoi cells with curvature-adaptive original surface facets. The core idea of the method is that we adaptively adjust the number of clipping times according to local curvature, and use the angular relationship between the normal vectors of neighboring facets to represent the magnitude of local curvature. Experimental results demonstrate the effectiveness of our method.

cs.CL [Back]

[97] Evaluating Reasoning LLMs for Suicide Screening with the Columbia-Suicide Severity Rating Scale

Avinash Patil,Siru Tao,Amardeep Gedhu

Main category: cs.CL

TL;DR: 研究评估了大型语言模型（LLMs）在自杀风险评估中的表现，发现Claude和GPT与人类标注最接近，Mistral在顺序预测误差上最低。

Details

Motivation: 在线平台如Reddit的r/SuicideWatch为自杀倾向者提供支持，但LLMs可能成为新的倾诉对象，因此需要评估其风险评估能力。 Method: 使用哥伦比亚自杀严重程度评定量表（C-SSRS）评估六种LLMs（包括Claude、GPT、Mistral和LLaMA）的零样本表现。 Result: Claude和GPT表现接近人类标注，Mistral顺序预测误差最低，模型通常在相邻严重级别间误判。 Conclusion: 研究强调人类监督、透明度和谨慎部署的重要性，代码和补充材料已开源。 Abstract: Suicide prevention remains a critical public health challenge. While online platforms such as Reddit's r/SuicideWatch have historically provided spaces for individuals to express suicidal thoughts and seek community support, the advent of large language models (LLMs) introduces a new paradigm-where individuals may begin disclosing ideation to AI systems instead of humans. This study evaluates the capability of LLMs to perform automated suicide risk assessment using the Columbia-Suicide Severity Rating Scale (C-SSRS). We assess the zero-shot performance of six models-including Claude, GPT, Mistral, and LLaMA-in classifying posts across a 7-point severity scale (Levels 0-6). Results indicate that Claude and GPT closely align with human annotations, while Mistral achieves the lowest ordinal prediction error. Most models exhibit ordinal sensitivity, with misclassifications typically occurring between adjacent severity levels. We further analyze confusion patterns, misclassification sources, and ethical considerations, underscoring the importance of human oversight, transparency, and cautious deployment. Full code and supplementary materials are available at https://github.com/av9ash/llm_cssrs_code.

[98] EmoMeta: A Multimodal Dataset for Fine-grained Emotion Classification in Chinese Metaphors

Xingyuan Lu,Yuxi Liu,Dongyu Zhang,Zhiyao Wu,Jing Ren,Feng Xia

Main category: cs.CL

TL;DR: 论文提出了一个中文多模态隐喻广告数据集（EmoMeta），包含5000个文本-图像对，标注了隐喻、领域关系和细粒度情感分类，填补了多模态隐喻情感数据集的空白。

Details

Motivation: 多模态隐喻在情感表达中日益重要，但相关研究稀缺且集中于英语，忽略了语言间的差异。 Method: 构建了一个中文多模态隐喻广告数据集，标注了隐喻、领域关系和10种细粒度情感。 Result: 数据集公开可用（https://github.com/DUTIR-YSQ/EmoMeta），支持多模态隐喻情感分类的进一步研究。 Conclusion: 该数据集为多模态隐喻情感研究提供了重要资源，促进了跨语言情感分析的进展。 Abstract: Metaphors play a pivotal role in expressing emotions, making them crucial for emotional intelligence. The advent of multimodal data and widespread communication has led to a proliferation of multimodal metaphors, amplifying the complexity of emotion classification compared to single-mode scenarios. However, the scarcity of research on constructing multimodal metaphorical fine-grained emotion datasets hampers progress in this domain. Moreover, existing studies predominantly focus on English, overlooking potential variations in emotional nuances across languages. To address these gaps, we introduce a multimodal dataset in Chinese comprising 5,000 text-image pairs of metaphorical advertisements. Each entry is meticulously annotated for metaphor occurrence, domain relations and fine-grained emotion classification encompassing joy, love, trust, fear, sadness, disgust, anger, surprise, anticipation, and neutral. Our dataset is publicly accessible (https://github.com/DUTIR-YSQ/EmoMeta), facilitating further advancements in this burgeoning field.

[99] Detecting Prefix Bias in LLM-based Reward Models

Ashwin Kumar,Yuzi He,Aram H. Markosyan,Bobbie Chern,Imanol Arrieta-Ibarra

Main category: cs.CL

TL;DR: 该论文研究了基于人类反馈的强化学习（RLHF）中奖励模型的偏见问题，特别是前缀偏见，并提出了一种数据增强方法来减轻这种偏见。

Details

Motivation: 探索RLHF中奖励模型可能存在的偏见，尤其是由查询前缀微小变化引发的系统性偏好偏移（前缀偏见），以促进公平可靠的AI发展。 Method: 引入新方法检测和评估前缀偏见，分析其对种族和性别维度的影响，并提出数据增强策略以减轻偏见。 Result: 研究发现奖励模型普遍存在前缀偏见，且与模型架构无关；数据增强方法能有效减少偏见影响。 Conclusion: 强调在设计公平可靠的奖励模型时，需关注偏见感知的数据集设计和评估，为AI公平性研究提供重要参考。 Abstract: Reinforcement Learning with Human Feedback (RLHF) has emerged as a key paradigm for task-specific fine-tuning of language models using human preference data. While numerous publicly available preference datasets provide pairwise comparisons of responses, the potential for biases in the resulting reward models remains underexplored. In this work, we introduce novel methods to detect and evaluate prefix bias -- a systematic shift in model preferences triggered by minor variations in query prefixes -- in LLM-based reward models trained on such datasets. We leverage these metrics to reveal significant biases in preference models across racial and gender dimensions. Our comprehensive evaluation spans diverse open-source preference datasets and reward model architectures, demonstrating susceptibility to this kind of bias regardless of the underlying model architecture. Furthermore, we propose a data augmentation strategy to mitigate these biases, showing its effectiveness in reducing the impact of prefix bias. Our findings highlight the critical need for bias-aware dataset design and evaluation in developing fair and reliable reward models, contributing to the broader discourse on fairness in AI.

[100] Source framing triggers systematic evaluation bias in Large Language Models

Federico Germani,Giovanni Spitale

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLMs）在文本评估中的一致性和偏见问题，发现模型间和模型内高度一致，但来源标注会显著影响结果，尤其是对中国作者的偏见。

Details

Motivation: 评估LLMs在文本生成和评估中的一致性、无偏见性及对框架效应的鲁棒性。 Method: 使用四种先进LLMs评估4800条叙述性陈述，操纵来源标注（LLM或人类作者国籍），分析模型间和模型内一致性。 Result: 盲评时模型高度一致，但来源标注（尤其是中国作者）显著降低一致性，Deepseek Reasoner受影响最大。 Conclusion: 框架效应严重影响LLMs的文本评估，对其信息系统的完整性、中立性和公平性提出挑战。 Abstract: Large Language Models (LLMs) are increasingly used not only to generate text but also to evaluate it, raising urgent questions about whether their judgments are consistent, unbiased, and robust to framing effects. In this study, we systematically examine inter- and intra-model agreement across four state-of-the-art LLMs (OpenAI o3-mini, Deepseek Reasoner, xAI Grok 2, and Mistral) tasked with evaluating 4,800 narrative statements on 24 different topics of social, political, and public health relevance, for a total of 192,000 assessments. We manipulate the disclosed source of each statement to assess how attribution to either another LLM or a human author of specified nationality affects evaluation outcomes. We find that, in the blind condition, different LLMs display a remarkably high degree of inter- and intra-model agreement across topics. However, this alignment breaks down when source framing is introduced. Here we show that attributing statements to Chinese individuals systematically lowers agreement scores across all models, and in particular for Deepseek Reasoner. Our findings reveal that framing effects can deeply affect text evaluation, with significant implications for the integrity, neutrality, and fairness of LLM-mediated information systems.

[101] ProdRev: A DNN framework for empowering customers using generative pre-trained transformers

Aakash Gupta,Nataraj Das

Main category: cs.CL

TL;DR: 论文提出了一种基于生成预训练Transformer的框架，用于更好地理解和总结电商产品评论，帮助用户快速做出决策。

Details

Motivation: 疫情期间，用户对电商的依赖增加，但海量评论可能导致决策瘫痪。现有工具虽能评分，但缺乏深层理解。 Method: 使用GPT-3的Curie引擎对模型进行微调，采用抽象摘要方法而非简单提取，引入“常识”辅助决策。 Result: 模型能生成评论的优缺点摘要，帮助用户快速理解并做出决策。 Conclusion: 该框架通过生成式摘要提升了评论分析的深度和实用性，增强了用户的决策能力。 Abstract: Following the pandemic, customers, preference for using e-commerce has accelerated. Since much information is available in multiple reviews (sometimes running in thousands) for a single product, it can create decision paralysis for the buyer. This scenario disempowers the consumer, who cannot be expected to go over so many reviews since its time consuming and can confuse them. Various commercial tools are available, that use a scoring mechanism to arrive at an adjusted score. It can alert the user to potential review manipulations. This paper proposes a framework that fine-tunes a generative pre-trained transformer to understand these reviews better. Furthermore, using "common-sense" to make better decisions. These models have more than 13 billion parameters. To fine-tune the model for our requirement, we use the curie engine from generative pre-trained transformer (GPT3). By using generative models, we are introducing abstractive summarization. Instead of using a simple extractive method of summarizing the reviews. This brings out the true relationship between the reviews and not simply copy-paste. This introduces an element of "common sense" for the user and helps them to quickly make the right decisions. The user is provided the pros and cons of the processed reviews. Thus the user/customer can take their own decisions.

[102] LLM4CD: Leveraging Large Language Models for Open-World Knowledge Augmented Cognitive Diagnosis

Weiming Zhang,Lingyue Fu,Qingyao Li,Kounianhua Du,Jianghao Lin,Jingwei Yu,Wei Xia,Weinan Zhang,Ruiming Tang,Yong Yu

Main category: cs.CL

TL;DR: LLM4CD利用大语言模型的开放世界知识增强认知诊断，通过语义表示替代传统ID嵌入，解决冷启动问题。

Details

Motivation: 当前认知诊断方法仅依赖ID关系建模，忽视教育数据中的语义信息，且难以处理新增学生和练习。 Method: 提出LLM4CD，利用LLM构建认知表达文本表示，并通过双层编码器框架（宏观认知文本编码器和微观知识状态编码器）建模学生测试历史。 Result: 实验表明LLM4CD在多个真实数据集上优于现有模型，验证了引入语义信息的有效性。 Conclusion: LLM4CD通过开放世界知识和语义表示，显著提升了认知诊断的性能和适应性。 Abstract: Cognitive diagnosis (CD) plays a crucial role in intelligent education, evaluating students' comprehension of knowledge concepts based on their test histories. However, current CD methods often model students, exercises, and knowledge concepts solely on their ID relationships, neglecting the abundant semantic relationships present within educational data space. Furthermore, contemporary intelligent tutoring systems (ITS) frequently involve the addition of new students and exercises, a situation that ID-based methods find challenging to manage effectively. The advent of large language models (LLMs) offers the potential for overcoming this challenge with open-world knowledge. In this paper, we propose LLM4CD, which Leverages Large Language Models for Open-World Knowledge Augmented Cognitive Diagnosis. Our method utilizes the open-world knowledge of LLMs to construct cognitively expressive textual representations, which are then encoded to introduce rich semantic information into the CD task. Additionally, we propose an innovative bi-level encoder framework that models students' test histories through two levels of encoders: a macro-level cognitive text encoder and a micro-level knowledge state encoder. This approach substitutes traditional ID embeddings with semantic representations, enabling the model to accommodate new students and exercises with open-world knowledge and address the cold-start problem. Extensive experimental results demonstrate that our proposed method consistently outperforms previous CD models on multiple real-world datasets, validating the effectiveness of leveraging LLMs to introduce rich semantic information into the CD task.

Khanh-Tung Tran,Barry O'Sullivan,Hoang D. Nguyen

Main category: cs.CL

TL;DR: IRLBench是一个新的多语言基准测试，专注于英语和爱尔兰语，用于评估LLMs在低资源语言和文化多样性环境中的表现。

Details

Motivation: 现有基准测试存在文化偏见、仅限文本评估、依赖选择题格式，且对极低资源语言支持不足。 Method: 基于2024年爱尔兰毕业考试开发12个代表性科目，采用长文本生成任务和官方评分标准。 Result: 实验显示LLMs在爱尔兰语中的表现显著低于英语，正确率分别为55.8%和76.2%。 Conclusion: IRLBench为未来多语言AI研究提供了工具，强调文化意识和语言保真度的重要性。 Abstract: Recent advances in Large Language Models (LLMs) have demonstrated promising knowledge and reasoning abilities, yet their performance in multilingual and low-resource settings remains underexplored. Existing benchmarks often exhibit cultural bias, restrict evaluation to text-only, rely on multiple-choice formats, and, more importantly, are limited for extremely low-resource languages. To address these gaps, we introduce IRLBench, presented in parallel English and Irish, which is considered definitely endangered by UNESCO. Our benchmark consists of 12 representative subjects developed from the 2024 Irish Leaving Certificate exams, enabling fine-grained analysis of model capabilities across domains. By framing the task as long-form generation and leveraging the official marking scheme, it does not only support a comprehensive evaluation of correctness but also language fidelity. Our extensive experiments of leading closed-source and open-source LLMs reveal a persistent performance gap between English and Irish, in which models produce valid Irish responses less than 80\% of the time, and answer correctly 55.8\% of the time compared to 76.2\% in English for the best-performing model. We release IRLBench (https://huggingface.co/datasets/ReliableAI/IRLBench) and an accompanying evaluation codebase (https://github.com/ReML-AI/IRLBench) to enable future research on robust, culturally aware multilingual AI development.

[104] Noise Injection Systemically Degrades Large Language Model Safety Guardrails

Prithviraj Singh Shahani,Matthias Scheutz

Main category: cs.CL

TL;DR: 研究表明，当前大语言模型的安全防护在噪声扰动下存在显著脆弱性，高斯噪声会导致有害输出率显著上升，且深度安全微调无法提供额外保护。

Details

Motivation: 探讨大语言模型安全微调在噪声扰动下的鲁棒性，揭示当前安全对齐技术的潜在漏洞。 Method: 通过向模型激活中系统注入高斯噪声，评估多个开源模型的安全性能。 Result: 高斯噪声使有害输出率上升27%（p<0.001），深度安全微调无额外保护作用，但思维链推理基本不受影响。 Conclusion: 当前安全对齐技术存在关键漏洞，基于推理和强化学习的方法可能是未来提升AI安全性的方向。 Abstract: Safety guardrails in large language models (LLMs) are a critical component in preventing harmful outputs. Yet, their resilience under perturbation remains poorly understood. In this paper, we investigate the robustness of safety fine-tuning in LLMs by systematically injecting Gaussian noise into model activations. We show across multiple open-weight models that (1) Gaussian noise raises harmful-output rates (p < 0.001) by up to 27%, (2) that deeper safety fine-tuning affords no extra protection, and (3) that chain-of-thought reasoning remains largely intact. The findings reveal critical vulnerabilities in current safety alignment techniques and highlight the potential of reasoning-based and reinforcement learning approaches as promising direction for developing more robust AI safety systems. These results have important implications for real-world deployment of LLMs in safety-critical applications as these results imply that widely-deployed safety tuning methods can fail even without adversarial prompts.

[105] EcoSafeRAG: Efficient Security through Context Analysis in Retrieval-Augmented Generation

Ruobing Yao,Yifei Zhang,Shuang Song,Neng Gao,Chenyang Tu

Main category: cs.CL

TL;DR: EcoSafeRAG提出了一种不依赖LLM内部知识的防御方法，通过句子级处理和诱饵引导的上下文多样性检测来识别恶意内容，同时提升RAG的性能。

Details

Motivation: RAG通过集成外部知识增强了LLM的响应准确性，但也引入了新的攻击面（如语料库中毒）。现有防御方法依赖LLM内部知识，与RAG设计理念冲突。 Method: EcoSafeRAG采用句子级处理和诱饵引导的上下文多样性检测，分析候选文档的上下文多样性以识别恶意内容。 Result: 实验表明EcoSafeRAG在安全性上达到领先水平，同时提升干净场景下的RAG性能，且运行成本较低（延迟1.2倍，令牌减少48%-80%）。 Conclusion: EcoSafeRAG在不依赖LLM内部知识的情况下，有效解决了RAG的安全问题，同时优化了性能与成本。 Abstract: Retrieval-Augmented Generation (RAG) compensates for the static knowledge limitations of Large Language Models (LLMs) by integrating external knowledge, producing responses with enhanced factual correctness and query-specific contextualization. However, it also introduces new attack surfaces such as corpus poisoning at the same time. Most of the existing defense methods rely on the internal knowledge of the model, which conflicts with the design concept of RAG. To bridge the gap, EcoSafeRAG uses sentence-level processing and bait-guided context diversity detection to identify malicious content by analyzing the context diversity of candidate documents without relying on LLM internal knowledge. Experiments show EcoSafeRAG delivers state-of-the-art security with plug-and-play deployment, simultaneously improving clean-scenario RAG performance while maintaining practical operational costs (relatively 1.2$\times$ latency, 48\%-80\% token reduction versus Vanilla RAG).

[106] Time-R1: Towards Comprehensive Temporal Reasoning in LLMs

Zijia Liu,Peixuan Han,Haofei Yu,Haoru Li,Jiaxuan You

Main category: cs.CL

TL;DR: Time-R1框架通过三阶段强化学习课程，赋予中等规模LLM全面时间能力，优于更大模型。

Details

Motivation: 现有LLM缺乏稳健的时间智能，现有方法泛化能力差，无法处理知识截止后事件或创造性前瞻。 Method: 采用三阶段RL课程，逐步培养时间理解、预测和创造性生成能力。 Result: Time-R1在预测和生成任务上优于200倍大模型，如671B DeepSeek-R1。 Conclusion: 精心设计的RL微调使小模型具备卓越时间性能，推动时间感知AI发展。 Abstract: Large Language Models (LLMs) demonstrate impressive capabilities but lack robust temporal intelligence, struggling to integrate reasoning about the past with predictions and plausible generations of the future. Meanwhile, existing methods typically target isolated temporal skills, such as question answering about past events or basic forecasting, and exhibit poor generalization, particularly when dealing with events beyond their knowledge cutoff or requiring creative foresight. To address these limitations, we introduce \textit{Time-R1}, the first framework to endow a moderate-sized (3B-parameter) LLM with comprehensive temporal abilities: understanding, prediction, and creative generation. Our approach features a novel three-stage development path; the first two constitute a \textit{reinforcement learning (RL) curriculum} driven by a meticulously designed dynamic rule-based reward system. This framework progressively builds (1) foundational temporal understanding and logical event-time mappings from historical data, (2) future event prediction skills for events beyond its knowledge cutoff, and finally (3) enables remarkable generalization to creative future scenario generation without any fine-tuning. Strikingly, experiments demonstrate that Time-R1 outperforms models over 200 times larger, including the state-of-the-art 671B DeepSeek-R1, on highly challenging future event prediction and creative scenario generation benchmarks. This work provides strong evidence that thoughtfully engineered, progressive RL fine-tuning allows smaller, efficient models to achieve superior temporal performance, offering a practical and scalable path towards truly time-aware AI. To foster further research, we also release \textit{Time-Bench}, a large-scale multi-task temporal reasoning dataset derived from 10 years of news data, and our series of \textit{Time-R1} checkpoints.

[107] Induction Head Toxicity Mechanistically Explains Repetition Curse in Large Language Models

Shuxun Wang,Qingyu Yin,Chak Tou Leong,Qiang Zhang,Linyi Yang

Main category: cs.CL

TL;DR: 研究发现，大语言模型（LLMs）中的归纳头（induction heads）是导致重复诅咒（repetition curse）的主要机制，并提出了一种注意力头正则化技术来缓解这一问题。

Details

Motivation: 重复诅咒现象在LLMs中广泛存在，但其机制尚不明确，研究旨在揭示其背后的原因并提供解决方案。 Method: 通过分析归纳头的毒性（即其在重复生成过程中主导输出logits的倾向），并提出注意力头正则化技术。 Result: 归纳头是重复诅咒的关键驱动因素，正则化技术可减少其主导性，提升生成多样性。 Conclusion: 研究为LLMs的设计和训练提供了新思路，通过调控归纳头行为可改善模型输出质量。 Abstract: Repetition curse is a phenomenon where Large Language Models (LLMs) generate repetitive sequences of tokens or cyclic sequences. While the repetition curse has been widely observed, its underlying mechanisms remain poorly understood. In this work, we investigate the role of induction heads--a specific type of attention head known for their ability to perform in-context learning--in driving this repetitive behavior. Specifically, we focus on the "toxicity" of induction heads, which we define as their tendency to dominate the model's output logits during repetition, effectively excluding other attention heads from contributing to the generation process. Our findings have important implications for the design and training of LLMs. By identifying induction heads as a key driver of the repetition curse, we provide a mechanistic explanation for this phenomenon and suggest potential avenues for mitigation. We also propose a technique with attention head regularization that could be employed to reduce the dominance of induction heads during generation, thereby promoting more diverse and coherent outputs.

[108] Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression

Jingyu Peng,Maolin Wang,Nan Wang,Xiangyu Zhao,Jiatong Li,Kai Zhang,Qi Liu

Main category: cs.CL

TL;DR: LogiBreak是一种利用逻辑表达式翻译绕过LLM安全系统的黑盒越狱方法，通过将有害自然语言提示转换为形式逻辑表达式，利用对齐数据和逻辑输入之间的分布差异，有效规避安全约束。

Details

Motivation: 尽管LLM与人类价值观对齐取得了进展，但现有安全机制仍易受越狱攻击，作者认为这是由于对齐导向提示与恶意提示之间的分布差异导致的。 Method: 提出LogiBreak方法，通过将有害自然语言提示转换为形式逻辑表达式，利用分布差异规避安全系统。 Result: 在多语言越狱数据集上验证了LogiBreak的有效性，证明其在多种评估设置和语言环境中均能成功。 Conclusion: LogiBreak揭示了LLM安全机制在逻辑输入上的脆弱性，为未来安全设计提供了改进方向。 Abstract: Despite substantial advancements in aligning large language models (LLMs) with human values, current safety mechanisms remain susceptible to jailbreak attacks. We hypothesize that this vulnerability stems from distributional discrepancies between alignment-oriented prompts and malicious prompts. To investigate this, we introduce LogiBreak, a novel and universal black-box jailbreak method that leverages logical expression translation to circumvent LLM safety systems. By converting harmful natural language prompts into formal logical expressions, LogiBreak exploits the distributional gap between alignment data and logic-based inputs, preserving the underlying semantic intent and readability while evading safety constraints. We evaluate LogiBreak on a multilingual jailbreak dataset spanning three languages, demonstrating its effectiveness across various evaluation settings and linguistic contexts.

[109] Combining the Best of Both Worlds: A Method for Hybrid NMT and LLM Translation

Zhanglin Wu,Daimeng Wei,Xiaoyu Chen,Hengchao Shang,Jiaxin Guo,Zongyao Li,Yuanchang Luo,Jinlong Yang,Zhiqiang Rao,Hao Yang

Main category: cs.CL

TL;DR: 论文探讨了如何结合大型语言模型（LLM）和神经机器翻译（NMT）系统，通过调度策略优化翻译效果，减少LLM的使用。

Details

Motivation: LLM在翻译任务中表现优异但计算成本高，而NMT系统效率更高。研究旨在找到两者结合的最佳方案，以平衡性能和效率。 Method: 提出了一种基于源句特征的调度策略，决定何时使用LLM或NMT，并通过多语言测试集验证其效果。 Result: 实验表明，该策略能以最少的LLM使用实现最优翻译性能。 Conclusion: 结合LLM和NMT的调度策略是高效且有效的翻译解决方案。 Abstract: Large language model (LLM) shows promising performances in a variety of downstream tasks, such as machine translation (MT). However, using LLMs for translation suffers from high computational costs and significant latency. Based on our evaluation, in most cases, translations using LLMs are comparable to that generated by neural machine translation (NMT) systems. Only in particular scenarios, LLM and NMT models show respective advantages. As a result, integrating NMT and LLM for translation and using LLM only when necessary seems to be a sound solution. A scheduling policy that optimizes translation result while ensuring fast speed and as little LLM usage as possible is thereby required. We compare several scheduling policies and propose a novel and straightforward decider that leverages source sentence features. We conduct extensive experiments on multilingual test sets and the result shows that we can achieve optimal translation performance with minimal LLM usage, demonstrating effectiveness of our decider.

[110] CS-Sum: A Benchmark for Code-Switching Dialogue Summarization and the Limits of Large Language Models

Sathya Krishnan Suresh,Tanmay Surana,Lim Zhi Hao,Eng Siong Chng

Main category: cs.CL

TL;DR: 论文提出了CS-Sum，首个跨语言（中英、泰米尔-英、马来-英）的代码切换对话摘要基准，评估了十种大语言模型的表现，发现尽管自动评分高，但模型在处理代码切换时仍存在语义错误。

Details

Motivation: 探索大语言模型对代码切换（CS）的理解能力，填补该领域的研究空白。 Method: 通过CS-Sum基准，采用少样本、翻译-摘要和微调（LoRA、QLoRA）等方法评估十种大语言模型。 Result: 模型在自动指标上得分高，但会犯细微错误，导致语义完全改变。错误率因语言对和模型而异。 Conclusion: 需针对代码切换数据进行专门训练，以提高模型处理能力。 Abstract: Code-switching (CS) poses a significant challenge for Large Language Models (LLMs), yet its comprehensibility remains underexplored in LLMs. We introduce CS-Sum, to evaluate the comprehensibility of CS by the LLMs through CS dialogue to English summarization. CS-Sum is the first benchmark for CS dialogue summarization across Mandarin-English (EN-ZH), Tamil-English (EN-TA), and Malay-English (EN-MS), with 900-1300 human-annotated dialogues per language pair. Evaluating ten LLMs, including open and closed-source models, we analyze performance across few-shot, translate-summarize, and fine-tuning (LoRA, QLoRA on synthetic data) approaches. Our findings show that though the scores on automated metrics are high, LLMs make subtle mistakes that alter the complete meaning of the dialogue. To this end, we introduce 3 most common type of errors that LLMs make when handling CS input. Error rates vary across CS pairs and LLMs, with some LLMs showing more frequent errors on certain language pairs, underscoring the need for specialized training on code-switched data.

[111] Cross-Lingual Representation Alignment Through Contrastive Image-Caption Tuning

Nathaniel Krasner,Nicholas Lanuzo,Antonios Anastasopoulos

Main category: cs.CL

TL;DR: 研究探讨视觉信息是否能替代双语文本实现多语言句子表征对齐，发现多语言图像-标题对齐可隐式对齐文本表征，且适用于跨语言自然语言理解和双语检索。

Details

Motivation: 双语文本对齐多语言句子表征成本高，而图像标题数据集易于创建，为低资源语言提供高效替代方案。 Method: 利用多语言图像-标题对齐隐式对齐文本表征，并测试其在跨语言自然语言理解和双语检索中的应用。 Result: 多语言图像-标题对齐可隐式对齐文本表征，未在预训练中见过的语言也能被纳入对齐，且对齐表征适用于跨语言任务。 Conclusion: 视觉信息可替代双语文本实现多语言句子表征对齐，为低资源语言提供高效解决方案。 Abstract: Multilingual alignment of sentence representations has mostly required bitexts to bridge the gap between languages. We investigate whether visual information can bridge this gap instead. Image caption datasets are very easy to create without requiring multilingual expertise, so this offers a more efficient alternative for low-resource languages. We find that multilingual image-caption alignment can implicitly align the text representations between languages, languages unseen by the encoder in pretraining can be incorporated into this alignment post-hoc, and these aligned representations are usable for cross-lingual Natural Language Understanding (NLU) and bitext retrieval.

[112] Clarifying orthography: Orthographic transparency as compressibility

Charles J. Torres,Richard Futrell

Main category: cs.CL

TL;DR: 提出了一种基于算法信息理论的跨文字系统度量方法，用于量化拼写与发音之间的透明度。

Details

Motivation: 缺乏一种统一的、与文字系统无关的度量方法来衡量拼写与发音的直接关系。 Method: 利用算法信息理论中的互压缩性概念，结合神经序列模型的前序编码长度，量化拼写与发音的透明度。 Result: 在22种不同文字系统（如字母、辅音音素、元音附标、音节、表意文字）中验证了该方法的有效性，结果符合对文字透明度的普遍直觉。 Conclusion: 互压缩性提供了一种简单、原则性强且通用的拼写透明度度量标准。 Abstract: Orthographic transparency -- how directly spelling is related to sound -- lacks a unified, script-agnostic metric. Using ideas from algorithmic information theory, we quantify orthographic transparency in terms of the mutual compressibility between orthographic and phonological strings. Our measure provides a principled way to combine two factors that decrease orthographic transparency, capturing both irregular spellings and rule complexity in one quantity. We estimate our transparency measure using prequential code-lengths derived from neural sequence models. Evaluating 22 languages across a broad range of script types (alphabetic, abjad, abugida, syllabic, logographic) confirms common intuitions about relative transparency of scripts. Mutual compressibility offers a simple, principled, and general yardstick for orthographic transparency.

[113] Are Large Language Models Good at Detecting Propaganda?

Julia Jose,Rachel Greenstadt

Main category: cs.CL

TL;DR: 研究比较了大型语言模型（如GPT-4、GPT-3.5和Claude 3 Opus）与基于Transformer的模型在检测新闻文章中的宣传技术方面的性能。GPT-4表现优于其他LLM，但未超过RoBERTa-CRF基线。部分宣传技术检测中，LLM优于多粒度网络（MGN）基线。

Details

Motivation: 宣传技术通过逻辑谬误和情感诉求影响决策，识别这些技术对信息判断至关重要。NLP的进步为开发检测系统提供了可能。 Method: 比较了多种大型语言模型（LLM）与基于Transformer的模型在检测宣传技术方面的性能，评估了F1分数等指标。 Result: GPT-4在F1分数上优于其他LLM（F1=0.16），但未超过RoBERTa-CRF基线（F1=0.67）。部分宣传技术检测中，LLM优于MGN基线。 Conclusion: 尽管GPT-4在LLM中表现最佳，但传统模型（如RoBERTa-CRF）在宣传技术检测上仍更优。LLM在特定宣传技术检测中有潜力。 Abstract: Propagandists use rhetorical devices that rely on logical fallacies and emotional appeals to advance their agendas. Recognizing these techniques is key to making informed decisions. Recent advances in Natural Language Processing (NLP) have enabled the development of systems capable of detecting manipulative content. In this study, we look at several Large Language Models and their performance in detecting propaganda techniques in news articles. We compare the performance of these LLMs with transformer-based models. We find that, while GPT-4 demonstrates superior F1 scores (F1=0.16) compared to GPT-3.5 and Claude 3 Opus, it does not outperform a RoBERTa-CRF baseline (F1=0.67). Additionally, we find that all three LLMs outperform a MultiGranularity Network (MGN) baseline in detecting instances of one out of six propaganda techniques (name-calling), with GPT-3.5 and GPT-4 also outperforming the MGN baseline in detecting instances of appeal to fear and flag-waving.

[114] SQLForge: Synthesizing Reliable and Diverse Data to Enhance Text-to-SQL Reasoning in LLMs

Yu Guo,Dong Jin,Shenghao Ye,Shuangwu Chen,Jian Yang,Xiaobin Tan

Main category: cs.CL

TL;DR: SQLForge通过合成可靠且多样化的数据提升LLMs在text-to-SQL任务中的表现，显著缩小开源与闭源模型的性能差距。

Details

Motivation: 现有开源模型在text-to-SQL任务中与闭源模型存在性能差距，需改进数据质量和多样性。 Method: 引入SQL语法约束和SQL-to-question反向翻译确保数据可靠性，通过模板丰富化和迭代探索提升多样性，并微调多种开源模型。 Result: SQLForge-LM在Spider和BIRD基准测试中达到开源模型最佳性能（Spider Dev 85.7%，BIRD Dev 59.8%）。 Conclusion: SQLForge有效提升开源模型性能，显著缩小与闭源模型的差距。 Abstract: Large Language models (LLMs) have demonstrated significant potential in text-to-SQL reasoning tasks, yet a substantial performance gap persists between existing open-source models and their closed-source counterparts. In this paper, we introduce SQLForge, a novel approach for synthesizing reliable and diverse data to enhance text-to-SQL reasoning in LLMs. We improve data reliability through SQL syntax constraints and SQL-to-question reverse translation, ensuring data logic at both structural and semantic levels. We also propose an SQL template enrichment and iterative data domain exploration mechanism to boost data diversity. Building on the augmented data, we fine-tune a variety of open-source models with different architectures and parameter sizes, resulting in a family of models termed SQLForge-LM. SQLForge-LM achieves the state-of-the-art performance on the widely recognized Spider and BIRD benchmarks among the open-source models. Specifically, SQLForge-LM achieves EX accuracy of 85.7% on Spider Dev and 59.8% on BIRD Dev, significantly narrowing the performance gap with closed-source methods.

[115] Simulation Agent: A Framework for Integrating Simulation and Large Language Models for Enhanced Decision-Making

Jacob Kleiman,Kevin Frank,Sindy Campagna

Main category: cs.CL

TL;DR: 提出了一种结合仿真模型和大型语言模型（LLM）的框架，通过LLM的对话能力简化仿真系统的交互，同时利用仿真为LLM提供准确的结构化现实世界表示。

Details

Motivation: 解决仿真系统对非技术用户过于复杂，而LLM缺乏结构化因果理解的问题。 Method: 开发了一个仿真代理框架，整合仿真模型和LLM的优势，实现无缝交互和结构化建模。 Result: 提供了一个稳健且通用的基础，支持跨领域的实证验证。 Conclusion: 该框架结合了仿真的准确性和LLM的易用性，具有广泛的应用潜力。 Abstract: Simulations, although powerful in accurately replicating real-world systems, often remain inaccessible to non-technical users due to their complexity. Conversely, large language models (LLMs) provide intuitive, language-based interactions but can lack the structured, causal understanding required to reliably model complex real-world dynamics. We introduce our simulation agent framework, a novel approach that integrates the strengths of both simulation models and LLMs. This framework helps empower users by leveraging the conversational capabilities of LLMs to interact seamlessly with sophisticated simulation systems, while simultaneously utilizing the simulations to ground the LLMs in accurate and structured representations of real-world phenomena. This integrated approach helps provide a robust and generalizable foundation for empirical validation and offers broad applicability across diverse domains.

[116] Krikri: Advancing Open Large Language Models for Greek

Dimitris Roussis,Leon Voukoutis,Georgios Paraskevopoulos,Sokratis Sofianopoulos,Prokopis Prokopidis,Vassilis Papavasileiou,Athanasios Katsamanis,Stelios Piperidis,Vassilis Katsouros

Main category: cs.CL

TL;DR: Llama-Krikri-8B是基于Meta Llama 3.1-8B的希腊语大语言模型，支持现代希腊语、英语、多调文本和古希腊语，性能优于同类模型。

Details

Motivation: 为希腊语提供高性能的语言模型，填补现有模型在希腊语处理上的不足。 Method: 基于Llama 3.1-8B架构，使用高质量希腊语数据训练，采用多阶段后训练流程（如MAGPIE技术）。 Result: 在自然语言理解、生成及代码生成任务上表现优于同类希腊语和多语言模型。 Conclusion: Llama-Krikri-8B是希腊语处理的先进工具，具有广泛的应用潜力。 Abstract: We introduce Llama-Krikri-8B, a cutting-edge Large Language Model tailored for the Greek language, built on Meta's Llama 3.1-8B. Llama-Krikri-8B has been extensively trained on high-quality Greek data to ensure superior adaptation to linguistic nuances. With 8 billion parameters, it offers advanced capabilities while maintaining efficient computational performance. Llama-Krikri-8B supports both Modern Greek and English, and is also equipped to handle polytonic text and Ancient Greek. The chat version of Llama-Krikri-8B features a multi-stage post-training pipeline, utilizing both human and synthetic instruction and preference data, by applying techniques such as MAGPIE. In addition, for evaluation, we propose three novel public benchmarks for Greek. Our evaluation on existing as well as the proposed benchmarks shows notable improvements over comparable Greek and multilingual LLMs in both natural language understanding and generation as well as code generation.

[117] Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation

Siddhant Bhambri,Upasana Biswas,Subbarao Kambhampati

Main category: cs.CL

TL;DR: 论文探讨了在知识蒸馏（KD）中利用推理痕迹（如Chain-of-Thought）的挑战，提出了一种基于规则的问题分解方法，以生成可解释的痕迹并评估其正确性。实验发现推理痕迹的正确性与最终答案的正确性相关性较低。

Details

Motivation: 在交互式对话系统（如ChatGPT）中，用户对模型的准确性和透明度要求高。虽然知识蒸馏可以提升小语言模型（SLMs）的性能，但推理痕迹的评估困难且其与最终性能的相关性不明确。 Method: 采用基于规则的问题分解方法，将复杂查询分解为结构化子问题（如分类和信息检索），生成可解释的推理痕迹。在多个QA数据集上进行了实验。 Result: 实验发现，正确的推理痕迹并不一定保证最终答案的正确性，且两者相关性较低。 Conclusion: 研究挑战了利用推理痕迹提升SLMs性能的隐含假设，强调了评估痕迹忠实性的重要性。 Abstract: Question Answering (QA) poses a challenging and critical problem, particularly in today's age of interactive dialogue systems such as ChatGPT, Perplexity, Microsoft Copilot, etc. where users demand both accuracy and transparency in the model's outputs. Since smaller language models (SLMs) are computationally more efficient but often under-perform compared to larger models, Knowledge Distillation (KD) methods allow for finetuning these smaller models to improve their final performance. Lately, the intermediate tokens or the so called `reasoning' traces produced by Chain-of-Thought (CoT) or by reasoning models such as DeepSeek R1 are used as a training signal for KD. However, these reasoning traces are often verbose and difficult to interpret or evaluate. In this work, we aim to address the challenge of evaluating the faithfulness of these reasoning traces and their correlation with the final performance. To this end, we employ a KD method leveraging rule-based problem decomposition. This approach allows us to break down complex queries into structured sub-problems, generating interpretable traces whose correctness can be readily evaluated, even at inference time. Specifically, we demonstrate this approach on Open Book QA, decomposing the problem into a Classification step and an Information Retrieval step, thereby simplifying trace evaluation. Our SFT experiments with correct and incorrect traces on the CoTemp QA, Microsoft Machine Reading Comprehension QA, and Facebook bAbI QA datasets reveal the striking finding that correct traces do not necessarily imply that the model outputs the correct final solution. Similarly, we find a low correlation between correct final solutions and intermediate trace correctness. These results challenge the implicit assumption behind utilizing reasoning traces for improving SLMs' final performance via KD.

[118] EfficientLLM: Efficiency in Large Language Models

Zhengqing Yuan,Weixiang Sun,Yixin Liu,Huichi Zhou,Rong Zhou,Yiyang Li,Zheyuan Zhang,Wei Song,Yue Huang,Haolong Jia,Keerthiram Murugesan,Yu Wang,Lifang He,Jianfeng Gao,Lichao Sun,Yanfang Ye

Main category: cs.CL

TL;DR: EfficientLLM是一个新基准，首次全面评估大规模LLM的效率技术，涵盖架构预训练、微调和推理，揭示了效率与性能的权衡。

Details

Motivation: 随着LLM参数和上下文窗口的增加，计算、能源和成本变得不可持续，需要研究高效技术。 Method: 在48xGH200和8xH200 GPU集群上，系统评估了预训练（注意力变体、稀疏MoE）、微调（参数高效方法）和推理（量化方法），定义了六项细粒度指标。 Result: 研究发现效率存在量化权衡，最优方法因任务和规模而异，且技术可跨模态推广。 Conclusion: EfficientLLM为下一代基础模型的效率-性能权衡提供了重要指导，并开源了数据集和评估工具。 Abstract: Large Language Models (LLMs) have driven significant progress, yet their growing parameter counts and context windows incur prohibitive compute, energy, and monetary costs. We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale. Conducted on a production-class cluster (48xGH200, 8xH200 GPUs), our study systematically explores three key axes: (1) architecture pretraining (efficient attention variants: MQA, GQA, MLA, NSA; sparse Mixture-of-Experts (MoE)), (2) fine-tuning (parameter-efficient methods: LoRA, RSLoRA, DoRA), and (3) inference (quantization methods: int4, float16). We define six fine-grained metrics (Memory Utilization, Compute Utilization, Latency, Throughput, Energy Consumption, Compression Rate) to capture hardware saturation, latency-throughput balance, and carbon cost. Evaluating over 100 model-technique pairs (0.5B-72B parameters), we derive three core insights: (i) Efficiency involves quantifiable trade-offs: no single method is universally optimal; e.g., MoE reduces FLOPs and improves accuracy but increases VRAM by 40%, while int4 quantization cuts memory/energy by up to 3.9x at a 3-5% accuracy drop. (ii) Optima are task- and scale-dependent: MQA offers optimal memory-latency trade-offs for constrained devices, MLA achieves lowest perplexity for quality-critical tasks, and RSLoRA surpasses LoRA efficiency only beyond 14B parameters. (iii) Techniques generalize across modalities: we extend evaluations to Large Vision Models (Stable Diffusion 3.5, Wan 2.1) and Vision-Language Models (Qwen2.5-VL), confirming effective transferability. By open-sourcing datasets, evaluation pipelines, and leaderboards, EfficientLLM provides essential guidance for researchers and engineers navigating the efficiency-performance landscape of next-generation foundation models.

[119] Improve Language Model and Brain Alignment via Associative Memory

Congchi Yin,Yongpeng Zhang,Xuyun Wen,Piji Li

Main category: cs.CL

TL;DR: 通过整合联想记忆，提高语言模型与人类大脑在语音信息处理中的对齐性。实验表明，联想记忆的引入改善了语言模型与大脑活动的对齐性。

Details

Motivation: 探索如何通过联想记忆提升语言模型与人类大脑在处理语音信息时的对齐性。 Method: 通过将语言模型激活映射到大脑活动，验证对齐性；使用模拟联想记忆扩展的文本作为输入；构建包含1000个故事样本的Association数据集，进行监督微调。 Result: 联想记忆的引入改善了语言模型与大脑活动的对齐性，尤其是在与联想记忆处理密切相关的大脑区域。 Conclusion: 联想记忆的整合和特定监督微调可显著提升语言模型与大脑的对齐性。 Abstract: Associative memory engages in the integration of relevant information for comprehension in the human cognition system. In this work, we seek to improve alignment between language models and human brain while processing speech information by integrating associative memory. After verifying the alignment between language model and brain by mapping language model activations to brain activity, the original text stimuli expanded with simulated associative memory are regarded as input to computational language models. We find the alignment between language model and brain is improved in brain regions closely related to associative memory processing. We also demonstrate large language models after specific supervised fine-tuning better align with brain response, by building the \textit{Association} dataset containing 1000 samples of stories, with instructions encouraging associative memory as input and associated content as output.

[120] Domain Gating Ensemble Networks for AI-Generated Text Detection

Arihant Tripathi,Liam Dugan,Charis Gao,Maggie Huan,Emma Jin,Peter Zhang,David Zhang,Julia Zhao,Chris Callison-Burch

Main category: cs.CL

TL;DR: DoGEN是一种通过集成领域专家检测模型和领域分类器权重来适应未见领域的机器生成文本检测技术，表现优于现有方法。

Details

Motivation: 随着语言模型的进步，检测机器生成文本的需求日益迫切，但现有方法难以适应新领域和生成模型。 Method: 提出DoGEN技术，通过集成领域专家检测模型和领域分类器权重，实现对新领域的适应。 Result: 在多个领域测试中，DoGEN在域内检测中达到最优性能，在域外检测中优于更大模型。 Conclusion: DoGEN为领域自适应AI检测提供了有效解决方案，并公开代码和模型以支持未来研究。 Abstract: As state-of-the-art language models continue to improve, the need for robust detection of machine-generated text becomes increasingly critical. However, current state-of-the-art machine text detectors struggle to adapt to new unseen domains and generative models. In this paper we present DoGEN (Domain Gating Ensemble Networks), a technique that allows detectors to adapt to unseen domains by ensembling a set of domain expert detector models using weights from a domain classifier. We test DoGEN on a wide variety of domains from leading benchmarks and find that it achieves state-of-the-art performance on in-domain detection while outperforming models twice its size on out-of-domain detection. We release our code and trained models to assist in future research in domain-adaptive AI detection.

[121] Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning

Jiwon Song,Dongwon Jo,Yulhwa Kim,Jae-Joon Kim

Main category: cs.CL

TL;DR: 论文提出了一种无需训练的推理路径压缩方法（RPC），通过利用推理路径的语义稀疏性加速推理，显著提高了生成吞吐量，同时保持较高的准确性。

Details

Motivation: 当前基于推理的语言模型通过生成冗长的中间推理路径来提高准确性，但这增加了内存使用和生成延迟，限制了实际部署。 Method: RPC通过定期压缩KV缓存，保留重要性分数高的部分，重要性分数由最近生成的查询组成的选择器窗口计算。 Result: 实验显示，RPC将QwQ-32B的生成吞吐量提高了1.60倍，在AIME 2024基准测试中准确率仅下降1.2%。 Conclusion: 研究表明，推理路径的语义稀疏性可有效用于压缩，为高效部署推理型语言模型提供了实用方法。 Abstract: Recent reasoning-focused language models achieve high accuracy by generating lengthy intermediate reasoning paths before producing final answers. While this approach is effective in solving problems that require logical thinking, long reasoning paths significantly increase memory usage and throughput of token generation, limiting the practical deployment of such models. We propose Reasoning Path Compression (RPC), a training-free method that accelerates inference by leveraging the semantic sparsity of reasoning paths. RPC periodically compresses the KV cache by retaining KV cache that receive high importance score, which are computed using a selector window composed of recently generated queries. Experiments show that RPC improves generation throughput of QwQ-32B by up to 1.60$\times$ compared to the inference with full KV cache, with an accuracy drop of 1.2% on the AIME 2024 benchmark. Our findings demonstrate that semantic sparsity in reasoning traces can be effectively exploited for compression, offering a practical path toward efficient deployment of reasoning LLMs. Our code is available at https://github.com/jiwonsong-dev/ReasoningPathCompression.

[122] Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning

Jingqi Tong,Jixin Tang,Hangcheng Li,Yurong Mou,Ming Zhang,Jun Zhao,Yanbo Wen,Fan Song,Jiahao Zhan,Yuyang Lu,Chaoran Tao,Zhiyuan Guo,Jizhou Yu,Tianhao Cheng,Changhao Jiang,Zhen Wang,Tao Liang,Zhihui Fei,Mingyang Wan,Guojun Ma,Weifeng Ge,Guanhua Chen,Tao Gui,Xipeng Qiu,Qi Zhang,Xuanjing Huang

Main category: cs.CL

TL;DR: 论文提出Code2Logic方法，利用游戏代码自动生成视觉语言推理数据，解决了高质量数据稀缺问题，并开发了GameQA数据集。

Details

Motivation: 视觉语言推理数据稀缺且标注成本高，限制了视觉语言模型的推理能力提升。 Method: 利用游戏代码的逻辑结构，通过LLM适配代码并自动获取推理过程和结果，生成多模态推理数据。 Result: 开发了GameQA数据集，模型在7个视觉语言基准上性能提升2.33%。 Conclusion: Code2Logic方法高效且可扩展，生成的GameQA数据集能显著提升模型性能。 Abstract: Visual-language Chain-of-Thought (CoT) data resources are relatively scarce compared to text-only counterparts, limiting the improvement of reasoning capabilities in Vision Language Models (VLMs). However, high-quality vision-language reasoning data is expensive and labor-intensive to annotate. To address this issue, we leverage a promising resource: game code, which naturally contains logical structures and state transition processes. Therefore, we propose Code2Logic, a novel game-code-driven approach for multimodal reasoning data synthesis. Our approach leverages Large Language Models (LLMs) to adapt game code, enabling automatic acquisition of reasoning processes and results through code execution. Using the Code2Logic approach, we developed the GameQA dataset to train and evaluate VLMs. GameQA is cost-effective and scalable to produce, challenging for state-of-the-art models, and diverse with 30 games and 158 tasks. Surprisingly, despite training solely on game data, VLMs demonstrated out of domain generalization, specifically Qwen2.5-VL-7B improving performance by 2.33\% across 7 diverse vision-language benchmarks. Our code and dataset are available at https://github.com/tongjingqi/Code2Logic.

[123] Mapping the Minds of LLMs: A Graph-Based Analysis of Reasoning LLM

Zhen Xiong,Yujun Cai,Zhecheng Li,Yiwei Wang

Main category: cs.CL

TL;DR: 论文提出了一种基于图的统一分析框架，用于建模大型语言模型（LLMs）的推理过程，揭示了推理结构与准确性的强相关性。

Details

Motivation: 尽管推理型LLMs（RLMs）展现出复杂推理能力，但其不稳定和反直觉的行为（如少样本提示下性能下降）挑战了当前对RLMs的理解。 Method: 通过聚类冗长的Chain-of-Thought（CoT）输出为语义连贯的推理步骤，并构建有向推理图捕捉步骤间的依赖关系。 Result: 分析表明，推理图的结构特性（如探索密度、分支和收敛比）与推理准确性高度相关，提示策略显著影响推理结构。 Conclusion: 该框架不仅提供了超越传统指标的推理质量评估方法，还为提示工程和LLMs认知分析提供了实用见解。 Abstract: Recent advances in test-time scaling have enabled Large Language Models (LLMs) to display sophisticated reasoning abilities via extended Chain-of-Thought (CoT) generation. Despite their potential, these Reasoning LLMs (RLMs) often demonstrate counterintuitive and unstable behaviors, such as performance degradation under few-shot prompting, that challenge our current understanding of RLMs. In this work, we introduce a unified graph-based analytical framework for better modeling the reasoning processes of RLMs. Our method first clusters long, verbose CoT outputs into semantically coherent reasoning steps, then constructs directed reasoning graphs to capture contextual and logical dependencies among these steps. Through comprehensive analysis across models and prompting regimes, we reveal that structural properties, such as exploration density, branching, and convergence ratios, strongly correlate with reasoning accuracy. Our findings demonstrate how prompting strategies substantially reshape the internal reasoning structure of RLMs, directly affecting task outcomes. The proposed framework not only enables quantitative evaluation of reasoning quality beyond conventional metrics but also provides practical insights for prompt engineering and the cognitive analysis of LLMs. Code and resources will be released to facilitate future research in this direction.

[124] InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion

Yuanyi Wang,Zhaoyi Yan,Yiming Zhang,Qi Zhou,Yanggan Gu,Fei Wu,Hongxia Yang

Main category: cs.CL

TL;DR: InfiGFusion是一个结构感知的融合框架，通过Graph-on-Logits Distillation（GLD）损失显式建模语义依赖，显著提升了融合质量和稳定性，在多个任务上优于现有方法。

Details

Motivation: 现有基于logit的融合方法忽视了词汇维度间的语义依赖，这些依赖对模型生成行为的对齐至关重要。 Method: 提出InfiGFusion框架，利用全局共激活图建模语义依赖，并通过排序近似降低计算成本。 Result: 在11个基准测试中表现优异，尤其在复杂推理任务上提升显著（如Multistep Arithmetic +35.6）。 Conclusion: InfiGFusion通过显式建模语义依赖，显著提升了模型融合的效果和效率。 Abstract: Recent advances in large language models (LLMs) have intensified efforts to fuse heterogeneous open-source models into a unified system that inherits their complementary strengths. Existing logit-based fusion methods maintain inference efficiency but treat vocabulary dimensions independently, overlooking semantic dependencies encoded by cross-dimension interactions. These dependencies reflect how token types interact under a model's internal reasoning and are essential for aligning models with diverse generation behaviors. To explicitly model these dependencies, we propose \textbf{InfiGFusion}, the first structure-aware fusion framework with a novel \textit{Graph-on-Logits Distillation} (GLD) loss. Specifically, we retain the top-$k$ logits per output and aggregate their outer products across sequence positions to form a global co-activation graph, where nodes represent vocabulary channels and edges quantify their joint activations. To ensure scalability and efficiency, we design a sorting-based closed-form approximation that reduces the original $O(n^4)$ cost of Gromov-Wasserstein distance to $O(n \log n)$, with provable approximation guarantees. Experiments across multiple fusion settings show that GLD consistently improves fusion quality and stability. InfiGFusion outperforms SOTA models and fusion baselines across 11 benchmarks spanning reasoning, coding, and mathematics. It shows particular strength in complex reasoning tasks, with +35.6 improvement on Multistep Arithmetic and +37.06 on Causal Judgement over SFT, demonstrating superior multi-step and relational inference.

[125] Let's Verify Math Questions Step by Step

Chengyu Shen,Zhen Hao Wong,Runming He,Hao Liang,Meiyi Qiang,Zimo Meng,Zhengyang Zhao,Bohan Zeng,Zhengzhou Zhu,Bin Cui,Wentao Zhang

Main category: cs.CL

TL;DR: 论文提出MathQ-Verify，一个五阶段流程，用于过滤数学问题中的不明确或无效问题，显著提升数据集质量。

Details

Motivation: 现有方法主要关注生成正确的推理路径和答案，但忽略了问题本身的有效性，导致数据集质量不高。 Method: MathQ-Verify通过格式验证、形式化分解、逻辑矛盾检测、目标完整性检查等五个阶段，严格筛选问题。 Result: 实验表明，MathQ-Verify在多个基准测试中表现最优，F1分数提升高达25个百分点，精度约90%，召回率63%。 Conclusion: MathQ-Verify为数学数据集的可靠筛选提供了可扩展且准确的解决方案，减少标签噪声和无效计算。 Abstract: Large Language Models (LLMs) have recently achieved remarkable progress in mathematical reasoning. To enable such capabilities, many existing works distill strong reasoning models into long chains of thought or design algorithms to construct high-quality math QA data for training. However, these efforts primarily focus on generating correct reasoning paths and answers, while largely overlooking the validity of the questions themselves. In this work, we propose Math Question Verification (MathQ-Verify), a novel five-stage pipeline designed to rigorously filter ill-posed or under-specified math problems. MathQ-Verify first performs format-level validation to remove redundant instructions and ensure that each question is syntactically well-formed. It then formalizes each question, decomposes it into atomic conditions, and verifies them against mathematical definitions. Next, it detects logical contradictions among these conditions, followed by a goal-oriented completeness check to ensure the question provides sufficient information for solving. To evaluate this task, we use existing benchmarks along with an additional dataset we construct, containing 2,147 math questions with diverse error types, each manually double-validated. Experiments show that MathQ-Verify achieves state-of-the-art performance across multiple benchmarks, improving the F1 score by up to 25 percentage points over the direct verification baseline. It further attains approximately 90% precision and 63% recall through a lightweight model voting scheme. MathQ-Verify offers a scalable and accurate solution for curating reliable mathematical datasets, reducing label noise and avoiding unnecessary computation on invalid questions. Our code and data are available at https://github.com/scuuy/MathQ-Verify.

[126] Cross-Linguistic Transfer in Multilingual NLP: The Role of Language Families and Morphology

Ajitesh Bankula,Praney Bankula

Main category: cs.CL

TL;DR: 本文研究了跨语言迁移在语言家族和形态学视角下的表现，探讨了语言家族接近度和形态相似性对NLP任务性能的影响，并比较了多语言模型的性能。

Details

Motivation: 研究跨语言迁移在资源丰富语言与低资源语言之间的有效性，以及语言家族和形态学特征对迁移效果的影响。 Method: 通过分析语言家族接近度和形态相似性，评估多语言模型在不同NLP任务中的表现，并结合文献讨论结果。 Result: 发现语言家族和形态相似性与跨语言迁移性能相关，并探讨了整合类型学和形态学信息到模型预训练中的新兴方法。 Conclusion: 语言家族和形态学特征对跨语言迁移有显著影响，未来可通过整合更多语言学信息优化迁移效果。 Abstract: Cross-lingual transfer has become a crucial aspect of multilingual NLP, as it allows for models trained on resource-rich languages to be applied to low-resource languages more effectively. Recently massively multilingual pre-trained language models (e.g., mBERT, XLM-R) demonstrate strong zero-shot transfer capabilities[14] [13]. This paper investigates cross-linguistic transfer through the lens of language families and morphology. Investigating how language family proximity and morphological similarity affect performance across NLP tasks. We further discuss our results and how it relates to findings from recent literature. Overall, we compare multilingual model performance and review how linguistic distance metrics correlate with transfer outcomes. We also look into emerging approaches that integrate typological and morphological information into model pre-training to improve transfer to diverse languages[18] [19].

[127] Word length predicts word order: "Min-max"-ing drives language evolution

Hiram Ring

Main category: cs.CL

TL;DR: 论文提出了一种基于大规模平行数据集的通用机制，解释词序变化，支持处理和信息结构的竞争压力理论。

Details

Motivation: 探讨语言表面结构（词序）的起源，解决先天论和功能论的分歧。 Method: 使用超过1,500种语言的标记平行数据集，分析词类长度与词序的关联。 Result: 词类长度与词序显著相关，支持处理理论，并预测历史词序变化。 Conclusion: 提出“最小-最大”理论，整合处理和信息结构的竞争压力，解释语言演化。 Abstract: Current theories of language propose an innate (Baker 2001; Chomsky 1981) or a functional (Greenberg 1963; Dryer 2007; Hawkins 2014) origin for the surface structures (i.e. word order) that we observe in languages of the world, while evolutionary modeling (Dunn et al. 2011) suggests that descent is the primary factor influencing such patterns. Although there are hypotheses for word order change from both innate and usage-based perspectives for specific languages and families, there are key disagreements between the two major proposals for mechanisms that drive the evolution of language more broadly (Wasow 2002; Levy 2008). This paper proposes a universal underlying mechanism for word order change based on a large tagged parallel dataset of over 1,500 languages representing 133 language families and 111 isolates. Results indicate that word class length is significantly correlated with word order crosslinguistically, but not in a straightforward manner, partially supporting opposing theories of processing, while at the same time predicting historical word order change in two different phylogenetic lines and explaining more variance than descent or language area in regression models. Such findings suggest an integrated "Min-Max" theory of language evolution driven by competing pressures of processing and information structure, aligning with recent efficiency-oriented (Levshina 2023) and information-theoretic proposals (Zaslavsky 2020; Tucker et al. 2025).

[128] EEG-to-Text Translation: A Model for Deciphering Human Brain Activity

Saydul Akbar Murad,Ashim Dahal,Nick Rahimi

Main category: cs.CL

TL;DR: R1 Translator模型结合双向LSTM编码器和预训练Transformer解码器，显著提升EEG信号到文本的解码性能，在ROUGE、CER和WER指标上优于T5和Brain Translator。

Details

Motivation: 大型语言模型快速发展，但EEG信号解码为文本仍存在性能限制，需改进现有模型。 Method: R1 Translator结合双向LSTM编码器（捕捉序列依赖）和预训练Transformer解码器，利用EEG特征生成高质量文本。 Result: ROUGE-1得分38.00%（优于T5和Brain），CER和WER也显著降低。 Conclusion: R1 Translator在EEG到文本解码任务中表现优异，为未来研究提供新方向。 Abstract: With the rapid advancement of large language models like Gemini, GPT, and others, bridging the gap between the human brain and language processing has become an important area of focus. To address this challenge, researchers have developed various models to decode EEG signals into text. However, these models still face significant performance limitations. To overcome these shortcomings, we propose a new model, R1 Translator, which aims to improve the performance of EEG-to-text decoding. The R1 Translator model combines a bidirectional LSTM encoder with a pretrained transformer-based decoder, utilizing EEG features to produce high-quality text outputs. The model processes EEG embeddings through the LSTM to capture sequential dependencies, which are then fed into the transformer decoder for effective text generation. The R1 Translator excels in ROUGE metrics, outperforming both T5 (previous research) and Brain Translator. Specifically, R1 achieves a ROUGE-1 score of 38.00% (P), which is up to 9% higher than T5 (34.89%) and 3% better than Brain (35.69%). It also leads in ROUGE-L, with a F1 score of 32.51%, outperforming T5 by 3% (29.67%) and Brain by 2% (30.38%). In terms of CER, R1 achieves a CER of 0.5795, which is 2% lower than T5 (0.5917) and 4% lower than Brain (0.6001). Additionally, R1 performs better in WER with a score of 0.7280, outperforming T5 by 4.3% (0.7610) and Brain by 3.6% (0.7553). Code is available at https://github.com/Mmurrad/EEG-To-text.

[129] Towards Rehearsal-Free Continual Relation Extraction: Capturing Within-Task Variance with Adaptive Prompting

Bao-Ngoc Dao,Quang Nguyen,Luyen Ngo Dinh,Minh Le,Nam Le,Linh Ngo Van

Main category: cs.CL

TL;DR: WAVE++是一种基于提示的新方法，通过任务特定提示池和标签描述改进持续关系提取，解决了任务识别和遗忘问题。

Details

Motivation: 解决持续关系提取中基于提示方法的任务识别不准确、遗忘问题及跨任务和任务内变异性挑战。 Method: 引入任务特定提示池、标签描述和生成模型，结合训练无关的任务预测机制。 Result: WAVE++在持续关系提取中优于现有基于提示和基于记忆的方法。 Conclusion: WAVE++提供了一种更鲁棒的持续关系提取解决方案，无需显式存储数据。 Abstract: Memory-based approaches have shown strong performance in Continual Relation Extraction (CRE). However, storing examples from previous tasks increases memory usage and raises privacy concerns. Recently, prompt-based methods have emerged as a promising alternative, as they do not rely on storing past samples. Despite this progress, current prompt-based techniques face several core challenges in CRE, particularly in accurately identifying task identities and mitigating catastrophic forgetting. Existing prompt selection strategies often suffer from inaccuracies, lack robust mechanisms to prevent forgetting in shared parameters, and struggle to handle both cross-task and within-task variations. In this paper, we propose WAVE++, a novel approach inspired by the connection between prefix-tuning and mixture of experts. Specifically, we introduce task-specific prompt pools that enhance flexibility and adaptability across diverse tasks while avoiding boundary-spanning risks; this design more effectively captures variations within each task and across tasks. To further refine relation classification, we incorporate label descriptions that provide richer, more global context, enabling the model to better distinguish among different relations. We also propose a training-free mechanism to improve task prediction during inference. Moreover, we integrate a generative model to consolidate prior knowledge within the shared parameters, thereby removing the need for explicit data storage. Extensive experiments demonstrate that WAVE++ outperforms state-of-the-art prompt-based and rehearsal-based methods, offering a more robust solution for continual relation extraction. Our code is publicly available at https://github.com/PiDinosauR2804/WAVE-CRE-PLUS-PLUS.

[130] Memory-Centric Embodied Question Answer

Mingliang Zhai,Zhi Gao,Yuwei Wu,Yunde Jia

Main category: cs.CL

TL;DR: 论文提出了一种以记忆为中心的EQA框架MemoryEQA，通过多模态分层记忆机制提升复杂任务处理的效率和准确性。

Details

Motivation: 现有EQA框架以规划器为中心，记忆模块无法充分与其他模块交互，限制了复杂任务（如跨区域多目标任务）的处理能力。 Method: 提出MemoryEQA框架，采用全局和局部记忆分层机制，利用多模态大语言模型将记忆信息转换为模块输入格式。 Result: 在HM-EQA、MT-HM3D和OpenEQA数据集上验证了框架有效性，MT-HM3D上性能提升19.8%。 Conclusion: 记忆能力对解决复杂EQA任务至关重要，MemoryEQA通过优化记忆交互显著提升了性能。 Abstract: Embodied Question Answering (EQA) requires agents to autonomously explore and understand the environment to answer context-dependent questions. Existing frameworks typically center around the planner, which guides the stopping module, memory module, and answering module for reasoning. In this paper, we propose a memory-centric EQA framework named MemoryEQA. Unlike planner-centric EQA models where the memory module cannot fully interact with other modules, MemoryEQA flexible feeds memory information into all modules, thereby enhancing efficiency and accuracy in handling complex tasks, such as those involving multiple targets across different regions. Specifically, we establish a multi-modal hierarchical memory mechanism, which is divided into global memory that stores language-enhanced scene maps, and local memory that retains historical observations and state information. When performing EQA tasks, the multi-modal large language model is leveraged to convert memory information into the required input formats for injection into different modules. To evaluate EQA models' memory capabilities, we constructed the MT-HM3D dataset based on HM3D, comprising 1,587 question-answer pairs involving multiple targets across various regions, which requires agents to maintain memory of exploration-acquired target information. Experimental results on HM-EQA, MT-HM3D, and OpenEQA demonstrate the effectiveness of our framework, where a 19.8% performance gain on MT-HM3D compared to baseline model further underscores memory capability's pivotal role in resolving complex tasks.

[131] FlashThink: An Early Exit Method For Efficient Reasoning

Guochao Jiang,Guofeng Quan,Zepeng Ding,Ziqin Luo,Dixuan Wang,Zheng Hu

Main category: cs.CL

TL;DR: 论文提出了一种名为FlashThink的方法，通过验证模型提前终止大语言模型（LLMs）的推理过程，以减少计算开销，同时保持准确性。

Details

Motivation: LLMs在推理任务中表现优异，但常生成冗长的推理内容，导致计算资源浪费。研究发现，模型在推理过程中可能提前得出正确答案，无需完成全部推理。 Method: 引入验证模型，识别模型何时可以停止推理但仍能提供正确答案。 Result: 在四个基准测试中，FlashThink显著缩短了推理内容长度（如Deepseek-R1和QwQ-32B模型分别减少77.04%和77.47%），同时保持准确性。 Conclusion: FlashThink方法有效解决了LLMs推理冗长的问题，实现了高效推理。 Abstract: Large Language Models (LLMs) have shown impressive performance in reasoning tasks. However, LLMs tend to generate excessively long reasoning content, leading to significant computational overhead. Our observations indicate that even on simple problems, LLMs tend to produce unnecessarily lengthy reasoning content, which is against intuitive expectations. Preliminary experiments show that at a certain point during the generation process, the model is already capable of producing the correct solution without completing the full reasoning content. Therefore, we consider that the reasoning process of the model can be exited early to achieve the purpose of efficient reasoning. We introduce a verification model that identifies the exact moment when the model can stop reasoning and still provide the correct answer. Comprehensive experiments on four different benchmarks demonstrate that our proposed method, FlashThink, effectively shortens the reasoning content while preserving the model accuracy. For the Deepseek-R1 and QwQ-32B models, we reduced the length of reasoning content by 77.04% and 77.47%, respectively, without reducing the accuracy.

[132] Through a Compressed Lens: Investigating the Impact of Quantization on LLM Explainability and Interpretability

Qianli Wang,Mingyang Wang,Nils Feldhus,Simon Ostermann,Yuan Cao,Hinrich Schütze,Sebastian Möller,Vera Schmitt

Main category: cs.CL

TL;DR: 量化方法对大型语言模型（LLM）的可解释性和透明度有显著影响，效果因量化方法、解释方法和评估协议而异。

Details

Motivation: 研究量化对LLM可解释性和透明度的影响，填补现有研究的空白。 Method: 使用三种量化技术和两种解释方法（反事实示例和自然语言解释）及两种透明度方法（知识记忆分析和潜在多跳推理分析），结合用户研究。 Result: 量化对可解释性和透明度的影响不一致，可能降低或提升效果，取决于具体配置。 Conclusion: 量化可能不可预测地影响模型透明度，对需要高透明度的LLM应用具有重要启示。 Abstract: Quantization methods are widely used to accelerate inference and streamline the deployment of large language models (LLMs). While prior research has extensively investigated the degradation of various LLM capabilities due to quantization, its effects on model explainability and interpretability, which are crucial for understanding decision-making processes, remain unexplored. To address this gap, we conduct comprehensive experiments using three common quantization techniques at distinct bit widths, in conjunction with two explainability methods, counterfactual examples and natural language explanations, as well as two interpretability approaches, knowledge memorization analysis and latent multi-hop reasoning analysis. We complement our analysis with a thorough user study, evaluating selected explainability methods. Our findings reveal that, depending on the configuration, quantization can significantly impact model explainability and interpretability. Notably, the direction of this effect is not consistent, as it strongly depends on (1) the quantization method, (2) the explainability or interpretability approach, and (3) the evaluation protocol. In some settings, human evaluation shows that quantization degrades explainability, while in others, it even leads to improvements. Our work serves as a cautionary tale, demonstrating that quantization can unpredictably affect model transparency. This insight has important implications for deploying LLMs in applications where transparency is a critical requirement.

[133] CAFES: A Collaborative Multi-Agent Framework for Multi-Granular Multimodal Essay Scoring

Jiamin Su,Yibo Yan,Zhuoran Gao,Han Zhang,Xiang Liu,Xuming Hu

Main category: cs.CL

TL;DR: CAFES是一个多智能体协作框架，用于提升自动作文评分（AES）的通用性和多模态感知能力，显著提高了评分与人类判断的一致性。

Details

Motivation: 传统AES方法在多模态评估和评分通用性上表现不足，而现有MLLM方法存在幻觉性解释和评分偏差问题。 Method: CAFES框架包含三个智能体：初始评分器、反馈池管理器和反思评分器，通过协作迭代优化评分。 Result: 实验表明，CAFES在QWK指标上相对提升了21%，尤其在语法和词汇多样性方面表现突出。 Conclusion: CAFES为智能多模态AES系统的发展奠定了基础。 Abstract: Automated Essay Scoring (AES) is crucial for modern education, particularly with the increasing prevalence of multimodal assessments. However, traditional AES methods struggle with evaluation generalizability and multimodal perception, while even recent Multimodal Large Language Model (MLLM)-based approaches can produce hallucinated justifications and scores misaligned with human judgment. To address the limitations, we introduce CAFES, the first collaborative multi-agent framework specifically designed for AES. It orchestrates three specialized agents: an Initial Scorer for rapid, trait-specific evaluations; a Feedback Pool Manager to aggregate detailed, evidence-grounded strengths; and a Reflective Scorer that iteratively refines scores based on this feedback to enhance human alignment. Extensive experiments, using state-of-the-art MLLMs, achieve an average relative improvement of 21% in Quadratic Weighted Kappa (QWK) against ground truth, especially for grammatical and lexical diversity. Our proposed CAFES framework paves the way for an intelligent multimodal AES system. The code will be available upon acceptance.

[134] Truth or Twist? Optimal Model Selection for Reliable Label Flipping Evaluation in LLM-based Counterfactuals

Qianli Wang,Van Bach Nguyen,Nils Feldhus,Luis Felipe Villa-Arenas,Christin Seifert,Sebastian Möller,Vera Schmitt

Main category: cs.CL

TL;DR: 研究发现，在反事实数据增强（CDA）中，选择独立且未经微调的评估模型能提供最可靠的标签翻转评估，但自动化流程仍需人工干预。

Details

Motivation: 探讨评估模型选择对反事实数据增强（CDA）中标签翻转评估的影响，以提升大语言模型（LLMs）的性能和鲁棒性。 Method: 通过实验分析四种生成模型与评估模型的关系，涉及两种先进LLM方法、三个数据集、五个生成模型和15个评估模型，并结合用户研究（n=90）。 Result: 独立且未经微调的评估模型表现最可靠，但与用户研究结果仍有较大差距。 Conclusion: 完全自动化的CDA流程可能不足，需结合人工干预。 Abstract: Counterfactual examples are widely employed to enhance the performance and robustness of large language models (LLMs) through counterfactual data augmentation (CDA). However, the selection of the judge model used to evaluate label flipping, the primary metric for assessing the validity of generated counterfactuals for CDA, yields inconsistent results. To decipher this, we define four types of relationships between the counterfactual generator and judge models. Through extensive experiments involving two state-of-the-art LLM-based methods, three datasets, five generator models, and 15 judge models, complemented by a user study (n = 90), we demonstrate that judge models with an independent, non-fine-tuned relationship to the generator model provide the most reliable label flipping evaluations. Relationships between the generator and judge models, which are closely aligned with the user study for CDA, result in better model performance and robustness. Nevertheless, we find that the gap between the most effective judge models and the results obtained from the user study remains considerably large. This suggests that a fully automated pipeline for CDA may be inadequate and requires human intervention.

[135] Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models

Wenhui Zhu,Xuanzhao Dong,Xin Li,Peijie Qiu,Xiwen Chen,Abolfazl Razi,Aris Sotiras,Yi Su,Yalin Wang

Main category: cs.CL

TL;DR: 本文探讨了在医学视觉问答（VQA）中，基于强化学习（RL）的调优方法（如GRPO）如何通过四个关键维度（基础模型初始化、医学语义对齐、长度奖励和偏差影响）提升模型性能，并证明其优于传统监督微调（SFT）。

Details

Motivation: 为了解决医学任务中模型行为与临床期望的对齐问题，研究分析了RL调优在医学MLLMs中的有效性。 Method: 通过实验分析四个关键维度（基础模型初始化、医学语义对齐、长度奖励和偏差影响）对医学VQA任务的影响。 Result: GRPO-based RL调优在准确性和推理质量上均优于标准监督微调（SFT）。 Conclusion: 研究为医学MLLMs的领域特定调优提供了新见解，并验证了GRPO在医学任务中的优越性。 Abstract: Recently, reinforcement learning (RL)-based tuning has shifted the trajectory of Multimodal Large Language Models (MLLMs), particularly following the introduction of Group Relative Policy Optimization (GRPO). However, directly applying it to medical tasks remains challenging for achieving clinically grounded model behavior. Motivated by the need to align model response with clinical expectations, we investigate four critical dimensions that affect the effectiveness of RL-based tuning in medical visual question answering (VQA): base model initialization strategy, the role of medical semantic alignment, the impact of length-based rewards on long-chain reasoning, and the influence of bias. We conduct extensive experiments to analyze these factors for medical MLLMs, providing new insights into how models are domain-specifically fine-tuned. Additionally, our results also demonstrate that GRPO-based RL tuning consistently outperforms standard supervised fine-tuning (SFT) in both accuracy and reasoning quality.

[136] DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models

Yuxuan Jiang,Dawei Li,Frank Ferraro

Main category: cs.CL

TL;DR: DRP框架通过推理时剪枝和蒸馏技术，显著提升大型推理模型的效率和准确性。

Details

Motivation: 解决大型推理模型在复杂推理任务中因冗长推理路径导致的效率低下问题。 Method: 结合推理时剪枝和蒸馏技术，通过教师模型进行技能感知的步骤分解和内容剪枝，并将剪枝后的推理路径蒸馏到学生模型中。 Result: 在多个数学推理数据集上，DRP显著减少token使用量（如GSM8K从917降至328），同时提升准确率（如GSM8K从91.7%提升至94.1%）。 Conclusion: DRP通过优化推理结构与学生模型能力对齐，实现了高效的知识迁移和性能提升。 Abstract: While Large Reasoning Models (LRMs) have demonstrated success in complex reasoning tasks through long chain-of-thought (CoT) reasoning, their inference often involves excessively verbose reasoning traces, resulting in substantial inefficiency. To address this, we propose Distilled Reasoning Pruning (DRP), a hybrid framework that combines inference-time pruning with tuning-based distillation, two widely used strategies for efficient reasoning. DRP uses a teacher model to perform skill-aware step decomposition and content pruning, and then distills the pruned reasoning paths into a student model, enabling it to reason both efficiently and accurately. Across several challenging mathematical reasoning datasets, we find that models trained with DRP achieve substantial improvements in token efficiency without sacrificing accuracy. Specifically, DRP reduces average token usage on GSM8K from 917 to 328 while improving accuracy from 91.7% to 94.1%, and achieves a 43% token reduction on AIME with no performance drop. Further analysis shows that aligning the reasoning structure of training CoTs with the student's reasoning capacity is critical for effective knowledge transfer and performance gains.

[137] Mixed Signals: Understanding Model Disagreement in Multimodal Empathy Detection

Maya Srikanth,Run Chen,Julia Hirschberg

Main category: cs.CL

TL;DR: 论文探讨了多模态模型在共情检测中的性能问题，特别是模态间冲突信号的影响，并提出分歧作为诊断信号以提升系统鲁棒性。

Details

Motivation: 研究多模态模型在共情检测中因模态冲突导致的性能下降问题，以理解模型与人类在多模态输入下的表现差异。 Method: 使用文本、音频和视频的微调模型及门控融合模型，分析单模态与多模态预测的分歧案例，结合标注者不确定性验证。 Result: 研究发现模态间主导信号可能误导融合，且人类与模型在多模态输入下表现并不总一致。分歧可识别挑战性案例。 Conclusion: 模态分歧是诊断共情检测系统挑战性案例的有效信号，有助于提升模型鲁棒性。 Abstract: Multimodal models play a key role in empathy detection, but their performance can suffer when modalities provide conflicting cues. To understand these failures, we examine cases where unimodal and multimodal predictions diverge. Using fine-tuned models for text, audio, and video, along with a gated fusion model, we find that such disagreements often reflect underlying ambiguity, as evidenced by annotator uncertainty. Our analysis shows that dominant signals in one modality can mislead fusion when unsupported by others. We also observe that humans, like models, do not consistently benefit from multimodal input. These insights position disagreement as a useful diagnostic signal for identifying challenging examples and improving empathy system robustness.

[138] The Hallucination Tax of Reinforcement Finetuning

Linxin Song,Taiwei Shi,Jieyu Zhao

Main category: cs.CL

TL;DR: 研究发现强化微调（RFT）会降低语言模型对不可回答问题的拒绝能力，导致幻觉回答增加。通过引入SUM数据集，少量调整RFT可显著恢复模型的拒绝行为。

Details

Motivation: 探索RFT对模型可信度的影响，特别是其对不可回答问题处理能力的副作用。 Method: 引入SUM数据集，测试模型对不可回答数学问题的识别能力，并调整RFT训练策略。 Result: 标准RFT训练使模型拒绝率下降80%，加入10% SUM数据可显著恢复拒绝行为，且对可解任务影响小。 Conclusion: 调整RFT策略能帮助模型更好地识别知识边界，提升泛化能力。 Abstract: Reinforcement finetuning (RFT) has become a standard approach for enhancing the reasoning capabilities of large language models (LLMs). However, its impact on model trustworthiness remains underexplored. In this work, we identify and systematically study a critical side effect of RFT, which we term the hallucination tax: a degradation in refusal behavior causing models to produce hallucinated answers to unanswerable questions confidently. To investigate this, we introduce SUM (Synthetic Unanswerable Math), a high-quality dataset of unanswerable math problems designed to probe models' ability to recognize an unanswerable question by reasoning from the insufficient or ambiguous information. Our results show that standard RFT training could reduce model refusal rates by more than 80%, which significantly increases model's tendency to hallucinate. We further demonstrate that incorporating just 10% SUM during RFT substantially restores appropriate refusal behavior, with minimal accuracy trade-offs on solvable tasks. Crucially, this approach enables LLMs to leverage inference-time compute to reason about their own uncertainty and knowledge boundaries, improving generalization not only to out-of-domain math problems but also to factual question answering tasks.

[139] DecIF: Improving Instruction-Following through Meta-Decomposition

Tingfeng Hui,Pengyu Zhu,Bowen Ping,Ling Tang,Yaqi Zhang,Sen Su

Main category: cs.CL

TL;DR: DecIF是一个完全自主的框架，通过分解原则生成多样且高质量的指令跟随数据，仅依赖LLMs，无需外部资源。

Details

Motivation: 现有方法依赖外部资源生成指令数据，限制了灵活性和泛化能力。DecIF旨在解决这一问题。 Method: DecIF通过分解原则，引导LLMs迭代生成元信息，结合响应约束形成结构化指令，并检测和解决不一致性。响应生成时，将指令分解为原子级评估标准以验证准确性。 Result: 实验表明DecIF在指令跟随任务中表现优异，具有强灵活性、可扩展性和泛化能力。 Conclusion: DecIF为自动生成高质量指令数据提供了高效且通用的解决方案。 Abstract: Instruction-following has emerged as a crucial capability for large language models (LLMs). However, existing approaches often rely on pre-existing documents or external resources to synthesize instruction-following data, which limits their flexibility and generalizability. In this paper, we introduce DecIF, a fully autonomous, meta-decomposition guided framework that generates diverse and high-quality instruction-following data using only LLMs. DecIF is grounded in the principle of decomposition. For instruction generation, we guide LLMs to iteratively produce various types of meta-information, which are then combined with response constraints to form well-structured and semantically rich instructions. We further utilize LLMs to detect and resolve potential inconsistencies within the generated instructions. Regarding response generation, we decompose each instruction into atomic-level evaluation criteria, enabling rigorous validation and the elimination of inaccurate instruction-response pairs. Extensive experiments across a wide range of scenarios and settings demonstrate DecIF's superior performance on instruction-following tasks. Further analysis highlights its strong flexibility, scalability, and generalizability in automatically synthesizing high-quality instruction data.

Myra Cheng,Sunny Yu,Cinoo Lee,Pranav Khadpe,Lujain Ibrahim,Dan Jurafsky

Main category: cs.CL

TL;DR: 论文提出LLMs中的社会谄媚问题，定义其为过度维护用户面子，并开发ELEPHANT框架评估五种行为，发现LLMs谄媚率显著高于人类。

Details

Motivation: 现有研究仅关注明确可验证的谄媚行为，忽略了模糊情境下的社会谄媚问题。 Method: 引入社会谄媚理论，开发ELEPHANT框架，评估五种面子维护行为，使用OEQ和AITA数据集测试八种模型。 Result: LLMs在OEQ中比人类多47%维护面子，在AITA中42%情况下支持不当行为。谄媚行为在偏好数据集中受奖励且难以缓解。 Conclusion: 研究为社会谄媚问题提供了理论和实证工具，揭示了其重要性和复杂性。 Abstract: A serious risk to the safety and utility of LLMs is sycophancy, i.e., excessive agreement with and flattery of the user. Yet existing work focuses on only one aspect of sycophancy: agreement with users' explicitly stated beliefs that can be compared to a ground truth. This overlooks forms of sycophancy that arise in ambiguous contexts such as advice and support-seeking, where there is no clear ground truth, yet sycophancy can reinforce harmful implicit assumptions, beliefs, or actions. To address this gap, we introduce a richer theory of social sycophancy in LLMs, characterizing sycophancy as the excessive preservation of a user's face (the positive self-image a person seeks to maintain in an interaction). We present ELEPHANT, a framework for evaluating social sycophancy across five face-preserving behaviors (emotional validation, moral endorsement, indirect language, indirect action, and accepting framing) on two datasets: open-ended questions (OEQ) and Reddit's r/AmITheAsshole (AITA). Across eight models, we show that LLMs consistently exhibit high rates of social sycophancy: on OEQ, they preserve face 47% more than humans, and on AITA, they affirm behavior deemed inappropriate by crowdsourced human judgments in 42% of cases. We further show that social sycophancy is rewarded in preference datasets and is not easily mitigated. Our work provides theoretical grounding and empirical tools (datasets and code) for understanding and addressing this under-recognized but consequential issue.

[141] Activation-Guided Consensus Merging for Large Language Models

Yuxuan Yao,Shuqi Liu,Zehua Liu,Qintong Li,Mingyang Liu,Xiongwei Han,Zhijiang Guo,Han Wu,Linqi Song

Main category: cs.CL

TL;DR: 论文提出了一种名为ACM的模型合并框架，通过激活引导的共识合并方法，解决了传统合并方法忽略神经网络功能异质性的问题，显著提升了效率和准确性。

Details

Motivation: 现有方法在结合System 2的推理能力和System 1的效率时面临挑战，传统模型合并方法假设层间重要性一致，忽略了功能异质性。 Method: 提出了ACM框架，基于预训练和微调模型激活间的互信息确定层特定合并系数，无需梯度计算或额外训练。 Result: 在L2S和通用合并任务中，ACM优于基线方法，例如在Qwen-7B模型上，响应长度减少55.3%，推理准确率提升1.3分。 Conclusion: ACM是一种高效且无需训练的合并方法，显著提升了模型性能，代码将公开以促进可复现性。 Abstract: Recent research has increasingly focused on reconciling the reasoning capabilities of System 2 with the efficiency of System 1. While existing training-based and prompt-based approaches face significant challenges in terms of efficiency and stability, model merging emerges as a promising strategy to integrate the diverse capabilities of different Large Language Models (LLMs) into a unified model. However, conventional model merging methods often assume uniform importance across layers, overlooking the functional heterogeneity inherent in neural components. To address this limitation, we propose \textbf{A}ctivation-Guided \textbf{C}onsensus \textbf{M}erging (\textbf{ACM}), a plug-and-play merging framework that determines layer-specific merging coefficients based on mutual information between activations of pre-trained and fine-tuned models. ACM effectively preserves task-specific capabilities without requiring gradient computations or additional training. Extensive experiments on Long-to-Short (L2S) and general merging tasks demonstrate that ACM consistently outperforms all baseline methods. For instance, in the case of Qwen-7B models, TIES-Merging equipped with ACM achieves a \textbf{55.3\%} reduction in response length while simultaneously improving reasoning accuracy by \textbf{1.3} points. We submit the code with the paper for reproducibility, and it will be publicly available.

[142] AUTOLAW: Enhancing Legal Compliance in Large Language Models via Case Law Generation and Jury-Inspired Deliberation

Tai D. Nguyen,Long H. Pham,Jun Sun

Main category: cs.CL

TL;DR: AutoLaw是一个新颖的法律违规检测框架，结合对抗数据生成和陪审团式审议过程，提升LLMs的法律合规性。

Details

Motivation: 现有法律评估基准缺乏适应性和对本地多样性的考虑，限制了其在动态监管环境中的实用性。 Method: AutoLaw动态合成案例法以反映本地法规，并利用LLM陪审团模拟司法决策，减少偏见并提高检测准确性。 Result: 在三个基准测试中，AutoLaw显著提升了违规检测率，对抗数据生成和陪审团投票策略效果显著。 Conclusion: AutoLaw能自适应探测法律偏差，提供可靠、上下文感知的判断，为法律敏感应用中的LLMs评估提供了可扩展解决方案。 Abstract: The rapid advancement of domain-specific large language models (LLMs) in fields like law necessitates frameworks that account for nuanced regional legal distinctions, which are critical for ensuring compliance and trustworthiness. Existing legal evaluation benchmarks often lack adaptability and fail to address diverse local contexts, limiting their utility in dynamically evolving regulatory landscapes. To address these gaps, we propose AutoLaw, a novel violation detection framework that combines adversarial data generation with a jury-inspired deliberation process to enhance legal compliance of LLMs. Unlike static approaches, AutoLaw dynamically synthesizes case law to reflect local regulations and employs a pool of LLM-based "jurors" to simulate judicial decision-making. Jurors are ranked and selected based on synthesized legal expertise, enabling a deliberation process that minimizes bias and improves detection accuracy. Evaluations across three benchmarks: Law-SG, Case-SG (legality), and Unfair-TOS (policy), demonstrate AutoLaw's effectiveness: adversarial data generation improves LLM discrimination, while the jury-based voting strategy significantly boosts violation detection rates. Our results highlight the framework's ability to adaptively probe legal misalignments and deliver reliable, context-aware judgments, offering a scalable solution for evaluating and enhancing LLMs in legally sensitive applications.

[143] From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora

Yingli Shen,Wen Lai,Shuo Wang,Kangyang Luo,Alexander Fraser,Maosong Sun

Main category: cs.CL

TL;DR: 论文提出了一种基于TED Talks的大规模高质量多语言平行语料库TED2025，用于增强大语言模型的多语言性能。实验表明，使用多向平行数据的模型优于未对齐数据训练的模型。

Details

Motivation: 未对齐的多语言数据在捕捉跨语言语义方面效果有限，而多向平行数据能提供更强的跨语言一致性，从而提升多语言模型的性能。 Method: 构建了TED2025语料库，涵盖113种语言，其中多达50种语言平行对齐。研究了利用多向平行数据进行持续预训练和指令调优的最佳实践。 Result: 在六个多语言基准测试中，使用多向平行数据训练的模型表现优于未对齐数据训练的模型。 Conclusion: 多向平行数据能显著提升大语言模型的多语言性能，TED2025语料库为相关研究提供了高质量资源。 Abstract: Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.

[144] Improved Methods for Model Pruning and Knowledge Distillation

Wei Jiang,Anying Fu,Youling Zhang

Main category: cs.CL

TL;DR: MAMA Pruning是一种改进的模型剪枝方法，通过权重和偏置分析减少模型大小和计算复杂度，同时保持性能。

Details

Motivation: 现有剪枝方法常导致性能显著下降或需要大量重新训练和微调，目标是获得更小、更快的知识蒸馏模型。 Method: 基于预训练阶段的权重和偏置固定，以及后训练阶段的GRPO奖励验证，作为剪枝指标。 Result: 在极端剪枝水平下，性能仍与未剪枝模型相当，优于现有方法。 Conclusion: MAMA Pruning是一种高效且性能优越的剪枝方法。 Abstract: Model pruning is a performance optimization technique for large language models like R1 or o3-mini. However, existing pruning methods often lead to significant performance degradation or require extensive retraining and fine-tuning. This technique aims to identify and remove neurons, connections unlikely leading to the contribution during the human-computer interaction phase. Our goal is to obtain a much smaller and faster knowledge distilled model that can quickly generate content almost as good as those of the unpruned ones. We propose MAMA Pruning, short for Movement and Magnitude Analysis, an improved pruning method that effectively reduces model size and computational complexity while maintaining performance comparable to the original unpruned model even at extreme pruned levels. The improved method is based on weights, bias fixed in the pre-training phase and GRPO rewards verified during the post-training phase as our novel pruning indicators. Preliminary experimental results show that our method outperforms and be comparable to state-of-the-art methods across various pruning levels and different downstream computational linguistics tasks.

[145] Enhancing LLMs via High-Knowledge Data Selection

Feiyu Duan,Xuemiao Zhang,Sirui Wang,Haoran Que,Yuqi Liu,Wenge Rong,Xunliang Cai

Main category: cs.CL

TL;DR: 本文提出了一种基于知识丰富度的高质量数据选择方法（HKS），通过知识密度和覆盖率评估文本知识内容，显著提升模型性能。

Details

Motivation: 现有数据选择方法未考虑知识丰富度，导致预训练语料库中知识稀缺问题。 Method: 提出梯度无关的HKS评分器，结合多领域知识元素池和知识密度、覆盖率指标，选择高知识密度数据。 Result: 实验表明，HKS评分器在知识密集和通用理解任务中提升模型性能，增强通用和领域特定能力。 Conclusion: HKS方法有效解决知识稀缺问题，提升模型表现，适用于通用和领域特定任务。 Abstract: The performance of Large Language Models (LLMs) is intrinsically linked to the quality of its training data. Although several studies have proposed methods for high-quality data selection, they do not consider the importance of knowledge richness in text corpora. In this paper, we propose a novel and gradient-free High-Knowledge Scorer (HKS) to select high-quality data from the dimension of knowledge, to alleviate the problem of knowledge scarcity in the pre-trained corpus. We propose a comprehensive multi-domain knowledge element pool and introduce knowledge density and coverage as metrics to assess the knowledge content of the text. Based on this, we propose a comprehensive knowledge scorer to select data with intensive knowledge, which can also be utilized for domain-specific high-knowledge data selection by restricting knowledge elements to the specific domain. We train models on a high-knowledge bilingual dataset, and experimental results demonstrate that our scorer improves the model's performance in knowledge-intensive and general comprehension tasks, and is effective in enhancing both the generic and domain-specific capabilities of the model.

[146] BAR: A Backward Reasoning based Agent for Complex Minecraft Tasks

Weihong Du,Wenrui Liao,Binyu Yan,Hongru Liang,Anthony G. Cohn,Wenqiang Lei

Main category: cs.CL

TL;DR: 论文提出了一种基于逆向推理的BAR代理，用于解决复杂任务中前向推理的不足，通过从终端状态出发进行规划，显著提升了任务完成效率。

Details

Motivation: 现有基于前向推理的方法在复杂任务中表现不佳，主要由于初始状态与任务目标间的感知差距较大。 Method: 设计了BAR代理，包含递归目标分解模块、状态一致性维护模块和阶段记忆模块，支持从终端状态出发的逆向推理规划。 Result: 实验证明BAR优于现有方法，且提出的模块有效。 Conclusion: 逆向推理在复杂任务规划中具有显著优势，BAR的设计为未来研究提供了新思路。 Abstract: Large language model (LLM) based agents have shown great potential in following human instructions and automatically completing various tasks. To complete a task, the agent needs to decompose it into easily executed steps by planning. Existing studies mainly conduct the planning by inferring what steps should be executed next starting from the agent's initial state. However, this forward reasoning paradigm doesn't work well for complex tasks. We propose to study this issue in Minecraft, a virtual environment that simulates complex tasks based on real-world scenarios. We believe that the failure of forward reasoning is caused by the big perception gap between the agent's initial state and task goal. To this end, we leverage backward reasoning and make the planning starting from the terminal state, which can directly achieve the task goal in one step. Specifically, we design a BAckward Reasoning based agent (BAR). It is equipped with a recursive goal decomposition module, a state consistency maintaining module and a stage memory module to make robust, consistent, and efficient planning starting from the terminal state. Experimental results demonstrate the superiority of BAR over existing methods and the effectiveness of proposed modules.

[147] Gender Trouble in Language Models: An Empirical Audit Guided by Gender Performativity Theory

Franziska Sofia Hafner,Ana Valdivia,Luc Rocher

Main category: cs.CL

TL;DR: 论文探讨了语言模型如何编码和延续有害的性别刻板印象，并呼吁重新定义性别偏见的概念。

Details

Motivation: 现有研究仅通过解耦非性别词汇与性别词汇来缓解性别偏见，但忽略了性别建构本身带来的更深层次问题，如对跨性别和非二元性别身份的忽视。 Method: 作者结合性别研究理论，实证分析了16种不同架构、训练数据和规模的模型如何编码性别。 Result: 发现语言模型倾向于将性别编码为与生物性别绑定的二元类别，且更大规模的模型强化了这种狭隘的性别理解。 Conclusion: 呼吁重新评估语言模型中性别偏见的定义和解决方法，以更全面地应对性别多样性问题。 Abstract: Language models encode and subsequently perpetuate harmful gendered stereotypes. Research has succeeded in mitigating some of these harms, e.g. by dissociating non-gendered terms such as occupations from gendered terms such as 'woman' and 'man'. This approach, however, remains superficial given that associations are only one form of prejudice through which gendered harms arise. Critical scholarship on gender, such as gender performativity theory, emphasizes how harms often arise from the construction of gender itself, such as conflating gender with biological sex. In language models, these issues could lead to the erasure of transgender and gender diverse identities and cause harms in downstream applications, from misgendering users to misdiagnosing patients based on wrong assumptions about their anatomy. For FAccT research on gendered harms to go beyond superficial linguistic associations, we advocate for a broader definition of 'gender bias' in language models. We operationalize insights on the construction of gender through language from gender studies literature and then empirically test how 16 language models of different architectures, training datasets, and model sizes encode gender. We find that language models tend to encode gender as a binary category tied to biological sex, and that gendered terms that do not neatly fall into one of these binary categories are erased and pathologized. Finally, we show that larger models, which achieve better results on performance benchmarks, learn stronger associations between gender and sex, further reinforcing a narrow understanding of gender. Our findings lead us to call for a re-evaluation of how gendered harms in language models are defined and addressed.

[148] Beyond Chains: Bridging Large Language Models and Knowledge Bases in Complex Question Answering

Yihua Zhu,Qianying Liu,Akiko Aizawa,Hidetoshi Shimodaira

Main category: cs.CL

TL;DR: PDRR框架通过预测、分解、检索和推理四阶段，结合KB和LLM，显著提升KBQA性能，尤其在复杂问题上表现优越。

Details

Motivation: 解决LLM-only方法的知识过时、幻觉和透明度问题，以及KG-RAG方法仅适用于简单链式问题的局限性。 Method: 提出PDRR框架：预测问题类型、分解为结构化三元组、从KB检索信息、引导LLM推理完成三元组。 Result: PDRR在多种LLM骨干上均优于现有方法，对链式和非链式复杂问题均有优越表现。 Conclusion: PDRR通过结构化规划和逻辑推理，有效结合KB与LLM，显著提升KBQA能力。 Abstract: Knowledge Base Question Answering (KBQA) aims to answer natural language questions using structured knowledge from KBs. While LLM-only approaches offer generalization, they suffer from outdated knowledge, hallucinations, and lack of transparency. Chain-based KG-RAG methods address these issues by incorporating external KBs, but are limited to simple chain-structured questions due to the absence of planning and logical structuring. Inspired by semantic parsing methods, we propose PDRR: a four-stage framework consisting of Predict, Decompose, Retrieve, and Reason. Our method first predicts the question type and decomposes the question into structured triples. Then retrieves relevant information from KBs and guides the LLM as an agent to reason over and complete the decomposed triples. Experimental results demonstrate that PDRR consistently outperforms existing methods across various LLM backbones and achieves superior performance on both chain-structured and non-chain complex questions.

[149] MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations

Ernests Lavrinovics,Russa Biswas,Katja Hose,Johannes Bjerva

Main category: cs.CL

TL;DR: 论文提出了一个基于知识图谱（KG）的多语言、多跳基准测试MultiHal，用于评估生成文本的事实性，并展示了KG整合在减少大语言模型（LLM）幻觉方面的潜力。

Details

Motivation: 大语言模型（LLM）存在幻觉问题，现有基准测试多依赖英语数据集和补充上下文，忽略了结构化事实资源。知识图谱（KG）因其结构化表示事实的能力被用于缓解幻觉。 Method: 通过挖掘和筛选开放领域KG中的路径，构建了高质量的25.9k条KG路径，并提出了MultiHal基准测试，用于多语言生成文本的事实性评估。 Result: 实验表明，KG-RAG在语义相似度得分上比传统QA方法提高了0.12到0.36分，验证了KG整合的有效性。 Conclusion: MultiHal基准测试有望推动基于图谱的幻觉缓解和事实核查任务的研究。 Abstract: Large Language Models (LLMs) have inherent limitations of faithfulness and factuality, commonly referred to as hallucinations. Several benchmarks have been developed that provide a test bed for factuality evaluation within the context of English-centric datasets, while relying on supplementary informative context like web links or text passages but ignoring the available structured factual resources. To this end, Knowledge Graphs (KGs) have been identified as a useful aid for hallucination mitigation, as they provide a structured way to represent the facts about entities and their relations with minimal linguistic overhead. We bridge the lack of KG paths and multilinguality for factual language modeling within the existing hallucination evaluation benchmarks and propose a KG-based multilingual, multihop benchmark called \textbf{MultiHal} framed for generative text evaluation. As part of our data collection pipeline, we mined 140k KG-paths from open-domain KGs, from which we pruned noisy KG-paths, curating a high-quality subset of 25.9k. Our baseline evaluation shows an absolute scale increase by approximately 0.12 to 0.36 points for the semantic similarity score in KG-RAG over vanilla QA across multiple languages and multiple models, demonstrating the potential of KG integration. We anticipate MultiHal will foster future research towards several graph-based hallucination mitigation and fact-checking tasks.

[150] Legal Rule Induction: Towards Generalizable Principle Discovery from Analogous Judicial Precedents

Wei Fan,Tianshi Zheng,Yiran Hu,Zheye Deng,Weiqi Wang,Baixuan Xu,Chunyang Li,Haoran Li,Weixing Shen,Yangqiu Song

Main category: cs.CL

TL;DR: 论文提出了法律规则归纳（LRI）任务，旨在从类似判例中提取简明、可推广的法律规则，并构建了首个LRI基准数据集。实验表明，现有LLMs存在过度泛化和幻觉问题，但通过该数据集训练可显著提升性能。

Details

Motivation: 计算法律研究在应用已有规则方面取得进展，但从司法判决中归纳法律规则的研究仍不足，受限于模型推理效能和符号推理能力。LLMs的出现为自动化提取潜在规则提供了机会，但缺乏正式任务定义、基准数据集和方法论。 Method: 论文将LRI任务定义为从类似判例中提取共享前提条件、规范行为和法律后果的简明规则，并构建了包含5,121个案例集（总计38,088个中国案例）的基准数据集和216个专家标注的测试集。 Result: 实验结果显示：1）现有LLMs存在过度泛化和幻觉问题；2）使用该数据集训练显著提升了LLMs在捕捉类似案例中细微规则模式的能力。 Conclusion: 论文填补了法律规则归纳研究的空白，为自动化提取法律规则提供了方法论和基准支持，展示了LLMs在该领域的潜力与改进方向。 Abstract: Legal rules encompass not only codified statutes but also implicit adjudicatory principles derived from precedents that contain discretionary norms, social morality, and policy. While computational legal research has advanced in applying established rules to cases, inducing legal rules from judicial decisions remains understudied, constrained by limitations in model inference efficacy and symbolic reasoning capability. The advent of Large Language Models (LLMs) offers unprecedented opportunities for automating the extraction of such latent principles, yet progress is stymied by the absence of formal task definitions, benchmark datasets, and methodologies. To address this gap, we formalize Legal Rule Induction (LRI) as the task of deriving concise, generalizable doctrinal rules from sets of analogous precedents, distilling their shared preconditions, normative behaviors, and legal consequences. We introduce the first LRI benchmark, comprising 5,121 case sets (38,088 Chinese cases in total) for model tuning and 216 expert-annotated gold test sets. Experimental results reveal that: 1) State-of-the-art LLMs struggle with over-generalization and hallucination; 2) Training on our dataset markedly enhances LLMs capabilities in capturing nuanced rule patterns across similar cases.

[151] A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations

Li Li,Peilin Cai,Ryan A. Rossi,Franck Dernoncourt,Branislav Kveton,Junda Wu,Tong Yu,Linxin Song,Tiankai Yang,Yuehan Qin,Nesreen K. Ahmed,Samyadeep Basu,Subhojyoti Mukherjee,Ruiyi Zhang,Zhengmian Hu,Bo Ni,Yuxiao Zhou,Zichao Wang,Yue Huang,Yu Wang,Xiangliang Zhang,Philip S. Yu,Xiyang Hu,Yue Zhao

Main category: cs.CL

TL;DR: PersonaConvBench是一个大规模基准测试，用于评估多轮对话中个性化推理和生成，结合了个性化和对话结构，包含三个核心任务，并展示了在统一提示设置下LLMs的性能提升。

Details

Motivation: 现有工作通常单独关注个性化或对话结构，而PersonaConvBench旨在整合两者，以系统分析个性化对话上下文如何影响LLMs输出。 Method: 设计了三个核心任务（句子分类、影响回归和用户中心文本生成），并在十个多样化Reddit领域中进行评估。 Result: 引入个性化历史显著提升性能，例如在情感分类中相对最佳非对话基线提升198%。 Conclusion: PersonaConvBench的发布旨在支持研究LLMs如何适应个体风格、跟踪长期上下文并生成丰富的响应。 Abstract: We present PersonaConvBench, a large-scale benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs). Unlike existing work that focuses on either personalization or conversational structure in isolation, PersonaConvBench integrates both, offering three core tasks: sentence classification, impact regression, and user-centric text generation across ten diverse Reddit-based domains. This design enables systematic analysis of how personalized conversational context shapes LLM outputs in realistic multi-user scenarios. We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements, including a 198 percent relative gain over the best non-conversational baseline in sentiment classification. By releasing PersonaConvBench with evaluations and code, we aim to support research on LLMs that adapt to individual styles, track long-term context, and produce contextually rich, engaging responses.

[152] DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models

Yakun Zhu,Zhongzhen Huang,Linjie Mu,Yutong Huang,Wei Nie,Shaoting Zhang,Pengfei Liu,Xiaofan Zhang

Main category: cs.CL

TL;DR: DiagnosisArena是一个针对临床诊断能力设计的综合性基准测试，评估大型语言模型在复杂医疗场景中的表现，结果显示当前模型表现不佳。

Details

Motivation: 为评估大型语言模型在真实医疗环境中的诊断能力，弥补现有医疗基准的不足。 Method: 开发了包含1,113对病例和诊断的基准测试，涵盖28个医学专业，并通过多轮筛选和专家审核确保数据质量。 Result: 最先进的模型（o3-mini、o1、DeepSeek-R1）准确率仅为45.82%、31.09%和17.79%，表明其在临床诊断推理中存在显著瓶颈。 Conclusion: DiagnosisArena旨在推动AI诊断推理能力的进步，并为实际临床挑战提供更有效的解决方案。 Abstract: The emergence of groundbreaking large language models capable of performing complex reasoning tasks holds significant promise for addressing various scientific challenges, including those arising in complex clinical scenarios. To enable their safe and effective deployment in real-world healthcare settings, it is urgently necessary to benchmark the diagnostic capabilities of current models systematically. Given the limitations of existing medical benchmarks in evaluating advanced diagnostic reasoning, we present DiagnosisArena, a comprehensive and challenging benchmark designed to rigorously assess professional-level diagnostic competence. DiagnosisArena consists of 1,113 pairs of segmented patient cases and corresponding diagnoses, spanning 28 medical specialties, deriving from clinical case reports published in 10 top-tier medical journals. The benchmark is developed through a meticulous construction pipeline, involving multiple rounds of screening and review by both AI systems and human experts, with thorough checks conducted to prevent data leakage. Our study reveals that even the most advanced reasoning models, o3-mini, o1, and DeepSeek-R1, achieve only 45.82%, 31.09%, and 17.79% accuracy, respectively. This finding highlights a significant generalization bottleneck in current large language models when faced with clinical diagnostic reasoning challenges. Through DiagnosisArena, we aim to drive further advancements in AIs diagnostic reasoning capabilities, enabling more effective solutions for real-world clinical diagnostic challenges. We provide the benchmark and evaluation tools for further research and development https://github.com/SPIRAL-MED/DiagnosisArena.

[153] Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM Watermarking

Tianle Gu,Zongqi Wang,Kexin Huang,Yuanqi Yao,Xiangliang Zhang,Yujiu Yang,Xiuying Chen

Main category: cs.CL

TL;DR: 论文提出了一种名为Invisible Entropy (IE)的水印方法，通过轻量级特征提取器和熵标记器预测高低熵令牌，解决了传统方法在低熵场景下的问题，同时提高了安全性和效率。

Details

Motivation: 传统基于Logit的LLM水印方法在低熵场景下效果不佳，且依赖原始LLM导致高计算成本和潜在模型泄漏风险。 Method: 引入轻量级特征提取器和熵标记器预测令牌熵，开发自适应阈值导航器优化水印比例和文本自然性。 Result: 在HumanEval和MBPP数据集上，IE将参数量减少99%，性能与现有最优方法相当。 Conclusion: IE为低熵水印提供了一种安全高效的范式。 Abstract: Logit-based LLM watermarking traces and verifies AI-generated content by maintaining green and red token lists and increasing the likelihood of green tokens during generation. However, it fails in low-entropy scenarios, where predictable outputs make green token selection difficult without disrupting natural text flow. Existing approaches address this by assuming access to the original LLM to calculate entropy and selectively watermark high-entropy tokens. However, these methods face two major challenges: (1) high computational costs and detection delays due to reliance on the original LLM, and (2) potential risks of model leakage. To address these limitations, we propose Invisible Entropy (IE), a watermarking paradigm designed to enhance both safety and efficiency. Instead of relying on the original LLM, IE introduces a lightweight feature extractor and an entropy tagger to predict whether the entropy of the next token is high or low. Furthermore, based on theoretical analysis, we develop a threshold navigator that adaptively sets entropy thresholds. It identifies a threshold where the watermark ratio decreases as the green token count increases, enhancing the naturalness of the watermarked text and improving detection robustness. Experiments on HumanEval and MBPP datasets demonstrate that IE reduces parameter size by 99\% while achieving performance on par with state-of-the-art methods. Our work introduces a safe and efficient paradigm for low-entropy watermarking. https://github.com/Carol-gutianle/IE https://huggingface.co/datasets/Carol0110/IE-Tagger

[154] Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst

Hongru Wang,Deng Cai,Wanjun Zhong,Shijue Huang,Jeff Z. Pan,Zeming Liu,Kam-Fai Wong

Main category: cs.CL

TL;DR: 论文提出了一种自推理语言模型（SRLM），通过自我训练合成更长的思维链数据，显著提升大语言模型在复杂推理任务中的性能。

Details

Motivation: 解决长思维链数据难以创建和获取的问题，同时提升模型在推理任务中的表现。 Method: 引入SRLM，利用少量示范样本（如1,000个）作为推理催化剂，模型通过自我训练合成更长思维链数据并迭代提升性能。 Result: 在五个推理任务（MMLU、GSM8K、ARC-C、HellaSwag和BBH）上平均绝对提升超过2.5分，采样次数增加时提升更显著（如64次采样时平均提升+7.89分）。 Conclusion: SRLM通过自我训练生成多样且深入的推理路径，显著提升推理任务性能，且效果随采样次数增加而增强。 Abstract: Inference-time scaling has attracted much attention which significantly enhance the performance of Large Language Models (LLMs) in complex reasoning tasks by increasing the length of Chain-of-Thought. These longer intermediate reasoning rationales embody various meta-reasoning skills in human cognition, such as reflection and decomposition, being difficult to create and acquire. In this work, we introduce \textit{Self-Reasoning Language Model} (SRLM), where the model itself can synthesize longer CoT data and iteratively improve performance through self-training. By incorporating a few demonstration examples (i.e., 1,000 samples) on how to unfold hidden reasoning chains from existing responses, which act as a reasoning catalyst, we demonstrate that SRLM not only enhances the model's initial performance but also ensures more stable and consistent improvements in subsequent iterations. Our proposed SRLM achieves an average absolute improvement of more than $+2.5$ points across five reasoning tasks: MMLU, GSM8K, ARC-C, HellaSwag, and BBH on two backbone models. Moreover, it brings more improvements with more times of sampling during inference, such as absolute $+7.89$ average improvement with $64$ sampling times, revealing the in-depth, diverse and creative reasoning paths in SRLM against the strong baseline.

[155] Probing BERT for German Compound Semantics

Filip Miletić,Aaron Schmid,Sabine Schulte im Walde

Main category: cs.CL

TL;DR: 研究探讨了德语BERT预训练模型对名词复合词语义的编码能力，发现其表现不及英语BERT，可能与德语复合词的更高生产力和歧义性有关。

Details

Motivation: 探究德语BERT模型是否能够编码名词复合词的语义信息，并与英语BERT的表现进行对比。 Method: 通过调整目标词、层数及模型大小写设置，评估模型对868个标准复合词组合性的预测能力。 Result: 德语BERT的表现明显落后于英语BERT，可能与德语复合词的高生产力和歧义性有关。 Conclusion: 德语名词复合词语义编码任务更具挑战性，需进一步研究改进。 Abstract: This paper investigates the extent to which pretrained German BERT encodes knowledge of noun compound semantics. We comprehensively vary combinations of target tokens, layers, and cased vs. uncased models, and evaluate them by predicting the compositionality of 868 gold standard compounds. Looking at representational patterns within the transformer architecture, we observe trends comparable to equivalent prior work on English, with compositionality information most easily recoverable in the early layers. However, our strongest results clearly lag behind those reported for English, suggesting an inherently more difficult task in German. This may be due to the higher productivity of compounding in German than in English and the associated increase in constituent-level ambiguity, including in our target compound set.

[156] Texts or Images? A Fine-grained Analysis on the Effectiveness of Input Representations and Models for Table Question Answering

Wei Zhou,Mohsen Mesgar,Heike Adel,Annemarie Friedrich

Main category: cs.CL

TL;DR: 本文研究了表格问答（TQA）中表格编码为文本或图像的效果，通过控制实验比较了多模态大语言模型（MLLMs）和大语言模型（LLMs）的表现，并提出动态选择表格表示的方法FRES，性能提升10%。

Details

Motivation: 现有研究缺乏对表格表示（文本或图像）与模型（MLLMs或LLMs）组合效果的细粒度比较，限制了对此类方法的深入理解。 Method: 构建新基准数据集，系统分析七对MLLMs和LLMs组合在不同问题复杂度和表格大小下的表现，并提出动态选择表示的方法FRES。 Result: 最佳表格表示与模型组合因场景而异，FRES方法平均性能提升10%。 Conclusion: 动态选择表格表示的方法FRES显著提升TQA性能，为未来研究提供了新方向。 Abstract: In table question answering (TQA), tables are encoded as either texts or images. Prior work suggests that passing images of tables to multi-modal large language models (MLLMs) performs comparably to or even better than using textual input with large language models (LLMs). However, the lack of controlled setups limits fine-grained distinctions between these approaches. In this paper, we conduct the first controlled study on the effectiveness of several combinations of table representations and models from two perspectives: question complexity and table size. We build a new benchmark based on existing TQA datasets. In a systematic analysis of seven pairs of MLLMs and LLMs, we find that the best combination of table representation and model varies across setups. We propose FRES, a method selecting table representations dynamically, and observe a 10% average performance improvement compared to using both representations indiscriminately.

[157] Enhancing Keyphrase Extraction from Academic Articles Using Section Structure Information

Chengzhi Zhang,Xinyi Yan,Lei Zhao,Yingyi Zhang

Main category: cs.CL

TL;DR: 该论文提出了一种基于学术文章结构特征的关键词提取方法，通过利用章节结构和文本信息提升提取性能。

Details

Motivation: 学术论文数量激增导致研究者检索相关文献时间增加，现有基于标题和摘要的关键词提取方法受限于摘要长度，而全文提取则引入噪声。 Method: 研究分为两部分：(1) 探索七种结构特征对关键词提取模型的影响；(2) 通过关键词整合算法融合各章节文本的提取结果。同时分析了章节结构分类质量对性能的影响。 Result: 结构特征的引入提升了关键词提取性能，不同特征对模型效果影响各异。关键词整合方法表现最佳，章节结构分类质量也会影响性能。 Conclusion: 利用学术文章的章节结构信息可有效提升关键词提取效果，相关代码和数据集已开源。 Abstract: The exponential increase in academic papers has significantly increased the time required for researchers to access relevant literature. Keyphrase Extraction (KPE) offers a solution to this situation by enabling researchers to efficiently retrieve relevant literature. The current study on KPE from academic articles aims to improve the performance of extraction models through innovative approaches using Title and Abstract as input corpora. However, the semantic richness of keywords is significantly constrained by the length of the abstract. While full-text-based KPE can address this issue, it simultaneously introduces noise, which significantly diminishes KPE performance. To address this issue, this paper utilized the structural features and section texts obtained from the section structure information of academic articles to extract keyphrase from academic papers. The approach consists of two main parts: (1) exploring the effect of seven structural features on KPE models, and (2) integrating the extraction results from all section texts used as input corpora for KPE models via a keyphrase integration algorithm to obtain the keyphrase integration result. Furthermore, this paper also examined the effect of the classification quality of section structure on the KPE performance. The results show that incorporating structural features improves KPE performance, though different features have varying effects on model efficacy. The keyphrase integration approach yields the best performance, and the classification quality of section structure can affect KPE performance. These findings indicate that using the section structure information of academic articles contributes to effective KPE from academic articles. The code and dataset supporting this study are available at https://github.com/yan-xinyi/SSB_KPE.

[158] Prior Prompt Engineering for Reinforcement Fine-Tuning

Pittawat Taveekitworachai,Potsawee Manakul,Sarana Nutanong,Kunat Pipatanakul

Main category: cs.CL

TL;DR: 本文研究了在强化微调（RFT）中的先验提示工程（pPE），探讨了不同pPE方法如何引导语言模型（LM）内化特定行为，并发现pPE训练的模型优于推理时提示工程（iPE）的模型。

Details

Motivation: 现有RFT研究主要集中在算法、奖励塑造和数据管理上，而先验提示设计的作用尚未充分探索。本文旨在填补这一空白。 Method: 将五种iPE策略（如推理、规划、代码推理等）转化为pPE方法，并在Qwen2.5-7B模型上进行实验，评估其在多个基准测试中的表现。 Result: 所有pPE训练的模型均优于iPE提示的模型，其中null-example pPE方法表现最佳，尤其在AIME2024和GPQA-Diamond上提升显著。 Conclusion: pPE是RFT中一个强大但未被充分研究的维度，不同pPE策略能塑造模型的不同行为风格。 Abstract: This paper investigates prior prompt engineering (pPE) in the context of reinforcement fine-tuning (RFT), where language models (LMs) are incentivized to exhibit behaviors that maximize performance through reward signals. While existing RFT research has primarily focused on algorithms, reward shaping, and data curation, the design of the prior prompt--the instructions prepended to queries during training to elicit behaviors such as step-by-step reasoning--remains underexplored. We investigate whether different pPE approaches can guide LMs to internalize distinct behaviors after RFT. Inspired by inference-time prompt engineering (iPE), we translate five representative iPE strategies--reasoning, planning, code-based reasoning, knowledge recall, and null-example utilization--into corresponding pPE approaches. We experiment with Qwen2.5-7B using each of the pPE approaches, then evaluate performance on in-domain and out-of-domain benchmarks (e.g., AIME2024, HumanEval+, and GPQA-Diamond). Our results show that all pPE-trained models surpass their iPE-prompted counterparts, with the null-example pPE approach achieving the largest average performance gain and the highest improvement on AIME2024 and GPQA-Diamond, surpassing the commonly used reasoning approach. Furthermore, by adapting a behavior-classification framework, we demonstrate that different pPE strategies instill distinct behavioral styles in the resulting models. These findings position pPE as a powerful yet understudied axis for RFT.

[159] Temporal Alignment of Time Sensitive Facts with Activation Engineering

Sanjay Govindan,Maurice Pagnucco,Yang Song

Main category: cs.CL

TL;DR: 本文探讨了通过激活工程技术使LLMs在特定时间点上对齐，以提高事实回忆能力，无需训练或数据集创建。实验表明，该方法在相对和显式提示下分别提升了44%和16%的性能。

Details

Motivation: LLMs训练数据涵盖多个领域和时间段，部分知识仅在特定时间有效。确保LLMs生成时间相关的响应对其准确性和相关性至关重要。 Method: 采用激活工程技术，将三个版本的LLaMA 2对齐到特定时间点，研究不同注入层和提示策略的效果。 Result: 实验显示，相对提示和显式提示分别提升了44%和16%的性能，与微调方法效果相当，但计算效率更高且无需预对齐数据集。 Conclusion: 激活工程技术是一种高效且无需训练的方法，可显著提升LLMs在时间相关任务中的表现。 Abstract: Large Language Models (LLMs) are trained on diverse and often conflicting knowledge spanning multiple domains and time periods. Some of this knowledge is only valid within specific temporal contexts, such as answering the question, "Who is the President of the United States in 2022?" Ensuring LLMs generate time appropriate responses is crucial for maintaining relevance and accuracy. In this work we explore activation engineering as a method for temporally aligning LLMs to improve factual recall without any training or dataset creation. In this research we explore an activation engineering technique to ground three versions of LLaMA 2 to specific points in time and examine the effects of varying injection layers and prompting strategies. Our experiments demonstrate up to a 44% and 16% improvement in relative and explicit prompting respectively, achieving comparable performance to the fine-tuning method proposed by Zhao et al. (2024) . Notably, our approach achieves similar results to the fine-tuning baseline while being significantly more computationally efficient and requiring no pre-aligned datasets.

[160] Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models

Zahraa Al Sahili,Ioannis Patras,Matthew Purver

Main category: cs.CL

TL;DR: 研究发现多语言视觉语言模型在多语言环境下性别和种族偏见更强，且低资源语言和性别中性语言尤其受影响。

Details

Motivation: 探讨多语言视觉语言模型在社会偏见方面的表现，尤其是跨语言权重共享对偏见传播的影响。 Method: 通过零样本设置，使用平衡的FairFace和PATA数据集，对三种多语言CLIP模型（M-CLIP、NLLB-CLIP、CAPIVARA-CLIP）在十种语言中的性别和种族偏见进行量化。 Result: 所有模型在多语言环境下表现出比单语言基线更强的性别偏见，低资源语言和性别中性语言尤其受影响。 Conclusion: 未来多语言视觉语言研究需要更细粒度、语言感知的偏见评估方法。 Abstract: Multilingual vision-language models promise universal image-text retrieval, yet their social biases remain under-explored. We present the first systematic audit of three public multilingual CLIP checkpoints -- M-CLIP, NLLB-CLIP, and CAPIVARA-CLIP -- across ten languages that vary in resource availability and grammatical gender. Using balanced subsets of \textsc{FairFace} and the \textsc{PATA} stereotype suite in a zero-shot setting, we quantify race and gender bias and measure stereotype amplification. Contrary to the assumption that multilinguality mitigates bias, every model exhibits stronger gender bias than its English-only baseline. CAPIVARA-CLIP shows its largest biases precisely in the low-resource languages it targets, while the shared cross-lingual encoder of NLLB-CLIP transports English gender stereotypes into gender-neutral languages; loosely coupled encoders largely avoid this transfer. Highly gendered languages consistently magnify all measured bias types, but even gender-neutral languages remain vulnerable when cross-lingual weight sharing imports foreign stereotypes. Aggregated metrics conceal language-specific ``hot spots,'' underscoring the need for fine-grained, language-aware bias evaluation in future multilingual vision-language research.

[161] PL-FGSA: A Prompt Learning Framework for Fine-Grained Sentiment Analysis Based on MindSpore

Zhenkai Qin,Jiajing He,Qiao Fang

Main category: cs.CL

TL;DR: PL-FGSA是一种基于提示学习的统一框架，用于细粒度情感分析，通过多任务提示增强生成方法，结合轻量级TextCNN，显著提升了性能并增强了可解释性。

Details

Motivation: 传统细粒度情感分析方法需要特定架构和大量标注数据，泛化性和扩展性受限。 Method: 提出PL-FGSA框架，将任务重新定义为多任务提示增强生成问题，结合提示设计和轻量级TextCNN。 Result: 在三个基准数据集上表现优异，F1分数分别为0.922、0.694和0.597，优于传统微调方法。 Conclusion: PL-FGSA通过提示学习提升了泛化能力，具有实际应用价值。 Abstract: Fine-grained sentiment analysis (FGSA) aims to identify sentiment polarity toward specific aspects within a text, enabling more precise opinion mining in domains such as product reviews and social media. However, traditional FGSA approaches often require task-specific architectures and extensive annotated data, limiting their generalization and scalability. To address these challenges, we propose PL-FGSA, a unified prompt learning-based framework implemented using the MindSpore platform, which integrates prompt design with a lightweight TextCNN backbone. Our method reformulates FGSA as a multi-task prompt-augmented generation problem, jointly tackling aspect extraction, sentiment classification, and causal explanation in a unified paradigm. By leveraging prompt-based guidance, PL-FGSA enhances interpretability and achieves strong performance under both full-data and low-resource conditions. Experiments on three benchmark datasets-SST-2, SemEval-2014 Task 4, and MAMS-demonstrate that our model consistently outperforms traditional fine-tuning methods and achieves F1-scores of 0.922, 0.694, and 0.597, respectively. These results validate the effectiveness of prompt-based generalization and highlight the practical value of PL-FGSA for real-world sentiment analysis tasks.

[162] The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models

Adrian Cosma,Stefan Ruseti,Emilian Radoi,Mihai Dascalu

Main category: cs.CL

TL;DR: LLMs在字符级任务（如单词字母计数）上表现不佳，主要原因是分词问题。本文通过19个合成任务分析其局限性，并提出轻量级架构改进方法。

Details

Motivation: 研究LLMs在字符级任务上的失败原因，并提出解决方案，以弥补其结构性盲点。 Method: 使用19个合成任务分析字符级推理能力，并提出一种轻量级架构改进方法。 Result: 字符级能力在训练后期缓慢且突然出现，改进方法显著提升了性能。 Conclusion: 研究为理解LLMs的局限性提供了框架，并提出了有效的改进方案。 Abstract: Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge slowly, suddenly, and only late in training. We further show that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword models. Together, our results bridge low-level perceptual gaps in tokenized LMs and provide a principled framework for understanding and mitigating their structural blind spots. We make our code publicly available.

[163] THOR-MoE: Hierarchical Task-Guided and Context-Responsive Routing for Neural Machine Translation

Yunlong Liang,Fandong Meng,Jie Zhou

Main category: cs.CL

TL;DR: THOR-MoE提出了一种分层任务引导和上下文响应的路由策略，解决了稀疏MoE在NMT中的局限性，提升了性能。

Details

Motivation: 当前MoE解决方案存在两个局限性：1）直接使用NMT的任务知识，忽略了自然分组的领域/语言特性；2）专家选择仅依赖局部标记表示，未考虑上下文。 Method: THOR-MoE通过分层任务引导和上下文响应路由策略：1）预测领域/语言标签并提取混合表示；2）注入上下文信息增强标记路由。 Result: 在多领域和多语言翻译基准测试中表现优异，平均BLEU提升0.75，激活参数减少22%。 Conclusion: THOR-MoE是一种即插即用模块，兼容现有路由方案，具有广泛适用性。 Abstract: The sparse Mixture-of-Experts (MoE) has achieved significant progress for neural machine translation (NMT). However, there exist two limitations in current MoE solutions which may lead to sub-optimal performance: 1) they directly use the task knowledge of NMT into MoE (\emph{e.g.}, domain/linguistics-specific knowledge), which are generally unavailable at practical application and neglect the naturally grouped domain/linguistic properties; 2) the expert selection only depends on the localized token representation without considering the context, which fully grasps the state of each token in a global view. To address the above limitations, we propose THOR-MoE via arming the MoE with hierarchical task-guided and context-responsive routing policies. Specifically, it 1) firstly predicts the domain/language label and then extracts mixed domain/language representation to allocate task-level experts in a hierarchical manner; 2) injects the context information to enhance the token routing from the pre-selected task-level experts set, which can help each token to be accurately routed to more specialized and suitable experts. Extensive experiments on multi-domain translation and multilingual translation benchmarks with different architectures consistently demonstrate the superior performance of THOR-MoE. Additionally, the THOR-MoE operates as a plug-and-play module compatible with existing Top-$k$~\cite{shazeer2017} and Top-$p$~\cite{huang-etal-2024-harder} routing schemes, ensuring broad applicability across diverse MoE architectures. For instance, compared with vanilla Top-$p$~\cite{huang-etal-2024-harder} routing, the context-aware manner can achieve an average improvement of 0.75 BLEU with less than 22\% activated parameters on multi-domain translation tasks.

[164] Cheaper, Better, Faster, Stronger: Robust Text-to-SQL without Chain-of-Thought or Fine-Tuning

Yusuf Denizay Dönder,Derek Hommel,Andrea W Wen-Yi,David Mimno,Unso Eun Seo Jo

Main category: cs.CL

TL;DR: N-rep一致性是一种更经济的文本到SQL方法，性能接近昂贵方法，成本仅为每次查询0.039美元。

Details

Motivation: 现有方法（如CoT、自一致性和微调）成本高，推理时可能需要上百次LLM调用，每次查询成本高达0.46美元。 Method: N-rep利用同一模式输入的多种表示来弥补单一表示的不足，无需推理或微调，使用更小更便宜的模型。 Result: N-rep在BIRD基准测试中表现接近昂贵方法，成本显著降低。 Conclusion: N-rep是目前成本范围内性能最佳的文本到SQL方法。 Abstract: LLMs are effective at code generation tasks like text-to-SQL, but is it worth the cost? Many state-of-the-art approaches use non-task-specific LLM techniques including Chain-of-Thought (CoT), self-consistency, and fine-tuning. These methods can be costly at inference time, sometimes requiring over a hundred LLM calls with reasoning, incurring average costs of up to \$0.46 per query, while fine-tuning models can cost thousands of dollars. We introduce "N-rep" consistency, a more cost-efficient text-to-SQL approach that achieves similar BIRD benchmark scores as other more expensive methods, at only \$0.039 per query. N-rep leverages multiple representations of the same schema input to mitigate weaknesses in any single representation, making the solution more robust and allowing the use of smaller and cheaper models without any reasoning or fine-tuning. To our knowledge, N-rep is the best-performing text-to-SQL approach in its cost range.

[165] Tokenization Constraints in LLMs: A Study of Symbolic and Arithmetic Reasoning Limits

Xiang Zhang,Juntai Cao,Jiaqi Wei,Yiwei Xu,Chenyu You

Main category: cs.CL

TL;DR: 论文探讨了分词对语言模型推理能力的限制，提出Token Awareness概念，并证明原子对齐的分词格式能显著提升推理性能。

Details

Motivation: 研究分词（如BPE）如何通过合并或模糊原子推理单元阻碍符号计算，从而限制模型推理能力。 Method: 通过理论和实验分析分词结构对推理的影响，引入Token Awareness概念，并在算术和符号任务上系统评估。 Result: 原子对齐的分词格式显著提升推理性能，小模型（如GPT-4o-mini）在结构化推理中优于大模型（如o1）。 Conclusion: LLMs的符号推理能力不仅依赖架构，还受分词表示深度影响。 Abstract: Tokenization is the first - and often underappreciated - layer of computation in language models. While Chain-of-Thought (CoT) prompting enables transformer models to approximate recurrent computation by externalizing intermediate steps, we show that the success of such reasoning is fundamentally bounded by the structure of tokenized inputs. This work presents a theoretical and empirical investigation into how tokenization schemes, particularly subword-based methods like byte-pair encoding (BPE), impede symbolic computation by merging or obscuring atomic reasoning units. We introduce the notion of Token Awareness to formalize how poor token granularity disrupts logical alignment and prevents models from generalizing symbolic procedures. Through systematic evaluation on arithmetic and symbolic tasks, we demonstrate that token structure dramatically affect reasoning performance, causing failure even with CoT, while atomically-aligned formats unlock strong generalization, allowing small models (e.g., GPT-4o-mini) to outperform larger systems (e.g., o1) in structured reasoning. Our findings reveal that symbolic reasoning ability in LLMs is not purely architectural, but deeply conditioned on token-level representations.

[166] Enhancing Abstractive Summarization of Scientific Papers Using Structure Information

Tong Bao,Heng Zhang,Chengzhi Zhang

Main category: cs.CL

TL;DR: 提出了一种两阶段抽象摘要框架，通过自动识别科学论文的结构功能，生成更全面的摘要。

Details

Motivation: 现有摘要方法未能充分捕捉科学论文的结构信息，且缺乏跨学科的鲁棒性。 Method: 两阶段框架：1) 构建结构功能识别数据集并训练分类器；2) 使用Longformer生成上下文感知摘要。 Result: 在两个领域特定数据集上优于先进基线，生成更全面的摘要。 Conclusion: 该方法通过结构功能识别和上下文建模，显著提升了科学论文摘要的质量。 Abstract: Abstractive summarization of scientific papers has always been a research focus, yet existing methods face two main challenges. First, most summarization models rely on Encoder-Decoder architectures that treat papers as sequences of words, thus fail to fully capture the structured information inherent in scientific papers. Second, existing research often use keyword mapping or feature engineering to identify the structural information, but these methods struggle with the structural flexibility of scientific papers and lack robustness across different disciplines. To address these challenges, we propose a two-stage abstractive summarization framework that leverages automatic recognition of structural functions within scientific papers. In the first stage, we standardize chapter titles from numerous scientific papers and construct a large-scale dataset for structural function recognition. A classifier is then trained to automatically identify the key structural components (e.g., Background, Methods, Results, Discussion), which provides a foundation for generating more balanced summaries. In the second stage, we employ Longformer to capture rich contextual relationships across sections and generating context-aware summaries. Experiments conducted on two domain-specific scientific paper summarization datasets demonstrate that our method outperforms advanced baselines, and generates more comprehensive summaries. The code and dataset can be accessed at https://github.com/tongbao96/code-for-SFR-AS.

[167] SlangDIT: Benchmarking LLMs in Interpretative Slang Translation

Yunlong Liang,Fandong Meng,Jiaan Wang,Jie Zhou

Main category: cs.CL

TL;DR: 论文提出了一个名为SlangDIT的任务，结合俚语检测、解释和翻译，并构建了一个包含25k英中句对的数据集。提出的SlangOWL模型通过深度思考显著提升了翻译性能。

Details

Motivation: 俚语翻译的难点在于捕捉其上下文相关的语义扩展，现有研究未充分探索俚语检测、解释和翻译之间的内在关联。 Method: 提出SlangDIT任务，包含俚语检测、跨语言解释和翻译三个子任务；构建SlangDIT数据集；设计SlangOWL模型，通过深度思考逐步完成俚语识别、多义性判断、解释生成和翻译。 Result: 实验表明，SlangOWL在LLMs（如Qwen2.5和LLama-3.1）上显著优于普通模型和无思考的监督微调模型。 Conclusion: SlangDIT任务和SlangOWL模型有效提升了俚语翻译的准确性，证明了深度思考在LLMs中的重要性。 Abstract: The challenge of slang translation lies in capturing context-dependent semantic extensions, as slang terms often convey meanings beyond their literal interpretation. While slang detection, explanation, and translation have been studied as isolated tasks in the era of large language models (LLMs), their intrinsic interdependence remains underexplored. The main reason is lacking of a benchmark where the two tasks can be a prerequisite for the third one, which can facilitate idiomatic translation. In this paper, we introduce the interpretative slang translation task (named SlangDIT) consisting of three sub-tasks: slang detection, cross-lingual slang explanation, and slang translation within the current context, aiming to generate more accurate translation with the help of slang detection and slang explanation. To this end, we construct a SlangDIT dataset, containing over 25k English-Chinese sentence pairs. Each source sentence mentions at least one slang term and is labeled with corresponding cross-lingual slang explanation. Based on the benchmark, we propose a deep thinking model, named SlangOWL. It firstly identifies whether the sentence contains a slang, and then judges whether the slang is polysemous and analyze its possible meaning. Further, the SlangOWL provides the best explanation of the slang term targeting on the current context. Finally, according to the whole thought, the SlangOWL offers a suitable translation. Our experiments on LLMs (\emph{e.g.}, Qwen2.5 and LLama-3.1), show that our deep thinking approach indeed enhances the performance of LLMs where the proposed SLangOWL significantly surpasses the vanilla models and supervised fine-tuned models without thinking.

[168] ThinkSwitcher: When to Think Hard, When to Think Fast

Guosheng Liang,Longguang Zhong,Ziyi Yang,Xiaojun Quan

Main category: cs.CL

TL;DR: ThinkSwitcher框架通过动态切换长短链推理模式，降低计算成本20-30%，同时保持复杂任务的高精度。

Details

Motivation: 大型推理模型（LRMs）在复杂任务中表现优异，但在简单任务上过度推理导致计算浪费。 Method: 提出ThinkSwitcher框架，通过轻量级切换模块动态选择推理模式，基于任务复杂度。 Result: 实验表明，ThinkSwitcher在多个推理基准上减少计算成本20-30%，且不影响复杂任务精度。 Conclusion: ThinkSwitcher是一种可扩展且高效的统一LRM部署解决方案。 Abstract: Large reasoning models (LRMs) excel at solving complex tasks by leveraging long chain-of-thought (CoT) reasoning. However, this often leads to overthinking on simple tasks, resulting in unnecessary computational overhead. We observe that LRMs inherently possess the capability for efficient short CoT reasoning, which can be reliably elicited through prompt design. To leverage this capability, we propose ThinkSwitcher, a framework that enables a single LRM to dynamically switch between short and long CoT modes based on task complexity. ThinkSwitcher introduces a lightweight switching module trained with supervision signals derived from the relative performance of each reasoning mode across tasks. Experiments on multiple reasoning benchmarks show that ThinkSwitcher reduces computational cost by 20-30% while maintaining high accuracy on complex tasks. This demonstrates the effectiveness of ThinkSwitcher as a scalable and efficient solution for unified LRM deployment.

[169] Unraveling Interwoven Roles of Large Language Models in Authorship Privacy: Obfuscation, Mimicking, and Verification

Tuc Nguyen,Yifan Hu,Thai Le

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型（LLMs）在生成文本时可能泄露用户隐私的问题，提出了首个统一框架分析作者隐私中的三种任务（AO、AM、AV）的动态关系，并研究了人口统计元数据对其性能的影响。

Details

Motivation: LLMs在生成文本时可能泄露用户隐私，包括显式信息（如姓名、地址）和隐式信号（如写作风格）。目前对作者隐私的三种任务（AO、AM、AV）的研究独立进行，缺乏对其交互关系的探索。 Method: 提出了首个统一框架，量化分析LLM支持的AO、AM和AV之间的动态关系，研究其对人类撰写文本的转化效果，并考察人口统计元数据（如性别、学术背景）的影响。 Result: 研究发现LLMs在作者隐私任务中的交互关系显著，人口统计元数据对任务性能和隐私风险有调节作用。 Conclusion: 论文填补了LLM时代下作者隐私研究的空白，为未来隐私保护技术提供了理论基础和实用工具。 Abstract: Recent advancements in large language models (LLMs) have been fueled by large scale training corpora drawn from diverse sources such as websites, news articles, and books. These datasets often contain explicit user information, such as person names and addresses, that LLMs may unintentionally reproduce in their generated outputs. Beyond such explicit content, LLMs can also leak identity revealing cues through implicit signals such as distinctive writing styles, raising significant concerns about authorship privacy. There are three major automated tasks in authorship privacy, namely authorship obfuscation (AO), authorship mimicking (AM), and authorship verification (AV). Prior research has studied AO, AM, and AV independently. However, their interplays remain under explored, which leaves a major research gap, especially in the era of LLMs, where they are profoundly shaping how we curate and share user generated content, and the distinction between machine generated and human authored text is also increasingly blurred. This work then presents the first unified framework for analyzing the dynamic relationships among LLM enabled AO, AM, and AV in the context of authorship privacy. We quantify how they interact with each other to transform human authored text, examining effects at a single point in time and iteratively over time. We also examine the role of demographic metadata, such as gender, academic background, in modulating their performances, inter-task dynamics, and privacy risks. All source code will be publicly available.

[170] Automatic Dataset Generation for Knowledge Intensive Question Answering Tasks

Sizhe Yuen,Ting Su,Ziyang Wang,Yali Du,Adam J. Sobey

Main category: cs.CL

TL;DR: 论文提出了一种通过自动生成基于上下文的QA对来增强大型语言模型在知识密集型问答任务中的方法，减少了人工标注的依赖并提升了模型的理解和推理能力。

Details

Motivation: 当前问答系统在处理复杂推理或实时知识整合时表现不佳，检索增强生成（RAG）技术仍面临多源信息逻辑连接的挑战。 Method: 采用自动化QA生成器和模型微调器，利用大型语言模型生成微调数据，并通过困惑度、ROUGE、BLEU和BERTScore进行评估。 Result: 实验表明，该方法在逻辑连贯性和事实准确性上有所提升，Mistral-7b-v0.3在多项指标上优于Llama-3-8b。 Conclusion: 该方法为开发适应性强的AI系统提供了潜在价值，尤其在减少人工标注和提升模型推理能力方面。 Abstract: A question-answering (QA) system is to search suitable answers within a knowledge base. Current QA systems struggle with queries requiring complex reasoning or real-time knowledge integration. They are often supplemented with retrieval techniques on a data source such as Retrieval-Augmented Generation (RAG). However, RAG continues to face challenges in handling complex reasoning and logical connections between multiple sources of information. A novel approach for enhancing Large Language Models (LLMs) in knowledge-intensive QA tasks is presented through the automated generation of context-based QA pairs. This methodology leverages LLMs to create fine-tuning data, reducing reliance on human labelling and improving model comprehension and reasoning capabilities. The proposed system includes an automated QA generator and a model fine-tuner, evaluated using perplexity, ROUGE, BLEU, and BERTScore. Comprehensive experiments demonstrate improvements in logical coherence and factual accuracy, with implications for developing adaptable Artificial Intelligence (AI) systems. Mistral-7b-v0.3 outperforms Llama-3-8b with BERT F1, BLEU, and ROUGE scores 0.858, 0.172, and 0.260 of for the LLM generated QA pairs compared to scores of 0.836, 0.083, and 0.139 for the human annotated QA pairs.

[171] "Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs

Darpan Aswal,Siddharth D Jaiswal

Main category: cs.CL

TL;DR: 该研究提出了一种利用代码混合和语音扰动的新型策略，成功破解大型语言模型（LLMs）的安全过滤器，并在文本和图像生成任务中取得高攻击成功率。

Details

Motivation: 现有研究主要集中在英语语言和固定模板攻击上，而模型在多语言和多模态环境下仍易受攻击。本研究旨在填补这一空白，探索更通用的安全对齐方法。 Method: 通过代码混合和语音扰动生成新型提示，绕过LLMs的安全过滤器，同时保持可解释性。 Result: 新型提示在文本生成中达到99%的攻击成功率，图像生成中达到78%，且攻击相关性高。语音扰动通过影响词标记化实现破解。 Conclusion: 研究呼吁加强对多语言多模态模型的通用安全对齐，尤其是在现实场景中可能存在拼写错误的情况下。 Abstract: Large Language Models (LLMs) have become increasingly powerful, with multilingual and multimodal capabilities improving by the day. These models are being evaluated through audits, alignment studies and red-teaming efforts to expose model vulnerabilities towards generating harmful, biased and unfair content. Existing red-teaming efforts have previously focused on the English language, using fixed template-based attacks; thus, models continue to be susceptible to multilingual jailbreaking strategies, especially in the multimodal context. In this study, we introduce a novel strategy that leverages code-mixing and phonetic perturbations to jailbreak LLMs for both text and image generation tasks. We also introduce two new jailbreak strategies that show higher effectiveness than baseline strategies. Our work presents a method to effectively bypass safety filters in LLMs while maintaining interpretability by applying phonetic misspellings to sensitive words in code-mixed prompts. Our novel prompts achieve a 99% Attack Success Rate for text generation and 78% for image generation, with Attack Relevance Rate of 100% for text generation and 95% for image generation when using the phonetically perturbed code-mixed prompts. Our interpretability experiments reveal that phonetic perturbations impact word tokenization, leading to jailbreak success. Our study motivates increasing the focus towards more generalizable safety alignment for multilingual multimodal models, especially in real-world settings wherein prompts can have misspelt words.

[172] Mechanistic Fine-tuning for In-context Learning

Hakaze Cho,Peng Luo,Mariko Kato,Rin Kaenbyou,Naoya Inoue

Main category: cs.CL

TL;DR: 本文提出了一种名为注意力行为微调（ABFT）的方法，通过调整注意力分数而非最终输出来优化语言模型的上下文学习能力，显著降低了计算成本。

Details

Motivation: 现有的上下文学习方法（ICL）需要大量计算资源进行端到端微调，本文旨在通过利用注意力机制的内在机制减少这种成本。 Method: ABFT方法通过优化注意力分数，使其专注于正确的标签标记，同时减少对错误标签标记的关注。 Result: 在9种现代语言模型和8个数据集上的实验表明，ABFT在性能、鲁棒性、无偏性和效率上均优于现有方法，且仅需约0.01%的数据成本。 Conclusion: ABFT展示了通过控制语言模型内部模块序列来改进行为的可能性，为机制解释性的未来应用开辟了道路。 Abstract: In-context Learning (ICL) utilizes structured demonstration-query inputs to induce few-shot learning on Language Models (LMs), which are not originally pre-trained on ICL-style data. To bridge the gap between ICL and pre-training, some approaches fine-tune LMs on large ICL-style datasets by an end-to-end paradigm with massive computational costs. To reduce such costs, in this paper, we propose Attention Behavior Fine-Tuning (ABFT), utilizing the previous findings on the inner mechanism of ICL, building training objectives on the attention scores instead of the final outputs, to force the attention scores to focus on the correct label tokens presented in the context and mitigate attention scores from the wrong label tokens. Our experiments on 9 modern LMs and 8 datasets empirically find that ABFT outperforms in performance, robustness, unbiasedness, and efficiency, with only around 0.01% data cost compared to the previous methods. Moreover, our subsequent analysis finds that the end-to-end training objective contains the ABFT objective, suggesting the implicit bias of ICL-style data to the emergence of induction heads. Our work demonstrates the possibility of controlling specific module sequences within LMs to improve their behavior, opening up the future application of mechanistic interpretability.

[173] ABBA: Highly Expressive Hadamard Product Adaptation for Large Language Models

Raghav Singhal,Kaustubh Ponkshe,Rohit Vartak,Praneeth Vepakomma

Main category: cs.CL

TL;DR: ABBA是一种新的参数高效微调（PEFT）架构，通过解耦预训练权重和更新矩阵，显著提升了表达能力，并在多个任务中优于现有方法。

Details

Motivation: 尽管大语言模型在许多任务中表现优异，但高效适应新领域仍是一个挑战。现有PEFT方法（如LoRA和HiRA）的表达能力受限于预训练模型的结构。 Method: ABBA通过将更新矩阵重新参数化为两个独立可学习的低秩矩阵的Hadamard积，完全解耦了更新与预训练权重。 Result: ABBA在矩阵重构实验中验证了其高表达能力，并在算术和常识推理基准测试中显著优于现有PEFT方法。 Conclusion: ABBA通过解耦设计实现了更高的表达能力，为参数高效微调提供了新的解决方案。 Abstract: Large Language Models have demonstrated strong performance across a wide range of tasks, but adapting them efficiently to new domains remains a key challenge. Parameter-Efficient Fine-Tuning (PEFT) methods address this by introducing lightweight, trainable modules while keeping most pre-trained weights fixed. The prevailing approach, LoRA, models updates using a low-rank decomposition, but its expressivity is inherently constrained by the rank. Recent methods like HiRA aim to increase expressivity by incorporating a Hadamard product with the frozen weights, but still rely on the structure of the pre-trained model. We introduce ABBA, a new PEFT architecture that reparameterizes the update as a Hadamard product of two independently learnable low-rank matrices. In contrast to prior work, ABBA fully decouples the update from the pre-trained weights, enabling both components to be optimized freely. This leads to significantly higher expressivity under the same parameter budget. We formally analyze ABBA's expressive capacity and validate its advantages through matrix reconstruction experiments. Empirically, ABBA achieves state-of-the-art results on arithmetic and commonsense reasoning benchmarks, consistently outperforming existing PEFT methods by a significant margin across multiple models. Our code is publicly available at: https://github.com/CERT-Lab/abba.

Ziang Wang,Amir Aryani

Main category: cs.CL

TL;DR: 该技术报告提出了一种基于NLP的方法，用于系统分类关于儿童言语障碍的科学文献。通过LDA和BERTopic两种主题建模技术，识别出14个临床相关主题，并展示了模型的良好性能。

Details

Motivation: 自动化文献综述在言语病理学领域的需求，以提高效率和精确性。 Method: 从PubMed数据库中检索并过滤4,804篇相关文献，使用LDA和BERTopic进行主题建模，并采用定制停用词列表优化结果。 Result: LDA模型的连贯性得分为0.42，困惑度为-7.5；BERTopic模型的异常主题比例低于20%，表明其有效分类能力。 Conclusion: 该方法为言语病理学领域的自动化文献综述提供了可行基础。 Abstract: This technical report presents a natural language processing (NLP)-based approach for systematically classifying scientific literature on childhood speech disorders. We retrieved and filtered 4,804 relevant articles published after 2015 from the PubMed database using domain-specific keywords. After cleaning and pre-processing the abstracts, we applied two topic modeling techniques - Latent Dirichlet Allocation (LDA) and BERTopic - to identify latent thematic structures in the corpus. Our models uncovered 14 clinically meaningful clusters, such as infantile hyperactivity and abnormal epileptic behavior. To improve relevance and precision, we incorporated a custom stop word list tailored to speech pathology. Evaluation results showed that the LDA model achieved a coherence score of 0.42 and a perplexity of -7.5, indicating strong topic coherence and predictive performance. The BERTopic model exhibited a low proportion of outlier topics (less than 20%), demonstrating its capacity to classify heterogeneous literature effectively. These results provide a foundation for automating literature reviews in speech-language pathology.

[175] TransBench: Benchmarking Machine Translation for Industrial-Scale Applications

Haijun Li,Tianqi Shi,Zifu Shang,Yuxuan Han,Xueyu Zhao,Hao Wang,Yu Qian,Zhiqiang Qian,Linlong Xu,Minghao Wu,Chenyang Lyu,Longyue Wang,Gongbo Tang,Weihua Luo,Zhao Xu,Kaifu Zhang

Main category: cs.CL

TL;DR: 论文提出了一种针对工业机器翻译的三级能力评估框架（基础语言能力、领域专业能力、文化适应能力），并推出了TransBench基准测试，填补了学术评估与实际工业需求之间的差距。

Details

Motivation: 工业场景中通用机器翻译模型因缺乏领域术语、文化差异和风格规范而表现不佳，现有评估框架无法满足专业需求。 Method: 提出三级翻译能力框架，开发TransBench基准（17,000句专业翻译，涵盖4场景和33语言对），结合传统指标与领域专用模型Marco-MOS。 Result: 贡献包括结构化评估框架、首个公开电商翻译基准、多级质量指标及开源工具。 Conclusion: 该工作为工业机器翻译提供了系统性评估方法，推动领域专用模型的优化。 Abstract: Machine translation (MT) has become indispensable for cross-border communication in globalized industries like e-commerce, finance, and legal services, with recent advancements in large language models (LLMs) significantly enhancing translation quality. However, applying general-purpose MT models to industrial scenarios reveals critical limitations due to domain-specific terminology, cultural nuances, and stylistic conventions absent in generic benchmarks. Existing evaluation frameworks inadequately assess performance in specialized contexts, creating a gap between academic benchmarks and real-world efficacy. To address this, we propose a three-level translation capability framework: (1) Basic Linguistic Competence, (2) Domain-Specific Proficiency, and (3) Cultural Adaptation, emphasizing the need for holistic evaluation across these dimensions. We introduce TransBench, a benchmark tailored for industrial MT, initially targeting international e-commerce with 17,000 professionally translated sentences spanning 4 main scenarios and 33 language pairs. TransBench integrates traditional metrics (BLEU, TER) with Marco-MOS, a domain-specific evaluation model, and provides guidelines for reproducible benchmark construction. Our contributions include: (1) a structured framework for industrial MT evaluation, (2) the first publicly available benchmark for e-commerce translation, (3) novel metrics probing multi-level translation quality, and (4) open-sourced evaluation tools. This work bridges the evaluation gap, enabling researchers and practitioners to systematically assess and enhance MT systems for industry-specific needs.

[176] FuxiMT: Sparsifying Large Language Models for Chinese-Centric Multilingual Machine Translation

Shaolin Zhu,Tianyu Dong,Bo Li,Deyi Xiong

Main category: cs.CL

TL;DR: FuxiMT是一种基于稀疏化大型语言模型的中文中心多语言机器翻译模型，采用两阶段训练策略，结合MoE和课程学习，显著优于现有模型，尤其在低资源场景和零样本翻译中表现突出。

Details

Motivation: 解决多语言机器翻译中低资源语言对翻译质量不足的问题，并探索稀疏化LLM在翻译任务中的潜力。 Method: 两阶段训练：先在中文语料上预训练，再在65种语言的平行数据上进行多语言微调，结合MoE和课程学习策略。 Result: FuxiMT显著优于现有模型，尤其在低资源场景和零样本翻译中表现优异。 Conclusion: FuxiMT展示了稀疏化LLM在多语言翻译中的潜力，尤其在资源稀缺或无平行数据的情况下具有应用前景。 Abstract: In this paper, we present FuxiMT, a novel Chinese-centric multilingual machine translation model powered by a sparsified large language model (LLM). We adopt a two-stage strategy to train FuxiMT. We first pre-train the model on a massive Chinese corpus and then conduct multilingual fine-tuning on a large parallel dataset encompassing 65 languages. FuxiMT incorporates Mixture-of-Experts (MoEs) and employs a curriculum learning strategy for robust performance across various resource levels. Experimental results demonstrate that FuxiMT significantly outperforms strong baselines, including state-of-the-art LLMs and machine translation models, particularly under low-resource scenarios. Furthermore, FuxiMT exhibits remarkable zero-shot translation capabilities for unseen language pairs, indicating its potential to bridge communication gaps where parallel data are scarce or unavailable.

[177] Think-J: Learning to Think for Generative LLM-as-a-Judge

Hui Huang,Yancheng He,Hongli Zhou,Rui Zhang,Wei Liu,Weixun Wang,Wenbo Su,Bo Zheng,Jiaheng Liu

Main category: cs.CL

TL;DR: 论文提出Think-J方法，通过让生成式LLM学习如何思考来提升其作为LLM-Judge的能力，利用强化学习优化判断思维，显著提升了评估能力。

Details

Motivation: 生成式LLM在作为LLM-Judge时的表现未达预期，因此需要改进其判断能力。 Method: 首先利用少量精选数据开发初步判断思维模型，随后基于离线（训练评论模型）和在线（规则奖励）强化学习优化判断思维。 Result: 实验表明，Think-J显著提升了生成式LLM-Judge的评估能力，优于其他方法且无需额外人工标注。 Conclusion: Think-J通过优化判断思维，有效提升了生成式LLM作为LLM-Judge的能力。 Abstract: LLM-as-a-Judge refers to the automatic modeling of preferences for responses generated by Large Language Models (LLMs), which is of significant importance for both LLM evaluation and reward modeling. Although generative LLMs have made substantial progress in various tasks, their performance as LLM-Judge still falls short of expectations. In this work, we propose Think-J, which improves generative LLM-as-a-Judge by learning how to think. We first utilized a small amount of curated data to develop the model with initial judgment thinking capabilities. Subsequently, we optimize the judgment thinking traces based on reinforcement learning (RL). We propose two methods for judgment thinking optimization, based on offline and online RL, respectively. The offline RL requires training a critic model to construct positive and negative examples for learning. The online method defines rule-based reward as feedback for optimization. Experimental results showed that our approach can significantly enhance the evaluation capability of generative LLM-Judge, surpassing both generative and classifier-based LLM-Judge without requiring extra human annotations.

[178] FAID: Fine-grained AI-generated Text Detection using Multi-task Auxiliary and Multi-level Contrastive Learning

Minh Ngoc Ta,Dong Cao Van,Duc-Anh Hoang,Minh Le-Anh,Truong Nguyen,My Anh Tran Nguyen,Yuxia Wang,Preslav Nakov,Sang Dinh

Main category: cs.CL

TL;DR: FAIDSet数据集和FAID框架用于区分人类、AI生成及人机协作文本，通过多级对比学习和多任务分类提升检测性能，并在未见数据上表现优异。

Details

Motivation: 解决人类与AI协作生成文本中区分来源的挑战，提升透明度和责任追溯。 Method: 结合多级对比学习和多任务辅助分类，建模AI模型家族为独特风格实体，适应未见数据分布变化。 Result: FAID在未见领域和新AI模型上表现优于基线方法，提升泛化能力。 Conclusion: FAID为AI辅助写作的透明度和责任追溯提供了潜在解决方案。 Abstract: The growing collaboration between humans and AI models in generative tasks has introduced new challenges in distinguishing between human-written, AI-generated, and human-AI collaborative texts. In this work, we collect a multilingual, multi-domain, multi-generator dataset FAIDSet. We further introduce a fine-grained detection framework FAID to classify text into these three categories, meanwhile identifying the underlying AI model family. Unlike existing binary classifiers, FAID is built to capture both authorship and model-specific characteristics. Our method combines multi-level contrastive learning with multi-task auxiliary classification to learn subtle stylistic cues. By modeling AI families as distinct stylistic entities, FAID offers improved interpretability. We incorporate an adaptation to address distributional shifts without retraining for unseen data. Experimental results demonstrate that FAID outperforms several baseline approaches, particularly enhancing the generalization accuracy on unseen domains and new AI models. It provide a potential solution for improving transparency and accountability in AI-assisted writing.

[179] Data-Efficient Hate Speech Detection via Cross-Lingual Nearest Neighbor Retrieval with Limited Labeled Data

Faeze Ghorbanpour,Daryna Dementieva,Alexander Fraser

Main category: cs.CL

TL;DR: 提出了一种基于最近邻检索的跨语言迁移学习方法，用于低资源语言的仇恨言论检测，显著提升性能并超越现有方法。

Details

Motivation: 由于标记仇恨言论数据成本高且耗时，尤其是在低资源语言中，需要一种高效且可扩展的方法来提升检测性能。 Method: 利用最近邻检索从多语言仇恨言论数据池中检索相关样本，增强目标语言的少量标记数据，并结合最大边际相关性减少冗余。 Result: 在八种语言上验证，性能优于仅使用目标语言数据的模型，并在多数情况下超越当前最优方法，且数据效率高。 Conclusion: 该方法高效、可扩展，适用于新语言和任务，同时通过减少冗余进一步提升了性能。 Abstract: Considering the importance of detecting hateful language, labeled hate speech data is expensive and time-consuming to collect, particularly for low-resource languages. Prior work has demonstrated the effectiveness of cross-lingual transfer learning and data augmentation in improving performance on tasks with limited labeled data. To develop an efficient and scalable cross-lingual transfer learning approach, we leverage nearest-neighbor retrieval to augment minimal labeled data in the target language, thereby enhancing detection performance. Specifically, we assume access to a small set of labeled training instances in the target language and use these to retrieve the most relevant labeled examples from a large multilingual hate speech detection pool. We evaluate our approach on eight languages and demonstrate that it consistently outperforms models trained solely on the target language data. Furthermore, in most cases, our method surpasses the current state-of-the-art. Notably, our approach is highly data-efficient, retrieving as small as 200 instances in some cases while maintaining superior performance. Moreover, it is scalable, as the retrieval pool can be easily expanded, and the method can be readily adapted to new languages and tasks. We also apply maximum marginal relevance to mitigate redundancy and filter out highly similar retrieved instances, resulting in improvements in some languages.

[180] YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering

Jennifer D'Souza,Hamed Babaei Giglou,Quentin Münch

Main category: cs.CL

TL;DR: YESciEval是一个开源框架，结合细粒度评分标准和强化学习，用于评估大语言模型（LLMs）的科学问答能力，减少评估中的乐观偏差。

Details

Motivation: 当前LLMs在科学问答中的评估鲁棒性不足，缺乏透明且可扩展的评估方法。 Method: 提出YESciEval框架，结合评分标准和强化学习，并发布多学科科学问答数据集及对抗性变体。 Result: 实现了独立于专有模型和人类反馈的可扩展、免费评估，支持可靠的LLM-as-a-judge模型。 Conclusion: 该工作推动了AI对齐，为科学研究和通用人工智能提供了透明且鲁棒的评估方法。 Abstract: Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation robustness remains underexplored. We introduce YESciEval, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in LLM evaluators. We release multidisciplinary scienceQ&A datasets, including adversarial variants, with evaluation scores from multiple LLMs. Independent of proprietary models and human feedback, our approach enables scalable, cost-free evaluation. By advancing reliable LLM-as-a-judge models, this work supports AI alignment and fosters robust, transparent evaluation essential for scientific inquiry and artificial general intelligence.

[181] Universal Acoustic Adversarial Attacks for Flexible Control of Speech-LLMs

Rao Ma,Mengjie Qian,Vyas Raina,Mark Gales,Kate Knill

Main category: cs.CL

TL;DR: 研究了语音大语言模型（LLMs）对通用声学对抗攻击的脆弱性，发现Qwen2-Audio和Granite-Speech等模型存在严重漏洞，需改进训练策略。

Details

Motivation: 语音LLMs的灵活性可能使其更易受对抗攻击，需评估其脆弱性。 Method: 通过在原始音频前添加固定的对抗音频段，研究模型的无输出或任务覆盖行为，并扩展为选择性攻击。 Result: 发现Qwen2-Audio和Granite-Speech等模型对通用对抗攻击高度脆弱。 Conclusion: 语音LLMs需更鲁棒的训练策略以提高对抗攻击的抵抗力。 Abstract: The combination of pre-trained speech encoders with large language models has enabled the development of speech LLMs that can handle a wide range of spoken language processing tasks. While these models are powerful and flexible, this very flexibility may make them more vulnerable to adversarial attacks. To examine the extent of this problem, in this work we investigate universal acoustic adversarial attacks on speech LLMs. Here a fixed, universal, adversarial audio segment is prepended to the original input audio. We initially investigate attacks that cause the model to either produce no output or to perform a modified task overriding the original prompt. We then extend the nature of the attack to be selective so that it activates only when specific input attributes, such as a speaker gender or spoken language, are present. Inputs without the targeted attribute should be unaffected, allowing fine-grained control over the model outputs. Our findings reveal critical vulnerabilities in Qwen2-Audio and Granite-Speech and suggest that similar speech LLMs may be susceptible to universal adversarial attacks. This highlights the need for more robust training strategies and improved resistance to adversarial attacks.

[182] Cross-Lingual Optimization for Language Transfer in Large Language Models

Jungseob Lee,Seongtae Hong,Hyeonseok Moon,Heuiseok Lim

Main category: cs.CL

TL;DR: 论文提出了一种名为跨语言优化（CLO）的方法，用于在数据受限环境下高效地将以英语为中心的大语言模型（LLM）迁移到目标语言，同时保持其英语能力。

Details

Motivation: 传统的监督微调（SFT）方法在数据受限环境下过度依赖英语性能，导致目标语言表现不佳。 Method: CLO利用公开的英语SFT数据和翻译模型实现跨语言迁移。 Result: 实验表明，CLO在目标语言性能和英语能力保持上均优于SFT，且在低资源语言中仅需一半数据即可超越SFT。 Conclusion: CLO在数据效率和鲁棒性上显著优于SFT，为多语言模型迁移提供了更优解决方案。 Abstract: Adapting large language models to other languages typically employs supervised fine-tuning (SFT) as a standard approach. However, it often suffers from an overemphasis on English performance, a phenomenon that is especially pronounced in data-constrained environments. To overcome these challenges, we propose \textbf{Cross-Lingual Optimization (CLO)} that efficiently transfers an English-centric LLM to a target language while preserving its English capabilities. CLO utilizes publicly available English SFT data and a translation model to enable cross-lingual transfer. We conduct experiments using five models on six languages, each possessing varying levels of resource. Our results show that CLO consistently outperforms SFT in both acquiring target language proficiency and maintaining English performance. Remarkably, in low-resource languages, CLO with only 3,200 samples surpasses SFT with 6,400 samples, demonstrating that CLO can achieve better performance with less data. Furthermore, we find that SFT is particularly sensitive to data quantity in medium and low-resource languages, whereas CLO remains robust. Our comprehensive analysis emphasizes the limitations of SFT and incorporates additional training strategies in CLO to enhance efficiency.

[183] JOLT-SQL: Joint Loss Tuning of Text-to-SQL with Confusion-aware Noisy Schema Sampling

Jinwang Song,Hongying Zan,Kunli Zhang,Lingling Mu,Yingjie Han,Haobo Hua,Min Peng

Main category: cs.CL

TL;DR: JOLT-SQL是一个单阶段的监督微调框架，通过统一损失联合优化模式链接和SQL生成，解决了传统方法中的复杂流程和噪声模式鲁棒性问题。

Details

Motivation: 传统监督微调方法在Text-to-SQL任务中存在多阶段流程复杂和噪声模式鲁棒性差的问题。 Method: JOLT-SQL采用判别式模式链接（增强局部双向注意力）和噪声模式采样策略（选择性注意力），联合优化模式链接和SQL生成。 Result: 在Spider和BIRD基准测试中，JOLT-SQL在同类开源模型中达到最优执行准确率，并显著提升训练和推理效率。 Conclusion: JOLT-SQL通过单阶段框架和噪声鲁棒性策略，显著提升了Text-to-SQL任务的性能和效率。 Abstract: Text-to-SQL, which maps natural language to SQL queries, has benefited greatly from recent advances in Large Language Models (LLMs). While LLMs offer various paradigms for this task, including prompting and supervised fine-tuning (SFT), SFT approaches still face challenges such as complex multi-stage pipelines and poor robustness to noisy schema information. To address these limitations, we present JOLT-SQL, a streamlined single-stage SFT framework that jointly optimizes schema linking and SQL generation via a unified loss. JOLT-SQL employs discriminative schema linking, enhanced by local bidirectional attention, alongside a confusion-aware noisy schema sampling strategy with selective attention to improve robustness under noisy schema conditions. Experiments on the Spider and BIRD benchmarks demonstrate that JOLT-SQL achieves state-of-the-art execution accuracy among comparable-size open-source models, while significantly improving both training and inference efficiency.

[184] Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency

Ehsan Doostmohammadi,Marco Kuhlmann

Main category: cs.CL

TL;DR: 检索增强语言模型的性能与查询和检索上下文的重叠程度密切相关，但最佳重叠程度尚未明确。本文通过实验发现，超过临界阈值后，增加重叠能显著提升模型性能和学习速度。通过生成合成上下文（如改写查询）可提高数据效率并减少40%训练时间。

Details

Motivation: 探索查询与检索上下文重叠程度对检索增强语言模型性能的影响，以优化模型训练和推理效率。 Method: 系统研究不同重叠程度对模型训练和推理的影响，并通过生成合成上下文（如改写查询）验证其效果。 Result: 增加重叠程度在超过临界阈值后显著提升模型性能和学习速度；合成上下文可减少40%训练时间且不影响性能。 Conclusion: 检索机制在语言模型预训练中存在显著优化潜力，合成上下文是一种有效的数据效率提升方法。 Abstract: Retrieval-augmented language models have demonstrated performance comparable to much larger models while requiring fewer computational resources. The effectiveness of these models crucially depends on the overlap between query and retrieved context, but the optimal degree of this overlap remains unexplored. In this paper, we systematically investigate how varying levels of query--context overlap affect model performance during both training and inference. Our experiments reveal that increased overlap initially has minimal effect, but substantially improves test-time perplexity and accelerates model learning above a critical threshold. Building on these findings, we demonstrate that deliberately increasing overlap through synthetic context can enhance data efficiency and reduce training time by approximately 40\% without compromising performance. We specifically generate synthetic context through paraphrasing queries. We validate our perplexity-based findings on question-answering tasks, confirming that the benefits of retrieval-augmented language modeling extend to practical applications. Our results provide empirical evidence of significant optimization potential for retrieval mechanisms in language model pretraining.

[185] HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing

Shamsuddeen Hassan Muhammad,Ibrahim Said Ahmad,Idris Abdulmumin,Falalu Ibrahim Lawan,Babangida Sani,Sukairaj Hafiz Imam,Yusuf Aliyu,Sani Abdullahi Sani,Ali Usman Umar,Kenneth Church,Vukosi Marivate

Main category: cs.CL

TL;DR: 本文综述了豪萨语NLP的现状，提出了一个资源目录HausaNLP，并讨论了豪萨语在大型语言模型中的挑战及未来研究方向。

Details

Motivation: 尽管豪萨语拥有大量使用者，但其NLP研究仍因资源匮乏而受限，本文旨在填补这一空白并推动其发展。 Method: 系统分析了豪萨语NLP的现有资源、研究成果及不足，并提出了HausaNLP目录以整合资源。 Result: 提出了HausaNLP目录，总结了豪萨语NLP的挑战，并提出了未来研究方向。 Conclusion: 本文为豪萨语NLP的发展奠定了基础，并为多语言NLP研究提供了参考。 Abstract: Hausa Natural Language Processing (NLP) has gained increasing attention in recent years, yet remains understudied as a low-resource language despite having over 120 million first-language (L1) and 80 million second-language (L2) speakers worldwide. While significant advances have been made in high-resource languages, Hausa NLP faces persistent challenges, including limited open-source datasets and inadequate model representation. This paper presents an overview of the current state of Hausa NLP, systematically examining existing resources, research contributions, and gaps across fundamental NLP tasks: text classification, machine translation, named entity recognition, speech recognition, and question answering. We introduce HausaNLP (https://catalog.hausanlp.org), a curated catalog that aggregates datasets, tools, and research works to enhance accessibility and drive further development. Furthermore, we discuss challenges in integrating Hausa into large language models (LLMs), addressing issues of suboptimal tokenization and dialectal variation. Finally, we propose strategic research directions emphasizing dataset expansion, improved language modeling approaches, and strengthened community collaboration to advance Hausa NLP. Our work provides both a foundation for accelerating Hausa NLP progress and valuable insights for broader multilingual NLP research.

[186] A MIND for Reasoning: Meta-learning for In-context Deduction

Leonardo Bertolazzi,Manuel Vargas Guzmán,Raffaella Bernardi,Maciej Malicki,Jakub Szymanik

Main category: cs.CL

TL;DR: 论文提出了一种名为MIND的元学习微调方法，旨在提升小规模语言模型在演绎推理任务中的泛化能力，使其在未见过的知识库上表现更优。

Details

Motivation: 大型语言模型（LLMs）在形式化任务中表现优异，但在分布外问题上的泛化能力有限。本文研究了如何让LLMs系统理解演绎规则，并解决从知识库中选择合适前提子集以推导假设的任务。 Method: 提出了Meta-learning for In-context Deduction (MIND)，一种少样本元学习微调方法，旨在提升模型对未见知识库的泛化能力和系统应用推理规则的能力。 Result: MIND显著提升了1.5B到7B参数的小规模模型的泛化能力，尤其在低数据环境下表现突出。小模型经MIND微调后，甚至优于GPT-4o和o3-mini等先进LLMs。 Conclusion: MIND方法为小规模语言模型在演绎推理任务中的高效泛化提供了有效解决方案，展示了其在低资源环境下的潜力。 Abstract: Large language models (LLMs) are increasingly evaluated on formal tasks, where strong reasoning abilities define the state of the art. However, their ability to generalize to out-of-distribution problems remains limited. In this paper, we investigate how LLMs can achieve a systematic understanding of deductive rules. Our focus is on the task of identifying the appropriate subset of premises within a knowledge base needed to derive a given hypothesis. To tackle this challenge, we propose Meta-learning for In-context Deduction (MIND), a novel few-shot meta-learning fine-tuning approach. The goal of MIND is to enable models to generalize more effectively to unseen knowledge bases and to systematically apply inference rules. Our results show that MIND significantly improves generalization in small LMs ranging from 1.5B to 7B parameters. The benefits are especially pronounced in smaller models and low-data settings. Remarkably, small models fine-tuned with MIND outperform state-of-the-art LLMs, such as GPT-4o and o3-mini, on this task.

[187] QA-prompting: Improving Summarization with Large Language Models using Question-Answering

Neelabh Sinha

Main category: cs.CL

TL;DR: QA-prompting是一种通过问答步骤优化长文本摘要的方法，无需微调或复杂流程，显著提升摘要质量。

Details

Motivation: 解决语言模型在长文本摘要中因位置偏差导致的关键信息提取不足问题。 Method: 提出QA-prompting方法，通过问答步骤提取关键信息并丰富上下文，再生成摘要。 Result: 在多个数据集和模型上，QA-prompting比基线方法提升高达29%的ROUGE分数。 Conclusion: QA-prompting是一种高效、可扩展的摘要方法，且领域特定问题选择对性能至关重要。 Abstract: Language Models (LMs) have revolutionized natural language processing, enabling high-quality text generation through prompting and in-context learning. However, models often struggle with long-context summarization due to positional biases, leading to suboptimal extraction of critical information. There are techniques to improve this with fine-tuning, pipelining, or using complex techniques, which have their own challenges. To solve these challenges, we propose QA-prompting - a simple prompting method for summarization that utilizes question-answering as an intermediate step prior to summary generation. Our method extracts key information and enriches the context of text to mitigate positional biases and improve summarization in a single LM call per task without requiring fine-tuning or pipelining. Experiments on multiple datasets belonging to different domains using ten state-of-the-art pre-trained models demonstrate that QA-prompting outperforms baseline and other state-of-the-art methods, achieving up to 29% improvement in ROUGE scores. This provides an effective and scalable solution for summarization and highlights the importance of domain-specific question selection for optimal performance.

[188] OSoRA: Output-Dimension and Singular-Value Initialized Low-Rank Adaptation

Jialong Han,Si Zhang,Ke Zhang

Main category: cs.CL

TL;DR: OSoRA是一种新的参数高效微调方法，通过结合SVD和可学习缩放向量，显著减少计算资源需求，同时保持或超越现有方法的性能。

Details

Motivation: 由于大型语言模型（LLMs）规模庞大且计算成本高，传统的微调方法面临挑战，需要更高效的替代方案。 Method: OSoRA扩展了LoRA，通过SVD分解预训练权重矩阵，并优化输出维度向量，同时冻结奇异向量矩阵。 Result: 在数学推理、常识推理等任务上，OSoRA性能与LoRA和VeRA相当或更优，且参数规模线性增长。 Conclusion: 联合训练奇异值和输出维度向量对性能至关重要，OSoRA为高效微调提供了可行方案。 Abstract: Fine-tuning Large Language Models (LLMs) has become increasingly challenging due to their massive scale and associated computational costs. Parameter-Efficient Fine-Tuning (PEFT) methodologies have been proposed as computational alternatives; however, their implementations still require significant resources. In this paper, we present OSoRA (Output-Dimension and Singular-Value Initialized Low-Rank Adaptation), a novel PEFT method for LLMs. OSoRA extends Low-Rank Adaptation (LoRA) by integrating Singular Value Decomposition (SVD) with learnable scaling vectors in a unified framework. It first performs an SVD of pre-trained weight matrices, then optimizes an output-dimension vector during training, while keeping the corresponding singular vector matrices frozen. OSoRA substantially reduces computational resource requirements by minimizing the number of trainable parameters during fine-tuning. Comprehensive evaluations across mathematical reasoning, common sense reasoning, and other benchmarks demonstrate that OSoRA achieves comparable or superior performance to state-of-the-art methods like LoRA and VeRA, while maintaining a linear parameter scaling even as the rank increases to higher dimensions. Our ablation studies further confirm that jointly training both the singular values and the output-dimension vector is critical for optimal performance.

[189] WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications

Xin Li,Mengbing Liu,Li Wei,Jiancheng An,Mérouane Debbah,Chau Yuen

Main category: cs.CL

TL;DR: 论文介绍了WirelessMathBench，一个专门评估大语言模型在无线通信数学建模能力的基准测试，发现当前模型在复杂任务中表现不佳。

Details

Motivation: 探索大语言模型在无线通信领域复杂数学推理能力的不足。 Method: 设计了包含587个问题的WirelessMathBench，涵盖多样任务，测试了多个领先模型。 Result: 模型在基础任务中表现良好，但在复杂方程完成任务中表现差，最佳模型平均准确率仅38.05%。 Conclusion: 通过公开基准测试和工具包，推动开发更强大的领域感知大语言模型。 Abstract: Large Language Models (LLMs) have achieved impressive results across a broad array of tasks, yet their capacity for complex, domain-specific mathematical reasoning-particularly in wireless communications-remains underexplored. In this work, we introduce WirelessMathBench, a novel benchmark specifically designed to evaluate LLMs on mathematical modeling challenges to wireless communications engineering. Our benchmark consists of 587 meticulously curated questions sourced from 40 state-of-the-art research papers, encompassing a diverse spectrum of tasks ranging from basic multiple-choice questions to complex equation completion tasks, including both partial and full completions, all of which rigorously adhere to physical and dimensional constraints. Through extensive experimentation with leading LLMs, we observe that while many models excel in basic recall tasks, their performance degrades significantly when reconstructing partially or fully obscured equations, exposing fundamental limitations in current LLMs. Even DeepSeek-R1, the best performer on our benchmark, achieves an average accuracy of only 38.05%, with a mere 7.83% success rate in full equation completion. By publicly releasing WirelessMathBench along with the evaluation toolkit, we aim to advance the development of more robust, domain-aware LLMs for wireless system analysis and broader engineering applications.

[190] Dual Decomposition of Weights and Singular Value Low Rank Adaptation

Jialong Han,Si Zhang,Ke Zhang

Main category: cs.CL

TL;DR: DuDe提出了一种基于SVD分解的PEFT方法，解决了LoRA训练不稳定和知识迁移效率低的问题，显著提升了性能。

Details

Motivation: 现有LoRA方法因随机初始化导致训练不稳定和知识迁移效率低，需改进。 Method: 通过SVD分解权重矩阵为幅度和方向分量，进行有原则的初始化。 Result: 在MMLU和GSM8K任务上分别达到48.35%和62.53%的准确率。 Conclusion: DuDe通过分解策略提升了优化稳定性并保留了预训练表示，成为PEFT领域的重要贡献。 Abstract: Parameter-Efficient Fine-Tuning (PEFT) has emerged as a critical paradigm for adapting Large Language Models (LLMs) to downstream tasks, among which Low-rank Adaptation (LoRA) represents one of the most widely adopted methodologies. However, existing LoRA-based approaches exhibit two fundamental limitations: unstable training dynamics and inefficient knowledge transfer from pre-trained models, both stemming from random initialization of adapter parameters. To overcome these challenges, we propose DuDe, a novel approach that decomposes weight matrices into magnitude and direction components, employing Singular Value Decomposition (SVD) for principled initialization. Our comprehensive evaluation demonstrates DuDe's superior performance and robustness, achieving up to 48.35\% accuracy on MMLU and 62.53\% ($\pm$ 1.59) accuracy on GSM8K. Our theoretical analysis and empirical validation collectively demonstrate that DuDe's decomposition strategy enhances optimization stability and better preserves pre-trained representations, particularly for domain-specific tasks requiring specialized knowledge. The combination of robust empirical performance and rigorous theoretical foundations establishes DuDe as a significant contribution to PEFT methodologies for LLMs.

[191] AutoRev: Automatic Peer Review System for Academic Research Papers

Maitreya Prafulla Chitale,Ketaki Mangesh Shetye,Harshit Gupta,Manav Chaudhary,Vasudeva Varma

Main category: cs.CL

TL;DR: AutoRev是一个基于图的自动学术论文评审系统，通过提取关键段落生成评审，性能优于现有方法58.72%。

Details

Motivation: 现有方法依赖大型语言模型，但忽视了长输入令牌的计算和性能限制。 Method: 将学术论文表示为图，提取关键段落用于评审生成。 Result: 在评审生成任务中，性能平均提升58.72%。 Conclusion: 图提取技术有望应用于其他NLP下游任务，代码将公开。 Abstract: Generating a review for an academic research paper is a complex task that requires a deep understanding of the document's content and the interdependencies between its sections. It demands not only insight into technical details but also an appreciation of the paper's overall coherence and structure. Recent methods have predominantly focused on fine-tuning large language models (LLMs) to address this challenge. However, they often overlook the computational and performance limitations imposed by long input token lengths. To address this, we introduce AutoRev, an Automatic Peer Review System for Academic Research Papers. Our novel framework represents an academic document as a graph, enabling the extraction of the most critical passages that contribute significantly to the review. This graph-based approach demonstrates effectiveness for review generation and is potentially adaptable to various downstream tasks, such as question answering, summarization, and document representation. When applied to review generation, our method outperforms SOTA baselines by an average of 58.72% across all evaluation metrics. We hope that our work will stimulate further research in applying graph-based extraction techniques to other downstream tasks in NLP. We plan to make our code public upon acceptance.

[192] Editing Across Languages: A Survey of Multilingual Knowledge Editing

Nadir Durrani,Basel Mousi,Fahim Dalvi

Main category: cs.CL

TL;DR: 本文系统化研究了多语言知识编辑（MKE）领域，总结了方法、基准和挑战，为未来可编辑语言感知大模型的发展奠定基础。

Details

Motivation: 尽管知识编辑在单语言环境中已广泛研究，但在多语言背景下仍未被充分探索。本文旨在填补这一空白。 Method: 提出了MKE方法的综合分类，包括参数化、基于记忆、微调和超网络方法，并调查了现有基准。 Result: 总结了方法效果和迁移模式的关键发现，指出了跨语言传播的挑战。 Conclusion: 本文整合了快速发展的MKE领域，为未来可编辑语言感知大模型的研究提供了基础。 Abstract: While Knowledge Editing has been extensively studied in monolingual settings, it remains underexplored in multilingual contexts. This survey systematizes recent research on Multilingual Knowledge Editing (MKE), a growing subdomain of model editing focused on ensuring factual edits generalize reliably across languages. We present a comprehensive taxonomy of MKE methods, covering parameter-based, memory-based, fine-tuning, and hypernetwork approaches. We survey available benchmarks,summarize key findings on method effectiveness and transfer patterns, identify challenges in cross-lingual propagation, and highlight open problems related to language anisotropy, evaluation coverage, and edit scalability. Our analysis consolidates a rapidly evolving area and lays the groundwork for future progress in editable language-aware LLMs.

[193] MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language

Seyoung Song,Seogyeong Jeong,Eunsu Kim,Jiho Jin,Dongkwan Kim,Jay Shin,Alice Oh

Main category: cs.CL

TL;DR: MUG-Eval是一个评估大型语言模型多语言生成能力的新框架，通过将现有基准转化为对话任务并测量任务成功率来评估模型表现。

Details

Motivation: 评估低资源语言中大型语言模型的文本生成能力具有挑战性，因为直接评估方法稀缺。 Method: 将现有基准转化为对话任务，以任务成功率作为生成能力的代理指标，避免依赖语言特定工具或标注数据。 Result: 在30种语言上评估8个模型，MUG-Eval与现有基准强相关（r > 0.75），支持跨语言和模型的标准化比较。 Conclusion: MUG-Eval为多语言生成评估提供了高效且可扩展的解决方案。 Abstract: Evaluating text generation capabilities of large language models (LLMs) is challenging, particularly for low-resource languages where methods for direct assessment are scarce. We propose MUG-Eval, a novel framework that evaluates LLMs' multilingual generation capabilities by transforming existing benchmarks into conversational tasks and measuring the LLMs' accuracies on those tasks. We specifically designed these conversational tasks to require effective communication in the target language. Then, we simply use task success rate as a proxy of successful conversation generation. Our approach offers two key advantages: it is independent of language-specific NLP tools or annotated datasets, which are limited for most languages, and it does not rely on LLMs-as-judges, whose evaluation quality degrades outside a few high-resource languages. We evaluate 8 LLMs across 30 languages spanning high, mid, and low-resource categories, and we find that MUG-Eval correlates strongly with established benchmarks ($r$ > 0.75) while enabling standardized comparisons across languages and models. Our framework provides a robust and resource-efficient solution for evaluating multilingual generation that can be extended to thousands of languages.

[194] Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation

Peter Baile Chen,Yi Zhang,Dan Roth,Samuel Madden,Jacob Andreas,Michael Cafarella

Main category: cs.CL

TL;DR: 论文提出了一种名为log-augmented generation (LAG)的新框架，通过直接复用过去的计算和推理日志来增强大型语言模型（LLMs）在新任务中的表现，同时保持高效性和可扩展性。

Details

Motivation: 人类能够从过去的经验中学习和适应，而大型语言模型及其代理系统难以保留和复用之前的推理能力。为了解决这一问题，作者提出了LAG框架。 Method: LAG利用键值（KV）缓存表示任务日志，仅存储部分关键标记的KV缓存。在新任务中，系统检索相关日志的KV值以增强生成能力，直接复用过去的推理和计算。 Result: 实验表明，LAG在知识和推理密集型数据集上显著优于未使用日志的标准代理系统，以及基于反思和KV缓存技术的现有解决方案。 Conclusion: LAG通过直接复用过去的推理和计算，显著提升了模型在新任务中的表现，同时保持了高效性和可扩展性。 Abstract: While humans naturally learn and adapt from past experiences, large language models (LLMs) and their agentic counterparts struggle to retain reasoning from previous tasks and apply them in future contexts. To address this limitation, we propose a novel framework, log-augmented generation (LAG) that directly reuses prior computation and reasoning from past logs at test time to enhance model's ability to learn from previous tasks and perform better on new, unseen challenges, all while keeping the system efficient and scalable. Specifically, our system represents task logs using key-value (KV) caches, encoding the full reasoning context of prior tasks while storing KV caches for only a selected subset of tokens. When a new task arises, LAG retrieves the KV values from relevant logs to augment generation. Our approach differs from reflection-based memory mechanisms by directly reusing prior reasoning and computations without requiring additional steps for knowledge extraction or distillation. Our method also goes beyond existing KV caching techniques, which primarily target efficiency gains rather than improving accuracy. Experiments on knowledge- and reasoning-intensive datasets demonstrate that our method significantly outperforms standard agentic systems that do not utilize logs, as well as existing solutions based on reflection and KV cache techniques.

[195] Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis

Haoming Huang,Yibo Yan,Jiahao Huo,Xin Zou,Xinfeng Li,Kun Wang,Xuming Hu

Main category: cs.CL

TL;DR: PhantomCircuit框架通过知识电路分析，揭示LLM中知识遮蔽现象的起源和机制，并提供检测方法。

Details

Motivation: 大型语言模型（LLM）存在知识遮蔽问题，当前研究对其训练过程中的内部机制缺乏深入理解。 Method: 提出PhantomCircuit框架，利用知识电路分析注意力头，追踪竞争知识路径及其演变。 Result: 实验证明PhantomCircuit能有效识别知识遮蔽现象，为研究提供新方法。 Conclusion: PhantomCircuit为理解和缓解知识遮蔽现象提供了新视角和方法。 Abstract: Large Language Models (LLMs), despite their remarkable capabilities, are hampered by hallucinations. A particularly challenging variant, knowledge overshadowing, occurs when one piece of activated knowledge inadvertently masks another relevant piece, leading to erroneous outputs even with high-quality training data. Current understanding of overshadowing is largely confined to inference-time observations, lacking deep insights into its origins and internal mechanisms during model training. Therefore, we introduce PhantomCircuit, a novel framework designed to comprehensively analyze and detect knowledge overshadowing. By innovatively employing knowledge circuit analysis, PhantomCircuit dissects the internal workings of attention heads, tracing how competing knowledge pathways contribute to the overshadowing phenomenon and its evolution throughout the training process. Extensive experiments demonstrate PhantomCircuit's effectiveness in identifying such instances, offering novel insights into this elusive hallucination and providing the research community with a new methodological lens for its potential mitigation.

[196] Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents

Pengzhou Cheng,Haowen Hu,Zheng Wu,Zongru Wu,Tianjie Ju,Daizong Ding,Zhuosheng Zhang,Gongshen Liu

Main category: cs.CL

TL;DR: 论文提出了一种名为AgentGhost的框架，用于对基于多模态大语言模型（MLLM）的GUI代理进行隐蔽的后门攻击，并通过实验验证了其高效性和通用性。

Details

Motivation: 由于GUI代理通常依赖开源模型或API，存在后门攻击的供应链威胁，但目前研究不足。论文旨在揭示并利用GUI代理的交互级触发器，设计一种高效且隐蔽的后门攻击框架。 Method: 通过组合目标和交互级触发器构建复合触发器，将后门注入建模为Min-Max优化问题，利用监督对比学习和监督微调来增强后门的灵活性和有效性。 Result: 在多个代理模型和移动基准测试中，AgentGhost攻击准确率达到99.7%，且仅导致1%的效用下降。论文还提出了一种防御方法，将攻击准确率降至22.1%。 Conclusion: AgentGhost展示了GUI代理的后门攻击风险，并提出了有效的防御方案，为未来研究提供了重要参考。 Abstract: Graphical user interface (GUI) agents powered by multimodal large language models (MLLMs) have shown greater promise for human-interaction. However, due to the high fine-tuning cost, users often rely on open-source GUI agents or APIs offered by AI providers, which introduces a critical but underexplored supply chain threat: backdoor attacks. In this work, we first unveil that MLLM-powered GUI agents naturally expose multiple interaction-level triggers, such as historical steps, environment states, and task progress. Based on this observation, we introduce AgentGhost, an effective and stealthy framework for red-teaming backdoor attacks. Specifically, we first construct composite triggers by combining goal and interaction levels, allowing GUI agents to unintentionally activate backdoors while ensuring task utility. Then, we formulate backdoor injection as a Min-Max optimization problem that uses supervised contrastive learning to maximize the feature difference across sample classes at the representation space, improving flexibility of the backdoor. Meanwhile, it adopts supervised fine-tuning to minimize the discrepancy between backdoor and clean behavior generation, enhancing effectiveness and utility. Extensive evaluations of various agent models in two established mobile benchmarks show that AgentGhost is effective and generic, with attack accuracy that reaches 99.7\% on three attack objectives, and shows stealthiness with only 1\% utility degradation. Furthermore, we tailor a defense method against AgentGhost that reduces the attack accuracy to 22.1\%. Our code is available at \texttt{anonymous}.

[197] Scaling Low-Resource MT via Synthetic Data Generation with LLMs

Ona de Gibert,Joseph Attieh,Teemu Vahtola,Mikko Aulamo,Zihao Li,Raúl Vázquez,Tiancheng Hu,Jörg Tiedemann

Main category: cs.CL

TL;DR: LLM生成的合成数据能显著提升低资源机器翻译性能，通过构建多语言合成语料库并验证其质量。

Details

Motivation: 研究低资源机器翻译中LLM生成合成数据的潜力，以解决数据不足问题。 Method: 构建文档级合成语料库，通过自动和人工评估验证质量，并比较HPLT数据集。 Result: 合成数据即使有噪声也能显著提升翻译性能，并发布SynOPUS公共数据集。 Conclusion: LLM生成的合成数据是低资源机器翻译的有效解决方案。 Abstract: We investigate the potential of LLM-generated synthetic data for improving low-resource machine translation (MT). Focusing on seven diverse target languages, we construct a document-level synthetic corpus from English Europarl, and extend it via pivoting to 147 additional language pairs. Automatic and human evaluation confirm its high overall quality. We study its practical application by (i) identifying effective training regimes, (ii) comparing our data with the HPLT dataset, and (iii) testing its utility beyond English-centric MT. Finally, we introduce SynOPUS, a public repository for synthetic parallel datasets. Our findings show that LLM-generated synthetic data, even when noisy, can substantially improve MT performance for low-resource languages.

[198] From Templates to Natural Language: Generalization Challenges in Instruction-Tuned LLMs for Spatial Reasoning

Chalamalasetti Kranti,Sherzod Hakimov,David Schlangen

Main category: cs.CL

TL;DR: 指令调优的大语言模型在空间基础任务中表现良好，但在从合成指令泛化到人类指令时存在挑战。

Details

Motivation: 研究在空间基础任务中，模型从合成指令泛化到人类指令的能力。 Method: 使用合成指令微调大语言模型，并在包含合成和人类指令的基准数据集上评估性能。 Result: 模型在简单任务上泛化良好，但在复杂任务上性能显著下降。 Conclusion: 通过错误分析揭示了指令泛化中的差距。 Abstract: Instruction-tuned large language models (LLMs) have shown strong performance on a variety of tasks; however, generalizing from synthetic to human-authored instructions in grounded environments remains a challenge for them. In this work, we study generalization challenges in spatial grounding tasks where models interpret and translate instructions for building object arrangements on a $2.5$D grid. We fine-tune LLMs using only synthetic instructions and evaluate their performance on a benchmark dataset containing both synthetic and human-written instructions. Our results reveal that while models generalize well on simple tasks, their performance degrades significantly on more complex tasks. We present a detailed error analysis of the gaps in instruction generalization.

[199] Neural Incompatibility: The Unbridgeable Gap of Cross-Scale Parametric Knowledge Transfer in Large Language Models

Yuqiao Tan,Shizhu He,Kang Liu,Jun Zhao

Main category: cs.CL

TL;DR: 论文探讨了通过参数实现跨规模大语言模型（LLMs）知识转移（PKT）的挑战，提出了Pre-Align PKT（PrePKT）和Post-Align PKT（PostPKT）两种范式，并发现神经不兼容性是主要障碍。

Details

Motivation: 传统基于符号语言的知识转移范式难以实现真正的参数化知识转移（PKT），探索跨规模LLMs参数化知识转移的有效方法是关键研究方向。 Method: 提出了PrePKT范式及LaTen方法，通过少量训练步骤对齐参数空间，避免后续微调；同时重新定义了PostPKT。 Result: 实验表明PostPKT和PrePKT均难以实现稳定转移，神经不兼容性是根本挑战。 Conclusion: 研究揭示了LLMs参数架构的新见解，为未来高效PKT研究指明了方向。 Abstract: Large Language Models (LLMs) offer a transparent brain with accessible parameters that encode extensive knowledge, which can be analyzed, located and transferred. Consequently, a key research challenge is to transcend traditional knowledge transfer paradigms rooted in symbolic language and achieve genuine Parametric Knowledge Transfer (PKT). Significantly, exploring effective methods for transferring knowledge across LLMs of different scales through parameters presents an intriguing and valuable research direction. In this paper, we first demonstrate $\textbf{Alignment}$ in parametric space is the fundamental prerequisite to achieve successful cross-scale PKT. We redefine the previously explored knowledge transfer as Post-Align PKT (PostPKT), which utilizes extracted parameters for LoRA initialization and requires subsequent fine-tune for alignment. Hence, to reduce cost for further fine-tuning, we introduce a novel Pre-Align PKT (PrePKT) paradigm and propose a solution called $\textbf{LaTen}$ ($\textbf{L}$oc$\textbf{a}$te-$\textbf{T}$h$\textbf{e}$n-Alig$\textbf{n}$) that aligns the parametric spaces of LLMs across scales only using several training steps without following training. Comprehensive experiments on four benchmarks demonstrate that both PostPKT and PrePKT face challenges in achieving consistently stable transfer. Through in-depth analysis, we identify $\textbf{Neural Incompatibility}$ as the ethological and parametric structural differences between LLMs of varying scales, presenting fundamental challenges to achieving effective PKT. These findings provide fresh insights into the parametric architectures of LLMs and highlight promising directions for future research on efficient PKT. Our code is available at https://github.com/Trae1ounG/Neural_Incompatibility.

[200] Creative Preference Optimization

Mete Ismayilzada,Antonio Laverghetta Jr.,Simone A. Luchini,Reet Patel,Antoine Bosselut,Lonneke van der Plas,Roger Beaty

Main category: cs.CL

TL;DR: 论文提出了一种名为Creative Preference Optimization (CrPO)的新方法，通过多维度创造力信号优化LLM的生成内容，在保持高质量的同时提升新颖性、多样性和惊喜感。

Details

Motivation: 现有方法在提升LLM创造力时往往局限于单一维度或任务，未能全面解决创造力的多面性问题。 Method: 提出CrPO方法，将多维度创造力信号模块化地注入偏好优化目标，并使用新的大规模人类偏好数据集MuCE进行训练和评估。 Result: 实验表明，优化后的模型在自动和人工评估中均优于GPT-4o等基线，生成内容更具新颖性、多样性和惊喜感，且质量不降。 Conclusion: 直接在偏好框架中优化创造力是提升LLM创意能力的有效方向，且不影响输出质量。 Abstract: While Large Language Models (LLMs) have demonstrated impressive performance across natural language generation tasks, their ability to generate truly creative content-characterized by novelty, diversity, surprise, and quality-remains limited. Existing methods for enhancing LLM creativity often focus narrowly on diversity or specific tasks, failing to address creativity's multifaceted nature in a generalizable way. In this work, we propose Creative Preference Optimization (CrPO), a novel alignment method that injects signals from multiple creativity dimensions into the preference optimization objective in a modular fashion. We train and evaluate creativity-augmented versions of several models using CrPO and MuCE, a new large-scale human preference dataset spanning over 200,000 human-generated responses and ratings from more than 30 psychological creativity assessments. Our models outperform strong baselines, including GPT-4o, on both automated and human evaluations, producing more novel, diverse, and surprising generations while maintaining high output quality. Additional evaluations on NoveltyBench further confirm the generalizability of our approach. Together, our results demonstrate that directly optimizing for creativity within preference frameworks is a promising direction for advancing the creative capabilities of LLMs without compromising output quality.

[201] CtrlDiff: Boosting Large Diffusion Language Models with Dynamic Block Prediction and Controllable Generation

Chihan Huang,Hao Tang

Main category: cs.CL

TL;DR: CtrlDiff提出了一种动态可控的半自回归框架，结合离散扩散模型和自回归方法，解决了固定长度生成和弱可控性问题。

Details

Motivation: 当前扩散语言模型存在固定长度生成和缺乏灵活控制机制的问题，限制了其实际应用。 Method: 结合自回归和离散扩散模型，使用强化学习动态确定生成块大小，并引入分类器引导的控制机制。 Result: CtrlDiff在混合扩散模型中表现优异，缩小了与自回归模型的性能差距，支持多样化条件文本生成。 Conclusion: CtrlDiff为扩散语言模型提供了动态可控的解决方案，提升了生成灵活性和效率。 Abstract: Although autoregressive models have dominated language modeling in recent years, there has been a growing interest in exploring alternative paradigms to the conventional next-token prediction framework. Diffusion-based language models have emerged as a compelling alternative due to their powerful parallel generation capabilities and inherent editability. However, these models are often constrained by fixed-length generation. A promising direction is to combine the strengths of both paradigms, segmenting sequences into blocks, modeling autoregressive dependencies across blocks while leveraging discrete diffusion to estimate the conditional distribution within each block given the preceding context. Nevertheless, their practical application is often hindered by two key limitations: rigid fixed-length outputs and a lack of flexible control mechanisms. In this work, we address the critical limitations of fixed granularity and weak controllability in current large diffusion language models. We propose CtrlDiff, a dynamic and controllable semi-autoregressive framework that adaptively determines the size of each generation block based on local semantics using reinforcement learning. Furthermore, we introduce a classifier-guided control mechanism tailored to discrete diffusion, which significantly reduces computational overhead while facilitating efficient post-hoc conditioning without retraining. Extensive experiments demonstrate that CtrlDiff sets a new standard among hybrid diffusion models, narrows the performance gap to state-of-the-art autoregressive approaches, and enables effective conditional text generation across diverse tasks.

[202] Not All Correct Answers Are Equal: Why Your Distillation Source Matters

Xiaoyu Tian,Yunjie Ji,Haotian Wang,Shuaiting Chen,Sitong Zhao,Yiping Peng,Han Zhao,Xiangang Li

Main category: cs.CL

TL;DR: 本文通过大规模实证研究，分析了三种先进教师模型（AM-Thinking-v1、Qwen3-235B-A22B和DeepSeek-R1）在189万查询上的推理数据蒸馏效果，发现AM-Thinking-v1蒸馏的数据具有更高的多样性，且学生模型在多个基准测试中表现最佳。

Details

Motivation: 提升开源语言模型的推理能力，探索高质量推理数据蒸馏的价值。 Method: 收集三种教师模型的验证输出，构建并行数据集，训练学生模型并评估其在多个推理基准上的表现。 Result: AM-Thinking-v1蒸馏的模型在AIME2024、AIME2025、MATH500和LiveCodeBench上表现最佳，且能自适应任务难度调整输出长度。 Conclusion: 高质量验证推理数据对提升语言模型推理能力至关重要，相关数据集已公开以支持未来研究。 Abstract: Distillation has emerged as a practical and effective approach to enhance the reasoning capabilities of open-source language models. In this work, we conduct a large-scale empirical study on reasoning data distillation by collecting verified outputs from three state-of-the-art teacher models-AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1-on a shared corpus of 1.89 million queries. We construct three parallel datasets and analyze their distributions, revealing that AM-Thinking-v1-distilled data exhibits greater token length diversity and lower perplexity. Student models trained on each dataset are evaluated on reasoning benchmarks including AIME2024, AIME2025, MATH500, and LiveCodeBench. The AM-based model consistently achieves the best performance (e.g., 84.3 on AIME2024, 72.2 on AIME2025, 98.4 on MATH500, and 65.9 on LiveCodeBench) and demonstrates adaptive output behavior-producing longer responses for harder tasks and shorter ones for simpler tasks. These findings highlight the value of high-quality, verified reasoning traces. We release the AM-Thinking-v1 and Qwen3-235B-A22B distilled datasets to support future research on open and high-performing reasoning-oriented language models. The datasets are publicly available on Hugging Face\footnote{Datasets are available on Hugging Face: \href{https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled}{AM-Thinking-v1-Distilled}, \href{https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled}{AM-Qwen3-Distilled}.}.

[203] Void in Language Models

Mani Shemiranifar

Main category: cs.CL

TL;DR: 研究发现，并非所有Transformer语言模型的层在推理过程中都被激活，通过L2自适应计算（LAC）方法可以识别未激活层（Voids），并选择性跳过这些层以提高模型性能。

Details

Motivation: 探讨Transformer语言模型在推理过程中是否所有层均被激活，以及如何通过识别未激活层优化模型效率与性能。 Method: 使用L2自适应计算（LAC）方法监测激活的L2范数变化，识别未激活层，并分析指令调优模型在提示处理（PP）和响应生成（RG）阶段的层激活情况。 Result: 实验表明，跳过未激活层可显著提升模型性能，例如Qwen2.5-7B-Instruct在MMLU任务中性能提升2.05%，同时仅使用30%的层。 Conclusion: 研究证实，选择性跳过未激活层能有效提升模型性能，为优化推理效率提供了新思路。 Abstract: Despite advances in transformer-based language models (LMs), a fundamental question remains largely unanswered: Are all layers activated during inference? We investigate this question by detecting unactivated layers (which we refer to as Voids) using a non-trainable and parameter-free adaptive computation method called L2 Adaptive Computation (LAC). We adapt LAC from its original efficiency-focused application to trace activated layers during inference. This method monitors changes in the L2-norm of activations to identify voids. We analyze layer activation in instruction-tuned LMs across two phases: Prompt Processing (PP), where we trace activated layers for each token in the input prompts, and Response Generation (RG), where we trace activated layers for each generated token. We further demonstrate that distinct layers are activated during these two phases. To show the effectiveness of our method, we evaluated three distinct instruction-tuned LMs from the Llama, Mistral, and Qwen families on three benchmarks: MMLU, GPQA Diamond, and BoolQ. For example, on MMLU with a zero-shot setting, skipping voids in Qwen2.5-7B-Instruct resulted in an improvement from 69.24 to 71.29 while the model uses only 30% of the layers. Similarly, Mistral-7B-Instruct-v0.3 on GPQA Diamond improved from 13.88 to 18.36 when using 70% of the layers during both the PP and RG phases. These results show that not all layers contribute equally during inference, and that selectively skipping most of them can improve the performance of models on certain tasks.

[204] Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations

Somnath Banerjee,Pratyush Chatterjee,Shanu Kumar,Sayan Layek,Parag Agrawal,Rima Hazra,Animesh Mukherjee

Main category: cs.CL

TL;DR: 研究发现，LLMs在处理代码混合输入时更容易产生不安全输出，相比单语英语输入。通过可解释性方法，揭示了模型内部归因变化，并区分了普遍不安全与文化特定不安全查询。

Details

Motivation: 探讨LLMs在代码混合输入下的安全性问题，揭示其内部机制。 Method: 使用可解释性方法分析模型归因变化，区分普遍与文化特定不安全查询。 Result: 代码混合输入显著增加LLMs的不安全输出倾向，揭示了内部机制。 Conclusion: 研究为LLMs安全性提供了新见解，强调了代码混合输入的风险。 Abstract: Recent advancements in LLMs have raised significant safety concerns, particularly when dealing with code-mixed inputs and outputs. Our study systematically investigates the increased susceptibility of LLMs to produce unsafe outputs from code-mixed prompts compared to monolingual English prompts. Utilizing explainability methods, we dissect the internal attribution shifts causing model's harmful behaviors. In addition, we explore cultural dimensions by distinguishing between universally unsafe and culturally-specific unsafe queries. This paper presents novel experimental insights, clarifying the mechanisms driving this phenomenon.

[205] Adapting Pretrained Language Models for Citation Classification via Self-Supervised Contrastive Learning

Tong Li,Jiachuan Wang,Yongqi Zhang,Shuangyin Li,Lei Chen

Main category: cs.CL

TL;DR: Citss是一个新颖的框架，通过自监督对比学习和两种专门策略解决引文分类中的数据稀缺和噪声问题，兼容编码器和解码器模型，并在实验中表现优异。

Details

Motivation: 引文分类对学术分析至关重要，但直接微调预训练语言模型面临数据稀缺、上下文噪声和虚假关键词关联的挑战。 Method: Citss引入自监督对比学习，结合句子级裁剪和关键词扰动策略，兼容编码器和解码器模型。 Result: 在三个基准数据集上，Citss表现优于现有方法。 Conclusion: Citss有效解决了引文分类中的挑战，并展示了其兼容性和性能优势。 Abstract: Citation classification, which identifies the intention behind academic citations, is pivotal for scholarly analysis. Previous works suggest fine-tuning pretrained language models (PLMs) on citation classification datasets, reaping the reward of the linguistic knowledge they gained during pretraining. However, directly fine-tuning for citation classification is challenging due to labeled data scarcity, contextual noise, and spurious keyphrase correlations. In this paper, we present a novel framework, Citss, that adapts the PLMs to overcome these challenges. Citss introduces self-supervised contrastive learning to alleviate data scarcity, and is equipped with two specialized strategies to obtain the contrastive pairs: sentence-level cropping, which enhances focus on target citations within long contexts, and keyphrase perturbation, which mitigates reliance on specific keyphrases. Compared with previous works that are only designed for encoder-based PLMs, Citss is carefully developed to be compatible with both encoder-based PLMs and decoder-based LLMs, to embrace the benefits of enlarged pretraining. Experiments with three benchmark datasets with both encoder-based PLMs and decoder-based LLMs demonstrate our superiority compared to the previous state of the art. Our code is available at: github.com/LITONG99/Citss

[206] PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models

He Zhu,Junyou Su,Minxi Chen,Wen Wang,Yijie Deng,Guanhua Chen,Wenjia Zhang

Main category: cs.CL

TL;DR: PlanGPT-VL是一种专为城市规划地图设计的视觉语言模型，通过创新方法显著提升地图分析能力。

Details

Motivation: 现有视觉语言模型在城市规划地图分析中表现不佳，亟需专业化解决方案。 Method: 采用PlanAnno-V框架、Critical Point Thinking和综合训练方法。 Result: PlanGPT-VL在PlanBench-V基准测试中优于通用模型，且参数效率高。 Conclusion: PlanGPT-VL为城市规划提供了高效可靠的工具，兼具专业性和准确性。 Abstract: In the field of urban planning, existing Vision-Language Models (VLMs) frequently fail to effectively analyze and evaluate planning maps, despite the critical importance of these visual elements for urban planners and related educational contexts. Planning maps, which visualize land use, infrastructure layouts, and functional zoning, require specialized understanding of spatial configurations, regulatory requirements, and multi-scale analysis. To address this challenge, we introduce PlanGPT-VL, the first domain-specific Vision-Language Model tailored specifically for urban planning maps. PlanGPT-VL employs three innovative approaches: (1) PlanAnno-V framework for high-quality VQA data synthesis, (2) Critical Point Thinking to reduce hallucinations through structured verification, and (3) comprehensive training methodology combining Supervised Fine-Tuning with frozen vision encoder parameters. Through systematic evaluation on our proposed PlanBench-V benchmark, we demonstrate that PlanGPT-VL significantly outperforms general-purpose state-of-the-art VLMs in specialized planning map interpretation tasks, offering urban planning professionals a reliable tool for map analysis, assessment, and educational applications while maintaining high factual accuracy. Our lightweight 7B parameter model achieves comparable performance to models exceeding 72B parameters, demonstrating efficient domain specialization without sacrificing performance.

[207] MoMoE: Mixture of Moderation Experts Framework for AI-Assisted Online Governance

Agam Goyal,Xianyang Zhan,Yilun Chen,Koustuv Saha,Eshwar Chandrasekharan

Main category: cs.CL

TL;DR: MoMoE框架通过模块化设计实现跨社区内容审核，提供透明决策，性能优于现有方法。

Details

Motivation: 现有内容审核方法需为每个社区单独训练模型且决策不透明，限制了实际应用。 Method: MoMoE框架包含四个操作（分配、预测、聚合、解释），分为社区专家和规范违规专家两类。 Result: 在30个子论坛上，MoMoE性能优于基线模型，并提供简洁可靠解释。 Conclusion: MoMoE展示了轻量级、可解释的专家集成在可信人机治理中的潜力。 Abstract: Large language models (LLMs) have shown great potential in flagging harmful content in online communities. Yet, existing approaches for moderation require a separate model for every community and are opaque in their decision-making, limiting real-world adoption. We introduce Mixture of Moderation Experts (MoMoE), a modular, cross-community framework that adds post-hoc explanations to scalable content moderation. MoMoE orchestrates four operators -- Allocate, Predict, Aggregate, Explain -- and is instantiated as seven community-specialized experts (MoMoE-Community) and five norm-violation experts (MoMoE-NormVio). On 30 unseen subreddits, the best variants obtain Micro-F1 scores of 0.72 and 0.67, respectively, matching or surpassing strong fine-tuned baselines while consistently producing concise and reliable explanations. Although community-specialized experts deliver the highest peak accuracy, norm-violation experts provide steadier performance across domains. These findings show that MoMoE yields scalable, transparent moderation without needing per-community fine-tuning. More broadly, they suggest that lightweight, explainable expert ensembles can guide future NLP and HCI research on trustworthy human-AI governance of online communities.

[208] Enhanced Multimodal Aspect-Based Sentiment Analysis by LLM-Generated Rationales

Jun Cao,Jiyi Li,Ziwei Yang,Renjie Zhou

Main category: cs.CL

TL;DR: 提出了一种结合小型语言模型（SLM）和大型语言模型（LLM）的新框架LRSA，用于多模态基于方面的情感分析（MABSA），通过LLM生成的解释增强SLM的能力。

Details

Motivation: 现有方法依赖小型语言模型（SLM）进行多模态情感分析，但其能力有限，导致对文本和视觉数据中方面、情感及其关联的识别不准确。大型语言模型（LLM）虽表现优异，但在ABSA领域仍不及微调的小型模型。 Method: 提出LRSA框架，将LLM生成的解释作为理性注入SLM，并采用双重交叉注意力机制增强特征交互与融合，提升SLM对方面和情感的识别能力。 Result: 在两个基线模型上评估，实验表明该方法在三个广泛使用的基准测试中表现优越，具有通用性和适用性。 Conclusion: LRSA框架通过结合SLM和LLM的优势，显著提升了多模态基于方面的情感分析性能。 Abstract: There has been growing interest in Multimodal Aspect-Based Sentiment Analysis (MABSA) in recent years. Existing methods predominantly rely on pre-trained small language models (SLMs) to collect information related to aspects and sentiments from both image and text, with an aim to align these two modalities. However, small SLMs possess limited capacity and knowledge, often resulting in inaccurate identification of meaning, aspects, sentiments, and their interconnections in textual and visual data. On the other hand, Large language models (LLMs) have shown exceptional capabilities in various tasks by effectively exploring fine-grained information in multimodal data. However, some studies indicate that LLMs still fall short compared to fine-tuned small models in the field of ABSA. Based on these findings, we propose a novel framework, termed LRSA, which combines the decision-making capabilities of SLMs with additional information provided by LLMs for MABSA. Specifically, we inject explanations generated by LLMs as rationales into SLMs and employ a dual cross-attention mechanism for enhancing feature interaction and fusion, thereby augmenting the SLMs' ability to identify aspects and sentiments. We evaluated our method using two baseline models, numerous experiments highlight the superiority of our approach on three widely-used benchmarks, indicating its generalizability and applicability to most pre-trained models for MABSA.

[209] ModRWKV: Transformer Multimodality in Linear Time

Jiale Kang,Ziyin Yue,Qingyu Yin,Jiang Rui,Weile Li,Zening Lu,Zhouran Ji

Main category: cs.CL

TL;DR: 本文提出了一种基于RWKV7架构的多模态框架ModRWKV，通过动态适配的异质模态编码器实现多源信息融合，展示了现代RNN架构在多模态大语言模型中的潜力。

Details

Motivation: 当前多模态研究主要依赖计算复杂度高的Transformer架构，而线性模型如RNN在计算效率上有优势但多限于单模态应用。本文旨在探索现代RNN在多模态任务中的能力。 Method: 提出ModRWKV框架，基于RWKV7架构，采用轻量级多模态模块设计，并通过预训练权重初始化加速训练。 Result: 实验表明ModRWKV在性能和计算效率间取得平衡，预训练权重初始化显著提升多模态信号理解能力。 Conclusion: 现代RNN架构可作为Transformer的替代方案用于多模态大语言模型，并确定了ModRWKV的最优配置。 Abstract: Currently, most multimodal studies are based on large language models (LLMs) with quadratic-complexity Transformer architectures. While linear models like RNNs enjoy low inference costs, their application has been largely limited to the text-only modality. This work explores the capabilities of modern RNN architectures in multimodal contexts. We propose ModRWKV-a decoupled multimodal framework built upon the RWKV7 architecture as its LLM backbone-which achieves multi-source information fusion through dynamically adaptable heterogeneous modality encoders. We designed the multimodal modules in ModRWKV with an extremely lightweight architecture and, through extensive experiments, identified a configuration that achieves an optimal balance between performance and computational efficiency. ModRWKV leverages the pretrained weights of the RWKV7 LLM for initialization, which significantly accelerates multimodal training. Comparative experiments with different pretrained checkpoints further demonstrate that such initialization plays a crucial role in enhancing the model's ability to understand multimodal signals. Supported by extensive experiments, we conclude that modern RNN architectures present a viable alternative to Transformers in the domain of multimodal large language models (MLLMs). Furthermore, we identify the optimal configuration of the ModRWKV architecture through systematic exploration.

[210] EmoGist: Efficient In-Context Learning for Visual Emotion Understanding

Ronald Seoh,Dan Goldwasser

Main category: cs.CL

TL;DR: EmoGist是一种无需训练、基于上下文学习的视觉情感分类方法，通过预生成情感标签的多重解释，显著提升了分类性能。

Details

Motivation: 情感在图像中的表现高度依赖上下文且复杂，传统方法难以准确捕捉。EmoGist旨在通过上下文相关的标签定义提升情感分类的准确性。 Method: EmoGist预生成情感标签的多重解释，基于嵌入相似性检索合适的解释，并利用快速视觉语言模型进行分类。 Result: 在Memotion数据集上，EmoGist的微F1分数提升13点；在FI数据集上，宏F1分数提升8点。 Conclusion: EmoGist通过上下文相关的标签解释显著提升了视觉情感分类的性能，证明了其有效性。 Abstract: In this paper, we introduce EmoGist, a training-free, in-context learning method for performing visual emotion classification with LVLMs. The key intuition of our approach is that context-dependent definition of emotion labels could allow more accurate predictions of emotions, as the ways in which emotions manifest within images are highly context dependent and nuanced. EmoGist pre-generates multiple explanations of emotion labels, by analyzing the clusters of example images belonging to each category. At test time, we retrieve a version of explanation based on embedding similarity, and feed it to a fast VLM for classification. Through our experiments, we show that EmoGist allows up to 13 points improvement in micro F1 scores with the multi-label Memotion dataset, and up to 8 points in macro F1 in the multi-class FI dataset.

[211] Exploring Graph Representations of Logical Forms for Language Modeling

Michael Sullivan

Main category: cs.CL

TL;DR: 该论文提出基于逻辑形式的语言模型（LFLMs），并通过GFoLDS原型证明其数据效率优于文本模型。实验表明，LFLMs能利用内置语言知识快速学习复杂模式，且在小数据量下表现优于文本模型。

Details

Motivation: 探讨逻辑形式语言模型（LFLMs）的数据效率优势，并验证其在真实应用中的潜力。 Method: 提出GFoLDS原型，一种基于图表示逻辑形式的预训练语言模型，并通过实验比较其与文本模型的性能。 Result: GFoLDS在小数据量下显著优于文本模型，且性能可能随参数和数据量增加而提升。 Conclusion: LFLMs具有更高的数据效率和扩展潜力，适用于真实场景。 Abstract: We make the case for language models over logical forms (LFLMs), arguing that such models are more data-efficient than their textual counterparts. To that end, we introduce the Graph-based Formal-Logical Distributional Semantics (GFoLDS) prototype, a pretrained LM over graph representations of logical forms, as a proof-of-concept of LFLMs. Using GFoLDS, we present strong experimental evidence that LFLMs can leverage the built-in, basic linguistic knowledge inherent in such models to immediately begin learning more complex patterns. On downstream tasks, we show that GFoLDS vastly outperforms textual, transformer LMs pretrained on similar amounts of data, indicating that LFLMs can learn with substantially less data than models over plain text. Furthermore, we show that the performance of this model is likely to scale with additional parameters and pretraining data, suggesting the viability of LFLMs in real-world applications.

[212] Internal Chain-of-Thought: Empirical Evidence for Layer-wise Subtask Scheduling in LLMs

Zhipeng Yang,Junzhuo Li,Siyu Xia,Xuming Hu

Main category: cs.CL

TL;DR: 研究发现大型语言模型（LLMs）具有内部思维链，能够逐层分解和执行复合任务。

Details

Motivation: 探究LLMs如何通过不同网络深度学习和执行复合任务的子任务，以增强模型透明度。 Method: 使用层间上下文掩码和跨任务修补方法验证子任务在不同深度的学习，并通过LogitLens解码隐藏状态分析执行模式。 Result: 在15个两步复合任务和真实TRACE基准测试中，观察到一致的逐层执行模式。 Conclusion: LLMs能够内部规划和执行子任务，为细粒度指令级激活调控提供了新途径。 Abstract: We show that large language models (LLMs) exhibit an $\textit{internal chain-of-thought}$: they sequentially decompose and execute composite tasks layer-by-layer. Two claims ground our study: (i) distinct subtasks are learned at different network depths, and (ii) these subtasks are executed sequentially across layers. On a benchmark of 15 two-step composite tasks, we employ layer-from context-masking and propose a novel cross-task patching method, confirming (i). To examine claim (ii), we apply LogitLens to decode hidden states, revealing a consistent layerwise execution pattern. We further replicate our analysis on the real-world $\text{TRACE}$ benchmark, observing the same stepwise dynamics. Together, our results enhance LLMs transparency by showing their capacity to internally plan and execute subtasks (or instructions), opening avenues for fine-grained, instruction-level activation steering.

[213] Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders

Agam Goyal,Vedant Rathi,William Yeh,Yian Wang,Yuen Chen,Hari Sundaram

Main category: cs.CL

TL;DR: 论文提出了一种基于稀疏自编码器（SAEs）的方法，通过识别模型残差流中的毒性相关方向并进行针对性激活引导，以减少大语言模型（LLMs）的毒性输出。

Details

Motivation: 尽管已有多种去毒方法，但大多数方法仅提供表面修复，容易被绕过。本文旨在通过更精确的干预减少毒性输出，同时保持模型的通用能力。 Method: 利用稀疏自编码器识别毒性相关方向，并通过不同强度的激活引导干预模型输出，评估了GPT-2 Small和Gemma-2-2B模型。 Result: 在较强引导强度下，毒性减少达20%，但语言流畅性可能下降。标准NLP基准分数保持稳定，表明模型知识和能力未受影响。 Conclusion: 基于SAE的因果干预在去毒方面具有潜力，但仍需改进特征解耦以提升效果，为安全部署提供了实用指南。 Abstract: Large language models (LLMs) are now ubiquitous in user-facing applications, yet they still generate undesirable toxic outputs, including profanity, vulgarity, and derogatory remarks. Although numerous detoxification methods exist, most apply broad, surface-level fixes and can therefore easily be circumvented by jailbreak attacks. In this paper we leverage sparse autoencoders (SAEs) to identify toxicity-related directions in the residual stream of models and perform targeted activation steering using the corresponding decoder vectors. We introduce three tiers of steering aggressiveness and evaluate them on GPT-2 Small and Gemma-2-2B, revealing trade-offs between toxicity reduction and language fluency. At stronger steering strengths, these causal interventions surpass competitive baselines in reducing toxicity by up to 20%, though fluency can degrade noticeably on GPT-2 Small depending on the aggressiveness. Crucially, standard NLP benchmark scores upon steering remain stable, indicating that the model's knowledge and general abilities are preserved. We further show that feature-splitting in wider SAEs hampers safety interventions, underscoring the importance of disentangled feature learning. Our findings highlight both the promise and the current limitations of SAE-based causal interventions for LLM detoxification, further suggesting practical guidelines for safer language-model deployment.

[214] KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

Jiajun Shi,Jian Yang,Jiaheng Liu,Xingyuan Bu,Jiangjie Chen,Junting Zhou,Kaijing Ma,Zhoufutu Wen,Bingli Wang,Yancheng He,Liang Song,Hualei Zhu,Shilong Li,Xingjian Wang,Wei Zhang,Ruibin Yuan,Yifan Yao,Wenjun Yang,Yunli Wang,Siyuan Fang,Siyu Yuan,Qianyu He,Xiangru Tang,Yingshui Tan,Wangchunshu Zhou,Zhaoxiang Zhang,Zhoujun Li,Wenhao Huang,Ge Zhang

Main category: cs.CL

TL;DR: 论文介绍了KORGym，一个动态评估平台，用于全面评估大语言模型的推理能力，并展示了封闭源模型的优越性能。

Details

Motivation: 现有评估方法多为领域特定，无法全面评估大语言模型的通用推理能力，因此需要更全面的评估平台。 Method: 开发了KORGym平台，包含50多种游戏，支持多轮交互评估和强化学习场景，并对19个LLM和8个VLM进行了实验。 Result: 实验揭示了模型家族内一致的推理模式，并显示封闭源模型表现更优。 Conclusion: KORGym有望成为推动LLM推理研究和复杂交互环境评估方法发展的重要资源。 Abstract: Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM's general reasoning potential. To address this limitation, we introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance. We expect KORGym to become a valuable resource for advancing LLM reasoning research and developing evaluation methodologies suited to complex, interactive environments.

[215] Pivot Language for Low-Resource Machine Translation

Abhimanyu Talwar,Julien Laasri

Main category: cs.CL

TL;DR: 该论文探讨了使用印地语作为枢轴语言将尼泊尔语翻译为英语的方法，比较了两种方法（完全监督的转移方法和半监督的回译方法），并分析了性能差异及未来改进方向。

Details

Motivation: 由于某些语言对缺乏大规模、多领域的平行语料库，论文提出使用枢轴语言（印地语）来解决这一问题，并验证其有效性。 Method: 论文采用两种方法：完全监督的转移方法和半监督的回译方法，利用印地语作为枢轴语言进行尼泊尔语到英语的翻译。 Result: 转移方法在开发测试集上取得了14.2的SacreBLEU分数，比基线提高了6.6分，但略低于半监督基线的15.1分。 Conclusion: 论文讨论了性能差异的原因，并提出了未来改进的方向。 Abstract: Certain pairs of languages suffer from lack of a parallel corpus which is large in size and diverse in domain. One of the ways this is overcome is via use of a pivot language. In this paper we use Hindi as a pivot language to translate Nepali into English. We describe what makes Hindi a good candidate for the pivot. We discuss ways in which a pivot language can be used, and use two such approaches - the Transfer Method (fully supervised) and Backtranslation (semi-supervised) - to translate Nepali into English. Using the former, we are able to achieve a devtest Set SacreBLEU score of 14.2, which improves the baseline fully supervised score reported by (Guzman et al., 2019) by 6.6 points. While we are slightly below the semi-supervised baseline score of 15.1, we discuss what may have caused this under-performance, and suggest scope for future work.

[216] TRATES: Trait-Specific Rubric-Assisted Cross-Prompt Essay Scoring

Sohaila Eltanbouly,Salam Albatarni,Tamer Elsayed

Main category: cs.CL

TL;DR: TRATES提出了一种基于大语言模型（LLM）的特质评分框架，通过生成特质相关特征并结合通用特征，实现了跨提示的自动化作文评分。

Details

Motivation: 现有自动化作文评分（AES）研究多关注整体评分，而忽视了对个体特质的评估。 Method: 利用LLM生成特质相关特征（评估问题），结合通用和提示特定特征，训练回归模型预测未见提示的作文特质分数。 Result: TRATES在广泛使用的数据集上实现了所有特质的最新最优性能，LLM生成的特征贡献最大。 Conclusion: TRATES框架为特质评分提供了高效且通用的解决方案，LLM生成的特征是关键因素。 Abstract: Research on holistic Automated Essay Scoring (AES) is long-dated; yet, there is a notable lack of attention for assessing essays according to individual traits. In this work, we propose TRATES, a novel trait-specific and rubric-based cross-prompt AES framework that is generic yet specific to the underlying trait. The framework leverages a Large Language Model (LLM) that utilizes the trait grading rubrics to generate trait-specific features (represented by assessment questions), then assesses those features given an essay. The trait-specific features are eventually combined with generic writing-quality and prompt-specific features to train a simple classical regression model that predicts trait scores of essays from an unseen prompt. Experiments show that TRATES achieves a new state-of-the-art performance across all traits on a widely-used dataset, with the generated LLM-based features being the most significant.

[217] Can Pruning Improve Reasoning? Revisiting Long-CoT Compression with Capability in Mind for Better Reasoning

Shangziqi Zhao,Jiahao Yuan,Guisong Yang,Usman Naseem

Main category: cs.CL

TL;DR: Prune-on-Logic框架通过逻辑图选择性修剪Long-CoT的低效推理步骤，验证步骤修剪提升小模型推理精度。

Details

Motivation: 探索如何通过修剪优化Long-CoT推理，使其更适合小语言模型。 Method: 提出Prune-on-Logic框架，将Long-CoT转为逻辑图并选择性修剪低效步骤。 Result: 验证步骤修剪显著提升精度并降低推理成本，而其他修剪策略效果不佳。 Conclusion: 修剪是优化CoT推理结构以适配小模型能力的有效策略。 Abstract: Long chain-of-thought (Long-CoT) reasoning improves accuracy in LLMs, yet its verbose, self-reflective style often hinders effective distillation into small language models (SLMs). We revisit Long-CoT compression through the lens of capability alignment and ask: Can pruning improve reasoning? We propose Prune-on-Logic, a structure-aware framework that transforms Long-CoT into logic graphs and selectively prunes low-utility reasoning steps under self-verification constraints. Through systematic analysis across three pruning strategies -- targeting entire chains, core reasoning, and verification -- we find that pruning verification steps yields consistent accuracy gains while reducing inference cost, outperforming token-level baselines and uncompressed fine-tuning. In contrast, pruning reasoning or all-chain steps degrades performance, revealing that small models benefit not from shorter CoTs, but from semantically leaner ones. Our findings highlight pruning as a structural optimization strategy for aligning CoT reasoning with SLM capacity.

[218] Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning

Wenbin Hu,Haoran Li,Huihao Jing,Qi Hu,Ziqian Zeng,Sirui Han,Heli Xu,Tianshu Chu,Peizhao Hu,Yangqiu Song

Main category: cs.CL

TL;DR: 论文提出了一种基于上下文完整性理论的强化学习方法，以解决大语言模型的安全和隐私风险，同时提升合规性和推理能力。

Details

Motivation: 当前大语言模型的安全和隐私缓解策略依赖敏感模式匹配，忽略了合规标准，导致系统性风险。 Method: 采用强化学习结合规则奖励，将安全和隐私问题转化为上下文合规问题，并遵循GDPR、EU AI Act和HIPAA标准。 Result: 方法显著提升了法律合规性（安全和隐私基准准确率提升17.64%），并增强了通用推理能力（MMLU和LegalBench基准分别提升2.05%和8.98%）。 Conclusion: 通过上下文完整性框架和强化学习，论文方法有效平衡了合规性和推理能力，为大语言模型的安全和隐私问题提供了新思路。 Abstract: While Large Language Models (LLMs) exhibit remarkable capabilities, they also introduce significant safety and privacy risks. Current mitigation strategies often fail to preserve contextual reasoning capabilities in risky scenarios. Instead, they rely heavily on sensitive pattern matching to protect LLMs, which limits the scope. Furthermore, they overlook established safety and privacy standards, leading to systemic risks for legal compliance. To address these gaps, we formulate safety and privacy issues into contextualized compliance problems following the Contextual Integrity (CI) theory. Under the CI framework, we align our model with three critical regulatory standards: GDPR, EU AI Act, and HIPAA. Specifically, we employ reinforcement learning (RL) with a rule-based reward to incentivize contextual reasoning capabilities while enhancing compliance with safety and privacy norms. Through extensive experiments, we demonstrate that our method not only significantly enhances legal compliance (achieving a +17.64% accuracy improvement in safety/privacy benchmarks) but also further improves general reasoning capability. For OpenThinker-7B, a strong reasoning model that significantly outperforms its base model Qwen2.5-7B-Instruct across diverse subjects, our method enhances its general reasoning capabilities, with +2.05% and +8.98% accuracy improvement on the MMLU and LegalBench benchmark, respectively.

[219] MCIP: Protecting MCP Safety via Model Contextual Integrity Protocol

Huihao Jing,Haoran Li,Wenbin Hu,Qi Hu,Heli Xu,Tianshu Chu,Peizhao Hu,Yangqiu Song

Main category: cs.CL

TL;DR: 本文提出了一种改进模型上下文协议（MCP）安全性的框架，通过分析MCP的安全缺陷并开发MCIP协议，进一步构建了细粒度的分类法和基准数据，显著提升了LLMs在MCP交互中的安全性表现。

Details

Motivation: MCP的分散式架构带来了未被充分探索的安全风险，需要系统性分析以提升安全性。 Method: 基于MAESTRO框架分析MCP的安全缺陷，提出MCIP协议，并开发分类法、基准数据和训练数据。 Result: 实验表明，LLMs在MCP交互中存在漏洞，而提出的方法显著提升了其安全性。 Conclusion: MCIP协议及相关工具能有效提升MCP的安全性，为LLMs的安全交互提供了实用解决方案。 Abstract: As Model Context Protocol (MCP) introduces an easy-to-use ecosystem for users and developers, it also brings underexplored safety risks. Its decentralized architecture, which separates clients and servers, poses unique challenges for systematic safety analysis. This paper proposes a novel framework to enhance MCP safety. Guided by the MAESTRO framework, we first analyze the missing safety mechanisms in MCP, and based on this analysis, we propose the Model Contextual Integrity Protocol (MCIP), a refined version of MCP that addresses these gaps.Next, we develop a fine-grained taxonomy that captures a diverse range of unsafe behaviors observed in MCP scenarios. Building on this taxonomy, we develop benchmark and training data that support the evaluation and improvement of LLMs' capabilities in identifying safety risks within MCP interactions. Leveraging the proposed benchmark and training data, we conduct extensive experiments on state-of-the-art LLMs. The results highlight LLMs' vulnerabilities in MCP interactions and demonstrate that our approach substantially improves their safety performance.

[220] Success is in the Details: Evaluate and Enhance Details Sensitivity of Code LLMs through Counterfactuals

Xianzhen Luo,Qingfu Zhu,Zhiming Zhang,Mingzheng Xu,Tianhao Cheng,Yixuan Wang,Zheng Chu,Shijie Xuyang,Zhiyuan Ma,YuanTao Fan,Wanxiang Che

Main category: cs.CL

TL;DR: 论文提出代码敏感性概念，并通过CTF-Code基准和CTF-Instruct微调框架提升LLMs的敏感性，实验验证了其有效性。

Details

Motivation: 现有代码基准和指令数据忽视了代码敏感性，导致LLMs对问题描述的细节变化响应不足。 Method: 引入CTF-Code基准和CTF-Instruct微调框架，通过反事实扰动和增量指令优化LLMs的敏感性。 Result: 实验显示，微调后的LLMs在CTF-Code上提升2%，在LiveCodeBench上提升10%。 Conclusion: 增强LLMs的代码敏感性可显著提升其性能，验证了方法的可行性。 Abstract: Code Sensitivity refers to the ability of Code LLMs to recognize and respond to details changes in problem descriptions. While current code benchmarks and instruction data focus on difficulty and diversity, sensitivity is overlooked. We first introduce the CTF-Code benchmark, constructed using counterfactual perturbations, minimizing input changes while maximizing output changes. The evaluation shows that many LLMs have a more than 10\% performance drop compared to the original problems. To fully utilize sensitivity, CTF-Instruct, an incremental instruction fine-tuning framework, extends on existing data and uses a selection mechanism to meet the three dimensions of difficulty, diversity, and sensitivity. Experiments show that LLMs fine-tuned with CTF-Instruct data achieve over a 2\% improvement on CTF-Code, and more than a 10\% performance boost on LiveCodeBench, validating the feasibility of enhancing LLMs' sensitivity to improve performance.

[221] Toward Reliable Biomedical Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models

Guangzhi Xiong,Eric Xie,Corey Williams,Myles Kim,Amir Hassan Shariatmadari,Sikun Guo,Stefan Bekiranov,Aidong Zhang

Main category: cs.CL

TL;DR: 论文提出了TruthHypo基准和KnowHD检测器，用于评估大语言模型（LLM）生成真实生物医学假设的能力，并解决幻觉问题。

Details

Motivation: LLM在生物医学领域有潜力生成假设，但其真实性和可靠性因幻觉问题而受限，需系统评估。 Method: 引入TruthHypo基准和KnowHD知识基础幻觉检测器，分析LLM生成假设的真实性。 Result: LLM难以生成真实假设，KnowHD的groundedness评分能有效筛选真实假设，人类评估验证其有效性。 Conclusion: TruthHypo和KnowHD为评估LLM生成假设的真实性提供了工具，有助于加速科学发现。 Abstract: Large language models (LLMs) have shown significant potential in scientific disciplines such as biomedicine, particularly in hypothesis generation, where they can analyze vast literature, identify patterns, and suggest research directions. However, a key challenge lies in evaluating the truthfulness of generated hypotheses, as verifying their accuracy often requires substantial time and resources. Additionally, the hallucination problem in LLMs can lead to the generation of hypotheses that appear plausible but are ultimately incorrect, undermining their reliability. To facilitate the systematic study of these challenges, we introduce TruthHypo, a benchmark for assessing the capabilities of LLMs in generating truthful biomedical hypotheses, and KnowHD, a knowledge-based hallucination detector to evaluate how well hypotheses are grounded in existing knowledge. Our results show that LLMs struggle to generate truthful hypotheses. By analyzing hallucinations in reasoning steps, we demonstrate that the groundedness scores provided by KnowHD serve as an effective metric for filtering truthful hypotheses from the diverse outputs of LLMs. Human evaluations further validate the utility of KnowHD in identifying truthful hypotheses and accelerating scientific discovery. Our data and source code are available at https://github.com/Teddy-XiongGZ/TruthHypo.

[222] sudoLLM : On Multi-role Alignment of Language Models

Soumadeep Saha,Akshay Chaturvedi,Joy Mahapatra,Utpal Garain

Main category: cs.CL

TL;DR: sudoLLM是一个新框架，通过用户授权机制实现多角色对齐的LLM，提升安全性和抗攻击能力。

Details

Motivation: 现有LLM缺乏用户授权访问控制，导致安全风险。 Method: 在查询中注入用户偏置信号，训练LLM根据授权生成敏感信息。 Result: 实验显示sudoLLM显著提升对齐性、泛化能力和抗攻击性。 Conclusion: sudoLLM作为额外安全层，补充现有防护机制，增强LLM端到端安全性。 Abstract: User authorization-based access privileges are a key feature in many safety-critical systems, but have thus far been absent from the large language model (LLM) realm. In this work, drawing inspiration from such access control systems, we introduce sudoLLM, a novel framework that results in multi-role aligned LLMs, i.e., LLMs that account for, and behave in accordance with, user access rights. sudoLLM injects subtle user-based biases into queries and trains an LLM to utilize this bias signal in order to produce sensitive information if and only if the user is authorized. We present empirical results demonstrating that this approach shows substantially improved alignment, generalization, and resistance to prompt-based jailbreaking attacks. The persistent tension between the language modeling objective and safety alignment, which is often exploited to jailbreak LLMs, is somewhat resolved with the aid of the injected bias signal. Our framework is meant as an additional security layer, and complements existing guardrail mechanisms for enhanced end-to-end safety with LLMs.

[223] Language Models Optimized to Fool Detectors Still Have a Distinct Style (And How to Change It)

Rafael Rivera Soto,Barry Chen,Nicholas Andrews

Main category: cs.CL

TL;DR: 论文探讨了机器生成文本检测的困难性，并提出了一种基于风格特征空间的检测方法，该方法对模型优化具有鲁棒性。同时，论文引入了一种新的度量标准AURA，用于评估检测性能。

Details

Motivation: 研究动机是解决机器生成文本检测的困难性，尤其是针对优化后的语言模型对检测器的性能影响。 Method: 论文提出了一种基于风格特征空间的检测方法，并探索了一种新的改写攻击方法。同时，引入了AURA度量标准。 Result: 结果显示，风格特征空间对优化后的模型具有鲁棒性，但单样本检测时攻击仍然有效。随着样本量增加，检测性能提升。 Conclusion: 结论强调了避免依赖机器文本检测的建议，并提出了AURA作为评估工具。 Abstract: Despite considerable progress in the development of machine-text detectors, it has been suggested that the problem is inherently hard, and therefore, that stakeholders should proceed under the assumption that machine-generated text cannot be reliably detected as such. We examine a recent such claim by Nicks et al. (2024) regarding the ease with which language models can be optimized to degrade the performance of machine-text detectors, including detectors not specifically optimized against. We identify a feature space$\unicode{x2013}$the stylistic feature space$\unicode{x2013}$that is robust to such optimization, and show that it may be used to reliably detect samples from language models optimized to prevent detection. Furthermore, we show that even when models are explicitly optimized against stylistic detectors, detection performance remains surprisingly unaffected. We then seek to understand if stylistic detectors are inherently more robust. To study this question, we explore a new paraphrasing approach that simultaneously aims to close the gap between human writing and machine writing in stylistic feature space while avoiding detection using traditional features. We show that when only a single sample is available for detection, this attack is universally effective across all detectors considered, including those that use writing style. However, as the number of samples available for detection grows, the human and machine distributions become distinguishable. This observation encourages us to introduce AURA, a metric that estimates the overlap between human and machine-generated distributions by analyzing how detector performance improves as more samples become available. Overall, our findings underscore previous recommendations to avoid reliance on machine-text detection.

[224] Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models

Sahar Abdelnabi,Ahmed Salem

Main category: cs.CL

TL;DR: 研究发现大型语言模型（LLMs）在意识到被评估时会改变行为，类似霍桑效应，影响其安全对齐。研究提出了一种白盒探测框架，量化了这种“测试意识”对模型行为的影响，并展示了不同模型间的差异。

Details

Motivation: 探讨LLMs在被评估时的行为变化（类似霍桑效应）如何影响其安全对齐，以提升安全评估的可信度。 Method: 提出白盒探测框架，线性识别与测试意识相关的激活，并通过调控这些激活来观察模型行为变化。 Result: 测试意识显著影响模型的安全对齐，且不同模型表现不同。 Conclusion: 通过量化测试意识的影响并提供调控方法，研究旨在增强安全评估的可信度。 Abstract: Reasoning-focused large language models (LLMs) sometimes alter their behavior when they detect that they are being evaluated, an effect analogous to the Hawthorne phenomenon, which can lead them to optimize for test-passing performance or to comply more readily with harmful prompts if real-world consequences appear absent. We present the first quantitative study of how such "test awareness" impacts model behavior, particularly its safety alignment. We introduce a white-box probing framework that (i) linearly identifies awareness-related activations and (ii) steers models toward or away from test awareness while monitoring downstream performance. We apply our method to different state-of-the-art open-source reasoning LLMs across both realistic and hypothetical tasks. Our results demonstrate that test awareness significantly impact safety alignment, and is different for different models. By providing fine-grained control over this latent effect, our work aims to increase trust in how we perform safety evaluation.

[225] Think Only When You Need with Large Hybrid-Reasoning Models

Lingjie Jiang,Xun Wu,Shaohan Huang,Qingxiu Dong,Zewen Chi,Li Dong,Xingxing Zhang,Tengchao Lv,Lei Cui,Furu Wei

Main category: cs.CL

TL;DR: 论文提出了一种新型的Large Hybrid-Reasoning Models（LHRMs），通过自适应地决定是否进行思考来优化推理效率，显著减少了不必要的计算开销。

Details

Motivation: 现有的Large Reasoning Models（LRMs）在处理简单查询时，过长的思考过程会带来不必要的计算开销和延迟，需要一种更高效的解决方案。 Method: 采用两阶段训练流程：Hybrid Fine-Tuning（HFT）作为冷启动，随后通过在线强化学习（Hybrid Group Policy Optimization, HGPO）隐式学习选择适当的思考模式。 Result: 实验表明，LHRMs能够自适应地处理不同难度和类型的查询，在推理和通用能力上优于现有模型，同时显著提升效率。 Conclusion: LHRMs为混合思考系统的构建提供了坚实基础，并重新审视了扩展思考过程的适用性。 Abstract: Recent Large Reasoning Models (LRMs) have shown substantially improved reasoning capabilities over traditional Large Language Models (LLMs) by incorporating extended thinking processes prior to producing final responses. However, excessively lengthy thinking introduces substantial overhead in terms of token consumption and latency, which is particularly unnecessary for simple queries. In this work, we introduce Large Hybrid-Reasoning Models (LHRMs), the first kind of model capable of adaptively determining whether to perform thinking based on the contextual information of user queries. To achieve this, we propose a two-stage training pipeline comprising Hybrid Fine-Tuning (HFT) as a cold start, followed by online reinforcement learning with the proposed Hybrid Group Policy Optimization (HGPO) to implicitly learn to select the appropriate thinking mode. Furthermore, we introduce a metric called Hybrid Accuracy to quantitatively assess the model's capability for hybrid thinking. Extensive experimental results show that LHRMs can adaptively perform hybrid thinking on queries of varying difficulty and type. It outperforms existing LRMs and LLMs in reasoning and general capabilities while significantly improving efficiency. Together, our work advocates for a reconsideration of the appropriate use of extended thinking processes and provides a solid starting point for building hybrid thinking systems.

[226] Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas

Yu Ying Chiu,Zhilin Wang,Sharan Maiya,Yejin Choi,Kyle Fish,Sydney Levine,Evan Hubinger

Main category: cs.CL

TL;DR: 论文提出LitmusValues方法，通过评估AI模型的价值优先级来预测其潜在风险行为。

Details

Motivation: 随着AI模型能力增强，检测其风险行为（如Alignment Faking）变得更困难。受人类风险行为（如非法活动）常受价值观驱动的启发，研究认为识别AI模型的价值可作为风险预警。 Method: 创建LitmusValues评估流程，揭示AI模型在多种价值类别中的优先级；收集AIRiskDilemmas数据集，模拟AI安全风险场景中的价值冲突。通过模型的选择行为预测其价值优先级。 Result: LitmusValues中的价值（如Care）能预测AIRiskDilemmas中的已知风险行为及HarmBench中的未知风险行为。 Conclusion: 识别AI模型的价值优先级是预测其风险行为的有效方法，LitmusValues为AI安全提供了新工具。 Abstract: Detecting AI risks becomes more challenging as stronger models emerge and find novel methods such as Alignment Faking to circumvent these detection attempts. Inspired by how risky behaviors in humans (i.e., illegal activities that may hurt others) are sometimes guided by strongly-held values, we believe that identifying values within AI models can be an early warning system for AI's risky behaviors. We create LitmusValues, an evaluation pipeline to reveal AI models' priorities on a range of AI value classes. Then, we collect AIRiskDilemmas, a diverse collection of dilemmas that pit values against one another in scenarios relevant to AI safety risks such as Power Seeking. By measuring an AI model's value prioritization using its aggregate choices, we obtain a self-consistent set of predicted value priorities that uncover potential risks. We show that values in LitmusValues (including seemingly innocuous ones like Care) can predict for both seen risky behaviors in AIRiskDilemmas and unseen risky behaviors in HarmBench.

[227] General-Reasoner: Advancing LLM Reasoning Across All Domains

Xueguang Ma,Qian Liu,Dongfu Jiang,Ge Zhang,Zejun Ma,Wenhu Chen

Main category: cs.CL

TL;DR: 论文提出General-Reasoner，一种增强LLM跨领域推理能力的新训练范式，通过构建大规模高质量数据集和生成式答案验证器，显著提升模型在多样化领域的表现。

Details

Motivation: 当前LLM推理研究主要集中于数学和编程领域，限制了模型在多样化领域的泛化能力。 Method: 构建跨领域高质量数据集，并开发基于生成模型的答案验证器，取代传统规则验证。 Result: 在12个基准测试中，General-Reasoner优于现有基线方法，展现出鲁棒且泛化的推理能力。 Conclusion: General-Reasoner为LLM在多样化领域的推理提供了有效解决方案，同时保持了数学推理任务的优越性。 Abstract: Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs). Particularly, the "Zero" reinforcement learning introduced by Deepseek-R1-Zero, enables direct RL training of base LLMs without relying on an intermediate supervised fine-tuning stage. Despite these advancements, current works for LLM reasoning mainly focus on mathematical and coding domains, largely due to data abundance and the ease of answer verification. This limits the applicability and generalization of such models to broader domains, where questions often have diverse answer representations, and data is more scarce. In this paper, we propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains. Our key contributions include: (1) constructing a large-scale, high-quality dataset of questions with verifiable answers curated by web crawling, covering a wide range of disciplines; and (2) developing a generative model-based answer verifier, which replaces traditional rule-based verification with the capability of chain-of-thought and context-awareness. We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc. Our comprehensive evaluation across these 12 benchmarks (e.g. MMLU-Pro, GPQA, SuperGPQA, TheoremQA, BBEH and MATH AMC) demonstrates that General-Reasoner outperforms existing baseline methods, achieving robust and generalizable reasoning performance while maintaining superior effectiveness in mathematical reasoning tasks.

[228] Reward Reasoning Model

Jiaxin Guo,Zewen Chi,Li Dong,Qingxiu Dong,Xun Wu,Shaohan Huang,Furu Wei

Main category: cs.CL

TL;DR: 论文提出了一种奖励推理模型（RRM），通过链式推理利用额外计算资源提升奖励模型性能，实验证明其优于传统方法。

Details

Motivation: 现有奖励模型在利用测试时计算资源提升性能方面存在挑战，需要更有效的方法。 Method: 引入奖励推理模型（RRM），通过强化学习框架实现自我进化的奖励推理能力，无需显式推理轨迹作为训练数据。 Result: RRM在多个领域的奖励建模基准测试中表现优异，并能自适应利用测试时计算资源提高奖励准确性。 Conclusion: RRM通过链式推理和自适应计算资源利用，显著提升了奖励模型的性能，具有广泛的应用潜力。 Abstract: Reward models play a critical role in guiding large language models toward outputs that align with human expectations. However, an open challenge remains in effectively utilizing test-time compute to enhance reward model performance. In this work, we introduce Reward Reasoning Models (RRMs), which are specifically designed to execute a deliberate reasoning process before generating final rewards. Through chain-of-thought reasoning, RRMs leverage additional test-time compute for complex queries where appropriate rewards are not immediately apparent. To develop RRMs, we implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities without requiring explicit reasoning traces as training data. Experimental results demonstrate that RRMs achieve superior performance on reward modeling benchmarks across diverse domains. Notably, we show that RRMs can adaptively exploit test-time compute to further improve reward accuracy. The pretrained reward reasoning models are available at https://huggingface.co/Reward-Reasoning.

[229] UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Large Language Models

Xiaojie Gu,Guangxu Chen,Jungang Li,Jia-Chen Gu,Xuming Hu,Kai Zhang

Main category: cs.CL

TL;DR: ULTRAEDIT是一种新型的终身学习模型编辑方法，通过轻量级线性代数操作实现快速、高效的知识更新，适用于大规模实际应用。

Details

Motivation: 解决现有终身学习方法在规模化和实际部署中的不足，提供高效、广泛的知识更新能力。 Method: 采用训练、主题和内存无关的编辑方案，通过线性代数操作计算参数变化，并结合终身归一化策略适应分布变化。 Result: ULTRAEDIT编辑速度比现有最快方法快7倍，VRAM消耗减少2/3，支持百万级编辑并保持高精度。 Conclusion: ULTRAEDIT在多种模型编辑场景中表现优异，是当前唯一能在消费级GPU上编辑7B LLM的方法。 Abstract: Lifelong learning enables large language models (LLMs) to adapt to evolving information by continually updating their internal knowledge. An ideal system should support efficient, wide-ranging updates while preserving existing capabilities and ensuring reliable deployment. Model editing stands out as a promising solution for this goal, offering a focused and efficient way to revise a model's internal knowledge. Although recent paradigms have made notable progress, they often struggle to meet the demands of practical lifelong adaptation at scale. To bridge this gap, we propose ULTRAEDIT-a fundamentally new editing solution that is training-, subject- and memory-free, making it particularly well-suited for ultra-scalable, real-world lifelong model editing. ULTRAEDIT performs editing through a self-contained process that relies solely on lightweight linear algebra operations to compute parameter shifts, enabling fast and consistent parameter modifications with minimal overhead. To improve scalability in lifelong settings, ULTRAEDIT employs a lifelong normalization strategy that continuously updates feature statistics across turns, allowing it to adapt to distributional shifts and maintain consistency over time. ULTRAEDIT achieves editing speeds over 7x faster than the previous state-of-the-art method-which was also the fastest known approach-while consuming less than 1/3 the VRAM, making it the only method currently capable of editing a 7B LLM on a 24GB consumer-grade GPU. Furthermore, we construct ULTRAEDITBENCH-the largest dataset in the field to date, with over 2M editing pairs-and demonstrate that our method supports up to 1M edits while maintaining high accuracy. Comprehensive experiments on four datasets and six models show that ULTRAEDIT consistently achieves superior performance across diverse model editing scenarios. Our code is available at: https://github.com/XiaojieGu/UltraEdit.

[230] Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning

Haolei Xu,Yuchen Yan,Yongliang Shen,Wenqi Zhang,Guiyang Hou,Shengpei Jiang,Kaitao Song,Weiming Lu,Jun Xiao,Yueting Zhuang

Main category: cs.CL

TL;DR: 论文提出CoT Thought Leap Bridge任务，通过检测思维跳跃并生成缺失的推理步骤，提升数学推理的完整性和连贯性。实验表明，该方法显著提升了模型性能。

Details

Motivation: 现有数学CoT数据集中存在专家省略中间步骤导致的思维跳跃问题，影响模型学习和泛化能力。 Method: 提出CoT Thought Leap Bridge任务，构建ScaleQM+数据集，训练CoT-Bridge模型以填补思维跳跃。 Result: 在数学推理基准测试中，模型性能提升高达5.87%，且在蒸馏数据和强化学习中表现更优。 Conclusion: 增强推理完整性具有广泛适用性，CoT-Bridge可作为即插即用模块兼容现有优化技术。 Abstract: Large language models (LLMs) have achieved remarkable progress on mathemati-cal tasks through Chain-of-Thought (CoT) reasoning. However, existing mathematical CoT datasets often suffer from Thought Leaps due to experts omitting intermediate steps, which negatively impacts model learning and generalization. We propose the CoT Thought Leap Bridge Task, which aims to automatically detect leaps and generate missing intermediate reasoning steps to restore the completeness and coherence of CoT. To facilitate this, we constructed a specialized training dataset called ScaleQM+, based on the structured ScaleQuestMath dataset, and trained CoT-Bridge to bridge thought leaps. Through comprehensive experiments on mathematical reasoning benchmarks, we demonstrate that models fine-tuned on bridged datasets consistently outperform those trained on original datasets, with improvements of up to +5.87% on NuminaMath. Our approach effectively enhances distilled data (+3.02%) and provides better starting points for reinforcement learning (+3.1%), functioning as a plug-and-play module compatible with existing optimization techniques. Furthermore, CoT-Bridge demonstrate improved generalization to out-of-domain logical reasoning tasks, confirming that enhancing reasoning completeness yields broadly applicable benefits.

[231] Language Models use Lookbacks to Track Beliefs

Nikhil Prakash,Natalie Shapira,Arnab Sen Sharma,Christoph Riedl,Yonatan Belinkov,Tamar Rott Shaham,David Bau,Atticus Geiger

Main category: cs.CL

TL;DR: 论文研究了语言模型（LMs）如何表示角色的信念，尤其是当这些信念与现实不同时，通过因果中介和抽象分析Llama-3-70B-Instruct的推理能力。

Details

Motivation: 理解语言模型的‘心智理论’（ToM）能力，探索其如何追踪和更新角色的信念。 Method: 构建包含简单故事的数据集，分析模型通过‘回溯机制’和‘绑定’技术处理角色-对象-状态信息的方式。 Result: 发现模型通过‘回溯机制’和‘可见性ID’动态更新角色信念，揭示了其信念追踪的算法模式。 Conclusion: 研究为逆向工程语言模型的心智理论推理提供了新见解。 Abstract: How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze Llama-3-70B-Instruct's ability to reason about characters' beliefs using causal mediation and abstraction. We construct a dataset that consists of simple stories where two characters each separately change the state of two objects, potentially unaware of each other's actions. Our investigation uncovered a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary. The LM binds each character-object-state triple together by co-locating reference information about them, represented as their Ordering IDs (OIs) in low rank subspaces of the state token's residual stream. When asked about a character's beliefs regarding the state of an object, the binding lookback retrieves the corresponding state OI and then an answer lookback retrieves the state token. When we introduce text specifying that one character is (not) visible to the other, we find that the LM first generates a visibility ID encoding the relation between the observing and the observed character OIs. In a visibility lookback, this ID is used to retrieve information about the observed character and update the observing character's beliefs. Our work provides insights into the LM's belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.

cs.PF [Back]

[232] Towards Efficient Multi-Scale Deformable Attention on NPU

Chenghuan Huang,Zhigeng Xu,Chong Sun,Chen Li,Ziyang Ma

Main category: cs.PF

TL;DR: 提出了一种针对Ascend NPU架构的多尺度可变形注意力（MSDA）的协同设计方法，显著提升了计算效率。

Details

Motivation: MSDA的随机访问网格采样策略在NPU等特定领域加速器上存在优化挑战，需要重新设计内存访问和计算策略。 Method: 采用协同设计方法，优化内存访问和计算策略，支持高效的前向和反向计算，并包含硬件感知优化。 Result: 实验显示，相比基线方法，前向计算加速5.9倍，反向计算加速8.9倍，端到端训练加速7.3倍；相比最新厂商库，分别加速1.9倍、2.4倍和2.0倍。 Conclusion: 协同设计方法显著提升了MSDA在Ascend NPU上的计算效率，适用于训练任务。 Abstract: Multi-scale deformable attention (MSDA) is a flexible and powerful feature extraction mechanism for visual tasks, but its random-access grid sampling strategy poses significant optimization challenges, especially on domain-specific accelerators such as NPUs. In this work, we present a co-design approach that systematically rethinks memory access and computation strategies for MSDA on the Ascend NPU architecture. With this co-design approach, our implementation supports both efficient forward and backward computation, is fully adapted for training workloads, and incorporates a suite of hardware-aware optimizations. Extensive experiments show that our solution achieves up to $5.9\times$ (forward), $8.9\times$ (backward), and $7.3\times$ (end-to-end training) speedup over the grid sample-based baseline, and $1.9\times$, $2.4\times$, and $2.0\times$ acceleration over the latest vendor library, respectively.

cs.CG [Back]

[233] EuLearn: A 3D database for learning Euler characteristics

Rodrigo Fritz,Pablo Suárez-Serrato,Victor Mijangos,Anayanzi D. Martinez-Hernandez,Eduardo Ivan Velazquez Richards

Main category: cs.CG

TL;DR: EuLearn是首个公平代表多种拓扑类型的表面数据集，通过随机结设计均匀变化的曲面，支持机器学习系统识别拓扑特征。

Details

Motivation: 为训练能够识别拓扑特征的机器学习系统提供多样化的数据集。 Method: 利用随机结设计曲面，提出非欧几里得统计采样方法，并改进PointNet和Transformer架构。 Result: 实验表明，将拓扑信息融入深度学习流程显著提升了在EuLearn数据集上的性能。 Conclusion: EuLearn数据集及非欧几里得采样方法有效提升了拓扑特征识别的性能。 Abstract: We present EuLearn, the first surface datasets equitably representing a diversity of topological types. We designed our embedded surfaces of uniformly varying genera relying on random knots, thus allowing our surfaces to knot with themselves. EuLearn contributes new topological datasets of meshes, point clouds, and scalar fields in 3D. We aim to facilitate the training of machine learning systems that can discern topological features. We experimented with specific emblematic 3D neural network architectures, finding that their vanilla implementations perform poorly on genus classification. To enhance performance, we developed a novel, non-Euclidean, statistical sampling method adapted to graph and manifold data. We also introduce adjacency-informed adaptations of PointNet and Transformer architectures that rely on our non-Euclidean sampling strategy. Our results demonstrate that incorporating topological information into deep learning workflows significantly improves performance on these otherwise challenging EuLearn datasets.

q-fin.CP [Back]

[234] SAE-FiRE: Enhancing Earnings Surprise Predictions Through Sparse Autoencoder Feature Selection

Huopu Zhang,Yanguang Liu,Mengnan Du

Main category: q-fin.CP

TL;DR: 论文提出SAE-FiRE框架，通过稀疏自编码器分析财报电话会议转录，提取关键信息并预测盈利意外。

Details

Motivation: 财报电话会议是公司高管、分析师和股东间的重要沟通渠道，但内容冗长且专业术语多，给语言模型分析带来挑战。 Method: 使用稀疏自编码器（SAEs）提取关键信息并过滤噪音，专注于捕捉预测盈利意外的金融信号。 Result: 实验表明，SAE-FiRE显著优于基线方法。 Conclusion: SAE-FiRE能有效解决财报电话会议分析中的冗余和专业术语问题，提升预测能力。 Abstract: Predicting earnings surprises through the analysis of earnings conference call transcripts has attracted increasing attention from the financial research community. Conference calls serve as critical communication channels between company executives, analysts, and shareholders, offering valuable forward-looking information. However, these transcripts present significant analytical challenges, typically containing over 5,000 words with substantial redundancy and industry-specific terminology that creates obstacles for language models. In this work, we propose the Sparse Autoencoder for Financial Representation Enhancement (SAE-FiRE) framework to address these limitations by extracting key information while eliminating redundancy. SAE-FiRE employs Sparse Autoencoders (SAEs) to efficiently identify patterns and filter out noises, and focusing specifically on capturing nuanced financial signals that have predictive power for earnings surprises. Experimental results indicate that the proposed method can significantly outperform comparing baselines.

cs.AI [Back]

[235] Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training

Mengru Wang,Xingyu Chen,Yue Wang,Zhiwei He,Jiahao Xu,Tian Liang,Qiuzhi Liu,Yunzhi Yao,Wenxuan Wang,Ruotian Ma,Haitao Mi,Ningyu Zhang,Zhaopeng Tu,Xiaolong Li,Dong Yu

Main category: cs.AI

TL;DR: 论文提出了一种名为RICE的推理时间引导方法，通过强化认知专家（cognitive experts）提升大型推理模型的性能，无需额外训练或复杂启发式方法。

Details

Motivation: 现有推理模型存在认知效率低的问题（如过度思考或思考不足），需要一种轻量级方法提升推理性能。 Method: 利用归一化点间互信息（nPMI）识别并强化认知专家，这些专家负责元级推理操作。 Result: 在多个基准测试中，RICE显著提升了推理准确性、认知效率和跨领域泛化能力，优于现有方法。 Conclusion: RICE是一种实用且可解释的方法，能有效提升高级推理模型的认知效率。 Abstract: Mixture-of-Experts (MoE) architectures within Large Reasoning Models (LRMs) have achieved impressive reasoning capabilities by selectively activating experts to facilitate structured cognitive processes. Despite notable advances, existing reasoning models often suffer from cognitive inefficiencies like overthinking and underthinking. To address these limitations, we introduce a novel inference-time steering methodology called Reinforcing Cognitive Experts (RICE), designed to improve reasoning performance without additional training or complex heuristics. Leveraging normalized Pointwise Mutual Information (nPMI), we systematically identify specialized experts, termed ''cognitive experts'' that orchestrate meta-level reasoning operations characterized by tokens like ''''. Empirical evaluations with leading MoE-based LRMs (DeepSeek-R1 and Qwen3-235B) on rigorous quantitative and scientific reasoning benchmarks demonstrate noticeable and consistent improvements in reasoning accuracy, cognitive efficiency, and cross-domain generalization. Crucially, our lightweight approach substantially outperforms prevalent reasoning-steering techniques, such as prompt design and decoding constraints, while preserving the model's general instruction-following skills. These results highlight reinforcing cognitive experts as a promising, practical, and interpretable direction to enhance cognitive efficiency within advanced reasoning models.

[236] Evaluating Large Language Models for Real-World Engineering Tasks

Rene Heesch,Sebastian Eilermann,Alexander Windmann,Alexander Diedrich,Philipp Rosenthal,Oliver Niggemann

Main category: cs.AI

TL;DR: 论文提出了一种基于真实工程场景的评估方法，填补了当前LLMs在复杂工程问题评估上的空白。

Details

Motivation: 当前LLMs在工程任务中的评估存在简化用例和临时场景的不足，未能充分反映真实工程能力。 Method: 构建包含100多个真实工程问题的数据库，评估四种先进LLMs的性能。 Result: LLMs在基础时空推理上表现良好，但在抽象推理、形式建模和上下文敏感逻辑上表现较差。 Conclusion: LLMs在复杂工程任务中仍有局限性，需进一步改进。 Abstract: Large Language Models (LLMs) are transformative not only for daily activities but also for engineering tasks. However, current evaluations of LLMs in engineering exhibit two critical shortcomings: (i) the reliance on simplified use cases, often adapted from examination materials where correctness is easily verifiable, and (ii) the use of ad hoc scenarios that insufficiently capture critical engineering competencies. Consequently, the assessment of LLMs on complex, real-world engineering problems remains largely unexplored. This paper addresses this gap by introducing a curated database comprising over 100 questions derived from authentic, production-oriented engineering scenarios, systematically designed to cover core competencies such as product design, prognosis, and diagnosis. Using this dataset, we evaluate four state-of-the-art LLMs, including both cloud-based and locally hosted instances, to systematically investigate their performance on complex engineering tasks. Our results show that LLMs demonstrate strengths in basic temporal and structural reasoning but struggle significantly with abstract reasoning, formal modeling, and context-sensitive engineering logic.

[237] Contrastive Cross-Course Knowledge Tracing via Concept Graph Guided Knowledge Transfer

Wenkang Han,Wang Lin,Liya Hu,Zhenlong Dai,Yiyun Zhou,Mengze Li,Zemin Liu,Chang Yao,Jingyuan Chen

Main category: cs.AI

TL;DR: TransKT提出了一种跨课程的知识追踪方法，通过概念图引导知识迁移，提升学习者知识状态估计的准确性。

Details

Motivation: 现有知识追踪模型主要关注单一课程数据，难以全面捕捉学习者的知识状态。 Method: TransKT构建跨课程概念图，利用零样本大型语言模型（LLM）建立概念间的隐式联系，并通过对比目标优化知识状态表示。 Result: TransKT显著提升了知识迁移的性能，增强了学习者知识状态的表示能力。 Conclusion: TransKT通过跨课程知识迁移和对比学习，为知识追踪提供了更全面和准确的解决方案。 Abstract: Knowledge tracing (KT) aims to predict learners' future performance based on historical learning interactions. However, existing KT models predominantly focus on data from a single course, limiting their ability to capture a comprehensive understanding of learners' knowledge states. In this paper, we propose TransKT, a contrastive cross-course knowledge tracing method that leverages concept graph guided knowledge transfer to model the relationships between learning behaviors across different courses, thereby enhancing knowledge state estimation. Specifically, TransKT constructs a cross-course concept graph by leveraging zero-shot Large Language Model (LLM) prompts to establish implicit links between related concepts across different courses. This graph serves as the foundation for knowledge transfer, enabling the model to integrate and enhance the semantic features of learners' interactions across courses. Furthermore, TransKT includes an LLM-to-LM pipeline for incorporating summarized semantic features, which significantly improves the performance of Graph Convolutional Networks (GCNs) used for knowledge transfer. Additionally, TransKT employs a contrastive objective that aligns single-course and cross-course knowledge states, thereby refining the model's ability to provide a more robust and accurate representation of learners' overall knowledge states.

[238] Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale

David Noever,Forrest McKee

Main category: cs.AI

TL;DR: 研究探讨了大型语言模型（LLMs）作为自主代理完成现实任务（如自由职业软件开发）的能力，提出了一个基于经济数据的新基准，评估LLMs在自由职业编程和数据分析任务中的表现。

Details

Motivation: 评估LLMs在自由职业任务中的实际应用潜力，为AI作为自由职业开发者的可行性提供依据。 Method: 构建了一个基于Kaggle自由职业数据集的合成任务基准，包含结构化输入输出测试用例和预估价格，用于自动化评估。 Result: Claude 3.5 Haiku表现最佳，赚取约152万美元，其次是GPT-4o-mini（149万美元）、Qwen 2.5（133万美元）和Mistral（70万美元）。 Conclusion: 研究表明LLMs在结构化任务中表现优异，但真实自由职业任务的复杂性仍存在差距，自动化基准方法具有可扩展性和重复性优势。 Abstract: This study explores Large Language Models (LLMs) as autonomous agents for real-world tasks, including freelance software development. This work presents a new benchmark that evaluates LLMs on freelance programming and data analysis tasks derived from economic data. We construct the benchmark using synthetic tasks created from a Kaggle Freelancer dataset of job postings, with all job prices standardized to USD (median fixed-project price around $250, and an average of $306). Each task is accompanied by structured input-output test cases and an estimated price tag, enabling automated correctness checking and a monetary performance valuation. This approach is inspired by OpenAI's recent SWE-Lancer benchmark (1,400 real Upwork tasks worth $1M total). Still, our framework simplifies evaluation using programmatically testable tasks and predicted price values, making it highly scalable and repeatable. On this benchmark, we evaluate four modern LLMs - Claude 3.5 Haiku, GPT-4o-mini, Qwen 2.5, and Mistral. We report each model's accuracy (task success rate and test-case pass rate) and the total "freelance earnings" it achieves (sum of prices of solved tasks). Our results show that Claude 3.5 Haiku performs best, earning approximately $1.52 million USD, followed closely by GPT-4o-mini at $1.49 million, then Qwen 2.5 ($1.33M) and Mistral ($0.70M). We analyze the distribution of errors per task and observe that the strongest models solve the most tasks and rarely fail completely on any project. We discuss the implications of these results for the feasibility of AI as a freelance developer, the advantages and limitations of our automated benchmark approach, and the gap between performance on structured tasks versus the true complexity of real-world freelance jobs.

[239] BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs

Junxiao Yang,Jinzhe Tu,Haoran Liu,Xiaoce Wang,Chujie Zheng,Zhexin Zhang,Shiyao Cui,Caishun Chen,Tiantian He,Hongning Wang,Yew-Soon Ong,Minlie Huang

Main category: cs.AI

TL;DR: 论文提出BARREL框架，解决大型推理模型（LRMs）过度自信和错误回答的问题，通过边界感知推理提升可靠性。

Details

Motivation: 当前LRMs在数学和逻辑推理中表现出过度自信，即使不知道答案也会给出错误回答，影响事实可靠性。 Method: 提出BARREL框架，针对两种病态推理模式（最后猜测和第二思维螺旋）进行优化，实现简洁且边界感知的推理。 Result: 实验显示，BARREL训练将DeepSeek-R1-Distill-Llama-8B的可靠性从39.33%提升至61.48%，同时保持与R1生成数据微调模型相当的准确性。 Conclusion: BARREL框架为构建更可靠的事实性System 2 LRMs提供了启发。 Abstract: Recent advances in Large Reasoning Models (LRMs) have shown impressive capabilities in mathematical and logical reasoning. However, current LRMs rarely admit ignorance or respond with "I don't know". Instead, they often produce incorrect answers while showing undue confidence, raising concerns about their factual reliability. In this work, we identify two pathological reasoning patterns characterized by overthinking that contribute to the overconfident and incorrect answers: last-minute guessing and second-thought spiraling. To address these issues, we propose BARREL-a novel framework that promotes concise and boundary-aware factual reasoning. Our experiments show that BARREL-training increases the reliability of DeepSeek-R1-Distill-Llama-8B from 39.33% to 61.48%, while still achieving accuracy comparable to models finetuned on reasoning data generated by R1. These results demonstrate that our pilot study is inspiring to build more reliable and factual System 2 LRMs.

[240] Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings

Safal Shrestha,Minwu Kim,Aadim Nepal,Anubhav Shrestha,Keith Ross

Main category: cs.AI

TL;DR: 论文提出了一种两阶段训练策略，用于在有限监督下开发具备推理能力的大语言模型（LLM），通过预热阶段和强化学习阶段提高样本效率和泛化能力。

Details

Motivation: 当前训练推理能力强的LLM通常需要大量高质量数据，但在数据稀缺的情况下，这成为主要挑战。本文旨在解决这一问题。 Method: 采用两阶段训练：1）预热阶段，通过玩具领域（如K&K逻辑谜题）的长链思维蒸馏获取通用推理能力；2）强化学习阶段，在预热模型上使用少量目标领域数据进行RLVR训练。 Result: 实验表明，预热阶段能提升跨任务性能，预热后的模型在相同小数据集上表现优于基础模型，且保持了跨领域泛化能力，同时提高了样本效率。 Conclusion: 预热策略在数据稀缺环境下构建鲁棒推理LLM具有潜力。 Abstract: Designing effective reasoning-capable LLMs typically requires training using Reinforcement Learning with Verifiable Rewards (RLVR) or distillation with carefully curated Long Chain of Thoughts (CoT), both of which depend heavily on extensive training data. This creates a major challenge when the amount of quality training data is scarce. We propose a sample-efficient, two-stage training strategy to develop reasoning LLMs under limited supervision. In the first stage, we "warm up" the model by distilling Long CoTs from a toy domain, namely, Knights \& Knaves (K\&K) logic puzzles to acquire general reasoning skills. In the second stage, we apply RLVR to the warmed-up model using a limited set of target-domain examples. Our experiments demonstrate that this two-phase approach offers several benefits: $(i)$ the warmup phase alone facilitates generalized reasoning, leading to performance improvements across a range of tasks, including MATH, HumanEval$^{+}$, and MMLU-Pro. $(ii)$ When both the base model and the warmed-up model are RLVR trained on the same small dataset ($\leq100$ examples), the warmed-up model consistently outperforms the base model; $(iii)$ Warming up before RLVR training allows a model to maintain cross-domain generalizability even after training on a specific domain; $(iv)$ Introducing warmup in the pipeline improves not only accuracy but also overall sample efficiency during RLVR training. The results in this paper highlight the promise of warmup for building robust reasoning LLMs in data-scarce environments.

[241] Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations

Li Ji-An,Hua-Dong Xiong,Robert C. Wilson,Marcelo G. Mattar,Marcus K. Benna

Main category: cs.AI

TL;DR: 研究探讨了大语言模型（LLMs）的元认知能力，即模型监控和报告自身内部激活模式的能力，并设计了一种神经反馈范式来量化这种能力。

Details

Motivation: 随着社会对LLMs的依赖增加，了解其元认知能力的局限性对AI安全至关重要，尤其是模型可能隐藏内部过程以逃避监管。 Method: 采用神经科学启发的神经反馈范式，通过句子-标签对训练模型报告和控制其内部激活模式。 Result: LLMs能够学习和控制特定神经方向的激活，但能力受样本数量、目标方向的语义可解释性和方差解释度影响。 Conclusion: LLMs的元认知空间维度远低于其神经空间，表明其仅能监控部分神经机制，这对AI安全有重要启示。 Abstract: Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, but they can also fail to do so. This suggests some degree of metacognition -- the capacity to monitor one's own cognitive processes for subsequent reporting and self-control. Metacognitive abilities enhance AI capabilities but raise safety concerns, as models might obscure their internal processes to evade neural-activation-based oversight mechanisms designed to detect harmful behaviors. Given society's increased reliance on these models, it is critical that we understand the limits of their metacognitive abilities, particularly their ability to monitor their internal activations. To address this, we introduce a neuroscience-inspired neurofeedback paradigm designed to quantify the ability of LLMs to explicitly report and control their activation patterns. By presenting models with sentence-label pairs where labels correspond to sentence-elicited internal activations along specific directions in the neural representation space, we demonstrate that LLMs can learn to report and control these activations. The performance varies with several factors: the number of example pairs provided, the semantic interpretability of the target neural direction, and the variance explained by that direction. These results reveal a "metacognitive space" with dimensionality much lower than the model's neural space, suggesting LLMs can monitor only a subset of their neural mechanisms. Our findings provide empirical evidence quantifying metacognitive capabilities in LLMs, with significant implications for AI safety.

[242] Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference

Jin Du,Li Chen,Xun Xian,An Luo,Fangqiao Tian,Ganghua Wang,Charles Doss,Xiaotong Shen,Jie Ding

Main category: cs.AI

TL;DR: 论文提出了CausalPitfalls基准，用于评估大语言模型在因果推理中克服常见陷阱的能力，揭示了当前模型的局限性。

Details

Motivation: 可靠因果推理对高风险领域决策至关重要，但现有基准过于简化，无法全面评估大语言模型的能力。 Method: 设计了多难度级别的结构化挑战，结合评分标准，采用直接提示和代码辅助提示两种协议进行评估。 Result: 当前大语言模型在统计因果推理中存在显著局限性。 Conclusion: CausalPitfalls为开发可信赖的因果推理系统提供了重要指导和量化指标。 Abstract: Reliable causal inference is essential for making decisions in high-stakes areas like medicine, economics, and public policy. However, it remains unclear whether large language models (LLMs) can handle rigorous and trustworthy statistical causal inference. Current benchmarks usually involve simplified tasks. For example, these tasks might only ask LLMs to identify semantic causal relationships or draw conclusions directly from raw data. As a result, models may overlook important statistical pitfalls, such as Simpson's paradox or selection bias. This oversight limits the applicability of LLMs in the real world. To address these limitations, we propose CausalPitfalls, a comprehensive benchmark designed to rigorously evaluate the capability of LLMs in overcoming common causal inference pitfalls. Our benchmark features structured challenges across multiple difficulty levels, each paired with grading rubrics. This approach allows us to quantitatively measure both causal reasoning capabilities and the reliability of LLMs' responses. We evaluate models using two protocols: (1) direct prompting, which assesses intrinsic causal reasoning, and (2) code-assisted prompting, where models generate executable code for explicit statistical analysis. Additionally, we validate the effectiveness of this judge by comparing its scoring with assessments from human experts. Our results reveal significant limitations in current LLMs when performing statistical causal inference. The CausalPitfalls benchmark provides essential guidance and quantitative metrics to advance the development of trustworthy causal reasoning systems.

[243] Mobile-Agent-V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge Injection in Mobile Automation

Junyang Wang,Haiyang Xu,Xi Zhang,Ming Yan,Ji Zhang,Fei Huang,Jitao Sang

Main category: cs.AI

TL;DR: Mobile-Agent-V利用视频作为指导工具，自动注入操作知识，显著提升移动自动化效率，性能提升36%。

Details

Motivation: 移动设备使用激增，现有AI框架因缺乏操作知识而效率不足，手动注入知识又过于繁琐。 Method: 通过视频内容直接提取操作知识，避免人工干预，提出Mobile-Knowledge基准评估性能。 Result: 实验显示Mobile-Agent-V性能提升36%，优于现有方法。 Conclusion: Mobile-Agent-V提供了一种高效、无需人工干预的移动自动化解决方案。 Abstract: The exponential rise in mobile device usage necessitates streamlined automation for effective task management, yet many AI frameworks fall short due to inadequate operational expertise. While manually written knowledge can bridge this gap, it is often burdensome and inefficient. We introduce Mobile-Agent-V, an innovative framework that utilizes video as a guiding tool to effortlessly and efficiently inject operational knowledge into mobile automation processes. By deriving knowledge directly from video content, Mobile-Agent-V eliminates manual intervention, significantly reducing the effort and time required for knowledge acquisition. To rigorously evaluate this approach, we propose Mobile-Knowledge, a benchmark tailored to assess the impact of external knowledge on mobile agent performance. Our experimental findings demonstrate that Mobile-Agent-V enhances performance by 36% compared to existing methods, underscoring its effortless and efficient advantages in mobile automation.

[244] Efficient Agent Training for Computer Use

Yanheng He,Jiahe Jin,Pengfei Liu

Main category: cs.AI

TL;DR: PC Agent-E框架通过少量高质量轨迹数据显著提升计算机使用代理性能，减少对人类演示的依赖。

Details

Motivation: 解决高质量轨迹数据稀缺对开发人类计算机使用代理的限制。 Method: 利用312条人工标注轨迹，结合Claude 3.7 Sonnet合成多样化动作决策，训练PC Agent-E模型。 Result: 在WindowsAgentArena-V2基准上相对提升141%，并在OSWorld上展示强泛化能力。 Conclusion: 少量高质量轨迹数据可激发强大的计算机使用能力。 Abstract: Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents. We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations. Starting with just 312 human-annotated computer use trajectories, we further improved data quality by synthesizing diverse action decisions with Claude 3.7 Sonnet. Trained on these enriched trajectories, our PC Agent-E model achieved a remarkable 141% relative improvement, surpassing the strong Claude 3.7 Sonnet with extended thinking on WindowsAgentArena-V2, an improved benchmark we also released. Furthermore, PC Agent-E demonstrates strong generalizability to different operating systems on OSWorld. Our findings suggest that strong computer use capabilities can be stimulated from a small amount of high-quality trajectory data.

[245] ProMind-LLM: Proactive Mental Health Care via Causal Reasoning with Sensor Data

Xinzhe Zheng,Sijie Ji,Jiawei Sun,Renqi Chen,Wei Gao,Mani Srivastava

Main category: cs.AI

TL;DR: ProMind-LLM提出了一种结合主观心理记录和客观行为数据的创新方法，用于可靠的心理健康风险评估，通过领域特定预训练、自我优化机制和因果链式推理提升预测的可靠性和可解释性。

Details

Motivation: 现有心理健康风险评估方法主要依赖主观文本记录，易受心理不确定性影响，导致预测不一致且不可靠。 Method: ProMind-LLM整合了领域特定预训练、自我优化机制和因果链式推理，结合主观心理记录和客观行为数据。 Result: 在PMData和Globem数据集上，ProMind-LLM表现优于通用大语言模型。 Conclusion: ProMind-LLM为心理健康领域提供了更可靠、可解释且可扩展的解决方案。 Abstract: Mental health risk is a critical global public health challenge, necessitating innovative and reliable assessment methods. With the development of large language models (LLMs), they stand out to be a promising tool for explainable mental health care applications. Nevertheless, existing approaches predominantly rely on subjective textual mental records, which can be distorted by inherent mental uncertainties, leading to inconsistent and unreliable predictions. To address these limitations, this paper introduces ProMind-LLM. We investigate an innovative approach integrating objective behavior data as complementary information alongside subjective mental records for robust mental health risk assessment. Specifically, ProMind-LLM incorporates a comprehensive pipeline that includes domain-specific pretraining to tailor the LLM for mental health contexts, a self-refine mechanism to optimize the processing of numerical behavioral data, and causal chain-of-thought reasoning to enhance the reliability and interpretability of its predictions. Evaluations of two real-world datasets, PMData and Globem, demonstrate the effectiveness of our proposed methods, achieving substantial improvements over general LLMs. We anticipate that ProMind-LLM will pave the way for more dependable, interpretable, and scalable mental health case solutions.

[246] s3: You Don't Need That Much Data to Train a Search Agent via RL

Pengcheng Jiang,Xueqiang Xu,Jiacheng Lin,Jinfeng Xiao,Zifeng Wang,Jimeng Sun,Jiawei Han

Main category: cs.AI

TL;DR: 论文提出了一种名为s3的轻量级框架，通过解耦检索与生成任务，并使用Gain Beyond RAG奖励训练检索器，显著提升了生成准确性。

Details

Motivation: 现有方法要么忽视下游任务的效用，要么将检索与生成耦合，限制了检索的实用性和兼容性。 Method: 提出s3框架，解耦检索器与生成器，使用Gain Beyond RAG奖励训练检索器。 Result: 仅需2.4k训练样本即可超越基线，在多个QA基准测试中表现优异。 Conclusion: s3框架在提升生成准确性和兼容性方面具有显著优势。 Abstract: Retrieval-augmented generation (RAG) systems empower large language models (LLMs) to access external knowledge during inference. Recent advances have enabled LLMs to act as search agents via reinforcement learning (RL), improving information acquisition through multi-turn interactions with retrieval engines. However, existing approaches either optimize retrieval using search-only metrics (e.g., NDCG) that ignore downstream utility or fine-tune the entire LLM to jointly reason and retrieve-entangling retrieval with generation and limiting the real search utility and compatibility with frozen or proprietary models. In this work, we propose s3, a lightweight, model-agnostic framework that decouples the searcher from the generator and trains the searcher using a Gain Beyond RAG reward: the improvement in generation accuracy over naive RAG. s3 requires only 2.4k training samples to outperform baselines trained on over 70x more data, consistently delivering stronger downstream performance across six general QA and five medical QA benchmarks.

[247] Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning

Minwu Kim,Anubhav Shrestha,Safal Shrestha,Aadim Nepal,Keith Ross

Main category: cs.AI

TL;DR: RLVR提升准确性但未改善能力，蒸馏可同时提升两者。研究发现RLVR因关注简单问题而忽视难题，蒸馏则需新知识才能提升能力。

Details

Motivation: 探究RLVR和蒸馏对语言模型推理行为的影响机制。 Method: 分析RLVR和蒸馏对问题难度的影响，比较其输出分布和质量。 Result: RLVR牺牲难题准确性，蒸馏需新知识才能提升能力。 Conclusion: 研究揭示了RLVR和蒸馏对模型推理行为的具体影响。 Abstract: Recent studies have shown that reinforcement learning with verifiable rewards (RLVR) enhances overall accuracy but fails to improve capability, while distillation can improve both. In this paper, we investigate the mechanisms behind these phenomena. First, we demonstrate that RLVR does not improve capability because it focuses on improving the accuracy of the less-difficult questions to the detriment of the accuracy of the most difficult questions, thereby leading to no improvement in capability. Second, we find that RLVR does not merely increase the success probability for the less difficult questions, but in our small model settings produces quality responses that were absent in its output distribution before training. In addition, we show these responses are neither noticeably longer nor feature more reflection-related keywords, underscoring the need for more reliable indicators of response quality. Third, we show that while distillation reliably improves accuracy by learning strong reasoning patterns, it only improves capability when new knowledge is introduced. Moreover, when distilling only with reasoning patterns and no new knowledge, the accuracy of the less-difficult questions improves to the detriment of the most difficult questions, similar to RLVR. Together, these findings offer a clearer understanding of how RLVR and distillation shape reasoning behavior in language models.

[248] SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors

Maheep Chaudhary,Fazl Barez

Main category: cs.AI

TL;DR: 提出了一种实时监测框架，通过无监督方法预测有害AI输出，专注于后门触发响应，并开发了Safety-Net多检测器框架，检测准确率达96%。

Details

Motivation: 高风险行业如核能和航空使用实时监测，类似地，大型语言模型（LLMs）也需要监测保障，以防止生成有害内容。 Method: 采用无监督方法，将正常行为作为基线，有害输出视为异常，研究后门触发响应，并设计多检测器框架Safety-Net。 Result: 模型可通过因果机制生成有害内容，并可能通过改变表示方式逃避监测。Safety-Net在检测有害行为时准确率达96%。 Conclusion: 提出的无监督框架能有效监测和防止LLMs生成有害内容，尤其在应对未来模型的欺骗行为方面具有潜力。 Abstract: High-risk industries like nuclear and aviation use real-time monitoring to detect dangerous system conditions. Similarly, Large Language Models (LLMs) need monitoring safeguards. We propose a real-time framework to predict harmful AI outputs before they occur by using an unsupervised approach that treats normal behavior as the baseline and harmful outputs as outliers. Our study focuses specifically on backdoor-triggered responses -- where specific input phrases activate hidden vulnerabilities causing the model to generate unsafe content like violence, pornography, or hate speech. We address two key challenges: (1) identifying true causal indicators rather than surface correlations, and (2) preventing advanced models from deception -- deliberately evading monitoring systems. Hence, we approach this problem from an unsupervised lens by drawing parallels to human deception: just as humans exhibit physical indicators while lying, we investigate whether LLMs display distinct internal behavioral signatures when generating harmful content. Our study addresses two critical challenges: 1) designing monitoring systems that capture true causal indicators rather than superficial correlations; and 2)preventing intentional evasion by increasingly capable "Future models''. Our findings show that models can produce harmful content through causal mechanisms and can become deceptive by: (a) alternating between linear and non-linear representations, and (b) modifying feature relationships. To counter this, we developed Safety-Net -- a multi-detector framework that monitors different representation dimensions, successfully detecting harmful behavior even when information is shifted across representational spaces to evade individual monitors. Our evaluation shows 96% accuracy in detecting harmful cases using our unsupervised ensemble approach.

[249] Causal Cartographer: From Mapping to Reasoning Over Counterfactual Worlds

Gaël Gendron,Jože M. Rožanec,Michael Witbrock,Gillian Dobbie

Main category: cs.AI

TL;DR: 论文提出Causal Cartographer框架，通过显式提取和建模因果关系，提升大语言模型在因果推理任务中的能力。

Details

Motivation: 现有基础模型（如大语言模型）缺乏因果推理能力，且真实世界中的反事实评估困难。 Method: 采用图检索增强生成代理提取因果关系，构建因果知识库，并设计反事实推理代理进行可靠推理。 Result: 方法能提取因果知识，提升大语言模型的因果推理鲁棒性，同时降低推理成本和虚假相关性。 Conclusion: Causal Cartographer框架为解决因果推理问题提供了有效工具。 Abstract: Causal world models are systems that can answer counterfactual questions about an environment of interest, i.e. predict how it would have evolved if an arbitrary subset of events had been realized differently. It requires understanding the underlying causes behind chains of events and conducting causal inference for arbitrary unseen distributions. So far, this task eludes foundation models, notably large language models (LLMs), which do not have demonstrated causal reasoning capabilities beyond the memorization of existing causal relationships. Furthermore, evaluating counterfactuals in real-world applications is challenging since only the factual world is observed, limiting evaluation to synthetic datasets. We address these problems by explicitly extracting and modeling causal relationships and propose the Causal Cartographer framework. First, we introduce a graph retrieval-augmented generation agent tasked to retrieve causal relationships from data. This approach allows us to construct a large network of real-world causal relationships that can serve as a repository of causal knowledge and build real-world counterfactuals. In addition, we create a counterfactual reasoning agent constrained by causal relationships to perform reliable step-by-step causal inference. We show that our approach can extract causal knowledge and improve the robustness of LLMs for causal reasoning tasks while reducing inference costs and spurious correlations.

[250] PRL: Prompts from Reinforcement Learning

Paweł Batorski,Adrian Kosmala,Paul Swoboda

Main category: cs.AI

TL;DR: PRL（基于强化学习的提示生成方法）通过自动生成新颖的少样本示例，显著提升了LLM在文本分类、简化和摘要任务中的性能。

Details

Motivation: 当前提示工程依赖专家直觉和任务理解，且关键语义线索可能难以捕捉。PRL旨在通过强化学习自动生成高效提示，减少人工干预。 Method: PRL是一种基于强化学习的方法，能够生成训练中未见过的少样本示例，用于提示优化。 Result: PRL在多个基准测试中表现优异：分类任务超越APE 2.58%和EvoPrompt 1.00%；摘要任务ROUGE分数提升4.32和2.12；简化任务SARI分数提升6.93和6.01。 Conclusion: PRL通过自动生成提示，显著提升了LLM性能，为提示工程提供了高效解决方案。 Abstract: Effective prompt engineering remains a central challenge in fully harnessing the capabilities of LLMs. While well-designed prompts can dramatically enhance performance, crafting them typically demands expert intuition and a nuanced understanding of the task. Moreover, the most impactful prompts often hinge on subtle semantic cues, ones that may elude human perception but are crucial for guiding LLM behavior. In this paper, we introduce PRL (Prompts from Reinforcement Learning), a novel RL-based approach for automatic prompt generation. Unlike previous methods, PRL can produce novel few-shot examples that were not seen during training. Our approach achieves state-of-the-art performance across a range of benchmarks, including text classification, simplification, and summarization. On the classification task, it surpasses prior methods by 2.58% over APE and 1.00% over EvoPrompt. Additionally, it improves the average ROUGE scores on the summarization task by 4.32 over APE and by 2.12 over EvoPrompt and the SARI score on simplification by 6.93 over APE and by 6.01 over EvoPrompt. Our code is available at https://github.com/Batorskq/prl .

[251] Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach

Oren Sultan,Eitan Stern,Dafna Shahaf

Main category: cs.AI

TL;DR: 论文提出了一种结合大型语言模型（LLM）生成能力和结构化组件的神经符号方法，以解决LLM在逻辑推理和符号推理任务（如几何证明）中的不足。

Details

Motivation: LLM在需要严格逻辑推理的领域表现不佳，如数学证明生成。作者希望通过结合神经和符号方法提升其性能。 Method: 方法包括：（1）检索类似问题并利用其证明指导LLM；（2）使用形式化验证器评估生成的证明并提供反馈。 Result: 实验表明，该方法显著提升了OpenAI o1模型的证明准确率（58%-70%提升），类似问题和验证器反馈均对提升有贡献。 Conclusion: 通过生成可验证的正确结论，LLM的可靠性、准确性和一致性将大幅提升，从而解锁更多复杂和关键任务。 Abstract: Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning, such as mathematical proof generation. We propose a neuro-symbolic approach that combines LLMs' generative strengths with structured components to overcome this challenge. As a proof-of-concept, we focus on geometry problems. Our approach is two-fold: (1) we retrieve analogous problems and use their proofs to guide the LLM, and (2) a formal verifier evaluates the generated proofs and provides feedback, helping the model fix incorrect proofs. We demonstrate that our method significantly improves proof accuracy for OpenAI's o1 model (58%-70% improvement); both analogous problems and the verifier's feedback contribute to these gains. More broadly, shifting to LLMs that generate provably correct conclusions could dramatically improve their reliability, accuracy and consistency, unlocking complex tasks and critical real-world applications that require trustworthiness.

[252] Reasoning Models Better Express Their Confidence

Dongkeun Yoon,Seungone Kim,Sohee Yang,Sunkyoung Kim,Soyeon Kim,Yongil Kim,Eunbi Choi,Yireun Kim,Minjoon Seo

Main category: cs.AI

TL;DR: 研究表明，推理型大语言模型（LLMs）通过链式思维（CoT）推理，不仅能提升问题解决能力，还能更准确地表达其置信度。

Details

Motivation: 大型语言模型（LLMs）在置信度表达上常不准确，影响其可靠性。本文探讨推理模型是否能改善这一问题。 Method: 对六种推理模型在六个数据集上进行基准测试，分析其置信度校准表现，并研究慢思考行为（如探索替代方案和回溯）的作用。 Result: 推理模型在36种设置中有33种表现优于非推理模型，且置信度校准随CoT展开逐步提升。移除慢思考行为会显著降低校准效果。 Conclusion: 推理模型通过慢思考行为动态调整置信度，显著提升校准能力，且非推理模型通过上下文学习也能受益。 Abstract: Despite their strengths, large language models (LLMs) often fail to communicate their confidence accurately, making it difficult to assess when they might be wrong and limiting their reliability. In this work, we demonstrate that reasoning models-LLMs that engage in extended chain-of-thought (CoT) reasoning-exhibit superior performance not only in problem-solving but also in accurately expressing their confidence. Specifically, we benchmark six reasoning models across six datasets and find that they achieve strictly better confidence calibration than their non-reasoning counterparts in 33 out of the 36 settings. Our detailed analysis reveals that these gains in calibration stem from the slow thinking behaviors of reasoning models-such as exploring alternative approaches and backtracking-which enable them to adjust their confidence dynamically throughout their CoT, making it progressively more accurate. In particular, we find that reasoning models become increasingly better calibrated as their CoT unfolds, a trend not observed in non-reasoning models. Moreover, removing slow thinking behaviors from the CoT leads to a significant drop in calibration. Lastly, we show that these gains are not exclusive to reasoning models-non-reasoning models also benefit when guided to perform slow thinking via in-context learning.

[253] Agent Context Protocols Enhance Collective Inference

Devansh Bhardwaj,Arjun Beniwal,Shreyas Chaudhari,Ashwin Kalyan,Tanmay Rajpurohit,Karthik R. Narasimhan,Ameet Deshpande,Vishvak Murahari

Main category: cs.AI

TL;DR: 论文提出了一种名为Agent Context Protocols (ACPs)的结构化协议，用于多智能体系统的通信与协作，显著提升了复杂任务的性能。

Details

Motivation: 当前多智能体系统的协调依赖于不精确的自然语言，限制了复杂交互和领域专用智能体的互操作性。 Method: 引入ACP协议，结合持久执行蓝图和标准化消息模式，实现鲁棒且容错的多智能体协作。 Result: ACP系统在长周期网络辅助任务中达到28.3%的准确率，并在多模态技术报告中优于商业AI系统。 Conclusion: ACP协议具有高度模块化和可扩展性，能快速构建高性能通用智能体。 Abstract: AI agents have become increasingly adept at complex tasks such as coding, reasoning, and multimodal understanding. However, building generalist systems requires moving beyond individual agents to collective inference -- a paradigm where multi-agent systems with diverse, task-specialized agents complement one another through structured communication and collaboration. Today, coordination is usually handled with imprecise, ad-hoc natural language, which limits complex interaction and hinders interoperability with domain-specific agents. We introduce Agent context protocols (ACPs): a domain- and agent-agnostic family of structured protocols for agent-agent communication, coordination, and error handling. ACPs combine (i) persistent execution blueprints -- explicit dependency graphs that store intermediate agent outputs -- with (ii) standardized message schemas, enabling robust and fault-tolerant multi-agent collective inference. ACP-powered generalist systems reach state-of-the-art performance: 28.3 % accuracy on AssistantBench for long-horizon web assistance and best-in-class multimodal technical reports, outperforming commercial AI systems in human evaluation. ACPs are highly modular and extensible, allowing practitioners to build top-tier generalist agents quickly.

[254] SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas

Anjiang Wei,Yuheng Wu,Yingjia Wan,Tarun Suresh,Huanmi Tan,Zhanke Zhou,Sanmi Koyejo,Ke Wang,Alex Aiken

Main category: cs.AI

TL;DR: SATBench是一个通过布尔可满足性问题生成的逻辑谜题基准，用于评估大语言模型（LLMs）的逻辑推理能力。

Details

Motivation: 现有研究主要关注基于推理规则的逻辑推理，而SATBench则利用SAT问题的搜索特性，填补了LLMs在搜索式逻辑推理能力评估上的空白。 Method: SATBench通过自动化生成基于SAT公式的逻辑谜题，并调整子句数量以控制难度，同时使用LLM辅助和求解器验证一致性。 Result: 实验显示，最强模型o4-mini在困难UNSAT问题上准确率仅为65.0%，接近随机基线50%。 Conclusion: SATBench揭示了当前LLMs在搜索式逻辑推理上的根本局限性，为未来研究提供了可扩展的测试平台。 Abstract: We introduce SATBench, a benchmark for evaluating the logical reasoning capabilities of large language models (LLMs) through logical puzzles derived from Boolean satisfiability (SAT) problems. Unlike prior work that focuses on inference rule-based reasoning, which often involves deducing conclusions from a set of premises, our approach leverages the search-based nature of SAT problems, where the objective is to find a solution that fulfills a specified set of logical constraints. Each instance in SATBench is generated from a SAT formula, then translated into a story context and conditions using LLMs. The generation process is fully automated and allows for adjustable difficulty by varying the number of clauses. All 2100 puzzles are validated through both LLM-assisted and solver-based consistency checks, with human validation on a subset. Experimental results show that even the strongest model, o4-mini, achieves only 65.0% accuracy on hard UNSAT problems, close to the random baseline of 50%. SATBench exposes fundamental limitations in the search-based logical reasoning abilities of current LLMs and provides a scalable testbed for future research in logical reasoning.

[255] Debating for Better Reasoning: An Unsupervised Multimodal Approach

Ashutosh Adhikari,Mirella Lapata

Main category: cs.AI

TL;DR: 论文探讨了在多模态环境中扩展辩论机制，通过辩论让较弱模型监督和增强较强模型的性能，特别是在视觉问答任务中。实验表明辩论框架优于单个专家模型，且较弱LLM的判断能通过微调提升视觉语言模型的推理能力。

Details

Motivation: 随着大语言模型（LLMs）在多领域和多模态中的能力提升，如何实现可扩展的监督成为挑战，尤其是在模型能力可能超越人类评估者时。辩论机制被认为是一种有潜力的解决方案。 Method: 研究将辩论范式扩展到多模态环境，专注于视觉问答（VQA）任务。两个“有视觉”的专家视觉语言模型进行辩论，一个“无视觉”（仅文本）的法官根据辩论质量裁决。专家仅支持其真实信念的答案，避免角色扮演，集中辩论于专家分歧的实例。 Result: 实验表明，辩论框架在多个多模态任务中持续优于单个专家模型。此外，较弱LLM的裁决可通过微调帮助视觉语言模型提升推理能力。 Conclusion: 辩论机制在多模态环境中具有潜力，能够通过较弱模型的监督提升较强模型的性能，同时增强模型的推理能力。 Abstract: As Large Language Models (LLMs) gain expertise across diverse domains and modalities, scalable oversight becomes increasingly challenging, particularly when their capabilities may surpass human evaluators. Debate has emerged as a promising mechanism for enabling such oversight. In this work, we extend the debate paradigm to a multimodal setting, exploring its potential for weaker models to supervise and enhance the performance of stronger models. We focus on visual question answering (VQA), where two "sighted" expert vision-language models debate an answer, while a "blind" (text-only) judge adjudicates based solely on the quality of the arguments. In our framework, the experts defend only answers aligned with their beliefs, thereby obviating the need for explicit role-playing and concentrating the debate on instances of expert disagreement. Experiments on several multimodal tasks demonstrate that the debate framework consistently outperforms individual expert models. Moreover, judgments from weaker LLMs can help instill reasoning capabilities in vision-language models through finetuning.

[256] SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment

Wonje Jeung,Sangyeon Yoon,Minsuk Kahng,Albert No

Main category: cs.AI

TL;DR: SAFEPATH是一种轻量级对齐方法，通过在有害提示下生成8个token的安全提示，有效减少有害输出并保持推理性能。

Details

Motivation: 大型推理模型（LRMs）在复杂问题解决中表现出色，但其结构化推理路径可能导致有害输出，现有安全对齐方法会降低推理深度且易受攻击。 Method: 引入SAFEPATH，通过微调LRMs在有害提示下生成简短安全提示，其余推理过程不受监督。 Result: SAFEPATH减少90.0%有害输出，阻止83.3%越狱攻击，计算成本显著低于其他方法。 Conclusion: SAFEPATH在保持推理性能的同时提升安全性，为零样本变体提供可能，揭示了现有方法在推理模型中的局限性。 Abstract: Large Reasoning Models (LRMs) have become powerful tools for complex problem solving, but their structured reasoning pathways can lead to unsafe outputs when exposed to harmful prompts. Existing safety alignment methods reduce harmful outputs but can degrade reasoning depth, leading to significant trade-offs in complex, multi-step tasks, and remain vulnerable to sophisticated jailbreak attacks. To address this, we introduce SAFEPATH, a lightweight alignment method that fine-tunes LRMs to emit a short, 8-token Safety Primer at the start of their reasoning, in response to harmful prompts, while leaving the rest of the reasoning process unsupervised. Empirical results across multiple benchmarks indicate that SAFEPATH effectively reduces harmful outputs while maintaining reasoning performance. Specifically, SAFEPATH reduces harmful responses by up to 90.0% and blocks 83.3% of jailbreak attempts in the DeepSeek-R1-Distill-Llama-8B model, while requiring 295.9x less compute than Direct Refusal and 314.1x less than SafeChain. We further introduce a zero-shot variant that requires no fine-tuning. In addition, we provide a comprehensive analysis of how existing methods in LLMs generalize, or fail, when applied to reasoning-centric models, revealing critical gaps and new directions for safer AI.

[257] ContextAgent: Context-Aware Proactive LLM Agents with Open-World Sensory Perceptions

Bufang Yang,Lilin Xu,Liekang Zeng,Kaiwei Liu,Siyang Jiang,Wenrui Lu,Hongkai Chen,Xiaofan Jiang,Guoliang Xing,Zhenyu Yan

Main category: cs.AI

TL;DR: ContextAgent是一种基于多维度感知上下文的主动代理，通过穿戴设备数据增强LLM代理的主动服务能力，并在实验中表现优于基线。

Details

Motivation: 现有主动代理依赖封闭环境或规则通知，导致用户意图理解不足和功能受限。 Method: ContextAgent从穿戴设备提取多维度上下文，结合历史数据预测主动服务需求，并自动调用工具。 Result: 在ContextAgentBench上，ContextAgent在主动预测和工具调用上分别比基线高8.5%和6.0%。 Conclusion: ContextAgent为开发更先进、以人为中心的主动AI助手提供了启示。 Abstract: Recent advances in Large Language Models (LLMs) have propelled intelligent agents from reactive responses to proactive support. While promising, existing proactive agents either rely exclusively on observations from enclosed environments (e.g., desktop UIs) with direct LLM inference or employ rule-based proactive notifications, leading to suboptimal user intent understanding and limited functionality for proactive service. In this paper, we introduce ContextAgent, the first context-aware proactive agent that incorporates extensive sensory contexts to enhance the proactive capabilities of LLM agents. ContextAgent first extracts multi-dimensional contexts from massive sensory perceptions on wearables (e.g., video and audio) to understand user intentions. ContextAgent then leverages the sensory contexts and the persona contexts from historical data to predict the necessity for proactive services. When proactive assistance is needed, ContextAgent further automatically calls the necessary tools to assist users unobtrusively. To evaluate this new task, we curate ContextAgentBench, the first benchmark for evaluating context-aware proactive LLM agents, covering 1,000 samples across nine daily scenarios and twenty tools. Experiments on ContextAgentBench show that ContextAgent outperforms baselines by achieving up to 8.5% and 6.0% higher accuracy in proactive predictions and tool calling, respectively. We hope our research can inspire the development of more advanced, human-centric, proactive AI assistants.

cs.CY [Back]

[258] AdAEM: An Adaptively and Automated Extensible Measurement of LLMs' Value Difference

Shitong Duan,Xiaoyuan Yi,Peng Zhang,Dongkuan Xu,Jing Yao,Tun Lu,Ning Gu,Xing Xie

Main category: cs.CY

TL;DR: AdAEM是一个自扩展的评估框架，用于揭示大型语言模型（LLMs）的价值倾向，通过动态生成测试问题解决现有数据集的信息不足问题。

Details

Motivation: 现有价值测量数据集因问题过时、污染或泛泛而无法有效捕捉LLMs的价值差异，导致结果饱和且无信息量。 Method: AdAEM通过上下文优化方式自动生成和扩展测试问题，最大化信息理论目标以提取最新或有文化争议的主题。 Result: 生成12,310个基于Schwartz价值理论的问题，分析16个LLMs的价值差异，验证方法的有效性和区分度。 Conclusion: AdAEM能动态跟踪LLMs的价值演变，为价值研究提供更有效的基础。 Abstract: Assessing Large Language Models (LLMs)' underlying value differences enables comprehensive comparison of their misalignment, cultural adaptability, and biases. Nevertheless, current value measurement datasets face the informativeness challenge: with often outdated, contaminated, or generic test questions, they can only capture the shared value orientations among different LLMs, leading to saturated and thus uninformative results. To address this problem, we introduce AdAEM, a novel, self-extensible assessment framework for revealing LLMs' inclinations. Distinct from previous static benchmarks, AdAEM can automatically and adaptively generate and extend its test questions. This is achieved by probing the internal value boundaries of a diverse set of LLMs developed across cultures and time periods in an in-context optimization manner. The optimization process theoretically maximizes an information-theoretic objective to extract the latest or culturally controversial topics, providing more distinguishable and informative insights about models' value differences. In this way, AdAEM is able to co-evolve with the development of LLMs, consistently tracking their value dynamics. Using AdAEM, we generate 12,310 questions grounded in Schwartz Value Theory, conduct an extensive analysis to manifest our method's validity and effectiveness, and benchmark the values of 16 LLMs, laying the groundwork for better value research.

q-bio.GN [Back]

[259] OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models Benchmarking

Heng Yang,Jack Cole,Yuan Li,Renzhi Chen,Geyong Min,Ke Li

Main category: q-bio.GN

TL;DR: OmniGenBench是一个模块化基准测试平台，旨在统一基因组基础模型（GFMs）的数据、模型、基准测试和可解释性层，以解决可重复性挑战。

Details

Motivation: 基因组基础模型（GFMs）在解码基因组方面具有变革性潜力，但需要严格且可重复的评估方法。 Method: OmniGenBench通过标准化一键评估、自动化管道和社区可扩展功能，整合了31个开源模型，覆盖五个基准测试套件。 Result: 该平台解决了数据透明度、模型互操作性、基准碎片化和黑盒可解释性等关键挑战。 Conclusion: OmniGenBench旨在作为可重复基因组AI研究的基础设施，加速基因组规模建模时代的可信发现和协作创新。 Abstract: The code of nature, embedded in DNA and RNA genomes since the origin of life, holds immense potential to impact both humans and ecosystems through genome modeling. Genomic Foundation Models (GFMs) have emerged as a transformative approach to decoding the genome. As GFMs scale up and reshape the landscape of AI-driven genomics, the field faces an urgent need for rigorous and reproducible evaluation. We present OmniGenBench, a modular benchmarking platform designed to unify the data, model, benchmarking, and interpretability layers across GFMs. OmniGenBench enables standardized, one-command evaluation of any GFM across five benchmark suites, with seamless integration of over 31 open-source models. Through automated pipelines and community-extensible features, the platform addresses critical reproducibility challenges, including data transparency, model interoperability, benchmark fragmentation, and black-box interpretability. OmniGenBench aims to serve as foundational infrastructure for reproducible genomic AI research, accelerating trustworthy discovery and collaborative innovation in the era of genome-scale modeling.

stat.ML [Back]

[260] From stability of Langevin diffusion to convergence of proximal MCMC for non-log-concave sampling

Marien Renaud,Valentin De Bortoli,Arthur Leclaire,Nicolas Papadakis

Main category: stat.ML

TL;DR: 论文研究了非凸势能分布采样问题，证明了离散时间ULA在势能强凸假设下的稳定性，并首次为PSGLA在非凸势能下的收敛性提供了证明。

Details

Motivation: 解决非凸和非光滑势能下的采样问题，特别是在成像逆问题中。 Method: 结合前向后向优化算法与ULA步骤的PSGLA方法。 Result: PSGLA在合成数据和成像逆问题中表现出比SGLA更快的收敛速度，同时保持恢复特性。 Conclusion: PSGLA在非凸势能下具有稳定性和收敛性，适用于复杂采样问题。 Abstract: We consider the problem of sampling distributions stemming from non-convex potentials with Unadjusted Langevin Algorithm (ULA). We prove the stability of the discrete-time ULA to drift approximations under the assumption that the potential is strongly convex at infinity. In many context, e.g. imaging inverse problems, potentials are non-convex and non-smooth. Proximal Stochastic Gradient Langevin Algorithm (PSGLA) is a popular algorithm to handle such potentials. It combines the forward-backward optimization algorithm with a ULA step. Our main stability result combined with properties of the Moreau envelope allows us to derive the first proof of convergence of the PSGLA for non-convex potentials. We empirically validate our methodology on synthetic data and in the context of imaging inverse problems. In particular, we observe that PSGLA exhibits faster convergence rates than Stochastic Gradient Langevin Algorithm for posterior sampling while preserving its restoration properties.

cs.SE [Back]

[261] Guided Search Strategies in Non-Serializable Environments with Applications to Software Engineering Agents

Karina Zainullina,Alexander Golubev,Maria Trofimova,Sergei Polezhaev,Ibragim Badertdinov,Daria Litvintseva,Simon Karasik,Filipp Fisin,Sergei Skvortsov,Maksim Nekrashevich,Anton Shevtsov,Boris Yangel

Main category: cs.SE

TL;DR: 论文探讨了在非序列化RL环境中（如Docker容器）如何通过1步前瞻和轨迹选择两种搜索策略提升LLMs的性能，并在SWE-bench Verified基准测试中取得显著效果。

Details

Motivation: LLMs在多步任务中表现优异，但在多次尝试中性能不稳定，尤其是在非序列化环境中，传统搜索方法（如MCTS）难以适用。 Method: 提出两种互补的搜索策略：1步前瞻和轨迹选择，基于学习的动作价值函数估计器指导。 Result: 在SWE-bench Verified基准测试中，这些方法使Qwen-72B模型的平均成功率翻倍，达到40.8%，并在GPT-4o中同样有效。 Conclusion: 这些搜索策略显著提升了LLMs在非序列化环境中的性能，且可迁移至更先进的模型。 Abstract: Large language models (LLMs) have recently achieved remarkable results in complex multi-step tasks, such as mathematical reasoning and agentic software engineering. However, they often struggle to maintain consistent performance across multiple solution attempts. One effective approach to narrow the gap between average-case and best-case performance is guided test-time search, which explores multiple solution paths to identify the most promising one. Unfortunately, effective search techniques (e.g. MCTS) are often unsuitable for non-serializable RL environments, such as Docker containers, where intermediate environment states cannot be easily saved and restored. We investigate two complementary search strategies applicable to such environments: 1-step lookahead and trajectory selection, both guided by a learned action-value function estimator. On the SWE-bench Verified benchmark, a key testbed for agentic software engineering, we find these methods to double the average success rate of a fine-tuned Qwen-72B model, achieving 40.8%, the new state-of-the-art for open-weights models. Additionally, we show that these techniques are transferable to more advanced closed models, yielding similar improvements with GPT-4o.

[262] Advancing Software Quality: A Standards-Focused Review of LLM-Based Assurance Techniques

Avinash Patil

Main category: cs.SE

TL;DR: 本文探讨了如何利用大语言模型（LLMs）增强软件质量保证（SQA）过程，同时确保符合国际标准。

Details

Motivation: 软件质量保证对可靠、安全和高效的软件产品至关重要，而LLMs为自动化SQA任务提供了新机会。 Method: 综述了LLM在SQA中的应用（如需求验证、缺陷检测等），并将其映射到ISO/IEC等标准框架中。 Result: 实证案例和开源项目验证了LLM方法的可行性，但也指出了数据隐私和模型偏见等挑战。 Conclusion: 未来方向包括自适应学习、隐私保护部署和多模态分析，以推动AI驱动的SQA发展。 Abstract: Software Quality Assurance (SQA) is critical for delivering reliable, secure, and efficient software products. The Software Quality Assurance Process aims to provide assurance that work products and processes comply with predefined provisions and plans. Recent advancements in Large Language Models (LLMs) present new opportunities to enhance existing SQA processes by automating tasks like requirement analysis, code review, test generation, and compliance checks. Simultaneously, established standards such as ISO/IEC 12207, ISO/IEC 25010, ISO/IEC 5055, ISO 9001/ISO/IEC 90003, CMMI, and TMM provide structured frameworks for ensuring robust quality practices. This paper surveys the intersection of LLM-based SQA methods and these recognized standards, highlighting how AI-driven solutions can augment traditional approaches while maintaining compliance and process maturity. We first review the foundational software quality standards and the technical fundamentals of LLMs in software engineering. Next, we explore various LLM-based SQA applications, including requirement validation, defect detection, test generation, and documentation maintenance. We then map these applications to key software quality frameworks, illustrating how LLMs can address specific requirements and metrics within each standard. Empirical case studies and open-source initiatives demonstrate the practical viability of these methods. At the same time, discussions on challenges (e.g., data privacy, model bias, explainability) underscore the need for deliberate governance and auditing. Finally, we propose future directions encompassing adaptive learning, privacy-focused deployments, multimodal analysis, and evolving standards for AI-driven software quality.

eess.AS [Back]

[263] Direction-Aware Neural Acoustic Fields for Few-Shot Interpolation of Ambisonic Impulse Responses

Christopher Ick,Gordon Wichern,Yoshiki Masuyama,François Germain,Jonathan Le Roux

Main category: eess.AS

TL;DR: 本文提出了一种方向感知神经场（DANF），通过Ambisonic格式的RIR更明确地捕捉方向信息，并进一步提出了方向感知损失函数。

Details

Motivation: 现有基于神经场的方法仅关注单声道全向或双耳听众，未能精确捕捉真实声场的定向特性。 Method: 提出DANF模型，结合Ambisonic格式RIR，引入方向感知损失函数，并探索低秩适应等新房间适应能力。 Result: DANF能够更精确地捕捉声场的定向特性，并具备适应新房间的能力。 Conclusion: DANF为声场建模提供了更精确的方向感知能力，并展示了在新环境中的适应性。 Abstract: The characteristics of a sound field are intrinsically linked to the geometric and spatial properties of the environment surrounding a sound source and a listener. The physics of sound propagation is captured in a time-domain signal known as a room impulse response (RIR). Prior work using neural fields (NFs) has allowed learning spatially-continuous representations of RIRs from finite RIR measurements. However, previous NF-based methods have focused on monaural omnidirectional or at most binaural listeners, which does not precisely capture the directional characteristics of a real sound field at a single point. We propose a direction-aware neural field (DANF) that more explicitly incorporates the directional information by Ambisonic-format RIRs. While DANF inherently captures spatial relations between sources and listeners, we further propose a direction-aware loss. In addition, we investigate the ability of DANF to adapt to new rooms in various ways including low-rank adaptation.

[264] Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach

Umberto Cappellazzo,Minsu Kim,Stavros Petridis,Daniele Falavigna,Alessio Brutti

Main category: eess.AS

TL;DR: Llama-SMoP是一种高效的多模态大语言模型，通过稀疏混合投影器（SMoP）模块在不增加推理成本的情况下提升模型能力，适用于资源受限环境。

Details

Motivation: 解决现有音频-视觉语音识别（AVSR）中大型语言模型（LLM）计算成本高的问题，提升在资源受限环境中的部署能力。 Method: 提出Llama-SMoP，采用稀疏门控的专家混合（MoE）投影器，探索三种SMoP配置，其中DEDR（分离专家和路由器）表现最佳。 Result: Llama-SMoP DEDR在ASR、VSR和AVSR任务中表现优异，验证了其在专家激活、可扩展性和噪声鲁棒性方面的有效性。 Conclusion: Llama-SMoP通过SMoP模块实现了高效的多模态LLM，为资源受限环境提供了可行的解决方案。 Abstract: Audio-Visual Speech Recognition (AVSR) enhances robustness in noisy environments by integrating visual cues. While recent advances integrate Large Language Models (LLMs) into AVSR, their high computational cost hinders deployment in resource-constrained settings. To address this, we propose Llama-SMoP, an efficient Multimodal LLM that employs a Sparse Mixture of Projectors (SMoP) module to scale model capacity without increasing inference costs. By incorporating sparsely-gated mixture-of-experts (MoE) projectors, Llama-SMoP enables the use of smaller LLMs while maintaining strong performance. We explore three SMoP configurations and show that Llama-SMoP DEDR (Disjoint-Experts, Disjoint-Routers), which uses modality-specific routers and experts, achieves superior performance on ASR, VSR, and AVSR tasks. Ablation studies confirm its effectiveness in expert activation, scalability, and noise robustness.

[265] Pairwise Evaluation of Accent Similarity in Speech Synthesis

Jinzuomu Zhong,Suyuan Liu,Dan Wells,Korin Richmond

Main category: eess.AS

TL;DR: 论文提出改进主观和客观方法来评估语音合成中的口音相似性，包括优化XAB听力测试和引入发音相关指标。

Details

Motivation: 尽管对生成高保真口音的兴趣增加，但评估语音合成中的口音相似性尚未充分探索。 Method: 主观上优化XAB听力测试，增加转录和差异标注；客观上使用发音相关指标（如元音共振峰距离和语音后验图）。 Result: 实验表明这些方法有效，同时揭示了常见指标（如词错误率）在评估少数口音时的局限性。 Conclusion: 改进的评估方法能更准确地衡量口音相似性，为语音合成研究提供新方向。 Abstract: Despite growing interest in generating high-fidelity accents, evaluating accent similarity in speech synthesis has been underexplored. We aim to enhance both subjective and objective evaluation methods for accent similarity. Subjectively, we refine the XAB listening test by adding components that achieve higher statistical significance with fewer listeners and lower costs. Our method involves providing listeners with transcriptions, having them highlight perceived accent differences, and implementing meticulous screening for reliability. Objectively, we utilise pronunciation-related metrics, based on distances between vowel formants and phonetic posteriorgrams, to evaluate accent generation. Comparative experiments reveal that these metrics, alongside accent similarity, speaker similarity, and Mel Cepstral Distortion, can be used. Moreover, our findings underscore significant limitations of common metrics like Word Error Rate in assessing underrepresented accents.

[266] Mitigating Subgroup Disparities in Multi-Label Speech Emotion Recognition: A Pseudo-Labeling and Unsupervised Learning Approach

Yi-Cheng Lin,Huang-Cheng Chou,Hung-yi Lee

Main category: eess.AS

TL;DR: 论文提出了一种隐式人口统计推断（IDI）模块，通过伪标签和无监督学习减少语音情感识别（SER）中的偏见，显著提升了公平性指标。

Details

Motivation: 现有方法依赖显式人口统计标签，但隐私问题使其难以获取，因此需要一种不依赖显式标签的公平性改进方法。 Method: 结合预训练模型的伪标签和无监督k-means聚类，设计了IDI模块以减少SER中的偏见。 Result: 伪标签IDI将公平性指标提升33%以上，SER准确率下降不到3%；无监督IDI提升公平性指标26%以上，SER性能下降不到4%。 Conclusion: IDI模块在缺乏显式人口统计信息时仍能有效减少种族和年龄偏见，具有广泛应用潜力。 Abstract: While subgroup disparities and performance bias are increasingly studied in computational research, fairness in categorical Speech Emotion Recognition (SER) remains underexplored. Existing methods often rely on explicit demographic labels, which are difficult to obtain due to privacy concerns. To address this limitation, we introduce an Implicit Demography Inference (IDI) module that leverages pseudo-labeling from a pre-trained model and unsupervised learning using k-means clustering to mitigate bias in SER. Our experiments show that pseudo-labeling IDI reduces subgroup disparities, improving fairness metrics by over 33% with less than a 3% decrease in SER accuracy. Also, the unsupervised IDI yields more than a 26% improvement in fairness metrics with a drop of less than 4% in SER performance. Further analyses reveal that the unsupervised IDI consistently mitigates race and age disparities, demonstrating its potential in scenarios where explicit demographic information is unavailable.

[267] Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples

Chun-Yi Kuan,Hung-yi Lee

Main category: eess.AS

TL;DR: 论文提出LISTEN方法，通过对比训练增强音频感知大语言模型（ALLMs）区分真实与虚构声音的能力，无需修改模型参数，高效且性能优越。

Details

Motivation: 现有音频感知大语言模型（ALLMs）在处理音频输入时容易产生虚构声音事件，影响可靠性。 Method: 提出LISTEN方法，利用合成数据和轻量适配器进行对比训练，无需修改LLM参数。 Result: 实验表明LISTEN有效减少虚构声音，同时保持音频问答和推理任务的性能，且数据与计算效率更高。 Conclusion: LISTEN是一种高效且无需修改模型参数的方法，显著提升了ALLMs的可靠性。 Abstract: Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs. However, these models often hallucinate non-existent sound events, reducing their reliability in real-world applications. To address this, we propose LISTEN (Learning to Identify Sounds Through Extended Negative Samples), a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds using synthesized data from the backbone LLM. Unlike prior approaches, our method requires no modification to LLM parameters and efficiently integrates audio representations via a lightweight adapter. Experiments show that LISTEN effectively mitigates hallucinations while maintaining impressive performance on existing audio question and reasoning benchmarks. At the same time, it is more efficient in both data and computation.

eess.IV [Back]

[268] GANCompress: GAN-Enhanced Neural Image Compression with Binary Spherical Quantization

Karthik Sivakoti

Main category: eess.IV

TL;DR: GANCompress结合二进制球形量化（BSQ）与生成对抗网络（GANs），提出了一种高效神经压缩框架，显著提升压缩效率和视觉质量。

Details

Motivation: 视觉数据爆炸式增长需要高效压缩技术，现有神经压缩方法在感知质量、计算效率和内容适应性方面存在挑战。 Method: 采用基于变压器的自动编码器和增强的BSQ瓶颈，结合频率域注意力和颜色一致性优化的GAN架构。 Result: 实验显示GANCompress压缩效率显著提升，文件大小减少100倍，视觉失真极小，性能优于传统编解码器。 Conclusion: GANCompress在神经压缩技术上取得重要进展，适用于实时视觉通信系统。 Abstract: The exponential growth of visual data in digital communications has intensified the need for efficient compression techniques that balance rate-distortion performance with computational feasibility. While recent neural compression approaches have shown promise, they still struggle with fundamental challenges: preserving perceptual quality at high compression ratios, computational efficiency, and adaptability to diverse visual content. This paper introduces GANCompress, a novel neural compression framework that synergistically combines Binary Spherical Quantization (BSQ) with Generative Adversarial Networks (GANs) to address these challenges. Our approach employs a transformer-based autoencoder with an enhanced BSQ bottleneck that projects latent representations onto a hypersphere, enabling efficient discretization with bounded quantization error. This is followed by a specialized GAN architecture incorporating frequency-domain attention and color consistency optimization. Experimental results demonstrate that GANCompress achieves substantial improvement in compression efficiency -- reducing file sizes by up to 100x with minimal visual distortion. Our method outperforms traditional codecs like H.264 by 12-15% in perceptual metrics while maintaining comparable PSNR/SSIM values, with 2.4x faster encoding and decoding speeds. On standard benchmarks including ImageNet-1k and COCO2017, GANCompress sets a new state-of-the-art, reducing FID from 0.72 to 0.41 (43% improvement) compared to previous methods while maintaining higher throughput. This work presents a significant advancement in neural compression technology with promising applications for real-time visual communication systems.

[269] Learning Wavelet-Sparse FDK for 3D Cone-Beam CT Reconstruction

Yipeng Sun,Linda-Sophie Schneider,Chengze Ye,Mingxuan Gu,Siyuan Mei,Siming Bayer,Andreas Maier

Main category: eess.IV

TL;DR: 提出一种改进的FDK算法，结合神经网络和稀疏表示，降低参数数量并保持计算效率，提升CBCT重建质量。

Details

Motivation: FDK算法在CBCT重建中效率高但易受噪声和伪影影响，深度学习方法虽提升质量但计算复杂且缺乏可解释性。 Method: 在FDK的余弦加权和滤波阶段选择性引入可训练元素，利用小波变换生成稀疏表示，减少参数数量。 Result: 参数减少93.75%，计算成本与FDK相当，收敛更快，噪声鲁棒性增强。 Conclusion: 该方法在保持FDK优势的同时提升性能，适合临床计算受限环境。 Abstract: Cone-Beam Computed Tomography (CBCT) is essential in medical imaging, and the Feldkamp-Davis-Kress (FDK) algorithm is a popular choice for reconstruction due to its efficiency. However, FDK is susceptible to noise and artifacts. While recent deep learning methods offer improved image quality, they often increase computational complexity and lack the interpretability of traditional methods. In this paper, we introduce an enhanced FDK-based neural network that maintains the classical algorithm's interpretability by selectively integrating trainable elements into the cosine weighting and filtering stages. Recognizing the challenge of a large parameter space inherent in 3D CBCT data, we leverage wavelet transformations to create sparse representations of the cosine weights and filters. This strategic sparsification reduces the parameter count by $93.75\%$ without compromising performance, accelerates convergence, and importantly, maintains the inference computational cost equivalent to the classical FDK algorithm. Our method not only ensures volumetric consistency and boosts robustness to noise, but is also designed for straightforward integration into existing CT reconstruction pipelines. This presents a pragmatic enhancement that can benefit clinical applications, particularly in environments with computational limitations.

[270] Exploring Image Quality Assessment from a New Perspective: Pupil Size

Yixuan Gao,Xiongkuo Min,Guangtao Zhai

Main category: eess.IV

TL;DR: 研究探讨了图像质量评估（IQA）任务如何通过瞳孔大小影响人的认知过程，并分析了瞳孔大小与图像质量的关系。

Details

Motivation: 探索IQA任务对认知过程的影响，为客观IQA方法提供理论基础，并开发新的主观IQA方法。 Method: 通过自由观察和IQA任务的主观实验，分析瞳孔大小差异。 Result: 发现IQA任务激活了视觉注意力机制，且瞳孔变化与图像质量密切相关。 Conclusion: 研究为客观IQA方法提供了理论支持，并提出了一种新的主观IQA方法。 Abstract: This paper explores how the image quality assessment (IQA) task affects the cognitive processes of people from the perspective of pupil size and studies the relationship between pupil size and image quality. Specifically, we first invited subjects to participate in a subjective experiment, which includes two tasks: free observation and IQA. In the free observation task, subjects did not need to perform any action, and they only needed to observe images as they usually do with an album. In the IQA task, subjects were required to score images according to their overall impression of image quality. Then, by analyzing the difference in pupil size between the two tasks, we find that people may activate the visual attention mechanism when evaluating image quality. Meanwhile, we also find that the change in pupil size is closely related to image quality in the IQA task. For future research on IQA, this research can not only provide a theoretical basis for the objective IQA method and promote the development of more effective objective IQA methods, but also provide a new subjective IQA method for collecting the authentic subjective impression of image quality.

[271] Automated Quality Evaluation of Cervical Cytopathology Whole Slide Images Based on Content Analysis

Lanlan Kang,Jian Wang,Jian QIn,Yiqin Liang,Yongjun He

Main category: eess.IV

TL;DR: 提出了一种基于人工智能的自动化宫颈细胞病理学全切片图像质量评估方法，显著提升了评估速度和一致性。

Details

Motivation: 传统手动评估方法主观性强、成本高、耗时长且可靠性低，亟需自动化质量评估系统。 Method: 结合TBS标准、AI算法和临床数据特征，通过目标检测、分类和分割模型量化质量指标，利用XGBoost模型综合评分。 Result: 在100张WSI上的实验表明，该方法在速度和一致性上具有显著优势。 Conclusion: 该方法为宫颈癌筛查提供了一种高效、可靠的质量评估工具。 Abstract: The ThinPrep Cytologic Test (TCT) is the most widely used method for cervical cancer screening, and the sample quality directly impacts the accuracy of the diagnosis. Traditional manual evaluation methods rely on the observation of pathologist under microscopes. These methods exhibit high subjectivity, high cost, long duration, and low reliability. With the development of computer-aided diagnosis (CAD), an automated quality assessment system that performs at the level of a professional pathologist is necessary. To address this need, we propose a fully automated quality assessment method for Cervical Cytopathology Whole Slide Images (WSIs) based on The Bethesda System (TBS) diagnostic standards, artificial intelligence algorithms, and the characteristics of clinical data. The method analysis the context of WSIs to quantify quality evaluation metrics which are focused by TBS such as staining quality, cell counts and cell mass proportion through multiple models including object detection, classification and segmentation. Subsequently, the XGBoost model is used to mine the attention paid by pathologists to different quality evaluation metrics when evaluating samples, thereby obtaining a comprehensive WSI sample score calculation model. Experimental results on 100 WSIs demonstrate that the proposed evaluation method has significant advantages in terms of speed and consistency.

[272] XDementNET: An Explainable Attention Based Deep Convolutional Network to Detect Alzheimer Progression from MRI data

Soyabul Islam Lincoln,Mirza Mohd Shahriar Maswood

Main category: eess.IV

TL;DR: 该论文提出了一种结合多残差块、空间注意力机制和多种注意力机制的深度学习架构，用于阿尔茨海默病的精确诊断，并在多个公开数据集上取得了极高的分类准确率。

Details

Motivation: 阿尔茨海默病的精确诊断需求日益增长，结合人工智能技术可以提升诊断效率并降低医疗成本。 Method: 采用深度卷积神经网络，结合多残差块、空间注意力块、分组查询注意力和多头注意力机制，对MRI图像进行分类。 Result: 在多个数据集上取得了极高的分类准确率，最高达到100%（二分类）和99.66%（四分类）。 Conclusion: 提出的方法在阿尔茨海默病诊断中表现优异，且具有较好的可解释性，优于现有技术。 Abstract: A common neurodegenerative disease, Alzheimer's disease requires a precise diagnosis and efficient treatment, particularly in light of escalating healthcare expenses and the expanding use of artificial intelligence in medical diagnostics. Many recent studies shows that the combination of brain Magnetic Resonance Imaging (MRI) and deep neural networks have achieved promising results for diagnosing AD. Using deep convolutional neural networks, this paper introduces a novel deep learning architecture that incorporates multiresidual blocks, specialized spatial attention blocks, grouped query attention, and multi-head attention. The study assessed the model's performance on four publicly accessible datasets and concentrated on identifying binary and multiclass issues across various categories. This paper also takes into account of the explainability of AD's progression and compared with state-of-the-art methods namely Gradient Class Activation Mapping (GradCAM), Score-CAM, Faster Score-CAM, and XGRADCAM. Our methodology consistently outperforms current approaches, achieving 99.66\% accuracy in 4-class classification, 99.63\% in 3-class classification, and 100\% in binary classification using Kaggle datasets. For Open Access Series of Imaging Studies (OASIS) datasets the accuracies are 99.92\%, 99.90\%, and 99.95\% respectively. The Alzheimer's Disease Neuroimaging Initiative-1 (ADNI-1) dataset was used for experiments in three planes (axial, sagittal, and coronal) and a combination of all planes. The study achieved accuracies of 99.08\% for axis, 99.85\% for sagittal, 99.5\% for coronal, and 99.17\% for all axis, and 97.79\% and 8.60\% respectively for ADNI-2. The network's ability to retrieve important information from MRI images is demonstrated by its excellent accuracy in categorizing AD stages.

[273] Bronchovascular Tree-Guided Weakly Supervised Learning Method for Pulmonary Segment Segmentation

Ruijie Zhao,Zuopeng Tan,Xiao Xue,Longfei Zhao,Bing Li,Zicheng Liao,Ying Ming,Jiaru Wang,Ran Xiao,Sirong Piao,Rui Zhao,Qiqi Xu,Wei Song

Main category: eess.IV

TL;DR: 提出了一种基于解剖层次监督学习（AHSL）的弱监督学习方法，用于肺部段分割，结合临床解剖定义和支气管血管树信息，通过两阶段分割策略和一致性损失提升边界平滑度。

Details

Motivation: 肺部段分割对癌症定位和手术规划至关重要，但像素级标注耗时且边界难以区分，因此需要一种弱监督方法。 Method: 利用段级和叶级标签监督，结合支气管血管先验信息的两阶段分割策略，并提出一致性损失和边界平滑度评估指标。 Result: 在私有数据集上的实验表明，该方法在视觉检查和评估指标上均表现有效。 Conclusion: AHSL方法通过解剖层次监督和两阶段策略，成功实现了高质量的肺部段分割。 Abstract: Pulmonary segment segmentation is crucial for cancer localization and surgical planning. However, the pixel-wise annotation of pulmonary segments is laborious, as the boundaries between segments are indistinguishable in medical images. To this end, we propose a weakly supervised learning (WSL) method, termed Anatomy-Hierarchy Supervised Learning (AHSL), which consults the precise clinical anatomical definition of pulmonary segments to perform pulmonary segment segmentation. Since pulmonary segments reside within the lobes and are determined by the bronchovascular tree, i.e., artery, airway and vein, the design of the loss function is founded on two principles. First, segment-level labels are utilized to directly supervise the output of the pulmonary segments, ensuring that they accurately encompass the appropriate bronchovascular tree. Second, lobe-level supervision indirectly oversees the pulmonary segment, ensuring their inclusion within the corresponding lobe. Besides, we introduce a two-stage segmentation strategy that incorporates bronchovascular priori information. Furthermore, a consistency loss is proposed to enhance the smoothness of segment boundaries, along with an evaluation metric designed to measure the smoothness of pulmonary segment boundaries. Visual inspection and evaluation metrics from experiments conducted on a private dataset demonstrate the effectiveness of our method.

[274] End-to-end Cortical Surface Reconstruction from Clinical Magnetic Resonance Images

Jesper Duemose Nielsen,Karthik Gopinath,Andrew Hoopes,Adrian Dalca,Colin Magdamo,Steven Arnold,Sudeshna Das,Axel Thielscher,Juan Eugenio Iglesias,Oula Puonti

Main category: eess.IV

TL;DR: 提出了一种基于神经网络的通用方法，用于从任意对比度和分辨率的MRI扫描中估计皮质表面，显著提高了临床扫描的准确性和效率。

Details

Motivation: 现有工具仅适用于特定分辨率和对比度的MRI扫描，限制了在临床数据中的应用。 Method: 使用合成域随机化数据训练神经网络，通过模板网格变形估计白质和灰质表面。 Result: 与现有方法相比，皮质厚度误差减少约50%，并更好地恢复了与衰老相关的皮质变薄模式。 Conclusion: 该方法为临床扫描提供了快速准确的表面重建，扩展了研究样本范围和临床人群的应用。 Abstract: Surface-based cortical analysis is valuable for a variety of neuroimaging tasks, such as spatial normalization, parcellation, and gray matter (GM) thickness estimation. However, most tools for estimating cortical surfaces work exclusively on scans with at least 1 mm isotropic resolution and are tuned to a specific magnetic resonance (MR) contrast, often T1-weighted (T1w). This precludes application using most clinical MR scans, which are very heterogeneous in terms of contrast and resolution. Here, we use synthetic domain-randomized data to train the first neural network for explicit estimation of cortical surfaces from scans of any contrast and resolution, without retraining. Our method deforms a template mesh to the white matter (WM) surface, which guarantees topological correctness. This mesh is further deformed to estimate the GM surface. We compare our method to recon-all-clinical (RAC), an implicit surface reconstruction method which is currently the only other tool capable of processing heterogeneous clinical MR scans, on ADNI and a large clinical dataset (n=1,332). We show a approximately 50 % reduction in cortical thickness error (from 0.50 to 0.24 mm) with respect to RAC and better recovery of the aging-related cortical thinning patterns detected by FreeSurfer on high-resolution T1w scans. Our method enables fast and accurate surface reconstruction of clinical scans, allowing studies (1) with sample sizes far beyond what is feasible in a research setting, and (2) of clinical populations that are difficult to enroll in research studies. The code is publicly available at https://github.com/simnibs/brainnet.

[275] NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI

Cosmin I. Bercea,Jun Li,Philipp Raffler,Evamaria O. Riedel,Lena Schmitzer,Angela Kurz,Felix Bitzer,Paula Roßmüller,Julian Canisius,Mirjam L. Beyrle,Che Liu,Wenjia Bai,Bernhard Kainz,Julia A. Schnabel,Benedikt Wiestler

Main category: eess.IV

TL;DR: NOVA是一个用于评估模型在真实世界中处理罕见病理和异构数据的极端压力测试基准，包含900个脑MRI扫描和281种罕见病理。

Details

Motivation: 现有基准测试在评估模型时往往忽略罕见或新颖的异常情况，导致模型在临床应用中表现不佳。 Method: NOVA提供900个脑MRI扫描，涵盖281种罕见病理和异构采集协议，并包含临床叙述和专家标注，用于评估异常定位、视觉描述和诊断推理。 Result: 领先的视觉语言模型在NOVA基准上表现显著下降，表明其在处理未知异常时的局限性。 Conclusion: NOVA为模型在真实世界中的泛化能力提供了严格的测试平台，推动了模型在检测、定位和推理未知异常方面的进步。 Abstract: In many real-world applications, deployed models encounter inputs that differ from the data seen during training. Out-of-distribution detection identifies whether an input stems from an unseen distribution, while open-world recognition flags such inputs to ensure the system remains robust as ever-emerging, previously $unknown$ categories appear and must be addressed without retraining. Foundation and vision-language models are pre-trained on large and diverse datasets with the expectation of broad generalization across domains, including medical imaging. However, benchmarking these models on test sets with only a few common outlier types silently collapses the evaluation back to a closed-set problem, masking failures on rare or truly novel conditions encountered in clinical use. We therefore present $NOVA$, a challenging, real-life $evaluation-only$ benchmark of $\sim$900 brain MRI scans that span 281 rare pathologies and heterogeneous acquisition protocols. Each case includes rich clinical narratives and double-blinded expert bounding-box annotations. Together, these enable joint assessment of anomaly localisation, visual captioning, and diagnostic reasoning. Because NOVA is never used for training, it serves as an $extreme$ stress-test of out-of-distribution generalisation: models must bridge a distribution gap both in sample appearance and in semantic space. Baseline results with leading vision-language models (GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B) reveal substantial performance drops across all tasks, establishing NOVA as a rigorous testbed for advancing models that can detect, localize, and reason about truly unknown anomalies.

[276] Neural Video Compression with Context Modulation

Chuanbo Tang,Zhuoyuan Li,Yifan Bian,Li Li,Dong Liu

Main category: eess.IV

TL;DR: 论文提出了一种通过调制时间上下文的两步法来提升神经视频编码器（NVC）的性能，显著降低了比特率。

Details

Motivation: 现有NVC的时间上下文传播机制未能充分利用参考信息，限制了压缩性能的进一步提升。 Method: 通过流导向挖掘参考帧与预测帧的互相关性生成额外时间上下文，并引入上下文补偿机制调制传播的时间上下文。 Result: 实验显示，该编码器平均比特率比H.266/VVC降低22.7%，比DCVC-FM降低10.1%。 Conclusion: 提出的方法有效提升了NVC的性能，为视频压缩提供了更高效的解决方案。 Abstract: Efficient video coding is highly dependent on exploiting the temporal redundancy, which is usually achieved by extracting and leveraging the temporal context in the emerging conditional coding-based neural video codec (NVC). Although the latest NVC has achieved remarkable progress in improving the compression performance, the inherent temporal context propagation mechanism lacks the ability to sufficiently leverage the reference information, limiting further improvement. In this paper, we address the limitation by modulating the temporal context with the reference frame in two steps. Specifically, we first propose the flow orientation to mine the inter-correlation between the reference frame and prediction frame for generating the additional oriented temporal context. Moreover, we introduce the context compensation to leverage the oriented context to modulate the propagated temporal context generated from the propagated reference feature. Through the synergy mechanism and decoupling loss supervision, the irrelevant propagated information can be effectively eliminated to ensure better context modeling. Experimental results demonstrate that our codec achieves on average 22.7% bitrate reduction over the advanced traditional video codec H.266/VVC, and offers an average 10.1% bitrate saving over the previous state-of-the-art NVC DCVC-FM. The code is available at https://github.com/Austin4USTC/DCMVC.

[277] Neural Inverse Scattering with Score-based Regularization

Yuan Gao,Wenhan Guo,Yu Sun

Main category: eess.IV

TL;DR: 本文提出了一种结合去噪评分函数的正则化神经场方法，用于解决逆散射问题，提高了成像质量。

Details

Motivation: 逆散射问题是成像应用中的基础挑战，需要同时估计图像和散射场，因此需要有效的图像先验来规范推断。 Method: 采用正则化神经场（NF）方法，并整合了基于去噪评分函数的图像结构先验。 Result: 在三个高对比度模拟对象上的实验表明，该方法比基于总变分的现有NF方法成像质量更好。 Conclusion: 提出的方法通过结合神经场和去噪评分函数，显著提升了逆散射问题的成像效果。 Abstract: Inverse scattering is a fundamental challenge in many imaging applications, ranging from microscopy to remote sensing. Solving this problem often requires jointly estimating two unknowns -- the image and the scattering field inside the object -- necessitating effective image prior to regularize the inference. In this paper, we propose a regularized neural field (NF) approach which integrates the denoising score function used in score-based generative models. The neural field formulation offers convenient flexibility to performing joint estimation, while the denoising score function imposes the rich structural prior of images. Our results on three high-contrast simulated objects show that the proposed approach yields a better imaging quality compared to the state-of-the-art NF approach, where regularization is based on total variation.

[278] Automated Fetal Biometry Assessment with Deep Ensembles using Sparse-Sampling of 2D Intrapartum Ultrasound Images

Jayroop Ramesh,Valentin Bacher,Mark C. Eid,Hoda Kalabizadeh,Christian Rupprecht,Ana IL Namburete,Pak-Hei Yeung,Madeleine K. Wyburd,Nicola K. Dinsdale

Main category: eess.IV

TL;DR: 论文提出了一种自动化胎儿生物测量流程，用于减少超声测量中的观察者差异并提高可靠性，通过分类、分割和计算关键参数实现了高精度结果。

Details

Motivation: 减少产程超声监测中的观察者差异，提高测量可靠性，以更好地理解产程停滞原因并指导临床风险分层。 Method: 提出三阶段流程：标准平面分类、胎儿头部和耻骨联合分割、计算角度和距离参数，采用稀疏采样和集成学习方法提高鲁棒性。 Result: 在未见数据集上表现优异，各项指标（如ACC、F1、AUC等）均显示高精度，测量误差较小。 Conclusion: 自动化流程可提升产程监测的可靠性，为临床风险分层工具的开发提供支持。 Abstract: The International Society of Ultrasound advocates Intrapartum Ultrasound (US) Imaging in Obstetrics and Gynecology (ISUOG) to monitor labour progression through changes in fetal head position. Two reliable ultrasound-derived parameters that are used to predict outcomes of instrumental vaginal delivery are the angle of progression (AoP) and head-symphysis distance (HSD). In this work, as part of the Intrapartum Ultrasounds Grand Challenge (IUGC) 2024, we propose an automated fetal biometry measurement pipeline to reduce intra- and inter-observer variability and improve measurement reliability. Our pipeline consists of three key tasks: (i) classification of standard planes (SP) from US videos, (ii) segmentation of fetal head and pubic symphysis from the detected SPs, and (iii) computation of the AoP and HSD from the segmented regions. We perform sparse sampling to mitigate class imbalances and reduce spurious correlations in task (i), and utilize ensemble-based deep learning methods for task (i) and (ii) to enhance generalizability under different US acquisition settings. Finally, to promote robustness in task iii) with respect to the structural fidelity of measurements, we retain the largest connected components and apply ellipse fitting to the segmentations. Our solution achieved ACC: 0.9452, F1: 0.9225, AUC: 0.983, MCC: 0.8361, DSC: 0.918, HD: 19.73, ASD: 5.71, $\Delta_{AoP}$: 8.90 and $\Delta_{HSD}$: 14.35 across an unseen hold-out set of 4 patients and 224 US frames. The results from the proposed automated pipeline can improve the understanding of labour arrest causes and guide the development of clinical risk stratification tools for efficient and effective prenatal care.

cs.SD [Back]

[279] MatchDance: Collaborative Mamba-Transformer Architecture Matching for High-Quality 3D Dance Synthesis

Kaixing Yang,Xulong Tang,Yuxuan Hu,Jiahao Yang,Hongyan Liu,Qinnan Zhang,Jun He,Zhaoxin Fan

Main category: cs.SD

TL;DR: MatchDance是一个新颖的音乐到舞蹈生成框架，通过两阶段设计提升舞蹈一致性，并在实验中表现优异。

Details

Motivation: 现有方法在舞蹈一致性上存在局限，MatchDance旨在解决这一问题。 Method: 采用两阶段设计：1）基于运动学-动力学的量化阶段（KDQS），2）混合音乐到舞蹈生成阶段（HMDGS）。 Result: 在FineDance数据集上实现了最先进的性能。 Conclusion: MatchDance通过创新的框架和评估方法，显著提升了音乐到舞蹈生成的一致性。 Abstract: Music-to-dance generation represents a challenging yet pivotal task at the intersection of choreography, virtual reality, and creative content generation. Despite its significance, existing methods face substantial limitation in achieving choreographic consistency. To address the challenge, we propose MatchDance, a novel framework for music-to-dance generation that constructs a latent representation to enhance choreographic consistency. MatchDance employs a two-stage design: (1) a Kinematic-Dynamic-based Quantization Stage (KDQS), which encodes dance motions into a latent representation by Finite Scalar Quantization (FSQ) with kinematic-dynamic constraints and reconstructs them with high fidelity, and (2) a Hybrid Music-to-Dance Generation Stage(HMDGS), which uses a Mamba-Transformer hybrid architecture to map music into the latent representation, followed by the KDQS decoder to generate 3D dance motions. Additionally, a music-dance retrieval framework and comprehensive metrics are introduced for evaluation. Extensive experiments on the FineDance dataset demonstrate state-of-the-art performance. Code will be released upon acceptance.

[280] Forensic deepfake audio detection using segmental speech features

Tianle Yang,Chengzhe Sun,Siwei Lyu,Phil Rose

Main category: cs.SD

TL;DR: 利用语音分段声学特征检测深度伪造音频，效果显著且可解释性强。

Details

Motivation: 深度伪造音频检测需要更可靠且可解释的特征，分段特征因其与人类发音过程的紧密关联而成为理想选择。 Method: 研究通过分析分段语音声学特征，评估其在深度伪造音频检测中的有效性。 Result: 某些常用于司法语音比对的局部特征对检测深度伪造音频有效，而全局特征效果较差。 Conclusion: 研究为音频深度伪造检测提供了新思路，强调分段特征的重要性，并指出需针对不同应用场景调整检测方法。 Abstract: This study explores the potential of using acoustic features of segmental speech sounds to detect deepfake audio. These features are highly interpretable because of their close relationship with human articulatory processes and are expected to be more difficult for deepfake models to replicate. The results demonstrate that certain segmental features commonly used in forensic voice comparison are effective in identifying deep-fakes, whereas some global features provide little value. These findings underscore the need to approach audio deepfake detection differently for forensic voice comparison and offer a new perspective on leveraging segmental features for this purpose.

[281] FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation

Yutong Liu,Ziyue Zhang,Ban Ma-bao,Yuqing Cai,Yongbin Yu,Renzeng Duojie,Xiangxiang Wang,Fan Gao,Cheng Huang,Nyima Tashi

Main category: cs.SD

TL;DR: FMSD-TTS是一个针对藏语多方言的少样本、多说话人、多方言文本转语音框架，通过新颖的模块和网络设计显著提升了方言表达和说话人相似性。

Details

Motivation: 藏语作为低资源语言，缺乏跨方言的平行语音语料库，限制了语音建模的进展。 Method: 采用说话人-方言融合模块和方言专用动态路由网络（DSDR-Net），捕捉方言间的声学和语言细微差异，同时保留说话人身份。 Result: 在方言表达和说话人相似性上显著优于基线，并通过语音到语音方言转换任务验证了合成语音的质量和实用性。 Conclusion: FMSD-TTS为藏语多方言语音合成提供了有效解决方案，并公开了合成语料库和评估工具包。 Abstract: Tibetan is a low-resource language with minimal parallel speech corpora spanning its three major dialects-\"U-Tsang, Amdo, and Kham-limiting progress in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot, multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels. Our method features a novel speaker-dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects while preserving speaker identity. Extensive objective and subjective evaluations demonstrate that FMSD-TTS significantly outperforms baselines in both dialectal expressiveness and speaker similarity. We further validate the quality and utility of the synthesized speech through a challenging speech-to-speech dialect conversion task. Our contributions include: (1) a novel few-shot TTS system tailored for Tibetan multi-dialect speech synthesis, (2) the public release of a large-scale synthetic Tibetan speech corpus generated by FMSD-TTS, and (3) an open-source evaluation toolkit for standardized assessment of speaker similarity, dialect consistency, and audio quality.

[282] PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs

Sho Inoue,Shai Wang,Haizhou Li

Main category: cs.SD

TL;DR: 提出了一种处理原始音频数据并生成带标注的对话数据集的流程，用于构建基于个性的对话系统。

Details

Motivation: 由于语音数据中缺乏个性标注，个性感知的对话系统研究不足。 Method: 使用ASR系统提取转录和时间戳，生成对话级标注，并利用大语言模型预测对话个性。 Result: 系统在人类评估中表现优于现有方法。 Conclusion: 该方法为个性感知对话系统提供了有效解决方案。 Abstract: Despite significant progress in neural spoken dialog systems, personality-aware conversation agents -- capable of adapting behavior based on personalities -- remain underexplored due to the absence of personality annotations in speech datasets. We propose a pipeline that preprocesses raw audio recordings to create a dialogue dataset annotated with timestamps, response types, and emotion/sentiment labels. We employ an automatic speech recognition (ASR) system to extract transcripts and timestamps, then generate conversation-level annotations. Leveraging these annotations, we design a system that employs large language models to predict conversational personality. Human evaluators were engaged to identify conversational characteristics and assign personality labels. Our analysis demonstrates that the proposed system achieves stronger alignment with human judgments compared to existing approaches.

[283] S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models

Yuanbo Fang,Haoze Sun,Jun Liu,Tao Zhang,Zenan Zhou,Weipeng Chen,Xiaofen Xing,Xiangmin Xu

Main category: cs.SD

TL;DR: S2SBench是一个用于量化语音大语言模型（LLMs）性能下降的基准测试，通过诊断数据集和评估协议分析语音输入对模型推理和生成能力的影响。

Details

Motivation: 语音LLMs在处理音频输入时可能导致推理和生成性能下降（称为智能退化），需要系统评估这一现象。 Method: 提出S2SBench基准测试，包含针对句子延续和常识推理的诊断数据集，并基于困惑度差异引入成对评估协议。 Result: 应用S2SBench分析了Baichuan-Audio的训练过程，验证了基准的有效性。 Conclusion: S2SBench为语音LLMs的性能退化提供了量化工具，有助于进一步研究和改进。 Abstract: End-to-end speech large language models ((LLMs)) extend the capabilities of text-based models to directly process and generate audio tokens. However, this often leads to a decline in reasoning and generation performance compared to text input, a phenomenon referred to as intelligence degradation. To systematically evaluate this gap, we propose S2SBench, a benchmark designed to quantify performance degradation in Speech LLMs. It includes diagnostic datasets targeting sentence continuation and commonsense reasoning under audio input. We further introduce a pairwise evaluation protocol based on perplexity differences between plausible and implausible samples to measure degradation relative to text input. We apply S2SBench to analyze the training process of Baichuan-Audio, which further demonstrates the benchmark's effectiveness. All datasets and evaluation code are available at https://github.com/undobug/S2SBench.

[284] PAST: Phonetic-Acoustic Speech Tokenizer

Nadav Har-Tuv,Or Tal,Yossi Adi

Main category: cs.SD

TL;DR: PAST是一种新颖的端到端框架，联合建模语音信息和信号重建，无需依赖外部预训练模型。

Details

Motivation: 消除对预训练自监督模型的依赖，直接通过辅助任务将领域知识整合到标记化过程中。 Method: PAST利用监督语音数据，引入可流式处理的因果变体，支持实时语音应用。 Result: PAST在语音表示和语音重建等常见评估指标上优于现有基线标记器，并作为语音语言模型的基础表现优异。 Conclusion: PAST是一种高效的语音表示方法，适用于语音生成任务，并公开了完整实现以促进研究。 Abstract: We present PAST, a novel end-to-end framework that jointly models phonetic information alongside signal reconstruction, eliminating the need for external pretrained models. Unlike previous approaches that rely on pretrained self-supervised models, PAST employs supervised phonetic data, directly integrating domain knowledge into the tokenization process via auxiliary tasks. Additionally, we introduce a streamable, causal variant of PAST, enabling real-time speech applications. Results demonstrate that PAST surpasses existing evaluated baseline tokenizers across common evaluation metrics, including phonetic representation and speech reconstruction. Notably, PAST also achieves superior performance when serving as a speech representation for speech language models, further highlighting its effectiveness as a foundation for spoken language generation. To foster further research, we release the full implementation. For code, model checkpoints, and samples see: https://pages.cs.huji.ac.il/adiyoss-lab/PAST

cs.LG [Back]

[285] Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws

Xiyuan Wei,Ming Lin,Fanjiang Ye,Fengguang Song,Liangliang Cao,My T. Thai,Tianbao Yang

Main category: cs.LG

TL;DR: 本文提出了一种名为“模型引导”的学习范式，通过参考模型指导目标模型的训练，并提出了一种基于分布鲁棒优化（DRO）的理论框架DRRho风险最小化。该方法在理论和实验上均优于无参考模型的训练方法，并首次提供了相关理论分析。

Details

Motivation: 现有模型引导方法缺乏理论支持，导致性能不佳。本文旨在通过理论框架提升模型引导的理解和实践效果。 Method: 提出DRRho风险最小化框架，基于DRO理论，并结合对比学习与DRO的关系，设计了DRRho-CLIP方法。 Result: 理论分析表明该方法能提升泛化能力和数据效率。实验验证了其优于无参考模型的训练方法，并展示了更好的扩展性。 Conclusion: DRRho框架为模型引导提供了理论基础，DRRho-CLIP方法在实践中表现出色，优于现有启发式方法。 Abstract: This paper formalizes an emerging learning paradigm that uses a trained model as a reference to guide and enhance the training of a target model through strategic data selection or weighting, named $\textbf{model steering}$. While ad-hoc methods have been used in various contexts, including the training of large foundation models, its underlying principles remain insufficiently understood, leading to sub-optimal performance. In this work, we propose a theory-driven framework for model steering called $\textbf{DRRho risk minimization}$, which is rooted in Distributionally Robust Optimization (DRO). Through a generalization analysis, we provide theoretical insights into why this approach improves generalization and data efficiency compared to training without a reference model. To the best of our knowledge, this is the first time such theoretical insights are provided for the new learning paradigm, which significantly enhance our understanding and practice of model steering. Building on these insights and the connection between contrastive learning and DRO, we introduce a novel method for Contrastive Language-Image Pretraining (CLIP) with a reference model, termed DRRho-CLIP. Extensive experiments validate the theoretical insights, reveal a superior scaling law compared to CLIP without a reference model, and demonstrate its strength over existing heuristic approaches.

[286] End-to-end fully-binarized network design: from Generic Learned Thermometer to Block Pruning

Thien Nguyen,William Guicquero

Main category: cs.LG

TL;DR: 本文提出了一种名为GLT的编码技术，通过非线性量化阈值改进BNN的输入数据表示，并结合轻量级分组卷积和块剪枝技术，实现了轻量级全二值化模型。

Details

Motivation: 现有BNN研究主要关注模型权重和激活，而忽略了输入原始数据的优化。本文旨在通过改进输入数据表示和模型结构，提升BNN的准确性和效率。 Method: 提出GLT编码技术，利用非线性量化阈值优化输入数据表示；结合轻量级分组卷积、块剪枝和知识蒸馏技术，减少模型规模和计算复杂度。 Result: 在STL-10和VWW数据集上，GLT显著提升了准确性；结合块剪枝技术后，实现了轻量级（小于1Mb）全二值化模型，且准确性损失有限。 Conclusion: GLT为BNN提供了全局色调映射的灵活性，结合块剪枝技术，适用于传感器端持续推理场景，实现了高效轻量化的模型设计。 Abstract: Existing works on Binary Neural Network (BNN) mainly focus on model's weights and activations while discarding considerations on the input raw data. This article introduces Generic Learned Thermometer (GLT), an encoding technique to improve input data representation for BNN, relying on learning non linear quantization thresholds. This technique consists in multiple data binarizations which can advantageously replace a conventional Analog to Digital Conversion (ADC) that uses natural binary coding. Additionally, we jointly propose a compact topology with light-weight grouped convolutions being trained thanks to block pruning and Knowledge Distillation (KD), aiming at reducing furthermore the model size so as its computational complexity. We show that GLT brings versatility to the BNN by intrinsically performing global tone mapping, enabling significant accuracy gains in practice (demonstrated by simulations on the STL-10 and VWW datasets). Moreover, when combining GLT with our proposed block-pruning technique, we successfully achieve lightweight (under 1Mb), fully-binarized models with limited accuracy degradation while being suitable for in-sensor always-on inference use cases.

[287] Open Set Domain Adaptation with Vision-language models via Gradient-aware Separation

Haoyang Chen

Main category: cs.LG

TL;DR: 论文提出了一种利用CLIP模型解决开放集域适应问题的方法，通过动态文本提示和梯度分析模块实现跨域对齐和开放集分离。

Details

Motivation: 开放集域适应（OSDA）面临已知类别分布对齐和目标域未知类别识别的双重挑战，现有方法未能充分利用模态间的语义关系且易受未知样本检测误差累积的影响。 Method: 1）基于域差异度量的可学习文本提示动态调整CLIP文本编码器，实现跨域语义对齐；2）通过梯度分析模块量化域偏移，利用梯度L2范数区分已知/未知样本。 Result: 在Office-Home数据集上的实验表明，该方法显著优于CLIP基线和标准基线，消融实验验证了梯度范数的关键作用。 Conclusion: 提出的方法通过结合动态提示和梯度分析，有效解决了OSDA问题，为跨域适应提供了新思路。 Abstract: Open-Set Domain Adaptation (OSDA) confronts the dual challenge of aligning known-class distributions across domains while identifying target-domain-specific unknown categories. Current approaches often fail to leverage semantic relationships between modalities and struggle with error accumulation in unknown sample detection. We propose to harness Contrastive Language-Image Pretraining (CLIP) to address these limitations through two key innovations: 1) Prompt-driven cross-domain alignment: Learnable textual prompts conditioned on domain discrepancy metrics dynamically adapt CLIP's text encoder, enabling semantic consistency between source and target domains without explicit unknown-class supervision. 2) Gradient-aware open-set separation: A gradient analysis module quantifies domain shift by comparing the L2-norm of gradients from the learned prompts, where known/unknown samples exhibit statistically distinct gradient behaviors. Evaluations on Office-Home show that our method consistently outperforms CLIP baseline and standard baseline. Ablation studies confirm the gradient norm's critical role.

[288] Breaking the Compression Ceiling: Data-Free Pipeline for Ultra-Efficient Delta Compression

Xiaohui Wang,Peng Ye,Chenyu Huang,Shenghe Zheng,Bo Zhang,Wanli Ouyang,Tao Chen

Main category: cs.LG

TL;DR: UltraDelta是一种无数据依赖的delta压缩方法，通过分层稀疏分配、分布感知压缩和全局重缩放，实现超高压缩和强性能。

Details

Motivation: 解决现有delta压缩方法在高压缩下性能和存储效率不足的问题。 Method: 采用方差混合稀疏分配、分布感知压缩和迹范数引导重缩放三种技术。 Result: 在多种模型上实现超高压缩（最高800倍），性能优于现有方法。 Conclusion: UltraDelta在超高压缩下仍能保持性能，优于现有方法。 Abstract: With the rise of the fine-tuned--pretrained paradigm, storing numerous fine-tuned models for multi-tasking creates significant storage overhead. Delta compression alleviates this by storing only the pretrained model and the highly compressed delta weights (the differences between fine-tuned and pretrained model weights). However, existing methods fail to maintain both high compression and performance, and often rely on data. To address these challenges, we propose UltraDelta, the first data-free delta compression pipeline that achieves both ultra-high compression and strong performance. UltraDelta is designed to minimize redundancy, maximize information, and stabilize performance across inter-layer, intra-layer, and global dimensions, using three key components: (1) Variance-Based Mixed Sparsity Allocation assigns sparsity based on variance, giving lower sparsity to high-variance layers to preserve inter-layer information. (2) Distribution-Aware Compression applies uniform quantization and then groups parameters by value, followed by group-wise pruning, to better preserve intra-layer distribution. (3) Trace-Norm-Guided Rescaling uses the trace norm of delta weights to estimate a global rescaling factor, improving model stability under higher compression. Extensive experiments across (a) large language models (fine-tuned on LLaMA-2 7B and 13B) with up to 133x, (b) general NLP models (RoBERTa-base, T5-base) with up to 800x, (c) vision models (ViT-B/32, ViT-L/14) with up to 400x, and (d) multi-modal models (BEiT-3) with 40x compression ratio, demonstrate that UltraDelta consistently outperforms existing methods, especially under ultra-high compression.

[289] FedCTTA: A Collaborative Approach to Continual Test-Time Adaptation in Federated Learning

Rakibul Hasan Rajib,Md Akil Raihan Iftee,Mir Sazzat Hossain,A. K. M. Mahbubur Rahman,Sajib Mistry,M Ashraful Amin,Amin Ahsan Ali

Main category: cs.LG

TL;DR: 提出了一种名为FedCTTA的联邦学习框架，解决了现有测试时适应方法在计算开销、隐私风险和可扩展性方面的不足。

Details

Motivation: 联邦学习（FL）在隐私敏感应用中表现优异，但模型性能常因训练与部署间的分布偏移而下降。测试时适应（TTA）虽能解决此问题，但现有方法存在计算开销大、隐私风险高和可扩展性差等挑战。 Method: FedCTTA通过基于模型输出分布的相似性感知聚合，避免直接特征共享，同时利用随机噪声样本最小化客户端熵，实现持续适应。该方法无需服务器端训练，内存占用恒定。 Result: 实验表明，FedCTTA在时间和空间异质性场景下均优于现有方法。 Conclusion: FedCTTA是一种高效、隐私保护且可扩展的联邦学习适应框架，显著提升了模型在动态环境中的性能。 Abstract: Federated Learning (FL) enables collaborative model training across distributed clients without sharing raw data, making it ideal for privacy-sensitive applications. However, FL models often suffer performance degradation due to distribution shifts between training and deployment. Test-Time Adaptation (TTA) offers a promising solution by allowing models to adapt using only test samples. However, existing TTA methods in FL face challenges such as computational overhead, privacy risks from feature sharing, and scalability concerns due to memory constraints. To address these limitations, we propose Federated Continual Test-Time Adaptation (FedCTTA), a privacy-preserving and computationally efficient framework for federated adaptation. Unlike prior methods that rely on sharing local feature statistics, FedCTTA avoids direct feature exchange by leveraging similarity-aware aggregation based on model output distributions over randomly generated noise samples. This approach ensures adaptive knowledge sharing while preserving data privacy. Furthermore, FedCTTA minimizes the entropy at each client for continual adaptation, enhancing the model's confidence in evolving target distributions. Our method eliminates the need for server-side training during adaptation and maintains a constant memory footprint, making it scalable even as the number of clients or training rounds increases. Extensive experiments show that FedCTTA surpasses existing methods across diverse temporal and spatial heterogeneity scenarios.

[290] Improving Compositional Generation with Diffusion Models Using Lift Scores

Chenning Yu,Sicun Gao

Main category: cs.LG

TL;DR: 提出了一种基于提升分数的新重采样标准，用于改进扩散模型中的组合生成。

Details

Motivation: 通过提升分数评估生成样本是否满足单个条件，进而组合结果判断是否满足组合提示，无需额外训练或外部模块。 Method: 利用原始扩散模型高效近似提升分数，开发了计算开销较低的优化变体。 Result: 实验表明，提升分数显著提高了2D合成数据、CLEVR位置任务和文本到图像合成中的条件对齐效果。 Conclusion: 该方法在保持有效性的同时降低了计算开销，代码已开源。 Abstract: We introduce a novel resampling criterion using lift scores, for improving compositional generation in diffusion models. By leveraging the lift scores, we evaluate whether generated samples align with each single condition and then compose the results to determine whether the composed prompt is satisfied. Our key insight is that lift scores can be efficiently approximated using only the original diffusion model, requiring no additional training or external modules. We develop an optimized variant that achieves relatively lower computational overhead during inference while maintaining effectiveness. Through extensive experiments, we demonstrate that lift scores significantly improved the condition alignment for compositional generation across 2D synthetic data, CLEVR position tasks, and text-to-image synthesis. Our code is available at http://github.com/rainorangelemon/complift.

[291] FlashKAT: Understanding and Addressing Performance Bottlenecks in the Kolmogorov-Arnold Transformer

Matthew Raffel,Lizhong Chen

Main category: cs.LG

TL;DR: FlashKAT通过优化内存访问和梯度累积，显著提升了KAT的训练速度，解决了其性能瓶颈。

Details

Motivation: KAN和KAT因其表达能力和可解释性受到关注，但训练速度慢限制了其应用。本研究旨在找出KAT速度慢的根本原因并提出解决方案。 Method: 通过实验分析KAT的性能瓶颈，发现内存停滞和梯度累积效率低是主要原因，提出FlashKAT优化内核以减少内存访问和梯度累积。 Result: FlashKAT比现有KAT快86.5倍，同时减少了系数梯度的舍入误差。 Conclusion: FlashKAT有效解决了KAT的性能瓶颈，为大规模任务提供了可行的解决方案。 Abstract: The Kolmogorov-Arnold Network (KAN) has been gaining popularity as an alternative to the multi-layer perceptron (MLP) with its increased expressiveness and interpretability. However, the KAN can be orders of magnitude slower due to its increased computational cost and training instability, limiting its applicability to larger-scale tasks. Recently, the Kolmogorov-Arnold Transformer (KAT) has been proposed, which can achieve FLOPs similar to the traditional Transformer with MLPs by leveraging Group-Rational KAN (GR-KAN). Unfortunately, despite the comparable FLOPs, our characterizations reveal that the KAT is still 123x slower in training speeds, indicating that there are other performance bottlenecks beyond FLOPs. In this paper, we conduct a series of experiments to understand the root cause of the slowdown in KAT. We uncover that the slowdown can be isolated to memory stalls and, more specifically, in the backward pass of GR-KAN caused by inefficient gradient accumulation. To address this memory bottleneck, we propose FlashKAT, which builds on our restructured kernel that minimizes gradient accumulation with atomic adds and accesses to slow memory. Evaluations demonstrate that FlashKAT can achieve a training speedup of 86.5x compared with the state-of-the-art KAT, while reducing rounding errors in the coefficient gradients. Our code is available at https://github.com/OSU-STARLAB/FlashKAT.

[292] Adversarial Training from Mean Field Perspective

Soichiro Kumano,Hiroshi Kera,Toshihiko Yamasaki

Main category: cs.LG

TL;DR: 本文首次对随机深度神经网络的对抗训练进行了理论分析，提出了基于平均场理论的新框架，并推导了不同范数下对抗损失的紧上界。

Details

Motivation: 尽管对抗训练对对抗样本有效，但其训练动态尚未被充分理解。本文旨在填补这一理论空白。 Method: 引入基于平均场理论的新框架，分析随机深度神经网络的对抗训练，推导对抗损失的紧上界。 Result: 证明了无捷径网络通常无法对抗训练，且对抗训练会降低网络容量；网络宽度可缓解这些问题。 Conclusion: 本文的理论框架为对抗训练提供了新的理解，揭示了网络结构和维度对其效果的影响。 Abstract: Although adversarial training is known to be effective against adversarial examples, training dynamics are not well understood. In this study, we present the first theoretical analysis of adversarial training in random deep neural networks without any assumptions on data distributions. We introduce a new theoretical framework based on mean field theory, which addresses the limitations of existing mean field-based approaches. Based on this framework, we derive (empirically tight) upper bounds of $\ell_q$ norm-based adversarial loss with $\ell_p$ norm-based adversarial examples for various values of $p$ and $q$. Moreover, we prove that networks without shortcuts are generally not adversarially trainable and that adversarial training reduces network capacity. We also show that network width alleviates these issues. Furthermore, we present the various impacts of the input and output dimensions on the upper bounds and time evolution of the weight variance.

[293] Adversarially Pretrained Transformers may be Universally Robust In-Context Learners

Soichiro Kumano,Hiroshi Kera,Toshihiko Yamasaki

Main category: cs.LG

TL;DR: 通过对抗性预训练的Transformer模型可作为鲁棒基础模型，无需下游任务对抗训练即可泛化到未见任务。

Details

Motivation: 对抗训练虽有效但计算成本高，本研究旨在探索对抗性预训练模型是否能替代下游任务的对抗训练。 Method: 理论证明通过上下文学习，单一对抗性预训练Transformer可鲁棒泛化到多个未见任务，无需参数更新。 Result: 模型能聚焦鲁棒特征并抵抗攻击，但存在局限性：某些条件下无普遍鲁棒单层Transformer，且需大量上下文演示。 Conclusion: 对抗性预训练Transformer具潜力，但需权衡准确性与鲁棒性，并需进一步研究其局限性。 Abstract: Adversarial training is one of the most effective adversarial defenses, but it incurs a high computational cost. In this study, we show that transformers adversarially pretrained on diverse tasks can serve as robust foundation models and eliminate the need for adversarial training in downstream tasks. Specifically, we theoretically demonstrate that through in-context learning, a single adversarially pretrained transformer can robustly generalize to multiple unseen tasks without any additional training, i.e., without any parameter updates. This robustness stems from the model's focus on robust features and its resistance to attacks that exploit non-predictive features. Besides these positive findings, we also identify several limitations. Under certain conditions (though unrealistic), no universally robust single-layer transformers exist. Moreover, robust transformers exhibit an accuracy--robustness trade-off and require a large number of in-context demonstrations. The code is available at https://github.com/s-kumano/universally-robust-in-context-learner.

[294] Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models

Woody Haosheng Gan,Deqing Fu,Julian Asilis,Ollie Liu,Dani Yogatama,Vatsal Sharan,Robin Jia,Willie Neiswanger

Main category: cs.LG

TL;DR: 文本引导方法通过稀疏自编码器、均值漂移和线性探测，成功提升了多模态大语言模型（MLLMs）的准确性，尤其在空间关系和计数任务上表现突出。

Details

Motivation: 目前多模态大语言模型缺乏有效的引导方法，研究探索了是否可以通过文本引导向量提升其性能。 Method: 使用稀疏自编码器（SAEs）、均值漂移和线性探测技术，从文本LLM骨干中提取引导向量。 Result: 文本引导显著提升了MLLMs的准确性，均值漂移在CV-Bench上空间关系任务提升7.3%，计数任务提升3.3%。 Conclusion: 文本引导向量是一种高效且无需额外数据的方法，可显著增强MLLMs的接地能力。 Abstract: Steering methods have emerged as effective and targeted tools for guiding large language models' (LLMs) behavior without modifying their parameters. Multimodal large language models (MLLMs), however, do not currently enjoy the same suite of techniques, due in part to their recency and architectural diversity. Inspired by this gap, we investigate whether MLLMs can be steered using vectors derived from their text-only LLM backbone, via sparse autoencoders (SAEs), mean shift, and linear probing. We find that text-derived steering consistently enhances multimodal accuracy across diverse MLLM architectures and visual tasks. In particular, mean shift boosts spatial relationship accuracy on CV-Bench by up to +7.3% and counting accuracy by up to +3.3%, outperforming prompting and exhibiting strong generalization to out-of-distribution datasets. These results highlight textual steering vectors as a powerful, efficient mechanism for enhancing grounding in MLLMs with minimal additional data collection and computational overhead.

[295] KERL: Knowledge-Enhanced Personalized Recipe Recommendation using Large Language Models

Fnu Mohbat,Mohammed J Zaki

Main category: cs.LG

TL;DR: KERL是一个结合食物知识图谱（KG）和大语言模型（LLM）的系统，用于个性化食物推荐和食谱生成，并提供营养信息。

Details

Motivation: 现有研究较少将食物相关KG与LLM结合，KERL填补了这一空白，提供更全面的食物推荐和营养分析解决方案。 Method: KERL通过提取自然语言问题中的实体，从KG中检索子图，并将其作为上下文输入LLM，生成满足约束的食谱及其烹饪步骤和营养信息。 Result: 实验表明，KERL显著优于现有方法，提供了一致且完整的食物推荐、食谱生成和营养分析方案。 Conclusion: KERL为食物理解和个性化推荐提供了高效且全面的解决方案，代码和数据集已开源。 Abstract: Recent advances in large language models (LLMs) and the abundance of food data have resulted in studies to improve food understanding using LLMs. Despite several recommendation systems utilizing LLMs and Knowledge Graphs (KGs), there has been limited research on integrating food related KGs with LLMs. We introduce KERL, a unified system that leverages food KGs and LLMs to provide personalized food recommendations and generates recipes with associated micro-nutritional information. Given a natural language question, KERL extracts entities, retrieves subgraphs from the KG, which are then fed into the LLM as context to select the recipes that satisfy the constraints. Next, our system generates the cooking steps and nutritional information for each recipe. To evaluate our approach, we also develop a benchmark dataset by curating recipe related questions, combined with constraints and personal preferences. Through extensive experiments, we show that our proposed KG-augmented LLM significantly outperforms existing approaches, offering a complete and coherent solution for food recommendation, recipe generation, and nutritional analysis. Our code and benchmark datasets are publicly available at https://github.com/mohbattharani/KERL.

[296] LoRASuite: Efficient LoRA Adaptation Across Large Language Model Upgrades

Yanan Li,Fanxu Meng,Muhan Zhang,Shiai Zhu,Shangguang Wang,Mengwei Xu

Main category: cs.LG

TL;DR: LoRASuite提出了一种模块化方法，利用现有LoRA权重高效适应新版LLM，避免从头训练，显著节省资源和时间。

Details

Motivation: 随着LLM频繁更新，旧版LoRA权重快速过时，从头训练成本高且不环保。 Method: 通过转移矩阵和层/注意力头分配，结合小规模微调，实现高效迁移。 Result: LoRASuite在MiniCPM和Qwen上表现优于全量LoRA重训练，数学任务分别提升1.4和6.6分，节省内存5.5GB，计算时间减少78.23%。 Conclusion: LoRASuite为LLM更新提供高效、低成本的LoRA权重迁移方案。 Abstract: As Large Language Models (LLMs) are frequently updated, LoRA weights trained on earlier versions quickly become obsolete. The conventional practice of retraining LoRA weights from scratch on the latest model is costly, time-consuming, and environmentally detrimental, particularly as the diversity of LLMs and downstream tasks expands. This motivates a critical question: "How can we efficiently leverage existing LoRA weights to adapt to newer model versions?" To address this, we propose LoRASuite, a modular approach tailored specifically to various types of LLM updates. First, we compute a transfer matrix utilizing known parameters from both old and new LLMs. Next, we allocate corresponding layers and attention heads based on centered kernel alignment and cosine similarity metrics, respectively. A subsequent small-scale, skillful fine-tuning step ensures numerical stability. Experimental evaluations demonstrate that LoRASuite consistently surpasses small-scale vanilla LoRA methods. Notably, on backbone LLMs such as MiniCPM and Qwen, LoRASuite even exceeds the performance of full-scale LoRA retraining, with average improvements of +1.4 and +6.6 points on math tasks, respectively. Additionally, LoRASuite significantly reduces memory consumption by 5.5 GB and computational time by 78.23%.

[297] Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training

Shane Bergsma,Nolan Dey,Gurpreet Gosal,Gavia Gray,Daria Soboleva,Joel Hestness

Main category: cs.LG

TL;DR: 论文研究了LLM预训练中超参数（如学习率和权重衰减）的缩放规律，发现最优权重衰减与批次大小线性相关，并提出了一种基于参数与数据比例的幂律预测方法。同时，探讨了最优批次大小和临界批次大小的缩放规律，并分析了其对实际训练中资源分配的指导意义。

Details

Motivation: 研究如何在大规模模型训练中高效调整超参数，以减少计算资源浪费并优化训练效果。 Method: 通过分析模型大小（N）、数据集大小（D）和批次大小（B）的缩放关系，提出超参数（如权重衰减）的幂律缩放规律，并验证其有效性。 Result: 发现最优权重衰减与批次大小线性相关，且最优批次大小和临界批次大小与数据集大小呈幂律关系，独立于模型大小。 Conclusion: 研究为大规模LLM预训练提供了超参数调整的理论依据，有助于优化训练效率和资源分配。 Abstract: Efficient LLM pre-training requires well-tuned hyperparameters (HPs), including learning rate {\eta} and weight decay {\lambda}. We study scaling laws for HPs: formulas for how to scale HPs as we scale model size N, dataset size D, and batch size B. Recent work suggests the AdamW timescale, B/({\eta}{\lambda}D), should remain constant across training settings, and we verify the implication that optimal {\lambda} scales linearly with B, for a fixed N,D. However, as N,D scale, we show the optimal timescale obeys a precise power law in the tokens-per-parameter ratio, D/N. This law thus provides a method to accurately predict {\lambda}opt in advance of large-scale training. We also study scaling laws for optimal batch size Bopt (the B enabling lowest loss at a given N,D) and critical batch size Bcrit (the B beyond which further data parallelism becomes ineffective). In contrast with prior work, we find both Bopt and Bcrit scale as power laws in D, independent of model size, N. Finally, we analyze how these findings inform the real-world selection of Pareto-optimal N and D under dual training time and compute objectives.

[298] Structured Agent Distillation for Large Language Model

Jun Liu,Zhenglun Kong,Peiyan Dong,Changdi Yang,Tianqi Li,Hao Tang,Geng Yuan,Wei Niu,Wenbin Zhang,Pu Zhao,Xue Lin,Dong Huang,Yanzhi Wang

Main category: cs.LG

TL;DR: 论文提出了一种名为Structured Agent Distillation的框架，通过分段监督压缩大型LLM代理，保留推理和行动一致性，实验证明其优于基线方法。

Details

Motivation: 大型语言模型（LLM）作为决策代理成本高且模型大，限制了实际部署。 Method: 将轨迹分段为{[REASON]}和{[ACT]}，应用分段特定损失对齐教师行为。 Result: 在ALFWorld、HotPotQA-ReAct和WebShop上表现优于基线，实现高效压缩且性能下降最小。 Conclusion: 分段对齐对高效可部署代理至关重要。 Abstract: Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks. Yet, their practical deployment is constrained by high inference costs and large model sizes. We propose Structured Agent Distillation, a framework that compresses large LLM-based agents into smaller student models while preserving both reasoning fidelity and action consistency. Unlike standard token-level distillation, our method segments trajectories into {[REASON]} and {[ACT]} spans, applying segment-specific losses to align each component with the teacher's behavior. This structure-aware supervision enables compact agents to better replicate the teacher's decision process. Experiments on ALFWorld, HotPotQA-ReAct, and WebShop show that our approach consistently outperforms token-level and imitation learning baselines, achieving significant compression with minimal performance drop. Scaling and ablation results further highlight the importance of span-level alignment for efficient and deployable agents.

[299] InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models

Yanggan Gu,Zhaoyi Yan,Yuanyi Wang,Yiming Zhang,Qi Zhou,Fei Wu,Hongxia Yang

Main category: cs.LG

TL;DR: InfiFPO是一种隐式模型融合方法，通过结合多源概率信息优化偏好对齐，显著提升LLM性能。

Details

Motivation: 现有模型融合方法在偏好对齐阶段仅利用响应输出而忽略概率信息，限制了性能提升。 Method: InfiFPO在DPO中替换参考模型为融合多源概率的模型，引入概率裁剪和最大间隔融合策略。 Result: 在11个基准测试中，InfiFPO平均性能从79.95提升至83.33，尤其在数学、编码和推理任务中表现突出。 Conclusion: InfiFPO通过保留概率信息和优化融合策略，显著优于现有方法，为LLM融合提供了新思路。 Abstract: Model fusion combines multiple Large Language Models (LLMs) with different strengths into a more powerful, integrated model through lightweight training methods. Existing works on model fusion focus primarily on supervised fine-tuning (SFT), leaving preference alignment (PA) --a critical phase for enhancing LLM performance--largely unexplored. The current few fusion methods on PA phase, like WRPO, simplify the process by utilizing only response outputs from source models while discarding their probability information. To address this limitation, we propose InfiFPO, a preference optimization method for implicit model fusion. InfiFPO replaces the reference model in Direct Preference Optimization (DPO) with a fused source model that synthesizes multi-source probabilities at the sequence level, circumventing complex vocabulary alignment challenges in previous works and meanwhile maintaining the probability information. By introducing probability clipping and max-margin fusion strategies, InfiFPO enables the pivot model to align with human preferences while effectively distilling knowledge from source models. Comprehensive experiments on 11 widely-used benchmarks demonstrate that InfiFPO consistently outperforms existing model fusion and preference optimization methods. When using Phi-4 as the pivot model, InfiFPO improve its average performance from 79.95 to 83.33 on 11 benchmarks, significantly improving its capabilities in mathematics, coding, and reasoning tasks.

[300] Safety Subspaces are Not Distinct: A Fine-Tuning Case Study

Kaustubh Ponkshe,Shaan Shah,Raghav Singhal,Praneeth Vepakomma

Main category: cs.LG

TL;DR: 研究发现，大语言模型的安全对齐行为并不集中在特定几何子空间中，而是与模型的广泛学习动态高度纠缠，挑战了子空间防御的可行性。

Details

Motivation: 探讨安全对齐是否对应可识别的几何方向或子空间，以防御微调导致的安全退化。 Method: 通过实验分析参数和激活空间中的安全相关行为，验证子空间是否选择性控制安全性。 Result: 安全行为与不安全行为在同一子空间中相互放大，且不同安全提示激活重叠表示，未发现选择性控制安全的子空间。 Conclusion: 安全对齐并非几何局部化，而是与模型学习动态高度纠缠，子空间防御策略可能面临根本限制。 Abstract: Large Language Models (LLMs) rely on safety alignment to produce socially acceptable responses. This is typically achieved through instruction tuning and reinforcement learning from human feedback. However, this alignment is known to be brittle: further fine-tuning, even on benign or lightly contaminated data, can degrade safety and reintroduce harmful behaviors. A growing body of work suggests that alignment may correspond to identifiable geometric directions in weight space, forming subspaces that could, in principle, be isolated or preserved to defend against misalignment. In this work, we conduct a comprehensive empirical study of this geometric perspective. We examine whether safety-relevant behavior is concentrated in specific subspaces, whether it can be separated from general-purpose learning, and whether harmfulness arises from distinguishable patterns in internal representations. Across both parameter and activation space, our findings are consistent: subspaces that amplify safe behaviors also amplify unsafe ones, and prompts with different safety implications activate overlapping representations. We find no evidence of a subspace that selectively governs safety. These results challenge the assumption that alignment is geometrically localized. Rather than residing in distinct directions, safety appears to emerge from entangled, high-impact components of the model's broader learning dynamics. This suggests that subspace-based defenses may face fundamental limitations and underscores the need for alternative strategies to preserve alignment under continued training. We corroborate these findings through multiple experiments on five open-source LLMs. Our code is publicly available at: https://github.com/CERT-Lab/safety-subspaces.

[301] AAPO: Enhance the Reasoning Capabilities of LLMs with Advantage Momentum

Jian Xiong,Jingbo Zhou,Jingyong Ye,Dejing Dou

Main category: cs.LG

TL;DR: 本文提出了一种名为AAPO的新型强化学习算法，通过动量增强的优势估计方案优化交叉熵损失，解决了现有组相对优势估计方法的训练效率问题。

Details

Motivation: 现有基于组相对优势估计的方法（如GRPO）在优势接近零时存在训练效率低下的问题，限制了强化学习在提升大型语言模型推理能力中的应用。 Method: 提出Advantage-Augmented Policy Optimization (AAPO)，利用动量增强的优势估计方案优化交叉熵损失，改进训练效率。 Result: 在多个数学推理基准测试中，AAPO表现出优于现有方法的性能。 Conclusion: AAPO通过改进优势估计方法，显著提升了训练效率，为强化学习在语言模型推理任务中的应用提供了更优解决方案。 Abstract: Reinforcement learning (RL) has emerged as an effective approach for enhancing the reasoning capabilities of large language models (LLMs), especially in scenarios where supervised fine-tuning (SFT) falls short due to limited chain-of-thought (CoT) data. Among RL-based post-training methods, group relative advantage estimation, as exemplified by Group Relative Policy Optimization (GRPO), has attracted considerable attention for eliminating the dependency on the value model, thereby simplifying training compared to traditional approaches like Proximal Policy Optimization (PPO). However, we observe that exsiting group relative advantage estimation method still suffers from training inefficiencies, particularly when the estimated advantage approaches zero. To address this limitation, we propose Advantage-Augmented Policy Optimization (AAPO), a novel RL algorithm that optimizes the cross-entropy (CE) loss using advantages enhanced through a momentum-based estimation scheme. This approach effectively mitigates the inefficiencies associated with group relative advantage estimation. Experimental results on multiple mathematical reasoning benchmarks demonstrate the superior performance of AAPO.

[302] Scaling Law for Quantization-Aware Training

Mengzhao Chen,Chaoyi Zhang,Jing Liu,Yutao Zeng,Zeyue Xue,Zhiheng Liu,Yunshui Li,Jin Ma,Jie Huang,Xun Zhou,Ping Luo

Main category: cs.LG

TL;DR: 本文提出了一种统一的量化感知训练（QAT）缩放定律，研究了4位精度（W4A4）下的量化误差与模型大小、训练数据量和量化粒度的关系，并通过实验揭示了权重和激活量化误差的不同敏感性。

Details

Motivation: 大型语言模型（LLMs）的高计算和内存需求带来了部署挑战，量化感知训练（QAT）通过降低模型精度来解决这些问题，但4位精度下的缩放行为尚不明确。 Method: 通过268次QAT实验，提出了一种统一的缩放定律，量化误差被建模为模型大小、训练数据量和量化组大小的函数，并分解了权重和激活量化误差。 Result: 量化误差随模型增大而减小，但随训练数据量增加和量化粒度变粗而上升；激活量化误差（尤其是FC2层的异常值）是主要瓶颈。通过混合精度量化可解决此问题。 Conclusion: 研究为改进QAT提供了关键见解，指出在更多训练数据下，权重量化误差可能超过激活量化误差，需同时关注两者优化。 Abstract: Large language models (LLMs) demand substantial computational and memory resources, creating deployment challenges. Quantization-aware training (QAT) addresses these challenges by reducing model precision while maintaining performance. However, the scaling behavior of QAT, especially at 4-bit precision (W4A4), is not well understood. Existing QAT scaling laws often ignore key factors such as the number of training tokens and quantization granularity, which limits their applicability. This paper proposes a unified scaling law for QAT that models quantization error as a function of model size, training data volume, and quantization group size. Through 268 QAT experiments, we show that quantization error decreases as model size increases, but rises with more training tokens and coarser quantization granularity. To identify the sources of W4A4 quantization error, we decompose it into weight and activation components. Both components follow the overall trend of W4A4 quantization error, but with different sensitivities. Specifically, weight quantization error increases more rapidly with more training tokens. Further analysis shows that the activation quantization error in the FC2 layer, caused by outliers, is the primary bottleneck of W4A4 QAT quantization error. By applying mixed-precision quantization to address this bottleneck, we demonstrate that weight and activation quantization errors can converge to similar levels. Additionally, with more training data, weight quantization error eventually exceeds activation quantization error, suggesting that reducing weight quantization error is also important in such scenarios. These findings offer key insights for improving QAT research and development.

[303] Enhancing Learned Knowledge in LoRA Adapters Through Efficient Contrastive Decoding on Ascend NPUs

Morgan Lindsay Heisler,Linzi Xing,Ge Shi,Hanieh Sadri,Gursimran Singh,Weiwei Zhang,Tao Ye,Ying Xiong,Yong Zhang,Zhenan Fan

Main category: cs.LG

TL;DR: 华为云用户使用LoRA（低秩适应）高效微调大型语言模型（LLM），但传统解码方法（如贪婪或束搜索）可能导致任务无关的响应。本文提出对比LoRA解码（CoLD），通过对比LoRA适应模型与基础模型的概率分布差异，优先选择与LoRA学习表示更一致的标记，提升任务性能。优化后的CoLD在华为Ascend NPU上实现任务准确率提升5.54%，延迟降低28%。

Details

Motivation: 传统解码方法在复杂推理或深度上下文理解任务中可能因基础模型的偏见或干扰而表现不佳，导致任务无关的响应。 Method: 提出CoLD框架，通过对比LoRA适应模型与基础模型的概率分布差异，优先选择与LoRA学习表示更一致的标记。优化实现以减少计算开销。 Result: CoLD在任务准确率上提升5.54%，端到端延迟降低28%。 Conclusion: CoLD为资源受限环境中的微调LLM提供了高效解码策略，对云端和本地应用具有广泛意义。 Abstract: Huawei Cloud users leverage LoRA (Low-Rank Adaptation) as an efficient and scalable method to fine-tune and customize large language models (LLMs) for application-specific needs. However, tasks that require complex reasoning or deep contextual understanding are often hindered by biases or interference from the base model when using typical decoding methods like greedy or beam search. These biases can lead to generic or task-agnostic responses from the base model instead of leveraging the LoRA-specific adaptations. In this paper, we introduce Contrastive LoRA Decoding (CoLD), a novel decoding framework designed to maximize the use of task-specific knowledge in LoRA-adapted models, resulting in better downstream performance. CoLD uses contrastive decoding by scoring candidate tokens based on the divergence between the probability distributions of a LoRA-adapted expert model and the corresponding base model. This approach prioritizes tokens that better align with the LoRA's learned representations, enhancing performance for specialized tasks. While effective, a naive implementation of CoLD is computationally expensive because each decoding step requires evaluating multiple token candidates across both models. To address this, we developed an optimized kernel for Huawei's Ascend NPU. CoLD achieves up to a 5.54% increase in task accuracy while reducing end-to-end latency by 28% compared to greedy decoding. This work provides practical and efficient decoding strategies for fine-tuned LLMs in resource-constrained environments and has broad implications for applied data science in both cloud and on-premises settings.

[304] TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning

Zhangchen Xu,Yuetai Li,Fengqing Jiang,Bhaskar Ramasubramanian,Luyao Niu,Bill Yuchen Lin,Radha Poovendran

Main category: cs.LG

TL;DR: 论文分析了强化学习中验证器错误拒绝正确模型输出的问题（假阴性），并提出轻量级验证器tinyV以提升奖励信号的准确性。

Details

Motivation: 强化学习依赖验证器提供的奖励信号，但假阴性问题普遍存在，严重影响模型训练效果。 Method: 通过分析Big-Math-RL-Verified数据集，提出tinyV验证器，动态识别假阴性并恢复有效响应。 Result: tinyV在多个数学推理基准测试中提升通过率10%，并加速收敛。 Conclusion: 解决假阴性问题对提升LLM的RL微调至关重要，tinyV提供了一种实用方法。 Abstract: Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs) by optimizing their policies with reward signals. Yet, RL's success relies on the reliability of rewards, which are provided by verifiers. In this paper, we expose and analyze a widespread problem--false negatives--where verifiers wrongly reject correct model outputs. Our in-depth study of the Big-Math-RL-Verified dataset reveals that over 38% of model-generated responses suffer from false negatives, where the verifier fails to recognize correct answers. We show, both empirically and theoretically, that these false negatives severely impair RL training by depriving the model of informative gradient signals and slowing convergence. To mitigate this, we propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods, which dynamically identifies potential false negatives and recovers valid responses to produce more accurate reward estimates. Across multiple math-reasoning benchmarks, integrating TinyV boosts pass rates by up to 10% and accelerates convergence relative to the baseline. Our findings highlight the critical importance of addressing verifier false negatives and offer a practical approach to improve RL-based fine-tuning of LLMs. Our code is available at https://github.com/uw-nsl/TinyV.

cs.CR [Back]

[305] PandaGuard: Systematic Evaluation of LLM Safety in the Era of Jailbreaking Attacks

Guobin Shen,Dongcheng Zhao,Linghao Feng,Xiang He,Jihang Wang,Sicheng Shen,Haibo Tong,Yiting Dong,Jindong Li,Xiang Zheng,Yi Zeng

Main category: cs.CR

TL;DR: PandaGuard是一个统一的多代理框架，用于系统评估LLM的越狱安全性，包含多种攻击、防御和判断方法，并通过PandaBench基准测试揭示关键见解。

Details

Motivation: 尽管LLM能力显著，但其易受越狱攻击，现有评估缺乏系统性和可重复性。 Method: 提出PandaGuard框架，集成19种攻击方法、12种防御机制及多种判断策略，支持灵活配置和实验。 Result: 评估49种LLM，发现防御无单一最优方案，判断不一致性影响安全性评估。 Conclusion: 发布代码和结果，支持透明可重复的LLM安全研究。 Abstract: Large language models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial prompts known as jailbreaks, which can bypass safety alignment and elicit harmful outputs. Despite growing efforts in LLM safety research, existing evaluations are often fragmented, focused on isolated attack or defense techniques, and lack systematic, reproducible analysis. In this work, we introduce PandaGuard, a unified and modular framework that models LLM jailbreak safety as a multi-agent system comprising attackers, defenders, and judges. Our framework implements 19 attack methods and 12 defense mechanisms, along with multiple judgment strategies, all within a flexible plugin architecture supporting diverse LLM interfaces, multiple interaction modes, and configuration-driven experimentation that enhances reproducibility and practical deployment. Built on this framework, we develop PandaBench, a comprehensive benchmark that evaluates the interactions between these attack/defense methods across 49 LLMs and various judgment approaches, requiring over 3 billion tokens to execute. Our extensive evaluation reveals key insights into model vulnerabilities, defense cost-performance trade-offs, and judge consistency. We find that no single defense is optimal across all dimensions and that judge disagreement introduces nontrivial variance in safety assessments. We release the code, configurations, and evaluation results to support transparent and reproducible research in LLM safety.

Jiankun Zhang,Shenglai Zeng,Jie Ren,Tianqi Zheng,Hui Liu,Xianfeng Tang,Hui Liu,Yi Chang

Main category: cs.CR

TL;DR: MRAG系统通过整合外部多模态数据库增强LMMs，但引入未探索的隐私漏洞。本文首次系统分析了MRAG在视觉-语言和语音-语言模态中的隐私风险，并提出了一种新颖的黑盒攻击方法。

Details

Motivation: 研究MRAG系统中的隐私漏洞，填补多模态数据隐私风险的研究空白。 Method: 使用新颖的组合结构化提示攻击方法，在黑盒设置下测试隐私漏洞。 Result: 实验表明，LMMs可直接生成类似检索内容的输出，或间接暴露敏感信息。 Conclusion: 亟需开发鲁棒的隐私保护MRAG技术。 Abstract: Multimodal Retrieval-Augmented Generation (MRAG) systems enhance LMMs by integrating external multimodal databases, but introduce unexplored privacy vulnerabilities. While text-based RAG privacy risks have been studied, multimodal data presents unique challenges. We provide the first systematic analysis of MRAG privacy vulnerabilities across vision-language and speech-language modalities. Using a novel compositional structured prompt attack in a black-box setting, we demonstrate how attackers can extract private information by manipulating queries. Our experiments reveal that LMMs can both directly generate outputs resembling retrieved content and produce descriptions that indirectly expose sensitive information, highlighting the urgent need for robust privacy-preserving MRAG techniques.

[307] Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs

Jiawen Wang,Pritha Gupta,Ivan Habernal,Eyke Hüllermeier

Main category: cs.CR

TL;DR: 论文研究了针对14种流行开源LLM的提示注入攻击，提出了一种催眠攻击方法，并引入攻击成功概率（ASP）指标，结果显示攻击效果显著。

Details

Motivation: 现有研究对开源和闭源LLM在提示攻击下的脆弱性研究不足，需进一步探索其安全风险。 Method: 通过五种攻击基准测试，提出催眠攻击和忽略前缀攻击，并引入ASP指标衡量攻击效果。 Result: 催眠攻击对多种对齐模型（如Stablelm2、Mistral等）有效，ASP达90%；忽略前缀攻击对14种LLM均有效，ASP超60%。 Conclusion: 中等知名度的LLM更易受攻击，需提高公众意识并优先制定缓解策略。 Abstract: Recent studies demonstrate that Large Language Models (LLMs) are vulnerable to different prompt-based attacks, generating harmful content or sensitive information. Both closed-source and open-source LLMs are underinvestigated for these attacks. This paper studies effective prompt injection attacks against the $\mathbf{14}$ most popular open-source LLMs on five attack benchmarks. Current metrics only consider successful attacks, whereas our proposed Attack Success Probability (ASP) also captures uncertainty in the model's response, reflecting ambiguity in attack feasibility. By comprehensively analyzing the effectiveness of prompt injection attacks, we propose a simple and effective hypnotism attack; results show that this attack causes aligned language models, including Stablelm2, Mistral, Openchat, and Vicuna, to generate objectionable behaviors, achieving around $90$% ASP. They also indicate that our ignore prefix attacks can break all $\mathbf{14}$ open-source LLMs, achieving over $60$% ASP on a multi-categorical dataset. We find that moderately well-known LLMs exhibit higher vulnerability to prompt injection attacks, highlighting the need to raise public awareness and prioritize efficient mitigation strategies.

q-bio.QM [Back]

[308] InterFeat: An Automated Pipeline for Finding Interesting Hypotheses in Structured Biomedical Data

Dan Ofer,Michal Linial,Dafna Shahaf

Main category: q-bio.QM

TL;DR: 提出了一种自动化发现生物医学数据中有趣假设的管道，结合机器学习、知识图谱和大型语言模型，定义“有趣性”为新颖性、实用性和合理性。

Details

Motivation: 科学发现的核心是发现有趣现象，但这是一个手动且模糊的概念，需要自动化方法来提高效率和可扩展性。 Method: 结合机器学习、知识图谱、文献搜索和大型语言模型，定义“有趣性”为新颖性、实用性和合理性的组合。 Result: 在8种主要疾病上，管道能提前数年发现风险因素，40-53%的候选假设被验证为有趣，远超基线方法的0-7%。 Conclusion: 该管道为“有趣性”的可扩展操作提供了解决方案，并展示了在生物医学数据中的实际应用价值。 Abstract: Finding interesting phenomena is the core of scientific discovery, but it is a manual, ill-defined concept. We present an integrative pipeline for automating the discovery of interesting simple hypotheses (feature-target relations with effect direction and a potential underlying mechanism) in structured biomedical data. The pipeline combines machine learning, knowledge graphs, literature search and Large Language Models. We formalize "interestingness" as a combination of novelty, utility and plausibility. On 8 major diseases from the UK Biobank, our pipeline consistently recovers risk factors years before their appearance in the literature. 40--53% of our top candidates were validated as interesting, compared to 0--7% for a SHAP-based baseline. Overall, 28% of 109 candidates were interesting to medical experts. The pipeline addresses the challenge of operationalizing "interestingness" scalably and for any target. We release data and code: https://github.com/LinialLab/InterFeat

cs.IR [Back]

[309] Bridge the Gap between Past and Future: Siamese Model Optimization for Context-Aware Document Ranking

Songhao Wu,Quan Tu,Mingjie Zhong,Hong Liu,Jia Xu,Jinjie Gu,Rui Yan

Main category: cs.IR

TL;DR: 提出了一种结合历史与未来上下文信息的文档排序模型ForeRanker，通过双模型协作训练提升性能。

Details

Motivation: 现有方法仅利用历史会话数据，难以捕捉用户意图的动态变化，因此探索结合未来上下文信息以改进文档排序。 Method: 设计了双模型优化框架，包括历史条件模型和未来感知模型，通过监督标签和伪标签协作训练，并引入动态门控机制减少训练不一致性。 Result: 在基准数据集上，ForeRanker表现优于现有方法。 Conclusion: 结合未来上下文信息能有效提升文档排序性能，动态门控机制有助于模型训练稳定性。 Abstract: In the realm of information retrieval, users often engage in multi-turn interactions with search engines to acquire information, leading to the formation of sequences of user feedback behaviors. Leveraging the session context has proven to be beneficial for inferring user search intent and document ranking. A multitude of approaches have been proposed to exploit in-session context for improved document ranking. Despite these advances, the limitation of historical session data for capturing evolving user intent remains a challenge. In this work, we explore the integration of future contextual information into the session context to enhance document ranking. We present the siamese model optimization framework, comprising a history-conditioned model and a future-aware model. The former processes only the historical behavior sequence, while the latter integrates both historical and anticipated future behaviors. Both models are trained collaboratively using the supervised labels and pseudo labels predicted by the other. The history-conditioned model, referred to as ForeRanker, progressively learns future-relevant information to enhance ranking, while it singly uses historical session at inference time. To mitigate inconsistencies during training, we introduce the peer knowledge distillation method with a dynamic gating mechanism, allowing models to selectively incorporate contextual information. Experimental results on benchmark datasets demonstrate the effectiveness of our ForeRanker, showcasing its superior performance compared to existing methods.

[310] MedEIR: A Specialized Medical Embedding Model for Enhanced Information Retrieval

Anand Selvadurai,Jasheen Shaik,Girish Chandrasekar,ShriRadhaKrishnan Balamurugan,Eswara Reddy

Main category: cs.IR

TL;DR: MedEIR是一种新型嵌入模型和分词器，专为医学和通用NLP任务优化，支持长文本处理，性能优于现有模型。

Details

Motivation: 现有嵌入模型在医学文档语义捕捉、长文本处理和通用任务上表现不佳，亟需更通用的解决方案。 Method: MedEIR结合ALiBi长文本处理技术，预训练60亿token，并在300万句对上微调。 Result: 在MTEB基准测试中，MedEIR在多个任务（如ArguAna、NFCorpus等）上表现优于Jina V2和MiniLM。 Conclusion: MedEIR在通用和特定领域任务中均表现出色，具有广泛应用潜力。 Abstract: Embedding models have become essential for retrieval-augmented generation (RAG) tasks, semantic clustering, and text re-ranking. But despite their growing use, many of these come with notable limitations. For example, Jina fails to capture the semantic content of medical documents, while models such as MiniLM often perform poorly on long-form documents. Domain-adapted models, while specialized, often underperform in general-purpose tasks, reducing their overall applicability. General-domain tokenizers often misinterpret medical vocabulary. The limitations of current embedding models, whether in tokenization accuracy, domain comprehension, or handling long sequences, highlight the need for more versatile solutions. In this work, we present MedEIR, a novel embedding model and tokenizer jointly optimized for both medical and general NLP tasks, incorporating ALiBi-based long-context processing to support sequences of up to 8,192 tokens. MedEIR was pre-trained on only 6 billion tokens, significantly fewer than Jina's, followed by fine-tuning on 3 million sentence pairs. MedEIR consistently outperforms Jina V2 and MiniLM across MTEB benchmarks, achieving top scores on ArguAna (55.24), NFCorpus (38.44), MedicalQARetrieval (74.25), SciFact (72.04), and TRECCOVID (79.56). These results highlight the potential of MedEIR as a highly effective embedding model, demonstrating strong performance across both general-purpose and domain-specific tasks and outperforming existing models on multiple benchmarks.

[311] RAR: Setting Knowledge Tripwires for Retrieval Augmented Rejection

Tommaso Mario Buonocore,Enea Parimbelli

Main category: cs.IR

TL;DR: 本文提出了一种名为检索增强拒绝（RAR）的新方法，通过检索增强生成（RAG）架构动态拒绝不安全查询，无需重新训练模型。

Details

Motivation: 大型语言模型（LLM）的内容审核需要灵活且适应性强的解决方案，以快速应对新兴威胁。 Method: 通过在向量数据库中策略性地插入和标记恶意文档，系统可以在检索到这些文档时识别并拒绝有害请求。 Result: 初步结果显示，RAR的性能与Claude 3.5 Sonnet等LLM中的嵌入式审核相当，同时提供更高的灵活性和实时定制能力。 Conclusion: RAR无需对现有RAG系统进行架构更改，仅需添加特殊文档和基于检索结果的简单拒绝机制，是一种高效且灵活的解决方案。 Abstract: Content moderation for large language models (LLMs) remains a significant challenge, requiring flexible and adaptable solutions that can quickly respond to emerging threats. This paper introduces Retrieval Augmented Rejection (RAR), a novel approach that leverages a retrieval-augmented generation (RAG) architecture to dynamically reject unsafe user queries without model retraining. By strategically inserting and marking malicious documents into the vector database, the system can identify and reject harmful requests when these documents are retrieved. Our preliminary results show that RAR achieves comparable performance to embedded moderation in LLMs like Claude 3.5 Sonnet, while offering superior flexibility and real-time customization capabilities, a fundamental feature to timely address critical vulnerabilities. This approach introduces no architectural changes to existing RAG systems, requiring only the addition of specially crafted documents and a simple rejection mechanism based on retrieval results.

[312] LLM-Based Compact Reranking with Document Features for Scientific Retrieval

Runchu Tian,Xueqiang Xu,Bowen Jin,SeongKu Kang,Jiawei Han

Main category: cs.IR

TL;DR: 论文提出了一种名为CoRank的框架，通过紧凑文档表示和混合重排序方法，解决了科学检索中LLM列表重排序的挑战。

Details

Motivation: 科学检索中，LLM列表重排序面临候选文档数量受限和相关文档排名低的问题，影响了检索性能。 Method: CoRank框架分三阶段：离线提取文档特征、基于紧凑表示的粗粒度重排序、对候选文档全文的细粒度重排序。 Result: 实验表明，CoRank显著提升了重排序性能，nDCG@10从32.0提高到39.7。 Conclusion: 信息提取对科学检索中的重排序具有重要价值。 Abstract: Scientific retrieval is essential for advancing academic discovery. Within this process, document reranking plays a critical role by refining first-stage retrieval results. However, large language model (LLM) listwise reranking faces unique challenges in the scientific domain. First-stage retrieval is often suboptimal in the scientific domain, so relevant documents are ranked lower. Moreover, conventional listwise reranking uses the full text of candidate documents in the context window, limiting the number of candidates that can be considered. As a result, many relevant documents are excluded before reranking, which constrains overall retrieval performance. To address these challenges, we explore compact document representations based on semantic features such as categories, sections, and keywords, and propose a training-free, model-agnostic reranking framework for scientific retrieval called CoRank. The framework involves three stages: (i) offline extraction of document-level features, (ii) coarse reranking using these compact representations, and (iii) fine-grained reranking on full texts of the top candidates from stage (ii). This hybrid design provides a high-level abstraction of document semantics, expands candidate coverage, and retains critical details required for precise ranking. Experiments on LitSearch and CSFCube show that CoRank significantly improves reranking performance across different LLM backbones, increasing nDCG@10 from 32.0 to 39.7. Overall, these results highlight the value of information extraction for reranking in scientific retrieval.

[313] Rank-K: Test-Time Reasoning for Listwise Reranking

Eugene Yang,Andrew Yates,Kathryn Ricci,Orion Weller,Vivek Chari,Benjamin Van Durme,Dawn Lawrie

Main category: cs.IR

TL;DR: Rank-K是一种基于推理语言模型的列表式重排序模型，显著提升了检索效率，尤其在多语言检索中表现优异。

Details

Motivation: 尽管神经重排序模型在检索效果上表现出色，但其资源消耗大，限制了实际应用。Rank-K旨在解决这一问题，同时提升检索效果。 Method: Rank-K利用推理语言模型的能力，在查询时进行列表式段落重排序，支持多语言检索。 Result: Rank-K在BM25初始排名列表上的检索效果比RankZephyr提升23%，在SPLADE-v3结果上提升19%，且在多语言检索中表现同样出色。 Conclusion: Rank-K通过高效的推理语言模型应用，显著提升了检索效果，并具备多语言适应性。 Abstract: Retrieve-and-rerank is a popular retrieval pipeline because of its ability to make slow but effective rerankers efficient enough at query time by reducing the number of comparisons. Recent works in neural rerankers take advantage of large language models for their capability in reasoning between queries and passages and have achieved state-of-the-art retrieval effectiveness. However, such rerankers are resource-intensive, even after heavy optimization. In this work, we introduce Rank-K, a listwise passage reranking model that leverages the reasoning capability of the reasoning language model at query time that provides test time scalability to serve hard queries. We show that Rank-K improves retrieval effectiveness by 23\% over the RankZephyr, the state-of-the-art listwise reranker, when reranking a BM25 initial ranked list and 19\% when reranking strong retrieval results by SPLADE-v3. Since Rank-K is inherently a multilingual model, we found that it ranks passages based on queries in different languages as effectively as it does in monolingual retrieval.

[314] NExT-Search: Rebuilding User Feedback Ecosystem for Generative AI Search

Sunhao Dai,Wenjie Wang,Liang Pang,Jun Xu,See-Kiong Ng,Ji-Rong Wen,Tat-Seng Chua

Main category: cs.IR

TL;DR: NExT-Search提出了一种新范式，通过细粒度、过程级反馈改进生成式AI搜索，解决传统反馈循环中断的问题。

Details

Motivation: 生成式AI搜索虽然便捷，但破坏了传统Web搜索的反馈驱动改进循环，导致难以优化中间阶段。 Method: NExT-Search引入两种模式：用户调试模式和影子用户模式，结合在线适应和离线更新，优化搜索流程。 Result: 通过恢复人类对搜索流程关键阶段的控制，NExT-Search有望构建反馈丰富的AI搜索系统。 Conclusion: NExT-Search为生成式AI搜索的持续改进提供了有前景的方向。 Abstract: Generative AI search is reshaping information retrieval by offering end-to-end answers to complex queries, reducing users' reliance on manually browsing and summarizing multiple web pages. However, while this paradigm enhances convenience, it disrupts the feedback-driven improvement loop that has historically powered the evolution of traditional Web search. Web search can continuously improve their ranking models by collecting large-scale, fine-grained user feedback (e.g., clicks, dwell time) at the document level. In contrast, generative AI search operates through a much longer search pipeline, spanning query decomposition, document retrieval, and answer generation, yet typically receives only coarse-grained feedback on the final answer. This introduces a feedback loop disconnect, where user feedback for the final output cannot be effectively mapped back to specific system components, making it difficult to improve each intermediate stage and sustain the feedback loop. In this paper, we envision NExT-Search, a next-generation paradigm designed to reintroduce fine-grained, process-level feedback into generative AI search. NExT-Search integrates two complementary modes: User Debug Mode, which allows engaged users to intervene at key stages; and Shadow User Mode, where a personalized user agent simulates user preferences and provides AI-assisted feedback for less interactive users. Furthermore, we envision how these feedback signals can be leveraged through online adaptation, which refines current search outputs in real-time, and offline update, which aggregates interaction logs to periodically fine-tune query decomposition, retrieval, and generation models. By restoring human control over key stages of the generative AI search pipeline, we believe NExT-Search offers a promising direction for building feedback-rich AI search systems that can evolve continuously alongside human feedback.

cs.MA [Back]

[315] MLZero: A Multi-Agent System for End-to-end Machine Learning Automation

Haoyang Fang,Boran Han,Nick Erickson,Xiyuan Zhang,Su Zhou,Anirudh Dagar,Jiani Zhang,Ali Caner Turkmen,Cuixiong Hu,Huzefa Rangwala,Ying Nian Wu,Bernie Wang,George Karypis

Main category: cs.MA

TL;DR: MLZero是一个基于大语言模型的多智能体框架，实现了多模态数据的端到端自动化机器学习，性能显著优于现有方法。

Details

Motivation: 现有AutoML系统在多模态数据处理中仍需大量人工配置和专家输入，MLZero旨在实现最小化人工干预的自动化。 Method: 采用认知感知模块将多模态输入转化为感知上下文，并通过语义和情景记忆增强迭代代码生成，解决大语言模型的局限性。 Result: 在MLE-Bench Lite和Multimodal AutoML Agent Benchmark上表现优异，成功率和解决方案质量均显著领先。 Conclusion: MLZero在多模态AutoML任务中表现出色，即使使用小型LLM也能超越现有系统。 Abstract: Existing AutoML systems have advanced the automation of machine learning (ML); however, they still require substantial manual configuration and expert input, particularly when handling multimodal data. We introduce MLZero, a novel multi-agent framework powered by Large Language Models (LLMs) that enables end-to-end ML automation across diverse data modalities with minimal human intervention. A cognitive perception module is first employed, transforming raw multimodal inputs into perceptual context that effectively guides the subsequent workflow. To address key limitations of LLMs, such as hallucinated code generation and outdated API knowledge, we enhance the iterative code generation process with semantic and episodic memory. MLZero demonstrates superior performance on MLE-Bench Lite, outperforming all competitors in both success rate and solution quality, securing six gold medals. Additionally, when evaluated on our Multimodal AutoML Agent Benchmark, which includes 25 more challenging tasks spanning diverse data modalities, MLZero outperforms the competing methods by a large margin with a success rate of 0.92 (+263.6\%) and an average rank of 2.28. Our approach maintains its robust effectiveness even with a compact 8B LLM, outperforming full-size systems from existing solutions.

Table of Contents

cs.CV [Back]

[1] An Edge AI Solution for Space Object Detection

[2] Self-Supervised Learning for Image Segmentation: A Comprehensive Survey

[3] IPENS:Interactive Unsupervised Framework for Rapid Plant Phenotyping Extraction via NeRF-SAM2 Fusion

[4] GeoVLM: Improving Automated Vehicle Geolocalisation Using Vision-Language Matching

[5] GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization

[6] Frozen Backpropagation: Relaxing Weight Symmetry in Temporally-Coded Deep Spiking Neural Networks

[7] ReSW-VL: Representation Learning for Surgical Workflow Analysis Using Vision-Language Model

[8] Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping

[9] Transfer Learning from Visual Speech Recognition to Mouthing Recognition in German Sign Language

[10] Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

[11] Physics-Driven Local-Whole Elastic Deformation Modeling for Point Cloud Representation Learning

[12] InstanceBEV: Unifying Instance and BEV Representation for Global Modeling

[13] MGStream: Motion-aware 3D Gaussian for Streamable Dynamic Scene Reconstruction

[14] SuperMapNet for Long-Range and High-Accuracy Vectorized HD Map Construction

[15] Domain Adaptation of VLM for Soccer Video Understanding

[16] 4D-ROLLS: 4D Radar Occupancy Learning via LiDAR Supervision

[17] Blind Restoration of High-Resolution Ultrasound Video

[18] An Explorative Analysis of SVM Classifier and ResNet50 Architecture on African Food Classification

[19] LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts

[20] Every Pixel Tells a Story: End-to-End Urdu Newspaper OCR

[21] StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning

[22] Multi-Label Stereo Matching for Transparent Scene Depth Estimation

[23] UHD Image Dehazing via anDehazeFormer with Atmospheric-aware KV Cache

[24] EGFormer: Towards Efficient and Generalizable Multimodal Semantic Segmentation

[25] OmniStyle: Filtering High Quality Style Transfer Data at Scale

[26] AppleGrowthVision: A large-scale stereo dataset for phenological analysis, fruit detection, and 3D reconstruction in apple orchards

[27] Selective Structured State Space for Multispectral-fused Small Target Detection

[28] Learning Concept-Driven Logical Rules for Interpretable and Generalizable Medical Image Classification

[29] Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

[30] Scaling Vision Mamba Across Resolutions via Fractal Traversal

[31] Place Recognition: A Comprehensive Review, Current Challenges and Future Directions

[32] Generalizable Multispectral Land Cover Classification via Frequency-Aware Mixture of Low-Rank Token Experts

[33] Unlocking the Power of SAM 2 for Few-Shot Segmentation

[34] Unintended Bias in 2D+ Image Segmentation and Its Effect on Attention Asymmetry

[35] CONSIGN: Conformal Segmentation Informed by Spatial Groupings via Decomposition

[36] Intra-class Patch Swap for Self-Distillation

[37] Hunyuan-Game: Industrial-grade Intelligent Game Creation Model

[38] ReactDiff: Latent Diffusion for Facial Reaction Generation

[39] Unify Graph Learning with Text: Unleashing LLM Potentials for Session Search

[40] M3Depth: Wavelet-Enhanced Depth Estimation on Mars via Mutual Boosting of Dual-Modal Data

[41] LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer

[42] Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method

[43] Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment

[44] Flexible-weighted Chamfer Distance: Enhanced Objective Function for Point Cloud Completion

[45] VoQA: Visual-only Question Answering

[46] UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning

[47] Decoupling Classifier for Boosting Few-shot Object Detection and Instance Segmentation

[48] Visual Agentic Reinforcement Fine-Tuning

[49] Instructing Text-to-Image Diffusion Models via Classifier-Guided Semantic Optimization

[50] Aligning Attention Distribution to Information Flow for Hallucination Mitigation in Large Vision-Language Models

[51] Speculative Decoding Reimagined for Multimodal Large Language Models

[52] RA-Touch: Retrieval-Augmented Touch Understanding with Enriched Visual Data

[53] Towards Generating Realistic Underwater Images

[54] A Review of Vision-Based Assistive Systems for Visually Impaired People: Technologies, Applications, and Future Directions

[55] RADAR: Enhancing Radiology Report Generation with Supplementary Knowledge Injection

[56] RETRO: REthinking Tactile Representation Learning with Material PriOrs

[57] Accuracy and Fairness of Facial Recognition Technology in Low-Quality Police Images: An Experiment With Synthetic Faces

[58] Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?

[59] Handloom Design Generation Using Generative Networks

[60] Domain Adaptation for Multi-label Image Classification: a Discriminator-free Approach

[61] Plane Geometry Problem Solving with Multi-modal Reasoning: A Survey

[62] Replace in Translation: Boost Concept Alignment in Counterfactual Text-to-Image

[63] Egocentric Action-aware Inertial Localization in Point Clouds

[64] Vid2World: Crafting Video Diffusion Models to Interactive World Models

[65] Dual Data Alignment Makes AI-Generated Image Detector Easier Generalizable

[66] Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives

[67] DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

[68] ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations

[69] Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency

[70] Diving into the Fusion of Monocular Priors for Generalized Stereo Matching

[71] Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models

[72] VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank

[73] RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

[74] Enhancing Interpretability of Sparse Latent Representations with Class Information

[75] ReservoirTTA: Prolonged Test-time Adaptation for Evolving and Recurring Domains

[76] SparC: Sparse Representation and Construction for High-Resolution 3D Shapes Modeling

[77] diffDemorph: Extending Reference-Free Demorphing to Unseen Faces

[78] Personalize Your Gaussian: Consistent 3D Scene Personalization from a Single Image