cs.CV [Total: 109]
cs.CL [Total: 144]
physics.comp-ph [Total: 1]
cs.AR [Total: 1]
cs.LG [Total: 16]
cs.RO [Total: 6]
eess.IV [Total: 15]
cs.SE [Total: 1]
cs.CR [Total: 3]
cs.SD [Total: 2]
cs.PF [Total: 1]
cs.AI [Total: 4]
stat.AP [Total: 1]
cs.IR [Total: 2]
cs.NE [Total: 1]
eess.AS [Total: 4]
cs.CY [Total: 1]
q-bio.QM [Total: 1]
cs.HC [Total: 1]

cs.CV [Back]

[1] Intentional Gesture: Deliver Your Intentions with Gestures for Speech

Pinxin Liu,Haiyang Liu,Luchuan Song,Chenliang Xu

Main category: cs.CV

TL;DR: 论文提出了一种基于意图推理的手势生成框架Intentional-Gesture，通过结合高层次交际功能，解决了现有方法语义浅薄的问题。

Details

Motivation: 现有手势生成方法仅依赖语言线索（如语音或文本），忽略了交际意图，导致生成手势语义不足。 Method: 引入Intentional-Gesture框架，通过标注手势意图（使用大视觉语言模型自动标注）和意图感知的运动表示，实现语义丰富的手势生成。 Result: 在BEAT-2基准测试中达到最新最优性能，生成的手势既时间对齐又语义丰富。 Conclusion: 该框架为数字人和具身AI提供了模块化的手势生成基础，具有表达力。 Abstract: When humans speak, gestures help convey communicative intentions, such as adding emphasis or describing concepts. However, current co-speech gesture generation methods rely solely on superficial linguistic cues (\textit{e.g.} speech audio or text transcripts), neglecting to understand and leverage the communicative intention that underpins human gestures. This results in outputs that are rhythmically synchronized with speech but are semantically shallow. To address this gap, we introduce \textbf{Intentional-Gesture}, a novel framework that casts gesture generation as an intention-reasoning task grounded in high-level communicative functions. % First, we curate the \textbf{InG} dataset by augmenting BEAT-2 with gesture-intention annotations (\textit{i.e.}, text sentences summarizing intentions), which are automatically annotated using large vision-language models. Next, we introduce the \textbf{Intentional Gesture Motion Tokenizer} to leverage these intention annotations. It injects high-level communicative functions (\textit{e.g.}, intentions) into tokenized motion representations to enable intention-aware gesture synthesis that are both temporally aligned and semantically meaningful, achieving new state-of-the-art performance on the BEAT-2 benchmark. Our framework offers a modular foundation for expressive gesture generation in digital humans and embodied AI. Project Page: https://andypinxinliu.github.io/Intentional-Gesture

[2] Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs

Miguel Lopez-Duran,Julian Fierrez,Aythami Morales,Ruben Tolosana,Oscar Delgado-Mohatar,Alvaro Ortigosa

Main category: cs.CV

TL;DR: 论文研究了如何利用图神经网络（GNN）对数字原生PDF文档的布局进行细粒度分类，提出了两种图构建方法，并通过实验验证了GraphSAGE在双分支配置下的优越性。

Details

Motivation: 数字原生PDF文档的布局分析因文本和非文本元素的异构排列及PDF格式的元数据不精确而具有挑战性，需要一种自动化的高效方法。 Method: 提出了两种图构建结构（k近邻图和全连接图），并利用预训练的文本和视觉模型生成节点特征，避免了手动特征工程。评估了三种实验框架（单模态、多模态拼接和双分支多模态）和四种GNN模型。 Result: 实验结果表明，GraphSAGE在k近邻图的双分支配置下表现最佳，在部分数据源中优于基线方法，验证了局部布局关系和多模态融合的重要性。 Conclusion: 研究证实了GNN在数字原生文档布局分析中的潜力，尤其是通过局部关系和多模态融合提升分类准确性。 Abstract: The automatic analysis of document layouts in digital-born PDF documents remains a challenging problem due to the heterogeneous arrangement of textual and nontextual elements and the imprecision of the textual metadata in the Portable Document Format. In this work, we benchmark Graph Neural Network (GNN) architectures for the task of fine-grained layout classification of text blocks from digital native documents. We introduce two graph construction structures: a k-closest-neighbor graph and a fully connected graph, and generate node features via pre-trained text and vision models, thus avoiding manual feature engineering. Three experimental frameworks are evaluated: single-modality (text or visual), concatenated multimodal, and dual-branch multimodal. We evaluated four foundational GNN models and compared them with the baseline. Our experiments are specifically conducted on a rich dataset of public affairs documents that includes more than 20 sources (e.g., regional and national-level official gazettes), 37K PDF documents, with 441K pages in total. Our results demonstrate that GraphSAGE operating on the k-closest-neighbor graph in a dual-branch configuration achieves the highest per-class and overall accuracy, outperforming the baseline in some sources. These findings confirm the importance of local layout relationships and multimodal fusion exploited through GNNs for the analysis of native digital document layouts.

[3] EVA: Expressive Virtual Avatars from Multi-view Videos

Hendrik Junkawitsch,Guoxing Sun,Heming Zhu,Christian Theobalt,Marc Habermann

Main category: cs.CV

TL;DR: EVA框架通过双层模型（几何层与外观层）实现高保真、实时渲染的虚拟人类化身，独立控制面部表情、身体动作和手势，超越现有方法。

Details

Motivation: 现有方法无法完全独立控制面部表情和身体动作，限制了虚拟化身的表达性和逼真度。 Method: 设计双层模型：几何层用于动作和表情跟踪，外观层通过解耦的3D高斯模型分别建模身体和面部。 Result: EVA在渲染质量和表达性上优于现有方法，验证了其有效性。 Conclusion: EVA为可驱动的数字人类模型提供了重要进展，实现了逼真的虚拟化身。 Abstract: With recent advancements in neural rendering and motion capture algorithms, remarkable progress has been made in photorealistic human avatar modeling, unlocking immense potential for applications in virtual reality, augmented reality, remote communication, and industries such as gaming, film, and medicine. However, existing methods fail to provide complete, faithful, and expressive control over human avatars due to their entangled representation of facial expressions and body movements. In this work, we introduce Expressive Virtual Avatars (EVA), an actor-specific, fully controllable, and expressive human avatar framework that achieves high-fidelity, lifelike renderings in real time while enabling independent control of facial expressions, body movements, and hand gestures. Specifically, our approach designs the human avatar as a two-layer model: an expressive template geometry layer and a 3D Gaussian appearance layer. First, we present an expressive template tracking algorithm that leverages coarse-to-fine optimization to accurately recover body motions, facial expressions, and non-rigid deformation parameters from multi-view videos. Next, we propose a novel decoupled 3D Gaussian appearance model designed to effectively disentangle body and facial appearance. Unlike unified Gaussian estimation approaches, our method employs two specialized and independent modules to model the body and face separately. Experimental results demonstrate that EVA surpasses state-of-the-art methods in terms of rendering quality and expressiveness, validating its effectiveness in creating full-body avatars. This work represents a significant advancement towards fully drivable digital human models, enabling the creation of lifelike digital avatars that faithfully replicate human geometry and appearance.

[4] Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation

Xin Zhang,Ziruo Zhang,Jiawei Du,Zuozhu Liu,Joey Tianyi Zhou

Main category: cs.CV

TL;DR: 论文提出RepBlend框架，通过表示混合和对称投影轨迹匹配解决多模态数据集蒸馏中的模态崩溃问题，显著提升跨模态学习效果和效率。

Details

Motivation: 现有多模态数据集蒸馏方法存在模态崩溃问题，表现为模态内表示过度集中和模态间分布差距扩大，影响跨模态学习效果。 Method: 提出RepBlend框架，通过表示混合减弱跨模态监督的过度主导，并引入对称投影轨迹匹配以平衡模态间的优化动态。 Result: 在Flickr-30K和MS-COCO数据集上，RepBlend显著优于现有方法，检索性能提升（如IR@10提高9.4），且蒸馏速度提升6.7倍。 Conclusion: RepBlend有效缓解模态崩溃问题，提升多模态数据集蒸馏的性能和效率。 Abstract: Multimodal Dataset Distillation (MDD) seeks to condense large-scale image-text datasets into compact surrogates while retaining their effectiveness for cross-modal learning. Despite recent progress, existing MDD approaches often suffer from \textit{\textbf{Modality Collapse}}, characterized by over-concentrated intra-modal representations and enlarged distributional gap across modalities. In this paper, at the first time, we identify this issue as stemming from a fundamental conflict between the over-compression behavior inherent in dataset distillation and the cross-modal supervision imposed by contrastive objectives. To alleviate modality collapse, we introduce \textbf{RepBlend}, a novel MDD framework that weakens overdominant cross-modal supervision via representation blending, thereby significantly enhancing intra-modal diversity. Additionally, we observe that current MDD methods impose asymmetric supervision across modalities, resulting in biased optimization. To address this, we propose symmetric projection trajectory matching, which synchronizes the optimization dynamics using modality-specific projection heads, thereby promoting balanced supervision and enhancing cross-modal alignment. Experiments on Flickr-30K and MS-COCO show that RepBlend consistently outperforms prior state-of-the-art MDD methods, achieving significant gains in retrieval performance (e.g., +9.4 IR@10, +6.3 TR@10 under the 100-pair setting) and offering up to 6.7$\times$ distillation speedup.

[5] PlantDreamer: Achieving Realistic 3D Plant Models with Diffusion-Guided Gaussian Splatting

Zane K J Hartley,Lewis A G Stuart,Andrew P French,Michael P Pound

Main category: cs.CV

TL;DR: PlantDreamer提出了一种新的3D植物生成方法，通过结合深度ControlNet、低秩适应和高斯剔除算法，显著提升了生成植物的真实感和几何精度。

Details

Motivation: 当前生成模型在复杂植物生成上表现不佳，限制了植物分析工具的应用。PlantDreamer旨在解决这一问题。 Method: 采用深度ControlNet、低秩适应和高斯剔除算法，支持纯合成生成和真实点云增强。 Result: PlantDreamer在生成高保真3D植物上优于现有方法，并能升级传统点云数据集。 Conclusion: PlantDreamer不仅推动了合成植物生成技术，还为3D表型分析提供了实用工具。 Abstract: Recent years have seen substantial improvements in the ability to generate synthetic 3D objects using AI. However, generating complex 3D objects, such as plants, remains a considerable challenge. Current generative 3D models struggle with plant generation compared to general objects, limiting their usability in plant analysis tools, which require fine detail and accurate geometry. We introduce PlantDreamer, a novel approach to 3D synthetic plant generation, which can achieve greater levels of realism for complex plant geometry and textures than available text-to-3D models. To achieve this, our new generation pipeline leverages a depth ControlNet, fine-tuned Low-Rank Adaptation and an adaptable Gaussian culling algorithm, which directly improve textural realism and geometric integrity of generated 3D plant models. Additionally, PlantDreamer enables both purely synthetic plant generation, by leveraging L-System-generated meshes, and the enhancement of real-world plant point clouds by converting them into 3D Gaussian Splats. We evaluate our approach by comparing its outputs with state-of-the-art text-to-3D models, demonstrating that PlantDreamer outperforms existing methods in producing high-fidelity synthetic plants. Our results indicate that our approach not only advances synthetic plant generation, but also facilitates the upgrading of legacy point cloud datasets, making it a valuable tool for 3D phenotyping applications.

[6] CrypticBio: A Large Multimodal Dataset for Visually Confusing Biodiversity

Georgiana Manolache,Gerard Schouten,Joaquin Vanschoren

Main category: cs.CV

TL;DR: CrypticBio是一个公开的多模态数据集，专注于视觉上难以区分的物种，旨在支持生物多样性AI模型的开发。

Details

Motivation: 现有数据集多为单一种类且规模小，无法满足对视觉相似物种的广泛识别需求。CrypticBio填补了这一空白。 Method: 数据集包含52K独特物种组，覆盖67K物种和1.66亿张图像，整合了多语言术语、时空数据等多模态信息。 Result: 基准测试显示地理上下文对视觉-语言零样本学习有显著影响。 Conclusion: CrypticBio有望推动生物多样性AI模型的发展，解决物种模糊性挑战。 Abstract: We present CrypticBio, the largest publicly available multimodal dataset of visually confusing species, specifically curated to support the development of AI models in the context of biodiversity applications. Visually confusing or cryptic species are groups of two or more taxa that are nearly indistinguishable based on visual characteristics alone. While much existing work addresses taxonomic identification in a broad sense, datasets that directly address the morphological confusion of cryptic species are small, manually curated, and target only a single taxon. Thus, the challenge of identifying such subtle differences in a wide range of taxa remains unaddressed. Curated from real-world trends in species misidentification among community annotators of iNaturalist, CrypticBio contains 52K unique cryptic groups spanning 67K species, represented in 166 million images. Rich research-grade image annotations--including scientific, multicultural, and multilingual species terminology, hierarchical taxonomy, spatiotemporal context, and associated cryptic groups--address multimodal AI in biodiversity research. For easy dataset curation, we provide an open-source pipeline CrypticBio-Curate. The multimodal nature of the dataset beyond vision-language arises from the integration of geographical and temporal data as complementary cues to identifying cryptic species. To highlight the importance of the dataset, we benchmark a suite of state-of-the-art foundation models across CrypticBio subsets of common, unseen, endangered, and invasive species, and demonstrate the substantial impact of geographical context on vision-language zero-shot learning for cryptic species. By introducing CrypticBio, we aim to catalyze progress toward real-world-ready biodiversity AI models capable of handling the nuanced challenges of species ambiguity.

[7] DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance

Xuan Shen,Chenxia Han,Yufa Zhou,Yanyue Xie,Yifan Gong,Quanyi Wang,Yiwei Wang,Yanzhi Wang,Pu Zhao,Jiuxiang Gu

Main category: cs.CV

TL;DR: 提出了一种名为DraftAttention的无训练框架，通过动态稀疏注意力加速视频扩散变换器，显著降低了计算成本。

Details

Motivation: 当前基于扩散变换器的视频生成模型（DiTs）因计算成本高（注意力机制占80%以上延迟）而难以实际应用，亟需优化。 Method: 采用压缩潜在空间中的特征图下采样，生成低分辨率草稿注意力图，揭示时空冗余，并通过重排序实现结构化稀疏注意力计算。 Result: 实验表明，该方法在视频生成质量上优于现有稀疏注意力方法，并在GPU上实现了1.75倍的端到端加速。 Conclusion: DraftAttention通过动态稀疏注意力有效降低了计算成本，为视频扩散变换器的实际应用提供了可行解决方案。 Abstract: Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality. However, their computational cost remains a major bottleneck-attention alone accounts for over 80% of total latency, and generating just 8 seconds of 720p video takes tens of minutes-posing serious challenges to practical application and scalability. To address this, we propose the DraftAttention, a training-free framework for the acceleration of video diffusion transformers with dynamic sparse attention on GPUs. We apply down-sampling to each feature map across frames in the compressed latent space, enabling a higher-level receptive field over the latent composed of hundreds of thousands of tokens. The low-resolution draft attention map, derived from draft query and key, exposes redundancy both spatially within each feature map and temporally across frames. We reorder the query, key, and value based on the draft attention map to guide the sparse attention computation in full resolution, and subsequently restore their original order after the attention computation. This reordering enables structured sparsity that aligns with hardware-optimized execution. Our theoretical analysis demonstrates that the low-resolution draft attention closely approximates the full attention, providing reliable guidance for constructing accurate sparse attention. Experimental results show that our method outperforms existing sparse attention approaches in video generation quality and achieves up to 1.75x end-to-end speedup on GPUs. Code: https://github.com/shawnricecake/draft-attention

[8] FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge

Xuan Shen,Weize Ma,Yufa Zhou,Enhao Tang,Yanyue Xie,Zhengang Li,Yifan Gong,Quanyi Wang,Henghui Ding,Yiwei Wang,Yanzhi Wang,Pu Zhao,Jun Lin,Jiuxiang Gu

Main category: cs.CV

TL;DR: FastCar框架通过利用时间冗余加速自回归视频生成的解码阶段，提出Temporal Attention Score（TAS）和硬件加速器，显著提升解码速度和能效。

Details

Motivation: 自回归模型在视频生成中因需要大量令牌而效率低下，MLP模块在解码阶段占主导延迟且存在时间冗余。 Method: 提出FastCar框架，利用TAS决定是否重用前一帧的MLP输出以减少计算，并开发基于FPGA的硬件加速器。 Result: 实验显示FastCar比传统稀疏注意力方法快2.1倍，能效更高，且能提升稀疏注意力的性能。 Conclusion: FastCar为高分辨率和长时视频生成提供了高效解决方案。 Abstract: Auto-regressive (AR) models, initially successful in language generation, have recently shown promise in visual generation tasks due to their superior sampling efficiency. Unlike image generation, video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during the decoding phase. Our key observations are: (i) MLP modules in the decode phase dominate the inference latency, and (ii) there exists high temporal redundancy in MLP outputs of adjacent frames. In this paper, we propose the \textbf{FastCar} framework to accelerate the decode phase for the AR video generation by exploring the temporal redundancy. The Temporal Attention Score (TAS) is proposed to determine whether to apply the replay strategy (\textit{i.e.}, reusing cached MLP outputs from the previous frame to reduce redundant computations) with detailed theoretical analysis and justification. Also, we develop a hardware accelerator on FPGA with Dynamic Resource Scheduling (DRS) based on TAS to enable better resource utilization and faster inference. Experimental results demonstrate the effectiveness of our method, which outperforms traditional sparse attention approaches with more than 2.1x decoding speedup and higher energy efficiency on the edge. Furthermore, by combining FastCar and sparse attention, FastCar can boost the performance of sparse attention with alleviated drifting, demonstrating our unique advantages for high-resolution and long-duration video generation. Code: https://github.com/shawnricecake/fast-car

[9] KGAlign: Joint Semantic-Structural Knowledge Encoding for Multimodal Fake News Detection

Tuan-Vinh La,Minh-Hieu Nguyen,Minh-Son Dao

Main category: cs.CV

TL;DR: 论文提出了一种结合视觉、文本和知识的多模态假新闻检测框架，通过细粒度对象捕捉、全局图像语义和上下文文本编码，结合知识图谱增强语义理解，显著提升了检测效果。

Details

Motivation: 假新闻检测因文本误导、图像篡改和外部知识推理的复杂性而具有挑战性，现有方法忽视了局部对象细节和外部知识利用。 Method: 提出多模态框架，结合视觉（细粒度对象捕捉和全局语义）、文本（RoBERTa编码）和知识图谱（实体检索与选择），通过Transformer分类器预测新闻真实性。 Result: 实验表明，模型优于现有方法，验证了邻居选择机制和多模态融合的有效性。 Conclusion: 通过知识驱动的多模态推理和语义验证，为假新闻检测提供了新范式，代码已开源。 Abstract: Fake news detection remains a challenging problem due to the complex interplay between textual misinformation, manipulated images, and external knowledge reasoning. While existing approaches have achieved notable results in verifying veracity and cross-modal consistency, two key challenges persist: (1) Existing methods often consider only the global image context while neglecting local object-level details, and (2) they fail to incorporate external knowledge and entity relationships for deeper semantic understanding. To address these challenges, we propose a novel multi-modal fake news detection framework that integrates visual, textual, and knowledge-based representations. Our approach leverages bottom-up attention to capture fine-grained object details, CLIP for global image semantics, and RoBERTa for context-aware text encoding. We further enhance knowledge utilization by retrieving and adaptively selecting relevant entities from a knowledge graph. The fused multi-modal features are processed through a Transformer-based classifier to predict news veracity. Experimental results demonstrate that our model outperforms recent approaches, showcasing the effectiveness of neighbor selection mechanism and multi-modal fusion for fake news detection. Our proposal introduces a new paradigm: knowledge-grounded multimodal reasoning. By integrating explicit entity-level selection and NLI-guided filtering, we shift fake news detection from feature fusion to semantically grounded verification. For reproducibility and further research, the source code is publicly at \href{https://github.com/latuanvinh1998/KGAlign}{github.com/latuanvinh1998/KGAlign}.

[10] Enhancing Shape Perception and Segmentation Consistency for Industrial Image Inspection

Guoxuan Mao,Ting Cao,Ziyang Li,Yuan Dong

Main category: cs.CV

TL;DR: 提出了一种形状感知高效网络（SPENet），通过分别监督边界和主体信息的提取，实现了优异的语义分割一致性，并引入可变边界域（VBD）和新指标CMSE。

Details

Motivation: 传统语义分割模型在工业图像检测中因缺乏对物体轮廓的感知而无法保持固定组件的一致性，且需满足实时性和低计算复杂度。 Method: SPENet通过分别提取和监督边界与主体信息，引入VBD描述模糊边界，并提出CMSE衡量分割一致性。 Result: 在数据集上取得最佳分割精度和竞争性速度，CMSE指标显著优于其他实时分割网络，降低50%以上。 Conclusion: SPENet在工业图像检测中表现出色，兼顾分割一致性和计算效率。 Abstract: Semantic segmentation stands as a pivotal research focus in computer vision. In the context of industrial image inspection, conventional semantic segmentation models fail to maintain the segmentation consistency of fixed components across varying contextual environments due to a lack of perception of object contours. Given the real-time constraints and limited computing capability of industrial image detection machines, it is also necessary to create efficient models to reduce computational complexity. In this work, a Shape-Aware Efficient Network (SPENet) is proposed, which focuses on the shapes of objects to achieve excellent segmentation consistency by separately supervising the extraction of boundary and body information from images. In SPENet, a novel method is introduced for describing fuzzy boundaries to better adapt to real-world scenarios named Variable Boundary Domain (VBD). Additionally, a new metric, Consistency Mean Square Error(CMSE), is proposed to measure segmentation consistency for fixed components. Our approach attains the best segmentation accuracy and competitive speed on our dataset, showcasing significant advantages in CMSE among numerous state-of-the-art real-time segmentation networks, achieving a reduction of over 50% compared to the previously top-performing models.

[11] MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion

Wei Hua,Chenlin Zhou,Jibin Wu,Yansong Chua,Yangyang Shu

Main category: cs.CV

TL;DR: 论文提出了一种名为MSVIT的新型脉冲驱动Transformer架构，通过多尺度脉冲注意力（MSSA）解决了现有SNN-Transformer架构在提取多尺度图像特征上的瓶颈问题，并在多个数据集上验证了其优越性能。

Details

Motivation: 现有SNN-Transformer架构在多尺度特征提取上存在瓶颈，限制了其性能。 Method: 提出MSVIT架构，首次引入多尺度脉冲注意力（MSSA）以增强脉冲注意力模块的能力。 Result: 实验表明MSVIT在多个数据集上优于现有SNN-Transformer模型，成为该领域的先进解决方案。 Conclusion: MSVIT通过多尺度脉冲注意力有效提升了SNN-Transformer的性能，为高效能计算提供了新思路。 Abstract: The combination of Spiking Neural Networks(SNNs) with Vision Transformer architectures has attracted significant attention due to the great potential for energy-efficient and high-performance computing paradigms. However, a substantial performance gap still exists between SNN-based and ANN-based transformer architectures. While existing methods propose spiking self-attention mechanisms that are successfully combined with SNNs, the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting features from different image scales. In this paper, we address this issue and propose MSVIT, a novel spike-driven Transformer architecture, which firstly uses multi-scale spiking attention (MSSA) to enrich the capability of spiking attention blocks. We validate our approach across various main data sets. The experimental results show that MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among SNN-transformer architectures. The codes are available at https://github.com/Nanhu-AI-Lab/MSViT.

[12] MORALISE: A Structured Benchmark for Moral Alignment in Visual Language Models

Xiao Lin,Zhining Liu,Ze Yang,Gaotang Li,Ruizhong Qiu,Shuke Wang,Hui Liu,Haotian Li,Sumit Keswani,Vishwa Pardeshi,Huijun Zhao,Wei Fan,Hanghang Tong

Main category: cs.CV

TL;DR: 论文提出了MORALISE基准，用于评估视觉语言模型（VLMs）的道德对齐能力，通过多样化的专家验证数据揭示现有模型的道德局限性。

Details

Motivation: 由于视觉语言模型在高风险领域的广泛应用，确保其输出符合人类道德价值观至关重要，而现有研究存在模态单一或数据偏见的问题。 Method: 基于Turiel的领域理论，构建了13个道德主题的分类法，并手动标注了2,481个高质量图像-文本对，设计了道德判断和道德规范归因两项评估任务。 Result: 实验表明，MORALISE对19种主流VLMs提出了显著挑战，揭示了当前模型的道德局限性。 Conclusion: MORALISE为评估和改进VLMs的道德对齐提供了重要工具，推动了模型在道德敏感领域的应用。 Abstract: Warning: This paper contains examples of harmful language and images. Reader discretion is advised. Recently, vision-language models have demonstrated increasing influence in morally sensitive domains such as autonomous driving and medical analysis, owing to their powerful multimodal reasoning capabilities. As these models are deployed in high-stakes real-world applications, it is of paramount importance to ensure that their outputs align with human moral values and remain within moral boundaries. However, existing work on moral alignment either focuses solely on textual modalities or relies heavily on AI-generated images, leading to distributional biases and reduced realism. To overcome these limitations, we introduce MORALISE, a comprehensive benchmark for evaluating the moral alignment of vision-language models (VLMs) using diverse, expert-verified real-world data. We begin by proposing a comprehensive taxonomy of 13 moral topics grounded in Turiel's Domain Theory, spanning the personal, interpersonal, and societal moral domains encountered in everyday life. Built on this framework, we manually curate 2,481 high-quality image-text pairs, each annotated with two fine-grained labels: (1) topic annotation, identifying the violated moral topic(s), and (2) modality annotation, indicating whether the violation arises from the image or the text. For evaluation, we encompass two tasks, \textit{moral judgment} and \textit{moral norm attribution}, to assess models' awareness of moral violations and their reasoning ability on morally salient content. Extensive experiments on 19 popular open- and closed-source VLMs show that MORALISE poses a significant challenge, revealing persistent moral limitations in current state-of-the-art models. The full benchmark is publicly available at https://huggingface.co/datasets/Ze1025/MORALISE.

[13] Uncovering Cultural Representation Disparities in Vision-Language Models

Ram Mohan Rao Kadiyala,Siddhant Gupta,Jebish Purbey,Srishti Yadav,Alejandro Salamanca,Desmond Elliott

Main category: cs.CV

TL;DR: 研究探讨了视觉语言模型（VLMs）在图像国家识别任务中表现出的文化偏见，发现模型性能在不同国家和提问方式下存在显著差异。

Details

Motivation: 尽管视觉语言模型在多任务中表现出色，但其潜在的文化偏见引发关注。本研究旨在评估VLMs在识别不同国家时的偏见程度。 Method: 使用Country211数据集，通过开放式问题、多选题（包括多语言和对抗性设置）等多种提问策略，评估多个大型VLMs的性能。 Result: 研究发现模型性能在不同国家和提问方式下存在显著差异，表明VLMs继承了预训练数据中的偏见，影响了其在全球多样化背景下的泛化能力。 Conclusion: VLMs虽具备强大的视觉理解能力，但其性能受预训练数据分布和规模的影响，需进一步优化以减少文化偏见。 Abstract: Vision-Language Models (VLMs) have demonstrated impressive capabilities across a range of tasks, yet concerns about their potential biases exist. This work investigates the extent to which prominent VLMs exhibit cultural biases by evaluating their performance on an image-based country identification task at a country level. Utilizing the geographically diverse Country211 dataset, we probe several large vision language models (VLMs) under various prompting strategies: open-ended questions, multiple-choice questions (MCQs) including challenging setups like multilingual and adversarial settings. Our analysis aims to uncover disparities in model accuracy across different countries and question formats, providing insights into how training data distribution and evaluation methodologies might influence cultural biases in VLMs. The findings highlight significant variations in performance, suggesting that while VLMs possess considerable visual understanding, they inherit biases from their pre-training data and scale that impact their ability to generalize uniformly across diverse global contexts.

[14] Leveraging Generative AI Models to Explore Human Identity

Yunha Yeo,Daeho Um

Main category: cs.CV

TL;DR: 论文通过扩散模型生成人脸图像，探讨人类身份的形成过程，并间接验证了外部因素对人类身份的影响。

Details

Motivation: 探索人类身份的形成过程，利用扩散模型生成的人脸图像间接研究身份与外部因素的关系。 Method: 采用扩散模型生成人脸图像，通过改变外部输入观察生成图像的变化，建立模型生成过程与人类身份形成的对应关系。 Result: 实验表明外部输入的变化显著影响生成的人脸图像，间接证实人类身份形成过程中对外部因素的依赖性。 Conclusion: 人类身份具有流动性，受外部因素影响显著，并通过视频艺术作品《Fluidity of Human Identity》进一步表达这一观点。 Abstract: This paper attempts to explore human identity by utilizing neural networks in an indirect manner. For this exploration, we adopt diffusion models, state-of-the-art AI generative models trained to create human face images. By relating the generated human face to human identity, we establish a correspondence between the face image generation process of the diffusion model and the process of human identity formation. Through experiments with the diffusion model, we observe that changes in its external input result in significant changes in the generated face image. Based on the correspondence, we indirectly confirm the dependence of human identity on external factors in the process of human identity formation. Furthermore, we introduce \textit{Fluidity of Human Identity}, a video artwork that expresses the fluid nature of human identity affected by varying external factors. The video is available at https://www.behance.net/gallery/219958453/Fluidity-of-Human-Identity?.

[15] Open-Set Semi-Supervised Learning for Long-Tailed Medical Datasets

Daniya Najiha A. Kareem,Jean Lahoud,Mustansar Fiaz,Amandeep Kumar,Hisham Cholakkal

Main category: cs.CV

TL;DR: 提出一种针对高度不平衡医学数据集的开放集学习方法，通过半监督方法和特征级正则化策略，显著提升模型在封闭集和开放集上的性能。

Details

Motivation: 解决医学图像识别中数据不平衡和未见类别的问题，以提升模型在真实场景中的泛化能力。 Method: 采用半监督学习方法，结合特征级正则化策略和分类器归一化技术。 Result: 在ISIC2018、ISIC2019和TissueMNIST数据集上，显著提升了封闭集和开放集的分类准确率。 Conclusion: 提出的方法有效解决了长尾数据对分类性能的影响，代码和模型已公开。 Abstract: Many practical medical imaging scenarios include categories that are under-represented but still crucial. The relevance of image recognition models to real-world applications lies in their ability to generalize to these rare classes as well as unseen classes. Real-world generalization requires taking into account the various complexities that can be encountered in the real-world. First, training data is highly imbalanced, which may lead to model exhibiting bias toward the more frequently represented classes. Moreover, real-world data may contain unseen classes that need to be identified, and model performance is affected by the data scarcity. While medical image recognition has been extensively addressed in the literature, current methods do not take into account all the intricacies in the real-world scenarios. To this end, we propose an open-set learning method for highly imbalanced medical datasets using a semi-supervised approach. Understanding the adverse impact of long-tail distribution at the inherent model characteristics, we implement a regularization strategy at the feature level complemented by a classifier normalization technique. We conduct extensive experiments on the publicly available datasets, ISIC2018, ISIC2019, and TissueMNIST with various numbers of labelled samples. Our analysis shows that addressing the impact of long-tail data in classification significantly improves the overall performance of the network in terms of closed-set and open-set accuracies on all datasets. Our code and trained models will be made publicly available at https://github.com/Daniyanaj/OpenLTR.

[16] Colors Matter: AI-Driven Exploration of Human Feature Colors

Rama Alyoubi,Taif Alharbi,Albatul Alghamdi,Yara Alshehri,Elham Alghamdi

Main category: cs.CV

TL;DR: 该研究提出了一种结合先进成像技术和机器学习的框架，用于提取和分类人类关键属性（如肤色、发色、虹膜颜色和静脉色调）。系统通过多阶段流程实现高精度分类，准确率达80%。

Details

Motivation: 旨在通过AI驱动的颜色分析和特征提取，为美容技术、数字个性化和视觉分析等领域提供更精确和包容的分类方法。 Method: 采用多阶段流程，包括人脸检测、区域分割和主色提取，结合X-means聚类和Delta E（CIEDE2000）距离度量，在LAB和HSV色彩空间中进行颜色区分。 Result: 系统在Delta E-HSV方法结合高斯模糊下，色调分类准确率达80%，在不同光照和图像条件下表现稳定。 Conclusion: 该框架展示了AI在颜色分析和特征提取中的潜力，能够为相关应用提供精确且包容的解决方案。 Abstract: This study presents a robust framework that leverages advanced imaging techniques and machine learning for feature extraction and classification of key human attributes-namely skin tone, hair color, iris color, and vein-based undertones. The system employs a multi-stage pipeline involving face detection, region segmentation, and dominant color extraction to isolate and analyze these features. Techniques such as X-means clustering, alongside perceptually uniform distance metrics like Delta E (CIEDE2000), are applied within both LAB and HSV color spaces to enhance the accuracy of color differentiation. For classification, the dominant tones of the skin, hair, and iris are extracted and matched to a custom tone scale, while vein analysis from wrist images enables undertone classification into "Warm" or "Cool" based on LAB differences. Each module uses targeted segmentation and color space transformations to ensure perceptual precision. The system achieves up to 80% accuracy in tone classification using the Delta E-HSV method with Gaussian blur, demonstrating reliable performance across varied lighting and image conditions. This work highlights the potential of AI-powered color analysis and feature extraction for delivering inclusive, precise, and nuanced classification, supporting applications in beauty technology, digital personalization, and visual analytics.

[17] Programmatic Video Prediction Using Large Language Models

Hao Tang,Kevin Ellis,Suhas Lohit,Michael J. Jones,Moitreya Chatterjee

Main category: cs.CV

TL;DR: ProgGen利用大型视觉语言模型（LLM/VLM）的归纳偏置，通过神经符号化状态表示实现视频帧预测，优于现有方法。

Details

Motivation: 为视频监控、机器人应用和自动驾驶等任务提供未来视觉预测，需要从少量视频帧中合成合理的未来画面。 Method: ProgGen通过LLM/VLM生成程序：(i) 估计视频状态；(ii) 预测未来状态；(iii) 将状态渲染为RGB帧。 Result: 在PhyWorld和Cart Pole环境中，ProgGen在视频帧预测任务上表现优于其他技术，并支持反事实推理和可解释视频生成。 Conclusion: ProgGen在视频生成任务中表现出高效性和通用性，适用于复杂环境。 Abstract: The task of estimating the world model describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes. For applications such as video surveillance, robotics applications, autonomous driving, etc. this objective entails synthesizing plausible visual futures, given a few frames of a video to set the visual context. Towards this end, we propose ProgGen, which undertakes the task of video frame prediction by representing the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states (one per frame) by leveraging the inductive biases of Large (Vision) Language Models (LLM/VLM). In particular, ProgGen utilizes LLM/VLM to synthesize programs: (i) to estimate the states of the video, given the visual context (i.e. the frames); (ii) to predict the states corresponding to future time steps by estimating the transition dynamics; (iii) to render the predicted states as visual RGB-frames. Empirical evaluations reveal that our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments: (i) PhyWorld (ii) Cart Pole. Additionally, ProgGen permits counter-factual reasoning and interpretable video generation attesting to its effectiveness and generalizability for video generation tasks.

Jose Sosa,Danila Rukhovich,Anis Kacem,Djamila Aouada

Main category: cs.CV

TL;DR: 本文提出了一种灵活的多模态多任务预训练策略（MultiMAE），用于解决地球观测数据中多模态预训练模型在下游任务中迁移学习的局限性。

Details

Motivation: 多模态地球观测数据为深度学习模型的预训练提供了机会，但现有方法在数据结构和下游任务之间存在迁移学习挑战。 Method: 采用多模态多任务掩码自编码器（MultiMAE），通过重构光谱、高程和分割数据等多种输入模态进行预训练。 Result: 预训练模型在分类和分割任务中表现优于现有方法，且能灵活处理多样输入配置。 Conclusion: MultiMAE策略显著提升了多模态地球观测数据的迁移学习能力，代码已开源。 Abstract: Multi-modal data in Earth Observation (EO) presents a huge opportunity for improving transfer learning capabilities when pre-training deep learning models. Unlike prior work that often overlooks multi-modal EO data, recent methods have started to include it, resulting in more effective pre-training strategies. However, existing approaches commonly face challenges in effectively transferring learning to downstream tasks where the structure of available data differs from that used during pre-training. This paper addresses this limitation by exploring a more flexible multi-modal, multi-task pre-training strategy for EO data. Specifically, we adopt a Multi-modal Multi-task Masked Autoencoder (MultiMAE) that we pre-train by reconstructing diverse input modalities, including spectral, elevation, and segmentation data. The pre-trained model demonstrates robust transfer learning capabilities, outperforming state-of-the-art methods on various EO datasets for classification and segmentation tasks. Our approach exhibits significant flexibility, handling diverse input configurations without requiring modality-specific pre-trained models. Code will be available at: https://github.com/josesosajs/multimae-meets-eo.

[19] Data Augmentation and Resolution Enhancement using GANs and Diffusion Models for Tree Segmentation

Alessandro dos Santos Ferreira,Ana Paula Marques Ramos,José Marcato Junior,Wesley Nunes Gonçalves

Main category: cs.CV

TL;DR: 提出了一种结合GAN和扩散模型的新方法，用于提升低分辨率航拍图像质量，从而在不依赖大量标注数据的情况下实现有效的树木分割。

Details

Motivation: 城市森林对环境和生物多样性至关重要，但树木检测因复杂景观和图像分辨率差异而困难，且深度学习依赖大量标注数据。 Method: 整合领域适应、GAN和扩散模型（如pix2pix、Real-ESRGAN等），生成高质量合成样本以扩展训练数据。 Result: 实验显示，低分辨率图像的IoU提升超过50%，方法优于传统流程。 Conclusion: 该方法为遥感场景提供了可扩展且高效的解决方案，尤其在标注资源稀缺时。 Abstract: Urban forests play a key role in enhancing environmental quality and supporting biodiversity in cities. Mapping and monitoring these green spaces are crucial for urban planning and conservation, yet accurately detecting trees is challenging due to complex landscapes and the variability in image resolution caused by different satellite sensors or UAV flight altitudes. While deep learning architectures have shown promise in addressing these challenges, their effectiveness remains strongly dependent on the availability of large and manually labeled datasets, which are often expensive and difficult to obtain in sufficient quantity. In this work, we propose a novel pipeline that integrates domain adaptation with GANs and Diffusion models to enhance the quality of low-resolution aerial images. Our proposed pipeline enhances low-resolution imagery while preserving semantic content, enabling effective tree segmentation without requiring large volumes of manually annotated data. Leveraging models such as pix2pix, Real-ESRGAN, Latent Diffusion, and Stable Diffusion, we generate realistic and structurally consistent synthetic samples that expand the training dataset and unify scale across domains. This approach not only improves the robustness of segmentation models across different acquisition conditions but also provides a scalable and replicable solution for remote sensing scenarios with scarce annotation resources. Experimental results demonstrated an improvement of over 50% in IoU for low-resolution images, highlighting the effectiveness of our method compared to traditional pipelines.

[20] iPad: Iterative Proposal-centric End-to-End Autonomous Driving

Ke Guo,Haochen Liu,Xiaojun Wu,Jia Pan,Chen Lv

Main category: cs.CV

TL;DR: 论文提出了一种名为iPad的端到端自动驾驶框架，通过迭代优化候选计划（proposals）及其特征，显著提升了效率和规划意识。

Details

Motivation: 传统端到端方法基于密集鸟瞰图特征生成计划，效率低且规划意识有限，iPad旨在解决这些问题。 Method: iPad框架以候选计划为中心，通过ProFormer迭代优化特征，并引入轻量级辅助任务（如地图和预测）提升规划质量。 Result: 在NAVSIM和CARLA Bench2Drive基准测试中，iPad实现了最先进的性能，且效率显著优于现有方法。 Conclusion: iPad通过迭代优化和轻量级辅助任务，为端到端自动驾驶提供了高效且高性能的解决方案。 Abstract: End-to-end (E2E) autonomous driving systems offer a promising alternative to traditional modular pipelines by reducing information loss and error accumulation, with significant potential to enhance both mobility and safety. However, most existing E2E approaches directly generate plans based on dense bird's-eye view (BEV) grid features, leading to inefficiency and limited planning awareness. To address these limitations, we propose iterative Proposal-centric autonomous driving (iPad), a novel framework that places proposals - a set of candidate future plans - at the center of feature extraction and auxiliary tasks. Central to iPad is ProFormer, a BEV encoder that iteratively refines proposals and their associated features through proposal-anchored attention, effectively fusing multi-view image data. Additionally, we introduce two lightweight, proposal-centric auxiliary tasks - mapping and prediction - that improve planning quality with minimal computational overhead. Extensive experiments on the NAVSIM and CARLA Bench2Drive benchmarks demonstrate that iPad achieves state-of-the-art performance while being significantly more efficient than prior leading methods.

[21] Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding

Ta Duc Huy,Duy Anh Huynh,Yutong Xie,Yuankai Qi,Qi Chen,Phi Le Nguyen,Sen Kim Tran,Son Lam Phung,Anton van den Hengel,Zhibin Liao,Minh-Son To,Johan W. Verjans,Vu Minh Hieu Phan

Main category: cs.CV

TL;DR: 论文提出了一种名为Disease-Aware Prompting (DAP)的方法，通过增强疾病相关区域并抑制背景干扰，显著提升了医学图像中视觉定位的准确性。

Details

Motivation: 当前视觉语言模型在医学图像中难以将文本描述与疾病区域关联，主要由于注意力机制效率低和缺乏细粒度表征。 Method: 引入DAP方法，利用视觉语言模型的可解释性图识别合适的图像特征，无需额外像素级标注。 Result: DAP在三个主要胸部X光数据集上比现有方法提升了20.74%的视觉定位准确率。 Conclusion: DAP是一种简单有效的方法，显著提升了医学图像中视觉定位的性能。 Abstract: Visual grounding (VG) is the capability to identify the specific regions in an image associated with a particular text description. In medical imaging, VG enhances interpretability by highlighting relevant pathological features corresponding to textual descriptions, improving model transparency and trustworthiness for wider adoption of deep learning models in clinical practice. Current models struggle to associate textual descriptions with disease regions due to inefficient attention mechanisms and a lack of fine-grained token representations. In this paper, we empirically demonstrate two key observations. First, current VLMs assign high norms to background tokens, diverting the model's attention from regions of disease. Second, the global tokens used for cross-modal learning are not representative of local disease tokens. This hampers identifying correlations between the text and disease tokens. To address this, we introduce simple, yet effective Disease-Aware Prompting (DAP) process, which uses the explainability map of a VLM to identify the appropriate image features. This simple strategy amplifies disease-relevant regions while suppressing background interference. Without any additional pixel-level annotations, DAP improves visual grounding accuracy by 20.74% compared to state-of-the-art methods across three major chest X-ray datasets.

[22] DeepKD: A Deeply Decoupled and Denoised Knowledge Distillation Trainer

Haiduo Huang,Jiangcheng Song,Yadong Zhang,Pengju Ren

Main category: cs.CV

TL;DR: DeepKD通过双级解耦和自适应去噪优化知识蒸馏，解决目标类与非目标类知识流的冲突和噪声问题。

Details

Motivation: 现有方法忽视目标类与非目标类知识流的冲突，且低置信度暗知识引入噪声信号，影响知识传递效果。 Method: 设计独立动量更新器，基于梯度信噪比特性；引入动态top-k掩码机制，逐步增加非目标类参与训练。 Result: 在CIFAR-100、ImageNet和MS-COCO上验证了DeepKD的有效性。 Conclusion: DeepKD通过解耦和去噪机制显著提升了知识蒸馏的性能。 Abstract: Recent advances in knowledge distillation have emphasized the importance of decoupling different knowledge components. While existing methods utilize momentum mechanisms to separate task-oriented and distillation gradients, they overlook the inherent conflict between target-class and non-target-class knowledge flows. Furthermore, low-confidence dark knowledge in non-target classes introduces noisy signals that hinder effective knowledge transfer. To address these limitations, we propose DeepKD, a novel training framework that integrates dual-level decoupling with adaptive denoising. First, through theoretical analysis of gradient signal-to-noise ratio (GSNR) characteristics in task-oriented and non-task-oriented knowledge distillation, we design independent momentum updaters for each component to prevent mutual interference. We observe that the optimal momentum coefficients for task-oriented gradient (TOG), target-class gradient (TCG), and non-target-class gradient (NCG) should be positively related to their GSNR. Second, we introduce a dynamic top-k mask (DTM) mechanism that gradually increases K from a small initial value to incorporate more non-target classes as training progresses, following curriculum learning principles. The DTM jointly filters low-confidence logits from both teacher and student models, effectively purifying dark knowledge during early training. Extensive experiments on CIFAR-100, ImageNet, and MS-COCO demonstrate DeepKD's effectiveness. Our code is available at https://github.com/haiduo/DeepKD.

[23] Multispectral Detection Transformer with Infrared-Centric Sensor Fusion

Seongmin Hwang,Daeyoung Han,Moongu Jeon

Main category: cs.CV

TL;DR: IC-Fusion是一种多光谱目标检测器，通过轻量级和模态感知设计有效融合可见光和红外特征，利用RGB的语义上下文和IR的高频信息，提升检测性能。

Details

Motivation: 多光谱目标检测旨在结合可见光（RGB）和红外（IR）模态的互补信息，以在不同环境条件下实现鲁棒性能。 Method: 采用紧凑的RGB骨干网络，设计多尺度特征蒸馏（MSFD）模块增强RGB特征，并通过三阶段融合块（CCSG和CLKG）促进跨模态交互。 Result: 在FLIR和LLVIP基准测试中验证了IR中心融合策略的有效性和高效性。 Conclusion: IC-Fusion通过模态感知设计成功融合RGB和IR特征，显著提升了多光谱目标检测的性能。 Abstract: Multispectral object detection aims to leverage complementary information from visible (RGB) and infrared (IR) modalities to enable robust performance under diverse environmental conditions. In this letter, we propose IC-Fusion, a multispectral object detector that effectively fuses visible and infrared features through a lightweight and modalityaware design. Motivated by wavelet analysis and empirical observations, we find that IR images contain structurally rich high-frequency information critical for object localization, while RGB images provide complementary semantic context. To exploit this, we adopt a compact RGB backbone and design a novel fusion module comprising a Multi-Scale Feature Distillation (MSFD) block to enhance RGB features and a three-stage fusion block with Cross-Modal Channel Shuffle Gate (CCSG) and Cross-Modal Large Kernel Gate (CLKG) to facilitate effective cross-modal interaction. Experiments on the FLIR and LLVIP benchmarks demonstrate the effectiveness and efficiency of our IR-centric fusion strategy. Our code is available at https://github.com/smin-hwang/IC-Fusion.

Badhan Mazumder,Lei Wu,Vince D. Calhoun,Dong Hye Ye

Main category: cs.CV

TL;DR: ConneX是一种多模态融合方法，结合交叉注意力和MLP-Mixer，用于提升脑结构-功能连接数据的诊断性能。

Details

Motivation: 传统多模态深度学习方法未能充分利用结构和功能连接数据的互补特性，限制了神经精神疾病（如精神分裂症）的诊断效果。 Method: 采用模态特定的GNN提取特征，通过交叉注意力网络融合模态内和模态间交互，MLP-Mixer层优化全局和局部特征，结合多头联合损失进行端到端分类。 Result: 在两个临床数据集上表现优异，验证了框架的鲁棒性。 Conclusion: ConneX通过高效的多模态融合，显著提升了脑连接数据的诊断能力。 Abstract: Gaining insights into the structural and functional mechanisms of the brain has been a longstanding focus in neuroscience research, particularly in the context of understanding and treating neuropsychiatric disorders such as Schizophrenia (SZ). Nevertheless, most of the traditional multimodal deep learning approaches fail to fully leverage the complementary characteristics of structural and functional connectomics data to enhance diagnostic performance. To address this issue, we proposed ConneX, a multimodal fusion method that integrates cross-attention mechanism and multilayer perceptron (MLP)-Mixer for refined feature fusion. Modality-specific backbone graph neural networks (GNNs) were firstly employed to obtain feature representation for each modality. A unified cross-modal attention network was then introduced to fuse these embeddings by capturing intra- and inter-modal interactions, while MLP-Mixer layers refined global and local features, leveraging higher-order dependencies for end-to-end classification with a multi-head joint loss. Extensive evaluations demonstrated improved performance on two distinct clinical datasets, highlighting the robustness of our proposed framework.

[25] CineTechBench: A Benchmark for Cinematographic Technique Understanding and Generation

Xinran Wang,Songyu Xu,Xiangxuan Shan,Yuxuan Zhang,Muxi Diao,Xueyan Duan,Yanhua Huang,Kongming Liang,Zhanyu Ma

Main category: cs.CV

TL;DR: 论文提出了CineTechBench，一个基于专家标注的基准测试，用于评估多模态大语言模型和视频生成模型在理解和生成电影摄影技术方面的能力。

Details

Motivation: 当前模型在理解和生成电影摄影技术方面的能力尚未充分探索，且缺乏专家标注的数据。 Method: 通过专家标注的电影图像和片段构建基准测试，设计问答对和描述任务评估模型的理解能力，并测试视频生成模型的重建能力。 Result: 对15+多模态大语言模型和5+视频生成模型进行了大规模评估，揭示了当前模型的局限性。 Conclusion: CineTechBench为未来自动电影制作和欣赏中的电影摄影技术理解和生成提供了方向和基准。 Abstract: Cinematography is a cornerstone of film production and appreciation, shaping mood, emotion, and narrative through visual elements such as camera movement, shot composition, and lighting. Despite recent progress in multimodal large language models (MLLMs) and video generation models, the capacity of current models to grasp and reproduce cinematographic techniques remains largely uncharted, hindered by the scarcity of expert-annotated data. To bridge this gap, we present CineTechBench, a pioneering benchmark founded on precise, manual annotation by seasoned cinematography experts across key cinematography dimensions. Our benchmark covers seven essential aspects-shot scale, shot angle, composition, camera movement, lighting, color, and focal length-and includes over 600 annotated movie images and 120 movie clips with clear cinematographic techniques. For the understanding task, we design question answer pairs and annotated descriptions to assess MLLMs' ability to interpret and explain cinematographic techniques. For the generation task, we assess advanced video generation models on their capacity to reconstruct cinema-quality camera movements given conditions such as textual prompts or keyframes. We conduct a large-scale evaluation on 15+ MLLMs and 5+ video generation models. Our results offer insights into the limitations of current models and future directions for cinematography understanding and generation in automatically film production and appreciation. The code and benchmark can be accessed at https://github.com/PRIS-CV/CineTechBench.

[26] From Pixels to Images: Deep Learning Advances in Remote Sensing Image Semantic Segmentation

Quanwei Liu,Tao Huang,Yanni Dong,Jiaqi Yang,Wei Xiang

Main category: cs.CV

TL;DR: 本文回顾了基于深度学习的遥感图像语义分割（RSISS）的演变，将其分为四个阶段，并对近40种先进技术进行了统一评估，总结了关键进展和开放挑战。

Details

Motivation: 随着遥感图像（RSIs）的多样性和数量增加，传统方法在效率和准确性上难以满足需求，深度学习（DL）成为改进RSISS的关键方法。 Method: 将现有方法分为四个阶段（像素基、块基、瓦片基和图像基），从特征提取和学习策略角度分析其发展，并在统一数据集上评估40种技术。 Result: 揭示了从像素级到瓦片级、从单模态到多模态分割的进展，并定量比较了不同技术的性能和适用性。 Conclusion: 本文为未来研究提供了全面视角，总结了DL在RSISS中的关键进展和开放挑战。 Abstract: Remote sensing images (RSIs) capture both natural and human-induced changes on the Earth's surface, serving as essential data for environmental monitoring, urban planning, and resource management. Semantic segmentation (SS) of RSIs enables the fine-grained interpretation of surface features, making it a critical task in remote sensing analysis. With the increasing diversity and volume of RSIs collected by sensors on various platforms, traditional processing methods struggle to maintain efficiency and accuracy. In response, deep learning (DL) has emerged as a transformative approach, enabling substantial advances in remote sensing image semantic segmentation (RSISS) by automating feature extraction and improving segmentation accuracy across diverse modalities. This paper revisits the evolution of DL-based RSISS by categorizing existing approaches into four stages: the early pixel-based methods, the prevailing patch-based and tile-based techniques, and the emerging image-based strategies enabled by foundation models. We analyze these developments from the perspective of feature extraction and learning strategies, revealing the field's progression from pixel-level to tile-level and from unimodal to multimodal segmentation. Furthermore, we conduct a comprehensive evaluation of nearly 40 advanced techniques on a unified dataset to quantitatively characterize their performance and applicability. This review offers a holistic view of DL-based SS for RS, highlighting key advancements, comparative insights, and open challenges to guide future research.

[27] ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving

Yunsheng Ma,Burhaneddin Yaman,Xin Ye,Mahmut Yurt,Jingru Luo,Abhirup Mallik,Ziran Wang,Liu Ren

Main category: cs.CV

TL;DR: ALN-P3是一个统一的共蒸馏框架，通过跨模态对齐提升自动驾驶系统的性能和语言推理能力。

Details

Motivation: 现有方法难以同时优化驾驶性能和视觉语言推理，ALN-P3旨在解决这一问题。 Method: 提出三种对齐机制（P1A、P2A、P3A），在训练中对齐视觉和语言输出。 Result: 在多个基准测试中表现优异，达到最先进水平。 Conclusion: ALN-P3有效结合驾驶与语言推理，且不影响推理效率。 Abstract: Recent advances have explored integrating large language models (LLMs) into end-to-end autonomous driving systems to enhance generalization and interpretability. However, most existing approaches are limited to either driving performance or vision-language reasoning, making it difficult to achieve both simultaneously. In this paper, we propose ALN-P3, a unified co-distillation framework that introduces cross-modal alignment between "fast" vision-based autonomous driving systems and "slow" language-driven reasoning modules. ALN-P3 incorporates three novel alignment mechanisms: Perception Alignment (P1A), Prediction Alignment (P2A), and Planning Alignment (P3A), which explicitly align visual tokens with corresponding linguistic outputs across the full perception, prediction, and planning stack. All alignment modules are applied only during training and incur no additional costs during inference. Extensive experiments on four challenging benchmarks-nuScenes, Nu-X, TOD3Cap, and nuScenes QA-demonstrate that ALN-P3 significantly improves both driving decisions and language reasoning, achieving state-of-the-art results.

[28] Lossless Token Merging Even Without Fine-Tuning in Vision Transformers

Jaeyeon Lee,Dong-Wan Choi

Main category: cs.CV

TL;DR: ATM是一种无需微调的无损令牌合并方法，通过自适应调整层间相似性阈值和考虑合并大小的令牌匹配技术，显著减少计算开销且保持性能。

Details

Motivation: ViTs因体积庞大导致计算开销高，现有令牌压缩技术常伴随信息丢失且需额外训练。 Method: 提出ATM方法，自适应合并令牌，调整层间相似性阈值，并引入考虑合并大小的令牌匹配技术。 Result: ATM在多种预训练模型上验证，优于现有免训练方法，甚至超越需训练方法，FLOPs减少30%且精度不变。 Conclusion: ATM为高效ViTs提供了一种无需额外训练的高性能解决方案。 Abstract: Although Vision Transformers (ViTs) have become the standard architecture in computer vision, their massive sizes lead to significant computational overhead. Token compression techniques have attracted considerable attention to address this issue, but they often suffer from severe information loss, requiring extensive additional training to achieve practical performance. In this paper, we propose Adaptive Token Merging (ATM), a novel method that ensures lossless token merging, eliminating the need for fine-tuning while maintaining competitive performance. ATM adaptively reduces tokens across layers and batches by carefully adjusting layer-specific similarity thresholds, thereby preventing the undesirable merging of less similar tokens with respect to each layer. Furthermore, ATM introduces a novel token matching technique that considers not only similarity but also merging sizes, particularly for the final layers, to minimize the information loss incurred from each merging operation. We empirically validate our method across a wide range of pretrained models, demonstrating that ATM not only outperforms all existing training-free methods but also surpasses most training-intensive approaches, even without additional training. Remarkably, training-free ATM achieves over a 30% reduction in FLOPs for the DeiT-T and DeiT-S models without any drop in their original accuracy.

[29] Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation

Xinran Wang,Muxi Diao,Yuanzhi Liu,Chunyu Wang,Kongming Liang,Zhanyu Ma,Jun Guo

Main category: cs.CV

TL;DR: 提出了一种新的度量标准（ICR和AOD）来评估文本到图像（T2I）模型训练中描述的详细程度，优于传统的长度指标。

Details

Motivation: 现有方法依赖简单的指标（如标题长度）来衡量描述的详细程度，无法全面反映内容覆盖和对象细节。 Method: 提出图像覆盖率（ICR）和平均对象细节度（AOD）两个指标，用于评估标题的详细程度。 Result: 在COCO数据集上实验表明，使用高ICR和AOD标题训练的T2I模型性能更优，且仅需20%数据即可超越全数据集训练。 Conclusion: 详细感知的度量标准在T2I任务中比基于长度的启发式方法更有效。 Abstract: Training text-to-image (T2I) models with detailed captions can significantly improve their generation quality. Existing methods often rely on simplistic metrics like caption length to represent the detailness of the caption in the T2I training set. In this paper, we propose a new metric to estimate caption detailness based on two aspects: image coverage rate (ICR), which evaluates whether the caption covers all regions/objects in the image, and average object detailness (AOD), which quantifies the detailness of each object's description. Through experiments on the COCO dataset using ShareGPT4V captions, we demonstrate that T2I models trained on high-ICR and -AOD captions achieve superior performance on DPG and other benchmarks. Notably, our metric enables more effective data selection-training on only 20% of full data surpasses both full-dataset training and length-based selection method, improving alignment and reconstruction ability. These findings highlight the critical role of detail-aware metrics over length-based heuristics in caption selection for T2I tasks.

[30] AvatarShield: Visual Reinforcement Learning for Human-Centric Video Forgery Detection

Zhipei Xu,Xuanyu Zhang,Xing Zhou,Jian Zhang

Main category: cs.CV

TL;DR: AvatarShield是一个基于MLLM的可解释框架，用于检测以人为中心的伪造视频，通过GRPO优化和双编码器架构实现高效检测，并在新数据集FakeHumanVid上验证其优越性。

Details

Motivation: AIGC技术的快速发展带来了视频生成的创造力提升，但也增加了信息完整性、身份安全和公众信任的威胁。现有检测方法在人为中心视频中表现不足，且存在泛化性差、扩展性有限等问题。 Method: 提出AvatarShield框架，结合GRPO优化和双编码器架构（高级语义推理与低级伪影放大），避免高成本文本标注数据，实现精确时间建模和伪造检测。 Result: 在FakeHumanVid数据集上的实验表明，AvatarShield在域内和跨域检测中显著优于现有方法。 Conclusion: AvatarShield为以人为中心的视频取证设立了新标准，解决了现有方法的局限性。 Abstract: The rapid advancement of Artificial Intelligence Generated Content (AIGC) technologies, particularly in video generation, has led to unprecedented creative capabilities but also increased threats to information integrity, identity security, and public trust. Existing detection methods, while effective in general scenarios, lack robust solutions for human-centric videos, which pose greater risks due to their realism and potential for legal and ethical misuse. Moreover, current detection approaches often suffer from poor generalization, limited scalability, and reliance on labor-intensive supervised fine-tuning. To address these challenges, we propose AvatarShield, the first interpretable MLLM-based framework for detecting human-centric fake videos, enhanced via Group Relative Policy Optimization (GRPO). Through our carefully designed accuracy detection reward and temporal compensation reward, it effectively avoids the use of high-cost text annotation data, enabling precise temporal modeling and forgery detection. Meanwhile, we design a dual-encoder architecture, combining high-level semantic reasoning and low-level artifact amplification to guide MLLMs in effective forgery detection. We further collect FakeHumanVid, a large-scale human-centric video benchmark that includes synthesis methods guided by pose, audio, and text inputs, enabling rigorous evaluation of detection methods in real-world scenes. Extensive experiments show that AvatarShield significantly outperforms existing approaches in both in-domain and cross-domain detection, setting a new standard for human-centric video forensics.

[31] Exploring Generalized Gait Recognition: Reducing Redundancy and Noise within Indoor and Outdoor Datasets

Qian Zhou,Xianda Guo,Jilong Wang,Chuanfu Shen,Zhongyuan Wang,Hua Zou,Qin Zou,Chao Liang,Chen Long,Gang Wu

Main category: cs.CV

TL;DR: 提出了一种统一框架，通过解耦三元组损失和目标数据集蒸馏策略，提升跨域步态识别的性能。

Details

Motivation: 跨域步态识别因视角、外观和环境差异面临挑战，混合数据集训练虽常用但存在优化冲突和噪声样本问题。 Method: 设计解耦三元组损失以减少数据集间梯度冲突，并采用目标数据集蒸馏策略过滤冗余样本。 Result: 在多个数据集上显著提升跨域识别性能，且不影响源域准确性。 Conclusion: 该框架有效解决了跨域步态识别的挑战，为实际应用提供了可行方案。 Abstract: Generalized gait recognition, which aims to achieve robust performance across diverse domains, remains a challenging problem due to severe domain shifts in viewpoints, appearances, and environments. While mixed-dataset training is widely used to enhance generalization, it introduces new obstacles including inter-dataset optimization conflicts and redundant or noisy samples, both of which hinder effective representation learning. To address these challenges, we propose a unified framework that systematically improves cross-domain gait recognition. First, we design a disentangled triplet loss that isolates supervision signals across datasets, mitigating gradient conflicts during optimization. Second, we introduce a targeted dataset distillation strategy that filters out the least informative 20\% of training samples based on feature redundancy and prediction uncertainty, enhancing data efficiency. Extensive experiments on CASIA-B, OU-MVLP, Gait3D, and GREW demonstrate that our method significantly improves cross-dataset recognition for both GaitBase and DeepGaitV2 backbones, without sacrificing source-domain accuracy. Code will be released at https://github.com/li1er3/Generalized_Gait.

[32] AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection

Yangting Shi,Renjie He,Le Hui,Xiang Li,Jian Yang,Ming-Ming Cheng,Yimian Dai

Main category: cs.CV

TL;DR: 论文提出了一种多模态框架AuxDet，通过结合文本元数据优化红外小目标检测，解决了现有方法在复杂场景中泛化能力不足的问题。

Details

Motivation: 现有红外小目标检测方法依赖单一视觉建模，难以应对复杂背景干扰和目标特征稀缺的问题，且在多场景中泛化能力有限。 Method: 提出AuxDet框架，利用多层感知机动态融合元数据与视觉特征，并通过1D卷积块增强特征提取。 Result: 在WideIRSTD-Full基准测试中，AuxDet表现优于现有方法，验证了辅助信息对提升检测鲁棒性和准确性的重要性。 Conclusion: AuxDet通过结合元数据，显著提升了红外小目标检测在复杂场景中的性能。 Abstract: Omni-domain infrared small target detection (IRSTD) poses formidable challenges, as a single model must seamlessly adapt to diverse imaging systems, varying resolutions, and multiple spectral bands simultaneously. Current approaches predominantly rely on visual-only modeling paradigms that not only struggle with complex background interference and inherently scarce target features, but also exhibit limited generalization capabilities across complex omni-scene environments where significant domain shifts and appearance variations occur. In this work, we reveal a critical oversight in existing paradigms: the neglect of readily available auxiliary metadata describing imaging parameters and acquisition conditions, such as spectral bands, sensor platforms, resolution, and observation perspectives. To address this limitation, we propose the Auxiliary Metadata Driven Infrared Small Target Detector (AuxDet), a novel multi-modal framework that fundamentally reimagines the IRSTD paradigm by incorporating textual metadata for scene-aware optimization. Through a high-dimensional fusion module based on multi-layer perceptrons (MLPs), AuxDet dynamically integrates metadata semantics with visual features, guiding adaptive representation learning for each individual sample. Additionally, we design a lightweight prior-initialized enhancement module using 1D convolutional blocks to further refine fused features and recover fine-grained target cues. Extensive experiments on the challenging WideIRSTD-Full benchmark demonstrate that AuxDet consistently outperforms state-of-the-art methods, validating the critical role of auxiliary information in improving robustness and accuracy in omni-domain IRSTD tasks. Code is available at https://github.com/GrokCV/AuxDet.

[33] MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models

Yifan Liu,Keyu Fan,Weihao Yu,Chenxin Li,Hao Lu,Yixuan Yuan

Main category: cs.CV

TL;DR: MonoSplat是一种新框架，利用单目深度预训练模型的视觉先验，通过多视图特征适配器和集成高斯预测模块，实现高效且泛化性强的3D高斯重建。

Details

Motivation: 现有方法在推理新场景时因泛化性不足而难以处理陌生视觉内容，MonoSplat旨在解决这一问题。 Method: 结合单目多视图特征适配器和集成高斯预测模块，通过轻量级注意力机制对齐特征并生成精确的高斯基元。 Result: 在多样化数据集上，MonoSplat展现出优于现有方法的重建质量和泛化能力，同时保持计算效率。 Conclusion: MonoSplat通过融合单目与多视图特征，显著提升了3D高斯重建的泛化性和质量。 Abstract: Recent advances in generalizable 3D Gaussian Splatting have demonstrated promising results in real-time high-fidelity rendering without per-scene optimization, yet existing approaches still struggle to handle unfamiliar visual content during inference on novel scenes due to limited generalizability. To address this challenge, we introduce MonoSplat, a novel framework that leverages rich visual priors from pre-trained monocular depth foundation models for robust Gaussian reconstruction. Our approach consists of two key components: a Mono-Multi Feature Adapter that transforms monocular features into multi-view representations, coupled with an Integrated Gaussian Prediction module that effectively fuses both feature types for precise Gaussian generation. Through the Adapter's lightweight attention mechanism, features are seamlessly aligned and aggregated across views while preserving valuable monocular priors, enabling the Prediction module to generate Gaussian primitives with accurate geometry and appearance. Through extensive experiments on diverse real-world datasets, we convincingly demonstrate that MonoSplat achieves superior reconstruction quality and generalization capability compared to existing methods while maintaining computational efficiency with minimal trainable parameters. Codes are available at https://github.com/CUHK-AIM-Group/MonoSplat.

[34] Geometrically Regularized Transfer Learning with On-Manifold and Off-Manifold Perturbation

Hana Satou,Alan Mitkiy,F Monkey

Main category: cs.CV

TL;DR: MAADA通过分解对抗扰动为流形内和流形外分量，同时捕捉语义变化和模型脆弱性，提升跨域泛化能力。

Details

Motivation: 解决源域和目标域数据流形差异导致的迁移学习挑战。 Method: 提出MAADA框架，分解对抗扰动为流形内和流形外分量，引入几何感知对齐损失最小化流形间测地差异。 Result: 在DomainNet、VisDA和Office-Home上表现优于现有对抗和适应方法，展现结构鲁棒性和跨域泛化能力。 Conclusion: MAADA通过流形感知对抗数据增强，显著提升迁移学习性能。 Abstract: Transfer learning under domain shift remains a fundamental challenge due to the divergence between source and target data manifolds. In this paper, we propose MAADA (Manifold-Aware Adversarial Data Augmentation), a novel framework that decomposes adversarial perturbations into on-manifold and off-manifold components to simultaneously capture semantic variation and model brittleness. We theoretically demonstrate that enforcing on-manifold consistency reduces hypothesis complexity and improves generalization, while off-manifold regularization smooths decision boundaries in low-density regions. Moreover, we introduce a geometry-aware alignment loss that minimizes geodesic discrepancy between source and target manifolds. Experiments on DomainNet, VisDA, and Office-Home show that MAADA consistently outperforms existing adversarial and adaptation methods in both unsupervised and few-shot settings, demonstrating superior structural robustness and cross-domain generalization.

[35] Leveraging Foundation Models for Multimodal Graph-Based Action Recognition

Fatemeh Ziaeetabar,Florentin Wörgötter

Main category: cs.CV

TL;DR: 提出了一种基于图的新型框架，结合视觉语言基础模型（VideoMAE和BERT），用于动态多模态图构建，以识别精细的双手动操作动作。

Details

Motivation: 解决传统静态图架构在识别精细双手动操作动作时的局限性，通过动态多模态图实现灵活且上下文感知的推理。 Method: 构建自适应多模态图，节点表示帧、对象和文本注释，边编码空间、时间和语义关系；利用图注意力网络中的任务特定注意力机制动态调整边重要性。 Result: 在多个基准数据集上的评估表明，该方法优于现有最先进基线。 Conclusion: 结合基础模型与动态图推理，能够实现鲁棒且可泛化的动作识别。 Abstract: Foundation models have ushered in a new era for multimodal video understanding by enabling the extraction of rich spatiotemporal and semantic representations. In this work, we introduce a novel graph-based framework that integrates a vision-language foundation, leveraging VideoMAE for dynamic visual encoding and BERT for contextual textual embedding, to address the challenge of recognizing fine-grained bimanual manipulation actions. Departing from conventional static graph architectures, our approach constructs an adaptive multimodal graph where nodes represent frames, objects, and textual annotations, and edges encode spatial, temporal, and semantic relationships. These graph structures evolve dynamically based on learned interactions, allowing for flexible and context-aware reasoning. A task-specific attention mechanism within a Graph Attention Network further enhances this reasoning by modulating edge importance based on action semantics. Through extensive evaluations on diverse benchmark datasets, we demonstrate that our method consistently outperforms state-of-the-art baselines, underscoring the strength of combining foundation models with dynamic graph-based reasoning for robust and generalizable action recognition.

[36] GAMA: Geometry-Aware Manifold Alignment via Structured Adversarial Perturbations for Robust Domain Adaptation

Hana Satou,F Monkey

Main category: cs.CV

TL;DR: GAMA提出了一种几何感知的流形对齐框架，通过结构化对抗扰动实现显式流形对齐，显著提升了跨域适应性能。

Details

Motivation: 解决源域和目标域流形差异大时，现有方法忽视精确流形对齐和结构化扰动探索的问题。 Method: 结合切空间探索和流形约束对抗优化，实现显式流形对齐，增强语义一致性和鲁棒性。 Result: 在DomainNet、VisDA和Office-Home数据集上，GAMA在无监督和少样本设置下均优于现有方法。 Conclusion: GAMA通过结构化正则化和显式对齐，提升了泛化能力和流形对齐效果。 Abstract: Domain adaptation remains a challenge when there is significant manifold discrepancy between source and target domains. Although recent methods leverage manifold-aware adversarial perturbations to perform data augmentation, they often neglect precise manifold alignment and systematic exploration of structured perturbations. To address this, we propose GAMA (Geometry-Aware Manifold Alignment), a structured framework that achieves explicit manifold alignment via adversarial perturbation guided by geometric information. GAMA systematically employs tangent space exploration and manifold-constrained adversarial optimization, simultaneously enhancing semantic consistency, robustness to off-manifold deviations, and cross-domain alignment. Theoretical analysis shows that GAMA tightens the generalization bound via structured regularization and explicit alignment. Empirical results on DomainNet, VisDA, and Office-Home demonstrate that GAMA consistently outperforms existing adversarial and adaptation methods in both unsupervised and few-shot settings, exhibiting superior robustness, generalization, and manifold alignment capability.

[37] Flashback: Memory-Driven Zero-shot, Real-time Video Anomaly Detection

Hyogun Lee,Haksub Kim,Ig-Jae Kim,Yonghun Choi

Main category: cs.CV

TL;DR: Flashback是一种零样本实时视频异常检测方法，通过离线记忆构建和在线匹配实现高效检测。

Details

Motivation: 解决视频异常检测中的领域依赖性和实时性限制问题。 Method: 采用两阶段方法：离线阶段用LLM构建伪场景记忆，在线阶段通过相似性搜索匹配视频片段。 Result: 在UCF-Crime和XD-Violence数据集上分别达到87.3 AUC和75.1 AP，显著优于现有方法。 Conclusion: Flashback通过零样本和实时处理能力，为大规模监控部署提供了实用解决方案。 Abstract: Video Anomaly Detection (VAD) automatically identifies anomalous events from video, mitigating the need for human operators in large-scale surveillance deployments. However, three fundamental obstacles hinder real-world adoption: domain dependency and real-time constraints -- requiring near-instantaneous processing of incoming video. To this end, we propose Flashback, a zero-shot and real-time video anomaly detection paradigm. Inspired by the human cognitive mechanism of instantly judging anomalies and reasoning in current scenes based on past experience, Flashback operates in two stages: Recall and Respond. In the offline recall stage, an off-the-shelf LLM builds a pseudo-scene memory of both normal and anomalous captions without any reliance on real anomaly data. In the online respond stage, incoming video segments are embedded and matched against this memory via similarity search. By eliminating all LLM calls at inference time, Flashback delivers real-time VAD even on a consumer-grade GPU. On two large datasets from real-world surveillance scenarios, UCF-Crime and XD-Violence, we achieve 87.3 AUC (+7.0 pp) and 75.1 AP (+13.1 pp), respectively, outperforming prior zero-shot VAD methods by large margins.

[38] GT^2-GS: Geometry-aware Texture Transfer for Gaussian Splatting

Wenjie Liu,Zhongliang Liu,Junwei Shu,Changbo Wang,Yang Li

Main category: cs.CV

TL;DR: GT^2-GS是一个几何感知的纹理转移框架，通过几何增强和一致性损失优化，实现高质量的3D纹理转移。

Details

Motivation: 现有方法在将2D纹理转移到3D模态时，常忽略几何信息，导致效果不佳。本文旨在解决这一问题。 Method: 提出几何感知纹理增强模块和几何一致性纹理损失，结合相机姿态和3D几何信息优化纹理特征。 Result: 实验证明该方法在纹理转移效果和可控性上表现优异，更符合人类视觉感知。 Conclusion: GT^2-GS通过几何感知实现了高质量的3D纹理转移，为多媒体内容创作提供了高效工具。 Abstract: Transferring 2D textures to 3D modalities is of great significance for improving the efficiency of multimedia content creation. Existing approaches have rarely focused on transferring image textures onto 3D representations. 3D style transfer methods are capable of transferring abstract artistic styles to 3D scenes. However, these methods often overlook the geometric information of the scene, which makes it challenging to achieve high-quality 3D texture transfer results. In this paper, we present GT^2-GS, a geometry-aware texture transfer framework for gaussian splitting. From the perspective of matching texture features with geometric information in rendered views, we identify the issue of insufficient texture features and propose a geometry-aware texture augmentation module to expand the texture feature set. Moreover, a geometry-consistent texture loss is proposed to optimize texture features into the scene representation. This loss function incorporates both camera pose and 3D geometric information of the scene, enabling controllable texture-oriented appearance editing. Finally, a geometry preservation strategy is introduced. By alternating between the texture transfer and geometry correction stages over multiple iterations, this strategy achieves a balance between learning texture features and preserving geometric integrity. Extensive experiments demonstrate the effectiveness and controllability of our method. Through geometric awareness, our approach achieves texture transfer results that better align with human visual perception. Our homepage is available at https://vpx-ecnu.github.io/GT2-GS-website.

[39] Multimodal Conditional Information Bottleneck for Generalizable AI-Generated Image Detection

Haotian Qin,Dongliang Chang,Yueying Gao,Bingyao Yu,Lei Chen,Zhanyu Ma

Main category: cs.CV

TL;DR: 本文提出了一种多模态条件瓶颈网络（InfoFD），通过结合文本和类别模态，减少CLIP特征冗余，提升AI生成图像检测的泛化能力。

Details

Motivation: 现有基于CLIP的AI生成图像检测方法存在特征冗余问题，且仅依赖图像对应提示导致性能不佳。 Method: 提出InfoFD框架，包含文本引导条件信息瓶颈（TGCIB）和动态文本正交化（DTO），通过多模态条件减少冗余并利用全局“偏差”。 Result: 在GenImage数据集和最新生成模型上表现出优异的泛化性能。 Conclusion: InfoFD通过多模态条件设计有效提升了AI生成图像检测的泛化能力。 Abstract: Although existing CLIP-based methods for detecting AI-generated images have achieved promising results, they are still limited by severe feature redundancy, which hinders their generalization ability. To address this issue, incorporating an information bottleneck network into the task presents a straightforward solution. However, relying solely on image-corresponding prompts results in suboptimal performance due to the inherent diversity of prompts. In this paper, we propose a multimodal conditional bottleneck network to reduce feature redundancy while enhancing the discriminative power of features extracted by CLIP, thereby improving the model's generalization ability. We begin with a semantic analysis experiment, where we observe that arbitrary text features exhibit lower cosine similarity with real image features than with fake image features in the CLIP feature space, a phenomenon we refer to as "bias". Therefore, we introduce InfoFD, a text-guided AI-generated image detection framework. InfoFD consists of two key components: the Text-Guided Conditional Information Bottleneck (TGCIB) and Dynamic Text Orthogonalization (DTO). TGCIB improves the generalizability of learned representations by conditioning on both text and class modalities. DTO dynamically updates weighted text features, preserving semantic information while leveraging the global "bias". Our model achieves exceptional generalization performance on the GenImage dataset and latest generative models. Our code is available at https://github.com/Ant0ny44/InfoFD.

[40] Continuous Representation Methods, Theories, and Applications: An Overview and Perspectives

Yisi Luo,Xile Zhao,Deyu Meng

Main category: cs.CV

TL;DR: 综述探讨了连续表示方法在数据表征与重建中的优势及其最新进展，涵盖方法设计、理论基础和实际应用。

Details

Motivation: 传统离散框架在数据表征和重建中存在局限性，连续表示方法因其分辨率灵活性、跨模态适应性和参数效率等优势成为新兴范式。 Method: 系统回顾了连续表示方法的设计（如基函数表示、隐式神经表示）、理论基础（如近似误差分析）及在计算机视觉等领域的应用。 Result: 连续表示方法在图像恢复、新视角合成等任务中表现出优越性，并提供了开源资源库。 Conclusion: 未来研究方向包括深化连续表示方法的理论和应用探索，以进一步推动其发展。 Abstract: Recently, continuous representation methods emerge as novel paradigms that characterize the intrinsic structures of real-world data through function representations that map positional coordinates to their corresponding values in the continuous space. As compared with the traditional discrete framework, the continuous framework demonstrates inherent superiority for data representation and reconstruction (e.g., image restoration, novel view synthesis, and waveform inversion) by offering inherent advantages including resolution flexibility, cross-modal adaptability, inherent smoothness, and parameter efficiency. In this review, we systematically examine recent advancements in continuous representation frameworks, focusing on three aspects: (i) Continuous representation method designs such as basis function representation, statistical modeling, tensor function decomposition, and implicit neural representation; (ii) Theoretical foundations of continuous representations such as approximation error analysis, convergence property, and implicit regularization; (iii) Real-world applications of continuous representations derived from computer vision, graphics, bioinformatics, and remote sensing. Furthermore, we outline future directions and perspectives to inspire exploration and deepen insights to facilitate continuous representation methods, theories, and applications. All referenced works are summarized in our open-source repository: https://github.com/YisiLuo/Continuous-Representation-Zoo.

[41] DC-Scene: Data-Centric Learning for 3D Scene Understanding

Ting Huang,Zeyu Zhang,Ruicheng Zhang,Yang Zhao

Main category: cs.CV

TL;DR: DC-Scene提出了一种数据中心的3D场景理解框架，通过CLIP驱动的双指标质量过滤器和课程调度器，显著提升数据质量和训练效率。

Details

Motivation: 3D场景理解在机器人、自动驾驶等领域至关重要，但面临计算成本高和标注数据稀缺的挑战，需要更高效的学习范式。 Method: 采用CLIP驱动的双指标质量过滤器（DIQ）结合课程调度器，逐步扩展训练池，过滤噪声样本。 Result: 在ScanRefer和Nr3D数据集上达到SOTA性能（86.1 CIDEr），训练成本降低约三分之二。 Conclusion: 高质量样本的紧凑集可以超越大规模训练，DC-Scene为3D场景理解提供了高效解决方案。 Abstract: 3D scene understanding plays a fundamental role in vision applications such as robotics, autonomous driving, and augmented reality. However, advancing learning-based 3D scene understanding remains challenging due to two key limitations: (1) the large scale and complexity of 3D scenes lead to higher computational costs and slower training compared to 2D counterparts; and (2) high-quality annotated 3D datasets are significantly scarcer than those available for 2D vision. These challenges underscore the need for more efficient learning paradigms. In this work, we propose DC-Scene, a data-centric framework tailored for 3D scene understanding, which emphasizes enhancing data quality and training efficiency. Specifically, we introduce a CLIP-driven dual-indicator quality (DIQ) filter, combining vision-language alignment scores with caption-loss perplexity, along with a curriculum scheduler that progressively expands the training pool from the top 25% to 75% of scene-caption pairs. This strategy filters out noisy samples and significantly reduces dependence on large-scale labeled 3D data. Extensive experiments on ScanRefer and Nr3D demonstrate that DC-Scene achieves state-of-the-art performance (86.1 CIDEr with the top-75% subset vs. 85.4 with the full dataset) while reducing training cost by approximately two-thirds, confirming that a compact set of high-quality samples can outperform exhaustive training. Code will be available at https://github.com/AIGeeksGroup/DC-Scene.

Yuxuan Du,Zhendong Wang,Yuhao Luo,Caiyong Piao,Zhiyuan Yan,Hao Li,Li Yuan

Main category: cs.CV

TL;DR: 论文提出了一种跨模态对齐与蒸馏（CAD）框架，用于检测多模态深度伪造视频，通过结合高层语义同步和模态特定痕迹，显著提升了检测性能。

Details

Motivation: 现有检测器仅依赖单模态痕迹或跨模态不一致性，无法有效应对多模态深度伪造的挑战。 Method: CAD框架包含跨模态对齐（识别语义不一致）和跨模态蒸馏（融合特征时保留模态特定痕迹）。 Result: 在多种基准测试中，CAD显著优于现有方法。 Conclusion: 和谐整合多模态互补信息对深度伪造检测至关重要。 Abstract: The rapid emergence of multimodal deepfakes (visual and auditory content are manipulated in concert) undermines the reliability of existing detectors that rely solely on modality-specific artifacts or cross-modal inconsistencies. In this work, we first demonstrate that modality-specific forensic traces (e.g., face-swap artifacts or spectral distortions) and modality-shared semantic misalignments (e.g., lip-speech asynchrony) offer complementary evidence, and that neglecting either aspect limits detection performance. Existing approaches either naively fuse modality-specific features without reconciling their conflicting characteristics or focus predominantly on semantic misalignment at the expense of modality-specific fine-grained artifact cues. To address these shortcomings, we propose a general multimodal framework for video deepfake detection via Cross-Modal Alignment and Distillation (CAD). CAD comprises two core components: 1) Cross-modal alignment that identifies inconsistencies in high-level semantic synchronization (e.g., lip-speech mismatches); 2) Cross-modal distillation that mitigates feature conflicts during fusion while preserving modality-specific forensic traces (e.g., spectral distortions in synthetic audio). Extensive experiments on both multimodal and unimodal (e.g., image-only/video-only)deepfake benchmarks demonstrate that CAD significantly outperforms previous methods, validating the necessity of harmonious integration of multimodal complementary information.

[43] GAMA++: Disentangled Geometric Alignment with Adaptive Contrastive Perturbation for Reliable Domain Transfer

Kim Yun,Hana Satou,F Monkey

Main category: cs.CV

TL;DR: GAMA++提出了一种新框架，通过潜在空间解耦和自适应对比扰动策略，解决了现有方法在几何感知域适应中的不足，并在多个基准测试中取得了最佳效果。

Details

Motivation: 当前方法（如GAMA）在任务相关和任务无关流形维度的解耦不足，且扰动方案过于刚性，未能考虑类别对齐的不对称性。 Method: GAMA++引入了潜在空间解耦和自适应对比扰动策略，并结合跨域对比一致性损失。 Result: 在DomainNet、Office-Home和VisDA基准测试中，GAMA++在标准和小样本设置下均取得了最佳效果，显著提升了类别对齐的准确性和边界鲁棒性。 Conclusion: GAMA++为迁移学习中的语义几何对齐设定了新标准。 Abstract: Despite progress in geometry-aware domain adaptation, current methods such as GAMA still suffer from two unresolved issues: (1) insufficient disentanglement of task-relevant and task-irrelevant manifold dimensions, and (2) rigid perturbation schemes that ignore per-class alignment asymmetries. To address this, we propose GAMA++, a novel framework that introduces (i) latent space disentanglement to isolate label-consistent manifold directions from nuisance factors, and (ii) an adaptive contrastive perturbation strategy that tailors both on- and off-manifold exploration to class-specific manifold curvature and alignment discrepancy. We further propose a cross-domain contrastive consistency loss that encourages local semantic clusters to align while preserving intra-domain diversity. Our method achieves state-of-the-art results on DomainNet, Office-Home, and VisDA benchmarks under both standard and few-shot settings, with notable improvements in class-level alignment fidelity and boundary robustness. GAMA++ sets a new standard for semantic geometry alignment in transfer learning.

[44] VET-DINO: Learning Anatomical Understanding Through Multi-View Distillation in Veterinary Imaging

Andre Dourson,Kylie Taylor,Xiaoli Qiao,Michael Fitzke

Main category: cs.CV

TL;DR: VET-DINO是一种自监督学习框架，利用医学影像中多视角数据的特性，提升模型对解剖结构的理解，并在兽医影像任务中取得领先性能。

Details

Motivation: 医学影像中标记数据稀缺，而现有方法多依赖单图像的合成增强，未能充分利用多视角数据的优势。 Method: 提出VET-DINO框架，利用同一患者研究中的多视角兽医X光片，学习视角不变的解剖结构，并隐含3D理解。 Result: 在包含500万张兽医X光片的数据集上验证，显示多视角学习优于合成增强，并在下游任务中达到领先性能。 Conclusion: VET-DINO为医学影像自监督学习提供了新范式，强调利用领域特性而非简单迁移自然图像技术。 Abstract: Self-supervised learning has emerged as a powerful paradigm for training deep neural networks, particularly in medical imaging where labeled data is scarce. While current approaches typically rely on synthetic augmentations of single images, we propose VET-DINO, a framework that leverages a unique characteristic of medical imaging: the availability of multiple standardized views from the same study. Using a series of clinical veterinary radiographs from the same patient study, we enable models to learn view-invariant anatomical structures and develop an implied 3D understanding from 2D projections. We demonstrate our approach on a dataset of 5 million veterinary radiographs from 668,000 canine studies. Through extensive experimentation, including view synthesis and downstream task performance, we show that learning from real multi-view pairs leads to superior anatomical understanding compared to purely synthetic augmentations. VET-DINO achieves state-of-the-art performance on various veterinary imaging tasks. Our work establishes a new paradigm for self-supervised learning in medical imaging that leverages domain-specific properties rather than merely adapting natural image techniques.

[45] Zero-Shot Gaze-based Volumetric Medical Image Segmentation

Tatyana Shmykova,Leila Khaertdinova,Ilya Pershin

Main category: cs.CV

TL;DR: 研究提出使用眼动追踪作为3D医学图像分割的新交互方式，评估其在SAM-2和MedSAM-2中的表现，发现其效率高但精度略低。

Details

Motivation: 传统交互分割模型依赖手动提示（如边界框），而眼动追踪提供了一种更高效的替代方式。 Method: 引入眼动追踪作为交互分割的新模态，使用合成和真实眼动数据评估其在SAM-2和MedSAM-2中的表现。 Result: 眼动提示相比边界框更高效，但分割质量略低。 Conclusion: 眼动追踪可作为3D医学图像分割的补充输入方式，具有潜力。 Abstract: Accurate segmentation of anatomical structures in volumetric medical images is crucial for clinical applications, including disease monitoring and cancer treatment planning. Contemporary interactive segmentation models, such as Segment Anything Model 2 (SAM-2) and its medical variant (MedSAM-2), rely on manually provided prompts like bounding boxes and mouse clicks. In this study, we introduce eye gaze as a novel informational modality for interactive segmentation, marking the application of eye-tracking for 3D medical image segmentation. We evaluate the performance of using gaze-based prompts with SAM-2 and MedSAM-2 using both synthetic and real gaze data. Compared to bounding boxes, gaze-based prompts offer a time-efficient interaction approach with slightly lower segmentation quality. Our findings highlight the potential of using gaze as a complementary input modality for interactive 3D medical image segmentation.

[46] gen2seg: Generative Models Enable Generalizable Instance Segmentation

Om Khangaonkar,Hamed Pirsiavash

Main category: cs.CV

TL;DR: 通过微调Stable Diffusion和MAE，利用生成模型的表征能力实现零样本实例分割，表现出强大的跨类别和跨领域泛化能力。

Details

Motivation: 探索生成模型在通用感知任务中的潜力，尤其是如何利用其学习的对象边界和场景组合知识进行实例分割。 Method: 微调Stable Diffusion和MAE，使用实例着色损失在有限对象类型（室内家具和汽车）上进行训练。 Result: 模型在未见过的对象类型和风格上表现出色，性能接近或超过强监督模型SAM，尤其在精细结构和模糊边界上表现更优。 Conclusion: 生成模型具有跨类别和跨领域的泛化能力，无需大规模预训练即可实现高质量的实例分割。 Abstract: By pretraining to synthesize coherent images from perturbed inputs, generative models inherently learn to understand object boundaries and scene compositions. How can we repurpose these generative representations for general-purpose perceptual organization? We finetune Stable Diffusion and MAE (encoder+decoder) for category-agnostic instance segmentation using our instance coloring loss exclusively on a narrow set of object types (indoor furnishings and cars). Surprisingly, our models exhibit strong zero-shot generalization, accurately segmenting objects of types and styles unseen in finetuning (and in many cases, MAE's ImageNet-1K pretraining too). Our best-performing models closely approach the heavily supervised SAM when evaluated on unseen object types and styles, and outperform it when segmenting fine structures and ambiguous boundaries. In contrast, existing promptable segmentation architectures or discriminatively pretrained models fail to generalize. This suggests that generative models learn an inherent grouping mechanism that transfers across categories and domains, even without internet-scale pretraining. Code, pretrained models, and demos are available on our website.

Zihao Pan,Yu Tong,Weibin Wu,Jingyi Wang,Lifeng Chen,Zhe Zhao,Jiajia Wei,Yitong Qiao,Zibin Zheng

Main category: cs.CV

TL;DR: 论文提出了一种语义进化框架，通过结合LLM和T2I模型，探索大型视觉语言模型（LVLM）对特定语义概念的敏感性，并量化其性能表现。

Details

Motivation: 研究动机是揭示LVLM在面对特定语义概念时容易产生幻觉和错误的根本原因，以针对性提升模型鲁棒性。 Method: 方法包括利用LLM进行语义概念的交叉与变异操作，生成图像描述，再通过T2I模型转换为视觉输入，最后量化LVLM的性能作为奖励信号。 Result: 实验在七种主流LVLM和两种多模态任务上验证了方法的有效性，并发现了LVLM的敏感语义概念。 Conclusion: 结论表明该方法能有效探索LVLM的敏感语义，为后续深入研究提供了启发。 Abstract: Adversarial attacks aim to generate malicious inputs that mislead deep models, but beyond causing model failure, they cannot provide certain interpretable information such as ``\textit{What content in inputs make models more likely to fail?}'' However, this information is crucial for researchers to specifically improve model robustness. Recent research suggests that models may be particularly sensitive to certain semantics in visual inputs (such as ``wet,'' ``foggy''), making them prone to errors. Inspired by this, in this paper we conducted the first exploration on large vision-language models (LVLMs) and found that LVLMs indeed are susceptible to hallucinations and various errors when facing specific semantic concepts in images. To efficiently search for these sensitive concepts, we integrated large language models (LLMs) and text-to-image (T2I) models to propose a novel semantic evolution framework. Randomly initialized semantic concepts undergo LLM-based crossover and mutation operations to form image descriptions, which are then converted by T2I models into visual inputs for LVLMs. The task-specific performance of LVLMs on each input is quantified as fitness scores for the involved semantics and serves as reward signals to further guide LLMs in exploring concepts that induce LVLMs. Extensive experiments on seven mainstream LVLMs and two multimodal tasks demonstrate the effectiveness of our method. Additionally, we provide interesting findings about the sensitive semantics of LVLMs, aiming to inspire further in-depth research.

[48] Contrastive Learning-Enhanced Trajectory Matching for Small-Scale Dataset Distillation

Wenmin Li,Shunsuke Sakai,Tatsuhito Hasegawa

Main category: cs.CV

TL;DR: 提出一种结合对比学习的数据集蒸馏方法，解决极端样本稀缺下语义丰富性不足的问题。

Details

Motivation: 在资源受限环境中部署机器学习模型需要将大数据集蒸馏为小规模但信息丰富的合成数据集，现有方法在极端样本稀缺时效果不佳。 Method: 通过结合对比学习，在图像合成过程中显式最大化实例级特征区分度，生成更具信息量和多样性的合成样本。 Result: 实验表明，该方法在小规模合成数据集上显著提升模型性能，并改善合成图像的视觉保真度。 Conclusion: 提出的方法在极端数据稀缺场景下优于现有蒸馏技术，尤其在特征表示和视觉质量上有显著提升。 Abstract: Deploying machine learning models in resource-constrained environments, such as edge devices or rapid prototyping scenarios, increasingly demands distillation of large datasets into significantly smaller yet informative synthetic datasets. Current dataset distillation techniques, particularly Trajectory Matching methods, optimize synthetic data so that the model's training trajectory on synthetic samples mirrors that on real data. While demonstrating efficacy on medium-scale synthetic datasets, these methods fail to adequately preserve semantic richness under extreme sample scarcity. To address this limitation, we propose a novel dataset distillation method integrating contrastive learning during image synthesis. By explicitly maximizing instance-level feature discrimination, our approach produces more informative and diverse synthetic samples, even when dataset sizes are significantly constrained. Experimental results demonstrate that incorporating contrastive learning substantially enhances the performance of models trained on very small-scale synthetic datasets. This integration not only guides more effective feature representation but also significantly improves the visual fidelity of the synthesized images. Experimental results demonstrate that our method achieves notable performance improvements over existing distillation techniques, especially in scenarios with extremely limited synthetic data.

[49] LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

Zhenyu Ning,Guangda Liu,Qihao Jin,Wenchao Ding,Minyi Guo,Jieru Zhao

Main category: cs.CV

TL;DR: LiveVLM是一个无需训练的框架，专为实时视频理解和交互设计，通过创新的流式KV缓存技术显著提升处理速度和内存效率。

Details

Motivation: 现有视频大语言模型主要关注离线视频问答，忽视了内存使用和响应速度，而这些在实时应用中至关重要。 Method: LiveVLM采用流式KV缓存技术实时处理视频流，保留长期细节并消除冗余KV，同时通过压缩视频KV提升内存效率。 Result: 实验显示，LiveVLM在相同设备上处理44倍帧数，响应速度比现有方法快5倍，且性能相当或更好。 Conclusion: LiveVLM为实时视频理解提供了一种高效解决方案，显著提升了处理能力和响应速度。 Abstract: Recent developments in Video Large Language Models (Video LLMs) have enabled models to process long video sequences and demonstrate remarkable performance. Nonetheless, studies predominantly focus on offline video question answering, neglecting memory usage and response speed that are essential in various real-world applications, such as Deepseek services, autonomous driving, and robotics. To mitigate these challenges, we propose $\textbf{LiveVLM}$, a training-free framework specifically designed for streaming, online video understanding and real-time interaction. Unlike existing works that process videos only after one question is posed, LiveVLM constructs an innovative streaming-oriented KV cache to process video streams in real-time, retain long-term video details and eliminate redundant KVs, ensuring prompt responses to user queries. For continuous video streams, LiveVLM generates and compresses video key-value tensors (video KVs) to reserve visual information while improving memory efficiency. Furthermore, when a new question is proposed, LiveVLM incorporates an online question-answering process that efficiently fetches both short-term and long-term visual information, while minimizing interference from redundant context. Extensive experiments demonstrate that LiveVLM enables the foundation LLaVA-OneVision model to process 44$\times$ number of frames on the same device, and achieves up to 5$\times$ speedup in response speed compared with SoTA online methods at an input of 256 frames, while maintaining the same or better model performance.

[50] DiffProb: Data Pruning for Face Recognition

Eduarda Caldeira,Jan Niklas Kolf,Naser Damer,Fadi Boutros

Main category: cs.CV

TL;DR: DiffProb是一种针对人脸识别的数据剪枝方法，通过评估训练样本的预测概率并剪除冗余样本，同时清理错误标签，显著减少数据量和训练成本。

Details

Motivation: 依赖大规模标注数据集带来计算成本、存储和隐私问题，需要一种高效的数据剪枝方法以减少对大数据集的依赖。 Method: DiffProb通过分析样本预测概率剪除冗余数据，并引入辅助机制清理错误标签。 Result: 在CASIA-WebFace等数据集上，DiffProb可剪除50%数据，同时保持或提升验证准确率，且对不同架构和损失函数具有鲁棒性。 Conclusion: DiffProb有效减少训练成本和数据量，为人脸识别提供高效解决方案。 Abstract: Face recognition models have made substantial progress due to advances in deep learning and the availability of large-scale datasets. However, reliance on massive annotated datasets introduces challenges related to training computational cost and data storage, as well as potential privacy concerns regarding managing large face datasets. This paper presents DiffProb, the first data pruning approach for the application of face recognition. DiffProb assesses the prediction probabilities of training samples within each identity and prunes the ones with identical or close prediction probability values, as they are likely reinforcing the same decision boundaries, and thus contribute minimally with new information. We further enhance this process with an auxiliary cleaning mechanism to eliminate mislabeled and label-flipped samples, boosting data quality with minimal loss. Extensive experiments on CASIA-WebFace with different pruning ratios and multiple benchmarks, including LFW, CFP-FP, and IJB-C, demonstrate that DiffProb can prune up to 50% of the dataset while maintaining or even, in some settings, improving the verification accuracies. Additionally, we demonstrate DiffProb's robustness across different architectures and loss functions. Our method significantly reduces training cost and data volume, enabling efficient face recognition training and reducing the reliance on massive datasets and their demanding management.

[51] GS2E: Gaussian Splatting is an Effective Data Generator for Event Stream Generation

Yuchen Li,Chaoran Feng,Zhenyu Tang,Kaiyuan Deng,Wangbo Yu,Yonghong Tian,Li Yuan

Main category: cs.CV

TL;DR: GS2E是一个基于3D高斯重建和物理模拟的大规模合成事件数据集，用于高保真事件视觉任务。

Details

Motivation: 现有事件数据集通常从密集RGB视频合成，缺乏视角多样性和几何一致性，或依赖昂贵硬件。GS2E旨在解决这些问题。 Method: 通过3D高斯重建真实静态场景，结合物理模拟事件生成管道，包括自适应轨迹插值和事件对比阈值建模。 Result: 实验表明，GS2E在事件3D重建任务中表现出优越的泛化能力，适合作为事件视觉研究的基准。 Conclusion: GS2E为事件视觉研究提供了高质量、多样化的数据集，推动了该领域的发展。 Abstract: We introduce GS2E (Gaussian Splatting to Event), a large-scale synthetic event dataset for high-fidelity event vision tasks, captured from real-world sparse multi-view RGB images. Existing event datasets are often synthesized from dense RGB videos, which typically lack viewpoint diversity and geometric consistency, or depend on expensive, difficult-to-scale hardware setups. GS2E overcomes these limitations by first reconstructing photorealistic static scenes using 3D Gaussian Splatting, and subsequently employing a novel, physically-informed event simulation pipeline. This pipeline generally integrates adaptive trajectory interpolation with physically-consistent event contrast threshold modeling. Such an approach yields temporally dense and geometrically consistent event streams under diverse motion and lighting conditions, while ensuring strong alignment with underlying scene structures. Experimental results on event-based 3D reconstruction demonstrate GS2E's superior generalization capabilities and its practical value as a benchmark for advancing event vision research.

[52] R3GS: Gaussian Splatting for Robust Reconstruction and Relocalization in Unconstrained Image Collections

Xu yan,Zhaohui Wang,Rong Wei,Jingbo Yu,Dong Li,Xiangde Liu

Main category: cs.CV

TL;DR: R3GS是一个针对非约束数据集的鲁棒重建与重定位框架，结合全局与局部特征，优化训练与渲染效率，并减少存储需求。

Details

Motivation: 解决非约束数据集中瞬态物体和天空区域对重建的干扰，以及光照变化对重定位的影响。 Method: 使用混合表示训练，结合CNN全局特征与多分辨率哈希网格局部特征，通过MLPs预测高斯属性，并引入轻量级人类检测网络和天空处理技术。 Result: 在野外数据集上实现最先进的性能，显著提升渲染保真度、效率和存储优化。 Conclusion: R3GS通过创新技术有效解决了复杂场景中的重建与重定位问题，具有广泛应用潜力。 Abstract: We propose R3GS, a robust reconstruction and relocalization framework tailored for unconstrained datasets. Our method uses a hybrid representation during training. Each anchor combines a global feature from a convolutional neural network (CNN) with a local feature encoded by the multiresolution hash grids [2]. Subsequently, several shallow multi-layer perceptrons (MLPs) predict the attributes of each Gaussians, including color, opacity, and covariance. To mitigate the adverse effects of transient objects on the reconstruction process, we ffne-tune a lightweight human detection network. Once ffne-tuned, this network generates a visibility map that efffciently generalizes to other transient objects (such as posters, banners, and cars) with minimal need for further adaptation. Additionally, to address the challenges posed by sky regions in outdoor scenes, we propose an effective sky-handling technique that incorporates a depth prior as a constraint. This allows the inffnitely distant sky to be represented on the surface of a large-radius sky sphere, signiffcantly reducing ffoaters caused by errors in sky reconstruction. Furthermore, we introduce a novel relocalization method that remains robust to changes in lighting conditions while estimating the camera pose of a given image within the reconstructed 3DGS scene. As a result, R3GS significantly enhances rendering ffdelity, improves both training and rendering efffciency, and reduces storage requirements. Our method achieves state-of-the-art performance compared to baseline methods on in-the-wild datasets. The code will be made open-source following the acceptance of the paper.

[53] BadSR: Stealthy Label Backdoor Attacks on Image Super-Resolution

Ji Guo,Xiaolei Wen,Wenbo Jiang,Cheng Huang,Jinjin Li,Hongwei Li

Main category: cs.CV

TL;DR: 论文提出BadSR方法，通过改进中毒高分辨率图像的隐蔽性，提升超分辨率模型的背门攻击效果。

Details

Motivation: 现有超分辨率模型的背门攻击主要关注低分辨率图像的隐蔽性，而忽略了高分辨率图像的隐蔽性，容易被用户发现异常数据。 Method: BadSR通过在特征空间中逼近干净高分辨率图像与目标图像，并限制对干净图像的修改范围，结合对抗优化的触发器和遗传算法驱动的中毒样本选择方法。 Result: 实验表明，BadSR在多种模型和数据集上实现了高攻击成功率，显著影响下游任务。 Conclusion: BadSR有效提升了背门攻击的隐蔽性和攻击效果，为超分辨率模型的安全性研究提供了新思路。 Abstract: With the widespread application of super-resolution (SR) in various fields, researchers have begun to investigate its security. Previous studies have demonstrated that SR models can also be subjected to backdoor attacks through data poisoning, affecting downstream tasks. A backdoor SR model generates an attacker-predefined target image when given a triggered image while producing a normal high-resolution (HR) output for clean images. However, prior backdoor attacks on SR models have primarily focused on the stealthiness of poisoned low-resolution (LR) images while ignoring the stealthiness of poisoned HR images, making it easy for users to detect anomalous data. To address this problem, we propose BadSR, which improves the stealthiness of poisoned HR images. The key idea of BadSR is to approximate the clean HR image and the pre-defined target image in the feature space while ensuring that modifications to the clean HR image remain within a constrained range. The poisoned HR images generated by BadSR can be integrated with existing triggers. To further improve the effectiveness of BadSR, we design an adversarially optimized trigger and a backdoor gradient-driven poisoned sample selection method based on a genetic algorithm. The experimental results show that BadSR achieves a high attack success rate in various models and data sets, significantly affecting downstream tasks.

[54] FaceCrafter: Identity-Conditional Diffusion with Disentangled Control over Facial Pose, Expression, and Emotion

Kazuaki Mishima,Antoni Bigata Casademunt,Stavros Petridis,Maja Pantic,Kenji Suzuki

Main category: cs.CV

TL;DR: 提出了一种新的身份条件扩散模型，通过轻量级控制模块独立操纵面部姿态、表情和情感，同时保持身份特征。

Details

Motivation: 现有方法在非身份属性（如姿态、表情、情感）的精确控制上存在困难，且难以分离身份与可变因素。 Method: 在基础扩散模型的交叉注意力层中嵌入两个轻量级控制模块，采用定制训练策略，使身份特征与控制信号正交。 Result: 定量和定性评估表明，该方法在姿态、表情和情感的控制准确性上优于现有方法，同时提高了生成多样性。 Conclusion: 该方法在身份条件下面部属性控制方面表现出色，具有更高的可控性和多样性。 Abstract: Human facial images encode a rich spectrum of information, encompassing both stable identity-related traits and mutable attributes such as pose, expression, and emotion. While recent advances in image generation have enabled high-quality identity-conditional face synthesis, precise control over non-identity attributes remains challenging, and disentangling identity from these mutable factors is particularly difficult. To address these limitations, we propose a novel identity-conditional diffusion model that introduces two lightweight control modules designed to independently manipulate facial pose, expression, and emotion without compromising identity preservation. These modules are embedded within the cross-attention layers of the base diffusion model, enabling precise attribute control with minimal parameter overhead. Furthermore, our tailored training strategy, which leverages cross-attention between the identity feature and each non-identity control feature, encourages identity features to remain orthogonal to control signals, enhancing controllability and diversity. Quantitative and qualitative evaluations, along with perceptual user studies, demonstrate that our method surpasses existing approaches in terms of control accuracy over pose, expression, and emotion, while also improving generative diversity under identity-only conditioning.

[55] CEBSNet: Change-Excited and Background-Suppressed Network with Temporal Dependency Modeling for Bitemporal Change Detection

Qi'ao Xu,Yan Xing,Jiali Hu,Yunan Jia,Rui Huang

Main category: cs.CV

TL;DR: CEBSNet是一种新型变化检测网络，通过建模时间依赖性和抑制背景噪声，显著提升了对显著和细微变化的检测能力。

Details

Motivation: 现有方法忽视了时间依赖性，且过度关注显著变化而忽略细微变化，导致检测效果受限。 Method: 提出CEBSNet，包含Channel Swap Module（CSM）建模时间依赖性，Feature Excitation and Suppression Module（FESM）捕获变化，以及Pyramid-Aware Spatial-Channel Attention（PASCA）增强多尺度检测能力。 Result: 在三个街景数据集和两个遥感数据集上达到最先进性能。 Conclusion: CEBSNet通过时间依赖性和背景噪声抑制，显著提升了变化检测的精度和鲁棒性。 Abstract: Change detection, a critical task in remote sensing and computer vision, aims to identify pixel-level differences between image pairs captured at the same geographic area but different times. It faces numerous challenges such as illumination variation, seasonal changes, background interference, and shooting angles, especially with a large time gap between images. While current methods have advanced, they often overlook temporal dependencies and overemphasize prominent changes while ignoring subtle but equally important changes. To address these limitations, we introduce \textbf{CEBSNet}, a novel change-excited and background-suppressed network with temporal dependency modeling for change detection. During the feature extraction, we utilize a simple Channel Swap Module (CSM) to model temporal dependency, reducing differences and noise. The Feature Excitation and Suppression Module (FESM) is developed to capture both obvious and subtle changes, maintaining the integrity of change regions. Additionally, we design a Pyramid-Aware Spatial-Channel Attention module (PASCA) to enhance the ability to detect change regions at different sizes and focus on critical regions. We conduct extensive experiments on three common street view datasets and two remote sensing datasets, and our method achieves the state-of-the-art performance.

[56] SoftHGNN: Soft Hypergraph Neural Networks for General Visual Recognition

Mengqi Lei,Yihong Wu,Siqi Li,Xinhu Zheng,Juan Wang,Yue Gao,Shaoyi Du

Main category: cs.CV

TL;DR: 论文提出了一种名为SoftHGNN的软超图神经网络，通过动态和可微的超边分配机制，解决了现有超图神经网络中静态硬超边分配的问题，从而更高效地捕捉视觉场景中的高阶关联。

Details

Motivation: 主流自注意力方法在建模全局成对关系时有效，但无法捕捉真实场景中的高阶关联，且存在冗余计算问题。超图虽然能建模高阶交互，但现有超图神经网络依赖静态硬超边分配，导致冗余和语义连续性缺失。 Method: 提出SoftHGNN，引入软超边概念，通过可学习的超边原型动态分配顶点参与权重，生成语义丰富的软超边。结合稀疏超边选择机制和负载均衡正则化，提升效率。 Result: 在五个数据集的三个任务上，SoftHGNN显著提升了性能，高效捕捉了高阶关联。 Conclusion: SoftHGNN通过软超边和动态分配机制，为视觉识别任务提供了一种高效且通用的超图计算方法。 Abstract: Visual recognition relies on understanding both the semantics of image tokens and the complex interactions among them. Mainstream self-attention methods, while effective at modeling global pair-wise relations, fail to capture high-order associations inherent in real-world scenes and often suffer from redundant computation. Hypergraphs extend conventional graphs by modeling high-order interactions and offer a promising framework for addressing these limitations. However, existing hypergraph neural networks typically rely on static and hard hyperedge assignments, leading to excessive and redundant hyperedges with hard binary vertex memberships that overlook the continuity of visual semantics. To overcome these issues, we present Soft Hypergraph Neural Networks (SoftHGNNs), which extend the methodology of hypergraph computation, to make it truly efficient and versatile in visual recognition tasks. Our framework introduces the concept of soft hyperedges, where each vertex is associated with hyperedges via continuous participation weights rather than hard binary assignments. This dynamic and differentiable association is achieved by using the learnable hyperedge prototype. Through similarity measurements between token features and the prototype, the model generates semantically rich soft hyperedges. SoftHGNN then aggregates messages over soft hyperedges to capture high-order semantics. To further enhance efficiency when scaling up the number of soft hyperedges, we incorporate a sparse hyperedge selection mechanism that activates only the top-k important hyperedges, along with a load-balancing regularizer to ensure balanced hyperedge utilization. Experimental results across three tasks on five datasets demonstrate that SoftHGNN efficiently captures high-order associations in visual scenes, achieving significant performance improvements.

[57] Towards Zero-Shot Differential Morphing Attack Detection with Multimodal Large Language Models

Ria Shekhawat,Hailin Li,Raghavendra Ramachandra,Sushma Venkatesh

Main category: cs.CV

TL;DR: 本文首次将多模态大语言模型（LLM）应用于基于真实生物特征数据的差分变形攻击检测（D-MAD），并通过Chain-of-Thought（CoT）提示工程提高模型可靠性和可解释性。实验表明，ChatGPT-4o在检测精度上优于Gemini，但两者在复杂条件下均表现不佳。

Details

Motivation: 利用多模态LLM提升变形攻击检测（MAD）的准确性和可解释性，尤其是在真实生物特征应用中。 Method: 设计基于Chain-of-Thought（CoT）的提示工程，减少无回答率并增强决策推理；使用真实生物特征数据对两种多模态LLM（ChatGPT-4o和Gemini）进行性能评估。 Result: ChatGPT-4o在检测精度上优于Gemini，尤其在对抗GAN生成的变形攻击时表现更优，但两者在复杂条件下均表现不佳。Gemini的解释更一致，而ChatGPT-4o更稳健但无回答率较高。 Conclusion: 多模态LLM在D-MAD中具有潜力，但需进一步优化以应对复杂条件。ChatGPT-4o和Gemini各有优劣，提示工程对提升模型性能至关重要。 Abstract: Leveraging the power of multimodal large language models (LLMs) offers a promising approach to enhancing the accuracy and interpretability of morphing attack detection (MAD), especially in real-world biometric applications. This work introduces the use of LLMs for differential morphing attack detection (D-MAD). To the best of our knowledge, this is the first study to employ multimodal LLMs to D-MAD using real biometric data. To effectively utilize these models, we design Chain-of-Thought (CoT)-based prompts to reduce failure-to-answer rates and enhance the reasoning behind decisions. Our contributions include: (1) the first application of multimodal LLMs for D-MAD using real data subjects, (2) CoT-based prompt engineering to improve response reliability and explainability, (3) comprehensive qualitative and quantitative benchmarking of LLM performance using data from 54 individuals captured in passport enrollment scenarios, and (4) comparative analysis of two multimodal LLMs: ChatGPT-4o and Gemini providing insights into their morphing attack detection accuracy and decision transparency. Experimental results show that ChatGPT-4o outperforms Gemini in detection accuracy, especially against GAN-based morphs, though both models struggle under challenging conditions. While Gemini offers more consistent explanations, ChatGPT-4o is more resilient but prone to a higher failure-to-answer rate.

[58] Parameter-Efficient Fine-Tuning of Multispectral Foundation Models for Hyperspectral Image Classification

Bernardin Ligan,Khalide Jbilou,Fahd Kalloubi,Ahmed Ratnani

Main category: cs.CV

TL;DR: 提出了一种高效微调方法KronA+，用于将多光谱基础模型SpectralGPT适配到高光谱图像分类任务，显著减少了参数和存储需求。

Details

Motivation: 高光谱图像（HSI）分类任务中，现有基础模型多为多光谱设计，直接微调需大量资源。 Method: 探索了多种参数高效微调（PEFT）方法，包括LoRA、KronA、LoKr和LoRA+，并提出了改进的KronA+。 Result: 在五个数据集上表现优异，KronA+仅需0.056%可训练参数和0.2MB存储，性能接近全微调。 Conclusion: KronA+是最高效的PEFT方法，为高光谱分类提供了轻量级解决方案。 Abstract: Foundation models have achieved great success across diverse domains, including remote sensing (RS), thanks to their versatility and strong generalization abilities. However, most RS foundation models are designed for multispectral data, while hyperspectral imagery (HSI) - with its hundreds of spectral bands - remains less explored. Fine-tuning such models for downstream tasks is also challenging, often demanding considerable memory and storage. In this paper, we propose an efficient framework to fine-tune SpectralGPT, a multispectral foundation model, for hyperspectral image classification (HSIC). We explore several Parameter-Efficient Fine-Tuning (PEFT) methods, including Low-Rank Adaptation (LoRA), Kronecker-based adaptation (KronA), Low-Rank Kronecker (LoKr), and the recent LoRA+, which uses distinct learning rates for low-rank adapters scaled by a factor lambda. Inspired by LoRA+, we introduce KronA+, which applies a similar mechanism to the Kronecker matrices. We evaluate our approach on five datasets from different sensors, showing competitive performance with state-of-the-art HSI models. Our full fine-tuning (FFT) setup for SpectralGPT even outperforms a dedicated hyperspectral foundation model on some datasets while requiring only a quarter of the training epochs. Under the same number of epochs, KronA+ reaches similar performance with far fewer trainable parameters - just 0.056 percent - and adds only approximately 0.2 megabytes of storage, making it the most effective PEFT method tested.

[59] My Face Is Mine, Not Yours: Facial Protection Against Diffusion Model Face Swapping

Hon Ming Yam,Zhongliang Guo,Chun Pong Lau

Main category: cs.CV

TL;DR: 本文提出了一种针对扩散模型的新型主动防御策略，通过对抗攻击预先保护面部图像，避免被用于深度伪造。

Details

Motivation: 扩散模型已成为高质量深度伪造的主流框架，但现有对抗保护方法主要针对传统生成架构（如GANs、AEs、VAEs），无法应对扩散模型的独特挑战。 Method: 提出了一种主动防御策略，通过对抗攻击预先保护面部图像，解决了现有方法依赖特定模型架构和全局扰动策略的局限性。 Result: 该方法能够有效抵御多样化的扩散模型深度伪造攻击，并针对面部区域进行特异性保护。 Conclusion: 该研究为扩散模型深度伪造提供了一种更有效的主动防御方案，填补了现有技术的空白。 Abstract: The proliferation of diffusion-based deepfake technologies poses significant risks for unauthorized and unethical facial image manipulation. While traditional countermeasures have primarily focused on passive detection methods, this paper introduces a novel proactive defense strategy through adversarial attacks that preemptively protect facial images from being exploited by diffusion-based deepfake systems. Existing adversarial protection methods predominantly target conventional generative architectures (GANs, AEs, VAEs) and fail to address the unique challenges presented by diffusion models, which have become the predominant framework for high-quality facial deepfakes. Current diffusion-specific adversarial approaches are limited by their reliance on specific model architectures and weights, rendering them ineffective against the diverse landscape of diffusion-based deepfake implementations. Additionally, they typically employ global perturbation strategies that inadequately address the region-specific nature of facial manipulation in deepfakes.

[60] Objective Bicycle Occlusion Level Classification using a Deformable Parts-Based Model

Angelique Mangubat,Shane Gilroy

Main category: cs.CV

TL;DR: 提出了一种基于计算机视觉的自行车遮挡等级分类新方法，显著提升了自行车可见性和遮挡水平的量化能力。

Details

Motivation: 提升道路安全，特别是针对易受伤害的骑行者，通过客观量化自行车遮挡水平来改进检测算法。 Method: 使用基于部分的检测模型和自定义图像检测流程，提出了一种新的自行车遮挡等级量化方法。 Result: 模型能够稳健地量化自行车的可见性和遮挡水平，优于现有主观方法。 Conclusion: 该方法有望广泛应用于骑行者检测算法的性能评估，推动自动驾驶中更鲁棒的道路用户检测方法的发展。 Abstract: Road safety is a critical challenge, particularly for cyclists, who are among the most vulnerable road users. This study aims to enhance road safety by proposing a novel benchmark for bicycle occlusion level classification using advanced computer vision techniques. Utilizing a parts-based detection model, images are annotated and processed through a custom image detection pipeline. A novel method of bicycle occlusion level is proposed to objectively quantify the visibility and occlusion level of bicycle semantic parts. The findings indicate that the model robustly quantifies the visibility and occlusion level of bicycles, a significant improvement over the subjective methods used by the current state of the art. Widespread use of the proposed methodology will facilitate the accurate performance reporting of cyclist detection algorithms for occluded cyclists, informing the development of more robust vulnerable road user detection methods for autonomous vehicles.

[61] Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition

Dasol Choi,Seunghyun Lee,Youngsook Song

Main category: cs.CV

TL;DR: 论文研究了视觉语言模型（VLMs）在安全关键场景中的可靠性问题，发现模型在识别真实紧急情况时表现良好，但对安全场景存在高误报率。

Details

Motivation: 探讨VLMs在安全关键应用中的可靠性，揭示其潜在问题。 Method: 通过VERI数据集（200张图像，100对对比图像）和两阶段评估协议（风险识别和紧急响应），评估14个VLMs的表现。 Result: 模型在识别真实紧急情况时成功率为70-100%，但对安全场景的误报率达31-96%，且存在系统性过度反应问题。 Conclusion: 模型规模的增加无法解决这些问题，需针对性改进VLMs在视觉误导场景中的上下文安全评估能力。 Abstract: Vision-Language Models (VLMs) have demonstrated impressive capabilities in understanding visual content, but their reliability in safety-critical contexts remains under-explored. We introduce VERI (Visual Emergency Recognition Dataset), a carefully designed diagnostic benchmark of 200 images (100 contrastive pairs). Each emergency scene is matched with a visually similar but safe counterpart through multi-stage human verification and iterative refinement. Using a two-stage protocol - risk identification and emergency response - we evaluate 14 VLMs (2B-124B parameters) across medical emergencies, accidents, and natural disasters. Our analysis reveals a systematic overreaction problem: models excel at identifying real emergencies (70-100 percent success rate) but suffer from an alarming rate of false alarms, misidentifying 31-96 percent of safe situations as dangerous, with 10 scenarios failed by all models regardless of scale. This "better-safe-than-sorry" bias manifests primarily through contextual overinterpretation (88-93 percent of errors), challenging VLMs' reliability for safety applications. These findings highlight persistent limitations that are not resolved by increasing model scale, motivating targeted approaches for improving contextual safety assessment in visually misleading scenarios.

[62] RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation

Naman Patel,Prashanth Krishnamurthy,Farshad Khorrami

Main category: cs.CV

TL;DR: 论文提出了一种零样本框架，将GPU加速的几何重建与开放词汇视觉语言模型结合，实现实时3D语义地图构建和自然语言交互。

Details

Motivation: 现有3D语义地图系统缺乏在线操作时构建开放词汇语义地图的灵活性，且视觉语言模型尚未实现3D空间理解。 Method: 通过在线实例级语义嵌入融合和分层对象关联，结合GPU加速几何重建与开放词汇模型。 Result: 系统在零样本3D实例检索、分割和物体检测等任务中表现优异，支持未见物体推理和自然语言查询。 Conclusion: 该框架为通用3D场景理解提供了高效、灵活的解决方案。 Abstract: Mapping and understanding complex 3D environments is fundamental to how autonomous systems perceive and interact with the physical world, requiring both precise geometric reconstruction and rich semantic comprehension. While existing 3D semantic mapping systems excel at reconstructing and identifying predefined object instances, they lack the flexibility to efficiently build semantic maps with open-vocabulary during online operation. Although recent vision-language models have enabled open-vocabulary object recognition in 2D images, they haven't yet bridged the gap to 3D spatial understanding. The critical challenge lies in developing a training-free unified system that can simultaneously construct accurate 3D maps while maintaining semantic consistency and supporting natural language interactions in real time. In this paper, we develop a zero-shot framework that seamlessly integrates GPU-accelerated geometric reconstruction with open-vocabulary vision-language models through online instance-level semantic embedding fusion, guided by hierarchical object association with spatial indexing. Our training-free system achieves superior performance through incremental processing and unified geometric-semantic updates, while robustly handling 2D segmentation inconsistencies. The proposed general-purpose 3D scene understanding framework can be used for various tasks including zero-shot 3D instance retrieval, segmentation, and object detection to reason about previously unseen objects and interpret natural language queries. The project page is available at https://razer-3d.github.io.

[63] The P$^3$ dataset: Pixels, Points and Polygons for Multimodal Building Vectorization

Raphael Sulzer,Liuyun Duan,Nicolas Girard,Florent Lafarge

Main category: cs.CV

TL;DR: P$^3$数据集是一个多模态的大规模基准数据集，结合了LiDAR点云、高分辨率航拍图像和矢量化的2D建筑轮廓，用于建筑矢量化研究。

Details

Motivation: 现有数据集主要关注图像模态，P$^3$通过引入密集的3D信息提供了补充视角。 Method: 利用LiDAR点云和航拍图像，结合混合和端到端学习框架预测建筑多边形。 Result: LiDAR点云在预测建筑多边形方面表现出色，融合LiDAR和图像进一步提高了预测精度和几何质量。 Conclusion: P$^3$数据集公开可用，并提供了代码和预训练模型，为建筑矢量化研究提供了重要资源。 Abstract: We present the P$^3$ dataset, a large-scale multimodal benchmark for building vectorization, constructed from aerial LiDAR point clouds, high-resolution aerial imagery, and vectorized 2D building outlines, collected across three continents. The dataset contains over 10 billion LiDAR points with decimeter-level accuracy and RGB images at a ground sampling distance of 25 centimeter. While many existing datasets primarily focus on the image modality, P$^3$ offers a complementary perspective by also incorporating dense 3D information. We demonstrate that LiDAR point clouds serve as a robust modality for predicting building polygons, both in hybrid and end-to-end learning frameworks. Moreover, fusing aerial LiDAR and imagery further improves accuracy and geometric quality of predicted polygons. The P$^3$ dataset is publicly available, along with code and pretrained weights of three state-of-the-art models for building polygon prediction at https://github.com/raphaelsulzer/PixelsPointsPolygons .

[64] Expanding Zero-Shot Object Counting with Rich Prompts

Huilin Zhu,Senyao Li,Jingling Yuan,Zhengwei Yang,Yu Guo,Wenxuan Liu,Xian Zhong,Shengfeng He

Main category: cs.CV

TL;DR: RichCount通过两阶段训练策略提升零样本计数模型对未见类别的处理能力，通过增强文本特征和图像特征的关联实现更准确的计数。

Details

Motivation: 现有方法仅通过添加新提示无法实现文本与视觉特征的充分对齐，限制了零样本计数模型的泛化能力。 Method: RichCount采用两阶段训练策略：1）通过前馈网络和适配器增强文本特征；2）将优化后的编码器应用于计数任务。 Result: 在三个基准数据集上，RichCount实现了零样本计数的最高性能，显著提升了对未见类别的泛化能力。 Conclusion: RichCount通过特征对齐超越了简单的提示扩展，为开放世界场景中的零样本计数提供了有效解决方案。 Abstract: Expanding pre-trained zero-shot counting models to handle unseen categories requires more than simply adding new prompts, as this approach does not achieve the necessary alignment between text and visual features for accurate counting. We introduce RichCount, the first framework to address these limitations, employing a two-stage training strategy that enhances text encoding and strengthens the model's association with objects in images. RichCount improves zero-shot counting for unseen categories through two key objectives: (1) enriching text features with a feed-forward network and adapter trained on text-image similarity, thereby creating robust, aligned representations; and (2) applying this refined encoder to counting tasks, enabling effective generalization across diverse prompts and complex images. In this manner, RichCount goes beyond simple prompt expansion to establish meaningful feature alignment that supports accurate counting across novel categories. Extensive experiments on three benchmark datasets demonstrate the effectiveness of RichCount, achieving state-of-the-art performance in zero-shot counting and significantly enhancing generalization to unseen categories in open-world scenarios.

[65] Visual Question Answering on Multiple Remote Sensing Image Modalities

Hichem Boussaid,Lucrezia Tosato,Flora Weissgerber,Camille Kurtz,Laurent Wendling,Sylvain Lobry

Main category: cs.CV

TL;DR: 论文提出了一种多模态多分辨率遥感视觉问答（VQA）任务，并引入新数据集TAMMI和基于VisualBERT的MM-RSVQA模型，初步实验显示65.56%的准确率。

Details

Motivation: 在遥感等领域，视觉特征提取可通过多模态图像（如RGB、多光谱和合成孔径雷达）提升场景理解能力，从而改进VQA任务。 Method: 提出TAMMI数据集，包含多模态图像和多样化问题；设计MM-RSVQA模型，基于VisualBERT实现多模态与文本的可训练融合。 Result: 初步实验在TAMMI数据集上达到65.56%的准确率，验证了方法的有效性。 Conclusion: 该研究为多模态多分辨率VQA任务开辟了新方向，适用于医学影像等其他多模态领域。 Abstract: The extraction of visual features is an essential step in Visual Question Answering (VQA). Building a good visual representation of the analyzed scene is indeed one of the essential keys for the system to be able to correctly understand the latter in order to answer complex questions. In many fields such as remote sensing, the visual feature extraction step could benefit significantly from leveraging different image modalities carrying complementary spectral, spatial and contextual information. In this work, we propose to add multiple image modalities to VQA in the particular context of remote sensing, leading to a novel task for the computer vision community. To this end, we introduce a new VQA dataset, named TAMMI (Text and Multi-Modal Imagery) with diverse questions on scenes described by three different modalities (very high resolution RGB, multi-spectral imaging data and synthetic aperture radar). Thanks to an automated pipeline, this dataset can be easily extended according to experimental needs. We also propose the MM-RSVQA (Multi-modal Multi-resolution Remote Sensing Visual Question Answering) model, based on VisualBERT, a vision-language transformer, to effectively combine the multiple image modalities and text through a trainable fusion process. A preliminary experimental study shows promising results of our methodology on this challenging dataset, with an accuracy of 65.56% on the targeted VQA task. This pioneering work paves the way for the community to a new multi-modal multi-resolution VQA task that can be applied in other imaging domains (such as medical imaging) where multi-modality can enrich the visual representation of a scene. The dataset and code are available at https://tammi.sylvainlobry.com/.

[66] Mouse Lockbox Dataset: Behavior Recognition for Mice Solving Lockboxes

Patrik Reiske,Marcus N. Boon,Niek Andresen,Sole Traverso,Katharina Hohlbaum,Lars Lewejohann,Christa Thöne-Reineke,Olaf Hellwich,Henning Sprekeler

Main category: cs.CV

TL;DR: 论文介绍了一个关于小鼠解决复杂机械谜题（锁盒）的视频数据集，用于推动自动行为分类方法的发展。

Details

Motivation: 现有数据集主要关注简单或社交行为，缺乏复杂行为的记录，因此需要新的数据集来支持精细行为分析。 Method: 提供了超过110小时的视频数据，包含三个视角的录制，并提供了人类标注的帧级动作分类标签。同时，提出了基于关键点追踪的动作分类框架。 Result: 展示了自动标注精细行为（如物体操作）的挑战，并提供了一个公开可用的数据集。 Conclusion: 该数据集有望加速计算神经科学领域自动行为分类的进展。 Abstract: Machine learning and computer vision methods have a major impact on the study of natural animal behavior, as they enable the (semi-)automatic analysis of vast amounts of video data. Mice are the standard mammalian model system in most research fields, but the datasets available today to refine such methods focus either on simple or social behaviors. In this work, we present a video dataset of individual mice solving complex mechanical puzzles, so-called lockboxes. The more than 110 hours of total playtime show their behavior recorded from three different perspectives. As a benchmark for frame-level action classification methods, we provide human-annotated labels for all videos of two different mice, that equal 13% of our dataset. Our keypoint (pose) tracking-based action classification framework illustrates the challenges of automated labeling of fine-grained behaviors, such as the manipulation of objects. We hope that our work will help accelerate the advancement of automated action and behavior classification in the computational neuroscience community. Our dataset is publicly available at https://doi.org/10.14279/depositonce-23850

[67] Efficient Data Driven Mixture-of-Expert Extraction from Trained Networks

Uranik Berisha,Jens Mehnert,Alexandru Paul Condurache

Main category: cs.CV

TL;DR: 提出一种从预训练模型中构建MoE变体的新方法，通过聚类激活模式提取专家子网络，显著降低计算和资源需求。

Details

Motivation: Vision Transformers的高计算和资源需求是主要挑战，而传统MoE方法需要昂贵的重新训练。 Method: 分两阶段从预训练模型的MLP层提取专家子网络：先聚类输出激活模式，再提取对应子网络。 Result: 在ImageNet-1k任务中，提取的专家子网络无需大量微调即可恢复98%性能，同时减少36%计算量和32%模型大小。 Conclusion: 该方法有效降低了Vision Transformers的计算和资源需求，同时保持了高性能。 Abstract: Vision Transformers have emerged as the state-of-the-art models in various Computer Vision tasks, but their high computational and resource demands pose significant challenges. While Mixture-of-Experts (MoE) can make these models more efficient, they often require costly retraining or even training from scratch. Recent developments aim to reduce these computational costs by leveraging pretrained networks. These have been shown to produce sparse activation patterns in the Multi-Layer Perceptrons (MLPs) of the encoder blocks, allowing for conditional activation of only relevant subnetworks for each sample. Building on this idea, we propose a new method to construct MoE variants from pretrained models. Our approach extracts expert subnetworks from the model's MLP layers post-training in two phases. First, we cluster output activations to identify distinct activation patterns. In the second phase, we use these clusters to extract the corresponding subnetworks responsible for producing them. On ImageNet-1k recognition tasks, we demonstrate that these extracted experts can perform surprisingly well out of the box and require only minimal fine-tuning to regain 98% of the original performance, all while reducing MACs and model size, by up to 36% and 32% respectively.

[68] On the Robustness of Medical Vision-Language Models: Are they Truly Generalizable?

Raza Imam,Rufael Marew,Mohammad Yaqub

Main category: cs.CV

TL;DR: 论文提出了MediMeta-C和MedMNIST-C基准测试，评估医学视觉语言模型（MVLMs）在噪声和失真条件下的鲁棒性，并提出RobustMedCLIP方法以提升模型抗干扰能力。

Details

Motivation: 现有医学视觉语言模型在干净数据集上表现优异，但在真实临床图像中的噪声和失真条件下性能未经验证，亟需评估和改进。 Method: 引入MediMeta-C和MedMNIST-C基准测试，提出RobustMedCLIP方法，通过少量样本微调增强模型鲁棒性。 Result: 实验表明现有MVLMs在失真条件下性能显著下降，RobustMedCLIP通过低秩适应和少量样本微调提升了鲁棒性。 Conclusion: 研究强调了多样化训练和鲁棒适应策略的必要性，为医学视觉语言模型的实际应用提供了改进方向。 Abstract: Medical Vision-Language Models (MVLMs) have achieved par excellence generalization in medical image analysis, yet their performance under noisy, corrupted conditions remains largely untested. Clinical imaging is inherently susceptible to acquisition artifacts and noise; however, existing evaluations predominantly assess generally clean datasets, overlooking robustness -- i.e., the model's ability to perform under real-world distortions. To address this gap, we first introduce MediMeta-C, a corruption benchmark that systematically applies several perturbations across multiple medical imaging datasets. Combined with MedMNIST-C, this establishes a comprehensive robustness evaluation framework for MVLMs. We further propose RobustMedCLIP, a visual encoder adaptation of a pretrained MVLM that incorporates few-shot tuning to enhance resilience against corruptions. Through extensive experiments, we benchmark 5 major MVLMs across 5 medical imaging modalities, revealing that existing models exhibit severe degradation under corruption and struggle with domain-modality tradeoffs. Our findings highlight the necessity of diverse training and robust adaptation strategies, demonstrating that efficient low-rank adaptation when paired with few-shot tuning, improves robustness while preserving generalization across modalities.

[69] TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models

Zeqing Wang,Shiyuan Zhang,Chengpei Tang,Keze Wang

Main category: cs.CV

TL;DR: 论文提出了一个名为TimeCausality的新基准，用于评估视觉语言模型（VLMs）在时间维度上的因果推理能力，发现当前开源VLMs在此任务上表现显著落后于闭源模型。

Details

Motivation: 探索视觉语言模型在时间因果推理（如物体状态变化）上的能力，填补现有研究的空白。 Method: 设计TimeCausality基准，评估VLMs在时间因果推理任务上的表现。 Result: 当前开源VLMs在TimeCausality上表现显著落后于闭源模型（如GPT-4o），且GPT-4o在此任务上的表现也明显下降。 Conclusion: 时间因果推理是VLM发展中需重点关注的方向，开源社区需进一步努力提升此能力。 Abstract: Reasoning about temporal causality, particularly irreversible transformations of objects governed by real-world knowledge (e.g., fruit decay and human aging), is a fundamental aspect of human visual understanding. Unlike temporal perception based on simple event sequences, this form of reasoning requires a deeper comprehension of how object states change over time. Although the current powerful Vision-Language Models (VLMs) have demonstrated impressive performance on a wide range of downstream tasks, their capacity to reason about temporal causality remains underexplored. To address this gap, we introduce \textbf{TimeCausality}, a novel benchmark specifically designed to evaluate the causal reasoning ability of VLMs in the temporal dimension. Based on our TimeCausality, we find that while the current SOTA open-source VLMs have achieved performance levels comparable to closed-source models like GPT-4o on various standard visual question answering tasks, they fall significantly behind on our benchmark compared with their closed-source competitors. Furthermore, even GPT-4o exhibits a marked drop in performance on TimeCausality compared to its results on other tasks. These findings underscore the critical need to incorporate temporal causality into the evaluation and development of VLMs, and they highlight an important challenge for the open-source VLM community moving forward. Code and Data are available at \href{https://github.com/Zeqing-Wang/TimeCausality }{TimeCausality}.

[70] Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL

Xintong Zhang,Zhi Gao,Bofei Zhang,Pengxiang Li,Xiaowen Zhang,Yang Liu,Tao Yuan,Yuwei Wu,Yunde Jia,Song-Chun Zhu,Qing Li

Main category: cs.CV

TL;DR: 提出了一种名为Chain-of-Focus (CoF)的方法，通过自适应聚焦和放大关键图像区域，提升视觉语言模型的多模态推理能力。采用两阶段训练流程（监督微调和强化学习），在多个基准测试中表现优异。

Details

Motivation: 现有视觉语言模型的多模态推理能力尚未充分探索，需要一种方法来自适应地聚焦关键图像区域以实现高效推理。 Method: 提出CoF方法，包括监督微调（SFT）和强化学习（RL）两阶段训练。构建MM-CoF数据集用于微调，并通过RL进一步优化模型。 Result: 在V*基准测试中，模型在8种分辨率下平均提升5%，证明了CoF的有效性。 Conclusion: CoF方法显著提升了视觉语言模型的多模态推理能力，有助于实际应用中的高效部署。 Abstract: Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the outcome accuracies and formats as rewards to update the Qwen2.5-VL model, enabling further refining the search and reasoning strategy of models without human priors. Our model achieves significant improvements on multiple benchmarks. On the V* benchmark that requires strong visual reasoning capability, our model outperforms existing VLMs by 5% among 8 image resolutions ranging from 224 to 4K, demonstrating the effectiveness of the proposed CoF method and facilitating the more efficient deployment of VLMs in practical applications.

[71] Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation

Jianyuan Guo,Peike Li,Trevor Cohn

Main category: cs.CV

TL;DR: 提出了一种无需人工标注手语注释的伪注释生成框架，通过大语言模型和弱监督学习优化手语翻译任务。

Details

Motivation: 传统手语翻译依赖昂贵的人工注释手语注释，限制了可扩展性。 Method: 利用大语言模型生成伪注释，并通过弱监督学习优化对齐，结合CTC损失训练模型。 Result: 在两个手语翻译基准测试中优于现有无注释方法，与基于注释的方法结果相当。 Conclusion: 提出的框架有效解决了手语注释稀缺问题，提升了手语翻译的实用性。 Abstract: Sign Language Translation (SLT) aims to map sign language videos to spoken language text. A common approach relies on gloss annotations as an intermediate representation, decomposing SLT into two sub-tasks: video-to-gloss recognition and gloss-to-text translation. While effective, this paradigm depends on expert-annotated gloss labels, which are costly and rarely available in existing datasets, limiting its scalability. To address this challenge, we propose a gloss-free pseudo gloss generation framework that eliminates the need for human-annotated glosses while preserving the structured intermediate representation. Specifically, we prompt a Large Language Model (LLM) with a few example text-gloss pairs using in-context learning to produce draft sign glosses from spoken language text. To enhance the correspondence between LLM-generated pseudo glosses and the sign sequences in video, we correct the ordering in the pseudo glosses for better alignment via a weakly supervised learning process. This reordering facilitates the incorporation of auxiliary alignment objectives, and allows for the use of efficient supervision via a Connectionist Temporal Classification (CTC) loss. We train our SLT mode, which consists of a vision encoder and a translator, through a three-stage pipeline, which progressively narrows the modality gap between sign language and spoken language. Despite its simplicity, our approach outperforms previous state-of-the-art gloss-free frameworks on two SLT benchmarks and achieves competitive results compared to gloss-based methods.

[72] FRN: Fractal-Based Recursive Spectral Reconstruction Network

Ge Meng,Zhongnan Cai,Ruizhe Chen,Jingyan Tu,Yingying Wang,Yue Huang,Xinghao Ding

Main category: cs.CV

TL;DR: 提出了一种基于分形的递归光谱重建网络（FRN），通过渐进式方法从RGB图像生成高光谱图像，优于现有方法。

Details

Motivation: 降低高光谱图像（HSI）获取成本，通过RGB图像的光谱重建实现高效生成。 Method: 采用递归调用原子重建模块的渐进式方法，利用相邻波段信息生成下一波长图像，并设计波段感知状态空间模型。 Result: 在多个数据集上，FRN在定量和定性评估中均优于现有方法。 Conclusion: FRN通过分形递归和波段感知设计，实现了高效且高质量的光谱重建。 Abstract: Generating hyperspectral images (HSIs) from RGB images through spectral reconstruction can significantly reduce the cost of HSI acquisition. In this paper, we propose a Fractal-Based Recursive Spectral Reconstruction Network (FRN), which differs from existing paradigms that attempt to directly integrate the full-spectrum information from the R, G, and B channels in a one-shot manner. Instead, it treats spectral reconstruction as a progressive process, predicting from broad to narrow bands or employing a coarse-to-fine approach for predicting the next wavelength. Inspired by fractals in mathematics, FRN establishes a novel spectral reconstruction paradigm by recursively invoking an atomic reconstruction module. In each invocation, only the spectral information from neighboring bands is used to provide clues for the generation of the image at the next wavelength, which follows the low-rank property of spectral data. Moreover, we design a band-aware state space model that employs a pixel-differentiated scanning strategy at different stages of the generation process, further suppressing interference from low-correlation regions caused by reflectance differences. Through extensive experimentation across different datasets, FRN achieves superior reconstruction performance compared to state-of-the-art methods in both quantitative and qualitative evaluations.

[73] Stronger ViTs With Octic Equivariance

David Nordström,Johan Edstedt,Fredrik Kahl,Georg Bökman

Main category: cs.CV

TL;DR: 本文提出了一种基于八面体群等变性的Vision Transformers（ViTs）新架构，称为octic ViTs，通过实验证明其在计算效率和性能上的提升。

Details

Motivation: Vision Transformers（ViTs）是目前计算机视觉领域的主流架构，但其在图像块上的权重共享仍可进一步优化。本文旨在通过引入八面体群（反射和90度旋转）等变性作为额外的归纳偏置，提升ViTs的性能和效率。 Method: 提出了octic ViTs架构，采用八面体群等变层，并在DeiT-III和DINOv2框架下进行了监督学习和自监督学习的实验验证。 Result: 实验结果显示，octic ViTs显著降低了计算成本（ViT-H的FLOPs减少约40%），同时提升了分类和分割任务的性能。 Conclusion: 通过引入八面体群等变性，octic ViTs在保持ViTs优势的同时，进一步提高了计算效率和模型性能。 Abstract: Recent efforts at scaling computer vision models have established Vision Transformers (ViTs) as the leading architecture. ViTs incorporate weight sharing over image patches as an important inductive bias. In this work, we show that ViTs benefit from incorporating equivariance under the octic group, i.e., reflections and 90-degree rotations, as a further inductive bias. We develop new architectures, octic ViTs, that use octic-equivariant layers and put them to the test on both supervised and self-supervised learning. Through extensive experiments on DeiT-III and DINOv2 training on ImageNet-1K, we show that octic ViTs yield more computationally efficient networks while also improving performance. In particular, we achieve approximately 40% reduction in FLOPs for ViT-H while simultaneously improving both classification and segmentation results.

[74] ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning

Ziqiang Xu,Qi Dai,Tian Xie,Yifan Yang,Kai Qiu,DongDong Chen,Zuxuan Wu,Chong Luo

Main category: cs.CV

TL;DR: ViaRL是一种基于规则强化学习的框架，用于优化意图驱动视频理解中的帧选择，无需昂贵标注，性能优越。

Details

Motivation: 现有视频理解方法依赖启发式或伪标注，成本高且扩展性差，ViaRL旨在解决这一问题。 Method: 采用规则强化学习，通过下游模型答案准确性作为奖励信号，训练帧选择器。 Result: 在多个基准测试中表现优异，尤其在Needle QA上提升近15%。 Conclusion: ViaRL有效且可扩展，适用于多样化视频理解任务。 Abstract: Video understanding is inherently intention-driven-humans naturally focus on relevant frames based on their goals. Recent advancements in multimodal large language models (MLLMs) have enabled flexible query-driven reasoning; however, video-based frameworks like Video Chain-of-Thought lack direct training signals to effectively identify relevant frames. Current approaches often rely on heuristic methods or pseudo-label supervised annotations, which are both costly and limited in scalability across diverse scenarios. To overcome these challenges, we introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in intention-driven video understanding. An iterated amplification strategy is adopted to perform alternating cyclic training in the video CoT system, where each component undergoes iterative cycles of refinement to improve its capabilities. ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error, eliminating the need for expensive annotations while closely aligning with human-like learning processes. Comprehensive experiments across multiple benchmarks, including VideoMME, LVBench, and MLVU, demonstrate that ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks, highlighting its effectiveness and scalability. Notably, ViaRL achieves a nearly 15\% improvement on Needle QA, a subset of MLVU, which is required to search a specific needle within a long video and regarded as one of the most suitable benchmarks for evaluating temporal grounding.

[75] Comprehensive Evaluation and Analysis for NSFW Concept Erasure in Text-to-Image Diffusion Models

Die Chen,Zhiwen Li,Cen Chen,Yuexiang Xie,Xiaodan Li,Jinyan Ye,Yingda Chen,Yaliang Li

Main category: cs.CV

TL;DR: 论文提出了一种全流程工具包，首次系统研究了文本到图像扩散模型中NSFW内容的概念擦除方法，并评估了其有效性。

Details

Motivation: 扩散模型在生成内容时可能无意中产生NSFW内容，现有概念擦除方法缺乏全面评估，亟需系统性研究。 Method: 引入全流程工具包，结合底层机制与实证观察，系统评估NSFW概念擦除方法。 Result: 提供了概念擦除方法的深入见解和实用指南，为内容安全研究奠定基础。 Conclusion: 研究推动了扩散模型内容安全的理解，为未来研究提供了重要参考。 Abstract: Text-to-image diffusion models have gained widespread application across various domains, demonstrating remarkable creative potential. However, the strong generalization capabilities of diffusion models can inadvertently lead to the generation of not-safe-for-work (NSFW) content, posing significant risks to their safe deployment. While several concept erasure methods have been proposed to mitigate the issue associated with NSFW content, a comprehensive evaluation of their effectiveness across various scenarios remains absent. To bridge this gap, we introduce a full-pipeline toolkit specifically designed for concept erasure and conduct the first systematic study of NSFW concept erasure methods. By examining the interplay between the underlying mechanisms and empirical observations, we provide in-depth insights and practical guidance for the effective application of concept erasure methods in various real-world scenarios, with the aim of advancing the understanding of content safety in diffusion models and establishing a solid foundation for future research and development in this critical area.

[76] Pura: An Efficient Privacy-Preserving Solution for Face Recognition

Guotao Xu,Bowen Zhao,Yang Xiao,Yantao Zhong,Liang Zhai,Qingqi Pei

Main category: cs.CV

TL;DR: 论文提出了一种名为Pura的高效隐私保护人脸识别方案，通过阈值Paillier加密系统实现非交互式架构，并设计了安全计算协议和并行计算机制，显著提升了识别速度。

Details

Motivation: 人脸识别技术存在隐私泄露风险，现有隐私保护方案效率不足且未能完全解决隐私问题。 Method: 采用阈值Paillier加密系统设计非交互式架构，开发安全计算协议，并引入并行计算机制。 Result: Pura在保护隐私的同时，识别速度比现有技术快16倍。 Conclusion: Pura是一种高效且隐私保护充分的人脸识别解决方案。 Abstract: Face recognition is an effective technology for identifying a target person by facial images. However, sensitive facial images raises privacy concerns. Although privacy-preserving face recognition is one of potential solutions, this solution neither fully addresses the privacy concerns nor is efficient enough. To this end, we propose an efficient privacy-preserving solution for face recognition, named Pura, which sufficiently protects facial privacy and supports face recognition over encrypted data efficiently. Specifically, we propose a privacy-preserving and non-interactive architecture for face recognition through the threshold Paillier cryptosystem. Additionally, we carefully design a suite of underlying secure computing protocols to enable efficient operations of face recognition over encrypted data directly. Furthermore, we introduce a parallel computing mechanism to enhance the performance of the proposed secure computing protocols. Privacy analysis demonstrates that Pura fully safeguards personal facial privacy. Experimental evaluations demonstrate that Pura achieves recognition speeds up to 16 times faster than the state-of-the-art.

[77] Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models

Jiaying Wu,Fanxiao Li,Min-Yen Kan,Bryan Hooi

Main category: cs.CV

TL;DR: 论文提出了一种自动化框架，通过模拟真实世界的多模态新闻创作，明确建模创作者意图，构建了一个大规模基准数据集DeceptionDecoded，并评估了14种先进视觉语言模型在意图相关任务上的表现。

Details

Motivation: 理解创作者的误导意图对多模态虚假信息检测（MMD）至关重要，但现有系统缺乏对意图的深入建模。 Method: 提出一个框架，通过建模创作者的期望影响和执行计划，构建包含12,000个图像-标题对的数据集，涵盖误导和非误导意图。 Result: 评估显示当前视觉语言模型在识别误导意图上表现不足，依赖表面线索而非深层推理。 Conclusion: 研究强调了意图感知建模在MMD中的重要性，为开发具备深层推理能力的系统提供了新方向。 Abstract: The real-world impact of misinformation stems from the underlying misleading narratives that creators seek to convey. As such, interpreting misleading creator intent is essential for multimodal misinformation detection (MMD) systems aimed at effective information governance. In this paper, we introduce an automated framework that simulates real-world multimodal news creation by explicitly modeling creator intent through two components: the desired influence and the execution plan. Using this framework, we construct DeceptionDecoded, a large-scale benchmark comprising 12,000 image-caption pairs aligned with trustworthy reference articles. The dataset captures both misleading and non-misleading intents and spans manipulations across visual and textual modalities. We conduct a comprehensive evaluation of 14 state-of-the-art vision-language models (VLMs) on three intent-centric tasks: (1) misleading intent detection, (2) misleading source attribution, and (3) creator desire inference. Despite recent advances, we observe that current VLMs fall short in recognizing misleading intent, often relying on spurious cues such as superficial cross-modal consistency, stylistic signals, and heuristic authenticity hints. Our findings highlight the pressing need for intent-aware modeling in MMD and open new directions for developing systems capable of deeper reasoning about multimodal misinformation.

[78] Spectral-Aware Global Fusion for RGB-Thermal Semantic Segmentation

Ce Zhang,Zifu Wan,Simon Stepputtis,Katia Sycara,Yaqi Xie

Main category: cs.CV

TL;DR: 论文提出了一种基于光谱视角的RGB与热辐射数据融合方法（SGFNet），通过区分低频和高频特征提升语义分割性能。

Details

Motivation: RGB数据在低光照和遮挡条件下表现不佳，而融合热辐射数据能提升性能，但如何有效融合多模态特征仍具挑战。 Method: 提出SGFNet，通过光谱视角将多模态特征分为低频（场景上下文）和高频（模态细节），并显式建模高频特征交互。 Result: 在MFNet和PST900数据集上，SGFNet优于现有方法。 Conclusion: 光谱视角为多模态特征融合提供了新思路，SGFNet在语义分割中表现优异。 Abstract: Semantic segmentation relying solely on RGB data often struggles in challenging conditions such as low illumination and obscured views, limiting its reliability in critical applications like autonomous driving. To address this, integrating additional thermal radiation data with RGB images demonstrates enhanced performance and robustness. However, how to effectively reconcile the modality discrepancies and fuse the RGB and thermal features remains a well-known challenge. In this work, we address this challenge from a novel spectral perspective. We observe that the multi-modal features can be categorized into two spectral components: low-frequency features that provide broad scene context, including color variations and smooth areas, and high-frequency features that capture modality-specific details such as edges and textures. Inspired by this, we propose the Spectral-aware Global Fusion Network (SGFNet) to effectively enhance and fuse the multi-modal features by explicitly modeling the interactions between the high-frequency, modality-specific features. Our experimental results demonstrate that SGFNet outperforms the state-of-the-art methods on the MFNet and PST900 datasets.

[79] Beyond Linearity: Squeeze-and-Recalibrate Blocks for Few-Shot Whole Slide Image Classification

Conghao Xiong,Zhengrui Guo,Zhe Xu,Yifei Zhang,Raymond Kai-Yu Tong,Si Yong Yeo,Hao Chen,Joseph J. Y. Sung,Irwin King

Main category: cs.CV

TL;DR: 提出了一种Squeeze-and-Recalibrate（SR）块，作为MIL模型中线性层的替代，解决了少样本学习中的过拟合和特征误判问题，同时减少了参数和计算成本。

Details

Motivation: 解决计算病理学中专家标注稀缺的问题，同时克服少样本学习中的过拟合和特征误判，以及现有方法的复杂预处理和高计算成本。 Method: 设计了SR块，包含低秩可训练矩阵（SP）和冻结的随机重校准矩阵，以减少参数并保持几何结构。 Result: SR-MIL模型在实验中表现优于现有方法，参数更少且无需架构更改。 Conclusion: SR块为MIL模型提供了高效且性能优越的解决方案，理论保证其性能下限不低于标准MIL模型。 Abstract: Deep learning has advanced computational pathology but expert annotations remain scarce. Few-shot learning mitigates annotation burdens yet suffers from overfitting and discriminative feature mischaracterization. In addition, the current few-shot multiple instance learning (MIL) approaches leverage pretrained vision-language models to alleviate these issues, but at the cost of complex preprocessing and high computational cost. We propose a Squeeze-and-Recalibrate (SR) block, a drop-in replacement for linear layers in MIL models to address these challenges. The SR block comprises two core components: a pair of low-rank trainable matrices (squeeze pathway, SP) that reduces parameter count and imposes a bottleneck to prevent spurious feature learning, and a frozen random recalibration matrix that preserves geometric structure, diversifies feature directions, and redefines the optimization objective for the SP. We provide theoretical guarantees that the SR block can approximate any linear mapping to arbitrary precision, thereby ensuring that the performance of a standard MIL model serves as a lower bound for its SR-enhanced counterpart. Extensive experiments demonstrate that our SR-MIL models consistently outperform prior methods while requiring significantly fewer parameters and no architectural changes.

[80] Prompt Tuning Vision Language Models with Margin Regularizer for Few-Shot Learning under Distribution Shifts

Debarshi Brahma,Anuska Roy,Soma Biswas

Main category: cs.CV

TL;DR: 该论文提出了一种名为PromptMargin的新方法，用于在目标数据集分布和类别与预训练数据差异较大的情况下，通过少量标注样本微调视觉-语言基础模型。

Details

Motivation: 研究视觉-语言基础模型（如CLIP和ALIGN）在目标数据集分布和类别与预训练数据差异较大时，能否通过少量标注样本进行有效微调，并解决过拟合和泛化能力下降的问题。 Method: 提出PromptMargin方法，通过选择性增强策略补充少量训练样本，并使用多模态边界正则化器提高类间区分度。 Result: 在15个目标基准数据集上的实验表明，PromptMargin在分布偏移较大的情况下优于现有方法。 Conclusion: PromptMargin能够有效适应目标数据集，提升性能，尤其在分布差异较大的情况下表现优异。 Abstract: Recently, Vision-Language foundation models like CLIP and ALIGN, which are pre-trained on large-scale data have shown remarkable zero-shot generalization to diverse datasets with different classes and even domains. In this work, we take a step further and analyze whether these models can be adapted to target datasets having very different distributions and classes compared to what these models have been trained on, using only a few labeled examples from the target dataset. In such scenarios, finetuning large pretrained models is challenging due to problems of overfitting as well as loss of generalization, and has not been well explored in prior literature. Since, the pre-training data of such models are unavailable, it is difficult to comprehend the performance on various downstream datasets. First, we try to answer the question: Given a target dataset with a few labelled examples, can we estimate whether further fine-tuning can enhance the performance compared to zero-shot evaluation? by analyzing the common vision-language embedding space. Based on the analysis, we propose a novel prompt-tuning method, PromptMargin for adapting such large-scale VLMs directly on the few target samples. PromptMargin effectively tunes the text as well as visual prompts for this task, and has two main modules: 1) Firstly, we use a selective augmentation strategy to complement the few training samples in each task; 2) Additionally, to ensure robust training in the presence of unfamiliar class names, we increase the inter-class margin for improved class discrimination using a novel Multimodal Margin Regularizer. Extensive experiments and analysis across fifteen target benchmark datasets, with varying degrees of distribution shifts from natural images, shows the effectiveness of the proposed framework over the existing state-of-the-art approaches applied to this setting. github.com/debarshigit/PromptMargin.

[81] Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought

Zihui Cheng,Qiguang Chen,Xiao Xu,Jiaqi Wang,Weiyun Wang,Hao Fei,Yidong Wang,Alex Jinpeng Wang,Zhi Chen,Wanxiang Che,Libo Qin

Main category: cs.CV

TL;DR: MCoT通过视觉思维提升LVLMs性能，视觉思维的清晰度和简洁性是关键。

Details

Motivation: 理解MCoT提升LVLMs性能的机制，填补研究空白。 Method: 定义并分析四种视觉思维表达形式，探索其内部机制。 Result: 不同形式的视觉思维对MCoT改进效果不同，视觉思维是输入图像与深层推理的中介。 Conclusion: 视觉思维为未来MCoT研究提供新方向。 Abstract: Large Vision-Language Models (LVLMs) have achieved significant success in multimodal tasks, with multimodal chain-of-thought (MCoT) further enhancing performance and interpretability. Recent MCoT methods fall into two categories: (i) Textual-MCoT (T-MCoT), which takes multimodal input and produces textual output; and (ii) Interleaved-MCoT (I-MCoT), which generates interleaved image-text outputs. Despite advances in both approaches, the mechanisms driving these improvements are not fully understood. To fill this gap, we first reveal that MCoT boosts LVLMs by incorporating visual thoughts, which convey image information to the reasoning process regardless of the MCoT format, depending only on clarity and conciseness of expression. Furthermore, to explore visual thoughts systematically, we define four distinct forms of visual thought expressions and analyze them comprehensively. Our findings demonstrate that these forms differ in clarity and conciseness, yielding varying levels of MCoT improvement. Additionally, we explore the internal nature of visual thoughts, finding that visual thoughts serve as intermediaries between the input image and reasoning to deeper transformer layers, enabling more advanced visual information transmission. We hope that the visual thoughts can inspire further breakthroughs for future MCoT research.

[82] Detection of Underwater Multi-Targets Based on Self-Supervised Learning and Deformable Path Aggregation Feature Pyramid Network

Chang Liu

Main category: cs.CV

TL;DR: 本文提出了一种针对水下目标检测的高效算法，通过自监督学习和改进的卷积方法提升检测精度。

Details

Motivation: 水下环境限制导致目标检测精度低，需改进模型以适应低对比度、目标遮挡和密集分布等问题。 Method: 采用基于SimSiam的自监督学习预训练网络，引入可变形卷积和扩张卷积，并使用EIoU损失函数。 Result: 实验表明，所提检测器提高了水下目标检测的准确性。 Conclusion: 通过改进模型结构和损失函数，有效提升了水下目标检测的性能。 Abstract: To overcome the constraints of the underwater environment and improve the accuracy and robustness of underwater target detection models, this paper develops a specialized dataset for underwater target detection and proposes an efficient algorithm for underwater multi-target detection. A self-supervised learning based on the SimSiam structure is employed for the pre-training of underwater target detection network. To address the problems of low detection accuracy caused by low contrast, mutual occlusion and dense distribution of underwater targets in underwater object detection, a detection model suitable for underwater target detection is proposed by introducing deformable convolution and dilated convolution. The proposed detection model can obtain more effective information by increasing the receptive field. In addition, the regression loss function EIoU is introduced, which improves model performance by separately calculating the width and height losses of the predicted box. Experiment results show that the accuracy of the underwater target detection has been improved by the proposed detector.

[83] Clapper: Compact Learning and Video Representation in VLMs

Lingyu Kong,Hongzhi Zhang,Jingyuan Zhang,Jianzhao Huang,Kunze Li,Qi Wang,Fuzheng Zhang

Main category: cs.CV

TL;DR: Clapper提出了一种慢快策略和TimePerceiver模块，用于优化视觉语言模型在长视频理解中的性能，实现了高效的时间-空间编码和显著压缩视觉标记。

Details

Motivation: 现有视觉语言模型在长视频理解任务中性能显著下降，尤其是在视觉标记压缩后。 Method: 采用慢快策略进行视频表示，并引入TimePerceiver模块，以高效编码时间-空间信息。 Result: 实现了13倍视觉标记压缩（平均61标记/帧），在多个数据集上表现优异（如VideoMME 62.0%）。 Conclusion: Clapper有效解决了长短视频理解的平衡问题，显著提升了性能。 Abstract: Current vision-language models (VLMs) have demonstrated remarkable capabilities across diverse video understanding applications. Designing VLMs for video inputs requires effectively modeling the temporal dimension (i.e. capturing dependencies across frames) and balancing the processing of short and long videos. Specifically, short videos demand preservation of fine-grained details, whereas long videos require strategic compression of visual information to handle extensive temporal contexts efficiently. However, our empirical analysis reveals a critical limitation: most existing VLMs suffer severe performance degradation in long video understanding tasks when compressing visual tokens below a quarter of their original visual tokens. To enable more effective modeling of both short and long video inputs, we propose Clapper, a method that utilizes a slow-fast strategy for video representation and introduces a novel module named TimePerceiver for efficient temporal-spatial encoding within existing VLM backbones. By using our method, we achieves 13x compression of visual tokens per frame (averaging 61 tokens/frame) without compromising QA accuracy. In our experiments, Clapper achieves 62.0% on VideoMME, 69.8% on MLVU, and 67.4% on TempCompass, all with fewer than 6,000 visual tokens per video. The code will be publicly available on the homepage.

[84] Convolutional Long Short-Term Memory Neural Networks Based Numerical Simulation of Flow Field

Chang Liu

Main category: cs.CV

TL;DR: 提出了一种改进的ConvLSTM网络，结合残差网络和注意力机制，用于流场预测，相比标准ConvLSTM模型，能提取更多时空特征，且参数更少、训练时间更短。

Details

Motivation: 传统CFD方法依赖数学模型和数值方法，收敛性和准确性受限，深度学习为流场分析提供了新思路。 Method: 结合动态网格技术和UDF进行数值模拟，构建流场数据集；改进ConvLSTM模型，引入残差网络和注意力机制。 Result: 改进模型在提取时空特征上优于标准ConvLSTM，同时参数更少、训练更快。 Conclusion: 改进的ConvLSTM模型为流场预测提供了一种高效且准确的深度学习解决方案。 Abstract: Computational Fluid Dynamics (CFD) is the main approach to analyzing flow field. However, the convergence and accuracy depend largely on mathematical models of flow, numerical methods, and time consumption. Deep learning-based analysis of flow filed provides an alternative. For the task of flow field prediction, an improved Convolutional Long Short-Term Memory (Con-vLSTM) Neural Network is proposed as the baseline network in consideration of the temporal and spatial characteristics of flow field. Combining dynamic mesh technology and User-Defined Function (UDF), numerical simulations of flow around a circular cylinder were conducted. Flow field snapshots were used to sample data from the cylinder's wake region at different time instants, constructing a flow field dataset with sufficient volume and rich flow state var-iations. Residual networks and attention mechanisms are combined with the standard ConvLSTM model. Compared with the standard ConvLSTM model, the results demonstrate that the improved ConvLSTM model can extract more temporal and spatial features while having fewer parameters and shorter train-ing time.

[85] seg_3D_by_PC2D: Multi-View Projection for Domain Generalization and Adaptation in 3D Semantic Segmentation

Andrew Caunes,Thierry Chateau,Vincent Fremont

Main category: cs.CV

TL;DR: 提出了一种多视角投影框架，用于3D语义分割的领域泛化和无监督领域适应，通过合成2D数据集训练2D模型，并在推理时通过遮挡感知投票生成3D标签。

Details

Motivation: 解决3D语义分割模型在不同数据集间部署时的领域偏移问题。 Method: 通过多视角投影生成合成2D数据集（PC2D），训练2D分割模型，并在推理时通过遮挡感知投票将结果回投影至3D。 Result: 在nuScenes和SemanticKITTI数据集上，UDA达到SOTA，DG接近SOTA，尤其在大静态类上表现突出。 Conclusion: 该框架模块化且灵活，在领域适应和泛化任务中表现优异，代码和工具将开源。 Abstract: 3D semantic segmentation plays a pivotal role in autonomous driving and road infrastructure analysis, yet state-of-the-art 3D models are prone to severe domain shift when deployed across different datasets. We propose a novel multi-view projection framework that excels in both domain generalization (DG) and unsupervised domain adaptation (UDA). Our approach first aligns Lidar scans into coherent 3D scenes and renders them from multiple virtual camera poses to create a large-scale synthetic 2D dataset (PC2D). We then use it to train a 2D segmentation model in-domain. During inference, the model processes hundreds of views per scene; the resulting logits are back-projected to 3D with an occlusion-aware voting scheme to generate final point-wise labels. Our framework is modular and enables extensive exploration of key design parameters, such as view generation optimization (VGO), visualization modality optimization (MODO), and 2D model choice. We evaluate on the nuScenes and SemanticKITTI datasets under both the DG and UDA settings. We achieve state-of-the-art results in UDA and close to state-of-the-art in DG, with particularly large gains on large, static classes. Our code and dataset generation tools will be publicly available at https://github.com/andrewcaunes/ia4markings

[86] TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving

Hossein Hassani,Soodeh Nikan,Abdallah Shami

Main category: cs.CV

TL;DR: TinyDrive是一个轻量级视觉语言模型，用于自动驾驶中的多视角视觉问答（VQA），通过多尺度视觉编码器和双级优先级机制实现高效性能。

Details

Motivation: 自动驾驶中视觉语言模型（VLMs）的计算资源需求高，难以部署在资源受限的车辆上，因此需要轻量级解决方案。 Method: TinyDrive采用多尺度视觉编码器和双级优先级机制（令牌级路由和序列级评分）来优化资源利用和性能。 Result: 在自定义VQA数据集和公开DriveLM基准测试中，TinyDrive在语言理解性能上达到最先进水平，BLEU-4和METEOR分数分别提升11.1%和35.4%。 Conclusion: TinyDrive展示了在资源受限环境下高效实现多视角VQA的潜力，同时显著提升了性能。 Abstract: Vision Language Models (VLMs) employed for visual question-answering (VQA) in autonomous driving often require substantial computational resources that pose a challenge for their deployment in resource-constrained vehicles. To address this challenge, we introduce TinyDrive, a lightweight yet effective VLM for multi-view VQA in driving scenarios. Our model comprises two key components including a multiscale vision encoder and a dual-level prioritization mechanism for tokens and sequences. The multiscale encoder facilitates the processing of multi-view images at diverse resolutions through scale injection and cross-scale gating to generate enhanced visual representations. At the token level, we design a token routing mechanism that dynamically selects and process the most informative tokens based on learned importance scores. At the sequence level, we propose integrating normalized loss, uncertainty estimates, and a diversity metric to formulate sequence scores that rank and preserve samples within a sequence priority buffer. Samples with higher scores are more frequently selected for training. TinyDrive is first evaluated on our custom-curated VQA dataset, and it is subsequently tested on the public DriveLM benchmark, where it achieves state-of-the-art language understanding performance. Notably, it achieves relative improvements of 11.1% and 35.4% in BLEU-4 and METEOR scores, respectively, despite having a significantly smaller parameter count.

[87] Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models

Xin Huang,Ruibin Li,Tong Jia,Wei Zheng,Ya Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为AHNPL的方法，通过生成视觉域的负样本和动态调整对比损失，提升视觉语言模型在复杂组合推理任务中的性能。

Details

Motivation: 现有方法主要依赖文本负样本，忽视了视觉负样本的重要性，且未考虑样本难度差异，导致模型性能不足。 Method: AHNPL将文本负样本转化为视觉域负样本，并引入多模态硬负样本损失和动态边界损失。 Result: 在三个公开数据集上的实验表明，AHNPL显著提升了模型性能。 Conclusion: AHNPL通过改进负样本生成和损失设计，有效解决了现有方法的不足，提升了模型在复杂任务中的表现。 Abstract: Vision-Language Models (VLMs) are essential for multimodal tasks, especially compositional reasoning (CR) tasks, which require distinguishing fine-grained semantic differences between visual and textual embeddings. However, existing methods primarily fine-tune the model by generating text-based hard negative samples, neglecting the importance of image-based negative samples, which results in insufficient training of the visual encoder and ultimately impacts the overall performance of the model. Moreover, negative samples are typically treated uniformly, without considering their difficulty levels, and the alignment of positive samples is insufficient, which leads to challenges in aligning difficult sample pairs. To address these issues, we propose Adaptive Hard Negative Perturbation Learning (AHNPL). AHNPL translates text-based hard negatives into the visual domain to generate semantically disturbed image-based negatives for training the model, thereby enhancing its overall performance. AHNPL also introduces a contrastive learning approach using a multimodal hard negative loss to improve the model's discrimination of hard negatives within each modality and a dynamic margin loss that adjusts the contrastive margin according to sample difficulty to enhance the distinction of challenging sample pairs. Experiments on three public datasets demonstrate that our method effectively boosts VLMs' performance on complex CR tasks. The source code is available at https://github.com/nynu-BDAI/AHNPL.

[88] UWSAM: Segment Anything Model Guided Underwater Instance Segmentation and A Large-scale Benchmark Dataset

Hua Li,Shijie Lian,Zhiyuan Li,Runmin Cong,Sam Kwong

Main category: cs.CV

TL;DR: 论文提出了一种针对水下实例分割的高效模型UWSAM，并发布了大规模数据集UIIS10K，通过知识蒸馏和自动提示生成技术显著提升了性能。

Details

Motivation: 由于缺乏水下领域专业知识，现有模型（如SAM）在水下实例分割任务中表现受限，且计算需求较高。 Method: 提出数据集UIIS10K，并设计UWSAM模型，结合Mask GAT-based知识蒸馏方法和端到端水下提示生成器（EUPG）。 Result: 实验表明，UWSAM在多个水下实例数据集上显著优于现有方法。 Conclusion: UWSAM通过高效知识蒸馏和自动提示生成，解决了水下实例分割的挑战，具有实际应用潜力。 Abstract: With recent breakthroughs in large-scale modeling, the Segment Anything Model (SAM) has demonstrated significant potential in a variety of visual applications. However, due to the lack of underwater domain expertise, SAM and its variants face performance limitations in end-to-end underwater instance segmentation tasks, while their higher computational requirements further hinder their application in underwater scenarios. To address this challenge, we propose a large-scale underwater instance segmentation dataset, UIIS10K, which includes 10,048 images with pixel-level annotations for 10 categories. Then, we introduce UWSAM, an efficient model designed for automatic and accurate segmentation of underwater instances. UWSAM efficiently distills knowledge from the SAM ViT-Huge image encoder into the smaller ViT-Small image encoder via the Mask GAT-based Underwater Knowledge Distillation (MG-UKD) method for effective visual representation learning. Furthermore, we design an End-to-end Underwater Prompt Generator (EUPG) for UWSAM, which automatically generates underwater prompts instead of explicitly providing foreground points or boxes as prompts, thus enabling the network to locate underwater instances accurately for efficient segmentation. Comprehensive experimental results show that our model is effective, achieving significant performance improvements over state-of-the-art methods on multiple underwater instance datasets. Datasets and codes are available at https://github.com/LiamLian0727/UIIS10K.

[89] VP Lab: a PEFT-Enabled Visual Prompting Laboratory for Semantic Segmentation

Niccolo Avogaro,Thomas Frick,Yagmur G. Cinar,Daniel Caraballo,Cezary Skura,Filip M. Janicki,Piotr Kluska,Brown Ebouky,Nicola Farronato,Florian Scheidegger,Cristiano Malossi,Konrad Schindler,Andrea Bartezzaghi,Roy Assaf,Mattia Rigotti

Main category: cs.CV

TL;DR: VP Lab通过E-PEFT技术提升视觉提示在专业领域的语义分割性能，仅需5张验证图像即可显著提高mIoU。

Details

Motivation: 大规模预训练视觉模型在专业领域表现不佳，需适应特定领域的视觉特征。 Method: 提出VP Lab框架，集成E-PEFT技术，实现参数和数据高效的视觉提示调整。 Result: 在Segment Anything Model上超越现有参数高效微调方法，mIoU提升50%。 Conclusion: VP Lab为快速、高效、交互式模型部署提供了新范式。 Abstract: Large-scale pretrained vision backbones have transformed computer vision by providing powerful feature extractors that enable various downstream tasks, including training-free approaches like visual prompting for semantic segmentation. Despite their success in generic scenarios, these models often fall short when applied to specialized technical domains where the visual features differ significantly from their training distribution. To bridge this gap, we introduce VP Lab, a comprehensive iterative framework that enhances visual prompting for robust segmentation model development. At the core of VP Lab lies E-PEFT, a novel ensemble of parameter-efficient fine-tuning techniques specifically designed to adapt our visual prompting pipeline to specific domains in a manner that is both parameter- and data-efficient. Our approach not only surpasses the state-of-the-art in parameter-efficient fine-tuning for the Segment Anything Model (SAM), but also facilitates an interactive, near-real-time loop, allowing users to observe progressively improving results as they experiment within the framework. By integrating E-PEFT with visual prompting, we demonstrate a remarkable 50\% increase in semantic segmentation mIoU performance across various technical datasets using only 5 validated images, establishing a new paradigm for fast, efficient, and interactive model deployment in new, challenging domains. This work comes in the form of a demonstration.

[90] LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models

Ruilin Yao,Bo Zhang,Jirui Huang,Xinwei Long,Yifang Zhang,Tianyu Zou,Yufei Wu,Shichao Su,Yifan Xu,Wenxi Zeng,Zhaoyu Yang,Guoyou Li,Shilan Zhang,Zichan Li,Yaxiong Chen,Shengwu Xiong,Peng Xu,Jiajun Zhang,Bowen Zhou,David Clifton,Luc Van Gool

Main category: cs.CV

TL;DR: 论文提出了Lens，一个多层级基准测试，用于评估多模态大语言模型（MLLMs）从感知到推理的能力，包含3.4K图像和60K+问题，覆盖8任务和12场景。

Details

Motivation: 现有基准测试在评估MLLMs时缺乏对低级感知能力与高级推理协同作用的考量，且任务样本数据分布不一致。 Method: 构建Lens基准测试，包含三个递进任务层级（感知、理解、推理），每张图像配备多任务注释，数据来自社交媒体。 Result: 评估了15+前沿MLLMs，包括Qwen2.5-VL-72B等，推理任务准确率均低于60%。 Conclusion: Lens为MLLMs提供了更全面的评估框架，揭示了当前模型在复杂推理任务上的局限性。 Abstract: Multimodal Large Language Models (MLLMs) have achieved significant advances in integrating visual and linguistic information, yet their ability to reason about complex and real-world scenarios remains limited. The existing benchmarks are usually constructed in the task-oriented manner without guarantee that different task samples come from the same data distribution, thus they often fall short in evaluating the synergistic effects of lower-level perceptual capabilities on higher-order reasoning. To lift this limitation, we contribute Lens, a multi-level benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios, forming three progressive task tiers, i.e., perception, understanding, and reasoning. One feature is that each image is equipped with rich annotations for all tasks. Thus, this dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning. In addition, our images are manully collected from the social media, in which 53% were published later than Jan. 2025. We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL. These models are released later than Dec. 2024, and none of them achieve an accuracy greater than 60% in the reasoning tasks. Project page: https://github.com/Lens4MLLMs/lens. ICCV 2025 workshop page: https://lens4mllms.github.io/mars2-workshop-iccv2025/

[91] SNAP: A Benchmark for Testing the Effects of Capture Conditions on Fundamental Vision Tasks

Iuliia Kotseruba,John K. Tsotsos

Main category: cs.CV

TL;DR: 论文研究了图像捕获条件（如相机参数和光照）对深度学习模型在图像分类、目标检测和视觉问答任务中的性能影响，提出了新基准SNAP，并揭示了数据集偏差和模型对微小变化的敏感性。

Details

Motivation: 现有研究多关注已捕获图像，而忽略了图像形成管道和环境的影响，因此需要分析捕获条件对模型性能的影响。 Method: 通过评估常见视觉数据集中的捕获偏差，创建新基准SNAP，并在控制光照和相机设置下采集图像，评估多种深度学习模型的性能。 Result: 发现视觉数据集存在显著偏差，模型在良好曝光图像上也无法达到人类准确率，且对相机设置的微小变化敏感。 Conclusion: 研究强调了捕获条件对模型性能的重要性，并提供了新基准和实验结果以指导未来研究。 Abstract: Generalization of deep-learning-based (DL) computer vision algorithms to various image perturbations is hard to establish and remains an active area of research. The majority of past analyses focused on the images already captured, whereas effects of the image formation pipeline and environment are less studied. In this paper, we address this issue by analyzing the impact of capture conditions, such as camera parameters and lighting, on DL model performance on 3 vision tasks -- image classification, object detection, and visual question answering (VQA). To this end, we assess capture bias in common vision datasets and create a new benchmark, SNAP (for $\textbf{S}$hutter speed, ISO se$\textbf{N}$sitivity, and $\textbf{AP}$erture), consisting of images of objects taken under controlled lighting conditions and with densely sampled camera settings. We then evaluate a large number of DL vision models and show the effects of capture conditions on each selected vision task. Lastly, we conduct an experiment to establish a human baseline for the VQA task. Our results show that computer vision datasets are significantly biased, the models trained on this data do not reach human accuracy even on the well-exposed images, and are susceptible to both major exposure changes and minute variations of camera settings. Code and data can be found at https://github.com/ykotseruba/SNAP

[92] Oral Imaging for Malocclusion Issues Assessments: OMNI Dataset, Deep Learning Baselines and Benchmarking

Pujun Xue,Junyi Ge,Xiaotong Jiang,Siyang Song,Zijian Wu,Yupeng Huo,Weicheng Xie,Linlin Shen,Xiaoqin Zhou,Xiaofeng Liu,Min Gu

Main category: cs.CV

TL;DR: 本研究提出了OMNI数据集，一个针对错颌问题的全面牙科图像数据集，旨在推动牙科图像分析的自动化诊断研究。

Details

Motivation: 错颌问题是正畸学中的主要挑战，但目前缺乏大规模、准确标记的数据集，限制了自动化诊断的发展。 Method: OMNI数据集包含4166张多视角图像，由专业牙医标注，并通过多种深度学习方法（CNN、Transformer、GNN）进行验证。 Result: 实验表明，OMNI数据集能有效促进错颌问题的自动化诊断研究，并为该领域提供新基准。 Conclusion: OMNI数据集为牙科图像分析提供了重要资源，推动了错颌诊断的自动化研究。 Abstract: Malocclusion is a major challenge in orthodontics, and its complex presentation and diverse clinical manifestations make accurate localization and diagnosis particularly important. Currently, one of the major shortcomings facing the field of dental image analysis is the lack of large-scale, accurately labeled datasets dedicated to malocclusion issues, which limits the development of automated diagnostics in the field of dentistry and leads to a lack of diagnostic accuracy and efficiency in clinical practice. Therefore, in this study, we propose the Oral and Maxillofacial Natural Images (OMNI) dataset, a novel and comprehensive dental image dataset aimed at advancing the study of analyzing dental images for issues of malocclusion. Specifically, the dataset contains 4166 multi-view images with 384 participants in data collection and annotated by professional dentists. In addition, we performed a comprehensive validation of the created OMNI dataset, including three CNN-based methods, two Transformer-based methods, and one GNN-based method, and conducted automated diagnostic experiments for malocclusion issues. The experimental results show that the OMNI dataset can facilitate the automated diagnosis research of malocclusion issues and provide a new benchmark for the research in this field. Our OMNI dataset and baseline code are publicly available at https://github.com/RoundFaceJ/OMNI.

[93] FragFake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models

Zhen Sun,Ziyi Zhang,Zeren Luo,Zeyang Sha,Tianshuo Cong,Zheng Li,Shiwen Cui,Weiqiang Wang,Jiaheng Wei,Xinlei He,Qi Li,Qian Wang

Main category: cs.CV

TL;DR: 论文提出了一种基于视觉语言模型（VLMs）的局部图像编辑检测方法，并创建了首个专用数据集FragFake，解决了现有方法在定位和数据集质量上的不足。

Details

Motivation: 现代图像编辑技术高度逼真，但现有检测方法无法精确定位编辑区域，且缺乏高质量数据集。 Method: 开发自动化数据生成管道创建FragFake数据集，并首次将VLMs用于编辑图像分类和区域定位任务。 Result: 微调后的VLMs在多个数据集上表现优异，显著优于预训练模型。 Conclusion: 该研究首次将局部图像编辑检测任务重新定义为视觉语言理解问题，为多模态内容真实性领域奠定了基础。 Abstract: Fine-grained edited image detection of localized edits in images is crucial for assessing content authenticity, especially given that modern diffusion models and image editing methods can produce highly realistic manipulations. However, this domain faces three challenges: (1) Binary classifiers yield only a global real-or-fake label without providing localization; (2) Traditional computer vision methods often rely on costly pixel-level annotations; and (3) No large-scale, high-quality dataset exists for modern image-editing detection techniques. To address these gaps, we develop an automated data-generation pipeline to create FragFake, the first dedicated benchmark dataset for edited image detection, which includes high-quality images from diverse editing models and a wide variety of edited objects. Based on FragFake, we utilize Vision Language Models (VLMs) for the first time in the task of edited image classification and edited region localization. Experimental results show that fine-tuned VLMs achieve higher average Object Precision across all datasets, significantly outperforming pretrained models. We further conduct ablation and transferability analyses to evaluate the detectors across various configurations and editing scenarios. To the best of our knowledge, this work is the first to reformulate localized image edit detection as a vision-language understanding task, establishing a new paradigm for the field. We anticipate that this work will establish a solid foundation to facilitate and inspire subsequent research endeavors in the domain of multimodal content authenticity.

[94] The Devil is in Fine-tuning and Long-tailed Problems:A New Benchmark for Scene Text Detection

Tianjiao Cao,Jiahao Lyu,Weichao Zeng,Weimin Mu,Yu Zhou

Main category: cs.CV

TL;DR: 论文揭示了学术基准与现实场景中场景文本检测性能差异的两大因素：微调差距和长尾分布问题，并提出联合数据集学习协议和长尾基准（LTB）以提升模型泛化能力。

Details

Motivation: 现有场景文本检测方法在学术基准上表现优异，但在实际应用中效果不佳，论文旨在解决这一性能差异问题。 Method: 通过实验分析发现微调差距和长尾分布问题，提出联合数据集学习（JDL）协议和长尾基准（LTB），并引入自监督学习方法MAEDet作为基线。 Result: 论文提出了LTB基准和MAEDet方法，为长尾场景文本检测提供了新的评估标准和基线模型。 Conclusion: 通过JDL协议和LTB基准，论文为提升场景文本检测在实际应用中的泛化能力提供了有效解决方案。 Abstract: Scene text detection has seen the emergence of high-performing methods that excel on academic benchmarks. However, these detectors often fail to replicate such success in real-world scenarios. We uncover two key factors contributing to this discrepancy through extensive experiments. First, a \textit{Fine-tuning Gap}, where models leverage \textit{Dataset-Specific Optimization} (DSO) paradigm for one domain at the cost of reduced effectiveness in others, leads to inflated performances on academic benchmarks. Second, the suboptimal performance in practical settings is primarily attributed to the long-tailed distribution of texts, where detectors struggle with rare and complex categories as artistic or overlapped text. Given that the DSO paradigm might undermine the generalization ability of models, we advocate for a \textit{Joint-Dataset Learning} (JDL) protocol to alleviate the Fine-tuning Gap. Additionally, an error analysis is conducted to identify three major categories and 13 subcategories of challenges in long-tailed scene text, upon which we propose a Long-Tailed Benchmark (LTB). LTB facilitates a comprehensive evaluation of ability to handle a diverse range of long-tailed challenges. We further introduce MAEDet, a self-supervised learning-based method, as a strong baseline for LTB. The code is available at https://github.com/pd162/LTB.

[95] Enhancing Monte Carlo Dropout Performance for Uncertainty Quantification

Hamzeh Asgharnezhad,Afshar Shamsi,Roohallah Alizadehsani,Arash Mohammadi,Hamid Alinejad-Rokny

Main category: cs.CV

TL;DR: 论文提出了一种改进的蒙特卡洛Dropout（MCD）方法，通过集成灰狼优化器（GWO）、贝叶斯优化（BO）和粒子群优化（PSO）以及不确定性感知损失函数，提升了不确定性量化的可靠性。

Details

Motivation: 在医疗诊断和自动驾驶等高风险领域，深度神经网络输出的不确定性量化至关重要。传统MCD方法在提供校准良好的不确定性估计方面存在不足。 Method: 提出了一种创新框架，将GWO、BO和PSO与不确定性感知损失函数结合，改进了MCD方法。实验使用了DenseNet121、ResNet50和VGG16等骨干网络，并在多个数据集上进行了验证。 Result: 改进后的算法在传统准确率和不确定性准确率上平均优于基线MCD 2-3%，且校准效果显著提升。 Conclusion: 该方法有望提升深度学习模型在安全关键应用中的可信度。 Abstract: Knowing the uncertainty associated with the output of a deep neural network is of paramount importance in making trustworthy decisions, particularly in high-stakes fields like medical diagnosis and autonomous systems. Monte Carlo Dropout (MCD) is a widely used method for uncertainty quantification, as it can be easily integrated into various deep architectures. However, conventional MCD often struggles with providing well-calibrated uncertainty estimates. To address this, we introduce innovative frameworks that enhances MCD by integrating different search solutions namely Grey Wolf Optimizer (GWO), Bayesian Optimization (BO), and Particle Swarm Optimization (PSO) as well as an uncertainty-aware loss function, thereby improving the reliability of uncertainty quantification. We conduct comprehensive experiments using different backbones, namely DenseNet121, ResNet50, and VGG16, on various datasets, including Cats vs. Dogs, Myocarditis, Wisconsin, and a synthetic dataset (Circles). Our proposed algorithm outperforms the MCD baseline by 2-3% on average in terms of both conventional accuracy and uncertainty accuracy while achieving significantly better calibration. These results highlight the potential of our approach to enhance the trustworthiness of deep learning models in safety-critical applications.

[96] Discovering Pathology Rationale and Token Allocation for Efficient Multimodal Pathology Reasoning

Zhe Xu,Cheng Jin,Yihui Wang,Ziyi Liu,Hao Chen

Main category: cs.CV

TL;DR: 提出了一种新的双边强化学习框架，通过增强推理能力和动态分配计算资源，显著提升了病理图像理解的性能和效率。

Details

Motivation: 现有方法在复杂诊断场景中推理能力有限，且病理图像尺寸大导致计算负担重，限制了实际应用。 Method: 采用双边强化学习框架，一个分支增强推理能力，另一个分支动态分配计算资源。 Result: 在多种病理任务中，性能平均提升41.7%，推理成本降低70.3%。 Conclusion: 该框架在提升推理准确性的同时，显著优化了计算效率，具有实际应用潜力。 Abstract: Multimodal pathological image understanding has garnered widespread interest due to its potential to improve diagnostic accuracy and enable personalized treatment through integrated visual and textual data. However, existing methods exhibit limited reasoning capabilities, which hamper their ability to handle complex diagnostic scenarios. Additionally, the enormous size of pathological images leads to severe computational burdens, further restricting their practical deployment. To address these limitations, we introduce a novel bilateral reinforcement learning framework comprising two synergistic branches. One reinforcement branch enhances the reasoning capability by enabling the model to learn task-specific decision processes, i.e., pathology rationales, directly from labels without explicit reasoning supervision. While the other branch dynamically allocates a tailored number of tokens to different images based on both their visual content and task context, thereby optimizing computational efficiency. We apply our method to various pathological tasks such as visual question answering, cancer subtyping, and lesion detection. Extensive experiments show an average +41.7 absolute performance improvement with 70.3% lower inference costs over the base models, achieving both reasoning accuracy and computational efficiency.

[97] HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning

Xiaodong Mei,Sheng Wang,Jie Cheng,Yingbing Chen,Dan Xu

Main category: cs.CV

TL;DR: HAMF是一个新型运动预测框架，通过联合学习场景上下文编码和未来运动表示，解决了现有方法中场景特征编码的信息退化问题。

Details

Motivation: 现有方法在场景特征编码过程中存在信息退化问题，影响了运动预测的准确性。 Method: HAMF将观察到的代理状态和地图信息嵌入1D令牌序列，设计了一个统一的基于注意力的编码器，结合自注意力和交叉注意力机制，并在解码阶段使用Mamba模块。 Result: 在Argoverse 2基准测试中，HAMF实现了最先进的运动预测性能。 Conclusion: HAMF通过轻量级架构，有效提升了运动预测的准确性和多样性。 Abstract: Motion forecasting represents a critical challenge in autonomous driving systems, requiring accurate prediction of surrounding agents' future trajectories. While existing approaches predict future motion states with the extracted scene context feature from historical agent trajectories and road layouts, they suffer from the information degradation during the scene feature encoding. To address the limitation, we propose HAMF, a novel motion forecasting framework that learns future motion representations with the scene context encoding jointly, to coherently combine the scene understanding and future motion state prediction. We first embed the observed agent states and map information into 1D token sequences, together with the target multi-modal future motion features as a set of learnable tokens. Then we design a unified Attention-based encoder, which synergistically combines self-attention and cross-attention mechanisms to model the scene context information and aggregate future motion features jointly. Complementing the encoder, we implement the Mamba module in the decoding stage to further preserve the consistency and correlations among the learned future motion representations, to generate the accurate and diverse final trajectories. Extensive experiments on Argoverse 2 benchmark demonstrate that our hybrid Attention-Mamba model achieves state-of-the-art motion forecasting performance with the simple and lightweight architecture.

[98] RUSplatting: Robust 3D Gaussian Splatting for Sparse-View Underwater Scene Reconstruction

Zhuodong Jiang,Haoran Wang,Guoxi Huang,Brett Seymour,Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: 提出了一种基于高斯泼溅的框架，通过解耦学习和帧插值策略，提升水下场景重建的视觉质量和几何精度。

Details

Motivation: 水下环境的光吸收、散射和有限能见度导致高质量场景重建困难。 Method: 采用解耦学习RGB通道的方法，结合物理衰减模型；提出帧插值策略和自适应权重方案；设计新损失函数以减少噪声并保留边缘。 Result: 在PSNR上提升达1.90dB，视觉质量和鲁棒性优于现有方法。 Conclusion: 该框架为水下视觉分析和海洋机器人提供了新方向。 Abstract: Reconstructing high-fidelity underwater scenes remains a challenging task due to light absorption, scattering, and limited visibility inherent in aquatic environments. This paper presents an enhanced Gaussian Splatting-based framework that improves both the visual quality and geometric accuracy of deep underwater rendering. We propose decoupled learning for RGB channels, guided by the physics of underwater attenuation, to enable more accurate colour restoration. To address sparse-view limitations and improve view consistency, we introduce a frame interpolation strategy with a novel adaptive weighting scheme. Additionally, we introduce a new loss function aimed at reducing noise while preserving edges, which is essential for deep-sea content. We also release a newly collected dataset, Submerged3D, captured specifically in deep-sea environments. Experimental results demonstrate that our framework consistently outperforms state-of-the-art methods with PSNR gains up to 1.90dB, delivering superior perceptual quality and robustness, and offering promising directions for marine robotics and underwater visual analytics.

[99] Exploring The Visual Feature Space for Multimodal Neural Decoding

Weihao Xia,Cengiz Oztireli

Main category: cs.CV

TL;DR: 论文提出了一种零样本多模态脑解码方法，利用预训练的视觉特征空间和多模态大语言模型（MLLMs）解码脑信号，解决了现有研究中粗粒度解释的问题。

Details

Motivation: 现有研究对脑信号的解码通常缺乏细节描述（如物体、位置、属性及其关系），导致视觉解码不精确。本文旨在解决这一问题。 Method: 分析了预训练视觉特征空间的选择，并引入零样本多模态脑解码方法，与MLLMs交互以实现多粒度解码。提出了MG-BrainDub基准测试，包含详细描述和显著问答任务。 Result: 方法提高了神经解码的精确性，支持更准确的神经解码应用。 Conclusion: 通过多模态AI和MLLMs的结合，本文实现了更精细的脑信号解码，为神经解码领域提供了新工具和基准。 Abstract: The intrication of brain signals drives research that leverages multimodal AI to align brain modalities with visual and textual data for explainable descriptions. However, most existing studies are limited to coarse interpretations, lacking essential details on object descriptions, locations, attributes, and their relationships. This leads to imprecise and ambiguous reconstructions when using such cues for visual decoding. To address this, we analyze different choices of vision feature spaces from pre-trained visual components within Multimodal Large Language Models (MLLMs) and introduce a zero-shot multimodal brain decoding method that interacts with these models to decode across multiple levels of granularities. % To assess a model's ability to decode fine details from brain signals, we propose the Multi-Granularity Brain Detail Understanding Benchmark (MG-BrainDub). This benchmark includes two key tasks: detailed descriptions and salient question-answering, with metrics highlighting key visual elements like objects, attributes, and relationships. Our approach enhances neural decoding precision and supports more accurate neuro-decoding applications. Code will be available at https://github.com/weihaox/VINDEX.

[100] Constructing a 3D Town from a Single Image

Kaizhi Zheng,Ruijian Zhang,Jing Gu,Jie Yang,Xin Eric Wang

Main category: cs.CV

TL;DR: 3DTown是一个无需训练的框架，通过单张俯视图生成高质量3D场景，解决了现有方法在几何一致性和布局上的问题。

Details

Motivation: 传统3D场景获取方法成本高且耗时，现有生成模型在场景级别存在几何不一致和低质量网格问题。 Method: 采用区域生成和空间感知3D修复技术，分解输入图像为重叠区域并生成3D对象，再通过修复流程保持结构连续性。 Result: 在几何质量、空间一致性和纹理保真度上优于现有方法（如Trellis、Hunyuan3D-2和TripoSG）。 Conclusion: 3DTown证明无需训练即可从单张图像生成高质量3D场景，具有实际应用潜力。 Abstract: Acquiring detailed 3D scenes typically demands costly equipment, multi-view data, or labor-intensive modeling. Therefore, a lightweight alternative, generating complex 3D scenes from a single top-down image, plays an essential role in real-world applications. While recent 3D generative models have achieved remarkable results at the object level, their extension to full-scene generation often leads to inconsistent geometry, layout hallucinations, and low-quality meshes. In this work, we introduce 3DTown, a training-free framework designed to synthesize realistic and coherent 3D scenes from a single top-down view. Our method is grounded in two principles: region-based generation to improve image-to-3D alignment and resolution, and spatial-aware 3D inpainting to ensure global scene coherence and high-quality geometry generation. Specifically, we decompose the input image into overlapping regions and generate each using a pretrained 3D object generator, followed by a masked rectified flow inpainting process that fills in missing geometry while maintaining structural continuity. This modular design allows us to overcome resolution bottlenecks and preserve spatial structure without requiring 3D supervision or fine-tuning. Extensive experiments across diverse scenes show that 3DTown outperforms state-of-the-art baselines, including Trellis, Hunyuan3D-2, and TripoSG, in terms of geometry quality, spatial coherence, and texture fidelity. Our results demonstrate that high-quality 3D town generation is achievable from a single image using a principled, training-free approach.

[101] IA-T2I: Internet-Augmented Text-to-Image Generation

Chuanhao Li,Jianwen Sun,Yukang Feng,Mingliang Zhai,Yifan Chang,Kaipeng Zhang

Main category: cs.CV

TL;DR: 提出了一种基于互联网增强的文本到图像生成框架（IA-T2I），通过提供参考图像解决文本提示中知识不确定性问题。

Details

Motivation: 现有文本到图像生成模型在文本提示隐含知识不确定时表现不佳，例如无法生成未来事件的准确图像。 Method: 设计了主动检索模块、分层图像选择模块和自反思机制，以增强模型对不确定知识的处理能力。 Result: 实验结果表明，该框架在人类评估中表现优于GPT-4o约30%。 Conclusion: IA-T2I框架有效解决了文本到图像生成中的知识不确定性问题，提升了生成质量。 Abstract: Current text-to-image (T2I) generation models achieve promising results, but they fail on the scenarios where the knowledge implied in the text prompt is uncertain. For example, a T2I model released in February would struggle to generate a suitable poster for a movie premiering in April, because the character designs and styles are uncertain to the model. To solve this problem, we propose an Internet-Augmented text-to-image generation (IA-T2I) framework to compel T2I models clear about such uncertain knowledge by providing them with reference images. Specifically, an active retrieval module is designed to determine whether a reference image is needed based on the given text prompt; a hierarchical image selection module is introduced to find the most suitable image returned by an image search engine to enhance the T2I model; a self-reflection mechanism is presented to continuously evaluate and refine the generated image to ensure faithful alignment with the text prompt. To evaluate the proposed framework's performance, we collect a dataset named Img-Ref-T2I, where text prompts include three types of uncertain knowledge: (1) known but rare. (2) unknown. (3) ambiguous. Moreover, we carefully craft a complex prompt to guide GPT-4o in making preference evaluation, which has been shown to have an evaluation accuracy similar to that of human preference evaluation. Experimental results demonstrate the effectiveness of our framework, outperforming GPT-4o by about 30% in human evaluation.

[102] VARD: Efficient and Dense Fine-Tuning for Diffusion Models with Value-based RL

Fengyuan Dai,Zifeng Zhuang,Yufei Huang,Siteng Huang,Bangyan Liao,Donglin Wang,Fajie Yuan

Main category: cs.CV

TL;DR: VARD是一种基于强化学习的扩散模型优化方法，通过学习值函数和KL正则化提供密集监督，解决了现有方法在稳定性和效率上的不足。

Details

Motivation: 现有强化学习方法在扩散模型中难以同时实现稳定、高效的微调，且对稀疏奖励的依赖导致生成质量不佳。 Method: 提出VARD方法，通过学习值函数预测中间状态的奖励期望，并结合KL正则化提供密集监督。 Result: 实验表明VARD能更好地指导轨迹生成，提高训练效率，并适用于复杂、不可微的奖励函数。 Conclusion: VARD为扩散模型提供了稳定、高效的强化学习优化方案，扩展了其应用范围。 Abstract: Diffusion models have emerged as powerful generative tools across various domains, yet tailoring pre-trained models to exhibit specific desirable properties remains challenging. While reinforcement learning (RL) offers a promising solution,current methods struggle to simultaneously achieve stable, efficient fine-tuning and support non-differentiable rewards. Furthermore, their reliance on sparse rewards provides inadequate supervision during intermediate steps, often resulting in suboptimal generation quality. To address these limitations, dense and differentiable signals are required throughout the diffusion process. Hence, we propose VAlue-based Reinforced Diffusion (VARD): a novel approach that first learns a value function predicting expection of rewards from intermediate states, and subsequently uses this value function with KL regularization to provide dense supervision throughout the generation process. Our method maintains proximity to the pretrained model while enabling effective and stable training via backpropagation. Experimental results demonstrate that our approach facilitates better trajectory guidance, improves training efficiency and extends the applicability of RL to diffusion models optimized for complex, non-differentiable reward functions.

[103] Interspatial Attention for Efficient 4D Human Video Generation

Ruizhi Shao,Yinghao Xu,Yujun Shen,Ceyuan Yang,Yang Zheng,Changan Chen,Yebin Liu,Gordon Wetzstein

Main category: cs.CV

TL;DR: 提出了一种新的交叉注意力机制（ISA），用于基于扩散变换器（DiT）的视频生成模型，显著提升了数字人类视频的生成质量和可控性。

Details

Motivation: 现有方法在生成数字人类视频时存在质量低、一致性和身份保持差的问题，需要一种更高效的解决方案。 Method: 引入了一种新的交叉注意力机制（ISA），结合自定义的视频变分自编码器，训练了一个基于潜在扩散的模型。 Result: 模型在4D人类视频合成中达到了最先进的性能，表现出卓越的运动一致性和身份保持能力，同时支持对相机和身体姿态的精确控制。 Conclusion: ISA机制为数字人类视频生成提供了一种高效且可控的解决方案，具有广泛的应用潜力。 Abstract: Generating photorealistic videos of digital humans in a controllable manner is crucial for a plethora of applications. Existing approaches either build on methods that employ template-based 3D representations or emerging video generation models but suffer from poor quality or limited consistency and identity preservation when generating individual or multiple digital humans. In this paper, we introduce a new interspatial attention (ISA) mechanism as a scalable building block for modern diffusion transformer (DiT)--based video generation models. ISA is a new type of cross attention that uses relative positional encodings tailored for the generation of human videos. Leveraging a custom-developed video variation autoencoder, we train a latent ISA-based diffusion model on a large corpus of video data. Our model achieves state-of-the-art performance for 4D human video synthesis, demonstrating remarkable motion consistency and identity preservation while providing precise control of the camera and body poses. Our code and model are publicly released at https://dsaurus.github.io/isa4d/.

[104] STAR-R1: Spacial TrAnsformation Reasoning by Reinforcing Multimodal LLMs

Zongzhao Li,Zongyang Ma,Mingze Li,Songyou Li,Yu Rong,Tingyang Xu,Ziqi Zhang,Deli Zhao,Wenbing Huang

Main category: cs.CV

TL;DR: 论文提出STAR-R1框架，通过单阶段强化学习和细粒度奖励机制，显著提升多模态大语言模型在空间推理任务中的表现。

Details

Motivation: 多模态大语言模型在空间推理任务中表现远逊于人类，传统方法如监督微调和稀疏奖励强化学习效果不佳。 Method: 提出STAR-R1框架，结合单阶段强化学习和细粒度奖励机制，奖励部分正确行为，惩罚无效探索。 Result: STAR-R1在11项指标上达到最优，跨视角场景下性能提升23%，并展现出类人推理行为。 Conclusion: STAR-R1为多模态大语言模型和推理模型的研究提供了重要启示，代码和数据将公开。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, yet they lag significantly behind humans in spatial reasoning. We investigate this gap through Transformation-Driven Visual Reasoning (TVR), a challenging task requiring identification of object transformations across images under varying viewpoints. While traditional Supervised Fine-Tuning (SFT) fails to generate coherent reasoning paths in cross-view settings, sparse-reward Reinforcement Learning (RL) suffers from inefficient exploration and slow convergence. To address these limitations, we propose STAR-R1, a novel framework that integrates a single-stage RL paradigm with a fine-grained reward mechanism tailored for TVR. Specifically, STAR-R1 rewards partial correctness while penalizing excessive enumeration and passive inaction, enabling efficient exploration and precise reasoning. Comprehensive evaluations demonstrate that STAR-R1 achieves state-of-the-art performance across all 11 metrics, outperforming SFT by 23% in cross-view scenarios. Further analysis reveals STAR-R1's anthropomorphic behavior and highlights its unique ability to compare all objects for improving spatial reasoning. Our work provides critical insights in advancing the research of MLLMs and reasoning models. The codes, model weights, and data will be publicly available at https://github.com/zongzhao23/STAR-R1.

[105] MMaDA: Multimodal Large Diffusion Language Models

Ling Yang,Ye Tian,Bowen Li,Xinchen Zhang,Ke Shen,Yunhai Tong,Mengdi Wang

Main category: cs.CV

TL;DR: MMaDA是一种新型多模态扩散基础模型，通过统一架构、混合长链思维微调和统一强化学习算法，在文本推理、多模态理解和文本生成图像任务中表现优异。

Details

Motivation: 旨在通过统一架构和算法设计，消除模态特定组件需求，提升多模态任务的性能和泛化能力。 Method: 采用统一扩散架构、混合长链思维微调策略和UniGRPO强化学习算法，实现跨模态无缝处理。 Result: MMaDA-8B在文本推理、多模态理解和文本生成图像任务中超越多个先进模型。 Conclusion: MMaDA为统一扩散架构提供了全面框架，有效缩小预训练与后训练间的差距。 Abstract: We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA

[106] Leveraging the Powerful Attention of a Pre-trained Diffusion Model for Exemplar-based Image Colorization

Satoshi Kosugi

Main category: cs.CV

TL;DR: 提出了一种基于预训练扩散模型的无微调图像着色方法，通过双重注意力引导和分类器无指导着色，提升了着色质量和参考保真度。

Details

Motivation: 解决基于范例的图像着色中语义匹配不准确的问题，利用预训练扩散模型的自注意力模块实现更精确的语义对齐。 Method: 1. 双重注意力引导颜色转移：利用自注意力模块计算输入与参考图像的注意力图，实现语义匹配；2. 分类器无指导着色：结合颜色转移与非转移输出，提升着色质量。 Result: 在335对输入-参考图像上，FID为95.27（图像质量），SI-FID为5.51（参考保真度），优于现有方法。 Conclusion: 该方法通过预训练扩散模型的自注意力模块和双重注意力机制，显著提升了图像着色的语义匹配和颜色质量。 Abstract: Exemplar-based image colorization aims to colorize a grayscale image using a reference color image, ensuring that reference colors are applied to corresponding input regions based on their semantic similarity. To achieve accurate semantic matching between regions, we leverage the self-attention module of a pre-trained diffusion model, which is trained on a large dataset and exhibits powerful attention capabilities. To harness this power, we propose a novel, fine-tuning-free approach based on a pre-trained diffusion model, making two key contributions. First, we introduce dual attention-guided color transfer. We utilize the self-attention module to compute an attention map between the input and reference images, effectively capturing semantic correspondences. The color features from the reference image is then transferred to the semantically matching regions of the input image, guided by this attention map, and finally, the grayscale features are replaced with the corresponding color features. Notably, we utilize dual attention to calculate attention maps separately for the grayscale and color images, achieving more precise semantic alignment. Second, we propose classifier-free colorization guidance, which enhances the transferred colors by combining color-transferred and non-color-transferred outputs. This process improves the quality of colorization. Our experimental results demonstrate that our method outperforms existing techniques in terms of image quality and fidelity to the reference. Specifically, we use 335 input-reference pairs from previous research, achieving an FID of 95.27 (image quality) and an SI-FID of 5.51 (fidelity to the reference). Our source code is available at https://github.com/satoshi-kosugi/powerful-attention.

[107] A Taxonomy of Structure from Motion Methods

Federica Arrigoni

Main category: cs.CV

TL;DR: 本文对Structure from Motion（SfM）方法进行了概念性综述，将其分为三类，并提出了新的分类视角。

Details

Motivation: SfM问题在理论和实践中具有重要意义，但现有方法缺乏系统性分类，本文旨在填补这一空白。 Method: 将SfM方法分为三类，根据其关注点（运动或结构）进行分类，并探讨其理论条件。 Result: 提出了新的分类视角，揭示了现有方法的共性与差异，并指出了未来研究方向。 Conclusion: 本文的分类为SfM研究提供了新视角，同时明确了问题的理论条件和未来可能的探索方向。 Abstract: Structure from Motion (SfM) refers to the problem of recovering both structure (i.e., 3D coordinates of points in the scene) and motion (i.e., camera matrices) starting from point correspondences in multiple images. It has attracted significant attention over the years, counting practical reconstruction pipelines as well as theoretical results. This paper is conceived as a conceptual review of SfM methods, which are grouped into three main categories, according to which part of the problem - between motion and structure - they focus on. The proposed taxonomy brings a new perspective on existing SfM approaches as well as insights into open problems and possible future research directions. Particular emphasis is given on identifying the theoretical conditions that make SfM well posed, which depend on the problem formulation that is being considered.

[108] Streamline Without Sacrifice -- Squeeze out Computation Redundancy in LMM

Penghao Wu,Lewei Lu,Ziwei Liu

Main category: cs.CV

TL;DR: 论文提出ProxyV方法，通过代理视觉令牌减少计算冗余，提升大型多模态模型的效率，同时保持性能。

Details

Motivation: 解决大型多模态模型中视觉令牌计算冗余的问题，避免信息丢失。 Method: 设计实验发现视觉令牌的计算冗余，提出ProxyV方法，利用代理令牌减轻计算负担。 Result: ProxyV在提升效率的同时不损失性能，甚至在某些情况下带来性能提升。 Conclusion: ProxyV是一种灵活且高效的方法，可与令牌缩减方法结合进一步提升效率。 Abstract: Large multimodal models excel in multimodal tasks but face significant computational challenges due to excessive computation on visual tokens. Unlike token reduction methods that focus on token-level redundancy, we identify and study the computation-level redundancy on vision tokens to ensure no information loss. Our key insight is that vision tokens from the pretrained vision encoder do not necessarily require all the heavy operations (e.g., self-attention, FFNs) in decoder-only LMMs and could be processed more lightly with proper designs. We designed a series of experiments to discover and progressively squeeze out the vision-related computation redundancy. Based on our findings, we propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens. ProxyV enhances efficiency without compromising performance and can even yield notable performance gains in scenarios with more moderate efficiency improvements. Furthermore, the flexibility of ProxyV is demonstrated through its combination with token reduction methods to boost efficiency further. The code will be made public at this https://github.com/penghao-wu/ProxyV URL.

[109] InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition

Yijie Zheng,Weijie Wu,Qingyun Li,Xuehui Wang,Xu Zhou,Aiai Ren,Jun Shen,Long Zhao,Guoqing Li,Xue Yang

Main category: cs.CV

TL;DR: 论文提出了一种新的任务套件InstructCDS和首个地球观测基准EarthInstruct，并提出了无需训练的框架InstructSAM，用于指令驱动的对象识别，显著提升了效率和性能。

Details

Motivation: 现有方法依赖显式类别提示，难以处理复杂或隐式查询，需要更灵活的解决方案。 Method: 提出InstructCDS任务套件和EarthInstruct基准，开发无需训练的框架InstructSAM，结合大视觉语言模型和SAM2进行对象识别。 Result: InstructSAM在多项任务中表现优于基线方法，推理时间稳定，输出标记减少89%，运行时间降低32%。 Conclusion: 提出的任务、基准和方法为开发多功能对象识别系统提供了重要基础。 Abstract: Language-Guided object recognition in remote sensing imagery is crucial for large-scale mapping and automated data annotation. However, existing open-vocabulary and visual grounding methods rely on explicit category cues, limiting their ability to handle complex or implicit queries that require advanced reasoning. To address this issue, we introduce a new suite of tasks, including Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS), covering open-vocabulary, open-ended, and open-subclass scenarios. We further present EarthInstruct, the first InstructCDS benchmark for earth observation. It is constructed from two diverse remote sensing datasets with varying spatial resolutions and annotation rules across 20 categories, necessitating models to interpret dataset-specific instructions. Given the scarcity of semantically rich labeled data in remote sensing, we propose InstructSAM, a training-free framework for instruction-driven object recognition. InstructSAM leverages large vision-language models to interpret user instructions and estimate object counts, employs SAM2 for mask proposal, and formulates mask-label assignment as a binary integer programming problem. By integrating semantic similarity with counting constraints, InstructSAM efficiently assigns categories to predicted masks without relying on confidence thresholds. Experiments demonstrate that InstructSAM matches or surpasses specialized baselines across multiple tasks while maintaining near-constant inference time regardless of object count, reducing output tokens by 89% and overall runtime by over 32% compared to direct generation approaches. We believe the contributions of the proposed tasks, benchmark, and effective approach will advance future research in developing versatile object recognition systems.

cs.CL [Back]

[110] Addressing the Challenges of Planning Language Generation

Prabhu Prakash Kagitha,Andrew Zhu,Li Zhang

Main category: cs.CL

TL;DR: 使用开源LLM生成PDDL语言并通过符号求解器生成计划，发现反馈修正方法显著提升性能。

Details

Motivation: 探索开源模型在生成PDDL语言和计划方面的能力，弥补此前仅限闭源模型的局限。 Method: 设计并评估8种PDDL生成流程，包括语言包装、约束解码和反馈修正等方法。 Result: 反馈修正方法使性能提升一倍以上，而直观方法如语言包装或约束解码反而降低性能。 Conclusion: 反馈修正是提升开源LLM生成PDDL计划性能的有效方法。 Abstract: Using LLMs to generate formal planning languages such as PDDL that invokes symbolic solvers to deterministically derive plans has been shown to outperform generating plans directly. While this success has been limited to closed-sourced models or particular LLM pipelines, we design and evaluate 8 different PDDL generation pipelines with open-source models under 50 billion parameters previously shown to be incapable of this task. We find that intuitive approaches such as using a high-resource language wrapper or constrained decoding with grammar decrease performance, yet inference-time scaling approaches such as revision with feedback from the solver and plan validator more than double the performance.

[111] Automated Journalistic Questions: A New Method for Extracting 5W1H in French

Richard Khoury,Maxence Verhaverbeke,Julie A. Gramaccia

Main category: cs.CL

TL;DR: 论文提出了一种自动化提取法文新闻文章中5W1H信息的流程，并在250篇魁北克新闻文章上验证其性能，结果显示其表现与GPT-4o相当。

Details

Motivation: 5W1H问题在新闻学中用于清晰描述事件，但自动化提取这些信息的工具尚未完善，尤其是在法文新闻领域。 Method: 设计了一个自动化提取流程，并创建了一个由人工标注的250篇新闻文章语料库用于评估。 Result: 提出的提取流程在5W1H信息提取任务中表现与GPT-4o相当。 Conclusion: 该自动化流程为法文新闻的5W1H信息提取提供了有效工具，性能媲美大型语言模型。 Abstract: The 5W1H questions -- who, what, when, where, why and how -- are commonly used in journalism to ensure that an article describes events clearly and systematically. Answering them is a crucial prerequisites for tasks such as summarization, clustering, and news aggregation. In this paper, we design the first automated extraction pipeline to get 5W1H information from French news articles. To evaluate the performance of our algo- rithm, we also create a corpus of 250 Quebec news articles with 5W1H answers marked by four human annotators. Our results demonstrate that our pipeline performs as well in this task as the large language model GPT-4o.

[112] Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models

Tingchen Fu,Jiawei Gu,Yafu Li,Xiaoye Qu,Yu Cheng

Main category: cs.CL

TL;DR: 论文提出MathIF基准，评估数学推理任务中的指令跟随能力，发现推理能力与指令遵循之间存在矛盾，并探讨了简单干预的效果。

Details

Motivation: 探索大型语言模型在数学推理任务中遵循自然语言指令的能力，填补现有研究的空白。 Method: 引入MathIF基准，通过实证分析评估模型在推理能力与指令遵循之间的权衡，并测试简单干预的效果。 Result: 发现推理能力强的模型在指令遵循上表现较差，尤其是生成长度增加时；简单干预可部分恢复指令遵循能力，但会牺牲推理性能。 Conclusion: 当前LLM训练范式存在根本性矛盾，需开发更具指令感知能力的推理模型。 Abstract: Instruction-following is essential for aligning large language models (LLMs) with user intent. While recent reasoning-oriented models exhibit impressive performance on complex mathematical problems, their ability to adhere to natural language instructions remains underexplored. In this work, we introduce MathIF, a dedicated benchmark for evaluating instruction-following in mathematical reasoning tasks. Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability, as models that reason more effectively often struggle to comply with user directives. We find that models tuned on distilled long chains-of-thought or trained with reasoning-oriented reinforcement learning often degrade in instruction adherence, especially when generation length increases. Furthermore, we show that even simple interventions can partially recover obedience, though at the cost of reasoning performance. These findings highlight a fundamental tension in current LLM training paradigms and motivate the need for more instruction-aware reasoning models. We release the code and data at https://github.com/TingchenFu/MathIF.

[113] Language Mixing in Reasoning Language Models: Patterns, Impact, and Internal Causes

Mingyang Wang,Lukas Lange,Heike Adel,Yunpu Ma,Jannik Strötgen,Hinrich Schütze

Main category: cs.CL

TL;DR: 本文首次系统研究了推理语言模型（RLMs）中的语言混合现象，分析了其模式、影响及内部原因，并展示了如何通过控制推理语言优化性能。

Details

Motivation: 语言混合现象在RLMs的输出中被观察到，但其影响尚不明确。本文旨在系统研究语言混合的规律及其对模型性能的影响。 Method: 研究覆盖15种语言、7种任务难度和18个主题领域，分析语言混合的模式和影响，并通过约束解码实验验证推理语言选择对性能的影响。 Result: 研究发现，强制模型使用拉丁或汉字脚本推理能显著提高准确性，且语言混合现象与模型内部表征的脚本组成密切相关。 Conclusion: 语言混合反映了RLMs的潜在处理偏好，研究结果为优化多语言推理提供了实用建议，并为构建更可解释和适应性的RLMs开辟了新方向。 Abstract: Reasoning language models (RLMs) excel at complex tasks by leveraging a chain-of-thought process to generate structured intermediate steps. However, language mixing, i.e., reasoning steps containing tokens from languages other than the prompt, has been observed in their outputs and shown to affect performance, though its impact remains debated. We present the first systematic study of language mixing in RLMs, examining its patterns, impact, and internal causes across 15 languages, 7 task difficulty levels, and 18 subject areas, and show how all three factors influence language mixing. Moreover, we demonstrate that the choice of reasoning language significantly affects performance: forcing models to reason in Latin or Han scripts via constrained decoding notably improves accuracy. Finally, we show that the script composition of reasoning traces closely aligns with that of the model's internal representations, indicating that language mixing reflects latent processing preferences in RLMs. Our findings provide actionable insights for optimizing multilingual reasoning and open new directions for controlling reasoning languages to build more interpretable and adaptable RLMs.

[114] WebNovelBench: Placing LLM Novelists on the Web Novel Distribution

Leon Lin,Jun Zheng,Haidong Wang

Main category: cs.CL

TL;DR: WebNovelBench是一个专为评估长篇小说生成能力设计的新基准，基于4000多部中文网络小说数据集，通过多维度叙事质量评估框架和LLM-as-Judge方法，有效区分人类作品与LLM生成内容。

Details

Motivation: 现有基准在规模、多样性或客观性上不足，难以全面评估LLM的长篇叙事能力。 Method: 利用4000多部中文网络小说数据集，将评估任务定义为概要到故事的生成，采用多维度叙事质量评估框架和LLM-as-Judge方法，结合主成分分析和百分位排名。 Result: 实验表明WebNovelBench能有效区分人类杰作、流行网络小说和LLM生成内容，并对24种前沿LLM进行了排名分析。 Conclusion: WebNovelBench为评估和提升LLM叙事生成能力提供了可扩展、可复制和数据驱动的方法。 Abstract: Robustly evaluating the long-form storytelling capabilities of Large Language Models (LLMs) remains a significant challenge, as existing benchmarks often lack the necessary scale, diversity, or objective measures. To address this, we introduce WebNovelBench, a novel benchmark specifically designed for evaluating long-form novel generation. WebNovelBench leverages a large-scale dataset of over 4,000 Chinese web novels, framing evaluation as a synopsis-to-story generation task. We propose a multi-faceted framework encompassing eight narrative quality dimensions, assessed automatically via an LLM-as-Judge approach. Scores are aggregated using Principal Component Analysis and mapped to a percentile rank against human-authored works. Our experiments demonstrate that WebNovelBench effectively differentiates between human-written masterpieces, popular web novels, and LLM-generated content. We provide a comprehensive analysis of 24 state-of-the-art LLMs, ranking their storytelling abilities and offering insights for future development. This benchmark provides a scalable, replicable, and data-driven methodology for assessing and advancing LLM-driven narrative generation.

[115] Tracing Multilingual Factual Knowledge Acquisition in Pretraining

Yihong Liu,Mingyang Wang,Amir Hossein Kargaran,Felicia Körner,Ercong Nie,Barbara Plank,François Yvon,Hinrich Schütze

Main category: cs.CL

TL;DR: 研究追踪了OLMo-7B模型在预训练过程中多语言事实记忆和跨语言一致性的演变，发现准确性和一致性随训练时间提升，主要由语料中事实频率驱动，同时跨语言迁移在早期阶段起辅助作用。

Details

Motivation: 现有研究多关注最终模型表现，但对预训练过程中多语言事实记忆和跨语言一致性的发展缺乏探索。 Method: 以OLMo-7B为案例，追踪预训练过程中事实记忆和跨语言一致性的变化，分析频率和跨语言迁移的影响。 Result: 事实频率是主要驱动因素，高频事实更易被正确记忆；跨语言迁移对低频非英语事实有帮助，尤其在早期阶段。 Conclusion: 多语言事实获取通过频率驱动（主导）和跨语言迁移（辅助）两种途径实现，后者主要涉及命名实体关系。 Abstract: Large Language Models (LLMs) are capable of recalling multilingual factual knowledge present in their pretraining data. However, most studies evaluate only the final model, leaving the development of factual recall and crosslingual consistency throughout pretraining largely unexplored. In this work, we trace how factual recall and crosslingual consistency evolve during pretraining, focusing on OLMo-7B as a case study. We find that both accuracy and consistency improve over time for most languages. We show that this improvement is primarily driven by the fact frequency in the pretraining corpus: more frequent facts are more likely to be recalled correctly, regardless of language. Yet, some low-frequency facts in non-English languages can still be correctly recalled. Our analysis reveals that these instances largely benefit from crosslingual transfer of their English counterparts -- an effect that emerges predominantly in the early stages of pretraining. We pinpoint two distinct pathways through which multilingual factual knowledge acquisition occurs: (1) frequency-driven learning, which is dominant and language-agnostic, and (2) crosslingual transfer, which is limited in scale and typically constrained to relation types involving named entities. We release our code and data to facilitate further research at https://github.com/cisnlp/multilingual-fact-tracing.

[116] Text Generation Beyond Discrete Token Sampling

Yufan Zhuang,Liyuan Liu,Chandan Singh,Jingbo Shang,Jianfeng Gao

Main category: cs.CL

TL;DR: 论文提出了一种无需训练的生成方法Mixture of Inputs (MoI)，通过结合离散生成的token和丢弃的token分布信息，提升自回归生成的质量和推理能力。

Details

Motivation: 标准自回归生成中，模型丢弃了token分布信息，仅传递离散token，导致信息损失。MoI旨在保留这些信息以提升生成效果。 Method: MoI在生成token后，通过贝叶斯估计将token分布作为先验，采样token作为观测，生成连续后验期望作为新输入。 Result: 在数学推理、代码生成和博士级QA任务中，MoI显著提升了多个模型的性能，且无需额外训练和计算开销。 Conclusion: MoI通过保留token分布信息，有效提升了自回归生成的质量和推理能力，具有广泛适用性。 Abstract: In standard autoregressive generation, an LLM predicts the next-token distribution, samples a discrete token, and then discards the distribution, passing only the sampled token as new input. To preserve this distribution's rich information, we propose Mixture of Inputs (MoI), a training-free method for autoregressive generation. After generating a token following the standard paradigm, we construct a new input that blends the generated discrete token with the previously discarded token distribution. Specifically, we employ a Bayesian estimation method that treats the token distribution as the prior, the sampled token as the observation, and replaces the conventional one-hot vector with the continuous posterior expectation as the new model input. MoI allows the model to maintain a richer internal representation throughout the generation process, resulting in improved text quality and reasoning capabilities. On mathematical reasoning, code generation, and PhD-level QA tasks, MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B, with no additional training and negligible computational overhead.

[117] SEPS: A Separability Measure for Robust Unlearning in LLMs

Wonje Jeung,Sangyeon Yoon,Albert No

Main category: cs.CL

TL;DR: SEPS是一个评估框架，用于衡量模型在单个提示中同时遗忘和保留信息的能力，解决了现有遗忘方法在混合查询场景中的不足。

Details

Motivation: 现有遗忘指标未能捕捉真实场景中遗忘和保留查询共存的情况，导致评估不全面。 Method: 提出Mixed Prompt (MP)遗忘策略，将遗忘和保留查询整合到统一训练目标中。 Result: MP策略显著提升了遗忘效果，在复杂场景中表现出鲁棒性。 Conclusion: SEPS和MP策略为机器遗忘提供了更有效的评估和实现方法。 Abstract: Machine unlearning aims to selectively remove targeted knowledge from Large Language Models (LLMs), ensuring they forget specified content while retaining essential information. Existing unlearning metrics assess whether a model correctly answers retain queries and rejects forget queries, but they fail to capture real-world scenarios where forget queries rarely appear in isolation. In fact, forget and retain queries often coexist within the same prompt, making mixed-query evaluation crucial. We introduce SEPS, an evaluation framework that explicitly measures a model's ability to both forget and retain information within a single prompt. Through extensive experiments across three benchmarks, we identify two key failure modes in existing unlearning methods: (1) untargeted unlearning indiscriminately erases both forget and retain content once a forget query appears, and (2) targeted unlearning overfits to single-query scenarios, leading to catastrophic failures when handling multiple queries. To address these issues, we propose Mixed Prompt (MP) unlearning, a strategy that integrates both forget and retain queries into a unified training objective. Our approach significantly improves unlearning effectiveness, demonstrating robustness even in complex settings with up to eight mixed forget and retain queries in a single prompt.

[118] A Comparative Study of Large Language Models and Human Personality Traits

Wang Jiaqi,Wang bo,Guo fa,Cheng cheng,Yang li

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLMs）是否表现出类似人类的性格特征，并提出了分布式人格框架，发现LLMs的性格是动态且输入驱动的。

Details

Motivation: 随着LLMs在社交和认知领域的参与度增加，研究其是否具备类似人类的性格特征，以及传统人格评估工具的适用性。 Method: 通过三个实证研究：测试-重测稳定性分析、跨变体一致性分析及角色扮演中的人格保留研究。 Result: LLMs的性格表现具有高变异性、输入敏感性和低内部一致性，缺乏长期稳定性。 Conclusion: LLMs表现出流动且外部依赖的性格模式，为构建LLM特定人格框架和推动人机交互提供了新视角。 Abstract: Large Language Models (LLMs) have demonstrated human-like capabilities in language comprehension and generation, becoming active participants in social and cognitive domains. This study investigates whether LLMs exhibit personality-like traits and how these traits compare with human personality, focusing on the applicability of conventional personality assessment tools. A behavior-based approach was used across three empirical studies. Study 1 examined test-retest stability and found that LLMs show higher variability and are more input-sensitive than humans, lacking long-term stability. Based on this, we propose the Distributed Personality Framework, conceptualizing LLM traits as dynamic and input-driven. Study 2 analyzed cross-variant consistency in personality measures and found LLMs' responses were highly sensitive to item wording, showing low internal consistency compared to humans. Study 3 explored personality retention during role-playing, showing LLM traits are shaped by prompt and parameter settings. These findings suggest that LLMs express fluid, externally dependent personality patterns, offering insights for constructing LLM-specific personality frameworks and advancing human-AI interaction. This work contributes to responsible AI development and extends the boundaries of personality psychology in the age of intelligent systems.

[119] MAATS: A Multi-Agent Automated Translation System Based on MQM Evaluation

Xi Wang,Jiaqian Hu,Safinah Ali

Main category: cs.CL

TL;DR: MAATS是一个多代理自动翻译系统，利用MQM框架进行细粒度错误检测和翻译优化，显著优于传统单代理方法。

Details

Motivation: 解决传统单代理翻译系统在语义准确性和上下文保真度方面的不足，通过多代理分工协作提升翻译质量。 Method: 使用多个专注于不同MQM类别的AI代理（如准确性、流畅性、风格等），并通过合成代理整合标注以迭代优化翻译。 Result: 在多种语言对和LLM上表现优于零样本和单代理基线，尤其在语义准确性、本地化适应和远距离语言对中表现突出。 Conclusion: MAATS通过模块化代理与可解释的MQM维度对齐，缩小了黑盒LLM与人工翻译流程之间的差距，提升了翻译的语义和上下文保真度。 Abstract: We present MAATS, a Multi Agent Automated Translation System that leverages the Multidimensional Quality Metrics (MQM) framework as a fine-grained signal for error detection and refinement. MAATS employs multiple specialized AI agents, each focused on a distinct MQM category (e.g., Accuracy, Fluency, Style, Terminology), followed by a synthesis agent that integrates the annotations to iteratively refine translations. This design contrasts with conventional single-agent methods that rely on self-correction. Evaluated across diverse language pairs and Large Language Models (LLMs), MAATS outperforms zero-shot and single-agent baselines with statistically significant gains in both automatic metrics and human assessments. It excels particularly in semantic accuracy, locale adaptation, and linguistically distant language pairs. Qualitative analysis highlights its strengths in multi-layered error diagnosis, omission detection across perspectives, and context-aware refinement. By aligning modular agent roles with interpretable MQM dimensions, MAATS narrows the gap between black-box LLMs and human translation workflows, shifting focus from surface fluency to deeper semantic and contextual fidelity.

[120] EasyMath: A 0-shot Math Benchmark for SLMs

Drishya Karki,Michiel Kamphuis,Angelecia Frey

Main category: cs.CL

TL;DR: EasyMath是一个针对小型语言模型设计的数学推理基准测试，涵盖13个类别，测试了23个模型，结果显示准确率随模型规模和训练增加而提升。

Details

Motivation: 为小型语言模型提供一个实用的数学推理评估工具，填补现有基准测试的空白。 Method: 设计了涵盖13个类别的EasyMath基准，测试了23个模型（14M到4B参数），采用零样本设置，通过精确、数值和符号检查评估自由形式答案。 Result: 模型准确率随规模和训练增加而提升，思维链（chain-of-thought）带来小幅增益，一致性在规模扩大时改善。 Conclusion: EasyMath是一个有效的工具，可用于评估小型语言模型的数学推理能力，并揭示了模型规模和训练对性能的影响。 Abstract: EasyMath is a compact benchmark for practical math reasoning in small language models. It covers thirteen categories, from basic arithmetic and order of operations to word problems, algebraic expressions, edge cases, and omits specialist topics. We tested 23 models (14M to 4B parameters) using exact, numerical, and symbolic checks on free-form answers in a zero-shot setting. Accuracy rises with size and training, chain-of-thought adds modest gains, and consistency improves at scale.

[121] Saten: Sparse Augmented Tensor Networks for Post-Training Compression of Large Language Models

Ryan Solgi,Kai Zhen,Rupak Vignesh Swaminathan,Nathan Susanj,Athanasios Mouchtaris,Siegfried Kunzmann,Zheng Zhang

Main category: cs.CL

TL;DR: 论文提出了一种稀疏增强张量网络（Saten）框架，用于在微调过程中压缩预训练大型语言模型（LLMs），提升压缩效率和准确性。

Details

Motivation: 预训练LLMs的高秩特性及缺乏预训练数据访问权，使得其在下游任务中的压缩具有挑战性。 Method: 采用低秩张量压缩技术，结合稀疏增强张量网络（Saten）框架，在微调过程中实现全模型压缩。 Result: 实验表明，Saten在张量化语言模型中提升了准确性和压缩效率，达到最先进性能。 Conclusion: Saten框架为资源受限设备上高效部署LLMs提供了有效解决方案。 Abstract: The efficient implementation of large language models (LLMs) is crucial for deployment on resource-constrained devices. Low-rank tensor compression techniques, such as tensor-train (TT) networks, have been widely studied for over-parameterized neural networks. However, their applications to compress pre-trained large language models (LLMs) for downstream tasks (post-training) remains challenging due to the high-rank nature of pre-trained LLMs and the lack of access to pretraining data. In this study, we investigate low-rank tensorized LLMs during fine-tuning and propose sparse augmented tensor networks (Saten) to enhance their performance. The proposed Saten framework enables full model compression. Experimental results demonstrate that Saten enhances both accuracy and compression efficiency in tensorized language models, achieving state-of-the-art performance.

[122] Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages

Chin-Jou Li,Eunjung Yeo,Kwanghee Choi,Paula Andrea Pérez-Toro,Masao Someki,Rohan Kumar Das,Zhengjun Yue,Juan Rafael Orozco-Arroyave,Elmar Nöth,David R. Mortensen

Main category: cs.CL

TL;DR: 通过使用语音转换模型将健康非英语语音转换为类似构音障碍语音，提升多语言ASR模型在构音障碍语音识别中的性能。

Details

Motivation: 解决非英语构音障碍语音数据稀缺问题，提升ASR模型在构音障碍语音识别中的表现。 Method: 利用英语构音障碍语音（UASpeech）微调语音转换模型，生成非英语构音障碍语音（FLEURS），用于微调多语言ASR模型（MMS）。 Result: 在西班牙语、意大利语和泰米尔语数据集上，该方法显著优于现成MMS模型和传统数据增强技术。 Conclusion: 生成的语音能有效模拟构音障碍特征，显著提升ASR模型性能。 Abstract: Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech. The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingual Speech (MMS), for improved dysarthric speech recognition. Evaluation on PC-GITA (Spanish), EasyCall (Italian), and SSNCE (Tamil) demonstrates that VC with both speaker and prosody conversion significantly outperforms the off-the-shelf MMS performance and conventional augmentation techniques such as speed and tempo perturbation. Objective and subjective analyses of the generated data further confirm that the generated speech simulates dysarthric characteristics.

[123] Incorporating Token Usage into Prompting Strategy Evaluation

Chris Sypherd,Sergei Petrov,Sonny George,Vaishak Belle

Main category: cs.CL

TL;DR: 论文提出Big-$O_{tok}$框架和Token Cost指标，用于评估提示策略的效率和实用性，发现增加token使用会导致性能回报急剧下降。

Details

Motivation: 大型语言模型的表现高度依赖提示策略，但现有评估多关注性能而忽视效率。作者认为平衡性能和token使用的效率更具实际意义。 Method: 提出Big-$O_{tok}$理论框架描述提示策略的token增长，并引入Token Cost（每性能的token消耗）作为实证指标，分析常见提示策略。 Result: 研究发现，增加token使用会导致性能回报急剧下降，验证了Big-$O_{tok}$框架的有效性。 Conclusion: 效率感知的评估是必要的，Big-$O_{tok}$和Token Cost为提示策略的优化提供了实用工具。 Abstract: In recent years, large language models have demonstrated remarkable performance across diverse tasks. However, their task effectiveness is heavily dependent on the prompting strategy used to elicit output, which can vary widely in both performance and token usage. While task performance is often used to determine prompting strategy success, we argue that efficiency--balancing performance and token usage--can be a more practical metric for real-world utility. To enable this, we propose Big-$O_{tok}$, a theoretical framework for describing the token usage growth of prompting strategies, and analyze Token Cost, an empirical measure of tokens per performance. We apply these to several common prompting strategies and find that increased token usage leads to drastically diminishing performance returns. Our results validate the Big-$O_{tok}$ analyses and reinforce the need for efficiency-aware evaluations.

[124] Strategic Planning and Rationalizing on Trees Make LLMs Better Debaters

Danqing Wang,Zhuorui Ye,Xinran Zhao,Fei Fang,Lei Li

Main category: cs.CL

TL;DR: TreeDebater是一个新颖的辩论框架，通过引入Rehearsal Tree和Debate Flow Tree解决竞争性辩论中的时间限制和互动性问题，并在实验中优于现有系统。

Details

Motivation: 竞争性辩论中，时间限制和辩论互动的复杂性是主要挑战，需要一种能动态调整策略的框架。 Method: 提出TreeDebater框架，包含Rehearsal Tree（评估论点强度）和Debate Flow Tree（跟踪辩论状态），并结合时间分配和观众反馈优化辩论策略。 Result: 实验表明，TreeDebater在多智能体辩论系统中表现最佳，且其时间分配策略与人类专家一致。 Conclusion: TreeDebater通过树结构和动态策略优化，显著提升了竞争性辩论的效果。 Abstract: Winning competitive debates requires sophisticated reasoning and argument skills. There are unique challenges in the competitive debate: (1) The time constraints force debaters to make strategic choices about which points to pursue rather than covering all possible arguments; (2) The persuasiveness of the debate relies on the back-and-forth interaction between arguments, which a single final game status cannot evaluate. To address these challenges, we propose TreeDebater, a novel debate framework that excels in competitive debate. We introduce two tree structures: the Rehearsal Tree and Debate Flow Tree. The Rehearsal Tree anticipates the attack and defenses to evaluate the strength of the claim, while the Debate Flow Tree tracks the debate status to identify the active actions. TreeDebater allocates its time budget among candidate actions and uses the speech time controller and feedback from the simulated audience to revise its statement. The human evaluation on both the stage-level and the debate-level comparison shows that our TreeDebater outperforms the state-of-the-art multi-agent debate system. Further investigation shows that TreeDebater shows better strategies in limiting time to important debate actions, aligning with the strategies of human debate experts.

[125] In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties

Nathan Roll,Calbert Graham,Yuka Tatsumi,Kim Tien Nguyen,Meghan Sumner,Dan Jurafsky

Main category: cs.CL

TL;DR: 论文提出了一种基于上下文学习（ICL）的框架，用于提升语音识别模型对陌生说话者和语言变体的适应性，实验表明仅需少量示例即可显著降低词错误率。

Details

Motivation: 研究语音识别模型是否能像人类一样通过少量示例适应陌生说话者和语言变体。 Method: 引入了一个可扩展的框架，利用交错的任务提示和音频-文本对进行上下文学习（ICL）。 Result: 仅需12个示例（约50秒）即可将词错误率相对降低19.7%，尤其在低资源变体中效果显著。 Conclusion: ICL框架在提升语音识别鲁棒性方面表现出与人类相似的适应性，但仍存在某些变体的性能差距。 Abstract: Human listeners readily adjust to unfamiliar speakers and language varieties through exposure, but do these adaptation benefits extend to state-of-the-art spoken language models? We introduce a scalable framework that allows for in-context learning (ICL) in Phi-4 Multimodal using interleaved task prompts and audio-text pairs, and find that as few as 12 example utterances (~50 seconds) at inference time reduce word error rates by a relative 19.7% (1.2 pp.) on average across diverse English corpora. These improvements are most pronounced in low-resource varieties, when the context and target speaker match, and when more examples are provided--though scaling our procedure yields diminishing marginal returns to context length. Overall, we find that our novel ICL adaptation scheme (1) reveals a similar performance profile to human listeners, and (2) demonstrates consistent improvements to automatic speech recognition (ASR) robustness across diverse speakers and language backgrounds. While adaptation succeeds broadly, significant gaps remain for certain varieties, revealing where current models still fall short of human flexibility. We release our prompts and code on GitHub.

[126] Scaling Laws for State Dynamics in Large Language Models

Jacob X Li,Shreyas S Raman,Jessica Wan,Fahad Samman,Jazlyn Lin

Main category: cs.CL

TL;DR: 论文评估了大型语言模型（LLMs）在确定性状态动态建模中的表现，发现其准确性随状态空间大小和稀疏转移而下降，并识别了负责状态信息传播的注意力头。

Details

Motivation: 研究LLMs在内部状态跟踪任务中的能力，尤其是其对状态转移动态的建模效果。 Method: 在三个领域（Box Tracking、Abstract DFA Sequences、Complex Text Games）评估LLMs的状态预测准确性，并通过激活修补技术识别关键注意力头。 Result: LLMs在低复杂度任务中表现尚可（如GPT-2 XL达70%准确率），但随状态空间增大或转移稀疏性增加，准确性显著下降（如低于30%）。关键注意力头被发现负责状态信息传播，但状态-动作联合推理较弱。 Conclusion: LLMs的状态跟踪能力源于分布式交互而非显式符号计算，未来需改进状态-动作联合推理机制。 Abstract: Large Language Models (LLMs) are increasingly used in tasks requiring internal state tracking, yet their ability to model state transition dynamics remains poorly understood. We evaluate how well LLMs capture deterministic state dynamics across 3 domains: Box Tracking, Abstract DFA Sequences, and Complex Text Games, each formalizable as a finite-state system. Across tasks, we find that next-state prediction accuracy degrades with increasing state-space size and sparse transitions. GPT-2 XL reaches about 70% accuracy in low-complexity settings but drops below 30% when the number of boxes or states exceeds 5 or 10, respectively. In DFA tasks, Pythia-1B fails to exceed 50% accuracy when the number of states is > 10 and transitions are < 30. Through activation patching, we identify attention heads responsible for propagating state information: GPT-2 XL Layer 22 Head 20, and Pythia-1B Heads at Layers 10, 11, 12, and 14. While these heads successfully move relevant state features, action information is not reliably routed to the final token, indicating weak joint state-action reasoning. Our results suggest that state tracking in LLMs emerges from distributed interactions of next-token heads rather than explicit symbolic computation.

[127] Concept Incongruence: An Exploration of Time and Death in Role Playing

Xiaoyan Bai,Ike Peng,Aditya Singh,Chenhao Tan

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型（LLMs）在概念冲突下的行为，提出了概念不一致性（concept incongruence）的定义，并通过角色扮演实验分析了模型的行为。

Details

Motivation: 研究动机是理解LLMs在面对与定义冲突的用户提示时的行为，例如“画一个有两个角的独角兽”，并量化模型在概念不一致性下的表现。 Method: 方法包括定义概念不一致性，设计角色扮演实验，提出三个行为指标（abstention rate, conditional accuracy, answer rate），并通过探测实验分析原因。 Result: 结果显示模型在角色死亡后未能有效停止回答，且准确性下降。原因包括对“死亡”状态编码不可靠以及角色扮演导致的时间表征偏移。 Conclusion: 结论指出概念不一致性会导致模型行为异常，并提出了改进模型行为的未来方向。 Abstract: Consider this prompt "Draw a unicorn with two horns". Should large language models (LLMs) recognize that a unicorn has only one horn by definition and ask users for clarifications, or proceed to generate something anyway? We introduce concept incongruence to capture such phenomena where concept boundaries clash with each other, either in user prompts or in model representations, often leading to under-specified or mis-specified behaviors. In this work, we take the first step towards defining and analyzing model behavior under concept incongruence. Focusing on temporal boundaries in the Role-Play setting, we propose three behavioral metrics--abstention rate, conditional accuracy, and answer rate--to quantify model behavior under incongruence due to the role's death. We show that models fail to abstain after death and suffer from an accuracy drop compared to the Non-Role-Play setting. Through probing experiments, we identify two main causes: (i) unreliable encoding of the "death" state across different years, leading to unsatisfactory abstention behavior, and (ii) role playing causes shifts in the model's temporal representations, resulting in accuracy drops. We leverage these insights to improve consistency in the model's abstention and answer behaviors. Our findings suggest that concept incongruence leads to unexpected model behaviors and point to future directions on improving model behavior under concept incongruence.

[128] Understanding 6G through Language Models: A Case Study on LLM-aided Structured Entity Extraction in Telecom Domain

Ye Yuan,Haolun Wu,Hao Zhou,Xue Liu,Hao Chen,Yan Xin,Jianzhong,Zhang

Main category: cs.CL

TL;DR: 论文提出了一种基于语言模型的信息提取技术TeleSEE，用于从电信领域提取结构化实体，通过高效的表示方法和分层并行解码策略提升准确性和处理速度。

Details

Motivation: 6G网络中知识理解是推动网络智能和AI原生架构的基础，而信息提取能将碎片化电信知识转化为结构化格式，帮助AI模型更好地理解网络术语。 Method: 提出TeleSEE技术，采用令牌高效表示方法预测实体类型和属性键，并结合分层并行解码策略优化编码器-解码器架构。 Result: 实验表明，TeleSEE在准确性上优于基线技术，且样本处理速度提高了5至9倍。 Conclusion: TeleSEE为电信领域的信息提取提供了一种高效且准确的解决方案，有助于6G网络的知识理解。 Abstract: Knowledge understanding is a foundational part of envisioned 6G networks to advance network intelligence and AI-native network architectures. In this paradigm, information extraction plays a pivotal role in transforming fragmented telecom knowledge into well-structured formats, empowering diverse AI models to better understand network terminologies. This work proposes a novel language model-based information extraction technique, aiming to extract structured entities from the telecom context. The proposed telecom structured entity extraction (TeleSEE) technique applies a token-efficient representation method to predict entity types and attribute keys, aiming to save the number of output tokens and improve prediction accuracy. Meanwhile, TeleSEE involves a hierarchical parallel decoding method, improving the standard encoder-decoder architecture by integrating additional prompting and decoding strategies into entity extraction tasks. In addition, to better evaluate the performance of the proposed technique in the telecom domain, we further designed a dataset named 6GTech, including 2390 sentences and 23747 words from more than 100 6G-related technical publications. Finally, the experiment shows that the proposed TeleSEE method achieves higher accuracy than other baseline techniques, and also presents 5 to 9 times higher sample processing speed.

[129] ConspEmoLLM-v2: A robust and stable model to detect sentiment-transformed conspiracy theories

Zhiwei Liu,Paul Thompson,Jiaqi Rong,Sophia Ananiadou

Main category: cs.CL

TL;DR: 论文提出了一种改进的阴谋论检测方法ConspEmoLLM-v2，通过增强数据集ConDID-v2（包含LLM重写的内容）来应对LLM生成内容的情感伪装问题。

Details

Motivation: 大型语言模型（LLM）可能生成伪装情感的阴谋论内容，现有检测方法依赖人类文本的情感特征，可能失效。 Method: 开发ConDID-v2数据集（包含LLM重写内容），并训练改进的检测模型ConspEmoLLM-v2。 Result: ConspEmoLLM-v2在原始和情感伪装内容上均表现优异，超越基线模型。 Conclusion: 改进的数据集和模型能有效检测伪装情感的阴谋论内容。 Abstract: Despite the many benefits of large language models (LLMs), they can also cause harm, e.g., through automatic generation of misinformation, including conspiracy theories. Moreover, LLMs can also ''disguise'' conspiracy theories by altering characteristic textual features, e.g., by transforming their typically strong negative emotions into a more positive tone. Although several studies have proposed automated conspiracy theory detection methods, they are usually trained using human-authored text, whose features can vary from LLM-generated text. Furthermore, several conspiracy detection models, including the previously proposed ConspEmoLLM, rely heavily on the typical emotional features of human-authored conspiracy content. As such, intentionally disguised content may evade detection. To combat such issues, we firstly developed an augmented version of the ConDID conspiracy detection dataset, ConDID-v2, which supplements human-authored conspiracy tweets with versions rewritten by an LLM to reduce the negativity of their original sentiment. The quality of the rewritten tweets was verified by combining human and LLM-based assessment. We subsequently used ConDID-v2 to train ConspEmoLLM-v2, an enhanced version of ConspEmoLLM. Experimental results demonstrate that ConspEmoLLM-v2 retains or exceeds the performance of ConspEmoLLM on the original human-authored content in ConDID, and considerably outperforms both ConspEmoLLM and several other baselines when applied to sentiment-transformed tweets in ConDID-v2. The project will be available at https://github.com/lzw108/ConspEmoLLM.

[130] Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications

Fadel M. Megahed,Ying-Ju Chen,L. Allision Jones-Farmer,Younghwa Lee,Jiawei Brooke Wang,Inez M. Zwetsloot

Main category: cs.CL

TL;DR: 该研究提出了一个评估大型语言模型（LLM）在二元文本分类中一致性的框架，填补了可靠性评估方法的空白。通过心理测量学原理，确定了样本量要求，开发了无效响应指标，并评估了内部和外部评分者可靠性。案例研究显示，模型在金融新闻情感分类中表现优异，但无法预测实际市场走势。

Details

Motivation: 目前缺乏评估LLM在文本分类任务中可靠性的方法，研究旨在填补这一空白并提供系统化指导。 Method: 采用心理测量学原理，确定样本量、开发无效响应指标，并评估14种LLM在金融新闻情感分类中的表现，每种模型重复5次测试1,350篇文章。 Result: 模型在内部一致性上表现优异（90-98%完全一致），小模型如gemma3:1B表现优于大模型。但所有模型在预测市场走势时表现随机。 Conclusion: 该框架为LLM选择、样本量规划和可靠性评估提供了系统化指导，帮助组织优化分类任务资源。 Abstract: This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B, llama3.2:3B, and claude-3-5-haiku outperforming larger counterparts. All models performed at chance when predicting actual market movements, indicating task constraints rather than model limitations. Our framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks.

[131] Too Long, Didn't Model: Decomposing LLM Long-Context Understanding With Novels

Sil Hamilton,Rebecca M. M. Hicke,Matthew Wilkens,David Mimno

Main category: cs.CL

TL;DR: 论文提出了一个名为TLDM的基准测试，用于评估大型语言模型在超长上下文（如小说）中的理解能力，发现现有模型在超过64k标记后表现不稳定。

Details

Motivation: 评估大型语言模型在超长上下文中的有效性，尤其是在复杂结构和长距离语义依赖的场景（如小说）中。 Method: 通过TLDM基准测试，测试模型在情节摘要、故事世界配置和叙事时间流逝等方面的表现。 Result: 测试的七种前沿大型语言模型在超过64k标记后均无法保持稳定的理解能力。 Conclusion: 建议开发者需超越传统评估方法，关注复杂长上下文场景，并发布了TLDM基准以支持进一步研究。 Abstract: Although the context length of large language models (LLMs) has increased to millions of tokens, evaluating their effectiveness beyond needle-in-a-haystack approaches has proven difficult. We argue that novels provide a case study of subtle, complicated structure and long-range semantic dependencies often over 128k tokens in length. Inspired by work on computational novel analysis, we release the Too Long, Didn't Model (TLDM) benchmark, which tests a model's ability to report plot summary, storyworld configuration, and elapsed narrative time. We find that none of seven tested frontier LLMs retain stable understanding beyond 64k tokens. Our results suggest language model developers must look beyond "lost in the middle" benchmarks when evaluating model performance in complex long-context scenarios. To aid in further development we release the TLDM benchmark together with reference code and data.

[132] MedBrowseComp: Benchmarking Medical Deep Research and Computer Use

Shan Chen,Pedro Moreira,Yuxin Xiao,Sam Schmidgall,Jeremy Warner,Hugo Aerts,Thomas Hartvigsen,Jack Gallifant,Danielle S. Bitterman

Main category: cs.CL

TL;DR: MedBrowseComp是一个新基准，用于测试大型语言模型在临床实践中检索和综合多跳医学事实的能力，揭示了当前模型与临床需求之间的差距。

Details

Motivation: 临床决策需要整合多种知识来源，现有评估方法未能充分测试模型的真实能力，因此需要更系统的评估工具。 Method: 开发了MedBrowseComp基准，包含1000多个模拟临床场景的问题，测试模型从动态知识库中检索和综合信息的能力。 Result: 前沿模型在MedBrowseComp上的表现低至10%，显示其与临床实践要求的严谨性存在显著差距。 Conclusion: MedBrowseComp为未来模型和工具链的改进提供了明确目标，是可靠医学信息检索的重要测试平台。 Abstract: Large language models (LLMs) are increasingly envisioned as decision-support tools in clinical practice, yet safe clinical reasoning demands integrating heterogeneous knowledge bases -- trials, primary studies, regulatory documents, and cost data -- under strict accuracy constraints. Existing evaluations often rely on synthetic prompts, reduce the task to single-hop factoid queries, or conflate reasoning with open-ended generation, leaving their real-world utility unclear. To close this gap, we present MedBrowseComp, the first benchmark that systematically tests an agent's ability to reliably retrieve and synthesize multi-hop medical facts from live, domain-specific knowledge bases. MedBrowseComp contains more than 1,000 human-curated questions that mirror clinical scenarios where practitioners must reconcile fragmented or conflicting information to reach an up-to-date conclusion. Applying MedBrowseComp to frontier agentic systems reveals performance shortfalls as low as ten percent, exposing a critical gap between current LLM capabilities and the rigor demanded in clinical settings. MedBrowseComp therefore offers a clear testbed for reliable medical information seeking and sets concrete goals for future model and toolchain upgrades. You can visit our project page at: https://moreirap12.github.io/mbc-browse-app/

[133] DECASTE: Unveiling Caste Stereotypes in Large Language Models through Multi-Dimensional Bias Analysis

Prashanth Vijayaraghavan,Soroush Vosoughi,Lamogha Chizor,Raya Horesh,Rogerio Abreu de Paula,Ehsan Degan,Vandana Mukherjee

Main category: cs.CL

TL;DR: 论文提出DECASTE框架，用于检测和评估大语言模型中的种姓偏见，揭示模型系统性强化偏见的问题。

Details

Motivation: 大语言模型虽在NLP领域表现优异，但存在强化社会偏见的问题，尤其是针对印度边缘化种姓群体的偏见尚未充分研究。 Method: 提出DECASTE框架，从社会文化、经济、教育和政治四个维度评估种姓偏见，采用定制化提示策略。 Result: 研究发现多个先进大语言模型系统性强化种姓偏见，边缘化种姓群体（如Dalits和Shudras）受到显著不公平对待。 Conclusion: 研究揭示了语言模型中潜藏的种姓偏见，强调需要更全面的偏见评估方法以减少现实应用中的风险。 Abstract: Recent advancements in large language models (LLMs) have revolutionized natural language processing (NLP) and expanded their applications across diverse domains. However, despite their impressive capabilities, LLMs have been shown to reflect and perpetuate harmful societal biases, including those based on ethnicity, gender, and religion. A critical and underexplored issue is the reinforcement of caste-based biases, particularly towards India's marginalized caste groups such as Dalits and Shudras. In this paper, we address this gap by proposing DECASTE, a novel, multi-dimensional framework designed to detect and assess both implicit and explicit caste biases in LLMs. Our approach evaluates caste fairness across four dimensions: socio-cultural, economic, educational, and political, using a range of customized prompting strategies. By benchmarking several state-of-the-art LLMs, we reveal that these models systematically reinforce caste biases, with significant disparities observed in the treatment of oppressed versus dominant caste groups. For example, bias scores are notably elevated when comparing Dalits and Shudras with dominant caste groups, reflecting societal prejudices that persist in model outputs. These results expose the subtle yet pervasive caste biases in LLMs and emphasize the need for more comprehensive and inclusive bias evaluation methodologies that assess the potential risks of deploying such models in real-world contexts.

[134] Multimodal Cultural Safety: Evaluation Frameworks and Alignment Strategies

Haoyi Qiu,Kung-Hsiang Huang,Ruichen Zheng,Jiao Sun,Nanyun Peng

Main category: cs.CL

TL;DR: CROSS是一个评估大型视觉语言模型（LVLMs）文化安全推理能力的基准，涵盖多语言、多国家数据，揭示现有模型在文化安全方面的不足，并提出改进方法。

Details

Motivation: 现有多模态安全基准忽视文化规范导致的符号伤害，需填补这一空白。 Method: 提出CROSS基准和CROSS-Eval框架，评估21个LVLMs，并开发监督微调和偏好调优两种改进策略。 Result: 模型在文化安全方面表现不佳，改进方法显著提升GPT-4o的文化意识和合规性。 Conclusion: 文化安全是LVLMs的重要挑战，改进方法有效但未完全解决问题。 Abstract: Large vision-language models (LVLMs) are increasingly deployed in globally distributed applications, such as tourism assistants, yet their ability to produce culturally appropriate responses remains underexplored. Existing multimodal safety benchmarks primarily focus on physical safety and overlook violations rooted in cultural norms, which can result in symbolic harm. To address this gap, we introduce CROSS, a benchmark designed to assess the cultural safety reasoning capabilities of LVLMs. CROSS includes 1,284 multilingual visually grounded queries from 16 countries, three everyday domains, and 14 languages, where cultural norm violations emerge only when images are interpreted in context. We propose CROSS-Eval, an intercultural theory-based framework that measures four key dimensions: cultural awareness, norm education, compliance, and helpfulness. Using this framework, we evaluate 21 leading LVLMs, including mixture-of-experts models and reasoning models. Results reveal significant cultural safety gaps: the best-performing model achieves only 61.79% in awareness and 37.73% in compliance. While some open-source models reach GPT-4o-level performance, they still fall notably short of proprietary models. Our results further show that increasing reasoning capacity improves cultural alignment but does not fully resolve the issue. To improve model performance, we develop two enhancement strategies: supervised fine-tuning with culturally grounded, open-ended data and preference tuning with contrastive response pairs that highlight safe versus unsafe behaviors. These methods substantially improve GPT-4o's cultural awareness (+60.14%) and compliance (+55.2%), while preserving general multimodal capabilities with minimal performance reduction on general multimodal understanding benchmarks.

[135] CRAFT: Training-Free Cascaded Retrieval for Tabular QA

Adarsh Singh,Kushal Raj Bhandari,Jianxi Gao,Soham Dan,Vivek Gupta

Main category: cs.CL

TL;DR: CRAFT是一种级联检索方法，结合稀疏和稠密模型，提升表格问答的检索性能，并在NQ-Tables数据集上验证了其有效性。

Details

Motivation: 传统稠密检索模型计算成本高且适应性差，难以应对大规模检索任务和新领域的需求。 Method: CRAFT先使用稀疏检索模型筛选候选表格，再应用计算密集的稠密模型和神经重排序器，同时利用Gemini Flash 1.5生成表格描述和标题。 Result: CRAFT在检索性能上优于现有稀疏、稠密和混合检索方法。 Conclusion: CRAFT通过级联检索和增强表格表示，显著提升了表格问答的效率和效果。 Abstract: Table Question Answering (TQA) involves retrieving relevant tables from a large corpus to answer natural language queries. Traditional dense retrieval models, such as DTR and ColBERT, not only incur high computational costs for large-scale retrieval tasks but also require retraining or fine-tuning on new datasets, limiting their adaptability to evolving domains and knowledge. In this work, we propose $\textbf{CRAFT}$, a cascaded retrieval approach that first uses a sparse retrieval model to filter a subset of candidate tables before applying more computationally expensive dense models and neural re-rankers. Our approach achieves better retrieval performance than state-of-the-art (SOTA) sparse, dense, and hybrid retrievers. We further enhance table representations by generating table descriptions and titles using Gemini Flash 1.5. End-to-end TQA results using various Large Language Models (LLMs) on NQ-Tables, a subset of the Natural Questions Dataset, demonstrate $\textbf{CRAFT}$ effectiveness.

[136] Language Specific Knowledge: Do Models Know Better in X than in English?

Ishika Agarwal,Nimet Beyza Bozdag,Dilek Hakkani-Tür

Main category: cs.CL

TL;DR: 论文探讨了语言模型中语言特定知识（LSK）的现象，提出通过改变推理语言可以提高模型性能，并设计了LSKExtractor方法来提取和利用这种知识。

Details

Motivation: 研究动机是探索语言模型是否在某些语言中具有更多特定主题的知识，以及是否可以通过改变推理语言来提升推理能力。 Method: 使用文化特定数据集，设计LSKExtractor方法，评估和利用语言模型中的语言特定知识。 Result: 实验表明，模型在某些非英语语言（甚至低资源语言）中表现更好，平均准确率相对提升10%。 Conclusion: 研究为开发更具文化包容性的语言模型提供了支持，强调了语言和文化背景对模型性能的重要性。 Abstract: Code-switching is a common phenomenon of alternating between different languages in the same utterance, thought, or conversation. We posit that humans code-switch because they feel more comfortable talking about certain topics and domains in one language than another. With the rise of knowledge-intensive language models, we ask ourselves the next, natural question: Could models hold more knowledge on some topics in some language X? More importantly, could we improve reasoning by changing the language that reasoning is performed in? We coin the term Language Specific Knowledge (LSK) to represent this phenomenon. As ethnic cultures tend to develop alongside different languages, we employ culture-specific datasets (that contain knowledge about cultural and social behavioral norms). We find that language models can perform better when using chain-of-thought reasoning in some languages other than English, sometimes even better in low-resource languages. Paired with previous works showing that semantic similarity does not equate to representational similarity, we hypothesize that culturally specific texts occur more abundantly in corresponding languages, enabling specific knowledge to occur only in specific "expert" languages. Motivated by our initial results, we design a simple methodology called LSKExtractor to benchmark the language-specific knowledge present in a language model and, then, exploit it during inference. We show our results on various models and datasets, showing an average relative improvement of 10% in accuracy. Our research contributes to the open-source development of language models that are inclusive and more aligned with the cultural and linguistic contexts in which they are deployed.

[137] Effective and Efficient Schema-aware Information Extraction Using On-Device Large Language Models

Zhihao Wen,Sheng Liang,Yaxiong Wu,Yongyue Zhang,Yong Liu

Main category: cs.CL

TL;DR: 提出了一种名为DLISC的双阶段信息提取方法，适用于资源受限设备上的大型语言模型，通过双LoRA模块和增量模式缓存提升效率和效果。

Details

Motivation: 解决在资源受限设备上部署大型语言模型进行信息提取时面临的幻觉、上下文长度限制和高延迟等问题。 Method: 采用双LoRA模块（模式识别LoRA和提取LoRA）结合增量模式缓存，优化模式识别和提取效率。 Result: 在多个信息提取数据集上的实验表明，DLISC在效果和效率上均有显著提升。 Conclusion: DLISC是一种高效且有效的信息提取方法，适用于资源受限环境。 Abstract: Information extraction (IE) plays a crucial role in natural language processing (NLP) by converting unstructured text into structured knowledge. Deploying computationally intensive large language models (LLMs) on resource-constrained devices for information extraction is challenging, particularly due to issues like hallucinations, limited context length, and high latency-especially when handling diverse extraction schemas. To address these challenges, we propose a two-stage information extraction approach adapted for on-device LLMs, called Dual-LoRA with Incremental Schema Caching (DLISC), which enhances both schema identification and schema-aware extraction in terms of effectiveness and efficiency. In particular, DLISC adopts an Identification LoRA module for retrieving the most relevant schemas to a given query, and an Extraction LoRA module for performing information extraction based on the previously selected schemas. To accelerate extraction inference, Incremental Schema Caching is incorporated to reduce redundant computation, substantially improving efficiency. Extensive experiments across multiple information extraction datasets demonstrate notable improvements in both effectiveness and efficiency.

[138] Meta-Design Matters: A Self-Design Multi-Agent System

Zixuan Ke,Austin Xu,Yifei Ming,Xuan-Phi Nguyen,Caiming Xiong,Shafiq Joty

Main category: cs.CL

TL;DR: SELF-MAS是一种自监督、仅推理时间的框架，用于自动设计多智能体系统（MAS），无需验证集，动态适应任务，性能优于现有方法。

Details

Motivation: 当前MAS依赖人工设计角色和协议，难以适应新任务；现有自动方法需验证集且缺乏动态适应性。 Method: SELF-MAS通过元级设计迭代生成、评估和优化MAS配置，支持动态智能体组合和问题分解。 Result: 在数学、QA和软件工程任务中，SELF-MAS平均准确率提升7.44%，且成本高效。 Conclusion: 元级自监督设计为创建高效、自适应MAS提供了新方向。 Abstract: Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks. However, most current MAS depend on manually designed agent roles and communication protocols. These manual designs often fail to align with the underlying LLMs' strengths and struggle to adapt to novel tasks. Recent automatic MAS approaches attempt to mitigate these limitations but typically necessitate a validation-set for tuning and yield static MAS designs lacking adaptability during inference. We introduce SELF-MAS, the first self-supervised, inference-time only framework for automatic MAS design. SELF-MAS employs meta-level design to iteratively generate, evaluate, and refine MAS configurations tailored to each problem instance, without requiring a validation set. Critically, it enables dynamic agent composition and problem decomposition through meta-feedback on solvability and completeness. Experiments across math, graduate-level QA, and software engineering benchmarks, using both closed-source and open-source LLM back-bones of varying sizes, demonstrate that SELF-MAS outperforms both manual and automatic MAS baselines, achieving a 7.44% average accuracy improvement over the next strongest baseline while maintaining cost-efficiency. These findings underscore the promise of meta-level self-supervised design for creating effective and adaptive MAS.

[139] Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems

Chengwei Wei,Bin Wang,Jung-jae Kim,Nancy F. Chen

Main category: cs.CL

TL;DR: 论文介绍了Spoken-MQA基准，用于评估语音模型在数学推理任务中的表现，发现当前模型在直接算术和口头数学表达上表现不佳。

Details

Motivation: 探索大型语言模型（LLMs）和多模态LLMs（MLLMs）在语音输入数学推理任务中的能力，填补现有研究在逻辑逐步推理方面的空白。 Method: 提出Spoken-MQA基准，涵盖多种数学问题类型，包括算术、上下文推理和知识导向问题，通过实验评估语音模型的性能。 Result: 语音LLMs在上下文推理任务中表现尚可，但在直接算术和口头数学表达上表现较差，且数学知识推理能力显著下降。 Conclusion: 当前语音LLMs在数学推理任务中存在局限性，尤其在口头表达和直接算术方面，需进一步改进。 Abstract: Recent advances in large language models (LLMs) and multimodal LLMs (MLLMs) have led to strong reasoning ability across a wide range of tasks. However, their ability to perform mathematical reasoning from spoken input remains underexplored. Prior studies on speech modality have mostly focused on factual speech understanding or simple audio reasoning tasks, providing limited insight into logical step-by-step reasoning, such as that required for mathematical problem solving. To address this gap, we introduce Spoken Math Question Answering (Spoken-MQA), a new benchmark designed to evaluate the mathematical reasoning capabilities of speech-based models, including both cascade models (ASR + LLMs) and end-to-end speech LLMs. Spoken-MQA covers a diverse set of math problems, including pure arithmetic, single-step and multi-step contextual reasoning, and knowledge-oriented reasoning problems, all presented in unambiguous natural spoken language. Through extensive experiments, we find that: (1) while some speech LLMs perform competitively on contextual reasoning tasks involving basic arithmetic, they still struggle with direct arithmetic problems; (2) current LLMs exhibit a strong bias toward symbolic mathematical expressions written in LaTex and have difficulty interpreting verbalized mathematical expressions; and (3) mathematical knowledge reasoning abilities are significantly degraded in current speech LLMs.

[140] Diagnosing our datasets: How does my language model learn clinical information?

Furong Jia,David Sontag,Monica Agrawal

Main category: cs.CL

TL;DR: 研究探讨开源大语言模型如何从大规模语料库中学习临床信息，重点关注其对临床术语的理解和对未经验证医学声明的响应。

Details

Motivation: 尽管大语言模型未直接训练于电子健康记录数据，但在临床自然语言处理任务中表现良好，研究旨在揭示其学习机制与实际需求的差距。 Method: 通过MedLingo数据集评估模型对临床术语的理解，并分析预训练语料库中临床术语和未经验证声明的频率及其来源。 Result: 临床术语在预训练语料中的出现频率与模型性能相关，但实际临床笔记中的高频术语在语料中罕见；模型可能重复未经验证的医学声明。 Conclusion: 研究揭示了预训练数据与实际临床需求的不匹配，为未来数据集构建提供了改进方向。 Abstract: Large language models (LLMs) have performed well across various clinical natural language processing tasks, despite not being directly trained on electronic health record (EHR) data. In this work, we examine how popular open-source LLMs learn clinical information from large mined corpora through two crucial but understudied lenses: (1) their interpretation of clinical jargon, a foundational ability for understanding real-world clinical notes, and (2) their responses to unsupported medical claims. For both use cases, we investigate the frequency of relevant clinical information in their corresponding pretraining corpora, the relationship between pretraining data composition and model outputs, and the sources underlying this data. To isolate clinical jargon understanding, we evaluate LLMs on a new dataset MedLingo. Unsurprisingly, we find that the frequency of clinical jargon mentions across major pretraining corpora correlates with model performance. However, jargon frequently appearing in clinical notes often rarely appears in pretraining corpora, revealing a mismatch between available data and real-world usage. Similarly, we find that a non-negligible portion of documents support disputed claims that can then be parroted by models. Finally, we classified and analyzed the types of online sources in which clinical jargon and unsupported medical claims appear, with implications for future dataset composition.

[141] Are the confidence scores of reviewers consistent with the review content? Evidence from top conference proceedings in AI

Wenqing Wu,Haixu Xi,Chengzhi Zhang

Main category: cs.CL

TL;DR: 该研究通过深度学习评估了AI会议评审中文本与评分的一致性，发现高置信度评分与论文被拒相关。

Details

Motivation: 现有研究缺乏对评审文本与评分一致性的细粒度分析，可能遗漏关键细节。 Method: 使用深度学习和NLP技术，从单词、句子和方面层面分析评审文本，检测模糊表达和方面，并通过统计方法评估一致性。 Result: 结果显示文本与评分高度一致，回归分析表明高置信度评分与论文被拒相关。 Conclusion: 研究验证了专家评估的可靠性和同行评审的公平性。 Abstract: Peer review is vital in academia for evaluating research quality. Top AI conferences use reviewer confidence scores to ensure review reliability, but existing studies lack fine-grained analysis of text-score consistency, potentially missing key details. This work assesses consistency at word, sentence, and aspect levels using deep learning and NLP conference review data. We employ deep learning to detect hedge sentences and aspects, then analyze report length, hedge word/sentence frequency, aspect mentions, and sentiment to evaluate text-score alignment. Correlation, significance, and regression tests examine confidence scores' impact on paper outcomes. Results show high text-score consistency across all levels, with regression revealing higher confidence scores correlate with paper rejection, validating expert assessments and peer review fairness.

[142] Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

Haiyan Zhao,Xuansheng Wu,Fan Yang,Bo Shen,Ninghao Liu,Mengnan Du

Main category: cs.CL

TL;DR: 提出了一种基于稀疏自编码器的去噪方法（SDCV），用于提升线性概念向量在大型语言模型中的鲁棒性。

Details

Motivation: 现有方法（如线性探测和均值差异）在多样数据中容易受到噪声干扰，影响概念向量的鲁棒性。 Method: 使用稀疏自编码器从隐藏表示中过滤噪声特征，生成去噪后的概念向量（SDCV）。 Result: SDCV显著提升了线性探测和均值差异方法的引导成功率，并通过反事实实验和特征可视化验证了噪声假设。 Conclusion: SDCV是一种有效的去噪方法，能够提升概念向量在复杂数据中的鲁棒性和引导效果。 Abstract: Linear Concept Vectors have proven effective for steering large language models (LLMs). While existing approaches like linear probing and difference-in-means derive these vectors from LLM hidden representations, diverse data introduces noises (i.e., irrelevant features) that challenge steering robustness. To address this, we propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which uses Sparse Autoencoders to filter out noisy features from hidden representations. When applied to linear probing and difference-in-means, our method improves their steering success rates. We validate our noise hypothesis through counterfactual experiments and feature visualizations.

[143] Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective

Siyue Zhang,Yilun Zhao,Liyuan Geng,Arman Cohan,Anh Tuan Luu,Chen Zhao

Main category: cs.CL

TL;DR: 本文提出了一种基于扩散语言模型的文本嵌入方法，解决了LLM嵌入因单向注意力机制与文本嵌入任务双向性不匹配的问题，并在多个任务上表现优于LLM嵌入。

Details

Motivation: LLM嵌入模型因预训练时的单向注意力机制与文本嵌入任务的双向性不匹配，限制了其性能。扩散语言模型因其双向架构和在推理任务上的成功表现，成为改进方向。 Method: 采用扩散语言模型构建文本嵌入模型，利用其双向注意力机制更好地编码全局上下文。 Result: 在长文档检索、推理密集型检索和指令跟随检索任务上分别比LLM嵌入模型提升20%、8%和2%，并在传统文本嵌入基准上表现竞争性。 Conclusion: 扩散语言模型因其双向注意力机制，在文本嵌入任务中表现出色，尤其在长文本和复杂文本的全局上下文编码中具有优势。 Abstract: Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks. We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval, 2% on instruction-following retrieval, and achieve competitive performance on traditional text embedding benchmarks. Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.

[144] ChartCards: A Chart-Metadata Generation Framework for Multi-Task Chart Understanding

Yifan Wu,Lutao Yan,Leixian Shen,Yinan Mei,Jiannan Wang,Yuyu Luo

Main category: cs.CL

TL;DR: ChartCards是一个统一的图表元数据生成框架，通过结构化图表信息支持多任务图表理解，并构建了高质量数据集MetaChart，显著提升了模型性能。

Details

Motivation: 多模态大语言模型（MLLMs）在图表理解任务中需要大量高质量数据进行微调，导致数据收集和训练成本高。 Method: 提出ChartCards框架，系统合成图表信息（如数据表、可视化代码、视觉元素和多维语义标题），并构建MetaChart数据集。 Result: 在MetaChart上微调6种模型，平均性能提升5%，其中文本到图表检索和图表到表格任务提升显著（17%和28%）。 Conclusion: ChartCards和MetaChart为多任务图表理解提供了高效解决方案，显著降低了数据需求并提升了模型性能。 Abstract: The emergence of Multi-modal Large Language Models (MLLMs) presents new opportunities for chart understanding. However, due to the fine-grained nature of these tasks, applying MLLMs typically requires large, high-quality datasets for task-specific fine-tuning, leading to high data collection and training costs. To address this, we propose ChartCards, a unified chart-metadata generation framework for multi-task chart understanding. ChartCards systematically synthesizes various chart information, including data tables, visualization code, visual elements, and multi-dimensional semantic captions. By structuring this information into organized metadata, ChartCards enables a single chart to support multiple downstream tasks, such as text-to-chart retrieval, chart summarization, chart-to-table conversion, chart description, and chart question answering. Using ChartCards, we further construct MetaChart, a large-scale high-quality dataset containing 10,862 data tables, 85K charts, and 170 K high-quality chart captions. We validate the dataset through qualitative crowdsourcing evaluations and quantitative fine-tuning experiments across various chart understanding tasks. Fine-tuning six different models on MetaChart resulted in an average performance improvement of 5% across all tasks. The most notable improvements are seen in text-to-chart retrieval and chart-to-table tasks, with Long-CLIP and Llama 3.2-11B achieving improvements of 17% and 28%, respectively.

[145] Improving the fact-checking performance of language models by relying on their entailment ability

Gaurav Kumar,Debajyoti Mazumder,Ayush Garg,Jasabanta Patro

Main category: cs.CL

TL;DR: 论文提出了一种基于语言模型的生成和蕴含能力的简单有效方法，用于自动化事实核查，通过生成支持或反驳的理由，显著提升了性能。

Details

Motivation: 当前事实核查方法主要依赖语言模型的嵌入知识或证据微调，但前者易产生幻觉，后者效果不佳，因此需要更有效的方法。 Method: 利用语言模型的生成和蕴含能力，生成支持或反驳的理由，并系统比较不同提示和微调策略。 Result: 实验显示，基于证据句子、提示理解和蕴含理由的训练分别提升了8.20%、16.39%和最高44.26%的性能。 Conclusion: 该方法显著优于基线，为自动化事实核查提供了新思路，并公开了代码以复现结果。 Abstract: Automated fact-checking is a crucial task in this digital age. To verify a claim, current approaches majorly follow one of two strategies i.e. (i) relying on embedded knowledge of language models, and (ii) fine-tuning them with evidence pieces. While the former can make systems to hallucinate, the later have not been very successful till date. The primary reason behind this is that fact verification is a complex process. Language models have to parse through multiple pieces of evidence before making a prediction. Further, the evidence pieces often contradict each other. This makes the reasoning process even more complex. We proposed a simple yet effective approach where we relied on entailment and the generative ability of language models to produce ''supporting'' and ''refuting'' justifications (for the truthfulness of a claim). We trained language models based on these justifications and achieved superior results. Apart from that, we did a systematic comparison of different prompting and fine-tuning strategies, as it is currently lacking in the literature. Some of our observations are: (i) training language models with raw evidence sentences registered an improvement up to 8.20% in macro-F1, over the best performing baseline for the RAW-FC dataset, (ii) similarly, training language models with prompted claim-evidence understanding (TBE-2) registered an improvement (with a margin up to 16.39%) over the baselines for the same dataset, (iii) training language models with entailed justifications (TBE-3) outperformed the baselines by a huge margin (up to 28.57% and 44.26% for LIAR-RAW and RAW-FC, respectively). We have shared our code repository to reproduce the results.

[146] MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation

Feiyang Cai,Jiahui Bai,Tao Tang,Joshua Luo,Tianyu Zhu,Ling Liu,Feng Luo

Main category: cs.CL

TL;DR: MolLangBench是一个评估分子-语言界面任务的基准，包括分子结构识别、编辑和生成。当前最先进模型在这些任务上表现有限，凸显了AI在化学应用中的不足。

Details

Motivation: 为化学家和AI系统提供精确的分子识别、编辑和生成能力，推动化学任务的研究。 Method: 通过自动化工具构建识别任务，专家标注和验证编辑与生成任务，支持多种分子表示形式。 Result: 最强模型在识别和编辑任务上准确率约79%，生成任务仅29%，显示当前AI系统的局限性。 Conclusion: MolLangBench有望推动更有效和可靠的化学AI系统研究。 Abstract: Precise recognition, editing, and generation of molecules are essential prerequisites for both chemists and AI systems tackling various chemical tasks. We present MolLangBench, a comprehensive benchmark designed to evaluate fundamental molecule-language interface tasks: language-prompted molecular structure recognition, editing, and generation. To ensure high-quality, unambiguous, and deterministic outputs, we construct the recognition tasks using automated cheminformatics tools, and curate editing and generation tasks through rigorous expert annotation and validation. MolLangBench supports the evaluation of models that interface language with different molecular representations, including linear strings, molecular images, and molecular graphs. Evaluations of state-of-the-art models reveal significant limitations: the strongest model (o3) achieves $79.2\%$ and $78.5\%$ accuracy on recognition and editing tasks, which are intuitively simple for humans, and performs even worse on the generation task, reaching only $29.0\%$ accuracy. These results highlight the shortcomings of current AI systems in handling even preliminary molecular recognition and manipulation tasks. We hope MolLangBench will catalyze further research toward more effective and reliable AI systems for chemical applications.

[147] Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

Hongli Zhou,Hui Huang,Ziqing Zhao,Lvyuan Han,Huicheng Wang,Kehai Chen,Muyun Yang,Wei Bao,Jian Dong,Bing Xu,Conghui Zhu,Hailong Cao,Tiejun Zhao

Main category: cs.CL

TL;DR: 论文提出了一种新框架PSN-IRT，用于更准确评估大语言模型（LLMs）的能力，并揭示了当前基准测试的不足。

Details

Motivation: 现有基准测试在评估大语言模型时存在不一致性和区分度不足的问题，无法真实反映模型能力。 Method: 提出PSN-IRT框架，结合丰富的项目参数，基于项目反应理论（IRT）进行改进。 Result: 分析显示当前基准测试存在显著缺陷，PSN-IRT能构建更小但更符合人类偏好的基准。 Conclusion: PSN-IRT提供了一种更可靠的大语言模型评估方法，优化了基准测试的质量。 Abstract: The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model capabilities. This paper provides a critical analysis of benchmark effectiveness, examining main-stream prominent LLM benchmarks using results from diverse models. We first propose a new framework for accurate and reliable estimations of item characteristics and model abilities. Specifically, we propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT), an enhanced Item Response Theory framework that incorporates a rich set of item parameters within an IRT-grounded architecture. Based on PSN-IRT, we conduct extensive analysis which reveals significant and varied shortcomings in the measurement quality of current benchmarks. Furthermore, we demonstrate that leveraging PSN-IRT is able to construct smaller benchmarks while maintaining stronger alignment with human preference.

[148] Self-GIVE: Associative Thinking from Limited Structured Knowledge for Enhanced Large Language Model Reasoning

Jiashu He,Jinxuan Fan,Bowen Jiang,Ignacio Houine,Dan Roth,Alejandro Ribeiro

Main category: cs.CL

TL;DR: Self-GIVE是一种基于检索-强化学习的框架，通过自动关联思维增强大型语言模型（LLM）在科学问题解答中的表现，解决了GIVE方法在效率、通用性和准确性上的限制。

Details

Motivation: 解决GIVE方法在知识外推时的高计算成本、难以部署到小型LLM以及知识修剪不准确的问题。 Method: 提出Self-GIVE框架，结合检索和强化学习，自动提取结构化信息和实体集，帮助模型关联查询概念。 Result: 在生物医学QA任务中，Self-GIVE显著提升了3B和7B模型的性能（最高提升28.5%→71.4%和78.6%→90.5%），并减少90%以上的token使用。 Conclusion: Self-GIVE通过结构化检索和关联思维，实现了高效、可扩展的知识推理，使小型LLM性能媲美GPT3.5 turbo。 Abstract: When addressing complex questions that require new information, people often associate the question with existing knowledge to derive a sensible answer. For instance, when evaluating whether melatonin aids insomnia, one might associate "hormones helping mental disorders" with "melatonin being a hormone and insomnia a mental disorder" to complete the reasoning. Large Language Models (LLMs) also require such associative thinking, particularly in resolving scientific inquiries when retrieved knowledge is insufficient and does not directly answer the question. Graph Inspired Veracity Extrapolation (GIVE) addresses this by using a knowledge graph (KG) to extrapolate structured knowledge. However, it involves the construction and pruning of many hypothetical triplets, which limits efficiency and generalizability. We propose Self-GIVE, a retrieve-RL framework that enhances LLMs with automatic associative thinking through reinforcement learning. Self-GIVE extracts structured information and entity sets to assist the model in linking to the queried concepts. We address GIVE's key limitations: (1) extensive LLM calls and token overhead for knowledge extrapolation, (2) difficulty in deploying on smaller LLMs (3B or 7B) due to complex instructions, and (3) inaccurate knowledge from LLM pruning. Specifically, after fine-tuning using self-GIVE with a 135 node UMLS KG, it improves the performance of the Qwen2.5 3B and 7B models by up to $\textbf{28.5%$\rightarrow$71.4%}$ and $\textbf{78.6$\rightarrow$90.5%}$ in samples $\textbf{unseen}$ in challenging biomedical QA tasks. In particular, Self-GIVE allows the 7B model to match or outperform GPT3.5 turbo with GIVE, while cutting token usage by over 90\%. Self-GIVE enhances the scalable integration of structured retrieval and reasoning with associative thinking.

[149] UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking

Sarfraz Ahmad,Hasan Iqbal,Momina Ahsan,Numaan Naeem,Muhammad Ahsan Riaz Khan,Arham Riaz,Muhammad Arslan Manzoor,Yuxia Wang,Preslav Nakov

Main category: cs.CL

TL;DR: 论文提出了UrduFactCheck，首个针对乌尔都语的模块化事实核查框架，填补了低资源语言事实核查的空白。

Details

Motivation: 大型语言模型（LLM）在乌尔都语等低资源语言中的事实可靠性问题亟待解决，现有解决方案主要针对英语。 Method: 开发了动态多策略证据检索管道，结合单语和翻译方法，并发布两个新标注基准UrduFactBench和UrduFactQA。 Result: UrduFactCheck在多项指标上优于基线模型，翻译增强版本表现更优，同时评估了12种SOTA LLM在乌尔都语中的表现。 Conclusion: UrduFactCheck为乌尔都语事实核查提供了有效工具，开源代码和数据集推动相关研究。 Abstract: The rapid use of large language models (LLMs) has raised critical concerns regarding the factual reliability of their outputs, especially in low-resource languages such as Urdu. Existing automated fact-checking solutions overwhelmingly focus on English, leaving a significant gap for the 200+ million Urdu speakers worldwide. In this work, we introduce UrduFactCheck, the first comprehensive, modular fact-checking framework specifically tailored for Urdu. Our system features a dynamic, multi-strategy evidence retrieval pipeline that combines monolingual and translation-based approaches to address the scarcity of high-quality Urdu evidence. We curate and release two new hand-annotated benchmarks: UrduFactBench for claim verification and UrduFactQA for evaluating LLM factuality. Extensive experiments demonstrate that UrduFactCheck, particularly its translation-augmented variants, consistently outperforms baselines and open-source alternatives on multiple metrics. We further benchmark twelve state-of-the-art (SOTA) LLMs on factual question answering in Urdu, highlighting persistent gaps between proprietary and open-source models. UrduFactCheck's code and datasets are open-sourced and publicly available at https://github.com/mbzuai-nlp/UrduFactCheck.

[150] The Pursuit of Empathy: Evaluating Small Language Models for PTSD Dialogue Support

Suhas BN,Yash Mahajan,Dominik Mattioli,Andrew M. Sherrill,Rosa I. Arriaga,Chris W. Wiese,Saeed Abdullah

Main category: cs.CL

TL;DR: 小型语言模型（0.5B-5B参数）能否有效参与针对PTSD患者的创伤知情、共情对话？研究通过TIDE数据集和实验表明，微调可提升共情表现，但效果因场景和用户而异。

Details

Motivation: 探讨小型语言模型在创伤知情对话中的潜力，为开发高效、安全的共情AI提供基础。 Method: 引入TIDE数据集（10,000条对话），基于三因素共情模型评估8个小型模型，并与前沿模型对比。 Result: 微调提升共情感知，但小型模型存在共情天花板；用户背景影响偏好。 Conclusion: 需结合上下文和用户需求设计系统，TIDE数据集为开发补充临床护理的AI奠定基础。 Abstract: Can small language models with 0.5B to 5B parameters meaningfully engage in trauma-informed, empathetic dialogue for individuals with PTSD? We address this question by introducing TIDE, a dataset of 10,000 two-turn dialogues spanning 500 diverse PTSD client personas and grounded in a three-factor empathy model: emotion recognition, distress normalization, and supportive reflection. All scenarios and reference responses were reviewed for realism and trauma sensitivity by a clinical psychologist specializing in PTSD. We evaluate eight small language models before and after fine-tuning, comparing their outputs to a frontier model (Claude Sonnet 3.5). Our IRB-approved human evaluation and automatic metrics show that fine-tuning generally improves perceived empathy, but gains are highly scenario- and user-dependent, with smaller models facing an empathy ceiling. Demographic analysis shows older adults value distress validation and graduate-educated users prefer nuanced replies, while gender effects are minimal. We highlight the limitations of automatic metrics and the need for context- and user-aware system design. Our findings, along with the planned release of TIDE, provide a foundation for building safe, resource-efficient, and ethically sound empathetic AI to supplement, not replace, clinical mental health care.

[151] In-Domain African Languages Translation Using LLMs and Multi-armed Bandits

Pratik Rakesh Singh,Kritarth Prasad,Mohammadi Zaki,Pankaj Wasnik

Main category: cs.CL

TL;DR: 本文研究了在低资源语言中，如何通过基于多臂老虎机的算法选择最优神经机器翻译模型，以解决领域适应问题。

Details

Motivation: 低资源语言的神经机器翻译系统在领域适应任务中面临数据不足和模型泛化能力差的问题，因此需要一种高效的方法选择最优模型。 Method: 采用基于多臂老虎机的算法（如Upper Confidence Bound、Linear UCB等）进行模型选择，以在资源受限的情况下实现高置信度的最优选择。 Result: 在三种非洲语言和多个领域的实验中验证了方法的鲁棒性和有效性，无论目标数据是否可用。 Conclusion: 该方法为低资源语言的领域适应提供了一种高效的模型选择策略。 Abstract: Neural Machine Translation (NMT) systems face significant challenges when working with low-resource languages, particularly in domain adaptation tasks. These difficulties arise due to limited training data and suboptimal model generalization, As a result, selecting an optimal model for translation is crucial for achieving strong performance on in-domain data, particularly in scenarios where fine-tuning is not feasible or practical. In this paper, we investigate strategies for selecting the most suitable NMT model for a given domain using bandit-based algorithms, including Upper Confidence Bound, Linear UCB, Neural Linear Bandit, and Thompson Sampling. Our method effectively addresses the resource constraints by facilitating optimal model selection with high confidence. We evaluate the approach across three African languages and domains, demonstrating its robustness and effectiveness in both scenarios where target data is available and where it is absent.

[152] Can Large Language Models Understand Internet Buzzwords Through User-Generated Content

Chen Huang,Junkai Luo,Xinzuo Wang,Wenqiang Lei,Jiancheng Lv

Main category: cs.CL

TL;DR: 本文研究了大型语言模型（LLMs）能否基于中文社交媒体中的用户生成内容（UGC）准确生成网络流行语的定义，并提出了数据集CHEER和方法RESS。

Details

Motivation: 中文社交媒体中的大量UGC为研究网络流行语提供了可能，探索LLMs是否能基于UGC生成准确的定义。 Method: 提出了RESS方法，通过引导LLMs的理解过程生成更准确的定义，并创建了CHEER数据集进行基准测试。 Result: RESS方法在生成定义上表现有效，但揭示了LLMs的共性问题，如过度依赖先验知识、推理能力不足等。 Conclusion: 本文为基于LLM的定义生成研究奠定了基础，并公开了数据集和代码。 Abstract: The massive user-generated content (UGC) available in Chinese social media is giving rise to the possibility of studying internet buzzwords. In this paper, we study if large language models (LLMs) can generate accurate definitions for these buzzwords based on UGC as examples. Our work serves a threefold contribution. First, we introduce CHEER, the first dataset of Chinese internet buzzwords, each annotated with a definition and relevant UGC. Second, we propose a novel method, called RESS, to effectively steer the comprehending process of LLMs to produce more accurate buzzword definitions, mirroring the skills of human language learning. Third, with CHEER, we benchmark the strengths and weaknesses of various off-the-shelf definition generation methods and our RESS. Our benchmark demonstrates the effectiveness of RESS while revealing crucial shared challenges: over-reliance on prior exposure, underdeveloped inferential abilities, and difficulty identifying high-quality UGC to facilitate comprehension. We believe our work lays the groundwork for future advancements in LLM-based definition generation. Our dataset and code are available at https://github.com/SCUNLP/Buzzword.

[153] DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data

Yuhang Zhou,Jing Zhu,Shengyi Qian,Zhuokai Zhao,Xiyao Wang,Xiaoyu Liu,Ming Li,Paiheng Xu,Wei Ai,Furong Huang

Main category: cs.CL

TL;DR: DISCO是一种改进GRPO的方法，通过领域感知和难度感知奖励缩放，解决多域不平衡数据中的泛化和公平性问题。

Details

Motivation: GRPO在多域不平衡数据中表现不佳，倾向于优化主导域，忽视弱势域，导致泛化和公平性问题。 Method: DISCO提出领域感知奖励缩放和难度感知奖励缩放，分别通过重加权和优先学习不确定提示来优化策略。 Result: DISCO在多个LLM和倾斜训练分布上表现优异，比现有GRPO变体提升5%，在多域对齐基准上达到新SOTA。 Conclusion: DISCO通过创新的奖励缩放策略，显著提升了多域不平衡数据中的泛化能力和公平性。 Abstract: Large Language Models (LLMs) are increasingly aligned with human preferences through Reinforcement Learning from Human Feedback (RLHF). Among RLHF methods, Group Relative Policy Optimization (GRPO) has gained attention for its simplicity and strong performance, notably eliminating the need for a learned value function. However, GRPO implicitly assumes a balanced domain distribution and uniform semantic alignment across groups - assumptions that rarely hold in real-world datasets. When applied to multi-domain, imbalanced data, GRPO disproportionately optimizes for dominant domains, neglecting underrepresented ones and resulting in poor generalization and fairness. We propose Domain-Informed Self-Consistency Policy Optimization (DISCO), a principled extension to GRPO that addresses inter-group imbalance with two key innovations. Domain-aware reward scaling counteracts frequency bias by reweighting optimization based on domain prevalence. Difficulty-aware reward scaling leverages prompt-level self-consistency to identify and prioritize uncertain prompts that offer greater learning value. Together, these strategies promote more equitable and effective policy learning across domains. Extensive experiments across multiple LLMs and skewed training distributions show that DISCO improves generalization, outperforms existing GRPO variants by 5% on Qwen3 models, and sets new state-of-the-art results on multi-domain alignment benchmarks.

[154] Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

Hao Wang,Pinzhi Huang,Jihan Yang,Saining Xie,Daisuke Kawahara

Main category: cs.CL

TL;DR: 论文提出了两个新基准（KnowRecall和VisRecall）来评估多模态大语言模型（MLLMs）的跨语言一致性，发现现有模型在跨语言和文化知识整合方面仍存在不足。

Details

Motivation: 多模态大语言模型在现实应用中表现突出，但在跨语言和文化知识整合方面的一致性仍是一个挑战。 Method: 引入KnowRecall（15种语言的视觉问答基准）和VisRecall（9种语言的视觉记忆基准）来评估MLLMs的跨语言一致性。 Result: 实验表明，即使是先进模型也难以实现跨语言一致性。 Conclusion: 需要更鲁棒的方法来开发真正多语言且具备文化意识的模型。 Abstract: The rapid evolution of multimodal large language models (MLLMs) has significantly enhanced their real-world applications. However, achieving consistent performance across languages, especially when integrating cultural knowledge, remains a significant challenge. To better assess this issue, we introduce two new benchmarks: KnowRecall and VisRecall, which evaluate cross-lingual consistency in MLLMs. KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency in 15 languages, focusing on cultural and historical questions about global landmarks. VisRecall assesses visual memory consistency by asking models to describe landmark appearances in 9 languages without access to images. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, still struggle to achieve cross-lingual consistency. This underscores the need for more robust approaches that produce truly multilingual and culturally aware models.

[155] HopWeaver: Synthesizing Authentic Multi-Hop Questions Across Text Corpora

Zhiyu Shen,Jiyuan Liu,Yunhe Pang,Yanghui Rao

Main category: cs.CL

TL;DR: HopWeaver是一个自动生成多跳问题（MHQA）的框架，无需人工干预，能够从非结构化文本中合成高质量的问题，成本低于人工标注。

Details

Motivation: 手动标注多跳问题成本高，现有合成方法生成的问题过于简单或需要大量人工指导，因此需要一种自动化的高质量合成方法。 Method: HopWeaver通过识别跨文档的互补信息，合成两种多跳问题（桥接和比较），并构建真实的推理路径。 Result: 合成的问题质量与人工标注数据集相当或更高，且成本更低。 Conclusion: HopWeaver为资源稀缺的专业领域提供了一种高效的多跳问题数据集开发方法，代码已开源。 Abstract: Multi-Hop Question Answering (MHQA) is crucial for evaluating the model's capability to integrate information from diverse sources. However, creating extensive and high-quality MHQA datasets is challenging: (i) manual annotation is expensive, and (ii) current synthesis methods often produce simplistic questions or require extensive manual guidance. This paper introduces HopWeaver, the first automatic framework synthesizing authentic multi-hop questions from unstructured text corpora without human intervention. HopWeaver synthesizes two types of multi-hop questions (bridge and comparison) using an innovative approach that identifies complementary documents across corpora. Its coherent pipeline constructs authentic reasoning paths that integrate information across multiple documents, ensuring synthesized questions necessitate authentic multi-hop reasoning. We further present a comprehensive system for evaluating synthesized multi-hop questions. Empirical evaluations demonstrate that the synthesized questions achieve comparable or superior quality to human-annotated datasets at a lower cost. Our approach is valuable for developing MHQA datasets in specialized domains with scarce annotated resources. The code for HopWeaver is publicly available.

[156] DeFTX: Denoised Sparse Fine-Tuning for Zero-Shot Cross-Lingual Transfer

Sona Elza Simon,Preethi Jyothi

Main category: cs.CL

TL;DR: DeFT-X是一种新颖的可组合稀疏微调方法，通过奇异值分解去噪预训练模型的权重矩阵，提高了跨语言迁移的鲁棒性，在极低资源语言任务中表现优异。

Details

Motivation: 解决跨语言迁移中高资源语言到低资源语言的有效知识转移问题，提升稀疏微调方法的鲁棒性。 Method: 提出DeFT-X方法，利用奇异值分解对预训练模型权重矩阵去噪，再进行基于幅度的剪枝，生成更鲁棒的稀疏微调向量。 Result: 在极低资源语言的情感分类和自然语言推理任务中，DeFT-X表现优于或与现有稀疏微调方法及其他基线方法相当。 Conclusion: DeFT-X通过去噪和稀疏微调结合，显著提升了跨语言迁移的效果，适用于低资源语言任务。 Abstract: Effective cross-lingual transfer remains a critical challenge in scaling the benefits of large language models from high-resource to low-resource languages. Towards this goal, prior studies have explored many approaches to combine task knowledge from task-specific data in a (high-resource) source language and language knowledge from unlabeled text in a (low-resource) target language. One notable approach proposed composable sparse fine-tuning (SFT) for cross-lingual transfer that learns task-specific and language-specific sparse masks to select a subset of the pretrained model's parameters that are further fine-tuned. These sparse fine-tuned vectors (SFTs) are subsequently composed with the pretrained model to facilitate zero-shot cross-lingual transfer to a task in a target language, using only task-specific data from a source language. These sparse masks for SFTs were identified using a simple magnitude-based pruning. In our work, we introduce DeFT-X, a novel composable SFT approach that denoises the weight matrices of a pretrained model before magnitude pruning using singular value decomposition, thus yielding more robust SFTs. We evaluate DeFT-X on a diverse set of extremely low-resource languages for sentiment classification (NusaX) and natural language inference (AmericasNLI) and demonstrate that it performs at par or outperforms SFT and other prominent cross-lingual transfer baselines.

[157] SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models

Jing Yu,Yuqi Tang,Kehua Feng,Mingyang Rao,Lei Liang,Zhiqiang Zhang,Mengshu Sun,Wen Zhang,Qiang Zhang,Keyan Ding,Huajun Chen

Main category: cs.CL

TL;DR: SciCUEval是一个针对大型语言模型（LLMs）在科学领域上下文理解能力的综合评估数据集，涵盖多个学科和数据类型。

Details

Motivation: 现有基准主要关注通用领域，未能充分评估LLMs在复杂科学数据中的表现。 Method: 构建SciCUEval数据集，包含十个学科子集和多种数据模态，系统评估四项核心能力。 Result: 对先进LLMs进行评估，揭示了其在科学上下文理解中的优势和局限。 Conclusion: SciCUEval为未来科学领域LLMs的发展提供了重要参考。 Abstract: Large Language Models (LLMs) have shown impressive capabilities in contextual understanding and reasoning. However, evaluating their performance across diverse scientific domains remains underexplored, as existing benchmarks primarily focus on general domains and fail to capture the intricate complexity of scientific data. To bridge this gap, we construct SciCUEval, a comprehensive benchmark dataset tailored to assess the scientific context understanding capability of LLMs. It comprises ten domain-specific sub-datasets spanning biology, chemistry, physics, biomedicine, and materials science, integrating diverse data modalities including structured tables, knowledge graphs, and unstructured texts. SciCUEval systematically evaluates four core competencies: Relevant information identification, Information-absence detection, Multi-source information integration, and Context-aware inference, through a variety of question formats. We conduct extensive evaluations of state-of-the-art LLMs on SciCUEval, providing a fine-grained analysis of their strengths and limitations in scientific context understanding, and offering valuable insights for the future development of scientific-domain LLMs.

[158] Nek Minit: Harnessing Pragmatic Metacognitive Prompting for Explainable Sarcasm Detection of Australian and Indian English

Ishmanbir Singh,Dipankar Srirag,Aditya Joshi

Main category: cs.CL

TL;DR: 论文提出了一种基于PMP（实用元认知提示）的方法，用于检测澳大利亚和印度英语中的讽刺，并在标准英语数据集上进行了性能比较，结果显示PMP方法显著优于其他提示策略。

Details

Motivation: 讽刺因其表面与隐含情感的不一致而对情感分析构成挑战，尤其是在特定国家或地区的语境中。研究旨在利用PMP技术解决这一问题。 Method: 使用PMP技术对澳大利亚和印度英语数据集BESSTIE进行讽刺检测，并与标准英语数据集FLUTE进行比较。测试了两种开源LLM（GEMMA和LLAMA）的性能。 Result: PMP方法在所有任务和数据集上均显著优于其他四种提示策略。此外，代理提示等技术通过外部知识检索缓解了上下文相关的问题。 Conclusion: PMP方法在生成多种英语变体的讽刺解释方面具有显著优势，为讽刺检测提供了新的解决方案。 Abstract: Sarcasm is a challenge to sentiment analysis because of the incongruity between stated and implied sentiment. The challenge is exacerbated when the implication may be relevant to a specific country or geographical region. Pragmatic metacognitive prompting (PMP) is a cognition-inspired technique that has been used for pragmatic reasoning. In this paper, we harness PMP for explainable sarcasm detection for Australian and Indian English, alongside a benchmark dataset for standard English. We manually add sarcasm explanations to an existing sarcasm-labeled dataset for Australian and Indian English called BESSTIE, and compare the performance for explainable sarcasm detection for them with FLUTE, a standard English dataset containing sarcasm explanations. Our approach utilising PMP when evaluated on two open-weight LLMs (GEMMA and LLAMA) achieves statistically significant performance improvement across all tasks and datasets when compared with four alternative prompting strategies. We also find that alternative techniques such as agentic prompting mitigate context-related failures by enabling external knowledge retrieval. The focused contribution of our work is utilising PMP in generating sarcasm explanations for varieties of English.

[159] Mechanistic evaluation of Transformers and state space models

Aryaman Arora,Neil Rathi,Nikil Roashan Selvam,Róbert Csórdas,Dan Jurafsky,Christopher Potts

Main category: cs.CL

TL;DR: 论文研究了状态空间模型（SSMs）在语言建模中的表现，发现其在关联召回（AR）任务中表现不一，并通过因果干预揭示了不同模型的机制差异。

Details

Motivation: 探索SSMs在语言建模中的性能差异，尤其是其在关联召回任务中的表现，以揭示不同模型的内部工作机制。 Method: 在AR任务上测试多种模型（Transformers、Based SSM、Mamba、H3、Hyena），并通过因果干预分析其机制。此外，引入新的合成任务ATR以验证发现。 Result: Transformers和Based SSM在AR任务中表现最佳，Mamba次之，其他SSMs失败。因果干预显示，Transformers和Based通过归纳头存储关联，而SSMs仅在最后状态计算关联。ATR任务进一步验证了这些发现。 Conclusion: 不同架构在相同任务中可能有显著机制差异，强调了机制评估的重要性。 Abstract: State space models (SSMs) for language modelling promise an efficient and performant alternative to quadratic-attention Transformers, yet show variable performance on recalling basic information from the context. While performance on synthetic tasks like Associative Recall (AR) can point to this deficiency, behavioural metrics provide little information as to why--on a mechanistic level--certain architectures fail and others succeed. To address this, we conduct experiments on AR and find that only Transformers and Based SSM models fully succeed at AR, with Mamba a close third, whereas the other SSMs (H3, Hyena) fail. We then use causal interventions to explain why. We find that Transformers and Based learn to store key-value associations in-context using induction heads. By contrast, the SSMs compute these associations only at the last state, with only Mamba succeeding because of its short convolution component. To extend and deepen these findings, we introduce Associative Treecall (ATR), a synthetic task similar to AR based on PCFG induction. ATR introduces language-like hierarchical structure into the AR setting. We find that all architectures learn the same mechanism as they did for AR, and the same three models succeed at the task. These results reveal that architectures with similar accuracy may still have substantive differences, motivating the adoption of mechanistic evaluations.

[160] StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization

Ziliang Wang,Xuhui Zheng,Kang An,Cijun Ouyang,Jialu Cai,Yuhang Wang,Yichao Wu

Main category: cs.CL

TL;DR: StepSearch框架通过细粒度、逐步的监督优化多跳推理LLM，显著提升了性能。

Details

Motivation: 解决现有基于强化学习的LLM在多跳QA中因稀疏全局奖励而表现不佳的问题。 Method: 引入StepSearch框架，采用逐步近端策略优化方法，结合中间搜索奖励和基于信息增益的令牌级监督。 Result: 在标准多跳QA基准上显著优于基线，3B和7B模型分别提升11.2%和4.2%。 Conclusion: 细粒度逐步监督能有效优化深度搜索LLM，仅需少量训练数据即可实现显著改进。 Abstract: Efficient multi-hop reasoning requires Large Language Models (LLMs) based agents to acquire high-value external knowledge iteratively. Previous work has explored reinforcement learning (RL) to train LLMs to perform search-based document retrieval, achieving notable improvements in QA performance, but underperform on complex, multi-hop QA resulting from the sparse rewards from global signal only. To address this gap in existing research, we introduce StepSearch, a framework for search LLMs that trained with step-wise proximal policy optimization method. It consists of richer and more detailed intermediate search rewards and token-level process supervision based on information gain and redundancy penalties to better guide each search step. We constructed a fine-grained question-answering dataset containing sub-question-level search trajectories based on open source datasets through a set of data pipeline method. On standard multi-hop QA benchmarks, it significantly outperforms global-reward baselines, achieving 11.2% and 4.2% absolute improvements for 3B and 7B models over various search with RL baselines using only 19k training data, demonstrating the effectiveness of fine-grained, stepwise supervision in optimizing deep search LLMs. Our implementation is publicly available at https://github.com/zxh20001117/StepSearch.

[161] A Risk Taxonomy for Evaluating AI-Powered Psychotherapy Agents

Ian Steenstra,Timothy W. Bickmore

Main category: cs.CL

TL;DR: 论文提出了一种新的风险分类法，用于系统评估对话AI心理治疗师，旨在解决现有评估方法无法捕捉治疗互动中微妙风险的问题。

Details

Motivation: 大型语言模型和智能虚拟代理作为心理治疗师的普及带来了扩大心理健康服务的机会，但也伴随着用户伤害和自杀等严重风险，现有评估方法无法有效检测这些风险。 Method: 通过文献综述、临床和法律专家访谈，并与DSM-5等临床标准和现有评估工具对齐，开发了一种新的风险分类法。 Result: 该分类法提供了结构化方法，用于识别和评估用户/患者伤害，并展示了在人类-AI咨询会话和模拟患者自动基准测试中的应用。 Conclusion: 该分类法为AI驱动的心理健康支持领域的安全创新奠定了基础。 Abstract: The proliferation of Large Language Models (LLMs) and Intelligent Virtual Agents acting as psychotherapists presents significant opportunities for expanding mental healthcare access. However, their deployment has also been linked to serious adverse outcomes, including user harm and suicide, facilitated by a lack of standardized evaluation methodologies capable of capturing the nuanced risks of therapeutic interaction. Current evaluation techniques lack the sensitivity to detect subtle changes in patient cognition and behavior during therapy sessions that may lead to subsequent decompensation. We introduce a novel risk taxonomy specifically designed for the systematic evaluation of conversational AI psychotherapists. Developed through an iterative process including review of the psychotherapy risk literature, qualitative interviews with clinical and legal experts, and alignment with established clinical criteria (e.g., DSM-5) and existing assessment tools (e.g., NEQ, UE-ATR), the taxonomy aims to provide a structured approach to identifying and assessing user/patient harms. We provide a high-level overview of this taxonomy, detailing its grounding, and discuss potential use cases. We discuss two use cases in detail: monitoring cognitive model-based risk factors during a counseling conversation to detect unsafe deviations, in both human-AI counseling sessions and in automated benchmarking of AI psychotherapists with simulated patients. The proposed taxonomy offers a foundational step towards establishing safer and more responsible innovation in the domain of AI-driven mental health support.

[162] RoT: Enhancing Table Reasoning with Iterative Row-Wise Traversals

Xuanliang Zhang,Dingzirui Wang,Keyan Xu,Qingfu Zhu,Wanxiang Che

Main category: cs.CL

TL;DR: 论文提出Row-of-Thought（RoT）方法，通过逐行遍历表格进行推理，减少幻觉问题，无需训练，性能优于现有方法。

Details

Motivation: 现有Long Chain-of-Thought（Long CoT）方法训练成本高且存在表格内容幻觉问题，需改进。 Method: RoT通过逐行遍历表格，结合LLM的反思能力，实现推理扩展和优化。 Result: RoT在非推理模型上平均优于RLLMs 4.3%，在WikiTableQuestions和TableBench上达到SOTA。 Conclusion: RoT高效且性能优越，显著减少幻觉问题，适用于表格推理任务。 Abstract: The table reasoning task, crucial for efficient data acquisition, aims to answer questions based on the given table. Recently, reasoning large language models (RLLMs) with Long Chain-of-Thought (Long CoT) significantly enhance reasoning capabilities, leading to brilliant performance on table reasoning. However, Long CoT suffers from high cost for training and exhibits low reliability due to table content hallucinations. Therefore, we propose Row-of-Thought (RoT), which performs iteratively row-wise table traversal, allowing for reasoning extension and reflection-based refinement at each traversal. Scaling reasoning length by row-wise traversal and leveraging reflection capabilities of LLMs, RoT is training-free. The sequential traversal encourages greater attention to the table, thus reducing hallucinations. Experiments show that RoT, using non-reasoning models, outperforms RLLMs by an average of 4.3%, and achieves state-of-the-art results on WikiTableQuestions and TableBench with comparable models, proving its effectiveness. Also, RoT outperforms Long CoT with fewer reasoning tokens, indicating higher efficiency.

[163] An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents

Bowen Jin,Jinsung Yoon,Priyanka Kargupta,Sercan O. Arik,Jiawei Han

Main category: cs.CL

TL;DR: 论文研究了强化学习在训练基于大型语言模型的搜索代理中的关键因素，包括奖励设计、LLM选择和搜索引擎的作用，并提出了实用指南。

Details

Motivation: 尽管强化学习在训练搜索代理方面表现出潜力，但其最优设计尚未完全理解，特别是奖励设计、LLM选择和搜索引擎的作用。 Method: 通过全面的实证研究，系统分析了奖励设计、LLM特性和搜索引擎在强化学习过程中的影响。 Result: 研究发现格式化奖励有效提升性能，LLM的规模和初始化对结果影响显著，搜索引擎的选择对训练动态和代理鲁棒性至关重要。 Conclusion: 研究为实际应用中构建和部署基于LLM的搜索代理提供了重要指导。 Abstract: Reinforcement learning (RL) has demonstrated strong potential in training large language models (LLMs) capable of complex reasoning for real-world problem solving. More recently, RL has been leveraged to create sophisticated LLM-based search agents that adeptly combine reasoning with search engine use. While the use of RL for training search agents is promising, the optimal design of such agents remains not fully understood. In particular, key factors -- such as (1) reward formulation, (2) the choice and characteristics of the underlying LLM, and (3) the role of the search engine in the RL process -- require further investigation. In this work, we conduct comprehensive empirical studies to systematically investigate these and offer actionable insights. We highlight several key findings: format rewards are effective in improving final performance, whereas intermediate retrieval rewards have limited impact; the scale and initialization of the LLM (general-purpose vs. reasoning-specialized) significantly influence RL outcomes; and the choice of search engine plays a critical role in shaping RL training dynamics and the robustness of the trained agent during inference. These establish important guidelines for successfully building and deploying LLM-based search agents in real-world applications. Code is available at https://github.com/PeterGriffinJin/Search-R1.

[164] Prolonged Reasoning Is Not All You Need: Certainty-Based Adaptive Routing for Efficient LLM/MLLM Reasoning

Jinghui Lu,Haiyang Yu,Siliang Xu,Shiwei Ran,Guozhi Tang,Siqi Wang,Bin Shan,Teng Fu,Hao Feng,Jingqun Tang,Han Wang,Can Huang

Main category: cs.CL

TL;DR: 论文提出了一种基于确定性的自适应推理框架（CAR），通过动态切换简短答案和长推理，优化LLMs和MLLMs的效率和准确性。

Details

Motivation: 研究发现过度依赖链式思维（CoT）推理会降低模型性能并导致输出冗长，影响效率。 Method: CAR框架首先生成简短答案并评估其困惑度，仅在模型置信度低时触发长推理。 Result: 实验表明，CAR在多种多模态VQA/KIE基准和文本推理数据集上优于简短答案和长推理方法。 Conclusion: CAR在准确性和效率之间实现了最佳平衡。 Abstract: Recent advancements in reasoning have significantly enhanced the capabilities of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) across diverse tasks. However, excessive reliance on chain-of-thought (CoT) reasoning can impair model performance and brings unnecessarily lengthened outputs, reducing efficiency. Our work reveals that prolonged reasoning does not universally improve accuracy and even degrade performance on simpler tasks. To address this, we propose Certainty-based Adaptive Reasoning (CAR), a novel framework that dynamically switches between short answers and long-form reasoning based on the model perplexity. CAR first generates a short answer and evaluates its perplexity, triggering reasoning only when the model exhibits low confidence (i.e., high perplexity). Experiments across diverse multimodal VQA/KIE benchmarks and text reasoning datasets show that CAR outperforms both short-answer and long-form reasoning approaches, striking an optimal balance between accuracy and efficiency.

[165] ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection

Jeonghye Kim,Sojeong Rhee,Minbeom Kim,Dohyung Kim,Sangmook Lee,Youngchul Sung,Kyomin Jung

Main category: cs.CL

TL;DR: ReflAct是一种新型的LLM代理推理框架，通过持续反思代理状态与目标的对齐，显著提升了推理的可靠性和成功率。

Details

Motivation: 解决ReAct在复杂环境中推理步骤不连贯或脱离实际的问题，因其无法保持一致的内部信念和目标对齐。 Method: 引入ReflAct框架，强调对代理状态与目标的持续反思，并明确将决策基于状态。 Result: ReflAct在ALFWorld中平均成功率比ReAct提高27.7%，达到93.3%，且优于增强版的ReAct。 Conclusion: 强化核心推理框架是提升代理性能的关键，ReflAct通过持续反思和状态对齐实现了更高的可靠性。 Abstract: Recent advances in LLM agents have largely built on reasoning backbones like ReAct, which interleave thought and action in complex environments. However, ReAct often produces ungrounded or incoherent reasoning steps, leading to misalignment between the agent's actual state and goal. Our analysis finds that this stems from ReAct's inability to maintain consistent internal beliefs and goal alignment, causing compounding errors and hallucinations. To address this, we introduce ReflAct, a novel backbone that shifts reasoning from merely planning next actions to continuously reflecting on the agent's state relative to its goal. By explicitly grounding decisions in states and enforcing ongoing goal alignment, ReflAct dramatically improves strategic reliability. This design delivers substantial empirical gains: ReflAct surpasses ReAct by 27.7% on average, achieving a 93.3% success rate in ALFWorld. Notably, ReflAct even outperforms ReAct with added enhancement modules (e.g., Reflexion, WKM), showing that strengthening the core reasoning backbone is key to reliable agent performance.

[166] EcomScriptBench: A Multi-task Benchmark for E-commerce Script Planning via Step-wise Intention-Driven Product Association

Weiqi Wang,Limeng Cui,Xin Liu,Sreyashi Nag,Wenju Xu,Chen Luo,Sheikh Muhammad Sarwar,Yang Li,Hansu Gu,Hui Liu,Changlong Yu,Jiaxin Bai,Yifan Gao,Haiyang Zhang,Qi He,Shuiwang Ji,Yangqiu Song

Main category: cs.CL

TL;DR: 论文提出了电子商务脚本规划任务（EcomScript），并设计了一个新框架，通过语义相似性关联产品和脚本步骤，生成了首个大规模数据集EcomScriptBench。实验显示，现有LLMs在此任务上表现不佳，但引入购买意图可提升性能。

Details

Motivation: 电子商务中，顾客需要LLM助手生成购物脚本并推荐产品，但现有方法在脚本规划和产品检索上存在不足，且缺乏评估方法和数据。 Method: 论文将EcomScript定义为三个子任务，提出一个框架，通过语义相似性将产品关联到脚本步骤，并构建了大规模数据集EcomScriptBench。 Result: 实验表明，现有LLMs在EcomScript任务上表现不佳，但引入产品购买意图后性能有所提升。 Conclusion: 论文为电子商务脚本规划提供了首个大规模数据集和评估基准，并展示了改进LLMs性能的方法。 Abstract: Goal-oriented script planning, or the ability to devise coherent sequences of actions toward specific goals, is commonly employed by humans to plan for typical activities. In e-commerce, customers increasingly seek LLM-based assistants to generate scripts and recommend products at each step, thereby facilitating convenient and efficient shopping experiences. However, this capability remains underexplored due to several challenges, including the inability of LLMs to simultaneously conduct script planning and product retrieval, difficulties in matching products caused by semantic discrepancies between planned actions and search queries, and a lack of methods and benchmark data for evaluation. In this paper, we step forward by formally defining the task of E-commerce Script Planning (EcomScript) as three sequential subtasks. We propose a novel framework that enables the scalable generation of product-enriched scripts by associating products with each step based on the semantic similarity between the actions and their purchase intentions. By applying our framework to real-world e-commerce data, we construct the very first large-scale EcomScript dataset, EcomScriptBench, which includes 605,229 scripts sourced from 2.4 million products. Human annotations are then conducted to provide gold labels for a sampled subset, forming an evaluation benchmark. Extensive experiments reveal that current (L)LMs face significant challenges with EcomScript tasks, even after fine-tuning, while injecting product purchase intentions improves their performance.

[167] DUSK: Do Not Unlearn Shared Knowledge

Wonje Jeung,Sangyeon Yoon,Hyesoo Hong,Soeun Kim,Seungju Han,Youngjae Yu,Albert No

Main category: cs.CL

TL;DR: DUSK是一个新的基准测试，用于评估在数据重叠的现实场景中机器遗忘方法的有效性，揭示了现有方法在选择性遗忘方面的不足。

Details

Motivation: 解决现有机器遗忘评估方法假设遗忘集和保留集完全分离的问题，提出更现实的评估场景。 Method: 引入DUSK基准，构建描述相同事实但风格不同的文档集，定义七种评估指标。 Result: 评估九种现有方法，发现它们能移除表层文本但难以选择性删除深层知识。 Conclusion: DUSK为开发更精确的机器遗忘技术提供了公共基准。 Abstract: Large language models (LLMs) are increasingly deployed in real-world applications, raising concerns about the unauthorized use of copyrighted or sensitive data. Machine unlearning aims to remove such 'forget' data while preserving utility and information from the 'retain' set. However, existing evaluations typically assume that forget and retain sets are fully disjoint, overlooking realistic scenarios where they share overlapping content. For instance, a news article may need to be unlearned, even though the same event, such as an earthquake in Japan, is also described factually on Wikipedia. Effective unlearning should remove the specific phrasing of the news article while preserving publicly supported facts. In this paper, we introduce DUSK, a benchmark designed to evaluate unlearning methods under realistic data overlap. DUSK constructs document sets that describe the same factual content in different styles, with some shared information appearing across all sets and other content remaining unique to each. When one set is designated for unlearning, an ideal method should remove its unique content while preserving shared facts. We define seven evaluation metrics to assess whether unlearning methods can achieve this selective removal. Our evaluation of nine recent unlearning methods reveals a key limitation: while most can remove surface-level text, they often fail to erase deeper, context-specific knowledge without damaging shared content. We release DUSK as a public benchmark to support the development of more precise and reliable unlearning techniques for real-world applications.

[168] Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs

Jie Ma,Ning Qu,Zhitao Gao,Rui Xing,Jun Liu,Hongbin Pei,Jiang Xie,Linyun Song,Pinghui Wang,Jing Tao,Zhou Su

Main category: cs.CL

TL;DR: 论文提出了一种名为Deliberation over Priors（DP）的框架，通过充分利用知识图谱中的先验知识，提升大语言模型的推理忠实性和生成可靠性。

Details

Motivation: 现有方法未能充分利用知识图谱的结构信息和约束条件，导致大语言模型在推理和生成中存在不足。 Method: DP采用渐进式知识蒸馏策略，结合监督微调和Kahneman-Tversky优化，并引入推理-反思策略进行验证。 Result: 在三个基准数据集上，DP实现了最优性能，特别是在ComplexWebQuestions数据集上Hit@1提升了13%。 Conclusion: DP框架显著提升了模型的忠实性和可靠性，具有灵活性和实用性。 Abstract: Knowledge graph-based retrieval-augmented generation seeks to mitigate hallucinations in Large Language Models (LLMs) caused by insufficient or outdated knowledge. However, existing methods often fail to fully exploit the prior knowledge embedded in knowledge graphs (KGs), particularly their structural information and explicit or implicit constraints. The former can enhance the faithfulness of LLMs' reasoning, while the latter can improve the reliability of response generation. Motivated by these, we propose a trustworthy reasoning framework, termed Deliberation over Priors (DP), which sufficiently utilizes the priors contained in KGs. Specifically, DP adopts a progressive knowledge distillation strategy that integrates structural priors into LLMs through a combination of supervised fine-tuning and Kahneman-Tversky optimization, thereby improving the faithfulness of relation path generation. Furthermore, our framework employs a reasoning-introspection strategy, which guides LLMs to perform refined reasoning verification based on extracted constraint priors, ensuring the reliability of response generation. Extensive experiments on three benchmark datasets demonstrate that DP achieves new state-of-the-art performance, especially a Hit@1 improvement of 13% on the ComplexWebQuestions dataset, and generates highly trustworthy responses. We also conduct various analyses to verify its flexibility and practicality. The code is available at https://github.com/reml-group/Deliberation-on-Priors.

[169] R-TOFU: Unlearning in Large Reasoning Models

Sangyeon Yoon,Wonje Jeung,Albert No

Main category: cs.CL

TL;DR: 论文提出了R-TOFU基准，用于评估大型推理模型（LRMs）在多步推理链中的信息遗忘效果，并展示了传统方法在推理步骤中残留知识的问题。

Details

Motivation: 现有的大型推理模型在推理链中嵌入私有或版权信息，使得信息遗忘比传统LLMs更复杂，需要专门的评估工具。 Method: 引入R-TOFU基准，结合梯度法和偏好优化基线，提出Reasoned IDK方法以平衡遗忘效果和模型效用。 Result: 传统方法在推理步骤中残留知识，而Reasoned IDK能更好地保留推理能力。解码变体仍可能泄露遗忘内容。 Conclusion: R-TOFU为LRMs的信息遗忘研究提供了系统基础，强调需在多样化解码设置下评估模型。 Abstract: Large Reasoning Models (LRMs) embed private or copyrighted information not only in their final answers but also throughout multi-step chain-of-thought (CoT) traces, making reliable unlearning far more demanding than in standard LLMs. We introduce Reasoning-TOFU (R-TOFU), the first benchmark tailored to this setting. R-TOFU augments existing unlearning tasks with realistic CoT annotations and provides step-wise metrics that expose residual knowledge invisible to answer-level checks. Using R-TOFU, we carry out a comprehensive comparison of gradient-based and preference-optimization baselines and show that conventional answer-only objectives leave substantial forget traces in reasoning. We further propose Reasoned IDK, a preference-optimization variant that preserves coherent yet inconclusive reasoning, achieving a stronger balance between forgetting efficacy and model utility than earlier refusal styles. Finally, we identify a failure mode: decoding variants such as ZeroThink and LessThink can still reveal forgotten content despite seemingly successful unlearning, emphasizing the need to evaluate models under diverse decoding settings. Together, the benchmark, analysis, and new baseline establish a systematic foundation for studying and improving unlearning in LRMs while preserving their reasoning capabilities.

[170] Multilingual Prompting for Improving LLM Generation Diversity

Qihan Wang,Shidong Pan,Tal Linzen,Emily Black

Main category: cs.CL

TL;DR: 提出多语言提示方法，通过增加文化和语言线索提升LLM生成内容的多样性和文化代表性。

Details

Motivation: 解决LLM生成内容缺乏文化多样性的问题。 Method: 多语言提示方法，生成带文化线索的提示变体并整合响应。 Result: 多语言提示在多种模型上优于现有多样性增强技术，且能减少文化特定信息的幻觉。 Conclusion: 多语言提示有效提升LLM生成内容的多样性和文化准确性。 Abstract: Large Language Models (LLMs) are known to lack cultural representation and overall diversity in their generations, from expressing opinions to answering factual questions. To mitigate this problem, we propose multilingual prompting: a prompting method which generates several variations of a base prompt with added cultural and linguistic cues from several cultures, generates responses, and then combines the results. Building on evidence that LLMs have language-specific knowledge, multilingual prompting seeks to increase diversity by activating a broader range of cultural knowledge embedded in model training data. Through experiments across multiple models (GPT-4o, GPT-4o-mini, LLaMA 70B, and LLaMA 8B), we show that multilingual prompting consistently outperforms existing diversity-enhancing techniques such as high-temperature sampling, step-by-step recall, and personas prompting. Further analyses show that the benefits of multilingual prompting vary with language resource level and model size, and that aligning the prompting language with the cultural cues reduces hallucination about culturally-specific information.

[171] Towards Explainable Temporal Reasoning in Large Language Models: A Structure-Aware Generative Framework

Zihao Jiang,Ben Liu,Miao Peng,Wenjie Xu,Yao Xiao,Zhenyan Shan,Min Peng

Main category: cs.CL

TL;DR: 论文提出GETER框架，结合图结构与文本信息，提升大语言模型在可解释时序推理中的表现。

Details

Motivation: 现有研究多关注大语言模型在时序推理中的性能提升，而忽略了其可解释性。本文旨在填补这一空白。 Method: 提出GETER框架，利用时序知识图谱捕捉结构信息，并通过结构-文本前缀适配器将图特征映射到文本嵌入空间，最终生成解释文本。 Result: 实验表明GETER在性能和泛化能力上均达到最优。 Conclusion: GETER框架有效解决了大语言模型在可解释时序推理中的挑战，并展示了强大的潜力。 Abstract: While large language models (LLMs) show great potential in temporal reasoning, most existing work focuses heavily on enhancing performance, often neglecting the explainable reasoning processes underlying the results. To address this gap, we introduce a comprehensive benchmark covering a wide range of temporal granularities, designed to systematically evaluate LLMs' capabilities in explainable temporal reasoning. Furthermore, our findings reveal that LLMs struggle to deliver convincing explanations when relying solely on textual information. To address challenge, we propose GETER, a novel structure-aware generative framework that integrates Graph structures with text for Explainable TEmporal Reasoning. Specifically, we first leverage temporal knowledge graphs to develop a temporal encoder that captures structural information for the query. Subsequently, we introduce a structure-text prefix adapter to map graph structure features into the text embedding space. Finally, LLMs generate explanation text by seamlessly integrating the soft graph token with instruction-tuning prompt tokens. Experimental results indicate that GETER achieves state-of-the-art performance while also demonstrating its effectiveness as well as strong generalization capabilities. Our dataset and code are available at https://github.com/carryTatum/GETER.

[172] Fooling the LVLM Judges: Visual Biases in LVLM-Based Evaluation

Yerin Hwang,Dongryeol Lee,Kyungmin Min,Taegwan Kang,Yong-il Kim,Kyomin Jung

Main category: cs.CL

TL;DR: 研究发现，大型视觉语言模型（LVLM）在评估文本-图像对齐时容易受到视觉对抗性操纵的影响，导致评分偏高。

Details

Motivation: 探索LVLM在视觉模态上的鲁棒性，尤其是对抗性视觉操纵是否会导致评分偏差。 Method: 定义图像诱导偏差，构建多领域元评估基准FRAME，测试LVLM的评分表现。 Result: 所有测试的LVLM均表现出脆弱性，评分被操纵图像显著抬高，且多偏差组合效果更强。 Conclusion: 当前LVLM评估系统存在漏洞，亟需更鲁棒的评估方法。 Abstract: Recently, large vision-language models (LVLMs) have emerged as the preferred tools for judging text-image alignment, yet their robustness along the visual modality remains underexplored. This work is the first study to address a key research question: Can adversarial visual manipulations systematically fool LVLM judges into assigning unfairly inflated scores? We define potential image induced biases within the context of T2I evaluation and examine how these biases affect the evaluations of LVLM judges. Moreover, we introduce a novel, fine-grained, multi-domain meta-evaluation benchmark named FRAME, which is deliberately constructed to exhibit diverse score distributions. By introducing the defined biases into the benchmark, we reveal that all tested LVLM judges exhibit vulnerability across all domains, consistently inflating scores for manipulated images. Further analysis reveals that combining multiple biases amplifies their effects, and pairwise evaluations are similarly susceptible. Moreover, we observe that visual biases persist under prompt-based mitigation strategies, highlighting the vulnerability of current LVLM evaluation systems and underscoring the urgent need for more robust LVLM judges.

[173] MentalMAC: Enhancing Large Language Models for Detecting Mental Manipulation via Multi-Task Anti-Curriculum Distillation

Yuansheng Gao,Han Bao,Tong Zhang,Bin Li,Zonghui Wang,Wenzhi Chen

Main category: cs.CL

TL;DR: 论文提出MentalMAC方法，通过多任务反课程蒸馏提升大语言模型检测心理操纵的能力，并构建了ReaMent数据集。

Details

Motivation: 心理操纵隐蔽且复杂，现有大语言模型难以检测，且缺乏高质量标注数据。 Method: 结合EvoSA无监督数据扩展、多任务监督和渐进知识蒸馏。 Result: 实验表明方法显著缩小师生模型差距，并在关键指标上优于竞争模型。 Conclusion: MentalMAC方法有效提升心理操纵检测能力，并开源代码和数据集。 Abstract: Mental manipulation is a subtle yet pervasive form of psychological abuse that poses serious threats to mental health. Its covert nature and the complexity of manipulation strategies make it challenging to detect, even for state-of-the-art large language models (LLMs). This concealment also hinders the manual collection of large-scale, high-quality annotations essential for training effective models. Although recent efforts have sought to improve LLM's performance on this task, progress remains limited due to the scarcity of real-world annotated datasets. To address these challenges, we propose MentalMAC, a multi-task anti-curriculum distillation method that enhances LLMs' ability to detect mental manipulation in multi-turn dialogue. Our approach includes: (i) EvoSA, an unsupervised data expansion method based on evolutionary operations and speech act theory; (ii) teacher-model-generated multi-task supervision; and (iii) progressive knowledge distillation from complex to simpler tasks. We then constructed the ReaMent dataset with 5,000 real-world dialogue samples, using a MentalMAC-distilled model to assist human annotation. Vast experiments demonstrate that our method significantly narrows the gap between student and teacher models and outperforms competitive LLMs across key evaluation metrics. All code, datasets, and checkpoints will be released upon paper acceptance. Warning: This paper contains content that may be offensive to readers.

[174] When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners

Weixiang Zhao,Jiahe Guo,Yang Deng,Tongtong Wu,Wenxuan Zhang,Yulin Hu,Xingyu Sui,Yanyan Zhao,Wanxiang Che,Bing Qin,Tat-Seng Chua,Ting Liu

Main category: cs.CL

TL;DR: 通过语言特异性表示的因果干预，提升多语言大语言模型的推理能力，无需额外训练。

Details

Motivation: 多语言推理在LLMs中表现不均衡，高资源语言表现更好。受认知神经科学启发，假设推理和语言是可分离的组件。 Method: 在推理时对语言特异性表示进行消融干预，实验覆盖10个开源LLMs和11种语言。 Result: 消融干预显著提升多语言推理性能，且语言与推理表示可解耦。 Conclusion: 该轻量级方法为提升跨语言泛化提供了可解释的策略。 Abstract: Multilingual reasoning remains a significant challenge for large language models (LLMs), with performance disproportionately favoring high-resource languages. Drawing inspiration from cognitive neuroscience, which suggests that human reasoning functions largely independently of language processing, we hypothesize that LLMs similarly encode reasoning and language as separable components that can be disentangled to enhance multilingual reasoning. To evaluate this, we perform a causal intervention by ablating language-specific representations at inference time. Experiments on 10 open-source LLMs spanning 11 typologically diverse languages show that this language-specific ablation consistently boosts multilingual reasoning performance. Layer-wise analyses further confirm that language and reasoning representations can be effectively decoupled throughout the model, yielding improved multilingual reasoning capabilities, while preserving top-layer language features remains essential for maintaining linguistic fidelity. Compared to post-training such as supervised fine-tuning or reinforcement learning, our training-free ablation achieves comparable or superior results with minimal computational overhead. These findings shed light on the internal mechanisms underlying multilingual reasoning in LLMs and suggest a lightweight and interpretable strategy for improving cross-lingual generalization.

[175] AGENT-X: Adaptive Guideline-based Expert Network for Threshold-free AI-generated teXt detection

Jiatao Li,Mao Ye,Cheng Peng,Xunjian Yin,Xiaojun Wan

Main category: cs.CL

TL;DR: AGENT-X是一个零样本多智能体框架，通过语言学维度检测AI生成文本，无需依赖标注数据或阈值调整，显著提升准确性和可解释性。

Details

Motivation: 现有AI生成文本检测方法依赖标注数据和阈值调整，限制了可解释性和零样本效果。 Method: AGENT-X基于语义、风格和结构维度，由语言学智能体独立评估，并通过元智能体集成结果，实现无阈值分类。 Result: 实验表明AGENT-X在准确性、可解释性和泛化能力上优于现有方法。 Conclusion: AGENT-X为AI生成文本检测提供了高效、可解释的零样本解决方案。 Abstract: Existing AI-generated text detection methods heavily depend on large annotated datasets and external threshold tuning, restricting interpretability, adaptability, and zero-shot effectiveness. To address these limitations, we propose AGENT-X, a zero-shot multi-agent framework informed by classical rhetoric and systemic functional linguistics. Specifically, we organize detection guidelines into semantic, stylistic, and structural dimensions, each independently evaluated by specialized linguistic agents that provide explicit reasoning and robust calibrated confidence via semantic steering. A meta agent integrates these assessments through confidence-aware aggregation, enabling threshold-free, interpretable classification. Additionally, an adaptive Mixture-of-Agent router dynamically selects guidelines based on inferred textual characteristics. Experiments on diverse datasets demonstrate that AGENT-X substantially surpasses state-of-the-art supervised and zero-shot approaches in accuracy, interpretability, and generalization.

[176] Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Hyungjoo Chae,Sunghwan Kim,Junhee Cho,Seungone Kim,Seungjun Moon,Gyeom Hwangbo,Dongha Lim,Minjin Kim,Yeonjun Hwang,Minju Gwak,Dongwook Choi,Minseok Kang,Gwanhoon Im,ByeongUng Cho,Hyojun Kim,Jun Hee Han,Taeyoon Kwon,Minju Kim,Beong-woo Kwak,Dongjin Kang,Jinyoung Yeo

Main category: cs.CL

TL;DR: 本文提出了首个用于网页导航的过程奖励模型Web-Shepherd，通过构建大规模数据集WebPRM Collection和评估基准WebRewardBench，显著提升了导航任务的准确性和成本效益。

Details

Motivation: 网页导航需要长时程序列决策，但现有方法依赖多模态大语言模型（MLLM）作为奖励模型，限制了实际部署的效率和成本。 Method: 提出Web-Shepherd模型，基于WebPRM Collection（40K步级偏好对和标注清单）和WebRewardBench评估基准。 Result: Web-Shepherd在WebRewardBench上比GPT-4o准确率高30分，在WebArena-lite测试中性能提升10.9分且成本降低10倍。 Conclusion: Web-Shepherd为网页导航任务提供了高效、低成本的解决方案，推动了该领域的实际应用。 Abstract: Web navigation is a unique domain that can automate many repetitive real-life tasks and is challenging as it requires long-horizon sequential decision making beyond typical multimodal large language model (MLLM) tasks. Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effectiveness, prior works have utilized MLLMs as reward models, which poses significant constraints for real-world deployment. To address this, in this work, we propose the first process reward model (PRM) called Web-Shepherd which could assess web navigation trajectories in a step-level. To achieve this, we first construct the WebPRM Collection, a large-scale dataset with 40K step-level preference pairs and annotated checklists spanning diverse domains and difficulty levels. Next, we also introduce the WebRewardBench, the first meta-evaluation benchmark for evaluating PRMs. In our experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WebRewardBench. Furthermore, when testing on WebArena-lite by using GPT-4o-mini as the policy and Web-Shepherd as the verifier, we achieve 10.9 points better performance, in 10 less cost compared to using GPT-4o-mini as the verifier. Our model, dataset, and code are publicly available at LINK.

[177] Exploring In-Image Machine Translation with Real-World Background

Yanzhi Tian,Zeming Liu,Zhengyang Liu,Yuhang Guo

Main category: cs.CL

TL;DR: 论文提出了一种名为DebackX的模型，用于解决复杂场景下的图像内机器翻译问题，提升了翻译质量和视觉效果。

Details

Motivation: 现有图像内机器翻译研究多基于简化场景（如白底黑字单行文本），与实际应用场景差距较大，缺乏实用性。 Method: 提出DebackX模型，通过分离背景与文本图像、直接翻译文本图像并融合背景生成目标图像。 Result: 实验结果表明，该模型在翻译质量和视觉效果上均有提升。 Conclusion: DebackX模型为复杂场景下的图像内机器翻译提供了有效解决方案。 Abstract: In-Image Machine Translation (IIMT) aims to translate texts within images from one language to another. Previous research on IIMT was primarily conducted on simplified scenarios such as images of one-line text with black font in white backgrounds, which is far from reality and impractical for applications in the real world. To make IIMT research practically valuable, it is essential to consider a complex scenario where the text backgrounds are derived from real-world images. To facilitate research of complex scenario IIMT, we design an IIMT dataset that includes subtitle text with real-world background. However previous IIMT models perform inadequately in complex scenarios. To address the issue, we propose the DebackX model, which separates the background and text-image from the source image, performs translation on text-image directly, and fuses the translated text-image with the background, to generate the target image. Experimental results show that our model achieves improvements in both translation quality and visual effect.

[178] Hallucinate at the Last in Long Response Generation: A Case Study on Long Document Summarization

Joonho Yang,Seunghyun Yoon,Hwan Chang,Byeongjeong Kim,Hwanhee Lee

Main category: cs.CL

TL;DR: 研究发现，大语言模型（LLMs）生成长文本时，幻觉（hallucinations）倾向集中在文本的后半部分，需关注注意力机制和解码动态以改善。

Details

Motivation: 尽管LLMs在文本生成任务中表现出色，但其生成的幻觉问题仍显著，尤其在长文本的后半部分，缺乏对此位置分布的研究。 Method: 以长文档摘要为案例，研究幻觉在生成长响应中的位置分布，分析注意力机制和解码动态的影响。 Result: 幻觉在生成长文本的后半部分集中出现，表明位置偏差问题。 Conclusion: 需进一步研究缓解幻觉位置偏差的方法，以提高长文本生成的忠实度。 Abstract: Large Language Models (LLMs) have significantly advanced text generation capabilities, including tasks like summarization, often producing coherent and fluent outputs. However, faithfulness to source material remains a significant challenge due to the generation of hallucinations. While extensive research focuses on detecting and reducing these inaccuracies, less attention has been paid to the positional distribution of hallucination within generated text, particularly in long outputs. In this work, we investigate where hallucinations occur in LLM-based long response generation, using long document summarization as a key case study. Focusing on the challenging setting of long context-aware long response generation, we find a consistent and concerning phenomenon: hallucinations tend to concentrate disproportionately in the latter parts of the generated long response. To understand this bias, we explore potential contributing factors related to the dynamics of attention and decoding over long sequences. Furthermore, we investigate methods to mitigate this positional hallucination, aiming to improve faithfulness specifically in the concluding segments of long outputs.

[179] Chinese Toxic Language Mitigation via Sentiment Polarity Consistent Rewrites

Xintong Wang,Yixiao Liu,Jingheng Pan,Liang Ding,Longyue Wang,Chris Biemann

Main category: cs.CL

TL;DR: ToxiRewriteCN是首个专注于中文去毒化并保留情感极性的数据集，包含1,556个标注样本，覆盖多种场景。评估17种LLM后发现，商业和MoE模型表现最佳，但所有模型在复杂情境下仍难以平衡安全性与情感保真度。

Details

Motivation: 在线交流中，去毒化同时保留原意是重要但具挑战性的任务。中文的毒性表达常隐含于表情符号、谐音或语境中，现有方法易过度礼貌化，扭曲情感。 Method: 构建ToxiRewriteCN数据集，包含有毒句子、情感对齐的非毒改写及标注毒性范围。评估17种LLM在去毒准确性、流畅性、内容保留和情感极性四个维度。 Result: 商业和MoE模型整体表现最佳，但在表情符号、谐音和对话等复杂情境中，所有模型均难以平衡安全性与情感保真度。 Conclusion: ToxiRewriteCN为中文可控、情感感知的去毒化研究提供了支持，未来需进一步优化模型在复杂情境下的表现。 Abstract: Detoxifying offensive language while preserving the speaker's original intent is a challenging yet critical goal for improving the quality of online interactions. Although large language models (LLMs) show promise in rewriting toxic content, they often default to overly polite rewrites, distorting the emotional tone and communicative intent. This problem is especially acute in Chinese, where toxicity often arises implicitly through emojis, homophones, or discourse context. We present ToxiRewriteCN, the first Chinese detoxification dataset explicitly designed to preserve sentiment polarity. The dataset comprises 1,556 carefully annotated triplets, each containing a toxic sentence, a sentiment-aligned non-toxic rewrite, and labeled toxic spans. It covers five real-world scenarios: standard expressions, emoji-induced and homophonic toxicity, as well as single-turn and multi-turn dialogues. We evaluate 17 LLMs, including commercial and open-source models with variant architectures, across four dimensions: detoxification accuracy, fluency, content preservation, and sentiment polarity. Results show that while commercial and MoE models perform best overall, all models struggle to balance safety with emotional fidelity in more subtle or context-heavy settings such as emoji, homophone, and dialogue-based inputs. We release ToxiRewriteCN to support future research on controllable, sentiment-aware detoxification for Chinese.

[180] Multi-Hop Question Generation via Dual-Perspective Keyword Guidance

Maodong Li,Longyin Zhang,Fang Kong

Main category: cs.CL

TL;DR: 论文提出了一种双视角关键词引导（DPKG）框架，用于多跳问题生成（MQG），通过区分问题关键词和文档关键词，更有效地定位关键信息片段。

Details

Motivation: 现有方法未能充分利用关键词的引导潜力，且未区分问题特定和文档特定关键词的不同作用。 Method: 定义了双视角关键词（问题和文档关键词），并设计了DPKG框架，包含扩展的Transformer编码器和两个答案感知的解码器。 Result: 实验证明DPKG框架在多跳问题生成任务中表现优异。 Conclusion: DPKG框架显著提升了MQG任务的性能，具有重要应用价值。 Abstract: Multi-hop question generation (MQG) aims to generate questions that require synthesizing multiple information snippets from documents to derive target answers. The primary challenge lies in effectively pinpointing crucial information snippets related to question-answer (QA) pairs, typically relying on keywords. However, existing works fail to fully utilize the guiding potential of keywords and neglect to differentiate the distinct roles of question-specific and document-specific keywords. To address this, we define dual-perspective keywords (i.e., question and document keywords) and propose a Dual-Perspective Keyword-Guided (DPKG) framework, which seamlessly integrates keywords into the multi-hop question generation process. We argue that question keywords capture the questioner's intent, whereas document keywords reflect the content related to the QA pair. Functionally, question and document keywords work together to pinpoint essential information snippets in the document, with question keywords required to appear in the generated question. The DPKG framework consists of an expanded transformer encoder and two answer-aware transformer decoders for keyword and question generation, respectively. Extensive experiments demonstrate the effectiveness of our work, showcasing its promising performance and underscoring its significant value in the MQG task.

[181] Emotional Supporters often Use Multiple Strategies in a Single Turn

Xin Bai,Guanyi Chen,Tingting He,Chenlian Zhou,Yu Liu

Main category: cs.CL

TL;DR: 论文重新定义了情感支持对话任务，提出支持者可能在单轮对话中使用多种策略，并通过实验证明大型语言模型在此任务中表现优于监督模型和人类支持者。

Details

Motivation: 现有情感支持对话任务的定义过于简化，忽略了支持者在单轮对话中可能使用多种策略的现象。 Method: 通过分析ESConv数据集，重新定义任务，并提出多种建模方法，包括监督深度学习模型和大型语言模型。 Result: 实验表明，大型语言模型在重新定义的任务中表现优于监督模型和人类支持者，且展现出更全面的支持能力。 Conclusion: 论文强调了情感支持对话中多策略使用的重要性，并展示了大型语言模型在此领域的潜力。 Abstract: Emotional Support Conversations (ESC) are crucial for providing empathy, validation, and actionable guidance to individuals in distress. However, existing definitions of the ESC task oversimplify the structure of supportive responses, typically modelling them as single strategy-utterance pairs. Through a detailed corpus analysis of the ESConv dataset, we identify a common yet previously overlooked phenomenon: emotional supporters often employ multiple strategies consecutively within a single turn. We formally redefine the ESC task to account for this, proposing a revised formulation that requires generating the full sequence of strategy-utterance pairs given a dialogue history. To facilitate this refined task, we introduce several modelling approaches, including supervised deep learning models and large language models. Our experiments show that, under this redefined task, state-of-the-art LLMs outperform both supervised models and human supporters. Notably, contrary to some earlier findings, we observe that LLMs frequently ask questions and provide suggestions, demonstrating more holistic support capabilities.

[182] Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack

Silvia Cappelletti,Tobia Poppi,Samuele Poppi,Zheng-Xin Yong,Diego Garcia-Olano,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara

Main category: cs.CL

TL;DR: 论文提出了一种名为'prefilling attack'的简单方法，通过在模型输出前添加结构化前缀（如'正确选项是：'），显著提升了多选问答任务中基于首词概率（FTP）评估的准确性、校准性和一致性。

Details

Motivation: 当前基于首词概率（FTP）的多选问答评估方法存在脆弱性，如模型可能生成无关的高概率词（错位）或将有效词误用为通用前缀（误解），影响符号评估的可靠性。 Method: 提出'prefilling attack'方法，即在模型输出前添加结构化自然语言前缀，引导模型生成清晰有效的选项，无需修改模型参数。 Result: 实验表明，该方法显著提升了FTP在多选问答任务中的准确性、校准性和一致性，性能优于标准FTP，且与需要完整解码和外部分类器的开放生成方法相当，但效率更高。 Conclusion: prefilling是一种简单、鲁棒且低成本的方法，可显著提升FTP在多选问答评估中的可靠性。 Abstract: Large Language Models (LLMs) are increasingly evaluated on multiple-choice question answering (MCQA) tasks using *first-token probability* (FTP), which selects the answer option whose initial token has the highest likelihood. While efficient, FTP can be fragile: models may assign high probability to unrelated tokens (*misalignment*) or use a valid token merely as part of a generic preamble rather than as a clear answer choice (*misinterpretation*), undermining the reliability of symbolic evaluation. We propose a simple solution: the *prefilling attack*, a structured natural-language prefix (e.g., "*The correct option is:*") prepended to the model output. Originally explored in AI safety, we repurpose prefilling to steer the model to respond with a clean, valid option, without modifying its parameters. Empirically, the FTP with prefilling strategy substantially improves accuracy, calibration, and output consistency across a broad set of LLMs and MCQA benchmarks. It outperforms standard FTP and often matches the performance of open-ended generation approaches that require full decoding and external classifiers, while being significantly more efficient. Our findings suggest that prefilling is a simple, robust, and low-cost method to enhance the reliability of FTP-based evaluation in multiple-choice settings.

[183] Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation

Yuhao Zhang,Xiangnan Ma,Kaiqi Kou,Peizhuo Liu,Weiqiao Shan,Benyou Wang,Tong Xiao,Yuxin Huang,Zhengtao Yu,Jingbo Zhu

Main category: cs.CL

TL;DR: 论文提出了一种基于单元语言的文本无关语音翻译方法，解决了跨模态和跨语言对齐的挑战，并通过任务提示建模优化性能。

Details

Motivation: 语音到语音翻译（S2ST）面临跨模态特征提取和跨语言长序列对齐的挑战，需要一种新的表示方法。 Method: 使用单元语言作为类文本表示，结合多任务学习和任务提示建模，优化语音翻译过程。 Result: 在Voxpupil数据集的四种语言上，方法显著优于基线模型，性能接近基于文本的模型。 Conclusion: 单元语言和任务提示建模有效解决了S2ST的核心挑战，为无文本翻译提供了新思路。 Abstract: The success of building textless speech-to-speech translation (S2ST) models has attracted much attention. However, S2ST still faces two main challenges: 1) extracting linguistic features for various speech signals, called cross-modal (CM), and 2) learning alignment of difference languages in long sequences, called cross-lingual (CL). We propose the unit language to overcome the two modeling challenges. The unit language can be considered a text-like representation format, constructed using $n$-gram language modeling. We implement multi-task learning to utilize the unit language in guiding the speech modeling process. Our initial results reveal a conflict when applying source and target unit languages simultaneously. We propose task prompt modeling to mitigate this conflict. We conduct experiments on four languages of the Voxpupil dataset. Our method demonstrates significant improvements over a strong baseline and achieves performance comparable to models trained with text.

[184] Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors

Hao Fang,Jiawei Kong,Tianqu Zhuang,Yixiang Qiu,Kuofeng Gao,Bin Chen,Shu-Tao Xia,Yaowei Wang,Min Zhang

Main category: cs.CL

TL;DR: 论文提出了一种无需训练的对比性改写攻击方法（CoPA），利用现成的大语言模型（LLM）欺骗文本检测器，解决了现有方法需要大量数据和计算资源的问题。

Details

Motivation: 大语言模型（LLM）的滥用（如学术抄袭）催生了文本检测器的发展，而改写攻击试图绕过检测。现有方法需大量资源且对高级检测算法效果不佳，因此需要更高效的方法。 Method: CoPA通过精心设计的指令让LLM生成更接近人类的文本，并构建辅助的机器化词分布作为对比，在解码过程中减去机器化模式，生成更不易被检测的文本。 Result: 理论分析和实验验证表明，CoPA能有效欺骗多种场景下的文本检测器。 Conclusion: CoPA是一种高效且无需训练的改写攻击方法，显著提升了对抗文本检测器的能力。 Abstract: The misuse of large language models (LLMs), such as academic plagiarism, has driven the development of detectors to identify LLM-generated texts. To bypass these detectors, paraphrase attacks have emerged to purposely rewrite these texts to evade detection. Despite the success, existing methods require substantial data and computational budgets to train a specialized paraphraser, and their attack efficacy greatly reduces when faced with advanced detection algorithms. To address this, we propose \textbf{Co}ntrastive \textbf{P}araphrase \textbf{A}ttack (CoPA), a training-free method that effectively deceives text detectors using off-the-shelf LLMs. The first step is to carefully craft instructions that encourage LLMs to produce more human-like texts. Nonetheless, we observe that the inherent statistical biases of LLMs can still result in some generated texts carrying certain machine-like attributes that can be captured by detectors. To overcome this, CoPA constructs an auxiliary machine-like word distribution as a contrast to the human-like distribution generated by the LLM. By subtracting the machine-like patterns from the human-like distribution during the decoding process, CoPA is able to produce sentences that are less discernible by text detectors. Our theoretical analysis suggests the superiority of the proposed attack. Extensive experiments validate the effectiveness of CoPA in fooling text detectors across various scenarios.

[185] FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management

Xiang Liu,Hong Chen,Xuming Hu,Xiaowen Chu

Main category: cs.CL

TL;DR: FlowKV是一种新颖的多轮隔离机制，用于管理KV缓存，避免重复压缩早期对话内容，显著提升性能。

Details

Motivation: KV缓存的线性增长和多轮对话中的重复压缩导致信息丢失和性能下降，需要一种更高效的缓存管理方法。 Method: FlowKV通过多轮隔离机制，仅压缩最新生成的KV对，保留历史压缩缓存，避免重复压缩。 Result: FlowKV在指令遵循准确性和用户偏好保留方面显著优于基线方法，提升幅度从10.90%到75.40%。 Conclusion: FlowKV有效解决了多轮对话中的KV缓存管理问题，显著提升了性能和用户体验。 Abstract: Large Language Models (LLMs) are increasingly deployed in multi-turn conversational applications, where the management of the Key-Value (KV) Cache presents a significant bottleneck. The linear growth of the KV Cache with dialogue history imposes substantial computational costs, and existing eviction strategies often degrade performance by repeatedly compressing early conversational context, leading to information loss and context forgetting. This paper introduces FlowKV, a novel \textbf{multi-turn isolation mechanism} for KV Cache management, which can be applied to any KV Cache compression method without training. FlowKV's core innovation is a multi-turn isolation mechanism that preserves the accumulated compressed KV cache from past turns. Compression is then strategically applied only to the newly generated KV pairs of the latest completed turn, effectively preventing the re-compression of older context and thereby mitigating catastrophic forgetting. Our results demonstrate that FlowKV consistently and significantly outperforms baseline strategies in maintaining instruction-following accuracy and user preference retention from 10.90\% to 75.40\%, particularly in later conversational turns.

[186] The Super Emotion Dataset

Enric Junqué de Fortuny

Main category: cs.CL

TL;DR: Super Emotion Dataset填补了NLP领域缺乏标准化、大规模情感分类数据集的空白，基于心理学验证的分类法统一了多样化的文本来源。

Details

Motivation: 现有情感分类数据集存在分类不一致、样本量有限或领域特定等问题，缺乏心理学基础的标准资源。 Method: 通过Shaver的经验验证情感分类法，将多样化文本源统一为一个框架。 Result: 创建了一个支持跨领域情感识别研究的标准化数据集。 Conclusion: Super Emotion Dataset为情感识别研究提供了更一致的跨领域资源。 Abstract: Despite the wide-scale usage and development of emotion classification datasets in NLP, the field lacks a standardized, large-scale resource that follows a psychologically grounded taxonomy. Existing datasets either use inconsistent emotion categories, suffer from limited sample size, or focus on specific domains. The Super Emotion Dataset addresses this gap by harmonizing diverse text sources into a unified framework based on Shaver's empirically validated emotion taxonomy, enabling more consistent cross-domain emotion recognition research.

[187] Revealing Language Model Trajectories via Kullback-Leibler Divergence

Ryo Kishino,Yusuke Takase,Momose Oyama,Hiroaki Yamagiwa,Hidetoshi Shimodaira

Main category: cs.CL

TL;DR: 该论文提出了一种基于对数似然向量的KL散度估计方法，并系统地评估了不同条件下语言模型的KL散度行为，发现预训练过程中模型轨迹呈螺旋结构，层间轨迹呈线状。

Details

Motivation: 为了更好地理解KL散度在语言模型中的行为，尤其是在不同架构模型间的比较。 Method: 通过基于对数似然向量的坐标分配方法，系统评估KL散度，涵盖预训练检查点、微调与基础模型及层的比较。 Result: 发现预训练过程中模型轨迹呈螺旋结构，层间轨迹呈线状；对数似然空间中模型轨迹比权重空间中更受限。 Conclusion: 对数似然空间中的KL散度分析揭示了语言模型行为的独特结构，为模型比较提供了新视角。 Abstract: A recently proposed method enables efficient estimation of the KL divergence between language models, including models with different architectures, by assigning coordinates based on log-likelihood vectors. To better understand the behavior of this metric, we systematically evaluate KL divergence across a wide range of conditions using publicly available language models. Our analysis covers comparisons between pretraining checkpoints, fine-tuned and base models, and layers via the logit lens. We find that trajectories of language models, as measured by KL divergence, exhibit a spiral structure during pretraining and thread-like progressions across layers. Furthermore, we show that, in terms of diffusion exponents, model trajectories in the log-likelihood space are more constrained than those in weight space.

[188] Decoding Phone Pairs from MEG Signals Across Speech Modalities

Xabier de Zuazo,Eva Navas,Ibon Saratxaga,Mathieu Bourguignon,Nicola Molinaro

Main category: cs.CL

TL;DR: 研究通过MEG信号解码语音产生和感知任务中的音素，发现语音产生时的解码准确率显著高于被动听和回放任务，且Elastic Net分类器表现最佳。低频振荡（Delta和Theta波段）对解码贡献最大。

Details

Motivation: 探索语音产生的神经机制，为认知神经科学理论和实用通信技术（如脑机接口）提供支持。 Method: 使用17名参与者的MEG数据，比较多种机器学习方法（包括正则化线性模型和神经网络）对15对音素的分类效果。 Result: 语音产生任务的解码准确率（76.6%）显著高于被动听和回放任务（约51%），Elastic Net分类器表现最优，低频振荡（Delta和Theta波段）贡献最大。 Conclusion: 研究强调了语音产生范式的重要性，为改进脑机接口提供了方向，但需进一步解决噪声和伪影问题。 Abstract: Understanding the neural mechanisms underlying speech production is essential for both advancing cognitive neuroscience theory and developing practical communication technologies. In this study, we investigated magnetoencephalography signals to decode phones from brain activity during speech production and perception (passive listening and voice playback) tasks. Using a dataset comprising 17 participants, we performed pairwise phone classification, extending our analysis to 15 phonetic pairs. Multiple machine learning approaches, including regularized linear models and neural network architectures, were compared to determine their effectiveness in decoding phonetic information. Our results demonstrate significantly higher decoding accuracy during speech production (76.6%) compared to passive listening and playback modalities (~51%), emphasizing the richer neural information available during overt speech. Among the models, the Elastic Net classifier consistently outperformed more complex neural networks, highlighting the effectiveness of traditional regularization techniques when applied to limited and high-dimensional MEG datasets. Besides, analysis of specific brain frequency bands revealed that low-frequency oscillations, particularly Delta (0.2-3 Hz) and Theta (4-7 Hz), contributed the most substantially to decoding accuracy, suggesting that these bands encode critical speech production-related neural processes. Despite using advanced denoising methods, it remains unclear whether decoding solely reflects neural activity or if residual muscular or movement artifacts also contributed, indicating the need for further methodological refinement. Overall, our findings underline the critical importance of examining overt speech production paradigms, which, despite their complexity, offer opportunities to improve brain-computer interfaces to help individuals with severe speech impairments.

[189] NL-Debugging: Exploiting Natural Language as an Intermediate Representation for Code Debugging

Weiming Zhang,Qingyao Li,Xinyi Dai,Jizheng Chen,Kounianhua Du,Weinan Zhang,Weiwen Liu,Yasheng Wang,Ruiming Tang,Yong Yu

Main category: cs.CL

TL;DR: NL-DEBUGGING框架通过自然语言作为中间表示提升代码调试能力，优于传统方法。

Details

Motivation: 传统代码级调试方法难以解决复杂编程错误，而自然语言推理在代码任务中的潜力尚未明确。 Method: 引入NL-DEBUGGING框架，利用自然语言作为中间表示进行调试。 Result: NL-DEBUGGING表现优于传统调试方法，并通过执行反馈直接优化调试过程。 Conclusion: 自然语言推理在自动化代码调试和解决复杂编程问题中具有潜力。 Abstract: Debugging is a critical aspect of LLM's coding ability. Early debugging efforts primarily focused on code-level analysis, which often falls short when addressing complex programming errors that require a deeper understanding of algorithmic logic. Recent advancements in large language models (LLMs) have shifted attention toward leveraging natural language reasoning to enhance code-related tasks. However, two fundamental questions remain unanswered: What type of natural language format is most effective for debugging tasks? And what specific benefits does natural language reasoning bring to the debugging process? In this paper, we introduce NL-DEBUGGING, a novel framework that employs natural language as an intermediate representation to improve code debugging. By debugging at a natural language level, we demonstrate that NL-DEBUGGING outperforms traditional debugging methods and enables a broader modification space through direct refinement guided by execution feedback. Our findings highlight the potential of natural language reasoning to advance automated code debugging and address complex programming challenges.

[190] X-WebAgentBench: A Multilingual Interactive Web Benchmark for Evaluating Global Agentic System

Peng Wang,Ruihan Tao,Qiguang Chen,Mengkang Hu,Libo Qin

Main category: cs.CL

TL;DR: 论文介绍了X-WebAgentBench，一个多语言代理基准测试工具，用于评估语言代理在多语言环境中的规划和交互性能，填补了当前研究在非英语场景中的不足。

Details

Motivation: 当前基于大语言模型的代理研究主要集中在英语场景，而全球有7000多种语言需要类似的服务。现有研究无法满足多语言代理应用的需求。 Method: 提出X-WebAgentBench，一个交互式网络环境中的多语言代理基准测试工具，评估语言代理的规划和交互性能，并测试不同LLM和跨语言对齐方法的有效性。 Result: 研究发现，即使是GPT-4o等先进模型结合跨语言技术，也无法取得令人满意的结果。 Conclusion: X-WebAgentBench有望成为多语言代理场景的实用基准测试工具，推动全球代理智能的发展。 Abstract: Recently, large language model (LLM)-based agents have achieved significant success in interactive environments, attracting significant academic and industrial attention. Despite these advancements, current research predominantly focuses on English scenarios. In reality, there are over 7,000 languages worldwide, all of which demand access to comparable agentic services. Nevertheless, the development of language agents remains inadequate for meeting the diverse requirements of multilingual agentic applications. To fill this gap, we introduce X-WebAgentBench, a novel multilingual agent benchmark in an interactive web environment, which evaluates the planning and interaction performance of language agents across multiple languages, thereby contributing to the advancement of global agent intelligence. Additionally, we assess the performance of various LLMs and cross-lingual alignment methods, examining their effectiveness in enhancing agents. Our findings reveal that even advanced models like GPT-4o, when combined with cross-lingual techniques, fail to achieve satisfactory results. We hope that X-WebAgentBench can serve as a valuable benchmark for multilingual agent scenario in real-world applications.

[191] RePPL: Recalibrating Perplexity by Uncertainty in Semantic Propagation and Language Generation for Explainable QA Hallucination Detection

Yiming Huang,Junyan Zhang,Zihao Wang,Biquan Bie,Xuming Hu,Yi R.,Fung,Xinlei He

Main category: cs.CL

TL;DR: RePPL是一种通过重新校准不确定性测量来检测和解释大语言模型（LLM）幻觉的方法，结合语义传播和语言生成的不确定性，提供可解释的token级分数。

Details

Motivation: 解决现有幻觉检测方法无法解释幻觉来源的问题，即哪些输入部分容易引发幻觉。 Method: 提出RePPL方法，通过语义传播和语言生成的不确定性重新校准测量，生成token级不确定性分数，并以困惑度对数平均形式汇总。 Result: 在多个QA数据集上表现最佳（平均AUC 0.833），并能提供token级解释。初步发现幻觉的混沌模式。 Conclusion: RePPL不仅提升了幻觉检测性能，还提供了可解释的token级分数，展示了其在理解和利用幻觉模式方面的潜力。 Abstract: Large Language Models (LLMs) have become powerful, but hallucinations remain a vital obstacle to their trustworthy use. While previous works improved the capability of hallucination detection by measuring uncertainty, they all lack the ability to explain the provenance behind why hallucinations occur, i.e., which part of the inputs tends to trigger hallucinations. Recent works on the prompt attack indicate that uncertainty exists in semantic propagation, where attention mechanisms gradually fuse local token information into high-level semantics across layers. Meanwhile, uncertainty also emerges in language generation, due to its probability-based selection of high-level semantics for sampled generations. Based on that, we propose RePPL to recalibrate uncertainty measurement by these two aspects, which dispatches explainable uncertainty scores to each token and aggregates in Perplexity-style Log-Average form as total score. Experiments show that our method achieves the best comprehensive detection performance across various QA datasets on advanced models (average AUC of 0.833), and our method is capable of producing token-level uncertainty scores as explanations for the hallucination. Leveraging these scores, we preliminarily find the chaotic pattern of hallucination and showcase its promising usage.

[192] Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study

DongGeon Lee,Joonwon Jang,Jihae Jeong,Hwanjo Yu

Main category: cs.CL

TL;DR: 研究发现，视觉语言模型（VLMs）在面对真实用户分享的表情包时，安全性显著降低，比合成或文字图像更易产生有害输出。

Details

Motivation: 评估现有VLMs在真实表情包场景下的安全性，揭示其潜在风险。 Method: 引入MemeSafetyBench基准，包含50,430个真实表情包与有害/无害指令配对，通过安全分类法和LLM生成指令评估多模型。 Result: 表情包显著增加有害输出并减少拒绝率，多轮交互仅部分缓解问题。 Conclusion: 需更生态有效的评估方法和更强的安全机制。 Abstract: Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MemeSafetyBench, a 50,430-instance benchmark pairing real meme images with both harmful and benign instructions. Using a comprehensive safety taxonomy and LLM-based instruction generation, we assess multiple VLMs across single and multi-turn interactions. We investigate how real-world memes influence harmful outputs, the mitigating effects of conversational context, and the relationship between model scale and safety metrics. Our findings demonstrate that VLMs show greater vulnerability to meme-based harmful prompts than to synthetic or typographic images. Memes significantly increase harmful responses and decrease refusals compared to text-only inputs. Though multi-turn interactions provide partial mitigation, elevated vulnerability persists. These results highlight the need for ecologically valid evaluations and stronger safety mechanisms.

[193] An Empirical Study of the Anchoring Effect in LLMs: Existence, Mechanism, and Potential Mitigations

Yiming Huang,Biquan Bie,Zuqiu Na,Weilin Ruan,Songxin Lei,Yutao Yue,Xinlei He

Main category: cs.CL

TL;DR: 研究发现大型语言模型（LLMs）普遍存在锚定效应，常规策略无法消除，但推理能力可部分缓解。

Details

Motivation: 探讨LLMs是否受锚定效应影响及其机制和缓解策略。 Method: 引入新数据集SynAnchors，结合评估指标对主流LLMs进行基准测试。 Result: LLMs普遍存在锚定效应，浅层作用明显，常规策略无效，推理能力可部分缓解。 Conclusion: 呼吁LLM评估应关注认知偏差，而非标准基准或过度优化的鲁棒性测试。 Abstract: The rise of Large Language Models (LLMs) like ChatGPT has advanced natural language processing, yet concerns about cognitive biases are growing. In this paper, we investigate the anchoring effect, a cognitive bias where the mind relies heavily on the first information as anchors to make affected judgments. We explore whether LLMs are affected by anchoring, the underlying mechanisms, and potential mitigation strategies. To facilitate studies at scale on the anchoring effect, we introduce a new dataset, SynAnchors. Combining refined evaluation metrics, we benchmark current widely used LLMs. Our findings show that LLMs' anchoring bias exists commonly with shallow-layer acting and is not eliminated by conventional strategies, while reasoning can offer some mitigation. This recontextualization via cognitive psychology urges that LLM evaluations focus not on standard benchmarks or over-optimized robustness tests, but on cognitive-bias-aware trustworthy evaluation.

[194] How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study

Zhexin Zhang,Xian Qi Loye,Victor Shea-Jay Huang,Junxiao Yang,Qi Zhu,Shiyao Cui,Fei Mi,Lifeng Shang,Yingkang Wang,Hongning Wang,Minlie Huang

Main category: cs.CL

TL;DR: 论文探讨如何通过监督微调（SFT）提升大型推理模型（LRMs）的安全性，发现直接蒸馏安全响应效果有限，提出改进方法，并发现简短推理过程也能达到类似安全效果。

Details

Motivation: 大型推理模型在数学和编程等任务上表现出色，但其推理能力提升未必带来安全性改进，甚至可能降低安全性，因此研究如何增强LRMs的安全性。 Method: 通过监督微调（SFT）进行实验，分析直接蒸馏安全响应的失败模式，并提出改进方法；同时探讨推理过程的复杂性对安全性的影响。 Result: 改进数据蒸馏方法显著提升安全性；简短或模板化的推理过程也能达到类似安全效果；混合数学推理数据有助于平衡安全性与过度拒绝。 Conclusion: 研究为提升LRMs安全性提供了全面视角，强调推理过程简化的重要性，并公开了代码与数据。 Abstract: Large Reasoning Models (LRMs) have achieved remarkable success on reasoning-intensive tasks such as mathematics and programming. However, their enhanced reasoning capabilities do not necessarily translate to improved safety performance-and in some cases, may even degrade it. This raises an important research question: how can we enhance the safety of LRMs? In this paper, we present a comprehensive empirical study on how to enhance the safety of LRMs through Supervised Fine-Tuning (SFT). Our investigation begins with an unexpected observation: directly distilling safe responses from DeepSeek-R1 fails to significantly enhance safety. We analyze this phenomenon and identify three key failure patterns that contribute to it. We then demonstrate that explicitly addressing these issues during the data distillation process can lead to substantial safety improvements. Next, we explore whether a long and complex reasoning process is necessary for achieving safety. Interestingly, we find that simply using short or template-based reasoning process can attain comparable safety performance-and are significantly easier for models to learn than more intricate reasoning chains. These findings prompt a deeper reflection on the role of reasoning in ensuring safety. Finally, we find that mixing math reasoning data during safety fine-tuning is helpful to balance safety and over-refusal. Overall, we hope our empirical study could provide a more holistic picture on enhancing the safety of LRMs. The code and data used in our experiments are released in https://github.com/thu-coai/LRM-Safety-Study.

[195] Trends and Challenges in Authorship Analysis: A Review of ML, DL, and LLM Approaches

Nudrat Habib,Tosin Adewumi,Marcus Liwicki,Elisa Barney

Main category: cs.CL

TL;DR: 本文对作者分析的子任务（作者归属和作者验证）进行了系统文献综述，总结了2015-2024年的SOTA方法、特征提取技术、数据集及挑战，并指出了未来研究方向。

Details

Motivation: 作者分析在多个领域（如法医语言学、网络安全）具有重要意义，但缺乏对最新方法和挑战的系统总结。 Method: 通过文献综述，分析了传统ML、DL模型和LLM在作者分析中的应用及其优缺点。 Result: 总结了现有方法的局限性，如低资源语言处理和多语言适应问题，并提出了未来研究方向。 Conclusion: 本文为研究者提供了作者分析的最新趋势和挑战，旨在推动更可靠、准确的系统开发。 Abstract: Authorship analysis plays an important role in diverse domains, including forensic linguistics, academia, cybersecurity, and digital content authentication. This paper presents a systematic literature review on two key sub-tasks of authorship analysis; Author Attribution and Author Verification. The review explores SOTA methodologies, ranging from traditional ML approaches to DL models and LLMs, highlighting their evolution, strengths, and limitations, based on studies conducted from 2015 to 2024. Key contributions include a comprehensive analysis of methods, techniques, their corresponding feature extraction techniques, datasets used, and emerging challenges in authorship analysis. The study highlights critical research gaps, particularly in low-resource language processing, multilingual adaptation, cross-domain generalization, and AI-generated text detection. This review aims to help researchers by giving an overview of the latest trends and challenges in authorship analysis. It also points out possible areas for future study. The goal is to support the development of better, more reliable, and accurate authorship analysis system in diverse textual domain.

[196] Gated Integration of Low-Rank Adaptation for Continual Learning of Language Models

Yan-Shuo Liang,Wu-Jun Li

Main category: cs.CL

TL;DR: 论文提出了一种名为GainLoRA的新方法，通过引入门控模块整合新旧LoRA分支，有效缓解持续学习中的遗忘问题，并在实验中表现优于现有方法。

Details

Motivation: 现有基于LoRA的持续学习方法通常强制新旧LoRA分支对旧任务贡献均等，可能导致遗忘问题。 Method: 提出GainLoRA方法，为每个新任务扩展LoRA分支，并通过门控模块整合新旧分支，最小化新分支对旧任务的贡献。 Result: 实验结果表明，GainLoRA在持续学习基准测试中优于现有最优方法。 Conclusion: GainLoRA通过门控模块有效缓解遗忘问题，提升了语言模型在持续学习中的整体性能。 Abstract: Continual learning (CL), which requires the model to learn multiple tasks sequentially, is crucial for language models (LMs). Recently, low-rank adaptation (LoRA), one of the most representative parameter-efficient fine-tuning (PEFT) methods, has gained increasing attention in CL of LMs. However, most existing CL methods based on LoRA typically expand a new LoRA branch to learn each new task and force the new and old LoRA branches to contribute equally to old tasks, potentially leading to forgetting. In this work, we propose a new method, called gated integration of low-rank adaptation (GainLoRA), for CL of LMs. GainLoRA expands a new LoRA branch for each new task and introduces gating modules to integrate the new and old LoRA branches. Furthermore, GainLoRA leverages the new gating module to minimize the contribution from the new LoRA branch to old tasks, effectively mitigating forgetting and improving the model's overall performance. Experimental results on CL benchmarks demonstrate that GainLoRA outperforms existing state-of-the-art methods.

[197] NeoN: A Tool for Automated Detection, Linguistic and LLM-Driven Analysis of Neologisms in Polish

Aleksandra Tomaszewska,Dariusz Czerski,Bartosz Żuk,Maciej Ogrodniczuk

Main category: cs.CL

TL;DR: NeoN是一个用于检测和分析波兰语新词的工具，通过多层管道结合参考语料库、语言过滤器、LLM驱动的精度提升过滤器以及RSS监控，显著减少人工工作量。

Details

Motivation: 传统基于词典的方法需要大量人工审查，NeoN旨在提供一种高效且自动化的解决方案来跟踪波兰语的词汇创新。 Method: NeoN采用上下文感知的词形还原、频率分析和拼写规范化，结合LLM模块生成定义并分类新词。 Result: 评估显示NeoN在保持高准确性的同时显著减少了人工工作量。 Conclusion: NeoN为波兰语新词跟踪提供了一个高效且易用的解决方案。 Abstract: NeoN, a tool for detecting and analyzing Polish neologisms. Unlike traditional dictionary-based methods requiring extensive manual review, NeoN combines reference corpora, Polish-specific linguistic filters, an LLM-driven precision-boosting filter, and daily RSS monitoring in a multi-layered pipeline. The system uses context-aware lemmatization, frequency analysis, and orthographic normalization to extract candidate neologisms while consolidating inflectional variants. Researchers can verify candidates through an intuitive interface with visualizations and filtering controls. An integrated LLM module automatically generates definitions and categorizes neologisms by domain and sentiment. Evaluations show NeoN maintains high accuracy while significantly reducing manual effort, providing an accessible solution for tracking lexical innovation in Polish.

[198] Responsible Diffusion Models via Constraining Text Embeddings within Safe Regions

Zhiwen Li,Die Chen,Mingyuan Fan,Cen Chen,Yaliang Li,Yanhao Wang,Wenmeng Zhou

Main category: cs.CL

TL;DR: 提出了一种自发现方法，通过在嵌入空间中定义安全语义方向向量，限制文本嵌入在安全区域内，以减少扩散模型生成NSFW内容和社交偏见。

Details

Motivation: 扩散模型生成高保真图像的能力受到NSFW内容和社交偏见的限制，现有方法效果不佳且影响正常输出。 Method: 提出自发现方法，定义语义方向向量并限制文本嵌入在安全区域，结合LoRA初始化以减少对其他语义的影响。 Result: 实验表明，该方法能有效减少NSFW内容和社交偏见，优于现有基线。 Conclusion: 该方法提升了扩散模型的社会责任，且可与现有方法结合使用。 Abstract: The remarkable ability of diffusion models to generate high-fidelity images has led to their widespread adoption. However, concerns have also arisen regarding their potential to produce Not Safe for Work (NSFW) content and exhibit social biases, hindering their practical use in real-world applications. In response to this challenge, prior work has focused on employing security filters to identify and exclude toxic text, or alternatively, fine-tuning pre-trained diffusion models to erase sensitive concepts. Unfortunately, existing methods struggle to achieve satisfactory performance in the sense that they can have a significant impact on the normal model output while still failing to prevent the generation of harmful content in some cases. In this paper, we propose a novel self-discovery approach to identifying a semantic direction vector in the embedding space to restrict text embedding within a safe region. Our method circumvents the need for correcting individual words within the input text and steers the entire text prompt towards a safe region in the embedding space, thereby enhancing model robustness against all possibly unsafe prompts. In addition, we employ Low-Rank Adaptation (LoRA) for semantic direction vector initialization to reduce the impact on the model performance for other semantics. Furthermore, our method can also be integrated with existing methods to improve their social responsibility. Extensive experiments on benchmark datasets demonstrate that our method can effectively reduce NSFW content and mitigate social bias generated by diffusion models compared to several state-of-the-art baselines.

[199] Likelihood Variance as Text Importance for Resampling Texts to Map Language Models

Momose Oyama,Ryo Kishino,Hiroaki Yamagiwa,Hidetoshi Shimodaira

Main category: cs.CL

TL;DR: 提出一种基于重采样的方法，通过选择重要性文本减少计算成本，同时保持KL散度估计的准确性。

Details

Motivation: 解决构建语言模型映射时因依赖大量文本计算对数似然而导致的高计算成本问题。 Method: 采用重采样方法，根据每个文本在不同模型下对数似然的方差选择重要性文本，减少所需文本数量。 Result: 实验表明，该方法仅需约一半文本即可达到与均匀采样相当的KL散度估计性能，并能高效整合新模型。 Conclusion: 该方法实现了语言模型映射的高效和可扩展构建。 Abstract: We address the computational cost of constructing a model map, which embeds diverse language models into a common space for comparison via KL divergence. The map relies on log-likelihoods over a large text set, making the cost proportional to the number of texts. To reduce this cost, we propose a resampling method that selects important texts with weights proportional to the variance of log-likelihoods across models for each text. Our method significantly reduces the number of required texts while preserving the accuracy of KL divergence estimates. Experiments show that it achieves comparable performance to uniform sampling with about half as many texts, and also facilitates efficient incorporation of new models into an existing map. These results enable scalable and efficient construction of language model maps.

[200] Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought

Ao Liu,Botong Zhou,Can Xu,Chayse Zhou,ChenChen Zhang,Chengcheng Xu,Chenhao Wang,Decheng Wu,Dengpeng Wu,Dian Jiao,Dong Du,Dong Wang,Feng Zhang,Fengzong Lian,Guanghui Xu,Guanwei Zhang,Hai Wang,Haipeng Luo,Han Hu,Huilin Xu,Jiajia Wu,Jianchen Zhu,Jianfeng Yan,Jiaqi Zhu,Jihong Zhang,Jinbao Xue,Jun Xia,Junqiang Zheng,Kai Liu,Kai Zhang,Kai Zheng,Kejiao Li,Keyao Wang,Lan Jiang,Lixin Liu,Lulu Wu,Mengyuan Huang,Peijie Yu,Peiqi Wang,Qian Wang,Qianbiao Xiang,Qibin Liu,Qingfeng Sun,Richard Guo,Ruobing Xie,Saiyong Yang,Shaohua Chen,Shihui Hu,Shuai Li,Shuaipeng Li,Shuang Chen,Suncong Zheng,Tao Yang,Tian Zhang,Tinghao Yu,Weidong Han,Weijie Liu,Weijin Zhou,Weikang Wang,Wesleye Chen,Xiao Feng,Xiaoqin Ren,Xingwu Sun,Xiong Kuang,Xuemeng Huang,Xun Cao,Yanfeng Chen,Yang Du,Yang Zhen,Yangyu Tao,Yaping Deng,Yi Shen,Yigeng Hong,Yiqi Chen,Yiqing Huang,Yuchi Deng,Yue Mao,Yulong Wang,Yuyuan Zeng,Zenan Xu,Zhanhui Kang,Zhe Zhao,ZhenXiang Yan,Zheng Fang,Zhichao Hu,Zhongzhi Chen,Zhuoyu Li,Zongwei Li,Alex Yan,Ande Liang,Baitong Liu,Beiping Pan,Bin Xing,Binghong Wu,Bingxin Qu,Bolin Ni,Boyu Wu,Chen Li,Cheng Jiang,Cheng Zhang,Chengjun Liu,Chengxu Yang,Chiyu Wang,Chong Zha,Daisy Yi,Di Wang,Fanyang Lu,Fei Chen,Feifei Liu,Feng Zheng,Guanghua Yu,Guiyang Li,Guohua Wang,Haisheng Lin,Han Liu,Han Wang,Hao Fei,Hao Lu,Haoqing Jiang,Haoran Sun,Haotian Zhu,Huangjin Dai,Huankui Chen,Huawen Feng,Huihui Cai,Huxin Peng,Jackson Lv,Jiacheng Shi,Jiahao Bu,Jianbo Li,Jianglu Hu,Jiangtao Guan,Jianing Xu,Jianwei Cai,Jiarong Zhang,Jiawei Song,Jie Jiang,Jie Liu,Jieneng Yang,Jihong Zhang,Jin lv,Jing Zhao,Jinjian Li,Jinxing Liu,Jun Zhao,Juntao Guo,Kai Wang,Kan Wu,Lei Fu,Lei He,Lei Wang,Li Liu,Liang Dong,Liya Zhan,Long Cheng,Long Xu,Mao Zheng,Meng Liu,Mengkang Hu,Nanli Chen,Peirui Chen,Peng He,Pengju Pan,Pengzhi Wei,Qi Yang,Qi Yi,Roberts Wang,Rongpeng Chen,Rui Sun,Rui Yang,Ruibin Chen,Ruixu Zhou,Shaofeng Zhang,Sheng Zhang,Shihao Xu,Shuaishuai Chang,Shulin Liu,SiQi Wang,Songjia Feng,Songling Yuan,Tao Zhang,Tianjiao Lang,Tongkai Li,Wei Deng,Wei Li,Weichao Wang,Weigang Zhang,Weixuan Sun,Wen Ouyang,Wenxiang Jiao,Wenzhi Sun,Wenzhuo Jia,Xiang Zhang,Xiangyu He,Xianshun Ren,XiaoYing Zhu,Xiaolong Guo,Xiaoxue Li,Xiaoyu Ma,Xican Lu,Xinhua Feng,Xinting Huang,Xinyu Guan,Xirui Li,Xu Zhang,Xudong Gao,Xun Luo,Xuxiang Qi,Yangkun Chen,Yangyu Tao,Yanling Xiao,Yantao Mai,Yanze Chen,Yao Ding,Yeting Yang,YiFan Song,Yifan Yang,Yijiao Zhu,Yinhe Wu,Yixian Liu,Yong Yang,Yuanjun Cai,Yuanlin Tu,Yue Zhang,Yufei Huang,Yuhang Zhou,Yuhao Jiang,Yuhong Liu,Yuhui Hu,Yujin Lin,Yun Yang,Yunhao Wang,Yusong Zhang,Zekun Wu,Zelong Zhang,Zhan Yu,Zhaoliang Yang,Zhe Zhao,Zheng Li,Zhenyu Huang,Zhiguang Liu,Zhijiang Xu,Zhiqing Kui,Zhiyin Zeng,Zhiyuan Xiong,Zhuo Han,Zifan Wu,Zigang Geng,Zilong Zhao,Ziyan Tang,Ziyuan Zhu,Zonglei Zhu,Zhijiang Xu

Main category: cs.CL

TL;DR: Hunyuan-TurboS是一种新型混合Transformer-Mamba MoE模型，结合了Mamba的长序列处理效率和Transformer的上下文理解能力，支持256K上下文长度，性能优异且高效。

Details

Motivation: 随着大语言模型的快速发展，需要一种既能高效处理长序列又能深度理解上下文的模型，同时优化计算资源。 Method: 采用混合Transformer-Mamba架构，结合自适应长短链思维机制、Mamba2线性复杂度、分组查询注意力等技术，并通过多阶段训练策略增强能力。 Result: 在LMSYS Chatbot Arena中排名前7，得分1356，超越Gemini-2.0等模型，并在23个自动化基准测试中平均得分77.9%。 Conclusion: Hunyuan-TurboS在性能和效率之间取得了平衡，为高效大规模预训练模型树立了新范式。 Abstract: As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba's long-sequence processing efficiency with Transformer's superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid responses for simple queries and deep "thinking" modes for complex problems, optimizing computational resources. Architecturally, this 56B activated (560B total) parameter model employs 128 layers (Mamba2, Attention, FFN) with an innovative AMF/MF block pattern. Faster Mamba2 ensures linear complexity, Grouped-Query Attention minimizes KV cache, and FFNs use an MoE structure. Pre-trained on 16T high-quality tokens, it supports a 256K context length and is the first industry-deployed large-scale Mamba model. Our comprehensive post-training strategy enhances capabilities via Supervised Fine-Tuning (3M instructions), a novel Adaptive Long-short CoT Fusion method, Multi-round Deliberation Learning for iterative improvement, and a two-stage Large-scale Reinforcement Learning process targeting STEM and general instruction-following. Evaluations show strong performance: overall top 7 rank on LMSYS Chatbot Arena with a score of 1356, outperforming leading models like Gemini-2.0-Flash-001 (1352) and o4-mini-2025-04-16 (1345). TurboS also achieves an average of 77.9% across 23 automated benchmarks. Hunyuan-TurboS balances high performance and efficiency, offering substantial capabilities at lower inference costs than many reasoning models, establishing a new paradigm for efficient large-scale pre-trained models.

[201] On the Generalization vs Fidelity Paradox in Knowledge Distillation

Suhas Kamasetty Ramesh,Ayan Sengupta,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: 知识蒸馏（KD）可将大语言模型压缩为小模型并保持性能，但对小模型的效果及知识转移机制研究不足。本文通过大规模实验发现，KD对小模型（0.5B-7B参数）在零样本任务中平均提升10%，但对大模型效果有限（1.3%）。教师模型性能影响小，任务专长更关键。KD虽提升准确性，但可能损害推理忠实度。

Details

Motivation: 探索知识蒸馏对小语言模型的效果及知识转移机制，填补研究空白。 Method: 对0.5B至7B参数的模型在14个复杂推理任务上进行零样本设置的大规模实证与统计分析。 Result: KD对小模型平均提升10%，峰值任务提升22%；大模型仅提升1.3%。教师任务专长比性能更重要，且KD可能损害推理忠实度。 Conclusion: KD对小模型效果显著，但对大模型有限；需权衡准确性提升与推理忠实度损失。 Abstract: Knowledge distillation (KD) is a key technique for compressing large language models into smaller ones while preserving performance. Despite the recent traction of KD research, its effectiveness for smaller language models (LMs) and the mechanisms driving knowledge transfer remain underexplored. In this work, we present the first large-scale empirical and statistical analysis of KD across models ranging from 0.5B to 7B parameters on 14 complex reasoning tasks in a zero-shot setting. Our findings reveal that KD can improve the average performance of smaller models by up to $10\%$, with a peak task specific gain of $22\%$, while providing only marginal benefits ($\sim 1.3\%$) for larger models. Surprisingly, teacher performance has a minimal impact on student outcomes, while teacher task expertise impacts KD effectiveness. A correlation study indicates that smaller LMs benefit more from KD, whereas larger LMs show diminished gains. Additionally, we uncover a misalignment between improvements in student performance and reasoning fidelity, suggesting that while KD enhances accuracy, it does not always maintain the structured decision-making processes of the teacher. Our ablation study further highlights the importance of teacher signals and logit smoothing in influencing students' performance after distillation. Overall, our study offers a comprehensive empirical and statistical assessment of KD, highlighting both its benefits and trade-offs when distilling knowledge from larger to smaller LMs.

[202] AdUE: Improving uncertainty estimation head for LoRA adapters in LLMs

Artem Zabolotnyi,Roman Makarov,Mile Mitrovic,Polina Proskura,Oleg Travkin,Roman Alferov,Alexey Zaytsev

Main category: cs.CL

TL;DR: AdUE1是一种高效的后验不确定性估计方法，通过可微分的最大函数近似和L2-SP正则化提升预训练语言模型在分类任务中的不确定性估计能力。

Details

Motivation: 预训练语言模型在参数高效微调（如适配器）下，不确定性估计仍是一个关键挑战。 Method: 使用可微分的最大函数近似和L2-SP正则化，固定微调头部权重并正则化模型。 Result: 在五个NLP分类数据集和四种语言模型上，AdUE1优于Mahalanobis距离和softmax响应等基线方法。 Conclusion: AdUE1轻量且无需修改基础模型，能生成更校准的置信度。 Abstract: Uncertainty estimation remains a critical challenge in adapting pre-trained language models to classification tasks, particularly under parameter-efficient fine-tuning approaches such as adapters. We introduce AdUE1, an efficient post-hoc uncertainty estimation (UE) method, to enhance softmax-based estimates. Our approach (1) uses a differentiable approximation of the maximum function and (2) applies additional regularization through L2-SP, anchoring the fine-tuned head weights and regularizing the model. Evaluations on five NLP classification datasets across four language models (RoBERTa, ELECTRA, LLaMA-2, Qwen) demonstrate that our method consistently outperforms established baselines such as Mahalanobis distance and softmax response. Our approach is lightweight (no base-model changes) and produces better-calibrated confidence.

[203] Single LLM, Multiple Roles: A Unified Retrieval-Augmented Generation Framework Using Role-Specific Token Optimization

Yutao Zhu,Jiajie Jin,Hongjin Qian,Zheng Liu,Zhicheng Dou,Ji-Rong Wen

Main category: cs.CL

TL;DR: RoleRAG是一个统一的RAG框架，通过角色特定令牌优化实现高效多任务处理，包含六个模块，并通过查询图动态分解查询。实验证明其有效性和灵活性。

Details

Motivation: 现有研究在RAG的各个子任务上进行了优化，但将这些优化整合到一个统一框架中仍具挑战性。 Method: 提出RoleRAG框架，包含六个模块，使用角色特定令牌优化，并通过查询图动态分解查询。所有模块由同一LLM驱动。 Result: 在五个开放域问答数据集上的实验证明了框架的有效性、通用性和灵活性。 Conclusion: RoleRAG通过统一框架和动态模块激活，优化了RAG的多任务处理，减少了资源消耗。 Abstract: Existing studies have optimized retrieval-augmented generation (RAG) across various sub-tasks, such as query understanding and retrieval refinement, but integrating these optimizations into a unified framework remains challenging. To tackle this problem, this work proposes RoleRAG, a unified RAG framework that achieves efficient multi-task processing through role-specific token optimization. RoleRAG comprises six modules, each handling a specific sub-task within the RAG process. Additionally, we introduce a query graph to represent the decomposition of the query, which can be dynamically resolved according to the decomposing state. All modules are driven by the same underlying LLM, distinguished by task-specific role tokens that are individually optimized. This design allows RoleRAG to dynamically activate different modules within a single LLM instance, thereby streamlining deployment and reducing resource consumption. Experimental results on five open-domain question-answering datasets demonstrate the effectiveness, generalizability, and flexibility of our framework.

[204] Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment

Weixiang Zhao,Xingyu Sui,Yulin Hu,Jiahe Guo,Haixiao Liu,Biye Li,Yanyan Zhao,Bing Qin,Ting Liu

Main category: cs.CL

TL;DR: 论文提出RLPA框架，通过强化学习动态优化用户画像，提升个性化对话效果。

Details

Motivation: 现有静态方法在冷启动和长期个性化方面表现不足，需动态优化用户画像。 Method: RLPA框架结合强化学习，通过双层次奖励机制（画像奖励和响应奖励）优化对话模型。 Result: Qwen-RLPA在个性化对话中表现优于现有方法，甚至超越商业模型如Claude-3.5和GPT-4o。 Conclusion: 动态画像推断是构建个性化对话系统的更有效范式。 Abstract: Personalized alignment is essential for enabling large language models (LLMs) to engage effectively in user-centric dialogue. While recent prompt-based and offline optimization methods offer preliminary solutions, they fall short in cold-start scenarios and long-term personalization due to their inherently static and shallow designs. In this work, we introduce the Reinforcement Learning for Personalized Alignment (RLPA) framework, in which an LLM interacts with a simulated user model to iteratively infer and refine user profiles through dialogue. The training process is guided by a dual-level reward structure: the Profile Reward encourages accurate construction of user representations, while the Response Reward incentivizes generation of responses consistent with the inferred profile. We instantiate RLPA by fine-tuning Qwen-2.5-3B-Instruct, resulting in Qwen-RLPA, which achieves state-of-the-art performance in personalized dialogue. Empirical evaluations demonstrate that Qwen-RLPA consistently outperforms prompting and offline fine-tuning baselines, and even surpasses advanced commercial models such as Claude-3.5 and GPT-4o. Further analysis highlights Qwen-RLPA's robustness in reconciling conflicting user preferences, sustaining long-term personalization and delivering more efficient inference compared to recent reasoning-focused LLMs. These results emphasize the potential of dynamic profile inference as a more effective paradigm for building personalized dialogue systems.

[205] Joint Flashback Adaptation for Forgetting-Resistant Instruction Tuning

Yukun Zhao,Lingyong Yan,Zhenyang Li,Shuaiqiang Wang,Zhumin Chen,Zhaochun Ren,Dawei Yin

Main category: cs.CL

TL;DR: 论文提出Joint Flashback Adaptation方法，通过引入少量旧任务提示（flashbacks）并约束模型输出偏差，结合潜在任务插值，解决大语言模型增量学习中的灾难性遗忘问题。

Details

Motivation: 大语言模型在增量学习新任务时易受灾难性遗忘影响，现有方法在现实场景中受限。 Method: 提出Joint Flashback Adaptation，结合flashbacks和潜在任务插值，联合学习新旧任务。 Result: 在1000+任务实验中，方法显著提升新任务泛化能力并减少旧任务遗忘。 Conclusion: Joint Flashback Adaptation是一种高效、任务无关的增量学习方法。 Abstract: Large language models have achieved remarkable success in various tasks. However, it is challenging for them to learn new tasks incrementally due to catastrophic forgetting. Existing approaches rely on experience replay, optimization constraints, or task differentiation, which encounter strict limitations in real-world scenarios. To address these issues, we propose Joint Flashback Adaptation. We first introduce flashbacks -- a limited number of prompts from old tasks -- when adapting to new tasks and constrain the deviations of the model outputs compared to the original one. We then interpolate latent tasks between flashbacks and new tasks to enable jointly learning relevant latent tasks, new tasks, and flashbacks, alleviating data sparsity in flashbacks and facilitating knowledge sharing for smooth adaptation. Our method requires only a limited number of flashbacks without access to the replay data and is task-agnostic. We conduct extensive experiments on state-of-the-art large language models across 1000+ instruction-following tasks, arithmetic reasoning tasks, and general reasoning tasks. The results demonstrate the superior performance of our method in improving generalization on new tasks and reducing forgetting in old tasks.

[206] CoLA: Collaborative Low-Rank Adaptation

Yiyun Zhou,Chang Yao,Jingyuan Chen

Main category: cs.CL

TL;DR: 论文提出了一种名为CoLA的灵活LoRA架构，通过优化矩阵A和B的协作策略，提升了参数高效微调（PEFT）在多任务和低样本场景下的性能。

Details

Motivation: 尽管LoRA在参数高效微调中表现优异，但在多任务场景中任务间干扰限制了其应用。现有方法（如MOE和非对称LoRA）仍面临样本稀缺和噪声干扰问题。 Method: 提出CoLA架构，采用高效初始化方案和三种协作策略，优化矩阵A和B的定量关系。 Result: 实验表明CoLA在低样本场景下表现优于现有PEFT方法，具有更高的有效性和鲁棒性。 Conclusion: CoLA通过灵活的架构和协作策略，显著提升了LoRA在多任务和低样本环境中的性能，为PEFT提供了新的解决方案。 Abstract: The scaling law of Large Language Models (LLMs) reveals a power-law relationship, showing diminishing return on performance as model scale increases. While training LLMs from scratch is resource-intensive, fine-tuning a pre-trained model for specific tasks has become a practical alternative. Full fine-tuning (FFT) achieves strong performance; however, it is computationally expensive and inefficient. Parameter-efficient fine-tuning (PEFT) methods, like LoRA, have been proposed to address these challenges by freezing the pre-trained model and adding lightweight task-specific modules. LoRA, in particular, has proven effective, but its application to multi-task scenarios is limited by interference between tasks. Recent approaches, such as Mixture-of-Experts (MOE) and asymmetric LoRA, have aimed to mitigate these issues but still struggle with sample scarcity and noise interference due to their fixed structure. In response, we propose CoLA, a more flexible LoRA architecture with an efficient initialization scheme, and introduces three collaborative strategies to enhance performance by better utilizing the quantitative relationships between matrices $A$ and $B$. Our experiments demonstrate the effectiveness and robustness of CoLA, outperforming existing PEFT methods, especially in low-sample scenarios. Our data and code are fully publicly available at https://github.com/zyy-2001/CoLA.

[207] PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable, Process, and Solution Dimensions

Song Dai,Yibo Yan,Jiamin Su,Dongfang Zihao,Yubo Gao,Yonghua Hei,Jungang Li,Junyan Zhang,Sicheng Tao,Zhuoran Gao,Xuming Hu

Main category: cs.CL

TL;DR: PhysicsArena是一个多模态物理推理基准，旨在全面评估MLLMs在变量识别、物理过程制定和解决方案推导三个关键维度的能力。

Details

Motivation: 当前物理推理基准局限于文本输入或问题解决，忽略了变量识别和过程制定的中间步骤，因此需要更全面的评估平台。 Method: 引入PhysicsArena，首个多模态物理推理基准，覆盖变量识别、物理过程制定和解决方案推导三个维度。 Result: PhysicsArena为评估和提升MLLMs的多模态物理推理能力提供了全面平台。 Conclusion: PhysicsArena填补了现有基准的不足，为MLLMs在复杂物理推理中的应用提供了重要工具。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in diverse reasoning tasks, yet their application to complex physics reasoning remains underexplored. Physics reasoning presents unique challenges, requiring grounding in physical conditions and the interpretation of multimodal information. Current physics benchmarks are limited, often focusing on text-only inputs or solely on problem-solving, thereby overlooking the critical intermediate steps of variable identification and process formulation. To address these limitations, we introduce PhysicsArena, the first multimodal physics reasoning benchmark designed to holistically evaluate MLLMs across three critical dimensions: variable identification, physical process formulation, and solution derivation. PhysicsArena aims to provide a comprehensive platform for assessing and advancing the multimodal physics reasoning abilities of MLLMs.

[208] LFTF: Locating First and Then Fine-Tuning for Mitigating Gender Bias in Large Language Models

Zhanyue Qin,Yue Ding,Deyuan Liu,Qingbin Liu,Junxian Cai,Xi Chen,Zhiying Tu,Dianhui Chu,Cuiyun Gao,Dianbo Sui

Main category: cs.CL

TL;DR: 论文提出了GenBiasEval和GenHintEval数据集及相应评分标准AFGB-Score和UB-Score，用于量化LLMs中的性别偏见，并开发了LFTF算法以有效减轻偏见。

Details

Motivation: 由于LLMs在训练中接触社会偏见数据，导致其表现出性别偏见，需要量化评估并减轻这种偏见。 Method: 提出GenBiasEval和GenHintEval数据集及评分标准，并设计LFTF算法，通过BMI评分定位偏见相关模块并微调。 Result: 实验表明LFTF算法能显著减轻LLMs的性别偏见，同时保持其通用能力。 Conclusion: 提出的方法和算法为评估和减轻LLMs中的性别偏见提供了有效工具。 Abstract: Nowadays, Large Language Models (LLMs) have attracted widespread attention due to their powerful performance. However, due to the unavoidable exposure to socially biased data during training, LLMs tend to exhibit social biases, particularly gender bias. To better explore and quantifying the degree of gender bias in LLMs, we propose a pair of datasets named GenBiasEval and GenHintEval, respectively. The GenBiasEval is responsible for evaluating the degree of gender bias in LLMs, accompanied by an evaluation metric named AFGB-Score (Absolutely Fair Gender Bias Score). Meanwhile, the GenHintEval is used to assess whether LLMs can provide responses consistent with prompts that contain gender hints, along with the accompanying evaluation metric UB-Score (UnBias Score). Besides, in order to mitigate gender bias in LLMs more effectively, we present the LFTF (Locating First and Then Fine-Tuning) algorithm.The algorithm first ranks specific LLM blocks by their relevance to gender bias in descending order using a metric called BMI (Block Mitigating Importance Score). Based on this ranking, the block most strongly associated with gender bias is then fine-tuned using a carefully designed loss function. Numerous experiments have shown that our proposed LFTF algorithm can significantly mitigate gender bias in LLMs while maintaining their general capabilities.

[209] KaFT: Knowledge-aware Fine-tuning for Boosting LLMs' Domain-specific Question-Answering Performance

Qihuang Zhong,Liang Ding,Xiantao Cai,Juhua Liu,Bo Du,Dacheng Tao

Main category: cs.CL

TL;DR: 论文提出了一种名为KaFT的知识感知微调方法，通过根据知识冲突水平调整训练权重，显著提升了大型语言模型在领域特定问答任务中的性能。

Details

Motivation: 传统的监督微调（SFT）方法在领域特定问答任务中表现不佳，主要因为模型内部知识与训练数据上下文知识之间的冲突。论文旨在解决这一问题。 Method: 设计了查询多样化策略以检测知识冲突，并提出KaFT方法，通过为不同冲突水平的训练样本分配不同奖励来调整训练权重。 Result: 实验表明，KaFT在四种大型语言模型上均带来显著改进，同时提升了模型泛化能力并减少了幻觉现象。 Conclusion: KaFT是一种简单而有效的方法，能够通过知识冲突感知显著提升模型性能。 Abstract: Supervised fine-tuning (SFT) is a common approach to improve the domain-specific question-answering (QA) performance of large language models (LLMs). However, recent literature reveals that due to the conflicts between LLMs' internal knowledge and the context knowledge of training data, vanilla SFT using the full QA training set is usually suboptimal. In this paper, we first design a query diversification strategy for robust conflict detection and then conduct a series of experiments to analyze the impact of knowledge conflict. We find that 1) training samples with varied conflicts contribute differently, where SFT on the data with large conflicts leads to catastrophic performance drops; 2) compared to directly filtering out the conflict data, appropriately applying the conflict data would be more beneficial. Motivated by this, we propose a simple-yet-effective Knowledge-aware Fine-tuning (namely KaFT) approach to effectively boost LLMs' performance. The core of KaFT is to adapt the training weight by assigning different rewards for different training samples according to conflict level. Extensive experiments show that KaFT brings consistent and significant improvements across four LLMs. More analyses prove that KaFT effectively improves the model generalization and alleviates the hallucination.

[210] Collaborative Problem-Solving in an Optimization Game

Isidora Jeknic,Alex Duchnowski,Alexander Koller

Main category: cs.CL

TL;DR: 论文提出了一种新型对话游戏，通过结合LLM提示与符号机制，代理协作解决双人旅行商问题，最优解率达45%，并能与人类用户成功合作。

Details

Motivation: 研究旨在开发能够支持人类用户解决复杂任务的对话代理，尤其是NP难优化问题。 Method: 结合大型语言模型（LLM）提示与符号机制进行状态跟踪和接地，设计了一种新型对话游戏。 Result: 最佳代理在自玩模式下以45%的最优解率完成任务，并能与人类用户成功协作及泛化到陌生图。 Conclusion: 该方法展示了对话代理在解决复杂优化问题中的潜力，尤其是在协作和泛化方面。 Abstract: Dialogue agents that support human users in solving complex tasks have received much attention recently. Many such tasks are NP-hard optimization problems that require careful collaborative exploration of the solution space. We introduce a novel dialogue game in which the agents collaboratively solve a two-player Traveling Salesman problem, along with an agent that combines LLM prompting with symbolic mechanisms for state tracking and grounding. Our best agent solves 45% of games optimally in self-play. It also demonstrates an ability to collaborate successfully with human users and generalize to unfamiliar graphs.

[211] Protoknowledge Shapes Behaviour of LLMs in Downstream Tasks: Memorization and Generalization with Knowledge Graphs

Federico Ranaldi,Andrea Zugarini,Leonardo Ranaldi,Fabio Massimo Zanzotto

Main category: cs.CL

TL;DR: 论文提出“protoknowledge”概念，用于形式化和衡量大型语言模型（LLMs）在预训练中如何内化知识图谱的token序列，并在推理时利用这些知识。通过知识激活任务（KATs）测量protoknowledge，并研究其对文本到SPARQL性能的影响。

Details

Motivation: 探索LLMs如何将预训练中记忆的token序列转化为可重用的知识，并通过泛化利用这些知识。 Method: 提出protoknowledge的分类（词汇、层次、拓扑），并通过KATs测量其特性。采用新分析框架评估模型预测是否与相关protoknowledge的激活一致。 Result: 研究发现protoknowledge的语义偏见等特性，并验证其对文本到SPARQL任务的影响。 Conclusion: 该框架为探索语义级数据污染提供了实用工具，并为封闭预训练模型提供了有效策略。 Abstract: We introduce the concept of protoknowledge to formalize and measure how sequences of tokens encoding Knowledge Graphs are internalized during pretraining and utilized at inference time by Large Language Models (LLMs). Indeed, LLMs have demonstrated the ability to memorize vast amounts of token sequences during pretraining, and a central open question is how they leverage this memorization as reusable knowledge through generalization. We then categorize protoknowledge into lexical, hierarchical, and topological forms, varying on the type of knowledge that needs to be activated. We measure protoknowledge through Knowledge Activation Tasks (KATs), analyzing its general properties such as semantic bias. We then investigate the impact of protoknowledge on Text-to-SPARQL performance by varying prompting strategies depending on input conditions. To this end, we adopt a novel analysis framework that assesses whether model predictions align with the successful activation of the relevant protoknowledge for each query. This methodology provides a practical tool to explore Semantic-Level Data Contamination and serves as an effective strategy for Closed-Pretraining models.

[212] Multilingual Test-Time Scaling via Initial Thought Transfer

Prasoon Bajpai,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: 本文研究了测试时缩放（test-time scaling）在多语言环境中的表现，发现其效果因语言而异，并提出了MITT方法以提升低资源语言的推理性能。

Details

Motivation: 测试时缩放在英语中表现良好，但在其他语言中的效果尚未系统研究，本文旨在填补这一空白。 Method: 评估DeepSeek-R1-Distill-LLama-8B和DeepSeek-R1-Distill-Qwen-7B在高/低资源拉丁语系语言中的表现，并提出MITT方法。 Result: 测试时缩放的增益因语言差异显著，低资源语言推理一致性较低。MITT显著提升了低资源语言的推理性能。 Conclusion: MITT是一种有效的轻量级方法，可提升多语言环境下的推理性能，尤其适用于低资源语言。 Abstract: Test-time scaling has emerged as a widely adopted inference-time strategy for boosting reasoning performance. However, its effectiveness has been studied almost exclusively in English, leaving its behavior in other languages largely unexplored. We present the first systematic study of test-time scaling in multilingual settings, evaluating DeepSeek-R1-Distill-LLama-8B and DeepSeek-R1-Distill-Qwen-7B across both high- and low-resource Latin-script languages. Our findings reveal that the relative gains from test-time scaling vary significantly across languages. Additionally, models frequently switch to English mid-reasoning, even when operating under strictly monolingual prompts. We further show that low-resource languages not only produce initial reasoning thoughts that differ significantly from English but also have lower internal consistency across generations in their early reasoning. Building on our findings, we introduce MITT (Multilingual Initial Thought Transfer), an unsupervised and lightweight reasoning prefix-tuning approach that transfers high-resource reasoning prefixes to enhance test-time scaling across all languages, addressing inconsistencies in multilingual reasoning performance. MITT significantly boosts DeepSeek-R1-Distill-Qwen-7B's reasoning performance, especially for underrepresented languages.

[213] Evaluate Bias without Manual Test Sets: A Concept Representation Perspective for LLMs

Lang Gao,Kaiyang Wan,Wei Liu,Chenxi Wang,Zirui Song,Zixiang Xu,Yanbo Wang,Veselin Stoyanov,Xiuying Chen

Main category: cs.CL

TL;DR: BiasLens是一种无需标注数据的偏差分析框架，通过模型向量空间结构检测LLM中的偏差，与传统方法相比更高效且可扩展。

Details

Motivation: 大型语言模型（LLM）中的偏差问题影响其可靠性和公平性，现有方法依赖标注数据且覆盖有限，亟需更高效的解决方案。 Method: 结合概念激活向量（CAVs）和稀疏自编码器（SAEs）提取可解释概念表示，通过表示相似性量化偏差。 Result: BiasLens与传统偏差评估指标高度一致（Spearman相关系数r > 0.85），并能发现传统方法难以检测的偏差形式。 Conclusion: BiasLens为LLM的公平性和透明度提升提供了可扩展、可解释的高效范式。 Abstract: Bias in Large Language Models (LLMs) significantly undermines their reliability and fairness. We focus on a common form of bias: when two reference concepts in the model's concept space, such as sentiment polarities (e.g., "positive" and "negative"), are asymmetrically correlated with a third, target concept, such as a reviewing aspect, the model exhibits unintended bias. For instance, the understanding of "food" should not skew toward any particular sentiment. Existing bias evaluation methods assess behavioral differences of LLMs by constructing labeled data for different social groups and measuring model responses across them, a process that requires substantial human effort and captures only a limited set of social concepts. To overcome these limitations, we propose BiasLens, a test-set-free bias analysis framework based on the structure of the model's vector space. BiasLens combines Concept Activation Vectors (CAVs) with Sparse Autoencoders (SAEs) to extract interpretable concept representations, and quantifies bias by measuring the variation in representational similarity between the target concept and each of the reference concepts. Even without labeled data, BiasLens shows strong agreement with traditional bias evaluation metrics (Spearman correlation r > 0.85). Moreover, BiasLens reveals forms of bias that are difficult to detect using existing methods. For example, in simulated clinical scenarios, a patient's insurance status can cause the LLM to produce biased diagnostic assessments. Overall, BiasLens offers a scalable, interpretable, and efficient paradigm for bias discovery, paving the way for improving fairness and transparency in LLMs.

Angelie Kraft,Judith Simon,Sonja Schimmler

Main category: cs.CL

TL;DR: 研究发现流行的QA和RC基准存在偏见，缺乏对不同人口和地区的代表性，且创建者多样性不足。

Details

Motivation: 评估大型语言模型（LLM）能力时，QA和RC基准的偏见问题可能影响公平性。 Method: 对30篇基准论文进行定性内容分析，对20个基准数据集进行定量分析。 Result: 大多数基准论文未提供创建者信息，仅一篇明确提及解决社会代表性问题的措施。数据中存在性别、宗教和地域偏见。 Conclusion: 需要更透明和关注偏见的基准创建实践，以促进更公平的LLM发展。 Abstract: Question-answering (QA) and reading comprehension (RC) benchmarks are essential for assessing the capabilities of large language models (LLMs) in retrieving and reproducing knowledge. However, we demonstrate that popular QA and RC benchmarks are biased and do not cover questions about different demographics or regions in a representative way, potentially due to a lack of diversity of those involved in their creation. We perform a qualitative content analysis of 30 benchmark papers and a quantitative analysis of 20 respective benchmark datasets to learn (1) who is involved in the benchmark creation, (2) how social bias is addressed or prevented, and (3) whether the demographics of the creators and annotators correspond to particular biases in the content. Most analyzed benchmark papers provided insufficient information regarding the stakeholders involved in benchmark creation, particularly the annotators. Notably, just one of the benchmark papers explicitly reported measures taken to address social representation issues. Moreover, the data analysis revealed gender, religion, and geographic biases across a wide range of encyclopedic, commonsense, and scholarly benchmarks. More transparent and bias-aware QA and RC benchmark creation practices are needed to facilitate better scrutiny and incentivize the development of fairer LLMs.

[215] DayDreamer at CQs-Gen 2025: Generating Critical Questions through Argument Scheme Completion

Wendi Zhou,Ameer Saadat-Yazdi,Nadin Kökciyan

Main category: cs.CL

TL;DR: 论文提出了一种基于大语言模型（LLMs）和链式思维提示的方法，用于生成与沃尔顿论证方案相关的批判性问题，并在ArgMining 2025共享任务中展示了其竞争力。

Details

Motivation: 批判性问题是激发批判性思维的重要工具，尤其是在面对论证性文本时。本文旨在通过结合论证理论和逐步推理，生成上下文相关且多样化的批判性问题。 Method: 利用LLMs和链式思维提示，首先根据输入文本生成结构化论证，随后生成相关批判性问题，并通过LLMs对问题进行排名，选出最有帮助的3个问题。 Result: 该方法在最终测试集中表现出竞争力，能够有效生成相关批判性问题，并检测缺失或未经证实的论点。 Conclusion: 结合结构化论证理论和逐步推理的生成方法，能够有效促进批判性思维，并提升论证性文本的分析能力。 Abstract: Critical questions are essential resources to provoke critical thinking when encountering an argumentative text. We present our system for the Critical Questions Generation (CQs-Gen) Shared Task at ArgMining 2025. Our approach leverages large language models (LLMs) with chain-of-thought prompting to generate critical questions guided by Walton's argumentation schemes. For each input intervention, we conversationally prompt LLMs to instantiate the corresponding argument scheme template to first obtain structured arguments, and then generate relevant critical questions. Following this, we rank all the available critical questions by prompting LLMs to select the top 3 most helpful questions based on the original intervention text. This combination of structured argumentation theory and step-by-step reasoning enables the generation of contextually relevant and diverse critical questions. Our pipeline achieves competitive performance in the final test set, showing its potential to foster critical thinking given argumentative text and detect missing or uninformed claims. Code available at \href{https://git.ecdf.ed.ac.uk/s2236454/DayDreamer-CQs-Gen}{DayDreamer}.

Ana-Maria Bucur,Marcos Zampieri,Tharindu Ranasinghe,Fabio Crestani

Main category: cs.CL

TL;DR: 本文调查了多语言社交媒体数据在心理健康障碍检测中的应用，填补了现有研究主要关注英语数据的空白。

Details

Motivation: 全球心理健康问题日益严重，亟需适用于多语言环境的数字筛查方法。现有研究多集中于英语数据，忽略了非英语文本中的关键心理健康信号。 Method: 通过分析多语言社交媒体数据，研究文化差异对在线语言模式和自我披露行为的影响，并评估其对NLP工具性能的影响。同时，提供多语言数据集的综合列表。 Result: 研究发现文化差异显著影响心理健康信号的表达和NLP工具的效果，并提供了可用于开发多语言心理健康筛查模型的数据资源。 Conclusion: 研究为设计有效的多语言心理健康筛查工具提供了依据，有助于满足多样化人群的需求，提升全球心理健康水平。 Abstract: The increasing prevalence of mental health disorders globally highlights the urgent need for effective digital screening methods that can be used in multilingual contexts. Most existing studies, however, focus on English data, overlooking critical mental health signals that may be present in non-English texts. To address this important gap, we present the first survey on the detection of mental health disorders using multilingual social media data. We investigate the cultural nuances that influence online language patterns and self-disclosure behaviors, and how these factors can impact the performance of NLP tools. Additionally, we provide a comprehensive list of multilingual data collections that can be used for developing NLP models for mental health screening. Our findings can inform the design of effective multilingual mental health screening tools that can meet the needs of diverse populations, ultimately improving mental health outcomes on a global scale.

[217] Do RAG Systems Suffer From Positional Bias?

Florin Cuconasu,Simone Filice,Guy Horowitz,Yoelle Maarek,Fabrizio Silvestri

Main category: cs.CL

TL;DR: 论文研究了检索增强生成中LLM的位置偏见如何影响其对相关和干扰段落的利用，发现实际场景中位置偏见的影响较小。

Details

Motivation: 探讨LLM位置偏见对检索增强生成的影响，尤其是在处理相关和干扰段落时的表现。 Method: 通过在三个基准上进行广泛实验，分析检索管道中干扰段落的出现频率及其对LLM性能的影响。 Result: 超过60%的查询在检索结果前10中包含高度干扰段落，但位置偏见的影响在实际场景中较小。 Conclusion: 针对LLM位置偏见的复杂策略并不优于随机重排，表明实际应用中位置偏见的影响有限。 Abstract: Retrieval Augmented Generation enhances LLM accuracy by adding passages retrieved from an external corpus to the LLM prompt. This paper investigates how positional bias - the tendency of LLMs to weight information differently based on its position in the prompt - affects not only the LLM's capability to capitalize on relevant passages, but also its susceptibility to distracting passages. Through extensive experiments on three benchmarks, we show how state-of-the-art retrieval pipelines, while attempting to retrieve relevant passages, systematically bring highly distracting ones to the top ranks, with over 60% of queries containing at least one highly distracting passage among the top-10 retrieved passages. As a result, the impact of the LLM positional bias, which in controlled settings is often reported as very prominent by related works, is actually marginal in real scenarios since both relevant and distracting passages are, in turn, penalized. Indeed, our findings reveal that sophisticated strategies that attempt to rearrange the passages based on LLM positional preferences do not perform better than random shuffling.

[218] Semantic-based Unsupervised Framing Analysis (SUFA): A Novel Approach for Computational Framing Analysis

Mohammad Ali,Naeemul Hassan

Main category: cs.CL

TL;DR: SUFA是一种基于语义关系和依赖解析的无监督框架分析方法，用于分析新闻媒体中的实体中心强调框架。

Details

Motivation: 研究旨在通过结合定性和计算方法，提出一种更高效的分析实体中心强调框架的新方法。 Method: SUFA利用语义关系和依赖解析算法，通过枪击暴力数据集验证其有效性。 Result: SUFA展示了在计算框架分析中的方法学进步，并具有广泛的社会科学和计算领域适用性。 Conclusion: SUFA为计算框架分析提供了重要的方法学贡献，同时讨论了其优势和局限性。 Abstract: This research presents a novel approach to computational framing analysis, called Semantic Relations-based Unsupervised Framing Analysis (SUFA). SUFA leverages semantic relations and dependency parsing algorithms to identify and assess entity-centric emphasis frames in news media reports. This innovative method is derived from two studies -- qualitative and computational -- using a dataset related to gun violence, demonstrating its potential for analyzing entity-centric emphasis frames. This article discusses SUFA's strengths, limitations, and application procedures. Overall, the SUFA approach offers a significant methodological advancement in computational framing analysis, with its broad applicability across both the social sciences and computational domains.

[219] From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning

David Dinucu-Jianu,Jakub Macina,Nico Daheim,Ido Hakimi,Iryna Gurevych,Mrinmaya Sachan

Main category: cs.CL

TL;DR: 论文提出了一种基于在线强化学习的框架，用于优化大型语言模型（LLMs）作为教育工具，强调教学质量和引导式问题解决，而非直接回答问题。

Details

Motivation: LLMs在教育中的应用潜力巨大，但直接回答问题的优化方式可能削弱教学效果。因此，需要一种方法使其更符合教学需求。 Method: 采用在线强化学习框架，通过模拟师生互动训练模型，无需人工标注，并引入可控奖励权重以平衡教学支持和学生解题准确性。 Result: 训练出的7B参数模型性能接近LearnLM等大型专有模型，且在推理能力和可解释性上优于单轮监督微调基线。 Conclusion: 该方法成功将LLMs转化为更有效的教学工具，同时保留了模型的推理能力，并通过思考标签增强了可解释性。 Abstract: Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy which requires strategically withholding answers. To mitigate this, we propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors using simulated student-tutor interactions by emphasizing pedagogical quality and guided problem-solving over simply giving away answers. We use our method to train a 7B parameter tutor model without human annotations which reaches similar performance to larger proprietary models like LearnLM. We introduce a controllable reward weighting to balance pedagogical support and student solving accuracy, allowing us to trace the Pareto frontier between these two objectives. Our models better preserve reasoning capabilities than single-turn SFT baselines and can optionally enhance interpretability through thinking tags that expose the model's instructional planning.

[220] Learn to Reason Efficiently with Adaptive Length-based Reward Shaping

Wei Liu,Ruochen Zhou,Yiyun Deng,Yuzhen Huang,Junteng Liu,Yuntian Deng,Yizhe Zhang,Junxian He

Main category: cs.CL

TL;DR: 论文提出了一种基于强化学习的方法LASER和LASER-D，通过长度奖励塑造提高大型推理模型的效率，显著减少冗余并提升性能。

Details

Motivation: 大型推理模型在生成长推理链时存在冗余问题，限制了效率。研究旨在通过强化学习优化推理效率。 Method: 提出统一框架，基于长度奖励塑造设计LASER方法，进一步扩展为动态和难度感知的LASER-D。 Result: 实验显示LASER-D在多个模型上显著提升性能（如AIME2024 +6.1）并减少63%的token使用。 Conclusion: LASER-D通过动态和难度感知的奖励塑造，实现了性能与效率的帕累托最优平衡。 Abstract: Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping. Building on this perspective, we propose a novel Length-bAsed StEp Reward shaping method (LASER), which employs a step function as the reward, controlled by a target length. LASER surpasses previous methods, achieving a superior Pareto-optimal balance between performance and efficiency. Next, we further extend LASER based on two key intuitions: (1) The reasoning behavior of the model evolves during training, necessitating reward specifications that are also adaptive and dynamic; (2) Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries. This approach is expected to facilitate a combination of fast and slow thinking, leading to a better overall tradeoff. The resulting method is termed LASER-D (Dynamic and Difficulty-aware). Experiments on DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, and DeepSeek-R1-Distill-Qwen-32B show that our approach significantly enhances both reasoning performance and response length efficiency. For instance, LASER-D and its variant achieve a +6.1 improvement on AIME2024 while reducing token usage by 63%. Further analysis reveals our RL-based compression produces more concise reasoning patterns with less redundant "self-reflections". Resources are at https://github.com/hkust-nlp/Laser.

[221] Can LLMs $\textit{understand}$ Math? -- Exploring the Pitfalls in Mathematical Reasoning

Tiasa Singha Roy,Aditeya Baral,Ayush Rajesh Jhaveri,Yusuf Baig

Main category: cs.CL

TL;DR: 论文提出了一种名为MAPLE的新评估框架，用于全面量化大型语言模型在数学推理中的表现，而不仅仅是基于最终答案的准确性。

Details

Motivation: 大型语言模型在数学推理中存在多步逻辑执行的挑战，但现有评估框架仅关注最终答案的准确性，无法全面反映其表现。 Method: 研究提出了一种名为MAPLE的评估指标，通过整合错误率、冗余性和有效性来量化推理偏差。 Result: MAPLE评分能够更全面地评估语言模型在数学推理中的表现，揭示传统准确性评估的局限性。 Conclusion: MAPLE框架为评估语言模型的数学推理能力提供了更全面的视角，有助于改进模型的设计和训练。 Abstract: Large language models (LLMs) demonstrate considerable potential in various natural language tasks but face significant challenges in mathematical reasoning, particularly in executing precise, multi-step logic. However, current evaluation frameworks judge their performance solely based on accuracy, which only accounts for the final answer. This study explores these pitfalls by employing a novel evaluation framework. We propose an evaluation metric called the MAPLE score, which holistically quantifies reasoning misalignment by integrating error rates, redundancy, and validity.

[222] Listen to the Context: Towards Faithful Large Language Models for Retrieval Augmented Generation on Climate Questions

David Thulke,Jakob Kemmler,Christian Dugast,Hermann Ney

Main category: cs.CL

TL;DR: 论文探讨了检索增强生成模型在气候科学领域的忠实性评估，并通过改进训练数据提升了ClimateGPT的忠实性。

Details

Motivation: 通过检索增强生成模型提高气候相关技术文档的可访问性，同时减少事实性幻觉问题。 Method: 自动评估模型忠实性，并分析指令微调对ClimateGPT忠实性的影响，通过排除不忠实训练数据改进模型。 Result: 改进后的ClimateGPT Faithful+在支持的原子声明中忠实性从30%提升至57%。 Conclusion: 改进训练数据可显著提升检索增强生成模型的忠实性，为气候科学领域提供更可靠的工具。 Abstract: Large language models that use retrieval augmented generation have the potential to unlock valuable knowledge for researchers, policymakers, and the public by making long and technical climate-related documents more accessible. While this approach can help alleviate factual hallucinations by relying on retrieved passages as additional context, its effectiveness depends on whether the model's output remains faithful to these passages. To address this, we explore the automatic assessment of faithfulness of different models in this setting. We then focus on ClimateGPT, a large language model specialised in climate science, to examine which factors in its instruction fine-tuning impact the model's faithfulness. By excluding unfaithful subsets of the model's training data, we develop ClimateGPT Faithful+, which achieves an improvement in faithfulness from 30% to 57% in supported atomic claims according to our automatic metric.

[223] Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models

Zihao Li,Xu Wang,Yuzhe Yang,Ziyu Yao,Haoyi Xiong,Mengnan Du

Main category: cs.CL

TL;DR: 论文提出了一种无需外部数据集的LLM推理增强方法，通过稀疏自编码器（SAE）或无SAE的算法引导模型内部状态。

Details

Motivation: 现有方法如DeepSeek-R1依赖高质量长CoT数据和微调，成本高。本文旨在通过引导技术提升LLM推理能力，避免数据依赖。 Method: 使用SAE提取CoT特征引导LLM内部状态，或直接计算残差激活的引导方向（SAE-free）。 Result: 实验表明，SAE和SAE-free算法均显著提升LLM推理能力。 Conclusion: 提出的引导技术为LLM推理增强提供了一种高效且无需外部数据的方法。 Abstract: Large Language Models (LLMs) demonstrate the ability to solve reasoning and mathematical problems using the Chain-of-Thought (CoT) technique. Expanding CoT length, as seen in models such as DeepSeek-R1, significantly enhances this reasoning for complex problems, but requires costly and high-quality long CoT data and fine-tuning. This work, inspired by the deep thinking paradigm of DeepSeek-R1, utilizes a steering technique to enhance the reasoning ability of an LLM without external datasets. Our method first employs Sparse Autoencoders (SAEs) to extract interpretable features from vanilla CoT. These features are then used to steer the LLM's internal states during generation. Recognizing that many LLMs do not have corresponding pre-trained SAEs, we further introduce a novel SAE-free steering algorithm, which directly computes steering directions from the residual activations of an LLM, obviating the need for an explicit SAE. Experimental results demonstrate that both our SAE-based and subsequent SAE-free steering algorithms significantly enhance the reasoning capabilities of LLMs.

[224] Word Level Timestamp Generation for Automatic Speech Recognition and Translation

Ke Hu,Krishna Puvvada,Elena Rastorgueva,Zhehuai Chen,He Huang,Shuoyang Ding,Kunal Dhawan,Hainan Xu,Jagadeesh Balam,Boris Ginsburg

Main category: cs.CL

TL;DR: 提出了一种数据驱动的方法，用于在Canary模型中实现单词级时间戳预测，无需外部对齐机制，准确率在80%-90%之间。

Details

Motivation: 精确的时间戳信息对语音内容检索和定时字幕等下游任务至关重要，传统方法需要外部模块，而新方法避免了这一需求。 Method: 利用NeMo Forced Aligner作为教师模型生成时间戳，并训练Canary模型直接预测时间戳，同时引入新标记<|timestamp|>。 Result: 在四种语言中，时间戳预测误差为20-120毫秒，准确率和召回率为80%-90%，WER退化最小；在自动语音翻译任务中误差约为200毫秒。 Conclusion: 该方法在单词级时间戳预测上表现优异，且适用于多种语言和任务。 Abstract: We introduce a data-driven approach for enabling word-level timestamp prediction in the Canary model. Accurate timestamp information is crucial for a variety of downstream tasks such as speech content retrieval and timed subtitles. While traditional hybrid systems and end-to-end (E2E) models may employ external modules for timestamp prediction, our approach eliminates the need for separate alignment mechanisms. By leveraging the NeMo Forced Aligner (NFA) as a teacher model, we generate word-level timestamps and train the Canary model to predict timestamps directly. We introduce a new <|timestamp|> token, enabling the Canary model to predict start and end timestamps for each word. Our method demonstrates precision and recall rates between 80% and 90%, with timestamp prediction errors ranging from 20 to 120 ms across four languages, with minimal WER degradation. Additionally, we extend our system to automatic speech translation (AST) tasks, achieving timestamp prediction errors around 200 milliseconds.

[225] Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!

Zhexin Zhang,Yuhao Sun,Junxiao Yang,Shiyao Cui,Hongning Wang,Minlie Huang

Main category: cs.CL

TL;DR: 研究发现，开源大语言模型（LLM）在微调后可能被原始创建者通过后门训练提取私有数据，提取率高达94.9%。防御策略效果有限，需进一步研究解决。

Details

Motivation: 揭示开源LLM微调后私有数据被提取的新风险，强调其严重性。 Method: 通过后门训练，仅需黑盒访问微调模型，提取私有数据。实验覆盖4种开源模型和2个数据集。 Result: 提取率高达76.3%（实际场景）和94.9%（理想场景），防御策略易被绕过。 Conclusion: 开源LLM微调存在严重数据泄露风险，亟需更多研究应对。 Abstract: Fine-tuning on open-source Large Language Models (LLMs) with proprietary data is now a standard practice for downstream developers to obtain task-specific LLMs. Surprisingly, we reveal a new and concerning risk along with the practice: the creator of the open-source LLMs can later extract the private downstream fine-tuning data through simple backdoor training, only requiring black-box access to the fine-tuned downstream model. Our comprehensive experiments, across 4 popularly used open-source models with 3B to 32B parameters and 2 downstream datasets, suggest that the extraction performance can be strikingly high: in practical settings, as much as 76.3% downstream fine-tuning data (queries) out of a total 5,000 samples can be perfectly extracted, and the success rate can increase to 94.9% in more ideal settings. We also explore a detection-based defense strategy but find it can be bypassed with improved attack. Overall, we highlight the emergency of this newly identified data breaching risk in fine-tuning, and we hope that more follow-up research could push the progress of addressing this concerning risk. The code and data used in our experiments are released at https://github.com/thu-coai/Backdoor-Data-Extraction.

[226] Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

Ke Hu,Ehsan Hosseini-Asl,Chen Chen,Edresson Casanova,Subhankar Ghosh,Piotr Żelasko,Zhehuai Chen,Jason Li,Jagadeesh Balam,Boris Ginsburg

Main category: cs.CL

TL;DR: 提出了一种新型的双工语音到语音（S2S）架构，支持连续用户输入和代理输出，通过通道融合直接建模用户和代理的并行流。

Details

Motivation: 当前语音语言模型多局限于轮换式交互，缺乏实时适应性（如用户打断）。 Method: 采用预训练的流式编码器处理用户输入，无需语音预训练；独立建模代理和用户架构，优化代理语音并降低比特率。 Result: 模型在推理、轮换和打断能力上优于现有双工模型，且所需语音数据更少。 Conclusion: 该模型是首个公开可用的双工S2S模型，简化了从任意LLM构建双工S2S模型的过程。 Abstract: Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is skipped, which markedly simplifies the process of building a duplex S2S model from any LLMs. Finally, it is the first openly available duplex S2S model with training and inference code to foster reproducibility.

[227] UniErase: Unlearning Token as a Universal Erasure Primitive for Language Models

Miao Yu,Liang Lin,Guibin Zhang,Xinfeng Li,Junfeng Fang,Ningyu Zhang,Kun Wang,Yang Wang

Main category: cs.CL

TL;DR: UniErase是一种新的遗忘范式，通过可学习的参数后缀（遗忘令牌）引导语言模型实现定向遗忘，在遗忘效果和模型能力上均达到SOTA性能。

Details

Motivation: 解决现有遗忘方法在平衡遗忘效果和模型能力上的不足，避免模型崩溃或泛化能力差的问题。 Method: 采用两阶段方法：(I)通过令牌优化将遗忘目标绑定到模型的自回归概率分布，(II)轻量级模型编辑激活令牌以概率性诱导遗忘目标。 Result: 在TOFU基准测试中，UniErase仅修改约3.66%的LLM参数，遗忘效果提升4.01倍，同时模型能力保持更好。 Conclusion: UniErase在遗忘领域实现了双重顶级性能，为令牌学习诱导遗忘目标提供了新研究方向。 Abstract: Large language models require iterative updates to address challenges such as knowledge conflicts and outdated information (e.g., incorrect, private, or illegal contents). Machine unlearning provides a systematic methodology for targeted knowledge removal from trained models, enabling elimination of sensitive information influences. However, mainstream fine-tuning-based unlearning methods often fail to balance unlearning efficacy and model ability, frequently resulting in catastrophic model collapse under extensive knowledge removal. Meanwhile, in-context unlearning, which relies solely on contextual prompting without modifying the model's intrinsic mechanisms, suffers from limited generalizability and struggles to achieve true unlearning. In this work, we introduce UniErase, a novel unlearning paradigm that employs learnable parametric suffix (unlearning token) to steer language models toward targeted forgetting behaviors. UniErase operates through two key phases: (I) an optimization stage that binds desired unlearning outputs to the model's autoregressive probability distribution via token optimization, followed by (II) a lightweight model editing phase that activates the learned token to probabilistically induce specified forgetting objective. Serving as a new research direction for token learning to induce unlearning target, UniErase achieves state-of-the-art (SOTA) performance across batch, sequential, and precise unlearning under fictitious and real-world knowledge settings. Remarkably, in terms of TOFU benchmark, UniErase, modifying only around 3.66% of the LLM parameters, outperforms previous forgetting SOTA baseline by around 4.01 times for model ability with even better unlearning efficacy. Similarly, UniErase, maintaining more ability, also surpasses previous retaining SOTA by 35.96% for unlearning efficacy, showing dual top-tier performances in current unlearing domain.

[228] The Representational Alignment between Humans and Language Models is implicitly driven by a Concreteness Effect

Cosimo Iaia,Bhavin Choksi,Emily Wiebers,Gemma Roig,Christian J. Fiebach

Main category: cs.CL

TL;DR: 研究探讨了人类和语言模型在名词具体性（concreteness）表征上的相似性，发现两者在具体性维度上显著一致，但在其他心理语言学维度上不一致。

Details

Motivation: 理解具体性在心理和大脑中的表征是心理学、神经科学和计算语言学的核心问题，但语言模型如何表征具体性尚未充分研究。 Method: 通过行为判断估计人类隐含的语义距离，并使用表征相似性分析比较人类和语言模型的语义表征。 Result: 人类和语言模型的表征空间在具体性维度上显著一致，且这种一致性主要由具体性驱动。 Conclusion: 人类和语言模型在具体性维度上趋同，但在其他维度上不一致，表明具体性是两者表征的共同基础。 Abstract: The nouns of our language refer to either concrete entities (like a table) or abstract concepts (like justice or love), and cognitive psychology has established that concreteness influences how words are processed. Accordingly, understanding how concreteness is represented in our mind and brain is a central question in psychology, neuroscience, and computational linguistics. While the advent of powerful language models has allowed for quantitative inquiries into the nature of semantic representations, it remains largely underexplored how they represent concreteness. Here, we used behavioral judgments to estimate semantic distances implicitly used by humans, for a set of carefully selected abstract and concrete nouns. Using Representational Similarity Analysis, we find that the implicit representational space of participants and the semantic representations of language models are significantly aligned. We also find that both representational spaces are implicitly aligned to an explicit representation of concreteness, which was obtained from our participants using an additional concreteness rating task. Importantly, using ablation experiments, we demonstrate that the human-to-model alignment is substantially driven by concreteness, but not by other important word characteristics established in psycholinguistics. These results indicate that humans and language models converge on the concreteness dimension, but not on other dimensions.

[229] A Federated Splitting Framework for LLMs: Security, Efficiency, and Adaptability

Zishuai Zhang,Hainan Zhang,Jiaying Zheng,Ziwei Wang,Yongxin Tong,Jin Dong,Zhiming Zheng

Main category: cs.CL

TL;DR: FL-LLaMA是一个基于LLaMA2的安全、高效且自适应的联邦分割学习框架，解决了隐私、效率和适应性挑战，性能接近集中式LLaMA2，并显著提升训练和推理速度。

Details

Motivation: 私有数据分散且高质量，但现有联邦学习方法在隐私、效率和适应性上存在不足。 Method: 采用本地客户端处理输入输出块、注入高斯噪声、并行训练策略、动态调整分割点等技术。 Result: 在NLU、摘要和对话任务中性能接近集中式LLaMA2，训练速度提升2倍，推理速度提升8倍。 Conclusion: FL-LLaMA在隐私、效率和适应性上表现优异，适用于联邦学习环境。 Abstract: Private data is typically larger and of higher quality than public data, offering great potential to improve LLM. However, its scattered distribution across data silos and the high computational demands of LLMs limit their deployment in federated environments. To address this, the transformer-based split learning model has emerged, offloading most model parameters to the server while retaining only the embedding and output layers on clients to ensure privacy. However, it still faces significant challenges in security, efficiency, and adaptability: 1) embedding gradients are vulnerable to attacks, leading to reverse engineering of private data; 2) the autoregressive nature of LLMs means that federated split learning can only train and infer sequentially, causing high communication overhead; 3) fixed partition points lack adaptability to downstream tasks. In this paper, we introduce FL-LLaMA, a secure, efficient, and adaptive federated split framework based on LLaMA2. First, we place some input and output blocks on the local client and inject Gaussian noise into forward-pass hidden states, enabling secure end-to-end propagation. Second, we employ client-batch and server-hierarchical strategies to achieve parallel training, along with attention-mask compression and KV cache mechanisms to accelerate inference, reducing communication costs effectively. Third, we allow users to dynamically adjust the partition points for input/output blocks based on specific task requirements and hardware limitations. Experiments on NLU, summarization and conversational QA tasks show that FL-LLaMA maintains performance comparable to centralized LLaMA2, and achieves up to 2x train speedups and 8x inference speedups. Further analysis of privacy attacks and different partition points also demonstrates the effectiveness of FL-LLaMA in security and adaptability.

[230] ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy

Gengyang Li,Yifeng Gao,Yuming Li,Yunfang Wu

Main category: cs.CL

TL;DR: ThinkLess是一个推理高效框架，通过提前终止推理生成减少延迟和内存使用，同时保持输出质量。

Details

Motivation: 解决Chain-of-Thought（CoT）提示中推理令牌过长导致的延迟、内存占用高及答案截断问题。 Method: 利用注意力分析发现答案令牌主要关注推理终止符，插入终止符提前终止冗余推理，并采用轻量后调节机制保持答案结构。 Result: 在不微调或额外数据的情况下，ThinkLess达到与完整CoT解码相当的准确性，同时显著减少解码时间和内存消耗。 Conclusion: ThinkLess通过高效推理设计，平衡了推理质量与资源消耗，为LLMs推理优化提供了新思路。 Abstract: While Chain-of-Thought (CoT) prompting improves reasoning in large language models (LLMs), the excessive length of reasoning tokens increases latency and KV cache memory usage, and may even truncate final answers under context limits. We propose ThinkLess, an inference-efficient framework that terminates reasoning generation early and maintains output quality without modifying the model. Atttention analysis reveals that answer tokens focus minimally on earlier reasoning steps and primarily attend to the reasoning terminator token, due to information migration under causal masking. Building on this insight, ThinkLess inserts the terminator token at earlier positions to skip redundant reasoning while preserving the underlying knowledge transfer. To prevent format discruption casued by early termination, ThinkLess employs a lightweight post-regulation mechanism, relying on the model's natural instruction-following ability to produce well-structured answers. Without fine-tuning or auxiliary data, ThinkLess achieves comparable accuracy to full-length CoT decoding while greatly reducing decoding time and memory consumption.

[231] Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities

Jinyang Wu,Chonghua Liao,Mingkuan Feng,Shuai Zhang,Zhengqi Wen,Pengpeng Shao,Huazhe Xu,Jianhua Tao

Main category: cs.CL

TL;DR: TAPO（Thought-Augmented Policy Optimization）是一种新型强化学习框架，通过引入外部高层次指导（“思维模式”）来增强模型探索能力，显著优于现有方法。

Details

Motivation: 现有强化学习方法偏向奖励最大化路径，缺乏外部知识，限制了探索能力和推理边界。 Method: TAPO通过自适应整合结构化思维模式，平衡模型内部探索与外部指导利用。 Result: 在AIME、AMC和Minerva Math上分别超越GRPO 99%、41%和17%，且思维模式仅需500样本即可泛化。 Conclusion: TAPO具有广泛适用性，能生成更具解释性和可读性的推理模型。 Abstract: Reinforcement learning (RL) has emerged as an effective method for training reasoning models. However, existing RL approaches typically bias the model's output distribution toward reward-maximizing paths without introducing external knowledge. This limits their exploration capacity and results in a narrower reasoning capability boundary compared to base models. To address this limitation, we propose TAPO (Thought-Augmented Policy Optimization), a novel framework that augments RL by incorporating external high-level guidance ("thought patterns"). By adaptively integrating structured thoughts during training, TAPO effectively balances model-internal exploration and external guidance exploitation. Extensive experiments show that our approach significantly outperforms GRPO by 99% on AIME, 41% on AMC, and 17% on Minerva Math. Notably, these high-level thought patterns, abstracted from only 500 prior samples, generalize effectively across various tasks and models. This highlights TAPO's potential for broader applications across multiple tasks and domains. Our further analysis reveals that introducing external guidance produces powerful reasoning models with superior explainability of inference behavior and enhanced output readability.

[232] Can Large Language Models be Effective Online Opinion Miners?

Ryang Heo,Yongsik Seo,Junseong Lee,Dongha Lee

Main category: cs.CL

TL;DR: 论文提出了一个名为OOMB的新数据集和评估协议，用于评估大型语言模型（LLMs）在复杂在线环境中挖掘意见的能力。

Details

Motivation: 用户生成的在线内容多样且复杂，传统意见挖掘方法难以应对，因此需要新的评估工具。 Method: 引入OOMB数据集，提供详细的（实体、特征、意见）元组标注和意见摘要，评估LLMs的提取和抽象能力。 Result: 通过OOMB分析了LLMs在意见挖掘中的挑战和适应性，探讨其在实际场景中的应用潜力。 Conclusion: 研究为基于LLM的意见挖掘奠定了基础，并讨论了未来研究方向。 Abstract: The surge of user-generated online content presents a wealth of insights into customer preferences and market trends. However, the highly diverse, complex, and context-rich nature of such contents poses significant challenges to traditional opinion mining approaches. To address this, we introduce Online Opinion Mining Benchmark (OOMB), a novel dataset and evaluation protocol designed to assess the ability of large language models (LLMs) to mine opinions effectively from diverse and intricate online environments. OOMB provides extensive (entity, feature, opinion) tuple annotations and a comprehensive opinion-centric summary that highlights key opinion topics within each content, thereby enabling the evaluation of both the extractive and abstractive capabilities of models. Through our proposed benchmark, we conduct a comprehensive analysis of which aspects remain challenging and where LLMs exhibit adaptability, to explore whether they can effectively serve as opinion miners in realistic online scenarios. This study lays the foundation for LLM-based opinion mining and discusses directions for future research in this field.

[233] MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation

Maike Behrendt,Stefan Sylvius Wagner,Stefan Harmeling

Main category: cs.CL

TL;DR: MaxPoolBERT通过跨层和跨令牌的信息聚合优化BERT的[CLS]表示，提升分类任务性能。

Details

Motivation: [CLS]令牌在BERT中常用于分类任务，但其他令牌和中间层也包含有价值信息，需要更高效的表示方法。 Method: 提出MaxPoolBERT，通过三种改进：（1）跨层最大池化[CLS]令牌，（2）添加多头注意力层使[CLS]关注最终层，（3）结合序列最大池化和MHA。 Result: 在GLUE基准测试中，MaxPoolBERT性能优于标准BERT-base，尤其在低资源任务中表现突出。 Conclusion: MaxPoolBERT在不增加预训练或显著增大模型的情况下，有效提升了分类任务的准确性。 Abstract: The [CLS] token in BERT is commonly used as a fixed-length representation for classification tasks, yet prior work has shown that both other tokens and intermediate layers encode valuable contextual information. In this work, we propose MaxPoolBERT, a lightweight extension to BERT that refines the [CLS] representation by aggregating information across layers and tokens. Specifically, we explore three modifications: (i) max-pooling the [CLS] token across multiple layers, (ii) enabling the [CLS] token to attend over the entire final layer using an additional multi-head attention (MHA) layer, and (iii) combining max-pooling across the full sequence with MHA. Our approach enhances BERT's classification accuracy (especially on low-resource tasks) without requiring pre-training or significantly increasing model size. Experiments on the GLUE benchmark show that MaxPoolBERT consistently achieves a better performance on the standard BERT-base model.

[234] "Alexa, can you forget me?" Machine Unlearning Benchmark in Spoken Language Understanding

Alkis Koudounas,Claudio Savelli,Flavio Giobergia,Elena Baralis

Main category: cs.CL

TL;DR: 该论文提出了首个用于口语理解（SLU）的机器遗忘基准UnSLU-BENCH，评估了八种遗忘技术，并提出了一种新指标来衡量其效果、实用性和效率。

Details

Motivation: 机器遗忘是负责任AI的重要课题，但现有研究较少关注复杂任务（如语音相关任务）。本文旨在填补这一空白。 Method: 通过UnSLU-BENCH基准测试，评估了四种语言数据集上的八种遗忘技术，并提出了新指标。 Result: UnSLU-BENCH为SLU中的遗忘研究奠定了基础，揭示了不同技术在效果和计算可行性上的显著差异。 Conclusion: 该研究为SLU领域的机器遗忘提供了首个基准，并展示了不同技术的优缺点。 Abstract: Machine unlearning, the process of efficiently removing specific information from machine learning models, is a growing area of interest for responsible AI. However, few studies have explored the effectiveness of unlearning methods on complex tasks, particularly speech-related ones. This paper introduces UnSLU-BENCH, the first benchmark for machine unlearning in spoken language understanding (SLU), focusing on four datasets spanning four languages. We address the unlearning of data from specific speakers as a way to evaluate the quality of potential "right to be forgotten" requests. We assess eight unlearning techniques and propose a novel metric to simultaneously better capture their efficacy, utility, and efficiency. UnSLU-BENCH sets a foundation for unlearning in SLU and reveals significant differences in the effectiveness and computational feasibility of various techniques.

[235] LyapLock: Bounded Knowledge Preservation in Sequential Large Language Model Editing

Peng Wang,Biyu Zhou,Xuehai Tang,Jizhong Han,Songlin Hu

Main category: cs.CL

TL;DR: LyapLock提出了一种基于排队论和Lyapunov优化的模型编辑框架，解决了大语言模型在连续编辑中知识保留不足的问题，显著提升了编辑效果和稳定性。

Details

Motivation: 大语言模型在连续编辑中因缺乏长期知识保留机制导致性能下降，亟需一种能兼顾编辑效果和知识保留的方法。 Method: 将连续编辑建模为约束随机规划问题，结合排队论和Lyapunov优化，分解为可逐步求解的子问题。 Result: 实验表明，LyapLock支持超过10,000次编辑，编辑效果提升11.89%，同时保持模型通用能力稳定。 Conclusion: LyapLock是首个具有理论保证的模型编辑框架，在长期知识保留和编辑性能上均表现优异，并可增强基线方法。 Abstract: Large Language Models often contain factually incorrect or outdated knowledge, giving rise to model editing methods for precise knowledge updates. However, current mainstream locate-then-edit approaches exhibit a progressive performance decline during sequential editing, due to inadequate mechanisms for long-term knowledge preservation. To tackle this, we model the sequential editing as a constrained stochastic programming. Given the challenges posed by the cumulative preservation error constraint and the gradually revealed editing tasks, \textbf{LyapLock} is proposed. It integrates queuing theory and Lyapunov optimization to decompose the long-term constrained programming into tractable stepwise subproblems for efficient solving. This is the first model editing framework with rigorous theoretical guarantees, achieving asymptotic optimal editing performance while meeting the constraints of long-term knowledge preservation. Experimental results show that our framework scales sequential editing capacity to over 10,000 edits while stabilizing general capabilities and boosting average editing efficacy by 11.89\% over SOTA baselines. Furthermore, it can be leveraged to enhance the performance of baseline methods. Our code is released on https://github.com/caskcsg/LyapLock.

[236] Advancing LLM Safe Alignment with Safety Representation Ranking

Tianqi Du,Zeming Wei,Quan Chen,Chenheng Zhang,Yisen Wang

Main category: cs.CL

TL;DR: 论文提出了一种名为SRR的框架，利用大语言模型的内部隐藏状态来评估和选择安全响应，显著提升了对抗性提示的鲁棒性。

Details

Motivation: 现有安全评估方法仅基于文本响应，忽略了模型内部嵌入的丰富信息，导致潜在的安全隐患。 Method: 提出Safety Representation Ranking (SRR)框架，通过模型的中间表示对指令和候选完成进行编码，并使用轻量级相似性评分器对候选进行排序。 Result: 实验表明，SRR在多个基准测试中显著提高了对对抗性提示的鲁棒性。 Conclusion: SRR通过直接利用模型内部状态和列表级监督，有效捕捉了细微的安全信号，为LLM安全评估提供了新思路。 Abstract: The rapid advancement of large language models (LLMs) has demonstrated milestone success in a variety of tasks, yet their potential for generating harmful content has raised significant safety concerns. Existing safety evaluation approaches typically operate directly on textual responses, overlooking the rich information embedded in the model's internal representations. In this paper, we propose Safety Representation Ranking (SRR), a listwise ranking framework that selects safe responses using hidden states from the LLM itself. SRR encodes both instructions and candidate completions using intermediate transformer representations and ranks candidates via a lightweight similarity-based scorer. Our approach directly leverages internal model states and supervision at the list level to capture subtle safety signals. Experiments across multiple benchmarks show that SRR significantly improves robustness to adversarial prompts. Our code will be available upon publication.

[237] TurnaboutLLM: A Deductive Reasoning Benchmark from Detective Games

Yuan Yuan,Muyu He,Muhammad Adil Shahid,Jiani Huang,Ziyang Li,Li Zhang

Main category: cs.CL

TL;DR: TurnaboutLLM是一个评估大型语言模型（LLMs）演绎推理能力的新框架和数据集，基于侦探游戏Ace Attorney和Danganronpa的互动玩法。

Details

Motivation: 研究LLMs在复杂叙事环境中识别证词与证据之间矛盾的能力，揭示现有推理增强策略的局限性。 Method: 通过设计长叙事背景下的矛盾识别任务，评估12种先进LLMs的表现，分析上下文大小、推理步骤数和答案空间对性能的影响。 Result: 结果表明，LLMs在复杂推理任务中表现有限，现有策略如Chain-of-Thought提示效果不佳。 Conclusion: TurnaboutLLM为评估LLMs在复杂叙事环境中的演绎推理能力提供了挑战性基准。 Abstract: This paper introduces TurnaboutLLM, a novel framework and dataset for evaluating the deductive reasoning abilities of Large Language Models (LLMs) by leveraging the interactive gameplay of detective games Ace Attorney and Danganronpa. The framework tasks LLMs with identifying contradictions between testimonies and evidences within long narrative contexts, a challenging task due to the large answer space and diverse reasoning types presented by its questions. We evaluate twelve state-of-the-art LLMs on the dataset, hinting at limitations of popular strategies for enhancing deductive reasoning such as extensive thinking and Chain-of-Thought prompting. The results also suggest varying effects of context size, the number of reasoning step and answer space size on model performance. Overall, TurnaboutLLM presents a substantial challenge for LLMs' deductive reasoning abilities in complex, narrative-rich environments.

[238] Beyond Empathy: Integrating Diagnostic and Therapeutic Reasoning with Large Language Models for Mental Health Counseling

He Hu,Yucheng Zhou,Juzheng Si,Qianning Wang,Hengheng Zhang,Fuji Ren,Fei Ma,Laizhong Cui

Main category: cs.CL

TL;DR: PsyLLM是一个专为心理健康咨询设计的大型语言模型，通过整合诊断和治疗推理，结合国际标准和多种治疗框架，显著优于现有基线模型。

Details

Motivation: 现有基于LLM的心理健康支持方法缺乏临床基础，无法满足实际心理咨询需求，特别是在诊断和治疗推理方面。 Method: 提出自动数据合成流程，处理真实心理健康帖子，生成多轮对话结构，并利用LLM结合DSM/ICD标准和多种治疗框架模拟临床推理。 Result: PsyLLM在全面性、专业性、真实性和安全性四个维度上显著优于现有基线模型。 Conclusion: PsyLLM为心理健康咨询提供了更系统、临床化的解决方案，填补了现有LLM方法的不足。 Abstract: Large language models (LLMs) hold significant potential for mental health support, capable of generating empathetic responses and simulating therapeutic conversations. However, existing LLM-based approaches often lack the clinical grounding necessary for real-world psychological counseling, particularly in explicit diagnostic reasoning aligned with standards like the DSM/ICD and incorporating diverse therapeutic modalities beyond basic empathy or single strategies. To address these critical limitations, we propose PsyLLM, the first large language model designed to systematically integrate both diagnostic and therapeutic reasoning for mental health counseling. To develop the PsyLLM, we propose a novel automated data synthesis pipeline. This pipeline processes real-world mental health posts, generates multi-turn dialogue structures, and leverages LLMs guided by international diagnostic standards (e.g., DSM/ICD) and multiple therapeutic frameworks (e.g., CBT, ACT, psychodynamic) to simulate detailed clinical reasoning processes. Rigorous multi-dimensional filtering ensures the generation of high-quality, clinically aligned dialogue data. In addition, we introduce a new benchmark and evaluation protocol, assessing counseling quality across four key dimensions: comprehensiveness, professionalism, authenticity, and safety. Our experiments demonstrate that PsyLLM significantly outperforms state-of-the-art baseline models on this benchmark.

[239] Shared Path: Unraveling Memorization in Multilingual LLMs through Language Similarities

Xiaoyu Luo,Yiyi Chen,Johannes Bjerva,Qiongxiu Li

Main category: cs.CL

TL;DR: 该研究首次全面分析了多语言大语言模型（MLLMs）中的记忆现象，发现语言相似性对记忆模式的影响，并提出了一种基于图的相关性度量方法。

Details

Motivation: 随着MLLMs的广泛应用，理解其记忆行为变得至关重要。然而，以往研究主要关注单语言模型，忽视了多语言记忆的探索。 Method: 研究分析了95种语言，使用不同规模、架构和记忆定义的模型，并提出了一种基于语言相似性的图相关性度量方法。 Result: 研究发现，语言相似性显著影响记忆模式，训练数据较少的相似语言表现出更高的记忆性。 Conclusion: 研究强调了语言感知视角的重要性，为多语言NLP中的记忆漏洞评估和缓解提供了新思路。 Abstract: We present the first comprehensive study of Memorization in Multilingual Large Language Models (MLLMs), analyzing 95 languages using models across diverse model scales, architectures, and memorization definitions. As MLLMs are increasingly deployed, understanding their memorization behavior has become critical. Yet prior work has focused primarily on monolingual models, leaving multilingual memorization underexplored, despite the inherently long-tailed nature of training corpora. We find that the prevailing assumption, that memorization is highly correlated with training data availability, fails to fully explain memorization patterns in MLLMs. We hypothesize that treating languages in isolation - ignoring their similarities - obscures the true patterns of memorization. To address this, we propose a novel graph-based correlation metric that incorporates language similarity to analyze cross-lingual memorization. Our analysis reveals that among similar languages, those with fewer training tokens tend to exhibit higher memorization, a trend that only emerges when cross-lingual relationships are explicitly modeled. These findings underscore the importance of a language-aware perspective in evaluating and mitigating memorization vulnerabilities in MLLMs. This also constitutes empirical evidence that language similarity both explains Memorization in MLLMs and underpins Cross-lingual Transferability, with broad implications for multilingual NLP.

[240] VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models

Heyang Liu,Yuhao Wang,Ziyang Cheng,Ronghua Wu,Qunshan Gu,Yanfeng Wang,Yu Wang

Main category: cs.CL

TL;DR: 论文提出了VocalBench，一个用于评估语音交互模型在语音沟通中能力的综合基准，填补了现有评估忽视语音性能的空白。

Details

Motivation: 现有语音交互模型的评估主要关注文本响应质量，忽视了语音性能和缺乏语音特定测试实例的基准。 Method: 提出VocalBench基准，包含9,400个实例，覆盖语义质量、声学性能、对话能力和鲁棒性四个维度，涵盖16项关键技能。 Result: 实验结果显示当前模型能力存在显著差异，各有优势和不足，为未来语音交互系统研究提供了指导。 Conclusion: VocalBench为语音交互模型的全面评估提供了新工具，有助于推动该领域的研究发展。 Abstract: The rapid advancement of large language models (LLMs) has accelerated the development of multi-modal models capable of vocal communication. Unlike text-based interactions, speech conveys rich and diverse information, including semantic content, acoustic variations, paralanguage cues, and environmental context. However, existing evaluations of speech interaction models predominantly focus on the quality of their textual responses, often overlooking critical aspects of vocal performance and lacking benchmarks with vocal-specific test instances. To address this gap, we propose VocalBench, a comprehensive benchmark designed to evaluate speech interaction models' capabilities in vocal communication. VocalBench comprises 9,400 carefully curated instances across four key dimensions: semantic quality, acoustic performance, conversational abilities, and robustness. It covers 16 fundamental skills essential for effective vocal interaction. Experimental results reveal significant variability in current model capabilities, each exhibiting distinct strengths and weaknesses, and provide valuable insights to guide future research in speech-based interaction systems. Code and evaluation instances are available at https://github.com/SJTU-OmniAgent/VocalBench.

[241] DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning

Gaurav Srivastava,Zhenyu Bi,Meng Lu,Xuan Wang

Main category: cs.CL

TL;DR: 论文提出了一种无需外部监督的自主推理提升框架DTE，通过多智能体辩论和新的提示策略显著提升了语言模型的推理能力。

Details

Motivation: 当前大型语言模型依赖大量数据提升推理能力，但仅靠数据增长已不现实，需探索自主提升方法。 Method: 提出了DTE框架，结合多智能体辩论和Reflect-Critique-Refine提示策略，无监督地优化模型推理。 Result: 在五个推理基准测试中，DTE平均提升8.92%准确率，并展现出强跨领域泛化能力（平均提升5.8%）。 Conclusion: DTE框架有效提升了语言模型的自主推理能力，且具有广泛适用性。 Abstract: Large language models (LLMs) have improved significantly in their reasoning through extensive training on massive datasets. However, relying solely on additional data for improvement is becoming increasingly impractical, highlighting the need for models to autonomously enhance their reasoning without external supervision. In this paper, we propose Debate, Train, Evolve (DTE), a novel ground truth-free training framework that uses multi-agent debate traces to evolve a single language model. We also introduce a new prompting strategy Reflect-Critique-Refine, to improve debate quality by explicitly instructing agents to critique and refine their reasoning. Extensive evaluations on five reasoning benchmarks with six open-weight models show that our DTE framework achieve substantial improvements, with an average accuracy gain of 8.92% on the challenging GSM-PLUS dataset. Furthermore, we observe strong cross-domain generalization, with an average accuracy gain of 5.8% on all other benchmarks, suggesting that our method captures general reasoning capabilities.

[242] Transfer of Structural Knowledge from Synthetic Languages

Mikhail Budnikov,Ivan Yamshchikov

Main category: cs.CL

TL;DR: 研究了从多种合成语言到英语的迁移学习，分析了微调模型的嵌入结构及其能力，并提出了新的合成语言和基准测试Tiny-Cloze Benchmark。

Details

Motivation: 探索合成语言对英语迁移学习的效果，提升模型在简单语言任务中的表现。 Method: 通过微调模型分析嵌入结构，引入新的合成语言和Tiny-Cloze Benchmark进行评估。 Result: 新合成语言和基准测试显著提升了模型在多种任务中的性能。 Conclusion: 合成语言迁移学习对提升模型能力具有潜力，新基准测试为弱模型提供了更有效的评估工具。 Abstract: This work explores transfer learning from several synthetic languages to English. We investigate the structure of the embeddings in the fine-tuned models, the information they contain, and the capabilities of the fine-tuned models on simple linguistic tasks. We also introduce a new synthetic language that leads to better transfer to English than the languages used in previous research. Finally, we introduce Tiny-Cloze Benchmark - a new synthetic benchmark for natural language understanding that is more informative for less powerful models. We use Tiny-Cloze Benchmark to evaluate fine-tuned models in several domains demonstrating that fine-tuning on a new synthetic language allows for better performance on a variety of tasks.

[243] Beyond Hard and Soft: Hybrid Context Compression for Balancing Local and Global Information Retention

Huanxuan Liao,Wen Hu,Yao Xu,Shizhu He,Jun Zhao,Kang Liu

Main category: cs.CL

TL;DR: 论文提出了一种混合上下文压缩方法（HyCo$_2$），通过结合全局和局部视角优化LLMs的长序列推理效率，减少冗余处理。

Details

Motivation: LLMs在长序列推理中存在计算效率低和冗余处理的问题，现有方法容易丢失有价值信息，需要一种更平衡的压缩方法。 Method: HyCo$_2$结合全局语义适配器和局部分类层，通过辅助预训练平衡信息保留。 Result: 实验表明HyCo$_2$显著提升长文本推理性能，平均提高13.1%，同时减少88.8%的token消耗。 Conclusion: HyCo$_2$有效平衡了全局和局部信息保留，显著提升了LLMs的长序列推理效率。 Abstract: Large Language Models (LLMs) encounter significant challenges in long-sequence inference due to computational inefficiency and redundant processing, driving interest in context compression techniques. Existing methods often rely on token importance to perform hard local compression or encode context into latent representations for soft global compression. However, the uneven distribution of textual content relevance and the diversity of demands for user instructions mean these approaches frequently lead to the loss of potentially valuable information. To address this, we propose $\textbf{Hy}$brid $\textbf{Co}$ntext $\textbf{Co}$mpression (HyCo$_2$) for LLMs, which integrates both global and local perspectives to guide context compression while retaining both the essential semantics and critical details for task completion. Specifically, we employ a hybrid adapter to refine global semantics with the global view, based on the observation that different adapters excel at different tasks. Then we incorporate a classification layer that assigns a retention probability to each context token based on the local view, determining whether it should be retained or discarded. To foster a balanced integration of global and local compression, we introduce auxiliary paraphrasing and completion pretraining before instruction tuning. This promotes a synergistic integration that emphasizes instruction-relevant information while preserving essential local details, ultimately balancing local and global information retention in context compression. Experiments show that our HyCo$_2$ method significantly enhances long-text reasoning while reducing token usage. It improves the performance of various LLM series by an average of 13.1\% across seven knowledge-intensive QA benchmarks. Moreover, HyCo$_2$ matches the performance of uncompressed methods while reducing token consumption by 88.8\%.

[244] ConvSearch-R1: Enhancing Query Reformulation for Conversational Search with Reasoning via Reinforcement Learning

Changtai Zhu,Siyin Wang,Ruijun Feng,Kai Song,Xipeng Qiu

Main category: cs.CL

TL;DR: ConvSearch-R1是一个自驱动的对话查询重构框架，通过强化学习优化检索信号，无需外部监督，显著优于现有方法。

Details

Motivation: 解决现有对话查询重构方法对外部监督的高依赖性和与下游检索器对齐不足的问题。 Method: 采用两阶段方法：自驱动策略预热和检索引导的强化学习，结合排名激励奖励机制。 Result: 在TopiOCQA和QReCC数据集上表现优异，TopiOCQA上提升超过10%。 Conclusion: ConvSearch-R1通过自驱动框架有效解决了对话查询重构的挑战，性能显著提升。 Abstract: Conversational search systems require effective handling of context-dependent queries that often contain ambiguity, omission, and coreference. Conversational Query Reformulation (CQR) addresses this challenge by transforming these queries into self-contained forms suitable for off-the-shelf retrievers. However, existing CQR approaches suffer from two critical constraints: high dependency on costly external supervision from human annotations or large language models, and insufficient alignment between the rewriting model and downstream retrievers. We present ConvSearch-R1, the first self-driven framework that completely eliminates dependency on external rewrite supervision by leveraging reinforcement learning to optimize reformulation directly through retrieval signals. Our novel two-stage approach combines Self-Driven Policy Warm-Up to address the cold-start problem through retrieval-guided self-distillation, followed by Retrieval-Guided Reinforcement Learning with a specially designed rank-incentive reward shaping mechanism that addresses the sparsity issue in conventional retrieval metrics. Extensive experiments on TopiOCQA and QReCC datasets demonstrate that ConvSearch-R1 significantly outperforms previous state-of-the-art methods, achieving over 10% improvement on the challenging TopiOCQA dataset while using smaller 3B parameter models without any external supervision.

[245] Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space

Zhen Zhang,Xuehai He,Weixiang Yan,Ao Shen,Chenyang Zhao,Shuohang Wang,Yelong Shen,Xin Eric Wang

Main category: cs.CL

TL;DR: 论文提出了一种名为Soft Thinking的训练无关方法，通过生成连续的抽象概念令牌来模拟人类“软”推理，突破了传统离散语言推理的限制。

Details

Motivation: 当前推理模型受限于离散语言令牌，表达能力受限，无法充分探索推理路径。 Method: 通过概率加权的令牌嵌入生成连续的抽象概念令牌，形成连续概念空间，实现平滑过渡和更丰富的表示。 Result: 在数学和编程基准测试中，Soft Thinking将pass@1准确率提升2.48分，同时减少22.4%的令牌使用。 Conclusion: Soft Thinking突破了离散语言推理的瓶颈，提供高效且可解释的推理路径。 Abstract: Human cognition typically involves thinking through abstract, fluid concepts rather than strictly using discrete linguistic tokens. Current reasoning models, however, are constrained to reasoning within the boundaries of human language, processing discrete token embeddings that represent fixed points in the semantic space. This discrete constraint restricts the expressive power and upper potential of such reasoning models, often causing incomplete exploration of reasoning paths, as standard Chain-of-Thought (CoT) methods rely on sampling one token per step. In this work, we introduce Soft Thinking, a training-free method that emulates human-like "soft" reasoning by generating soft, abstract concept tokens in a continuous concept space. These concept tokens are created by the probability-weighted mixture of token embeddings, which form the continuous concept space, enabling smooth transitions and richer representations that transcend traditional discrete boundaries. In essence, each generated concept token encapsulates multiple meanings from related discrete tokens, implicitly exploring various reasoning paths to converge effectively toward the correct answer. Empirical evaluations on diverse mathematical and coding benchmarks consistently demonstrate the effectiveness and efficiency of Soft Thinking, improving pass@1 accuracy by up to 2.48 points while simultaneously reducing token usage by up to 22.4% compared to standard CoT. Qualitative analysis further reveals that Soft Thinking outputs remain highly interpretable and readable, highlighting the potential of Soft Thinking to break the inherent bottleneck of discrete language-based reasoning. Code is available at https://github.com/eric-ai-lab/Soft-Thinking.

[246] dKV-Cache: The Cache for Diffusion Language Models

Xinyin Ma,Runpeng Yu,Gongfan Fang,Xinchao Wang

Main category: cs.CL

TL;DR: 论文提出了一种名为延迟KV-Cache的机制，用于加速扩散语言模型（DLMs）的解码过程，显著提升了推理速度。

Details

Motivation: 扩散语言模型因非自回归架构和双向注意力机制导致推理速度慢，缺乏自回归模型中的键值缓存（KV-Cache）加速机制。 Method: 提出了两种延迟和条件化的键值状态缓存策略：dKV-Cache-Decode（几乎无损加速）和dKV-Cache-Greedy（更高速度但性能略有下降）。 Result: 实验表明，该方法在推理速度上实现了2-10倍的提升，并在多个基准测试中验证了其有效性。 Conclusion: 延迟KV-Cache机制显著缩小了自回归模型与扩散语言模型之间的性能差距，且无需额外训练即可应用。 Abstract: Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive architecture and bidirectional attention preclude the key-value cache that accelerates decoding. We address this bottleneck by proposing a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs. Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process. Accordingly, we propose a delayed and conditioned caching strategy for key and value states. We design two complementary variants to cache key and value step-by-step: (1) dKV-Cache-Decode, which provides almost lossless acceleration, and even improves performance on long sequences, suggesting that existing DLMs may under-utilise contextual information during inference. (2) dKV-Cache-Greedy, which has aggressive caching with reduced lifespan, achieving higher speed-ups with quadratic time complexity at the cost of some performance degradation. dKV-Cache, in final, achieves from 2-10x speedup in inference, largely narrowing the gap between ARs and DLMs. We evaluate our dKV-Cache on several benchmarks, delivering acceleration across general language understanding, mathematical, and code-generation benchmarks. Experiments demonstrate that cache can also be used in DLMs, even in a training-free manner from current DLMs.

[247] Long-Form Information Alignment Evaluation Beyond Atomic Facts

Danna Zheng,Mirella Lapata,Jeff Z. Pan

Main category: cs.CL

TL;DR: MontageLie是一个挑战性基准，通过组合真实陈述构建欺骗性叙述，现有评估方法易受攻击。DoveScore提出联合验证事实准确性和事件顺序一致性，性能提升8%。

Details

Motivation: 现有细粒度评估方法（如FactScore）忽略事实间依赖关系，导致漏洞。需要更鲁棒的评估框架。 Method: 提出DoveScore框架，建模事实间关系，联合验证事实准确性和事件顺序一致性。 Result: DoveScore在AUC-ROC上优于现有方法8%，提供更鲁棒的评估方案。 Conclusion: DoveScore通过建模事实间关系，显著提升评估鲁棒性，为长文本对齐评估提供新方案。 Abstract: Information alignment evaluators are vital for various NLG evaluation tasks and trustworthy LLM deployment, reducing hallucinations and enhancing user trust. Current fine-grained methods, like FactScore, verify facts individually but neglect inter-fact dependencies, enabling subtle vulnerabilities. In this work, we introduce MontageLie, a challenging benchmark that constructs deceptive narratives by "montaging" truthful statements without introducing explicit hallucinations. We demonstrate that both coarse-grained LLM-based evaluators and current fine-grained frameworks are susceptible to this attack, with AUC-ROC scores falling below 65%. To enable more robust fine-grained evaluation, we propose DoveScore, a novel framework that jointly verifies factual accuracy and event-order consistency. By modeling inter-fact relationships, DoveScore outperforms existing fine-grained methods by over 8%, providing a more robust solution for long-form text alignment evaluation. Our code and datasets are available at https://github.com/dannalily/DoveScore.

[248] Reverse Engineering Human Preferences with Reinforcement Learning

Lisa Alazraki,Tan Yi-Chern,Jon Ander Campos,Maximilian Mozes,Marek Rei,Max Bartolo

Main category: cs.CL

TL;DR: 研究探讨了LLM-as-a-judge评估框架的漏洞，提出了一种通过生成优化前导文本来提升评分的方法，该方法难以检测且具有泛化能力。

Details

Motivation: 现有LLM-as-a-judge评估框架易受恶意攻击，如通过调整候选LLM的输出来迎合评委LLM的偏好，因此需要探索更可靠的评估方法。 Method: 利用评委LLM提供的信号作为奖励，通过强化学习对抗性地优化生成前导文本的模型，以间接提升评分。 Result: 优化的前导文本生成模型能显著提高评分，且方法难以检测，同时在未参与训练的模型上也能泛化。 Conclusion: 研究揭示了LLM-as-a-judge框架的潜在风险，并提出了一种新的优化方法，为未来更可靠的评估设计和应用提供了启示。 Abstract: The capabilities of Large Language Models (LLMs) are routinely evaluated by other LLMs trained to predict human preferences. This framework--known as LLM-as-a-judge--is highly scalable and relatively low cost. However, it is also vulnerable to malicious exploitation, as LLM responses can be tuned to overfit the preferences of the judge. Previous work shows that the answers generated by a candidate-LLM can be edited post hoc to maximise the score assigned to them by a judge-LLM. In this study, we adopt a different approach and use the signal provided by judge-LLMs as a reward to adversarially tune models that generate text preambles designed to boost downstream performance. We find that frozen LLMs pipelined with these models attain higher LLM-evaluation scores than existing frameworks. Crucially, unlike other frameworks which intervene directly on the model's response, our method is virtually undetectable. We also demonstrate that the effectiveness of the tuned preamble generator transfers when the candidate-LLM and the judge-LLM are replaced with models that are not used during training. These findings raise important questions about the design of more reliable LLM-as-a-judge evaluation settings. They also demonstrate that human preferences can be reverse engineered effectively, by pipelining LLMs to optimise upstream preambles via reinforcement learning--an approach that could find future applications in diverse tasks and domains beyond adversarial attacks.

[249] VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

Yuchen Yan,Jin Jiang,Zhenbang Ren,Yijun Li,Xudong Cai,Yang Liu,Xin Xu,Mengdi Zhang,Jian Shao,Yongliang Shen,Jun Xiao,Yueting Zhuang

Main category: cs.CL

TL;DR: 本文提出了两个基准测试VerifyBench和VerifyBench-Hard，用于评估基于参考的奖励系统性能，填补了现有奖励基准的空白。

Details

Motivation: 现有奖励基准未评估基于参考的奖励系统，限制了研究者对强化学习中验证器准确性的理解。 Method: 通过细致的数据收集、整理和人工标注，构建了两个高质量的基准测试VerifyBench和VerifyBench-Hard。 Result: 当前模型在基准测试上仍有较大改进空间，尤其是小规模模型。 Conclusion: 提出的基准测试为理解和开发基于参考的奖励系统提供了有效工具，有助于提升强化学习模型的推理能力。 Abstract: Large reasoning models such as OpenAI o1 and DeepSeek-R1 have achieved remarkable performance in the domain of reasoning. A key component of their training is the incorporation of verifiable rewards within reinforcement learning (RL). However, existing reward benchmarks do not evaluate reference-based reward systems, leaving researchers with limited understanding of the accuracy of verifiers used in RL. In this paper, we introduce two benchmarks, VerifyBench and VerifyBench-Hard, designed to assess the performance of reference-based reward systems. These benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. Current models still show considerable room for improvement on both VerifyBench and VerifyBench-Hard, especially smaller-scale models. Furthermore, we conduct a thorough and comprehensive analysis of evaluation results, offering insights for understanding and developing reference-based reward systems. Our proposed benchmarks serve as effective tools for guiding the development of verifier accuracy and the reasoning capabilities of models trained via RL in reasoning tasks.

[250] Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering

Hwan Chang,Yumin Kim,Yonghyun Jun,Hwanhee Lee

Main category: cs.CL

TL;DR: 论文提出了一个名为CoPriva的大规模基准数据集，用于评估LLM在问答中是否遵守上下文非披露政策，发现许多模型存在泄露敏感信息的漏洞。

Details

Motivation: 随着LLM在敏感领域的广泛应用，确保其遵守用户定义的安全政策（尤其是信息非披露）变得至关重要。目前缺乏针对上下文安全保护的大规模基准。 Method: 通过构建CoPriva数据集，包含明确政策和直接/间接攻击查询，评估了10个LLM的上下文安全表现。 Result: 许多模型违反政策并泄露敏感信息，尤其是面对间接攻击时表现更差。模型在生成时难以融入政策约束，但部分能在明确提示下修正输出。 Conclusion: 当前LLM在敏感应用中的安全性存在重大缺陷，亟需更鲁棒的方法保障上下文安全。 Abstract: As Large Language Models (LLMs) are increasingly deployed in sensitive domains such as enterprise and government, ensuring that they adhere to user-defined security policies within context is critical-especially with respect to information non-disclosure. While prior LLM studies have focused on general safety and socially sensitive data, large-scale benchmarks for contextual security preservation against attacks remain lacking. To address this, we introduce a novel large-scale benchmark dataset, CoPriva, evaluating LLM adherence to contextual non-disclosure policies in question answering. Derived from realistic contexts, our dataset includes explicit policies and queries designed as direct and challenging indirect attacks seeking prohibited information. We evaluate 10 LLMs on our benchmark and reveal a significant vulnerability: many models violate user-defined policies and leak sensitive information. This failure is particularly severe against indirect attacks, highlighting a critical gap in current LLM safety alignment for sensitive applications. Our analysis reveals that while models can often identify the correct answer to a query, they struggle to incorporate policy constraints during generation. In contrast, they exhibit a partial ability to revise outputs when explicitly prompted. Our findings underscore the urgent need for more robust methods to guarantee contextual security.

[251] GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents

Yuqi Zhou,Sunhao Dai,Shuai Wang,Kaiwen Zhou,Qinqlin Jia,Junxu

Main category: cs.CL

TL;DR: 论文分析了GUI代理训练中的三个关键问题（输入设计、输出评估、策略更新），并提出针对性解决方案，最终在GUI任务中取得新SOTA性能。

Details

Motivation: 现有GUI代理训练方法在输入设计、输出评估和策略更新中存在不足，导致性能受限。 Method: 提出Fast Thinking模板、改进奖励函数和调整RL目标，优化训练流程。 Result: GUI-G1-3B在ScreenSpot和ScreenSpot-Pro上分别达到90.3%和37.1%的准确率，超越同类模型。 Conclusion: 通过针对性改进，显著提升了GUI代理的性能，确立了新的SOTA。 Abstract: Recent Graphical User Interface (GUI) agents replicate the R1-Zero paradigm, coupling online Reinforcement Learning (RL) with explicit chain-of-thought reasoning prior to object grounding and thereby achieving substantial performance gains. In this paper, we first conduct extensive analysis experiments of three key components of that training pipeline: input design, output evaluation, and policy update-each revealing distinct challenges arising from blindly applying general-purpose RL without adapting to GUI grounding tasks. Input design: Current templates encourage the model to generate chain-of-thought reasoning, but longer chains unexpectedly lead to worse grounding performance. Output evaluation: Reward functions based on hit signals or box area allow models to exploit box size, leading to reward hacking and poor localization quality. Policy update: Online RL tends to overfit easy examples due to biases in length and sample difficulty, leading to under-optimization on harder cases. To address these issues, we propose three targeted solutions. First, we adopt a Fast Thinking Template that encourages direct answer generation, reducing excessive reasoning during training. Second, we incorporate a box size constraint into the reward function to mitigate reward hacking. Third, we revise the RL objective by adjusting length normalization and adding a difficulty-aware scaling factor, enabling better optimization on hard samples. Our GUI-G1-3B, trained on 17K public samples with Qwen2.5-VL-3B-Instruct, achieves 90.3% accuracy on ScreenSpot and 37.1% on ScreenSpot-Pro. This surpasses all prior models of similar size and even outperforms the larger UI-TARS-7B, establishing a new state-of-the-art in GUI agent grounding. The project repository is available at https://github.com/Yuqi-Zhou/GUI-G1.

[252] The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation

Patrick Kahardipraja,Reduan Achtibat,Thomas Wiegand,Wojciech Samek,Sebastian Lapuschkin

Main category: cs.CL

TL;DR: 论文研究了大型语言模型如何通过检索增强利用上下文学习访问外部知识，揭示了其内部机制，并提出了一种基于归因的方法来识别特定注意力头的作用。

Details

Motivation: 探索大型语言模型如何通过检索增强实现上下文学习，以理解其内部工作机制，从而提升模型的安全性和透明度。 Method: 提出基于归因的方法，识别并分析专门化的注意力头（如上下文头和参数头），并通过修改其注意力权重研究其对答案生成的影响。 Result: 揭示了上下文头和参数头的不同功能，展示了它们如何影响答案生成，并追踪了推理过程中知识的来源。 Conclusion: 研究为理解检索增强的上下文学习机制提供了新视角，有助于开发更安全、透明的语言模型。 Abstract: Large language models are able to exploit in-context learning to access external knowledge beyond their training data through retrieval-augmentation. While promising, its inner workings remain unclear. In this work, we shed light on the mechanism of in-context retrieval augmentation for question answering by viewing a prompt as a composition of informational components. We propose an attribution-based method to identify specialized attention heads, revealing in-context heads that comprehend instructions and retrieve relevant contextual information, and parametric heads that store entities' relational knowledge. To better understand their roles, we extract function vectors and modify their attention weights to show how they can influence the answer generation process. Finally, we leverage the gained insights to trace the sources of knowledge used during inference, paving the way towards more safe and transparent language models.

[253] Learning to Reason via Mixture-of-Thought for Logical Reasoning

Tong Zheng,Lichang Chen,Simeng Han,R. Thomas McCoy,Heng Huang

Main category: cs.CL

TL;DR: 论文提出Mixture-of-Thought (MoT)框架，通过结合自然语言、代码和符号逻辑（真值表）三种推理模态，提升LLM的逻辑推理能力。实验显示MoT显著优于单模态方法。

Details

Motivation: 人类使用多种推理模态解决问题，而现有LLM方法通常仅依赖单一模态（如自然语言），限制了模态间的协同效应。 Method: MoT采用两阶段设计：1）自演化的多模态训练；2）推理阶段充分利用三种模态的协同效应。 Result: 在FOLIO和ProofWriter等逻辑推理基准测试中，MoT比单模态方法平均准确率提升高达11.7个百分点。 Conclusion: MoT框架在训练和推理阶段均有效，尤其擅长解决复杂逻辑问题，不同模态互补性强，真值表推理能克服自然语言推理的瓶颈。 Abstract: Human beings naturally utilize multiple reasoning modalities to learn and solve logical problems, i.e., different representational formats such as natural language, code, and symbolic logic. In contrast, most existing LLM-based approaches operate with a single reasoning modality during training, typically natural language. Although some methods explored modality selection or augmentation at inference time, the training process remains modality-blind, limiting synergy among modalities. To fill in this gap, we propose Mixture-of-Thought (MoT), a framework that enables LLMs to reason across three complementary modalities: natural language, code, and a newly introduced symbolic modality, truth-table, which systematically enumerates logical cases and partially mitigates key failure modes in natural language reasoning. MoT adopts a two-phase design: (1) self-evolving MoT training, which jointly learns from filtered, self-generated rationales across modalities; and (2) MoT inference, which fully leverages the synergy of three modalities to produce better predictions. Experiments on logical reasoning benchmarks including FOLIO and ProofWriter demonstrate that our MoT framework consistently and significantly outperforms strong LLM baselines with single-modality chain-of-thought approaches, achieving up to +11.7pp average accuracy gain. Further analyses show that our MoT framework benefits both training and inference stages; that it is particularly effective on harder logical reasoning problems; and that different modalities contribute complementary strengths, with truth-table reasoning helping to overcome key bottlenecks in natural language inference.

physics.comp-ph [Back]

[254] Pathobiological Dictionary Defining Pathomics and Texture Features: Addressing Understandable AI Issues in Personalized Liver Cancer; Dictionary Version LCP1.0

Mohammad R. Salmanpour,Seyed Mohammad Piri,Somayeh Sadat Mehrnia,Ahmad Shariftabrizi,Masume Allahmoradi,Venkata SK. Manem,Arman Rahmim,Ilker Hacihaliloglu

Main category: physics.comp-ph

TL;DR: 本研究提出了一种名为LCP1.0的框架，将复杂的病理组学和放射组学特征转化为临床可理解的诊断工具，并通过专家验证提升了AI在肝癌诊断中的透明度和可用性。

Details

Motivation: AI在医学诊断中潜力巨大，但其临床应用受限于缺乏可解释性和泛化性。本研究旨在通过开发LCP1.0框架，解决这一问题。 Method: 使用QuPath和PyRadiomics提取肝癌组织的病理组学和放射组学特征，并通过多分类器和特征选择算法评估其与WHO分级系统的相关性。 Result: 结合SVM模型的Variable Threshold特征选择算法达到最高准确率（0.80），筛选出20个关键特征，这些特征与肿瘤分级和预后密切相关。 Conclusion: LCP1.0为AI输出与专家解释之间提供了临床验证的桥梁，增强了模型的透明度和可用性，支持开发可信赖的肝癌诊断工具。 Abstract: Artificial intelligence (AI) holds strong potential for medical diagnostics, yet its clinical adoption is limited by a lack of interpretability and generalizability. This study introduces the Pathobiological Dictionary for Liver Cancer (LCP1.0), a practical framework designed to translate complex Pathomics and Radiomics Features (PF and RF) into clinically meaningful insights aligned with existing diagnostic workflows. QuPath and PyRadiomics, standardized according to IBSI guidelines, were used to extract 333 imaging features from hepatocellular carcinoma (HCC) tissue samples, including 240 PF-based-cell detection/intensity, 74 RF-based texture, and 19 RF-based first-order features. Expert-defined ROIs from the public dataset excluded artifact-prone areas, and features were aggregated at the case level. Their relevance to the WHO grading system was assessed using multiple classifiers linked with feature selectors. The resulting dictionary was validated by 8 experts in oncology and pathology. In collaboration with 10 domain experts, we developed a Pathobiological dictionary of imaging features such as PFs and RF. In our study, the Variable Threshold feature selection algorithm combined with the SVM model achieved the highest accuracy (0.80, P-value less than 0.05), selecting 20 key features, primarily clinical and pathomics traits such as Centroid, Cell Nucleus, and Cytoplasmic characteristics. These features, particularly nuclear and cytoplasmic, were strongly associated with tumor grading and prognosis, reflecting atypia indicators like pleomorphism, hyperchromasia, and cellular orientation.The LCP1.0 provides a clinically validated bridge between AI outputs and expert interpretation, enhancing model transparency and usability. Aligning AI-derived features with clinical semantics supports the development of interpretable, trustworthy diagnostic tools for liver cancer pathology.

cs.AR [Back]

[255] HDLxGraph: Bridging Large Language Models and HDL Repositories via HDL Graph Databases

Pingqing Zheng,Jiayin Qin,Fuqi Zhang,Shang Wu,Yu Cao,Caiwen Ding,Yang,Zhao

Main category: cs.AR

TL;DR: HDLxGraph框架通过结合Graph RAG与LLMs，利用AST和DFG提升硬件设计任务的性能，显著提高了搜索准确性和调试效率。

Details

Motivation: 解决LLMs在大型HDL项目中性能受限的问题，通过引入结构信息增强检索能力。 Method: 提出HDLxGraph框架，结合Graph RAG和LLMs，利用AST和DFG构建双检索机制。 Result: 实验显示HDLxGraph在搜索准确性、调试效率和完成质量上分别提升12.04%、12.22%和5.04%。 Conclusion: HDLxGraph有效提升了LLMs在硬件设计任务中的表现，并提供了HDLSearch基准数据集。 Abstract: Large Language Models (LLMs) have demonstrated their potential in hardware design tasks, such as Hardware Description Language (HDL) generation and debugging. Yet, their performance in real-world, repository-level HDL projects with thousands or even tens of thousands of code lines is hindered. To this end, we propose HDLxGraph, a novel framework that integrates Graph Retrieval Augmented Generation (Graph RAG) with LLMs, introducing HDL-specific graph representations by incorporating Abstract Syntax Trees (ASTs) and Data Flow Graphs (DFGs) to capture both code graph view and hardware graph view. HDLxGraph utilizes a dual-retrieval mechanism that not only mitigates the limited recall issues inherent in similarity-based semantic retrieval by incorporating structural information, but also enhances its extensibility to various real-world tasks by a task-specific retrieval finetuning. Additionally, to address the lack of comprehensive HDL search benchmarks, we introduce HDLSearch, a multi-granularity evaluation dataset derived from real-world repository-level projects. Experimental results demonstrate that HDLxGraph significantly improves average search accuracy, debugging efficiency and completion quality by 12.04%, 12.22% and 5.04% compared to similarity-based RAG, respectively. The code of HDLxGraph and collected HDLSearch benchmark are available at https://github.com/Nick-Zheng-Q/HDLxGraph.

cs.LG [Back]

[256] Scaling Diffusion Transformers Efficiently via $μ$P

Chenyu Zheng,Xinyu Zhang,Rongzhen Wang,Wei Huang,Zhi Tian,Weilin Huang,Jun Zhu,Chongxuan Li

Main category: cs.LG

TL;DR: 论文研究了如何将Maximal Update Parametrization（μP）推广到扩散Transformer中，验证了其在大规模实验中的有效性，显著降低了调参成本并提升了性能。

Details

Motivation: 扩散Transformer在视觉生成模型中表现优异，但其大规模扩展受限于高昂的超参数调优成本。研究旨在验证μP方法是否适用于扩散Transformer，以降低成本并提升效率。 Method: 通过理论证明和实验验证，将μP方法推广到主流扩散Transformer（如DiT、U-ViT等），并测试其超参数的可迁移性。 Result: 实验表明，μP方法显著提升了模型性能（如DiT-XL-2收敛速度提升2.9倍），并在文本到图像生成任务中验证了其有效性（如PixArt-α和MMDiT的扩展）。 Conclusion: μP是一种高效且原理清晰的框架，适用于扩散Transformer的扩展，显著降低了调参成本并提升了性能。 Abstract: Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($\mu$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $\mu$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $\mu$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $\mu$P of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-$\alpha$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $\mu$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$\mu$P enjoys robust HP transferability. Notably, DiT-XL-2-$\mu$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $\mu$P on text-to-image generation by scaling PixArt-$\alpha$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $\mu$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$\alpha$ and 3% of consumption by human experts for MMDiT-18B. These results establish $\mu$P as a principled and efficient framework for scaling diffusion Transformers.

[257] Kernel PCA for Out-of-Distribution Detection: Non-Linear Kernel Selections and Approximations

Kun Fang,Qinghua Tao,Mingzhen He,Kexin Lv,Runze Yang,Haibo Hu,Xiaolin Huang,Jie Yang,Longbin Cao

Main category: cs.LG

TL;DR: 论文提出了一种基于非线性特征子空间的OoD检测方法，通过KPCA学习判别性子空间，利用重建误差区分InD和OoD数据，并解决了核函数选择和计算效率问题。

Details

Motivation: OoD检测对深度神经网络可靠性至关重要，但现有方法在表征InD和OoD数据差异方面存在不足。本文从非线性特征子空间的新视角出发，探索更有效的差异表征方法。 Method: 利用KPCA框架学习判别性非线性子空间，提出Cosine-Gaussian核函数，并引入两种技术优化大规模InD数据的核矩阵计算。此外，结合InD数据置信度优化子空间学习。 Result: 提出的方法在OoD检测中表现出显著提升的效能和效率，Cosine-Gaussian核函数和计算优化技术有效解决了核函数选择和计算效率问题。 Conclusion: 研究为非线性特征子空间在OoD检测中的应用提供了新见解，并通过核函数设计和高效计算技术实现了更优的检测性能。 Abstract: Out-of-Distribution (OoD) detection is vital for the reliability of deep neural networks, the key of which lies in effectively characterizing the disparities between OoD and In-Distribution (InD) data. In this work, such disparities are exploited through a fresh perspective of non-linear feature subspace. That is, a discriminative non-linear subspace is learned from InD features to capture representative patterns of InD, while informative patterns of OoD features cannot be well captured in such a subspace due to their different distribution. Grounded on this perspective, we exploit the deviations of InD and OoD features in such a non-linear subspace for effective OoD detection. To be specific, we leverage the framework of Kernel Principal Component Analysis (KPCA) to attain the discriminative non-linear subspace and deploy the reconstruction error on such subspace to distinguish InD and OoD data. Two challenges emerge: (i) the learning of an effective non-linear subspace, i.e., the selection of kernel function in KPCA, and (ii) the computation of the kernel matrix with large-scale InD data. For the former, we reveal two vital non-linear patterns that closely relate to the InD-OoD disparity, leading to the establishment of a Cosine-Gaussian kernel for constructing the subspace. For the latter, we introduce two techniques to approximate the Cosine-Gaussian kernel with significantly cheap computations. In particular, our approximation is further tailored by incorporating the InD data confidence, which is demonstrated to promote the learning of discriminative subspaces for OoD data. Our study presents new insights into the non-linear feature subspace for OoD detection and contributes practical explorations on the associated kernel design and efficient computations, yielding a KPCA detection method with distinctively improved efficacy and efficiency.

[258] Directional Non-Commutative Monoidal Structures for Compositional Embeddings in Machine Learning

Mahesh Godavarti

Main category: cs.LG

TL;DR: 提出了一种新的多维组合嵌入代数结构，基于方向性非交换幺半群算子，具有理论优势且兼容现代机器学习架构。

Details

Motivation: 为多维数据提供统一的组合框架，同时保留经典序列建模范式的特性。 Method: 定义每个轴的组合算子 circ_i，确保沿轴的结合性，同时全局算子满足交换律。 Result: 框架能泛化经典序列建模方法（如SSMs和Transformer自注意力），并支持多维递归操作。 Conclusion: 该结构为未来深度学习模型设计提供了理论基础，潜在应用包括结构化位置编码和图像嵌入。 Abstract: We introduce a new algebraic structure for multi-dimensional compositional embeddings, built on directional non-commutative monoidal operators. The core contribution of this work is this novel framework, which exhibits appealing theoretical properties (associativity along each dimension and an interchange law ensuring global consistency) while remaining compatible with modern machine learning architectures. Our construction defines a distinct composition operator circ_i for each axis i, ensuring associative combination along each axis without imposing global commutativity. Importantly, all axis-specific operators commute with one another, enforcing a global interchange law that enables consistent crossaxis compositions. This is, to our knowledge, the first approach that provides a common foundation that generalizes classical sequence-modeling paradigms (e.g., structured state-space models (SSMs) and transformer self-attention) to a unified multi-dimensional framework. For example, specific one-dimensional instances of our framework can recover the familiar affine transformation algebra, vanilla self-attention, and the SSM-style recurrence. The higher-dimensional generalizations naturally support recursive, structure-aware operations in embedding spaces. We outline several potential applications unlocked by this structure-including structured positional encodings in Transformers, directional image embeddings, and symbolic modeling of sequences or grids-indicating that it could inform future deep learning model designs. We formally establish the algebraic properties of our framework and discuss efficient implementations. Finally, as our focus is theoretical, we include no experiments here and defer empirical validation to future work, which we plan to undertake.

[259] Explainable embeddings with Distance Explainer

Christiaan Meijer,E. G. Patrick Bos

Main category: cs.LG

TL;DR: Distance Explainer是一种新颖的XAI方法，用于解释嵌入空间中的距离关系，通过选择性掩码和距离排序掩码过滤生成局部解释。

Details

Motivation: 现有XAI方法在解释嵌入向量空间（维度表示复杂抽象）方面存在不足，需要一种能解释嵌入数据点之间相似性或差异性的方法。 Method: 基于RISE的显著性技术，通过选择性掩码和距离排序掩码过滤，为嵌入空间中的距离关系生成属性值。 Result: 在ImageNet和CLIP模型上的实验表明，该方法能有效识别影响相似性或差异性的特征，并保持高鲁棒性和一致性。 Conclusion: Distance Explainer填补了XAI研究的空白，提升了嵌入空间深度学习应用的透明度和可信度。 Abstract: While eXplainable AI (XAI) has advanced significantly, few methods address interpretability in embedded vector spaces where dimensions represent complex abstractions. We introduce Distance Explainer, a novel method for generating local, post-hoc explanations of embedded spaces in machine learning models. Our approach adapts saliency-based techniques from RISE to explain the distance between two embedded data points by assigning attribution values through selective masking and distance-ranked mask filtering. We evaluate Distance Explainer on cross-modal embeddings (image-image and image-caption pairs) using established XAI metrics including Faithfulness, Sensitivity/Robustness, and Randomization. Experiments with ImageNet and CLIP models demonstrate that our method effectively identifies features contributing to similarity or dissimilarity between embedded data points while maintaining high robustness and consistency. We also explore how parameter tuning, particularly mask quantity and selection strategy, affects explanation quality. This work addresses a critical gap in XAI research and enhances transparency and trustworthiness in deep learning applications utilizing embedded spaces.

[260] Beyond Classification: Evaluating Diffusion Denoised Smoothing for Security-Utility Trade off

Yury Belousov,Brian Pulfer,Vitaliy Kinakh,Slava Voloshynovskiy

Main category: cs.LG

TL;DR: 论文研究了扩散去噪平滑技术在增强基础模型对抗鲁棒性中的效果，发现其在某些情况下会显著降低性能，并提出了一种针对扩散过程的新型攻击策略。

Details

Motivation: 探索扩散去噪平滑技术在不同任务和攻击算法下的有效性，填补现有研究的空白。 Method: 使用预训练的扩散模型预处理输入，并在三个数据集上测试四种下游任务和三种对抗攻击算法。 Result: 高噪声扩散去噪会显著降低性能（高达57%），低噪声设置虽保留性能但无法抵御所有攻击。新型攻击策略能绕过低噪声防御。 Conclusion: 对抗鲁棒性与性能之间的权衡仍需进一步研究。 Abstract: While foundation models demonstrate impressive performance across various tasks, they remain vulnerable to adversarial inputs. Current research explores various approaches to enhance model robustness, with Diffusion Denoised Smoothing emerging as a particularly promising technique. This method employs a pretrained diffusion model to preprocess inputs before model inference. Yet, its effectiveness remains largely unexplored beyond classification. We aim to address this gap by analyzing three datasets with four distinct downstream tasks under three different adversarial attack algorithms. Our findings reveal that while foundation models maintain resilience against conventional transformations, applying high-noise diffusion denoising to clean images without any distortions significantly degrades performance by as high as 57%. Low-noise diffusion settings preserve performance but fail to provide adequate protection across all attack types. Moreover, we introduce a novel attack strategy specifically targeting the diffusion process itself, capable of circumventing defenses in the low-noise regime. Our results suggest that the trade-off between adversarial robustness and performance remains a challenge to be addressed.

[261] FisherSFT: Data-Efficient Supervised Fine-Tuning of Language Models Using Information Gain

Rohan Deb,Kiran Thekumparampil,Kousha Kalantari,Gaurush Hiranandani,Shoham Sabach,Branislav Kveton

Main category: cs.LG

TL;DR: 本文提出了一种通过选择信息量最大的训练子集来提高监督微调（SFT）统计效率的方法。

Details

Motivation: 监督微调（SFT）是调整大型语言模型（LLM）以适应新领域的标准方法，但其统计效率有待提高。 Method: 通过最大化信息增益（以LLM对数似然的Hessian矩阵衡量）选择训练子集，并利用多项式逻辑回归模型在最后一层线性化LLM以高效近似。 Result: 方法计算高效、可分析，并在多个问题上表现优异，定量结果和LLM评估支持了其有效性。 Conclusion: 该方法显著提升了SFT的统计效率，为LLM的微调提供了新思路。 Abstract: Supervised fine-tuning (SFT) is a standard approach to adapting large language models (LLMs) to new domains. In this work, we improve the statistical efficiency of SFT by selecting an informative subset of training examples. Specifically, for a fixed budget of training examples, which determines the computational cost of fine-tuning, we determine the most informative ones. The key idea in our method is to select examples that maximize information gain, measured by the Hessian of the log-likelihood of the LLM. We approximate it efficiently by linearizing the LLM at the last layer using multinomial logistic regression models. Our approach is computationally efficient, analyzable, and performs well empirically. We demonstrate this on several problems, and back our claims with both quantitative results and an LLM evaluation.

[262] Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision

Eric Hanchen Jiang,Haozheng Luo,Shengyuan Pang,Xiaomin Li,Zhenting Qi,Hengli Li,Cheng-Fu Yang,Zongyu Lin,Xinfeng Li,Hao Xu,Kai-Wei Chang,Ying Nian Wu

Main category: cs.LG

TL;DR: 论文提出了一种轻量级后验验证器EORM，通过能量评分提升LLM在数学推理中的准确性，避免高成本采样。

Details

Motivation: 解决LLM在数学推理中多步逻辑一致性问题，避免传统CoT提示的不可靠性和高计算成本。 Method: 利用基于能量的模型（EBMs）学习为CoT解决方案分配标量能量分数，仅需结果标签，无需详细标注。 Result: 在GSM8k和MATH基准测试中显著提升准确率（如Llama 3 8B在GSM8k达90.7%）。 Conclusion: EORM通过后验验证高效提升LLM推理可靠性，性能媲美暴力采样。 Abstract: Mathematical reasoning presents a significant challenge for Large Language Models (LLMs), often requiring robust multi step logical consistency. While Chain of Thought (CoT) prompting elicits reasoning steps, it doesn't guarantee correctness, and improving reliability via extensive sampling is computationally costly. This paper introduces the Energy Outcome Reward Model (EORM), an effective, lightweight, post hoc verifier. EORM leverages Energy Based Models (EBMs) to simplify the training of reward models by learning to assign a scalar energy score to CoT solutions using only outcome labels, thereby avoiding detailed annotations. It achieves this by interpreting discriminator output logits as negative energies, effectively ranking candidates where lower energy is assigned to solutions leading to correct final outcomes implicitly favoring coherent reasoning. On mathematical benchmarks (GSM8k, MATH), EORM significantly improves final answer accuracy (e.g., with Llama 3 8B, achieving 90.7% on GSM8k and 63.7% on MATH). EORM effectively leverages a given pool of candidate solutions to match or exceed the performance of brute force sampling, thereby enhancing LLM reasoning outcome reliability through its streamlined post hoc verification process.

[263] RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning

Kaiwen Zha,Zhengqi Gao,Maohao Shen,Zhang-Wei Hong,Duane S. Boning,Dina Katabi

Main category: cs.LG

TL;DR: Tango框架通过强化学习同时训练LLM生成器和验证器，解决了现有方法中奖励黑客攻击和泛化能力差的问题，并在多个基准测试中取得最佳性能。

Details

Motivation: 现有RL后训练方法中的验证器设计容易受到奖励黑客攻击且泛化能力差，需要一种更鲁棒的方法。 Method: 提出Tango框架，通过RL同时训练生成器和生成式过程级验证器，无需显式过程级标注。 Result: 在多个数学推理任务和ProcessBench数据集上取得最佳性能，尤其在困难问题上表现突出。 Conclusion: Tango框架通过生成式验证器和生成器的协同进化，显著提升了模型的鲁棒性和泛化能力。 Abstract: Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango.

[264] MoTime: A Dataset Suite for Multimodal Time Series Forecasting

Xin Zhou,Weiqing Wang,Francisco J. Baldán,Wray Buntine,Christoph Bergmeir

Main category: cs.LG

TL;DR: MoTime是一套多模态时间序列预测数据集，结合了时间信号与文本、元数据和图像等外部模态，支持在常规预测和冷启动预测场景下评估模态效用。实验表明外部模态能提升预测性能，尤其在短序列数据中效果显著。

Details

Motivation: 现实世界预测中多模态数据日益丰富，但现有研究多集中于单模态时间序列。MoTime旨在填补这一空白，推动多模态时间序列预测的全面研究。 Method: 提出MoTime数据集，涵盖多种领域，支持两种场景（常规预测和冷启动预测）下模态效用的结构化评估。 Result: 外部模态在两种场景下均能提升预测性能，部分数据集中短序列效果尤为显著，但效果因数据特性而异。 Conclusion: MoTime的公开数据集和发现旨在支持未来多模态时间序列预测的更全面和现实基准研究。 Abstract: While multimodal data sources are increasingly available from real-world forecasting, most existing research remains on unimodal time series. In this work, we present MoTime, a suite of multimodal time series forecasting datasets that pair temporal signals with external modalities such as text, metadata, and images. Covering diverse domains, MoTime supports structured evaluation of modality utility under two scenarios: 1) the common forecasting task, where varying-length history is available, and 2) cold-start forecasting, where no historical data is available. Experiments show that external modalities can improve forecasting performance in both scenarios, with particularly strong benefits for short series in some datasets, though the impact varies depending on data characteristics. By making datasets and findings publicly available, we aim to support more comprehensive and realistic benchmarks in future multimodal time series forecasting research.

[265] SUS backprop: linear backpropagation algorithm for long inputs in transformers

Sergey Pankov,Georges Harik

Main category: cs.LG

TL;DR: 提出了一种通过随机切断反向传播路径以减少计算量的方法，特别适用于Transformer中的注意力机制，将复杂度从O(n²)降至O(nc)。

Details

Motivation: 长序列中注意力机制的计算复杂度高，且大多数注意力权重较小，适合切断以减少计算量。 Method: 提出基于概率的规则，通过参数c控制每个token和注意力头的反向传播路径，最多保留c个交互。 Result: 实验表明，切断99%的注意力梯度流仅增加1%的梯度方差，且计算复杂度显著降低。 Conclusion: 该方法能高效减少反向传播计算量，适用于长序列Transformer训练。 Abstract: It is straightforward to design an unbiased gradient estimator that stochastically cuts the backpropagation flow through any part of a computational graph. By cutting the parts that have little effect on the computation, one can potentially save a significant amount of back-propagation computation in exchange for a minimal increase in the stochastic gradient variance, in some situations. Such a situation occurs in the attention mechanism of the transformer architecture. For long sequences, attention becomes the limiting factor, as its compute requirements increase quadratically with sequence length $n$. At the same time, most attention weights become very small, as most attention heads tend to connect a given token with only a small fraction of other tokens in the sequence. These weights become promising targets for cutting backpropagation. We propose a simple probabilistic rule controlled by a single parameter $c$ that cuts backpropagation through most attention weights, leaving at most $c$ interactions per token per attention head. This brings a factor of $c/n$ reduction in the compute required for the attention backpropagation, turning it from quadratic $O(n^2)$ to linear complexity $O(nc)$. We have empirically verified that, for a typical transformer model, cutting $99\%$ of the attention gradient flow (i.e. choosing $c \sim 20-30$) results in relative gradient variance increase of only about $1\%$ for $n \sim 2000$, and it decreases with $n$. This approach is amenable to efficient sparse matrix implementation, thus being promising for making the cost of a backward pass negligible relative to the cost of a forward pass when training a transformer model on long sequences.

[266] Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

Christian Walder,Deep Karkhanis

Main category: cs.LG

TL;DR: 论文提出了一种名为PKPO的方法，通过优化pass@k性能来提升强化学习中的样本多样性和集体效用，解决了传统方法仅优化pass@1的局限性。

Details

Motivation: 传统强化学习算法独立奖励每个样本，导致样本多样性和集体效用不足，限制了在复杂问题上的表现。 Method: 提出PKPO方法，通过变换最终奖励直接优化pass@k性能，并推导了低方差无偏估计器。 Result: 实验验证了PKPO的有效性，尤其在复杂任务上表现出色，同时支持动态调整k值以平衡pass@1和pass@k性能。 Conclusion: PKPO通过优化样本集的联合效用，显著提升了强化学习的探索能力和性能，尤其在复杂任务中表现突出。 Abstract: Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.

[267] ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search

Hyunseok Lee,Jeonghoon Kim,Beomjun Kim,Jihoon Tack,Chansong Jo,Jaehong Lee,Cheonbok Park,Sookyo In,Jinwoo Shin,Kang Min Yoo

Main category: cs.LG

TL;DR: ReGUIDE是一种新型框架，通过自生成推理和空间感知批评，显著提高了多模态大语言模型在GUI元素定位中的数据效率。

Details

Motivation: 现有方法依赖大规模数据集提高GUI元素定位精度，但效率低下。ReGUIDE旨在通过数据高效学习解决这一问题。 Method: ReGUIDE结合在线强化学习自生成语言推理过程，并利用空间先验批评预测，测试时通过空间搜索和坐标聚合提升性能。 Result: 实验表明，ReGUIDE在多个基准测试中表现优异，仅需0.2%的训练数据即可超越基线。 Conclusion: ReGUIDE为GUI元素定位提供了一种高效、数据节约的解决方案。 Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have enabled autonomous agents to interact with computers via Graphical User Interfaces (GUIs), where accurately localizing the coordinates of interface elements (e.g., buttons) is often required for fine-grained actions. However, this remains significantly challenging, leading prior works to rely on large-scale web datasets to improve the grounding accuracy. In this work, we propose Reasoning Graphical User Interface Grounding for Data Efficiency (ReGUIDE), a novel and effective framework for web grounding that enables MLLMs to learn data efficiently through self-generated reasoning and spatial-aware criticism. More specifically, ReGUIDE learns to (i) self-generate a language reasoning process for the localization via online reinforcement learning, and (ii) criticize the prediction using spatial priors that enforce equivariance under input transformations. At inference time, ReGUIDE further boosts performance through a test-time scaling strategy, which combines spatial search with coordinate aggregation. Our experiments demonstrate that ReGUIDE significantly advances web grounding performance across multiple benchmarks, outperforming baselines with substantially fewer training data points (e.g., only 0.2% samples compared to the best open-sourced baselines).

[268] Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

Yurun Yuan,Fan Chen,Zeyu Jia,Alexander Rakhlin,Tengyang Xie

Main category: cs.LG

TL;DR: TBRM是一种基于贝尔曼残差最小化的价值强化学习方法，适用于大语言模型推理，无需评论家或重要性采样，性能优于策略基线。

Details

Motivation: 探索价值强化学习在大语言模型推理中的应用，填补当前以策略方法为主的空白。 Method: 提出TBRM算法，通过轨迹级贝尔曼目标优化，利用模型自身logits作为Q值，简化了计算流程。 Result: 实验表明TBRM在数学推理任务中优于PPO和GRPO，计算和内存开销更低。 Conclusion: 价值强化学习可能是提升大语言模型推理能力的有效替代方案。 Abstract: Policy-based methods currently dominate reinforcement learning (RL) pipelines for large language model (LLM) reasoning, leaving value-based approaches largely unexplored. We revisit the classical paradigm of Bellman Residual Minimization and introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs, yielding a simple yet effective off-policy algorithm that optimizes a single trajectory-level Bellman objective using the model's own logits as $Q$-values. TBRM removes the need for critics, importance-sampling ratios, or clipping, and operates with only one rollout per prompt. We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy data via an improved change-of-trajectory-measure analysis. Experiments on standard mathematical-reasoning benchmarks show that TBRM consistently outperforms policy-based baselines, like PPO and GRPO, with comparable or lower computational and memory overhead. Our results indicate that value-based RL might be a principled and efficient alternative for enhancing reasoning capabilities in LLMs.

[269] Set-LLM: A Permutation-Invariant LLM

Beni Egressy,Jan Stühmer

Main category: cs.LG

TL;DR: 论文提出Set-LLM，一种解决大语言模型（LLMs）顺序敏感性问题的新架构，通过改进注意力掩码和位置编码实现排列不变性。

Details

Motivation: LLMs在选项顺序不同时表现出偏好差异，影响其在多选问答和自动评估中的可靠性。 Method: 引入Set-LLM架构，设计新的注意力掩码和位置编码，确保输入集合的排列不变性。 Result: 实验证明Set-LLM有效消除顺序敏感性，性能与原模型相当或更好，且运行时间不变。 Conclusion: Set-LLM为LLMs的顺序敏感性问题提供了实用解决方案，适用于多种应用场景。 Abstract: While large language models (LLMs) demonstrate impressive capabilities across numerous applications, their robustness remains a critical concern. This paper is motivated by a specific vulnerability: the order sensitivity of LLMs. This vulnerability manifests itself as the order bias observed when LLMs decide between possible options (for example, a preference for the first option) and the tendency of LLMs to provide different answers when options are reordered. The use cases for this scenario extend beyond the classical case of multiple-choice question answering to the use of LLMs as automated evaluators in AI pipelines, comparing output generated by different models. We introduce Set-LLM, a novel architectural adaptation for pretrained LLMs that enables the processing of mixed set-text inputs with permutation invariance guarantees. The adaptations involve a new attention mask and new positional encodings specifically designed for sets. We provide a theoretical proof of invariance and demonstrate through experiments that Set-LLM can be trained effectively, achieving comparable or improved performance and maintaining the runtime of the original model, while eliminating order sensitivity.

[270] Mechanistic Insights into Grokking from the Embedding Layer

H. V. AlquBoj,Hilal AlQuabeh,Velibor Bojkovic,Munachiso Nwadike,Kentaro Inui

Main category: cs.LG

TL;DR: 论文研究了神经网络中的“grokking”现象（延迟泛化），发现嵌入层是导致这一现象的关键因素。通过分析嵌入更新动态和双线性耦合机制，提出了频率感知采样和嵌入特定学习率的方法，以优化训练过程。

Details

Motivation: 探索神经网络中“grokking”现象的成因，特别是在嵌入层中的作用，以改进训练效率和泛化性能。 Method: 在模块化算术任务中，通过引入嵌入层观察延迟泛化现象，分析嵌入更新动态和双线性耦合机制，并提出频率感知采样和嵌入特定学习率的方法。 Result: 发现嵌入层是导致延迟泛化的关键，提出的方法能够有效缓解双线性耦合效应，加速收敛，并适用于Transformer优化。 Conclusion: 嵌入层在“grokking”现象中起核心作用，通过优化嵌入更新和学习率可以显著改善训练动态和泛化性能。 Abstract: Grokking, a delayed generalization in neural networks after perfect training performance, has been observed in Transformers and MLPs, but the components driving it remain underexplored. We show that embeddings are central to grokking: introducing them into MLPs induces delayed generalization in modular arithmetic tasks, whereas MLPs without embeddings can generalize immediately. Our analysis identifies two key mechanisms: (1) Embedding update dynamics, where rare tokens stagnate due to sparse gradient updates and weight decay, and (2) Bilinear coupling, where the interaction between embeddings and downstream weights introduces saddle points and increases sensitivity to initialization. To confirm these mechanisms, we investigate frequency-aware sampling, which balances token updates by minimizing gradient variance, and embedding-specific learning rates, derived from the asymmetric curvature of the bilinear loss landscape. We prove that an adaptive learning rate ratio, $\frac{\eta_E}{\eta_W} \propto \frac{\sigma_{\max}(E)}{\sigma_{\max}(W)} \cdot \frac{f_W}{f_E}$, mitigates bilinear coupling effects, accelerating convergence. Our methods not only improve grokking dynamics but also extend to broader challenges in Transformer optimization, where bilinear interactions hinder efficient training.

[271] Large Language Models as Computable Approximations to Solomonoff Induction

Jun Wan,Lingrui Mei

Main category: cs.LG

TL;DR: 论文通过算法信息理论（AIT）为大型语言模型（LLM）的行为提供了统一的理论解释，并提出了基于预测置信度的少样本选择方法。

Details

Motivation: 现有理论框架无法统一解释LLM的涌现现象，需要建立与算法信息理论的联系。 Method: 通过证明LLM训练过程近似Solomonoff先验和预测任务近似Solomonoff归纳，利用AIT解释上下文学习、少样本学习和缩放定律。 Result: 实验表明，基于低预测置信度的少样本选择策略显著提升了性能，尤其对小模型。 Conclusion: 该框架填补了理论与实践的鸿沟，为未来模型开发提供了理论基础和实用方法。 Abstract: The rapid advancement of large language models (LLMs) calls for a rigorous theoretical framework to explain their empirical success. While significant progress has been made in understanding LLM behaviors, existing theoretical frameworks remain fragmented in explaining emergent phenomena through a unified mathematical lens. We establish the first formal connection between LLM architectures and Algorithmic Information Theory (AIT) by proving two fundamental results: (1) the training process computationally approximates Solomonoff prior through loss minimization interpreted as program length optimization, and (2) next-token prediction implements approximate Solomonoff induction. We leverage AIT to provide a unified theoretical explanation for in-context learning, few-shot learning, and scaling laws. Furthermore, our theoretical insights lead to a principled method for few-shot example selection that prioritizes samples where models exhibit lower predictive confidence. We demonstrate through experiments on diverse text classification benchmarks that this strategy yields significant performance improvements, particularly for smaller model architectures, when compared to selecting high-confidence examples. Our framework bridges the gap between theoretical foundations and practical LLM behaviors, providing both explanatory power and actionable insights for future model development.

cs.RO [Back]

[272] Scan, Materialize, Simulate: A Generalizable Framework for Physically Grounded Robot Planning

Amine Elhafsi,Daniel Morton,Marco Pavone

Main category: cs.RO

TL;DR: SMS框架结合3D高斯泼溅、视觉基础模型、视觉语言模型和物理模拟，实现机器人物理推理和规划，无需重新学习物理动态。

Details

Motivation: 解决自主机器人在非结构化环境中物理推理的挑战，提升操作和规划能力。 Method: 整合3D高斯泼溅（场景重建）、视觉基础模型（语义分割）、视觉语言模型（材料属性推断）和物理模拟（动作结果预测）。 Result: 在台球操作任务和四旋翼着陆场景中验证了SMS的鲁棒性能，支持模拟和真实实验。 Conclusion: SMS展示了结合场景重建、语义理解和物理模拟的潜力，为跨领域机器人规划提供物理基础。 Abstract: Autonomous robots must reason about the physical consequences of their actions to operate effectively in unstructured, real-world environments. We present Scan, Materialize, Simulate (SMS), a unified framework that combines 3D Gaussian Splatting for accurate scene reconstruction, visual foundation models for semantic segmentation, vision-language models for material property inference, and physics simulation for reliable prediction of action outcomes. By integrating these components, SMS enables generalizable physical reasoning and object-centric planning without the need to re-learn foundational physical dynamics. We empirically validate SMS in a billiards-inspired manipulation task and a challenging quadrotor landing scenario, demonstrating robust performance on both simulated domain transfer and real-world experiments. Our results highlight the potential of bridging differentiable rendering for scene reconstruction, foundation models for semantic understanding, and physics-based simulation to achieve physically grounded robot planning across diverse settings.

[273] UPTor: Unified 3D Human Pose Dynamics and Trajectory Prediction for Human-Robot Interaction

Nisarga Nilavadi,Andrey Rudenko,Timm Linder

Main category: cs.RO

TL;DR: 提出了一种统一的方法，基于短序列输入姿势预测人体关键点和运动轨迹的动态变化。

Details

Motivation: 现有研究多专注于全身姿势预测或运动轨迹预测，但很少尝试将两者结合。本文旨在填补这一空白。 Method: 结合现成的3D人体姿势估计模块、图注意力网络编码骨架结构，以及紧凑的非自回归变换器，实现实时运动预测。 Result: 在Human3.6M、CMU-Mocap和自建DARKO数据集上表现优异，方法紧凑、实时且准确。 Conclusion: 该方法适用于人机交互和人类感知导航，数据集和代码将公开。 Abstract: We introduce a unified approach to forecast the dynamics of human keypoints along with the motion trajectory based on a short sequence of input poses. While many studies address either full-body pose prediction or motion trajectory prediction, only a few attempt to merge them. We propose a motion transformation technique to simultaneously predict full-body pose and trajectory key-points in a global coordinate frame. We utilize an off-the-shelf 3D human pose estimation module, a graph attention network to encode the skeleton structure, and a compact, non-autoregressive transformer suitable for real-time motion prediction for human-robot interaction and human-aware navigation. We introduce a human navigation dataset ``DARKO'' with specific focus on navigational activities that are relevant for human-aware mobile robot navigation. We perform extensive evaluation on Human3.6M, CMU-Mocap, and our DARKO dataset. In comparison to prior work, we show that our approach is compact, real-time, and accurate in predicting human navigation motion across all datasets. Result animations, our dataset, and code will be available at https://nisarganc.github.io/UPTor-page/

[274] AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving

Kangan Qian,Sicong Jiang,Yang Zhong,Ziang Luo,Zilin Huang,Tianze Zhu,Kun Jiang,Mengmeng Yang,Zheng Fu,Jinyu Miao,Yining Shi,He Zhe Lim,Li Liu,Tianbao Zhou,Hongyi Wang,Huang Yu,Yifei Hu,Guang Li,Guang Chen,Hao Ye,Lijun Sun,Diange Yang

Main category: cs.RO

TL;DR: AgentThink框架通过结合链式思维推理和动态工具调用，显著提升了自动驾驶任务的推理能力和准确性。

Details

Motivation: 现有视觉语言模型在自动驾驶中存在幻觉、推理效率低和验证不足的问题，需要更可靠的解决方案。 Method: 提出AgentThink框架，包括结构化数据生成、两阶段训练流程和工具使用评估协议。 Result: 在DriveLMM-o1基准测试中，推理分数提升53.91%，答案准确性提高33.54%。 Conclusion: AgentThink为开发可信赖的自动驾驶模型提供了有前景的方向。 Abstract: Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce \textbf{AgentThink}, a pioneering unified framework that, for the first time, integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink's core innovations include: \textbf{(i) Structured Data Generation}, by establishing an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; \textbf{(ii) A Two-stage Training Pipeline}, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and \textbf{(iii) Agent-style Tool-Usage Evaluation}, introducing a novel multi-tool assessment protocol to rigorously evaluate the model's tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate AgentThink significantly boosts overall reasoning scores by \textbf{53.91\%} and enhances answer accuracy by \textbf{33.54\%}, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models.

[275] Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization

Jiaming Zhou,Ke Ye,Jiayi Liu,Teli Ma,Zifang Wang,Ronghe Qiu,Kun-Yu Lin,Zhilin Zhao,Junwei Liang

Main category: cs.RO

TL;DR: 论文提出了AGNOSTOS基准和X-ICM方法，用于评估和提升视觉-语言-动作模型在未见任务上的零样本泛化能力。

Details

Motivation: 现有视觉-语言-动作模型在跨任务泛化能力方面研究不足，需要新的评估工具和方法。 Method: 提出AGNOSTOS基准测试23个未见任务，并开发X-ICM方法，利用大语言模型和动态引导样本选择策略。 Result: X-ICM显著提升了模型在AGNOSTOS上的零样本泛化性能。 Conclusion: AGNOSTOS和X-ICM为通用机器人操作研究提供了重要工具。 Abstract: The generalization capabilities of vision-language-action (VLA) models to unseen tasks are crucial to achieving general-purpose robotic manipulation in open-world settings. However, the cross-task generalization capabilities of existing VLA models remain significantly underexplored. To address this gap, we introduce AGNOSTOS, a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation. AGNOSTOS comprises 23 unseen manipulation tasks for testing, distinct from common training task distributions, and incorporates two levels of generalization difficulty to assess robustness. Our systematic evaluation reveals that current VLA models, despite being trained on diverse datasets, struggle to generalize effectively to these unseen tasks. To overcome this limitation, we propose Cross-Task In-Context Manipulation (X-ICM), a method that conditions large language models (LLMs) on in-context demonstrations from seen tasks to predict action sequences for unseen tasks. Additionally, we introduce a dynamics-guided sample selection strategy that identifies relevant demonstrations by capturing cross-task dynamics. On AGNOSTOS, X-ICM significantly improves cross-task zero-shot generalization performance over leading VLAs. We believe AGNOSTOS and X-ICM will serve as valuable tools for advancing general-purpose robotic manipulation.

[276] Think, Reflect, Create: Metacognitive Learning for Zero-Shot Robotic Planning with LLMs

Wenjie Lin,Jin Wei-Kocsis

Main category: cs.RO

TL;DR: 论文探讨了如何通过赋予大型语言模型（LLMs）元认知能力，提升其在机器人任务中的表现，尤其是在零样本或少样本设置下。

Details

Motivation: LLMs在机器人领域的应用多局限于静态、基于提示的行为，难以处理复杂任务。受人类元认知学习和创造性问题解决的启发，研究旨在探索LLMs是否可以通过元认知能力增强其任务表现。 Method: 提出了一个早期框架，将元认知学习整合到LLM驱动的多机器人协作中，包括技能分解和自我反思机制。 Result: 实验表明，该框架显著优于现有基线，并能生成与真实解不同但仍能成功完成任务的新方案。 Conclusion: 元认知学习可以促进机器人规划中的创造性，为LLMs在机器人领域的应用提供了新方向。 Abstract: While large language models (LLMs) have shown great potential across various domains, their applications in robotics remain largely limited to static, prompt-based behaviors and still face challenges in handling complex tasks under zero-shot or few-shot settings. Inspired by human metacognitive learning and creative problem-solving, we address this limitation by exploring a fundamental research question: Can LLMs be empowered with metacognitive capabilities to reason, reflect, and create, thereby enhancing their ability to perform robotic tasks with minimal demonstrations? In this paper, we present an early-stage framework that integrates metacognitive learning into LLM-powered multi-robot collaboration. The proposed framework equips the LLM-powered robotic agents with a skill decomposition and self-reflection mechanism that identifies modular skills from prior tasks, reflects on failures in unseen task scenarios, and synthesizes effective new solutions. Experimental results show that our metacognitive-learning-empowered LLM framework significantly outperforms existing baselines. Moreover, we observe that the framework is capable of generating solutions that differ from the ground truth yet still successfully complete the tasks. These exciting findings support our hypothesis that metacognitive learning can foster creativity in robotic planning.

[277] Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets

Kaiyuan Chen,Shuangyu Xie,Zehan Ma,Ken Goldberg

Main category: cs.RO

TL;DR: Robo2VLM是一个基于机器人轨迹数据的VQA数据集生成框架，用于增强和评估视觉语言模型（VLMs）。

Details

Motivation: 探索利用丰富的多模态机器人轨迹数据来增强和评估VLMs的能力。 Method: 通过机器人轨迹数据提取非视觉模态信息，分段生成VQA查询，并构建大规模数据集Robo2VLM-1。 Result: Robo2VLM-1数据集包含684,710个问题，覆盖463个场景和3,396个任务，能有效评估和提升VLMs的空间和交互推理能力。 Conclusion: Robo2VLM框架和数据集为VLMs的能力提升和评估提供了新途径。 Abstract: Vision-Language Models (VLMs) acquire real-world knowledge and general reasoning ability through Internet-scale image-text corpora. They can augment robotic systems with scene understanding and task planning, and assist visuomotor policies that are trained on robot trajectory data. We explore the reverse paradigm - using rich, real, multi-modal robot trajectory data to enhance and evaluate VLMs. In this paper, we present Robo2VLM, a Visual Question Answering (VQA) dataset generation framework for VLMs. Given a human tele-operated robot trajectory, Robo2VLM derives ground-truth from non-visual and non-descriptive sensory modalities, such as end-effector pose, gripper aperture, and force sensing. Based on these modalities, it segments the robot trajectory into a sequence of manipulation phases. At each phase, Robo2VLM uses scene and interaction understanding to identify 3D properties of the robot, task goal, and the target object. The properties are used to generate representative VQA queries - images with textural multiple-choice questions - based on spatial, goal-conditioned, and interaction reasoning question templates. We curate Robo2VLM-1, a large-scale in-the-wild dataset with 684,710 questions covering 463 distinct scenes and 3,396 robotic manipulation tasks from 176k real robot trajectories. Results suggest that Robo2VLM-1 can benchmark and improve VLM capabilities in spatial and interaction reasoning.

eess.IV [Back]

Muhammad Zubair,Muzammil Hussai,Mousa Ahmad Al-Bashrawi,Malika Bendechache,Muhammad Owais

Main category: eess.IV

TL;DR: 多模态医学图像融合（MMIF）通过整合X光、MRI、CT等数据提升诊断精度，本文综述了其方法、算法、应用及挑战。

Details

Motivation: MMIF在计算机辅助诊断系统中对提高诊断准确性和临床决策至关重要。 Method: 综述了传统融合方法（像素、特征、决策级）及深度学习、生成模型等新技术。 Result: MMIF显著提升了诊断准确性、病灶检测和分割，并在肿瘤学、神经学等领域有广泛应用。 Conclusion: 尽管MMIF前景广阔，但仍面临数据隐私、计算复杂度等挑战，未来需关注可解释AI和实时融合系统等方向。 Abstract: Multi-modal medical image fusion (MMIF) is increasingly recognized as an essential technique for enhancing diagnostic precision and facilitating effective clinical decision-making within computer-aided diagnosis systems. MMIF combines data from X-ray, MRI, CT, PET, SPECT, and ultrasound to create detailed, clinically useful images of patient anatomy and pathology. These integrated representations significantly advance diagnostic accuracy, lesion detection, and segmentation. This comprehensive review meticulously surveys the evolution, methodologies, algorithms, current advancements, and clinical applications of MMIF. We present a critical comparative analysis of traditional fusion approaches, including pixel-, feature-, and decision-level methods, and delves into recent advancements driven by deep learning, generative models, and transformer-based architectures. A critical comparative analysis is presented between these conventional methods and contemporary techniques, highlighting differences in robustness, computational efficiency, and interpretability. The article addresses extensive clinical applications across oncology, neurology, and cardiology, demonstrating MMIF's vital role in precision medicine through improved patient-specific therapeutic outcomes. Moreover, the review thoroughly investigates the persistent challenges affecting MMIF's broad adoption, including issues related to data privacy, heterogeneity, computational complexity, interpretability of AI-driven algorithms, and integration within clinical workflows. It also identifies significant future research avenues, such as the integration of explainable AI, adoption of privacy-preserving federated learning frameworks, development of real-time fusion systems, and standardization efforts for regulatory compliance.

[279] A Hybrid Quantum Classical Pipeline for X Ray Based Fracture Diagnosis

Sahil Tomar,Rajeshwar Tripathi,Sandeep Kumar

Main category: eess.IV

TL;DR: 提出了一种分布式混合量子经典管道，用于骨折X射线分类，结合PCA降维和量子特征增强，达到99%准确率，同时减少特征提取时间82%。

Details

Motivation: 传统X射线分析耗时且易错，现有机器学习方法需要大量数据和计算资源，亟需高效解决方案。 Method: 采用PCA降维后，通过4量子比特振幅编码电路增强特征，融合生成16维向量，再用机器学习模型分类。 Result: 在公开数据集上达到99%准确率，与现有迁移学习模型相当，特征提取时间减少82%。 Conclusion: 混合量子经典方法在骨折分类中高效且准确，为医疗影像分析提供了新思路。 Abstract: Bone fractures are a leading cause of morbidity and disability worldwide, imposing significant clinical and economic burdens on healthcare systems. Traditional X ray interpretation is time consuming and error prone, while existing machine learning and deep learning solutions often demand extensive feature engineering, large, annotated datasets, and high computational resources. To address these challenges, a distributed hybrid quantum classical pipeline is proposed that first applies Principal Component Analysis (PCA) for dimensionality reduction and then leverages a 4 qubit quantum amplitude encoding circuit for feature enrichment. By fusing eight PCA derived features with eight quantum enhanced features into a 16 dimensional vector and then classifying with different machine learning models achieving 99% accuracy using a public multi region X ray dataset on par with state of the art transfer learning models while reducing feature extraction time by 82%.

[280] Aneumo: A Large-Scale Multimodal Aneurysm Dataset with Computational Fluid Dynamics Simulations and Deep Learning Benchmarks

Xigui Li,Yuanye Zhou,Feiyang Xiao,Xin Guo,Chen Jiang,Tan Pan,Xingmeng Zhang,Cenyu Liu,Zeyun Miao,Jianchao Ge,Xiansheng Wang,Qimeng Wang,Yichi Zhang,Wenbo Zhang,Fengping Zhu,Limei Han,Yuan Qi,Chensen Lin,Yuan Cheng

Main category: eess.IV

TL;DR: 论文提出了一种基于大规模合成动脉瘤CFD数据集的机器学习方法，以解决传统CFD计算耗时的问题，并促进动脉瘤研究和临床风险评估。

Details

Motivation: 颅内动脉瘤（IAs）破裂可能导致高死亡率，而当前风险评估方法对血流动力学影响的研究不足。传统CFD方法计算量大，难以用于大规模或实时临床应用。 Method: 通过427个真实动脉瘤几何形状合成10,660个3D形状模拟动脉瘤演变，生成85,280个血流动力学数据，并引入基准测试评估建模方法。 Result: 生成了包含血流动力学参数、分割掩码等多模态数据的大规模数据集，支持机器学习算法的开发。 Conclusion: 该数据集和代码公开，旨在推动动脉瘤研究和数据驱动的生物流体、生物医学工程及临床风险评估方法。 Abstract: Intracranial aneurysms (IAs) are serious cerebrovascular lesions found in approximately 5\% of the general population. Their rupture may lead to high mortality. Current methods for assessing IA risk focus on morphological and patient-specific factors, but the hemodynamic influences on IA development and rupture remain unclear. While accurate for hemodynamic studies, conventional computational fluid dynamics (CFD) methods are computationally intensive, hindering their deployment in large-scale or real-time clinical applications. To address this challenge, we curated a large-scale, high-fidelity aneurysm CFD dataset to facilitate the development of efficient machine learning algorithms for such applications. Based on 427 real aneurysm geometries, we synthesized 10,660 3D shapes via controlled deformation to simulate aneurysm evolution. The authenticity of these synthetic shapes was confirmed by neurosurgeons. CFD computations were performed on each shape under eight steady-state mass flow conditions, generating a total of 85,280 blood flow dynamics data covering key parameters. Furthermore, the dataset includes segmentation masks, which can support tasks that use images, point clouds or other multimodal data as input. Additionally, we introduced a benchmark for estimating flow parameters to assess current modeling methods. This dataset aims to advance aneurysm research and promote data-driven approaches in biofluids, biomedical engineering, and clinical risk assessment. The code and dataset are available at: https://github.com/Xigui-Li/Aneumo.

[281] MedBLIP: Fine-tuning BLIP for Medical Image Captioning

Manshi Limbu,Diwita Banerjee

Main category: eess.IV

TL;DR: 论文探讨了微调BLIP模型在ROCO数据集上对放射学图像描述的改进效果，结果显示领域特定微调显著提升了性能。

Details

Motivation: 现有视觉语言模型在医学领域生成描述时表现泛泛或不精确，需要针对性改进。 Method: 微调BLIP模型并与零样本版本、BLIP-2基线和ViT-GPT2进行比较，同时分析编码器和解码器微调的贡献。 Result: 领域特定微调显著提升性能，解码器微调（编码器冻结）在训练时间减少5%的情况下表现良好，但全模型微调效果最佳。 Conclusion: 医学应用需针对性微调，解码器微调是高效选择，但全模型微调效果最优。 Abstract: Medical image captioning is a challenging task that requires generating clinically accurate and semantically meaningful descriptions of radiology images. While recent vision-language models (VLMs) such as BLIP, BLIP2, Gemini and ViT-GPT2 show strong performance on natural image datasets, they often produce generic or imprecise captions when applied to specialized medical domains. In this project, we explore the effectiveness of fine-tuning the BLIP model on the ROCO dataset for improved radiology captioning. We compare the fine-tuned BLIP against its zero-shot version, BLIP-2 base, BLIP-2 Instruct and a ViT-GPT2 transformer baseline. Our results demonstrate that domain-specific fine-tuning on BLIP significantly improves performance across both quantitative and qualitative evaluation metrics. We also visualize decoder cross-attention maps to assess interpretability and conduct an ablation study to evaluate the contributions of encoder-only and decoder-only fine-tuning. Our findings highlight the importance of targeted adaptation for medical applications and suggest that decoder-only fine-tuning (encoder-frozen) offers a strong performance baseline with 5% lower training time than full fine-tuning, while full model fine-tuning still yields the best results overall.

[282] LOD1 3D City Model from LiDAR: The Impact of Segmentation Accuracy on Quality of Urban 3D Modeling and Morphology Extraction

Fatemeh Chajaei,Hossein Bagheri

Main category: eess.IV

TL;DR: 研究评估了LiDAR数据在LOD1级别3D建筑重建中的潜力，比较了四种深度学习模型，发现U-Net3+和Attention U-Net表现最佳，并探讨了分割精度对3D建模和形态特征提取的影响。

Details

Motivation: 3D建筑重建在城规和环境研究中至关重要，但LOD1级别的准确重建仍具挑战性。研究旨在利用LiDAR数据和深度学习模型提升重建精度。 Method: 使用U-Net、Attention U-Net、U-Net3+和DeepLabV3+四种模型进行语义分割，通过迁移学习提取建筑轮廓，并采用多种统计方法估算建筑高度。 Result: U-Net3+和Attention U-Net表现最优，IoU分数分别为0.833和0.814。分割精度显著影响3D模型质量和形态特征（如建筑面积和外墙面积）的准确性。 Conclusion: U-Net3+结合90百分位数和中位数方法能实现准确的建筑高度估计和形态特征提取，为LOD1级别的3D重建提供了有效解决方案。 Abstract: Three-dimensional reconstruction of buildings, particularly at Level of Detail 1 (LOD1), plays a crucial role in various applications such as urban planning, urban environmental studies, and designing optimized transportation networks. This study focuses on assessing the potential of LiDAR data for accurate 3D building reconstruction at LOD1 and extracting morphological features from these models. Four deep semantic segmentation models, U-Net, Attention U-Net, U-Net3+, and DeepLabV3+, were used, applying transfer learning to extract building footprints from LiDAR data. The results showed that U-Net3+ and Attention U-Net outperformed the others, achieving IoU scores of 0.833 and 0.814, respectively. Various statistical measures, including maximum, range, mode, median, and the 90th percentile, were used to estimate building heights, resulting in the generation of 3D models at LOD1. As the main contribution of the research, the impact of segmentation accuracy on the quality of 3D building modeling and the accuracy of morphological features like building area and external wall surface area was investigated. The results showed that the accuracy of building identification (segmentation performance) significantly affects the 3D model quality and the estimation of morphological features, depending on the height calculation method. Overall, the UNet3+ method, utilizing the 90th percentile and median measures, leads to accurate height estimation of buildings and the extraction of morphological features.

[283] TransMedSeg: A Transferable Semantic Framework for Semi-Supervised Medical Image Segmentation

Mengzhu Wang,Jiao Li,Shanshan Wang,Long Lan,Huibin Tan,Liang Yang,Guoli Yang

Main category: eess.IV

TL;DR: TransMedSeg提出了一种新的半监督医学图像分割框架，通过跨域语义对齐和域内结构保留，显著提升了分割性能。

Details

Motivation: 现有半监督学习方法在医学图像分割中忽视了跨临床领域和成像模态的可转移语义关系。 Method: 引入可转移语义增强（TSA）模块，通过跨域分布匹配和域内结构保护对齐域不变语义，构建统一特征空间。 Result: 在医学图像数据集上，TransMedSeg优于现有半监督方法。 Conclusion: TransMedSeg为医学图像分析中的可转移表示学习开辟了新方向。 Abstract: Semi-supervised learning (SSL) has achieved significant progress in medical image segmentation (SSMIS) through effective utilization of limited labeled data. While current SSL methods for medical images predominantly rely on consistency regularization and pseudo-labeling, they often overlook transferable semantic relationships across different clinical domains and imaging modalities. To address this, we propose TransMedSeg, a novel transferable semantic framework for semi-supervised medical image segmentation. Our approach introduces a Transferable Semantic Augmentation (TSA) module, which implicitly enhances feature representations by aligning domain-invariant semantics through cross-domain distribution matching and intra-domain structural preservation. Specifically, TransMedSeg constructs a unified feature space where teacher network features are adaptively augmented towards student network semantics via a lightweight memory module, enabling implicit semantic transformation without explicit data generation. Interestingly, this augmentation is implicitly realized through an expected transferable cross-entropy loss computed over the augmented teacher distribution. An upper bound of the expected loss is theoretically derived and minimized during training, incurring negligible computational overhead. Extensive experiments on medical image datasets demonstrate that TransMedSeg outperforms existing semi-supervised methods, establishing a new direction for transferable representation learning in medical image analysis.

[284] Model-Independent Machine Learning Approach for Nanometric Axial Localization and Tracking

Andrey Alexandrov,Giovanni Acampora,Giovanni De Lellis,Antonia Di Crescenzo,Chiara Errico,Daria Morozova,Valeri Tioukov,Autilia Vittiello

Main category: eess.IV

TL;DR: 本文提出了一种基于深度学习的双焦平面图像轴向定位方法，精度达40纳米，优于传统技术6倍。

Details

Motivation: 光学显微镜中高精度轴向定位粒子位置是一大挑战，传统方法难以满足需求。 Method: 使用卷积神经网络（CNNs）从双焦平面图像中直接确定轴向位置，无需预定义模型。 Result: 轴向定位精度达40纳米，性能优于传统单焦平面技术6倍。 Conclusion: 该方法设计简单、性能优越，适用于多种科学领域，展示了机器学习在复杂图像数据处理中的潜力。 Abstract: Accurately tracking particles and determining their position along the optical axis is a major challenge in optical microscopy, especially when extremely high precision is needed. In this study, we introduce a deep learning approach using convolutional neural networks (CNNs) that can determine axial positions from dual-focal plane images without relying on predefined models. Our method achieves an axial localization accuracy of 40 nanometers - six times better than traditional single-focal plane techniques. The model's simple design and strong performance make it suitable for a wide range of uses, including dark matter detection, proton therapy for cancer, and radiation protection in space. It also shows promise in fields like biological imaging, materials science, and environmental monitoring. This work highlights how machine learning can turn complex image data into reliable, precise information, offering a flexible and powerful tool for many scientific applications.

[285] Super-Resolution Optical Coherence Tomography Using Diffusion Model-Based Plug-and-Play Priors

Yaning Wang,Jinglun Yu,Wenhan Guo,Yu Sun,Jin U. Kang

Main category: eess.IV

TL;DR: 提出基于扩散模型的OCT超分辨率框架（PnP-DM），用于从稀疏测量中重建高质量图像，优于传统方法。

Details

Motivation: 解决高速度OCT成像中稀疏测量导致图像质量下降的问题，提升临床应用的图像保真度。 Method: 将重建问题建模为逆问题，结合扩散先验和马尔可夫链蒙特卡洛采样进行高效后验推断，并使用深度学习上采样构建训练数据。 Result: 在体内和体外鱼眼角膜模型中，PnP-DM表现优于传统2D-UNet，结构更清晰且噪声抑制更好。 Conclusion: PnP-DM为高速度OCT成像提供了一种高保真度的解决方案，具有临床应用潜力。 Abstract: We propose an OCT super-resolution framework based on a plug-and-play diffusion model (PnP-DM) to reconstruct high-quality images from sparse measurements (OCT B-mode corneal images). Our method formulates reconstruction as an inverse problem, combining a diffusion prior with Markov chain Monte Carlo sampling for efficient posterior inference. We collect high-speed under-sampled B-mode corneal images and apply a deep learning-based up-sampling pipeline to build realistic training pairs. Evaluations on in vivo and ex vivo fish-eye corneal models show that PnP-DM outperforms conventional 2D-UNet baselines, producing sharper structures and better noise suppression. This approach advances high-fidelity OCT imaging in high-speed acquisition for clinical applications.

[286] Non-rigid Motion Correction for MRI Reconstruction via Coarse-To-Fine Diffusion Models

Frederic Wang,Jonathan I. Tamir

Main category: eess.IV

TL;DR: 提出了一种基于扩散模型的交替最小化框架，用于联合重建和校正运动伪影的MRI数据，适用于动态成像。

Details

Motivation: MRI因长时间采集容易产生运动伪影，影响诊断效果，尤其是动态成像。 Method: 采用交替最小化框架和定制扩散模型，通过从粗到细的去噪策略捕获大范围运动并优先重建低频图像。 Result: 在真实心脏MRI数据集和复杂模拟变形数据上表现优异，即使运动状态下采样率低至64倍。 Conclusion: 该方法对采样模式、解剖变异和扫描协议具有鲁棒性，适用于低频采样的动态MRI。 Abstract: Magnetic Resonance Imaging (MRI) is highly susceptible to motion artifacts due to the extended acquisition times required for k-space sampling. These artifacts can compromise diagnostic utility, particularly for dynamic imaging. We propose a novel alternating minimization framework that leverages a bespoke diffusion model to jointly reconstruct and correct non-rigid motion-corrupted k-space data. The diffusion model uses a coarse-to-fine denoising strategy to capture large overall motion and reconstruct the lower frequencies of the image first, providing a better inductive bias for motion estimation than that of standard diffusion models. We demonstrate the performance of our approach on both real-world cine cardiac MRI datasets and complex simulated rigid and non-rigid deformations, even when each motion state is undersampled by a factor of 64x. Additionally, our method is agnostic to sampling patterns, anatomical variations, and MRI scanning protocols, as long as some low frequency components are sampled during each motion state.

[287] Lung Nodule-SSM: Self-Supervised Lung Nodule Detection and Classification in Thoracic CT Images

Muniba Noreen,Furqan Shaukat

Main category: eess.IV

TL;DR: 提出了一种名为LungNodule-SSM的自监督学习方法，利用DINOv2和Transformer架构提升肺结节检测和分类的准确性，无需标注数据。

Details

Motivation: 肺癌早期检测对患者预后至关重要，但标注医学影像数据稀缺限制了计算机辅助诊断系统的发展。自监督学习可以利用大量未标注数据提升系统性能。 Method: 分两阶段：1）用DINOv2在未标注CT扫描上预训练学习特征表示；2）用Transformer架构微调特征，实现病灶级检测和结节诊断。 Result: 在LUNA 16数据集（888个CT扫描）上测试，准确率达98.37%，优于现有方法。 Conclusion: LungNodule-SSM方法在肺结节检测中表现优异，为自监督学习在医学影像中的应用提供了新思路。 Abstract: Lung cancer remains among the deadliest types of cancer in recent decades, and early lung nodule detection is crucial for improving patient outcomes. The limited availability of annotated medical imaging data remains a bottleneck in developing accurate computer-aided diagnosis (CAD) systems. Self-supervised learning can help leverage large amounts of unlabeled data to develop more robust CAD systems. With the recent advent of transformer-based architecture and their ability to generalize to unseen tasks, there has been an effort within the healthcare community to adapt them to various medical downstream tasks. Thus, we propose a novel "LungNodule-SSM" method, which utilizes selfsupervised learning with DINOv2 as a backbone to enhance lung nodule detection and classification without annotated data. Our methodology has two stages: firstly, the DINOv2 model is pre-trained on unlabeled CT scans to learn robust feature representations, then secondly, these features are fine-tuned using transformer-based architectures for lesionlevel detection and accurate lung nodule diagnosis. The proposed method has been evaluated on the challenging LUNA 16 dataset, consisting of 888 CT scans, and compared with SOTA methods. Our experimental results show the superiority of our proposed method with an accuracy of 98.37%, explaining its effectiveness in lung nodule detection. The source code, datasets, and pre-processed data can be accessed using the link:https://github.com/EMeRALDsNRPU/Lung-Nodule-SSM-Self-Supervised-Lung-Nodule-Detection-and-Classification/tree/main

[288] Physics-Guided Multi-View Graph Neural Network for Schizophrenia Classification via Structural-Functional Coupling

Badhan Mazumder,Ayush Kanyal,Lei Wu,Vince D. Calhoun,Dong Hye Ye

Main category: eess.IV

TL;DR: 提出了一种基于物理引导的深度学习框架，通过神经振荡模型和SC-FC耦合，结合多视图图神经网络，提升了精神分裂症分类性能。

Details

Motivation: 传统方法仅依赖结构连接（SC），忽略了SC与功能连接（FC）的复杂关系，限制了认知和行为障碍的理解。 Method: 利用神经振荡模型描述神经振荡器动态，通过多视图图神经网络（GNN）和联合损失实现SC-FC融合与分类。 Result: 临床数据集实验显示性能提升，验证了方法的鲁棒性。 Conclusion: 提出的框架通过SC-FC耦合和多视图GNN，为精神分裂症研究提供了新视角。 Abstract: Clinical studies reveal disruptions in brain structural connectivity (SC) and functional connectivity (FC) in neuropsychiatric disorders such as schizophrenia (SZ). Traditional approaches might rely solely on SC due to limited functional data availability, hindering comprehension of cognitive and behavioral impairments in individuals with SZ by neglecting the intricate SC-FC interrelationship. To tackle the challenge, we propose a novel physics-guided deep learning framework that leverages a neural oscillation model to describe the dynamics of a collection of interconnected neural oscillators, which operate via nerve fibers dispersed across the brain's structure. Our proposed framework utilizes SC to simultaneously generate FC by learning SC-FC coupling from a system dynamics perspective. Additionally, it employs a novel multi-view graph neural network (GNN) with a joint loss to perform correlation-based SC-FC fusion and classification of individuals with SZ. Experiments conducted on a clinical dataset exhibited improved performance, demonstrating the robustness of our proposed approach.

[289] SAMA-UNet: Enhancing Medical Image Segmentation with Self-Adaptive Mamba-Like Attention and Causal-Resonance Learning

Saqib Qamar,Mohd Fazil,Parvez Ahmad,Ghulam Muhammad

Main category: eess.IV

TL;DR: SAMA-UNet是一种新型医学图像分割架构，通过SAMA块和CR-MSM模块解决了现有模型的计算效率低和特征平衡问题，实验表明其优于CNN、Transformer和Mamba。

Details

Motivation: 现有医学图像分割模型计算效率低且难以平衡局部与全局特征，SSMs的应用受限于图像令牌和自回归假设。 Method: 提出SAMA-UNet架构，包含SAMA块（结合上下文自注意力与动态权重调制）和CR-MSM模块（通过因果共振学习优化多尺度信息流）。 Result: 在MRI、CT和内窥镜图像上的实验显示，SAMA-UNet在分割精度上优于CNN、Transformer和Mamba。 Conclusion: SAMA-UNet通过创新模块设计有效提升了医学图像分割的性能和效率。 Abstract: Medical image segmentation plays an important role in various clinical applications, but existing models often struggle with the computational inefficiencies and challenges posed by complex medical data. State Space Sequence Models (SSMs) have demonstrated promise in modeling long-range dependencies with linear computational complexity, yet their application in medical image segmentation remains hindered by incompatibilities with image tokens and autoregressive assumptions. Moreover, it is difficult to achieve a balance in capturing both local fine-grained information and global semantic dependencies. To address these challenges, we introduce SAMA-UNet, a novel architecture for medical image segmentation. A key innovation is the Self-Adaptive Mamba-like Aggregated Attention (SAMA) block, which integrates contextual self-attention with dynamic weight modulation to prioritise the most relevant features based on local and global contexts. This approach reduces computational complexity and improves the representation of complex image features across multiple scales. We also suggest the Causal-Resonance Multi-Scale Module (CR-MSM), which enhances the flow of information between the encoder and decoder by using causal resonance learning. This mechanism allows the model to automatically adjust feature resolution and causal dependencies across scales, leading to better semantic alignment between the low-level and high-level features in U-shaped architectures. Experiments on MRI, CT, and endoscopy images show that SAMA-UNet performs better in segmentation accuracy than current methods using CNN, Transformer, and Mamba. The implementation is publicly available at GitHub.

[290] X-GRM: Large Gaussian Reconstruction Model for Sparse-view X-rays to Computed Tomography

Yifan Liu,Wuyang Li,Weihao Yu,Chenxin Li,Alexandre Alahi,Max Meng,Yixuan Yuan

Main category: eess.IV

TL;DR: X-GRM是一种基于Transformer的大规模前馈模型，用于从稀疏2D X射线投影重建3D CT，采用Voxel-based Gaussian Splatting表示，并利用大规模数据集ReconX-15K进行训练。

Details

Motivation: 现有CT重建方法受限于小容量模型架构、不灵活的体素表示和小规模训练数据，X-GRM旨在解决这些问题。 Method: X-GRM使用可扩展的Transformer架构编码稀疏X射线输入，并通过VoxGS解码为高效可微分的体积表示。 Result: 模型能够从各种测试输入（包括域内和域外X射线投影）生成高质量重建结果。 Conclusion: X-GRM通过高容量模型、灵活表示和大规模数据，显著提升了CT重建的质量和灵活性。 Abstract: Computed Tomography serves as an indispensable tool in clinical workflows, providing non-invasive visualization of internal anatomical structures. Existing CT reconstruction works are limited to small-capacity model architecture, inflexible volume representation, and small-scale training data. In this paper, we present X-GRM (X-ray Gaussian Reconstruction Model), a large feedforward model for reconstructing 3D CT from sparse-view 2D X-ray projections. X-GRM employs a scalable transformer-based architecture to encode an arbitrary number of sparse X-ray inputs, where tokens from different views are integrated efficiently. Then, tokens are decoded into a new volume representation, named Voxel-based Gaussian Splatting (VoxGS), which enables efficient CT volume extraction and differentiable X-ray rendering. To support the training of X-GRM, we collect ReconX-15K, a large-scale CT reconstruction dataset containing around 15,000 CT/X-ray pairs across diverse organs, including the chest, abdomen, pelvis, and tooth etc. This combination of a high-capacity model, flexible volume representation, and large-scale training data empowers our model to produce high-quality reconstructions from various testing inputs, including in-domain and out-domain X-ray projections. Project Page: https://github.com/CUHK-AIM-Group/X-GRM.

[291] Reconsider the Template Mesh in Deep Learning-based Mesh Reconstruction

Fengting Zhang,Boxu Liang,Qinghao Liu,Min Liu,Xiang Chen,Yaonan Wang

Main category: eess.IV

TL;DR: 提出了一种基于自适应模板的网格重建网络（ATMRN），通过生成自适应模板提高重建精度，克服了传统固定模板方法的局限性。

Details

Motivation: 传统网格重建方法依赖固定模板，忽略了受试者间的解剖学差异，影响了重建的准确性。 Method: 提出ATMRN，从给定图像生成自适应模板，再进行变形，避免了单一固定模板的限制。 Result: 在OASIS数据集的皮质MR图像上验证，平均对称表面距离为0.267mm，优于现有方法。 Conclusion: ATMRN方法具有通用性，可轻松扩展到其他图像模态和解剖结构。 Abstract: Mesh reconstruction is a cornerstone process across various applications, including in-silico trials, digital twins, surgical planning, and navigation. Recent advancements in deep learning have notably enhanced mesh reconstruction speeds. Yet, traditional methods predominantly rely on deforming a standardised template mesh for individual subjects, which overlooks the unique anatomical variations between them, and may compromise the fidelity of the reconstructions. In this paper, we propose an adaptive-template-based mesh reconstruction network (ATMRN), which generates adaptive templates from the given images for the subsequent deformation, moving beyond the constraints of a singular, fixed template. Our approach, validated on cortical magnetic resonance (MR) images from the OASIS dataset, sets a new benchmark in voxel-to-cortex mesh reconstruction, achieving an average symmetric surface distance of 0.267mm across four cortical structures. Our proposed method is generic and can be easily transferred to other image modalities and anatomical structures.

[292] Deep Learning Enabled Segmentation, Classification and Risk Assessment of Cervical Cancer

Abdul Samad Shaik,Shashaank Mattur Aswatha,Rahul Jashvantbhai Pandya

Main category: eess.IV

TL;DR: 提出了一种新型深度学习架构，用于宫颈癌细胞的检测和分类，结合多分辨率融合和多任务学习技术，性能接近现有最优模型，同时参数更少。

Details

Motivation: 宫颈癌是全球女性第四大癌症，早期检测至关重要。Pap涂片测试是主要手段，但需要高效准确的方法识别癌前病变。 Method: 使用多分辨率融合深度卷积网络处理不同分辨率的图像，结合多任务学习同时完成分割和分类任务，并通过概率方法评估风险。 Result: 模型性能接近最优模型，准确率差异仅2%至3%，参数减少85倍；分割IoU为0.83，分类准确率为90%。 Conclusion: 该模型在宫颈癌早期检测中表现出色，参数效率高，为预后提供了有效工具。 Abstract: Cervical cancer, the fourth leading cause of cancer in women globally, requires early detection through Pap smear tests to identify precancerous changes and prevent disease progression. In this study, we performed a focused analysis by segmenting the cellular boundaries and drawing bounding boxes to isolate the cancer cells. A novel Deep Learning (DL) architecture, the ``Multi-Resolution Fusion Deep Convolutional Network", was proposed to effectively handle images with varying resolutions and aspect ratios, with its efficacy showcased using the SIPaKMeD dataset. The performance of this DL model was observed to be similar to the state-of-the-art models, with accuracy variations of a mere 2\% to 3\%, achieved using just 1.7 million learnable parameters, which is approximately 85 times less than the VGG-19 model. Furthermore, we introduced a multi-task learning technique that simultaneously performs segmentation and classification tasks and begets an Intersection over Union score of 0.83 and a classification accuracy of 90\%. The final stage of the workflow employs a probabilistic approach for risk assessment, extracting feature vectors to predict the likelihood of normal cells progressing to malignant states, which can be utilized for the prognosis of cervical cancer.

cs.SE [Back]

[293] Sentiment Analysis in Software Engineering: Evaluating Generative Pre-trained Transformers

KM Khalid Saifullah,Faiaz Azmain,Habiba Hye

Main category: cs.SE

TL;DR: 该研究比较了BERT和GPT-4o-mini在软件工程情感分析中的表现，发现GPT-4o-mini在复杂数据集上表现更优，而BERT在结构化数据上表现良好。

Details

Motivation: 传统情感分析工具在软件工程领域未能充分处理语言的细微差别和上下文依赖性，因此需要评估更先进的模型。 Method: 使用GitHub、Stack Overflow和Jira的数据集，对BERT和GPT-4o-mini进行微调和默认配置的基准测试。 Result: 微调的GPT-4o-mini在结构化数据集上表现与BERT相当（F1分数0.93和0.98），但在复杂数据集上默认配置表现更优（准确率85.3%）。 Conclusion: 研究强调了模型架构与数据集特性匹配的重要性，并提出了优化软件工程情感分析工具的未来研究方向。 Abstract: Sentiment analysis plays a crucial role in understanding developer interactions, issue resolutions, and project dynamics within software engineering (SE). While traditional SE-specific sentiment analysis tools have made significant strides, they often fail to account for the nuanced and context-dependent language inherent to the domain. This study systematically evaluates the performance of bidirectional transformers, such as BERT, against generative pre-trained transformers, specifically GPT-4o-mini, in SE sentiment analysis. Using datasets from GitHub, Stack Overflow, and Jira, we benchmark the models' capabilities with fine-tuned and default configurations. The results reveal that fine-tuned GPT-4o-mini performs comparable to BERT and other bidirectional models on structured and balanced datasets like GitHub and Jira, achieving macro-averaged F1-scores of 0.93 and 0.98, respectively. However, on linguistically complex datasets with imbalanced sentiment distributions, such as Stack Overflow, the default GPT-4o-mini model exhibits superior generalization, achieving an accuracy of 85.3\% compared to the fine-tuned model's 13.1\%. These findings highlight the trade-offs between fine-tuning and leveraging pre-trained models for SE tasks. The study underscores the importance of aligning model architectures with dataset characteristics to optimize performance and proposes directions for future research in refining sentiment analysis tools tailored to the SE domain.

cs.CR [Back]

[294] BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

Andy K. Zhang,Joey Ji,Celeste Menders,Riya Dulepet,Thomas Qin,Ron Y. Wang,Junrong Wu,Kyleen Liao,Jiliang Li,Jinghan Hu,Sara Hong,Nardos Demilew,Shivatmica Murgai,Jason Tran,Nishka Kacheria,Ethan Ho,Denis Liu,Lauren McLane,Olivia Bruvik,Dai-Rong Han,Seungwoo Kim,Akhil Vyas,Cuiyuanxiu Chen,Ryan Li,Weiran Xu,Jonathan Z. Ye,Prerit Choudhary,Siddharth M. Bhatia,Vikram Sivashankar,Yuxuan Bao,Dawn Song,Dan Boneh,Daniel E. Ho,Percy Liang

Main category: cs.CR

TL;DR: 论文提出了首个框架BountyBench，用于评估AI代理在网络安全中的攻防能力，通过25个真实系统测试了5种AI代理在漏洞检测、利用和修复任务中的表现。

Details

Motivation: 理解AI代理如何改变网络安全格局，并提供一个量化评估攻防能力的框架。 Method: 构建BountyBench框架，设置25个真实系统，定义三种任务类型（检测、利用、修复），并设计新的评估指标。 Result: Claude Code和OpenAI Codex CLI在防御任务中表现更优，而自定义代理在攻防平衡上更佳。 Conclusion: AI代理在网络安全中具有潜力，但攻防能力差异显著，需进一步优化。 Abstract: AI agents have the potential to significantly alter the cybersecurity landscape. To help us understand this change, we introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems. Instantiating this framework with BountyBench, we set up 25 systems with complex, real-world codebases. To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a specific vulnerability), and Patch (patching a specific vulnerability). For Detect, we construct a new success indicator, which is general across vulnerability types and provides localized evaluation. We manually set up the environment for each system, including installing packages, setting up server(s), and hydrating database(s). We add 40 bug bounties, which are vulnerabilities with monetary awards from \$10 to \$30,485, and cover 9 of the OWASP Top 10 Risks. To modulate task difficulty, we devise a new strategy based on information to guide detection, interpolating from identifying a zero day to exploiting a specific vulnerability. We evaluate 5 agents: Claude Code, OpenAI Codex CLI, and custom agents with GPT-4.1, Gemini 2.5 Pro Preview, and Claude 3.7 Sonnet Thinking. Given up to three attempts, the top-performing agents are Claude Code (5% on Detect, mapping to \$1,350), Custom Agent with Claude 3.7 Sonnet Thinking (5% on Detect, mapping to \$1,025; 67.5% on Exploit), and OpenAI Codex CLI (5% on Detect, mapping to \$2,400; 90% on Patch, mapping to \$14,422). OpenAI Codex CLI and Claude Code are more capable at defense, achieving higher Patch scores of 90% and 87.5%, compared to Exploit scores of 32.5% and 57.5% respectively; in contrast, the custom agents are relatively balanced between offense and defense, achieving Exploit scores of 40-67.5% and Patch scores of 45-60%.

[295] Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses

Xiaoxue Yang,Bozhidar Stevanoski,Matthieu Meeus,Yves-Alexandre de Montjoye

Main category: cs.CR

TL;DR: 本文评估了当前基于对齐的防御方法的鲁棒性，提出了一种利用中间模型检查点初始化GCG攻击的改进方法，证明现有防御方法存在脆弱性。

Details

Motivation: 尽管现有防御方法声称能抵御GCG等攻击，但攻击者在掌握对齐过程信息时可能更高效地发起攻击。本文旨在验证这种威胁模型的可行性。 Method: 提出一种基于中间模型检查点的初始化方法，结合梯度信息选择检查点，改进GCG攻击的效果和效率。 Result: 该方法在多种防御和模型上表现优异，成功找到通用对抗后缀，证明现有防御方法存在漏洞。 Conclusion: 当前基于对齐的防御方法脆弱，需考虑更强的威胁模型以确保LLM的安全性。 Abstract: Large language models (LLMs) are rapidly deployed in real-world applications ranging from chatbots to agentic systems. Alignment is one of the main approaches used to defend against attacks such as prompt injection and jailbreaks. Recent defenses report near-zero Attack Success Rates (ASR) even against Greedy Coordinate Gradient (GCG), a white-box attack that generates adversarial suffixes to induce attacker-desired outputs. However, this search space over discrete tokens is extremely large, making the task of finding successful attacks difficult. GCG has, for instance, been shown to converge to local minima, making it sensitive to initialization choices. In this paper, we assess the future-proof robustness of these defenses using a more informed threat model: attackers who have access to some information about the alignment process. Specifically, we propose an informed white-box attack leveraging the intermediate model checkpoints to initialize GCG, with each checkpoint acting as a stepping stone for the next one. We show this approach to be highly effective across state-of-the-art (SOTA) defenses and models. We further show our informed initialization to outperform other initialization methods and show a gradient-informed checkpoint selection strategy to greatly improve attack performance and efficiency. Importantly, we also show our method to successfully find universal adversarial suffixes -- single suffixes effective across diverse inputs. Our results show that, contrary to previous beliefs, effective adversarial suffixes do exist against SOTA alignment-based defenses, that these can be found by existing attack methods when adversaries exploit alignment knowledge, and that even universal suffixes exist. Taken together, our results highlight the brittleness of current alignment-based methods and the need to consider stronger threat models when testing the safety of LLMs.

[296] Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval

Taiye Chen,Zeming Wei,Ang Li,Yisen Wang

Main category: cs.CR

TL;DR: 论文提出了一种基于上下文检索的安全防护方法（SCR），用于增强大型语言模型（LLM）对越狱攻击的防御能力。

Details

Motivation: 现有的静态防御机制难以应对不断演变的越狱攻击技术，因此需要一种更灵活、可扩展的防御方法。 Method: 通过初步研究发现少量安全对齐的示例能显著提升防御能力，进而结合检索增强生成（RAG）技术，提出了SCR方法。 Result: 实验表明，SCR能有效防御已知和新兴的越狱攻击，表现优于现有方法。 Conclusion: SCR为LLM安全提供了一种新的可扩展且鲁棒的防御范式。 Abstract: Large Language Models (LLMs) are known to be vulnerable to jailbreaking attacks, wherein adversaries exploit carefully engineered prompts to induce harmful or unethical responses. Such threats have raised critical concerns about the safety and reliability of LLMs in real-world deployment. While existing defense mechanisms partially mitigate such risks, subsequent advancements in adversarial techniques have enabled novel jailbreaking methods to circumvent these protections, exposing the limitations of static defense frameworks. In this work, we explore defending against evolving jailbreaking threats through the lens of context retrieval. First, we conduct a preliminary study demonstrating that even a minimal set of safety-aligned examples against a particular jailbreak can significantly enhance robustness against this attack pattern. Building on this insight, we further leverage the retrieval-augmented generation (RAG) techniques and propose Safety Context Retrieval (SCR), a scalable and robust safeguarding paradigm for LLMs against jailbreaking. Our comprehensive experiments demonstrate how SCR achieves superior defensive performance against both established and emerging jailbreaking tactics, contributing a new paradigm to LLM safety. Our code will be available upon publication.

cs.SD [Back]

[297] AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars

Tianbao Zhang,Jian Zhao,Yuer Li,Zheng Zhu,Ping Hu,Zhaoxin Fan,Wenjun Wu,Xuelong Li

Main category: cs.SD

TL;DR: AsynFusion是一种基于扩散变换器的新型框架，用于协调生成音频驱动的全身虚拟形象表情和姿势，解决了现有方法中表情与手势不协调的问题。

Details

Motivation: 全身音频驱动的虚拟形象生成在虚拟现实等领域有广泛应用，但现有方法因表情和手势独立生成导致动画不自然。 Method: 采用双分支DiT架构，引入协作同步模块和异步LCM采样策略，实现表情与手势的并行生成与协调。 Result: 实验表明，AsynFusion在实时同步全身动画生成中表现优异，定量和定性评估均优于现有方法。 Conclusion: AsynFusion通过协调表情与手势生成，显著提升了动画的自然性和一致性，具有广泛应用前景。 Abstract: Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans and enhancing the capabilities of interactive virtual agents, with wide-ranging applications in virtual reality, digital entertainment, and remote communication. Existing approaches often generate audio-driven facial expressions and gestures independently, which introduces a significant limitation: the lack of seamless coordination between facial and gestural elements, resulting in less natural and cohesive animations. To address this limitation, we propose AsynFusion, a novel framework that leverages diffusion transformers to achieve harmonious expression and gesture synthesis. The proposed method is built upon a dual-branch DiT architecture, which enables the parallel generation of facial expressions and gestures. Within the model, we introduce a Cooperative Synchronization Module to facilitate bidirectional feature interaction between the two modalities, and an Asynchronous LCM Sampling strategy to reduce computational overhead while maintaining high-quality outputs. Extensive experiments demonstrate that AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations, consistently outperforming existing methods in both quantitative and qualitative evaluations.

Cheng Yifan,Zhang Ruoyi,Shi Jiatong

Main category: cs.SD

TL;DR: MIKU-PAL是一个自动化多模态流程，用于从未标记的视频数据中提取高一致性情感语音，其准确性和一致性优于人工标注，并发布了高质量数据集MIKU-EmoBench。

Details

Motivation: 大规模情感语音数据的获取及其一致性是语音合成的挑战，需要高效且低成本的解决方案。 Method: 利用人脸检测与跟踪算法，结合多模态大语言模型（MLLM）开发自动情感分析系统。 Result: MIKU-PAL达到人类水平准确性（68.5% MELD评分）和高一致性（0.93 Fleiss kappa评分），并支持26种细粒度情感标注。 Conclusion: MIKU-PAL为情感语音合成提供了高质量、灵活且一致的标注工具，并发布了新基准数据集MIKU-EmoBench。 Abstract: Acquiring large-scale emotional speech data with strong consistency remains a challenge for speech synthesis. This paper presents MIKU-PAL, a fully automated multimodal pipeline for extracting high-consistency emotional speech from unlabeled video data. Leveraging face detection and tracking algorithms, we developed an automatic emotion analysis system using a multimodal large language model (MLLM). Our results demonstrate that MIKU-PAL can achieve human-level accuracy (68.5% on MELD) and superior consistency (0.93 Fleiss kappa score) while being much cheaper and faster than human annotation. With the high-quality, flexible, and consistent annotation from MIKU-PAL, we can annotate fine-grained speech emotion categories of up to 26 types, validated by human annotators with 83% rationality ratings. Based on our proposed system, we further released a fine-grained emotional speech dataset MIKU-EmoBench(131.2 hours) as a new benchmark for emotional text-to-speech and visual voice cloning.

cs.PF [Back]

[299] A Methodology to Evaluate Strategies Predicting Rankings on Unseen Domains

Sébastien Piérard,Adrien Deliège,Anaïs Halin,Marc Van Droogenbroeck

Main category: cs.PF

TL;DR: 本文提出了一种新方法，用于预测在未知领域中表现最佳的实体（如算法或方法），而无需进行新的昂贵评估。

Details

Motivation: 解决在多领域中预测实体性能的难题，避免重复评估的高成本。 Method: 采用留一领域法，结合特定应用偏好，对30种策略和40种实体（无监督背景减除方法）在53个领域（视频）中进行排名预测。 Result: 展示了方法在多个领域中的有效性，能够准确预测实体排名。 Conclusion: 该方法为跨领域性能预测提供了实用工具，减少了评估成本。 Abstract: Frequently, multiple entities (methods, algorithms, procedures, solutions, etc.) can be developed for a common task and applied across various domains that differ in the distribution of scenarios encountered. For example, in computer vision, the input data provided to image analysis methods depend on the type of sensor used, its location, and the scene content. However, a crucial difficulty remains: can we predict which entities will perform best in a new domain based on assessments on known domains, without having to carry out new and costly evaluations? This paper presents an original methodology to address this question, in a leave-one-domain-out fashion, for various application-specific preferences. We illustrate its use with 30 strategies to predict the rankings of 40 entities (unsupervised background subtraction methods) on 53 domains (videos).

cs.AI [Back]

[300] ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges

Cheng Qian,Hongyi Du,Hongru Wang,Xiusi Chen,Yuji Zhang,Avirup Sil,Chengxiang Zhai,Kathleen McKeown,Heng Ji

Main category: cs.AI

TL;DR: 论文介绍了ModelingBench（一个基于现实问题的开放式数学建模基准）和ModelingAgent（一个多智能体框架），用于解决现有基准难以反映现实复杂性的问题。

Details

Motivation: 现有基准未能充分体现现实问题的复杂性和开放性，需要更全面的评估工具。 Method: 提出ModelingBench基准和ModelingAgent框架，结合工具使用、结构化工作流和迭代优化。 Result: ModelingAgent表现优于基线，解决方案接近人类专家水平。 Conclusion: 该研究为开放式跨学科建模问题提供了全面的评估框架。 Abstract: Recent progress in large language models (LLMs) has enabled substantial advances in solving mathematical problems. However, existing benchmarks often fail to reflect the complexity of real-world problems, which demand open-ended, interdisciplinary reasoning and integration of computational tools. To address this gap, we introduce ModelingBench, a novel benchmark featuring real-world-inspired, open-ended problems from math modeling competitions across diverse domains, ranging from urban traffic optimization to ecosystem resource planning. These tasks require translating natural language into formal mathematical formulations, applying appropriate tools, and producing structured, defensible reports. ModelingBench also supports multiple valid solutions, capturing the ambiguity and creativity of practical modeling. We also present ModelingAgent, a multi-agent framework that coordinates tool use, supports structured workflows, and enables iterative self-refinement to generate well-grounded, creative solutions. To evaluate outputs, we further propose ModelingJudge, an expert-in-the-loop system leveraging LLMs as domain-specialized judges assessing solutions from multiple expert perspectives. Empirical results show that ModelingAgent substantially outperforms strong baselines and often produces solutions indistinguishable from those of human experts. Together, our work provides a comprehensive framework for evaluating and advancing real-world problem-solving in open-ended, interdisciplinary modeling challenges.

[301] When Can Large Reasoning Models Save Thinking? Mechanistic Analysis of Behavioral Divergence in Reasoning

Rongzhi Zhu,Yi Liu,Zequn Sun,Yiwei Wang,Wei Hu

Main category: cs.AI

TL;DR: 研究发现大型推理模型（LRMs）存在过度思考问题，揭示了三种思考模式（NT、ET、IT），并分析了其对推理行为的影响。

Details

Motivation: 探索RL训练的大型推理模型在节省思考时的内部机制，以提高效率。 Method: 通过分析思考终止的置信度、注意力转移及输入部分的关注度，研究推理行为。 Result: NT模式减少输出长度但降低准确性，ET和IT模式在保持准确性的同时减少响应长度。 Conclusion: RL优化的LRMs存在不一致性，需适应性改进以提高可靠性和效率。 Abstract: Large reasoning models (LRMs) have significantly advanced performance on complex tasks, yet their tendency to overthink introduces inefficiencies. This study investigates the internal mechanisms of reinforcement learning (RL)-trained LRMs when prompted to save thinking, revealing three distinct thinking modes: no thinking (NT), explicit thinking (ET), and implicit thinking (IT). Through comprehensive analysis of confidence in thinking termination, attention from thinking to generation, and attentional focus on input sections, we uncover key factors influencing the reasoning behaviors. We further find that NT reduces output length at the cost of accuracy, while ET and IT maintain accuracy with reduced response length. Our findings expose fundamental inconsistencies in RL-optimized LRMs, necessitating adaptive improvements for reliable efficiency.

[302] When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning

Xiaoyun Zhang,Jingqing Ruan,Xing Ma,Yawen Zhu,Haodong Zhao,Hao Li,Jiansong Chen,Ke Zeng,Xunliang Cai

Main category: cs.AI

TL;DR: ASRR框架通过自适应分配推理资源，显著减少大型推理模型的冗余计算，同时保持性能损失最小。

Details

Motivation: 大型推理模型在简单任务上存在冗余推理问题，导致计算资源浪费。 Method: 提出ASRR框架，利用准确性感知长度奖励调节，根据问题难度自适应分配推理资源。 Result: ASRR在多个基准测试中减少推理资源消耗达32.5%，性能损失仅为1.2%。 Conclusion: ASRR为大型推理模型提供了高效、自适应且安全的推理解决方案。 Abstract: Large reasoning models (LRMs) achieve remarkable performance via long reasoning chains, but often incur excessive computational overhead due to redundant reasoning, especially on simple tasks. In this work, we systematically quantify the upper bounds of LRMs under both Long-Thinking and No-Thinking modes, and uncover the phenomenon of "Internal Self-Recovery Mechanism" where models implicitly supplement reasoning during answer generation. Building on this insight, we propose Adaptive Self-Recovery Reasoning (ASRR), a framework that suppresses unnecessary reasoning and enables implicit recovery. By introducing accuracy-aware length reward regulation, ASRR adaptively allocates reasoning effort according to problem difficulty, achieving high efficiency with negligible performance sacrifice. Experiments across multiple benchmarks and models show that, compared with GRPO, ASRR reduces reasoning budget by up to 32.5% (1.5B) and 25.7% (7B) with minimal accuracy loss (1.2% and 0.6% pass@1), and significantly boosts harmless rates on safety benchmarks (up to +21.7%). Our results highlight the potential of ASRR for enabling efficient, adaptive, and safer reasoning in LRMs.

[303] ClickSight: Interpreting Student Clickstreams to Reveal Insights on Learning Strategies via LLMs

Bahar Radmehr,Ekaterina Shved,Fatma Betül Güreş,Adish Singla,Tanja Käser

Main category: cs.AI

TL;DR: ClickSight是一个基于大语言模型（LLM）的管道，用于从学生点击流数据中解释学习策略，评估了不同提示策略和自我优化的效果。

Details

Motivation: 点击流数据高维且复杂，传统方法缺乏通用性和可扩展性，需要一种更高效的解释方法。 Method: 提出ClickSight，利用LLM从原始点击流和学习策略列表中生成行为解释，评估了四种提示策略和自我优化的效果。 Result: LLM能合理解释学习策略，但效果因提示策略而异，自我优化改进有限。 Conclusion: ClickSight展示了LLM在教育数据中生成理论驱动洞察的潜力。 Abstract: Clickstream data from digital learning environments offer valuable insights into students' learning behaviors, but are challenging to interpret due to their high dimensionality and granularity. Prior approaches have relied mainly on handcrafted features, expert labeling, clustering, or supervised models, therefore often lacking generalizability and scalability. In this work, we introduce ClickSight, an in-context Large Language Model (LLM)-based pipeline that interprets student clickstreams to reveal their learning strategies. ClickSight takes raw clickstreams and a list of learning strategies as input and generates textual interpretations of students' behaviors during interaction. We evaluate four different prompting strategies and investigate the impact of self-refinement on interpretation quality. Our evaluation spans two open-ended learning environments and uses a rubric-based domain-expert evaluation. Results show that while LLMs can reasonably interpret learning strategies from clickstreams, interpretation quality varies by prompting strategy, and self-refinement offers limited improvement. ClickSight demonstrates the potential of LLMs to generate theory-driven insights from educational interaction data.

stat.AP [Back]

[304] ComBAT Harmonization for diffusion MRI: Challenges and Best Practices

Pierre-Marc Jodoin,Manon Edde,Gabriel Girard,Félix Dumais,Guillaume Theaud,Matthieu Dumont,Jean-Christophe Houde,Yoan David,Maxime Descoteaux

Main category: stat.AP

TL;DR: 本文回顾了ComBAT方法的数学基础及其假设，探讨了人口统计特征对结果的影响，并提出了五项改进建议。

Details

Motivation: ComBAT是MRI数据标准化的常用方法，但其假设条件可能导致结果偏差，需进一步优化。 Method: 通过修改版的Pairwise-ComBAT实验，评估人口规模、年龄分布等因素的影响。 Result: 实验揭示了不同因素对结果的影响，并提出了五项改进建议。 Conclusion: 五项建议可提升ComBAT的稳定性和可重复性，支持开放科学和临床部署。 Abstract: Over the years, ComBAT has become the standard method for harmonizing MRI-derived measurements, with its ability to compensate for site-related additive and multiplicative biases while preserving biological variability. However, ComBAT relies on a set of assumptions that, when violated, can result in flawed harmonization. In this paper, we thoroughly review ComBAT's mathematical foundation, outlining these assumptions, and exploring their implications for the demographic composition necessary for optimal results. Through a series of experiments involving a slightly modified version of ComBAT called Pairwise-ComBAT tailored for normative modeling applications, we assess the impact of various population characteristics, including population size, age distribution, the absence of certain covariates, and the magnitude of additive and multiplicative factors. Based on these experiments, we present five essential recommendations that should be carefully considered to enhance consistency and supporting reproducibility, two essential factors for open science, collaborative research, and real-life clinical deployment.

cs.IR [Back]

[305] An Alternative to FLOPS Regularization to Effectively Productionize SPLADE-Doc

Aldo Porco,Dhruv Mehra,Igor Malioutov,Karthik Radhakrishnan,Moniba Keymanesh,Daniel Preoţiuc-Pietro,Sean MacAvaney,Pengxiang Cheng

Main category: cs.IR

TL;DR: 论文提出了一种新的正则化方法DF-FLOPS，用于减少高文档频率（DF）术语的使用，从而降低检索延迟，同时保持检索效果。

Details

Motivation: 高DF术语会增加检索延迟，现有方法如FLOPS正则化无法解决这一问题。 Method: 提出DF-FLOPS正则化技术，惩罚高DF术语的使用，缩短倒排索引的列表长度。 Result: DF-FLOPS显著减少高DF术语的使用，检索延迟降低约10倍，同时保持检索效果。 Conclusion: DF-FLOPS为LSR模型在生产环境中的实际部署提供了重要改进。 Abstract: Learned Sparse Retrieval (LSR) models encode text as weighted term vectors, which need to be sparse to leverage inverted index structures during retrieval. SPLADE, the most popular LSR model, uses FLOPS regularization to encourage vector sparsity during training. However, FLOPS regularization does not ensure sparsity among terms - only within a given query or document. Terms with very high Document Frequencies (DFs) substantially increase latency in production retrieval engines, such as Apache Solr, due to their lengthy posting lists. To address the issue of high DFs, we present a new variant of FLOPS regularization: DF-FLOPS. This new regularization technique penalizes the usage of high-DF terms, thereby shortening posting lists and reducing retrieval latency. Unlike other inference-time sparsification methods, such as stopword removal, DF-FLOPS regularization allows for the selective inclusion of high-frequency terms in cases where the terms are truly salient. We find that DF-FLOPS successfully reduces the prevalence of high-DF terms and lowers retrieval latency (around 10x faster) in a production-grade engine while maintaining effectiveness both in-domain (only a 2.2-point drop in MRR@10) and cross-domain (improved performance in 12 out of 13 tasks on which we tested). With retrieval latencies on par with BM25, this work provides an important step towards making LSR practical for deployment in production-grade search engines.

[306] MIRB: Mathematical Information Retrieval Benchmark

Haocheng Ju,Bin Dong

Main category: cs.IR

TL;DR: 本文介绍了MIRB（数学信息检索基准），用于评估数学信息检索（MIR）任务的性能，填补了该领域缺乏统一基准的空白。

Details

Motivation: 数学信息检索在多个应用中至关重要，但缺乏统一的评估基准，因此需要开发一个全面的框架来评估不同检索任务。 Method: MIRB包含四个任务（语义陈述检索、问答检索、前提检索和公式检索），涵盖12个数据集，并评估了13种检索模型。 Result: 通过MIRB评估了13种模型，并分析了数学信息检索中的挑战。 Conclusion: MIRB为数学信息检索系统提供了一个全面的评估框架，有助于推动针对数学领域的更有效检索模型的开发。 Abstract: Mathematical Information Retrieval (MIR) is the task of retrieving information from mathematical documents and plays a key role in various applications, including theorem search in mathematical libraries, answer retrieval on math forums, and premise selection in automated theorem proving. However, a unified benchmark for evaluating these diverse retrieval tasks has been lacking. In this paper, we introduce MIRB (Mathematical Information Retrieval Benchmark) to assess the MIR capabilities of retrieval models. MIRB includes four tasks: semantic statement retrieval, question-answer retrieval, premise retrieval, and formula retrieval, spanning a total of 12 datasets. We evaluate 13 retrieval models on this benchmark and analyze the challenges inherent to MIR. We hope that MIRB provides a comprehensive framework for evaluating MIR systems and helps advance the development of more effective retrieval models tailored to the mathematical domain.

cs.NE [Back]

[307] Evolutionary Computation and Large Language Models: A Survey of Methods, Synergies, and Applications

Dikshit Chauhan,Bapi Dutta,Indu Bala,Niki van Stein,Thomas Bäck,Anupam Yadav

Main category: cs.NE

TL;DR: 论文探讨了大型语言模型（LLMs）与进化计算（EC）的协同潜力，展示了它们在优化和自动化设计中的双向贡献。

Details

Motivation: 结合LLMs的自然语言理解能力和EC的优化能力，推动人工智能的进步。 Method: 通过综述LLMs与EC的交叉点，分析它们如何相互增强，包括LLM训练、EC算法设计等。 Result: 展示了EC优化LLM组件（如提示工程）和LLMs自动化EC设计（如元启发式）的双向优势。 Conclusion: 提出混合方法结合LLMs与EC的优势，并指出未来研究方向，如计算成本和可解释性。 Abstract: Integrating Large Language Models (LLMs) and Evolutionary Computation (EC) represents a promising avenue for advancing artificial intelligence by combining powerful natural language understanding with optimization and search capabilities. This manuscript explores the synergistic potential of LLMs and EC, reviewing their intersections, complementary strengths, and emerging applications. We identify key opportunities where EC can enhance LLM training, fine-tuning, prompt engineering, and architecture search, while LLMs can, in turn, aid in automating the design, analysis, and interpretation of ECs. The manuscript explores the synergistic integration of EC and LLMs, highlighting their bidirectional contributions to advancing artificial intelligence. It first examines how EC techniques enhance LLMs by optimizing key components such as prompt engineering, hyperparameter tuning, and architecture search, demonstrating how evolutionary methods automate and refine these processes. Secondly, the survey investigates how LLMs improve EC by automating metaheuristic design, tuning evolutionary algorithms, and generating adaptive heuristics, thereby increasing efficiency and scalability. Emerging co-evolutionary frameworks are discussed, showcasing applications across diverse fields while acknowledging challenges like computational costs, interpretability, and algorithmic convergence. The survey concludes by identifying open research questions and advocating for hybrid approaches that combine the strengths of EC and LLMs.

eess.AS [Back]

[308] QUADS: QUAntized Distillation Framework for Efficient Speech Language Understanding

Subrata Biswas,Mohammad Nur Hossain Khan,Bashima Islam

Main category: eess.AS

TL;DR: QUADS是一个统一框架，通过多阶段训练结合蒸馏和量化，优化SLU系统在资源受限环境中的性能与效率。

Details

Motivation: 现有方法分别应用蒸馏和量化，导致压缩效果不佳，因为蒸馏忽略了量化约束。 Method: 提出QUADS框架，通过多阶段训练和预调模型，同时优化蒸馏和量化，适应低比特环境。 Result: 在SLURP和FSC数据集上分别达到71.13%和99.20%的准确率，计算复杂度降低60-73倍，模型大小减少83-700倍。 Conclusion: QUADS在极端量化下表现稳健，是资源受限SLU应用的高效解决方案。 Abstract: Spoken Language Understanding (SLU) systems must balance performance and efficiency, particularly in resource-constrained environments. Existing methods apply distillation and quantization separately, leading to suboptimal compression as distillation ignores quantization constraints. We propose QUADS, a unified framework that optimizes both through multi-stage training with a pre-tuned model, enhancing adaptability to low-bit regimes while maintaining accuracy. QUADS achieves 71.13\% accuracy on SLURP and 99.20\% on FSC, with only minor degradations of up to 5.56\% compared to state-of-the-art models. Additionally, it reduces computational complexity by 60--73$\times$ (GMACs) and model size by 83--700$\times$, demonstrating strong robustness under extreme quantization. These results establish QUADS as a highly efficient solution for real-world, resource-constrained SLU applications.

[309] TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis

Yu Zhang,Wenxiang Guo,Changhao Pan,Dongyu Yao,Zhiyuan Zhu,Ziyue Jiang,Yuhan Wang,Tao Jin,Zhou Zhao

Main category: eess.AS

TL;DR: TCSinger 2是一个多任务多语言零样本歌声合成模型，通过风格迁移和多样化提示实现风格控制，解决了现有模型对音素和音符边界注释的依赖问题。

Details

Motivation: 现有歌声合成模型在零样本场景中表现不佳，且缺乏多级风格控制能力，TCSinger 2旨在解决这些问题。 Method: TCSinger 2包含三个关键模块：模糊边界内容编码器、自定义音频编码器和基于流的自定义Transformer，分别处理平滑过渡、对齐表示和风格建模。 Result: 实验表明，TCSinger 2在主观和客观指标上均优于基线模型。 Conclusion: TCSinger 2通过改进的模块设计和风格控制能力，显著提升了零样本歌声合成的质量和灵活性。 Abstract: Customizable multilingual zero-shot singing voice synthesis (SVS) has various potential applications in music composition and short video dubbing. However, existing SVS models overly depend on phoneme and note boundary annotations, limiting their robustness in zero-shot scenarios and producing poor transitions between phonemes and notes. Moreover, they also lack effective multi-level style control via diverse prompts. To overcome these challenges, we introduce TCSinger 2, a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts. TCSinger 2 mainly includes three key modules: 1) Blurred Boundary Content (BBC) Encoder, predicts duration, extends content embedding, and applies masking to the boundaries to enable smooth transitions. 2) Custom Audio Encoder, uses contrastive learning to extract aligned representations from singing, speech, and textual prompts. 3) Flow-based Custom Transformer, leverages Cus-MOE, with F0 supervision, enhancing both the synthesis quality and style modeling of the generated singing voice. Experimental results show that TCSinger 2 outperforms baseline models in both subjective and objective metrics across multiple related tasks.

[310] Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information

Nicholas Sanders,Yuanchao Li,Korin Richmond,Simon King

Main category: eess.AS

TL;DR: 论文提出了一种名为Segmentation-Variant Codebooks (SVCs)的方法，通过在不同语言单元（帧、音素、单词、语句）上量化语音，有效保留韵律和副语言信息，同时避免了传统方法中增加码本大小带来的高比特率问题。

Details

Motivation: 传统SSL语音模型（如HuBERT）在量化过程中往往丢失韵律和副语言信息（如情感、重音），而增加码本大小虽然能部分缓解这一问题，但会导致比特率上升。 Method: 提出Segmentation-Variant Codebooks (SVCs)，在不同语言单元（帧、音素、单词、语句）上分别量化语音，生成多个分段特定的离散特征流。此外，研究发现离散化前进行池化比离散化后更能保留分段级信息。 Result: 实验表明，SVCs在多种探测任务中显著更有效地保留了韵律和副语言信息。重合成实验进一步证实了其能更好地实现语音风格，同时略微提升质量并保持可懂度。 Conclusion: SVCs通过分段变体码本量化语音，显著提升了韵律和副语言信息的保留效果，同时避免了传统方法的高比特率问题，为语音任务提供了更高效的解决方案。 Abstract: Quantization in SSL speech models (e.g., HuBERT) improves compression and performance in tasks like language modeling, resynthesis, and text-to-speech but often discards prosodic and paralinguistic information (e.g., emotion, prominence). While increasing codebook size mitigates some loss, it inefficiently raises bitrates. We propose Segmentation-Variant Codebooks (SVCs), which quantize speech at distinct linguistic units (frame, phone, word, utterance), factorizing it into multiple streams of segment-specific discrete features. Our results show that SVCs are significantly more effective at preserving prosodic and paralinguistic information across probing tasks. Additionally, we find that pooling before rather than after discretization better retains segment-level information. Resynthesis experiments further confirm improved style realization and slightly improved quality while preserving intelligibility.

[311] ToxicTone: A Mandarin Audio Dataset Annotated for Toxicity and Toxic Utterance Tonality

Yu-Xiang Luo,Yi-Cheng Lin,Ming-To Chuang,Jia-Hung Chen,I-Ning Tsai,Pei Xing Kiew,Yueh-Hsuan Huang,Chien-Feng Liu,Yu-Chen Chen,Bo-Han Feng,Wenze Ren,Hung-yi Lee

Main category: eess.AS

TL;DR: 论文提出ToxicTone数据集和一种多模态检测框架，填补了普通话语音毒性检测的空白，并证明了语音特征的重要性。

Details

Motivation: 现有研究主要关注文本毒性检测，而普通话语音毒性检测缺乏标注数据集和针对性方法。 Method: 引入ToxicTone数据集，结合声学、语言和情感特征，提出多模态检测框架。 Result: 实验表明该方法优于纯文本和基线模型，验证了语音特征的关键作用。 Conclusion: 研究填补了普通话语音毒性检测的空白，强调了多模态方法的重要性。 Abstract: Despite extensive research on toxic speech detection in text, a critical gap remains in handling spoken Mandarin audio. The lack of annotated datasets that capture the unique prosodic cues and culturally specific expressions in Mandarin leaves spoken toxicity underexplored. To address this, we introduce ToxicTone -- the largest public dataset of its kind -- featuring detailed annotations that distinguish both forms of toxicity (e.g., profanity, bullying) and sources of toxicity (e.g., anger, sarcasm, dismissiveness). Our data, sourced from diverse real-world audio and organized into 13 topical categories, mirrors authentic communication scenarios. We also propose a multimodal detection framework that integrates acoustic, linguistic, and emotional features using state-of-the-art speech and emotion encoders. Extensive experiments show our approach outperforms text-only and baseline models, underscoring the essential role of speech-specific cues in revealing hidden toxic expressions.

cs.CY [Back]

[312] A Participatory Strategy for AI Ethics in Education and Rehabilitation grounded in the Capability Approach

Valeria Cesaroni,Eleonora Pasqua,Piercosma Bisconti,Martina Galletti

Main category: cs.CY

TL;DR: AI技术有潜力提升特殊教育需求儿童的包容性教育和临床康复环境，但需系统性生态视角和伦理考量。研究提出基于能力方法的参与式策略，通过ARTIS项目案例展示如何整合伦理、教育和技术设计。

Details

Motivation: 探讨AI如何支持特殊教育需求儿童的学习和康复，同时解决伦理和技术挑战。 Method: 采用参与式研究策略，结合伦理、教育、临床和技术专家，通过焦点小组和协作设计会议开发AI技术。 Result: 提出了一种基于能力方法的框架，强调功能性和技术适应性，并通过ARTIS项目验证其可行性。 Conclusion: 通过整合伦理与技术，AI在教育中的应用可以弥合创新与责任之间的鸿沟。 Abstract: AI-based technologies have significant potential to enhance inclusive education and clinical-rehabilitative contexts for children with Special Educational Needs and Disabilities. AI can enhance learning experiences, empower students, and support both teachers and rehabilitators. However, their usage presents challenges that require a systemic-ecological vision, ethical considerations, and participatory research. Therefore, research and technological development must be rooted in a strong ethical-theoretical framework. The Capability Approach - a theoretical model of disability, human vulnerability, and inclusion - offers a more relevant perspective on functionality, effectiveness, and technological adequacy in inclusive learning environments. In this paper, we propose a participatory research strategy with different stakeholders through a case study on the ARTIS Project, which develops an AI-enriched interface to support children with text comprehension difficulties. Our research strategy integrates ethical, educational, clinical, and technological expertise in designing and implementing AI-based technologies for children's learning environments through focus groups and collaborative design sessions. We believe that this holistic approach to AI adoption in education can help bridge the gap between technological innovation and ethical responsibility.

q-bio.QM [Back]

[313] Predicting Neo-Adjuvant Chemotherapy Response in Triple-Negative Breast Cancer Using Pre-Treatment Histopathologic Images

Hikmat Khan,Ziyu Su,Huina Zhang,Yihong Wang,Bohan Ning,Shi Wei,Hua Guo,Zaibo Li,Muhammad Khalid Khan Niazi

Main category: q-bio.QM

TL;DR: 开发了一种深度学习模型，利用H&E染色活检图像预测TNBC患者对新辅助化疗的反应，模型表现优异，并揭示了与免疫生物标志物的关联。

Details

Motivation: TNBC缺乏靶向治疗选择，准确预测NACT反应对优化治疗和改善患者预后至关重要。 Method: 使用深度学习模型分析H&E染色活检图像，结合mIHC数据验证模型关注区域与免疫生物标志物的关联。 Result: 模型在五折交叉验证中表现优异（准确率82%，AUC 0.86），高预测重要性与PD-L1表达、CD8+ T细胞浸润等生物标志物一致。 Conclusion: 结合IHC免疫分析数据可提升模型解释性和预测性能，为TNBC个性化治疗提供新方向。 Abstract: Triple-negative breast cancer (TNBC) is an aggressive subtype defined by the lack of estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) expression, resulting in limited targeted treatment options. Neoadjuvant chemotherapy (NACT) is the standard treatment for early-stage TNBC, with pathologic complete response (pCR) serving as a key prognostic marker; however, only 40-50% of patients with TNBC achieve pCR. Accurate prediction of NACT response is crucial to optimize therapy, avoid ineffective treatments, and improve patient outcomes. In this study, we developed a deep learning model to predict NACT response using pre-treatment hematoxylin and eosin (H&E)-stained biopsy images. Our model achieved promising results in five-fold cross-validation (accuracy: 82%, AUC: 0.86, F1-score: 0.84, sensitivity: 0.85, specificity: 0.81, precision: 0.80). Analysis of model attention maps in conjunction with multiplexed immunohistochemistry (mIHC) data revealed that regions of high predictive importance consistently colocalized with tumor areas showing elevated PD-L1 expression, CD8+ T-cell infiltration, and CD163+ macrophage density - all established biomarkers of treatment response. Our findings indicate that incorporating IHC-derived immune profiling data could substantially improve model interpretability and predictive performance. Furthermore, this approach may accelerate the discovery of novel histopathological biomarkers for NACT and advance the development of personalized treatment strategies for TNBC patients.

cs.HC [Back]

[314] AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals

Stefan Pasch

Main category: cs.HC

TL;DR: 研究发现，基于模型的评估系统（如LLM-as-a-Judge）对伦理拒绝的评分显著高于人类用户，而对技术拒绝的评分则无此差异，揭示了评估中的“调节偏差”。

Details

Motivation: 探讨模型评估系统（如LLM-as-a-Judge）在伦理拒绝和技术拒绝上的评分差异，揭示其与人类用户评价的分歧。 Method: 利用Chatbot Arena数据和两种AI模型（GPT-4o和Llama 3 70B）的评估结果，比较伦理拒绝和技术拒绝的评分差异。 Result: 模型评估系统对伦理拒绝的评分显著高于人类用户，而对技术拒绝的评分无显著差异，表现出“调节偏差”。 Conclusion: 研究揭示了模型评估系统在伦理拒绝上的评分偏差，引发了对自动化评估系统的透明度、价值对齐和规范假设的思考。 Abstract: As large language models (LLMs) are increasingly deployed in high-stakes settings, their ability to refuse ethically sensitive prompts-such as those involving hate speech or illegal activities-has become central to content moderation and responsible AI practices. While refusal responses can be viewed as evidence of ethical alignment and safety-conscious behavior, recent research suggests that users may perceive them negatively. At the same time, automated assessments of model outputs are playing a growing role in both evaluation and training. In particular, LLM-as-a-Judge frameworks-in which one model is used to evaluate the output of another-are now widely adopted to guide benchmarking and fine-tuning. This paper examines whether such model-based evaluators assess refusal responses differently than human users. Drawing on data from Chatbot Arena and judgments from two AI judges (GPT-4o and Llama 3 70B), we compare how different types of refusals are rated. We distinguish ethical refusals, which explicitly cite safety or normative concerns (e.g., "I can't help with that because it may be harmful"), and technical refusals, which reflect system limitations (e.g., "I can't answer because I lack real-time data"). We find that LLM-as-a-Judge systems evaluate ethical refusals significantly more favorably than human users, a divergence not observed for technical refusals. We refer to this divergence as a moderation bias-a systematic tendency for model-based evaluators to reward refusal behaviors more than human users do. This raises broader questions about transparency, value alignment, and the normative assumptions embedded in automated evaluation systems.

Table of Contents

cs.CV [Back]

[1] Intentional Gesture: Deliver Your Intentions with Gestures for Speech

[2] Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs

[3] EVA: Expressive Virtual Avatars from Multi-view Videos

[4] Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation

[5] PlantDreamer: Achieving Realistic 3D Plant Models with Diffusion-Guided Gaussian Splatting

[6] CrypticBio: A Large Multimodal Dataset for Visually Confusing Biodiversity

[7] DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance

[8] FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge

[9] KGAlign: Joint Semantic-Structural Knowledge Encoding for Multimodal Fake News Detection

[10] Enhancing Shape Perception and Segmentation Consistency for Industrial Image Inspection

[11] MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion

[12] MORALISE: A Structured Benchmark for Moral Alignment in Visual Language Models

[13] Uncovering Cultural Representation Disparities in Vision-Language Models

[14] Leveraging Generative AI Models to Explore Human Identity

[15] Open-Set Semi-Supervised Learning for Long-Tailed Medical Datasets

[16] Colors Matter: AI-Driven Exploration of Human Feature Colors

[17] Programmatic Video Prediction Using Large Language Models

[18] MultiMAE Meets Earth Observation: Pre-training Multi-modal Multi-task Masked Autoencoders for Earth Observation Tasks

[19] Data Augmentation and Resolution Enhancement using GANs and Diffusion Models for Tree Segmentation

[20] iPad: Iterative Proposal-centric End-to-End Autonomous Driving

[21] Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding

[22] DeepKD: A Deeply Decoupled and Denoised Knowledge Distillation Trainer

[23] Multispectral Detection Transformer with Infrared-Centric Sensor Fusion

[24] Unified Cross-Modal Attention-Mixer Based Structural-Functional Connectomics Fusion for Neuropsychiatric Disorder Diagnosis

[25] CineTechBench: A Benchmark for Cinematographic Technique Understanding and Generation

[26] From Pixels to Images: Deep Learning Advances in Remote Sensing Image Semantic Segmentation

[27] ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving

[28] Lossless Token Merging Even Without Fine-Tuning in Vision Transformers

[29] Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation

[30] AvatarShield: Visual Reinforcement Learning for Human-Centric Video Forgery Detection

[31] Exploring Generalized Gait Recognition: Reducing Redundancy and Noise within Indoor and Outdoor Datasets

[32] AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection

[33] MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models

[34] Geometrically Regularized Transfer Learning with On-Manifold and Off-Manifold Perturbation

[35] Leveraging Foundation Models for Multimodal Graph-Based Action Recognition

[36] GAMA: Geometry-Aware Manifold Alignment via Structured Adversarial Perturbations for Robust Domain Adaptation

[37] Flashback: Memory-Driven Zero-shot, Real-time Video Anomaly Detection

[38] GT^2-GS: Geometry-aware Texture Transfer for Gaussian Splatting

[39] Multimodal Conditional Information Bottleneck for Generalizable AI-Generated Image Detection

[40] Continuous Representation Methods, Theories, and Applications: An Overview and Perspectives

[41] DC-Scene: Data-Centric Learning for 3D Scene Understanding

[42] CAD: A General Multimodal Framework for Video Deepfake Detection via Cross-Modal Alignment and Distillation

[43] GAMA++: Disentangled Geometric Alignment with Adaptive Contrastive Perturbation for Reliable Domain Transfer

[44] VET-DINO: Learning Anatomical Understanding Through Multi-View Distillation in Veterinary Imaging

[45] Zero-Shot Gaze-based Volumetric Medical Image Segmentation

[46] gen2seg: Generative Models Enable Generalizable Instance Segmentation

[47] Blind Spot Navigation: Evolutionary Discovery of Sensitive Semantic Concepts for LVLMs

[48] Contrastive Learning-Enhanced Trajectory Matching for Small-Scale Dataset Distillation

[49] LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

[50] DiffProb: Data Pruning for Face Recognition

[51] GS2E: Gaussian Splatting is an Effective Data Generator for Event Stream Generation

[52] R3GS: Gaussian Splatting for Robust Reconstruction and Relocalization in Unconstrained Image Collections

[53] BadSR: Stealthy Label Backdoor Attacks on Image Super-Resolution

[54] FaceCrafter: Identity-Conditional Diffusion with Disentangled Control over Facial Pose, Expression, and Emotion

[55] CEBSNet: Change-Excited and Background-Suppressed Network with Temporal Dependency Modeling for Bitemporal Change Detection

[56] SoftHGNN: Soft Hypergraph Neural Networks for General Visual Recognition

[57] Towards Zero-Shot Differential Morphing Attack Detection with Multimodal Large Language Models

[58] Parameter-Efficient Fine-Tuning of Multispectral Foundation Models for Hyperspectral Image Classification

[59] My Face Is Mine, Not Yours: Facial Protection Against Diffusion Model Face Swapping

[60] Objective Bicycle Occlusion Level Classification using a Deformable Parts-Based Model

[61] Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition

[62] RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation

[63] The P$^3$ dataset: Pixels, Points and Polygons for Multimodal Building Vectorization

[64] Expanding Zero-Shot Object Counting with Rich Prompts

[65] Visual Question Answering on Multiple Remote Sensing Image Modalities

[66] Mouse Lockbox Dataset: Behavior Recognition for Mice Solving Lockboxes

[67] Efficient Data Driven Mixture-of-Expert Extraction from Trained Networks

[68] On the Robustness of Medical Vision-Language Models: Are they Truly Generalizable?

[69] TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models

[70] Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL

[71] Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation

[72] FRN: Fractal-Based Recursive Spectral Reconstruction Network

[73] Stronger ViTs With Octic Equivariance

[74] ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning

[75] Comprehensive Evaluation and Analysis for NSFW Concept Erasure in Text-to-Image Diffusion Models

[76] Pura: An Efficient Privacy-Preserving Solution for Face Recognition

[77] Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models

[78] Spectral-Aware Global Fusion for RGB-Thermal Semantic Segmentation