cs.CV [Total: 177]
cs.GR [Total: 11]
cs.CL [Total: 235]
cs.IR [Total: 7]
cs.LG [Total: 28]
cs.AR [Total: 1]
cs.SE [Total: 1]
cs.SD [Total: 4]
cs.MM [Total: 1]
cs.CR [Total: 3]
physics.soc-ph [Total: 1]
q-bio.BM [Total: 1]
eess.IV [Total: 4]
cs.AI [Total: 22]
cs.CY [Total: 9]
cs.HC [Total: 1]
eess.AS [Total: 3]
cs.RO [Total: 11]
astro-ph.IM [Total: 1]

cs.CV [Back]

[1] EgoVIS@CVPR: What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning

Chi-Hsi Kung,Frangil Ramirez,Juhyung Ha,Yi-Ting Chen,David Crandall,Yi-Hsuan Tsai

Main category: cs.CV

TL;DR: 论文提出了一种通过学习状态变化描述及其反事实推理的方法，提升视频编码器对程序性活动的理解能力。

Details

Motivation: 现有方法未能明确学习场景状态变化，限制了模型对程序性活动的理解。 Method: 利用LLM生成的状态变化描述作为监督信号，并生成反事实推理场景以模拟失败结果。 Result: 在时间动作分割和错误检测等任务中表现显著提升。 Conclusion: 状态变化描述及反事实推理能有效增强模型对程序性活动的理解。 Abstract: Understanding a procedural activity requires modeling both how action steps transform the scene, and how evolving scene transformations can influence the sequence of action steps, even those that are accidental or erroneous. Yet, existing work on procedure-aware video representations fails to explicitly learned the state changes (scene transformations). In this work, we study procedure-aware video representation learning by incorporating state-change descriptions generated by LLMs as supervision signals for video encoders. Moreover, we generate state-change counterfactuals that simulate hypothesized failure outcomes, allowing models to learn by imagining the unseen ``What if'' scenarios. This counterfactual reasoning facilitates the model's ability to understand the cause and effect of each step in an activity. To verify the procedure awareness of our model, we conduct extensive experiments on procedure-aware tasks, including temporal action segmentation, error detection, and more. Our results demonstrate the effectiveness of the proposed state-change descriptions and their counterfactuals, and achieve significant improvements on multiple tasks.

[2] Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces

Gen Luo,Ganlin Yang,Ziyang Gong,Guanzhou Chen,Haonan Duan,Erfei Cui,Ronglei Tong,Zhi Hou,Tianyi Zhang,Zhe Chen,Shenglong Ye,Lewei Lu,Jingbo Wang,Wenhai Wang,Jifeng Dai,Yu Qiao,Rongrong Ji,Xizhou Zhu

Main category: cs.CV

TL;DR: VeBrain是一个统一的多模态大语言模型框架，用于机器人的感知、推理和控制，通过将机器人控制任务转化为2D视觉空间的文本任务，并引入机器人适配器，显著提升了性能。

Details

Motivation: 现有方法难以统一多模态理解、视觉空间推理和物理交互能力，VeBrain旨在解决这一问题。 Method: VeBrain将机器人控制任务转化为2D视觉空间的文本任务，并设计机器人适配器将文本信号转换为运动策略。同时，引入高质量数据集VeBrain-600k。 Result: 在13个多模态基准和5个空间智能基准上表现优异，相比Qwen2.5-VL在MMVet上提升5.6%，在腿式机器人任务中平均提升50%。 Conclusion: VeBrain展示了强大的适应性、灵活性和组合能力，为机器人控制提供了统一且高效的解决方案。 Abstract: The remarkable progress of Multimodal Large Language Models (MLLMs) has attracted increasing attention to extend them to physical entities like legged robot. This typically requires MLLMs to not only grasp multimodal understanding abilities, but also integrate visual-spatial reasoning and physical interaction capabilities. Nevertheless,existing methods struggle to unify these capabilities due to their fundamental differences.In this paper, we present the Visual Embodied Brain (VeBrain), a unified framework for perception, reasoning, and control in real world. VeBrain reformulates robotic control into common text-based MLLM tasks in the 2D visual space, thus unifying the objectives and mapping spaces of different tasks. Then, a novel robotic adapter is proposed to convert textual control signals from MLLMs to motion policies of real robots. From the data perspective, we further introduce VeBrain-600k, a high-quality instruction dataset encompassing various capabilities of VeBrain. In VeBrain-600k, we take hundreds of hours to collect, curate and annotate the data, and adopt multimodal chain-of-thought(CoT) to mix the different capabilities into a single conversation. Extensive experiments on 13 multimodal benchmarks and 5 spatial intelligence benchmarks demonstrate the superior performance of VeBrain to existing MLLMs like Qwen2.5-VL. When deployed to legged robots and robotic arms, VeBrain shows strong adaptability, flexibility, and compositional capabilities compared to existing methods. For example, compared to Qwen2.5-VL, VeBrain not only achieves substantial gains on MMVet by +5.6%, but also excels in legged robot tasks with +50% average gains.

[3] Geo-Sign: Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation

Edward Fish,Richard Bowden

Main category: cs.CV

TL;DR: 论文提出Geo-Sign方法，利用双曲几何改进手语翻译中的骨骼表示，通过双曲投影层和对比损失增强特征区分度。

Details

Motivation: 探索双曲几何在手语翻译中的应用，以改进骨骼表示的几何特性，尤其是针对精细动作（如手指关节）。 Method: 提出Geo-Sign方法，包括双曲投影层、加权Fr\'echet均值聚合和双曲空间对比损失，集成到端到端翻译框架中。 Result: 实验表明，双曲几何能显著提升骨骼表示效果，优于现有RGB方法，同时保护隐私并提高计算效率。 Conclusion: 双曲几何在手语翻译中具有潜力，能有效改进骨骼表示，为未来研究提供新方向。 Abstract: Recent progress in Sign Language Translation (SLT) has focussed primarily on improving the representational capacity of large language models to incorporate Sign Language features. This work explores an alternative direction: enhancing the geometric properties of skeletal representations themselves. We propose Geo-Sign, a method that leverages the properties of hyperbolic geometry to model the hierarchical structure inherent in sign language kinematics. By projecting skeletal features derived from Spatio-Temporal Graph Convolutional Networks (ST-GCNs) into the Poincar\'e ball model, we aim to create more discriminative embeddings, particularly for fine-grained motions like finger articulations. We introduce a hyperbolic projection layer, a weighted Fr\'echet mean aggregation scheme, and a geometric contrastive loss operating directly in hyperbolic space. These components are integrated into an end-to-end translation framework as a regularisation function, to enhance the representations within the language model. This work demonstrates the potential of hyperbolic geometry to improve skeletal representations for Sign Language Translation, improving on SOTA RGB methods while preserving privacy and improving computational efficiency. Code available here: https://github.com/ed-fish/geo-sign.

[4] Detection of Endangered Deer Species Using UAV Imagery: A Comparative Study Between Efficient Deep Learning Approaches

Agustín Roca,Gastón Castro,Gabriel Torre,Leonardo J. Colombo,Ignacio Mas,Javier Pereira,Juan I. Giribet

Main category: cs.CV

TL;DR: 比较YOLOv11和RT-DETR模型在无人机图像中检测被植被遮挡且占图像比例极小的沼泽鹿的性能，通过添加精确分割掩码提升YOLO模型检测效果。

Details

Motivation: 提升无人机图像中野生动物检测的准确性，尤其是在目标占比小且被遮挡的场景下。 Method: 扩展先前分析，为数据集添加精确分割掩码，训练带分割头的YOLO模型。 Result: 加入分割头显著提高了检测性能。 Conclusion: 为无人机野生动物监测提供了可扩展且高精度的AI检测方案。 Abstract: This study compares the performance of state-of-the-art neural networks including variants of the YOLOv11 and RT-DETR models for detecting marsh deer in UAV imagery, in scenarios where specimens occupy a very small portion of the image and are occluded by vegetation. We extend previous analysis adding precise segmentation masks for our datasets enabling a fine-grained training of a YOLO model with a segmentation head included. Experimental results show the effectiveness of incorporating the segmentation head achieving superior detection performance. This work contributes valuable insights for improving UAV-based wildlife monitoring and conservation strategies through scalable and accurate AI-driven detection systems.

[5] Efficient Endangered Deer Species Monitoring with UAV Aerial Imagery and Deep Learning

Agustín Roca,Gabriel Torre,Juan I. Giribet,Gastón Castro,Leonardo Colombo,Ignacio Mas,Javier Pereira

Main category: cs.CV

TL;DR: 利用无人机和深度学习技术高效检测濒危鹿种，替代传统人工方法，提升野生动物监测效率。

Details

Motivation: 传统濒危鹿种识别方法成本高且耗时，需更高效解决方案。 Method: 基于YOLO框架开发算法，利用无人机高分辨率图像训练，应用于阿根廷两个项目。 Result: 算法能高精度识别沼泽鹿，对潘帕斯鹿的适用性初步验证但有限制。 Conclusion: AI与无人机技术结合可显著提升野生动物监测与管理潜力。 Abstract: This paper examines the use of Unmanned Aerial Vehicles (UAVs) and deep learning for detecting endangered deer species in their natural habitats. As traditional identification processes require trained manual labor that can be costly in resources and time, there is a need for more efficient solutions. Leveraging high-resolution aerial imagery, advanced computer vision techniques are applied to automate the identification process of deer across two distinct projects in Buenos Aires, Argentina. The first project, Pantano Project, involves the marsh deer in the Paran\'a Delta, while the second, WiMoBo, focuses on the Pampas deer in Campos del Tuy\'u National Park. A tailored algorithm was developed using the YOLO framework, trained on extensive datasets compiled from UAV-captured images. The findings demonstrate that the algorithm effectively identifies marsh deer with a high degree of accuracy and provides initial insights into its applicability to Pampas deer, albeit with noted limitations. This study not only supports ongoing conservation efforts but also highlights the potential of integrating AI with UAV technology to enhance wildlife monitoring and management practices.

[6] FastCAR: Fast Classification And Regression for Task Consolidation in Multi-Task Learning to Model a Continuous Property Variable of Detected Object Class

Anoop Kini,Andreas Jansche,Timo Bernthaler,Gerhard Schneider

Main category: cs.CV

TL;DR: FastCAR是一种新颖的多任务学习（MTL）任务整合方法，用于分类和回归任务，解决了任务异质性和微弱相关性的问题。通过标签转换方法，仅需单任务回归网络架构即可实现高性能。

Details

Motivation: 解决分类和回归任务在异质性且相关性微弱情况下的整合问题，适用于科学和工程中的关键用例。 Method: 采用标签转换方法，结合单任务回归网络架构，实现分类和回归任务的联合学习。 Result: 分类准确率达99.54%，回归平均绝对百分比误差为2.4%，训练效率提升2.52倍，推理延迟降低55%。 Conclusion: FastCAR在性能和效率上优于传统MTL方法，适用于分类和回归任务的联合学习。 Abstract: FastCAR is a novel task consolidation approach in Multi-Task Learning (MTL) for a classification and a regression task, despite the non-triviality of task heterogeneity with only a subtle correlation. The approach addresses the classification of a detected object (occupying the entire image frame) and regression for modeling a continuous property variable (for instances of an object class), a crucial use case in science and engineering. FastCAR involves a label transformation approach that is amenable for use with only a single-task regression network architecture. FastCAR outperforms traditional MTL model families, parametrized in the landscape of architecture and loss weighting schemes, when learning both tasks are collectively considered (classification accuracy of 99.54%, regression mean absolute percentage error of 2.4%). The experiments performed used "Advanced Steel Property Dataset" contributed by us https://github.com/fastcandr/AdvancedSteel-Property-Dataset. The dataset comprises 4536 images of 224x224 pixels, annotated with discrete object classes and its hardness property that can take continuous values. Our proposed FastCAR approach for task consolidation achieves training time efficiency (2.52x quicker) and reduced inference latency (55% faster) than benchmark MTL networks.

[7] Ctrl-Crash: Controllable Diffusion for Realistic Car Crashes

Anthony Gosselin,Ge Ya Luo,Luis Lara,Florian Golemo,Derek Nowrouzezahrai,Liam Paull,Alexia Jolicoeur-Martineau,Christopher Pal

Main category: cs.CV

TL;DR: Ctrl-Crash是一种可控的汽车碰撞视频生成模型，通过输入边界框、碰撞类型和初始帧等信号，生成高质量且可控的碰撞场景。

Details

Motivation: 现有视频扩散技术在生成真实汽车碰撞场景时表现不佳，主要由于驾驶数据集中事故事件稀缺，而改进交通安全需要真实可控的事故模拟。 Method: 提出Ctrl-Crash模型，利用边界框、碰撞类型和初始帧等信号进行条件控制，并采用分类器无关引导实现细粒度控制。 Result: 在定量视频质量指标（如FVD和JEDi）和人类评估的物理真实性与视频质量方面，Ctrl-Crash表现优于现有扩散方法。 Conclusion: Ctrl-Crash通过可控条件信号生成高质量碰撞视频，为交通安全研究提供了有效的模拟工具。 Abstract: Video diffusion techniques have advanced significantly in recent years; however, they struggle to generate realistic imagery of car crashes due to the scarcity of accident events in most driving datasets. Improving traffic safety requires realistic and controllable accident simulations. To tackle the problem, we propose Ctrl-Crash, a controllable car crash video generation model that conditions on signals such as bounding boxes, crash types, and an initial image frame. Our approach enables counterfactual scenario generation where minor variations in input can lead to dramatically different crash outcomes. To support fine-grained control at inference time, we leverage classifier-free guidance with independently tunable scales for each conditioning signal. Ctrl-Crash achieves state-of-the-art performance across quantitative video quality metrics (e.g., FVD and JEDi) and qualitative measurements based on a human-evaluation of physical realism and video quality compared to prior diffusion-based methods.

[8] ZeShot-VQA: Zero-Shot Visual Question Answering Framework with Answer Mapping for Natural Disaster Damage Assessment

Ehsan Karimi,Maryam Rahnemoonfar

Main category: cs.CV

TL;DR: 提出了一种基于视觉语言模型（VLM）的零样本视觉问答（ZeShot-VQA）方法，用于自然灾害后的高效响应，无需微调即可处理新数据集和未见过的答案。

Details

Motivation: 自然灾害影响广泛，传统VQA模型需微调才能处理新问题，限制了实时响应能力。 Method: 利用大规模视觉语言模型（VLMs）的零样本学习能力，提出ZeShot-VQA方法，并在FloodNet数据集上验证。 Result: ZeShot-VQA无需微调即可处理新数据集和生成未见过的答案，展现了灵活性。 Conclusion: ZeShot-VQA为灾害响应提供了高效、灵活的数据驱动解决方案。 Abstract: Natural disasters usually affect vast areas and devastate infrastructures. Performing a timely and efficient response is crucial to minimize the impact on affected communities, and data-driven approaches are the best choice. Visual question answering (VQA) models help management teams to achieve in-depth understanding of damages. However, recently published models do not possess the ability to answer open-ended questions and only select the best answer among a predefined list of answers. If we want to ask questions with new additional possible answers that do not exist in the predefined list, the model needs to be fin-tuned/retrained on a new collected and annotated dataset, which is a time-consuming procedure. In recent years, large-scale Vision-Language Models (VLMs) have earned significant attention. These models are trained on extensive datasets and demonstrate strong performance on both unimodal and multimodal vision/language downstream tasks, often without the need for fine-tuning. In this paper, we propose a VLM-based zero-shot VQA (ZeShot-VQA) method, and investigate the performance of on post-disaster FloodNet dataset. Since the proposed method takes advantage of zero-shot learning, it can be applied on new datasets without fine-tuning. In addition, ZeShot-VQA is able to process and generate answers that has been not seen during the training procedure, which demonstrates its flexibility.

[9] Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning

Sara Ghazanfari,Francesco Croce,Nicolas Flammarion,Prashanth Krishnamurthy,Farshad Khorrami,Siddharth Garg

Main category: cs.CV

TL;DR: 论文提出了一种基于视频帧的链式推理方法（CoF），通过生成与关键帧相关的推理步骤，显著提升了视频理解任务的性能。

Details

Motivation: 现有方法在视频多模态大语言模型（LLMs）中生成推理步骤时，未能明确关联到具体视频帧，导致性能受限。 Method: 首先创建了一个包含多样化问题和帧相关推理步骤的数据集（CoF-Data），然后基于此微调现有视频LLMs。 Result: CoF方法在多个视频理解基准测试中表现优异，超越了现有领先模型，并显著降低了幻觉率。 Conclusion: 通过显式关联推理步骤与视频帧，CoF方法简单有效，提升了视频LLMs的性能和可靠性。 Abstract: Recent work has shown that eliciting Large Language Models (LLMs) to generate reasoning traces in natural language before answering the user's request can significantly improve their performance across tasks. This approach has been extended to multimodal LLMs, where the models can produce chain-of-thoughts (CoT) about the content of input images and videos. In this work, we propose to obtain video LLMs whose reasoning steps are grounded in, and explicitly refer to, the relevant video frames. For this, we first create CoF-Data, a large dataset of diverse questions, answers, and corresponding frame-grounded reasoning traces about both natural and synthetic videos, spanning various topics and tasks. Then, we fine-tune existing video LLMs on this chain-of-frames (CoF) data. Our approach is simple and self-contained, and, unlike existing approaches for video CoT, does not require auxiliary networks to select or caption relevant frames. We show that our models based on CoF are able to generate chain-of-thoughts that accurately refer to the key frames to answer the given question. This, in turn, leads to improved performance across multiple video understanding benchmarks, for example, surpassing leading video LLMs on Video-MME, MVBench, and VSI-Bench, and notably reducing the hallucination rate. Code available at https://github.com/SaraGhazanfari/CoF}{github.com/SaraGhazanfari/CoF.

[10] Improving Optical Flow and Stereo Depth Estimation by Leveraging Uncertainty-Based Learning Difficulties

Jisoo Jeong,Hong Cai,Jamie Menjay Lin,Fatih Porikli

Main category: cs.CV

TL;DR: 论文提出了一种基于不确定性的置信度图方法，通过DB和OA损失函数解决像素和区域学习难度不均的问题，显著提升了光流和立体深度任务的性能。

Details

Motivation: 传统光流和立体深度模型的训练采用统一的损失函数，忽视了像素和区域间学习难度的差异。 Method: 提出Difficulty Balancing (DB)损失和Occlusion Avoiding (OA)损失，分别针对学习难度不均和遮挡问题。 Result: 实验表明，DB和OA损失的组合在光流和立体深度任务中显著提升了性能。 Conclusion: 通过结合DB和OA损失，有效解决了训练中像素和区域学习难度不均的问题。 Abstract: Conventional training for optical flow and stereo depth models typically employs a uniform loss function across all pixels. However, this one-size-fits-all approach often overlooks the significant variations in learning difficulty among individual pixels and contextual regions. This paper investigates the uncertainty-based confidence maps which capture these spatially varying learning difficulties and introduces tailored solutions to address them. We first present the Difficulty Balancing (DB) loss, which utilizes an error-based confidence measure to encourage the network to focus more on challenging pixels and regions. Moreover, we identify that some difficult pixels and regions are affected by occlusions, resulting from the inherently ill-posed matching problem in the absence of real correspondences. To address this, we propose the Occlusion Avoiding (OA) loss, designed to guide the network into cycle consistency-based confident regions, where feature matching is more reliable. By combining the DB and OA losses, we effectively manage various types of challenging pixels and regions during training. Experiments on both optical flow and stereo depth tasks consistently demonstrate significant performance improvements when applying our proposed combination of the DB and OA losses.

[11] Towards Effective and Efficient Adversarial Defense with Diffusion Models for Robust Visual Tracking

Long Xu,Peng Gao,Wen-Jia Tang,Fei Wang,Ru-Yue Yuan

Main category: cs.CV

TL;DR: 本文提出了一种基于去噪扩散概率模型（DiffDf）的新对抗防御方法，旨在提升视觉跟踪方法对对抗攻击的鲁棒性。

Details

Motivation: 尽管深度学习视觉跟踪方法已取得显著进展，但其在面对精心设计的对抗攻击时表现脆弱，导致跟踪性能急剧下降。 Method: DiffDf通过结合像素级重建损失、语义一致性损失和结构相似性损失，建立多尺度防御机制，并通过逐步去噪过程有效抑制对抗扰动。 Result: 在多个主流数据集上的实验表明，DiffDf对不同架构的跟踪器均表现出优异的泛化性能，显著提升各项评价指标，并实现超过30 FPS的实时推理速度。 Conclusion: DiffDf展示了出色的防御性能和效率，代码已开源。 Abstract: Although deep learning-based visual tracking methods have made significant progress, they exhibit vulnerabilities when facing carefully designed adversarial attacks, which can lead to a sharp decline in tracking performance. To address this issue, this paper proposes for the first time a novel adversarial defense method based on denoise diffusion probabilistic models, termed DiffDf, aimed at effectively improving the robustness of existing visual tracking methods against adversarial attacks. DiffDf establishes a multi-scale defense mechanism by combining pixel-level reconstruction loss, semantic consistency loss, and structural similarity loss, effectively suppressing adversarial perturbations through a gradual denoising process. Extensive experimental results on several mainstream datasets show that the DiffDf method demonstrates excellent generalization performance for trackers with different architectures, significantly improving various evaluation metrics while achieving real-time inference speeds of over 30 FPS, showcasing outstanding defense performance and efficiency. Codes are available at https://github.com/pgao-lab/DiffDf.

[12] Latent Guidance in Diffusion Models for Perceptual Evaluations

Shreshth Saini,Ru-Ling Liao,Yan Ye,Alan C. Bovik

Main category: cs.CV

TL;DR: 论文提出了一种名为Perceptual Manifold Guidance (PMG)的算法，利用预训练的潜在扩散模型和感知质量特征，在NR-IQA任务中实现感知一致性。

Details

Motivation: 尽管潜在扩散模型在高维图像生成和下游任务中取得进展，但其在NR-IQA任务中的感知一致性尚未充分探索。 Method: 提出PMG算法，利用预训练扩散模型和感知特征，从去噪U-Net中提取多尺度、多时间步的感知一致性特征图。 Result: 实验表明，这些特征与人类感知高度相关，并在IQA数据集上达到SOTA性能。 Conclusion: PMG首次将感知特征引入扩散模型，展示了其在NR-IQA任务中的优越泛化能力。 Abstract: Despite recent advancements in latent diffusion models that generate high-dimensional image data and perform various downstream tasks, there has been little exploration into perceptual consistency within these models on the task of No-Reference Image Quality Assessment (NR-IQA). In this paper, we hypothesize that latent diffusion models implicitly exhibit perceptually consistent local regions within the data manifold. We leverage this insight to guide on-manifold sampling using perceptual features and input measurements. Specifically, we propose Perceptual Manifold Guidance (PMG), an algorithm that utilizes pretrained latent diffusion models and perceptual quality features to obtain perceptually consistent multi-scale and multi-timestep feature maps from the denoising U-Net. We empirically demonstrate that these hyperfeatures exhibit high correlation with human perception in IQA tasks. Our method can be applied to any existing pretrained latent diffusion model and is straightforward to integrate. To the best of our knowledge, this paper is the first work on guiding diffusion model with perceptual features for NR-IQA. Extensive experiments on IQA datasets show that our method, LGDM, achieves state-of-the-art performance, underscoring the superior generalization capabilities of diffusion models for NR-IQA tasks.

[13] Test-time Vocabulary Adaptation for Language-driven Object Detection

Mingxuan Liu,Tyler L. Hayes,Massimiliano Mancini,Elisa Ricci,Riccardo Volpi,Gabriela Csurka

Main category: cs.CV

TL;DR: 提出一种无需训练的Vocabulary Adapter（VocAda），通过图像描述和名词解析自动优化用户定义的词汇，提升开放词汇目标检测性能。

Details

Motivation: 开放词汇目标检测中，用户定义的词汇可能过于宽泛或错误，影响检测性能。 Method: VocAda在推理时通过图像描述、名词解析和词汇选择三步优化词汇。 Result: 在COCO和Objects365数据集上，VocAda显著提升了三种先进检测器的性能。 Conclusion: VocAda是一种通用且高效的词汇优化方法，代码已开源。 Abstract: Open-vocabulary object detection models allow users to freely specify a class vocabulary in natural language at test time, guiding the detection of desired objects. However, vocabularies can be overly broad or even mis-specified, hampering the overall performance of the detector. In this work, we propose a plug-and-play Vocabulary Adapter (VocAda) to refine the user-defined vocabulary, automatically tailoring it to categories that are relevant for a given image. VocAda does not require any training, it operates at inference time in three steps: i) it uses an image captionner to describe visible objects, ii) it parses nouns from those captions, and iii) it selects relevant classes from the user-defined vocabulary, discarding irrelevant ones. Experiments on COCO and Objects365 with three state-of-the-art detectors show that VocAda consistently improves performance, proving its versatility. The code is open source.

Ngoc Tuyen Do,Tri Nhu Do

Main category: cs.CV

TL;DR: 提出了一种基于特征融合和知识蒸馏的多模态多目标检测框架，通过数据融合提高准确性，并利用知识蒸馏优化域适应能力。

Details

Motivation: 解决多目标检测中异构数据输入和计算复杂度高的问题，特别是在资源受限的嵌入式设备上。 Method: 结合RGB和热成像输入，采用融合模型和知识蒸馏训练流程，通过多阶段训练和复合损失函数优化后验概率。 Result: 学生模型达到教师模型95%的平均精度，同时推理时间减少50%。 Conclusion: 该框架适用于实际多目标检测部署场景，平衡了精度和效率。 Abstract: In the surveillance and defense domain, multi-target detection and classification (MTD) is considered essential yet challenging due to heterogeneous inputs from diverse data sources and the computational complexity of algorithms designed for resource-constrained embedded devices, particularly for Al-based solutions. To address these challenges, we propose a feature fusion and knowledge-distilled framework for multi-modal MTD that leverages data fusion to enhance accuracy and employs knowledge distillation for improved domain adaptation. Specifically, our approach utilizes both RGB and thermal image inputs within a novel fusion-based multi-modal model, coupled with a distillation training pipeline. We formulate the problem as a posterior probability optimization task, which is solved through a multi-stage training pipeline supported by a composite loss function. This loss function effectively transfers knowledge from a teacher model to a student model. Experimental results demonstrate that our student model achieves approximately 95% of the teacher model's mean Average Precision while reducing inference time by approximately 50%, underscoring its suitability for practical MTD deployment scenarios.

[15] Sequence-Based Identification of First-Person Camera Wearers in Third-Person Views

Ziwei Zhao,Xizi Wang,Yuchen Wang,Feng Cheng,David Crandall

Main category: cs.CV

TL;DR: 论文介绍了TF2025数据集，用于研究多视角交互，并提出了一种基于序列的方法来识别第三人称视角中的第一人称佩戴者。

Details

Motivation: 随着第一人称摄像头的普及，多摄像头交互在共享环境中的研究需求增加，但现有数据集如Ego4D和Ego-Exo4D对此关注不足。 Method: 扩展了TF2025数据集，包含同步的第一人称和第三人称视角，并提出了一种结合运动线索和行人重识别的序列方法。 Result: TF2025数据集填补了多视角交互研究的空白，提出的方法能有效识别第三人称视角中的第一人称佩戴者。 Conclusion: TF2025数据集和方法为沉浸式学习和协作机器人等应用提供了重要支持。 Abstract: The increasing popularity of egocentric cameras has generated growing interest in studying multi-camera interactions in shared environments. Although large-scale datasets such as Ego4D and Ego-Exo4D have propelled egocentric vision research, interactions between multiple camera wearers remain underexplored-a key gap for applications like immersive learning and collaborative robotics. To bridge this, we present TF2025, an expanded dataset with synchronized first- and third-person views. In addition, we introduce a sequence-based method to identify first-person wearers in third-person footage, combining motion cues and person re-identification.

[16] iDPA: Instance Decoupled Prompt Attention for Incremental Medical Object Detection

Huahui Yi,Wei Xu,Ziyuan Qin,Xi Chen,Xiaohu Wu,Kang Li,Qicheng Lao

Main category: cs.CV

TL;DR: 论文提出了一种名为\method的框架，通过解耦实例级提示生成和提示注意力，解决了医学目标检测任务中的挑战，显著提升了性能。

Details

Motivation: 现有基于提示的方法在医学目标检测任务中存在前景-背景信息耦合以及提示与图像-文本标记耦合的问题，导致性能受限。 Method: 框架包含两部分：1) 实例级提示生成（IPG），从图像中解耦细粒度知识并生成专注于密集预测的提示；2) 解耦提示注意力（DPA），优化提示信息传递并减少内存使用。 Result: 在13个临床数据集上，\method在多种设置下均优于现有方法，FAP提升显著（如全数据提升5.44%）。 Conclusion: \method通过解耦设计有效解决了医学目标检测中的挑战，性能显著优于现有方法。 Abstract: Existing prompt-based approaches have demonstrated impressive performance in continual learning, leveraging pre-trained large-scale models for classification tasks; however, the tight coupling between foreground-background information and the coupled attention between prompts and image-text tokens present significant challenges in incremental medical object detection tasks, due to the conceptual gap between medical and natural domains. To overcome these challenges, we introduce the \method~framework, which comprises two main components: 1) Instance-level Prompt Generation (\ipg), which decouples fine-grained instance-level knowledge from images and generates prompts that focus on dense predictions, and 2) Decoupled Prompt Attention (\dpa), which decouples the original prompt attention, enabling a more direct and efficient transfer of prompt information while reducing memory usage and mitigating catastrophic forgetting. We collect 13 clinical, cross-modal, multi-organ, and multi-category datasets, referred to as \dataset, and experiments demonstrate that \method~outperforms existing SOTA methods, with FAP improvements of 5.44\%, 4.83\%, 12.88\%, and 4.59\% in full data, 1-shot, 10-shot, and 50-shot settings, respectively.

[17] Latent Wavelet Diffusion: Enabling 4K Image Synthesis for Free

Luigi Sigillo,Shengfeng He,Danilo Comminiello

Main category: cs.CV

TL;DR: Latent Wavelet Diffusion (LWD) 是一种轻量级框架，通过增强潜在表示的频谱保真度和聚焦高频细节，实现超高清图像生成（2K至4K）。

Details

Motivation: 高分辨率图像合成在生成模型中仍面临挑战，尤其是如何在计算效率和保留视觉细节之间取得平衡。 Method: LWD 提出三个关键组件：1）尺度一致的变分自编码器目标；2）小波能量图定位细节丰富的空间区域；3）时间依赖的掩码策略，专注于高频细节的去噪监督。 Result: LWD 在超高清图像合成中显著提升感知质量并降低 FID，且无需额外计算开销。 Conclusion: 频率感知的信号驱动监督是一种高效且原理性的高分辨率生成建模方法。 Abstract: High-resolution image synthesis remains a core challenge in generative modeling, particularly in balancing computational efficiency with the preservation of fine-grained visual detail. We present Latent Wavelet Diffusion (LWD), a lightweight framework that enables any latent diffusion model to scale to ultra-high-resolution image generation (2K to 4K) for free. LWD introduces three key components: (1) a scale-consistent variational autoencoder objective that enhances the spectral fidelity of latent representations; (2) wavelet energy maps that identify and localize detail-rich spatial regions within the latent space; and (3) a time-dependent masking strategy that focuses denoising supervision on high-frequency components during training. LWD requires no architectural modifications and incurs no additional computational overhead. Despite its simplicity, it consistently improves perceptual quality and reduces FID in ultra-high-resolution image synthesis, outperforming strong baseline models. These results highlight the effectiveness of frequency-aware, signal-driven supervision as a principled and efficient approach for high-resolution generative modeling.

[18] Efficient 3D Brain Tumor Segmentation with Axial-Coronal-Sagittal Embedding

Tuan-Luc Huynh,Thanh-Danh Le,Tam V. Nguyen,Trung-Nghia Le,Minh-Triet Tran

Main category: cs.CV

TL;DR: 论文提出了一种改进的脑肿瘤分割方法，通过整合轴向-冠状-矢状卷积和预训练权重，优化了nnU-Net框架，减少了训练时间和参数，同时提升了性能。

Details

Motivation: 当前最先进的nnU-Net在脑肿瘤分割中表现良好，但存在训练时间长和预训练权重利用不足的问题，需要改进。 Method: 结合轴向-冠状-矢状卷积和ImageNet预训练权重，提出两种将2D预训练权重迁移到3D领域的方法，并探索了联合分类与分割模型。 Result: 实验表明，所提方法在快速训练设置下性能与交叉验证模型相当或更优。 Conclusion: 通过优化训练效率和利用预训练权重，显著提升了脑肿瘤分割的性能和效率。 Abstract: In this paper, we address the crucial task of brain tumor segmentation in medical imaging and propose innovative approaches to enhance its performance. The current state-of-the-art nnU-Net has shown promising results but suffers from extensive training requirements and underutilization of pre-trained weights. To overcome these limitations, we integrate Axial-Coronal-Sagittal convolutions and pre-trained weights from ImageNet into the nnU-Net framework, resulting in reduced training epochs, reduced trainable parameters, and improved efficiency. Two strategies for transferring 2D pre-trained weights to the 3D domain are presented, ensuring the preservation of learned relationships and feature representations critical for effective information propagation. Furthermore, we explore a joint classification and segmentation model that leverages pre-trained encoders from a brain glioma grade classification proxy task, leading to enhanced segmentation performance, especially for challenging tumor labels. Experimental results demonstrate that our proposed methods in the fast training settings achieve comparable or even outperform the ensemble of cross-validation models, a common practice in the brain tumor segmentation literature.

[19] Performance Analysis of Few-Shot Learning Approaches for Bangla Handwritten Character and Digit Recognition

Mehedi Ahamed,Radib Bin Kabir,Tawsif Tashwar Dipto,Mueeze Al Mushabbir,Sabbir Ahmed,Md. Hasanul Kabir

Main category: cs.CV

TL;DR: 研究探讨了少样本学习（FSL）方法在识别孟加拉手写字符和数字中的表现，提出了一种名为SynergiProtoNet的混合网络，显著提升了识别精度。

Details

Motivation: 解决孟加拉语等复杂结构语言因数据稀缺导致的识别难题，并验证模型在类似语言中的泛化能力。 Method: 提出SynergiProtoNet，结合聚类技术和嵌入框架，通过多级特征提取优化原型学习。 Result: SynergiProtoNet在多种评估设置中均优于现有方法，成为少样本学习的新标杆。 Conclusion: SynergiProtoNet为复杂脚本的少样本学习提供了高效解决方案，代码已开源。 Abstract: This study investigates the performance of few-shot learning (FSL) approaches in recognizing Bangla handwritten characters and numerals using limited labeled data. It demonstrates the applicability of these methods to scripts with intricate and complex structures, where dataset scarcity is a common challenge. Given the complexity of Bangla script, we hypothesize that models performing well on these characters can generalize effectively to languages of similar or lower structural complexity. To this end, we introduce SynergiProtoNet, a hybrid network designed to improve the recognition accuracy of handwritten characters and digits. The model integrates advanced clustering techniques with a robust embedding framework to capture fine-grained details and contextual nuances. It leverages multi-level (both high- and low-level) feature extraction within a prototypical learning framework. We rigorously benchmark SynergiProtoNet against several state-of-the-art few-shot learning models: BD-CSPN, Prototypical Network, Relation Network, Matching Network, and SimpleShot, across diverse evaluation settings including Monolingual Intra-Dataset Evaluation, Monolingual Inter-Dataset Evaluation, Cross-Lingual Transfer, and Split Digit Testing. Experimental results show that SynergiProtoNet consistently outperforms existing methods, establishing a new benchmark in few-shot learning for handwritten character and digit recognition. The code is available on GitHub: https://github.com/MehediAhamed/SynergiProtoNet.

[20] BAGNet: A Boundary-Aware Graph Attention Network for 3D Point Cloud Semantic Segmentation

Wei Tao,Xiaoyang Qu,Kai Lu,Jiguang Wan,Shenglin He,Jianzong Wang

Main category: cs.CV

TL;DR: BAGNet通过边界感知图注意力网络，高效处理点云语义分割，减少计算时间并保持高精度。

Details

Motivation: 点云数据不规则且无结构，传统图方法计算成本高，边界点包含更复杂的空间结构信息。 Method: 提出BAGNet，包含边界感知图注意力层（BAGLayer）和轻量级注意力池化层，分别捕捉边界点特征和全局特征。 Result: 在标准数据集上，BAGNet在精度和推理时间上均优于现有方法。 Conclusion: BAGNet通过优化边界点处理和全局特征提取，实现了高效且高精度的点云语义分割。 Abstract: Since the point cloud data is inherently irregular and unstructured, point cloud semantic segmentation has always been a challenging task. The graph-based method attempts to model the irregular point cloud by representing it as a graph; however, this approach incurs substantial computational cost due to the necessity of constructing a graph for every point within a large-scale point cloud. In this paper, we observe that boundary points possess more intricate spatial structural information and develop a novel graph attention network known as the Boundary-Aware Graph attention Network (BAGNet). On one hand, BAGNet contains a boundary-aware graph attention layer (BAGLayer), which employs edge vertex fusion and attention coefficients to capture features of boundary points, reducing the computation time. On the other hand, BAGNet employs a lightweight attention pooling layer to extract the global feature of the point cloud to maintain model accuracy. Extensive experiments on standard datasets demonstrate that BAGNet outperforms state-of-the-art methods in point cloud semantic segmentation with higher accuracy and less inference time.

[21] UNSURF: Uncertainty Quantification for Cortical Surface Reconstruction of Clinical Brain MRIs

Raghav Mehta,Karthik Gopinath,Ben Glocker,Juan Eugenio Iglesias

Main category: cs.CV

TL;DR: UNSURF是一种新型不确定性度量方法，用于临床脑MRI扫描的皮质表面重建，适用于任意方向、分辨率和对比度。

Details

Motivation: 传统的不确定性度量方法（如体素蒙特卡洛方差）不适合建模表面放置的不确定性，因此需要一种更有效的方法。 Method: 通过预测体素有符号距离函数（SDF）与实际拟合表面SDF之间的差异来度量不确定性。 Result: UNSURF的估计值与真实误差相关性良好，可用于自动化质量控制，并提升阿尔茨海默病分类任务的性能。 Conclusion: UNSURF是一种有效的皮质表面重建不确定性度量方法，具有实际应用价值。 Abstract: We propose UNSURF, a novel uncertainty measure for cortical surface reconstruction of clinical brain MRI scans of any orientation, resolution, and contrast. It relies on the discrepancy between predicted voxel-wise signed distance functions (SDFs) and the actual SDFs of the fitted surfaces. Our experiments on real clinical scans show that traditional uncertainty measures, such as voxel-wise Monte Carlo variance, are not suitable for modeling the uncertainty of surface placement. Our results demonstrate that UNSURF estimates correlate well with the ground truth errors and: \textit{(i)}~enable effective automated quality control of surface reconstructions at the subject-, parcel-, mesh node-level; and \textit{(ii)}~improve performance on a downstream Alzheimer's disease classification task.

[22] SSAM: Self-Supervised Association Modeling for Test-Time Adaption

Yaxiong Wang,Zhenqiang Zhang,Lechao Cheng,Zhun Zhong,Dan Guo,Meng Wang

Main category: cs.CV

TL;DR: 论文提出了一种新的测试时适应（TTA）框架SSAM，通过双阶段关联学习动态优化图像编码器，解决了现有方法因缺乏显式监督而冻结编码器的问题。

Details

Motivation: 现有TTA方法通常冻结图像编码器，忽视了其在缓解训练与测试数据分布偏移中的关键作用，导致性能受限。 Method: SSAM包含两个协同组件：软原型估计（SPE）和原型锚定图像重建（PIR），分别用于引导特征空间重组和保持编码器稳定性。 Result: 实验表明，SSAM在多种基准测试中显著优于现有TTA方法，同时保持计算高效性。 Conclusion: SSAM通过动态优化图像编码器，显著提升了TTA性能，且具有架构无关和超参数依赖性低的优点。 Abstract: Test-time adaption (TTA) has witnessed important progress in recent years, the prevailing methods typically first encode the image and the text and design strategies to model the association between them. Meanwhile, the image encoder is usually frozen due to the absence of explicit supervision in TTA scenarios. We identify a critical limitation in this paradigm: While test-time images often exhibit distribution shifts from training data, existing methods persistently freeze the image encoder due to the absence of explicit supervision during adaptation. This practice overlooks the image encoder's crucial role in bridging distribution shift between training and test. To address this challenge, we propose SSAM (Self-Supervised Association Modeling), a new TTA framework that enables dynamic encoder refinement through dual-phase association learning. Our method operates via two synergistic components: 1) Soft Prototype Estimation (SPE), which estimates probabilistic category associations to guide feature space reorganization, and 2) Prototype-anchored Image Reconstruction (PIR), enforcing encoder stability through cluster-conditional image feature reconstruction. Comprehensive experiments across diverse baseline methods and benchmarks demonstrate that SSAM can surpass state-of-the-art TTA baselines by a clear margin while maintaining computational efficiency. The framework's architecture-agnostic design and minimal hyperparameter dependence further enhance its practical applicability.

[23] SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation

Xingtong Ge,Xin Zhang,Tongda Xu,Yi Zhang,Xinjie Zhang,Yan Wang,Jun Zhang

Main category: cs.CV

TL;DR: 论文提出了一种改进的分布匹配蒸馏方法（DMD），通过隐式分布对齐（IDA）和段内引导（ISG）解决了大规模流式文本到图像模型的收敛问题，最终模型SenseFlow在多个模型上表现优异。

Details

Motivation: 传统DMD在大规模流式文本到图像模型（如SD 3.5和FLUX）上存在收敛困难，需要改进以提升性能。 Method: 提出隐式分布对齐（IDA）和段内引导（ISG）来优化DMD，并结合其他改进（如放大判别器模型）构建最终模型SenseFlow。 Result: IDA单独使用时，DMD在SD 3.5上收敛；结合IDA和ISG时，DMD在SD 3.5和FLUX.1 dev上收敛。SenseFlow在SDXL和FLUX等模型上表现优异。 Conclusion: SenseFlow通过IDA和ISG解决了DMD的收敛问题，并在多个文本到图像模型上实现了优越的蒸馏性能。 Abstract: The Distribution Matching Distillation (DMD) has been successfully applied to text-to-image diffusion models such as Stable Diffusion (SD) 1.5. However, vanilla DMD suffers from convergence difficulties on large-scale flow-based text-to-image models, such as SD 3.5 and FLUX. In this paper, we first analyze the issues when applying vanilla DMD on large-scale models. Then, to overcome the scalability challenge, we propose implicit distribution alignment (IDA) to regularize the distance between the generator and fake distribution. Furthermore, we propose intra-segment guidance (ISG) to relocate the timestep importance distribution from the teacher model. With IDA alone, DMD converges for SD 3.5; employing both IDA and ISG, DMD converges for SD 3.5 and FLUX.1 dev. Along with other improvements such as scaled up discriminator models, our final model, dubbed \textbf{SenseFlow}, achieves superior performance in distillation for both diffusion based text-to-image models such as SDXL, and flow-matching models such as SD 3.5 Large and FLUX. The source code will be avaliable at https://github.com/XingtongGe/SenseFlow.

[24] 3D Trajectory Reconstruction of Moving Points Based on Asynchronous Cameras

Huayu Huang,Banglei Guan,Yang Shang,Qifeng Yu

Main category: cs.CV

TL;DR: 本文提出了一种基于异步相机的点目标3D轨迹重建方法，同时解决了轨迹重建和相机同步两个子问题，显著提高了重建精度。

Details

Motivation: 在光学实验力学中，点目标的定位是一个基本问题，尤其在无人机任务中应用广泛。现有方法通常只能单独解决轨迹重建或相机同步问题，无法同时处理两者。 Method: 1. 扩展轨迹交会法以适用于异步相机；2. 基于成像机制和目标动态特性建立相机时间信息和目标运动模型；3. 同时优化相机旋转、时间信息和目标运动参数。 Result: 仿真和实际实验验证了方法的可行性和准确性，实际实验中在15~20 km观测范围内实现了112.95 m的定位误差。 Conclusion: 该方法显著提高了轨迹重建精度，尤其在相机旋转不准确时表现更优。 Abstract: Photomechanics is a crucial branch of solid mechanics. The localization of point targets constitutes a fundamental problem in optical experimental mechanics, with extensive applications in various missions of UAVs. Localizing moving targets is crucial for analyzing their motion characteristics and dynamic properties. Reconstructing the trajectories of points from asynchronous cameras is a significant challenge. It encompasses two coupled sub-problems: trajectory reconstruction and camera synchronization. Present methods typically address only one of these sub-problems individually. This paper proposes a 3D trajectory reconstruction method for point targets based on asynchronous cameras, simultaneously solving both sub-problems. Firstly, we extend the trajectory intersection method to asynchronous cameras to resolve the limitation of traditional triangulation that requires camera synchronization. Secondly, we develop models for camera temporal information and target motion, based on imaging mechanisms and target dynamics characteristics. The parameters are optimized simultaneously to achieve trajectory reconstruction without accurate time parameters. Thirdly, we optimize the camera rotations alongside the camera time information and target motion parameters, using tighter and more continuous constraints on moving points. The reconstruction accuracy is significantly improved, especially when the camera rotations are inaccurate. Finally, the simulated and real-world experimental results demonstrate the feasibility and accuracy of the proposed method. The real-world results indicate that the proposed algorithm achieved a localization error of 112.95 m at an observation range of 15 ~ 20 km.

[25] ViVo: A Dataset for Volumetric VideoReconstruction and Compression

Adrian Azzarelli,Ge Gao,Ho Man Kwan,Fan Zhang,Nantheera Anantrasirichai,Ollie Moolan-Feroze,David Bull

Main category: cs.CV

TL;DR: ViVo是一个新的多样化、真实的神经体积视频重建和压缩数据集，填补了现有数据集在语义和低层特征多样性上的不足。

Details

Motivation: 现有体积视频数据集缺乏真实世界生产流程中的多样内容，限制了重建和压缩模型的开发和验证。 Method: 提出ViVo数据集，包含多视角RGB和深度视频对、2D前景掩码、3D点云等，并扩展了多样性的定义。 Result: 通过测试三种3D重建方法和两种压缩算法，证明了数据集的挑战性和现有数据集的局限性。 Conclusion: ViVo数据集为体积视频重建和压缩提供了更真实的基准，推动了更有效算法的开发。 Abstract: As research on neural volumetric video reconstruction and compression flourishes, there is a need for diverse and realistic datasets, which can be used to develop and validate reconstruction and compression models. However, existing volumetric video datasets lack diverse content in terms of both semantic and low-level features that are commonly present in real-world production pipelines. In this context, we propose a new dataset, ViVo, for VolumetrIc VideO reconstruction and compression. The dataset is faithful to real-world volumetric video production and is the first dataset to extend the definition of diversity to include both human-centric characteristics (skin, hair, etc.) and dynamic visual phenomena (transparent, reflective, liquid, etc.). Each video sequence in this database contains raw data including fourteen multi-view RGB and depth video pairs, synchronized at 30FPS with per-frame calibration and audio data, and their associated 2-D foreground masks and 3-D point clouds. To demonstrate the use of this database, we have benchmarked three state-of-the-art (SotA) 3-D reconstruction methods and two volumetric video compression algorithms. The obtained results evidence the challenging nature of the proposed dataset and the limitations of existing datasets for both volumetric video reconstruction and compression tasks, highlighting the need to develop more effective algorithms for these applications. The database and the associated results are available at https://vivo-bvicr.github.io/

[26] SEED: A Benchmark Dataset for Sequential Facial Attribute Editing with Diffusion Models

Yule Zhu,Ping Liu,Zhedong Zheng,Wei Liu

Main category: cs.CV

TL;DR: 论文介绍了SEED数据集和FAITH模型，用于研究基于扩散模型的渐进式面部编辑序列的跟踪和分析。

Details

Motivation: 现有方法在渐进式面部编辑序列的跟踪和检测方面存在挑战，且缺乏大规模标注数据集。 Method: 构建了SEED数据集，包含90,000多张面部图像，每张图像标注了编辑序列和属性掩码；提出了FAITH模型，利用高频线索增强对细微变化的敏感性。 Result: 实验表明FAITH模型有效，并揭示了SEED数据集的独特挑战。 Conclusion: SEED为研究渐进式编辑提供了灵活且具有挑战性的资源，数据集和代码将公开。 Abstract: Diffusion models have recently enabled precise and photorealistic facial editing across a wide range of semantic attributes. Beyond single-step modifications, a growing class of applications now demands the ability to analyze and track sequences of progressive edits, such as stepwise changes to hair, makeup, or accessories. However, sequential editing introduces significant challenges in edit attribution and detection robustness, further complicated by the lack of large-scale, finely annotated benchmarks tailored explicitly for this task. We introduce SEED, a large-scale Sequentially Edited facE Dataset constructed via state-of-the-art diffusion models. SEED contains over 90,000 facial images with one to four sequential attribute modifications, generated using diverse diffusion-based editing pipelines (LEdits, SDXL, SD3). Each image is annotated with detailed edit sequences, attribute masks, and prompts, facilitating research on sequential edit tracking, visual provenance analysis, and manipulation robustness assessment. To benchmark this task, we propose FAITH, a frequency-aware transformer-based model that incorporates high-frequency cues to enhance sensitivity to subtle sequential changes. Comprehensive experiments, including systematic comparisons of multiple frequency-domain methods, demonstrate the effectiveness of FAITH and the unique challenges posed by SEED. SEED offers a challenging and flexible resource for studying progressive diffusion-based edits at scale. Dataset and code will be publicly released at: https://github.com/Zeus1037/SEED.

[27] CReFT-CAD: Boosting Orthographic Projection Reasoning for CAD via Reinforcement Fine-Tuning

Ke Niu,Zhuofan Chen,Haiyang Yu,Yuwen Chen,Teng Fu,Mengyang Zhao,Bin Li,Xiangyang Xue

Main category: cs.CV

TL;DR: 论文提出CReFT-CAD方法，通过两阶段微调提升CAD中的正交投影推理能力，并发布TriView2CAD基准数据集。

Details

Motivation: 现有深度学习方法在CAD工作流中引入不精确尺寸和限制参数编辑性，而监督微调（SFT）易陷入模式记忆，泛化能力不足。 Method: 采用两阶段微调：课程驱动的强化学习阶段构建推理能力，监督后微调阶段优化指令遵循和语义提取。 Result: CReFT-CAD显著提升推理准确性和分布外泛化能力，TriView2CAD数据集为研究提供支持。 Conclusion: CReFT-CAD为CAD推理研究提供有效方法，TriView2CAD数据集推动领域发展。 Abstract: Computer-Aided Design (CAD) plays a pivotal role in industrial manufacturing. Orthographic projection reasoning underpins the entire CAD workflow, encompassing design, manufacturing, and simulation. However, prevailing deep-learning approaches employ standard 3D reconstruction pipelines as an alternative, which often introduce imprecise dimensions and limit the parametric editability required for CAD workflows. Recently, some researchers adopt vision-language models (VLMs), particularly supervised fine-tuning (SFT), to tackle CAD-related challenges. SFT shows promise but often devolves into pattern memorization, yielding poor out-of-distribution performance on complex reasoning tasks. To address these gaps, we introduce CReFT-CAD, a two-stage fine-tuning paradigm that first employs a curriculum-driven reinforcement learning stage with difficulty-aware rewards to build reasoning ability steadily, and then applies supervised post-tuning to hone instruction following and semantic extraction. Complementing this, we release TriView2CAD, the first large-scale, open-source benchmark for orthographic projection reasoning, comprising 200,000 synthetic and 3,000 real-world orthographic projections with precise dimension annotations and six interoperable data modalities. We benchmark leading VLMs on orthographic projection reasoning and demonstrate that CReFT-CAD substantially improves reasoning accuracy and out-of-distribution generalizability in real-world scenarios, offering valuable insights for advancing CAD reasoning research.

[28] Event-based multi-view photogrammetry for high-dynamic, high-velocity target measurement

Taihang Lei,Banglei Guan,Minzu Liang,Xiangyu Li,Jianbing Liu,Jing Tao,Yang Shang,Qifeng Yu

Main category: cs.CV

TL;DR: 提出了一种基于事件多视角摄影测量的新方法，用于高动态、高速目标运动的机械特性测量，解决了现有方法的动态范围限制、观测不连续和高成本问题。

Details

Motivation: 高动态、高速目标运动的机械特性测量在工业和武器系统验证中至关重要，但现有方法存在动态范围有限、观测不连续和高成本等挑战。 Method: 利用事件时空分布的单调性提取目标前缘特征，消除拖尾效应；通过重投影误差关联事件与目标轨迹；采用速度衰减模型拟合数据，实现多视角联合计算。 Result: 在轻气枪碎片测试中，该方法与电磁测速仪相比，测量偏差为4.47%。 Conclusion: 该方法有效解决了现有测量技术的局限性，为高动态目标运动提供了更准确、经济的测量方案。 Abstract: The characterization of mechanical properties for high-dynamic, high-velocity target motion is essential in industries. It provides crucial data for validating weapon systems and precision manufacturing processes etc. However, existing measurement methods face challenges such as limited dynamic range, discontinuous observations, and high costs. This paper presents a new approach leveraging an event-based multi-view photogrammetric system, which aims to address the aforementioned challenges. First, the monotonicity in the spatiotemporal distribution of events is leveraged to extract the target's leading-edge features, eliminating the tailing effect that complicates motion measurements. Then, reprojection error is used to associate events with the target's trajectory, providing more data than traditional intersection methods. Finally, a target velocity decay model is employed to fit the data, enabling accurate motion measurements via ours multi-view data joint computation. In a light gas gun fragment test, the proposed method showed a measurement deviation of 4.47% compared to the electromagnetic speedometer.

[29] MR2US-Pro: Prostate MR to Ultrasound Image Translation and Registration Based on Diffusion Models

Xudong Ma,Nantheera Anantrasirichai,Stefanos Bolomytis,Alin Achim

Main category: cs.CV

TL;DR: 提出了一种新颖的两阶段框架，解决MRI和TRUS多模态图像配准问题，包括TRUS 3D重建和跨模态配准，无需外部探针跟踪信息，并通过伪中间模态和结构感知策略提升配准精度。

Details

Motivation: 前列腺癌诊断依赖多模态成像（MRI和TRUS），但两者因维度和解剖表示差异导致配准困难。现有方法依赖外部探针信息，限制了实用性。 Method: 1. TRUS 3D重建：利用矢状面和横断面TRUS视图的自然相关性，提出无需探针定位的聚类特征匹配方法。2. 跨模态配准：通过伪中间模态的无监督扩散框架，结合结构感知策略优化配准。 Result: 实验验证表明，该方法在无监督条件下实现了更高的配准精度和物理合理的形变，优于现有方法。 Conclusion: 该框架解决了多模态配准的关键挑战，为前列腺癌诊断提供了更可靠的图像配准工具。 Abstract: The diagnosis of prostate cancer increasingly depends on multimodal imaging, particularly magnetic resonance imaging (MRI) and transrectal ultrasound (TRUS). However, accurate registration between these modalities remains a fundamental challenge due to the differences in dimensionality and anatomical representations. In this work, we present a novel framework that addresses these challenges through a two-stage process: TRUS 3D reconstruction followed by cross-modal registration. Unlike existing TRUS 3D reconstruction methods that rely heavily on external probe tracking information, we propose a totally probe-location-independent approach that leverages the natural correlation between sagittal and transverse TRUS views. With the help of our clustering-based feature matching method, we enable the spatial localization of 2D frames without any additional probe tracking information. For the registration stage, we introduce an unsupervised diffusion-based framework guided by modality translation. Unlike existing methods that translate one modality into another, we map both MR and US into a pseudo intermediate modality. This design enables us to customize it to retain only registration-critical features, greatly easing registration. To further enhance anatomical alignment, we incorporate an anatomy-aware registration strategy that prioritizes internal structural coherence while adaptively reducing the influence of boundary inconsistencies. Extensive validation demonstrates that our approach outperforms state-of-the-art methods by achieving superior registration accuracy with physically realistic deformations in a completely unsupervised fashion.

[30] Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control

Danfeng li,Hui Zhang,Sheng Wang,Jiacheng Li,Zuxuan Wu

Main category: cs.CV

TL;DR: Seg2Any是一个新的S2I框架，通过解耦语义和形状条件，解决了现有方法在语义和形状一致性上的不足，并在多实体场景中防止属性泄漏。

Details

Motivation: 现有S2I方法无法同时保证语义和形状一致性，且在多实体场景中存在属性泄漏问题。 Method: Seg2Any通过解耦语义和形状条件，引入语义对齐注意力掩码和实体轮廓图，并采用属性隔离注意力掩码机制。 Result: Seg2Any在开放和封闭S2I基准测试中表现最佳，尤其在细粒度空间和属性控制方面。 Conclusion: Seg2Any通过创新的条件解耦和注意力机制，显著提升了S2I生成的性能。 Abstract: Despite recent advances in diffusion models, top-tier text-to-image (T2I) models still struggle to achieve precise spatial layout control, i.e. accurately generating entities with specified attributes and locations. Segmentation-mask-to-image (S2I) generation has emerged as a promising solution by incorporating pixel-level spatial guidance and regional text prompts. However, existing S2I methods fail to simultaneously ensure semantic consistency and shape consistency. To address these challenges, we propose Seg2Any, a novel S2I framework built upon advanced multimodal diffusion transformers (e.g. FLUX). First, to achieve both semantic and shape consistency, we decouple segmentation mask conditions into regional semantic and high-frequency shape components. The regional semantic condition is introduced by a Semantic Alignment Attention Mask, ensuring that generated entities adhere to their assigned text prompts. The high-frequency shape condition, representing entity boundaries, is encoded as an Entity Contour Map and then introduced as an additional modality via multi-modal attention to guide image spatial structure. Second, to prevent attribute leakage across entities in multi-entity scenarios, we introduce an Attribute Isolation Attention Mask mechanism, which constrains each entity's image tokens to attend exclusively to themselves during image self-attention. To support open-set S2I generation, we construct SACap-1M, a large-scale dataset containing 1 million images with 5.9 million segmented entities and detailed regional captions, along with a SACap-Eval benchmark for comprehensive S2I evaluation. Extensive experiments demonstrate that Seg2Any achieves state-of-the-art performance on both open-set and closed-set S2I benchmarks, particularly in fine-grained spatial and attribute control of entities.

[31] XYZ-IBD: High-precision Bin-picking Dataset for Object 6D Pose Estimation Capturing Real-world Industrial Complexity

Junwen Huang,Jizhong Liang,Jiaqi Hu,Martin Sundermeyer,Peter KT Yu,Nassir Navab,Benjamin Busam

Main category: cs.CV

TL;DR: XYZ-IBD是一个针对6D姿态估计的工业级数据集，专注于真实工业场景中的复杂挑战，如高反射材料、严重遮挡和密集杂乱。

Details

Motivation: 现有数据集多关注家庭物品，已趋饱和，而工业场景的真实复杂性尚未解决。XYZ-IBD填补了这一空白。 Method: 数据集包含15种无纹理、金属且对称的物体，通过高精度工业相机和商业相机采集RGB、灰度和深度图像，并采用毫米级标注流程。 Result: 在模拟环境中验证了标注的可靠性，并在基准测试中显示现有方法在工业场景下性能显著下降。 Conclusion: XYZ-IBD为未来研究提供了更真实和挑战性的问题，数据集已公开。 Abstract: We introduce XYZ-IBD, a bin-picking dataset for 6D pose estimation that captures real-world industrial complexity, including challenging object geometries, reflective materials, severe occlusions, and dense clutter. The dataset reflects authentic robotic manipulation scenarios with millimeter-accurate annotations. Unlike existing datasets that primarily focus on household objects, which approach saturation,XYZ-IBD represents the unsolved realistic industrial conditions. The dataset features 15 texture-less, metallic, and mostly symmetrical objects of varying shapes and sizes. These objects are heavily occluded and randomly arranged in bins with high density, replicating the challenges of real-world bin-picking. XYZ-IBD was collected using two high-precision industrial cameras and one commercially available camera, providing RGB, grayscale, and depth images. It contains 75 multi-view real-world scenes, along with a large-scale synthetic dataset rendered under simulated bin-picking conditions. We employ a meticulous annotation pipeline that includes anti-reflection spray, multi-view depth fusion, and semi-automatic annotation, achieving millimeter-level pose labeling accuracy required for industrial manipulation. Quantification in simulated environments confirms the reliability of the ground-truth annotations. We benchmark state-of-the-art methods on 2D detection, 6D pose estimation, and depth estimation tasks on our dataset, revealing significant performance degradation in our setups compared to current academic household benchmarks. By capturing the complexity of real-world bin-picking scenarios, XYZ-IBD introduces more realistic and challenging problems for future research. The dataset and benchmark are publicly available at https://xyz-ibd.github.io/XYZ-IBD/.

[32] SatDreamer360: Geometry Consistent Street-View Video Generation from Satellite Imagery

Xianghui Ze,Beiyi Zhu,Zhenbo Song,Jianfeng Lu,Yujiao Shi

Main category: cs.CV

TL;DR: SatDreamer360是一种新框架，可从单张卫星图像和预定义轨迹生成几何和时间一致的地面视频，解决了现有方法在生成时间一致序列上的不足。

Details

Motivation: 从卫星图像生成连续地面视频在模拟、自主导航和数字孪生城市中有重要应用潜力，但现有方法多依赖辅助输入且难以保持时间一致性。 Method: 提出紧凑的三平面表示法和基于射线的像素注意力机制，无需额外几何先验；引入极线约束时间注意力模块确保多帧一致性。 Result: 在VIGOR++数据集上实验表明，SatDreamer360在保真度、连贯性和几何对齐方面表现优异。 Conclusion: SatDreamer360为跨视角视频生成提供了高效解决方案，适用于多样化城市场景。 Abstract: Generating continuous ground-level video from satellite imagery is a challenging task with significant potential for applications in simulation, autonomous navigation, and digital twin cities. Existing approaches primarily focus on synthesizing individual ground-view images, often relying on auxiliary inputs like height maps or handcrafted projections, and fall short in producing temporally consistent sequences. In this paper, we propose {SatDreamer360}, a novel framework that generates geometrically and temporally consistent ground-view video from a single satellite image and a predefined trajectory. To bridge the large viewpoint gap, we introduce a compact tri-plane representation that encodes scene geometry directly from the satellite image. A ray-based pixel attention mechanism retrieves view-dependent features from the tri-plane, enabling accurate cross-view correspondence without requiring additional geometric priors. To ensure multi-frame consistency, we propose an epipolar-constrained temporal attention module that aligns features across frames using the known relative poses along the trajectory. To support evaluation, we introduce {VIGOR++}, a large-scale dataset for cross-view video generation, with dense trajectory annotations and high-quality ground-view sequences. Extensive experiments demonstrate that SatDreamer360 achieves superior performance in fidelity, coherence, and geometric alignment across diverse urban scenes.

[33] ABCDEFGH: An Adaptation-Based Convolutional Neural Network-CycleGAN Disease-Courses Evolution Framework Using Generative Models in Health Education

Ruiming Min,Minghao Liu

Main category: cs.CV

TL;DR: 论文探讨了利用卷积神经网络（CNN）和CycleGAN生成合成医学图像，以解决医学教育中高质量教学材料不足的问题。

Details

Motivation: 现代医学教育因隐私问题和资源短缺而难以获取高质量教学材料，合成图像技术可提供解决方案。 Method: 使用卷积神经网络（CNN）和CycleGAN生成合成医学图像。 Result: 成功生成了多样且可比的医学图像数据集，支持医学教育。 Conclusion: 合成医学图像技术为医学教育提供了隐私安全的替代方案，具有广泛应用潜力。 Abstract: With the advancement of modern medicine and the development of technologies such as MRI, CT, and cellular analysis, it has become increasingly critical for clinicians to accurately interpret various diagnostic images. However, modern medical education often faces challenges due to limited access to high-quality teaching materials, stemming from privacy concerns and a shortage of educational resources (Balogh et al., 2015). In this context, image data generated by machine learning models, particularly generative models, presents a promising solution. These models can create diverse and comparable imaging datasets without compromising patient privacy, thereby supporting modern medical education. In this study, we explore the use of convolutional neural networks (CNNs) and CycleGAN (Zhu et al., 2017) for generating synthetic medical images. The source code is available at https://github.com/mliuby/COMP4211-Project.

[34] Parallel Rescaling: Rebalancing Consistency Guidance for Personalized Diffusion Models

JungWoo Chae,Jiyoon Kim,Sangheum Hwang

Main category: cs.CV

TL;DR: 提出了一种并行重缩放技术，用于个性化扩散模型，通过分解一致性引导信号，改善提示对齐和视觉保真度。

Details

Motivation: 现有方法（如DreamBooth和Textual Inversion）在少量参考图像下容易过拟合，导致生成图像与文本提示不匹配。 Method: 提出并行重缩放技术，将一致性引导信号分解为与分类器自由引导（CFG）平行和正交的分量，并重缩放平行分量以减少干扰。 Result: 实验表明，该方法在复杂或风格化提示下优于基线方法，提升了提示对齐和视觉保真度。 Conclusion: 并行重缩放技术为个性化扩散模型提供了更稳定和准确的解决方案。 Abstract: Personalizing diffusion models to specific users or concepts remains challenging, particularly when only a few reference images are available. Existing methods such as DreamBooth and Textual Inversion often overfit to limited data, causing misalignment between generated images and text prompts when attempting to balance identity fidelity with prompt adherence. While Direct Consistency Optimization (DCO) with its consistency-guided sampling partially alleviates this issue, it still struggles with complex or stylized prompts. In this paper, we propose a parallel rescaling technique for personalized diffusion models. Our approach explicitly decomposes the consistency guidance signal into parallel and orthogonal components relative to classifier free guidance (CFG). By rescaling the parallel component, we minimize disruptive interference with CFG while preserving the subject's identity. Unlike prior personalization methods, our technique does not require additional training data or expensive annotations. Extensive experiments show improved prompt alignment and visual fidelity compared to baseline methods, even on challenging stylized prompts. These findings highlight the potential of parallel rescaled guidance to yield more stable and accurate personalization for diverse user inputs.

[35] Long-Tailed Visual Recognition via Permutation-Invariant Head-to-Tail Feature Fusion

Mengke Li,Zhikai Hu,Yang Lu,Weichao Lan,Yiu-ming Cheung,Hui Huang

Main category: cs.CV

TL;DR: PI-H2T方法通过特征融合和语义信息转移，解决了长尾数据分布不平衡导致的模型偏置问题。

Details

Motivation: 长尾数据分布不平衡导致深度学习模型偏向头部类别，忽视尾部类别，影响识别准确性。 Method: 提出PI-H2T方法，包括置换不变表示融合（PIF）和头到尾特征融合（H2TF），优化表示空间和分类器。 Result: 实验证明PI-H2T能有效提升尾部类别的多样性和整体性能。 Conclusion: PI-H2T是一种即插即用的方法，可无缝集成到现有模型中，显著提升长尾数据分类性能。 Abstract: The imbalanced distribution of long-tailed data presents a significant challenge for deep learning models, causing them to prioritize head classes while neglecting tail classes. Two key factors contributing to low recognition accuracy are the deformed representation space and a biased classifier, stemming from insufficient semantic information in tail classes. To address these issues, we propose permutation-invariant and head-to-tail feature fusion (PI-H2T), a highly adaptable method. PI-H2T enhances the representation space through permutation-invariant representation fusion (PIF), yielding more clustered features and automatic class margins. Additionally, it adjusts the biased classifier by transferring semantic information from head to tail classes via head-to-tail fusion (H2TF), improving tail class diversity. Theoretical analysis and experiments show that PI-H2T optimizes both the representation space and decision boundaries. Its plug-and-play design ensures seamless integration into existing methods, providing a straightforward path to further performance improvements. Extensive experiments on long-tailed benchmarks confirm the effectiveness of PI-H2T.

[36] Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Daniele Molino,Camillo Maria Caruso,Filippo Ruffini,Paolo Soda,Valerio Guarrasi

Main category: cs.CV

TL;DR: 本文提出了一种结合潜在扩散模型和3D对比视觉语言预训练方案的新架构，用于从文本生成CT图像，解决了3D医学影像生成中的高维度和解剖复杂性挑战。

Details

Motivation: 尽管文本条件生成模型在2D医学影像合成上取得进展，但扩展到3D CT图像生成仍面临高维度和解剖复杂性的挑战，缺乏有效的视觉语言对齐框架。 Method: 采用双编码器CLIP风格模型，结合预训练的3D VAE压缩CT体积到低维潜在空间，实现高效的3D去噪扩散生成。 Result: 在CT-RATE数据集上评估，模型在图像保真度、临床相关性和语义对齐方面表现优异，显著优于基线方法，并能有效增强下游诊断性能。 Conclusion: 研究表明，特定模态的视觉语言对齐是高质量3D医学影像生成的关键，该方法为数据增强、医学教育和临床模拟提供了可扩展的解决方案。 Abstract: Objective: While recent advances in text-conditioned generative models have enabled the synthesis of realistic medical images, progress has been largely confined to 2D modalities such as chest X-rays. Extending text-to-image generation to volumetric Computed Tomography (CT) remains a significant challenge, due to its high dimensionality, anatomical complexity, and the absence of robust frameworks that align vision-language data in 3D medical imaging. Methods: We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme. Our approach leverages a dual-encoder CLIP-style model trained on paired CT volumes and radiology reports to establish a shared embedding space, which serves as the conditioning input for generation. CT volumes are compressed into a low-dimensional latent space via a pretrained volumetric VAE, enabling efficient 3D denoising diffusion without requiring external super-resolution stages. Results: We evaluate our method on the CT-RATE dataset and conduct a comprehensive assessment of image fidelity, clinical relevance, and semantic alignment. Our model achieves competitive performance across all tasks, significantly outperforming prior baselines for text-to-CT generation. Moreover, we demonstrate that CT scans synthesized by our framework can effectively augment real data, improving downstream diagnostic performance. Conclusion: Our results show that modality-specific vision-language alignment is a key component for high-quality 3D medical image generation. By integrating contrastive pretraining and volumetric diffusion, our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text, paving the way for new applications in data augmentation, medical education, and automated clinical simulation.

[37] Video Signature: In-generation Watermarking for Latent Video Diffusion Models

Yu Huang,Junhao Chen,Qi Zheng,Hanqian Li,Shuliang Liu,Xuming Hu

Main category: cs.CV

TL;DR: VIDSIG是一种用于潜在视频扩散模型的生成中水印方法，通过部分微调潜在解码器实现隐式和自适应水印集成，解决了现有后生成水印方法的计算开销和视频质量平衡问题。

Details

Motivation: AIGC的快速发展引发了对知识产权保护和内容追溯的担忧，现有后生成水印方法存在计算开销大且难以平衡视频质量与水印提取的问题。 Method: VIDSIG通过部分微调潜在解码器，结合Perturbation-Aware Suppression（PAS）和轻量级时间对齐模块，实现隐式水印集成和时间一致性增强。 Result: 实验表明VIDSIG在水印提取、视觉质量和生成效率方面表现最佳，并对空间和时间篡改具有强鲁棒性。 Conclusion: VIDSIG是一种实用且高效的生成中水印方法，适用于实际场景。 Abstract: The rapid development of Artificial Intelligence Generated Content (AIGC) has led to significant progress in video generation but also raises serious concerns about intellectual property protection and reliable content tracing. Watermarking is a widely adopted solution to this issue, but existing methods for video generation mainly follow a post-generation paradigm, which introduces additional computational overhead and often fails to effectively balance the trade-off between video quality and watermark extraction. To address these issues, we propose Video Signature (VIDSIG), an in-generation watermarking method for latent video diffusion models, which enables implicit and adaptive watermark integration during generation. Specifically, we achieve this by partially fine-tuning the latent decoder, where Perturbation-Aware Suppression (PAS) pre-identifies and freezes perceptually sensitive layers to preserve visual quality. Beyond spatial fidelity, we further enhance temporal consistency by introducing a lightweight Temporal Alignment module that guides the decoder to generate coherent frame sequences during fine-tuning. Experimental results show that VIDSIG achieves the best overall performance in watermark extraction, visual quality, and generation efficiency. It also demonstrates strong robustness against both spatial and temporal tampering, highlighting its practicality in real-world scenarios.

[38] Poster: Adapting Pretrained Vision Transformers with LoRA Against Attack Vectors

Richard E. Neddo,Sean Willis,Zander Blasingame,Chen Liu

Main category: cs.CV

TL;DR: 提出一种针对对抗攻击的防御方法，通过低秩适应调整预训练视觉变换器的权重和类别，增强鲁棒性并支持可扩展微调。

Details

Motivation: 图像分类器（如自动驾驶导航中使用的）易受对抗攻击影响，需提升其鲁棒性。 Method: 采用低秩适应调整预训练视觉变换器的权重和类别，避免重新训练。 Result: 增强了模型对对抗攻击的鲁棒性，同时支持可扩展的微调。 Conclusion: 低秩适应是一种有效的防御对抗攻击的方法，且具有可扩展性。 Abstract: Image classifiers, such as those used for autonomous vehicle navigation, are largely known to be susceptible to adversarial attacks that target the input image set. There is extensive discussion on adversarial attacks including perturbations that alter the input images to cause malicious misclassifications without perceivable modification. This work proposes a countermeasure for such attacks by adjusting the weights and classes of pretrained vision transformers with a low-rank adaptation to become more robust against adversarial attacks and allow for scalable fine-tuning without retraining.

[39] Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis

Vasilii Korolkov

Main category: cs.CV

TL;DR: 提出了一种统一的、自适应的场景分割和关键帧提取框架，适用于多种视频类型和时长，具有高效性和可扩展性。

Details

Motivation: 现有方法在多样化的视频类型和时长中缺乏通用性，需要一种适应性强的解决方案来支持视频理解任务。 Method: 动态选择分割策略（短视频用自适应阈值，中长视频用混合策略，长视频用基于间隔的分割），并使用轻量级模块评分关键帧。 Result: 系统已部署于商业视频分析平台，适用于媒体、教育、研究和安全领域，提供一致的分割粒度和高效处理。 Conclusion: 该框架为下游应用提供了可扩展和可解释的解决方案，未来可进一步优化音频感知分割和强化学习评分。 Abstract: Robust scene segmentation and keyframe extraction are essential preprocessing steps in video understanding pipelines, supporting tasks such as indexing, summarization, and semantic retrieval. However, existing methods often lack generalizability across diverse video types and durations. We present a unified, adaptive framework for automatic scene detection and keyframe selection that handles formats ranging from short-form media to long-form films, archival content, and surveillance footage. Our system dynamically selects segmentation policies based on video length: adaptive thresholding for short videos, hybrid strategies for mid-length ones, and interval-based splitting for extended recordings. This ensures consistent granularity and efficient processing across domains. For keyframe selection, we employ a lightweight module that scores sampled frames using a composite metric of sharpness, luminance, and temporal spread, avoiding complex saliency models while ensuring visual relevance. Designed for high-throughput workflows, the system is deployed in a commercial video analysis platform and has processed content from media, education, research, and security domains. It offers a scalable and interpretable solution suitable for downstream applications such as UI previews, embedding pipelines, and content filtering. We discuss practical implementation details and outline future enhancements, including audio-aware segmentation and reinforcement-learned frame scoring.

[40] CineMA: A Foundation Model for Cine Cardiac MRI

Yunguan Fu,Weixi Yi,Charlotte Manisty,Anish N Bhuva,Thomas A Treibel,James C Moon,Matthew J Clarkson,Rhodri Huw Davies,Yipeng Hu

Main category: cs.CV

TL;DR: CineMA是一种基于自监督学习的AI模型，用于自动化心脏磁共振（CMR）图像分析，减少人工标注需求，性能优于传统卷积神经网络（CNN）。

Details

Motivation: 传统CMR图像分析耗时且主观，需自动化解决方案以提升效率和准确性。 Method: CineMA采用自监督自动编码器，通过74,916个CMR研究预训练，并在8个数据集上微调，完成23项任务。 Result: CineMA在多项任务中表现优于CNN，且标注效率更高。 Conclusion: CineMA为心脏影像分析提供了高效基础模型，支持临床转化和可重复性。 Abstract: Cardiac magnetic resonance (CMR) is a key investigation in clinical cardiovascular medicine and has been used extensively in population research. However, extracting clinically important measurements such as ejection fraction for diagnosing cardiovascular diseases remains time-consuming and subjective. We developed CineMA, a foundation AI model automating these tasks with limited labels. CineMA is a self-supervised autoencoder model trained on 74,916 cine CMR studies to reconstruct images from masked inputs. After fine-tuning, it was evaluated across eight datasets on 23 tasks from four categories: ventricle and myocardium segmentation, left and right ventricle ejection fraction calculation, disease detection and classification, and landmark localisation. CineMA is the first foundation model for cine CMR to match or outperform convolutional neural networks (CNNs). CineMA demonstrated greater label efficiency than CNNs, achieving comparable or better performance with fewer annotations. This reduces the burden of clinician labelling and supports replacing task-specific training with fine-tuning foundation models in future cardiac imaging applications. Models and code for pre-training and fine-tuning are available at https://github.com/mathpluscode/CineMA, democratising access to high-performance models that otherwise require substantial computational resources, promoting reproducibility and accelerating clinical translation.

[41] Concept-Centric Token Interpretation for Vector-Quantized Generative Models

Tianze Yang,Yucheng Shi,Mengnan Du,Xuansheng Wu,Qiaoyu Tan,Jin Sun,Ninghao Liu

Main category: cs.CV

TL;DR: CORTEX是一种解释VQGM的新方法，通过识别概念特定的token组合，提供样本级和代码本级的解释，提升模型透明度。

Details

Motivation: VQGM的离散token代码本尚未被充分理解，尤其是哪些token对生成特定概念的图像至关重要。 Method: CORTEX采用两种方法：样本级解释（分析单张图像中token的重要性）和代码本级解释（全局探索相关token）。 Result: 实验表明CORTEX在解释token使用上优于基线方法，适用于目标图像编辑和快捷特征检测。 Conclusion: CORTEX不仅提升了VQGM的透明度，还具有实际应用价值。 Abstract: Vector-Quantized Generative Models (VQGMs) have emerged as powerful tools for image generation. However, the key component of VQGMs -- the codebook of discrete tokens -- is still not well understood, e.g., which tokens are critical to generate an image of a certain concept? This paper introduces Concept-Oriented Token Explanation (CORTEX), a novel approach for interpreting VQGMs by identifying concept-specific token combinations. Our framework employs two methods: (1) a sample-level explanation method that analyzes token importance scores in individual images, and (2) a codebook-level explanation method that explores the entire codebook to find globally relevant tokens. Experimental results demonstrate CORTEX's efficacy in providing clear explanations of token usage in the generative process, outperforming baselines across multiple pretrained VQGMs. Besides enhancing VQGMs transparency, CORTEX is useful in applications such as targeted image editing and shortcut feature detection. Our code is available at https://github.com/YangTianze009/CORTEX.

[42] Fovea Stacking: Imaging with Dynamic Localized Aberration Correction

Shi Mao,Yogeshwar Mishra,Wolfgang Heidrich

Main category: cs.CV

TL;DR: 论文提出了一种名为Fovea Stacking的新型成像系统，利用可变形相位板（DPPs）进行局部像差校正，通过优化和堆叠多张局部清晰图像，实现全视场无像差成像。

Details

Motivation: 小型化相机需求推动了对光学简化系统的探索，但简化系统通常伴随严重的离轴像差，难以仅通过软件校正。 Method: 利用DPPs进行局部像差校正，通过可微分光学模型优化DPP变形，堆叠多张局部清晰图像，并结合神经网络控制模型提升硬件性能对齐。 Result: Fovea Stacking在扩展景深成像中优于传统焦点堆叠，且可通过物体检测或眼动追踪实现动态调整，适用于实时视频应用。 Conclusion: Fovea Stacking为小型化成像系统提供了一种高效的像差校正方法，具有广泛的应用潜力。 Abstract: The desire for cameras with smaller form factors has recently lead to a push for exploring computational imaging systems with reduced optical complexity such as a smaller number of lens elements. Unfortunately such simplified optical systems usually suffer from severe aberrations, especially in off-axis regions, which can be difficult to correct purely in software. In this paper we introduce Fovea Stacking, a new type of imaging system that utilizes emerging dynamic optical components called deformable phase plates (DPPs) for localized aberration correction anywhere on the image sensor. By optimizing DPP deformations through a differentiable optical model, off-axis aberrations are corrected locally, producing a foveated image with enhanced sharpness at the fixation point - analogous to the eye's fovea. Stacking multiple such foveated images, each with a different fixation point, yields a composite image free from aberrations. To efficiently cover the entire field of view, we propose joint optimization of DPP deformations under imaging budget constraints. Due to the DPP device's non-linear behavior, we introduce a neural network-based control model for improved alignment between simulation-hardware performance. We further demonstrated that for extended depth-of-field imaging, fovea stacking outperforms traditional focus stacking in image quality. By integrating object detection or eye-tracking, the system can dynamically adjust the lens to track the object of interest-enabling real-time foveated video suitable for downstream applications such as surveillance or foveated virtual reality displays.

[43] From Local Cues to Global Percepts: Emergent Gestalt Organization in Self-Supervised Vision Models

Tianqin Li,Ziqi Wen,Leiran Song,Jun Liu,Zhi Jing,Tai Sing Lee

Main category: cs.CV

TL;DR: 现代视觉模型（如ViTs和ConvNeXt）在自监督训练（如MAE）下表现出类似Gestalt原则的行为，但分类微调会削弱这种能力。DiSRT测试平台用于评估模型对全局结构的敏感性。

Details

Motivation: 研究现代视觉模型是否能够像人类视觉一样，通过Gestalt原则（如闭合、邻近性）组织局部线索为全局形式。 Method: 使用Masked Autoencoding（MAE）训练Vision Transformers（ViTs）和ConvNeXt模型，并通过DiSRT测试平台评估其对全局空间扰动的敏感性。 Result: 自监督模型（如MAE、CLIP）在DiSRT测试中表现优于监督基线，甚至有时超过人类表现。分类微调会降低模型的Gestalt敏感性，但Top-K激活稀疏机制可恢复这种能力。 Conclusion: 自监督训练条件能促进Gestalt-like感知，而分类微调会抑制。DiSRT可作为评估模型对全局结构敏感性的诊断工具。 Abstract: Human vision organizes local cues into coherent global forms using Gestalt principles like closure, proximity, and figure-ground assignment -- functions reliant on global spatial structure. We investigate whether modern vision models show similar behaviors, and under what training conditions these emerge. We find that Vision Transformers (ViTs) trained with Masked Autoencoding (MAE) exhibit activation patterns consistent with Gestalt laws, including illusory contour completion, convexity preference, and dynamic figure-ground segregation. To probe the computational basis, we hypothesize that modeling global dependencies is necessary for Gestalt-like organization. We introduce the Distorted Spatial Relationship Testbench (DiSRT), which evaluates sensitivity to global spatial perturbations while preserving local textures. Using DiSRT, we show that self-supervised models (e.g., MAE, CLIP) outperform supervised baselines and sometimes even exceed human performance. ConvNeXt models trained with MAE also exhibit Gestalt-compatible representations, suggesting such sensitivity can arise without attention architectures. However, classification finetuning degrades this ability. Inspired by biological vision, we show that a Top-K activation sparsity mechanism can restore global sensitivity. Our findings identify training conditions that promote or suppress Gestalt-like perception and establish DiSRT as a diagnostic for global structure sensitivity across models.

[44] Common Inpainted Objects In-N-Out of Context

Tianze Yang,Tyson Jordan,Ninghao Liu,Jin Sun

Main category: cs.CV

TL;DR: COinCO是一个新的数据集，通过扩散填充技术生成97,722张图像，包含上下文一致和不一致的场景，用于上下文学习。

Details

Motivation: 解决现有视觉数据集中缺乏上下文不一致样本的问题。 Method: 使用扩散填充技术替换COCO图像中的对象，并通过多模态大语言模型验证和分类填充对象的上下文一致性。 Result: 揭示了语义先验对填充成功的影响，并支持三种关键任务：上下文分类、对象预测和假检测。 Conclusion: COinCO为上下文感知的视觉理解和图像取证提供了基础。 Abstract: We present Common Inpainted Objects In-N-Out of Context (COinCO), a novel dataset addressing the scarcity of out-of-context examples in existing vision datasets. By systematically replacing objects in COCO images through diffusion-based inpainting, we create 97,722 unique images featuring both contextually coherent and inconsistent scenes, enabling effective context learning. Each inpainted object is meticulously verified and categorized as in- or out-of-context through a multimodal large language model assessment. Our analysis reveals significant patterns in semantic priors that influence inpainting success across object categories. We demonstrate three key tasks enabled by COinCO: (1) training context classifiers that effectively determine whether existing objects belong in their context; (2) a novel Objects-from-Context prediction task that determines which new objects naturally belong in given scenes at both instance and clique levels, and (3) context-enhanced fake detection on state-of-the-art methods without fine-tuning. COinCO provides a controlled testbed with contextual variations, establishing a foundation for advancing context-aware visual understanding in computer vision and image forensics. Our code and data are at: https://github.com/YangTianze009/COinCO.

[45] Involution-Infused DenseNet with Two-Step Compression for Resource-Efficient Plant Disease Classification

T. Ahmed,S. Jannat,Md. F. Islam,J. Noor

Main category: cs.CV

TL;DR: 论文提出了一种结合权重剪枝和知识蒸馏的两步模型压缩方法，并融合了DenseNet与Involutional Layers，以降低计算需求并提升模型性能，适用于资源受限设备。

Details

Motivation: 农业对全球粮食安全至关重要，但作物易受病害影响。传统CNN模型计算需求高，难以在资源受限设备上部署。 Method: 采用权重剪枝和知识蒸馏的两步压缩方法，并结合DenseNet与Involutional Layers的混合结构。 Result: 压缩后的ResNet50在PlantVillage和PaddyLeaf数据集上分别达到99.55%和98.99%的准确率；DenseNet混合模型在高效性优化下也表现优异。 Conclusion: 该方法支持在资源受限设备上高效部署，促进精准农业和可持续耕作。 Abstract: Agriculture is vital for global food security, but crops are vulnerable to diseases that impact yield and quality. While Convolutional Neural Networks (CNNs) accurately classify plant diseases using leaf images, their high computational demands hinder their deployment in resource-constrained settings such as smartphones, edge devices, and real-time monitoring systems. This study proposes a two-step model compression approach integrating Weight Pruning and Knowledge Distillation, along with the hybridization of DenseNet with Involutional Layers. Pruning reduces model size and computational load, while distillation improves the smaller student models performance by transferring knowledge from a larger teacher network. The hybridization enhances the models ability to capture spatial features efficiently. These compressed models are suitable for real-time applications, promoting precision agriculture through rapid disease identification and crop management. The results demonstrate ResNet50s superior performance post-compression, achieving 99.55% and 98.99% accuracy on the PlantVillage and PaddyLeaf datasets, respectively. The DenseNet-based model, optimized for efficiency, recorded 99.21% and 93.96% accuracy with a minimal parameter count. Furthermore, the hybrid model achieved 98.87% and 97.10% accuracy, supporting the practical deployment of energy-efficient devices for timely disease intervention and sustainable farming practices.

[46] ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary

Zeqi Gu,Yin Cui,Zhaoshuo Li,Fangyin Wei,Yunhao Ge,Jinwei Gu,Ming-Yu Liu,Abe Davis,Yifan Ding

Main category: cs.CV

TL;DR: ArtiScene利用文本到图像的生成模型作为中介，通过2D图像指导3D场景合成，避免了直接依赖3D数据的限制，显著提升了场景设计的质量和多样性。

Details

Motivation: 传统3D场景设计需要艺术和技术双重能力，而现有文本到3D生成方法受限于高质量3D数据的稀缺。现代文本到图像模型能生成多样且可靠的2D布局，因此可以利用2D图像作为中介指导3D合成。 Method: ArtiScene通过文本生成2D图像，从中提取对象形状和外观信息，结合几何和姿态数据组装成3D场景。 Result: ArtiScene在布局和美学质量上大幅超越现有方法，用户研究中胜率达74.89%，GPT-4o评估中胜率达95.07%。 Conclusion: ArtiScene提供了一种无需训练的自动化3D场景设计方法，结合了文本到图像的灵活性和2D布局的可靠性。 Abstract: Designing 3D scenes is traditionally a challenging task that demands both artistic expertise and proficiency with complex software. Recent advances in text-to-3D generation have greatly simplified this process by letting users create scenes based on simple text descriptions. However, as these methods generally require extra training or in-context learning, their performance is often hindered by the limited availability of high-quality 3D data. In contrast, modern text-to-image models learned from web-scale images can generate scenes with diverse, reliable spatial layouts and consistent, visually appealing styles. Our key insight is that instead of learning directly from 3D scenes, we can leverage generated 2D images as an intermediary to guide 3D synthesis. In light of this, we introduce ArtiScene, a training-free automated pipeline for scene design that integrates the flexibility of free-form text-to-image generation with the diversity and reliability of 2D intermediary layouts. First, we generate 2D images from a scene description, then extract the shape and appearance of objects to create 3D models. These models are assembled into the final scene using geometry, position, and pose information derived from the same intermediary image. Being generalizable to a wide range of scenes and styles, ArtiScene outperforms state-of-the-art benchmarks by a large margin in layout and aesthetic quality by quantitative metrics. It also averages a 74.89% winning rate in extensive user studies and 95.07% in GPT-4o evaluation. Project page: https://artiscene-cvpr.github.io/

[47] EcoLens: Leveraging Multi-Objective Bayesian Optimization for Energy-Efficient Video Processing on Edge Devices

Benjamin Civjan,Bo Chen,Ruixiao Zhang,Klara Nahrstedt

Main category: cs.CV

TL;DR: 该论文提出了一种动态优化视频处理配置的系统，以在边缘设备上最小化能耗，同时保持深度学习推理所需的视频特征。

Details

Motivation: 在资源受限环境中实现实时视频分析时，能耗与视频语义的平衡是一个重要挑战。 Method: 通过离线配置分析建立能耗与推理精度的先验知识，并利用多目标贝叶斯优化在线动态调整配置。 Result: 实验证明系统能显著降低能耗，同时保持高分析性能。 Conclusion: 该系统为智能设备和边缘计算提供了一种实用的节能解决方案。 Abstract: Video processing for real-time analytics in resource-constrained environments presents a significant challenge in balancing energy consumption and video semantics. This paper addresses the problem of energy-efficient video processing by proposing a system that dynamically optimizes processing configurations to minimize energy usage on the edge, while preserving essential video features for deep learning inference. We first gather an extensive offline profile of various configurations consisting of device CPU frequencies, frame filtering features, difference thresholds, and video bitrates, to establish apriori knowledge of their impact on energy consumption and inference accuracy. Leveraging this insight, we introduce an online system that employs multi-objective Bayesian optimization to intelligently explore and adapt configurations in real time. Our approach continuously refines processing settings to meet a target inference accuracy with minimal edge device energy expenditure. Experimental results demonstrate the system's effectiveness in reducing video processing energy use while maintaining high analytical performance, offering a practical solution for smart devices and edge computing applications.

[48] Depth-Aware Scoring and Hierarchical Alignment for Multiple Object Tracking

Milad Khanchi,Maria Amer,Charalambos Poullis

Main category: cs.CV

TL;DR: 论文提出了一种新颖的深度感知多目标跟踪框架，通过零样本深度估计和分层对齐分数改进关联准确性，无需额外训练。

Details

Motivation: 现有基于运动的多目标跟踪方法依赖IoU进行目标关联，但在遮挡或视觉相似对象场景中效果不佳。 Method: 使用零样本方法估计深度，并将其作为独立特征引入关联过程；提出分层对齐分数，结合粗粒度边界框重叠和细粒度像素级对齐。 Result: 在挑战性基准测试中取得最先进结果，无需训练或微调。 Conclusion: 首次将3D特征（单目深度）作为独立决策矩阵引入关联步骤，显著提升了跟踪性能。 Abstract: Current motion-based multiple object tracking (MOT) approaches rely heavily on Intersection-over-Union (IoU) for object association. Without using 3D features, they are ineffective in scenarios with occlusions or visually similar objects. To address this, our paper presents a novel depth-aware framework for MOT. We estimate depth using a zero-shot approach and incorporate it as an independent feature in the association process. Additionally, we introduce a Hierarchical Alignment Score that refines IoU by integrating both coarse bounding box overlap and fine-grained (pixel-level) alignment to improve association accuracy without requiring additional learnable parameters. To our knowledge, this is the first MOT framework to incorporate 3D features (monocular depth) as an independent decision matrix in the association step. Our framework achieves state-of-the-art results on challenging benchmarks without any training nor fine-tuning. The code is available at https://github.com/Milad-Khanchi/DepthMOT

[49] Aiding Medical Diagnosis through Image Synthesis and Classification

Kanishk Choudhary

Main category: cs.CV

TL;DR: 本文提出了一种通过文本描述生成逼真医学图像的系统，并通过分类模型验证其准确性，旨在解决医学教育资源多样性和可及性不足的问题。

Details

Motivation: 医学教育中缺乏多样且易获取的视觉参考材料，影响了诊断准确性和模式识别能力的培养。 Method: 使用PathMNIST数据集对预训练的稳定扩散模型进行LoRA微调，生成医学图像，并通过ResNet-18分类模型验证图像质量。 Result: 生成模型的F1得分为0.6727，部分组织类型分类准确率达100%，系统在生成和分类环节均表现出高准确性。 Conclusion: 该系统为合成特定领域医学图像提供了可靠方法，未来可扩展至其他医学影像领域。 Abstract: Medical professionals, especially those in training, often depend on visual reference materials to support an accurate diagnosis and develop pattern recognition skills. However, existing resources may lack the diversity and accessibility needed for broad and effective clinical learning. This paper presents a system designed to generate realistic medical images from textual descriptions and validate their accuracy through a classification model. A pretrained stable diffusion model was fine-tuned using Low-Rank Adaptation (LoRA) on the PathMNIST dataset, consisting of nine colorectal histopathology tissue types. The generative model was trained multiple times using different training parameter configurations, guided by domain-specific prompts to capture meaningful features. To ensure quality control, a ResNet-18 classification model was trained on the same dataset, achieving 99.76% accuracy in detecting the correct label of a colorectal histopathological medical image. Generated images were then filtered using the trained classifier and an iterative process, where inaccurate outputs were discarded and regenerated until they were correctly classified. The highest performing version of the generative model from experimentation achieved an F1 score of 0.6727, with precision and recall scores of 0.6817 and 0.7111, respectively. Some types of tissue, such as adipose tissue and lymphocytes, reached perfect classification scores, while others proved more challenging due to structural complexity. The self-validating approach created demonstrates a reliable method for synthesizing domain-specific medical images because of high accuracy in both the generation and classification portions of the system, with potential applications in both diagnostic support and clinical education. Future work includes improving prompt-specific accuracy and extending the system to other areas of medical imaging.

[50] HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models

Songtao Jiang,Yan Zhang,Yeying Jin,Zhihang Tang,Yangyang Wu,Yang Feng,Jian Wu,Zuozhu Liu

Main category: cs.CV

TL;DR: 论文提出了一种名为HSCR的新方法，通过分层自对比奖励解决医学视觉语言模型中的模态不对齐问题，提高了模型的零样本性能和可信度。

Details

Motivation: 现有医学视觉语言模型存在模态不对齐问题，导致临床场景中不可靠的响应，亟需一种高效且能捕捉细微偏好的对齐方法。 Method: HSCR通过视觉标记丢弃分析模态耦合标记，生成高质量偏好数据，并采用多级偏好优化策略，结合隐式偏好进行更精确的对齐优化。 Result: 在多个医学任务（如Med-VQA、医学图像描述等）中，HSCR显著提升了零样本性能和对齐效果，仅需2000条训练数据即可实现。 Conclusion: HSCR是一种高效且有效的医学视觉语言模型对齐方法，显著提升了模型的性能和可信度。 Abstract: Medical Vision-Language Models (Med-VLMs) have achieved success across various tasks, yet most existing methods overlook the modality misalignment issue that can lead to untrustworthy responses in clinical settings. In this paper, we propose Hierarchical Self-Contrastive Rewarding (HSCR), a novel approach that addresses two critical challenges in Med-VLM alignment: 1) Cost-effective generation of high-quality preference data; 2) Capturing nuanced and context-aware preferences for improved alignment. HSCR first leverages the inherent capability of Med-VLMs to generate dispreferred responses with higher sampling probability. By analyzing output logit shifts after visual token dropout, we identify modality-coupled tokens that induce misalignment and derive an implicit alignment reward function. This function guides token replacement with hallucinated ones during decoding, producing high-quality dispreferred data. Furthermore, HSCR introduces a multi-level preference optimization strategy, which extends beyond traditional adjacent-level optimization by incorporating nuanced implicit preferences, leveraging relative quality in dispreferred data to capture subtle alignment cues for more precise and context-aware optimization. Extensive experiments across multiple medical tasks, including Med-VQA, medical image captioning and instruction following, demonstrate that HSCR not only enhances zero-shot performance but also significantly improves modality alignment and trustworthiness with just 2,000 training entries.

[51] TIME: TabPFN-Integrated Multimodal Engine for Robust Tabular-Image Learning

Jiaqi Luo,Yuan Yuan,Shixin Xu

Main category: cs.CV

TL;DR: TIME框架结合TabPFN和图像特征，解决表格数据标准化和缺失值问题，在医疗和自然数据集中表现优异。

Details

Motivation: 解决表格数据缺乏标准化预训练表示和缺失值处理的挑战，尤其在医疗应用中。 Method: 提出TIME框架，利用TabPFN作为冻结表格编码器生成鲁棒嵌入，结合预训练视觉主干提取的图像特征，探索多种融合策略。 Result: 在完整和不完整表格输入下，TIME均优于基线方法，验证其实际应用价值。 Conclusion: TIME为多模态学习提供了高效解决方案，尤其在医疗领域具有广泛应用前景。 Abstract: Tabular-image multimodal learning, which integrates structured tabular data with imaging data, holds great promise for a variety of tasks, especially in medical applications. Yet, two key challenges remain: (1) the lack of a standardized, pretrained representation for tabular data, as is commonly available in vision and language domains; and (2) the difficulty of handling missing values in the tabular modality, which are common in real-world medical datasets. To address these issues, we propose the TabPFN-Integrated Multimodal Engine (TIME), a novel multimodal framework that builds on the recently introduced tabular foundation model, TabPFN. TIME leverages TabPFN as a frozen tabular encoder to generate robust, strong embeddings that are naturally resilient to missing data, and combines them with image features from pretrained vision backbones. We explore a range of fusion strategies and tabular encoders, and evaluate our approach on both natural and medical datasets. Extensive experiments demonstrate that TIME consistently outperforms competitive baselines across both complete and incomplete tabular inputs, underscoring its practical value in real-world multimodal learning scenarios.

[52] L3A: Label-Augmented Analytic Adaptation for Multi-Label Class Incremental Learning

Xiang Zhang,Run He,Jiao Chen,Di Fang,Ming Li,Ziqian Zeng,Cen Chen,Huiping Zhuang

Main category: cs.CV

TL;DR: 论文提出了一种名为L3A的方法，用于解决多标签增量学习中的标签缺失和类别不平衡问题，无需存储历史样本。

Details

Motivation: 多标签增量学习（MLCIL）面临标签缺失和类别不平衡的挑战，传统方法难以解决。 Method: L3A包含伪标签模块（解决标签缺失）和加权分析分类器（解决类别不平衡），无需存储历史样本。 Result: 在MS-COCO和PASCAL VOC数据集上，L3A优于现有方法。 Conclusion: L3A是一种高效的多标签增量学习方法，解决了标签缺失和类别不平衡问题。 Abstract: Class-incremental learning (CIL) enables models to learn new classes continually without forgetting previously acquired knowledge. Multi-label CIL (MLCIL) extends CIL to a real-world scenario where each sample may belong to multiple classes, introducing several challenges: label absence, which leads to incomplete historical information due to missing labels, and class imbalance, which results in the model bias toward majority classes. To address these challenges, we propose Label-Augmented Analytic Adaptation (L3A), an exemplar-free approach without storing past samples. L3A integrates two key modules. The pseudo-label (PL) module implements label augmentation by generating pseudo-labels for current phase samples, addressing the label absence problem. The weighted analytic classifier (WAC) derives a closed-form solution for neural networks. It introduces sample-specific weights to adaptively balance the class contribution and mitigate class imbalance. Experiments on MS-COCO and PASCAL VOC datasets demonstrate that L3A outperforms existing methods in MLCIL tasks. Our code is available at https://github.com/scut-zx/L3A.

[53] QuantFace: Low-Bit Post-Training Quantization for One-Step Diffusion Face Restoration

Jiatong Li,Libo Zhu,Haotong Qin,Jingkai Wang,Linghe Kong,Guihai Chen,Yulun Zhang,Xiaokang Yang

Main category: cs.CV

TL;DR: QuantFace是一种针对一步扩散人脸恢复模型的低比特量化方法，将32位权重和激活量化为4~6位，通过旋转缩放通道平衡和量化-蒸馏低秩适应（QD-LoRA）优化性能，并采用自适应比特分配策略。

Details

Motivation: 扩散模型在人脸恢复中表现优异，但计算量大，难以部署在智能手机等设备上。 Method: 提出QuantFace，包括旋转缩放通道平衡、QD-LoRA联合优化和自适应比特分配策略。 Result: 在合成和真实数据集上，QuantFace在6位和4位量化下表现优异，优于现有低比特量化方法。 Conclusion: QuantFace有效解决了扩散模型在低比特量化下的性能问题，具有实际部署潜力。 Abstract: Diffusion models have been achieving remarkable performance in face restoration. However, the heavy computations of diffusion models make it difficult to deploy them on devices like smartphones. In this work, we propose QuantFace, a novel low-bit quantization for one-step diffusion face restoration models, where the full-precision (\ie, 32-bit) weights and activations are quantized to 4$\sim$6-bit. We first analyze the data distribution within activations and find that they are highly variant. To preserve the original data information, we employ rotation-scaling channel balancing. Furthermore, we propose Quantization-Distillation Low-Rank Adaptation (QD-LoRA) that jointly optimizes for quantization and distillation performance. Finally, we propose an adaptive bit-width allocation strategy. We formulate such a strategy as an integer programming problem, which combines quantization error and perceptual metrics to find a satisfactory resource allocation. Extensive experiments on the synthetic and real-world datasets demonstrate the effectiveness of QuantFace under 6-bit and 4-bit. QuantFace achieves significant advantages over recent leading low-bit quantization methods for face restoration. The code is available at https://github.com/jiatongli2024/QuantFace.

[54] Improving Keystep Recognition in Ego-Video via Dexterous Focus

Zachary Chavis,Stephen J. Guy,Hyun Soo Park

Main category: cs.CV

TL;DR: 提出了一种通过稳定和聚焦手部区域的视频处理框架，显著提升了自我中心视角下活动识别的性能。

Details

Motivation: 解决传统活动识别技术在自我中心视频中因头部动态变化带来的挑战。 Method: 限制输入为稳定且聚焦手部的视频，无需改变网络架构。 Result: 在Ego-Exo4D Fine-Grained Keystep Recognition基准测试中表现优于现有方法。 Conclusion: 简单视频处理即可显著提升自我中心活动识别性能。 Abstract: In this paper, we address the challenge of understanding human activities from an egocentric perspective. Traditional activity recognition techniques face unique challenges in egocentric videos due to the highly dynamic nature of the head during many activities. We propose a framework that seeks to address these challenges in a way that is independent of network architecture by restricting the ego-video input to a stabilized, hand-focused video. We demonstrate that this straightforward video transformation alone outperforms existing egocentric video baselines on the Ego-Exo4D Fine-Grained Keystep Recognition benchmark without requiring any alteration of the underlying model infrastructure.

[55] SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers

Zhengcong Fei,Hao Jiang,Di Qiu,Baoxuan Gu,Youqiang Zhang,Jiahua Wang,Jialin Bai,Debang Li,Mingyuan Fan,Guibin Chen,Yahui Zhou

Main category: cs.CV

TL;DR: SkyReels-Audio是一个统一框架，通过多模态输入（文本、图像、视频）生成和编辑音频驱动的说话肖像视频，支持无限长度生成和编辑，并实现高保真和时间一致性。

Details

Motivation: 目前，基于多模态输入的音频驱动说话肖像生成和编辑研究较少，需要一种能够支持多样化控制和高质量输出的框架。 Method: 采用预训练的视频扩散变换器，结合混合课程学习策略和面部掩码损失，引入滑动窗口去噪方法和音频引导的无分类器指导机制。 Result: 在唇同步准确性、身份一致性和真实面部动态方面表现优异，尤其在复杂条件下。 Conclusion: SkyReels-Audio通过多模态输入和先进技术实现了高质量的说话肖像视频生成和编辑。 Abstract: The generation and editing of audio-conditioned talking portraits guided by multimodal inputs, including text, images, and videos, remains under explored. In this paper, we present SkyReels-Audio, a unified framework for synthesizing high-fidelity and temporally coherent talking portrait videos. Built upon pretrained video diffusion transformers, our framework supports infinite-length generation and editing, while enabling diverse and controllable conditioning through multimodal inputs. We employ a hybrid curriculum learning strategy to progressively align audio with facial motion, enabling fine-grained multimodal control over long video sequences. To enhance local facial coherence, we introduce a facial mask loss and an audio-guided classifier-free guidance mechanism. A sliding-window denoising approach further fuses latent representations across temporal segments, ensuring visual fidelity and temporal consistency across extended durations and diverse identities. More importantly, we construct a dedicated data pipeline for curating high-quality triplets consisting of synchronized audio, video, and textual descriptions. Comprehensive benchmark evaluations show that SkyReels-Audio achieves superior performance in lip-sync accuracy, identity consistency, and realistic facial dynamics, particularly under complex and challenging conditions.

[56] Advancing from Automated to Autonomous Beamline by Leveraging Computer Vision

Baolu Li,Hongkai Yu,Huiming Sun,Jin Ma,Yuewei Lin,Lu Ma,Yonghua Du

Main category: cs.CV

TL;DR: 提出了一种基于计算机视觉的系统，结合深度学习和多视角摄像头，用于同步辐射光束线的实时碰撞检测，以实现自主操作。

Details

Motivation: 当前同步辐射光束线仍依赖人工安全监督，需实现自动化与自主操作的过渡。 Method: 系统采用设备分割、跟踪和几何分析，结合迁移学习增强鲁棒性，并开发交互式标注模块以适应新物体类别。 Result: 在真实光束线数据集上的实验显示高精度、实时性能和自主操作的潜力。 Conclusion: 该系统为实现同步辐射光束线的自主操作提供了有效解决方案。 Abstract: The synchrotron light source, a cutting-edge large-scale user facility, requires autonomous synchrotron beamline operations, a crucial technique that should enable experiments to be conducted automatically, reliably, and safely with minimum human intervention. However, current state-of-the-art synchrotron beamlines still heavily rely on human safety oversight. To bridge the gap between automated and autonomous operation, a computer vision-based system is proposed, integrating deep learning and multiview cameras for real-time collision detection. The system utilizes equipment segmentation, tracking, and geometric analysis to assess potential collisions with transfer learning that enhances robustness. In addition, an interactive annotation module has been developed to improve the adaptability to new object classes. Experiments on a real beamline dataset demonstrate high accuracy, real-time performance, and strong potential for autonomous synchrotron beamline operations.

[57] Towards Predicting Any Human Trajectory In Context

Ryo Fujii,Hideo Saito,Ryo Hachiuma

Main category: cs.CV

TL;DR: TrajICL是一种基于上下文学习的行人轨迹预测框架，无需微调即可快速适应不同场景，通过时空相似性和预测引导的示例选择方法提升性能。

Details

Motivation: 现有方法通常需要针对特定场景数据进行微调，但在边缘设备上计算资源受限，难以实现。TrajICL旨在解决这一问题。 Method: 提出时空相似性示例选择（STES）和预测引导示例选择（PG-ES）方法，结合大规模合成数据集训练模型。 Result: TrajICL在多个公开基准测试中表现优异，适应性强，优于微调方法。 Conclusion: TrajICL为行人轨迹预测提供了一种高效且适应性强的解决方案，适用于不同领域和场景。 Abstract: Predicting accurate future trajectories of pedestrians is essential for autonomous systems but remains a challenging task due to the need for adaptability in different environments and domains. A common approach involves collecting scenario-specific data and performing fine-tuning via backpropagation. However, this process is often impractical on edge devices due to constrained computational resources. To address this challenge, we introduce TrajICL, an In-Context Learning (ICL) framework for pedestrian trajectory prediction that enables rapid adaptation without fine-tuning on the scenario-specific data. We propose a spatio-temporal similarity-based example selection (STES) method that selects relevant examples from previously observed trajectories within the same scene by identifying similar motion patterns at corresponding locations. To further refine this selection, we introduce prediction-guided example selection (PG-ES), which selects examples based on both the past trajectory and the predicted future trajectory, rather than relying solely on the past trajectory. This approach allows the model to account for long-term dynamics when selecting examples. Finally, instead of relying on small real-world datasets with limited scenario diversity, we train our model on a large-scale synthetic dataset to enhance its prediction ability by leveraging in-context examples. Extensive experiments demonstrate that TrajICL achieves remarkable adaptation across both in-domain and cross-domain scenarios, outperforming even fine-tuned approaches across multiple public benchmarks. The code will be released at https://fujiry0.github.io/TrajICL-project-page.

[58] Breaking Latent Prior Bias in Detectors for Generalizable AIGC Image Detection

Yue Zhou,Xinan He,KaiQing Lin,Bin Fan,Feng Ding,Bin Li

Main category: cs.CV

TL;DR: 论文提出了一种名为OMAT的对抗训练方法，通过优化扩散模型的初始潜在噪声生成对抗样本，解决了AIGC检测器在未见生成器上的泛化问题。实验表明，该方法显著提升了检测器的跨生成器性能。

Details

Motivation: 现有AIGC检测器在训练生成器上表现优异，但在未见生成器上泛化能力差，主要原因是潜在先验偏差。 Method: 提出On-Manifold Adversarial Training (OMAT)，通过优化扩散模型的初始潜在噪声生成对抗样本，保持其在生成器的输出流形上。 Result: 在GenImage++等基准测试中，OMAT显著提升了检测器的跨生成器性能。 Conclusion: OMAT为未来数据集构建和检测器评估提供了重要见解，有助于开发更鲁棒和通用的AIGC取证方法。 Abstract: Current AIGC detectors often achieve near-perfect accuracy on images produced by the same generator used for training but struggle to generalize to outputs from unseen generators. We trace this failure in part to latent prior bias: detectors learn shortcuts tied to patterns stemming from the initial noise vector rather than learning robust generative artifacts. To address this, we propose On-Manifold Adversarial Training (OMAT): by optimizing the initial latent noise of diffusion models under fixed conditioning, we generate on-manifold adversarial examples that remain on the generator's output manifold-unlike pixel-space attacks, which introduce off-manifold perturbations that the generator itself cannot reproduce and that can obscure the true discriminative artifacts. To test against state-of-the-art generative models, we introduce GenImage++, a test-only benchmark of outputs from advanced generators (Flux.1, SD3) with extended prompts and diverse styles. We apply our adversarial-training paradigm to ResNet50 and CLIP baselines and evaluate across existing AIGC forensic benchmarks and recent challenge datasets. Extensive experiments show that adversarially trained detectors significantly improve cross-generator performance without any network redesign. Our findings on latent-prior bias offer valuable insights for future dataset construction and detector evaluation, guiding the development of more robust and generalizable AIGC forensic methodologies.

[59] Uneven Event Modeling for Partially Relevant Video Retrieval

Sa Zhu,Huashan Chen,Wanqian Zhang,Jinchao Zhang,Zexian Yang,Xiaoshuai Hao,Bo Li

Main category: cs.CV

TL;DR: 提出了一种名为UEM的框架，通过PGVS模块和CAER模块解决了PRVR中事件边界模糊和文本-视频对齐不精确的问题，实验证明其性能优越。

Details

Motivation: 现有方法将视频分割为固定长度的片段，导致事件边界模糊，且使用均值池化计算事件表示，引入不精确的对齐。 Method: 提出UEM框架，包含PGVS模块（基于时间和语义相似性迭代分割视频）和CAER模块（通过文本交叉注意力优化事件表示）。 Result: 在两个PRVR基准测试中达到最优性能。 Conclusion: UEM框架通过明确事件边界和精确对齐，显著提升了PRVR任务的效果。 Abstract: Given a text query, partially relevant video retrieval (PRVR) aims to retrieve untrimmed videos containing relevant moments, wherein event modeling is crucial for partitioning the video into smaller temporal events that partially correspond to the text. Previous methods typically segment videos into a fixed number of equal-length clips, resulting in ambiguous event boundaries. Additionally, they rely on mean pooling to compute event representations, inevitably introducing undesired misalignment. To address these, we propose an Uneven Event Modeling (UEM) framework for PRVR. We first introduce the Progressive-Grouped Video Segmentation (PGVS) module, to iteratively formulate events in light of both temporal dependencies and semantic similarity between consecutive frames, enabling clear event boundaries. Furthermore, we also propose the Context-Aware Event Refinement (CAER) module to refine the event representation conditioned the text's cross-attention. This enables event representations to focus on the most relevant frames for a given text, facilitating more precise text-video alignment. Extensive experiments demonstrate that our method achieves state-of-the-art performance on two PRVR benchmarks.

[60] Leveraging CLIP Encoder for Multimodal Emotion Recognition

Yehun Song,Sunyoung Cho

Main category: cs.CV

TL;DR: 提出了一种基于CLIP的多模态情感识别框架MER-CLIP，通过标签编码器和跨模态解码器提升情感特征表示，实验表明其在CMU-MOSI和CMU-MOSEI数据集上优于现有方法。

Details

Motivation: 多模态情感识别（MER）因数据获取受限而性能提升困难，需利用大规模预训练模型的语义知识。 Method: 采用CLIP架构，引入标签编码器处理标签语义，设计跨模态解码器对齐多模态特征。 Result: 在CMU-MOSI和CMU-MOSEI数据集上表现优于现有技术。 Conclusion: MER-CLIP通过语义标签嵌入和跨模态对齐显著提升了情感识别性能。 Abstract: Multimodal emotion recognition (MER) aims to identify human emotions by combining data from various modalities such as language, audio, and vision. Despite the recent advances of MER approaches, the limitations in obtaining extensive datasets impede the improvement of performance. To mitigate this issue, we leverage a Contrastive Language-Image Pre-training (CLIP)-based architecture and its semantic knowledge from massive datasets that aims to enhance the discriminative multimodal representation. We propose a label encoder-guided MER framework based on CLIP (MER-CLIP) to learn emotion-related representations across modalities. Our approach introduces a label encoder that treats labels as text embeddings to incorporate their semantic information, leading to the learning of more representative emotional features. To further exploit label semantics, we devise a cross-modal decoder that aligns each modality to a shared embedding space by sequentially fusing modality features based on emotion-related input from the label encoder. Finally, the label encoder-guided prediction enables generalization across diverse labels by embedding their semantic information as well as word labels. Experimental results show that our method outperforms the state-of-the-art MER methods on the benchmark datasets, CMU-MOSI and CMU-MOSEI.

[61] Towards Edge-Based Idle State Detection in Construction Machinery Using Surveillance Cameras

Xander Küpers,Jeroen Klein Brinke,Rob Bemthuis,Ozlem Durmaz Incel

Main category: cs.CV

TL;DR: 论文提出Edge-IMI框架，用于检测建筑机械的闲置状态，通过边缘计算设备实现高效现场推理。

Details

Motivation: 建筑行业设备利用率低导致成本增加和项目延误，需准确监控设备活动以提高效率。 Method: 框架包含目标检测、跟踪和闲置状态识别三个模块，适用于资源受限的边缘计算设备。 Result: 目标检测F1分数为71.75%，闲置识别模块误报率低，支持实时处理。 Conclusion: Edge-IMI减少对高带宽云服务和昂贵硬件的依赖，适用于实际场景。 Abstract: The construction industry faces significant challenges in optimizing equipment utilization, as underused machinery leads to increased operational costs and project delays. Accurate and timely monitoring of equipment activity is therefore key to identifying idle periods and improving overall efficiency. This paper presents the Edge-IMI framework for detecting idle construction machinery, specifically designed for integration with surveillance camera systems. The proposed solution consists of three components: object detection, tracking, and idle state identification, which are tailored for execution on resource-constrained, CPU-based edge computing devices. The performance of Edge-IMI is evaluated using a combined dataset derived from the ACID and MOCS benchmarks. Experimental results confirm that the object detector achieves an F1 score of 71.75%, indicating robust real-world detection capabilities. The logistic regression-based idle identification module reliably distinguishes between active and idle machinery with minimal false positives. Integrating all three modules, Edge-IMI enables efficient on-site inference, reducing reliance on high-bandwidth cloud services and costly hardware accelerators. We also evaluate the performance of object detection models on Raspberry Pi 5 and an Intel NUC platforms, as example edge computing platforms. We assess the feasibility of real-time processing and the impact of model optimization techniques.

[62] DS-VTON: High-Quality Virtual Try-on via Disentangled Dual-Scale Generation

Xianbing Sun,Yan Hong,Jiahui Zhan,Jun Lan,Huijia Zhu,Weiqiang Wang,Liqing Zhang,Jianfu Zhang

Main category: cs.CV

TL;DR: DS-VTON是一个双尺度虚拟试穿框架，通过分阶段处理解决了服装对齐和纹理保留的挑战。

Details

Motivation: 现有虚拟试穿方法难以同时实现服装与人体准确对齐及保留精细纹理。 Method: 采用两阶段设计：首阶段生成低分辨率结果以捕捉语义对应；次阶段通过残差引导扩散过程重建高分辨率输出。 Result: 在多个标准虚拟试穿基准测试中，DS-VTON在结构对齐和纹理保留方面表现最佳。 Conclusion: DS-VTON通过双尺度设计和无掩模生成范式，显著提升了虚拟试穿的效果。 Abstract: Despite recent progress, most existing virtual try-on methods still struggle to simultaneously address two core challenges: accurately aligning the garment image with the target human body, and preserving fine-grained garment textures and patterns. In this paper, we propose DS-VTON, a dual-scale virtual try-on framework that explicitly disentangles these objectives for more effective modeling. DS-VTON consists of two stages: the first stage generates a low-resolution try-on result to capture the semantic correspondence between garment and body, where reduced detail facilitates robust structural alignment. The second stage introduces a residual-guided diffusion process that reconstructs high-resolution outputs by refining the residual between the two scales, focusing on texture fidelity. In addition, our method adopts a fully mask-free generation paradigm, eliminating reliance on human parsing maps or segmentation masks. By leveraging the semantic priors embedded in pretrained diffusion models, this design more effectively preserves the person's appearance and geometric consistency. Extensive experiments demonstrate that DS-VTON achieves state-of-the-art performance in both structural alignment and texture preservation across multiple standard virtual try-on benchmarks.

[63] 3D Skeleton-Based Action Recognition: A Review

Mengyuan Liu,Hong Liu,Qianshuo Hu,Bin Ren,Junsong Yuan,Jiaying Lin,Jiajun Wen

Main category: cs.CV

TL;DR: 本文提出了一种任务导向的框架，全面分析3D骨架动作识别，强调预处理、特征提取和时空建模等子任务，并探讨了最新技术进展。

Details

Motivation: 以往研究多从模型角度出发，忽略了骨架动作识别的关键步骤，本文旨在填补这一空白，提供更深入的理解。 Method: 将任务分解为子任务，包括预处理、特征提取和时空建模，并分析最新技术如混合架构、Mamba模型和生成模型。 Result: 提供了公共3D骨架数据集的全面概述，并评估了最新算法的性能。 Conclusion: 本文为3D骨架动作识别领域提供了结构化的路线图，推动了该领域的理解和进展。 Abstract: With the inherent advantages of skeleton representation, 3D skeleton-based action recognition has become a prominent topic in the field of computer vision. However, previous reviews have predominantly adopted a model-oriented perspective, often neglecting the fundamental steps involved in skeleton-based action recognition. This oversight tends to ignore key components of skeleton-based action recognition beyond model design and has hindered deeper, more intrinsic understanding of the task. To bridge this gap, our review aims to address these limitations by presenting a comprehensive, task-oriented framework for understanding skeleton-based action recognition. We begin by decomposing the task into a series of sub-tasks, placing particular emphasis on preprocessing steps such as modality derivation and data augmentation. The subsequent discussion delves into critical sub-tasks, including feature extraction and spatio-temporal modeling techniques. Beyond foundational action recognition networks, recently advanced frameworks such as hybrid architectures, Mamba models, large language models (LLMs), and generative models have also been highlighted. Finally, a comprehensive overview of public 3D skeleton datasets is presented, accompanied by an analysis of state-of-the-art algorithms evaluated on these benchmarks. By integrating task-oriented discussions, comprehensive examinations of sub-tasks, and an emphasis on the latest advancements, our review provides a fundamental and accessible structured roadmap for understanding and advancing the field of 3D skeleton-based action recognition.

[64] Deep Temporal Reasoning in Video Language Models: A Cross-Linguistic Evaluation of Action Duration and Completion through Perfect Times

Olga Loginova,Sofía Ortega Loguinova

Main category: cs.CV

TL;DR: 论文介绍了多语言数据集Perfect Times，用于评估视频语言模型在时间推理上的表现，发现现有模型难以模拟人类的时间与因果推理。

Details

Motivation: 研究人类如何通过语言和视觉线索区分已完成和进行中的动作，并评估视频语言模型在此类任务上的表现。 Method: 构建了多语言（英语、意大利语、俄语、日语）的多选题数据集Perfect Times，结合视频和事件完成标签，测试模型的时间推理能力。 Result: 实验显示，现有模型在视频中难以实现类似人类的时间与因果推理。 Conclusion: 研究强调了整合多模态线索的重要性，为视频语言模型的时间推理评估设定了新标准。 Abstract: Human perception of events is intrinsically tied to distinguishing between completed (perfect and telic) and ongoing (durative) actions, a process mediated by both linguistic structure and visual cues. In this work, we introduce the \textbf{Perfect Times} dataset, a novel, quadrilingual (English, Italian, Russian, and Japanese) multiple-choice question-answering benchmark designed to assess video-language models (VLMs) on temporal reasoning. By pairing everyday activity videos with event completion labels and perfectivity-tailored distractors, our dataset probes whether models truly comprehend temporal dynamics or merely latch onto superficial markers. Experimental results indicate that state-of-the-art models, despite their success on text-based tasks, struggle to mirror human-like temporal and causal reasoning grounded in video. This study underscores the necessity of integrating deep multimodal cues to capture the nuances of action duration and completion within temporal and causal video dynamics, setting a new standard for evaluating and advancing temporal reasoning in VLMs.

[65] Deformable registration and generative modelling of aortic anatomies by auto-decoders and neural ODEs

Riccardo Tenderini,Luca Pegolotti,Fanwei Kong,Stefano Pagani,Francesco Regazzoni,Alison L. Marsden,Simone Deparis

Main category: cs.CV

TL;DR: AD-SVFD是一种深度学习模型，用于血管形状的可变形配准和合成解剖结构的生成。它通过加权点云表示几何形状，并利用ODE解建模空间变形，具有高效的权重共享和生成能力。

Details

Motivation: 解决血管形状的可变形配准问题，并生成合成解剖结构，以支持医学研究和应用。 Method: 使用加权点云表示几何形状，通过ODE解建模空间变形，采用自动解码器结构优化参数，并通过隐式形状表示支持生成应用。 Result: 在健康主动脉解剖结构上展示了高质量的结果，计算成本低且精度高。 Conclusion: AD-SVFD是一种高效且精确的模型，适用于血管形状配准和合成解剖结构生成。 Abstract: This work introduces AD-SVFD, a deep learning model for the deformable registration of vascular shapes to a pre-defined reference and for the generation of synthetic anatomies. AD-SVFD operates by representing each geometry as a weighted point cloud and models ambient space deformations as solutions at unit time of ODEs, whose time-independent right-hand sides are expressed through artificial neural networks. The model parameters are optimized by minimizing the Chamfer Distance between the deformed and reference point clouds, while backward integration of the ODE defines the inverse transformation. A distinctive feature of AD-SVFD is its auto-decoder structure, that enables generalization across shape cohorts and favors efficient weight sharing. In particular, each anatomy is associated with a low-dimensional code that acts as a self-conditioning field and that is jointly optimized with the network parameters during training. At inference, only the latent codes are fine-tuned, substantially reducing computational overheads. Furthermore, the use of implicit shape representations enables generative applications: new anatomies can be synthesized by suitably sampling from the latent space and applying the corresponding inverse transformations to the reference geometry. Numerical experiments, conducted on healthy aortic anatomies, showcase the high-quality results of AD-SVFD, which yields extremely accurate approximations at competitive computational costs.

[66] TIGeR: Text-Instructed Generation and Refinement for Template-Free Hand-Object Interaction

Yiyao Huang,Zhedong Zheng,Yu Ziwei,Yaxiong Wang,Tze Ho Elden Tse,Angela Yao

Main category: cs.CV

TL;DR: TIGeR框架通过文本驱动生成和视觉引导优化，解决了预定义3D模板在重建手-物体交互中的局限，提升了适应性和鲁棒性。

Details

Motivation: 预定义3D模板需要大量人工且适应性差，尤其在遮挡场景中表现不佳。 Method: 采用两阶段框架：文本指令生成先验，再通过2D-3D协作注意力优化形状。 Result: 在Dex-YCB和Obman数据集上表现优异，Chamfer距离分别为1.979和5.468，优于无模板方法。 Conclusion: TIGeR在遮挡场景中表现鲁棒，且兼容多种先验来源，具有实际部署潜力。 Abstract: Pre-defined 3D object templates are widely used in 3D reconstruction of hand-object interactions. However, they often require substantial manual efforts to capture or source, and inherently restrict the adaptability of models to unconstrained interaction scenarios, e.g., heavily-occluded objects. To overcome this bottleneck, we propose a new Text-Instructed Generation and Refinement (TIGeR) framework, harnessing the power of intuitive text-driven priors to steer the object shape refinement and pose estimation. We use a two-stage framework: a text-instructed prior generation and vision-guided refinement. As the name implies, we first leverage off-the-shelf models to generate shape priors according to the text description without tedious 3D crafting. Considering the geometric gap between the synthesized prototype and the real object interacted with the hand, we further calibrate the synthesized prototype via 2D-3D collaborative attention. TIGeR achieves competitive performance, i.e., 1.979 and 5.468 object Chamfer distance on the widely-used Dex-YCB and Obman datasets, respectively, surpassing existing template-free methods. Notably, the proposed framework shows robustness to occlusion, while maintaining compatibility with heterogeneous prior sources, e.g., retrieved hand-crafted prototypes, in practical deployment scenarios.

[67] Continual-MEGA: A Large-scale Benchmark for Generalizable Continual Anomaly Detection

Geonu Lee,Yujeong Oh,Geonhui Jang,Soyoung Lee,Jeonghyo Song,Sungmin Cha,YoungJoon Yoo

Main category: cs.CV

TL;DR: 本文提出了一个名为Continual-MEGA的新基准，用于持续学习中的异常检测，旨在更真实地反映实际部署场景。

Details

Motivation: 现有的持续学习评估设置不足以反映真实世界的复杂性，因此需要一个新的基准来扩展数据集并引入新的问题设置。 Method: 提出了Continual-MEGA基准，结合现有数据集和新数据集ContinualAD，并设计了一种新的零样本泛化场景。同时提出了一种统一的基线算法。 Result: 评估显示：(1)现有方法在像素级缺陷定位上有改进空间；(2)提出的方法优于现有方法；(3)ContinualAD数据集提升了模型性能。 Conclusion: Continual-MEGA基准及其算法为持续学习中的异常检测提供了新的研究方向，并公开了代码和数据集。 Abstract: In this paper, we introduce a new benchmark for continual learning in anomaly detection, aimed at better reflecting real-world deployment scenarios. Our benchmark, Continual-MEGA, includes a large and diverse dataset that significantly expands existing evaluation settings by combining carefully curated existing datasets with our newly proposed dataset, ContinualAD. In addition to standard continual learning with expanded quantity, we propose a novel scenario that measures zero-shot generalization to unseen classes, those not observed during continual adaptation. This setting poses a new problem setting that continual adaptation also enhances zero-shot performance. We also present a unified baseline algorithm that improves robustness in few-shot detection and maintains strong generalization. Through extensive evaluations, we report three key findings: (1) existing methods show substantial room for improvement, particularly in pixel-level defect localization; (2) our proposed method consistently outperforms prior approaches; and (3) the newly introduced ContinualAD dataset enhances the performance of strong anomaly detection models. We release the benchmark and code in https://github.com/Continual-Mega/Continual-Mega.

[68] Camera Trajectory Generation: A Comprehensive Survey of Methods, Metrics, and Future Directions

Zahra Dehghanian,Pouya Ardekhani,Amir Vahedi,Hamid Beigy,Hamid R. Rabiee

Main category: cs.CV

TL;DR: 本文首次全面综述了相机轨迹生成领域，涵盖基础定义到高级方法，并分析了评估指标与数据集，指出了当前研究的局限性与未来机会。

Details

Motivation: 相机轨迹生成在多个领域至关重要，但缺乏系统性综述，本文旨在填补这一空白。 Method: 综述了从基于规则的方法到优化技术、机器学习及混合方法的多种轨迹生成模型，并分析了相关评估工具。 Result: 提供了领域内的全面知识整合，指出了现有局限和未来研究方向。 Conclusion: 本文为研究者提供了基础资源，并推动了跨领域相机轨迹系统的创新与发展。 Abstract: Camera trajectory generation is a cornerstone in computer graphics, robotics, virtual reality, and cinematography, enabling seamless and adaptive camera movements that enhance visual storytelling and immersive experiences. Despite its growing prominence, the field lacks a systematic and unified survey that consolidates essential knowledge and advancements in this domain. This paper addresses this gap by providing the first comprehensive review of the field, covering from foundational definitions to advanced methodologies. We introduce the different approaches to camera representation and present an in-depth review of available camera trajectory generation models, starting with rule-based approaches and progressing through optimization-based techniques, machine learning advancements, and hybrid methods that integrate multiple strategies. Additionally, we gather and analyze the metrics and datasets commonly used for evaluating camera trajectory systems, offering insights into how these tools measure performance, aesthetic quality, and practical applicability. Finally, we highlight existing limitations, critical gaps in current research, and promising opportunities for investment and innovation in the field. This paper not only serves as a foundational resource for researchers entering the field but also paves the way for advancing adaptive, efficient, and creative camera trajectory systems across diverse applications.

[69] CAPAA: Classifier-Agnostic Projector-Based Adversarial Attack

Zhan Li,Mingyu Zhao,Xin Dong,Haibin Ling,Bingyao Huang

Main category: cs.CV

TL;DR: CAPAA提出了一种分类器无关的投影对抗攻击方法，通过多分类器梯度和注意力机制提升攻击效果和隐蔽性。

Details

Motivation: 现有方法局限于单一分类器和固定相机姿态，无法适应多分类器和动态相机姿态场景。 Method: 设计了分类器无关的对抗损失和优化框架，结合注意力机制加权梯度。 Result: 实验表明CAPAA在攻击成功率和隐蔽性上优于基线方法。 Conclusion: CAPAA为多分类器和动态相机姿态场景提供了有效的对抗攻击解决方案。 Abstract: Projector-based adversarial attack aims to project carefully designed light patterns (i.e., adversarial projections) onto scenes to deceive deep image classifiers. It has potential applications in privacy protection and the development of more robust classifiers. However, existing approaches primarily focus on individual classifiers and fixed camera poses, often neglecting the complexities of multi-classifier systems and scenarios with varying camera poses. This limitation reduces their effectiveness when introducing new classifiers or camera poses. In this paper, we introduce Classifier-Agnostic Projector-Based Adversarial Attack (CAPAA) to address these issues. First, we develop a novel classifier-agnostic adversarial loss and optimization framework that aggregates adversarial and stealthiness loss gradients from multiple classifiers. Then, we propose an attention-based gradient weighting mechanism that concentrates perturbations on regions of high classification activation, thereby improving the robustness of adversarial projections when applied to scenes with varying camera poses. Our extensive experimental evaluations demonstrate that CAPAA achieves both a higher attack success rate and greater stealthiness compared to existing baselines. Codes are available at: https://github.com/ZhanLiQxQ/CAPAA.

[70] IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

Wayne Zhang,Changjiang Jiang,Zhonghao Zhang,Chenyang Si,Fengchang Yu,Wei Peng

Main category: cs.CV

TL;DR: 论文提出了一种统一且可解释的多模态AIGC检测数据集IVY-FAKE，并基于此设计了IVY-XDETECTOR模型，实现了图像和视频内容的联合检测与解释。

Details

Motivation: 当前AIGC检测方法多为黑盒二元分类器，缺乏可解释性且不支持多模态统一检测，影响了模型的透明度和实际应用。 Method: 构建了大规模数据集IVY-FAKE，并提出IVY-XDETECTOR模型，结合视觉-语言模型实现多模态检测与解释。 Result: 模型在多个图像和视频检测基准上达到最优性能，数据集包含15万训练样本和1.87万评估样本。 Conclusion: IVY-FAKE和IVY-XDETECTOR显著提升了AIGC检测的可解释性和多模态统一检测能力。 Abstract: The rapid advancement of Artificial Intelligence Generated Content (AIGC) in visual domains has resulted in highly realistic synthetic images and videos, driven by sophisticated generative frameworks such as diffusion-based architectures. While these breakthroughs open substantial opportunities, they simultaneously raise critical concerns about content authenticity and integrity. Many current AIGC detection methods operate as black-box binary classifiers, which offer limited interpretability, and no approach supports detecting both images and videos in a unified framework. This dual limitation compromises model transparency, reduces trustworthiness, and hinders practical deployment. To address these challenges, we introduce IVY-FAKE , a novel, unified, and large-scale dataset specifically designed for explainable multimodal AIGC detection. Unlike prior benchmarks, which suffer from fragmented modality coverage and sparse annotations, IVY-FAKE contains over 150,000 richly annotated training samples (images and videos) and 18,700 evaluation examples, each accompanied by detailed natural-language reasoning beyond simple binary labels. Building on this, we propose Ivy Explainable Detector (IVY-XDETECTOR), a unified AIGC detection and explainable architecture that jointly performs explainable detection for both image and video content. Our unified vision-language model achieves state-of-the-art performance across multiple image and video detection benchmarks, highlighting the significant advancements enabled by our dataset and modeling framework. Our data is publicly available at https://huggingface.co/datasets/AI-Safeguard/Ivy-Fake.

[71] GOBench: Benchmarking Geometric Optics Generation and Understanding of MLLMs

Xiaorong Zhu,Ziheng Jia,Jiarui Wang,Xiangyu Zhao,Haodong Duan,Xiongkuo Min,Jia Wang,Zicheng Zhang,Guangtao Zhai

Main category: cs.CV

TL;DR: GOBench是首个系统评估多模态大语言模型（MLLMs）在几何光学领域能力的基准测试，涵盖生成光学真实图像和理解光学现象两项任务。实验显示当前模型在这两方面均表现不佳。

Details

Motivation: 多模态大语言模型在视觉理解和生成方面进展迅速，但其在细粒度物理原理（如几何光学）上的能力尚未被充分评估。 Method: 通过构建GOBench-Gen-1k数据集，结合主观实验评估生成图像的光学真实性、美学质量和指令遵循性；同时设计评估指令测试11种主流MLLMs的光学理解能力。 Result: 当前模型在光学生成和理解任务中表现较差，最佳生成模型GPT-4o-Image无法完美完成任务，最佳理解模型Gemini-2.5Pro准确率仅为37.35%。 Conclusion: MLLMs在几何光学领域的生成和理解能力仍有显著不足，需进一步优化。 Abstract: The rapid evolution of Multi-modality Large Language Models (MLLMs) is driving significant advancements in visual understanding and generation. Nevertheless, a comprehensive assessment of their capabilities, concerning the fine-grained physical principles especially in geometric optics, remains underexplored. To address this gap, we introduce GOBench, the first benchmark to systematically evaluate MLLMs' ability across two tasks: 1) Generating Optically Authentic Imagery and 2) Understanding Underlying Optical Phenomena. We curates high-quality prompts of geometric optical scenarios and use MLLMs to construct GOBench-Gen-1k dataset.We then organize subjective experiments to assess the generated imagery based on Optical Authenticity, Aesthetic Quality, and Instruction Fidelity, revealing MLLMs' generation flaws that violate optical principles. For the understanding task, we apply crafted evaluation instructions to test optical understanding ability of eleven prominent MLLMs. The experimental results demonstrate that current models face significant challenges in both optical generation and understanding. The top-performing generative model, GPT-4o-Image, cannot perfectly complete all generation tasks, and the best-performing MLLM model, Gemini-2.5Pro, attains a mere 37.35\% accuracy in optical understanding.

[72] Quotient Network -- A Network Similar to ResNet but Learning Quotients

Peng Hui,Jiamuyang Zhao,Changxin Li,Qingzhen Zhu

Main category: cs.CV

TL;DR: 论文提出了一种基于商数学习的网络（Quotient Network），解决了ResNet中特征差异学习的问题，并通过实验验证了其优于ResNet的性能。

Details

Motivation: ResNet通过学习特征差异来训练深度网络，但差异缺乏独立意义且对特征大小敏感。本文旨在解决这些问题。 Method: 提出商数网络，学习目标特征与现有特征的商数，并设计训练规则以提高性能。 Result: 在CIFAR10、CIFAR100和SVHN数据集上，商数网络无需新增参数即可稳定优于ResNet。 Conclusion: 商数网络有效解决了ResNet的问题，并在性能和训练效率上表现更优。 Abstract: The emergence of ResNet provides a powerful tool for training extremely deep networks. The core idea behind it is to change the learning goals of the network. It no longer learns new features from scratch but learns the difference between the target and existing features. However, the difference between the two kinds of features does not have an independent and clear meaning, and the amount of learning is based on the absolute rather than the relative difference, which is sensitive to the size of existing features. We propose a new network that perfectly solves these two problems while still having the advantages of ResNet. Specifically, it chooses to learn the quotient of the target features with the existing features, so we call it the quotient network. In order to enable this network to learn successfully and achieve higher performance, we propose some design rules for this network so that it can be trained efficiently and achieve better performance than ResNet. Experiments on the CIFAR10, CIFAR100, and SVHN datasets prove that this network can stably achieve considerable improvements over ResNet by simply making tiny corresponding changes to the original ResNet network without adding new parameters.

[73] FlexSelect: Flexible Token Selection for Efficient Long Video Understanding

Yunzhu Zhang,Yu Lu,Tianyi Wang,Fengyun Rao,Yi Yang,Linchao Zhu

Main category: cs.CV

TL;DR: FlexSelect是一种灵活高效的令牌选择策略，用于处理长视频，通过跨模态注意力模式识别并保留最相关的内容，显著提升视频大语言模型的效率和性能。

Details

Motivation: 长视频理解对视频大语言模型（VideoLLMs）提出了高计算和内存需求的挑战，需要一种高效的方法来减少冗余计算。 Method: FlexSelect包括无训练的令牌排名管道和轻量级选择器，利用跨模态注意力权重估计令牌重要性并过滤冗余令牌。 Result: 在多个长视频基准测试中表现优异，显著提升了处理速度（如LLaVA-Video-7B模型速度提升达9倍）。 Conclusion: FlexSelect作为一种即插即用模块，能有效扩展视频大语言模型的上下文长度，提升长视频理解效率。 Abstract: Long-form video understanding poses a significant challenge for video large language models (VideoLLMs) due to prohibitively high computational and memory demands. In this paper, we propose FlexSelect, a flexible and efficient token selection strategy for processing long videos. FlexSelect identifies and retains the most semantically relevant content by leveraging cross-modal attention patterns from a reference transformer layer. It comprises two key components: (1) a training-free token ranking pipeline that leverages faithful cross-modal attention weights to estimate each video token's importance, and (2) a rank-supervised lightweight selector that is trained to replicate these rankings and filter redundant tokens. This generic approach can be seamlessly integrated into various VideoLLM architectures, such as LLaVA-Video, InternVL and Qwen-VL, serving as a plug-and-play module to extend their temporal context length. Empirically, FlexSelect delivers strong gains across multiple long-video benchmarks including VideoMME, MLVU, LongVB, and LVBench. Moreover, it achieves significant speed-ups (for example, up to 9 times on a LLaVA-Video-7B model), highlighting FlexSelect's promise for efficient long-form video understanding. Project page available at: https://yunzhuzhang0918.github.io/flex_select

[74] Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models

Kinam Kim,Junha Hyung,Jaegul Choo

Main category: cs.CV

TL;DR: TIC-FT是一种高效且通用的方法，用于适应预训练的视频扩散模型到多样化的条件生成任务，无需架构修改，仅需少量样本即可实现高性能。

Details

Motivation: 现有方法依赖于外部编码器或架构修改，需要大数据集且灵活性受限，TIC-FT旨在解决这些问题。 Method: 通过沿时间轴连接条件和目标帧，并插入噪声逐渐增加的缓冲帧，实现平滑过渡和高效微调。 Result: 在多种任务中表现优异，条件保真度和视觉质量均优于现有基线，且训练和推理效率高。 Conclusion: TIC-FT是一种高效、灵活且可扩展的条件生成方法，适用于有限数据和计算资源的情况。 Abstract: Recent advances in text-to-video diffusion models have enabled high-quality video synthesis, but controllable generation remains challenging, particularly under limited data and compute. Existing fine-tuning methods for conditional generation often rely on external encoders or architectural modifications, which demand large datasets and are typically restricted to spatially aligned conditioning, limiting flexibility and scalability. In this work, we introduce Temporal In-Context Fine-Tuning (TIC-FT), an efficient and versatile approach for adapting pretrained video diffusion models to diverse conditional generation tasks. Our key idea is to concatenate condition and target frames along the temporal axis and insert intermediate buffer frames with progressively increasing noise levels. These buffer frames enable smooth transitions, aligning the fine-tuning process with the pretrained model's temporal dynamics. TIC-FT requires no architectural changes and achieves strong performance with as few as 10-30 training samples. We validate our method across a range of tasks, including image-to-video and video-to-video generation, using large-scale base models such as CogVideoX-5B and Wan-14B. Extensive experiments show that TIC-FT outperforms existing baselines in both condition fidelity and visual quality, while remaining highly efficient in both training and inference. For additional results, visit https://kinam0252.github.io/TIC-FT/

Min Je Kim,Muhammad Munsif,Altaf Hussain,Hikmat Yar,Sung Wook Baik

Main category: cs.CV

TL;DR: 论文提出了一种名为MJ-COCO的改进版MS-COCO数据集，通过自动化方法修正了原数据集的标注错误，显著提升了目标检测模型的性能。

Details

Motivation: MS-COCO数据集存在标注错误（如缺失标签、类别错误、边界框不准确等），影响了模型的训练和泛化能力。为了解决这些问题，作者提出了一个综合的标注修正框架。 Method: 采用基于损失和梯度的错误检测方法，结合四阶段伪标签修正流程：边界框生成、重复去除与置信度合并、类别一致性验证、空间调整。 Result: 实验表明，使用MJ-COCO训练的模型在多个验证数据集上表现优于MS-COCO，平均精度（AP）和APS指标均有提升，且小目标标注数量增加了20万以上。 Conclusion: MJ-COCO通过自动化修正标注错误，显著提升了目标检测模型的性能和可靠性，为未来研究提供了更高质量的数据集基准。 Abstract: Benchmark object detection (OD) datasets play a pivotal role in advancing computer vision applications such as autonomous driving, and surveillance, as well as in training and evaluating deep learning-based state-of-the-art detection models. Among them, MS-COCO has become a standard benchmark due to its diverse object categories and complex scenes. However, despite its wide adoption, MS-COCO suffers from various annotation issues, including missing labels, incorrect class assignments, inaccurate bounding boxes, duplicate labels, and group labeling inconsistencies. These errors not only hinder model training but also degrade the reliability and generalization of OD models. To address these challenges, we propose a comprehensive refinement framework and present MJ-COCO, a newly re-annotated version of MS-COCO. Our approach begins with loss and gradient-based error detection to identify potentially mislabeled or hard-to-learn samples. Next, we apply a four-stage pseudo-labeling refinement process: (1) bounding box generation using invertible transformations, (2) IoU-based duplicate removal and confidence merging, (3) class consistency verification via expert objects recognizer, and (4) spatial adjustment based on object region activation map analysis. This integrated pipeline enables scalable and accurate correction of annotation errors without manual re-labeling. Extensive experiments were conducted across four validation datasets: MS-COCO, Sama COCO, Objects365, and PASCAL VOC. Models trained on MJ-COCO consistently outperformed those trained on MS-COCO, achieving improvements in Average Precision (AP) and APS metrics. MJ-COCO also demonstrated significant gains in annotation coverage: for example, the number of small object annotations increased by more than 200,000 compared to MS-COCO.

[76] Motion-Aware Concept Alignment for Consistent Video Editing

Tong Zhang,Juan C Leon Alcazar,Bernard Ghanem

Main category: cs.CV

TL;DR: MoCA-Video是一种无需训练的框架，通过将参考图像的语义特征注入视频中的特定对象，同时保留原始运动和视觉上下文，实现了图像域语义混合与视频的桥接。

Details

Motivation: 解决图像域语义混合与视频之间的差距，实现无需训练的高质量视频合成。 Method: 利用对角去噪调度和类别无关分割在潜在空间中检测和跟踪对象，并结合动量语义校正和伽马残差噪声稳定技术确保时间一致性。 Result: 在自建数据集上表现优于现有基线，实现了更高的空间一致性和时间连贯性，以及显著更高的CASS评分。 Conclusion: MoCA-Video表明，通过结构化操作扩散噪声轨迹，可以实现可控的高质量视频合成。 Abstract: We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video. Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video, while preserving the original motion and visual context. Our approach leverages a diagonal denoising schedule and class-agnostic segmentation to detect and track objects in the latent space and precisely control the spatial location of the blended objects. To ensure temporal coherence, we incorporate momentum-based semantic corrections and gamma residual noise stabilization for smooth frame transitions. We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames. Using self-constructed dataset, MoCA-Video outperforms current baselines, achieving superior spatial consistency, coherent motion, and a significantly higher CASS score, despite having no training or fine-tuning. MoCA-Video demonstrates that structured manipulation in the diffusion noise trajectory allows for controllable, high-quality video synthesis.

[77] AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

Yuyuan Liu,Yuanhong Chen,Chong Wang,Junlin Han,Junde Wu,Can Peng,Jingkun Chen,Yu Tian,Gustavo Carneiro

Main category: cs.CV

TL;DR: AuralSAM2通过AuralFuser模块将音频与视觉特征融合，提升SAM2在多模态场景下的分割性能。

Details

Motivation: 现有方法在音频与视觉模态融合上效率低且定位不精确，忽视了语义交互。 Method: 提出AuralFuser模块，结合特征金字塔和音频引导对比学习，优化跨模态融合。 Result: 在公开基准测试中表现优于现有方法。 Conclusion: AuralSAM2有效解决了音频与视觉模态融合的挑战，提升了分割精度。 Abstract: Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches mainly follow two directions: (1) injecting adapters into the image encoder to receive audio signals, which incurs efficiency costs during prompt engineering, and (2) leveraging additional foundation models to generate visual prompts for the sounding objects, which are often imprecisely localised, leading to misguidance in SAM2. Moreover, these methods overlook the rich semantic interplay between hierarchical visual features and other modalities, resulting in suboptimal cross-modal fusion. In this work, we propose AuralSAM2, comprising the novel AuralFuser module, which externally attaches to SAM2 to integrate features from different modalities and generate feature-level prompts, guiding SAM2's decoder in segmenting sounding targets. Such integration is facilitated by a feature pyramid, further refining semantic understanding and enhancing object awareness in multimodal scenarios. Additionally, the audio-guided contrastive learning is introduced to explicitly align audio and visual representations and to also mitigate biases caused by dominant visual patterns. Results on public benchmarks show that our approach achieves remarkable improvements over the previous methods in the field. Code is available at https://github.com/yyliu01/AuralSAM2.

[78] Modality Translation and Registration of MR and Ultrasound Images Using Diffusion Models

Xudong Ma,Nantheera Anantrasirichai,Stefanos Bolomytis,Alin Achim

Main category: cs.CV

TL;DR: 提出了一种基于层次特征解耦的解剖一致性模态转换网络（ACMT），用于解决多模态MR-US图像配准中的模态差异问题。

Details

Motivation: 多模态MR-US配准对前列腺癌诊断至关重要，但现有方法难以对齐关键边界且对无关细节敏感。 Method: 通过浅层特征保持纹理一致性，深层特征保留边界，并引入中间伪模态设计，将MR和US图像转换到该中间域。 Result: 实验表明，该方法有效减少模态差异并保留解剖边界，定量评估显示其模态相似性优于现有方法。 Conclusion: ACMT框架在多模态前列腺图像配准中表现出色，下游配准实验验证了其鲁棒性。 Abstract: Multimodal MR-US registration is critical for prostate cancer diagnosis. However, this task remains challenging due to significant modality discrepancies. Existing methods often fail to align critical boundaries while being overly sensitive to irrelevant details. To address this, we propose an anatomically coherent modality translation (ACMT) network based on a hierarchical feature disentanglement design. We leverage shallow-layer features for texture consistency and deep-layer features for boundary preservation. Unlike conventional modality translation methods that convert one modality into another, our ACMT introduces the customized design of an intermediate pseudo modality. Both MR and US images are translated toward this intermediate domain, effectively addressing the bottlenecks faced by traditional translation methods in the downstream registration task. Experiments demonstrate that our method mitigates modality-specific discrepancies while preserving crucial anatomical boundaries for accurate registration. Quantitative evaluations show superior modality similarity compared to state-of-the-art modality translation methods. Furthermore, downstream registration experiments confirm that our translated images achieve the best alignment performance, highlighting the robustness of our framework for multi-modal prostate image registration.

Yanyuan Qiao,Haodong Hong,Wenqi Lyu,Dong An,Siqi Zhang,Yutong Xie,Xinyu Wang,Qi Wu

Main category: cs.CV

TL;DR: NavBench是一个评估多模态大语言模型（MLLMs）在零样本设置下具身导航能力的基准测试，包含导航理解和逐步执行两部分，发现GPT-4o表现优异，但多数模型在时间理解上存在困难。

Details

Motivation: 探索MLLMs在具身环境中的理解和行动能力，填补现有研究的空白。 Method: 通过NavBench评估MLLMs的导航能力，包括导航理解和逐步执行任务，并设计了一个将MLLMs输出转化为机器人动作的流程。 Result: GPT-4o表现优异，轻量级开源模型在简单任务中表现良好；模型理解能力与执行性能正相关；地图上下文提升决策准确性；多数模型在时间理解上表现不佳。 Conclusion: MLLMs在具身导航中展现出潜力，但时间理解仍是主要挑战，未来需进一步优化模型能力。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong generalization in vision-language tasks, yet their ability to understand and act within embodied environments remains underexplored. We present NavBench, a benchmark to evaluate the embodied navigation capabilities of MLLMs under zero-shot settings. NavBench consists of two components: (1) navigation comprehension, assessed through three cognitively grounded tasks including global instruction alignment, temporal progress estimation, and local observation-action reasoning, covering 3,200 question-answer pairs; and (2) step-by-step execution in 432 episodes across 72 indoor scenes, stratified by spatial, cognitive, and execution complexity. To support real-world deployment, we introduce a pipeline that converts MLLMs' outputs into robotic actions. We evaluate both proprietary and open-source models, finding that GPT-4o performs well across tasks, while lighter open-source models succeed in simpler cases. Results also show that models with higher comprehension scores tend to achieve better execution performance. Providing map-based context improves decision accuracy, especially in medium-difficulty scenarios. However, most models struggle with temporal understanding, particularly in estimating progress during navigation, which may pose a key challenge.

[80] Self-supervised ControlNet with Spatio-Temporal Mamba for Real-world Video Super-resolution

Shijun Shi,Jing Xu,Lijing Lu,Zhihang Li,Kai Hu

Main category: cs.CV

TL;DR: 提出了一种基于自监督学习和Mamba的噪声鲁棒性视频超分辨率框架，通过增强扩散模型和引入自监督ControlNet，显著提升了真实世界视频超分辨率的效果。

Details

Motivation: 现有基于扩散的视频超分辨率方法因随机性易引入复杂退化和明显伪影，需改进以提升鲁棒性和质量。 Method: 结合自监督学习和Mamba到预训练潜在扩散模型，引入全局时空注意力机制和自监督ControlNet，采用三阶段训练策略。 Result: 在真实世界视频超分辨率基准数据集上取得了优于现有技术的感知质量。 Conclusion: 提出的模型设计和训练策略有效提升了视频超分辨率的鲁棒性和生成质量。 Abstract: Existing diffusion-based video super-resolution (VSR) methods are susceptible to introducing complex degradations and noticeable artifacts into high-resolution videos due to their inherent randomness. In this paper, we propose a noise-robust real-world VSR framework by incorporating self-supervised learning and Mamba into pre-trained latent diffusion models. To ensure content consistency across adjacent frames, we enhance the diffusion model with a global spatio-temporal attention mechanism using the Video State-Space block with a 3D Selective Scan module, which reinforces coherence at an affordable computational cost. To further reduce artifacts in generated details, we introduce a self-supervised ControlNet that leverages HR features as guidance and employs contrastive learning to extract degradation-insensitive features from LR videos. Finally, a three-stage training strategy based on a mixture of HR-LR videos is proposed to stabilize VSR training. The proposed Self-supervised ControlNet with Spatio-Temporal Continuous Mamba based VSR algorithm achieves superior perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.

[81] ECP-Mamba: An Efficient Multi-scale Self-supervised Contrastive Learning Method with State Space Model for PolSAR Image Classification

Zuzheng Kuang,Haixia Bi,Chen Xu,Jian Sun

Main category: cs.CV

TL;DR: ECP-Mamba是一个高效框架，结合多尺度自监督对比学习和状态空间模型（SSM），解决了PolSAR图像分类中标注数据稀缺和计算效率低的问题。

Details

Motivation: 当前基于深度学习的PolSAR分类方法依赖大量标注数据且计算效率低，ECP-Mamba旨在解决这些问题。 Method: 提出多尺度自监督对比学习任务和Mamba架构（选择性SSM），采用螺旋扫描策略和轻量级Cross Mamba模块。 Result: 在四个基准数据集上表现优异，Flevoland 1989数据集上总体准确率达99.70%。 Conclusion: ECP-Mamba在精度和资源效率间取得了平衡，为PolSAR分类提供了高效解决方案。 Abstract: Recently, polarimetric synthetic aperture radar (PolSAR) image classification has been greatly promoted by deep neural networks. However,current deep learning-based PolSAR classification methods encounter difficulties due to its dependence on extensive labeled data and the computational inefficiency of architectures like Transformers. This paper presents ECP-Mamba, an efficient framework integrating multi-scale self-supervised contrastive learning with a state space model (SSM) backbone. Specifically, ECP-Mamba addresses annotation scarcity through a multi-scale predictive pretext task based on local-to-global feature correspondences, which uses a simplified self-distillation paradigm without negative sample pairs. To enhance computational efficiency,the Mamba architecture (a selective SSM) is first tailored for pixel-wise PolSAR classification task by designing a spiral scan strategy. This strategy prioritizes causally relevant features near the central pixel, leveraging the localized nature of pixel-wise classification tasks. Additionally, the lightweight Cross Mamba module is proposed to facilitates complementary multi-scale feature interaction with minimal overhead. Extensive experiments across four benchmark datasets demonstrate ECP-Mamba's effectiveness in balancing high accuracy with resource efficiency. On the Flevoland 1989 dataset, ECP-Mamba achieves state-of-the-art performance with an overall accuracy of 99.70%, average accuracy of 99.64% and Kappa coefficient of 99.62e-2. Our code will be available at https://github.com/HaixiaBi1982/ECP_Mamba.

[82] AceVFI: A Comprehensive Survey of Advances in Video Frame Interpolation

Dahyeon Kye,Changhyun Roh,Sukhun Ko,Chanho Eom,Jihyong Oh

Main category: cs.CV

TL;DR: AceVFI是一篇关于视频帧插值（VFI）的全面综述，涵盖250多篇论文，系统整理了VFI方法、挑战、数据集、应用及未来方向。

Details

Motivation: 视频帧插值是低级视觉任务的核心问题，现有方法多样但缺乏系统总结，AceVFI旨在填补这一空白。 Method: 综述整理了VFI的多种方法（如基于流、GAN、Transformer等），分类为CTFI和ATFI，并分析关键挑战（如大运动、遮挡）。 Result: 详细总结了VFI的技术特点、数据集、评估指标，并探讨了其在医学图像等领域的应用。 Conclusion: AceVFI为VFI领域提供了统一参考，并指出了未来研究方向，推动领域持续发展。 Abstract: Video Frame Interpolation (VFI) is a fundamental Low-Level Vision (LLV) task that synthesizes intermediate frames between existing ones while maintaining spatial and temporal coherence. VFI techniques have evolved from classical motion compensation-based approach to deep learning-based approach, including kernel-, flow-, hybrid-, phase-, GAN-, Transformer-, Mamba-, and more recently diffusion model-based approach. We introduce AceVFI, the most comprehensive survey on VFI to date, covering over 250+ papers across these approaches. We systematically organize and describe VFI methodologies, detailing the core principles, design assumptions, and technical characteristics of each approach. We categorize the learning paradigm of VFI methods namely, Center-Time Frame Interpolation (CTFI) and Arbitrary-Time Frame Interpolation (ATFI). We analyze key challenges of VFI such as large motion, occlusion, lighting variation, and non-linear motion. In addition, we review standard datasets, loss functions, evaluation metrics. We examine applications of VFI including event-based, cartoon, medical image VFI and joint VFI with other LLV tasks. We conclude by outlining promising future research directions to support continued progress in the field. This survey aims to serve as a unified reference for both newcomers and experts seeking a deep understanding of modern VFI landscapes.

[83] Fighting Fire with Fire (F3): A Training-free and Efficient Visual Adversarial Example Purification Method in LVLMs

Yudong Zhang,Ruobing Xie,Yiqing Huang,Jiansheng Chen,Xingwu Sun,Zhanhui Kang,Di Wang,Yu Wang

Main category: cs.CV

TL;DR: F3是一种对抗净化框架，通过故意引入简单扰动来减轻对抗样本的有害影响，无需训练且高效。

Details

Motivation: 大型视觉语言模型（LVLMs）易受视觉对抗攻击影响，但现有净化方法研究有限。 Method: F3利用随机扰动的对抗样本的跨模态注意力作为参考目标，通过注入噪声净化对抗样本。 Result: F3显著提升了对抗样本的净化效果，且计算效率高。 Conclusion: F3是一种高效、易实现的对抗净化方法，适用于大规模工业应用。 Abstract: Recent advances in large vision-language models (LVLMs) have showcased their remarkable capabilities across a wide range of multimodal vision-language tasks. However, these models remain vulnerable to visual adversarial attacks, which can substantially compromise their performance. Despite their potential impact, the development of effective methods for purifying such adversarial examples has received relatively limited attention. In this paper, we introduce F3, a novel adversarial purification framework that employs a counterintuitive "fighting fire with fire" strategy: intentionally introducing simple perturbations to adversarial examples to mitigate their harmful effects. Specifically, F3 leverages cross-modal attentions derived from randomly perturbed adversary examples as reference targets. By injecting noise into these adversarial examples, F3 effectively refines their attention, resulting in cleaner and more reliable model outputs. Remarkably, this seemingly paradoxical approach of employing noise to counteract adversarial attacks yields impressive purification results. Furthermore, F3 offers several distinct advantages: it is training-free and straightforward to implement, and exhibits significant computational efficiency improvements compared to existing purification methods. These attributes render F3 particularly suitable for large-scale industrial applications where both robust performance and operational efficiency are critical priorities. The code will be made publicly available.

[84] Revolutionizing Blood Banks: AI-Driven Fingerprint-Blood Group Correlation for Enhanced Safety

Malik A. Altayar,Muhyeeddin Alqaraleh,Mowafaq Salem Alzboon,Wesam T. Almagharbeh

Main category: cs.CV

TL;DR: 研究探讨指纹模式与ABO血型的关系，发现两者关联性较弱，血型数据未能显著提升指纹识别的准确性。

Details

Motivation: 探索低成本、易实施的生物识别方法，以补充现有高成本技术（如虹膜扫描和基因组分析）。 Method: 对200名受试者的指纹模式（环、涡、弓）和血型进行比较，使用卡方检验和皮尔逊相关性分析。 Result: 环状指纹最常见，O+血型最普遍，但指纹模式与血型无显著统计学关联。 Conclusion: 血型数据对指纹识别改进有限，未来需结合多模态生物特征和机器学习以提高准确性。 Abstract: Identification of a person is central in forensic science, security, and healthcare. Methods such as iris scanning and genomic profiling are more accurate but expensive, time-consuming, and more difficult to implement. This study focuses on the relationship between the fingerprint patterns and the ABO blood group as a biometric identification tool. A total of 200 subjects were included in the study, and fingerprint types (loops, whorls, and arches) and blood groups were compared. Associations were evaluated with statistical tests, including chi-square and Pearson correlation. The study found that the loops were the most common fingerprint pattern and the O+ blood group was the most prevalent. Even though there was some associative pattern, there was no statistically significant difference in the fingerprint patterns of different blood groups. Overall, the results indicate that blood group data do not significantly improve personal identification when used in conjunction with fingerprinting. Although the study shows weak correlation, it may emphasize the efforts of multi-modal based biometric systems in enhancing the current biometric systems. Future studies may focus on larger and more diverse samples, and possibly machine learning and additional biometrics to improve identification methods. This study addresses an element of the ever-changing nature of the fields of forensic science and biometric identification, highlighting the importance of resilient analytical methods for personal identification.

[85] Aligned Contrastive Loss for Long-Tailed Recognition

Jiali Ma,Jiequan Cui,Maeno Kazuki,Lakshmi Subramanian,Karlekar Jayashree,Sugiri Pranata,Hanwang Zhang

Main category: cs.CV

TL;DR: 提出了一种对齐对比学习（ACL）算法，解决长尾识别问题，通过消除梯度冲突和不平衡问题，在多个基准测试中表现优异。

Details

Motivation: 多视图训练虽能提升性能，但对比学习在视图增加时未能持续增强模型泛化能力，需解决梯度冲突和不平衡问题。 Method: 通过理论梯度分析发现监督对比学习（SCL）中的梯度冲突和不平衡问题，设计ACL算法消除这些问题。 Result: 在长尾CIFAR、ImageNet、Places和iNaturalist数据集上验证，ACL达到新的最优性能。 Conclusion: ACL算法有效解决了长尾识别中的梯度问题，实现了显著的性能提升。 Abstract: In this paper, we propose an Aligned Contrastive Learning (ACL) algorithm to address the long-tailed recognition problem. Our findings indicate that while multi-view training boosts the performance, contrastive learning does not consistently enhance model generalization as the number of views increases. Through theoretical gradient analysis of supervised contrastive learning (SCL), we identify gradient conflicts, and imbalanced attraction and repulsion gradients between positive and negative pairs as the underlying issues. Our ACL algorithm is designed to eliminate these problems and demonstrates strong performance across multiple benchmarks. We validate the effectiveness of ACL through experiments on long-tailed CIFAR, ImageNet, Places, and iNaturalist datasets. Results show that ACL achieves new state-of-the-art performance.

[86] A Large Convolutional Neural Network for Clinical Target and Multi-organ Segmentation in Gynecologic Brachytherapy with Multi-stage Learning

Mingzhe Hu,Yuan Gao,Yuheng Li,Ricahrd LJ Qiu,Chih-Wei Chang,Keyur D. Shah,Priyanka Kapoor,Beth Bradshaw,Yuan Shao,Justin Roper,Jill Remick,Zhen Tian,Xiaofeng Yang

Main category: cs.CV

TL;DR: GynBTNet是一种多阶段学习框架，通过自监督预训练和分层微调策略提升妇科近距离放射治疗中临床靶区和危险器官的分割性能。

Details

Motivation: 妇科近距离放射治疗中，临床靶区和危险器官的准确分割对优化治疗计划至关重要，但解剖变异性、CT成像的低软组织对比度和有限标注数据集带来挑战。 Method: GynBTNet采用三阶段训练策略：自监督预训练、多器官分割数据集的监督微调和针对妇科近距离放射治疗数据集的特定任务微调。 Result: GynBTNet在DSC、HD95和ASD指标上显著优于nnU-Net和Swin-UNETR，尤其对复杂边界结构表现更优，但乙状结肠分割仍具挑战性。 Conclusion: GynBTNet通过自监督预训练和分层微调显著提升了分割性能，为临床提供了更可靠的解决方案。 Abstract: Purpose: Accurate segmentation of clinical target volumes (CTV) and organs-at-risk is crucial for optimizing gynecologic brachytherapy (GYN-BT) treatment planning. However, anatomical variability, low soft-tissue contrast in CT imaging, and limited annotated datasets pose significant challenges. This study presents GynBTNet, a novel multi-stage learning framework designed to enhance segmentation performance through self-supervised pretraining and hierarchical fine-tuning strategies. Methods: GynBTNet employs a three-stage training strategy: (1) self-supervised pretraining on large-scale CT datasets using sparse submanifold convolution to capture robust anatomical representations, (2) supervised fine-tuning on a comprehensive multi-organ segmentation dataset to refine feature extraction, and (3) task-specific fine-tuning on a dedicated GYN-BT dataset to optimize segmentation performance for clinical applications. The model was evaluated against state-of-the-art methods using the Dice Similarity Coefficient (DSC), 95th percentile Hausdorff Distance (HD95), and Average Surface Distance (ASD). Results: Our GynBTNet achieved superior segmentation performance, significantly outperforming nnU-Net and Swin-UNETR. Notably, it yielded a DSC of 0.837 +/- 0.068 for CTV, 0.940 +/- 0.052 for the bladder, 0.842 +/- 0.070 for the rectum, and 0.871 +/- 0.047 for the uterus, with reduced HD95 and ASD compared to baseline models. Self-supervised pretraining led to consistent performance improvements, particularly for structures with complex boundaries. However, segmentation of the sigmoid colon remained challenging, likely due to anatomical ambiguities and inter-patient variability. Statistical significance analysis confirmed that GynBTNet's improvements were significant compared to baseline models.

[87] GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking

Yufei Zhan,Ziheng Wu,Yousong Zhu,Rongkun Xue,Ruipu Luo,Zhenghao Chen,Can Zhang,Yifan Li,Zhentao He,Zheming Yang,Ming Tang,Minghui Qiu,Jinqiao Wang

Main category: cs.CV

TL;DR: GThinker是一种新型多模态推理模型，通过Cue-Rethinking策略和两阶段训练方法，显著提升了在通用场景、数学和科学领域的多模态推理性能。

Details

Motivation: 当前多模态大语言模型在视觉中心的多模态推理任务中表现不佳，主要依赖逻辑和知识推理，未能有效整合视觉信息。 Method: 提出Cue-Rethinking策略，通过视觉线索迭代推理，并设计两阶段训练流程（模式引导冷启动和激励强化学习）。 Result: 在M$^3$CoT基准测试中达到81.5%，优于O4-mini模型，通用场景推理性能平均提升2.1%。 Conclusion: GThinker通过创新的推理策略和训练方法，填补了通用多模态推理的数据和性能缺口。 Abstract: Despite notable advancements in multimodal reasoning, leading Multimodal Large Language Models (MLLMs) still underperform on vision-centric multimodal reasoning tasks in general scenarios. This shortfall stems from their predominant reliance on logic- and knowledge-based slow thinking strategies, while effective for domains like math and science, fail to integrate visual information effectively during reasoning. Consequently, these models often fail to adequately ground visual cues, resulting in suboptimal performance in tasks that require multiple plausible visual interpretations and inferences. To address this, we present GThinker (General Thinker), a novel reasoning MLLM excelling in multimodal reasoning across general scenarios, mathematics, and science. GThinker introduces Cue-Rethinking, a flexible reasoning pattern that grounds inferences in visual cues and iteratively reinterprets these cues to resolve inconsistencies. Building on this pattern, we further propose a two-stage training pipeline, including pattern-guided cold start and incentive reinforcement learning, designed to enable multimodal reasoning capabilities across domains. Furthermore, to support the training, we construct GThinker-11K, comprising 7K high-quality, iteratively-annotated reasoning paths and 4K curated reinforcement learning samples, filling the data gap toward general multimodal reasoning. Extensive experiments demonstrate that GThinker achieves 81.5% on the challenging comprehensive multimodal reasoning benchmark M$^3$CoT, surpassing the latest O4-mini model. It also shows an average improvement of 2.1% on general scenario multimodal reasoning benchmarks, while maintaining on-par performance in mathematical reasoning compared to counterpart advanced reasoning models. The code, model, and data will be released soon at https://github.com/jefferyZhan/GThinker.

[88] Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection

Shivam Chandhok,Qian Yang,Oscar Manas,Kanishk Jain,Leonid Sigal,Aishwarya Agrawal

Main category: cs.CV

TL;DR: PROGRESS是一种动态选择学习样本的高效框架，通过跟踪学习进度优先选择最有信息量的样本，减少数据和计算需求。

Details

Motivation: 传统的指令调优方法需要大量数据和计算资源，PROGRESS旨在通过动态样本选择提高效率。 Method: PROGRESS根据模型学习进度动态选择样本，优先选择未掌握且难度适中的技能，无需额外标注或计算密集型操作。 Result: 在多数据集实验中，PROGRESS以更少的数据和监督优于现有方法，并展示出跨架构的泛化能力。 Conclusion: PROGRESS为高效学习提供了可扩展的解决方案，适用于不同规模的模型。 Abstract: Instruction tuning has been central to the success of recent vision-language models (VLMs), but it remains expensive-requiring large-scale datasets, high-quality annotations, and large compute budgets. We propose PRioritized cOncept learninG via Relative Error-driven Sample Selection (PROGRESS), a data- and compute-efficient framework that enables VLMs to dynamically select what to learn next based on their evolving needs during training. At each stage, the model tracks its learning progress across skills and selects the most informative samples-those it has not already mastered and that are not too difficult to learn at the current stage of training. This strategy effectively controls skill acquisition and the order in which skills are learned. Specifically, we sample from skills showing the highest learning progress, prioritizing those with the most rapid improvement. Unlike prior methods, PROGRESS requires no upfront answer annotations, queries answers only on a need basis, avoids reliance on additional supervision from auxiliary VLMs, and does not require compute-heavy gradient computations for data selection. Experiments across multiple instruction-tuning datasets of varying scales demonstrate that PROGRESS consistently outperforms state-of-the-art baselines with much less data and supervision. Additionally, we show strong cross-architecture generalization and transferability to larger models, validating PROGRESS as a scalable solution for efficient learning.

[89] Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective

Lei Lei,Jie Gu,Xiaokang Ma,Chu Tang,Jingmin Chen,Tong Xu

Main category: cs.CV

TL;DR: 研究发现，通过适当选择，可以在LLM输入阶段进行视觉令牌压缩，性能损失可忽略。利用可解释性方法评估令牌重要性，并提出通过轻量卷积网络学习映射，实现高效部署。实验验证了方法的有效性。

Details

Motivation: 现有MLLMs处理大量视觉令牌导致计算成本高且效率低，而传统方法假设浅层需所有令牌。本研究旨在探索输入阶段令牌压缩的可行性。 Method: 利用可解释性方法评估视觉令牌重要性，提出通过轻量卷积网络学习从第一层注意力映射到解释结果的映射，避免完整推理。 Result: 在10个图像和视频基准测试中，压缩50%视觉令牌仍保留96%以上性能，且方法具有强泛化能力。 Conclusion: 输入阶段令牌压缩可行且高效，轻量卷积网络学习映射的方法实用性强，适用于多种MLLMs。 Abstract: Existing Multimodal Large Language Models (MLLMs) process a large number of visual tokens, leading to significant computational costs and inefficiency. Previous works generally assume that all visual tokens are necessary in the shallow layers of LLMs, and therefore token compression typically occurs in intermediate layers. In contrast, our study reveals an interesting insight: with proper selection, token compression is feasible at the input stage of LLM with negligible performance loss. Specifically, we reveal that explainability methods can effectively evaluate the importance of each visual token with respect to the given instruction, which can well guide the token compression. Furthermore, we propose to learn a mapping from the attention map of the first LLM layer to the explanation results, thereby avoiding the need for a full inference pass and facilitating practical deployment. Interestingly, this mapping can be learned using a simple and lightweight convolutional network, whose training is efficient and independent of MLLMs. Extensive experiments on 10 image and video benchmarks across three leading MLLMs (Qwen2-VL, LLaVA-OneVision, and VILA1.5) demonstrate the effectiveness of our approach, e.g., pruning 50% visual tokens while retaining more than 96% of the original performance across all benchmarks for all these three MLLMs. It also exhibits strong generalization, even when the number of tokens in inference far exceeds that used in training.

[90] Keystep Recognition using Graph Neural Networks

Julia Lee Romero,Kyle Min,Subarna Tripathi,Morteza Karimzadeh

Main category: cs.CV

TL;DR: GLEVR提出了一种基于图学习的细粒度关键步骤识别框架，通过构建稀疏图有效利用第一人称视频中的长期依赖关系，并结合第三人称视频和自动字幕提升性能。

Details

Motivation: 解决第一人称视频中细粒度关键步骤识别的挑战，利用长期依赖关系和跨视角对齐提升模型性能。 Method: 将视频片段作为节点构建稀疏图，结合第三人称视频和自动字幕作为额外节点，定义节点间的连接策略。 Result: 在Ego-Exo4D数据集上显著优于现有方法。 Conclusion: GLEVR框架通过灵活利用多模态数据和图结构，显著提升了关键步骤识别的性能。 Abstract: We pose keystep recognition as a node classification task, and propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos. Our approach, termed GLEVR, consists of constructing a graph where each video clip of the egocentric video corresponds to a node. The constructed graphs are sparse and computationally efficient, outperforming existing larger models substantially. We further leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos, as well as adding automatic captioning as an additional modality. We consider each clip of each exocentric video (if available) or video captions as additional nodes during training. We examine several strategies to define connections across these nodes. We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods.

[91] DeepVerse: 4D Autoregressive Video Generation as a World Model

Junyi Chen,Haoyi Zhu,Xianglong He,Yifan Wang,Jianjun Zhou,Wenzheng Chang,Yang Zhou,Zizun Li,Zhoujie Fu,Jiangmiao Pang,Tong He

Main category: cs.CV

TL;DR: DeepVerse是一种新型4D交互世界模型，通过显式结合几何预测，显著提升了时空一致性和预测精度。

Details

Motivation: 现有交互模型主要预测视觉观测，忽略了几何结构和空间一致性等隐藏状态，导致误差累积和时间不一致。 Method: DeepVerse将先前时间步的几何预测显式结合到当前动作条件下的预测中，引入几何约束。 Result: 实验表明，DeepVerse能捕捉更丰富的时空关系和物理动态，减少漂移，提升预测准确性、视觉真实性和场景合理性。 Conclusion: DeepVerse为几何感知的高保真长期预测提供了有效解决方案，并在多种场景中验证了其有效性。 Abstract: World models serve as essential building blocks toward Artificial General Intelligence (AGI), enabling intelligent agents to predict future states and plan actions by simulating complex physical interactions. However, existing interactive models primarily predict visual observations, thereby neglecting crucial hidden states like geometric structures and spatial coherence. This leads to rapid error accumulation and temporal inconsistency. To address these limitations, we introduce DeepVerse, a novel 4D interactive world model explicitly incorporating geometric predictions from previous timesteps into current predictions conditioned on actions. Experiments demonstrate that by incorporating explicit geometric constraints, DeepVerse captures richer spatio-temporal relationships and underlying physical dynamics. This capability significantly reduces drift and enhances temporal consistency, enabling the model to reliably generate extended future sequences and achieve substantial improvements in prediction accuracy, visual realism, and scene rationality. Furthermore, our method provides an effective solution for geometry-aware memory retrieval, effectively preserving long-term spatial consistency. We validate the effectiveness of DeepVerse across diverse scenarios, establishing its capacity for high-fidelity, long-horizon predictions grounded in geometry-aware dynamics.

[92] CountingFruit: Real-Time 3D Fruit Counting with Language-Guided Semantic Gaussian Splatting

Fengze Li,Yangle Liu,Jieming Ma,Hai-Ning Liang,Yaochun Shen,Huangxiang Li,Zhijing Wu

Main category: cs.CV

TL;DR: FruitLangGS是一个实时3D水果计数框架，通过空间重建、语义嵌入和语言引导实例估计，解决了现有方法在推理速度、泛化能力和语义控制方面的不足。

Details

Motivation: 农业环境中水果计数的挑战包括视觉遮挡、语义模糊和3D重建的高计算需求，现有方法在速度、泛化和语义控制方面表现不佳。 Method: FruitLangGS采用自适应高斯喷洒管道进行场景重建，结合CLIP对齐的语言嵌入实现语义控制，并通过分布感知采样和聚类估计水果数量。 Result: 实验表明，FruitLangGS在渲染速度、语义灵活性和计数准确性上优于现有方法。 Conclusion: FruitLangGS为开放世界场景中的语言驱动实时神经渲染提供了新视角。 Abstract: Accurate fruit counting in real-world agricultural environments is a longstanding challenge due to visual occlusions, semantic ambiguity, and the high computational demands of 3D reconstruction. Existing methods based on neural radiance fields suffer from low inference speed, limited generalization, and lack support for open-set semantic control. This paper presents FruitLangGS, a real-time 3D fruit counting framework that addresses these limitations through spatial reconstruction, semantic embedding, and language-guided instance estimation. FruitLangGS first reconstructs orchard-scale scenes using an adaptive Gaussian splatting pipeline with radius-aware pruning and tile-based rasterization for efficient rendering. To enable semantic control, each Gaussian encodes a compressed CLIP-aligned language embedding, forming a compact and queryable 3D representation. At inference time, prompt-based semantic filtering is applied directly in 3D space, without relying on image-space segmentation or view-level fusion. The selected Gaussians are then converted into dense point clouds via distribution-aware sampling and clustered to estimate fruit counts. Experimental results on real orchard data demonstrate that FruitLangGS achieves higher rendering speed, semantic flexibility, and counting accuracy compared to prior approaches, offering a new perspective for language-driven, real-time neural rendering across open-world scenarios.

[93] Revolutionizing Radiology Workflow with Factual and Efficient CXR Report Generation

Pimchanok Sukjai,Apiradee Boonmee

Main category: cs.CV

TL;DR: CXR-PathFinder是一种基于大型语言模型（LLM）的自动胸片报告生成模型，通过临床医生引导的对抗微调（CGAFT）和知识图谱增强模块（KGAM）提高诊断准确性和减少错误。

Details

Motivation: 医疗图像解读需求增长，需要高效、准确的人工智能解决方案以提升放射学诊断水平。 Method: 提出CGAFT方法，整合临床专家反馈和对抗学习框架；引入KGAM模块，动态验证生成报告的正确性。 Result: 实验表明CXR-PathFinder在临床准确性等指标上显著优于现有模型，并通过放射科医生盲评验证其优越性。 Conclusion: CXR-PathFinder为自动化医疗报告生成提供了高效、可靠的解决方案。 Abstract: The escalating demand for medical image interpretation underscores the critical need for advanced artificial intelligence solutions to enhance the efficiency and accuracy of radiological diagnoses. This paper introduces CXR-PathFinder, a novel Large Language Model (LLM)-centric foundation model specifically engineered for automated chest X-ray (CXR) report generation. We propose a unique training paradigm, Clinician-Guided Adversarial Fine-Tuning (CGAFT), which meticulously integrates expert clinical feedback into an adversarial learning framework to mitigate factual inconsistencies and improve diagnostic precision. Complementing this, our Knowledge Graph Augmentation Module (KGAM) acts as an inference-time safeguard, dynamically verifying generated medical statements against authoritative knowledge bases to minimize hallucinations and ensure standardized terminology. Leveraging a comprehensive dataset of millions of paired CXR images and expert reports, our experiments demonstrate that CXR-PathFinder significantly outperforms existing state-of-the-art medical vision-language models across various quantitative metrics, including clinical accuracy (Macro F1 (14): 46.5, Micro F1 (14): 59.5). Furthermore, blinded human evaluation by board-certified radiologists confirms CXR-PathFinder's superior clinical utility, completeness, and accuracy, establishing its potential as a reliable and efficient aid for radiological practice. The developed method effectively balances high diagnostic fidelity with computational efficiency, providing a robust solution for automated medical report generation.

[94] MOOSE: Pay Attention to Temporal Dynamics for Video Understanding via Optical Flows

Hong Nguyen,Dung Tran,Hieu Hoang,Phong Nguyen,Shrikanth Narayanan

Main category: cs.CV

TL;DR: MOOSE是一种新型视频编码器，通过结合光流与空间嵌入高效建模时间信息，减少计算复杂度并提升时间可解释性。

Details

Motivation: 解决视频分析中时间动态建模的高计算成本和细粒度标注需求问题。 Method: 结合预训练视觉和光流编码器，提出时间中心架构MOOSE。 Result: 在临床、医学和动作识别数据集上达到最优性能。 Conclusion: MOOSE高效且通用，适用于多种视频分析任务。 Abstract: Many motion-centric video analysis tasks, such as atomic actions, detecting atypical motor behavior in individuals with autism, or analyzing articulatory motion in real-time MRI of human speech, require efficient and interpretable temporal modeling. Capturing temporal dynamics is a central challenge in video analysis, often requiring significant computational resources and fine-grained annotations that are not widely available. This paper presents MOOSE (Motion Flow Over Spatial Space), a novel temporally-centric video encoder explicitly integrating optical flow with spatial embeddings to model temporal information efficiently, inspired by human perception of motion. Unlike prior models, MOOSE takes advantage of rich, widely available pre-trained visual and optical flow encoders instead of training video models from scratch. This significantly reduces computational complexity while enhancing temporal interpretability. Our primary contributions includes (1) proposing a computationally efficient temporally-centric architecture for video understanding (2) demonstrating enhanced interpretability in modeling temporal dynamics; and (3) achieving state-of-the-art performance on diverse benchmarks, including clinical, medical, and standard action recognition datasets, confirming the broad applicability and effectiveness of our approach.

[95] ProstaTD: A Large-scale Multi-source Dataset for Structured Surgical Triplet Detection

Yiliang Chen,Zhixi Li,Cheng Xu,Alex Qinyang Liu,Xuemiao Xu,Jeremy Yuen-Chun Teoh,Shengfeng He,Jing Qin

Main category: cs.CV

TL;DR: ProstaTD是一个大规模、多机构的手术三重态检测数据集，解决了现有数据集在空间标注、时间标签和数据来源上的局限性。

Details

Motivation: 现有数据集（如CholecT50）在空间标注、时间标签和数据来源上存在不足，限制了模型的泛化能力。 Method: 通过机器人辅助前列腺切除术领域的数据，开发了ProstaTD数据集，提供临床定义的时间边界和高精度空间标注。 Result: ProstaTD包含60,529视频帧和165,567标注实例，来自21台手术，是当前最大且最多样化的数据集。 Conclusion: ProstaTD为手术AI系统的开发和培训工具提供了可靠的基础。 Abstract: Surgical triplet detection has emerged as a pivotal task in surgical video analysis, with significant implications for performance assessment and the training of novice surgeons. However, existing datasets such as CholecT50 exhibit critical limitations: they lack precise spatial bounding box annotations, provide inconsistent and clinically ungrounded temporal labels, and rely on a single data source, which limits model generalizability.To address these shortcomings, we introduce ProstaTD, a large-scale, multi-institutional dataset for surgical triplet detection, developed from the technically demanding domain of robot-assisted prostatectomy. ProstaTD offers clinically defined temporal boundaries and high-precision bounding box annotations for each structured triplet action. The dataset comprises 60,529 video frames and 165,567 annotated triplet instances, collected from 21 surgeries performed across multiple institutions, reflecting a broad range of surgical practices and intraoperative conditions. The annotation process was conducted under rigorous medical supervision and involved more than 50 contributors, including practicing surgeons and medically trained annotators, through multiple iterative phases of labeling and verification. ProstaTD is the largest and most diverse surgical triplet dataset to date, providing a robust foundation for fair benchmarking, the development of reliable surgical AI systems, and scalable tools for procedural training.

[96] FlowMo: Variance-Based Flow Guidance for Coherent Motion in Video Generation

Ariel Shaulov,Itay Hazan,Lior Wolf,Hila Chefer

Main category: cs.CV

TL;DR: FlowMo是一种无需额外训练或辅助输入的训练自由引导方法，通过提取预训练模型的预测来增强视频生成的时间一致性。

Details

Motivation: 现有文本到视频扩散模型在建模时间动态方面存在局限，通常需要重新训练或引入外部条件信号。FlowMo旨在直接从预训练模型的预测中提取时间表示，避免额外训练。 Method: FlowMo通过计算连续帧潜在表示的距离，提取外观去偏的时间表示，并通过测量时间维度上的块方差估计运动一致性，动态引导模型减少方差。 Result: 实验表明，FlowMo显著提升了运动一致性，同时保持了视觉质量和提示对齐，为预训练视频扩散模型提供了一种即插即用的解决方案。 Conclusion: FlowMo为提升预训练视频扩散模型的时间保真度提供了一种高效且无需额外训练的方法。 Abstract: Text-to-video diffusion models are notoriously limited in their ability to model temporal aspects such as motion, physics, and dynamic interactions. Existing approaches address this limitation by retraining the model or introducing external conditioning signals to enforce temporal consistency. In this work, we explore whether a meaningful temporal representation can be extracted directly from the predictions of a pre-trained model without any additional training or auxiliary inputs. We introduce \textbf{FlowMo}, a novel training-free guidance method that enhances motion coherence using only the model's own predictions in each diffusion step. FlowMo first derives an appearance-debiased temporal representation by measuring the distance between latents corresponding to consecutive frames. This highlights the implicit temporal structure predicted by the model. It then estimates motion coherence by measuring the patch-wise variance across the temporal dimension and guides the model to reduce this variance dynamically during sampling. Extensive experiments across multiple text-to-video models demonstrate that FlowMo significantly improves motion coherence without sacrificing visual quality or prompt alignment, offering an effective plug-and-play solution for enhancing the temporal fidelity of pre-trained video diffusion models.

[97] SVarM: Linear Support Varifold Machines for Classification and Regression on Geometric Data

Emmanuel Hartman,Nicolas Charon

Main category: cs.CV

TL;DR: SVarM利用varifold表示形状，通过测试函数和神经网络的结合，在非欧几里得形状空间上实现分类和回归，性能优异且参数少。

Details

Motivation: 几何数据（如曲线、图、曲面）的统计分析因形状空间的非欧几里得性质而困难，需构建具有不变性的机器学习框架。 Method: 提出SVarM，利用varifold表示形状及其对偶测试函数，结合神经网络实现分类和回归。 Result: 在多种形状数据集上表现优异，性能与现有方法相当但参数显著减少。 Conclusion: SVarM为形状空间上的机器学习提供了高效且通用的框架。 Abstract: Despite progress in the rapidly developing field of geometric deep learning, performing statistical analysis on geometric data--where each observation is a shape such as a curve, graph, or surface--remains challenging due to the non-Euclidean nature of shape spaces, which are defined as equivalence classes under invariance groups. Building machine learning frameworks that incorporate such invariances, notably to shape parametrization, is often crucial to ensure generalizability of the trained models to new observations. This work proposes SVarM to exploit varifold representations of shapes as measures and their duality with test functions $h:\mathbb{R}^n \times S^{n-1} \to \mathbb{R}$. This method provides a general framework akin to linear support vector machines but operating instead over the infinite-dimensional space of varifolds. We develop classification and regression models on shape datasets by introducing a neural network-based representation of the trainable test function $h$. This approach demonstrates strong performance and robustness across various shape graph and surface datasets, achieving results comparable to state-of-the-art methods while significantly reducing the number of trainable parameters.

[98] Perceptual Inductive Bias Is What You Need Before Contrastive Learning

Tianqin Li,Junru Zhao,Dunhan Jiang,Shenghao Wu,Alan Ramirez,Tai Sing Lee

Main category: cs.CV

TL;DR: 论文提出了一种基于David Marr多阶段视觉理论的预训练方法，通过先构建边界和表面级表征，再学习语义表征，显著提升了模型收敛速度和性能。

Details

Motivation: 现有对比表征学习框架通常直接学习语义表征空间，忽略了人类视觉的多阶段处理特性，导致收敛慢和纹理偏差。 Method: 采用Marr的多阶段理论，先构建边界和表面级表征，再训练语义表征，结合人类视觉的归纳偏置。 Result: 在ResNet18上实现2倍收敛速度提升，并在语义分割、深度估计和物体识别任务中表现更优，同时增强鲁棒性和分布外能力。 Conclusion: 提出在通用对比表征预训练前增加基于人类视觉系统的预训练阶段，可显著提升表征质量和减少收敛时间。 Abstract: David Marr's seminal theory of human perception stipulates that visual processing is a multi-stage process, prioritizing the derivation of boundary and surface properties before forming semantic object representations. In contrast, contrastive representation learning frameworks typically bypass this explicit multi-stage approach, defining their objective as the direct learning of a semantic representation space for objects. While effective in general contexts, this approach sacrifices the inductive biases of vision, leading to slower convergence speed and learning shortcut resulting in texture bias. In this work, we demonstrate that leveraging Marr's multi-stage theory-by first constructing boundary and surface-level representations using perceptual constructs from early visual processing stages and subsequently training for object semantics-leads to 2x faster convergence on ResNet18, improved final representations on semantic segmentation, depth estimation, and object recognition, and enhanced robustness and out-of-distribution capability. Together, we propose a pretraining stage before the general contrastive representation pretraining to further enhance the final representation quality and reduce the overall convergence time via inductive bias from human vision systems.

[99] Self-Supervised Multi-View Representation Learning using Vision-Language Model for 3D/4D Facial Expression Recognition

Muzammil Behzad

Main category: cs.CV

TL;DR: SMILE-VLM是一个自监督的视觉语言模型，用于3D/4D面部表情识别，通过多视图视觉表示学习和自然语言监督实现。

Details

Motivation: 面部表情识别在情感计算中具有重要应用，但现有方法需要大量标注数据，SMILE-VLM旨在提供一种无需标注的高效解决方案。 Method: 提出三个核心组件：多视图去相关、视觉语言对比对齐和跨模态冗余最小化。 Result: 在多个基准测试中达到最先进性能，并在4D微表情识别任务中表现优异。 Conclusion: SMILE-VLM不仅超越无监督方法，还能与监督基线媲美，为面部行为理解提供了高效且可扩展的解决方案。 Abstract: Facial expression recognition (FER) is a fundamental task in affective computing with applications in human-computer interaction, mental health analysis, and behavioral understanding. In this paper, we propose SMILE-VLM, a self-supervised vision-language model for 3D/4D FER that unifies multiview visual representation learning with natural language supervision. SMILE-VLM learns robust, semantically aligned, and view-invariant embeddings by proposing three core components: multiview decorrelation via a Barlow Twins-style loss, vision-language contrastive alignment, and cross-modal redundancy minimization. Our framework achieves the state-of-the-art performance on multiple benchmarks. We further extend SMILE-VLM to the task of 4D micro-expression recognition (MER) to recognize the subtle affective cues. The extensive results demonstrate that SMILE-VLM not only surpasses existing unsupervised methods but also matches or exceeds supervised baselines, offering a scalable and annotation-efficient solution for expressive facial behavior understanding.

[100] A Review on Coarse to Fine-Grained Animal Action Recognition

Ali Zia,Renuka Sharma,Abdelwahed Khamis,Xuesong Li,Muhammad Husnain,Numan Shafi,Saeed Anwar,Sabine Schmoelzl,Eric Stone,Lars Petersson,Vivien Rolland

Main category: cs.CV

TL;DR: 本文综述了动物行为识别领域的现状，重点探讨了粗粒度（CG）和细粒度（FG）技术，并分析了户外环境中识别细微动物行为的独特挑战。

Details

Motivation: 研究动物行为识别的动机在于其与人类行为识别的显著差异，如非刚性身体结构、频繁遮挡和缺乏大规模标注数据集。 Method: 通过评估时空深度学习框架（如SlowFast）和现有数据集的局限性，探讨了动物行为分析的有效方法。 Result: 综述指出了当前方法的优缺点，并介绍了一个新发布的数据集，为未来细粒度行为识别的发展提供了方向。 Conclusion: 未来研究应致力于提高跨物种行为分析的准确性和泛化能力。 Abstract: This review provides an in-depth exploration of the field of animal action recognition, focusing on coarse-grained (CG) and fine-grained (FG) techniques. The primary aim is to examine the current state of research in animal behaviour recognition and to elucidate the unique challenges associated with recognising subtle animal actions in outdoor environments. These challenges differ significantly from those encountered in human action recognition due to factors such as non-rigid body structures, frequent occlusions, and the lack of large-scale, annotated datasets. The review begins by discussing the evolution of human action recognition, a more established field, highlighting how it progressed from broad, coarse actions in controlled settings to the demand for fine-grained recognition in dynamic environments. This shift is particularly relevant for animal action recognition, where behavioural variability and environmental complexity present unique challenges that human-centric models cannot fully address. The review then underscores the critical differences between human and animal action recognition, with an emphasis on high intra-species variability, unstructured datasets, and the natural complexity of animal habitats. Techniques like spatio-temporal deep learning frameworks (e.g., SlowFast) are evaluated for their effectiveness in animal behaviour analysis, along with the limitations of existing datasets. By assessing the strengths and weaknesses of current methodologies and introducing a recently-published dataset, the review outlines future directions for advancing fine-grained action recognition, aiming to improve accuracy and generalisability in behaviour analysis across species.

[101] Dirty and Clean-Label attack detection using GAN discriminators

John Smutny

Main category: cs.CV

TL;DR: 利用GAN判别器保护计算机视觉模型中的单一类别免受错误标签和修改图像的攻击，通过置信度评分识别问题图像。

Details

Motivation: 收集足够图像训练模型时，未知来源的图像可能导致模型被脏标签或干净标签攻击操纵，手动检查不切实际，现有方法耗时。 Method: 使用GAN判别器训练单一类别，通过置信度评分设定阈值识别错误标签和修改图像。 Result: 训练后的GAN判别器能100%识别测试中的毒化图像，扰动幅度阈值从0.20开始。 Conclusion: 开发者可基于此方法训练判别器，保护高价值类别。 Abstract: Gathering enough images to train a deep computer vision model is a constant challenge. Unfortunately, collecting images from unknown sources can leave your model s behavior at risk of being manipulated by a dirty-label or clean-label attack unless the images are properly inspected. Manually inspecting each image-label pair is impractical and common poison-detection methods that involve re-training your model can be time consuming. This research uses GAN discriminators to protect a single class against mislabeled and different levels of modified images. The effect of said perturbation on a basic convolutional neural network classifier is also included for reference. The results suggest that after training on a single class, GAN discriminator s confidence scores can provide a threshold to identify mislabeled images and identify 100% of the tested poison starting at a perturbation epsilon magnitude of 0.20, after decision threshold calibration using in-class samples. Developers can use this report as a basis to train their own discriminators to protect high valued classes in their CV models.

[102] Fourier-Modulated Implicit Neural Representation for Multispectral Satellite Image Compression

Woojin Cho,Steve Andreas Immanuel,Junhyuk Heo,Darongsae Kwon

Main category: cs.CV

TL;DR: ImpliSat是一个基于隐式神经表示（INR）的统一框架，用于高效压缩和重建多光谱卫星数据，解决了高维度和多分辨率带来的挑战。

Details

Motivation: 多光谱卫星图像在农业、渔业和环境监测中至关重要，但其高维度、大数据量和多分辨率特性给数据压缩和分析带来挑战。 Method: 利用INR将卫星图像建模为坐标空间上的连续函数，并通过傅里叶调制算法动态适应各波段的光谱和空间特性。 Result: 实现了高效压缩，同时保留了关键图像细节。 Conclusion: ImpliSat为多光谱卫星数据的压缩和分析提供了有效的解决方案。 Abstract: Multispectral satellite images play a vital role in agriculture, fisheries, and environmental monitoring. However, their high dimensionality, large data volumes, and diverse spatial resolutions across multiple channels pose significant challenges for data compression and analysis. This paper presents ImpliSat, a unified framework specifically designed to address these challenges through efficient compression and reconstruction of multispectral satellite data. ImpliSat leverages Implicit Neural Representations (INR) to model satellite images as continuous functions over coordinate space, capturing fine spatial details across varying spatial resolutions. Furthermore, we introduce a Fourier modulation algorithm that dynamically adjusts to the spectral and spatial characteristics of each band, ensuring optimal compression while preserving critical image details.

[103] Visual Sparse Steering: Improving Zero-shot Image Classification with Sparsity Guided Steering Vectors

Gerasimos Chatzoudis,Zhuowei Li,Gemma E. Moran,Hao Wang,Dimitris N. Metaxas

Main category: cs.CV

TL;DR: VS2和VS2++是轻量级的测试时方法，通过稀疏特征引导视觉模型，显著提升零样本CLIP性能，VS2++进一步通过检索增强表现更优。PASS通过原型对齐进一步优化稀疏特征。

Details

Motivation: 在动态或资源受限环境中，无需重新训练或大量标注数据即可引导视觉基础模型是一个重要但具有挑战性的目标。 Method: VS2使用稀疏自编码器学习的稀疏特征生成引导向量；VS2++通过检索增强选择性放大相关特征；PASS在训练时引入原型对齐损失优化稀疏特征。 Result: VS2在多个数据集上显著超越零样本CLIP，VS2++表现更优，PASS进一步小幅提升性能。 Conclusion: 稀疏特征引导能显著提升模型性能，尤其对特定类别效果更佳，原型对齐进一步优化了稀疏特征的学习。 Abstract: Steering vision foundation models at inference time without retraining or access to large labeled datasets is a desirable yet challenging objective, particularly in dynamic or resource-constrained settings. In this paper, we introduce Visual Sparse Steering (VS2), a lightweight, test-time method that guides vision models using steering vectors derived from sparse features learned by top-$k$ Sparse Autoencoders without requiring contrastive data. Specifically, VS2 surpasses zero-shot CLIP by 4.12% on CIFAR-100, 1.08% on CUB-200, and 1.84% on Tiny-ImageNet. We further propose VS2++, a retrieval-augmented variant that selectively amplifies relevant sparse features using pseudo-labeled neighbors at inference time. With oracle positive/negative sets, VS2++ achieves absolute top-1 gains over CLIP zero-shot of up to 21.44% on CIFAR-100, 7.08% on CUB-200, and 20.47% on Tiny-ImageNet. Interestingly, VS2 and VS2++ raise per-class accuracy by up to 25% and 38%, respectively, showing that sparse steering benefits specific classes by disambiguating visually or taxonomically proximate categories rather than providing a uniform boost. Finally, to better align the sparse features learned through the SAE reconstruction task with those relevant for downstream performance, we propose Prototype-Aligned Sparse Steering (PASS). By incorporating a prototype-alignment loss during SAE training, using labels only during training while remaining fully test-time unsupervised, PASS consistently, though modestly, outperforms VS2, achieving a 6.12% gain over VS2 only on CIFAR-100 with ViT-B/32.

[104] ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

Hosu Lee,Junho Kim,Hyunjun Kim,Yong Man Ro

Main category: cs.CV

TL;DR: ReFoCUS提出了一种基于强化学习的帧选择优化框架，通过模型内部偏好指导帧选择，提升视频问答性能。

Details

Motivation: 现有视频理解方法依赖静态启发式或外部检索模块，可能无法提供查询相关信息，需要更优的帧选择策略。 Method: 采用强化学习训练帧选择策略，利用参考LMM的奖励信号优化视觉输入选择，并通过自回归条件选择架构降低复杂性。 Result: 在多个视频问答基准测试中显著提升了推理性能。 Conclusion: ReFoCUS通过模型内部效用对齐帧选择，无需显式监督即可提升视频理解能力。 Abstract: Recent progress in Large Multi-modal Models (LMMs) has enabled effective vision-language reasoning, yet the ability to understand video content remains constrained by suboptimal frame selection strategies. Existing approaches often rely on static heuristics or external retrieval modules to feed frame information into video-LLMs, which may fail to provide the query-relevant information. In this work, we introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), a novel frame-level policy optimization framework that shifts the optimization target from textual responses to visual input selection. ReFoCUS learns a frame selection policy via reinforcement learning, using reward signals derived from a reference LMM to reflect the model's intrinsic preferences for frames that best support temporally grounded responses. To efficiently explore the large combinatorial frame space, we employ an autoregressive, conditional selection architecture that ensures temporal coherence while reducing complexity. Our approach does not require explicit supervision at the frame-level and consistently improves reasoning performance across multiple video QA benchmarks, highlighting the benefits of aligning frame selection with model-internal utility.

Yichi Zhang,Zhuo Chen,Lingbing Guo,Yajing Xu,Min Zhang,Wen Zhang,Huajun Chen

Main category: cs.CV

TL;DR: 论文提出了M3STR基准，用于评估多模态大语言模型（MLLMs）在结构化视觉知识理解方面的能力，填补了现有评测的空白。

Details

Motivation: 现有MLLMs评测主要忽略了模型对结构化视觉知识的理解能力，亟需一种新的评估范式。 Method: 设计了基于多模态知识图谱的M3STR基准，要求模型识别视觉输入中的多模态实体及其复杂关系拓扑。 Result: 对26个先进MLLMs的评测显示，它们在处理结构化视觉知识时仍存在显著不足。 Conclusion: M3STR为提升MLLMs的整体推理能力指明了关键方向，相关代码和数据已开源。 Abstract: Multi-modal large language models (MLLMs) incorporate heterogeneous modalities into LLMs, enabling a comprehensive understanding of diverse scenarios and objects. Despite the proliferation of evaluation benchmarks and leaderboards for MLLMs, they predominantly overlook the critical capacity of MLLMs to comprehend world knowledge with structured abstractions that appear in visual form. To address this gap, we propose a novel evaluation paradigm and devise M3STR, an innovative benchmark grounded in the Multi-Modal Map for STRuctured understanding. This benchmark leverages multi-modal knowledge graphs to synthesize images encapsulating subgraph architectures enriched with multi-modal entities. M3STR necessitates that MLLMs not only recognize the multi-modal entities within the visual inputs but also decipher intricate relational topologies among them. We delineate the benchmark's statistical profiles and automated construction pipeline, accompanied by an extensive empirical analysis of 26 state-of-the-art MLLMs. Our findings reveal persistent deficiencies in processing abstractive visual information with structured knowledge, thereby charting a pivotal trajectory for advancing MLLMs' holistic reasoning capacities. Our code and data are released at https://github.com/zjukg/M3STR

[106] ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

Yiyang Zhou,Yangfan He,Yaofeng Su,Siwei Han,Joel Jang,Gedas Bertasius,Mohit Bansal,Huaxiu Yao

Main category: cs.CV

TL;DR: ReAgent-V是一种新型视频理解框架，通过实时奖励生成和多视角反思机制提升推理能力，并在多个任务中表现出显著性能提升。

Details

Motivation: 传统视频理解方法缺乏动态反馈，限制了模型在复杂场景中的自我修正和适应能力。现有方法存在高标注成本、低推理效率等问题。 Method: 提出ReAgent-V框架，结合高效帧选择和实时奖励生成，通过多视角反思机制迭代优化答案，并支持灵活工具集成。 Result: 在12个数据集上的实验显示，ReAgent-V在视频理解、推理增强和视觉-语言-动作对齐任务中分别提升6.9%、2.1%和9.8%。 Conclusion: ReAgent-V轻量、模块化且可扩展，显著提升了视频理解的泛化能力和推理效果。 Abstract: Video understanding is fundamental to tasks such as action recognition, video reasoning, and robotic control. Early video understanding methods based on large vision-language models (LVLMs) typically adopt a single-pass reasoning paradigm without dynamic feedback, limiting the model's capacity to self-correct and adapt in complex scenarios. Recent efforts have attempted to address this limitation by incorporating reward models and reinforcement learning to enhance reasoning, or by employing tool-agent frameworks. However, these approaches face several challenges, including high annotation costs, reward signals that fail to capture real-time reasoning states, and low inference efficiency. To overcome these issues, we propose ReAgent-V, a novel agentic video understanding framework that integrates efficient frame selection with real-time reward generation during inference. These reward signals not only guide iterative answer refinement through a multi-perspective reflection mechanism-adjusting predictions from conservative, neutral, and aggressive viewpoints-but also enable automatic filtering of high-quality data for supervised fine-tuning (SFT), direct preference optimization (DPO), and group relative policy optimization (GRPO). ReAgent-V is lightweight, modular, and extensible, supporting flexible tool integration tailored to diverse tasks. Extensive experiments on 12 datasets across three core applications-video understanding, video reasoning enhancement, and vision-language-action model alignment-demonstrate significant gains in generalization and reasoning, with improvements of up to 6.9%, 2.1%, and 9.8%, respectively, highlighting the effectiveness and versatility of the proposed framework.

[107] SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost

Haiyang Mei,Pengyu Zhang,Mike Zheng Shou

Main category: cs.CV

TL;DR: SAM-I2V是一种高效的图像到视频升级方法，通过预训练的SAM模型实现提示性视频分割，显著降低训练成本和资源需求。

Details

Motivation: 扩展基础模型（如SAM）到视频分割领域面临挑战，尤其是动态场景中的时间一致性掩码传播。SAM 2的训练成本过高，阻碍研究和实际部署。 Method: 引入三个创新点：(i) 基于SAM的图像到视频特征提取升级器，(ii) 记忆过滤策略选择相关历史帧，(iii) 记忆作为提示机制确保时间一致性。 Result: 实验表明，SAM-I2V达到SAM 2性能的90%以上，仅需0.2%的训练成本。 Conclusion: SAM-I2V为提示性视频分割提供了资源高效的解决方案，降低了研究门槛并推动领域发展。 Abstract: Foundation models like the Segment Anything Model (SAM) have significantly advanced promptable image segmentation in computer vision. However, extending these capabilities to videos presents substantial challenges, particularly in ensuring precise and temporally consistent mask propagation in dynamic scenes. SAM 2 attempts to address this by training a model on massive image and video data from scratch to learn complex spatiotemporal associations, resulting in huge training costs that hinder research and practical deployment. In this paper, we introduce SAM-I2V, an effective image-to-video upgradation method for cultivating a promptable video segmentation (PVS) model. Our approach strategically upgrades the pre-trained SAM to support PVS, significantly reducing training complexity and resource requirements. To achieve this, we introduce three key innovations: (i) an image-to-video feature extraction upgrader built upon SAM's static image encoder to enable spatiotemporal video perception, (ii) a memory filtering strategy that selects the most relevant past frames for more effective utilization of historical information, and (iii) a memory-as-prompt mechanism leveraging object memory to ensure temporally consistent mask propagation in dynamic scenes. Comprehensive experiments demonstrate that our method achieves over 90% of SAM 2's performance while using only 0.2% of its training cost. Our work presents a resource-efficient pathway to PVS, lowering barriers for further research in PVS model design and enabling broader applications and advancements in the field. Code and model are available at: https://github.com/showlab/SAM-I2V.

[108] Ultra-High-Resolution Image Synthesis: Data, Method and Evaluation

Jinjin Zhang,Qiuyu Huang,Junjie Liu,Xiefan Guo,Di Huang

Main category: cs.CV

TL;DR: 论文提出了Aesthetic-4K数据集和Diffusion-4K框架，用于超高清图像合成，并引入新评估指标。

Details

Motivation: 超高清图像合成潜力巨大但研究不足，缺乏标准化基准和计算资源。 Method: 提出SC-VAE和WLF技术，结合扩散模型直接生成4K图像，并设计新评估指标。 Result: Diffusion-4K在超高清图像合成中表现优异，尤其结合先进扩散模型。 Conclusion: 该研究为超高清图像合成提供了新数据集、框架和评估方法，推动了领域发展。 Abstract: Ultra-high-resolution image synthesis holds significant potential, yet remains an underexplored challenge due to the absence of standardized benchmarks and computational constraints. In this paper, we establish Aesthetic-4K, a meticulously curated dataset containing dedicated training and evaluation subsets specifically designed for comprehensive research on ultra-high-resolution image synthesis. This dataset consists of high-quality 4K images accompanied by descriptive captions generated by GPT-4o. Furthermore, we propose Diffusion-4K, an innovative framework for the direct generation of ultra-high-resolution images. Our approach incorporates the Scale Consistent Variational Auto-Encoder (SC-VAE) and Wavelet-based Latent Fine-tuning (WLF), which are designed for efficient visual token compression and the capture of intricate details in ultra-high-resolution images, thereby facilitating direct training with photorealistic 4K data. This method is applicable to various latent diffusion models and demonstrates its efficacy in synthesizing highly detailed 4K images. Additionally, we propose novel metrics, namely the GLCM Score and Compression Ratio, to assess the texture richness and fine details in local patches, in conjunction with holistic measures such as FID, Aesthetics, and CLIPScore, enabling a thorough and multifaceted evaluation of ultra-high-resolution image synthesis. Consequently, Diffusion-4K achieves impressive performance in ultra-high-resolution image synthesis, particularly when powered by state-of-the-art large-scale diffusion models (eg, Flux-12B). The source code is publicly available at https://github.com/zhang0jhon/diffusion-4k.

[109] A 2-Stage Model for Vehicle Class and Orientation Detection with Photo-Realistic Image Generation

Youngmin Kim,Donghwa Kang,Hyeongboo Baek

Main category: cs.CV

TL;DR: 提出了一种两阶段检测模型，通过生成逼真图像解决合成数据训练中的类别不平衡和真实世界预测困难问题。

Details

Motivation: 训练数据中类别分布不平衡，且合成图像训练的模型难以预测真实世界图像。 Method: 1. 构建包含图像、类别和位置信息的元表；2. 将合成图像转换为真实风格并合并到元表；3. 使用元表中的图像分类车辆类别和方向；4. 结合位置信息和预测类别完成检测。 Result: 在IEEE BigData Challenge 2022 VOD比赛中获得第4名。 Conclusion: 两阶段模型结合逼真图像生成有效提升了车辆类别和方向的检测性能。 Abstract: We aim to detect the class and orientation of a vehicle by training a model with synthetic data. However, the distribution of the classes in the training data is imbalanced, and the model trained on the synthetic image is difficult to predict in real-world images. We propose a two-stage detection model with photo-realistic image generation to tackle this issue. Our model mainly takes four steps to detect the class and orientation of the vehicle. (1) It builds a table containing the image, class, and location information of objects in the image, (2) transforms the synthetic images into real-world images style, and merges them into the meta table. (3) Classify vehicle class and orientation using images from the meta-table. (4) Finally, the vehicle class and orientation are detected by combining the pre-extracted location information and the predicted classes. We achieved 4th place in IEEE BigData Challenge 2022 Vehicle class and Orientation Detection (VOD) with our approach.

[110] Rethinking Image Histogram Matching for Image Classification

Rikuto Otsuka,Yuho Shoji,Yuka Ogino,Takahiro Toizumi,Atsushi Ito

Main category: cs.CV

TL;DR: 本文重新思考了图像直方图匹配（HM），提出了一种可微分且参数化的HM预处理方法，用于下游分类器。通过优化目标像素值分布，该方法在恶劣天气条件下提升了分类器性能。

Details

Motivation: 卷积神经网络在分类任务中表现出色，但在低对比度图像（如恶劣天气条件下拍摄的图像）上性能下降。传统直方图均衡化（HE）虽常用，但其目标分布为均匀分布，可能并非最优。本文假设，设计一个优化的目标分布可以进一步提升分类器性能。 Method: 提出了一种可微分且参数化的HM方法，通过下游分类器的损失函数优化目标像素值分布，将输入图像转换为适合分类器的目标分布。该方法仅使用正常天气图像进行训练。 Result: 实验结果表明，使用所提HM方法训练的分类器在恶劣天气条件下优于传统预处理方法。 Conclusion: 通过优化目标像素值分布，可微分且参数化的HM方法显著提升了分类器在恶劣天气条件下的性能。 Abstract: This paper rethinks image histogram matching (HM) and proposes a differentiable and parametric HM preprocessing for a downstream classifier. Convolutional neural networks have demonstrated remarkable achievements in classification tasks. However, they often exhibit degraded performance on low-contrast images captured under adverse weather conditions. To maintain classifier performance under low-contrast images, histogram equalization (HE) is commonly used. HE is a special case of HM using a uniform distribution as a target pixel value distribution. In this paper, we focus on the shape of the target pixel value distribution. Compared to a uniform distribution, a single, well-designed distribution could have potential to improve the performance of the downstream classifier across various adverse weather conditions. Based on this hypothesis, we propose a differentiable and parametric HM that optimizes the target distribution using the loss function of the downstream classifier. This method addresses pixel value imbalances by transforming input images with arbitrary distributions into a target distribution optimized for the classifier. Our HM is trained on only normal weather images using the classifier. Experimental results show that a classifier trained with our proposed HM outperforms conventional preprocessing methods under adverse weather conditions.

[111] Target Driven Adaptive Loss For Infrared Small Target Detection

Yuho Shoji,Takahiro Toizumi,Atsushi Ito

Main category: cs.CV

TL;DR: 提出了一种目标驱动自适应（TDA）损失函数，用于提升红外小目标检测（IRSTD）的性能，解决了现有损失函数在局部区域检测和小尺度、低对比度目标上的不足。

Details

Motivation: 现有损失函数（如二元交叉熵损失和IoU损失）在训练分割模型时，未能有效提升局部区域检测性能和对小尺度、低对比度目标的鲁棒性。 Method: 提出TDA损失，引入基于块的机制和自适应调整策略，针对目标的尺度和局部对比度进行优化。 Result: 在三个IRSTD数据集上验证，TDA损失优于现有损失函数。 Conclusion: TDA损失通过关注目标周围局部区域和小尺度、低对比度目标，显著提升了检测性能。 Abstract: We propose a target driven adaptive (TDA) loss to enhance the performance of infrared small target detection (IRSTD). Prior works have used loss functions, such as binary cross-entropy loss and IoU loss, to train segmentation models for IRSTD. Minimizing these loss functions guides models to extract pixel-level features or global image context. However, they have two issues: improving detection performance for local regions around the targets and enhancing robustness to small scale and low local contrast. To address these issues, the proposed TDA loss introduces a patch-based mechanism, and an adaptive adjustment strategy to scale and local contrast. The proposed TDA loss leads the model to focus on local regions around the targets and pay particular attention to targets with smaller scales and lower local contrast. We evaluate the proposed method on three datasets for IRSTD. The results demonstrate that the proposed TDA loss achieves better detection performance than existing losses on these datasets.

[112] CLIP-driven rain perception: Adaptive deraining with pattern-aware network routing and mask-guided cross-attention

Cong Guan,Osamu Yoshie

Main category: cs.CV

TL;DR: 提出了一种基于CLIP的雨感知网络（CLIP-RPN），通过视觉-语言匹配分数自动识别雨模式，并动态路由到子网络处理不同雨模式，结合掩码引导的跨注意力机制（MGCA）和动态损失调度（DLS），显著提升了去雨效果。

Details

Motivation: 现有去雨模型使用单一网络处理所有雨图像，但不同雨模式差异显著，单一网络难以应对多样性。 Method: 利用CLIP的跨模态对齐能力识别雨模式，动态激活子网络；引入MGCA机制预测多尺度雨掩码，使用DLS优化训练过程。 Result: 在多个数据集上达到最优性能，尤其在复杂混合数据集中表现突出。 Conclusion: CLIP-RPN通过语义感知和动态路由机制，有效提升了模型处理多样雨模式的能力。 Abstract: Existing deraining models process all rainy images within a single network. However, different rain patterns have significant variations, which makes it challenging for a single network to handle diverse types of raindrops and streaks. To address this limitation, we propose a novel CLIP-driven rain perception network (CLIP-RPN) that leverages CLIP to automatically perceive rain patterns by computing visual-language matching scores and adaptively routing to sub-networks to handle different rain patterns, such as varying raindrop densities, streak orientations, and rainfall intensity. CLIP-RPN establishes semantic-aware rain pattern recognition through CLIP's cross-modal visual-language alignment capabilities, enabling automatic identification of precipitation characteristics across different rain scenarios. This rain pattern awareness drives an adaptive subnetwork routing mechanism where specialized processing branches are dynamically activated based on the detected rain type, significantly enhancing the model's capacity to handle diverse rainfall conditions. Furthermore, within sub-networks of CLIP-RPN, we introduce a mask-guided cross-attention mechanism (MGCA) that predicts precise rain masks at multi-scale to facilitate contextual interactions between rainy regions and clean background areas by cross-attention. We also introduces a dynamic loss scheduling mechanism (DLS) to adaptively adjust the gradients for the optimization process of CLIP-RPN. Compared with the commonly used $l_1$ or $l_2$ loss, DLS is more compatible with the inherent dynamics of the network training process, thus achieving enhanced outcomes. Our method achieves state-of-the-art performance across multiple datasets, particularly excelling in complex mixed datasets.

[113] Synthetic Data Augmentation using Pre-trained Diffusion Models for Long-tailed Food Image Classification

GaYeon Koh,Hyun-Jic Oh,Jeonghyun Noh,Won-Ki Jeong

Main category: cs.CV

TL;DR: 提出了一种基于预训练扩散模型的两阶段合成数据增强框架，用于解决长尾食物分类问题，通过正负提示条件生成合成数据，提升分类性能。

Details

Motivation: 现实世界中的食物图像分布不均，导致模型偏向多数类，影响少数类的分类性能。扩散模型生成合成数据是一种潜在解决方案，但现有方法存在数据分布或类别分离问题。 Method: 提出两阶段框架：首先生成基于正提示的参考集，再选择相似特征的类别作为负提示，通过组合采样策略生成合成数据，增强类内多样性和类间分离。 Result: 在两个长尾食物基准数据集上验证，提出的方法在top-1准确率上优于先前工作。 Conclusion: 该方法通过合成数据增强有效解决了长尾食物分类问题，提升了模型性能。 Abstract: Deep learning-based food image classification enables precise identification of food categories, further facilitating accurate nutritional analysis. However, real-world food images often show a skewed distribution, with some food types being more prevalent than others. This class imbalance can be problematic, causing models to favor the majority (head) classes with overall performance degradation for the less common (tail) classes. Recently, synthetic data augmentation using diffusion-based generative models has emerged as a promising solution to address this issue. By generating high-quality synthetic images, these models can help uniformize the data distribution, potentially improving classification performance. However, existing approaches face challenges: fine-tuning-based methods need a uniformly distributed dataset, while pre-trained model-based approaches often overlook inter-class separation in synthetic data. In this paper, we propose a two-stage synthetic data augmentation framework, leveraging pre-trained diffusion models for long-tailed food classification. We generate a reference set conditioned by a positive prompt on the generation target and then select a class that shares similar features with the generation target as a negative prompt. Subsequently, we generate a synthetic augmentation set using positive and negative prompt conditions by a combined sampling strategy that promotes intra-class diversity and inter-class separation. We demonstrate the efficacy of the proposed method on two long-tailed food benchmark datasets, achieving superior performance compared to previous works in terms of top-1 accuracy.

[114] PointT2I: LLM-based text-to-image generation via keypoints

Taekyung Lee,Donggyu Lee,Myungjoo Kang

Main category: cs.CV

TL;DR: PointT2I是一个利用大语言模型（LLM）生成与文本提示中人体姿势准确对应的图像的框架，包含关键点生成、图像生成和反馈系统三个部分。

Details

Motivation: 尽管T2I生成模型在生成高质量图像方面取得了进展，但在处理包含复杂概念（如人体姿势）的提示时仍存在挑战。 Method: PointT2I通过LLM生成关键点，结合文本提示生成图像，并通过反馈系统评估语义一致性。 Result: 该框架无需微调即可生成与文本提示中姿势准确对齐的图像。 Conclusion: PointT2I是首个利用LLM进行关键点引导图像生成的框架，仅依赖文本提示即可实现精确姿势对齐。 Abstract: Text-to-image (T2I) generation model has made significant advancements, resulting in high-quality images aligned with an input prompt. However, despite T2I generation's ability to generate fine-grained images, it still faces challenges in accurately generating images when the input prompt contains complex concepts, especially human pose. In this paper, we propose PointT2I, a framework that effectively generates images that accurately correspond to the human pose described in the prompt by using a large language model (LLM). PointT2I consists of three components: Keypoint generation, Image generation, and Feedback system. The keypoint generation uses an LLM to directly generate keypoints corresponding to a human pose, solely based on the input prompt, without external references. Subsequently, the image generation produces images based on both the text prompt and the generated keypoints to accurately reflect the target pose. To refine the outputs of the preceding stages, we incorporate an LLM-based feedback system that assesses the semantic consistency between the generated contents and the given prompts. Our framework is the first approach to leveraging LLM for keypoints-guided image generation without any fine-tuning, producing accurate pose-aligned images based solely on textual prompts.

[115] SVQA-R1: Reinforcing Spatial Reasoning in MLLMs via View-Consistent Reward Optimization

Peiyao Wang,Haibin Ling

Main category: cs.CV

TL;DR: SVQA-R1框架通过Spatial-GRPO增强空间视觉问答任务中的推理能力，显著提升准确性并展示可解释性。

Details

Motivation: 现有视觉语言模型在空间推理能力上不足，尤其是空间视觉问答任务中对相对位置、距离和物体配置的理解。 Method: 提出SVQA-R1框架，采用Spatial-GRPO策略，通过扰动物体间的空间关系（如镜像翻转）构建视图一致的奖励，增强空间理解。 Result: SVQA-R1在空间VQA基准测试中显著提升准确性，并展示无需监督微调的可解释推理路径。 Conclusion: SVQA-R1通过新颖的RL策略有效提升了空间推理能力，实验验证了其广泛适用性。 Abstract: Spatial reasoning remains a critical yet underdeveloped capability in existing vision-language models (VLMs), especially for Spatial Visual Question Answering (Spatial VQA) tasks that require understanding relative positions, distances, and object configurations. Inspired by the R1 paradigm introduced in DeepSeek-R1, which enhances reasoning in language models through rule-based reinforcement learning (RL), we propose SVQA-R1, the first framework to extend R1-style training to spatial VQA. In particular, we introduce Spatial-GRPO, a novel group-wise RL strategy that constructs view-consistent rewards by perturbing spatial relations between objects, e.g., mirror flipping, thereby encouraging the model to develop a consistent and grounded understanding of space. Our model, SVQA-R1, not only achieves dramatically improved accuracy on spatial VQA benchmarks but also exhibits interpretable reasoning paths even without using supervised fine-tuning (SFT) data. Extensive experiments and visualization demonstrate the effectiveness of SVQA-R1 across multiple spatial reasoning benchmarks.

[116] No Train Yet Gain: Towards Generic Multi-Object Tracking in Sports and Beyond

Tomasz Stanczyk,Seongro Yoon,Francois Bremond

Main category: cs.CV

TL;DR: McByte是一种无需训练的多目标跟踪框架，通过结合时间传播的分割掩码提升鲁棒性，适用于体育和行人跟踪。

Details

Motivation: 体育分析中的多目标跟踪面临快速运动、遮挡和相机移动等挑战，传统方法需要大量调参或难以处理轨迹。 Method: McByte采用基于检测的跟踪框架，利用时间传播的分割掩码作为关联线索，无需训练或视频特定调参。 Result: 在SportsMOT、DanceTrack等数据集上表现优异，验证了掩码传播对通用跟踪的适应性。 Conclusion: McByte展示了掩码传播在多目标跟踪中的优势，提供了一种更通用和鲁棒的解决方案。 Abstract: Multi-object tracking (MOT) is essential for sports analytics, enabling performance evaluation and tactical insights. However, tracking in sports is challenging due to fast movements, occlusions, and camera shifts. Traditional tracking-by-detection methods require extensive tuning, while segmentation-based approaches struggle with track processing. We propose McByte, a tracking-by-detection framework that integrates temporally propagated segmentation mask as an association cue to improve robustness without per-video tuning. Unlike many existing methods, McByte does not require training, relying solely on pre-trained models and object detectors commonly used in the community. Evaluated on SportsMOT, DanceTrack, SoccerNet-tracking 2022 and MOT17, McByte demonstrates strong performance across sports and general pedestrian tracking. Our results highlight the benefits of mask propagation for a more adaptable and generalizable MOT approach. Code will be made available at https://github.com/tstanczyk95/McByte.

[117] RadarSplat: Radar Gaussian Splatting for High-Fidelity Data Synthesis and 3D Reconstruction of Autonomous Driving Scenes

Pou-Chun Kung,Skanda Harisha,Ram Vasudevan,Aline Eid,Katherine A. Skinner

Main category: cs.CV

TL;DR: RadarSplat结合高斯散射与新型雷达噪声建模，提升雷达数据合成与3D重建质量。

Details

Motivation: 雷达在恶劣天气中表现优异，但现有方法在噪声场景下效果不佳，且无法合成真实雷达数据。 Method: 提出RadarSplat，整合高斯散射与雷达噪声建模，实现更真实的雷达数据合成与3D重建。 Result: 在雷达图像合成（+3.4 PSNR / 2.6x SSIM）和几何重建（-40% RMSE / 1.5x Accuracy）上优于现有技术。 Conclusion: RadarSplat能高效生成高保真雷达数据并提升场景重建质量。 Abstract: High-Fidelity 3D scene reconstruction plays a crucial role in autonomous driving by enabling novel data generation from existing datasets. This allows simulating safety-critical scenarios and augmenting training datasets without incurring further data collection costs. While recent advances in radiance fields have demonstrated promising results in 3D reconstruction and sensor data synthesis using cameras and LiDAR, their potential for radar remains largely unexplored. Radar is crucial for autonomous driving due to its robustness in adverse weather conditions like rain, fog, and snow, where optical sensors often struggle. Although the state-of-the-art radar-based neural representation shows promise for 3D driving scene reconstruction, it performs poorly in scenarios with significant radar noise, including receiver saturation and multipath reflection. Moreover, it is limited to synthesizing preprocessed, noise-excluded radar images, failing to address realistic radar data synthesis. To address these limitations, this paper proposes RadarSplat, which integrates Gaussian Splatting with novel radar noise modeling to enable realistic radar data synthesis and enhanced 3D reconstruction. Compared to the state-of-the-art, RadarSplat achieves superior radar image synthesis (+3.4 PSNR / 2.6x SSIM) and improved geometric reconstruction (-40% RMSE / 1.5x Accuracy), demonstrating its effectiveness in generating high-fidelity radar data and scene reconstruction. A project page is available at https://umautobots.github.io/radarsplat.

[118] Playing with Transformer at 30+ FPS via Next-Frame Diffusion

Xinle Cheng,Tianyu He,Jiayi Xu,Junliang Guo,Di He,Jiang Bian

Main category: cs.CV

TL;DR: NFD是一种自回归扩散变换器，通过块级因果注意力和并行令牌生成实现高效推理，结合一致性蒸馏和推测采样，首次在A100 GPU上以30 FPS实现自回归视频生成。

Details

Motivation: 解决自回归视频模型在实时生成中的高计算成本和硬件效率问题。 Method: 提出Next-Frame Diffusion (NFD)，结合块级因果注意力和并行令牌生成；引入一致性蒸馏和推测采样优化推理效率。 Result: 在动作条件视频生成基准测试中，NFD在视觉质量和采样效率上优于基线模型，首次实现30 FPS的自回归视频生成。 Conclusion: NFD通过创新方法显著提升了自回归视频生成的效率和实时性。 Abstract: Autoregressive video models offer distinct advantages over bidirectional diffusion models in creating interactive video content and supporting streaming applications with arbitrary duration. In this work, we present Next-Frame Diffusion (NFD), an autoregressive diffusion transformer that incorporates block-wise causal attention, enabling iterative sampling and efficient inference via parallel token generation within each frame. Nonetheless, achieving real-time video generation remains a significant challenge for such models, primarily due to the high computational cost associated with diffusion sampling and the hardware inefficiencies inherent to autoregressive generation. To address this, we introduce two innovations: (1) We extend consistency distillation to the video domain and adapt it specifically for video models, enabling efficient inference with few sampling steps; (2) To fully leverage parallel computation, motivated by the observation that adjacent frames often share the identical action input, we propose speculative sampling. In this approach, the model generates next few frames using current action input, and discard speculatively generated frames if the input action differs. Experiments on a large-scale action-conditioned video generation benchmark demonstrate that NFD beats autoregressive baselines in terms of both visual quality and sampling efficiency. We, for the first time, achieves autoregressive video generation at over 30 Frames Per Second (FPS) on an A100 GPU using a 310M model.

[119] VRD-IU: Lessons from Visually Rich Document Intelligence and Understanding

Yihao Ding,Soyeon Caren Han,Yan Li,Josiah Poon

Main category: cs.CV

TL;DR: VRD-IU竞赛聚焦于从多格式表单中提取和定位关键信息，展示了多种先进方法，并在视觉丰富文档理解（VRDU）领域设定了新基准。

Details

Motivation: 解决表单类文档因复杂布局、多利益相关者参与和高结构可变性带来的独特挑战。 Method: 竞赛分为两个赛道：Track A（基于实体的关键信息检索）和Track B（端到端关键信息定位），采用了分层分解、基于Transformer的检索、多模态特征融合和高级目标检测技术。 Result: 超过20个团队参与，展示了多种先进方法，并设定了VRDU领域的新基准。 Conclusion: 竞赛为文档智能领域提供了宝贵见解，推动了VRDU技术的发展。 Abstract: Visually Rich Document Understanding (VRDU) has emerged as a critical field in document intelligence, enabling automated extraction of key information from complex documents across domains such as medical, financial, and educational applications. However, form-like documents pose unique challenges due to their complex layouts, multi-stakeholder involvement, and high structural variability. Addressing these issues, the VRD-IU Competition was introduced, focusing on extracting and localizing key information from multi-format forms within the Form-NLU dataset, which includes digital, printed, and handwritten documents. This paper presents insights from the competition, which featured two tracks: Track A, emphasizing entity-based key information retrieval, and Track B, targeting end-to-end key information localization from raw document images. With over 20 participating teams, the competition showcased various state-of-the-art methodologies, including hierarchical decomposition, transformer-based retrieval, multimodal feature fusion, and advanced object detection techniques. The top-performing models set new benchmarks in VRDU, providing valuable insights into document intelligence.

[120] Neural shape reconstruction from multiple views with static pattern projection

Ryo Furukawa,Kota Nishihara,Hiroshi Kawasaki

Main category: cs.CV

TL;DR: 提出了一种基于神经符号距离场（NeuralSDF）和体积差分渲染技术的方法，实现相机和投影仪在运动中自动校准相对位姿，从而提高主动立体系统的便利性。

Details

Motivation: 主动立体系统通常需要相机和投影仪固定且精确校准，限制了其便利性。若能自由移动相机和投影仪，将显著提升系统实用性。 Method: 通过捕捉相机和投影仪运动中的多张图像，利用NeuralSDF和体积差分渲染技术自动校准相对位姿，实现目标物体的形状恢复。 Result: 实验通过合成和真实图像进行3D重建，验证了方法的有效性。 Conclusion: 该方法成功实现了相机和投影仪在运动中的自动校准，提升了主动立体系统的便利性和实用性。 Abstract: Active-stereo-based 3D shape measurement is crucial for various purposes, such as industrial inspection, reverse engineering, and medical systems, due to its strong ability to accurately acquire the shape of textureless objects. Active stereo systems typically consist of a camera and a pattern projector, tightly fixed to each other, and precise calibration between a camera and a projector is required, which in turn decreases the usability of the system. If a camera and a projector can be freely moved during shape scanning process, it will drastically increase the convenience of the usability of the system. To realize it, we propose a technique to recover the shape of the target object by capturing multiple images while both the camera and the projector are in motion, and their relative poses are auto-calibrated by our neural signed-distance-field (NeuralSDF) using novel volumetric differential rendering technique. In the experiment, the proposed method is evaluated by performing 3D reconstruction using both synthetic and real images.

[121] ViTA-PAR: Visual and Textual Attribute Alignment with Attribute Prompting for Pedestrian Attribute Recognition

Minjeong Park,Hongbeen Park,Jinkyu Kim

Main category: cs.CV

TL;DR: ViTA-PAR提出了一种结合视觉和文本属性对齐的多模态提示方法，用于行人属性识别（PAR），通过全局到局部的语义捕捉和上下文学习，提升了识别性能。

Details

Motivation: 现有PAR方法受限于固定水平区域的属性分类，无法处理属性出现在多变或意外位置的情况，ViTA-PAR旨在通过多模态提示和视觉-语言对齐解决这一问题。 Method: ViTA-PAR设计了视觉属性提示（捕捉全局到局部语义）和可学习的文本提示模板（学习人物和属性上下文），并进行了视觉与文本特征的对齐融合。 Result: 在四个PAR基准测试中，ViTA-PAR表现出色，推理高效。 Conclusion: ViTA-PAR通过多模态提示和特征对齐，显著提升了行人属性识别的鲁棒性和准确性。 Abstract: The Pedestrian Attribute Recognition (PAR) task aims to identify various detailed attributes of an individual, such as clothing, accessories, and gender. To enhance PAR performance, a model must capture features ranging from coarse-grained global attributes (e.g., for identifying gender) to fine-grained local details (e.g., for recognizing accessories) that may appear in diverse regions. Recent research suggests that body part representation can enhance the model's robustness and accuracy, but these methods are often restricted to attribute classes within fixed horizontal regions, leading to degraded performance when attributes appear in varying or unexpected body locations. In this paper, we propose Visual and Textual Attribute Alignment with Attribute Prompting for Pedestrian Attribute Recognition, dubbed as ViTA-PAR, to enhance attribute recognition through specialized multimodal prompting and vision-language alignment. We introduce visual attribute prompts that capture global-to-local semantics, enabling diverse attribute representations. To enrich textual embeddings, we design a learnable prompt template, termed person and attribute context prompting, to learn person and attributes context. Finally, we align visual and textual attribute features for effective fusion. ViTA-PAR is validated on four PAR benchmarks, achieving competitive performance with efficient inference. We release our code and model at https://github.com/mlnjeongpark/ViTA-PAR.

[122] Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models

Yulei Qin,Gang Li,Zongyi Li,Zihan Xu,Yuchen Shi,Zhekai Lin,Xiao Cui,Ke Li,Xing Sun

Main category: cs.CV

TL;DR: 论文提出了一种系统性方法，通过激励推理和强化学习提升大语言模型（LLMs）处理复杂指令的能力，解决了传统思维链（CoT）方法的局限性。

Details

Motivation: 现有大语言模型在遵循复杂指令时表现不佳，尤其是涉及多约束并行、链式和分支结构时。传统思维链方法因浅层推理而效果有限。 Method: 提出基于强化学习的可验证规则奖励信号，结合样本对比和行为克隆，提升模型推理能力。 Result: 在七个基准测试中，1.5B参数的LLM性能提升11.74%，接近8B参数模型的表现。 Conclusion: 该方法有效提升了LLMs处理复杂指令的能力，代码和数据已开源。 Abstract: Existing large language models (LLMs) face challenges of following complex instructions, especially when multiple constraints are present and organized in paralleling, chaining, and branching structures. One intuitive solution, namely chain-of-thought (CoT), is expected to universally improve capabilities of LLMs. However, we find that the vanilla CoT exerts a negative impact on performance due to its superficial reasoning pattern of simply paraphrasing the instructions. It fails to peel back the compositions of constraints for identifying their relationship across hierarchies of types and dimensions. To this end, we propose a systematic method to boost LLMs in dealing with complex instructions via incentivizing reasoning for test-time compute scaling. First, we stem from the decomposition of complex instructions under existing taxonomies and propose a reproducible data acquisition method. Second, we exploit reinforcement learning (RL) with verifiable rule-centric reward signals to cultivate reasoning specifically for instruction following. We address the shallow, non-essential nature of reasoning under complex instructions via sample-wise contrast for superior CoT enforcement. We also exploit behavior cloning of experts to facilitate steady distribution shift from fast-thinking LLMs to skillful reasoners. Extensive evaluations on seven comprehensive benchmarks confirm the validity of the proposed method, where a 1.5B LLM achieves 11.74% gains with performance comparable to a 8B LLM. Codes and data are available at https://github.com/yuleiqin/RAIF.

[123] DNAEdit: Direct Noise Alignment for Text-Guided Rectified Flow Editing

Chenxi Xie,Minghan Li,Shuai Li,Yuhui Wu,Qiaosi Yi,Lei Zhang

Main category: cs.CV

TL;DR: 提出了一种名为DNAEdit的新方法，通过直接优化噪声域中的高斯噪声（DNA）和移动速度引导（MVG），显著减少了传统方法中的误差累积，提升了图像编辑性能。

Details

Motivation: 传统基于扩散和RF的方法在噪声近似过程中会引入累积误差，导致重建精度下降。 Method: 提出Direct Noise Alignment (DNA) 直接优化噪声，并引入Mobile Velocity Guidance (MVG) 平衡背景保留和目标编辑。 Result: 实验证明DNAEdit在文本引导的图像编辑任务中优于现有方法。 Conclusion: DNAEdit通过减少误差累积和引入MVG，显著提升了图像编辑的准确性和可控性。 Abstract: Leveraging the powerful generation capability of large-scale pretrained text-to-image models, training-free methods have demonstrated impressive image editing results. Conventional diffusion-based methods, as well as recent rectified flow (RF)-based methods, typically reverse synthesis trajectories by gradually adding noise to clean images, during which the noisy latent at the current timestep is used to approximate that at the next timesteps, introducing accumulated drift and degrading reconstruction accuracy. Considering the fact that in RF the noisy latent is estimated through direct interpolation between Gaussian noises and clean images at each timestep, we propose Direct Noise Alignment (DNA), which directly refines the desired Gaussian noise in the noise domain, significantly reducing the error accumulation in previous methods. Specifically, DNA estimates the velocity field of the interpolated noised latent at each timestep and adjusts the Gaussian noise by computing the difference between the predicted and expected velocity field. We validate the effectiveness of DNA and reveal its relationship with existing RF-based inversion methods. Additionally, we introduce a Mobile Velocity Guidance (MVG) to control the target prompt-guided generation process, balancing image background preservation and target object editability. DNA and MVG collectively constitute our proposed method, namely DNAEdit. Finally, we introduce DNA-Bench, a long-prompt benchmark, to evaluate the performance of advanced image editing models. Experimental results demonstrate that our DNAEdit achieves superior performance to state-of-the-art text-guided editing methods. Codes and benchmark will be available at \href{ https://xiechenxi99.github.io/DNAEdit/}{https://xiechenxi99.github.io/DNAEdit/}.

[124] Semantic Palette-Guided Color Propagation

Zi-Yu Zhang,Bing-Feng Seng,Ya-Feng Du,Kang Li,Zhe-Cheng Wang,Zheng-Jun Du

Main category: cs.CV

TL;DR: 提出了一种基于语义调色板的颜色传播方法，通过语义信息实现内容感知的颜色编辑。

Details

Motivation: 传统方法依赖低层次视觉线索（如颜色、纹理）难以实现内容感知的颜色传播，而现有引入语义信息的方法常导致全局颜色变化不自然。 Method: 首先从输入图像中提取语义调色板，然后通过最小化设计的能量函数求解编辑后的调色板，最后将局部编辑准确传播到语义相似区域。 Result: 实验证明该方法能高效且精确地进行像素级颜色编辑，确保颜色传播内容感知。 Conclusion: 该方法克服了传统方法的局限性，实现了自然且内容感知的颜色传播。 Abstract: Color propagation aims to extend local color edits to similar regions across the input image. Conventional approaches often rely on low-level visual cues such as color, texture, or lightness to measure pixel similarity, making it difficult to achieve content-aware color propagation. While some recent approaches attempt to introduce semantic information into color editing, but often lead to unnatural, global color change in color adjustments. To overcome these limitations, we present a semantic palette-guided approach for color propagation. We first extract a semantic palette from an input image. Then, we solve an edited palette by minimizing a well-designed energy function based on user edits. Finally, local edits are accurately propagated to regions that share similar semantics via the solved palette. Our approach enables efficient yet accurate pixel-level color editing and ensures that local color changes are propagated in a content-aware manner. Extensive experiments demonstrated the effectiveness of our method.

[125] MS-RAFT-3D: A Multi-Scale Architecture for Recurrent Image-Based Scene Flow

Jakob Schmid,Azin Jahedi,Noah Berenguel Senn,Andrés Bruhn

Main category: cs.CV

TL;DR: 论文提出了一种多尺度方法，将光流中的分层思想推广到基于图像的场景流中，显著提升了性能。

Details

Motivation: 尽管多尺度概念在光流和立体视觉的循环网络架构中已被证明有效，但尚未应用于基于图像的场景流。 Method: 基于单尺度循环场景流主干，开发了多尺度方法，改进了特征和上下文编码器、粗到细框架及训练损失。 Result: 在KITTI和Spring数据集上分别以8.7%和65.8%的优势超越了当前最优方法。 Conclusion: 多尺度方法在场景流任务中表现优异，代码已开源。 Abstract: Although multi-scale concepts have recently proven useful for recurrent network architectures in the field of optical flow and stereo, they have not been considered for image-based scene flow so far. Hence, based on a single-scale recurrent scene flow backbone, we develop a multi-scale approach that generalizes successful hierarchical ideas from optical flow to image-based scene flow. By considering suitable concepts for the feature and the context encoder, the overall coarse-to-fine framework and the training loss, we succeed to design a scene flow approach that outperforms the current state of the art on KITTI and Spring by 8.7%(3.89 vs. 4.26) and 65.8% (9.13 vs. 26.71), respectively. Our code is available at https://github.com/cv-stuttgart/MS-RAFT-3D.

[126] A Novel Context-Adaptive Fusion of Shadow and Highlight Regions for Efficient Sonar Image Classification

Kamal Basha S,Anukul Kiran B,Athira Nambiar,Suresh Rajendran

Main category: cs.CV

TL;DR: 提出了一种上下文自适应的声纳图像分类框架，结合阴影和高光特征，并引入区域感知去噪模型和扩展数据集S3Simulator+，以提升水下声纳图像分析的鲁棒性和分类可靠性。

Details

Motivation: 现有研究主要基于高光分析，而阴影区域的分类研究不足，限制了水下物体检测和分类的准确性。 Method: 提出上下文自适应分类框架，包括阴影特定分类器和自适应阴影分割，结合区域感知去噪模型和解释性优化策略。 Result: 框架优化了特征表示，提升了噪声和遮挡情况下的鲁棒性，同时增强了分类的可靠性和可解释性。 Conclusion: 通过结合新分类策略和扩展数据集，解决了声纳图像分析中的关键挑战，推动了自主水下感知的发展。 Abstract: Sonar imaging is fundamental to underwater exploration, with critical applications in defense, navigation, and marine research. Shadow regions, in particular, provide essential cues for object detection and classification, yet existing studies primarily focus on highlight-based analysis, leaving shadow-based classification underexplored. To bridge this gap, we propose a Context-adaptive sonar image classification framework that leverages advanced image processing techniques to extract and integrate discriminative shadow and highlight features. Our framework introduces a novel shadow-specific classifier and adaptive shadow segmentation, enabling effective classification based on the dominant region. This approach ensures optimal feature representation, improving robustness against noise and occlusions. In addition, we introduce a Region-aware denoising model that enhances sonar image quality by preserving critical structural details while suppressing noise. This model incorporates an explainability-driven optimization strategy, ensuring that denoising is guided by feature importance, thereby improving interpretability and classification reliability. Furthermore, we present S3Simulator+, an extended dataset incorporating naval mine scenarios with physics-informed noise specifically tailored for the underwater sonar domain, fostering the development of robust AI models. By combining novel classification strategies with an enhanced dataset, our work addresses key challenges in sonar image analysis, contributing to the advancement of autonomous underwater perception.

[127] DiffuseSlide: Training-Free High Frame Rate Video Generation Diffusion

Geunmin Hwang,Hyun-kyu Ko,Younghyun Kim,Seungryong Lee,Eunbyung Park

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的DiffuseSlide方法，利用预训练扩散模型生成高帧率视频，解决了现有方法在长序列中闪烁和质量下降的问题。

Details

Motivation: 高帧率视频生成存在闪烁和质量下降的挑战，现有方法计算效率低且难以保持视频质量。 Method: DiffuseSlide通过关键帧提取、噪声重注入和滑动窗口潜在去噪技术，无需额外微调即可生成高质量视频。 Result: 实验表明，该方法显著提升了视频质量，增强了时间一致性和空间保真度。 Conclusion: DiffuseSlide计算高效且适用于多种视频生成任务，适用于虚拟现实、游戏和高质量内容创作。 Abstract: Recent advancements in diffusion models have revolutionized video generation, enabling the creation of high-quality, temporally consistent videos. However, generating high frame-rate (FPS) videos remains a significant challenge due to issues such as flickering and degradation in long sequences, particularly in fast-motion scenarios. Existing methods often suffer from computational inefficiencies and limitations in maintaining video quality over extended frames. In this paper, we present a novel, training-free approach for high FPS video generation using pre-trained diffusion models. Our method, DiffuseSlide, introduces a new pipeline that leverages key frames from low FPS videos and applies innovative techniques, including noise re-injection and sliding window latent denoising, to achieve smooth, consistent video outputs without the need for additional fine-tuning. Through extensive experiments, we demonstrate that our approach significantly improves video quality, offering enhanced temporal coherence and spatial fidelity. The proposed method is not only computationally efficient but also adaptable to various video generation tasks, making it ideal for applications such as virtual reality, video games, and high-quality content creation.

[128] Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark

Shuyu Yang,Yilun Wang,Yaxiong Wang,Li Zhu,Zhedong Zheng

Main category: cs.CV

TL;DR: SVTA是一个利用生成模型解决视频异常检索数据稀缺和隐私问题的合成数据集，包含41,315个视频和配对文本，涵盖68种异常事件和30种正常活动。

Details

Motivation: 现有数据集因异常事件的长尾性和隐私问题导致数据稀缺，SVTA旨在通过合成数据解决这些问题。 Method: 使用大型语言模型生成视频描述，并指导视频生成模型创建多样化的高质量视频。 Result: SVTA包含1.36M帧视频和配对文本，测试显示其具有挑战性且能有效评估跨模态检索方法。 Conclusion: SVTA消除了真实数据收集的隐私风险，同时保持了场景的真实性。 Abstract: Video anomaly retrieval aims to localize anomalous events in videos using natural language queries to facilitate public safety. However, existing datasets suffer from severe limitations: (1) data scarcity due to the long-tail nature of real-world anomalies, and (2) privacy constraints that impede large-scale collection. To address the aforementioned issues in one go, we introduce SVTA (Synthetic Video-Text Anomaly benchmark), the first large-scale dataset for cross-modal anomaly retrieval, leveraging generative models to overcome data availability challenges. Specifically, we collect and generate video descriptions via the off-the-shelf LLM (Large Language Model) covering 68 anomaly categories, e.g., throwing, stealing, and shooting. These descriptions encompass common long-tail events. We adopt these texts to guide the video generative model to produce diverse and high-quality videos. Finally, our SVTA involves 41,315 videos (1.36M frames) with paired captions, covering 30 normal activities, e.g., standing, walking, and sports, and 68 anomalous events, e.g., falling, fighting, theft, explosions, and natural disasters. We adopt three widely-used video-text retrieval baselines to comprehensively test our SVTA, revealing SVTA's challenging nature and its effectiveness in evaluating a robust cross-modal retrieval method. SVTA eliminates privacy risks associated with real-world anomaly collection while maintaining realistic scenarios. The dataset demo is available at: [https://svta-mm.github.io/SVTA.github.io/].

[129] Sheep Facial Pain Assessment Under Weighted Graph Neural Networks

Alam Noor,Luis Almeida,Mohamed Daoudi,Kai Li,Eduardo Tovar

Main category: cs.CV

TL;DR: 提出了一种基于加权图神经网络（WGNN）的模型，用于通过绵羊面部标志点检测和预测疼痛水平，并创建了一个新的绵羊面部标志点数据集。

Details

Motivation: 准确识别和评估绵羊的疼痛对动物健康和福利至关重要，但目前缺乏自动监测疼痛的有效方法。 Method: 使用WGNN模型连接检测到的面部标志点并定义疼痛水平，同时提出新的绵羊面部标志点数据集。 Result: YOLOv8n检测器的mAP为59.30%，WGNN框架在跟踪多面部部位表情时的准确率为92.71%。 Conclusion: WGNN模型和新的数据集为绵羊疼痛检测提供了高效且准确的解决方案。 Abstract: Accurately recognizing and assessing pain in sheep is key to discern animal health and mitigating harmful situations. However, such accuracy is limited by the ability to manage automatic monitoring of pain in those animals. Facial expression scoring is a widely used and useful method to evaluate pain in both humans and other living beings. Researchers also analyzed the facial expressions of sheep to assess their health state and concluded that facial landmark detection and pain level prediction are essential. For this purpose, we propose a novel weighted graph neural network (WGNN) model to link sheep's detected facial landmarks and define pain levels. Furthermore, we propose a new sheep facial landmarks dataset that adheres to the parameters of the Sheep Facial Expression Scale (SPFES). Currently, there is no comprehensive performance benchmark that specifically evaluates the use of graph neural networks (GNNs) on sheep facial landmark data to detect and measure pain levels. The YOLOv8n detector architecture achieves a mean average precision (mAP) of 59.30% with the sheep facial landmarks dataset, among seven other detection models. The WGNN framework has an accuracy of 92.71% for tracking multiple facial parts expressions with the YOLOv8n lightweight on-board device deployment-capable model.

[130] SemiVT-Surge: Semi-Supervised Video Transformer for Surgical Phase Recognition

Yiping Li,Ronald de Jong,Sahar Nasirihaghighi,Tim Jaspers,Romy van Jaarsveld,Gino Kuiper,Richard van Hillegersberg,Fons van der Sommen,Jelle Ruurda,Marcel Breeuwer,Yasmina Al Khalil

Main category: cs.CV

TL;DR: 提出了一种基于视频Transformer的模型，结合伪标签框架和对比学习，用于半监督手术阶段识别，显著减少标注需求并提升性能。

Details

Motivation: 手术视频标注耗时费力，研究旨在利用未标注数据实现高性能，减少对大量标注的依赖。 Method: 采用视频Transformer模型，结合时间一致性正则化和对比学习，利用伪标签优化特征空间。 Result: 在RAMIE数据集上准确率提升4.9%，在Cholec80上仅用1/4标注数据即达到全监督可比结果。 Conclusion: 为半监督手术阶段识别设定了新基准，推动了该领域未来研究。 Abstract: Accurate surgical phase recognition is crucial for computer-assisted interventions and surgical video analysis. Annotating long surgical videos is labor-intensive, driving research toward leveraging unlabeled data for strong performance with minimal annotations. Although self-supervised learning has gained popularity by enabling large-scale pretraining followed by fine-tuning on small labeled subsets, semi-supervised approaches remain largely underexplored in the surgical domain. In this work, we propose a video transformer-based model with a robust pseudo-labeling framework. Our method incorporates temporal consistency regularization for unlabeled data and contrastive learning with class prototypes, which leverages both labeled data and pseudo-labels to refine the feature space. Through extensive experiments on the private RAMIE (Robot-Assisted Minimally Invasive Esophagectomy) dataset and the public Cholec80 dataset, we demonstrate the effectiveness of our approach. By incorporating unlabeled data, we achieve state-of-the-art performance on RAMIE with a 4.9% accuracy increase and obtain comparable results to full supervision while using only 1/4 of the labeled data on Cholec80. Our findings establish a strong benchmark for semi-supervised surgical phase recognition, paving the way for future research in this domain.

[131] Unlocking Aha Moments via Reinforcement Learning: Advancing Collaborative Visual Comprehension and Generation

Kaihang Pan,Yang Wu,Wendong Bu,Kai Shen,Juncheng Li,Yingting Wang,Yunfei Li,Siliang Tang,Jun Xiao,Fei Wu,Hang Zhao,Yueting Zhuang

Main category: cs.CV

TL;DR: 论文提出了一种方法，通过协同进化的方式统一视觉理解和生成能力，将图像生成提升为迭代内省过程。

Details

Motivation: 当前多模态大语言模型（MLLMs）中视觉理解和生成能力相互独立，未能相互增强，限制了图像生成的潜力。 Method: 采用两阶段训练：监督微调教授MLLM生成真实的视觉生成链式推理（CoT），强化学习通过探索-利用权衡激活其潜力。 Result: 模型在文本到图像生成、图像编辑和图像语义评估任务中表现优异，视觉理解能力显著提升。 Conclusion: 该方法成功实现了视觉理解和生成的协同进化，推动了MLLMs在统一图像生成任务中的进展。 Abstract: Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize image generation. In this paper, we propose to enable the collaborative co-evolution of visual comprehension and generation, advancing image generation into an iterative introspective process. We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT for visual generation, while reinforcement learning activates its full potential via an exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation. Extensive experiments demonstrate that our model not only excels in text-to-image generation and image editing, but also functions as a superior image semantic evaluator with enhanced visual comprehension capabilities. Project Page: https://janus-pro-r1.github.io.

[132] FDSG: Forecasting Dynamic Scene Graphs

Yi Yang,Yuren Cong,Hao Cheng,Bodo Rosenhahn,Michael Ying Yang

Main category: cs.CV

TL;DR: 论文提出了一种名为FDSG的新框架，用于预测未来帧中的实体标签、边界框和关系，同时生成观测帧的场景图。通过查询分解和神经随机微分方程建模动态，并通过时间聚合模块优化预测。

Details

Motivation: 现有方法未能有效建模实体和关系的动态变化，限制了视频场景理解的能力。 Method: 提出FDSG框架，结合查询分解和神经随机微分方程建模动态，并利用时间聚合模块整合预测与观测信息。 Result: 在Action Genome数据集上，FDSG在动态场景图生成、场景图预测和场景图预报任务中优于现有方法。 Conclusion: FDSG通过建模实体和关系的动态变化，显著提升了视频场景理解的性能。 Abstract: Dynamic scene graph generation extends scene graph generation from images to videos by modeling entity relationships and their temporal evolution. However, existing methods either generate scene graphs from observed frames without explicitly modeling temporal dynamics, or predict only relationships while assuming static entity labels and locations. These limitations hinder effective extrapolation of both entity and relationship dynamics, restricting video scene understanding. We propose Forecasting Dynamic Scene Graphs (FDSG), a novel framework that predicts future entity labels, bounding boxes, and relationships, for unobserved frames, while also generating scene graphs for observed frames. Our scene graph forecast module leverages query decomposition and neural stochastic differential equations to model entity and relationship dynamics. A temporal aggregation module further refines predictions by integrating forecasted and observed information via cross-attention. To benchmark FDSG, we introduce Scene Graph Forecasting, a new task for full future scene graph prediction. Experiments on Action Genome show that FDSG outperforms state-of-the-art methods on dynamic scene graph generation, scene graph anticipation, and scene graph forecasting. Codes will be released upon publication.

[133] Efficiency without Compromise: CLIP-aided Text-to-Image GANs with Increased Diversity

Yuya Kobayashi,Yuhta Takida,Takashi Shibuya,Yuki Mitsufuji

Main category: cs.CV

TL;DR: SCAD提出了一种高效且高保真的文本到图像生成方法，通过结合预训练模型和专用判别器，显著降低了训练成本，同时提升了生成多样性和样本质量。

Details

Motivation: 大规模GAN训练成本高，现有方法虽降低成本但牺牲了生成多样性。 Method: 采用两个专用判别器和Slicing Adversarial Networks (SANs)，并引入Per-Prompt Diversity (PPD)指标。 Result: SCAD在训练成本大幅降低的同时，生成多样性和样本保真度显著提升，零样本FID与最新大规模GAN相当。 Conclusion: SCAD为高效文本到图像生成提供了可行方案，平衡了成本与性能。 Abstract: Recently, Generative Adversarial Networks (GANs) have been successfully scaled to billion-scale large text-to-image datasets. However, training such models entails a high training cost, limiting some applications and research usage. To reduce the cost, one promising direction is the incorporation of pre-trained models. The existing method of utilizing pre-trained models for a generator significantly reduced the training cost compared with the other large-scale GANs, but we found the model loses the diversity of generation for a given prompt by a large margin. To build an efficient and high-fidelity text-to-image GAN without compromise, we propose to use two specialized discriminators with Slicing Adversarial Networks (SANs) adapted for text-to-image tasks. Our proposed model, called SCAD, shows a notable enhancement in diversity for a given prompt with better sample fidelity. We also propose to use a metric called Per-Prompt Diversity (PPD) to evaluate the diversity of text-to-image models quantitatively. SCAD achieved a zero-shot FID competitive with the latest large-scale GANs at two orders of magnitude less training cost.

[134] Enhancing Diffusion-based Unrestricted Adversarial Attacks via Adversary Preferences Alignment

Kaixun Jiang,Zhaoyu Chen,Haijing Guo,Jinglun Li,Jiyuan Fu,Pinxue Guo,Hao Tang,Bo Li,Wenqiang Zhang

Main category: cs.CV

TL;DR: 论文提出了一种对抗性偏好对齐框架（APA），通过两阶段优化解决对抗样本生成中的视觉一致性与攻击效果的冲突问题。

Details

Motivation: 研究对抗性偏好对齐问题，解决传统对抗样本生成中视觉质量与攻击效果难以平衡的挑战。 Method: APA框架分为两阶段：第一阶段通过LoRA微调提升视觉一致性，第二阶段基于替代分类器反馈优化图像潜在表示或提示嵌入。 Result: 实验表明APA在保持高视觉一致性的同时显著提升了攻击迁移性。 Conclusion: APA为对抗攻击研究提供了新的对齐视角，未来可进一步探索。 Abstract: Preference alignment in diffusion models has primarily focused on benign human preferences (e.g., aesthetic). In this paper, we propose a novel perspective: framing unrestricted adversarial example generation as a problem of aligning with adversary preferences. Unlike benign alignment, adversarial alignment involves two inherently conflicting preferences: visual consistency and attack effectiveness, which often lead to unstable optimization and reward hacking (e.g., reducing visual quality to improve attack success). To address this, we propose APA (Adversary Preferences Alignment), a two-stage framework that decouples conflicting preferences and optimizes each with differentiable rewards. In the first stage, APA fine-tunes LoRA to improve visual consistency using rule-based similarity reward. In the second stage, APA updates either the image latent or prompt embedding based on feedback from a substitute classifier, guided by trajectory-level and step-wise rewards. To enhance black-box transferability, we further incorporate a diffusion augmentation strategy. Experiments demonstrate that APA achieves significantly better attack transferability while maintaining high visual consistency, inspiring further research to approach adversarial attacks from an alignment perspective. Code will be available at https://github.com/deep-kaixun/APA.

[135] Speed-up of Vision Transformer Models by Attention-aware Token Filtering

Takahiro Naruko,Hiroaki Akutsu

Main category: cs.CV

TL;DR: 本文提出了一种名为注意力感知令牌过滤（ATF）的新方法，用于加速ViT模型，同时保持任务准确性。

Details

Motivation: ViT模型在图像嵌入提取方面表现出色，但计算负担高，因此需要一种高效的加速方法。 Method: ATF通过令牌过滤模块和过滤策略动态筛选输入令牌，保留特定对象区域和静态高注意力区域的令牌。 Result: 在检索任务中，ATF将ViT模型SigLIP的速度提高了2.8倍，同时保持检索召回率。 Conclusion: ATF是一种有效的ViT模型加速方法，无需修改或微调编码器即可实现性能提升。 Abstract: Vision Transformer (ViT) models have made breakthroughs in image embedding extraction, which provide state-of-the-art performance in tasks such as zero-shot image classification. However, the models suffer from a high computational burden. In this paper, we propose a novel speed-up method for ViT models called Attention-aware Token Filtering (ATF). ATF consists of two main ideas: a novel token filtering module and a filtering strategy. The token filtering module is introduced between a tokenizer and a transformer encoder of the ViT model, without modifying or fine-tuning of the transformer encoder. The module filters out tokens inputted to the encoder so that it keeps tokens in regions of specific object types dynamically and keeps tokens in regions that statically receive high attention in the transformer encoder. This filtering strategy maintains task accuracy while filtering out tokens inputted to the transformer encoder. Evaluation results on retrieval tasks show that ATF provides $2.8\times$ speed-up to a ViT model, SigLIP, while maintaining the retrieval recall rate.

[136] Beyond black and white: A more nuanced approach to facial recognition with continuous ethnicity labels

Pedro C. Neto,Naser Damer,Jaime S. Cardoso,Ana F. Sequeira

Main category: cs.CV

TL;DR: 论文提出将种族标签从离散值改为连续变量，以更准确地平衡数据集，从而减少人脸识别模型中的偏见。实验证明连续空间平衡的数据集表现更优。

Details

Motivation: 人脸识别模型中的偏见问题长期存在，现有方法对数据偏见的缓解有限且缺乏对问题本质的洞察。 Method: 将种族标签作为连续变量而非离散值，并通过实验和理论验证其有效性。训练了65个不同模型，并创建了20多个数据子集。 Result: 连续空间平衡的数据集训练的模型表现优于离散空间平衡的模型。 Conclusion: 种族标签作为连续变量能更有效地平衡数据集，减少模型偏见。 Abstract: Bias has been a constant in face recognition models. Over the years, researchers have looked at it from both the model and the data point of view. However, their approach to mitigation of data bias was limited and lacked insight on the real nature of the problem. Here, in this document, we propose to revise our use of ethnicity labels as a continuous variable instead of a discrete value per identity. We validate our formulation both experimentally and theoretically, showcasing that not all identities from one ethnicity contribute equally to the balance of the dataset; thus, having the same number of identities per ethnicity does not represent a balanced dataset. We further show that models trained on datasets balanced in the continuous space consistently outperform models trained on data balanced in the discrete space. We trained more than 65 different models, and created more than 20 subsets of the original datasets.

Tianjiao Zhang,Fei Zhang,Jiangchao Yao,Ya Zhang,Yanfeng Wang

Main category: cs.CV

TL;DR: 利用大规模文本到图像扩散模型解决不精确分割问题，通过生成差异实现分割细化。

Details

Motivation: 传统方法依赖判别模型或密集视觉表示，而本文探索生成先验在分割任务中的潜力。 Method: 利用原始图像与掩码条件生成图像的模式差异，通过语义对齐和前景概率更新实现分割细化。 Result: 实验验证了方法的有效性和优越性，展示了生成差异建模密集表示的潜力。 Conclusion: 鼓励进一步探索生成方法解决判别任务，证明了生成先验的实用性。 Abstract: This paper considers the problem of utilizing a large-scale text-to-image diffusion model to tackle the challenging Inexact Segmentation (IS) task. Unlike traditional approaches that rely heavily on discriminative-model-based paradigms or dense visual representations derived from internal attention mechanisms, our method focuses on the intrinsic generative priors in Stable Diffusion~(SD). Specifically, we exploit the pattern discrepancies between original images and mask-conditional generated images to facilitate a coarse-to-fine segmentation refinement by establishing a semantic correspondence alignment and updating the foreground probability. Comprehensive quantitative and qualitative experiments validate the effectiveness and superiority of our plug-and-play design, underscoring the potential of leveraging generation discrepancies to model dense representations and encouraging further exploration of generative approaches for solving discriminative tasks.

[138] LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model

Xiaodong Wang,Zhirong Wu,Peixi Peng

Main category: cs.CV

TL;DR: 提出了一种分层解耦和自监督蒸馏方法，用于构建长期驾驶世界模型，显著提升了视频生成的连贯性和效率。

Details

Motivation: 当前驾驶世界模型在长期未来预测中存在误差累积问题，且训练与推理之间存在差距，限制了实际应用。 Method: 分层解耦为大规模运动学习和双向连续运动学习，并利用自监督蒸馏方法提升视频连贯性。 Result: 在NuScenes基准测试中，FVD提升27%，推理时间减少85%，能生成110+帧的连贯视频。 Conclusion: 提出的方法有效解决了长期视频生成的连贯性问题，显著提升了性能与效率。 Abstract: Driving world models are used to simulate futures by video generation based on the condition of the current state and actions. However, current models often suffer serious error accumulations when predicting the long-term future, which limits the practical application. Recent studies utilize the Diffusion Transformer (DiT) as the backbone of driving world models to improve learning flexibility. However, these models are always trained on short video clips (high fps and short duration), and multiple roll-out generations struggle to produce consistent and reasonable long videos due to the training-inference gap. To this end, we propose several solutions to build a simple yet effective long-term driving world model. First, we hierarchically decouple world model learning into large motion learning and bidirectional continuous motion learning. Then, considering the continuity of driving scenes, we propose a simple distillation method where fine-grained video flows are self-supervised signals for coarse-grained flows. The distillation is designed to improve the coherence of infinite video generation. The coarse-grained and fine-grained modules are coordinated to generate long-term and temporally coherent videos. In the public benchmark NuScenes, compared with the state-of-the-art front-view model, our model improves FVD by $27\%$ and reduces inference time by $85\%$ for the video task of generating 110+ frames. More videos (including 90s duration) are available at https://Wang-Xiaodong1899.github.io/longdwm/.

Bingqian Lin,Yunshuang Nie,Khun Loun Zai,Ziming Wei,Mingfei Han,Rongtao Xu,Minzhe Niu,Jianhua Han,Liang Lin,Cewu Lu,Xiaodan Liang

Main category: cs.CV

TL;DR: EvolveNav提出了一种自改进的推理框架，通过两阶段训练提升基于LLM的视觉语言导航性能。

Details

Motivation: 解决现有方法中直接输入-输出映射导致的决策难解释性和学习困难问题。 Method: 两阶段训练：1) 使用形式化CoT标签进行监督微调；2) 通过自反思后训练，利用模型自身推理输出作为增强标签。 Result: 在VLN基准测试中表现优于现有LLM-based方法。 Conclusion: EvolveNav通过自改进推理框架显著提升了导航决策的准确性和可解释性。 Abstract: Building Vision-Language Navigation (VLN) agents which can navigate following natural language instructions is a long-standing goal in human-robot interaction applications. Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs' reasoning ability for improving navigation, and simultaneously mitigate the domain gap between LLMs' training corpus and the VLN task. However, these approaches primarily adopt direct input-output mapping paradigms, causing the mapping learning difficult and the navigational decisions unexplainable. Chain-of-Thought (CoT) training is a promising way to improve both navigational decision accuracy and interpretability, while the complexity of the navigation task makes the perfect CoT labels unavailable and may lead to overfitting through pure CoT supervised fine-tuning. In this paper, we propose a novel sElf-improving embodied reasoning framework for boosting LLM-based vision-language Navigation, dubbed EvolveNav. Our EvolveNav consists of two stages: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with formalized CoT labels to both activate the model's navigational reasoning capabilities and increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity. A self-reflective auxiliary task is also introduced to encourage learning correct reasoning patterns by contrasting with wrong ones. Experimental results on the popular VLN benchmarks demonstrate the superiority of EvolveNav over previous LLM-based VLN approaches. Code is available at https://github.com/expectorlin/EvolveNav.

[140] SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes

Yuji Wang,Haoran Xu,Yong Liu,Jiaze Li,Yansong Tang

Main category: cs.CV

TL;DR: SAM2-LOVE是一个新颖的框架，通过整合文本、音频和视觉表示来提升Ref-AVS任务的性能，解决了多模态一致性和目标偏移问题。

Details

Motivation: 现有双模态方法因缺乏第三模态而失败，三模态方法则面临时空一致性挑战，导致目标偏移。 Method: 提出SAM2-LOVE框架，结合多模态融合模块、令牌传播和累积策略，以增强时空一致性和历史信息保留。 Result: 在Ref-AVS基准测试中，SAM2-LOVE比SOTA方法性能提升8.5%。 Conclusion: SAM2-LOVE通过简单有效的组件设计，显著提升了多模态场景理解的性能。 Abstract: Reference Audio-Visual Segmentation (Ref-AVS) aims to provide a pixel-wise scene understanding in Language-aided Audio-Visual Scenes (LAVS). This task requires the model to continuously segment objects referred to by text and audio from a video. Previous dual-modality methods always fail due to the lack of a third modality and the existing triple-modality method struggles with spatio-temporal consistency, leading to the target shift of different frames. In this work, we introduce a novel framework, termed SAM2-LOVE, which integrates textual, audio, and visual representations into a learnable token to prompt and align SAM2 for achieving Ref-AVS in the LAVS. Technically, our approach includes a multimodal fusion module aimed at improving multimodal understanding of SAM2, as well as token propagation and accumulation strategies designed to enhance spatio-temporal consistency without forgetting historical information. We conducted extensive experiments to demonstrate that SAM2-LOVE outperforms the SOTA by 8.5\% in $\mathcal{J\&F}$ on the Ref-AVS benchmark and showcase the simplicity and effectiveness of the components. Our code will be available here.

[141] HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception

Wei Yao,Yunlian Sun,Hongwen Zhang,Yebin Liu,Jinhui Tang

Main category: cs.CV

TL;DR: HOSIG框架通过分层场景感知合成全身交互，解决了现有方法忽略场景上下文或协调不足的问题。

Details

Motivation: 现有的人类-物体交互方法常忽略场景上下文，导致不合理的穿透；而人类-场景交互方法难以协调精细操作与长距离导航。 Method: HOSIG框架包含三个关键组件：场景感知抓取姿势生成器、启发式导航算法和场景引导的运动扩散模型。 Result: 在TRUMANS数据集上表现优于现有方法，支持无限运动长度且需最少人工干预。 Conclusion: HOSIG填补了场景感知导航与灵巧物体操作之间的关键空白，推动了交互合成的前沿。 Abstract: Generating high-fidelity full-body human interactions with dynamic objects and static scenes remains a critical challenge in computer graphics and animation. Existing methods for human-object interaction often neglect scene context, leading to implausible penetrations, while human-scene interaction approaches struggle to coordinate fine-grained manipulations with long-range navigation. To address these limitations, we propose HOSIG, a novel framework for synthesizing full-body interactions through hierarchical scene perception. Our method decouples the task into three key components: 1) a scene-aware grasp pose generator that ensures collision-free whole-body postures with precise hand-object contact by integrating local geometry constraints, 2) a heuristic navigation algorithm that autonomously plans obstacle-avoiding paths in complex indoor environments via compressed 2D floor maps and dual-component spatial reasoning, and 3) a scene-guided motion diffusion model that generates trajectory-controlled, full-body motions with finger-level accuracy by incorporating spatial anchors and dual-space classifier-free guidance. Extensive experiments on the TRUMANS dataset demonstrate superior performance over state-of-the-art methods. Notably, our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention. This work bridges the critical gap between scene-aware navigation and dexterous object manipulation, advancing the frontier of embodied interaction synthesis. Codes will be available after publication. Project page: http://yw0208.github.io/hosig

Zhuohang Dang,Minnan Luo,Chengyou Jia,Hangwei Qian,Xiaojun Chang,Ivor W. Tsang

Main category: cs.CV

TL;DR: MDW框架通过蒸馏噪声多模态数据集为紧凑干净的数据集，提升模型训练效率和效果。

Details

Motivation: 解决多模态模型训练中大规模数据集存储成本高和噪声数据导致性能下降的问题。 Method: 引入可学习的细粒度对应关系，通过双轨协作学习避免噪声干扰，优化蒸馏数据。 Result: 实验显示MDW在多种压缩比下性能超越先前方法15%以上。 Conclusion: MDW具有显著的可扩展性和实用性，适用于不同资源需求的应用。 Abstract: Recent multi-modal models have shown remarkable versatility in real-world applications. However, their rapid development encounters two critical data challenges. First, the training process requires large-scale datasets, leading to substantial storage and computational costs. Second, these data are typically web-crawled with inevitable noise, i.e., partially mismatched pairs, severely degrading model performance. To these ends, we propose Multi-modal dataset Distillation in the Wild, i.e., MDW, the first framework to distill noisy multi-modal datasets into compact clean ones for effective and efficient model training. Specifically, MDW introduces learnable fine-grained correspondences during distillation and adaptively optimizes distilled data to emphasize correspondence-discriminative regions, thereby enhancing distilled data's information density and efficacy. Moreover, to capture robust cross-modal correspondence prior knowledge from real data, MDW proposes dual-track collaborative learning to avoid the risky data noise, alleviating information loss with certifiable noise tolerance. Extensive experiments validate MDW's theoretical and empirical efficacy with remarkable scalability, surpassing prior methods by over 15% across various compression ratios, highlighting its appealing practicality for applications with diverse efficacy and resource needs.

[143] EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models

Andy Bonnetto,Haozhe Qi,Franklin Leong,Matea Tashkovska,Mahdi Rad,Solaiman Shokur,Friedhelm Hummel,Silvestro Micera,Marc Pollefeys,Alexander Mathis

Main category: cs.CV

TL;DR: EPFL-Smart-Kitchen-30数据集是一个多模态厨房行为数据集，用于研究人类复杂动作，包含多视角同步数据，并提出了四个基准任务。

Details

Motivation: 厨房环境适合研究人类运动和认知功能，但缺乏高质量的多模态数据集。 Method: 使用RGB-D相机、IMU和HoloLens~2捕捉16名受试者在厨房中的动作，数据包括3D手部、身体和眼动。 Result: 数据集包含29.7小时的多模态数据，标注密集（33.78动作段/分钟），并提出了四个基准任务。 Conclusion: 该数据集有望推动行为理解与建模的研究，提供生态效度高的数据支持。 Abstract: Understanding behavior requires datasets that capture humans while carrying out complex tasks. The kitchen is an excellent environment for assessing human motor and cognitive function, as many complex actions are naturally exhibited in kitchens from chopping to cleaning. Here, we introduce the EPFL-Smart-Kitchen-30 dataset, collected in a noninvasive motion capture platform inside a kitchen environment. Nine static RGB-D cameras, inertial measurement units (IMUs) and one head-mounted HoloLens~2 headset were used to capture 3D hand, body, and eye movements. The EPFL-Smart-Kitchen-30 dataset is a multi-view action dataset with synchronized exocentric, egocentric, depth, IMUs, eye gaze, body and hand kinematics spanning 29.7 hours of 16 subjects cooking four different recipes. Action sequences were densely annotated with 33.78 action segments per minute. Leveraging this multi-modal dataset, we propose four benchmarks to advance behavior understanding and modeling through 1) a vision-language benchmark, 2) a semantic text-to-motion generation benchmark, 3) a multi-modal action recognition benchmark, 4) a pose-based action segmentation benchmark. We expect the EPFL-Smart-Kitchen-30 dataset to pave the way for better methods as well as insights to understand the nature of ecologically-valid human behavior. Code and data are available at https://github.com/amathislab/EPFL-Smart-Kitchen

[144] Visual Explanation via Similar Feature Activation for Metric Learning

Yi Liao,Ugochukwu Ejike Akpudo,Jue Zhang,Yongsheng Gao,Jun Zhou,Wenyi Zeng,Weichuan Zhang

Main category: cs.CV

TL;DR: 论文提出了一种新的视觉解释方法SFAM，用于解决现有CAM方法无法直接应用于度量学习模型的问题。

Details

Motivation: 现有CAM方法依赖全连接层作为分类器，无法直接用于缺乏全连接层的度量学习模型，因此需要一种新的解释方法。 Method: 提出SFAM方法，通过通道贡献重要性分数（CIS）衡量特征重要性，并基于相似性度量构建解释图。 Result: 实验表明，SFAM能为使用欧氏距离或余弦相似度的CNN模型提供高度可解释的视觉解释。 Conclusion: SFAM是一种有效的视觉解释方法，适用于度量学习模型，增强了模型的可信度和可解释性。 Abstract: Visual explanation maps enhance the trustworthiness of decisions made by deep learning models and offer valuable guidance for developing new algorithms in image recognition tasks. Class activation maps (CAM) and their variants (e.g., Grad-CAM and Relevance-CAM) have been extensively employed to explore the interpretability of softmax-based convolutional neural networks, which require a fully connected layer as the classifier for decision-making. However, these methods cannot be directly applied to metric learning models, as such models lack a fully connected layer functioning as a classifier. To address this limitation, we propose a novel visual explanation method termed Similar Feature Activation Map (SFAM). This method introduces the channel-wise contribution importance score (CIS) to measure feature importance, derived from the similarity measurement between two image embeddings. The explanation map is constructed by linearly combining the proposed importance weights with the feature map from a CNN model. Quantitative and qualitative experiments show that SFAM provides highly promising interpretable visual explanations for CNN models using Euclidean distance or cosine similarity as the similarity metric.

Xuan Yu,Dayan Guan,Michael Ying Yang,Yanfeng Gu

Main category: cs.CV

TL;DR: Zoom-Refine是一种无需训练的MLLM增强方法，通过局部放大和自我细化提升高分辨率图像理解能力。

Details

Motivation: 解决MLLM在高分辨率图像中难以捕捉细粒度细节的问题。 Method: 结合局部放大（预测任务相关区域）和自我细化（整合细节重新评估），无需额外训练。 Result: 在两个高分辨率多模态基准测试中表现优异。 Conclusion: Zoom-Refine有效提升了MLLM的视觉理解能力，无需额外训练或专家干预。 Abstract: Multimodal Large Language Models (MLLM) often struggle to interpret high-resolution images accurately, where fine-grained details are crucial for complex visual understanding. We introduce Zoom-Refine, a novel training-free method that enhances MLLM capabilities to address this issue. Zoom-Refine operates through a synergistic process of \textit{Localized Zoom} and \textit{Self-Refinement}. In the \textit{Localized Zoom} step, Zoom-Refine leverages the MLLM to provide a preliminary response to an input query and identifies the most task-relevant image region by predicting its bounding box coordinates. During the \textit{Self-Refinement} step, Zoom-Refine then integrates fine-grained details from the high-resolution crop (identified by \textit{Localized Zoom}) with its initial reasoning to re-evaluate and refine its preliminary response. Our method harnesses the MLLM's inherent capabilities for spatial localization, contextual reasoning and comparative analysis without requiring additional training or external experts. Comprehensive experiments demonstrate the efficacy of Zoom-Refine on two challenging high-resolution multimodal benchmarks. Code is available at \href{https://github.com/xavier-yu114/Zoom-Refine}{\color{magenta}github.com/xavier-yu114/Zoom-Refine}

[146] EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models

Yan Shu,Bin Ren,Zhitong Xiong,Danda Pani Paudel,Luc Van Gool,Begum Demir,Nicu Sebe,Paolo Rota

Main category: cs.CV

TL;DR: EarthMind是一个新型视觉语言框架，用于多粒度和多传感器地球观测数据理解，通过空间注意力提示和跨模态融合提升性能，并在多个基准测试中表现优异。

Details

Motivation: 现有大型多模态模型在地球观测数据理解上表现不足，而此类数据对环境监测至关重要。 Method: EarthMind采用空间注意力提示（SAP）和跨模态融合技术，增强像素级理解并有效对齐异构模态。 Result: EarthMind在EarthMind-Bench和多个公共基准测试中达到最先进性能，超越GPT-4o。 Conclusion: EarthMind展示了在多粒度和多传感器挑战中的潜力，为地球观测数据理解提供了统一框架。 Abstract: Large Multimodal Models (LMMs) have demonstrated strong performance in various vision-language tasks. However, they often struggle to comprehensively understand Earth Observation (EO) data, which is critical for monitoring the environment and the effects of human activity on it. In this work, we present EarthMind, a novel vision-language framework for multi-granular and multi-sensor EO data understanding. EarthMind features two core components: (1) Spatial Attention Prompting (SAP), which reallocates attention within the LLM to enhance pixel-level understanding; and (2) Cross-modal Fusion, which aligns heterogeneous modalities into a shared space and adaptively reweighs tokens based on their information density for effective fusion. To facilitate multi-sensor fusion evaluation, we propose EarthMind-Bench, a comprehensive benchmark with over 2,000 human-annotated multi-sensor image-question pairs, covering a wide range of perception and reasoning tasks. Extensive experiments demonstrate the effectiveness of EarthMind. It achieves state-of-the-art performance on EarthMind-Bench, surpassing GPT-4o despite being only 4B in scale. Moreover, EarthMind outperforms existing methods on multiple public EO benchmarks, showcasing its potential to handle both multi-granular and multi-sensor challenges in a unified framework.

[147] MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs

Yipeng Du,Tiehan Fan,Kepan Nan,Rui Xie,Penghao Zhou,Xiang Li,Jian Yang,Zhenheng Yang,Ying Tai

Main category: cs.CV

TL;DR: 论文提出了一种零样本方法MotionSight，通过视觉提示提升多模态大语言模型（MLLMs）在细粒度视频运动理解中的表现，并发布了首个大规模数据集MotionVid-QA。

Details

Motivation: 尽管MLLMs在多模态任务中有所进展，但在细粒度视频运动理解方面仍存在局限，缺乏帧间差异分析能力且忽略细微视觉线索。视觉提示在静态图像中有效，但其在视频中的应用尚未充分探索。 Method: 提出MotionSight方法，利用物体中心视觉聚焦和运动模糊作为视觉提示，无需训练即可提升运动理解。同时构建了MotionVid-QA数据集，包含40K视频片段和87K问答对。 Result: MotionSight在实验中表现优异，达到开源模型的最优性能，并与商业模型竞争。 Conclusion: MotionSight为零样本技术提供了新思路，并通过高质量数据集推动了细粒度视频运动理解的研究。 Abstract: Despite advancements in Multimodal Large Language Models (MLLMs), their proficiency in fine-grained video motion understanding remains critically limited. They often lack inter-frame differencing and tend to average or ignore subtle visual cues. Furthermore, while visual prompting has shown potential in static images, its application to video's temporal complexities, particularly for fine-grained motion understanding, remains largely unexplored. We investigate whether inherent capability can be unlocked and boost MLLMs' motion perception and enable distinct visual signatures tailored to decouple object and camera motion cues. In this study, we introduce MotionSight, a novel zero-shot method pioneering object-centric visual spotlight and motion blur as visual prompts to effectively improve fine-grained motion understanding without training. To convert this into valuable data assets, we curated MotionVid-QA, the first large-scale dataset for fine-grained video motion understanding, with hierarchical annotations including SFT and preference data, {\Theta}(40K) video clips and {\Theta}(87K) QAs. Experiments show MotionSight achieves state-of-the-art open-source performance and competitiveness with commercial models. In particular, for fine-grained motion understanding we present a novel zero-shot technique and a large-scale, high-quality dataset. All the code and annotations will be publicly available.

[148] SteerPose: Simultaneous Extrinsic Camera Calibration and Matching from Articulation

Sang-Eun Lee,Ko Nishino,Shohei Nobuhara

Main category: cs.CV

TL;DR: SteerPose是一种神经网络方法，通过旋转2D姿态进行多相机系统的外参标定和对应点搜索，同时利用几何一致性损失确保有效性。

Details

Motivation: 受人类通过旋转2D姿态对齐多视角的认知能力启发，解决自由移动人或动物作为标定目标时的相机标定和对应点估计问题。 Method: 提出SteerPose神经网络，结合可微分匹配和几何一致性损失，统一实现外参标定和对应点搜索。 Result: 在多样化的野外数据集中验证了方法的有效性和鲁棒性，并展示了利用2D姿态估计器重建新动物3D姿态的能力。 Conclusion: SteerPose为多相机系统提供了一种统一的标定和对应点估计框架，适用于不同类别的目标。 Abstract: Can freely moving humans or animals themselves serve as calibration targets for multi-camera systems while simultaneously estimating their correspondences across views? We humans can solve this problem by mentally rotating the observed 2D poses and aligning them with those in the target views. Inspired by this cognitive ability, we propose SteerPose, a neural network that performs this rotation of 2D poses into another view. By integrating differentiable matching, SteerPose simultaneously performs extrinsic camera calibration and correspondence search within a single unified framework. We also introduce a novel geometric consistency loss that explicitly ensures that the estimated rotation and correspondences result in a valid translation estimation. Experimental results on diverse in-the-wild datasets of humans and animals validate the effectiveness and robustness of the proposed method. Furthermore, we demonstrate that our method can reconstruct the 3D poses of novel animals in multi-camera setups by leveraging off-the-shelf 2D pose estimators and our class-agnostic model.

[149] Data Pruning by Information Maximization

Haoru Tan,Sitong Wu,Wei Huang,Shizhen Zhao,Xiaojuan Qi

Main category: cs.CV

TL;DR: InfoMax是一种新型数据修剪方法，通过最大化信息内容和最小化冗余来优化核心集的选择，适用于大规模数据集。

Details

Motivation: 传统数据修剪方法未能充分平衡信息内容和冗余，InfoMax旨在解决这一问题。 Method: 通过重要性评分衡量样本信息，利用样本相似性量化冗余，将问题形式化为离散二次规划任务，并采用梯度求解器和稀疏化技术。 Result: 实验表明InfoMax在图像分类、视觉语言预训练和大语言模型指令调整等任务中表现优异。 Conclusion: InfoMax是一种高效且可扩展的数据修剪方法，显著提升了核心集的信息量。 Abstract: In this paper, we present InfoMax, a novel data pruning method, also known as coreset selection, designed to maximize the information content of selected samples while minimizing redundancy. By doing so, InfoMax enhances the overall informativeness of the coreset. The information of individual samples is measured by importance scores, which capture their influence or difficulty in model learning. To quantify redundancy, we use pairwise sample similarities, based on the premise that similar samples contribute similarly to the learning process. We formalize the coreset selection problem as a discrete quadratic programming (DQP) task, with the objective of maximizing the total information content, represented as the sum of individual sample contributions minus the redundancies introduced by similar samples within the coreset. To ensure practical scalability, we introduce an efficient gradient-based solver, complemented by sparsification techniques applied to the similarity matrix and dataset partitioning strategies. This enables InfoMax to seamlessly scale to datasets with millions of samples. Extensive experiments demonstrate the superior performance of InfoMax in various data pruning tasks, including image classification, vision-language pre-training, and instruction tuning for large language models.

[150] Active Learning via Vision-Language Model Adaptation with Open Data

Tong Wang,Jiaqi Wang,Shu Kong

Main category: cs.CV

TL;DR: 论文提出了一种名为ALOR的方法，利用公开数据和视觉语言模型（VLM）改进主动学习（AL），并通过对比调优（CT）和尾部优先采样（TFS）策略显著提升性能。

Details

Motivation: 减少数据标注成本，同时利用公开数据和VLM的预训练能力，改进主动学习的效果。 Method: 结合公开数据检索任务相关样本，对比调优（CT）作为模型适应方法，并提出尾部优先采样（TFS）策略选择标注数据。 Result: ALOR方法显著优于现有方法，CT在所有适应方法中表现最佳，TFS有效缓解了数据不平衡问题。 Conclusion: 通过利用公开数据和优化采样策略，ALOR在主动学习中取得了显著改进，为数据标注成本高的任务提供了高效解决方案。 Abstract: Pretrained on web-scale open data, VLMs offer powerful capabilities for solving downstream tasks after being adapted to task-specific labeled data. Yet, data labeling can be expensive and may demand domain expertise. Active Learning (AL) aims to reduce this expense by strategically selecting the most informative data for labeling and model training. Recent AL methods have explored VLMs but have not leveraged publicly available open data, such as VLM's pretraining data. In this work, we leverage such data by retrieving task-relevant examples to augment the task-specific examples. As expected, incorporating them significantly improves AL. Given that our method exploits open-source VLM and open data, we refer to it as Active Learning with Open Resources (ALOR). Additionally, most VLM-based AL methods use prompt tuning (PT) for model adaptation, likely due to its ability to directly utilize pretrained parameters and the assumption that doing so reduces the risk of overfitting to limited labeled data. We rigorously compare popular adaptation approaches, including linear probing (LP), finetuning (FT), and contrastive tuning (CT). We reveal two key findings: (1) All adaptation approaches benefit from incorporating retrieved data, and (2) CT resoundingly outperforms other approaches across AL methods. Further analysis of retrieved data reveals a naturally imbalanced distribution of task-relevant classes, exposing inherent biases within the VLM. This motivates our novel Tail First Sampling (TFS) strategy for AL, an embarrassingly simple yet effective method that prioritizes sampling data from underrepresented classes to label. Extensive experiments demonstrate that our final method, contrastively finetuning VLM on both retrieved and TFS-selected labeled data, significantly outperforms existing methods.

[151] VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking

Desen Meng,Rui Huang,Zhilin Dai,Xinhao Li,Yifan Xu,Jun Zhang,Zhenpeng Huang,Meng Zhang,Lingshu Zhang,Yi Liu,Limin Wang

Main category: cs.CV

TL;DR: 本文首次系统地研究了基于GRPO的强化学习后训练方法，用于提升多模态大语言模型（MLLMs）在视频字幕生成中的动作描述能力，提出了VideoCap-R1模型，并通过实验验证了其优越性。

Details

Motivation: 尽管强化学习在提升大语言模型（LLMs）的推理能力方面取得了显著进展，但在多模态LLMs（MLLMs）的视频字幕生成任务中仍未被充分探索。本文旨在填补这一空白，提升MLLMs对视频中动作的描述能力。 Method: 提出了VideoCap-R1模型，采用结构化思维分析视频主体及其属性和动作，再生成完整字幕。模型通过两种奖励机制（LLM-free的思维评分器和LLM辅助的字幕评分器）强化训练，连接结构化推理与描述生成。 Result: 实验表明，VideoCap-R1在多个视频字幕基准测试中显著优于Qwen2VL-7B基线模型（如DREAM1K事件F1提升4.4，VDC准确率提升4.2），并持续优于SFT训练的对比模型。 Conclusion: GRPO强化学习框架能有效提升MLLMs的视频字幕生成能力，尤其是在动作描述的准确性上表现突出。 Abstract: While recent advances in reinforcement learning have significantly enhanced reasoning capabilities in large language models (LLMs), these techniques remain underexplored in multi-modal LLMs for video captioning. This paper presents the first systematic investigation of GRPO-based RL post-training for video MLLMs, with the goal of enhancing video MLLMs' capability of describing actions in videos. Specifically, we develop the VideoCap-R1, which is prompted to first perform structured thinking that analyzes video subjects with their attributes and actions before generating complete captions, supported by two specialized reward mechanisms: a LLM-free think scorer evaluating the structured thinking quality and a LLM-assisted caption scorer assessing the output quality. The RL training framework effectively establishes the connection between structured reasoning and comprehensive description generation, enabling the model to produce captions with more accurate actions. Our experiments demonstrate that VideoCap-R1 achieves substantial improvements over the Qwen2VL-7B baseline using limited samples (1.5k) across multiple video caption benchmarks (DREAM1K: +4.4 event F1, VDC: +4.2 Acc, CAREBENCH: +3.1 action F1, +6.9 object F1) while consistently outperforming the SFT-trained counterparts, confirming GRPO's superiority in enhancing MLLMs' captioning capabilities.

[152] STORM: Benchmarking Visual Rating of MLLMs with a Comprehensive Ordinal Regression Dataset

Jinhong Wang,Shuo Tong,Jian liu,Dongqi Tang,Jintai Chen,Haochao Ying,Hongxia Xu,Danny Chen,Jian Wu

Main category: cs.CV

TL;DR: STORM是一个多模态大语言模型（MLLMs）的视觉评分数据集和基准测试，旨在提升其在序数回归任务中的能力。

Details

Motivation: 当前MLLMs在视觉评分任务中表现不佳，且缺乏相关数据集和基准测试。 Method: 收集了14个序数回归数据集，并提出了一种粗到细的处理流程，动态考虑标签候选并提供可解释的思考。 Result: 实验证明了该框架的有效性，并提供了更好的微调策略。 Conclusion: STORM为MLLMs在视觉评分任务中的研究提供了数据集、基准测试和预训练模型。 Abstract: Visual rating is an essential capability of artificial intelligence (AI) for multi-dimensional quantification of visual content, primarily applied in ordinal regression (OR) tasks such as image quality assessment, facial age estimation, and medical image grading. However, current multi-modal large language models (MLLMs) under-perform in such visual rating ability while also suffering the lack of relevant datasets and benchmarks. In this work, we collect and present STORM, a data collection and benchmark for Stimulating Trustworthy Ordinal Regression Ability of MLLMs for universal visual rating. STORM encompasses 14 ordinal regression datasets across five common visual rating domains, comprising 655K image-level pairs and the corresponding carefully curated VQAs. Importantly, we also propose a coarse-to-fine processing pipeline that dynamically considers label candidates and provides interpretable thoughts, providing MLLMs with a general and trustworthy ordinal thinking paradigm. This benchmark aims to evaluate the all-in-one and zero-shot performance of MLLMs in scenarios requiring understanding of the essential common ordinal relationships of rating labels. Extensive experiments demonstrate the effectiveness of our framework and shed light on better fine-tuning strategies. The STORM dataset, benchmark, and pre-trained models are available on the following webpage to support further research in this area. Datasets and codes are released on the project page: https://storm-bench.github.io/.

[153] Efficient Egocentric Action Recognition with Multimodal Data

Marco Calzavara,Ard Kastrati,Matteo Macchini,Dushan Vasilevski,Roger Wattenhofer

Main category: cs.CV

TL;DR: 通过分析RGB视频和3D手部姿态的采样频率对Egocentric Action Recognition（EAR）性能和CPU使用的影响，研究发现降低RGB帧采样率并结合高频3D手部姿态输入可显著降低CPU需求，同时保持高准确性。

Details

Motivation: 随着可穿戴XR设备的普及，实时Egocentric Action Recognition（EAR）系统面临便携性、电池寿命和计算资源之间的权衡挑战。 Method: 系统分析RGB视频和3D手部姿态在不同采样频率下对EAR性能和CPU使用的影响，探索多种配置以权衡准确性和计算效率。 Result: 降低RGB帧采样率并结合高频3D手部姿态输入，可实现CPU使用降低3倍，同时识别性能损失极小或无损失。 Conclusion: 多模态输入策略是实现XR设备上高效实时EAR的可行方法。 Abstract: The increasing availability of wearable XR devices opens new perspectives for Egocentric Action Recognition (EAR) systems, which can provide deeper human understanding and situation awareness. However, deploying real-time algorithms on these devices can be challenging due to the inherent trade-offs between portability, battery life, and computational resources. In this work, we systematically analyze the impact of sampling frequency across different input modalities - RGB video and 3D hand pose - on egocentric action recognition performance and CPU usage. By exploring a range of configurations, we provide a comprehensive characterization of the trade-offs between accuracy and computational efficiency. Our findings reveal that reducing the sampling rate of RGB frames, when complemented with higher-frequency 3D hand pose input, can preserve high accuracy while significantly lowering CPU demands. Notably, we observe up to a 3x reduction in CPU usage with minimal to no loss in recognition performance. This highlights the potential of multimodal input strategies as a viable approach to achieving efficient, real-time EAR on XR devices.

[154] Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks

Tao Yang,Ruibin Li,Yangming Shi,Yuqi Zhang,Qide Dong,Haoran Cheng,Weiguo Feng,Shilei Wen,Bingyue Peng,Lei Zhang

Main category: cs.CV

TL;DR: 论文提出了一种名为“many-for-many”的统一框架，利用多种视觉生成和操作任务的数据训练单一模型，通过轻量级适配器和联合图像-视频学习策略提升性能。

Details

Motivation: 现有方法通常针对单一任务训练模型，且高质量标注数据成本高昂。本文旨在通过统一框架解决多任务需求，降低训练成本。 Method: 设计了轻量级适配器统一不同任务的条件，采用联合图像-视频学习策略从零开始训练模型，并引入深度图作为条件以增强3D空间感知。 Result: 训练了两个版本（8B和2B）的模型，每个模型可执行超过10种任务，其中8B模型在视频生成任务中表现优异，媲美开源和商业引擎。 Conclusion: 提出的统一框架在多任务视觉生成和操作中表现出色，尤其是视频生成任务，且代码和模型已开源。 Abstract: Diffusion models have shown impressive performance in many visual generation and manipulation tasks. Many existing methods focus on training a model for a specific task, especially, text-to-video (T2V) generation, while many other works focus on finetuning the pretrained T2V model for image-to-video (I2V), video-to-video (V2V), image and video manipulation tasks, etc. However, training a strong T2V foundation model requires a large amount of high-quality annotations, which is very costly. In addition, many existing models can perform only one or several tasks. In this work, we introduce a unified framework, namely many-for-many, which leverages the available training data from many different visual generation and manipulation tasks to train a single model for those different tasks. Specifically, we design a lightweight adapter to unify the different conditions in different tasks, then employ a joint image-video learning strategy to progressively train the model from scratch. Our joint learning leads to a unified visual generation and manipulation model with improved video generation performance. In addition, we introduce depth maps as a condition to help our model better perceive the 3D space in visual generation. Two versions of our model are trained with different model sizes (8B and 2B), each of which can perform more than 10 different tasks. In particular, our 8B model demonstrates highly competitive performance in video generation tasks compared to open-source and even commercial engines. Our models and source codes are available at https://github.com/leeruibin/MfM.git.

[155] unMORE: Unsupervised Multi-Object Segmentation via Center-Boundary Reasoning

Yafei Yang,Zihui Zhang,Bo Yang

Main category: cs.CV

TL;DR: 论文提出了一种名为unMORE的两阶段无监督多目标分割方法，显著优于现有方法，在复杂真实图像中表现优异。

Details

Motivation: 现有无监督方法在分割复杂真实世界对象时表现有限，无法处理拥挤图像。 Method: unMORE通过两阶段流程：首先学习三个层次的对象中心表示，然后利用网络无关的多目标推理模块发现多个对象。 Result: 在6个真实世界基准数据集（包括COCO）上表现最佳，尤其在拥挤图像中优于所有基线方法。 Conclusion: unMORE为无监督多目标分割提供了高效解决方案，适用于复杂场景。 Abstract: We study the challenging problem of unsupervised multi-object segmentation on single images. Existing methods, which rely on image reconstruction objectives to learn objectness or leverage pretrained image features to group similar pixels, often succeed only in segmenting simple synthetic objects or discovering a limited number of real-world objects. In this paper, we introduce unMORE, a novel two-stage pipeline designed to identify many complex objects in real-world images. The key to our approach involves explicitly learning three levels of carefully defined object-centric representations in the first stage. Subsequently, our multi-object reasoning module utilizes these learned object priors to discover multiple objects in the second stage. Notably, this reasoning module is entirely network-free and does not require human labels. Extensive experiments demonstrate that unMORE significantly outperforms all existing unsupervised methods across 6 real-world benchmark datasets, including the challenging COCO dataset, achieving state-of-the-art object segmentation results. Remarkably, our method excels in crowded images where all baselines collapse.

[156] FaceCoT: A Benchmark Dataset for Face Anti-Spoofing with Chain-of-Thought Reasoning

Honglu Zhang,Zhiqin Fang,Ningning Zhao,Saihui Hou,Long Ma,Renwang Pei,Zhaofeng He

Main category: cs.CV

TL;DR: 论文提出FaceCoT数据集和CEPL策略，通过视觉-语言多模态方法提升人脸防伪（FAS）的鲁棒性和可解释性。

Details

Motivation: 传统FAS依赖单一视觉模态，泛化能力有限；多模态大语言模型（MLLMs）的突破为结合视觉与语言推理提供了可能，但缺乏高质量数据集。 Method: 构建FaceCoT数据集（含14种攻击类型和高质量CoT VQA标注），开发强化学习优化的标注模型，并提出CEPL策略以利用CoT数据。 Result: 实验表明，基于FaceCoT和CEPL的模型在多个基准数据集上优于现有方法。 Conclusion: FaceCoT和CEPL有效提升了FAS的性能，为多模态防伪研究提供了新方向。 Abstract: Face Anti-Spoofing (FAS) typically depends on a single visual modality when defending against presentation attacks such as print attacks, screen replays, and 3D masks, resulting in limited generalization across devices, environments, and attack types. Meanwhile, Multimodal Large Language Models (MLLMs) have recently achieved breakthroughs in image-text understanding and semantic reasoning, suggesting that integrating visual and linguistic co-inference into FAS can substantially improve both robustness and interpretability. However, the lack of a high-quality vision-language multimodal dataset has been a critical bottleneck. To address this, we introduce FaceCoT (Face Chain-of-Thought), the first large-scale Visual Question Answering (VQA) dataset tailored for FAS. FaceCoT covers 14 spoofing attack types and enriches model learning with high-quality CoT VQA annotations. Meanwhile, we develop a caption model refined via reinforcement learning to expand the dataset and enhance annotation quality. Furthermore, we introduce a CoT-Enhanced Progressive Learning (CEPL) strategy to better leverage the CoT data and boost model performance on FAS tasks. Extensive experiments demonstrate that models trained with FaceCoT and CEPL outperform state-of-the-art methods on multiple benchmark datasets.

[157] R2SM: Referring and Reasoning for Selective Masks

Yu-Lin Shih,Wei-En Tai,Cheng Sun,Yu-Chiang Frank Wang,Hwann-Tzong Chen

Main category: cs.CV

TL;DR: 论文提出新任务R2SM，结合用户意图选择模态或非模态分割掩码，并构建了R2SM数据集用于模型微调和评估。

Details

Motivation: 扩展文本引导分割任务，通过用户意图驱动掩码类型选择，提升模型的多模态推理和意图感知分割能力。 Method: 基于COCOA-cls、D2SA和MUVA数据集构建R2SM数据集，包含模态和非模态文本查询及对应掩码，要求模型根据提示生成适当的分割结果。 Result: R2SM任务为多模态推理和意图感知分割研究提供了挑战性测试平台。 Conclusion: R2SM任务和数据集推动了多模态推理和意图感知分割领域的研究进展。 Abstract: We introduce a new task, Referring and Reasoning for Selective Masks (R2SM), which extends text-guided segmentation by incorporating mask-type selection driven by user intent. This task challenges vision-language models to determine whether to generate a modal (visible) or amodal (complete) segmentation mask based solely on natural language prompts. To support the R2SM task, we present the R2SM dataset, constructed by augmenting annotations of COCOA-cls, D2SA, and MUVA. The R2SM dataset consists of both modal and amodal text queries, each paired with the corresponding ground-truth mask, enabling model finetuning and evaluation for the ability to segment images as per user intent. Specifically, the task requires the model to interpret whether a given prompt refers to only the visible part of an object or to its complete shape, including occluded regions, and then produce the appropriate segmentation. For example, if a prompt explicitly requests the whole shape of a partially hidden object, the model is expected to output an amodal mask that completes the occluded parts. In contrast, prompts without explicit mention of hidden regions should generate standard modal masks. The R2SM benchmark provides a challenging and insightful testbed for advancing research in multimodal reasoning and intent-aware segmentation.

[158] WorldExplorer: Towards Generating Fully Navigable 3D Scenes

Manuel-Andreas Schneider,Lukas Höllein,Matthias Nießner

Main category: cs.CV

TL;DR: WorldExplorer提出了一种基于自回归视频轨迹生成的新方法，用于构建高质量、可导航的3D场景，解决了现有方法在视角移动时产生的噪声和拉伸问题。

Details

Motivation: 现有方法在生成3D场景时，视角移动会导致噪声和拉伸问题，限制了场景的探索性。WorldExplorer旨在解决这一问题，实现高质量、稳定的3D场景生成。 Method: 通过多视角一致的360度全景图初始化场景，利用视频扩散模型迭代生成场景。采用场景记忆机制和碰撞检测，确保生成视频的质量和一致性，最后通过3D高斯泼溅优化将所有视图融合为统一的3D表示。 Result: WorldExplorer生成的3D场景在大范围相机运动下保持高质量和稳定性，首次实现了真实且无限制的探索。 Conclusion: WorldExplorer在生成沉浸式和可探索的虚拟3D环境方面迈出了重要一步。 Abstract: Generating 3D worlds from text is a highly anticipated goal in computer vision. Existing works are limited by the degree of exploration they allow inside of a scene, i.e., produce streched-out and noisy artifacts when moving beyond central or panoramic perspectives. To this end, we propose WorldExplorer, a novel method based on autoregressive video trajectory generation, which builds fully navigable 3D scenes with consistent visual quality across a wide range of viewpoints. We initialize our scenes by creating multi-view consistent images corresponding to a 360 degree panorama. Then, we expand it by leveraging video diffusion models in an iterative scene generation pipeline. Concretely, we generate multiple videos along short, pre-defined trajectories, that explore the scene in depth, including motion around objects. Our novel scene memory conditions each video on the most relevant prior views, while a collision-detection mechanism prevents degenerate results, like moving into objects. Finally, we fuse all generated views into a unified 3D representation via 3D Gaussian Splatting optimization. Compared to prior approaches, WorldExplorer produces high-quality scenes that remain stable under large camera motion, enabling for the first time realistic and unrestricted exploration. We believe this marks a significant step toward generating immersive and truly explorable virtual 3D environments.

[159] OmniV2V: Versatile Video Generation and Editing via Dynamic Content Manipulation

Sen Liang,Zhentao Yu,Zhengguang Zhou,Teng Hu,Hongmei Wang,Yi Chen,Qin Lin,Yuan Zhou,Xin Li,Qinglin Lu,Zhibo Chen

Main category: cs.CV

TL;DR: OmniV2V是一个多功能视频生成与编辑模型，支持跨场景操作，性能优于现有开源和商业模型。

Details

Motivation: 现有视频生成模型局限于单一场景，无法实现多样化内容操作。 Method: 提出统一动态内容操作注入模块和视觉-文本指令模块，构建多任务数据处理系统。 Result: 实验表明OmniV2V在多种任务中表现优异。 Conclusion: OmniV2V为视频生成与编辑提供了高效、统一的解决方案。 Abstract: The emergence of Diffusion Transformers (DiT) has brought significant advancements to video generation, especially in text-to-video and image-to-video tasks. Although video generation is widely applied in various fields, most existing models are limited to single scenarios and cannot perform diverse video generation and editing through dynamic content manipulation. We propose OmniV2V, a video model capable of generating and editing videos across different scenarios based on various operations, including: object movement, object addition, mask-guided video edit, try-on, inpainting, outpainting, human animation, and controllable character video synthesis. We explore a unified dynamic content manipulation injection module, which effectively integrates the requirements of the above tasks. In addition, we design a visual-text instruction module based on LLaVA, enabling the model to effectively understand the correspondence between visual content and instructions. Furthermore, we build a comprehensive multi-task data processing system. Since there is data overlap among various tasks, this system can efficiently provide data augmentation. Using this system, we construct a multi-type, multi-scenario OmniV2V dataset and its corresponding OmniV2V-Test benchmark. Extensive experiments show that OmniV2V works as well as, and sometimes better than, the best existing open-source and commercial models for many video generation and editing tasks.

[160] UMA: Ultra-detailed Human Avatars via Multi-level Surface Alignment

Heming Zhu,Guoxing Sun,Christian Theobalt,Marc Habermann

Main category: cs.CV

TL;DR: 论文提出了一种基于隐式表示和2D视频点跟踪器的可动画人体模型，通过潜在变形模型和级联训练策略，显著提升了渲染质量和几何精度。

Details

Motivation: 现有基于隐式表示的可动画人体模型在细节保留上存在不足，尤其是在高分辨率渲染时，主要原因是几何跟踪不准确。 Method: 提出潜在变形模型，利用2D视频点跟踪器监督3D变形，并通过级联训练策略生成一致的3D点轨迹。 Result: 实验验证了方法在渲染质量和几何精度上的显著提升，优于现有技术。 Conclusion: 该方法通过改进几何跟踪和变形模型，实现了更高细节保留的动画人体模型。 Abstract: Learning an animatable and clothed human avatar model with vivid dynamics and photorealistic appearance from multi-view videos is an important foundational research problem in computer graphics and vision. Fueled by recent advances in implicit representations, the quality of the animatable avatars has achieved an unprecedented level by attaching the implicit representation to drivable human template meshes. However, they usually fail to preserve the highest level of detail, particularly apparent when the virtual camera is zoomed in and when rendering at 4K resolution and higher. We argue that this limitation stems from inaccurate surface tracking, specifically, depth misalignment and surface drift between character geometry and the ground truth surface, which forces the detailed appearance model to compensate for geometric errors. To address this, we propose a latent deformation model and supervising the 3D deformation of the animatable character using guidance from foundational 2D video point trackers, which offer improved robustness to shading and surface variations, and are less prone to local minima than differentiable rendering. To mitigate the drift over time and lack of 3D awareness of 2D point trackers, we introduce a cascaded training strategy that generates consistent 3D point tracks by anchoring point tracks to the rendered avatar, which ultimately supervises our avatar at the vertex and texel level. To validate the effectiveness of our approach, we introduce a novel dataset comprising five multi-view video sequences, each over 10 minutes in duration, captured using 40 calibrated 6K-resolution cameras, featuring subjects dressed in clothing with challenging texture patterns and wrinkle deformations. Our approach demonstrates significantly improved performance in rendering quality and geometric accuracy over the prior state of the art.

[161] Ridgeformer: Mutli-Stage Contrastive Training For Fine-grained Cross-Domain Fingerprint Recognition

Shubham Pandey,Bhavin Jawade,Srirangaraj Setlur

Main category: cs.CV

TL;DR: 提出了一种基于多阶段Transformer的无接触指纹匹配方法，解决了图像模糊、对比度低和位置变化等问题，显著提升了匹配准确率。

Details

Motivation: 无接触指纹识别技术需求增长，但面临图像模糊、对比度低、手指位置变化和透视变形等挑战，影响匹配准确性。 Method: 采用多阶段Transformer方法，先捕获全局空间特征，再细化局部特征对齐，通过分层特征提取和匹配流程实现精细对齐。 Result: 在HKPolyU和RidgeBase数据集上测试，性能优于现有方法，包括商业解决方案。 Conclusion: 该方法显著提升了无接触指纹匹配的准确性和可靠性，具有实际应用潜力。 Abstract: The increasing demand for hygienic and portable biometric systems has underscored the critical need for advancements in contactless fingerprint recognition. Despite its potential, this technology faces notable challenges, including out-of-focus image acquisition, reduced contrast between fingerprint ridges and valleys, variations in finger positioning, and perspective distortion. These factors significantly hinder the accuracy and reliability of contactless fingerprint matching. To address these issues, we propose a novel multi-stage transformer-based contactless fingerprint matching approach that first captures global spatial features and subsequently refines localized feature alignment across fingerprint samples. By employing a hierarchical feature extraction and matching pipeline, our method ensures fine-grained, cross-sample alignment while maintaining the robustness of global feature representation. We perform extensive evaluations on publicly available datasets such as HKPolyU and RidgeBase under different evaluation protocols, such as contactless-to-contact matching and contactless-to-contactless matching and demonstrate that our proposed approach outperforms existing methods, including COTS solutions.

[162] GSCodec Studio: A Modular Framework for Gaussian Splat Compression

Sicheng Li,Chengzhen Wu,Hao Li,Xiang Gao,Yiyi Liao,Lu Yu

Main category: cs.CV

TL;DR: GSCodec Studio是一个统一的模块化框架，用于高斯溅射（GS）的重建、压缩和渲染，解决了现有方法分散的问题，并支持静态和动态GS的高效压缩。

Details

Motivation: 高斯溅射（GS）在实时渲染中表现出色，但高存储需求限制了其实际应用。现有压缩研究分散，缺乏统一框架。 Method: GSCodec Studio整合了多种3D/4D GS重建和压缩技术，提供模块化组件，支持灵活组合和全面比较。 Result: 框架实现了静态和动态GS的高效压缩（Static和Dynamic GSCodec），在率失真性能上表现优异。 Conclusion: GSCodec Studio为GS压缩研究提供了统一平台，推动了高斯溅射技术的进一步发展。 Abstract: 3D Gaussian Splatting and its extension to 4D dynamic scenes enable photorealistic, real-time rendering from real-world captures, positioning Gaussian Splats (GS) as a promising format for next-generation immersive media. However, their high storage requirements pose significant challenges for practical use in sharing, transmission, and storage. Despite various studies exploring GS compression from different perspectives, these efforts remain scattered across separate repositories, complicating benchmarking and the integration of best practices. To address this gap, we present GSCodec Studio, a unified and modular framework for GS reconstruction, compression, and rendering. The framework incorporates a diverse set of 3D/4D GS reconstruction methods and GS compression techniques as modular components, facilitating flexible combinations and comprehensive comparisons. By integrating best practices from community research and our own explorations, GSCodec Studio supports the development of compact representation and compression solutions for static and dynamic Gaussian Splats, namely our Static and Dynamic GSCodec, achieving competitive rate-distortion performance in static and dynamic GS compression. The code for our framework is publicly available at https://github.com/JasonLSC/GSCodec_Studio , to advance the research on Gaussian Splats compression.

[163] MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

Wayner Barrios,Andrés Villa,Juan León Alcázar,SouYoung Jin,Bernard Ghanem

Main category: cs.CV

TL;DR: MoDA（Modulation Adapter）是一种轻量级模块，通过指令引导的调制优化预对齐视觉特征，提升多模态大语言模型（MLLMs）在复杂场景中的细粒度视觉概念理解能力。

Details

Motivation: 现有方法在复杂场景中难以准确关联细粒度视觉概念，MoDA旨在通过指令引导的调制解决这一问题。 Method: MoDA采用两阶段训练：1）通过冻结视觉编码器和适配层将图像特征对齐到LLMs输入空间；2）在指令调优阶段使用MoDA适配器优化特征。MoDA利用Transformer交叉注意力生成调制掩码，突出语义相关嵌入维度。 Result: 实验表明，MoDA提升了视觉基础能力，并生成更符合上下文的响应。 Conclusion: MoDA是一种通用的图像MLLMs增强方法，有效提升了视觉基础与语言生成的准确性。 Abstract: Recently, Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on instruction-following tasks by integrating pretrained visual encoders with large language models (LLMs). However, existing approaches often struggle to ground fine-grained visual concepts in complex scenes. In this paper, we propose MoDA (Modulation Adapter), a lightweight yet effective module designed to refine pre-aligned visual features through instruction-guided modulation. Our approach follows the standard LLaVA training protocol, consisting of a two-stage process: (1) aligning image features to the LLMs input space via a frozen vision encoder and adapter layers, and (2) refining those features using the MoDA adapter during the instructional tuning stage. MoDA employs a Transformer-based cross-attention mechanism to generate a modulation mask over the aligned visual tokens, thereby emphasizing semantically relevant embedding dimensions based on the language instruction. The modulated features are then passed to the LLM for autoregressive language generation. Our experimental evaluation shows that MoDA improves visual grounding and generates more contextually appropriate responses, demonstrating its effectiveness as a general-purpose enhancement for image-based MLLMs.

[164] ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding

Junliang Ye,Zhengyi Wang,Ruowen Zhao,Shenghao Xie,Jun Zhu

Main category: cs.CV

TL;DR: 论文提出ShapeLLM-Omni，一种原生3D大语言模型，填补了多模态模型在3D内容理解与生成上的空白。

Details

Motivation: 现有ChatGPT-4o等多模态模型仅支持图像与文本，而3D内容的理解与生成同样重要。 Method: 训练3D VQVAE实现高效形状表示，构建3D-Alpaca数据集，并在Qwen-2.5-vl-7B-Instruct模型上进行指令微调。 Result: ShapeLLM-Omni成功扩展了多模态模型的3D能力。 Conclusion: 该研究为3D原生AI的未来发展提供了有效尝试。 Abstract: Recently, the powerful text-to-image capabilities of ChatGPT-4o have led to growing appreciation for native multimodal large language models. However, its multimodal capabilities remain confined to images and text. Yet beyond images, the ability to understand and generate 3D content is equally crucial. To address this gap, we propose ShapeLLM-Omni-a native 3D large language model capable of understanding and generating 3D assets and text in any sequence. First, we train a 3D vector-quantized variational autoencoder (VQVAE), which maps 3D objects into a discrete latent space to achieve efficient and accurate shape representation and reconstruction. Building upon the 3D-aware discrete tokens, we innovatively construct a large-scale continuous training dataset named 3D-Alpaca, encompassing generation, comprehension, and editing, thus providing rich resources for future research and training. Finally, by performing instruction-based training of the Qwen-2.5-vl-7B-Instruct model on the 3D-Alpaca dataset. Our work provides an effective attempt at extending multimodal models with basic 3D capabilities, which contributes to future research in 3D-native AI. Project page: https://github.com/JAMESYJL/ShapeLLM-Omni

Xinliu Zhong,Kayhan Batmanghelich,Li Sun

Main category: cs.CV

TL;DR: 提出了一种名为“扰动报告判别”的新方法，用于预训练生物医学视觉语言模型，以解决生物医学文本复杂语义被忽视的问题。

Details

Motivation: 生物医学文本具有复杂且领域特定的语义，现有对比学习方法常忽视这一点，因此需要改进。 Method: 通过设计文本扰动方法破坏句子语义结构，并让模型区分原始报告与扰动报告；同时对比注意力加权的图像子区域和子词。 Result: 在多个下游任务中表现优于基线方法，学习到更具语义意义和鲁棒性的多模态表示。 Conclusion: 该方法能有效提升生物医学视觉语言模型的语义理解和鲁棒性。 Abstract: Vision-language models pre-trained on large scale of unlabeled biomedical images and associated reports learn generalizable semantic representations. These multi-modal representations can benefit various downstream tasks in the biomedical domain. Contrastive learning is widely used to pre-train vision-language models for general natural images and associated captions. Despite its popularity, we found biomedical texts have complex and domain-specific semantics that are often neglected by common contrastive methods. To address this issue, we propose a novel method, perturbed report discrimination, for pre-train biomedical vision-language models. First, we curate a set of text perturbation methods that keep the same words, but disrupt the semantic structure of the sentence. Next, we apply different types of perturbation to reports, and use the model to distinguish the original report from the perturbed ones given the associated image. Parallel to this, we enhance the sensitivity of our method to higher level of granularity for both modalities by contrasting attention-weighted image sub-regions and sub-words in the image-text pairs. We conduct extensive experiments on multiple downstream tasks, and our method outperforms strong baseline methods. The results demonstrate that our approach learns more semantic meaningful and robust multi-modal representations.

[166] Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency

Hongyu Li,Songhao Han,Yue Liao,Junfeng Luo,Jialin Gao,Shuicheng Yan,Si Liu

Main category: cs.CV

TL;DR: 该论文提出了一种基于强化学习调优（RLT）的后训练策略，通过双奖励机制提升多模态大语言模型（MLLMs）在视频理解任务中的推理能力，并在多个任务中表现优异。

Details

Motivation: 解决视频理解中复杂语义和长时序依赖的挑战，利用RLT增强MLLMs的推理能力。 Method: 基于GRPO框架，设计双奖励机制（语义和时序推理），并采用方差感知数据选择策略优化训练样本。 Result: 在八个视频理解任务中表现优于监督微调和现有RLT基线，且训练数据需求更少。 Conclusion: 奖励设计和数据选择对提升MLLMs在视频理解中的推理能力至关重要，代码已开源并持续更新。 Abstract: Understanding real-world videos with complex semantics and long temporal dependencies remains a fundamental challenge in computer vision. Recent progress in multimodal large language models (MLLMs) has demonstrated strong capabilities in vision-language tasks, while reinforcement learning tuning (RLT) has further improved their reasoning abilities. In this work, we explore RLT as a post-training strategy to enhance the video-specific reasoning capabilities of MLLMs. Built upon the Group Relative Policy Optimization (GRPO) framework, we propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals. To facilitate effective preference-based optimization, we introduce a variance-aware data selection strategy based on repeated inference to identify samples that provide informative learning signals. We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA. Our method consistently outperforms supervised fine-tuning and existing RLT baselines, achieving superior performance with significantly less training data. These results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs. Notably, The initial code release (two months ago) has now been expanded with updates, including optimized reward mechanisms and additional datasets. The latest version is available at https://github.com/appletea233/Temporal-R1 .

[167] Elucidating the representation of images within an unconditional diffusion model denoiser

Zahra Kadkhodaie,Stéphane Mallat,Eero Simoncelli

Main category: cs.CV

TL;DR: 论文研究了UNet在去噪任务中的内部机制，发现其通过稀疏通道分解图像，并提出了一种新的图像重建算法。

Details

Motivation: 尽管生成扩散模型在图像生成上表现出色，但其内部机制尚不明确，本文旨在揭示UNet在去噪任务中的表示和计算方式。 Method: 通过分析UNet中间块的稀疏通道分解，提出了一种基于空间平均的随机图像重建算法。 Result: 研究发现UNet的潜在空间距离与条件密度及图像语义相似性相关，聚类分析揭示了图像细节和全局结构的共享模式。 Conclusion: 研究表明，仅通过去噪训练的UNet能够生成丰富的稀疏图像表示，为理解扩散模型提供了新视角。 Abstract: Generative diffusion models learn probability densities over diverse image datasets by estimating the score with a neural network trained to remove noise. Despite their remarkable success in generating high-quality images, the internal mechanisms of the underlying score networks are not well understood. Here, we examine a UNet trained for denoising on the ImageNet dataset, to better understand its internal representation and computation of the score. We show that the middle block of the UNet decomposes individual images into sparse subsets of active channels, and that the vector of spatial averages of these channels can provide a nonlinear representation of the underlying clean images. We develop a novel algorithm for stochastic reconstruction of images from this representation and demonstrate that it recovers a sample from a set of images defined by a target image representation. We then study the properties of the representation and demonstrate that Euclidean distances in the latent space correspond to distances between conditional densities induced by representations as well as semantic similarities in the image space. Applying a clustering algorithm in the representation space yields groups of images that share both fine details (e.g., specialized features, textured regions, small objects), as well as global structure, but are only partially aligned with object identities. Thus, we show for the first time that a network trained solely on denoising contains a rich and accessible sparse representation of images.

[168] MedEBench: Revisiting Text-instructed Image Editing

Minghao Liu,Zhitao He,Zhiyuan Fan,Qingyun Wang,Yi R. Fung

Main category: cs.CV

TL;DR: MedEBench是一个用于评估文本引导医学图像编辑的综合基准，包含1,182个临床来源的图像-提示对，覆盖13个解剖区域的70个任务。它提供了评估框架、模型比较和失败分析协议。

Details

Motivation: 文本引导图像编辑在医学影像领域缺乏标准化评估，而临床上有模拟手术结果、教学材料个性化和改善患者沟通的需求。 Method: MedEBench包括临床相关的评估框架（编辑准确性、上下文保留和视觉质量），系统比较7种先进模型，并提出基于注意力定位的失败分析协议。 Result: 揭示了常见失败模式，并通过注意力地图与ROI的IoU识别定位错误。 Conclusion: MedEBench为开发可靠的医学图像编辑系统提供了坚实基础。 Abstract: Text-guided image editing has seen rapid progress in natural image domains, but its adaptation to medical imaging remains limited and lacks standardized evaluation. Clinically, such editing holds promise for simulating surgical outcomes, creating personalized teaching materials, and enhancing patient communication. To bridge this gap, we introduce \textbf{MedEBench}, a comprehensive benchmark for evaluating text-guided medical image editing. It consists of 1,182 clinically sourced image-prompt triplets spanning 70 tasks across 13 anatomical regions. MedEBench offers three key contributions: (1) a clinically relevant evaluation framework covering Editing Accuracy, Contextual Preservation, and Visual Quality, supported by detailed descriptions of expected change and ROI (Region of Interest) masks; (2) a systematic comparison of seven state-of-the-art models, revealing common failure patterns; and (3) a failure analysis protocol based on attention grounding, using IoU between attention maps and ROIs to identify mislocalization. MedEBench provides a solid foundation for developing and evaluating reliable, clinically meaningful medical image editing systems.

[169] TaxaDiffusion: Progressively Trained Diffusion Model for Fine-Grained Species Generation

Amin Karimi Monsefi,Mridul Khurana,Rajiv Ramnath,Anuj Karpatne,Wei-Lun Chao,Cheng Zhang

Main category: cs.CV

TL;DR: TaxaDiffusion是一种基于分类学知识的扩散模型训练框架，用于生成具有高形态和身份准确性的细粒度动物图像。

Details

Motivation: 传统方法将每个物种视为独立类别，忽略了物种间的视觉相似性。TaxaDiffusion旨在利用分类学层次结构，从粗到细逐步训练模型，以提高生成准确性。 Method: TaxaDiffusion通过从高级分类（如纲、目）到低级分类（如科、属、种）的层次化训练策略，逐步捕获和细化物种间的形态特征。 Result: 在三个细粒度动物数据集上的实验表明，TaxaDiffusion优于现有方法，生成图像的保真度更高，且在小样本情况下表现优异。 Conclusion: TaxaDiffusion通过结合分类学知识，显著提升了细粒度动物图像生成的准确性和效率。 Abstract: We propose TaxaDiffusion, a taxonomy-informed training framework for diffusion models to generate fine-grained animal images with high morphological and identity accuracy. Unlike standard approaches that treat each species as an independent category, TaxaDiffusion incorporates domain knowledge that many species exhibit strong visual similarities, with distinctions often residing in subtle variations of shape, pattern, and color. To exploit these relationships, TaxaDiffusion progressively trains conditioned diffusion models across different taxonomic levels -- starting from broad classifications such as Class and Order, refining through Family and Genus, and ultimately distinguishing at the Species level. This hierarchical learning strategy first captures coarse-grained morphological traits shared by species with common ancestors, facilitating knowledge transfer before refining fine-grained differences for species-level distinction. As a result, TaxaDiffusion enables accurate generation even with limited training samples per species. Extensive experiments on three fine-grained animal datasets demonstrate that outperforms existing approaches, achieving superior fidelity in fine-grained animal image generation. Project page: https://amink8.github.io/TaxaDiffusion/

[170] E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models

Wenyan Cong,Yiqing Liang,Yancheng Zhang,Ziyi Yang,Yan Wang,Boris Ivanovic,Marco Pavone,Chen Chen,Zhangyang Wang,Zhiwen Fan

Main category: cs.CV

TL;DR: 本文提出了首个针对3D几何基础模型（GFMs）的全面基准测试，覆盖五项核心任务，并评估了16种先进模型，揭示了其优缺点。

Details

Motivation: 空间智能（如3D重建和感知）对机器人、航空成像等领域至关重要，但缺乏对新兴3D GFMs的系统评估。 Method: 通过标准化工具包自动化数据集处理、评估协议和指标计算，对16种GFMs在五项核心任务上进行评估。 Result: 评估揭示了GFMs在不同任务和领域中的优势与局限性，并提出了未来模型优化的关键见解。 Conclusion: 公开代码和数据以加速3D空间智能研究，为未来模型扩展和优化提供指导。 Abstract: Spatial intelligence, encompassing 3D reconstruction, perception, and reasoning, is fundamental to applications such as robotics, aerial imaging, and extended reality. A key enabler is the real-time, accurate estimation of core 3D attributes (camera parameters, point clouds, depth maps, and 3D point tracks) from unstructured or streaming imagery. Inspired by the success of large foundation models in language and 2D vision, a new class of end-to-end 3D geometric foundation models (GFMs) has emerged, directly predicting dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters. Since late 2023, the field has exploded with diverse variants, but systematic evaluation is lacking. In this work, we present the first comprehensive benchmark for 3D GFMs, covering five core tasks: sparse-view depth estimation, video depth estimation, 3D reconstruction, multi-view pose estimation, novel view synthesis, and spanning both standard and challenging out-of-distribution datasets. Our standardized toolkit automates dataset handling, evaluation protocols, and metric computation to ensure fair, reproducible comparisons. We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains, and derive key insights to guide future model scaling and optimization. All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial intelligence.

[171] Low-Rank Head Avatar Personalization with Registers

Sai Tanmay Reddy Chakkera,Aggelina Chatziagapi,Md Moniruzzaman,Chen-Ping Yu,Yi-Hsuan Tsai,Dimitris Samaras

Main category: cs.CV

TL;DR: 提出了一种新的低秩个性化方法，用于提升通用模型在头像生成中的表现，特别是捕捉身份特有的细节。

Details

Motivation: 通用模型虽然能生成高质量面部动画，但难以捕捉独特的身份细节，现有方法（如LoRA）在捕捉高频面部细节方面仍有挑战。 Method: 设计了一个Register Module，通过可学习的3D特征空间增强LoRA性能，仅需少量参数即可适应新身份。 Result: 在包含独特面部细节的数据集上验证，新方法在定量和定性上均优于现有方法。 Conclusion: 提出的方法能有效捕捉未见过的面部细节，代码、模型和数据集将公开。 Abstract: We introduce a novel method for low-rank personalization of a generic model for head avatar generation. Prior work proposes generic models that achieve high-quality face animation by leveraging large-scale datasets of multiple identities. However, such generic models usually fail to synthesize unique identity-specific details, since they learn a general domain prior. To adapt to specific subjects, we find that it is still challenging to capture high-frequency facial details via popular solutions like low-rank adaptation (LoRA). This motivates us to propose a specific architecture, a Register Module, that enhances the performance of LoRA, while requiring only a small number of parameters to adapt to an unseen identity. Our module is applied to intermediate features of a pre-trained model, storing and re-purposing information in a learnable 3D feature space. To demonstrate the efficacy of our personalization method, we collect a dataset of talking videos of individuals with distinctive facial details, such as wrinkles and tattoos. Our approach faithfully captures unseen faces, outperforming existing methods quantitatively and qualitatively. We will release the code, models, and dataset to the public.

[172] Fast and Robust Rotation Averaging with Anisotropic Coordinate Descent

Yaroslava Lochman,Carl Olsson,Christopher Zach

Main category: cs.CV

TL;DR: 本文提出了一种快速通用的求解器，用于各向异性旋转平均问题，结合了最优性、鲁棒性和效率。

Details

Motivation: 各向异性旋转平均方法在扩展各向同性方法时面临计算效率低和初始化敏感的问题，本文旨在解决这些挑战。 Method: 分析了块坐标下降方法，推导了更简单的公式和各向异性扩展，并将其集成到大规模鲁棒旋转平均流程中。 Result: 在公开的结构从运动数据集上实现了最先进的性能。 Conclusion: 提出的方法成功平衡了最优性、鲁棒性和效率，为各向异性旋转平均提供了高效解决方案。 Abstract: Anisotropic rotation averaging has recently been explored as a natural extension of respective isotropic methods. In the anisotropic formulation, uncertainties of the estimated relative rotations -- obtained via standard two-view optimization -- are propagated to the optimization of absolute rotations. The resulting semidefinite relaxations are able to recover global minima but scale poorly with the problem size. Local methods are fast and also admit robust estimation but are sensitive to initialization. They usually employ minimum spanning trees and therefore suffer from drift accumulation and can get trapped in poor local minima. In this paper, we attempt to bridge the gap between optimality, robustness and efficiency of anisotropic rotation averaging. We analyze a family of block coordinate descent methods initially proposed to optimize the standard chordal distances, and derive a much simpler formulation and an anisotropic extension obtaining a fast general solver. We integrate this solver into the extended anisotropic large-scale robust rotation averaging pipeline. The resulting algorithm achieves state-of-the-art performance on public structure-from-motion datasets. Project page: https://ylochman.github.io/acd

[173] OD3: Optimization-free Dataset Distillation for Object Detection

Salwa K. Al Khatib,Ahmed ElHagry,Shitong Shao,Zhiqiang Shen

Main category: cs.CV

TL;DR: 论文提出了一种名为OD3的无优化数据蒸馏框架，专门用于目标检测任务，通过两阶段方法合成紧凑数据集，显著提升了检测性能。

Details

Motivation: 大规模神经网络的训练需要大量计算资源，尤其是密集预测任务如目标检测。现有的数据集蒸馏方法主要针对图像分类，而复杂的目标检测任务尚未充分探索。 Method: OD3框架分为两阶段：候选选择（迭代放置目标实例到合成图像中）和候选筛选（使用预训练观察模型移除低置信度目标）。 Result: 在MS COCO和PASCAL VOC数据集上，压缩比为0.25%至5%时，OD3性能优于现有方法，在COCO mAP50上提升超过14%。 Conclusion: OD3为目标检测任务提供了一种高效的数据蒸馏方法，显著提升了性能，并开源了代码和数据集。 Abstract: Training large neural networks on large-scale datasets requires substantial computational resources, particularly for dense prediction tasks such as object detection. Although dataset distillation (DD) has been proposed to alleviate these demands by synthesizing compact datasets from larger ones, most existing work focuses solely on image classification, leaving the more complex detection setting largely unexplored. In this paper, we introduce OD3, a novel optimization-free data distillation framework specifically designed for object detection. Our approach involves two stages: first, a candidate selection process in which object instances are iteratively placed in synthesized images based on their suitable locations, and second, a candidate screening process using a pre-trained observer model to remove low-confidence objects. We perform our data synthesis framework on MS COCO and PASCAL VOC, two popular detection datasets, with compression ratios ranging from 0.25% to 5%. Compared to the prior solely existing dataset distillation method on detection and conventional core set selection methods, OD3 delivers superior accuracy, establishes new state-of-the-art results, surpassing prior best method by more than 14% on COCO mAP50 at a compression ratio of 1.0%. Code and condensed datasets are available at: https://github.com/VILA-Lab/OD3.

[174] Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

Xiao Fu,Xintao Wang,Xian Liu,Jianhong Bai,Runsen Xu,Pengfei Wan,Di Zhang,Dahua Lin

Main category: cs.CV

TL;DR: RoboMaster提出了一种新框架，通过分解交互过程为三个阶段来建模多对象动态，解决了现有方法在多对象交互中的局限性，并在轨迹控制视频生成中取得了最佳性能。

Details

Motivation: 现有基于轨迹的方法主要关注单个对象运动，难以捕捉复杂机器人操作中的多对象交互，导致视觉保真度下降。 Method: RoboMaster将交互过程分解为三个阶段（交互前、交互中、交互后），每个阶段用主导对象的特征建模，并引入外观和形状感知的潜在表示。 Result: 在Bridge V2数据集和实际场景评估中，RoboMaster优于现有方法，实现了轨迹控制视频生成的最先进性能。 Conclusion: RoboMaster通过建模多对象动态和分解交互过程，显著提升了复杂机器人操作中的视频生成质量。 Abstract: Recent advances in video diffusion models have demonstrated strong potential for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing trajectory-based methods primarily focus on individual object motion and struggle to capture multi-object interaction crucial in complex robotic manipulation. This limitation arises from multi-feature entanglement in overlapping regions, which leads to degraded visual fidelity. To address this, we present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation. Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction. Each stage is modeled using the feature of the dominant object, specifically the robotic arm in the pre- and post-interaction phases and the manipulated object during interaction, thereby mitigating the drawback of multi-object feature fusion present during interaction in prior work. To further ensure subject semantic consistency throughout the video, we incorporate appearance- and shape-aware latent representations for objects. Extensive experiments on the challenging Bridge V2 dataset, as well as in-the-wild evaluation, demonstrate that our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.

[175] MLLMs Need 3D-Aware Representation Supervision for Scene Understanding

Xiaohu Huang,Jingjing Wu,Qunyi Xie,Kai Han

Main category: cs.CV

TL;DR: 论文提出3DRS框架，通过引入3D基础模型的监督增强MLLM的3D表示学习，提升场景理解能力。

Details

Motivation: MLLMs在3D推理中因缺乏显式3D数据而受限，研究发现3D感知表示质量与下游任务性能正相关。 Method: 提出3DRS框架，利用预训练3D基础模型监督对齐MLLM视觉特征，融入丰富3D知识。 Result: 在视觉定位、描述生成和问答等多个基准测试中，3DRS显著提升了MLLMs的性能。 Conclusion: 3DRS通过增强3D表示学习，有效提升了MLLMs在场景理解任务中的表现。 Abstract: Recent advances in scene understanding have leveraged multimodal large language models (MLLMs) for 3D reasoning by capitalizing on their strong 2D pretraining. However, the lack of explicit 3D data during MLLM pretraining limits 3D representation capability. In this paper, we investigate the 3D-awareness of MLLMs by evaluating multi-view correspondence and reveal a strong positive correlation between the quality of 3D-aware representation and downstream task performance. Motivated by this, we propose 3DRS, a framework that enhances MLLM 3D representation learning by introducing supervision from pretrained 3D foundation models. Our approach aligns MLLM visual features with rich 3D knowledge distilled from 3D models, effectively improving scene understanding. Extensive experiments across multiple benchmarks and MLLMs -- including visual grounding, captioning, and question answering -- demonstrate consistent performance gains. Project page: https://visual-ai.github.io/3drs

[176] IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout

Fei Shen,Xiaoyu Du,Yutong Gao,Jian Yu,Yushe Cao,Xing Lei,Jinhui Tang

Main category: cs.CV

TL;DR: 论文提出了一种新任务QL-Edit，旨在解决多对象场景下图像编辑的挑战，并提出了IMAGHarmony框架，结合HA和PNS策略，显著提升了编辑精度和结构一致性。

Details

Motivation: 当前图像编辑技术在多对象场景中缺乏对对象数量、类别和空间布局的精确控制，限制了其应用范围。 Method: 提出了IMAGHarmony框架，包含harmony-aware attention（HA）和preference-guided noise selection（PNS）策略，以增强多对象编辑的准确性和稳定性。 Result: 实验表明IMAGHarmony在结构对齐和语义准确性上优于现有方法。 Conclusion: IMAGHarmony为多对象图像编辑提供了有效的解决方案，并通过HarmonyBench支持了进一步研究。 Abstract: Recent diffusion models have advanced image editing by enhancing visual quality and control, supporting broad applications across creative and personalized domains. However, current image editing largely overlooks multi-object scenarios, where precise control over object categories, counts, and spatial layouts remains a significant challenge. To address this, we introduce a new task, quantity-and-layout consistent image editing (QL-Edit), which aims to enable fine-grained control of object quantity and spatial structure in complex scenes. We further propose IMAGHarmony, a structure-aware framework that incorporates harmony-aware attention (HA) to integrate multimodal semantics, explicitly modeling object counts and layouts to enhance editing accuracy and structural consistency. In addition, we observe that diffusion models are susceptible to initial noise and exhibit strong preferences for specific noise patterns. Motivated by this, we present a preference-guided noise selection (PNS) strategy that chooses semantically aligned initial noise samples based on vision-language matching, thereby improving generation stability and layout consistency in multi-object editing. To support evaluation, we construct HarmonyBench, a comprehensive benchmark covering diverse quantity and layout control scenarios. Extensive experiments demonstrate that IMAGHarmony consistently outperforms state-of-the-art methods in structural alignment and semantic accuracy. The code and model are available at https://github.com/muzishen/IMAGHarmony.

[177] Dual-Process Image Generation

Grace Luo,Jonathan Granskog,Aleksander Holynski,Trevor Darrell

Main category: cs.CV

TL;DR: 提出了一种双过程蒸馏方案，使前馈图像生成器能够从深思熟虑的视觉语言模型（VLM）中学习新任务。

Details

Motivation: 现有图像生成控制方法在学习新任务方面能力有限，而视觉语言模型（VLM）能够通过上下文学习任务并生成正确输出。 Method: 使用VLM对生成图像进行评分，并通过反向传播梯度更新图像生成器的权重。 Result: 该方法支持多种新控制任务，如常识推理和视觉提示，并能快速实现多模态控制（如调色板、线条粗细等）。 Conclusion: 该框架为图像生成提供了灵活且高效的控制方式。 Abstract: Prior methods for controlling image generation are limited in their ability to be taught new tasks. In contrast, vision-language models, or VLMs, can learn tasks in-context and produce the correct outputs for a given input. We propose a dual-process distillation scheme that allows feed-forward image generators to learn new tasks from deliberative VLMs. Our scheme uses a VLM to rate the generated images and backpropagates this gradient to update the weights of the image generator. Our general framework enables a wide variety of new control tasks through the same text-and-image based interface. We showcase a handful of applications of this technique for different types of control signals, such as commonsense inferences and visual prompts. With our method, users can implement multimodal controls for properties such as color palette, line weight, horizon position, and relative depth within a matter of minutes. Project page: https://dual-process.github.io.

cs.GR [Back]

[178] MotionPersona: Characteristics-aware Locomotion Control

Mingyi Shi,Wei Liu,Jidong Mei,Wangpok Tse,Rui Chen,Xuelin Chen,Taku Komura

Main category: cs.GR

TL;DR: MotionPersona是一个实时角色控制器，通过用户定义的属性生成个性化动画，优于现有方法。

Details

Motivation: 现有深度学习控制器生成的动画单一，无法反映真实世界中不同特质对人类动作的影响。 Method: 开发了基于SMPLX参数、文本提示和用户控制信号的块自回归运动扩散模型，并构建了多样化数据集。 Result: MotionPersona能实时生成反映用户指定特征的动作，质量和多样性优于现有方法。 Conclusion: MotionPersona首次实现了基于用户特征的实时动作生成，并通过少样本技术补充了语言提示的不足。 Abstract: We present MotionPersona, a novel real-time character controller that allows users to characterize a character by specifying attributes such as physical traits, mental states, and demographics, and projects these properties into the generated motions for animating the character. In contrast to existing deep learning-based controllers, which typically produce homogeneous animations tailored to a single, predefined character, MotionPersona accounts for the impact of various traits on human motion as observed in the real world. To achieve this, we develop a block autoregressive motion diffusion model conditioned on SMPLX parameters, textual prompts, and user-defined locomotion control signals. We also curate a comprehensive dataset featuring a wide range of locomotion types and actor traits to enable the training of this characteristic-aware controller. Unlike prior work, MotionPersona is the first method capable of generating motion that faithfully reflects user-specified characteristics (e.g., an elderly person's shuffling gait) while responding in real time to dynamic control inputs. Additionally, we introduce a few-shot characterization technique as a complementary conditioning mechanism, enabling customization via short motion clips when language prompts fall short. Through extensive experiments, we demonstrate that MotionPersona outperforms existing methods in characteristics-aware locomotion control, achieving superior motion quality and diversity. Results, code, and demo can be found at: https://motionpersona25.github.io/.

[179] Power-Linear Polar Directional Fields

Jiabao Brad Wang,Amir Vaxman

Main category: cs.GR

TL;DR: 提出一种新颖的网格方向场设计方法，支持在任意位置指定奇点，通过分段幂线性表示实现精确控制，减少网格粗糙或不均匀导致的伪影。

Details

Motivation: 解决现有方法在网格上设计方向场时无法灵活指定奇点位置及控制拓扑结构的问题。 Method: 采用分段幂线性表示相位和尺度，提供对场拓扑的精确控制，确保场的平滑性和对称性。 Result: 生成的场平滑且适应任意奇点指数和场对称性，有效减少网格质量不佳带来的伪影。 Conclusion: 该方法在多种拓扑结构和三角形质量的网格上表现优异，为方向场设计提供了灵活性和鲁棒性。 Abstract: We introduce a novel method for directional-field design on meshes, enabling users to specify singularities at any location on a mesh. Our method uses a piecewise power-linear representation for phase and scale, offering precise control over field topology. The resulting fields are smooth and accommodate any singularity index and field symmetry. With this representation, we mitigate the artifacts caused by coarse or uneven meshes. We showcase our approach on meshes with diverse topologies and triangle qualities.

[180] Pro3D-Editor : A Progressive-Views Perspective for Consistent and Precise 3D Editing

Yang Zheng,Mengqi Huang,Nan Chen,Zhendong Mao

Main category: cs.GR

TL;DR: 论文提出了一种基于渐进视图范式的3D编辑方法Pro3D-Editor，通过动态采样主视图并传播编辑语义，实现了更准确和一致的3D编辑。

Details

Motivation: 现有3D编辑方法忽视跨视图依赖性，导致多视图编辑不一致。本文旨在通过渐进视图范式解决这一问题。 Method: 提出Pro3D-Editor框架，包括主视图采样器、关键视图渲染器和全视图优化器，利用MoVE-LoRA技术传播编辑语义。 Result: 实验表明，该方法在编辑准确性和空间一致性上优于现有方法。 Conclusion: 渐进视图范式能有效提升3D编辑的一致性，Pro3D-Editor展示了其潜力。 Abstract: Text-guided 3D editing aims to precisely edit semantically relevant local 3D regions, which has significant potential for various practical applications ranging from 3D games to film production. Existing methods typically follow a view-indiscriminate paradigm: editing 2D views indiscriminately and projecting them back into 3D space. However, they overlook the different cross-view interdependencies, resulting in inconsistent multi-view editing. In this study, we argue that ideal consistent 3D editing can be achieved through a \textit{progressive-views paradigm}, which propagates editing semantics from the editing-salient view to other editing-sparse views. Specifically, we propose \textit{Pro3D-Editor}, a novel framework, which mainly includes Primary-view Sampler, Key-view Render, and Full-view Refiner. Primary-view Sampler dynamically samples and edits the most editing-salient view as the primary view. Key-view Render accurately propagates editing semantics from the primary view to other key views through its Mixture-of-View-Experts Low-Rank Adaption (MoVE-LoRA). Full-view Refiner edits and refines the 3D object based on the edited multi-views. Extensive experiments demonstrate that our method outperforms existing methods in editing accuracy and spatial consistency.

[181] Neural Path Guiding with Distribution Factorization

Pedro Figueiredo,Qihao He,Nima Khademi Kalantari

Main category: cs.GR

TL;DR: 提出了一种神经路径引导方法，用于改进蒙特卡洛积分在渲染中的应用。该方法通过分解2D方向分布为两个1D概率分布函数，并利用神经网络建模，实现了高效且表达能力强的分布表示。

Details

Motivation: 现有神经方法在分布表示上无法同时实现快速和表达能力强，因此需要一种更优的解决方案。 Method: 将2D方向分布分解为两个1D概率分布函数，用神经网络建模离散坐标的分布，并通过插值实现任意位置的评估和采样。训练时最大化学习分布与目标分布的相似性，并使用额外网络缓存入射辐射以减少梯度方差。 Result: 实验表明，该方法在复杂光传输场景中优于现有方法。 Conclusion: 提出的方法在表达能力和速度上取得了平衡，适用于复杂渲染任务。 Abstract: In this paper, we present a neural path guiding method to aid with Monte Carlo (MC) integration in rendering. Existing neural methods utilize distribution representations that are either fast or expressive, but not both. We propose a simple, but effective, representation that is sufficiently expressive and reasonably fast. Specifically, we break down the 2D distribution over the directional domain into two 1D probability distribution functions (PDF). We propose to model each 1D PDF using a neural network that estimates the distribution at a set of discrete coordinates. The PDF at an arbitrary location can then be evaluated and sampled through interpolation. To train the network, we maximize the similarity of the learned and target distributions. To reduce the variance of the gradient during optimizations and estimate the normalization factor, we propose to cache the incoming radiance using an additional network. Through extensive experiments, we demonstrate that our approach is better than the existing methods, particularly in challenging scenes with complex light transport.

[182] Hybridizing Expressive Rendering: Stroke-Based Rendering with Classic and Neural Methods

Kapil Dev

Main category: cs.GR

TL;DR: 本文探讨了传统非真实感渲染（NPR）与基于深度学习的NPR技术的异同，提出了一种结合两者的框架以拓展表现力。

Details

Motivation: 随着深度学习的兴起，NPR领域出现了范式转变，本文旨在分析传统与神经网络方法的优劣，并探索结合的可能性。 Method: 通过比较传统NPR技术（如边缘检测、卡通着色）与基于深度学习的NPR，特别关注笔触渲染（SBR），分析其优缺点。 Result: 揭示了两种方法在质量和艺术控制上的权衡，并提出结合两者的框架。 Conclusion: 结合传统与深度学习的NPR方法为表现力渲染提供了新的可能性。 Abstract: Non-Photorealistic Rendering (NPR) has long been used to create artistic visualizations that prioritize style over realism, enabling the depiction of a wide range of aesthetic effects, from hand-drawn sketches to painterly renderings. While classical NPR methods, such as edge detection, toon shading, and geometric abstraction, have been well-established in both research and practice, with a particular focus on stroke-based rendering, the recent rise of deep learning represents a paradigm shift. We analyze the similarities and differences between classical and neural network based NPR techniques, focusing on stroke-based rendering (SBR), highlighting their strengths and limitations. We discuss trade offs in quality and artistic control between these paradigms, propose a framework where these approaches can be combined for new possibilities in expressive rendering.

[183] LensCraft: Your Professional Virtual Cinematographer

Zahra Dehghanian,Morteza Abolghasemi,Hossein Azizinaghsh,Amir Vahedi,Hamid Beigy,Hamid R. Rabiee

Main category: cs.GR

TL;DR: LensCraft 提出了一种数据驱动的方法，结合电影摄影原则，实时适应动态场景，解决了自动化拍摄系统中机械执行与创意意图之间的权衡问题。

Details

Motivation: 数字创作者在将创意转化为精确的相机运动时面临瓶颈，现有系统通常将拍摄对象简化为单点，忽略了其方向和体积，限制了空间感知。 Method: LensCraft 结合了专业电影摄影师的专业知识，使用专门的模拟框架生成高保真训练数据，并通过高级神经模型实现对脚本的忠实执行，同时考虑拍摄对象的体积和动态行为。 Result: LensCraft 在静态和动态场景中表现出前所未有的准确性和连贯性，计算复杂度更低且推理速度更快，同时保持高质量输出。 Conclusion: LensCraft 为智能相机系统设立了新的基准，提供了灵活的控制方式，支持多种输入模态，为创作者提供了更贴近其创意的工具。 Abstract: Digital creators, from indie filmmakers to animation studios, face a persistent bottleneck: translating their creative vision into precise camera movements. Despite significant progress in computer vision and artificial intelligence, current automated filming systems struggle with a fundamental trade-off between mechanical execution and creative intent. Crucially, almost all previous works simplify the subject to a single point-ignoring its orientation and true volume-severely limiting spatial awareness during filming. LensCraft solves this problem by mimicking the expertise of a professional cinematographer, using a data-driven approach that combines cinematographic principles with the flexibility to adapt to dynamic scenes in real time. Our solution combines a specialized simulation framework for generating high-fidelity training data with an advanced neural model that is faithful to the script while being aware of the volume and dynamic behavior of the subject. Additionally, our approach allows for flexible control via various input modalities, including text prompts, subject trajectory and volume, key points, or a full camera trajectory, offering creators a versatile tool to guide camera movements in line with their vision. Leveraging a lightweight real time architecture, LensCraft achieves markedly lower computational complexity and faster inference while maintaining high output quality. Extensive evaluation across static and dynamic scenarios reveals unprecedented accuracy and coherence, setting a new benchmark for intelligent camera systems compared to state-of-the-art models. Extended results, the complete dataset, simulation environment, trained model weights, and source code are publicly accessible on LensCraft Webpage.

Yueqian Guo,Tianzhao Li,Xin Lyu,Jiehaolin Chen,Zhaohan Wang,Sirui Xiao,Yurun Chen,Yezi He,Helin Li,Fan Zhang

Main category: cs.GR

TL;DR: TRiMM是一种基于Transformer的多模态框架，用于实时3D手势生成，解决了现有方法在实时合成和长文本理解上的不足。

Details

Motivation: 现有方法在实时合成和长文本理解方面表现不佳，限制了LLM驱动的数字人类的应用。 Method: TRiMM包含三个模块：跨模态注意力机制、长上下文自回归模型和大规模手势匹配系统，并通过Unreal Engine实现轻量级管道。 Result: 在消费级GPU上实现120 fps的实时推理，每句延迟0.15秒，并在ZEGGS和BEAT数据集上优于现有方法。 Conclusion: TRiMM在保证手势质量的同时提升了生成速度，使LLM驱动的数字人类能够实时响应语音并合成手势。 Abstract: Large Language Model (LLM)-driven digital humans have sparked a series of recent studies on co-speech gesture generation systems. However, existing approaches struggle with real-time synthesis and long-text comprehension. This paper introduces Transformer-Based Rich Motion Matching (TRiMM), a novel multi-modal framework for real-time 3D gesture generation. Our method incorporates three modules: 1) a cross-modal attention mechanism to achieve precise temporal alignment between speech and gestures; 2) a long-context autoregressive model with a sliding window mechanism for effective sequence modeling; 3) a large-scale gesture matching system that constructs an atomic action library and enables real-time retrieval. Additionally, we develop a lightweight pipeline implemented in the Unreal Engine for experimentation. Our approach achieves real-time inference at 120 fps and maintains a per-sentence latency of 0.15 seconds on consumer-grade GPUs (Geforce RTX3060). Extensive subjective and objective evaluations on the ZEGGS, and BEAT datasets demonstrate that our model outperforms current state-of-the-art methods. TRiMM enhances the speed of co-speech gesture generation while ensuring gesture quality, enabling LLM-driven digital humans to respond to speech in real time and synthesize corresponding gestures. Our code is available at https://github.com/teroon/TRiMM-Transformer-Based-Rich-Motion-Matching

[185] PromptVFX: Text-Driven Fields for Open-World 3D Gaussian Animation

Mert Kiray,Paul Uhlenbruck,Nassir Navab,Benjamin Busam

Main category: cs.GR

TL;DR: 提出了一种基于文本驱动的4D流场预测框架，用于实时生成3D动画效果，减少传统方法的时间和专业需求。

Details

Motivation: 现代影视、游戏和AR/VR中视觉效果的创作需要专业3D动画软件和大量时间，现有生成方法计算成本高且速度慢。 Method: 将3D动画重新定义为场预测任务，利用大语言模型和视觉语言模型生成函数，通过文本指令实时更新3D高斯属性。 Result: 实验表明，简单的文本指令即可生成动态视觉效果，显著减少传统建模和绑定所需的手动工作。 Conclusion: 该方法为语言驱动的3D内容创作提供了快速且易于使用的途径，有助于进一步普及视觉特效技术。 Abstract: Visual effects (VFX) are key to immersion in modern films, games, and AR/VR. Creating 3D effects requires specialized expertise and training in 3D animation software and can be time consuming. Generative solutions typically rely on computationally intense methods such as diffusion models which can be slow at 4D inference. We reformulate 3D animation as a field prediction task and introduce a text-driven framework that infers a time-varying 4D flow field acting on 3D Gaussians. By leveraging large language models (LLMs) and vision-language models (VLMs) for function generation, our approach interprets arbitrary prompts (e.g., "make the vase glow orange, then explode") and instantly updates color, opacity, and positions of 3D Gaussians in real time. This design avoids overheads such as mesh extraction, manual or physics-based simulations and allows both novice and expert users to animate volumetric scenes with minimal effort on a consumer device even in a web browser. Experimental results show that simple textual instructions suffice to generate compelling time-varying VFX, reducing the manual effort typically required for rigging or advanced modeling. We thus present a fast and accessible pathway to language-driven 3D content creation that can pave the way to democratize VFX further.

[186] WishGI: Lightweight Static Global Illumination Baking via Spherical Harmonics Fitting

Junke Zhu,Zehan Wu,Qixing Zhang,Cheng Liao,Zhangjin Huang

Main category: cs.GR

TL;DR: 提出了一种基于球谐函数拟合和逆向探针分布的全局光照重建方法，显著降低内存使用和运行时开销，适用于低端平台。

Details

Motivation: 现有静态全局光照方法依赖高存储和采样开销，难以在低端平台上高效运行。 Method: 采用球谐函数拟合烘焙光照信息，并提出逆向探针分布方法为每个网格生成唯一探针关联。 Result: 内存使用仅为行业主流技术的5%，同时保持高质量光照效果。 Conclusion: 该方法在低端平台上实现了高效且高质量的全局光照渲染。 Abstract: Global illumination combines direct and indirect lighting to create realistic lighting effects, bringing virtual scenes closer to reality. Static global illumination is a crucial component of virtual scene rendering, leveraging precomputation and baking techniques to significantly reduce runtime computational costs. Unfortunately, many existing works prioritize visual quality by relying on extensive texture storage and massive pixel-level texture sampling, leading to large performance overhead. In this paper, we introduce an illumination reconstruction method that effectively reduces sampling in fragment shader and avoids additional render passes, making it well-suited for low-end platforms. To achieve high-quality global illumination with reduced memory usage, we adopt a spherical harmonics fitting approach for baking effective illumination information and propose an inverse probe distribution method that generates unique probe associations for each mesh. This association, which can be generated offline in the local space, ensures consistent lighting quality across all instances of the same mesh. As a consequence, our method delivers highly competitive lighting effects while using only approximately 5% of the memory required by mainstream industry techniques.

[187] Silence is Golden: Leveraging Adversarial Examples to Nullify Audio Control in LDM-based Talking-Head Generation

Yuan Gan,Jiaxu Miao,Yunze Wang,Yi Yang

Main category: cs.GR

TL;DR: 论文提出Silencer方法，通过两阶段设计保护肖像隐私，防止基于LDM的说话头动画技术滥用。

Details

Motivation: 基于LDM的说话头动画技术可能被滥用于诈骗和政治操纵，现有防御方法无法有效保护肖像隐私。 Method: 提出两阶段方法Silencer：1）使用nullifying loss忽略音频控制；2）应用anti-purification loss优化潜在特征以生成鲁棒扰动。 Result: 实验证明Silencer能有效保护肖像隐私。 Conclusion: Silencer为AI安全社区提供了解决说话头生成技术伦理问题的新思路。 Abstract: Advances in talking-head animation based on Latent Diffusion Models (LDM) enable the creation of highly realistic, synchronized videos. These fabricated videos are indistinguishable from real ones, increasing the risk of potential misuse for scams, political manipulation, and misinformation. Hence, addressing these ethical concerns has become a pressing issue in AI security. Recent proactive defense studies focused on countering LDM-based models by adding perturbations to portraits. However, these methods are ineffective at protecting reference portraits from advanced image-to-video animation. The limitations are twofold: 1) they fail to prevent images from being manipulated by audio signals, and 2) diffusion-based purification techniques can effectively eliminate protective perturbations. To address these challenges, we propose Silencer, a two-stage method designed to proactively protect the privacy of portraits. First, a nullifying loss is proposed to ignore audio control in talking-head generation. Second, we apply anti-purification loss in LDM to optimize the inverted latent feature to generate robust perturbations. Extensive experiments demonstrate the effectiveness of Silencer in proactively protecting portrait privacy. We hope this work will raise awareness among the AI security community regarding critical ethical issues related to talking-head generation techniques. Code: https://github.com/yuangan/Silencer.

[188] Image Generation from Contextually-Contradictory Prompts

Saar Huberman,Or Patashnik,Omer Dahary,Ron Mokady,Daniel Cohen-Or

Main category: cs.GR

TL;DR: 提出了一种基于阶段感知的提示分解框架，通过代理提示引导去噪过程，解决文本到图像扩散模型中的上下文矛盾问题。

Details

Motivation: 文本到图像扩散模型在生成高质量图像时，常因提示中的概念组合与学习先验矛盾而失败。 Method: 利用大语言模型分析目标提示，生成代理提示序列，确保语义连贯性。 Result: 实验表明，该方法显著提高了生成图像与文本提示的对齐度。 Conclusion: 通过阶段感知提示分解，实现了对语义的精细控制，解决了上下文矛盾问题。 Abstract: Text-to-image diffusion models excel at generating high-quality, diverse images from natural language prompts. However, they often fail to produce semantically accurate results when the prompt contains concept combinations that contradict their learned priors. We define this failure mode as contextual contradiction, where one concept implicitly negates another due to entangled associations learned during training. To address this, we propose a stage-aware prompt decomposition framework that guides the denoising process using a sequence of proxy prompts. Each proxy prompt is constructed to match the semantic content expected to emerge at a specific stage of denoising, while ensuring contextual coherence. To construct these proxy prompts, we leverage a large language model (LLM) to analyze the target prompt, identify contradictions, and generate alternative expressions that preserve the original intent while resolving contextual conflicts. By aligning prompt information with the denoising progression, our method enables fine-grained semantic control and accurate image generation in the presence of contextual contradictions. Experiments across a variety of challenging prompts show substantial improvements in alignment to the textual prompt.

cs.CL [Back]

[189] Amadeus-Verbo Technical Report: The powerful Qwen2.5 family models trained in Portuguese

William Alberto Cruz-Castañeda,Marcellus Amadeus

Main category: cs.CL

TL;DR: 介绍了开发巴西葡萄牙语大语言模型Amadeus Verbo的经验，展示了如何通过微调基础模型来开源开发巴西葡萄牙语LLM。

Details

Motivation: 为巴西葡萄牙语提供多样化的大语言模型，推动开源开发。 Method: 开发了不同参数规模（0.5B至72B）的基础模型、合并模型和指令调优模型。 Result: Amadeus Verbo系列模型已在HuggingFace上开源。 Conclusion: 展示了在数据和资源可用时，微调基础模型以开发巴西葡萄牙语LLM的可行性。 Abstract: This report introduces the experience of developing Amadeus Verbo, a family of large language models for Brazilian Portuguese. To handle diverse use cases, Amadeus Verbo includes base-tuned, merged, and instruction-tuned models in sizes of 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B parameters. Thus, the main objective is to show how easy it is to fine-tune foundation models to democratize the open-source development of Brazilian Portuguese LLMs when data and resources are available. Amadeus-Verbo family models are all available at HuggingFace at https://huggingface.co/collections/amadeusai/amadeus-verbo-qwen25-67cf2e7aae69ce2b3bcdcfda.

[190] Scaling Physical Reasoning with the PHYSICS Dataset

Shenghe Zheng,Qianjia Cheng,Junchi Yao,Mengsong Wu,haonan he,Ning Ding,Yu Cheng,Shuyue Hu,Lei Bai,Dongzhan Zhou,Ganqu Cui,Peng Ye

Main category: cs.CL

TL;DR: PHYSICS数据集包含16,568个高质量物理问题，覆盖多个领域和难度级别，旨在提升大语言模型在物理推理任务中的表现。

Details

Motivation: 物理作为推理密集且重要的学科，在大语言模型研究中未得到足够关注，PHYSICS数据集填补了这一空白。 Method: 通过精心设计的流程从100多本教材中筛选问题，分为训练集和测试集，并提供推理路径；提出Rule+Model评估框架。 Result: 评估显示当前模型在物理任务中存在局限性，PHYSICS数据集和方法有助于改进模型表现。 Conclusion: PHYSICS数据集和评估方法将推动大语言模型在物理领域的发展。 Abstract: Large Language Models (LLMs) have achieved remarkable progress on advanced reasoning tasks such as mathematics and coding competitions. Meanwhile, physics, despite being both reasoning-intensive and essential to real-world understanding, received limited academic and industrial attention. This paper introduces PHYSICS, a dataset containing 16,568 high-quality physics problems spanning subjects and difficulty levels, to facilitate this issue. Specifically, PHYSICS is curated with exercises from over 100 textbooks through a carefully designed pipeline for quality control. It covers five major physics domains: Mechanics, Electromagnetism, Thermodynamics, Optics, and Modern Physics. It also spans a wide range of difficulty levels, from high school to graduate-level physics courses. To utilize the data for improving and evaluating the model's physical reasoning capabilities, we split the dataset into training and test sets, and provide reasoning paths generated by powerful reasoning models for the training data to facilitate model training. In addition, for the evaluation part, we find that existing evaluation frameworks exhibit biases in aspects such as units, simplification, and precision in physics domain. To balance efficiency and accuracy, we introduce a Rule+Model evaluation framework tailored to physics problems. Our evaluations on current state-of-the-art open-source and proprietary models highlight the limitations of current models in handling physics-related tasks. We hope that our dataset and evaluation methodology will jointly advance the development of LLMs in the field of physics.

[191] From Mathematical Reasoning to Code: Generalization of Process Reward Models in Test-Time Scaling

Zhengyu Chen,Yudong Wang,Teng Xiao,Ruochen Zhou,Xuesheng Yang,Wei Wang,Zhifang Sui,Jingang Wang

Main category: cs.CL

TL;DR: 研究分析了过程奖励模型（PRMs）在提升大型语言模型推理能力中的作用，探讨了训练方法、可扩展性和泛化能力，发现模型规模与性能之间存在收益递减关系，并强调了数据多样性和测试时扩展策略的重要性。

Details

Motivation: 探索PRMs如何通过结构化反馈机制解决中间错误，并分析其在复杂推理任务中的效率和准确性。 Method: 从训练方法、可扩展性和泛化能力多角度分析PRMs，研究预训练与奖励模型训练FLOPs的关系，并评估测试时扩展策略。 Result: 发现PRM规模与性能呈收益递减关系，数据多样性显著影响性能；数学数据集训练的PRMs与代码生成任务表现相当，显示跨域泛化能力。 Conclusion: PRMs在推理任务中表现优异，但需平衡模型规模与计算成本；数据多样性和测试时策略对性能至关重要。 Abstract: Recent advancements in improving the reasoning capabilities of Large Language Models have underscored the efficacy of Process Reward Models (PRMs) in addressing intermediate errors through structured feedback mechanisms. This study analyzes PRMs from multiple perspectives, including training methodologies, scalability, and generalization capabilities. We investigate the interplay between pre-training and reward model training FLOPs to assess their influence on PRM efficiency and accuracy in complex reasoning tasks. Our analysis reveals a pattern of diminishing returns in performance with increasing PRM scale, highlighting the importance of balancing model size and computational cost. Furthermore, the diversity of training datasets significantly impacts PRM performance, emphasizing the importance of diverse data to enhance both accuracy and efficiency. We further examine test-time scaling strategies, identifying Monte Carlo Tree Search as the most effective method when computational resources are abundant, while Best-of-N Sampling serves as a practical alternative under resource-limited conditions. Notably, our findings indicate that PRMs trained on mathematical datasets exhibit performance comparable to those tailored for code generation, suggesting robust cross-domain generalization. Employing a gradient-based metric, we observe that PRMs exhibit a preference for selecting responses with similar underlying patterns, further informing their optimization.

[192] Enhancing Tool Learning in Large Language Models with Hierarchical Error Checklists

Yue Cui,Liuyi Yao,Shuchang Tao,Weijie Shi,Yaliang Li,Bolin Ding,Xiaofang Zhou

Main category: cs.CL

TL;DR: HiTEC框架通过全局和局部错误检查表系统诊断和缓解LLM工具调用中的参数填充错误，显著提升准确性和成功率。

Details

Motivation: 解决大型语言模型（LLM）在工具调用中因参数填充错误导致的效果受限问题。 Method: 提出HiTEC框架，包括全局和局部错误检查表，并部署HiTEC-ICL和HiTEC-KTO两种方法。 Result: 在五个公共数据集上的实验表明，HiTEC显著提升了参数填充准确性和工具调用成功率。 Conclusion: HiTEC框架有效解决了LLM工具调用中的参数填充问题，具有实际应用价值。 Abstract: Large language models (LLMs) have significantly advanced natural language processing, particularly through the integration of external tools and APIs. However, their effectiveness is frequently hampered by parameter mis-filling during tool calling. In this paper, we propose the Hierarchical Tool Error Checklist (HiTEC) framework to systematically diagnose and mitigate tool-calling errors without relying on extensive real-world interactions. HiTEC introduces a two-tiered approach: a global error checklist that identifies common, cross-tool issues, and a local error checklist that targets tool-specific and contextual failures. Building on this structure, we propose two deployments: HiTEC-In Context Learning (HiTEC-ICL) and HiTEC-Kahneman-Tversky Optimization (HiTEC-KTO). HiTEC-ICL embeds the global checklist in the initial prompts and leverages a two-round conversational interaction to dynamically refine parameter handling, while HiTEC-KTO generates high-quality negative examples to drive fine-tuning via preference-based optimization. Extensive experiments across five public datasets demonstrate that our framework significantly improves parameter-filling accuracy and tool-calling success rates compared to baseline methods.

Wiktoria Mieleszczenko-Kowszewicz,Beata Bajcar,Aleksander Szczęsny,Maciej Markiewicz,Jolanta Babiak,Berenika Dyczek,Przemysław Kazienko

Main category: cs.CL

TL;DR: 论文提出了Social Influence Technique Taxonomy (SITT)框架，包含58种社会影响技术，并评估了LLMs识别这些技术的能力。

Details

Motivation: 研究旨在检测文本中微妙的社会影响形式，并探索LLMs在此任务上的表现。 Method: 构建了SITT数据集，包含746个对话，由专家标注，并采用分层多标签分类方法评估了5种LLMs。 Result: Claude 3.5表现最佳（F1=0.45），但整体模型性能有限，尤其是对上下文敏感的技术。 Conclusion: 当前LLMs对细微语言线索的敏感性不足，需领域特定微调。研究为理解LLMs如何检测社会影响提供了新资源。 Abstract: In this work we present the Social Influence Technique Taxonomy (SITT), a comprehensive framework of 58 empirically grounded techniques organized into nine categories, designed to detect subtle forms of social influence in textual content. We also investigate the LLMs ability to identify various forms of social influence. Building on interdisciplinary foundations, we construct the SITT dataset -- a 746-dialogue corpus annotated by 11 experts in Polish and translated into English -- to evaluate the ability of LLMs to identify these techniques. Using a hierarchical multi-label classification setup, we benchmark five LLMs, including GPT-4o, Claude 3.5, Llama-3.1, Mixtral, and PLLuM. Our results show that while some models, notably Claude 3.5, achieved moderate success (F1 score = 0.45 for categories), overall performance of models remains limited, particularly for context-sensitive techniques. The findings demonstrate key limitations in current LLMs' sensitivity to nuanced linguistic cues and underscore the importance of domain-specific fine-tuning. This work contributes a novel resource and evaluation example for understanding how LLMs detect, classify, and potentially replicate strategies of social influence in natural dialogues.

[194] Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling

Jiayi Zeng,Yizhe Feng,Mengliang He,Wenhui Lei,Wei Zhang,Zeming Liu,Xiaoming Shi,Aimin Zhou

Main category: cs.CL

TL;DR: 论文提出了一种主动错误处理方法，无需显式错误处理指令，并引入了新的基准Mis-prompt，包含四项评估任务、错误分类法和数据集。实验表明当前LLMs在主动错误处理上表现不佳，但通过监督微调可提升能力。

Details

Motivation: 现实场景中通常缺乏显式错误处理指令，当前LLMs的被动错误处理方法不适用，需研究主动错误处理。 Method: 提出Mis-prompt基准，包含四项任务、错误分类法和数据集，并分析LLMs在基准上的表现。 Result: 当前LLMs在主动错误处理上表现不佳，但监督微调能显著提升其能力。 Conclusion: 主动错误处理是重要研究方向，Mis-prompt为未来研究提供了工具和基准。 Abstract: Large language models (LLMs) have demonstrated significant advancements in error handling. Current error-handling works are performed in a passive manner, with explicit error-handling instructions. However, in real-world scenarios, explicit error-handling instructions are usually unavailable. In this paper, our work identifies this challenge as how to conduct proactive error handling without explicit error handling instructions. To promote further research, this work introduces a new benchmark, termed Mis-prompt, consisting of four evaluation tasks, an error category taxonomy, and a new evaluation dataset. Furthermore, this work analyzes current LLMs' performance on the benchmark, and the experimental results reveal that current LLMs show poor performance on proactive error handling, and SFT on error handling instances improves LLMs' proactive error handling capabilities. The dataset will be publicly available.

[195] You Prefer This One, I Prefer Yours: Using Reference Words is Harder Than Vocabulary Words for Humans and Multimodal Language Models

Dota Tianai Dong,Yifan Luo,Po-Ya Angela Wang,Asli Ozyurek,Paula Rubio-Fernandez

Main category: cs.CL

TL;DR: 论文研究了多模态语言模型（MLMs）在参考词使用上的表现，发现其在词汇任务上接近人类水平，但在所有格和指示代词上表现较差，揭示了其在视角转换和空间推理上的局限性。

Details

Motivation: 探讨MLMs在参考词使用上的能力，填补了现有研究中对这一常见但被忽视的沟通方式的空白。 Method: 比较人类和七种先进MLMs在词汇、所有格和指示代词任务上的表现，分析其认知需求差异。 Result: MLMs在词汇任务上表现接近人类，但在所有格和指示代词上显著落后，提示工程仅部分改善了所有格使用。 Conclusion: 当前NLP系统在需要语用学和社会认知的语法形式上仍面临挑战。 Abstract: Multimodal language models (MLMs) increasingly communicate in human-like ways, yet their ability to use reference words remains largely overlooked despite their ubiquity in everyday communication. Our study addresses this gap by comparing human and MLM use of three word classes with increasing cognitive demands: vocabulary words, possessive pronouns (`mine' vs `yours'), and demonstrative pronouns (`this one' vs `that one'). Evaluating seven state-of-the-art MLMs against human participants, we observe a clear difficulty hierarchy: while MLMs approach human-level performance on the vocabulary task, they show substantial deficits with possessives and demonstratives. Our analysis reveals these difficulties stem from limitations in perspective-taking and spatial reasoning. Although prompt engineering improved model performance on possessive use, demonstrative use remained well below human-level competence. These findings provide theoretical and empirical evidence that producing grammatical forms requiring pragmatics and social cognition remains a clear challenge in current NLP systems.

[196] Probing Politico-Economic Bias in Multilingual Large Language Models: A Cultural Analysis of Low-Resource Pakistani Languages

Afrozah Nadeem,Mark Dras,Usman Naseem

Main category: cs.CL

TL;DR: 该论文系统分析了13种先进大语言模型在巴基斯坦五种低资源语言中的政治偏见，提出了一种结合政治倾向测试和多层次框架分析的新方法，发现模型普遍偏向自由左翼价值观，但在区域语言中表现出明显的威权倾向。

Details

Motivation: 研究动机是探讨大语言模型在非西方和低资源多语言环境中的政治经济偏见，填补现有研究的空白。 Method: 方法包括结合定量的政治倾向测试（经济和社会轴）和定性的框架分析（内容、风格和重点），并针对巴基斯坦社会的11个关键社会政治主题设计提示。 Result: 结果显示模型普遍偏向自由左翼价值观，但在区域语言中表现出威权倾向，且存在模型特定的偏见特征和语言条件差异。 Conclusion: 结论强调需要开发基于文化的多语言偏见审计框架，以应对模型在不同语言和文化背景下的偏见问题。 Abstract: Large Language Models (LLMs) are increasingly shaping public discourse, yet their politico-economic biases remain underexamined in non-Western and low-resource multilingual contexts. This paper presents a systematic analysis of political bias in 13 state-of-the-art LLMs across five low-resource languages spoken in Pakistan: Urdu, Punjabi, Sindhi, Balochi, and Pashto. We propose a novel framework that integrates an adapted Political Compass Test (PCT) with a multi-level framing analysis. Our method combines quantitative assessment of political orientation across economic (left-right) and social (libertarian-authoritarian) axes with qualitative analysis of framing through content, style, and emphasis. We further contextualize this analysis by aligning prompts with 11 key socio-political themes relevant to Pakistani society. Our results reveal that LLMs predominantly align with liberal-left values, echoing Western training data influences, but exhibit notable shifts toward authoritarian framing in regional languages, suggesting strong cultural modulation effects. We also identify consistent model-specific bias signatures and language-conditioned variations in ideological expression. These findings show the urgent need for culturally grounded, multilingual bias auditing frameworks.

[197] Evaluating the Sensitivity of LLMs to Prior Context

Robert Hankache,Kingsley Nketia Acheampong,Liang Song,Marek Brynda,Raad Khraishi,Greig A. Cowan

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）在多轮对话中的性能表现，发现上下文变化会显著影响其准确性，并提出新基准测试。

Details

Motivation: 现有基准测试主要关注单轮问答任务，无法反映多轮交互对LLMs性能的影响，因此需要系统性研究。 Method: 引入一组新基准测试，系统性地改变上下文量和性质，并评估多种LLMs（如GPT、Claude、Gemini）的性能。 Result: 多轮交互中，LLMs在选择题上的性能可能大幅下降（某些模型下降73%）；任务描述的策略性放置可显著缓解性能下降。 Conclusion: 需开发稳健策略以设计和评估LLMs，减少上下文敏感性对其性能的影响。 Abstract: As large language models (LLMs) are increasingly deployed in multi-turn dialogue and other sustained interactive scenarios, it is essential to understand how extended context affects their performance. Popular benchmarks, focusing primarily on single-turn question answering (QA) tasks, fail to capture the effects of multi-turn exchanges. To address this gap, we introduce a novel set of benchmarks that systematically vary the volume and nature of prior context. We evaluate multiple conventional LLMs, including GPT, Claude, and Gemini, across these benchmarks to measure their sensitivity to contextual variations. Our findings reveal that LLM performance on multiple-choice questions can degrade dramatically in multi-turn interactions, with performance drops as large as 73% for certain models. Even highly capable models such as GPT-4o exhibit up to a 32% decrease in accuracy. Notably, the relative performance of larger versus smaller models is not always predictable. Moreover, the strategic placement of the task description within the context can substantially mitigate performance drops, improving the accuracy by as much as a factor of 3.5. These findings underscore the need for robust strategies to design, evaluate, and mitigate context-related sensitivity in LLMs.

[198] Gaussian mixture models as a proxy for interacting language models

Edward Wang,Tianyu Wang,Avanti Athreya,Vince Lyzinski,Carey E. Priebe

Main category: cs.CL

TL;DR: 论文提出用交互式高斯混合模型（GMMs）替代复杂的大语言模型（LLMs）来研究人类行为，发现GMMs能捕捉LLMs动态特征，并探讨了其优缺点。

Details

Motivation: LLMs虽强大但计算成本高，而检索增强生成（RAG）使其在社会科学中应用受限，因此需要更高效的替代方法。 Method: 引入交互式GMMs，并与LLMs的实验模拟进行比较，分析其动态特征。 Result: GMMs能有效模拟LLMs的交互动态，并揭示了二者的关键异同点。 Conclusion: GMMs是一种有潜力的替代方案，未来可进一步优化和扩展研究方向。 Abstract: Large language models (LLMs) are a powerful tool with the ability to match human capabilities and behavior in many settings. Retrieval-augmented generation (RAG) further allows LLMs to generate diverse output depending on the contents of their RAG database. This motivates their use in the social sciences to study human behavior between individuals when large-scale experiments are infeasible. However, LLMs depend on complex, computationally expensive algorithms. In this paper, we introduce interacting Gaussian mixture models (GMMs) as an alternative to similar frameworks using LLMs. We compare a simplified model of GMMs to select experimental simulations of LLMs whose updating and response depend on feedback from other LLMs. We find that interacting GMMs capture important features of the dynamics in interacting LLMs, and we investigate key similarities and differences between interacting LLMs and GMMs. We conclude by discussing the benefits of Gaussian mixture models, potential modifications, and future research directions.

[199] COSMIC: Generalized Refusal Direction Identification in LLM Activations

Vincent Siu,Nicholas Crispino,Zihao Yu,Sam Pan,Zhun Wang,Yang Liu,Dawn Song,Chenguang Wang

Main category: cs.CL

TL;DR: COSMIC是一种自动化框架，通过余弦相似度识别拒绝行为方向，无需依赖模型输出或预设模板，性能与现有方法相当。

Details

Motivation: 现有方法依赖预设模板或人工分析，难以全面识别LLM中的拒绝行为。 Method: 使用余弦相似度选择方向和目标层，独立于模型输出。 Result: 在对抗性和弱对齐模型中可靠识别拒绝方向，并能引导模型更安全行为，假拒绝率低。 Conclusion: COSMIC在多种对齐条件下表现稳健，为识别和引导LLM行为提供了新方法。 Abstract: Large Language Models (LLMs) encode behaviors such as refusal within their activation space, yet identifying these behaviors remains a significant challenge. Existing methods often rely on predefined refusal templates detectable in output tokens or require manual analysis. We introduce \textbf{COSMIC} (Cosine Similarity Metrics for Inversion of Concepts), an automated framework for direction selection that identifies viable steering directions and target layers using cosine similarity - entirely independent of model outputs. COSMIC achieves steering performance comparable to prior methods without requiring assumptions about a model's refusal behavior, such as the presence of specific refusal tokens. It reliably identifies refusal directions in adversarial settings and weakly aligned models, and is capable of steering such models toward safer behavior with minimal increase in false refusals, demonstrating robustness across a wide range of alignment conditions.

[200] SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset

Peng Xie,Xingyuan Liu,Tsz Wai Chan,Yequan Bie,Yangqiu Song,Yang Wang,Hao Chen,Kani Chen

Main category: cs.CL

TL;DR: 论文介绍了LinguaMaster框架和SwitchLingua数据集，解决了多语言代码切换（CS）研究中数据不足的问题，并提出了新的评估指标SAER。

Details

Motivation: 现有代码切换数据集规模小且多样性不足，无法满足多语言应用需求，亟需类似ImageNet的大规模基准数据集。 Method: 提出LinguaMaster框架，用于高效合成多语言数据，并构建了SwitchLingua数据集，包含42万文本样本和80小时音频数据。 Result: SwitchLingua数据集覆盖12种语言和63种民族背景，为多语言研究提供了丰富资源；SAER指标能更准确地评估代码切换场景下的ASR性能。 Conclusion: LinguaMaster和SwitchLingua填补了多语言数据集的空白，SAER为代码切换研究提供了更有效的评估工具。 Abstract: Code-switching (CS) is the alternating use of two or more languages within a conversation or utterance, often influenced by social context and speaker identity. This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems, which are typically designed for a single language and struggle to handle multilingual inputs. The growing global demand for multilingual applications, including Code-Switching ASR (CSASR), Text-to-Speech (CSTTS), and Cross-Lingual Information Retrieval (CLIR), highlights the inadequacy of existing monolingual datasets. Although some code-switching datasets exist, most are limited to bilingual mixing within homogeneous ethnic groups, leaving a critical need for a large-scale, diverse benchmark akin to ImageNet in computer vision. To bridge this gap, we introduce \textbf{LinguaMaster}, a multi-agent collaboration framework specifically designed for efficient and scalable multilingual data synthesis. Leveraging this framework, we curate \textbf{SwitchLingua}, the first large-scale multilingual and multi-ethnic code-switching dataset, including: (1) 420K CS textual samples across 12 languages, and (2) over 80 hours of audio recordings from 174 speakers representing 18 countries/regions and 63 racial/ethnic backgrounds, based on the textual data. This dataset captures rich linguistic and cultural diversity, offering a foundational resource for advancing multilingual and multicultural research. Furthermore, to address the issue that existing ASR evaluation metrics lack sensitivity to code-switching scenarios, we propose the \textbf{Semantic-Aware Error Rate (SAER)}, a novel evaluation metric that incorporates semantic information, providing a more accurate and context-aware assessment of system performance.

[201] HD-NDEs: Neural Differential Equations for Hallucination Detection in LLMs

Qing Li,Jiahui Geng,Zongxiong Chen,Derui Zhu,Yuxia Wang,Congbo Ma,Chenyang Lyu,Fakhri Karray

Main category: cs.CL

TL;DR: 论文提出了一种新方法HD-NDEs，通过神经微分方程在LLMs的潜在空间中动态评估陈述的真实性，显著提升了幻觉检测性能。

Details

Motivation: 尽管现有方法（如SAPLMA）在缓解LLMs的幻觉问题上高效，但在输出序列早期或中期出现非事实信息时表现不佳。 Method: 使用神经微分方程（Neural DEs）建模LLMs潜在空间的动态系统，并将潜在空间序列映射到分类空间进行真实性评估。 Result: 在五个数据集和六种LLMs上的实验表明，HD-NDEs在True-False数据集上的AUC-ROC比现有技术提升了14%以上。 Conclusion: HD-NDEs通过动态建模潜在空间，显著提高了幻觉检测的可靠性，为LLMs的实际部署提供了更优解决方案。 Abstract: In recent years, large language models (LLMs) have made remarkable advancements, yet hallucination, where models produce inaccurate or non-factual statements, remains a significant challenge for real-world deployment. Although current classification-based methods, such as SAPLMA, are highly efficient in mitigating hallucinations, they struggle when non-factual information arises in the early or mid-sequence of outputs, reducing their reliability. To address these issues, we propose Hallucination Detection-Neural Differential Equations (HD-NDEs), a novel method that systematically assesses the truthfulness of statements by capturing the full dynamics of LLMs within their latent space. Our approaches apply neural differential equations (Neural DEs) to model the dynamic system in the latent space of LLMs. Then, the sequence in the latent space is mapped to the classification space for truth assessment. The extensive experiments across five datasets and six widely used LLMs demonstrate the effectiveness of HD-NDEs, especially, achieving over 14% improvement in AUC-ROC on the True-False dataset compared to state-of-the-art techniques.

[202] Writing-Zero: Bridge the Gap Between Non-verifiable Problems and Verifiable Rewards

Xun Lu

Main category: cs.CL

TL;DR: 论文提出了一种统一的RLVR训练范式，通过引入基于写作原则的成对生成奖励模型（GenRM）和Bootstrapped Relative Policy Optimization（BRPO）算法，解决了非可验证任务（如创意写作）中奖励模型的局限性。

Details

Motivation: 现有方法在非可验证任务中依赖标量奖励模型，存在泛化能力不足和奖励黑客问题。本文旨在通过RLVR框架填补这一空白。 Method: 提出成对写作GenRM和BRPO算法，将主观评估转化为可验证奖励，并通过动态参考优化策略。 Result: 实验表明，该方法在写作任务中表现优于标量奖励基线，且具备抗奖励黑客能力。 Conclusion: 研究展示了RLVR框架在统一规则、参考和无参考奖励建模中的潜力，为语言任务提供了全面的RL训练范式。 Abstract: Reinforcement learning with verifiable rewards (RLVR) has enabled large language models (LLMs) to achieve remarkable breakthroughs in reasoning tasks with objective ground-truth answers, such as mathematics and code generation. However, a significant gap remains for non-verifiable tasks, like creative writing and open-ended dialogue, where quality assessment is inherently subjective and lacks definitive references. Existing approaches for these domains often rely on scalar reward models trained with human preferences, which suffer from limited generalization and are prone to reward hacking, such as over-explanation and length bias. In this work, we propose a unified RLVR-based training paradigm that bridges the gap between non-verifiable tasks and verifiable rewards. We introduce a writing-principle-based pairwise Generative Reward Model (GenRM) and a novel Bootstrapped Relative Policy Optimization (BRPO) algorithm. The pairwise writing GenRM leverages self-principled critique to transform subjective assessments into reliable, verifiable rewards, while BRPO enables dynamic, reference-free pairwise comparison by leveraging a bootstrapped response as temporary reference from within group rollouts during RL training. Our approach empowers LLMs to develop robust writing capabilities without supervised fine-tuning, as demonstrated by Writing-Zero, which shows consistent improvement and strong resistance to reward hacking compared to scalar reward baselines. Furthermore, our method achieves competitive results on both in-house and open-source writing benchmarks. Our findings suggest the potential to unify rule-based, reference-based, and reference-free reward modeling under the RLVR framework, thus paving the way for a comprehensive and scalable RL training paradigm applicable across all language tasks.

[203] Spurious Correlations and Beyond: Understanding and Mitigating Shortcut Learning in SDOH Extraction with Large Language Models

Fardin Ahsan Sakib,Ziwei Zhu,Karen Trister Grace,Meliha Yetisgen,Ozlem Uzuner

Main category: cs.CL

TL;DR: 论文探讨了从临床文本中提取社会健康决定因素（SDOH）时，大语言模型（LLM）可能因依赖表面线索而产生虚假预测的问题，并提出缓解策略。

Details

Motivation: SDOH提取对医疗分析至关重要，但LLM可能因依赖表面线索（如提及酒精或吸烟）而错误预测药物使用状态，且存在性别差异。 Method: 使用SHAC数据集中的MIMIC部分，以药物状态提取为例，分析LLM的虚假预测问题，并评估提示工程和链式思维推理等缓解策略。 Result: 研究发现提及酒精或吸烟会误导模型预测药物使用状态，同时揭示模型性能存在性别差异。缓解策略部分有效。 Conclusion: 通过提示工程和链式思维推理等策略，可以部分减少LLM在健康领域的虚假预测，提升其可靠性。 Abstract: Social determinants of health (SDOH) extraction from clinical text is critical for downstream healthcare analytics. Although large language models (LLMs) have shown promise, they may rely on superficial cues leading to spurious predictions. Using the MIMIC portion of the SHAC (Social History Annotation Corpus) dataset and focusing on drug status extraction as a case study, we demonstrate that mentions of alcohol or smoking can falsely induce models to predict current/past drug use where none is present, while also uncovering concerning gender disparities in model performance. We further evaluate mitigation strategies - such as prompt engineering and chain-of-thought reasoning - to reduce these false positives, providing insights into enhancing LLM reliability in health domains.

[204] LaMP-QA: A Benchmark for Personalized Long-form Question Answering

Alireza Salemi,Hamed Zamani

Main category: cs.CL

TL;DR: LaMP-QA是一个用于评估个性化长答案生成的基准，填补了该领域资源不足的空白，涵盖多个类别，并通过实验证明个性化上下文可提升性能达39%。

Details

Motivation: 个性化在问答系统中至关重要，但相关研究和资源匮乏，因此需要开发LaMP-QA基准以推动研究。 Method: 引入LaMP-QA基准，涵盖三大类45个子类别，通过人工和自动评估比较不同策略，并测试开源和专有LLM的个性化与非个性化方法。 Result: 实验表明，引入个性化上下文可使性能提升高达39%。 Conclusion: LaMP-QA基准的发布为个性化问答研究提供了重要资源，并证实了个性化方法的有效性。 Abstract: Personalization is essential for question answering systems that are user-centric. Despite its importance, personalization in answer generation has been relatively underexplored. This is mainly due to lack of resources for training and evaluating personalized question answering systems. We address this gap by introducing LaMP-QA -- a benchmark designed for evaluating personalized long-form answer generation. The benchmark covers questions from three major categories: (1) Arts & Entertainment, (2) Lifestyle & Personal Development, and (3) Society & Culture, encompassing over 45 subcategories in total. To assess the quality and potential impact of the LaMP-QA benchmark for personalized question answering, we conduct comprehensive human and automatic evaluations, to compare multiple evaluation strategies for evaluating generated personalized responses and measure their alignment with human preferences. Furthermore, we benchmark a number of non-personalized and personalized approaches based on open-source and proprietary large language models (LLMs). Our results show that incorporating the personalized context provided leads to performance improvements of up to 39%. The benchmark is publicly released to support future research in this area.

[205] Vedavani: A Benchmark Corpus for ASR on Vedic Sanskrit Poetry

Sujeet Kumar,Pretam Ray,Abhinay Beerukuri,Shrey Kamoji,Manoj Balaji Jagadeeshan,Pawan Goyal

Main category: cs.CL

TL;DR: 本文介绍了Vedavani，首个专注于梵语吠陀诗歌的自动语音识别（ASR）研究，并提供了一个54小时的梵语ASR数据集，测试了多种多语言语音模型。

Details

Motivation: 梵语的音素复杂性和语音转换特性，尤其是在诗歌形式中的韵律和节奏特征，使得其ASR研究较少。本文旨在填补这一空白。 Method: 构建了一个包含30,779个标记音频样本的54小时梵语ASR数据集，并测试了多种先进的多语言语音模型。 Result: 实验表明，IndicWhisper在测试的模型中表现最佳。 Conclusion: Vedavani为梵语ASR研究提供了首个全面数据集，并验证了IndicWhisper的有效性。 Abstract: Sanskrit, an ancient language with a rich linguistic heritage, presents unique challenges for automatic speech recognition (ASR) due to its phonemic complexity and the phonetic transformations that occur at word junctures, similar to the connected speech found in natural conversations. Due to these complexities, there has been limited exploration of ASR in Sanskrit, particularly in the context of its poetic verses, which are characterized by intricate prosodic and rhythmic patterns. This gap in research raises the question: How can we develop an effective ASR system for Sanskrit, particularly one that captures the nuanced features of its poetic form? In this study, we introduce Vedavani, the first comprehensive ASR study focused on Sanskrit Vedic poetry. We present a 54-hour Sanskrit ASR dataset, consisting of 30,779 labelled audio samples from the Rig Veda and Atharva Veda. This dataset captures the precise prosodic and rhythmic features that define the language. We also benchmark the dataset on various state-of-the-art multilingual speech models.$^{1}$ Experimentation revealed that IndicWhisper performed the best among the SOTA models.

[206] Werewolf: A Straightforward Game Framework with TTS for Improved User Engagement

Qihui Fan,Enfu Nan,Wenbo Li,Lei Lu,Pu Zhao,Yanzhi Wang

Main category: cs.CL

TL;DR: 本文提出了一种基于LLM的狼人杀游戏系统，通过优化的TTS模型提升兼容性和用户体验，认为随着LLM推理能力的增强，额外组件将变得不必要。

Details

Motivation: 随着LLM推理和说服能力的提升，结合社交推理游戏的流行，本文旨在设计一个更吸引人的LLM代理狼人杀游戏系统。 Method: 提出了一种基于LLM的狼人杀系统，结合优化的TTS模型，无需额外组件如微调或经验池。 Result: 系统提升了与多种LLM模型的兼容性，并改善了用户参与度。 Conclusion: 随着LLM推理能力的持续增强，未来类似狼人杀的游戏系统可能无需依赖额外组件。 Abstract: The growing popularity of social deduction game systems for both business applications and AI research has greatly benefited from the rapid advancements in Large Language Models (LLMs), which now demonstrate stronger reasoning and persuasion capabilities. Especially with the raise of DeepSeek R1 and V3 models, LLMs should enable a more engaging experience for human players in LLM-agent-based social deduction games like Werewolf. Previous works either fine-tuning, advanced prompting engineering, or additional experience pool to achieve engaging text-format Werewolf game experience. We propose a novel yet straightforward LLM-based Werewolf game system with tuned Text-to-Speech(TTS) models designed for enhanced compatibility with various LLM models, and improved user engagement. We argue with ever enhancing LLM reasoning, extra components will be unnecessary in the case of Werewolf.

[207] Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences

Mingqian Zheng,Wenjia Hu,Patrick Zhao,Motahhare Eslami,Jena D. Hwang,Faeze Brahman,Carolyn Rose,Maarten Sap

Main category: cs.CL

TL;DR: 研究发现，部分遵从策略（提供通用信息但不含可操作细节）能显著改善用户体验，减少负面感知50%以上，优于直接拒绝。现有LLMs和奖励模型未能充分利用此策略。

Details

Motivation: 当前LLMs对所有潜在有害查询一律拒绝，导致安全性与用户体验的权衡问题。研究旨在探索不同拒绝策略对用户感知的影响。 Method: 通过480名参与者评估3,840个查询-响应对，分析不同拒绝策略的效果，并评估9个先进LLMs和6个奖励模型的响应模式。 Result: 部分遵从策略效果最佳，但现有模型和奖励模型未能有效应用或评估此策略。 Conclusion: AI安全机制应注重设计深思熟虑的拒绝策略，而非仅依赖意图检测，以兼顾安全性和用户体验。 Abstract: Current LLMs are trained to refuse potentially harmful input queries regardless of whether users actually had harmful intents, causing a tradeoff between safety and user experience. Through a study of 480 participants evaluating 3,840 query-response pairs, we examine how different refusal strategies affect user perceptions across varying motivations. Our findings reveal that response strategy largely shapes user experience, while actual user motivation has negligible impact. Partial compliance -- providing general information without actionable details -- emerges as the optimal strategy, reducing negative user perceptions by over 50% to flat-out refusals. Complementing this, we analyze response patterns of 9 state-of-the-art LLMs and evaluate how 6 reward models score different refusal strategies, demonstrating that models rarely deploy partial compliance naturally and reward models currently undervalue it. This work demonstrates that effective guardrails require focusing on crafting thoughtful refusals rather than detecting intent, offering a path toward AI safety mechanisms that ensure both safety and sustained user engagement.

[208] Structuring Radiology Reports: Challenging LLMs with Lightweight Models

Johannes Moll,Louisa Fay,Asfandyar Azhar,Sophie Ostmeier,Tim Lueth,Sergios Gatidis,Curtis Langlotz,Jean-Benoit Delbrouck

Main category: cs.CL

TL;DR: 论文探讨了轻量级编码器-解码器模型（T5和BERT2BERT）在结构化放射学报告中的应用，相比大型语言模型（LLMs），轻量级模型在性能和资源消耗上更具优势。

Details

Motivation: 放射学报告缺乏标准化格式，限制了人类解读和机器学习应用。大型语言模型虽强大，但计算需求高、透明度低且存在隐私问题。 Method: 使用轻量级模型（<300M参数）和八种开源LLMs（1B-70B）进行对比，评估其在MIMIC-CXR和CheXpert Plus数据集上的表现。 Result: 轻量级模型在人类标注测试集上优于基于提示技术的LLMs，部分LLMs在Findings部分表现略优但资源消耗显著更高。 Conclusion: 轻量级模型是资源受限医疗环境中结构化临床文本的可持续且隐私保护的解决方案。 Abstract: Radiology reports are critical for clinical decision-making but often lack a standardized format, limiting both human interpretability and machine learning (ML) applications. While large language models (LLMs) have shown strong capabilities in reformatting clinical text, their high computational requirements, lack of transparency, and data privacy concerns hinder practical deployment. To address these challenges, we explore lightweight encoder-decoder models (<300M parameters)-specifically T5 and BERT2BERT-for structuring radiology reports from the MIMIC-CXR and CheXpert Plus datasets. We benchmark these models against eight open-source LLMs (1B-70B), adapted using prefix prompting, in-context learning (ICL), and low-rank adaptation (LoRA) finetuning. Our best-performing lightweight model outperforms all LLMs adapted using prompt-based techniques on a human-annotated test set. While some LoRA-finetuned LLMs achieve modest gains over the lightweight model on the Findings section (BLEU 6.4%, ROUGE-L 4.8%, BERTScore 3.6%, F1-RadGraph 1.1%, GREEN 3.6%, and F1-SRR-BERT 4.3%), these improvements come at the cost of substantially greater computational resources. For example, LLaMA-3-70B incurred more than 400 times the inference time, cost, and carbon emissions compared to the lightweight model. These results underscore the potential of lightweight, task-specific models as sustainable and privacy-preserving solutions for structuring clinical text in resource-constrained healthcare settings.

[209] Structure-Aware Fill-in-the-Middle Pretraining for Code

Linyuan Gong,Alvin Cheung,Mostafa Elhoushi,Sida Wang

Main category: cs.CL

TL;DR: AST-FIM是一种利用抽象语法树（AST）进行代码填充预训练的方法，相比传统随机字符掩码方法，它在实际代码编辑任务中表现更优。

Details

Motivation: 现有LLM将代码视为纯文本并随机掩码字符，忽略了代码的结构特性，导致训练示例不够连贯。 Method: 提出AST-FIM方法，通过AST掩码完整的语法结构（如代码块、表达式或函数），生成更符合代码结构和编辑模式的训练示例。 Result: 在1B和8B参数模型上，AST-FIM在标准FIM基准测试中比随机字符FIM高出5分。 Conclusion: AST-FIM通过利用代码结构信息，显著提升了代码填充任务的性能，适用于实际代码编辑场景。 Abstract: Fill-in-the-Middle (FIM) is a common pretraining method for code LLMs, where models complete code segments given surrounding context. However, existing LLMs treat code as plain text and mask random character spans. We propose and evaluate AST-FIM, a pretraining strategy that leverages Abstract Syntax Trees (ASTs) to mask complete syntactic structures at scale, ensuring coherent training examples better aligned with universal code structures and common code editing patterns such as blocks, expressions, or functions. To evaluate real-world fill-in-the-middle (FIM) programming tasks, we introduce Real-FIM-Eval, a benchmark derived from 30,000+ GitHub commits across 12 languages. On infilling tasks, experiments on 1B and 8B parameter models show that AST-FIM is particularly beneficial for real-world code editing as it outperforms standard random-character FIM by up to 5 pts on standard FIM benchmarks. Our code is publicly available at https://github.com/gonglinyuan/ast_fim.

[210] REIC: RAG-Enhanced Intent Classification at Scale

Ziji Zhang,Michael Yang,Zhiyu Chen,Yingying Zhuang,Shu-Ting Pi,Qun Liu,Rajashekar Maragoud,Vy Nguyen,Anurag Beniwal

Main category: cs.CL

TL;DR: REIC是一种基于检索增强生成的意图分类方法，有效解决了大规模客户服务中意图分类的可扩展性问题，无需频繁重新训练。

Details

Motivation: 随着公司产品线的扩展，意图分类面临意图数量增加和分类体系跨垂直领域变化的挑战，需要更高效的分类方法。 Method: REIC利用检索增强生成（RAG）动态整合相关知识，实现精确分类。 Result: 实验表明，REIC在大规模客户服务场景中优于传统微调、零样本和少样本方法，适用于域内和域外场景。 Conclusion: REIC在自适应和大规模意图分类系统中具有实际部署潜力。 Abstract: Accurate intent classification is critical for efficient routing in customer service, ensuring customers are connected with the most suitable agents while reducing handling times and operational costs. However, as companies expand their product lines, intent classification faces scalability challenges due to the increasing number of intents and variations in taxonomy across different verticals. In this paper, we introduce REIC, a Retrieval-augmented generation Enhanced Intent Classification approach, which addresses these challenges effectively. REIC leverages retrieval-augmented generation (RAG) to dynamically incorporate relevant knowledge, enabling precise classification without the need for frequent retraining. Through extensive experiments on real-world datasets, we demonstrate that REIC outperforms traditional fine-tuning, zero-shot, and few-shot methods in large-scale customer service settings. Our results highlight its effectiveness in both in-domain and out-of-domain scenarios, demonstrating its potential for real-world deployment in adaptive and large-scale intent classification systems.

[211] ComposeRAG: A Modular and Composable RAG for Corpus-Grounded Multi-Hop Question Answering

Ruofan Wu,Youngwon Lee,Fan Shu,Danmei Xu,Seung-won Hwang,Zhewei Yao,Yuxiong He,Feng Yan

Main category: cs.CL

TL;DR: ComposeRAG是一种模块化的RAG系统，通过分解核心功能为独立模块（如问题分解、查询重写等）提升多跳问答的性能和可解释性，并在验证失败时通过自反思机制优化结果。

Details

Motivation: 现有RAG系统设计单一，核心功能耦合度高，限制了可解释性和针对性改进，尤其是在复杂多跳问答任务中。 Method: 提出ComposeRAG，将RAG流程分解为可组合的模块化组件，每个模块独立实现和优化，并引入自反思机制增强鲁棒性。 Result: 在四个多跳问答基准测试中，ComposeRAG在准确性和基础性上均优于基线方法，最高提升15%准确率和5%基础性。 Conclusion: ComposeRAG通过模块化设计和自反思机制，实现了灵活、透明、可扩展且高性能的多跳推理，显著提升了基础性和可解释性。 Abstract: Retrieval-Augmented Generation (RAG) systems are increasingly diverse, yet many suffer from monolithic designs that tightly couple core functions like query reformulation, retrieval, reasoning, and verification. This limits their interpretability, systematic evaluation, and targeted improvement, especially for complex multi-hop question answering. We introduce ComposeRAG, a novel modular abstraction that decomposes RAG pipelines into atomic, composable modules. Each module, such as Question Decomposition, Query Rewriting, Retrieval Decision, and Answer Verification, acts as a parameterized transformation on structured inputs/outputs, allowing independent implementation, upgrade, and analysis. To enhance robustness against errors in multi-step reasoning, ComposeRAG incorporates a self-reflection mechanism that iteratively revisits and refines earlier steps upon verification failure. Evaluated on four challenging multi-hop QA benchmarks, ComposeRAG consistently outperforms strong baselines in both accuracy and grounding fidelity. Specifically, it achieves up to a 15% accuracy improvement over fine-tuning-based methods and up to a 5% gain over reasoning-specialized pipelines under identical retrieval conditions. Crucially, ComposeRAG significantly enhances grounding: its verification-first design reduces ungrounded answers by over 10% in low-quality retrieval settings, and by approximately 3% even with strong corpora. Comprehensive ablation studies validate the modular architecture, demonstrating distinct and additive contributions from each component. These findings underscore ComposeRAG's capacity to deliver flexible, transparent, scalable, and high-performing multi-hop reasoning with improved grounding and interpretability.

[212] MedOrch: Medical Diagnosis with Tool-Augmented Reasoning Agents for Flexible Extensibility

Yexiao He,Ang Li,Boyi Liu,Zhewei Yao,Yuxiong He

Main category: cs.CL

TL;DR: MedOrch是一个新型医疗决策支持框架，通过协调多个专业工具和推理代理，提供全面的医疗决策支持，并在多个医疗任务中表现出色。

Details

Motivation: 当前AI系统在医疗决策中要么依赖任务特定模型（适应性有限），要么依赖未与专业知识工具结合的语言模型，难以满足复杂需求。 Method: MedOrch采用模块化、基于代理的架构，灵活整合领域特定工具，并确保透明、可追溯的推理过程。 Result: 在阿尔茨海默病诊断、胸部X光解读和医学视觉问答等任务中，MedOrch表现优异，准确率显著提升。 Conclusion: MedOrch展示了通过推理驱动工具利用处理多模态医疗数据的潜力，有望推动医疗AI的发展。 Abstract: Healthcare decision-making represents one of the most challenging domains for Artificial Intelligence (AI), requiring the integration of diverse knowledge sources, complex reasoning, and various external analytical tools. Current AI systems often rely on either task-specific models, which offer limited adaptability, or general language models without grounding with specialized external knowledge and tools. We introduce MedOrch, a novel framework that orchestrates multiple specialized tools and reasoning agents to provide comprehensive medical decision support. MedOrch employs a modular, agent-based architecture that facilitates the flexible integration of domain-specific tools without altering the core system. Furthermore, it ensures transparent and traceable reasoning processes, enabling clinicians to meticulously verify each intermediate step underlying the system's recommendations. We evaluate MedOrch across three distinct medical applications: Alzheimer's disease diagnosis, chest X-ray interpretation, and medical visual question answering, using authentic clinical datasets. The results demonstrate MedOrch's competitive performance across these diverse medical tasks. Notably, in Alzheimer's disease diagnosis, MedOrch achieves an accuracy of 93.26%, surpassing the state-of-the-art baseline by over four percentage points. For predicting Alzheimer's disease progression, it attains a 50.35% accuracy, marking a significant improvement. In chest X-ray analysis, MedOrch exhibits superior performance with a Macro AUC of 61.2% and a Macro F1-score of 25.5%. Moreover, in complex multimodal visual question answering (Image+Table), MedOrch achieves an accuracy of 54.47%. These findings underscore MedOrch's potential to advance healthcare AI by enabling reasoning-driven tool utilization for multimodal medical data processing and supporting intricate cognitive tasks in clinical decision-making.

[213] PersianMedQA: Language-Centric Evaluation of LLMs in the Persian Medical Domain

Mohammad Javad Ranjbar Kalahroodi,Amirhossein Sheikholselami,Sepehr Karimi,Sepideh Ranjbar Kalahroodi,Heshaam Faili,Azadeh Shakery

Main category: cs.CL

TL;DR: 论文介绍了PersianMedQA数据集，用于评估LLMs在波斯语和英语中的医学推理能力，发现闭源通用模型表现最佳，而波斯语微调模型表现较差。

Details

Motivation: 探索LLMs在低资源语言（如波斯语）和高风险领域（如医学）中的可靠性。 Method: 构建PersianMedQA数据集，评估40多种LLMs在零样本和CoT设置下的表现。 Result: 闭源通用模型（如GPT-4.1）表现最佳（波斯语83.3%，英语80.7%），波斯语微调模型表现较差（如Dorna为35.9%）。 Conclusion: 模型大小不足以保证性能，需结合领域或语言适应；PersianMedQA为多语言医学推理评估提供了基础。 Abstract: Large Language Models (LLMs) have achieved remarkable performance on a wide range of NLP benchmarks, often surpassing human-level accuracy. However, their reliability in high-stakes domains such as medicine, particularly in low-resource languages, remains underexplored. In this work, we introduce PersianMedQA, a large-scale, expert-validated dataset of multiple-choice Persian medical questions, designed to evaluate LLMs across both Persian and English. We benchmark over 40 state-of-the-art models, including general-purpose, Persian fine-tuned, and medical LLMs, in zero-shot and chain-of-thought (CoT) settings. Our results show that closed-source general models (e.g., GPT-4.1) consistently outperform all other categories, achieving 83.3% accuracy in Persian and 80.7% in English, while Persian fine-tuned models such as Dorna underperform significantly (e.g., 35.9% in Persian), often struggling with both instruction-following and domain reasoning. We also analyze the impact of translation, showing that while English performance is generally higher, Persian responses are sometimes more accurate due to cultural and clinical contextual cues. Finally, we demonstrate that model size alone is insufficient for robust performance without strong domain or language adaptation. PersianMedQA provides a foundation for evaluating multilingual and culturally grounded medical reasoning in LLMs. The PersianMedQA dataset can be accessed at: https://huggingface.co/datasets/MohammadJRanjbar/PersianMedQA](https://huggingface.co/datasets/MohammadJRanjbar/PersianMedQA

Lihao Sun,Chengzhi Mao,Valentin Hofmann,Xuechunzi Bai

Main category: cs.CL

TL;DR: 研究发现，尽管对齐的语言模型在显性偏见评估中表现无偏，但在隐性词联想任务中仍表现出刻板印象。对齐过程反而放大了模型的隐性偏见，尤其是在模糊语境下忽略种族概念。作者提出了一种新的偏见缓解策略，通过激励模型在早期层表示种族概念来有效减少隐性偏见。

Details

Motivation: 探讨对齐语言模型在显性和隐性偏见评估中表现不一致的机制，并解决隐性偏见问题。 Method: 研究发现对齐模型在模糊语境下忽略种族概念，导致隐性偏见。提出通过激励模型在早期层表示种族概念来缓解偏见。 Result: 对齐模型在模糊语境下忽略种族概念，放大了隐性偏见。新策略通过增强种族概念表示有效减少了偏见。 Conclusion: 忽略种族概念可能无意中放大隐性偏见，而通过激励模型表示种族概念可以有效缓解这一问题。 Abstract: Although value-aligned language models (LMs) appear unbiased in explicit bias evaluations, they often exhibit stereotypes in implicit word association tasks, raising concerns about their fair usage. We investigate the mechanisms behind this discrepancy and find that alignment surprisingly amplifies implicit bias in model outputs. Specifically, we show that aligned LMs, unlike their unaligned counterparts, overlook racial concepts in early internal representations when the context is ambiguous. Not representing race likely fails to activate safety guardrails, leading to unintended biases. Inspired by this insight, we propose a new bias mitigation strategy that works by incentivizing the representation of racial concepts in the early model layers. In contrast to conventional mitigation methods of machine unlearning, our interventions find that steering the model to be more aware of racial concepts effectively mitigates implicit bias. Similar to race blindness in humans, ignoring racial nuances can inadvertently perpetuate subtle biases in LMs.

[215] The Impact of Disability Disclosure on Fairness and Bias in LLM-Driven Candidate Selection

Mahammed Kamruzzaman,Gene Louis Kim

Main category: cs.CL

TL;DR: 研究发现，在LLM驱动的招聘过程中，候选人披露残疾信息会导致选择偏见，即使在其他条件相同的情况下，LLM更倾向于选择未披露残疾的候选人。

Details

Motivation: 探讨LLM在招聘中处理自愿披露的残疾信息时是否引入偏见，填补现有研究空白。 Method: 通过对比相同背景的候选人（仅残疾披露状态不同），分析LLM的选择倾向。 Result: LLM明显偏向未披露残疾的候选人，即使不披露也比披露更有优势。 Conclusion: LLM在招聘中存在对残疾信息的潜在偏见，需进一步优化公平性。 Abstract: As large language models (LLMs) become increasingly integrated into hiring processes, concerns about fairness have gained prominence. When applying for jobs, companies often request/require demographic information, including gender, race, and disability or veteran status. This data is collected to support diversity and inclusion initiatives, but when provided to LLMs, especially disability-related information, it raises concerns about potential biases in candidate selection outcomes. Many studies have highlighted how disability can impact CV screening, yet little research has explored the specific effect of voluntarily disclosed information on LLM-driven candidate selection. This study seeks to bridge that gap. When candidates shared identical gender, race, qualifications, experience, and backgrounds, and sought jobs with minimal employment rate gaps between individuals with and without disabilities (e.g., Cashier, Software Developer), LLMs consistently favored candidates who disclosed that they had no disability. Even in cases where candidates chose not to disclose their disability status, the LLMs were less likely to select them compared to those who explicitly stated they did not have a disability.

[216] MultiHoax: A Dataset of Multi-hop False-Premise Questions

Mohammadamin Shafiei,Hamidreza Saffari,Nafise Sadat Moosavi

Main category: cs.CL

TL;DR: 论文提出了MultiHoax基准，用于评估大语言模型在复杂多步推理任务中处理错误前提的能力，发现现有模型在多国和多知识类别中表现不佳。

Details

Motivation: 大语言模型在高风险领域部署时，检测错误假设和批判性推理能力至关重要。现有基准仅关注单步错误前提问题，而现实推理需要多步验证。 Method: 引入MultiHoax基准，覆盖七国和十类知识领域，利用维基百科作为知识源，评估模型在多步推理中的表现。 Result: 实验显示，当前最先进的大语言模型在多国、多知识类别和多步推理中难以检测错误前提。 Conclusion: 需提升大语言模型的错误前提检测能力和多步推理鲁棒性。 Abstract: As Large Language Models are increasingly deployed in high-stakes domains, their ability to detect false assumptions and reason critically is crucial for ensuring reliable outputs. False-premise questions (FPQs) serve as an important evaluation method by exposing cases where flawed assumptions lead to incorrect responses. While existing benchmarks focus on single-hop FPQs, real-world reasoning often requires multi-hop inference, where models must verify consistency across multiple reasoning steps rather than relying on surface-level cues. To address this gap, we introduce MultiHoax, a benchmark for evaluating LLMs' ability to handle false premises in complex, multi-step reasoning tasks. Our dataset spans seven countries and ten diverse knowledge categories, using Wikipedia as the primary knowledge source to enable factual reasoning across regions. Experiments reveal that state-of-the-art LLMs struggle to detect false premises across different countries, knowledge categories, and multi-hop reasoning types, highlighting the need for improved false premise detection and more robust multi-hop reasoning capabilities in LLMs.

[217] CASPER: A Large Scale Spontaneous Speech Dataset

Cihan Xiao,Ruixing Liang,Xiangyu Zhang,Mehmet Emre Tiryaki,Veronica Bae,Lavanya Shankar,Rong Yang,Ethan Poon,Emmanuel Dupoux,Sanjeev Khudanpur,Leibny Paola Garcia Perera

Main category: cs.CL

TL;DR: 提出了一种新方法收集自然对话数据，并发布了200+小时的自发语音数据集，以解决高质量自发语音数据稀缺的问题。

Details

Motivation: 现有语音数据集多为脚本对话，缺乏高质量的自发语音数据，限制了语音处理模型的发展。 Method: 设计了一种新颖的管道，用于激发和记录自然对话，确保话题多样性和真实互动。 Result: 发布了Stage 1数据集，包含200+小时的自发语音，为研究社区提供了宝贵资源。 Conclusion: 该方法为未来数据收集提供了可复现框架，并计划进一步扩展数据集。 Abstract: The success of large language models has driven interest in developing similar speech processing capabilities. However, a key challenge is the scarcity of high-quality spontaneous speech data, as most existing datasets contain scripted dialogues. To address this, we present a novel pipeline for eliciting and recording natural dialogues and release our Stage 1 dataset with 200+ hours of spontaneous speech. Our approach fosters fluid, natural conversations while encouraging a diverse range of topics and interactive exchanges. Unlike traditional methods, it facilitates genuine interactions, providing a reproducible framework for future data collection. This paper introduces our dataset and methodology, laying the groundwork for addressing the shortage of spontaneous speech data. We plan to expand this dataset in future stages, offering a growing resource for the research community.

[218] Hierarchical Level-Wise News Article Clustering via Multilingual Matryoshka Embeddings

Hans W. A. Hanley,Zakir Durumeric

Main category: cs.CL

TL;DR: 提出了一种新颖、可扩展、可解释、层次化和多语言的聚类方法，用于新闻文章和社交媒体数据，通过多语言Matryoshka嵌入和高效层次聚类算法实现。

Details

Motivation: 当前方法在扩展性、相似性度量的透明度以及多语言处理上表现不佳，需要改进。 Method: 训练多语言Matryoshka嵌入模型，开发高效层次聚类算法。 Result: 嵌入模型在SemEval 2022 Task 8测试数据集上达到Pearson ρ = 0.816的先进性能。 Conclusion: 该方法能有效识别和聚类新闻数据中的故事、叙事和主题。 Abstract: Contextual large language model embeddings are increasingly utilized for topic modeling and clustering. However, current methods often scale poorly, rely on opaque similarity metrics, and struggle in multilingual settings. In this work, we present a novel, scalable, interpretable, hierarchical, and multilingual approach to clustering news articles and social media data. To do this, we first train multilingual Matryoshka embeddings that can determine story similarity at varying levels of granularity based on which subset of the dimensions of the embeddings is examined. This embedding model achieves state-of-the-art performance on the SemEval 2022 Task 8 test dataset (Pearson $\rho$ = 0.816). Once trained, we develop an efficient hierarchical clustering algorithm that leverages the hierarchical nature of Matryoshka embeddings to identify unique news stories, narratives, and themes. We conclude by illustrating how our approach can identify and cluster stories, narratives, and overarching themes within real-world news datasets.

[219] Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation

Ahmed Elhady,Eneko Agirre,Mikel Artetxe

Main category: cs.CL

TL;DR: 研究发现，在持续预训练（CPT）中，加入英语数据对验证困惑度无影响，但对目标语言下游任务能力至关重要。未加入英语会导致灾难性遗忘，影响模型泛化能力。提出课程学习和权重指数移动平均（EMA）作为替代方案。

Details

Motivation: 探讨英语数据在CPT中对目标语言模型适应性的作用，揭示其重要性。 Method: 通过语言无关的基准测试评估CPT效果，分析英语数据的影响，并提出课程学习和EMA作为改进方法。 Result: 未加入英语数据会导致灾难性遗忘，影响模型泛化能力；课程学习和EMA能有效缓解这一问题。 Conclusion: 英语数据在CPT中对下游任务能力至关重要，研究为未来方法设计提供了基础。 Abstract: Continued pretraining (CPT) is a popular approach to adapt existing large language models (LLMs) to new languages. When doing so, it is common practice to include a portion of English data in the mixture, but its role has not been carefully studied to date. In this work, we show that including English does not impact validation perplexity, yet it is critical for the emergence of downstream capabilities in the target language. We introduce a language-agnostic benchmark for in-context learning (ICL), which reveals catastrophic forgetting early on CPT when English is not included. This in turn damages the ability of the model to generalize to downstream prompts in the target language as measured by perplexity, even if it does not manifest in terms of accuracy until later in training, and can be tied to a big shift in the model parameters. Based on these insights, we introduce curriculum learning and exponential moving average (EMA) of weights as effective alternatives to mitigate the need for English. All in all, our work sheds light into the dynamics by which emergent abilities arise when doing CPT for language adaptation, and can serve as a foundation to design more effective methods in the future.

[220] DLM-One: Diffusion Language Models for One-Step Sequence Generation

Tianqi Chen,Shujian Zhang,Mingyuan Zhou

Main category: cs.CL

TL;DR: DLM-One是一种基于分数蒸馏的框架，用于一步生成序列，通过连续扩散语言模型（DLMs）实现高效推理。

Details

Motivation: 研究是否可以通过一步生成替代迭代优化，以提高语言模型的采样效率。 Method: 通过将学生模型在连续词嵌入空间中的输出分数与预训练教师DLM的分数函数对齐，实现一步生成。 Result: 实验表明，DLM-One在推理时间上实现了约500倍的加速，同时在文本生成任务中保持竞争力。 Conclusion: 一步扩散为高效、高质量的语言生成提供了新方向，并推动了连续扩散模型在自然语言处理中的广泛应用。 Abstract: This paper introduces DLM-One, a score-distillation-based framework for one-step sequence generation with continuous diffusion language models (DLMs). DLM-One eliminates the need for iterative refinement by aligning the scores of a student model's outputs in the continuous token embedding space with the score function of a pretrained teacher DLM. We investigate whether DLM-One can achieve substantial gains in sampling efficiency for language modeling. Through comprehensive experiments on DiffuSeq -- a representative continuous DLM -- we show that DLM-One achieves up to ~500x speedup in inference time while maintaining competitive performance on benchmark text generation tasks used to evaluate the teacher models. We further analyze the method's empirical behavior across multiple datasets, providing initial insights into its generality and practical applicability. Our findings position one-step diffusion as a promising direction for efficient, high-quality language generation and broader adoption of continuous diffusion models operating in embedding space for natural language processing.

[221] Can LLMs Understand Unvoiced Speech? Exploring EMG-to-Text Conversion with LLMs

Payal Mohapatra,Akash Pandey,Xiaoyuan Zhang,Qi Zhu

Main category: cs.CL

TL;DR: 论文提出了一种新颖的EMG适配器模块，将无声音EMG特征映射到大型语言模型（LLM）的输入空间，显著提升了无声音EMG到文本的转换性能。

Details

Motivation: 现有方法依赖有声和无声EMG信号及语音数据的配对，对无法发声的个体不实用，因此探索LLM理解无声语音的潜力。 Method: 设计了EMG适配器模块，将EMG特征映射到LLM输入空间，仅使用无声EMG数据进行训练。 Result: 在封闭词汇任务中达到0.49的平均词错误率（WER），数据量仅六分钟时性能提升近20%。 Conclusion: 这是利用表面EMG使LLM理解无声语音的重要第一步。 Abstract: Unvoiced electromyography (EMG) is an effective communication tool for individuals unable to produce vocal speech. However, most prior methods rely on paired voiced and unvoiced EMG signals, along with speech data, for EMG-to-text conversion, which is not practical for such individuals. Given the rise of large language models (LLMs) in speech recognition, we explore their potential to understand unvoiced speech. To this end, we address the challenge of learning from unvoiced EMG alone and propose a novel EMG adaptor module that maps EMG features into an LLM's input space, achieving an average word error rate (WER) of 0.49 on a closed-vocabulary unvoiced EMG-to-text task. Even with a conservative data availability of just six minutes, our approach improves performance over specialized models by nearly 20%. While LLMs have been shown to be extendable to new language modalities -- such as audio -- understanding articulatory biosignals like unvoiced EMG remains more challenging. This work takes a crucial first step toward enabling LLMs to comprehend unvoiced speech using surface EMG.

[222] Lossless Token Sequence Compression via Meta-Tokens

John Harvill,Ziwei Fan,Hao Wang,Yizhou Sun,Hao Ding,Luke Huan,Anoop Deoras

Main category: cs.CL

TL;DR: 本文提出了一种任务无关的无损压缩技术，类似LZ77，平均减少输入序列长度27%和18%，同时保留全部语义信息。

Details

Motivation: 现有方法多为有损压缩，可能丢失语义信息，而本文旨在实现无损压缩，确保语义完整性。 Method: 采用类似LZ77的无损压缩技术，减少输入序列长度，同时利用Transformer的二次注意力计算特性降低计算量。 Result: 在两个任务中，序列长度分别减少27%和18%，计算量减少47%和33%，性能接近未压缩输入。 Conclusion: 无损压缩在严格语义保留任务中表现优于有损方法，未来更大模型和计算资源可能完全消除性能差距。 Abstract: Existing work on prompt compression for Large Language Models (LLM) focuses on lossy methods that try to maximize the retention of semantic information that is relevant to downstream tasks while significantly reducing the sequence length. In this paper, we introduce a task-agnostic lossless compression technique similar to LZ77 that makes it possible to reduce the input token sequence length on average by 27\% and 18\% for the two evaluation tasks explored here. Given that we use transformer-based LLMs, this equates to 47\% and 33\% less encoding computation, respectively, due to the quadratic nature of attention. The token sequence transformation is trivial to reverse and highlights that no semantic information is lost in the process. We evaluate our proposed approach on two tasks that require strict preservation of semantics/syntax and demonstrate that existing lossy compression methods perform poorly in this setting. We find that our lossless compression technique produces only a small gap in performance compared to using the uncompressed input and posit that larger models and an expanded computing budget would likely erase the gap entirely.

[223] An evaluation of LLMs for generating movie reviews: GPT-4o, Gemini-2.0 and DeepSeek-V3

Brendan Sands,Yining Wang,Chenhao Xu,Yuxuan Zhou,Lai Wei,Rohitash Chandra

Main category: cs.CL

TL;DR: 研究提出了一种利用三种大语言模型（GPT-4o、DeepSeek-V3和Gemini-2.0）生成电影评论的框架，并通过与IMDb用户评论对比评估其性能。结果显示LLM能生成语法流畅的评论，但在情感丰富度和风格一致性上仍有差距。

Details

Motivation: 探索大语言模型在电影评论生成任务中的适用性，并评估其生成质量。 Method: 使用电影字幕和剧本作为输入，通过三种LLM生成评论，并从词汇、情感极性、相似性和主题一致性等方面与IMDb用户评论对比。 Result: LLM能生成语法流畅的评论，但情感丰富度和风格一致性不及IMDb评论。DeepSeek-V3表现最平衡，GPT-4o偏向积极情感，Gemini-2.0偏向负面情感但情感强度过高。 Conclusion: LLM在电影评论生成任务中表现良好，但需进一步优化以提升情感丰富度和风格一致性。 Abstract: Large language models (LLMs) have been prominent in various tasks, including text generation and summarisation. The applicability of LLMs to the generation of product reviews is gaining momentum, paving the way for the generation of movie reviews. In this study, we propose a framework that generates movie reviews using three LLMs (GPT-4o, DeepSeek-V3, and Gemini-2.0), and evaluate their performance by comparing the generated outputs with IMDb user reviews. We use movie subtitles and screenplays as input to the LLMs and investigate how they affect the quality of reviews generated. We review the LLM-based movie reviews in terms of vocabulary, sentiment polarity, similarity, and thematic consistency in comparison to IMDB user reviews. The results demonstrate that LLMs are capable of generating syntactically fluent and structurally complete movie reviews. Nevertheless, there is still a noticeable gap in emotional richness and stylistic coherence between LLM-generated and IMDb reviews, suggesting that further refinement is needed to improve the overall quality of movie review generation. We provided a survey-based analysis where participants were told to distinguish between LLM and IMDb user reviews. The results show that LLM-generated reviews are difficult to distinguish from IMDB user reviews. We found that DeepSeek-V3 produced the most balanced reviews, closely matching IMDb reviews. GPT-4o overemphasised positive emotions, while Gemini-2.0 captured negative emotions better but showed excessive emotional intensity.

[224] SkillVerse : Assessing and Enhancing LLMs with Tree Evaluation

Yufei Tian,Jiao Sun,Nanyun Peng,Zizhao Zhang

Main category: cs.CL

TL;DR: SkillVerse是一种无监督的树状诊断框架，用于评估语言模型在特定技能上的能力，并通过树搜索算法和预测模型弱点展示了其有效性。

Details

Motivation: 随着语言模型处理复杂任务的能力增强，需要更精细的评估方法来指导模型开发。 Method: SkillVerse利用LLM作为评判者，对模型响应进行批评并组织成层次结构（树状图），以灵活分析模型能力。 Result: SkillVerse在两项任务中表现优异：1）通过树搜索算法提升模型上下文学习能力25%；2）预测模型弱点的成功率提高22%。 Conclusion: SkillVerse提供了一种灵活且有效的方法，用于深入理解语言模型的能力并指导其改进。 Abstract: As language models evolve to tackle complex, multifaceted tasks, their evaluation must adapt to capture this intricacy. A granular, skill-specific understanding of model capabilities can empower researchers to make informed model development plans. In this paper, we introduce SkillVerse, an unsupervised tree-structured diagnosis framework for understanding model proficiency in specific abilities. With LLM as a judge, SkillVerse first critiques the model responses, and then organizes them into a hierarchical structure termed dendrogram. Given proficiency at arbitrary levels of granularity, SkillVerse is flexible to produce insights of behaviors of modern large models. We also demonstrate its efficacy in two downstream tasks: 1) improving model in-context learning by 25% using a tree-search algorithm to select more informative few-shot demonstrations, and 2) accurately predicting new model weaknesses with a 55% success rate, 22% higher than without SkillVerse.

[225] TreeRare: Syntax Tree-Guided Retrieval and Reasoning for Knowledge-Intensive Question Answering

Boyi Zhang,Zhuo Liu,Hangfeng He

Main category: cs.CL

TL;DR: TreeRare提出了一种基于语法树的信息检索和推理框架，用于解决复杂、知识密集型问题，显著优于现有方法。

Details

Motivation: 复杂问题需要多源信息推理，但现有检索框架因推理错误和检索结果不匹配而受限。 Method: 利用语法树自底向上遍历，生成子查询并检索相关段落，合成上下文感知证据，最终聚合答案。 Result: 在五个问答数据集上，TreeRare显著优于现有方法。 Conclusion: TreeRare通过语法树引导的检索和推理，有效解决了复杂问题中的不确定性，提升了性能。 Abstract: In real practice, questions are typically complex and knowledge-intensive, requiring Large Language Models (LLMs) to recognize the multifaceted nature of the question and reason across multiple information sources. Iterative and adaptive retrieval, where LLMs decide when and what to retrieve based on their reasoning, has been shown to be a promising approach to resolve complex, knowledge-intensive questions. However, the performance of such retrieval frameworks is limited by the accumulation of reasoning errors and misaligned retrieval results. To overcome these limitations, we propose TreeRare (Syntax Tree-Guided Retrieval and Reasoning), a framework that utilizes syntax trees to guide information retrieval and reasoning for question answering. Following the principle of compositionality, TreeRare traverses the syntax tree in a bottom-up fashion, and in each node, it generates subcomponent-based queries and retrieves relevant passages to resolve localized uncertainty. A subcomponent question answering module then synthesizes these passages into concise, context-aware evidence. Finally, TreeRare aggregates the evidence across the tree to form a final answer. Experiments across five question answering datasets involving ambiguous or multi-hop reasoning demonstrate that TreeRare achieves substantial improvements over existing state-of-the-art methods.

[226] Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus

Svetlana Churina,Akshat Gupta,Insyirah Mujtahid,Kokil Jaidka

Main category: cs.CL

TL;DR: 本文介绍了一个首个公开的、标注的、通用的代码混合语料库，旨在支持计算语言学、社会语言学和NLP研究。

Details

Motivation: 代码混合在社交媒体等非正式交流中普遍存在，但缺乏公开的、适合建模人类对话和关系的标注语料库。 Method: 通过持续收集、验证和整合代码混合消息，构建结构化数据集（JSON格式），并提供详细元数据和语言统计。 Result: 目前已包含355,641条消息，涵盖多种代码混合模式，重点关注英语、普通话等语言。 Conclusion: Codemix语料库将成为计算语言学、社会语言学和NLP研究的基础数据集。 Abstract: Code-mixing involves the seamless integration of linguistic elements from multiple languages within a single discourse, reflecting natural multilingual communication patterns. Despite its prominence in informal interactions such as social media, chat messages and instant-messaging exchanges, there has been a lack of publicly available corpora that are author-labeled and suitable for modeling human conversations and relationships. This study introduces the first labeled and general-purpose corpus for understanding code-mixing in context while maintaining rigorous privacy and ethical standards. Our live project will continuously gather, verify, and integrate code-mixed messages into a structured dataset released in JSON format, accompanied by detailed metadata and linguistic statistics. To date, it includes over 355,641 messages spanning various code-mixing patterns, with a primary focus on English, Mandarin, and other languages. We expect the Codemix Corpus to serve as a foundational dataset for research in computational linguistics, sociolinguistics, and NLP applications.

[227] Beyond Context to Cognitive Appraisal: Emotion Reasoning as a Theory of Mind Benchmark for Large Language Models

Gerard Christopher Yeo,Kokil Jaidka

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型（LLMs）在情感推理任务中如何利用上下文信息推断他人情绪状态，基于心理理论（ToM）框架，并指出LLMs在特定情感与情境关联方面的不足。

Details

Motivation: 情感识别任务通常依赖显性线索，但文本中可能存在隐性上下文线索，需要高阶推理能力。研究旨在探索LLMs如何利用这些线索进行情感推理。 Method: 基于认知评估理论，构建了一个专门的ToM评估数据集，用于测试LLMs的前向推理（从上下文到情感）和后向推理（从情感到上下文）能力。 Result: LLMs具备一定推理能力，但在将情境结果和评估与特定情感关联方面表现不佳。 Conclusion: 研究强调了在LLMs的情感推理训练和评估中融入心理学理论的必要性。 Abstract: Datasets used for emotion recognition tasks typically contain overt cues that can be used in predicting the emotions expressed in a text. However, one challenge is that texts sometimes contain covert contextual cues that are rich in affective semantics, which warrant higher-order reasoning abilities to infer emotional states, not simply the emotions conveyed. This study advances beyond surface-level perceptual features to investigate how large language models (LLMs) reason about others' emotional states using contextual information, within a Theory-of-Mind (ToM) framework. Grounded in Cognitive Appraisal Theory, we curate a specialized ToM evaluation dataset1 to assess both forward reasoning - from context to emotion- and backward reasoning - from emotion to inferred context. We showed that LLMs can reason to a certain extent, although they are poor at associating situational outcomes and appraisals with specific emotions. Our work highlights the need for psychological theories in the training and evaluation of LLMs in the context of emotion reasoning.

[228] OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning

Yifan Peng,Shakeel Muhammad,Yui Sudo,William Chen,Jinchuan Tian,Chyi-Jiunn Lin,Shinji Watanabe

Main category: cs.CL

TL;DR: OWSM项目通过整合YODAS数据集，解决了训练数据不足的问题，并开发了数据清洗流程，显著提升了模型性能。

Details

Motivation: 解决OWSM项目训练数据不足的问题，并处理YODAS数据集的语言标签和音频-文本对齐问题。 Method: 开发可扩展的数据清洗流程，整合YODAS数据集和现有OWSM数据，训练新的OWSM v4模型。 Result: 新模型在多项多语言基准测试中显著优于前代，甚至在某些场景下超越工业前沿模型。 Conclusion: 通过数据清洗和整合，OWSM v4模型性能显著提升，相关资源将公开分享。 Abstract: The Open Whisper-style Speech Models (OWSM) project has developed a series of fully open speech foundation models using academic-scale resources, but their training data remains insufficient. This work enhances OWSM by integrating YODAS, a large-scale web-crawled dataset with a Creative Commons license. However, incorporating YODAS is nontrivial due to its wild nature, which introduces challenges such as incorrect language labels and audio-text misalignments. To address this, we develop a scalable data-cleaning pipeline using public toolkits, yielding a dataset with 166,000 hours of speech across 75 languages. Our new series of OWSM v4 models, trained on this curated dataset alongside existing OWSM data, significantly outperform previous versions on multilingual benchmarks. Our models even match or surpass frontier industrial models like Whisper and MMS in multiple scenarios. We will publicly release the cleaned YODAS data, pre-trained models, and all associated scripts via the ESPnet toolkit.

[229] Efficient Latent Semantic Clustering for Scaling Test-Time Computation of LLMs

Sungjae Lee,Hoyoung Kim,Jeongyeon Hwang,Eunhyeok Park,Jungseul Ok

Main category: cs.CL

TL;DR: 论文提出了一种轻量级且上下文敏感的潜在语义聚类（LSC）方法，利用生成LLM的内部隐藏状态进行聚类，显著提升了测试时计算效率。

Details

Motivation: 现有方法依赖外部模型进行语义聚类，导致计算开销大且难以捕捉上下文语义。 Method: 提出LSC方法，利用生成LLM的内部隐藏状态进行聚类，避免使用外部模型。 Result: 实验表明，LSC在多种LLM和数据集上显著提高了计算效率，且性能优于现有方法。 Conclusion: LSC是一种高效且性能优越的语义聚类方法，适用于测试时计算扩展。 Abstract: Scaling test-time computation--generating and analyzing multiple or sequential outputs for a single input--has become a promising strategy for improving the reliability and quality of large language models (LLMs), as evidenced by advances in uncertainty quantification and multi-step reasoning. A key shared component is semantic clustering, which groups outputs that differ in form but convey the same meaning. Semantic clustering enables estimation of the distribution over the semantics of outputs and helps avoid redundant exploration of reasoning paths. However, existing approaches typically rely on external models, which introduce substantial computational overhead and often fail to capture context-aware semantics. We propose Latent Semantic Clustering (LSC), a lightweight and context-sensitive method that leverages the generator LLM's internal hidden states for clustering, eliminating the need for external models. Our extensive experiment across various LLMs and datasets shows that LSC significantly improves the computational efficiency of test-time scaling while maintaining or exceeding the performance of existing methods.

[230] Neuro2Semantic: A Transfer Learning Framework for Semantic Reconstruction of Continuous Language from Human Intracranial EEG

Siavash Shams,Richard Antonello,Gavin Mischler,Stephan Bickel,Ashesh Mehta,Nima Mesgarani

Main category: cs.CL

TL;DR: Neuro2Semantic是一种新框架，通过iEEG记录重建感知语音的语义内容，采用LSTM适配器和校正模块实现连续自然文本生成，性能优于现有方法。

Details

Motivation: 解决神经信号解码连续语言的挑战，推动脑机接口和神经解码技术的实际应用。 Method: 分两阶段：LSTM适配器对齐神经信号与预训练文本嵌入，校正模块生成连续自然文本。 Result: 仅需30分钟神经数据即可实现高性能，优于现有方法。 Conclusion: Neuro2Semantic在低数据环境下表现优异，具有实际应用潜力。 Abstract: Decoding continuous language from neural signals remains a significant challenge in the intersection of neuroscience and artificial intelligence. We introduce Neuro2Semantic, a novel framework that reconstructs the semantic content of perceived speech from intracranial EEG (iEEG) recordings. Our approach consists of two phases: first, an LSTM-based adapter aligns neural signals with pre-trained text embeddings; second, a corrector module generates continuous, natural text directly from these aligned embeddings. This flexible method overcomes the limitations of previous decoding approaches and enables unconstrained text generation. Neuro2Semantic achieves strong performance with as little as 30 minutes of neural data, outperforming a recent state-of-the-art method in low-data settings. These results highlight the potential for practical applications in brain-computer interfaces and neural decoding technologies.

[231] Adaptive-VP: A Framework for LLM-Based Virtual Patients that Adapts to Trainees' Dialogue to Facilitate Nurse Communication Training

Keyeun Lee,Seolhee Lee,Esther Hehsun Kim,Yena Ko,Jinsu Eun,Dahee Kim,Hyewon Cho,Haiyi Zhu,Robert E. Kraut,Eunyoung Suh,Eun-mee Kim,Hajin Lim

Main category: cs.CL

TL;DR: Adaptive-VP框架利用大型语言模型动态调整虚拟患者行为，提升护理沟通培训的适应性和效果。

Details

Motivation: 标准化患者模拟成本高且不灵活，现有虚拟患者系统缺乏对学员沟通技能的动态适应能力。 Method: 提出Adaptive-VP框架，结合临床场景构建和实时评估模块，动态调整虚拟患者行为。 Result: 自动化评估显示框架能反映真实沟通能力，专家认为其交互更自然真实。 Conclusion: Adaptive-VP是护理沟通培训中可扩展且有效的工具。 Abstract: Effective communication training is essential to preparing nurses for high-quality patient care. While standardized patient (SP) simulations provide valuable experiential learning, they are often costly and inflexible. Virtual patient (VP) systems offer a scalable alternative, but most fail to adapt to the varying communication skills of trainees. In particular, when trainees respond ineffectively, VPs should escalate in hostility or become uncooperative--yet this level of adaptive interaction remains largely unsupported. To address this gap, we introduce Adaptive-VP, a VP dialogue generation framework that leverages large language models (LLMs) to dynamically adapt VP behavior based on trainee input. The framework features a pipeline for constructing clinically grounded yet flexible VP scenarios and a modular system for assessing trainee communication and adjusting VP responses in real time, while ensuring learner safety. We validated Adaptive-VP by simulating challenging patient conversations. Automated evaluation using a corpus from practicing nurses showed that our communication skill evaluation mechanism reflected real-world proficiency levels. Expert nurses further confirmed that Adaptive-VP produced more natural and realistic interactions than existing approaches, demonstrating its potential as a scalable and effective tool for nursing communication training.

Ge Qu,Jinyang Li,Bowen Qin,Xiaolong Li,Nan Huo,Chenhao Ma,Reynold Cheng

Main category: cs.CL

TL;DR: SHARE是一种基于SLM的分层动作校正助手，通过三步流水线提升LLM在文本到SQL任务中的自校正能力，解决了传统方法的计算开销和错误检测问题。

Details

Motivation: 传统自校正方法依赖LLM的递归调用，计算开销大且难以有效检测和修正SQL查询错误。 Method: SHARE采用三个专用SLM的流水线，将SQL查询转换为逐步动作轨迹，并进行两阶段细化，同时提出分层自进化训练策略。 Result: 实验表明SHARE显著提升自校正能力，且在低资源训练下表现稳健。 Conclusion: SHARE为文本到SQL任务提供了一种高效、鲁棒的自校正解决方案，尤其适用于数据隐私受限场景。 Abstract: Current self-correction approaches in text-to-SQL face two critical limitations: 1) Conventional self-correction methods rely on recursive self-calls of LLMs, resulting in multiplicative computational overhead, and 2) LLMs struggle to implement effective error detection and correction for declarative SQL queries, as they fail to demonstrate the underlying reasoning path. In this work, we propose SHARE, an SLM-based Hierarchical Action corREction assistant that enables LLMs to perform more precise error localization and efficient correction. SHARE orchestrates three specialized Small Language Models (SLMs) in a sequential pipeline, where it first transforms declarative SQL queries into stepwise action trajectories that reveal underlying reasoning, followed by a two-phase granular refinement. We further propose a novel hierarchical self-evolution strategy for data-efficient training. Experimental results demonstrate that SHARE effectively enhances self-correction capabilities while proving robust across various LLMs. Furthermore, our comprehensive analysis shows that SHARE maintains strong performance even in low-resource training settings, which is particularly valuable for text-to-SQL applications with data privacy constraints.

[233] Speculative Reward Model Boosts Decision Making Ability of LLMs Cost-Effectively

Jiawei Gu,Shangsong Liang

Main category: cs.CL

TL;DR: 论文提出了一种名为Speculative Reward Model（SRM）的框架，通过外部奖励分配器和推测验证机制，在保持LLM决策效果的同时显著降低计算成本。

Details

Motivation: 现有方法在LLM决策中过于注重性能而忽略成本效益平衡，导致效率低下。 Method: 引入3E标准评估搜索策略，并提出SRM框架，结合外部奖励分配器和推测验证机制优化决策过程。 Result: 实验表明，SRM将成本降至原框架的1/10，同时保持决策效果。 Conclusion: SRM为LLM决策提供了一种高效且成本可控的解决方案。 Abstract: Effective decision-making in Large Language Models (LLMs) is essential for handling intricate tasks. However, existing approaches prioritize performance but often overlook the balance between effectiveness and computational cost. To address this, we first introduce the 3E Criteria to systematically assess the cost-effectiveness of search strategies, revealing that existing methods often trade significant efficiency for marginal performance gains. To improve LLM decision-making while maintaining efficiency, we propose the Speculative Reward Model (SRM), a plug-and-play framework that seamlessly integrates with existing search strategies. Specifically, SRM employs an external reward assigner to predict optimal actions, reducing reliance on LLMs' internal self-evaluation. And a speculative verification mechanism is used to prune suboptimal choices and guide the search toward more promising steps. We evaluate SRM on several complex decision-making tasks including mathematical reasoning, planning and numerical reasoning in specialized domains. Experimental results show that SRM reduces costs to 1/10 of the original search framework on average while maintaining effectiveness.

[234] Scaling Textual Gradients via Sampling-Based Momentum

Zixin Ding,Junyuan Hong,Jiachen T. Wang,Zinan Lin,Zhangyang Wang,Yuxin Chen

Main category: cs.CL

TL;DR: 本文提出了一种名为TSGD-M的新方法，通过基于过去批次分布重新加权提示采样，解决了TGD框架在数据扩展时性能下降和计算成本高的问题。TSGD-M在多个NLP任务中显著优于TGD基线，并降低了方差。

Details

Motivation: 随着提示在大型语言模型中的重要性增加，优化文本提示成为关键挑战。TGD框架虽然有效，但在数据扩展时性能下降且计算成本高。 Method: 受数值梯度下降启发，提出TSGD-M方法，通过重新加权提示采样实现可扩展的上下文学习。 Result: 在九个NLP任务中，TSGD-M显著优于未采用重新加权采样的TGD基线，并降低了方差。 Conclusion: TSGD-M是一种高效且可扩展的提示优化方法，适用于多种NLP任务。 Abstract: As prompts play an increasingly critical role in large language models (LLMs), optimizing textual prompts has become a crucial challenge. The Textual Gradient Descent (TGD) framework has emerged as a promising data-driven approach that iteratively refines textual prompts using LLM - suggested updates (or textual gradients) over minibatches of training samples. In this paper, we empirically demonstrate that scaling the number of training examples initially improves but later degrades TGD's performance across multiple downstream NLP tasks. However, while data scaling improves results for most tasks, it also significantly increases the computational cost when leveraging LLMs. To address this, we draw inspiration from numerical gradient descent and propose Textual Stochastic Gradient Descent with Momentum (TSGD-M) - a method that facilitates scalable in-context learning by reweighting prompt sampling based on past batch distributions. Across nine NLP tasks spanning three domains - including BIG-Bench Hard (BBH), natural language understanding tasks, and reasoning tasks - TSGD-M significantly outperforms TGD baselines that do not incorporate reweighted sampling, while also reducing variance in most tasks.

[235] Causal Structure Discovery for Error Diagnostics of Children's ASR

Vishwanath Pratap Singh,Md. Sahidullah,Tomi Kinnunen

Main category: cs.CL

TL;DR: 论文提出了一种因果结构发现方法，分析儿童语音识别中生理、认知和外部因素的相互依赖关系，并通过因果量化测量各因素的影响。

Details

Motivation: 儿童语音识别表现较差，现有方法仅孤立分析各因素，忽略了其相互依赖关系。 Method: 引入因果结构发现和因果量化方法，分析生理、认知和外部因素对ASR错误的影响，并扩展到微调模型。 Result: 实验证明该方法在Whisper和Wav2Vec2.0等ASR系统中具有普适性。 Conclusion: 因果分析方法能更全面地理解儿童ASR错误，并指导模型优化。 Abstract: Children's automatic speech recognition (ASR) often underperforms compared to that of adults due to a confluence of interdependent factors: physiological (e.g., smaller vocal tracts), cognitive (e.g., underdeveloped pronunciation), and extrinsic (e.g., vocabulary limitations, background noise). Existing analysis methods examine the impact of these factors in isolation, neglecting interdependencies-such as age affecting ASR accuracy both directly and indirectly via pronunciation skills. In this paper, we introduce a causal structure discovery to unravel these interdependent relationships among physiology, cognition, extrinsic factors, and ASR errors. Then, we employ causal quantification to measure each factor's impact on children's ASR. We extend the analysis to fine-tuned models to identify which factors are mitigated by fine-tuning and which remain largely unaffected. Experiments on Whisper and Wav2Vec2.0 demonstrate the generalizability of our findings across different ASR systems.

[236] Accelerating Diffusion LLMs via Adaptive Parallel Decoding

Daniel Israel,Guy Van den Broeck,Aditya Grover

Main category: cs.CL

TL;DR: 论文提出了一种自适应并行解码（APD）方法，通过动态调整并行生成的token数量，解决了扩散大语言模型（dLLMs）在并行生成时的速度与质量平衡问题。

Details

Motivation: 现有LLMs的自回归解码速度受限，而dLLMs虽然理论上支持并行生成，但在实践中难以在不牺牲质量的情况下实现高速。 Method: APD通过结合dLLM的边缘概率和一个辅助自回归模型的联合概率，动态调整并行采样的token数量，并优化了KV缓存和掩码输入大小。 Result: APD在保持质量的同时显著提高了吞吐量，下游基准测试中仅出现轻微质量下降。 Conclusion: APD通过灵活的参数设置，在速度和质量之间实现了有效平衡，为并行解码提供了新思路。 Abstract: The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion large language models (dLLMs) theoretically allow for parallel token generation, but in practice struggle to achieve the speed of autoregressive models without significantly sacrificing quality. We therefore introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel. We achieve this by defining a multiplicative mixture between the dLLM marginal probabilities and the joint probability of sequences under a small auxiliary autoregressive model. This inverts the standard setup of speculative decoding, where the goal is to sample from a large autoregressive verifier by drafting from a smaller model. We further optimize APD by enabling KV caching and limiting the size of the masked input. Altogether, our method puts forward three tunable parameters to flexibly tradeoff throughput and quality. We show that APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.

[237] Dual Debiasing for Noisy In-Context Learning for Text Generation

Siqi Liang,Sumyeong Ahn,Paramveer S. Dhillon,Jiayu Zhou

Main category: cs.CL

TL;DR: 论文提出了一种双重去偏框架，通过合成邻居修正困惑度估计，解决了高噪声比例下困惑度假设失效的问题，提升了样本清洁度评分和ICL性能。

Details

Motivation: 现有方法在高噪声比例下假设困惑度区分噪声样本失效，且困惑度受注释和领域知识偏见影响，需改进。 Method: 引入双重去偏框架，利用合成邻居显式修正困惑度估计，生成鲁棒的样本清洁度评分。 Result: 实验表明该方法噪声检测能力更强，ICL性能接近完全清洁语料库，且在高噪声比例下仍稳健。 Conclusion: 双重去偏框架有效解决了困惑度偏见问题，提升了噪声环境下的ICL表现。 Abstract: In context learning (ICL) relies heavily on high quality demonstrations drawn from large annotated corpora. Existing approaches detect noisy annotations by ranking local perplexities, presuming that noisy samples yield higher perplexities than their clean counterparts. However, this assumption breaks down when the noise ratio is high and many demonstrations are flawed. We reexamine the perplexity based paradigm for text generation under noisy annotations, highlighting two sources of bias in perplexity: the annotation itself and the domain specific knowledge inherent in large language models (LLMs). To overcome these biases, we introduce a dual debiasing framework that uses synthesized neighbors to explicitly correct perplexity estimates, yielding a robust Sample Cleanliness Score. This metric uncovers absolute sample cleanliness regardless of the overall corpus noise level. Extensive experiments demonstrate our method's superior noise detection capabilities and show that its final ICL performance is comparable to that of a fully clean demonstration corpus. Moreover, our approach remains robust even when noise ratios are extremely high.

[238] Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions

Jihyoung Jang,Minwook Bae,Minji Kim,Dilek Hakkani-Tur,Hyounghun Kim

Main category: cs.CL

TL;DR: 该论文提出了一种新型多模态对话模型，通过结合视觉和听觉输入，实现了更自然的动态交互。

Details

Motivation: 现有研究多关注图像任务，忽视了听觉模态，且交互多为静态。本文旨在解决这一问题，提升多模态对话的沉浸感。 Method: 引入多模态多会话多参与者对话数据集（$M^3C$），并提出基于多模态记忆检索的对话模型。 Result: 模型在复杂场景中能处理多模态输入，保持连贯对话，人类评估显示其性能优异。 Conclusion: 该模型为高级多模态对话代理提供了潜力，推动了多模态交互的发展。 Abstract: As chatbots continue to evolve toward human-like, real-world, interactions, multimodality remains an active area of research and exploration. So far, efforts to integrate multimodality into chatbots have primarily focused on image-centric tasks, such as visual dialogue and image-based instructions, placing emphasis on the "eyes" of human perception while neglecting the "ears", namely auditory aspects. Moreover, these studies often center around static interactions that focus on discussing the modality rather than naturally incorporating it into the conversation, which limits the richness of simultaneous, dynamic engagement. Furthermore, while multimodality has been explored in multi-party and multi-session conversations, task-specific constraints have hindered its seamless integration into dynamic, natural conversations. To address these challenges, this study aims to equip chatbots with "eyes and ears" capable of more immersive interactions with humans. As part of this effort, we introduce a new multimodal conversation dataset, Multimodal Multi-Session Multi-Party Conversation ($M^3C$), and propose a novel multimodal conversation model featuring multimodal memory retrieval. Our model, trained on the $M^3C$, demonstrates the ability to seamlessly engage in long-term conversations with multiple speakers in complex, real-world-like settings, effectively processing visual and auditory inputs to understand and respond appropriately. Human evaluations highlight the model's strong performance in maintaining coherent and dynamic interactions, demonstrating its potential for advanced multimodal conversational agents.

[239] DYNAC: Dynamic Vocabulary based Non-Autoregressive Contextualization for Speech Recognition

Yui Sudo,Yosuke Fukumoto,Muhammad Shakeel,Yifan Peng,Chyi-Jiunn Lin,Shinji Watanabe

Main category: cs.CL

TL;DR: DYNAC是一种基于动态词汇的非自回归上下文化方法，显著提升推理速度，同时保持高准确率。

Details

Motivation: 解决动态词汇在非自回归模型中因条件独立性假设而无法捕捉静态与动态词汇依赖关系的问题。 Method: 提出DYNAC，一种自条件CTC方法，将动态词汇集成到中间层，通过编码器捕捉依赖关系。 Result: 在LibriSpeech 960测试集上，RTF降低81%，词错误率仅增加0.1个百分点。 Conclusion: DYNAC在保持准确性的同时显著提升推理效率，适用于实际应用。 Abstract: Contextual biasing (CB) improves automatic speech recognition for rare and unseen phrases. Recent studies have introduced dynamic vocabulary, which represents context phrases as expandable tokens in autoregressive (AR) models. This method improves CB accuracy but with slow inference speed. While dynamic vocabulary can be applied to non-autoregressive (NAR) models, such as connectionist temporal classification (CTC), the conditional independence assumption fails to capture dependencies between static and dynamic tokens. This paper proposes DYNAC (Dynamic Vocabulary-based NAR Contextualization), a self-conditioned CTC method that integrates dynamic vocabulary into intermediate layers. Conditioning the encoder on dynamic vocabulary, DYNAC effectively captures dependencies between static and dynamic tokens while reducing the real-time factor (RTF). Experimental results show that DYNAC reduces RTF by 81% with a 0.1-point degradation in word error rate on the LibriSpeech 960 test-clean set.

[240] Inter-Passage Verification for Multi-evidence Multi-answer QA

Bingsen Chen,Shengjie Wang,Xi Ye,Chen Zhao

Main category: cs.CL

TL;DR: 论文提出了一种名为RI$^2$VER的多答案问答框架，通过独立阅读和跨段落验证解决现有系统在多答案QA任务中的检索和合成问题。

Details

Motivation: 现有基于检索增强生成的QA系统在多答案问答任务中表现不佳，难以检索和合成大量证据段落。 Method: RI$^2$VER框架包括检索大量段落、独立处理生成初始答案集，以及通过跨段落验证（生成验证问题、收集额外证据、跨段落合成验证）优化答案。 Result: 在QAMPARI和RoMQA数据集上，RI$^2$VER显著优于基线模型，平均F1分数提升11.17%。 Conclusion: 跨段落验证管道使RI$^2$VER在多证据合成问题上表现突出，验证了框架的有效性。 Abstract: Multi-answer question answering (QA), where questions can have many valid answers, presents a significant challenge for existing retrieval-augmented generation-based QA systems, as these systems struggle to retrieve and then synthesize a large number of evidence passages. To tackle these challenges, we propose a new multi-answer QA framework -- Retrieval-augmented Independent Reading with Inter-passage Verification (RI$^2$VER). Our framework retrieves a large set of passages and processes each passage individually to generate an initial high-recall but noisy answer set. Then we propose a new inter-passage verification pipeline that validates every candidate answer through (1) Verification Question Generation, (2) Gathering Additional Evidence, and (3) Verification with inter-passage synthesis. Evaluations on the QAMPARI and RoMQA datasets demonstrate that our framework significantly outperforms existing baselines across various model sizes, achieving an average F1 score improvement of 11.17%. Further analysis validates that our inter-passage verification pipeline enables our framework to be particularly beneficial for questions requiring multi-evidence synthesis.

[241] G2S: A General-to-Specific Learning Framework for Temporal Knowledge Graph Forecasting with Large Language Models

Long Bai,Zixuan Li,Xiaolong Jin,Jiafeng Guo,Xueqi Cheng,Tat-Seng Chua

Main category: cs.CL

TL;DR: 论文提出了一种名为G2S的学习框架，通过分离通用模式和场景信息的学习过程，提升LLMs在时间知识图谱预测任务中的泛化能力。

Details

Motivation: 现有方法在时间知识图谱预测任务中同时学习通用模式和场景信息，导致学习过程相互干扰，影响模型的泛化能力。 Method: G2S框架分为两个阶段：通用学习阶段通过屏蔽场景信息学习通用模式；特定学习阶段通过上下文学习或微调注入场景信息。 Result: 实验结果表明，G2S显著提升了LLMs的泛化能力。 Conclusion: G2S框架通过分离通用模式和场景信息的学习，有效提升了模型在时间知识图谱预测任务中的性能。 Abstract: Forecasting over Temporal Knowledge Graphs (TKGs) which predicts future facts based on historical ones has received much attention. Recent studies have introduced Large Language Models (LLMs) for this task to enhance the models' generalization abilities. However, these models perform forecasting via simultaneously learning two kinds of entangled knowledge in the TKG: (1) general patterns, i.e., invariant temporal structures shared across different scenarios; and (2) scenario information, i.e., factual knowledge engaged in specific scenario, such as entities and relations. As a result, the learning processes of these two kinds of knowledge may interfere with each other, which potentially impact the generalization abilities of the models. To enhance the generalization ability of LLMs on this task, in this paper, we propose a General-to-Specific learning framework (G2S) that disentangles the learning processes of the above two kinds of knowledge. In the general learning stage, we mask the scenario information in different TKGs and convert it into anonymous temporal structures. After training on these structures, the model is able to capture the general patterns across different TKGs. In the specific learning stage, we inject the scenario information into the structures via either in-context learning or fine-tuning modes. Experimental results show that G2S effectively improves the generalization abilities of LLMs.

[242] Fact-Controlled Diagnosis of Hallucinations in Medical Text Summarization

Suhas BN,Han-Chin Shing,Lei Xu,Mitch Strong,Jon Burnsky,Jessica Ofor,Jordan R. Mason,Susan Chen,Sundararajan Srinivasan,Chaitanya Shivade,Jack Moriarty,Joseph Paul Cohen

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLM）在临床对话摘要中的幻觉问题，评估了检测方法并构建了两个数据集，发现通用检测器效果不佳，开发了基于事实的新方法。

Details

Motivation: LLM在临床对话摘要中的幻觉对患者护理和临床决策构成风险，但该领域研究不足，通用检测器适用性存疑。 Method: 构建了两个数据集（事实控制的Leave-N-out数据集和自然幻觉数据集），评估了检测方法，并开发了基于事实的检测方法。 Result: 通用检测器在临床幻觉检测中表现不佳，基于事实的方法能有效检测真实临床幻觉。 Conclusion: 研究提供了专业指标和数据集，推动了临床摘要系统的可信度提升。 Abstract: Hallucinations in large language models (LLMs) during summarization of patient-clinician dialogues pose significant risks to patient care and clinical decision-making. However, the phenomenon remains understudied in the clinical domain, with uncertainty surrounding the applicability of general-domain hallucination detectors. The rarity and randomness of hallucinations further complicate their investigation. In this paper, we conduct an evaluation of hallucination detection methods in the medical domain, and construct two datasets for the purpose: A fact-controlled Leave-N-out dataset -- generated by systematically removing facts from source dialogues to induce hallucinated content in summaries; and a natural hallucination dataset -- arising organically during LLM-based medical summarization. We show that general-domain detectors struggle to detect clinical hallucinations, and that performance on fact-controlled hallucinations does not reliably predict effectiveness on natural hallucinations. We then develop fact-based approaches that count hallucinations, offering explainability not available with existing methods. Notably, our LLM-based detectors, which we developed using fact-controlled hallucinations, generalize well to detecting real-world clinical hallucinations. This research contributes a suite of specialized metrics supported by expert-annotated datasets to advance faithful clinical summarization systems.

[243] Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data

Shaoxiong Ji,Zihao Li,Jaakko Paavola,Indraneil Paul,Hengyu Luo,Jörg Tiedemann

Main category: cs.CL

TL;DR: 研究了在多语言持续预训练中加入平行数据的影响，构建了MaLA双语翻译语料库，开发了EMMA-500 Llama 3模型套件，发现双语数据能提升低资源语言的性能。

Details

Motivation: 探讨在500种语言的大规模多语言持续预训练中，双语翻译数据对模型性能的影响。 Method: 构建MaLA双语翻译语料库（2500+语言对），开发EMMA-500 Llama 3模型套件，并进行持续预训练实验。 Result: 双语数据显著提升语言迁移和性能，尤其对低资源语言效果明显。 Conclusion: 双语数据对多语言模型持续预训练具有积极影响，特别是在低资源语言场景下。 Abstract: This paper investigates a critical design decision in the practice of massively multilingual continual pre-training -- the inclusion of parallel data. Specifically, we study the impact of bilingual translation data for massively multilingual language adaptation of the Llama3 family of models to 500 languages. To this end, we construct the MaLA bilingual translation corpus, containing data from more than 2,500 language pairs. Subsequently, we develop the EMMA-500 Llama 3 suite of four massively multilingual models -- continually pre-trained from the Llama 3 family of base models extensively on diverse data mixes up to 671B tokens -- and explore the effect of continual pre-training with or without bilingual translation data. Comprehensive evaluation across 7 tasks and 12 benchmarks demonstrates that bilingual data tends to enhance language transfer and performance, particularly for low-resource languages. We open-source the MaLA corpus, EMMA-500 Llama 3 suite artefacts, code, and model generations.

[244] EffiVLM-BENCH: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Vision-Language Models

Zekun Wang,Minghua Ma,Zexin Wang,Rongchuan Mu,Liping Shan,Ming Liu,Bing Qin

Main category: cs.CL

TL;DR: 本文系统评估了大型视觉语言模型（LVLM）的主流加速技术，提出了EffiVLM-Bench框架，并开源代码以促进未来研究。

Details

Motivation: 尽管LVLM取得了显著成功，但其高计算需求限制了实际部署，现有方法缺乏全面的评估。 Method: 将加速技术分为令牌和参数压缩两类，并引入EffiVLM-Bench框架进行多维度评估。 Result: 通过大量实验和深入分析，提供了加速LVLM的最佳策略。 Conclusion: 开源EffiVLM-Bench代码，为未来研究提供支持。 Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable success, yet their significant computational demands hinder practical deployment. While efforts to improve LVLM efficiency are growing, existing methods lack comprehensive evaluation across diverse backbones, benchmarks, and metrics. In this work, we systematically evaluate mainstream acceleration techniques for LVLMs, categorized into token and parameter compression. We introduce EffiVLM-Bench, a unified framework for assessing not only absolute performance but also generalization and loyalty, while exploring Pareto-optimal trade-offs. Our extensive experiments and in-depth analyses offer insights into optimal strategies for accelerating LVLMs. We open-source code and recipes for EffiVLM-Bench to foster future research.

[245] PVP: An Image Dataset for Personalized Visual Persuasion with Persuasion Strategies, Viewer Characteristics, and Persuasiveness Ratings

Junseo Kim,Jongwook Han,Dongmin Choi,Jongwook Yoon,Eun-Ju Lee,Yohan Jo

Main category: cs.CL

TL;DR: 论文介绍了个性化视觉说服（PVP）数据集，包含28,454张说服性图像及其评分，结合了2,521名标注者的人口统计和心理特征，用于推动个性化视觉说服技术的发展。

Details

Motivation: 视觉说服在广告和政治传播等领域至关重要，但缺乏连接图像说服力与个人信息的综合数据集，阻碍了AI技术的发展。 Method: 发布PVP数据集，包含图像、评分及标注者的心理特征，并开发了说服性图像生成器和自动评估器。 Result: 实验表明，结合心理特征能提升说服性图像的生成和评估效果。 Conclusion: PVP数据集为个性化视觉说服提供了重要资源，心理特征的引入有助于技术进步。 Abstract: Visual persuasion, which uses visual elements to influence cognition and behaviors, is crucial in fields such as advertising and political communication. With recent advancements in artificial intelligence, there is growing potential to develop persuasive systems that automatically generate persuasive images tailored to individuals. However, a significant bottleneck in this area is the lack of comprehensive datasets that connect the persuasiveness of images with the personal information about those who evaluated the images. To address this gap and facilitate technological advancements in personalized visual persuasion, we release the Personalized Visual Persuasion (PVP) dataset, comprising 28,454 persuasive images across 596 messages and 9 persuasion strategies. Importantly, the PVP dataset provides persuasiveness scores of images evaluated by 2,521 human annotators, along with their demographic and psychological characteristics (personality traits and values). We demonstrate the utility of our dataset by developing a persuasive image generator and an automated evaluator, and establish benchmark baselines. Our experiments reveal that incorporating psychological characteristics enhances the generation and evaluation of persuasive images, providing valuable insights for personalized visual persuasion.

[246] Auto-Patching: Enhancing Multi-Hop Reasoning in Language Models

Aviv Jan,Dean Tahory,Omer Talmi,Omar Abo Mokh

Main category: cs.CL

TL;DR: Auto-Patch通过动态修改隐藏状态提升LLMs的多跳推理能力，在MuSiQue数据集上表现优于基线。

Details

Motivation: 解决大语言模型在多跳推理中难以链接信息的问题。 Method: 基于PatchScopes框架，利用学习到的分类器选择性修改内部表示。 Result: 在MuSiQue数据集上，解决率从18.45%提升至23.63±0.7%。 Conclusion: 动态隐藏状态干预有望提升LLMs的复杂推理能力。 Abstract: Multi-hop questions still stump large language models (LLMs), which struggle to link information across multiple reasoning steps. We introduce Auto-Patch, a novel method that dynamically patches hidden states during inference to enhance multi-hop reasoning in LLMs. Building on the PatchScopes framework, Auto-Patch selectively modifies internal representations using a learned classifier. Evaluated on the MuSiQue dataset, Auto-Patch improves the solve rate from 18.45\% (baseline) to 23.63~$\pm$~0.7\% (3 runs), narrowing the gap to Chain-of-Thought prompting (27.44\%). Our results highlight the potential of dynamic hidden state interventions for advancing complex reasoning in LLMs.

[247] Synergizing LLMs with Global Label Propagation for Multimodal Fake News Detection

Shuguo Hu,Jun Hu,Huaiwen Zhang

Main category: cs.CL

TL;DR: GLPN-LLM模型通过标签传播技术整合LLM生成的伪标签，提升多模态假新闻检测性能。

Details

Motivation: LLM生成的伪标签单独使用时性能较差，需有效整合以提升检测效果。 Method: 提出GLPN-LLM模型，结合全局标签传播和基于掩码的机制防止标签泄漏。 Result: 在基准数据集上表现优于现有方法。 Conclusion: 结合LLM与标签传播技术可显著提升假新闻检测性能。 Abstract: Large Language Models (LLMs) can assist multimodal fake news detection by predicting pseudo labels. However, LLM-generated pseudo labels alone demonstrate poor performance compared to traditional detection methods, making their effective integration non-trivial. In this paper, we propose Global Label Propagation Network with LLM-based Pseudo Labeling (GLPN-LLM) for multimodal fake news detection, which integrates LLM capabilities via label propagation techniques. The global label propagation can utilize LLM-generated pseudo labels, enhancing prediction accuracy by propagating label information among all samples. For label propagation, a mask-based mechanism is designed to prevent label leakage during training by ensuring that training nodes do not propagate their own labels back to themselves. Experimental results on benchmark datasets show that by synergizing LLMs with label propagation, our model achieves superior performance over state-of-the-art baselines.

[248] Exploring In-context Example Generation for Machine Translation

Dohyun Lee,Seungil Chad Lee,Chanwoo Yang,Yujin Baek,Jaegul Choo

Main category: cs.CL

TL;DR: 本文提出了一种名为DAT的方法，用于在低资源语言中生成机器翻译的上下文示例，无需依赖外部资源，显著提升了翻译质量。

Details

Motivation: 现有研究假设存在人工标注的示例池，这在低资源语言中难以实现，因此需要一种不依赖外部资源的示例生成方法。 Method: 提出DAT方法，基于相关性和多样性标准生成示例对，无需外部资源。 Result: 在低资源语言中，DAT的翻译质量优于基线方法。 Conclusion: DAT是一种简单有效的方法，适用于低资源语言的机器翻译，并展示了逐步积累生成示例的潜力。 Abstract: Large language models (LLMs) have demonstrated strong performance across various tasks, leveraging their exceptional in-context learning ability with only a few examples. Accordingly, the selection of optimal in-context examples has been actively studied in the field of machine translation. However, these studies presuppose the presence of a demonstration pool with human-annotated pairs, making them less applicable to low-resource languages where such an assumption is challenging to meet. To overcome this limitation, this paper explores the research direction of in-context example generation for machine translation. Specifically, we propose Demonstration Augmentation for Translation (DAT), a simple yet effective approach that generates example pairs without relying on any external resources. This method builds upon two prior criteria, relevance and diversity, which have been highlighted in previous work as key factors for in-context example selection. Through experiments and analysis on low-resource languages where human-annotated pairs are scarce, we show that DAT achieves superior translation quality compared to the baselines. Furthermore, we investigate the potential of progressively accumulating generated pairs during test time to build and reuse a demonstration pool. Our implementation is publicly available at https://github.com/aiclaudev/DAT.

[249] Goal-Aware Identification and Rectification of Misinformation in Multi-Agent Systems

Zherui Li,Yan Mi,Zhenhong Zhou,Houcheng Jiang,Guibin Zhang,Kun Wang,Junfeng Fang

Main category: cs.CL

TL;DR: 论文提出了一种名为ARGUS的两阶段防御框架，用于解决多智能体系统中的错误信息传播问题，并通过实验验证了其有效性。

Details

Motivation: 多智能体系统在处理复杂任务时容易受到错误信息注入的攻击，需要一种有效的防御机制来保障系统稳健性。 Method: 提出ARGUS框架，基于目标感知推理，无需训练即可纠正信息流中的错误信息。 Result: 在实验中，ARGUS平均减少错误信息毒性约28.17%，并在攻击下提升任务成功率约10.33%。 Conclusion: ARGUS是一种高效且无需训练的防御框架，能显著提升多智能体系统对错误信息的抵抗能力。 Abstract: Large Language Model-based Multi-Agent Systems (MASs) have demonstrated strong advantages in addressing complex real-world tasks. However, due to the introduction of additional attack surfaces, MASs are particularly vulnerable to misinformation injection. To facilitate a deeper understanding of misinformation propagation dynamics within these systems, we introduce MisinfoTask, a novel dataset featuring complex, realistic tasks designed to evaluate MAS robustness against such threats. Building upon this, we propose ARGUS, a two-stage, training-free defense framework leveraging goal-aware reasoning for precise misinformation rectification within information flows. Our experiments demonstrate that in challenging misinformation scenarios, ARGUS exhibits significant efficacy across various injection attacks, achieving an average reduction in misinformation toxicity of approximately 28.17% and improving task success rates under attack by approximately 10.33%. Our code and dataset is available at: https://github.com/zhrli324/ARGUS.

[250] Evaluating the Evaluation of Diversity in Commonsense Generation

Tianhui Zhang,Bei Peng,Danushka Bollegala

Main category: cs.CL

TL;DR: 本文通过系统评估常识生成中的多样性指标，发现基于形式的指标高估多样性，而基于内容的指标表现更优，推荐未来研究采用后者。

Details

Motivation: 现有多样性评估指标在常识生成任务中表现不一，缺乏明确的最佳选择，需系统评估以指导未来研究。 Method: 通过LLM创建标注数据集，对现有多样性指标进行元评估，比较形式与内容指标的表现。 Result: 基于内容的多样性指标与LLM评分高度相关，优于形式指标。 Conclusion: 建议未来常识生成研究采用基于内容的多样性评估指标。 Abstract: In commonsense generation, given a set of input concepts, a model must generate a response that is not only commonsense bearing, but also capturing multiple diverse viewpoints. Numerous evaluation metrics based on form- and content-level overlap have been proposed in prior work for evaluating the diversity of a commonsense generation model. However, it remains unclear as to which metrics are best suited for evaluating the diversity in commonsense generation. To address this gap, we conduct a systematic meta-evaluation of diversity metrics for commonsense generation. We find that form-based diversity metrics tend to consistently overestimate the diversity in sentence sets, where even randomly generated sentences are assigned overly high diversity scores. We then use an Large Language Model (LLM) to create a novel dataset annotated for the diversity of sentences generated for a commonsense generation task, and use it to conduct a meta-evaluation of the existing diversity evaluation metrics. Our experimental results show that content-based diversity evaluation metrics consistently outperform the form-based counterparts, showing high correlations with the LLM-based ratings. We recommend that future work on commonsense generation should use content-based metrics for evaluating the diversity of their outputs.

[251] CausalAbstain: Enhancing Multilingual LLMs with Causal Reasoning for Trustworthy Abstention

Yuxi Sun,Aoqi Zuo,Wei Gao,Jing Ma

Main category: cs.CL

TL;DR: 论文提出了一种名为CausalAbstain的方法，通过因果视角帮助LLMs在多语言场景中更有效地决定是否使用生成的反馈，以减少幻觉。

Details

Motivation: LLMs在不同语言中存在知识差异，当前的多语言弃权策略依赖生成的反馈，但易受不准确和偏见影响。 Method: 引入CausalAbstain方法，从因果角度帮助LLMs选择有用的反馈并优化弃权决策。 Result: 实验表明，CausalAbstain在原生语言和多语言设置中均优于基线，提升了弃权决策的准确性和可解释性。 Conclusion: CausalAbstain方法在多语言场景中有效减少了LLMs的幻觉问题，并开源了代码和数据。 Abstract: Large Language Models (LLMs) often exhibit knowledge disparities across languages. Encouraging LLMs to \textit{abstain} when faced with knowledge gaps is a promising strategy to reduce hallucinations in multilingual settings. Current abstention strategies for multilingual scenarios primarily rely on generating feedback in various languages using LLMs and performing self-reflection. However, these methods can be adversely impacted by inaccuracies and biases in the generated feedback. To address this, from a causal perspective, we introduce \textit{CausalAbstain}, a method that helps LLMs determine whether to utilize multiple generated feedback responses and how to identify the most useful ones. Extensive experiments demonstrate that \textit{CausalAbstain} effectively selects helpful feedback and enhances abstention decisions with interpretability in both native language (\textsc{Casual-native}) and multilingual (\textsc{Causal-multi}) settings, outperforming strong baselines on two benchmark datasets covering encyclopedic and commonsense knowledge QA tasks. Our code and data are open-sourced at https://github.com/peachch/CausalAbstain.

[252] Retrieval-Augmented Generation Systems for Intellectual Property via Synthetic Multi-Angle Fine-tuning

Runtao Ren,Jian Ma,Jianxi Luo

Main category: cs.CL

TL;DR: MQG-RFM框架通过多角度问题生成和检索微调方法，显著提升IP领域RAG系统的检索准确性和生成质量。

Details

Motivation: 解决IP领域RAG系统因用户查询多样性（如口语化表达、拼写错误等）导致的检索不准确和响应不佳问题。 Method: 采用轻量级Data-to-Tune范式，结合提示工程化查询生成和硬负例挖掘，优化检索模型。 Result: 在台湾专利Q&A数据集上，检索准确率提升185.62%（专利咨询）和262.26%（新技术报告），生成质量分别提升14.22%和53.58%。 Conclusion: MQG-RFM通过语义感知检索优化，为中小型机构提供实用、可扩展的专利情报解决方案，已被ScholarMate采用。 Abstract: Retrieval-Augmented Generation (RAG) systems in the Intellectual Property (IP) field often struggle with diverse user queries, including colloquial expressions, spelling errors, and ambiguous terminology, leading to inaccurate retrieval and suboptimal responses. To address this challenge, we propose Multi-Angle Question Generation and Retrieval Fine-Tuning Method (MQG-RFM), a novel framework that leverages large language models (LLMs) to simulate varied user inquiries and fine-tunes retrieval models to align semantically equivalent but linguistically diverse questions. Unlike complex architectural modifications, MQG-RFM adopts a lightweight Data-to-Tune paradigm, combining prompt-engineered query generation with hard negative mining to enhance retrieval robustness without costly infrastructure changes. Experimental results on a Taiwan patent Q&A dataset show 185.62% improvement in retrieval accuracy on the Patent Consultation dataset and 262.26% improvement on the Novel Patent Technology Report dataset, with 14.22% and 53.58% improvements in generation quality over the baselines, respectively. By bridging the gap between user intent and system comprehension through semantic-aware retrieval optimization, MQG-RFM offers a practical, scalable approach for rapid, cost-effective deployment among small and medium-sized agencies seeking reliable patent intelligence solutions. Additionally, our proposed method has already been adopted by ScholarMate, the largest professional research social networking platform in China, to support real-world development and deployment. A demo version of the instantiated is available at https://github.com/renruntao/patent_rag.

[253] Decoupling Reasoning and Knowledge Injection for In-Context Knowledge Editing

Changyue Wang,Weihang Su,Qingyao Ai,Yujia Zhou,Yiqun Liu

Main category: cs.CL

TL;DR: DecKER是一种新的知识编辑框架，通过解耦推理和知识编辑，显著提升了多跳问答任务的性能。

Details

Motivation: 现有知识编辑方法未明确分离新注入知识与模型原有推理过程，导致知识冲突和推理路径不一致。 Method: 提出DecKER框架，通过生成掩码推理路径并结合混合检索和模型验证来解决知识编辑问题。 Result: 在多跳问答基准测试中，DecKER显著优于现有方法，减少了知识冲突并保持了推理一致性。 Conclusion: DecKER为知识编辑提供了一种高效且轻量级的解决方案，尤其在多跳任务中表现优异。 Abstract: Knowledge editing aims to efficiently update Large Language Models (LLMs) by modifying specific knowledge without retraining the entire model. Among knowledge editing approaches, in-context editing (ICE) offers a lightweight solution by injecting new knowledge directly into the input context, leaving model parameters unchanged. However, existing ICE approaches do not explicitly separate the newly injected knowledge from the model's original reasoning process. This entanglement often results in conflicts between external updates and internal parametric knowledge, undermining the consistency and accuracy of the reasoning path.In this work, we conduct preliminary experiments to examine how parametric knowledge influences reasoning path planning. We find that the model's reasoning is tightly coupled with its internal knowledge, and that naively injecting new information without adapting the reasoning path often leads to performance degradation, particularly in multi-hop tasks. To this end, we propose DecKER, a novel ICE framework that decouples reasoning from knowledge editing by generating a masked reasoning path and then resolving knowledge edits via hybrid retrieval and model-based validation. Experiments on multi-hop QA benchmarks show that DecKER significantly outperforms existing ICE methods by mitigating knowledge conflicts and preserving reasoning consistency. Our code is available at: https://github.com/bebr2/DecKER .

[254] ARIA: Training Language Agents with Intention-Driven Reward Aggregation

Ruihan Yang,Yikai Zhang,Aili Chen,Xintao Wang,Siyu Yuan,Jiangjie Chen,Deqing Yang,Yanghua Xiao

Main category: cs.CL

TL;DR: ARIA通过将自然语言动作从高维令牌分布空间映射到低维意图空间，聚合奖励以减少方差，提升语言代理训练效率。

Details

Motivation: 开放语言动作环境中的动作空间巨大，导致奖励稀疏和方差大，阻碍强化学习效果。 Method: 提出ARIA方法，将动作投影到意图空间，聚类语义相似动作并共享奖励。 Result: ARIA显著降低策略梯度方差，在四项任务中平均性能提升9.95%。 Conclusion: ARIA通过意图空间奖励聚合，有效解决了语言代理训练中的奖励稀疏问题。 Abstract: Large language models (LLMs) have enabled agents to perform complex reasoning and decision-making through free-form language interactions. However, in open-ended language action environments (e.g., negotiation or question-asking games), the action space can be formulated as a joint distribution over tokens, resulting in an exponentially large action space. Sampling actions in such a space can lead to extreme reward sparsity, which brings large reward variance, hindering effective reinforcement learning (RL). To address this, we propose ARIA, a method that Aggregates Rewards in Intention space to enable efficient and effective language Agents training. ARIA aims to project natural language actions from the high-dimensional joint token distribution space into a low-dimensional intention space, where semantically similar actions are clustered and assigned shared rewards. This intention-aware reward aggregation reduces reward variance by densifying reward signals, fostering better policy optimization. Extensive experiments demonstrate that ARIA not only significantly reduces policy gradient variance, but also delivers substantial performance gains of an average of 9.95% across four downstream tasks, consistently outperforming offline and online RL baselines.

[255] Towards Multi-dimensional Evaluation of LLM Summarization across Domains and Languages

Hyangsuk Min,Yuho Lee,Minjeong Ban,Jiaqi Deng,Nicole Hee-Yeon Kim,Taewon Yun,Hang Su,Jason Cai,Hwanjun Song

Main category: cs.CL

TL;DR: MSumBench是一个多维度、多领域的文本摘要评估框架，支持中英文，并提供领域特定评估标准，通过多智能体辩论系统提升标注质量。

Details

Motivation: 现有摘要评估框架缺乏领域特定标准，以英语为主，且人工标注面临推理复杂性挑战。 Method: 引入MSumBench，包含多领域评估标准和中英文支持，利用多智能体辩论系统优化标注，并评估现代摘要模型及大语言模型的评估能力。 Result: 发现不同领域和语言的摘要模型表现差异，揭示大语言模型在评估自生成摘要时的系统性偏差。 Conclusion: MSumBench为文本摘要提供了更全面的评估工具，公开数据集促进进一步研究。 Abstract: Evaluation frameworks for text summarization have evolved in terms of both domain coverage and metrics. However, existing benchmarks still lack domain-specific assessment criteria, remain predominantly English-centric, and face challenges with human annotation due to the complexity of reasoning. To address these, we introduce MSumBench, which provides a multi-dimensional, multi-domain evaluation of summarization in English and Chinese. It also incorporates specialized assessment criteria for each domain and leverages a multi-agent debate system to enhance annotation quality. By evaluating eight modern summarization models, we discover distinct performance patterns across domains and languages. We further examine large language models as summary evaluators, analyzing the correlation between their evaluation and summarization capabilities, and uncovering systematic bias in their assessment of self-generated summaries. Our benchmark dataset is publicly available at https://github.com/DISL-Lab/MSumBench.

[256] AnnaAgent: Dynamic Evolution Agent System with Multi-Session Memory for Realistic Seeker Simulation

Ming Wang,Peidong Wang,Lin Wu,Xiaocui Yang,Daling Wang,Shi Feng,Yuxin Chen,Bixuan Wang,Yifei Zhang

Main category: cs.CL

TL;DR: 论文提出AnnaAgent，一种动态情感和认知代理系统，用于模拟心理辅导中的求助者，解决了动态演化和多会话记忆两大挑战。

Details

Motivation: 由于成本和伦理问题，AI驱动的心理健康研究中难以使用真实求助者，因此需要更真实的模拟方法。 Method: AnnaAgent结合情感调节器和抱怨引发器，并采用三级记忆机制，动态控制模拟配置。 Result: 评估显示，AnnaAgent在心理辅导中比现有基线更真实地模拟求助者。 Conclusion: AnnaAgent为心理健康研究提供了更真实的模拟工具，代码已通过伦理审查并开源。 Abstract: Constrained by the cost and ethical concerns of involving real seekers in AI-driven mental health, researchers develop LLM-based conversational agents (CAs) with tailored configurations, such as profiles, symptoms, and scenarios, to simulate seekers. While these efforts advance AI in mental health, achieving more realistic seeker simulation remains hindered by two key challenges: dynamic evolution and multi-session memory. Seekers' mental states often fluctuate during counseling, which typically spans multiple sessions. To address this, we propose AnnaAgent, an emotional and cognitive dynamic agent system equipped with tertiary memory. AnnaAgent incorporates an emotion modulator and a complaint elicitor trained on real counseling dialogues, enabling dynamic control of the simulator's configurations. Additionally, its tertiary memory mechanism effectively integrates short-term and long-term memory across sessions. Evaluation results, both automated and manual, demonstrate that AnnaAgent achieves more realistic seeker simulation in psychological counseling compared to existing baselines. The ethically reviewed and screened code can be found on https://github.com/sci-m-wang/AnnaAgent.

[257] The Hidden Language of Harm: Examining the Role of Emojis in Harmful Online Communication and Content Moderation

Yuhang Zhou,Yimin Xiao,Wei Ai,Ge Gao

Main category: cs.CL

TL;DR: 研究探讨了社交媒体中表情符号在冒犯性内容中的作用，并提出了一种基于LLM的多步审核方法，有效减少冒犯性同时保留语义。

Details

Motivation: 社交媒体中表情符号虽单独无害，但可能通过象征性关联、讽刺或上下文滥用产生冒犯性，其作用尚未充分研究。 Method: 系统分析表情符号在冒犯性推文中的分布及用户如何利用其模糊性，提出基于LLM的多步审核流程，选择性替换有害表情符号。 Result: 人类评估证实该方法能有效降低冒犯性感知且不牺牲语义，同时揭示了不同冒犯类型的异质性效果。 Conclusion: 研究为在线交流和表情符号审核提供了细致见解，展示了LLM在内容审核中的潜力。 Abstract: Social media platforms have become central to modern communication, yet they also harbor offensive content that challenges platform safety and inclusivity. While prior research has primarily focused on textual indicators of offense, the role of emojis, ubiquitous visual elements in online discourse, remains underexplored. Emojis, despite being rarely offensive in isolation, can acquire harmful meanings through symbolic associations, sarcasm, and contextual misuse. In this work, we systematically examine emoji contributions to offensive Twitter messages, analyzing their distribution across offense categories and how users exploit emoji ambiguity. To address this, we propose an LLM-powered, multi-step moderation pipeline that selectively replaces harmful emojis while preserving the tweet's semantic intent. Human evaluations confirm our approach effectively reduces perceived offensiveness without sacrificing meaning. Our analysis also reveals heterogeneous effects across offense types, offering nuanced insights for online communication and emoji moderation.

[258] Entriever: Energy-based Retriever for Knowledge-Grounded Dialog Systems

Yucheng Cai,Ke Li,Yi Huang,Junlan Feng,Zhijian Ou

Main category: cs.CL

TL;DR: 提出了一种基于能量的检索器Entriever，用于解决现有检索器中知识片段条件独立假设的问题，显著提升了知识检索和对话系统的性能。

Details

Motivation: 现有检索器通常假设知识片段条件独立，忽略了其相关性，导致知识检索效果受限。 Method: 提出Entriever，通过能量函数整体建模候选检索结果，探索不同能量函数架构和训练方法。 Result: Entriever在知识检索任务中显著优于基线模型，并在半监督对话系统中显著提升端到端性能。 Conclusion: Entriever通过整体建模知识片段相关性，有效提升了检索和对话系统的性能。 Abstract: A retriever, which retrieves relevant knowledge pieces from a knowledge base given a context, is an important component in many natural language processing (NLP) tasks. Retrievers have been introduced in knowledge-grounded dialog systems to improve knowledge acquisition. In knowledge-grounded dialog systems, when conditioning on a given context, there may be multiple relevant and correlated knowledge pieces. However, knowledge pieces are usually assumed to be conditionally independent in current retriever models. To address this issue, we propose Entriever, an energy-based retriever. Entriever directly models the candidate retrieval results as a whole instead of modeling the knowledge pieces separately, with the relevance score defined by an energy function. We explore various architectures of energy functions and different training methods for Entriever, and show that Entriever substantially outperforms the strong cross-encoder baseline in knowledge retrieval tasks. Furthermore, we show that in semi-supervised training of knowledge-grounded dialog systems, Entriever enables effective scoring of retrieved knowledge pieces and significantly improves end-to-end performance of dialog systems.

[259] PAKTON: A Multi-Agent Framework for Question Answering in Long Legal Agreements

Petros Raptopoulos,Giorgos Filandrianos,Maria Lymperaiou,Giorgos Stamou

Main category: cs.CL

TL;DR: PAKTON是一个开源的多代理框架，旨在通过协作代理工作流和检索增强生成技术，实现更易用、适应性更强且保护隐私的合同自动审查。

Details

Motivation: 合同审查复杂且耗时，通常需要法律专业知识，非专家难以参与；法律解释常具主观性，且合同保密性限制了专有模型的使用。 Method: PAKTON采用多代理框架和检索增强生成（RAG）技术，支持端到端的合同分析。 Result: 实验表明，PAKTON在预测准确性、检索性能、可解释性等方面优于通用和预训练模型。 Conclusion: PAKTON为合同审查提供了更高效、隐私保护的解决方案，尤其适合非专家使用。 Abstract: Contract review is a complex and time-intensive task that typically demands specialized legal expertise, rendering it largely inaccessible to non-experts. Moreover, legal interpretation is rarely straightforward-ambiguity is pervasive, and judgments often hinge on subjective assessments. Compounding these challenges, contracts are usually confidential, restricting their use with proprietary models and necessitating reliance on open-source alternatives. To address these challenges, we introduce PAKTON: a fully open-source, end-to-end, multi-agent framework with plug-and-play capabilities. PAKTON is designed to handle the complexities of contract analysis through collaborative agent workflows and a novel retrieval-augmented generation (RAG) component, enabling automated legal document review that is more accessible, adaptable, and privacy-preserving. Experiments demonstrate that PAKTON outperforms both general-purpose and pretrained models in predictive accuracy, retrieval performance, explainability, completeness, and grounded justifications as evaluated through a human study and validated with automated metrics.

[260] Enhancing Clinical Multiple-Choice Questions Benchmarks with Knowledge Graph Guided Distractor Generation

Running Yang,Wenlong Deng,Minghui Chen,Yuyin Zhou,Xiaoxiao Li

Main category: cs.CL

TL;DR: 本文提出了一种基于知识图谱的数据增强框架（KGGDG），用于生成更具迷惑性的临床多选题干扰项，以更严格评估大型语言模型（LLM）的可靠性。

Details

Motivation: 临床任务（如诊断和治疗）需要强大的决策能力，因此需要更严格的评估基准来测试LLM的可靠性。 Method: 通过多步语义遍历医学知识图谱，生成临床合理且具有误导性的干扰项，增强多选题的难度。 Result: 在六个广泛使用的医学QA基准测试中，KGGDG显著降低了最先进LLM的准确率。 Conclusion: KGGDG是一种强大的工具，可用于更稳健和诊断性地评估医学LLM。 Abstract: Clinical tasks such as diagnosis and treatment require strong decision-making abilities, highlighting the importance of rigorous evaluation benchmarks to assess the reliability of large language models (LLMs). In this work, we introduce a knowledge-guided data augmentation framework that enhances the difficulty of clinical multiple-choice question (MCQ) datasets by generating distractors (i.e., incorrect choices that are similar to the correct one and may confuse existing LLMs). Using our KG-based pipeline, the generated choices are both clinically plausible and deliberately misleading. Our approach involves multi-step, semantically informed walks on a medical knowledge graph to identify distractor paths-associations that are medically relevant but factually incorrect-which then guide the LLM in crafting more deceptive distractors. We apply the designed knowledge graph guided distractor generation (KGGDG) pipline, to six widely used medical QA benchmarks and show that it consistently reduces the accuracy of state-of-the-art LLMs. These findings establish KGGDG as a powerful tool for enabling more robust and diagnostic evaluations of medical LLMs.

[261] Improving Dialogue State Tracking through Combinatorial Search for In-Context Examples

Haesung Pyun,Yoonah Park,Yohan Jo

Main category: cs.CL

TL;DR: CombiSearch是一种新方法，通过考虑示例的组合影响来优化对话状态跟踪（DST）中的检索器训练，显著提升了性能和数据效率。

Details

Motivation: 现有检索器训练方法存在三个主要问题：未考虑示例的协同效应、未充分利用查询的语言特征、评分未直接优化DST性能。 Method: 提出CombiSearch，基于示例对DST性能的组合影响进行评分，优化检索器的训练数据。 Result: 在MultiWOZ和SGD数据集上，CombiSearch显著优于现有方法，数据效率提升20倍，DST性能上限提高12%。 Conclusion: CombiSearch证明了现有检索器训练数据的不足，并为实际DST性能提供了更大的提升空间。 Abstract: In dialogue state tracking (DST), in-context learning comprises a retriever that selects labeled dialogues as in-context examples and a DST model that uses these examples to infer the dialogue state of the query dialogue. Existing methods for constructing training data for retrievers suffer from three key limitations: (1) the synergistic effect of examples is not considered, (2) the linguistic characteristics of the query are not sufficiently factored in, and (3) scoring is not directly optimized for DST performance. Consequently, the retriever can fail to retrieve examples that would substantially improve DST performance. To address these issues, we present CombiSearch, a method that scores effective in-context examples based on their combinatorial impact on DST performance. Our evaluation on MultiWOZ shows that retrievers trained with CombiSearch surpass state-of-the-art models, achieving a 20x gain in data efficiency and generalizing well to the SGD dataset. Moreover, CombiSearch attains a 12% absolute improvement in the upper bound DST performance over traditional approaches when no retrieval errors are assumed. This significantly increases the headroom for practical DST performance while demonstrating that existing methods rely on suboptimal data for retriever training.

[262] LID Models are Actually Accent Classifiers: Implications and Solutions for LID on Accented Speech

Niyati Bafna,Matthew Wiesner

Main category: cs.CL

TL;DR: 研究发现LID模型在带口音语音上表现显著下降，但具体原因和错误特征未充分探讨。通过分析，提出了增强模型对口音鲁棒性的方法，并展示了如何整合序列级信息以提升性能。

Details

Motivation: 探索LID模型在带口音语音上表现下降的具体原因及错误模式，以提升模型鲁棒性。 Method: 识别常见错误模式，分析模型对短语音段的排列不变性，提出输入分块和序列级信息整合方法。 Result: 发现模型易将L2口音语音误分类为母语或相关语言，通过改进方法显著提升带口音语音的性能。 Conclusion: 输入分块和序列级信息整合能有效减少口音-语言混淆，提升LID模型在带口音语音上的表现。 Abstract: Prior research indicates that LID model performance significantly declines on accented speech; however, the specific causes, extent, and characterization of these errors remain under-explored. (i) We identify a common failure mode on accented speech whereby LID systems often misclassify L2 accented speech as the speaker's native language or a related language. (ii) We present evidence suggesting that state-of-the-art models are invariant to permutations of short spans of speech, implying they classify on the basis of short phonotactic features indicative of accent rather than language. Our analysis reveals a simple method to enhance model robustness to accents through input chunking. (iii) We present an approach that integrates sequence-level information into our model without relying on monolingual ASR systems; this reduces accent-language confusion and significantly enhances performance on accented speech while maintaining comparable results on standard LID.

Adam Visokay,Ruth Bagley,Ian Kennedy,Chris Hess,Kyle Crowder,Rob Voigt,Denis Peskoff

Main category: cs.CL

TL;DR: 通过分析芝加哥Craigslist租房广告（2018-2024），研究揭示了语言如何塑造城市空间的社会建构，发现社区边界与广告描述的不匹配现象。

Details

Motivation: 探索租房广告如何通过语言反映城市空间的社会建构，并揭示传统方法可能忽略的空间定义争议。 Method: 结合人工和大语言模型标注Craigslist广告，进行地理空间分析和主题建模。 Result: 发现三类空间模式：冲突的社区定义、边界属性及声誉洗白行为；主题建模显示位置与广告内容的相关性。 Conclusion: 自然语言处理技术能有效揭示传统方法忽视的城市空间定义争议。 Abstract: Rental listings offer a unique window into how urban space is socially constructed through language. We analyze Chicago Craigslist rental advertisements from 2018 to 2024 to examine how listing agents characterize neighborhoods, identifying mismatches between institutional boundaries and neighborhood claims. Through manual and large language model annotation, we classify unstructured listings from Craigslist according to their neighborhood. Geospatial analysis reveals three distinct patterns: properties with conflicting neighborhood designations due to competing spatial definitions, border properties with valid claims to adjacent neighborhoods, and ``reputation laundering" where listings claim association with distant, desirable neighborhoods. Through topic modeling, we identify patterns that correlate with spatial positioning: listings further from neighborhood centers emphasize different amenities than centrally-located units. Our findings demonstrate that natural language processing techniques can reveal how definitions of urban spaces are contested in ways that traditional methods overlook.

[264] ViToSA: Audio-Based Toxic Spans Detection on Vietnamese Speech Utterances

Huy Ba Do,Vy Le-Phuong Huynh,Luan Thanh Nguyen

Main category: cs.CL

TL;DR: 论文提出了ViToSA数据集，用于越南语语音中的毒性内容检测，结合ASR和毒性片段检测方法，显著降低了毒性语音的转录错误率，并提升了检测效果。

Details

Motivation: 在线平台上的毒性语音问题日益严重，但针对低资源语言（如越南语）的音频毒性检测研究不足，因此需要构建相关数据集和方法。 Method: 提出ViToSA数据集（11,000个音频样本），并设计结合ASR和毒性片段检测的流程，进行细粒度毒性内容识别。 Result: 实验表明，在ViToSA上微调ASR模型显著降低了毒性语音的WER，同时文本毒性片段检测模型优于现有基线。 Conclusion: ViToSA为越南语音频毒性检测建立了新基准，为未来语音内容审核研究铺平了道路。 Abstract: Toxic speech on online platforms is a growing concern, impacting user experience and online safety. While text-based toxicity detection is well-studied, audio-based approaches remain underexplored, especially for low-resource languages like Vietnamese. This paper introduces ViToSA (Vietnamese Toxic Spans Audio), the first dataset for toxic spans detection in Vietnamese speech, comprising 11,000 audio samples (25 hours) with accurate human-annotated transcripts. We propose a pipeline that combines ASR and toxic spans detection for fine-grained identification of toxic content. Our experiments show that fine-tuning ASR models on ViToSA significantly reduces WER when transcribing toxic speech, while the text-based toxic spans detection (TSD) models outperform existing baselines. These findings establish a novel benchmark for Vietnamese audio-based toxic spans detection, paving the way for future research in speech content moderation.

[265] Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution's Characteristics

Lorenzo Jaime Yu Flores,Ori Ernst,Jackie Chi Kit Cheung

Main category: cs.CL

TL;DR: 提出了一种任务无关的置信度度量方法，用于改进文本生成模型的校准，无需微调或启发式方法。

Details

Motivation: 现有置信度度量在文本生成中校准不佳，因为存在多个有效答案，传统方法未充分考虑。 Method: 提出任务无关的置信度度量，仅依赖模型输出的概率分布。 Result: 在BART和Flan-T5模型上，改进了摘要、翻译和问答数据集的校准效果。 Conclusion: 该方法能有效提升生成模型的置信度校准，增强其实际应用价值。 Abstract: Well-calibrated model confidence scores can improve the usefulness of text generation models. For example, users can be prompted to review predictions with low confidence scores, to prevent models from returning bad or potentially dangerous predictions. However, confidence metrics are not always well calibrated in text generation. One reason is that in generation, there can be many valid answers, which previous methods do not always account for. Hence, a confident model could distribute its output probability among multiple sequences because they are all valid. We propose task-agnostic confidence metrics suited to generation, which rely solely on the probabilities associated with the model outputs without the need for further fine-tuning or heuristics. Using these, we are able to improve the calibration of BART and Flan-T5 on summarization, translation, and QA datasets.

[266] SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions

Weijie Xu,Shixian Cui,Xi Fang,Chi Xue,Stephanie Eckman,Chandan Reddy

Main category: cs.CL

TL;DR: 论文介绍了SATA-BENCH，首个用于评估大语言模型在“选择所有适用项”任务上的基准，揭示了模型在此类任务中的显著不足，并提出Choice Funnel方法以提升性能。

Details

Motivation: 现实问题常需从选项中选择所有正确答案，但现有评估多关注单一答案任务，导致大语言模型在此类任务中的能力未被充分探索。 Method: 提出SATA-BENCH基准，评估27个开源和专有模型，发现其性能不足，并提出Choice Funnel解码策略，结合去偏和自适应阈值技术。 Result: 最强模型仅达41.8%准确率，Choice Funnel提升29%性能并降低64%推理成本。 Conclusion: 研究揭示了大语言模型在多答案任务中的局限性，并提供了改进框架，推动其在现实决策中的应用。 Abstract: Large language models (LLMs) are increasingly evaluated on single-answer multiple-choice tasks, yet many real-world problems require identifying all correct answers from a set of options. This capability remains underexplored. We introduce SATA-BENCH, the first dedicated benchmark for evaluating LLMs on Select All That Apply (SATA) questions across diverse domains, including reading comprehension, law, and biomedicine. Our evaluation of 27 open-source and proprietary models reveals a significant gap: even the strongest model achieves only 41.8% exact match, exposing LLMs' inability to reliably identify all correct answers. We find that this weakness stems from two core challenges: selection bias - models favor certain choices regardless of content, and count bias - models fail to predict the correct number of answers. To address these issues, we propose Choice Funnel, a decoding strategy that combines token debiasing with adaptive thresholding to guide models toward complete and accurate selections. Choice Funnel achieves up to 29% higher exact match than competitive baselines while reducing inference cost by over 64%. Our findings expose fundamental limitations in current LLMs and introduce a new framework for diagnosing and improving multi-answer reasoning. We release SATA-BENCH and Choice Funnel to promote LLM development for robust decision-making in realistic, multi-answer applications.

[267] Clinical Annotations for Automatic Stuttering Severity Assessment

Ana Rita Valente,Rufael Marew,Hawau Olamide Toyin,Hamdan Al-Ali,Anelise Bohnen,Inma Becerra,Elsa Marta Soares,Goncalo Leal,Hanan Aldarmaki

Main category: cs.CL

TL;DR: 本文提出了一种基于临床标准的新口吃标注方案，用于增强FluencyBank数据集，并通过专家标注和多模态特征提升标注质量。

Details

Motivation: 口吃是一种复杂障碍，需要专业评估和治疗。本文旨在通过高质量标注方案提升数据集的有效性。 Method: 聘请临床专家标注数据，采用多模态特征（视听）检测和分类口吃时刻、次要行为及紧张评分，并提供基于专家共识的测试集。 Result: 实验和分析展示了任务复杂性，强调临床专业知识对模型训练和评估的重要性。 Conclusion: 高质量标注和专家共识是开发有效口吃评估模型的关键。 Abstract: Stuttering is a complex disorder that requires specialized expertise for effective assessment and treatment. This paper presents an effort to enhance the FluencyBank dataset with a new stuttering annotation scheme based on established clinical standards. To achieve high-quality annotations, we hired expert clinicians to label the data, ensuring that the resulting annotations mirror real-world clinical expertise. The annotations are multi-modal, incorporating audiovisual features for the detection and classification of stuttering moments, secondary behaviors, and tension scores. In addition to individual annotations, we additionally provide a test set with highly reliable annotations based on expert consensus for assessing individual annotators and machine learning models. Our experiments and analysis illustrate the complexity of this task that necessitates extensive clinical expertise for valid training and evaluation of stuttering assessment models.

[268] GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction

Neil De La Fuente,Oscar Sainz,Iker García-Ferrero,Eneko Agirre

Main category: cs.CL

TL;DR: GUIDEX是一种新方法，通过自动定义领域特定模式、推断指南并生成合成标记实例，提升零样本信息抽取性能，无需人工标记数据即可超越现有方法。

Details

Motivation: 传统信息抽取系统需领域专家设计模式、标注数据和训练模型，成本高；大语言模型在零样本任务中表现不佳，尤其在未知领域。 Method: 提出GUIDEX方法，自动生成领域模式、指南和合成数据，结合Llama 3.1微调。 Result: 在七个零样本命名实体识别基准中达到新SOTA，无需人工数据提升7 F1，结合人工数据提升2 F1。 Conclusion: GUIDEX显著提升模型对复杂领域模式的理解，代码和数据集已开源。 Abstract: Information Extraction (IE) systems are traditionally domain-specific, requiring costly adaptation that involves expert schema design, data annotation, and model training. While Large Language Models have shown promise in zero-shot IE, performance degrades significantly in unseen domains where label definitions differ. This paper introduces GUIDEX, a novel method that automatically defines domain-specific schemas, infers guidelines, and generates synthetically labeled instances, allowing for better out-of-domain generalization. Fine-tuning Llama 3.1 with GUIDEX sets a new state-of-the-art across seven zeroshot Named Entity Recognition benchmarks. Models trained with GUIDEX gain up to 7 F1 points over previous methods without humanlabeled data, and nearly 2 F1 points higher when combined with it. Models trained on GUIDEX demonstrate enhanced comprehension of complex, domain-specific annotation schemas. Code, models, and synthetic datasets are available at neilus03.github.io/guidex.com

[269] Sarc7: Evaluating Sarcasm Detection and Generation with Seven Types and Emotion-Informed Techniques

Lang Xiong,Raina Gao,Alyssa Jeong,Yicheng Fu,Sean O'Brien,Vasu Sharma,Kevin Zhu

Main category: cs.CL

TL;DR: 论文提出Sarc7基准，用于分类7种讽刺类型，并通过情感提示技术提升分类和生成效果。

Details

Motivation: 讽刺因其微妙性对计算模型构成挑战，研究旨在改进讽刺的分类和生成方法。 Method: 使用MUStARD数据集标注7类讽刺，评估零样本、少样本、链式思维及情感提示技术，提出基于情感的生成方法。 Result: Gemini 2.5结合情感提示技术表现最佳（F1=0.3664），人类评估显示其生成效果优于零样本提示38.46%。 Conclusion: 情感提示技术显著提升讽刺分类和生成效果，为理解人类交流提供新工具。 Abstract: Sarcasm is a form of humor where expressions convey meanings opposite to their literal interpretations. Classifying and generating sarcasm using large language models is vital for interpreting human communication. Sarcasm poses challenges for computational models, due to its nuanced nature. We introduce Sarc7, a benchmark that classifies 7 types of sarcasm: self-deprecating, brooding, deadpan, polite, obnoxious, raging, and manic by annotating entries of the MUStARD dataset. Classification was evaluated using zero-shot, few-shot, chain-of-thought (CoT), and a novel emotion-based prompting technique. We propose an emotion-based generation method developed by identifying key components of sarcasm-incongruity, shock value, and context dependency. Our classification experiments show that Gemini 2.5, using emotion-based prompting, outperforms other setups with an F1 score of 0.3664. Human evaluators preferred our emotion-based prompting, with 38.46% more successful generations than zero-shot prompting.

[270] SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues

Martin Kuo,Jianyi Zhang,Aolin Ding,Louis DiValentin,Amin Hass,Benjamin F Morris,Isaac Jacobson,Randolph Linderman,James Kiessling,Nicolas Ramos,Bhavna Gopal,Maziyar Baran Pouyan,Changwei Liu,Hai Li,Yiran Chen

Main category: cs.CL

TL;DR: STREAM是一种新型防御机制，通过安全推理对齐技术保护大型语言模型免受多轮对话攻击，同时保持其功能。

Details

Motivation: 恶意攻击者通过多轮对话利用大型语言模型（LLMs）实现有害目标，对社会安全构成重大风险。 Method: 构建人工标注的安全推理多轮对话数据集，用于微调一个即插即用的安全推理调节器，识别多轮对话中的恶意意图并警示目标LLM。 Result: STREAM显著优于现有防御技术，将攻击成功率（ASR）降低51.2%，同时保持LLM的功能性。 Conclusion: STREAM是一种有效的防御机制，能够显著降低多轮对话攻击的风险，同时不影响LLM的核心能力。 Abstract: Malicious attackers can exploit large language models (LLMs) by engaging them in multi-turn dialogues to achieve harmful objectives, posing significant safety risks to society. To address this challenge, we propose a novel defense mechanism: SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues (STREAM). STREAM defends LLMs against multi-turn attacks while preserving their functional capabilities. Our approach involves constructing a human-annotated dataset, the Safety Reasoning Multi-turn Dialogues dataset, which is used to fine-tune a plug-and-play safety reasoning moderator. This model is designed to identify malicious intent hidden within multi-turn conversations and alert the target LLM of potential risks. We evaluate STREAM across multiple LLMs against prevalent multi-turn attack strategies. Experimental results demonstrate that our method significantly outperforms existing defense techniques, reducing the Attack Success Rate (ASR) by 51.2%, all while maintaining comparable LLM capability.

[271] DeepRAG: Integrating Hierarchical Reasoning and Process Supervision for Biomedical Multi-Hop QA

Yuelyu Ji,Hang Zhang,Shiven Verma,Hui Ji,Chun Li,Yushui Han,Yanshan Wang

Main category: cs.CL

TL;DR: DeepRAG结合DeepSeek和RAG Gym，通过分层问题分解和检索增强生成优化，显著提升MedHopQA生物医学问答任务的性能。

Details

Motivation: 解决生物医学问答中复杂查询的挑战，提升准确性和概念理解。 Method: 结合DeepSeek的分层问题分解和RAG Gym的检索增强生成优化，利用UMLS本体提供概念级奖励信号。 Result: 在MedHopQA数据集上显著优于基线模型，包括独立的DeepSeek和RAG Gym，在精确匹配和概念级准确性上均有提升。 Conclusion: DeepRAG框架在生物医学问答任务中表现出色，为复杂查询处理提供了有效解决方案。 Abstract: We propose DeepRAG, a novel framework that integrates DeepSeek hierarchical question decomposition capabilities with RAG Gym unified retrieval-augmented generation optimization using process level supervision. Targeting the challenging MedHopQA biomedical question answering task, DeepRAG systematically decomposes complex queries into precise sub-queries and employs concept level reward signals informed by the UMLS ontology to enhance biomedical accuracy. Preliminary evaluations on the MedHopQA dataset indicate that DeepRAG significantly outperforms baseline models, including standalone DeepSeek and RAG Gym, achieving notable improvements in both Exact Match and concept level accuracy.

[272] Measuring Faithfulness and Abstention: An Automated Pipeline for Evaluating LLM-Generated 3-ply Case-Based Legal Arguments

Li Zhang,Morgan Gray,Jaromir Savelka,Kevin D. Ashley

Main category: cs.CL

TL;DR: 论文提出了一种自动化评估大语言模型（LLM）在生成法律论证任务中的表现的方法，重点关注其可靠性、避免幻觉和适当弃权的能力。

Details

Motivation: 尽管LLM在复杂法律任务中表现出潜力，但其可靠性仍存疑，需要一种可扩展的评估方法。 Method: 使用外部LLM提取生成论证中的因素，并与输入案例的真实因素对比，评估八种LLM在三种难度递增的测试中的表现。 Result: LLM在避免幻觉方面表现良好（准确率超90%），但未能充分利用相关因素，且在需要弃权时往往生成虚假论证。 Conclusion: 自动化方法揭示了LLM在法律任务中需改进的领域，尤其是因素利用和弃权能力，以确保可靠部署。 Abstract: Large Language Models (LLMs) demonstrate potential in complex legal tasks like argument generation, yet their reliability remains a concern. Building upon pilot work assessing LLM generation of 3-ply legal arguments using human evaluation, this paper introduces an automated pipeline to evaluate LLM performance on this task, specifically focusing on faithfulness (absence of hallucination), factor utilization, and appropriate abstention. We define hallucination as the generation of factors not present in the input case materials and abstention as the model's ability to refrain from generating arguments when instructed and no factual basis exists. Our automated method employs an external LLM to extract factors from generated arguments and compares them against the ground-truth factors provided in the input case triples (current case and two precedent cases). We evaluated eight distinct LLMs on three tests of increasing difficulty: 1) generating a standard 3-ply argument, 2) generating an argument with swapped precedent roles, and 3) recognizing the impossibility of argument generation due to lack of shared factors and abstaining. Our findings indicate that while current LLMs achieve high accuracy (over 90%) in avoiding hallucination on viable argument generation tests (Tests 1 & 2), they often fail to utilize the full set of relevant factors present in the cases. Critically, on the abstention test (Test 3), most models failed to follow instructions to stop, instead generating spurious arguments despite the lack of common factors. This automated pipeline provides a scalable method for assessing these crucial LLM behaviors, highlighting the need for improvements in factor utilization and robust abstention capabilities before reliable deployment in legal settings. Project page: https://github.com/lizhang-AIandLaw/Measuring-Faithfulness-and-Abstention.

[273] From Argumentative Text to Argument Knowledge Graph: A New Framework for Structured Argumentation

Debarati Bhattacharjee,Ashish Anand

Main category: cs.CL

TL;DR: 提出了一种将论证文本转化为论证知识图（AKG）的框架，通过知识库和推理规则构建图形化结构，便于理解和推理。

Details

Motivation: 现有论证数据集难以检测隐含的间接关系（如下位攻击），且理论格式不易理解，因此需要一种图形化表示方法。 Method: 从论证组件和关系的基本标注出发，构建带元数据的知识库图，利用前提和推理规则形成论证，并通过模因推理生成AKG。 Result: AKG能捕捉重要论证特征，发现隐含推理规则，并检测之前无法识别的下位攻击。 Conclusion: AKG为未来论证推理任务（如一致性检查和修订机会识别）奠定了基础，并有助于推理模型学习隐含的间接关系。 Abstract: This paper presents a framework to convert argumentative texts into argument knowledge graphs (AKG). Starting with basic annotations of argumentative components (ACs) and argumentative relations (ARs), we enrich the information by constructing a knowledge base (KB) graph with metadata attributes for nodes. Next, we use premises and inference rules from the KB to form arguments by applying modus ponens. From these arguments, we create an AKG. The nodes and edges of the AKG have attributes that capture important argumentative features. We also find missing inference rules by identifying markers. This makes it possible to identify undercut attacks that were previously undetectable in existing datasets. The AKG gives a graphical view of the argumentative structure that is easier to understand than theoretical formats. It also prepares the ground for future reasoning tasks, including checking the coherence of arguments and identifying opportunities for revision. For this, it is important to find indirect relations, many of which are implicit. Our proposed AKG format, with annotated inference rules and modus ponens, will help reasoning models learn the implicit indirect relations that require inference over arguments and the relations between them.

[274] Chain-of-Thought Training for Open E2E Spoken Dialogue Systems

Siddhant Arora,Jinchuan Tian,Hayato Futami,Jee-weon Jung,Jiatong Shi,Yosuke Kashiwagi,Emiru Tsunoo,Shinji Watanabe

Main category: cs.CL

TL;DR: 论文提出了一种基于链式思维（CoT）的端到端（E2E）口语对话系统训练方法，解决了现有方法需要大量数据且生成语义不连贯的问题。

Details

Motivation: 现有E2E口语对话系统需要大规模训练数据且生成响应语义不连贯，限制了其实际应用。 Method: 采用链式思维（CoT）策略，将对话数据训练与多模态语言模型（LM）的预训练任务（如ASR、TTS和文本LM）紧密结合。 Result: 在公开的人类对话数据集（如Switchboard）上训练，仅需300小时数据，ROUGE-1分数提升1.5以上。 Conclusion: 该方法简单高效，显著提升了口语对话系统的性能，并公开了模型和训练代码。 Abstract: Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions. However, existing E2E approaches often require large-scale training data and generates responses lacking semantic coherence. We propose a simple yet effective strategy leveraging a chain-of-thought (CoT) formulation, ensuring that training on conversational data remains closely aligned with the multimodal language model (LM)'s pre-training on speech recognition~(ASR), text-to-speech synthesis (TTS), and text LM tasks. Our method achieves over 1.5 ROUGE-1 improvement over the baseline, successfully training spoken dialogue systems on publicly available human-human conversation datasets, while being compute-efficient enough to train on just 300 hours of public human-human conversation data, such as the Switchboard. We will publicly release our models and training code.

[275] Structured Gradient Guidance for Few-Shot Adaptation in Large Language Models

Hongye Zheng,Yichen Wang,Ray Pan,Guiran Liu,Binrong Zhu,Hanlu Zhang

Main category: cs.CL

TL;DR: 本文提出了一种梯度感知的微调方法，用于少样本条件下的大型语言模型，旨在提升任务适应性和训练稳定性。

Details

Motivation: 在数据有限的情况下，增强任务的适应性和训练稳定性。 Method: 基于基础损失函数，引入两个梯度相关的正则项：梯度方向一致性和梯度幅度控制，并结合梯度对齐机制提升跨任务泛化能力。 Result: 在多种自然语言理解任务中，该方法在平均准确率、梯度稳定性和方向对齐方面优于现有微调策略。 Conclusion: 基于梯度的微调框架能有效利用大型语言模型的表征能力，确保训练稳定性并减少对大量标注数据的依赖。 Abstract: This paper presents a gradient-informed fine-tuning method for large language models under few-shot conditions. The goal is to enhance task adaptability and training stability when data is limited. The method builds on a base loss function and introduces two gradient-related regularization terms. The first enforces gradient direction consistency to guide parameter updates along task-relevant directions and prevent drift. The second controls gradient magnitude to avoid abnormal updates. Together, these components support a more efficient and stable optimization path. To further improve cross-task generalization, the method incorporates a gradient alignment mechanism. This mechanism measures the consistency between optimization directions of the source and target tasks. It enhances fine-tuning performance in multi-task and cross-domain scenarios. Across various natural language understanding tasks, the method outperforms existing fine-tuning strategies in average accuracy, gradient stability, and directional alignment. Empirical evaluations under different sample sizes and domain-specific tasks confirm the method's robustness and broad applicability in low-resource environments. In particular, the method shows clear advantages in controlling parameter update paths. The results demonstrate that a gradient-based fine-tuning framework can effectively leverage the representational power of large language models. It ensures training stability while reducing dependence on large volumes of labeled data.

[276] Narrative Media Framing in Political Discourse

Yulia Otmakhova,Lea Frermann

Main category: cs.CL

TL;DR: 本文提出了一种框架，将叙事性与框架分析结合，并验证了其在气候变化和COVID-19领域的通用性。

Details

Motivation: 自动化框架分析通常忽略了叙事框架，而叙事框架是传达复杂争议性观点的有效工具。 Method: 通过标注新闻数据集，分析叙事框架成分的政治倾向，并测试LLMs预测能力。 Result: 在气候变化和COVID-19领域验证了框架的通用性，预测结果与理论一致。 Conclusion: 该框架为叙事框架分析提供了可操作的方法，并展示了跨领域的适用性。 Abstract: Narrative frames are a powerful way of conceptualizing and communicating complex, controversial ideas, however automated frame analysis to date has mostly overlooked this framing device. In this paper, we connect elements of narrativity with fundamental aspects of framing, and present a framework which formalizes and operationalizes such aspects. We annotate and release a data set of news articles in the climate change domain, analyze the dominance of narrative frame components across political leanings, and test LLMs in their ability to predict narrative frames and their components. Finally, we apply our framework in an unsupervised way to elicit components of narrative framing in a second domain, the COVID-19 crisis, where our predictions are congruent with prior theoretical work showing the generalizability of our approach.

[277] DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

Chiyu Zhang,Marc-Alexandre Cote,Michael Albada,Anush Sankaran,Jack W. Stokes,Tong Wang,Amir Abdi,William Blum,Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: DefenderBench是一个开源工具包，用于评估语言模型在网络安全任务中的表现，包括入侵检测、恶意内容分析和漏洞评估。Claude-3.7-sonnet表现最佳，得分81.65。

Details

Motivation: 探索大型语言模型在网络安全领域的潜力，填补现有研究的空白。 Method: 开发DefenderBench工具包，包含多种网络安全任务环境，使用标准化框架评估多个先进语言模型。 Result: Claude-3.7-sonnet表现最佳（81.65分），开源模型Llama 3.3 70B紧随其后（71.81分）。 Conclusion: DefenderBench为网络安全领域的语言模型评估提供了实用、可扩展的工具，促进公平比较和可重复性。 Abstract: Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state-of-the-art (SoTA) and popular LLMs, including both open- and closed-weight models, using a standardized agentic framework. Our results show that Claude-3.7-sonnet performs best with a DefenderBench score of 81.65, followed by Claude-3.7-sonnet-think with 78.40, while the best open-weight model, Llama 3.3 70B, is not far behind with a DefenderBench score of 71.81. DefenderBench's modular design allows seamless integration of custom LLMs and tasks, promoting reproducibility and fair comparisons. An anonymized version of DefenderBench is available at https://github.com/microsoft/DefenderBench.

[278] Length Aware Speech Translation for Video Dubbing

Harveen Singh Chadha,Aswin Shanmugam Subramanian,Vikas Joshi,Shubham Bansal,Jian Xue,Rupeshkumar Mehta,Jinyu Li

Main category: cs.CL

TL;DR: 提出了一种基于音素的端到端长度敏感语音翻译模型（LSST）和长度感知束搜索（LABS），用于实时视频配音，显著提升了音频同步质量。

Details

Motivation: 解决视频配音中翻译音频与源音频对齐的挑战，特别是在实时、设备端场景下的高效实现。 Method: 开发了LSST模型，通过预定义标签生成不同长度的翻译，并引入LABS在单次解码中生成多种长度翻译。 Result: 在保持BLEU分数可比性的同时，显著提升了同步质量，西班牙语和韩语的MOS分别提高了0.34和0.65。 Conclusion: 该方法在实时视频配音中有效解决了音频对齐问题，同时保持了翻译质量。 Abstract: In video dubbing, aligning translated audio with the source audio is a significant challenge. Our focus is on achieving this efficiently, tailored for real-time, on-device video dubbing scenarios. We developed a phoneme-based end-to-end length-sensitive speech translation (LSST) model, which generates translations of varying lengths short, normal, and long using predefined tags. Additionally, we introduced length-aware beam search (LABS), an efficient approach to generate translations of different lengths in a single decoding pass. This approach maintained comparable BLEU scores compared to a baseline without length awareness while significantly enhancing synchronization quality between source and target audio, achieving a mean opinion score (MOS) gain of 0.34 for Spanish and 0.65 for Korean, respectively.

[279] Data Swarms: Optimizable Generation of Synthetic Evaluation Data

Shangbin Feng,Yike Wang,Weijia Shi,Yulia Tsvetkov

Main category: cs.CL

TL;DR: Data Swarms算法通过粒子群优化生成合成评估数据，提升LLM评估的定量目标。Adversarial Swarms进一步通过对抗性生成与模型共同进化，增强数据与模型的鲁棒性。

Details

Motivation: 优化合成评估数据的生成，以更好地反映LLM评估的定量目标（如生成更难的测试问题）。 Method: 1. 训练初始数据生成器群；2. 定义评估目标；3. 使用粒子群优化优化生成器群；4. 扩展为对抗性生成器与模型共同进化。 Result: Data Swarms在五项评估目标上优于八种基线方法，Adversarial Swarms生成的数据和模型更具鲁棒性和泛化能力。 Conclusion: Data Swarms能有效优化多目标组合，并泛化至未见的LLM模型。 Abstract: We propose Data Swarms, an algorithm to optimize the generation of synthetic evaluation data and advance quantitative desiderata of LLM evaluation. We first train a swarm of initial data generators using existing data, and define various evaluation objectives to reflect the desired properties of evaluation (e.g., generate more difficult problems for the evaluated models) and quantitatively evaluate data generators. We then employ particle swarm optimization to optimize the swarm of data generators, where they collaboratively search through the model parameter space to find new generators that advance these objectives. We further extend it to Adversarial Swarms, where the data generator swarm generates harder data while the test taker model swarm learns from such data, co-evolving dynamically for better data and models simultaneously. Extensive experiments demonstrate that Data Swarms outperforms eight data generation baselines across five evaluation objectives, while Adversarial Swarms produce more robust learning of synthetic data and stronger generalization. Further analysis reveals that Data Swarms successfully optimizes compositions of multiple evaluation objectives and generalizes to new off-the-shelf LLMs, unseen at optimization time.

[280] Assortment of Attention Heads: Accelerating Federated PEFT with Head Pruning and Strategic Client Selection

Yeshwanth Venkatesha,Souvik Kundu,Priyadarshini Panda

Main category: cs.CL

TL;DR: 论文提出了一种在联邦学习框架中高效执行参数高效微调（PEFT）的方法，通过头剪枝、加权聚合机制和客户端选择策略解决资源受限和数据分布多样化的挑战。

Details

Motivation: 尽管PEFT在自然语言处理中广泛用于适应大型语言模型，但在隐私保护的分布式学习框架（如联邦学习）中应用有限，主要由于资源受限设备和客户端数据分布多样化的挑战。 Method: 采用头剪枝减少训练复杂度，提出头特定的加权聚合机制和客户端选择策略，确保全局模型从多样化客户端捕获关键更新。 Result: 在MultiNLI等数据集上测试，使用T5-small模型和LoRA方法，实现了高达90%的稀疏度，通信优势达1.8倍，训练操作减少3.9倍，精度下降控制在2%以内。 Conclusion: 该方法在联邦学习中高效实现了PEFT，显著降低了通信和计算成本，同时保持了模型性能。 Abstract: Parameter Efficient Fine-Tuning (PEFT) has become the de-facto approach in adapting Large Language Models (LLMs) for downstream tasks in Natural Language Processing. However, its adoption in privacy-preserving distributed learning frameworks, such as Federated Learning (FL), remains relatively limited. This is mainly due to challenges specific to FL, such as resource-constrained devices and diverse data distributions among clients. In this paper, we propose an efficient method to perform PEFT within the FL framework for Multi-Head Attention (MHA) based language models. We address the challenges through head pruning, a novel head-specific weighted aggregation mechanism, and a client selection strategy. Head pruning minimizes training complexity within the clients, guided by the importance score computed based on the confidence of the attention head. Weighted aggregation of heads ensures the global model captures crucial updates from diverse clients complementing our client selection strategy. We show results on the MultiNLI benchmark along with 20 Newsgroups, XL-Sum, and E2E NLG datasets. We use the MultiNLI dataset and T5-small model with LoRA as our PEFT method, attaining sparsity levels of up to 90%, resulting in a communication advantage of up to 1.8x and a reduction in training OPs of 3.9x while maintaining the accuracy drop under 2%.

[281] Translate With Care: Addressing Gender Bias, Neutrality, and Reasoning in Large Language Model Translations

Pardis Sadat Zahraei,Ali Emami

Main category: cs.CL

TL;DR: 论文研究了机器翻译中的性别偏见问题，提出了Translate-with-Care数据集，评估了多种翻译模型的性能，发现普遍存在性别刻板印象和推理错误。微调mBART-50显著改善了这些问题。

Details

Motivation: 解决性别偏见和保持逻辑一致性在机器翻译中仍具挑战性，尤其是在自然性别语言（如英语）与无性别语言（如波斯语、印尼语、芬兰语）之间的翻译。 Method: 引入Translate-with-Care数据集（3,950个挑战场景），评估包括GPT-4、mBART-50、NLLB-200和Google Translate在内的多种技术。 Result: 所有模型在无性别内容翻译中表现不佳，倾向于使用男性代词，尤其在领导力和职业成功场景中。微调mBART-50显著减少了偏见和错误。 Conclusion: 需要针对性别和语义一致性的方法，特别是在无性别语言中，以实现更公平和准确的翻译系统。 Abstract: Addressing gender bias and maintaining logical coherence in machine translation remains challenging, particularly when translating between natural gender languages, like English, and genderless languages, such as Persian, Indonesian, and Finnish. We introduce the Translate-with-Care (TWC) dataset, comprising 3,950 challenging scenarios across six low- to mid-resource languages, to assess translation systems' performance. Our analysis of diverse technologies, including GPT-4, mBART-50, NLLB-200, and Google Translate, reveals a universal struggle in translating genderless content, resulting in gender stereotyping and reasoning errors. All models preferred masculine pronouns when gender stereotypes could influence choices. Google Translate and GPT-4 showed particularly strong bias, favoring male pronouns 4-6 times more than feminine ones in leadership and professional success contexts. Fine-tuning mBART-50 on TWC substantially resolved these biases and errors, led to strong generalization, and surpassed proprietary LLMs while remaining open-source. This work emphasizes the need for targeted approaches to gender and semantic coherence in machine translation, particularly for genderless languages, contributing to more equitable and accurate translation systems.

[282] Understanding and Mitigating Cross-lingual Privacy Leakage via Language-specific and Universal Privacy Neurons

Wenshuo Dong,Qingsong Yang,Shu Yang,Lijie Hu,Meng Ding,Wanyu Lin,Tianhang Zheng,Di Wang

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）在多语言环境下的隐私泄露风险，发现即使训练数据为单一语言，模型仍可能在其他语言查询中泄露隐私。通过分析信息流，识别出隐私通用神经元和语言特定隐私神经元，并通过去激活这些神经元将跨语言隐私泄露风险降低23.3%-31.6%。

Details

Motivation: 大型语言模型在跨语言环境中可能泄露隐私信息，而现有方法仅针对单一语言（英语）场景，无法解决跨语言隐私问题。 Method: 研究跨语言隐私泄露的信息流，识别隐私通用神经元和语言特定隐私神经元，并通过去激活这些神经元降低隐私风险。 Result: 去激活隐私神经元后，跨语言隐私泄露风险降低23.3%-31.6%。 Conclusion: 跨语言隐私泄露是一个重要问题，通过识别和去激活特定神经元可以有效缓解风险。 Abstract: Large Language Models (LLMs) trained on massive data capture rich information embedded in the training data. However, this also introduces the risk of privacy leakage, particularly involving personally identifiable information (PII). Although previous studies have shown that this risk can be mitigated through methods such as privacy neurons, they all assume that both the (sensitive) training data and user queries are in English. We show that they cannot defend against the privacy leakage in cross-lingual contexts: even if the training data is exclusively in one language, these (private) models may still reveal private information when queried in another language. In this work, we first investigate the information flow of cross-lingual privacy leakage to give a better understanding. We find that LLMs process private information in the middle layers, where representations are largely shared across languages. The risk of leakage peaks when converted to a language-specific space in later layers. Based on this, we identify privacy-universal neurons and language-specific privacy neurons. Privacy-universal neurons influence privacy leakage across all languages, while language-specific privacy neurons are only related to specific languages. By deactivating these neurons, the cross-lingual privacy leakage risk is reduced by 23.3%-31.6%.

[283] Dynamic Chunking and Selection for Reading Comprehension of Ultra-Long Context in Large Language Models

Boheng Sheng,Jiacheng Yao,Meicong Zhang,Guoxiu He

Main category: cs.CL

TL;DR: 提出了一种动态分割长文本的方法，通过语义相似度自适应分块，并结合问题感知分类器选择关键块，显著提升了LLMs对长文本的理解能力。

Details

Motivation: 现有固定长度分块方法可能导致语义相关内容的分离，影响LLMs对长文本的准确理解。 Method: 计算相邻句子的语义相似度，动态分块；训练问题感知分类器选择关键块。 Result: 在单跳和多跳问答基准上表现优于基线，支持长达256k tokens的输入。 Conclusion: 动态分块和问题感知选择有效提升了LLMs处理长文本的能力。 Abstract: Large language models (LLMs) often struggle to accurately read and comprehend extremely long texts. Current methods for improvement typically rely on splitting long contexts into fixed-length chunks. However, fixed truncation risks separating semantically relevant content, leading to ambiguity and compromising accurate understanding. To overcome this limitation, we propose a straightforward approach for dynamically separating and selecting chunks of long context, facilitating a more streamlined input for LLMs. In particular, we compute semantic similarities between adjacent sentences, using lower similarities to adaptively divide long contexts into variable-length chunks. We further train a question-aware classifier to select sensitive chunks that are critical for answering specific questions. Experimental results on both single-hop and multi-hop question-answering benchmarks show that the proposed approach consistently outperforms strong baselines. Notably, it maintains robustness across a wide range of input lengths, handling sequences of up to 256k tokens. Our datasets and code are available at the following link: https://github.com/ECNU-Text-Computing/DCS

[284] Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction via LLMs-as-the-Judge

Md Tahmid Rahman Laskar,Israt Jahan,Elham Dolatabadi,Chun Peng,Enamul Hoque,Jimmy Huang

Main category: cs.CL

TL;DR: 本文探讨了使用LLMs作为评估者（LLM-as-the-Judge）在生物医学关系抽取任务中的可行性，发现其性能较差（通常低于50%准确率），并提出结构化输出格式和领域适应技术以提升性能。

Details

Motivation: 传统自动评估指标在生物医学关系抽取任务中不可靠，而人工评估成本高且耗时，因此需要一种替代的评估方法。 Method: 研究通过将8个LLMs作为评估者，对5个其他LLMs生成的响应进行评估，并提出结构化输出格式和领域适应技术以改进评估性能。 Result: LLM-based评估者在生物医学关系抽取任务中表现不佳（通常低于50%准确率），但通过结构化输出格式和领域适应技术，性能平均提升了15%。 Conclusion: 结构化输出格式和领域适应技术可有效提升LLM-based评估者在生物医学关系抽取任务中的性能，为未来研究提供了实用工具和数据资源。 Abstract: Large Language Models (LLMs) have demonstrated impressive performance in biomedical relation extraction, even in zero-shot scenarios. However, evaluating LLMs in this task remains challenging due to their ability to generate human-like text, often producing synonyms or abbreviations of gold-standard answers, making traditional automatic evaluation metrics unreliable. On the other hand, while human evaluation is more reliable, it is costly and time-consuming, making it impractical for real-world applications. This paper investigates the use of LLMs-as-the-Judge as an alternative evaluation method for biomedical relation extraction. We benchmark 8 LLMs as judges to evaluate the responses generated by 5 other LLMs across 3 biomedical relation extraction datasets. Unlike other text-generation tasks, we observe that LLM-based judges perform quite poorly (usually below 50% accuracy) in the biomedical relation extraction task. Our findings reveal that it happens mainly because relations extracted by LLMs do not adhere to any standard format. To address this, we propose structured output formatting for LLM-generated responses that helps LLM-Judges to improve their performance by about 15% (on average). We also introduce a domain adaptation technique to further enhance LLM-Judge performance by effectively transferring knowledge between datasets. We release both our human-annotated and LLM-annotated judgment data (36k samples in total) for public use here: https://github.com/tahmedge/llm_judge_biomedical_re.

[285] KG-TRACES: Enhancing Large Language Models with Knowledge Graph-constrained Trajectory Reasoning and Attribution Supervision

Rong Wu,Pinlong Cai,Jianbiao Mei,Licheng Wen,Tao Hu,Xuemeng Yang,Daocheng Fu,Botian Shi

Main category: cs.CL

TL;DR: KG-TRACES框架通过显式监督推理路径和过程，提升大语言模型在复杂推理任务中的可解释性和可信度，显著优于现有方法。

Details

Motivation: 大语言模型在复杂推理任务中缺乏可解释性和可信度，限制了其应用。 Method: 提出KG-TRACES框架，联合监督模型预测符号关系路径、完整三元组推理路径，并生成基于推理路径的归因感知推理过程。 Result: 在WebQSP和CWQ任务中，Hits@1和F1分数显著提升，并在医学等专业领域展示出迁移能力。 Conclusion: KG-TRACES通过显式监督实现了更稳定、目标导向的推理过程，提升了模型的解释性和性能。 Abstract: Large language models (LLMs) have made remarkable strides in various natural language processing tasks, but their performance on complex reasoning problems remains hindered by a lack of explainability and trustworthiness. This issue, often manifesting as hallucinations or unattributable reasoning processes, limits their applicability in complex reasoning scenarios. To address this, we propose Knowledge Graph-constrained Trajectory Reasoning Attribution and Chain Explanation Supervision (KG-TRACES), a novel framework that enhances the reasoning ability of LLMs through explicit supervision over reasoning paths and processes. KG-TRACES jointly supervises the model to: (1) predict symbolic relation paths, (2) predict full triple-level reasoning paths, and (3) generate attribution-aware reasoning processes grounded in the reasoning paths. At inference phase, the model adapts to both KG-available and KG-unavailable scenarios, retrieving reasoning paths from a KG when possible or predicting plausible reasoning paths with only intrinsic knowledge when not. This design enables the model to reason in an explainable and source-attributable pattern. Through extensive experiments on complex reasoning tasks, we demonstrate that KG-TRACES significantly outperforms existing SOTA: it improves Hits@1 by 1.6% and F1 by 4.7% on WebQSP, and achieves improvements of 4.8% in Hits@1 and 2.1% in F1 on CWQ. Moreover, we show its transferability to specialized domains such as medicine. By visualizing the intermediate steps of reasoning processes, we further show that the explicit supervision introduced by KG-TRACES leads to more stable and goal-directed reasoning processes, aligning closely with correct answers. Code is available at https://github.com/Edaizi/KG-TRACES.

[286] Research Borderlands: Analysing Writing Across Research Cultures

Shaily Bhatt,Tal August,Maria Antoniak

Main category: cs.CL

TL;DR: 本文提出了一种以人为中心的方法，通过访谈跨学科研究者，发现并衡量语言文化规范和LLM的文化能力，聚焦于研究文化和写作适应任务。

Details

Motivation: 当前语言技术对文化能力的关注不足，且研究方法多依赖合成设置和不完美的文化代理。 Method: 通过访谈跨学科专家，构建研究文化的结构、风格、修辞和引用规范框架，并用计算指标操作化这些特征。 Result: 揭示了人类研究论文中的潜在文化规范，并指出LLM在文化能力上的不足及其写作同质化倾向。 Conclusion: 以人为中心的方法能有效衡量人类和LLM生成文本中的文化规范。 Abstract: Improving cultural competence of language technologies is important. However most recent works rarely engage with the communities they study, and instead rely on synthetic setups and imperfect proxies of culture. In this work, we take a human-centered approach to discover and measure language-based cultural norms, and cultural competence of LLMs. We focus on a single kind of culture, research cultures, and a single task, adapting writing across research cultures. Through a set of interviews with interdisciplinary researchers, who are experts at moving between cultures, we create a framework of structural, stylistic, rhetorical, and citational norms that vary across research cultures. We operationalise these features with a suite of computational metrics and use them for (a) surfacing latent cultural norms in human-written research papers at scale; and (b) highlighting the lack of cultural competence of LLMs, and their tendency to homogenise writing. Overall, our work illustrates the efficacy of a human-centered approach to measuring cultural norms in human-written and LLM-generated texts.

[287] RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems

Yixiao Zeng,Tianyu Cao,Danqing Wang,Xinran Zhao,Zimeng Qiu,Morteza Ziyadi,Tongshuang Wu,Lei Li

Main category: cs.CL

TL;DR: RARE框架通过知识图谱驱动的合成管道（RARE-Get）生成多级问题集，测试RAG系统在动态、时间敏感语料库中的鲁棒性，发现其对扰动表现出明显脆弱性。

Details

Motivation: 现有评估很少测试RAG系统如何应对真实世界噪声、内外检索上下文冲突或快速变化的事实，因此需要更全面的鲁棒性评估。 Method: 提出RARE框架，结合知识图谱驱动的合成管道（RARE-Get）构建大规模数据集（RARE-Set），并定义检索条件鲁棒性指标（RARE-Met）。 Result: RAG系统对扰动表现出明显脆弱性，文档鲁棒性是最薄弱环节，多跳查询的鲁棒性普遍低于单跳查询。 Conclusion: RARE为RAG系统的鲁棒性评估提供了统一框架，揭示了其在动态环境中的局限性。 Abstract: Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 400 expert-level time-sensitive finance, economics, and policy documents and 48,322 questions whose distribution evolves as the underlying sources change. To quantify resilience, we formalize retrieval-conditioned robustness metrics (RARE-Met) that capture a model's ability to remain correct or recover when queries, documents, or real-world retrieval results are systematically altered. Our results show that RAG systems exhibit surprising vulnerability to perturbations, with document robustness consistently being the weakest point regardless of generator size or architecture. RAG systems consistently show lower robustness on multi-hop queries than single-hop queries across all domains.

[288] Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering

Songtao Jiang,Chenyi Zhou,Yan Zhang,Yeying Jin,Zuozhu Liu

Main category: cs.CL

TL;DR: FOCUS是一种动态适应问题复杂性的方法，结合直觉与分析推理，提升多模态大语言模型在视觉问答中的性能。

Details

Motivation: 当前方法在视觉问答中过度标记所有对象，导致性能下降，缺乏对关键视觉元素的关注。 Method: FOCUS根据问题复杂性动态调整策略：简单问题使用零样本推理，复杂问题采用先概念化再观察的策略。 Result: 在四个基准测试中，FOCUS显著提升了开源和黑盒模型的性能。 Conclusion: 结合多样化认知策略和精细视觉信息是提升性能的关键。 Abstract: Multimodal large language models (MLLMs) still struggle with complex reasoning tasks in Visual Question Answering (VQA). While current methods have advanced by incorporating visual prompts, our study uncovers critical limitations: these approaches indiscriminately annotate all detected objects for every visual question, generating excessive visual markers that degrade task performance. This issue stems primarily from a lack of focus on key visual elements, raising two important questions: Are all objects equally important, and do all questions require visual prompts? Motivated by Dual Process Theory, which distinguishes between instinctive and deliberate cognitive modes in human reasoning, we propose FOCUS, a plug-and-play approach that dynamically adapts to the complexity of questions, combining fast intuitive judgments with deliberate analytical reasoning to enhance the vision-language reasoning capability of the MLLM. For straightforward questions, FOCUS supports efficient zero-shot reasoning. For more complex tasks, it employs the conceptualizing before observation strategy to highlight critical elements. Extensive experiments on four benchmarks, ScienceQA, TextQA, VizWiz, and MME, demonstrate that FOCUS consistently improves the performance of both open-source and black-box MLLMs, achieving significant gains across all datasets. Ablation studies further validate the importance of combining diverse cognitive strategies with refined visual information for superior performance. Code will be released.

[289] GuessBench: Sensemaking Multimodal Creativity in the Wild

Zifeng Zhu,Shangbin Feng,Herun Wan,Ningnan Wang,Minnan Luo,Yulia Tsvetkov

Main category: cs.CL

TL;DR: GuessBench是一个评估视觉语言模型（VLMs）在模拟人类创造力方面表现的新基准，基于Minecraft游戏数据，展示了模型在创造性理解上的挑战。

Details

Motivation: 研究旨在评估VLMs在模拟人类创造力方面的能力，尤其是在嘈杂和多元化的现实场景中。 Method: 通过Minecraft游戏“Guess the Build”收集1500张图像和设计2000个问题，测试VLMs在静态和动态图像、自然语言提示等场景下的表现。 Result: 实验显示，即使是GPT-4o在34%的情况下也会出错，开源模型与API模型之间存在显著性能差距（13.87% vs. 53.93%）。微调后，视觉感知任务性能平均提升15.36%。 Conclusion: GuessBench揭示了VLMs在创造力建模中的挑战，尤其是对低频概念和文化背景不足的数据表现较差。 Abstract: We propose GuessBench, a novel benchmark that evaluates Vision Language Models (VLMs) on modeling the pervasive, noisy, and pluralistic human creativity. GuessBench sources data from "Guess the Build", an online multiplayer Minecraft minigame where one player constructs a Minecraft build given a concept (e.g. caterpillar) and others try to guess it with natural language hints, presenting a pristine testbed for sensemaking creativity in the wild with VLMs acting as guessers. We curate 1500 images from the actual gameplay and design 2000 problems spanning static and dynamic image settings, natural language hints of varying completeness, and more. Extensive experiments with six open/API VLMs and five reasoning enhancement approaches demonstrate that GuessBench presents a uniquely challenging task in creativity modeling: even the start-of-the-art GPT-4o is incorrect on 34% of instances, while we observe a huge performance gap (13.87% vs. 53.93% on average) between open and API models. When used as a resource to improve VLMs, fine-tuning on the reasoning traces for GuessBench problems improves visual perception tasks by 15.36% on average. Further analysis reveals that VLM performance in creativity sensemaking correlates with the frequency of the concept in training data, while the accuracy drops sharply for concepts in underrepresented cultural contexts and low-resource languages.

[290] From Plain Text to Poetic Form: Generating Metrically-Constrained Sanskrit Verses

Manoj Balaji Jagadeeshan,Samarth Bhatia,Pretam Ray,Harshul Raj Surana,Akhil Rajeev P,Priya Mishra,Annarao Kulkarni,Ganesh Ramakrishnan,Prathosh AP,Pawan Goyal

Main category: cs.CL

TL;DR: 论文探讨了如何将大型语言模型（LLMs）应用于低资源、形态丰富的语言（如梵语）的结构化诗歌生成，提出了数据集和评估方法，并在韵律和语义保真度上取得了显著成果。

Details

Motivation: 研究动机是解决LLMs在高资源语言中表现优异但在低资源语言（如梵语）中生成结构化诗歌的挑战。 Method: 方法包括构建用于英译梵语诗歌的数据集，评估多种生成模型，并探索约束解码和指令微调策略。 Result: 结果显示约束解码在韵律准确性上表现优异（99%），指令微调模型在语义和风格对齐上有所提升。 Conclusion: 结论表明LLMs可通过特定方法有效适应低资源语言的诗歌生成，但需权衡韵律精度与语义保真度。 Abstract: Recent advances in large language models (LLMs) have significantly improved natural language generation, including creative tasks like poetry composition. However, most progress remains concentrated in high-resource languages. This raises an important question: Can LLMs be adapted for structured poetic generation in a low-resource, morphologically rich language such as Sanskrit? In this work, we introduce a dataset designed for translating English prose into structured Sanskrit verse, with strict adherence to classical metrical patterns, particularly the Anushtub meter. We evaluate a range of generative models-both open-source and proprietary-under multiple settings. Specifically, we explore constrained decoding strategies and instruction-based fine-tuning tailored to metrical and semantic fidelity. Our decoding approach achieves over 99% accuracy in producing syntactically valid poetic forms, substantially outperforming general-purpose models in meter conformity. Meanwhile, instruction-tuned variants show improved alignment with source meaning and poetic style, as supported by human assessments, albeit with marginal trade-offs in metrical precision.

[291] One for All: Update Parameterized Knowledge Across Multiple Models

Weitao Ma,Xiyuan Du,Xiaocheng Feng,Lei Huang,Yichong Huang,Huiyi Zhang,Xiaoliang Yang,Baohang Li,Xiachong Feng,Ting Liu,Bing Qin

Main category: cs.CL

TL;DR: OnceEdit是一种基于集成的知识编辑方法，通过插件模型实现多模型的稳定知识更新，并引入动态权重和集成增强机制以提高效果。

Details

Motivation: 大型语言模型（LLMs）难以保持知识更新，现有编辑方法主要针对单一模型，无法高效更新多模型或适应新模型。 Method: 提出OnceEdit，采用插件模型作为编辑模块，结合动态权重机制和集成增强机制，优化知识编辑效果。 Result: 实验表明OnceEdit在多种LLMs上表现优于现有方法，编辑效率更高，且在多模型编辑场景中具有适应性和稳定性。 Conclusion: OnceEdit为多模型知识编辑提供了高效稳定的解决方案，未来代码将公开。 Abstract: Large language models (LLMs) encode vast world knowledge but struggle to stay up-to-date, often leading to errors and hallucinations. Knowledge editing offers an efficient alternative to retraining, enabling targeted modifications by updating specific model parameters. However, existing methods primarily focus on individual models, posing challenges in efficiently updating multiple models and adapting to new models. To address this, we propose OnceEdit, a novel ensemble-based approach that employs a plug-in model as the editing module, enabling stable knowledge updates across multiple models. Building on the model ensemble, OnceEdit introduces two key mechanisms to enhance its effectiveness. First, we introduce a dynamic weight mechanism through a \weight token for distinguishing between edit-related and non-edit-related instances, ensuring the appropriate utilization of knowledge from integrated models. Second, we incorporate an ensemble enhancement mechanism to mitigate the excessive reliance on the central model inherent in the model ensemble technique, making it more suitable for knowledge editing. Extensive experiments on diverse LLMs demonstrate that OnceEdit consistently outperforms existing methods while achieving superior editing efficiency. Further analysis confirms its adaptability and stability in multi-model editing scenarios. Our code will be available.

[292] Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks

Yuntai Bao,Xuhong Zhang,Tianyu Du,Xinkui Zhao,Zhengwen Feng,Hao Peng,Jianwei Yin

Main category: cs.CL

TL;DR: 研究发现，大型语言模型（LLMs）的“真实性方向”并非普遍一致，且更强大的模型在逻辑否定中表现更好。真实性探针能泛化到多种任务，提升用户对LLM输出的信任。

Details

Motivation: 探讨LLMs中“真实性方向”的普遍性、识别方法及泛化能力，以提升模型输出的可信度。 Method: 通过实验分析不同LLMs的真实性方向一致性，并测试真实性探针在多种任务中的泛化能力。 Result: 并非所有LLMs具有一致的真实性方向，但探针能有效泛化到逻辑转换、问答等任务，提升输出可信度。 Conclusion: 研究深化了对LLMs内部表征的理解，并为提升模型输出的真实性提供了新方法。 Abstract: Large language models (LLMs) are trained on extensive datasets that encapsulate substantial world knowledge. However, their outputs often include confidently stated inaccuracies. Earlier works suggest that LLMs encode truthfulness as a distinct linear feature, termed the "truth direction", which can classify truthfulness reliably. We address several open questions about the truth direction: (i) whether LLMs universally exhibit consistent truth directions; (ii) whether sophisticated probing techniques are necessary to identify truth directions; and (iii) how the truth direction generalizes across diverse contexts. Our findings reveal that not all LLMs exhibit consistent truth directions, with stronger representations observed in more capable models, particularly in the context of logical negation. Additionally, we demonstrate that truthfulness probes trained on declarative atomic statements can generalize effectively to logical transformations, question-answering tasks, in-context learning, and external knowledge sources. Finally, we explore the practical application of truthfulness probes in selective question-answering, illustrating their potential to improve user trust in LLM outputs. These results advance our understanding of truth directions and provide new insights into the internal representations of LLM beliefs. Our code is public at https://github.com/colored-dye/truthfulness_probe_generalization

[293] HERGC: Heterogeneous Experts Representation and Generative Completion for Multimodal Knowledge Graphs

Yongkang Xiao,Rui Zhang

Main category: cs.CL

TL;DR: HERGC是一个多模态知识图谱补全框架，通过融合多模态信息和生成式LLM预测器，显著提升了补全性能。

Details

Motivation: 现有MMKGC方法在封闭世界假设下仅利用MMKG中的信息，限制了推理能力，而生成式方法在多模态领域的潜力尚未充分探索。 Method: HERGC结合了异构专家表示检索器和生成式LLM预测器，先检索候选集，再生成正确答案。 Result: 在三个标准MMKG基准测试中，HERGC表现出色，达到最先进性能。 Conclusion: HERGC通过多模态融合和生成式推理，有效解决了MMKGC的挑战，展示了生成式方法的潜力。 Abstract: Multimodal knowledge graphs (MMKGs) enrich traditional knowledge graphs (KGs) by incorporating diverse modalities such as images and text. Multi-modal knowledge graph completion (MMKGC) seeks to exploit these heterogeneous signals to infer missing facts, thereby mitigating the intrinsic incompleteness of MMKGs. Existing MMKGC methods typically leverage only the information contained in the MMKGs under the closed-world assumption and adopt discriminative training objectives, which limits their reasoning capacity during completion. Recent generative completion approaches powered by advanced large language models (LLMs) have shown strong reasoning abilities in unimodal knowledge graph completion, but their potential in MMKGC remains largely unexplored. To bridge this gap, we propose HERGC, a Heterogeneous Experts Representation and Generative Completion framework for MMKGs. HERGC first deploys a Heterogeneous Experts Representation Retriever that enriches and fuses multimodal information and retrieves a compact candidate set for each incomplete triple. It then uses a Generative LLM Predictor fine-tuned on minimal instruction data to accurately identify the correct answer from these candidates. Extensive experiments on three standard MMKG benchmarks demonstrate HERGC's effectiveness and robustness, achieving state-of-the-art performance.

[294] COMPKE: Complex Question Answering under Knowledge Editing

Keyuan Cheng,Zijian Kan,Zhixian He,Zhuoran Zhang,Muhammad Asif Ali,Ke Xu,Lijie Hu,Di Wang

Main category: cs.CL

TL;DR: COMPKE是一个新的知识编辑基准测试，专注于复杂推理场景下的问答能力评估，填补了现有测试的不足。

Details

Motivation: 现有知识编辑测试主要依赖多跳问答，未能有效评估模型在复杂推理和实际场景中的应用能力。 Method: 提出了COMPKE基准测试，包含11,924个复杂问题，并对四种知识编辑方法进行了广泛评估。 Result: 不同模型在COMPKE上的表现差异显著，例如MeLLo在GPT-4O-MINI上准确率为39.47，而在QWEN2.5-3B上仅为3.83。 Conclusion: COMPKE揭示了知识编辑方法在不同模型中的有效性差异，为未来研究提供了新的评估工具。 Abstract: Knowledge Editing, which efficiently modifies the knowledge in large language models, has gathered great attention. Current benchmarks primarily use multi-hop question answering to assess and analyze newly injected or updated knowledge. However, we argue that these benchmarks fail to effectively evaluate how well the updated models apply this knowledge in real-life scenarios, particularly when questions require complex reasoning, involving one-to-many relationships or multi-step logical intersections. To fill in this gap, we introduce a new benchmark, COMPKE: Complex Question Answering under Knowledge Editing, which includes 11,924 complex questions that reflect real-life situations. We conduct an extensive evaluation of four knowledge editing methods on COMPKE, revealing that their effectiveness varies notably across different models. For instance, MeLLo attains an accuracy of 39.47 on GPT-4O-MINI, but this drops sharply to 3.83 on QWEN2.5-3B. We further investigate the underlying causes of these disparities from both methodological and model-specific perspectives. The datasets are available at https://github.com/kzjkzj666/CompKE.

[295] Toward Structured Knowledge Reasoning: Contrastive Retrieval-Augmented Generation on Experience

Jiawei Gu,Ziting Xian,Yuanzhen Xie,Ye Liu,Enjie Liu,Ruichao Zhong,Mochi Gao,Yunzhi Tan,Bo Hu,Zang Li

Main category: cs.CL

TL;DR: CoRE框架通过对比性上下文学习和经验记忆提升LLMs在结构化数据上的表现，显著优于传统方法。

Details

Motivation: LLMs在结构化数据（如表格和数据库）上表现不佳，主要由于预训练中缺乏相关经验和僵化的文本到结构转换机制。 Method: 引入CoRE框架，结合对比性上下文学习和经验记忆，模拟人类知识迁移，并通过MCTS生成的经验记忆扩展训练数据。 Result: 在Text-to-SQL和TableQA任务中，CoRE平均提升3.44%和4.24%，最高达17.2%。训练数据扩展8-9倍。 Conclusion: CoRE为LLMs提供了一种无需训练且持续的方法，显著提升其在结构化数据上的能力。 Abstract: Large language models (LLMs) achieve strong performance on plain text tasks but underperform on structured data like tables and databases. Potential challenges arise from their underexposure during pre-training and rigid text-to-structure transfer mechanisms. Unlike humans who seamlessly apply learned patterns across data modalities, LLMs struggle to infer implicit relationships embedded in tabular formats, especially in the absence of explicit structural guidance. To bridge this cognitive gap, we introduce Contrastive Retrieval-Augmented Generation on Experience (CoRE), a framework that builds experience memory representations and enhances generalization through contrastive In-Context Learning (ICL) to simulate human-like knowledge transfer. Experiments on Text-to-SQL and TableQA show CoRE significantly improves performance, achieving average gains of 3.44% and 4.24%, with up to 17.2% on challenging tasks. Our Monte Carlo Tree Search (MCTS)-generated Experience Memory expands training data 8-9x, enhancing diversity and domain coverage. This training-free and continual method propels LLMs toward structured knowledge expertise.

[296] EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG

Jacky Tai-Yu Lu,Jung Chiang,Chi-Sheng Chen,Anna Nai-Yun Tung,Hsiang Wei Hu,Yuan Chiao Cheng

Main category: cs.CL

TL;DR: EEG2TEXT-CN是一个针对中文的开源词汇EEG到文本生成框架，结合生物启发的EEG编码器和预训练语言模型，通过掩码预训练和对比学习实现脑信号与语言对齐。

Details

Motivation: 探索非语音、跨模态的脑信号解码为中文文本的可行性，为多语言脑文本研究开辟新方向。 Method: 使用NICE-EEG编码器和MiniLM语言模型，通过掩码预训练和对比学习对齐脑信号与语言表示，采用教师强制和填充掩码训练解码器。 Result: 在1500个训练验证句子和300个测试样本上，最佳BLEU-1得分为6.38%，显示词汇对齐的潜力，但句法流畅性仍需改进。 Conclusion: EEG2TEXT-CN证明了从EEG解码中文文本的可行性，为未来中文认知语言接口奠定了基础。 Abstract: We propose EEG2TEXT-CN, which, to the best of our knowledge, represents one of the earliest open-vocabulary EEG-to-text generation frameworks tailored for Chinese. Built on a biologically grounded EEG encoder (NICE-EEG) and a compact pretrained language model (MiniLM), our architecture aligns multichannel brain signals with natural language representations via masked pretraining and contrastive learning. Using a subset of the ChineseEEG dataset, where each sentence contains approximately ten Chinese characters aligned with 128-channel EEG recorded at 256 Hz, we segment EEG into per-character embeddings and predict full sentences in a zero-shot setting. The decoder is trained with teacher forcing and padding masks to accommodate variable-length sequences. Evaluation on over 1,500 training-validation sentences and 300 held-out test samples shows promising lexical alignment, with a best BLEU-1 score of 6.38\%. While syntactic fluency remains a challenge, our findings demonstrate the feasibility of non-phonetic, cross-modal language decoding from EEG. This work opens a new direction in multilingual brain-to-text research and lays the foundation for future cognitive-language interfaces in Chinese.

[297] How Bidirectionality Helps Language Models Learn Better via Dynamic Bottleneck Estimation

Md Kowsher,Nusrat Jahan Prottasha,Shiyun Xu,Shetu Mohanto,Chen Chen,Niloofar Yousefi,Ozlem Garibay

Main category: cs.CL

TL;DR: 论文通过信息瓶颈（IB）原则解释了双向语言模型优于单向模型的原因，并提出FlowNIB方法动态估计互信息，验证了双向模型在信息保留和表示复杂性上的优势。

Details

Motivation: 探究双向语言模型在自然语言理解任务中优于单向模型的理论原因。 Method: 提出FlowNIB方法，动态估计互信息，解决传统IB方法的计算难题和固定权衡问题。 Result: 理论证明双向模型保留更多互信息且具有更高有效维度；实验验证了信息编码和压缩的动态过程。 Conclusion: 为双向架构的有效性提供了理论解释，并提出了分析语言模型中信息流的实用工具。 Abstract: Bidirectional language models have better context understanding and perform better than unidirectional models on natural language understanding tasks, yet the theoretical reasons behind this advantage remain unclear. In this work, we investigate this disparity through the lens of the Information Bottleneck (IB) principle, which formalizes a trade-off between compressing input information and preserving task-relevant content. We propose FlowNIB, a dynamic and scalable method for estimating mutual information during training that addresses key limitations of classical IB approaches, including computational intractability and fixed trade-off schedules. Theoretically, we show that bidirectional models retain more mutual information and exhibit higher effective dimensionality than unidirectional models. To support this, we present a generalized framework for measuring representational complexity and prove that bidirectional representations are strictly more informative under mild conditions. We further validate our findings through extensive experiments across multiple models and tasks using FlowNIB, revealing how information is encoded and compressed throughout training. Together, our work provides a principled explanation for the effectiveness of bidirectional architectures and introduces a practical tool for analyzing information flow in deep language models.

[298] L3Cube-MahaEmotions: A Marathi Emotion Recognition Dataset with Synthetic Annotations using CoTR prompting and Large Language Models

Nidhi Kowtal,Raviraj Joshi

Main category: cs.CL

TL;DR: 论文提出了L3Cube-MahaEmotions数据集，用于低资源语言（如马拉地语）的情感识别，通过LLM生成训练数据，并评估了GPT-4和Llama3-405B的表现。

Details

Motivation: 解决低资源语言（如马拉地语）情感识别中标注数据不足的问题。 Method: 使用Chain-of-Translation（CoTR）提示技术，通过翻译和LLM标注生成训练数据，并对比GPT-4与BERT模型的性能。 Result: GPT-4在情感识别任务中表现优于微调的BERT模型，但BERT模型在合成标签上训练未能超越GPT-4。 Conclusion: 高质量人工标注数据的重要性，以及通用LLM在低资源情感识别任务中的优越性。 Abstract: Emotion recognition in low-resource languages like Marathi remains challenging due to limited annotated data. We present L3Cube-MahaEmotions, a high-quality Marathi emotion recognition dataset with 11 fine-grained emotion labels. The training data is synthetically annotated using large language models (LLMs), while the validation and test sets are manually labeled to serve as a reliable gold-standard benchmark. Building on the MahaSent dataset, we apply the Chain-of-Translation (CoTR) prompting technique, where Marathi sentences are translated into English and emotion labeled via a single prompt. GPT-4 and Llama3-405B were evaluated, with GPT-4 selected for training data annotation due to superior label quality. We evaluate model performance using standard metrics and explore label aggregation strategies (e.g., Union, Intersection). While GPT-4 predictions outperform fine-tuned BERT models, BERT-based models trained on synthetic labels fail to surpass GPT-4. This highlights both the importance of high-quality human-labeled data and the inherent complexity of emotion recognition. An important finding of this work is that generic LLMs like GPT-4 and Llama3-405B generalize better than fine-tuned BERT for complex low-resource emotion recognition tasks. The dataset and model are shared publicly at https://github.com/l3cube-pune/MarathiNLP

[299] What's Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning

Zhaotian Weng,Haoxuan Li,Kuan-Hao Huang,Jieyu Zhao

Main category: cs.CL

TL;DR: 论文提出了两个新基准VQA-Causal和VCR-Causal，专门用于评估视觉语言模型（VLMs）的因果推理能力，发现VLMs在因果推理任务上表现不佳，主要原因是训练数据中缺乏明确的因果关系表达。

Details

Motivation: 现有基准难以真正评估VLMs的因果推理能力，因为模型可以通过对象识别和活动识别等捷径回答问题。 Method: 设计了VQA-Causal和VCR-Causal两个新基准，并通过微调策略（如硬负例）提升模型的因果推理能力。 Result: VLMs在因果推理任务上表现较差，仅略高于随机猜测，但通过针对性微调可以提升性能。 Conclusion: 研究揭示了当前VLMs在因果推理上的不足，为未来改进提供了方向。 Abstract: Despite the impressive performance of vision-language models (VLMs) on downstream tasks, their ability to understand and reason about causal relationships in visual inputs remains unclear. Robust causal reasoning is fundamental to solving complex high-level reasoning tasks, yet existing benchmarks often include a mixture of reasoning questions, and VLMs can frequently exploit object recognition and activity identification as shortcuts to arrive at the correct answers, making it challenging to truly assess their causal reasoning abilities. To bridge this gap, we introduce VQA-Causal and VCR-Causal, two new benchmarks specifically designed to isolate and rigorously evaluate VLMs' causal reasoning abilities. Our findings reveal that while VLMs excel in object and activity recognition, they perform poorly on causal reasoning tasks, often only marginally surpassing random guessing. Further analysis suggests that this limitation stems from a severe lack of causal expressions in widely used training datasets, where causal relationships are rarely explicitly conveyed. We additionally explore fine-tuning strategies with hard negative cases, showing that targeted fine-tuning can improve model's causal reasoning while maintaining generalization and downstream performance. Our study highlights a key gap in current VLMs and lays the groundwork for future work on causal understanding.

[300] CC-Tuning: A Cross-Lingual Connection Mechanism for Improving Joint Multilingual Supervised Fine-Tuning

Yangfan Ye,Xiaocheng Feng,Zekun Yuan,Xiachong Feng,Libo Qin,Lei Huang,Weitao Ma,Yichong Huang,Zhirui Zhang,Yunfei Lu,Xiaohui Yan,Duyu Tang,Dandan Tu,Bing Qin

Main category: cs.CL

TL;DR: CC-Tuning是一种新的多语言微调方法，通过在潜在层面建立跨语言连接机制，提升大语言模型的多语言能力。

Details

Motivation: 解决当前大语言模型因英语为中心的训练数据导致的多语言能力不平衡问题。 Method: 提出CC-Tuning，通过融合英语和非英语输入的激活，并利用可训练的决策机制和转换矩阵实现跨语言连接。 Result: 在22种语言的六个基准测试中，CC-Tuning表现优于传统微调方法。 Conclusion: CC-Tuning展示了潜在层面跨语言交互在提升多语言性能方面的潜力。 Abstract: Current large language models (LLMs) often exhibit imbalanced multilingual capabilities due to their English-centric training corpora. To address this, existing fine-tuning approaches operating at the data-level (e.g., through data augmentation or distillation) typically introduce implicit cross-lingual alignment, overlooking the potential for more profound, latent-level cross-lingual interactions. In this work, we propose CC-Tuning, a novel multilingual fine-tuning paradigm that explicitly establishes a cross-lingual connection mechanism at the latent level. During training, CC-Tuning fuses the feed forward activations from both English and non-English inputs, enabling the model to benefit from both linguistic resources. This process is facilitated with a trainable Decision Maker that identifies beneficial activations. Furthermore, during inference, a Transform Matrix is utilized to simulate the cross-lingual connection under monolingual setting through representation transformation. Our experiments on six benchmarks covering 22 languages show that CC-Tuning outperforms vanilla SFT and offers a strong latent-level alternative to data-level augmentation methods. Further analysis also highlights the practicality of CC-Tuning and the potential of latent-level cross-lingual interactions in advancing the multilingual performance of LLMs.

[301] Not Every Token Needs Forgetting: Selective Unlearning to Limit Change in Utility in Large Language Model Unlearning

Yixin Wan,Anil Ramakrishna,Kai-Wei Chang,Volkan Cevher,Rahul Gupta

Main category: cs.CL

TL;DR: 本文提出选择性遗忘（SU）方法，仅针对目标文档中与不需要信息相关的关键子集进行遗忘，而非全部内容，从而在有效遗忘的同时保留模型的其他知识。

Details

Motivation: 传统遗忘方法会无差别地更新模型参数以遗忘目标文档中的所有内容，包括通用知识（如代词、介词等），这可能导致模型性能下降。因此，需要一种更精确的遗忘方法。 Method: 提出选择性遗忘（SU），通过识别目标文档中与不需要信息相关的关键子集，仅对这些子集进行遗忘操作。 Result: 在两个基准测试和六种基线遗忘算法上的实验表明，SU不仅能有效遗忘目标数据，还能显著保留模型在保留集上的性能。 Conclusion: 选择性遗忘（SU）是一种更精确的遗忘方法，能够在遗忘不需要信息的同时，最大限度地保留模型的其他知识。 Abstract: Large Language Model (LLM) unlearning has recently gained significant attention, driven by the need to remove unwanted information, such as private, sensitive, or copyrighted content, from LLMs. However, conventional unlearning approaches indiscriminately update model parameters to forget all tokens in a target document, including common tokens (e.g., pronouns, prepositions, general nouns) that carry general knowledge. In this paper, we highlight that not every token needs forgetting. We propose Selective Unlearning (SU), which identifies a critical subset of tokens within the forgetting set that is relevant to the unwanted information, and unlearns only those tokens. Experiments on two benchmarks and six baseline unlearning algorithms demonstrate that SU not only achieves effective unlearning on the targeted forget data, but also significantly preserves the model's utility in the retaining set.

[302] Improve MLLM Benchmark Efficiency through Interview

Farong Wen,Yijin Guo,Junying Wang,Jiaohao Xiao,Yingjie Zhou,Chunyi Li,Zicheng Zhang,Guangtao Zhai

Main category: cs.CL

TL;DR: 论文提出了一种名为MLLM Interview (MITV)的策略，通过少量问题快速评估多模态大语言模型（MLLM）的性能。

Details

Motivation: 现有的大规模问答测试资源消耗大且耗时，需要一种更高效的评估方法。 Method: 构建带难度标签的面试数据集，并通过少量问题逐步测试模型极限。 Result: MITV策略在基准数据集上表现良好，能通过少量问答快速评估模型能力。 Conclusion: MITV是一种高效且资源节约的MLLM评估方法。 Abstract: The rapid development of Multimodal Large Language Models (MLLM) has led to a wide range of MLLM applications, and a number of benchmark datasets have sprung up in order to assess MLLM abilities. However, full-coverage Q&A testing on large-scale data is resource-intensive and time-consuming. To address this issue, we propose the MLLM Interview (MITV) strategy, which aims to quickly obtain MLLM performance metrics by quizzing fewer question. First, First, we constructed the interview dataset, which was built on an existing MLLM assessment dataset, by adding difficulty labels based on the performance of some typical MLLMs in this dataset. Second, we propose an MLLM Interview strategy, which obtains an initial performance situation of the large model by quizzing a small number of topics and then continuously tries to test the model's limits. Through extensive experiments, the result shows that the MITV strategy proposed in this paper performs well on MLLM benchmark datasets, and it is able to obtain the model evaluation capability faster through a small number of questions and answers.

[303] Affordance Benchmark for MLLMs

Junying Wang,Wenzhe Li,Yalun Wu,Yingji Liang,Yijin Guo,Chunyi Li,Haodong Duan,Zicheng Zhang,Guangtao Zhai

Main category: cs.CL

TL;DR: 论文提出了A4Bench基准，用于评估多模态大语言模型（MLLMs）在感知环境动作可能性（affordance）方面的能力，发现现有模型表现有限，尤其在动态和上下文相关的感知上。

Details

Motivation: 尽管MLLMs在视觉语言任务中表现出色，但其感知环境动作可能性的能力尚未充分研究，这对实现直观和安全的交互至关重要。 Method: 通过A4Bench基准，从构成性（Constitutive Affordance）和转化性（Transformative Affordance）两个维度评估17种MLLMs的表现，并与人类表现对比。 Result: 专有模型优于开源模型，但所有模型表现均有限，转化性感知尤其薄弱；最佳模型Gemini-2.0-Pro的准确率（18.05%）远低于人类（81.25%-85.34%）。 Conclusion: 研究揭示了MLLMs在环境理解上的关键不足，为开发更鲁棒、上下文感知的AI系统提供了基础。数据集已开源。 Abstract: Affordance theory posits that environments inherently offer action possibilities that shape perception and behavior. While Multimodal Large Language Models (MLLMs) excel in vision-language tasks, their ability to perceive affordance, which is crucial for intuitive and safe interactions, remains underexplored. To address this, we introduce A4Bench, a novel benchmark designed to evaluate the affordance perception abilities of MLLMs across two dimensions: 1) Constitutive Affordance}, assessing understanding of inherent object properties through 1,282 question-answer pairs spanning nine sub-disciplines, and 2) Transformative Affordance, probing dynamic and contextual nuances (e.g., misleading, time-dependent, cultural, or individual-specific affordance) with 718 challenging question-answer pairs. Evaluating 17 MLLMs (nine proprietary and eight open-source) against human performance, we find that proprietary models generally outperform open-source counterparts, but all exhibit limited capabilities, particularly in transformative affordance perception. Furthermore, even top-performing models, such as Gemini-2.0-Pro (18.05% overall exact match accuracy), significantly lag behind human performance (best: 85.34%, worst: 81.25%). These findings highlight critical gaps in environmental understanding of MLLMs and provide a foundation for advancing AI systems toward more robust, context-aware interactions. The dataset is available in https://github.com/JunyingWang959/A4Bench/.

Jinfeng Zhou,Yuxuan Chen,Yihan Shi,Xuanming Zhang,Leqi Lei,Yi Feng,Zexuan Xiong,Miao Yan,Xunzhi Wang,Yaru Cao,Jianing Yin,Shuai Wang,Quanyu Dai,Zhenhua Dong,Hongning Wang,Minlie Huang

Main category: cs.CL

TL;DR: 论文提出SocialEval，一个基于脚本的双语社交智能（SI）评测基准，结合结果导向和过程导向的评估方法，揭示LLMs在社交智能上与人类的差距。

Details

Motivation: 评估LLMs的社交智能及其与人类的差异，填补现有工作在结果和过程导向评估上的空白。 Method: 通过手工编写叙事脚本构建SocialEval基准，每个脚本以世界树形式组织，包含由社交能力驱动的故事情节。 Result: 实验显示LLMs在社交智能上落后于人类，表现出亲社会性，并倾向于积极社交行为，即使导致目标失败。 Conclusion: LLMs已形成类似人脑的能力特定功能分区，但其社交智能仍需改进。 Abstract: LLMs exhibit promising Social Intelligence (SI) in modeling human behavior, raising the need to evaluate LLMs' SI and their discrepancy with humans. SI equips humans with interpersonal abilities to behave wisely in navigating social interactions to achieve social goals. This presents an operational evaluation paradigm: outcome-oriented goal achievement evaluation and process-oriented interpersonal ability evaluation, which existing work fails to address. To this end, we propose SocialEval, a script-based bilingual SI benchmark, integrating outcome- and process-oriented evaluation by manually crafting narrative scripts. Each script is structured as a world tree that contains plot lines driven by interpersonal ability, providing a comprehensive view of how LLMs navigate social interactions. Experiments show that LLMs fall behind humans on both SI evaluations, exhibit prosociality, and prefer more positive social behaviors, even if they lead to goal failure. Analysis of LLMs' formed representation space and neuronal activations reveals that LLMs have developed ability-specific functional partitions akin to the human brain.

[305] Pi-SQL: Enhancing Text-to-SQL with Fine-Grained Guidance from Pivot Programming Languages

Yongdong chi,Hanqing Wang,Zonghan Yang,Jian Yang,Xiao Yan,Yun Chen,Guanhua Chen

Main category: cs.CL

TL;DR: Pi-SQL通过将Python程序作为桥梁，将自然语言查询转换为SQL程序，显著提升了执行准确性和效率。

Details

Motivation: 现有基于提示的方法在自然语言与SQL程序之间存在语义鸿沟，导致准确性受限。 Method: Pi-SQL首先生成提供细粒度指导的Python程序，再基于其生成SQL程序，并通过候选策略选择优化执行速度。 Result: Pi-SQL的执行准确率比最佳基线提升3.20，效率评分高4.55。 Conclusion: Pi-SQL通过引入Python作为中间层，有效缩小了语义鸿沟，显著提升了SQL生成的性能。 Abstract: Text-to-SQL transforms the user queries from natural language to executable SQL programs, enabling non-experts to interact with complex databases. Existing prompt-based methods craft meticulous text guidelines and examples to facilitate SQL generation, but their accuracy is hindered by the large semantic gap between the texts and the low-resource SQL programs. In this work, we propose Pi-SQL, which incorporates the high-resource Python program as a pivot to bridge between the natural language query and SQL program. In particular, Pi-SQL first generates Python programs that provide fine-grained step-by-step guidelines in their code blocks or comments, and then produces an SQL program following the guidance of each Python program.The final SQL program matches the reference Python program's query results and, through selection from candidates generated by different strategies, achieves superior execution speed, with a reward-based valid efficiency score up to 4.55 higher than the best-performing baseline.Extensive experiments demonstrate the effectiveness of Pi-SQL, which improves the execution accuracy of the best-performing baseline by up to 3.20.

[306] How do Transformer Embeddings Represent Compositions? A Functional Analysis

Aishik Nagar,Ishaan Singh Rawal,Mansi Dhanania,Cheston Tan

Main category: cs.CL

TL;DR: 研究测试了Mistral、OpenAI Large和Google嵌入模型以及BERT的组合性表现，发现大多数嵌入模型具有高度组合性，而BERT表现较差。

Details

Motivation: 组合性是人工智能推理和泛化的关键，但现有研究对Transformer模型如何表示复合词及其组合性知之甚少。 Method: 通过六种不同的组合性模型（加法、乘法、回归等）评估模型表现，并使用合成数据集验证结果。 Result: 岭回归模型在组合性表现上最佳，经典向量加法模型表现接近。大多数嵌入模型高度组合，BERT表现较差。 Conclusion: 研究全面探讨了组合性，为模型设计和评估提供了重要参考。 Abstract: Compositionality is a key aspect of human intelligence, essential for reasoning and generalization. While transformer-based models have become the de facto standard for many language modeling tasks, little is known about how they represent compound words, and whether these representations are compositional. In this study, we test compositionality in Mistral, OpenAI Large, and Google embedding models, and compare them with BERT. First, we evaluate compositionality in the representations by examining six diverse models of compositionality (addition, multiplication, dilation, regression, etc.). We find that ridge regression, albeit linear, best accounts for compositionality. Surprisingly, we find that the classic vector addition model performs almost as well as any other model. Next, we verify that most embedding models are highly compositional, while BERT shows much poorer compositionality. We verify and visualize our findings with a synthetic dataset consisting of fully transparent adjective-noun compositions. Overall, we present a thorough investigation of compositionality.

[307] anyECG-chat: A Generalist ECG-MLLM for Flexible ECG Input and Multi-Task Understanding

Haitao Li,Ziyu Li,Yiheng Mao,Ziyi Liu,Zhoujian Sun,Zhengxing Huang

Main category: cs.CL

TL;DR: 本文提出了一种支持多任务和灵活输入的MLLM模型anyECG-chat，并构建了anyECG数据集，涵盖多种ECG分析任务。

Details

Motivation: 现有ECG-focused MLLMs主要局限于单导联短时ECG输入和报告生成任务，未能充分发挥MLLMs的潜力。 Method: 构建anyECG数据集，提出anyECG-chat模型，支持动态长度和多ECG输入，采用三阶段课程训练方法。 Result: anyECG-chat支持多种实际应用场景，包括报告生成、异常波形定位和多ECG比较分析。 Conclusion: anyECG-chat模型和anyECG数据集为ECG分析提供了更灵活和全面的解决方案。 Abstract: The advent of multimodal large language models (MLLMs) has sparked interest in their application to electrocardiogram (ECG) analysis. However, existing ECG-focused MLLMs primarily focus on report generation tasks, often limited to single 12-lead, short-duration (10s) ECG inputs, thereby underutilizing the potential of MLLMs. To this end, we aim to develop a MLLM for ECG analysis that supports a broader range of tasks and more flexible ECG inputs. However, existing ECG-QA datasets are often monotonous. To address this gap, we first constructed the anyECG dataset, which encompasses a wide variety of tasks, including report generation, abnormal waveform localization, and open-ended question answering. In addition to standard hospital ECGs, we introduced long-duration reduced-lead ECGs for home environments and multiple ECG comparison scenarios commonly encountered in clinical practice. Furthermore, we propose the anyECG-chat model, which supports dynamic-length ECG inputs and multiple ECG inputs. We trained the model using a three-stage curriculum training recipe with the anyECG dataset. A comprehensive evaluation was conducted, demonstrating that anyECG-chat is capable of supporting various practical application scenarios, including not only common report generation tasks but also abnormal waveform localization for long-duration reduced-lead ECGs in home environments and comprehensive comparative analysis of multiple ECGs.

[308] Leveraging Large Language Models for Sarcastic Speech Annotation in Sarcasm Detection

Zhu Li,Yuqing Zhang,Xiyuan Gao,Shekhar Nayak,Matt Coler

Main category: cs.CL

TL;DR: 论文提出了一种利用大语言模型（LLMs）生成讽刺语音数据集的标注流程，并通过人类验证提升标注质量。最终构建了PodSarc数据集，检测模型F1分数达73.63%。

Details

Motivation: 讽刺通过语气和上下文改变意义，但语音讽刺检测因数据稀缺和现有系统依赖多模态数据而受限。 Method: 使用GPT-4o和LLaMA 3对公开讽刺播客进行初始标注，人类验证解决分歧，并通过协作门控架构验证标注质量和检测性能。 Result: 构建了PodSarc数据集，检测模型F1分数为73.63%。 Conclusion: 提出的标注流程和数据集为讽刺检测研究提供了有效基准。 Abstract: Sarcasm fundamentally alters meaning through tone and context, yet detecting it in speech remains a challenge due to data scarcity. In addition, existing detection systems often rely on multimodal data, limiting their applicability in contexts where only speech is available. To address this, we propose an annotation pipeline that leverages large language models (LLMs) to generate a sarcasm dataset. Using a publicly available sarcasm-focused podcast, we employ GPT-4o and LLaMA 3 for initial sarcasm annotations, followed by human verification to resolve disagreements. We validate this approach by comparing annotation quality and detection performance on a publicly available sarcasm dataset using a collaborative gating architecture. Finally, we introduce PodSarc, a large-scale sarcastic speech dataset created through this pipeline. The detection model achieves a 73.63% F1 score, demonstrating the dataset's potential as a benchmark for sarcasm detection research.

[309] From Objectives to Questions: A Planning-based Framework for Educational Mathematical Question Generation

Cheng Cheng,Zhenya Huang,Guanhao Zhao,Yuxiang Guo,Xin Lin,Jinze Wu,Xin Li,Shijin Wang

Main category: cs.CL

TL;DR: 论文提出了一种基于教育目标的数学问题自动生成方法EQPR，通过结合蒙特卡洛树搜索和大型语言模型，优化生成符合多维教育目标的高质量问题。

Details

Motivation: 传统方法在生成数学问题时忽视教育目标，仅关注文本质量，无法满足复杂教育需求。 Method: 提出EQPR方法，采用“计划-评估-优化”流程，结合蒙特卡洛树搜索和大型语言模型，通过迭代反馈优化问题生成。 Result: 基于EQGEVAL的实验表明，EQPR在生成符合多维教育目标的问题上表现显著提升。 Conclusion: EQPR方法有效解决了传统生成方法的局限性，能够生成高质量且符合教育目标的数学问题。 Abstract: Automatically generating high-quality mathematical problems that align with educational objectives is a crucial task in NLP-based educational technology. Traditional generation methods focus primarily on textual quality, but they often overlook educational objectives. Moreover, these methods address only single-dimensional, simple question generation, failing to meet complex, multifaceted educational requirements. To address these challenges, we constructed and annotated EduMath, a dataset of 16k mathematical questions with multi-dimensional educational objectives. Based on this dataset, we developed EQGEVAL, which incorporates three evaluation dimensions and is designed to assess the ability of models to generate educational questions. Drawing inspiration from teachers' problem design processes, we propose the Educational Question Planning with self-Reflection (EQPR) method for educational mathematical question generation, following a "plan-evaluate-optimize" approach. Specifically, by combining planning algorithm based on Monte Carlo Tree Search with the generative capabilities of Large Language Models, we continuously optimize questions through iterative feedback. This self-optimization mechanism ensures that the generated questions both fit the educational context and strategically achieve specific basic educational objectives. Through extensive experiments based on EQGEVAL, we have demonstrated that EQPR achieves significant improvements in generating questions that meet multi-dimensional educational objectives.

[310] ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness

Dren Fazlija,Arkadij Orlov,Sandipan Sikdar

Main category: cs.CL

TL;DR: 提出了敏感性感知（SA）概念，使大语言模型（LLM）能遵循预设访问权限规则，并开发了评估工具ACCESS DENIED INC。实验显示模型在管理未授权请求时行为差异显著。

Details

Motivation: 解决LLM在企业数据管理中因信息敏感性及访问限制带来的性能和隐私挑战。 Method: 提出敏感性感知（SA）概念，开发评估环境ACCESS DENIED INC进行实验。 Result: 模型在未授权请求管理上表现差异显著，同时有效处理合法查询。 Conclusion: 为敏感性感知语言模型奠定基准，为企业环境中隐私中心AI系统提供改进方向。 Abstract: Large language models (LLMs) are increasingly becoming valuable to corporate data management due to their ability to process text from various document formats and facilitate user interactions through natural language queries. However, LLMs must consider the sensitivity of information when communicating with employees, especially given access restrictions. Simple filtering based on user clearance levels can pose both performance and privacy challenges. To address this, we propose the concept of sensitivity awareness (SA), which enables LLMs to adhere to predefined access rights rules. In addition, we developed a benchmarking environment called ACCESS DENIED INC to evaluate SA. Our experimental findings reveal significant variations in model behavior, particularly in managing unauthorized data requests while effectively addressing legitimate queries. This work establishes a foundation for benchmarking sensitivity-aware language models and provides insights to enhance privacy-centric AI systems in corporate environments.

[311] XGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist Content

Vadivel Abishethvarman,Bhavik Chandna,Pratik Jalan,Usman Naseem

Main category: cs.CL

TL;DR: XGUARD是一个评估LLM生成极端内容严重性的框架，通过五级危险分类和攻击严重性曲线（ASC）提供更细致的分析。

Details

Motivation: 现有安全评估过于简化，无法捕捉LLM生成内容的复杂风险。 Method: XGUARD包含3,840个真实世界提示，将模型响应分为五级危险，并引入ASC可视化漏洞。 Result: 评估了六种LLM和两种防御策略，揭示了安全漏洞与表达自由之间的权衡。 Conclusion: 分级安全指标对构建可信LLM至关重要。 Abstract: Large Language Models (LLMs) can generate content spanning ideological rhetoric to explicit instructions for violence. However, existing safety evaluations often rely on simplistic binary labels (safe and unsafe), overlooking the nuanced spectrum of risk these outputs pose. To address this, we present XGUARD, a benchmark and evaluation framework designed to assess the severity of extremist content generated by LLMs. XGUARD includes 3,840 red teaming prompts sourced from real world data such as social media and news, covering a broad range of ideologically charged scenarios. Our framework categorizes model responses into five danger levels (0 to 4), enabling a more nuanced analysis of both the frequency and severity of failures. We introduce the interpretable Attack Severity Curve (ASC) to visualize vulnerabilities and compare defense mechanisms across threat intensities. Using XGUARD, we evaluate six popular LLMs and two lightweight defense strategies, revealing key insights into current safety gaps and trade-offs between robustness and expressive freedom. Our work underscores the value of graded safety metrics for building trustworthy LLMs.

[312] NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction

Qichao Wang,Ziqiao Meng,Wenqian Cui,Yifei Zhang,Pengcheng Wu,Bingzhe Wu,Irwin King,Liang Chen,Peilin Zhao

Main category: cs.CL

TL;DR: 论文提出了一种新的生成建模范式NTPP，利用双通道语音数据提升语音语言模型的对话能力，显著改善了响应连贯性和自然度，同时降低了推理延迟。

Details

Motivation: 当前语音语言模型未能充分利用双通道语音数据，限制了对话的自然性和流畅性。 Method: 引入Next-Token-Pair Prediction（NTPP）方法，首次在解码器架构中实现说话者无关的双通道对话学习。 Result: NTPP在标准测试中显著提升了对话能力（如轮转预测、响应连贯性和自然度），并降低了推理延迟。 Conclusion: NTPP为语音语言模型的实时应用提供了高效且实用的解决方案。 Abstract: Inspired by the impressive capabilities of GPT-4o, there is growing interest in enabling speech language models (SLMs) to engage in natural, fluid spoken interactions with humans. Recent advancements have led to the development of several SLMs that demonstrate promising results in this area. However, current approaches have yet to fully exploit dual-channel speech data, which inherently captures the structure and dynamics of human conversation. In this work, we systematically explore the use of dual-channel speech data in the context of modern large language models, and introduce a novel generative modeling paradigm, Next-Token-Pair Prediction (NTPP), to enable speaker-independent dual-channel spoken dialogue learning using decoder-only architectures for the first time. We evaluate our approach on standard benchmarks, and empirical results show that our proposed method, NTPP, significantly improves the conversational abilities of SLMs in terms of turn-taking prediction, response coherence, and naturalness. Moreover, compared to existing methods, NTPP achieves substantially lower inference latency, highlighting its practical efficiency for real-time applications.

[313] LEMONADE: A Large Multilingual Expert-Annotated Abstractive Event Dataset for the Real World

Sina J. Semnani,Pingyue Zhang,Wanyue Zhai,Haozhuo Li,Ryan Beauchamp,Trey Billing,Katayoun Kishi,Manling Li,Monica S. Lam

Main category: cs.CL

TL;DR: LEMONADE是一个大规模多语言冲突事件数据集，基于ACLED数据重新标注，并提出抽象事件提取（AEE）和抽象实体链接（AEL）方法，通过大型语言模型（LLMs）和零样本系统ZEST进行评估。

Details

Motivation: 解决多语言源数据聚合的挑战，提升全球事件分析的准确性和覆盖范围。 Method: 采用AEE和AEL方法，通过文档整体理解检测事件参数和实体，并评估LLMs及零样本系统ZEST。 Result: 最佳零样本系统在端到端任务中F1得分为58.3%，ZEST在AEL任务中F1得分为45.7%，但仍落后于监督系统。 Conclusion: 零样本方法虽有进展，但与监督系统差距显著，需进一步研究。 Abstract: This paper presents LEMONADE, a large-scale conflict event dataset comprising 39,786 events across 20 languages and 171 countries, with extensive coverage of region-specific entities. LEMONADE is based on a partially reannotated subset of the Armed Conflict Location & Event Data (ACLED), which has documented global conflict events for over a decade. To address the challenge of aggregating multilingual sources for global event analysis, we introduce abstractive event extraction (AEE) and its subtask, abstractive entity linking (AEL). Unlike conventional span-based event extraction, our approach detects event arguments and entities through holistic document understanding and normalizes them across the multilingual dataset. We evaluate various large language models (LLMs) on these tasks, adapt existing zero-shot event extraction systems, and benchmark supervised models. Additionally, we introduce ZEST, a novel zero-shot retrieval-based system for AEL. Our best zero-shot system achieves an end-to-end F1 score of 58.3%, with LLMs outperforming specialized event extraction models such as GoLLIE. For entity linking, ZEST achieves an F1 score of 45.7%, significantly surpassing OneNet, a state-of-the-art zero-shot baseline that achieves only 23.7%. However, these zero-shot results lag behind the best supervised systems by 20.1% and 37.0% in the end-to-end and AEL tasks, respectively, highlighting the need for further research.

[314] What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training

Marianne de Heer Kloots,Hosein Mohebbi,Charlotte Pouw,Gaofei Shen,Willem Zuidema,Martijn Bentum

Main category: cs.CL

TL;DR: 研究探讨自监督模型学习到的语音表征是否具有语言特异性，发现荷兰语预训练能更好地编码荷兰语的语言特征。

Details

Motivation: 探索自监督模型在不同语言预训练下对特定语言特征的编码能力，尤其是荷兰语的语音和词汇信息。 Method: 通过对比荷兰语、英语及多语言预训练的Wav2Vec2模型，使用聚类或分类探针及零样本指标评估语言特征编码效果。 Result: 荷兰语预训练显著提升荷兰语语言特征的编码效果，且与自动语音识别任务性能相关。 Conclusion: 语言特异性预训练对编码特定语言特征及下游任务表现具有优势。 Abstract: How language-specific are speech representations learned by self-supervised models? Existing work has shown that a range of linguistic features can be successfully decoded from end-to-end models trained only on speech recordings. However, it's less clear to what extent pre-training on specific languages improves language-specific linguistic information. Here we test the encoding of Dutch phonetic and lexical information in internal representations of self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the representation of Dutch linguistic features as compared to pre-training on similar amounts of English or larger amounts of multilingual data. This language-specific advantage is well-detected by trained clustering or classification probes, and partially observable using zero-shot metrics. Furthermore, the language-specific benefit on linguistic feature encoding aligns with downstream performance on Automatic Speech Recognition.

[315] Do LLMs Understand Why We Write Diaries? A Method for Purpose Extraction and Clustering

Valeriya Goloviznina,Alexander Sergeev,Mikhail Melnichenko,Evgeny Kotelnikov

Main category: cs.CL

TL;DR: 本研究提出了一种基于大语言模型（LLM）的新方法，用于识别和聚类日记写作的多种目的，并在苏联时期日记上验证了其有效性。

Details

Motivation: 传统方法在处理大规模日记语料时效果不佳，因此需要更高效的方法来提取日记写作的意图。 Method: 利用LLM（如GPT-4o和o1-mini）对日记进行分类和聚类，并与模板基线方法对比。 Result: GPT-4o和o1-mini表现最佳，而模板方法效果较差。研究还分析了作者性别、年龄和写作年份对目的的影响。 Conclusion: LLM方法在日记分析中表现优越，但仍需改进错误类型和模型局限性。 Abstract: Diary analysis presents challenges, particularly in extracting meaningful information from large corpora, where traditional methods often fail to deliver satisfactory results. This study introduces a novel method based on Large Language Models (LLMs) to identify and cluster the various purposes of diary writing. By "purposes," we refer to the intentions behind diary writing, such as documenting life events, self-reflection, or practicing language skills. Our approach is applied to Soviet-era diaries (1922-1929) from the Prozhito digital archive, a rich collection of personal narratives. We evaluate different proprietary and open-source LLMs, finding that GPT-4o and o1-mini achieve the best performance, while a template-based baseline is significantly less effective. Additionally, we analyze the retrieved purposes based on gender, age of the authors, and the year of writing. Furthermore, we examine the types of errors made by the models, providing a deeper understanding of their limitations and potential areas for improvement in future research.

[316] Talking to Data: Designing Smart Assistants for Humanities Databases

Alexander Sergeev,Valeriya Goloviznina,Mikhail Melnichenko,Evgeny Kotelnikov

Main category: cs.CL

TL;DR: 该研究开发了一个基于LLM的智能助手，通过自然语言交互提升人文研究数据库的可访问性，结合RAG和先进技术，实验验证了其在数字档案中的有效性。

Details

Motivation: 传统交互方式限制了人文研究数据库的访问，研究旨在通过自然语言交互提升效率和可访问性。 Method: 采用RAG方法，结合混合搜索、自动查询生成、文本到SQL过滤等技术，开发聊天机器人形式的智能助手。 Result: 实验基于Prozhito数字档案，验证了系统在支持人类学、历史研究及非专业用户中的有效性。 Conclusion: LLM有潜力改变数字档案的交互方式，使其更直观和包容。 Abstract: Access to humanities research databases is often hindered by the limitations of traditional interaction formats, particularly in the methods of searching and response generation. This study introduces an LLM-based smart assistant designed to facilitate natural language communication with digital humanities data. The assistant, developed in a chatbot format, leverages the RAG approach and integrates state-of-the-art technologies such as hybrid search, automatic query generation, text-to-SQL filtering, semantic database search, and hyperlink insertion. To evaluate the effectiveness of the system, experiments were conducted to assess the response quality of various language models. The testing was based on the Prozhito digital archive, which contains diary entries from predominantly Russian-speaking individuals who lived in the 20th century. The chatbot is tailored to support anthropology and history researchers, as well as non-specialist users with an interest in the field, without requiring prior technical training. By enabling researchers to query complex databases with natural language, this tool aims to enhance accessibility and efficiency in humanities research. The study highlights the potential of Large Language Models to transform the way researchers and the public interact with digital archives, making them more intuitive and inclusive. Additional materials are presented in GitHub repository: https://github.com/alekosus/talking-to-data-intersys2025.

[317] Less is More: Local Intrinsic Dimensions of Contextual Language Models

Benjamin Matthias Ruppik,Julius von Rohrscheidt,Carel van Niekerk,Michael Heck,Renato Vukovic,Shutong Feng,Hsien-chin Lin,Nurul Lubis,Bastian Rieck,Marcus Zibrowius,Milica Gašić

Main category: cs.CL

TL;DR: 本文通过几何视角分析LLM的训练和微调效果，提出局部维度作为衡量指标，揭示模型动态和泛化能力。

Details

Motivation: 研究LLM内部机制，尤其是微调对模型行为的影响，填补理论和实践的鸿沟。 Method: 通过测量上下文潜在嵌入的局部维度，分析其在训练和微调中的变化。 Result: 局部维度均值可预测模型能力极限、过拟合和‘grokking’现象，且其减少预示性能提升。 Conclusion: 局部维度为理解LLM的微调效果提供了新视角，有助于模型配置决策。 Abstract: Understanding the internal mechanisms of large language models (LLMs) remains a challenging and complex endeavor. Even fundamental questions, such as how fine-tuning affects model behavior, often require extensive empirical evaluation. In this paper, we introduce a novel perspective based on the geometric properties of contextual latent embeddings to study the effects of training and fine-tuning. To that end, we measure the local dimensions of a contextual language model's latent space and analyze their shifts during training and fine-tuning. We show that the local dimensions provide insights into the model's training dynamics and generalization ability. Specifically, the mean of the local dimensions predicts when the model's training capabilities are exhausted, as exemplified in a dialogue state tracking task, overfitting, as demonstrated in an emotion recognition task, and grokking, as illustrated with an arithmetic task. Furthermore, our experiments suggest a practical heuristic: reductions in the mean local dimension tend to accompany and predict subsequent performance gains. Through this exploration, we aim to provide practitioners with a deeper understanding of the implications of fine-tuning on embedding spaces, facilitating informed decisions when configuring models for specific applications. The results of this work contribute to the ongoing discourse on the interpretability, adaptability, and generalizability of LLMs by bridging the gap between intrinsic model mechanisms and geometric properties in the respective embeddings.

[318] Probing Neural Topology of Large Language Models

Yu Zheng,Yuan Yuan,Yong Li,Paolo Santi

Main category: cs.CL

TL;DR: 论文提出了一种名为“graph probing”的方法，用于揭示大语言模型（LLM）神经元的功能连接拓扑结构，并发现神经拓扑结构可以普遍预测模型的性能。

Details

Motivation: 尽管对LLM的神经元表征已有研究，但神经元如何协同激活以产生涌现能力仍不清楚，阻碍了对LLM的深入理解和安全开发。 Method: 通过分析不同LLM家族和规模的内部神经图，研究神经拓扑结构与语言生成性能的关系。 Result: 发现神经拓扑结构能普遍预测模型的性能，且这种预测性在仅保留1%神经元连接或模型仅预训练8步时仍稳健。 Conclusion: 不同LLM尽管在架构、参数和数据上有显著差异，但会形成一致且复杂的神经拓扑结构，这可能是其语言生成能力的基础。 Abstract: Probing large language models (LLMs) has yielded valuable insights into their internal mechanisms by linking neural representations to interpretable semantics. However, how neurons functionally co-activate with each other to give rise to emergent capabilities remains largely unknown, hindering a deeper understanding and safer development of LLMs. In this work, we introduce graph probing, a method for uncovering the functional connectivity topology of LLM neurons and relating it to language generation performance. By analyzing internal neural graphs across diverse LLM families and scales, we discover a universal predictability of next-token prediction performance using only neural topology. This predictability is robust even when retaining just 1% of neuron connections or probing models after only 8 pretraining steps, highlighting the sparsity and early emergence of topological patterns. Further graph matching analysis suggests that, despite significant distinctions in architectures, parameters, and training data, different LLMs develop intricate and consistent neural topological structures that may form the foundation for their language generation abilities. Codes and data for the graph probing toolbox are released at https://github.com/DavyMorgan/llm-graph-probing.

[319] CHEER-Ekman: Fine-grained Embodied Emotion Classification

Phan Anh Duong,Cat Luong,Divyesh Bommana,Tianyu Jiang

Main category: cs.CL

TL;DR: 论文提出了一个基于Ekman六种基本情绪类别的数据集CHEER-Ekman，并通过自动最佳-最差缩放方法结合大型语言模型，实现了优于监督方法的性能。研究发现简化的提示指令和链式思维推理显著提高了情绪识别准确率。

Details

Motivation: 现有研究中，通过文本识别身体反应的情绪表达仍未被充分探索，因此需要扩展数据集并改进方法。 Method: 使用自动最佳-最差缩放方法结合大型语言模型，并探索简化的提示指令和链式思维推理。 Result: 新方法在CHEER-Ekman数据集上表现优于监督方法，且小模型也能达到与大模型竞争的性能。 Conclusion: 简化的提示和链式思维推理能有效提升情绪识别性能，为小模型的应用提供了可能。 Abstract: Emotions manifest through physical experiences and bodily reactions, yet identifying such embodied emotions in text remains understudied. We present an embodied emotion classification dataset, CHEER-Ekman, extending the existing binary embodied emotion dataset with Ekman's six basic emotion categories. Using automatic best-worst scaling with large language models, we achieve performance superior to supervised approaches on our new dataset. Our investigation reveals that simplified prompting instructions and chain-of-thought reasoning significantly improve emotion recognition accuracy, enabling smaller models to achieve competitive performance with larger ones.

[320] SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Thinh Pham,Nguyen Nguyen,Pratibha Zunjare,Weiyuan Chen,Yu-Min Tseng,Tu Vu

Main category: cs.CL

TL;DR: SealQA是一个新的基准测试，用于评估搜索增强语言模型在事实性问题上的表现，尤其是在搜索结果冲突、嘈杂或无帮助的情况下。它包含三个版本：Seal-0、Seal-Hard和LongSeal，分别测试准确性、推理能力和长文本多文档推理。当前前沿模型表现不佳，且增加计算资源并不能显著提升性能。

Details

Motivation: 为了解决现有语言模型在事实性问题上的局限性，尤其是在搜索结果不可靠时的表现，SealQA被设计为一个挑战性基准。 Method: SealQA包含三个测试集：Seal-0（最具挑战性的问题）、Seal-Hard（测试推理能力）和LongSeal（测试长文本多文档推理）。通过评估前沿模型在这些测试集上的表现，分析其局限性。 Result: 前沿模型在SealQA上表现不佳，例如Seal-0上最高准确率仅为17.1%。增加计算资源未能显著提升性能，且模型对嘈杂搜索结果高度敏感。 Conclusion: SealQA揭示了当前语言模型在事实性问题上的重大缺陷，尤其是在复杂搜索场景下的表现不足，为未来研究提供了方向。 Abstract: We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the "lost-in-the-middle" issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at huggingface.co/datasets/vtllms/sealqa.

[321] How Programming Concepts and Neurons Are Shared in Code Language Models

Amir Hossein Kargaran,Yihong Liu,François Yvon,Hinrich Schütze

Main category: cs.CL

TL;DR: 研究探讨了多编程语言（PL）与英语在LLMs概念空间中的关系，通过少量样本翻译任务和神经元激活分析，揭示了PL在模型内部的表示方式。

Details

Motivation: 现有研究多关注单语言编程任务，本文旨在探索多PL与英语在LLMs概念空间中的交互关系。 Method: 使用两个基于Llama的模型对21对PL进行少量样本翻译任务，解码中间层嵌入并分析神经元激活。 Result: 发现概念空间更接近英语，PL关键词在中间层后半部分概率较高；语言特定神经元集中在底层，而PL特有神经元出现在顶层。 Conclusion: 研究揭示了LLMs内部PL表示的结构模式，为理解模型概念空间提供了新视角。 Abstract: Several studies have explored the mechanisms of large language models (LLMs) in coding tasks, but most have focused on programming languages (PLs) in a monolingual setting. In this paper, we investigate the relationship between multiple PLs and English in the concept space of LLMs. We perform a few-shot translation task on 21 PL pairs using two Llama-based models. By decoding the embeddings of intermediate layers during this task, we observe that the concept space is closer to English (including PL keywords) and assigns high probabilities to English tokens in the second half of the intermediate layers. We analyze neuron activations for 11 PLs and English, finding that while language-specific neurons are primarily concentrated in the bottom layers, those exclusive to each PL tend to appear in the top layers. For PLs that are highly aligned with multiple other PLs, identifying language-specific neurons is not feasible. These PLs also tend to have a larger keyword set than other PLs and are closer to the model's concept space regardless of the input/output PL in the translation task. Our findings provide insights into how LLMs internally represent PLs, revealing structural patterns in the model's concept space. Code is available at https://github.com/cisnlp/code-specific-neurons.

[322] zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression

Saibo Geng,Nathan Ranchin,Yunzhen yao,Maxime Peyrard,Chris Wendler,Michael Gastpar,Robert West

Main category: cs.CL

TL;DR: zip2zip是一个动态调整token词汇表的框架，通过减少生成的token数量来加速大型语言模型的推理。

Details

Motivation: 静态tokenizer在领域或语言特定输入上表现不佳，导致token序列过长和计算成本增加。 Method: zip2zip包含基于LZW压缩的动态tokenizer、运行时计算新token嵌入的层，以及支持压缩序列的语言建模变体。 Result: zip2zip化的模型在推理时减少了20-60%的序列长度，显著降低了延迟。 Conclusion: zip2zip通过动态token调整，有效提升了大型语言模型的推理效率。 Abstract: Tokenization efficiency plays a critical role in the performance and cost of large language models (LLMs), yet most models rely on static tokenizers optimized for general-purpose corpora. These tokenizers' fixed vocabularies often fail to adapt to domain- or language-specific inputs, leading to longer token sequences and higher computational costs. We introduce zip2zip, a framework that enables LLMs to dynamically adjust token vocabulary at inference time, allowing for fewer generated tokens and thus faster inference. zip2zip consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch (LZW) compression that incrementally compresses tokens into reusable "hypertokens" on the fly; (2) an embedding layer that computes embeddings for newly formed hypertokens at runtime; and (3) a causal language modeling variant that trains the model to operate on hypertokenized, compressed sequences. We show that an existing LLM can be zip2zip-fied in 10 GPU-hours via parameter-efficient finetuning. The resulting zip2zip LLMs effectively learn to use hypertokens at inference time, reducing input and output sequence length by 20-60\%, with significant improvements in inference latency.

[323] Un-considering Contextual Information: Assessing LLMs' Understanding of Indexical Elements

Metehan Oguz,Yavuz Bakman,Duygu Nur Yaldiz

Main category: cs.CL

TL;DR: 该研究评估了大型语言模型（LLMs）在处理索引词（如I、you、here、tomorrow）时的共指消解能力，发现模型对某些索引词表现良好（如I），但对其他词（如you、here、tomorrow）表现较差。

Details

Motivation: 现有研究主要评估LLMs在名词和第三人称代词上的共指消解能力，而忽略了索引词的独特挑战。 Method: 研究创建了包含1600个多选题的英语索引数据集，并评估了GPT-4o、Claude 3.5 Sonnet等领先LLMs的表现。 Result: LLMs对某些索引词（如I）表现优异，但对其他词（如you、here、tomorrow）表现不佳。句法线索（如引号）对不同索引词的影响各异。 Conclusion: LLMs在处理索引词时表现不均衡，未来研究需进一步优化模型以应对此类挑战。 Abstract: Large Language Models (LLMs) have demonstrated impressive performances in tasks related to coreference resolution. However, previous studies mostly assessed LLM performance on coreference resolution with nouns and third person pronouns. This study evaluates LLM performance on coreference resolution with indexical like I, you, here and tomorrow, which come with unique challenges due to their linguistic properties. We present the first study examining how LLMs interpret indexicals in English, releasing the English Indexical Dataset with 1600 multiple-choice questions. We evaluate pioneering LLMs, including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and DeepSeek V3. Our results reveal that LLMs exhibit an impressive performance with some indexicals (I), while struggling with others (you, here, tomorrow), and that syntactic cues (e.g. quotation) contribute to LLM performance with some indexicals, while they reduce performance with others. Code and data are available at: https://github.com/metehanoguzz/LLMs-Indexicals-English.

[324] Contextual Candor: Enhancing LLM Trustworthiness Through Hierarchical Unanswerability Detection

Steven Robinson,Antonio Carlos Rivera

Main category: cs.CL

TL;DR: 本文提出了一种名为Reinforced Unanswerability Learning (RUL)的新型混合训练范式，旨在提升大型语言模型（LLMs）检测不可回答问题并生成适当响应的能力。通过结合判别性预测头和多阶段学习策略，RUL在不可回答性检测和拒绝响应生成方面表现出色。

Details

Motivation: 尽管LLMs在对话AI中广泛应用，但其生成无事实依据或幻觉回答的问题阻碍了其可信度和普及。因此，需要一种方法使LLMs能够准确识别不可回答问题并生成可靠响应。 Method: RUL结合了判别性不可回答性预测头和LLM的生成核心，采用多阶段学习策略，包括在Enhanced-CAsT-Answerability (ECA)数据集上的监督微调，以及基于人类反馈的强化学习（RLHF）阶段。 Result: 实验表明，RUL在不可回答性检测和拒绝响应生成方面显著优于传统方法，同时在可回答问题上也表现良好。人类评估进一步验证了其在帮助性和可信度上的提升。 Conclusion: RUL为更可靠、以用户为中心的对话AI铺平了道路，显著提升了模型的信任度和实用性。 Abstract: The pervasive deployment of large language models (LLMs) in conversational AI systems has revolutionized information access, yet their propensity for generating factually unsupported or hallucinated responses remains a critical impediment to trustworthiness and widespread adoption. This paper introduces Reinforced Unanswerability Learning (RUL), a novel hybrid training paradigm designed to imbue LLMs with the intrinsic capability to accurately detect unanswerable questions and generate reliably appropriate responses. Unlike conventional approaches that rely on external classifiers or simple prompting, RUL integrates a discriminative unanswerability prediction head with the LLM's generative core, guided by a multi-stage learning strategy. This includes supervised fine-tuning on a novel, richly annotated dataset, Enhanced-CAsT-Answerability (ECA), which features hierarchical answerability labels and ground-truth refusal responses. Crucially, RUL incorporates a subsequent reinforcement learning with human feedback (RLHF) phase to refine the nuance, helpfulness, and informativeness of refusal responses. Extensive experiments demonstrate RUL's superior performance, achieving significantly higher accuracy in unanswerability detection across sentence, paragraph, and ranking levels, and substantially increasing the generation of appropriate refusals for unanswerable queries, alongside strong performance on answerable questions. Human evaluations further corroborate RUL's effectiveness, highlighting a marked improvement in perceived helpfulness and trustworthiness, ultimately paving the way for more reliable and user-centric conversational AI.

[325] From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models

Asım Ersoy,Basel Mousi,Shammur Chowdhury,Firoj Alam,Fahim Dalvi,Nadir Durrani

Main category: cs.CL

TL;DR: 论文探讨了多模态模型（如语音和文本）是否能够形成更丰富的语义理解，并通过潜在概念分析研究了其语义抽象的形成。

Details

Motivation: 研究大型语言模型（LLMs）在文本训练中展现的智能特性是否也存在于其他模态（如语音）或多模态联合训练中。 Method: 使用潜在概念分析（Latent Concept Analysis）这一无监督方法，分析语音和文本模型单独及联合训练时的潜在表示。 Result: 研究发现多模态联合训练可能促进更结构化、丰富的语义理解。 Conclusion: 多模态训练有助于模型形成更全面的语义抽象，为未来研究提供了工具和资源。 Abstract: The emergence of large language models (LLMs) has demonstrated that systems trained solely on text can acquire extensive world knowledge, develop reasoning capabilities, and internalize abstract semantic concepts--showcasing properties that can be associated with general intelligence. This raises an intriguing question: Do such concepts emerge in models trained on other modalities, such as speech? Furthermore, when models are trained jointly on multiple modalities: Do they develop a richer, more structured semantic understanding? To explore this, we analyze the conceptual structures learned by speech and textual models both individually and jointly. We employ Latent Concept Analysis, an unsupervised method for uncovering and interpreting latent representations in neural networks, to examine how semantic abstractions form across modalities. For reproducibility we made scripts and other resources available to the community.

[326] A Word is Worth 4-bit: Efficient Log Parsing with Binary Coded Decimal Recognition

Prerak Srivastava,Giulio Corallo,Sergey Rybalko

Main category: cs.CL

TL;DR: 提出了一种基于字符级嵌入的新型日志解析器，能够更精细地提取日志模板，在准确性和效率上优于现有方法。

Details

Motivation: 现有日志解析器无法捕捉细粒度模板细节，导致下游任务准确性不足。 Method: 使用字符级嵌入的神经架构，通过二进制编码序列实现高粒度模板提取。 Result: 在Loghub-2k和工业数据集上测试，准确性媲美LLM解析器，效率优于语义解析器。 Conclusion: 该方法在低资源环境下实现了高精度和高效率的日志模板提取。 Abstract: System-generated logs are typically converted into categorical log templates through parsing. These templates are crucial for generating actionable insights in various downstream tasks. However, existing parsers often fail to capture fine-grained template details, leading to suboptimal accuracy and reduced utility in downstream tasks requiring precise pattern identification. We propose a character-level log parser utilizing a novel neural architecture that aggregates character embeddings. Our approach estimates a sequence of binary-coded decimals to achieve highly granular log templates extraction. Our low-resource character-level parser, tested on revised Loghub-2k and a manually annotated industrial dataset, matches LLM-based parsers in accuracy while outperforming semantic parsers in efficiency.

[327] Mispronunciation Detection Without L2 Pronunciation Dataset in Low-Resource Setting: A Case Study in Finland Swedish

Nhan Phan,Mikko Kuronen,Maria Kautonen,Riikka Ullakonoja,Anna von Zansen,Yaroslav Getman,Ekaterina Voskoboinik,Tamás Grósz,Mikko Kurimo

Main category: cs.CL

TL;DR: 本文提出了一种针对芬兰瑞典语（FS）的发音错误检测（MD）模型，适用于低资源语言，通过多语言wav2vec 2.0模型和熵正则化实现。

Details

Motivation: 现有MD模型主要针对英语等主流语言，低资源语言如芬兰瑞典语缺乏相关工具。 Method: 使用多语言wav2vec 2.0模型，结合熵正则化、温度缩放和top-k归一化，仅需少量L2数据。 Result: 模型在召回率（43.2%）和精确率（29.8%）上优于基线模型（召回率77.5%，精确率17.6%）。 Conclusion: 该方法简单且语言无关，适用于其他低资源语言。 Abstract: Mispronunciation detection (MD) models are the cornerstones of many language learning applications. Unfortunately, most systems are built for English and other major languages, while low-resourced language varieties, such as Finland Swedish (FS), lack such tools. In this paper, we introduce our MD model for FS, trained on 89 hours of first language (L1) speakers' spontaneous speech and tested on 33 minutes of L2 transcribed read-aloud speech. We trained a multilingual wav2vec 2.0 model with entropy regularization, followed by temperature scaling and top-k normalization after the inference to better adapt it for MD. The main novelty of our method lies in its simplicity, requiring minimal L2 data. The process is also language-independent, making it suitable for other low-resource languages. Our proposed algorithm allows us to balance Recall (43.2%) and Precision (29.8%), compared with the baseline model's Recall (77.5%) and Precision (17.6%).

[328] The Inverse Scaling Effect of Pre-Trained Language Model Surprisal Is Not Due to Data Leakage

Byung-Doh Oh,Hongao Zhu,William Schuler

Main category: cs.CL

TL;DR: 研究发现，预训练语言模型的规模与对人类阅读时间的预测能力呈负相关，且数据泄露并非主要原因。

Details

Motivation: 探讨预训练语言模型的规模与其对人类阅读时间预测能力的关系，并验证数据泄露是否影响先前研究结果。 Method: 通过两项研究：1) 分析五个自然阅读时间语料库在两个预训练数据集中的泄露情况；2) 使用与阅读时间语料库重叠极少的‘无泄露’数据训练模型，验证模型规模与阅读时间拟合度的关系。 Result: 1) 数据泄露较少；2) 模型规模与阅读时间拟合度仍呈负相关，表明先前结果不受数据泄露影响。 Conclusion: 预训练语言模型规模越大，对人类阅读时间的预测能力越差，且这一现象与数据泄露无关。 Abstract: In psycholinguistic modeling, surprisal from larger pre-trained language models has been shown to be a poorer predictor of naturalistic human reading times. However, it has been speculated that this may be due to data leakage that caused language models to see the text stimuli during training. This paper presents two studies to address this concern at scale. The first study reveals relatively little leakage of five naturalistic reading time corpora in two pre-training datasets in terms of length and frequency of token $n$-gram overlap. The second study replicates the negative relationship between language model size and the fit of surprisal to reading times using models trained on 'leakage-free' data that overlaps only minimally with the reading time corpora. Taken together, this suggests that previous results using language models trained on these corpora are not driven by the effects of data leakage.

[329] LAQuer: Localized Attribution Queries in Content-grounded Generation

Eran Hirsch,Aviv Slobodkin,David Wan,Elias Stengel-Eskin,Mohit Bansal,Ido Dagan

Main category: cs.CL

TL;DR: 论文提出了一种名为LAQuer的任务，旨在将生成文本的特定片段与其来源片段关联，以提供更细粒度的用户导向溯源。通过比较两种方法并扩展现有框架，实验表明LAQuer显著减少了溯源文本长度。

Details

Motivation: 现有溯源方法要么过于笼统（整句关联），要么过于精确但不符合用户需求，因此需要一种更灵活、用户导向的溯源方式。 Method: 提出了LAQuer任务，比较了基于提示的大型语言模型（LLMs）和利用LLM内部表征的两种方法，并扩展了现有框架以支持LAQuer。 Result: 实验表明，LAQuer方法在多文档摘要和长形式问答任务中显著减少了溯源文本长度。 Conclusion: LAQuer任务提升了溯源的可用性，提出了新的建模框架和评估设置，为未来细粒度溯源研究奠定了基础。 Abstract: Grounded text generation models often produce content that deviates from their source material, requiring user verification to ensure accuracy. Existing attribution methods associate entire sentences with source documents, which can be overwhelming for users seeking to fact-check specific claims. In contrast, existing sub-sentence attribution methods may be more precise but fail to align with users' interests. In light of these limitations, we introduce Localized Attribution Queries (LAQuer), a new task that localizes selected spans of generated output to their corresponding source spans, allowing fine-grained and user-directed attribution. We compare two approaches for the LAQuer task, including prompting large language models (LLMs) and leveraging LLM internal representations. We then explore a modeling framework that extends existing attributed text generation methods to LAQuer. We evaluate this framework across two grounded text generation tasks: Multi-document Summarization (MDS) and Long-form Question Answering (LFQA). Our findings show that LAQuer methods significantly reduce the length of the attributed text. Our contributions include: (1) proposing the LAQuer task to enhance attribution usability, (2) suggesting a modeling framework and benchmarking multiple baselines, and (3) proposing a new evaluation setting to promote future research on localized attribution in content-grounded generation.

[330] Culturally-Grounded Chain-of-Thought (CG-CoT):Enhancing LLM Performance on Culturally-Specific Tasks in Low-Resource Languages

Madhavendra Thakur

Main category: cs.CL

TL;DR: CG-CoT是一种新的提示策略，结合文化背景检索和显式推理序列，显著提高了低资源语言任务的文化对齐准确性。

Details

Motivation: 解决LLMs在低资源语言和文化特定推理任务中的局限性，促进AI的全球公平应用。 Method: 提出CG-CoT，结合密集向量检索文化背景和显式推理序列，并在约鲁巴谚语解释任务中进行实验。 Result: CG-CoT在文化对齐准确性和深度上显著优于传统方法，同时揭示了BLEU等指标与文化相关性之间的差异。 Conclusion: CG-CoT有效提升文化特定任务的性能，并呼吁重新思考低资源NLP的评估方法。 Abstract: Large Language Models (LLMs) struggle with culturally-specific reasoning tasks, particularly in low-resource languages, hindering their global applicability. Addressing this gap is crucial for equitable AI deployment. We introduce Culturally-Grounded Chain-of-Thought (CG-CoT), a novel prompting strategy that combines dense vector retrieval of cultural context with explicit reasoning sequences. Our extensive experiments on Yoruba proverb interpretation demonstrate that CG-CoT provides significantly higher culturally-aligned accuracy and depth than traditional prompting methods, validated through both automated metrics and LLM-based evaluations. Notably, we uncover stark disparities between token-level translation metrics like BLEU and human-judged cultural relevance, suggesting a rethinking of evaluation approaches for low-resource NLP.

[331] CoBRA: Quantifying Strategic Language Use and LLM Pragmatics

Anshun Asher Zheng,Junyi Jessy Li,David I. Beaver

Main category: cs.CL

TL;DR: 论文提出CoBRA框架和CHARM数据集，用于量化非合作话语的战略效果，并评估LLMs在此类场景中的表现。

Details

Motivation: 填补对非合作话语系统理解的空白，研究语言在高风险对抗性环境中的战略使用。 Method: 引入CoBRA框架和三个可解释指标（BaT、PaT、NRBaT），并使用CHARM数据集验证框架有效性。 Result: LLMs在战略语言理解上表现有限，模型规模提升性能，但推理能力反而导致过复杂化和内部混乱。 Conclusion: LLMs在非合作话语场景中的实用性仍需改进，尤其是推理能力的负面影响需进一步研究。 Abstract: Language is often used strategically, particularly in high-stakes, adversarial settings, yet most work on pragmatics and LLMs centers on cooperativity. This leaves a gap in systematic understanding of non-cooperative discourse. To address this, we introduce CoBRA (Cooperation-Breach Response Assessment), along with three interpretable metrics -- Benefit at Turn (BaT), Penalty at Turn (PaT), and Normalized Relative Benefit at Turn (NRBaT) -- to quantify the perceived strategic effects of discourse moves. We also present CHARM, an annotated dataset of real courtroom cross-examinations, to demonstrate the framework's effectiveness. Using these tools, we evaluate a range of LLMs and show that LLMs generally exhibit limited pragmatic understanding of strategic language. While model size shows an increase in performance on our metrics, reasoning ability does not help and largely hurts, introducing overcomplication and internal confusion.

[332] Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures

Mark Muchane,Sean Richardson,Kiho Park,Victor Veitch

Main category: cs.CL

TL;DR: 提出了一种改进的稀疏自编码器（SAE）架构，显式建模概念的语义层次结构，提高了重建性能、可解释性和计算效率。

Details

Motivation: 稀疏字典学习（如稀疏自编码器）未利用或表示学习概念间的语义关系，限制了其效果。 Method: 引入改进的SAE架构，显式建模概念的语义层次结构。 Result: 实验表明，该架构能学习语义层次结构，提升重建性能、可解释性和计算效率。 Conclusion: 显式建模语义层次结构的SAE架构在多个方面优于传统方法。 Abstract: Sparse dictionary learning (and, in particular, sparse autoencoders) attempts to learn a set of human-understandable concepts that can explain variation on an abstract space. A basic limitation of this approach is that it neither exploits nor represents the semantic relationships between the learned concepts. In this paper, we introduce a modified SAE architecture that explicitly models a semantic hierarchy of concepts. Application of this architecture to the internal representations of large language models shows both that semantic hierarchy can be learned, and that doing so improves both reconstruction and interpretability. Additionally, the architecture leads to significant improvements in computational efficiency.

[333] Trick or Neat: Adversarial Ambiguity and Language Model Evaluation

Antonia Karamolegkou,Oliver Eberle,Phillip Rust,Carina Kauf,Anders Søgaard

Main category: cs.CL

TL;DR: 论文研究了语言模型对歧义的敏感性，通过对抗性歧义数据集评估模型表现，发现直接提示效果不佳，而基于模型表示的线性探针能高精度解码歧义。

Details

Motivation: 检测歧义对语言理解至关重要，包括不确定性估计、幽默检测和花园路径句处理。 Method: 引入对抗性歧义数据集（含句法、词汇和语音歧义及对抗性变体），评估直接提示和线性探针的表现。 Result: 直接提示效果不佳，线性探针解码歧义准确率高达90%。 Conclusion: 研究揭示了提示范式的局限性及语言模型在不同层编码歧义的方式，代码和数据已开源。 Abstract: Detecting ambiguity is important for language understanding, including uncertainty estimation, humour detection, and processing garden path sentences. We assess language models' sensitivity to ambiguity by introducing an adversarial ambiguity dataset that includes syntactic, lexical, and phonological ambiguities along with adversarial variations (e.g., word-order changes, synonym replacements, and random-based alterations). Our findings show that direct prompting fails to robustly identify ambiguity, while linear probes trained on model representations can decode ambiguity with high accuracy, sometimes exceeding 90\%. Our results offer insights into the prompting paradigm and how language models encode ambiguity at different layers. We release both our code and data: https://github.com/coastalcph/lm_ambiguity.

[334] Mamba Drafters for Speculative Decoding

Daewon Choi,Seunghyuk Oh,Saket Dingliwal,Jihoon Tack,Kyuyoung Kim,Woomin Song,Seojin Kim,Insu Han,Jinwoo Shin,Aram Galstyan,Shubham Katiyar,Sravan Babu Bodapati

Main category: cs.CL

TL;DR: 本文提出了一种基于Mamba的新型推测解码方法，结合了外部草稿和自我推测的优点，通过线性结构避免了传统Transformer的二次复杂度，实现了更快的草稿生成和更低的内存占用。

Details

Motivation: 现有推测解码方法存在灵活性或效率的权衡，外部草稿速度慢，而自我推测需要重新训练。本文旨在结合两者的优点。 Method: 利用Mamba（一种状态空间模型）的线性结构，避免了二次复杂度，并引入了一种新颖的测试时树搜索算法以提高草稿质量。 Result: 实验表明，基于Mamba的草稿方法不仅优于现有外部草稿方法，还能与自我推测方法媲美，同时内存占用更低且保持跨模型适应性。 Conclusion: Mamba为基础的草稿方法在推测解码中表现出色，兼具高效性和灵活性，为LLM加速提供了新思路。 Abstract: Speculative decoding has emerged as a promising approach to accelerating large language model (LLM) generation using a fast drafter while maintaining alignment with the target model's distribution. However, existing approaches face a trade-off: external drafters offer flexibility but can suffer from slower drafting, while self-speculation methods use drafters tailored to the target model but require re-training. In this paper, we introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM), as a solution that combines the best aspects of both approaches. By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods, enabling faster drafting and lower memory usage while maintaining the flexibility to work across different target models. We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates. Our empirical evaluation demonstrates that Mamba-based drafters not only outperform existing external drafting methods but are also comparable to state-of-the-art self-speculation approaches while using less memory and maintaining their cross-model adaptability.

[335] Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers

Woomin Song,Sai Muralidhar Jayanthi,Srikanth Ronanki,Kanthashree Mysore Sathyendra,Jinwoo Shin,Aram Galstyan,Shubham Katiyar,Sravan Babu Bodapati

Main category: cs.CL

TL;DR: REFORM是一种新型推理框架，通过两阶段方法高效处理长上下文，显著提升性能并降低资源消耗。

Details

Motivation: 随着大语言模型在现实应用中的普及，处理超出预训练上下文限制的极长上下文成为关键挑战。现有方法在信息保留或内存资源需求上存在不足。 Method: REFORM采用两阶段方法：1）增量处理输入块并维护压缩的KV缓存，构建跨层上下文嵌入；2）通过相似性匹配识别关键令牌并选择性重新计算KV缓存。 Result: 在1M上下文长度下，REFORM在RULER和BABILong上分别实现50%和27%的性能提升，同时在Infinite-Bench和MM-NIAH上优于基线，减少30%推理时间和5%峰值内存使用。 Conclusion: REFORM在高效性和性能上均优于现有方法，适用于多样化任务和领域。 Abstract: As large language models increasingly gain popularity in real-world applications, processing extremely long contexts, often exceeding the model's pre-trained context limits, has emerged as a critical challenge. While existing approaches to efficient long-context processing show promise, recurrent compression-based methods struggle with information preservation, whereas random access approaches require substantial memory resources. We introduce REFORM, a novel inference framework that efficiently handles long contexts through a two-phase approach. First, it incrementally processes input chunks while maintaining a compressed KV cache, constructs cross-layer context embeddings, and utilizes early exit strategy for improved efficiency. Second, it identifies and gathers essential tokens via similarity matching and selectively recomputes the KV cache. Compared to baselines, REFORM achieves over 50% and 27% performance gains on RULER and BABILong respectively at 1M context length. It also outperforms baselines on Infinite-Bench and MM-NIAH, demonstrating flexibility across diverse tasks and domains. Additionally, REFORM reduces inference time by 30% and peak memory usage by 5%, achieving both efficiency and superior performance.

SungHo Kim,Nayeon Kim,Taehee Jeon,SangKeun Lee

Main category: cs.CL

TL;DR: KoGEM是一个评估LLMs和人类韩语能力的基准，包含1.5k选择题，覆盖5大类16子类。27种LLMs的零样本评估显示，LLMs在定义性任务上表现优异，但在需要现实经验的任务（如音韵规则）上表现不佳。分析表明，融入经验知识可提升LLMs的语言能力。

Details

Motivation: 评估LLMs和人类在韩语中的语言能力，揭示LLMs的局限性并探索提升其语言理解的途径。 Method: 构建KoGEM基准，包含1.5k选择题，分为5大类16子类，对27种LLMs进行零样本评估。 Result: LLMs在定义性任务上表现良好，但在需要现实经验的任务上表现较差。融入经验知识可能提升其语言能力。 Conclusion: KoGEM不仅揭示了LLMs在语言能力上的局限性，还为提升其全面语言理解提供了方向。 Abstract: We introduce the $\underline{Ko}rean \underline{G}rammar \underline{E}valuation Bench\underline{M}ark (KoGEM)$, designed to assess the linguistic competence of LLMs and humans in Korean. KoGEM consists of 1.5k multiple-choice QA pairs covering five main categories and 16 subcategories. The zero-shot evaluation of 27 LLMs of various sizes and types reveals that while LLMs perform remarkably well on straightforward tasks requiring primarily definitional knowledge, they struggle with tasks that demand the integration of real-world experiential knowledge, such as phonological rules and pronunciation. Furthermore, our in-depth analysis suggests that incorporating such experiential knowledge could enhance the linguistic competence of LLMs. With KoGEM, we not only highlight the limitations of current LLMs in linguistic competence but also uncover hidden facets of LLMs in linguistic competence, paving the way for enhancing comprehensive language understanding. Our code and dataset are available at: https://github.com/SungHo3268/KoGEM.

[337] ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists

Jie Ruan,Inderjeet Nair,Shuyang Cao,Amy Liu,Sheza Munir,Micah Pollens-Dempsey,Tiffany Chiang,Lucy Kates,Nicholas David,Sihan Chen,Ruxin Yang,Yuqian Yang,Jasmine Gump,Tessa Bialek,Vivek Sankaran,Margo Schlanger,Lu Wang

Main category: cs.CL

TL;DR: ExpertLongBench是一个专家级基准测试，包含11个任务，覆盖9个领域，要求长文本输出和严格遵循领域要求。CLEAR评估框架通过提取任务特定评分标准中的检查项，实现细粒度评估。实验显示现有大语言模型在专家任务上表现不佳，但CLEAR框架能有效支持评估。

Details

Motivation: 现有基准测试难以评估专家级任务的长文本输出和领域特定要求，因此需要开发一个更专业的基准和评估框架。 Method: 提出ExpertLongBench基准和CLEAR评估框架，后者通过提取评分标准中的检查项进行细粒度评估。 Result: 实验显示现有大语言模型在专家任务上表现较差（最高F1分数26.8%），但CLEAR框架能有效支持评估，且开源模型可实现准确的检查项提取。 Conclusion: ExpertLongBench和CLEAR填补了专家级任务评估的空白，为未来模型改进提供了方向。 Abstract: This paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications. Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task in ExpertLongBench includes a rubric, designed or validated by domain experts, to specify task requirements and guide output evaluation. Furthermore, we propose CLEAR, an evaluation framework that supports accurate evaluation of long-form model outputs in our benchmark. To achieve fine-grained, expert-aligned evaluation, CLEAR derives checklists from both model outputs and references by extracting information corresponding to items in the task-specific rubric. Checklist items for model outputs are then compared with corresponding items for reference outputs to assess their correctness, enabling grounded evaluation. We benchmark 11 large language models (LLMs) and analyze components in CLEAR, showing that (1) existing LLMs, with the top performer achieving only a 26.8% F1 score, require significant improvement for expert-level tasks; (2) models can generate content corresponding to the required aspects, though often not accurately; and (3) accurate checklist extraction and comparison in CLEAR can be achieved by open-weight models for more scalable and low-cost usage.

[338] MTCMB: A Multi-Task Benchmark Framework for Evaluating LLMs on Knowledge, Reasoning, and Safety in Traditional Chinese Medicine

Shufeng Kong,Xingru Yang,Yuanyuan Wei,Zijie Wang,Hao Tang,Jiuqi Qin,Shuting Lan,Yingheng Wang,Junwen Bai,Zhuangbin Chen,Zibin Zheng,Caihua Liu,Hao Liang

Main category: cs.CL

TL;DR: 论文提出了MTCMB基准，用于评估大语言模型在中医领域的知识、推理和安全性表现。

Details

Motivation: 中医领域缺乏系统化的评估基准，现有基准或过于狭窄或缺乏临床真实性。 Method: 开发了MTCMB基准，包含12个子数据集，涵盖知识问答、语言理解、诊断推理、处方生成和安全性评估五大类。 Result: 当前大语言模型在基础知识上表现良好，但在临床推理、处方规划和安全性合规方面不足。 Conclusion: MTCMB为开发更可靠的中医AI系统提供了必要的评估工具，数据与代码已开源。 Abstract: Traditional Chinese Medicine (TCM) is a holistic medical system with millennia of accumulated clinical experience, playing a vital role in global healthcare-particularly across East Asia. However, the implicit reasoning, diverse textual forms, and lack of standardization in TCM pose major challenges for computational modeling and evaluation. Large Language Models (LLMs) have demonstrated remarkable potential in processing natural language across diverse domains, including general medicine. Yet, their systematic evaluation in the TCM domain remains underdeveloped. Existing benchmarks either focus narrowly on factual question answering or lack domain-specific tasks and clinical realism. To fill this gap, we introduce MTCMB-a Multi-Task Benchmark for Evaluating LLMs on TCM Knowledge, Reasoning, and Safety. Developed in collaboration with certified TCM experts, MTCMB comprises 12 sub-datasets spanning five major categories: knowledge QA, language understanding, diagnostic reasoning, prescription generation, and safety evaluation. The benchmark integrates real-world case records, national licensing exams, and classical texts, providing an authentic and comprehensive testbed for TCM-capable models. Preliminary results indicate that current LLMs perform well on foundational knowledge but fall short in clinical reasoning, prescription planning, and safety compliance. These findings highlight the urgent need for domain-aligned benchmarks like MTCMB to guide the development of more competent and trustworthy medical AI systems. All datasets, code, and evaluation tools are publicly available at: https://github.com/Wayyuanyuan/MTCMB.

[339] CoRE: Condition-based Reasoning for Identifying Outcome Variance in Complex Events

Sai Vallurupalli,Francis Ferraro

Main category: cs.CL

TL;DR: 论文研究了如何通过潜在条件分析复杂事件结果，结合现有数据集探索条件的影响，并测试了不同LLM在条件推理任务中的表现。

Details

Motivation: 研究旨在通过识别潜在条件来验证复杂事件结果的合理性，解决现有方法在条件分析和推理上的不足。 Method: 结合并扩展了两个现有数据集（目标和状态），设计了条件推理任务，测试了不同规模和意图对齐的LLM（包括GPT-4o）。 Result: 条件在上下文缺失时有用，不同模型在生成和识别条件上的能力差异显著，影响结果验证表现；大模型（如GPT-4o）在低约束场景中更谨慎。 Conclusion: 条件分析有助于验证事件结果，模型规模和意图对齐显著影响条件推理能力，大模型在不确定条件下表现更优。 Abstract: Knowing which latent conditions lead to a particular outcome is useful for critically examining claims made about complex event outcomes. Identifying implied conditions and examining their influence on an outcome is challenging. We handle this by combining and augmenting annotations from two existing datasets consisting of goals and states, and explore the influence of conditions through our research questions and Condition-based Reasoning tasks. We examine open and closed LLMs of varying sizes and intent-alignment on our reasoning tasks and find that conditions are useful when not all context is available. Models differ widely in their ability to generate and identify outcome-variant conditions which affects their performance on outcome validation when conditions are used to replace missing context. Larger models like GPT-4o, are more cautious in such less constrained situations.

[340] Memory-Efficient FastText: A Comprehensive Approach Using Double-Array Trie Structures and Mark-Compact Memory Management

Yimin Du

Main category: cs.CL

TL;DR: 本文提出了一种基于双数组字典树和标记-压缩垃圾回收原则的FastText内存优化框架，显著减少了内存占用并保持了嵌入质量。

Details

Motivation: FastText的哈希分桶机制在大规模工业部署中存在语义漂移和高内存消耗的问题，亟需优化。 Method: 通过双数组字典树结构和标记-压缩垃圾回收原则，分四个阶段优化内存管理：前缀字典树构建、前缀相似性压缩、后缀相似性压缩和标记-压缩内存重组。 Result: 在3000万中文词汇数据集上，内存占用从100GB降至30GB，性能几乎无损。 Conclusion: 该方法显著降低了工业部署成本，提升了模型加载速度和可靠性。 Abstract: FastText has established itself as a fundamental algorithm for learning word representations, demonstrating exceptional capability in handling out-of-vocabulary words through character-level n-gram embeddings. However, its hash-based bucketing mechanism introduces critical limitations for large-scale industrial deployment: hash collisions cause semantic drift, and memory requirements become prohibitively expensive when dealing with real-world vocabularies containing millions of terms. This paper presents a comprehensive memory optimization framework that fundamentally reimagines FastText's memory management through the integration of double-array trie (DA-trie) structures and mark-compact garbage collection principles. Our approach leverages the linguistic insight that n-grams sharing common prefixes or suffixes exhibit highly correlated embeddings due to co-occurrence patterns in natural language. By systematically identifying and merging semantically similar embeddings based on structural relationships, we achieve compression ratios of 4:1 to 10:1 while maintaining near-perfect embedding quality. The algorithm consists of four sophisticated phases: prefix trie construction with embedding mapping, prefix-based similarity compression, suffix-based similarity compression, and mark-compact memory reorganization. Comprehensive experiments on a 30-million Chinese vocabulary dataset demonstrate memory reduction from over 100GB to approximately 30GB with negligible performance degradation. Our industrial deployment results show significant cost reduction, faster loading times, and improved model reliability through the elimination of hash collision artifacts. Code and experimental implementations are available at: https://github.com/initial-d/me_fasttext

[341] DeepSeek in Healthcare: A Survey of Capabilities, Risks, and Clinical Applications of Open-Source Large Language Models

Jiancheng Ye,Sophie Bronstein,Jiarui Hai,Malak Abu Hashish

Main category: cs.CL

TL;DR: DeepSeek-R1是一款开源大型语言模型，结合混合专家架构、思维链推理和强化学习，在数学、医疗诊断等领域表现优异，但存在偏见和安全问题。

Details

Motivation: 提供透明且经济的替代方案，超越专有模型如GPT-4o和Claude-3 Opus，推动开源AI发展。 Method: 采用混合专家架构（MoE）、思维链推理（CoT）和强化学习，优化推理效率和深度。 Result: 在USMLE和AIME等基准测试中表现优异，尤其在医疗和数学领域，但存在偏见和安全漏洞。 Conclusion: DeepSeek-R1是开源AI的重要进展，需进一步研究偏见缓解和安全性，推动负责任部署。 Abstract: DeepSeek-R1 is a cutting-edge open-source large language model (LLM) developed by DeepSeek, showcasing advanced reasoning capabilities through a hybrid architecture that integrates mixture of experts (MoE), chain of thought (CoT) reasoning, and reinforcement learning. Released under the permissive MIT license, DeepSeek-R1 offers a transparent and cost-effective alternative to proprietary models like GPT-4o and Claude-3 Opus; it excels in structured problem-solving domains such as mathematics, healthcare diagnostics, code generation, and pharmaceutical research. The model demonstrates competitive performance on benchmarks like the United States Medical Licensing Examination (USMLE) and American Invitational Mathematics Examination (AIME), with strong results in pediatric and ophthalmologic clinical decision support tasks. Its architecture enables efficient inference while preserving reasoning depth, making it suitable for deployment in resource-constrained settings. However, DeepSeek-R1 also exhibits increased vulnerability to bias, misinformation, adversarial manipulation, and safety failures - especially in multilingual and ethically sensitive contexts. This survey highlights the model's strengths, including interpretability, scalability, and adaptability, alongside its limitations in general language fluency and safety alignment. Future research priorities include improving bias mitigation, natural language comprehension, domain-specific validation, and regulatory compliance. Overall, DeepSeek-R1 represents a major advance in open, scalable AI, underscoring the need for collaborative governance to ensure responsible and equitable deployment.

[342] Exploring the Potential of LLMs as Personalized Assistants: Dataset, Evaluation, and Analysis

Jisoo Mok,Ik-hwan Kim,Sangkwon Park,Sungroh Yoon

Main category: cs.CL

TL;DR: HiCUPID是一个新的开源基准测试，旨在解决个性化AI助手研究中缺乏专用对话数据集的问题，并提供基于Llama-3.2的自动评估模型。

Details

Motivation: 个性化AI助手的研究因缺乏开源对话数据集而受限，HiCUPID旨在填补这一空白。 Method: HiCUPID提供了一个对话数据集和基于Llama-3.2的自动评估模型，以支持个性化响应的研究。 Result: HiCUPID的数据集和评估模型已开源，其评估结果与人类偏好高度一致。 Conclusion: HiCUPID为个性化AI助手研究提供了重要的工具和资源，有望推动该领域的发展。 Abstract: Personalized AI assistants, a hallmark of the human-like capabilities of Large Language Models (LLMs), are a challenging application that intertwines multiple problems in LLM research. Despite the growing interest in the development of personalized assistants, the lack of an open-source conversational dataset tailored for personalization remains a significant obstacle for researchers in the field. To address this research gap, we introduce HiCUPID, a new benchmark to probe and unleash the potential of LLMs to deliver personalized responses. Alongside a conversational dataset, HiCUPID provides a Llama-3.2-based automated evaluation model whose assessment closely mirrors human preferences. We release our dataset, evaluation model, and code at https://github.com/12kimih/HiCUPID.

[343] WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing

Yu Nakagome,Michael Hentschel

Main category: cs.CL

TL;DR: 提出了一种无需额外训练的方法，通过关键词检测和偏置调整，显著提高了CTC模型对罕见词的识别准确率。

Details

Motivation: 解决端到端语音识别模型对训练数据词汇的依赖问题，特别是对专有名词和未知词的识别不准确。 Method: 利用中间层声学特征进行关键词检测，采用通配符CTC实现快速模糊匹配，并对检测到的关键词施加偏置。 Result: 在日语语音识别实验中，未知词的F1分数提升了29%。 Conclusion: 该方法无需重新训练模型，适用于大规模模型，显著提升了罕见词的识别性能。 Abstract: Despite recent advances in end-to-end speech recognition methods, the output tends to be biased to the training data's vocabulary, resulting in inaccurate recognition of proper nouns and other unknown terms. To address this issue, we propose a method to improve recognition accuracy of such rare words in CTC-based models without additional training or text-to-speech systems. Specifically, keyword spotting is performed using acoustic features of intermediate layers during inference, and a bias is applied to the subsequent layers of the acoustic model for detected keywords. For keyword detection, we adopt a wildcard CTC that is both fast and tolerant of ambiguous matches, allowing flexible handling of words that are difficult to match strictly. Since this method does not require retraining of existing models, it can be easily applied to even large-scale models. In experiments on Japanese speech recognition, the proposed method achieved a 29% improvement in the F1 score for unknown words.

[344] Beyond In-Context Learning: Aligning Long-form Generation of Large Language Models via Task-Inherent Attribute Guidelines

Do Xuan Long,Duong Ngoc Yen,Do Xuan Trong,Luu Anh Tuan,Kenji Kawaguchi,Shafiq Joty,Min-Yen Kan,Nancy F. Chen

Main category: cs.CL

TL;DR: 论文研究了上下文学习（ICL）在长文本生成任务中的不足，提出LongGuide方法，通过生成任务语言和格式的并行指导流，显著提升模型性能。

Details

Motivation: ICL在长文本生成任务中表现不佳，作者希望通过明确任务分布的定义和指导来提升模型性能。 Method: 提出LongGuide方法，生成Metric Guidelines（MGs）和Output Constraint Guidelines（OCGs）两套指导流，自动选择最佳组合。 Result: LongGuide在零样本和少样本设置下，将开源和闭源LLM的性能提升超过5%。 Conclusion: LongGuide具有通用性，可被弱模型学习以增强强模型，并能与自动提示优化器协同工作。 Abstract: In-context learning (ICL) is an important yet not fully understood ability of pre-trained large language models (LLMs). It can greatly enhance task performance using a few examples, termed demonstrations, without fine-tuning. Although effective in question answering, ICL often underperforms in long-form generation tasks such as summarization. Under appropriately realistic assumptions, we empirically and theoretically show that ICL demonstrations alone are insufficient to teach LLMs the task language and format distributions for generation. We argue for explicit exposure to the task distributions and hypothesize that defining them by prompting enhances model performance. To this end, we present LongGuide, which efficiently generates two parallel streams of guidelines capturing task language and format properties: (i) Metric Guidelines (MGs) that instruct models to optimize self-evaluated metrics; and (ii) Output Constraint Guidelines (OCGs) that constrain generation at both token and sentence levels. LongGuide automatically selects the best combination of guidelines, improving both strong open- and closed-source LLMs by over 5% in both zero- and few-shot settings. We show that LongGuide is generalizable, learnable by weak models to enhance strong ones, and integrates synergistically with automatic prompt optimizers.

[345] Detoxification of Large Language Models through Output-layer Fusion with a Calibration Model

Yuanhe Tian,Mingjie Deng,Guoqing Jin,Yan Song

Main category: cs.CL

TL;DR: 提出一种轻量级干预方法，通过预训练的校准模型指导LLM生成无害内容，避免传统方法的高计算成本和性能损失。

Details

Motivation: 现有LLM去毒方法依赖大规模数据或模型修改，计算成本高且影响模型流畅性和上下文理解。 Method: 利用预训练的紧凑校准模型，通过学习无毒数据的嵌入空间，轻量干预LLM生成过程。 Result: 实验表明，该方法有效降低毒性，同时保持内容表达的流畅性和上下文理解。 Conclusion: 该方法简单高效，适用于多种LLM，无需重复训练且性能稳定。 Abstract: Existing approaches for Large language model (LLM) detoxification generally rely on training on large-scale non-toxic or human-annotated preference data, designing prompts to instruct the LLM to generate safe content, or modifying the model parameters to remove toxic information, which are computationally expensive, lack robustness, and often compromise LLMs' fluency and contextual understanding. In this paper, we propose a simple yet effective approach for LLM detoxification, which leverages a compact, pre-trained calibration model that guides the detoxification process of a target LLM via a lightweight intervention in its generation pipeline. By learning a detoxified embedding space from non-toxic data, the calibration model effectively steers the LLM away from generating harmful content. This approach only requires a one-time training of the calibration model that is able to be seamlessly applied to multiple LLMs without compromising fluency or contextual understanding. Experiment results on the benchmark dataset demonstrate that our approach reduces toxicity while maintaining reasonable content expression.

[346] Schema as Parameterized Tools for Universal Information Extraction

Sheng Liang,Yongyue Zhang,Yaxiong Wu,Ruiming Tang,Yong Liu

Main category: cs.CL

TL;DR: 论文提出了一种名为SPT的统一自适应文本到结构生成框架，通过将预定义模式视为参数化工具，解决了UIE在模式选择和生成中的适应性问题。

Details

Motivation: UIE在预定义模式和即时模式生成之间缺乏适应性，尤其在模式选择较多时表现不佳。 Method: SPT框架将模式视为参数化工具，支持模式检索、填充和生成，统一了封闭、开放和按需IE任务。 Result: 实验表明，SPT能自适应处理四种IE任务，模式检索和选择性能稳健，且提取性能与现有系统相当，但参数更少。 Conclusion: SPT通过参数化工具的方法，显著提升了UIE的适应性和效率。 Abstract: Universal information extraction (UIE) primarily employs an extractive generation approach with large language models (LLMs), typically outputting structured information based on predefined schemas such as JSON or tables. UIE suffers from a lack of adaptability when selecting between predefined schemas and on-the-fly schema generation within the in-context learning paradigm, especially when there are numerous schemas to choose from. In this paper, we propose a unified adaptive text-to-structure generation framework, called Schema as Parameterized Tools (SPT), which reimagines the tool-calling capability of LLMs by treating predefined schemas as parameterized tools for tool selection and parameter filling. Specifically, our SPT method can be applied to unify closed, open, and on-demand IE tasks by adopting Schema Retrieval by fetching the relevant schemas from a predefined pool, Schema Filling by extracting information and filling slots as with tool parameters, or Schema Generation by synthesizing new schemas with uncovered cases. Experiments show that the SPT method can handle four distinct IE tasks adaptively, delivering robust schema retrieval and selection performance. SPT also achieves comparable extraction performance to LoRA baselines and current leading UIE systems with significantly fewer trainable parameters.

[347] VM14K: First Vietnamese Medical Benchmark

Thong Nguyen,Duc Nguyen,Minh Dang,Thai Dao,Long Nguyen,Quan H. Nguyen,Dat Nguyen,Kien Tran,Minh Tran

Main category: cs.CL

TL;DR: 论文提出了一种构建越南语医学问题基准的方法，解决了非英语社区资源不足和数据碎片化的问题，并发布了包含14,000个问题的基准。

Details

Motivation: 非英语社区缺乏资源和标准化方法来构建医学基准，且数据分散难以验证，亟需解决方案。 Method: 通过多种可验证来源（如医学考试和临床记录）构建基准，并由医学专家标注，涵盖34个医学专业和4个难度级别。 Result: 发布了包含14,000个问题的越南语医学基准，分为公开样本集、完整公开集和私有集，支持多语言扩展。 Conclusion: 该方法可扩展至其他语言，开源数据构建流程以支持未来多语言医学基准的开发。 Abstract: Medical benchmarks are indispensable for evaluating the capabilities of language models in healthcare for non-English-speaking communities,therefore help ensuring the quality of real-life applications. However, not every community has sufficient resources and standardized methods to effectively build and design such benchmark, and available non-English medical data is normally fragmented and difficult to verify. We developed an approach to tackle this problem and applied it to create the first Vietnamese medical question benchmark, featuring 14,000 multiple-choice questions across 34 medical specialties. Our benchmark was constructed using various verifiable sources, including carefully curated medical exams and clinical records, and eventually annotated by medical experts. The benchmark includes four difficulty levels, ranging from foundational biological knowledge commonly found in textbooks to typical clinical case studies that require advanced reasoning. This design enables assessment of both the breadth and depth of language models' medical understanding in the target language thanks to its extensive coverage and in-depth subject-specific expertise. We release the benchmark in three parts: a sample public set (4k questions), a full public set (10k questions), and a private set (2k questions) used for leaderboard evaluation. Each set contains all medical subfields and difficulty levels. Our approach is scalable to other languages, and we open-source our data construction pipeline to support the development of future multilingual benchmarks in the medical domain.

[348] A Platform for Investigating Public Health Content with Efficient Concern Classification

Christopher Li,Rickard Stureborg,Bhuwan Dhingra,Jun Yang

Main category: cs.CL

TL;DR: ConcernScope是一个平台，利用教师-学生框架在大型语言模型和轻量级分类器之间进行知识转移，快速识别文本语料库中的健康问题。

Details

Motivation: 在线内容对公共卫生措施的担忧影响了全球预防措施的采用，未来需理解此类内容及其对读者的影响。 Method: 使用教师-学生框架，结合大型语言模型和轻量级分类器，支持大规模文件上传、URL自动抓取和文本编辑。 Result: 平台展示了在在线社区数据中识别常见问题、分析时间序列趋势以及事件前后主题频率变化的应用。 Conclusion: ConcernScope为公共卫生官员提供了有效工具，以理解和应对公众对健康问题的担忧。 Abstract: A recent rise in online content expressing concerns with public health initiatives has contributed to already stalled uptake of preemptive measures globally. Future public health efforts must attempt to understand such content, what concerns it may raise among readers, and how to effectively respond to it. To this end, we present ConcernScope, a platform that uses a teacher-student framework for knowledge transfer between large language models and light-weight classifiers to quickly and effectively identify the health concerns raised in a text corpus. The platform allows uploading massive files directly, automatically scraping specific URLs, and direct text editing. ConcernScope is built on top of a taxonomy of public health concerns. Intended for public health officials, we demonstrate several applications of this platform: guided data exploration to find useful examples of common concerns found in online community datasets, identification of trends in concerns through an example time series analysis of 186,000 samples, and finding trends in topic frequency before and after significant events.

[349] Growing Through Experience: Scaling Episodic Grounding in Language Models

Chunhui Zhang,Sirui,Wang,Zhongyu Ouyang,Xiangchi Yuan,Soroush Vosoughi

Main category: cs.CL

TL;DR: 提出了一种可扩展的弱到强情景学习框架，通过蒙特卡洛树搜索和新型蒸馏方法，显著提升语言模型在规划和问答任务中的表现。

Details

Motivation: 当前情景学习方法在可扩展性和集成性上存在局限，尤其是对中型语言模型（7B参数），而大型语言模型（70-405B参数）虽具备更强的抽象能力，但缺乏有效利用经验流的机制。 Method: 结合蒙特卡洛树搜索进行结构化经验收集，并采用新型蒸馏方法，保留语言模型固有能力的同时嵌入情景记忆。 Result: 实验表明，该方法在多样规划和问答任务中超越现有专有语言模型3.45%，且在深层模型层中任务对齐显著提升。 Conclusion: 该方法在复杂规划场景中表现出稳定的泛化能力，而基线方法则显著退化。 Abstract: Language models (LMs) require robust episodic grounding-the capacity to learn from and apply past experiences-to excel at physical planning tasks. Current episodic grounding approaches struggle with scalability and integration, limiting their effectiveness, especially for medium-sized LMs (7B parameters). While larger LMs (70-405B parameters) possess superior hierarchical representations and extensive pre-trained knowledge, they encounter a fundamental scale paradox: despite their advanced abstraction capabilities, they lack efficient mechanisms to leverage experience streams. We propose a scalable weak-to-strong episodic learning framework that effectively transfers episodic behaviors from smaller to larger LMs. This framework integrates Monte Carlo tree search for structured experience collection with a novel distillation method, preserving the inherent LM capabilities while embedding episodic memory. Experiments demonstrate our method surpasses state-of-the-art proprietary LMs by 3.45% across diverse planning and question-answering tasks. Layer-wise probing further indicates significant improvements in task alignment, especially within deeper LM layers, highlighting stable generalization even for previously unseen scenarios with increased planning complexity-conditions where baseline methods degrade markedly.

[350] Zero-Shot Text-to-Speech for Vietnamese

Thi Vu,Linh The Nguyen,Dat Quoc Nguyen

Main category: cs.CL

TL;DR: PhoAudiobook是一个新整理的越南语文本转语音数据集，包含941小时高质量音频。实验表明，该数据集显著提升了VALL-E、VoiceCraft和XTTS-V2等零样本TTS模型的性能，其中VALL-E和VoiceCraft在短句合成中表现更优。

Details

Motivation: 为越南语文本转语音研究提供高质量数据集，推动该领域的发展。 Method: 使用PhoAudiobook数据集对VALL-E、VoiceCraft和XTTS-V2三种零样本TTS模型进行实验。 Result: PhoAudiobook显著提升模型性能，VALL-E和VoiceCraft在短句合成中表现更优。 Conclusion: PhoAudiobook的发布将促进越南语文本转语音的进一步研究和开发。 Abstract: This paper introduces PhoAudiobook, a newly curated dataset comprising 941 hours of high-quality audio for Vietnamese text-to-speech. Using PhoAudiobook, we conduct experiments on three leading zero-shot TTS models: VALL-E, VoiceCraft, and XTTS-V2. Our findings demonstrate that PhoAudiobook consistently enhances model performance across various metrics. Moreover, VALL-E and VoiceCraft exhibit superior performance in synthesizing short sentences, highlighting their robustness in handling diverse linguistic contexts. We publicly release PhoAudiobook to facilitate further research and development in Vietnamese text-to-speech.

[351] Evaluating Large Language Models in Crisis Detection: A Real-World Benchmark from Psychological Support Hotlines

Guifeng Deng,Shuyin Rao,Tianyu Lin,Anlu Dai,Pan Wang,Junyi Xie,Haidong Song,Ke Zhao,Dongwu Xu,Zhengdong Cheng,Tao Li,Haiteng Jiang

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）在心理危机评估中的应用，通过PsyCrisisBench基准测试评估了64种LLM在情绪状态识别、自杀意念检测等任务中的表现，发现LLM在结构化任务中表现优异，但情绪识别仍具挑战性。

Details

Motivation: 心理支持热线需求激增，但面临资源不足的挑战。LLM可能提供支持，但其在情感敏感任务中的能力尚不明确。 Method: 使用540条标注的心理热线转录本构建PsyCrisisBench，评估64种LLM在四种任务中的表现（零样本、少样本和微调）。 Result: LLM在自杀意念检测（F1=0.880）和风险评估（F1=0.907）中表现优异，情绪识别较弱（F1=0.709）。开源模型与闭源模型差距缩小。 Conclusion: LLM在心理危机评估中潜力巨大，尤其是微调后。情绪识别仍需改进，开源模型与量化技术为实际应用提供了可行性。 Abstract: Psychological support hotlines are critical for crisis intervention but face significant challenges due to rising demand. Large language models (LLMs) could support crisis assessments, yet their capabilities in emotionally sensitive contexts remain unclear. We introduce PsyCrisisBench, a benchmark of 540 annotated transcripts from the Hangzhou Psychological Assistance Hotline, assessing four tasks: mood status recognition, suicidal ideation detection, suicide plan identification, and risk assessment. We evaluated 64 LLMs across 15 families (e.g., GPT, Claude, Gemini, Llama, Qwen, DeepSeek) using zero-shot, few-shot, and fine-tuning paradigms. Performance was measured by F1-score, with statistical comparisons via Welch's t-tests. LLMs performed strongly on suicidal ideation detection (F1=0.880), suicide plan identification (F1=0.779), and risk assessment (F1=0.907), improved with few-shot and fine-tuning. Mood status recognition was more challenging (max F1=0.709), likely due to lost vocal cues and ambiguity. A fine-tuned 1.5B-parameter model (Qwen2.5-1.5B) surpassed larger models on mood and suicidal ideation. Open-source models like QwQ-32B performed comparably to closed-source on most tasks (p>0.3), though closed models retained an edge in mood detection (p=0.007). Performance scaled with size up to a point; quantization (AWQ) reduced GPU memory by 70% with minimal F1 degradation. LLMs show substantial promise in structured psychological crisis assessments, especially with fine-tuning. Mood recognition remains limited due to contextual complexity. The narrowing gap between open- and closed-source models, combined with efficient quantization, suggests feasible integration. PsyCrisisBench offers a robust evaluation framework to guide model development and ethical deployment in mental health.

[352] Enhancing Interpretable Image Classification Through LLM Agents and Conditional Concept Bottleneck Models

Yiwen Jiang,Deval Mehta,Wei Feng,Zongyuan Ge

Main category: cs.CL

TL;DR: 提出动态代理方法优化概念数量，并引入CoCoBMs改进传统CBMs的评分机制，提升分类准确性和可解释性。

Details

Motivation: 解决现有概念库冗余或覆盖不足的问题，优化概念数量及其评分机制。 Method: 采用动态代理方法调整概念库，并设计CoCoBMs改进概念评分机制，支持LLM修正冲突评分。 Result: 在6个数据集上，分类准确性提升6%，可解释性评估提升30%。 Conclusion: 动态概念库和CoCoBMs显著提升了分类性能和可解释性。 Abstract: Concept Bottleneck Models (CBMs) decompose image classification into a process governed by interpretable, human-readable concepts. Recent advances in CBMs have used Large Language Models (LLMs) to generate candidate concepts. However, a critical question remains: What is the optimal number of concepts to use? Current concept banks suffer from redundancy or insufficient coverage. To address this issue, we introduce a dynamic, agent-based approach that adjusts the concept bank in response to environmental feedback, optimizing the number of concepts for sufficiency yet concise coverage. Moreover, we propose Conditional Concept Bottleneck Models (CoCoBMs) to overcome the limitations in traditional CBMs' concept scoring mechanisms. It enhances the accuracy of assessing each concept's contribution to classification tasks and feature an editable matrix that allows LLMs to correct concept scores that conflict with their internal knowledge. Our evaluations across 6 datasets show that our method not only improves classification accuracy by 6% but also enhances interpretability assessments by 30%.

[353] The Landscape of Arabic Large Language Models (ALLMs): A New Era for Arabic Language Technology

Shahad Al-Khalifa,Nadir Durrani,Hend Al-Khalifa,Firoj Alam

Main category: cs.CL

TL;DR: 本文探讨了阿拉伯语大型语言模型（ALLMs）的发展历程、挑战与机遇，强调其在填补技术鸿沟和赋能社区方面的潜力。

Details

Motivation: 英语用户已从大型语言模型（LLMs）中受益，但阿拉伯语世界面临开发阿拉伯语特定模型的挑战，需要填补技术差距并赋能社区。 Method: 回顾ALLMs的发展轨迹，从基础文本处理系统到复杂AI驱动模型，并通过基准测试和公开排行榜评估这些模型。 Result: ALLMs的发展展示了其在阿拉伯世界的潜力，但也揭示了语言和文化复杂性带来的挑战。 Conclusion: 开发ALLMs是阿拉伯世界的重要机遇，尽管面临挑战，但其潜力巨大，值得进一步探索和投资。 Abstract: The emergence of ChatGPT marked a transformative milestone for Artificial Intelligence (AI), showcasing the remarkable potential of Large Language Models (LLMs) to generate human-like text. This wave of innovation has revolutionized how we interact with technology, seamlessly integrating LLMs into everyday tasks such as vacation planning, email drafting, and content creation. While English-speaking users have significantly benefited from these advancements, the Arabic world faces distinct challenges in developing Arabic-specific LLMs. Arabic, one of the languages spoken most widely around the world, serves more than 422 million native speakers in 27 countries and is deeply rooted in a rich linguistic and cultural heritage. Developing Arabic LLMs (ALLMs) presents an unparalleled opportunity to bridge technological gaps and empower communities. The journey of ALLMs has been both fascinating and complex, evolving from rudimentary text processing systems to sophisticated AI-driven models. This article explores the trajectory of ALLMs, from their inception to the present day, highlighting the efforts to evaluate these models through benchmarks and public leaderboards. We also discuss the challenges and opportunities that ALLMs present for the Arab world.

[354] TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models

Yiran Zhang,Mo Wang,Xiaoyang Li,Kaixuan Ren,Chencheng Zhu,Usman Naseem

Main category: cs.CL

TL;DR: TurnBench是一个新的基准测试，用于评估多轮、多步推理能力，通过交互式代码破解任务测试模型在动态环境中的表现。

Details

Motivation: 现有基准测试多关注单轮或单步任务，无法捕捉真实场景中所需的迭代推理能力。 Method: 引入TurnBench，包含Classic和Nightmare两种模式，通过隐藏规则和反馈循环测试模型的推理能力。 Result: 最佳模型在Classic模式下准确率为81.5%，但在Nightmare模式下降至17.8%，而人类表现均为100%。 Conclusion: TurnBench为诊断和提升LLMs的多步、多轮推理能力提供了严格的测试平台。 Abstract: Despite impressive advances in large language models (LLMs), existing benchmarks often focus on single-turn or single-step tasks, failing to capture the kind of iterative reasoning required in real-world settings. To address this limitation, we introduce TurnBench, a novel benchmark that evaluates multi-turn, multi-step reasoning through an interactive code-breaking task inspired by a "Turing Machine Board Game." In each episode, a model must uncover hidden logical or arithmetic rules by making sequential guesses, receiving structured feedback, and integrating clues across multiple rounds. This dynamic setup requires models to reason over time, adapt based on past information, and maintain consistency across steps-capabilities underexplored in current benchmarks. TurnBench includes two modes: Classic, which tests standard reasoning, and Nightmare, which introduces increased complexity and requires robust inferential chains. To support fine-grained analysis, we provide ground-truth annotations for intermediate reasoning steps. Our evaluation of state-of-the-art LLMs reveals significant gaps: the best model achieves 81.5% accuracy in Classic mode, but performance drops to 17.8% in Nightmare mode. In contrast, human participants achieve 100% in both, underscoring the challenge TurnBench poses to current models. By incorporating feedback loops and hiding task rules, TurnBench reduces contamination risks and provides a rigorous testbed for diagnosing and advancing multi-step, multi-turn reasoning in LLMs.

[355] Follow the Flow: Fine-grained Flowchart Attribution with Neurosymbolic Agents

Manan Suri,Puneet Mathur,Nedim Lipka,Franck Dernoncourt,Ryan A. Rossi,Vivek Gupta,Dinesh Manocha

Main category: cs.CL

TL;DR: 论文提出了一种名为FlowPathAgent的神经符号代理，用于解决流程图分析中LLM的视觉幻觉问题，并通过细粒度归因提升解释性。

Details

Motivation: 流程图在关键领域（如物流、医疗和工程）中具有重要作用，但LLM在分析流程图时容易产生视觉幻觉，导致可靠性下降。 Method: 提出FlowPathAgent，通过图推理进行细粒度归因：先分割流程图并转换为符号图，再动态交互生成归因路径。 Result: FlowPathAgent在FlowExplainBench数据集上比基线方法性能提升10-14%，有效减少了视觉幻觉。 Conclusion: FlowPathAgent通过细粒度归因提升了流程图分析的可靠性和解释性，为关键领域提供了更可信的自动化处理方案。 Abstract: Flowcharts are a critical tool for visualizing decision-making processes. However, their non-linear structure and complex visual-textual relationships make it challenging to interpret them using LLMs, as vision-language models frequently hallucinate nonexistent connections and decision paths when analyzing these diagrams. This leads to compromised reliability for automated flowchart processing in critical domains such as logistics, health, and engineering. We introduce the task of Fine-grained Flowchart Attribution, which traces specific components grounding a flowchart referring LLM response. Flowchart Attribution ensures the verifiability of LLM predictions and improves explainability by linking generated responses to the flowchart's structure. We propose FlowPathAgent, a neurosymbolic agent that performs fine-grained post hoc attribution through graph-based reasoning. It first segments the flowchart, then converts it into a structured symbolic graph, and then employs an agentic approach to dynamically interact with the graph, to generate attribution paths. Additionally, we present FlowExplainBench, a novel benchmark for evaluating flowchart attributions across diverse styles, domains, and question types. Experimental results show that FlowPathAgent mitigates visual hallucinations in LLM answers over flowchart QA, outperforming strong baselines by 10-14% on our proposed FlowExplainBench dataset.

[356] The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

Xinyu Zhu,Mengzhou Xia,Zhepei Wei,Wei-Lin Chen,Danqi Chen,Yu Meng

Main category: cs.CL

TL;DR: RLVR通过分解学习信号为正向和负向样本强化（PSR和NSR），发现仅使用负向样本训练（NSR）在数学推理任务中效果显著，甚至优于传统方法。

Details

Motivation: 探索强化学习在语言模型推理任务中的机制，尤其是正向和负向样本对性能的影响。 Method: 分解RLVR学习信号为PSR和NSR，训练Qwen2.5-Math-7B和Qwen3-4B模型，分析梯度。 Result: 仅NSR训练显著提升性能，尤其在Pass@k中表现优异，而仅PSR会降低多样性。 Conclusion: 负向样本强化（NSR）对性能贡献更大，提出优化RL目标以提升整体性能。 Abstract: Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs) on reasoning tasks that elicit emergent long chains of thought (CoTs). Unlike supervised learning, it updates the model using both correct and incorrect samples via policy gradients. To better understand its mechanism, we decompose the learning signal into reinforcing correct responses and penalizing incorrect ones, referred to as Positive and Negative Sample Reinforcement (PSR and NSR), respectively. We train Qwen2.5-Math-7B and Qwen3-4B on a mathematical reasoning dataset and uncover a surprising result: training with only negative samples -- without reinforcing correct responses -- can be highly effective: it consistently improves performance over the base model across the entire Pass@$k$ spectrum ($k$ up to $256$), often matching or surpassing PPO and GRPO. In contrast, reinforcing only correct responses improves Pass@$1$ but degrades performance at higher $k$, due to reduced diversity. These inference-scaling trends highlight that solely penalizing incorrect responses may contribute more to performance than previously recognized. Through gradient analysis, we show that NSR works by suppressing incorrect generations and redistributing probability mass toward other plausible candidates, guided by the model's prior beliefs. It refines the model's existing knowledge rather than introducing entirely new behaviors. Building on this insight, we propose a simple variant of the RL objective that upweights NSR, and show that it consistently improves overall Pass@$k$ performance on MATH, AIME 2025, and AMC23. Our code is available at https://github.com/TianHongZXY/RLVR-Decomposed.

[357] KokoroChat: A Japanese Psychological Counseling Dialogue Dataset Collected via Role-Playing by Trained Counselors

Zhiyang Qi,Takumasa Kaneko,Keiko Takamizo,Mariko Ukiyo,Michimasa Inaba

Main category: cs.CL

TL;DR: 研究通过角色扮演方法构建高质量心理咨询对话数据集KokoroChat，提升语言模型生成咨询回复的质量。

Details

Motivation: 现有心理咨询对话数据集存在多样性不足和隐私问题，需高质量数据支持语言模型训练。 Method: 采用角色扮演方法，由训练有素的咨询师模拟咨询对话，构建包含6,589条长对话的KokoroChat数据集。 Result: 实验显示，基于KokoroChat微调的开源语言模型提升了咨询回复质量和自动评估效果。 Conclusion: 角色扮演方法有效解决了数据质量和隐私问题，KokoroChat为心理咨询对话研究提供了高质量资源。 Abstract: Generating psychological counseling responses with language models relies heavily on high-quality datasets. Crowdsourced data collection methods require strict worker training, and data from real-world counseling environments may raise privacy and ethical concerns. While recent studies have explored using large language models (LLMs) to augment psychological counseling dialogue datasets, the resulting data often suffers from limited diversity and authenticity. To address these limitations, this study adopts a role-playing approach where trained counselors simulate counselor-client interactions, ensuring high-quality dialogues while mitigating privacy risks. Using this method, we construct KokoroChat, a Japanese psychological counseling dialogue dataset comprising 6,589 long-form dialogues, each accompanied by comprehensive client feedback. Experimental results demonstrate that fine-tuning open-source LLMs with KokoroChat improves both the quality of generated counseling responses and the automatic evaluation of counseling dialogues. The KokoroChat dataset is available at https://github.com/UEC-InabaLab/KokoroChat.

[358] MMD-Flagger: Leveraging Maximum Mean Discrepancy to Detect Hallucinations

Kensuke Mitsuzawa,Damien Garreau

Main category: cs.CL

TL;DR: 提出了一种基于最大均值差异（MMD）的新方法MMD-Flagger，用于检测大语言模型生成的幻觉内容，并在机器翻译数据集上表现优于现有方法。

Details

Motivation: 大语言模型生成的幻觉内容（不真实的流畅文本）阻碍了其在关键应用中的使用，因此检测幻觉内容至关重要。 Method: 利用MMD（非参数分布距离）跟踪生成文档与不同温度参数生成的文档之间的差异，通过分析轨迹形状检测幻觉。 Result: 在机器翻译数据集上，MMD-Flagger优于其他竞争方法。 Conclusion: MMD-Flagger是一种有效的幻觉检测方法，适用于大语言模型生成内容的真实性验证。 Abstract: Large language models (LLMs) have become pervasive in our everyday life. Yet, a fundamental obstacle prevents their use in many critical applications: their propensity to generate fluent, human-quality content that is not grounded in reality. The detection of such hallucinations is thus of the highest importance. In this work, we propose a new method to flag hallucinated content, MMD-Flagger. It relies on Maximum Mean Discrepancy (MMD), a non-parametric distance between distributions. On a high-level perspective, MMD-Flagger tracks the MMD between the generated documents and documents generated with various temperature parameters. We show empirically that inspecting the shape of this trajectory is sufficient to detect most hallucinations. This novel method is benchmarked on two machine translation datasets, on which it outperforms natural competitors.

[359] AdaRewriter: Unleashing the Power of Prompting-based Conversational Query Reformulation via Test-Time Adaptation

Yilong Lai,Jialong Wu,Zhenglin Wang,Deyu Zhou

Main category: cs.CL

TL;DR: AdaRewriter是一个基于测试时适应的查询重写框架，通过轻量级奖励模型选择最佳重写方案，显著优于现有方法。

Details

Motivation: 现有方法在训练时和测试时均无法充分发挥提示式查询重写的潜力，需要更高效的适应机制。 Method: 提出AdaRewriter框架，使用对比排序损失训练奖励模型，在推理时选择最优重写方案。 Result: 在五个对话搜索数据集上，AdaRewriter显著优于现有方法。 Conclusion: 测试时适应在对话查询重写中具有巨大潜力，AdaRewriter展示了其有效性。 Abstract: Prompting-based conversational query reformulation has emerged as a powerful approach for conversational search, refining ambiguous user queries into standalone search queries. Best-of-N reformulation over the generated candidates via prompting shows impressive potential scaling capability. However, both the previous tuning methods (training time) and adaptation approaches (test time) can not fully unleash their benefits. In this paper, we propose AdaRewriter, a novel framework for query reformulation using an outcome-supervised reward model via test-time adaptation. By training a lightweight reward model with contrastive ranking loss, AdaRewriter selects the most promising reformulation during inference. Notably, it can operate effectively in black-box systems, including commercial LLM APIs. Experiments on five conversational search datasets show that AdaRewriter significantly outperforms the existing methods across most settings, demonstrating the potential of test-time adaptation for conversational query reformulation.

[360] Speech-to-Speech Translation Pipelines for Conversations in Low-Resource Languages

Andrei Popescu-Belis,Alexis Allemann,Teo Ferrari,Gopal Krishnamani

Main category: cs.CL

TL;DR: 研究了土耳其语和普什图语与法语之间的自动语音翻译系统，通过多种指标评估了60多种流程，并确定了最佳方案。

Details

Motivation: 针对低资源语言的社区口译需求，提升语音翻译质量。 Method: 收集数据并微调模型，结合自动语音识别、机器翻译和语音合成，评估60多种流程。 Result: 确定了每种语言方向的最佳流程，并发现组件性能独立于流程其他部分。 Conclusion: 为低资源语言的语音翻译提供了优化方案，组件选择对性能至关重要。 Abstract: The popularity of automatic speech-to-speech translation for human conversations is growing, but the quality varies significantly depending on the language pair. In a context of community interpreting for low-resource languages, namely Turkish and Pashto to/from French, we collected fine-tuning and testing data, and compared systems using several automatic metrics (BLEU, COMET, and BLASER) and human assessments. The pipelines included automatic speech recognition, machine translation, and speech synthesis, with local models and cloud-based commercial ones. Some components have been fine-tuned on our data. We evaluated over 60 pipelines and determined the best one for each direction. We also found that the ranks of components are generally independent of the rest of the pipeline.

[361] Comparing LLM-generated and human-authored news text using formal syntactic theory

Olga Zamaraeva,Dan Flickinger,Francis Bond,Carlos Gómez-Rodríguez

Main category: cs.CL

TL;DR: 研究首次比较了六种大语言模型生成的《纽约时报》风格文本与真实人类写作的差异，基于形式句法理论HPSG分析，揭示了系统性的语法分布区别。

Details

Motivation: 探索大语言模型与人类在《纽约时报》风格写作中的句法行为差异，以深化对两者语法表现的理解。 Method: 使用Head-driven Phrase Structure Grammar (HPSG)分析文本的语法结构，比较HPSG语法类型的分布。 Result: 发现人类与LLM生成的文本在HPSG语法类型分布上存在系统性差异。 Conclusion: 研究增进了对LLM和人类在特定写作风格中语法行为的理解。 Abstract: This study provides the first comprehensive comparison of New York Times-style text generated by six large language models against real, human-authored NYT writing. The comparison is based on a formal syntactic theory. We use Head-driven Phrase Structure Grammar (HPSG) to analyze the grammatical structure of the texts. We then investigate and illustrate the differences in the distributions of HPSG grammar types, revealing systematic distinctions between human and LLM-generated writing. These findings contribute to a deeper understanding of the syntactic behavior of LLMs as well as humans, within the NYT genre.

[362] UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment

Joseph Marvin Imperial,Abdullah Barayan,Regina Stodden,Rodrigo Wilkens,Ricardo Munoz Sanchez,Lingyun Gao,Melissa Torgbi,Dawn Knight,Gail Forey,Reka R. Jablonkai,Ekaterina Kochmar,Robert Reynolds,Eugenio Ribeiro,Horacio Saggion,Elena Volodina,Sowmya Vajjala,Thomas Francois,Fernando Alva-Manchego,Harish Tayyar Madabushi

Main category: cs.CL

TL;DR: UniversalCEFR是一个多语言、多维度的数据集，包含505,807篇按CEFR标准标注的文本，支持13种语言，旨在推动自动可读性和语言能力评估的研究。

Details

Motivation: 为语言能力评估和自动可读性研究提供一个统一、标准化的数据集，促进全球研究社区的开放研究。 Method: 数据集包含来自教育和学习者资源的文本，标准化为统一格式。实验采用三种建模范式：基于语言特征的分类、微调预训练大语言模型（LLMs）和基于描述符的指令调优LLMs。 Result: 实验结果表明，语言特征和微调预训练模型在多语言CEFR级别评估中表现良好。 Conclusion: UniversalCEFR通过标准化数据集格式和提升其可访问性，旨在为语言能力研究建立最佳实践。 Abstract: We introduce UniversalCEFR, a large-scale multilingual multidimensional dataset of texts annotated according to the CEFR (Common European Framework of Reference) scale in 13 languages. To enable open research in both automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modeling across tasks and languages. To demonstrate its utility, we conduct benchmark experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results further support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution in language proficiency research by standardising dataset formats and promoting their accessibility to the global research community.

[363] Self-Refining Language Model Anonymizers via Adversarial Distillation

Kyuyoung Kim,Hyunjun Jeon,Jinwoo Shin

Main category: cs.CL

TL;DR: SEAL框架通过对抗蒸馏训练小型语言模型（SLM）进行高效匿名化，避免依赖外部昂贵模型，并在隐私-效用权衡上媲美甚至超越GPT-4。

Details

Motivation: 解决现有LLM匿名化方法依赖昂贵且可能不安全的专有模型（如GPT-4）的问题。 Method: 利用LLM匿名器与推理模型的对抗交互收集匿名化文本轨迹，通过监督微调和偏好学习将能力蒸馏到SLM中。 Result: 8B规模的SLM在隐私-效用权衡上媲美GPT-4，并通过自优化在隐私性上超越GPT-4。 Conclusion: SEAL框架有效训练SLM成为高效匿名化工具，为相关研究提供了数据集支持。 Abstract: Large language models (LLMs) are increasingly used in sensitive domains, where their ability to infer personal data from seemingly benign text poses emerging privacy risks. While recent LLM-based anonymization methods help mitigate such risks, they often rely on proprietary models (e.g., GPT-4), raising concerns about cost and the potential exposure of sensitive data to untrusted external systems. To address this, we introduce SElf-refining Anonymization with Language model (SEAL), a novel distillation framework for training small language models (SLMs) to perform effective anonymization without relying on external costly models at inference time. We leverage adversarial interactions between an LLM anonymizer and an inference model to collect trajectories of anonymized texts and inferred attributes, which are used to distill anonymization, adversarial inference, and utility evaluation capabilities into SLMs via supervised fine-tuning and preference learning. The resulting models learn to both anonymize text and critique their outputs, enabling iterative improvement of anonymization quality via self-refinement. Experiments on SynthPAI, a dataset of synthetic personal profiles and text comments, demonstrate that SLMs trained with SEAL achieve substantial improvements in anonymization capabilities. Notably, 8B models attain a privacy-utility trade-off comparable to that of the GPT-4 anonymizer and, with self-refinement, even surpass it in terms of privacy. These results show the effectiveness of our adversarial distillation framework in training SLMs as efficient anonymizers. To facilitate further research, we release the full dataset used in our experiments.

[364] Redundancy, Isotropy, and Intrinsic Dimensionality of Prompt-based Text Embeddings

Hayato Tsukagoshi,Ryohei Sasano

Main category: cs.CL

TL;DR: 论文研究了基于提示的文本嵌入模型的高维冗余问题，发现即使大幅降维，任务性能下降很小，尤其是分类和聚类任务。

Details

Motivation: 高维嵌入导致存储和计算成本高，研究降维对任务性能的影响。 Method: 应用后处理降维技术，分析嵌入的固有维度和各向同性。 Result: 降维后性能下降很小，分类和聚类任务嵌入冗余性高。 Conclusion: 嵌入存在高维冗余，降维可行且对性能影响小。 Abstract: Prompt-based text embedding models, which generate task-specific embeddings upon receiving tailored prompts, have recently demonstrated remarkable performance. However, their resulting embeddings often have thousands of dimensions, leading to high storage costs and increased computational costs of embedding-based operations. In this paper, we investigate how post-hoc dimensionality reduction applied to the embeddings affects the performance of various tasks that leverage these embeddings, specifically classification, clustering, retrieval, and semantic textual similarity (STS) tasks. Our experiments show that even a naive dimensionality reduction, which keeps only the first 25% of the dimensions of the embeddings, results in a very slight performance degradation, indicating that these embeddings are highly redundant. Notably, for classification and clustering, even when embeddings are reduced to less than 0.5% of the original dimensionality the performance degradation is very small. To quantitatively analyze this redundancy, we perform an analysis based on the intrinsic dimensionality and isotropy of the embeddings. Our analysis reveals that embeddings for classification and clustering, which are considered to have very high dimensional redundancy, exhibit lower intrinsic dimensionality and less isotropy compared with those for retrieval and STS.

[365] Whale: Large-Scale multilingual ASR model with w2v-BERT and E-Branchformer with large speech data

Yosuke Kashiwagi,Hayato Futami,Emiru Tsunoo,Satoshi Asakawa

Main category: cs.CL

TL;DR: 论文介绍了一种名为Whale的大规模语音识别模型，结合了w2v-BERT自监督模型、E-Branchformer编码器-解码器架构和联合CTC-注意力解码策略，在多个基准测试中表现优异。

Details

Motivation: 开发一种高性能的语音识别模型，通过结合先进技术和多样化数据集，提升对不同说话风格和声学条件的鲁棒性。 Method: 采用w2v-BERT自监督模型、E-Branchformer编码器-解码器架构和联合CTC-注意力解码策略，训练数据包括公开和内部数据集。 Result: 在Librispeech test-clean集上词错误率为2.4%，在CSJ eval3集上字符错误率为3.4%，优于Whisper large-v3和OWSM v3.1。 Conclusion: Whale模型在性能上超越了现有模型，展示了其在大规模语音识别任务中的潜力。 Abstract: This paper reports on the development of a large-scale speech recognition model, Whale. Similar to models such as Whisper and OWSM, Whale leverages both a large model size and a diverse, extensive dataset. Whale's architecture integrates w2v-BERT self-supervised model, an encoder-decoder backbone built on E-Branchformer, and a joint CTC-attention decoding strategy. The training corpus comprises varied speech data, of not only public corpora but also in-house data, thereby enhancing the model's robustness to different speaking styles and acoustic conditions. Through evaluations on multiple benchmarks, Whale achieved comparable performance to existing models. In particular, it achieves a word error rate of 2.4% on the Librispeech test-clean set and a character error rate of 3.4% on the CSJ eval3 set, outperforming Whisper large-v3 and OWSM v3.1.

[366] Building Entity Association Mining Framework for Knowledge Discovery

Anshika Rawal,Abhijeet Kumar,Mridul Mishra

Main category: cs.CL

TL;DR: 论文提出了一种通用框架，用于从非结构化文本中提取信号和模式，支持商业决策，包括文档过滤、实体提取和关联挖掘。

Details

Motivation: 从非结构化文本中提取有用信息以支持商业决策（如投资产品分析、客户偏好发现和风险监控）是一个挑战性任务。 Method: 框架包含三个主要组件：文档过滤、可配置的实体提取管道（使用多种技术）和关联关系挖掘（生成共现图）。 Result: 框架在金融用例（如品牌产品发现和供应商风险监控）中展示了其有效性。 Conclusion: 该框架旨在减少重复工作，降低开发成本，并促进关联挖掘业务应用的可重用性和快速原型设计。 Abstract: Extracting useful signals or pattern to support important business decisions for example analyzing investment product traction and discovering customer preference, risk monitoring etc. from unstructured text is a challenging task. Capturing interaction of entities or concepts and association mining is a crucial component in text mining, enabling information extraction and reasoning over and knowledge discovery from text. Furthermore, it can be used to enrich or filter knowledge graphs to guide exploration processes, descriptive analytics and uncover hidden stories in the text. In this paper, we introduce a domain independent pipeline i.e., generalized framework to enable document filtering, entity extraction using various sources (or techniques) as plug-ins and association mining to build any text mining business use-case and quantitatively define a scoring metric for ranking purpose. The proposed framework has three major components a) Document filtering: filtering documents/text of interest from massive amount of texts b) Configurable entity extraction pipeline: include entity extraction techniques i.e., i) DBpedia Spotlight, ii) Spacy NER, iii) Custom Entity Matcher, iv) Phrase extraction (or dictionary) based c) Association Relationship Mining: To generates co-occurrence graph to analyse potential relationships among entities, concepts. Further, co-occurrence count based frequency statistics provide a holistic window to observe association trends or buzz rate in specific business context. The paper demonstrates the usage of framework as fundamental building box in two financial use-cases namely brand product discovery and vendor risk monitoring. We aim that such framework will remove duplicated effort, minimize the development effort, and encourage reusability and rapid prototyping in association mining business applications for institutions.

[367] TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 Challenge

Tanel Alumäe,Artem Fedorchenko

Main category: cs.CL

TL;DR: 本文介绍了塔林理工大学为Interspeech 2025 ML-SUPERB 2.0挑战赛开发的语言识别和多语言语音识别系统，采用混合方法并取得最佳成绩。

Details

Motivation: 开发高效的语言识别和语音识别系统，以应对多语言环境下的挑战。 Method: 使用混合语言识别系统（预训练语言嵌入模型和轻量级语音识别模型），并结合三种语音识别模型（SeamlessM4T、MMS-1B-all和MMS-zeroshot）。 Result: 系统在挑战赛中获得了最高分。 Conclusion: 该混合方法在多语言语音识别任务中表现出色，验证了其有效性。 Abstract: This paper describes the language identification and multilingual speech recognition system developed at Tallinn University of Technology for the Interspeech 2025 ML-SUPERB 2.0 Challenge. A hybrid language identification system is used, consisting of a pretrained language embedding model and a light-weight speech recognition model with a shared encoder across languages and language-specific bigram language models. For speech recognition, three models are used, where only a single model is applied for each language, depending on the training data availability and performance on held-out data. The model set consists of a finetuned version of SeamlessM4T, MMS-1B-all with custom language adapters and MMS-zeroshot. The system obtained the top overall score in the challenge.

[368] Integrating Neural and Symbolic Components in a Model of Pragmatic Question-Answering

Polina Tsvilodub,Robert D. Hawkins,Michael Franke

Main category: cs.CL

TL;DR: 提出了一种结合神经符号框架的认知模型，利用LLM模块增强实用性，减少人工干预，在语用问答任务中表现优异，但需注意模块整合方式。

Details

Motivation: 传统语用语言计算模型依赖人工定义的语句和意义，限制了实际应用。 Method: 通过神经符号框架整合LLM模块，自动生成和评估自然语言关键组件，并在语用问答案例中测试不同整合方式。 Result: 混合模型在预测人类回答模式上表现优于传统概率模型，但LLM在语义评估方面存在挑战。 Conclusion: 该研究为更灵活、可扩展的语用语言模型提供了路径，同时强调了神经与符号组件平衡的设计考量。 Abstract: Computational models of pragmatic language use have traditionally relied on hand-specified sets of utterances and meanings, limiting their applicability to real-world language use. We propose a neuro-symbolic framework that enhances probabilistic cognitive models by integrating LLM-based modules to propose and evaluate key components in natural language, eliminating the need for manual specification. Through a classic case study of pragmatic question-answering, we systematically examine various approaches to incorporating neural modules into the cognitive model -- from evaluating utilities and literal semantics to generating alternative utterances and goals. We find that hybrid models can match or exceed the performance of traditional probabilistic models in predicting human answer patterns. However, the success of the neuro-symbolic model depends critically on how LLMs are integrated: while they are particularly effective for proposing alternatives and transforming abstract goals into utilities, they face challenges with truth-conditional semantic evaluation. This work charts a path toward more flexible and scalable models of pragmatic language use while illuminating crucial design considerations for balancing neural and symbolic components.

[369] LLM in the Loop: Creating the PARADEHATE Dataset for Hate Speech Detoxification

Shuzhou Yuan,Ercong Nie,Lukas Kouba,Ashish Yashwanth Kangen,Helmut Schmid,Hinrich Schutze,Michael Farber

Main category: cs.CL

TL;DR: 论文提出了一种利用GPT-4o-mini自动生成去毒文本的LLM-in-the-loop流程，并构建了大规模仇恨言论去毒数据集PARADEHATE，验证了其有效性。

Details

Motivation: 在线有毒内容日益增多，但高质量的去毒平行数据集稀缺，尤其是仇恨言论领域，人工标注成本高且敏感。 Method: 提出LLM-in-the-loop流程，用GPT-4o-mini替代人工标注，构建PARADEHATE数据集（8K仇恨/非仇恨文本对），并评估多种基线方法。 Result: 实验表明，基于PARADEHATE微调的BART等模型在风格准确性、内容保留和流畅性上表现更优，验证了LLM生成去毒文本的可扩展性。 Conclusion: LLM生成去毒文本可作为人工标注的高效替代方案，PARADEHATE为仇恨言论去毒提供了有价值的基准。 Abstract: Detoxification, the task of rewriting harmful language into non-toxic text, has become increasingly important amid the growing prevalence of toxic content online. However, high-quality parallel datasets for detoxification, especially for hate speech, remain scarce due to the cost and sensitivity of human annotation. In this paper, we propose a novel LLM-in-the-loop pipeline leveraging GPT-4o-mini for automated detoxification. We first replicate the ParaDetox pipeline by replacing human annotators with an LLM and show that the LLM performs comparably to human annotation. Building on this, we construct PARADEHATE, a large-scale parallel dataset specifically for hatespeech detoxification. We release PARADEHATE as a benchmark of over 8K hate/non-hate text pairs and evaluate a wide range of baseline methods. Experimental results show that models such as BART, fine-tuned on PARADEHATE, achieve better performance in style accuracy, content preservation, and fluency, demonstrating the effectiveness of LLM-generated detoxification text as a scalable alternative to human annotation.

[370] Argument-Centric Causal Intervention Method for Mitigating Bias in Cross-Document Event Coreference Resolution

Long Yao,Wenzhong Yang,Yabo Yin,Fuyuan Wei,Hongzhen Lv,Jiaren Peng,Liejun Wang,Xiaoming Tao

Main category: cs.CL

TL;DR: 提出了一种基于Argument-Centric Causal Intervention (ACCI)的新方法，用于解决跨文档事件共指消解中的虚假相关性问题，显著提升了性能。

Details

Motivation: 当前跨文档事件共指消解方法过度依赖触发词特征，导致表面词汇特征与共指关系之间的虚假相关性，影响模型性能。 Method: 构建结构因果图以揭示词汇触发词与共指标签之间的混杂依赖关系，并引入后门调整干预来隔离论元语义的真实因果效应。进一步通过反事实推理模块和论元感知增强模块减少虚假相关性。 Result: 在ECB+和GVC数据集上分别达到88.4%和85.2%的CoNLL F1分数，实现了最先进的性能。 Conclusion: ACCI方法在统一的端到端框架中有效去偏，无需改变训练过程，显著提升了跨文档事件共指消解的性能。 Abstract: Cross-document Event Coreference Resolution (CD-ECR) is a fundamental task in natural language processing (NLP) that seeks to determine whether event mentions across multiple documents refer to the same real-world occurrence. However, current CD-ECR approaches predominantly rely on trigger features within input mention pairs, which induce spurious correlations between surface-level lexical features and coreference relationships, impairing the overall performance of the models. To address this issue, we propose a novel cross-document event coreference resolution method based on Argument-Centric Causal Intervention (ACCI). Specifically, we construct a structural causal graph to uncover confounding dependencies between lexical triggers and coreference labels, and introduce backdoor-adjusted interventions to isolate the true causal effect of argument semantics. To further mitigate spurious correlations, ACCI integrates a counterfactual reasoning module that quantifies the causal influence of trigger word perturbations, and an argument-aware enhancement module to promote greater sensitivity to semantically grounded information. In contrast to prior methods that depend on costly data augmentation or heuristic-based filtering, ACCI enables effective debiasing in a unified end-to-end framework without altering the underlying training procedure. Extensive experiments demonstrate that ACCI achieves CoNLL F1 of 88.4% on ECB+ and 85.2% on GVC, achieving state-of-the-art performance. The implementation and materials are available at https://github.com/era211/ACCI.

[371] Multilingual Definition Modeling

Edison Marrese-Taylor,Erica K. Shimomoto,Alfredo Solano,Enrique Reid

Main category: cs.CL

TL;DR: 本文首次提出多语言定义建模研究，测试了预训练多语言模型在四种新语言上的表现，并评估了大型语言模型的零样本能力。

Details

Motivation: 探索多语言定义建模的可行性，验证预训练模型和大型语言模型在此任务中的表现。 Method: 使用单语词典数据微调预训练多语言模型，并采用零样本方法测试大型语言模型。 Result: 多语言模型表现与英语相当但未能利用跨语言协同效应，大型语言模型整体表现更优。BERTScore与多语言基准表现强相关。 Conclusion: 多语言定义建模任务可作为计算受限、稳定且自然的替代方案，大型语言模型在零样本和少样本场景下表现突出。 Abstract: In this paper, we propose the first multilingual study on definition modeling. We use monolingual dictionary data for four new languages (Spanish, French, Portuguese, and German) and perform an in-depth empirical study to test the performance of pre-trained multilingual language models on definition modeling of monosemic words when finetuned on this data. Furthermore, we use a zero-shot approach to test the multilingual capabilities of two popular chat-based Large Language Models (LLMs) in the task. Results show that multilingual language models can perform on-pair with English but cannot leverage potential cross-lingual synergies, with LLMs generally offering better performance overall. A comprehensive human evaluation of the LLM-generated definition highlights the zero and few-shot capabilities of these models in this new task, also showing their shortcomings. Finally, we show that performance on our task via BERTScore strongly correlates to the performance on multilingual LLM benchmarks, suggesting that our task offers a viable compute-constrained, stable and natural alternative to these.

[372] CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models

Ping Wu,Guobin Shen,Dongcheng Zhao,Yuwei Wang,Yiting Dong,Yu Shi,Enmeng Lu,Feifei Zhao,Yi Zeng

Main category: cs.CL

TL;DR: 提出了一种基于中国核心价值观的分层价值框架，并构建了大规模中文价值观语料库（CVC），用于评估和调整大型语言模型（LLM）的价值对齐。

Details

Motivation: 当前的价值评估和对齐受限于西方文化偏见和不完善的国内框架，缺乏可扩展的规则驱动场景生成方法。 Method: 提出分层价值框架，构建CVC语料库，并通过人工标注增强和扩展。 Result: CVC生成的场景在价值边界和内容多样性上表现更优，主流LLM在70.5%的情况下偏好CVC选项，人类标注者与CVC对齐率达87.5%。 Conclusion: 建立了一个文化适应性强的基准框架，用于全面价值评估和对齐，体现了中国特色。 Abstract: Ensuring that Large Language Models (LLMs) align with mainstream human values and ethical norms is crucial for the safe and sustainable development of AI. Current value evaluation and alignment are constrained by Western cultural bias and incomplete domestic frameworks reliant on non-native rules; furthermore, the lack of scalable, rule-driven scenario generation methods makes evaluations costly and inadequate across diverse cultural contexts. To address these challenges, we propose a hierarchical value framework grounded in core Chinese values, encompassing three main dimensions, 12 core values, and 50 derived values. Based on this framework, we construct a large-scale Chinese Values Corpus (CVC) containing over 250,000 value rules enhanced and expanded through human annotation. Experimental results show that CVC-guided scenarios outperform direct generation ones in value boundaries and content diversity. In the evaluation across six sensitive themes (e.g., surrogacy, suicide), seven mainstream LLMs preferred CVC-generated options in over 70.5% of cases, while five Chinese human annotators showed an 87.5% alignment with CVC, confirming its universality, cultural relevance, and strong alignment with Chinese values. Additionally, we construct 400,000 rule-based moral dilemma scenarios that objectively capture nuanced distinctions in conflicting value prioritization across 17 LLMs. Our work establishes a culturally-adaptive benchmarking framework for comprehensive value evaluation and alignment, representing Chinese characteristics. All data are available at https://huggingface.co/datasets/Beijing-AISI/CVC, and the code is available at https://github.com/Beijing-AISI/CVC.

[373] Continual Speech Learning with Fused Speech Features

Guitao Wang,Jinming Zhao,Hao Yang,Guilin Qi,Tongtong Wu,Gholamreza Haffari

Main category: cs.CL

TL;DR: 提出了一种名为连续语音学习的新方法，通过动态选择任务特征改进语音模型适应性，显著提升了六项语音任务的准确性。

Details

Motivation: 传统静态方法无法适应动态多样的语音数据，需要更灵活的模型。 Method: 使用Whisper编码器-解码器模型，并集成可学习的门控融合层动态选择任务特征。 Result: 在六项语音任务中显著优于传统方法，无需完全重新训练即可适应新任务。 Conclusion: 连续语音学习方法有效解决了语音模型适应性问题，具有广泛的应用潜力。 Abstract: Rapid growth in speech data demands adaptive models, as traditional static methods fail to keep pace with dynamic and diverse speech information. We introduce continuous speech learning, a new set-up targeting at bridging the adaptation gap in current speech models. We use the encoder-decoder Whisper model to standardize speech tasks into a generative format. We integrate a learnable gated-fusion layer on the top of the encoder to dynamically select task-specific features for downstream tasks. Our approach improves accuracy significantly over traditional methods in six speech processing tasks, demonstrating gains in adapting to new speech tasks without full retraining.

[374] Representations of Fact, Fiction and Forecast in Large Language Models: Epistemics and Attitudes

Meng Li,Michael Vrazitulis,David Schlangen

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型（LLMs）在不确定环境中生成基于事实和信心的表达的能力不足，并提出通过评估LLMs对情态知识的掌握来改进其不确定性表达。

Details

Motivation: 理性说话者能根据证据强度生成相应表达，而LLMs在不确定环境中生成可靠表达仍具挑战性。研究旨在填补对LLMs潜在空间中不确定性语言知识的研究空白。 Method: 利用类型学框架和受控故事评估LLMs对情态知识的掌握。 Result: LLMs生成情态表达的能力有限且不稳定，其不确定性表达不可靠。 Conclusion: 为构建不确定性感知的LLMs，需丰富其情态语义知识。 Abstract: Rational speakers are supposed to know what they know and what they do not know, and to generate expressions matching the strength of evidence. In contrast, it is still a challenge for current large language models to generate corresponding utterances based on the assessment of facts and confidence in an uncertain real-world environment. While it has recently become popular to estimate and calibrate confidence of LLMs with verbalized uncertainty, what is lacking is a careful examination of the linguistic knowledge of uncertainty encoded in the latent space of LLMs. In this paper, we draw on typological frameworks of epistemic expressions to evaluate LLMs' knowledge of epistemic modality, using controlled stories. Our experiments show that the performance of LLMs in generating epistemic expressions is limited and not robust, and hence the expressions of uncertainty generated by LLMs are not always reliable. To build uncertainty-aware LLMs, it is necessary to enrich semantic knowledge of epistemic modality in LLMs.

[375] FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents

Bobo Li,Yuheng Wang,Hao Fei,Juncheng Li,Wei Ji,Mong-Li Lee,Wynne Hsu

Main category: cs.CL

TL;DR: 论文提出FormFactory，一个用于评估多模态大语言模型在表单填写任务中表现的交互式基准套件，发现现有模型准确率不足5%，揭示了其在视觉布局推理和字段值对齐方面的局限性。

Details

Motivation: 在线表单填写是一项常见但劳动密集的任务，现有工具多为基于规则且缺乏通用生成能力。多模态大语言模型在GUI相关任务中展现出潜力，但在表单填写任务中面临独特挑战。 Method: 提出FormFactory，一个包含网页界面、后端评估模块和精心构建数据集的交互式基准套件，覆盖多样化真实场景和高保真表单交互模拟。 Result: 评估显示现有模型的准确率不足5%，揭示了其在视觉布局推理和字段值对齐方面的显著局限性。 Conclusion: FormFactory可作为进一步研究稳健、实用表单填写代理的基石。 Abstract: Online form filling is a common yet labor-intensive task involving extensive keyboard and mouse interactions. Despite the long-standing vision of automating this process with "one click", existing tools remain largely rule-based and lack generalizable, generative capabilities. Recent advances in Multimodal Large Language Models (MLLMs) have enabled promising agents for GUI-related tasks in general-purpose scenarios. However, they struggle with the unique challenges of form filling, such as flexible layouts and the difficulty of aligning textual instructions with on-screen fields. To bridge this gap, we formally define the form-filling task and propose FormFactory, an interactive benchmarking suite comprising a web-based interface, backend evaluation module, and carefully constructed dataset. Our benchmark covers diverse real-world scenarios, incorporates various field formats, and simulates high-fidelity form interactions. We conduct a comprehensive evaluation of state-of-the-art MLLMs and observe that no model surpasses 5% accuracy, underscoring the inherent difficulty of the task. These findings also reveal significant limitations in current models' visual layout reasoning and field-value alignment abilities. We hope our benchmark can serve as a stepping stone for further research into robust, practical form-filling agents.

[376] V-VAE: A Variational Auto Encoding Framework Towards Fine-Grained Control over Human-Like Chat

Qi Lin,Weikai Xu,Lisi Chen,Bin Dai

Main category: cs.CL

TL;DR: 论文提出了一种名为V-VAE的框架，用于生成更符合人物特质的对话响应，并通过高质量数据集HumanChatData和基准测试HumanChatBench验证其有效性。

Details

Motivation: 现有基于角色扮演和人物特质的聊天方法依赖静态描述和低质量合成数据，难以捕捉动态细节，如情感、情境意识和个性演变。 Method: 提出Verbal Variational Auto-Encoding (V-VAE)框架，包含变分自编码模块和细粒度控制空间，动态调整对话行为。 Result: 实验表明，基于V-VAE的LLM在HumanChatBench和DialogBench上优于基线方法。 Conclusion: V-VAE框架和HumanChatData有效解决了高质量数据稀缺和动态对话建模问题。 Abstract: With the continued proliferation of Large Language Model (LLM) based chatbots, there is a growing demand for generating responses that are not only linguistically fluent but also consistently aligned with persona-specific traits in conversations. However, existing role-play and persona-based chat approaches rely heavily on static role descriptions, coarse-grained signal space, and low-quality synthetic data, which fail to capture dynamic fine-grained details in human-like chat. Human-like chat requires modeling subtle latent traits, such as emotional tone, situational awareness, and evolving personality, which are difficult to predefine and cannot be easily learned from synthetic or distillation-based data. To address these limitations, we propose a Verbal Variational Auto-Encoding (V-VAE) framework, containing a variational auto-encoding module and fine-grained control space which dynamically adapts dialogue behaviour based on fine-grained, interpretable latent variables across talking style, interaction patterns, and personal attributes. We also construct a high-quality dataset, HumanChatData, and benchmark HumanChatBench to address the scarcity of high-quality data in the human-like domain. Experiments show that LLMs based on V-VAE consistently outperform standard baselines on HumanChatBench and DialogBench, which further demonstrates the effectiveness of V-VAE and HumanChatData.

[377] STORM-BORN: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent Framework

Wenhao Liu,Zhenyi Lu,Xinyu Hu,Jierui Zhang,Dailin Li,Jiacheng Cen,Huilin Cao,Haiteng Wang,Yuhan Li,Kun Xie,Dandan Li,Pei Zhang,Chengbo Zhang,Yuxiang Ren,Xiaohong Huang,Yan Ma

Main category: cs.CL

TL;DR: STORM-BORN是一个高质量、高难度的数学数据集，旨在提升大型语言模型的推理能力。通过多智能体协作和人类专家验证，解决了现有数据集的不足。

Details

Motivation: 现有数学数据集存在内容过时、缺乏挑战性、忽视人类推理模式以及可靠性不足的问题。 Method: 采用人类参与的多智能体数据生成框架，结合推理密集过滤器和人类数学家的评估。 Result: 生成的2000个样本中，100个最难问题连GPT-o1也仅解决不到5%。微调后，LLaMA3-8B和Qwen2.5-7B的准确率分别提升7.84%和9.12%。 Conclusion: STORM-BORN为AI提供了高难度基准和人类推理训练资源，推动数学推理能力的进步。 Abstract: High-quality math datasets are crucial for advancing the reasoning abilities of large language models (LLMs). However, existing datasets often suffer from three key issues: outdated and insufficient challenging content, neglecting human-like reasoning, and limited reliability due to single-LLM generation. To address these, we introduce $\textbf{STORM-BORN}$, an ultra-challenging dataset of mathematical derivations sourced from cutting-edge academic papers, which includes dense human-like approximations and heuristic cues. To ensure the reliability and quality, we propose a novel human-in-the-loop, multi-agent data generation framework, integrating reasoning-dense filters, multi-agent collaboration, and human mathematicians' evaluations. We curated a set of 2,000 synthetic samples and deliberately selected the 100 most difficult problems. Even most advanced models like GPT-o1 solved fewer than $5\%$ of them. Fine-tuning on STORM-BORN boosts accuracy by $7.84\%$ (LLaMA3-8B) and $9.12\%$ (Qwen2.5-7B). As AI approaches mathematician-level reasoning, STORM-BORN provides both a high-difficulty benchmark and a human-like reasoning training resource. Our code and dataset are publicly available at https://github.com/lwhere/STORM-BORN.

[378] Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries

Haruki Sakajo,Yusuke Ide,Justin Vasselli,Yusuke Sakai,Yingtao Tian,Hidetaka Kamigaito,Taro Watanabe

Main category: cs.CL

TL;DR: 提出了一种基于双语词典的跨语言词汇迁移方法，适用于低资源语言，通过逐步移除目标子词并估计其嵌入，性能优于现有方法。

Details

Motivation: 现有方法在低资源语言上表现不佳，而双语词典资源丰富，因此探索一种基于词典的跨语言词汇迁移方法。 Method: 利用BPE分词器的特性，逐步移除目标子词并回退到更短的子词，迭代估计目标子词的嵌入。 Result: 实验结果表明，该方法在低资源语言上优于现有方法。 Conclusion: 基于词典的跨语言词汇迁移方法简单有效，尤其适用于低资源语言。 Abstract: Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages, including low-resource languages. Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources. In this work, we propose a simple yet effective vocabulary transfer method that utilizes bilingual dictionaries, which are available for many languages, thanks to descriptive linguists. Our proposed method leverages a property of BPE tokenizers where removing a subword from the vocabulary causes a fallback to shorter subwords. The embeddings of target subwords are estimated iteratively by progressively removing them from the tokenizer. The experimental results show that our approach outperforms existing methods for low-resource languages, demonstrating the effectiveness of a dictionary-based approach for cross-lingual vocabulary transfer.

[379] Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation

Li Zhou,Lutong Yu,Dongchu Xie,Shaohuan Cheng,Wenyan Li,Haizhou Li

Main category: cs.CL

TL;DR: 论文提出了Hanfu-Bench数据集，用于研究视觉语言模型在文化理解中的时间维度，填补了现有研究忽视时间变化的空白。

Details

Motivation: 现有文化理解研究多关注地理多样性，忽视了时间维度，而Hanfu-Bench通过传统服饰Hanfu的多模态数据集填补了这一空白。 Method: 数据集包含文化视觉理解和文化图像转创两个任务，前者通过视觉问答评估时间文化特征识别，后者关注传统服饰到现代设计的转换。 Result: 封闭式视觉语言模型在文化视觉理解上与非专家相当，但落后专家10%；开放式模型表现更差。转创任务中，最佳模型成功率仅42%。 Conclusion: Hanfu-Bench揭示了时间文化理解和创意适应中的重大挑战，为未来研究提供了重要基准。 Abstract: Culture is a rich and dynamic domain that evolves across both geography and time. However, existing studies on cultural understanding with vision-language models (VLMs) primarily emphasize geographic diversity, often overlooking the critical temporal dimensions. To bridge this gap, we introduce Hanfu-Bench, a novel, expert-curated multimodal dataset. Hanfu, a traditional garment spanning ancient Chinese dynasties, serves as a representative cultural heritage that reflects the profound temporal aspects of Chinese culture while remaining highly popular in Chinese contemporary society. Hanfu-Bench comprises two core tasks: cultural visual understanding and cultural image transcreation.The former task examines temporal-cultural feature recognition based on single- or multi-image inputs through multiple-choice visual question answering, while the latter focuses on transforming traditional attire into modern designs through cultural element inheritance and modern context adaptation. Our evaluation shows that closed VLMs perform comparably to non-experts on visual cutural understanding but fall short by 10\% to human experts, while open VLMs lags further behind non-experts. For the transcreation task, multi-faceted human evaluation indicates that the best-performing model achieves a success rate of only 42\%. Our benchmark provides an essential testbed, revealing significant challenges in this new direction of temporal cultural understanding and creative adaptation.

[380] Prompt Engineering Large Language Models' Forecasting Capabilities

Philipp Schoenegger,Cameron R. Jones,Philip E. Tetlock,Barbara Mellers

Main category: cs.CL

TL;DR: 研究发现，在复杂任务（如预测）中，简单的提示工程改进对提升大型语言模型性能效果有限，某些策略甚至可能降低准确性。

Details

Motivation: 探讨提示工程是否足以提升大型语言模型在复杂领域（如预测）中的表现。 Method: 测试了38种提示，包括复合提示和外部来源提示，并引入了推理模型o1和o1-mini。 Result: 大多数提示改进效果微乎其微，部分策略（如贝叶斯推理）甚至显著降低准确性。 Conclusion: 在复杂任务中，仅靠基本提示改进效果有限，可能需要更强大或专业的技术来显著提升性能。 Abstract: Large language model performance can be improved in a large number of ways. Many such techniques, like fine-tuning or advanced tool usage, are time-intensive and expensive. Although prompt engineering is significantly cheaper and often works for simpler tasks, it remains unclear whether prompt engineering suffices for more complex domains like forecasting. Here we show that small prompt modifications rarely boost forecasting accuracy beyond a minimal baseline. In our first study, we tested 38 prompts across Claude 3.5 Sonnet, Claude 3.5 Haiku, GPT-4o, and Llama 3.1 405B. In our second, we introduced compound prompts and prompts from external sources, also including the reasoning models o1 and o1-mini. Our results show that most prompts lead to negligible gains, although references to base rates yield slight benefits. Surprisingly, some strategies showed strong negative effects on accuracy: especially encouraging the model to engage in Bayesian reasoning. These results suggest that, in the context of complex tasks like forecasting, basic prompt refinements alone offer limited gains, implying that more robust or specialized techniques may be required for substantial performance improvements in AI forecasting.

[381] Unified Large Language Models for Misinformation Detection in Low-Resource Linguistic Settings

Muhammad Islam,Javed Ali Khan,Mohammed Abaker,Ali Daud,Azeem Irshad

Main category: cs.CL

TL;DR: 该研究针对乌尔都语等资源匮乏语言的假新闻检测问题，提出了首个公开可用的乌尔都语假新闻检测数据集，并评估了多种预训练语言模型的性能，最终提出了一种统一的LLM模型。

Details

Motivation: 由于乌尔都语等资源匮乏语言缺乏可靠的假新闻检测数据集和验证资源，研究旨在填补这一空白，提升假新闻检测的准确性。 Method: 研究开发了一个公开可用的乌尔都语假新闻数据集，并评估了多种预训练语言模型（如XLNet、mBERT等），提出了一种统一的LLM模型。 Result: 提出的统一LLM模型在准确性、F1分数等指标上优于其他模型，并通过人工验证进一步确认了其可靠性。 Conclusion: 研究强调了开发可靠、专家验证且领域无关的数据集的重要性，为资源匮乏语言的假新闻检测提供了有效解决方案。 Abstract: The rapid expansion of social media platforms has significantly increased the dissemination of forged content and misinformation, making the detection of fake news a critical area of research. Although fact-checking efforts predominantly focus on English-language news, there is a noticeable gap in resources and strategies to detect news in regional languages, such as Urdu. Advanced Fake News Detection (FND) techniques rely heavily on large, accurately labeled datasets. However, FND in under-resourced languages like Urdu faces substantial challenges due to the scarcity of extensive corpora and the lack of validated lexical resources. Current Urdu fake news datasets are often domain-specific and inaccessible to the public. They also lack human verification, relying mainly on unverified English-to-Urdu translations, which compromises their reliability in practical applications. This study highlights the necessity of developing reliable, expert-verified, and domain-independent Urdu-enhanced FND datasets to improve fake news detection in Urdu and other resource-constrained languages. This paper presents the first benchmark large FND dataset for Urdu news, which is publicly available for validation and deep analysis. We also evaluate this dataset using multiple state-of-the-art pre-trained large language models (LLMs), such as XLNet, mBERT, XLM-RoBERTa, RoBERTa, DistilBERT, and DeBERTa. Additionally, we propose a unified LLM model that outperforms the others with different embedding and feature extraction techniques. The performance of these models is compared based on accuracy, F1 score, precision, recall, and human judgment for vetting the sample results of news.

[382] Statement-Tuning Enables Efficient Cross-lingual Generalization in Encoder-only Models

Ahmed Elshabrawy,Thanh-Nhi Nguyen,Yeeun Kang,Lihan Feng,Annant Jain,Faadil Abdullah Shaikh,Jonibek Mansurov,Mohamed Fazli Mohamed Imam,Jesus-German Ortiz-Barajas,Rendi Chevi,Alham Fikri Aji

Main category: cs.CL

TL;DR: 论文探讨了如何通过Statement Tuning方法使编码器模型（如BERT和RoBERTa）在零样本和多语言任务中表现接近大型语言模型（LLMs），同时保持高效性。

Details

Motivation: 编码器模型在计算和内存成本上优于LLMs，但在零样本任务中表现较差。研究旨在探索编码器模型在多语言零样本任务中的潜力，为低资源语言提供高效替代方案。 Method: 采用Statement Tuning方法，将任务重新表述为有限模板，并扩展到多语言环境，评估编码器模型的零样本跨语言泛化能力。 Result: 实验表明，先进的编码器模型在多语言任务中表现优异，与多语言LLMs相当，同时更高效。 Conclusion: 研究证明了编码器模型在多语言零样本任务中的潜力，为资源高效的NLP模型设计提供了新方向。 Abstract: Large Language Models (LLMs) excel in zero-shot and few-shot tasks, but achieving similar performance with encoder-only models like BERT and RoBERTa has been challenging due to their architecture. However, encoders offer advantages such as lower computational and memory costs. Recent work adapts them for zero-shot generalization using Statement Tuning, which reformulates tasks into finite templates. We extend this approach to multilingual NLP, exploring whether encoders can achieve zero-shot cross-lingual generalization and serve as efficient alternatives to memory-intensive LLMs for low-resource languages. Our results show that state-of-the-art encoder models generalize well across languages, rivaling multilingual LLMs while being more efficient. We also analyze multilingual Statement Tuning dataset design, efficiency gains, and language-specific generalization, contributing to more inclusive and resource-efficient NLP models. We release our code and models.

[383] MMD-Sense-Analysis: Word Sense Detection Leveraging Maximum Mean Discrepancy

Kensuke Mitsuzawa

Main category: cs.CL

TL;DR: 本文提出了一种基于最大均值差异（MMD）的新方法MMD-Sense-Analysis，用于检测和解释词义随时间的变化。

Details

Motivation: 词义分析对于理解语言和社会背景至关重要，而词义变化检测是识别和解释词义随时间变化的任务。 Method: 利用最大均值差异（MMD）选择语义上有意义的变量，并量化不同时间段的变化。 Result: 实证评估结果表明该方法的有效性。 Conclusion: 这是首次将MMD应用于词义变化检测，方法能够识别词义变化并解释其演变。 Abstract: Word sense analysis is an essential analysis work for interpreting the linguistic and social backgrounds. The word sense change detection is a task of identifying and interpreting shifts in word meanings over time. This paper proposes MMD-Sense-Analysis, a novel approach that leverages Maximum Mean Discrepancy (MMD) to select semantically meaningful variables and quantify changes across time periods. This method enables both the identification of words undergoing sense shifts and the explanation of their evolution over multiple historical periods. To my knowledge, this is the first application of MMD to word sense change detection. Empirical assessment results demonstrate the effectiveness of the proposed approach.

[384] IndicRAGSuite: Large-Scale Datasets and a Benchmark for Indian Language RAG Systems

Pasunuti Prasanjith,Prathmesh B More,Anoop Kunchukuttan,Raj Dabre

Main category: cs.CL

TL;DR: 论文提出了IndicMSMarco基准和数据集，以解决印度语言在RAG系统中缺乏评估基准和训练数据的问题。

Details

Motivation: 印度语言在RAG系统中缺乏高质量评估基准和训练数据，阻碍了其发展。 Method: 通过手动翻译MS MARCO-dev集的1000个查询创建IndicMSMarco基准，并利用LLMs从19种印度语言的维基百科构建大规模训练数据集。 Result: 开发了涵盖13种印度语言的评估基准和19种语言的训练数据集，填补了资源空白。 Conclusion: IndicMSMarco和数据集为印度语言的RAG系统提供了关键资源，推动了多语言信息检索和生成的发展。 Abstract: Retrieval-Augmented Generation (RAG) systems enable language models to access relevant information and generate accurate, well-grounded, and contextually informed responses. However, for Indian languages, the development of high-quality RAG systems is hindered by the lack of two critical resources: (1) evaluation benchmarks for retrieval and generation tasks, and (2) large-scale training datasets for multilingual retrieval. Most existing benchmarks and datasets are centered around English or high-resource languages, making it difficult to extend RAG capabilities to the diverse linguistic landscape of India. To address the lack of evaluation benchmarks, we create IndicMSMarco, a multilingual benchmark for evaluating retrieval quality and response generation in 13 Indian languages, created via manual translation of 1000 diverse queries from MS MARCO-dev set. To address the need for training data, we build a large-scale dataset of (question, answer, relevant passage) tuples derived from the Wikipedias of 19 Indian languages using state-of-the-art LLMs. Additionally, we include translated versions of the original MS MARCO dataset to further enrich the training data and ensure alignment with real-world information-seeking tasks. Resources are available here: https://huggingface.co/datasets/ai4bharat/Indic-Rag-Suite

[385] Domain Lexical Knowledge-based Word Embedding Learning for Text Classification under Small Data

Zixiao Zhu,Kezhi Mao

Main category: cs.CL

TL;DR: 论文提出了一种基于领域特定词汇知识增强BERT词嵌入的方法，以提升在关键词对分类任务起关键作用时的性能。

Details

Motivation: 研究发现BERT在关键词对分类任务（如情感分析和情绪识别）中表现不佳，原因是其基于上下文的词嵌入对关键词的区分性不足。 Method: 开发了一种基于领域特定词汇知识的词嵌入增强模型，将BERT嵌入投影到新空间以最大化类内相似性和类间差异，并设计了自动从在线开放资源获取词汇知识的算法。 Result: 在情感分析、情绪识别和问答三个分类任务上的实验证明了该模型的有效性。 Conclusion: 提出的方法显著提升了BERT在关键词敏感任务中的性能，代码和数据集已开源。 Abstract: Pre-trained language models such as BERT have been proved to be powerful in many natural language processing tasks. But in some text classification applications such as emotion recognition and sentiment analysis, BERT may not lead to satisfactory performance. This often happens in applications where keywords play critical roles in the prediction of class labels. Our investigation found that the root cause of the problem is that the context-based BERT embedding of the keywords may not be discriminative enough to produce discriminative text representation for classification. Motivated by this finding, we develop a method to enhance word embeddings using domain-specific lexical knowledge. The knowledge-based embedding enhancement model projects the BERT embedding into a new space where within-class similarity and between-class difference are maximized. To implement the knowledge-based word embedding enhancement model, we also develop a knowledge acquisition algorithm for automatically collecting lexical knowledge from online open sources. Experiment results on three classification tasks, including sentiment analysis, emotion recognition and question answering, have shown the effectiveness of our proposed word embedding enhancing model. The codes and datasets are in https://github.com/MidiyaZhu/KVWEFFER.

Shiwen Ni,Jiawen Li,Hung-Yu Kao

Main category: cs.CL

TL;DR: 该论文提出了一种名为MVAN的多视角注意力网络模型，用于在缺乏用户评论的情况下，仅通过源推文及其转发用户检测社交媒体上的假新闻，并提供解释。

Details

Motivation: 解决现有假新闻检测方法依赖长文本内容（如新闻文章和用户评论）的局限性，专注于更现实的短文本场景。 Method: 开发了MVAN模型，结合文本语义注意力和传播结构注意力，从源推文内容和传播结构中提取信息。 Result: 在两个真实数据集上的实验表明，MVAN在准确性上平均优于现有方法2.5%，并能提供合理解释。 Conclusion: MVAN模型在短文本假新闻检测中表现出色，同时具备解释能力。 Abstract: Fake news on social media is a widespread and serious problem in today's society. Existing fake news detection methods focus on finding clues from Long text content, such as original news articles and user comments. This paper solves the problem of fake news detection in more realistic scenarios. Only source shot-text tweet and its retweet users are provided without user comments. We develop a novel neural network based model, \textbf{M}ulti-\textbf{V}iew \textbf{A}ttention \textbf{N}etworks (MVAN) to detect fake news and provide explanations on social media. The MVAN model includes text semantic attention and propagation structure attention, which ensures that our model can capture information and clues both of source tweet content and propagation structure. In addition, the two attention mechanisms in the model can find key clue words in fake news texts and suspicious users in the propagation structure. We conduct experiments on two real-world datasets, and the results demonstrate that MVAN can significantly outperform state-of-the-art methods by 2.5\% in accuracy on average, and produce a reasonable explanation.

[387] Cross-Lingual Generalization and Compression: From Language-Specific to Shared Neurons

Frederick Riemenschneider,Anette Frank

Main category: cs.CL

TL;DR: 多语言模型（MLLMs）在无显式跨语言监督下展现出知识迁移能力。研究发现其参数空间在预训练中逐渐从语言特定表示压缩为跨语言抽象，神经元逐步对齐不同语言的语义概念。

Details

Motivation: 探索多语言模型如何在无显式跨语言监督下实现知识迁移，并分析其表示演化和神经元功能。 Method: 分析三种MLLMs的参数空间，通过探测实验观察层功能变化，追踪神经元在预训练中对语义概念的编码。 Result: 模型从语言特定表示逐渐收敛为跨语言抽象，神经元逐步对齐不同语言的语义概念，部分神经元成为跨语言概念的可靠预测器。 Conclusion: MLLMs通过预训练逐步形成跨语言抽象表示，神经元功能从语言识别转向语义概念编码，支持跨语言知识迁移。 Abstract: Multilingual language models (MLLMs) have demonstrated remarkable abilities to transfer knowledge across languages, despite being trained without explicit cross-lingual supervision. We analyze the parameter spaces of three MLLMs to study how their representations evolve during pre-training, observing patterns consistent with compression: models initially form language-specific representations, which gradually converge into cross-lingual abstractions as training progresses. Through probing experiments, we observe a clear transition from uniform language identification capabilities across layers to more specialized layer functions. For deeper analysis, we focus on neurons that encode distinct semantic concepts. By tracing their development during pre-training, we show how they gradually align across languages. Notably, we identify specific neurons that emerge as increasingly reliable predictors for the same concepts across languages.

Chaoyue He,Xin Zhou,Yi Wu,Xinjia Yu,Yan Zhang,Lei Zhang,Di Wang,Shengfei Lyu,Hong Xu,Xiaoqiao Wang,Wei Liu,Chunyan Miao

Main category: cs.CL

TL;DR: ESGenius是一个用于评估和提升大型语言模型（LLM）在环境、社会和治理（ESG）及可持续性问题回答能力的综合基准，包含问答集和语料库两部分，并通过零样本和RAG方法评估模型表现。

Details

Motivation: 当前LLM在跨学科的ESG和可持续性问题回答上表现有限，需要权威数据支持以提升理解能力。 Method: ESGenius包含1,136个多选问题（ESGenius-QA）和231份权威文档（ESGenius-Corpus），采用零样本和RAG两阶段评估协议测试50个LLM。 Result: 零样本下模型准确率仅为55-70%，而RAG方法显著提升性能，尤其是小模型（如DeepSeek-R1-Distill-Qwen-14B从63.82%提升至80.46%）。 Conclusion: ESGenius是首个专注于ESG和可持续性的LLM基准，强调权威数据对模型性能的重要性。 Abstract: We introduce ESGenius, a comprehensive benchmark for evaluating and enhancing the proficiency of Large Language Models (LLMs) in Environmental, Social and Governance (ESG) and sustainability-focused question answering. ESGenius comprises two key components: (i) ESGenius-QA, a collection of 1 136 multiple-choice questions generated by LLMs and rigorously validated by domain experts, covering a broad range of ESG pillars and sustainability topics. Each question is systematically linked to its corresponding source text, enabling transparent evaluation and supporting retrieval-augmented generation (RAG) methods; and (ii) ESGenius-Corpus, a meticulously curated repository of 231 foundational frameworks, standards, reports and recommendation documents from seven authoritative sources. Moreover, to fully assess the capabilities and adaptation potential of the model, we implement a rigorous two-stage evaluation protocol -- Zero-Shot and RAG. Extensive experiments across 50 LLMs (ranging from 0.5 B to 671 B parameters) demonstrate that state-of-the-art models achieve only moderate performance in zero-shot settings, with accuracies typically around 55--70\%, highlighting ESGenius's challenging nature for LLMs in interdisciplinary contexts. However, models employing RAG show significant performance improvements, particularly for smaller models. For example, "DeepSeek-R1-Distill-Qwen-14B" improves from 63.82\% (zero-shot) to 80.46\% with RAG. These results underscore the necessity of grounding responses in authoritative sources for enhanced ESG understanding. To the best of our knowledge, ESGenius is the first benchmark curated for LLMs and the relevant enhancement technologies that focuses on ESG and sustainability topics.

[389] Cross-Lingual Transfer of Cultural Knowledge: An Asymmetric Phenomenon

Chen Zhang,Zhiyuan Liao,Yansong Feng

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLMs）在多语言环境下文化知识获取的机制，发现高资源语言与英语之间存在双向文化知识转移，而低资源语言则主要单向转移。

Details

Motivation: 尽管已有大量研究评估LLMs处理全球文化多样性的能力，但其在多语言环境中文化知识获取的机制仍不明确。 Method: 引入了一个可解释的框架，研究文化知识在语言适应中的转移，并通过四种非英语文化的案例进行分析。 Result: 高资源语言与英语之间存在双向文化知识转移，而低资源语言则主要单向转移。这一不对称现象与训练数据中文化知识的频率相关。 Conclusion: 文化知识的转移与其在训练数据中的出现频率密切相关，频率越高，转移越容易。 Abstract: Despite substantial research efforts evaluating how well large language models~(LLMs) handle global cultural diversity, the mechanisms behind their cultural knowledge acquisition, particularly in multilingual settings, remain unclear. We study this question by investigating how cultural knowledge transfers across languages during language adaptation of LLMs. We introduce an interpretable framework for studying this transfer, ensuring training data transparency and controlling transfer effects. Through a study of four non-Anglophonic cultures, we observe bidirectional cultural transfer between English and other high-resource languages, while low-resource languages primarily transfer knowledge to English with limited reverse flow. To explain this asymmetric phenomenon, we propose a frequency-based hypothesis: cultural knowledge appearing more frequently in the pretraining data transfers more easily, which is supported by empirical analysis of the training corpora.

[390] StochasTok: Improving Fine-Grained Subword Understanding in LLMs

Anya Sims,Thom Foster,Klara Kaleb,Tuan-Duy H. Nguyen,Joseph Lee,Jakob N. Foerster,Yee Whye Teh,Cong Lu

Main category: cs.CL

TL;DR: 论文提出StochasTok，一种随机分词方案，通过训练中随机拆分token，提升大语言模型在子词任务上的表现。

Details

Motivation: 现有大语言模型在子词任务（如字符计数）上表现不佳，主要因分词掩盖了词的细粒度结构。现有替代方案（如字符级分词）计算成本高且效果不稳定。 Method: 引入StochasTok，一种简单高效的随机分词方案，训练中随机拆分token，使模型能看到词的内部结构。 Result: 实验表明，StochasTok显著提升模型在字符计数、子串识别和数学任务等子词任务上的表现，且能无缝集成到现有训练流程中。 Conclusion: StochasTok通过简单改动实现显著改进，展示了在更大模型中的应用潜力。 Abstract: Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still often struggle with seemingly simple subword-level tasks like How many 'r's in 'strawberry'?. A key factor behind these failures is tokenization which obscures the fine-grained structure of words. Current alternatives, such as character-level and dropout tokenization methods, significantly increase computational costs and provide inconsistent improvements. In this paper we revisit tokenization and introduce StochasTok, a simple, efficient stochastic tokenization scheme that randomly splits tokens during training, allowing LLMs to 'see' their internal structure. Our experiments show that pretraining with StochasTok substantially improves LLMs' downstream performance across multiple subword-level language games, including character counting, substring identification, and math tasks. Furthermore, StochasTok's simplicity allows seamless integration at any stage of the training pipeline; and we demonstrate that post-training with StochasTok can instill improved subword understanding into existing pretrained models, thus avoiding costly pretraining from scratch. These dramatic improvements achieved with a minimal change suggest StochasTok holds exciting potential when applied to larger, more capable models. Code open-sourced at: https://github.com/anyasims/stochastok.

[391] When LLMs Team Up: The Emergence of Collaborative Affective Computing

Wenna Lai,Haoran Xie,Guandong Xu,Qing Li,S. Joe Qin

Main category: cs.CL

TL;DR: 该论文综述了基于大语言模型（LLMs）的协作系统在情感计算（AC）中的应用，探讨了从结构化协作到自主协作的方法，并分析了其潜力与挑战。

Details

Motivation: 传统的情感计算任务在自然语言处理中采用流水线架构，存在结构僵化和适应性不足的问题。LLMs为情感理解与生成提供了统一方法，但其在情感推理中存在认知局限，如文化误解和决策幻觉。 Method: 论文系统回顾了现有方法，包括协作策略、机制、关键功能和应用；实验比较了情感理解与生成任务中的协作策略；分析了系统在复杂情感推理中的鲁棒性和适应性潜力。 Result: 研究发现LLM协作系统能通过情感与理性思维的协同，提升情感推理的鲁棒性和适应性，接近人类社交智能。 Conclusion: 论文首次系统探索了LLMs在AC中的协作智能，为更强大的应用铺平了道路，并讨论了未来研究方向。 Abstract: Affective Computing (AC) is essential in bridging the gap between human emotional experiences and machine understanding. Traditionally, AC tasks in natural language processing (NLP) have been approached through pipeline architectures, which often suffer from structure rigidity that leads to inefficiencies and limited adaptability. The advent of Large Language Models (LLMs) has revolutionized this field by offering a unified approach to affective understanding and generation tasks, enhancing the potential for dynamic, real-time interactions. However, LLMs face cognitive limitations in affective reasoning, such as misinterpreting cultural nuances or contextual emotions, and hallucination problems in decision-making. To address these challenges, recent research advocates for LLM-based collaboration systems that emphasize interactions among specialized models and LLMs, mimicking human-like affective intelligence through the synergy of emotional and rational thinking that aligns with Dual Process Theory in psychology. This survey aims to provide a comprehensive overview of LLM-based collaboration systems in AC, exploring from structured collaborations to autonomous collaborations. Specifically, it includes: (1) A systematic review of existing methods, focusing on collaboration strategies, mechanisms, key functions, and applications; (2) Experimental comparisons of collaboration strategies across representative tasks in affective understanding and generation; (3) An analysis highlighting the potential of these systems to enhance robustness and adaptability in complex affective reasoning; (4) A discussion of key challenges and future research directions to further advance the field. This work is the first to systematically explore collaborative intelligence with LLMs in AC, paving the way for more powerful applications that approach human-like social intelligence.

[392] mdok of KInIT: Robustly Fine-tuned LLM for Binary and Multiclass AI-Generated Text Detection

Dominik Macko

Main category: cs.CL

TL;DR: 本文提出了一种基于微调小型LLMs的文本分类方法（mdok），用于检测机器生成的文本，并在Voight-Kampff Generative AI Detection 2025任务中表现出色。

Details

Motivation: 大型语言模型（LLMs）生成的高质量文本可能被滥用（如抄袭、垃圾邮件、虚假信息传播），因此需要自动化检测方法。 Method: 通过微调小型LLMs进行文本分类，提出mdok方法，用于机器生成文本的检测。 Result: 在Voight-Kampff Generative AI Detection 2025任务中，mdok方法在二元检测和多分类任务中均表现优异（多分类排名第一）。 Conclusion: mdok方法为机器生成文本的检测提供了高效且鲁棒的解决方案。 Abstract: The large language models (LLMs) are able to generate high-quality texts in multiple languages. Such texts are often not recognizable by humans as generated, and therefore present a potential of LLMs for misuse (e.g., plagiarism, spams, disinformation spreading). An automated detection is able to assist humans to indicate the machine-generated texts; however, its robustness to out-of-distribution data is still challenging. This notebook describes our mdok approach in robust detection, based on fine-tuning smaller LLMs for text classification. It is applied to both subtasks of Voight-Kampff Generative AI Detection 2025, providing remarkable performance in binary detection as well as in multiclass (1st rank) classification of various cases of human-AI collaboration.

[393] Fairness Dynamics During Training

Krishna Patel,Nivedha Sivakumar,Barry-John Theobald,Luca Zappella,Nicholas Apostoloff

Main category: cs.CL

TL;DR: 研究大型语言模型（LLM）训练中的公平性动态，提出两种新指标评估偏见，发现模型在训练中可能突然产生偏见，且早期停止可显著提升公平性。

Details

Motivation: 探讨LLM训练中偏见的动态变化，以诊断和缓解偏见问题。 Method: 引入两种新指标（Average Rank和Jensen-Shannon Divergence by Parts）评估Pythia模型在WinoBias数据集上的性别偏见动态。 Result: 发现Pythia-6.9b对男性有偏见，早期停止可牺牲少量准确性换取大幅公平性提升，且更大模型可能更偏颇。 Conclusion: 监控公平性动态有助于优化训练策略，减少模型偏见。 Abstract: We investigate fairness dynamics during Large Language Model (LLM) training to enable the diagnoses of biases and mitigations through training interventions like early stopping; we find that biases can emerge suddenly and do not always follow common performance metrics. We introduce two new metrics to evaluate fairness dynamics holistically during model pre-training: Average Rank and Jensen-Shannon Divergence by Parts. These metrics provide insights into the Pythia models' progression of biases in gender prediction of occupations on the WinoBias dataset. By monitoring these dynamics, we find that (1) Pythia-6.9b is biased towards men; it becomes more performant and confident predicting "male" than "female" during training, (2) via early-stopping, Pythia-6.9b can exchange 1.7% accuracy on LAMBADA for a 92.5% increase in fairness, and (3) larger models can exhibit more bias; Pythia-6.9b makes more assumptions about gender than Pythia-160m, even when a subject's gender is not specified.

[394] Reasoning-Table: Exploring Reinforcement Learning for Table Reasoning

Fangyu Lei,Jinxiang Meng,Yiming Huang,Tinghong Chen,Yun Zhang,Shizhu He,Jun Zhao,Kang Liu

Main category: cs.CL

TL;DR: 论文提出了一种基于强化学习的表格推理方法Reasoning-Table，通过数据预处理、奖励设计和训练策略优化，在多项任务中超越监督微调方法，并提升了模型的泛化能力和鲁棒性。

Details

Motivation: 现有监督微调方法在表格推理任务中存在泛化和鲁棒性问题，需要一种更有效的方法来克服这些限制。 Method: 采用强化学习（RL）框架，结合数据预处理、奖励设计和定制化训练策略，优化表格推理任务的表现。 Result: 在多个基准测试中表现优异，超越Claude-3.7-Sonnet等大型专有模型4.0%，并在文本到SQL任务中达到68.3%的性能。 Conclusion: Reasoning-Table展示了强化学习在表格推理任务中的潜力，显著提升了模型的性能和鲁棒性。 Abstract: Table reasoning, encompassing tasks such as table question answering, fact verification, and text-to-SQL, requires precise understanding of structured tabular data, coupled with numerical computation and code manipulation for effective inference. Supervised fine-tuning (SFT) approaches have achieved notable success but often struggle with generalization and robustness due to biases inherent in imitative learning. We introduce Reasoning-Table, the first application of reinforcement learning (RL) to table reasoning, achieving state-of-the-art performance. Through rigorous data preprocessing, reward design, and tailored training strategies, our method leverages simple rule-based outcome rewards to outperform SFT across multiple benchmarks. Unified training across diverse tasks enables Reasoning-Table to emerge as a robust table reasoning large language model, surpassing larger proprietary models like Claude-3.7-Sonnet by 4.0% on table reasoning benchmarks. The approach also achieves excellent performance on text-to-SQL tasks, reaching 68.3% performance on the BIRD dev dataset with a 7B model. Further experiments demonstrate that Reasoning-Table enhances the model's generalization capabilities and robustness.

[395] SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

Zhongwei Wan,Zhihao Dou,Che Liu,Yu Zhang,Dongfei Cui,Qinjian Zhao,Hui Shen,Jing Xiong,Yi Xin,Yifan Jiang,Yangfan He,Mi Zhang,Shen Yan

Main category: cs.CL

TL;DR: 论文提出了一种名为SRPO的两阶段强化学习框架，通过自我反思增强多模态大语言模型的推理能力，显著提升了推理准确性和反思质量。

Details

Motivation: 现有的多模态大语言模型在复杂推理任务中表现不佳，尤其是缺乏有效的自我反思和自我纠正能力。现有反思方法过于简单，无法生成有意义的反馈。 Method: 提出SRPO框架，分两阶段：1）构建高质量反思数据集；2）在GRPO框架中引入新颖奖励机制，鼓励简洁且有认知意义的反思。 Result: 在多个多模态推理基准测试中，SRPO显著优于现有模型，推理准确性和反思质量均有显著提升。 Conclusion: SRPO框架有效提升了多模态大语言模型的推理能力，尤其在复杂任务中表现出色。 Abstract: Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.

[396] Tug-of-war between idiom's figurative and literal meanings in LLMs

Soyoung Oh,Xinting Huang,Mathis Pink,Michael Hahn,Vera Demberg

Main category: cs.CL

TL;DR: 论文研究了语言模型如何处理习语的非组合性比喻意义，通过机制可解释性工具追踪了LLama3.2-1B-base模型的三步习语处理过程。

Details

Motivation: 习语的比喻意义与字面意义差异大，模型需学习如何在两者间选择，研究旨在揭示模型如何处理这种歧义。 Method: 使用机制可解释性工具分析LLama3.2-1B-base模型，定位了习语处理的三个步骤：比喻意义检索、比喻表示路径和字面解释并行路径。 Result: 发现特定注意力头增强比喻意义并抑制字面解释，模型通过中间路径表示比喻意义，同时保留字面解释的并行路径。 Conclusion: 研究为自回归变换器中习语理解的机制提供了证据。 Abstract: Idioms present a unique challenge for language models due to their non-compositional figurative meanings, which often strongly diverge from the idiom's literal interpretation. This duality requires a model to learn representing and deciding between the two meanings to interpret an idiom in a figurative sense, or literally. In this paper, we employ tools from mechanistic interpretability to trace how a large pretrained causal transformer (LLama3.2-1B-base) deals with this ambiguity. We localize three steps of idiom processing: First, the idiom's figurative meaning is retrieved in early attention and MLP sublayers. We identify specific attention heads which boost the figurative meaning of the idiom while suppressing the idiom's literal interpretation. The model subsequently represents the figurative representation through an intermediate path. Meanwhile, a parallel bypass route forwards literal interpretation, ensuring that a both reading remain available. Overall, our findings provide a mechanistic evidence for idiom comprehension in an autoregressive transformer.

[397] Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Pierre-Carl Langlais,Carlos Rosas Hinostroza,Mattia Nee,Catherine Arnett,Pavel Chizhov,Eliot Krzystof Jones,Irène Girard,David Mach,Anastasia Stasenko,Ivan P. Yamshchikov

Main category: cs.CL

TL;DR: Common Corpus是一个开放的大规模语言模型预训练数据集，包含约两万亿个无版权或允许许可的标记，涵盖多种语言和代码数据，旨在解决AI立法中的数据合规问题。

Details

Motivation: 解决大型语言模型预训练数据中版权和专有内容的问题，提供符合数据安全法规的开放数据集。 Method: 收集无版权或允许许可的数据，涵盖多种语言和代码，并进行详细的过滤和整理。 Result: Common Corpus成为业界领先的开放数据集，已被Anthropic等多个LLM训练项目使用。 Conclusion: Common Corpus将为LLM的开放科学研究提供关键基础设施。 Abstract: Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. These data most often contain trillions of tokens with large portions of copyrighted or proprietary content, which hinders the usage of such models under AI legislation. This raises the need for truly open pre-training data that is compliant with the data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for language model pre-training. The data assembled in Common Corpus are either uncopyrighted or under permissible licenses and amount to about two trillion tokens. The dataset contains a wide variety of languages, ranging from the main European languages to low-resource ones rarely present in pre-training datasets; in addition, it includes a large portion of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs in diverse areas of knowledge. In this technical report, we present the detailed provenance of data assembling and the details of dataset filtering and curation. Being already used by such industry leaders as Anthropic and multiple LLM training projects, we believe that Common Corpus will become a critical infrastructure for open science research in LLMs.

[398] Benford's Curse: Tracing Digit Bias to Numerical Hallucination in LLMs

Jiandong Shao,Yao Lu,Jianfei Yang

Main category: cs.CL

TL;DR: 研究发现大型语言模型（LLMs）在数值任务中存在与Benford定律相似的数字偏差，并提出通过修剪特定神经元来缓解这一问题。

Details

Motivation: LLMs在复杂推理任务中表现优异，但在基础数值问题上常出错，研究假设其数字生成偏差源于预训练数据中的长尾数字分布。 Method: 通过分析预训练数据（OLMo2）是否符合Benford定律，构建均匀分布的数字评估基准，并利用logit-lens和神经元剖析技术定位偏差来源。 Result: 开源LLMs表现出与Benford定律相似的数字偏差，且偏差主要源于深层网络中少数高度数字选择性的FFN神经元。 Conclusion: 修剪特定神经元可部分纠正错误输出，揭示了预训练数据统计与模型符号失败模式之间的关联，为数值任务中的幻觉诊断提供了新视角。 Abstract: Large Language Models (LLMs) exhibit impressive performance on complex reasoning tasks, yet they frequently fail on basic numerical problems, producing incorrect outputs. Inspired by Benford's Law -- a statistical pattern where lower digits occur more frequently as leading digits -- we hypothesize that the long-tailed digit distributions in web-collected corpora may be learned by LLMs during pretraining, leading to biased numerical generation. To investigate the hypothesis, we first examine whether digits frequencies in pretraining corpus (OLMo2) follows Benford's law. We then construct an evaluation benchmark with uniformly distributed ground-truth digits across seven numerical reasoning tasks. Our evaluation results demonstrate that leading open-source LLMs show a consistent pattern of digit bias that resembles Benford's law. Through logit-lens tracing and neuron-level dissection, we identify that this bias arises predominantly from a small subset of highly digit-selective feed-forward network (FFN) neurons in the deeper layers. Finally, we demonstrate that pruning these neurons mitigates imbalanced overgeneration and partially corrects erroneous outputs, providing causal evidence that fine-grained pretraining digit bias can propagate into model behavior. Our findings reveal a fundamental connection between corpus-level statistics and symbolic failure modes in LLMs, offering a new lens for diagnosing and mitigating hallucinations in numerical tasks.

[399] Thinking in Character: Advancing Role-Playing Agents with Role-Aware Reasoning

Yihong Tang,Kehai Chen,Muyun Yang,Zhengyu Niu,Jing Li,Tiejun Zhao,Min Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为角色感知推理（RAR）的新方法，通过角色身份激活（RIA）和推理风格优化（RSO）两阶段解决角色扮演代理（RPAs）中的注意力分散和风格漂移问题。

Details

Motivation: 现有角色扮演代理基于显式对话数据，缺乏深层次的人类思维模拟，导致知识表达和风格表现肤浅。大型推理模型（LRMs）直接应用时存在注意力分散和风格漂移问题。 Method: RAR方法包括角色身份激活（RIA）和推理风格优化（RSO）两阶段，RIA通过角色档案引导模型推理，RSO通过LRM蒸馏优化推理风格。 Result: 实验表明，RAR显著提升了角色扮演代理的性能，有效解决了注意力分散和风格漂移问题。 Conclusion: RAR方法为角色扮演代理提供了更深入、更一致的推理能力，提升了其表现力和实用性。 Abstract: The advancement of Large Language Models (LLMs) has spurred significant interest in Role-Playing Agents (RPAs) for applications such as emotional companionship and virtual interaction. However, recent RPAs are often built on explicit dialogue data, lacking deep, human-like internal thought processes, resulting in superficial knowledge and style expression. While Large Reasoning Models (LRMs) can be employed to simulate character thought, their direct application is hindered by attention diversion (i.e., RPAs forget their role) and style drift (i.e., overly formal and rigid reasoning rather than character-consistent reasoning). To address these challenges, this paper introduces a novel Role-Aware Reasoning (RAR) method, which consists of two important stages: Role Identity Activation (RIA) and Reasoning Style Optimization (RSO). RIA explicitly guides the model with character profiles during reasoning to counteract attention diversion, and then RSO aligns reasoning style with the character and scene via LRM distillation to mitigate style drift. Extensive experiments demonstrate that the proposed RAR significantly enhances the performance of RPAs by effectively addressing attention diversion and style drift.

[400] Developing a Mixed-Methods Pipeline for Community-Oriented Digitization of Kwak'wala Legacy Texts

Milind Agarwal,Daisy Rosenblum,Antonios Anastasopoulos

Main category: cs.CL

TL;DR: 该论文探讨了如何利用最新OCR技术对Kwak'wala语言的早期文本进行数字化处理，以支持语言复兴和技术开发。

Details

Motivation: Kwak'wala语言有丰富的文献记录，但早期文本因机器不可读而难以利用，数字化可促进现代拼写转换和语言技术开发。 Method: 结合现成OCR技术、语言识别和掩码技术分离Kwak'wala文本，并使用后校正模型生成高质量转录。 Result: 成功应用OCR技术处理Kwak'wala文本，并讨论了实际应用中的挑战和适应性调整。 Conclusion: 该方法为Kwak'wala文本的数字化提供了可行方案，支持语言复兴和技术发展。 Abstract: Kwak'wala is an Indigenous language spoken in British Columbia, with a rich legacy of published documentation spanning more than a century, and an active community of speakers, teachers, and learners engaged in language revitalization. Over 11 volumes of the earliest texts created during the collaboration between Franz Boas and George Hunt have been scanned but remain unreadable by machines. Complete digitization through optical character recognition has the potential to facilitate transliteration into modern orthographies and the creation of other language technologies. In this paper, we apply the latest OCR techniques to a series of Kwak'wala texts only accessible as images, and discuss the challenges and unique adaptations necessary to make such technologies work for these real-world texts. Building on previous methods, we propose using a mix of off-the-shelf OCR methods, language identification, and masking to effectively isolate Kwak'wala text, along with post-correction models, to produce a final high-quality transcription.

[401] MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation

Yile Liu,Ziwei Ma,Xiu Jiang,Jinglu Hu,Jing Chang,Liang Li

Main category: cs.CL

TL;DR: MaXIFE是一个多语言指令跟随评估基准，涵盖23种语言和1667个任务，结合规则和模型评估方法，为LLMs提供标准化评估工具。

Details

Motivation: 现有评估方法多关注单语言场景，忽略多语言和跨语言挑战，MaXIFE旨在填补这一空白。 Method: MaXIFE结合规则评估和模型评估，评估23种语言的1667个任务。 Result: 评估了多个主流商业和开源LLMs，建立了基线结果。 Conclusion: MaXIFE为多语言指令跟随评估提供标准化工具，推动NLP研究发展。 Abstract: With the rapid adoption of large language models (LLMs) in natural language processing, the ability to follow instructions has emerged as a key metric for evaluating their practical utility. However, existing evaluation methods often focus on single-language scenarios, overlooking the challenges and differences present in multilingual and cross-lingual contexts. To address this gap, we introduce MaXIFE: a comprehensive evaluation benchmark designed to assess instruction-following capabilities across 23 languages with 1,667 verifiable instruction tasks. MaXIFE integrates both Rule-Based Evaluation and Model-Based Evaluation, ensuring a balance of efficiency and accuracy. We applied MaXIFE to evaluate several leading commercial and open-source LLMs, establishing baseline results for future comparisons. By providing a standardized tool for multilingual instruction-following evaluation, MaXIFE aims to advance research and development in natural language processing.

[402] iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering

Shuai Wang,Yinan Yu

Main category: cs.CL

TL;DR: iQUEST框架通过迭代分解复杂问题和结合GNN增强多跳推理，显著提升KBQA任务性能。

Details

Motivation: LLMs在知识密集型任务中存在事实不准确问题，需结合外部知识资源（如知识图谱）以提高可靠性。 Method: 提出iQUEST框架，迭代分解复杂查询为子问题，并集成GNN以提前考虑2跳邻居信息。 Result: 在四个基准数据集和四种LLMs上，iQUEST表现一致优于其他方法。 Conclusion: iQUEST通过结构化推理路径和多跳信息整合，有效解决了KBQA中的多跳推理挑战。 Abstract: While Large Language Models (LLMs) excel at many natural language processing tasks, they often suffer from factual inaccuracies in knowledge-intensive scenarios. Integrating external knowledge resources, particularly knowledge graphs (KGs), provides a transparent and updatable foundation for more reliable reasoning. Knowledge Base Question Answering (KBQA), which queries and reasons over KGs, is central to this effort, especially for complex, multi-hop queries. However, multi-hop reasoning poses two key challenges: (1)~maintaining coherent reasoning paths, and (2)~avoiding prematurely discarding critical multi-hop connections. To address these issues, we introduce iQUEST, a question-guided KBQA framework that iteratively decomposes complex queries into simpler sub-questions, ensuring a structured and focused reasoning trajectory. Additionally, we integrate a Graph Neural Network (GNN) to look ahead and incorporate 2-hop neighbor information at each reasoning step. This dual approach strengthens the reasoning process, enabling the model to explore viable paths more effectively. Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.

[403] Human-Centric Evaluation for Foundation Models

Yijin Guo,Kaiyuan Ji,Xiaorong Zhu,Junying Wang,Farong Wen,Chunyi Li,Zicheng Zhang,Guangtao Zhai

Main category: cs.CL

TL;DR: 论文提出了一种以人为中心的评估框架（HCE），通过主观维度（问题解决能力、信息质量和交互体验）评估基础模型，填补了传统客观评估的不足。实验涉及多个模型，结果显示Grok 3表现最佳。

Details

Motivation: 当前基础模型的评估主要依赖客观指标，忽略了真实的人类体验。本文旨在填补这一空白，提出主观评估框架。 Method: 采用HCE框架，通过540多次参与者驱动的评估，人类与模型协作完成开放研究任务，生成主观数据集。 Result: 实验显示Grok 3表现最优，其次是Deepseek R1和Gemini 2.5，OpenAI o3 mini表现较差。数据集揭示了模型的多样性和适应性。 Conclusion: 研究不仅改进了主观评估方法，还为标准化自动化评估奠定了基础，推动了LLM的发展。数据集已公开。 Abstract: Currently, nearly all evaluations of foundation models focus on objective metrics, emphasizing quiz performance to define model capabilities. While this model-centric approach enables rapid performance assessment, it fails to reflect authentic human experiences. To address this gap, we propose a Human-Centric subjective Evaluation (HCE) framework, focusing on three core dimensions: problem-solving ability, information quality, and interaction experience. Through experiments involving Deepseek R1, OpenAI o3 mini, Grok 3, and Gemini 2.5, we conduct over 540 participant-driven evaluations, where humans and models collaborate on open-ended research tasks, yielding a comprehensive subjective dataset. This dataset captures diverse user feedback across multiple disciplines, revealing distinct model strengths and adaptability. Our findings highlight Grok 3's superior performance, followed by Deepseek R1 and Gemini 2.5, with OpenAI o3 mini lagging behind. By offering a novel framework and a rich dataset, this study not only enhances subjective evaluation methodologies but also lays the foundation for standardized, automated assessments, advancing LLM development for research and practical scenarios. Our dataset link is https://github.com/yijinguo/Human-Centric-Evaluation.

[404] Read it in Two Steps: Translating Extremely Low-Resource Languages with Code-Augmented Grammar Books

Chen Zhang,Jiuheng Lin,Xiao Liu,Zekai Zhang,Yansong Feng

Main category: cs.CL

TL;DR: 本文探讨了语法书在极低资源语言翻译中的作用，提出将语法规则表示为代码函数的方法，显著提升了翻译效果。

Details

Motivation: 研究语法书在极低资源语言翻译中的有效性，尤其是语法规则检索和应用的瓶颈问题。 Method: 引入ZhuangRules数据集，将语法规则分解为检索和应用两步，并提出用代码函数表示规则以提升LLM的推理能力。 Result: 实验表明，代码规则显著提升了规则检索和应用，翻译BLEU分数提高了13.1%。 Conclusion: 代码化的语法规则能有效解决LLM在极低资源语言翻译中的瓶颈问题。 Abstract: While large language models (LLMs) have shown promise in translating extremely low-resource languages using resources like dictionaries, the effectiveness of grammar books remains debated. This paper investigates the role of grammar books in translating extremely low-resource languages by decomposing it into two key steps: grammar rule retrieval and application. To facilitate the study, we introduce ZhuangRules, a modularized dataset of grammar rules and their corresponding test sentences. Our analysis reveals that rule retrieval constitutes a primary bottleneck in grammar-based translation. Moreover, although LLMs can apply simple rules for translation when explicitly provided, they encounter difficulties in handling more complex rules. To address these challenges, we propose representing grammar rules as code functions, considering their similarities in structure and the benefit of code in facilitating LLM reasoning. Our experiments show that using code rules significantly boosts both rule retrieval and application, ultimately resulting in a 13.1% BLEU improvement in translation.

[405] Propaganda and Information Dissemination in the Russo-Ukrainian War: Natural Language Processing of Russian and Western Twitter Narratives

Zaur Gouliev

Main category: cs.CL

TL;DR: 分析乌克兰冲突中社交媒体（如X平台）上的信息战，通过NLP和机器学习研究推文情感与主题，揭示宣传账户与可信账户的策略差异。

Details

Motivation: 研究信息战在乌克兰冲突中的作用，揭示社交媒体如何塑造公众认知。 Method: 收集2022年2月至5月的40,000条推文，结合NLP、机器学习和人工分析，评估情感与主题。 Result: 宣传账户使用情绪化语言和虚假信息，而西方账户侧重事实报道；聚类分析显示可能存在协同行为。 Conclusion: 研究为理解信息战动态提供技术方法，并有助于未来社交媒体影响研究。 Abstract: The conflict in Ukraine has been not only characterised by military engagement but also by a significant information war, with social media platforms like X, formerly known as Twitter playing an important role in shaping public perception. This article provides an analysis of tweets from propaganda accounts and trusted accounts collected from the onset of the war, February 2022 until the middle of May 2022 with n=40,000 total tweets. We utilise natural language processing and machine learning algorithms to assess the sentiment and identify key themes, topics and narratives across the dataset with human-in-the-loop (HITL) analysis throughout. Our findings indicate distinct strategies in how information is created, spread, and targeted at different audiences by both sides. Propaganda accounts frequently employ emotionally charged language and disinformation to evoke fear and distrust, whereas other accounts, primarily Western tend to focus on factual reporting and humanitarian aspects of the conflict. Clustering analysis reveals groups of accounts with similar behaviours, which we suspect indicates the presence of coordinated efforts. This research attempts to contribute to our understanding of the dynamics of information warfare and offers techniques for future studies on social media influence in military conflicts.

[406] NAVER LABS Europe Submission to the Instruction-following Track

Beomseok Lee,Marcely Zanon Boito,Laurent Besacier,Ioan Calapodescu

Main category: cs.CL

TL;DR: NAVER LABS Europe 在 IWSLT 2025 的指令跟随语音处理短赛道中提交了一个系统，能够同时完成 ASR、ST 和 SQA 任务，支持英语输入到中文、意大利语和德语。

Details

Motivation: 开发一个能够同时处理多种语音和文本任务的系统，以提升多语言和多模态任务的效率。 Method: 结合两个预训练模块：语音到 LLM 嵌入投影器和基于 Llama-3.1-8B-Instruct 的 LoRA 适配器，并进行联合指令调优。 Result: 系统在 IWSLT 2025 中提交评估，支持多语言和多模态任务。 Conclusion: 通过预训练模块和指令调优，成功开发了一个高效的多任务处理系统。 Abstract: In this paper we describe NAVER LABS Europe submission to the instruction-following speech processing short track at IWSLT 2025. We participate in the constrained settings, developing systems that can simultaneously perform ASR, ST, and SQA tasks from English speech input into the following target languages: Chinese, Italian, and German. Our solution leverages two pretrained modules: (1) a speech-to-LLM embedding projector trained using representations from the SeamlessM4T-v2-large speech encoder; and (2) LoRA adapters trained on text data on top of a Llama-3.1-8B-Instruct. These modules are jointly loaded and further instruction-tuned for 1K steps on multilingual and multimodal data to form our final system submitted for evaluation.

[407] Analysis of LLM Bias (Chinese Propaganda & Anti-US Sentiment) in DeepSeek-R1 vs. ChatGPT o3-mini-high

PeiHsuan Huang,ZihWei Lin,Simon Imbot,WenCheng Fu,Ethan Tu

Main category: cs.CL

TL;DR: 研究比较了PRC对齐的DeepSeek-R1和非PRC的ChatGPT o3-mini-high在宣传和反美情绪上的偏见，发现DeepSeek-R1在简体中文中表现出显著偏见。

Details

Motivation: 探讨大型语言模型（LLMs）的地缘政治对齐是否影响其意识形态中立性。 Method: 开发了1200个去语境化问题，通过GPT-4o评分和人工标注评估7200个回答。 Result: DeepSeek-R1在简体中文中表现出更高的宣传和反美偏见，而ChatGPT o3-mini-high几乎无偏见。 Conclusion: LLMs的地缘政治对齐显著影响其输出偏见，尤其在简体中文中表现明显。 Abstract: Large language models (LLMs) increasingly shape public understanding and civic decisions, yet their ideological neutrality is a growing concern. While existing research has explored various forms of LLM bias, a direct, cross-lingual comparison of models with differing geopolitical alignments-specifically a PRC-system model versus a non-PRC counterpart-has been lacking. This study addresses this gap by systematically evaluating DeepSeek-R1 (PRC-aligned) against ChatGPT o3-mini-high (non-PRC) for Chinese-state propaganda and anti-U.S. sentiment. We developed a novel corpus of 1,200 de-contextualized, reasoning-oriented questions derived from Chinese-language news, presented in Simplified Chinese, Traditional Chinese, and English. Answers from both models (7,200 total) were assessed using a hybrid evaluation pipeline combining rubric-guided GPT-4o scoring with human annotation. Our findings reveal significant model-level and language-dependent biases. DeepSeek-R1 consistently exhibited substantially higher proportions of both propaganda and anti-U.S. bias compared to ChatGPT o3-mini-high, which remained largely free of anti-U.S. sentiment and showed lower propaganda levels. For DeepSeek-R1, Simplified Chinese queries elicited the highest bias rates; these diminished in Traditional Chinese and were nearly absent in English. Notably, DeepSeek-R1 occasionally responded in Simplified Chinese to Traditional Chinese queries and amplified existing PRC-aligned terms in its Chinese answers, demonstrating an "invisible loudspeaker" effect. Furthermore, such biases were not confined to overtly political topics but also permeated cultural and lifestyle content, particularly in DeepSeek-R1.

[408] BD at BEA 2025 Shared Task: MPNet Ensembles for Pedagogical Mistake Identification and Localization in AI Tutor Responses

Shadman Rohan,Ishita Sur Apan,Muhtasim Ibteda Shochcho,Md Fahim,Mohammad Ashfaq Ur Rahman,AKM Mahbubur Rahman,Amin Ahsan Ali

Main category: cs.CL

TL;DR: Team BD 提交了 BEA 2025 共享任务的解决方案，专注于 AI 导师的教学能力评估，包括错误识别（Track 1）和错误定位（Track 2）。基于 MPNet 模型，通过集成学习和交叉验证取得良好结果。

Details

Motivation: 评估 AI 导师在对话中识别和定位学生错误的能力，以提升教育对话系统的可靠性。 Method: 使用 MPNet 模型，结合类加权交叉熵损失处理数据不平衡，采用分组交叉验证和硬投票集成提升性能。 Result: 在测试集上，错误识别和错误定位的宏 F1 分数分别为 0.7110 和 0.5543。 Conclusion: 提出的集成方法和分析为教育对话系统中的导师响应评估提供了有价值的参考。 Abstract: We present Team BD's submission to the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors, under Track 1 (Mistake Identification) and Track 2 (Mistake Location). Both tracks involve three-class classification of tutor responses in educational dialogues - determining if a tutor correctly recognizes a student's mistake (Track 1) and whether the tutor pinpoints the mistake's location (Track 2). Our system is built on MPNet, a Transformer-based language model that combines BERT and XLNet's pre-training advantages. We fine-tuned MPNet on the task data using a class-weighted cross-entropy loss to handle class imbalance, and leveraged grouped cross-validation (10 folds) to maximize the use of limited data while avoiding dialogue overlap between training and validation. We then performed a hard-voting ensemble of the best models from each fold, which improves robustness and generalization by combining multiple classifiers. Our approach achieved strong results on both tracks, with exact-match macro-F1 scores of approximately 0.7110 for Mistake Identification and 0.5543 for Mistake Location on the official test set. We include comprehensive analysis of our system's performance, including confusion matrices and t-SNE visualizations to interpret classifier behavior, as well as a taxonomy of common errors with examples. We hope our ensemble-based approach and findings provide useful insights for designing reliable tutor response evaluation systems in educational dialogue settings.

[409] Not All Jokes Land: Evaluating Large Language Models Understanding of Workplace Humor

Moahmmadamin Shafiei,Hamidreza Saffari

Main category: cs.CL

TL;DR: 论文探讨了AI和LLMs在自动化任务中的应用，尤其是专业幽默的评估问题，发现LLMs在判断幽默适当性上表现不佳。

Details

Motivation: 随着AI和LLMs的发展，自动化任务如写作受到关注，但专业幽默的评估被忽视，论文旨在填补这一空白。 Method: 开发了一个专业幽默语句数据集，并评估了五种LLMs对幽默适当性的判断能力。 Result: LLMs在准确判断幽默适当性方面表现不佳。 Conclusion: 研究表明LLMs在专业幽默评估上存在局限性，需进一步改进。 Abstract: With the recent advances in Artificial Intelligence (AI) and Large Language Models (LLMs), the automation of daily tasks, like automatic writing, is getting more and more attention. Hence, efforts have focused on aligning LLMs with human values, yet humor, particularly professional industrial humor used in workplaces, has been largely neglected. To address this, we develop a dataset of professional humor statements along with features that determine the appropriateness of each statement. Our evaluation of five LLMs shows that LLMs often struggle to judge the appropriateness of humor accurately.

[410] Is Extending Modality The Right Path Towards Omni-Modality?

Tinghui Zhu,Kai Zhang,Muhao Chen,Yu Su

Main category: cs.CL

TL;DR: 论文研究了如何通过模态扩展技术实现真正的全模态语言模型（OLMs），并探讨了其对核心语言能力、模型合并效果及知识共享的影响。

Details

Motivation: 现有开源模型在实现全模态能力方面表现不足，无法泛化到未训练过的模态组合或多模态输入。本文旨在研究模态扩展技术的效果及其潜在问题。 Method: 通过模态扩展技术，对现成的语言模型进行目标领域和语言数据的微调，并探讨模型合并、知识共享与泛化的效果。 Result: 实验分析了模态扩展对核心语言能力的影响，模型合并的有效性，以及全模态扩展是否优于顺序扩展。 Conclusion: 研究为当前方法实现真正全模态的可行性提供了见解，但仍存在挑战。 Abstract: Omni-modal language models (OLMs) aim to integrate and reason over diverse input modalities--such as text, images, video, and audio--while maintaining strong language capabilities. Despite recent advancements, existing models, especially open-source ones, remain far from true omni-modality, struggling to generalize beyond the specific modality pairs they are trained on or to achieve strong performance when processing multi-modal inputs. We study the effect of extending modality, the dominant technique for training multimodal models, where an off-the-shelf language model is fine-tuned on target-domain and language data. Specifically, we investigate three key questions: (1) Does modality extension compromise core language abilities? (2) Can model merging effectively integrate independently fine-tuned modality-specific models to achieve omni-modality? (3) Does omni-modality extension lead to better knowledge sharing and generalization compared to sequential extension? Through extensive experiments, we analyze these trade-offs and provide insights into the feasibility of achieving true omni-modality using current approaches.

[411] CiteEval: Principle-Driven Citation Evaluation for Source Attribution

Yumo Xu,Peng Qi,Jifan Chen,Kunlun Liu,Rujun Han,Lan Liu,Bonan Min,Vittorio Castelli,Arshit Gupta,Zhiguo Wang

Main category: cs.CL

TL;DR: CiteEval是一个新的引用评估框架，通过细粒度评估和上下文分析改进现有基于NLI的方法，并开发了CiteBench基准和CiteEval-Auto自动评估工具。

Details

Motivation: 现有引用评估方法主要依赖NLI，未能全面捕捉引用的多维度特性，影响了信息检索的信任和效果。 Method: 提出CiteEval框架，结合上下文、用户查询和生成文本进行细粒度评估，并构建CiteBench基准和CiteEval-Auto自动评估工具。 Result: CiteEval-Auto在实验中表现出优于现有指标的评估能力，能更全面地捕捉引用的复杂性。 Conclusion: CiteEval为引用评估提供了更全面、可扩展的解决方案，有助于改进模型生成的引用质量。 Abstract: Citation quality is crucial in information-seeking systems, directly influencing trust and the effectiveness of information access. Current evaluation frameworks, both human and automatic, mainly rely on Natural Language Inference (NLI) to assess binary or ternary supportiveness from cited sources, which we argue is a suboptimal proxy for citation evaluation. In this work we introduce CiteEval, a citation evaluation framework driven by principles focusing on fine-grained citation assessment within a broad context, encompassing not only the cited sources but the full retrieval context, user query, and generated text. Guided by the proposed framework, we construct CiteBench, a multi-domain benchmark with high-quality human annotations on citation quality. To enable efficient evaluation, we further develop CiteEval-Auto, a suite of model-based metrics that exhibit strong correlation with human judgments. Experiments across diverse systems demonstrate CiteEval-Auto's superior ability to capture the multifaceted nature of citations compared to existing metrics, offering a principled and scalable approach to evaluate and improve model-generated citations.

[412] Minimal Pair-Based Evaluation of Code-Switching

Igor Sterner,Simone Teufel

Main category: cs.CL

TL;DR: 提出了一种基于最小对干预的方法，评估大型语言模型（LLM）在代码转换（CS）上的表现，发现模型规模越大，越能像双语者一样偏好自然CS句子。

Details

Motivation: 现有方法在语言覆盖、CS现象多样性或扩展性上不足，缺乏评估LLM在CS上表现的方法。 Method: 通过最小对干预，收集11种语言对的自然CS句子及其变体，进行人类和LLM实验。 Result: 双语者一致偏好自然CS句子；LLM规模越大，对自然CS句子的概率分配越高。 Conclusion: 模型规模与CS表现正相关，最大概率差异出现在封闭类词变体中。 Abstract: There is a lack of an evaluation methodology that estimates the extent to which large language models (LLMs) use code-switching (CS) in the same way as bilinguals. Existing methods do not have wide language coverage, fail to account for the diverse range of CS phenomena, or do not scale. We propose an intervention based on minimal pairs of CS. Each minimal pair contains one naturally occurring CS sentence and one minimally manipulated variant. We collect up to 1,000 such pairs each for 11 language pairs. Our human experiments show that, for every language pair, bilinguals consistently prefer the naturally occurring CS sentence. Meanwhile our experiments with current LLMs show that the larger the model, the more consistently it assigns higher probability to the naturally occurring CS sentence than to the variant. In accordance with theoretical claims, the largest probability differences arise in those pairs where the manipulated material consisted of closed-class words.

[413] Code-Switching and Syntax: A Large-Scale Experiment

Igor Sterner,Simone Teufel

Main category: cs.CL

TL;DR: 本文通过大规模多语言实验验证了语法在代码切换（CS）中的核心作用，证明仅凭语法信息即可预测双语者的切换行为。

Details

Motivation: 现有研究多为点状分析，缺乏大规模跨语言实验验证语法对CS的影响。 Method: 设计仅依赖语法信息的自动预测系统，测试其对CS位置的预测能力。 Result: 系统表现与双语人类相当，且学到的语法模式能泛化到未见过的语言对。 Conclusion: 语法是解释CS模式的关键因素，其预测能力具有普适性。 Abstract: The theoretical code-switching (CS) literature provides numerous pointwise investigations that aim to explain patterns in CS, i.e. why bilinguals switch language in certain positions in a sentence more often than in others. A resulting consensus is that CS can be explained by the syntax of the contributing languages. There is however no large-scale, multi-language, cross-phenomena experiment that tests this claim. When designing such an experiment, we need to make sure that the system that is predicting where bilinguals tend to switch has access only to syntactic information. We provide such an experiment here. Results show that syntax alone is sufficient for an automatic system to distinguish between sentences in minimal pairs of CS, to the same degree as bilingual humans. Furthermore, the learnt syntactic patterns generalise well to unseen language pairs.

[414] CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions

Tamer Alkhouli,Katerina Margatina,James Gung,Raphael Shu,Claudia Zaghi,Monica Sunkara,Yi Zhang

Main category: cs.CL

TL;DR: CONFETTI是一个用于评估大型语言模型（LLM）在复杂对话场景中功能调用能力的基准测试，包含109个人工模拟对话，覆盖86个API。测试结果显示，不同模型在长对话和多API调用中的表现差异显著。

Details

Motivation: 当前基准测试缺乏对LLM在复杂对话场景中功能调用能力的全面评估，CONFETTI旨在填补这一空白。 Method: 通过109个人工模拟对话（313个用户轮次）进行离策略轮级评估，涵盖多种对话复杂性（如跟进、目标修正、模糊目标等），并加入对话行为标注。 Result: 测试显示，部分模型能处理长对话和多API调用（如Nova Pro表现最佳，40.01%），但多数模型在链式功能调用中表现受限。 Conclusion: CONFETTI为LLM的功能调用能力提供了全面评估，揭示了模型在复杂对话中的局限性，尤其是链式功能调用方面。 Abstract: We introduce Conversational Function-Calling Evaluation Through Turn-Level Interactions (CONFETTI), a conversational benchmark1 designed to evaluate the function-calling capabilities and response quality of large language models (LLMs). Current benchmarks lack comprehensive assessment of LLMs in complex conversational scenarios. CONFETTI addresses this gap through 109 human-simulated conversations, comprising 313 user turns and covering 86 APIs. These conversations explicitly target various conversational complexities, such as follow-ups, goal correction and switching, ambiguous and implicit goals. We perform off-policy turn-level evaluation using this benchmark targeting function-calling. Our benchmark also incorporates dialog act annotations to assess agent responses. We evaluate a series of state-of-the-art LLMs and analyze their performance with respect to the number of available APIs, conversation lengths, and chained function calling. Our results reveal that while some models are able to handle long conversations, and leverage more than 20+ APIs successfully, other models struggle with longer context or when increasing the number of APIs. We also report that the performance on chained function-calls is severely limited across the models. Overall, the top performing models on CONFETTI are Nova Pro (40.01%), Claude Sonnet v3.5 (35.46%) and Llama 3.1 405B (33.19%) followed by command-r-plus (31.18%) and Mistral-Large-2407 (30.07%).

[415] Spatial Coordinates as a Cell Language: A Multi-Sentence Framework for Imaging Mass Cytometry Analysis

Chi-Jane Chen,Yuhang Chen,Sukwon Yun,Natalie Stanley,Tianlong Chen

Main category: cs.CL

TL;DR: Spatial2Sentence是一个新框架，通过多句子方法将单细胞表达和空间信息整合到自然语言中，解决了现有单细胞LLMs在空间信息整合和细胞间相互作用方面的不足。

Details

Motivation: 现有单细胞LLMs难以有效整合空间信息和捕捉细胞间相互作用，限制了其在生物学关系分析中的应用。 Method: Spatial2Sentence构建表达相似性和距离矩阵，将空间相邻且表达相似的细胞配对为正样本，远距离且表达不相似的细胞为负样本，通过多句子表示使LLMs学习细胞间相互作用。 Result: 在预处理IMC数据集上，Spatial2Sentence优于现有单细胞LLMs，糖尿病数据集的细胞类型分类和临床状态预测分别提高了5.98%和4.18%。 Conclusion: Spatial2Sentence通过整合空间和表达信息，显著提升了单细胞LLMs的性能和可解释性。 Abstract: Image mass cytometry (IMC) enables high-dimensional spatial profiling by combining mass cytometry's analytical power with spatial distributions of cell phenotypes. Recent studies leverage large language models (LLMs) to extract cell states by translating gene or protein expression into biological context. However, existing single-cell LLMs face two major challenges: (1) Integration of spatial information: they struggle to generalize spatial coordinates and effectively encode spatial context as text, and (2) Treating each cell independently: they overlook cell-cell interactions, limiting their ability to capture biological relationships. To address these limitations, we propose Spatial2Sentence, a novel framework that integrates single-cell expression and spatial information into natural language using a multi-sentence approach. Spatial2Sentence constructs expression similarity and distance matrices, pairing spatially adjacent and expressionally similar cells as positive pairs while using distant and dissimilar cells as negatives. These multi-sentence representations enable LLMs to learn cellular interactions in both expression and spatial contexts. Equipped with multi-task learning, Spatial2Sentence outperforms existing single-cell LLMs on preprocessed IMC datasets, improving cell-type classification by 5.98% and clinical status prediction by 4.18% on the diabetes dataset while enhancing interpretability. The source code can be found here: https://github.com/UNITES-Lab/Spatial2Sentence.

[416] From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation

Serry Sibaee,Omer Nacar,Adel Ammar,Yasser Al-Habashi,Abdulrahman Al-Batati,Wadii Boulila

Main category: cs.CL

TL;DR: 本文填补了阿拉伯语语言模型评估的关键空白，提出了理论指南和新评估框架ADMD，测试了五大模型，发现Claude 3.5 Sonnet表现最佳。

Details

Motivation: 解决现有阿拉伯语评估数据集的不足，如语言准确性、文化对齐和方法严谨性。 Method: 提出阿拉伯深度迷你数据集（ADMD），包含490个挑战性问题，覆盖10大领域，评估五大语言模型。 Result: 模型表现差异显著，Claude 3.5 Sonnet总体准确率30%，在数学理论、阿拉伯语和伊斯兰领域表现较强。 Conclusion: 强调文化能力与技术能力并重，为阿拉伯语模型评估提供理论和实践指导。 Abstract: This paper addresses critical gaps in Arabic language model evaluation by establishing comprehensive theoretical guidelines and introducing a novel evaluation framework. We first analyze existing Arabic evaluation datasets, identifying significant issues in linguistic accuracy, cultural alignment, and methodological rigor. To address these limitations in LLMs, we present the Arabic Depth Mini Dataset (ADMD), a carefully curated collection of 490 challenging questions spanning ten major domains (42 sub-domains, see Figure 1. Using ADMD, we evaluate five leading language models: GPT-4, Claude 3.5 Sonnet, Gemini Flash 1.5, CommandR 100B, and Qwen-Max. Our results reveal significant variations in model performance across different domains, with particular challenges in areas requiring deep cultural understanding and specialized knowledge. Claude 3.5 Sonnet demonstrated the highest overall accuracy at 30\%, showing relative strength in mathematical theory in Arabic, Arabic language, and islamic domains. This work provides both theoretical foundations and practical insights for improving Arabic language model evaluation, emphasizing the importance of cultural competence alongside technical capabilities.

[417] Esoteric Language Models

Subham Sekhar Sahoo,Zhihan Yang,Yash Akhauri,Johnna Liu,Deepansha Singh,Zhoujun Cheng,Zhengzhong Liu,Eric Xing,John Thickstun,Arash Vahdat

Main category: cs.CL

TL;DR: Eso-LMs结合自回归和掩码扩散模型，提升语言模型性能，首次为MDM引入KV缓存，显著提高推理效率。

Details

Motivation: 解决掩码扩散模型（MDM）在困惑度和推理效率上不如自回归模型的问题。 Method: 融合自回归和MDM范式，引入KV缓存并优化采样计划。 Result: 在标准语言建模基准上达到新SOTA，推理速度比标准MDM快65倍，比半自回归方法快4倍。 Conclusion: Eso-LMs成功克服了MDM和自回归模型的局限性，显著提升了性能和效率。 Abstract: Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Among this family of models, Masked Diffusion Models (MDMs) achieve the strongest performance but still underperform AR models in perplexity and lack key inference-time efficiency features--most notably, KV caching. In this work, we introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, enabling smooth interpolation between their perplexities while overcoming their respective limitations. Eso-LMs set a new state of the art on standard language modeling benchmarks. Crucially, we are the **first to introduce KV caching for MDMs** while preserving parallel generation, significantly improving inference efficiency. Combined with an optimized sampling schedule, our method achieves up to **65x** faster inference than standard MDMs and **4x** faster inference than prior semi-autoregressive approaches. We provide the code and model checkpoints on the project page: [http://s-sahoo.github.io/Eso-LMs](http://s-sahoo.github.io/Eso-LMs)

[418] RewardBench 2: Advancing Reward Model Evaluation

Saumya Malik,Valentina Pyatkin,Sander Land,Jacob Morrison,Noah A. Smith,Hannaneh Hajishirzi,Nathan Lambert

Main category: cs.CL

TL;DR: RewardBench 2是一个新的多技能奖励建模基准，旨在提供更具挑战性的数据，用于基于准确性的奖励模型评估。与RewardBench 1相比，模型在RewardBench 2上的平均得分低约20分，但与下游任务性能高度相关。

Details

Motivation: 现有奖励模型在下游任务中的效果不如简单的直接对齐算法，因此需要更严格的评估基准来提升奖励模型的实际性能。 Method: RewardBench 2通过引入新的人类提示（而非现有下游评估中的提示）构建，并量化其在推理时扩展算法（如best-of-N采样）和RLHF训练算法（如近端策略优化）中的相关性。 Result: RewardBench 2的得分与下游任务性能高度相关，且模型在该基准上的表现明显低于RewardBench 1。 Conclusion: RewardBench 2为奖励模型提供了更具挑战性和相关性的评估基准，有助于推动奖励模型在实际任务中的改进。 Abstract: Reward models are used throughout the post-training of language models to capture nuanced signals from preference data and provide a training target for optimization across instruction following, reasoning, safety, and more domains. The community has begun establishing best practices for evaluating reward models, from the development of benchmarks that test capabilities in specific skill areas to others that test agreement with human preferences. At the same time, progress in evaluation has not been mirrored by the effectiveness of reward models in downstream tasks -- simpler direct alignment algorithms are reported to work better in many cases. This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark designed to bring new, challenging data for accuracy-based reward model evaluation -- models score about 20 points on average lower on RewardBench 2 compared to the first RewardBench -- while being highly correlated with downstream performance. Compared to most other benchmarks, RewardBench 2 sources new human prompts instead of existing prompts from downstream evaluations, facilitating more rigorous evaluation practices. In this paper, we describe our benchmark construction process and report how existing models perform on it, while quantifying how performance on the benchmark correlates with downstream use of the models in both inference-time scaling algorithms, like best-of-N sampling, and RLHF training algorithms like proximal policy optimization.

[419] Novel Benchmark for NER in the Wastewater and Stormwater Domain

Franco Alberto Cardillo,Franca Debole,Francesca Frontini,Mitra Aelami,Nanée Chahinian,Serge Conrad

Main category: cs.CL

TL;DR: 论文研究了废水管理领域的多语言命名实体识别（NER），开发了法语和意大利语的语料库，并评估了包括LLM在内的先进方法，为未来策略提供基准。

Details

Motivation: 废水管理对城市可持续性和环境保护至关重要，但领域术语和多语言背景使得从报告中提取结构化知识具有挑战性。 Method: 研究开发了法语和意大利语的废水管理语料库，评估了包括LLM在内的NER方法，并探索了自动标注投影以扩展语料库。 Result: 提供了多语言NER的可靠基准，并展示了自动标注投影的潜力。 Conclusion: 该研究为废水管理领域的多语言信息提取提供了基础，支持未来决策制定。 Abstract: Effective wastewater and stormwater management is essential for urban sustainability and environmental protection. Extracting structured knowledge from reports and regulations is challenging due to domainspecific terminology and multilingual contexts. This work focuses on domain-specific Named Entity Recognition (NER) as a first step towards effective relation and information extraction to support decision making. A multilingual benchmark is crucial for evaluating these methods. This study develops a French-Italian domain-specific text corpus for wastewater management. It evaluates state-of-the-art NER methods, including LLM-based approaches, to provide a reliable baseline for future strategies and explores automated annotation projection in view of an extension of the corpus to new languages.

[420] Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang,Le Yu,Chang Gao,Chujie Zheng,Shixuan Liu,Rui Lu,Kai Dang,Xionghui Chen,Jianxin Yang,Zhenru Zhang,Yuqiong Liu,An Yang,Andrew Zhao,Yang Yue,Shiji Song,Bowen Yu,Gao Huang,Junyang Lin

Main category: cs.CL

TL;DR: RLVR通过分析令牌熵模式，发现高熵令牌对推理性能至关重要，优化这些令牌可显著提升模型表现。

Details

Motivation: 探索RLVR机制，理解令牌熵模式如何影响LLM的推理能力。 Method: 通过分析Chain-of-Thought推理中的令牌熵模式，研究RLVR训练中熵的演变，并优化高熵令牌。 Result: 仅优化20%高熵令牌即可保持性能，甚至超越全梯度更新，而优化低熵令牌则导致性能下降。 Conclusion: RLVR的有效性源于优化高熵令牌，未来可通过令牌熵视角进一步优化LLM推理。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways. Furthermore, studying how entropy patterns evolve during RLVR training reveals that RLVR largely adheres to the base model's entropy patterns, primarily adjusting the entropy of high-entropy tokens. These findings highlight the significance of high-entropy tokens (i.e., forking tokens) to RLVR. We ultimately improve RLVR by restricting policy gradient updates to forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates on the Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME'25 and +7.71 on AIME'24) and Qwen3-14B (+4.79 on AIME'25 and +5.21 on AIME'24) base models, highlighting a strong scaling trend. In contrast, training exclusively on the 80% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that decide reasoning directions. Collectively, our results highlight the potential to understand RLVR through a token-entropy perspective and optimize RLVR by leveraging high-entropy minority tokens to further improve LLM reasoning.

[421] Self-ensemble: Mitigating Confidence Distortion for Large Language Models

Zicheng Xu,Guanchu Wang,Guangyao Zheng,Yu-Neng Chuang,Alexander Szalay,Xia Hu,Vladimir Braverman

Main category: cs.CL

TL;DR: LLMs在多项选择题（MCQA）中存在置信度扭曲问题，尤其是在选项增多时。作者提出Self-ensemble方法，通过分组和集成预测来解决这一问题，无需额外标注数据。实验证明其有效性。

Details

Motivation: LLMs在MCQA中表现出置信度扭曲问题（正确预测信心不足，错误预测信心过高），影响性能。 Method: 将选项分组，通过设计的注意力掩码和位置编码集成LLM预测，无需参数调优。 Result: 在三个LLM和数据集上，Self-ensemble显著解决了置信度扭曲问题，优于标准推理和基线方法。 Conclusion: Self-ensemble是一种即插即用的方法，有效提升LLMs在MCQA中的表现。 Abstract: Although Large Language Models (LLMs) perform well in general fields, they exhibit a confidence distortion problem on multi-choice question-answering (MCQA), particularly as the number of answer choices increases. Specifically, on MCQA with many choices, LLMs suffer from under-confidence in correct predictions and over-confidence in incorrect ones, leading to a substantially degraded performance. To solve this problem, we propose Self-ensemble in this work. Our method splits the choices into several groups and ensembles LLM predictions across these groups to reach a final decision. The advantage of Self-ensemble is its plug-and-play nature, where it can be integrated into existing LLM architecture based on a designed attention mask and positional encoding, without requiring labeled datasets for parameter tuning. Experimental results on three LLMs and datasets demonstrate that Self-ensemble comprehensively addresses the confidence distortion problem of LLMs, outperforming standard inference as well as baseline methods.

[422] WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks

Atsuyuki Miyai,Zaiying Zhao,Kazuki Egashira,Atsuki Sato,Tatsumi Sunada,Shota Onohara,Hiromasa Yamanishi,Mashiro Toyooka,Kunato Nishina,Ryoma Maeda,Kiyoharu Aizawa,Toshihiko Yamasaki

Main category: cs.CL

TL;DR: WebChoreArena是一个新的基准测试，包含532个任务，旨在评估LLM在复杂、繁琐任务上的表现，包括大规模记忆、计算和长期记忆挑战。

Details

Motivation: 评估LLM是否能超越一般浏览任务，处理人类常回避的繁琐复杂任务。 Method: 基于WebArena的四个模拟环境，设计WebChoreArena基准测试，包含三类挑战任务。 Result: 实验显示，随着LLM（如GPT-4o、Claude 3.7 Sonnet、Gemini 2.5 Pro）的进化，性能显著提升，但仍存在改进空间。 Conclusion: WebChoreArena能清晰衡量LLM的进步，但也揭示了其在复杂任务上的挑战。 Abstract: Powered by a large language model (LLM), a web browsing agent operates web browsers in a human-like manner and offers a highly transparent path toward automating a wide range of everyday tasks. As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: Can they go beyond general browsing to robustly handle tasks that are tedious and complex, or chores that humans often avoid doing themselves? In this paper, we introduce WebChoreArena, a new fully reproducible benchmark comprising 532 carefully curated tasks designed to extend the scope of WebArena beyond general browsing to more labor-intensive and tedious tasks. WebChoreArena systematically integrates three key challenges: (i) Massive Memory tasks requiring accurate retrieval of large amounts of information in the observations, (ii) Calculation tasks demanding precise mathematical reasoning, and (iii) Long-Term Memory tasks necessitating long-term memory across multiple webpages. Built on top of the fully reproducible and widely adopted four WebArena simulation environments, WebChoreArena ensures strict reproducibility and enables fair, direct comparisons with the established WebArena benchmark, offering key insights into agent progress. Our experimental results demonstrate that as LLMs evolve, represented by GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro, significant improvements in performance are observed on WebChoreArena. These findings suggest that WebChoreArena is well-suited to measure the advancement of state-of-the-art LLMs with greater clarity. Nevertheless, the results also indicate that even with Gemini 2.5 Pro, there remains substantial room for improvement compared to WebArena, highlighting the increased challenges posed by WebChoreArena.

[423] DRAG: Distilling RAG for SLMs from LLMs to Transfer Knowledge and Mitigate Hallucination via Evidence and Graph-based Distillation

Jennifer Chen,Aidar Myrzakhan,Yaxin Luo,Hassaan Muhammad Khan,Sondos Mahmoud Bsharat,Zhiqiang Shen

Main category: cs.CL

TL;DR: 论文提出了一种名为DRAG的新框架，通过知识蒸馏将大型语言模型（LLMs）的检索增强生成（RAG）能力迁移到小型语言模型（SLMs）中，显著减少了计算资源消耗和幻觉内容生成，同时提升了事实准确性。

Details

Motivation: 解决大规模RAG系统的高计算资源消耗和幻觉内容生成问题，同时提升小型语言模型的事实准确性。 Method: 采用基于证据和知识图的知识蒸馏方法，将大型语言模型的RAG能力迁移到小型语言模型中。 Result: 实验表明，DRAG在多个基准测试中优于现有方法（如MiniRAG），性能提升高达27.7%，同时保持了高效性和可靠性。 Conclusion: DRAG为在小型语言模型中部署增强的检索和生成能力提供了一种实用且资源高效的解决方案。 Abstract: Retrieval-Augmented Generation (RAG) methods have proven highly effective for tasks requiring factual consistency and robust knowledge retrieval. However, large-scale RAG systems consume significant computational resources and are prone to generating hallucinated content from Humans. In this work, we introduce $\texttt{DRAG}$, a novel framework for distilling RAG knowledge from large-scale Language Models (LLMs) into small LMs (SLMs). Our approach leverages evidence- and knowledge graph-based distillation, ensuring that the distilled model retains critical factual knowledge while significantly reducing model size and computational cost. By aligning the smaller model's predictions with a structured knowledge graph and ranked evidence, $\texttt{DRAG}$ effectively mitigates hallucinations and improves factual accuracy. We further present a case demonstrating how our framework mitigates user privacy risks and introduce a corresponding benchmark. Experimental evaluations on multiple benchmarks demonstrate that our method outperforms the prior competitive RAG methods like MiniRAG for SLMs by up to 27.7% using the same models, preserving high-level efficiency and reliability. With $\texttt{DRAG}$, we provide a practical and resource-efficient roadmap to deploying enhanced retrieval and generation capabilities in small-sized LLMs.

cs.IR [Back]

[424] GLEN: Generative Retrieval via Lexical Index Learning

Sunkyung Lee,Minjin Choi,Jongwuk Lee

Main category: cs.IR

TL;DR: GLEN是一种新的生成式检索方法，通过动态词汇标识符和两阶段索引学习策略解决了现有方法的挑战，并在多个基准数据集上表现出色。

Details

Motivation: 生成式检索直接生成文档标识符，但面临预训练语言模型知识与标识符之间的差异以及训练与推理之间的差距问题。 Method: GLEN采用动态词汇标识符和两阶段索引学习策略，训练时学习词汇标识符和查询与文档的相关性信号，推理时使用无冲突推理和标识符权重排序文档。 Result: GLEN在NQ320k、MS MARCO和BEIR等基准数据集上实现了最先进或竞争性性能。 Conclusion: GLEN通过创新的索引学习和推理策略，有效解决了生成式检索的挑战，并展示了优越的性能。 Abstract: Generative retrieval shed light on a new paradigm of document retrieval, aiming to directly generate the identifier of a relevant document for a query. While it takes advantage of bypassing the construction of auxiliary index structures, existing studies face two significant challenges: (i) the discrepancy between the knowledge of pre-trained language models and identifiers and (ii) the gap between training and inference that poses difficulty in learning to rank. To overcome these challenges, we propose a novel generative retrieval method, namely Generative retrieval via LExical iNdex learning (GLEN). For training, GLEN effectively exploits a dynamic lexical identifier using a two-phase index learning strategy, enabling it to learn meaningful lexical identifiers and relevance signals between queries and documents. For inference, GLEN utilizes collision-free inference, using identifier weights to rank documents without additional overhead. Experimental results prove that GLEN achieves state-of-the-art or competitive performance against existing generative retrieval methods on various benchmark datasets, e.g., NQ320k, MS MARCO, and BEIR. The code is available at https://github.com/skleee/GLEN.

[425] Retrieval-Augmented Generation: A Comprehensive Survey of Architectures, Enhancements, and Robustness Frontiers

Chaitanya Sharma

Main category: cs.IR

TL;DR: 本文综述了检索增强生成（RAG）的最新进展，分析了其架构分类、优化方法及性能表现，并探讨了未来研究方向。

Details

Motivation: RAG通过结合外部检索信息增强大语言模型（LLMs），解决了参数化知识存储的局限性，但也带来了检索质量、效率等新挑战。 Method: 文章对RAG系统进行分类（如检索器中心、生成器中心等），并系统分析了检索优化、上下文过滤、解码控制等改进方法。 Result: 通过比较分析，揭示了检索精度与生成灵活性、效率与忠实性之间的权衡。 Conclusion: 未来研究方向包括自适应检索架构、实时检索集成等，本文旨在为下一代RAG系统奠定基础。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm to enhance large language models (LLMs) by conditioning generation on external evidence retrieved at inference time. While RAG addresses critical limitations of parametric knowledge storage-such as factual inconsistency and domain inflexibility-it introduces new challenges in retrieval quality, grounding fidelity, pipeline efficiency, and robustness against noisy or adversarial inputs. This survey provides a comprehensive synthesis of recent advances in RAG systems, offering a taxonomy that categorizes architectures into retriever-centric, generator-centric, hybrid, and robustness-oriented designs. We systematically analyze enhancements across retrieval optimization, context filtering, decoding control, and efficiency improvements, supported by comparative performance analyses on short-form and multi-hop question answering tasks. Furthermore, we review state-of-the-art evaluation frameworks and benchmarks, highlighting trends in retrieval-aware evaluation, robustness testing, and federated retrieval settings. Our analysis reveals recurring trade-offs between retrieval precision and generation flexibility, efficiency and faithfulness, and modularity and coordination. We conclude by identifying open challenges and future research directions, including adaptive retrieval architectures, real-time retrieval integration, structured reasoning over multi-hop evidence, and privacy-preserving retrieval mechanisms. This survey aims to consolidate current knowledge in RAG research and serve as a foundation for the next generation of retrieval-augmented language modeling systems.

[426] GPR: Empowering Generation with Graph-Pretrained Retriever

Xiaochen Wang,Zongyu Wu,Yuan Zhong,Xiang Zhang,Suhang Wang,Fenglong Ma

Main category: cs.IR

TL;DR: GPR是一种基于知识图谱预训练的图检索器，通过LLM引导的图增强和结构感知目标，显著提升检索质量和下游生成效果。

Details

Motivation: 现有检索器依赖纯文本预训练的语言模型，存在领域不对齐和结构忽略问题，限制了其在图检索增强生成中的效果。 Method: 提出GPR，直接在知识图谱上预训练，通过LLM引导的图增强对齐自然语言问题与相关子图，并采用结构感知目标学习细粒度检索策略。 Result: 在两个数据集、三个LLM主干和五个基线上的实验表明，GPR显著提升了检索质量和下游生成效果。 Conclusion: GPR是一种针对图检索增强生成的鲁棒检索解决方案。 Abstract: Graph retrieval-augmented generation (GRAG) places high demands on graph-specific retrievers. However, existing retrievers often rely on language models pretrained on plain text, limiting their effectiveness due to domain misalignment and structure ignorance. To address these challenges, we propose GPR, a graph-based retriever pretrained directly on knowledge graphs. GPR aligns natural language questions with relevant subgraphs through LLM-guided graph augmentation and employs a structure-aware objective to learn fine-grained retrieval strategies. Experiments on two datasets, three LLM backbones, and five baselines show that GPR consistently improves both retrieval quality and downstream generation, demonstrating its effectiveness as a robust retrieval solution for GRAG.

[427] Adapting General-Purpose Embedding Models to Private Datasets Using Keyword-based Retrieval

Yubai Wei,Jiale Han,Yi Yang

Main category: cs.IR

TL;DR: BMEmbed是一种新方法，通过利用BM25的关键词检索技术，将通用文本嵌入模型适配到私有数据集上，显著提升了检索性能。

Details

Motivation: 通用文本嵌入模型在私有数据集（如公司专有数据）上表现不佳，因为这些数据包含专业术语和行话。 Method: 利用BM25的关键词检索结果排名构建监督信号，以促进模型适配。 Result: 在多个领域、数据集和模型上评估，BMEmbed均表现出检索性能的持续提升。 Conclusion: BM25的信号通过促进对齐和一致性改善了嵌入效果，证明了该方法在适配领域特定数据上的价值。 Abstract: Text embedding models play a cornerstone role in AI applications, such as retrieval-augmented generation (RAG). While general-purpose text embedding models demonstrate strong performance on generic retrieval benchmarks, their effectiveness diminishes when applied to private datasets (e.g., company-specific proprietary data), which often contain specialized terminology and lingo. In this work, we introduce BMEmbed, a novel method for adapting general-purpose text embedding models to private datasets. By leveraging the well-established keyword-based retrieval technique (BM25), we construct supervisory signals from the ranking of keyword-based retrieval results to facilitate model adaptation. We evaluate BMEmbed across a range of domains, datasets, and models, showing consistent improvements in retrieval performance. Moreover, we provide empirical insights into how BM25-based signals contribute to improving embeddings by fostering alignment and uniformity, highlighting the value of this approach in adapting models to domain-specific data. We release the source code available at https://github.com/BaileyWei/BMEmbed for the research community.

[428] Bridging the Gap: From Ad-hoc to Proactive Search in Conversations

Chuan Meng,Francesco Tonolini,Fengran Mo,Nikolaos Aletras,Emine Yilmaz,Gabriella Kazai

Main category: cs.IR

TL;DR: Conv2Query框架通过将对话上下文映射为ad-hoc查询，解决了PSC中输入不匹配的问题，显著提升了检索性能。

Details

Motivation: PSC中直接使用对话上下文作为ad-hoc检索器输入存在输入不匹配问题，限制了检索质量。 Method: 提出Conv2Query框架，将对话上下文映射为ad-hoc查询，适配ad-hoc检索器。 Result: 在两个PSC数据集上的实验表明，Conv2Query显著提升了检索性能。 Conclusion: Conv2Query有效解决了PSC中的输入不匹配问题，提升了检索质量。 Abstract: Proactive search in conversations (PSC) aims to reduce user effort in formulating explicit queries by proactively retrieving useful relevant information given conversational context. Previous work in PSC either directly uses this context as input to off-the-shelf ad-hoc retrievers or further fine-tunes them on PSC data. However, ad-hoc retrievers are pre-trained on short and concise queries, while the PSC input is longer and noisier. This input mismatch between ad-hoc search and PSC limits retrieval quality. While fine-tuning on PSC data helps, its benefits remain constrained by this input gap. In this work, we propose Conv2Query, a novel conversation-to-query framework that adapts ad-hoc retrievers to PSC by bridging the input gap between ad-hoc search and PSC. Conv2Query maps conversational context into ad-hoc queries, which can either be used as input for off-the-shelf ad-hoc retrievers or for further fine-tuning on PSC data. Extensive experiments on two PSC datasets show that Conv2Query significantly improves ad-hoc retrievers' performance, both when used directly and after fine-tuning on PSC.

[429] GRAM: Generative Recommendation via Semantic-aware Multi-granular Late Fusion

Sunkyung Lee,Minjin Choi,Eunseong Choi,Hye-young Kim,Jongwuk Lee

Main category: cs.IR

TL;DR: GRAM是一种生成式推荐模型，通过语义感知的多粒度延迟融合解决现有方法在隐式项目关系和丰富但冗长项目信息利用上的不足。

Details

Motivation: 现有生成式推荐方法在隐式项目关系和冗长项目信息利用上存在局限，GRAM旨在解决这些问题。 Method: 提出语义到词汇的翻译和多粒度延迟融合，分别编码隐式关系和高效整合丰富语义。 Result: 在四个基准数据集上，GRAM在Recall@5和NDCG@5上显著优于现有方法。 Conclusion: GRAM通过创新设计有效提升了生成式推荐的性能。 Abstract: Generative recommendation is an emerging paradigm that leverages the extensive knowledge of large language models by formulating recommendations into a text-to-text generation task. However, existing studies face two key limitations in (i) incorporating implicit item relationships and (ii) utilizing rich yet lengthy item information. To address these challenges, we propose a Generative Recommender via semantic-Aware Multi-granular late fusion (GRAM), introducing two synergistic innovations. First, we design semantic-to-lexical translation to encode implicit hierarchical and collaborative item relationships into the vocabulary space of LLMs. Second, we present multi-granular late fusion to integrate rich semantics efficiently with minimal information loss. It employs separate encoders for multi-granular prompts, delaying the fusion until the decoding stage. Experiments on four benchmark datasets show that GRAM outperforms eight state-of-the-art generative recommendation models, achieving significant improvements of 11.5-16.0% in Recall@5 and 5.3-13.6% in NDCG@5. The source code is available at https://github.com/skleee/GRAM.

[430] When Should Dense Retrievers Be Updated in Evolving Corpora? Detecting Out-of-Distribution Corpora Using GradNormIR

Dayoon Ko,Jinyoung Kim,Sohyeon Kim,Jinhyuk Kim,Jaehoon Lee,Seonghak Song,Minyoung Lee,Gunhee Kim

Main category: cs.IR

TL;DR: 提出了一种新任务，预测语料库是否超出密集检索器的分布范围，并提出了无监督方法GradNormIR来检测OOD语料库，显著提升了检索的鲁棒性和效率。

Details

Motivation: 现实世界的语料库不断演变，可能导致密集检索器的性能下降，因此需要及时更新或重新训练。 Method: 提出了GradNormIR方法，利用梯度范数无监督地检测OOD语料库。 Result: 在BEIR基准测试中，GradNormIR显著提升了检索的鲁棒性和效率。 Conclusion: GradNormIR能够有效预测OOD语料库，为密集检索器的及时更新提供了解决方案。 Abstract: Dense retrievers encode texts into embeddings to efficiently retrieve relevant documents from large databases in response to user queries. However, real-world corpora continually evolve, leading to a shift from the original training distribution of the retriever. Without timely updates or retraining, indexing newly emerging documents can degrade retrieval performance for future queries. Thus, identifying when a dense retriever requires an update is critical for maintaining robust retrieval systems. In this paper, we propose a novel task of predicting whether a corpus is out-of-distribution (OOD) relative to a dense retriever before indexing. Addressing this task allows us to proactively manage retriever updates, preventing potential retrieval failures. We introduce GradNormIR, an unsupervised approach that leverages gradient norms to detect OOD corpora effectively. Experiments on the BEIR benchmark demonstrate that GradNormIR enables timely updates of dense retrievers in evolving document collections, significantly enhancing retrieval robustness and efficiency.

cs.LG [Back]

[431] PerFormer: A Permutation Based Vision Transformer for Remaining Useful Life Prediction

Zhengyang Fan,Wanru Li,Kuo-chu Chang,Ting Yuan

Main category: cs.LG

TL;DR: 论文提出了一种基于排列的视觉变换器方法（PerFormer），用于提升退化系统中剩余使用寿命（RUL）预测的准确性，通过将多变量时间序列数据转换为类似图像的空间特征，解决了ViT直接应用于时间序列数据的挑战。

Details

Motivation: 随着视觉变换器（ViT）在计算机视觉任务中表现出优于卷积神经网络（CNN）的性能，研究者希望探索其在RUL预测中的潜力，但直接应用于多变量传感器数据存在空间信息模糊的挑战。 Method: 提出PerFormer方法，通过排列多变量时间序列数据模拟图像的空间特征，并设计了一种新的排列损失函数来生成所需的排列矩阵。 Result: 在NASA的C-MAPSS数据集上，PerFormer在RUL预测中表现优于基于CNN、RNN和其他变换器模型的现有方法。 Conclusion: PerFormer展示了在预测和健康管理（PHM）应用中的有效性和潜力，为RUL预测提供了新的解决方案。 Abstract: Accurately estimating the remaining useful life (RUL) for degradation systems is crucial in modern prognostic and health management (PHM). Convolutional Neural Networks (CNNs), initially developed for tasks like image and video recognition, have proven highly effectively in RUL prediction, demonstrating remarkable performance. However, with the emergence of the Vision Transformer (ViT), a Transformer model tailored for computer vision tasks such as image classification, and its demonstrated superiority over CNNs, there is a natural inclination to explore its potential in enhancing RUL prediction accuracy. Nonetheless, applying ViT directly to multivariate sensor data for RUL prediction poses challenges, primarily due to the ambiguous nature of spatial information in time series data. To address this issue, we introduce the PerFormer, a permutation-based vision transformer approach designed to permute multivariate time series data, mimicking spatial characteristics akin to image data, thereby making it suitable for ViT. To generate the desired permutation matrix, we introduce a novel permutation loss function aimed at guiding the convergence of any matrix towards a permutation matrix. Our experiments on NASA's C-MAPSS dataset demonstrate the PerFormer's superior performance in RUL prediction compared to state-of-the-art methods employing CNNs, Recurrent Neural Networks (RNNs), and various Transformer models. This underscores its effectiveness and potential in PHM applications.

[432] Foresight: Adaptive Layer Reuse for Accelerated and High-Quality Text-to-Video Generation

Muhammad Adnan,Nithesh Kurella,Akhil Arunkumar,Prashant J. Nair

Main category: cs.LG

TL;DR: Foresight是一种自适应层重用技术，通过减少去噪步骤中的计算冗余，提升Diffusion Transformers在视频生成中的效率，同时保持性能。

Details

Motivation: Diffusion Transformers在视频生成中因模型大和时空注意力计算成本高而效率低下，静态缓存无法适应生成动态，导致速度与质量的权衡不佳。 Method: 提出Foresight技术，动态识别并重用DiT块输出，根据生成参数（如分辨率和去噪计划）优化效率。 Result: 在OpenSora、Latte和CogVideoX上，Foresight实现了最高1.63倍的端到端加速，同时保持视频质量。 Conclusion: Foresight通过自适应层重用显著提升了视频生成的效率，且不影响性能。 Abstract: Diffusion Transformers (DiTs) achieve state-of-the-art results in text-to-image, text-to-video generation, and editing. However, their large model size and the quadratic cost of spatial-temporal attention over multiple denoising steps make video generation computationally expensive. Static caching mitigates this by reusing features across fixed steps but fails to adapt to generation dynamics, leading to suboptimal trade-offs between speed and quality. We propose Foresight, an adaptive layer-reuse technique that reduces computational redundancy across denoising steps while preserving baseline performance. Foresight dynamically identifies and reuses DiT block outputs for all layers across steps, adapting to generation parameters such as resolution and denoising schedules to optimize efficiency. Applied to OpenSora, Latte, and CogVideoX, Foresight achieves up to 1.63x end-to-end speedup, while maintaining video quality. The source code of Foresight is available at \texttt{https://github.com/STAR-Laboratory/foresight}.

[433] SST: Self-training with Self-adaptive Thresholding for Semi-supervised Learning

Shuai Zhao,Heyan Huang,Xinge Li,Xiaokang Chen,Rui Wang

Main category: cs.LG

TL;DR: 本文提出了一种名为SST的半监督学习框架，通过自适应的阈值调整机制（SAT）高效选择高质量伪标签，显著提升了性能。

Details

Motivation: 现实场景中获取高质量标注数据成本高昂，现有半监督学习方法依赖固定阈值或更新阈值过程耗时。 Method: 提出SST框架，引入SAT机制，根据模型学习进度自适应调整类别特定阈值。 Result: SST在ImageNet-1K基准测试中表现优异，仅用10%标注数据即超越全监督模型性能。 Conclusion: SST通过自适应阈值机制解决了伪标签选择问题，具有高效、泛化性强和可扩展性。 Abstract: Neural networks have demonstrated exceptional performance in supervised learning, benefiting from abundant high-quality annotated data. However, obtaining such data in real-world scenarios is costly and labor-intensive. Semi-supervised learning (SSL) offers a solution to this problem. Recent studies, such as Semi-ViT and Noisy Student, which employ consistency regularization or pseudo-labeling, have demonstrated significant achievements. However, they still face challenges, particularly in accurately selecting sufficient high-quality pseudo-labels due to their reliance on fixed thresholds. Recent methods such as FlexMatch and FreeMatch have introduced flexible or self-adaptive thresholding techniques, greatly advancing SSL research. Nonetheless, their process of updating thresholds at each iteration is deemed time-consuming, computationally intensive, and potentially unnecessary. To address these issues, we propose Self-training with Self-adaptive Thresholding (SST), a novel, effective, and efficient SSL framework. SST introduces an innovative Self-Adaptive Thresholding (SAT) mechanism that adaptively adjusts class-specific thresholds based on the model's learning progress. SAT ensures the selection of high-quality pseudo-labeled data, mitigating the risks of inaccurate pseudo-labels and confirmation bias. Extensive experiments demonstrate that SST achieves state-of-the-art performance with remarkable efficiency, generalization, and scalability across various architectures and datasets. Semi-SST-ViT-Huge achieves the best results on competitive ImageNet-1K SSL benchmarks, with 80.7% / 84.9% Top-1 accuracy using only 1% / 10% labeled data. Compared to the fully-supervised DeiT-III-ViT-Huge, which achieves 84.8% Top-1 accuracy using 100% labeled data, our method demonstrates superior performance using only 10% labeled data.

[434] Flashbacks to Harmonize Stability and Plasticity in Continual Learning

Leila Mahmoodi,Peyman Moghadam,Munawar Hayat,Christian Simon,Mehrtash Harandi

Main category: cs.LG

TL;DR: Flashback Learning (FL) 是一种新方法，通过双向正则化平衡持续学习中的稳定性和可塑性，显著提升了模型性能。

Details

Motivation: 解决持续学习中模型在保留旧知识的同时学习新知识的平衡问题。 Method: 采用两阶段训练过程，结合两个知识库分别增强可塑性和稳定性，并融入多种持续学习方法。 Result: 在标准图像分类基准上，FL 平均准确率提升达 4.91%（类增量）和 3.51%（任务增量），且在 ImageNet 上表现优于现有方法。 Conclusion: FL 通过双向正则化有效平衡稳定性和可塑性，为持续学习提供了高效解决方案。 Abstract: We introduce Flashback Learning (FL), a novel method designed to harmonize the stability and plasticity of models in Continual Learning (CL). Unlike prior approaches that primarily focus on regularizing model updates to preserve old information while learning new concepts, FL explicitly balances this trade-off through a bidirectional form of regularization. This approach effectively guides the model to swiftly incorporate new knowledge while actively retaining its old knowledge. FL operates through a two-phase training process and can be seamlessly integrated into various CL methods, including replay, parameter regularization, distillation, and dynamic architecture techniques. In designing FL, we use two distinct knowledge bases: one to enhance plasticity and another to improve stability. FL ensures a more balanced model by utilizing both knowledge bases to regularize model updates. Theoretically, we analyze how the FL mechanism enhances the stability-plasticity balance. Empirically, FL demonstrates tangible improvements over baseline methods within the same training budget. By integrating FL into at least one representative baseline from each CL category, we observed an average accuracy improvement of up to 4.91% in Class-Incremental and 3.51% in Task-Incremental settings on standard image classification benchmarks. Additionally, measurements of the stability-to-plasticity ratio confirm that FL effectively enhances this balance. FL also outperforms state-of-the-art CL methods on more challenging datasets like ImageNet.

[435] Dynamic Domain Adaptation-Driven Physics-Informed Graph Representation Learning for AC-OPF

Hongjie Zhu,Zezheng Zhang,Zeyu Zhang,Yu Bai,Shimin Wen,Huazhang Wang,Daji Ergu,Ying Cai,Yang Zhao

Main category: cs.LG

TL;DR: DDA-PIGCN是一种结合时空特征的图卷积网络方法，用于解决AC-OPF中的约束建模问题，表现优异。

Details

Motivation: 当前AC-OPF求解器难以有效建模约束空间与最优解之间的复杂关系，且缺乏时空信息的整合。 Method: 提出DDA-PIGCN方法，通过多层硬物理约束和动态域适应学习机制，结合电网物理结构捕捉时空依赖性。 Result: 在多个IEEE标准测试案例中，MAE为0.0011至0.0624，约束满足率达99.6%至100%。 Conclusion: DDA-PIGCN是一种可靠高效的AC-OPF求解器，解决了约束建模和时空信息整合的难题。 Abstract: Alternating Current Optimal Power Flow (AC-OPF) aims to optimize generator power outputs by utilizing the non-linear relationships between voltage magnitudes and phase angles in a power system. However, current AC-OPF solvers struggle to effectively represent the complex relationship between variable distributions in the constraint space and their corresponding optimal solutions. This limitation in constraint modeling restricts the system's ability to develop diverse knowledge representations. Additionally, modeling the power grid solely based on spatial topology further limits the integration of additional prior knowledge, such as temporal information. To overcome these challenges, we propose DDA-PIGCN (Dynamic Domain Adaptation-Driven Physics-Informed Graph Convolutional Network), a new method designed to address constraint-related issues and build a graph-based learning framework that incorporates spatiotemporal features. DDA-PIGCN improves consistency optimization for features with varying long-range dependencies by applying multi-layer, hard physics-informed constraints. It also uses a dynamic domain adaptation learning mechanism that iteratively updates and refines key state variables under predefined constraints, enabling precise constraint verification. Moreover, it captures spatiotemporal dependencies between generators and loads by leveraging the physical structure of the power grid, allowing for deep integration of topological information across time and space. Extensive comparative and ablation studies show that DDA-PIGCN delivers strong performance across several IEEE standard test cases (such as case9, case30, and case300), achieving mean absolute errors (MAE) from 0.0011 to 0.0624 and constraint satisfaction rates between 99.6% and 100%, establishing it as a reliable and efficient AC-OPF solver.

[436] MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning

Peng Xia,Jinglu Wang,Yibo Peng,Kaide Zeng,Xian Wu,Xiangru Tang,Hongtu Zhu,Yun Li,Shujie Liu,Yan Lu,Huaxiu Yao

Main category: cs.LG

TL;DR: MMedAgent-RL 是一个基于强化学习的多智能体框架，通过动态协作提升医学多模态诊断任务的性能，显著优于现有方法。

Details

Motivation: 现有单智能体模型在跨医学专业泛化能力不足，静态多智能体协作缺乏灵活性和适应性。 Method: 提出基于强化学习的动态协作框架，包括分诊医生和主治医生，并通过课程学习策略优化决策。 Result: 在五个医学 VQA 基准测试中表现优异，平均性能提升 18.4%，并展现出类人推理模式。 Conclusion: MMedAgent-RL 通过动态协作和课程学习策略，显著提升了医学多模态诊断的灵活性和性能。 Abstract: Medical Large Vision-Language Models (Med-LVLMs) have shown strong potential in multimodal diagnostic tasks. However, existing single-agent models struggle to generalize across diverse medical specialties, limiting their performance. Recent efforts introduce multi-agent collaboration frameworks inspired by clinical workflows, where general practitioners (GPs) and specialists interact in a fixed sequence. Despite improvements, these static pipelines lack flexibility and adaptability in reasoning. To address this, we propose MMedAgent-RL, a reinforcement learning (RL)-based multi-agent framework that enables dynamic, optimized collaboration among medical agents. Specifically, we train two GP agents based on Qwen2.5-VL via RL: the triage doctor learns to assign patients to appropriate specialties, while the attending physician integrates the judgments from multi-specialists and its own knowledge to make final decisions. To address the inconsistency in specialist outputs, we introduce a curriculum learning (CL)-guided RL strategy that progressively teaches the attending physician to balance between imitating specialists and correcting their mistakes. Experiments on five medical VQA benchmarks demonstrate that MMedAgent-RL not only outperforms both open-source and proprietary Med-LVLMs, but also exhibits human-like reasoning patterns. Notably, it achieves an average performance gain of 18.4% over supervised fine-tuning baselines.

[437] QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

Wei Dai,Peilin Chen,Chanakya Ekbote,Paul Pu Liang

Main category: cs.LG

TL;DR: QoQ-Med-7B/32B是首个开放通用的临床基础模型，支持跨医学图像、时间序列信号和文本报告的联合推理，通过DRPO训练显著提升诊断性能。

Details

Motivation: 现有多模态语言模型多为视觉中心，无法泛化到不同临床领域，需解决临床数据分布不均的问题。 Method: 采用Domain-aware Relative Policy Optimization (DRPO)训练，根据领域稀有性和模态难度分层缩放奖励，优化性能不平衡。 Result: DRPO训练使诊断性能平均提升43%（宏F1），在分割任务中IoU比开放模型高10倍，达到OpenAI o4-mini水平。 Conclusion: QoQ-Med通过DRPO和多模态联合推理显著提升临床决策能力，并开源模型权重、训练管道和推理痕迹以促进研究。 Abstract: Clinical decision-making routinely demands reasoning over heterogeneous data, yet existing multimodal language models (MLLMs) remain largely vision-centric and fail to generalize across clinical specialties. To bridge this gap, we introduce QoQ-Med-7B/32B, the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports. QoQ-Med is trained with Domain-aware Relative Policy Optimization (DRPO), a novel reinforcement-learning objective that hierarchically scales normalized rewards according to domain rarity and modality difficulty, mitigating performance imbalance caused by skewed clinical data distributions. Trained on 2.61 million instruction tuning pairs spanning 9 clinical domains, we show that DRPO training boosts diagnostic performance by 43% in macro-F1 on average across all visual domains as compared to other critic-free training methods like GRPO. Furthermore, with QoQ-Med trained on intensive segmentation data, it is able to highlight salient regions related to the diagnosis, with an IoU 10x higher than open models while reaching the performance of OpenAI o4-mini. To foster reproducibility and downstream research, we release (i) the full model weights, (ii) the modular training pipeline, and (iii) all intermediate reasoning traces at https://github.com/DDVD233/QoQ_Med.

[438] Adaptive Plane Reformatting for 4D Flow MRI using Deep Reinforcement Learning

Javier Bisbal,Julio Sotelo,Maria I Valdés,Pablo Irarrazaval,Marcelo E Andia,Julio García,José Rodriguez-Palomarez,Francesca Raimondi,Cristián Tejos,Sergio Uribe

Main category: cs.LG

TL;DR: 论文提出了一种基于灵活坐标系的深度强化学习方法，用于医学图像中的平面重格式化任务，显著提高了准确性和适应性。

Details

Motivation: 现有深度强化学习方法在平面重格式化任务中表现良好，但要求测试数据与训练数据的位置和方向一致，限制了其实际应用。 Method: 采用异步优势演员-评论家（A3C）算法，结合灵活坐标系，实现任意位置和方向的体积导航。 Result: 在4D流MRI中，平面重格式化的角度和距离误差显著降低（6.32±4.15°和3.40±2.75 mm），且流测量结果与专家操作无显著差异（p=0.21）。 Conclusion: 该方法具有灵活性和适应性，适用于4D流MRI及其他医学影像应用。 Abstract: Deep reinforcement learning (DRL) algorithms have shown robust results in plane reformatting tasks. In these methods, an agent sequentially adjusts the position and orientation of an initial plane towards an objective location. This process allows accurate plane reformatting, without the need for detailed landmarks, which makes it suitable for images with limited contrast and resolution, such as 4D flow MRI. However, current DRL methods require the test dataset to be in the same position and orientation as the training dataset. In this paper, we present a novel technique that utilizes a flexible coordinate system based on the current state, enabling navigation in volumes at any position or orientation. We adopted the Asynchronous Advantage Actor Critic (A3C) algorithm for reinforcement learning, outperforming Deep Q Network (DQN). Experimental results in 4D flow MRI demonstrate improved accuracy in plane reformatting angular and distance errors (6.32 +- 4.15 {\deg} and 3.40 +- 2.75 mm), as well as statistically equivalent flow measurements determined by a plane reformatting process done by an expert (p=0.21). The method's flexibility and adaptability make it a promising candidate for other medical imaging applications beyond 4D flow MRI.

[439] Understanding Model Reprogramming for CLIP via Decoupling Visual Prompts

Chengyi Cai,Zesheng Ye,Lei Feng,Jianzhong Qi,Feng Liu

Main category: cs.LG

TL;DR: 论文提出了一种解耦和重加权框架（DVP），通过分组优化视觉提示并引入概率重加权矩阵（PRM），提升CLIP模型在下游任务中的性能。

Details

Motivation: 现有视觉重编程方法（VR）在CLIP中训练单一视觉提示，可能无法捕捉描述多样性或偏向非信息性属性，影响分类效果。 Method: 提出解耦视觉提示（DVP），通过分组描述优化提示，并引入PRM衡量其对分类的贡献。 Result: DVP在11个下游数据集上表现优于基线方法，且PRM提供了对分类决策的可解释性。 Conclusion: DVP通过解耦和重加权提升了模型性能，同时提供了对视觉提示影响的概率性理解。 Abstract: Model reprogramming adapts pretrained models to downstream tasks by modifying only the input and output spaces. Visual reprogramming (VR) is one instance for vision tasks that adds a trainable noise pattern (i.e., a visual prompt) to input images to facilitate downstream classification. The existing VR approaches for CLIP train a single visual prompt using all descriptions of different downstream classes. However, the limited learning capacity may result in (1) a failure to capture diverse aspects of the descriptions (e.g., shape, color, and texture), and (2) a possible bias toward less informative attributes that do not help distinguish between classes. In this paper, we introduce a decoupling-and-reweighting framework. Our decoupled visual prompts (DVP) are optimized using descriptions grouped by explicit causes (DVP-cse) or unsupervised clusters (DVP-cls). Then, we integrate the outputs of these visual prompts with a probabilistic reweighting matrix (PRM) that measures their contributions to each downstream class. Theoretically, DVP lowers the empirical risk bound. Experimentally, DVP outperforms baselines on average across 11 downstream datasets. Notably, the DVP-PRM integration enables insights into how individual visual prompts influence classification decisions, providing a probabilistic framework for understanding reprogramming. Our code is available at https://github.com/tmlr-group/DecoupledVP.

[440] $Ψ$-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models

Taehoon Yoon,Yunhong Min,Kyeongmin Yeo,Minhyuk Sung

Main category: cs.LG

TL;DR: Ψ-Sampler是一种基于SMC的框架，通过pCNL初始粒子采样实现高效的推理时奖励对齐。

Details

Motivation: 现有方法通常从高斯先验初始化粒子，无法有效捕捉奖励相关区域，导致采样效率低。 Method: 提出pCNL算法，结合维度鲁棒提议和梯度信息动态，实现高效后验采样。 Result: 实验证明，该方法在布局到图像生成、数量感知生成等任务中表现优异。 Conclusion: 后验采样显著提升了奖励对齐性能，适用于高维潜在空间。 Abstract: We introduce $\Psi$-Sampler, an SMC-based framework incorporating pCNL-based initial particle sampling for effective inference-time reward alignment with a score-based generative model. Inference-time reward alignment with score-based generative models has recently gained significant traction, following a broader paradigm shift from pre-training to post-training optimization. At the core of this trend is the application of Sequential Monte Carlo (SMC) to the denoising process. However, existing methods typically initialize particles from the Gaussian prior, which inadequately captures reward-relevant regions and results in reduced sampling efficiency. We demonstrate that initializing from the reward-aware posterior significantly improves alignment performance. To enable posterior sampling in high-dimensional latent spaces, we introduce the preconditioned Crank-Nicolson Langevin (pCNL) algorithm, which combines dimension-robust proposals with gradient-informed dynamics. This approach enables efficient and scalable posterior sampling and consistently improves performance across various reward alignment tasks, including layout-to-image generation, quantity-aware generation, and aesthetic-preference generation, as demonstrated in our experiments.

[441] Variance-Based Defense Against Blended Backdoor Attacks

Sujeevan Aseervatham,Achraf Kerzazi,Younès Bennani

Main category: cs.LG

TL;DR: 论文提出了一种新的防御方法，用于检测和清除AI模型中的后门攻击，无需依赖干净数据集。

Details

Motivation: 现有防御方法依赖干净数据集计算统计异常，但在实际场景中可能不可行。 Method: 通过训练模型、检测中毒类别、提取攻击触发关键部分并识别中毒实例。 Result: 实验证明该方法在知名图像数据集上有效，优于SCAn、ABL和AGPD三种算法。 Conclusion: 新方法提高了可解释性，能有效防御后门攻击，适用于实际场景。 Abstract: Backdoor attacks represent a subtle yet effective class of cyberattacks targeting AI models, primarily due to their stealthy nature. The model behaves normally on clean data but exhibits malicious behavior only when the attacker embeds a specific trigger into the input. This attack is performed during the training phase, where the adversary corrupts a small subset of the training data by embedding a pattern and modifying the labels to a chosen target. The objective is to make the model associate the pattern with the target label while maintaining normal performance on unaltered data. Several defense mechanisms have been proposed to sanitize training data-sets. However, these methods often rely on the availability of a clean dataset to compute statistical anomalies, which may not always be feasible in real-world scenarios where datasets can be unavailable or compromised. To address this limitation, we propose a novel defense method that trains a model on the given dataset, detects poisoned classes, and extracts the critical part of the attack trigger before identifying the poisoned instances. This approach enhances explainability by explicitly revealing the harmful part of the trigger. The effectiveness of our method is demonstrated through experimental evaluations on well-known image datasets and comparative analysis against three state-of-the-art algorithms: SCAn, ABL, and AGPD.

[442] Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability

Genta Indra Winata,David Anugraha,Emmy Liu,Alham Fikri Aji,Shou-Yi Hung,Aditya Parashar,Patrick Amadeus Irawan,Ruochen Zhang,Zheng-Xin Yong,Jan Christian Blaise Cruz,Niklas Muennighoff,Seungone Kim,Hanyang Zhao,Sudipta Kar,Kezia Erina Suryoraharjo,M. Farid Adilazuarda,En-Shiun Annie Lee,Ayu Purwarianti,Derry Tanti Wijaya,Monojit Choudhury

Main category: cs.LG

TL;DR: 论文提出DataRubrics框架，通过系统化、标准化的评估指标改进数据集质量审查流程，并探索合成数据生成方法。

Details

Motivation: 当前数据集论文缺乏原创性、多样性和质量控制，且审查过程中常忽视这些问题，需要更透明、可衡量的评估方法。 Method: 提出DataRubrics框架，结合LLM技术，提供可重复、可扩展的数据集质量评估方案，并开源相关代码。 Result: DataRubrics为数据集质量评估提供了可操作的标准，支持作者和审稿人提升数据研究的质量。 Conclusion: 论文呼吁采用系统化的评估方法，推动数据研究领域的透明度和标准化。 Abstract: High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about dataset construction and properties. While existing tools such as datasheets aim to promote transparency, they are largely descriptive and do not provide standardized, measurable methods for evaluating data quality. Similarly, metadata requirements at conferences promote accountability but are inconsistently enforced. To address these limitations, this position paper advocates for the integration of systematic, rubric-based evaluation metrics into the dataset review process-particularly as submission volumes continue to grow. We also explore scalable, cost-effective methods for synthetic data generation, including dedicated tools and LLM-as-a-judge approaches, to support more efficient evaluation. As a call to action, we introduce DataRubrics, a structured framework for assessing the quality of both human- and model-generated datasets. Leveraging recent advances in LLM-based evaluation, DataRubrics offers a reproducible, scalable, and actionable solution for dataset quality assessment, enabling both authors and reviewers to uphold higher standards in data-centric research. We also release code to support reproducibility of LLM-based evaluations at https://github.com/datarubrics/datarubrics.

[443] Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment

Kundan Krishna,Joseph Y Cheng,Charles Maalouf,Leon A Gatys

Main category: cs.LG

TL;DR: DSA框架通过解耦安全计算与任务优化模型，提升AI安全性和效率，显著优于传统方法。

Details

Motivation: 现有AI安全方法（如护栏模型和对齐训练）常牺牲推理效率或开发灵活性，需改进。 Method: 引入轻量级适配器（DSA），利用基础模型内部表示，实现灵活安全功能且不影响推理成本。 Result: DSA在幻觉检测、仇恨言论分类等任务中表现优异，并支持动态调整对齐强度。 Conclusion: DSA为模块化、高效且适应性强的AI安全与对齐提供了新方向。 Abstract: Existing paradigms for ensuring AI safety, such as guardrail models and alignment training, often compromise either inference efficiency or development flexibility. We introduce Disentangled Safety Adapters (DSA), a novel framework addressing these challenges by decoupling safety-specific computations from a task-optimized base model. DSA utilizes lightweight adapters that leverage the base model's internal representations, enabling diverse and flexible safety functionalities with minimal impact on inference cost. Empirically, DSA-based safety guardrails substantially outperform comparably sized standalone models, notably improving hallucination detection (0.88 vs. 0.61 AUC on Summedits) and also excelling at classifying hate speech (0.98 vs. 0.92 on ToxiGen) and unsafe model inputs and responses (0.93 vs. 0.90 on AEGIS2.0 & BeaverTails). Furthermore, DSA-based safety alignment allows dynamic, inference-time adjustment of alignment strength and a fine-grained trade-off between instruction following performance and model safety. Importantly, combining the DSA safety guardrail with DSA safety alignment facilitates context-dependent alignment strength, boosting safety on StrongReject by 93% while maintaining 98% performance on MTBench -- a total reduction in alignment tax of 8 percentage points compared to standard safety alignment fine-tuning. Overall, DSA presents a promising path towards more modular, efficient, and adaptable AI safety and alignment.

[444] Localized LoRA: A Structured Low-Rank Approximation for Efficient Fine-Tuning

Babak Barazandeh

Main category: cs.LG

TL;DR: Localized LoRA提出了一种参数高效微调方法，通过局部低秩更新提升性能，优于全局低秩方法。

Details

Motivation: 现有PEFT方法（如LoRA）依赖全局低秩结构，可能忽略参数空间中的空间模式。 Method: 提出Localized LoRA框架，将权重更新建模为低秩矩阵的组合，应用于权重矩阵的结构化块。 Result: 在相同参数预算下，Localized LoRA的近似误差更低，实验证明其表达能力和适应性更强。 Conclusion: Localized LoRA是一种更高效、性能更好的微调方法，适用于多种场景。 Abstract: Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, offer compact and effective alternatives to full model fine-tuning by introducing low-rank updates to pretrained weights. However, most existing approaches rely on global low-rank structures, which can overlook spatial patterns spread across the parameter space. In this work, we propose Localized LoRA, a generalized framework that models weight updates as a composition of low-rank matrices applied to structured blocks of the weight matrix. This formulation enables dense, localized updates throughout the parameter space-without increasing the total number of trainable parameters. We provide a formal comparison between global, diagonal-local, and fully localized low-rank approximations, and show that our method consistently achieves lower approximation error under matched parameter budgets. Experiments on both synthetic and practical settings demonstrate that Localized LoRA offers a more expressive and adaptable alternative to existing methods, enabling efficient fine-tuning with improved performance.

[445] Beyond Semantic Entropy: Boosting LLM Uncertainty Quantification with Pairwise Semantic Similarity

Dang Nguyen,Ali Payani,Baharan Mirzasoleiman

Main category: cs.LG

TL;DR: 提出了一种基于最近邻熵估计的黑盒不确定性量化方法，解决了语义熵在长句生成中的局限性，并在多个任务和模型上验证了其有效性。

Details

Motivation: 语义熵在长句生成中因忽略簇内和簇间相似性而效果不佳，需要改进不确定性量化方法。 Method: 提出了一种基于最近邻熵估计的黑盒方法，并可扩展至白盒设置。 Result: 在Phi3和Llama3模型及多个任务上验证了方法的有效性。 Conclusion: 新方法优于语义熵，且具有理论和实践优势。 Abstract: Hallucination in large language models (LLMs) can be detected by assessing the uncertainty of model outputs, typically measured using entropy. Semantic entropy (SE) enhances traditional entropy estimation by quantifying uncertainty at the semantic cluster level. However, as modern LLMs generate longer one-sentence responses, SE becomes less effective because it overlooks two crucial factors: intra-cluster similarity (the spread within a cluster) and inter-cluster similarity (the distance between clusters). To address these limitations, we propose a simple black-box uncertainty quantification method inspired by nearest neighbor estimates of entropy. Our approach can also be easily extended to white-box settings by incorporating token probabilities. Additionally, we provide theoretical results showing that our method generalizes semantic entropy. Extensive empirical results demonstrate its effectiveness compared to semantic entropy across two recent LLMs (Phi3 and Llama3) and three common text generation tasks: question answering, text summarization, and machine translation. Our code is available at https://github.com/BigML-CS-UCLA/SNNE.

[446] Spectral Insights into Data-Oblivious Critical Layers in Large Language Models

Xuyuan Liu,Lei Hsiung,Yaoqing Yang,Yujun Yan

Main category: cs.LG

TL;DR: 本文提出了一种数据无关的方法，通过CKA分析预训练LLM中的关键层，发现这些层在微调时变化最大，且与语义转换相关。该方法在领域适应和防御后门攻击中表现优异。

Details

Motivation: 理解LLM中特征表示的演变对提高其可解释性和鲁棒性至关重要，但现有方法依赖数据且局限于微调后的分析。 Method: 采用CKA分析预训练LLM的表示动态，识别关键层，并通过谱分析揭示其语义转换机制。 Result: 关键层在微调时变化显著，且与语义转换相关；在领域适应和防御后门攻击中效果显著。 Conclusion: 数据无关方法能有效识别关键层，为LLM的高效微调和安全应用提供新思路。 Abstract: Understanding how feature representations evolve across layers in large language models (LLMs) is key to improving their interpretability and robustness. While recent studies have identified critical layers linked to specific functions or behaviors, these efforts typically rely on data-dependent analyses of fine-tuned models, limiting their use to post-hoc settings. In contrast, we introduce a data-oblivious approach to identify intrinsic critical layers in pre-fine-tuned LLMs by analyzing representation dynamics via Centered Kernel Alignment(CKA). We show that layers with significant shifts in representation space are also those most affected during fine-tuning--a pattern that holds consistently across tasks for a given model. Our spectral analysis further reveals that these shifts are driven by changes in the top principal components, which encode semantic transitions from rationales to conclusions. We further apply these findings to two practical scenarios: efficient domain adaptation, where fine-tuning critical layers leads to greater loss reduction compared to non-critical layers; and backdoor defense, where freezing them reduces attack success rates by up to 40%.

[447] BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

Eunsu Kim,Haneul Yoo,Guijin Son,Hitesh Patel,Amit Agarwal,Alice Oh

Main category: cs.LG

TL;DR: 论文介绍了BenchHub，一个动态的基准测试库，旨在解决现有数据集分散、难以管理的问题，支持灵活、定制化的LLM评估。

Details

Motivation: 随着大语言模型（LLMs）的发展，现有数据集分散且难以管理，难以满足特定领域或需求的评估需求。 Method: 提出BenchHub，一个动态基准测试库，聚合并自动分类来自不同领域的基准数据集，涵盖303K问题和38个基准。 Result: 实验表明，模型性能在不同领域子集间差异显著，凸显领域感知基准测试的重要性。 Conclusion: BenchHub能促进数据集重用、透明模型比较，并识别现有基准中的不足，为LLM评估研究提供关键基础设施。 Abstract: As large language models (LLMs) continue to advance, the need for up-to-date and well-organized benchmarks becomes increasingly critical. However, many existing datasets are scattered, difficult to manage, and make it challenging to perform evaluations tailored to specific needs or domains, despite the growing importance of domain-specific models in areas such as math or code. In this paper, we introduce BenchHub, a dynamic benchmark repository that empowers researchers and developers to evaluate LLMs more effectively. BenchHub aggregates and automatically classifies benchmark datasets from diverse domains, integrating 303K questions across 38 benchmarks. It is designed to support continuous updates and scalable data management, enabling flexible and customizable evaluation tailored to various domains or use cases. Through extensive experiments with various LLM families, we demonstrate that model performance varies significantly across domain-specific subsets, emphasizing the importance of domain-aware benchmarking. We believe BenchHub can encourage better dataset reuse, more transparent model comparisons, and easier identification of underrepresented areas in existing benchmarks, offering a critical infrastructure for advancing LLM evaluation research.

[448] FLoE: Fisher-Based Layer Selection for Efficient Sparse Adaptation of Low-Rank Experts

Xinyi Wang,Lirong Gao,Haobo Wang,Yiming Zhang,Junbo Zhao

Main category: cs.LG

TL;DR: FLoE是一种新的参数高效微调（PEFT）框架，通过动态识别关键层和自动优化LoRA秩，显著提升了效率和准确性。

Details

Motivation: 现有PEFT方法在所有层上统一部署LoRA适配器，忽略了层的异质性和任务需求，导致参数冗余和效率低下。 Method: FLoE引入Fisher信息评分机制动态识别关键层，并使用贝叶斯优化自动分配LoRA秩。 Result: 实验表明FLoE在多种LLM和基准测试中实现了高效的性能平衡。 Conclusion: FLoE在资源受限环境中具有显著优势，适合快速适配任务。 Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a widely adopted strategy for adapting pre-trained Large Language Models (LLMs) to downstream tasks, significantly reducing memory and computational costs. However, most existing PEFT techniques uniformly deploy LoRA adapters across all layers, disregarding the intrinsic heterogeneity of layer contributions and task-specific rank requirements. This uniform paradigm leads to redundant parameter allocation and suboptimal adaptation efficiency. To address these limitations, we propose FLoE, a novel PEFT framework that introduces two key innovations: (i) a Fisher information-guided importance scoring mechanism to dynamically identify task-critical transformer layers for MoE-based low-rank adaptation, enabling sparse adapter deployment; and (ii) a Bayesian optimization-driven rank allocator that automatically determines optimal LoRA ranks on specific datasets without exhaustive grid search. Extensive experiments across diverse LLMs and benchmarks reveal that FLoE achieves impressive efficiency-accuracy trade-offs, making FLoE particularly advantageous in resource-constrained environments that necessitate rapid adaptation.

[449] Linear Representation Transferability Hypothesis: Leveraging Small Models to Steer Large Models

Femi Bello,Anubrata Das,Fanzhi Zeng,Fangcong Yin,Leqi Liu

Main category: cs.LG

TL;DR: 论文提出线性表示可迁移性（LRT）假设，认为不同模型的表示空间之间存在仿射变换关系，并通过实验验证了小模型表示可以指导大模型行为。

Details

Motivation: 探索神经网络在相似数据和架构下学习的表示是否具有共享性，以及这些表示是否可以跨模型迁移。 Method: 提出LRT假设，学习不同大小模型隐藏状态之间的仿射映射，并评估其语义效果。 Result: 实验证明仿射映射能保留语义行为，小模型的表示可以指导大模型行为。 Conclusion: LRT假设为理解跨模型规模的表示对齐提供了新方向。 Abstract: It has been hypothesized that neural networks with similar architectures trained on similar data learn shared representations relevant to the learning task. We build on this idea by extending the conceptual framework where representations learned across models trained on the same data can be expressed as linear combinations of a \emph{universal} set of basis features. These basis features underlie the learning task itself and remain consistent across models, regardless of scale. From this framework, we propose the \textbf{Linear Representation Transferability (LRT)} Hypothesis -- that there exists an affine transformation between the representation spaces of different models. To test this hypothesis, we learn affine mappings between the hidden states of models of different sizes and evaluate whether steering vectors -- directions in hidden state space associated with specific model behaviors -- retain their semantic effect when transferred from small to large language models using the learned mappings. We find strong empirical evidence that such affine mappings can preserve steering behaviors. These findings suggest that representations learned by small models can be used to guide the behavior of large models, and that the LRT hypothesis may be a promising direction on understanding representation alignment across model scales.

[450] Existing Large Language Model Unlearning Evaluations Are Inconclusive

Zhili Feng,Yixuan Even Xu,Alexander Robey,Robert Kirk,Xander Davies,Yarin Gal,Avi Schwarzschild,J. Zico Kolter

Main category: cs.LG

TL;DR: 论文批评了当前机器遗忘评估方法的局限性，提出了改进原则并通过实验验证。

Details

Motivation: 研究动机是揭示现有机器遗忘评估方法的不足，尤其是其可能掩盖真实遗忘效果的问题。 Method: 方法包括分析评估实践的局限性，提出最小信息注入和下游任务意识原则，并通过实验验证。 Result: 结果显示当前评估方法可能高估或低估遗忘效果，新原则能更准确评估。 Conclusion: 结论是未来评估需遵循新原则以提高可信度和泛化性。 Abstract: Machine unlearning aims to remove sensitive or undesired data from large language models. However, recent studies suggest that unlearning is often shallow, claiming that removed knowledge can easily be recovered. In this work, we critically examine standard unlearning evaluation practices and uncover key limitations that shake our trust in those findings. First, we show that some evaluations introduce substantial new information into the model, potentially masking true unlearning performance by re-teaching the model during testing. Second, we demonstrate that evaluation outcomes vary significantly across tasks, undermining the generalizability of current evaluation routines. Finally, we find that many evaluations rely on spurious correlations, making their results difficult to trust and interpret. Taken together, these issues suggest that current evaluation protocols may both overstate and understate unlearning success. To address this, we propose two principles for future unlearning evaluations: minimal information injection and downstream task awareness. We validate these principles through a series of targeted experiments, showing how violations of each can lead to misleading conclusions.

[451] Bregman Conditional Random Fields: Sequence Labeling with Parallelizable Inference Algorithms

Caio Corro,Mathieu Lacroix,Joseph Le Roux

Main category: cs.LG

TL;DR: 提出了一种新型序列标注模型BCRF，支持并行化推理，性能优于传统CRF和均值场方法。

Details

Motivation: 解决传统线性链条件随机场（CRF）推理速度慢的问题，提供更高效的并行化替代方案。 Method: 基于Bregman投影的并行化推理算法，使用Fenchel-Young损失函数进行模型训练，支持部分标签学习。 Result: 实验表明，BCRF性能与CRF相当但速度更快，在受限环境下优于均值场方法。 Conclusion: BCRF是一种高效且性能优越的序列标注模型，适用于需要快速推理的场景。 Abstract: We propose a novel discriminative model for sequence labeling called Bregman conditional random fields (BCRF). Contrary to standard linear-chain conditional random fields, BCRF allows fast parallelizable inference algorithms based on iterative Bregman projections. We show how such models can be learned using Fenchel-Young losses, including extension for learning from partial labels. Experimentally, our approach delivers comparable results to CRF while being faster, and achieves better results in highly constrained settings compared to mean field, another parallelizable alternative.

[452] LIFT the Veil for the Truth: Principal Weights Emerge after Rank Reduction for Reasoning-Focused Supervised Fine-Tuning

Zihang Liu,Tianyu Pang,Oleg Balabanov,Chaoqun Yang,Tianjin Huang,Lu Yin,Yaoqing Yang,Shiwei Liu

Main category: cs.LG

TL;DR: 论文提出了一种名为LIFT的低秩稀疏微调方法，通过仅更新关键权重（Principal Weights），在保持高效的同时提升了LLM的推理能力。

Details

Motivation: 全参数微调（Full FT）计算成本高且易过拟合，稀疏微调在LLM时代表现不佳，因难以识别关键参数。 Method: 提出LIFT方法，基于低秩近似识别关键权重（Principal Weights），仅更新前5%的关键权重。 Result: LIFT在推理任务中表现优于Full FT，同时内存效率与参数高效微调方法相当，且保留更多源领域知识。 Conclusion: LIFT是一种高效且有效的LLM微调方法，适用于资源受限的场景。 Abstract: Recent studies have shown that supervised fine-tuning of LLMs on a small number of high-quality datasets can yield strong reasoning capabilities. However, full fine-tuning (Full FT), while powerful, is computationally expensive and susceptible to overfitting and catastrophic forgetting, particularly when data is limited. Sparse fine-tuning, which previously achieved notable success by updating only a small subset of model parameters, offers a promising trade-off between efficiency and effectiveness. Yet, it has lagged behind in the LLM era due to the difficulty of identifying parameters truly critical for reasoning. In this work, we state that weights with the largest magnitude after low-rank approximation are critical weights for fine-tuning, which we call Principal Weights. Surprisingly, while magnitude-based sparse fine-tuning performs poorly as a baseline on LLM fine-tuning, it becomes highly effective after rank reduction. These insights motivate our method: Low-rank Informed Sparse Fine-Tuning (LIFT). LIFT only updates the top 5% Principal Weights throughout training and consistently achieves better performance on reasoning tasks than Full FT, while maintaining memory efficiency on par with popular parameter-efficient fine-tuning methods. In addition to strong performance on target domains such as arithmetic reasoning, LIFT also retains up to 20% more source-domain knowledge, compared to Full FT and LoRA. Our code is available at: https://github.com/zihanghliu/LIFT.

[453] Generalizable LLM Learning of Graph Synthetic Data with Reinforcement Learning

Yizhuo Zhang,Heng Wang,Shangbin Feng,Zhaoxuan Tan,Xinyun Liu,Yulia Tsvetkov

Main category: cs.LG

TL;DR: 本文提出了一种利用强化学习（RL）提升LLMs在合成图数据上的泛化能力的方法，通过设计基于解决方案和过程的奖励机制，避免直接微调中的过拟合问题。实验表明，该方法在多个数据集上显著优于基线，平均提升12.9%。

Details

Motivation: 现有方法通过监督微调提升LLMs在图算法任务上的表现，但缺乏对真实世界中隐含图结构的泛化能力。本文旨在通过RL解锁合成图数据的泛化学习。 Method: 设计了基于解决方案和过程的奖励机制，并采用GRPO和DPO等RL算法，对LLMs进行对齐。实验覆盖了合成任务和隐含图结构的真实任务（如多跳QA、结构化规划）。 Result: 在5个数据集上取得显著提升，平均增益12.9%。基于过程的奖励表现更优，混合合成与真实数据有潜在增益，但组合性和可解释性仍是挑战。 Conclusion: RL方法有效提升了LLMs在图数据上的泛化能力，但组合性和中间步骤的可解释性仍需进一步研究。 Abstract: Previous research has sought to enhance the graph reasoning capabilities of LLMs by supervised fine-tuning on synthetic graph data. While these led to specialized LLMs better at solving graph algorithm problems, we don't need LLMs for shortest path: we need generalization from synthetic graph data to real-world tasks with implicit graph structures. In this work, we propose to unlock generalizable learning of graph synthetic data with reinforcement learning. We first design solution-based and process-based rewards for synthetic graph problems: instead of rigid memorizing response patterns in direct fine-tuning, we posit that RL would help LLMs grasp the essentials underlying graph reasoning and alleviate overfitting. We employ RL algorithms such as GRPO and DPO, aligning both off-the-shelf LLMs and LLMs fine-tuned on synthetic graph data. We then compare them against existing settings on both in-domain synthetic tasks and out-of-domain real-world tasks with implicit graph structures such as multi-hop QA, structured planning, and more. Extensive experiments demonstrate that our RL recipe leads to statistically significant improvement on 5 datasets, with an average gain of 12.9\% over baseline settings. Further analysis reveals that process-based rewards consistently outperform solution-based rewards, mixing synthetic and real-world task data yields potential gains, while compositionality and explainable intermediate steps remains a critical challenge even after RL.

[454] Position as Probability: Self-Supervised Transformers that Think Past Their Training for Length Extrapolation

Philip Heejun Lee

Main category: cs.LG

TL;DR: PRISM是一种新型的位置编码机制，使Transformer模型能够在训练长度10倍以上的范围内准确推断。

Details

Motivation: 解决深度序列模型在测试序列远超训练长度时精度下降的问题，尤其是在算法推理、多步算术和组合泛化等关键任务中。 Method: PRISM通过可微分直方图滤波更新学习连续相对位置，采用概率叠加而非确定性嵌入来保留位置不确定性。 Result: PRISM在算法基准测试（如算术、SCAN组合任务和复杂复制变体）中实现了最先进的长度推断能力。 Conclusion: PRISM的随机位置编码保持清晰可解释的内部状态，为可靠的长度泛化提供了理论基础。 Abstract: Deep sequence models typically degrade in accuracy when test sequences significantly exceed their training lengths, yet many critical tasks--such as algorithmic reasoning, multi-step arithmetic, and compositional generalization--require robust length extrapolation. We introduce PRISM, a Probabilistic Relative-position Implicit Superposition Model, a novel positional encoding mechanism that enables Transformers to extrapolate accurately up to 10x beyond their training length. PRISM learns continuous relative positions through a differentiable histogram-filter update, preserving position uncertainty via a probabilistic superposition rather than conventional deterministic embeddings. Empirically, PRISM achieves state-of-the-art length extrapolation, successfully generalizing to previously intractable sequence lengths across algorithmic benchmarks--including arithmetic (addition, multiplication), SCAN compositionality tasks, and complex copy variants derived from DeepMind's recent datasets. Our analysis demonstrates that PRISM's stochastic positional encoding maintains sharp and interpretable internal states, providing a theoretical basis for reliable length generalization. These results advance the goal of neural sequence models that remain algorithmically robust at lengths far exceeding their training horizon.

[455] Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer

Yihe Dong,Lorenzo Noci,Mikhail Khodak,Mufan Li

Main category: cs.LG

TL;DR: 研究发现，Transformer的性能提升并非完全依赖自注意力机制，随机固定注意力的简化模型MixiT在某些任务中表现与完整Transformer相当，但在依赖输入的任务中表现较差。

Details

Motivation: 探讨Transformer中自注意力机制对性能提升的具体贡献，以及不同组件对任务解决的重要性。 Method: 通过冻结MLP层或注意力投影器，并引入随机固定注意力的MixiT模型，对比分析不同变体的性能。 Result: MixiT在算法任务（如算术和记忆）中表现与完整Transformer相当，但在检索任务中表现较差；冻结查询和键投影器的注意力仍能形成关键电路（如归纳头）。 Conclusion: Transformer的不同组件提供互补的归纳偏差，对解决不同任务至关重要。 Abstract: The Transformer architecture is central to the success of modern Large Language Models (LLMs), in part due to its surprising ability to perform a wide range of algorithmic tasks -- including mathematical reasoning, memorization, and retrieval -- using only gradient-based training on next-token prediction. While the core component of a Transformer is the self-attention mechanism, we question how much, and which aspects, of the performance gains can be attributed to it. To this end, we compare standard Transformers to variants in which either the multi-layer perceptron (MLP) layers or the attention projectors (queries and keys) are frozen at initialization. To further isolate the contribution of attention, we introduce MixiT -- the Mixing Transformer -- a simplified, principled model in which the attention coefficients are entirely random and fixed at initialization, eliminating any input-dependent computation or learning in attention. Surprisingly, we find that MixiT matches the performance of fully trained Transformers on various algorithmic tasks, especially those involving basic arithmetic or focusing heavily on memorization. For retrieval-based tasks, we observe that having input-dependent attention coefficients is consistently beneficial, while MixiT underperforms. We attribute this failure to its inability to form specialized circuits such as induction heads -- a specific circuit known to be crucial for learning and exploiting repeating patterns in input sequences. Even more interestingly, we find that attention with frozen key and query projectors is not only able to form induction heads, but can also perform competitively on language modeling. Our results underscore the importance of architectural heterogeneity, where distinct components contribute complementary inductive biases crucial for solving different classes of tasks.

[456] Earley-Driven Dynamic Pruning for Efficient Structured Decoding

Xintong Sun,Chi Wei,Minghao Tian,Shiwen Ni

Main category: cs.LG

TL;DR: ZapFormat提出了一种基于Earley算法的动态剪枝策略，显著减少内存占用，并提升LLM在结构化生成任务中的推理速度。

Details

Motivation: 确保LLM输出符合严格的结构或语法约束在函数调用和领域特定语言生成中至关重要，但现有方法存在显著开销。 Method: 提出动态剪枝策略ZapFormat，基于Earley算法实时消除无效状态，并实现状态缓存以加速生成。 Result: Formatron在结构化生成任务中保持高精度，推理速度提升至2倍，且适用于多种LLM架构。 Conclusion: ZapFormat和Formatron为LLM的结构化生成提供了高效且通用的解决方案，已开源。 Abstract: Large Language Models (LLMs) have shown remarkable capabilities, yet ensuring their outputs conform to strict structural or grammatical constraints remains challenging, which is critical in function calls and domain-specific language (DSL) generation. Constrained decoding with context-free grammar is a flexible approach to guarantee LLMs' adherence to a specific format by dynamically building a token logits mask. However, creating this mask requires checking the validity of all tokens in the LLM vocabulary at every decoding step, which often incurs significant overheads in existing constrained decoding engines. To address this challenge, we propose $\textbf{ZapFormat}$, a novel $\textbf{dynamic pruning}$ strategy based on the Earley algorithm that identifies and eliminates invalid or redundant Earley states in real-time, significantly reducing memory occupation of the Earley algorithm's states. This further enables us to use a state cache to speed up structured generations on a large number of queries. We implemented ZapFormat in a new constrained decoding engine called Formatron which also incorporates existing optimizations. Through comprehensive experiments on structured generation tasks, including JSON generation, JSON Schema validation, and semantic parsing, we demonstrate that Formatron not only $\textbf{consistently maintains}$ high-precision compliant outputs but also achieves $\textbf{significant improvements}$ in inference speed up to 2x compared to state-of-the-art implementations. More importantly, Formatron is generally applicable across various LLM architectures. We release Formatron as open source at https://github.com/Dan-wanna-M/formatron.

[457] MUDI: A Multimodal Biomedical Dataset for Understanding Pharmacodynamic Drug-Drug Interactions

Tung-Lam Ngo,Ba-Hoang Tran,Duy-Cat Can,Trung-Hieu Do,Oliver Y. Chén,Hoang-Quynh Le

Main category: cs.LG

TL;DR: 论文介绍了MUDI，一个多模态生物医学数据集，用于研究药物-药物相互作用（DDI），并评估了学习方法。

Details

Motivation: 现有DDI数据集主要依赖文本信息，忽略了反映复杂药物机制的多模态数据。 Method: 提出MUDI数据集，结合药理学文本、化学式、分子结构图和图像，标注了310,532对药物组合。测试集包含未见过的药物对以评估泛化能力。 Result: 评估了基于晚期融合投票和中期融合策略的基准模型。 Conclusion: MUDI为多模态DDI研究提供了全面资源，所有数据和工具已开源。 Abstract: Understanding the interaction between different drugs (drug-drug interaction or DDI) is critical for ensuring patient safety and optimizing therapeutic outcomes. Existing DDI datasets primarily focus on textual information, overlooking multimodal data that reflect complex drug mechanisms. In this paper, we (1) introduce MUDI, a large-scale Multimodal biomedical dataset for Understanding pharmacodynamic Drug-drug Interactions, and (2) benchmark learning methods to study it. In brief, MUDI provides a comprehensive multimodal representation of drugs by combining pharmacological text, chemical formulas, molecular structure graphs, and images across 310,532 annotated drug pairs labeled as Synergism, Antagonism, or New Effect. Crucially, to effectively evaluate machine-learning based generalization, MUDI consists of unseen drug pairs in the test set. We evaluate benchmark models using both late fusion voting and intermediate fusion strategies. All data, annotations, evaluation scripts, and baselines are released under an open research license.

[458] Unified Scaling Laws for Compressed Representations

Andrei Panferov,Alexandra Volkova,Ionut-Vlad Modoranu,Vage Egiazarian,Mher Safaryan,Dan Alistarh

Main category: cs.LG

TL;DR: 本文探讨了扩展定律与模型压缩技术（如量化和稀疏化）的相互作用，提出了一种统一的扩展框架，能够预测不同压缩表示下的模型性能，并发现了一种基于随机高斯数据拟合能力的简单“容量”指标。

Details

Motivation: 随着AI计算成本的上升，模型压缩技术成为减轻大规模训练和推理计算需求的关键。本文旨在研究扩展定律是否适用于压缩表示，并探索一种统一的预测框架。 Method: 通过理论和实证验证，提出了一种通用的扩展定律，并引入基于随机高斯数据拟合能力的“容量”指标，用于预测不同压缩表示的参数效率。 Result: 研究发现，该“容量”指标能够稳健地预测多种压缩表示的参数效率，并扩展了框架以比较不同压缩格式的精度潜力，改进了稀疏量化格式的训练算法。 Conclusion: 本文证明了扩展定律在压缩表示中的适用性，并提出了一种简单有效的“容量”指标，为模型压缩和训练提供了新的理论支持和实用方法。 Abstract: Scaling laws have shaped recent advances in machine learning by enabling predictable scaling of model performance based on model size, computation, and data volume. Concurrently, the rise in computational cost for AI has motivated model compression techniques, notably quantization and sparsification, which have emerged to mitigate the steep computational demands associated with large-scale training and inference. This paper investigates the interplay between scaling laws and compression formats, exploring whether a unified scaling framework can accurately predict model performance when training occurs over various compressed representations, such as sparse, scalar-quantized, sparse-quantized or even vector-quantized formats. Our key contributions include validating a general scaling law formulation and showing that it is applicable both individually but also composably across compression types. Based on this, our main finding is demonstrating both theoretically and empirically that there exists a simple "capacity" metric -- based on the representation's ability to fit random Gaussian data -- which can robustly predict parameter efficiency across multiple compressed representations. On the practical side, we extend our formulation to directly compare the accuracy potential of different compressed formats, and to derive better algorithms for training over sparse-quantized formats.

cs.AR [Back]

[459] Enhancing Finite State Machine Design Automation with Large Language Models and Prompt Engineering Techniques

Qun-Kai Lin,Cheng Hsu,Tian-Sheuan Chang

Main category: cs.AR

TL;DR: 论文研究了三种大型语言模型（Claude 3 Opus、ChatGPT-4和ChatGPT-4o）在有限状态机（FSM）设计中的表现，并探讨了提示优化方法（TOP Patch）对其成功率的影响。

Details

Motivation: 由于大型语言模型在硬件描述语言（HDL）设计中的出色兼容性，研究其在FSM设计中的性能及优化方法具有重要意义。 Method: 利用HDLBits提供的指令内容评估模型的稳定性、局限性及优化方法，并测试提示优化方法（TOP Patch）的效果。 Result: 系统化格式提示方法和新型提示优化方法在提升模型成功率方面表现出潜力，并可推广至其他领域。 Conclusion: 提示优化方法（如TOP Patch）在HDL设计自动化及其他领域具有广泛应用前景。 Abstract: Large Language Models (LLMs) have attracted considerable attention in recent years due to their remarkable compatibility with Hardware Description Language (HDL) design. In this paper, we examine the performance of three major LLMs, Claude 3 Opus, ChatGPT-4, and ChatGPT-4o, in designing finite state machines (FSMs). By utilizing the instructional content provided by HDLBits, we evaluate the stability, limitations, and potential approaches for improving the success rates of these models. Furthermore, we explore the impact of using the prompt-refining method, To-do-Oriented Prompting (TOP) Patch, on the success rate of these LLM models in various FSM design scenarios. The results show that the systematic format prompt method and the novel prompt refinement method have the potential to be applied to other domains beyond HDL design automation, considering its possible integration with other prompt engineering techniques in the future.

cs.SE [Back]

[460] CODEMENV: Benchmarking Large Language Models on Code Migration

Keyuan Cheng,Xudong Shen,Yihao Yang,Tengyue Wang,Yang Cao,Muhammad Asif Ali,Hanbin Wang,Lijie Hu,Di Wang

Main category: cs.SE

TL;DR: CODEMENV是一个新基准，用于评估大语言模型（LLMs）在代码迁移任务中的表现，涵盖Python和Java的922个示例。实验显示LLMs的平均通过率为26.50%，GPT-4O表现最佳（43.84%）。

Details

Motivation: 研究LLMs在代码迁移任务中的能力，填补现有研究的空白。 Method: 提出CODEMENV基准，包含三个核心任务，评估七种LLMs的表现。 Result: LLMs在代码迁移中表现有限，平均通过率26.50%，GPT-4O最高（43.84%）。发现LLMs对新版本函数更熟练，但存在逻辑不一致问题。 Conclusion: CODEMENV为代码迁移任务提供了评估工具，揭示了LLMs的潜力与局限性。 Abstract: Large language models (LLMs) have shown remarkable capabilities across various software engineering tasks; however, their effectiveness in code migration, adapting code to run in different environments, remains insufficiently studied. In this work, we introduce CODEMENV: Code Migration Across Environment, a new benchmark specifically designed to assess LLMs' abilities in code migration scenarios. CODEMENV consists of 922 examples spanning 19 Python and Java packages, and covers three core tasks: (1) identifying functions incompatible with specific versions, (2) detecting changes in function definitions, and (3) adapting code to target environments. Experimental evaluation with seven LLMs on CODEMENV yields an average pass@1 rate of 26.50%, with GPT-4O achieving the highest score at 43.84%. Key findings include: (i) LLMs tend to be more proficient with newer function versions, which aids in migrating legacy code, and (ii) LLMs sometimes exhibit logical inconsistencies by identifying function changes irrelevant to the intended migration environment. The datasets are available at https://github.com/xdshen-ai/Benchmark-of-Code-Migration.

cs.SD [Back]

[461] Learning Sparsity for Effective and Efficient Music Performance Question Answering

Xingjian Diao,Tianzhen Yang,Chunhui Zhang,Weiyi Wu,Ming Cheng,Jiang Gui

Main category: cs.SD

TL;DR: 论文提出了一种名为Sparsify的稀疏学习框架，用于解决音乐表演音频视觉问答（Music AVQA）中的效率问题，通过三种稀疏化策略提升性能并减少训练时间。

Details

Motivation: 音乐表演的密集连续音频和音频视觉整合为多模态场景理解带来挑战，现有方法存在信息冗余和效率低下的问题。 Method: 提出Sparsify框架，集成三种稀疏化策略，并设计关键子集选择算法以提高数据效率。 Result: Sparsify在Music AVQA数据集上达到最优性能，训练时间减少28.32%，且仅用25%数据即可保持70-80%的性能。 Conclusion: Sparsify框架有效解决了Music AVQA中的效率问题，为多模态学习提供了高效解决方案。 Abstract: Music performances, characterized by dense and continuous audio as well as seamless audio-visual integration, present unique challenges for multimodal scene understanding and reasoning. Recent Music Performance Audio-Visual Question Answering (Music AVQA) datasets have been proposed to reflect these challenges, highlighting the continued need for more effective integration of audio-visual representations in complex question answering. However, existing Music AVQA methods often rely on dense and unoptimized representations, leading to inefficiencies in the isolation of key information, the reduction of redundancy, and the prioritization of critical samples. To address these challenges, we introduce Sparsify, a sparse learning framework specifically designed for Music AVQA. It integrates three sparsification strategies into an end-to-end pipeline and achieves state-of-the-art performance on the Music AVQA datasets. In addition, it reduces training time by 28.32% compared to its fully trained dense counterpart while maintaining accuracy, demonstrating clear efficiency gains. To further improve data efficiency, we propose a key-subset selection algorithm that selects and uses approximately 25% of MUSIC-AVQA v2.0 training data and retains 70-80% of full-data performance across models.

[462] Probing Audio-Generation Capabilities of Text-Based Language Models

Arjun Prasaath Anbazhagan,Parteek Kumar,Ujjwal Kaur,Aslihan Akalin,Kevin Zhu,Sean O'Brien

Main category: cs.SD

TL;DR: 研究探讨了大型语言模型（LLMs）如何通过文本表示生成音频，发现其能力随音频复杂性增加而下降。

Details

Motivation: 探索LLMs在文本训练基础上生成音频的潜力，填补文本与音频之间的鸿沟。 Method: 采用三阶段渐进方法（音符、环境音、人声），通过代码作为中介生成音频，并用FAD和CLAP评分评估质量。 Result: LLMs能生成基础音频，但随着复杂性增加表现下降，表明其音频生成能力有限。 Conclusion: 需进一步研究提升LLMs音频生成质量与多样性的技术。 Abstract: How does textual representation of audio relate to the Large Language Model's (LLMs) learning about the audio world? This research investigates the extent to which LLMs can be prompted to generate audio, despite their primary training in textual data. We employ a three-tier approach, progressively increasing the complexity of audio generation: 1) Musical Notes, 2) Environmental Sounds, and 3) Human Speech. To bridge the gap between text and audio, we leverage code as an intermediary, prompting LLMs to generate code that, when executed, produces the desired audio output. To evaluate the quality and accuracy of the generated audio, we employ FAD and CLAP scores. Our findings reveal that while LLMs can generate basic audio features, their performance deteriorates as the complexity of the audio increases. This suggests that while LLMs possess a latent understanding of the auditory world, their ability to translate this understanding into tangible audio output remains rudimentary. Further research into techniques that can enhance the quality and diversity of LLM-generated audio can lead to an improvement in the performance of text-based LLMs in generating audio.

[463] XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark

Ioan-Paul Ciobanu,Andrei-Iulian Hiji,Nicolae-Catalin Ristea,Paul Irofti,Cristian Rusu,Radu Tudor Ionescu

Main category: cs.SD

TL;DR: 论文介绍了XMAD-Bench，一个跨领域多语言音频深度伪造基准测试，揭示了现有检测器在跨领域测试中性能显著下降的问题。

Details

Motivation: 随着音频生成技术的进步，深度伪造音频增多，公众面临更大风险。现有检测器在相同生成模型下的测试表现良好，但在跨领域场景中效果不佳。 Method: 构建了XMAD-Bench数据集，包含668.8小时的真实和伪造语音，训练集和测试集在说话者、生成方法和音频来源上完全独立。 Result: 实验显示，检测器在相同领域内表现接近100%，但在跨领域测试中性能可能接近随机猜测。 Conclusion: 研究强调了开发具有跨领域泛化能力的音频深度伪造检测器的必要性，并公开了XMAD-Bench基准测试。 Abstract: Recent advances in audio generation led to an increasing number of deepfakes, making the general public more vulnerable to financial scams, identity theft, and misinformation. Audio deepfake detectors promise to alleviate this issue, with many recent studies reporting accuracy rates close to 99%. However, these methods are typically tested in an in-domain setup, where the deepfake samples from the training and test sets are produced by the same generative models. To this end, we introduce XMAD-Bench, a large-scale cross-domain multilingual audio deepfake benchmark comprising 668.8 hours of real and deepfake speech. In our novel dataset, the speakers, the generative methods, and the real audio sources are distinct across training and test splits. This leads to a challenging cross-domain evaluation setup, where audio deepfake detectors can be tested ``in the wild''. Our in-domain and cross-domain experiments indicate a clear disparity between the in-domain performance of deepfake detectors, which is usually as high as 100%, and the cross-domain performance of the same models, which is sometimes similar to random chance. Our benchmark highlights the need for the development of robust audio deepfake detectors, which maintain their generalization capacity across different languages, speakers, generative methods, and data sources. Our benchmark is publicly released at https://github.com/ristea/xmad-bench/.

[464] Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion

Kumud Tripathi,Chowdam Venkata Kumar,Pankaj Wasnik

Main category: cs.SD

TL;DR: FusionVAD结合MFCC和预训练模型特征，通过简单融合策略（如加法）提升语音活动检测性能，超越单特征模型和现有技术。

Details

Motivation: 研究MFCC和预训练模型特征在语音活动检测中的互补性，探索融合策略以提升性能。 Method: 提出FusionVAD框架，采用三种融合策略（拼接、加法、交叉注意力）结合MFCC和预训练模型特征。 Result: 加法融合效果最佳，模型性能超越单特征模型和Pyannote，平均提升2.04%。 Conclusion: 简单特征融合能增强语音活动检测的鲁棒性，同时保持计算效率。 Abstract: Voice Activity Detection (VAD) plays a key role in speech processing, often utilizing hand-crafted or neural features. This study examines the effectiveness of Mel-Frequency Cepstral Coefficients (MFCCs) and pre-trained model (PTM) features, including wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper. We propose FusionVAD, a unified framework that combines both feature types using three fusion strategies: concatenation, addition, and cross-attention (CA). Experimental results reveal that simple fusion techniques, particularly addition, outperform CA in both accuracy and efficiency. Fusion-based models consistently surpass single-feature models, highlighting the complementary nature of MFCCs and PTM features. Notably, our best-performing fusion model exceeds the state-of-the-art Pyannote across multiple datasets, achieving an absolute average improvement of 2.04%. These results confirm that simple feature fusion enhances VAD robustness while maintaining computational efficiency.

cs.MM [Back]

[465] Multiverse Through Deepfakes: The MultiFakeVerse Dataset of Person-Centric Visual and Conceptual Manipulations

Parul Gupta,Shreya Ghosh,Tom Gedeon,Thanh-Toan Do,Abhinav Dhall

Main category: cs.MM

TL;DR: 论文介绍了MultiFakeVerse，一个大规模人物中心化的深度伪造数据集，包含845,286张图像，通过视觉语言模型（VLM）生成，专注于语义和上下文感知的修改。

Details

Motivation: 当前研究缺乏针对人物中心化、上下文感知的大规模深度伪造基准数据集，MultiFakeVerse填补了这一空白。 Method: 利用视觉语言模型（VLM）生成语义和上下文感知的图像修改，而非传统的低级别身份替换或区域编辑。 Result: 实验表明，现有的深度伪造检测模型和人类观察者难以检测这些细微但有意义的修改。 Conclusion: MultiFakeVerse为深度伪造检测研究提供了新的挑战和基准，代码和数据集已开源。 Abstract: The rapid advancement of GenAI technology over the past few years has significantly contributed towards highly realistic deepfake content generation. Despite ongoing efforts, the research community still lacks a large-scale and reasoning capability driven deepfake benchmark dataset specifically tailored for person-centric object, context and scene manipulations. In this paper, we address this gap by introducing MultiFakeVerse, a large scale person-centric deepfake dataset, comprising 845,286 images generated through manipulation suggestions and image manipulations both derived from vision-language models (VLM). The VLM instructions were specifically targeted towards modifications to individuals or contextual elements of a scene that influence human perception of importance, intent, or narrative. This VLM-driven approach enables semantic, context-aware alterations such as modifying actions, scenes, and human-object interactions rather than synthetic or low-level identity swaps and region-specific edits that are common in existing datasets. Our experiments reveal that current state-of-the-art deepfake detection models and human observers struggle to detect these subtle yet meaningful manipulations. The code and dataset are available on \href{https://github.com/Parul-Gupta/MultiFakeVerse}{GitHub}.

cs.CR [Back]

[466] 3D Gaussian Splat Vulnerabilities

Matthew Hull,Haoyang Yang,Pratham Mehta,Mansi Phute,Aeree Cho,Haoran Wang,Matthew Lau,Wenke Lee,Willian T. Lunardi,Martin Andreoni,Polo Chau

Main category: cs.CR

TL;DR: 论文介绍了CLOAK和DAGGER两种针对3D高斯泼溅（3DGS）的攻击方法，揭示了3DGS在安全关键应用中的潜在漏洞。

Details

Motivation: 随着3DGS在安全关键应用中的普及，研究其潜在攻击方式以预防潜在危害。 Method: CLOAK利用视角依赖的高斯外观嵌入对抗内容；DAGGER通过扰动3D高斯直接攻击多阶段目标检测器。 Result: 攻击成功欺骗了如Faster R-CNN等检测器，展示了3DGS的未探索漏洞。 Conclusion: 研究揭示了3DGS在机器人学习和自主导航等应用中的新威胁。 Abstract: With 3D Gaussian Splatting (3DGS) being increasingly used in safety-critical applications, how can an adversary manipulate the scene to cause harm? We introduce CLOAK, the first attack that leverages view-dependent Gaussian appearances - colors and textures that change with viewing angle - to embed adversarial content visible only from specific viewpoints. We further demonstrate DAGGER, a targeted adversarial attack directly perturbing 3D Gaussians without access to underlying training data, deceiving multi-stage object detectors e.g., Faster R-CNN, through established methods such as projected gradient descent. These attacks highlight underexplored vulnerabilities in 3DGS, introducing a new potential threat to robotic learning for autonomous navigation and other safety-critical 3DGS applications.

[467] Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities

Jiahui Geng,Thy Thy Tran,Preslav Nakov,Iryna Gurevych

Main category: cs.CR

TL;DR: 论文提出了一种新型攻击方法Con Instruction，利用多模态语言模型（MLLMs）对非文本指令（如图像或音频）的理解能力，生成对抗性示例以绕过安全机制。该方法无需训练数据或文本预处理，攻击成功率高达81.3%和86.6%。

Details

Motivation: 现有攻击方法主要依赖文本指令和对抗性图像，而本文旨在探索MLLMs对非文本指令的响应能力，揭示其潜在安全漏洞。 Method: 通过优化对抗性示例，使其在嵌入空间中与目标指令对齐，并结合文本输入增强攻击效果。同时提出ARC框架评估攻击质量。 Result: 在LLaVA-v1.5等模型上，攻击成功率显著（81.3%和86.6%），并发现现有防御技术存在性能差距。 Conclusion: Con Instruction有效揭示了MLLMs的安全隐患，为防御研究提供了新方向。 Abstract: Existing attacks against multimodal language models (MLLMs) primarily communicate instructions through text accompanied by adversarial images. In contrast, we exploit the capabilities of MLLMs to interpret non-textual instructions, specifically, adversarial images or audio generated by our novel method, Con Instruction. We optimize these adversarial examples to align closely with target instructions in the embedding space, revealing the detrimental implications of MLLMs' sophisticated understanding. Unlike prior work, our method does not require training data or preprocessing of textual instructions. While these non-textual adversarial examples can effectively bypass MLLM safety mechanisms, their combination with various text inputs substantially amplifies attack success. We further introduce a new Attack Response Categorization (ARC) framework, which evaluates both the quality of the model's response and its relevance to the malicious instructions. Experimental results demonstrate that Con Instruction effectively bypasses safety mechanisms in multiple vision- and audio-language models, including LLaVA-v1.5, InternVL, Qwen-VL, and Qwen-Audio, evaluated on two standard benchmarks: AdvBench and SafeBench. Specifically, our method achieves the highest attack success rates, reaching 81.3% and 86.6% on LLaVA-v1.5 (13B). On the defense side, we explore various countermeasures against our attacks and uncover a substantial performance gap among existing techniques. Our implementation is made publicly available.

[468] Simple Prompt Injection Attacks Can Leak Personal Data Observed by LLM Agents During Task Execution

Meysam Alizadeh,Zeynab Samei,Daria Stetsenko,Fabrizio Gilardi

Main category: cs.CR

TL;DR: 论文研究了提示注入攻击如何导致工具调用代理泄露个人数据，通过虚构银行代理测试，发现攻击成功率约20%，部分防御措施可降至0%。

Details

Motivation: 现有基准测试对复杂威胁（如数据泄露）的洞察有限，需深入研究提示注入对代理安全的影响。 Method: 开发基于数据流的攻击方法，集成到AgentDojo基准测试中，并使用合成数据集评估。 Result: 攻击下LLM效用下降15-50%，平均攻击成功率20%；部分防御可完全阻止泄露。 Conclusion: LLM对高度敏感数据（如密码）泄露有抵抗，但仍易泄露其他个人数据，任务类型与防御效果密切相关。 Abstract: Previous benchmarks on prompt injection in large language models (LLMs) have primarily focused on generic tasks and attacks, offering limited insights into more complex threats like data exfiltration. This paper examines how prompt injection can cause tool-calling agents to leak personal data observed during task execution. Using a fictitious banking agent, we develop data flow-based attacks and integrate them into AgentDojo, a recent benchmark for agentic security. To enhance its scope, we also create a richer synthetic dataset of human-AI banking conversations. In 16 user tasks from AgentDojo, LLMs show a 15-50 percentage point drop in utility under attack, with average attack success rates (ASR) around 20 percent; some defenses reduce ASR to zero. Most LLMs, even when successfully tricked by the attack, avoid leaking highly sensitive data like passwords, likely due to safety alignments, but they remain vulnerable to disclosing other personal data. The likelihood of password leakage increases when a password is requested along with one or two additional personal details. In an extended evaluation across 48 tasks, the average ASR is around 15 percent, with no built-in AgentDojo defense fully preventing leakage. Tasks involving data extraction or authorization workflows, which closely resemble the structure of exfiltration attacks, exhibit the highest ASRs, highlighting the interaction between task type, agent performance, and defense efficacy.

physics.soc-ph [Back]

[469] Transport Network, Graph, and Air Pollution

Nan Xu

Main category: physics.soc-ph

TL;DR: 研究通过分析全球城市的30万张图像，发现交通网络的几何模式与污染相关，并提出12项指数和优化策略以减轻污染。

Details

Motivation: 现有研究对交通网络的几何和拓扑特征分析不足，缺乏全面视角。 Method: 通过图像解析和12项指数分析交通网络与污染的关系。 Result: 发现优化连接性、平衡道路类型和避免极端聚类系数可减轻污染。 Conclusion: 研究为城市规划提供了基于永久性基础设施的污染减轻策略。 Abstract: Air pollution can be studied in the urban structure regulated by transport networks. Transport networks can be studied as geometric and topological graph characteristics through designed models. Current studies do not offer a comprehensive view as limited models with insufficient features are examined. Our study finds geometric patterns of pollution-indicated transport networks through 0.3 million image interpretations of global cities. These are then described as part of 12 indices to investigate the network-pollution correlation. Strategies such as improved connectivity, more balanced road types and the avoidance of extreme clustering coefficient are identified as beneficial for alleviated pollution. As a graph-only study, it informs superior urban planning by separating the impact of permanent infrastructure from that of derived development for a more focused and efficient effort toward pollution reduction.

q-bio.BM [Back]

[470] ProtInvTree: Deliberate Protein Inverse Folding with Reward-guided Tree Search

Mengdi Liu,Xiaoxue Cheng,Zhangyang Gao,Hong Chang,Cheng Tan,Shiguang Shan,Xilin Chen

Main category: q-bio.BM

TL;DR: ProtInvTree是一个基于树搜索的蛋白质逆折叠生成模型，能够设计多样化的序列同时保持结构一致性。

Details

Motivation: 现有深度学习方法在蛋白质逆折叠中忽略了问题的一对多性质，即多种序列可折叠为同一结构，因此需要一种能生成多样化序列的模型。 Method: ProtInvTree采用奖励引导的树搜索框架，通过分阶段的位置选择和残基生成机制，结合跳跃去噪策略高效评估中间状态。 Result: ProtInvTree在多个基准测试中优于现有方法，生成的结构一致且多样化的序列，包括远离原生序列的情况。 Conclusion: ProtInvTree为蛋白质逆折叠提供了一种高效且灵活的生成方法，解决了序列多样性与结构一致性的平衡问题。 Abstract: Designing protein sequences that fold into a target 3D structure, known as protein inverse folding, is a fundamental challenge in protein engineering. While recent deep learning methods have achieved impressive performance by recovering native sequences, they often overlook the one-to-many nature of the problem: multiple diverse sequences can fold into the same structure. This motivates the need for a generative model capable of designing diverse sequences while preserving structural consistency. To address this trade-off, we introduce ProtInvTree, the first reward-guided tree-search framework for protein inverse folding. ProtInvTree reformulates sequence generation as a deliberate, step-wise decision-making process, enabling the exploration of multiple design paths and exploitation of promising candidates through self-evaluation, lookahead, and backtracking. We propose a two-stage focus-and-grounding action mechanism that decouples position selection and residue generation. To efficiently evaluate intermediate states, we introduce a jumpy denoising strategy that avoids full rollouts. Built upon pretrained protein language models, ProtInvTree supports flexible test-time scaling by expanding the search depth and breadth without retraining. Empirically, ProtInvTree outperforms state-of-the-art baselines across multiple benchmarks, generating structurally consistent yet diverse sequences, including those far from the native ground truth.

eess.IV [Back]

[471] NTIRE 2025 the 2nd Restore Any Image Model (RAIM) in the Wild Challenge

Jie Liang,Radu Timofte,Qiaosi Yi,Zhengqiang Zhang,Shuaizheng Liu,Lingchen Sun,Rongyuan Wu,Xindong Zhang,Hui Zeng,Lei Zhang

Main category: eess.IV

TL;DR: NTIRE 2025挑战赛聚焦于真实世界图像修复，分为两个赛道：低光联合去噪与去马赛克（JDD）和图像细节增强/生成，吸引了大量参与者并推动了技术进步。

Details

Motivation: 为真实世界图像修复建立新基准，解决复杂未知退化问题，同时评估感知质量和保真度。 Method: 挑战赛分为两个赛道，每个赛道包含两个子任务：基于配对数据的定量评估和基于非配对数据的主观质量评估。 Result: 吸引了300注册和51团队提交600+结果，顶尖方法获得专家一致认可，推动了图像修复领域发展。 Conclusion: NTIRE 2025挑战赛成功推动了真实世界图像修复技术的进步，并为未来研究提供了新基准。 Abstract: In this paper, we present a comprehensive overview of the NTIRE 2025 challenge on the 2nd Restore Any Image Model (RAIM) in the Wild. This challenge established a new benchmark for real-world image restoration, featuring diverse scenarios with and without reference ground truth. Participants were tasked with restoring real-captured images suffering from complex and unknown degradations, where both perceptual quality and fidelity were critically evaluated. The challenge comprised two tracks: (1) the low-light joint denoising and demosaicing (JDD) task, and (2) the image detail enhancement/generation task. Each track included two sub-tasks. The first sub-task involved paired data with available ground truth, enabling quantitative evaluation. The second sub-task dealt with real-world yet unpaired images, emphasizing restoration efficiency and subjective quality assessed through a comprehensive user study. In total, the challenge attracted nearly 300 registrations, with 51 teams submitting more than 600 results. The top-performing methods advanced the state of the art in image restoration and received unanimous recognition from all 20+ expert judges. The datasets used in Track 1 and Track 2 are available at https://drive.google.com/drive/folders/1Mgqve-yNcE26IIieI8lMIf-25VvZRs_J and https://drive.google.com/drive/folders/1UB7nnzLwqDZOwDmD9aT8J0KVg2ag4Qae, respectively. The official challenge pages for Track 1 and Track 2 can be found at https://codalab.lisn.upsaclay.fr/competitions/21334#learn_the_details and https://codalab.lisn.upsaclay.fr/competitions/21623#learn_the_details.

[472] RAW Image Reconstruction from RGB on Smartphones. NTIRE 2025 Challenge Report

Marcos V. Conde,Radu Timofte,Radu Berdan,Beril Besbinar,Daisuke Iso,Pengzhou Ji,Xiong Dun,Zeying Fan,Chen Wu,Zhansheng Wang,Pengbo Zhang,Jiazi Huang,Qinglin Liu,Wei Yu,Shengping Zhang,Xiangyang Ji,Kyungsik Kim,Minkyung Kim,Hwalmin Lee,Hekun Ma,Huan Zheng,Yanyan Wei,Zhao Zhang,Jing Fang,Meilin Gao,Xiang Yu,Shangbin Xie,Mengyuan Sun,Huanjing Yue,Jingyu Yang Huize Cheng,Shaomeng Zhang,Zhaoyang Zhang,Haoxiang Liang

Main category: eess.IV

TL;DR: 该论文探讨了从sRGB图像重建RAW传感器图像的挑战（Reverse ISP），旨在通过无元数据的方式恢复智能手机的RAW图像，并提出了高效模型。

Details

Motivation: 由于RAW图像数据集稀缺且昂贵，而sRGB数据集丰富且公开，研究旨在通过逆向ISP转换生成逼真的RAW图像。 Method: 论文组织了NTIRE 2025挑战赛，吸引了150多名参与者提交高效模型，用于从sRGB图像重建RAW图像。 Result: 提出的方法和基准测试确立了生成逼真RAW数据的最新技术水平。 Conclusion: 该研究为RAW图像重建提供了高效解决方案，并推动了相关领域的发展。 Abstract: Numerous low-level vision tasks operate in the RAW domain due to its linear properties, bit depth, and sensor designs. Despite this, RAW image datasets are scarce and more expensive to collect than the already large and public sRGB datasets. For this reason, many approaches try to generate realistic RAW images using sensor information and sRGB images. This paper covers the second challenge on RAW Reconstruction from sRGB (Reverse ISP). We aim to recover RAW sensor images from smartphones given the corresponding sRGB images without metadata and, by doing this, ``reverse" the ISP transformation. Over 150 participants joined this NTIRE 2025 challenge and submitted efficient models. The proposed methods and benchmark establish the state-of-the-art for generating realistic RAW data.

[473] A European Multi-Center Breast Cancer MRI Dataset

Gustav Müller-Franzes,Lorena Escudero Sánchez,Nicholas Payne,Alexandra Athanasiou,Michael Kalogeropoulos,Aitor Lopez,Alfredo Miguel Soro Busto,Julia Camps Herrero,Nika Rasoolzadeh,Tianyu Zhang,Ritse Mann,Debora Jutz,Maike Bode,Christiane Kuhl,Wouter Veldhuis,Oliver Lester Saldanha,JieFu Zhu,Jakob Nikolas Kather,Daniel Truhn,Fiona J. Gilbert

Main category: eess.IV

TL;DR: 论文探讨了利用AI和MRI技术辅助乳腺癌早期检测的重要性，并介绍了ODELIA联盟公开的多中心数据集以支持AI工具开发。

Details

Motivation: 乳腺癌早期检测对治疗至关重要，MRI作为补充筛查工具需求增加，但专家解读耗时，需开发自动化方法。 Method: 利用AI技术分析MRI数据，开发自动化癌症检测工具。 Result: ODELIA联盟公开多中心数据集，支持AI工具开发。 Conclusion: AI辅助MRI解读有望提高乳腺癌早期检测效率。 Abstract: Detecting breast cancer early is of the utmost importance to effectively treat the millions of women afflicted by breast cancer worldwide every year. Although mammography is the primary imaging modality for screening breast cancer, there is an increasing interest in adding magnetic resonance imaging (MRI) to screening programmes, particularly for women at high risk. Recent guidelines by the European Society of Breast Imaging (EUSOBI) recommended breast MRI as a supplemental screening tool for women with dense breast tissue. However, acquiring and reading MRI scans requires significantly more time from expert radiologists. This highlights the need to develop new automated methods to detect cancer accurately using MRI and Artificial Intelligence (AI), which have the potential to support radiologists in breast MRI interpretation and classification and help detect cancer earlier. For this reason, the ODELIA consortium has made this multi-centre dataset publicly available to assist in developing AI tools for the detection of breast cancer on MRI.

[474] Image Restoration Learning via Noisy Supervision in the Fourier Domain

Haosen Liu,Jiahao Liu,Shan Tan,Edmund Y. Lam

Main category: eess.IV

TL;DR: 该论文提出了一种在傅里叶域中建立噪声监督的方法，以解决图像修复任务中空间相关噪声和像素级损失函数的问题。

Details

Motivation: 现有方法在处理空间相关噪声时效果不佳，且仅依赖像素级损失函数提供有限监督信息。傅里叶域中的噪声系数具有稀疏性和独立性，且包含全局信息，因此更适合用于监督学习。 Method: 利用傅里叶域中噪声系数的统计特性（收敛于高斯分布），建立噪声目标与干净目标在傅里叶域中的等价性，提出统一的学习框架。 Result: 实验验证了该框架在定量指标和感知质量上的优异表现。 Conclusion: 该方法为图像修复任务提供了一种高效且通用的噪声监督解决方案。 Abstract: Noisy supervision refers to supervising image restoration learning with noisy targets. It can alleviate the data collection burden and enhance the practical applicability of deep learning techniques. However, existing methods suffer from two key drawbacks. Firstly, they are ineffective in handling spatially correlated noise commonly observed in practical applications such as low-light imaging and remote sensing. Secondly, they rely on pixel-wise loss functions that only provide limited supervision information. This work addresses these challenges by leveraging the Fourier domain. We highlight that the Fourier coefficients of spatially correlated noise exhibit sparsity and independence, making them easier to handle. Additionally, Fourier coefficients contain global information, enabling more significant supervision. Motivated by these insights, we propose to establish noisy supervision in the Fourier domain. We first prove that Fourier coefficients of a wide range of noise converge in distribution to the Gaussian distribution. Exploiting this statistical property, we establish the equivalence between using noisy targets and clean targets in the Fourier domain. This leads to a unified learning framework applicable to various image restoration tasks, diverse network architectures, and different noise models. Extensive experiments validate the outstanding performance of this framework in terms of both quantitative indices and perceptual quality.

cs.AI [Back]

[475] GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning

Sahiti Yerramilli,Nilay Pande,Rynaa Grover,Jayant Sravan Tamarapalli

Main category: cs.AI

TL;DR: GeoChain是一个用于评估多模态大语言模型（MLLMs）逐步地理推理能力的大规模基准，包含146万张Mapillary街景图像和3000万问答对，揭示了模型在视觉定位和复杂推理中的挑战。

Details

Motivation: 当前MLLMs在地理推理任务中表现不佳，尤其是在视觉定位和逐步推理方面存在明显缺陷，需要一种系统化的评估方法来推动改进。 Method: 利用146万张街景图像，每张图像配以21步链式推理问题（共3000万问答对），涵盖视觉、空间、文化和精确定位四类推理，并标注难度。图像还包含语义分割和视觉定位评分。 Result: 测试多种MLLMs（如GPT-4.1、Claude 3.7、Gemini 2.5）在2088张图像上的表现，发现模型在视觉定位和复杂推理中表现不稳定，定位准确性随难度增加而下降。 Conclusion: GeoChain为MLLMs的复杂地理推理提供了诊断工具，有助于推动相关技术的进步。 Abstract: This paper introduces GeoChain, a large-scale benchmark for evaluating step-by-step geographic reasoning in multimodal large language models (MLLMs). Leveraging 1.46 million Mapillary street-level images, GeoChain pairs each image with a 21-step chain-of-thought (CoT) question sequence (over 30 million Q&A pairs). These sequences guide models from coarse attributes to fine-grained localization across four reasoning categories - visual, spatial, cultural, and precise geolocation - annotated by difficulty. Images are also enriched with semantic segmentation (150 classes) and a visual locatability score. Our benchmarking of contemporary MLLMs (GPT-4.1 variants, Claude 3.7, Gemini 2.5 variants) on a diverse 2,088-image subset reveals consistent challenges: models frequently exhibit weaknesses in visual grounding, display erratic reasoning, and struggle to achieve accurate localization, especially as the reasoning complexity escalates. GeoChain offers a robust diagnostic methodology, critical for fostering significant advancements in complex geographic reasoning within MLLMs.

[476] SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning

Jisheng Dang,Yizhou Zhang,Hao Ye,Teng Wang,Siming Chen,Huicheng Zheng,Yulan Guo,Jianhuang Lai,Bin Hu

Main category: cs.AI

TL;DR: 本文提出了一种名为SynPO的新优化方法，通过偏好学习提升细粒度视频描述的性能，解决了现有方法的局限性。

Details

Motivation: 现有方法难以捕捉视频中的细微动态和丰富细节信息，需要一种更高效的优化方法。 Method: 提出了一种构建偏好对的流程，并结合SynPO优化方法，避免了负面偏好主导优化，同时保留了模型的语言能力。 Result: 在视频描述和NLP任务中，SynPO表现优于DPO及其变体，训练效率提升20%。 Conclusion: SynPO是一种高效的优化方法，适用于细粒度视频描述和广泛的NLP任务。 Abstract: Fine-grained video captioning aims to generate detailed, temporally coherent descriptions of video content. However, existing methods struggle to capture subtle video dynamics and rich detailed information. In this paper, we leverage preference learning to enhance the performance of vision-language models in fine-grained video captioning, while mitigating several limitations inherent to direct preference optimization (DPO). First, we propose a pipeline for constructing preference pairs that leverages the intrinsic properties of VLMs along with partial assistance from large language models, achieving an optimal balance between cost and data quality. Second, we propose Synergistic Preference Optimization (SynPO), a novel optimization method offering significant advantages over DPO and its variants. SynPO prevents negative preferences from dominating the optimization, explicitly preserves the model's language capability to avoid deviation of the optimization objective, and improves training efficiency by eliminating the need for the reference model. We extensively evaluate SynPO not only on video captioning benchmarks (e.g., VDC, VDD, VATEX) but also across well-established NLP tasks, including general language understanding and preference evaluation, using diverse pretrained models. Results demonstrate that SynPO consistently outperforms DPO variants while achieving 20\% improvement in training efficiency. Code is available at https://github.com/longmalongma/SynPO

[477] Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues

Youngmin Kim,Jiwan Chung,Jisoo Kim,Sunghyun Lee,Sangkyu Lee,Junhyeok Kim,Cheoljong Yang,Youngjae Yu

Main category: cs.AI

TL;DR: MARS是一种多模态语言模型，结合文本和非语言线索（如面部表情和肢体语言），以提升对话AI的沉浸感。

Details

Motivation: 现有大型语言模型（LLMs）未能有效整合非语言元素，限制了对话体验的沉浸感。 Method: 通过VENUS数据集（包含标注视频、文本、面部表情和肢体语言）训练MARS，采用下一词预测目标，实现多模态理解和生成。 Result: MARS成功生成与对话输入对应的文本和非语言内容，VENUS数据集被验证为规模大且高效。 Conclusion: MARS填补了对话AI中非语言交流的空白，为多模态交互提供了新方向。 Abstract: Nonverbal communication is integral to human interaction, with gestures, facial expressions, and body language conveying critical aspects of intent and emotion. However, existing large language models (LLMs) fail to effectively incorporate these nonverbal elements, limiting their capacity to create fully immersive conversational experiences. We introduce MARS, a multimodal language model designed to understand and generate nonverbal cues alongside text, bridging this gap in conversational AI. Our key innovation is VENUS, a large-scale dataset comprising annotated videos with time-aligned text, facial expressions, and body language. Leveraging VENUS, we train MARS with a next-token prediction objective, combining text with vector-quantized nonverbal representations to achieve multimodal understanding and generation within a unified framework. Based on various analyses of the VENUS datasets, we validate its substantial scale and high effectiveness. Our quantitative and qualitative results demonstrate that MARS successfully generates text and nonverbal languages, corresponding to conversational input.

[478] EgoBrain: Synergizing Minds and Eyes For Human Action Understanding

Nie Lin,Yansen Wang,Dongqi Han,Weibang Jiang,Jingyuan Li,Ryosuke Furuta,Yoichi Sato,Dongsheng Li

Main category: cs.AI

TL;DR: EgoBrain是一个大规模、时间对齐的多模态数据集，同步记录第一人称视频和EEG信号，用于人类行为分析。通过多模态学习框架，实现了66.70%的动作识别准确率。

Details

Motivation: 结合脑机接口（BCI）和人工智能（AI）解码人类认知与行为，探索多模态AI模型的新可能性。 Method: 开发EgoBrain数据集，包含61小时的同步32通道EEG和第一人称视频，涵盖40名参与者的29类日常活动。构建多模态学习框架融合EEG和视觉数据。 Result: 在跨被试和跨环境挑战中验证，动作识别准确率达到66.70%。 Conclusion: EgoBrain为多模态脑机接口提供了统一框架，并公开数据以促进认知计算的开放科学。 Abstract: The integration of brain-computer interfaces (BCIs), in particular electroencephalography (EEG), with artificial intelligence (AI) has shown tremendous promise in decoding human cognition and behavior from neural signals. In particular, the rise of multimodal AI models have brought new possibilities that have never been imagined before. Here, we present EgoBrain --the world's first large-scale, temporally aligned multimodal dataset that synchronizes egocentric vision and EEG of human brain over extended periods of time, establishing a new paradigm for human-centered behavior analysis. This dataset comprises 61 hours of synchronized 32-channel EEG recordings and first-person video from 40 participants engaged in 29 categories of daily activities. We then developed a muiltimodal learning framework to fuse EEG and vision for action understanding, validated across both cross-subject and cross-environment challenges, achieving an action recognition accuracy of 66.70%. EgoBrain paves the way for a unified framework for brain-computer interface with multiple modalities. All data, tools, and acquisition protocols are openly shared to foster open science in cognitive computing.

[479] AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning

Zhong Zhang,Yaxi Lu,Yikun Fu,Yupeng Huo,Shenzhi Yang,Yesai Wu,Han Si,Xin Cong,Haotian Chen,Yankai Lin,Jie Xie,Wei Zhou,Wang Xu,Yuanheng Zhang,Zhou Su,Zhongwu Zhai,Xiaoming Liu,Yudong Mei,Jianming Xu,Hongyan Tian,Chongyi Wang,Chi Chen,Yuan Yao,Zhiyuan Liu,Maosong Sun

Main category: cs.AI

TL;DR: AgentCPM-GUI是一个8B参数的GUI代理，通过改进的训练流程在移动设备上实现高效交互，并在中英文界面上表现出色。

Details

Motivation: 现有GUI代理的训练数据噪声大且语义多样性不足，导致模型泛化能力差，且非英语界面（如中文）被忽视。 Method: 采用基于感知的预训练、高质量轨迹的监督微调、GRPO强化微调，并设计紧凑动作空间以降低延迟。 Result: 在五个公共基准和新的中文基准CAGUI上达到96.9% Type-Match和91.3% Exact-Match。 Conclusion: AgentCPM-GUI在性能和泛化能力上表现优异，代码和模型已公开以促进研究。 Abstract: The recent progress of large language model agents has opened new possibilities for automating tasks through graphical user interfaces (GUIs), especially in mobile environments where intelligent interaction can greatly enhance usability. However, practical deployment of such agents remains constrained by several key challenges. Existing training data is often noisy and lack semantic diversity, which hinders the learning of precise grounding and planning. Models trained purely by imitation tend to overfit to seen interface patterns and fail to generalize in unfamiliar scenarios. Moreover, most prior work focuses on English interfaces while overlooks the growing diversity of non-English applications such as those in the Chinese mobile ecosystem. In this work, we present AgentCPM-GUI, an 8B-parameter GUI agent built for robust and efficient on-device GUI interaction. Our training pipeline includes grounding-aware pre-training to enhance perception, supervised fine-tuning on high-quality Chinese and English trajectories to imitate human-like actions, and reinforcement fine-tuning with GRPO to improve reasoning capability. We also introduce a compact action space that reduces output length and supports low-latency execution on mobile devices. AgentCPM-GUI achieves state-of-the-art performance on five public benchmarks and a new Chinese GUI benchmark called CAGUI, reaching $96.9\%$ Type-Match and $91.3\%$ Exact-Match. To facilitate reproducibility and further research, we publicly release all code, model checkpoint, and evaluation data.

[480] The Automated but Risky Game: Modeling Agent-to-Agent Negotiations and Transactions in Consumer Markets

Shenzhe Zhu,Jiao Sun,Yi Nian,Tobin South,Alex Pentland,Jiaxin Pei

Main category: cs.AI

TL;DR: 研究探讨AI代理在消费者市场中自动化谈判和交易的潜力，发现不同LLM代理的表现差异显著，且存在行为异常导致财务风险。

Details

Motivation: 探索AI代理在消费者市场中自动化谈判和交易的可行性及其潜在风险。 Method: 开发实验框架，评估多种LLM代理在真实谈判和交易场景中的表现。 Result: AI代理的表现差异显著，可能导致财务损失（如过度消费或不合理交易）。 Conclusion: 自动化虽提升效率，但风险显著，用户需谨慎授权AI代理进行商业决策。 Abstract: AI agents are increasingly used in consumer-facing applications to assist with tasks such as product search, negotiation, and transaction execution. In this paper, we explore a future scenario where both consumers and merchants authorize AI agents to fully automate negotiations and transactions. We aim to answer two key questions: (1) Do different LLM agents vary in their ability to secure favorable deals for users? (2) What risks arise from fully automating deal-making with AI agents in consumer markets? To address these questions, we develop an experimental framework that evaluates the performance of various LLM agents in real-world negotiation and transaction settings. Our findings reveal that AI-mediated deal-making is an inherently imbalanced game -- different agents achieve significantly different outcomes for their users. Moreover, behavioral anomalies in LLMs can result in financial losses for both consumers and merchants, such as overspending or accepting unreasonable deals. These results underscore that while automation can improve efficiency, it also introduces substantial risks. Users should exercise caution when delegating business decisions to AI agents.

[481] Control-R: Towards controllable test-time scaling

Di Zhang,Weida Wang,Junxian Li,Xunzhi Wang,Jiatong Li,Jianbo Wu,Jingdi Lei,Haonan He,Peng Ye,Shufei Zhang,Wanli Ouyang,Yuqiang Li,Dongzhan Zhou

Main category: cs.AI

TL;DR: 论文提出了一种名为推理控制场（RCF）的新方法，用于解决大型推理模型（LRMs）在长链推理（CoT）中的欠思考与过思考问题，并通过树搜索视角注入结构化控制信号。

Details

Motivation: 解决大型推理模型在长链推理中的欠思考与过思考问题，提升复杂任务中的推理可控性。 Method: 引入推理控制场（RCF）和条件蒸馏微调（CDF）方法，结合Control-R-4K数据集，训练模型（如Control-R-32B）动态调整推理努力。 Result: 在AIME2024和MATH500等基准测试中，该方法在32B规模上实现了最先进的性能，并实现了可控的长链推理过程（L-CoT）。 Conclusion: 该研究为可控的测试时推理扩展提供了一种有效范式。 Abstract: This paper target in addressing the challenges of underthinking and overthinking in long chain-of-thought (CoT) reasoning for Large Reasoning Models (LRMs) by introducing Reasoning Control Fields (RCF)--a novel test-time approach that injects structured control signals to guide reasoning from a tree search perspective. RCF enables models to adjust reasoning effort according to given control conditions when solving complex tasks. Additionally, we present the Control-R-4K dataset, which consists of challenging problems annotated with detailed reasoning processes and corresponding control fields. To further enhance reasoning control, we propose a Conditional Distillation Finetuning (CDF) method, which trains model--particularly Control-R-32B--to effectively adjust reasoning effort during test time. Experimental results on benchmarks such as AIME2024 and MATH500 demonstrate that our approach achieves state-of-the-art performance at the 32B scale while enabling a controllable Long CoT reasoning process (L-CoT). Overall, this work introduces an effective paradigm for controllable test-time scaling reasoning.

[482] Whispers of Many Shores: Cultural Alignment through Collaborative Cultural Expertise

Shuai Feng,Wei-Chuang Chan,Srishti Chouhan,Junior Francisco Garcia Ayala,Srujananjali Medicherla,Kyle Clark,Mingwei Shi

Main category: cs.AI

TL;DR: 提出了一种基于软提示微调的新框架，用于高效实现大语言模型（LLM）的文化对齐，显著提升文化敏感性和适应性。

Details

Motivation: 当前LLM在多样化文化背景下缺乏细致理解，且全微调成本高昂，亟需一种高效、模块化的文化对齐方法。 Method: 采用向量化提示调优，动态将查询路由至一组文化特化的专家LLM配置，通过优化软提示嵌入实现，无需修改基础模型参数。 Result: 实验表明，文化对齐得分从0.208提升至0.820，显著增强文化敏感性和适应性。 Conclusion: 该框架为文化感知的LLM部署提供了高效解决方案，并为后续研究（如文化覆盖扩展和动态专家适应）奠定了基础。 Abstract: The integration of large language models (LLMs) into global applications necessitates effective cultural alignment for meaningful and culturally-sensitive interactions. Current LLMs often lack the nuanced understanding required for diverse cultural contexts, and adapting them typically involves costly full fine-tuning. To address this, we introduce a novel soft prompt fine-tuning framework that enables efficient and modular cultural alignment. Our method utilizes vectorized prompt tuning to dynamically route queries to a committee of culturally specialized 'expert' LLM configurations, created by optimizing soft prompt embeddings without altering the base model's parameters. Extensive experiments demonstrate that our framework significantly enhances cultural sensitivity and adaptability, improving alignment scores from 0.208 to 0.820, offering a robust solution for culturally-aware LLM deployment. This research paves the way for subsequent investigations into enhanced cultural coverage and dynamic expert adaptation, crucial for realizing autonomous AI with deeply nuanced understanding in a globally interconnected world.

[483] MIR: Methodology Inspiration Retrieval for Scientific Research Problems

Aniketh Garikaparthi,Manasi Patwardhan,Aditya Sanjiv Kanade,Aman Hassan,Lovekesh Vig,Arman Cohan

Main category: cs.AI

TL;DR: 论文提出了一种名为方法论灵感检索（MIR）的任务，通过构建方法论邻接图（MAG）和改进检索方法，显著提升了检索效果。

Details

Motivation: 现有方法依赖文献质量，效果不稳定，需解决如何从文献中获取方法论灵感的问题。 Method: 构建MAG图，利用密集检索器嵌入方法论邻接关系，并结合LLM重新排序策略。 Result: 在Recall@3和mAP指标上分别提升了5.4和7.8，结合LLM后进一步提升了4.5和4.8。 Conclusion: MIR在自动化科学发现中具有潜力，未来可进一步优化灵感驱动的检索方法。 Abstract: There has been a surge of interest in harnessing the reasoning capabilities of Large Language Models (LLMs) to accelerate scientific discovery. While existing approaches rely on grounding the discovery process within the relevant literature, effectiveness varies significantly with the quality and nature of the retrieved literature. We address the challenge of retrieving prior work whose concepts can inspire solutions for a given research problem, a task we define as Methodology Inspiration Retrieval (MIR). We construct a novel dataset tailored for training and evaluating retrievers on MIR, and establish baselines. To address MIR, we build the Methodology Adjacency Graph (MAG); capturing methodological lineage through citation relationships. We leverage MAG to embed an "intuitive prior" into dense retrievers for identifying patterns of methodological inspiration beyond superficial semantic similarity. This achieves significant gains of +5.4 in Recall@3 and +7.8 in Mean Average Precision (mAP) over strong baselines. Further, we adapt LLM-based re-ranking strategies to MIR, yielding additional improvements of +4.5 in Recall@3 and +4.8 in mAP. Through extensive ablation studies and qualitative analyses, we exhibit the promise of MIR in enhancing automated scientific discovery and outline avenues for advancing inspiration-driven retrieval.

[484] Dyna-Think: Synergizing Reasoning, Acting, and World Model Simulation in AI Agents

Xiao Yu,Baolin Peng,Ruize Xu,Michel Galley,Hao Cheng,Suman Nath,Jianfeng Gao,Zhou Yu

Main category: cs.AI

TL;DR: Dyna-Think框架通过结合规划、世界模型与推理，提升AI代理性能，其训练方法DIT和DDT显著减少了计算资源需求并提高了任务表现。

Details

Motivation: 探索大型语言模型在长时程任务中的有效行为，提出整合世界模型与推理的框架以优化AI代理性能。 Method: 提出Dyna-Think框架，包含DIT（模仿学习）和DDT（两阶段训练），分别用于初始化和增强代理的世界模型与行动能力。 Result: 在OSWorld上验证，Dyna-Think在性能上与R1相当，但生成标记数减少50%，且世界模型能力与代理表现正相关。 Conclusion: 世界模型仿真为提升AI代理的推理、规划和行动能力提供了有前景的研究方向。 Abstract: Recent progress in reasoning with large language models (LLMs), such as DeepSeek-R1, demonstrates impressive capabilities in domains like mathematics and coding, by exhibiting complex cognitive behaviors such as verification, goal decomposition, and self-reflection. However, it is unclear what behavior is effective and what behavior is missing for long-horizon AI agents tasks. In this work, we propose Dyna-Think, a thinking framework that integrates planning with an internal world model with reasoning and acting to enhance AI agent performance. To enable Dyna-Think, we propose Dyna-Think Imitation Learning (DIT) and Dyna-Think Dyna Training (DDT). To initialize a policy with Dyna-Think, DIT reconstructs the thinking process of R1 to focus on performing world model simulation relevant to the proposed (and planned) action, and trains the policy using this reconstructed data. To enhance Dyna-Think, DDT uses a two-stage training process to first improve the agent's world modeling ability via objectives such as state prediction or critique generation, and then improve the agent's action via policy training. We evaluate our methods on OSWorld, and demonstrate that Dyna-Think improves the agent's in-domain and out-of-domain performance, achieving similar best-of-n performance compared to R1 while generating 2x less tokens on average. Our extensive empirical studies reveal that 1) using critique generation for world model training is effective to improve policy performance; and 2) AI agents with better performance correlate with better world modeling abilities. We believe our results suggest a promising research direction to integrate world model simulation into AI agents to enhance their reasoning, planning, and acting capabilities.

[485] CityLens: Benchmarking Large Language-Vision Models for Urban Socioeconomic Sensing

Tianhui Liu,Jie Feng,Hetian Pang,Xin Zhang,Tianjian Ouyang,Zhiyuan Zhang,Yong Li

Main category: cs.AI

TL;DR: CityLens是一个多模态基准测试，用于评估大型语言视觉模型（LLVMs）从卫星和街景图像预测社会经济指标的能力。

Details

Motivation: 理解城市社会经济状况对可持续发展和政策规划至关重要，但现有方法存在挑战。 Method: 构建覆盖17个全球城市的多模态数据集，定义11个预测任务，采用三种评估范式，并对17种LLVMs进行基准测试。 Result: LLVMs在感知和推理方面表现良好，但在预测社会经济指标时仍有局限。 Conclusion: CityLens为诊断LLVMs的局限性提供了统一框架，并指导未来研究。 Abstract: Understanding urban socioeconomic conditions through visual data is a challenging yet essential task for sustainable urban development and policy planning. In this work, we introduce $\textbf{CityLens}$, a comprehensive benchmark designed to evaluate the capabilities of large language-vision models (LLVMs) in predicting socioeconomic indicators from satellite and street view imagery. We construct a multi-modal dataset covering a total of 17 globally distributed cities, spanning 6 key domains: economy, education, crime, transport, health, and environment, reflecting the multifaceted nature of urban life. Based on this dataset, we define 11 prediction tasks and utilize three evaluation paradigms: Direct Metric Prediction, Normalized Metric Estimation, and Feature-Based Regression. We benchmark 17 state-of-the-art LLVMs across these tasks. Our results reveal that while LLVMs demonstrate promising perceptual and reasoning capabilities, they still exhibit limitations in predicting urban socioeconomic indicators. CityLens provides a unified framework for diagnosing these limitations and guiding future efforts in using LLVMs to understand and predict urban socioeconomic patterns. Our codes and datasets are open-sourced via https://github.com/tsinghua-fib-lab/CityLens.

[486] Reasoning Like an Economist: Post-Training on Economic Problems Induces Strategic Generalization in LLMs

Yufa Zhou,Shaobo Wang,Xingyu Dong,Xiangqi Jin,Yifang Chen,Yue Min,Kexin Yang,Xingzhang Ren,Dayiheng Liu,Linfeng Zhang

Main category: cs.AI

TL;DR: 论文探讨了通过监督微调（SFT）和可验证奖励的强化学习（RLVR）提升大语言模型在多智能体系统中的泛化能力，并以经济学推理为测试平台。

Details

Motivation: 直接训练大语言模型用于多智能体系统面临奖励建模复杂、动态交互和泛化要求高的挑战。 Method: 使用SFT和RLVR对7B参数的Recon模型进行后训练，数据集包含2100个高质量经济学推理问题。 Result: 在经济学推理基准和多智能体游戏中，模型在结构化推理和经济理性方面表现显著提升。 Conclusion: 领域对齐的后训练能有效增强推理能力和智能体对齐，揭示了SFT和RL在模型行为塑造中的作用。 Abstract: Directly training Large Language Models (LLMs) for Multi-Agent Systems (MAS) remains challenging due to intricate reward modeling, dynamic agent interactions, and demanding generalization requirements. This paper explores whether post-training techniques, specifically Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR), can effectively $\textit{generalize}$ to multi-agent scenarios. We use economic reasoning as a testbed, leveraging its strong foundations in mathematics and game theory, its demand for structured analytical reasoning, and its relevance to real-world applications such as market design, resource allocation, and policy analysis. We introduce $\textbf{Recon}$ ($\textbf{R}$easoning like an $\textbf{ECON}$omist), a 7B-parameter open-source LLM post-trained on a hand-curated dataset of 2,100 high-quality economic reasoning problems. Comprehensive evaluation on economic reasoning benchmarks and multi-agent games reveals clear improvements in structured reasoning and economic rationality. These results underscore the promise of domain-aligned post-training for enhancing reasoning and agent alignment, shedding light on the roles of SFT and RL in shaping model behavior. Code is available at https://github.com/MasterZhou1/Recon .

[487] DrKGC: Dynamic Subgraph Retrieval-Augmented LLMs for Knowledge Graph Completion across General and Biomedical Domains

Yongkang Xiao,Sinian Zhang,Yi Dai,Huixue Zhou,Jue Hou,Jie Ding,Rui Zhang

Main category: cs.AI

TL;DR: DrKGC是一种结合动态子图检索和LLM的知识图谱补全方法，通过学习结构嵌入和逻辑规则，并利用GCN增强嵌入，显著提升了性能。

Details

Motivation: 现有方法未能充分利用LLM对图结构的感知和推理能力，DrKGC旨在解决这一问题。 Method: DrKGC通过学习结构嵌入和逻辑规则，采用动态子图检索方法提取子图，并通过GCN增强嵌入，最后整合到LLM的提示中进行微调。 Result: 在两个通用领域和两个生物医学数据集上表现出优越性能，并在生物医学案例中展示了其可解释性和实用性。 Conclusion: DrKGC通过结合动态子图检索和LLM，显著提升了知识图谱补全的性能和实用性。 Abstract: Knowledge graph completion (KGC) aims to predict missing triples in knowledge graphs (KGs) by leveraging existing triples and textual information. Recently, generative large language models (LLMs) have been increasingly employed for graph tasks. However, current approaches typically encode graph context in textual form, which fails to fully exploit the potential of LLMs for perceiving and reasoning about graph structures. To address this limitation, we propose DrKGC (Dynamic Subgraph Retrieval-Augmented LLMs for Knowledge Graph Completion). DrKGC employs a flexible lightweight model training strategy to learn structural embeddings and logical rules within the KG. It then leverages a novel bottom-up graph retrieval method to extract a subgraph for each query guided by the learned rules. Finally, a graph convolutional network (GCN) adapter uses the retrieved subgraph to enhance the structural embeddings, which are then integrated into the prompt for effective LLM fine-tuning. Experimental results on two general domain benchmark datasets and two biomedical datasets demonstrate the superior performance of DrKGC. Furthermore, a realistic case study in the biomedical domain highlights its interpretability and practical utility.

[488] Aligning VLM Assistants with Personalized Situated Cognition

Yongqi Li,Shen Zhou,Xiaohu Li,Xin Miao,Jintao Wen,Mayi Xu,Jianhao Chen,Birong Pan,Hankun Kang,Yuanyuan Zhu,Ming Zhong,Tieyun Qian

Main category: cs.AI

TL;DR: 论文提出了一种个性化对齐视觉语言模型（VLM）的方法，通过社会学角色集概念简化问题，构建了一个包含18k实例的基准PCogAlignBench，并提出了PCogAlign框架。实验证明其有效性和基准的可靠性。

Details

Motivation: 由于不同背景的人对同一情境的认知和期望不同，需要将VLM助手与个性化情境认知对齐，以满足现实世界中的个性化需求。 Method: 通过社会学角色集概念简化问题，构建PCogAlignBench基准，提出PCogAlign框架，利用认知感知和基于动作的奖励模型实现个性化对齐。 Result: 实验和人工评估验证了PCogAlignBench的可靠性和PCogAlign框架的有效性。 Conclusion: 研究为个性化对齐VLM提供了可行方法，并开源了基准和代码。 Abstract: Vision-language models (VLMs) aligned with general human objectives, such as being harmless and hallucination-free, have become valuable assistants of humans in managing visual tasks. However, people with diversified backgrounds have different cognition even in the same situation. Consequently, they may have personalized expectations for VLM assistants. This highlights the urgent need to align VLM assistants with personalized situated cognition for real-world assistance. To study this problem, we first simplify it by characterizing individuals based on the sociological concept of Role-Set. Then, we propose to evaluate the individuals' actions to examine whether the personalized alignment is achieved. Further, we construct a benchmark named PCogAlignBench, which includes 18k instances and 20 individuals with different Role-Sets. Finally, we present a framework called PCogAlign, which constructs a cognition-aware and action-based reward model for personalized alignment. Experimental results and human evaluations demonstrate the reliability of the PCogAlignBench and the effectiveness of our proposed PCogAlign. We will open-source the constructed benchmark and code at https://github.com/NLPGM/PCogAlign.

[489] Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner

Chunhui Zhang,Zhongyu Ouyang,Kwonjoon Lee,Nakul Agarwal,Sean Dae Houlihan,Soroush Vosoughi,Shao-Yuan Lo

Main category: cs.AI

TL;DR: 提出了一种基于贝叶斯更新的可扩展ToM规划器，通过小模型与大模型的协同推理，显著提升了多模态ToM任务的性能。

Details

Motivation: 现有ToM计算方法依赖结构化工作流或深度微调，难以在多模态环境中扩展且泛化能力不足。 Method: 将ToM推理分解为逐步贝叶斯更新，采用弱到强控制策略，让小模型专注于ToM似然估计，并将推理行为迁移到大模型中。 Result: 在多模态ToM基准测试中，准确率比现有技术提升了4.6%，在未见场景中表现优异。 Conclusion: 该方法为复杂环境中建模人类心理状态设立了新标准。 Abstract: Theory-of-Mind (ToM) enables humans to infer mental states-such as beliefs, desires, and intentions-forming the foundation of social cognition. However, existing computational ToM methods rely on structured workflows with ToM-specific priors or deep model fine-tuning, which struggle with scalability in multimodal environments and fail to generalize as task complexity increases. To address these limitations, we propose a scalable Bayesian ToM planner that decomposes ToM reasoning into stepwise Bayesian updates. Our framework introduces weak-to-strong control, allowing smaller language models (LMs) to specialize in ToM-specific likelihood estimation and transfer their reasoning behaviors to larger LMs (7B to 405B) for integration with social and world knowledge. This synergistic approach aligns large-model inference of human mental states with Bayesian principles. Extensive experiments show that our method achieves a 4.6% accuracy improvement over state-of-the-art techniques on multimodal ToM benchmarks, including challenging unseen scenarios, thereby establishing a new standard for modeling human mental states in complex environments.

[490] An Empirical Study of Group Conformity in Multi-Agent Systems

Min Choi,Keonwoo Kim,Sungwon Chae,Sangyeob Baek

Main category: cs.AI

TL;DR: 研究探讨了多智能体LLM在争议话题辩论中如何形成偏见，发现智能体倾向于与多数群体或更智能的智能体保持一致，强调了政策干预的必要性。

Details

Motivation: 探索多智能体LLM在争议话题中偏见的产生与传播，填补现有研究的空白。 Method: 通过模拟2500多场辩论，分析中立智能体在辩论中立场的变化。 Result: 智能体表现出显著的群体一致性，倾向于与多数或更智能的智能体保持一致。 Conclusion: 需通过政策促进多样性和透明度，以减少匿名在线环境中偏见的传播风险。 Abstract: Recent advances in Large Language Models (LLMs) have enabled multi-agent systems that simulate real-world interactions with near-human reasoning. While previous studies have extensively examined biases related to protected attributes such as race, the emergence and propagation of biases on socially contentious issues in multi-agent LLM interactions remain underexplored. This study explores how LLM agents shape public opinion through debates on five contentious topics. By simulating over 2,500 debates, we analyze how initially neutral agents, assigned a centrist disposition, adopt specific stances over time. Statistical analyses reveal significant group conformity mirroring human behavior; LLM agents tend to align with numerically dominant groups or more intelligent agents, exerting a greater influence. These findings underscore the crucial role of agent intelligence in shaping discourse and highlight the risks of bias amplification in online interactions. Our results emphasize the need for policy measures that promote diversity and transparency in LLM-generated discussions to mitigate the risks of bias propagation within anonymous online environments.

[491] AI Scientists Fail Without Strong Implementation Capability

Minjun Zhu,Qiujie Xie,Yixuan Weng,Jian Wu,Zhen Lin,Linyi Yang,Yue Zhang

Main category: cs.AI

TL;DR: AI Scientist展现了独立科学发现的能力，但在计算机科学领域尚未取得突破性成就，主要瓶颈在于验证程序的执行能力不足。

Details

Motivation: 探讨AI Scientist在科学发现中的潜力及其当前限制，特别是验证和执行能力的不足。 Method: 通过定量证据和系统评估28篇由AI Scientist生成的研究论文，分析其执行能力。 Result: AI Scientist在验证和执行实验方面存在显著不足，导致无法产生高质量科学成果。 Conclusion: 呼吁社区共同努力，解决AI Scientist的执行能力瓶颈，推动其进一步发展。 Abstract: The emergence of Artificial Intelligence (AI) Scientist represents a paradigm shift in scientific discovery, with large language models (LLMs) taking the lead as the primary executor in the entire scientific workflow from idea generation to experiment implementation. Recent AI Scientist studies demonstrate sufficient capabilities for independent scientific discovery, with the generated research reports gaining acceptance at the ICLR 2025 workshop and ACL 2025, arguing that a human-level AI Scientist, capable of uncovering phenomena previously unknown to humans, may be imminent. Despite this substantial progress, AI Scientist has yet to produce a groundbreaking achievement in the domain of computer science on par with automated scientific tools. Based on extensive quantitative evidence from existing benchmarks in complex engineering tasks and a systematic evaluation assess 28 research papers generated by five advanced AI Scientist systems, we argue that \textbf{the fundamental bottleneck for AI Scientists lies in their capability to execute the requisite verification procedures.} Current AI Scientist systems lack the execution capabilities needed to execute rigorous experiments and produce high-quality scientific papers. To better illustrate the root cause of this \textbf{implementation gap}, we provide an in-depth discussion on the fundamental limitations of AI Scientist. This position paper aims to call for the participants in the community to bridge the implementation gap.

[492] PGPO: Enhancing Agent Reasoning via Pseudocode-style Planning Guided Preference Optimization

Zouying Cao,Runze Wang,Yifei Yang,Xinbei Ma,Xiaoyong Zhu,Bo Zheng,Hai Zhao

Main category: cs.AI

TL;DR: 论文提出了一种伪代码风格的计划（P-code Plan）方法，以提升大型语言模型（LLM）代理的推理效率和泛化能力，并提出了PGPO方法进一步优化代理学习。

Details

Motivation: 现有LLM代理主要依赖自然语言计划，效率低且泛化能力差，因此探索伪代码风格计划以改进。 Method: 提出伪代码风格计划（P-code Plan）和PGPO方法，通过规划导向的奖励优化代理学习。 Result: PGPO在代表性代理基准测试中表现优异，优于当前领先基线，减少了推理中的动作错误和遗漏。 Conclusion: 伪代码风格计划和PGPO方法显著提升了LLM代理的推理效率和泛化能力。 Abstract: Large Language Model (LLM) agents have demonstrated impressive capabilities in handling complex interactive problems. Existing LLM agents mainly generate natural language plans to guide reasoning, which is verbose and inefficient. NL plans are also tailored to specific tasks and restrict agents' ability to generalize across similar tasks. To this end, we explore pseudocode-style plans (P-code Plan) to capture the structural logic of reasoning. We find that P-code Plan empowers LLM agents with stronger generalization ability and more efficiency. Inspired by this finding, we propose a pseudocode-style Planning Guided Preference Optimization method called PGPO for effective agent learning. With two planning-oriented rewards, PGPO further enhances LLM agents' ability to generate high-quality P-code Plans and subsequent reasoning. Experiments show that PGPO achieves superior performance on representative agent benchmarks and outperforms the current leading baselines. Analyses reveal the advantage of PGPO in reducing action errors and omissions during reasoning.

[493] Respond Beyond Language: A Benchmark for Video Generation in Response to Realistic User Intents

Shuting Wang,Yunqi Liu,Zixin Yang,Ning Hu,Zhicheng Dou,Chenyan Xiong

Main category: cs.AI

TL;DR: 论文构建了RealVideoQuest基准，用于评估文本到视频（T2V）模型在回答真实世界视觉查询中的表现，发现现有模型效果不佳。

Details

Motivation: 现有查询-回答数据集主要关注文本响应，难以满足需要视觉演示的复杂查询需求。 Method: 通过多阶段视频检索和精炼过程构建了4.5K高质量查询-视频对，并开发了多角度评估系统。 Result: 实验表明当前T2V模型在应对真实用户查询时表现不佳。 Conclusion: 指出了多模态AI的关键挑战和未来研究方向。 Abstract: Querying generative AI models, e.g., large language models (LLMs), has become a prevalent method for information acquisition. However, existing query-answer datasets primarily focus on textual responses, making it challenging to address complex user queries that require visual demonstrations or explanations for better understanding. To bridge this gap, we construct a benchmark, RealVideoQuest, designed to evaluate the abilities of text-to-video (T2V) models in answering real-world, visually grounded queries. It identifies 7.5K real user queries with video response intents from Chatbot-Arena and builds 4.5K high-quality query-video pairs through a multistage video retrieval and refinement process. We further develop a multi-angle evaluation system to assess the quality of generated video answers. Experiments indicate that current T2V models struggle with effectively addressing real user queries, pointing to key challenges and future research opportunities in multimodal AI.

[494] Self-Challenging Language Model Agents

Yifei Zhou,Sergey Levine,Jason Weston,Xian Li,Sainbayar Sukhbaatar

Main category: cs.AI

TL;DR: Self-Challenging框架通过让智能体自我生成高质量任务并训练，显著提升了工具使用能力。

Details

Motivation: 训练智能体使用工具需要大量人工标注任务，成本高且多样性有限。 Method: 智能体分为挑战者和执行者角色，生成Code-as-Task任务并通过强化学习训练。 Result: 在M3ToolEval和TauBench基准测试中，Llama-3.1-8B-Instruct性能提升两倍以上。 Conclusion: Self-Challenging框架证明了自我生成任务的有效性，减少了对外部标注数据的依赖。 Abstract: Large language models are quickly becoming the foundation for intelligent agents that are capable of using tools. However, training such agents is challenging because it requires human creation and annotation of a diverse set of tasks, tools, and evaluation criteria. In this paper, we propose the Self-Challenging framework for training an agent on high-quality tasks that are generated by itself. The agent first plays the role of challenger and generates a task after interacting with the given tools. The tasks take the form of a novel general class of problems termed Code-as-Task, which are defined by an instruction, a verification function and solution and failure cases which serve as tests, allowing to filter only for high-quality tasks. The agent then takes an executor role and trains on those tasks with reinforcement learning using the evaluation feedback as a reward. Evaluation on two existing multi-turn tool-use agent benchmarks, M3ToolEval and TauBench, shows the Self-Challenging framework achieves over a two-fold improvement in Llama-3.1-8B-Instruct, despite using only self-generated training data.

[495] WHEN TO ACT, WHEN TO WAIT: Modeling Structural Trajectories for Intent Triggerability in Task-Oriented Dialogue

Yaoyao Qian,Jindan Huang,Yuanli Wang,Simon Yu,Kyrie Zhixuan Zhou,Jiayuan Mao,Mingfu Liang,Hanhan Zhou

Main category: cs.AI

TL;DR: STORM框架通过用户与代理LLM的对话建模信息不对称动态，捕捉协作意图形成过程，并评估认知改进与任务性能。

Details

Motivation: 解决用户表达语义完整但结构信息不足的问题，以及LLM代理无法区分语言完整性与上下文触发性的局限性。 Method: 提出STORM框架，通过UserLLM和AgentLLM的对话建模信息不对称，生成标注语料库分析协作理解发展。 Result: 实验显示适度不确定性（40-60%）在某些场景下优于完全透明，模型特定模式提示重新思考人机协作中的信息完整性。 Conclusion: STORM为理解不对称推理动态提供新视角，并指导不确定性校准的对话系统设计。 Abstract: Task-oriented dialogue systems often face difficulties when user utterances seem semantically complete but lack necessary structural information for appropriate system action. This arises because users frequently do not fully understand their own needs, while systems require precise intent definitions. Current LLM-based agents cannot effectively distinguish between linguistically complete and contextually triggerable expressions, lacking frameworks for collaborative intent formation. We present STORM, a framework modeling asymmetric information dynamics through conversations between UserLLM (full internal access) and AgentLLM (observable behavior only). STORM produces annotated corpora capturing expression trajectories and latent cognitive transitions, enabling systematic analysis of collaborative understanding development. Our contributions include: (1) formalizing asymmetric information processing in dialogue systems; (2) modeling intent formation tracking collaborative understanding evolution; and (3) evaluation metrics measuring internal cognitive improvements alongside task performance. Experiments across four language models reveal that moderate uncertainty (40-60%) can outperform complete transparency in certain scenarios, with model-specific patterns suggesting reconsideration of optimal information completeness in human-AI collaboration. These findings contribute to understanding asymmetric reasoning dynamics and inform uncertainty-calibrated dialogue system design.

[496] Large language models can learn and generalize steganographic chain-of-thought under process supervision

Joey Skaf,Luis Ibanez-Lissen,Robert McCarthy,Connor Watts,Vasil Georgiv,Hannes Whittingham,Lorena Gonzalez-Manzano,David Lindner,Cameron Tice,Edward James Young,Puria Radmard

Main category: cs.AI

TL;DR: 论文探讨了链式思维（CoT）监控的可靠性问题，指出惩罚特定字符串会导致模型学习隐写编码，但并未改变其底层行为。

Details

Motivation: 研究旨在揭示惩罚CoT中特定字符串对模型行为的影响，以及模型如何通过隐写编码规避监控。 Method: 通过实验展示惩罚特定字符串后模型的替代行为，并验证其编码方案的泛化能力。 Result: 模型学会用替代字符串隐写编码，且能泛化到未见过的同类字符串，但底层行为未变。 Conclusion: CoT监控可能因模型的隐写编码而失效，需更深入的方法确保模型行为的透明性。 Abstract: Chain-of-thought (CoT) reasoning not only enhances large language model performance but also provides critical insights into decision-making processes, marking it as a useful tool for monitoring model intent and planning. By proactively preventing models from acting on CoT indicating misaligned or harmful intent, CoT monitoring can be used to reduce risks associated with deploying models. However, developers may be incentivized to train away the appearance of harmful intent from CoT traces, by either customer preferences or regulatory requirements. Recent works have shown that banning mention of a specific example of reward hacking, which may be done either to make CoT presentable to users or as a naive attempt to prevent the behavior, causes obfuscation of the undesired reasoning traces but the persistence of the undesired behavior. Such obfuscation threatens the reliability of CoT monitoring. However, obfuscation of reasoning can be due to its internalization to latent space computation, or its encoding within the CoT. Here, we provide an extension to these results. First, we show that penalizing the use of specific strings within load-bearing reasoning traces causes models to substitute alternative strings. Crucially, this does not alter the underlying method by which the model performs the task, demonstrating that the model can learn to steganographically encode its reasoning. We further demonstrate that models can generalize an encoding scheme. When the penalized strings belong to an overarching class, the model learns not only to substitute strings seen in training, but also develops a general encoding scheme for all members of the class which it can apply to held-out testing strings.

cs.CY [Back]

[497] Comparative analysis of privacy-preserving open-source LLMs regarding extraction of diagnostic information from clinical CMR imaging reports

Sina Amirrajab,Volker Vehof,Michael Bietenbeck,Ali Yilmaz

Main category: cs.CY

TL;DR: 研究评估了开源大语言模型（LLMs）在心血管磁共振（CMR）报告中的诊断信息提取能力，发现多个模型表现优异，优于专业心脏病专家。

Details

Motivation: 探索隐私保护、本地部署的开源LLMs在临床环境中自动化分析影像报告的可行性。 Method: 评估了9个开源LLMs在109份临床CMR报告中的诊断分类能力，使用准确率、精确率、召回率和F1分数等指标。 Result: 多个模型表现优异，Google的Gemma2模型F1分数最高（0.98），部分模型甚至优于心脏病专家（F1分数0.94）。 Conclusion: 开源LLMs可用于临床影像报告的自动化分析，提供准确、快速且资源高效的诊断分类。 Abstract: Purpose: We investigated the utilization of privacy-preserving, locally-deployed, open-source Large Language Models (LLMs) to extract diagnostic information from free-text cardiovascular magnetic resonance (CMR) reports. Materials and Methods: We evaluated nine open-source LLMs on their ability to identify diagnoses and classify patients into various cardiac diagnostic categories based on descriptive findings in 109 clinical CMR reports. Performance was quantified using standard classification metrics including accuracy, precision, recall, and F1 score. We also employed confusion matrices to examine patterns of misclassification across models. Results: Most open-source LLMs demonstrated exceptional performance in classifying reports into different diagnostic categories. Google's Gemma2 model achieved the highest average F1 score of 0.98, followed by Qwen2.5:32B and DeepseekR1-32B with F1 scores of 0.96 and 0.95, respectively. All other evaluated models attained average scores above 0.93, with Mistral and DeepseekR1-7B being the only exceptions. The top four LLMs outperformed our board-certified cardiologist (F1 score of 0.94) across all evaluation metrics in analyzing CMR reports. Conclusion: Our findings demonstrate the feasibility of implementing open-source, privacy-preserving LLMs in clinical settings for automated analysis of imaging reports, enabling accurate, fast and resource-efficient diagnostic categorization.

[498] SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models?

Aladin Djuhera,Swanand Ravindra Kadhe,Farhan Ahmed,Syed Zawad,Holger Boche,Walid Saad

Main category: cs.CY

TL;DR: 研究发现，电信领域微调的大语言模型（LLMs）可能损害模型安全性，即使使用看似无害的数据集。通过实验验证了三种安全重对齐防御方法的有效性。

Details

Motivation: 探讨电信领域微调LLMs时模型安全性的退化问题，并提出解决方案。 Method: 使用三种电信数据集和公开的Telecom LLMs，评估三种安全重对齐防御方法（SafeInstruct、SafeLoRA、SafeMERGE）。 Result: 防御方法能有效恢复模型安全性且不影响下游任务性能。 Conclusion: 强调电信LLMs需安全感知的微调和指令，为实际部署提供诊断研究和实用指南。 Abstract: Fine-tuning large language models (LLMs) for telecom tasks and datasets is a common practice to adapt general-purpose models to the telecom domain. However, little attention has been paid to how this process may compromise model safety. Recent research has shown that even benign fine-tuning can degrade the safety alignment of LLMs, causing them to respond to harmful or unethical user queries. In this paper, we investigate this issue for telecom-tuned LLMs using three representative datasets featured by the GenAINet initiative. We show that safety degradation persists even for structured and seemingly harmless datasets such as 3GPP standards and tabular records, indicating that telecom-specific data is not immune to safety erosion during fine-tuning. We further extend our analysis to publicly available Telecom LLMs trained via continual pre-training, revealing that safety alignment is often severely lacking, primarily due to the omission of safety-focused instruction tuning. To address these issues in both fine-tuned and pre-trained models, we conduct extensive experiments and evaluate three safety realignment defenses (SafeInstruct, SafeLoRA, and SafeMERGE) using established red-teaming benchmarks. The results show that, across all settings, the proposed defenses can effectively restore safety after harmful degradation without compromising downstream task performance, leading to Safe teleCOMMunication (SafeCOMM) models. In a nutshell, our work serves as a diagnostic study and practical guide for safety realignment in telecom-tuned LLMs, and emphasizes the importance of safety-aware instruction and fine-tuning for real-world deployments of Telecom LLMs.

[499] Evaluating Prompt Engineering Techniques for Accuracy and Confidence Elicitation in Medical LLMs

Nariman Naderi,Zahra Atf,Peter R Lewis,Aref Mahjoub far,Seyed Amir Ahmad Safavi-Naini,Ali Soroush

Main category: cs.CY

TL;DR: 研究了提示工程技术对大型语言模型（LLMs）在医学应用中准确性和置信度的影响，发现Chain-of-Thought提示提高准确性但导致过度自信，情感提示进一步增加置信度风险。

Details

Motivation: 探讨提示工程技术如何影响LLMs在医学任务中的表现，特别是准确性和置信度的校准问题。 Method: 使用波斯医学考试数据集，评估五种LLMs在不同配置（温度、提示风格、置信度标度）下的表现，采用AUC-ROC、Brier Score和ECE作为评估指标。 Result: Chain-of-Thought提示提高准确性但导致过度自信，情感提示增加置信度风险；小模型表现较差，专有模型准确性高但置信度校准不足。 Conclusion: 提示工程需同时优化准确性和不确定性校准，以适用于高风险的医学任务。 Abstract: This paper investigates how prompt engineering techniques impact both accuracy and confidence elicitation in Large Language Models (LLMs) applied to medical contexts. Using a stratified dataset of Persian board exam questions across multiple specialties, we evaluated five LLMs - GPT-4o, o3-mini, Llama-3.3-70b, Llama-3.1-8b, and DeepSeek-v3 - across 156 configurations. These configurations varied in temperature settings (0.3, 0.7, 1.0), prompt styles (Chain-of-Thought, Few-Shot, Emotional, Expert Mimicry), and confidence scales (1-10, 1-100). We used AUC-ROC, Brier Score, and Expected Calibration Error (ECE) to evaluate alignment between confidence and actual performance. Chain-of-Thought prompts improved accuracy but also led to overconfidence, highlighting the need for calibration. Emotional prompting further inflated confidence, risking poor decisions. Smaller models like Llama-3.1-8b underperformed across all metrics, while proprietary models showed higher accuracy but still lacked calibrated confidence. These results suggest prompt engineering must address both accuracy and uncertainty to be effective in high-stakes medical tasks.

[500] Optimizing Storytelling, Improving Audience Retention, and Reducing Waste in the Entertainment Industry

Andrew Cornfeld,Ashley Miller,Mercedes Mora-Figueroa,Kurt Samuels,Anthony Palomba

Main category: cs.CY

TL;DR: 该研究提出了一种结合NLP特征与传统收视数据的机器学习框架，用于提升电视节目收视率的预测准确性。

Details

Motivation: 电视网络在节目决策中面临高风险，依赖有限的历史数据预测收视率。 Method: 通过提取剧集对话的情感基调、认知复杂性和叙事结构，结合SARIMAX、滚动XGBoost和特征选择模型进行评估。 Result: NLP特征对某些剧集的预测有显著提升，同时提出了一种基于对话向量相似性的评分方法。 Conclusion: 该框架在不同类型节目中表现良好，为内容创作者和营销者提供了数据驱动的见解。 Abstract: Television networks face high financial risk when making programming decisions, often relying on limited historical data to forecast episodic viewership. This study introduces a machine learning framework that integrates natural language processing (NLP) features from over 25000 television episodes with traditional viewership data to enhance predictive accuracy. By extracting emotional tone, cognitive complexity, and narrative structure from episode dialogue, we evaluate forecasting performance using SARIMAX, rolling XGBoost, and feature selection models. While prior viewership remains a strong baseline predictor, NLP features contribute meaningful improvements for some series. We also introduce a similarity scoring method based on Euclidean distance between aggregate dialogue vectors to compare shows by content. Tested across diverse genres, including Better Call Saul and Abbott Elementary, our framework reveals genre-specific performance and offers interpretable metrics for writers, executives, and marketers seeking data-driven insight into audience behavior.

[501] Bottom-Up Perspectives on AI Governance: Insights from User Reviews of AI Products

Stefan Pasch

Main category: cs.CY

TL;DR: 本研究采用自下而上的方法，通过分析用户评论揭示AI治理的实际关注点，发现与技术与非技术领域相关的多样化主题，补充了现有规范性框架的不足。

Details

Motivation: 现有AI治理框架多为高层级原则，未能充分反映实际应用中的用户关注点，因此需通过用户视角填补这一空白。 Method: 利用BERTopic分析G2.com上超过10万条AI产品用户评论，提取与AI治理相关的潜在主题。 Result: 研究发现治理主题涵盖技术与非技术领域，如隐私、透明度、项目管理等，与现有框架有重叠但也有新发现。 Conclusion: 研究强调需结合用户实践视角，推动更具包容性和操作性的AI治理方法。 Abstract: With the growing importance of AI governance, numerous high-level frameworks and principles have been articulated by policymakers, institutions, and expert communities to guide the development and application of AI. While such frameworks offer valuable normative orientation, they may not fully capture the practical concerns of those who interact with AI systems in organizational and operational contexts. To address this gap, this study adopts a bottom-up approach to explore how governance-relevant themes are expressed in user discourse. Drawing on over 100,000 user reviews of AI products from G2.com, we apply BERTopic to extract latent themes and identify those most semantically related to AI governance. The analysis reveals a diverse set of governance-relevant topics spanning both technical and non-technical domains. These include concerns across organizational processes-such as planning, coordination, and communication-as well as stages of the AI value chain, including deployment infrastructure, data handling, and analytics. The findings show considerable overlap with institutional AI governance and ethics frameworks on issues like privacy and transparency, but also surface overlooked areas such as project management, strategy development, and customer interaction. This highlights the need for more empirically grounded, user-centered approaches to AI governance-approaches that complement normative models by capturing how governance unfolds in applied settings. By foregrounding how governance is enacted in practice, this study contributes to more inclusive and operationally grounded approaches to AI governance and digital policy.

[502] ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases

Yuchong Li,Xiaojun Zeng,Chihua Fang,Jian Yang,Lei Zhang

Main category: cs.CY

TL;DR: 研究团队建立了HPB疾病评估基准ClinBench-HBP，包含3535道选择题和337个真实诊断案例，评估了商业和开源LLM在HPB领域的表现，发现其在复杂临床案例中表现不佳。

Details

Motivation: HPB疾病的高发病率和死亡率是全球公共卫生挑战，但现有LLM评估基准缺乏HPB覆盖和真实临床案例。 Method: 系统建立HPB疾病评估基准，涵盖ICD-10定义的33个主类和465个子类，数据来自公共数据集、合成数据及临床案例。 Result: 商业LLM在医学考试题上表现良好，但在HPB诊断任务（尤其是复杂临床案例）中表现显著下降，医学LLM泛化能力有限。 Conclusion: 当前LLM在HPB领域存在关键局限，未来需改进以处理真实复杂临床诊断。基准将公开发布。 Abstract: Hepato-pancreato-biliary (HPB) disorders represent a global public health challenge due to their high morbidity and mortality. Although large language models (LLMs) have shown promising performance in general medical question-answering tasks, the current evaluation benchmarks are mostly derived from standardized examinations or manually designed questions, lacking HPB coverage and clinical cases. To address these issues, we systematically eatablish an HPB disease evaluation benchmark comprising 3,535 closed-ended multiple-choice questions and 337 open-ended real diagnosis cases, which encompasses all the 33 main categories and 465 subcategories of HPB diseases defined in the International Statistical Classification of Diseases, 10th Revision (ICD-10). The multiple-choice questions are curated from public datasets and synthesized data, and the clinical cases are collected from prestigious medical journals, case-sharing platforms, and collaborating hospitals. By evalauting commercial and open-source general and medical LLMs on our established benchmark, namely ClinBench-HBP, we find that while commercial LLMs perform competently on medical exam questions, they exhibit substantial performance degradation on HPB diagnosis tasks, especially on complex, inpatient clinical cases. Those medical LLMs also show limited generalizability to HPB diseases. Our results reveal the critical limitations of current LLMs in the domain of HPB diseases, underscoring the imperative need for future medical LLMs to handle real, complex clinical diagnostics rather than simple medical exam questions. The benchmark will be released at the homepage.

[503] Children's Voice Privacy: First Steps And Emerging Challenges

Ajinkya Kulkarni,Francisco Teixeira,Enno Hermann,Thomas Rolland,Isabel Trancoso,Mathew Magimai Doss

Main category: cs.CY

TL;DR: 研究评估了针对成人语音的匿名化技术在儿童语音上的应用效果，发现其虽能保护隐私但实用性下降明显，且自动评估方法在儿童语音质量上存在挑战。

Details

Motivation: 儿童在语音技术中代表性不足且隐私易受侵害，但针对其语音的匿名化技术研究较少，本研究旨在填补这一空白。 Method: 使用三个儿童数据集和六种匿名化方法，结合主客观评估指标进行分析。 Result: 现有成人语音匿名化系统能保护儿童隐私，但实用性显著降低；自动评估方法在儿童语音质量上效果不佳。 Conclusion: 需进一步研究儿童语音匿名化技术，改进评估方法。 Abstract: Children are one of the most under-represented groups in speech technologies, as well as one of the most vulnerable in terms of privacy. Despite this, anonymization techniques targeting this population have received little attention. In this study, we seek to bridge this gap, and establish a baseline for the use of voice anonymization techniques designed for adult speech when applied to children's voices. Such an evaluation is essential, as children's speech presents a distinct set of challenges when compared to that of adults. This study comprises three children's datasets, six anonymization methods, and objective and subjective utility metrics for evaluation. Our results show that existing systems for adults are still able to protect children's voice privacy, but suffer from much higher utility degradation. In addition, our subjective study displays the challenges of automatic evaluation methods for speech quality in children's speech, highlighting the need for further research.

Hayoung Jung,Shravika Mittal,Ananya Aatreya,Navreet Kaur,Munmun De Choudhury,Tanushree Mitra

Main category: cs.CY

TL;DR: 该研究首次大规模分析了YouTube上与阿片类药物使用障碍（OUD）相关的错误信息，提出了一种高效标注方法MythTriage，显著降低了标注成本，并揭示了错误信息的传播机制。

Details

Motivation: 在线健康信息中的错误信息可能影响公共健康政策，但目前缺乏对高关注但研究不足的OUD相关错误信息的量化研究。 Method: 研究结合临床专家验证了8种常见错误信息，并开发了MythTriage标注流程，结合轻量级模型和大语言模型（LLM）提高效率。 Result: MythTriage在减少76%标注成本的同时，达到0.86的宏F1分数，分析了2.9K搜索结果和343K推荐内容。 Conclusion: 研究为公共健康干预和平台内容审核提供了实用工具和见解，揭示了YouTube上OUD错误信息的传播模式。 Abstract: Understanding the prevalence of misinformation in health topics online can inform public health policies and interventions. However, measuring such misinformation at scale remains a challenge, particularly for high-stakes but understudied topics like opioid-use disorder (OUD)--a leading cause of death in the U.S. We present the first large-scale study of OUD-related myths on YouTube, a widely-used platform for health information. With clinical experts, we validate 8 pervasive myths and release an expert-labeled video dataset. To scale labeling, we introduce MythTriage, an efficient triage pipeline that uses a lightweight model for routine cases and defers harder ones to a high-performing, but costlier, large language model (LLM). MythTriage achieves up to 0.86 macro F1-score while estimated to reduce annotation time and financial cost by over 76% compared to experts and full LLM labeling. We analyze 2.9K search results and 343K recommendations, uncovering how myths persist on YouTube and offering actionable insights for public health and platform moderation.

[505] AIMSCheck: Leveraging LLMs for AI-Assisted Review of Modern Slavery Statements Across Jurisdictions

Adriana Eufrosina Bora,Akshatha Arodi,Duoyi Zhang,Jordan Bannister,Mirko Bronzi,Arsene Fansi Tchango,Md Abul Bashar,Richi Nayak,Kerrie Mengersen

Main category: cs.CY

TL;DR: 论文提出AIMSCheck框架和跨司法管辖区数据集（AIMS.uk和AIMS.ca），用于验证现代奴隶制法案的合规性，并展示了模型在不同司法管辖区的泛化能力。

Details

Motivation: 现代奴隶制法案要求企业披露其反奴隶制措施，但验证这些声明的复杂性和数据稀缺性带来了挑战。 Method: 与领域专家合作，构建跨司法管辖区数据集（AIMS.uk和AIMS.ca），并提出AIMSCheck框架，将合规评估任务分解为三个层次。 Result: 实验表明，基于澳大利亚数据集训练的模型在英加司法管辖区表现良好，具有广泛应用的潜力。 Conclusion: 发布数据集和AIMSCheck框架，推动AI在合规评估中的应用和相关研究。 Abstract: Modern Slavery Acts mandate that corporations disclose their efforts to combat modern slavery, aiming to enhance transparency and strengthen practices for its eradication. However, verifying these statements remains challenging due to their complex, diversified language and the sheer number of statements that must be reviewed. The development of NLP tools to assist in this task is also difficult due to a scarcity of annotated data. Furthermore, as modern slavery transparency legislation has been introduced in several countries, the generalizability of such tools across legal jurisdictions must be studied. To address these challenges, we work with domain experts to make two key contributions. First, we present AIMS.uk and AIMS.ca, newly annotated datasets from the UK and Canada to enable cross-jurisdictional evaluation. Second, we introduce AIMSCheck, an end-to-end framework for compliance validation. AIMSCheck decomposes the compliance assessment task into three levels, enhancing interpretability and practical applicability. Our experiments show that models trained on an Australian dataset generalize well across UK and Canadian jurisdictions, demonstrating the potential for broader application in compliance monitoring. We release the benchmark datasets and AIMSCheck to the public to advance AI-adoption in compliance assessment and drive further research in this field.

cs.HC [Back]

[506] Vid2Coach: Transforming How-To Videos into Task Assistants

Mina Huh,Zihui Xue,Ujjaini Das,Kumar Ashutosh,Kristen Grauman,Amy Pavel

Main category: cs.HC

TL;DR: Vid2Coach是一个基于可穿戴摄像头的系统，旨在帮助盲人和低视力人群通过视频学习技能，减少错误并提供实时反馈。

Details

Motivation: 盲人和低视力人群难以通过视频学习技能，因为视频依赖视觉比较。视觉康复治疗师的指导方式启发了系统的设计。 Method: Vid2Coach通过视频生成可访问的指令，结合非视觉替代方案，并使用智能眼镜摄像头监控用户进度，提供上下文感知的反馈。 Result: 使用Vid2Coach的盲人参与者在烹饪任务中错误减少了58.5%，并表示希望在日常中使用该系统。 Conclusion: Vid2Coach展示了AI视觉辅助的潜力，能够增强而非取代非视觉专业知识。 Abstract: People use videos to learn new recipes, exercises, and crafts. Such videos remain difficult for blind and low vision (BLV) people to follow as they rely on visual comparison. Our observations of visual rehabilitation therapists (VRTs) guiding BLV people to follow how-to videos revealed that VRTs provide both proactive and responsive support including detailed descriptions, non-visual workarounds, and progress feedback. We propose Vid2Coach, a system that transforms how-to videos into wearable camera-based assistants that provide accessible instructions and mixed-initiative feedback. From the video, Vid2Coach generates accessible instructions by augmenting narrated instructions with demonstration details and completion criteria for each step. It then uses retrieval-augmented-generation to extract relevant non-visual workarounds from BLV-specific resources. Vid2Coach then monitors user progress with a camera embedded in commercial smart glasses to provide context-aware instructions, proactive feedback, and answers to user questions. BLV participants (N=8) using Vid2Coach completed cooking tasks with 58.5\% fewer errors than when using their typical workflow and wanted to use Vid2Coach in their daily lives. Vid2Coach demonstrates an opportunity for AI visual assistance that strengthens rather than replaces non-visual expertise.

eess.AS [Back]

[507] Pushing the Limits of Beam Search Decoding for Transducer-based ASR models

Lilit Grigoryan,Vladimir Bataev,Andrei Andrusenko,Hainan Xu,Vitaly Lavrukhin,Boris Ginsburg

Main category: eess.AS

TL;DR: 本文提出了一种加速Transducer模型束搜索的通用方法，包括ALSD++和AES++两种优化算法，显著提升了推理速度和识别准确率。

Details

Motivation: 尽管Transducer模型在端到端ASR系统中表现出色，但束搜索会显著降低其推理速度，限制了实际应用。 Method: 通过批量操作、树形假设结构、新颖的空白评分以及CUDA图执行，优化了束搜索过程。 Result: 该方法将束搜索与贪婪解码的速度差距缩小至10-20%，词错误率相对提升14-30%，低资源场景下浅融合提升达11%。 Conclusion: 提出的方法显著提升了Transducer模型的实用性和性能，相关算法已开源。 Abstract: Transducer models have emerged as a promising choice for end-to-end ASR systems, offering a balanced trade-off between recognition accuracy, streaming capabilities, and inference speed in greedy decoding. However, beam search significantly slows down Transducers due to repeated evaluations of key network components, limiting practical applications. This paper introduces a universal method to accelerate beam search for Transducers, enabling the implementation of two optimized algorithms: ALSD++ and AES++. The proposed method utilizes batch operations, a tree-based hypothesis structure, novel blank scoring for enhanced shallow fusion, and CUDA graph execution for efficient GPU inference. This narrows the speed gap between beam and greedy modes to only 10-20% for the whole system, achieves 14-30% relative improvement in WER compared to greedy decoding, and improves shallow fusion for low-resource up to 11% compared to existing implementations. All the algorithms are open sourced.

[508] Confidence intervals for forced alignment boundaries using model ensembles

Matthew C. Kelley

Main category: eess.AS

TL;DR: 本文提出了一种使用神经网络集成技术为强制对齐边界生成置信区间的方法，通过多个模型的边界中位数和顺序统计量构建置信区间，并在Buckeye和TIMIT语料库上验证了其效果。

Details

Motivation: 现有的强制对齐工具通常仅提供单一的边界估计，缺乏对边界不确定性的量化。本文旨在通过神经网络集成技术为边界提供置信区间，以更好地反映对齐的不确定性。 Method: 训练了十个不同的分段分类神经网络模型，每个模型独立进行对齐，通过边界中位数和顺序统计量构建97.85%的置信区间。 Result: 在Buckeye和TIMIT语料库上，集成边界比单一模型略有改进，置信区间被整合到Praat TextGrids中，并输出为表格供进一步分析。 Conclusion: 该方法成功为强制对齐边界提供了置信区间，增强了对齐结果的可靠性，并为研究者提供了不确定性分析的工具。 Abstract: Forced alignment is a common tool to align audio with orthographic and phonetic transcriptions. Most forced alignment tools provide only a single estimate of a boundary. The present project introduces a method of deriving confidence intervals for these boundaries using a neural network ensemble technique. Ten different segment classifier neural networks were previously trained, and the alignment process is repeated with each model. The alignment ensemble is then used to place the boundary at the median of the boundaries in the ensemble, and 97.85% confidence intervals are constructed using order statistics. On the Buckeye and TIMIT corpora, the ensemble boundaries show a slight improvement over using just a single model. The confidence intervals are incorporated into Praat TextGrids using a point tier, and they are also output as a table for researchers to analyze separately as diagnostics or to incorporate uncertainty into their analyses.

[509] LinearVC: Linear transformations of self-supervised features through the lens of voice conversion

Herman Kamper,Benjamin van Niekerk,Julian Zaïdi,Marc-André Carbonneau

Main category: eess.AS

TL;DR: LinearVC是一种简单的语音转换方法，通过线性变换自监督特征实现高质量语音转换，并揭示了特征空间的结构。

Details

Motivation: 研究自监督表示的结构，探索语音转换中内容和说话人信息的分离。 Method: 使用线性变换自监督特征，并通过旋转和奇异值分解（SVD）显式分解内容和说话人信息。 Result: 仅100维的线性投影即可实现竞争性的语音转换效果。 Conclusion: LinearVC不仅实用，还深化了对自监督语音表示的理解。 Abstract: We introduce LinearVC, a simple voice conversion method that sheds light on the structure of self-supervised representations. First, we show that simple linear transformations of self-supervised features effectively convert voices. Next, we probe the geometry of the feature space by constraining the set of allowed transformations. We find that just rotating the features is sufficient for high-quality voice conversion. This suggests that content information is embedded in a low-dimensional subspace which can be linearly transformed to produce a target voice. To validate this hypothesis, we finally propose a method that explicitly factorizes content and speaker information using singular value decomposition; the resulting linear projection with a rank of just 100 gives competitive conversion results. Our work has implications for both practical voice conversion and a broader understanding of self-supervised speech representations. Samples and code: https://www.kamperh.com/linearvc/.

cs.RO [Back]

[510] GaussianFusion: Gaussian-Based Multi-Sensor Fusion for End-to-End Autonomous Driving

Shuai Liu,Quanmin Liang,Zefeng Li,Boyang Li,Kai Huang

Main category: cs.RO

TL;DR: GaussianFusion提出了一种基于高斯分布的多传感器融合框架，用于端到端自动驾驶，通过高斯表示聚合多模态信息，提升性能与鲁棒性。

Details

Motivation: 现有方法（如注意力机制或鸟瞰图融合）存在可解释性差或计算开销大的问题，需要一种更高效且直观的融合方式。 Method: 采用2D高斯分布作为信息载体，通过物理属性和显隐特征逐步融合多传感器数据，并设计级联规划头优化轨迹预测。 Result: 在NAVSIM和Bench2Drive基准测试中验证了框架的有效性与鲁棒性。 Conclusion: GaussianFusion为多传感器融合提供了一种高效且可解释的解决方案，适用于自动驾驶系统。 Abstract: Multi-sensor fusion is crucial for improving the performance and robustness of end-to-end autonomous driving systems. Existing methods predominantly adopt either attention-based flatten fusion or bird's eye view fusion through geometric transformations. However, these approaches often suffer from limited interpretability or dense computational overhead. In this paper, we introduce GaussianFusion, a Gaussian-based multi-sensor fusion framework for end-to-end autonomous driving. Our method employs intuitive and compact Gaussian representations as intermediate carriers to aggregate information from diverse sensors. Specifically, we initialize a set of 2D Gaussians uniformly across the driving scene, where each Gaussian is parameterized by physical attributes and equipped with explicit and implicit features. These Gaussians are progressively refined by integrating multi-modal features. The explicit features capture rich semantic and spatial information about the traffic scene, while the implicit features provide complementary cues beneficial for trajectory planning. To fully exploit rich spatial and semantic information in Gaussians, we design a cascade planning head that iteratively refines trajectory predictions through interactions with Gaussians. Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate the effectiveness and robustness of the proposed GaussianFusion framework. The source code will be released at https://github.com/Say2L/GaussianFusion.

[511] From Motion to Behavior: Hierarchical Modeling of Humanoid Generative Behavior Control

Jusheng Zhang,Jinzhou Tang,Sidi Liu,Mingyan Li,Sheng Zhang,Jian Wang,Keze Wang

Main category: cs.RO

TL;DR: 论文提出了一种基于认知科学的统一框架GBC，通过结合LLMs生成的行为计划来建模多样化的人类行为，解决了现有方法在行为计划方面的不足，并提出了GBC-100K数据集。

Details

Motivation: 当前研究主要关注低层次短周期运动或高层次动作规划，忽视了人类活动的层次化目标导向特性。论文旨在从人类运动生成扩展到人类行为建模。 Method: 提出Generative Behavior Control (GBC)框架，利用LLMs生成层次化行为计划，结合任务和运动规划控制人类运动。 Result: GBC在GBC-100K数据集上训练后，能生成更多样化、目的性更强的高质量人类运动，且运动时长是现有方法的10倍。 Conclusion: GBC为人类运动行为建模的未来研究奠定了基础，数据集和源代码将公开。 Abstract: Human motion generative modeling or synthesis aims to characterize complicated human motions of daily activities in diverse real-world environments. However, current research predominantly focuses on either low-level, short-period motions or high-level action planning, without taking into account the hierarchical goal-oriented nature of human activities. In this work, we take a step forward from human motion generation to human behavior modeling, which is inspired by cognitive science. We present a unified framework, dubbed Generative Behavior Control (GBC), to model diverse human motions driven by various high-level intentions by aligning motions with hierarchical behavior plans generated by large language models (LLMs). Our insight is that human motions can be jointly controlled by task and motion planning in robotics, but guided by LLMs to achieve improved motion diversity and physical fidelity. Meanwhile, to overcome the limitations of existing benchmarks, i.e., lack of behavioral plans, we propose GBC-100K dataset annotated with a hierarchical granularity of semantic and motion plans driven by target goals. Our experiments demonstrate that GBC can generate more diverse and purposeful high-quality human motions with 10* longer horizons compared with existing methods when trained on GBC-100K, laying a foundation for future research on behavioral modeling of human motions. Our dataset and source code will be made publicly available.

[512] Understanding while Exploring: Semantics-driven Active Mapping

Liyan Chen,Huangying Zhan,Hairong Yin,Yi Xu,Philippos Mordohai

Main category: cs.RO

TL;DR: ActiveSGM是一种主动语义映射框架，通过预测潜在观测的信息量来提升机器人探索效率。

Details

Motivation: 在未知环境中实现高效机器人自主性需要主动探索和对几何与语义的精确理解。 Method: 基于3D高斯散射映射，结合语义和几何不确定性量化及稀疏语义表示，指导机器人选择最优视角。 Result: 在Replica和Matterport3D数据集上验证了ActiveSGM在提升地图完整性、准确性和鲁棒性方面的有效性。 Conclusion: ActiveSGM支持更自适应的场景探索，显著提升了主动语义映射的性能。 Abstract: Effective robotic autonomy in unknown environments demands proactive exploration and precise understanding of both geometry and semantics. In this paper, we propose ActiveSGM, an active semantic mapping framework designed to predict the informativeness of potential observations before execution. Built upon a 3D Gaussian Splatting (3DGS) mapping backbone, our approach employs semantic and geometric uncertainty quantification, coupled with a sparse semantic representation, to guide exploration. By enabling robots to strategically select the most beneficial viewpoints, ActiveSGM efficiently enhances mapping completeness, accuracy, and robustness to noisy semantic data, ultimately supporting more adaptive scene exploration. Our experiments on the Replica and Matterport3D datasets highlight the effectiveness of ActiveSGM in active semantic mapping tasks.

[513] Using Diffusion Ensembles to Estimate Uncertainty for End-to-End Autonomous Driving

Florian Wintel,Sigmund H. Høeg,Gabriel Kiss,Frank Lindseth

Main category: cs.RO

TL;DR: EnDfuser是一种端到端自动驾驶系统，利用扩散模型作为轨迹规划器，通过集成扩散生成候选轨迹分布，提升驾驶决策的安全性。

Details

Motivation: 现有自动驾驶系统在规划中未充分考虑不确定性，或使用难以泛化的专用表示方法。 Method: 结合注意力池化和轨迹规划，使用扩散变换器模块处理感知信息，生成128条候选轨迹。 Result: 在CARLA的Longest6基准测试中取得70.1的驾驶分数，推理速度影响较小。 Conclusion: 集成扩散模型可替代传统点估计轨迹规划模块，通过建模后验轨迹分布的不确定性提升安全性。 Abstract: End-to-end planning systems for autonomous driving are improving rapidly, especially in closed-loop simulation environments like CARLA. Many such driving systems either do not consider uncertainty as part of the plan itself, or obtain it by using specialized representations that do not generalize. In this paper, we propose EnDfuser, an end-to-end driving system that uses a diffusion model as the trajectory planner. EnDfuser effectively leverages complex perception information like fused camera and LiDAR features, through combining attention pooling and trajectory planning into a single diffusion transformer module. Instead of committing to a single plan, EnDfuser produces a distribution of candidate trajectories (128 for our case) from a single perception frame through ensemble diffusion. By observing the full set of candidate trajectories, EnDfuser provides interpretability for uncertain, multi-modal future trajectory spaces, where there are multiple plausible options. EnDfuser achieves a competitive driving score of 70.1 on the Longest6 benchmark in CARLA with minimal concessions on inference speed. Our findings suggest that ensemble diffusion, used as a drop-in replacement for traditional point-estimate trajectory planning modules, can help improve the safety of driving decisions by modeling the uncertainty of the posterior trajectory distribution.

[514] OG-VLA: 3D-Aware Vision Language Action Model via Orthographic Image Generation

Ishika Singh,Ankit Goyal,Stan Birchfield,Dieter Fox,Animesh Garg,Valts Blukis

Main category: cs.RO

TL;DR: OG-VLA结合了视觉语言动作模型（VLA）的泛化能力和3D感知策略的鲁棒性，通过多视角RGBD观测和自然语言指令生成机器人动作，显著提升了未见场景和指令的泛化性能。

Details

Motivation: 解决3D感知策略在泛化性上的不足以及VLA模型对相机和机器人姿态变化的敏感性，结合两者的优势以提升机器人操作的泛化性和鲁棒性。 Method: 通过将多视角观测投影为点云并渲染为标准正交视图，结合视觉主干网络、大型语言模型（LLM）和图像扩散模型生成末端执行器的目标位姿。 Result: 在Arnold和Colosseum基准测试中，OG-VLA在未见环境中实现了40%以上的相对性能提升，同时在已知场景中保持鲁棒性。 Conclusion: OG-VLA通过结合语言和视觉先验知识，显著提升了机器人操作的泛化能力，并在实际应用中展示了快速适应能力。 Abstract: We introduce OG-VLA, a novel architecture and learning framework that combines the generalization strengths of Vision Language Action models (VLAs) with the robustness of 3D-aware policies. We address the challenge of mapping natural language instructions and multi-view RGBD observations to quasi-static robot actions. 3D-aware robot policies achieve state-of-the-art performance on precise robot manipulation tasks, but struggle with generalization to unseen instructions, scenes, and objects. On the other hand, VLAs excel at generalizing across instructions and scenes, but can be sensitive to camera and robot pose variations. We leverage prior knowledge embedded in language and vision foundation models to improve generalization of 3D-aware keyframe policies. OG-VLA projects input observations from diverse views into a point cloud which is then rendered from canonical orthographic views, ensuring input view invariance and consistency between input and output spaces. These canonical views are processed with a vision backbone, a Large Language Model (LLM), and an image diffusion model to generate images that encode the next position and orientation of the end-effector on the input scene. Evaluations on the Arnold and Colosseum benchmarks demonstrate state-of-the-art generalization to unseen environments, with over 40% relative improvements while maintaining robust performance in seen settings. We also show real-world adaption in 3 to 5 demonstrations along with strong generalization. Videos and resources at https://og-vla.github.io/

[515] Sparse Imagination for Efficient Visual World Model Planning

Junha Chun,Youngjoon Jeong,Taesup Kim

Main category: cs.RO

TL;DR: 提出了一种稀疏想象的视觉世界模型规划方法，通过减少前向预测中的令牌数量提升计算效率，适用于资源受限的实时决策场景。

Details

Motivation: 世界模型在复杂环境中的决策能力强大，但高精度预测需要大量计算资源，尤其在机器人领域资源受限时成为瓶颈。 Method: 基于稀疏训练的视觉世界模型，采用随机分组注意力策略的Transformer，动态调整处理的令牌数量以适应计算资源。 Result: 实验表明，稀疏想象在保持任务性能的同时显著提升推理效率。 Conclusion: 该方法为世界模型在实时决策场景中的部署提供了可行路径。 Abstract: World model based planning has significantly improved decision-making in complex environments by enabling agents to simulate future states and make informed choices. However, ensuring the prediction accuracy of world models often demands substantial computational resources, posing a major challenge for real-time applications. This computational burden is particularly restrictive in robotics, where resources are severely constrained. To address this limitation, we propose a Sparse Imagination for Efficient Visual World Model Planning, which enhances computational efficiency by reducing the number of tokens processed during forward prediction. Our method leverages a sparsely trained vision-based world model based on transformers with randomized grouped attention strategy, allowing the model to adaptively adjust the number of tokens processed based on the computational resource. By enabling sparse imagination (rollout), our approach significantly accelerates planning while maintaining high control fidelity. Experimental results demonstrate that sparse imagination preserves task performance while dramatically improving inference efficiency, paving the way for the deployment of world models in real-time decision-making scenarios.

Rafael Flor-Rodríguez,Carlos Gutiérrez-Álvarez,Francisco Javier Acevedo-Rodríguez,Sergio Lafuente-Arroyo,Roberto J. López-Sastre

Main category: cs.RO

TL;DR: 论文提出SEMNAV，一种利用语义分割作为主要视觉输入的方法，以提升视觉语义导航（VSN）的泛化能力。

Details

Motivation: 现有VSN模型依赖虚拟场景的RGB数据，泛化到真实环境时存在领域适应问题。 Method: 通过引入语义分割作为视觉输入，结合新数据集SEMNAV，训练模型以增强感知和决策能力。 Result: 在模拟和真实环境中，SEMNAV表现优于现有方法，成功率高且能有效缩小模拟与现实的差距。 Conclusion: SEMNAV为VSN提供了一种高效解决方案，适用于实际机器人应用。 Abstract: Visual Semantic Navigation (VSN) is a fundamental problem in robotics, where an agent must navigate toward a target object in an unknown environment, mainly using visual information. Most state-of-the-art VSN models are trained in simulation environments, where rendered scenes of the real world are used, at best. These approaches typically rely on raw RGB data from the virtual scenes, which limits their ability to generalize to real-world environments due to domain adaptation issues. To tackle this problem, in this work, we propose SEMNAV, a novel approach that leverages semantic segmentation as the main visual input representation of the environment to enhance the agent's perception and decision-making capabilities. By explicitly incorporating high-level semantic information, our model learns robust navigation policies that improve generalization across unseen environments, both in simulated and real world settings. We also introduce a newly curated dataset, i.e. the SEMNAV dataset, designed for training semantic segmentation-aware navigation models like SEMNAV. Our approach is evaluated extensively in both simulated environments and with real-world robotic platforms. Experimental results demonstrate that SEMNAV outperforms existing state-of-the-art VSN models, achieving higher success rates in the Habitat 2.0 simulation environment, using the HM3D dataset. Furthermore, our real-world experiments highlight the effectiveness of semantic segmentation in mitigating the sim-to-real gap, making our model a promising solution for practical VSN-based robotic applications. We release SEMNAV dataset, code and trained models at https://github.com/gramuah/semnav

[517] FreqPolicy: Frequency Autoregressive Visuomotor Policy with Continuous Tokens

Yiming Zhong,Yumeng Liu,Chuyang Xiao,Zemin Yang,Youzhuo Wang,Yufei Zhu,Ye Shi,Yujing Sun,Xinge Zhu,Yuexin Ma

Main category: cs.RO

TL;DR: 提出了一种基于频域表示的新型视觉运动策略学习方法，通过分层建模频率组件和连续潜在表示，提高了机器人操作的精度和效率。

Details

Motivation: 现有方法在动作表示和网络架构上存在局限性，频域表示能更好地捕捉动作的结构化特性，且不同复杂度的任务需要不同频率带的建模精度。 Method: 采用频域自回归框架，分层建模频率组件，并引入连续潜在表示以保持动作空间的平滑性和连续性。 Result: 在多种2D和3D机器人操作基准测试中，该方法在精度和效率上均优于现有方法。 Conclusion: 频域自回归框架与连续潜在表示的结合为通用机器人操作提供了潜力。 Abstract: Learning effective visuomotor policies for robotic manipulation is challenging, as it requires generating precise actions while maintaining computational efficiency. Existing methods remain unsatisfactory due to inherent limitations in the essential action representation and the basic network architectures. We observe that representing actions in the frequency domain captures the structured nature of motion more effectively: low-frequency components reflect global movement patterns, while high-frequency components encode fine local details. Additionally, robotic manipulation tasks of varying complexity demand different levels of modeling precision across these frequency bands. Motivated by this, we propose a novel paradigm for visuomotor policy learning that progressively models hierarchical frequency components. To further enhance precision, we introduce continuous latent representations that maintain smoothness and continuity in the action space. Extensive experiments across diverse 2D and 3D robotic manipulation benchmarks demonstrate that our approach outperforms existing methods in both accuracy and efficiency, showcasing the potential of a frequency-domain autoregressive framework with continuous tokens for generalized robotic manipulation.

[518] WoMAP: World Models For Embodied Open-Vocabulary Object Localization

Tenny Yin,Zhiting Mei,Tao Sun,Lihan Zha,Emily Zhou,Jeremy Bao,Miyu Yamane,Ola Shorinwa,Anirudha Majumdar

Main category: cs.RO

TL;DR: WoMAP是一种用于开放词汇对象定位的策略，通过高斯散射和世界模型实现高效探索和物理动作生成，显著优于现有方法。

Details

Motivation: 现有方法在泛化性和物理动作生成方面存在不足，WoMAP旨在解决这些问题。 Method: 使用高斯散射实现数据生成，结合开放词汇对象检测器和潜在世界模型进行训练。 Result: 在零样本任务中，WoMAP的成功率比基线方法高9倍和2倍，并展示了强泛化能力。 Conclusion: WoMAP在对象定位任务中表现出色，具有广泛的应用潜力。 Abstract: Language-instructed active object localization is a critical challenge for robots, requiring efficient exploration of partially observable environments. However, state-of-the-art approaches either struggle to generalize beyond demonstration datasets (e.g., imitation learning methods) or fail to generate physically grounded actions (e.g., VLMs). To address these limitations, we introduce WoMAP (World Models for Active Perception): a recipe for training open-vocabulary object localization policies that: (i) uses a Gaussian Splatting-based real-to-sim-to-real pipeline for scalable data generation without the need for expert demonstrations, (ii) distills dense rewards signals from open-vocabulary object detectors, and (iii) leverages a latent world model for dynamics and rewards prediction to ground high-level action proposals at inference time. Rigorous simulation and hardware experiments demonstrate WoMAP's superior performance in a broad range of zero-shot object localization tasks, with more than 9x and 2x higher success rates compared to VLM and diffusion policy baselines, respectively. Further, we show that WoMAP achieves strong generalization and sim-to-real transfer on a TidyBot.

Jiajun Jiang,Yiming Zhu,Zirui Wu,Jie Song

Main category: cs.RO

TL;DR: DualMap是一个在线开放词汇映射系统，通过自然语言查询帮助机器人理解和导航动态变化的环境。

Details

Motivation: 为满足现实世界机器人导航应用的需求，设计一个高效的语义映射系统，能够适应环境变化。 Method: 采用混合分割前端和对象级状态检查，避免昂贵的3D对象合并，同时使用双地图表示（全局抽象地图和局部具体地图）管理动态变化。 Result: 在仿真和实际场景中表现出色，实现了3D开放词汇分割、高效场景映射和在线语言引导导航的先进性能。 Conclusion: DualMap通过创新的双地图表示和高效分割方法，显著提升了动态环境中的机器人导航能力。 Abstract: We introduce DualMap, an online open-vocabulary mapping system that enables robots to understand and navigate dynamically changing environments through natural language queries. Designed for efficient semantic mapping and adaptability to changing environments, DualMap meets the essential requirements for real-world robot navigation applications. Our proposed hybrid segmentation frontend and object-level status check eliminate the costly 3D object merging required by prior methods, enabling efficient online scene mapping. The dual-map representation combines a global abstract map for high-level candidate selection with a local concrete map for precise goal-reaching, effectively managing and updating dynamic changes in the environment. Through extensive experiments in both simulation and real-world scenarios, we demonstrate state-of-the-art performance in 3D open-vocabulary segmentation, efficient scene mapping, and online language-guided navigation.

[520] RoboMoRe: LLM-based Robot Co-design via Joint Optimization of Morphology and Reward

Jiawei Fang,Yuxuan Sun,Chengtian Ma,Qiuyu Lu,Lining Yao

Main category: cs.RO

TL;DR: RoboMoRe是一个基于大语言模型（LLM）的机器人协同设计框架，通过双阶段优化（粗优化和细优化）联合优化形态和奖励函数，显著优于人工设计和其他方法。

Details

Motivation: 传统机器人协同设计因固定奖励函数易收敛至次优设计，无法探索适合不同形态的多样化运动模式。 Method: RoboMoRe采用双阶段优化：粗优化阶段通过LLM生成多样且高质量的形态-奖励对；细优化阶段通过交替更新奖励和形态梯度迭代优化。 Result: 在八项任务中，RoboMoRe无需任务特定提示或预定义模板，显著优于人工设计和其他方法。 Conclusion: RoboMoRe通过LLM驱动的奖励和形态协同优化，解决了机器人协同设计的局限性，实现了高效设计和多样化运动行为。 Abstract: Robot co-design, jointly optimizing morphology and control policy, remains a longstanding challenge in the robotics community, where many promising robots have been developed. However, a key limitation lies in its tendency to converge to sub-optimal designs due to the use of fixed reward functions, which fail to explore the diverse motion modes suitable for different morphologies. Here we propose RoboMoRe, a large language model (LLM)-driven framework that integrates morphology and reward shaping for co-optimization within the robot co-design loop. RoboMoRe performs a dual-stage optimization: in the coarse optimization stage, an LLM-based diversity reflection mechanism generates both diverse and high-quality morphology-reward pairs and efficiently explores their distribution. In the fine optimization stage, top candidates are iteratively refined through alternating LLM-guided reward and morphology gradient updates. RoboMoRe can optimize both efficient robot morphologies and their suited motion behaviors through reward shaping. Results demonstrate that without any task-specific prompting or predefined reward/morphology templates, RoboMoRe significantly outperforms human-engineered designs and competing methods across eight different tasks.

astro-ph.IM [Back]

[521] Applying Vision Transformers on Spectral Analysis of Astronomical Objects

Luis Felipe Strano Moraes,Ignacio Becker,Pavlos Protopapas,Guillermo Cabrera-Vives

Main category: astro-ph.IM

TL;DR: 将预训练的视觉Transformer（ViT）应用于天文光谱数据分析，通过将一维光谱转换为二维图像表示，ViT能够捕捉局部和全局特征。在SDSS和LAMOST数据上微调后，模型在恒星分类和红移估计任务中表现优异。

Details

Motivation: 探索预训练视觉模型在天文光谱分析中的潜力，解决传统方法在捕捉光谱特征上的局限性。 Method: 将一维光谱转换为二维图像，利用预训练的ViT模型进行微调，应用于SDSS和LAMOST数据。 Result: 在恒星分类和红移估计任务中表现优于支持向量机和随机森林，与AstroCLIP相当。 Conclusion: 预训练视觉模型在天文光谱分析中具有高效性和可扩展性，首次成功应用于真实光谱数据。 Abstract: We apply pre-trained Vision Transformers (ViTs), originally developed for image recognition, to the analysis of astronomical spectral data. By converting traditional one-dimensional spectra into two-dimensional image representations, we enable ViTs to capture both local and global spectral features through spatial self-attention. We fine-tune a ViT pretrained on ImageNet using millions of spectra from the SDSS and LAMOST surveys, represented as spectral plots. Our model is evaluated on key tasks including stellar object classification and redshift ($z$) estimation, where it demonstrates strong performance and scalability. We achieve classification accuracy higher than Support Vector Machines and Random Forests, and attain $R^2$ values comparable to AstroCLIP's spectrum encoder, even when generalizing across diverse object types. These results demonstrate the effectiveness of using pretrained vision models for spectroscopic data analysis. To our knowledge, this is the first application of ViTs to large-scale, which also leverages real spectroscopic data and does not rely on synthetic inputs.

Table of Contents

cs.CV [Back]

[1] EgoVIS@CVPR: What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning

[2] Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces

[3] Geo-Sign: Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation

[4] Detection of Endangered Deer Species Using UAV Imagery: A Comparative Study Between Efficient Deep Learning Approaches

[5] Efficient Endangered Deer Species Monitoring with UAV Aerial Imagery and Deep Learning

[6] FastCAR: Fast Classification And Regression for Task Consolidation in Multi-Task Learning to Model a Continuous Property Variable of Detected Object Class

[7] Ctrl-Crash: Controllable Diffusion for Realistic Car Crashes

[8] ZeShot-VQA: Zero-Shot Visual Question Answering Framework with Answer Mapping for Natural Disaster Damage Assessment

[9] Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning

[10] Improving Optical Flow and Stereo Depth Estimation by Leveraging Uncertainty-Based Learning Difficulties

[11] Towards Effective and Efficient Adversarial Defense with Diffusion Models for Robust Visual Tracking

[12] Latent Guidance in Diffusion Models for Perceptual Evaluations

[13] Test-time Vocabulary Adaptation for Language-driven Object Detection

[14] Feature Fusion and Knowledge-Distilled Multi-Modal Multi-Target Detection

[15] Sequence-Based Identification of First-Person Camera Wearers in Third-Person Views

[16] iDPA: Instance Decoupled Prompt Attention for Incremental Medical Object Detection

[17] Latent Wavelet Diffusion: Enabling 4K Image Synthesis for Free

[18] Efficient 3D Brain Tumor Segmentation with Axial-Coronal-Sagittal Embedding

[19] Performance Analysis of Few-Shot Learning Approaches for Bangla Handwritten Character and Digit Recognition

[20] BAGNet: A Boundary-Aware Graph Attention Network for 3D Point Cloud Semantic Segmentation

[21] UNSURF: Uncertainty Quantification for Cortical Surface Reconstruction of Clinical Brain MRIs

[22] SSAM: Self-Supervised Association Modeling for Test-Time Adaption

[23] SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation

[24] 3D Trajectory Reconstruction of Moving Points Based on Asynchronous Cameras

[25] ViVo: A Dataset for Volumetric VideoReconstruction and Compression

[26] SEED: A Benchmark Dataset for Sequential Facial Attribute Editing with Diffusion Models

[27] CReFT-CAD: Boosting Orthographic Projection Reasoning for CAD via Reinforcement Fine-Tuning

[28] Event-based multi-view photogrammetry for high-dynamic, high-velocity target measurement

[29] MR2US-Pro: Prostate MR to Ultrasound Image Translation and Registration Based on Diffusion Models

[30] Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control

[31] XYZ-IBD: High-precision Bin-picking Dataset for Object 6D Pose Estimation Capturing Real-world Industrial Complexity

[32] SatDreamer360: Geometry Consistent Street-View Video Generation from Satellite Imagery

[33] ABCDEFGH: An Adaptation-Based Convolutional Neural Network-CycleGAN Disease-Courses Evolution Framework Using Generative Models in Health Education

[34] Parallel Rescaling: Rebalancing Consistency Guidance for Personalized Diffusion Models

[35] Long-Tailed Visual Recognition via Permutation-Invariant Head-to-Tail Feature Fusion

[36] Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

[37] Video Signature: In-generation Watermarking for Latent Video Diffusion Models

[38] Poster: Adapting Pretrained Vision Transformers with LoRA Against Attack Vectors

[39] Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis

[40] CineMA: A Foundation Model for Cine Cardiac MRI

[41] Concept-Centric Token Interpretation for Vector-Quantized Generative Models

[42] Fovea Stacking: Imaging with Dynamic Localized Aberration Correction

[43] From Local Cues to Global Percepts: Emergent Gestalt Organization in Self-Supervised Vision Models

[44] Common Inpainted Objects In-N-Out of Context

[45] Involution-Infused DenseNet with Two-Step Compression for Resource-Efficient Plant Disease Classification

[46] ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary

[47] EcoLens: Leveraging Multi-Objective Bayesian Optimization for Energy-Efficient Video Processing on Edge Devices

[48] Depth-Aware Scoring and Hierarchical Alignment for Multiple Object Tracking

[49] Aiding Medical Diagnosis through Image Synthesis and Classification

[50] HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models

[51] TIME: TabPFN-Integrated Multimodal Engine for Robust Tabular-Image Learning

[52] L3A: Label-Augmented Analytic Adaptation for Multi-Label Class Incremental Learning

[53] QuantFace: Low-Bit Post-Training Quantization for One-Step Diffusion Face Restoration

[54] Improving Keystep Recognition in Ego-Video via Dexterous Focus

[55] SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers

[56] Advancing from Automated to Autonomous Beamline by Leveraging Computer Vision

[57] Towards Predicting Any Human Trajectory In Context

[58] Breaking Latent Prior Bias in Detectors for Generalizable AIGC Image Detection

[59] Uneven Event Modeling for Partially Relevant Video Retrieval

[60] Leveraging CLIP Encoder for Multimodal Emotion Recognition

[61] Towards Edge-Based Idle State Detection in Construction Machinery Using Surveillance Cameras

[62] DS-VTON: High-Quality Virtual Try-on via Disentangled Dual-Scale Generation

[63] 3D Skeleton-Based Action Recognition: A Review

[64] Deep Temporal Reasoning in Video Language Models: A Cross-Linguistic Evaluation of Action Duration and Completion through Perfect Times

[65] Deformable registration and generative modelling of aortic anatomies by auto-decoders and neural ODEs

[66] TIGeR: Text-Instructed Generation and Refinement for Template-Free Hand-Object Interaction

[67] Continual-MEGA: A Large-scale Benchmark for Generalizable Continual Anomaly Detection

[68] Camera Trajectory Generation: A Comprehensive Survey of Methods, Metrics, and Future Directions

[69] CAPAA: Classifier-Agnostic Projector-Based Adversarial Attack

[70] IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

[71] GOBench: Benchmarking Geometric Optics Generation and Understanding of MLLMs

[72] Quotient Network -- A Network Similar to ResNet but Learning Quotients

[73] FlexSelect: Flexible Token Selection for Efficient Long Video Understanding

[74] Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models

[75] Pseudo-Labeling Driven Refinement of Benchmark Object Detection Datasets via Analysis of Learning Patterns

[76] Motion-Aware Concept Alignment for Consistent Video Editing

[77] AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

[78] Modality Translation and Registration of MR and Ultrasound Images Using Diffusion Models