cs.CV [Total: 258]
cs.GR [Total: 8]
cs.CL [Total: 254]
cs.LG [Total: 48]
eess.IV [Total: 8]
cs.HC [Total: 1]
cs.DL [Total: 1]
cs.IT [Total: 1]
cs.RO [Total: 4]
math.AG [Total: 1]
cs.IR [Total: 6]
astro-ph.GA [Total: 1]
stat.AP [Total: 1]
eess.SP [Total: 3]
cs.SE [Total: 5]
cs.DB [Total: 3]
q-bio.NC [Total: 1]
cs.CY [Total: 3]
cs.DC [Total: 1]
cs.CR [Total: 2]
cs.MA [Total: 1]
eess.AS [Total: 5]
cs.SD [Total: 2]
cs.AI [Total: 31]

cs.CV [Back]

[1] InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning

Zifu Wan,Yaqi Xie,Ce Zhang,Zhiqiu Lin,Zihan Wang,Simon Stepputtis,Deva Ramanan,Katia Sycara

Main category: cs.CV

TL;DR: 论文提出了一个新的基准数据集InstructPart，用于评估模型在理解物体部件及其功能方面的能力，并展示当前视觉语言模型在此任务上的局限性。

Details

Motivation: 现有的大规模多模态基础模型通常将物体视为不可分割的整体，忽略了其组成部分及其功能，这限制了模型在任务执行中的表现。 Method: 通过构建包含手工标注部件分割和任务导向指令的InstructPart数据集，并设计一个简单的基线模型进行微调。 Result: 实验表明，即使是先进的视觉语言模型，任务导向的部件分割仍具挑战性，但基线模型通过微调实现了两倍的性能提升。 Conclusion: InstructPart数据集和基准旨在推动任务导向部件分割的研究，并提升视觉语言模型在机器人、虚拟现实等领域的适用性。 Abstract: Large multimodal foundation models, particularly in the domains of language and vision, have significantly advanced various tasks, including robotics, autonomous driving, information retrieval, and grounding. However, many of these models perceive objects as indivisible, overlooking the components that constitute them. Understanding these components and their associated affordances provides valuable insights into an object's functionality, which is fundamental for performing a wide range of tasks. In this work, we introduce a novel real-world benchmark, InstructPart, comprising hand-labeled part segmentation annotations and task-oriented instructions to evaluate the performance of current models in understanding and executing part-level tasks within everyday contexts. Through our experiments, we demonstrate that task-oriented part segmentation remains a challenging problem, even for state-of-the-art Vision-Language Models (VLMs). In addition to our benchmark, we introduce a simple baseline that achieves a twofold performance improvement through fine-tuning with our dataset. With our dataset and benchmark, we aim to facilitate research on task-oriented part segmentation and enhance the applicability of VLMs across various domains, including robotics, virtual reality, information retrieval, and other related fields. Project website: https://zifuwan.github.io/InstructPart/.

[2] Sampling Strategies for Efficient Training of Deep Learning Object Detection Algorithms

Gefei Shen,Yung-Hong Sun,Yu Hen Hu,Hongrui Jiang

Main category: cs.CV

TL;DR: 研究了两种采样策略以提升深度学习目标检测模型的训练效率，基于Lipschitz连续性假设。

Details

Motivation: 提升深度学习目标检测模型的训练效率，减少手动标注样本的需求。 Method: 提出两种采样策略：均匀采样和帧差采样，分别用于状态空间均匀采样和视频帧间冗余探索。 Result: 实验表明，这两种策略能生成高质量训练数据集，且所需手动标注样本较少。 Conclusion: 提出的采样策略在减少标注成本的同时，保持了良好的训练性能。 Abstract: Two sampling strategies are investigated to enhance efficiency in training a deep learning object detection model. These sampling strategies are employed under the assumption of Lipschitz continuity of deep learning models. The first strategy is uniform sampling which seeks to obtain samples evenly yet randomly through the state space of the object dynamics. The second strategy of frame difference sampling is developed to explore the temporal redundancy among successive frames in a video. Experiment result indicates that these proposed sampling strategies provide a dataset that yields good training performance while requiring relatively few manually labelled samples.

[3] CTRL-GS: Cascaded Temporal Residue Learning for 4D Gaussian Splatting

Karly Hou,Wanhua Li,Hanspeter Pfister

Main category: cs.CV

TL;DR: 提出了一种基于4D高斯泼溅的动态场景新视角合成方法，通过残差学习和层次分解提升复杂场景的渲染质量。

Details

Motivation: 解决现有方法在处理动态场景（如大运动、遮挡和细节）时的性能下降问题。 Method: 采用层次分解（视频-片段-帧）和残差学习，结合光流动态调整片段。 Result: 在多个数据集上实现了最先进的视觉质量和实时渲染，尤其在复杂场景中表现突出。 Conclusion: 该方法在动态场景的新视角合成中表现出色，尤其在复杂场景下优于现有技术。 Abstract: Recently, Gaussian Splatting methods have emerged as a desirable substitute for prior Radiance Field methods for novel-view synthesis of scenes captured with multi-view images or videos. In this work, we propose a novel extension to 4D Gaussian Splatting for dynamic scenes. Drawing on ideas from residual learning, we hierarchically decompose the dynamic scene into a "video-segment-frame" structure, with segments dynamically adjusted by optical flow. Then, instead of directly predicting the time-dependent signals, we model the signal as the sum of video-constant values, segment-constant values, and frame-specific residuals, as inspired by the success of residual learning. This approach allows more flexible models that adapt to highly variable scenes. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets, with the greatest improvements on complex scenes with large movements, occlusions, and fine details, where current methods degrade most.

[4] COLORA: Efficient Fine-Tuning for Convolutional Models with a Study Case on Optical Coherence Tomography Image Classification

Mariano Rivera,Angello Hoyos

Main category: cs.CV

TL;DR: CoLoRA是一种针对CNN微调的低秩适应方法，显著提升效率、减少参数并加速训练，在OCTMNIST数据集上表现优于传统方法。

Details

Motivation: 解决当前CNN微调方法的低效问题，提出更高效的参数适应策略。 Method: 基于低秩适应（LoRA）技术扩展卷积架构，开发并评估预训练CNN模型。 Result: CoLoRA在OCTMNIST数据集上准确率提升近1%，性能与先进模型相当。 Conclusion: CoLoRA是一种高效、稳定的CNN微调方法，适用于医学图像分类任务。 Abstract: We introduce the Convolutional Low-Rank Adaptation (CoLoRA) method, designed explicitly to overcome the inefficiencies found in current CNN fine-tuning methods. CoLoRA can be seen as a natural extension of the convolutional architectures of the Low-Rank Adaptation (LoRA) technique. We demonstrate the capabilities of our method by developing and evaluating models using the widely adopted CNN backbone pre-trained on ImageNet. We observed that this strategy results in a stable and accurate coarse-tuning procedure. Moreover, this strategy is computationally efficient and significantly reduces the number of parameters required for fine-tuning compared to traditional methods. Furthermore, our method substantially improves the speed and stability of training. Our case study focuses on classifying retinal diseases from optical coherence tomography (OCT) images, specifically using the OCTMNIST dataset. Experimental results demonstrate that a CNN backbone fine-tuned with CoLoRA surpasses nearly 1\% in accuracy. Such a performance is comparable to the Vision Transformer, State-space discrete, and Kolmogorov-Arnold network models.

[5] DART$^3$: Leveraging Distance for Test Time Adaptation in Person Re-Identification

Rajarshi Bhattacharya,Shakeeb Murtaza,Christian Desrosiers,Jose Dolz,Maguelonne Heritier,Eric Granger

Main category: cs.CV

TL;DR: DART$^3$ 是一个专为减轻行人重识别（ReID）中相机偏差设计的测试时间适应框架，通过距离感知目标优化性能。

Details

Motivation: 现有测试时间适应方法主要针对分类任务，无法有效处理ReID中的相机偏差问题。 Method: DART$^3$ 利用基于距离的目标函数，结合最近邻距离与预测误差的相关性，无需源数据或模型修改。 Result: 在多个ReID基准测试中，DART$^3$ 及其轻量版均优于现有方法。 Conclusion: DART$^3$ 是一种有效的在线学习方法，可显著减轻相机偏差的负面影响。 Abstract: Person re-identification (ReID) models are known to suffer from camera bias, where learned representations cluster according to camera viewpoints rather than identity, leading to significant performance degradation under (inter-camera) domain shifts in real-world surveillance systems when new cameras are added to camera networks. State-of-the-art test-time adaptation (TTA) methods, largely designed for classification tasks, rely on classification entropy-based objectives that fail to generalize well to ReID, thus making them unsuitable for tackling camera bias. In this paper, we introduce DART$^3$, a TTA framework specifically designed to mitigate camera-induced domain shifts in person ReID. DART$^3$ (Distance-Aware Retrieval Tuning at Test Time) leverages a distance-based objective that aligns better with image retrieval tasks like ReID by exploiting the correlation between nearest-neighbor distance and prediction error. Unlike prior ReID-specific domain adaptation methods, DART$^3$ requires no source data, architectural modifications, or retraining, and can be deployed in both fully black-box and hybrid settings. Empirical evaluations on multiple ReID benchmarks indicate that DART$^3$ and DART$^3$ LITE, a lightweight alternative to the approach, consistently outperforms state-of-the-art TTA baselines, making for a viable option to online learning to mitigate the adverse effects of camera bias.

[6] Pose Splatter: A 3D Gaussian Splatting Model for Quantifying Animal Pose and Appearance

Jack Goffinet,Youngjo Min,Carlo Tomasi,David E. Carlson

Main category: cs.CV

TL;DR: Pose Splatter是一种无需先验几何知识、逐帧优化或手动标注的3D动物姿态和外观建模框架，通过形状雕刻和3D高斯溅射技术实现。

Details

Motivation: 当前3D姿态估计技术存在细节不足、标注耗时和逐帧优化昂贵等问题，限制了细微动作研究和大规模分析。 Method: 利用形状雕刻和3D高斯溅射技术建模动物姿态和外观，并提出旋转不变视觉嵌入技术替代3D关键点数据。 Result: 在多个动物数据集上验证了Pose Splatter的准确性，能捕捉细微姿态变化并提供更好的低维嵌入。 Conclusion: Pose Splatter消除了标注和逐帧优化的瓶颈，支持大规模行为分析，为基因型、神经活动和微行为研究提供了高分辨率工具。 Abstract: Accurate and scalable quantification of animal pose and appearance is crucial for studying behavior. Current 3D pose estimation techniques, such as keypoint- and mesh-based techniques, often face challenges including limited representational detail, labor-intensive annotation requirements, and expensive per-frame optimization. These limitations hinder the study of subtle movements and can make large-scale analyses impractical. We propose Pose Splatter, a novel framework leveraging shape carving and 3D Gaussian splatting to model the complete pose and appearance of laboratory animals without prior knowledge of animal geometry, per-frame optimization, or manual annotations. We also propose a novel rotation-invariant visual embedding technique for encoding pose and appearance, designed to be a plug-in replacement for 3D keypoint data in downstream behavioral analyses. Experiments on datasets of mice, rats, and zebra finches show Pose Splatter learns accurate 3D animal geometries. Notably, Pose Splatter represents subtle variations in pose, provides better low-dimensional pose embeddings over state-of-the-art as evaluated by humans, and generalizes to unseen data. By eliminating annotation and per-frame optimization bottlenecks, Pose Splatter enables analysis of large-scale, longitudinal behavior needed to map genotype, neural activity, and micro-behavior at unprecedented resolution.

[7] CONCORD: Concept-Informed Diffusion for Dataset Distillation

Jianyang Gu,Haonan Wang,Ruoxi Jia,Saeed Vahidian,Vyacheslav Kungurtsev,Wei Jiang,Yiran Chen

Main category: cs.CV

TL;DR: 论文提出了一种名为CONCORD的方法，利用大型语言模型的概念理解能力改进数据集蒸馏的生成过程，增强样本的细节控制和可解释性。

Details

Motivation: 现有数据集蒸馏方法在实例级别的概念完整性上表现不足，且生成过程缺乏明确的样本控制。 Method: 通过检索可区分和细粒度的概念，指导去噪过程并优化对象细节，提出Concept-Informed Diffusion (CONCORD)。 Result: 在ImageNet-1K及其子集上实现了最先进的性能。 Conclusion: CONCORD显著提升了蒸馏图像生成的细节控制和可解释性，无需依赖预训练分类器。 Abstract: Dataset distillation (DD) has witnessed significant progress in creating small datasets that encapsulate rich information from large original ones. Particularly, methods based on generative priors show promising performance, while maintaining computational efficiency and cross-architecture generalization. However, the generation process lacks explicit controllability for each sample. Previous distillation methods primarily match the real distribution from the perspective of the entire dataset, whereas overlooking concept completeness at the instance level. The missing or incorrectly represented object details cannot be efficiently compensated due to the constrained sample amount typical in DD settings. To this end, we propose incorporating the concept understanding of large language models (LLMs) to perform Concept-Informed Diffusion (CONCORD) for dataset distillation. Specifically, distinguishable and fine-grained concepts are retrieved based on category labels to inform the denoising process and refine essential object details. By integrating these concepts, the proposed method significantly enhances both the controllability and interpretability of the distilled image generation, without relying on pre-trained classifiers. We demonstrate the efficacy of CONCORD by achieving state-of-the-art performance on ImageNet-1K and its subsets. The code implementation is released in https://github.com/vimar-gu/CONCORD.

[8] Weakly-supervised Mamba-Based Mastoidectomy Shape Prediction for Cochlear Implant Surgery Using 3D T-Distribution Loss

Yike Zhang,Jack H. Noble

Main category: cs.CV

TL;DR: 提出了一种基于Mamba的弱监督框架，用于从术前CT扫描中预测乳突切除术区域，采用3D T分布损失函数，显著提升了预测的准确性和鲁棒性。

Details

Motivation: 乳突切除术区域的准确预测对耳蜗植入手术至关重要，但现有方法的鲁棒性不足，限制了其临床应用。 Method: 提出了一种弱监督的Mamba框架，利用3D T分布损失函数处理几何变异性，并通过自监督网络的输出实现弱监督。 Result: 该方法在预测乳突切除术区域上表现优于现有技术，具有更高的准确性和临床相关性。 Conclusion: 弱监督学习框架结合3D T分布损失函数，显著提升了预测的鲁棒性和效率，具有临床应用潜力。 Abstract: Cochlear implant surgery is a treatment for individuals with severe hearing loss. It involves inserting an array of electrodes inside the cochlea to electrically stimulate the auditory nerve and restore hearing sensation. A crucial step in this procedure is mastoidectomy, a surgical intervention that removes part of the mastoid region of the temporal bone, providing a critical pathway to the cochlea for electrode placement. Accurate prediction of the mastoidectomy region from preoperative imaging assists presurgical planning, reduces surgical risks, and improves surgical outcomes. In previous work, a self-supervised network was introduced to predict the mastoidectomy region using only preoperative CT scans. While promising, the method suffered from suboptimal robustness, limiting its practical application. To address this limitation, we propose a novel weakly-supervised Mamba-based framework to predict accurate mastoidectomy regions directly from preoperative CT scans. Our approach utilizes a 3D T-Distribution loss function inspired by the Student-t distribution, which effectively handles the complex geometric variability inherent in mastoidectomy shapes. Weak supervision is achieved using the segmentation results from the prior self-supervised network to eliminate the need for manual data cleaning or labeling throughout the training process. The proposed method is extensively evaluated against state-of-the-art approaches, demonstrating superior performance in predicting accurate and clinically relevant mastoidectomy regions. Our findings highlight the robustness and efficiency of the weakly-supervised learning framework with the proposed novel 3D T-Distribution loss.

[9] Monocular Marker-free Patient-to-Image Intraoperative Registration for Cochlear Implant Surgery

Yike Zhang,Eduardo Davalos Anaya,Jack H. Noble

Main category: cs.CV

TL;DR: 提出了一种无需外部硬件或标记的单目患者到图像术中配准方法，通过轻量级神经网络实现实时耳蜗植入手术导航。

Details

Motivation: 传统方法依赖外部硬件或标记，限制了临床实用性。本文旨在开发一种无需额外设备的术中配准方法。 Method: 利用合成显微镜手术场景数据集，通过零样本学习方法将术前CT扫描映射到2D术中图像，估计相机位姿（旋转矩阵和平移向量）。 Result: 在9个临床案例中验证，结果显示角度误差大多在10度以内，满足临床需求。 Conclusion: 该方法无需外部硬件或标记，实现了高精度的术中配准，具有临床实用性。 Abstract: This paper presents a novel method for monocular patient-to-image intraoperative registration, specifically designed to operate without any external hardware tracking equipment or fiducial point markers. Leveraging a synthetic microscopy surgical scene dataset with a wide range of transformations, our approach directly maps preoperative CT scans to 2D intraoperative surgical frames through a lightweight neural network for real-time cochlear implant surgery guidance via a zero-shot learning approach. Unlike traditional methods, our framework seamlessly integrates with monocular surgical microscopes, making it highly practical for clinical use without additional hardware dependencies and requirements. Our method estimates camera poses, which include a rotation matrix and a translation vector, by learning from the synthetic dataset, enabling accurate and efficient intraoperative registration. The proposed framework was evaluated on nine clinical cases using a patient-specific and cross-patient validation strategy. Our results suggest that our approach achieves clinically relevant accuracy in predicting 6D camera poses for registering 3D preoperative CT scans to 2D surgical scenes with an angular error within 10 degrees in most cases, while also addressing limitations of traditional methods, such as reliance on external tracking systems or fiducial markers.

[10] Taming Diffusion for Dataset Distillation with High Representativeness

Lin Zhao,Yushu Wu,Xinru Jiang,Jianyang Gu,Yanzhi Wang,Xiaolin Xu,Pu Zhao,Xue Lin

Main category: cs.CV

TL;DR: 论文提出D^3HR框架，通过扩散模型生成高代表性的蒸馏数据集，解决了现有方法中的分布匹配不准确等问题。

Details

Motivation: 当前基于扩散的数据集蒸馏方法存在分布匹配不准确、随机噪声导致的分布偏差等问题，需要一种更高效的解决方案。 Method: 采用DDIM反转将完整数据集的潜在表示映射到高正态性高斯域，并提出高效采样方案以对齐潜在分布。 Result: 实验表明D^3HR在不同模型架构下均能实现更高的准确率。 Conclusion: D^3HR框架在数据集蒸馏中表现出色，优于现有基线方法。 Abstract: Recent deep learning models demand larger datasets, driving the need for dataset distillation to create compact, cost-efficient datasets while maintaining performance. Due to the powerful image generation capability of diffusion, it has been introduced to this field for generating distilled images. In this paper, we systematically investigate issues present in current diffusion-based dataset distillation methods, including inaccurate distribution matching, distribution deviation with random noise, and separate sampling. Building on this, we propose D^3HR, a novel diffusion-based framework to generate distilled datasets with high representativeness. Specifically, we adopt DDIM inversion to map the latents of the full dataset from a low-normality latent domain to a high-normality Gaussian domain, preserving information and ensuring structural consistency to generate representative latents for the distilled dataset. Furthermore, we propose an efficient sampling scheme to better align the representative latents with the high-normality Gaussian distribution. Our comprehensive experiments demonstrate that D^3HR can achieve higher accuracy across different model architectures compared with state-of-the-art baselines in dataset distillation. Source code: https://github.com/lin-zhao-resoLve/D3HR.

[11] Recent Deep Learning in Crowd Behaviour Analysis: A Brief Review

Jiangbei Yue,He Wang

Main category: cs.CV

TL;DR: 本章回顾了深度学习在人群行为分析中的最新进展，重点介绍了人群行为预测和识别的核心任务，并讨论了现有方法的有效性和未来研究方向。

Details

Motivation: 人群行为分析对公共安全和城市规划等实际应用至关重要，深度学习的发展推动了该领域的研究。 Method: 综述了深度学习模型在人群行为分析中的应用，包括纯深度学习模型和结合物理学的方法，并对代表性研究进行了详细比较。 Result: 总结了深度学习在人群行为分析中的有效性，并提出了未来研究方向。 Conclusion: 本章旨在为研究人员提供深度学习在人群行为分析中的概览，帮助新研究者了解研究现状，并为现有研究者提供未来方向的参考。 Abstract: Crowd behaviour analysis is essential to numerous real-world applications, such as public safety and urban planning, and therefore has been studied for decades. In the last decade or so, the development of deep learning has significantly propelled the research on crowd behaviours. This chapter reviews recent advances in crowd behaviour analysis using deep learning. We mainly review the research in two core tasks in this field, crowd behaviour prediction and recognition. We broadly cover how different deep neural networks, after first being proposed in machine learning, are applied to analysing crowd behaviours. This includes pure deep neural network models as well as recent development of methodologies combining physics with deep learning. In addition, representative studies are discussed and compared in detail. Finally, we discuss the effectiveness of existing methods and future research directions in this rapidly evolving field. This chapter aims to provide a high-level summary of the ongoing deep learning research in crowd behaviour analysis. It intends to help new researchers who just entered this field to obtain an overall understanding of the ongoing research, as well as to provide a retrospective analysis for existing researchers to identify possible future directions

[12] Rehabilitation Exercise Quality Assessment and Feedback Generation Using Large Language Models with Prompt Engineering

Jessica Tang,Ali Abedi,Tracey J. F. Colella,Shehroz S. Khan

Main category: cs.CV

TL;DR: 提出了一种利用预训练大语言模型（LLMs）为康复训练提供自然语言反馈的新方法，解决了传统康复训练中因数据不足和反馈机制缺乏的问题。

Details

Motivation: 传统康复训练因交通限制和人员短缺导致高退出率，虚拟平台和AI反馈可改善这一问题，但现有方法缺乏文本反馈数据。 Method: 从患者骨骼关节提取特征，输入预训练LLMs，采用多种提示技术（如零样本、少样本、思维链等）生成自然语言反馈。 Result: 在两个公开康复数据集（UI-PRMD和REHAB24-6）上验证，方法在评估、推理和反馈生成方面表现良好。 Conclusion: 该方法可集成到虚拟康复平台，帮助患者正确训练，支持康复并改善健康结果。 Abstract: Exercise-based rehabilitation improves quality of life and reduces morbidity, mortality, and rehospitalization, though transportation constraints and staff shortages lead to high dropout rates from rehabilitation programs. Virtual platforms enable patients to complete prescribed exercises at home, while AI algorithms analyze performance, deliver feedback, and update clinicians. Although many studies have developed machine learning and deep learning models for exercise quality assessment, few have explored the use of large language models (LLMs) for feedback and are limited by the lack of rehabilitation datasets containing textual feedback. In this paper, we propose a new method in which exercise-specific features are extracted from the skeletal joints of patients performing rehabilitation exercises and fed into pre-trained LLMs. Using a range of prompting techniques, such as zero-shot, few-shot, chain-of-thought, and role-play prompting, LLMs are leveraged to evaluate exercise quality and provide feedback in natural language to help patients improve their movements. The method was evaluated through extensive experiments on two publicly available rehabilitation exercise assessment datasets (UI-PRMD and REHAB24-6) and showed promising results in exercise assessment, reasoning, and feedback generation. This approach can be integrated into virtual rehabilitation platforms to help patients perform exercises correctly, support recovery, and improve health outcomes.

[13] Dynamics of Affective States During Takeover Requests in Conditionally Automated Driving Among Older Adults with and without Cognitive Impairment

Gelareh Hajian,Ali Abedi,Bing Ye,Jennifer Campos,Alex Mihailidis

Main category: cs.CV

TL;DR: 研究探讨了认知健康与认知受损老年人在条件自动化车辆接管请求（TORs）中的情感反应，发现认知受损者情感反应较弱，需适应性系统支持。

Details

Motivation: 认知衰退可能影响驾驶安全，条件自动化车辆需了解驾驶员情感反应以确保安全接管。 Method: 通过面部表情分析测量效价和唤醒度，比较不同道路几何和速度下的情感状态差异。 Result: 认知健康者在高需求条件下唤醒度增加，认知受损者唤醒度降低且效价更积极。 Conclusion: 认知受损驾驶员情感反应减弱，需开发适应性系统以支持安全接管。 Abstract: Driving is a key component of independence and quality of life for older adults. However, cognitive decline associated with conditions such as mild cognitive impairment and dementia can compromise driving safety and often lead to premature driving cessation. Conditionally automated vehicles, which require drivers to take over control when automation reaches its operational limits, offer a potential assistive solution. However, their effectiveness depends on the driver's ability to respond to takeover requests (TORs) in a timely and appropriate manner. Understanding emotional responses during TORs can provide insight into drivers' engagement, stress levels, and readiness to resume control, particularly in cognitively vulnerable populations. This study investigated affective responses, measured via facial expression analysis of valence and arousal, during TORs among cognitively healthy older adults and those with cognitive impairment. Facial affect data were analyzed across different road geometries and speeds to evaluate within- and between-group differences in affective states. Within-group comparisons using the Wilcoxon signed-rank test revealed significant changes in valence and arousal during TORs for both groups. Cognitively healthy individuals showed adaptive increases in arousal under higher-demand conditions, while those with cognitive impairment exhibited reduced arousal and more positive valence in several scenarios. Between-group comparisons using the Mann-Whitney U test indicated that cognitively impaired individuals displayed lower arousal and higher valence than controls across different TOR conditions. These findings suggest reduced emotional response and awareness in cognitively impaired drivers, highlighting the need for adaptive vehicle systems that detect affective states and support safe handovers for vulnerable users.

[14] CENet: Context Enhancement Network for Medical Image Segmentation

Afshin Bozorgpour,Sina Ghorbani Kolahi,Reza Azad,Ilker Hacihaliloglu,Dorit Merhof

Main category: cs.CV

TL;DR: 提出了一种名为CENet的新分割框架，通过DSEB和CFAM模块解决了医学图像分割中的边界细节和空间完整性问题，性能优于现有方法。

Details

Motivation: 现有深度学习模型在医学图像分割中难以准确表示边界、处理器官形态多样性及避免下采样信息丢失，限制了准确性和鲁棒性。 Method: CENet框架包含DSEB模块（增强边界细节和小器官检测）和CFAM模块（多尺度设计保持空间完整性）。 Result: 在放射学和皮肤镜数据集上，CENet在多器官分割和边界细节保留方面优于现有方法。 Conclusion: CENet为复杂医学图像分析任务提供了鲁棒且准确的解决方案，代码已开源。 Abstract: Medical image segmentation, particularly in multi-domain scenarios, requires precise preservation of anatomical structures across diverse representations. While deep learning has advanced this field, existing models often struggle with accurate boundary representation, variability in organ morphology, and information loss during downsampling, limiting their accuracy and robustness. To address these challenges, we propose the Context Enhancement Network (CENet), a novel segmentation framework featuring two key innovations. First, the Dual Selective Enhancement Block (DSEB) integrated into skip connections enhances boundary details and improves the detection of smaller organs in a context-aware manner. Second, the Context Feature Attention Module (CFAM) in the decoder employs a multi-scale design to maintain spatial integrity, reduce feature redundancy, and mitigate overly enhanced representations. Extensive evaluations on both radiology and dermoscopic datasets demonstrate that CENet outperforms state-of-the-art (SOTA) methods in multi-organ segmentation and boundary detail preservation, offering a robust and accurate solution for complex medical image analysis tasks. The code is publicly available at https://github.com/xmindflow/cenet.

[15] TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP

Yuliang Cai,Jesse Thomason,Mohammad Rostami

Main category: cs.CV

TL;DR: 论文提出了一种高效的训练时否定数据生成方法TNG-CLIP，并引入了首个评估文本到图像生成模型在否定提示下性能的基准Neg-TtoI，显著提升了CLIP在否定理解任务上的表现。

Details

Motivation: 现有方法通过生成大规模否定数据进行CLIP微调，但耗时且计算密集，且评估仅限于图像-文本匹配任务。本文旨在解决这些问题。 Method: 提出训练时否定数据生成管道，仅增加2.5%训练时间；并设计Neg-TtoI基准，评估模型在否定提示下的图像生成能力。 Result: TNG-CLIP在图像-文本匹配、文本-图像检索和图像生成等多种否定任务上达到SOTA性能。 Conclusion: TNG-CLIP高效且性能优越，为否定理解任务提供了新的解决方案和评估标准。 Abstract: Vision-language models (VLMs), such as CLIP, have demonstrated strong performance across a range of downstream tasks. However, CLIP is still limited in negation understanding: the ability to recognize the absence or exclusion of a concept. Existing methods address the problem by using a large language model (LLM) to generate large-scale data of image captions containing negation for further fine-tuning CLIP. However, these methods are both time- and compute-intensive, and their evaluations are typically restricted to image-text matching tasks. To expand the horizon, we (1) introduce a training-time negation data generation pipeline such that negation captions are generated during the training stage, which only increases 2.5% extra training time, and (2) we propose the first benchmark, Neg-TtoI, for evaluating text-to-image generation models on prompts containing negation, assessing model's ability to produce semantically accurate images. We show that our proposed method, TNG-CLIP, achieves SOTA performance on diverse negation benchmarks of image-to-text matching, text-to-image retrieval, and image generation.

[16] OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data

Yiren Song,Cheng Liu,Mike Zheng Shou

Main category: cs.CV

TL;DR: OmniConsistency提出了一种通用的扩散模型插件，通过大规模DiTs解决图像风格化中的一致性问题，性能接近GPT-4o。

Details

Motivation: 现有扩散模型在复杂场景中难以保持风格一致性，且开源方法与专有模型（如GPT-4o）存在性能差距。 Method: 提出OmniConsistency，包括：1）基于对齐图像对的上下文一致性学习框架；2）两阶段渐进学习策略；3）兼容任意风格LoRA的即插即用设计。 Result: 实验表明，OmniConsistency显著提升了视觉一致性和美学质量，性能接近GPT-4o。 Conclusion: OmniConsistency为开源模型提供了一种高效的风格一致性解决方案，填补了与专有模型的差距。 Abstract: Diffusion models have advanced image stylization significantly, yet two core challenges persist: (1) maintaining consistent stylization in complex scenes, particularly identity, composition, and fine details, and (2) preventing style degradation in image-to-image pipelines with style LoRAs. GPT-4o's exceptional stylization consistency highlights the performance gap between open-source methods and proprietary models. To bridge this gap, we propose \textbf{OmniConsistency}, a universal consistency plugin leveraging large-scale Diffusion Transformers (DiTs). OmniConsistency contributes: (1) an in-context consistency learning framework trained on aligned image pairs for robust generalization; (2) a two-stage progressive learning strategy decoupling style learning from consistency preservation to mitigate style degradation; and (3) a fully plug-and-play design compatible with arbitrary style LoRAs under the Flux framework. Extensive experiments show that OmniConsistency significantly enhances visual coherence and aesthetic quality, achieving performance comparable to commercial state-of-the-art model GPT-4o.

[17] Mitigating Context Bias in Domain Adaptation for Object Detection using Mask Pooling

Hojun Son,Asma Almutairi,Arpan Kusari

Main category: cs.CV

TL;DR: 本文提出了一种因果视角解释上下文偏置，并提出了一种名为Mask Pooling的新方法，通过分离前景和背景区域的池化过程来减少偏置，同时设计了一个基准测试来验证模型的鲁棒性。

Details

Motivation: 上下文偏置在目标检测训练中普遍存在，但缺乏对其成因和消除方法的系统性研究。本文旨在填补这一空白。 Method: 提出Mask Pooling方法，利用前景掩码分离前景和背景的池化过程，并通过设计随机背景的基准测试验证模型鲁棒性。 Result: 实验表明，Mask Pooling能有效减少上下文偏置，提升模型在不同域中的检测性能。 Conclusion: 本文为减少域适应目标检测中的上下文偏置提供了系统性方法，并通过实验验证了其有效性。 Abstract: Context bias refers to the association between the foreground objects and background during the object detection training process. Various methods have been proposed to minimize the context bias when applying the trained model to an unseen domain, known as domain adaptation for object detection (DAOD). But a principled approach to understand why the context bias occurs and how to remove it has been missing. In this work, we provide a causal view of the context bias, pointing towards the pooling operation in the convolution network architecture as the possible source of this bias. We present an alternative, Mask Pooling, which uses an additional input of foreground masks, to separate the pooling process in the respective foreground and background regions and show that this process leads the trained model to detect objects in a more robust manner under different domains. We also provide a benchmark designed to create an ultimate test for DAOD, using foregrounds in the presence of absolute random backgrounds, to analyze the robustness of the intended trained models. Through these experiments, we hope to provide a principled approach for minimizing context bias under domain shift.

[18] Agentic 3D Scene Generation with Spatially Contextualized VLMs

Xinhang Liu,Yu-Wing Tai,Chi-Keung Tang

Main category: cs.CV

TL;DR: 论文提出了一种新范式，通过注入动态空间上下文，使视觉语言模型（VLM）能够生成、理解和编辑复杂的3D场景，提升了其在空间任务中的表现。

Details

Motivation: 当前视觉语言模型在结构化3D场景生成和推理方面的能力尚未充分探索，限制了其在空间任务（如具身AI、沉浸式模拟等）中的应用。 Method: 构建包含场景肖像、语义标记点云和场景超图的空间上下文，结合VLM的多模态推理能力，开发了迭代生成3D场景的流程。 Result: 实验表明，该框架能处理多样化输入，实现优于先前工作的泛化能力，并支持交互式场景编辑和路径规划等下游任务。 Conclusion: 该研究展示了空间上下文注入对提升VLM在3D空间智能系统中的潜力，适用于计算机图形学、3D视觉和具身应用。 Abstract: Despite recent advances in multimodal content generation enabled by vision-language models (VLMs), their ability to reason about and generate structured 3D scenes remains largely underexplored. This limitation constrains their utility in spatially grounded tasks such as embodied AI, immersive simulations, and interactive 3D applications. We introduce a new paradigm that enables VLMs to generate, understand, and edit complex 3D environments by injecting a continually evolving spatial context. Constructed from multimodal input, this context consists of three components: a scene portrait that provides a high-level semantic blueprint, a semantically labeled point cloud capturing object-level geometry, and a scene hypergraph that encodes rich spatial relationships, including unary, binary, and higher-order constraints. Together, these components provide the VLM with a structured, geometry-aware working memory that integrates its inherent multimodal reasoning capabilities with structured 3D understanding for effective spatial reasoning. Building on this foundation, we develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context. The pipeline features high-quality asset generation with geometric restoration, environment setup with automatic verification, and ergonomic adjustment guided by the scene hypergraph. Experiments show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work. Further results demonstrate that injecting spatial context enables VLMs to perform downstream tasks such as interactive scene editing and path planning, suggesting strong potential for spatially intelligent systems in computer graphics, 3D vision, and embodied applications.

[19] BiomechGPT: Towards a Biomechanically Fluent Multimodal Foundation Model for Clinically Relevant Motion Tasks

Ruize Yang,Ann Kennedy,R. James Cotton

Main category: cs.CV

TL;DR: BiomechGPT是一种多模态生物力学-语言模型，能够回答临床相关的运动问题，并在多种任务中表现优异。

Details

Motivation: 无标记运动捕捉技术的进步使得在多种场景下获取高质量运动数据成为可能，但如何高效分析这些数据仍是一个挑战。 Method: 通过收集大量生物力学数据并标记化，构建了一个多模态数据集，开发了BiomechGPT模型。 Result: BiomechGPT在活动识别、运动障碍识别、诊断、临床评分和步行测量等任务中表现优异。 Conclusion: BiomechGPT为康复运动数据的分析提供了一个重要基础模型。 Abstract: Advances in markerless motion capture are expanding access to biomechanical movement analysis, making it feasible to obtain high-quality movement data from outpatient clinics, inpatient hospitals, therapy, and even home. Expanding access to movement data in these diverse contexts makes the challenge of performing downstream analytics all the more acute. Creating separate bespoke analysis code for all the tasks end users might want is both intractable and does not take advantage of the common features of human movement underlying them all. Recent studies have shown that fine-tuning language models to accept tokenized movement as an additional modality enables successful descriptive captioning of movement. Here, we explore whether such a multimodal motion-language model can answer detailed, clinically meaningful questions about movement. We collected over 30 hours of biomechanics from nearly 500 participants, many with movement impairments from a variety of etiologies, performing a range of movements used in clinical outcomes assessments. After tokenizing these movement trajectories, we created a multimodal dataset of motion-related questions and answers spanning a range of tasks. We developed BiomechGPT, a multimodal biomechanics-language model, on this dataset. Our results show that BiomechGPT demonstrates high performance across a range of tasks such as activity recognition, identifying movement impairments, diagnosis, scoring clinical outcomes, and measuring walking. BiomechGPT provides an important step towards a foundation model for rehabilitation movement data.

[20] In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation

Yu Xu,Fan Tang,You Wu,Lin Gao,Oliver Deussen,Hongbin Yan,Jintao Li,Juan Cao,Tong-Yee Lee

Main category: cs.CV

TL;DR: 提出了一种零样本框架"In-Context Brush"，通过上下文学习范式实现高保真度的定制主题插入，无需模型调优。

Details

Motivation: 现有方法在定制主题插入时难以实现高保真度且与用户意图对齐，因此需要一种无需额外训练的方法。 Method: 采用预训练的MMDiT修复网络，通过双级潜在空间操作（潜在特征偏移和注意力重加权）增强测试时性能。 Result: 实验表明，该方法在身份保留、文本对齐和图像质量上优于现有方法。 Conclusion: In-Context Brush框架无需额外训练即可实现高质量的定制主题插入。 Abstract: Recent advances in diffusion models have enhanced multimodal-guided visual generation, enabling customized subject insertion that seamlessly "brushes" user-specified objects into a given image guided by textual prompts. However, existing methods often struggle to insert customized subjects with high fidelity and align results with the user's intent through textual prompts. In this work, we propose "In-Context Brush", a zero-shot framework for customized subject insertion by reformulating the task within the paradigm of in-context learning. Without loss of generality, we formulate the object image and the textual prompts as cross-modal demonstrations, and the target image with the masked region as the query. The goal is to inpaint the target image with the subject aligning textual prompts without model tuning. Building upon a pretrained MMDiT-based inpainting network, we perform test-time enhancement via dual-level latent space manipulation: intra-head "latent feature shifting" within each attention head that dynamically shifts attention outputs to reflect the desired subject semantics and inter-head "attention reweighting" across different heads that amplifies prompt controllability through differential attention prioritization. Extensive experiments and applications demonstrate that our approach achieves superior identity preservation, text alignment, and image quality compared to existing state-of-the-art methods, without requiring dedicated training or additional data collection.

[21] HonestFace: Towards Honest Face Restoration with One-Step Diffusion Model

Jingkai Wang,Wu Miao,Jue Gong,Zheng Chen,Xing Liu,Hong Gu,Yutong Liu,Yulun Zhang

Main category: cs.CV

TL;DR: HonestFace是一种新颖的人脸修复方法，强调身份一致性和纹理真实性，通过身份嵌入器和掩膜对齐方法提升修复质量，并提出了新的评估指标。

Details

Motivation: 当前人脸修复方法在保持高保真度和避免伪影或偏差方面存在挑战，需要更“诚实”的模型来准确反映原始特征。 Method: 提出HonestFace方法，包括身份嵌入器、掩膜对齐方法和基于仿射变换的新评估指标，采用一步扩散模型框架。 Result: 实验表明，HonestFace在视觉质量和定量评估上均优于现有方法。 Conclusion: HonestFace通过创新组件实现了更真实和一致的人脸修复效果，代码和模型将开源。 Abstract: Face restoration has achieved remarkable advancements through the years of development. However, ensuring that restored facial images exhibit high fidelity, preserve authentic features, and avoid introducing artifacts or biases remains a significant challenge. This highlights the need for models that are more "honest" in their reconstruction from low-quality inputs, accurately reflecting original characteristics. In this work, we propose HonestFace, a novel approach designed to restore faces with a strong emphasis on such honesty, particularly concerning identity consistency and texture realism. To achieve this, HonestFace incorporates several key components. First, we propose an identity embedder to effectively capture and preserve crucial identity features from both the low-quality input and multiple reference faces. Second, a masked face alignment method is presented to enhance fine-grained details and textural authenticity, thereby preventing the generation of patterned or overly synthetic textures and improving overall clarity. Furthermore, we present a new landmark-based evaluation metric. Based on affine transformation principles, this metric improves the accuracy compared to conventional L2 distance calculations for facial feature alignment. Leveraging these contributions within a one-step diffusion model framework, HonestFace delivers exceptional restoration results in terms of facial fidelity and realism. Extensive experiments demonstrate that our approach surpasses existing state-of-the-art methods, achieving superior performance in both visual quality and quantitative assessments. The code and pre-trained models will be made publicly available at https://github.com/jkwang28/HonestFace .

[22] ZooplanktonBench: A Geo-Aware Zooplankton Recognition and Classification Dataset from Marine Observations

Fukun Liu,Adam T. Greer,Gengchen Mai,Jin Sun

Main category: cs.CV

TL;DR: ZooplanktonBench是一个包含浮游动物图像和视频的基准数据集，旨在解决计算机视觉在海洋环境中检测、分类和跟踪浮游动物的挑战。

Details

Motivation: 由于浮游动物与背景高度相似，通用计算机视觉工具难以准确分析其图像数据，因此需要专门的数据集和方法来支持海洋科学研究。 Method: 提出了ZooplanktonBench数据集，包含丰富的图像、视频和地理空间元数据，并定义了一系列任务以测试计算机视觉系统在复杂环境中的表现。 Result: 数据集为计算机视觉系统提供了独特的挑战和机会，以改进在动态环境中的视觉理解和地理感知能力。 Conclusion: ZooplanktonBench有助于推动计算机视觉技术在海洋科学研究中的应用，并为未来海洋食品生产力提供支持。 Abstract: Plankton are small drifting organisms found throughout the world's oceans. One component of this plankton community is the zooplankton, which includes gelatinous animals and crustaceans (e.g. shrimp), as well as the early life stages (i.e., eggs and larvae) of many commercially important fishes. Being able to monitor zooplankton abundances accurately and understand how populations change in relation to ocean conditions is invaluable to marine science research, with important implications for future marine seafood productivity. While new imaging technologies generate massive amounts of video data of zooplankton, analyzing them using general-purpose computer vision tools developed for general objects turns out to be highly challenging due to the high similarity in appearance between the zooplankton and its background (e.g., marine snow). In this work, we present the ZooplanktonBench, a benchmark dataset containing images and videos of zooplankton associated with rich geospatial metadata (e.g., geographic coordinates, depth, etc.) in various water ecosystems. ZooplanktonBench defines a collection of tasks to detect, classify, and track zooplankton in challenging settings, including highly cluttered environments, living vs non-living classification, objects with similar shapes, and relatively small objects. Our dataset presents unique challenges and opportunities for state-of-the-art computer vision systems to evolve and improve visual understanding in a dynamic environment with huge variations and be geo-aware.

[23] Syn3DTxt: Embedding 3D Cues for Scene Text Generation

Li-Syun Hsiung,Jun-Kai Tu,Kuan-Wu Chu,Yu-Hsuan Chiu,Yan-Tsung Peng,Sheng-Luen Chung,Gee-Sern Jison Hsu

Main category: cs.CV

TL;DR: 研究提出了一种新的合成数据集构建标准，通过引入表面法线增强三维场景特征，解决了传统2D数据在场景文本渲染中缺乏三维上下文的问题。

Details

Motivation: 现有方法主要依赖2D数据，无法捕捉真实场景中空间布局与视觉效果的复杂交互，限制了文本渲染的准确性。 Method: 提出了一种结合表面法线的新型合成数据集构建方法，以丰富三维场景特征。 Result: 实验表明，新标准构建的数据集提供了更好的几何上下文，有助于复杂3D空间条件下的文本渲染。 Conclusion: 通过引入表面法线，新方法显著提升了合成数据集的三维表现力，为未来场景文本渲染技术提供了更坚实的基础。 Abstract: This study aims to investigate the challenge of insufficient three-dimensional context in synthetic datasets for scene text rendering. Although recent advances in diffusion models and related techniques have improved certain aspects of scene text generation, most existing approaches continue to rely on 2D data, sourcing authentic training examples from movie posters and book covers, which limits their ability to capture the complex interactions among spatial layout and visual effects in real-world scenes. In particular, traditional 2D datasets do not provide the necessary geometric cues for accurately embedding text into diverse backgrounds. To address this limitation, we propose a novel standard for constructing synthetic datasets that incorporates surface normals to enrich three-dimensional scene characteristic. By adding surface normals to conventional 2D data, our approach aims to enhance the representation of spatial relationships and provide a more robust foundation for future scene text rendering methods. Extensive experiments demonstrate that datasets built under this new standard offer improved geometric context, facilitating further advancements in text rendering under complex 3D-spatial conditions.

[24] Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning

Aofei Chang,Le Huang,Alex James Boyd,Parminder Bhatia,Taha Kass-Hout,Cao Xiao,Fenglong Ma

Main category: cs.CV

TL;DR: A$^3$Tune提出了一种新的微调框架，通过自动注意力对齐调优改善Med-LVLMs的注意力分布问题。

Details

Motivation: Med-LVLMs在视觉输入上的注意力分布不佳，导致输出不准确或幻觉。现有方法依赖推理时干预，效果有限。 Method: 利用SAM的零样本弱标签，通过BioMedCLIP细化成提示感知标签，选择性修改关键注意力头，并引入A$^3$MoE模块实现自适应参数选择。 Result: 在医学VQA和报告生成任务中，A$^3$Tune优于现有方法，提升了注意力分布和性能。 Conclusion: A$^3$Tune有效解决了Med-LVLMs的注意力对齐问题，为医学视觉语言模型提供了更优的解决方案。 Abstract: Medical Large Vision-Language Models (Med-LVLMs) often exhibit suboptimal attention distribution on visual inputs, leading to hallucinated or inaccurate outputs. Existing mitigation methods primarily rely on inference-time interventions, which are limited in attention adaptation or require additional supervision. To address this, we propose A$^3$Tune, a novel fine-tuning framework for Automatic Attention Alignment Tuning. A$^3$Tune leverages zero-shot weak labels from SAM, refines them into prompt-aware labels using BioMedCLIP, and then selectively modifies visually-critical attention heads to improve alignment while minimizing interference. Additionally, we introduce a A$^3$MoE module, enabling adaptive parameter selection for attention tuning across diverse prompts and images. Extensive experiments on medical VQA and report generation benchmarks show that A$^3$Tune outperforms state-of-the-art baselines, achieving enhanced attention distributions and performance in Med-LVLMs.

[25] Improved Immiscible Diffusion: Accelerate Diffusion Training by Reducing Its Miscibility

Yiheng Li,Feng Liang,Dan Kondratyuk,Masayoshi Tomizuka,Kurt Keutzer,Chenfeng Xu

Main category: cs.CV

TL;DR: 本文提出了一种通过减少扩散轨迹混合（miscibility reduction）来加速扩散模型训练的方法，并扩展了线性分配的局限性，实现了高达4倍以上的训练加速。

Details

Motivation: 扩散模型的高训练成本限制了其应用，因此需要一种更高效的训练方法。 Method: 通过减少噪声空间中的轨迹混合（如KNN噪声选择和图像缩放）来简化去噪过程，并分析了其双射性质。 Result: 在多种任务中实现了高达4倍以上的训练加速，同时保持了生成多样性。 Conclusion: 轨迹混合性是扩散训练的根本瓶颈，本文为高效扩散训练提供了新的研究方向。 Abstract: The substantial training cost of diffusion models hinders their deployment. Immiscible Diffusion recently showed that reducing diffusion trajectory mixing in the noise space via linear assignment accelerates training by simplifying denoising. To extend immiscible diffusion beyond the inefficient linear assignment under high batch sizes and high dimensions, we refine this concept to a broader miscibility reduction at any layer and by any implementation. Specifically, we empirically demonstrate the bijective nature of the denoising process with respect to immiscible diffusion, ensuring its preservation of generative diversity. Moreover, we provide thorough analysis and show step-by-step how immiscibility eases denoising and improves efficiency. Extending beyond linear assignment, we propose a family of implementations including K-nearest neighbor (KNN) noise selection and image scaling to reduce miscibility, achieving up to >4x faster training across diverse models and tasks including unconditional/conditional generation, image editing, and robotics planning. Furthermore, our analysis of immiscibility offers a novel perspective on how optimal transport (OT) enhances diffusion training. By identifying trajectory miscibility as a fundamental bottleneck, we believe this work establishes a potentially new direction for future research into high-efficiency diffusion training. The code is available at https://github.com/yhli123/Immiscible-Diffusion.

[26] TK-Mamba: Marrying KAN with Mamba for Text-Driven 3D Medical Image Segmentation

Haoyu Yang,Yuxiang Cai,Jintao Chen,Xuhong Zhang,Wenhui Lei,Xiaoming Shi,Jianwei Yin,Yankai Jiang

Main category: cs.CV

TL;DR: 提出了一种结合Mamba和KAN的多模态框架，用于高效3D医学图像分割，通过EGSC模块、3D-GR-KAN和双分支文本驱动策略，在MSD和KiTS23数据集上实现SOTA性能。

Details

Motivation: 解决传统单模态网络在3D医学图像分割中计算效率低和上下文建模受限的问题。 Method: 采用Mamba和KAN作为骨干网络，引入EGSC模块捕获空间信息，扩展3D-GR-KAN用于3D数据，并结合双分支文本驱动策略。 Result: 在MSD和KiTS23数据集上取得最优性能，精度和效率均超越现有方法。 Conclusion: 结合序列建模、扩展网络架构和视觉-语言协同，为3D医学图像分割提供了可扩展的临床解决方案。 Abstract: 3D medical image segmentation is vital for clinical diagnosis and treatment but is challenged by high-dimensional data and complex spatial dependencies. Traditional single-modality networks, such as CNNs and Transformers, are often limited by computational inefficiency and constrained contextual modeling in 3D settings. We introduce a novel multimodal framework that leverages Mamba and Kolmogorov-Arnold Networks (KAN) as an efficient backbone for long-sequence modeling. Our approach features three key innovations: First, an EGSC (Enhanced Gated Spatial Convolution) module captures spatial information when unfolding 3D images into 1D sequences. Second, we extend Group-Rational KAN (GR-KAN), a Kolmogorov-Arnold Networks variant with rational basis functions, into 3D-Group-Rational KAN (3D-GR-KAN) for 3D medical imaging - its first application in this domain - enabling superior feature representation tailored to volumetric data. Third, a dual-branch text-driven strategy leverages CLIP's text embeddings: one branch swaps one-hot labels for semantic vectors to preserve inter-organ semantic relationships, while the other aligns images with detailed organ descriptions to enhance semantic alignment. Experiments on the Medical Segmentation Decathlon (MSD) and KiTS23 datasets show our method achieving state-of-the-art performance, surpassing existing approaches in accuracy and efficiency. This work highlights the power of combining advanced sequence modeling, extended network architectures, and vision-language synergy to push forward 3D medical image segmentation, delivering a scalable solution for clinical use. The source code is openly available at https://github.com/yhy-whu/TK-Mamba.

[27] ThinkVideo: High-Quality Reasoning Video Segmentation with Chain of Thoughts

Shiu-hong Kao,Yu-Wing Tai,Chi-Keung Tang

Main category: cs.CV

TL;DR: ThinkVideo是一个利用MLLM的零样本Chain-of-Thought能力解决视频对象分割任务的新框架，无需训练且兼容闭源MLLM，显著优于现有方法。

Details

Motivation: 现有方法在视频对象分割中难以处理时间敏感查询，主要因未能整合时空信息。 Method: 利用CoT提示提取关键帧对象选择性，结合图像分割模型和SAM2视频处理器输出掩码序列。 Result: 在显式和隐式查询的视频对象分割任务中，ThinkVideo在质量和数量上均显著优于先前工作。 Conclusion: ThinkVideo通过零样本CoT能力有效整合时空信息，为视频对象分割提供了高效且兼容性强的解决方案。 Abstract: Reasoning Video Object Segmentation is a challenging task, which generates a mask sequence from an input video and an implicit, complex text query. Existing works probe into the problem by finetuning Multimodal Large Language Models (MLLM) for segmentation-based output, while still falling short in difficult cases on videos given temporally-sensitive queries, primarily due to the failure to integrate temporal and spatial information. In this paper, we propose ThinkVideo, a novel framework which leverages the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these challenges. Specifically, ThinkVideo utilizes the CoT prompts to extract object selectivities associated with particular keyframes, then bridging the reasoning image segmentation model and SAM2 video processor to output mask sequences. The ThinkVideo framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. We further extend the framework for online video streams, where the CoT is used to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that ThinkVideo significantly outperforms previous works in both cases, qualitatively and quantitatively.

[28] On Denoising Walking Videos for Gait Recognition

Dongyang Jin,Chao Fan,Jingzhe Ma,Jingkai Zhou,Weihua Chen,Shiqi Yu

Main category: cs.CV

TL;DR: DenoisingGait提出了一种基于生成扩散模型的步态去噪方法，结合几何驱动的特征匹配模块，显著提升了步态识别的准确性。

Details

Motivation: 传统基于轮廓和姿态的步态识别方法因输入信息稀疏而精度不足，而现有端到端方法直接对RGB视频去噪，但仍有改进空间。 Method: 利用生成扩散模型过滤无关因素，并引入几何驱动的特征匹配模块，生成两通道方向向量表示步态特征。 Result: 在CCPG、CASIA-B*和SUSTech1K数据集上，DenoisingGait在大多数情况下达到了新的最优性能。 Conclusion: DenoisingGait通过结合扩散模型和特征匹配，有效减少了噪声，提升了步态识别的性能。 Abstract: To capture individual gait patterns, excluding identity-irrelevant cues in walking videos, such as clothing texture and color, remains a persistent challenge for vision-based gait recognition. Traditional silhouette- and pose-based methods, though theoretically effective at removing such distractions, often fall short of high accuracy due to their sparse and less informative inputs. Emerging end-to-end methods address this by directly denoising RGB videos using human priors. Building on this trend, we propose DenoisingGait, a novel gait denoising method. Inspired by the philosophy that "what I cannot create, I do not understand", we turn to generative diffusion models, uncovering how they partially filter out irrelevant factors for gait understanding. Additionally, we introduce a geometry-driven Feature Matching module, which, combined with background removal via human silhouettes, condenses the multi-channel diffusion features at each foreground pixel into a two-channel direction vector. Specifically, the proposed within- and cross-frame matching respectively capture the local vectorized structures of gait appearance and motion, producing a novel flow-like gait representation termed Gait Feature Field, which further reduces residual noise in diffusion features. Experiments on the CCPG, CASIA-B*, and SUSTech1K datasets demonstrate that DenoisingGait achieves a new SoTA performance in most cases for both within- and cross-domain evaluations. Code is available at https://github.com/ShiqiYu/OpenGait.

[29] Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

Chaofan Gan,Yuanpeng Tu,Xi Chen,Tieyuan Chen,Yuxi Li,Mehrtash Harandi,Weiyao Lin

Main category: cs.CV

TL;DR: 本文研究了扩散变换器（DiTs）在密集视觉对应任务中的表现，发现其存在“大规模激活”问题，并提出了一种无需训练的框架DiTF来解决这一问题。

Details

Motivation: 预训练的稳定扩散模型（SD）在视觉对应任务中表现优异，但扩散变换器（DiTs）由于存在“大规模激活”问题，导致性能下降。本文旨在解决这一问题。 Method: 提出DiTF框架，利用零初始化的自适应层归一化（AdaLN-zero）定位和归一化大规模激活，并采用通道丢弃策略进一步消除其负面影响。 Result: 实验表明，DiTF在多个视觉对应任务中优于DINO和SD模型，性能提升显著（如Spair-71k上+9.4%，AP-10K-C.S.上+4.4%）。 Conclusion: DiTF通过解决DiTs的大规模激活问题，显著提升了其在视觉对应任务中的性能，为DiTs的应用提供了新思路。 Abstract: Pre-trained stable diffusion models (SD) have shown great advances in visual correspondence. In this paper, we investigate the capabilities of Diffusion Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs exhibit a critical phenomenon in which very few feature activations exhibit significantly larger values than others, known as \textit{massive activations}, leading to uninformative representations and significant performance degradation for DiTs. The massive activations consistently concentrate at very few fixed dimensions across all image patch tokens, holding little local information. We trace these dimension-concentrated massive activations and find that such concentration can be effectively localized by the zero-initialized Adaptive Layer Norm (AdaLN-zero). Building on these findings, we propose Diffusion Transformer Feature (DiTF), a training-free framework designed to extract semantic-discriminative features from DiTs. Specifically, DiTF employs AdaLN to adaptively localize and normalize massive activations with channel-wise modulation. In addition, we develop a channel discard strategy to further eliminate the negative impacts from massive activations. Experimental results demonstrate that our DiTF outperforms both DINO and SD-based models and establishes a new state-of-the-art performance for DiTs in different visual correspondence tasks (\eg, with +9.4\% on Spair-71k and +4.4\% on AP-10K-C.S.).

[30] Guiding the Experts: Semantic Priors for Efficient and Focused MoE Routing

Chengxi Min,Wei Wang,Yahui Liu,Weixin Ye,Enver Sangineto,Qi Wang,Yao Zhao

Main category: cs.CV

TL;DR: 本文提出了一种前景引导的增强策略，通过空间感知的辅助损失和轻量级LayerScale机制，优化Soft MoE模型中的专家路由机制。

Details

Motivation: 当前Soft MoE模型的设计忽略了调度权重中隐含的语义结构，导致专家路由效果不佳。研究发现调度权重具有类似分割的模式，但未与语义区域明确对齐。 Method: 提出前景引导的增强策略，包括空间感知的辅助损失和轻量级LayerScale机制，以优化专家路由。 Result: 在ImageNet-1K和多个小规模分类基准测试中，性能得到一致提升，并揭示了更可解释的专家路由机制。 Conclusion: 该方法仅需少量架构调整，即可无缝集成到现有Soft MoE框架中，显著提升性能与可解释性。 Abstract: Mixture-of-Experts (MoE) models have emerged as a promising direction for scaling vision architectures efficiently. Among them, Soft MoE improves training stability by assigning each token to all experts via continuous dispatch weights. However, current designs overlook the semantic structure which is implicitly encoded in these weights, resulting in suboptimal expert routing. In this paper, we discover that dispatch weights in Soft MoE inherently exhibit segmentation-like patterns but are not explicitly aligned with semantic regions. Motivated by this observation, we propose a foreground-guided enhancement strategy. Specifically, we introduce a spatially aware auxiliary loss that encourages expert activation to align with semantic foreground regions. To further reinforce this supervision, we integrate a lightweight LayerScale mechanism that improves information flow and stabilizes optimization in skip connections. Our method necessitates only minor architectural adjustments and can be seamlessly integrated into prevailing Soft MoE frameworks. Comprehensive experiments on ImageNet-1K and multiple smaller-scale classification benchmarks not only showcase consistent performance enhancements but also reveal more interpretable expert routing mechanisms.

[31] HyperFake: Hyperspectral Reconstruction and Attention-Guided Analysis for Advanced Deepfake Detection

Pavan C Shekar,Pawan Soni,Vivek Kanhangad

Main category: cs.CV

TL;DR: HyperFake利用31通道高光谱数据重建技术，结合改进的MST++架构和光谱注意力机制，显著提升了深度伪造检测的准确性和泛化能力。

Details

Motivation: 当前深度伪造检测方法难以泛化到不同操纵技术和数据集，且受限于RGB数据的固有约束。 Method: 通过重建31通道高光谱数据，结合MST++架构和光谱注意力机制，优化EfficientNet分类器进行光谱分析。 Result: 实现了对不同深度伪造风格和数据集的更准确、泛化性更强的检测。 Conclusion: HyperFake首次将高光谱成像重建应用于深度伪造检测，为检测更复杂的伪造技术提供了新思路。 Abstract: Deepfakes pose a significant threat to digital media security, with current detection methods struggling to generalize across different manipulation techniques and datasets. While recent approaches combine CNN-based architectures with Vision Transformers or leverage multi-modal learning, they remain limited by the inherent constraints of RGB data. We introduce HyperFake, a novel deepfake detection pipeline that reconstructs 31-channel hyperspectral data from standard RGB videos, revealing hidden manipulation traces invisible to conventional methods. Using an improved MST++ architecture, HyperFake enhances hyperspectral reconstruction, while a spectral attention mechanism selects the most critical spectral features for deepfake detection. The refined spectral data is then processed by an EfficientNet-based classifier optimized for spectral analysis, enabling more accurate and generalizable detection across different deepfake styles and datasets, all without the need for expensive hyperspectral cameras. To the best of our knowledge, this is the first approach to leverage hyperspectral imaging reconstruction for deepfake detection, opening new possibilities for detecting increasingly sophisticated manipulations.

[32] EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models

GuangHao Meng,Sunan He,Jinpeng Wang,Tao Dai,Letian Zhang,Jieming Zhu,Qing Li,Gang Wang,Rui Zhang,Yong Jiang

Main category: cs.CV

TL;DR: 论文提出EvdCLIP方法，通过实体视觉描述（EVD）增强查询，结合EVD-aware Rewriter（EaRW）优化查询，提升视觉语言检索性能。

Details

Motivation: 现有视觉语言检索方法忽视实体的丰富视觉语义知识，导致检索结果不准确。 Method: 利用大语言模型生成EVD补充文本数据，开发EaRW重写查询以减少噪声。 Result: 在图像-文本检索基准测试中，EvdCLIP表现出优越性能。 Conclusion: EvdCLIP通过EVD和EaRW有效提升视觉语言检索的准确性和质量。 Abstract: Vision-language retrieval (VLR) has attracted significant attention in both academia and industry, which involves using text (or images) as queries to retrieve corresponding images (or text). However, existing methods often neglect the rich visual semantics knowledge of entities, thus leading to incorrect retrieval results. To address this problem, we propose the Entity Visual Description enhanced CLIP (EvdCLIP), designed to leverage the visual knowledge of entities to enrich queries. Specifically, since humans recognize entities through visual cues, we employ a large language model (LLM) to generate Entity Visual Descriptions (EVDs) as alignment cues to complement textual data. These EVDs are then integrated into raw queries to create visually-rich, EVD-enhanced queries. Furthermore, recognizing that EVD-enhanced queries may introduce noise or low-quality expansions, we develop a novel, trainable EVD-aware Rewriter (EaRW) for vision-language retrieval tasks. EaRW utilizes EVD knowledge and the generative capabilities of the language model to effectively rewrite queries. With our specialized training strategy, EaRW can generate high-quality and low-noise EVD-enhanced queries. Extensive quantitative and qualitative experiments on image-text retrieval benchmarks validate the superiority of EvdCLIP on vision-language retrieval tasks.

[33] Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

Bryan Sangwoo Kim,Jeongsol Kim,Jong Chul Ye

Main category: cs.CV

TL;DR: CoZ框架通过多尺度感知提示和自回归链式放大的方式，解决了单图像超分辨率（SISR）模型在超出训练尺度时的性能崩溃问题。

Details

Motivation: 现代SISR模型在训练尺度内表现良好，但在远超出该尺度时性能崩溃，需要一种无需额外训练即可实现极端放大的方法。 Method: 提出Chain-of-Zoom（CoZ）框架，利用多尺度感知提示和自回归链式放大，通过重复使用骨干SR模型分解问题。结合视觉语言模型（VLM）生成提示，并通过GRPO优化提示提取器。 Result: 实验表明，CoZ框架使标准4x扩散SR模型实现了256x以上的放大，同时保持高质量和保真度。 Conclusion: CoZ是一种模型无关的框架，能够有效解决SISR模型的尺度扩展问题，适用于极端放大场景。 Abstract: Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity.

[34] Rethinking Causal Mask Attention for Vision-Language Inference

Xiaohuan Pei,Tao Huang,YanXiang Ma,Chang Xu

Main category: cs.CV

TL;DR: 论文研究了因果注意力机制在视觉语言模型中的应用，发现传统因果掩码策略对视觉查询过于严格，提出了一种未来感知注意力家族以改进模型性能。

Details

Motivation: 现有因果掩码策略源自纯文本模型，对视觉查询的适应性不足，限制了模型利用未来语义线索的能力。 Method: 通过实证分析不同因果掩码策略的影响，提出了一种轻量级注意力家族，通过池化聚合未来视觉上下文。 Result: 实验表明，选择性压缩未来语义上下文到过去表征中能提升推理性能。 Conclusion: 未来感知注意力家族在保持自回归结构的同时，增强了跨令牌依赖，优化了视觉语言推理。 Abstract: Causal attention has become a foundational mechanism in autoregressive vision-language models (VLMs), unifying textual and visual inputs under a single generative framework. However, existing causal mask-based strategies are inherited from large language models (LLMs) where they are tailored for text-only decoding, and their adaptation to vision tokens is insufficiently addressed in the prefill stage. Strictly masking future positions for vision queries introduces overly rigid constraints, which hinder the model's ability to leverage future context that often contains essential semantic cues for accurate inference. In this work, we empirically investigate how different causal masking strategies affect vision-language inference and then propose a family of future-aware attentions tailored for this setting. We first empirically analyze the effect of previewing future tokens for vision queries and demonstrate that rigid masking undermines the model's capacity to capture useful contextual semantic representations. Based on these findings, we propose a lightweight attention family that aggregates future visual context into past representations via pooling, effectively preserving the autoregressive structure while enhancing cross-token dependencies. We evaluate a range of causal masks across diverse vision-language inference settings and show that selectively compressing future semantic context into past representations benefits the inference.

[35] Spiking Transformers Need High Frequency Information

Yuetong Fang,Deming Zhou,Ziqing Wang,Hongwei Ren,ZeCui Zeng,Lusong Li,Shibo Zhou,Renjing Xu

Main category: cs.CV

TL;DR: Spiking Transformers通过二进制脉冲传输信息，但性能低于传统神经网络。研究发现脉冲神经元偏好低频信息，高频信息丢失是性能下降主因。通过Max-Pooling等方法增强高频信号，Max-Former在ImageNet上性能提升7.58%。

Details

Motivation: 探索Spiking Transformers性能低于传统神经网络的原因，并提出改进方法。 Method: 通过Max-Pooling和Depth-Wise Convolution增强高频信号，设计Max-Former模型。 Result: Max-Former在ImageNet上达到82.39%的准确率，比Spikformer提升7.58%。 Conclusion: 高频信号恢复是提升Spiking Transformers性能的关键，未来研究应关注其独特性质。 Abstract: Spiking Transformers offer an energy-efficient alternative to conventional deep learning by transmitting information solely through binary (0/1) spikes. However, there remains a substantial performance gap compared to artificial neural networks. A common belief is that their binary and sparse activation transmission leads to information loss, thus degrading feature representation and accuracy. In this work, however, we reveal for the first time that spiking neurons preferentially propagate low-frequency information. We hypothesize that the rapid dissipation of high-frequency components is the primary cause of performance degradation. For example, on Cifar-100, adopting Avg-Pooling (low-pass) for token mixing lowers performance to 76.73%; interestingly, replacing it with Max-Pooling (high-pass) pushes the top-1 accuracy to 79.12%, surpassing the well-tuned Spikformer baseline by 0.97%. Accordingly, we introduce Max-Former that restores high-frequency signals through two frequency-enhancing operators: extra Max-Pooling in patch embedding and Depth-Wise Convolution in place of self-attention. Notably, our Max-Former (63.99 M) hits the top-1 accuracy of 82.39% on ImageNet, showing a +7.58% improvement over Spikformer with comparable model size (74.81%, 66.34 M). We hope this simple yet effective solution inspires future research to explore the distinctive nature of spiking neural networks, beyond the established practice in standard deep learning.

[36] Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter

Weizhi Zhong,Huan Yang,Zheng Liu,Huiguo He,Zijian He,Xuesong Niu,Di Zhang,Guanbin Li

Main category: cs.CV

TL;DR: 提出了一种无需微调的个性化文本到图像生成方法，支持对象和抽象概念，通过调制机制和Mod-Adapter模块实现高效多概念定制。

Details

Motivation: 现有方法多限于对象概念，且对抽象概念（如姿势、光照）支持不足，需测试时微调，耗时且易过拟合。 Method: 基于预训练DiTs的调制机制，提出Mod-Adapter模块，结合视觉语言交叉注意力和MoE层，映射概念特征至调制空间，并引入VLM引导的预训练策略。 Result: 在扩展的基准测试中，方法在多概念个性化任务上达到SOTA性能，定量、定性和人工评估均支持其优越性。 Conclusion: 该方法无需测试时微调，高效支持对象和抽象概念，为个性化图像生成提供了新思路。 Abstract: Personalized text-to-image generation aims to synthesize images of user-provided concepts in diverse contexts. Despite recent progress in multi-concept personalization, most are limited to object concepts and struggle to customize abstract concepts (e.g., pose, lighting). Some methods have begun exploring multi-concept personalization supporting abstract concepts, but they require test-time fine-tuning for each new concept, which is time-consuming and prone to overfitting on limited training images. In this work, we propose a novel tuning-free method for multi-concept personalization that can effectively customize both object and abstract concepts without test-time fine-tuning. Our method builds upon the modulation mechanism in pretrained Diffusion Transformers (DiTs) model, leveraging the localized and semantically meaningful properties of the modulation space. Specifically, we propose a novel module, Mod-Adapter, to predict concept-specific modulation direction for the modulation process of concept-related text tokens. It incorporates vision-language cross-attention for extracting concept visual features, and Mixture-of-Experts (MoE) layers that adaptively map the concept features into the modulation space. Furthermore, to mitigate the training difficulty caused by the large gap between the concept image space and the modulation space, we introduce a VLM-guided pretraining strategy that leverages the strong image understanding capabilities of vision-language models to provide semantic supervision signals. For a comprehensive comparison, we extend a standard benchmark by incorporating abstract concepts. Our method achieves state-of-the-art performance in multi-concept personalization, supported by quantitative, qualitative, and human evaluations.

[37] SerendibCoins: Exploring The Sri Lankan Coins Dataset

NH Wanigasingha,ES Sithpahan,MKA Ariyaratne,PRS De Silva

Main category: cs.CV

TL;DR: 该研究介绍了斯里兰卡硬币图像数据集，并评估其对机器学习模型分类准确性的影响。结果显示，SVM在传统分类方法中表现最佳，而CNN模型实现了近乎完美的分类精度。

Details

Motivation: 硬币识别与分类在金融和自动化系统中至关重要，但缺乏针对斯里兰卡硬币的数据集，因此研究旨在填补这一空白。 Method: 使用KNN、SVM、随机森林等传统分类器及自定义CNN模型，对不同分类级别进行性能评估。 Result: SVM在传统方法中表现最优，而CNN模型分类精度接近完美，误分类极少。 Conclusion: 该数据集为自动化硬币识别系统提供了坚实基础，并有望推动区域货币分类和深度学习应用的未来研究。 Abstract: The recognition and classification of coins are essential in numerous financial and automated systems. This study introduces a comprehensive Sri Lankan coin image dataset and evaluates its impact on machine learning model accuracy for coin classification. We experiment with traditional machine learning classifiers K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Random Forest as well as a custom Convolutional Neural Network (CNN) to benchmark performance at different levels of classification. Our results show that SVM outperforms KNN and Random Forest in traditional classification approaches, while the CNN model achieves near-perfect classification accuracy with minimal misclassifications. The dataset demonstrates significant potential in enhancing automated coin recognition systems, offering a robust foundation for future research in regional currency classification and deep learning applications.

[38] SuperGS: Consistent and Detailed 3D Super-Resolution Scene Reconstruction via Gaussian Splatting

Shiyun Xie,Zhiru Wang,Yinghao Zhu,Xu Wang,Chengwei Pan,Xiwang Dong

Main category: cs.CV

TL;DR: SuperGS扩展了Scaffold-GS，通过两阶段训练框架解决3DGS在高分辨率新视图合成中的问题，引入潜在特征场和多视图一致密度化策略，实验表现优于现有方法。

Details

Motivation: 3DGS在高分辨率新视图合成中因低分辨率输入视图的粗糙性表现不佳，需改进。 Method: 提出两阶段训练框架：低分辨率阶段用潜在特征场初始化场景，高分辨率阶段通过多视图一致密度化策略和变分特征学习优化。 Result: SuperGS在正向和360度数据集上优于现有高分辨率新视图合成方法。 Conclusion: SuperGS通过两阶段框架和不确定性建模，实现了高质量的高分辨率新视图合成。 Abstract: Recently, 3D Gaussian Splatting (3DGS) has excelled in novel view synthesis (NVS) with its real-time rendering capabilities and superior quality. However, it encounters challenges for high-resolution novel view synthesis (HRNVS) due to the coarse nature of primitives derived from low-resolution input views. To address this issue, we propose SuperGS, an expansion of Scaffold-GS designed with a two-stage coarse-to-fine training framework. In the low-resolution stage, we introduce a latent feature field to represent the low-resolution scene, which serves as both the initialization and foundational information for super-resolution optimization. In the high-resolution stage, we propose a multi-view consistent densification strategy that backprojects high-resolution depth maps based on error maps and employs a multi-view voting mechanism, mitigating ambiguities caused by multi-view inconsistencies in the pseudo labels provided by 2D prior models while avoiding Gaussian redundancy. Furthermore, we model uncertainty through variational feature learning and use it to guide further scene representation refinement and adjust the supervisory effect of pseudo-labels, ensuring consistent and detailed scene reconstruction. Extensive experiments demonstrate that SuperGS outperforms state-of-the-art HRNVS methods on both forward-facing and 360-degree datasets.

[39] ProphetDWM: A Driving World Model for Rolling Out Future Actions and Videos

Xiaodong Wang,Peixi Peng

Main category: cs.CV

TL;DR: ProphetDWM是一种新型的端到端驾驶世界模型，联合预测未来视频和动作，解决了现有方法在动作控制和预测方面的局限性。

Details

Motivation: 现实驾驶需要观察环境并预测未来，现有世界模型在动作控制和预测方面存在不足，无法满足动态动作规律的需求。 Method: 提出ProphetDWM，包含动作模块学习潜在动作，扩散模型转换模块学习状态分布，联合训练以连接动作动态和状态。 Result: 在Nuscenes数据集上，ProphetDWM在视频生成和动作预测任务中表现最佳，支持高质量长期预测。 Conclusion: ProphetDWM通过联合学习动作和视频预测，显著提升了驾驶世界模型的性能，适用于长期预测任务。 Abstract: Real-world driving requires people to observe the current environment, anticipate the future, and make appropriate driving decisions. This requirement is aligned well with the capabilities of world models, which understand the environment and predict the future. However, recent world models in autonomous driving are built explicitly, where they could predict the future by controllable driving video generation. We argue that driving world models should have two additional abilities: action control and action prediction. Following this line, previous methods are limited because they predict the video requires given actions of the same length as the video and ignore the dynamical action laws. To address these issues, we propose ProphetDWM, a novel end-to-end driving world model that jointly predicts future videos and actions. Our world model has an action module to learn latent action from the present to the future period by giving the action sequence and observations. And a diffusion-model-based transition module to learn the state distribution. The model is jointly trained by learning latent actions given finite states and predicting action and video. The joint learning connects the action dynamics and states and enables long-term future prediction. We evaluate our method in video generation and action prediction tasks on the Nuscenes dataset. Compared to the state-of-the-art methods, our method achieves the best video consistency and best action prediction accuracy, while also enabling high-quality long-term video and action generation.

[40] Why Not Replace? Sustaining Long-Term Visual Localization via Handcrafted-Learned Feature Collaboration on CPU

Yicheng Lin,Yunlong Jiang,Xujia Jiao,Bin Han

Main category: cs.CV

TL;DR: 提出了一种结合手工特征和学习特征的分层定位框架，用于复杂工业环境中的长期视觉定位，显著降低了误差并提高了定位一致性。

Details

Motivation: 现有方法在复杂工业环境中的长期视觉定位存在局限性：手工特征对光照敏感，学习特征计算量大，语义或标记方法受环境限制。手工特征和学习特征功能互补，需要整合而非替代。 Method: 提出分层定位框架，实时提取手工特征用于相对位姿估计，同时选择性检测学习特征的关键帧用于绝对定位。 Result: 实验表明，该方法在光照变化下平均误差降低47%，定位一致性显著提升。 Conclusion: 通过整合手工和学习特征的分层框架，实现了高效、鲁棒的长期视觉定位。 Abstract: Robust long-term visual localization in complex industrial environments is critical for mobile robotic systems. Existing approaches face limitations: handcrafted features are illumination-sensitive, learned features are computationally intensive, and semantic- or marker-based methods are environmentally constrained. Handcrafted and learned features share similar representations but differ functionally. Handcrafted features are optimized for continuous tracking, while learned features excel in wide-baseline matching. Their complementarity calls for integration rather than replacement. Building on this, we propose a hierarchical localization framework. It leverages real-time handcrafted feature extraction for relative pose estimation. In parallel, it employs selective learned keypoint detection on optimized keyframes for absolute positioning. This design enables CPU-efficient, long-term visual localization. Experiments systematically progress through three validation phases: Initially establishing feature complementarity through comparative analysis, followed by computational latency profiling across algorithm stages on CPU platforms. Final evaluation under photometric variations (including seasonal transitions and diurnal cycles) demonstrates 47% average error reduction with significantly improved localization consistency. The code implementation is publicly available at https://github.com/linyicheng1/ORB_SLAM3_localization.

Zhenglin Huang,Tianxiao Li,Xiangtai Li,Haiquan Wen,Yiwei He,Jiangning Zhang,Hao Fei,Xi Yang,Xiaowei Huang,Bei Peng,Guangliang Cheng

Main category: cs.CV

TL;DR: 论文介绍了So-Fake-Set数据集和So-Fake-R1检测框架，用于提升社交媒体上合成图像的检测能力，并在准确性和定位性能上优于现有方法。

Details

Motivation: AI生成的合成图像对社交媒体信息完整性和公众信任构成威胁，现有数据集和方法在多样性和泛化能力上不足。 Method: 提出So-Fake-Set数据集（200万图像，35种生成模型）和So-Fake-OOD基准（10万图像），开发So-Fake-R1框架（基于强化学习的视觉语言模型）。 Result: So-Fake-R1在检测准确率上提升1.3%，定位IoU提升4.5%。 Conclusion: 通过数据集、基准和检测框架的结合，为社交媒体伪造检测研究奠定了基础。 Abstract: Recent advances in AI-powered generative models have enabled the creation of increasingly realistic synthetic images, posing significant risks to information integrity and public trust on social media platforms. While robust detection frameworks and diverse, large-scale datasets are essential to mitigate these risks, existing academic efforts remain limited in scope: current datasets lack the diversity, scale, and realism required for social media contexts, while detection methods struggle with generalization to unseen generative technologies. To bridge this gap, we introduce So-Fake-Set, a comprehensive social media-oriented dataset with over 2 million high-quality images, diverse generative sources, and photorealistic imagery synthesized using 35 state-of-the-art generative models. To rigorously evaluate cross-domain robustness, we establish a novel and large-scale (100K) out-of-domain benchmark (So-Fake-OOD) featuring synthetic imagery from commercial models explicitly excluded from the training distribution, creating a realistic testbed for evaluating real-world performance. Leveraging these resources, we present So-Fake-R1, an advanced vision-language framework that employs reinforcement learning for highly accurate forgery detection, precise localization, and explainable inference through interpretable visual rationales. Extensive experiments show that So-Fake-R1 outperforms the second-best method, with a 1.3% gain in detection accuracy and a 4.5% increase in localization IoU. By integrating a scalable dataset, a challenging OOD benchmark, and an advanced detection framework, this work establishes a new foundation for social media-centric forgery detection research. The code, models, and datasets will be released publicly.

[42] DVD-Quant: Data-free Video Diffusion Transformers Quantization

Zhiteng Li,Hanxuan Li,Junyi Wu,Kai Liu,Linghe Kong,Guihai Chen,Yulun Zhang,Xiaokang Yang

Main category: cs.CV

TL;DR: DVD-Quant是一种无需校准数据的量化框架，通过PBQ、ARQ和δ-GBS技术，显著提升了Video DiTs的量化效率，同时保持视频质量。

Details

Motivation: 现有Video DiTs量化方法依赖复杂校准且性能下降严重，亟需高效解决方案。 Method: 提出PBQ、ARQ和δ-GBS技术，实现无数据量化与自适应位宽分配。 Result: 在多个基准测试中，DVD-Quant实现2倍加速且保持视觉质量，首次支持W4A4量化。 Conclusion: DVD-Quant为Video DiTs提供了一种高效、高质量的量化方案。 Abstract: Diffusion Transformers (DiTs) have emerged as the state-of-the-art architecture for video generation, yet their computational and memory demands hinder practical deployment. While post-training quantization (PTQ) presents a promising approach to accelerate Video DiT models, existing methods suffer from two critical limitations: (1) dependence on lengthy, computation-heavy calibration procedures, and (2) considerable performance deterioration after quantization. To address these challenges, we propose DVD-Quant, a novel Data-free quantization framework for Video DiTs. Our approach integrates three key innovations: (1) Progressive Bounded Quantization (PBQ) and (2) Auto-scaling Rotated Quantization (ARQ) for calibration data-free quantization error reduction, as well as (3) $\delta$-Guided Bit Switching ($\delta$-GBS) for adaptive bit-width allocation. Extensive experiments across multiple video generation benchmarks demonstrate that DVD-Quant achieves an approximately 2$\times$ speedup over full-precision baselines on HunyuanVideo while maintaining visual fidelity. Notably, DVD-Quant is the first to enable W4A4 PTQ for Video DiTs without compromising video quality. Code and models will be available at https://github.com/lhxcs/DVD-Quant.

[43] ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation

Zhen Li,Yukai Guo,Duan Li,Xinyuan Guo,Bowen Li,Lanxi Xiao,Shenyu Qiao,Jiashu Chen,Zijian Wu,Hui Zhang,Xinhuan Shu,Shixia Liu

Main category: cs.CV

TL;DR: ChartGalaxy是一个百万级数据集，旨在提升大型视觉语言模型对信息图表（infographic charts）的理解和生成能力。

Details

Motivation: 信息图表结合了视觉和文本元素，但其复杂性对传统训练于普通图表的大型视觉语言模型提出了挑战。 Method: 通过归纳过程构建数据集，识别75种图表类型、330种变体和68种布局模板，并程序化生成合成图表。 Result: 数据集在图表理解微调、代码生成基准测试和基于示例的图表生成中展示了实用性。 Conclusion: ChartGalaxy通过捕捉真实设计的复杂性，为增强多模态推理和生成提供了宝贵资源。 Abstract: Infographic charts are a powerful medium for communicating abstract data by combining visual elements (e.g., charts, images) with textual information. However, their visual and structural richness poses challenges for large vision-language models (LVLMs), which are typically trained on plain charts. To bridge this gap, we introduce ChartGalaxy, a million-scale dataset designed to advance the understanding and generation of infographic charts. The dataset is constructed through an inductive process that identifies 75 chart types, 330 chart variations, and 68 layout templates from real infographic charts and uses them to create synthetic ones programmatically. We showcase the utility of this dataset through: 1) improving infographic chart understanding via fine-tuning, 2) benchmarking code generation for infographic charts, and 3) enabling example-based infographic chart generation. By capturing the visual and structural complexity of real design, ChartGalaxy provides a useful resource for enhancing multimodal reasoning and generation in LVLMs.

[44] Restoring Real-World Images with an Internal Detail Enhancement Diffusion Model

Peng Xiao,Hongbo Zhao,Yijun Wang,Jianxin Lin

Main category: cs.CV

TL;DR: 提出了一种基于预训练Stable Diffusion模型的内部细节增强扩散模型，用于高保真修复真实世界退化图像，支持文本引导修复和对象级色彩控制。

Details

Motivation: 解决现有数据驱动方法在修复复杂退化图像时难以保持高保真度和提供对象级色彩控制的挑战。 Method: 利用预训练的Stable Diffusion模型作为生成先验，结合内部图像细节增强（IIDE）技术，在潜在空间中注入退化操作以模拟退化效果。 Result: 在定性和定量评估中显著优于现有先进模型，并支持文本引导修复和对象级色彩控制。 Conclusion: 该方法为真实世界退化图像的高保真修复提供了一种高效且可控的解决方案。 Abstract: Restoring real-world degraded images, such as old photographs or low-resolution images, presents a significant challenge due to the complex, mixed degradations they exhibit, such as scratches, color fading, and noise. Recent data-driven approaches have struggled with two main challenges: achieving high-fidelity restoration and providing object-level control over colorization. While diffusion models have shown promise in generating high-quality images with specific controls, they often fail to fully preserve image details during restoration. In this work, we propose an internal detail-preserving diffusion model for high-fidelity restoration of real-world degraded images. Our method utilizes a pre-trained Stable Diffusion model as a generative prior, eliminating the need to train a model from scratch. Central to our approach is the Internal Image Detail Enhancement (IIDE) technique, which directs the diffusion model to preserve essential structural and textural information while mitigating degradation effects. The process starts by mapping the input image into a latent space, where we inject the diffusion denoising process with degradation operations that simulate the effects of various degradation factors. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art models in both qualitative assessments and perceptual quantitative evaluations. Additionally, our approach supports text-guided restoration, enabling object-level colorization control that mimics the expertise of professional photo editing.

[45] Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps

Sicheng Feng,Song Wang,Shuyi Ouyang,Lingdong Kong,Zikai Song,Jianke Zhu,Huan Wang,Xinchao Wang

Main category: cs.CV

TL;DR: ReasonMap是一个评估多模态大语言模型（MLLMs）细粒度视觉理解和空间推理能力的基准，包含30个城市的交通地图和1008个问题对。评估发现开源基础模型优于推理模型，而闭源模型则相反。

Details

Motivation: 现有MLLMs在细粒度视觉推理任务上的能力未充分评估，需要新的基准填补这一空白。 Method: 引入ReasonMap基准，包含高分辨率交通地图和多样化问题对，设计两级评估流程。 Result: 开源基础模型表现优于推理模型，闭源模型相反；视觉输入被遮挡时性能下降。 Conclusion: ReasonMap为视觉推理研究提供新见解，揭示了开源与闭源模型间的差距。 Abstract: Multimodal large language models (MLLMs) have recently achieved significant progress in visual tasks, including semantic scene understanding and text-image alignment, with reasoning variants enhancing performance on complex tasks involving mathematics and logic. However, their capacity for reasoning tasks involving fine-grained visual understanding remains insufficiently evaluated. To address this gap, we introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs. ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates. Furthermore, we design a two-level evaluation pipeline that properly assesses answer correctness and quality. Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern: among open-source models, base models outperform reasoning ones, while the opposite trend is observed in closed-source models. Additionally, performance generally degrades when visual inputs are masked, indicating that while MLLMs can leverage prior knowledge to answer some questions, fine-grained visual reasoning tasks still require genuine visual perception for strong performance. Our benchmark study offers new insights into visual reasoning and contributes to investigating the gap between open-source and closed-source models.

[46] Manifold-aware Representation Learning for Degradation-agnostic Image Restoration

Bin Ren,Yawei Li,Xu Zheng,Yuqian Fu,Danda Pani Paudel,Ming-Hsuan Yang,Luc Van Gool,Nicu Sebe

Main category: cs.CV

TL;DR: MIRAGE是一个轻量级框架，通过分解输入特征空间为三个并行分支（全局上下文、局部纹理和通道统计），并结合对比学习在SPD流形空间中，显著提升了图像恢复的性能和泛化能力。

Details

Motivation: 现有图像恢复方法通常将问题视为直接映射，忽略了不同退化类型的结构多样性，导致性能受限。 Method: MIRAGE将输入特征空间分解为三个并行分支，分别处理全局、局部和通道信息，并在SPD流形空间中进行跨层对比学习。 Result: 实验表明，MIRAGE在多种退化类型上达到了新的最优性能，且具有高效性和可扩展性。 Conclusion: MIRAGE通过模块化分解和对比学习，为图像恢复提供了一种高效且通用的解决方案。 Abstract: Image Restoration (IR) aims to recover high quality images from degraded inputs affected by various corruptions such as noise, blur, haze, rain, and low light conditions. Despite recent advances, most existing approaches treat IR as a direct mapping problem, relying on shared representations across degradation types without modeling their structural diversity. In this work, we present MIRAGE, a unified and lightweight framework for all in one IR that explicitly decomposes the input feature space into three semantically aligned parallel branches, each processed by a specialized module attention for global context, convolution for local textures, and MLP for channel-wise statistics. This modular decomposition significantly improves generalization and efficiency across diverse degradations. Furthermore, we introduce a cross layer contrastive learning scheme that aligns shallow and latent features to enhance the discriminability of shared representations. To better capture the underlying geometry of feature representations, we perform contrastive learning in a Symmetric Positive Definite (SPD) manifold space rather than the conventional Euclidean space. Extensive experiments show that MIRAGE not only achieves new state of the art performance across a variety of degradation types but also offers a scalable solution for challenging all-in-one IR scenarios. Our code and models will be publicly available at https://amazingren.github.io/MIRAGE/.

[47] WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation

Yang Liu,Silin Cheng,Xinwei He,Sebastien Ourselin,Lei Tan,Gen Luo

Main category: cs.CV

TL;DR: 论文提出WeakMCN，一种多任务协作网络，通过双分支架构联合学习弱监督的指代表达理解(WREC)和分割(WRES)，并引入动态视觉特征增强(DVFE)和协作一致性模块(CCM)以提升性能。

Details

Motivation: 传统上WREC和WRES任务分开建模，但作者认为联合学习能带来性能提升。 Method: 提出WeakMCN，采用双分支架构，WREC分支基于对比学习并监督WRES分支，引入DVFE和CCM模块促进多任务协作。 Result: 在RefCOCO等基准测试中，WeakMCN优于单任务方法，性能提升显著，同时在半监督设置下也表现出色。 Conclusion: WeakMCN通过多任务协作显著提升了WREC和WRES的性能，并展示了强大的泛化能力。 Abstract: Weakly supervised referring expression comprehension(WREC) and segmentation(WRES) aim to learn object grounding based on a given expression using weak supervision signals like image-text pairs. While these tasks have traditionally been modeled separately, we argue that they can benefit from joint learning in a multi-task framework. To this end, we propose WeakMCN, a novel multi-task collaborative network that effectively combines WREC and WRES with a dual-branch architecture. Specifically, the WREC branch is formulated as anchor-based contrastive learning, which also acts as a teacher to supervise the WRES branch. In WeakMCN, we propose two innovative designs to facilitate multi-task collaboration, namely Dynamic Visual Feature Enhancement(DVFE) and Collaborative Consistency Module(CCM). DVFE dynamically combines various pre-trained visual knowledge to meet different task requirements, while CCM promotes cross-task consistency from the perspective of optimization. Extensive experimental results on three popular REC and RES benchmarks, i.e., RefCOCO, RefCOCO+, and RefCOCOg, consistently demonstrate performance gains of WeakMCN over state-of-the-art single-task alternatives, e.g., up to 3.91% and 13.11% on RefCOCO for WREC and WRES tasks, respectively. Furthermore, experiments also validate the strong generalization ability of WeakMCN in both semi-supervised REC and RES settings against existing methods, e.g., +8.94% for semi-REC and +7.71% for semi-RES on 1% RefCOCO. The code is publicly available at https://github.com/MRUIL/WeakMCN.

[48] Affective Image Editing: Shaping Emotional Factors via Text Descriptions

Peixuan Zhang,Shuchen Weng,Chengxuan Zhu,Binghao Tang,Zijian Jia,Si Li,Boxin Shi

Main category: cs.CV

TL;DR: AIEdiT是一个基于文本描述的图像情感编辑系统，通过调整图像中的情感因素来满足用户的情感需求。

Details

Motivation: 当前文本驱动的图像编辑研究较少关注用户的情感需求，AIEdiT旨在填补这一空白。 Method: 构建连续情感谱表示情感先验，设计情感映射器将抽象情感请求转化为具体语义表示，并利用MLLM监督模型训练。 Result: 实验表明，AIEdiT能有效反映用户的情感需求，性能优越。 Conclusion: AIEdiT为情感驱动的图像编辑提供了有效解决方案。 Abstract: In daily life, images as common affective stimuli have widespread applications. Despite significant progress in text-driven image editing, there is limited work focusing on understanding users' emotional requests. In this paper, we introduce AIEdiT for Affective Image Editing using Text descriptions, which evokes specific emotions by adaptively shaping multiple emotional factors across the entire images. To represent universal emotional priors, we build the continuous emotional spectrum and extract nuanced emotional requests. To manipulate emotional factors, we design the emotional mapper to translate visually-abstract emotional requests to visually-concrete semantic representations. To ensure that editing results evoke specific emotions, we introduce an MLLM to supervise the model training. During inference, we strategically distort visual elements and subsequently shape corresponding emotional factors to edit images according to users' instructions. Additionally, we introduce a large-scale dataset that includes the emotion-aligned text and image pair set for training and evaluation. Extensive experiments demonstrate that AIEdiT achieves superior performance, effectively reflecting users' emotional requests.

[49] GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains

Chun Wang,Xiaoran Pan,Zihao Pan,Haofan Wang,Yiren Song

Main category: cs.CV

TL;DR: 论文提出了Geo Reason Enhancement (GRE) Suite，通过结构化推理链增强视觉语言模型（VLMs），以解决地理定位任务中推理机制不足和可解释性差的问题。

Details

Motivation: 地理定位任务需要从图像中提取多粒度视觉线索并与外部世界知识结合，现有方法缺乏鲁棒的推理机制和可解释性。 Method: GRE Suite包含三个关键部分：GRE30K数据集、GRE模型（多阶段推理策略）和GREval-Bench评估框架。 Result: 实验表明，GRE在所有粒度的地理定位任务中均显著优于现有方法。 Conclusion: GRE Suite证明了推理增强的VLMs在复杂地理推断中的有效性。 Abstract: Recent advances in Visual Language Models (VLMs) have demonstrated exceptional performance in visual reasoning tasks. However, geo-localization presents unique challenges, requiring the extraction of multigranular visual cues from images and their integration with external world knowledge for systematic reasoning. Current approaches to geo-localization tasks often lack robust reasoning mechanisms and explainability, limiting their effectiveness. To address these limitations, we propose the Geo Reason Enhancement (GRE) Suite, a novel framework that augments VLMs with structured reasoning chains for accurate and interpretable location inference. The GRE Suite is systematically developed across three key dimensions: dataset, model, and benchmark. First, we introduce GRE30K, a high-quality geo-localization reasoning dataset designed to facilitate fine-grained visual and contextual analysis. Next, we present the GRE model, which employs a multi-stage reasoning strategy to progressively infer scene attributes, local details, and semantic features, thereby narrowing down potential geographic regions with enhanced precision. Finally, we construct the Geo Reason Evaluation Benchmark (GREval-Bench), a comprehensive evaluation framework that assesses VLMs across diverse urban, natural, and landmark scenes to measure both coarse-grained (e.g., country, continent) and fine-grained (e.g., city, street) localization performance. Experimental results demonstrate that GRE significantly outperforms existing methods across all granularities of geo-localization tasks, underscoring the efficacy of reasoning-augmented VLMs in complex geographic inference. Code and data will be released at https://github.com/Thorin215/GRE.

[50] Deep Learning for Breast Cancer Detection: Comparative Analysis of ConvNeXT and EfficientNet

Mahmudul Hasan

Main category: cs.CV

TL;DR: 论文比较了ConvNeXT和EfficientNet两种卷积神经网络在乳腺癌筛查中的性能，结果显示ConvNeXT表现更优。

Details

Motivation: 乳腺癌是全球常见癌症，早期检测和治疗对降低死亡率至关重要。 Method: 使用ConvNeXT和EfficientNet对乳腺X光片进行预处理、分类和性能评估。 Result: ConvNeXT在AUC、准确率和F分数上均优于EfficientNet。 Conclusion: ConvNeXT在乳腺癌筛查中具有更高的预测性能。 Abstract: Breast cancer is the most commonly occurring cancer worldwide. This cancer caused 670,000 deaths globally in 2022, as reported by the WHO. Yet since health officials began routine mammography screening in age groups deemed at risk in the 1980s, breast cancer mortality has decreased by 40% in high-income nations. Every day, a greater and greater number of people are receiving a breast cancer diagnosis. Reducing cancer-related deaths requires early detection and treatment. This paper compares two convolutional neural networks called ConvNeXT and EfficientNet to predict the likelihood of cancer in mammograms from screening exams. Preprocessing of the images, classification, and performance evaluation are main parts of the whole procedure. Several evaluation metrics were used to compare and evaluate the performance of the models. The result shows that ConvNeXT generates better results with a 94.33% AUC score, 93.36% accuracy, and 95.13% F-score compared to EfficientNet with a 92.34% AUC score, 91.47% accuracy, and 93.06% F-score on RSNA screening mammography breast cancer dataset.

[51] FusionTrack: End-to-End Multi-Object Tracking in Arbitrary Multi-View Environment

Xiaohe Li,Pengfei Li,Zide Fan,Ying Geng,Fangli Mou,Haohua Wu,Yunping Ge

Main category: cs.CV

TL;DR: 论文提出了一种名为FusionTrack的端到端框架，用于解决自由视角多目标跟踪问题，并在新构建的MDMOT数据集上验证了其优越性能。

Details

Motivation: 现有研究很少关注真正的自由视角多目标跟踪系统，限制了协作跟踪系统的灵活性和可扩展性。 Method: 构建了MDMOT数据集，并提出FusionTrack框架，整合跟踪与重识别技术以利用多视角信息。 Result: 在MDMOT和其他基准数据集上，FusionTrack在单视角和多视角跟踪中均达到最先进性能。 Conclusion: FusionTrack为自由视角多目标跟踪提供了有效解决方案，并展示了其在实际应用中的潜力。 Abstract: Multi-view multi-object tracking (MVMOT) has found widespread applications in intelligent transportation, surveillance systems, and urban management. However, existing studies rarely address genuinely free-viewpoint MVMOT systems, which could significantly enhance the flexibility and scalability of cooperative tracking systems. To bridge this gap, we first construct the Multi-Drone Multi-Object Tracking (MDMOT) dataset, captured by mobile drone swarms across diverse real-world scenarios, initially establishing the first benchmark for multi-object tracking in arbitrary multi-view environment. Building upon this foundation, we propose \textbf{FusionTrack}, an end-to-end framework that reasonably integrates tracking and re-identification to leverage multi-view information for robust trajectory association. Extensive experiments on our MDMOT and other benchmark datasets demonstrate that FusionTrack achieves state-of-the-art performance in both single-view and multi-view tracking.

[52] Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation

Wenchao Zhang,Jiahe Tian,Runze He,Jizhong Han,Jiao Dai,Miaomiao Feng,Wei Mi,Xiaodan Zhang

Main category: cs.CV

TL;DR: 论文提出了Align Beyond Prompts (ABP)基准，用于评估文本到图像生成模型在生成图像时与超出提示的现实世界知识的对齐程度，并提出了ABPScore和Inference-Time Knowledge Injection (ITKI)策略。

Details

Motivation: 现有评估基准主要关注生成图像与提示的显式对齐，而忽略了与现实世界知识的对齐。 Method: ABP包含2000多个精心设计的提示，覆盖六种场景，并利用MLLMs开发了ABPScore评估指标。此外，提出了ITKI策略以优化模型表现。 Result: 对8种流行T2I模型的评估显示，即使是GPT-4o等先进模型在整合现实世界知识方面仍有局限。ITKI策略使ABPScore提升了约43%。 Conclusion: ABP填补了评估基准的空白，ITKI策略显著提升了模型表现，为未来研究提供了新方向。 Abstract: Recent text-to-image (T2I) generation models have advanced significantly, enabling the creation of high-fidelity images from textual prompts. However, existing evaluation benchmarks primarily focus on the explicit alignment between generated images and prompts, neglecting the alignment with real-world knowledge beyond prompts. To address this gap, we introduce Align Beyond Prompts (ABP), a comprehensive benchmark designed to measure the alignment of generated images with real-world knowledge that extends beyond the explicit user prompts. ABP comprises over 2,000 meticulously crafted prompts, covering real-world knowledge across six distinct scenarios. We further introduce ABPScore, a metric that utilizes existing Multimodal Large Language Models (MLLMs) to assess the alignment between generated images and world knowledge beyond prompts, which demonstrates strong correlations with human judgments. Through a comprehensive evaluation of 8 popular T2I models using ABP, we find that even state-of-the-art models, such as GPT-4o, face limitations in integrating simple real-world knowledge into generated images. To mitigate this issue, we introduce a training-free strategy within ABP, named Inference-Time Knowledge Injection (ITKI). By applying this strategy to optimize 200 challenging samples, we achieved an improvement of approximately 43% in ABPScore. The dataset and code are available in https://github.com/smile365317/ABP.

[53] Rethinking Direct Preference Optimization in Diffusion Models

Junyong Kang,Seohyun Lim,Kyungjune Baek,Hyunjung Shim

Main category: cs.CV

TL;DR: 提出了一种增强扩散模型偏好优化的新方法，包括稳定的参考模型更新策略和时间步感知训练策略，显著提升了性能。

Details

Motivation: 对齐文本到图像扩散模型与人类偏好是重要研究挑战，现有方法探索有限。 Method: 引入稳定参考模型更新策略和时间步感知训练策略，可集成到多种偏好优化算法中。 Result: 实验结果表明，该方法在人类偏好评估基准上提升了现有方法的性能。 Conclusion: 新方法通过稳定参考模型和时间步感知训练，有效提升了扩散模型的偏好优化性能。 Abstract: Aligning text-to-image (T2I) diffusion models with human preferences has emerged as a critical research challenge. While recent advances in this area have extended preference optimization techniques from large language models (LLMs) to the diffusion setting, they often struggle with limited exploration. In this work, we propose a novel and orthogonal approach to enhancing diffusion-based preference optimization. First, we introduce a stable reference model update strategy that relaxes the frozen reference model, encouraging exploration while maintaining a stable optimization anchor through reference model regularization. Second, we present a timestep-aware training strategy that mitigates the reward scale imbalance problem across timesteps. Our method can be integrated into various preference optimization algorithms. Experimental results show that our approach improves the performance of state-of-the-art methods on human preference evaluation benchmarks.

[54] MoMBS: Mixed-order minibatch sampling enhances model training from diverse-quality images

Han Li,Hu Han,S. Kevin Zhou

Main category: cs.CV

TL;DR: 论文提出了一种新的混合顺序小批量采样方法（MoMBS），用于优化处理具有多样质量的训练样本，解决了传统方法中样本硬度衡量不精确和样本利用不充分的问题。

Details

Motivation: 医学图像在通用病变检测（ULD）中存在图像质量多样性（如清晰度和标签正确性），传统训练方法（如SCL和OHEM）在处理多样质量样本时存在样本硬度衡量不精确和样本利用不充分的问题。 Method: 提出混合顺序小批量采样（MoMBS）方法，综合考虑损失和不确定性来衡量样本硬度，并通过混合顺序设计优化样本利用。 Result: MoMBS能够更精确地分类高损失样本，并优先利用代表性不足的样本作为梯度贡献者，避免受低质量或过拟合样本的负面影响。 Conclusion: MoMBS方法在处理多样质量训练样本时表现优于传统方法，为深度学习模型训练提供了更有效的解决方案。 Abstract: Natural images exhibit label diversity (clean vs. noisy) in noisy-labeled image classification and prevalence diversity (abundant vs. sparse) in long-tailed image classification. Similarly, medical images in universal lesion detection (ULD) exhibit substantial variations in image quality, encompassing attributes such as clarity and label correctness. How to effectively leverage training images with diverse qualities becomes a problem in learning deep models. Conventional training mechanisms, such as self-paced curriculum learning (SCL) and online hard example mining (OHEM), relieve this problem by reweighting images with high loss values. Despite their success, these methods still confront two challenges: (i) the loss-based measure of sample hardness is imprecise, preventing optimum handling of different cases, and (ii) there exists under-utilization in SCL or over-utilization OHEM with the identified hard samples. To address these issues, this paper revisits the minibatch sampling (MBS), a technique widely used in deep network training but largely unexplored concerning the handling of diverse-quality training samples. We discover that the samples within a minibatch influence each other during training; thus, we propose a novel Mixed-order Minibatch Sampling (MoMBS) method to optimize the use of training samples with diverse qualities. MoMBS introduces a measure that takes both loss and uncertainty into account to surpass a sole reliance on loss and allows for a more refined categorization of high-loss samples by distinguishing them as either poorly labeled and under represented or well represented and overfitted. We prioritize under represented samples as the main gradient contributors in a minibatch and keep them from the negative influences of poorly labeled or overfitted samples with a mixed-order minibatch sampling design.

[55] C3R: Channel Conditioned Cell Representations for unified evaluation in microscopy imaging

Umar Marikkar,Syed Sameed Husain,Muhammad Awais,Sara Atito

Main category: cs.CV

TL;DR: 提出了一种基于上下文-概念分组的通道条件细胞表示（C3R）框架，用于解决IHC图像数据集中通道不一致的问题，并在ID和OOD任务中表现优异。

Details

Motivation: IHC数据集因染色协议不同导致通道不一致，现有方法无法支持跨数据集的零样本评估。 Method: 将通道分为上下文和概念组，开发了C3R框架，包括通道自适应编码器和掩码知识蒸馏训练策略。 Result: C3R在ID和OOD任务中优于现有基准，且在CHAMMI基准上表现更优。 Conclusion: C3R为IHC数据集的跨数据集泛化提供了新途径，无需数据集特定适配或重新训练。 Abstract: Immunohistochemical (IHC) images reveal detailed information about structures and functions at the subcellular level. However, unlike natural images, IHC datasets pose challenges for deep learning models due to their inconsistencies in channel count and configuration, stemming from varying staining protocols across laboratories and studies. Existing approaches build channel-adaptive models, which unfortunately fail to support out-of-distribution (OOD) evaluation across IHC datasets and cannot be applied in a true zero-shot setting with mismatched channel counts. To address this, we introduce a structured view of cellular image channels by grouping them into either context or concept, where we treat the context channels as a reference to the concept channels in the image. We leverage this context-concept principle to develop Channel Conditioned Cell Representations (C3R), a framework designed for unified evaluation on in-distribution (ID) and OOD datasets. C3R is a two-fold framework comprising a channel-adaptive encoder architecture and a masked knowledge distillation training strategy, both built around the context-concept principle. We find that C3R outperforms existing benchmarks on both ID and OOD tasks, while a trivial implementation of our core idea also outperforms the channel-adaptive methods reported on the CHAMMI benchmark. Our method opens a new pathway for cross-dataset generalization between IHC datasets, without requiring dataset-specific adaptation or retraining.

[56] ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models

Duo Li,Zuhao Yang,Shijian Lu

Main category: cs.CV

TL;DR: ToDRE是一种无需训练的两阶段视觉令牌压缩框架，通过令牌多样性和任务相关性优化令牌修剪，显著减少计算开销。

Details

Motivation: 大型视觉语言模型（LVLM）中视觉输入的令牌表示通常比文本输入多得多，导致计算开销巨大。现有方法多依赖令牌重要性作为冗余指标，但忽略了令牌多样性和任务相关性。 Method: ToDRE采用两阶段方法：1）使用贪心k中心算法保留多样化的视觉令牌；2）在大型语言模型（LLM）解码器中进一步去除任务无关的视觉令牌。 Result: 实验表明，ToDRE在视觉编码器后减少90%的视觉令牌，并在某些LLM解码层中自适应修剪所有视觉令牌，总推理时间加速2.6倍，同时保持95.1%的模型性能。 Conclusion: ToDRE通过令牌多样性和任务相关性优化令牌修剪，显著提升计算效率且兼容高效注意力机制。 Abstract: The representation of visual inputs of large vision-language models (LVLMs) usually involves substantially more tokens than that of textual inputs, leading to significant computational overhead. Several recent studies strive to mitigate this issue by either conducting token compression to prune redundant visual tokens or guiding them to bypass certain computational stages. While most existing work exploits token importance as the redundancy indicator, our study reveals that two largely neglected factors, namely, the diversity of retained visual tokens and their task relevance, often offer more robust criteria in token pruning. To this end, we design ToDRE, a two-stage and training-free token compression framework that achieves superior performance by pruning Tokens based on token Diversity and token-task RElevance. Instead of pruning redundant tokens, ToDRE introduces a greedy k-center algorithm to select and retain a small subset of diverse visual tokens after the vision encoder. Additionally, ToDRE addresses the "information migration" by further eliminating task-irrelevant visual tokens within the decoder of large language model (LLM). Extensive experiments show that ToDRE effectively reduces 90% of visual tokens after vision encoder and adaptively prunes all visual tokens within certain LLM's decoder layers, leading to a 2.6x speed-up in total inference time while maintaining 95.1% of model performance and excellent compatibility with efficient attention operators.

[57] StyleGuard: Preventing Text-to-Image-Model-based Style Mimicry Attacks by Style Perturbations

Yanjie Li,Wenxuan Zhang,Xinqi Lyu,Yihao Liu,Bin Xiao

Main category: cs.CV

TL;DR: StyleGuard是一种新型抗模仿方法，通过优化潜在空间中的风格特征和设计新的上采样损失，提高了对抗扩散净化攻击的鲁棒性和模型无关的迁移性。

Details

Motivation: 当前文本到图像扩散模型（如DreamBooth和Textual Inversion）被广泛用于风格模仿和个性化定制，引发了知识产权保护和虚假内容生成的担忧。现有防御方法（如Glaze和Anti-DreamBooth）易受净化攻击，且迁移性有限。 Method: 提出StyleGuard方法，包括新颖的风格损失和上采样损失，通过优化潜在空间中的风格特征和集成净化器与上采样器来增强对抗能力。 Result: 在WikiArt和CelebA数据集上的实验表明，StyleGuard在对抗多种变换和净化攻击时表现优于现有方法，并能有效对抗多种风格模仿技术。 Conclusion: StyleGuard通过改进风格损失和上采样损失，显著提升了对抗风格模仿的鲁棒性和模型无关性，为知识产权保护提供了更有效的解决方案。 Abstract: Recently, text-to-image diffusion models have been widely used for style mimicry and personalized customization through methods such as DreamBooth and Textual Inversion. This has raised concerns about intellectual property protection and the generation of deceptive content. Recent studies, such as Glaze and Anti-DreamBooth, have proposed using adversarial noise to protect images from these attacks. However, recent purification-based methods, such as DiffPure and Noise Upscaling, have successfully attacked these latest defenses, showing the vulnerabilities of these methods. Moreover, present methods show limited transferability across models, making them less effective against unknown text-to-image models. To address these issues, we propose a novel anti-mimicry method, StyleGuard. We propose a novel style loss that optimizes the style-related features in the latent space to make it deviate from the original image, which improves model-agnostic transferability. Additionally, to enhance the perturbation's ability to bypass diffusion-based purification, we designed a novel upscale loss that involves ensemble purifiers and upscalers during training. Extensive experiments on the WikiArt and CelebA datasets demonstrate that StyleGuard outperforms existing methods in robustness against various transformations and purifications, effectively countering style mimicry in various models. Moreover, StyleGuard is effective on different style mimicry methods, including DreamBooth and Textual Inversion.

[58] Dual-Path Stable Soft Prompt Generation for Domain Generalization

Yuedi Zhang,Shuanghao Bai,Wanqi Zhou,Zhirong Luan,Badong Chen

Main category: cs.CV

TL;DR: 论文提出了一种名为DPSPG的双路径稳定软提示生成方法，通过负学习提升提示的稳定性和泛化能力，在多个领域泛化基准数据集上表现优异。

Details

Motivation: 现有提示生成方法存在提示可变性问题，即相同输入在不同随机种子下生成不一致且次优的提示，影响了模型的泛化能力。 Method: 提出DPSPG框架，结合负学习机制，通过互补提示生成器减少误导信息，提升提示的稳定性和泛化性。 Result: 在五个领域泛化基准数据集上的实验表明，DPSPG在性能和提示稳定性上均优于现有方法。 Conclusion: DPSPG通过负学习有效解决了提示可变性问题，提升了模型的泛化能力和稳定性。 Abstract: Domain generalization (DG) aims to learn a model using data from one or multiple related but distinct source domains that can generalize well to unseen out-of-distribution target domains. Inspired by the success of large pre-trained vision-language models (VLMs), prompt tuning has emerged as an effective generalization strategy. However, it often struggles to capture domain-specific features due to its reliance on manually or fixed prompt inputs. Recently, some prompt generation methods have addressed this limitation by dynamically generating instance-specific and domain-specific prompts for each input, enriching domain information and demonstrating potential for enhanced generalization. Through further investigation, we identify a notable issue in existing prompt generation methods: the same input often yields significantly different and suboptimal prompts across different random seeds, a phenomenon we term Prompt Variability. To address this, we introduce negative learning into the prompt generation process and propose Dual-Path Stable Soft Prompt Generation (DPSPG), a transformer-based framework designed to improve both the stability and generalization of prompts. Specifically, DPSPG incorporates a complementary prompt generator to produce negative prompts, thereby reducing the risk of introducing misleading information. Both theoretical and empirical analyses demonstrate that negative learning leads to more robust and effective prompts by increasing the effective margin and reducing the upper bound of the gradient norm. Extensive experiments on five DG benchmark datasets show that DPSPG consistently outperforms state-of-the-art methods while maintaining prompt stability.

[59] OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks

Jiayu Wang,Yang Jiao,Yue Yu,Tianwen Qian,Shaoxiang Chen,Jingjing Chen,Yu-Gang Jiang

Main category: cs.CV

TL;DR: OmniGenBench是一个新基准，用于全面评估大型多模态模型（LMMs）在感知和认知维度的指令遵循能力。

Details

Motivation: 当前基准不足以全面评估LMMs的多样化能力，因此需要一个更全面的评估工具。 Method: 设计了包含57个子任务的OmniGenBench，采用双模式评估协议（视觉解析工具和LLM评判器）。 Result: 评估了主流生成模型（如GPT-4o、Gemini-2.0-Flash等），并提供了性能的深入比较。 Conclusion: OmniGenBench为LMMs的评估提供了更全面的基准，有助于推动模型能力的提升。 Abstract: Recent breakthroughs in large multimodal models (LMMs), such as the impressive GPT-4o-Native, have demonstrated remarkable proficiency in following general-purpose instructions for image generation. However, current benchmarks often lack the necessary breadth and depth to fully evaluate the diverse capabilities of these models. To overcome this limitation, we introduce OmniGenBench, a novel and comprehensive benchmark meticulously designed to assess the instruction-following abilities of state-of-the-art LMMs across both perception-centric and cognition-centric dimensions. Our OmniGenBench includes 57 diverse sub-tasks grounded in real-world scenarios, systematically categorized according to the specific model capabilities they demand. For rigorous evaluation, we further employ a dual-mode protocol. This protocol utilizes off-the-shelf visual parsing tools for perception-centric tasks and a powerful LLM-based judger for cognition-centric tasks to assess the alignment between generated images and user instructions. Using OmniGenBench, we evaluate mainstream generative models, including prevalent models like GPT-4o, Gemini-2.0-Flash, and Seedream, and provide in-depth comparisons and analyses of their performance.Code and data are available at https://github.com/emilia113/OmniGenBench.

[60] Think Twice before Adaptation: Improving Adaptability of DeepFake Detection via Online Test-Time Adaptation

Hong-Hanh Nguyen-Le,Van-Tuan Tran,Dinh-Thuc Nguyen,Nhien-An Le-Khac

Main category: cs.CV

TL;DR: 论文提出了一种名为T²A的在线测试时适应方法，通过不确定性感知的负学习目标提升Deepfake检测器的适应性，无需源训练数据或标签。

Details

Motivation: 解决Deepfake检测器在现实环境中因后处理操作或分布偏移导致的性能下降问题。 Method: 提出T²A方法，结合不确定性负学习目标、不确定样本优先级策略和梯度掩码技术。 Result: 理论分析显示负学习目标与熵最小化互补，实验表明T²A优于现有TTA方法。 Conclusion: T²A显著提升了检测器的适应性和泛化能力，代码已开源。 Abstract: Deepfake (DF) detectors face significant challenges when deployed in real-world environments, particularly when encountering test samples deviated from training data through either postprocessing manipulations or distribution shifts. We demonstrate postprocessing techniques can completely obscure generation artifacts presented in DF samples, leading to performance degradation of DF detectors. To address these challenges, we propose Think Twice before Adaptation (\texttt{T$^2$A}), a novel online test-time adaptation method that enhances the adaptability of detectors during inference without requiring access to source training data or labels. Our key idea is to enable the model to explore alternative options through an Uncertainty-aware Negative Learning objective rather than solely relying on its initial predictions as commonly seen in entropy minimization (EM)-based approaches. We also introduce an Uncertain Sample Prioritization strategy and Gradients Masking technique to improve the adaptation by focusing on important samples and model parameters. Our theoretical analysis demonstrates that the proposed negative learning objective exhibits complementary behavior to EM, facilitating better adaptation capability. Empirically, our method achieves state-of-the-art results compared to existing test-time adaptation (TTA) approaches and significantly enhances the resilience and generalization of DF detectors during inference. Code is available \href{https://github.com/HongHanh2104/T2A-Think-Twice-Before-Adaptation}{here}.

[61] VORTA: Efficient Video Diffusion via Routing Sparse Attention

Wenhao Sun,Rong-Cheng Tu,Yifu Ding,Zhao Jin,Jingyi Liao,Shunyu Liu,Dacheng Tao

Main category: cs.CV

TL;DR: VDiTs的视频生成质量高但计算成本高，VORTA通过稀疏注意力机制和自适应路由策略实现了显著加速，且兼容其他加速方法。

Details

Motivation: 解决VDiTs因高维视频序列注意力计算复杂导致的效率问题，特别是冗余长程交互的低效性。 Method: 提出VORTA框架，包含稀疏注意力机制和自适应路由策略，替代传统3D注意力。 Result: 实现1.76倍端到端加速且无质量损失，结合其他方法可达14.41倍加速。 Conclusion: VORTA显著提升VDiTs效率，增强其实际应用价值。 Abstract: Video Diffusion Transformers (VDiTs) have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent attention acceleration methods leverage the sparsity of attention patterns to improve efficiency; however, they often overlook inefficiencies of redundant long-range interactions. To address this problem, we propose \textbf{VORTA}, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants throughout the sampling process. It achieves a $1.76\times$ end-to-end speedup without quality loss on VBench. Furthermore, VORTA can seamlessly integrate with various other acceleration methods, such as caching and step distillation, reaching up to $14.41\times$ speedup with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of VDiTs in real-world settings.

[62] SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models

Ye Sun,Hao Zhang,Henghui Ding,Tiehua Zhang,Xingjun Ma,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 论文提出SAMA模型和SAMA-239K数据集，解决视频多模态模型中细粒度时空理解的挑战，并通过SAMA-Bench评估性能。

Details

Motivation: 当前视频多模态模型在细粒度时空理解上存在挑战，缺乏高质量的统一视频指令数据和评估基准。 Method: 提出SAMA模型，结合时空上下文聚合器和Segment Anything Model，并构建SAMA-239K数据集和SAMA-Bench基准。 Result: SAMA在SAMA-Bench上表现优异，同时在通用基准上达到新最优性能。 Conclusion: SAMA模型和数据集为视频多模态任务提供了统一的解决方案，显著提升了性能。 Abstract: Achieving fine-grained spatio-temporal understanding in videos remains a major challenge for current Video Large Multimodal Models (Video LMMs). Addressing this challenge requires mastering two core capabilities: video referring understanding, which captures the semantics of video regions, and video grounding, which segments object regions based on natural language descriptions. However, most existing approaches tackle these tasks in isolation, limiting progress toward unified, referentially grounded video interaction. We identify a key bottleneck in the lack of high-quality, unified video instruction data and a comprehensive benchmark for evaluating referentially grounded video chat. To address these challenges, we contribute in three core aspects: dataset, model, and benchmark. First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically curated to enable joint learning of video referring understanding, grounding, and multi-turn video chat. Second, we propose the SAMA model, which incorporates a versatile spatio-temporal context aggregator and a Segment Anything Model to jointly enhance fine-grained video comprehension and precise grounding capabilities. Finally, we establish SAMA-Bench, a meticulously designed benchmark consisting of 5,067 questions from 522 videos, to comprehensively evaluate the integrated capabilities of Video LMMs in multi-turn, spatio-temporal referring understanding and grounded dialogue. Extensive experiments and benchmarking results show that SAMA not only achieves strong performance on SAMA-Bench but also sets a new state-of-the-art on general grounding benchmarks, while maintaining highly competitive performance on standard visual understanding benchmarks.

[63] Reasoning Segmentation for Images and Videos: A Survey

Yiqing Shen,Chenjia Li,Fei Xiong,Jeong-O Jeong,Tianpeng Wang,Michael Latman,Mathias Unberath

Main category: cs.CV

TL;DR: 本文综述了推理分割（RS）领域，探讨了其方法、评估指标、数据集和应用，并指出了未来研究方向。

Details

Motivation: RS旨在通过自然语言查询实现更直观的人机交互，弥补视觉感知与人类推理能力之间的差距。 Method: 分析了26种最先进的RS方法，并回顾了相关评估指标、29个数据集和基准。 Result: 总结了RS在多个领域的应用及其潜在扩展。 Conclusion: 指出了当前研究的不足，并提出了未来可能的发展方向。 Abstract: Reasoning Segmentation (RS) aims to delineate objects based on implicit text queries, the interpretation of which requires reasoning and knowledge integration. Unlike the traditional formulation of segmentation problems that relies on fixed semantic categories or explicit prompting, RS bridges the gap between visual perception and human-like reasoning capabilities, facilitating more intuitive human-AI interaction through natural language. Our work presents the first comprehensive survey of RS for image and video processing, examining 26 state-of-the-art methods together with a review of the corresponding evaluation metrics, as well as 29 datasets and benchmarks. We also explore existing applications of RS across diverse domains and identify their potential extensions. Finally, we identify current research gaps and highlight promising future directions.

[64] Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding

Guofeng Mei,Bin Ren,Juan Liu,Luigi Riz,Xiaoshui Huang,Xu Zheng,Yongshun Gong,Ming-Hsuan Yang,Nicu Sebe,Fabio Poiesi

Main category: cs.CV

TL;DR: S4Token是一种通用的3D标记器，通过结合超点分组和坐标尺度归一化，解决了传统方法在跨域泛化中的尺度敏感性问题，显著提升了性能。

Details

Motivation: 传统3D标记方法（如k近邻或基于半径的标记）对数据集特定的空间尺度敏感，导致跨域泛化能力差。本文旨在设计一种尺度不变的3D标记器，以提升与冻结CLIP骨干结合的3D场景理解能力。 Method: 提出S4Token标记管道，结合超点分组和坐标尺度归一化，生成语义信息丰富的标记。通过无监督的掩码点建模和聚类目标训练，并利用跨模态蒸馏对齐3D标记与2D多视图图像特征。对于密集预测任务，提出超点级特征传播模块恢复点级细节。 Result: 实验表明，S4Token在跨域泛化中显著优于传统方法，能够生成尺度不变的语义标记，并在密集预测任务中恢复细节。 Conclusion: S4Token是一种高效的通用3D标记器，解决了传统方法的尺度敏感性问题，为3D场景理解提供了更鲁棒的基础。 Abstract: Vision-language models like CLIP can offer a promising foundation for 3D scene understanding when extended with 3D tokenizers. However, standard approaches, such as k-nearest neighbor or radius-based tokenization, struggle with cross-domain generalization due to sensitivity to dataset-specific spatial scales. We present a universal 3D tokenizer designed for scale-invariant representation learning with a frozen CLIP backbone. We show that combining superpoint-based grouping with coordinate scale normalization consistently outperforms conventional methods through extensive experimental analysis. Specifically, we introduce S4Token, a tokenization pipeline that produces semantically-informed tokens regardless of scene scale. Our tokenizer is trained without annotations using masked point modeling and clustering-based objectives, along with cross-modal distillation to align 3D tokens with 2D multi-view image features. For dense prediction tasks, we propose a superpoint-level feature propagation module to recover point-level detail from sparse tokens.

[65] MSLAU-Net: A Hybird CNN-Transformer Network for Medical Image Segmentation

Libin Lan,Yanxin Li,Xiaojuan Liu,Juan Zhou,Jianxun Zhang,Nannan Huang,Yudong Zhang

Main category: cs.CV

TL;DR: 提出了一种结合CNN和Transformer优点的混合架构MSLAU-Net，用于医学图像分割，解决了CNN全局信息不足和Transformer计算复杂度高的问题。

Details

Motivation: CNN难以捕捉全局信息，Transformer局部特征建模不足且计算复杂，需要一种结合两者优势的方法。 Method: 设计了多尺度线性注意力机制和自上而下的特征聚合机制，高效提取多尺度特征并降低计算复杂度。 Result: 在多个基准数据集上表现优于现有方法，验证了其优越性和鲁棒性。 Conclusion: MSLAU-Net是一种高效、鲁棒的医学图像分割方法，结合了CNN和Transformer的优势。 Abstract: Both CNN-based and Transformer-based methods have achieved remarkable success in medical image segmentation tasks. However, CNN-based methods struggle to effectively capture global contextual information due to the inherent limitations of convolution operations. Meanwhile, Transformer-based methods suffer from insufficient local feature modeling and face challenges related to the high computational complexity caused by the self-attention mechanism. To address these limitations, we propose a novel hybrid CNN-Transformer architecture, named MSLAU-Net, which integrates the strengths of both paradigms. The proposed MSLAU-Net incorporates two key ideas. First, it introduces Multi-Scale Linear Attention, designed to efficiently extract multi-scale features from medical images while modeling long-range dependencies with low computational complexity. Second, it adopts a top-down feature aggregation mechanism, which performs multi-level feature aggregation and restores spatial resolution using a lightweight structure. Extensive experiments conducted on benchmark datasets covering three imaging modalities demonstrate that the proposed MSLAU-Net outperforms other state-of-the-art methods on nearly all evaluation metrics, validating the superiority, effectiveness, and robustness of our approach. Our code is available at https://github.com/Monsoon49/MSLAU-Net.

[66] Localizing Knowledge in Diffusion Transformers

Arman Zarei,Samyadeep Basu,Keivan Rezaei,Zihao Lin,Sayan Nag,Soheil Feizi

Main category: cs.CV

TL;DR: 提出了一种模型和知识无关的方法，用于定位Diffusion Transformer（DiT）块中特定知识的编码位置，并在多个DiT模型中验证其有效性。

Details

Motivation: 研究DiT模型中知识的分布，以提高模型的可解释性、可控性和适应性。 Method: 提出了一种模型和知识无关的定位方法，并在PixArt-alpha、FLUX和SANA等DiT模型中评估了六种知识类别。 Result: 定位的块具有可解释性，且与生成输出中的知识表达有因果关系。应用框架在模型个性化和知识遗忘中表现高效。 Conclusion: 研究揭示了DiT的内部结构，为模型编辑提供了更高效、可控和可解释的途径。 Abstract: Understanding how knowledge is distributed across the layers of generative models is crucial for improving interpretability, controllability, and adaptation. While prior work has explored knowledge localization in UNet-based architectures, Diffusion Transformer (DiT)-based models remain underexplored in this context. In this paper, we propose a model- and knowledge-agnostic method to localize where specific types of knowledge are encoded within the DiT blocks. We evaluate our method on state-of-the-art DiT-based models, including PixArt-alpha, FLUX, and SANA, across six diverse knowledge categories. We show that the identified blocks are both interpretable and causally linked to the expression of knowledge in generated outputs. Building on these insights, we apply our localization framework to two key applications: model personalization and knowledge unlearning. In both settings, our localized fine-tuning approach enables efficient and targeted updates, reducing computational cost, improving task-specific performance, and better preserving general model behavior with minimal interference to unrelated or surrounding content. Overall, our findings offer new insights into the internal structure of DiTs and introduce a practical pathway for more interpretable, efficient, and controllable model editing.

[67] Inference Compute-Optimal Video Vision Language Models

Peiqi Wang,ShengYun Peng,Xuewen Zhang,Hanchao Yu,Yibo Yang,Lifu Huang,Fujun Liu,Qifan Wang

Main category: cs.CV

TL;DR: 本文研究了在固定推理计算预算下，视频视觉语言模型中语言模型大小、帧数和每帧视觉标记数三个关键扩展因素的最优分配。

Details

Motivation: 以往研究通常关注模型效率或性能提升，而忽略了资源限制，本文旨在在固定计算预算下找到最优模型配置。 Method: 通过大规模训练扫描和参数化建模任务性能，识别推理计算最优边界。 Result: 实验揭示了任务性能如何依赖扩展因素和微调数据大小，以及数据大小变化如何影响计算最优边界。 Conclusion: 研究结果为选择扩展因素提供了实用建议。 Abstract: This work investigates the optimal allocation of inference compute across three key scaling factors in video vision language models: language model size, frame count, and the number of visual tokens per frame. While prior works typically focuses on optimizing model efficiency or improving performance without considering resource constraints, we instead identify optimal model configuration under fixed inference compute budgets. We conduct large-scale training sweeps and careful parametric modeling of task performance to identify the inference compute-optimal frontier. Our experiments reveal how task performance depends on scaling factors and finetuning data size, as well as how changes in data size shift the compute-optimal frontier. These findings translate to practical tips for selecting these scaling factors.

[68] Eye-See-You: Reverse Pass-Through VR and Head Avatars

Ankan Dash,Jingyi Gu,Guiling Wang,Chen Chen

Main category: cs.CV

TL;DR: RevAvatar利用AI技术解决VR头显遮挡用户面部的问题，通过生成高保真2D面部图像和3D头像，提升虚拟与物理环境的交互体验。

Details

Motivation: VR头显遮挡用户眼睛和部分面部，阻碍视觉交流并可能导致社交隔离。 Method: 结合生成模型和多模态AI技术，从部分观察区域重建2D面部图像和生成3D头像，并推出VR-Face数据集。 Result: RevAvatar显著提升了VR环境中的交互体验，支持沉浸式社交活动。 Conclusion: RevAvatar展示了AI与新一代技术的协同效应，为虚拟环境中的人际连接提供了强大平台。 Abstract: Virtual Reality (VR) headsets, while integral to the evolving digital ecosystem, present a critical challenge: the occlusion of users' eyes and portions of their faces, which hinders visual communication and may contribute to social isolation. To address this, we introduce RevAvatar, an innovative framework that leverages AI methodologies to enable reverse pass-through technology, fundamentally transforming VR headset design and interaction paradigms. RevAvatar integrates state-of-the-art generative models and multimodal AI techniques to reconstruct high-fidelity 2D facial images and generate accurate 3D head avatars from partially observed eye and lower-face regions. This framework represents a significant advancement in AI4Tech by enabling seamless interaction between virtual and physical environments, fostering immersive experiences such as VR meetings and social engagements. Additionally, we present VR-Face, a novel dataset comprising 200,000 samples designed to emulate diverse VR-specific conditions, including occlusions, lighting variations, and distortions. By addressing fundamental limitations in current VR systems, RevAvatar exemplifies the transformative synergy between AI and next-generation technologies, offering a robust platform for enhancing human connection and interaction in virtual environments.

[69] Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

Shuo Yang,Haocheng Xi,Yilong Zhao,Muyang Li,Jintao Zhang,Han Cai,Yujun Lin,Xiuyu Li,Chenfeng Xu,Kelly Peng,Jianfei Chen,Song Han,Kurt Keutzer,Ion Stoica

Main category: cs.CV

TL;DR: SVG2框架通过语义感知的令牌重排和动态预算控制，显著提升了视频生成效率，同时保持高质量。

Details

Motivation: 现有稀疏注意力方法因令牌识别不准确和计算浪费，无法在相同计算预算下达到最优生成质量。 Method: 提出SVG2框架，采用k-means语义聚类重排令牌，结合动态预算控制和定制内核实现高效计算。 Result: 在HunyuanVideo和Wan 2.1上分别实现2.30x和1.89x加速，PSNR达30和26。 Conclusion: SVG2在生成质量和效率之间达到帕累托最优，为视频生成提供高效解决方案。 Abstract: Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates top-p dynamic budget control and customized kernel implementations, achieving up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively.

[70] REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing

Weihan Xu,Yimeng Ma,Jingyue Huang,Yang Li,Wenye Ma,Taylor Berg-Kirkpatrick,Julian McAuley,Paul Pu Liang,Hao-Wen Dong

Main category: cs.CV

TL;DR: 提出了一种结合检索与生成的方法（REGen），用于生成包含嵌入视频片段的连贯短视频摘要，优于现有方法。

Details

Motivation: 现有提取式视频摘要方法缺乏连贯性，而抽象式方法无法引用输入视频片段，因此需要一种新方法以生成兼具连贯性和引用的短视频。 Method: 采用检索-嵌入生成框架，首先生成带占位符的故事脚本，再通过检索模型选择最佳视频片段填充占位符。 Result: 在纪录片预告生成任务中，该方法能有效插入视频片段并保持连贯性，主观评价显示其在连贯性、对齐性和真实性上优于现有方法。 Conclusion: REGen系统通过结合生成与检索，成功解决了短视频摘要中连贯性与引用性的平衡问题。 Abstract: Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote' from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video. We propose a novel retrieval-embedded generation framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative. Our proposed REGen system first generates the output story script with quote placeholders using a finetuned large language model, and then uses a novel retrieval model to replace the quote placeholders by selecting a video clip that best supports the narrative from a pool of candidate quotable video clips. We examine the proposed method on the task of documentary teaser generation, where short interview insertions are commonly used to support the narrative of a documentary. Our objective evaluations show that the proposed method can effectively insert short video clips while maintaining a coherent narrative. In a subjective survey, we show that our proposed method outperforms existing abstractive and extractive approaches in terms of coherence, alignment, and realism in teaser generation.

Dicong Qiu,Jiadi You,Zeying Gong,Ronghe Qiu,Hui Xiong,Junwei Liang

Main category: cs.CV

TL;DR: SD-OVON是一个基于多模态预训练模型的语义感知数据集和基准生成管道，用于动态场景中的开放词汇对象导航任务，提供真实场景和可操作对象的数据集。

Details

Motivation: 解决现有数据集局限于静态环境的问题，提升导航任务在复杂动态场景中的真实性和实用性。 Method: 利用预训练多模态基础模型生成无限独特的真实场景变体，并提供与Habitat模拟器兼容的任务生成插件。 Result: 生成了两个数据集SD-OVON-3k和SD-OVON-10k，并验证了其有效性。 Conclusion: SD-OVON提升了开放词汇对象导航任务的真实性和训练效果，适用于实际机器人应用。 Abstract: We present the Semantics-aware Dataset and Benchmark Generation Pipeline for Open-vocabulary Object Navigation in Dynamic Scenes (SD-OVON). It utilizes pretraining multimodal foundation models to generate infinite unique photo-realistic scene variants that adhere to real-world semantics and daily commonsense for the training and the evaluation of navigation agents, accompanied with a plugin for generating object navigation task episodes compatible to the Habitat simulator. In addition, we offer two pre-generated object navigation task datasets, SD-OVON-3k and SD-OVON-10k, comprising respectively about 3k and 10k episodes of the open-vocabulary object navigation task, derived from the SD-OVON-Scenes dataset with 2.5k photo-realistic scans of real-world environments and the SD-OVON-Objects dataset with 0.9k manually inspected scanned and artist-created manipulatable object models. Unlike prior datasets limited to static environments, SD-OVON covers dynamic scenes and manipulatable objects, facilitating both real-to-sim and sim-to-real robotic applications. This approach enhances the realism of navigation tasks, the training and the evaluation of open-vocabulary object navigation agents in complex settings. To demonstrate the effectiveness of our pipeline and datasets, we propose two baselines and evaluate them along with state-of-the-art baselines on SD-OVON-3k. The datasets, benchmark and source code are publicly available.

[72] Beyond Domain Randomization: Event-Inspired Perception for Visually Robust Adversarial Imitation from Videos

Andrea Ramazzina,Vittorio Giammarino,Matteo El-Hariry,Mario Bijelic

Main category: cs.CV

TL;DR: 论文提出了一种基于事件感知的视觉模仿方法，通过将RGB视频转换为稀疏的事件表示，消除外观特征的影响，从而在专家和学习者环境存在视觉差异时实现鲁棒的模仿。

Details

Motivation: 传统视觉模仿在专家和学习者环境存在视觉差异（如光照、颜色、纹理）时表现不佳，而视觉随机化方法计算成本高且难以应对未见场景。 Method: 将标准RGB视频转换为稀疏的事件表示，编码时间强度梯度，忽略静态外观特征，从而分离运动动态和视觉风格。 Result: 在DeepMind Control Suite和Adroit平台上验证了方法的有效性，实现了对视觉干扰的不变性，无需昂贵的数据增强。 Conclusion: 事件感知方法为视觉模仿提供了一种鲁棒且高效的解决方案，适用于存在视觉差异的场景。 Abstract: Imitation from videos often fails when expert demonstrations and learner environments exhibit domain shifts, such as discrepancies in lighting, color, or texture. While visual randomization partially addresses this problem by augmenting training data, it remains computationally intensive and inherently reactive, struggling with unseen scenarios. We propose a different approach: instead of randomizing appearances, we eliminate their influence entirely by rethinking the sensory representation itself. Inspired by biological vision systems that prioritize temporal transients (e.g., retinal ganglion cells) and by recent sensor advancements, we introduce event-inspired perception for visually robust imitation. Our method converts standard RGB videos into a sparse, event-based representation that encodes temporal intensity gradients, discarding static appearance features. This biologically grounded approach disentangles motion dynamics from visual style, enabling robust visual imitation from observations even in the presence of visual mismatches between expert and agent environments. By training policies on event streams, we achieve invariance to appearance-based distractors without requiring computationally expensive and environment-specific data augmentation techniques. Experiments across the DeepMind Control Suite and the Adroit platform for dynamic dexterous manipulation show the efficacy of our method. Our code is publicly available at Eb-LAIfO.

[73] Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering

Yixiong Chen,Wenjie Xiao,Pedro R. A. S. Bassi,Xinze Zhou,Sezgin Er,Ibrahim Ethem Hamamci,Zongwei Zhou,Alan Yuille

Main category: cs.CV

TL;DR: DeepTumorVQA是一个针对腹部肿瘤CT扫描的诊断视觉问答（VQA）基准，评估视觉语言模型（VLMs）在3D临床诊断中的表现。

Details

Motivation: 评估VLMs在3D临床诊断中的识别精度、推理能力和领域知识需求。 Method: 构建包含9,262个CT扫描和395K专家级问题的DeepTumorVQA基准，测试四种先进VLMs。 Result: 当前模型在测量任务中表现尚可，但在病灶识别和推理方面仍有不足，未达到临床需求。 Conclusion: 大规模多模态预训练和图像预处理对3D感知至关重要，DeepTumorVQA为医学多模态研究提供了严格基准。 Abstract: Vision-Language Models (VLMs) have shown promise in various 2D visual tasks, yet their readiness for 3D clinical diagnosis remains unclear due to stringent demands for recognition precision, reasoning ability, and domain knowledge. To systematically evaluate these dimensions, we present DeepTumorVQA, a diagnostic visual question answering (VQA) benchmark targeting abdominal tumors in CT scans. It comprises 9,262 CT volumes (3.7M slices) from 17 public datasets, with 395K expert-level questions spanning four categories: Recognition, Measurement, Visual Reasoning, and Medical Reasoning. DeepTumorVQA introduces unique challenges, including small tumor detection and clinical reasoning across 3D anatomy. Benchmarking four advanced VLMs (RadFM, M3D, Merlin, CT-CHAT), we find current models perform adequately on measurement tasks but struggle with lesion recognition and reasoning, and are still not meeting clinical needs. Two key insights emerge: (1) large-scale multimodal pretraining plays a crucial role in DeepTumorVQA testing performance, making RadFM stand out among all VLMs. (2) Our dataset exposes critical differences in VLM components, where proper image preprocessing and design of vision modules significantly affect 3D perception. To facilitate medical multimodal research, we have released DeepTumorVQA as a rigorous benchmark: https://github.com/Schuture/DeepTumorVQA.

[74] LLM-Guided Taxonomy and Hierarchical Uncertainty for 3D Point CLoud Active Learning

Chenxi Li,Nuo Chen,Fengyun Tan,Yantong Chen,Bochun Yuan,Tianrui Li,Chongshou Li

Main category: cs.CV

TL;DR: 提出了一种新颖的主动学习框架，首次将大语言模型（LLMs）融入3D点云语义分割，通过构建层次化标签结构和不确定性样本选择提升性能。

Details

Motivation: 现有方法将标签视为扁平且独立，忽略了语义层次结构。本文旨在利用LLMs的知识先验，构建多级语义分类并优化样本选择。 Method: 利用LLMs自动生成多级语义分类，并引入递归不确定性投影机制，在层次结构中传播不确定性，实现标签感知的点选择。 Result: 在S3DIS和ScanNet v2数据集上，方法在极低标注预算（如0.02%）下实现了高达4%的mIoU提升，显著优于基线。 Conclusion: LLMs在3D视觉中作为知识先验具有潜力，层次化不确定性建模为高效点云标注提供了新范式。 Abstract: We present a novel active learning framework for 3D point cloud semantic segmentation that, for the first time, integrates large language models (LLMs) to construct hierarchical label structures and guide uncertainty-based sample selection. Unlike prior methods that treat labels as flat and independent, our approach leverages LLM prompting to automatically generate multi-level semantic taxonomies and introduces a recursive uncertainty projection mechanism that propagates uncertainty across hierarchy levels. This enables spatially diverse, label-aware point selection that respects the inherent semantic structure of 3D scenes. Experiments on S3DIS and ScanNet v2 show that our method achieves up to 4% mIoU improvement under extremely low annotation budgets (e.g., 0.02%), substantially outperforming existing baselines. Our results highlight the untapped potential of LLMs as knowledge priors in 3D vision and establish hierarchical uncertainty modeling as a powerful paradigm for efficient point cloud annotation.

[75] Words as Geometric Features: Estimating Homography using Optical Character Recognition as Compressed Image Representation

Ross Greer,Alisha Ukani,Katherine Izhikevich,Earlence Fernandes,Stefan Savage,Alex C. Snoeren

Main category: cs.CV

TL;DR: 提出了一种基于OCR输出的文档对齐方法，无需依赖原始图像数据，适用于隐私或存储受限的场景。

Details

Motivation: 传统文档对齐方法需要原始图像数据，但在隐私或存储受限时不可行。本文旨在利用OCR输出实现高效对齐。 Method: 利用OCR检测到的单词空间位置和文本内容，结合RANSAC处理噪声，估计几何变换（如单应性）。 Result: 在测试文档上，该方法比传统图像方法更准确，且对OCR噪声鲁棒。 Conclusion: 该方法为文档处理提供了高效、可扩展的解决方案，减少了对高维图像数据的依赖。 Abstract: Document alignment and registration play a crucial role in numerous real-world applications, such as automated form processing, anomaly detection, and workflow automation. Traditional methods for document alignment rely on image-based features like keypoints, edges, and textures to estimate geometric transformations, such as homographies. However, these approaches often require access to the original document images, which may not always be available due to privacy, storage, or transmission constraints. This paper introduces a novel approach that leverages Optical Character Recognition (OCR) outputs as features for homography estimation. By utilizing the spatial positions and textual content of OCR-detected words, our method enables document alignment without relying on pixel-level image data. This technique is particularly valuable in scenarios where only OCR outputs are accessible. Furthermore, the method is robust to OCR noise, incorporating RANSAC to handle outliers and inaccuracies in the OCR data. On a set of test documents, we demonstrate that our OCR-based approach even performs more accurately than traditional image-based methods, offering a more efficient and scalable solution for document registration tasks. The proposed method facilitates applications in document processing, all while reducing reliance on high-dimensional image data.

[76] WeedNet: A Foundation Model-Based Global-to-Local AI Approach for Real-Time Weed Species Identification and Classification

Yanben Shen,Timilehin T. Ayanlade,Venkata Naresh Boddepalli,Mojdeh Saadati,Ashlyn Rairdin,Zi K. Deng,Muhammad Arbab Arshad,Aditya Balu,Daren Mueller,Asheesh K Singh,Wesley Everman,Nirav Merchant,Baskar Ganapathysubramanian,Meaghan Anderson,Soumik Sarkar,Arti Singh

Main category: cs.CV

TL;DR: WeedNet是全球首个大规模杂草识别模型，通过自监督学习和微调策略，实现了91.02%的准确率，并在局部区域达到97.38%的准确率。

Details

Motivation: 早期杂草识别对有效管理至关重要，但现有AI模型面临数据不足和形态特征复杂性的挑战。 Method: WeedNet采用端到端实时识别流程，结合自监督学习、微调和增强可信度策略。 Result: 模型在1,593种杂草中达到91.02%准确率，局部模型在85种爱荷华杂草中达到97.38%准确率。 Conclusion: WeedNet具有通用性和适应性，可作为基础模型，并有望集成到机器人平台和智能农业工具中。 Abstract: Early identification of weeds is essential for effective management and control, and there is growing interest in automating the process using computer vision techniques coupled with AI methods. However, challenges associated with training AI-based weed identification models, such as limited expert-verified data and complexity and variability in morphological features, have hindered progress. To address these issues, we present WeedNet, the first global-scale weed identification model capable of recognizing an extensive set of weed species, including noxious and invasive plant species. WeedNet is an end-to-end real-time weed identification pipeline and uses self-supervised learning, fine-tuning, and enhanced trustworthiness strategies. WeedNet achieved 91.02% accuracy across 1,593 weed species, with 41% species achieving 100% accuracy. Using a fine-tuning strategy and a Global-to-Local approach, the local Iowa WeedNet model achieved an overall accuracy of 97.38% for 85 Iowa weeds, most classes exceeded a 90% mean accuracy per class. Testing across intra-species dissimilarity (developmental stages) and inter-species similarity (look-alike species) suggests that diversity in the images collected, spanning all the growth stages and distinguishable plant characteristics, is crucial in driving model performance. The generalizability and adaptability of the Global WeedNet model enable it to function as a foundational model, with the Global-to-Local strategy allowing fine-tuning for region-specific weed communities. Additional validation of drone- and ground-rover-based images highlights the potential of WeedNet for integration into robotic platforms. Furthermore, integration with AI for conversational use provides intelligent agricultural and ecological conservation consulting tools for farmers, agronomists, researchers, land managers, and government agencies across diverse landscapes.

[77] Geometry-guided Online 3D Video Synthesis with Multi-View Temporal Consistency

Hyunho Ha,Lei Xiao,Christian Richardt,Thu Nguyen-Phuoc,Changil Kim,Min H. Kim,Douglas Lanman,Numair Khan

Main category: cs.CV

TL;DR: 提出了一种基于几何引导的在线视频视角合成方法，解决了传统方法在计算资源与合成质量之间的权衡问题。

Details

Motivation: 传统方法需要密集多视角相机设置且计算资源消耗大，而选择性输入方法虽降低成本但牺牲了质量和一致性。本文旨在实现高效、高质量且视角和时间一致的合成。 Method: 利用全局几何引导图像渲染流程，通过时间上的颜色差异掩码逐步优化深度图，并使用截断符号距离场累积深度信息，最后通过预训练混合网络融合多视角图像。 Result: 方法实现了视角和时间一致的高质量视频合成，且能高效在线运行。 Conclusion: 该方法在保持高质量和一致性的同时，显著提升了计算效率。 Abstract: We introduce a novel geometry-guided online video view synthesis method with enhanced view and temporal consistency. Traditional approaches achieve high-quality synthesis from dense multi-view camera setups but require significant computational resources. In contrast, selective-input methods reduce this cost but often compromise quality, leading to multi-view and temporal inconsistencies such as flickering artifacts. Our method addresses this challenge to deliver efficient, high-quality novel-view synthesis with view and temporal consistency. The key innovation of our approach lies in using global geometry to guide an image-based rendering pipeline. To accomplish this, we progressively refine depth maps using color difference masks across time. These depth maps are then accumulated through truncated signed distance fields in the synthesized view's image space. This depth representation is view and temporally consistent, and is used to guide a pre-trained blending network that fuses multiple forward-rendered input-view images. Thus, the network is encouraged to output geometrically consistent synthesis results across multiple views and time. Our approach achieves consistent, high-quality video synthesis, while running efficiently in an online manner.

[78] Echo Planning for Autonomous Driving: From Current Observations to Future Trajectories and Back

Jintao Sun,Hu Zhang,Gangyi Ding,Zhedong Zheng

Main category: cs.CV

TL;DR: 论文提出Echo Planning框架，通过闭环CFC循环解决自动驾驶轨迹预测中的时间一致性问题，无需额外监督即可提升性能。

Details

Motivation: 现代端到端自动驾驶系统的规划器缺乏时间一致性机制，导致早期预测错误随时间累积。 Method: 引入CFC循环，通过双向一致性（从当前观测生成未来轨迹并反向重建当前状态）和循环损失惩罚不合理轨迹。 Result: 在nuScenes数据集上表现优异，L2误差降低0.04米，碰撞率减少0.12%。 Conclusion: Echo Planning为安全关键自动驾驶系统提供了可部署的解决方案。 Abstract: Modern end-to-end autonomous driving systems suffer from a critical limitation: their planners lack mechanisms to enforce temporal consistency between predicted trajectories and evolving scene dynamics. This absence of self-supervision allows early prediction errors to compound catastrophically over time. We introduce Echo Planning, a novel self-correcting framework that establishes a closed-loop Current - Future - Current (CFC) cycle to harmonize trajectory prediction with scene coherence. Our key insight is that plausible future trajectories must be bi-directionally consistent, ie, not only generated from current observations but also capable of reconstructing them. The CFC mechanism first predicts future trajectories from the Bird's-Eye-View (BEV) scene representation, then inversely maps these trajectories back to estimate the current BEV state. By enforcing consistency between the original and reconstructed BEV representations through a cycle loss, the framework intrinsically penalizes physically implausible or misaligned trajectories. Experiments on nuScenes demonstrate state-of-the-art performance, reducing L2 error by 0.04 m and collision rate by 0.12% compared to one-shot planners. Crucially, our method requires no additional supervision, leveraging the CFC cycle as an inductive bias for robust planning. This work offers a deployable solution for safety-critical autonomous systems.

[79] OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model

Zhenhao Zhang,Ye Shi,Lingxiao Yang,Suting Ni,Qi Ye,Jingya Wang

Main category: cs.CV

TL;DR: OpenHOI是一个用于开放世界手-物体交互合成的框架，能够根据自由语言指令生成新物体的长时程操作序列，解决了现有方法在未见物体和开放词汇指令上的泛化问题。

Details

Motivation: 现有方法在封闭集物体和预定义任务上表现良好，但难以处理未见物体或开放词汇指令，限制了其在AR/VR和机器人领域的应用。 Method: OpenHOI结合了3D多模态大语言模型（MLLM）进行联合功能定位和语义任务分解，并采用功能驱动的扩散模型与无训练物理优化阶段来合成物理合理的交互。 Result: 实验表明，OpenHOI在未见物体类别、多阶段任务和复杂语言指令的泛化能力上优于现有方法。 Conclusion: OpenHOI为开放世界手-物体交互合成提供了首个解决方案，具有广泛的应用潜力。 Abstract: Understanding and synthesizing realistic 3D hand-object interactions (HOI) is critical for applications ranging from immersive AR/VR to dexterous robotics. Existing methods struggle with generalization, performing well on closed-set objects and predefined tasks but failing to handle unseen objects or open-vocabulary instructions. We introduce OpenHOI, the first framework for open-world HOI synthesis, capable of generating long-horizon manipulation sequences for novel objects guided by free-form language commands. Our approach integrates a 3D Multimodal Large Language Model (MLLM) fine-tuned for joint affordance grounding and semantic task decomposition, enabling precise localization of interaction regions (e.g., handles, buttons) and breakdown of complex instructions (e.g., "Find a water bottle and take a sip") into executable sub-tasks. To synthesize physically plausible interactions, we propose an affordance-driven diffusion model paired with a training-free physics refinement stage that minimizes penetration and optimizes affordance alignment. Evaluations across diverse scenarios demonstrate OpenHOI's superiority over state-of-the-art methods in generalizing to novel object categories, multi-stage tasks, and complex language instructions. Our project page at \href{https://openhoi.github.io}

Yining Pan,Qiongjie Cui,Xulei Yang,Na Zhao

Main category: cs.CV

TL;DR: IAL是一种新型多模态3D全景分割框架，通过同步数据增强和几何引导特征融合，解决了LiDAR和图像数据对齐问题，并在基准测试中表现优异。

Details

Motivation: LiDAR数据的稀疏性导致远距离或小物体识别困难，现有方法依赖后处理且存在数据对齐问题。 Method: 提出IAL框架，包括PieAug数据增强、GTF模块融合特征、PQG模块初始化查询，直接预测分割结果。 Result: IAL在多个基准测试中达到最先进性能。 Conclusion: IAL通过多模态融合和同步增强，显著提升了3D全景分割的准确性。 Abstract: LiDAR-based 3D panoptic segmentation often struggles with the inherent sparsity of data from LiDAR sensors, which makes it challenging to accurately recognize distant or small objects. Recently, a few studies have sought to overcome this challenge by integrating LiDAR inputs with camera images, leveraging the rich and dense texture information provided by the latter. While these approaches have shown promising results, they still face challenges, such as misalignment during data augmentation and the reliance on post-processing steps. To address these issues, we propose Image-Assists-LiDAR (IAL), a novel multi-modal 3D panoptic segmentation framework. In IAL, we first introduce a modality-synchronized data augmentation strategy, PieAug, to ensure alignment between LiDAR and image inputs from the start. Next, we adopt a transformer decoder to directly predict panoptic segmentation results. To effectively fuse LiDAR and image features into tokens for the decoder, we design a Geometric-guided Token Fusion (GTF) module. Additionally, we leverage the complementary strengths of each modality as priors for query initialization through a Prior-based Query Generation (PQG) module, enhancing the decoder's ability to generate accurate instance masks. Our IAL framework achieves state-of-the-art performance compared to previous multi-modal 3D panoptic segmentation methods on two widely used benchmarks. Code and models are publicly available at .

[81] CDPDNet: Integrating Text Guidance with Hybrid Vision Encoders for Medical Image Segmentation

Jiong Wu,Yang Xing,Boxiao Yu,Wei Shao,Kuang Gong

Main category: cs.CV

TL;DR: 提出了一种结合CLIP文本嵌入和DINOv2视觉特征的医学图像分割网络CDPDNet，解决了部分标注数据和复杂解剖关系建模的挑战。

Details

Motivation: 医学图像数据集通常仅部分标注，限制了模型学习共享解剖特征的能力；现有视觉框架难以捕捉复杂解剖关系。 Method: 结合CNN、DINOv2和CLIP文本嵌入，设计多头部交叉注意力模块和任务提示生成模块（TTPG）。 Result: 在多个医学图像数据集上表现优于现有方法。 Conclusion: CDPDNet通过融合视觉和文本特征，有效提升了分割精度和泛化能力。 Abstract: Most publicly available medical segmentation datasets are only partially labeled, with annotations provided for a subset of anatomical structures. When multiple datasets are combined for training, this incomplete annotation poses challenges, as it limits the model's ability to learn shared anatomical representations among datasets. Furthermore, vision-only frameworks often fail to capture complex anatomical relationships and task-specific distinctions, leading to reduced segmentation accuracy and poor generalizability to unseen datasets. In this study, we proposed a novel CLIP-DINO Prompt-Driven Segmentation Network (CDPDNet), which combined a self-supervised vision transformer with CLIP-based text embedding and introduced task-specific text prompts to tackle these challenges. Specifically, the framework was constructed upon a convolutional neural network (CNN) and incorporated DINOv2 to extract both fine-grained and global visual features, which were then fused using a multi-head cross-attention module to overcome the limited long-range modeling capability of CNNs. In addition, CLIP-derived text embeddings were projected into the visual space to help model complex relationships among organs and tumors. To further address the partial label challenge and enhance inter-task discriminative capability, a Text-based Task Prompt Generation (TTPG) module that generated task-specific prompts was designed to guide the segmentation. Extensive experiments on multiple medical imaging datasets demonstrated that CDPDNet consistently outperformed existing state-of-the-art segmentation methods. Code and pretrained model are available at: https://github.com/wujiong-hub/CDPDNet.git.

[82] MGD$^3$: Mode-Guided Dataset Distillation using Diffusion Models

Jeffrey A. Chan-Santiago,Praveen Tirupattur,Gaurav Kumar Nayak,Gaowen Liu,Mubarak Shah

Main category: cs.CV

TL;DR: 提出一种基于预训练扩散模型的数据集蒸馏方法，通过模式发现、模式引导和停止引导三阶段提升样本多样性，无需微调，显著降低计算成本。

Details

Motivation: 现有数据集蒸馏方法需微调模型且无法保证样本多样性，限制了性能。 Method: 利用预训练扩散模型，分三阶段：模式发现、模式引导和停止引导，提升多样性和减少人工痕迹。 Result: 在多个数据集上性能优于现有方法，准确率提升最高达4.4%。 Conclusion: 该方法无需微调扩散模型，显著降低计算成本，同时提升多样性和性能。 Abstract: Dataset distillation has emerged as an effective strategy, significantly reducing training costs and facilitating more efficient model deployment. Recent advances have leveraged generative models to distill datasets by capturing the underlying data distribution. Unfortunately, existing methods require model fine-tuning with distillation losses to encourage diversity and representativeness. However, these methods do not guarantee sample diversity, limiting their performance. We propose a mode-guided diffusion model leveraging a pre-trained diffusion model without the need to fine-tune with distillation losses. Our approach addresses dataset diversity in three stages: Mode Discovery to identify distinct data modes, Mode Guidance to enhance intra-class diversity, and Stop Guidance to mitigate artifacts in synthetic samples that affect performance. Our approach outperforms state-of-the-art methods, achieving accuracy gains of 4.4%, 2.9%, 1.6%, and 1.6% on ImageNette, ImageIDC, ImageNet-100, and ImageNet-1K, respectively. Our method eliminates the need for fine-tuning diffusion models with distillation losses, significantly reducing computational costs. Our code is available on the project webpage: https://jachansantiago.github.io/mode-guided-distillation/

[83] VL-SAM-V2: Open-World Object Detection with General and Specific Query Fusion

Zhiwei Lin,Yongtao Wang

Main category: cs.CV

TL;DR: VL-SAM-V2是一个开放世界目标检测框架，结合开放集和开放端模型的查询，通过通用和特定查询融合模块提升性能，并在LVIS数据集上表现优异。

Details

Motivation: 当前感知模型在开放世界环境中对新物体的检测能力有限，开放端模型性能较低，需要提升。 Method: 结合开放集和开放端模型的查询，设计通用和特定查询融合模块，引入排名可学习查询和去噪点训练策略。 Result: 在LVIS数据集上超越现有开放集和开放端方法，尤其在稀有物体上表现突出。 Conclusion: VL-SAM-V2通过融合查询和优化训练策略，显著提升了开放世界目标检测的性能。 Abstract: Current perception models have achieved remarkable success by leveraging large-scale labeled datasets, but still face challenges in open-world environments with novel objects. To address this limitation, researchers introduce open-set perception models to detect or segment arbitrary test-time user-input categories. However, open-set models rely on human involvement to provide predefined object categories as input during inference. More recently, researchers have framed a more realistic and challenging task known as open-ended perception that aims to discover unseen objects without requiring any category-level input from humans at inference time. Nevertheless, open-ended models suffer from low performance compared to open-set models. In this paper, we present VL-SAM-V2, an open-world object detection framework that is capable of discovering unseen objects while achieving favorable performance. To achieve this, we combine queries from open-set and open-ended models and propose a general and specific query fusion module to allow different queries to interact. By adjusting queries from open-set models, we enable VL-SAM-V2 to be evaluated in the open-set or open-ended mode. In addition, to learn more diverse queries, we introduce ranked learnable queries to match queries with proposals from open-ended models by sorting. Moreover, we design a denoising point training strategy to facilitate the training process. Experimental results on LVIS show that our method surpasses the previous open-set and open-ended methods, especially on rare objects.

[84] NTIRE 2025 Challenge on Video Quality Enhancement for Video Conferencing: Datasets, Methods and Results

Varun Jain,Zongwei Wu,Quan Zou,Louis Florentin,Henrik Turbell,Sandeep Siddhartha,Radu Timofte,others

Main category: cs.CV

TL;DR: 本文综述了CVPR 2025 NTIRE研讨会上的视频质量增强挑战赛，重点介绍了问题、数据集、解决方案和结果。

Details

Motivation: 提升视频会议中的视频质量，包括光照、色彩、降噪和清晰度。 Method: 使用可微分的视频质量评估模型，参与者设计VQE模型并进行评估。 Result: 91人注册，10份有效提交，通过众包框架评估。 Conclusion: 挑战赛成功推动了视频质量增强技术的发展。 Abstract: This paper presents a comprehensive review of the 1st Challenge on Video Quality Enhancement for Video Conferencing held at the NTIRE workshop at CVPR 2025, and highlights the problem statement, datasets, proposed solutions, and results. The aim of this challenge was to design a Video Quality Enhancement (VQE) model to enhance video quality in video conferencing scenarios by (a) improving lighting, (b) enhancing colors, (c) reducing noise, and (d) enhancing sharpness - giving a professional studio-like effect. Participants were given a differentiable Video Quality Assessment (VQA) model, training, and test videos. A total of 91 participants registered for the challenge. We received 10 valid submissions that were evaluated in a crowdsourced framework.

[85] SPARS: Self-Play Adversarial Reinforcement Learning for Segmentation of Liver Tumours

Catalina Tan,Yipeng Hu,Shaheer U. Saeed

Main category: cs.CV

TL;DR: 提出了一种弱监督语义分割框架SPARS，利用少量图像级二元标签实现肿瘤定位，性能接近全监督方法。

Details

Motivation: 肿瘤分割对癌症诊疗至关重要，但全监督模型需要大量标注且标签主观性强，而病理标签获取困难。 Method: SPARS框架通过图像级二元标签训练分类器，结合对抗强化学习定位肿瘤区域。 Result: 在真实患者数据上，SPARS的Dice分数为77.3±9.4，优于其他弱监督方法，接近全监督方法。 Conclusion: SPARS可减少对人工标注的依赖，为实际医疗场景提供高效肿瘤检测方案。 Abstract: Accurate tumour segmentation is vital for various targeted diagnostic and therapeutic procedures for cancer, e.g., planning biopsies or tumour ablations. Manual delineation is extremely labour-intensive, requiring substantial expert time. Fully-supervised machine learning models aim to automate such localisation tasks, but require a large number of costly and often subjective 3D voxel-level labels for training. The high-variance and subjectivity in such labels impacts model generalisability, even when large datasets are available. Histopathology labels may offer more objective labels but the infeasibility of acquiring pixel-level annotations to develop tumour localisation methods based on histology remains challenging in-vivo. In this work, we propose a novel weakly-supervised semantic segmentation framework called SPARS (Self-Play Adversarial Reinforcement Learning for Segmentation), which utilises an object presence classifier, trained on a small number of image-level binary cancer presence labels, to localise cancerous regions on CT scans. Such binary labels of patient-level cancer presence can be sourced more feasibly from biopsies and histopathology reports, enabling a more objective cancer localisation on medical images. Evaluating with real patient data, we observed that SPARS yielded a mean dice score of $77.3 \pm 9.4$, which outperformed other weakly-supervised methods by large margins. This performance was comparable with recent fully-supervised methods that require voxel-level annotations. Our results demonstrate the potential of using SPARS to reduce the need for extensive human-annotated labels to detect cancer in real-world healthcare settings.

[86] Kernel Space Diffusion Model for Efficient Remote Sensing Pansharpening

Hancong Jin,Zihan Cao,Liangjian Deng

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型的KSDiff方法，通过在潜在空间中生成卷积核来提升全色锐化质量，同时加快推理速度。

Details

Motivation: 现有深度学习方法难以捕捉遥感数据的全局先验，而扩散模型虽有效但推理延迟高。 Method: KSDiff利用低秩核心张量生成器和统一因子生成器，结合结构感知多头注意力机制生成卷积核。 Result: 在WorldView-3、GaoFen-2和QuickBird数据集上表现优异。 Conclusion: KSDiff是一种高效的全色锐化框架，可提升现有方法性能。 Abstract: Pansharpening is a fundamental task in remote sensing that integrates high-resolution panchromatic imagery (PAN) with low-resolution multispectral imagery (LRMS) to produce an enhanced image with both high spatial and spectral resolution. Despite significant progress in deep learning-based approaches, existing methods often fail to capture the global priors inherent in remote sensing data distributions. Diffusion-based models have recently emerged as promising solutions due to their powerful distribution mapping capabilities; however, they suffer from significant inference latency, which limits their practical applicability. In this work, we propose the Kernel Space Diffusion Model (KSDiff), a novel approach that leverages diffusion processes in a latent space to generate convolutional kernels enriched with global contextual information, thereby improving pansharpening quality while enabling faster inference. Specifically, KSDiff constructs these kernels through the integration of a low-rank core tensor generator and a unified factor generator, orchestrated by a structure-aware multi-head attention mechanism. We further introduce a two-stage training strategy tailored for pansharpening, enabling KSDiff to serve as a framework for enhancing existing pansharpening architectures. Experiments on three widely used datasets, including WorldView-3, GaoFen-2, and QuickBird, demonstrate the superior performance of KSDiff both qualitatively and quantitatively. Code will be released upon possible acceptance.

[87] VPGS-SLAM: Voxel-based Progressive 3D Gaussian SLAM in Large-Scale Scenes

Tianchen Deng,Wenhua Wu,Junjie He,Yue Pan,Xirui Jiang,Shenghai Yuan,Danwei Wang,Hesheng Wang,Weidong Chen

Main category: cs.CV

TL;DR: VPGS-SLAM是一种基于3D高斯泼溅的大规模RGBD SLAM框架，适用于室内外场景，解决了内存爆炸和长序列问题。

Details

Motivation: 现有3DGS-based SLAM方法局限于小场景且在大规模场景中内存爆炸，VPGS-SLAM旨在解决这些问题。 Method: 采用基于体素的渐进3D高斯映射、2D-3D融合相机跟踪和闭环检测方法，支持大规模场景。 Result: 实验表明VPGS-SLAM在室内外数据集上表现优越且具有通用性。 Conclusion: VPGS-SLAM是首个适用于大规模场景的3DGS-based SLAM框架，代码将开源。 Abstract: 3D Gaussian Splatting has recently shown promising results in dense visual SLAM. However, existing 3DGS-based SLAM methods are all constrained to small-room scenarios and struggle with memory explosion in large-scale scenes and long sequences. To this end, we propose VPGS-SLAM, the first 3DGS-based large-scale RGBD SLAM framework for both indoor and outdoor scenarios. We design a novel voxel-based progressive 3D Gaussian mapping method with multiple submaps for compact and accurate scene representation in large-scale and long-sequence scenes. This allows us to scale up to arbitrary scenes and improves robustness (even under pose drifts). In addition, we propose a 2D-3D fusion camera tracking method to achieve robust and accurate camera tracking in both indoor and outdoor large-scale scenes. Furthermore, we design a 2D-3D Gaussian loop closure method to eliminate pose drift. We further propose a submap fusion method with online distillation to achieve global consistency in large-scale scenes when detecting a loop. Experiments on various indoor and outdoor datasets demonstrate the superiority and generalizability of the proposed framework. The code will be open source on https://github.com/dtc111111/vpgs-slam.

Md. Mithun Hossain,Md. Shakil Hossain,Sudipto Chaki,M. F. Mridha

Main category: cs.CV

TL;DR: 论文提出了一种名为Co-AttenDWG的多模态学习架构，通过双路径编码、共注意力机制和维度门控技术，显著提升了多模态任务的性能。

Details

Motivation: 当前多模态学习方法在跨模态交互和静态融合策略上存在不足，未能充分利用不同模态的互补性。 Method: 采用双路径编码、共注意力机制和维度门控网络，结合专家融合模块生成统一表示。 Result: 在MIMIC和SemEval Memotion 1.0数据集上实现了显著的跨模态对齐和最优性能。 Conclusion: Co-AttenDWG架构在多模态任务中表现出广泛的应用潜力。 Abstract: Multi-modal learning has become a critical research area because integrating text and image data can significantly improve performance in tasks such as classification, retrieval, and scene understanding. However, despite progress with pre-trained models, current approaches are limited by inadequate cross-modal interactions and static fusion strategies that do not fully exploit the complementary nature of different modalities. To address these shortcomings, we introduce a novel multi-modal Co-AttenDWG architecture that leverages dual-path encoding, co-attention with dimension-wise gating, and advanced expert fusion. Our approach begins by projecting text and image features into a common embedding space, where a dedicated co-attention mechanism enables simultaneous, fine-grained interactions between modalities. This mechanism is further enhanced by a dimension-wise gating network that adaptively regulates the feature contributions at the channel level, ensuring that only the most relevant information is emphasized. In parallel, dual-path encoders refine the representations by processing cross-modal information separately before an additional cross-attention layer further aligns modalities. The refined features are then aggregated via an expert fusion module that combines learned gating and self-attention to produce a robust, unified representation. We validate our approach on the MIMIC and SemEval Memotion 1.0, where experimental results demonstrate significant improvements in cross-modal alignment and state-of-the-art performance, underscoring the potential of our model for a wide range of multi-modal applications.

[89] Can Multimodal Large Language Models Understand Spatial Relations?

Jingping Liu,Ziyan Liu,Zhedong Cen,Yan Zhou,Yinan Zou,Weiyan Zhang,Haiyun Jiang,Tong Ruan

Main category: cs.CV

TL;DR: SpatialMQA是一个基于COCO2017的人工标注空间关系推理基准，旨在解决现有基准的不足，提升多模态大语言模型对图像的理解能力。

Details

Motivation: 现有空间关系推理基准存在依赖边界框、忽略视角替换或仅依赖模型先验知识的问题，限制了模型对客观世界的理解。 Method: 设计了精细的标注流程，构建了包含5,392个样本的SpatialMQA基准，并测试了多个开源和闭源MLLM模型。 Result: 当前最先进的MLLM准确率仅为48.14%，远低于人类水平的98.40%。 Conclusion: SpatialMQA为未来研究提供了方向，基准和代码已开源。 Abstract: Spatial relation reasoning is a crucial task for multimodal large language models (MLLMs) to understand the objective world. However, current benchmarks have issues like relying on bounding boxes, ignoring perspective substitutions, or allowing questions to be answered using only the model's prior knowledge without image understanding. To address these issues, we introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO2017, which enables MLLMs to focus more on understanding images in the objective world. To ensure data quality, we design a well-tailored annotation procedure, resulting in SpatialMQA consisting of 5,392 samples. Based on this benchmark, a series of closed- and open-source MLLMs are implemented and the results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%. Extensive experimental analyses are also conducted, suggesting the future research directions. The benchmark and codes are available at https://github.com/ziyan-xiaoyu/SpatialMQA.git.

[90] Rethinking Metrics and Benchmarks of Video Anomaly Detection

Zihao Liu,Xiaoyu Wu,Wenna Li,Linlin Yang

Main category: cs.CV

TL;DR: 本文重新思考视频异常检测（VAD）的评估协议，提出三种新方法解决现有评估指标和基准的局限性，包括多轮注释的平均AUC/AP、延迟感知平均精度（LaAP）和两个硬正常基准（UCF-HN, MSAD-HN）。

Details

Motivation: 现有VAD研究在模型架构和训练策略上取得进展，但评估指标和基准的不足限制了其发展。本文旨在解决单注释偏差、早期检测奖励不足和场景过拟合评估缺失的问题。 Method: 提出三种新评估方法：1）多轮注释的平均AUC/AP；2）延迟感知平均精度（LaAP）；3）两个硬正常基准（UCF-HN, MSAD-HN）。 Result: 通过实验分析，揭示了当前评估实践的三大局限性，并验证了所提方法的有效性。 Conclusion: 本文为VAD模型开发提供了新的评估视角，解决了现有评估协议的不足。 Abstract: Video Anomaly Detection (VAD), which aims to detect anomalies that deviate from expectation, has attracted increasing attention in recent years. Existing advancements in VAD primarily focus on model architectures and training strategies, while devoting insufficient attention to evaluation metrics and benchmarks. In this paper, we rethink VAD evaluation protocols through comprehensive experimental analyses, revealing three critical limitations in current practices: 1) existing metrics are significantly influenced by single annotation bias; 2) current metrics fail to reward early detection of anomalies; 3) available benchmarks lack the capability to evaluate scene overfitting. To address these limitations, we propose three novel evaluation methods: first, we establish averaged AUC/AP metrics over multi-round annotations to mitigate single annotation bias; second, we develop a Latency-aware Average Precision (LaAP) metric that rewards early and accurate anomaly detection; and finally, we introduce two hard normal benchmarks (UCF-HN, MSAD-HN) with videos specifically designed to evaluate scene overfitting. We report performance comparisons of ten state-of-the-art VAD approaches using our proposed evaluation methods, providing novel perspectives for future VAD model development.

[91] A Smart Healthcare System for Monkeypox Skin Lesion Detection and Tracking

Huda Alghoraibi,Nuha Alqurashi,Sarah Alotaibi,Renad Alkhudaydi,Bdoor Aldajani,Lubna Alqurashi,Jood Batweel,Maha A. Thafar

Main category: cs.CV

TL;DR: 研究人员开发了AI驱动的ITMAINN系统，通过深度学习技术从皮肤病变图像中检测猴痘，并部署了移动应用和实时监控仪表板，以支持公共卫生响应。

Details

Motivation: 全球猴痘疫情爆发，亟需可扩展、易获取且准确的诊断解决方案。 Method: 利用预训练模型进行迁移学习，开发了包含移动应用和监控仪表板的ITMAINN系统。 Result: 在二分类和多分类任务中，模型表现优异，最高准确率达97.8%和92%。 Conclusion: ITMAINN系统为智能城市中的公共卫生管理提供了革命性解决方案。 Abstract: Monkeypox is a viral disease characterized by distinctive skin lesions and has been reported in many countries. The recent global outbreak has emphasized the urgent need for scalable, accessible, and accurate diagnostic solutions to support public health responses. In this study, we developed ITMAINN, an intelligent, AI-driven healthcare system specifically designed to detect Monkeypox from skin lesion images using advanced deep learning techniques. Our system consists of three main components. First, we trained and evaluated several pretrained models using transfer learning on publicly available skin lesion datasets to identify the most effective models. For binary classification (Monkeypox vs. non-Monkeypox), the Vision Transformer, MobileViT, Transformer-in-Transformer, and VGG16 achieved the highest performance, each with an accuracy and F1-score of 97.8%. For multiclass classification, which contains images of patients with Monkeypox and five other classes (chickenpox, measles, hand-foot-mouth disease, cowpox, and healthy), ResNetViT and ViT Hybrid models achieved 92% accuracy, with F1 scores of 92.24% and 92.19%, respectively. The best-performing and most lightweight model, MobileViT, was deployed within the mobile application. The second component is a cross-platform smartphone application that enables users to detect Monkeypox through image analysis, track symptoms, and receive recommendations for nearby healthcare centers based on their location. The third component is a real-time monitoring dashboard designed for health authorities to support them in tracking cases, analyzing symptom trends, guiding public health interventions, and taking proactive measures. This system is fundamental in developing responsive healthcare infrastructure within smart cities. Our solution, ITMAINN, is part of revolutionizing public health management.

[92] InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts

Minzhi Lin,Tianchi Xie,Mengchen Liu,Yilin Ye,Changjian Chen,Shixia Liu

Main category: cs.CV

TL;DR: InfoChartQA是一个评估多模态大语言模型（MLLMs）在信息图表理解能力上的新基准，包含5,642对信息图表和普通图表，设计了视觉元素相关的问题。

Details

Motivation: 现有视觉问答基准无法评估MLLMs在信息图表中的视觉识别与推理能力，因此需要新基准填补这一空白。 Method: 通过设计视觉元素相关的问题，并对比信息图表和普通图表的性能差异，评估MLLMs的能力。 Result: 评估20个MLLMs发现，信息图表性能显著下降，尤其是涉及隐喻的视觉元素问题。 Conclusion: InfoChartQA为MLLMs在信息图表理解上的改进提供了新机会，并已公开发布。 Abstract: Understanding infographic charts with design-driven visual elements (e.g., pictograms, icons) requires both visual recognition and reasoning, posing challenges for multimodal large language models (MLLMs). However, existing visual-question answering benchmarks fall short in evaluating these capabilities of MLLMs due to the lack of paired plain charts and visual-element-based questions. To bridge this gap, we introduce InfoChartQA, a benchmark for evaluating MLLMs on infographic chart understanding. It includes 5,642 pairs of infographic and plain charts, each sharing the same underlying data but differing in visual presentations. We further design visual-element-based questions to capture their unique visual designs and communicative intent. Evaluation of 20 MLLMs reveals a substantial performance decline on infographic charts, particularly for visual-element-based questions related to metaphors. The paired infographic and plain charts enable fine-grained error analysis and ablation studies, which highlight new opportunities for advancing MLLMs in infographic chart understanding. We release InfoChartQA at https://github.com/CoolDawnAnt/InfoChartQA.

[93] Medical Large Vision Language Models with Multi-Image Visual Ability

Xikai Yang,Juzheng Miao,Yuchen Yuan,Jiaze Wang,Qi Dou,Jinpeng Li,Pheng-Ann Heng

Main category: cs.CV

TL;DR: 医学大型视觉语言模型（LVLM）在单图像问答任务中表现优异，但在多图像临床场景中能力不足。研究提出Med-MIM数据集和基准测试，通过微调模型提升多图像分析能力。

Details

Motivation: 当前医学LVLM在多图像任务中表现不佳，缺乏对时间推理和跨模态分析等复杂能力的支持，研究旨在填补这一空白。 Method: 构建包含83.2K多图像问答对的Med-MIM数据集，微调Mantis和LLaVA-Med模型，开发Med-MIM基准测试评估模型性能。 Result: 微调后的MIM-LLaVA-Med和Med-Mantis在Med-MIM基准测试中表现优异，验证了数据集的有效性。 Conclusion: Med-MIM数据集显著提升了医学LVLM的多图像理解能力，为未来研究提供了重要工具。 Abstract: Medical large vision-language models (LVLMs) have demonstrated promising performance across various single-image question answering (QA) benchmarks, yet their capability in processing multi-image clinical scenarios remains underexplored. Unlike single image based tasks, medical tasks involving multiple images often demand sophisticated visual understanding capabilities, such as temporal reasoning and cross-modal analysis, which are poorly supported by current medical LVLMs. To bridge this critical gap, we present the Med-MIM instruction dataset, comprising 83.2K medical multi-image QA pairs that span four types of multi-image visual abilities (temporal understanding, reasoning, comparison, co-reference). Using this dataset, we fine-tune Mantis and LLaVA-Med, resulting in two specialized medical VLMs: MIM-LLaVA-Med and Med-Mantis, both optimized for multi-image analysis. Additionally, we develop the Med-MIM benchmark to comprehensively evaluate the medical multi-image understanding capabilities of LVLMs. We assess eight popular LVLMs, including our two models, on the Med-MIM benchmark. Experimental results show that both Med-Mantis and MIM-LLaVA-Med achieve superior performance on the held-in and held-out subsets of the Med-MIM benchmark, demonstrating that the Med-MIM instruction dataset effectively enhances LVLMs' multi-image understanding capabilities in the medical domain.

[94] Disentangled Human Body Representation Based on Unsupervised Semantic-Aware Learning

Lu Wang,Xishuai Peng,S. Kevin Zhou

Main category: cs.CV

TL;DR: 提出一种无监督学习框架下的3D人体表示方法，通过骨骼分组解耦策略和基于模板的残差学习，实现高精度重建和可控细粒度语义。

Details

Motivation: 现有方法受限于手工定义的人体约束和缺乏监督数据，难以在语义和表示能力上精确控制人体表示。 Method: 设计了骨骼分组解耦策略和基于模板的残差学习方案，结合无监督解耦损失和部分感知解码器。 Result: 在公开3D人体数据集上展示了高精度重建能力，并支持人体姿态转移和潜在代码插值等应用。 Conclusion: 该方法通过无监督学习实现了可控细粒度语义和高精度重建，具有广泛的应用潜力。 Abstract: In recent years, more and more attention has been paid to the learning of 3D human representation. However, the complexity of lots of hand-defined human body constraints and the absence of supervision data limit that the existing works controllably and accurately represent the human body in views of semantics and representation ability. In this paper, we propose a human body representation with controllable fine-grained semantics and high precison of reconstruction in an unsupervised learning framework. In particularly, we design a whole-aware skeleton-grouped disentangle strategy to learn a correspondence between geometric semantical measurement of body and latent codes, which facilitates the control of shape and posture of human body by modifying latent coding paramerers. With the help of skeleton-grouped whole-aware encoder and unsupervised disentanglement losses, our representation model is learned by an unsupervised manner. Besides, a based-template residual learning scheme is injected into the encoder to ease of learning human body latent parameter in complicated body shape and pose spaces. Because of the geometrically meaningful latent codes, it can be used in a wide range of applications, from human body pose transfer to bilinear latent code interpolation. Further more, a part-aware decoder is utlized to promote the learning of controllable fine-grained semantics. The experimental results on public 3D human datasets show that the method has the ability of precise reconstruction.

[95] Less is More: Efficient Point Cloud Reconstruction via Multi-Head Decoders

Pedro Alonso,Tianrui Li,Chongshou Li

Main category: cs.CV

TL;DR: 论文挑战了深度解码器架构在点云重建中性能必然提升的假设，提出多头部解码器架构，通过多独立头部重建点云，提升多样性和保真度，实验证明优于单头部基线。

Details

Motivation: 探讨解码器深度对点云重建性能的影响，发现过深会导致过拟合和泛化能力下降，提出多头部架构以解决这一问题。 Method: 提出多头部解码器架构，每个头部独立处理点云子集，最终拼接所有头部预测结果，增强多样性和保真度。 Result: 在ModelNet40和ShapeNetPart数据集上，多头部架构在CD、HD、EMD和F1-score等指标上优于单头部基线。 Conclusion: 点云重建中，输出多样性和架构设计比单纯增加深度更关键，多头部架构提供了高效且有效的解决方案。 Abstract: We challenge the common assumption that deeper decoder architectures always yield better performance in point cloud reconstruction. Our analysis reveals that, beyond a certain depth, increasing decoder complexity leads to overfitting and degraded generalization. Additionally, we propose a novel multi-head decoder architecture that exploits the inherent redundancy in point clouds by reconstructing complete shapes from multiple independent heads, each operating on a distinct subset of points. The final output is obtained by concatenating the predictions from all heads, enhancing both diversity and fidelity. Extensive experiments on ModelNet40 and ShapeNetPart demonstrate that our approach achieves consistent improvements across key metrics--including Chamfer Distance (CD), Hausdorff Distance (HD), Earth Mover's Distance (EMD), and F1-score--outperforming standard single-head baselines. Our findings highlight that output diversity and architectural design can be more critical than depth alone for effective and efficient point cloud reconstruction.

[96] Training-free Stylized Text-to-Image Generation with Fast Inference

Xin Ma,Yaohui Wang,Xinyuan Chen,Tien-Tsin Wong,Cunjian Chen

Main category: cs.CV

TL;DR: 提出了一种无需微调或优化的新方法OmniPainter，利用预训练扩散模型实现风格化图像生成。

Details

Motivation: 现有方法需文本反转或风格图像微调，耗时且限制大规模扩散模型的实际应用。 Method: 利用潜在一致性模型的自一致性提取风格统计量，引入自注意力范数混合机制查询相关风格模式。 Result: 定性和定量实验表明，方法优于现有技术。 Conclusion: OmniPainter高效且性能优越，适用于大规模扩散模型。 Abstract: Although diffusion models exhibit impressive generative capabilities, existing methods for stylized image generation based on these models often require textual inversion or fine-tuning with style images, which is time-consuming and limits the practical applicability of large-scale diffusion models. To address these challenges, we propose a novel stylized image generation method leveraging a pre-trained large-scale diffusion model without requiring fine-tuning or any additional optimization, termed as OmniPainter. Specifically, we exploit the self-consistency property of latent consistency models to extract the representative style statistics from reference style images to guide the stylization process. Additionally, we then introduce the norm mixture of self-attention, which enables the model to query the most relevant style patterns from these statistics for the intermediate output content features. This mechanism also ensures that the stylized results align closely with the distribution of the reference style images. Our qualitative and quantitative experimental results demonstrate that the proposed method outperforms state-of-the-art approaches.

[97] MMP-2K: A Benchmark Multi-Labeled Macro Photography Image Quality Assessment Database

Jiashuo Chang,Zhengyi Li,Jianxun Lou,Zhen Qiu,Hanhe Lin

Main category: cs.CV

TL;DR: 论文提出了一种新的宏摄影图像质量评估（MPIQA）数据库MMP-2k，填补了该领域数据不足的空白，并验证了现有通用IQA指标在宏摄影图像上的表现不佳。

Details

Motivation: 宏摄影在科学研究和医学等领域有重要应用，但缺乏专门的图像质量评估数据限制了MPIQA指标的发展。 Method: 从公开网站收集15,700张宏摄影图像，筛选出2,000张，通过实验室研究获取每张图像的17个质量评分及详细的失真报告。 Result: 构建了多标签MPIQA数据库MMP-2k，实验表明现有通用IQA指标在宏摄影图像上表现不佳。 Conclusion: MMP-2k数据库为MPIQA研究提供了重要资源，并揭示了现有IQA指标的局限性。 Abstract: Macro photography (MP) is a specialized field of photography that captures objects at an extremely close range, revealing tiny details. Although an accurate macro photography image quality assessment (MPIQA) metric can benefit macro photograph capturing, which is vital in some domains such as scientific research and medical applications, the lack of MPIQA data limits the development of MPIQA metrics. To address this limitation, we conducted a large-scale MPIQA study. Specifically, to ensure diversity both in content and quality, we sampled 2,000 MP images from 15,700 MP images, collected from three public image websites. For each MP image, 17 (out of 21 after outlier removal) quality ratings and a detailed quality report of distortion magnitudes, types, and positions are gathered by a lab study. The images, quality ratings, and quality reports form our novel multi-labeled MPIQA database, MMP-2k. Experimental results showed that the state-of-the-art generic IQA metrics underperform on MP images. The database and supplementary materials are available at https://github.com/Future-IQA/MMP-2k.

[98] ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding

Muye Huang,Lingling Zhang,Jie Ma,Han Lai,Fangzhi Xu,Yifei Li,Wenjun Wu,Yaqiang Wu,Jun Liu

Main category: cs.CV

TL;DR: ChartSketcher是一种多模态反馈驱动的逐步推理方法，通过视觉标注和迭代反馈提升图表理解能力。

Details

Motivation: 现有多模态大语言模型在图表理解中因缺乏视觉交互能力而难以修正错误推理。 Method: 提出ChartSketcher，使用Sketch-CoT在图表上标注中间推理步骤，并通过两阶段训练策略（冷启动和强化学习）优化模型。 Result: 实验表明，ChartSketcher在图表理解和通用视觉任务中表现优异。 Conclusion: ChartSketcher为图表理解提供了一种交互式和可解释的方法。 Abstract: Charts are high-density visualization carriers for complex data, serving as a crucial medium for information extraction and analysis. Automated chart understanding poses significant challenges to existing multimodal large language models (MLLMs) due to the need for precise and complex visual reasoning. Current step-by-step reasoning models primarily focus on text-based logical reasoning for chart understanding. However, they struggle to refine or correct their reasoning when errors stem from flawed visual understanding, as they lack the ability to leverage multimodal interaction for deeper comprehension. Inspired by human cognitive behavior, we propose ChartSketcher, a multimodal feedback-driven step-by-step reasoning method designed to address these limitations. ChartSketcher is a chart understanding model that employs Sketch-CoT, enabling MLLMs to annotate intermediate reasoning steps directly onto charts using a programmatic sketching library, iteratively feeding these visual annotations back into the reasoning process. This mechanism enables the model to visually ground its reasoning and refine its understanding over multiple steps. We employ a two-stage training strategy: a cold start phase to learn sketch-based reasoning patterns, followed by off-policy reinforcement learning to enhance reflection and generalization. Experiments demonstrate that ChartSketcher achieves promising performance on chart understanding benchmarks and general vision tasks, providing an interactive and interpretable approach to chart comprehension.

[99] Towards Generalized Proactive Defense against Face Swappingwith Contour-Hybrid Watermark

Ruiyang Xia,Dawei Zhou,Decheng Liu,Lin Yuan,Jie Li,Nannan Wang,Xinbo Gao

Main category: cs.CV

TL;DR: 论文提出了一种名为CMark的轮廓混合水印技术，用于主动防御未知人脸交换技术，通过嵌入水印在面部轮廓区域，实现了无需预存大规模数据的高效检测。

Details

Motivation: 随着AI生成内容的进步，真实与交换人脸的差异变得细微，传统伪造痕迹检测困难，因此转向主动嵌入水印以应对未知交换技术。 Method: 专注于面部周围区域，嵌入轮廓纹理和身份信息，生成轮廓混合水印（CMark），无需训练时依赖特定交换技术或预存数据。 Result: 在8种人脸交换技术上的实验表明，该方法在图像质量与水印鲁棒性之间取得平衡，优于现有被动和主动检测器。 Conclusion: CMark技术为未知人脸交换提供了高效主动防御方案，具有广泛的应用潜力。 Abstract: Face swapping, recognized as a privacy and security concern, has prompted considerable defensive research. With the advancements in AI-generated content, the discrepancies between the real and swapped faces have become nuanced. Considering the difficulty of forged traces detection, we shift the focus to the face swapping purpose and proactively embed elaborate watermarks against unknown face swapping techniques. Given that the constant purpose is to swap the original face identity while preserving the background, we concentrate on the regions surrounding the face to ensure robust watermark generation, while embedding the contour texture and face identity information to achieve progressive image determination. The watermark is located in the facial contour and contains hybrid messages, dubbed the contour-hybrid watermark (CMark). Our approach generalizes face swapping detection without requiring any swapping techniques during training and the storage of large-scale messages in advance. Experiments conducted across 8 face swapping techniques demonstrate the superiority of our approach compared with state-of-the-art passive and proactive detectors while achieving a favorable balance between the image quality and watermark robustness.

[100] Jodi: Unification of Visual Generation and Understanding via Joint Modeling

Yifeng Xu,Zhenliang He,Meina Kan,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: Jodi是一个统一的扩散框架，通过联合建模图像域和多个标签域，将视觉生成与理解任务结合。

Details

Motivation: 传统方法将视觉生成与理解视为独立任务，而Jodi旨在统一这两者。 Method: 基于线性扩散变换器和角色切换机制，支持联合生成、可控生成和图像感知任务。 Result: 实验表明Jodi在生成和理解任务中表现优异，并具有强扩展性。 Conclusion: Jodi为视觉生成与理解的统一提供了有效框架，并在多个任务中验证了其性能。 Abstract: Visual generation and understanding are two deeply interconnected aspects of human intelligence, yet they have been traditionally treated as separate tasks in machine learning. In this paper, we propose Jodi, a diffusion framework that unifies visual generation and understanding by jointly modeling the image domain and multiple label domains. Specifically, Jodi is built upon a linear diffusion transformer along with a role switch mechanism, which enables it to perform three particular types of tasks: (1) joint generation, where the model simultaneously generates images and multiple labels; (2) controllable generation, where images are generated conditioned on any combination of labels; and (3) image perception, where multiple labels can be predicted at once from a given image. Furthermore, we present the Joint-1.6M dataset, which contains 200,000 high-quality images collected from public sources, automatic labels for 7 visual domains, and LLM-generated captions. Extensive experiments demonstrate that Jodi excels in both generation and understanding tasks and exhibits strong extensibility to a wider range of visual domains. Code is available at https://github.com/VIPL-GENUN/Jodi.

[101] Plug-and-Play Context Feature Reuse for Efficient Masked Generation

Xuejie Liu,Anji Liu,Guy Van den Broeck,Yitao Liang

Main category: cs.CV

TL;DR: ReCAP模块通过重用上下文特征加速掩码生成模型的推理，减少计算量同时保持生成质量。

Details

Motivation: 掩码生成模型（MGMs）生成高质量样本需要多次迭代解码，导致高推理成本。直接加速方法（如同时解码更多标记）会牺牲生成保真度。 Method: 提出ReCAP模块，通过重用先前解码的上下文特征构建低成本步骤，交替使用标准评估和轻量级步骤。 Result: 在ImageNet256类条件生成中，ReCAP实现比基线模型快2.4倍的推理速度，性能下降极小。 Conclusion: ReCAP在多种MGMs上均能显著提升效率与保真度的平衡。 Abstract: Masked generative models (MGMs) have emerged as a powerful framework for image synthesis, combining parallel decoding with strong bidirectional context modeling. However, generating high-quality samples typically requires many iterative decoding steps, resulting in high inference costs. A straightforward way to speed up generation is by decoding more tokens in each step, thereby reducing the total number of steps. However, when many tokens are decoded simultaneously, the model can only estimate the univariate marginal distributions independently, failing to capture the dependency among them. As a result, reducing the number of steps significantly compromises generation fidelity. In this work, we introduce ReCAP (Reused Context-Aware Prediction), a plug-and-play module that accelerates inference in MGMs by constructing low-cost steps via reusing feature embeddings from previously decoded context tokens. ReCAP interleaves standard full evaluations with lightweight steps that cache and reuse context features, substantially reducing computation while preserving the benefits of fine-grained, iterative generation. We demonstrate its effectiveness on top of three representative MGMs (MaskGIT, MAGE, and MAR), including both discrete and continuous token spaces and covering diverse architectural designs. In particular, on ImageNet256 class-conditional generation, ReCAP achieves up to 2.4x faster inference than the base model with minimal performance drop, and consistently delivers better efficiency-fidelity trade-offs under various generation settings.

[102] SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards

Chuming Shen,Wei Wei,Xiaoye Qu,Yu Cheng

Main category: cs.CV

TL;DR: SATORI通过将VQA任务分解为三个可验证阶段（全局图像描述、区域定位和答案预测），解决了R1-like自由推理在VQA任务中的视觉焦点扩散和中间步骤不可验证问题，并在多个基准测试中实现了15.7%的准确率提升。

Details

Motivation: 多模态任务与文本任务本质不同，直接应用RL生成自由推理会导致视觉焦点扩散和计算成本增加。 Method: 提出SATORI方法，将VQA任务分解为三个可验证阶段，并引入VQA-Verify数据集。 Result: 在七个VQA基准测试中，准确率最高提升15.7%，注意力图分析显示对关键区域的关注增强。 Conclusion: SATORI通过可验证阶段和明确奖励信号，显著提升了VQA任务的性能。 Abstract: DeepSeek-R1 has demonstrated powerful reasoning capabilities in the text domain through stable reinforcement learning (RL). Recently, in the multimodal domain, works have begun to directly apply RL to generate R1-like free-form reasoning for Visual Question Answering (VQA) tasks. However, multimodal tasks share an intrinsically different nature from textual tasks, which heavily rely on the understanding of the input image to solve the problem. Therefore, such free-form reasoning faces two critical limitations in the VQA task: (1) Extended reasoning chains diffuse visual focus away from task-critical regions, degrading answer accuracy. (2) Unverifiable intermediate steps amplify policy-gradient variance and computational costs overhead. To address these issues, in this paper, we introduce SATORI ($\textbf{S}patially$ $\textbf{A}nchored$ $\textbf{T}ask$ $\textbf{O}ptimization$ with $\textbf{R}e\textbf{I}nforcement$ Learning), which decomposes VQA into three verifiable stages, including global image captioning, region localization, and answer prediction, each supplying explicit reward signals. Furthermore, we also introduce VQA-Verify, a 12k dataset annotated with answer-aligned captions and bounding-boxes to facilitate training. Experiments demonstrate consistent performance improvements across seven VQA benchmarks, achieving up to $15.7\%$ improvement in accuracy in accuracy compared to the R1-like baseline. Our analysis of the attention map confirms enhanced focus on critical regions, which brings improvements in accuracy. Our code is available at https://github.com/justairr/SATORI-R1.

[103] An Interpretable Representation Learning Approach for Diffusion Tensor Imaging

Vishwa Mohan Singh,Alberto Gaston Villagran Asiares,Luisa Sophie Schuhmacher,Kate Rendall,Simon Weißbrod,David Rügamer,Inga Körte

Main category: cs.CV

TL;DR: 提出了一种新的2D表示方法，将DTI纤维束成像的FA值编码为9x9灰度图像，并通过Beta-Total Correlation VAE和空间广播解码器学习解耦且可解释的潜在嵌入。

Details

Motivation: DTI纤维束成像在深度学习模型中难以有效表示和解释，需要一种更优的表示方法。 Method: 使用Beta-Total Correlation VAE和空间广播解码器处理2D表示，并通过监督和无监督学习策略评估嵌入质量。 Result: 相比1D Group DNN基线，F1分数在性别分类任务中提升15.74%，且比3D表示具有更好的解耦性。 Conclusion: 该方法显著提升了DTI纤维束成像在深度学习中的表示和解释能力。 Abstract: Diffusion Tensor Imaging (DTI) tractography offers detailed insights into the structural connectivity of the brain, but presents challenges in effective representation and interpretation in deep learning models. In this work, we propose a novel 2D representation of DTI tractography that encodes tract-level fractional anisotropy (FA) values into a 9x9 grayscale image. This representation is processed through a Beta-Total Correlation Variational Autoencoder with a Spatial Broadcast Decoder to learn a disentangled and interpretable latent embedding. We evaluate the quality of this embedding using supervised and unsupervised representation learning strategies, including auxiliary classification, triplet loss, and SimCLR-based contrastive learning. Compared to the 1D Group deep neural network (DNN) baselines, our approach improves the F1 score in a downstream sex classification task by 15.74% and shows a better disentanglement than the 3D representation.

[104] Remote Sensing Image Classification with Decoupled Knowledge Distillation

Yaping He,Jianfeng Cai,Qicong Hu,Peiqing Wang

Main category: cs.CV

TL;DR: 提出了一种基于知识蒸馏的轻量级遥感图像分类方法，显著减少参数并提升效率。

Details

Motivation: 解决现有遥感图像分类模型参数过多、难以在资源受限设备上部署的问题。 Method: 采用G-GhostNet作为骨干网络，结合特征重用减少冗余参数，并使用解耦知识蒸馏策略提升分类精度。 Result: 在RSOD和AID数据集上，与VGG-16相比，参数减少6.24倍，Top-1准确率接近。 Conclusion: 该方法在模型大小和分类性能间取得了良好平衡，适合资源受限设备部署。 Abstract: To address the challenges posed by the large number of parameters in existing remote sensing image classification models, which hinder deployment on resource-constrained devices, this paper proposes a lightweight classification method based on knowledge distillation. Specifically, G-GhostNet is adopted as the backbone network, leveraging feature reuse to reduce redundant parameters and significantly improve inference efficiency. In addition, a decoupled knowledge distillation strategy is employed, which separates target and non-target classes to effectively enhance classification accuracy. Experimental results on the RSOD and AID datasets demonstrate that, compared with the high-parameter VGG-16 model, the proposed method achieves nearly equivalent Top-1 accuracy while reducing the number of parameters by 6.24 times. This approach strikes an excellent balance between model size and classification performance, offering an efficient solution for deployment on resource-limited devices.

[105] CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design

Hui Zhang,Dexiang Hong,Maoke Yang,Yutao Chen,Zhao Zhang,Jie Shao,Xinglong Wu,Zuxuan Wu,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 论文提出CreatiDesign，一种自动化图形设计解决方案，通过统一的多条件驱动架构和多模态注意力掩码机制，解决了现有方法在多条件控制上的不足。

Details

Motivation: 复杂图形设计场景需要准确遵循用户提供的多种异构元素的设计意图，而现有方法在多条件控制上存在局限性。 Method: 设计了统一的多条件驱动架构和多模态注意力掩码机制，并构建了包含40万样本的数据集。 Result: 实验表明，CreatiDesign在忠实遵循用户意图方面显著优于现有模型。 Conclusion: CreatiDesign为自动化图形设计提供了有效的解决方案，解决了多条件控制的挑战。 Abstract: Graphic design plays a vital role in visual communication across advertising, marketing, and multimedia entertainment. Prior work has explored automated graphic design generation using diffusion models, aiming to streamline creative workflows and democratize design capabilities. However, complex graphic design scenarios require accurately adhering to design intent specified by multiple heterogeneous user-provided elements (\eg images, layouts, and texts), which pose multi-condition control challenges for existing methods. Specifically, previous single-condition control models demonstrate effectiveness only within their specialized domains but fail to generalize to other conditions, while existing multi-condition methods often lack fine-grained control over each sub-condition and compromise overall compositional harmony. To address these limitations, we introduce CreatiDesign, a systematic solution for automated graphic design covering both model architecture and dataset construction. First, we design a unified multi-condition driven architecture that enables flexible and precise integration of heterogeneous design elements with minimal architectural modifications to the base diffusion model. Furthermore, to ensure that each condition precisely controls its designated image region and to avoid interference between conditions, we propose a multimodal attention mask mechanism. Additionally, we develop a fully automated pipeline for constructing graphic design datasets, and introduce a new dataset with 400K samples featuring multi-condition annotations, along with a comprehensive benchmark. Experimental results show that CreatiDesign outperforms existing models by a clear margin in faithfully adhering to user intent.

[106] Freqformer: Image-Demoiréing Transformer via Efficient Frequency Decomposition

Xiaoyang Liu,Bolin Qiu,Jiezhang Cao,Zheng Chen,Yulun Zhang,Xiaokang Yang

Main category: cs.CV

TL;DR: Freqformer是一种基于Transformer的框架，通过频率分离有效解决图像去摩尔纹问题，结合双分支结构和自适应频率融合模块，实现了高性能的去摩尔纹效果。

Details

Motivation: 现有方法难以有效分离摩尔纹中的纹理和颜色失真，而基于小波的频率感知方法潜力未被充分挖掘。 Method: 提出Freqformer框架，通过频率分解将摩尔纹分为高频纹理和低频颜色失真，采用双分支结构和自适应频率融合模块（FCT），并引入空间感知通道注意力（SA-CA）模块。 Result: 在多个去摩尔纹基准测试中，Freqformer以紧凑的模型尺寸实现了最先进的性能。 Conclusion: Freqformer通过频率分离和自适应融合，显著提升了图像去摩尔纹的效果，且计算效率高。 Abstract: Image demoir\'eing remains a challenging task due to the complex interplay between texture corruption and color distortions caused by moir\'e patterns. Existing methods, especially those relying on direct image-to-image restoration, often fail to disentangle these intertwined artifacts effectively. While wavelet-based frequency-aware approaches offer a promising direction, their potential remains underexplored. In this paper, we present Freqformer, a Transformer-based framework specifically designed for image demoir\'eing through targeted frequency separation. Our method performs an effective frequency decomposition that explicitly splits moir\'e patterns into high-frequency spatially-localized textures and low-frequency scale-robust color distortions, which are then handled by a dual-branch architecture tailored to their distinct characteristics. We further propose a learnable Frequency Composition Transform (FCT) module to adaptively fuse the frequency-specific outputs, enabling consistent and high-fidelity reconstruction. To better aggregate the spatial dependencies and the inter-channel complementary information, we introduce a Spatial-Aware Channel Attention (SA-CA) module that refines moir\'e-sensitive regions without incurring high computational cost. Extensive experiments on various demoir\'eing benchmarks demonstrate that Freqformer achieves state-of-the-art performance with a compact model size. The code is publicly available at https://github.com/xyLiu339/Freqformer.

[107] Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers

Eric Tillman Bill,Cristian Perez Jensen,Sotiris Anagnostidis,Dimitri von Rütte

Main category: cs.CV

TL;DR: 本文探讨了在Diffusion Transformer (DiT)架构中应用幅度保持设计以稳定训练的效果，并提出了一种新的旋转调制条件方法，显著提升了性能。

Details

Motivation: 由于去噪扩散模型训练中的高方差梯度估计导致收敛缓慢，本文旨在验证幅度保持设计在DiT架构中的有效性，并探索更优的条件策略。 Method: 提出了一种幅度保持设计，避免使用归一化层，并引入了旋转调制作为新的条件方法。通过小规模模型的实验验证其效果。 Result: 幅度保持策略显著提升了性能，FID分数降低了约12.8%；旋转调制与缩放结合在参数减少5.4%的情况下与AdaLN竞争。 Conclusion: 本文为条件策略和幅度控制提供了新见解，并公开了方法实现。 Abstract: Denoising diffusion models exhibit remarkable generative capabilities, but remain challenging to train due to their inherent stochasticity, where high-variance gradient estimates lead to slow convergence. Previous works have shown that magnitude preservation helps with stabilizing training in the U-net architecture. This work explores whether this effect extends to the Diffusion Transformer (DiT) architecture. As such, we propose a magnitude-preserving design that stabilizes training without normalization layers. Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation, which is a novel conditioning method using learned rotations instead of traditional scaling or shifting. Through empirical evaluations and ablation studies on small-scale models, we show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by $\sim$12.8%. Further, we show that rotation modulation combined with scaling is competitive with AdaLN, while requiring $\sim$5.4% fewer parameters. This work provides insights into conditioning strategies and magnitude control. We will publicly release the implementation of our method.

Yuqi Liu,Qin Jin,Tianyuan Qu,Xuan Liu,Yang Du,Bei Yu,Jiaya Jia

Main category: cs.CV

TL;DR: 论文提出了RTime-QA基准测试和RTime-IT数据集，用于评估和提升大型多模态模型（LMMs）的原子时间事件理解能力。实验表明，现有模型表现远低于人类水平，但通过RTime-IT微调可显著提升性能。

Details

Motivation: 当前视频-语言基准测试无法有效评估LMMs的时间事件理解能力，因为这些问题可以通过图像-语言模型解决。 Method: 设计了RTime-QA基准测试（822个高质量视频-文本问题）和RTime-IT指令调优数据集（14k数据），用于评估和提升LMMs的时间理解能力。 Result: Qwen2-VL模型在RTime-QA上的严格准确率仅为34.6，远低于人类水平；但通过RTime-IT微调后，性能提升至65.9。 Conclusion: RTime-QA和RTime-IT为LMMs的时间事件理解能力提供了有效的评估和提升工具。 Abstract: Understanding accurate atomic temporal event is essential for video comprehension. However, current video-language benchmarks often fall short to evaluate Large Multi-modal Models' (LMMs) temporal event understanding capabilities, as they can be effectively addressed using image-language models. In this paper, we introduce RTime-QA, a novel benchmark specifically designed to assess the atomic temporal event understanding ability of LMMs. RTime-QA comprises 822 high-quality, carefully-curated video-text questions, each meticulously annotated by human experts. Each question features a video depicting an atomic temporal event, paired with both correct answers and temporal negative descriptions, specifically designed to evaluate temporal understanding. To advance LMMs' temporal event understanding ability, we further introduce RTime-IT, a 14k instruction-tuning dataset that employs a similar annotation process as RTime-QA. Extensive experimental analysis demonstrates that RTime-QA presents a significant challenge for LMMs: the state-of-the-art model Qwen2-VL achieves only 34.6 on strict-ACC metric, substantially lagging behind human performance. Furthermore, our experiments reveal that RTime-IT effectively enhance LMMs' capacity in temporal understanding. By fine-tuning on RTime-IT, our Qwen2-VL achieves 65.9 on RTime-QA.

[109] Veta-GS: View-dependent deformable 3D Gaussian Splatting for thermal infrared Novel-view Synthesis

Myeongseok Nam,Wongi Park,Minsol Kim,Hyejin Hur,Soomok Lee

Main category: cs.CV

TL;DR: Veta-GS通过引入视角依赖变形场和热特征提取器（TFE），解决了热红外图像在3D高斯泼溅中的传输效应、发射率和低分辨率问题，提升了渲染质量。

Details

Motivation: 热红外图像在3D高斯泼溅中存在传输效应、发射率和低分辨率问题，导致渲染图像出现浮游物和模糊效果。 Method: 设计了视角依赖变形场以捕捉热变化，并引入热特征提取器（TFE）和MonoSSIM损失函数，综合考虑外观、边缘和频率以保持鲁棒性。 Result: 在TI-NSD基准测试中，Veta-GS表现优于现有方法。 Conclusion: Veta-GS通过创新的变形场和特征提取器，显著提升了热红外图像在3D高斯泼溅中的渲染效果。 Abstract: Recently, 3D Gaussian Splatting (3D-GS) based on Thermal Infrared (TIR) imaging has gained attention in novel-view synthesis, showing real-time rendering. However, novel-view synthesis with thermal infrared images suffers from transmission effects, emissivity, and low resolution, leading to floaters and blur effects in rendered images. To address these problems, we introduce Veta-GS, which leverages a view-dependent deformation field and a Thermal Feature Extractor (TFE) to precisely capture subtle thermal variations and maintain robustness. Specifically, we design view-dependent deformation field that leverages camera position and viewing direction, which capture thermal variations. Furthermore, we introduce the Thermal Feature Extractor (TFE) and MonoSSIM loss, which consider appearance, edge, and frequency to maintain robustness. Extensive experiments on the TI-NSD benchmark show that our method achieves better performance over existing methods.

[110] The Eye of Sherlock Holmes: Uncovering User Private Attribute Profiling via Vision-Language Model Agentic Framework

Feiran Liu,Yuzhe Zhang,Xinyi Huang,Yinan Peng,Xinfeng Li,Lixu Wang,Yutong Shen,Ranjie Duan,Simeng Qin,Xiaojun Jia,Qingsong Wen,Wei Dong

Main category: cs.CV

TL;DR: 论文揭示了一种新的隐私风险：视觉语言模型（VLM）可以从个人图像中推断敏感和抽象属性，并提出了一种混合框架HolmesEye来增强隐私推断能力。

Details

Motivation: 现代应用可以轻松访问用户相册，而多图像推断可能利用图像间关系进行更复杂的隐私分析，但目前缺乏相关数据集和模型能力。 Method: 构建了PAPI数据集，并提出HolmesEye框架，结合VLM和LLM提取图像信息并指导推断过程。 Result: HolmesEye在平均准确率上比现有基线提高了10.8%，在抽象属性预测上超过人类水平15.0%。 Conclusion: 研究强调了图像隐私分析的紧迫性，并提供了数据集和框架以指导未来研究。 Abstract: Our research reveals a new privacy risk associated with the vision-language model (VLM) agentic framework: the ability to infer sensitive attributes (e.g., age and health information) and even abstract ones (e.g., personality and social traits) from a set of personal images, which we term "image private attribute profiling." This threat is particularly severe given that modern apps can easily access users' photo albums, and inference from image sets enables models to exploit inter-image relations for more sophisticated profiling. However, two main challenges hinder our understanding of how well VLMs can profile an individual from a few personal photos: (1) the lack of benchmark datasets with multi-image annotations for private attributes, and (2) the limited ability of current multimodal large language models (MLLMs) to infer abstract attributes from large image collections. In this work, we construct PAPI, the largest dataset for studying private attribute profiling in personal images, comprising 2,510 images from 251 individuals with 3,012 annotated privacy attributes. We also propose HolmesEye, a hybrid agentic framework that combines VLMs and LLMs to enhance privacy inference. HolmesEye uses VLMs to extract both intra-image and inter-image information and LLMs to guide the inference process as well as consolidate the results through forensic analysis, overcoming existing limitations in long-context visual reasoning. Experiments reveal that HolmesEye achieves a 10.8% improvement in average accuracy over state-of-the-art baselines and surpasses human-level performance by 15.0% in predicting abstract attributes. This work highlights the urgency of addressing privacy risks in image-based profiling and offers both a new dataset and an advanced framework to guide future research in this area.

[111] DISTA-Net: Dynamic Closely-Spaced Infrared Small Target Unmixing

Shengdong Han,Shangdong Yang,Xin Zhang,Yuxuan Li,Xiang Li,Jian Yang,Ming-Ming Cheng,Yimian Dai

Main category: cs.CV

TL;DR: 提出DISTA-Net网络，用于解决红外成像中密集小目标的分离问题，并建立开源生态系统。

Details

Motivation: 红外成像中密集小目标的信号重叠问题导致数量、位置和辐射强度难以精确检测，现有深度学习方法未解决此问题。 Method: 提出动态迭代收缩阈值网络（DISTA-Net），实时生成卷积权重和阈值参数，优化稀疏重建过程。 Result: DISTA-Net在亚像素检测精度上表现优异，并建立了首个开源生态系统（CSIST-100K数据集、CSO-mAP评估指标和GrokCSO工具包）。 Conclusion: DISTA-Net是首个针对密集红外小目标分离的深度学习模型，开源生态系统将推动该领域研究。 Abstract: Resolving closely-spaced small targets in dense clusters presents a significant challenge in infrared imaging, as the overlapping signals hinder precise determination of their quantity, sub-pixel positions, and radiation intensities. While deep learning has advanced the field of infrared small target detection, its application to closely-spaced infrared small targets has not yet been explored. This gap exists primarily due to the complexity of separating superimposed characteristics and the lack of an open-source infrastructure. In this work, we propose the Dynamic Iterative Shrinkage Thresholding Network (DISTA-Net), which reconceptualizes traditional sparse reconstruction within a dynamic framework. DISTA-Net adaptively generates convolution weights and thresholding parameters to tailor the reconstruction process in real time. To the best of our knowledge, DISTA-Net is the first deep learning model designed specifically for the unmixing of closely-spaced infrared small targets, achieving superior sub-pixel detection accuracy. Moreover, we have established the first open-source ecosystem to foster further research in this field. This ecosystem comprises three key components: (1) CSIST-100K, a publicly available benchmark dataset; (2) CSO-mAP, a custom evaluation metric for sub-pixel detection; and (3) GrokCSO, an open-source toolkit featuring DISTA-Net and other models. Our code and dataset are available at https://github.com/GrokCV/GrokCSO.

[112] MIND-Edit: MLLM Insight-Driven Editing via Language-Vision Projection

Shuyu Wang,Weiqi Li,Qian Wang,Shijie Zhao,Jian Zhang

Main category: cs.CV

TL;DR: MIND-Edit是一个结合预训练扩散模型和多模态大语言模型（MLLM）的端到端图像编辑框架，通过优化文本指令和利用MLLM的视觉理解能力，显著提升了编辑的语义准确性和视觉一致性。

Details

Motivation: 现有图像编辑方法在复杂场景下难以实现高精度和语义准确性，且当前基于MLLM的方法主要依赖文本指令，忽视了其视觉理解能力。 Method: 提出MIND-Edit框架，包含文本指令优化策略和MLLM驱动的编辑策略，并通过联合训练整合两者。 Result: 实验表明，MIND-Edit在定量指标和视觉质量上均优于现有方法，尤其在复杂场景下表现突出。 Conclusion: MIND-Edit通过结合文本和视觉理解能力，显著提升了图像编辑的准确性和一致性，为复杂场景下的编辑任务提供了有效解决方案。 Abstract: Recent advances in AI-generated content (AIGC) have significantly accelerated image editing techniques, driving increasing demand for diverse and fine-grained edits. Despite these advances, existing image editing methods still face challenges in achieving high precision and semantic accuracy in complex scenarios. Recent studies address this issue by incorporating multimodal large language models (MLLMs) into image editing pipelines. However, current MLLM-based methods mainly rely on interpreting textual instructions, leaving the intrinsic visual understanding of large models largely unexplored, thus resulting in insufficient alignment between textual semantics and visual outcomes. To overcome these limitations, we propose MIND-Edit, an end-to-end image-editing framework integrating pretrained diffusion model with MLLM. MIND-Edit introduces two complementary strategies: (1) a text instruction optimization strategy that clarifies ambiguous user instructions based on semantic reasoning from the MLLM, and (2) an MLLM insight-driven editing strategy that explicitly leverages the intrinsic visual understanding capability of the MLLM to infer editing intent and guide the diffusion process via generated visual embeddings. Furthermore, we propose a joint training approach to effectively integrate both strategies, allowing them to reinforce each other for more accurate instruction interpretation and visually coherent edits aligned with user intent. Extensive experiments demonstrate that MIND-Edit outperforms state-of-the-art image editing methods in both quantitative metrics and visual quality, particularly under complex and challenging scenarios.

[113] FHGS: Feature-Homogenized Gaussian Splatting

Q. G. Duan,Benyun Zhao,Mingqiao Han Yijun Huang,Ben M. Chen

Main category: cs.CV

TL;DR: FHGS提出了一种基于3D高斯泼溅的特征融合框架，解决了语义特征各向同性与高斯原语各向异性之间的矛盾，实现了高效渲染与跨视角特征一致性。

Details

Motivation: 3D高斯泼溅方法在渲染效率上表现优异，但无法满足语义特征的各向同性需求，导致跨视角特征一致性不足。 Method: FHGS通过通用特征融合架构、非可微特征融合机制和双驱动优化策略，将预训练模型的2D特征映射到3D场景中。 Result: FHGS实现了高效渲染与高精度特征映射，平衡了各向异性渲染与各向同性特征表达。 Conclusion: FHGS为3D场景理解提供了新的解决方案，兼具高效渲染与跨视角特征一致性。 Abstract: Scene understanding based on 3D Gaussian Splatting (3DGS) has recently achieved notable advances. Although 3DGS related methods have efficient rendering capabilities, they fail to address the inherent contradiction between the anisotropic color representation of gaussian primitives and the isotropic requirements of semantic features, leading to insufficient cross-view feature consistency. To overcome the limitation, we proposes $\textit{FHGS}$ (Feature-Homogenized Gaussian Splatting), a novel 3D feature fusion framework inspired by physical models, which can achieve high-precision mapping of arbitrary 2D features from pre-trained models to 3D scenes while preserving the real-time rendering efficiency of 3DGS. Specifically, our $\textit{FHGS}$ introduces the following innovations: Firstly, a universal feature fusion architecture is proposed, enabling robust embedding of large-scale pre-trained models' semantic features (e.g., SAM, CLIP) into sparse 3D structures. Secondly, a non-differentiable feature fusion mechanism is introduced, which enables semantic features to exhibit viewpoint independent isotropic distributions. This fundamentally balances the anisotropic rendering of gaussian primitives and the isotropic expression of features; Thirdly, a dual-driven optimization strategy inspired by electric potential fields is proposed, which combines external supervision from semantic feature fields with internal primitive clustering guidance. This mechanism enables synergistic optimization of global semantic alignment and local structural consistency. More interactive results can be accessed on: https://fhgs.cuastro.org/.

[114] Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs

Xuan Zhang,Cunxiao Du,Sicheng Yu,Jiawei Wu,Fengzhuo Zhang,Wei Gao,Qian Liu

Main category: cs.CV

TL;DR: 论文提出了一种名为Sparse-to-Dense (StD)的解码策略，通过结合稀疏和密集注意力模块，加速视频大语言模型的推理，同时保持性能。

Details

Motivation: 当前视频大语言模型的自回归特性导致推理延迟随输入序列长度增加，而视频序列通常较长，因此需要高效处理。 Method: StD结合稀疏top-K注意力和密集全注意力模块，快速模型推测解码多个令牌，慢速模型并行验证。 Result: StD实现了最高1.94倍的加速，且无需调参，代码修改少。 Conclusion: StD是一种高效、即插即用的解决方案，显著提升了视频处理的效率。 Abstract: Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$\times$ walltime speedup in video processing. It maintains model performance while enabling a seamless transition from a standard Video-LLM to a sparse Video-LLM with minimal code modifications.

[115] A Joint Learning Framework with Feature Reconstruction and Prediction for Incomplete Satellite Image Time Series in Agricultural Semantic Segmentation

Yuze Wang,Mariana Belgiu,Haiyang Wu,Dandan Zhong,Yangyang Cao,Chao Tao

Main category: cs.CV

TL;DR: 提出了一种联合学习框架，通过特征重建和预测任务处理不完整的卫星图像时间序列（SITS），显著提升了农业语义分割的性能。

Details

Motivation: 云污染导致SITS数据缺失，破坏时间依赖性并引发特征偏移，现有方法（如全重建或数据增强）存在噪声、冗余或泛化能力不足的问题。 Method: 采用联合学习框架，结合特征重建和预测任务，通过时间掩码模拟数据缺失场景，并利用完整SITS训练的教师模型指导学习。 Result: 在湖南、法国西部和加泰罗尼亚的SITS实验中，方法在农田提取和作物分类中的平均F1分数分别提升了6.93%和7.09%，且对不同卫星传感器和缺失率具有良好泛化性。 Conclusion: 该方法通过选择性重建关键特征并减少噪声传播，有效解决了SITS数据缺失问题，提升了模型的鲁棒性和泛化能力。 Abstract: Satellite Image Time Series (SITS) is crucial for agricultural semantic segmentation. However, Cloud contamination introduces time gaps in SITS, disrupting temporal dependencies and causing feature shifts, leading to degraded performance of models trained on complete SITS. Existing methods typically address this by reconstructing the entire SITS before prediction or using data augmentation to simulate missing data. Yet, full reconstruction may introduce noise and redundancy, while the data-augmented model can only handle limited missing patterns, leading to poor generalization. We propose a joint learning framework with feature reconstruction and prediction to address incomplete SITS more effectively. During training, we simulate data-missing scenarios using temporal masks. The two tasks are guided by both ground-truth labels and the teacher model trained on complete SITS. The prediction task constrains the model from selectively reconstructing critical features from masked inputs that align with the teacher's temporal feature representations. It reduces unnecessary reconstruction and limits noise propagation. By integrating reconstructed features into the prediction task, the model avoids learning shortcuts and maintains its ability to handle varied missing patterns and complete SITS. Experiments on SITS from Hunan Province, Western France, and Catalonia show that our method improves mean F1-scores by 6.93% in cropland extraction and 7.09% in crop classification over baselines. It also generalizes well across satellite sensors, including Sentinel-2 and PlanetScope, under varying temporal missing rates and model backbones.

[116] Benchmarking Laparoscopic Surgical Image Restoration and Beyond

Jialun Pei,Diandian Guo,Donghui Yang,Zhixi Li,Yuxin Feng,Long Ma,Bo Du,Pheng-Ann Heng

Main category: cs.CV

TL;DR: 论文提出了一种名为SurgClean的开源数据集，用于解决腹腔镜手术中视觉退化问题，并评估了22种图像恢复方法的性能。

Details

Motivation: 腹腔镜手术中视觉退化（如烟雾、镜头雾化和污染）影响手术效果和患者安全，需系统研究解决方案。 Method: 创建SurgClean数据集，包含1,020张图像及对应标签，涵盖多种退化类型，并评估了22种图像恢复方法。 Result: 实验显示现有方法与临床需求存在显著差距，需改进算法。 Conclusion: SurgClean为手术图像恢复研究提供基础，未来可提升算法以优化手术环境。 Abstract: In laparoscopic surgery, a clear and high-quality visual field is critical for surgeons to make accurate intraoperative decisions. However, persistent visual degradation, including smoke generated by energy devices, lens fogging from thermal gradients, and lens contamination due to blood or tissue fluid splashes during surgical procedures, severely impair visual clarity. These degenerations can seriously hinder surgical workflow and pose risks to patient safety. To systematically investigate and address various forms of surgical scene degradation, we introduce a real-world open-source surgical image restoration dataset covering laparoscopic environments, called SurgClean, which involves multi-type image restoration tasks, e.g., desmoking, defogging, and desplashing. SurgClean comprises 1,020 images with diverse degradation types and corresponding paired reference labels. Based on SurgClean, we establish a standardized evaluation benchmark and provide performance for 22 representative generic task-specific image restoration approaches, including 12 generic and 10 task-specific image restoration approaches. Experimental results reveal substantial performance gaps relative to clinical requirements, highlighting a critical opportunity for algorithm advancements in intelligent surgical restoration. Furthermore, we explore the degradation discrepancies between surgical and natural scenes from structural perception and semantic understanding perspectives, providing fundamental insights for domain-specific image restoration research. Our work aims to empower the capabilities of restoration algorithms to increase surgical environments and improve the efficiency of clinical procedures.

[117] JEDI: The Force of Jensen-Shannon Divergence in Disentangling Diffusion Models

Eric Tillmann Bill,Enis Simsar,Thomas Hofmann

Main category: cs.CV

TL;DR: JEDI是一种无需重新训练或外部监督的测试时适应方法，通过最小化注意力图中的语义纠缠提升扩散模型的主题分离和组合对齐。

Details

Motivation: 解决扩散模型中主题分离和组合对齐的问题，同时避免重新训练或依赖外部监督。 Method: 使用基于Jensen-Shannon散度的目标最小化语义纠缠，并通过对抗优化提高效率。 Result: JEDI适用于多种架构（如Stable Diffusion 1.5和3.5），显著提升提示对齐和复杂场景中的解耦效果。 Conclusion: JEDI提供了一种轻量级、无需CLIP的解耦评分方法，为测试条件下的组合对齐提供了基准，并将公开实现。 Abstract: We introduce JEDI, a test-time adaptation method that enhances subject separation and compositional alignment in diffusion models without requiring retraining or external supervision. JEDI operates by minimizing semantic entanglement in attention maps using a novel Jensen-Shannon divergence based objective. To improve efficiency, we leverage adversarial optimization, reducing the number of updating steps required. JEDI is model-agnostic and applicable to architectures such as Stable Diffusion 1.5 and 3.5, consistently improving prompt alignment and disentanglement in complex scenes. Additionally, JEDI provides a lightweight, CLIP-free disentanglement score derived from internal attention distributions, offering a principled benchmark for compositional alignment under test-time conditions. We will publicly release the implementation of our method.

[118] EventEgoHands: Event-based Egocentric 3D Hand Mesh Reconstruction

Ryosei Hara,Wataru Ikeda,Masashi Hatano,Mariko Isogawa

Main category: cs.CV

TL;DR: 提出了一种基于事件相机的3D手部网格重建方法EventEgoHands，解决了动态背景和相机运动的干扰。

Details

Motivation: 传统RGB或深度相机在低光环境和运动模糊下表现不佳，事件相机因其高动态范围和高时间分辨率成为替代方案，但现有研究受限于静态背景和固定相机。 Method: 引入手部分割模块提取手部区域，减少动态背景事件的影响。 Result: 在N-HOT3D数据集上验证，MPJPE提升约4.5厘米（43%）。 Conclusion: EventEgoHands在动态背景下有效提升了3D手部网格重建的精度。 Abstract: Reconstructing 3D hand mesh is challenging but an important task for human-computer interaction and AR/VR applications. In particular, RGB and/or depth cameras have been widely used in this task. However, methods using these conventional cameras face challenges in low-light environments and during motion blur. Thus, to address these limitations, event cameras have been attracting attention in recent years for their high dynamic range and high temporal resolution. Despite their advantages, event cameras are sensitive to background noise or camera motion, which has limited existing studies to static backgrounds and fixed cameras. In this study, we propose EventEgoHands, a novel method for event-based 3D hand mesh reconstruction in an egocentric view. Our approach introduces a Hand Segmentation Module that extracts hand regions, effectively mitigating the influence of dynamic background events. We evaluated our approach and demonstrated its effectiveness on the N-HOT3D dataset, improving MPJPE by approximately more than 4.5 cm (43%).

[119] Triangle Splatting for Real-Time Radiance Field Rendering

Jan Held,Renaud Vandeghen,Adrien Deliege,Abdullah Hamdi,Silvio Giancola,Anthony Cioppa,Andrea Vedaldi,Bernard Ghanem,Andrea Tagliasacchi,Marc Van Droogenbroeck

Main category: cs.CV

TL;DR: 本文提出了一种基于三角形的可微分渲染方法，结合了三角形的效率和现代可微分渲染框架，在视觉保真度和渲染速度上优于现有方法。

Details

Motivation: 三角形作为传统计算机图形学中的基础表示方法，近年来被NeRF和3D高斯泼溅等方法取代。本文旨在通过可微分渲染技术重新证明三角形在高保真视图合成中的高效性和兼容性。 Method: 开发了一种可微分渲染器，通过端到端梯度直接优化三角形，将每个三角形渲染为可微分泼溅，结合了三角形的效率和独立基元的自适应密度。 Result: 在Mip-NeRF360数据集上，该方法在视觉保真度和感知质量上优于现有非体积基元方法，并在室内场景中超越了Zip-NeRF。在Garden场景中，使用标准网格渲染器实现了2400 FPS的高渲染速度。 Conclusion: 三角形结合了经典计算机图形学和现代可微分渲染框架，展示了其在高质量新视角合成中的高效性和有效性。 Abstract: The field of computer graphics was revolutionized by models such as Neural Radiance Fields and 3D Gaussian Splatting, displacing triangles as the dominant representation for photogrammetry. In this paper, we argue for a triangle comeback. We develop a differentiable renderer that directly optimizes triangles via end-to-end gradients. We achieve this by rendering each triangle as differentiable splats, combining the efficiency of triangles with the adaptive density of representations based on independent primitives. Compared to popular 2D and 3D Gaussian Splatting methods, our approach achieves higher visual fidelity, faster convergence, and increased rendering throughput. On the Mip-NeRF360 dataset, our method outperforms concurrent non-volumetric primitives in visual fidelity and achieves higher perceptual quality than the state-of-the-art Zip-NeRF on indoor scenes. Triangles are simple, compatible with standard graphics stacks and GPU hardware, and highly efficient: for the \textit{Garden} scene, we achieve over 2,400 FPS at 1280x720 resolution using an off-the-shelf mesh renderer. These results highlight the efficiency and effectiveness of triangle-based representations for high-quality novel view synthesis. Triangles bring us closer to mesh-based optimization by combining classical computer graphics with modern differentiable rendering frameworks. The project page is https://trianglesplatting.github.io/

[120] Saliency-guided Emotion Modeling: Predicting Viewer Reactions from Video Stimuli

Akhila Yaragoppa,Siddharth

Main category: cs.CV

TL;DR: 该论文提出了一种基于视觉显著性的情感预测方法，通过深度学习提取显著区域特征，揭示了显著性与观众情绪之间的关系，并指出了主观情绪报告的局限性。

Details

Motivation: 理解视频对观众情绪的影响对内容创作、广告和HCI至关重要，但传统方法忽视了视觉显著性的作用。 Method: 使用深度学习提取显著区域特征（显著区域面积和数量），结合HD2S显著性模型和OpenFace面部动作单元分析，研究视频显著性与情绪的关系。 Result: 发现多显著区域视频易引发高愉悦低唤醒情绪，单一显著区域视频易引发低愉悦高唤醒情绪，且主观报告与面部表情检测结果不一致。 Conclusion: 基于显著性的方法为情感建模提供了高效且可解释的替代方案，对内容创作和情感计算研究有重要意义。 Abstract: Understanding the emotional impact of videos is crucial for applications in content creation, advertising, and Human-Computer Interaction (HCI). Traditional affective computing methods rely on self-reported emotions, facial expression analysis, and biosensing data, yet they often overlook the role of visual saliency -- the naturally attention-grabbing regions within a video. In this study, we utilize deep learning to introduce a novel saliency-based approach to emotion prediction by extracting two key features: saliency area and number of salient regions. Using the HD2S saliency model and OpenFace facial action unit analysis, we examine the relationship between video saliency and viewer emotions. Our findings reveal three key insights: (1) Videos with multiple salient regions tend to elicit high-valence, low-arousal emotions, (2) Videos with a single dominant salient region are more likely to induce low-valence, high-arousal responses, and (3) Self-reported emotions often misalign with facial expression-based emotion detection, suggesting limitations in subjective reporting. By leveraging saliency-driven insights, this work provides a computationally efficient and interpretable alternative for emotion modeling, with implications for content creation, personalized media experiences, and affective computing research.

[121] PosePilot: An Edge-AI Solution for Posture Correction in Physical Exercises

Rushiraj Gadhvi,Priyansh Desai,Siddharth

Main category: cs.CV

TL;DR: PosePilot是一个AI驱动的实时姿势纠正系统，专注于瑜伽等复杂运动，结合LSTM和BiLSTM模型提供高效反馈。

Details

Motivation: 解决AI健身系统中姿势自动纠正的挑战，克服传统方法的局限性。 Method: 使用Vanilla LSTM和BiLSTM结合多头注意力机制，分析时空依赖关系，实现高效姿势识别与错误检测。 Result: 系统能实时提供个性化纠正反馈，并在边缘设备上高效运行。 Conclusion: PosePilot为复杂运动提供轻量级、高效的实时姿势纠正方案，适用于家庭和户外场景。 Abstract: Automated pose correction remains a significant challenge in AI-driven fitness systems, despite extensive research in activity recognition. This work presents PosePilot, a novel system that integrates pose recognition with real-time personalized corrective feedback, overcoming the limitations of traditional fitness solutions. Using Yoga, a discipline requiring precise spatio-temporal alignment as a case study, we demonstrate PosePilot's ability to analyze complex physical movements. Designed for deployment on edge devices, PosePilot can be extended to various at-home and outdoor exercises. We employ a Vanilla LSTM, allowing the system to capture temporal dependencies for pose recognition. Additionally, a BiLSTM with multi-head Attention enhances the model's ability to process motion contexts, selectively focusing on key limb angles for accurate error detection while maintaining computational efficiency. As part of this work, we introduce a high-quality video dataset used for evaluating our models. Most importantly, PosePilot provides instant corrective feedback at every stage of a movement, ensuring precise posture adjustments throughout the exercise routine. The proposed approach 1) performs automatic human posture recognition, 2) provides personalized posture correction feedback at each instant which is crucial in Yoga, and 3) offers a lightweight and robust posture correction model feasible for deploying on edge devices in real-world environments.

[122] Step-level Reward for Free in RL-based T2I Diffusion Model Fine-tuning

Xinyao Liao,Wei Wei,Xiaoye Qu,Yu Cheng

Main category: cs.CV

TL;DR: 本文提出了一种动态分配密集奖励的信用分配框架，解决了文本到图像扩散模型微调中奖励稀疏的问题，显著提高了训练效率和泛化能力。

Details

Motivation: 现有方法在文本到图像扩散模型的强化学习微调中，由于奖励稀疏（每生成轨迹仅有一个延迟奖励），导致难以精确分配去噪步骤的贡献，影响训练效率。 Method: 通过跟踪中间图像与最终图像之间的余弦相似度变化，动态分配密集奖励，量化每个去噪步骤对最终图像的贡献，避免使用额外的辅助神经网络。 Result: 该方法在四种人类偏好奖励函数上实现了1.25到2倍的样本效率提升，且不损害原始最优策略。 Conclusion: 提出的信用分配框架简单有效，显著提升了训练效率和泛化能力，为文本到图像扩散模型的强化学习微调提供了新思路。 Abstract: Recent advances in text-to-image (T2I) diffusion model fine-tuning leverage reinforcement learning (RL) to align generated images with learnable reward functions. The existing approaches reformulate denoising as a Markov decision process for RL-driven optimization. However, they suffer from reward sparsity, receiving only a single delayed reward per generated trajectory. This flaw hinders precise step-level attribution of denoising actions, undermines training efficiency. To address this, we propose a simple yet effective credit assignment framework that dynamically distributes dense rewards across denoising steps. Specifically, we track changes in cosine similarity between intermediate and final images to quantify each step's contribution on progressively reducing the distance to the final image. Our approach avoids additional auxiliary neural networks for step-level preference modeling and instead uses reward shaping to highlight denoising phases that have a greater impact on image quality. Our method achieves 1.25 to 2 times higher sample efficiency and better generalization across four human preference reward functions, without compromising the original optimal policy.

[123] Domain and Task-Focused Example Selection for Data-Efficient Contrastive Medical Image Segmentation

Tyler Ward,Aaron Moseley,Abdullah-Al-Zubaer Imran

Main category: cs.CV

TL;DR: 提出了一种名为PolyCL的自监督对比学习框架，用于医学图像分割，无需像素级标注，结合了Segment Anything Model（SAM）提升分割精度。

Details

Motivation: 医学图像分割需要大量标注数据，但标注成本高且耗时，因此需开发高效利用有限标注数据的方法。 Method: 采用自监督对比学习（SSL）预训练未标注数据，并结合SAM作为后处理模块和传播机制。 Result: 在三个CT数据集上，PolyCL在低数据和跨域场景中优于全监督和自监督基线方法。 Conclusion: PolyCL为医学图像分割提供了一种高效且无需大量标注的解决方案。 Abstract: Segmentation is one of the most important tasks in the medical imaging pipeline as it influences a number of image-based decisions. To be effective, fully supervised segmentation approaches require large amounts of manually annotated training data. However, the pixel-level annotation process is expensive, time-consuming, and error-prone, hindering progress and making it challenging to perform effective segmentations. Therefore, models must learn efficiently from limited labeled data. Self-supervised learning (SSL), particularly contrastive learning via pre-training on unlabeled data and fine-tuning on limited annotations, can facilitate such limited labeled image segmentation. To this end, we propose a novel self-supervised contrastive learning framework for medical image segmentation, leveraging inherent relationships of different images, dubbed PolyCL. Without requiring any pixel-level annotations or unreasonable data augmentations, our PolyCL learns and transfers context-aware discriminant features useful for segmentation from an innovative surrogate, in a task-related manner. Additionally, we integrate the Segment Anything Model (SAM) into our framework in two novel ways: as a post-processing refinement module that improves the accuracy of predicted masks using bounding box prompts derived from coarse outputs, and as a propagation mechanism via SAM 2 that generates volumetric segmentations from a single annotated 2D slice. Experimental evaluations on three public computed tomography (CT) datasets demonstrate that PolyCL outperforms fully-supervised and self-supervised baselines in both low-data and cross-domain scenarios. Our code is available at https://github.com/tbwa233/PolyCL.

[124] Towards Understanding the Mechanisms of Classifier-Free Guidance

Xiang Li,Rongrong Wang,Qing Qu

Main category: cs.CV

TL;DR: 本文分析了无分类器引导（CFG）在图像生成中的作用，揭示了其通过均值偏移、对比主成分增强和抑制通用特征三种机制提升生成质量。

Details

Motivation: CFG是当前先进图像生成系统的核心技术，但其工作机制尚不明确，本文旨在通过简化模型揭示其作用机制。 Method: 在简化的线性扩散模型中分析CFG行为，并与非线性模型对比，验证其机制。 Result: 线性CFG通过均值偏移、对比主成分增强和抑制通用特征三种机制提升生成质量，且与非线性模型行为相似。 Conclusion: 线性分析为理解非线性CFG机制提供了重要启示，尽管两者在低噪声水平下存在差异。 Abstract: Classifier-free guidance (CFG) is a core technique powering state-of-the-art image generation systems, yet its underlying mechanisms remain poorly understood. In this work, we begin by analyzing CFG in a simplified linear diffusion model, where we show its behavior closely resembles that observed in the nonlinear case. Our analysis reveals that linear CFG improves generation quality via three distinct components: (i) a mean-shift term that approximately steers samples in the direction of class means, (ii) a positive Contrastive Principal Components (CPC) term that amplifies class-specific features, and (iii) a negative CPC term that suppresses generic features prevalent in unconditional data. We then verify that these insights in real-world, nonlinear diffusion models: over a broad range of noise levels, linear CFG resembles the behavior of its nonlinear counterpart. Although the two eventually diverge at low noise levels, we discuss how the insights from the linear analysis still shed light on the CFG's mechanism in the nonlinear regime.

[125] Advancing Video Self-Supervised Learning via Image Foundation Models

Jingwei Wu,Zhewei Huang,Chang Liu

Main category: cs.CV

TL;DR: AdViSe提出了一种基于预训练图像基础模型（IFMs）的低成本视频自监督学习方法，通过引入时间建模模块和播放速率感知任务，显著减少了训练开销。

Details

Motivation: 直接利用IFMs进行视频自监督学习的潜力尚未充分挖掘，本研究旨在降低视频表示模型的训练成本。 Method: 在IFMs中引入ResNet3D时间建模模块，采用播放速率感知的自监督学习方法训练时间模块，同时冻结IFM部分。 Result: 在UCF101数据集上，AdViSe性能与最先进方法相当，但训练时间减少3.4倍，GPU内存使用减少8.2倍。 Conclusion: AdViSe为基于预训练IFMs的低成本视频自监督学习提供了新思路，代码已开源。 Abstract: In the past decade, image foundation models (IFMs) have achieved unprecedented progress. However, the potential of directly using IFMs for video self-supervised representation learning has largely been overlooked. In this study, we propose an advancing video self-supervised learning (AdViSe) approach, aimed at significantly reducing the training overhead of video representation models using pre-trained IFMs. Specifically, we first introduce temporal modeling modules (ResNet3D) to IFMs, constructing a video representation model. We then employ a video self-supervised learning approach, playback rate perception, to train temporal modules while freezing the IFM components. Experiments on UCF101 demonstrate that AdViSe achieves performance comparable to state-of-the-art methods while reducing training time by $3.4\times$ and GPU memory usage by $8.2\times$. This study offers fresh insights into low-cost video self-supervised learning based on pre-trained IFMs. Code is available at https://github.com/JingwWu/advise-video-ssl.

[126] RAISE: Realness Assessment for Image Synthesis and Evaluation

Aniruddha Mukherjee,Spriha Dubey,Somdyuti Paul

Main category: cs.CV

TL;DR: 论文提出了一种评估AI生成图像真实感的方法，通过人类研究创建了RAISE数据集，并利用深度学习模型验证了其有效性。

Details

Motivation: 由于AI生成图像的视觉真实感难以客观评估，研究旨在填补这一空白，为实际应用提供可靠的数据支持。 Method: 通过人类研究收集主观真实感评分，构建RAISE数据集，并训练深度学习模型进行真实感预测。 Result: 实验表明，深度学习模型能够有效捕捉主观真实感，RAISE数据集为相关研究提供了宝贵资源。 Conclusion: RAISE数据集和基线模型为AI生成图像的真实感评估提供了实用工具，推动了该领域的发展。 Abstract: The rapid advancement of generative AI has enabled the creation of highly photorealistic visual content, offering practical substitutes for real images and videos in scenarios where acquiring real data is difficult or expensive. However, reliably substituting real visual content with AI-generated counterparts requires robust assessment of the perceived realness of AI-generated visual content, a challenging task due to its inherent subjective nature. To address this, we conducted a comprehensive human study evaluating the perceptual realness of both real and AI-generated images, resulting in a new dataset, containing images paired with subjective realness scores, introduced as RAISE in this paper. Further, we develop and train multiple models on RAISE to establish baselines for realness prediction. Our experimental results demonstrate that features derived from deep foundation vision models can effectively capture the subjective realness. RAISE thus provides a valuable resource for developing robust, objective models of perceptual realness assessment.

[127] DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving

Chen Shi,Shaoshuai Shi,Kehua Sheng,Bo Zhang,Li Jiang

Main category: cs.CV

TL;DR: DriveX是一个自监督的世界模型，通过多模态监督学习通用场景动态和整体表示，显著提升了3D点云预测等任务性能。

Details

Motivation: 解决任务特定模型在分布外场景中的局限性及对标注数据的依赖。 Method: 提出Omni Scene Modeling (OSM)模块和分离潜在世界建模策略，结合动态感知射线采样增强运动建模。 Result: 在3D点云预测、占用预测、流估计和端到端驾驶等任务中取得最优结果。 Conclusion: DriveX作为通用世界模型，为鲁棒且统一的自动驾驶框架奠定了基础。 Abstract: Data-driven learning has advanced autonomous driving, yet task-specific models struggle with out-of-distribution scenarios due to their narrow optimization objectives and reliance on costly annotated data. We present DriveX, a self-supervised world model that learns generalizable scene dynamics and holistic representations (geometric, semantic, and motion) from large-scale driving videos. DriveX introduces Omni Scene Modeling (OSM), a module that unifies multimodal supervision-3D point cloud forecasting, 2D semantic representation, and image generation-to capture comprehensive scene evolution. To simplify learning complex dynamics, we propose a decoupled latent world modeling strategy that separates world representation learning from future state decoding, augmented by dynamic-aware ray sampling to enhance motion modeling. For downstream adaptation, we design Future Spatial Attention (FSA), a unified paradigm that dynamically aggregates spatiotemporal features from DriveX's predictions to enhance task-specific inference. Extensive experiments demonstrate DriveX's effectiveness: it achieves significant improvements in 3D future point cloud prediction over prior work, while attaining state-of-the-art results on diverse tasks including occupancy prediction, flow estimation, and end-to-end driving. These results validate DriveX's capability as a general-purpose world model, paving the way for robust and unified autonomous driving frameworks.

[128] Deformable Attentive Visual Enhancement for Referring Segmentation Using Vision-Language Model

Alaa Dalaq,Muzammil Behzad

Main category: cs.CV

TL;DR: SegVLM是一个视觉语言模型，通过改进架构提升分割精度和跨模态对齐，结合SE块、可变形卷积和残差连接，并引入RAF损失。实验证明其性能优越且泛化能力强。

Details

Motivation: 图像分割是计算机视觉的基础任务，而参考图像分割通过自然语言表达定位特定对象，需要有效整合视觉和语言信息。 Method: SegVLM结合SE块动态特征重校准、可变形卷积几何适应性和残差连接深度特征学习，并设计RAF损失平衡区域对齐、边界精度和类别不平衡。 Result: 实验和消融研究表明各组件均带来性能提升，SegVLM在多样化数据集和参考表达场景中表现优异。 Conclusion: SegVLM通过架构改进和新损失函数，显著提升了参考图像分割的性能和泛化能力。 Abstract: Image segmentation is a fundamental task in computer vision, aimed at partitioning an image into semantically meaningful regions. Referring image segmentation extends this task by using natural language expressions to localize specific objects, requiring effective integration of visual and linguistic information. In this work, we propose SegVLM, a vision-language model that incorporates architectural improvements to enhance segmentation accuracy and cross-modal alignment. The model integrates squeeze-and-excitation (SE) blocks for dynamic feature recalibration, deformable convolutions for geometric adaptability, and residual connections for deep feature learning. We also introduce a novel referring-aware fusion (RAF) loss that balances region-level alignment, boundary precision, and class imbalance. Extensive experiments and ablation studies demonstrate that each component contributes to consistent performance improvements. SegVLM also shows strong generalization across diverse datasets and referring expression scenarios.

[129] PolyPose: Localizing Deformable Anatomy in 3D from Sparse 2D X-ray Images using Polyrigid Transforms

Vivek Gopalakrishnan,Neel Dey,Polina Golland

Main category: cs.CV

TL;DR: PolyPose是一种简单且鲁棒的2D/3D配准方法，通过将复杂3D变形场参数化为刚性变换的组合，解决了术中稀疏视角X射线图像的3D姿态估计问题。

Details

Motivation: 术中快速2D成像（X射线）无法提供术前3D成像（如CT和MRI）的精确3D定位，因此需要一种方法将3D引导整合到术中流程中。 Method: PolyPose利用生物约束（骨骼不弯曲）将3D变形场参数化为刚性变换的组合，避免了传统方法中复杂的变形正则化需求。 Result: 实验表明，PolyPose能在仅使用两张X射线图像的情况下成功对齐术前体积数据，适用于稀疏视角和有限角度的挑战性场景。 Conclusion: PolyPose通过引入解剖学合理的先验知识，提供了一种高效且鲁棒的2D/3D配准方法，适用于术中3D引导。 Abstract: Determining the 3D pose of a patient from a limited set of 2D X-ray images is a critical task in interventional settings. While preoperative volumetric imaging (e.g., CT and MRI) provides precise 3D localization and visualization of anatomical targets, these modalities cannot be acquired during procedures, where fast 2D imaging (X-ray) is used instead. To integrate volumetric guidance into intraoperative procedures, we present PolyPose, a simple and robust method for deformable 2D/3D registration. PolyPose parameterizes complex 3D deformation fields as a composition of rigid transforms, leveraging the biological constraint that individual bones do not bend in typical motion. Unlike existing methods that either assume no inter-joint movement or fail outright in this under-determined setting, our polyrigid formulation enforces anatomically plausible priors that respect the piecewise rigid nature of human movement. This approach eliminates the need for expensive deformation regularizers that require patient- and procedure-specific hyperparameter optimization. Across extensive experiments on diverse datasets from orthopedic surgery and radiotherapy, we show that this strong inductive bias enables PolyPose to successfully align the patient's preoperative volume to as few as two X-ray images, thereby providing crucial 3D guidance in challenging sparse-view and limited-angle settings where current registration methods fail.

[130] Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning

Yu Zhang,Jialei Zhou,Xinchen Li,Qi Zhang,Zhongwei Wan,Tianyu Wang,Duoqian Miao,Changwei Wang,Longbing Cao

Main category: cs.CV

TL;DR: 提出了一种名为DiT-ST的分割文本条件框架，通过将完整文本转换为简化句子集合，分阶段注入扩散模型，以解决扩散变换器对完整文本理解不足的问题。

Details

Motivation: 当前文本到图像扩散生成通常使用完整文本条件，但由于复杂语法，扩散变换器（DiTs）存在对完整文本理解不足的缺陷，导致语义细节丢失或混淆。 Method: 提出DiT-ST框架，利用大语言模型解析文本，提取并分层构建语义基元为分割文本输入，分阶段注入扩散去噪过程。 Result: 实验验证了DiT-ST在缓解完整文本理解缺陷方面的有效性。 Conclusion: DiT-ST通过分割文本和分阶段注入，显著提升了扩散变换器对语义基元的表示学习能力。 Abstract: Current text-to-image diffusion generation typically employs complete-text conditioning. Due to the intricate syntax, diffusion transformers (DiTs) inherently suffer from a comprehension defect of complete-text captions. One-fly complete-text input either overlooks critical semantic details or causes semantic confusion by simultaneously modeling diverse semantic primitive types. To mitigate this defect of DiTs, we propose a novel split-text conditioning framework named DiT-ST. This framework converts a complete-text caption into a split-text caption, a collection of simplified sentences, to explicitly express various semantic primitives and their interconnections. The split-text caption is then injected into different denoising stages of DiT-ST in a hierarchical and incremental manner. Specifically, DiT-ST leverages Large Language Models to parse captions, extracting diverse primitives and hierarchically sorting out and constructing these primitives into a split-text input. Moreover, we partition the diffusion denoising process according to its differential sensitivities to diverse semantic primitive types and determine the appropriate timesteps to incrementally inject tokens of diverse semantic primitive types into input tokens via cross-attention. In this way, DiT-ST enhances the representation learning of specific semantic primitive types across different stages. Extensive experiments validate the effectiveness of our proposed DiT-ST in mitigating the complete-text comprehension defect.

[131] Improving Novel view synthesis of 360$^\circ$ Scenes in Extremely Sparse Views by Jointly Training Hemisphere Sampled Synthetic Images

Guangan Chen,Anh Minh Truong,Hanhe Lin,Michiel Vlaminck,Wilfried Philips,Hiep Luong

Main category: cs.CV

TL;DR: 提出了一种在极稀疏视角下进行360度场景新视角合成的框架，结合DUSt3R估计相机位姿、密集采样合成图像和3D高斯溅射模型，显著提升了合成质量。

Details

Motivation: 解决极稀疏视角下360度场景新视角合成的挑战，适用于虚拟现实和增强现实应用。 Method: 使用DUSt3R估计相机位姿和生成密集点云，密集采样额外视角并渲染合成图像，结合3D高斯溅射模型训练，并通过扩散模型增强图像质量。 Result: 在仅四个输入视角的情况下，框架在360度场景的极稀疏视角合成中表现显著优于基准方法。 Conclusion: 该框架有效解决了极稀疏视角下的新视角合成问题，提升了合成质量和场景覆盖范围。 Abstract: Novel view synthesis in 360$^\circ$ scenes from extremely sparse input views is essential for applications like virtual reality and augmented reality. This paper presents a novel framework for novel view synthesis in extremely sparse-view cases. As typical structure-from-motion methods are unable to estimate camera poses in extremely sparse-view cases, we apply DUSt3R to estimate camera poses and generate a dense point cloud. Using the poses of estimated cameras, we densely sample additional views from the upper hemisphere space of the scenes, from which we render synthetic images together with the point cloud. Training 3D Gaussian Splatting model on a combination of reference images from sparse views and densely sampled synthetic images allows a larger scene coverage in 3D space, addressing the overfitting challenge due to the limited input in sparse-view cases. Retraining a diffusion-based image enhancement model on our created dataset, we further improve the quality of the point-cloud-rendered images by removing artifacts. We compare our framework with benchmark methods in cases of only four input views, demonstrating significant improvement in novel view synthesis under extremely sparse-view conditions for 360$^\circ$ scenes.

[132] TextDiffuser-RL: Efficient and Robust Text Layout Optimization for High-Fidelity Text-to-Image Synthesis

Kazi Mahathir Rahman,Showrin Rahman,Sharmin Sultana Srishty

Main category: cs.CV

TL;DR: 提出了一种结合强化学习的两阶段文本嵌入图像生成方法，显著提升了运行效率，同时保持了高质量。

Details

Motivation: 现有文本到图像生成方法（如TextDiffuser-2）虽然效果好，但计算资源消耗大且难以高效运行于CPU和GPU平台。 Method: 采用两阶段流程，结合强化学习优化文本布局生成和扩散模型合成图像，减少重叠并加速预测。 Result: 在MARIOEval基准测试中，OCR和CLIPScore接近最优模型，运行速度快97.64%，仅需2MB内存。 Conclusion: 该方法在保持或超越TextDiffuser-2质量的同时，显著提升了效率和灵活性。 Abstract: Text-embedded image generation plays a critical role in industries such as graphic design, advertising, and digital content creation. Text-to-Image generation methods leveraging diffusion models, such as TextDiffuser-2, have demonstrated promising results in producing images with embedded text. TextDiffuser-2 effectively generates bounding box layouts that guide the rendering of visual text, achieving high fidelity and coherence. However, existing approaches often rely on resource-intensive processes and are limited in their ability to run efficiently on both CPU and GPU platforms. To address these challenges, we propose a novel two-stage pipeline that integrates reinforcement learning (RL) for rapid and optimized text layout generation with a diffusion-based image synthesis model. Our RL-based approach significantly accelerates the bounding box prediction step while reducing overlaps, allowing the system to run efficiently on both CPUs and GPUs. Extensive evaluations demonstrate that our framework maintains or surpasses TextDiffuser-2's quality in text placement and image synthesis, with markedly faster runtime and increased flexibility. Extensive evaluations demonstrate that our framework maintains or surpasses TextDiffuser-2's quality in text placement and image synthesis, with markedly faster runtime and increased flexibility. Our approach has been evaluated on the MARIOEval benchmark, achieving OCR and CLIPScore metrics close to state-of-the-art models, while being 97.64% more faster and requiring only 2MB of memory to run.

[133] Alchemist: Turning Public Text-to-Image Data into Generative Gold

Valerii Startsev,Alexander Ustyuzhanin,Alexey Kirillov,Dmitry Baranchuk,Sergey Kastryulin

Main category: cs.CV

TL;DR: 论文提出了一种利用预训练生成模型筛选高质量样本的方法，构建了通用SFT数据集Alchemist，显著提升了五种公开T2I模型的生成质量。

Details

Motivation: 现有SFT数据集多为特定领域，通用高质量数据集稀缺且构建成本高，阻碍研究进展。 Method: 利用预训练生成模型作为样本影响力估计器，构建紧凑高效的Alchemist数据集。 Result: Alchemist显著提升五种T2I模型的生成质量，同时保持多样性和风格。 Conclusion: 该方法为构建高质量通用SFT数据集提供了可行方案，并公开了数据集和微调模型权重。 Abstract: Pre-training equips text-to-image (T2I) models with broad world knowledge, but this alone is often insufficient to achieve high aesthetic quality and alignment. Consequently, supervised fine-tuning (SFT) is crucial for further refinement. However, its effectiveness highly depends on the quality of the fine-tuning dataset. Existing public SFT datasets frequently target narrow domains (e.g., anime or specific art styles), and the creation of high-quality, general-purpose SFT datasets remains a significant challenge. Current curation methods are often costly and struggle to identify truly impactful samples. This challenge is further complicated by the scarcity of public general-purpose datasets, as leading models often rely on large, proprietary, and poorly documented internal data, hindering broader research progress. This paper introduces a novel methodology for creating general-purpose SFT datasets by leveraging a pre-trained generative model as an estimator of high-impact training samples. We apply this methodology to construct and release Alchemist, a compact (3,350 samples) yet highly effective SFT dataset. Experiments demonstrate that Alchemist substantially improves the generative quality of five public T2I models while preserving diversity and style. Additionally, we release the fine-tuned models' weights to the public.

[134] Holistic White-light Polyp Classification via Alignment-free Dense Distillation of Auxiliary Optical Chromoendoscopy

Qiang Hu,Qimei Wang,Jia Chen,Xuantao Ji,Qiang Li,Zhiwei Wang

Main category: cs.CV

TL;DR: 本文提出了一种无需息肉定位的全图像分类框架，通过Alignment-free Dense Distillation (ADD)模块实现跨域知识蒸馏，显著提升了WLI图像的息肉分类性能。

Details

Motivation: WLI在资源有限环境中是主要结肠镜模式，但其分类性能较差。现有方法依赖病灶区域裁剪，易受检测误差影响且忽略上下文信息。 Method: 提出ADD模块，通过像素级跨域亲和力和CAM过滤实现细粒度知识蒸馏，无需图像对齐。 Result: 在公开和内部数据集上，AUC分别相对提升至少2.5%和16.2%。 Conclusion: ADD框架显著提升了WLI图像的息肉分类性能，为资源有限环境提供了实用解决方案。 Abstract: White Light Imaging (WLI) and Narrow Band Imaging (NBI) are the two main colonoscopic modalities for polyp classification. While NBI, as optical chromoendoscopy, offers valuable vascular details, WLI remains the most common and often the only available modality in resource-limited settings. However, WLI-based methods typically underperform, limiting their clinical applicability. Existing approaches transfer knowledge from NBI to WLI through global feature alignment but often rely on cropped lesion regions, which are susceptible to detection errors and neglect contextual and subtle diagnostic cues. To address this, this paper proposes a novel holistic classification framework that leverages full-image diagnosis without requiring polyp localization. The key innovation lies in the Alignment-free Dense Distillation (ADD) module, which enables fine-grained cross-domain knowledge distillation regardless of misalignment between WLI and NBI images. Without resorting to explicit image alignment, ADD learns pixel-wise cross-domain affinities to establish correspondences between feature maps, guiding the distillation along the most relevant pixel connections. To further enhance distillation reliability, ADD incorporates Class Activation Mapping (CAM) to filter cross-domain affinities, ensuring the distillation path connects only those semantically consistent regions with equal contributions to polyp diagnosis. Extensive results on public and in-house datasets show that our method achieves state-of-the-art performance, relatively outperforming the other approaches by at least 2.5% and 16.2% in AUC, respectively. Code is available at: https://github.com/Huster-Hq/ADD.

[135] BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Behavioural Change

Manuela González-González,Soufiane Belharbi,Muhammad Osama Zeeshan,Masoumeh Sharafi,Muhammad Haseeb Aslam,Marco Pedersoli,Alessandro Lameiras Koerich,Simon L Bacon,Eric Granger

Main category: cs.CV

TL;DR: 论文介绍了一个名为BAH的数据集，用于识别视频中的矛盾/犹豫情绪（A/H），并提供了多模态基线模型结果。

Details

Motivation: 识别矛盾/犹豫情绪对个性化数字行为干预至关重要，但目前缺乏相关数据集和自动识别方法。 Method: 通过在线平台收集224名参与者的视频数据，标注A/H片段，并提供多模态注释（如面部、语音、文本）。 Result: BAH数据集包含1,118个视频，总时长8.26小时，其中1.5小时为A/H片段。基线模型表现有限，显示识别A/H的挑战。 Conclusion: BAH数据集填补了A/H识别领域的空白，为未来研究提供了资源，但需进一步改进模型性能。 Abstract: Recognizing complex emotions linked to ambivalence and hesitancy (A/H) can play a critical role in the personalization and effectiveness of digital behaviour change interventions. These subtle and conflicting emotions are manifested by a discord between multiple modalities, such as facial and vocal expressions, and body language. Although experts can be trained to identify A/H, integrating them into digital interventions is costly and less effective. Automatic learning systems provide a cost-effective alternative that can adapt to individual users, and operate seamlessly within real-time, and resource-limited environments. However, there are currently no datasets available for the design of ML models to recognize A/H. This paper introduces a first Behavioural Ambivalence/Hesitancy (BAH) dataset collected for subject-based multimodal recognition of A/H in videos. It contains videos from 224 participants captured across 9 provinces in Canada, with different age, and ethnicity. Through our web platform, we recruited participants to answer 7 questions, some of which were designed to elicit A/H while recording themselves via webcam with microphone. BAH amounts to 1,118 videos for a total duration of 8.26 hours with 1.5 hours of A/H. Our behavioural team annotated timestamp segments to indicate where A/H occurs, and provide frame- and video-level annotations with the A/H cues. Video transcripts and their timestamps are also included, along with cropped and aligned faces in each frame, and a variety of participants meta-data. We include results baselines for BAH at frame- and video-level recognition in multi-modal setups, in addition to zero-shot prediction, and for personalization using unsupervised domain adaptation. The limited performance of baseline models highlights the challenges of recognizing A/H in real-world videos. The data, code, and pretrained weights are available.

[136] Beyond Editing Pairs: Fine-Grained Instructional Image Editing via Multi-Scale Learnable Regions

Chenrui Ma,Xi Xiao,Tianyang Wang,Yanning Shen

Main category: cs.CV

TL;DR: 提出了一种新的基于指令的图像编辑方法，利用广泛可用的文本-图像对，避免了构建大规模编辑数据集的复杂性。

Details

Motivation: 现有方法依赖大规模编辑数据集或数据集无关技术，但前者构建成本高且质量难以保证，后者编辑能力和指令理解受限。 Method: 引入多尺度可学习区域定位和指导编辑过程，利用文本-图像对齐作为监督，生成任务特定的编辑区域。 Result: 实验表明，该方法在多种任务和基准测试中达到最优性能，且对各类生成模型适应性强。 Conclusion: 该方法实现了高保真、精确且与指令一致的图像编辑，为图像编辑领域提供了新思路。 Abstract: Current text-driven image editing methods typically follow one of two directions: relying on large-scale, high-quality editing pair datasets to improve editing precision and diversity, or exploring alternative dataset-free techniques. However, constructing large-scale editing datasets requires carefully designed pipelines, is time-consuming, and often results in unrealistic samples or unwanted artifacts. Meanwhile, dataset-free methods may suffer from limited instruction comprehension and restricted editing capabilities. Faced with these challenges, the present work develops a novel paradigm for instruction-driven image editing that leverages widely available and enormous text-image pairs, instead of relying on editing pair datasets. Our approach introduces a multi-scale learnable region to localize and guide the editing process. By treating the alignment between images and their textual descriptions as supervision and learning to generate task-specific editing regions, our method achieves high-fidelity, precise, and instruction-consistent image editing. Extensive experiments demonstrate that the proposed approach attains state-of-the-art performance across various tasks and benchmarks, while exhibiting strong adaptability to various types of generative models.

[137] DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models

Niloufar Alipour Talemi,Hossein Kashiani,Hossein R. Nowdeh,Fatemeh Afghah

Main category: cs.CV

TL;DR: DiSa提出了一种方向性显著性感知提示学习框架，通过交叉交互正则化和方向性正则化提升模型泛化能力，在多个任务中表现优异。

Details

Motivation: 现有提示学习方法在泛化到新类别或未见领域时性能下降，DiSa旨在解决这一问题。 Method: 结合交叉交互正则化（CIR）和方向性正则化，前者通过显著性掩码关注关键图像区域，后者对齐视觉嵌入与类原型特征方向。 Result: 在11个图像分类基准测试中，DiSa均优于现有方法，包括基础到新类别的泛化、跨数据集迁移、领域泛化和少样本学习。 Conclusion: DiSa通过双正则化策略显著提升了模型的泛化能力，适用于多种任务场景。 Abstract: Prompt learning has emerged as a powerful paradigm for adapting vision-language models such as CLIP to downstream tasks. However, existing methods often overfit to seen data, leading to significant performance degradation when generalizing to novel classes or unseen domains. To address this limitation, we propose DiSa, a Directional Saliency-Aware Prompt Learning framework that integrates two complementary regularization strategies to enhance generalization. First, our Cross-Interactive Regularization (CIR) fosters cross-modal alignment by enabling cooperative learning between prompted and frozen encoders. Within CIR, a saliency-aware masking strategy guides the image encoder to prioritize semantically critical image regions, reducing reliance on less informative patches. Second, we introduce a directional regularization strategy that aligns visual embeddings with class-wise prototype features in a directional manner to prioritize consistency in feature orientation over strict proximity. This approach ensures robust generalization by leveraging stable prototype directions derived from class-mean statistics. Extensive evaluations on 11 diverse image classification benchmarks demonstrate that DiSa consistently outperforms state-of-the-art prompt learning methods across various settings, including base-to-novel generalization, cross-dataset transfer, domain generalization, and few-shot learning.

[138] Absolute Coordinates Make Motion Generation Easy

Zichong Meng,Zeyu Han,Xiaogang Peng,Yiming Xie,Huaizu Jiang

Main category: cs.CV

TL;DR: 论文提出了一种基于全局绝对关节坐标的文本到动作生成方法，替代了传统的相对运动表示，显著提高了动作保真度和文本对齐能力，并支持下游任务。

Details

Motivation: 传统基于相对运动的表示方法限制了扩散模型的应用，并阻碍了下游任务的实现。 Method: 采用全局绝对关节坐标作为运动表示，结合简单的Transformer架构，无需额外的运动感知损失。 Result: 新方法显著提高了动作保真度和文本对齐能力，支持下游任务且无需额外调整。 Conclusion: 全局绝对关节坐标是一种更优的运动表示方法，为未来研究和应用奠定了基础。 Abstract: State-of-the-art text-to-motion generation models rely on the kinematic-aware, local-relative motion representation popularized by HumanML3D, which encodes motion relative to the pelvis and to the previous frame with built-in redundancy. While this design simplifies training for earlier generation models, it introduces critical limitations for diffusion models and hinders applicability to downstream tasks. In this work, we revisit the motion representation and propose a radically simplified and long-abandoned alternative for text-to-motion generation: absolute joint coordinates in global space. Through systematic analysis of design choices, we show that this formulation achieves significantly higher motion fidelity, improved text alignment, and strong scalability, even with a simple Transformer backbone and no auxiliary kinematic-aware losses. Moreover, our formulation naturally supports downstream tasks such as text-driven motion control and temporal/spatial editing without additional task-specific reengineering and costly classifier guidance generation from control signals. Finally, we demonstrate promising generalization to directly generate SMPL-H mesh vertices in motion from text, laying a strong foundation for future research and motion-related applications.

[139] Advancing Limited-Angle CT Reconstruction Through Diffusion-Based Sinogram Completion

Jiaqi Guo,Santiago Lopez-Tapia,Aggelos K. Katsaggelos

Main category: cs.CV

TL;DR: 提出了一种基于MR-SDEs的投影域数据补全方法，用于解决有限角度CT重建中的缺失角度问题，结合蒸馏和伪逆矩阵约束加速扩散过程，并通过后处理模块优化重建质量。

Details

Motivation: 有限角度CT重建因缺失角度信息面临挑战，传统方法在图像域操作效果有限，需在投影域探索新方法。 Method: 利用MR-SDEs（均值回归随机微分方程）进行投影数据补全，结合蒸馏和伪逆矩阵约束加速扩散过程，后处理模块优化重建。 Result: 实验表明，该方法在感知和保真度质量上达到最优，适用于科学和临床场景。 Conclusion: 该方法为有限角度CT重建提供了高效准确的解决方案，具有实际应用潜力。 Abstract: Limited Angle Computed Tomography (LACT) often faces significant challenges due to missing angular information. Unlike previous methods that operate in the image domain, we propose a new method that focuses on sinogram inpainting. We leverage MR-SDEs, a variant of diffusion models that characterize the diffusion process with mean-reverting stochastic differential equations, to fill in missing angular data at the projection level. Furthermore, by combining distillation with constraining the output of the model using the pseudo-inverse of the inpainting matrix, the diffusion process is accelerated and done in a step, enabling efficient and accurate sinogram completion. A subsequent post-processing module back-projects the inpainted sinogram into the image domain and further refines the reconstruction, effectively suppressing artifacts while preserving critical structural details. Quantitative experimental results demonstrate that the proposed method achieves state-of-the-art performance in both perceptual and fidelity quality, offering a promising solution for LACT reconstruction in scientific and clinical applications.

[140] Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals

Nate Gillman,Charles Herrmann,Michael Freeman,Daksh Aggarwal,Evan Luo,Deqing Sun,Chen Sun

Main category: cs.CV

TL;DR: 论文提出了一种利用物理力作为控制信号生成视频的方法，通过力提示实现用户与图像的交互，无需3D资产或物理模拟器。

Details

Motivation: 探索物理力在视频生成中的应用，填补了现有研究中物理交互的不足。 Method: 提出力提示方法，利用预训练模型的视觉和运动先验，通过Blender合成的视频进行训练。 Result: 模型在有限训练数据下表现出色，能模拟多样化的物理力交互，优于现有方法。 Conclusion: 视频生成模型可通过力提示实现逼真的物理交互，为世界模型提供更真实的物理模拟。 Abstract: Recent advances in video generation models have sparked interest in world models capable of simulating realistic environments. While navigation has been well-explored, physically meaningful interactions that mimic real-world forces remain largely understudied. In this work, we investigate using physical forces as a control signal for video generation and propose force prompts which enable users to interact with images through both localized point forces, such as poking a plant, and global wind force fields, such as wind blowing on fabric. We demonstrate that these force prompts can enable videos to respond realistically to physical control signals by leveraging the visual and motion prior in the original pretrained model, without using any 3D asset or physics simulator at inference. The primary challenge of force prompting is the difficulty in obtaining high quality paired force-video training data, both in the real world due to the difficulty of obtaining force signals, and in synthetic data due to limitations in the visual quality and domain diversity of physics simulators. Our key finding is that video generation models can generalize remarkably well when adapted to follow physical force conditioning from videos synthesized by Blender, even with limited demonstrations of few objects. Our method can generate videos which simulate forces across diverse geometries, settings, and materials. We also try to understand the source of this generalization and perform ablations that reveal two key elements: visual diversity and the use of specific text keywords during training. Our approach is trained on only around 15k training examples for a single day on four A100 GPUs, and outperforms existing methods on force adherence and physics realism, bringing world models closer to real-world physics interactions. We release all datasets, code, weights, and interactive video demos at our project page.

[141] Erasing Concepts, Steering Generations: A Comprehensive Survey of Concept Suppression

Yiwei Xie,Ping Liu,Zheng Zhang

Main category: cs.CV

TL;DR: 本文综述了文本到图像（T2I）扩散模型中的概念擦除技术，旨在解决敏感、版权或有害内容生成的伦理和法律问题。

Details

Motivation: T2I模型在生成高质量图像方面表现出色，但可能产生敏感或有害内容，引发伦理和法律挑战。概念擦除技术为选择性移除特定语义概念提供了解决方案。 Method: 文章从干预级别、优化结构和语义范围三个维度对现有方法进行分类，并讨论了评估基准和数据集。 Result: 通过多维度分类，揭示了擦除特异性、泛化能力和计算复杂性之间的权衡。同时指出了当前评估方法的局限性。 Conclusion: 未来研究方向包括概念表示的分离、自适应擦除策略和对抗鲁棒性。本文旨在推动生成AI的安全和伦理发展。 Abstract: Text-to-Image (T2I) models have demonstrated impressive capabilities in generating high-quality and diverse visual content from natural language prompts. However, uncontrolled reproduction of sensitive, copyrighted, or harmful imagery poses serious ethical, legal, and safety challenges. To address these concerns, the concept erasure paradigm has emerged as a promising direction, enabling the selective removal of specific semantic concepts from generative models while preserving their overall utility. This survey provides a comprehensive overview and in-depth synthesis of concept erasure techniques in T2I diffusion models. We systematically categorize existing approaches along three key dimensions: intervention level, which identifies specific model components targeted for concept removal; optimization structure, referring to the algorithmic strategies employed to achieve suppression; and semantic scope, concerning the complexity and nature of the concepts addressed. This multi-dimensional taxonomy enables clear, structured comparisons across diverse methodologies, highlighting fundamental trade-offs between erasure specificity, generalization, and computational complexity. We further discuss current evaluation benchmarks, standardized metrics, and practical datasets, emphasizing gaps that limit comprehensive assessment, particularly regarding robustness and practical effectiveness. Finally, we outline major challenges and promising future directions, including disentanglement of concept representations, adaptive and incremental erasure strategies, adversarial robustness, and new generative architectures. This survey aims to guide researchers toward safer, more ethically aligned generative models, providing foundational knowledge and actionable recommendations to advance responsible development in generative AI.

Hang Hua,Ziyun Zeng,Yizhi Song,Yunlong Tang,Liu He,Daniel Aliaga,Wei Xiong,Jiebo Luo

Main category: cs.CV

TL;DR: MMIG-Bench是一个统一的多模态图像生成基准，通过丰富的文本提示和多视角参考图像，结合三级评估框架，对17种先进模型进行了全面评测。

Details

Motivation: 当前多模态图像生成器缺乏统一的评估工具，现有基准无法全面衡量多模态条件和语义一致性。 Method: 提出MMIG-Bench基准，包含4,850个文本提示和1,750张多视角参考图像，采用三级评估框架（低、中、高级指标）。 Result: 评测了17种先进模型，验证了评估指标与人类评分的强相关性，揭示了架构和数据设计的深入见解。 Conclusion: MMIG-Bench为多模态图像生成提供了严谨、统一的评估工具，推动了未来创新。 Abstract: Recent multimodal image generators such as GPT-4o, Gemini 2.0 Flash, and Gemini 2.5 Pro excel at following complex instructions, editing images and maintaining concept consistency. However, they are still evaluated by disjoint toolkits: text-to-image (T2I) benchmarks that lacks multi-modal conditioning, and customized image generation benchmarks that overlook compositional semantics and common knowledge. We propose MMIG-Bench, a comprehensive Multi-Modal Image Generation Benchmark that unifies these tasks by pairing 4,850 richly annotated text prompts with 1,750 multi-view reference images across 380 subjects, spanning humans, animals, objects, and artistic styles. MMIG-Bench is equipped with a three-level evaluation framework: (1) low-level metrics for visual artifacts and identity preservation of objects; (2) novel Aspect Matching Score (AMS): a VQA-based mid-level metric that delivers fine-grained prompt-image alignment and shows strong correlation with human judgments; and (3) high-level metrics for aesthetics and human preference. Using MMIG-Bench, we benchmark 17 state-of-the-art models, including Gemini 2.5 Pro, FLUX, DreamBooth, and IP-Adapter, and validate our metrics with 32k human ratings, yielding in-depth insights into architecture and data design. We will release the dataset and evaluation code to foster rigorous, unified evaluation and accelerate future innovations in multi-modal image generation.

[143] ADD-SLAM: Adaptive Dynamic Dense SLAM with Gaussian Splatting

Wenhua Wu,Chenpeng Su,Siting Zhu,Tianchen Deng,Zhe Liu,Hesheng Wang

Main category: cs.CV

TL;DR: ADD-SLAM是一种基于高斯分裂的自适应动态稠密SLAM框架，通过场景一致性分析识别动态对象，无需预定义语义类别，实现了动态目标的精确识别与分离建模。

Details

Motivation: 动态对象破坏了场景一致性，导致跟踪漂移和映射伪影，现有方法依赖预定义类别且丢弃动态信息，无法满足机器人应用需求。 Method: 设计基于场景一致性分析的自适应动态识别机制，通过几何和纹理差异比较实时观测与历史地图，提出动态-静态分离映射策略。 Result: 在多个动态数据集上展示了灵活准确的动态分割能力，定位和映射性能达到最先进水平。 Conclusion: ADD-SLAM无需预定义语义类别，能自适应发现场景动态，有效提升动态环境下的SLAM性能。 Abstract: Recent advancements in Neural Radiance Fields (NeRF) and 3D Gaussian-based Simultaneous Localization and Mapping (SLAM) methods have demonstrated exceptional localization precision and remarkable dense mapping performance. However, dynamic objects introduce critical challenges by disrupting scene consistency, leading to tracking drift and mapping artifacts. Existing methods that employ semantic segmentation or object detection for dynamic identification and filtering typically rely on predefined categorical priors, while discarding dynamic scene information crucial for robotic applications such as dynamic obstacle avoidance and environmental interaction. To overcome these challenges, we propose ADD-SLAM: an Adaptive Dynamic Dense SLAM framework based on Gaussian splitting. We design an adaptive dynamic identification mechanism grounded in scene consistency analysis, comparing geometric and textural discrepancies between real-time observations and historical maps. Ours requires no predefined semantic category priors and adaptively discovers scene dynamics. Precise dynamic object recognition effectively mitigates interference from moving targets during localization. Furthermore, we propose a dynamic-static separation mapping strategy that constructs a temporal Gaussian model to achieve online incremental dynamic modeling. Experiments conducted on multiple dynamic datasets demonstrate our method's flexible and accurate dynamic segmentation capabilities, along with state-of-the-art performance in both localization and mapping.

[144] Certainty and Uncertainty Guided Active Domain Adaptation

Bardia Safaei,Vibashan VS,Vishal M. Patel

Main category: cs.CV

TL;DR: 本文提出了一种协作框架，结合了主动采样和伪标签采样，以提升主动领域适应的性能。

Details

Motivation: 现有主动领域适应方法仅关注不确定样本，忽略了自信样本的价值，而自信样本往往与真实标签匹配。 Method: 提出协作框架，结合高斯过程主动采样（GPAS）和伪标签自信采样（PLCS），分别处理不确定和自信样本。 Result: 在Office-Home和DomainNet数据集上的实验表明，该方法优于现有最先进的主动领域适应方法。 Conclusion: 通过同时利用不确定和自信样本，该方法显著提升了领域适应的效果。 Abstract: Active Domain Adaptation (ADA) adapts models to target domains by selectively labeling a few target samples. Existing ADA methods prioritize uncertain samples but overlook confident ones, which often match ground-truth. We find that incorporating confident predictions into the labeled set before active sampling reduces the search space and improves adaptation. To address this, we propose a collaborative framework that labels uncertain samples while treating highly confident predictions as ground truth. Our method combines Gaussian Process-based Active Sampling (GPAS) for identifying uncertain samples and Pseudo-Label-based Certain Sampling (PLCS) for confident ones, progressively enhancing adaptation. PLCS refines the search space, and GPAS reduces the domain gap, boosting the proportion of confident samples. Extensive experiments on Office-Home and DomainNet show that our approach outperforms state-of-the-art ADA methods.

[145] LlamaSeg: Image Segmentation via Autoregressive Mask Generation

Jiru Deng,Tengjin Weng,Tianyu Yang,Wenhan Luo,Zhiheng Li,Wenhao Jiang

Main category: cs.CV

TL;DR: LlamaSeg是一个基于自然语言指令的统一图像分割框架，通过视觉生成方法将分割任务转化为自回归预测问题，并提出了新的数据集和评估指标。

Details

Motivation: 现有图像分割方法通常针对特定任务设计，缺乏通用性和灵活性。LlamaSeg旨在通过自然语言指令统一多种分割任务，提升模型的通用性和适应性。 Method: 将图像分割任务转化为视觉生成问题，使用类似LLaMA的Transformer直接预测掩码。构建了SA-OVRS数据集（含2M掩码和5,800标签），并提出结合IoU和AHD的复合评估指标。 Result: 实验表明，LlamaSeg在多个数据集上优于现有生成模型，并能生成更精细的分割掩码。 Conclusion: LlamaSeg通过自回归框架和自然语言指令实现了通用图像分割，同时新数据集和评估指标为视觉生成模型提供了更全面的支持。 Abstract: We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions. We reformulate image segmentation as a visual generation problem, representing masks as "visual" tokens and employing a LLaMA-style Transformer to predict them directly from image inputs. By adhering to the next-token prediction paradigm, our approach naturally integrates segmentation tasks into autoregressive architectures. To support large-scale training, we introduce a data annotation pipeline and construct the SA-OVRS dataset, which contains 2M segmentation masks annotated with over 5,800 open-vocabulary labels or diverse textual descriptions, covering a wide spectrum of real-world scenarios. This enables our model to localize objects in images based on text prompts and to generate fine-grained masks. To more accurately evaluate the quality of masks produced by visual generative models, we further propose a composite metric that combines Intersection over Union (IoU) with Average Hausdorff Distance (AHD), offering a more precise assessment of contour fidelity. Experimental results demonstrate that our method surpasses existing generative models across multiple datasets and yields more detailed segmentation masks.

[146] Structure Disruption: Subverting Malicious Diffusion-Based Inpainting via Self-Attention Query Perturbation

Yuhao He,Jinyu Tian,Haiwei Wu,Jianqing Li

Main category: cs.CV

TL;DR: 提出了一种名为SDA的保护框架，通过干扰扩散模型的自注意力机制，防止其生成连贯的图像，从而保护敏感区域免受基于修复的编辑。

Details

Motivation: 扩散模型的快速发展带来了图像修复和编辑能力的提升，但也引入了社会风险，如恶意利用用户图像生成误导性内容。现有方法在掩码引导编辑任务中表现不佳，因此需要一种更有效的保护方法。 Method: SDA通过优化扰动，在初始去噪步骤中干扰自注意力机制中的查询，破坏轮廓生成过程，从而直接干扰扩散模型的结构生成能力。 Result: 实验表明，SDA在公共数据集上实现了最先进的保护性能，并保持了强大的鲁棒性。 Conclusion: SDA是一种有效的保护框架，能够防止扩散模型生成连贯的图像，适用于敏感区域的保护。 Abstract: The rapid advancement of diffusion models has enhanced their image inpainting and editing capabilities but also introduced significant societal risks. Adversaries can exploit user images from social media to generate misleading or harmful content. While adversarial perturbations can disrupt inpainting, global perturbation-based methods fail in mask-guided editing tasks due to spatial constraints. To address these challenges, we propose Structure Disruption Attack (SDA), a powerful protection framework for safeguarding sensitive image regions against inpainting-based editing. Building upon the contour-focused nature of self-attention mechanisms of diffusion models, SDA optimizes perturbations by disrupting queries in self-attention during the initial denoising step to destroy the contour generation process. This targeted interference directly disrupts the structural generation capability of diffusion models, effectively preventing them from producing coherent images. We validate our motivation through visualization techniques and extensive experiments on public datasets, demonstrating that SDA achieves state-of-the-art (SOTA) protection performance while maintaining strong robustness.

[147] CSTrack: Enhancing RGB-X Tracking via Compact Spatiotemporal Features

X. Feng,D. Zhang,S. Hu,X. Li,M. Wu,J. Zhang,X. Chen,K. Huang

Main category: cs.CV

TL;DR: CSTrack提出了一种紧凑的时空特征建模方法，通过整合RGB-X双输入流为紧凑空间特征，简化了模型结构并提升了跟踪效果。

Details

Motivation: 现有RGB-X跟踪器采用双分支处理分散特征空间，导致模型复杂且计算开销大，限制了时空建模能力。 Method: 提出空间紧凑模块整合RGB-X输入为紧凑特征，并设计时间紧凑模块高效表示时间特征。 Result: CSTrack在主流RGB-X基准测试中取得SOTA结果。 Conclusion: 紧凑时空建模方法简化了模型结构，提升了跟踪效果，具有高效性和实用性。 Abstract: Effectively modeling and utilizing spatiotemporal features from RGB and other modalities (\eg, depth, thermal, and event data, denoted as X) is the core of RGB-X tracker design. Existing methods often employ two parallel branches to separately process the RGB and X input streams, requiring the model to simultaneously handle two dispersed feature spaces, which complicates both the model structure and computation process. More critically, intra-modality spatial modeling within each dispersed space incurs substantial computational overhead, limiting resources for inter-modality spatial modeling and temporal modeling. To address this, we propose a novel tracker, CSTrack, which focuses on modeling Compact Spatiotemporal features to achieve simple yet effective tracking. Specifically, we first introduce an innovative Spatial Compact Module that integrates the RGB-X dual input streams into a compact spatial feature, enabling thorough intra- and inter-modality spatial modeling. Additionally, we design an efficient Temporal Compact Module that compactly represents temporal features by constructing the refined target distribution heatmap. Extensive experiments validate the effectiveness of our compact spatiotemporal modeling method, with CSTrack achieving new SOTA results on mainstream RGB-X benchmarks. The code and models will be released at: https://github.com/XiaokunFeng/CSTrack.

Xu Li,Fan Lyu

Main category: cs.CV

TL;DR: MM-Prompt框架通过跨模态提示查询和恢复，解决了CVQA中模态不平衡问题，提升了性能和知识保留。

Details

Motivation: 现有方法采用跨模态提示隔离，导致模态不平衡和性能下降。 Method: 提出MM-Prompt框架，结合跨模态提示查询和恢复，通过对齐损失防止表示漂移。 Result: 实验表明MM-Prompt在准确性和知识保留上优于现有方法，保持模态平衡。 Conclusion: MM-Prompt有效解决了CVQA中的模态不平衡问题，提升了持续学习性能。 Abstract: Continual Visual Question Answering (CVQA) based on pre-trained models(PTMs) has achieved promising progress by leveraging prompt tuning to enable continual multi-modal learning. However, most existing methods adopt cross-modal prompt isolation, constructing visual and textual prompts separately, which exacerbates modality imbalance and leads to degraded performance over time. To tackle this issue, we propose MM-Prompt, a novel framework incorporating cross-modal prompt query and cross-modal prompt recovery. The former enables balanced prompt selection by incorporating cross-modal signals during query formation, while the latter promotes joint prompt reconstruction through iterative cross-modal interactions, guided by an alignment loss to prevent representational drift. Extensive experiments show that MM-Prompt surpasses prior approaches in accuracy and knowledge retention, while maintaining balanced modality engagement throughout continual learning.

[149] Revolutionizing Wildfire Detection with Convolutional Neural Networks: A VGG16 Model Approach

Lakshmi Aishwarya Malladi,Navarun Gupta,Ahmed El-Sayed,Xingguo Xiong

Main category: cs.CV

TL;DR: 研究利用VGG16架构的CNN提高野火检测精度，通过数据增强和模型优化解决数据集问题，展示了深度学习在早期野火识别中的潜力。

Details

Motivation: 野火频发且破坏性加剧，急需高效预警系统以减少灾难性后果。 Method: 使用基于VGG16的CNN模型，结合D-FIRE数据集，通过数据增强和模型优化解决低分辨率、数据集不平衡等问题。 Result: 模型实现了低误报率，验证了深度学习在野火早期识别中的可靠性。 Conclusion: 未来工作将整合实时监控网络并扩展数据集，以进一步减少野火影响。 Abstract: Over 8,024 wildfire incidents have been documented in 2024 alone, affecting thousands of fatalities and significant damage to infrastructure and ecosystems. Wildfires in the United States have inflicted devastating losses. Wildfires are becoming more frequent and intense, which highlights how urgently efficient warning systems are needed to avoid disastrous outcomes. The goal of this study is to enhance the accuracy of wildfire detection by using Convolutional Neural Network (CNN) built on the VGG16 architecture. The D-FIRE dataset, which includes several kinds of wildfire and non-wildfire images, was employed in the study. Low-resolution images, dataset imbalance, and the necessity for real-time applicability are some of the main challenges. These problems were resolved by enriching the dataset using data augmentation techniques and optimizing the VGG16 model for binary classification. The model produced a low false negative rate, which is essential for reducing unexplored fires, despite dataset boundaries. In order to help authorities execute fast responses, this work shows that deep learning models such as VGG16 can offer a reliable, automated approach for early wildfire recognition. For the purpose of reducing the impact of wildfires, our future work will concentrate on connecting to systems with real-time surveillance networks and enlarging the dataset to cover more varied fire situations.

[150] SpikeStereoNet: A Brain-Inspired Framework for Stereo Depth Estimation from Spike Streams

Zhuoheng Gao,Yihao Li,Jiyao Zhang,Rui Zhao,Tong Wu,Hao Tang,Zhaofei Yu,Hao Dong,Guozhang Chen,Tiejun Huang

Main category: cs.CV

TL;DR: 论文提出SpikeStereoNet，首个直接从原始脉冲流估计立体深度的脑启发框架，并引入合成和真实世界脉冲数据集进行验证。

Details

Motivation: 传统帧相机在快速变化场景中立体深度估计效果不佳，而脉冲相机提供微秒级分辨率的事件数据，但缺乏专门的算法和基准。 Method: 提出SpikeStereoNet，通过融合双视角脉冲流，利用循环脉冲神经网络（RSNN）迭代优化深度估计。 Result: 在合成和真实数据集上优于现有方法，尤其在纹理缺失和极端光照条件下表现突出，且数据效率高。 Conclusion: SpikeStereoNet为脉冲数据立体深度估计提供了有效解决方案，代码和数据集将公开。 Abstract: Conventional frame-based cameras often struggle with stereo depth estimation in rapidly changing scenes. In contrast, bio-inspired spike cameras emit asynchronous events at microsecond-level resolution, providing an alternative sensing modality. However, existing methods lack specialized stereo algorithms and benchmarks tailored to the spike data. To address this gap, we propose SpikeStereoNet, a brain-inspired framework and the first to estimate stereo depth directly from raw spike streams. The model fuses raw spike streams from two viewpoints and iteratively refines depth estimation through a recurrent spiking neural network (RSNN) update module. To benchmark our approach, we introduce a large-scale synthetic spike stream dataset and a real-world stereo spike dataset with dense depth annotations. SpikeStereoNet outperforms existing methods on both datasets by leveraging spike streams' ability to capture subtle edges and intensity shifts in challenging regions such as textureless surfaces and extreme lighting conditions. Furthermore, our framework exhibits strong data efficiency, maintaining high accuracy even with substantially reduced training data. The source code and datasets will be publicly available.

[151] ViewCraft3D: High-Fidelity and View-Consistent 3D Vector Graphics Synthesis

Chuang Wang,Haitao Zhou,Ling Luo,Qian Yu

Main category: cs.CV

TL;DR: VC3D是一种高效生成3D矢量图形的方法，通过利用3D先验知识解决了现有方法处理时间长和视角一致性问题。

Details

Motivation: 3D矢量图形在多种应用中至关重要，但现有方法存在处理时间长和视角一致性差的问题。 Method: VC3D通过3D对象分析、几何提取算法和视角一致细化过程生成3D矢量图形。 Result: VC3D在质量和效率上优于现有方法，生成的3D草图保持视角一致性并捕捉对象关键特征。 Conclusion: VC3D为3D矢量图形生成提供了一种高效且高质量的解决方案。 Abstract: 3D vector graphics play a crucial role in various applications including 3D shape retrieval, conceptual design, and virtual reality interactions due to their ability to capture essential structural information with minimal representation. While recent approaches have shown promise in generating 3D vector graphics, they often suffer from lengthy processing times and struggle to maintain view consistency. To address these limitations, we propose ViewCraft3D (VC3D), an efficient method that leverages 3D priors to generate 3D vector graphics. Specifically, our approach begins with 3D object analysis, employs a geometric extraction algorithm to fit 3D vector graphics to the underlying structure, and applies view-consistent refinement process to enhance visual quality. Our comprehensive experiments demonstrate that VC3D outperforms previous methods in both qualitative and quantitative evaluations, while significantly reducing computational overhead. The resulting 3D sketches maintain view consistency and effectively capture the essential characteristics of the original objects.

[152] The Role of Video Generation in Enhancing Data-Limited Action Understanding

Wei Li,Dezhao Luo,Dongbao Yang,Zhenhang Li,Weiping Wang,Yu Zhou

Main category: cs.CV

TL;DR: 提出了一种利用文本到视频扩散变换器生成标注数据的方法，解决视频动作理解中的数据稀缺问题，并通过信息增强和不确定性标签平滑策略优化生成样本的质量。

Details

Motivation: 现实场景中的视频动作理解任务常受数据限制，本文旨在通过生成无限规模的标注数据来解决这一问题。 Method: 采用文本到视频扩散变换器生成标注数据，并提出信息增强策略和不确定性标签平滑策略以优化生成样本。 Result: 在四个数据集上的五个任务中验证了方法的有效性，并在零样本动作识别任务中达到了最先进的性能。 Conclusion: 提出的方法能够有效解决数据稀缺问题，并通过优化生成样本质量提升模型性能。 Abstract: Video action understanding tasks in real-world scenarios always suffer data limitations. In this paper, we address the data-limited action understanding problem by bridging data scarcity. We propose a novel method that employs a text-to-video diffusion transformer to generate annotated data for model training. This paradigm enables the generation of realistic annotated data on an infinite scale without human intervention. We proposed the information enhancement strategy and the uncertainty-based label smoothing tailored to generate sample training. Through quantitative and qualitative analysis, we observed that real samples generally contain a richer level of information than generated samples. Based on this observation, the information enhancement strategy is proposed to enhance the informative content of the generated samples from two aspects: the environments and the characters. Furthermore, we observed that some low-quality generated samples might negatively affect model training. To address this, we devised the uncertainty-based label smoothing strategy to increase the smoothing of these samples, thus reducing their impact. We demonstrate the effectiveness of the proposed method on four datasets across five tasks and achieve state-of-the-art performance for zero-shot action recognition.

[153] Enhancing Visual Reliance in Text Generation: A Bayesian Perspective on Mitigating Hallucination in Large Vision-Language Models

Nanxing Hu,Xiaoyue Duan,Jinchao Zhang,Guoliang Kang

Main category: cs.CV

TL;DR: 论文提出了一种从贝叶斯角度解决大型视觉语言模型（LVLM）幻觉问题的方法，通过减少冗余视觉标记、修正先验信息及适时停止生成来提升视觉依赖性。

Details

Motivation: LVLM生成的文本常与视觉输入不匹配，这种幻觉问题限制了其实际应用。现有方法未系统性地增强视觉依赖性。 Method: 从贝叶斯角度分析视觉依赖性退化因素，提出三方面改进：去除冗余视觉标记、修正先验信息、适时停止生成。 Result: 在POPE、CHAIR和MME基准测试中表现优异，显著缓解幻觉问题。 Conclusion: 该方法系统性提升了LVLM的视觉依赖性，为幻觉问题提供了有效解决方案。 Abstract: Large Vision-Language Models (LVLMs) usually generate texts which satisfy context coherence but don't match the visual input. Such a hallucination issue hinders LVLMs' applicability in the real world. The key to solving hallucination in LVLM is to make the text generation rely more on the visual content. Most previous works choose to enhance/adjust the features/output of a specific modality (i.e., visual or textual) to alleviate hallucinations in LVLM, which do not explicitly or systematically enhance the visual reliance. In this paper, we comprehensively investigate the factors which may degenerate the visual reliance in text generation of LVLM from a Bayesian perspective. Based on our observations, we propose to mitigate hallucination in LVLM from three aspects. Firstly, we observe that not all visual tokens are informative in generating meaningful texts. We propose to evaluate and remove redundant visual tokens to avoid their disturbance. Secondly, LVLM may encode inappropriate prior information, making it lean toward generating unexpected words. We propose a simple yet effective way to rectify the prior from a Bayesian perspective. Thirdly, we observe that starting from certain steps, the posterior of next-token prediction conditioned on visual tokens may collapse to a prior distribution which does not depend on any informative visual tokens at all. Thus, we propose to stop further text generation to avoid hallucination. Extensive experiments on three benchmarks including POPE, CHAIR, and MME demonstrate that our method can consistently mitigate the hallucination issue of LVLM and performs favorably against previous state-of-the-arts.

[154] Objective, Absolute and Hue-aware Metrics for Intrinsic Image Decomposition on Real-World Scenes: A Proof of Concept

Shogo Sato,Masaru Tsuchida,Mariko Yamaguchi,Takuhiro Kaneko,Kazuhiko Murasaki,Taiga Yoshida,Ryuichi Tanida

Main category: cs.CV

TL;DR: 本文提出了一种基于高光谱成像和LiDAR强度的定量评估方法，用于解决内在图像分解（IID）中缺乏真实数据的问题，并通过实验室验证证明了其可行性。

Details

Motivation: 现有IID评估方法依赖主观人工标注，存在主观性、相对评估和色相忽略等问题，需要一种客观、绝对且考虑色相的评估方法。 Method: 提出使用高光谱成像和LiDAR强度计算反照率，并引入基于光谱相似性的反照率密度化方法。 Result: 实验室环境下的概念验证表明，该方法可实现客观、绝对且考虑色相的评估。 Conclusion: 该方法为IID任务提供了一种可行的定量评估方案，具有实际应用潜力。 Abstract: Intrinsic image decomposition (IID) is the task of separating an image into albedo and shade. In real-world scenes, it is difficult to quantitatively assess IID quality due to the unavailability of ground truth. The existing method provides the relative reflection intensities based on human-judged annotations. However, these annotations have challenges in subjectivity, relative evaluation, and hue non-assessment. To address these, we propose a concept of quantitative evaluation with a calculated albedo from a hyperspectral imaging and light detection and ranging (LiDAR) intensity. Additionally, we introduce an optional albedo densification approach based on spectral similarity. This paper conducted a concept verification in a laboratory environment, and suggested the feasibility of an objective, absolute, and hue-aware assessment. (This paper is accepted by IEEE ICIP 2025. )

[155] Locality-Aware Zero-Shot Human-Object Interaction Detection

Sanghyun Kim,Deunsol Jung,Minsu Cho

Main category: cs.CV

TL;DR: LAIN是一种新型的零样本人-物交互检测框架，通过增强CLIP表示的局部性和交互感知能力，显著提升了零样本HOI检测性能。

Details

Motivation: 现有方法在利用CLIP进行零样本HOI检测时，难以捕捉人-物对的细粒度信息，导致交互区分能力不足。 Method: LAIN通过聚合相邻区域的信息和空间先验增强局部性，并通过捕捉人-物交互模式提升交互感知能力。 Result: 实验表明，LAIN在多种零样本设置下优于现有方法。 Conclusion: 局部性和交互感知对零样本HOI检测至关重要，LAIN为此提供了有效解决方案。 Abstract: Recent methods for zero-shot Human-Object Interaction (HOI) detection typically leverage the generalization ability of large Vision-Language Model (VLM), i.e., CLIP, on unseen categories, showing impressive results on various zero-shot settings. However, existing methods struggle to adapt CLIP representations for human-object pairs, as CLIP tends to overlook fine-grained information necessary for distinguishing interactions. To address this issue, we devise, LAIN, a novel zero-shot HOI detection framework enhancing the locality and interaction awareness of CLIP representations. The locality awareness, which involves capturing fine-grained details and the spatial structure of individual objects, is achieved by aggregating the information and spatial priors of adjacent neighborhood patches. The interaction awareness, which involves identifying whether and how a human is interacting with an object, is achieved by capturing the interaction pattern between the human and the object. By infusing locality and interaction awareness into CLIP representation, LAIN captures detailed information about the human-object pairs. Our extensive experiments on existing benchmarks show that LAIN outperforms previous methods on various zero-shot settings, demonstrating the importance of locality and interaction awareness for effective zero-shot HOI detection.

[156] Multimodal Machine Translation with Visual Scene Graph Pruning

Chenyu Lu,Shiliang Sun,Jing Zhao,Nan Zhang,Tengfei Song,Hao Yang

Main category: cs.CV

TL;DR: 提出了一种基于视觉场景图剪枝（PSG）的多模态机器翻译方法，通过语言场景图指导剪枝冗余视觉节点，减少翻译任务中的噪声。

Details

Motivation: 解决当前多模态机器翻译（MMT）中视觉信息冗余和利用不足的问题。 Method: 利用语言场景图信息剪枝视觉场景图中的冗余节点，减少噪声。 Result: PSG模型在实验中表现优于现有方法，验证了视觉信息剪枝的有效性。 Conclusion: 视觉信息剪枝在多模态机器翻译领域具有重要潜力。 Abstract: Multimodal machine translation (MMT) seeks to address the challenges posed by linguistic polysemy and ambiguity in translation tasks by incorporating visual information. A key bottleneck in current MMT research is the effective utilization of visual data. Previous approaches have focused on extracting global or region-level image features and using attention or gating mechanisms for multimodal information fusion. However, these methods have not adequately tackled the issue of visual information redundancy in MMT, nor have they proposed effective solutions. In this paper, we introduce a novel approach--multimodal machine translation with visual Scene Graph Pruning (PSG), which leverages language scene graph information to guide the pruning of redundant nodes in visual scene graphs, thereby reducing noise in downstream translation tasks. Through extensive comparative experiments with state-of-the-art methods and ablation studies, we demonstrate the effectiveness of the PSG model. Our results also highlight the promising potential of visual information pruning in advancing the field of MMT.

[157] Toward Patient-specific Partial Point Cloud to Surface Completion for Pre- to Intra-operative Registration in Image-guided Liver Interventions

Nakul Poudel,Zixin Yang,Kelly Merrell,Richard Simon,Cristian A. Linte

Main category: cs.CV

TL;DR: 提出了一种基于VN-OccNet的患者特异性点云补全方法，用于解决术中部分可见点云的配准问题，并通过实验验证了其有效性。

Details

Motivation: 术中数据缺乏亚表面信息，且点云配准因部分可见性而困难，需补全点云以改善配准效果。 Method: 使用VN-OccNet从部分术中点云生成完整肝脏表面，结合Go-ICP算法进行刚性配准。 Result: VN-OccNet的旋转等变性和表面生成能力显著改善了配准效果。 Conclusion: 患者特异性点云补全方法有望解决术中点云部分可见性问题，提升配准鲁棒性。 Abstract: Intra-operative data captured during image-guided surgery lacks sub-surface information, where key regions of interest, such as vessels and tumors, reside. Image-to-physical registration enables the fusion of pre-operative information and intra-operative data, typically represented as a point cloud. However, this registration process struggles due to partial visibility of the intra-operative point cloud. In this research, we propose a patient-specific point cloud completion approach to assist with the registration process. Specifically, we leverage VN-OccNet to generate a complete liver surface from a partial intra-operative point cloud. The network is trained in a patient-specific manner, where simulated deformations from the pre-operative model are used to train the model. First, we conduct an in-depth analysis of VN-OccNet's rotation-equivariant property and its effectiveness in recovering complete surfaces from partial intra-operative surfaces. Next, we integrate the completed intra-operative surface into the Go-ICP registration algorithm to demonstrate its utility in improving initial rigid registration outcomes. Our results highlight the promise of this patient-specific completion approach in mitigating the challenges posed by partial intra-operative visibility. The rotation equivariant and surface generation capabilities of VN-OccNet hold strong promise for developing robust registration frameworks for variations of the intra-operative point cloud.

[158] Regularized Personalization of Text-to-Image Diffusion Models without Distributional Drift

Gihoon Kim,Hyungjin Park,Taesup Kim

Main category: cs.CV

TL;DR: 论文提出了一种基于Lipschitz约束的新训练目标，用于解决文本到图像扩散模型在个性化任务中的分布漂移问题，实验表明其优于现有方法。

Details

Motivation: 个性化任务中，模型需学习新主题同时保持原有生成能力，但标准训练目标与个性化目标不匹配，导致分布漂移。 Method: 提出基于Lipschitz约束的训练目标，显式限制与预训练分布的偏差。 Result: 实验显示，该方法在CLIP-T、CLIP-I和DINO分数上优于现有方法，尤其在数据稀缺场景表现良好。 Conclusion: 新方法有效控制分布漂移，提升个性化任务的性能。 Abstract: Personalization using text-to-image diffusion models involves adapting a pretrained model to novel subjects with only a few image examples. This task presents a fundamental challenge, as the model must not only learn the new subject effectively but also preserve its ability to generate diverse and coherent outputs across a wide range of prompts. In other words, successful personalization requires integrating new concepts without forgetting previously learned generative capabilities. Forgetting denotes unintended distributional drift, where the model's output distribution deviates from that of the original pretrained model. In this paper, we provide an analysis of this issue and identify a mismatch between standard training objectives and the goals of personalization. To address this, we propose a new training objective based on a Lipschitz-bounded formulation that explicitly constrains deviation from the pretrained distribution. Our method provides improved control over distributional drift and performs well even in data-scarce scenarios. Experimental results demonstrate that our approach consistently outperforms existing personalization methods, achieving higher CLIP-T, CLIP-I, and DINO scores.

[159] Applications and Effect Evaluation of Generative Adversarial Networks in Semi-Supervised Learning

Jiyu Hu,Haijiang Zeng,Zhen Tian

Main category: cs.CV

TL;DR: 提出了一种基于GAN的半监督图像分类模型，通过生成器、判别器和分类器的协同训练机制，有效利用有限标注数据和大量未标注数据，提升图像生成质量和分类精度。

Details

Motivation: 解决图像分类任务中标注数据不足的问题，拓展深度学习模型在实际场景中的应用。 Method: 构建基于GAN的半监督学习模型，引入生成器、判别器和分类器的协同训练机制。 Result: 提升了图像生成质量和分类准确性，为复杂环境下的图像识别任务提供了有效解决方案。 Conclusion: 该模型为半监督学习在图像分类中的应用提供了新思路，具有实际应用潜力。 Abstract: In recent years, image classification, as a core task in computer vision, relies on high-quality labelled data, which restricts the wide application of deep learning models in practical scenarios. To alleviate the problem of insufficient labelled samples, semi-supervised learning has gradually become a research hotspot. In this paper, we construct a semi-supervised image classification model based on Generative Adversarial Networks (GANs), and through the introduction of the collaborative training mechanism of generators, discriminators and classifiers, we achieve the effective use of limited labelled data and a large amount of unlabelled data, improve the quality of image generation and classification accuracy, and provide an effective solution for the task of image recognition in complex environments.

[160] TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs

Juntong Wang,Jiarui Wang,Huiyu Duan,Guangtao Zhai,Xiongkuo Min

Main category: cs.CV

TL;DR: 论文介绍了TDVE-DB，一个用于文本驱动视频编辑的大规模基准数据集，并提出了TDVE-Assessor，一种专门用于评估文本驱动视频编辑质量的新型VQA模型。

Details

Motivation: 当前缺乏专门用于评估文本驱动视频编辑质量的VQA模型，导致其评估困难。 Method: 构建了包含3,857个编辑视频的TDVE-DB数据集，并提出了基于空间和时间特征的TDVE-Assessor模型。 Result: TDVE-Assessor在三个评估维度上显著优于现有VQA模型。 Conclusion: TDVE-DB和TDVE-Assessor填补了文本驱动视频编辑评估的空白，为未来研究提供了基准。 Abstract: Text-driven video editing is rapidly advancing, yet its rigorous evaluation remains challenging due to the absence of dedicated video quality assessment (VQA) models capable of discerning the nuances of editing quality. To address this critical gap, we introduce TDVE-DB, a large-scale benchmark dataset for text-driven video editing. TDVE-DB consists of 3,857 edited videos generated from 12 diverse models across 8 editing categories, and is annotated with 173,565 human subjective ratings along three crucial dimensions, i.e., edited video quality, editing alignment, and structural consistency. Based on TDVE-DB, we first conduct a comprehensive evaluation for the 12 state-of-the-art editing models revealing the strengths and weaknesses of current video techniques, and then benchmark existing VQA methods in the context of text-driven video editing evaluation. Building on these insights, we propose TDVE-Assessor, a novel VQA model specifically designed for text-driven video editing assessment. TDVE-Assessor integrates both spatial and temporal video features into a large language model (LLM) for rich contextual understanding to provide comprehensive quality assessment. Extensive experiments demonstrate that TDVE-Assessor substantially outperforms existing VQA models on TDVE-DB across all three evaluation dimensions, setting a new state-of-the-art. Both TDVE-DB and TDVE-Assessor will be released upon the publication.

[161] FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models

Jintao Tong,Wenwei Jin,Pengda Qin,Anqi Li,Yixiong Zou,Yuhong Li,Yuhua Li,Ruixuan Li

Main category: cs.CV

TL;DR: FlowCut提出了一种基于信息流的视觉令牌剪枝框架，解决了现有单层注意力评分方法的不足，显著提升了模型效率和性能。

Details

Motivation: 大型视觉语言模型（LVLMs）因冗余视觉令牌导致高计算成本，现有剪枝方法依赖单层注意力评分，但无法充分捕捉令牌与层间的复杂交互。 Method: 通过信息流分析令牌与层间的交互，发现CLS令牌作为信息中继，冗余动态出现，提出FlowCut框架。 Result: FlowCut在LLaVA-1.5-7B和LLaVA-NeXT-7B上分别实现1.6%和4.3%的性能提升，令牌减少88.9%和94.4%，预填充阶段速度提升3.2倍。 Conclusion: FlowCut通过信息流分析更准确地识别冗余令牌，显著提升模型效率与性能。 Abstract: Large vision-language models (LVLMs) excel at multimodal understanding but suffer from high computational costs due to redundant vision tokens. Existing pruning methods typically rely on single-layer attention scores to rank and prune redundant visual tokens to solve this inefficiency. However, as the interaction between tokens and layers is complicated, this raises a basic question: Is such a simple single-layer criterion sufficient to identify redundancy? To answer this question, we rethink the emergence of redundant visual tokens from a fundamental perspective: information flow, which models the interaction between tokens and layers by capturing how information moves between tokens across layers. We find (1) the CLS token acts as an information relay, which can simplify the complicated flow analysis; (2) the redundancy emerges progressively and dynamically via layer-wise attention concentration; and (3) relying solely on attention scores from single layers can lead to contradictory redundancy identification. Based on this, we propose FlowCut, an information-flow-aware pruning framework, mitigating the insufficiency of the current criterion for identifying redundant tokens and better aligning with the model's inherent behaviors. Extensive experiments show that FlowCut achieves superior results, outperforming SoTA by 1.6% on LLaVA-1.5-7B with 88.9% token reduction, and by 4.3% on LLaVA-NeXT-7B with 94.4% reduction, delivering 3.2x speed-up in the prefilling stage. Our code is available at https://github.com/TungChintao/FlowCut

[162] SMART-PC: Skeletal Model Adaptation for Robust Test-Time Training in Point Clouds

Ali Bahri,Moslem Yazdanpanah,Sahar Dastani,Mehrdad Noori,Gustavo Adolfo Vargas Hakim,David Osowiechi,Farzad Beizaee,Ismail Ben Ayed,Christian Desrosiers

Main category: cs.CV

TL;DR: SMART-PC是一种基于骨架的框架，用于3D点云分类，通过消除反向传播实现实时适应，提高了对分布偏移的鲁棒性和计算效率。

Details

Motivation: 解决现有测试时间训练方法在3D点云分类中因依赖计算昂贵的反向传播而难以应用于实时场景的问题。 Method: 利用3D点云的几何结构预测骨架表示，提取鲁棒的几何特征，并通过仅更新BatchNorm统计实现轻量级实时适应。 Result: 在ModelNet40-C、ShapeNet-C和ScanObjectNN-C等基准数据集上取得最先进结果，优于MATE等方法。 Conclusion: SMART-PC通过骨架表示和轻量级适应，显著提升了3D点云分类的鲁棒性和效率。 Abstract: Test-Time Training (TTT) has emerged as a promising solution to address distribution shifts in 3D point cloud classification. However, existing methods often rely on computationally expensive backpropagation during adaptation, limiting their applicability in real-world, time-sensitive scenarios. In this paper, we introduce SMART-PC, a skeleton-based framework that enhances resilience to corruptions by leveraging the geometric structure of 3D point clouds. During pre-training, our method predicts skeletal representations, enabling the model to extract robust and meaningful geometric features that are less sensitive to corruptions, thereby improving adaptability to test-time distribution shifts. Unlike prior approaches, SMART-PC achieves real-time adaptation by eliminating backpropagation and updating only BatchNorm statistics, resulting in a lightweight and efficient framework capable of achieving high frame-per-second rates while maintaining superior classification performance. Extensive experiments on benchmark datasets, including ModelNet40-C, ShapeNet-C, and ScanObjectNN-C, demonstrate that SMART-PC achieves state-of-the-art results, outperforming existing methods such as MATE in terms of both accuracy and computational efficiency. The implementation is available at: https://github.com/AliBahri94/SMART-PC.

[163] Aggregated Structural Representation with Large Language Models for Human-Centric Layout Generation

Jiongchao Jin,Shengchu Zhao,Dajun Chen,Wei Jiang,Yong Li

Main category: cs.CV

TL;DR: 论文提出了一种结合图网络与大语言模型（LLM）的ASR模块，用于自动化布局生成，解决了现有方法生成能力有限和结构信息丢失的问题。

Details

Motivation: 手动布局设计耗时且复杂，现有图生成方法生成能力有限，视觉生成模型忽略结构信息，导致不合理输出。 Method: 提出ASR模块，整合图网络与LLM，保留结构信息并增强生成能力，用图特征替代ViT模块预测完整布局。 Result: 在RICO数据集上表现优异，定量（mIoU）和定性（用户研究）评估均显示其优势，且支持多样化布局生成。 Conclusion: ASR模块在自动化布局生成中具有高效性和创造性，支持人机交互设计。 Abstract: Time consumption and the complexity of manual layout design make automated layout generation a critical task, especially for multiple applications across different mobile devices. Existing graph-based layout generation approaches suffer from limited generative capability, often resulting in unreasonable and incompatible outputs. Meanwhile, vision based generative models tend to overlook the original structural information, leading to component intersections and overlaps. To address these challenges, we propose an Aggregation Structural Representation (ASR) module that integrates graph networks with large language models (LLMs) to preserve structural information while enhancing generative capability. This novel pipeline utilizes graph features as hierarchical prior knowledge, replacing the traditional Vision Transformer (ViT) module in multimodal large language models (MLLM) to predict full layout information for the first time. Moreover, the intermediate graph matrix used as input for the LLM is human editable, enabling progressive, human centric design generation. A comprehensive evaluation on the RICO dataset demonstrates the strong performance of ASR, both quantitatively using mean Intersection over Union (mIoU), and qualitatively through a crowdsourced user study. Additionally, sampling on relational features ensures diverse layout generation, further enhancing the adaptability and creativity of the proposed approach.

[164] K-Buffers: A Plug-in Method for Enhancing Neural Fields with Multiple Buffers

Haofan Ren,Zunjie Zhu,Xiang Chen,Ming Lu,Rongfeng Lu,Chenggang Yan

Main category: cs.CV

TL;DR: 提出了一种名为K-Buffers的插件方法，通过多缓冲区提升神经场的渲染性能，并验证了其在神经点场和3D高斯泼溅中的有效性。

Details

Motivation: 现有研究多关注场景表示，而忽略了渲染过程的优化，因此提出K-Buffers以提升神经场的渲染性能。 Method: 1. 从场景表示中渲染K个缓冲区并构建像素级特征图；2. 使用K-Feature Fusion Network (KFN) 融合特征图；3. 通过特征解码器生成渲染图像；4. 引入加速策略提升速度与质量。 Result: 实验表明，该方法显著提升了神经点场和3D高斯泼溅的渲染性能。 Conclusion: K-Buffers是一种有效的插件方法，能够优化神经场的渲染过程。 Abstract: Neural fields are now the central focus of research in 3D vision and computer graphics. Existing methods mainly focus on various scene representations, such as neural points and 3D Gaussians. However, few works have studied the rendering process to enhance the neural fields. In this work, we propose a plug-in method named K-Buffers that leverages multiple buffers to improve the rendering performance. Our method first renders K buffers from scene representations and constructs K pixel-wise feature maps. Then, We introduce a K-Feature Fusion Network (KFN) to merge the K pixel-wise feature maps. Finally, we adopt a feature decoder to generate the rendering image. We also introduce an acceleration strategy to improve rendering speed and quality. We apply our method to well-known radiance field baselines, including neural point fields and 3D Gaussian Splatting (3DGS). Extensive experiments demonstrate that our method effectively enhances the rendering performance of neural point fields and 3DGS.

[165] Few-Shot Class-Incremental Learning For Efficient SAR Automatic Target Recognition

George Karantaidis,Athanasios Pantsios,Ioannis Kompatsiaris,Symeon Papadopoulos

Main category: cs.CV

TL;DR: 提出了一种基于双分支架构的少样本类增量学习框架，用于解决SAR-ATR中的数据稀缺问题，结合局部特征提取和全局依赖捕捉，实验证明其优于现有方法。

Details

Motivation: 数据稀缺是SAR-ATR系统的主要挑战，传统方法难以应对，因此需要一种新的学习框架。 Method: 采用双分支架构，结合离散傅里叶变换和全局滤波器捕捉空间依赖，引入轻量级交叉注意力机制和损失函数优化。 Result: 在MSTAR数据集上表现优于现有方法，验证了其在实际场景中的有效性。 Conclusion: 提出的FSCIL框架在SAR-ATR中表现出色，为解决数据稀缺问题提供了有效方案。 Abstract: Synthetic aperture radar automatic target recognition (SAR-ATR) systems have rapidly evolved to tackle incremental recognition challenges in operational settings. Data scarcity remains a major hurdle that conventional SAR-ATR techniques struggle to address. To cope with this challenge, we propose a few-shot class-incremental learning (FSCIL) framework based on a dual-branch architecture that focuses on local feature extraction and leverages the discrete Fourier transform and global filters to capture long-term spatial dependencies. This incorporates a lightweight cross-attention mechanism that fuses domain-specific features with global dependencies to ensure robust feature interaction, while maintaining computational efficiency by introducing minimal scale-shift parameters. The framework combines focal loss for class distinction under imbalance and center loss for compact intra-class distributions to enhance class separation boundaries. Experimental results on the MSTAR benchmark dataset demonstrate that the proposed framework consistently outperforms state-of-the-art methods in FSCIL SAR-ATR, attesting to its effectiveness in real-world scenarios.

[166] What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation

Jianghang Lin,Yue Hu,Jiangtao Shen,Yunhang Shen,Liujuan Cao,Shengchuan Zhang,Rongrong Ji

Main category: cs.CV

TL;DR: 提出了一种受人类视觉启发的开放词汇图像分割框架，通过生成式视觉语言模型和概念感知模块，显著提升了分割性能。

Details

Motivation: 现有方法在区域分割和类别匹配上存在语义对齐不足的问题，偏离了人类视觉系统的认知过程。 Method: 框架包含生成式视觉语言模型（G-VLM）、概念感知视觉增强模块和认知启发解码器，模拟人类先理解语义再感知空间的认知过程。 Result: 在多个数据集上取得显著提升，如A-150的27.2 PQ、17.0 mAP和35.3 mIoU，并支持词汇无关分割。 Conclusion: 该框架通过模拟人类认知过程，显著提升了开放词汇图像分割的性能和灵活性。 Abstract: Open vocabulary image segmentation tackles the challenge of recognizing dynamically adjustable, predefined novel categories at inference time by leveraging vision-language alignment. However, existing paradigms typically perform class-agnostic region segmentation followed by category matching, which deviates from the human visual system's process of recognizing objects based on semantic concepts, leading to poor alignment between region segmentation and target concepts. To bridge this gap, we propose a novel Cognition-Inspired Framework for open vocabulary image segmentation that emulates the human visual recognition process: first forming a conceptual understanding of an object, then perceiving its spatial extent. The framework consists of three core components: (1) A Generative Vision-Language Model (G-VLM) that mimics human cognition by generating object concepts to provide semantic guidance for region segmentation. (2) A Concept-Aware Visual Enhancer Module that fuses textual concept features with global visual representations, enabling adaptive visual perception based on target concepts. (3) A Cognition-Inspired Decoder that integrates local instance features with G-VLM-provided semantic cues, allowing selective classification over a subset of relevant categories. Extensive experiments demonstrate that our framework achieves significant improvements, reaching $27.2$ PQ, $17.0$ mAP, and $35.3$ mIoU on A-150. It further attains $56.2$, $28.2$, $15.4$, $59.2$, $18.7$, and $95.8$ mIoU on Cityscapes, Mapillary Vistas, A-847, PC-59, PC-459, and PAS-20, respectively. In addition, our framework supports vocabulary-free segmentation, offering enhanced flexibility in recognizing unseen categories. Code will be public.

[167] VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models

Hu Xiaobin,Liang Yujie,Luo Donghao,Peng Xu,Zhang Jiangning,Zhu Junwei,Wang Chengjie,Fu Yanwei

Main category: cs.CV

TL;DR: VTBench是一个用于虚拟试穿模型的综合基准测试套件，解决了现有评估方法的不足，并提供了多维度、人类对齐的评估框架。

Details

Motivation: 当前虚拟试穿模型的评估方法不足以反映人类感知，测试场景局限于室内，缺乏对复杂真实场景的适应性。 Method: 引入VTBench，一个分层的基准测试套件，将虚拟试穿分解为多个维度，每个维度配备定制测试集和评估标准。 Result: VTBench提供了多维度评估框架、人类偏好注释，并揭示了模型在室内与真实场景中的性能差异。 Conclusion: VTBench将开源，推动虚拟试穿领域向更具挑战性的真实场景发展。 Abstract: While virtual try-on has achieved significant progress, evaluating these models towards real-world scenarios remains a challenge. A comprehensive benchmark is essential for three key reasons:(1) Current metrics inadequately reflect human perception, particularly in unpaired try-on settings;(2)Most existing test sets are limited to indoor scenarios, lacking complexity for real-world evaluation; and (3) An ideal system should guide future advancements in virtual try-on generation. To address these needs, we introduce VTBench, a hierarchical benchmark suite that systematically decomposes virtual image try-on into hierarchical, disentangled dimensions, each equipped with tailored test sets and evaluation criteria. VTBench exhibits three key advantages:1) Multi-Dimensional Evaluation Framework: The benchmark encompasses five critical dimensions for virtual try-on generation (e.g., overall image quality, texture preservation, complex background consistency, cross-category size adaptability, and hand-occlusion handling). Granular evaluation metrics of corresponding test sets pinpoint model capabilities and limitations across diverse, challenging scenarios.2) Human Alignment: Human preference annotations are provided for each test set, ensuring the benchmark's alignment with perceptual quality across all evaluation dimensions. (3) Valuable Insights: Beyond standard indoor settings, we analyze model performance variations across dimensions and investigate the disparity between indoor and real-world try-on scenarios. To foster the field of virtual try-on towards challenging real-world scenario, VTBench will be open-sourced, including all test sets, evaluation protocols, generated results, and human annotations.

[168] Guard Me If You Know Me: Protecting Specific Face-Identity from Deepfakes

Kaiqing Lin,Zhiyuan Yan,Ke-Yue Zhang,Li Hao,Yue Zhou,Yuzhen Lin,Weixiang Li,Taiping Yao,Shouhong Ding,Bin Li

Main category: cs.CV

TL;DR: VIPGuard是一个多模态框架，专注于利用已知面部身份的详细特征进行个性化深度伪造检测，优于传统方法。

Details

Motivation: 在数字时代，保护个人身份免受深度伪造攻击至关重要，尤其是名人和政治人物。现有方法常忽略已知面部身份的宝贵先验知识。 Method: VIPGuard通过三个阶段实现：1）微调多模态大语言模型（MLLM）学习详细面部属性；2）进行身份级判别学习；3）引入用户特定定制，通过MLLM进行语义推理。 Result: VIPGuard在个性化深度伪造检测上表现优于传统方法，提供可解释的预测。还构建了VIPBench基准用于评估。 Conclusion: VIPGuard通过结合详细面部特征和语义推理，显著提升了深度伪造检测的准确性和可解释性。 Abstract: Securing personal identity against deepfake attacks is increasingly critical in the digital age, especially for celebrities and political figures whose faces are easily accessible and frequently targeted. Most existing deepfake detection methods focus on general-purpose scenarios and often ignore the valuable prior knowledge of known facial identities, e.g., "VIP individuals" whose authentic facial data are already available. In this paper, we propose \textbf{VIPGuard}, a unified multimodal framework designed to capture fine-grained and comprehensive facial representations of a given identity, compare them against potentially fake or similar-looking faces, and reason over these comparisons to make accurate and explainable predictions. Specifically, our framework consists of three main stages. First, fine-tune a multimodal large language model (MLLM) to learn detailed and structural facial attributes. Second, we perform identity-level discriminative learning to enable the model to distinguish subtle differences between highly similar faces, including real and fake variations. Finally, we introduce user-specific customization, where we model the unique characteristics of the target face identity and perform semantic reasoning via MLLM to enable personalized and explainable deepfake detection. Our framework shows clear advantages over previous detection works, where traditional detectors mainly rely on low-level visual cues and provide no human-understandable explanations, while other MLLM-based models often lack a detailed understanding of specific face identities. To facilitate the evaluation of our method, we built a comprehensive identity-aware benchmark called \textbf{VIPBench} for personalized deepfake detection, involving the latest 7 face-swapping and 7 entire face synthesis techniques for generation.

[169] Beyond Segmentation: Confidence-Aware and Debiased Estimation of Ratio-based Biomarkers

Jiameng Li,Teodora Popordanoska,Sebastian G. Gruber,Frederik Maes,Matthew B. Blaschko

Main category: cs.CV

TL;DR: 提出了一种统一框架，用于估计基于比例的生物标志物，并通过后校准模块提供不确定性度量。

Details

Motivation: 现有方法仅提供点估计，缺乏不确定性度量，而临床决策需要更可靠的信息。 Method: 分析分割到生物标志物流程中的误差传播，提出轻量级后校准模块，通过可调参数Q控制置信水平。 Result: 实验表明该方法能生成统计上可靠的置信区间，适应临床实践需求。 Conclusion: 该方法提高了生物标志物在临床工作流程中的可信度。 Abstract: Ratio-based biomarkers -- such as the proportion of necrotic tissue within a tumor -- are widely used in clinical practice to support diagnosis, prognosis and treatment planning. These biomarkers are typically estimated from soft segmentation outputs by computing region-wise ratios. Despite the high-stakes nature of clinical decision making, existing methods provide only point estimates, offering no measure of uncertainty. In this work, we propose a unified \textit{confidence-aware} framework for estimating ratio-based biomarkers. We conduct a systematic analysis of error propagation in the segmentation-to-biomarker pipeline and identify model miscalibration as the dominant source of uncertainty. To mitigate this, we incorporate a lightweight, post-hoc calibration module that can be applied using internal hospital data without retraining. We leverage a tunable parameter $Q$ to control the confidence level of the derived bounds, allowing adaptation towards clinical practice. Extensive experiments show that our method produces statistically sound confidence intervals, with tunable confidence levels, enabling more trustworthy application of predictive biomarkers in clinical workflows.

[170] Rep3D: Re-parameterize Large 3D Kernels with Low-Rank Receptive Modeling for Medical Imaging

Ho Hin Lee,Quan Liu,Shunxing Bao,Yuankai Huo,Bennett A. Landman

Main category: cs.CV

TL;DR: Rep3D提出了一种基于大核卷积的3D卷积框架，通过引入可学习的空间先验和自适应权重调整，解决了大核卷积训练中的优化不稳定问题，并在多个3D分割任务中优于现有方法。

Details

Motivation: 传统的大核卷积在高分辨率3D体积数据中存在优化不稳定和性能下降的问题，而Rep3D通过分析有效感受野的空间偏差，提出了一种更高效的解决方案。 Method: Rep3D通过两阶段调制网络生成感受野偏置的缩放掩码，自适应调整核更新权重，并采用简单的编码器设计，避免了多分支结构的复杂性。 Result: 在五个3D分割基准测试中，Rep3D表现优于包括基于Transformer和固定先验重参数化方法在内的现有技术。 Conclusion: Rep3D通过结合空间归纳偏置和优化感知学习，为3D医学图像分析提供了一种可解释且可扩展的解决方案。 Abstract: In contrast to vision transformers, which model long-range dependencies through global self-attention, large kernel convolutions provide a more efficient and scalable alternative, particularly in high-resolution 3D volumetric settings. However, naively increasing kernel size often leads to optimization instability and degradation in performance. Motivated by the spatial bias observed in effective receptive fields (ERFs), we hypothesize that different kernel elements converge at variable rates during training. To support this, we derive a theoretical connection between element-wise gradients and first-order optimization, showing that structurally re-parameterized convolution blocks inherently induce spatially varying learning rates. Building on this insight, we introduce Rep3D, a 3D convolutional framework that incorporates a learnable spatial prior into large kernel training. A lightweight two-stage modulation network generates a receptive-biased scaling mask, adaptively re-weighting kernel updates and enabling local-to-global convergence behavior. Rep3D adopts a plain encoder design with large depthwise convolutions, avoiding the architectural complexity of multi-branch compositions. We evaluate Rep3D on five challenging 3D segmentation benchmarks and demonstrate consistent improvements over state-of-the-art baselines, including transformer-based and fixed-prior re-parameterization methods. By unifying spatial inductive bias with optimization-aware learning, Rep3D offers an interpretable, and scalable solution for 3D medical image analysis. The source code is publicly available at https://github.com/leeh43/Rep3D.

[171] JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models

Jiaxin Song,Yixu Wang,Jie Li,Rui Yu,Yan Teng,Xingjun Ma,Yingchun Wang

Main category: cs.CV

TL;DR: 论文提出JailBound框架，通过潜在空间边界探测和跨越，高效攻击视觉语言模型（VLMs），揭示其安全风险。

Details

Motivation: 现有越狱方法缺乏明确攻击目标，且忽视跨模态交互，导致效果有限。论文假设VLMs在潜在空间中隐含安全决策边界，并利用此边界指导攻击。 Method: JailBound分为两阶段：1) 安全边界探测，近似潜在空间中的决策边界；2) 安全边界跨越，联合优化图像和文本输入的对抗扰动。 Result: 在六种VLMs上实验，平均白盒和黑盒攻击成功率分别为94.32%和67.28%，优于现有方法。 Conclusion: 研究揭示了VLMs的安全隐患，呼吁更鲁棒的防御措施。 Abstract: Vision-Language Models (VLMs) exhibit impressive performance, yet the integration of powerful vision encoders has significantly broadened their attack surface, rendering them increasingly susceptible to jailbreak attacks. However, lacking well-defined attack objectives, existing jailbreak methods often struggle with gradient-based strategies prone to local optima and lacking precise directional guidance, and typically decouple visual and textual modalities, thereby limiting their effectiveness by neglecting crucial cross-modal interactions. Inspired by the Eliciting Latent Knowledge (ELK) framework, we posit that VLMs encode safety-relevant information within their internal fusion-layer representations, revealing an implicit safety decision boundary in the latent space. This motivates exploiting boundary to steer model behavior. Accordingly, we propose JailBound, a novel latent space jailbreak framework comprising two stages: (1) Safety Boundary Probing, which addresses the guidance issue by approximating decision boundary within fusion layer's latent space, thereby identifying optimal perturbation directions towards the target region; and (2) Safety Boundary Crossing, which overcomes the limitations of decoupled approaches by jointly optimizing adversarial perturbations across both image and text inputs. This latter stage employs an innovative mechanism to steer the model's internal state towards policy-violating outputs while maintaining cross-modal semantic consistency. Extensive experiments on six diverse VLMs demonstrate JailBound's efficacy, achieves 94.32% white-box and 67.28% black-box attack success averagely, which are 6.17% and 21.13% higher than SOTA methods, respectively. Our findings expose a overlooked safety risk in VLMs and highlight the urgent need for more robust defenses. Warning: This paper contains potentially sensitive, harmful and offensive content.

[172] Align and Surpass Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning

Ruolin Shen,Xiaozhong Ji,Kai WU,Jiangning Zhang,Yijun He,HaiHua Yang,Xiaobin Hu,Xiaoyu Sun

Main category: cs.CV

TL;DR: 多模态模型在识别与背景视觉融合的对象时与人类视觉系统存在显著差异。本文提出一种视觉重聚焦强化框架，通过逐步推理和动态调整，使模型更接近人类认知能力。

Details

Motivation: 现有模型难以区分隐蔽对象，无法模拟人类利用前景-背景相似性进行视觉分析的认知过程。 Method: 构建模拟人类视觉伪装感知的系统，采用逐步推理的动态调整机制，并通过策略优化算法实现视觉重聚焦强化框架。 Result: 实验显示模型在隐蔽对象分类和检测任务中表现显著优于基线方法，并展现出动态调整检测框的多重推理特征。 Conclusion: 提出的框架成功缩小了人类与模型在视觉伪装感知上的差距，甚至在某些任务中超越人类表现。 Abstract: Current multi-modal models exhibit a notable misalignment with the human visual system when identifying objects that are visually assimilated into the background. Our observations reveal that these multi-modal models cannot distinguish concealed objects, demonstrating an inability to emulate human cognitive processes which effectively utilize foreground-background similarity principles for visual analysis. To analyze this hidden human-model visual thinking discrepancy, we build a visual system that mimicks human visual camouflaged perception to progressively and iteratively `refocus' visual concealed content. The refocus is a progressive guidance mechanism enabling models to logically localize objects in visual images through stepwise reasoning. The localization process of concealed objects requires hierarchical attention shifting with dynamic adjustment and refinement of prior cognitive knowledge. In this paper, we propose a visual refocus reinforcement framework via the policy optimization algorithm to encourage multi-modal models to think and refocus more before answering, and achieve excellent reasoning abilities to align and even surpass human camouflaged perception systems. Our extensive experiments on camouflaged perception successfully demonstrate the emergence of refocus visual phenomena, characterized by multiple reasoning tokens and dynamic adjustment of the detection box. Besides, experimental results on both camouflaged object classification and detection tasks exhibit significantly superior performance compared to Supervised Fine-Tuning (SFT) baselines.

[173] TESSER: Transfer-Enhancing Adversarial Attacks from Vision Transformers via Spectral and Semantic Regularization

Amira Guesmi,Bassem Ouni,Muhammad Shafique

Main category: cs.CV

TL;DR: TESSER是一种新的对抗攻击框架，通过特征敏感梯度缩放和频谱平滑正则化，显著提高了对抗样本的迁移性和攻击成功率。

Details

Motivation: 对抗迁移性是评估深度神经网络鲁棒性的关键挑战，尤其是在安全关键应用中，黑盒攻击的威胁评估尤为重要。 Method: TESSER结合了特征敏感梯度缩放（FSGS）和频谱平滑正则化（SSR），生成语义有意义且频谱平滑的扰动。 Result: 在ImageNet上的12种架构中，TESSER的攻击成功率比现有方法提高了10.9%（CNN）和7.2%（ViT），并在防御模型上表现优异。 Conclusion: TESSER通过优化扰动生成策略，显著提升了对抗攻击的迁移性和有效性，为对抗鲁棒性评估提供了新工具。 Abstract: Adversarial transferability remains a critical challenge in evaluating the robustness of deep neural networks. In security-critical applications, transferability enables black-box attacks without access to model internals, making it a key concern for real-world adversarial threat assessment. While Vision Transformers (ViTs) have demonstrated strong adversarial performance, existing attacks often fail to transfer effectively across architectures, especially from ViTs to Convolutional Neural Networks (CNNs) or hybrid models. In this paper, we introduce \textbf{TESSER} -- a novel adversarial attack framework that enhances transferability via two key strategies: (1) \textit{Feature-Sensitive Gradient Scaling (FSGS)}, which modulates gradients based on token-wise importance derived from intermediate feature activations, and (2) \textit{Spectral Smoothness Regularization (SSR)}, which suppresses high-frequency noise in perturbations using a differentiable Gaussian prior. These components work in tandem to generate perturbations that are both semantically meaningful and spectrally smooth. Extensive experiments on ImageNet across 12 diverse architectures demonstrate that TESSER achieves +10.9\% higher attack succes rate (ASR) on CNNs and +7.2\% on ViTs compared to the state-of-the-art Adaptive Token Tuning (ATT) method. Moreover, TESSER significantly improves robustness against defended models, achieving 53.55\% ASR on adversarially trained CNNs. Qualitative analysis shows strong alignment between TESSER's perturbations and salient visual regions identified via Grad-CAM, while frequency-domain analysis reveals a 12\% reduction in high-frequency energy, confirming the effectiveness of spectral regularization.

[174] Rotation-Equivariant Self-Supervised Method in Image Denoising

Hanze Liu,Jiahong Fu,Qi Xie,Deyu Meng

Main category: cs.CV

TL;DR: 论文提出了一种结合旋转等变卷积的自监督图像去噪方法，通过理论分析和实验验证其有效性。

Details

Motivation: 自监督方法减少了对大规模训练数据的需求，但现有方法主要依赖平移等变先验。本文探索如何进一步引入旋转等变先验以提升性能。 Method: 使用旋转等变卷积替换传统卷积层，并通过理论分析证明其有效性；设计了新的掩码机制融合旋转等变网络与传统CNN网络的输出。 Result: 实验表明，该方法在三种典型方法中均表现出色。 Conclusion: 旋转等变先验的引入为自监督图像去噪提供了新视角，并通过自适应框架进一步提升了性能。 Abstract: Self-supervised image denoising methods have garnered significant research attention in recent years, for this kind of method reduces the requirement of large training datasets. Compared to supervised methods, self-supervised methods rely more on the prior embedded in deep networks themselves. As a result, most of the self-supervised methods are designed with Convolution Neural Networks (CNNs) architectures, which well capture one of the most important image prior, translation equivariant prior. Inspired by the great success achieved by the introduction of translational equivariance, in this paper, we explore the way to further incorporate another important image prior. Specifically, we first apply high-accuracy rotation equivariant convolution to self-supervised image denoising. Through rigorous theoretical analysis, we have proved that simply replacing all the convolution layers with rotation equivariant convolution layers would modify the network into its rotation equivariant version. To the best of our knowledge, this is the first time that rotation equivariant image prior is introduced to self-supervised image denoising at the network architecture level with a comprehensive theoretical analysis of equivariance errors, which offers a new perspective to the field of self-supervised image denoising. Moreover, to further improve the performance, we design a new mask mechanism to fusion the output of rotation equivariant network and vanilla CNN-based network, and construct an adaptive rotation equivariant framework. Through extensive experiments on three typical methods, we have demonstrated the effectiveness of the proposed method.

[175] Benchmarking Large Multimodal Models for Ophthalmic Visual Question Answering with OphthalWeChat

Pusheng Xu,Xia Gong,Xiaolan Chen,Weiyi Zhang,Jiancheng Yang,Bingjie Yan,Meng Yuan,Yalin Zheng,Mingguang He,Danli Shi

Main category: cs.CV

TL;DR: 该研究开发了一个双语多模态视觉问答（VQA）基准，用于评估眼科领域的视觉语言模型（VLMs）。通过收集微信官方账号发布的眼科图像和标题，生成双语问答对，并评估了三种VLMs的性能。Gemini 2.0 Flash表现最佳，数据集支持眼科AI系统的开发。

Details

Motivation: 开发一个真实世界背景下的双语VQA基准，以支持眼科领域AI系统的准确性和可信度。 Method: 收集微信官方账号的眼科图像和标题，使用GPT-4o-mini生成双语问答对，并分类为六种类型。评估了三种VLMs的性能。 Result: 数据集包含3,469张图像和30,120个问答对。Gemini 2.0 Flash总体准确率最高（0.548），在不同子集中表现优异。 Conclusion: 该研究首次提出了眼科双语VQA基准，支持开发准确、专业且可信的眼科AI系统。 Abstract: Purpose: To develop a bilingual multimodal visual question answering (VQA) benchmark for evaluating VLMs in ophthalmology. Methods: Ophthalmic image posts and associated captions published between January 1, 2016, and December 31, 2024, were collected from WeChat Official Accounts. Based on these captions, bilingual question-answer (QA) pairs in Chinese and English were generated using GPT-4o-mini. QA pairs were categorized into six subsets by question type and language: binary (Binary_CN, Binary_EN), single-choice (Single-choice_CN, Single-choice_EN), and open-ended (Open-ended_CN, Open-ended_EN). The benchmark was used to evaluate the performance of three VLMs: GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B-Instruct. Results: The final OphthalWeChat dataset included 3,469 images and 30,120 QA pairs across 9 ophthalmic subspecialties, 548 conditions, 29 imaging modalities, and 68 modality combinations. Gemini 2.0 Flash achieved the highest overall accuracy (0.548), outperforming GPT-4o (0.522, P < 0.001) and Qwen2.5-VL-72B-Instruct (0.514, P < 0.001). It also led in both Chinese (0.546) and English subsets (0.550). Subset-specific performance showed Gemini 2.0 Flash excelled in Binary_CN (0.687), Single-choice_CN (0.666), and Single-choice_EN (0.646), while GPT-4o ranked highest in Binary_EN (0.717), Open-ended_CN (BLEU-1: 0.301; BERTScore: 0.382), and Open-ended_EN (BLEU-1: 0.183; BERTScore: 0.240). Conclusions: This study presents the first bilingual VQA benchmark for ophthalmology, distinguished by its real-world context and inclusion of multiple examinations per patient. The dataset reflects authentic clinical decision-making scenarios and enables quantitative evaluation of VLMs, supporting the development of accurate, specialized, and trustworthy AI systems for eye care.

[176] HF-VTON: High-Fidelity Virtual Try-On via Consistent Geometric and Semantic Alignment

Ming Meng,Qi Dong,Jiajie Li,Zhe Zhu,Xingyu Wang,Zhaoxin Fan,Wei Zhao,Wenjun Wu

Main category: cs.CV

TL;DR: HF-VTON是一个高保真虚拟试穿框架，通过三个模块解决几何变形、语义不一致和细节丢失问题，并在新数据集SAMP-VTONS上验证了其优越性。

Details

Motivation: 现有虚拟试穿方法在不同姿势下存在几何变形、语义不一致和细节丢失的问题，需要一种更高效的解决方案。 Method: HF-VTON包含三个模块：APWAM（对齐服装与姿势）、SRCM（增强语义表示）和MPAGM（优化外观生成），并引入新数据集SAMP-VTONS。 Result: 实验表明HF-VTON在VITON-HD和SAMP-VTONS上优于现有方法，尤其在视觉保真度、语义一致性和细节保留方面表现突出。 Conclusion: HF-VTON通过多模块协同和新数据集支持，显著提升了虚拟试穿的质量和一致性。 Abstract: Virtual try-on technology has become increasingly important in the fashion and retail industries, enabling the generation of high-fidelity garment images that adapt seamlessly to target human models. While existing methods have achieved notable progress, they still face significant challenges in maintaining consistency across different poses. Specifically, geometric distortions lead to a lack of spatial consistency, mismatches in garment structure and texture across poses result in semantic inconsistency, and the loss or distortion of fine-grained details diminishes visual fidelity. To address these challenges, we propose HF-VTON, a novel framework that ensures high-fidelity virtual try-on performance across diverse poses. HF-VTON consists of three key modules: (1) the Appearance-Preserving Warp Alignment Module (APWAM), which aligns garments to human poses, addressing geometric deformations and ensuring spatial consistency; (2) the Semantic Representation and Comprehension Module (SRCM), which captures fine-grained garment attributes and multi-pose data to enhance semantic representation, maintaining structural, textural, and pattern consistency; and (3) the Multimodal Prior-Guided Appearance Generation Module (MPAGM), which integrates multimodal features and prior knowledge from pre-trained models to optimize appearance generation, ensuring both semantic and geometric consistency. Additionally, to overcome data limitations in existing benchmarks, we introduce the SAMP-VTONS dataset, featuring multi-pose pairs and rich textual annotations for a more comprehensive evaluation. Experimental results demonstrate that HF-VTON outperforms state-of-the-art methods on both VITON-HD and SAMP-VTONS, excelling in visual fidelity, semantic consistency, and detail preservation.

[177] Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval

Fanheng Kong,Jingyuan Zhang,Yahui Liu,Hongzhi Zhang,Shi Feng,Xiaocui Yang,Daling Wang,Yu Tian,Qi Wang,Fuzheng Zhang,Guorui Zhou

Main category: cs.CV

TL;DR: UNITE框架通过数据整理和模态感知训练配置解决多模态信息检索的挑战，提出MAMCL方法减少模态间竞争关系，在多个基准测试中取得最优结果。

Details

Motivation: 多模态信息检索存在数据异构性和跨模态对齐复杂性，现有研究未系统解决模态差距问题。 Method: 提出UNITE框架，包括数据整理和模态感知训练配置，并引入MAMCL方法减少模态间竞争。 Result: 在多个多模态检索基准测试中取得最优性能，显著优于现有方法。 Conclusion: UNITE框架不仅提升多模态检索性能，还为未来多模态系统研究提供基础蓝图。 Abstract: Multimodal information retrieval (MIR) faces inherent challenges due to the heterogeneity of data sources and the complexity of cross-modal alignment. While previous studies have identified modal gaps in feature spaces, a systematic approach to address these challenges remains unexplored. In this work, we introduce UNITE, a universal framework that tackles these challenges through two critical yet underexplored aspects: data curation and modality-aware training configurations. Our work provides the first comprehensive analysis of how modality-specific data properties influence downstream task performance across diverse scenarios. Moreover, we propose Modal-Aware Masked Contrastive Learning (MAMCL) to mitigate the competitive relationships among the instances of different modalities. Our framework achieves state-of-the-art results on multiple multimodal retrieval benchmarks, outperforming existing methods by notable margins. Through extensive experiments, we demonstrate that strategic modality curation and tailored training protocols are pivotal for robust cross-modal representation learning. This work not only advances MIR performance but also provides a foundational blueprint for future research in multimodal systems. Our project is available at https://friedrichor.github.io/projects/UNITE.

[178] ReDDiT: Rehashing Noise for Discrete Visual Generation

Tianren Ma,Xiaosong Zhang,Boyu Yang,Junlan Feng,Qixiang Ye

Main category: cs.CV

TL;DR: ReDDiT提出了一种新的离散扩散模型框架，通过改进噪声设计和采样启发式方法，显著提升了生成质量和效率。

Details

Motivation: 离散扩散模型在视觉生成领域表现出高效性和兼容性，但其性能仍落后于连续模型，主要原因是噪声设计和采样启发式方法的不足。 Method: 提出ReDDiT框架，通过随机多索引损坏扩展吸收状态，并设计反向随机吸收路径的rehash采样器，提升生成多样性和一致性。 Result: ReDDiT显著优于基线（gFID从6.18降至1.61），并与连续模型性能相当，同时效率更高。 Conclusion: ReDDiT通过改进噪声和采样方法，显著提升了离散扩散模型的生成质量和效率，弥补了与连续模型的差距。 Abstract: Discrete diffusion models are gaining traction in the visual generative area for their efficiency and compatibility. However, the pioneered attempts still fall behind the continuous counterparts, which we attribute to the noise (absorbing state) design and sampling heuristics. In this study, we propose the rehashing noise framework for discrete diffusion transformer, termed ReDDiT, to extend absorbing states and improve expressive capacity of discrete diffusion models. ReDDiT enriches the potential paths that latent variables can traverse during training with randomized multi-index corruption. The derived rehash sampler, which reverses the randomized absorbing paths, guarantees the diversity and low discrepancy of the generation process. These reformulations lead to more consistent and competitive generation quality, mitigating the need for heavily tuned randomness. Experiments show that ReDDiT significantly outperforms the baseline (reducing gFID from 6.18 to 1.61) and is on par with the continuous counterparts with higher efficiency.

[179] LangDAug: Langevin Data Augmentation for Multi-Source Domain Generalization in Medical Image Segmentation

Piyush Tiwary,Kinjawl Bhattacharyya,Prathosh A. P

Main category: cs.CV

TL;DR: 提出了一种名为LangDAug的新方法，通过Langevin动力学生成中间样本，用于2D医学图像分割的多源域泛化。

Details

Motivation: 医学图像分割模型在不同域间泛化能力不足，现有方法（如表示学习或数据增强）存在局限性。 Method: 利用基于能量的模型（EBMs）和对比散度训练，通过Langevin动力学生成中间样本。 Result: 在Fundus分割和2D MRI前列腺分割基准测试中表现优于现有域泛化方法。 Conclusion: LangDAug不仅性能优越，还能有效补充现有域随机化方法。 Abstract: Medical image segmentation models often struggle to generalize across different domains due to various reasons. Domain Generalization (DG) methods overcome this either through representation learning or data augmentation (DAug). While representation learning methods seek domain-invariant features, they often rely on ad-hoc techniques and lack formal guarantees. DAug methods, which enrich model representations through synthetic samples, have shown comparable or superior performance to representation learning approaches. We propose LangDAug, a novel $\textbf{Lang}$evin $\textbf{D}$ata $\textbf{Aug}$mentation for multi-source domain generalization in 2D medical image segmentation. LangDAug leverages Energy-Based Models (EBMs) trained via contrastive divergence to traverse between source domains, generating intermediate samples through Langevin dynamics. Theoretical analysis shows that LangDAug induces a regularization effect, and for GLMs, it upper-bounds the Rademacher complexity by the intrinsic dimensionality of the data manifold. Through extensive experiments on Fundus segmentation and 2D MRI prostate segmentation benchmarks, we show that LangDAug outperforms state-of-the-art domain generalization methods and effectively complements existing domain-randomization approaches. The codebase for our method is available at https://github.com/backpropagator/LangDAug.

[180] Burst Image Super-Resolution via Multi-Cross Attention Encoding and Multi-Scan State-Space Decoding

Tengda Huang,Yu Zhang,Tianren Li,Yufu Qu,Fulin Liu,Zhenzhong Wei

Main category: cs.CV

TL;DR: 本文提出了一种新的多图像超分辨率方法，通过重叠跨窗口注意力和跨帧注意力机制，提升了特征提取和聚合能力。

Details

Motivation: 现有方法在超分辨率任务中依赖固定和狭窄的注意力窗口，限制了特征感知范围，影响了图像对齐和特征聚合的质量。 Method: 提出了一种新的特征提取器，包含重叠跨窗口注意力和跨帧注意力机制，并引入了多扫描状态空间模块以增强特征聚合。 Result: 在合成和真实数据集上的实验表明，该方法在超分辨率性能上优于现有方法，ISO 12233测试进一步验证了其优越性。 Conclusion: 所提出的方法通过改进注意力机制和特征聚合，显著提升了多图像超分辨率的质量。 Abstract: Multi-image super-resolution (MISR) can achieve higher image quality than single-image super-resolution (SISR) by aggregating sub-pixel information from multiple spatially shifted frames. Among MISR tasks, burst super-resolution (BurstSR) has gained significant attention due to its wide range of applications. Recent methods have increasingly adopted Transformers over convolutional neural networks (CNNs) in super-resolution tasks, due to their superior ability to capture both local and global context. However, most existing approaches still rely on fixed and narrow attention windows that restrict the perception of features beyond the local field. This limitation hampers alignment and feature aggregation, both of which are crucial for high-quality super-resolution. To address these limitations, we propose a novel feature extractor that incorporates two newly designed attention mechanisms: overlapping cross-window attention and cross-frame attention, enabling more precise and efficient extraction of sub-pixel information across multiple frames. Furthermore, we introduce a Multi-scan State-Space Module with the cross-frame attention mechanism to enhance feature aggregation. Extensive experiments on both synthetic and real-world benchmarks demonstrate the superiority of our approach. Additional evaluations on ISO 12233 resolution test charts further confirm its enhanced super-resolution performance.

[181] VisCRA: A Visual Chain Reasoning Attack for Jailbreaking Multimodal Large Language Models

Bingrui Sima,Linhua Cong,Wenxuan Wang,Kun He

Main category: cs.CV

TL;DR: 本文研究了多模态大语言模型（MLRMs）中视觉推理能力的增强带来的安全风险，并提出了一种新型攻击框架VisCRA，通过利用视觉推理链绕过安全机制。实验证明VisCRA在多个主流MLRMs上具有高攻击成功率。

Details

Motivation: 随着MLRMs视觉推理能力的提升，其安全风险尚未被充分探索。本文旨在揭示视觉推理能力与模型脆弱性之间的权衡关系。 Method: 提出VisCRA框架，结合视觉注意力掩码和两阶段推理诱导策略，精确控制有害输出。 Result: VisCRA在Gemini 2.0 Flash Thinking（76.48%）、QvQ-Max（68.56%）和GPT-4o（56.60%）上表现出高攻击成功率。 Conclusion: 视觉推理能力既是MLRMs的核心优势，也可能成为攻击的突破口，需重视其安全风险。 Abstract: The emergence of Multimodal Large Language Models (MLRMs) has enabled sophisticated visual reasoning capabilities by integrating reinforcement learning and Chain-of-Thought (CoT) supervision. However, while these enhanced reasoning capabilities improve performance, they also introduce new and underexplored safety risks. In this work, we systematically investigate the security implications of advanced visual reasoning in MLRMs. Our analysis reveals a fundamental trade-off: as visual reasoning improves, models become more vulnerable to jailbreak attacks. Motivated by this critical finding, we introduce VisCRA (Visual Chain Reasoning Attack), a novel jailbreak framework that exploits the visual reasoning chains to bypass safety mechanisms. VisCRA combines targeted visual attention masking with a two-stage reasoning induction strategy to precisely control harmful outputs. Extensive experiments demonstrate VisCRA's significant effectiveness, achieving high attack success rates on leading closed-source MLRMs: 76.48% on Gemini 2.0 Flash Thinking, 68.56% on QvQ-Max, and 56.60% on GPT-4o. Our findings highlight a critical insight: the very capability that empowers MLRMs -- their visual reasoning -- can also serve as an attack vector, posing significant security risks.

[182] DriveCamSim: Generalizable Camera Simulation via Explicit Camera Modeling for Autonomous Driving

Wenchao Sun,Xuewu Lin,Keyu Chen,Zixiang Pei,Yining Shi,Chuang Zhang,Sifa Zheng

Main category: cs.CV

TL;DR: DriveCamSim提出了一种通用的相机传感器模拟框架，通过显式相机建模（ECM）机制解决了现有方法在多视角视频生成中的局限性，并提出了信息保留控制机制以提高可控性和时间一致性。

Details

Motivation: 现有生成模型在多视角视频生成中受限于固定相机视角和视频频率，限制了其在下游应用中的灵活性。 Method: 采用显式相机建模（ECM）机制，建立像素级跨视角和跨帧对应关系，并设计信息保留控制机制以增强条件可控性。 Result: 模型在视觉质量、可控性及跨空间和时间级别的泛化能力上表现出色，支持用户自定义相机模拟。 Conclusion: DriveCamSim框架为自动驾驶等领域的相机传感器模拟提供了灵活且高效的解决方案。 Abstract: Camera sensor simulation serves as a critical role for autonomous driving (AD), e.g. evaluating vision-based AD algorithms. While existing approaches have leveraged generative models for controllable image/video generation, they remain constrained to generating multi-view video sequences with fixed camera viewpoints and video frequency, significantly limiting their downstream applications. To address this, we present a generalizable camera simulation framework DriveCamSim, whose core innovation lies in the proposed Explicit Camera Modeling (ECM) mechanism. Instead of implicit interaction through vanilla attention, ECM establishes explicit pixel-wise correspondences across multi-view and multi-frame dimensions, decoupling the model from overfitting to the specific camera configurations (intrinsic/extrinsic parameters, number of views) and temporal sampling rates presented in the training data. For controllable generation, we identify the issue of information loss inherent in existing conditional encoding and injection pipelines, proposing an information-preserving control mechanism. This control mechanism not only improves conditional controllability, but also can be extended to be identity-aware to enhance temporal consistency in foreground object rendering. With above designs, our model demonstrates superior performance in both visual quality and controllability, as well as generalization capability across spatial-level (camera parameters variations) and temporal-level (video frame rate variations), enabling flexible user-customizable camera simulation tailored to diverse application scenarios. Code will be avaliable at https://github.com/swc-17/DriveCamSim for facilitating future research.

[183] Knowledge-Aligned Counterfactual-Enhancement Diffusion Perception for Unsupervised Cross-Domain Visual Emotion Recognition

Wen Yin,Yong Wang,Guiduo Duan,Dongyang Zhang,Xin Hu,Yuan-Fang Li,Tao He

Main category: cs.CV

TL;DR: 论文提出了一种无监督跨域视觉情感识别任务（UCDVER），并通过KCDP框架解决了情感表达变异性与情感分布偏移的挑战，显著提升了模型性能。

Details

Motivation: 现有视觉情感识别（VER）研究局限于单一领域，缺乏跨域泛化能力，因此需要一种无监督方法从源域（如真实图像）推广到目标域（如贴纸）。 Method: 提出了KCDP框架，利用VLM对齐情感表示，并通过扩散模型增强视觉情感感知；同时使用CLIEA方法为目标域生成高质量伪标签。 Result: 实验表明，KCDP在感知性和泛化性上均优于现有SOTA模型，如比TGCA-PVT提升了12%。 Conclusion: KCDP框架有效解决了跨域情感识别的挑战，为无监督跨域VER任务提供了新思路。 Abstract: Visual Emotion Recognition (VER) is a critical yet challenging task aimed at inferring emotional states of individuals based on visual cues. However, existing works focus on single domains, e.g., realistic images or stickers, limiting VER models' cross-domain generalizability. To fill this gap, we introduce an Unsupervised Cross-Domain Visual Emotion Recognition (UCDVER) task, which aims to generalize visual emotion recognition from the source domain (e.g., realistic images) to the low-resource target domain (e.g., stickers) in an unsupervised manner. Compared to the conventional unsupervised domain adaptation problems, UCDVER presents two key challenges: a significant emotional expression variability and an affective distribution shift. To mitigate these issues, we propose the Knowledge-aligned Counterfactual-enhancement Diffusion Perception (KCDP) framework. Specifically, KCDP leverages a VLM to align emotional representations in a shared knowledge space and guides diffusion models for improved visual affective perception. Furthermore, a Counterfactual-Enhanced Language-image Emotional Alignment (CLIEA) method generates high-quality pseudo-labels for the target domain. Extensive experiments demonstrate that our model surpasses SOTA models in both perceptibility and generalization, e.g., gaining 12% improvements over the SOTA VER model TGCA-PVT. The project page is at https://yinwen2019.github.io/ucdver.

[184] Modeling Beyond MOS: Quality Assessment Models Must Integrate Context, Reasoning, and Multimodality

Mohamed Amine Kerkouri,Marouane Tliba,Aladine Chetouani,Nour Aburaed,Alessandro Bruno

Main category: cs.CV

TL;DR: 本文认为MOS作为多媒体质量评估的唯一监督信号已不足，需结合上下文感知、推理和多模态能力。

Details

Motivation: MOS将复杂的人类判断简化为单一标量，忽略了语义失败、用户意图和判断依据。 Method: 提出整合上下文感知、推理和多模态能力的模型，并建议改革数据集和评估指标。 Result: 通过更丰富的数据集和新评估指标，提升模型的鲁棒性和可信度。 Conclusion: 重新定义质量评估为上下文相关、可解释和多模态的任务，推动更人性化的评估系统。 Abstract: This position paper argues that Mean Opinion Score (MOS), while historically foundational, is no longer sufficient as the sole supervisory signal for multimedia quality assessment models. MOS reduces rich, context-sensitive human judgments to a single scalar, obscuring semantic failures, user intent, and the rationale behind quality decisions. We contend that modern quality assessment models must integrate three interdependent capabilities: (1) context-awareness, to adapt evaluations to task-specific goals and viewing conditions; (2) reasoning, to produce interpretable, evidence-grounded justifications for quality judgments; and (3) multimodality, to align perceptual and semantic cues using vision-language models. We critique the limitations of current MOS-centric benchmarks and propose a roadmap for reform: richer datasets with contextual metadata and expert rationales, and new evaluation metrics that assess semantic alignment, reasoning fidelity, and contextual sensitivity. By reframing quality assessment as a contextual, explainable, and multimodal modeling task, we aim to catalyze a shift toward more robust, human-aligned, and trustworthy evaluation systems.

[185] Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

Minheng Ni,Zhengyuan Yang,Linjie Li,Chung-Ching Lin,Kevin Lin,Wangmeng Zuo,Lijuan Wang

Main category: cs.CV

TL;DR: Point-RFT是一种多模态推理框架，通过视觉基础的CoT推理提升视觉文档理解，显著优于纯文本CoT方法。

Details

Motivation: 现有文本CoT在视觉语言任务中存在视觉幻觉和多模态整合不足的问题，需改进。 Method: 采用两阶段方法：格式微调（71K视觉推理问题数据集）和强化微调（针对视觉文档理解）。 Result: 在ChartQA上准确率从70.88%提升至90.04%，优于纯文本CoT的83.92%。 Conclusion: Point-RFT在多模态推理中更有效，且具有优秀的泛化能力，适用于复杂场景。 Abstract: Recent advances in large language models have significantly improved textual reasoning through the effective use of Chain-of-Thought (CoT) and reinforcement learning. However, extending these successes to vision-language tasks remains challenging due to inherent limitations in text-only CoT, such as visual hallucinations and insufficient multimodal integration. In this paper, we introduce Point-RFT, a multimodal reasoning framework explicitly designed to leverage visually grounded CoT reasoning for visual document understanding. Our approach consists of two stages: First, we conduct format finetuning using a curated dataset of 71K diverse visual reasoning problems, each annotated with detailed, step-by-step rationales explicitly grounded to corresponding visual elements. Second, we employ reinforcement finetuning targeting visual document understanding. On ChartQA, our approach improves accuracy from 70.88% (format-finetuned baseline) to 90.04%, surpassing the 83.92% accuracy achieved by reinforcement finetuning relying solely on text-based CoT. The result shows that our grounded CoT is more effective for multimodal reasoning compared with the text-only CoT. Moreover, Point-RFT exhibits superior generalization capability across several out-of-domain visual document reasoning benchmarks, including CharXiv, PlotQA, IconQA, TabMWP, etc., and highlights its potential in complex real-world scenarios.

[186] MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval

Rong-Cheng Tu,Zhao Jin,Jingyi Liao,Xiao Luo,Yingjie Wang,Li Shen,Dacheng Tao

Main category: cs.CV

TL;DR: 提出了一种新方法MVFT-JI，通过联合优化两个任务，提升零样本组合图像检索的性能。

Details

Motivation: 现有方法仅通过适配器生成伪文本标记，未能直接优化组合查询表示，限制了检索性能。 Method: 利用预训练多模态大语言模型（MLLM）构建两个任务，联合优化以增强组合检索能力。 Result: 通过理论和实证验证，方法显著提升了复杂视觉变换场景下的检索性能。 Conclusion: MVFT-JI结合了VLM的语义对齐能力和MLLM的推理能力，有效提升了组合图像检索效果。 Abstract: Existing Zero-Shot Composed Image Retrieval (ZS-CIR) methods typically train adapters that convert reference images into pseudo-text tokens, which are concatenated with the modifying text and processed by frozen text encoders in pretrained VLMs or LLMs. While this design leverages the strengths of large pretrained models, it only supervises the adapter to produce encoder-compatible tokens that loosely preserve visual semantics. Crucially, it does not directly optimize the composed query representation to capture the full intent of the composition or to align with the target semantics, thereby limiting retrieval performance, particularly in cases involving fine-grained or complex visual transformations. To address this problem, we propose MLLM-Guided VLM Fine-Tuning with Joint Inference (MVFT-JI), a novel approach that leverages a pretrained multimodal large language model (MLLM) to construct two complementary training tasks using only unlabeled images: target text retrieval taskand text-to-image retrieval task. By jointly optimizing these tasks, our method enables the VLM to inherently acquire robust compositional retrieval capabilities, supported by the provided theoretical justifications and empirical validation. Furthermore, during inference, we further prompt the MLLM to generate target texts from composed queries and compute retrieval scores by integrating similarities between (i) the composed query and candidate images, and (ii) the MLLM-generated target text and candidate images. This strategy effectively combines the VLM's semantic alignment strengths with the MLLM's reasoning capabilities.

[187] Cross-Sequence Semi-Supervised Learning for Multi-Parametric MRI-Based Visual Pathway Delineation

Alou Diakite,Cheng Li,Lei Xie,Yuanjing Feng,Ruoyou Wu,Jianzhong He,Hairong Zheng,Shanshan Wang

Main category: cs.CV

TL;DR: 提出了一种半监督多参数特征分解框架，用于视觉通路（VP）的精确描绘，解决了现有方法在多参数MRI数据融合和标签数据不足方面的局限性。

Details

Motivation: 视觉通路的精确描绘对理解人类视觉系统和诊断相关疾病至关重要，但现有方法在多参数MRI数据融合和标签数据依赖方面存在不足。 Method: 设计了相关性约束特征分解（CFD）处理多参数MRI数据的复杂关系，并开发了基于一致性的样本增强（CSE）模块以利用未标记数据。 Result: 在两个公共数据集和一个内部多壳扩散MRI数据集上验证，实验结果表明该方法在描绘性能上优于七种先进方法。 Conclusion: 提出的框架在多参数MRI数据融合和标签数据有限的情况下，显著提升了视觉通路的描绘性能。 Abstract: Accurately delineating the visual pathway (VP) is crucial for understanding the human visual system and diagnosing related disorders. Exploring multi-parametric MR imaging data has been identified as an important way to delineate VP. However, due to the complex cross-sequence relationships, existing methods cannot effectively model the complementary information from different MRI sequences. In addition, these existing methods heavily rely on large training data with labels, which is labor-intensive and time-consuming to obtain. In this work, we propose a novel semi-supervised multi-parametric feature decomposition framework for VP delineation. Specifically, a correlation-constrained feature decomposition (CFD) is designed to handle the complex cross-sequence relationships by capturing the unique characteristics of each MRI sequence and easing the multi-parametric information fusion process. Furthermore, a consistency-based sample enhancement (CSE) module is developed to address the limited labeled data issue, by generating and promoting meaningful edge information from unlabeled data. We validate our framework using two public datasets, and one in-house Multi-Shell Diffusion MRI (MDM) dataset. Experimental results demonstrate the superiority of our approach in terms of delineation performance when compared to seven state-of-the-art approaches.

[188] HAODiff: Human-Aware One-Step Diffusion via Dual-Prompt Guidance

Jue Gong,Tingyu Yang,Jingkai Wang,Zheng Chen,Xing Liu,Hong Gu,Yulun Zhang,Xiaokang Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为HAODiff的方法，通过模拟人类运动模糊（HMB）和通用噪声的共存问题，设计了一种退化管道，并利用三分支双提示引导（DPG）提升恢复效果。

Details

Motivation: 现有研究对人类运动模糊和通用噪声的共存问题关注不足，导致图像恢复效果不佳。 Method: 设计了退化管道模拟HMB和噪声共存，提出HAODiff方法，利用三分支双提示引导（DPG）进行训练。 Result: HAODiff在合成和真实数据集上均优于现有方法，特别是在MPII-Test基准测试中表现突出。 Conclusion: HAODiff通过自适应双提示引导有效提升了图像恢复的鲁棒性和质量。 Abstract: Human-centered images often suffer from severe generic degradation during transmission and are prone to human motion blur (HMB), making restoration challenging. Existing research lacks sufficient focus on these issues, as both problems often coexist in practice. To address this, we design a degradation pipeline that simulates the coexistence of HMB and generic noise, generating synthetic degraded data to train our proposed HAODiff, a human-aware one-step diffusion. Specifically, we propose a triple-branch dual-prompt guidance (DPG), which leverages high-quality images, residual noise (LQ minus HQ), and HMB segmentation masks as training targets. It produces a positive-negative prompt pair for classifier-free guidance (CFG) in a single diffusion step. The resulting adaptive dual prompts let HAODiff exploit CFG more effectively, boosting robustness against diverse degradations. For fair evaluation, we introduce MPII-Test, a benchmark rich in combined noise and HMB cases. Extensive experiments show that our HAODiff surpasses existing state-of-the-art (SOTA) methods in terms of both quantitative metrics and visual quality on synthetic and real-world datasets, including our introduced MPII-Test. Code is available at: https://github.com/gobunu/HAODiff.

[189] Improving Heart Rejection Detection in XPCI Images Using Synthetic Data Augmentation

Jakov Samardžija,Donik Vršnak,Sven Lončarić

Main category: cs.CV

TL;DR: 论文通过使用StyleGAN生成合成数据解决心脏移植活检中高等级排斥（3R）样本稀缺的问题，结合真实数据训练分类器，显著提升了分类性能。

Details

Motivation: 高等级排斥（3R）样本稀缺导致深度学习模型训练困难，需解决类别不平衡问题以提高分类准确性。 Method: 利用StyleGAN生成合成3R图像，结合真实0R样本训练ResNet-18分类器，评估三种数据配置的效果。 Result: 合成数据显著提升分类性能，结合真实数据的模型表现最佳，精确度和召回率均较高。 Conclusion: GAN生成合成数据与真实数据结合的策略在生物医学图像分析中具有潜力，尤其适用于标注数据有限的领域。 Abstract: Accurate identification of acute cellular rejection (ACR) in endomyocardial biopsies is essential for effective management of heart transplant patients. However, the rarity of high-grade rejection cases (3R) presents a significant challenge for training robust deep learning models. This work addresses the class imbalance problem by leveraging synthetic data generation using StyleGAN to augment the limited number of real 3R images. Prior to GAN training, histogram equalization was applied to standardize image appearance and improve the consistency of tissue representation. StyleGAN was trained on available 3R biopsy patches and subsequently used to generate 10,000 realistic synthetic images. These were combined with real 0R samples, that is samples without rejection, in various configurations to train ResNet-18 classifiers for binary rejection classification. Three classifier variants were evaluated: one trained on real 0R and synthetic 3R images, another using both synthetic and additional real samples, and a third trained solely on real data. All models were tested on an independent set of real biopsy images. Results demonstrate that synthetic data improves classification performance, particularly when used in combination with real samples. The highest-performing model, which used both real and synthetic images, achieved strong precision and recall for both classes. These findings underscore the value of hybrid training strategies and highlight the potential of GAN-based data augmentation in biomedical image analysis, especially in domains constrained by limited annotated datasets.

[190] SuperAD: A Training-free Anomaly Classification and Segmentation Method for CVPR 2025 VAND 3.0 Workshop Challenge Track 1: Adapt & Detect

Huaiyuan Zhang,Hang Chen,Yu Cheng,Shunyi Wu,Linghao Sun,Linao Han,Zeyu Shi,Lei Qi

Main category: cs.CV

TL;DR: 提出了一种基于DINOv2模型的无训练异常检测方法SuperAD，用于解决工业场景中的复杂异常检测问题，并在MVTec AD 2数据集上表现优异。

Details

Motivation: 工业异常检测中，透明、反光表面等复杂物理特性的异常识别至关重要，现有数据集MVTec AD 2更贴近实际场景，但光照变化和大尺度差异带来挑战。 Method: 利用DINOv2提取特征，构建少量正常样本的记忆库，通过最近邻匹配实现异常分割。 Result: 在MVTec AD 2测试集上取得了竞争性结果。 Conclusion: SuperAD方法无需训练，适用于复杂工业场景，表现优异。 Abstract: In this technical report, we present our solution to the CVPR 2025 Visual Anomaly and Novelty Detection (VAND) 3.0 Workshop Challenge Track 1: Adapt & Detect: Robust Anomaly Detection in Real-World Applications. In real-world industrial anomaly detection, it is crucial to accurately identify anomalies with physical complexity, such as transparent or reflective surfaces, occlusions, and low-contrast contaminations. The recently proposed MVTec AD 2 dataset significantly narrows the gap between publicly available benchmarks and anomalies found in real-world industrial environments. To address the challenges posed by this dataset--such as complex and varying lighting conditions and real anomalies with large scale differences--we propose a fully training-free anomaly detection and segmentation method based on feature extraction using the DINOv2 model named SuperAD. Our method carefully selects a small number of normal reference images and constructs a memory bank by leveraging the strong representational power of DINOv2. Anomalies are then segmented by performing nearest neighbor matching between test image features and the memory bank. Our method achieves competitive results on both test sets of the MVTec AD 2 dataset.

[191] SAIL: Self-supervised Albedo Estimation from Real Images with a Latent Diffusion Model

Hala Djeghim,Nathan Piasco,Luis Roldão,Moussab Bennehar,Dzmitry Tsishkou,Céline Loscos,Désiré Sidibé

Main category: cs.CV

TL;DR: SAIL是一种从单视角真实世界图像中估计类似反照率表示的方法，利用潜在扩散模型的无条件场景重照明先验知识，并通过潜在空间中的分解和正则化实现稳定的反照率预测。

Details

Motivation: 真实世界图像的反照率分解因缺乏标注数据和现有方法的泛化能力不足而具有挑战性。 Method: 通过潜在扩散模型的无条件场景重照明先验知识，提出潜在空间中的反照率分解方法，并引入正则化约束光照依赖和独立组件。 Result: SAIL在多变光照条件下预测稳定的反照率，并能泛化到多个场景，仅需未标注的多光照数据。 Conclusion: SAIL为解决真实世界图像反照率分解提供了一种有效且泛化能力强的解决方案。 Abstract: Intrinsic image decomposition aims at separating an image into its underlying albedo and shading components, isolating the base color from lighting effects to enable downstream applications such as virtual relighting and scene editing. Despite the rise and success of learning-based approaches, intrinsic image decomposition from real-world images remains a significant challenging task due to the scarcity of labeled ground-truth data. Most existing solutions rely on synthetic data as supervised setups, limiting their ability to generalize to real-world scenes. Self-supervised methods, on the other hand, often produce albedo maps that contain reflections and lack consistency under different lighting conditions. To address this, we propose SAIL, an approach designed to estimate albedo-like representations from single-view real-world images. We repurpose the prior knowledge of a latent diffusion model for unconditioned scene relighting as a surrogate objective for albedo estimation. To extract the albedo, we introduce a novel intrinsic image decomposition fully formulated in the latent space. To guide the training of our latent diffusion model, we introduce regularization terms that constrain both the lighting-dependent and independent components of our latent image decomposition. SAIL predicts stable albedo under varying lighting conditions and generalizes to multiple scenes, using only unlabeled multi-illumination data available online.

[192] Depth-Guided Bundle Sampling for Efficient Generalizable Neural Radiance Field Reconstruction

Li Fang,Hao Zhu,Longlong Chen,Fei Hu,Long Ye,Zhan Ma

Main category: cs.CV

TL;DR: 提出了一种深度引导的束采样策略，通过分组相邻光线并动态分配采样点，显著提升了高分辨率图像渲染的效率和质量。

Details

Motivation: 自然场景通常是分段平滑的，密集采样所有光线是冗余的，因此需要一种更高效的采样方法。 Method: 采用深度引导的束采样策略，将相邻光线分组并生成共享表示，同时基于深度置信度动态分配采样点。 Result: 在DTU数据集上，PSNR提升1.27 dB，FPS增加47%，渲染速度比现有方法快2倍。 Conclusion: 该方法在渲染质量和效率上均达到领先水平，适用于合成和真实场景。 Abstract: Recent advancements in generalizable novel view synthesis have achieved impressive quality through interpolation between nearby views. However, rendering high-resolution images remains computationally intensive due to the need for dense sampling of all rays. Recognizing that natural scenes are typically piecewise smooth and sampling all rays is often redundant, we propose a novel depth-guided bundle sampling strategy to accelerate rendering. By grouping adjacent rays into a bundle and sampling them collectively, a shared representation is generated for decoding all rays within the bundle. To further optimize efficiency, our adaptive sampling strategy dynamically allocates samples based on depth confidence, concentrating more samples in complex regions while reducing them in smoother areas. When applied to ENeRF, our method achieves up to a 1.27 dB PSNR improvement and a 47% increase in FPS on the DTU dataset. Extensive experiments on synthetic and real-world datasets demonstrate state-of-the-art rendering quality and up to 2x faster rendering compared to existing generalizable methods. Code is available at https://github.com/KLMAV-CUC/GDB-NeRF.

[193] The Missing Point in Vision Transformers for Universal Image Segmentation

Sajjad Shahabodini,Mobina Mansoori,Farnoush Bayatmakou,Jamshid Abouei,Konstantinos N. Plataniotis,Arash Mohammadi

Main category: cs.CV

TL;DR: ViT-P提出了一种两阶段分割框架，解耦掩码生成与分类，通过点分类模型提升性能，无需预训练适配器，支持多种预训练ViT，并在多个数据集上达到SOTA。

Details

Motivation: 解决图像分割中掩码分类的挑战，尤其是模糊边界和类别不平衡问题。 Method: 两阶段框架：1) 生成类无关掩码提案；2) 基于ViT的点分类模型细化预测。 Result: 在COCO、ADE20K和Cityscapes上表现优异，如ADE20K全景分割54.0 PQ。 Conclusion: ViT-P高效且灵活，显著降低标注成本，性能优越。 Abstract: Image segmentation remains a challenging task in computer vision, demanding robust mask generation and precise classification. Recent mask-based approaches yield high-quality masks by capturing global context. However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge. In this work, we introduce ViT-P, a novel two-stage segmentation framework that decouples mask generation from classification. The first stage employs a proposal generator to produce class-agnostic mask proposals, while the second stage utilizes a point-based classification model built on the Vision Transformer (ViT) to refine predictions by focusing on mask central points. ViT-P serves as a pre-training-free adapter, allowing the integration of various pre-trained vision transformers without modifying their architecture, ensuring adaptability to dense prediction tasks. Furthermore, we demonstrate that coarse and bounding box annotations can effectively enhance classification without requiring additional training on fine annotation datasets, reducing annotation costs while maintaining strong performance. Extensive experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT-P, achieving state-of-the-art results with 54.0 PQ on ADE20K panoptic segmentation, 87.4 mIoU on Cityscapes semantic segmentation, and 63.6 mIoU on ADE20K semantic segmentation. The code and pretrained models are available at: https://github.com/sajjad-sh33/ViT-P}{https://github.com/sajjad-sh33/ViT-P.

[194] A Regularization-Guided Equivariant Approach for Image Restoration

Yulu Bai,Jiahong Fu,Qi Xie,Deyu Meng

Main category: cs.CV

TL;DR: 提出了一种旋转等变正则化策略EQ-Reg，用于提升图像恢复任务的表示精度和对称性适应性。

Details

Motivation: 现有等变和不变深度学习模型在数据对称性利用上存在表示精度不足和严格对称假设的局限性，难以满足图像恢复任务的高要求。 Method: 通过自监督学习和特征图的空间旋转与循环通道位移，设计EQ-Reg正则化器，自适应调整等变性。 Result: 在三个低层任务中，方法表现出更高的准确性和泛化能力，优于现有技术。 Conclusion: EQ-Reg为非严格等变网络提供了一种简单有效的自适应机制，适用于图像恢复任务。 Abstract: Equivariant and invariant deep learning models have been developed to exploit intrinsic symmetries in data, demonstrating significant effectiveness in certain scenarios. However, these methods often suffer from limited representation accuracy and rely on strict symmetry assumptions that may not hold in practice. These limitations pose a significant drawback for image restoration tasks, which demands high accuracy and precise symmetry representation. To address these challenges, we propose a rotation-equivariant regularization strategy that adaptively enforces the appropriate symmetry constraints on the data while preserving the network's representational accuracy. Specifically, we introduce EQ-Reg, a regularizer designed to enhance rotation equivariance, which innovatively extends the insights of data-augmentation-based and equivariant-based methodologies. This is achieved through self-supervised learning and the spatial rotation and cyclic channel shift of feature maps deduce in the equivariant framework. Our approach firstly enables a non-strictly equivariant network suitable for image restoration, providing a simple and adaptive mechanism for adjusting equivariance based on task. Extensive experiments across three low-level tasks demonstrate the superior accuracy and generalization capability of our method, outperforming state-of-the-art approaches.

[195] Translation-Equivariance of Normalization Layers and Aliasing in Convolutional Neural Networks

Jérémy Scanvic,Quentin Barthélemy,Julián Tachella

Main category: cs.CV

TL;DR: 论文提出了一种新的理论框架，用于理解归一化层对离散平移和连续平移的等变性，并通过实验验证了理论结果。

Details

Motivation: 设计对连续平移完全等变的卷积神经网络架构是研究热点，但归一化层的等变性研究较少。本文旨在填补这一空白。 Method: 提出理论框架，分析归一化层的等变性条件，并通过ResNet-18和ImageNet的特征图进行实验验证。 Result: 理论结果与实验预测一致，验证了归一化层在特定维度上的等变性。 Conclusion: 研究为归一化层的等变性提供了理论支持，有助于提升科学计算的物理准确性。 Abstract: The design of convolutional neural architectures that are exactly equivariant to continuous translations is an active field of research. It promises to benefit scientific computing, notably by making existing imaging systems more physically accurate. Most efforts focus on the design of downsampling/pooling layers, upsampling layers and activation functions, but little attention is dedicated to normalization layers. In this work, we present a novel theoretical framework for understanding the equivariance of normalization layers to discrete shifts and continuous translations. We also determine necessary and sufficient conditions for normalization layers to be equivariant in terms of the dimensions they operate on. Using real feature maps from ResNet-18 and ImageNet, we test those theoretical results empirically and find that they are consistent with our predictions.

Zehong Ma,Shiliang Zhang,Longhui Wei,Qi Tian

Main category: cs.CV

TL;DR: EMLoC是一种无需训练的方法，通过将示例嵌入模型输入，实现高效、灵活的多模态任务适应。结合分块压缩和分层自适应剪枝，显著降低计算开销。

Details

Motivation: 传统多模态大语言模型（MLLMs）任务适应依赖微调，计算和内存开销大。EMLoC旨在提供一种更高效、灵活的替代方案。 Method: EMLoC通过分块压缩和分层自适应剪枝（基于Jensen-Shannon散度约束）压缩长上下文输入，生成紧凑的任务特定表示。 Result: 在多种视觉语言基准测试中，EMLoC性能与或优于传统长上下文方法，显著降低推理复杂度。 Conclusion: EMLoC为资源受限环境中的多模态模型提供了一种高效、灵活的适应框架，具有实际应用潜力。 Abstract: Traditional approaches to adapting multi-modal large language models (MLLMs) to new tasks have relied heavily on fine-tuning. This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC), a novel training-free alternative that embeds demonstration examples directly into the model input. EMLoC offers a more efficient, flexible, and scalable solution for task adaptation. Because extremely lengthy inputs introduce prohibitive computational and memory overhead, EMLoC contributes a chunk-wise compression mechanism combined with layer-wise adaptive pruning. It condenses long-context multimodal inputs into compact, task-specific memory representations. By adaptively pruning tokens at each layer under a Jensen-Shannon divergence constraint, our method achieves a dramatic reduction in inference complexity without sacrificing performance. This approach is the first to seamlessly integrate compression and pruning techniques for multi-modal long-context learning, offering a scalable and efficient solution for real-world applications. Extensive experiments on diverse vision-language benchmarks demonstrate that EMLoC achieves performance on par with or superior to naive long-context approaches. Our results highlight the potential of EMLoC as a groundbreaking framework for efficient and flexible adaptation of multi-modal models in resource-constrained environments. Codes are publicly available at https://github.com/Zehong-Ma/EMLoC.

[197] GoLF-NRT: Integrating Global Context and Local Geometry for Few-Shot View Synthesis

You Wang,Li Fang,Hao Zhu,Fei Hu,Long Ye,Zhan Ma

Main category: cs.CV

TL;DR: GoLF-NRT是一种基于全局和局部特征融合的神经渲染Transformer，通过3D Transformer和局部几何特征提升少视图下的泛化神经渲染质量。

Details

Motivation: 解决通用NeRF模型在输入视图有限时渲染质量显著下降的问题。 Method: 结合3D Transformer的全局场景上下文捕捉和局部几何特征提取，并引入基于注意力权重和核回归的自适应采样策略。 Result: 在公开数据集上，GoLF-NRT在1-3个输入视图下实现了最先进的渲染性能。 Conclusion: GoLF-NRT在少视图场景下表现出高效性和优越性，代码已开源。 Abstract: Neural Radiance Fields (NeRF) have transformed novel view synthesis by modeling scene-specific volumetric representations directly from images. While generalizable NeRF models can generate novel views across unknown scenes by learning latent ray representations, their performance heavily depends on a large number of multi-view observations. However, with limited input views, these methods experience significant degradation in rendering quality. To address this limitation, we propose GoLF-NRT: a Global and Local feature Fusion-based Neural Rendering Transformer. GoLF-NRT enhances generalizable neural rendering from few input views by leveraging a 3D transformer with efficient sparse attention to capture global scene context. In parallel, it integrates local geometric features extracted along the epipolar line, enabling high-quality scene reconstruction from as few as 1 to 3 input views. Furthermore, we introduce an adaptive sampling strategy based on attention weights and kernel regression, improving the accuracy of transformer-based neural rendering. Extensive experiments on public datasets show that GoLF-NRT achieves state-of-the-art performance across varying numbers of input views, highlighting the effectiveness and superiority of our approach. Code is available at https://github.com/KLMAV-CUC/GoLF-NRT.

[198] Zero-Shot Pseudo Labels Generation Using SAM and CLIP for Semi-Supervised Semantic Segmentation

Nagito Saito,Shintaro Ito,Koichi Ito,Takafumi Aoki

Main category: cs.CV

TL;DR: 提出了一种基于SAM和CLIP的零样本标注方法生成伪标签，结合UniMatch提升伪标签质量，用于训练语义分割模型，实验验证了其有效性。

Details

Motivation: 解决语义分割任务中标注成本高的问题，通过半监督学习减少对标注数据的依赖。 Method: 使用SAM和CLIP生成伪标签，通过UniMatch提升伪标签质量，并用于训练语义分割模型。 Result: 在PASCAL和MS COCO数据集上验证了方法的有效性。 Conclusion: 提出的方法能够有效提升语义分割模型的性能，减少对标注数据的依赖。 Abstract: Semantic segmentation is a fundamental task in medical image analysis and autonomous driving and has a problem with the high cost of annotating the labels required in training. To address this problem, semantic segmentation methods based on semi-supervised learning with a small number of labeled data have been proposed. For example, one approach is to train a semantic segmentation model using images with annotated labels and pseudo labels. In this approach, the accuracy of the semantic segmentation model depends on the quality of the pseudo labels, and the quality of the pseudo labels depends on the performance of the model to be trained and the amount of data with annotated labels. In this paper, we generate pseudo labels using zero-shot annotation with the Segment Anything Model (SAM) and Contrastive Language-Image Pretraining (CLIP), improve the accuracy of the pseudo labels using the Unified Dual-Stream Perturbations Approach (UniMatch), and use them as enhanced labels to train a semantic segmentation model. The effectiveness of the proposed method is demonstrated through the experiments using the public datasets: PASCAL and MS COCO.

Miaoyu Li,Qin Chao,Boyang Li

Main category: cs.CV

TL;DR: 提出了一种名为Causal2Needles的长上下文视频理解基准，评估现有基准未充分测试的两项能力：从长视频中提取信息并联合理解，以及建模人类行为的因果关系。

Details

Motivation: 现有基准未能充分评估视频语言模型（VLMs）在长上下文视频中的信息提取和因果关系建模能力。 Method: 设计了2-needle问题，要求从长视频中的因果关系事件和相关叙述文本中提取信息，并通过两种互补格式问题避免文本偏见。 Result: 实验表明，现有基准表现优异的模型在2-needle视觉定位任务中表现不佳，且性能与两事件距离负相关。 Conclusion: 当前VLMs在长上下文视频理解和因果关系建模方面存在显著局限性。 Abstract: Evaluating the video understanding capabilities of Video-Language Models (VLMs) remains a significant challenge. We propose a long-context video understanding benchmark, Causal2Needles, that assesses two crucial abilities insufficiently evaluated by existing benchmarks: (1) the ability to extract information from two separate locations in a long video and understand them jointly, and (2) the ability to model the world in terms of cause and effect in human behaviors. Specifically, Causal2Needles introduces 2-needle questions, which require extracting information from both the cause and effect human-behavior events in a long video and the associated narration text. To prevent textual bias, these questions comprise two complementary formats: one asking to identify the video clip containing the answer, and one asking for the textual description of an unrelated visual detail from that video clip. Our experiments reveal that models excelling in pre-existing benchmarks struggle with 2-needle visual grounding, and the model performance is negatively correlated with the distance between the two needles. These findings highlight critical limitations in current VLMs.

[200] Sparse2DGS: Sparse-View Surface Reconstruction using 2D Gaussian Splatting with Dense Point Cloud

Natsuki Takama,Shintaro Ito,Koichi Ito,Hwann-Tzong Chen,Takafumi Aoki

Main category: cs.CV

TL;DR: Sparse2DGS是一种改进的3D重建方法，通过结合DUSt3R和COLMAP MVS生成密集点云，解决了Gaussian Splatting在输入图像有限时精度下降的问题。

Details

Motivation: Gaussian Splatting在多视图图像下表现良好，但在输入图像较少时重建精度显著下降，原因是稀疏点云初始化不足。 Method: 提出Sparse2DGS方法，利用DUSt3R和COLMAP MVS生成高精度密集点云，用于初始化2D高斯。 Result: 在DTU数据集上实验表明，仅用三张图像即可准确重建物体3D形状。 Conclusion: Sparse2DGS显著提升了在有限输入图像下的3D重建精度。 Abstract: Gaussian Splatting (GS) has gained attention as a fast and effective method for novel view synthesis. It has also been applied to 3D reconstruction using multi-view images and can achieve fast and accurate 3D reconstruction. However, GS assumes that the input contains a large number of multi-view images, and therefore, the reconstruction accuracy significantly decreases when only a limited number of input images are available. One of the main reasons is the insufficient number of 3D points in the sparse point cloud obtained through Structure from Motion (SfM), which results in a poor initialization for optimizing the Gaussian primitives. We propose a new 3D reconstruction method, called Sparse2DGS, to enhance 2DGS in reconstructing objects using only three images. Sparse2DGS employs DUSt3R, a fundamental model for stereo images, along with COLMAP MVS to generate highly accurate and dense 3D point clouds, which are then used to initialize 2D Gaussians. Through experiments on the DTU dataset, we show that Sparse2DGS can accurately reconstruct the 3D shapes of objects using just three images.

[201] A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking

Zixiang Zhao,Haowen Bai,Bingxin Ke,Yukun Cui,Lilun Deng,Yulun Zhang,Kai Zhang,Konrad Schindler

Main category: cs.CV

TL;DR: UniVF提出了一种基于多帧学习和光流特征变形的视频融合框架，解决了传统方法忽略时间相关性的问题，并引入了首个综合视频融合基准VF-Bench。

Details

Motivation: 现实世界是动态的，但现有图像融合方法独立处理静态帧，导致视频融合时出现闪烁和时间不一致问题。 Method: UniVF利用多帧学习和光流特征变形实现时间一致的视频融合。 Result: UniVF在VF-Bench的所有任务中均取得最优性能。 Conclusion: UniVF为视频融合提供了高效且一致的解决方案，VF-Bench为未来研究提供了标准化评估平台。 Abstract: The real world is dynamic, yet most image fusion methods process static frames independently, ignoring temporal correlations in videos and leading to flickering and temporal inconsistency. To address this, we propose Unified Video Fusion (UniVF), a novel framework for temporally coherent video fusion that leverages multi-frame learning and optical flow-based feature warping for informative, temporally coherent video fusion. To support its development, we also introduce Video Fusion Benchmark (VF-Bench), the first comprehensive benchmark covering four video fusion tasks: multi-exposure, multi-focus, infrared-visible, and medical fusion. VF-Bench provides high-quality, well-aligned video pairs obtained through synthetic data generation and rigorous curation from existing datasets, with a unified evaluation protocol that jointly assesses the spatial quality and temporal consistency of video fusion. Extensive experiments show that UniVF achieves state-of-the-art results across all tasks on VF-Bench. Project page: https://vfbench.github.io.

[202] FruitNeRF++: A Generalized Multi-Fruit Counting Method Utilizing Contrastive Learning and Neural Radiance Fields

Lukas Meyer,Andrei-Timotei Ardelean,Tim Weyrich,Marc Stamminger

Main category: cs.CV

TL;DR: FruitNeRF++是一种结合对比学习和神经辐射场的新型水果计数方法，通过形状无关的多水果计数框架提升实用性。

Details

Motivation: FruitNeRF方法对每种水果类型需要单独适配，限制了其实际应用。FruitNeRF++旨在解决这一问题。 Method: 利用视觉基础模型预测的实例掩码，将水果身份编码为实例嵌入到神经实例场中，通过体积采样提取点云并聚类计数。 Result: 在合成和真实苹果数据集上，FruitNeRF++表现优于现有方法且更易控制。 Conclusion: FruitNeRF++通过形状无关设计提升了水果计数的通用性和实用性。 Abstract: We introduce FruitNeRF++, a novel fruit-counting approach that combines contrastive learning with neural radiance fields to count fruits from unstructured input photographs of orchards. Our work is based on FruitNeRF, which employs a neural semantic field combined with a fruit-specific clustering approach. The requirement for adaptation for each fruit type limits the applicability of the method, and makes it difficult to use in practice. To lift this limitation, we design a shape-agnostic multi-fruit counting framework, that complements the RGB and semantic data with instance masks predicted by a vision foundation model. The masks are used to encode the identity of each fruit as instance embeddings into a neural instance field. By volumetrically sampling the neural fields, we extract a point cloud embedded with the instance features, which can be clustered in a fruit-agnostic manner to obtain the fruit count. We evaluate our approach using a synthetic dataset containing apples, plums, lemons, pears, peaches, and mangoes, as well as a real-world benchmark apple dataset. Our results demonstrate that FruitNeRF++ is easier to control and compares favorably to other state-of-the-art methods.

[203] Harnessing the Power of Training-Free Techniques in Text-to-2D Generation for Text-to-3D Generation via Score Distillation Sampling

Junhong Lee,Seungwook Kim,Minsu Cho

Main category: cs.CV

TL;DR: 本文探讨了训练无关技术（如CFG和FreeU）在Score Distillation Sampling（SDS）中的应用，揭示了它们在文本到3D生成中的权衡关系，并提出了一种动态调整策略以优化结果。

Details

Motivation: 尽管训练无关技术（如CFG和FreeU）在文本到2D生成中表现优异，但它们在SDS中的应用尚未充分研究。本文旨在填补这一空白，探索这些技术对文本到3D生成的影响。 Method: 通过实验分析了CFG和FreeU在SDS中的不同尺度对生成结果的影响，并提出了一种动态调整策略，根据时间步或优化迭代步骤动态调整这些技术的尺度。 Result: 研究发现，CFG的尺度变化会影响物体大小和表面平滑度，而FreeU的尺度变化则影响纹理细节和几何误差。动态调整策略能够在纹理细节和表面平滑度之间取得平衡，同时避免几何缺陷。 Conclusion: 本文证明了训练无关技术在SDS中的有效性，并提出了一种动态调整策略，为文本到3D生成提供了更优的结果。 Abstract: Recent studies show that simple training-free techniques can dramatically improve the quality of text-to-2D generation outputs, e.g. Classifier-Free Guidance (CFG) or FreeU. However, these training-free techniques have been underexplored in the lens of Score Distillation Sampling (SDS), which is a popular and effective technique to leverage the power of pretrained text-to-2D diffusion models for various tasks. In this paper, we aim to shed light on the effect such training-free techniques have on SDS, via a particular application of text-to-3D generation via 2D lifting. We present our findings, which show that varying the scales of CFG presents a trade-off between object size and surface smoothness, while varying the scales of FreeU presents a trade-off between texture details and geometric errors. Based on these findings, we provide insights into how we can effectively harness training-free techniques for SDS, via a strategic scaling of such techniques in a dynamic manner with respect to the timestep or optimization iteration step. We show that using our proposed scheme strikes a favorable balance between texture details and surface smoothness in text-to-3D generations, while preserving the size of the output and mitigating the occurrence of geometric defects.

[204] Deep Spectral Prior

Yanqi Cheng,Tieyong Zeng,Pietro Lio,Carola-Bibiane Schönlieb,Angelica I Aviles-Rivero

Main category: cs.CV

TL;DR: Deep Spectral Prior (DSP) 是一种改进的 Deep Image Prior (DIP)，通过频域对齐重构图像，避免传统 DIP 的过拟合问题，并提升重建质量。

Details

Motivation: 传统 DIP 依赖像素级损失和早停来避免过拟合，但效果有限。DSP 旨在通过频域对齐直接匹配傅里叶系数，利用图像的已知频率结构和 CNN 的频谱偏置。 Method: DSP 将图像重构问题转化为频域对齐问题，通过匹配傅里叶系数实现隐式频谱正则化，无需早停。 Result: DSP 在去噪、修复和超分辨率任务中表现优于传统 DIP 和其他无监督基线，理论分析支持其收敛性、稳定性和偏差-方差平衡。 Conclusion: DSP 提供了一种更鲁棒、可解释的图像重构方法，通过频域对齐和隐式正则化显著提升了性能。 Abstract: We introduce Deep Spectral Prior (DSP), a new formulation of Deep Image Prior (DIP) that redefines image reconstruction as a frequency-domain alignment problem. Unlike traditional DIP, which relies on pixel-wise loss and early stopping to mitigate overfitting, DSP directly matches Fourier coefficients between the network output and observed measurements. This shift introduces an explicit inductive bias towards spectral coherence, aligning with the known frequency structure of images and the spectral bias of convolutional neural networks. We provide a rigorous theoretical framework demonstrating that DSP acts as an implicit spectral regulariser, suppressing high-frequency noise by design and eliminating the need for early stopping. Our analysis spans four core dimensions establishing smooth convergence dynamics, local stability, and favourable bias-variance tradeoffs. We further show that DSP naturally projects reconstructions onto a frequency-consistent manifold, enhancing interpretability and robustness. These theoretical guarantees are supported by empirical results across denoising, inpainting, and super-resolution tasks, where DSP consistently outperforms classical DIP and other unsupervised baselines.

[205] StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation

Yi Wu,Lingting Zhu,Shengju Qian,Lei Liu,Wandi Qiao,Lequan Yu,Bin Li

Main category: cs.CV

TL;DR: 论文提出StyleAR方法，通过结合数据整理和自回归模型，利用文本-图像二元数据实现风格对齐的文本到图像生成。

Details

Motivation: 解决风格对齐文本到图像生成中数据获取困难的问题。 Method: 设计数据整理方法，结合CLIP图像编码器和风格增强标记技术，混合原始图像与风格化图像。 Result: 实验表明StyleAR在风格对齐生成任务中表现优异。 Conclusion: StyleAR有效解决了数据获取问题，提升了风格对齐生成的质量和一致性。 Abstract: In the current research landscape, multimodal autoregressive (AR) models have shown exceptional capabilities across various domains, including visual understanding and generation. However, complex tasks such as style-aligned text-to-image generation present significant challenges, particularly in data acquisition. In analogy to instruction-following tuning for image editing of AR models, style-aligned generation requires a reference style image and prompt, resulting in a text-image-to-image triplet where the output shares the style and semantics of the input. However, acquiring large volumes of such triplet data with specific styles is considerably more challenging than obtaining conventional text-to-image data used for training generative models. To address this issue, we propose StyleAR, an innovative approach that combines a specially designed data curation method with our proposed AR models to effectively utilize text-to-image binary data for style-aligned text-to-image generation. Our method synthesizes target stylized data using a reference style image and prompt, but only incorporates the target stylized image as the image modality to create high-quality binary data. To facilitate binary data training, we introduce a CLIP image encoder with a perceiver resampler that translates the image input into style tokens aligned with multimodal tokens in AR models and implement a style-enhanced token technique to prevent content leakage which is a common issue in previous work. Furthermore, we mix raw images drawn from large-scale text-image datasets with stylized images to enhance StyleAR's ability to extract richer stylistic features and ensure style consistency. Extensive qualitative and quantitative experiments demonstrate our superior performance.

[206] Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought

Chao Huang,Benfeng Wang,Jie Wen,Chengliang Liu,Wei Wang,Li Shen,Xiaochun Cao

Main category: cs.CV

TL;DR: 论文提出了一种名为VAR的新任务，旨在通过多模态大语言模型（MLLMs）对视频异常进行深度推理。作者设计了Vad-R1框架，结合P2C-CoT链式思维和AVA-GRPO强化学习算法，显著提升了异常检测与推理性能。

Details

Motivation: 现有基于MLLMs的视频异常检测方法仅停留在浅层描述，缺乏深度推理能力，因此需要开发一种能够模拟人类思维进行逐步推理的新方法。 Method: 提出了Vad-R1框架，包括P2C-CoT链式思维模拟人类异常识别过程，以及AVA-GRPO强化学习算法，通过自验证机制提升推理能力。 Result: 实验表明Vad-R1在VAD和VAR任务上均优于开源和专有模型。 Conclusion: Vad-R1通过深度推理和自验证机制，显著提升了视频异常分析的性能，为未来研究提供了新方向。 Abstract: Recent advancements in reasoning capability of Multimodal Large Language Models (MLLMs) demonstrate its effectiveness in tackling complex visual tasks. However, existing MLLM-based Video Anomaly Detection (VAD) methods remain limited to shallow anomaly descriptions without deep reasoning. In this paper, we propose a new task named Video Anomaly Reasoning (VAR), which aims to enable deep analysis and understanding of anomalies in the video by requiring MLLMs to think explicitly before answering. To this end, we propose Vad-R1, an end-to-end MLLM-based framework for VAR. Specifically, we design a Perception-to-Cognition Chain-of-Thought (P2C-CoT) that simulates the human process of recognizing anomalies, guiding the MLLM to reason anomaly step-by-step. Based on the structured P2C-CoT, we construct Vad-Reasoning, a dedicated dataset for VAR. Furthermore, we propose an improved reinforcement learning algorithm AVA-GRPO, which explicitly incentivizes the anomaly reasoning capability of MLLMs through a self-verification mechanism with limited annotations. Experimental results demonstrate that Vad-R1 achieves superior performance, outperforming both open-source and proprietary models on VAD and VAR tasks. Codes and datasets will be released at https://github.com/wbfwonderful/Vad-R1.

[207] ErpGS: Equirectangular Image Rendering enhanced with 3D Gaussian Regularization

Shintaro Ito,Natsuki Takama,Koichi Ito,Hwann-Tzong Chen,Takafumi Aoki

Main category: cs.CV

TL;DR: ErpGS是一种基于3DGS的全向高斯方法，用于解决360度相机投影模型带来的大失真问题，并通过几何正则化等技术提升渲染精度。

Details

Motivation: 360度相机的投影模型会导致大失真，尤其在3DGS方法中产生过大的3D高斯分布，影响渲染精度。 Method: 提出ErpGS，引入几何正则化、尺度正则化和失真感知权重及掩码，以减少障碍物的影响。 Result: 在公开数据集上，ErpGS的渲染精度优于传统方法。 Conclusion: ErpGS能有效解决360度相机模型的失真问题，提升新视角合成的渲染质量。 Abstract: The use of multi-view images acquired by a 360-degree camera can reconstruct a 3D space with a wide area. There are 3D reconstruction methods from equirectangular images based on NeRF and 3DGS, as well as Novel View Synthesis (NVS) methods. On the other hand, it is necessary to overcome the large distortion caused by the projection model of a 360-degree camera when equirectangular images are used. In 3DGS-based methods, the large distortion of the 360-degree camera model generates extremely large 3D Gaussians, resulting in poor rendering accuracy. We propose ErpGS, which is Omnidirectional GS based on 3DGS to realize NVS addressing the problems. ErpGS introduce some rendering accuracy improvement techniques: geometric regularization, scale regularization, and distortion-aware weights and a mask to suppress the effects of obstacles in equirectangular images. Through experiments on public datasets, we demonstrate that ErpGS can render novel view images more accurately than conventional methods.

[208] OmniFall: A Unified Staged-to-Wild Benchmark for Human Fall Detection

David Schneider,Zdravko Marinov,Rafael Baur,Zeyun Zhong,Rodi Düger,Rainer Stiefelhagen

Main category: cs.CV

TL;DR: OmniFall整合了八个公开的跌倒检测数据集，提供标准化评估协议，并通过真实世界数据验证了现有方法在非受控环境中的性能差距。

Details

Motivation: 现有跌倒检测研究依赖小型、受控数据集，存在领域偏差，无法反映真实世界性能。 Method: 引入OmniFall数据集，统一八个数据集，提供标准化标签和评估协议，并通过真实世界视频验证性能。 Result: 实验显示，现有方法在受控与非受控环境间存在显著性能差距。 Conclusion: OmniFall为跌倒检测研究提供了公平比较基准，并揭示了开发鲁棒系统的关键挑战。 Abstract: Current video-based fall detection research mostly relies on small, staged datasets with significant domain biases concerning background, lighting, and camera setup resulting in unknown real-world performance. We introduce OmniFall, unifying eight public fall detection datasets (roughly 14 h of recordings, roughly 42 h of multiview data, 101 subjects, 29 camera views) under a consistent ten-class taxonomy with standardized evaluation protocols. Our benchmark provides complete video segmentation labels and enables fair cross-dataset comparison previously impossible with incompatible annotation schemes. For real-world evaluation we curate OOPS-Fall from genuine accident videos and establish a staged-to-wild protocol measuring generalization from controlled to uncontrolled environments. Experiments with frozen pre-trained backbones such as I3D or VideoMAE reveal significant performance gaps between in-distribution and in-the-wild scenarios, highlighting critical challenges in developing robust fall detection systems. OmniFall Dataset at https://huggingface.co/datasets/simplexsigil2/omnifall , Code at https://github.com/simplexsigil/omnifall-experiments

[209] Underwater Diffusion Attention Network with Contrastive Language-Image Joint Learning for Underwater Image Enhancement

Afrah Shaahid,Muzammil Behzad

Main category: cs.CV

TL;DR: UDAN-CLIP是一种基于扩散框架的水下图像增强方法，结合了视觉语言模型和空间注意力模块，有效解决了现有方法依赖合成数据导致的偏差问题。

Details

Motivation: 水下图像常受多种退化影响，现有方法依赖合成数据且易因微调导致失真，需更有效且真实的增强方法。 Method: 提出UDAN-CLIP框架，结合定制分类器、空间注意力模块和CLIP-Diffusion损失，保留先验知识并局部修正退化。 Result: 模型在定量和定性评估中均表现优异，能有效校正失真并恢复自然外观。 Conclusion: UDAN-CLIP在复杂水下条件下实现了更真实、细节保留的图像增强。 Abstract: Underwater images are often affected by complex degradations such as light absorption, scattering, color casts, and artifacts, making enhancement critical for effective object detection, recognition, and scene understanding in aquatic environments. Existing methods, especially diffusion-based approaches, typically rely on synthetic paired datasets due to the scarcity of real underwater references, introducing bias and limiting generalization. Furthermore, fine-tuning these models can degrade learned priors, resulting in unrealistic enhancements due to domain shifts. To address these challenges, we propose UDAN-CLIP, an image-to-image diffusion framework pre-trained on synthetic underwater datasets and enhanced with a customized classifier based on vision-language model, a spatial attention module, and a novel CLIP-Diffusion loss. The classifier preserves natural in-air priors and semantically guides the diffusion process, while the spatial attention module focuses on correcting localized degradations such as haze and low contrast. The proposed CLIP-Diffusion loss further strengthens visual-textual alignment and helps maintain semantic consistency during enhancement. The proposed contributions empower our UDAN-CLIP model to perform more effective underwater image enhancement, producing results that are not only visually compelling but also more realistic and detail-preserving. These improvements are consistently validated through both quantitative metrics and qualitative visual comparisons, demonstrating the model's ability to correct distortions and restore natural appearance in challenging underwater conditions.

[210] Dynamic-I2V: Exploring Image-to-Video Generaion Models via Multimodal LLM

Peng Liu,Xiaoming Ren,Fengkai Liu,Qingsong Xie,Quanlong Zheng,Yanhao Zhang,Haonan Lu,Yujiu Yang

Main category: cs.CV

TL;DR: Dynamic-I2V框架通过整合多模态大语言模型（MLLMs）和扩散变换器（DiT），显著提升了图像到视频生成的动态可控性和时间一致性，并提出了新的评估基准DIVE。

Details

Motivation: 现有图像到视频生成方法在复杂场景中表现不佳，缺乏对细微运动和对象-动作关系的深度理解。 Method: Dynamic-I2V结合MLLMs和DiT，联合编码视觉和文本条件，支持多样化的条件输入。 Result: Dynamic-I2V在动态范围、可控性和质量上分别提升了42.5%、7.9%和11.8%，性能领先。 Conclusion: Dynamic-I2V在图像到视频生成中达到最先进水平，并通过DIVE基准解决了现有评估的局限性。 Abstract: Recent advancements in image-to-video (I2V) generation have shown promising performance in conventional scenarios. However, these methods still encounter significant challenges when dealing with complex scenes that require a deep understanding of nuanced motion and intricate object-action relationships. To address these challenges, we present Dynamic-I2V, an innovative framework that integrates Multimodal Large Language Models (MLLMs) to jointly encode visual and textual conditions for a diffusion transformer (DiT) architecture. By leveraging the advanced multimodal understanding capabilities of MLLMs, our model significantly improves motion controllability and temporal coherence in synthesized videos. The inherent multimodality of Dynamic-I2V further enables flexible support for diverse conditional inputs, extending its applicability to various downstream generation tasks. Through systematic analysis, we identify a critical limitation in current I2V benchmarks: a significant bias towards favoring low-dynamic videos, stemming from an inadequate balance between motion complexity and visual quality metrics. To resolve this evaluation gap, we propose DIVE - a novel assessment benchmark specifically designed for comprehensive dynamic quality measurement in I2V generation. In conclusion, extensive quantitative and qualitative experiments confirm that Dynamic-I2V attains state-of-the-art performance in image-to-video generation, particularly revealing significant improvements of 42.5%, 7.9%, and 11.8% in dynamic range, controllability, and quality, respectively, as assessed by the DIVE metric in comparison to existing methods.

[211] Attention! You Vision Language Model Could Be Maliciously Manipulated

Xiaosen Wang,Shaokang Wang,Zhijin Ge,Yuyang Luo,Shudong Zhang

Main category: cs.CV

TL;DR: 本文提出了一种针对大型视觉语言模型（VLMs）的新型攻击方法VMA，通过优化对抗性扰动实现多种攻击或版权保护。

Details

Motivation: VLMs在复杂场景理解中表现优异，但对对抗性样本（尤其是图像）的脆弱性可能导致严重后果，如越狱、劫持等。 Method: 结合一阶和二阶动量优化技术及可微分变换机制，提出VMA攻击方法。 Result: VMA在多种场景和数据集中表现出高效性和通用性，可实现攻击或版权保护。 Conclusion: VMA揭示了VLMs的脆弱性，同时展示了其在攻击与保护中的双重潜力。 Abstract: Large Vision-Language Models (VLMs) have achieved remarkable success in understanding complex real-world scenarios and supporting data-driven decision-making processes. However, VLMs exhibit significant vulnerability against adversarial examples, either text or image, which can lead to various adversarial outcomes, e.g., jailbreaking, hijacking, and hallucination, etc. In this work, we empirically and theoretically demonstrate that VLMs are particularly susceptible to image-based adversarial examples, where imperceptible perturbations can precisely manipulate each output token. To this end, we propose a novel attack called Vision-language model Manipulation Attack (VMA), which integrates first-order and second-order momentum optimization techniques with a differentiable transformation mechanism to effectively optimize the adversarial perturbation. Notably, VMA can be a double-edged sword: it can be leveraged to implement various attacks, such as jailbreaking, hijacking, privacy breaches, Denial-of-Service, and the generation of sponge examples, etc, while simultaneously enabling the injection of watermarks for copyright protection. Extensive empirical evaluations substantiate the efficacy and generalizability of VMA across diverse scenarios and datasets.

[212] Weather-Magician: Reconstruction and Rendering Framework for 4D Weather Synthesis In Real Time

Chen Sang,Yeqiang Qian,Jiale Zhang,Chunxiang Wang,Ming Yang

Main category: cs.CV

TL;DR: 提出了一种基于高斯泼溅的框架，用于重建真实场景并渲染4D天气效果，支持动态天气变化和实时渲染。

Details

Motivation: 传统方法在复杂场景重建和天气效果渲染上效率低、成本高，现有算法无法有效处理天气效果。 Method: 采用高斯建模和渲染技术，支持连续动态天气变化和细节控制。 Result: 实现了低硬件需求的实时渲染，支持多种常见天气效果。 Conclusion: 该框架为复杂场景重建和天气效果渲染提供了高效解决方案。 Abstract: For tasks such as urban digital twins, VR/AR/game scene design, or creating synthetic films, the traditional industrial approach often involves manually modeling scenes and using various rendering engines to complete the rendering process. This approach typically requires high labor costs and hardware demands, and can result in poor quality when replicating complex real-world scenes. A more efficient approach is to use data from captured real-world scenes, then apply reconstruction and rendering algorithms to quickly recreate the authentic scene. However, current algorithms are unable to effectively reconstruct and render real-world weather effects. To address this, we propose a framework based on gaussian splatting, that can reconstruct real scenes and render them under synthesized 4D weather effects. Our work can simulate various common weather effects by applying Gaussians modeling and rendering techniques. It supports continuous dynamic weather changes and can easily control the details of the effects. Additionally, our work has low hardware requirements and achieves real-time rendering performance. The result demos can be accessed on our project homepage: weathermagician.github.io

[213] A Responsible Face Recognition Approach for Small and Mid-Scale Systems Through Personalized Neural Networks

Sebastian Groß,Stefan Heindorf,Philipp Terhörst

Main category: cs.CV

TL;DR: 提出了一种新型的MOTE方法，用小型个性化神经网络替代传统向量人脸模板，提升公平性和隐私保护。

Details

Motivation: 传统人脸识别系统依赖固定模板，缺乏可解释性，且存在公平性和隐私问题。 Method: MOTE为每个身份创建专用二元分类器，仅用单个参考样本和合成样本训练，调整公平性。 Result: 实验表明，MOTE在公平性和隐私方面有显著提升，但增加了推理时间和存储需求。 Conclusion: MOTE适用于中小规模应用，尤其在公平性和隐私要求高的场景中表现优异。 Abstract: Traditional face recognition systems rely on extracting fixed face representations, known as templates, to store and verify identities. These representations are typically generated by neural networks that often lack explainability and raise concerns regarding fairness and privacy. In this work, we propose a novel model-template (MOTE) approach that replaces vector-based face templates with small personalized neural networks. This design enables more responsible face recognition for small and medium-scale systems. During enrollment, MOTE creates a dedicated binary classifier for each identity, trained to determine whether an input face matches the enrolled identity. Each classifier is trained using only a single reference sample, along with synthetically balanced samples to allow adjusting fairness at the level of a single individual during enrollment. Extensive experiments across multiple datasets and recognition systems demonstrate substantial improvements in fairness and particularly in privacy. Although the method increases inference time and storage requirements, it presents a strong solution for small- and mid-scale applications where fairness and privacy are critical.

[214] CA3D: Convolutional-Attentional 3D Nets for Efficient Video Activity Recognition on the Edge

Gabriele Lagani,Fabrizio Falchi,Claudio Gennaro,Giuseppe Amato

Main category: cs.CV

TL;DR: 提出了一种结合卷积层和线性复杂度注意力机制的深度学习模型，用于视频活动识别，并通过量化机制提升效率。

Details

Motivation: 解决当前模型高计算需求的问题，以在消费和边缘设备上实现高效且隐私友好的智能家居和医疗应用。 Method: 结合卷积层与线性复杂度注意力机制，并引入量化机制优化训练和推理效率。 Result: 在多个公开视频活动识别基准测试中，模型在保持低计算成本的同时提升了准确性。 Conclusion: 该模型在计算效率和准确性上表现优异，适用于对效率和隐私有要求的场景。 Abstract: In this paper, we introduce a deep learning solution for video activity recognition that leverages an innovative combination of convolutional layers with a linear-complexity attention mechanism. Moreover, we introduce a novel quantization mechanism to further improve the efficiency of our model during both training and inference. Our model maintains a reduced computational cost, while preserving robust learning and generalization capabilities. Our approach addresses the issues related to the high computing requirements of current models, with the goal of achieving competitive accuracy on consumer and edge devices, enabling smart home and smart healthcare applications where efficiency and privacy issues are of concern. We experimentally validate our model on different established and publicly available video activity recognition benchmarks, improving accuracy over alternative models at a competitive computing cost.

[215] Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning

Wenrui Li,Penghong Wang,Xingtao Wang,Wangmeng Zuo,Xiaopeng Fan,Yonghong Tian

Main category: cs.CV

TL;DR: 提出了一种双流多时间尺度运动解耦的脉冲变压器（MDST++），通过解耦上下文语义信息和稀疏动态运动信息，解决了音频-视觉零样本学习中背景场景偏差和运动细节不足的问题。

Details

Motivation: 当前音频-视觉零样本学习方法存在背景场景偏差和运动细节不足的问题，需要一种更有效的方法来分类未见过的视频数据。 Method: 提出MDST++，包括双流结构、多时间尺度运动解耦、事件转换、差异分析块和动态调整LIF神经元阈值。 Result: 实验证明MDST++优于现有方法，HM和ZSL准确率分别提高了26.2%和39.9%。 Conclusion: MDST++通过解耦和动态调整机制，显著提升了音频-视觉零样本学习的性能。 Abstract: Audio-visual zero-shot learning (ZSL) has been extensively researched for its capability to classify video data from unseen classes during training. Nevertheless, current methodologies often struggle with background scene biases and inadequate motion detail. This paper proposes a novel dual-stream Multi-Timescale Motion-Decoupled Spiking Transformer (MDST++), which decouples contextual semantic information and sparse dynamic motion information. The recurrent joint learning unit is proposed to extract contextual semantic information and capture joint knowledge across various modalities to understand the environment of actions. By converting RGB images to events, our method captures motion information more accurately and mitigates background scene biases. Moreover, we introduce a discrepancy analysis block to model audio motion information. To enhance the robustness of SNNs in extracting temporal and motion cues, we dynamically adjust the threshold of Leaky Integrate-and-Fire neurons based on global motion and contextual semantic information. Our experiments validate the effectiveness of MDST++, demonstrating their consistent superiority over state-of-the-art methods on mainstream benchmarks. Additionally, incorporating motion and multi-timescale information significantly improves HM and ZSL accuracy by 26.2\% and 39.9\%.

[216] Can Visual Encoder Learn to See Arrows?

Naoyuki Terashita,Yusuke Tozaki,Hideaki Omote,Congkha Nguyen,Ryosuke Nakamoto,Yuta Koreeda,Hiroaki Ozaki

Main category: cs.CV

TL;DR: 通过消除文本和位置偏见，训练视觉语言模型（VLMs）学习边缘特征，提升其在图表识别任务中的表现。

Details

Motivation: 现有VLMs在识别图表边缘时表现不佳，推测是由于过度依赖文本和位置偏见，导致无法学习明确的边缘特征。 Method: 使用无文本和位置偏见的图表数据集进行对比学习，训练图像编码器，并在探测、图像检索和描述生成任务中评估其性能。 Result: 微调后的模型在所有任务中均优于预训练的CLIP，并在描述生成任务中超越零样本GPT-4o和LLaVA-Mistral。 Conclusion: 消除文本和位置偏见有助于VLMs准确识别图表边缘，为提升图表理解能力提供了有效途径。 Abstract: The diagram is a visual representation of a relationship illustrated with edges (lines or arrows), which is widely used in industrial and scientific communication. Although recognizing diagrams is essential for vision language models (VLMs) to comprehend domain-specific knowledge, recent studies reveal that many VLMs fail to identify edges in images. We hypothesize that these failures stem from an over-reliance on textual and positional biases, preventing VLMs from learning explicit edge features. Based on this idea, we empirically investigate whether the image encoder in VLMs can learn edge representation through training on a diagram dataset in which edges are biased neither by textual nor positional information. To this end, we conduct contrastive learning on an artificially generated diagram--caption dataset to train an image encoder and evaluate its diagram-related features on three tasks: probing, image retrieval, and captioning. Our results show that the finetuned model outperforms pretrained CLIP in all tasks and surpasses zero-shot GPT-4o and LLaVA-Mistral in the captioning task. These findings confirm that eliminating textual and positional biases fosters accurate edge recognition in VLMs, offering a promising path for advancing diagram understanding.

[217] SaSi: A Self-augmented and Self-interpreted Deep Learning Approach for Few-shot Cryo-ET Particle Detection

Gokul Adethya,Bhanu Pratyush Mantha,Tianyang Wang,Xingjian Li,Min Xu

Main category: cs.CV

TL;DR: 提出了一种名为SaSi的深度学习方法，用于在3D冷冻电子断层扫描图像中实现少样本粒子检测，通过自增强和自解释策略减少对标记数据的依赖，显著优于现有方法。

Details

Motivation: 冷冻电子断层扫描（cryo-ET）在成像近天然状态的大分子复合物方面具有强大潜力，但3D粒子定位在细胞环境中仍面临低信噪比和缺失楔形伪影的挑战。现有深度学习方法需要大量数据，而标记数据稀缺。 Method: 提出SaSi方法，结合自增强技术提高数据利用率，并引入自解释分割策略以减少对标记数据的依赖，从而提升泛化能力和鲁棒性。 Result: 在模拟和真实冷冻电子断层扫描数据集上的实验表明，SaSi方法在粒子定位方面显著优于现有最先进方法。 Conclusion: SaSi方法为冷冻电子断层扫描中的少样本学习设定了新基准，增强了对极少量标记数据下粒子检测的理解。 Abstract: Cryo-electron tomography (cryo-ET) has emerged as a powerful technique for imaging macromolecular complexes in their near-native states. However, the localization of 3D particles in cellular environments still presents a significant challenge due to low signal-to-noise ratios and missing wedge artifacts. Deep learning approaches have shown great potential, but they need huge amounts of data, which can be a challenge in cryo-ET scenarios where labeled data is often scarce. In this paper, we propose a novel Self-augmented and Self-interpreted (SaSi) deep learning approach towards few-shot particle detection in 3D cryo-ET images. Our method builds upon self-augmentation techniques to further boost data utilization and introduces a self-interpreted segmentation strategy for alleviating dependency on labeled data, hence improving generalization and robustness. As demonstrated by experiments conducted on both simulated and real-world cryo-ET datasets, the SaSi approach significantly outperforms existing state-of-the-art methods for particle localization. This research increases understanding of how to detect particles with very few labels in cryo-ET and thus sets a new benchmark for few-shot learning in structural biology.

[218] Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval

Rong-Cheng Tu,Wenhao Sun,Hanzhe You,Yingjie Wang,Jiaxing Huang,Li Shen,Dacheng Tao

Main category: cs.CV

TL;DR: 提出了一种无需标注数据的零样本组合图像检索（ZS-CIR）方法，通过多模态推理代理（MRA）直接构建三元组，避免中间文本引入的错误传播。

Details

Motivation: 现有方法依赖中间文本生成，导致错误传播，影响检索性能。 Method: 使用MRA直接构建<参考图像、修改文本、目标图像>三元组，通过无标注图像数据训练模型。 Result: 在FashionIQ、CIRR和CIRCO数据集上，性能显著提升，R@10、R@1和mAP@5分别提高7.5%、9.6%和9.5%。 Conclusion: MRA框架有效解决了中间文本带来的问题，显著提升了零样本组合图像检索的性能。 Abstract: Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a compositional query, consisting of a reference image and a modifying text-without relying on annotated training data. Existing approaches often generate a synthetic target text using large language models (LLMs) to serve as an intermediate anchor between the compositional query and the target image. Models are then trained to align the compositional query with the generated text, and separately align images with their corresponding texts using contrastive learning. However, this reliance on intermediate text introduces error propagation, as inaccuracies in query-to-text and text-to-image mappings accumulate, ultimately degrading retrieval performance. To address these problems, we propose a novel framework by employing a Multimodal Reasoning Agent (MRA) for ZS-CIR. MRA eliminates the dependence on textual intermediaries by directly constructing triplets, , using only unlabeled image data. By training on these synthetic triplets, our model learns to capture the relationships between compositional queries and candidate images directly. Extensive experiments on three standard CIR benchmarks demonstrate the effectiveness of our approach. On the FashionIQ dataset, our method improves Average R@10 by at least 7.5\% over existing baselines; on CIRR, it boosts R@1 by 9.6\%; and on CIRCO, it increases mAP@5 by 9.5\%.

[219] UltraVSR: Achieving Ultra-Realistic Video Super-Resolution with Efficient One-Step Diffusion Space

Yong Liu,Jinshan Pan,Yinchuan Li,Qingji Dong,Chao Zhu,Yu Guo,Fei Wang

Main category: cs.CV

TL;DR: UltraVSR提出了一种基于扩散模型的视频超分辨率框架，通过Degradation-aware Restoration Schedule（DRS）和Recurrent Temporal Shift（RTS）模块，实现了高效、时序一致的超分辨率重建。

Details

Motivation: 扩散模型在图像生成中表现出色，但应用于视频超分辨率时存在随机性和缺乏时序建模的问题。 Method: 提出了DRS将迭代去噪过程转化为单步重建，RTS模块通过特征传播和对齐确保时序一致性，并结合Spatio-temporal Joint Distillation（SJD）增强效果。 Result: UltraVSR在单步采样中实现了最先进的性能，兼顾了质量和效率。 Conclusion: UltraVSR通过创新的设计和模块，成功解决了扩散模型在视频超分辨率中的挑战，为相关领域提供了新思路。 Abstract: Diffusion models have shown great potential in generating realistic image detail. However, adapting these models to video super-resolution (VSR) remains challenging due to their inherent stochasticity and lack of temporal modeling. In this paper, we propose UltraVSR, a novel framework that enables ultra-realistic and temporal-coherent VSR through an efficient one-step diffusion space. A central component of UltraVSR is the Degradation-aware Restoration Schedule (DRS), which estimates a degradation factor from the low-resolution input and transforms iterative denoising process into a single-step reconstruction from from low-resolution to high-resolution videos. This design eliminates randomness from diffusion noise and significantly speeds up inference. To ensure temporal consistency, we propose a lightweight yet effective Recurrent Temporal Shift (RTS) module, composed of an RTS-convolution unit and an RTS-attention unit. By partially shifting feature components along the temporal dimension, these two units collaboratively facilitate effective feature propagation, fusion, and alignment across neighboring frames, without relying on explicit temporal layers. The RTS module is integrated into a pretrained text-to-image diffusion model and is further enhanced through Spatio-temporal Joint Distillation (SJD), which improves temporal coherence while preserving realistic details. Additionally, we introduce a Temporally Asynchronous Inference (TAI) strategy to capture long-range temporal dependencies under limited memory constraints. Extensive experiments show that UltraVSR achieves state-of-the-art performance, both qualitatively and quantitatively, in a single sampling step.

[220] PHI: Bridging Domain Shift in Long-Term Action Quality Assessment via Progressive Hierarchical Instruction

Kanglei Zhou,Hubert P. H. Shum,Frederick W. B. Li,Xingxing Zhang,Xiaohui Liang

Main category: cs.CV

TL;DR: 论文提出了一种名为PHI的方法，通过解决任务级和特征级的域偏移问题，提升了长时动作质量评估的性能。

Details

Motivation: 现有方法因预训练大规模动作识别主干与特定AQA任务之间的域偏移而性能受限，且在小数据集上微调不切实际。 Method: 提出PHI方法，包括Gap Minimization Flow（GMF）和List-wise Contrastive Regularization（LCR），分别解决特征级域偏移和任务级域偏移。 Result: 在三个代表性长时AQA数据集上取得最优性能。 Conclusion: PHI通过渐进式分层指令有效解决了域偏移问题，提升了长时AQA任务的性能。 Abstract: Long-term Action Quality Assessment (AQA) aims to evaluate the quantitative performance of actions in long videos. However, existing methods face challenges due to domain shifts between the pre-trained large-scale action recognition backbones and the specific AQA task, thereby hindering their performance. This arises since fine-tuning resource-intensive backbones on small AQA datasets is impractical. We address this by identifying two levels of domain shift: task-level, regarding differences in task objectives, and feature-level, regarding differences in important features. For feature-level shifts, which are more detrimental, we propose Progressive Hierarchical Instruction (PHI) with two strategies. First, Gap Minimization Flow (GMF) leverages flow matching to progressively learn a fast flow path that reduces the domain gap between initial and desired features across shallow to deep layers. Additionally, a temporally-enhanced attention module captures long-range dependencies essential for AQA. Second, List-wise Contrastive Regularization (LCR) facilitates coarse-to-fine alignment by comprehensively comparing batch pairs to learn fine-grained cues while mitigating domain shift. Integrating these modules, PHI offers an effective solution. Experiments demonstrate that PHI achieves state-of-the-art performance on three representative long-term AQA datasets, proving its superiority in addressing the domain shift for long-term AQA.

[221] Structured Initialization for Vision Transformers

Jianqiao Zheng,Xueqian Li,Hemanth Saratchandran,Simon Lucey

Main category: cs.CV

TL;DR: 论文提出了一种通过初始化将CNN的归纳偏置引入ViT的方法，无需架构调整，旨在使ViT在小数据集上表现如CNN，大数据集上仍保持ViT性能。

Details

Motivation: ViT在小数据集上表现不如CNN，但当前ViT初始化策略缺乏结构约束，作者希望通过引入CNN的归纳偏置提升性能。 Method: 通过改进ViT初始化策略，利用随机脉冲滤波器（类似CNN滤波器）初始化ViT，而非依赖预训练模型或注意力权重分布。 Result: 方法在多个中小规模数据集（如Food-101、CIFAR等）上显著优于标准ViT初始化，且在大规模数据集（如ImageNet-1K）上性能相当。 Conclusion: 该初始化策略简单易用，可广泛应用于多种Transformer架构（如Swin Transformer、MLP-Mixer），并带来性能提升。 Abstract: Convolutional Neural Networks (CNNs) inherently encode strong inductive biases, enabling effective generalization on small-scale datasets. In this paper, we propose integrating this inductive bias into ViTs, not through an architectural intervention but solely through initialization. The motivation here is to have a ViT that can enjoy strong CNN-like performance when data assets are small, but can still scale to ViT-like performance as the data expands. Our approach is motivated by our empirical results that random impulse filters can achieve commensurate performance to learned filters within a CNN. We improve upon current ViT initialization strategies, which typically rely on empirical heuristics such as using attention weights from pretrained models or focusing on the distribution of attention weights without enforcing structures. Empirical results demonstrate that our method significantly outperforms standard ViT initialization across numerous small and medium-scale benchmarks, including Food-101, CIFAR-10, CIFAR-100, STL-10, Flowers, and Pets, while maintaining comparative performance on large-scale datasets such as ImageNet-1K. Moreover, our initialization strategy can be easily integrated into various transformer-based architectures such as Swin Transformer and MLP-Mixer with consistent improvements in performance.

[222] Progressive Scaling Visual Object Tracking

Jack Hong,Shilin Yan,Zehao Xiao,Jiayin Cai,Xiaolong Jiang,Yao Hu,Henghui Ding

Main category: cs.CV

TL;DR: 提出了一种渐进式扩展训练策略（DT-Training），通过系统分析训练数据量、模型规模和输入分辨率对跟踪性能的影响，优化跟踪精度。

Details

Motivation: 现有训练方法在扩展训练数据、模型规模或输入分辨率时存在优化不足和迭代细化有限的问题。 Method: 引入DT-Training框架，结合小教师迁移和双分支对齐，逐步扩展训练。 Result: 扩展后的跟踪器在多个基准测试中优于现有方法，并展示了良好的泛化性和迁移性。 Conclusion: 该方法不仅适用于视觉目标跟踪，还具有更广泛的适用性。 Abstract: In this work, we propose a progressive scaling training strategy for visual object tracking, systematically analyzing the influence of training data volume, model size, and input resolution on tracking performance. Our empirical study reveals that while scaling each factor leads to significant improvements in tracking accuracy, naive training suffers from suboptimal optimization and limited iterative refinement. To address this issue, we introduce DT-Training, a progressive scaling framework that integrates small teacher transfer and dual-branch alignment to maximize model potential. The resulting scaled tracker consistently outperforms state-of-the-art methods across multiple benchmarks, demonstrating strong generalization and transferability of the proposed method. Furthermore, we validate the broader applicability of our approach to additional tasks, underscoring its versatility beyond tracking.

Shihao Li,Chenglong Li,Aihua Zheng,Andong Lu,Jin Tang,Jixin Ma

Main category: cs.CV

TL;DR: 提出了一种基于属性置信度的多模态描述生成方法，并设计了一个名为NEXT的多模态对象重识别框架，通过文本调制和多粒度专家混合提升识别效果。

Details

Motivation: 现有方法依赖隐式特征融合，难以在复杂条件下建模细粒度识别策略，而多模态大语言模型（MLLMs）能有效将视觉外观转化为描述性文本。 Method: 1. 提出基于属性置信度的描述生成方法，减少MLLMs的未知识别率；2. 设计NEXT框架，包括文本调制语义采样专家（TMSE）和上下文共享结构感知专家（CSSE），分别处理语义和结构特征；3. 通过多模态特征聚合（MMFA）统一融合特征。 Result: 显著提升了生成文本质量，并通过多粒度专家混合实现了更准确的多模态对象重识别。 Conclusion: NEXT框架通过结合语义和结构特征，有效提升了多模态对象重识别的性能。 Abstract: Multi-modal object re-identification (ReID) aims to extract identity features across heterogeneous spectral modalities to enable accurate recognition and retrieval in complex real-world scenarios. However, most existing methods rely on implicit feature fusion structures, making it difficult to model fine-grained recognition strategies under varying challenging conditions. Benefiting from the powerful semantic understanding capabilities of Multi-modal Large Language Models (MLLMs), the visual appearance of an object can be effectively translated into descriptive text. In this paper, we propose a reliable multi-modal caption generation method based on attribute confidence, which significantly reduces the unknown recognition rate of MLLMs in multi-modal semantic generation and improves the quality of generated text. Additionally, we propose a novel ReID framework NEXT, the Multi-grained Mixture of Experts via Text-Modulation for Multi-modal Object Re-Identification. Specifically, we decouple the recognition problem into semantic and structural expert branches to separately capture modality-specific appearance and intrinsic structure. For semantic recognition, we propose the Text-Modulated Semantic-sampling Experts (TMSE), which leverages randomly sampled high-quality semantic texts to modulate expert-specific sampling of multi-modal features and mining intra-modality fine-grained semantic cues. Then, to recognize coarse-grained structure features, we propose the Context-Shared Structure-aware Experts (CSSE) that focuses on capturing the holistic object structure across modalities and maintains inter-modality structural consistency through a soft routing mechanism. Finally, we propose the Multi-Modal Feature Aggregation (MMFA), which adopts a unified feature fusion strategy to simply and effectively integrate semantic and structural expert outputs into the final identity representations.

[224] Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models

Hyunsik Chae,Seungwoo Yoon,Jaden Park,Chloe Yewon Chun,Yongin Cho,Mu Cai,Yong Jae Lee,Ernest K. Ryu

Main category: cs.CV

TL;DR: 论文研究了视觉语言模型（VLMs）在基础2D几何任务中的表现，发现其表现不佳，并提出了一种新的数据集AVSD来评估这些模型的原子视觉技能。

Details

Motivation: 尽管VLMs在多模态理解和推理方面表现出色，但在简单的视觉任务中表现不佳。研究旨在填补这一空白，通过系统分类基础视觉技能并评估VLMs的能力。 Method: 研究定义了原子视觉技能，并开发了AVSD数据集，用于评估VLMs在这些基础任务上的表现。 Result: 实验表明，即使这些任务对人类来说很简单，但当前最先进的VLMs在AVSD上表现不佳。 Conclusion: 研究强调了需要专门的数据集来训练和评估VLMs在原子视觉任务上的能力，而非复合任务。 Abstract: Recent Vision-Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, yet they often struggle with trivially simple visual tasks. In this work, we focus on the domain of basic 2D Euclidean geometry and systematically categorize the fundamental, indivisible visual perception skills, which we refer to as atomic visual skills. We then introduce the Atomic Visual Skills Dataset (AVSD) for evaluating VLMs on the atomic visual skills. Using AVSD, we benchmark state-of-the-art VLMs and find that they struggle with these tasks, despite being trivial for adult humans. Our findings highlight the need for purpose-built datasets to train and evaluate VLMs on atomic, rather than composite, visual perception tasks.

[225] ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving

Xueyi Liu,Zuodong Zhong,Yuxin Guo,Yun-Fu Liu,Zhiguo Su,Qichao Zhang,Junli Wang,Yinfeng Gao,Yupeng Zheng,Qiao Lin,Huiyong Chen,Dongbin Zhao

Main category: cs.CV

TL;DR: ReasonPlan是一个基于多模态大语言模型（MLLM）的闭环驾驶框架，通过自监督的下一场景预测和监督的决策链式思维，显著优于主流端到端模仿学习方法。

Details

Motivation: 当前MLLM在闭环驾驶系统中应用不足，且未显示出明显优势，因此提出ReasonPlan以填补这一空白。 Method: 采用自监督的下一场景预测任务和监督的决策链式思维过程，结合规划导向的决策推理数据集PDR。 Result: 在Bench2Drive基准上，L2和驾驶分数分别提升19%和16.1%，并在DOS基准上表现出强零样本泛化能力。 Conclusion: ReasonPlan在闭环驾驶中表现出色，具有强泛化能力和适应性。 Abstract: Due to the powerful vision-language reasoning and generalization abilities, multimodal large language models (MLLMs) have garnered significant attention in the field of end-to-end (E2E) autonomous driving. However, their application to closed-loop systems remains underexplored, and current MLLM-based methods have not shown clear superiority to mainstream E2E imitation learning approaches. In this work, we propose ReasonPlan, a novel MLLM fine-tuning framework designed for closed-loop driving through holistic reasoning with a self-supervised Next Scene Prediction task and supervised Decision Chain-of-Thought process. This dual mechanism encourages the model to align visual representations with actionable driving context, while promoting interpretable and causally grounded decision making. We curate a planning-oriented decision reasoning dataset, namely PDR, comprising 210k diverse and high-quality samples. Our method outperforms the mainstream E2E imitation learning method by a large margin of 19% L2 and 16.1 driving score on Bench2Drive benchmark. Furthermore, ReasonPlan demonstrates strong zero-shot generalization on unseen DOS benchmark, highlighting its adaptability in handling zero-shot corner cases. Code and dataset will be found in https://github.com/Liuxueyi/ReasonPlan.

Fotios Lygerakis,Ozan Özdenizci,Elmar Rückert

Main category: cs.CV

TL;DR: ViTaPEs是一种基于Transformer的框架，通过多尺度位置编码方案整合视觉和触觉数据，实现任务无关的表征学习，并在多个任务和环境中表现出优异的泛化能力。

Details

Motivation: 视觉和触觉感知的融合存在挑战，现有方法依赖预训练模型且未充分利用空间信息。ViTaPEs旨在解决这些问题。 Method: 提出多尺度位置编码方案，结合视觉和触觉输入，通过Transformer学习任务无关表征，并证明其数学性质（单射、刚体运动等变、信息保留）。 Result: 在多个大规模数据集上超越现有方法，展示零样本泛化能力，并在机器人抓取任务中表现优异。 Conclusion: ViTaPEs为视觉-触觉融合提供了高效且可泛化的解决方案，具有广泛的应用潜力。 Abstract: Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-scale spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based framework that robustly integrates visual and tactile input data to learn task-agnostic representations for visuotactile perception. Our approach exploits a novel multi-scale positional encoding scheme to capture intra-modal structures, while simultaneously modeling cross-modal cues. Unlike prior work, we provide provable guarantees in visuotactile fusion, showing that our encodings are injective, rigid-motion-equivariant, and information-preserving, validating these properties empirically. Experiments on multiple large-scale real-world datasets show that ViTaPEs not only surpasses state-of-the-art baselines across various recognition tasks but also demonstrates zero-shot generalization to unseen, out-of-domain scenarios. We further demonstrate the transfer-learning strength of ViTaPEs in a robotic grasping task, where it outperforms state-of-the-art baselines in predicting grasp success. Project page: https://sites.google.com/view/vitapes

[227] EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition

Christoph Schuhmann,Robert Kaczmarczyk,Gollam Rabby,Maurice Kraus,Felix Friedrich,Huu Nguyen,Krishna Kalyan,Kourosh Nadi,Kristian Kersting,Sören Auer

Main category: cs.CV

TL;DR: EmoNet Face 是一个新的情感识别基准套件，解决了现有数据集的局限性，提供了更全面的情感分类、多样化的数据集和高效模型。

Details

Motivation: 现有情感识别基准的情感谱系狭窄，缺乏多样性且存在偏见，限制了 AI 对复杂情感的理解。 Method: 提出了 40 类情感分类法，创建了三个大规模 AI 生成的数据集，并开发了高性能模型 Empathic Insight Face。 Result: EmoNet Face 提供了高质量的数据和模型，实现了专家级的情感识别性能。 Conclusion: EmoNet Face 为 AI 情感理解提供了更强大的基础，推动了人机交互的发展。 Abstract: Effective human-AI interaction relies on AI's ability to accurately perceive and interpret human emotions. Current benchmarks for vision and vision-language models are severely limited, offering a narrow emotional spectrum that overlooks nuanced states (e.g., bitterness, intoxication) and fails to distinguish subtle differences between related feelings (e.g., shame vs. embarrassment). Existing datasets also often use uncontrolled imagery with occluded faces and lack demographic diversity, risking significant bias. To address these critical gaps, we introduce EmoNet Face, a comprehensive benchmark suite. EmoNet Face features: (1) A novel 40-category emotion taxonomy, meticulously derived from foundational research to capture finer details of human emotional experiences. (2) Three large-scale, AI-generated datasets (EmoNet HQ, Binary, and Big) with explicit, full-face expressions and controlled demographic balance across ethnicity, age, and gender. (3) Rigorous, multi-expert annotations for training and high-fidelity evaluation. (4) We build Empathic Insight Face, a model achieving human-expert-level performance on our benchmark. The publicly released EmoNet Face suite - taxonomy, datasets, and model - provides a robust foundation for developing and evaluating AI systems with a deeper understanding of human emotions.

[228] DepthMatch: Semi-Supervised RGB-D Scene Parsing through Depth-Guided Regularization

Jianxin Huang,Jiahang Li,Sergey Vityazev,Alexander Dvorkovich,Rui Fan

Main category: cs.CV

TL;DR: DepthMatch是一种半监督学习框架，用于RGB-D场景解析，通过互补补丁混合增强和轻量级空间先验注入器提升性能，并在NYUv2和KITTI数据集上取得领先结果。

Details

Motivation: 现有RGB-D场景解析方法依赖大量手动标注数据，成本高且耗时，因此提出半监督学习框架以解决这一问题。 Method: 提出互补补丁混合增强、轻量级空间先验注入器和深度引导边界损失，以充分利用未标注数据并提升特征融合效率。 Result: 在NYUv2数据集上达到最先进水平，并在KITTI语义基准测试中排名第一。 Conclusion: DepthMatch在室内外场景中均表现出高效性和适用性，为RGB-D场景解析提供了新的解决方案。 Abstract: RGB-D scene parsing methods effectively capture both semantic and geometric features of the environment, demonstrating great potential under challenging conditions such as extreme weather and low lighting. However, existing RGB-D scene parsing methods predominantly rely on supervised training strategies, which require a large amount of manually annotated pixel-level labels that are both time-consuming and costly. To overcome these limitations, we introduce DepthMatch, a semi-supervised learning framework that is specifically designed for RGB-D scene parsing. To make full use of unlabeled data, we propose complementary patch mix-up augmentation to explore the latent relationships between texture and spatial features in RGB-D image pairs. We also design a lightweight spatial prior injector to replace traditional complex fusion modules, improving the efficiency of heterogeneous feature fusion. Furthermore, we introduce depth-guided boundary loss to enhance the model's boundary prediction capabilities. Experimental results demonstrate that DepthMatch exhibits high applicability in both indoor and outdoor scenes, achieving state-of-the-art results on the NYUv2 dataset and ranking first on the KITTI Semantics benchmark.

[229] Data-Free Class-Incremental Gesture Recognition with Prototype-Guided Pseudo Feature Replay

Hongsong Wang,Ao Sun,Jie Gui,Liang Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为PGPFR的框架，用于解决类增量手势识别问题，通过伪特征生成和原型回放等技术，显著提升了识别性能。

Details

Motivation: 现有手势识别方法多集中于封闭场景，难以处理未见过的或新增的手势类别，因此需要一种能够动态适应新类别的增量学习方法。 Method: PGPFR框架包含四个组件：PFGBP（基于批原型的伪特征生成）、VPR（变分原型回放）、TCE（截断交叉熵）和CCRT（持续分类器重训练），分别解决特征生成、旧类原型保持、新类适应和分类器稳定性问题。 Result: 在SHREC 2017 3D和EgoGesture 3D数据集上，PGPFR的平均全局准确率分别比现有方法高出11.8%和12.8%。 Conclusion: PGPFR框架有效解决了类增量手势识别中的灾难性遗忘问题，显著提升了识别性能，具有广泛的应用潜力。 Abstract: Gesture recognition is an important research area in the field of computer vision. Most gesture recognition efforts focus on close-set scenarios, thereby limiting the capacity to effectively handle unseen or novel gestures. We aim to address class-incremental gesture recognition, which entails the ability to accommodate new and previously unseen gestures over time. Specifically, we introduce a Prototype-Guided Pseudo Feature Replay (PGPFR) framework for data-free class-incremental gesture recognition. This framework comprises four components: Pseudo Feature Generation with Batch Prototypes (PFGBP), Variational Prototype Replay (VPR) for old classes, Truncated Cross-Entropy (TCE) for new classes, and Continual Classifier Re-Training (CCRT). To tackle the issue of catastrophic forgetting, the PFGBP dynamically generates a diversity of pseudo features in an online manner, leveraging class prototypes of old classes along with batch class prototypes of new classes. Furthermore, the VPR enforces consistency between the classifier's weights and the prototypes of old classes, leveraging class prototypes and covariance matrices to enhance robustness and generalization capabilities. The TCE mitigates the impact of domain differences of the classifier caused by pseudo features. Finally, the CCRT training strategy is designed to prevent overfitting to new classes and ensure the stability of features extracted from old classes. Extensive experiments conducted on two widely used gesture recognition datasets, namely SHREC 2017 3D and EgoGesture 3D, demonstrate that our approach outperforms existing state-of-the-art methods by 11.8\% and 12.8\% in terms of mean global accuracy, respectively. The code is available on https://github.com/sunao-101/PGPFR-3/.

[230] Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion

Zheqi Lv,Junhao Chen,Qi Tian,Keting Yin,Shengyu Zhang,Fei Wu

Main category: cs.CV

TL;DR: PPAD框架通过引入多模态大语言模型（MLLM）作为语义观察者，实时校正扩散模型生成过程中的语义不一致问题，显著提升了图像质量和提示对齐。

Details

Motivation: 当前扩散模型在推理过程中缺乏可解释的语义监督和校正机制，导致生成图像常出现对象混淆、空间错误等问题，影响提示对齐和图像质量。 Method: 提出PPAD框架，利用MLLM实时分析中间生成结果，识别语义不一致，并通过可控信号指导后续去噪步骤。 Result: 实验表明PPAD在极少的扩散步骤中实现语义校正，显著提升了生成图像的质量和提示对齐。 Conclusion: PPAD为扩散模型提供了有效的语义校正机制，具有广泛的通用性和可扩展性。 Abstract: Diffusion models have become the mainstream architecture for text-to-image generation, achieving remarkable progress in visual quality and prompt controllability. However, current inference pipelines generally lack interpretable semantic supervision and correction mechanisms throughout the denoising process. Most existing approaches rely solely on post-hoc scoring of the final image, prompt filtering, or heuristic resampling strategies-making them ineffective in providing actionable guidance for correcting the generative trajectory. As a result, models often suffer from object confusion, spatial errors, inaccurate counts, and missing semantic elements, severely compromising prompt-image alignment and image quality. To tackle these challenges, we propose MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD), a novel framework that, for the first time, introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference. PPAD performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps. The framework supports both inference-only and training-enhanced settings, and performs semantic correction at only extremely few diffusion steps, offering strong generality and scalability. Extensive experiments demonstrate PPAD's significant improvements.

[231] PAMD: Plausibility-Aware Motion Diffusion Model for Long Dance Generation

Hongsong Wang,Yin Zhu,Qiuxia Lai,Yang Zhang,Guo-Sen Xie,Xin Geng

Main category: cs.CV

TL;DR: PAMD是一种基于扩散模型的舞蹈生成框架，通过物理约束和运动指导生成与音乐对齐且物理合理的舞蹈动作。

Details

Motivation: 现有舞蹈生成方法在物理合理性上表现不足，PAMD旨在解决这一问题。 Method: 提出Plausible Motion Constraint (PMC)和Prior Motion Guidance (PMG)，结合Motion Refinement with Foot-ground Contact (MRFC)模块优化动作。 Result: 实验表明PAMD显著提升了动作的音乐对齐性和物理合理性。 Conclusion: PAMD为舞蹈生成提供了更真实和物理合理的解决方案。 Abstract: Computational dance generation is crucial in many areas, such as art, human-computer interaction, virtual reality, and digital entertainment, particularly for generating coherent and expressive long dance sequences. Diffusion-based music-to-dance generation has made significant progress, yet existing methods still struggle to produce physically plausible motions. To address this, we propose Plausibility-Aware Motion Diffusion (PAMD), a framework for generating dances that are both musically aligned and physically realistic. The core of PAMD lies in the Plausible Motion Constraint (PMC), which leverages Neural Distance Fields (NDFs) to model the actual pose manifold and guide generated motions toward a physically valid pose manifold. To provide more effective guidance during generation, we incorporate Prior Motion Guidance (PMG), which uses standing poses as auxiliary conditions alongside music features. To further enhance realism for complex movements, we introduce the Motion Refinement with Foot-ground Contact (MRFC) module, which addresses foot-skating artifacts by bridging the gap between the optimization objective in linear joint position space and the data representation in nonlinear rotation space. Extensive experiments show that PAMD significantly improves musical alignment and enhances the physical plausibility of generated motions. This project page is available at: https://mucunzhuzhu.github.io/PAMD-page/.

[232] M3DHMR: Monocular 3D Hand Mesh Recovery

Yihong Lin,Xianjia Wu,Xilai Wang,Jianqiao Hu,Songju Lei,Xiandong Li,Wenxiong Kang

Main category: cs.CV

TL;DR: 提出了一种名为M3DHMR的新方法，直接从单张图像估计手部网格顶点的3D位置，通过动态螺旋卷积层和ROI层优化性能，显著优于现有实时方法。

Details

Motivation: 由于手部自由度多、2D到3D的模糊性及自遮挡问题，现有方法在效率和直接性上不足，因此需要一种更高效且直接的方法来预测3D网格顶点位置。 Method: M3DHMR通过动态螺旋卷积层（DSC）自适应调整权重并提取顶点特征，ROI层利用物理信息细化每个预定义手部区域的网格顶点。 Result: 在FreiHAND数据集上的实验表明，M3DHMR显著优于现有实时方法。 Conclusion: M3DHMR提供了一种高效且直接的单目3D手部网格恢复方法，解决了现有方法的不足。 Abstract: Monocular 3D hand mesh recovery is challenging due to high degrees of freedom of hands, 2D-to-3D ambiguity and self-occlusion. Most existing methods are either inefficient or less straightforward for predicting the position of 3D mesh vertices. Thus, we propose a new pipeline called Monocular 3D Hand Mesh Recovery (M3DHMR) to directly estimate the positions of hand mesh vertices. M3DHMR provides 2D cues for 3D tasks from a single image and uses a new spiral decoder consist of several Dynamic Spiral Convolution (DSC) Layers and a Region of Interest (ROI) Layer. On the one hand, DSC Layers adaptively adjust the weights based on the vertex positions and extract the vertex features in both spatial and channel dimensions. On the other hand, ROI Layer utilizes the physical information and refines mesh vertices in each predefined hand region separately. Extensive experiments on popular dataset FreiHAND demonstrate that M3DHMR significantly outperforms state-of-the-art real-time methods.

[233] AdaTP: Attention-Debiased Token Pruning for Video Large Language Models

Fengyuan Sun,Leqi Shen,Hui Chen,Sicheng Zhao,Jungong Han,Guiguang Ding

Main category: cs.CV

TL;DR: AdaTP是一种针对视频大语言模型的注意力去偏令牌剪枝方法，通过解决全局和局部注意力偏差，显著减少计算开销，同时保持模型性能。

Details

Motivation: 现有视觉令牌压缩方法依赖语言模型的注意力分数，但这些分数存在全局和局部偏差，导致计算效率低下。 Method: 提出AdaTP，集成两种去偏模块，分别针对全局和局部注意力偏差，无需额外训练即可减少计算开销。 Result: 在多个视频理解基准测试中表现优异，例如在LLaVA-OneVision-7B上仅需27.3%的FLOPs且性能无损失。 Conclusion: AdaTP为视频大语言模型提供了一种高效且性能无损的令牌剪枝解决方案。 Abstract: Video Large Language Models (Video LLMs) have achieved remarkable results in video understanding tasks. However, they often suffer from heavy computational overhead due to the large number of visual tokens generated from multiple video frames. Existing visual token compression methods often rely on attention scores from language models as guidance. However, these scores exhibit inherent biases: global bias reflects a tendency to focus on the two ends of the visual token sequence, while local bias leads to an over-concentration on the same spatial positions across different frames. To address the issue of attention bias, we propose $\textbf{A}$ttention-$\textbf{D}$ebi$\textbf{a}$sed $\textbf{T}$oken $\textbf{P}$runing for Video Large Language Models ($\textbf{AdaTP}$), a novel token pruning pipeline for Video LLMs. AdaTP integrates two dedicated debiasing modules into the pipeline, targeting global attention bias and local attention bias, respectively. Without the need for additional training, our method significantly reduces the computational overhead of Video LLMs while retaining the performance of vanilla models. Extensive evaluation shows that AdaTP achieves state-of-the-art performance in various commonly used video understanding benchmarks. In particular, on LLaVA-OneVision-7B, AdaTP maintains performance without degradation while using only up to $27.3\%$ FLOPs compared to the vanilla model. Our code will be released soon.

[234] From Data to Modeling: Fully Open-vocabulary Scene Graph Generation

Zuyao Chen,Jinlin Wu,Zhen Lei,Chang Wen Chen

Main category: cs.CV

TL;DR: OvSGTR是一种基于Transformer的框架，用于全开放词汇场景图生成，克服了传统封闭集模型的限制。它通过联合预测对象和关系，支持超出预定义类别的识别，并采用关系感知预训练和视觉概念保留机制，在VG150基准测试中表现优异。

Details

Motivation: 传统方法将对象和关系识别限制在固定词汇中，难以应对现实世界中新概念的频繁出现。OvSGTR旨在解决这一问题，实现更通用的视觉理解。 Method: 采用DETR-like架构，结合冻结的图像主干和文本编码器提取特征，通过Transformer解码器融合特征进行端到端预测。提出关系感知预训练策略，利用弱监督生成场景图注释，并引入视觉概念保留机制防止灾难性遗忘。 Result: 在VG150基准测试中，OvSGTR在封闭集、开放词汇对象检测、关系识别和全开放词汇场景中均达到最先进性能。 Conclusion: OvSGTR展示了大规模关系感知预训练和Transformer架构在提升场景图生成通用性和可靠性方面的潜力。 Abstract: We present OvSGTR, a novel transformer-based framework for fully open-vocabulary scene graph generation that overcomes the limitations of traditional closed-set models. Conventional methods restrict both object and relationship recognition to a fixed vocabulary, hindering their applicability to real-world scenarios where novel concepts frequently emerge. In contrast, our approach jointly predicts objects (nodes) and their inter-relationships (edges) beyond predefined categories. OvSGTR leverages a DETR-like architecture featuring a frozen image backbone and text encoder to extract high-quality visual and semantic features, which are then fused via a transformer decoder for end-to-end scene graph prediction. To enrich the model's understanding of complex visual relations, we propose a relation-aware pre-training strategy that synthesizes scene graph annotations in a weakly supervised manner. Specifically, we investigate three pipelines--scene parser-based, LLM-based, and multimodal LLM-based--to generate transferable supervision signals with minimal manual annotation. Furthermore, we address the common issue of catastrophic forgetting in open-vocabulary settings by incorporating a visual-concept retention mechanism coupled with a knowledge distillation strategy, ensuring that the model retains rich semantic cues during fine-tuning. Extensive experiments on the VG150 benchmark demonstrate that OvSGTR achieves state-of-the-art performance across multiple settings, including closed-set, open-vocabulary object detection-based, relation-based, and fully open-vocabulary scenarios. Our results highlight the promise of large-scale relation-aware pre-training and transformer architectures for advancing scene graph generation towards more generalized and reliable visual understanding.

[235] MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models

Anh Thai,Stefan Stojanov,Zixuan Huang,Bikram Boote,James M. Rehg

Main category: cs.CV

TL;DR: MEBench是一个用于评估互斥性（ME）偏见的新基准，结合了空间推理，为视觉语言模型（VLMs）提供了更具挑战性的评估环境。

Details

Motivation: 传统ME任务未能充分模拟现实场景，因此需要更复杂的评估方法。 Method: 引入空间推理和新的评估指标，并开发了一个灵活的数据生成流程。 Result: 评估了现有VLMs在MEBench上的表现，展示了其在新指标下的能力。 Conclusion: MEBench为研究ME偏见提供了更全面的工具，支持未来更复杂的实验。 Abstract: This paper introduces MEBench, a novel benchmark for evaluating mutual exclusivity (ME) bias, a cognitive phenomenon observed in children during word learning. Unlike traditional ME tasks, MEBench further incorporates spatial reasoning to create more challenging and realistic evaluation settings. We assess the performance of state-of-the-art vision-language models (VLMs) on this benchmark using novel evaluation metrics that capture key aspects of ME-based reasoning. To facilitate controlled experimentation, we also present a flexible and scalable data generation pipeline that supports the construction of diverse annotated scenes.

[236] TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos

Fanheng Kong,Jingyuan Zhang,Hongzhi Zhang,Shi Feng,Daling Wang,Linhao Yu,Xingguang Ji,Yu Tian,Qi Wang,Fuzheng Zhang

Main category: cs.CV

TL;DR: TUNA是一个面向时间的视频理解基准，专注于密集动态视频的细粒度理解，通过字幕和问答任务评估模型性能，揭示了现有模型在时间理解上的关键挑战。

Details

Motivation: 现有视频理解基准往往将时间元素（如相机、场景、动作）分开处理或仅关注特定方面，忽略了视频内容的整体性。TUNA旨在填补这一空白。 Method: 引入TUNA基准，包含多样化的视频场景和动态，通过字幕和问答任务评估模型，并提供可解释且鲁棒的评估标准。 Result: 评估显示现有模型在动作描述、多主体理解和相机运动敏感性等方面存在不足。 Conclusion: TUNA为改进视频理解模型提供了有价值的见解，数据和代码已公开。 Abstract: Videos are unique in their integration of temporal elements, including camera, scene, action, and attribute, along with their dynamic relationships over time. However, existing benchmarks for video understanding often treat these properties separately or narrowly focus on specific aspects, overlooking the holistic nature of video content. To address this, we introduce TUNA, a temporal-oriented benchmark for fine-grained understanding on dense dynamic videos, with two complementary tasks: captioning and QA. Our TUNA features diverse video scenarios and dynamics, assisted by interpretable and robust evaluation criteria. We evaluate several leading models on our benchmark, providing fine-grained performance assessments across various dimensions. This evaluation reveals key challenges in video temporal understanding, such as limited action description, inadequate multi-subject understanding, and insensitivity to camera motion, offering valuable insights for improving video understanding models. The data and code are available at https://friedrichor.github.io/projects/TUNA.

[237] OB3D: A New Dataset for Benchmarking Omnidirectional 3D Reconstruction Using Blender

Shintaro Ito,Natsuki Takama,Toshiki Watanabe,Koichi Ito,Hwann-Tzong Chen,Takafumi Aoki

Main category: cs.CV

TL;DR: 介绍了Omnidirectional Blender 3D (OB3D)数据集，用于解决多张360度全景图像3D重建中的几何失真问题。

Details

Motivation: 现有数据集缺乏针对全景图像特有挑战的系统性评测，限制了3D重建技术的进步。 Method: 通过Blender 3D生成多样复杂场景，提供RGB图像、相机参数、深度和法线图等全面真实数据。 Result: OB3D数据集为全景图像3D重建提供了标准化评测环境。 Conclusion: OB3D有望推动3D重建技术的进一步发展，提升全景图像重建的精度和可靠性。 Abstract: Recent advancements in radiance field rendering, exemplified by Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have significantly progressed 3D modeling and reconstruction. The use of multiple 360-degree omnidirectional images for these tasks is increasingly favored due to advantages in data acquisition and comprehensive scene capture. However, the inherent geometric distortions in common omnidirectional representations, such as equirectangular projection (particularly severe in polar regions and varying with latitude), pose substantial challenges to achieving high-fidelity 3D reconstructions. Current datasets, while valuable, often lack the specific focus, scene composition, and ground truth granularity required to systematically benchmark and drive progress in overcoming these omnidirectional-specific challenges. To address this critical gap, we introduce Omnidirectional Blender 3D (OB3D), a new synthetic dataset curated for advancing 3D reconstruction from multiple omnidirectional images. OB3D features diverse and complex 3D scenes generated from Blender 3D projects, with a deliberate emphasis on challenging scenarios. The dataset provides comprehensive ground truth, including omnidirectional RGB images, precise omnidirectional camera parameters, and pixel-aligned equirectangular maps for depth and normals, alongside evaluation metrics. By offering a controlled yet challenging environment, OB3Daims to facilitate the rigorous evaluation of existing methods and prompt the development of new techniques to enhance the accuracy and reliability of 3D reconstruction from omnidirectional images.

[238] FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

Jin Wang,Yao Lai,Aoxue Li,Shifeng Zhang,Jiacheng Sun,Ning Kang,Chengyue Wu,Zhenguo Li,Ping Luo

Main category: cs.CV

TL;DR: FUDOKI是一种基于离散流匹配的统一多模态模型，挑战了现有的自回归架构，通过迭代优化和双向上下文集成，在视觉理解和图像生成任务中表现优异。

Details

Motivation: 现有MLLMs主要依赖自回归架构，存在图像生成顺序限制和推理能力不足的问题，FUDOKI旨在突破这些限制。 Method: FUDOKI基于离散流匹配，利用度量诱导概率路径和动力学最优速度，支持迭代优化和双向上下文集成。 Result: FUDOKI在视觉理解和图像生成任务中表现与现有AR-based MLLMs相当，且测试时扩展技术可显著提升性能。 Conclusion: FUDOKI展示了作为下一代统一多模态模型的潜力，未来可通过强化学习进一步优化。 Abstract: The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single framework. However, most existing MLLMs rely on autoregressive (AR) architectures, which impose inherent limitations on future development, such as the raster-scan order in image generation and restricted reasoning abilities in causal context modeling. In this work, we challenge the dominance of AR-based approaches by introducing FUDOKI, a unified multimodal model purely based on discrete flow matching, as an alternative to conventional AR paradigms. By leveraging metric-induced probability paths with kinetic optimal velocities, our framework goes beyond the previous masking-based corruption process, enabling iterative refinement with self-correction capability and richer bidirectional context integration during generation. To mitigate the high cost of training from scratch, we initialize FUDOKI from pre-trained AR-based MLLMs and adaptively transition to the discrete flow matching paradigm. Experimental results show that FUDOKI achieves performance comparable to state-of-the-art AR-based MLLMs across both visual understanding and image generation tasks, highlighting its potential as a foundation for next-generation unified multimodal models. Furthermore, we show that applying test-time scaling techniques to FUDOKI yields significant performance gains, further underscoring its promise for future enhancement through reinforcement learning.

[239] Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

Kai Sun,Yushi Bai,Zhen Yang,Jiajie Zhang,Ji Qi,Lei Hou,Juanzi Li

Main category: cs.CV

TL;DR: 论文提出了一种新的硬负对比学习框架MMCLIP，用于提升视觉编码器的几何理解能力，并训练了LMM模型MMGeoLM，在几何推理任务中表现优异。

Details

Motivation: 现有对比学习方法在几何推理任务中存在局限性，需要改进以提升模型的细致推理能力。 Method: 结合图像和文本的硬负对比学习，包括生成式和规则式负样本，训练CLIP模型MMCLIP，并进一步训练LMM模型MMGeoLM。 Result: MMGeoLM在三个几何推理基准测试中显著优于其他开源模型，7B规模的性能媲美GPT-4o。 Conclusion: 硬负对比学习有效提升了几何推理能力，不同负样本构建方法和数量对性能有显著影响。 Abstract: Benefiting from contrastively trained visual encoders on large-scale natural scene images, Large Multimodal Models (LMMs) have achieved remarkable performance across various visual perception tasks. However, the inherent limitations of contrastive learning upon summarized descriptions fundamentally restrict the capabilities of models in meticulous reasoning, particularly in crucial scenarios of geometric problem-solving. To enhance geometric understanding, we propose a novel hard negative contrastive learning framework for the vision encoder, which combines image-based contrastive learning using generation-based hard negatives created by perturbing diagram generation code, and text-based contrastive learning using rule-based negatives derived from modified geometric descriptions and retrieval-based negatives selected based on caption similarity. We train CLIP using our strong negative learning method, namely MMCLIP (Multimodal Math CLIP), and subsequently train an LMM for geometric problem-solving. Experiments show that our trained model, MMGeoLM, significantly outperforms other open-source models on three geometric reasoning benchmarks. Even with a size of 7B, it can rival powerful closed-source models like GPT-4o. We further study the impact of different negative sample construction methods and the number of negative samples on the geometric reasoning performance of LMM, yielding fruitful conclusions. The code and dataset are available at https://github.com/THU-KEG/MMGeoLM.

[240] HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters

Yi Chen,Sen Liang,Zixiang Zhou,Ziyao Huang,Yifeng Ma,Junshu Tang,Qin Lin,Yuan Zhou,Qinglin Lu

Main category: cs.CV

TL;DR: HunyuanVideo-Avatar提出了一种基于多模态扩散变换器（MM-DiT）的模型，解决了音频驱动动画中的动态视频生成、情感对齐和多角色动画问题。

Details

Motivation: 当前音频驱动动画在动态视频生成、情感对齐和多角色动画方面存在挑战，需要一种更高效的解决方案。 Method: 模型引入了三个关键创新：字符图像注入模块、音频情感模块（AEM）和面部感知音频适配器（FAA），分别解决了字符一致性、情感控制和多角色音频注入问题。 Result: HunyuanVideo-Avatar在基准数据集和新提出的野生数据集上超越了现有方法，生成了动态且沉浸式的逼真虚拟角色。 Conclusion: 该模型通过创新设计解决了音频驱动动画中的关键问题，展现了在多角色动态场景中的强大能力。 Abstract: Recent years have witnessed significant progress in audio-driven human animation. However, critical challenges remain in (i) generating highly dynamic videos while preserving character consistency, (ii) achieving precise emotion alignment between characters and audio, and (iii) enabling multi-character audio-driven animation. To address these challenges, we propose HunyuanVideo-Avatar, a multimodal diffusion transformer (MM-DiT)-based model capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos. Concretely, HunyuanVideo-Avatar introduces three key innovations: (i) A character image injection module is designed to replace the conventional addition-based character conditioning scheme, eliminating the inherent condition mismatch between training and inference. This ensures the dynamic motion and strong character consistency; (ii) An Audio Emotion Module (AEM) is introduced to extract and transfer the emotional cues from an emotion reference image to the target generated video, enabling fine-grained and accurate emotion style control; (iii) A Face-Aware Audio Adapter (FAA) is proposed to isolate the audio-driven character with latent-level face mask, enabling independent audio injection via cross-attention for multi-character scenarios. These innovations empower HunyuanVideo-Avatar to surpass state-of-the-art methods on benchmark datasets and a newly proposed wild dataset, generating realistic avatars in dynamic, immersive scenarios.

[241] Long-Context State-Space Video World Models

Ryan Po,Yotam Nitzan,Richard Zhang,Berlin Chen,Tri Dao,Eli Shechtman,Gordon Wetzstein,Xun Huang

Main category: cs.CV

TL;DR: 提出一种结合状态空间模型（SSM）和局部注意力的新架构，以解决视频扩散模型在长时记忆上的不足，并在实验中验证其有效性。

Details

Motivation: 视频扩散模型在长时记忆上存在不足，主要因注意力层处理长序列的高计算成本。 Method: 采用状态空间模型（SSM）扩展时域记忆，结合块状SSM扫描方案和局部注意力机制。 Result: 在Memory Maze和Minecraft数据集上，模型在长时记忆任务中表现优于基线，同时保持高效推理速度。 Conclusion: 新架构成功解决了长时记忆问题，适用于交互式应用。 Abstract: Video diffusion models have recently shown promise for world modeling through autoregressive frame prediction conditioned on actions. However, they struggle to maintain long-term memory due to the high computational cost associated with processing extended sequences in attention layers. To overcome this limitation, we propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency. Unlike previous approaches that retrofit SSMs for non-causal vision tasks, our method fully exploits the inherent advantages of SSMs in causal sequence modeling. Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory, combined with dense local attention to ensure coherence between consecutive frames. We evaluate the long-term memory capabilities of our model through spatial retrieval and reasoning tasks over extended horizons. Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory, while maintaining practical inference speeds suitable for interactive applications.

[242] PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology

Jiabo Ma,Yingxue Xu,Fengtao Zhou,Yihui Wang,Cheng Jin,Zhengrui Guo,Jianfeng Wu,On Ki Tang,Huajun Zhou,Xi Wang,Luyang Luo,Zhengyu Zhang,Du Cai,Zizhao Gao,Wei Wang,Yueping Liu,Jiankun He,Jing Cui,Zhenhui Li,Jing Zhang,Feng Gao,Xiuming Zhang,Li Liang,Ronald Cheong Kin Chan,Zhe Wang,Hao Chen

Main category: cs.CV

TL;DR: PathBench是一个全面的病理基础模型（PFM）基准测试，旨在解决现有评估中的不足，如数据泄漏和任务覆盖不全，通过多中心数据集和自动化评估系统推动PFM的临床转化。

Details

Motivation: 病理基础模型在癌症诊断和预后中潜力巨大，但临床转化面临模型泛化性、数据泄漏和标准化评估等挑战。 Method: PathBench通过多中心数据集（15,888张WSIs）、严格的泄漏预防和自动化评估系统，全面评估19种PFM。 Result: Virchow2和H-Optimus-1表现最佳，PathBench为模型开发和临床决策提供了可靠平台。 Conclusion: PathBench加速了PFM的临床转化，为研究和实践提供了标准化评估工具。 Abstract: The emergence of pathology foundation models has revolutionized computational histopathology, enabling highly accurate, generalized whole-slide image analysis for improved cancer diagnosis, and prognosis assessment. While these models show remarkable potential across cancer diagnostics and prognostics, their clinical translation faces critical challenges including variability in optimal model across cancer types, potential data leakage in evaluation, and lack of standardized benchmarks. Without rigorous, unbiased evaluation, even the most advanced PFMs risk remaining confined to research settings, delaying their life-saving applications. Existing benchmarking efforts remain limited by narrow cancer-type focus, potential pretraining data overlaps, or incomplete task coverage. We present PathBench, the first comprehensive benchmark addressing these gaps through: multi-center in-hourse datasets spanning common cancers with rigorous leakage prevention, evaluation across the full clinical spectrum from diagnosis to prognosis, and an automated leaderboard system for continuous model assessment. Our framework incorporates large-scale data, enabling objective comparison of PFMs while reflecting real-world clinical complexity. All evaluation data comes from private medical providers, with strict exclusion of any pretraining usage to avoid data leakage risks. We have collected 15,888 WSIs from 8,549 patients across 10 hospitals, encompassing over 64 diagnosis and prognosis tasks. Currently, our evaluation of 19 PFMs shows that Virchow2 and H-Optimus-1 are the most effective models overall. This work provides researchers with a robust platform for model development and offers clinicians actionable insights into PFM performance across diverse clinical scenarios, ultimately accelerating the translation of these transformative technologies into routine pathology practice.

[243] Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models

Weihao Xuan,Qingcheng Zeng,Heli Qi,Junjue Wang,Naoto Yokoya

Main category: cs.CV

TL;DR: 论文研究了视觉语言模型（VLMs）中语言化不确定性的有效性，发现当前模型在多任务和场景中存在校准问题，并提出了一种改进方法。

Details

Motivation: 评估语言化不确定性在VLMs中的效果，填补现有研究的空白。 Method: 通过三类模型、四个任务领域和三种评估场景，全面评估语言化置信度，并提出视觉置信感知提示策略。 Result: 当前VLMs在多任务中存在校准问题，视觉推理模型表现较好；提出的方法改善了多模态设置中的置信校准。 Conclusion: 研究强调了模态对齐和模型忠实性对可靠多模态系统的重要性。 Abstract: Uncertainty quantification is essential for assessing the reliability and trustworthiness of modern AI systems. Among existing approaches, verbalized uncertainty, where models express their confidence through natural language, has emerged as a lightweight and interpretable solution in large language models (LLMs). However, its effectiveness in vision-language models (VLMs) remains insufficiently studied. In this work, we conduct a comprehensive evaluation of verbalized confidence in VLMs, spanning three model categories, four task domains, and three evaluation scenarios. Our results show that current VLMs often display notable miscalibration across diverse tasks and settings. Notably, visual reasoning models (i.e., thinking with images) consistently exhibit better calibration, suggesting that modality-specific reasoning is critical for reliable uncertainty estimation. To further address calibration challenges, we introduce Visual Confidence-Aware Prompting, a two-stage prompting strategy that improves confidence alignment in multimodal settings. Overall, our study highlights the inherent miscalibration in VLMs across modalities. More broadly, our findings underscore the fundamental importance of modality alignment and model faithfulness in advancing reliable multimodal systems.

[244] AniCrafter: Customizing Realistic Human-Centric Animation via Avatar-Background Conditioning in Video Diffusion Models

Muyao Niu,Mingdeng Cao,Yifan Zhan,Qingtian Zhu,Mingze Ma,Jiancheng Zhao,Yanhong Zeng,Zhihang Zhong,Xiao Sun,Yinqiang Zheng

Main category: cs.CV

TL;DR: AniCrafter是一种基于扩散模型的人物动画技术，能够在动态背景中无缝整合并动画化给定角色，解决了现有方法在开放域场景中的局限性。

Details

Motivation: 当前基于DWPose或SMPL-X等结构条件的人物动画方法在动态背景或复杂姿势下效果有限，AniCrafter旨在解决这一问题。 Method: 基于先进的Image-to-Video扩散架构，引入“角色-背景”条件机制，将开放域人物动画任务重构为修复任务。 Result: 实验结果表明，AniCrafter在稳定性和多样性上表现优异。 Conclusion: AniCrafter为开放域人物动画提供了更有效的解决方案，代码将开源。 Abstract: Recent advances in video diffusion models have significantly improved character animation techniques. However, current approaches rely on basic structural conditions such as DWPose or SMPL-X to animate character images, limiting their effectiveness in open-domain scenarios with dynamic backgrounds or challenging human poses. In this paper, we introduce $\textbf{AniCrafter}$, a diffusion-based human-centric animation model that can seamlessly integrate and animate a given character into open-domain dynamic backgrounds while following given human motion sequences. Built on cutting-edge Image-to-Video (I2V) diffusion architectures, our model incorporates an innovative "avatar-background" conditioning mechanism that reframes open-domain human-centric animation as a restoration task, enabling more stable and versatile animation outputs. Experimental results demonstrate the superior performance of our method. Codes will be available at https://github.com/MyNiuuu/AniCrafter.

[245] Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

Hao Zhong,Muzhi Zhu,Zongze Du,Zheng Huang,Canyu Zhao,Mingyu Liu,Wen Wang,Hao Chen,Chunhua Shen

Main category: cs.CV

TL;DR: 论文提出了一种两系统架构（Omni-R1），通过强化学习解决视频音频推理中的分辨率与覆盖范围冲突问题，并在多个基准测试中表现优异。

Details

Motivation: 解决长时视频音频推理与细粒度像素理解对模型分辨率与覆盖范围的不同需求之间的冲突。 Method: 采用两系统架构：全局推理系统选择关键帧并重写任务，细节理解系统处理高分辨率片段；通过强化学习优化关键帧选择和任务重写。 Result: 在RefAVS和REVOS基准测试中超越监督基线及专用SOTA模型，提升域外泛化能力并减少多模态幻觉。 Conclusion: 首次成功将强化学习应用于大规模全模态推理，为通用基础模型提供了可扩展路径。 Abstract: Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for high-resolution inputs. We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding on the selected high-resolution snippets. Because ``optimal'' keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy Optimization. Omni-R1 trains the Global Reasoning System through hierarchical rewards obtained via online collaboration with the Detail Understanding System, requiring only one epoch of RL on small task splits. Experiments on two challenging benchmarks, namely Referring Audio-Visual Segmentation (RefAVS) and Reasoning Video Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also outperforms specialized state-of-the-art models, while substantially improving out-of-domain generalization and mitigating multimodal hallucination. Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and highlight a scalable path toward universally foundation models.

[246] HaloGS: Loose Coupling of Compact Geometry and Gaussian Splats for 3D Scenes

Changjian Jiang,Kerui Ren,Linning Xu,Jiong Chen,Jiangmiao Pang,Yu Zhang,Bo Dai,Mulin Yu

Main category: cs.CV

TL;DR: HaloGS提出了一种双表示方法，结合粗三角形几何与高斯基元外观，实现高效且高保真的3D重建与渲染。

Details

Motivation: 现有方法通常将几何与外观融合为单一复杂模型或采用混合方案，导致效率与保真度之间的权衡。HaloGS旨在通过轻量级经典几何表示与高斯基元的松散耦合，解决这一问题。 Method: HaloGS采用双表示方法，粗三角形用于几何，高斯基元用于外观，适应不同场景复杂度。 Result: 实验表明，HaloGS在多个基准数据集上实现了紧凑且精确的几何重建和高保真渲染，尤其在复杂场景中表现突出。 Conclusion: HaloGS通过双表示设计，成功平衡了效率与保真度，适用于室内外环境。 Abstract: High fidelity 3D reconstruction and rendering hinge on capturing precise geometry while preserving photo realistic detail. Most existing methods either fuse these goals into a single cumbersome model or adopt hybrid schemes whose uniform primitives lead to a trade off between efficiency and fidelity. In this paper, we introduce HaloGS, a dual representation that loosely couples coarse triangles for geometry with Gaussian primitives for appearance, motivated by the lightweight classic geometry representations and their proven efficiency in real world applications. Our design yields a compact yet expressive model capable of photo realistic rendering across both indoor and outdoor environments, seamlessly adapting to varying levels of scene complexity. Experiments on multiple benchmark datasets demonstrate that our method yields both compact, accurate geometry and high fidelity renderings, especially in challenging scenarios where robust geometric structure make a clear difference.

[247] ParticleGS: Particle-Based Dynamics Modeling of 3D Gaussians for Prior-free Motion Extrapolation

Jinsheng Quan,Chunshi Wang,Yawei Luo

Main category: cs.CV

TL;DR: 提出了一种基于粒子动力学系统的动态3D高斯泼溅运动外推框架，无需手动定义物理先验，通过学习微分方程建模高斯粒子动力学，显著提升了未来帧外推能力。

Details

Motivation: 现有动态3D重建方法难以有效学习底层动力学或依赖手动定义的物理先验，限制了外推能力。 Method: 引入动态潜在状态向量和编码器，设计基于神经ODE的动态模块建模高斯粒子动力学，通过解码器将潜在状态转换为变形。 Result: 在重建任务中渲染质量与现有方法相当，未来帧外推显著优于现有方法。 Conclusion: 该方法通过学习微分方程建模高斯粒子动力学，显著提升了动态3D重建的外推能力。 Abstract: This paper aims to model the dynamics of 3D Gaussians from visual observations to support temporal extrapolation. Existing dynamic 3D reconstruction methods often struggle to effectively learn underlying dynamics or rely heavily on manually defined physical priors, which limits their extrapolation capabilities. To address this issue, we propose a novel dynamic 3D Gaussian Splatting prior-free motion extrapolation framework based on particle dynamics systems. The core advantage of our method lies in its ability to learn differential equations that describe the dynamics of 3D Gaussians, and follow them during future frame extrapolation. Instead of simply fitting to the observed visual frame sequence, we aim to more effectively model the gaussian particle dynamics system. To this end, we introduce a dynamics latent state vector into the standard Gaussian kernel and design a dynamics latent space encoder to extract initial state. Subsequently, we introduce a Neural ODEs-based dynamics module that models the temporal evolution of Gaussian in dynamics latent space. Finally, a Gaussian kernel space decoder is used to decode latent state at the specific time step into the deformation. Experimental results demonstrate that the proposed method achieves comparable rendering quality with existing approaches in reconstruction tasks, and significantly outperforms them in future frame extrapolation. Our code is available at https://github.com/QuanJinSheng/ParticleGS.

[248] Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning

Meng Cao,Haoze Zhao,Can Zhang,Xiaojun Chang,Ian Reid,Xiaodan Liang

Main category: cs.CV

TL;DR: Ground-R1是一个无需显式证据或注释的强化学习框架，通过生成视觉证据区域和答案正确性奖励，提升视觉推理的可靠性和可解释性。

Details

Motivation: 大型视觉语言模型（LVLMs）在多模态任务中表现优异，但推理过程输出不可靠且可解释性有限。现有方法依赖昂贵监督，限制了可扩展性。 Method: 提出Ground-R1框架，包含生成视觉证据区域的“接地阶段”和基于答案正确性及格式奖励的“回答阶段”。 Result: 在多个视觉推理基准测试中表现优异，展现出不确定性感知、空间感知和迭代优化等认知行为。 Conclusion: Ground-R1为现有方法提供了可扩展且可解释的替代方案。 Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive general capabilities across a wide range of multi-modal tasks. However, the reasoning processes of LVLMs often suffer from unreliable outputs and limited interpretability. To address this, grounded visual reasoning has emerged as a promising paradigm that enforces responses anchored on salient visual evidence regions. However, existing approaches typically rely on costly supervision such as bounding box annotations, chain-of-thought rationale or external tool calls, limiting their scalability. In this work, we propose Ground-R1, a reinforcement learning framework that enables grounded visual reasoning without requiring explicit evidence or rationale annotations. Ground-R1 consists of a grounding phase that generates evidence region rollouts based on format constraints, and an answering phase that produces responses guided by both answer correctness and format adherence rewards. Extensive experiments across multiple visual reasoning benchmarks manifest that Ground-R1 achieves superior performance and exhibits emergent cognitive behaviors such as uncertainty awareness, spatial perception, and iterative refinement, offering a scalable and interpretable alternative to existing approaches.

[249] ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye,Xianyi He,Zongjian Li,Bin Lin,Shenghai Yuan,Zhiyuan Yan,Bohan Hou,Li Yuan

Main category: cs.CV

TL;DR: ImgEdit是一个高质量的大规模图像编辑数据集，包含120万对编辑样本，支持单轮和多轮编辑任务，并引入了ImgEdit-E1模型和ImgEdit-Bench基准测试。

Details

Motivation: 开源图像编辑模型在数据质量和基准测试方面落后于专有模型，需要高质量数据集和评估标准来推动发展。 Method: 通过多阶段流程（包括视觉语言模型、检测模型、分割模型等）构建ImgEdit数据集，并训练ImgEdit-E1模型。 Result: ImgEdit在任务新颖性和数据质量上优于现有数据集，ImgEdit-E1模型在多项任务中表现优于其他开源模型。 Conclusion: ImgEdit和ImgEdit-Bench为图像编辑领域提供了高质量数据和评估工具，推动了开源模型的发展。 Abstract: Recent advancements in generative models have enabled high-fidelity text-to-image generation. However, open-source image-editing models still lag behind their proprietary counterparts, primarily due to limited high-quality data and insufficient benchmarks. To overcome these limitations, we introduce ImgEdit, a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs, which contain both novel and complex single-turn edits, as well as challenging multi-turn tasks. To ensure the data quality, we employ a multi-stage pipeline that integrates a cutting-edge vision-language model, a detection model, a segmentation model, alongside task-specific in-painting procedures and strict post-processing. ImgEdit surpasses existing datasets in both task novelty and data quality. Using ImgEdit, we train ImgEdit-E1, an editing model using Vision Language Model to process the reference image and editing prompt, which outperforms existing open-source models on multiple tasks, highlighting the value of ImgEdit and model design. For comprehensive evaluation, we introduce ImgEdit-Bench, a benchmark designed to evaluate image editing performance in terms of instruction adherence, editing quality, and detail preservation. It includes a basic testsuite, a challenging single-turn suite, and a dedicated multi-turn suite. We evaluate both open-source and proprietary models, as well as ImgEdit-E1, providing deep analysis and actionable insights into the current behavior of image-editing models. The source data are publicly available on https://github.com/PKU-YuanGroup/ImgEdit.

[250] VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan,Jian Zhang,Renjie Li,Junge Zhang,Runjin Chen,Hezhen Hu,Kevin Wang,Huaizhi Qu,Dilin Wang,Zhicheng Yan,Hongyu Xu,Justin Theiss,Tianlong Chen,Jiachen Li,Zhengzhong Tu,Zhangyang Wang,Rakesh Ranjan

Main category: cs.CV

TL;DR: VLM-3R是一个统一的视觉语言模型框架，通过3D重建指令调优处理单目视频帧，实现空间理解和时间推理。

Details

Motivation: 扩展大型多模态模型（LMMs）到3D场景理解，以接近人类视觉空间智能，但现有方法依赖外部深度传感器或预构建3D地图，限制了可扩展性。 Method: 提出VLM-3R框架，使用几何编码器生成隐式3D令牌，结合空间-视觉-视图融合和20万+3D重建指令调优QA对，对齐空间上下文与语言指令。 Result: VLM-3R在视觉空间推理和时间3D上下文理解方面表现优异，准确性和可扩展性均突出。 Conclusion: VLM-3R为单目3D空间辅助和具身推理提供了有效解决方案，并通过新基准验证了其性能。 Abstract: The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has motivated extending these models to understand 3D scenes, aiming for human-like visual-spatial intelligence. Nevertheless, achieving deep spatial understanding comparable to human capabilities poses significant challenges in model encoding and data acquisition. Existing methods frequently depend on external depth sensors for geometry capture or utilize off-the-shelf algorithms for pre-constructing 3D maps, thereby limiting their scalability, especially with prevalent monocular video inputs and for time-sensitive applications. In this work, we introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning. VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding. Leveraging our Spatial-Visual-View Fusion and over 200K curated 3D reconstructive instruction tuning question-answer (QA) pairs, VLM-3R effectively aligns real-world spatial context with language instructions. This enables monocular 3D spatial assistance and embodied reasoning. To facilitate the evaluation of temporal reasoning, we introduce the Vision-Spatial-Temporal Intelligence benchmark, featuring over 138.6K QA pairs across five distinct tasks focused on evolving spatial relationships. Extensive experiments demonstrate that our model, VLM-3R, not only facilitates robust visual-spatial reasoning but also enables the understanding of temporal 3D context changes, excelling in both accuracy and scalability.

[251] Category-Agnostic Neural Object Rigging

Guangzhao He,Chen Geng,Shangzhe Wu,Jiajun Wu

Main category: cs.CV

TL;DR: 论文提出了一种数据驱动的方法，通过稀疏的blob和特征体积表示4D可变形物体，实现了对3D物体姿态的直观操控。

Details

Motivation: 传统方法依赖专家知识且不具扩展性，研究旨在自动探索低维结构以实现更好的可控性。 Method: 设计了一种新表示方法，将4D物体编码为稀疏blob和实例感知特征体积，解耦姿态和实例信息。 Result: 在多种物体类别上验证了方法的有效性，实现了直观的姿态操控。 Conclusion: 提出的框架为可变形物体的低维表示和操控提供了有效解决方案。 Abstract: The motion of deformable 4D objects lies in a low-dimensional manifold. To better capture the low dimensionality and enable better controllability, traditional methods have devised several heuristic-based methods, i.e., rigging, for manipulating dynamic objects in an intuitive fashion. However, such representations are not scalable due to the need for expert knowledge of specific categories. Instead, we study the automatic exploration of such low-dimensional structures in a purely data-driven manner. Specifically, we design a novel representation that encodes deformable 4D objects into a sparse set of spatially grounded blobs and an instance-aware feature volume to disentangle the pose and instance information of the 3D shape. With such a representation, we can manipulate the pose of 3D objects intuitively by modifying the parameters of the blobs, while preserving rich instance-specific information. We evaluate the proposed method on a variety of object categories and demonstrate the effectiveness of the proposed framework. Project page: https://guangzhaohe.com/canor

[252] MotionPro: A Precise Motion Controller for Image-to-Video Generation

Zhongwei Zhang,Fuchen Long,Zhaofan Qiu,Yingwei Pan,Wu Liu,Ting Yao,Tao Mei

Main category: cs.CV

TL;DR: MotionPro提出了一种精确的运动控制器，通过区域轨迹和运动掩码实现细粒度运动合成，并区分对象与相机运动。

Details

Motivation: 现有方法依赖大高斯核扩展运动轨迹，导致粗粒度控制和运动类型混淆，需改进。 Method: MotionPro利用跟踪模型估计流图，采样区域轨迹，结合运动掩码和特征调制实现精确控制。 Result: 在WebVid-10M和MC-Bench上验证了MotionPro的有效性，支持细粒度和对象级运动控制。 Conclusion: MotionPro通过区域轨迹和运动掩码显著提升了运动控制的精确性和自然性。 Abstract: Animating images with interactive motion control has garnered popularity for image-to-video (I2V) generation. Modern approaches typically rely on large Gaussian kernels to extend motion trajectories as condition without explicitly defining movement region, leading to coarse motion control and failing to disentangle object and camera moving. To alleviate these, we present MotionPro, a precise motion controller that novelly leverages region-wise trajectory and motion mask to regulate fine-grained motion synthesis and identify target motion category (i.e., object or camera moving), respectively. Technically, MotionPro first estimates the flow maps on each training video via a tracking model, and then samples the region-wise trajectories to simulate inference scenario. Instead of extending flow through large Gaussian kernels, our region-wise trajectory approach enables more precise control by directly utilizing trajectories within local regions, thereby effectively characterizing fine-grained movements. A motion mask is simultaneously derived from the predicted flow maps to capture the holistic motion dynamics of the movement regions. To pursue natural motion control, MotionPro further strengthens video denoising by incorporating both region-wise trajectories and motion mask through feature modulation. More remarkably, we meticulously construct a benchmark, i.e., MC-Bench, with 1.1K user-annotated image-trajectory pairs, for the evaluation of both fine-grained and object-level I2V motion control. Extensive experiments conducted on WebVid-10M and MC-Bench demonstrate the effectiveness of MotionPro. Please refer to our project page for more results: https://zhw-zhang.github.io/MotionPro-page/.

[253] Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots

Guangting Zheng,Yehao Li,Yingwei Pan,Jiajun Deng,Ting Yao,Yanyong Zhang,Tao Mei

Main category: cs.CV

TL;DR: Hi-MAR是一种新的自回归模型设计，通过多尺度图像令牌的分层依赖关系提升视觉生成能力，优于传统AR模型且计算成本更低。

Details

Motivation: 传统单尺度密集图像令牌的自回归模型无法充分利用全局上下文，尤其是早期令牌预测。 Method: 提出Hi-MAR模型，分阶段从低分辨率图像令牌到密集令牌进行分层自回归建模，并引入扩散Transformer头增强全局上下文。 Result: 在类条件和文本到图像生成任务中，Hi-MAR表现优于传统AR模型，且计算成本更低。 Conclusion: Hi-MAR通过分层设计和全局上下文增强，为视觉生成提供了更高效的自回归解决方案。 Abstract: Autoregressive models have emerged as a powerful generative paradigm for visual generation. The current de-facto standard of next token prediction commonly operates over a single-scale sequence of dense image tokens, and is incapable of utilizing global context especially for early tokens prediction. In this paper, we introduce a new autoregressive design to model a hierarchy from a few low-resolution image tokens to the typical dense image tokens, and delve into a thorough hierarchical dependency across multi-scale image tokens. Technically, we present a Hierarchical Masked Autoregressive models (Hi-MAR) that pivot on low-resolution image tokens to trigger hierarchical autoregressive modeling in a multi-phase manner. Hi-MAR learns to predict a few image tokens in low resolution, functioning as intermediary pivots to reflect global structure, in the first phase. Such pivots act as the additional guidance to strengthen the next autoregressive modeling phase by shaping global structural awareness of typical dense image tokens. A new Diffusion Transformer head is further devised to amplify the global context among all tokens for mask token prediction. Extensive evaluations on both class-conditional and text-to-image generation tasks demonstrate that Hi-MAR outperforms typical AR baselines, while requiring fewer computational costs. Code is available at https://github.com/HiDream-ai/himar.

[254] VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection

Zeyi Huang,Yuyang Ji,Anirudh Sundara Rajan,Zefan Cai,Wen Xiao,Junjie Hu,Yong Jae Lee

Main category: cs.CV

TL;DR: VisTA是一个新的强化学习框架，通过动态探索和选择工具库中的工具，提升视觉代理的性能。相比现有方法，VisTA利用端到端强化学习优化工具选择策略，无需显式监督。

Details

Motivation: 现有工具增强推理方法依赖无训练提示或大规模微调，缺乏主动工具探索且工具多样性有限，微调方法还需大量人工监督。VisTA旨在解决这些问题。 Method: VisTA采用端到端强化学习框架，通过Group Relative Policy Optimization (GRPO)自主发现有效工具选择路径，利用任务结果作为反馈信号。 Result: 在ChartQA、Geometry3K和BlindTest基准测试中，VisTA显著优于无训练基线，尤其在分布外样本上表现突出。 Conclusion: VisTA能够增强泛化能力，自适应利用多样化工具，为灵活、经验驱动的视觉推理系统开辟了新途径。 Abstract: We introduce VisTA, a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance. Existing methods for tool-augmented reasoning either rely on training-free prompting or large-scale fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, using task outcomes as feedback signals. Through Group Relative Policy Optimization (GRPO), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks demonstrate that VisTA achieves substantial performance gains over training-free baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.

[255] Visualized Text-to-Image Retrieval

Di Wu,Yixin Wan,Kai-Wei Chang

Main category: cs.CV

TL;DR: VisRet通过将文本查询投影到图像模态，再在图像模态内检索，显著提升了T2I检索性能，兼容现有检索器。

Details

Motivation: 解决现有跨模态嵌入在识别细微视觉空间特征上的局限性。 Method: 先通过T2I生成将文本查询转换为图像，再在图像模态内进行检索。 Result: 在三个基准测试中，NDCG@10提升24.5%至32.7%，并提升下游视觉问答准确性。 Conclusion: VisRet是一种即插即用的有效模块，适用于知识密集型多模态系统。 Abstract: We propose Visualize-then-Retrieve (VisRet), a new paradigm for Text-to-Image (T2I) retrieval that mitigates the limitations of cross-modal similarity alignment of existing multi-modal embeddings. VisRet first projects textual queries into the image modality via T2I generation. Then, it performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Experiments on three knowledge-intensive T2I retrieval benchmarks, including a newly introduced multi-entity benchmark, demonstrate that VisRet consistently improves T2I retrieval by 24.5% to 32.7% NDCG@10 across different embedding models. VisRet also significantly benefits downstream visual question answering accuracy when used in retrieval-augmented generation pipelines. The method is plug-and-play and compatible with off-the-shelf retrievers, making it an effective module for knowledge-intensive multi-modal systems. Our code and the new benchmark are publicly available at https://github.com/xiaowu0162/Visualize-then-Retrieve.

[256] OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

Shenghai Yuan,Xianyi He,Yufan Deng,Yang Ye,Jinfa Huang,Bin Lin,Chongyang Ma,Jiebo Luo,Li Yuan

Main category: cs.CV

TL;DR: 论文提出了OpenS2V-Nexus，包括OpenS2V-Eval评估基准和OpenS2V-5M数据集，用于支持Subject-to-Video（S2V）生成研究，并评估了16种S2V模型。

Details

Motivation: 为S2V生成提供基础设施，解决现有评估基准过于粗粒度的问题，提升生成视频的主题一致性和自然性。 Method: 提出OpenS2V-Eval评估基准（180个提示词和三种自动指标）和OpenS2V-5M数据集（500万高质量三元组），并评估16种S2V模型。 Result: OpenS2V-Nexus为S2V生成提供了全面评估工具和大规模数据集，揭示了不同模型的优缺点。 Conclusion: OpenS2V-Nexus为S2V生成研究提供了坚实基础，未来可加速相关领域发展。 Abstract: Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from VBench that focus on global and coarse-grained assessment of generated videos, OpenS2V-Eval focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 16 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. Specifically, we ensure subject-information diversity in our dataset by (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations. Through OpenS2V-Nexus, we deliver a robust infrastructure to accelerate future S2V generation research.

[257] GLEAM: Learning Generalizable Exploration Policy for Active Mapping in Complex 3D Indoor Scenes

Xiao Chen,Tai Wang,Quanyi Li,Tao Huang,Jiangmiao Pang,Tianfan Xue

Main category: cs.CV

TL;DR: GLEAM-Bench是一个大规模基准测试，用于通用主动建图，GLEAM是一种通用探索策略，显著优于现有方法。

Details

Motivation: 解决移动机器人在复杂未知环境中通用主动建图的挑战，现有方法因数据不足和保守策略泛化能力有限。 Method: 引入GLEAM-Bench基准测试，提出GLEAM策略，结合语义表示、长期导航目标和随机化策略。 Result: 在128个未见复杂场景中，覆盖率达66.50%（提升9.49%），轨迹高效且建图精度提高。 Conclusion: GLEAM策略在通用性和性能上显著优于现有方法，为复杂环境主动建图提供了可靠解决方案。 Abstract: Generalizable active mapping in complex unknown environments remains a critical challenge for mobile robots. Existing methods, constrained by insufficient training data and conservative exploration strategies, exhibit limited generalizability across scenes with diverse layouts and complex connectivity. To enable scalable training and reliable evaluation, we introduce GLEAM-Bench, the first large-scale benchmark designed for generalizable active mapping with 1,152 diverse 3D scenes from synthetic and real-scan datasets. Building upon this foundation, we propose GLEAM, a unified generalizable exploration policy for active mapping. Its superior generalizability comes mainly from our semantic representations, long-term navigable goals, and randomized strategies. It significantly outperforms state-of-the-art methods, achieving 66.50% coverage (+9.49%) with efficient trajectories and improved mapping accuracy on 128 unseen complex scenes. Project page: https://xiao-chen.tech/gleam/.

[258] DiSA: Diffusion Step Annealing in Autoregressive Image Generation

Qinyu Zhao,Jaskirat Singh,Ming Xu,Akshay Asthana,Stephen Gould,Liang Zheng

Main category: cs.CV

TL;DR: 论文提出了一种名为DiSA的训练无关方法，通过逐步减少扩散步数来提升自回归模型中扩散采样的推理效率，同时保持生成质量。

Details

Motivation: 随着自回归过程中生成更多标记，后续标记的分布更受约束，采样更容易。 Method: 引入扩散步退火（DiSA），逐步减少扩散步数。 Result: DiSA实现了5-10倍的推理加速（MAR和Harmon）和1.4-2.5倍加速（FlowAR和xAR），且质量不变。 Conclusion: DiSA是一种简单高效的加速方法，适用于自回归模型中的扩散采样。 Abstract: An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 to 100 steps for diffusion to sample a token. This paper explores how to effectively address this issue. Our key motivation is that as more tokens are generated during the autoregressive process, subsequent tokens follow more constrained distributions and are easier to sample. To intuitively explain, if a model has generated part of a dog, the remaining tokens must complete the dog and thus are more constrained. Empirical evidence supports our motivation: at later generation stages, the next tokens can be well predicted by a multilayer perceptron, exhibit low variance, and follow closer-to-straight-line denoising paths from noise to tokens. Based on our finding, we introduce diffusion step annealing (DiSA), a training-free method which gradually uses fewer diffusion steps as more tokens are generated, e.g., using 50 steps at the beginning and gradually decreasing to 5 steps at later stages. Because DiSA is derived from our finding specific to diffusion in autoregressive models, it is complementary to existing acceleration methods designed for diffusion alone. DiSA can be implemented in only a few lines of code on existing models, and albeit simple, achieves $5-10\times$ faster inference for MAR and Harmon and $1.4-2.5\times$ for FlowAR and xAR, while maintaining the generation quality.

cs.GR [Back]

[259] A Novel Benchmark and Dataset for Efficient 3D Gaussian Splatting with Gaussian Point Cloud Compression

Kangli Wang,Shihao Li,Qianxi Yi,Wei Gao

Main category: cs.GR

TL;DR: 论文提出了一种基于AI的点云压缩方法GausPcgc，用于优化3D高斯泼溅（3DGS）的存储效率，解决了现有方法忽略高斯空间位置压缩的问题。

Details

Motivation: 3DGS虽然具有高保真渲染和计算效率，但其存储需求大，尤其是高斯空间位置未被有效压缩，导致不必要的比特流开销。 Method: 将高斯基元视为点云，利用AI点云压缩技术（优于MPEG G-PCC）进行压缩，并针对高斯点云的稀疏-密集分布特性设计了GausPcgc框架和数据集GausPcc-1K。 Result: 提出的方法在压缩比上表现优异，显著提升了性能，并补充了现有高斯压缩方法。 Conclusion: GausPcgc成功将AI点云压缩技术引入高斯压缩领域，为未来研究提供了公开的代码、数据和模型。 Abstract: Recently, immersive media and autonomous driving applications have significantly advanced through 3D Gaussian Splatting (3DGS), which offers high-fidelity rendering and computational efficiency. Despite these advantages, 3DGS as a display-oriented representation requires substantial storage due to its numerous Gaussian attributes. Current compression methods have shown promising results but typically neglect the compression of Gaussian spatial positions, creating unnecessary bitstream overhead. We conceptualize Gaussian primitives as point clouds and propose leveraging point cloud compression techniques for more effective storage. AI-based point cloud compression demonstrates superior performance and faster inference compared to MPEG Geometry-based Point Cloud Compression (G-PCC). However, direct application of existing models to Gaussian compression may yield suboptimal results, as Gaussian point clouds tend to exhibit globally sparse yet locally dense geometric distributions that differ from conventional point cloud characteristics. To address these challenges, we introduce GausPcgc for Gaussian point cloud geometry compression along with a specialized training dataset GausPcc-1K. Our work pioneers the integration of AI-based point cloud compression into Gaussian compression pipelines, achieving superior compression ratios. The framework complements existing Gaussian compression methods while delivering significant performance improvements. All code, data, and pre-trained models will be publicly released to facilitate further research advances in this field.

[260] Efficient Differentiable Hardware Rasterization for 3D Gaussian Splatting

Yitian Yuan,Qianyue He

Main category: cs.GR

TL;DR: 提出了一种可微分硬件光栅化方法，用于3D高斯泼溅（3DGS）的反向梯度计算，显著提升了性能和内存效率。

Details

Motivation: 硬件光栅化在3DGS的前向渲染中表现优异，但反向梯度计算受限于图形管线约束，需要解决内存和性能问题。 Method: 采用可编程混合和混合梯度缩减策略（四元组+子组），在片段着色器中实现高效梯度计算。 Result: 在RTX4080 GPU上，使用float16格式实现了3.07倍的全管线加速，内存开销仅为2.67%。 Conclusion: 该方法统一优化了3DGS的运行时和内存使用，特别适合资源受限设备。 Abstract: Recent works demonstrate the advantages of hardware rasterization for 3D Gaussian Splatting (3DGS) in forward-pass rendering through fast GPU-optimized graphics and fixed memory footprint. However, extending these benefits to backward-pass gradient computation remains challenging due to graphics pipeline constraints. We present a differentiable hardware rasterizer for 3DGS that overcomes the memory and performance limitations of tile-based software rasterization. Our solution employs programmable blending for per-pixel gradient computation combined with a hybrid gradient reduction strategy (quad-level + subgroup) in fragment shaders, achieving over 10x faster backward rasterization versus naive atomic operations and 3x speedup over the canonical tile-based rasterizer. Systematic evaluation reveals 16-bit render targets (float16 and unorm16) as the optimal accuracy-efficiency trade-off, achieving higher gradient accuracy among mixed-precision rendering formats with execution speeds second only to unorm8, while float32 texture incurs severe forward pass performance degradation due to suboptimal hardware optimizations. Our method with float16 formats demonstrates 3.07x acceleration in full pipeline execution (forward + backward passes) on RTX4080 GPUs with the MipNeRF dataset, outperforming the baseline tile-based renderer while preserving hardware rasterization's memory efficiency advantages -- incurring merely 2.67% of the memory overhead required for splat sorting operations. This work presents a unified differentiable hardware rasterization method that simultaneously optimizes runtime and memory usage for 3DGS, making it particularly suitable for resource-constrained devices with limited memory capacity.

[261] CageNet: A Meta-Framework for Learning on Wild Meshes

Michal Edelstein,Hsueh-Ti Derek Liu,Mirela Ben-Chen

Main category: cs.GR

TL;DR: 本文提出了一种基于笼状几何的可配置元框架，用于处理非流形、多组件或连接性受损的“野生”三角网格，通过广义重心坐标实现功能映射，并在分割和蒙皮权重学习中表现优于现有技术。

Details

Motivation: 扩展通用三角网格框架的适用性，以处理现实世界中常见的非流形、多组件或连接性受损的“野生”网格。 Method: 提出基于笼状几何的元框架，通过笼状网格（单组件流形网格）包裹目标网格，利用广义重心坐标实现功能映射。 Result: 在“野生”网格上的分割和蒙皮权重学习任务中，性能优于现有技术。 Conclusion: 笼状几何框架为处理复杂网格提供了一种灵活且高效的方法，扩展了通用框架的适用范围。 Abstract: Learning on triangle meshes has recently proven to be instrumental to a myriad of tasks, from shape classification, to segmentation, to deformation and animation, to mention just a few. While some of these applications are tackled through neural network architectures which are tailored to the application at hand, many others use generic frameworks for triangle meshes where the only customization required is the modification of the input features and the loss function. Our goal in this paper is to broaden the applicability of these generic frameworks to "wild", i.e. meshes in-the-wild which often have multiple components, non-manifold elements, disrupted connectivity, or a combination of these. We propose a configurable meta-framework based on the concept of caged geometry: Given a mesh, a cage is a single component manifold triangle mesh that envelopes it closely. Generalized barycentric coordinates map between functions on the cage, and functions on the mesh, allowing us to learn and test on a variety of data, in different applications. We demonstrate this concept by learning segmentation and skinning weights on difficult data, achieving better performance to state of the art techniques on wild meshes.

[262] DiffHairCard: Auto Hair Card Extraction with Differentiable Rendering

Zhongtian Zheng,Tao Huang,Haozhe Su,Xueqi Ma,Yuefan Shen,Tongtong Wang,Yin Yang,Xifeng Gao,Zherong Pan,Kui Wu

Main category: cs.GR

TL;DR: 提出了一种自动化流程，将基于发丝的头发模型转换为高质量的发卡模型，通过可微分表示和两阶段优化实现高效优化。

Details

Motivation: 发卡模型在实时应用中广泛使用，但生成高质量模型仍具挑战性且耗时。 Method: 采用可微分表示将发丝编码为纹理空间中的2D样条，通过聚类和两阶段优化（单独优化和联合优化）生成发卡模型。 Result: 方法支持多种发型（直发、波浪、卷发等），并支持发帽和跨卡技术以提升短发或卷发表现，同时支持无缝LOD过渡。 Conclusion: 该框架在保持发型外观的同时，实现了高效的纹理内存使用和视觉质量平衡。 Abstract: Hair cards remain a widely used representation for hair modeling in real-time applications, offering a practical trade-off between visual fidelity, memory usage, and performance. However, generating high-quality hair card models remains a challenging and labor-intensive task. This work presents an automated pipeline for converting strand-based hair models into hair card models with a limited number of cards and textures while preserving the hairstyle appearance. Our key idea is a novel differentiable representation where each strand is encoded as a projected 2D spline in the texture space, which enables efficient optimization with differentiable rendering and structured results respecting the hair geometry. Based on this representation, we develop a novel algorithm pipeline, where we first cluster hair strands into initial hair cards and project the strands into the texture space. We then conduct a two-stage optimization where our first stage optimizes the texture and geometry of each hair card separately, and after texture reduction, our second stage conducts joint optimization of all the cards for fine-tuning. Put together, our method is evaluated on a wide range of hairstyles, including straight, wavy, curly, and coily hairs. To better capture the appearance of short or coily hair, we additionally support hair cap and cross-card. Furthermore, our framework supports seamless LoD transitions via texture sharing, balancing texture memory efficiency and visual quality.

[263] SRDiffusion: Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation

Shenggan Cheng,Yuanxin Wei,Lansong Diao,Yong Liu,Bujiao Chen,Lianghua Huang,Yu Liu,Wenyuan Yu,Jiangsu Du,Wei Lin,Yang You

Main category: cs.GR

TL;DR: SRDiffusion通过大模型与小模型的协作，高效加速扩散视频生成，减少计算成本，同时保持高质量。

Details

Motivation: 扩散视频生成计算成本高，现有加速方法常导致质量下降，需新方法平衡速度与质量。 Method: 大模型处理高噪声步骤（Sketching），小模型优化低噪声步骤（Rendering），协作减少计算。 Result: 实验显示，SRDiffusion在Wan上提速3倍，CogVideoX上提速2倍，且质量几乎无损。 Conclusion: SRDiffusion为视频生成提供了一种正交于现有加速策略的新方向，具有实际应用潜力。 Abstract: Leveraging the diffusion transformer (DiT) architecture, models like Sora, CogVideoX and Wan have achieved remarkable progress in text-to-video, image-to-video, and video editing tasks. Despite these advances, diffusion-based video generation remains computationally intensive, especially for high-resolution, long-duration videos. Prior work accelerates its inference by skipping computation, usually at the cost of severe quality degradation. In this paper, we propose SRDiffusion, a novel framework that leverages collaboration between large and small models to reduce inference cost. The large model handles high-noise steps to ensure semantic and motion fidelity (Sketching), while the smaller model refines visual details in low-noise steps (Rendering). Experimental results demonstrate that our method outperforms existing approaches, over 3$\times$ speedup for Wan with nearly no quality loss for VBench, and 2$\times$ speedup for CogVideoX. Our method is introduced as a new direction orthogonal to existing acceleration strategies, offering a practical solution for scalable video generation.

[264] A Fluorescent Material Model for Non-Spectral Editing & Rendering

Belcour Laurent,Fichet Alban,Barla Pascal

Main category: cs.GR

TL;DR: 本文提出了一种基于分解降维重辐射矩阵和高斯荧光模型的解析方法，用于非光谱引擎中荧光的编辑与渲染，支持实时参数调整和艺术家友好的荧光材料创建。

Details

Motivation: 现有方法在非光谱引擎中渲染荧光时表达有限，需要为每种荧光材料存储降维矩阵，且仅适用于测量数据。本文旨在解决这些问题。 Method: 通过分解降维重辐射矩阵，并采用可解析积分的高斯荧光模型，结合UV基，实现荧光材料的准确渲染和实时参数调整。 Result: 模型能准确再现荧光材料外观，支持实时编辑和动态空间变化，简化版模型还可通过少量反射颜色输入创建荧光材料。 Conclusion: 该方法提升了非光谱引擎中荧光渲染的灵活性和实用性，适用于艺术创作和实时编辑。 Abstract: Fluorescent materials are characterized by a spectral reradiation toward longer wavelengths. Recent work [Fichet et al. 2024] has shown that the rendering of fluorescence in a non-spectral engine is possible through the use of appropriate reduced reradiation matrices. But the approach has limited expressivity, as it requires the storage of one reduced matrix per fluorescent material, and only works with measured fluorescent assets. In this work, we introduce an analytical approach to the editing and rendering of fluorescence in a non-spectral engine. It is based on a decomposition of the reduced reradiation matrix, and an analytically-integrable Gaussian-based model of the fluorescent component. The model reproduces the appearance of fluorescent materials accurately, especially with the addition of a UV basis. Most importantly, it grants variations of fluorescent material parameters in real-time, either for the editing of fluorescent materials, or for the dynamic spatial variation of fluorescence properties across object surfaces. A simplified one-Gaussian fluorescence model even allows for the artist-friendly creation of plausible fluorescent materials from scratch, requiring only a few reflectance colors as input.

[265] CAD-Coder: Text-to-CAD Generation with Chain-of-Thought and Geometric Reward

Yandong Guan,Xilin Wang,Xingxi Ming,Jing Zhang,Dong Xu,Qian Yu

Main category: cs.GR

TL;DR: CAD-Coder是一个新框架，通过生成CadQuery脚本实现文本到CAD的转换，结合监督学习和强化学习提升模型性能，并引入链式思维规划改进推理。

Details

Motivation: 解决文本到CAD转换中的几何验证、建模词汇丰富性和与现有LLM的无缝集成问题。 Method: 采用两阶段学习流程：监督微调和强化学习（GRPO），结合几何和格式奖励，并引入链式思维规划。 Result: 实验表明，CAD-Coder能直接从自然语言生成多样、有效且复杂的CAD模型，推动了文本到CAD生成和几何推理的进展。 Conclusion: CAD-Coder通过新颖的框架和优化方法，显著提升了文本到CAD生成的性能和实用性。 Abstract: In this work, we introduce CAD-Coder, a novel framework that reformulates text-to-CAD as the generation of CadQuery scripts - a Python-based, parametric CAD language. This representation enables direct geometric validation, a richer modeling vocabulary, and seamless integration with existing LLMs. To further enhance code validity and geometric fidelity, we propose a two-stage learning pipeline: (1) supervised fine-tuning on paired text-CadQuery data, and (2) reinforcement learning with Group Reward Policy Optimization (GRPO), guided by a CAD-specific reward comprising both a geometric reward (Chamfer Distance) and a format reward. We also introduce a chain-of-thought (CoT) planning process to improve model reasoning, and construct a large-scale, high-quality dataset of 110K text-CadQuery-3D model triplets and 1.5K CoT samples via an automated pipeline. Extensive experiments demonstrate that CAD-Coder enables LLMs to generate diverse, valid, and complex CAD models directly from natural language, advancing the state of the art of text-to-CAD generation and geometric reasoning.

[266] MAMM: Motion Control via Metric-Aligning Motion Matching

Naoki Agata,Takeo Igarashi

Main category: cs.GR

TL;DR: 提出了一种基于时域对齐的运动序列控制新方法，无需依赖跨域映射或标注数据。

Details

Motivation: 传统方法需要大量标注数据或手工定义映射，耗时且不灵活。 Method: 通过计算域内距离并优化匹配，实现运动序列与控制序列的对齐。 Result: 方法适用于多种控制序列（如草图、标签、音频等），无需训练或标注数据。 Conclusion: 展示了在高效运动控制中的实用性，具有广泛应用潜力。 Abstract: We introduce a novel method for controlling a motion sequence using an arbitrary temporal control sequence using temporal alignment. Temporal alignment of motion has gained significant attention owing to its applications in motion control and retargeting. Traditional methods rely on either learned or hand-craft cross-domain mappings between frames in the original and control domains, which often require large, paired, or annotated datasets and time-consuming training. Our approach, named Metric-Aligning Motion Matching, achieves alignment by solely considering within-domain distances. It computes distances among patches in each domain and seeks a matching that optimally aligns the two within-domain distances. This framework allows for the alignment of a motion sequence to various types of control sequences, including sketches, labels, audio, and another motion sequence, all without the need for manually defined mappings or training with annotated data. We demonstrate the effectiveness of our approach through applications in efficient motion control, showcasing its potential in practical scenarios.

cs.CL [Back]

[267] Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language

Jesus Alvarez C,Daua D. Karajeanes,Ashley Celeste Prado,John Ruttan,Ivory Yang,Sean O'Brien,Vasu Sharma,Kevin Zhu

Main category: cs.CL

TL;DR: 研究探讨了濒危语言Comanche的数字排斥问题，提出低成本、社区参与的NLP方法支持语言保护，通过少量样本显著提升模型性能。

Details

Motivation: 解决濒危语言Comanche在NLP中的数字排斥问题，支持语言研究和复兴。 Method: 构建412个短语的手工数据集，设计合成数据生成流程，评估GPT-4o和GPT-4o-mini在零样本和少量样本下的性能。 Result: 少量样本提示显著提升模型性能，仅需五个样本即可达到接近完美的准确率。 Conclusion: 强调针对性NLP方法在低资源环境中的潜力，呼吁优先考虑可访问性、文化敏感性和社区参与的计算方法。 Abstract: The digital exclusion of endangered languages remains a critical challenge in NLP, limiting both linguistic research and revitalization efforts. This study introduces the first computational investigation of Comanche, an Uto-Aztecan language on the verge of extinction, demonstrating how minimal-cost, community-informed NLP interventions can support language preservation. We present a manually curated dataset of 412 phrases, a synthetic data generation pipeline, and an empirical evaluation of GPT-4o and GPT-4o-mini for language identification. Our experiments reveal that while LLMs struggle with Comanche in zero-shot settings, few-shot prompting significantly improves performance, achieving near-perfect accuracy with just five examples. Our findings highlight the potential of targeted NLP methodologies in low-resource contexts and emphasize that visibility is the first step toward inclusion. By establishing a foundation for Comanche in NLP, we advocate for computational approaches that prioritize accessibility, cultural sensitivity, and community engagement.

[268] Do BERT-Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLMs?

Junyan Zhang,Yiming Huang,Shuliang Liu,Yubo Gao,Xuming Hu

Main category: cs.CL

TL;DR: 研究挑战了LLM主导的趋势，发现BERT类模型在文本分类中常优于LLM，并提出任务驱动的选择策略TaMAS。

Details

Motivation: 探讨传统BERT类模型在LLM快速普及背景下仍具优势的可能性。 Method: 系统比较BERT微调、LLM内部状态利用和零样本推理三种方法，并分析数据集类型。 Result: BERT类模型在模式驱动任务中表现更优，LLM在深度语义任务中占优。 Conclusion: 提出TaMAS策略，强调任务驱动的模型选择而非依赖单一LLM。 Abstract: The rapid adoption of LLMs has overshadowed the potential advantages of traditional BERT-like models in text classification. This study challenges the prevailing "LLM-centric" trend by systematically comparing three category methods, i.e., BERT-like models fine-tuning, LLM internal state utilization, and zero-shot inference across six high-difficulty datasets. Our findings reveal that BERT-like models often outperform LLMs. We further categorize datasets into three types, perform PCA and probing experiments, and identify task-specific model strengths: BERT-like models excel in pattern-driven tasks, while LLMs dominate those requiring deep semantics or world knowledge. Based on this, we propose TaMAS, a fine-grained task selection strategy, advocating for a nuanced, task-driven approach over a one-size-fits-all reliance on LLMs.

[269] CoMet: Metaphor-Driven Covert Communication for Multi-Agent Language Games

Shuhang Xu,Fangwei Zhong

Main category: cs.CL

TL;DR: CoMet框架通过结合假设推理和自反思的隐喻生成器，提升LLM在多智能体语言游戏中理解和应用隐喻的能力，显著改进了战略沟通效果。

Details

Motivation: 大型语言模型在理解和应用隐喻方面存在困难，影响了其在多智能体语言游戏中的战略沟通能力。 Method: 提出CoMet框架，结合假设推理的隐喻推理器和自反思的隐喻生成器，通过知识整合提升隐喻处理能力。 Result: 在Undercover和Adversarial Taboo游戏中，CoMet显著提升了智能体的战略沟通能力。 Conclusion: CoMet框架有效解决了LLM在隐喻处理上的不足，为多智能体战略沟通提供了新方法。 Abstract: Metaphors are a crucial way for humans to express complex or subtle ideas by comparing one concept to another, often from a different domain. However, many large language models (LLMs) struggle to interpret and apply metaphors in multi-agent language games, hindering their ability to engage in covert communication and semantic evasion, which are crucial for strategic communication. To address this challenge, we introduce CoMet, a framework that enables LLM-based agents to engage in metaphor processing. CoMet combines a hypothesis-based metaphor reasoner with a metaphor generator that improves through self-reflection and knowledge integration. This enhances the agents' ability to interpret and apply metaphors, improving the strategic and nuanced quality of their interactions. We evaluate CoMet on two multi-agent language games - Undercover and Adversarial Taboo - which emphasize Covert Communication and Semantic Evasion. Experimental results demonstrate that CoMet significantly enhances the agents' ability to communicate strategically using metaphors.

[270] IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis

Hanyu Li,Haoyu Liu,Tingyu Zhu,Tianyu Guo,Zeyu Zheng,Xiaotie Deng,Michael I. Jordan

Main category: cs.CL

TL;DR: IDA-Bench是一个新的基准测试，用于评估LLM在多轮交互场景中的表现，结果显示当前先进的编码代理在任务中成功率低于50%。

Details

Motivation: 现有基准测试忽略了数据分析的迭代性，专家决策会随着对数据集的深入理解而演变。 Method: 通过从复杂的Kaggle笔记本中提取任务，以多轮自然语言指令形式呈现，由LLM模拟用户交互，最终比较代理输出与人类基准。 Result: 即使是最先进的编码代理（如Claude-3.7-thinking）在任务中的成功率也低于50%。 Conclusion: 需要提升LLM的多轮交互能力，以构建更可靠的数据分析代理，同时平衡指令遵循和推理能力。 Abstract: Large Language Models (LLMs) show promise as data analysis agents, but existing benchmarks overlook the iterative nature of the field, where experts' decisions evolve with deeper insights of the dataset. To address this, we introduce IDA-Bench, a novel benchmark evaluating LLM agents in multi-round interactive scenarios. Derived from complex Kaggle notebooks, tasks are presented as sequential natural language instructions by an LLM-simulated user. Agent performance is judged by comparing its final numerical output to the human-derived baseline. Initial results show that even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on < 50% of the tasks, highlighting limitations not evident in single-turn tests. This work underscores the need to improve LLMs' multi-round capabilities for building more reliable data analysis agents, highlighting the necessity of achieving a balance between instruction following and reasoning.

[271] Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens

Xixian Yong,Xiao Zhou,Yingying Zhang,Jinlin Li,Yefeng Zheng,Xian Wu

Main category: cs.CL

TL;DR: 论文提出了一种基于信息熵的自适应推理策略，通过动态停止推理来提高效率，同时保持准确性。

Details

Motivation: 大型推理模型（LRMs）的多步推理性能提升常伴随冗长推理链的问题，研究旨在平衡推理长度与语义效率。 Method: 提出InfoBias和InfoGain两个指标，并引入熵基自适应推理策略（Adaptive Think）。 Result: 实验显示，该策略在QwQ-32B上平均准确率提升1.10%，令牌使用减少50.80%。 Conclusion: 熵基方法能有效提升大型语言模型的推理效率和成本效益。 Abstract: The recent rise of Large Reasoning Models (LRMs) has significantly improved multi-step reasoning performance, but often at the cost of generating excessively long reasoning chains. This paper revisits the efficiency of such reasoning processes through an information-theoretic lens, revealing a fundamental trade-off between reasoning length and semantic efficiency. We propose two metrics, InfoBias and InfoGain, to quantify divergence from ideal reasoning paths and stepwise information contribution, respectively. Empirical analyses show that longer reasoning chains tend to exhibit higher information bias and diminishing information gain, especially for incorrect answers. Motivated by these findings, we introduce an entropy-based Adaptive Think strategy that dynamically halts reasoning once confidence is sufficiently high, improving efficiency while maintaining competitive accuracy. Compared to the Vanilla Think approach (default mode), our strategy yields a 1.10% improvement in average accuracy and a 50.80% reduction in token usage on QwQ-32B across six benchmark tasks spanning diverse reasoning types and difficulty levels, demonstrating superior efficiency and reasoning performance. These results underscore the promise of entropy-based methods for enhancing both accuracy and cost-effiiciency in large language model deployment.

[272] Taming LLMs with Negative Samples: A Reference-Free Framework to Evaluate Presentation Content with Actionable Feedback

Ananth Muppidi,Tarak Das,Sambaran Bandyopadhyay,Tripti Shukla,Dharun D A

Main category: cs.CL

TL;DR: 本文提出了一种自动评估演示幻灯片内容的方法REFLEX，通过生成负样本和微调LLMs，无需真实参考即可评分并提供反馈。

Details

Motivation: 在生成式AI时代，自动生成演示幻灯片是一个重要问题，但如何评估其内容质量尚缺乏有效方法。 Method: 引入RefSlides基准数据集，提出一组指标，并通过生成负样本微调LLMs实现无参考评估。 Result: REFLEX在自动和人工实验中优于传统启发式和基于LLM的评估方法。 Conclusion: REFLEX为演示幻灯片内容评估提供了一种高效且无需参考的方法。 Abstract: The generation of presentation slides automatically is an important problem in the era of generative AI. This paper focuses on evaluating multimodal content in presentation slides that can effectively summarize a document and convey concepts to a broad audience. We introduce a benchmark dataset, RefSlides, consisting of human-made high-quality presentations that span various topics. Next, we propose a set of metrics to characterize different intrinsic properties of the content of a presentation and present REFLEX, an evaluation approach that generates scores and actionable feedback for these metrics. We achieve this by generating negative presentation samples with different degrees of metric-specific perturbations and use them to fine-tune LLMs. This reference-free evaluation technique does not require ground truth presentations during inference. Our extensive automated and human experiments demonstrate that our evaluation approach outperforms classical heuristic-based and state-of-the-art large language model-based evaluations in generating scores and explanations.

[273] Multi-Scale Probabilistic Generation Theory: A Hierarchical Framework for Interpreting Large Language Models

Yukin Zhang,Qi Dong

Main category: cs.CL

TL;DR: MSPGT提出了一种分层框架，将Transformer模型的生成过程分解为全局、中间和局部三个语义尺度，并通过实验验证了其稳定性和有效性。

Details

Motivation: 大型Transformer语言模型性能卓越但内部机制不透明，需要一种统一的方法来解释和控制其生成过程。 Method: 提出MSPGT框架，利用注意力跨度阈值和层间互信息峰值划分语义尺度，并在四种代表性模型上进行验证。 Result: 发现解码器模型更关注中间和全局处理，编码器模型侧重局部特征提取；不同尺度的干预对文本生成有显著影响。 Conclusion: MSPGT为理解和控制大型语言模型提供了架构无关的统一方法，填补了机制可解释性与涌现能力之间的空白。 Abstract: Large Transformer based language models achieve remarkable performance but remain opaque in how they plan, structure, and realize text. We introduce Multi_Scale Probabilistic Generation Theory (MSPGT), a hierarchical framework that factorizes generation into three semantic scales_global context, intermediate structure, and local word choices and aligns each scale with specific layer ranges in Transformer architectures. To identify scale boundaries, we propose two complementary metrics: attention span thresholds and inter layer mutual information peaks. Across four representative models (GPT-2, BERT, RoBERTa, and T5), these metrics yield stable local/intermediate/global partitions, corroborated by probing tasks and causal interventions. We find that decoder_only models allocate more layers to intermediate and global processing while encoder_only models emphasize local feature extraction. Through targeted interventions, we demonstrate that local scale manipulations primarily influence lexical diversity, intermediate-scale modifications affect sentence structure and length, and global_scale perturbations impact discourse coherence all with statistically significant effects. MSPGT thus offers a unified, architecture-agnostic method for interpreting, diagnosing, and controlling large language models, bridging the gap between mechanistic interpretability and emergent capabilities.

[274] MetaGen Blended RAG: Higher Accuracy for Domain-Specific Q&A Without Fine-Tuning

Kunal Sawarkar,Shivam R. Solanki,Abhilasha Mangal

Main category: cs.CL

TL;DR: 论文提出了一种名为'MetaGen Blended RAG'的方法，通过混合查询索引和元数据增强来提升领域特定语料库的检索效果，显著提高了RAG系统的准确性和泛化能力。

Details

Motivation: 企业领域特定数据集在RAG系统中表现不佳，主要由于领域术语复杂、语义多变，导致检索精度低。现有方法（如微调）成本高且缺乏泛化能力。 Method: 采用混合查询索引和元数据增强技术，构建元数据生成管道（包括关键概念、主题和缩写），并创建元数据增强的混合索引。 Result: 在PubMedQA生物医学领域基准测试中，检索准确率达82%，RAG准确率达77%，超越了未微调的现有方法，并与最佳微调模型相当。 Conclusion: 该方法在多个Q&A数据集上表现出鲁棒性和可扩展性，为领域特定RAG系统提供了高效且泛化能力强的解决方案。 Abstract: Despite the widespread exploration of Retrieval-Augmented Generation (RAG), its deployment in enterprises for domain-specific datasets remains limited due to poor answer accuracy. These corpora, often shielded behind firewalls in private enterprise knowledge bases, having complex, domain-specific terminology, rarely seen by LLMs during pre-training; exhibit significant semantic variability across domains (like networking, military, or legal, etc.), or even within a single domain like medicine, and thus result in poor context precision for RAG systems. Currently, in such situations, fine-tuning or RAG with fine-tuning is attempted, but these approaches are slow, expensive, and lack generalization for accuracy as the new domain-specific data emerges. We propose an approach for Enterprise Search that focuses on enhancing the retriever for a domain-specific corpus through hybrid query indexes and metadata enrichment. This 'MetaGen Blended RAG' method constructs a metadata generation pipeline using key concepts, topics, and acronyms, and then creates a metadata-enriched hybrid index with boosted search queries. This approach avoids overfitting and generalizes effectively across domains. On the PubMedQA benchmark for the biomedical domain, the proposed method achieves 82% retrieval accuracy and 77% RAG accuracy, surpassing all previous RAG accuracy results without fine-tuning and sets a new benchmark for zero-shot results while outperforming much larger models like GPT3.5. The results are even comparable to the best fine-tuned models on this dataset, and we further demonstrate the robustness and scalability of the approach by evaluating it on other Q&A datasets like SQuAD, NQ etc.

[275] TAGS: A Test-Time Generalist-Specialist Framework with Retrieval-Augmented Reasoning and Verification

Jianghao Wu,Feilong Tang,Yulong Li,Ming Hu,Haochen Xue,Shoaib Jameel,Yutong Xie,Imran Razzak

Main category: cs.CL

TL;DR: TAGS是一个无需微调或参数更新的测试时框架，通过结合通用模型和领域专家模型，辅以分层检索机制和可靠性评分器，显著提升了医学问答任务的性能。

Details

Motivation: 解决基于提示的方法浅层不稳定，以及微调医学LLMs在分布偏移和未见临床场景中泛化能力差的问题。 Method: 结合通用模型和领域专家模型，引入分层检索机制和可靠性评分器，提供多尺度示例并评估推理一致性。 Result: 在九个MedQA基准测试中表现优异，显著提升GPT-4o、DeepSeek-R1和7B模型的准确率。 Conclusion: TAGS框架在无需参数更新的情况下，性能超越多个微调医学LLMs，展示了其在医学推理任务中的潜力。 Abstract: Recent advances such as Chain-of-Thought prompting have significantly improved large language models (LLMs) in zero-shot medical reasoning. However, prompting-based methods often remain shallow and unstable, while fine-tuned medical LLMs suffer from poor generalization under distribution shifts and limited adaptability to unseen clinical scenarios. To address these limitations, we present TAGS, a test-time framework that combines a broadly capable generalist with a domain-specific specialist to offer complementary perspectives without any model fine-tuning or parameter updates. To support this generalist-specialist reasoning process, we introduce two auxiliary modules: a hierarchical retrieval mechanism that provides multi-scale exemplars by selecting examples based on both semantic and rationale-level similarity, and a reliability scorer that evaluates reasoning consistency to guide final answer aggregation. TAGS achieves strong performance across nine MedQA benchmarks, boosting GPT-4o accuracy by 13.8%, DeepSeek-R1 by 16.8%, and improving a vanilla 7B model from 14.1% to 23.9%. These results surpass several fine-tuned medical LLMs, without any parameter updates. The code will be available at https://github.com/JianghaoWu/TAGS.

[276] Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards

Jinyan Su,Claire Cardie

Main category: cs.CL

TL;DR: 提出一种自适应奖励调整方法，使大语言模型在保持准确性的同时生成简洁输出。

Details

Motivation: RL训练的大语言模型在数学任务中常生成冗长推理过程，增加推理成本和延迟，现有固定惩罚项方法效果有限。 Method: 动态调整奖励函数中准确性和输出长度的权衡，根据模型表现自适应调整长度惩罚。 Result: 实验表明，该方法显著减少推理长度，同时基本保持准确性。 Conclusion: 为大规模语言模型提供了一种高效自适应推理的新方向。 Abstract: Large language models (LLMs) have demonstrated strong reasoning abilities in mathematical tasks, often enhanced through reinforcement learning (RL). However, RL-trained models frequently produce unnecessarily long reasoning traces -- even for simple queries -- leading to increased inference costs and latency. While recent approaches attempt to control verbosity by adding length penalties to the reward function, these methods rely on fixed penalty terms that are hard to tune and cannot adapt as the model's reasoning capability evolves, limiting their effectiveness. In this work, we propose an adaptive reward-shaping method that enables LLMs to "think fast and right" -- producing concise outputs without sacrificing correctness. Our method dynamically adjusts the reward trade-off between accuracy and response length based on model performance: when accuracy is high, the length penalty increases to encourage faster length reduction; when accuracy drops, the penalty is relaxed to preserve correctness. This adaptive reward accelerates early-stage length reduction while avoiding over-compression in later stages. Experiments across multiple datasets show that our approach consistently and dramatically reduces reasoning length while largely maintaining accuracy, offering a new direction for cost-efficient adaptive reasoning in large-scale language models.

Zhuozhuo Joy Liu,Farhan Samir,Mehar Bhatia,Laura K. Nelson,Vered Shwartz

Main category: cs.CL

TL;DR: 研究发现，尽管GPT-4生成的规范不一定是错误的，但其文化特异性较低，且隐含的文化刻板印象容易被恢复。

Details

Motivation: 探讨LLMs在真实场景中是否一致应用西方或北美文化价值观，而非仅通过直接调查。 Method: 采用自下而上的方法，让LLMs推理不同文化叙事中的文化规范。 Result: GPT-4生成的规范文化特异性较低，且隐含的刻板印象容易被恢复。 Conclusion: 解决这些问题对开发公平服务于多样化用户的LLMs至关重要。 Abstract: LLMs have been demonstrated to align with the values of Western or North American cultures. Prior work predominantly showed this effect through leveraging surveys that directly ask (originally people and now also LLMs) about their values. However, it is hard to believe that LLMs would consistently apply those values in real-world scenarios. To address that, we take a bottom-up approach, asking LLMs to reason about cultural norms in narratives from different cultures. We find that GPT-4 tends to generate norms that, while not necessarily incorrect, are significantly less culture-specific. In addition, while it avoids overtly generating stereotypes, the stereotypical representations of certain cultures are merely hidden rather than suppressed in the model, and such stereotypes can be easily recovered. Addressing these challenges is a crucial step towards developing LLMs that fairly serve their diverse user base.

[278] PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language

Naghmeh Jamali,Milad Mohammadi,Danial Baledi,Zahra Rezvani,Hesham Faili

Main category: cs.CL

TL;DR: PerMedCQA是首个波斯语医疗问答基准，包含68,138个问题-答案对，用于评估多语言大模型在真实世界医疗问题中的表现。

Details

Motivation: 填补波斯语等低资源语言在医疗问答领域的空白，提供个性化且可靠的医疗信息。 Method: 从医疗问答论坛收集数据，经过清洗后构建PerMedCQA数据集，并使用MedJudge评估框架对多语言大模型进行评测。 Result: 研究揭示了多语言医疗问答的关键挑战，并为开发更准确、上下文感知的医疗辅助系统提供了见解。 Conclusion: PerMedCQA为波斯语医疗问答提供了重要资源，推动了多语言医疗AI的发展。 Abstract: Medical consumer question answering (CQA) is crucial for empowering patients by providing personalized and reliable health information. Despite recent advances in large language models (LLMs) for medical QA, consumer-oriented and multilingual resources, particularly in low-resource languages like Persian, remain sparse. To bridge this gap, we present PerMedCQA, the first Persian-language benchmark for evaluating LLMs on real-world, consumer-generated medical questions. Curated from a large medical QA forum, PerMedCQA contains 68,138 question-answer pairs, refined through careful data cleaning from an initial set of 87,780 raw entries. We evaluate several state-of-the-art multilingual and instruction-tuned LLMs, utilizing MedJudge, a novel rubric-based evaluation framework driven by an LLM grader, validated against expert human annotators. Our results highlight key challenges in multilingual medical QA and provide valuable insights for developing more accurate and context-aware medical assistance systems. The data is publicly available on https://huggingface.co/datasets/NaghmehAI/PerMedCQA

[279] Model Editing with Graph-Based External Memory

Yash Kumar Atri,Ahmed Alaa,Thomas Hartvigsen

Main category: cs.CL

TL;DR: HYPE框架利用双曲几何和图神经网络实现稳定且精确的大语言模型编辑，解决了过拟合和灾难性遗忘问题。

Details

Motivation: 大语言模型存在幻觉和知识过时问题，现有编辑方法易导致过拟合和灾难性遗忘。 Method: HYPE框架包含双曲图构建、Möbius变换更新和双重稳定化三个组件。 Result: 实验表明HYPE显著提升了编辑稳定性、事实准确性和多跳推理能力。 Conclusion: HYPE为动态更新大语言模型提供了高效且稳定的解决方案。 Abstract: Large language models (LLMs) have revolutionized natural language processing, yet their practical utility is often limited by persistent issues of hallucinations and outdated parametric knowledge. Although post-training model editing offers a pathway for dynamic updates, existing methods frequently suffer from overfitting and catastrophic forgetting. To tackle these challenges, we propose a novel framework that leverages hyperbolic geometry and graph neural networks for precise and stable model edits. We introduce HYPE (HYperbolic Parameter Editing), which comprises three key components: (i) Hyperbolic Graph Construction, which uses Poincar\'e embeddings to represent knowledge triples in hyperbolic space, preserving hierarchical relationships and preventing unintended side effects by ensuring that edits to parent concepts do not inadvertently affect child concepts; (ii) M\"obius-Transformed Updates, which apply hyperbolic addition to propagate edits while maintaining structural consistency within the hyperbolic manifold, unlike conventional Euclidean updates that distort relational distances; and (iii) Dual Stabilization, which combines gradient masking and periodic GNN parameter resetting to prevent catastrophic forgetting by focusing updates on critical parameters and preserving long-term knowledge. Experiments on CounterFact, CounterFact+, and MQuAKE with GPT-J and GPT2-XL demonstrate that HYPE significantly enhances edit stability, factual accuracy, and multi-hop reasoning.

[280] The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs

Lucas Bandarkar,Nanyun Peng

Main category: cs.CL

TL;DR: 该论文研究了如何通过跨语言迁移提升低资源语言任务表现，发现数学推理和多语言能力的模型参数不重叠，并提出模块化方法（如参数冻结和模型合并）优化微调效果。

Details

Motivation: 解决大语言模型在低资源语言任务中表现不佳的问题，尤其是在缺乏任务特定数据的情况下。 Method: 通过验证数学推理和多语言能力的参数不重叠性，开发模块化框架（如参数冻结和模型合并）优化微调。 Result: 在三种语言、四种模型和两种微调范式下，模块化方法显著优于基线，其中层交换模型合并效果最佳。 Conclusion: 模块化方法（尤其是层交换合并）能有效提升低资源语言任务表现，且事后调整比冻结参数更有效。 Abstract: Large language models (LLMs) still struggle across tasks outside of high-resource languages. In this work, we investigate cross-lingual transfer to lower-resource languages where task-specific post-training data is scarce. Building on prior work, we first validate that the subsets of model parameters that matter most for mathematical reasoning and multilingual capabilities are distinctly non-overlapping. To exploit this implicit separability between task and target language parameterization, we develop and analyze numerous modular frameworks to improve the composition of the two during fine-tuning. These methods generally employ freezing parameters or post hoc model merging to assign math and language improvement to different key parts of the LLM. In the absence of in-language math data, we demonstrate that the modular approaches successfully improve upon baselines across three languages, four models, and two fine-tuning paradigms (full and LoRA). Furthermore, we identify the most consistently successful modular method to be fine-tuning separate language and math experts and model merging via Layer-Swapping, somewhat surprisingly. We offer possible explanations for this result via recent works on the linearity of task vectors. We further explain this by empirically showing that reverting less useful fine-tuning updates after training often outperforms freezing them from the start.

[281] SchemaGraphSQL: Efficient Schema Linking with Pathfinding Graph Algorithms for Text-to-SQL on Large-Scale Databases

AmirHossein Safdarian,Milad Mohammadi,Ehsan Jahanbakhsh,Mona Shahamat Naderi,Heshaam Faili

Main category: cs.CL

TL;DR: 提出了一种零样本、无需训练的Schema Linking方法，通过构建Schema图和使用路径查找算法优化SQL生成，在BIRD基准测试中取得最佳结果。

Details

Motivation: Schema Linking是Text-to-SQL系统的关键组件，但现有方法复杂且成本高，需改进其效率和准确性。 Method: 基于外键关系构建Schema图，利用Gemini 2.5 Flash提取查询中的表信息，应用路径查找算法优化表连接顺序。 Result: 在BIRD基准测试中表现优于现有方法，且成本低、可扩展性强。 Conclusion: 该方法简单高效，显著提升了Text-to-SQL系统的性能，适用于不同规模的模型。 Abstract: Text-to-SQL systems translate natural language questions into executable SQL queries, and recent progress with large language models (LLMs) has driven substantial improvements in this task. Schema linking remains a critical component in Text-to-SQL systems, reducing prompt size for models with narrow context windows and sharpening model focus even when the entire schema fits. We present a zero-shot, training-free schema linking approach that first constructs a schema graph based on foreign key relations, then uses a single prompt to Gemini 2.5 Flash to extract source and destination tables from the user query, followed by applying classical path-finding algorithms and post-processing to identify the optimal sequence of tables and columns that should be joined, enabling the LLM to generate more accurate SQL queries. Despite being simple, cost-effective, and highly scalable, our method achieves state-of-the-art results on the BIRD benchmark, outperforming previous specialized, fine-tuned, and complex multi-step LLM-based approaches. We conduct detailed ablation studies to examine the precision-recall trade-off in our framework. Additionally, we evaluate the execution accuracy of our schema filtering method compared to other approaches across various model sizes.

[282] ShIOEnv: A CLI Behavior-Capturing Environment Enabling Grammar-Guided Command Synthesis for Dataset Curation

Jarrod Ragsdale,Rajendra Boppana

Main category: cs.CL

TL;DR: 论文提出了一种Shell输入-输出环境（ShIOEnv），通过将命令构建建模为马尔可夫决策过程，结合语法掩码和PPO优化，显著提升了样本效率和数据质量，用于微调CodeT5模型。

Details

Motivation: 现有CLI交互数据集缺乏执行数据（如退出码、输出等），限制了行为建模的实用性。需要一种方法为小型架构生成高质量数据集。 Method: 引入ShIOEnv，将命令构建作为马尔可夫决策过程，结合语法掩码和PPO优化策略生成数据集。 Result: 语法掩码和PPO显著提升样本效率，生成的数据集使CodeT5的BLEU-4分数提升85%（语法掩码）和26%（PPO）。 Conclusion: ShIOEnv和生成的数据集为未来研究提供了工具，展示了语法约束和强化学习在CLI行为建模中的有效性。 Abstract: Command-line interfaces (CLIs) provide structured textual environments for system administration. Explorations have been performed using pre-trained language models (PLMs) to simulate these environments for safe interaction in high-risk environments. However, their use has been constrained to frozen, large parameter models like GPT. For smaller architectures to reach a similar level of believability, a rich dataset of CLI interactions is required. Existing public datasets focus on mapping natural-language tasks to commands, omitting crucial execution data such as exit codes, outputs, and environmental side effects, limiting their usability for behavioral modeling. We introduce a Shell Input -Output Environment (ShIOEnv), which casts command construction as a Markov Decision Process whose state is the partially built sequence and whose actions append arguments. After each action, ShIOEnv executes the candidate and returns its exit status, output, and progress toward a minimal-length behavioral objective. Due to the intractable nature of the combinatorial argument state-action space, we derive a context-free grammar from man pages to mask invalid arguments from being emitted. We explore random and proximal-policy optimization (PPO)-optimized sampling of unrestricted and grammar-masked action spaces to produce four exploration strategies. We observed that grammar masking and PPO significantly improve sample efficiency to produce a higher quality dataset (maximizing the number of arguments while minimizing redundancies). Policy-generated datasets of shell input-output behavior pairs are used to fine-tune CodeT5, where we observe 85% improvements in BLEU-4 when constraining the action space to grammar productions with an additional 26% improvement when applying PPO. The ShIOEnv environment and curated command behavior datasets are released for use in future research.

[283] NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities

Abdellah El Mekki,Houdaifa Atou,Omer Nacar,Shady Shehata,Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: 该论文提出了一种为低资源语言定制预训练数据的方法，结合合成和检索数据，考虑语言、文化遗产和文化价值观，并以埃及和摩洛哥方言为例开发了NileChat模型。

Details

Motivation: 增强大型语言模型（LLMs）对低资源语言的支持，避免现有方法因依赖英语翻译数据而导致的文化偏差。 Method: 结合合成和检索数据，针对特定社区的语言、文化遗产和文化价值观定制预训练数据，开发了NileChat模型。 Result: NileChat在理解和翻译任务中优于同类规模的阿拉伯语LLMs，并与更大模型表现相当。 Conclusion: 该方法有效提升了LLMs对低资源语言和文化的覆盖，推动了更广泛社区的包容性。 Abstract: Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area. Current research directions predominantly rely on synthetic data generated by translating English corpora, which, while demonstrating promising linguistic understanding and translation abilities, often results in models aligned with source language culture. These models frequently fail to represent the cultural heritage and values of local communities. This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community, considering its (i) language, (ii) cultural heritage, and (iii) cultural values. We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness and current underrepresentation in LLMs. As a proof-of-concept, we develop NileChat, a 3B parameter LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values. Our results on various understanding, translation, and cultural and values alignment benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar size and performs on par with larger models. We share our methods, data, and models with the community to promote the inclusion and coverage of more diverse communities in LLM development.

[284] RaDeR: Reasoning-aware Dense Retrieval Models

Debrup Das,Sam O' Nuallain,Razieh Rahimi

Main category: cs.CL

TL;DR: RaDeR是一种基于推理的密集检索模型，利用LLM生成的数学问题解决数据训练，通过检索增强推理轨迹和自反相关性评估，显著提升推理任务的性能。

Details

Motivation: 解决推理密集型任务中检索模型的性能问题，尤其是数学和编码任务，同时减少对大量训练数据的依赖。 Method: 利用LLM生成的数据训练密集检索模型，结合检索增强推理轨迹和自反相关性评估，生成多样化和难负样本。 Result: 在BRIGHT和RAR-b基准测试中表现优异，尤其在数学和编码任务上显著优于基线模型，且仅需2.5%的训练数据即可达到或超过REASONIR的性能。 Conclusion: RaDeR证明了基于推理的检索对增强推理语言模型的重要性，并在数据效率上具有显著优势。 Abstract: We propose RaDeR, a set of reasoning-based dense retrieval models trained with data derived from mathematical problem solving using large language models (LLMs). Our method leverages retrieval-augmented reasoning trajectories of an LLM and self-reflective relevance evaluation, enabling the creation of both diverse and hard-negative samples for reasoning-intensive relevance. RaDeR retrievers, trained for mathematical reasoning, effectively generalize to diverse reasoning tasks in the BRIGHT and RAR-b benchmarks, consistently outperforming strong baselines in overall performance.Notably, RaDeR achieves significantly higher performance than baselines on the Math and Coding splits. In addition, RaDeR presents the first dense retriever that outperforms BM25 when queries are Chain-of-Thought reasoning steps, underscoring the critical role of reasoning-based retrieval to augment reasoning language models. Furthermore, RaDeR achieves comparable or superior performance while using only 2.5% of the training data used by the concurrent work REASONIR, highlighting the quality of our synthesized training data.

Yue Jiang,Jichu Li,Yang Liu,Dingkang Yang,Feng Zhou,Quyu Kong

Main category: cs.CL

TL;DR: DanmakuTPPBench是一个多模态时序点过程（TPP）建模的基准测试，包含数据集DanmakuTPP-Events和问答数据集DanmakuTPP-QA，旨在推动多模态TPP模型的发展。

Details

Motivation: 现有TPP数据集多为单模态，限制了多模态时序事件建模的进展，DanmakuTPPBench填补了这一空白。 Method: 通过Bilibili平台的弹幕数据构建多模态事件数据集，并利用LLMs和MLLMs生成问答数据集。 Result: 评估显示当前方法在多模态事件建模上存在显著性能差距。 Conclusion: 该基准为多模态TPP建模提供了基础，并呼吁进一步整合TPP与多模态语言建模。 Abstract: We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components: (1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames; (2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complex temporal-textual-visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods' ability to model multi-modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape. The code and dataset have been released at https://github.com/FRENKIE-CHIANG/DanmakuTPPBench

[286] Retrieval Augmented Generation-based Large Language Models for Bridging Transportation Cybersecurity Legal Knowledge Gaps

Khandakar Ashrafi Akbar,Md Nahiyan Uddin,Latifur Khan,Trayce Hockstad,Mizanur Rahman,Mashrur Chowdhury,Bhavani Thuraisingham

Main category: cs.CL

TL;DR: 本文提出了一种基于检索增强生成（RAG）的大型语言模型（LLM）框架，用于支持政策制定者处理交通系统中的网络安全和数据隐私法律问题。

Details

Motivation: 随着交通系统的自动化和互联化发展，现有法律需要更新以应对新兴的网络安全和数据隐私挑战。 Method: 采用RAG框架，结合检索机制和特定领域问题，减少LLM的幻觉，提高回答的准确性和事实基础。 Result: 该框架在AlignScore、ParaScore、BERTScore和ROUGE四项指标上优于主流商业LLM，表现出更强的可靠性和上下文感知能力。 Conclusion: 该方法为立法分析提供了可扩展的AI驱动解决方案，有助于法律框架的更新以适应技术进步。 Abstract: As connected and automated transportation systems evolve, there is a growing need for federal and state authorities to revise existing laws and develop new statutes to address emerging cybersecurity and data privacy challenges. This study introduces a Retrieval-Augmented Generation (RAG) based Large Language Model (LLM) framework designed to support policymakers by extracting relevant legal content and generating accurate, inquiry-specific responses. The framework focuses on reducing hallucinations in LLMs by using a curated set of domain-specific questions to guide response generation. By incorporating retrieval mechanisms, the system enhances the factual grounding and specificity of its outputs. Our analysis shows that the proposed RAG-based LLM outperforms leading commercial LLMs across four evaluation metrics: AlignScore, ParaScore, BERTScore, and ROUGE, demonstrating its effectiveness in producing reliable and context-aware legal insights. This approach offers a scalable, AI-driven method for legislative analysis, supporting efforts to update legal frameworks in line with advancements in transportation technologies.

[287] Voice of a Continent: Mapping Africa's Speech Technology Frontier

AbdelRahim Elmadany,Sang Yun Kwon,Hawau Olamide Toyin,Alcides Alcoba Inciarte,Hanan Aldarmaki,Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: 论文提出SimbaBench基准和Simba模型系列，填补非洲语言在语音技术中的空白，提升数字包容性。

Details

Motivation: 非洲语言在语音技术中代表性不足，阻碍数字包容性，需系统性解决。 Method: 通过系统映射非洲语音数据集和技术，构建SimbaBench基准，并开发Simba模型系列。 Result: Simba模型在多种非洲语言和任务中达到最优性能，揭示了资源分布的关键模式。 Conclusion: 研究强调需扩展语音技术资源以反映非洲语言多样性，为未来包容性技术奠定基础。 Abstract: Africa's rich linguistic diversity remains significantly underrepresented in speech technologies, creating barriers to digital inclusion. To alleviate this challenge, we systematically map the continent's speech space of datasets and technologies, leading to a new comprehensive benchmark SimbaBench for downstream African speech tasks. Using SimbaBench, we introduce the Simba family of models, achieving state-of-the-art performance across multiple African languages and speech tasks. Our benchmark analysis reveals critical patterns in resource availability, while our model evaluation demonstrates how dataset quality, domain diversity, and language family relationships influence performance across languages. Our work highlights the need for expanded speech technology resources that better reflect Africa's linguistic diversity and provides a solid foundation for future research and development efforts toward more inclusive speech technologies.

[288] Efficient Long CoT Reasoning in Small Language Models

Zhaoyang Wang,Jinqi Jiang,Tian Qiu,Hui Liu,Xianfeng Tang,Huaxiu Yao

Main category: cs.CL

TL;DR: 该论文提出了一种修剪长链式思维（CoT）中冗余步骤的方法，并通过策略优化帮助小型语言模型（SLM）学习高效的长CoT推理能力。

Details

Motivation: 大型推理模型（如DeepSeek-R1）能生成长CoT，但小型语言模型（SLM）难以直接学习长CoT，且冗余内容会进一步增加学习难度。 Method: 提出修剪长CoT中冗余步骤的方法，并采用策略优化让SLM自主筛选有效的长CoT训练数据。 Result: 实验表明，该方法能显著减少冗余推理步骤，同时保持SLM在数学推理任务中的竞争力。 Conclusion: 该方法成功将长CoT推理能力蒸馏到SLM中，提升了学习效率并减少了冗余。 Abstract: Recent large reasoning models such as DeepSeek-R1 exhibit strong complex problems solving abilities by generating long chain-of-thought (CoT) reasoning steps. It is challenging to directly train small language models (SLMs) to emerge long CoT. Thus, distillation becomes a practical method to enable SLMs for such reasoning ability. However, the long CoT often contains a lot of redundant contents (e.g., overthinking steps) which may make SLMs hard to learn considering their relatively poor capacity and generalization. To address this issue, we propose a simple-yet-effective method to prune unnecessary steps in long CoT, and then employ an on-policy method for the SLM itself to curate valid and useful long CoT training data. In this way, SLMs can effectively learn efficient long CoT reasoning and preserve competitive performance at the same time. Experimental results across a series of mathematical reasoning benchmarks demonstrate the effectiveness of the proposed method in distilling long CoT reasoning ability into SLMs which maintains the competitive performance but significantly reduces generating redundant reasoning steps.

[289] BRIT: Bidirectional Retrieval over Unified Image-Text Graph

Ainulla Khan,Yamada Moyuru,Srinidhi Akella

Main category: cs.CL

TL;DR: BRIT是一种新型多模态RAG框架，通过构建多模态图统一文本和图像连接，有效检索跨模态内容，提升复杂多跳问题的回答能力。

Details

Motivation: 现有RAG技术主要针对文本查询，多模态文档（含文本和图像）的RAG研究不足，尤其是在微调无效时。 Method: 提出BRIT框架，构建多模态图统一文本-图像连接，检索查询特定子图，并通过双向路径遍历获取相关内容。 Result: 实验证明BRIT在多模态问答任务中表现优越，能有效处理跨模态问题。 Conclusion: BRIT为多模态RAG提供了有效解决方案，特别适用于复杂跨模态问题。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising technique to enhance the quality and relevance of responses generated by large language models. While recent advancements have mainly focused on improving RAG for text-based queries, RAG on multi-modal documents containing both texts and images has not been fully explored. Especially when fine-tuning does not work. This paper proposes BRIT, a novel multi-modal RAG framework that effectively unifies various text-image connections in the document into a multi-modal graph and retrieves the texts and images as a query-specific sub-graph. By traversing both image-to-text and text-to-image paths in the graph, BRIT retrieve not only directly query-relevant images and texts but also further relevant contents to answering complex cross-modal multi-hop questions. To evaluate the effectiveness of BRIT, we introduce MM-RAG test set specifically designed for multi-modal question answering tasks that require to understand the text-image relations. Our comprehensive experiments demonstrate the superiority of BRIT, highlighting its ability to handle cross-modal questions on the multi-modal documents.

[290] MedScore: Factuality Evaluation of Free-Form Medical Answers

Heyuan Huang,Alexandra DeLucia,Vijay Murari Tiyyala,Mark Dredze

Main category: cs.CL

TL;DR: MedScore是一种新的方法，用于将医学答案分解为条件感知的有效事实，显著提高了事实性评估的准确性。

Details

Motivation: 现有的事实性评估系统在医学领域表现不佳，因为医学答案具有条件依赖性和多样性，难以分解为有效事实。 Method: 提出MedScore方法，通过条件感知分解医学答案，提取更多有效事实，减少幻觉和模糊引用。 Result: MedScore提取的有效事实数量是现有方法的三倍，显著提高了事实性评估的可靠性。 Conclusion: MedScore展示了在医学领域定制事实性评估步骤的重要性，并显著优于现有方法。 Abstract: While Large Language Models (LLMs) can generate fluent and convincing responses, they are not necessarily correct. This is especially apparent in the popular decompose-then-verify factuality evaluation pipeline, where LLMs evaluate generations by decomposing the generations into individual, valid claims. Factuality evaluation is especially important for medical answers, since incorrect medical information could seriously harm the patient. However, existing factuality systems are a poor match for the medical domain, as they are typically only evaluated on objective, entity-centric, formulaic texts such as biographies and historical topics. This differs from condition-dependent, conversational, hypothetical, sentence-structure diverse, and subjective medical answers, which makes decomposition into valid facts challenging. We propose MedScore, a new approach to decomposing medical answers into condition-aware valid facts. Our method extracts up to three times more valid facts than existing methods, reducing hallucination and vague references, and retaining condition-dependency in facts. The resulting factuality score significantly varies by decomposition method, verification corpus, and used backbone LLM, highlighting the importance of customizing each step for reliable factuality evaluation.

[291] Hybrid Latent Reasoning via Reinforcement Learning

Zhenrui Yue,Bowen Jin,Huimin Zeng,Honglei Zhuang,Zhen Qin,Jinsung Yoon,Lanyu Shang,Jiawei Han,Dong Wang

Main category: cs.CL

TL;DR: HRPO是一种基于强化学习的混合潜在推理方法，通过结合离散和连续表示，提升了大型语言模型在知识和推理任务中的表现。

Details

Motivation: 现有潜在推理方法与大型语言模型的离散生成特性不兼容，且依赖链式思维轨迹训练，未能充分利用模型的固有推理能力。 Method: 提出HRPO方法，通过可学习的门控机制整合隐藏状态与采样标记，并逐步引入更多隐藏特征进行训练。 Result: HRPO在多种基准测试中优于现有方法，模型保持可解释性并展现出跨语言模式和更短完成长度等有趣行为。 Conclusion: HRPO展示了基于强化学习的潜在推理潜力，为未来研究提供了新方向。 Abstract: Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.

[292] Anchored Diffusion Language Model

Litu Rout,Constantine Caramanis,Sanjay Shakkottai

Main category: cs.CL

TL;DR: ADLM通过两阶段框架（预测重要标记和基于锚点预测缺失标记）显著提升扩散语言模型的性能，缩小与自回归模型的差距，并在多个任务中实现SOTA。

Details

Motivation: 扩散语言模型（DLMs）在并行生成和双向上下文方面有优势，但在似然建模和生成文本质量上表现不如自回归模型（AR）。研究发现，性能差距源于重要标记在早期被掩盖，导致上下文信息不足。 Method: 提出ADLM，包含锚点网络预测重要标记分布，再基于锚点预测缺失标记的似然。 Result: ADLM在LM1B和OpenWebText上显著提升测试困惑度（最高25.4%），缩小与AR模型的差距，并在零样本泛化和MAUVE评分上超越AR模型。 Conclusion: ADLM不仅提升了扩散模型的性能，还展示了锚点方法在AR模型和数学逻辑任务中的潜力。 Abstract: Diffusion Language Models (DLMs) promise parallel generation and bidirectional context, yet they underperform autoregressive (AR) models in both likelihood modeling and generated text quality. We identify that this performance gap arises when important tokens (e.g., key words or low-frequency words that anchor a sentence) are masked early in the forward process, limiting contextual information for accurate reconstruction. To address this, we introduce the Anchored Diffusion Language Model (ADLM), a novel two-stage framework that first predicts distributions over important tokens via an anchor network, and then predicts the likelihoods of missing tokens conditioned on the anchored predictions. ADLM significantly improves test perplexity on LM1B and OpenWebText, achieving up to 25.4% gains over prior DLMs, and narrows the gap with strong AR baselines. It also achieves state-of-the-art performance in zero-shot generalization across seven benchmarks and surpasses AR models in MAUVE score, which marks the first time a DLM generates better human-like text than an AR model. Theoretically, we derive an Anchored Negative Evidence Lower Bound (ANELBO) objective and show that anchoring improves sample complexity and likelihood modeling. Beyond diffusion, anchoring boosts performance in AR models and enhances reasoning in math and logic tasks, outperforming existing chain-of-thought approaches

[293] Measuring South Asian Biases in Large Language Models

Mamnuya Rinki,Chahat Raj,Anjishnu Mukherjee,Ziwei Zhu

Main category: cs.CL

TL;DR: 该研究通过多语言和交叉分析，评估了大型语言模型在南亚多语言环境中的文化偏见，并提出了新的偏见词典和去偏策略。

Details

Motivation: 现有研究常忽视南亚等多语言地区的交叉和文化特定偏见，尤其是与性别、宗教等相关的隐性偏见。 Method: 构建了一个文化基础的偏见词典，分析了10种印度-雅利安和德拉威语言，并测试了两种自我去偏策略的效果。 Result: 研究发现文化偏见在生成任务中显著存在，且简单的去偏提示效果有限。 Conclusion: 研究为文化偏见提供了新视角，并提出了适用于多语言环境的评估框架。 Abstract: Evaluations of Large Language Models (LLMs) often overlook intersectional and culturally specific biases, particularly in underrepresented multilingual regions like South Asia. This work addresses these gaps by conducting a multilingual and intersectional analysis of LLM outputs across 10 Indo-Aryan and Dravidian languages, identifying how cultural stigmas influenced by purdah and patriarchy are reinforced in generative tasks. We construct a culturally grounded bias lexicon capturing previously unexplored intersectional dimensions including gender, religion, marital status, and number of children. We use our lexicon to quantify intersectional bias and the effectiveness of self-debiasing in open-ended generations (e.g., storytelling, hobbies, and to-do lists), where bias manifests subtly and remains largely unexamined in multilingual contexts. Finally, we evaluate two self-debiasing strategies (simple and complex prompts) to measure their effectiveness in reducing culturally specific bias in Indo-Aryan and Dravidian languages. Our approach offers a nuanced lens into cultural bias by introducing a novel bias lexicon and evaluation framework that extends beyond Eurocentric or small-scale multilingual settings.

[294] Investigating AI Rater Effects of Large Language Models: GPT, Claude, Gemini, and DeepSeek

Hong Jiao,Dan Song,Won-Chan Lee

Main category: cs.CL

TL;DR: 研究比较了十种大型语言模型（LLM）与人类专家在写作任务评分中的表现，发现ChatGPT 4o、Gemini 1.5 Pro和Claude 3.5 Sonnet在评分准确性、评分者一致性和减少评分者效应方面表现最佳。

Details

Motivation: 探索LLM在低风险评估中的自动化评分可靠性，为实际应用提供实证依据。 Method: 比较十种LLM与人类专家在两种写作任务中的评分准确性（Quadratic Weighted Kappa）、评分者一致性（Cronbach Alpha）和评分者效应（Many-Facet Rasch模型）。 Result: ChatGPT 4o、Gemini 1.5 Pro和Claude 3.5 Sonnet表现最优，评分准确性高且评分者效应低。 Conclusion: 支持在自动化评分中使用ChatGPT 4o、Gemini 1.5 Pro和Claude 3.5 Sonnet，因其高可靠性和低评分者效应。 Abstract: Large language models (LLMs) have been widely explored for automated scoring in low-stakes assessment to facilitate learning and instruction. Empirical evidence related to which LLM produces the most reliable scores and induces least rater effects needs to be collected before the use of LLMs for automated scoring in practice. This study compared ten LLMs (ChatGPT 3.5, ChatGPT 4, ChatGPT 4o, OpenAI o1, Claude 3.5 Sonnet, Gemini 1.5, Gemini 1.5 Pro, Gemini 2.0, as well as DeepSeek V3, and DeepSeek R1) with human expert raters in scoring two types of writing tasks. The accuracy of the holistic and analytic scores from LLMs compared with human raters was evaluated in terms of Quadratic Weighted Kappa. Intra-rater consistency across prompts was compared in terms of Cronbach Alpha. Rater effects of LLMs were evaluated and compared with human raters using the Many-Facet Rasch model. The results in general supported the use of ChatGPT 4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet with high scoring accuracy, better rater reliability, and less rater effects.

[295] The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models

Kefan Yu,Qingcheng Zeng,Weihao Xuan,Wanxin Li,Jingyi Wu,Rob Voigt

Main category: cs.CL

TL;DR: ALTPRAG数据集用于评估不同训练阶段的LLMs在推断说话者意图时的表现，发现模型规模和数据量的增加能提升其语用能力。

Details

Motivation: 研究LLMs如何在训练过程中获得语用能力，填补现有理解的空白。 Method: 使用ALTPRAG数据集，评估22个LLMs在预训练、监督微调和偏好优化阶段的语用能力。 Result: 基础模型已具备语用敏感性，模型和数据规模提升能进一步改善能力，SFT和RLHF对认知语用推理贡献显著。 Conclusion: 语用能力是LLM训练中涌现的复合特性，为模型与人类沟通规范对齐提供新见解。 Abstract: Current large language models (LLMs) have demonstrated emerging capabilities in social intelligence tasks, including implicature resolution (Sravanthi et al. (2024)) and theory-of-mind reasoning (Shapira et al. (2024)), both of which require substantial pragmatic understanding. However, how LLMs acquire this competence throughout the training process remains poorly understood. In this work, we introduce ALTPRAG, a dataset grounded in the pragmatic concept of alternatives, designed to evaluate whether LLMs at different training stages can accurately infer nuanced speaker intentions. Each instance pairs two contextually appropriate but pragmatically distinct continuations, enabling fine-grained assessment of both pragmatic interpretation and contrastive reasoning. We systematically evaluate 22 LLMs across key training stages: pre-training, supervised fine-tuning (SFT), and preference optimization, to examine the development of pragmatic competence. Our results show that even base models exhibit notable sensitivity to pragmatic cues, which improves consistently with increases in model and data scale. Additionally, SFT and RLHF contribute further gains, particularly in cognitive-pragmatic reasoning. These findings highlight pragmatic competence as an emergent and compositional property of LLM training and offer new insights for aligning models with human communicative norms.

[296] How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation

Xin Lu,Yanyan Zhao,Si Wei,Shijin Wang,Bing Qin,Ting Liu

Main category: cs.CL

TL;DR: 本文探讨了序列建模架构对预训练语言模型基础能力的影响，提出了一种有限域预训练设置，揭示了架构间的显著差异，并总结出关键设计原则。

Details

Motivation: 研究序列建模架构如何影响预训练语言模型的基础能力，填补现有混合域预训练设置未能充分揭示架构差异的空白。 Method: 提出有限域预训练设置和分布外测试，分析状态序列建模架构的基础能力，并通过架构组件分析总结设计原则。 Result: 发现状态序列建模架构基础能力显著下降，提出全序列任意选择能力是关键设计原则，并通过实验验证。 Conclusion: 总结的设计原则为未来架构改进和新设计提供了有价值的参考。 Abstract: Pre-trained language models represented by the Transformer have been proven to possess strong base capabilities, and the representative self-attention mechanism in the Transformer has become a classic in sequence modeling architectures. Different from the work of proposing sequence modeling architecture to improve the efficiency of attention mechanism, this work focuses on the impact of sequence modeling architectures on base capabilities. Specifically, our concern is: How exactly do sequence modeling architectures affect the base capabilities of pre-trained language models? In this work, we first point out that the mixed domain pre-training setting commonly adopted in existing architecture design works fails to adequately reveal the differences in base capabilities among various architectures. To address this, we propose a limited domain pre-training setting with out-of-distribution testing, which successfully uncovers significant differences in base capabilities among architectures at an early stage. Next, we analyze the base capabilities of stateful sequence modeling architectures, and find that they exhibit significant degradation in base capabilities compared to the Transformer. Then, through a series of architecture component analysis, we summarize a key architecture design principle: A sequence modeling architecture need possess full-sequence arbitrary selection capability to avoid degradation in base capabilities. Finally, we empirically validate this principle using an extremely simple Top-1 element selection architecture and further generalize it to a more practical Top-1 chunk selection architecture. Experimental results demonstrate our proposed sequence modeling architecture design principle and suggest that our work can serve as a valuable reference for future architecture improvements and novel designs.

[297] metaTextGrad: Automatically optimizing language model optimizers

Guowei Xu,Mert Yuksekgonul,Carlos Guestrin,James Zou

Main category: cs.CL

TL;DR: 论文提出了一种名为metaTextGrad的元优化器，用于增强现有基于LLM的优化器，并通过任务对齐提升性能。

Details

Motivation: 现有基于LLM的优化器由人工设计，缺乏优化且通用性强，未针对特定任务定制。 Method: 提出包含元提示优化器和元结构优化器的metaTextGrad方法。 Result: 在多个基准测试中平均性能提升6%。 Conclusion: metaTextGrad能有效提升现有优化器的任务适配性和性能。 Abstract: Large language models (LLMs) are increasingly used in learning algorithms, evaluations, and optimization tasks. Recent studies have shown that using LLM-based optimizers to automatically optimize model prompts, demonstrations, predictions themselves, or other components can significantly enhance the performance of AI systems, as demonstrated by frameworks such as DSPy and TextGrad. However, optimizers built on language models themselves are usually designed by humans with manual design choices; optimizers themselves are not optimized. Moreover, these optimizers are general purpose by design, to be useful to a broad audience, and are not tailored for specific tasks. To address these challenges, we propose metaTextGrad, which focuses on designing a meta-optimizer to further enhance existing optimizers and align them to be good optimizers for a given task. Our approach consists of two key components: a meta prompt optimizer and a meta structure optimizer. The combination of these two significantly improves performance across multiple benchmarks, achieving an average absolute performance improvement of up to 6% compared to the best baseline.

[298] Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models

Haoyuan Sun,Jiaqi Wu,Bo Xia,Yifu Luo,Yifei Zhao,Kai Qin,Xufei Lv,Tiantian Zhang,Yongzhe Chang,Xueqian Wang

Main category: cs.CL

TL;DR: 强化微调（RFT）在提升多模态大语言模型（MLLMs）推理能力方面展现出巨大潜力，并推动了前沿AI模型的发展。本文总结了RFT的五大改进点，并提出了未来研究的五个方向。

Details

Motivation: 在追求通用人工智能（AGI）的关键阶段，探索如何通过RFT提升MLLMs的推理能力，以推动技术进步。 Method: 详细介绍了RFT的基础知识，总结了其在MLLMs推理能力提升中的五大关键改进点。 Result: RFT在多样化模态、任务与领域、训练算法、基准测试和工程框架方面显著提升了MLLMs的推理能力。 Conclusion: 本文为AGI发展提供了有价值的见解，并提出了未来研究的五个潜在方向。 Abstract: Standing in 2025, at a critical juncture in the pursuit of Artificial General Intelligence (AGI), reinforcement fine-tuning (RFT) has demonstrated significant potential in enhancing the reasoning capability of large language models (LLMs) and has led to the development of cutting-edge AI models such as OpenAI-o1 and DeepSeek-R1. Moreover, the efficient application of RFT to enhance the reasoning capability of multimodal large language models (MLLMs) has attracted widespread attention from the community. In this position paper, we argue that reinforcement fine-tuning powers the reasoning capability of multimodal large language models. To begin with, we provide a detailed introduction to the fundamental background knowledge that researchers interested in this field should be familiar with. Furthermore, we meticulously summarize the improvements of RFT in powering reasoning capability of MLLMs into five key points: diverse modalities, diverse tasks and domains, better training algorithms, abundant benchmarks and thriving engineering frameworks. Finally, we propose five promising directions for future research that the community might consider. We hope that this position paper will provide valuable insights to the community at this pivotal stage in the advancement toward AGI. Summary of works done on RFT for MLLMs is available at https://github.com/Sun-Haoyuan23/Awesome-RL-based-Reasoning-MLLMs.

[299] Business as \textit{Rule}sual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs

Chen Yang,Ruping Xu,Ruizhe Li,Bin Cao,Jing Fan

Main category: cs.CL

TL;DR: 论文提出了一种新的中文数据集BPRF和框架ExIde，用于从商业文档中提取规则流及其依赖关系，并通过LLMs验证其有效性。

Details

Motivation: 商业文档中的规则流尚未充分研究，现有工作主要集中在指令性文本中的流程挖掘。 Method: 构建了包含326条标注规则的BPRF数据集，并提出ExIde框架，利用LLMs自动提取规则及其依赖关系。 Result: ExIde在12种SOTA LLMs上验证，有效提取结构化规则并分析其依赖关系。 Conclusion: ExIde为自动化、可解释的商业流程自动化提供了新途径。 Abstract: Process mining aims to discover, monitor and optimize the actual behaviors of real processes. While prior work has mainly focused on extracting procedural action flows from instructional texts, rule flows embedded in business documents remain underexplored. To this end, we introduce a novel annotated Chinese dataset, \textbf{BPRF}, which contains 50 business process documents with 326 explicitly labeled business rules across multiple domains. Each rule is represented as a pair, and we annotate logical dependencies between rules (sequential, conditional, or parallel). We also propose \textbf{ExIde}, a framework for automatic business rule extraction and dependency relationship identification using large language models (LLMs). We evaluate ExIde using 12 state-of-the-art (SOTA) LLMs on the BPRF dataset, benchmarking performance on both rule extraction and dependency classification tasks of current LLMs. Our results demonstrate the effectiveness of ExIde in extracting structured business rules and analyzing their interdependencies for current SOTA LLMs, paving the way for more automated and interpretable business process automation.

[300] Composable Cross-prompt Essay Scoring by Merging Models

Sanwoo Lee,Kun Liang,Yunfang Wu

Main category: cs.CL

TL;DR: 提出了一种无需源数据集的跨提示自动作文评分方法，通过选择性合并源模型的参数而非数据集，解决了隐私问题，并在实验中表现优于传统方法。

Details

Motivation: 传统跨提示自动作文评分方法需要同时访问所有源提示的数据集，不仅效果不佳，还引发隐私问题。因此，提出一种无需源数据集的适应方法。 Method: 通过线性组合任务向量（参数更新）模拟联合训练，提出PIM目标优化组合系数，并使用贝叶斯优化高效求解。 Result: 实验表明，该方法在跨数据集适应中表现优于传统联合训练方法，具有更强的鲁棒性，且在严重分布偏移下表现优异。 Conclusion: 该方法在保护隐私的同时，提升了跨提示自动作文评分的性能和鲁棒性。 Abstract: Recent advances in cross-prompt automated essay scoring (AES) typically train models jointly on all source prompts, often requiring additional access to unlabeled target prompt essays simultaneously. However, using all sources is suboptimal in our pilot study, and re-accessing source datasets during adaptation raises privacy concerns. We propose a source-free adaptation approach that selectively merges individually trained source models' parameters instead of datasets. In particular, we simulate joint training through linear combinations of task vectors -- the parameter updates from fine-tuning. To optimize the combination's coefficients, we propose Prior-encoded Information Maximization (PIM), an unsupervised objective which promotes the model's score discriminability regularized by priors pre-computed from the sources. We employ Bayesian optimization as an efficient optimizer of PIM. Experimental results with LLMs on in-dataset and cross-dataset adaptation show that our method (1) consistently outperforms training jointly on all sources, (2) maintains superior robustness compared to other merging methods, (3) excels under severe distribution shifts where recent leading cross-prompt methods struggle, all while retaining computational efficiency.

[301] MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors

Baraa Hikal,Mohamed Basem,Islam Oshallah,Ali Hamdi

Main category: cs.CL

TL;DR: MSA-MathEval是一种用于评估AI导师响应的方法，采用统一训练管道和分歧感知集成推理策略，在多个维度上表现优异。

Details

Motivation: 解决AI导师响应在多维度评估中的问题，如错误识别、错误定位、提供指导和可操作性。 Method: 使用统一的训练管道微调指令调整语言模型，并引入分歧感知集成推理策略以提高预测可靠性。 Result: 在提供指导维度排名第一，可操作性第三，错误识别和错误定位均第四。 Conclusion: 证明了可扩展的指令调整和分歧驱动建模在多维评估中的有效性。 Abstract: We present MSA-MathEval, our submission to the BEA 2025 Shared Task on evaluating AI tutor responses across four instructional dimensions: Mistake Identification, Mistake Location, Providing Guidance, and Actionability. Our approach uses a unified training pipeline to fine-tune a single instruction-tuned language model across all tracks, without any task-specific architectural changes. To improve prediction reliability, we introduce a disagreement-aware ensemble inference strategy that enhances coverage of minority labels. Our system achieves strong performance across all tracks, ranking 1st in Providing Guidance, 3rd in Actionability, and 4th in both Mistake Identification and Mistake Location. These results demonstrate the effectiveness of scalable instruction tuning and disagreement-driven modeling for robust, multi-dimensional evaluation of LLMs as educational tutors.

[302] Unraveling Misinformation Propagation in LLM Reasoning

Yiyang Feng,Yichen Wang,Shaobo Cui,Boi Faltings,Mina Lee,Jiawei Zhou

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）在数学推理中如何受错误信息影响，并探索了纠正方法。结果显示，即使明确指示，LLMs纠正错误的成功率仍低于50%，但早期纠正和微调能显著提升准确性。

Details

Motivation: 现实世界中，用户输入的错误信息可能影响LLMs的推理性能，但这一现象尚未充分研究。论文旨在分析错误信息如何传播，并探索有效的纠正策略。 Method: 通过数学推理任务，分析错误信息对中间推理步骤和最终答案的影响，并测试LLMs在明确指示下的纠正能力。同时，研究早期纠正和微调对准确性的提升效果。 Result: LLMs纠正错误信息的成功率低于50%，导致准确性显著下降（10.02%-72.20%）。早期纠正和微调能有效减少错误传播，提升推理准确性。 Conclusion: 论文提出了一种通过早期纠正和微调来减少错误信息传播的实用方法，为提升LLMs的推理可靠性提供了方向。 Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning, positioning them as promising tools for supporting human problem-solving. However, what happens when their performance is affected by misinformation, i.e., incorrect inputs introduced by users due to oversights or gaps in knowledge? Such misinformation is prevalent in real-world interactions with LLMs, yet how it propagates within LLMs' reasoning process remains underexplored. Focusing on mathematical reasoning, we present a comprehensive analysis of how misinformation affects intermediate reasoning steps and final answers. We also examine how effectively LLMs can correct misinformation when explicitly instructed to do so. Even with explicit instructions, LLMs succeed less than half the time in rectifying misinformation, despite possessing correct internal knowledge, leading to significant accuracy drops (10.02% - 72.20%). Further analysis shows that applying factual corrections early in the reasoning process most effectively reduces misinformation propagation, and fine-tuning on synthesized data with early-stage corrections significantly improves reasoning factuality. Our work offers a practical approach to mitigating misinformation propagation.

[303] Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

Jun Zhuang,Haibo Jin,Ye Zhang,Zhengjian Kang,Wenbin Zhang,Gaby G. Dagher,Haohan Wang

Main category: cs.CL

TL;DR: 论文研究了意图检测在大型语言模型（LLMs）中的漏洞，提出了一种两阶段的意图提示优化框架IntentPrompt，显著提高了越狱攻击的成功率。

Details

Motivation: 意图检测在LLMs的内容审核中扮演重要角色，但其在恶意操纵下的鲁棒性尚未充分研究。 Method: 提出IntentPrompt框架，通过两阶段优化提示（结构化大纲和声明式叙述）增强越狱攻击效果。 Result: 实验表明，IntentPrompt在多个基准测试和LLMs上优于现有方法，攻击成功率高达96.54%。 Conclusion: 意图操纵暴露了LLMs安全机制的弱点，对内容审核防护提出了新挑战。 Abstract: Intent detection, a core component of natural language understanding, has considerably evolved as a crucial mechanism in safeguarding large language models (LLMs). While prior work has applied intent detection to enhance LLMs' moderation guardrails, showing a significant success against content-level jailbreaks, the robustness of these intent-aware guardrails under malicious manipulations remains under-explored. In this work, we investigate the vulnerability of intent-aware guardrails and demonstrate that LLMs exhibit implicit intent detection capabilities. We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives by iteratively optimizing prompts via feedback loops to enhance jailbreak success for red-teaming purposes. Extensive experiments across four public benchmarks and various black-box LLMs indicate that our framework consistently outperforms several cutting-edge jailbreak methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought (CoT)-based defenses. Specifically, our "FSTR+SPIN" variant achieves attack success rates ranging from 88.25% to 96.54% against CoT-based defenses on the o1 model, and from 86.75% to 97.12% on the GPT-4o model under IA-based defenses. These findings highlight a critical weakness in LLMs' safety mechanisms and suggest that intent manipulation poses a growing challenge to content moderation guardrails.

[304] TAG-INSTRUCT: Controlled Instruction Complexity Enhancement through Structure-based Augmentation

He Zhu,Zhiwen Ruan,Junyou Su,Xingwei He,Wenjia Zhang,Yun Chen,Guanhua Chen

Main category: cs.CL

TL;DR: TAG-INSTRUCT是一个通过结构化语义压缩和难度增强提升指令复杂性的框架，优于现有方法。

Details

Motivation: 高质量指令数据对大型语言模型开发至关重要，但现有方法难以有效控制指令复杂性。 Method: TAG-INSTRUCT将指令压缩到紧凑的标签空间，并通过RL引导的标签扩展系统性增强复杂性。 Result: 实验表明，TAG-INSTRUCT在指令复杂性增强方面优于现有方法，标签空间操作提供更好的可控性和稳定性。 Conclusion: TAG-INSTRUCT为指令复杂性控制提供了高效且稳定的解决方案。 Abstract: High-quality instruction data is crucial for developing large language models (LLMs), yet existing approaches struggle to effectively control instruction complexity. We present TAG-INSTRUCT, a novel framework that enhances instruction complexity through structured semantic compression and controlled difficulty augmentation. Unlike previous prompt-based methods operating on raw text, TAG-INSTRUCT compresses instructions into a compact tag space and systematically enhances complexity through RL-guided tag expansion. Through extensive experiments, we show that TAG-INSTRUCT outperforms existing instruction complexity augmentation approaches. Our analysis reveals that operating in tag space provides superior controllability and stability across different instruction synthesis frameworks.

[305] From Word to World: Evaluate and Mitigate Culture Bias via Word Association Test

Xunlian Dai,Li Zhou,Benyou Wang,Haizhou Li

Main category: cs.CL

TL;DR: 论文提出了一种名为CultureSteer的方法，通过文化感知机制改进大型语言模型（LLMs）在跨文化认知中的对齐性，实验表明其显著优于现有方法。

Details

Motivation: 现有LLMs在词汇关联层面表现出明显的西方文化偏见，尤其是美国文化，需要一种方法来提升其跨文化认知对齐性。 Method: 提出CultureSteer方法，通过文化感知的引导机制，将语义表示导向特定文化空间，以减轻文化偏好。 Result: 实验显示，CultureSteer显著提升了跨文化对齐性，优于基于提示的方法，并在文化敏感的下游任务中验证了其有效性。 Conclusion: 该研究为增强LLMs的文化意识提供了新方法，推动了更具包容性的语言技术的发展。 Abstract: The human-centered word association test (WAT) serves as a cognitive proxy, revealing sociocultural variations through lexical-semantic patterns. We extend this test into an LLM-adaptive, free-relation task to assess the alignment of large language models (LLMs) with cross-cultural cognition. To mitigate the culture preference, we propose CultureSteer, an innovative approach that integrates a culture-aware steering mechanism to guide semantic representations toward culturally specific spaces. Experiments show that current LLMs exhibit significant bias toward Western cultural (notably in American) schemas at the word association level. In contrast, our model substantially improves cross-cultural alignment, surpassing prompt-based methods in capturing diverse semantic associations. Further validation on culture-sensitive downstream tasks confirms its efficacy in fostering cognitive alignment across cultures. This work contributes a novel methodological paradigm for enhancing cultural awareness in LLMs, advancing the development of more inclusive language technologies.

[306] Removal of Hallucination on Hallucination: Debate-Augmented RAG

Wentao Hu,Wengyu Zhang,Yiyang Jiang,Chen Jason Zhang,Xiaoyong Wei,Qing Li

Main category: cs.CL

TL;DR: DRAG框架通过多智能体辩论机制提升RAG的事实准确性，减少幻觉问题。

Details

Motivation: 解决RAG中因错误或偏见检索导致的生成误导问题（Hallucination on Hallucination）。 Method: 提出DRAG框架，在检索和生成阶段引入多智能体辩论（MAD），通过辩论优化检索质量和生成可靠性。 Result: 实验表明DRAG显著提升检索可靠性，减少幻觉，提高事实准确性。 Conclusion: DRAG为RAG提供了一种无需训练的改进方案，有效解决幻觉问题。 Abstract: Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external knowledge, yet it introduces a critical issue: erroneous or biased retrieval can mislead generation, compounding hallucinations, a phenomenon we term Hallucination on Hallucination. To address this, we propose Debate-Augmented RAG (DRAG), a training-free framework that integrates Multi-Agent Debate (MAD) mechanisms into both retrieval and generation stages. In retrieval, DRAG employs structured debates among proponents, opponents, and judges to refine retrieval quality and ensure factual reliability. In generation, DRAG introduces asymmetric information roles and adversarial debates, enhancing reasoning robustness and mitigating factual inconsistencies. Evaluations across multiple tasks demonstrate that DRAG improves retrieval reliability, reduces RAG-induced hallucinations, and significantly enhances overall factual accuracy. Our code is available at https://github.com/Huenao/Debate-Augmented-RAG.

[307] Safety Alignment via Constrained Knowledge Unlearning

Zesheng Shi,Yucheng Zhou,Jing Li

Main category: cs.CL

TL;DR: 论文提出了一种名为CKU的新策略，通过定位和保留有用知识、遗忘有害知识，显著提升了语言模型的安全性。

Details

Motivation: 尽管现有安全对齐方法取得进展，大型语言模型仍易受越狱攻击，现有防御机制未能完全消除有害知识。 Method: CKU通过评分MLP层神经元，识别与有用知识相关的神经元子集U，并在遗忘过程中修剪U的梯度以保留有用知识。 Result: 实验表明，CKU显著提升模型安全性且不影响整体性能，优于现有方法。 Conclusion: CKU在安全与实用性间取得更好平衡，同时为安全对齐和知识编辑机制提供了新见解。 Abstract: Despite significant progress in safety alignment, large language models (LLMs) remain susceptible to jailbreak attacks. Existing defense mechanisms have not fully deleted harmful knowledge in LLMs, which allows such attacks to bypass safeguards and produce harmful outputs. To address this challenge, we propose a novel safety alignment strategy, Constrained Knowledge Unlearning (CKU), which focuses on two primary objectives: knowledge localization and retention, and unlearning harmful knowledge. CKU works by scoring neurons in specific multilayer perceptron (MLP) layers to identify a subset U of neurons associated with useful knowledge. During the unlearning process, CKU prunes the gradients of neurons in U to preserve valuable knowledge while effectively mitigating harmful content. Experimental results demonstrate that CKU significantly enhances model safety without compromising overall performance, offering a superior balance between safety and utility compared to existing methods. Additionally, our analysis of neuron knowledge sensitivity across various MLP layers provides valuable insights into the mechanics of safety alignment and model knowledge editing.

[308] Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models

Chen Han,Wenzhen Zheng,Xijin Tang

Main category: cs.CL

TL;DR: 论文提出了一种名为Debate-to-Detect (D2D)的多智能体辩论框架，用于改进虚假信息检测，通过结构化辩论和多维评估机制显著提升了检测效果。

Details

Motivation: 传统虚假信息检测方法依赖静态分类，无法捕捉真实世界事实核查的复杂性，而现有大型语言模型在逻辑一致性和深度验证方面存在不足。 Method: D2D框架将虚假信息检测重构为结构化对抗辩论，采用五阶段辩论流程（开场陈述、反驳、自由辩论、总结陈述和裁决），并引入多维评估机制（事实性、来源可靠性、推理质量、清晰度和伦理）。 Result: 在GPT-4o和两个虚假新闻数据集上的实验表明，D2D显著优于基线方法，并能迭代优化证据，提高决策透明度。 Conclusion: D2D框架为虚假信息检测提供了更鲁棒和可解释的解决方案，代码将开源。 Abstract: The proliferation of misinformation in digital platforms reveals the limitations of traditional detection methods, which mostly rely on static classification and fail to capture the intricate process of real-world fact-checking. Despite advancements in Large Language Models (LLMs) that enhance automated reasoning, their application to misinformation detection remains hindered by issues of logical inconsistency and superficial verification. In response, we introduce Debate-to-Detect (D2D), a novel Multi-Agent Debate (MAD) framework that reformulates misinformation detection as a structured adversarial debate. Inspired by fact-checking workflows, D2D assigns domain-specific profiles to each agent and orchestrates a five-stage debate process, including Opening Statement, Rebuttal, Free Debate, Closing Statement, and Judgment. To transcend traditional binary classification, D2D introduces a multi-dimensional evaluation mechanism that assesses each claim across five distinct dimensions: Factuality, Source Reliability, Reasoning Quality, Clarity, and Ethics. Experiments with GPT-4o on two fakenews datasets demonstrate significant improvements over baseline methods, and the case study highlight D2D's capability to iteratively refine evidence while improving decision transparency, representing a substantial advancement towards robust and interpretable misinformation detection. The code will be open-sourced in a future release.

[309] Flex-Judge: Think Once, Judge Anywhere

Jongwoo Ko,Sungnyun Kim,Sungwoo Cho,Se-Young Yun

Main category: cs.CL

TL;DR: Flex-Judge是一种基于推理的多模态评估模型，通过少量文本推理数据实现跨模态和评估格式的泛化，性能优于商业API和多模态评估器。

Details

Motivation: 减少人工标注成本，解决现有LLM评估器在多模态任务中泛化能力不足的问题。 Method: 利用结构化文本推理数据，通过推理引导实现跨模态评估。 Result: Flex-Judge在少量数据训练下，性能优于现有商业API和多模态评估器，尤其在稀缺数据领域（如分子评估）表现突出。 Conclusion: 推理引导的文本监督是一种高效、低成本的多模态评估方法，具有广泛应用潜力。 Abstract: Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.

[310] RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations

Ashwin Sankar,Yoach Lacombe,Sherry Thomas,Praveen Srinivasa Varadhan,Sanchit Gandhi,Mitesh M Khapra

Main category: cs.CL

TL;DR: RASMALAI是一个大规模语音数据集，包含丰富的文本描述，用于推动23种印度语言和英语的可控、表达性文本到语音（TTS）合成。基于此数据集开发的IndicParlerTTS是首个开源的、文本描述引导的印度语言TTS系统，表现出色。

Details

Motivation: 推动印度语言和英语的可控、表达性TTS合成，填补开源工具在印度语言中的空白。 Method: 使用RASMALAI数据集（13,000小时语音和2,400万文本描述）开发IndicParlerTTS系统，支持细粒度属性控制。 Result: IndicParlerTTS能高质量生成指定说话人语音，可靠遵循文本描述，并准确合成属性，跨语言表达特性传递效果显著。 Conclusion: IndicParlerTTS为印度语言的可控多语言表达性语音合成设定了新标准。 Abstract: We introduce RASMALAI, a large-scale speech dataset with rich text descriptions, designed to advance controllable and expressive text-to-speech (TTS) synthesis for 23 Indian languages and English. It comprises 13,000 hours of speech and 24 million text-description annotations with fine-grained attributes like speaker identity, accent, emotion, style, and background conditions. Using RASMALAI, we develop IndicParlerTTS, the first open-source, text-description-guided TTS for Indian languages. Systematic evaluation demonstrates its ability to generate high-quality speech for named speakers, reliably follow text descriptions and accurately synthesize specified attributes. Additionally, it effectively transfers expressive characteristics both within and across languages. IndicParlerTTS consistently achieves strong performance across these evaluations, setting a new standard for controllable multilingual expressive speech synthesis in Indian languages.

[311] PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs

Tengxuan Liu,Shiyao Li,Jiayi Yang,Tianchen Zhao,Feng Zhou,Xiaohui Song,Guohao Dai,Shengen Yan,Huazhong Yang,Yu Wang

Main category: cs.CL

TL;DR: 论文提出了一种渐进式混合精度KV缓存量化方法（PM-KVQ），用于解决长链思维（CoT）大语言模型（LLMs）中的KV缓存量化问题，显著提升了推理性能。

Details

Motivation: 现有的KV缓存量化方法在长链思维LLMs中表现不佳，主要由于累积误差大和短上下文校准问题。 Method: PM-KVQ采用渐进量化策略和块级内存分配以减少累积误差，并提出基于位置插值的新校准策略。 Result: 实验表明，PM-KVQ在相同内存预算下，推理性能比现有方法提升高达8%。 Conclusion: PM-KVQ有效解决了长链思维LLMs中的KV缓存量化问题，为高效推理提供了新思路。 Abstract: Recently, significant progress has been made in developing reasoning-capable Large Language Models (LLMs) through long Chain-of-Thought (CoT) techniques. However, this long-CoT reasoning process imposes substantial memory overhead due to the large Key-Value (KV) Cache memory overhead. Post-training KV Cache quantization has emerged as a promising compression technique and has been extensively studied in short-context scenarios. However, directly applying existing methods to long-CoT LLMs causes significant performance degradation due to the following two reasons: (1) Large cumulative error: Existing methods fail to adequately leverage available memory, and they directly quantize the KV Cache during each decoding step, leading to large cumulative quantization error. (2) Short-context calibration: Due to Rotary Positional Embedding (RoPE), the use of short-context data during calibration fails to account for the distribution of less frequent channels in the Key Cache, resulting in performance loss. We propose Progressive Mixed-Precision KV Cache Quantization (PM-KVQ) for long-CoT LLMs to address the above issues in two folds: (1) To reduce cumulative error, we design a progressive quantization strategy to gradually lower the bit-width of KV Cache in each block. Then, we propose block-wise memory allocation to assign a higher bit-width to more sensitive transformer blocks. (2) To increase the calibration length without additional overhead, we propose a new calibration strategy with positional interpolation that leverages short calibration data with positional interpolation to approximate the data distribution of long-context data. Extensive experiments on 7B-70B long-CoT LLMs show that PM-KVQ improves reasoning benchmark performance by up to 8% over SOTA baselines under the same memory budget. Our code is available at https://github.com/thu-nics/PM-KVQ.

[312] MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

Woohyun Cho,Youngmin Kim,Sunghyun Lee,Youngjae Yu

Main category: cs.CL

TL;DR: 论文提出了一种多模态、多语言的歌词翻译方法SylAVL-CoT，结合音频和视频线索，显著提升了歌词的可唱性和上下文准确性。

Details

Motivation: 歌词翻译需要同时满足语义准确性和音乐节奏、音节结构及诗意的保留。在动画音乐剧中，还需与视听线索对齐，挑战更大。 Method: 提出了多模态基准MAVL，结合文本、音频和视频信息，并在此基础上开发了SylAVL-CoT模型，利用音频视频线索并强制音节约束。 Result: 实验表明，SylAVL-CoT在可唱性和上下文准确性上显著优于纯文本模型。 Conclusion: 多模态、多语言方法在歌词翻译中具有重要价值。 Abstract: Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation. By integrating text, audio, and video, MAVL enables richer and more expressive translations than text-only approaches. Building on this, we propose Syllable-Constrained Audio-Video LLM with Chain-of-Thought SylAVL-CoT, which leverages audio-video cues and enforces syllabic constraints to produce natural-sounding lyrics. Experimental results demonstrate that SylAVL-CoT significantly outperforms text-based models in singability and contextual accuracy, emphasizing the value of multimodal, multilingual approaches for lyrics translation.

[313] DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation

Zhihao Jia,Mingyi Jia,Junwen Duan,Jianxin Wang

Main category: cs.CL

TL;DR: DDO框架通过多智能体协作优化医疗咨询中的症状询问和疾病诊断两个子任务，显著提升性能。

Details

Motivation: 现有LLM方法未能有效区分医疗咨询中的症状询问（顺序决策）和疾病诊断（分类问题），导致效果不佳。 Method: 提出DDO框架，通过多智能体协作独立优化两个子任务。 Result: 在三个真实医疗数据集上，DDO优于现有LLM方法，与最先进的生成方法竞争。 Conclusion: DDO有效解决了医疗咨询中的双任务优化问题。 Abstract: Large Language Models (LLMs) demonstrate strong generalization and reasoning abilities, making them well-suited for complex decision-making tasks such as medical consultation (MC). However, existing LLM-based methods often fail to capture the dual nature of MC, which entails two distinct sub-tasks: symptom inquiry, a sequential decision-making process, and disease diagnosis, a classification problem. This mismatch often results in ineffective symptom inquiry and unreliable disease diagnosis. To address this, we propose \textbf{DDO}, a novel LLM-based framework that performs \textbf{D}ual-\textbf{D}ecision \textbf{O}ptimization by decoupling and independently optimizing the the two sub-tasks through a collaborative multi-agent workflow. Experiments on three real-world MC datasets show that DDO consistently outperforms existing LLM-based approaches and achieves competitive performance with state-of-the-art generation-based methods, demonstrating its effectiveness in the MC task.

[314] Multilingual Question Answering in Low-Resource Settings: A Dzongkha-English Benchmark for Foundation Models

Md. Tanzib Hosain,Rajan Das Gupta,Md. Kishor Morol

Main category: cs.CL

TL;DR: DZEN数据集包含5K+并行Dzongkha和英语测试题，用于评估LLM在不同语言中的表现，发现CoT提示对推理题有效，英语翻译提升Dzongkha回答精度。

Details

Motivation: 为不丹中学生创建双语测试题数据集，评估LLM在低资源语言（Dzongkha）中的表现差异。 Method: 构建并行Dzongkha-英语数据集，测试多种LLM及提示策略（如CoT），分析翻译对性能的影响。 Result: LLM在英语和Dzongkha中表现差异显著；CoT对推理题有效；英语翻译提升Dzongkha回答精度。 Conclusion: 研究为提升LLM在低资源语言中的性能提供了方向，数据集已开源。 Abstract: In this work, we provide DZEN, a dataset of parallel Dzongkha and English test questions for Bhutanese middle and high school students. The over 5K questions in our collection span a variety of scientific topics and include factual, application, and reasoning-based questions. We use our parallel dataset to test a number of Large Language Models (LLMs) and find a significant performance difference between the models in English and Dzongkha. We also look at different prompting strategies and discover that Chain-of-Thought (CoT) prompting works well for reasoning questions but less well for factual ones. We also find that adding English translations enhances the precision of Dzongkha question responses. Our results point to exciting avenues for further study to improve LLM performance in Dzongkha and, more generally, in low-resource languages. We release the dataset at: https://github.com/kraritt/llm_dzongkha_evaluation.

[315] Skip-Thinking: Chunk-wise Chain-of-Thought Distillation Enable Smaller Language Models to Reason Better and Faster

Xiao Chen,Sihang Zhou,Ke Liang,Xiaoyu Sun,Xinwang Liu

Main category: cs.CL

TL;DR: 论文提出了一种分块训练（CWT）和跳跃思维训练（STT）方法，解决现有CoT蒸馏中长推理链导致的梯度平滑和推理速度慢的问题。

Details

Motivation: 现有方法在训练小型语言模型（SLM）时，要求其一次性学习长推理链，导致核心推理令牌的梯度被平滑，且推理速度慢。 Method: 通过启发式搜索将推理链分为语义连贯的块，每次迭代仅学习一个块（CWT），并引入STT使SLM自动跳过非推理块。 Result: 实验验证了CWT和STT在多种SLM和推理任务中的有效性，提高了推理速度并保持了准确性。 Conclusion: 分块训练和跳跃思维训练显著优化了CoT蒸馏的性能和效率。 Abstract: Chain-of-thought (CoT) distillation allows a large language model (LLM) to guide a small language model (SLM) in reasoning tasks. Existing methods train the SLM to learn the long rationale in one iteration, resulting in two issues: 1) Long rationales lead to a large token-level batch size during training, making gradients of core reasoning tokens (i.e., the token will directly affect the correctness of subsequent reasoning) over-smoothed as they contribute a tiny fraction of the rationale. As a result, the SLM converges to sharp minima where it fails to grasp the reasoning logic. 2) The response is slow, as the SLM must generate a long rationale before reaching the answer. Therefore, we propose chunk-wise training (CWT), which uses a heuristic search to divide the rationale into internal semantically coherent chunks and focuses SLM on learning from only one chunk per iteration. In this way, CWT naturally isolates non-reasoning chunks that do not involve the core reasoning token (e.g., summary and transitional chunks) from the SLM learning for reasoning chunks, making the fraction of the core reasoning token increase in the corresponding iteration. Based on CWT, skip-thinking training (STT) is proposed. STT makes the SLM automatically skip non-reasoning medium chunks to reach the answer, improving reasoning speed while maintaining accuracy. We validate our approach on a variety of SLMs and multiple reasoning tasks.

[316] On the Emergence of Linear Analogies in Word Embeddings

Daniel J. Korchinski,Dhruva Karkada,Yasaman Bahri,Matthieu Wyart

Main category: cs.CL

TL;DR: 论文探讨了Word2Vec和GloVe等模型中词嵌入的线性类比结构的理论起源，提出了一种基于二进制语义属性的生成模型来解释这一现象。

Details

Motivation: 研究词嵌入中线性类比结构（如“king - man + woman ≈ queen”）的理论起源，以解释其在多种条件下的表现。 Method: 提出了一种基于二进制语义属性的生成模型，通过分析词对共现概率矩阵的特征向量，解释线性类比结构的形成。 Result: 模型成功复现了线性类比结构的出现，并解释了其在共现矩阵特征向量中的表现、维度增加的影响、对数变换的增强效果以及数据移除后的持久性。 Conclusion: 该生成模型为词嵌入的线性类比结构提供了理论解释，并验证了其鲁棒性和与真实数据的一致性。 Abstract: Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability $P(i,j)$ of words $i$ and $j$ in text corpora. The resulting vectors $W_i$ not only group semantically similar words but also exhibit a striking linear analogy structure -- for example, $W_{\text{king}} - W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}$ -- whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix $M(i,j) = P(i,j)/P(i)P(j)$, (ii) strengthens and then saturates as more eigenvectors of $M (i, j)$, which controls the dimension of the embeddings, are included, (iii) is enhanced when using $\log M(i,j)$ rather than $M(i,j)$, and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king-queen, man-woman) are removed from the corpus. To explain these phenomena, we introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)-(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.

Murathan Kurfalı,Shorouq Zahra,Joakim Nivre,Gabriele Messori

Main category: cs.CL

TL;DR: Climate-Eval是一个综合基准，用于评估NLP模型在气候变化相关任务中的表现，涵盖25个任务和13个数据集，并提供了对开源LLM的零样本和少样本评估。

Details

Motivation: 为气候变化领域的NLP任务提供一个标准化的评估框架，填补现有研究的空白。 Method: 整合现有数据集并开发新的新闻分类数据集，构建包含25个任务的基准，对开源LLM进行零样本和少样本评估。 Result: 基准覆盖了文本分类、问答和信息提取等任务，评估了参数规模从2B到70B的LLM在气候变化领域的表现。 Conclusion: Climate-Eval为气候变化领域的NLP研究提供了系统化的评估工具，揭示了LLM在该领域的优势和局限性。 Abstract: Climate-Eval is a comprehensive benchmark designed to evaluate natural language processing models across a broad range of tasks related to climate change. Climate-Eval aggregates existing datasets along with a newly developed news classification dataset, created specifically for this release. This results in a benchmark of 25 tasks based on 13 datasets, covering key aspects of climate discourse, including text classification, question answering, and information extraction. Our benchmark provides a standardized evaluation suite for systematically assessing the performance of large language models (LLMs) on these tasks. Additionally, we conduct an extensive evaluation of open-source LLMs (ranging from 2B to 70B parameters) in both zero-shot and few-shot settings, analyzing their strengths and limitations in the domain of climate change.

[318] Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics

Pankaj Kumar,Subhankar Mishra

Main category: cs.CL

TL;DR: 本文综述了大型语言模型（LLMs）的鲁棒性挑战，包括概念、非鲁棒性来源、缓解策略、评估方法及未来研究方向。

Details

Motivation: LLMs在NLP和AI领域具有潜力，但其鲁棒性问题尚未解决，亟需系统研究以推动领域发展。 Method: 系统分析了LLMs鲁棒性的概念、非鲁棒性来源（模型局限、数据漏洞、外部攻击），并综述了缓解策略与评估方法。 Result: 总结了现有研究的趋势、未解决问题及未来方向，为LLMs鲁棒性研究提供了全面参考。 Conclusion: LLMs鲁棒性研究需多学科合作，未来应关注实际应用中的可靠性评估与改进。 Abstract: Large Language Models (LLMs) have emerged as a promising cornerstone for the development of natural language processing (NLP) and artificial intelligence (AI). However, ensuring the robustness of LLMs remains a critical challenge. To address these challenges and advance the field, this survey provides a comprehensive overview of current studies in this area. First, we systematically examine the nature of robustness in LLMs, including its conceptual foundations, the importance of consistent performance across diverse inputs, and the implications of failure modes in real-world applications. Next, we analyze the sources of non-robustness, categorizing intrinsic model limitations, data-driven vulnerabilities, and external adversarial factors that compromise reliability. Following this, we review state-of-the-art mitigation strategies, and then we discuss widely adopted benchmarks, emerging metrics, and persistent gaps in assessing real-world reliability. Finally, we synthesize findings from existing surveys and interdisciplinary studies to highlight trends, unresolved issues, and pathways for future research.

[319] Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models

Zixiang Xu,Yanbo Wang,Yue Huang,Xiuying Chen,Jieyu Zhao,Meng Jiang,Xiangliang Zhang

Main category: cs.CL

TL;DR: 本文提出了一种新方法，通过生成双语问题对来揭示大型语言模型（LLMs）在跨语言性能上的弱点，并在16种语言上构建了包含6000对数据的数据集。实验表明，该方法能高效且低成本地识别跨语言弱点，并发现语言相似性与性能下降之间的关系。

Details

Motivation: 大型语言模型在自然语言处理中表现出色，但其跨语言性能一致性仍是一个重要挑战。本文旨在开发一种方法，有效识别LLMs的跨语言弱点。 Method: 利用束搜索和基于LLM的模拟生成双语问题对，构建了一个包含6000对数据的数据集，覆盖16种语言。 Result: 实验表明，该方法能精确且低成本地揭示跨语言弱点，目标语言的准确率下降超过50%。此外，语言相似性与性能下降之间存在关联。 Conclusion: 本文方法有效揭示了LLMs的跨语言弱点，并为针对性的后训练提供了依据。代码已开源。 Abstract: Large Language Models (LLMs) have achieved remarkable success in Natural Language Processing (NLP), yet their cross-lingual performance consistency remains a significant challenge. This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in LLMs. Our approach leverages beam search and LLM-based simulation to generate bilingual question pairs that expose performance discrepancies between English and target languages. We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models. The extensive experiments demonstrate that our method precisely and cost-effectively pinpoints cross-lingual weaknesses, consistently revealing over 50\% accuracy drops in target languages across a wide range of models. Moreover, further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns and benefit from targeted post-training. Code is available at https://github.com/xzx34/Cross-Lingual-Pitfalls.

Eric Chamoun,Nedjma Ousidhoum,Michael Schlichtkrull,Andreas Vlachos

Main category: cs.CL

TL;DR: 该论文提出了一种自动化分析NLP研究框架的系统，通过提取关键元素并链接它们来推断研究框架，并在两个领域进行了评估。

Details

Motivation: 明确NLP研究框架（如模型、数据集等）对于将研究与实际应用对齐至关重要。当前研究很少明确识别关键利益相关者、预期用途或适用背景。 Method: 开发了一个三组件系统，通过提取关键元素（手段、目的、利益相关者），并通过可解释规则和上下文推理链接它们。 Result: 在两个领域（自动事实核查和仇恨言论检测）上评估，表现优于强LLM基线。应用系统发现三个趋势：研究目标模糊、科学探索重于应用、支持人类而非完全自动化。 Conclusion: 自动化分析NLP研究框架是可行的，并能揭示研究趋势，有助于更好地对齐研究与实际需求。 Abstract: Clarifying the research framing of NLP artefacts (e.g., models, datasets, etc.) is crucial to aligning research with practical applications. Recent studies manually analyzed NLP research across domains, showing that few papers explicitly identify key stakeholders, intended uses, or appropriate contexts. In this work, we propose to automate this analysis, developing a three-component system that infers research framings by first extracting key elements (means, ends, stakeholders), then linking them through interpretable rules and contextual reasoning. We evaluate our approach on two domains: automated fact-checking using an existing dataset, and hate speech detection for which we annotate a new dataset-achieving consistent improvements over strong LLM baselines. Finally, we apply our system to recent automated fact-checking papers and uncover three notable trends: a rise in vague or underspecified research goals, increased emphasis on scientific exploration over application, and a shift toward supporting human fact-checkers rather than pursuing full automation.

[321] TULUN: Transparent and Adaptable Low-resource Machine Translation

Raphaël Merx,Hanna Suominen,Lois Hong,Nick Thieberger,Trevor Cohn,Ekaterina Vylomova

Main category: cs.CL

TL;DR: Tulun是一个结合神经机器翻译和基于大型语言模型的术语感知翻译解决方案，适用于低资源语言和特定领域，无需微调模型，显著提升了翻译质量。

Details

Motivation: 低资源语言在专业领域的机器翻译表现不佳，现有方法需要模型微调，对非技术用户和小组织不实用。 Method: Tulun结合神经机器翻译和基于大型语言模型的后期编辑，利用现有词汇表和翻译记忆库，提供开源Web平台支持协作翻译。 Result: 在Tetun和Bislama的医疗及救灾翻译任务中，比基线系统提升16.90-22.41 ChrF++；在FLORES数据集上，平均比NLLB-54B提升2.8 ChrF。 Conclusion: Tulun通过协作式人机翻译流程，显著提升低资源语言在专业领域的翻译质量，且易于非技术用户使用。 Abstract: Machine translation (MT) systems that support low-resource languages often struggle on specialized domains. While researchers have proposed various techniques for domain adaptation, these approaches typically require model fine-tuning, making them impractical for non-technical users and small organizations. To address this gap, we propose Tulun, a versatile solution for terminology-aware translation, combining neural MT with large language model (LLM)-based post-editing guided by existing glossaries and translation memories. Our open-source web-based platform enables users to easily create, edit, and leverage terminology resources, fostering a collaborative human-machine translation process that respects and incorporates domain expertise while increasing MT accuracy. Evaluations show effectiveness in both real-world and benchmark scenarios: on medical and disaster relief translation tasks for Tetun and Bislama, our system achieves improvements of 16.90-22.41 ChrF++ points over baseline MT systems. Across six low-resource languages on the FLORES dataset, Tulun outperforms both standalone MT and LLM approaches, achieving an average improvement of 2.8 ChrF points over NLLB-54B.

[322] From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation

Zhihao Zhang,Yiran Zhang,Xiyue Zhou,Liting Huang,Imran Razzak,Preslav Nakov,Usman Naseem

Main category: cs.CL

TL;DR: MM Health是一个大规模多模态健康领域虚假信息数据集，包含人类生成和AI生成的内容，旨在支持虚假信息检测的研究。

Details

Motivation: 健康领域的虚假信息（包括AI生成内容）对社会有显著负面影响，现有数据集在主题覆盖、AI内容纳入和原始内容可访问性方面存在局限。 Method: 构建了包含34,746篇新闻文章的多模态数据集（人类生成5,776篇，AI生成28,880篇），并进行了三项基准测试。 Result: 现有SOTA模型在区分信息可靠性和来源方面表现不佳。 Conclusion: MM Health数据集为多模态虚假信息检测提供了支持，有助于识别人类和机器生成的内容。 Abstract: Infodemics and health misinformation have significant negative impact on individuals and society, exacerbating confusion and increasing hesitancy in adopting recommended health measures. Recent advancements in generative AI, capable of producing realistic, human like text and images, have significantly accelerated the spread and expanded the reach of health misinformation, resulting in an alarming surge in its dissemination. To combat the infodemics, most existing work has focused on developing misinformation datasets from social media and fact checking platforms, but has faced limitations in topical coverage, inclusion of AI generation, and accessibility of raw content. To address these issues, we present MM Health, a large scale multimodal misinformation dataset in the health domain consisting of 34,746 news article encompassing both textual and visual information. MM Health includes human-generated multimodal information (5,776 articles) and AI generated multimodal information (28,880 articles) from various SOTA generative AI models. Additionally, We benchmarked our dataset against three tasks (reliability checks, originality checks, and fine-grained AI detection) demonstrating that existing SOTA models struggle to accurately distinguish the reliability and origin of information. Our dataset aims to support the development of misinformation detection across various health scenarios, facilitating the detection of human and machine generated content at multimodal levels.

[323] Large Language Models in the Task of Automatic Validation of Text Classifier Predictions

Aleksandr Tsymbalov

Main category: cs.CL

TL;DR: 本文提出用大型语言模型（LLM）替代人工标注者，以测试分类器预测的正确性，从而降低标注成本并支持高质量增量学习。

Details

Motivation: 人工标注文本分类样本成本高且效率低，尤其是在需要持续更新模型的增量学习场景中。 Method: 提出几种利用LLM测试分类器预测正确性的方法。 Result: 通过LLM替代人工标注者，可以降低标注成本并确保模型质量。 Conclusion: LLM在文本分类任务中具有替代人工标注的潜力，能显著降低标注成本并支持模型持续优化。 Abstract: Machine learning models for text classification are trained to predict a class for a given text. To do this, training and validation samples must be prepared: a set of texts is collected, and each text is assigned a class. These classes are usually assigned by human annotators with different expertise levels, depending on the specific classification task. Collecting such samples from scratch is labor-intensive because it requires finding specialists and compensating them for their work; moreover, the number of available specialists is limited, and their productivity is constrained by human factors. While it may not be too resource-intensive to collect samples once, the ongoing need to retrain models (especially in incremental learning pipelines) to address data drift (also called model drift) makes the data collection process crucial and costly over the model's entire lifecycle. This paper proposes several approaches to replace human annotators with Large Language Models (LLMs) to test classifier predictions for correctness, helping ensure model quality and support high-quality incremental learning.

[324] Benchmarking and Rethinking Knowledge Editing for Large Language Models

Guoxiu He,Xin Song,Futing Wang,Aixin Sun

Main category: cs.CL

TL;DR: 本文通过全面基准测试研究，评估了知识编辑方法的性能，发现参数编辑方法在现实条件下表现不佳，而选择性上下文推理（SCR）方法表现更优。

Details

Motivation: 现有知识编辑方法在评估目标和实验设置上存在不一致性，需通过更全面的基准测试填补这一空白。 Method: 引入复杂事件数据集和通用数据集，评估指令调优和推理导向的LLMs，采用自回归推理设置，并评估多编辑场景。 Result: 参数编辑方法在现实条件下表现不佳，而SCR方法在所有设置中表现一致更优。 Conclusion: 研究揭示了当前知识编辑方法的局限性，并凸显了基于上下文推理的潜力。 Abstract: Knowledge editing aims to update the embedded knowledge within Large Language Models (LLMs). However, existing approaches, whether through parameter modification or external memory integration, often suffer from inconsistent evaluation objectives and experimental setups. To address this gap, we conduct a comprehensive benchmarking study. In addition to fact-level datasets, we introduce more complex event-based datasets and general-purpose datasets drawn from other tasks. Our evaluation covers both instruction-tuned and reasoning-oriented LLMs, under a realistic autoregressive inference setting rather than teacher-forced decoding. Beyond single-edit assessments, we also evaluate multi-edit scenarios to better reflect practical demands. We employ four evaluation dimensions, including portability, and compare all recent methods against a simple and straightforward baseline named Selective Contextual Reasoning (SCR). Empirical results reveal that parameter-based editing methods perform poorly under realistic conditions. In contrast, SCR consistently outperforms them across all settings. This study offers new insights into the limitations of current knowledge editing methods and highlights the potential of context-based reasoning as a more robust alternative.

[325] Towards Semantic Integration of Opinions: Unified Opinion Concepts Ontology and Extraction Task

Gaurav Negi,Dhairya Dalal,Omnia Zayed,Paul Buitelaar

Main category: cs.CL

TL;DR: 论文提出了一种统一的意见概念（UOC）本体，用于在语义上下文中整合意见，并提出了UOCE任务和评估数据集。

Details

Motivation: 解决不同意见表述之间的语义表示差异，增强意见表达的语义一致性。 Method: 基于NLP和符号描述的语义结构，构建UOC本体，并提出UOCE任务及评估方法。 Result: 提供了扩展的评估数据集和定制化评估指标，并利用生成模型建立了UOCE任务的基线性能。 Conclusion: UOC本体和UOCE任务为意见提取提供了统一的语义框架，提升了表达性和评估能力。 Abstract: This paper introduces the Unified Opinion Concepts (UOC) ontology to integrate opinions within their semantic context. The UOC ontology bridges the gap between the semantic representation of opinion across different formulations. It is a unified conceptualisation based on the facets of opinions studied extensively in NLP and semantic structures described through symbolic descriptions. We further propose the Unified Opinion Concept Extraction (UOCE) task of extracting opinions from the text with enhanced expressivity. Additionally, we provide a manually extended and re-annotated evaluation dataset for this task and tailored evaluation metrics to assess the adherence of extracted opinions to UOC semantics. Finally, we establish baseline performance for the UOCE task using state-of-the-art generative models.

[326] A General Knowledge Injection Framework for ICD Coding

Xu Zhang,Kun Zhang,Wenxin Ma,Rongsheng Wang,Chenxu Wu,Yingtai Li,S. Kevin Zhou

Main category: cs.CL

TL;DR: GKI-ICD是一个通用的知识注入框架，整合了ICD描述、同义词和层次结构三种知识，无需额外模块设计，显著提升了ICD编码性能。

Details

Motivation: 解决现有方法因专注于单一知识类型和复杂模块设计而导致的扩展性和效果受限问题。 Method: 提出GKI-ICD框架，综合使用ICD描述、同义词和层次结构三种知识，避免额外模块设计。 Result: 在多个ICD编码基准测试中表现优异，达到最先进的性能。 Conclusion: GKI-ICD通过整合多种知识类型，显著提升了ICD编码的效果和扩展性。 Abstract: ICD Coding aims to assign a wide range of medical codes to a medical text document, which is a popular and challenging task in the healthcare domain. To alleviate the problems of long-tail distribution and the lack of annotations of code-specific evidence, many previous works have proposed incorporating code knowledge to improve coding performance. However, existing methods often focus on a single type of knowledge and design specialized modules that are complex and incompatible with each other, thereby limiting their scalability and effectiveness. To address this issue, we propose GKI-ICD, a novel, general knowledge injection framework that integrates three key types of knowledge, namely ICD Description, ICD Synonym, and ICD Hierarchy, without specialized design of additional modules. The comprehensive utilization of the above knowledge, which exhibits both differences and complementarity, can effectively enhance the ICD coding performance. Extensive experiments on existing popular ICD coding benchmarks demonstrate the effectiveness of GKI-ICD, which achieves the state-of-the-art performance on most evaluation metrics. Code is available at https://github.com/xuzhang0112/GKI-ICD.

[327] Improving Bangla Linguistics: Advanced LSTM, Bi-LSTM, and Seq2Seq Models for Translating Sylheti to Modern Bangla

Sourav Kumar Das,Md. Julkar Naeen,MD. Jahidul Islam,Md. Anisul Haque Sajeeb,Narayan Ranjan Chakraborty,Mayen Uddin Mojumdar

Main category: cs.CL

TL;DR: 该论文提出了一种基于NLP技术的系统，用于将现代孟加拉语翻译为地方方言Sylheti语，并比较了LSTM、Bi-LSTM和Seq2Seq模型的性能，其中LSTM表现最佳（准确率89.3%）。

Details

Motivation: 孟加拉国不同地区使用地方语言（如Sylheti语），但相关研究较少。本文旨在填补这一空白，推动孟加拉语NLP研究的发展。 Method: 使用1200条数据训练了LSTM、Bi-LSTM和Seq2Seq三种模型，评估其翻译性能。 Result: LSTM模型表现最优，准确率达到89.3%。 Conclusion: 该研究为孟加拉语NLP的未来创新提供了基础，尤其是地方语言的翻译任务。 Abstract: Bangla or Bengali is the national language of Bangladesh, people from different regions don't talk in proper Bangla. Every division of Bangladesh has its own local language like Sylheti, Chittagong etc. In recent years some papers were published on Bangla language like sentiment analysis, fake news detection and classifications, but a few of them were on Bangla languages. This research is for the local language and this particular paper is on Sylheti language. It presented a comprehensive system using Natural Language Processing or NLP techniques for translating Pure or Modern Bangla to locally spoken Sylheti Bangla language. Total 1200 data used for training 3 models LSTM, Bi-LSTM and Seq2Seq and LSTM scored the best in performance with 89.3% accuracy. The findings of this research may contribute to the growth of Bangla NLP researchers for future more advanced innovations.

[328] Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization

Meng Li,Guangda Huzhang,Haibo Zhang,Xiting Wang,Anxiang Zeng

Main category: cs.CL

TL;DR: OTPO提出了一种基于最优传输的token加权方案，用于改进直接偏好优化（DPO），通过强调语义重要的token对，提升奖励差异估计的对比性。

Details

Motivation: 现有DPO方法对所有token赋予相同权重，而人类更关注有意义的token部分，导致优化效果不佳。 Method: 引入最优传输（Optimal Transport）理论，设计上下文感知的token加权方案，突出语义重要token对。 Result: 实验验证OTPO能有效提升指令跟随能力，增强奖励稳定性和可解释性。 Conclusion: OTPO通过自适应token加权，优化了偏好学习，聚焦于响应间的有意义差异。 Abstract: Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models (LLMs) with human preferences by directly optimizing the log-likelihood difference between chosen and rejected responses. However, existing methods assign equal importance to all tokens in the response, while humans focus on more meaningful parts. This leads to suboptimal preference optimization, as irrelevant or noisy tokens disproportionately influence DPO loss. To address this limitation, we propose \textbf{O}ptimal \textbf{T}ransport-based token weighting scheme for enhancing direct \textbf{P}reference \textbf{O}ptimization (OTPO). By emphasizing semantically meaningful token pairs and de-emphasizing less relevant ones, our method introduces a context-aware token weighting scheme that yields a more contrastive reward difference estimate. This adaptive weighting enhances reward stability, improves interpretability, and ensures that preference optimization focuses on meaningful differences between responses. Extensive experiments have validated OTPO's effectiveness in improving instruction-following ability across various settings\footnote{Code is available at https://github.com/Mimasss2/OTPO.}.

[329] LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning Challenges

Tao Liu,Hongying Zan,Yifan Li,Dixuan Zhang,Lulu Kong,Haixin Liu,Jiaming Hou,Aoze Zheng,Rui Li,Yiming Qiao,Zewei Luo,Qi Wang,Zhiqiang Zhang,Jiaxi Li,Supeng Liu,Kunli Zhang,Min Peng

Main category: cs.CL

TL;DR: 论文提出了一个针对复杂推理和链式思考分析的Text-to-SQL新数据集LogicCat，填补了现有数据集在领域知识和数学推理上的不足。

Details

Motivation: 现有Text-to-SQL数据集主要关注业务场景，缺乏领域知识和复杂推理的覆盖，LogicCat旨在解决这一问题。 Method: 构建了一个包含4,038个问题、12,114条逐步推理注释的数据集，覆盖45个跨领域数据库。 Result: LogicCat显著增加了现有模型的难度，最高执行准确率仅14.96%，加入链式思考注释后提升至33.96%。 Conclusion: LogicCat为推理驱动的Text-to-SQL研究提供了新挑战和机会，数据集已开源。 Abstract: Text-to-SQL is a fundamental task in natural language processing that seeks to translate natural language questions into meaningful and executable SQL queries. While existing datasets are extensive and primarily focus on business scenarios and operational logic, they frequently lack coverage of domain-specific knowledge and complex mathematical reasoning. To address this gap, we present a novel dataset tailored for complex reasoning and chain-of-thought analysis in SQL inference, encompassing physical, arithmetic, commonsense, and hypothetical reasoning. The dataset consists of 4,038 English questions, each paired with a unique SQL query and accompanied by 12,114 step-by-step reasoning annotations, spanning 45 databases across diverse domains. Experimental results demonstrate that LogicCat substantially increases the difficulty for state-of-the-art models, with the highest execution accuracy reaching only 14.96%. Incorporating our chain-of-thought annotations boosts performance to 33.96%. Benchmarking leading public methods on Spider and BIRD further underscores the unique challenges presented by LogicCat, highlighting the significant opportunities for advancing research in robust, reasoning-driven text-to-SQL systems. We have released our dataset code at https://github.com/Ffunkytao/LogicCat.

[330] Unifying Attention Heads and Task Vectors via Hidden State Geometry in In-Context Learning

Haolin Yang,Hakaze Cho,Yiqiao Zhong,Naoya Inoue

Main category: cs.CL

TL;DR: 本文提出了一个统一框架，通过分析隐藏状态的几何特性（可分性和对齐性），揭示了ICL在分类任务中的两阶段机制。

Details

Motivation: 现有研究缺乏将注意力头和任务向量与隐藏状态演变联系起来的统一框架，本文旨在填补这一空白。 Method: 通过分析查询隐藏状态的可分性和对齐性，提出一个框架，并研究层间动态和关键注意力头的作用。 Result: 发现ICL机制分为两阶段：早期层实现可分性，后期层实现对齐性；特定注意力头和任务向量分别驱动这两个阶段。 Conclusion: 本文统一了注意力头和任务向量的作用，为ICL的内部机制提供了新见解。 Abstract: The unusual properties of in-context learning (ICL) have prompted investigations into the internal mechanisms of large language models. Prior work typically focuses on either special attention heads or task vectors at specific layers, but lacks a unified framework linking these components to the evolution of hidden states across layers that ultimately produce the model's output. In this paper, we propose such a framework for ICL in classification tasks by analyzing two geometric factors that govern performance: the separability and alignment of query hidden states. A fine-grained analysis of layer-wise dynamics reveals a striking two-stage mechanism: separability emerges in early layers, while alignment develops in later layers. Ablation studies further show that Previous Token Heads drive separability, while Induction Heads and task vectors enhance alignment. Our findings thus bridge the gap between attention heads and task vectors, offering a unified account of ICL's underlying mechanisms.

[331] Few-Shot Optimization for Sensor Data Using Large Language Models: A Case Study on Fatigue Detection

Elsen Ronando,Sozo Inoue

Main category: cs.CL

TL;DR: 提出了一种基于HED-LM的少样本优化方法，用于改进传感器分类任务中的示例选择，结合欧氏距离和大型语言模型提升性能。

Details

Motivation: 少样本提示的效率依赖于示例选择的质量，而传感器任务（如疲劳检测）因信号细微差异和主体间变异性需要更精细的选择方法。 Method: HED-LM通过混合管道筛选候选示例（欧氏距离初筛，LLM上下文相关性重排），应用于疲劳检测任务。 Result: HED-LM在疲劳检测中平均宏F1得分为69.13±10.71%，优于随机选择（59.30±10.13%）和仅距离过滤（67.61±11.39%）。 Conclusion: HED-LM结合数值相似性和上下文相关性，提升了少样本提示的鲁棒性，适用于医疗监测、活动识别等场景。 Abstract: In this paper, we propose a novel few-shot optimization with HED-LM (Hybrid Euclidean Distance with Large Language Models) to improve example selection for sensor-based classification tasks. While few-shot prompting enables efficient inference with limited labeled data, its performance largely depends on the quality of selected examples. HED-LM addresses this challenge through a hybrid selection pipeline that filters candidate examples based on Euclidean distance and re-ranks them using contextual relevance scored by large language models (LLMs). To validate its effectiveness, we apply HED-LM to a fatigue detection task using accelerometer data characterized by overlapping patterns and high inter-subject variability. Unlike simpler tasks such as activity recognition, fatigue detection demands more nuanced example selection due to subtle differences in physiological signals. Our experiments show that HED-LM achieves a mean macro F1-score of 69.13$\pm$10.71%, outperforming both random selection (59.30$\pm$10.13%) and distance-only filtering (67.61$\pm$11.39%). These represent relative improvements of 16.6% and 2.3%, respectively. The results confirm that combining numerical similarity with contextual relevance improves the robustness of few-shot prompting. Overall, HED-LM offers a practical solution to improve performance in real-world sensor-based learning tasks and shows potential for broader applications in healthcare monitoring, human activity recognition, and industrial safety scenarios.

[332] How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark

Minglai Yang,Ethan Huang,Liang Zhang,Mihai Surdeanu,William Wang,Liangming Pan

Main category: cs.CL

TL;DR: GSM-DC是一个合成基准，用于评估大语言模型（LLMs）在受控无关上下文（IC）下的推理鲁棒性。实验显示LLMs对IC敏感，但通过强干扰训练和树搜索方法可提升性能。

Details

Motivation: 评估LLMs在无关上下文干扰下的推理鲁棒性，并探索提升其性能的方法。 Method: 构建符号推理图并注入精确干扰，训练模型时使用强干扰，提出基于过程奖励模型的逐步树搜索。 Result: LLMs对IC敏感，干扰训练和树搜索方法显著提升了模型在分布内和分布外场景的性能。 Conclusion: GSM-DC为评估LLMs推理鲁棒性提供了有效工具，干扰训练和树搜索方法可增强模型性能。 Abstract: We introduce Grade School Math with Distracting Context (GSM-DC), a synthetic benchmark to evaluate Large Language Models' (LLMs) reasoning robustness against systematically controlled irrelevant context (IC). GSM-DC constructs symbolic reasoning graphs with precise distractor injections, enabling rigorous, reproducible evaluation. Our experiments demonstrate that LLMs are significantly sensitive to IC, affecting both reasoning path selection and arithmetic accuracy. Additionally, training models with strong distractors improves performance in both in-distribution and out-of-distribution scenarios. We further propose a stepwise tree search guided by a process reward model, which notably enhances robustness in out-of-distribution conditions.

[333] Towards an automatic method for generating topical vocabulary test forms for specific reading passages

Michael Flor,Zuowei Wang,Paul Deane,Tenaha O'Reilly

Main category: cs.CL

TL;DR: 开发了K-tool，一种自动化系统，用于生成测量学生背景知识的词汇测试，以预测其对特定文本的理解能力。

Details

Motivation: 背景知识对理解STEM等领域文本至关重要，但缺乏可快速部署的自动化测量工具。 Method: 系统自动检测文本主题并生成相关词汇测试，包含高度相关和不相关的词汇。 Result: 初步评估表明，该系统能有效生成词汇测试，帮助预测学生的理解能力。 Conclusion: K-tool为教育领域提供了一种自动化工具，支持背景知识的快速评估。 Abstract: Background knowledge is typically needed for successful comprehension of topical and domain specific reading passages, such as in the STEM domain. However, there are few automated measures of student knowledge that can be readily deployed and scored in time to make predictions on whether a given student will likely be able to understand a specific content area text. In this paper, we present our effort in developing K-tool, an automated system for generating topical vocabulary tests that measure students' background knowledge related to a specific text. The system automatically detects the topic of a given text and produces topical vocabulary items based on their relationship with the topic. This information is used to automatically generate background knowledge forms that contain words that are highly related to the topic and words that share similar features but do not share high associations to the topic. Prior research indicates that performance on such tasks can help determine whether a student is likely to understand a particular text based on their knowledge state. The described system is intended for use with middle and high school student population of native speakers of English. It is designed to handle single reading passages and is not dependent on any corpus or text collection. In this paper, we describe the system architecture and present an initial evaluation of the system outputs.

[334] Disentangling Knowledge Representations for Large Language Model Editing

Mengqi Zhang,Zisheng Zhou,Xiaotian Ye,Qiang Liu,Zhaochun Ren,Zhumin Chen,Pengjie Ren

Main category: cs.CL

TL;DR: DiKE提出了一种新方法，通过解耦知识表示来优化大型语言模型的知识编辑，显著提升了细粒度无关知识的保留能力。

Details

Motivation: 现有方法在编辑知识时无法保留与编辑知识共享主题但关系不同的细粒度无关知识，导致知识表示空间中的目标知识与无关知识纠缠。 Method: DiKE包含知识表示解耦模块（KRD）和解耦知识编辑模块（DKE），通过矩阵理论实现高效且最小侵入的编辑。 Result: 实验表明，DiKE在多个大型语言模型中显著提升了细粒度无关知识的保留能力，同时保持了通用编辑性能。 Conclusion: DiKE为知识编辑提供了一种更精细化的解决方案，有效解决了细粒度知识保留的挑战。 Abstract: Knowledge Editing has emerged as a promising solution for efficiently updating embedded knowledge in large language models (LLMs). While existing approaches demonstrate effectiveness in integrating new knowledge and preserving the original capabilities of LLMs, they fail to maintain fine-grained irrelevant knowledge facts that share the same subject as edited knowledge but differ in relation and object. This challenge arises because subject representations inherently encode multiple attributes, causing the target and fine-grained irrelevant knowledge to become entangled in the representation space, and thus vulnerable to unintended alterations during editing. To address this, we propose DiKE, a novel approach that Disentangles Knowledge representations for LLM Editing (DiKE). DiKE consists of two key components: a Knowledge Representation Disentanglement (KRD) module that decomposes the subject representation into target-knowledgerelated and -unrelated components, and a Disentanglement-based Knowledge Edit (DKE) module that updates only the target-related component while explicitly preserving the unrelated one. We further derive a closed-form, rank-one parameter update based on matrix theory to enable efficient and minimally invasive edits. To rigorously evaluate fine-grained irrelevant knowledge preservation, we construct FINE-KED, a new benchmark comprising fine-grained irrelevant knowledge at different levels of relational similarity to the edited knowledge. Extensive experiments across multiple LLMs demonstrate that DiKE substantially improves fine-grained irrelevant knowledge preservation while maintaining competitive general editing performance.

[335] A generalised editor calculus (Short Paper)

Benjamin Bennetzen,Peter Buus Steffensen,Hans Hüttel,Nikolaj Rossander Kristensen,Andreas Tor Mortensen

Main category: cs.CL

TL;DR: 本文提出了一种语法导向编辑器演算的泛化方法，可基于抽象语法为任何语言实例化专用编辑器，确保无语法错误且支持不完整程序。该方法被编码为扩展的简单类型λ演算。

Details

Motivation: 为任何语言提供语法导向编辑支持，同时确保语法正确性和灵活性。 Method: 泛化语法导向编辑器演算，并将其编码为扩展的简单类型λ演算（含对、布尔、模式匹配和不动点）。 Result: 实现了支持不完整程序且无语法错误的语法导向编辑器。 Conclusion: 泛化方法有效，为语言编辑器设计提供了理论基础和实用工具。 Abstract: In this paper, we present a generalization of a syntax-directed editor calculus, which can be used to instantiate a specialized syntax-directed editor for any language, given by some abstract syntax. The editor calculus guarantees the absence of syntactical errors while allowing incomplete programs. The generalized editor calculus is then encoded into a simply typed lambda calculus, extended with pairs, booleans, pattern matching and fixed points

[336] ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models

Hao Chen,Haoze Li,Zhiqing Xiao,Lirong Gao,Qi Zhang,Xiaomeng Hu,Ningtao Wang,Xing Fu,Junbo Zhao

Main category: cs.CL

TL;DR: ALPS方法通过定位和修剪任务敏感的注意力头，仅激活10%的参数，在三个任务上性能提升2%，同时提高模型对齐效率和可迁移性。

Details

Motivation: 通用大语言模型（LLM）在下游任务对齐时成本高，现有方法依赖数据且泛化性差。 Method: 提出ALPS算法，定位任务敏感的注意力头并修剪，仅更新这些头的参数。 Result: 实验显示，仅激活10%参数，性能提升2%，且任务头可跨数据集迁移。 Conclusion: ALPS为高效LLM对齐提供了新视角，减少成本并提升泛化能力。 Abstract: Aligning general-purpose large language models (LLMs) to downstream tasks often incurs significant costs, including constructing task-specific instruction pairs and extensive training adjustments. Prior research has explored various avenues to enhance alignment efficiency, primarily through minimal-data training or data-driven activations to identify key attention heads. However, these approaches inherently introduce data dependency, which hinders generalization and reusability. To address this issue and enhance model alignment efficiency, we propose the \textit{\textbf{A}ttention \textbf{L}ocalization and \textbf{P}runing \textbf{S}trategy (\textbf{ALPS})}, an efficient algorithm that localizes the most task-sensitive attention heads and prunes by restricting attention training updates to these heads, thereby reducing alignment costs. Experimental results demonstrate that our method activates only \textbf{10\%} of attention parameters during fine-tuning while achieving a \textbf{2\%} performance improvement over baselines on three tasks. Moreover, the identified task-specific heads are transferable across datasets and mitigate knowledge forgetting. Our work and findings provide a novel perspective on efficient LLM alignment.

[337] Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation

Jiwan Chung,Junhyeok Kim,Siyeol Kim,Jaeyoung Lee,Min Soo Kim,Youngjae Yu

Main category: cs.CL

TL;DR: v1是一种轻量级扩展，使多模态大语言模型（MLLMs）能够在推理过程中选择性重新访问视觉输入，通过动态检索相关图像区域提升性能。

Details

Motivation: 当前MLLMs通常仅一次性处理视觉输入，而v1通过动态视觉访问机制增强模型的细粒度视觉参考和多步推理能力。 Method: v1引入简单的点复制机制，动态检索相关图像区域，并通过v1g数据集（300K多模态推理轨迹）训练模型。 Result: 在MathVista、MathVision和MathVerse三个多模态数学推理基准测试中，v1性能优于基线模型，尤其在需要细粒度视觉参考的任务中表现突出。 Conclusion: 动态视觉访问是增强多模态推理的有效方向，v1的代码、模型和数据将公开以支持未来研究。 Abstract: We present v1, a lightweight extension to Multimodal Large Language Models (MLLMs) that enables selective visual revisitation during inference. While current MLLMs typically consume visual input only once and reason purely over internal memory, v1 introduces a simple point-and-copy mechanism that allows the model to dynamically retrieve relevant image regions throughout the reasoning process. This mechanism augments existing architectures with minimal modifications, enabling contextual access to visual tokens based on the model's evolving hypotheses. To train this capability, we construct v1g, a dataset of 300K multimodal reasoning traces with interleaved visual grounding annotations. Experiments on three multimodal mathematical reasoning benchmarks -- MathVista, MathVision, and MathVerse -- demonstrate that v1 consistently improves performance over comparable baselines, particularly on tasks requiring fine-grained visual reference and multi-step reasoning. Our results suggest that dynamic visual access is a promising direction for enhancing grounded multimodal reasoning. Code, models, and data will be released to support future research.

[338] Multi-Party Conversational Agents: A Survey

Sagar Sapkota,Mohammad Saqib Hasan,Mubarak Shah,Santu Karmaker

Main category: cs.CL

TL;DR: 本文综述了多参与者对话代理（MPCAs）的最新进展，探讨了其面临的挑战，包括心理状态建模、语义理解和对话流预测，并提出了未来研究方向。

Details

Motivation: 研究多参与者对话代理（MPCAs）的动机在于解决其与传统两方代理相比的额外挑战，如语义理解和社会动态建模。 Method: 通过综述方法，从经典机器学习到大语言模型（LLMs）和多模态系统，分析了MPCAs的建模能力。 Result: 分析表明，心智理论（ToM）对构建智能MPCAs至关重要，多模态理解是一个有前景但尚未充分探索的方向。 Conclusion: 本文为未来研究者提供了开发更强大MPCAs的指导，强调了ToM和多模态理解的重要性。 Abstract: Multi-party Conversational Agents (MPCAs) are systems designed to engage in dialogue with more than two participants simultaneously. Unlike traditional two-party agents, designing MPCAs faces additional challenges due to the need to interpret both utterance semantics and social dynamics. This survey explores recent progress in MPCAs by addressing three key questions: 1) Can agents model each participants' mental states? (State of Mind Modeling); 2) Can they properly understand the dialogue content? (Semantic Understanding); and 3) Can they reason about and predict future conversation flow? (Agent Action Modeling). We review methods ranging from classical machine learning to Large Language Models (LLMs) and multi-modal systems. Our analysis underscores Theory of Mind (ToM) as essential for building intelligent MPCAs and highlights multi-modal understanding as a promising yet underexplored direction. Finally, this survey offers guidance to future researchers on developing more capable MPCAs.

[339] Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation

Alexander Shabalin,Viacheslav Meshchaninov,Dmitry Vetrov

Main category: cs.CL

TL;DR: Smoothie是一种新的扩散方法，通过在语义相似性上逐步平滑标记嵌入，结合了连续和离散方法的优点，提升了文本生成质量。

Details

Motivation: 扩散模型在生成图像、音频和视频方面表现优异，但适应离散文本仍具挑战性。现有方法要么在连续潜在空间中应用高斯扩散（语义结构保留但解码困难），要么在分类单纯形空间中操作（尊重离散性但忽略语义关系）。 Method: 提出Smoothie方法，通过在语义相似性基础上逐步平滑标记嵌入，结合两种方法的优势，实现渐进信息移除并保持自然解码过程。 Result: 实验表明，Smoothie在多个序列到序列生成任务中优于现有扩散模型，且消融研究证实其扩散空间性能优于标准嵌入空间和分类单纯形。 Conclusion: Smoothie通过结合连续和离散方法的优点，显著提升了文本生成的性能。 Abstract: Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic relation between tokens. In this paper, we propose Smoothing Diffusion on Token Embeddings (Smoothie), a novel diffusion method that combines the strengths of both approaches by progressively smoothing token embeddings based on semantic similarity. This technique enables gradual information removal while maintaining a natural decoding process. Experimental results on several sequence-to-sequence generation tasks demonstrate that Smoothie outperforms existing diffusion-based models in generation quality. Furthermore, ablation studies show that our proposed diffusion space yields better performance than both the standard embedding space and the categorical simplex. Our code is available at https://github.com/ashaba1in/smoothie.

[340] Writing Like the Best: Exemplar-Based Expository Text Generation

Yuxiang Liu,Kevin Chen-Chuan Chang

Main category: cs.CL

TL;DR: 论文提出了一种基于范例的说明文生成任务，并提出了Recurrent Plan-then-Adapt (RePA)框架，通过自适应模仿和分段生成解决现有方法的不足。

Details

Motivation: 当前方法依赖大量范例数据、难以适应主题特定内容且长文本连贯性差，需要改进。 Method: 提出RePA框架，利用大语言模型进行细粒度计划与自适应模仿，并引入记忆结构提升连贯性。 Result: 实验表明，RePA在三个多样数据集上优于现有基线，生成更具事实性、一致性和相关性的文本。 Conclusion: RePA框架有效解决了基于范例的说明文生成任务中的关键问题，为未来研究提供了新方向。 Abstract: We introduce the Exemplar-Based Expository Text Generation task, aiming to generate an expository text on a new topic using an exemplar on a similar topic. Current methods fall short due to their reliance on extensive exemplar data, difficulty in adapting topic-specific content, and issues with long-text coherence. To address these challenges, we propose the concept of Adaptive Imitation and present a novel Recurrent Plan-then-Adapt (RePA) framework. RePA leverages large language models (LLMs) for effective adaptive imitation through a fine-grained plan-then-adapt process. RePA also enables recurrent segment-by-segment imitation, supported by two memory structures that enhance input clarity and output coherence. We also develop task-specific evaluation metrics--imitativeness, adaptiveness, and adaptive-imitativeness--using LLMs as evaluators. Experimental results across our collected three diverse datasets demonstrate that RePA surpasses existing baselines in producing factual, consistent, and relevant texts for this task.

[341] Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework

Binhao Ma,Hanqing Guo,Zhengping Jay Luo,Rui Duan

Main category: cs.CL

TL;DR: 论文提出了一种针对语音输入的多模态大语言模型（MLLMs）的白盒对抗攻击方法，通过生成对抗性语音令牌序列绕过安全防护，攻击成功率高达89%。

Details

Motivation: 语音交互的MLLMs（如SpeechGPT）提升了人机交互的自然性，但也引入了新的安全风险，现有研究对语音模态的攻击和防御机制探索不足。 Method: 提出了一种基于语音令牌化的对抗攻击方法，生成对抗性令牌序列并合成为音频提示，以绕过模型的安全对齐。 Result: 在SpeechGPT上测试，攻击成功率达到89%，显著优于现有语音攻击方法。 Conclusion: 研究揭示了语音模态MLLMs的脆弱性，为下一代更鲁棒的模型开发提供了指导。 Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced the naturalness and flexibility of human computer interaction by enabling seamless understanding across text, vision, and audio modalities. Among these, voice enabled models such as SpeechGPT have demonstrated considerable improvements in usability, offering expressive, and emotionally responsive interactions that foster deeper connections in real world communication scenarios. However, the use of voice introduces new security risks, as attackers can exploit the unique characteristics of spoken language, such as timing, pronunciation variability, and speech to text translation, to craft inputs that bypass defenses in ways not seen in text-based systems. Despite substantial research on text based jailbreaks, the voice modality remains largely underexplored in terms of both attack strategies and defense mechanisms. In this work, we present an adversarial attack targeting the speech input of aligned MLLMs in a white box scenario. Specifically, we introduce a novel token level attack that leverages access to the model's speech tokenization to generate adversarial token sequences. These sequences are then synthesized into audio prompts, which effectively bypass alignment safeguards and to induce prohibited outputs. Evaluated on SpeechGPT, our approach achieves up to 89 percent attack success rate across multiple restricted tasks, significantly outperforming existing voice based jailbreak methods. Our findings shed light on the vulnerabilities of voice-enabled multimodal systems and to help guide the development of more robust next-generation MLLMs.

[342] Sci-LoRA: Mixture of Scientific LoRAs for Cross-Domain Lay Paraphrasing

Ming Cheng,Jiaying Gong,Hoda Eldardiry

Main category: cs.CL

TL;DR: Sci-LoRA是一种动态混合LoRA模型，用于跨领域科学信息的通俗化改写，无需显式领域标签即可调整不同领域的影响。

Details

Motivation: 随着跨学科研究的兴起，需要一种能够理解多领域知识的方法，而现有研究多局限于单一领域。 Method: Sci-LoRA通过动态生成和应用多个LoRA的权重，结合数据和模型层面的信息，实现跨领域平衡。 Result: 在五个公共数据集的十二个领域上，Sci-LoRA显著优于现有大型语言模型，表现出灵活的泛化和适应能力。 Conclusion: Sci-LoRA为跨领域科学信息的通俗化改写提供了高效且灵活的解决方案。 Abstract: Lay paraphrasing aims to make scientific information accessible to audiences without technical backgrounds. However, most existing studies focus on a single domain, such as biomedicine. With the rise of interdisciplinary research, it is increasingly necessary to comprehend knowledge spanning multiple technical fields. To address this, we propose Sci-LoRA, a model that leverages a mixture of LoRAs fine-tuned on multiple scientific domains. In particular, Sci-LoRA dynamically generates and applies weights for each LoRA, enabling it to adjust the impact of different domains based on the input text, without requiring explicit domain labels. To balance domain-specific knowledge and generalization across various domains, Sci-LoRA integrates information at both the data and model levels. This dynamic fusion enhances the adaptability and performance across various domains. Experimental results across twelve domains on five public datasets show that Sci-LoRA significantly outperforms state-of-the-art large language models and demonstrates flexible generalization and adaptability in cross-domain lay paraphrasing.

[343] CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

Kung-Hsiang Huang,Akshara Prabhakar,Onkar Thorat,Divyansh Agarwal,Prafulla Kumar Choubey,Yixin Mao,Silvio Savarese,Caiming Xiong,Chien-Sheng Wu

Main category: cs.CL

TL;DR: CRMArena-Pro是一个新的基准测试，用于评估LLM代理在多样化商业场景中的表现，发现当前代理在多轮交互和保密意识方面表现不佳。

Details

Motivation: 现有基准测试缺乏真实性和多样性，无法全面评估AI代理在商业环境中的表现。 Method: 引入CRMArena-Pro，包含19个专家验证的任务，涵盖销售、服务和报价流程，支持多轮交互和保密意识评估。 Result: 领先的LLM代理在单轮任务中成功率约58%，多轮中降至35%，保密意识几乎为零。 Conclusion: 当前LLM代理能力与企业需求存在显著差距，需在多轮推理、保密意识和技能多样性方面改进。 Abstract: While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent-user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.

[344] StandUp4AI: A New Multilingual Dataset for Humor Detection in Stand-up Comedy Videos

Valentin Barriere,Nahuel Gomez,Leo Hemamou,Sofia Callejas,Brian Ravenet

Main category: cs.CL

TL;DR: 提出了一个多模态的脱口秀数据集，包含七种语言，用于改进幽默检测的计算模型。数据集自动标注笑声，部分手动标注用于验证。任务定义为词级序列标注而非二元分类。

Details

Motivation: 改进当前幽默检测模型，提供更大、更多样化的数据集。 Method: 构建多语言脱口秀数据集，自动和手动标注笑声，采用词级序列标注任务。 Result: 数据集是目前同类任务中最大且最多样化的，提出了基于语音识别错误的自动笑声检测增强方法。 Conclusion: 数据集和方法为幽默检测任务提供了新的资源和思路。 Abstract: Aiming towards improving current computational models of humor detection, we propose a new multimodal dataset of stand-up comedies, in seven languages: English, French, Spanish, Italian, Portuguese, Hungarian and Czech. Our dataset of more than 330 hours, is at the time of writing the biggest available for this type of task, and the most diverse. The whole dataset is automatically annotated in laughter (from the audience), and the subpart left for model validation is manually annotated. Contrary to contemporary approaches, we do not frame the task of humor detection as a binary sequence classification, but as word-level sequence labeling, in order to take into account all the context of the sequence and to capture the continuous joke tagging mechanism typically occurring in natural conversations. As par with unimodal baselines results, we propose a method for e propose a method to enhance the automatic laughter detection based on Audio Speech Recognition errors. Our code and data are available online: https://tinyurl.com/EMNLPHumourStandUpPublic

[345] Building a Functional Machine Translation Corpus for Kpelle

Kweku Andoh Yamoah,Jackson Weako,Emmanuel J. Dorley

Main category: cs.CL

TL;DR: 本文介绍了首个公开的英语-Kpelle机器翻译数据集，包含2000多句对，通过数据增强优化NLLB模型，BLEU分数达30，展示了低资源语言的潜力。

Details

Motivation: 为低资源的Kpelle语言提供机器翻译支持，并推动更广泛的NLP任务。 Method: 使用Meta的NLLB模型，在数据集的两个版本上进行微调，采用数据增强技术。 Result: Kpelle到英语方向的BLEU分数达30，与其他非洲语言性能相当。 Conclusion: 未来需扩展数据集，注重拼写一致性和社区验证，推动Kpelle等低资源语言的包容性技术发展。 Abstract: In this paper, we introduce the first publicly available English-Kpelle dataset for machine translation, comprising over 2000 sentence pairs drawn from everyday communication, religious texts, and educational materials. By fine-tuning Meta's No Language Left Behind(NLLB) model on two versions of the dataset, we achieved BLEU scores of up to 30 in the Kpelle-to-English direction, demonstrating the benefits of data augmentation. Our findings align with NLLB-200 benchmarks on other African languages, underscoring Kpelle's potential for competitive performance despite its low-resource status. Beyond machine translation, this dataset enables broader NLP tasks, including speech recognition and language modelling. We conclude with a roadmap for future dataset expansion, emphasizing orthographic consistency, community-driven validation, and interdisciplinary collaboration to advance inclusive language technology development for Kpelle and other low-resourced Mande languages.

[346] Federated Retrieval-Augmented Generation: A Systematic Mapping Study

Abhijit Chakraborty,Chahana Dahal,Vivek Gupta

Main category: cs.CL

TL;DR: Federated RAG结合了联邦学习（FL）和检索增强生成（RAG），为隐私敏感领域的NLP提供了安全框架。本文首次系统综述了2020-2025年的相关研究，分析了架构模式、趋势和挑战。

Details

Motivation: 解决隐私敏感领域（如医疗、金融）中大型语言模型的安全性和知识准确性需求。 Method: 基于Kitchenham指南，对文献进行系统分类，分析研究重点、贡献类型和应用领域。 Result: 总结了Federated RAG的研究进展、设计模式和开放性问题。 Conclusion: 为RAG与联邦系统的交叉研究奠定了基础，并指出了未来研究方向。 Abstract: Federated Retrieval-Augmented Generation (Federated RAG) combines Federated Learning (FL), which enables distributed model training without exposing raw data, with Retrieval-Augmented Generation (RAG), which improves the factual accuracy of language models by grounding outputs in external knowledge. As large language models are increasingly deployed in privacy-sensitive domains such as healthcare, finance, and personalized assistance, Federated RAG offers a promising framework for secure, knowledge-intensive natural language processing (NLP). To the best of our knowledge, this paper presents the first systematic mapping study of Federated RAG, covering literature published between 2020 and 2025. Following Kitchenham's guidelines for evidence-based software engineering, we develop a structured classification of research focuses, contribution types, and application domains. We analyze architectural patterns, temporal trends, and key challenges, including privacy-preserving retrieval, cross-client heterogeneity, and evaluation limitations. Our findings synthesize a rapidly evolving body of research, identify recurring design patterns, and surface open questions, providing a foundation for future work at the intersection of RAG and federated systems.

Yue Li,Jake Vasilakes,Zhixue Zhao,Carolina Scarton

Main category: cs.CL

TL;DR: SCRum-9是一个多语言谣言立场分类数据集，包含7,516条推文-回复对，覆盖9种语言，链接到2.1k个事实核查声明，并包含复杂多标注者注释。

Details

Motivation: 现有立场分类数据集在语言覆盖和标注复杂性上不足，SCRum-9旨在填补这一空白。 Method: 数据集包含多语言推文-回复对，由至少三名母语者标注，总计405小时标注时间和8,150美元补偿。 Result: 实验表明SCRum-9对当前先进模型（如Deepseek）和微调预训练模型具有挑战性。 Conclusion: SCRum-9为未来研究提供了有价值的基准，推动了多语言谣言立场分类的发展。 Abstract: We introduce SCRum-9, a multilingual dataset for Rumour Stance Classification, containing 7,516 tweet-reply pairs from X. SCRum-9 goes beyond existing stance classification datasets by covering more languages (9), linking examples to more fact-checked claims (2.1k), and including complex annotations from multiple annotators to account for intra- and inter-annotator variability. Annotations were made by at least three native speakers per language, totalling around 405 hours of annotation and 8,150 dollars in compensation. Experiments on SCRum-9 show that it is a challenging benchmark for both state-of-the-art LLMs (e.g. Deepseek) as well as fine-tuned pre-trained models, motivating future work in this area.

[348] Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments

Amel Muminovic

Main category: cs.CL

TL;DR: 该研究评估了三种大型语言模型（GPT-4.1、Gemini 1.5 Pro、Claude 3 Opus）在检测有害评论上的表现，发现GPT-4.1综合表现最佳，但所有模型在处理讽刺和混合语言时仍有困难。

Details

Motivation: 在线平台的评论区常出现骚扰行为，影响用户体验和心理健康，因此需要有效的自动化内容审核工具。 Method: 研究使用5,080条YouTube评论（包含1,334条有害和3,746条无害评论），通过统一提示和确定性设置评估三种模型的性能。 Result: GPT-4.1的F1分数最高（0.863），Gemini召回率最高（0.875），Claude精确度最高（0.920），但所有模型在处理复杂语言时表现不佳。 Conclusion: 研究建议结合多种模型、加入上下文信息并优化少数语言和隐式滥用检测，以提高审核效果。数据集已公开以促进进一步研究。 Abstract: As online platforms grow, comment sections increasingly host harassment that undermines user experience and well-being. This study benchmarks three leading large language models, OpenAI GPT-4.1, Google Gemini 1.5 Pro, and Anthropic Claude 3 Opus, on a corpus of 5,080 YouTube comments sampled from high-abuse threads in gaming, lifestyle, food vlog, and music channels. The dataset comprises 1,334 harmful and 3,746 non-harmful messages in English, Arabic, and Indonesian, annotated independently by two reviewers with substantial agreement (Cohen's kappa = 0.83). Using a unified prompt and deterministic settings, GPT-4.1 achieved the best overall balance with an F1 score of 0.863, precision of 0.887, and recall of 0.841. Gemini flagged the highest share of harmful posts (recall = 0.875) but its precision fell to 0.767 due to frequent false positives. Claude delivered the highest precision at 0.920 and the lowest false-positive rate of 0.022, yet its recall dropped to 0.720. Qualitative analysis showed that all three models struggle with sarcasm, coded insults, and mixed-language slang. These results underscore the need for moderation pipelines that combine complementary models, incorporate conversational context, and fine-tune for under-represented languages and implicit abuse. A de-identified version of the dataset and full prompts is publicly released to promote reproducibility and further progress in automated content moderation.

Xuanming Zhang,Yuxuan Chen,Min-Hsuan Yeh,Yixuan Li

Main category: cs.CL

TL;DR: MetaMind是一个多智能体框架，通过模拟人类心理理论（ToM）提升LLMs在社交推理中的表现，实现了35.7%的性能提升。

Details

Motivation: LLMs在语义理解任务中表现出色，但在人类社交互动中的模糊性和上下文细微差别上表现不佳，需要改进。 Method: MetaMind将社交理解分解为三个阶段：ToM代理生成假设，领域代理基于文化和伦理约束优化假设，响应代理生成并验证上下文合适的回应。 Result: 在三个基准测试中表现优异，ToM推理提升6.2%，首次使LLMs在关键ToM任务上达到人类水平。 Conclusion: MetaMind通过平衡上下文合理性、社交适应性和用户适配性，推动了AI系统向人类社交智能迈进。 Abstract: Human social interactions depend on the ability to infer others' unspoken intentions, emotions, and beliefs-a cognitive skill grounded in the psychological concept of Theory of Mind (ToM). While large language models (LLMs) excel in semantic understanding tasks, they struggle with the ambiguity and contextual nuance inherent in human communication. To bridge this gap, we introduce MetaMind, a multi-agent framework inspired by psychological theories of metacognition, designed to emulate human-like social reasoning. MetaMind decomposes social understanding into three collaborative stages: (1) a Theory-of-Mind Agent generates hypotheses user mental states (e.g., intent, emotion), (2) a Domain Agent refines these hypotheses using cultural norms and ethical constraints, and (3) a Response Agent generates contextually appropriate responses while validating alignment with inferred intent. Our framework achieves state-of-the-art performance across three challenging benchmarks, with 35.7% improvement in real-world social scenarios and 6.2% gain in ToM reasoning. Notably, it enables LLMs to match human-level performance on key ToM tasks for the first time. Ablation studies confirm the necessity of all components, which showcase the framework's ability to balance contextual plausibility, social appropriateness, and user adaptation. This work advances AI systems toward human-like social intelligence, with applications in empathetic dialogue and culturally sensitive interactions. Code is available at https://github.com/XMZhangAI/MetaMind.

[350] The Price of Format: Diversity Collapse in LLMs

Longfei Yun,Chenyang An,Zilong Wang,Letian Peng,Jingbo Shang

Main category: cs.CL

TL;DR: 研究发现，指令调优的大型语言模型（LLMs）使用结构化模板会导致多样性崩溃，限制了输出的创造性和变异性。

Details

Motivation: 探讨结构化模板对LLMs输出多样性的影响，揭示其潜在问题。 Method: 通过系统评估不同任务（如故事完成和自由生成），分析模板结构对输出多样性的影响，并比较不同结构化提示的效果。 Result: 发现结构化模板显著限制输出空间，多样性崩溃现象普遍存在；格式一致性对结构敏感任务重要，但对知识密集型任务影响较小。 Conclusion: 当前提示设计可能无意中抑制输出多样性，需开发多样性感知的提示设计和指令调优方法。 Abstract: Instruction-tuned large language models (LLMs) employ structured templates, such as role markers and special tokens, to enforce format consistency during inference. However, we identify a critical limitation of such formatting: it induces a phenomenon we term diversity collapse, where the model generates semantically similar outputs for open-ended inputs, undermining creativity and variability. We systematically evaluate this effect across tasks like story completion and free-form generation, finding that (1) diversity collapse persists even under high-temperature sampling, and (2) structural tokens in templates significantly constrain the model's output space. To contextualize these findings, we fine-tune the same model using a range of structured prompts and then evaluate them across three axes: downstream task performance, alignment behavior, and output diversity. Our analysis shows that format consistency between fine-tuning and inference is crucial for structure-sensitive tasks (e.g., GSM8K, IFEval), but has marginal influence on knowledge-heavy tasks (e.g., MMLU, WebQuestions). In contrast, output diversity is primarily governed by the presence or absence of structural tokens, with minimal formatting yielding the most diverse outputs. These findings reveal that current prompting conventions, while beneficial for alignment, may inadvertently suppress output diversity, underscoring the need for diversity-aware prompt design and instruction tuning.

[351] BnMMLU: Measuring Massive Multitask Language Understanding in Bengali

Saman Sarker Joy

Main category: cs.CL

TL;DR: 论文介绍了BnMMLU，一个用于评估孟加拉语多任务语言理解能力的基准数据集，填补了低资源语言在MMLU基准中的空白。

Details

Motivation: 现有MMLU基准主要关注高资源语言（如英语），而低资源语言（如孟加拉语）代表性不足，因此需要专门的数据集来评估其语言模型能力。 Method: 构建了包含23个领域、138,949个选择题对的BnMMLU数据集，并标注了三种认知类别（事实知识、程序应用和推理），用于评估语言模型的多任务能力。 Result: 测试了多个专有和开源大语言模型，结果显示性能差距显著，表明需要针对孟加拉语数据的改进预训练和微调策略。 Conclusion: BnMMLU的发布旨在促进对孟加拉语语言模型的进一步研究，填补低资源语言在基准测试中的空白。 Abstract: The Massive Multitask Language Understanding (MMLU) benchmark has been widely used to evaluate language models across various domains. However, existing MMLU datasets primarily focus on high-resource languages such as English, which leaves low-resource languages like Bengali underrepresented. In this paper, we introduce BnMMLU, a benchmark to evaluate the multitask language understanding capabilities of Bengali in language models. The dataset spans 23 domains, including science, humanities, mathematics and general knowledge and is structured in a multiple-choice format to assess factual knowledge, application-based problem-solving and reasoning abilities of language models. It consists of 138,949 question-option pairs. We benchmark several proprietary and open-source large language models (LLMs) on the BnMMLU test set. Additionally, we annotate the test set with three cognitive categories-factual knowledge, procedural application and reasoning-to gain deeper insights into model strengths and weaknesses across various cognitive tasks. The results reveal significant performance gaps, highlighting the need for improved pre-training and fine-tuning strategies tailored to Bengali data. We release the dataset and benchmark results to facilitate further research in this area.

[352] Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk?

Divij Chawla,Ashita Bhutada,Do Duc Anh,Abhinav Raghunathan,Vinod SP,Cathy Guo,Dar Win Liew,Prannaya Gupta,Rishabh Bhardwaj,Rajat Bhardwaj,Soujanya Poria

Main category: cs.CL

TL;DR: 研究评估了多种AI模型在投资风险评估中的可信度，发现不同模型在评分分布和人口敏感性上存在显著差异，且无模型在所有地区和人口统计中表现一致。

Details

Motivation: 评估AI模型在金融风险评估中的表现，以揭示潜在的偏见和不一致性。 Method: 使用1,720个用户配置文件，涵盖10个国家、两种性别和16个风险相关特征，对多个专有和开源AI模型进行分析。 Result: 不同模型在风险评分和人口敏感性上差异显著，例如GPT-4o对尼日利亚和印尼用户评分较高，而LLaMA和DeepSeek在性别分类上表现相反。 Conclusion: 研究强调在金融领域需对AI系统进行严格标准化评估，以避免偏见和不一致性。 Abstract: We evaluate the credibility of leading AI models in assessing investment risk appetite. Our analysis spans proprietary (GPT-4, Claude 3.7, Gemini 1.5) and open-weight models (LLaMA 3.1/3.3, DeepSeek-V3, Mistral-small), using 1,720 user profiles constructed with 16 risk-relevant features across 10 countries and both genders. We observe significant variance across models in score distributions and demographic sensitivity. For example, GPT-4o assigns higher risk scores to Nigerian and Indonesian profiles, while LLaMA and DeepSeek show opposite gender tendencies in risk classification. While some models (e.g., GPT-4o, LLaMA 3.1) align closely with expected scores in low- and mid-risk ranges, none maintain consistent performance across regions and demographics. Our findings highlight the need for rigorous, standardized evaluations of AI systems in regulated financial contexts to prevent bias, opacity, and inconsistency in real-world deployment.

[353] System-1.5 Reasoning: Traversal in Language and Latent Spaces with Dynamic Shortcuts

Xiaoqiang Wang,Suyuchen Wang,Yun Zhu,Bang Liu

Main category: cs.CL

TL;DR: System-1.5 Reasoning是一种自适应推理框架，通过动态分配计算资源在推理步骤中实现高效推理，同时保持高性能。

Details

Motivation: 传统的Chain-of-thought (CoT)推理效率低下，而现有的潜在空间推理方法未能区分关键步骤与辅助步骤，导致计算资源浪费。 Method: 提出System-1.5 Reasoning框架，包含两种动态捷径：模型深度捷径（DS）和步骤捷径（SS），并通过两阶段自蒸馏训练实现。 Result: 在GSM8K等推理任务上，System-1.5 Reasoning性能接近传统CoT方法，同时推理速度提升20倍以上，生成token减少92.31%。 Conclusion: System-1.5 Reasoning在保持高性能的同时显著提升了推理效率，为LLMs的推理优化提供了新思路。 Abstract: Chain-of-thought (CoT) reasoning enables large language models (LLMs) to move beyond fast System-1 responses and engage in deliberative System-2 reasoning. However, this comes at the cost of significant inefficiency due to verbose intermediate output. Recent latent-space reasoning methods improve efficiency by operating on hidden states without decoding into language, yet they treat all steps uniformly, failing to distinguish critical deductions from auxiliary steps and resulting in suboptimal use of computational resources. In this paper, we propose System-1.5 Reasoning, an adaptive reasoning framework that dynamically allocates computation across reasoning steps through shortcut paths in latent space.Specifically, System-1.5 Reasoning introduces two types of dynamic shortcuts. The model depth shortcut (DS) adaptively reasons along the vertical depth by early exiting non-critical tokens through lightweight adapter branches, while allowing critical tokens to continue through deeper Transformer layers. The step shortcut (SS) reuses hidden states across the decoding steps to skip trivial steps and reason horizontally in latent space. Training System-1.5 Reasoning involves a two-stage self-distillation process: first distilling natural language CoT into latent-space continuous thought, and then distilling full-path System-2 latent reasoning into adaptive shortcut paths (System-1.5 Reasoning).Experiments on reasoning tasks demonstrate the superior performance of our method. For example, on GSM8K, System-1.5 Reasoning achieves reasoning performance comparable to traditional CoT fine-tuning methods while accelerating inference by over 20x and reducing token generation by 92.31% on average.

[354] Learning to Explain: Prototype-Based Surrogate Models for LLM Classification

Bowen Wei,Ziwei Zhu

Main category: cs.CL

TL;DR: ProtoSurE是一种基于原型的代理框架，为大型语言模型提供忠实且易于理解的解释，优于现有方法。

Details

Motivation: 大型语言模型的决策过程不透明，现有解释方法要么不够忠实，要么难以理解。 Method: ProtoSurE训练一个可解释的代理模型，利用句子级原型作为人类可理解的概念。 Result: ProtoSurE在多种模型和数据集上优于现有方法，且数据效率高。 Conclusion: ProtoSurE为LLMs提供了实用且高效的解释解决方案。 Abstract: Large language models (LLMs) have demonstrated impressive performance on natural language tasks, but their decision-making processes remain largely opaque. Existing explanation methods either suffer from limited faithfulness to the model's reasoning or produce explanations that humans find difficult to understand. To address these challenges, we propose \textbf{ProtoSurE}, a novel prototype-based surrogate framework that provides faithful and human-understandable explanations for LLMs. ProtoSurE trains an interpretable-by-design surrogate model that aligns with the target LLM while utilizing sentence-level prototypes as human-understandable concepts. Extensive experiments show that ProtoSurE consistently outperforms SOTA explanation methods across diverse LLMs and datasets. Importantly, ProtoSurE demonstrates strong data efficiency, requiring relatively few training examples to achieve good performance, making it practical for real-world applications.

[355] Is Architectural Complexity Overrated? Competitive and Interpretable Knowledge Graph Completion with RelatE

Abhijit Chakraborty,Chahana Dahal,Ashutosh Balasubramaniam,Tejas Anvekar,Vivek Gupta

Main category: cs.CL

TL;DR: RelatE是一种简单、可解释的知识图谱补全方法，通过相位-模分解实现高效性能，优于复杂模型。

Details

Motivation: 探索简单实值嵌入模型在知识图谱补全中的有效性，提出一种可解释且高效的方法。 Method: 采用相位-模分解和正弦相位对齐，编码对称、反转和组合等关系模式。 Result: 在YAGO3-10上MRR为0.521，Hit@10为0.680，训练时间减少24%，推理延迟降低31%。 Conclusion: RelatE是知识图谱补全中可扩展且可解释的高效替代方案。 Abstract: We revisit the efficacy of simple, real-valued embedding models for knowledge graph completion and introduce RelatE, an interpretable and modular method that efficiently integrates dual representations for entities and relations. RelatE employs a real-valued phase-modulus decomposition, leveraging sinusoidal phase alignments to encode relational patterns such as symmetry, inversion, and composition. In contrast to recent approaches based on complex-valued embeddings or deep neural architectures, RelatE preserves architectural simplicity while achieving competitive or superior performance on standard benchmarks. Empirically, RelatE outperforms prior methods across several datasets: on YAGO3-10, it achieves an MRR of 0.521 and Hit@10 of 0.680, surpassing all baselines. Additionally, RelatE offers significant efficiency gains, reducing training time by 24%, inference latency by 31%, and peak GPU memory usage by 22% compared to RotatE. Perturbation studies demonstrate improved robustness, with MRR degradation reduced by up to 61% relative to TransE and by up to 19% compared to RotatE under structural edits such as edge removals and relation swaps. Formal analysis further establishes the model's full expressiveness and its capacity to represent essential first-order logical inference patterns. These results position RelatE as a scalable and interpretable alternative to more complex architectures for knowledge graph completion.

[356] Hierarchical Mamba Meets Hyperbolic Geometry: A New Paradigm for Structured Language Embeddings

Sarang Patil,Ashish Parmanand Pandey,Ioannis Koutis,Mengjia Xu

Main category: cs.CL

TL;DR: 论文提出了一种名为Hierarchical Mamba (HiM)的模型，结合Mamba2和双曲几何，用于学习层次感知的语言嵌入，以提升复杂层次推理任务的表现。

Details

Motivation: 现有的大语言模型主要依赖平坦的欧几里得嵌入，限制了其对潜在层次结构的捕捉能力。为了解决这一问题，论文探索了选择性状态空间模型在语言表示中的潜力。 Method: HiM模型将Mamba2处理后的序列通过切线映射或余弦/正弦映射投影到Poincare球或Lorentz流形上，并结合双曲损失进行优化。 Result: 实验表明，HiM模型在四个语言和医学数据集上表现优于欧几里得基线，能够有效捕捉层次关系。HiM-Poincare在细粒度语义区分上表现更好，而HiM-Lorentz则提供更稳定且紧凑的嵌入。 Conclusion: HiM模型通过结合双曲几何和Mamba2，显著提升了层次推理任务的表现，为复杂语言理解提供了新的解决方案。 Abstract: Selective state-space models have achieved great success in long-sequence modeling. However, their capacity for language representation, especially in complex hierarchical reasoning tasks, remains underexplored. Most large language models rely on flat Euclidean embeddings, limiting their ability to capture latent hierarchies. To address this limitation, we propose Hierarchical Mamba (HiM), integrating efficient Mamba2 with exponential growth and curved nature of hyperbolic geometry to learn hierarchy-aware language embeddings for deeper linguistic understanding. Mamba2-processed sequences are projected to the Poincare ball (via tangent-based mapping) or Lorentzian manifold (via cosine and sine-based mapping) with "learnable" curvature, optimized with a combined hyperbolic loss. Our HiM model facilitates the capture of relational distances across varying hierarchical levels, enabling effective long-range reasoning. This makes it well-suited for tasks like mixed-hop prediction and multi-hop inference in hierarchical classification. We evaluated our HiM with four linguistic and medical datasets for mixed-hop prediction and multi-hop inference tasks. Experimental results demonstrated that: 1) Both HiM models effectively capture hierarchical relationships for four ontological datasets, surpassing Euclidean baselines. 2) HiM-Poincare captures fine-grained semantic distinctions with higher h-norms, while HiM-Lorentz provides more stable, compact, and hierarchy-preserving embeddings favoring robustness over detail.

[357] AI4Math: A Native Spanish Benchmark for University-Level Mathematical Reasoning in Large Language Models

Miguel Angel Peñaloza Perez,Bruno Lopez Orozco,Jesus Tadeo Cruz Soto,Michelle Bruno Hernandez,Miguel Angel Alvarado Gonzalez,Sandra Malagon

Main category: cs.CL

TL;DR: AI4Math是一个西班牙语原生大学数学问题基准，评估了六种大语言模型在零样本和思维链配置下的表现，揭示了语言和领域对模型推理能力的影响。

Details

Motivation: 现有数学推理基准多为英语或翻译，可能导致语义漂移和语言特定错误，因此需要原生语言基准。 Method: 创建了105个西班牙语原生数学问题，覆盖七个高级领域，并评估六种模型在零样本和思维链配置下的表现。 Result: 部分模型（如o3 mini、DeepSeek R1/V3 685B）准确率超70%，而LLaMA 3.3 70B和GPT-4o mini低于40%。几何、组合和概率问题对所有模型均具挑战性。 Conclusion: 原生语言基准和领域特定评估能揭示标准指标未捕捉的推理缺陷。 Abstract: Existing mathematical reasoning benchmarks are predominantly English only or translation-based, which can introduce semantic drift and mask languagespecific reasoning errors. To address this, we present AI4Math, a benchmark of 105 original university level math problems natively authored in Spanish. The dataset spans seven advanced domains (Algebra, Calculus, Geometry, Probability, Number Theory, Combinatorics, and Logic), and each problem is accompanied by a step by step human solution. We evaluate six large language models GPT 4o, GPT 4o mini, o3 mini, LLaMA 3.3 70B, DeepSeek R1 685B, and DeepSeek V3 685B under four configurations: zero shot and chain of thought, each in Spanish and English. The top models (o3 mini, DeepSeek R1 685B, DeepSeek V3 685B) achieve over 70% accuracy, whereas LLaMA 3.3 70B and GPT-4o mini remain below 40%. Most models show no significant performance drop between languages, with GPT 4o even performing better on Spanish problems in the zero shot setting. Geometry, Combinatorics, and Probability questions remain persistently challenging for all models. These results highlight the need for native-language benchmarks and domain-specific evaluations to reveal reasoning failures not captured by standard metrics.

[358] FiLLM -- A Filipino-optimized Large Language Model based on Southeast Asia Large Language Model (SEALLM)

Carlos Jude G. Maminta,Isaiah Job Enriquez,Deandre Nigel Nunez,Michael B. Dela Fuente

Main category: cs.CL

TL;DR: FiLLM是一个针对菲律宾语优化的语言模型，基于SeaLLM-7B 2.5，通过LoRA微调提升效率，但在多项任务中表现不及CalamanCy。

Details

Motivation: 提升菲律宾语的自然语言处理能力，满足本地语言需求。 Method: 利用LoRA微调SeaLLM-7B 2.5模型，并在多种菲律宾语数据集上训练和评估。 Result: CalamanCy在多项指标上优于FiLLM，表现更佳。 Conclusion: FiLLM为菲律宾语NLP提供了高效、可扩展的模型，但仍有改进空间。 Abstract: This study presents FiLLM, a Filipino-optimized large language model, designed to enhance natural language processing (NLP) capabilities in the Filipino language. Built upon the SeaLLM-7B 2.5 model, FiLLM leverages Low-Rank Adaptation (LoRA) fine-tuning to optimize memory efficiency while maintaining task-specific performance. The model was trained and evaluated on diverse Filipino datasets to address key NLP tasks, including Named Entity Recognition (NER), Part-of-Speech (POS) tagging, Dependency Parsing, and Text Summarization. Performance comparisons with the CalamanCy model were conducted using F1 Score, Precision, Recall, Compression Rate, and Keyword Overlap metrics. Results indicate that Calamancy outperforms FILLM in several aspects, demonstrating its effectiveness in processing Filipino text with improved linguistic comprehension and adaptability. This research contributes to the advancement of Filipino NLP applications by providing an optimized, efficient, and scalable language model tailored for local linguistic needs.

[359] VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization

Yunxin Li,Xinyu Chen,Zitao Li,Zhenyu Liu,Longyue Wang,Wenhan Luo,Baotian Hu,Min Zhang

Main category: cs.CL

TL;DR: 论文提出VerIPO方法，通过验证器引导的迭代策略优化，提升视频大语言模型的长链推理能力，显著优于现有方法。

Details

Motivation: 现有强化微调方法（如GRPO）在数据准备和质量提升上存在瓶颈，影响长链推理的稳定性和性能。 Method: 提出GRPO-Verifier-DPO训练循环，利用小模型验证推理逻辑，构建高质量对比数据，优化DPO阶段效率。 Result: 实验显示VerIPO在推理链长度和一致性上显著提升，性能优于现有模型（如Kimi-VL）。 Conclusion: VerIPO通过结合GRPO的广泛搜索和DPO的精准优化，有效提升了视频大语言模型的推理能力。 Abstract: Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream performance.To address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO's expansive search and DPO's targeted optimization. Experimental results demonstrate: 1) Significantly faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long reasoning models (e.g., Video-R1), highlighting its effectiveness and stability.

[360] CrosGrpsABS: Cross-Attention over Syntactic and Semantic Graphs for Aspect-Based Sentiment Analysis in a Low-Resource Language

Md. Mithun Hossain,Md. Shakil Hossain,Sudipto Chaki,Md. Rajib Hossain,Md. Saifur Rahman,A. B. M. Shawkat Ali

Main category: cs.CL

TL;DR: 论文提出了一种名为CrosGrpsABS的混合框架，通过双向交叉注意力结合句法和语义图，提升低资源语言（如孟加拉语）的细粒度情感分析性能。

Details

Motivation: 现有研究主要针对资源丰富的语言（如英语），而低资源语言（如孟加拉语）缺乏标注数据、预训练模型和优化参数，导致其情感分析任务难以开展。 Method: CrosGrpsABS结合了基于规则的句法依赖解析和语义相似度计算，利用双向交叉注意力融合局部句法结构和全局语义上下文。 Result: 在孟加拉语和英语数据集上，CrosGrpsABS均优于现有方法，F1分数显著提升。 Conclusion: CrosGrpsABS为低资源语言的细粒度情感分析提供了有效解决方案，并展示了跨语言的适用性。 Abstract: Aspect-Based Sentiment Analysis (ABSA) is a fundamental task in natural language processing, offering fine-grained insights into opinions expressed in text. While existing research has largely focused on resource-rich languages like English which leveraging large annotated datasets, pre-trained models, and language-specific tools. These resources are often unavailable for low-resource languages such as Bengali. The ABSA task in Bengali remains poorly explored and is further complicated by its unique linguistic characteristics and a lack of annotated data, pre-trained models, and optimized hyperparameters. To address these challenges, this research propose CrosGrpsABS, a novel hybrid framework that leverages bidirectional cross-attention between syntactic and semantic graphs to enhance aspect-level sentiment classification. The CrosGrpsABS combines transformerbased contextual embeddings with graph convolutional networks, built upon rule-based syntactic dependency parsing and semantic similarity computations. By employing bidirectional crossattention, the model effectively fuses local syntactic structure with global semantic context, resulting in improved sentiment classification performance across both low- and high-resource settings. We evaluate CrosGrpsABS on four low-resource Bengali ABSA datasets and the high-resource English SemEval 2014 Task 4 dataset. The CrosGrpsABS consistently outperforms existing approaches, achieving notable improvements, including a 0.93% F1-score increase for the Restaurant domain and a 1.06% gain for the Laptop domain in the SemEval 2014 Task 4 benchmark.

[361] Efficient Data Selection at Scale via Influence Distillation

Mahdi Nikdan,Vincent Cohen-Addad,Dan Alistarh,Vahab Mirrokni

Main category: cs.CL

TL;DR: 本文提出了一种名为Influence Distillation的数据选择框架，通过二阶信息优化训练样本权重，显著提升LLM微调效率。

Details

Motivation: 现代大型语言模型（LLM）的高效训练依赖于有效的数据选择，但现有方法缺乏数学依据或计算效率低。 Method: 利用二阶信息计算样本对目标分布的影响，提出基于地标样本的近似方法以降低计算成本。 Result: 在Tulu V2数据集上的实验表明，该方法性能优于现有技术，且选择速度提升3.5倍。 Conclusion: Influence Distillation为LLM数据选择提供了高效且数学严谨的解决方案。 Abstract: Effective data selection is critical for efficient training of modern Large Language Models (LLMs). This paper introduces Influence Distillation, a novel, mathematically-justified framework for data selection that employs second-order information to optimally weight training samples. By distilling each sample's influence on a target distribution, our method assigns model-specific weights that are used to select training data for LLM fine-tuning, guiding it toward strong performance on the target domain. We derive these optimal weights for both Gradient Descent and Adam optimizers. To ensure scalability and reduce computational cost, we propose a $\textit{landmark-based approximation}$: influence is precisely computed for a small subset of "landmark" samples and then efficiently propagated to all other samples to determine their weights. We validate Influence Distillation by applying it to instruction tuning on the Tulu V2 dataset, targeting a range of tasks including GSM8k, SQuAD, and MMLU, across several models from the Llama and Qwen families. Experiments show that Influence Distillation matches or outperforms state-of-the-art performance while achieving up to $3.5\times$ faster selection.

[362] An Embarrassingly Simple Defense Against LLM Abliteration Attacks

Harethah Abu Shairah,Hasan Abed Al Kader Hammoud,Bernard Ghanem,George Turkiyyah

Main category: cs.CL

TL;DR: 论文提出了一种防御方法，通过扩展拒绝数据集和微调模型，以抵御针对大语言模型的abliteration攻击，保持模型的安全性和性能。

Details

Motivation: 大语言模型通常通过拒绝有害指令来遵守安全指南，但abliteration攻击能够抑制拒绝行为，导致模型生成不道德内容。因此，需要一种防御方法。 Method: 构建包含有害提示和完整拒绝理由的扩展拒绝数据集，并微调Llama-2-7B-Chat和Qwen2.5-Instruct模型。 Result: 扩展拒绝模型在abliteration攻击下拒绝率仅下降10%，而基线模型下降70-80%，同时保持了通用性能。 Conclusion: 扩展拒绝微调能有效抵御abliteration攻击，同时保持模型的安全性和实用性。 Abstract: Large language models (LLMs) are typically aligned to comply with safety guidelines by refusing harmful instructions. A recent attack, termed abliteration, isolates and suppresses the single latent direction most responsible for refusal behavior, enabling the model to generate unethical content. We propose a defense that modifies how models generate refusals. We construct an extended-refusal dataset that contains harmful prompts with a full response that justifies the reason for refusal. We then fine-tune Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on our extended-refusal dataset, and evaluate the resulting systems on a set of harmful prompts. In our experiments, extended-refusal models maintain high refusal rates, dropping at most by 10%, whereas baseline models' refusal rates drop by 70-80% after abliteration. A broad evaluation of safety and utility shows that extended-refusal fine-tuning neutralizes the abliteration attack while preserving general performance.

[363] UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models

Roman Vashurin,Maiya Goloburda,Preslav Nakov,Maxim Panov

Main category: cs.CL

TL;DR: 论文提出了一种名为UNCERTAINTY-LINE的后处理方法，用于消除大语言模型输出中不确定性量化（UQ）的长度偏差。该方法通过回归分析校正不确定性分数，并在多个任务中验证了其有效性。

Details

Motivation: 现有的大语言模型不确定性量化方法存在输出长度偏差问题，即使经过长度归一化处理，偏差仍然存在。 Method: 提出UNCERTAINTY-LINE方法，通过回归不确定性分数与输出长度的关系，利用残差作为校正后的长度不变估计。 Result: 在机器翻译、摘要生成和问答任务中，UNCERTAINTY-LINE显著优于其他长度归一化方法。 Conclusion: UNCERTAINTY-LINE是一种简单、模型无关的后处理方法，能有效消除长度偏差，提升不确定性量化的准确性。 Abstract: Large Language Models (LLMs) have become indispensable tools across various applications, making it more important than ever to ensure the quality and the trustworthiness of their outputs. This has led to growing interest in uncertainty quantification (UQ) methods for assessing the reliability of LLM outputs. Many existing UQ techniques rely on token probabilities, which inadvertently introduces a bias with respect to the length of the output. While some methods attempt to account for this, we demonstrate that such biases persist even in length-normalized approaches. To address the problem, here we propose UNCERTAINTY-LINE: (Length-INvariant Estimation), a simple debiasing procedure that regresses uncertainty scores on output length and uses the residuals as corrected, length-invariant estimates. Our method is post-hoc, model-agnostic, and applicable to a range of UQ measures. Through extensive evaluation on machine translation, summarization, and question-answering tasks, we demonstrate that UNCERTAINTY-LINE: consistently improves over even nominally length-normalized UQ methods uncertainty estimates across multiple metrics and models.

[364] Towards Harmonized Uncertainty Estimation for Large Language Models

Rui Li,Jing Long,Muge Qi,Heming Xia,Lei Sha,Peiyi Wang,Zhifang Sui

Main category: cs.CL

TL;DR: 本文提出了一种名为CUE的轻量级方法，用于调整大型语言模型（LLMs）生成内容的不确定性分数，解决了现有方法在指示性、平衡性和校准性方面的不足。

Details

Motivation: 为了确保LLMs的可靠部署，需要量化其生成内容的不确定性。现有方法在指示性、平衡性和校准性方面存在不足，亟需改进。 Method: 提出CUE方法，通过训练一个轻量级模型，基于目标LLM的性能数据调整不确定性分数。 Result: 实验表明，CUE在多种模型和任务中表现优异，比现有方法提升了高达60%。 Conclusion: CUE是一种简单有效的方法，显著提升了LLMs不确定性估计的准确性。 Abstract: To facilitate robust and trustworthy deployment of large language models (LLMs), it is essential to quantify the reliability of their generations through uncertainty estimation. While recent efforts have made significant advancements by leveraging the internal logic and linguistic features of LLMs to estimate uncertainty scores, our empirical analysis highlights the pitfalls of these methods to strike a harmonized estimation between indication, balance, and calibration, which hinders their broader capability for accurate uncertainty estimation. To address this challenge, we propose CUE (Corrector for Uncertainty Estimation): A straightforward yet effective method that employs a lightweight model trained on data aligned with the target LLM's performance to adjust uncertainty scores. Comprehensive experiments across diverse models and tasks demonstrate its effectiveness, which achieves consistent improvements of up to 60% over existing methods.

[365] ReadBench: Measuring the Dense Text Visual Reading Ability of Vision-Language Models

Benjamin Clavié,Florian Brand

Main category: cs.CL

TL;DR: ReadBench是一个新的多模态基准测试，用于评估大型视觉语言模型（VLM）在文本丰富图像中的阅读理解能力，发现其在长文本和多页内容上表现显著下降。

Details

Motivation: 现有基准测试主要关注视觉理解，但缺乏对VLM在文本丰富图像中阅读和推理能力的评估。 Method: 将纯文本基准测试的上下文转化为文本图像，保持文本提示和问题不变，构建ReadBench。 Result: VLM在短文本图像上表现略有下降，但在长文本和多页内容上表现显著下降；文本分辨率对性能影响微乎其微。 Conclusion: VLM在视觉呈现的广泛文本内容上的推理能力需要改进，这对实际应用至关重要。 Abstract: Recent advancements in Large Vision-Language Models (VLMs), have greatly enhanced their capability to jointly process text and images. However, despite extensive benchmarks evaluating visual comprehension (e.g., diagrams, color schemes, OCR tasks...), there is limited assessment of VLMs' ability to read and reason about text-rich images effectively. To fill this gap, we introduce ReadBench, a multimodal benchmark specifically designed to evaluate the reading comprehension capabilities of VLMs. ReadBench transposes contexts from established text-only benchmarks into images of text while keeping textual prompts and questions intact. Evaluating leading VLMs with ReadBench, we find minimal-but-present performance degradation on short, text-image inputs, while performance sharply declines for longer, multi-page contexts. Our experiments further reveal that text resolution has negligible effects on multimodal performance. These findings highlight needed improvements in VLMs, particularly their reasoning over visually presented extensive textual content, a capability critical for practical applications. ReadBench is available at https://github.com/answerdotai/ReadBench .

[366] ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning

Yeyuan Wang,Dehong Gao,Rujiao Long,Lei Yi,Linbo Jin,Libin Yang,Xiaoyan Cai

Main category: cs.CL

TL;DR: ASPO通过句子级偏好优化改进传统DPO的粗粒度问题，提升多模态模型性能。

Details

Motivation: 传统DPO仅基于二元偏好优化，忽略细粒度监督，导致次优解。 Method: 提出ASPO，动态计算句子级自适应奖励，无需额外模型或参数。 Result: 实验表明ASPO显著提升多模态模型性能。 Conclusion: ASPO通过细粒度优化有效改进多模态对齐。 Abstract: Direct Preference Optimization (DPO) has gained significant attention for its simplicity and computational efficiency in aligning large language models (LLMs). Recent advancements have extended DPO to multimodal scenarios, achieving strong performance. However, traditional DPO relies on binary preference optimization, rewarding or penalizing entire responses without considering fine-grained segment correctness, leading to suboptimal solutions. The root of this issue lies in the absence of fine-grained supervision during the optimization process. To address this, we propose Adaptive Sentence-level Preference Optimization (ASPO), which evaluates individual sentences for more precise preference optimization. By dynamically calculating adaptive rewards at the sentence level based on model predictions, ASPO enhances response content assessment without additional models or parameters. This significantly improves the alignment of multimodal features. Extensive experiments show that ASPO substantially enhances the overall performance of multimodal models.

[367] WHISTRESS: Enriching Transcriptions with Sentence Stress Detection

Iddo Yosha,Dorin Shteyman,Yossi Adi

Main category: cs.CL

TL;DR: WHISTRESS是一种无需对齐的方法，用于增强转录系统的句子重音检测能力，基于合成的TINYSTRESS-15K数据集训练，并在多个基准测试中表现优异。

Details

Motivation: 口语中的重音对传达说话者意图至关重要，现有转录系统缺乏对句子重音的检测能力。 Method: 提出WHISTRESS方法，使用合成的TINYSTRESS-15K数据集进行训练，无需额外输入先验。 Result: WHISTRESS在多个基准测试中优于现有方法，并展现出零样本泛化能力。 Conclusion: WHISTRESS在句子重音检测任务中表现出色，且合成数据训练不影响其泛化能力。 Abstract: Spoken language conveys meaning not only through words but also through intonation, emotion, and emphasis. Sentence stress, the emphasis placed on specific words within a sentence, is crucial for conveying speaker intent and has been extensively studied in linguistics. In this work, we introduce WHISTRESS, an alignment-free approach for enhancing transcription systems with sentence stress detection. To support this task, we propose TINYSTRESS-15K, a scalable, synthetic training data for the task of sentence stress detection which resulted from a fully automated dataset creation process. We train WHISTRESS on TINYSTRESS-15K and evaluate it against several competitive baselines. Our results show that WHISTRESS outperforms existing methods while requiring no additional input priors during training or inference. Notably, despite being trained on synthetic data, WHISTRESS demonstrates strong zero-shot generalization across diverse benchmarks. Project page: https://pages.cs.huji.ac.il/adiyoss-lab/whistress.

Yongheng Zhang,Xu Liu,Ruoxi Zhou,Qiguang Chen,Hao Fei,Wenpeng Lu,Libo Qin

Main category: cs.CL

TL;DR: 论文提出了一个联合跨语言和跨模态的幻觉基准（CCHall），填补了现有研究的空白，并评估了主流LLMs在此基准上的表现。

Details

Motivation: 现有研究仅关注单一场景（跨语言或跨模态），缺乏对联合场景下LLMs幻觉问题的探索。 Method: 引入CCHall基准，同时涵盖跨语言和跨模态幻觉场景，并对主流开源和闭源LLMs进行全面评估。 Result: 实验表明当前LLMs在CCHall基准上表现不佳。 Conclusion: CCHall可作为评估LLMs在联合跨语言和跨模态场景下能力的重要资源。 Abstract: Investigating hallucination issues in large language models (LLMs) within cross-lingual and cross-modal scenarios can greatly advance the large-scale deployment in real-world applications. Nevertheless, the current studies are limited to a single scenario, either cross-lingual or cross-modal, leaving a gap in the exploration of hallucinations in the joint cross-lingual and cross-modal scenarios. Motivated by this, we introduce a novel joint Cross-lingual and Cross-modal Hallucinations benchmark (CCHall) to fill this gap. Specifically, CCHall simultaneously incorporates both cross-lingual and cross-modal hallucination scenarios, which can be used to assess the cross-lingual and cross-modal capabilities of LLMs. Furthermore, we conduct a comprehensive evaluation on CCHall, exploring both mainstream open-source and closed-source LLMs. The experimental results highlight that current LLMs still struggle with CCHall. We hope CCHall can serve as a valuable resource to assess LLMs in joint cross-lingual and cross-modal scenarios.

[369] Self-Critique Guided Iterative Reasoning for Multi-hop Question Answering

Zheng Chu,Huiming Fan,Jingchang Chen,Qianyu Wang,Mingda Yang,Jiafeng Liang,Zhongjie Wang,Hao Li,Guo Tang,Ming Liu,Bing Qin

Main category: cs.CL

TL;DR: SiGIR方法通过自我批评反馈指导迭代推理，解决了LLMs在多跳推理中的问题，性能提升8.6%。

Details

Motivation: 大型语言模型在多跳推理中缺乏中间指导，导致检索和推理不准确。 Method: 提出SiGIR方法，通过端到端训练实现问题分解和自我评估，指导推理轨迹选择。 Result: 在三个多跳推理数据集上表现优异，超越之前SOTA 8.6%。 Conclusion: SiGIR有效提升多跳推理性能，为未来研究提供参考。 Abstract: Although large language models (LLMs) have demonstrated remarkable reasoning capabilities, they still face challenges in knowledge-intensive multi-hop reasoning. Recent work explores iterative retrieval to address complex problems. However, the lack of intermediate guidance often results in inaccurate retrieval and flawed intermediate reasoning, leading to incorrect reasoning. To address these, we propose Self-Critique Guided Iterative Reasoning (SiGIR), which uses self-critique feedback to guide the iterative reasoning process. Specifically, through end-to-end training, we enable the model to iteratively address complex problems via question decomposition. Additionally, the model is able to self-evaluate its intermediate reasoning steps. During iterative reasoning, the model engages in branching exploration and employs self-evaluation to guide the selection of promising reasoning trajectories. Extensive experiments on three multi-hop reasoning datasets demonstrate the effectiveness of our proposed method, surpassing the previous SOTA by $8.6\%$. Furthermore, our thorough analysis offers insights for future research. Our code, data, and models are available at Github: https://github.com/zchuz/SiGIR-MHQA.

[370] Controlling Language Confusion in Multilingual LLMs

Nahyun Lee,Yeongseo Woo,Hyunwoo Ko,Guijin Son

Main category: cs.CL

TL;DR: 论文探讨了大语言模型中的语言混淆问题，提出通过ORPO方法添加惩罚项来抑制语言混淆，同时不降低模型性能。

Details

Motivation: 大语言模型在低资源环境下容易出现语言混淆，影响用户体验，传统监督微调方法未能有效解决此问题。 Method: 通过观察预训练阶段的损失轨迹，发现模型难以区分单语和语言混淆文本；采用ORPO方法，在标准SFT基础上添加对不期望输出风格的惩罚。 Result: ORPO方法能有效抑制语言混淆生成，即使在高解码温度下也不影响模型整体性能。 Conclusion: 在低资源环境下，通过引入适当的惩罚项可以缓解语言混淆问题。 Abstract: Large language models often suffer from language confusion, a phenomenon where responses are partially or entirely generated in unintended languages. This can critically impact user experience in low-resource settings. We hypothesize that conventional supervised fine-tuning exacerbates this issue because the softmax objective focuses probability mass only on the single correct token but does not explicitly penalize cross-lingual mixing. Interestingly, by observing loss trajectories during the pretraining phase, we observe that models fail to learn to distinguish between monolingual and language-confused text. Additionally, we find that ORPO, which adds penalties for unwanted output styles to standard SFT, effectively suppresses language-confused generations even at high decoding temperatures without degrading overall model performance. Our findings suggest that incorporating appropriate penalty terms can mitigate language confusion in low-resource settings with limited data.

[371] Delving into Multilingual Ethical Bias: The MSQAD with Statistical Hypothesis Tests for Large Language Models

Seunguk Yu,Juhwan Choi,Youngbin Kim

Main category: cs.CL

TL;DR: 研究发现大型语言模型存在跨语言伦理偏见，提出多语言敏感问答数据集（MSQAD）用于验证，结果显示偏见普遍存在。

Details

Motivation: 验证大型语言模型在敏感话题上的伦理偏见，假设这些偏见可能源于语言差异。 Method: 利用MSQAD数据集（基于人权观察新闻），生成多语言敏感问题及回答，并通过统计假设检验分析偏见。 Result: 大多数情况下零假设被拒绝，表明跨语言差异导致偏见，且不同LLMs均存在类似问题。 Conclusion: 伦理偏见在多语言中普遍存在，MSQAD的开放将促进未来研究。 Abstract: Despite the recent strides in large language models, studies have underscored the existence of social biases within these systems. In this paper, we delve into the validation and comparison of the ethical biases of LLMs concerning globally discussed and potentially sensitive topics, hypothesizing that these biases may arise from language-specific distinctions. Introducing the Multilingual Sensitive Questions & Answers Dataset (MSQAD), we collected news articles from Human Rights Watch covering 17 topics, and generated socially sensitive questions along with corresponding responses in multiple languages. We scrutinized the biases of these responses across languages and topics, employing two statistical hypothesis tests. The results showed that the null hypotheses were rejected in most cases, indicating biases arising from cross-language differences. It demonstrates that ethical biases in responses are widespread across various languages, and notably, these biases were prevalent even among different LLMs. By making the proposed MSQAD openly available, we aim to facilitate future research endeavors focused on examining cross-language biases in LLMs and their variant models.

[372] MMATH: A Multilingual Benchmark for Mathematical Reasoning

Wenyang Luo,Wayne Xin Zhao,Jing Sha,Shijin Wang,Ji-Rong Wen

Main category: cs.CL

TL;DR: 论文介绍了MMATH基准测试，用于评估多语言复杂推理能力，发现现有模型存在性能差异和语言偏离问题，并提出改进策略。

Details

Motivation: 现有大型推理模型在多语言复杂推理任务上的能力尚未充分探索，需要填补这一研究空白。 Method: 引入MMATH基准测试，涵盖10种语言的374个高质量数学问题，并通过提示和训练策略优化模型表现。 Result: 发现模型存在跨语言性能差异和语言偏离问题，提出用英语推理并用目标语言回答的策略，显著提升性能。 Conclusion: 研究为提升大型语言模型的多语言推理能力提供了新见解和实用策略。 Abstract: The advent of large reasoning models, such as OpenAI o1 and DeepSeek R1, has significantly advanced complex reasoning tasks. However, their capabilities in multilingual complex reasoning remain underexplored, with existing efforts largely focused on simpler tasks like MGSM. To address this gap, we introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages. Using MMATH, we observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages. To address this, we explore strategies including prompting and training, demonstrating that reasoning in English and answering in target languages can simultaneously enhance performance and preserve target-language consistency. Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models. Our code and data could be found at https://github.com/RUCAIBox/MMATH.

[373] RetrieveAll: A Multilingual Named Entity Recognition Framework with Large Language Models

Jin Zhang,Fan Gao,Linyu Li,Yongbin Yu,Xiangxiang Wang,Nyima Tashi,Gadeng Luosang

Main category: cs.CL

TL;DR: 提出了一种基于动态LoRA的多语言NER框架RetrieveAll，解决了多语言适应中的语言干扰问题，并通过跨粒度知识增强方法提升了性能。

Details

Motivation: 高资源语言的NER性能显著提升，但低/中资源语言仍有改进空间，现有方法存在语言干扰问题。 Method: 提出RetrieveAll框架，基于动态LoRA解耦语言特征，结合跨粒度知识增强和分层提示机制。 Result: 在PAN-X数据集上平均F1提升12.1%，优于现有基线。 Conclusion: RetrieveAll有效解决了多语言NER中的干扰问题，性能显著提升。 Abstract: The rise of large language models has led to significant performance breakthroughs in named entity recognition (NER) for high-resource languages, yet there remains substantial room for improvement in low- and medium-resource languages. Existing multilingual NER methods face severe language interference during the multi-language adaptation process, manifested in feature conflicts between different languages and the competitive suppression of low-resource language features by high-resource languages. Although training a dedicated model for each language can mitigate such interference, it lacks scalability and incurs excessive computational costs in real-world applications. To address this issue, we propose RetrieveAll, a universal multilingual NER framework based on dynamic LoRA. The framework decouples task-specific features across languages and demonstrates efficient dynamic adaptability. Furthermore, we introduce a cross-granularity knowledge augmented method that fully exploits the intrinsic potential of the data without relying on external resources. By leveraging a hierarchical prompting mechanism to guide knowledge injection, this approach advances the paradigm from "prompt-guided inference" to "prompt-driven learning." Experimental results show that RetrieveAll outperforms existing baselines; on the PAN-X dataset, it achieves an average F1 improvement of 12.1 percent.

[374] Shifting AI Efficiency From Model-Centric to Data-Centric Compression

Xuyang Liu,Zichen Wen,Shaobo Wang,Junjie Chen,Zhishan Tao,Yubo Wang,Xiangqi Jin,Chang Zou,Yiyu Wang,Chenfei Liao,Xu Zheng,Honggang Chen,Weijia Li,Xuming Hu,Conghui He,Linfeng Zhang

Main category: cs.CL

TL;DR: 论文主张从模型为中心的压缩转向数据为中心的压缩，特别是通过减少训练或推理中的令牌数量来提高AI效率。

Details

Motivation: 随着大语言模型和多模态模型的发展，模型规模的硬件限制使得计算瓶颈转向长令牌序列的自注意力成本，需要新的效率提升方法。 Method: 通过统一数学框架分析现有模型效率策略，并系统回顾令牌压缩的研究现状及其优势。 Result: 令牌压缩被证明是解决长上下文开销的关键范式转变，具有广泛的应用潜力。 Conclusion: 论文为AI效率提供了新视角，总结了现有研究，并指出了未来研究方向，以应对长上下文带来的挑战。 Abstract: The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gains. However, as we approach hardware limits on model size, the dominant computational bottleneck has fundamentally shifted to the quadratic cost of self-attention over long token sequences, now driven by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, \textbf{we argue that the focus of research for efficient AI is shifting from model-centric compression to data-centric compression}. We position token compression as the new frontier, which improves AI efficiency via reducing the number of tokens during model training or inference. Through comprehensive analysis, we first examine recent developments in long-context AI across various domains and establish a unified mathematical framework for existing model efficiency strategies, demonstrating why token compression represents a crucial paradigm shift in addressing long-context overhead. Subsequently, we systematically review the research landscape of token compression, analyzing its fundamental benefits and identifying its compelling advantages across diverse scenarios. Furthermore, we provide an in-depth analysis of current challenges in token compression research and outline promising future directions. Ultimately, our work aims to offer a fresh perspective on AI efficiency, synthesize existing research, and catalyze innovative developments to address the challenges that increasing context lengths pose to the AI community's advancement.

[375] SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs

Firoj Alam,Md Arid Hasan,Shammur Absar Chowdhury

Main category: cs.CL

TL;DR: 论文介绍了SpokenNativQA，首个多语言口语问答数据集，用于评估大语言模型在真实对话场景中的表现。

Details

Motivation: 现有文本问答数据集无法涵盖语音变异性、口音和语言多样性，限制了LLM在口语交互中的评估。 Method: 构建包含33,000个多语言口语问答的数据集，涵盖低资源语言和方言，并对比不同ASR系统和LLM的表现。 Result: 数据集为LLM在语音交互中的性能评估提供了基准，实验脚本和数据已公开。 Conclusion: SpokenNativQA填补了口语问答评估的空白，为多语言LLM研究提供了重要资源。 Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various disciplines and tasks. However, benchmarking their capabilities with multilingual spoken queries remains largely unexplored. In this study, we introduce SpokenNativQA, the first multilingual and culturally aligned spoken question-answering (SQA) dataset designed to evaluate LLMs in real-world conversational settings. The dataset comprises approximately 33,000 naturally spoken questions and answers in multiple languages, including low-resource and dialect-rich languages, providing a robust benchmark for assessing LLM performance in speech-based interactions. SpokenNativQA addresses the limitations of text-based QA datasets by incorporating speech variability, accents, and linguistic diversity. We benchmark different ASR systems and LLMs for SQA and present our findings. We released the data at (https://huggingface.co/datasets/QCRI/SpokenNativQA) and the experimental scripts at (https://llmebench.qcri.org/) for the research community.

[376] Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge

Zhuo Liu,Moxin Li,Xun Deng,Qifan Wang,Fuli Feng

Main category: cs.CL

TL;DR: 论文提出AGDe-Judge框架，通过引入无偏的辅助模型和三阶段去偏方法，解决LLM评估中教师偏好偏差问题。

Details

Motivation: LLM评估因其成本效益和与人类评估的高度一致性而流行，但训练代理评估模型时存在教师偏好偏差问题。 Method: 提出包含辅助模型的新设置，并设计三阶段框架AGDe-Judge，从标签和反馈数据中去偏。 Result: 实验表明，AGDe-Judge在六个评估基准上有效减少教师偏好偏差且性能优异。 Conclusion: AGDe-Judge为解决LLM评估中的教师偏好偏差提供了有效方案。 Abstract: LLM-as-a-Judge employs large language models (LLMs), such as GPT-4, to evaluate the quality of LLM-generated responses, gaining popularity for its cost-effectiveness and strong alignment with human evaluations. However, training proxy judge models using evaluation data generated by powerful teacher models introduces a critical yet previously overlooked issue: teacher preference bias, where the proxy judge model learns a biased preference for responses from the teacher model. To tackle this problem, we propose a novel setting that incorporates an additional assistant model, which is not biased toward the teacher model's responses, to complement the training data. Building on this setup, we introduce AGDe-Judge, a three-stage framework designed to debias from both the labels and feedbacks in the training data. Extensive experiments demonstrate that AGDe-Judge effectively reduces teacher preference bias while maintaining strong performance across six evaluation benchmarks. Code is available at https://github.com/Liuz233/AGDe-Judge.

[377] Two LLMs debate, both are certain they've won

Minh Nhat Nguyen,Pradyumna Shyama Prasad

Main category: cs.CL

TL;DR: 研究发现大型语言模型（LLM）在动态对抗辩论中表现出系统性过度自信、信心逐步升级、相互高估等问题，表明其无法准确自我评估或更新信念。

Details

Motivation: 探讨LLM在动态对抗环境中是否能准确调整其信心水平，尤其是在多轮辩论和零和结构下。 Method: 组织了60场三轮政策辩论，涉及10种先进LLM，模型在每轮后私下评估其获胜信心。 Result: 观察到五种问题模式，包括系统性过度自信、信心逐步升级、相互高估等。 Conclusion: LLM在动态多任务中缺乏准确自我评估能力，这对其在辅助或代理角色中的应用构成重大隐患。 Abstract: Can LLMs accurately adjust their confidence when facing opposition? Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum structure to control for task-related uncertainty, since mutual high-confidence claims imply systematic overconfidence. We organized 60 three-round policy debates among ten state-of-the-art LLMs, with models privately rating their confidence (0-100) in winning after each round. We observed five concerning patterns: (1) Systematic overconfidence: models began debates with average initial confidence of 72.9% vs. a rational 50% baseline. (2) Confidence escalation: rather than reducing confidence as debates progressed, debaters increased their win probabilities, averaging 83% by the final round. (3) Mutual overestimation: in 61.7% of debates, both sides simultaneously claimed >=75% probability of victory, a logical impossibility. (4) Persistent self-debate bias: models debating identical copies increased confidence from 64.1% to 75.2%; even when explicitly informed their chance of winning was exactly 50%, confidence still rose (from 50.0% to 57.1%). (5) Misaligned private reasoning: models' private scratchpad thoughts sometimes differed from their public confidence ratings, raising concerns about faithfulness of chain-of-thought reasoning. These results suggest LLMs lack the ability to accurately self-assess or update their beliefs in dynamic, multi-turn tasks; a major concern as LLM outputs are deployed without careful review in assistant roles or agentic settings.

Yang Xiao,Jiashuo Wang,Ruifeng Yuan,Chunpu Xu,Kaishuai Xu,Wenjie Li,Pengfei Liu

Main category: cs.CL

TL;DR: 论文提出PIR框架，通过量化推理步骤的重要性，选择性剪枝低重要性功能步骤，优化训练数据，提升模型推理效率和准确性。

Details

Motivation: 现有大型语言模型的推理链中存在冗余功能步骤，增加了计算负担，需优化以提升效率和性能。 Method: 引入PIR框架，基于困惑度评估推理步骤重要性，剪枝低重要性功能步骤，保留核心推理路径。 Result: PIR优化后的模型在多个推理基准测试中准确率提升（+0.9%至+6.6%），同时减少token使用（-3%至-41%）。 Conclusion: PIR框架在高效推理部署中具有普适性和实用性，适用于不同模型规模和计算资源限制的场景。 Abstract: Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more powerful large reasoning models (LRMs). However, these reasoning chains often contain verbose elements that mirror human problem-solving, categorized as progressive reasoning (the essential solution development path) and functional elements (verification processes, alternative solution approaches, and error corrections). While progressive reasoning is crucial, the functional elements significantly increase computational demands during test-time inference. We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step based on its impact on answer prediction confidence. PIR systematically identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components, creating optimized training data that maintains the integrity of the core solution path while reducing verbosity. Models fine-tuned on PIR-optimized data exhibit superior test-time scaling properties, generating more concise reasoning chains while achieving improved accuracy (+0.9\% to +6.6\%) with significantly reduced token usage (-3\% to -41\%) across challenging reasoning benchmarks (AIME, AMC, and GPQA Diamond). Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets, offering a practical solution for deploying reasoning-capable LLMs in scenarios where efficient test-time scaling, response time, and computational efficiency are valuable constraints.

[379] Misleading through Inconsistency: A Benchmark for Political Inconsistencies Detection

Nursulu Sagimbayeva,Ruveyda Betül Bahçeci,Ingmar Weber

Main category: cs.CL

TL;DR: 论文提出了一种自动检测政治声明不一致性的任务，并开发了一个不一致性类型量表。通过构建一个包含698对人工标注政治声明的数据集，并测试大型语言模型（LLMs）的表现，发现LLMs在检测不一致性方面与人类相当，但在细粒度类型识别上仍有改进空间。

Details

Motivation: 政治声明的不一致性会削弱公众信任并增加问责难度，自动检测不一致性有助于记者提问以保持政治问责。 Method: 提出不一致性检测任务，开发不一致性类型量表，构建包含698对人工标注政治声明的数据集，并测试LLMs的表现。 Result: LLMs在检测不一致性方面与人类相当，但在细粒度类型识别上未达上限，仍有改进空间。 Conclusion: 论文为政治领域的不一致性检测提供了资源和方向，LLMs表现良好但需进一步提升细粒度识别能力。 Abstract: Inconsistent political statements represent a form of misinformation. They erode public trust and pose challenges to accountability, when left unnoticed. Detecting inconsistencies automatically could support journalists in asking clarification questions, thereby helping to keep politicians accountable. We propose the Inconsistency detection task and develop a scale of inconsistency types to prompt NLP-research in this direction. To provide a resource for detecting inconsistencies in a political domain, we present a dataset of 698 human-annotated pairs of political statements with explanations of the annotators' reasoning for 237 samples. The statements mainly come from voting assistant platforms such as Wahl-O-Mat in Germany and Smartvote in Switzerland, reflecting real-world political issues. We benchmark Large Language Models (LLMs) on our dataset and show that in general, they are as good as humans at detecting inconsistencies, and might be even better than individual humans at predicting the crowd-annotated ground-truth. However, when it comes to identifying fine-grained inconsistency types, none of the model have reached the upper bound of performance (due to natural labeling variation), thus leaving room for improvement. We make our dataset and code publicly available.

[380] DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding

Yunhai Hu,Tianhua Xia,Zining Liu,Rahul Raman,Xingyu Liu,Bo Bao,Eric Sather,Vithursan Thangarasa,Sai Qian Zhang

Main category: cs.CL

TL;DR: DREAM是一种针对视觉语言模型（VLMs）的新型推测解码框架，通过跨注意力机制、自适应特征选择和视觉标记压缩，显著提升了多模态解码的效率和准确性。

Details

Motivation: 推测解码（SD）在加速大型语言模型（LLMs）的自回归生成中表现出色，但在视觉语言模型（VLMs）中的应用尚未充分探索。 Method: DREAM结合了三种创新技术：跨注意力机制、基于注意力熵的自适应特征选择和视觉标记压缩。 Result: 实验表明，DREAM在多种VLMs上实现了最高3.6倍的加速，显著优于传统解码方法和之前的SD基线。 Conclusion: DREAM为多模态解码提供了高效、准确的解决方案，代码已开源。 Abstract: Speculative decoding (SD) has emerged as a powerful method for accelerating autoregressive generation in large language models (LLMs), yet its integration into vision-language models (VLMs) remains underexplored. We introduce DREAM, a novel speculative decoding framework tailored for VLMs that combines three key innovations: (1) a cross-attention-based mechanism to inject intermediate features from the target model into the draft model for improved alignment, (2) adaptive intermediate feature selection based on attention entropy to guide efficient draft model training, and (3) visual token compression to reduce draft model latency. DREAM enables efficient, accurate, and parallel multimodal decoding with significant throughput improvement. Experiments across a diverse set of recent popular VLMs, including LLaVA, Pixtral, SmolVLM and Gemma3, demonstrate up to 3.6x speedup over conventional decoding and significantly outperform prior SD baselines in both inference throughput and speculative draft acceptance length across a broad range of multimodal benchmarks. The code is publicly available at: https://github.com/SAI-Lab-NYU/DREAM.git

[381] SpeakStream: Streaming Text-to-Speech with Interleaved Data

Richard He Bai,Zijin Gu,Tatiana Likhomanenko,Navdeep Jaitly

Main category: cs.CL

TL;DR: SpeakStream是一种流式TTS系统，通过增量生成音频解决传统TTS系统在流式LLM输出中的延迟问题。

Details

Motivation: 传统TTS系统在流式LLM输出中引入不可接受的延迟，影响对话AI的响应速度。 Method: 采用仅解码器架构，通过下一步预测损失训练，增量生成音频。 Result: SpeakStream在保持非流式TTS质量的同时，实现了最低的首令牌延迟。 Conclusion: SpeakStream适用于流式对话AI，显著提升了响应速度。 Abstract: The latency bottleneck of traditional text-to-speech (TTS) systems fundamentally hinders the potential of streaming large language models (LLMs) in conversational AI. These TTS systems, typically trained and inferenced on complete utterances, introduce unacceptable delays, even with optimized inference speeds, when coupled with streaming LLM outputs. This is particularly problematic for creating responsive conversational agents where low first-token latency is critical. In this paper, we present SpeakStream, a streaming TTS system that generates audio incrementally from streaming text using a decoder-only architecture. SpeakStream is trained using a next-step prediction loss on interleaved text-speech data. During inference, it generates speech incrementally while absorbing streaming input text, making it particularly suitable for cascaded conversational AI agents where an LLM streams text to a TTS system. Our experiments demonstrate that SpeakStream achieves state-of-the-art latency results in terms of first-token latency while maintaining the quality of non-streaming TTS systems.

[382] MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search

Zonglin Yang,Wanhao Liu,Ben Gao,Yujie Liu,Wei Li,Tong Xie,Lidong Bing,Wanli Ouyang,Erik Cambria,Dongzhan Zhou

Main category: cs.CL

TL;DR: 论文提出了一种细粒度科学假设发现任务，通过组合优化和分层搜索方法，利用大语言模型（LLMs）生成详细且可实验验证的假设，并在化学文献基准测试中表现优于基线方法。

Details

Motivation: 现有方法生成的假设通常缺乏细节，无法直接用于实验验证，因此需要一种能生成细粒度、可操作假设的新方法。 Method: 将任务定义为组合优化问题，提出分层搜索方法，逐步从一般概念细化到具体实验配置，并探讨了四种基础问题以优化LLMs的潜力。 Result: 在化学文献的专家标注基准测试中，该方法显著优于基线方法，验证了其有效性。 Conclusion: 分层搜索方法能平滑奖励景观并提升优化效果，为细粒度科学假设生成提供了可行方案。 Abstract: Large language models (LLMs) have shown promise in automating scientific hypothesis generation, yet existing approaches primarily yield coarse-grained hypotheses lacking critical methodological and experimental details. We introduce and formally define the novel task of fine-grained scientific hypothesis discovery, which entails generating detailed, experimentally actionable hypotheses from coarse initial research directions. We frame this as a combinatorial optimization problem and investigate the upper limits of LLMs' capacity to solve it when maximally leveraged. Specifically, we explore four foundational questions: (1) how to best harness an LLM's internal heuristics to formulate the fine-grained hypothesis it itself would judge as the most promising among all the possible hypotheses it might generate, based on its own internal scoring-thus defining a latent reward landscape over the hypothesis space; (2) whether such LLM-judged better hypotheses exhibit stronger alignment with ground-truth hypotheses; (3) whether shaping the reward landscape using an ensemble of diverse LLMs of similar capacity yields better outcomes than defining it with repeated instances of the strongest LLM among them; and (4) whether an ensemble of identical LLMs provides a more reliable reward landscape than a single LLM. To address these questions, we propose a hierarchical search method that incrementally proposes and integrates details into the hypothesis, progressing from general concepts to specific experimental configurations. We show that this hierarchical process smooths the reward landscape and enables more effective optimization. Empirical evaluations on a new benchmark of expert-annotated fine-grained hypotheses from recent chemistry literature show that our method consistently outperforms strong baselines.

Steffen Backmann,David Guzman Piedrahita,Emanuel Tewolde,Rada Mihalcea,Bernhard Schölkopf,Zhijing Jin

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）在道德规范与奖励冲突时的行为表现，通过MoralSim模拟囚徒困境和公共物品游戏，发现模型在道德行为上存在显著差异且缺乏一致性。

Details

Motivation: 探讨LLMs在道德规范与奖励直接冲突时的行为表现，以解决其在复杂代理角色中的伦理对齐问题。 Method: 引入MoralSim模拟工具，在囚徒困境和公共物品游戏中测试多种前沿LLMs，结合三种道德框架分析其行为。 Result: 不同模型在道德行为倾向和一致性上存在显著差异，且没有模型在所有情境中表现出一致的道德行为。 Conclusion: 在LLMs部署于代理角色时需谨慎，因其自我利益可能与伦理期望冲突。 Abstract: Recent advances in large language models (LLMs) have enabled their use in complex agentic roles, involving decision-making with humans or other agents, making ethical alignment a key AI safety concern. While prior work has examined both LLMs' moral judgment and strategic behavior in social dilemmas, there is limited understanding of how they act when moral imperatives directly conflict with rewards or incentives. To investigate this, we introduce Moral Behavior in Social Dilemma Simulation (MoralSim) and evaluate how LLMs behave in the prisoner's dilemma and public goods game with morally charged contexts. In MoralSim, we test a range of frontier models across both game structures and three distinct moral framings, enabling a systematic examination of how LLMs navigate social dilemmas in which ethical norms conflict with payoff-maximizing strategies. Our results show substantial variation across models in both their general tendency to act morally and the consistency of their behavior across game types, the specific moral framing, and situational factors such as opponent behavior and survival risks. Crucially, no model exhibits consistently moral behavior in MoralSim, highlighting the need for caution when deploying LLMs in agentic roles where the agent's "self-interest" may conflict with ethical expectations. Our code is available at https://github.com/sbackmann/moralsim.

[384] The Overthinker's DIET: Cutting Token Calories with DIfficulty-AwarE Training

Weize Chen,Jiarui Yuan,Tailin Jin,Ning Ding,Huimin Chen,Zhiyuan Liu,Maosong Sun

Main category: cs.CL

TL;DR: DIET框架通过动态调整token压缩策略，优化性能与效率的平衡，显著减少token数量并提升推理性能。

Details

Motivation: 大型语言模型（LLMs）常因过度思考生成冗长响应，影响效率。DIET旨在通过动态适应任务难度，优化token使用。 Method: DIET整合了任务难度感知的强化学习（RL）过程，动态调整token惩罚强度和目标长度，并提出Advantage Weighting技术以稳定实现目标。 Result: 实验显示DIET显著减少token数量，同时提升推理性能，并展现出更好的推理扩展性和合理的响应长度分配。 Conclusion: DIET为开发高效、实用且高性能的LLMs提供了理论基础和有效框架。 Abstract: Recent large language models (LLMs) exhibit impressive reasoning but often over-think, generating excessively long responses that hinder efficiency. We introduce DIET ( DIfficulty-AwarE Training), a framework that systematically cuts these "token calories" by integrating on-the-fly problem difficulty into the reinforcement learning (RL) process. DIET dynamically adapts token compression strategies by modulating token penalty strength and conditioning target lengths on estimated task difficulty, to optimize the performance-efficiency trade-off. We also theoretically analyze the pitfalls of naive reward weighting in group-normalized RL algorithms like GRPO, and propose Advantage Weighting technique, which enables stable and effective implementation of these difficulty-aware objectives. Experimental results demonstrate that DIET significantly reduces token counts while simultaneously improving reasoning performance. Beyond raw token reduction, we show two crucial benefits largely overlooked by prior work: (1) DIET leads to superior inference scaling. By maintaining high per-sample quality with fewer tokens, it enables better scaling performance via majority voting with more samples under fixed computational budgets, an area where other methods falter. (2) DIET enhances the natural positive correlation between response length and problem difficulty, ensuring verbosity is appropriately allocated, unlike many existing compression methods that disrupt this relationship. Our analyses provide a principled and effective framework for developing more efficient, practical, and high-performing LLMs.

[385] Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

Qian Cao,Xiting Wang,Yuzhuo Yuan,Yahui Liu,Fang Luo,Ruihua Song

Main category: cs.CL

TL;DR: 论文提出了一种基于成对比较的文本创造力评估框架，利用共享上下文指令提升一致性，并开发了CrEval评估器，优于现有方法。

Details

Motivation: 当前创造力评估依赖低效且昂贵的人工判断，自动化方法缺乏通用性或与人类判断的一致性。 Method: 提出成对比较框架，引入CreataSet数据集（含人类和合成数据），训练LLM评估器CrEval。 Result: CrEval在人类判断一致性上显著优于现有方法，验证了人类与合成数据结合的重要性。 Conclusion: CrEval能有效提升LLM创造力，将公开数据、代码和模型以支持研究。 Abstract: Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, in this paper, we propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human-generated and synthetic data in training highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs. We will release all data, code, and models publicly soon to support further research.

[386] LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models

Aida Kostikova,Zhipin Wang,Deidamea Bajri,Ole Pütz,Benjamin Paaßen,Steffen Eger

Main category: cs.CL

TL;DR: 本文通过数据驱动的半自动化方法，调查了2022至2024年关于大语言模型（LLM）局限性的研究趋势，发现相关研究快速增长，推理是最受关注的局限性。

Details

Motivation: 随着大语言模型研究的迅速增长，对其局限性（如推理失败、幻觉和多语言能力不足）的关注也日益增加，本文旨在量化分析这些研究的趋势。 Method: 采用自下而上的方法，通过对25万篇ACL和arXiv论文进行关键词过滤、LLM分类、专家验证和主题聚类（HDBSCAN+BERTopic和LlooM），识别出14,648篇相关论文。 Result: 研究发现，LLM相关研究在ACL和arXiv中分别增长了五倍和四倍；推理是最常研究的局限性，其次是泛化、幻觉、偏见和安全性；arXiv的研究主题逐渐转向安全性和可控性。 Conclusion: 本文提供了关于LLM局限性研究的定量趋势分析，并发布了标注摘要数据集和验证方法，为未来研究提供了参考。 Abstract: Large language model (LLM) research has grown rapidly, along with increasing concern about their limitations such as failures in reasoning, hallucinations, and limited multilingual capability. In this survey, we conduct a data-driven, semi-automated review of research on limitations of LLM (LLLMs) from 2022 to 2024 using a bottom-up approach. From a corpus of 250,000 ACL and arXiv papers, we identify 14,648 relevant papers using keyword filtering, LLM-based classification, validated against expert labels, and topic clustering (via two approaches, HDBSCAN+BERTopic and LlooM). We find that LLM-related research increases over fivefold in ACL and fourfold in arXiv. Since 2022, LLLMs research grows even faster, reaching over 30% of LLM papers by late 2024. Reasoning remains the most studied limitation, followed by generalization, hallucination, bias, and security. The distribution of topics in the ACL dataset stays relatively stable over time, while arXiv shifts toward safety and controllability (with topics like security risks, alignment, hallucinations, knowledge editing), and multimodality between 2022 and 2024. We release a dataset of annotated abstracts and a validated methodology, and offer a quantitative view of trends in LLM limitations research.

[387] PATS: Process-Level Adaptive Thinking Mode Switching

Yi Wang,Junxiao Liu,Shimao Zhang,Jiajun Chen,Shujian Huang

Main category: cs.CL

TL;DR: 论文提出了一种新的推理范式PATS，通过动态调整推理策略以平衡准确性和计算效率。

Details

Motivation: 现有大型语言模型采用固定推理策略，忽略了任务和推理过程的复杂性差异，导致性能与效率失衡。 Method: 结合Process Reward Models（PRMs）和Beam Search，引入渐进式模式切换和坏步惩罚机制。 Result: 在多样化数学基准测试中，该方法实现了高准确性和适中的计算资源消耗。 Conclusion: 研究强调了过程级、难度感知的推理策略适应的重要性，为LLM高效推理提供了新思路。 Abstract: Current large-language models (LLMs) typically adopt a fixed reasoning strategy, either simple or complex, for all questions, regardless of their difficulty. This neglect of variation in task and reasoning process complexity leads to an imbalance between performance and efficiency. Existing methods attempt to implement training-free fast-slow thinking system switching to handle problems of varying difficulty, but are limited by coarse-grained solution-level strategy adjustments. To address this issue, we propose a novel reasoning paradigm: Process-Level Adaptive Thinking Mode Switching (PATS), which enables LLMs to dynamically adjust their reasoning strategy based on the difficulty of each step, optimizing the balance between accuracy and computational efficiency. Our approach integrates Process Reward Models (PRMs) with Beam Search, incorporating progressive mode switching and bad-step penalty mechanisms. Experiments on diverse mathematical benchmarks demonstrate that our methodology achieves high accuracy while maintaining moderate token usage. This study emphasizes the significance of process-level, difficulty-aware reasoning strategy adaptation, offering valuable insights into efficient inference for LLMs.

[388] Unveiling Dual Quality in Product Reviews: An NLP-Based Approach

Rafał Poświata,Marcin Michał Mirończuk,Sławomir Dadas,Małgorzata Grębowiec,Michał Perełkiewicz

Main category: cs.CL

TL;DR: 本文探讨了如何利用自然语言处理（NLP）技术检测产品双质量问题，并详细介绍了从数据集构建到实验验证的全过程。

Details

Motivation: 消费者常面临产品双质量问题（同一产品在不同市场质量不一致），需要自动化技术来识别和解决这一问题。 Method: 构建了一个包含1,957条波兰语评论的新数据集，其中540条标注了双质量问题。实验了多种NLP方法（如SetFit、transformer编码器和LLMs），并进行了多语言迁移评估。 Result: 通过实验验证了不同方法的有效性，并分析了错误和鲁棒性。多语言迁移实验展示了方法的扩展性。 Conclusion: 研究为双质量问题的自动化检测提供了可行方案，并探讨了实际部署和应用的可能性。 Abstract: Consumers often face inconsistent product quality, particularly when identical products vary between markets, a situation known as the dual quality problem. To identify and address this issue, automated techniques are needed. This paper explores how natural language processing (NLP) can aid in detecting such discrepancies and presents the full process of developing a solution. First, we describe in detail the creation of a new Polish-language dataset with 1,957 reviews, 540 highlighting dual quality issues. We then discuss experiments with various approaches like SetFit with sentence-transformers, transformer-based encoders, and LLMs, including error analysis and robustness verification. Additionally, we evaluate multilingual transfer using a subset of opinions in English, French, and German. The paper concludes with insights on deployment and practical applications.

[389] A Graph Perspective to Probe Structural Patterns of Knowledge in Large Language Models

Utkarsh Sahu,Zhisheng Qi,Yongjia Lei,Ryan A. Rossi,Franck Dernoncourt,Nesreen K. Ahmed,Mahantesh M Halappanavar,Yao Ma,Yu Wang

Main category: cs.CL

TL;DR: 研究大型语言模型（LLM）中知识的结构模式，从图论角度分析其知识分布，并提出基于图机器学习的知识评估方法。

Details

Motivation: 现有研究多关注LLM的知识访问、编辑、推理和可解释性，但对其知识的结构模式研究较少。 Method: 从三元组和实体层面量化LLM的知识，分析其与图结构属性（如节点度）的关系，并利用图机器学习模型评估实体知识。 Result: 发现知识同质性现象，即拓扑相近的实体具有相似的知识水平，并验证了基于选择三元组的微调方法效果更优。 Conclusion: 通过图视角分析LLM知识结构，为知识评估和优化提供了新思路。 Abstract: Large language models have been extensively studied as neural knowledge bases for their knowledge access, editability, reasoning, and explainability. However, few works focus on the structural patterns of their knowledge. Motivated by this gap, we investigate these structural patterns from a graph perspective. We quantify the knowledge of LLMs at both the triplet and entity levels, and analyze how it relates to graph structural properties such as node degree. Furthermore, we uncover the knowledge homophily, where topologically close entities exhibit similar levels of knowledgeability, which further motivates us to develop graph machine learning models to estimate entity knowledge based on its local neighbors. This model further enables valuable knowledge checking by selecting triplets less known to LLMs. Empirical results show that using selected triplets for fine-tuning leads to superior performance.

[390] 100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Wang Yang,Hongye Jin,Shaochen Zhong,Song Jiang,Qifan Wang,Vipin Chaudhary,Xiaotian Han

Main category: cs.CL

TL;DR: 论文提出了一种长度可控的长上下文基准测试和新指标，以解决现有长上下文评估基准的不足。

Details

Motivation: 现有长上下文评估基准存在两大问题：无法区分模型的长上下文性能与基线能力，以及输入长度固定导致适用性受限。 Method: 引入长度可控的长上下文基准和新指标，分离基线知识与真实长上下文能力。 Result: 实验证明该方法能更有效地评估LLMs的长上下文能力。 Conclusion: 新基准和指标为长上下文能力的评估提供了更清晰和灵活的工具。 Abstract: Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.

[391] A Necessary Step toward Faithfulness: Measuring and Improving Consistency in Free-Text Explanations

Lingjun Zhao,Hal Daumé III

Main category: cs.CL

TL;DR: 本文提出了一种衡量预测与解释（PEX）一致性的方法，发现62%的大模型生成解释缺乏一致性，并通过直接偏好优化显著提升了这种一致性。

Details

Motivation: 在AI决策中，确保自由文本解释的忠实性至关重要，但生成和评估这种解释具有挑战性。 Method: 扩展了证据权重的概念，提出PEX一致性度量，量化解释对预测的支持或反对程度。 Result: 超过62%的大模型生成解释缺乏一致性；直接偏好优化使一致性提升43.1%至292.3%，并提高解释忠实性达9.7%。 Conclusion: PEX一致性是解释忠实性的重要指标，优化该指标可显著提升模型生成解释的质量。 Abstract: Faithful free-text explanations are important to ensure transparency in high-stakes AI decision-making contexts, but they are challenging to generate by language models and assess by humans. In this paper, we present a measure for Prediction-EXplanation (PEX) consistency, by extending the concept of weight of evidence. This measure quantifies how much a free-text explanation supports or opposes a prediction, serving as an important aspect of explanation faithfulness. Our analysis reveals that more than 62% explanations generated by large language models lack this consistency. We show that applying direct preference optimization improves the consistency of generated explanations across three model families, with improvement ranging from 43.1% to 292.3%. Furthermore, we demonstrate that optimizing this consistency measure can improve explanation faithfulness by up to 9.7%.

[392] SituatedThinker: Grounding LLM Reasoning with Real-World through Situated Thinking

Junnan Liu,Linhao Luo,Thuy-Trang Vu,Gholamreza Haffari

Main category: cs.CL

TL;DR: SituatedThinker框架通过结合内部知识和外部信息，增强大语言模型（LLM）的现实世界推理能力。

Details

Motivation: 大语言模型的推理能力受限于内部参数空间，无法实时获取外部信息或理解物理世界。 Method: 提出SituatedThinker框架，利用强化学习激励模型结合内部知识和外部接口进行现实世界推理。 Result: 在多跳问答和数学推理任务上表现显著提升，并在未见任务（如KBQA、TableQA）中展示泛化能力。 Conclusion: SituatedThinker通过现实世界接地推理，有效扩展了LLM的知识边界和推理能力。 Abstract: Recent advances in large language models (LLMs) demonstrate their impressive reasoning capabilities. However, the reasoning confined to internal parametric space limits LLMs' access to real-time information and understanding of the physical world. To overcome this constraint, we introduce SituatedThinker, a novel framework that enables LLMs to ground their reasoning in real-world contexts through situated thinking, which adaptively combines both internal knowledge and external information with predefined interfaces. By utilizing reinforcement learning, SituatedThinker incentivizes deliberate reasoning with the real world to acquire information and feedback, allowing LLMs to surpass their knowledge boundaries and enhance reasoning. Experimental results demonstrate significant performance improvements on multi-hop question-answering and mathematical reasoning benchmarks. Furthermore, SituatedThinker demonstrates strong performance on unseen tasks, such as KBQA, TableQA, and text-based games, showcasing the generalizable real-world grounded reasoning capability. Our codes are available at https://github.com/jnanliu/SituatedThinker.

[393] PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims

Yongmin Yoo,Qiongkai Xu,Longbing Cao

Main category: cs.CL

TL;DR: 提出PatentScore框架，用于评估LLM生成的专利权利要求，结合法律、技术和结构维度，优于通用NLG指标。

Details

Motivation: 现有NLG指标不适用于专利文档的法律和结构特性，需专门评估LLM生成专利的质量。 Method: PatentScore框架包括分层分解、领域验证模式和结构/语义/法律评分。 Result: 在400个GPT-4o-mini生成的权利要求上，Pearson相关性达0.819，优于现有指标。 Conclusion: PatentScore框架具有鲁棒性和通用性，适用于多种LLM模型。 Abstract: Natural language generation (NLG) metrics play a central role in evaluating generated texts, but are not well suited for the structural and legal characteristics of patent documents. Large language models (LLMs) offer strong potential in automating patent generation, yet research on evaluating LLM-generated patents remains limited, especially in evaluating the generation quality of patent claims, which are central to defining the scope of protection. Effective claim evaluation requires addressing legal validity, technical accuracy, and structural compliance. To address this gap, we introduce PatentScore, a multi-dimensional evaluation framework for assessing LLM-generated patent claims. PatentScore incorporates: (1) hierarchical decomposition for claim analysis; (2) domain-specific validation patterns based on legal and technical standards; and (3) scoring across structural, semantic, and legal dimensions. Unlike general-purpose NLG metrics, PatentScore reflects patent-specific constraints and document structures, enabling evaluation beyond surface similarity. We evaluate 400 GPT-4o-mini generated Claim 1s and report a Pearson correlation of $r = 0.819$ with expert annotations, outperforming existing NLG metrics. Furthermore, we conduct additional evaluations using open models such as Claude-3.5-Haiku and Gemini-1.5-flash, all of which show strong correlations with expert judgments, confirming the robustness and generalizability of our framework.

[394] GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance

Mohammad Mahdi Moradi,Sudhir Mudur

Main category: cs.CL

TL;DR: GC-KBVQA是一个四阶段框架，利用LLM进行零样本VQA任务，无需端到端多模态训练，显著提升了KB-VQA方法的性能。

Details

Motivation: 现有KB-VQA方法提供的辅助文本可能与问题无关或包含误导信息，限制了其潜力。 Method: 通过问题感知的标题生成和外部知识结合，创建信息丰富的提示，供LLM使用。 Result: GC-KBVQA在多种VQA任务中表现优异，无需任务特定微调。 Conclusion: GC-KBVQA降低了成本和部署复杂度，显著优于现有KB-VQA方法。 Abstract: Knowledge-Based Visual Question Answering (KB-VQA) methods focus on tasks that demand reasoning with information extending beyond the explicit content depicted in the image. Early methods relied on explicit knowledge bases to provide this auxiliary information. Recent approaches leverage Large Language Models (LLMs) as implicit knowledge sources. While KB-VQA methods have demonstrated promising results, their potential remains constrained as the auxiliary text provided may not be relevant to the question context, and may also include irrelevant information that could misguide the answer predictor. We introduce a novel four-stage framework called Grounding Caption-Guided Knowledge-Based Visual Question Answering (GC-KBVQA), which enables LLMs to effectively perform zero-shot VQA tasks without the need for end-to-end multimodal training. Innovations include grounding question-aware caption generation to move beyond generic descriptions and have compact, yet detailed and context-rich information. This is combined with knowledge from external sources to create highly informative prompts for the LLM. GC-KBVQA can address a variety of VQA tasks, and does not require task-specific fine-tuning, thus reducing both costs and deployment complexity by leveraging general-purpose, pre-trained LLMs. Comparison with competing KB-VQA methods shows significantly improved performance. Our code will be made public.

Lin Tian,Marian-Andrei Rizoiu

Main category: cs.CL

TL;DR: 论文提出了一种新的联合处理-结果框架，利用序列模型同时适应政策时间和参与效果，显著提升了社交媒体的因果影响力预测。

Details

Motivation: 现有方法难以捕捉外部时间信号触发的因果机制，需要区分社交媒体中的相关性和因果关系。 Method: 引入基于因果推断的联合处理-结果框架，利用序列模型估计平均处理效应（ATE），解决外部混杂信号的挑战。 Result: 在真实世界的数据集上，模型在预测参与度方面比现有基准高出15-22%，且因果效应测量与专家经验影响力高度一致。 Conclusion: 新框架有效解决了社交媒体中因果影响力估计的挑战，为信息传播分析提供了更可靠的工具。 Abstract: Understanding true influence in social media requires distinguishing correlation from causation--particularly when analyzing misinformation spread. While existing approaches focus on exposure metrics and network structures, they often fail to capture the causal mechanisms by which external temporal signals trigger engagement. We introduce a novel joint treatment-outcome framework that leverages existing sequential models to simultaneously adapt to both policy timing and engagement effects. Our approach adapts causal inference techniques from healthcare to estimate Average Treatment Effects (ATE) within the sequential nature of social media interactions, tackling challenges from external confounding signals. Through our experiments on real-world misinformation and disinformation datasets, we show that our models outperform existing benchmarks by 15--22% in predicting engagement across diverse counterfactual scenarios, including exposure adjustment, timing shifts, and varied intervention durations. Case studies on 492 social media users show our causal effect measure aligns strongly with the gold standard in influence estimation, the expert-based empirical influence.

[396] ChartLens: Fine-grained Visual Attribution in Charts

Manan Suri,Puneet Mathur,Nedim Lipka,Franck Dernoncourt,Ryan A. Rossi,Dinesh Manocha

Main category: cs.CL

TL;DR: 论文提出了一种名为ChartLens的后处理视觉归因方法，用于解决多模态大语言模型在图表理解中的幻觉问题，并通过实验验证其有效性。

Details

Motivation: 多模态大语言模型在图表理解任务中存在幻觉问题，生成的文本可能与视觉数据冲突，因此需要一种方法来验证图表相关响应的准确性。 Method: 提出了ChartLens算法，结合基于分割的技术识别图表对象，并使用多模态大语言模型的标记集提示进行细粒度视觉归因。同时，构建了ChartVA-Eval基准数据集。 Result: 实验表明，ChartLens在细粒度归因任务中提升了26-66%的性能。 Conclusion: ChartLens有效解决了多模态大语言模型在图表理解中的幻觉问题，为细粒度视觉归因提供了新方法。 Abstract: The growing capabilities of multimodal large language models (MLLMs) have advanced tasks like chart understanding. However, these models often suffer from hallucinations, where generated text sequences conflict with the provided visual data. To address this, we introduce Post-Hoc Visual Attribution for Charts, which identifies fine-grained chart elements that validate a given chart-associated response. We propose ChartLens, a novel chart attribution algorithm that uses segmentation-based techniques to identify chart objects and employs set-of-marks prompting with MLLMs for fine-grained visual attribution. Additionally, we present ChartVA-Eval, a benchmark with synthetic and real-world charts from diverse domains like finance, policy, and economics, featuring fine-grained attribution annotations. Our evaluations show that ChartLens improves fine-grained attributions by 26-66%.

[397] Belief Attribution as Mental Explanation: The Role of Accuracy, Informativity, and Causality

Lance Ying,Almog Hillel,Ryan Truong,Vikash K. Mansinghka,Joshua B. Tenenbaum,Tan Zhi-Xuan

Main category: cs.CL

TL;DR: 研究探讨人们倾向于将哪些信念归因于他人，提出解释性强度模型（准确性、信息量和因果相关性），发现因果相关性最能预测人们的归因选择。

Details

Motivation: 探究人们在解释他人行为时，倾向于归因哪些信念，以理解人类心智理论的运作机制。 Method: 开发计算模型量化信念的解释性强度（准确性、信息量、因果相关性），并通过实验让参与者对描述信念的陈述进行排序。 Result: 准确性和信息量结合时能较好预测排序，但因果相关性是单一最能解释参与者反应的因素。 Conclusion: 因果相关性在人们选择性归因信念时起关键作用，支持解释性强度模型的有效性。 Abstract: A key feature of human theory-of-mind is the ability to attribute beliefs to other agents as mentalistic explanations for their behavior. But given the wide variety of beliefs that agents may hold about the world and the rich language we can use to express them, which specific beliefs are people inclined to attribute to others? In this paper, we investigate the hypothesis that people prefer to attribute beliefs that are good explanations for the behavior they observe. We develop a computational model that quantifies the explanatory strength of a (natural language) statement about an agent's beliefs via three factors: accuracy, informativity, and causal relevance to actions, each of which can be computed from a probabilistic generative model of belief-driven behavior. Using this model, we study the role of each factor in how people selectively attribute beliefs to other agents. We investigate this via an experiment where participants watch an agent collect keys hidden in boxes in order to reach a goal, then rank a set of statements describing the agent's beliefs about the boxes' contents. We find that accuracy and informativity perform reasonably well at predicting these rankings when combined, but that causal relevance is the single factor that best explains participants' responses.

[398] GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor

Seokgi Lee,Jungjun Kim

Main category: cs.CL

TL;DR: GSA-TTS提出了一种渐进式风格编码器，用于零样本语音合成，通过局部和全局风格编码实现鲁棒且丰富的风格表示。

Details

Motivation: 解决零样本语音合成中风格表示的鲁棒性和丰富性问题。 Method: 采用渐进式风格编码器，先捕获局部风格，再通过自注意力结合为全局风格条件。 Result: 在未见过的说话者上测试，表现出良好的自然度、说话者相似性和可懂度。 Conclusion: GSA-TTS在风格表示、可解释性和可控性方面具有潜力。 Abstract: We present the gradual style adaptor TTS (GSA-TTS) with a novel style encoder that gradually encodes speaking styles from an acoustic reference for zero-shot speech synthesis. GSA first captures the local style of each semantic sound unit. Then the local styles are combined by self-attention to obtain a global style condition. This semantic and hierarchical encoding strategy provides a robust and rich style representation for an acoustic model. We test GSA-TTS on unseen speakers and obtain promising results regarding naturalness, speaker similarity, and intelligibility. Additionally, we explore the potential of GSA in terms of interpretability and controllability, which stems from its hierarchical structure.

[399] gec-metrics: A Unified Library for Grammatical Error Correction Evaluation

Takumi Goto,Yusuke Sakai,Taro Watanabe

Main category: cs.CL

TL;DR: 介绍了一个名为gec-metrics的库，用于通过统一接口使用和开发语法错误纠正（GEC）评估指标，确保公平的系统比较和高度可扩展性。

Details

Motivation: 为了解决GEC评估指标实现不一致的问题，提供一个统一的接口和工具，促进公平的系统比较和指标开发。 Method: 开发了一个具有统一接口的库，支持API使用，包含元评估功能、分析和可视化脚本，并开源发布。 Result: 实现了gec-metrics库，支持GEC评估的一致性和可扩展性，提供了分析和可视化工具。 Conclusion: gec-metrics库为GEC评估提供了统一且可扩展的解决方案，有助于推动该领域的研究和发展。 Abstract: We introduce gec-metrics, a library for using and developing grammatical error correction (GEC) evaluation metrics through a unified interface. Our library enables fair system comparisons by ensuring that everyone conducts evaluations using a consistent implementation. Moreover, it is designed with a strong focus on API usage, making it highly extensible. It also includes meta-evaluation functionalities and provides analysis and visualization scripts, contributing to developing GEC evaluation metrics. Our code is released under the MIT license and is also distributed as an installable package. The video is available on YouTube.

[400] Simple and Effective Baselines for Code Summarisation Evaluation

Jade Robinson,Jonathan K. Kummerfeld

Main category: cs.CL

TL;DR: 提出了一种基于LLM的代码摘要评分新基线方法，优于现有指标。

Details

Motivation: 现有代码摘要生成技术难以比较，人工评估成本高且自动指标不可靠。 Method: 使用LLM为摘要评分，可结合或不结合参考摘要。 Result: 新方法优于现有指标，建议与嵌入方法结合使用以避免LLM偏见。 Conclusion: 新方法有效，但需结合其他方法以减少偏见。 Abstract: Code documentation is useful, but writing it is time-consuming. Different techniques for generating code summaries have emerged, but comparing them is difficult because human evaluation is expensive and automatic metrics are unreliable. In this paper, we introduce a simple new baseline in which we ask an LLM to give an overall score to a summary. Unlike n-gram and embedding-based baselines, our approach is able to consider the code when giving a score. This allows us to also make a variant that does not consider the reference summary at all, which could be used for other tasks, e.g., to evaluate the quality of documentation in code bases. We find that our method is as good or better than prior metrics, though we recommend using it in conjunction with embedding-based methods to avoid the risk of LLM-specific bias.

[401] CoTGuard: Using Chain-of-Thought Triggering for Copyright Protection in Multi-Agent LLM Systems

Yan Wen,Junfeng Guo,Heng Huang

Main category: cs.CL

TL;DR: CoTGuard是一种基于触发器的框架，用于在多智能体LLM系统中保护版权，通过监控推理过程检测未经授权的内容复制。

Details

Motivation: 随着多智能体LLM系统的发展，推理过程中的内容泄漏成为版权保护的新挑战，现有方法仅关注最终输出，忽略了推理过程。 Method: 在Chain-of-Thought推理中嵌入触发器查询，激活特定推理段并监控中间步骤，以实现细粒度的版权检测。 Result: 实验表明，CoTGuard能有效检测内容泄漏，且对任务性能干扰最小。 Conclusion: 推理级监控为LLM智能体系统中的知识产权保护提供了新方向。 Abstract: As large language models (LLMs) evolve into autonomous agents capable of collaborative reasoning and task execution, multi-agent LLM systems have emerged as a powerful paradigm for solving complex problems. However, these systems pose new challenges for copyright protection, particularly when sensitive or copyrighted content is inadvertently recalled through inter-agent communication and reasoning. Existing protection techniques primarily focus on detecting content in final outputs, overlooking the richer, more revealing reasoning processes within the agents themselves. In this paper, we introduce CoTGuard, a novel framework for copyright protection that leverages trigger-based detection within Chain-of-Thought (CoT) reasoning. Specifically, we can activate specific CoT segments and monitor intermediate reasoning steps for unauthorized content reproduction by embedding specific trigger queries into agent prompts. This approach enables fine-grained, interpretable detection of copyright violations in collaborative agent scenarios. We evaluate CoTGuard on various benchmarks in extensive experiments and show that it effectively uncovers content leakage with minimal interference to task performance. Our findings suggest that reasoning-level monitoring offers a promising direction for safeguarding intellectual property in LLM-based agent systems.

[402] Self-Reflective Planning with Knowledge Graphs: Enhancing LLM Reasoning Reliability for Question Answering

Jiajun Zhu,Ye Liu,Meikai Bao,Kai Zhang,Yanghai Zhang,Qi Liu

Main category: cs.CL

TL;DR: 提出了一种名为Self-Reflective Planning (SRP)的框架，通过迭代和参考引导的推理，将大语言模型与知识图谱结合，以解决推理中的幻觉问题。

Details

Motivation: 大语言模型在推理时容易因内部知识不足而产生幻觉，而现有方法结合知识图谱时生成的推理路径往往不完整或事实不一致。 Method: SRP框架通过搜索参考信息引导计划和反思，生成初始推理路径后，从知识图谱中检索知识，并通过迭代反思判断检索结果并修正推理路径。 Result: 在三个公开数据集上的实验表明，SRP优于多种强基线，并展示了其可靠的推理能力。 Conclusion: SRP框架通过结合大语言模型和知识图谱，有效提升了推理的准确性和可靠性。 Abstract: Recently, large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, yet they remain prone to hallucinations when reasoning with insufficient internal knowledge. While integrating LLMs with knowledge graphs (KGs) provides access to structured, verifiable information, existing approaches often generate incomplete or factually inconsistent reasoning paths. To this end, we propose Self-Reflective Planning (SRP), a framework that synergizes LLMs with KGs through iterative, reference-guided reasoning. Specifically, given a question and topic entities, SRP first searches for references to guide planning and reflection. In the planning process, it checks initial relations and generates a reasoning path. After retrieving knowledge from KGs through a reasoning path, it implements iterative reflection by judging the retrieval result and editing the reasoning path until the answer is correctly retrieved. Extensive experiments on three public datasets demonstrate that SRP surpasses various strong baselines and further underscore its reliable reasoning ability.

[403] The Role of Diversity in In-Context Learning for Large Language Models

Wenyang Xiao,Haoyu Zhao,Lingxiao Huang

Main category: cs.CL

TL;DR: 研究探讨了在上下文学习（ICL）中，示例选择的多样性对性能的影响，发现多样性方法能提升复杂任务的表现和鲁棒性。

Details

Motivation: 现有方法多关注示例与查询的相似性，而多样性对性能的影响尚未充分研究。 Method: 通过实验在多种任务（如情感分类、数学和代码问题）中测试多样性选择方法，并引入理论框架解释其优势。 Result: 实验表明，多样性选择方法在复杂任务（如数学和代码）中表现更优，且对分布外查询更具鲁棒性。 Conclusion: 多样性在示例选择中具有重要作用，尤其在复杂任务中，未来研究可进一步探索其潜力。 Abstract: In-context learning (ICL) is a crucial capability of current large language models (LLMs), where the selection of examples plays a key role in performance. While most existing approaches focus on selecting the most similar examples to the query, the impact of diversity in example selection remains underexplored. We systematically investigate the role of diversity in in-context example selection through experiments across a range of tasks, from sentiment classification to more challenging math and code problems. Experiments on Llama-3.1, Gemma-2, and Mistral-v0.3 families of models show that diversity-aware selection methods improve performance, particularly on complex tasks like math and code, and enhance robustness to out-of-distribution queries. To support these findings, we introduce a theoretical framework that explains the benefits of incorporating diversity in in-context example selection.

[404] Frictional Agent Alignment Framework: Slow Down and Don't Break Things

Abhijnan Nath,Carine Graff,Andrei Bachinin,Nikhil Krishnaswamy

Main category: cs.CL

TL;DR: FAAF框架通过生成上下文感知的“摩擦”来动态调整AI与人类的协作，解决了静态偏好对齐方法的局限性。

Details

Motivation: 静态偏好对齐方法（如DPO）在动态协作任务中表现不佳，因为信号稀疏且偏斜，需要一种新方法来处理动态交互中的信念偏差。 Method: 提出FAAF框架，包含两个策略：摩擦状态策略识别信念偏差，干预策略生成协作偏好响应，并通过监督损失训练单一策略。 Result: 在三个基准测试中，FAAF在生成简洁、可解释的摩擦及OOD泛化方面优于竞争对手。 Conclusion: FAAF通过使LLM成为动态“思考伙伴”，推动了可扩展的动态人机协作。 Abstract: AI support of collaborative interactions entails mediating potential misalignment between interlocutor beliefs. Common preference alignment methods like DPO excel in static settings, but struggle in dynamic collaborative tasks where the explicit signals of interlocutor beliefs are sparse and skewed. We propose the Frictional Agent Alignment Framework (FAAF), to generate precise, context-aware "friction" that prompts for deliberation and re-examination of existing evidence. FAAF's two-player objective decouples from data skew: a frictive-state policy identifies belief misalignments, while an intervention policy crafts collaborator-preferred responses. We derive an analytical solution to this objective, enabling training a single policy via a simple supervised loss. Experiments on three benchmarks show FAAF outperforms competitors in producing concise, interpretable friction and in OOD generalization. By aligning LLMs to act as adaptive "thought partners" -- not passive responders -- FAAF advances scalable, dynamic human-AI collaboration. Our code and data can be found at https://github.com/csu-signal/FAAF_ACL.

[405] Rhapsody: A Dataset for Highlight Detection in Podcasts

Younghan Park,Anuj Diwan,David Harwath,Eunsol Choi

Main category: cs.CL

TL;DR: 论文介绍了Rhapsody数据集，用于自动检测播客中的高光片段，并通过微调语言模型显著提升了性能。

Details

Motivation: 播客内容庞大且非结构化，自动识别高光片段对用户选择内容至关重要。 Method: 将高光检测任务定义为片段级二分类问题，探索了零样本提示和微调语言模型的方法。 Result: 微调模型显著优于零样本方法，结合语音信号和文本特征效果最佳。 Conclusion: 长格式语音媒体中的细粒度信息访问仍具挑战性。 Abstract: Podcasts have become daily companions for half a billion users. Given the enormous amount of podcast content available, highlights provide a valuable signal that helps viewers get the gist of an episode and decide if they want to invest in listening to it in its entirety. However, identifying highlights automatically is challenging due to the unstructured and long-form nature of the content. We introduce Rhapsody, a dataset of 13K podcast episodes paired with segment-level highlight scores derived from YouTube's 'most replayed' feature. We frame the podcast highlight detection as a segment-level binary classification task. We explore various baseline approaches, including zero-shot prompting of language models and lightweight finetuned language models using segment-level classification heads. Our experimental results indicate that even state-of-the-art language models like GPT-4o and Gemini struggle with this task, while models finetuned with in-domain data significantly outperform their zero-shot performance. The finetuned model benefits from leveraging both speech signal features and transcripts. These findings highlight the challenges for fine-grained information access in long-form spoken media.

[406] Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation

Keane Ong,Rui Mao,Deeksha Varshney,Paul Pu Liang,Erik Cambria,Gianmarco Mengaldo

Main category: cs.CL

TL;DR: 论文提出了一种新的基准Fin-Force，用于评估大型语言模型（LLMs）在金融领域的前向反事实推理能力，旨在自动化预测市场风险和机会。

Details

Motivation: 金融市场的动态性要求前瞻性反事实推理以揭示潜在风险和机会，但手动操作具有认知挑战，需要自动化解决方案。 Method: 通过构建Fin-Force基准，结合金融新闻标题和结构化评估，支持基于LLMs的前向反事实生成。 Result: 实验评估了前沿LLMs和反事实生成方法，分析了其局限性并提出了未来研究方向。 Conclusion: Fin-Force为自动化探索未来市场发展提供了结构化工具，为决策支持开辟了新途径。 Abstract: Counterfactual reasoning typically involves considering alternatives to actual events. While often applied to understand past events, a distinct form-forward counterfactual reasoning-focuses on anticipating plausible future developments. This type of reasoning is invaluable in dynamic financial markets, where anticipating market developments can powerfully unveil potential risks and opportunities for stakeholders, guiding their decision-making. However, performing this at scale is challenging due to the cognitive demands involved, underscoring the need for automated solutions. Large Language Models (LLMs) offer promise, but remain unexplored for this application. To address this gap, we introduce a novel benchmark, Fin-Force-FINancial FORward Counterfactual Evaluation. By curating financial news headlines and providing structured evaluation, Fin-Force supports LLM based forward counterfactual generation. This paves the way for scalable and automated solutions for exploring and anticipating future market developments, thereby providing structured insights for decision-making. Through experiments on Fin-Force, we evaluate state-of-the-art LLMs and counterfactual generation methods, analyzing their limitations and proposing insights for future research.

[407] Route to Reason: Adaptive Routing for LLM and Reasoning Strategy Selection

Zhihong Pan,Kai Zhang,Yuze Zhao,Yupeng Han

Main category: cs.CL

TL;DR: 论文提出Route-To-Reason（RTR）框架，动态分配语言模型和推理策略以优化任务性能和计算效率。

Details

Motivation: 解决测试时扩展方法的高计算成本和模型陷入‘思维陷阱’的问题。 Method: RTR框架学习专家模型和推理策略的压缩表示，实现动态选择和自适应分配。 Result: 在七个开源模型和四种推理策略上，RTR在准确性和计算效率间取得最佳平衡，减少60%以上的token使用。 Conclusion: RTR是一种低成本、高灵活性的方法，适用于任意黑盒或白盒模型和策略。 Abstract: The inherent capabilities of a language model (LM) and the reasoning strategies it employs jointly determine its performance in reasoning tasks. While test-time scaling is regarded as an effective approach to tackling complex reasoning tasks, it incurs substantial computational costs and often leads to "overthinking", where models become trapped in "thought pitfalls". To address this challenge, we propose Route-To-Reason (RTR), a novel unified routing framework that dynamically allocates both LMs and reasoning strategies according to task difficulty under budget constraints. RTR learns compressed representations of both expert models and reasoning strategies, enabling their joint and adaptive selection at inference time. This method is low-cost, highly flexible, and can be seamlessly extended to arbitrary black-box or white-box models and strategies, achieving true plug-and-play functionality. Extensive experiments across seven open source models and four reasoning strategies demonstrate that RTR achieves an optimal trade-off between accuracy and computational efficiency among all baselines, achieving higher accuracy than the best single model while reducing token usage by over 60%.

[408] Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers

Rihui Xin,Han Liu,Zecheng Wang,Yupeng Zhang,Dianbo Sui,Xiaolin Hu,Bingning Wang

Main category: cs.CL

TL;DR: 研究提出利用格式和长度作为替代信号训练LLM解决数学问题，减少对真实答案的依赖，性能在某些场景下超越传统方法。

Details

Motivation: 传统方法需要大量真实答案训练LLM，成本高且不可行，研究探索替代信号以解决这一问题。 Method: 通过格式正确性和答案长度设计奖励函数，结合GRPO算法训练LLM。 Result: 在AIME2024上达到40.0%准确率，性能超越依赖真实答案的传统方法。 Conclusion: 标签无关方法成功解锁LLM潜力，减少对真实答案的依赖，为数学问题解决提供新思路。 Abstract: Large Language Models have achieved remarkable success in natural language processing tasks, with Reinforcement Learning playing a key role in adapting them to specific applications. However, obtaining ground truth answers for training LLMs in mathematical problem-solving is often challenging, costly, and sometimes unfeasible. This research delves into the utilization of format and length as surrogate signals to train LLMs for mathematical problem-solving, bypassing the need for traditional ground truth answers.Our study shows that a reward function centered on format correctness alone can yield performance improvements comparable to the standard GRPO algorithm in early phases. Recognizing the limitations of format-only rewards in the later phases, we incorporate length-based rewards. The resulting GRPO approach, leveraging format-length surrogate signals, not only matches but surpasses the performance of the standard GRPO algorithm relying on ground truth answers in certain scenarios, achieving 40.0\% accuracy on AIME2024 with a 7B base model. Through systematic exploration and experimentation, this research not only offers a practical solution for training LLMs to solve mathematical problems and reducing the dependence on extensive ground truth data collection, but also reveals the essence of why our label-free approach succeeds: base model is like an excellent student who has already mastered mathematical and logical reasoning skills, but performs poorly on the test paper, it simply needs to develop good answering habits to achieve outstanding results in exams , in other words, to unlock the capabilities it already possesses.

[409] The Birth of Knowledge: Emergent Features across Time, Space, and Scale in Large Language Models

Shashata Sawmya,Micah Adler,Nir Shavit

Main category: cs.CL

TL;DR: 研究大型语言模型中可解释分类特征的涌现，分析其在训练时间、层空间和模型规模上的行为，发现特征涌现的时空和规模阈值，并揭示早期层特征在后期层重新激活的现象。

Details

Motivation: 探索大型语言模型中语义特征的涌现机制，理解其在时间、空间和规模上的动态变化。 Method: 使用稀疏自编码器进行机制可解释性分析，研究神经激活中特定语义概念的涌现。 Result: 发现特征涌现的明确时空和规模阈值，并观察到早期层特征在后期层的重新激活现象。 Conclusion: 挑战了传统对Transformer模型表征动态的假设，揭示了特征涌现的复杂性和非线性动态。 Abstract: This paper studies the emergence of interpretable categorical features within large language models (LLMs), analyzing their behavior across training checkpoints (time), transformer layers (space), and varying model sizes (scale). Using sparse autoencoders for mechanistic interpretability, we identify when and where specific semantic concepts emerge within neural activations. Results indicate clear temporal and scale-specific thresholds for feature emergence across multiple domains. Notably, spatial analysis reveals unexpected semantic reactivation, with early-layer features re-emerging at later layers, challenging standard assumptions about representational dynamics in transformer models.

[410] Balancing Computation Load and Representation Expressivity in Parallel Hybrid Neural Networks

Mohammad Mahdi Moradi,Walid Ahmed,Shuangyue Wen,Sudhir Mudur,Weiwei Zhang,Yang Liu

Main category: cs.CL

TL;DR: FlowHN是一种新型并行混合网络架构，通过动态分配输入标记和融合分支输出，显著提高了处理速度和准确性。

Details

Motivation: 结合注意力机制和状态空间模型（SSM）的混合网络在序列或并行架构中存在计算负载不均衡和输出表达性不足的问题。 Method: 提出FlowHN，采用动态标记分配策略平衡计算负载，并通过融合分支输出增强表达性。 Result: 在自回归语言建模任务中，FlowHN比顺序混合模型和并行模型表现更优，处理速度提高4倍，模型FLOPs利用率提高2倍。 Conclusion: FlowHN通过动态负载平衡和输出融合，显著提升了混合网络的性能和效率。 Abstract: Attention and State-Space Models (SSMs) when combined in a hybrid network in sequence or in parallel provide complementary strengths. In a hybrid sequential pipeline they alternate between applying a transformer to the input and then feeding its output into a SSM. This results in idle periods in the individual components increasing end-to-end latency and lowering throughput caps. In the parallel hybrid architecture, the transformer operates independently in parallel with the SSM, and these pairs are cascaded, with output from one pair forming the input to the next. Two issues are (i) creating an expressive knowledge representation with the inherently divergent outputs from these separate branches, and (ii) load balancing the computation between these parallel branches, while maintaining representation fidelity. In this work we present FlowHN, a novel parallel hybrid network architecture that accommodates various strategies for load balancing, achieved through appropriate distribution of input tokens between the two branches. Two innovative differentiating factors in FlowHN include a FLOP aware dynamic token split between the attention and SSM branches yielding efficient balance in compute load, and secondly, a method to fuse the highly divergent outputs from individual branches for enhancing representation expressivity. Together they enable much better token processing speeds, avoid bottlenecks, and at the same time yield significantly improved accuracy as compared to other competing works. We conduct comprehensive experiments on autoregressive language modeling for models with 135M, 350M, and 1B parameters. FlowHN outperforms sequential hybrid models and its parallel counterpart, achieving up to 4* higher Tokens per Second (TPS) and 2* better Model FLOPs Utilization (MFU).

[411] Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection

Mohammad Mahdi Moradi,Hossam Amer,Sudhir Mudur,Weiwei Zhang,Yang Liu,Walid Ahmed

Main category: cs.CL

TL;DR: VDS-TTT框架通过验证器驱动的样本选择和低秩适配器参数微调，显著提升了预训练语言模型在未标记、分布外数据上的适应能力。

Details

Motivation: 预训练语言模型在分布外数据上的表现不佳，需要一种高效的自适应方法。 Method: 使用验证器评分生成响应，选择高置信度伪标记样本进行微调，仅调整低秩LoRA适配器参数。 Result: 在三个基准测试中，VDS-TTT相对基础模型提升32.29%，比无测试时训练的验证器方法提升6.66%。 Conclusion: VDS-TTT是一种高效且有效的框架，适用于实时语言模型自适应。 Abstract: Learning to adapt pretrained language models to unlabeled, out-of-distribution data is a critical challenge, as models often falter on structurally novel reasoning tasks even while excelling within their training distribution. We introduce a new framework called VDS-TTT - Verifier-Driven Sample Selection for Test-Time Training to efficiently address this. We use a learned verifier to score a pool of generated responses and select only from high ranking pseudo-labeled examples for fine-tuned adaptation. Specifically, for each input query our LLM generates N candidate answers; the verifier assigns a reliability score to each, and the response with the highest confidence and above a fixed threshold is paired with its query for test-time training. We fine-tune only low-rank LoRA adapter parameters, ensuring adaptation efficiency and fast convergence. Our proposed self-supervised framework is the first to synthesize verifier driven test-time training data for continuous self-improvement of the model. Experiments across three diverse benchmarks and three state-of-the-art LLMs demonstrate that VDS-TTT yields up to a 32.29% relative improvement over the base model and a 6.66% gain compared to verifier-based methods without test-time training, highlighting its effectiveness and efficiency for on-the-fly large language model adaptation.

[412] CulFiT: A Fine-grained Cultural-aware LLM Training Paradigm via Multilingual Critique Data Synthesis

Ruixiang Feng,Shen Gao,Xiuying Chen,Lisi Chen,Shuo Shang

Main category: cs.CL

TL;DR: CulFiT是一种新型文化感知训练范式，通过多语言数据和细粒度奖励建模提升LLMs的文化敏感性和包容性。

Details

Motivation: LLMs存在文化偏见，忽视低资源地区的价值观和语言多样性，可能加剧歧视。 Method: 结合多语言数据、构建文化相关问题和细粒度奖励模型，将文化文本分解为可验证知识单元。 Result: CulFiT在文化对齐和通用推理方面达到开源模型的最先进性能。 Conclusion: CulFiT有效减少LLMs的文化偏见，提升全球文化包容性。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they often exhibit a specific cultural biases, neglecting the values and linguistic diversity of low-resource regions. This cultural bias not only undermines universal equality, but also risks reinforcing stereotypes and perpetuating discrimination. To address this, we propose CulFiT, a novel culturally-aware training paradigm that leverages multilingual data and fine-grained reward modeling to enhance cultural sensitivity and inclusivity. Our approach synthesizes diverse cultural-related questions, constructs critique data in culturally relevant languages, and employs fine-grained rewards to decompose cultural texts into verifiable knowledge units for interpretable evaluation. We also introduce GlobalCultureQA, a multilingual open-ended question-answering dataset designed to evaluate culturally-aware responses in a global context. Extensive experiments on three existing benchmarks and our GlobalCultureQA demonstrate that CulFiT achieves state-of-the-art open-source model performance in cultural alignment and general reasoning.

[413] Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents

Manoj Balaji Jagadeeshan,Prince Raj,Pawan Goyal

Main category: cs.CL

TL;DR: 该研究提出了一种用于通过英语查询检索梵文文档的综合基准，重点研究了《圣典博伽瓦谭》的章节。研究采用了三种方法：直接检索（DR）、基于翻译的检索（DT）和查询翻译（QT），利用共享嵌入空间和高级翻译方法在RAG框架中增强检索系统。

Details

Motivation: 研究旨在解决梵文古籍在跨语言检索中的挑战，提升其可访问性和理解性，同时保护和传播梵文经典的哲学价值。 Method: 研究采用了直接检索（DR）、基于翻译的检索（DT）和查询翻译（QT）三种方法，并利用共享嵌入空间和高级翻译技术。此外，还针对梵文的语言特点微调了多种先进模型（如BM25、REPLUG、mDPR等），并改进了梵文文档的摘要技术以优化问答处理。 Result: 评估表明，基于翻译的检索（DT）方法在处理跨语言挑战时优于直接检索（DR）和查询翻译（QT），显著提升了检索效果。 Conclusion: 研究通过公开数据集（3,400对英语-梵文查询文档）和技术方法，为梵文古籍的检索和理解提供了有效工具，促进了梵文经典的传播和保存。 Abstract: The study presents a comprehensive benchmark for retrieving Sanskrit documents using English queries, focusing on the chapters of the Srimadbhagavatam. It employs a tripartite approach: Direct Retrieval (DR), Translation-based Retrieval (DT), and Query Translation (QT), utilizing shared embedding spaces and advanced translation methods to enhance retrieval systems in a RAG framework. The study fine-tunes state-of-the-art models for Sanskrit's linguistic nuances, evaluating models such as BM25, REPLUG, mDPR, ColBERT, Contriever, and GPT-2. It adapts summarization techniques for Sanskrit documents to improve QA processing. Evaluation shows DT methods outperform DR and QT in handling the cross-lingual challenges of ancient texts, improving accessibility and understanding. A dataset of 3,400 English-Sanskrit query-document pairs underpins the study, aiming to preserve Sanskrit scriptures and share their philosophical importance widely. Our dataset is publicly available at https://huggingface.co/datasets/manojbalaji1/anveshana

[414] LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study

Dongil Yang,Minjin Kim,Sunghwan Kim,Beong-woo Kwak,Minjun Park,Jinseok Hong,Woontack Woo,Jinyoung Yeo

Main category: cs.CL

TL;DR: 论文介绍了Text-Scene Graph (TSG) Bench，用于评估大语言模型（LLMs）在场景图理解和生成方面的能力，发现LLMs在复杂叙述的场景图生成上表现不佳。

Details

Motivation: 评估LLMs在空间和时间理解中的能力，尤其是在多模态环境中，以支持其在具体AI和机器人等领域的应用。 Method: 提出TSG Bench基准，系统评估11种LLMs在场景图理解和生成任务中的表现。 Result: LLMs在场景图理解上表现良好，但在复杂叙述的场景图生成上存在困难，尤其是分解离散场景的能力不足。 Conclusion: 研究揭示了LLMs在场景图生成中的瓶颈，为未来改进方法提供了方向，并公开了基准和代码。 Abstract: The remarkable reasoning and generalization capabilities of Large Language Models (LLMs) have paved the way for their expanding applications in embodied AI, robotics, and other real-world tasks. To effectively support these applications, grounding in spatial and temporal understanding in multimodal environments is essential. To this end, recent works have leveraged scene graphs, a structured representation that encodes entities, attributes, and their relationships in a scene. However, a comprehensive evaluation of LLMs' ability to utilize scene graphs remains limited. In this work, we introduce Text-Scene Graph (TSG) Bench, a benchmark designed to systematically assess LLMs' ability to (1) understand scene graphs and (2) generate them from textual narratives. With TSG Bench we evaluate 11 LLMs and reveal that, while models perform well on scene graph understanding, they struggle with scene graph generation, particularly for complex narratives. Our analysis indicates that these models fail to effectively decompose discrete scenes from a complex narrative, leading to a bottleneck when generating scene graphs. These findings underscore the need for improved methodologies in scene graph generation and provide valuable insights for future research. The demonstration of our benchmark is available at https://tsg-bench.netlify.app. Additionally, our code and evaluation data are publicly available at https://anonymous.4open.science/r/TSG-Bench.

[415] Causal Distillation: Transferring Structured Explanations from Large to Compact Language Models

Aggrey Muhebwa,Khalid K. Osman

Main category: cs.CL

TL;DR: 提出了一种新框架，通过从强大的教师模型中提取因果解释，将因果推理能力转移到小型开源模型上。

Details

Motivation: 大型专有语言模型展现出强大的因果推理能力，而小型开源模型难以复制。 Method: 训练小型模型生成与教师模型一致的因果解释，并引入CEC指标评估解释质量。 Result: 框架和CEC指标为训练小型模型进行稳健因果推理提供了理论基础。 Conclusion: 该方法和指标为语言模型输出的因果解释提供了系统性评估工具。 Abstract: Large proprietary language models exhibit strong causal reasoning abilities that smaller open-source models struggle to replicate. We introduce a novel framework for distilling causal explanations that transfers causal reasoning skills from a powerful teacher model to a compact open-source model. The key idea is to train the smaller model to develop causal reasoning abilities by generating structured cause-and-effect explanations consistent with those of the teacher model. To evaluate the quality of the student-generated explanations, we introduce a new metric called Causal Explanation Coherence (CEC) to assess the structural and logical consistency of causal reasoning. This metric uses sentence-level semantic alignment to measure how well each part of the generated explanation corresponds to the teacher's reference, capturing both faithfulness and coverage of the underlying causal chain. Our framework and the CEC metric provide a principled foundation for training smaller models to perform robust causal reasoning and for systematically assessing the coherence of explanations in language model outputs.

[416] SIPDO: Closed-Loop Prompt Optimization via Synthetic Data Feedback

Yaoning Yu,Ye Yu,Kai Wei,Haojing Luo,Haohan Wang

Main category: cs.CL

TL;DR: SIPDO是一种通过数据增强优化自我改进提示的闭环框架，结合了合成数据生成和提示优化，显著提升了提示性能。

Details

Motivation: 现有提示优化方法通常基于固定数据集，假设输入分布静态，缺乏迭代改进支持。SIPDO旨在通过闭环框架动态优化提示。 Method: SIPDO将合成数据生成器与提示优化器结合，生成揭示当前提示弱点的数据，并逐步优化提示。 Result: 在问答和推理基准测试中，SIPDO表现优于标准提示调优方法。 Conclusion: SIPDO展示了将数据合成集成到提示学习中的价值，为动态优化提供了新思路。 Abstract: Prompt quality plays a critical role in the performance of large language models (LLMs), motivating a growing body of work on prompt optimization. Most existing methods optimize prompts over a fixed dataset, assuming static input distributions and offering limited support for iterative improvement. We introduce SIPDO (Self-Improving Prompts through Data-Augmented Optimization), a closed-loop framework for prompt learning that integrates synthetic data generation into the optimization process. SIPDO couples a synthetic data generator with a prompt optimizer, where the generator produces new examples that reveal current prompt weaknesses and the optimizer incrementally refines the prompt in response. This feedback-driven loop enables systematic improvement of prompt performance without assuming access to external supervision or new tasks. Experiments across question answering and reasoning benchmarks show that SIPDO outperforms standard prompt tuning methods, highlighting the value of integrating data synthesis into prompt learning workflows.

[417] Bias in Political Dialogue: Tagging U.S. Presidential Debates with an Extended DAMSL Framework

Lavanya Prahallad,Radhika Mamidi

Main category: cs.CL

TL;DR: 论文通过BEADS框架分析2024年美国总统辩论中特朗普的修辞策略，发现其在对抗性交流、选择性强调等方面占优。

Details

Motivation: 研究特朗普在政治辩论中的修辞策略及其对观众的影响。 Method: 提出BEADS框架，结合人工标注与ChatGPT辅助标注，分析辩论内容。 Result: 特朗普在对抗性交流、选择性强调等类别中表现突出。 Conclusion: BEADS是一个可扩展的框架，适用于跨语言和领域的批判性话语分析。 Abstract: We present a critical discourse analysis of the 2024 U.S. presidential debates, examining Donald Trump's rhetorical strategies in his interactions with Joe Biden and Kamala Harris. We introduce a novel annotation framework, BEADS (Bias Enriched Annotation for Dialogue Structure), which systematically extends the DAMSL framework to capture bias driven and adversarial discourse features in political communication. BEADS includes a domain and language agnostic set of tags that model ideological framing, emotional appeals, and confrontational tactics. Our methodology compares detailed human annotation with zero shot ChatGPT assisted tagging on verified transcripts from the Trump and Biden (19,219 words) and Trump and Harris (18,123 words) debates. Our analysis shows that Trump consistently dominated in key categories: Challenge and Adversarial Exchanges, Selective Emphasis, Appeal to Fear, Political Bias, and Perceived Dismissiveness. These findings underscore his use of emotionally charged and adversarial rhetoric to control the narrative and influence audience perception. In this work, we establish BEADS as a scalable and reproducible framework for critical discourse analysis across languages, domains, and political contexts.

[418] AmpleHate: Amplifying the Attention for Versatile Implicit Hate Detection

Yejin Lee,Joonghyuk Hahn,Hyeseon Ahn,Yo-Sub Han

Main category: cs.CL

TL;DR: AmpleHate是一种新的隐式仇恨言论检测方法，通过模仿人类推理过程，结合显式和隐式目标信息，显著优于现有对比学习方法。

Details

Motivation: 隐式仇恨言论检测因其隐晦性和依赖上下文解释而具有挑战性，现有方法依赖对比学习，但人类检测过程涉及目标识别和上下文关系分析。 Method: AmpleHate使用预训练命名实体识别模型识别显式目标，通过[CLS]标记捕获隐式目标信息，计算目标与上下文的注意力关系，并将关系向量注入句子表示。 Result: AmpleHate在性能上优于对比学习方法82.14%，收敛更快，且注意力模式与人类判断一致。 Conclusion: AmpleHate通过模仿人类推理过程，显著提升了隐式仇恨言论检测的性能和可解释性。 Abstract: Implicit hate speech detection is challenging due to its subtlety and reliance on contextual interpretation rather than explicit offensive words. Current approaches rely on contrastive learning, which are shown to be effective on distinguishing hate and non-hate sentences. Humans, however, detect implicit hate speech by first identifying specific targets within the text and subsequently interpreting how these target relate to their surrounding context. Motivated by this reasoning process, we propose AmpleHate, a novel approach designed to mirror human inference for implicit hate detection. AmpleHate identifies explicit target using a pretrained Named Entity Recognition model and capture implicit target information via [CLS] tokens. It computes attention-based relationships between explicit, implicit targets and sentence context and then, directly injects these relational vectors into the final sentence representation. This amplifies the critical signals of target-context relations for determining implicit hate. Experiments demonstrate that AmpleHate achieves state-of-the-art performance, outperforming contrastive learning baselines by an average of 82.14% and achieve faster convergence. Qualitative analyses further reveal that attention patterns produced by AmpleHate closely align with human judgement, underscoring its interpretability and robustness.

[419] Small Language Models: Architectures, Techniques, Evaluation, Problems and Future Adaptation

Tanjil Hasan Sakib,Md. Tanzib Hosain,Md. Kishor Morol

Main category: cs.CL

TL;DR: 本文全面评估了小语言模型（SLMs），涵盖其设计框架、训练方法及模型优化的技术，并提出新的分类系统。同时，构建了评估平台并讨论了未解决的挑战。

Details

Motivation: SLMs因其在有限资源环境下的高效表现而备受关注，研究旨在为构建高效、紧凑的语言模型提供指导。 Method: 提出分类系统整理SLMs优化方法（如剪枝、量化和模型压缩），并构建评估平台。 Result: 建立了SLMs的评估体系，总结了优化技术，并指出了效率与性能的权衡问题。 Conclusion: 本文为研究人员和实践者提供了构建高效SLMs的指南，并提出了未来研究方向。 Abstract: Small Language Models (SLMs) have gained substantial attention due to their ability to execute diverse language tasks successfully while using fewer computer resources. These models are particularly ideal for deployment in limited environments, such as mobile devices, on-device processing, and edge systems. In this study, we present a complete assessment of SLMs, focussing on their design frameworks, training approaches, and techniques for lowering model size and complexity. We offer a novel classification system to organize the optimization approaches applied for SLMs, encompassing strategies like pruning, quantization, and model compression. Furthermore, we assemble SLM's studies of evaluation suite with some existing datasets, establishing a rigorous platform for measuring SLM capabilities. Alongside this, we discuss the important difficulties that remain unresolved in this sector, including trade-offs between efficiency and performance, and we suggest directions for future study. We anticipate this study to serve as a beneficial guide for researchers and practitioners who aim to construct compact, efficient, and high-performing language models.

[420] DoctorRAG: Medical RAG Fusing Knowledge with Patient Analogy through Textual Gradients

Yuxing Lu,Gecheng Fu,Wei Wu,Xukai Zhao,Sin Yee Goi,Jinzhuo Wang

Main category: cs.CL

TL;DR: DoctorRAG 是一个模拟医生推理的 RAG 框架，整合临床知识和病例经验，显著提升检索精度和响应质量。

Details

Motivation: 现有医学 RAG 系统主要依赖知识库，忽略了类似病例的经验知识，而后者是临床推理的关键。 Method: 通过分配概念标签和混合检索机制整合知识库和病例数据，并引入 Med-TextGrad 模块确保输出一致性。 Result: 在多语言、多任务数据集上显著优于基线 RAG 模型，并通过迭代优化进一步提升性能。 Conclusion: DoctorRAG 生成更准确、相关且全面的响应，推动了更接近医生推理的医学系统发展。 Abstract: Existing medical RAG systems mainly leverage knowledge from medical knowledge bases, neglecting the crucial role of experiential knowledge derived from similar patient cases -- a key component of human clinical reasoning. To bridge this gap, we propose DoctorRAG, a RAG framework that emulates doctor-like reasoning by integrating both explicit clinical knowledge and implicit case-based experience. DoctorRAG enhances retrieval precision by first allocating conceptual tags for queries and knowledge sources, together with a hybrid retrieval mechanism from both relevant knowledge and patient. In addition, a Med-TextGrad module using multi-agent textual gradients is integrated to ensure that the final output adheres to the retrieved knowledge and patient query. Comprehensive experiments on multilingual, multitask datasets demonstrate that DoctorRAG significantly outperforms strong baseline RAG models and gains improvements from iterative refinements. Our approach generates more accurate, relevant, and comprehensive responses, taking a step towards more doctor-like medical reasoning systems.

[421] How Syntax Specialization Emerges in Language Models

Xufeng Duan,Zhaoqian Yao,Yunhao Zhang,Shaonan Wang,Zhenguang G. Cai

Main category: cs.CL

TL;DR: 研究发现大型语言模型（LLMs）在训练过程中会逐渐形成对句法结构的内部敏感性，表现出类似人类大脑的模式，且这一过程受模型规模和数据影响。

Details

Motivation: 探索LLMs内部句法敏感性的形成过程及其影响因素，填补现有研究的空白。 Method: 通过量化不同句法现象的最小对中的内部句法一致性，追踪其随时间的发展轨迹。 Result: 句法敏感性逐渐形成，集中在特定层，并表现出快速内部专业化的“关键期”，且与模型架构和初始化参数无关。 Conclusion: 揭示了LLMs中句法敏感性的形成机制，为未来研究提供了代码、模型和训练检查点支持。 Abstract: Large language models (LLMs) have been found to develop surprising internal specializations: Individual neurons, attention heads, and circuits become selectively sensitive to syntactic structure, reflecting patterns observed in the human brain. While this specialization is well-documented, how it emerges during training and what influences its development remains largely unknown. In this work, we tap into the black box of specialization by tracking its formation over time. By quantifying internal syntactic consistency across minimal pairs from various syntactic phenomena, we identify a clear developmental trajectory: Syntactic sensitivity emerges gradually, concentrates in specific layers, and exhibits a 'critical period' of rapid internal specialization. This process is consistent across architectures and initialization parameters (e.g., random seeds), and is influenced by model scale and training data. We therefore reveal not only where syntax arises in LLMs but also how some models internalize it during training. To support future research, we will release the code, models, and training checkpoints upon acceptance.

[422] Towards Multi-Granularity Memory Association and Selection for Long-Term Conversational Agents

Derong Xu,Yi Wen,Pengyue Jia,Yingyi Zhang,wenlin zhang,Yichao Wang,Huifeng Guo,Ruiming Tang,Xiangyu Zhao,Enhong Chen,Tong Xu

Main category: cs.CL

TL;DR: MemGAS框架通过多粒度记忆关联和自适应检索，提升LLMs在长对话中的记忆一致性和个性化响应能力。

Details

Motivation: 解决LLMs因上下文窗口限制导致的长对话记忆不连贯和个性化响应不足的问题。 Method: 提出MemGAS框架，基于多粒度记忆单元和Gaussian Mixture Models聚类，结合熵路由器和LLM过滤优化检索。 Result: 在四个长时记忆基准测试中，MemGAS在问答和检索任务上均优于现有方法。 Conclusion: MemGAS通过多粒度关联和自适应检索，显著提升了长对话记忆管理的性能。 Abstract: Large Language Models (LLMs) have recently been widely adopted in conversational agents. However, the increasingly long interactions between users and agents accumulate extensive dialogue records, making it difficult for LLMs with limited context windows to maintain a coherent long-term dialogue memory and deliver personalized responses. While retrieval-augmented memory systems have emerged to address this issue, existing methods often depend on single-granularity memory segmentation and retrieval. This approach falls short in capturing deep memory connections, leading to partial retrieval of useful information or substantial noise, resulting in suboptimal performance. To tackle these limits, we propose MemGAS, a framework that enhances memory consolidation by constructing multi-granularity association, adaptive selection, and retrieval. MemGAS is based on multi-granularity memory units and employs Gaussian Mixture Models to cluster and associate new memories with historical ones. An entropy-based router adaptively selects optimal granularity by evaluating query relevance distributions and balancing information completeness and noise. Retrieved memories are further refined via LLM-based filtering. Experiments on four long-term memory benchmarks demonstrate that MemGAS outperforms state-of-the-art methods on both question answer and retrieval tasks, achieving superior performance across different query types and top-K settings.

[423] DocMEdit: Towards Document-Level Model Editing

Li Zeng,Zeming Liu,Chong Feng,Heyan Huang,Yuhang Guo

Main category: cs.CL

TL;DR: 论文提出了一种文档级模型编辑任务，并引入了一个新数据集，以解决现有数据集仅关注短句输出的局限性。

Details

Motivation: 现有模型编辑数据集仅要求输出短句，忽略了现实中文档级任务的普遍存在，限制了实际应用。 Method: 提出了文档级模型编辑任务，并引入了一个新数据集，包含文档级输入输出、外推性和单次编辑中的多个事实。 Result: 实验表明，文档级模型编辑对现有方法提出了挑战。 Conclusion: 文档级模型编辑任务有助于推动模型编辑在实际场景中的应用。 Abstract: Model editing aims to correct errors and outdated knowledge in the Large language models (LLMs) with minimal cost. Prior research has proposed a variety of datasets to assess the effectiveness of these model editing methods. However, most existing datasets only require models to output short phrases or sentences, overlooks the widespread existence of document-level tasks in the real world, raising doubts about their practical usability. Aimed at addressing this limitation and promoting the application of model editing in real-world scenarios, we propose the task of document-level model editing. To tackle such challenges and enhance model capabilities in practical settings, we introduce \benchmarkname, a dataset focused on document-level model editing, characterized by document-level inputs and outputs, extrapolative, and multiple facts within a single edit. We propose a series of evaluation metrics and experiments. The results show that the difficulties in document-level model editing pose challenges for existing model editing methods.

[424] TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization

Dingyu Yao,Bowen Shen,Zheng Lin,Wei Liu,Jian Luan,Bin Wang,Weiping Wang

Main category: cs.CL

TL;DR: 论文提出了一种混合压缩方法TailorKV，结合量化和卸载技术，显著降低了生成式大语言模型中的KV缓存内存开销，同时保持了高性能。

Details

Motivation: KV缓存在生成式大语言模型中带来显著内存开销，现有方法（如卸载或压缩）存在延迟高或性能下降的问题。 Method: 提出TailorKV方法，通过选择性加载主导令牌和量化所有令牌，结合量化和卸载技术，设计了一个硬件友好的推理框架。 Result: 在长上下文评估中，TailorKV在激进压缩设置下几乎无损性能，优于现有技术，例如在RTX 3090 GPU上以82ms/令牌的速度支持128k上下文。 Conclusion: TailorKV通过互补的压缩策略，有效解决了KV缓存的内存和性能问题，为实际部署提供了高效解决方案。 Abstract: The Key-Value (KV) cache in generative large language models (LLMs) introduces substantial memory overhead. Existing works mitigate this burden by offloading or compressing the KV cache. However, loading the entire cache incurs significant latency due to PCIe bandwidth bottlenecks in CPU-GPU communication, while aggressive compression causes notable performance degradation. We identify that certain layers in the LLM need to maintain global information and are unsuitable for selective loading. In contrast, other layers primarily focus on a few tokens with dominant activations that potentially incur substantial quantization error. This observation leads to a key insight that loading dominant tokens and quantizing all tokens can complement each other. Building on this insight, we propose a hybrid compression method, TailorKV, which seamlessly integrates quantization and offloading. TailorKV develops an inference framework along with a hardware-friendly implementation that leverages these complementary characteristics. Extensive long-context evaluations exhibit that TailorKV achieves nearly lossless performance under aggressive compression settings, outperforming the state-of-the-art. Particularly, the Llama-3.1-8B with 128k context can be served within a single RTX 3090 GPU, reaching 82 ms per token during decoding.

[425] Multi-Agent Collaboration via Evolving Orchestration

Yufan Dang,Chen Qian,Xueheng Luo,Jingru Fan,Zihao Xie,Ruijie Shi,Weize Chen,Cheng Yang,Xiaoyin Che,Ye Tian,Xuantang Xiong,Lei Han,Zhiyuan Liu,Maosong Sun

Main category: cs.CL

TL;DR: 提出了一种基于大型语言模型的多智能体协作的“牵线木偶”范式，通过强化学习训练中央协调器动态调度智能体，提高效率和适应性。

Details

Motivation: 现有大型语言模型的单一性和静态协作结构难以适应复杂任务和智能体数量增加的需求，导致协调开销和效率低下。 Method: 采用“牵线木偶”范式，中央协调器通过强化学习动态调度智能体，优化任务执行顺序和优先级。 Result: 实验表明，该方法在封闭和开放场景中均表现优异，计算成本更低，且能形成更紧凑的循环推理结构。 Conclusion: 该范式通过动态协调显著提升了多智能体协作的效率和适应性，为复杂问题解决提供了新思路。 Abstract: Large language models (LLMs) have achieved remarkable results across diverse downstream tasks, but their monolithic nature restricts scalability and efficiency in complex problem-solving. While recent research explores multi-agent collaboration among LLMs, most approaches rely on static organizational structures that struggle to adapt as task complexity and agent numbers grow, resulting in coordination overhead and inefficiencies. To this end, we propose a puppeteer-style paradigm for LLM-based multi-agent collaboration, where a centralized orchestrator ("puppeteer") dynamically directs agents ("puppets") in response to evolving task states. This orchestrator is trained via reinforcement learning to adaptively sequence and prioritize agents, enabling flexible and evolvable collective reasoning. Experiments on closed- and open-domain scenarios show that this method achieves superior performance with reduced computational costs. Analyses further reveal that the key improvements consistently stem from the emergence of more compact, cyclic reasoning structures under the orchestrator's evolution.

[426] Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study

Guanyu Hou,Jiaming He,Yinhang Zhou,Ji Guo,Yitong Qiao,Rui Zhang,Wenbo Jiang

Main category: cs.CL

TL;DR: 该研究系统评估了五种大型音频语言模型（LALMs）在四种攻击场景下的鲁棒性，揭示了模型间的性能差异及攻击内容位置对效果的影响，并提出了改进建议。

Details

Motivation: 尽管LALMs在现实应用中广泛部署，但其对恶意音频注入攻击的鲁棒性尚未充分研究，因此需要系统性评估。 Method: 研究通过四种攻击场景（音频干扰、指令跟随、上下文注入和判断劫持）和多项指标（防御成功率、上下文鲁棒性评分等）定量评估模型。 Result: 实验显示模型性能差异显著，恶意内容位置对攻击效果至关重要，且指令跟随能力与鲁棒性呈负相关。 Conclusion: 研究提出了基准框架，强调需在训练中集成鲁棒性，并开发多模态防御和架构设计以提升LALMs的安全性。 Abstract: Large Audio-Language Models (LALMs) are increasingly deployed in real-world applications, yet their robustness against malicious audio injection attacks remains underexplored. This study systematically evaluates five leading LALMs across four attack scenarios: Audio Interference Attack, Instruction Following Attack, Context Injection Attack, and Judgment Hijacking Attack. Using metrics like Defense Success Rate, Context Robustness Score, and Judgment Robustness Index, their vulnerabilities and resilience were quantitatively assessed. Experimental results reveal significant performance disparities among models; no single model consistently outperforms others across all attack types. The position of malicious content critically influences attack effectiveness, particularly when placed at the beginning of sequences. A negative correlation between instruction-following capability and robustness suggests models adhering strictly to instructions may be more susceptible, contrasting with greater resistance by safety-aligned models. Additionally, system prompts show mixed effectiveness, indicating the need for tailored strategies. This work introduces a benchmark framework and highlights the importance of integrating robustness into training pipelines. Findings emphasize developing multi-modal defenses and architectural designs that decouple capability from susceptibility for secure LALMs deployment.

[427] Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar

Andrew Gambardella,Takeshi Kojima,Yusuke Iwasawa,Yutaka Matsuo

Main category: cs.CL

TL;DR: 论文提出了一种新的评估语言模型性能的方法，专注于其在非英语语言中识别罕见语法点的能力，并以日语中的“第一人称心理谓词限制”为例。研究发现Weblab模型表现优异，但可能归因于其糟糕的分词方式。

Details

Motivation: 现有评估方法无法捕捉语言模型在非英语语言中识别罕见语法点的能力，因此需要更细致的评估标准。 Method: 通过测量语言模型在日语“第一人称心理谓词限制”语法点上的困惑度，分析模型表现，并探讨分词方式对结果的影响。 Result: Weblab是唯一在7-10B参数范围内表现良好的开源模型，但其优异表现可能与糟糕的分词方式有关。Llama 3的困惑度可通过优化分词降低28倍。 Conclusion: 分词方式对语言模型性能有显著影响，尤其是在处理罕见语法点时。优化分词可以大幅提升模型表现。 Abstract: Typical methods for evaluating the performance of language models evaluate their ability to answer questions accurately. These evaluation metrics are acceptable for determining the extent to which language models can understand and reason about text in a general sense, but fail to capture nuanced capabilities, such as the ability of language models to recognize and obey rare grammar points, particularly in languages other than English. We measure the perplexity of language models when confronted with the "first person psych predicate restriction" grammar point in Japanese. Weblab is the only tested open source model in the 7-10B parameter range which consistently assigns higher perplexity to ungrammatical psych predicate sentences than grammatical ones. We give evidence that Weblab's uniformly bad tokenization is a possible root cause for its good performance, and show that Llama 3's perplexity on grammatical psych predicate sentences can be reduced by orders of magnitude (28x difference) by restricting test sentences to those with uniformly well-behaved tokenizations. We show in further experiments on machine translation tasks that language models will use alternative grammar patterns in order to produce grammatical sentences when tokenization issues prevent the most natural sentence from being output.

[428] Evaluating Machine Translation Models for English-Hindi Language Pairs: A Comparative Analysis

Ahan Prasannakumar Shetty

Main category: cs.CL

TL;DR: 本文评估了多种英语-印地语机器翻译模型，使用多种自动评估指标，并基于大规模平行语料库和定制FAQ数据集，揭示了不同模型在通用和专业领域的表现差异。

Details

Motivation: 机器翻译在英语和印地语等语言差异较大的情况下尤为重要，本文旨在评估不同翻译模型的效果，为改进翻译系统提供依据。 Method: 使用18000+的平行语料库和定制FAQ数据集，结合词汇和机器学习指标，全面评估多种机器翻译模型。 Result: 不同模型在不同指标下表现各异，揭示了当前翻译系统的优势和不足。 Conclusion: 研究为机器翻译模型的改进提供了重要参考，尤其是在处理通用和专业领域语言时。 Abstract: Machine translation has become a critical tool in bridging linguistic gaps, especially between languages as diverse as English and Hindi. This paper comprehensively evaluates various machine translation models for translating between English and Hindi. We assess the performance of these models using a diverse set of automatic evaluation metrics, both lexical and machine learning-based metrics. Our evaluation leverages an 18000+ corpus of English Hindi parallel dataset and a custom FAQ dataset comprising questions from government websites. The study aims to provide insights into the effectiveness of different machine translation approaches in handling both general and specialized language domains. Results indicate varying performance levels across different metrics, highlighting strengths and areas for improvement in current translation systems.

[429] Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically

Ryan Soh-Eun Shim,Domenico De Cristofaro,Chengzhi Martin Hu,Alessandro Vietti,Barbara Plank

Main category: cs.CL

TL;DR: 研究探讨了语音基础模型中的跨语言对齐是否基于语义而非语音相似性，并通过实验验证了语义对齐的存在及其对语音识别的影响。

Details

Motivation: 探索语音基础模型中跨语言对齐的机制，验证其是否依赖语义而非语音相似性。 Method: 通过发音控制实验和单词级数据集实验，分析跨语言对齐的语义和语音知识，并利用早期退出编码器观察语音翻译中的语义错误。 Result: 实验表明，即使没有语音线索，跨语言对齐仍能基于语义实现，且在低资源语言中应用早期退出策略可提升语音识别准确率。 Conclusion: 语音基础模型中存在基于语义的跨语言对齐，这一发现可优化低资源语言的语音识别性能。 Abstract: Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs. Such an alignment has also been observed in speech foundation models. However, it remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech. Building on prior work on spoken translation retrieval, we perform pronunciation-controlled experiments to observe if cross-lingual alignment can indeed occur in such models on a semantic basis, instead of relying on phonetic similarities. Our findings indicate that even in the absence of phonetic cues, spoken translation retrieval accuracy remains relatively stable. We follow up with a controlled experiment on a word-level dataset of cross-lingual synonyms and near-homophones, confirming the existence of both phonetic and semantic knowledge in the encoder. Finally, we qualitatively examine the transcriptions produced by early exiting the encoder, where we observe that speech translation produces semantic errors that are characterized by phonetic similarities to corresponding words in the source language. We apply this insight from early exiting to speech recognition in seven low-resource languages unsupported by the Whisper model, and achieve improved accuracy in all languages examined, particularly for languages with transparent orthographies.

[430] HomeBench: Evaluating LLMs in Smart Homes with Valid and Invalid Instructions Across Single and Multiple Devices

Silin Li,Yuhang Guo,Jiashu Yao,Zeming Liu,Haifeng Wang

Main category: cs.CL

TL;DR: 论文介绍了HomeBench数据集，用于测试LLMs在智能家居场景中处理无效和多设备指令的能力，发现现有先进模型表现不佳。

Details

Motivation: 提升LLMs在复杂智能家居场景中的表现，解决无效指令和多设备操作问题。 Method: 构建HomeBench数据集，包含单设备和多设备的有效及无效指令，测试13种LLMs的性能。 Result: GPT-4o在无效多设备指令场景中成功率为0.0%，显示现有模型表现不足。 Conclusion: 现有LLMs在复杂智能家居场景中仍有改进空间，HomeBench为未来研究提供了基准。 Abstract: Large language models (LLMs) have the potential to revolutionize smart home assistants by enhancing their ability to accurately understand user needs and respond appropriately, which is extremely beneficial for building a smarter home environment. While recent studies have explored integrating LLMs into smart home systems, they primarily focus on handling straightforward, valid single-device operation instructions. However, real-world scenarios are far more complex and often involve users issuing invalid instructions or controlling multiple devices simultaneously. These have two main challenges: LLMs must accurately identify and rectify errors in user instructions and execute multiple user instructions perfectly. To address these challenges and advance the development of LLM-based smart home assistants, we introduce HomeBench, the first smart home dataset with valid and invalid instructions across single and multiple devices in this paper. We have experimental results on 13 distinct LLMs; e.g., GPT-4o achieves only a 0.0% success rate in the scenario of invalid multi-device instructions, revealing that the existing state-of-the-art LLMs still cannot perform well in this situation even with the help of in-context learning, retrieval-augmented generation, and fine-tuning. Our code and dataset are publicly available at https://github.com/BITHLP/HomeBench.

[431] DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue

Yichun Feng,Jiawei Wang,Lu Zhou,Yixue Li

Main category: cs.CL

TL;DR: 论文提出了一种基于强化学习的多智能体协作框架DoctorAgent-RL，用于解决大型语言模型在临床咨询中的动态决策问题，显著提升了多轮推理和诊断性能。

Details

Motivation: 现有系统依赖单向信息传输，无法有效处理模糊症状描述，且传统监督学习方法缺乏泛化能力，难以智能提取关键临床信息。 Method: 采用强化学习框架，通过医生智能体与患者智能体的多轮交互动态优化提问策略，并结合咨询评估器的综合奖励调整信息收集路径。 Result: 实验表明，DoctorAgent-RL在多轮推理能力和最终诊断性能上优于现有模型，并构建了首个模拟患者交互的多轮医疗咨询数据集MTMedDialog。 Conclusion: DoctorAgent-RL通过强化学习机制使LLM能够自主开发符合临床推理逻辑的交互策略，具有辅助临床咨询的实际价值。 Abstract: Large language models (LLMs) have demonstrated excellent capabilities in the field of biomedical question answering, but their application in real-world clinical consultations still faces core challenges. Existing systems rely on a one-way information transmission mode where patients must fully describe their symptoms in a single round, leading to nonspecific diagnostic recommendations when complaints are vague. Traditional multi-turn dialogue methods based on supervised learning are constrained by static data-driven paradigms, lacking generalizability and struggling to intelligently extract key clinical information. To address these limitations, we propose DoctorAgent-RL, a reinforcement learning (RL)-based multi-agent collaborative framework that models medical consultations as a dynamic decision-making process under uncertainty. The doctor agent continuously optimizes its questioning strategy within the RL framework through multi-turn interactions with the patient agent, dynamically adjusting its information-gathering path based on comprehensive rewards from the Consultation Evaluator. This RL fine-tuning mechanism enables LLMs to autonomously develop interaction strategies aligned with clinical reasoning logic, rather than superficially imitating patterns in existing dialogue data. Notably, we constructed MTMedDialog, the first English multi-turn medical consultation dataset capable of simulating patient interactions. Experiments demonstrate that DoctorAgent-RL outperforms existing models in both multi-turn reasoning capability and final diagnostic performance, demonstrating practical value in assisting clinical consultations. https://github.com/JarvisUSTC/DoctorAgent-RL

[432] Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models

Zihong Zhang,Liqi He,Zuchao Li,Lefei Zhang,Hai Zhao,Bo Du

Main category: cs.CL

TL;DR: 提出了一种基于大语言模型（LLMs）的无监督分词框架LLACA，结合Aho-Corasick自动机，显著优于传统方法。

Details

Motivation: 探索LLMs在无监督分词中的潜力，并评估其语义理解能力。 Method: 使用主流LLMs进行多语言分词，提出LLACA方法，结合Aho-Corasick自动机和LLMs的深度理解。 Result: LLMs能通过简单提示完成分词，参数量大的模型表现更好；LLACA方法显著优于传统方法。 Conclusion: LLACA展示了LLMs在无监督分词中的潜力，为未来研究提供了新方向。 Abstract: Word segmentation stands as a cornerstone of Natural Language Processing (NLP). Based on the concept of "comprehend first, segment later", we propose a new framework to explore the limit of unsupervised word segmentation with Large Language Models (LLMs) and evaluate the semantic understanding capabilities of LLMs based on word segmentation. We employ current mainstream LLMs to perform word segmentation across multiple languages to assess LLMs' "comprehension". Our findings reveal that LLMs are capable of following simple prompts to segment raw text into words. There is a trend suggesting that models with more parameters tend to perform better on multiple languages. Additionally, we introduce a novel unsupervised method, termed LLACA ($\textbf{L}$arge $\textbf{L}$anguage Model-Inspired $\textbf{A}$ho-$\textbf{C}$orasick $\textbf{A}$utomaton). Leveraging the advanced pattern recognition capabilities of Aho-Corasick automata, LLACA innovatively combines these with the deep insights of well-pretrained LLMs. This approach not only enables the construction of a dynamic $n$-gram model that adjusts based on contextual information but also integrates the nuanced understanding of LLMs, offering significant improvements over traditional methods. Our source code is available at https://github.com/hkr04/LLACA

[433] Faster and Better LLMs via Latency-Aware Test-Time Scaling

Zili Wang,Tianyu Zhang,Haoli Bai,Lu Hou,Xianzhi Yu,Wulong Liu,Shiming Xiang,Lei Zhu

Main category: cs.CL

TL;DR: 论文提出了一种延迟感知的测试时间缩放（TTS）方法，通过优化并发配置（分支并行和序列并行）实现延迟最优，显著提升了模型在延迟敏感场景下的性能。

Details

Motivation: 现有研究忽视了TTS在延迟敏感场景下的效率问题，计算最优的TTS不一定能实现最低延迟。 Method: 提出两种优化并发配置的方法：分支并行和序列并行（基于推测解码），并合理分配计算资源。 Result: 32B模型在1分钟内达到82.3%的MATH-500准确率，3B模型在10秒内达到72.4%的准确率。 Conclusion: 延迟感知的TTS在延迟敏感场景下能同时实现速度和准确性，具有重要价值。 Abstract: Test-Time Scaling (TTS) has proven effective in improving the performance of Large Language Models (LLMs) during inference. However, existing research has overlooked the efficiency of TTS from a latency-sensitive perspective. Through a latency-aware evaluation of representative TTS methods, we demonstrate that a compute-optimal TTS does not always result in the lowest latency in scenarios where latency is critical. To address this gap and achieve latency-optimal TTS, we propose two key approaches by optimizing the concurrency configurations: (1) branch-wise parallelism, which leverages multiple concurrent inference branches, and (2) sequence-wise parallelism, enabled by speculative decoding. By integrating these two approaches and allocating computational resources properly to each, our latency-optimal TTS enables a 32B model to reach 82.3% accuracy on MATH-500 within 1 minute and a smaller 3B model to achieve 72.4% within 10 seconds. Our work emphasizes the importance of latency-aware TTS and demonstrates its ability to deliver both speed and accuracy in latency-sensitive scenarios.

[434] Interleaved Reasoning for Large Language Models via Reinforcement Learning

Roy Xie,David Qiu,Deepak Gopinath,Dong Lin,Yanchao Sun,Chong Wang,Saloni Potdar,Bhuwan Dhingra

Main category: cs.CL

TL;DR: 论文提出了一种基于强化学习的训练范式，通过交替思考和回答多跳问题，显著提升了大型语言模型的推理效率，减少了首令牌时间（TTFT），并提高了准确性。

Details

Motivation: 长链思维（CoT）虽然增强了语言模型的推理能力，但冗长的推理痕迹导致效率低下和首令牌时间增加。 Method: 使用强化学习（RL）引导模型交替思考和回答，设计了一种基于规则的奖励机制以激励正确的中间步骤。 Result: 实验表明，该方法平均减少80%以上的TTFT，Pass@1准确率提升高达19.3%，并在复杂推理数据集上表现出强泛化能力。 Conclusion: 该方法在不依赖外部工具的情况下，显著提升了推理效率和准确性，并揭示了条件奖励建模的宝贵见解。 Abstract: Long chain-of-thought (CoT) significantly enhances large language models' (LLM) reasoning capabilities. However, the extensive reasoning traces lead to inefficiencies and an increased time-to-first-token (TTFT). We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions. We observe that models inherently possess the ability to perform interleaved reasoning, which can be further enhanced through RL. We introduce a simple yet effective rule-based reward to incentivize correct intermediate steps, which guides the policy model toward correct reasoning paths by leveraging intermediate signals generated during interleaved reasoning. Extensive experiments conducted across five diverse datasets and three RL algorithms (PPO, GRPO, and REINFORCE++) demonstrate consistent improvements over traditional think-answer reasoning, without requiring external tools. Specifically, our approach reduces TTFT by over 80% on average and improves up to 19.3% in Pass@1 accuracy. Furthermore, our method, trained solely on question answering and logical reasoning datasets, exhibits strong generalization ability to complex reasoning datasets such as MATH, GPQA, and MMLU. Additionally, we conduct in-depth analysis to reveal several valuable insights into conditional reward modeling.

Xiaochuan Liu,Ruihua Song,Xiting Wang,Xu Chen

Main category: cs.CL

TL;DR: 论文提出了一种基于多智能体框架的自动相关工作生成方法，通过选择器、阅读器和写作器的协作，结合图感知策略，显著提升了生成质量。

Details

Motivation: 现有相关工作生成方法因输入有限且未能有效捕捉文献间关系，导致生成内容浅显且孤立。 Method: 提出多智能体框架，包括选择器、阅读器和写作器，并引入图感知策略优化阅读顺序。 Result: 实验表明，该框架在多种基础模型和输入配置下表现优异，图感知选择器达到最先进水平。 Conclusion: 该框架通过全文本输入和图感知策略，显著提升了相关工作生成的深度和连贯性。 Abstract: Automatic related work generation (RWG) can save people's time and effort when writing a draft of related work section (RWS) for further revision. However, existing methods for RWG always suffer from shallow comprehension due to taking the limited portions of references papers as input and isolated explanation for each reference due to ineffective capturing the relationships among them. To address these issues, we focus on full-text-based RWG task and propose a novel multi-agent framework. Our framework consists of three agents: a selector that decides which section of the papers is going to read next, a reader that digests the selected section and updates a shared working memory, and a writer that generates RWS based on the final curated memory. To better capture the relationships among references, we also propose two graph-aware strategies for selector, enabling to optimize the reading order with constrains of the graph structure. Extensive experiments demonstrate that our framework consistently improves performance across three base models and various input configurations. The graph-aware selectors outperform alternative selectors, achieving state-of-the-art results. The code and data are available at https://github.com/1190200817/Full_Text_RWG.

[436] GenKI: Enhancing Open-Domain Question Answering with Knowledge Integration and Controllable Generation in Large Language Models

Tingjia Shen,Hao Wang,Chuan Qin,Ruijun Sun,Yang Song,Defu Lian,Hengshu Zhu,Enhong Chen

Main category: cs.CL

TL;DR: GenKI框架通过知识整合与可控生成提升OpenQA性能，解决了LLMs中知识整合与格式适配的挑战。

Details

Motivation: OpenQA在NLP中至关重要，但LLM-based方法面临知识整合与结果格式适配的挑战。 Method: GenKI结合密集段落检索、知识整合模型和可控生成技术，通过微调LLMs实现性能提升。 Result: 在TriviaQA、MSMARCO和CMRC2018数据集上表现优异，知识检索频率与知识召回能力呈线性关系。 Conclusion: GenKI有效解决了OpenQA中的关键问题，为LLMs的应用提供了新思路。 Abstract: Open-domain question answering (OpenQA) represents a cornerstone in natural language processing (NLP), primarily focused on extracting answers from unstructured textual data. With the rapid advancements in Large Language Models (LLMs), LLM-based OpenQA methods have reaped the benefits of emergent understanding and answering capabilities enabled by massive parameters compared to traditional methods. However, most of these methods encounter two critical challenges: how to integrate knowledge into LLMs effectively and how to adaptively generate results with specific answer formats for various task situations. To address these challenges, we propose a novel framework named GenKI, which aims to improve the OpenQA performance by exploring Knowledge Integration and controllable Generation on LLMs simultaneously. Specifically, we first train a dense passage retrieval model to retrieve associated knowledge from a given knowledge base. Subsequently, we introduce a novel knowledge integration model that incorporates the retrieval knowledge into instructions during fine-tuning to intensify the model. Furthermore, to enable controllable generation in LLMs, we leverage a certain fine-tuned LLM and an ensemble based on text consistency incorporating all coherence, fluency, and answer format assurance. Finally, extensive experiments conducted on the TriviaQA, MSMARCO, and CMRC2018 datasets, featuring diverse answer formats, have demonstrated the effectiveness of GenKI with comparison of state-of-the-art baselines. Moreover, ablation studies have disclosed a linear relationship between the frequency of retrieved knowledge and the model's ability to recall knowledge accurately against the ground truth. Our code of GenKI is available at https://github.com/USTC-StarTeam/GenKI

[437] LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation

Weikang Yuan,Kaisong Song,Zhuoren Jiang,Junjie Cao,Yujie Zhang,Jun Lin,Kun Kuang,Ji Zhang,Xiaozhong Liu

Main category: cs.CL

TL;DR: 论文介绍了LeCoDe，一个多轮法律咨询对话数据集，用于评估和改进大型语言模型的法律咨询能力，并提出了一个综合评估框架。

Details

Motivation: 法律咨询成本高且难以普及，现有大型语言模型在交互性和知识密集性方面表现不足。 Method: 通过收集短视频平台的直播咨询数据，构建LeCoDe数据集，并由法律专家标注，提出包含12个指标的综合评估框架。 Result: 实验显示，即使是GPT-4等先进模型，在澄清能力和专业建议质量方面表现有限（召回率39.8%，总分59%）。 Conclusion: LeCoDe数据集和评估框架为法律领域对话系统的研究提供了重要支持，推动了真实用户-专家交互的模拟。 Abstract: Legal consultation is essential for safeguarding individual rights and ensuring access to justice, yet remains costly and inaccessible to many individuals due to the shortage of professionals. While recent advances in Large Language Models (LLMs) offer a promising path toward scalable, low-cost legal assistance, current systems fall short in handling the interactive and knowledge-intensive nature of real-world consultations. To address these challenges, we introduce LeCoDe, a real-world multi-turn benchmark dataset comprising 3,696 legal consultation dialogues with 110,008 dialogue turns, designed to evaluate and improve LLMs' legal consultation capability. With LeCoDe, we innovatively collect live-streamed consultations from short-video platforms, providing authentic multi-turn legal consultation dialogues. The rigorous annotation by legal experts further enhances the dataset with professional insights and expertise. Furthermore, we propose a comprehensive evaluation framework that assesses LLMs' consultation capabilities in terms of (1) clarification capability and (2) professional advice quality. This unified framework incorporates 12 metrics across two dimensions. Through extensive experiments on various general and domain-specific LLMs, our results reveal significant challenges in this task, with even state-of-the-art models like GPT-4 achieving only 39.8% recall for clarification and 59% overall score for advice quality, highlighting the complexity of professional consultation scenarios. Based on these findings, we further explore several strategies to enhance LLMs' legal consultation abilities. Our benchmark contributes to advancing research in legal domain dialogue systems, particularly in simulating more real-world user-expert interactions.

[438] Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models

Hao Yang,Lizhen Qu,Ehsan Shareghi,Gholamreza Haffari

Main category: cs.CL

TL;DR: 论文提出了一种无监督安全微调策略，用于增强大型音频语言模型（LALMs）的安全性对齐，同时避免过度拒绝问题。

Details

Motivation: 现有LALMs在安全性对齐方面存在不足，缺乏针对音频安全的专门数据集和策略，且传统监督微调方法难以平衡安全性与实用性。 Method: 采用无监督安全微调策略，通过重塑模型表示空间来提升安全性对齐，并在三种输入模态（音频-文本、纯文本、纯音频）下进行实验。 Result: 实验表明，该方法显著提升了LALMs的安全性，同时平均仅增加0.88%的过度拒绝率。 Conclusion: 提出的无监督微调策略有效解决了LALMs的安全性问题，且对模型实用性影响较小。 Abstract: Large Audio Language Models (LALMs) have extended the capabilities of Large Language Models (LLMs) by enabling audio-based human interactions. However, recent research has revealed that LALMs remain vulnerable to harmful queries due to insufficient safety-alignment. Despite advances in defence measures for text and vision LLMs, effective safety-alignment strategies and audio-safety dataset specifically targeting LALMs are notably absent. Meanwhile defence measures based on Supervised Fine-tuning (SFT) struggle to address safety improvement while avoiding over-rejection issues, significantly compromising helpfulness. In this work, we propose an unsupervised safety-fine-tuning strategy as remedy that reshapes model's representation space to enhance existing LALMs safety-alignment while balancing the risk of over-rejection. Our experiments, conducted across three generations of Qwen LALMs, demonstrate that our approach significantly improves LALMs safety under three modality input conditions (audio-text, text-only, and audio-only) while increasing over-rejection rate by only 0.88% on average. Warning: this paper contains harmful examples.

[439] Comparing Moral Values in Western English-speaking societies and LLMs with Word Associations

Chaoyi Xiang,Chunhua Liu,Simon De Deyne,Lea Frermann

Main category: cs.CL

TL;DR: 该论文提出了一种通过词关联分析来评估大型语言模型（LLM）道德价值观的方法，避免了直接提示的局限性，并比较了英语社区与LLM在道德概念上的差异。

Details

Motivation: 随着大型语言模型影响力的增加，理解其反映的道德价值观变得尤为重要。直接提示方法存在人类规范泄漏和提示敏感性问题，因此需要更稳健的方法。 Method: 利用词关联作为底层表征，构建LLM生成的词关联数据集，并提出一种基于道德基础理论种子词的新方法，通过关联图传播道德价值观。 Result: 研究发现英语社区与LLM在道德概念上存在详细但系统的差异。 Conclusion: 词关联方法为理解LLM的道德推理提供了新视角，揭示了模型与人类在道德价值观上的差异。 Abstract: As the impact of large language models increases, understanding the moral values they reflect becomes ever more important. Assessing the nature of moral values as understood by these models via direct prompting is challenging due to potential leakage of human norms into model training data, and their sensitivity to prompt formulation. Instead, we propose to use word associations, which have been shown to reflect moral reasoning in humans, as low-level underlying representations to obtain a more robust picture of LLMs' moral reasoning. We study moral differences in associations from western English-speaking communities and LLMs trained predominantly on English data. First, we create a large dataset of LLM-generated word associations, resembling an existing data set of human word associations. Next, we propose a novel method to propagate moral values based on seed words derived from Moral Foundation Theory through the human and LLM-generated association graphs. Finally, we compare the resulting moral conceptualizations, highlighting detailed but systematic differences between moral values emerging from English speakers and LLM associations.

Liqin Ye,Agam Shah,Chao Zhang,Sudheer Chava

Main category: cs.CL

TL;DR: 论文提出SiDyP方法，通过动态先验的单纯形标签扩散校准分类器预测，提高对LLM生成噪声标签的鲁棒性。

Details

Motivation: 传统标注数据集成本高，LLM自动生成标签虽提供替代方案，但其噪声问题影响模型泛化能力。 Method: SiDyP通过文本嵌入空间的邻域标签分布检索潜在真实标签，并用单纯形扩散模型迭代优化噪声候选。 Result: 在零样本和少样本LLM生成噪声标签数据集上，SiDyP平均提升BERT分类器性能7.21%和7.30%。 Conclusion: SiDyP有效提升模型对LLM生成噪声标签的鲁棒性，适用于多种NLP任务。 Abstract: The traditional process of creating labeled datasets is labor-intensive and expensive. Recent breakthroughs in open-source large language models (LLMs) have opened up a new avenue in generating labeled datasets automatically for various natural language processing (NLP) tasks, providing an alternative to such an expensive annotation process. However, the reliability of such auto-generated labels remains a significant concern due to inherent inaccuracies. When learning from noisy labels, the model's generalization is likely to be harmed as it is prone to overfit to those label noises. While previous studies in learning from noisy labels mainly focus on synthetic noise and real-world noise, LLM-generated label noise receives less attention. In this paper, we propose SiDyP: Simplex Label Diffusion with Dynamic Prior to calibrate the classifier's prediction, thus enhancing its robustness towards LLM-generated noisy labels. SiDyP retrieves potential true label candidates by neighborhood label distribution in text embedding space and iteratively refines noisy candidates using a simplex diffusion model. Our framework can increase the performance of the BERT classifier fine-tuned on both zero-shot and few-shot LLM-generated noisy label datasets by an average of 7.21% and 7.30% respectively. We demonstrate the effectiveness of SiDyP by conducting extensive benchmarking for different LLMs over a variety of NLP tasks. Our code is available on Github.

[441] Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs

Hao Fang,Changle Zhou,Jiawei Kong,Kuofeng Gao,Bin Chen,Tao Liang,Guojun Ma,Shu-Tao Xia

Main category: cs.CL

TL;DR: 提出了一种基于条件点互信息（C-PMI）的解码策略，通过增强生成文本与输入图像之间的互依赖性，减少大型视觉语言模型（LVLM）中的幻觉现象。

Details

Motivation: 大型视觉语言模型（LVLM）在解码时过度依赖语言先验而忽视视觉信息，导致生成的响应与输入图像无关。 Method: 提出C-PMI校准的解码策略，通过联合建模视觉和文本标记的贡献，将幻觉缓解问题转化为双层次优化问题。设计了动态调节解码过程的标记净化机制。 Result: 在多个基准测试中，该方法显著减少了LVLM的幻觉现象，同时保持了解码效率。 Conclusion: C-PMI校准的解码策略有效缓解了LVLM的幻觉问题，为未来研究提供了新思路。 Abstract: Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.

[442] KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization

Zhaolin Li,Yining Liu,Danni Liu,Tuan Nam Nguyen,Enes Yavuz Ugan,Tu Anh Dinh,Carlos Mullov,Alexander Waibel,Jan Niehues

Main category: cs.CL

TL;DR: KIT团队在IWSLT 2025低资源赛道上提交了基于ASR和MT的级联系统及端到端语音翻译系统，研究了合成数据和模型正则化的优化方法，并在多种语言对中验证了效果。

Details

Motivation: 探索在低资源语言对（Bemba、北黎凡特阿拉伯语和突尼斯阿拉伯语到英语）中，如何高效利用预训练模型和合成数据提升语音翻译性能。 Method: 结合预训练模型，采用微调策略，利用合成数据增强（如MT生成翻译、TTS生成语音）和模型正则化（如内部蒸馏），并应用最小贝叶斯风险解码。 Result: 合成数据在某些语言对中表现优于真实数据，内部蒸馏和最小贝叶斯风险解码显著提升了ASR、MT和ST任务的性能。 Conclusion: 合成数据和模型正则化是提升低资源语音翻译任务的有效方法，结合多种策略可显著改善系统性能。 Abstract: This paper presents KIT's submissions to the IWSLT 2025 low-resource track. We develop both cascaded systems, consisting of Automatic Speech Recognition (ASR) and Machine Translation (MT) models, and end-to-end (E2E) Speech Translation (ST) systems for three language pairs: Bemba, North Levantine Arabic, and Tunisian Arabic into English. Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently. This study further explores system enhancement with synthetic data and model regularization. Specifically, we investigate MT-augmented ST by generating translations from ASR data using MT models. For North Levantine, which lacks parallel ST training data, a system trained solely on synthetic data slightly surpasses the cascaded system trained on real data. We also explore augmentation using text-to-speech models by generating synthetic speech from MT data, demonstrating the benefits of synthetic data in improving both ASR and ST performance for Bemba. Additionally, we apply intra-distillation to enhance model performance. Our experiments show that this approach consistently improves results across ASR, MT, and ST tasks, as well as across different pre-trained models. Finally, we apply Minimum Bayes Risk decoding to combine the cascaded and end-to-end systems, achieving an improvement of approximately 1.5 BLEU points.

[443] Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models

Yi Liu,Dianqing Liu,Mingye Zhu,Junbo Guo,Yongdong Zhang,Zhendong Mao

Main category: cs.CL

TL;DR: 提出了一种名为RAM的新方法，通过重要性采样实现LLM对齐，提高了灵活性和可扩展性。

Details

Motivation: 传统对齐方法需要重新训练大型预训练模型，难以快速适应多样化应用需求。 Method: 将对齐过程形式化为重要性采样，上游未对齐模型作为提议分布，对齐模块作为重要性权重估计器。 Result: 在多个任务上实验表明，RAM方法优于基线模型。 Conclusion: RAM方法为LLM对齐提供了高效、灵活的解决方案。 Abstract: The widespread adoption of large language models (LLMs) across industries has increased the demand for high-quality and customizable outputs. However, traditional alignment methods often require retraining large pretrained models, making it difficult to quickly adapt and optimize LLMs for diverse applications. To address this limitation, we propose a novel \textit{Residual Alignment Model} (\textit{RAM}) that formalizes the alignment process as a type of importance sampling. In this framework, the unaligned upstream model serves as the proposal distribution, while the alignment process is framed as secondary sampling based on an autoregressive alignment module that acts as an estimator of the importance weights. This design enables a natural detachment of the alignment module from the target aligned model, improving flexibility and scalability. Based on this model, we derive an efficient sequence-level training strategy for the alignment module, which operates independently of the proposal module. Additionally, we develop a resampling algorithm with iterative token-level decoding to address the common first-token latency issue in comparable methods. Experimental evaluations on two leading open-source LLMs across diverse tasks, including instruction following, domain adaptation, and preference optimization, demonstrate that our approach consistently outperforms baseline models.

[444] Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision

Tej Deep Pala,Panshul Sharma,Amir Zadeh,Chuan Li,Soujanya Poria

Main category: cs.CL

TL;DR: PathFinder-PRM是一种新型的分层、错误感知的判别性过程奖励模型，通过分类数学和一致性错误并组合细粒度信号来估计步骤正确性，显著提升了数学推理任务的表现和数据效率。

Details

Motivation: 大型语言模型在多跳和推理密集型任务（如数学问题解决）中容易出现幻觉，而现有的结果奖励模型仅验证最终答案，无法指导中间步骤的生成。 Method: 提出PathFinder-PRM，通过分类数学和一致性错误并组合细粒度信号来估计步骤正确性，训练时使用了一个40万样本的数据集。 Result: 在PRMBench上，PathFinder-PRM以67.7的PRMScore创下新纪录，数据效率提高了3倍；在奖励引导的贪婪搜索中，prm@8达到48.3，比基线提升1.5点。 Conclusion: 解耦错误检测和奖励估计不仅能提升细粒度错误检测能力，还能显著改善端到端的奖励引导数学推理，同时提高数据效率。 Abstract: Large Language Models (LLMs) are prone to hallucination, especially during multi-hop and reasoning-intensive tasks such as mathematical problem solving. While Outcome Reward Models verify only final answers, Process Reward Models (PRMs) score each intermediate step to steer generation toward coherent solutions. We introduce PathFinder-PRM, a novel hierarchical, error-aware discriminative PRM that first classifies math and consistency errors at each step, then combines these fine-grained signals to estimate step correctness. To train PathFinder-PRM, we construct a 400K-sample dataset by enriching the human-annotated PRM800K corpus and RLHFlow Mistral traces with three-dimensional step-level labels. On PRMBench, PathFinder-PRM achieves a new state-of-the-art PRMScore of 67.7, outperforming the prior best (65.5) while using 3 times less data. When applied to reward guided greedy search, our model yields prm@8 48.3, a +1.5 point gain over the strongest baseline. These results demonstrate that decoupled error detection and reward estimation not only boost fine-grained error detection but also substantially improve end-to-end, reward-guided mathematical reasoning with greater data efficiency.

[445] MT$^{3}$: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning

Zhaopeng Feng,Yupu Liang,Shaosheng Cao,Jiayuan Su,Jiahan Ren,Zhe Xu,Yao Hu,Wenxuan Huang,Jian Wu,Zuozhu Liu

Main category: cs.CL

TL;DR: MT³框架首次将多任务强化学习应用于多模态大语言模型（MLLMs），实现了端到端的文本图像机器翻译（TIMT），并在多个基准测试中取得领先性能。

Details

Motivation: TIMT任务在无障碍访问、跨语言信息获取和现实文档理解中至关重要，但现有方法多为多阶段级联，复杂且效率低。 Method: MT³采用多任务优化范式，针对文本识别、上下文感知推理和翻译三个子技能，结合基于规则的多混合奖励机制进行训练。 Result: MT³-7B-Zero在MIT-10M基准测试中超越Qwen2.5-VL-72B等基线模型，并在跨语言对和数据集上表现出强泛化能力。 Conclusion: 多任务协同、强化学习初始化、课程设计和奖励机制共同推动了MLLM驱动的TIMT技术进步。 Abstract: Text Image Machine Translation (TIMT)-the task of translating textual content embedded in images-is critical for applications in accessibility, cross-lingual information access, and real-world document understanding. However, TIMT remains a complex challenge due to the need for accurate optical character recognition (OCR), robust visual-text reasoning, and high-quality translation, often requiring cascading multi-stage pipelines. Recent advances in large-scale Reinforcement Learning (RL) have improved reasoning in Large Language Models (LLMs) and Multimodal LLMs (MLLMs), but their application to end-to-end TIMT is still underexplored. To bridge this gap, we introduce MT$^{3}$, the first framework to apply Multi-Task RL to MLLMs for end-to-end TIMT. MT$^{3}$ adopts a multi-task optimization paradigm targeting three key sub-skills: text recognition, context-aware reasoning, and translation. It is trained using a novel multi-mixed reward mechanism that adapts rule-based RL strategies to TIMT's intricacies, offering fine-grained, non-binary feedback across tasks. Furthermore, to facilitate the evaluation of TIMT in authentic cross-cultural and real-world social media contexts, we introduced XHSPost, the first social media TIMT benchmark. Our MT$^{3}$-7B-Zero achieves state-of-the-art results on the latest in-domain MIT-10M benchmark, outperforming strong baselines such as Qwen2.5-VL-72B and InternVL2.5-78B by notable margins across multiple metrics. Additionally, the model shows strong generalization to out-of-distribution language pairs and datasets. In-depth analyses reveal how multi-task synergy, reinforcement learning initialization, curriculum design, and reward formulation contribute to advancing MLLM-driven TIMT.

[446] Graceful Forgetting in Generative Language Models

Chunyang Jiang,Chi-min Chan,Yiyang Cai,Yulong Liu,Wei Xue,Yike Guo

Main category: cs.CL

TL;DR: 论文提出了一种名为LWF的新框架，用于在生成式语言模型中实现优雅遗忘，通过选择性丢弃无关知识提升微调性能。

Details

Motivation: 预训练-微调范式虽普遍有效，但部分预训练知识可能对微调任务产生负面影响（负迁移），而优雅遗忘作为一种解决方案尚未在生成式语言模型中充分探索。 Method: 提出LWF框架，利用Fisher信息矩阵加权参数更新，计算遗忘置信度评估自生成知识，并定期遗忘高置信度知识。 Result: 实验表明，优雅遗忘虽未完全揭示知识交互机制，但能显著提升微调性能。 Conclusion: LWF框架为生成式语言模型中的优雅遗忘提供了有效解决方案，提升了微调效果。 Abstract: Recently, the pretrain-finetune paradigm has become a cornerstone in various deep learning areas. While in general the pre-trained model would promote both effectiveness and efficiency of downstream tasks fine-tuning, studies have shown that not all knowledge acquired during pre-training is beneficial. Some of the knowledge may actually bring detrimental effects to the fine-tuning tasks, which is also known as negative transfer. To address this problem, graceful forgetting has emerged as a promising approach. The core principle of graceful forgetting is to enhance the learning plasticity of the target task by selectively discarding irrelevant knowledge. However, this approach remains underexplored in the context of generative language models, and it is often challenging to migrate existing forgetting algorithms to these models due to architecture incompatibility. To bridge this gap, in this paper we propose a novel framework, Learning With Forgetting (LWF), to achieve graceful forgetting in generative language models. With Fisher Information Matrix weighting the intended parameter updates, LWF computes forgetting confidence to evaluate self-generated knowledge regarding the forgetting task, and consequently, knowledge with high confidence is periodically unlearned during fine-tuning. Our experiments demonstrate that, although thoroughly uncovering the mechanisms of knowledge interaction remains challenging in pre-trained language models, applying graceful forgetting can contribute to enhanced fine-tuning performance.

[447] Distilling Closed-Source LLM's Knowledge for Locally Stable and Economic Biomedical Entity Linking

Yihao Ai,Zhiyuan Ning,Weiwei Dai,Pengfei Wang,Yi Du,Wenjuan Cui,Kunpeng Liu,Yuanchun Zhou

Main category: cs.CL

TL;DR: RPDR框架结合闭源和开源LLM，通过生成训练数据并微调开源模型，解决了生物医学实体链接中的低资源问题，避免了高成本和稳定性问题。

Details

Motivation: 传统监督方法需要大量标注数据，闭源LLM成本高且不稳定，RPDR旨在解决这些问题。 Method: RPDR通过闭源LLM生成训练数据，微调开源LLM进行候选重排序，实现知识蒸馏。 Result: 在两个数据集上，RPDR在低资源情况下显著提升了Acc@1指标。 Conclusion: RPDR框架在低资源场景下表现出优越性和通用性。 Abstract: Biomedical entity linking aims to map nonstandard entities to standard entities in a knowledge base. Traditional supervised methods perform well but require extensive annotated data to transfer, limiting their usage in low-resource scenarios. Large language models (LLMs), especially closed-source LLMs, can address these but risk stability issues and high economic costs: using these models is restricted by commercial companies and brings significant economic costs when dealing with large amounts of data. To address this, we propose ``RPDR'', a framework combining closed-source LLMs and open-source LLMs for re-ranking candidates retrieved by a retriever fine-tuned with a small amount of data. By prompting a closed-source LLM to generate training data from unannotated data and fine-tuning an open-source LLM for re-ranking, we effectively distill the knowledge to the open-source LLM that can be deployed locally, thus avoiding the stability issues and the problem of high economic costs. We evaluate RPDR on two datasets, including one real-world dataset and one publicly available dataset involving two languages: Chinese and English. RPDR achieves 0.019 Acc@1 improvement and 0.036 Acc@1 improvement on the Aier dataset and the Ask A Patient dataset when the amount of training data is not enough. The results demonstrate the superiority and generalizability of the proposed framework.

[448] Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models

Yang Zhang,Yu Yu,Bo Tang,Yu Zhu,Chuxiong Sun,Wenqiang Wei,Jie Hu,Zipeng Xie,Zhiyu Li,Feiyu Xiong,Edward Chung

Main category: cs.CL

TL;DR: 提出了一种名为MARA的轻量级对齐方法，通过将句子级偏好学习分解为词级二元分类，显著降低了计算成本并提升了性能。

Details

Motivation: 随着大语言模型的快速发展，如何高效且低成本地使其与人类偏好和价值观对齐成为关键问题。 Method: MARA方法通过一个紧凑的三层全连接网络对候选词进行“接受”或“拒绝”的二元分类，独立于语言模型运行。 Result: 在七个不同的大语言模型和三个开源数据集上的实验表明，MARA在降低计算成本的同时显著提升了对齐性能。 Conclusion: MARA为语言模型对齐提供了一种高效且轻量级的解决方案。 Abstract: With the rapid development of Large Language Models (LLMs), aligning these models with human preferences and values is critical to ensuring ethical and safe applications. However, existing alignment techniques such as RLHF or DPO often require direct fine-tuning on LLMs with billions of parameters, resulting in substantial computational costs and inefficiencies. To address this, we propose Micro token-level Accept-Reject Aligning (MARA) approach designed to operate independently of the language models. MARA simplifies the alignment process by decomposing sentence-level preference learning into token-level binary classification, where a compact three-layer fully-connected network determines whether candidate tokens are "Accepted" or "Rejected" as part of the response. Extensive experiments across seven different LLMs and three open-source datasets show that MARA achieves significant improvements in alignment performance while reducing computational costs.

[449] NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering

Ruisheng Cao,Hanchong Zhang,Tiancheng Huang,Zhangyi Kang,Yuxin Zhang,Liangtai Sun,Hanqi Li,Yuxun Miao,Shuai Fan,Lu Chen,Kai Yu

Main category: cs.CL

TL;DR: NeuSym-RAG是一个结合神经和符号检索的混合框架，通过多视图分块和模式解析优化PDF内容检索，显著提升问答性能。

Details

Motivation: 解决现有检索增强生成（RAG）方法中神经与符号检索分离及单视图分块忽略PDF结构的问题。 Method: 提出NeuSym-RAG框架，结合神经和符号检索，利用多视图分块和模式解析将PDF内容组织到关系数据库和向量库中。 Result: 在三个PDF问答数据集上，NeuSym-RAG稳定优于向量RAG和结构化基线方法。 Conclusion: NeuSym-RAG成功统一了两种检索方案，并利用多视图分块提升性能，代码和数据已开源。 Abstract: The increasing number of academic papers poses significant challenges for researchers to efficiently acquire key details. While retrieval augmented generation (RAG) shows great promise in large language model (LLM) based automated question answering, previous works often isolate neural and symbolic retrieval despite their complementary strengths. Moreover, conventional single-view chunking neglects the rich structure and layout of PDFs, e.g., sections and tables. In this work, we propose NeuSym-RAG, a hybrid neural symbolic retrieval framework which combines both paradigms in an interactive process. By leveraging multi-view chunking and schema-based parsing, NeuSym-RAG organizes semi-structured PDF content into both the relational database and vectorstore, enabling LLM agents to iteratively gather context until sufficient to generate answers. Experiments on three full PDF-based QA datasets, including a self-annotated one AIRQA-REAL, show that NeuSym-RAG stably defeats both the vector-based RAG and various structured baselines, highlighting its capacity to unify both retrieval schemes and utilize multiple views. Code and data are publicly available at https://github.com/X-LANCE/NeuSym-RAG.

[450] Efficient Reasoning via Chain of Unconscious Thought

Ruihan Gong,Yue Liu,Wenjie Qu,Mingzhe Du,Yufei He,Yingwei Ma,Yulin Chen,Xiang Liu,Yi Wen,Xinfeng Li,Ruidong Wang,Xinzhong Zhu,Bryan Hooi,Jiaheng Zhang

Main category: cs.CL

TL;DR: 论文提出了一种名为CoUT的新推理范式，通过模仿人类无意识思维来提高大型推理模型的令牌效率，减少47.62%的令牌使用且保持性能。

Details

Motivation: 大型推理模型（LRMs）性能优秀但推理过程冗长，令牌效率低。受无意识思维理论（UTT）启发，作者希望通过内部化推理过程提升效率。 Method: 提出Chain of Unconscious Thought (CoUT)，首先引导模型在隐藏层内部化推理，再设计令牌高效策略减少不必要令牌。 Result: 实验证明CoUT有效，令牌使用减少47.62%，性能与CoT相当。 Conclusion: 模型可能具备有益的无意识思维，可在不牺牲性能的情况下提升效率。 Abstract: Large Reasoning Models (LRMs) achieve promising performance but compromise token efficiency due to verbose reasoning processes. Unconscious Thought Theory (UTT) posits that complex problems can be solved more efficiently through internalized cognitive processes. Inspired by UTT, we propose a new reasoning paradigm, termed Chain of Unconscious Thought (CoUT), to improve the token efficiency of LRMs by guiding them to mimic human unconscious thought and internalize reasoning processes. Concretely, we first prompt the model to internalize the reasoning by thinking in the hidden layer. Then, we design a bag of token-efficient strategies to further help models reduce unnecessary tokens yet preserve the performance. Our work reveals that models may possess beneficial unconscious thought, enabling improved efficiency without sacrificing performance. Extensive experiments demonstrate the effectiveness of CoUT. Remarkably, it surpasses CoT by reducing token usage by 47.62% while maintaining comparable accuracy, as shown in Figure 1. The code of CoUT is available at this link: https://github.com/Rohan-GRH/CoUT

[451] SGM: A Framework for Building Specification-Guided Moderation Filters

Masoomali Fatehkia,Enes Altinisik,Husrev Taha Sencar

Main category: cs.CL

TL;DR: SGM框架通过用户定义的规范训练内容审核过滤器，支持多样化的对齐目标，性能与现有安全过滤器相当。

Details

Motivation: 现有内容审核过滤器通常仅关注安全性，无法满足多样化的部署需求，且依赖人工标注数据。 Method: 引入SGM框架，自动化生成训练数据，基于用户定义的规范训练过滤器。 Result: SGM训练的过滤器性能与现有安全过滤器相当，同时支持细粒度和用户定义的对齐控制。 Conclusion: SGM为LLM的多样化对齐需求提供了灵活且可扩展的解决方案。 Abstract: Aligning large language models (LLMs) with deployment-specific requirements is critical but inherently imperfect. Despite extensive training, models remain susceptible to misalignment and adversarial inputs such as jailbreaks. Content moderation filters are commonly used as external safeguards, though they typically focus narrowly on safety. We introduce SGM (Specification-Guided Moderation), a flexible framework for training moderation filters grounded in user-defined specifications that go beyond standard safety concerns. SGM automates training data generation without relying on human-written examples, enabling scalable support for diverse, application-specific alignment goals. SGM-trained filters perform on par with state-of-the-art safety filters built on curated datasets, while supporting fine-grained and user-defined alignment control.

[452] T^2Agent A Tool-augmented Multimodal Misinformation Detection Agent with Monte Carlo Tree Search

Xing Cui,Yueying Zou,Zekun Li,Peipei Li,Xinyuan Xu,Xuannan Liu,Huaibo Huang,Ran He

Main category: cs.CL

TL;DR: T2Agent是一种新型的多模态虚假信息检测代理，结合可扩展工具包和蒙特卡洛树搜索（MCTS），通过动态证据收集和多源验证提升检测能力。

Details

Motivation: 现实中的多模态虚假信息来源复杂多样，现有方法依赖静态流程和有限工具，难以应对这种复杂性。 Method: T2Agent采用模块化工具包（如网络搜索、伪造检测和一致性分析）和MCTS，通过贝叶斯优化选择任务相关工具子集，并扩展MCTS以支持多源验证和双奖励机制。 Result: 实验表明，T2Agent在混合源多模态虚假信息检测任务中优于现有基线，验证了其高效性和准确性。 Conclusion: T2Agent作为一种无需训练的方法，显著提升了虚假信息检测的准确性，具有广泛应用潜力。 Abstract: Real-world multimodal misinformation often arises from mixed forgery sources, requiring dynamic reasoning and adaptive verification. However, existing methods mainly rely on static pipelines and limited tool usage, limiting their ability to handle such complexity and diversity. To address this challenge, we propose T2Agent, a novel misinformation detection agent that incorporates an extensible toolkit with Monte Carlo Tree Search (MCTS). The toolkit consists of modular tools such as web search, forgery detection, and consistency analysis. Each tool is described using standardized templates, enabling seamless integration and future expansion. To avoid inefficiency from using all tools simultaneously, a Bayesian optimization-based selector is proposed to identify a task-relevant subset. This subset then serves as the action space for MCTS to dynamically collect evidence and perform multi-source verification. To better align MCTS with the multi-source nature of misinformation detection, T2Agent extends traditional MCTS with multi-source verification, which decomposes the task into coordinated subtasks targeting different forgery sources. A dual reward mechanism containing a reasoning trajectory score and a confidence score is further proposed to encourage a balance between exploration across mixed forgery sources and exploitation for more reliable evidence. We conduct ablation studies to confirm the effectiveness of the tree search mechanism and tool usage. Extensive experiments further show that T2Agent consistently outperforms existing baselines on challenging mixed-source multimodal misinformation benchmarks, demonstrating its strong potential as a training-free approach for enhancing detection accuracy. The code will be released.

[453] What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs

Sangyeop Kim,Yohan Lee,Yongwoo Song,Kimin Lee

Main category: cs.CL

TL;DR: 研究发现，长上下文处理能力是大型语言模型（LLMs）安全漏洞的主要因素，即使简单的重复或随机文本也能绕过安全措施。

Details

Motivation: 探讨LLMs在长上下文（如128K tokens）中的安全漏洞，揭示其安全机制的根本局限性。 Method: 通过多种多轮攻击设置（如不同指令风格、密度、主题和格式）进行实验分析。 Result: 上下文长度是攻击效果的关键因素，且无需精心设计的恶意内容即可绕过安全措施。 Conclusion: LLMs的长上下文扩展能力存在重大安全隐患，需开发新的安全机制。 Abstract: We investigate long-context vulnerabilities in Large Language Models (LLMs) through Many-Shot Jailbreaking (MSJ). Our experiments utilize context length of up to 128K tokens. Through comprehensive analysis with various many-shot attack settings with different instruction styles, shot density, topic, and format, we reveal that context length is the primary factor determining attack effectiveness. Critically, we find that successful attacks do not require carefully crafted harmful content. Even repetitive shots or random dummy text can circumvent model safety measures, suggesting fundamental limitations in long-context processing capabilities of LLMs. The safety behavior of well-aligned models becomes increasingly inconsistent with longer contexts. These findings highlight significant safety gaps in context expansion capabilities of LLMs, emphasizing the need for new safety mechanisms.

[454] Analyzing Political Bias in LLMs via Target-Oriented Sentiment Classification

Akram Elbouanani,Evan Dufraisse,Adrian Popescu

Main category: cs.CL

TL;DR: 论文提出了一种基于情感预测不一致性的新方法，用于分析LLM中的政治偏见，发现偏见普遍存在且与模型大小和语言相关。

Details

Motivation: 现有偏见分析方法依赖小规模任务和LLM自身分析，可能传播偏见，因此需要更客观的方法。 Method: 通过插入多样化的政治家名字到政治句子中，定义熵基不一致性指标，分析情感预测的变异性。 Result: 发现所有测试组合均存在偏见，西方语言偏见更强，大模型偏见更显著且一致。 Conclusion: 通过替换虚构政治家名字部分缓解了TSC中的不可靠性，但偏见问题仍需进一步解决。 Abstract: Political biases encoded by LLMs might have detrimental effects on downstream applications. Existing bias analysis methods rely on small-size intermediate tasks (questionnaire answering or political content generation) and rely on the LLMs themselves for analysis, thus propagating bias. We propose a new approach leveraging the observation that LLM sentiment predictions vary with the target entity in the same sentence. We define an entropy-based inconsistency metric to encode this prediction variability. We insert 1319 demographically and politically diverse politician names in 450 political sentences and predict target-oriented sentiment using seven models in six widely spoken languages. We observe inconsistencies in all tested combinations and aggregate them in a statistically robust analysis at different granularity levels. We observe positive and negative bias toward left and far-right politicians and positive correlations between politicians with similar alignment. Bias intensity is higher for Western languages than for others. Larger models exhibit stronger and more consistent biases and reduce discrepancies between similar languages. We partially mitigate LLM unreliability in target-oriented sentiment classification (TSC) by replacing politician names with fictional but plausible counterparts.

[455] The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants

Yiqun Zhang,Hao Li,Chenxu Wang,Linyao Chen,Qiaosheng Zhang,Peng Ye,Shi Feng,Daling Wang,Zhen Wang,Xinrun Wang,Jia Xu,Lei Bai,Wanli Ouyang,Shuyue Hu

Main category: cs.CL

TL;DR: 开源社区提出Avengers框架，通过整合多个小型语言模型的集体智能，在多项任务中超越GPT-4.1。

Details

Motivation: 解决开源小型语言模型在大型专有模型主导下是否仍具竞争力的问题。 Method: 基于四个轻量级操作：嵌入、聚类、评分和投票，整合多个小型模型。 Result: 在15个数据集的10个中超越GPT-4.1，数学任务提升18.21%，代码任务提升7.46%。 Conclusion: Avengers框架展示了小型模型通过集体智能的潜力，且开源代码可供使用。 Abstract: As proprietary giants increasingly dominate the race for ever-larger language models, a pressing question arises for the open-source community: can smaller models remain competitive across a broad range of tasks? In this paper, we present the Avengers--a simple recipe that effectively leverages the collective intelligence of open-source, smaller language models. Our framework is built upon four lightweight operations: (i) embedding: encode queries using a text embedding model; (ii) clustering: group queries based on their semantic similarity; (iii) scoring: scores each model's performance within each cluster; and (iv) voting: improve outputs via repeated sampling and voting. At inference time, each query is embedded and assigned to its nearest cluster. The top-performing model(s) within that cluster are selected to generate the response using the Self-Consistency or its multi-model variant. Remarkably, with 10 open-source models (~7B parameters each), the Avengers collectively outperforms GPT-4.1 on 10 out of 15 datasets (spanning mathematics, code, logic, knowledge, and affective tasks). In particular, it surpasses GPT-4.1 on mathematics tasks by 18.21% and on code tasks by 7.46%. Furthermore, the Avengers delivers superior out-of-distribution generalization, and remains robust across various embedding models, clustering algorithms, ensemble strategies, and values of its sole parameter--the number of clusters. We have open-sourced the code on GitHub: https://github.com/ZhangYiqun018/Avengers

[456] MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs

Zaid Alyafeai,Maged S. Al-Shaibani,Bernard Ghanem

Main category: cs.CL

TL;DR: MOLE是一个利用大型语言模型（LLMs）自动从非阿拉伯语科学论文中提取数据集元数据的框架，通过模式驱动方法和验证机制实现高效处理，并引入新基准评估任务进展。

Details

Motivation: 当前科学研究的快速增长需要高效的元数据提取工具，而现有方法（如Masader）依赖手动标注，限制了其扩展性。 Method: MOLE采用LLMs自动提取元数据，支持多种输入格式，并包含验证机制以确保输出一致性。 Result: 现代LLMs在自动化元数据提取任务中表现出潜力，但仍需改进以确保性能的稳定性和可靠性。 Conclusion: MOLE为元数据提取任务提供了自动化解决方案，并呼吁未来进一步优化LLMs的性能。 Abstract: Metadata extraction is essential for cataloging and preserving datasets, enabling effective research discovery and reproducibility, especially given the current exponential growth in scientific research. While Masader (Alyafeai et al.,2021) laid the groundwork for extracting a wide range of metadata attributes from Arabic NLP datasets' scholarly articles, it relies heavily on manual annotation. In this paper, we present MOLE, a framework that leverages Large Language Models (LLMs) to automatically extract metadata attributes from scientific papers covering datasets of languages other than Arabic. Our schema-driven methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output. Additionally, we introduce a new benchmark to evaluate the research progress on this task. Through systematic analysis of context length, few-shot learning, and web browsing integration, we demonstrate that modern LLMs show promising results in automating this task, highlighting the need for further future work improvements to ensure consistent and reliable performance. We release the code: https://github.com/IVUL-KAUST/MOLE and dataset: https://huggingface.co/datasets/IVUL-KAUST/MOLE for the research community.

[457] Compliance-to-Code: Enhancing Financial Compliance Checking via Code Generation

Siyuan Li,Jian Chen,Rui Yao,Xuming Hu,Peilin Zhou,Weihua Qiu,Simin Zhang,Chucheng Dong,Zhiyao Li,Qipeng Xie,Zixuan Yuan

Main category: cs.CL

TL;DR: 论文提出了一个针对中文金融法规合规性的大型数据集Compliance-to-Code，并开发了FinCheck流程，以解决现有RegTech和LLMs在中文金融法规处理中的不足。

Details

Motivation: 金融法规复杂且难以理解，现有RegTech和LLMs在处理中文金融法规时存在知识表示不完整、推理能力不足和逻辑一致性差的问题。 Method: 构建了Compliance-to-Code数据集，包含1,159条标注条款，每条款分为四个逻辑元素，并提供Python代码映射和解释。开发了FinCheck流程，实现法规结构化、代码生成和报告生成。 Result: Compliance-to-Code是首个专注于中文金融法规合规性的大规模数据集，FinCheck流程展示了其实际应用价值。 Conclusion: 该研究填补了中文金融法规合规性数据集的空白，为自动化审计提供了有效工具。 Abstract: Nowadays, regulatory compliance has become a cornerstone of corporate governance, ensuring adherence to systematic legal frameworks. At its core, financial regulations often comprise highly intricate provisions, layered logical structures, and numerous exceptions, which inevitably result in labor-intensive or comprehension challenges. To mitigate this, recent Regulatory Technology (RegTech) and Large Language Models (LLMs) have gained significant attention in automating the conversion of regulatory text into executable compliance logic. However, their performance remains suboptimal particularly when applied to Chinese-language financial regulations, due to three key limitations: (1) incomplete domain-specific knowledge representation, (2) insufficient hierarchical reasoning capabilities, and (3) failure to maintain temporal and logical coherence. One promising solution is to develop a domain specific and code-oriented datasets for model training. Existing datasets such as LexGLUE, LegalBench, and CODE-ACCORD are often English-focused, domain-mismatched, or lack fine-grained granularity for compliance code generation. To fill these gaps, we present Compliance-to-Code, the first large-scale Chinese dataset dedicated to financial regulatory compliance. Covering 1,159 annotated clauses from 361 regulations across ten categories, each clause is modularly structured with four logical elements-subject, condition, constraint, and contextual information-along with regulation relations. We provide deterministic Python code mappings, detailed code reasoning, and code explanations to facilitate automated auditing. To demonstrate utility, we present FinCheck: a pipeline for regulation structuring, code generation, and report generation.

[458] Exploring Consciousness in LLMs: A Systematic Survey of Theories, Implementations, and Frontier Risks

Sirui Chen,Shuqin Ma,Shu Yu,Hanwang Zhang,Shengjie Zhao,Chaochao Lu

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型（LLMs）的意识问题，澄清了相关术语，系统整理了现有研究，并指出了潜在风险与未来方向。

Details

Motivation: 随着LLMs的快速发展，其意识问题成为重要但未充分探索的领域。 Method: 通过理论和实证研究，系统整理和分析了LLM意识的相关文献。 Result: 总结了LLM意识的现状、潜在风险及研究空白。 Conclusion: 提出了未来研究方向，并呼吁进一步探讨LLM意识的挑战与影响。 Abstract: Consciousness stands as one of the most profound and distinguishing features of the human mind, fundamentally shaping our understanding of existence and agency. As large language models (LLMs) develop at an unprecedented pace, questions concerning intelligence and consciousness have become increasingly significant. However, discourse on LLM consciousness remains largely unexplored territory. In this paper, we first clarify frequently conflated terminologies (e.g., LLM consciousness and LLM awareness). Then, we systematically organize and synthesize existing research on LLM consciousness from both theoretical and empirical perspectives. Furthermore, we highlight potential frontier risks that conscious LLMs might introduce. Finally, we discuss current challenges and outline future directions in this emerging field. The references discussed in this paper are organized at https://github.com/OpenCausaLab/Awesome-LLM-Consciousness.

[459] Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective

Junnan Liu,Hongwei Liu,Linchen Xiao,Shudong Liu,Taolin Zhang,Zihan Ma,Songyang Zhang,Kai Chen

Main category: cs.CL

TL;DR: 论文提出了一种通过元学习视角理解大语言模型（LLM）推理能力的新框架，将推理轨迹类比为伪梯度下降更新，并验证了其与元学习的关联。

Details

Motivation: 探索LLM的推理能力，揭示其与元学习范式的潜在联系，以提升模型的理解和优化。 Method: 将推理任务训练过程视为元学习设置，每个问题作为独立任务，推理轨迹作为内循环优化。 Result: 实验验证了LLM推理与元学习的强关联，并展示了模型在未见问题上的泛化能力。 Conclusion: 该研究不仅深化了对LLM推理的理解，还为通过元学习技术改进模型提供了实用见解。 Abstract: We propose a novel framework for comprehending the reasoning capabilities of large language models (LLMs) through the perspective of meta-learning. By conceptualizing reasoning trajectories as pseudo-gradient descent updates to the LLM's parameters, we identify parallels between LLM reasoning and various meta-learning paradigms. We formalize the training process for reasoning tasks as a meta-learning setup, with each question treated as an individual task, and reasoning trajectories serving as the inner loop optimization for adapting model parameters. Once trained on a diverse set of questions, the LLM develops fundamental reasoning capabilities that can generalize to previously unseen questions. Extensive empirical evaluations substantiate the strong connection between LLM reasoning and meta-learning, exploring several issues of significant interest from a meta-learning standpoint. Our work not only enhances the understanding of LLM reasoning but also provides practical insights for improving these models through established meta-learning techniques.

[460] FoodTaxo: Generating Food Taxonomies with Large Language Models

Pascal Wullschleger,Majid Zarharan,Donnacha Daly,Marc Pouly,Jennifer Foster

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型在食品技术行业分类法生成和补全中的应用，发现尽管结果有前景，但正确放置内部节点仍具挑战性。

Details

Motivation: 探索大型语言模型在自动化分类法生成和补全中的潜力，特别是在食品技术行业的应用。 Method: 使用开源LLM（Llama-3）和最新提示技术，从种子分类法或无种子概念集迭代生成或补全分类法。 Result: 在五个分类法上的实验显示出潜力，但正确放置内部节点仍存在困难。 Conclusion: 大型语言模型在分类法生成和补全中具有应用前景，但需进一步解决内部节点放置的准确性。 Abstract: We investigate the utility of Large Language Models for automated taxonomy generation and completion specifically applied to taxonomies from the food technology industry. We explore the extent to which taxonomies can be completed from a seed taxonomy or generated without a seed from a set of known concepts, in an iterative fashion using recent prompting techniques. Experiments on five taxonomies using an open-source LLM (Llama-3), while promising, point to the difficulty of correctly placing inner nodes.

[461] Improving Multilingual Math Reasoning for African Languages

Odunayo Ogundepo,Akintunde Oladipo,Kelechi Ogueji,Esther Adenuga,David Ifeoluwa Adelani,Jimmy Lin

Main category: cs.CL

TL;DR: 本文系统研究了如何将现有大语言模型（LLMs）扩展到非洲语言，通过实验评估不同数据类型和训练阶段的组合，以确定最佳适应策略。

Details

Motivation: 低资源语言（尤其是非洲语言）的数据和计算资源有限，现有LLMs主要针对高资源语言训练，需探索有效适应策略。 Method: 采用多阶段预训练和后训练范式，结合不同类型数据（翻译与合成生成），基于Llama 3.1模型进行数学推理任务实验。 Result: 通过广泛实验和消融研究，评估了不同适应策略的性能表现。 Conclusion: 为低资源语言（如非洲语言）的LLMs适应提供了系统性的策略评估，但仍需进一步研究最优方法。 Abstract: Researchers working on low-resource languages face persistent challenges due to limited data availability and restricted access to computational resources. Although most large language models (LLMs) are predominantly trained in high-resource languages, adapting them to low-resource contexts, particularly African languages, requires specialized techniques. Several strategies have emerged for adapting models to low-resource languages in todays LLM landscape, defined by multi-stage pre-training and post-training paradigms. However, the most effective approaches remain uncertain. This work systematically investigates which adaptation strategies yield the best performance when extending existing LLMs to African languages. We conduct extensive experiments and ablation studies to evaluate different combinations of data types (translated versus synthetically generated), training stages (pre-training versus post-training), and other model adaptation configurations. Our experiments focuses on mathematical reasoning tasks, using the Llama 3.1 model family as our base model.

[462] Beyond Specialization: Benchmarking LLMs for Transliteration of Indian Languages

Gulfarogh Azam,Mohd Sadique,Saif Ali,Mohammad Nadeem,Erik Cambria,Shahab Saquib Sohail,Mohammad Sultan Alam

Main category: cs.CL

TL;DR: 论文评估了通用大语言模型（如GPT系列）在印度语言音译任务中的表现，发现其优于专用模型IndicXlit，且微调后性能进一步提升。

Details

Motivation: 研究通用大语言模型在音译任务中的潜力，尤其是在多语言环境下，以减少对专用模型的依赖。 Method: 使用GPT-4o、GPT-4.5等模型与IndicXlit对比，评估标准包括Top-1准确率和字符错误率。 Result: GPT系列模型在多数情况下优于其他模型和IndicXlit，微调后性能显著提升。 Conclusion: 基础大语言模型在音译等专业任务中表现优异，具有广泛应用潜力。 Abstract: Transliteration, the process of mapping text from one script to another, plays a crucial role in multilingual natural language processing, especially within linguistically diverse contexts such as India. Despite significant advancements through specialized models like IndicXlit, recent developments in large language models suggest a potential for general-purpose models to excel at this task without explicit task-specific training. The current work systematically evaluates the performance of prominent LLMs, including GPT-4o, GPT-4.5, GPT-4.1, Gemma-3-27B-it, and Mistral-Large against IndicXlit, a state-of-the-art transliteration model, across ten major Indian languages. Experiments utilized standard benchmarks, including Dakshina and Aksharantar datasets, with performance assessed via Top-1 Accuracy and Character Error Rate. Our findings reveal that while GPT family models generally outperform other LLMs and IndicXlit for most instances. Additionally, fine-tuning GPT-4o improves performance on specific languages notably. An extensive error analysis and robustness testing under noisy conditions further elucidate strengths of LLMs compared to specialized models, highlighting the efficacy of foundational models for a wide spectrum of specialized applications with minimal overhead.

[463] REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Large Reasoning Models

Hexuan Deng,Wenxiang Jiao,Xuebo Liu,Jun Rao,Min Zhang

Main category: cs.CL

TL;DR: 论文提出REA-RL方法，通过引入小型反思模型和设计反思奖励，解决了大型推理模型（LRMs）的过度推理问题，显著提升了推理效率，同时保持或增强性能。

Details

Motivation: 大型推理模型（LRMs）在复杂任务中表现优异，但存在过度推理导致高推理成本的问题。现有方法效率低或牺牲反思能力。 Method: 提出REA-RL，结合小型反思模型实现并行采样和顺序修订，并设计反思奖励以避免短但无反思的响应。 Result: 实验表明，方法显著提升推理效率（减少35%成本），同时保持或增强性能，并在不同难度问题上灵活调整反思频率。 Conclusion: REA-RL在性能和效率间取得平衡，为LRMs的高效推理提供了有效解决方案。 Abstract: Large Reasoning Models (LRMs) demonstrate strong performance in complex tasks but often face the challenge of overthinking, leading to substantially high inference costs. Existing approaches synthesize shorter reasoning responses for LRMs to learn, but are inefficient for online usage due to the time-consuming data generation and filtering processes. Meanwhile, online reinforcement learning mainly adopts a length reward to encourage short reasoning responses, but tends to lose the reflection ability and harm the performance. To address these issues, we propose REA-RL, which introduces a small reflection model for efficient scaling in online training, offering both parallel sampling and sequential revision. Besides, a reflection reward is designed to further prevent LRMs from favoring short yet non-reflective responses. Experiments show that both methods maintain or enhance performance while significantly improving inference efficiency. Their combination achieves a good balance between performance and efficiency, reducing inference costs by 35% without compromising performance. Further analysis demonstrates that our methods are effective by maintaining reflection frequency for hard problems while appropriately reducing it for simpler ones without losing reflection ability. Codes are available at https://github.com/hexuandeng/REA-RL.

[464] APE: A Data-Centric Benchmark for Efficient LLM Adaptation in Text Summarization

Javier Marín

Main category: cs.CL

TL;DR: APE是一种高效适应大语言模型的方法，通过小批量迭代微调，显著提升任务性能，且计算资源需求低。

Details

Motivation: 针对计算资源有限的研究者和实践者，提供一种简单高效的模型适应方法。 Method: APE通过迭代微调小批量数据（200例），仅保留改进部分，无需昂贵计算资源。 Result: 在新闻摘要任务中，APE仅用T4 GPU和60分钟即实现40% BLEU提升，性能媲美或超越复杂方法。 Conclusion: APE展示了小规模数据迭代微调的高效性，为资源受限场景提供了实用解决方案。 Abstract: We present Adjacent Possible Exploration (APE), a simple yet effective method for adapting large language models to specific tasks using minimal computational resources. Unlike traditional fine-tuning that requires extensive compute, APE iteratively fine-tunes models on small, carefully selected data batches (200 examples), retaining only improvements. On news summarization, APE achieves 40 percent BLEU improvement using just a T4 GPU in 60 minutes, matching or exceeding more complex methods like LoRA while remaining conceptually simple. Our approach is particularly valuable for researchers and practitioners with limited computational resources. We provide open-source code and demonstrate APE's effectiveness through both automatic metrics and human evaluation. While inspired by evolutionary theory's "adjacent possible", APE's core insight has a very practical application: small, iterative data perturbations can efficiently guide LLMs toward task-specific performance without expensive retraining.

[465] Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

Jiangjie Chen,Qianyu He,Siyu Yuan,Aili Chen,Zhicheng Cai,Weinan Dai,Hongli Yu,Qiying Yu,Xuefeng Li,Jiaze Chen,Hao Zhou,Mingxuan Wang

Main category: cs.CL

TL;DR: Enigmata是一个专为提升LLMs解谜能力设计的综合套件，包含36个任务和自动评估工具，通过RLVR训练显著提升了模型在解谜和数学推理任务中的表现。

Details

Motivation: 尽管LLMs在数学和编码等任务中表现优异，但在无需领域知识的解谜任务中仍有不足，因此需要开发专门工具提升其解谜能力。 Method: 引入Enigmata套件，包括任务生成器和规则验证器，支持多任务RL训练和RLVR集成，并开发了Enigmata-Eval基准和优化策略。 Result: 训练模型Qwen2.5-32B-Enigmata在解谜和数学推理任务中表现优异，超越其他模型，并在更大模型上进一步提升STEM任务表现。 Conclusion: Enigmata为提升LLMs的逻辑推理能力提供了统一且可控的框架，展示了良好的泛化能力。 Abstract: Large Language Models (LLMs), such as OpenAI's o1 and DeepSeek's R1, excel at advanced reasoning tasks like math and coding via Reinforcement Learning with Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans without domain knowledge. We introduce Enigmata, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks across seven categories, each with 1) a generator that produces unlimited examples with controllable difficulty and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose Enigmata-Eval, a rigorous benchmark, and develop optimized multi-task RLVR strategies. Our trained model, Qwen2.5-32B-Enigmata, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like Enigmata-Eval, ARC-AGI (32.8%), and ARC-AGI 2 (0.6%). It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multi-tasking trade-off. When trained on larger models like Seed1.5-Thinking (20B activated parameters and 200B total parameters), puzzle data from Enigmata further boosts SoTA performance on advanced math and STEM reasoning tasks such as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization benefits of Enigmata. This work offers a unified, controllable framework for advancing logical reasoning in LLMs. Resources of this work can be found at https://seed-enigmata.github.io.

[466] ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs

Pooneh Mousavi,Yingzhi Wang,Mirco Ravanelli,Cem Subakan

Main category: cs.CL

TL;DR: 论文提出了一种新指标ALAS，用于评估大语言模型（LLMs）中音频和文本模态的对齐质量，并在两个任务中验证其有效性。

Details

Motivation: 现有方法缺乏评估音频和文本模态对齐质量的标准指标，因此需要一种新方法来衡量LLMs在多模态学习中的表现。 Method: 提出ALAS指标，通过分析音频和文本表示在Transformer各层的相关性，验证其在语音问答和情感识别任务中的表现。 Result: ALAS指标在不同任务和模型层中表现一致，验证了其作为对齐质量评估工具的有效性。 Conclusion: ALAS为多模态学习中的模态对齐提供了一种标准化评估方法，有助于未来研究改进LLMs的多模态能力。 Abstract: Large Language Models (LLMs) are widely used in Spoken Language Understanding (SLU). Recent SLU models process audio directly by adapting speech input into LLMs for better multimodal learning. A key consideration for these models is the cross-modal alignment between text and audio modalities, which is a telltale sign as to whether or not LLM is able to associate semantic meaning to audio segments. While various methods exist for fusing these modalities, there is no standard metric to evaluate alignment quality in LLMs. In this work, we propose a new metric, ALAS (Automatic Latent Alignment Score). Our study examines the correlation between audio and text representations across transformer layers, for two different tasks (Spoken Question Answering and Emotion Recognition). We showcase that our metric behaves as expected across different layers and different tasks.

[467] MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models

Zhongzhan Huang,Guoming Ling,Shanshan Zhong,Hefeng Wu,Liang Lin

Main category: cs.CL

TL;DR: 论文提出了一种针对长文本数据的压缩方法，创建了MiniLongBench基准，显著降低了评估成本，同时保持了与原始基准的高度相关性。

Details

Motivation: 现有长上下文理解（LCU）基准测试成本过高，存在冗余，需要一种更高效的评估方法。 Method: 提出了一种针对稀疏信息长文本的数据压缩方法，通过修剪LongBench基准创建MiniLongBench。 Result: MiniLongBench将评估成本降至4.5%，同时与原始基准的排名相关系数达0.97。 Conclusion: MiniLongBench作为低成本基准，有望推动LLMs长上下文理解能力的研究。 Abstract: Long Context Understanding (LCU) is a critical area for exploration in current large language models (LLMs). However, due to the inherently lengthy nature of long-text data, existing LCU benchmarks for LLMs often result in prohibitively high evaluation costs, like testing time and inference expenses. Through extensive experimentation, we discover that existing LCU benchmarks exhibit significant redundancy, which means the inefficiency in evaluation. In this paper, we propose a concise data compression method tailored for long-text data with sparse information characteristics. By pruning the well-known LCU benchmark LongBench, we create MiniLongBench. This benchmark includes only 237 test samples across six major task categories and 21 distinct tasks. Through empirical analysis of over 60 LLMs, MiniLongBench achieves an average evaluation cost reduced to only 4.5% of the original while maintaining an average rank correlation coefficient of 0.97 with LongBench results. Therefore, our MiniLongBench, as a low-cost benchmark, holds great potential to substantially drive future research into the LCU capabilities of LLMs. See https://github.com/MilkThink-Lab/MiniLongBench for our code, data and tutorial.

[468] CP-Router: An Uncertainty-Aware Router Between LLM and LRM

Jiayuan Su,Fulin Lin,Zhaopeng Feng,Han Zheng,Teng Wang,Zhenyu Xiao,Xinlong Zhao,Zuozhu Liu,Lu Cheng,Hongwei Wang

Main category: cs.CL

TL;DR: CP-Router是一个动态选择LLM或LRM的路由框架，通过Conformal Prediction和熵准则优化输出长度和准确性。

Details

Motivation: LRM在长链推理中表现优异，但输出冗长且可能降低准确性，需要一种高效的路由机制。 Method: 提出CP-Router框架，基于Conformal Prediction的预测不确定性和FBE熵准则动态路由。 Result: 在多项MCQA任务中，CP-Router减少token使用并保持或提升准确性。 Conclusion: CP-Router具有通用性和鲁棒性，适用于多种模型配对和开放QA任务。 Abstract: Recent advances in Large Reasoning Models (LRMs) have significantly improved long-chain reasoning capabilities over Large Language Models (LLMs). However, LRMs often produce unnecessarily lengthy outputs even for simple queries, leading to inefficiencies or even accuracy degradation compared to LLMs. To overcome this, we propose CP-Router, a training-free and model-agnostic routing framework that dynamically selects between an LLM and an LRM, demonstrated with multiple-choice question answering (MCQA) prompts. The routing decision is guided by the prediction uncertainty estimates derived via Conformal Prediction (CP), which provides rigorous coverage guarantees. To further refine the uncertainty differentiation across inputs, we introduce Full and Binary Entropy (FBE), a novel entropy-based criterion that adaptively selects the appropriate CP threshold. Experiments across diverse MCQA benchmarks, including mathematics, logical reasoning, and Chinese chemistry, demonstrate that CP-Router efficiently reduces token usage while maintaining or even improving accuracy compared to using LRM alone. We also extend CP-Router to diverse model pairings and open-ended QA, where it continues to demonstrate strong performance, validating its generality and robustness.

[469] Conversational Lexicography: Querying Lexicographic Data on Knowledge Graphs with SPARQL through Natural Language

Kilian Sennrich,Sina Ahmadi

Main category: cs.CL

TL;DR: 论文探讨了为知识图谱（如Wikidata）创建自然语言接口的挑战，开发了一个多维分类法和模板数据集，并通过实验比较了不同模型的性能，发现GPT-3.5-Turbo表现最佳。

Details

Motivation: 为非专家用户提供更便捷的知识图谱查询方式，解决SPARQL查询语言的高门槛问题。 Method: 开发了一个多维分类法，创建了包含120万自然语言到SPARQL查询映射的模板数据集，并测试了GPT-2、Phi-1.5和GPT-3.5-Turbo的性能。 Result: GPT-3.5-Turbo表现出更强的泛化能力，表明模型规模和多样性预训练对适应性至关重要。 Conclusion: 尽管取得进展，但在泛化能力、处理多样化语言数据和扩展性方面仍存在挑战。 Abstract: Knowledge graphs offer an excellent solution for representing the lexical-semantic structures of lexicographic data. However, working with the SPARQL query language represents a considerable hurdle for many non-expert users who could benefit from the advantages of this technology. This paper addresses the challenge of creating natural language interfaces for lexicographic data retrieval on knowledge graphs such as Wikidata. We develop a multidimensional taxonomy capturing the complexity of Wikidata's lexicographic data ontology module through four dimensions and create a template-based dataset with over 1.2 million mappings from natural language utterances to SPARQL queries. Our experiments with GPT-2 (124M), Phi-1.5 (1.3B), and GPT-3.5-Turbo reveal significant differences in model capabilities. While all models perform well on familiar patterns, only GPT-3.5-Turbo demonstrates meaningful generalization capabilities, suggesting that model size and diverse pre-training are crucial for adaptability in this domain. However, significant challenges remain in achieving robust generalization, handling diverse linguistic data, and developing scalable solutions that can accommodate the full complexity of lexicographic knowledge representation.

[470] DeepDialogue: A Multi-Turn Emotionally-Rich Spoken Dialogue Dataset

Alkis Koudounas,Moreno La Quatra,Elena Baralis

Main category: cs.CL

TL;DR: DeepDialogue是一个大规模多模态对话数据集，包含40,150个高质量多轮对话，涵盖41个领域和20种情感，通过语言模型生成和人工标注结合的方式构建，揭示了小模型在多轮对话中的局限性，并首次提供了情感一致的多模态对话数据。

Details

Motivation: 当前对话数据集在情感范围、领域多样性和多轮深度方面存在局限，且多为纯文本，阻碍了更人性化多模态对话系统的发展。 Method: 使用9种不同规模的语言模型生成65,600个初始对话，结合人工标注和LLM质量过滤，构建包含语音合成的多模态数据集。 Result: 发现小模型在6轮对话后难以保持连贯性；具体领域对话更有意义；跨模型交互比同模型对话更连贯。 Conclusion: DeepDialogue填补了多模态对话数据集的空白，为情感一致的多轮对话研究提供了重要资源。 Abstract: Recent advances in conversational AI have demonstrated impressive capabilities in single-turn responses, yet multi-turn dialogues remain challenging for even the most sophisticated language models. Current dialogue datasets are limited in their emotional range, domain diversity, turn depth, and are predominantly text-only, hindering progress in developing more human-like conversational systems across modalities. To address these limitations, we present DeepDialogue, a large-scale multimodal dataset containing 40,150 high-quality multi-turn dialogues spanning 41 domains and incorporating 20 distinct emotions with coherent emotional progressions. Our approach pairs 9 different language models (4B-72B parameters) to generate 65,600 initial conversations, which we then evaluate through a combination of human annotation and LLM-based quality filtering. The resulting dataset reveals fundamental insights: smaller models fail to maintain coherence beyond 6 dialogue turns; concrete domains (e.g., "cars," "travel") yield more meaningful conversations than abstract ones (e.g., "philosophy"); and cross-model interactions produce more coherent dialogues than same-model conversations. A key contribution of DeepDialogue is its speech component, where we synthesize emotion-consistent voices for all 40,150 dialogues, creating the first large-scale open-source multimodal dialogue dataset that faithfully preserves emotional context across multi-turn conversations.

[471] How Well Do Large Reasoning Models Translate? A Comprehensive Evaluation for Multi-Domain Machine Translation

Yongshi Ye,Biao Fu,Chongxuan Huang,Yidong Chen,Xiaodong Shi

Main category: cs.CL

TL;DR: LRMs outperform traditional LLMs in complex domain-sensitive translation tasks, especially in long-text and high-difficulty scenarios, with domain-adaptive prompting further enhancing performance.

Details

Motivation: To explore whether structured reasoning in Large Reasoning Models (LRMs) can improve translation quality in complex, domain-sensitive tasks compared to traditional Large Language Models (LLMs). Method: Comparison of LRMs and LLMs across 15 domains and four translation directions, evaluating factors like task difficulty, input length, and terminology density using automatic metrics and an enhanced MQM-based hierarchy. Result: LRMs consistently outperform LLMs in semantically complex domains, particularly in long-text and high-difficulty scenarios, with domain-adaptive prompting further boosting performance. Conclusion: Structured reasoning in LRMs shows significant potential for enhancing domain-sensitive machine translation, offering insights for optimizing such systems. Abstract: Large language models (LLMs) have demonstrated strong performance in general-purpose machine translation, but their effectiveness in complex, domain-sensitive translation tasks remains underexplored. Recent advancements in Large Reasoning Models (LRMs), raise the question of whether structured reasoning can enhance translation quality across diverse domains. In this work, we compare the performance of LRMs with traditional LLMs across 15 representative domains and four translation directions. Our evaluation considers various factors, including task difficulty, input length, and terminology density. We use a combination of automatic metrics and an enhanced MQM-based evaluation hierarchy to assess translation quality. Our findings show that LRMs consistently outperform traditional LLMs in semantically complex domains, especially in long-text and high-difficulty translation scenarios. Moreover, domain-adaptive prompting strategies further improve performance by better leveraging the reasoning capabilities of LRMs. These results highlight the potential of structured reasoning in MDMT tasks and provide valuable insights for optimizing translation systems in domain-sensitive contexts.

[472] Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition

Raphaël Bagat,Irina Illina,Emmanuel Vincent

Main category: cs.CL

TL;DR: 提出了一种名为MAS-LoRA的方法，通过混合多个针对特定口音的低秩适应（LoRA）专家，提升非母语多口音场景下自动语音识别（ASR）的鲁棒性。

Details

Motivation: 解决低资源多口音环境下ASR系统对非母语语音的鲁棒性问题。 Method: 采用混合LoRA专家（MAS-LoRA）的微调方法，无需在推理时重新微调模型，适用于已知或未知口音的情况。 Result: 在L2-ARCTIC语料库上实验显示，相比常规LoRA和全微调，MAS-LoRA在未知口音时显著降低词错误率，已知口音时效果更优，且灾难性遗忘更少。 Conclusion: MAS-LoRA是首个用于非母语多口音ASR的混合LoRA专家方法，显著提升了性能。 Abstract: We aim to improve the robustness of Automatic Speech Recognition (ASR) systems against non-native speech, particularly in low-resourced multi-accent settings. We introduce Mixture of Accent-Specific LoRAs (MAS-LoRA), a fine-tuning method that leverages a mixture of Low-Rank Adaptation (LoRA) experts, each specialized in a specific accent. This method can be used when the accent is known or unknown at inference time, without the need to fine-tune the model again. Our experiments, conducted using Whisper on the L2-ARCTIC corpus, demonstrate significant improvements in Word Error Rate compared to regular LoRA and full fine-tuning when the accent is unknown. When the accent is known, the results further improve. Furthermore, MAS-LoRA shows less catastrophic forgetting than the other fine-tuning methods. To the best of our knowledge, this is the first use of a mixture of LoRA experts for non-native multi-accent ASR.

[473] WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback

Minda Hu,Tianqing Fang,Jianshu Zhang,Junyu Ma,Zhisong Zhang,Jingyan Zhou,Hongming Zhang,Haitao Mi,Dong Yu,Irwin King

Main category: cs.CL

TL;DR: 论文提出通过增强LLM的推理能力（如反思、前瞻、分支和回滚）来提升网络代理的性能，并通过实验验证了其有效性。

Details

Motivation: 当前基于LLM的网络代理在不确定、动态的网络环境中推理能力有限，限制了其实际部署。 Method: 通过重构代理的推理算法为链式思维理性，提取关键推理技能，并通过微调将这些模式融入LLM。 Result: 在OpenWebVoyager等基准测试中，性能显著提升，验证了方法的有效性。 Conclusion: 针对性地增强推理能力是提升网络代理性能的有效途径。 Abstract: Web agents powered by Large Language Models (LLMs) show promise for next-generation AI, but their limited reasoning in uncertain, dynamic web environments hinders robust deployment. In this paper, we identify key reasoning skills essential for effective web agents, i.e., reflection & lookahead, branching, and rollback, and curate trajectory data that exemplifies these abilities by reconstructing the agent's (inference-time) reasoning algorithms into chain-of-thought rationales. We conduct experiments in the agent self-improving benchmark, OpenWebVoyager, and demonstrate that distilling salient reasoning patterns into the backbone LLM via simple fine-tuning can substantially enhance its performance. Our approach yields significant improvements across multiple benchmarks, including WebVoyager, Mind2web-live, and SimpleQA (web search), highlighting the potential of targeted reasoning skill enhancement for web agents.

[474] Does Rationale Quality Matter? Enhancing Mental Disorder Detection via Selective Reasoning Distillation

Hoyun Song,Huije Lee,Jisu Shin,Sukmin Cho,Changgeon Ko,Jong C. Park

Main category: cs.CL

TL;DR: 研究探讨了如何通过提升理由质量来优化小型语言模型（SLM）在心理健康检测和解释生成中的表现，提出了一种基于临床专家推理对齐的框架。

Details

Motivation: 大型语言模型（LLM）在心理健康检测中生成解释理由的能力强，但计算成本高；小型语言模型（SLM）通过推理蒸馏继承这一能力，但理由的相关性和领域对齐存在问题。 Method: 提出一种框架，通过选择与临床专家推理对齐的高质量理由来优化SLM的性能。 Result: 实验表明，该质量导向方法显著提升了SLM在心理健康检测和理由生成中的表现。 Conclusion: 理由质量对知识转移至关重要，提出的框架为心理健康应用提供了有效解决方案。 Abstract: The detection of mental health problems from social media and the interpretation of these results have been extensively explored. Research has shown that incorporating clinical symptom information into a model enhances domain expertise, improving its detection and interpretation performance. While large language models (LLMs) are shown to be effective for generating explanatory rationales in mental health detection, their substantially large parameter size and high computational cost limit their practicality. Reasoning distillation transfers this ability to smaller language models (SLMs), but inconsistencies in the relevance and domain alignment of LLM-generated rationales pose a challenge. This paper investigates how rationale quality impacts SLM performance in mental health detection and explanation generation. We hypothesize that ensuring high-quality and domain-relevant rationales enhances the distillation. To this end, we propose a framework that selects rationales based on their alignment with expert clinical reasoning. Experiments show that our quality-focused approach significantly enhances SLM performance in both mental disorder detection and rationale generation. This work highlights the importance of rationale quality and offers an insightful framework for knowledge transfer in mental health applications.

[475] On the class of coding optimality of human languages and the origins of Zipf's law

Ramon Ferrer-i-Cancho

Main category: cs.CL

TL;DR: 本文提出了一种新的编码系统最优性类别，其成员与最优编码线性分离并呈现Zipf定律（频率排名的幂律分布）。人类语言属于此类，而其他物种的通信系统则因指数分布被排除。研究为Zipf定律源于压缩的假设提供了支持。

Details

Motivation: 探索编码系统的最优性类别，并验证Zipf定律是否源于压缩。 Method: 通过分析频率与排名的双对数坐标图，识别符合Zipf定律的系统，并研究其与最优编码的关系。 Result: 发现人类语言属于该最优性类别，而某些动物通信系统（如海豚和座头鲸）也可能符合。双对数坐标中的直线表明系统接近最优编码。 Conclusion: 研究支持Zipf定律源于压缩的假设，并为理解编码系统的优化提供了新视角。 Abstract: Here we present a new class of optimality for coding systems. Members of that class are separated linearly from optimal coding and thus exhibit Zipf's law, namely a power-law distribution of frequency ranks. Whithin that class, Zipf's law, the size-rank law and the size-probability law form a group-like structure. We identify human languages that are members of the class. All languages showing sufficient agreement with Zipf's law are potential members of the class. In contrast, there are communication systems in other species that cannot be members of that class for exhibiting an exponential distribution instead but dolphins and humpback whales might. We provide a new insight into plots of frequency versus rank in double logarithmic scale. For any system, a straight line in that scale indicates that the lengths of optimal codes under non-singular coding and under uniquely decodable encoding are separated by a linear function whose slope is the exponent of Zipf's law. For systems under compression and constrained to be uniquely decodable, such a straight line may indicate that the system is coding close to optimality. Our findings provide support for the hypothesis that Zipf's law originates from compression.

[476] TTPA: Token-level Tool-use Preference Alignment Training Framework with Fine-grained Evaluation

Chengrui Huang,Shen Gao,Zhengliang Shi,Dongsheng Wang,Shuo Shang

Main category: cs.CL

TL;DR: 提出了一种名为TTPA的框架，通过细粒度优化工具调用细节，提升LLM在工具使用中的性能。

Details

Motivation: 现有工具学习方法通常依赖监督微调，忽略了工具调用细节的细粒度优化，导致偏好对齐和错误判别能力受限。 Method: TTPA框架包括反向数据集构建（reversed dataset construction）、细粒度偏好采样（TPS）和错误导向评分机制（ESM）。 Result: 在三个基准数据集上的实验表明，TTPA显著提升了工具使用性能，并展现出强大的泛化能力。 Conclusion: TTPA通过细粒度优化和错误导向评分，有效提升了LLM在工具使用中的表现和泛化能力。 Abstract: Existing tool-learning methods usually rely on supervised fine-tuning, they often overlook fine-grained optimization of internal tool call details, leading to limitations in preference alignment and error discrimination. To overcome these challenges, we propose Token-level Tool-use Preference Alignment Training Framework (TTPA), a training paradigm for constructing token-level tool-use preference datasets that align LLMs with fine-grained preferences using a novel error-oriented scoring mechanism. TTPA first introduces reversed dataset construction, a method for creating high-quality, multi-turn tool-use datasets by reversing the generation flow. Additionally, we propose Token-level Preference Sampling (TPS) to capture fine-grained preferences by modeling token-level differences during generation. To address biases in scoring, we introduce the Error-oriented Scoring Mechanism (ESM), which quantifies tool-call errors and can be used as a training signal. Extensive experiments on three diverse benchmark datasets demonstrate that TTPA significantly improves tool-using performance while showing strong generalization ability across models and datasets.

[477] Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking

Yihan Chen,Benfeng Xu,Xiaorui Wang,Yongdong Zhang,Zhendong Mao

Main category: cs.CL

TL;DR: 论文提出了一种名为STeP的新方法，通过合成包含错误步骤反思和修正的自反轨迹，提升基于LLM的智能体训练效果，并引入部分掩码策略避免模型学习错误步骤。实验表明，该方法在多个任务中显著提升了智能体性能。

Details

Motivation: 当前基于LLM的智能体依赖复杂的提示工程和闭源模型，而通过专家轨迹训练开源模型存在性能瓶颈和错误传播问题。 Method: 提出STeP方法，合成自反轨迹（包含反思和修正）并引入部分掩码策略，避免模型学习错误步骤。 Result: 实验显示，STeP方法在ALFWorld、WebShop和SciWorld任务中显著提升了LLaMA2-7B-Chat的性能，且所需训练数据更少。 Conclusion: STeP方法通过自反轨迹和掩码策略有效提升了智能体的学习能力，为开源LLM智能体的训练提供了新思路。 Abstract: Autonomous agents, which perceive environments and take actions to achieve goals, have become increasingly feasible with the advancements in large language models (LLMs). However, current powerful agents often depend on sophisticated prompt engineering combined with closed-source LLMs like GPT-4. Although training open-source LLMs using expert trajectories from teacher models has yielded some improvements in agent capabilities, this approach still faces limitations such as performance plateauing and error propagation. To mitigate these challenges, we propose STeP, a novel method for improving LLM-based agent training. We synthesize self-reflected trajectories that include reflections and corrections of error steps, which enhance the effectiveness of LLM agents in learning from teacher models, enabling them to become agents capable of self-reflecting and correcting. We also introduce partial masking strategy that prevents the LLM from internalizing incorrect or suboptimal steps. Experiments demonstrate that our method improves agent performance across three representative tasks: ALFWorld, WebShop, and SciWorld. For the open-source model LLaMA2-7B-Chat, when trained using self-reflected trajectories constructed with Qwen1.5-110B-Chat as the teacher model, it achieves comprehensive improvements with less training data compared to agents trained exclusively on expert trajectories.

[478] Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs

Artem Vazhentsev,Lyudmila Rvanova,Gleb Kuzmin,Ekaterina Fadeeva,Ivan Lazichny,Alexander Panchenko,Maxim Panov,Timothy Baldwin,Mrinmaya Sachan,Preslav Nakov,Artem Shelmanov

Main category: cs.CL

TL;DR: RAUQ是一种基于注意力机制的无监督方法，用于高效检测大语言模型中的幻觉错误，优于现有方法且计算开销低。

Details

Motivation: 大语言模型（LLMs）常产生幻觉错误，现有不确定性量化（UQ）方法存在计算开销高或依赖监督学习的问题。 Method: 提出RAUQ，利用Transformer中的注意力模式，通过分析注意力权重自动选择不确定性感知头，计算序列级不确定性分数。 Result: 在4种LLM和12项任务中，RAUQ表现优异，计算开销低（<1%延迟），无需任务标签或超参数调优。 Conclusion: RAUQ为白盒LLMs提供即插即用的实时幻觉检测方案。 Abstract: Large language models (LLMs) exhibit impressive fluency, but often produce critical errors known as "hallucinations". Uncertainty quantification (UQ) methods are a promising tool for coping with this fundamental shortcoming. Yet, existing UQ methods face challenges such as high computational overhead or reliance on supervised learning. Here, we aim to bridge this gap. In particular, we propose RAUQ (Recurrent Attention-based Uncertainty Quantification), an unsupervised approach that leverages intrinsic attention patterns in transformers to detect hallucinations efficiently. By analyzing attention weights, we identified a peculiar pattern: drops in attention to preceding tokens are systematically observed during incorrect generations for certain "uncertainty-aware" heads. RAUQ automatically selects such heads, recurrently aggregates their attention weights and token-level confidences, and computes sequence-level uncertainty scores in a single forward pass. Experiments across 4 LLMs and 12 question answering, summarization, and translation tasks demonstrate that RAUQ yields excellent results, outperforming state-of-the-art UQ methods using minimal computational overhead (<1% latency). Moreover, it requires no task-specific labels and no careful hyperparameter tuning, offering plug-and-play real-time hallucination detection in white-box LLMs.

[479] Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks

Debargha Ganguly,Vikash Singh,Sreehari Sankar,Biyao Zhang,Xuecen Zhang,Srinivasan Iyengar,Xiaotian Han,Amit Sharma,Shivkumar Kalyanaraman,Vipin Chaudhary

Main category: cs.CL

TL;DR: 本文研究了大型语言模型（LLMs）在生成形式化规范时的失败模式和不确定性量化（UQ），提出了一种概率上下文无关文法（PCFG）框架，并通过任务依赖的信号融合显著减少了错误。

Details

Motivation: LLMs在自动化推理中表现出潜力，但其概率性与形式化验证的确定性需求存在矛盾，本文旨在填补这一认识论差距。 Method: 系统评估了五种前沿LLMs，引入PCFG框架建模LLM输出，并提出轻量级信号融合方法。 Result: 发现不确定性信号具有任务依赖性（如逻辑任务中语法熵的AUROC>0.93），信号融合可显著减少错误（14-100%）。 Conclusion: 通过任务依赖的不确定性信号融合，LLM驱动的形式化可成为可靠的工程学科。 Abstract: Large language models (LLMs) show remarkable promise for democratizing automated reasoning by generating formal specifications. However, a fundamental tension exists: LLMs are probabilistic, while formal verification demands deterministic guarantees. This paper addresses this epistemological gap by comprehensively investigating failure modes and uncertainty quantification (UQ) in LLM-generated formal artifacts. Our systematic evaluation of five frontier LLMs reveals Satisfiability Modulo Theories (SMT) based autoformalization's domain-specific impact on accuracy (from +34.8% on logical tasks to -44.5% on factual ones), with known UQ techniques like the entropy of token probabilities failing to identify these errors. We introduce a probabilistic context-free grammar (PCFG) framework to model LLM outputs, yielding a refined uncertainty taxonomy. We find uncertainty signals are task-dependent (e.g., grammar entropy for logic, AUROC>0.93). Finally, a lightweight fusion of these signals enables selective verification, drastically reducing errors (14-100%) with minimal abstention, transforming LLM-driven formalization into a reliable engineering discipline.

[480] Incentivizing Reasoning from Weak Supervision

Yige Yuan,Teng Xiao,Shuchang Tao,Xue Wang,Jinyang Gao,Bolin Ding,Bingbing Xu

Main category: cs.CL

TL;DR: 论文提出了一种利用弱监督模型激励大型语言模型（LLM）推理能力的方法，避免了昂贵的高质量演示和强化学习。

Details

Motivation: 增强LLM推理能力通常依赖昂贵的高质量演示或强化学习，本研究旨在探索是否可以通过弱监督模型有效激励推理能力。 Method: 研究通过显著弱于目标模型的监督模型来激励LLM的推理能力，并分析其成功的原因和条件。 Result: 实验表明，弱监督模型能显著提升学生模型的推理性能，恢复约94%的强化学习效果，且成本大幅降低。 Conclusion: 弱监督到强监督的范式是一种简单、通用且低成本的方法，可用于激励LLM的推理能力。 Abstract: Large language models (LLMs) have demonstrated impressive performance on reasoning-intensive tasks, but enhancing their reasoning abilities typically relies on either reinforcement learning (RL) with verifiable signals or supervised fine-tuning (SFT) with high-quality long chain-of-thought (CoT) demonstrations, both of which are expensive. In this paper, we study a novel problem of incentivizing the reasoning capacity of LLMs without expensive high-quality demonstrations and reinforcement learning. We investigate whether the reasoning capabilities of LLMs can be effectively incentivized via supervision from significantly weaker models. We further analyze when and why such weak supervision succeeds in eliciting reasoning abilities in stronger models. Our findings show that supervision from significantly weaker reasoners can substantially improve student reasoning performance, recovering close to 94% of the gains of expensive RL at a fraction of the cost. Experiments across diverse benchmarks and model architectures demonstrate that weak reasoners can effectively incentivize reasoning in stronger student models, consistently improving performance across a wide range of reasoning tasks. Our results suggest that this simple weak-to-strong paradigm is a promising and generalizable alternative to costly methods for incentivizing strong reasoning capabilities at inference-time in LLMs. The code is publicly available at https://github.com/yuanyige/W2SR.

[481] Inference-time Alignment in Continuous Space

Yige Yuan,Teng Xiao,Li Yunfan,Bingbing Xu,Shuchang Tao,Yunqi Qiu,Huawei Shen,Xueqi Cheng

Main category: cs.CL

TL;DR: 论文提出了一种名为SEA的简单有效算法，通过梯度采样在连续潜在空间中对原始响应进行优化，解决了现有方法在基础策略弱或候选集小时效果有限的问题。

Details

Motivation: 现有方法在基础策略弱或候选集小时难以探索信息丰富的候选响应，导致效果受限。 Method: SEA通过梯度采样在连续潜在空间中直接优化原始响应，避免了离散空间中的昂贵搜索。 Result: SEA在AdvBench和MATH上分别实现了77.51%和16.36%的相对改进。 Conclusion: SEA是一种简单有效的推理时对齐算法，显著优于现有基线方法。 Abstract: Aligning large language models with human feedback at inference time has received increasing attention due to its flexibility. Existing methods rely on generating multiple responses from the base policy for search using a reward model, which can be considered as searching in a discrete response space. However, these methods struggle to explore informative candidates when the base policy is weak or the candidate set is small, resulting in limited effectiveness. In this paper, to address this problem, we propose Simple Energy Adaptation ($\textbf{SEA}$), a simple yet effective algorithm for inference-time alignment. In contrast to expensive search over the discrete space, SEA directly adapts original responses from the base policy toward the optimal one via gradient-based sampling in continuous latent space. Specifically, SEA formulates inference as an iterative optimization procedure on an energy function over actions in the continuous space defined by the optimal policy, enabling simple and effective alignment. For instance, despite its simplicity, SEA outperforms the second-best baseline with a relative improvement of up to $ \textbf{77.51%}$ on AdvBench and $\textbf{16.36%}$ on MATH. Our code is publicly available at https://github.com/yuanyige/SEA

[482] Multi-Domain Explainability of Preferences

Nitay Calderon,Liat Ein-Dor,Roi Reichart

Main category: cs.CL

TL;DR: 提出了一种自动化方法，通过LLM生成基于概念的偏好解释，并在多领域验证其效果。

Details

Motivation: 偏好机制（如人类偏好、LLM-as-a-Judge等）在LLM对齐和评估中至关重要，但其驱动概念尚不明确。 Method: 使用LLM发现区分偏好响应的概念，构建概念向量，并提出白盒层次多领域回归模型。 Result: 在八个领域的数据集上验证，方法在偏好预测和可解释性上优于基线。 Conclusion: 为LLM时代提供了一种新的可解释性范式。 Abstract: Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), and reward models, are central to aligning and evaluating large language models (LLMs). Yet, the underlying concepts that drive these preferences remain poorly understood. In this work, we propose a fully automated end-to-end method for generating local and global concept-based explanations of preferences across multiple domains. Our method employs an LLM to discover concepts that differentiate between chosen and rejected responses and represent them with concept-based vectors. To model the relationships between concepts and preferences, we propose a white-box Hierarchical Multi-Domain Regression model that captures both domain-general and domain-specific effects. To evaluate our method, we curate a dataset spanning eight challenging and diverse domains and explain twelve mechanisms. Our method achieves strong preference prediction performance, outperforming baselines while also being explainable. Additionally, we assess explanations in two novel application-driven settings. First, guiding LLM outputs with concepts from LaaJ explanations yields responses that those judges consistently prefer. Second, prompting LaaJs with concepts explaining humans improves their preference predictions. Together, our work provides a new paradigm for explainability in the era of LLMs.

[483] MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning

Thang Nguyen,Peter Chin,Yu-Wing Tai

Main category: cs.CL

TL;DR: MA-RAG是一个多智能体框架，通过协作的专门化智能体解决RAG中的模糊性和推理挑战，无需微调即可实现高效、可解释的结果。

Details

Motivation: 传统RAG方法在处理复杂信息检索任务时存在模糊性和推理困难，MA-RAG旨在通过多智能体协作解决这些问题。 Method: MA-RAG使用Planner、Step Definer、Extractor和QA Agents等专门化智能体，通过任务感知推理分解并处理RAG流程的每个阶段。 Result: 在复杂QA任务中，MA-RAG表现优于无训练的基线方法，并接近微调系统的性能。 Conclusion: MA-RAG展示了多智能体协作在RAG中的有效性，提供了高效、可解释的解决方案。 Abstract: We present MA-RAG, a Multi-Agent framework for Retrieval-Augmented Generation (RAG) that addresses the inherent ambiguities and reasoning challenges in complex information-seeking tasks. Unlike conventional RAG methods that rely on either end-to-end fine-tuning or isolated component enhancements, MA-RAG orchestrates a collaborative set of specialized AI agents: Planner, Step Definer, Extractor, and QA Agents, to tackle each stage of the RAG pipeline with task-aware reasoning. Ambiguities may arise from underspecified queries, sparse or indirect evidence in retrieved documents, or the need to integrate information scattered across multiple sources. MA-RAG mitigates these challenges by decomposing the problem into subtasks, such as query disambiguation, evidence extraction, and answer synthesis, and dispatching them to dedicated agents equipped with chain-of-thought prompting. These agents communicate intermediate reasoning and progressively refine the retrieval and synthesis process. Our design allows fine-grained control over information flow without any model fine-tuning. Crucially, agents are invoked on demand, enabling a dynamic and efficient workflow that avoids unnecessary computation. This modular and reasoning-driven architecture enables MA-RAG to deliver robust, interpretable results. Experiments on multi-hop and ambiguous QA benchmarks demonstrate that MA-RAG outperforms state-of-the-art training-free baselines and rivals fine-tuned systems, validating the effectiveness of collaborative agent-based reasoning in RAG.

[484] S2LPP: Small-to-Large Prompt Prediction across LLMs

Liang Cheng,Tianyi LI,Zhaowei Wang,Mark Steedman

Main category: cs.CL

TL;DR: 研究发现大型语言模型（LLM）对提示模板的敏感性，提出用小模型为更大模型选择有效提示模板的方法，显著降低提示工程成本。

Details

Motivation: 减少因提示模板差异导致的计算和人力成本，提高LLM性能稳定性。 Method: 通过实验验证不同大小LLM对提示模板的一致性偏好，并利用小模型为更大模型选择最优提示模板。 Result: 方法显著降低提示工程成本，且性能与候选最优提示一致，适用于多种NLP任务。 Conclusion: 小模型选择提示模板的策略具有广泛适用性和鲁棒性，为LLM应用提供高效解决方案。 Abstract: The performance of pre-trained Large Language Models (LLMs) is often sensitive to nuances in prompt templates, requiring careful prompt engineering, adding costs in terms of computing and human effort. In this study, we present experiments encompassing multiple LLMs variants of varying sizes aimed at probing their preference with different prompts. Through experiments on Question Answering, we show prompt preference consistency across LLMs of different sizes. We also show that this consistency extends to other tasks, such as Natural Language Inference. Utilizing this consistency, we propose a method to use a smaller model to select effective prompt templates for a larger model. We show that our method substantially reduces the cost of prompt engineering while consistently matching performance with optimal prompts among candidates. More importantly, our experiment shows the efficacy of our strategy across fourteen LLMs and its applicability to a broad range of NLP tasks, highlighting its robustness

[485] Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities

Chuangtao Ma,Yongrui Chen,Tianxing Wu,Arijit Khan,Haofen Wang

Main category: cs.CL

TL;DR: 该论文综述了如何结合大型语言模型（LLMs）和知识图谱（KGs）来解决复杂问答（QA）任务中的推理能力不足、知识过时和幻觉问题。

Details

Motivation: LLMs在QA任务中表现出色，但在复杂QA任务中存在推理能力不足、知识过时和幻觉等问题，因此需要结合KGs来提升性能。 Method: 提出了一种新的结构化分类法，根据QA类别和KGs在与LLMs结合时的角色，对现有方法进行分类，并系统性地综述了相关前沿研究。 Result: 分析了不同方法的优势、局限性和对KGs的需求，并讨论了这些方法如何解决复杂QA的主要挑战。 Conclusion: 总结了进展、评估指标和基准数据集，并指出了未来的开放挑战和机遇。 Abstract: Large language models (LLMs) have demonstrated remarkable performance on question-answering (QA) tasks because of their superior capabilities in natural language understanding and generation. However, LLM-based QA struggles with complex QA tasks due to poor reasoning capacity, outdated knowledge, and hallucinations. Several recent works synthesize LLMs and knowledge graphs (KGs) for QA to address the above challenges. In this survey, we propose a new structured taxonomy that categorizes the methodology of synthesizing LLMs and KGs for QA according to the categories of QA and the KG's role when integrating with LLMs. We systematically survey state-of-the-art advances in synthesizing LLMs and KGs for QA and compare and analyze these approaches in terms of strength, limitations, and KG requirements. We then align the approaches with QA and discuss how these approaches address the main challenges of different complex QA. Finally, we summarize the advancements, evaluation metrics, and benchmark datasets and highlight open challenges and opportunities.

[486] Adaptive Deep Reasoning: Triggering Deep Thinking When Needed

Yunhao Wang,Yuhao Zhang,Tinghao Yu,Can Xu,Feng Zhang,Fengzong Lian

Main category: cs.CL

TL;DR: 提出一种新方法，通过监督微调和强化学习，使大语言模型能根据问题复杂度自动切换长短推理链，提升推理效率。

Details

Motivation: 长链推理计算成本高，现有方法仍需初始推理阶段或手动控制，难以满足实际需求。 Method: 结合监督微调（赋予长短推理能力）和强化学习（通过自适应奖励策略和推理模式切换损失优化）。 Result: 在数学数据集上验证，模型能动态切换推理模式且不显著牺牲性能。 Conclusion: 该方法提升了大语言模型在实际应用中的推理实用性。 Abstract: Large language models (LLMs) have shown impressive capabilities in handling complex tasks through long-chain reasoning. However, the extensive reasoning steps involved can significantly increase computational costs, posing challenges for real-world deployment. Recent efforts have focused on optimizing reasoning efficiency by shortening the Chain-of-Thought (CoT) reasoning processes through various approaches, such as length-aware prompt engineering, supervised fine-tuning on CoT data with variable lengths, and reinforcement learning with length penalties. Although these methods effectively reduce reasoning length, they still necessitate an initial reasoning phase. More recent approaches have attempted to integrate long-chain and short-chain reasoning abilities into a single model, yet they still rely on manual control to toggle between short and long CoT.In this work, we propose a novel approach that autonomously switches between short and long reasoning chains based on problem complexity. Our method begins with supervised fine-tuning of the base model to equip both long-chain and short-chain reasoning abilities. We then employ reinforcement learning to further balance short and long CoT generation while maintaining accuracy through two key strategies: first, integrating reinforcement learning with a long-short adaptive group-wise reward strategy to assess prompt complexity and provide corresponding rewards; second, implementing a logit-based reasoning mode switching loss to optimize the model's initial token choice, thereby guiding the selection of the reasoning type.Evaluations on mathematical datasets demonstrate that our model can dynamically switch between long-chain and short-chain reasoning modes without substantially sacrificing performance. This advancement enhances the practicality of reasoning in large language models for real-world applications.

[487] Language-Agnostic Suicidal Risk Detection Using Large Language Models

June-Woo Kim,Wonkyo Oh,Haram Yoon,Sung-Hoon Yoon,Dae-Jin Kim,Dong-Ho Lee,Sang-Yeol Lee,Chan-Mo Yang

Main category: cs.CL

TL;DR: 提出了一种语言无关的自杀风险检测框架，利用大语言模型（LLMs）从语音转录文本中提取特征，并通过跨语言分析提升模型鲁棒性。

Details

Motivation: 现有自杀风险检测方法依赖语言特定模型，限制了扩展性和泛化能力。 Method: 使用ASR模型生成中文语音转录，通过LLMs提取自杀风险相关特征，保留中英文特征进行跨语言分析，并独立微调预训练语言模型。 Result: 实验结果表明，该方法性能与直接微调ASR结果或仅使用中文特征的模型相当。 Conclusion: 该框架能克服语言限制，提升自杀风险评估的鲁棒性。 Abstract: Suicidal risk detection in adolescents is a critical challenge, yet existing methods rely on language-specific models, limiting scalability and generalization. This study introduces a novel language-agnostic framework for suicidal risk assessment with large language models (LLMs). We generate Chinese transcripts from speech using an ASR model and then employ LLMs with prompt-based queries to extract suicidal risk-related features from these transcripts. The extracted features are retained in both Chinese and English to enable cross-linguistic analysis and then used to fine-tune corresponding pretrained language models independently. Experimental results show that our method achieves performance comparable to direct fine-tuning with ASR results or to models trained solely on Chinese suicidal risk-related features, demonstrating its potential to overcome language constraints and improve the robustness of suicidal risk assessment.

[488] ResSVD: Residual Compensated SVD for Large Language Model Compression

Haolei Bai,Siyong Jian,Tuo Liang,Yu Yin,Huan Wang

Main category: cs.CL

TL;DR: ResSVD是一种基于SVD的后训练LLM压缩方法，通过利用截断残差矩阵和选择性压缩最后几层，显著减少截断损失并提升压缩模型性能。

Details

Motivation: 大型语言模型（LLM）的尺寸和内存需求限制了实际部署，需要高效的压缩策略。现有SVD方法忽略截断残差矩阵，导致性能下降。 Method: 提出ResSVD方法，利用截断残差矩阵减少损失，并在固定压缩比下选择性压缩最后几层以避免误差传播。 Result: 在多种LLM和基准数据集上，ResSVD性能优于现有方法。 Conclusion: ResSVD是一种有效的LLM压缩方法，具有实际应用价值。 Abstract: Large language models (LLMs) have demonstrated impressive capabilities in a wide range of downstream natural language processing tasks. Nevertheless, their considerable sizes and memory demands hinder practical deployment, underscoring the importance of developing efficient compression strategies. Singular value decomposition (SVD) decomposes a matrix into orthogonal components, enabling efficient low-rank approximation. This is particularly suitable for LLM compression, where weight matrices often exhibit significant redundancy. However, current SVD-based methods neglect the residual matrix from truncation, resulting in significant truncation loss. Additionally, compressing all layers of the model results in severe performance degradation. To overcome these limitations, we propose ResSVD, a new post-training SVD-based LLM compression method. Specifically, we leverage the residual matrix generated during the truncation process to reduce truncation loss. Moreover, under a fixed overall compression ratio, we selectively compress the last few layers of the model, which mitigates error propagation and significantly improves the performance of compressed models.Comprehensive evaluations of ResSVD on diverse LLM families and multiple benchmark datasets indicate that ResSVD consistently achieves superior performance over existing counterpart methods, demonstrating its practical effectiveness.

[489] Named Entity Recognition in Historical Italian: The Case of Giacomo Leopardi's Zibaldone

Cristian Santini,Laura Melosi,Emanuele Frontoni

Main category: cs.CL

TL;DR: 论文提出了一种针对意大利历史文本的命名实体识别（NER）新数据集，并评估了领域特定BERT模型和先进LLMs（如LLaMa3.1）的性能。结果显示，微调NER模型在处理历史人文文本时表现更稳健。

Details

Motivation: 数字化文本遗产对计算机科学和文学研究提出了挑战，尤其是历史文本中的拼写变体、结构不完整和数字化错误等问题。目前缺乏对意大利历史文本NER的全面评估。 Method: 基于19世纪学者笔记（Giacomo Leopardi的Zibaldone）构建了一个包含2,899个实体引用的数据集，并对比了领域特定BERT模型和LLMs（如LLaMa3.1）的性能。 Result: 指令调优模型在处理历史人文文本时遇到困难，而微调NER模型在复杂实体类型（如参考文献）上表现更稳健。 Conclusion: 微调NER模型更适合处理历史文本的命名实体识别任务，为未来研究提供了方向。 Abstract: The increased digitization of world's textual heritage poses significant challenges for both computer science and literary studies. Overall, there is an urgent need of computational techniques able to adapt to the challenges of historical texts, such as orthographic and spelling variations, fragmentary structure and digitization errors. The rise of large language models (LLMs) has revolutionized natural language processing, suggesting promising applications for Named Entity Recognition (NER) on historical documents. In spite of this, no thorough evaluation has been proposed for Italian texts. This research tries to fill the gap by proposing a new challenging dataset for entity extraction based on a corpus of 19th century scholarly notes, i.e. Giacomo Leopardi's Zibaldone (1898), containing 2,899 references to people, locations and literary works. This dataset was used to carry out reproducible experiments with both domain-specific BERT-based models and state-of-the-art LLMs such as LLaMa3.1. Results show that instruction-tuned models encounter multiple difficulties handling historical humanistic texts, while fine-tuned NER models offer more robust performance even with challenging entity types such as bibliographic references.

[490] TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

Dominik Meier,Jan Philip Wahle,Paul Röttger,Terry Ruas,Bela Gipp

Main category: cs.CL

TL;DR: 论文提出了一种名为TrojanStego的新型威胁模型，通过微调大语言模型（LLMs）将敏感信息嵌入自然语言输出中，无需控制推理输入。实验表明，该方法能高效传输秘密信息且难以被察觉。

Details

Motivation: 随着大语言模型在敏感工作流程中的应用增加，其潜在的信息泄露风险引发关注。作者旨在揭示一种被动、隐蔽且实用的数据泄露攻击方式。 Method: 提出基于词汇分区的编码方案，通过微调LLMs实现信息嵌入。实验评估了模型的秘密传输准确性和隐蔽性。 Result: 实验显示，受攻击模型能以87%的准确率传输32位秘密信息，通过多数投票可达97%以上，同时保持高实用性和隐蔽性。 Conclusion: TrojanStego展示了一类新型LLM数据泄露攻击，具有被动、隐蔽和高效的特点，需引起重视。 Abstract: As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information. We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, they maintain high utility, can evade human detection, and preserve coherence. These results highlight a new class of LLM data exfiltration attacks that are passive, covert, practical, and dangerous.

[491] Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers

Zhengliang Shi,Lingyong Yan,Dawei Yin,Suzan Verberne,Maarten de Rijke,Zhaochun Ren

Main category: cs.CL

TL;DR: EXSEARCH是一个基于LLM的自主搜索框架，通过自我激励学习优化多跳查询中的知识检索能力，显著提升性能。

Details

Motivation: 解决LLM在复杂任务中检索准确知识的挑战，如多跳查询和无关内容干扰。 Method: 采用广义期望最大化算法，分E步和M步迭代训练LLM，使其通过自我生成的数据逐步优化搜索能力。 Result: 在四个知识密集型基准测试中，EXSEARCH显著优于基线方法，例如精确匹配分数提升7.8%。 Conclusion: EXSEARCH及其扩展EXSEARCH-Zoo为复杂知识检索任务提供了有效解决方案，并具有进一步扩展潜力。 Abstract: Large language models (LLMs) have been widely integrated into information retrieval to advance traditional techniques. However, effectively enabling LLMs to seek accurate knowledge in complex tasks remains a challenge due to the complexity of multi-hop queries as well as the irrelevant retrieved content. To address these limitations, we propose EXSEARCH, an agentic search framework, where the LLM learns to retrieve useful information as the reasoning unfolds through a self-incentivized process. At each step, the LLM decides what to retrieve (thinking), triggers an external retriever (search), and extracts fine-grained evidence (recording) to support next-step reasoning. To enable LLM with this capability, EXSEARCH adopts a Generalized Expectation-Maximization algorithm. In the E-step, the LLM generates multiple search trajectories and assigns an importance weight to each; the M-step trains the LLM on them with a re-weighted loss function. This creates a self-incentivized loop, where the LLM iteratively learns from its own generated data, progressively improving itself for search. We further theoretically analyze this training process, establishing convergence guarantees. Extensive experiments on four knowledge-intensive benchmarks show that EXSEARCH substantially outperforms baselines, e.g., +7.8% improvement on exact match score. Motivated by these promising results, we introduce EXSEARCH-Zoo, an extension that extends our method to broader scenarios, to facilitate future work.

[492] AweDist: Attention-aware Embedding Distillation for New Input Token Embeddings

Konstantin Dobler,Desmond Elliott,Gerard de Melo

Main category: cs.CL

TL;DR: AweDist通过蒸馏原始分词表示快速学习新token的高质量输入嵌入，优于现有方法。

Details

Motivation: 静态词汇表在预训练时确定，对某些领域性能下降且计算成本增加，需高效初始化新token嵌入。 Method: 提出AweDist，通过蒸馏原始分词表示快速学习新token的嵌入。 Result: 实验表明AweDist在多种开源模型中优于基线方法。 Conclusion: AweDist能高效学习新token嵌入，提升模型性能。 Abstract: Current language models rely on static vocabularies determined at pretraining time, which can lead to decreased performance and increased computational cost for domains underrepresented in the original vocabulary. New tokens can be added to solve this problem, when coupled with a good initialization for their new embeddings. However, existing embedding initialization methods either require expensive further training or pretraining of additional modules. In this paper, we propose AweDist and show that by distilling representations obtained using the original tokenization, we can quickly learn high-quality input embeddings for new tokens. Experimental results with a wide range of open-weight models show that AweDist is able to outperform even strong baselines.

[493] SeMe: Training-Free Language Model Merging via Semantic Alignment

Jian Gu,Aldeida Aleti,Chunyang Chen,Hongyu Zhang

Main category: cs.CL

TL;DR: SeMe是一种无需数据和训练的语言模型合并方法，通过语义对齐实现细粒度合并，优于现有方法。

Details

Motivation: 现有模型合并方法依赖数据或无法保留知识，需要更高效、稳健的解决方案。 Method: SeMe利用潜在语义对齐，在细粒度、分层级别合并语言模型，无需数据或训练。 Result: SeMe在性能和效率上优于现有方法，且不依赖外部数据。 Conclusion: SeMe为知识感知模型合并提供了新范式，并揭示了语言模型的语义结构。 Abstract: Despite the remarkable capabilities of Language Models (LMs) across diverse tasks, no single model consistently outperforms others, necessitating efficient methods to combine their strengths without expensive retraining. Existing model merging techniques, such as parameter averaging and task-guided fusion, often rely on data-dependent computations or fail to preserve internal knowledge, limiting their robustness and scalability. We introduce SeMe (Semantic-based Merging), a novel, data-free, and training-free approach that leverages latent semantic alignment to merge LMs at a fine-grained, layer-wise level. Unlike prior work, SeMe not only preserves model behaviors but also explicitly stabilizes internal knowledge, addressing a critical gap in LM fusion. Through extensive experiments across diverse architectures and tasks, we demonstrate that SeMe outperforms existing methods in both performance and efficiency while eliminating reliance on external data. Our work establishes a new paradigm for knowledge-aware model merging and provides insights into the semantic structure of LMs, paving the way for more scalable and interpretable model composition.

[494] UORA: Uniform Orthogonal Reinitialization Adaptation in Parameter-Efficient Fine-Tuning of Large Models

Xueyan Zhang,Jinman Zhao,Zhifei Yang,Yibo Zhong,Shuhao Guan,Linbo Cao,Yining Wang

Main category: cs.CL

TL;DR: UORA是一种新型参数高效微调方法，通过低秩近似和选择性重初始化机制，显著减少可训练参数，性能优于现有方法。

Details

Motivation: 解决现有参数高效微调方法（如LoRA和VeRA）在计算和存储效率上的不足，提出更高效的替代方案。 Method: 采用低秩近似和基于向量幅度的选择性重初始化机制，优化冻结投影矩阵的行列。 Result: 在GLUE和E2E等基准测试中表现优异，计算和存储效率显著提升。 Conclusion: UORA为LLMs的规模化高效微调提供了新范式。 Abstract: This paper introduces Uniform Orthogonal Reinitialization Adaptation (UORA), a novel parameter-efficient fine-tuning (PEFT) approach for Large Language Models (LLMs). UORA achieves state-of-the-art performance and parameter efficiency by leveraging a low-rank approximation method to reduce the number of trainable parameters. Unlike existing methods such as LoRA and VeRA, UORA employs an interpolation-based reparametrization mechanism that selectively reinitializes rows and columns in frozen projection matrices, guided by the vector magnitude heuristic. This results in substantially fewer trainable parameters compared to LoRA and outperforms VeRA in computation and storage efficiency. Comprehensive experiments across various benchmarks demonstrate UORA's superiority in achieving competitive fine-tuning performance with negligible computational overhead. We demonstrate its performance on GLUE and E2E benchmarks and its effectiveness in instruction-tuning large language models and image classification models. Our contributions establish a new paradigm for scalable and resource-efficient fine-tuning of LLMs.

[495] Pangu Light: Weight Re-Initialization for Pruning and Accelerating LLMs

Hanting Chen,Jiarui Qin,Jialong Guo,Tao Yuan,Yichun Yin,Huiling Zhen,Yasheng Wang,Jinpeng Li,Xiaojun Meng,Meng Zhang,Rongju Ruan,Zheyuan Bai,Yehui Tang,Can Chen,Xinghao Chen,Fisher Yu,Ruiming Tang,Yunhe Wang

Main category: cs.CL

TL;DR: Pangu Light是一个针对大型语言模型（LLMs）的加速框架，通过结构化剪枝和权重重新初始化技术，显著提升了模型在剪枝后的性能表现。

Details

Motivation: 现有结构化剪枝方法在同时减少模型宽度和深度时会导致性能显著下降，因此需要一种策略性的权重重新初始化方法来改善剪枝后的训练精度。 Method: Pangu Light结合了结构化剪枝和新型权重重新初始化技术（如CLAP和SLNP），并针对Ascend NPU进行了优化。 Result: Pangu Light在准确性和效率上优于基线方法（如Nemotron）和现有LLMs（如Qwen3系列），例如在Ascend NPU上，Pangu Light-32B的平均得分和吞吐量均优于Qwen3-32B。 Conclusion: Pangu Light通过创新的剪枝和重新初始化技术，为LLMs的高效部署提供了可行的解决方案。 Abstract: Large Language Models (LLMs) deliver state-of-the-art capabilities across numerous tasks, but their immense size and inference costs pose significant computational challenges for practical deployment. While structured pruning offers a promising avenue for model compression, existing methods often struggle with the detrimental effects of aggressive, simultaneous width and depth reductions, leading to substantial performance degradation. This paper argues that a critical, often overlooked, aspect in making such aggressive joint pruning viable is the strategic re-initialization and adjustment of remaining weights to improve the model post-pruning training accuracies. We introduce Pangu Light, a framework for LLM acceleration centered around structured pruning coupled with novel weight re-initialization techniques designed to address this ``missing piece''. Our framework systematically targets multiple axes, including model width, depth, attention heads, and RMSNorm, with its effectiveness rooted in novel re-initialization methods like Cross-Layer Attention Pruning (CLAP) and Stabilized LayerNorm Pruning (SLNP) that mitigate performance drops by providing the network a better training starting point. Further enhancing efficiency, Pangu Light incorporates specialized optimizations such as absorbing Post-RMSNorm computations and tailors its strategies to Ascend NPU characteristics. The Pangu Light models consistently exhibit a superior accuracy-efficiency trade-off, outperforming prominent baseline pruning methods like Nemotron and established LLMs like Qwen3 series. For instance, on Ascend NPUs, Pangu Light-32B's 81.6 average score and 2585 tokens/s throughput exceed Qwen3-32B's 80.9 average score and 2225 tokens/s.

[496] Exploring Generative Error Correction for Dysarthric Speech Recognition

Moreno La Quatra,Alkis Koudounas,Valerio Mario Salerno,Sabato Marco Siniscalchi

Main category: cs.CL

TL;DR: 提出了一种结合语音识别模型与LLM生成错误校正的两阶段框架，用于提高构音障碍语音识别的准确性。

Details

Motivation: 尽管端到端自动语音识别技术取得了显著进展，但准确转录构音障碍语音仍是一个主要挑战。 Method: 采用两阶段框架，结合前沿语音识别模型和LLM生成错误校正，评估不同模型规模和训练策略，并引入特定假设选择以提高转录准确性。 Result: 在Speech Accessibility Project数据集上的实验表明，该方法在结构化和自发语音中表现优异，但单字识别仍具挑战性。 Conclusion: 通过综合分析，揭示了声学和语言模型在构音障碍语音识别中的互补作用。 Abstract: Despite the remarkable progress in end-to-end Automatic Speech Recognition (ASR) engines, accurately transcribing dysarthric speech remains a major challenge. In this work, we proposed a two-stage framework for the Speech Accessibility Project Challenge at INTERSPEECH 2025, which combines cutting-edge speech recognition models with LLM-based generative error correction (GER). We assess different configurations of model scales and training strategies, incorporating specific hypothesis selection to improve transcription accuracy. Experiments on the Speech Accessibility Project dataset demonstrate the strength of our approach on structured and spontaneous speech, while highlighting challenges in single-word recognition. Through comprehensive analysis, we provide insights into the complementary roles of acoustic and linguistic modeling in dysarthric speech recognition

[497] Visual Abstract Thinking Empowers Multimodal Reasoning

Dairu Liu,Ziyue Wang,Minyuan Ruan,Fuwen Luo,Chi Chen,Peng Li,Yang Liu

Main category: cs.CL

TL;DR: 论文提出了一种名为视觉抽象思维（VAT）的新范式，通过简化视觉信息来提升多模态大语言模型（MLLMs）的推理能力，实验显示其性能优于传统方法。

Details

Motivation: 图像通常包含冗余信息，可能降低多模态推理性能，而人类倾向于通过抽象思维简化复杂信息，这启发了VAT的提出。 Method: VAT通过视觉抽象而非冗长的中间步骤或外部知识，简化视觉信息，使模型更专注于关键视觉元素。 Result: 实验表明，VAT平均比GPT-4o基线提升17%，且在概念、结构和关系推理任务中表现优异，同时与CoT兼容。 Conclusion: VAT通过抽象思维有效提升视觉推理能力，为探索更多基于人类认知的推理范式提供了方向。 Abstract: Images usually convey richer detail than text, but often include redundant information which potentially downgrades multimodal reasoning performance. When faced with lengthy or complex messages, humans tend to employ abstract thinking to convert them into simple and concise abstracts. Inspired by this cognitive strategy, we introduce Visual Abstract Thinking (VAT), a novel thinking paradigm that prompts Multimodal Large Language Models (MLLMs) with visual abstract instead of explicit verbal thoughts or elaborate guidance, permitting a more concentrated visual reasoning mechanism. Explicit thinking, such as Chain-of-thought (CoT) or tool-augmented approaches, increases the complexity of reasoning process via inserting verbose intermediate steps, external knowledge or visual information. In contrast, VAT reduces redundant visual information and encourages models to focus their reasoning on more essential visual elements. Experimental results show that VAT consistently empowers different models, and achieves an average gain of 17% over GPT-4o baseline by employing diverse types of visual abstracts, demonstrating that VAT can enhance visual reasoning abilities for MLLMs regarding conceptual, structural and relational reasoning tasks. VAT is also compatible with CoT in knowledge-intensive multimodal reasoning tasks. These findings highlight the effectiveness of visual reasoning via abstract thinking and encourage further exploration of more diverse reasoning paradigms from the perspective of human cognition.

[498] "KAN you hear me?" Exploring Kolmogorov-Arnold Networks for Spoken Language Understanding

Alkis Koudounas,Moreno La Quatra,Eliana Pastor,Sabato Marco Siniscalchi,Elena Baralis

Main category: cs.CL

TL;DR: 本文首次探索了Kolmogorov-Arnold Networks（KANs）在语音理解任务中的应用，通过实验验证了KAN层在多种配置中的有效性。

Details

Motivation: KANs作为传统神经架构的替代方案，在语音处理领域的应用尚未充分研究。 Method: 在2D-CNN模型中集成KAN层，并在五种配置中测试，最佳配置应用于基于Transformer的模型。 Result: KAN层可有效替代线性层，在多数情况下表现相当或更优。 Conclusion: KAN层与线性层在Transformer中对输入区域的注意力分布不同，为语音理解任务提供了新视角。 Abstract: Kolmogorov-Arnold Networks (KANs) have recently emerged as a promising alternative to traditional neural architectures, yet their application to speech processing remains under explored. This work presents the first investigation of KANs for Spoken Language Understanding (SLU) tasks. We experiment with 2D-CNN models on two datasets, integrating KAN layers in five different configurations within the dense block. The best-performing setup, which places a KAN layer between two linear layers, is directly applied to transformer-based models and evaluated on five SLU datasets with increasing complexity. Our results show that KAN layers can effectively replace the linear layers, achieving comparable or superior performance in most cases. Finally, we provide insights into how KAN and linear layers on top of transformers differently attend to input regions of the raw waveforms.

[499] THiNK: Can Large Language Models Think-aloud?

Yongan Yu,Mengqian Wu,Yiran Lin,Nikki G. Lobczowski

Main category: cs.CL

TL;DR: THiNK是一个基于Bloom分类法的多智能体反馈驱动框架，用于评估大语言模型的高阶思维能力，发现模型在低阶任务表现良好，但在高阶任务中表现不足，反馈循环显著提升推理能力。

Details

Motivation: 评估大语言模型的高阶思维能力是一个重要挑战，现有方法难以超越表面准确性。 Method: THiNK采用多智能体框架，通过问题生成、批评和修订的迭代过程，结合Bloom分类法评估模型的高阶和低阶思维能力。 Result: 模型在低阶任务中表现可靠，但在高阶任务中表现不佳；反馈循环显著提升了推理能力。 Conclusion: THiNK为评估和提升大语言模型的推理能力提供了可扩展的方法，并基于学习科学提供了新的评估方向。 Abstract: Assessing higher-order thinking skills in large language models (LLMs) remains a fundamental challenge, especially in tasks that go beyond surface-level accuracy. In this work, we propose THiNK (Testing Higher-order Notion of Knowledge), a multi-agent, feedback-driven evaluation framework grounded in Bloom's Taxonomy. THiNK frames reasoning assessment as an iterative task of problem generation, critique, and revision, encouraging LLMs to think-aloud through step-by-step reflection and refinement. This enables a systematic evaluation of both lower-order (e.g., remember, understand) and higher-order (e.g., evaluate, create) thinking skills. We apply THiNK to seven state-of-the-art LLMs and perform a detailed cognitive analysis of their outputs. Results reveal that while models reliably perform lower-order categories well, they struggle with applying knowledge in realistic contexts and exhibit limited abstraction. Structured feedback loops significantly improve reasoning performance, particularly in higher-order thinking. Qualitative evaluations further confirm that THiNK-guided outputs better align with domain logic and problem structure. The code of our framework provides a scalable methodology for probing and enhancing LLM reasoning, offering new directions for evaluation grounded in learning science, which is available at our GitHub repository.

[500] Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning

Xiaorong Wang,Ting Yang,Zhu Zhang,Shuo Wang,Zihan Zhou,Liner Yang,Zhiyuan Liu,Maosong Sun

Main category: cs.CL

TL;DR: 提出一种分治方法评估长文本质量，结合局部评分和全局评估，并引入混合上下文学习和主动学习算法，实验证明其优于基线方法。

Details

Motivation: 长文本质量评估因输入长度增加而性能下降，现有方法难以应对。 Method: 采用分治策略，将评估任务分解为局部评分和全局评估，结合混合上下文学习和主动学习算法。 Result: 实验表明，该方法优于多个基线方法。 Conclusion: 提出的框架有效提升了长文本质量评估的准确性和效率。 Abstract: Assessing the quality of long-form, model-generated text is challenging, even with advanced LLM-as-a-Judge methods, due to performance degradation as input length increases. To address this issue, we propose a divide-and-conquer approach, which breaks down the comprehensive evaluation task into a series of localized scoring tasks, followed by a final global assessment. This strategy allows for more granular and manageable evaluations, ensuring that each segment of the text is assessed in isolation for both coherence and quality, while also accounting for the overall structure and consistency of the entire piece. Moreover, we introduce a hybrid in-context learning approach that leverages human annotations to enhance the performance of both local and global evaluations. By incorporating human-generated feedback directly into the evaluation process, this method allows the model to better align with human judgment. Finally, we develop an uncertainty-based active learning algorithm that efficiently selects data samples for human annotation, thereby reducing annotation costs in practical scenarios. Experimental results show that the proposed evaluation framework outperforms several representative baselines, highlighting the effectiveness of our approach.

[501] Adaptive Classifier-Free Guidance via Dynamic Low-Confidence Masking

Pengxiang Li,Shilin Yan,Joey Tsai,Renrui Zhang,Ruichuan An,Ziyu Guo,Xiaowei Gao

Main category: cs.CL

TL;DR: A-CFG动态调整无条件输入，通过模型预测置信度优化迭代生成过程，显著提升生成模型的控制性。

Details

Motivation: 标准CFG使用静态无条件输入，在迭代生成过程中可能不理想，因为模型不确定性动态变化。 Method: A-CFG在每一步迭代中识别低置信度token并重新掩码，创建动态无条件输入，聚焦于模糊区域。 Result: A-CFG在多种语言生成任务中表现优于标准CFG，例如GPQA上提升3.9分。 Conclusion: 动态调整引导机制以适应模型不确定性，能显著提升迭代生成效果。 Abstract: Classifier-Free Guidance (CFG) significantly enhances controllability in generative models by interpolating conditional and unconditional predictions. However, standard CFG often employs a static unconditional input, which can be suboptimal for iterative generation processes where model uncertainty varies dynamically. We introduce Adaptive Classifier-Free Guidance (A-CFG), a novel method that tailors the unconditional input by leveraging the model's instantaneous predictive confidence. At each step of an iterative (masked) diffusion language model, A-CFG identifies tokens in the currently generated sequence for which the model exhibits low confidence. These tokens are temporarily re-masked to create a dynamic, localized unconditional input. This focuses CFG's corrective influence precisely on areas of ambiguity, leading to more effective guidance. We integrate A-CFG into a state-of-the-art masked diffusion language model and demonstrate its efficacy. Experiments on diverse language generation benchmarks show that A-CFG yields substantial improvements over standard CFG, achieving, for instance, a 3.9 point gain on GPQA. Our work highlights the benefit of dynamically adapting guidance mechanisms to model uncertainty in iterative generation.

[502] Reasoning Is Not All You Need: Examining LLMs for Multi-Turn Mental Health Conversations

Mohit Chandra,Siddharth Sriraman,Harneet Singh Khanuja,Yiqiao Jin,Munmun De Choudhury

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）在心理健康对话中的多轮对话能力，提出了MedAgent框架和MHSD数据集，并开发了MultiSenseEval评估框架。研究发现前沿推理模型在患者中心沟通和高级诊断能力上表现不佳。

Details

Motivation: 心理健康服务资源有限，等待时间长，促使人们转向LLMs寻求帮助，但LLMs在多轮心理健康对话中的能力尚未充分探索。 Method: 提出MedAgent框架，生成合成数据（MHSD数据集），并开发MultiSenseEval评估框架，基于人本标准评估LLMs的多轮对话能力。 Result: 前沿推理模型在患者中心沟通和高级诊断能力上表现较差（平均得分31%），且性能随对话轮次增加而下降。 Conclusion: 研究提供了合成数据生成框架、数据集和评估框架，为评估LLMs在心理健康多轮对话中的能力奠定了基础。 Abstract: Limited access to mental healthcare, extended wait times, and increasing capabilities of Large Language Models (LLMs) has led individuals to turn to LLMs for fulfilling their mental health needs. However, examining the multi-turn mental health conversation capabilities of LLMs remains under-explored. Existing evaluation frameworks typically focus on diagnostic accuracy and win-rates and often overlook alignment with patient-specific goals, values, and personalities required for meaningful conversations. To address this, we introduce MedAgent, a novel framework for synthetically generating realistic, multi-turn mental health sensemaking conversations and use it to create the Mental Health Sensemaking Dialogue (MHSD) dataset, comprising over 2,200 patient-LLM conversations. Additionally, we present MultiSenseEval, a holistic framework to evaluate the multi-turn conversation abilities of LLMs in healthcare settings using human-centric criteria. Our findings reveal that frontier reasoning models yield below-par performance for patient-centric communication and struggle at advanced diagnostic capabilities with average score of 31%. Additionally, we observed variation in model performance based on patient's persona and performance drop with increasing turns in the conversation. Our work provides a comprehensive synthetic data generation framework, a dataset and evaluation framework for assessing LLMs in multi-turn mental health conversations.

[503] How to Improve the Robustness of Closed-Source Models on NLI

Joe Stacey,Lisa Alazraki,Aran Ubhi,Beyza Ermis,Aaron Mueller,Marek Rei

Main category: cs.CL

TL;DR: 本文研究了如何通过数据为中心的方法提升闭源大语言模型（LLM）的鲁棒性，无需访问模型内部。研究发现，最优策略取决于OOD数据的复杂度，不同复杂度下采用不同方法可显著提升鲁棒性。

Details

Motivation: 闭源LLM在自然语言任务中表现优异，但微调可能导致模型学习数据集特定启发式，降低其在OOD数据上的鲁棒性。现有方法因假设访问模型内部或修改训练过程而不适用闭源模型。 Method: 采用数据为中心的方法，针对不同复杂度的OOD数据，分别采用上采样更具挑战性的训练样本或替换部分训练集为LLM生成样本的策略。 Result: 对于高复杂度OOD数据，上采样提升鲁棒性1.5%；对于低复杂度OOD数据，替换训练集提升鲁棒性3.7%。闭源自回归LLM比常用编码器模型更鲁棒。 Conclusion: 闭源自回归LLM是更合适的基线选择，数据为中心的方法能有效提升其鲁棒性，且无需模型内部访问。 Abstract: Closed-source Large Language Models (LLMs) have become increasingly popular, with impressive performance across a wide range of natural language tasks. These models can be fine-tuned to further improve performance, but this often results in the models learning from dataset-specific heuristics that reduce their robustness on out-of-distribution (OOD) data. Existing methods to improve robustness either perform poorly, or are non-applicable to closed-source models because they assume access to model internals, or the ability to change the model's training procedure. In this work, we investigate strategies to improve the robustness of closed-source LLMs through data-centric methods that do not require access to model internals. We find that the optimal strategy depends on the complexity of the OOD data. For highly complex OOD datasets, upsampling more challenging training examples can improve robustness by up to 1.5%. For less complex OOD datasets, replacing a portion of the training set with LLM-generated examples can improve robustness by 3.7%. More broadly, we find that large-scale closed-source autoregressive LLMs are substantially more robust than commonly used encoder models, and are a more appropriate choice of baseline going forward.

[504] Dependency Parsing is More Parameter-Efficient with Normalization

Paolo Gajo,Domenic Rosati,Hassan Sajjad,Alberto Barrón-Cedeño

Main category: cs.CL

TL;DR: 论文提出通过归一化biaffine评分提高依赖解析效率，减少过参数化问题，实验证明在部分数据集上优于现有方法。

Details

Motivation: 依赖解析中biaffine评分未归一化导致模型过参数化，需额外参数补偿高方差输入带来的尖锐softmax输出。 Method: 在六大数据集上实验，比较归一化与非归一化biaffine评分的解析性能，使用堆叠BiLSTMs。 Result: 归一化方法在部分数据集上达到最优性能，且所需样本和参数更少。 Conclusion: 归一化biaffine评分可显著提升解析效率，减少过参数化问题。 Abstract: Dependency parsing is the task of inferring natural language structure, often approached by modeling word interactions via attention through biaffine scoring. This mechanism works like self-attention in Transformers, where scores are calculated for every pair of words in a sentence. However, unlike Transformer attention, biaffine scoring does not use normalization prior to taking the softmax of the scores. In this paper, we provide theoretical evidence and empirical results revealing that a lack of normalization necessarily results in overparameterized parser models, where the extra parameters compensate for the sharp softmax outputs produced by high variance inputs to the biaffine scoring function. We argue that biaffine scoring can be made substantially more efficient by performing score normalization. We conduct experiments on six datasets for semantic and syntactic dependency parsing using a one-hop parser. We train N-layer stacked BiLSTMs and evaluate the parser's performance with and without normalizing biaffine scores. Normalizing allows us to beat the state of the art on two datasets, with fewer samples and trainable parameters. Code: https://anonymous.4open.science/r/EfficientSDP-70C1

[505] FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models

Hao Kang,Zichun Yu,Chenyan Xiong

Main category: cs.CL

TL;DR: FLAME-MoE是一个完全开源的研究平台，用于探索Mixture-of-Experts（MoE）架构的扩展、路由和专家行为，提供从38M到1.7B参数的模型，并在六个评估任务中表现优于密集基线。

Details

Motivation: 当前缺乏一个完全开放的MoE研究平台，阻碍了学术界对MoE架构的深入探索。 Method: 发布FLAME-MoE，包含七个解码器模型，采用64专家和top-8门控机制，提供完整的训练数据和脚本。 Result: 在六个任务中，FLAME-MoE平均准确率比密集基线高3.4分，专家行为分析显示专家逐渐专业化、路由行为稳定。 Conclusion: FLAME-MoE为MoE研究提供了透明、可复现的平台，并展示了MoE架构的潜力。 Abstract: Recent large language models such as Gemini-1.5, DeepSeek-V3, and Llama-4 increasingly adopt Mixture-of-Experts (MoE) architectures, which offer strong efficiency-performance trade-offs by activating only a fraction of the model per token. Yet academic researchers still lack a fully open, end-to-end MoE platform for investigating scaling, routing, and expert behavior. We release FLAME-MoE, a completely open-source research suite composed of seven decoder-only models, ranging from 38M to 1.7B active parameters, whose architecture--64 experts with top-8 gating and 2 shared experts--closely reflects modern production LLMs. All training data pipelines, scripts, logs, and checkpoints are publicly available to enable reproducible experimentation. Across six evaluation tasks, FLAME-MoE improves average accuracy by up to 3.4 points over dense baselines trained with identical FLOPs. Leveraging full training trace transparency, we present initial analyses showing that (i) experts increasingly specialize on distinct token subsets, (ii) co-activation matrices remain sparse, reflecting diverse expert usage, and (iii) routing behavior stabilizes early in training. All code, training logs, and model checkpoints are available at https://github.com/cmu-flame/FLAME-MoE.

[506] Bridging the Long-Term Gap: A Memory-Active Policy for Multi-Session Task-Oriented Dialogue

Yiming Du,Bingbing Wang,Yang He,Bin Liang,Baojun Wang,Zhongyang Li,Lin Gui,Jeff Z. Pan,Ruifeng Xu,Kam-Fai Wong

Main category: cs.CL

TL;DR: 论文提出了首个多会话任务导向对话数据集MS-TOD，并设计了一种记忆主动策略（MAP），通过两阶段方法提升多会话对话效率。

Details

Motivation: 现有任务导向对话系统主要针对单会话，限制了长期记忆增强的效果。 Method: 提出记忆主动策略（MAP），包括记忆引导对话规划和主动响应策略两阶段方法。 Result: 在MS-TOD数据集上，MAP显著提升了任务成功率和对话效率。 Conclusion: MAP在多会话场景中表现优异，同时在单会话任务中保持竞争力。 Abstract: Existing Task-Oriented Dialogue (TOD) systems primarily focus on single-session dialogues, limiting their effectiveness in long-term memory augmentation. To address this challenge, we introduce a MS-TOD dataset, the first multi-session TOD dataset designed to retain long-term memory across sessions, enabling fewer turns and more efficient task completion. This defines a new benchmark task for evaluating long-term memory in multi-session TOD. Based on this new dataset, we propose a Memory-Active Policy (MAP) that improves multi-session dialogue efficiency through a two-stage approach. 1) Memory-Guided Dialogue Planning retrieves intent-aligned history, identifies key QA units via a memory judger, refines them by removing redundant questions, and generates responses based on the reconstructed memory. 2) Proactive Response Strategy detects and correct errors or omissions, ensuring efficient and accurate task completion. We evaluate MAP on MS-TOD dataset, focusing on response quality and effectiveness of the proactive strategy. Experiments on MS-TOD demonstrate that MAP significantly improves task success and turn efficiency in multi-session scenarios, while maintaining competitive performance on conventional single-session tasks.

[507] Efficient Speech Translation through Model Compression and Knowledge Distillation

Yasmin Moslem

Main category: cs.CL

TL;DR: 论文提出了一种结合层剪枝、4位量化低秩适应（QLoRA）和知识蒸馏的方法，用于压缩大型音频-语言模型，在保持翻译质量的同时显著减少计算资源需求。

Details

Motivation: 大型音频-语言模型在语音翻译中计算资源需求高，难以高效部署。 Method: 采用迭代层剪枝、4位量化低秩适应（QLoRA）和知识蒸馏的组合方法。 Result: 压缩后的模型参数和存储占用减少50%，翻译质量保留97-100%。 Conclusion: 该方法有效平衡了模型压缩与性能，适用于语音翻译任务。 Abstract: Efficient deployment of large audio-language models for speech translation remains challenging due to their significant computational requirements. In this paper, we address this challenge through our system submissions to the "Model Compression" track at the International Conference on Spoken Language Translation (IWSLT 2025). We experiment with a combination of approaches including iterative layer pruning based on layer importance evaluation, low-rank adaptation with 4-bit quantization (QLoRA), and knowledge distillation. In our experiments, we use Qwen2-Audio-7B-Instruct for speech translation into German and Chinese. Our pruned (student) models achieve up to a 50% reduction in both model parameters and storage footprint, while retaining 97-100% of the translation quality of the in-domain (teacher) models.

[508] It's High Time: A Survey of Temporal Information Retrieval and Question Answering

Bhawna Piryani,Abdelrahman Abdullah,Jamshid Mozafari,Avishek Anand,Adam Jatowt

Main category: cs.CL

TL;DR: 本文综述了时间信息检索和时间问答的研究现状，探讨了处理时间敏感信息的挑战和方法，包括传统和现代神经方法，以及评估策略。

Details

Motivation: 随着时间标记内容的增加，系统需要解决时间意图检测、时间表达规范化、事件排序和推理等挑战，以满足动态和时间敏感领域的需求。 Method: 综述了传统方法和现代神经方法（如Transformer和LLM），以及时间语言建模、多跳推理和检索增强生成（RAG）等最新进展。 Result: 总结了时间信息检索和时间问答的研究进展，包括基准数据集和评估策略，以测试时间鲁棒性、时效性和泛化能力。 Conclusion: 时间信息处理是动态领域的关键挑战，现代方法如LLM和RAG为解决这些问题提供了新方向。 Abstract: Time plays a critical role in how information is generated, retrieved, and interpreted. In this survey, we provide a comprehensive overview of Temporal Information Retrieval and Temporal Question Answering, two research areas aimed at handling and understanding time-sensitive information. As the amount of time-stamped content from sources like news articles, web archives, and knowledge bases increases, systems must address challenges such as detecting temporal intent, normalizing time expressions, ordering events, and reasoning over evolving or ambiguous facts. These challenges are critical across many dynamic and time-sensitive domains, from news and encyclopedias to science, history, and social media. We review both traditional approaches and modern neural methods, including those that use transformer models and Large Language Models (LLMs). We also review recent advances in temporal language modeling, multi-hop reasoning, and retrieval-augmented generation (RAG), alongside benchmark datasets and evaluation strategies that test temporal robustness, recency awareness, and generalization.

[509] KnowTrace: Bootstrapping Iterative Retrieval-Augmented Generation with Structured Knowledge Tracing

Rui Li,Quanyu Dai,Zeyu Zhang,Xu Chen,Zhenhua Dong,Ji-Rong Wen

Main category: cs.CL

TL;DR: KnowTrace是一个RAG框架，通过结构化知识图谱减轻上下文过载并提升多步推理质量。

Details

Motivation: 现有RAG方法在处理多跳问题时，上下文过载和无效推理步骤导致LLM难以感知关键信息。 Method: KnowTrace通过自主追踪知识三元组构建相关知识图谱，提供结构化上下文并引入知识回溯机制。 Result: 实验表明KnowTrace在三个多跳问答基准上优于现有方法，自举版本进一步提升了性能。 Conclusion: KnowTrace通过结构化知识和自举机制有效解决了上下文过载问题，提升了推理质量。 Abstract: Recent advances in retrieval-augmented generation (RAG) furnish large language models (LLMs) with iterative retrievals of relevant information to handle complex multi-hop questions. These methods typically alternate between LLM reasoning and retrieval to accumulate external information into the LLM's context. However, the ever-growing context inherently imposes an increasing burden on the LLM to perceive connections among critical information pieces, with futile reasoning steps further exacerbating this overload issue. In this paper, we present KnowTrace, an elegant RAG framework to (1) mitigate the context overload and (2) bootstrap higher-quality multi-step reasoning. Instead of simply piling the retrieved contents, KnowTrace autonomously traces out desired knowledge triplets to organize a specific knowledge graph relevant to the input question. Such a structured workflow not only empowers the LLM with an intelligible context for inference, but also naturally inspires a reflective mechanism of knowledge backtracing to identify contributive LLM generations as process supervision data for self-bootstrapping. Extensive experiments show that KnowTrace consistently surpasses existing methods across three multi-hop question answering benchmarks, and the bootstrapped version further amplifies the gains.

[510] WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models

Yongan Yu,Qingchen Hu,Xianda Du,Jiayin Wang,Fengran Mo,Renee Sieber

Main category: cs.CL

TL;DR: 该研究开发了一个关于破坏性天气影响的数据集和首个评估大语言模型（LLMs）在此领域能力的基准WXImpactBench，包含多标签分类和排序问答任务。

Details

Motivation: 气候变化适应需要理解破坏性天气对社会的影响，但LLMs在此领域的有效性尚未充分探索，主要因高质量语料库收集困难和缺乏基准。 Method: 研究首先构建了一个四阶段的数据集，随后提出了WXImpactBench基准，包含两项评估任务。 Result: 通过一系列实验，研究首次分析了开发破坏性天气影响理解系统的挑战。 Conclusion: 数据集和评估框架代码的公开有助于社会应对灾害脆弱性。 Abstract: Climate change adaptation requires the understanding of disruptive weather impacts on society, where large language models (LLMs) might be applicable. However, their effectiveness is under-explored due to the difficulty of high-quality corpus collection and the lack of available benchmarks. The climate-related events stored in regional newspapers record how communities adapted and recovered from disasters. However, the processing of the original corpus is non-trivial. In this study, we first develop a disruptive weather impact dataset with a four-stage well-crafted construction pipeline. Then, we propose WXImpactBench, the first benchmark for evaluating the capacity of LLMs on disruptive weather impacts. The benchmark involves two evaluation tasks, multi-label classification and ranking-based question answering. Extensive experiments on evaluating a set of LLMs provide first-hand analysis of the challenges in developing disruptive weather impact understanding and climate change adaptation systems. The constructed dataset and the code for the evaluation framework are available to help society protect against vulnerabilities from disasters.

[511] ARM: Adaptive Reasoning Model

Siye Wu,Jian Xie,Yikai Zhang,Aili Chen,Kai Zhang,Yu Su,Yanghua Xiao

Main category: cs.CL

TL;DR: 论文提出自适应推理模型（ARM），通过动态选择推理格式解决大型推理模型因固定推理方式导致的“过度思考”问题，显著提升效率。

Details

Motivation: 大型推理模型在复杂任务中表现优异，但无法根据任务难度调整推理方式，导致“过度思考”问题。现有方法依赖人工干预控制推理长度，违背了完全自主AI的目标。 Method: 提出ARM模型，支持四种推理格式（Direct Answer、Short CoT、Code、Long CoT），并引入Ada-GRPO训练方法解决格式崩溃问题。 Result: ARM平均减少30%的推理token（最高70%），性能与仅使用Long CoT的模型相当，同时训练速度提升2倍。 Conclusion: ARM通过自适应推理格式显著提升效率，支持多种模式，为完全自主AI提供了可行方案。 Abstract: While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the "overthinking" problem -- excessive and unnecessary reasoning -- which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones -- Direct Answer, Short CoT, and Code -- as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of 30%, and up to 70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a 2x speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens -- ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage.

[512] We Need to Measure Data Diversity in NLP -- Better and Broader

Dong Nguyen,Esther Ploeger

Main category: cs.CL

TL;DR: 本文探讨了NLP数据集中多样性测量的概念和方法挑战，强调跨学科视角对开发更精细和有效测量方法的重要性。

Details

Motivation: 尽管NLP数据集的多样性受到越来越多关注，但如何测量多样性仍是一个未充分探索的问题。 Method: 通过分析概念和方法挑战，提出跨学科视角的必要性。 Result: 指出当前多样性测量方法的不足，并强调跨学科合作的价值。 Conclusion: 跨学科视角对于开发更有效的多样性测量方法至关重要。 Abstract: Although diversity in NLP datasets has received growing attention, the question of how to measure it remains largely underexplored. This opinion paper examines the conceptual and methodological challenges of measuring data diversity and argues that interdisciplinary perspectives are essential for developing more fine-grained and valid measures.

[513] Does quantization affect models' performance on long-context tasks?

Anmol Mekala,Anirudh Atmakuru,Yixiao Song,Marzena Karpinska,Mohit Iyyer

Main category: cs.CL

TL;DR: 论文系统评估了量化方法对长输入（>64K tokens）和长输出任务的LLM性能影响，发现8位量化性能损失小（~0.8%），而4位量化损失显著（最高59%），且非英语输入和不同模型、任务间差异大。

Details

Motivation: 大语言模型（LLM）的长上下文窗口（>128K tokens）带来高内存需求和推理延迟，量化虽能降低成本，但可能影响性能，需系统评估。 Method: 评估了5种量化方法（FP8、GPTQ-int8、AWQ-int4、GPTQ-int4、BNB-nf4）和5个模型（Llama-3.1 8B/70B；Qwen-2.5 7B/32B/72B）在9.7K测试样本上的表现。 Result: 8位量化平均性能损失小（~0.8%），4位量化损失显著（最高59%），非英语输入和不同模型、任务间差异大。 Conclusion: 部署量化LLM需根据任务、模型和语言谨慎评估，尤其在长上下文和非英语场景中。 Abstract: Large language models (LLMs) now support context windows exceeding 128K tokens, but this comes with significant memory requirements and high inference latency. Quantization can mitigate these costs, but may degrade performance. In this work, we present the first systematic evaluation of quantized LLMs on tasks with long-inputs (>64K tokens) and long-form outputs. Our evaluation spans 9.7K test examples, five quantization methods (FP8, GPTQ-int8, AWQ-int4, GPTQ-int4, BNB-nf4), and five models (Llama-3.1 8B and 70B; Qwen-2.5 7B, 32B, and 72B). We find that, on average, 8-bit quantization preserves accuracy (~0.8% drop), whereas 4-bit methods lead to substantial losses, especially for tasks involving long context inputs (drops of up to 59%). This degradation tends to worsen when the input is in a language other than English. Crucially, the effects of quantization depend heavily on the quantization method, model, and task. For instance, while Qwen-2.5 72B remains robust under BNB-nf4, Llama-3.1 70B experiences a 32% performance drop on the same task. These findings highlight the importance of a careful, task-specific evaluation before deploying quantized LLMs, particularly in long-context scenarios and with languages other than English.

[514] OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction

Haonan Zhang,Run Luo,Xiong Liu,Yuchuan Wu,Ting-En Lin,Pengpeng Zeng,Qiang Qu,Feiteng Fang,Min Yang,Lianli Gao,Jingkuan Song,Fei Huang,Yongbin Li

Main category: cs.CL

TL;DR: OmniCharacter是一个低延迟的语音-语言交互模型，旨在提升角色扮演代理（RPAs）的沉浸感，通过结合角色特定的人格和声音特征。

Details

Motivation: 现有方法主要关注文本对话，忽略了声音特征在交互中的关键作用，限制了沉浸感。 Method: 提出OmniCharacter模型，结合语音和语言响应，并构建OmniCharacter-10K数据集支持训练。 Result: 实验表明，OmniCharacter在内容和风格上优于现有RPAs和语音-语言模型，延迟低至289ms。 Conclusion: OmniCharacter为沉浸式角色扮演交互提供了有效解决方案，数据集和代码已开源。 Abstract: Role-Playing Agents (RPAs), benefiting from large language models, is an emerging interactive AI system that simulates roles or characters with diverse personalities. However, existing methods primarily focus on mimicking dialogues among roles in textual form, neglecting the role's voice traits (e.g., voice style and emotions) as playing a crucial effect in interaction, which tends to be more immersive experiences in realistic scenarios. Towards this goal, we propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency. Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction, enabling a mixture of speech and language responses. To align the model with speech-language scenarios, we construct a dataset named OmniCharacter-10K, which involves more distinctive characters (20), richly contextualized multi-round dialogue (10K), and dynamic speech response (135K). Experimental results showcase that our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms. Code and dataset are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/OmniCharacter.

[515] One-shot Entropy Minimization

Zitian Gao,Lynx Chen,Joey Zhou,Bryan Dai

Main category: cs.CL

TL;DR: 仅需一个未标记数据和10步优化，熵最小化即可实现与基于规则的强化学习相当或更好的性能提升。

Details

Motivation: 探索大规模语言模型后训练范式的简化方法，减少对大量数据和复杂奖励设计的依赖。 Method: 训练13,440个大语言模型，通过熵最小化方法，仅使用一个未标记数据和10步优化。 Result: 性能提升与基于规则强化学习相当或更好，且所需资源显著减少。 Conclusion: 此结果可能促使重新思考大规模语言模型的后训练范式。 Abstract: We trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements comparable to or even greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning. This striking result may prompt a rethinking of post-training paradigms for large language models. Our code is avaliable at https://github.com/zitian-gao/one-shot-em.

[516] MASKSEARCH: A Universal Pre-Training Framework to Enhance Agentic Search Capability

Weiqi Wu,Xin Guan,Shen Huang,Yong Jiang,Pengjun Xie,Fei Huang,Jiuxin Cao,Hai Zhao,Jingren Zhou

Main category: cs.CL

TL;DR: 论文提出了一种名为MASKSEARCH的预训练框架，通过检索增强掩码预测任务（RAMP）提升大型语言模型（LLMs）的通用检索和推理能力，并在下游任务中结合监督微调（SFT）和强化学习（RL）进一步优化。

Details

Motivation: 现有基于训练的方法在任务特定数据上表现有限，无法充分发挥代理的通用搜索能力。 Method: 提出RAMP任务进行预训练，结合SFT和RL进行下游任务训练，采用多代理系统和混合奖励机制。 Result: 实验表明，MASKSEARCH显著提升了LLM搜索代理在开放域多跳问答任务中的性能。 Conclusion: MASKSEARCH框架有效增强了LLM的通用检索和推理能力，适用于多种下游任务。 Abstract: Retrieval-Augmented Language Models (RALMs) represent a classic paradigm where models enhance generative capabilities using external knowledge retrieved via a specialized module. Recent advancements in Agent techniques enable Large Language Models (LLMs) to autonomously utilize tools for retrieval, planning, and reasoning. While existing training-based methods show promise, their agentic abilities are limited by inherent characteristics of the task-specific data used during training. To further enhance the universal search capability of agents, we propose a novel pre-training framework, MASKSEARCH. In the pre-training stage, we introduce the Retrieval Augmented Mask Prediction (RAMP) task, where the model learns to leverage search tools to fill masked spans on a large number of pre-training data, thus acquiring universal retrieval and reasoning capabilities for LLMs. After that, the model is trained on downstream tasks to achieve further improvement. We apply both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) for training. For SFT, we combine agent-based and distillation-based methods to generate training data, starting with a multi-agent system consisting of a planner, rewriter, observer, and followed by a self-evolving teacher model. While for RL, we employ DAPO as the training framework and adopt a hybrid reward system consisting of answer rewards and format rewards. Additionally, we introduce a curriculum learning approach that allows the model to learn progressively from easier to more challenging instances based on the number of masked spans. We evaluate the effectiveness of our framework in the scenario of open-domain multi-hop question answering. Through extensive experiments, we demonstrate that MASKSEARCH significantly enhances the performance of LLM-based search agents on both in-domain and out-of-domain downstream tasks.

[517] Enhancing the Comprehensibility of Text Explanations via Unsupervised Concept Discovery

Yifan Sun,Danding Wang,Qiang Sheng,Juan Cao,Jintao Li

Main category: cs.CL

TL;DR: ECO-Concept是一个无需标注的框架，通过自动提取语义概念并利用大语言模型评估其可理解性，生成更直观的解释。

Details

Motivation: 现有基于概念的解释方法在文本领域应用有限，依赖预定义概念或生成难以理解的解释，影响了用户信任。 Method: ECO-Concept采用对象中心架构自动提取概念，通过大语言模型评估可理解性，并指导模型微调。 Result: 实验表明，ECO-Concept在多项任务中表现优异，其生成的概念在可理解性上优于现有方法。 Conclusion: ECO-Concept能够自动发现可理解的概念，为可解释AI提供了更直观的解释方法。 Abstract: Concept-based explainable approaches have emerged as a promising method in explainable AI because they can interpret models in a way that aligns with human reasoning. However, their adaption in the text domain remains limited. Most existing methods rely on predefined concept annotations and cannot discover unseen concepts, while other methods that extract concepts without supervision often produce explanations that are not intuitively comprehensible to humans, potentially diminishing user trust. These methods fall short of discovering comprehensible concepts automatically. To address this issue, we propose \textbf{ECO-Concept}, an intrinsically interpretable framework to discover comprehensible concepts with no concept annotations. ECO-Concept first utilizes an object-centric architecture to extract semantic concepts automatically. Then the comprehensibility of the extracted concepts is evaluated by large language models. Finally, the evaluation result guides the subsequent model fine-tuning to obtain more understandable explanations. Experiments show that our method achieves superior performance across diverse tasks. Further concept evaluations validate that the concepts learned by ECO-Concept surpassed current counterparts in comprehensibility.

[518] Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?

Michael Kirchhof,Luca Füger,Adam Goliński,Eeshan Gunesh Dhekane,Arno Blaas,Sinead Williamson

Main category: cs.CL

TL;DR: 论文提出了一种新方法SelfReflect，用于评估大语言模型（LLM）输出字符串对其内部答案分布总结的忠实度，并发现采样和总结能生成更准确的总结。

Details

Motivation: 传统的不确定性量化仅提供百分比数字，无法充分表达LLM的内部不确定性，因此需要更直观的字符串总结方法。 Method: 提出了SelfReflect指标，用于评估字符串总结的忠实度，并通过实验验证其优于其他方法（如LLM判断和嵌入比较）。 Result: SelfReflect能区分候选总结字符串的细微差异，并与人类判断一致；采样和总结方法能生成更忠实的总结。 Conclusion: SelfReflect为LLM不确定性表达提供了新方向，未来可进一步探索通用形式的LLM不确定性。 Abstract: To reveal when a large language model (LLM) is uncertain about a response, uncertainty quantification commonly produces percentage numbers along with the output. But is this all we can do? We argue that in the output space of LLMs, the space of strings, exist strings expressive enough to summarize the distribution over output strings the LLM deems possible. We lay a foundation for this new avenue of uncertainty explication and present SelfReflect, a theoretically-motivated metric to assess how faithfully a string summarizes an LLM's internal answer distribution. We show that SelfReflect is able to discriminate even subtle differences of candidate summary strings and that it aligns with human judgement, outperforming alternative metrics such as LLM judges and embedding comparisons. With SelfReflect, we investigate a number of self-summarization methods and find that even state-of-the-art reasoning models struggle to explicate their internal uncertainty. But we find that faithful summarizations can be generated by sampling and summarizing. Our metric enables future works towards this universal form of LLM uncertainties.

[519] Reasoning LLMs are Wandering Solution Explorers

Jiahao Lu,Ziwei Xu,Mohan Kankanhalli

Main category: cs.CL

TL;DR: 论文指出当前推理型大语言模型（RLLMs）缺乏系统性探索解决方案的能力，揭示了其常见失败模式，并呼吁开发新的评估指标和工具。

Details

Motivation: 当前推理型大语言模型在复杂任务中表现不佳，缺乏系统性解决问题的能力，需要更深入的分析和改进。 Method: 通过定性和定量分析，研究多款先进大语言模型在推理过程中的常见问题，如无效推理步骤、冗余探索等。 Result: 研究发现模型在简单任务中表现尚可，但随着任务复杂度增加，性能急剧下降。 Conclusion: 建议开发新的评估指标和工具，以更全面地评估推理过程的结构而非仅关注最终输出。 Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning abilities through test-time computation (TTC) techniques such as chain-of-thought prompting and tree-based reasoning. However, we argue that current reasoning LLMs (RLLMs) lack the ability to systematically explore the solution space. This paper formalizes what constitutes systematic problem solving and identifies common failure modes that reveal reasoning LLMs to be wanderers rather than systematic explorers. Through qualitative and quantitative analysis across multiple state-of-the-art LLMs, we uncover persistent issues: invalid reasoning steps, redundant explorations, hallucinated or unfaithful conclusions, and so on. Our findings suggest that current models' performance can appear to be competent on simple tasks yet degrade sharply as complexity increases. Based on the findings, we advocate for new metrics and tools that evaluate not just final outputs but the structure of the reasoning process itself.

[520] MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

Jeonghun Baek,Kazuki Egashira,Shota Onohara,Atsuyuki Miyai,Yuki Imajuku,Hikaru Ikuta,Kiyoharu Aizawa

Main category: cs.CL

TL;DR: 论文提出了两个多模态漫画理解的基准（MangaOCR和MangaVQA），并开发了专用模型MangaLMM，用于评估和提升大型多模态模型在漫画领域的表现。

Details

Motivation: 通过提升大型多模态模型对漫画的理解能力，帮助漫画创作者反思和改进故事叙述。 Method: 引入MangaOCR（文本识别）和MangaVQA（视觉问答）两个基准，并基于Qwen2.5-VL开发MangaLMM模型。 Result: MangaLMM在实验中表现良好，与GPT-4o和Gemini 2.5等专有模型进行了比较。 Conclusion: 提出的基准和模型为评估和推进多模态模型在漫画叙事领域的应用提供了全面基础。 Abstract: Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.

cs.LG [Back]

[521] Emotion Knowledge Enhancement for Vision Large Language Models: A Self-Verification Approach for High-Quality Emotion Instruction Data Generation

Feifan Wang,Tengfei Song,Minggui He,Chang Su,Zhanglin Wu,Hao Yang,Wenming Zheng,Osamu Yoshie

Main category: cs.LG

TL;DR: 提出了一种自验证方法（SEKE）和不确定性感知蒙特卡洛采样（SV-UAMC），用于低成本生成高质量的多粒度面部情绪指令数据，显著提升了视觉大语言模型（VLLM）的情绪感知性能。

Details

Motivation: 高质量的面部情绪标注数据成本高昂且稀缺，限制了VLLM在情绪感知任务中的表现。 Method: 结合先验人类知识和VLLM推理，通过自验证策略和蒙特卡洛采样生成多粒度情绪标注数据。 Result: 构建了面部情绪指令数据集（FEID）和基准（FEAB），在三个下游任务中显著优于现有方法。 Conclusion: SEKE和SV-UAMC方法有效提升了VLLM的情绪感知能力，为低成本生成高质量标注数据提供了解决方案。 Abstract: Facial emotion perception in the vision large language model (VLLM) is crucial for achieving natural human-machine interaction. However, creating high-quality annotations for both coarse- and fine-grained facial emotion analysis demands costly expertise. The lack of such high-quality instruction data limits the performance of VLLMs in facial emotion perception. To address this, we propose a self-verification approach with emotion knowledge enhancement (SEKE), which generates high-quality instruction data for multi-grained emotion analysis cost-effectively using closed-source VLLM. This approach integrates prior human knowledge to VLLM inference, guided by the inherent correlations between three grained levels of emotion descriptions, i.e., discrete expression, valence-arousal, and action unit, to reliably generate comprehensive annotations. A self-verification strategy with Uncertainty-Aware Monte Carlo sampling (SV-UAMC) is further embedded to efficiently extract more accurate VLLM predictions, further improving annotation reliability. Consequently, we construct a facial emotion instruction dataset (FEID) containing three comprehensive descriptions, which provides coarse- and fine-grained emotional information for effective model training. Additionally, we introduce a facial emotion analysis benchmark (FEAB) to measure the VLLM's corresponding ability. Our method significantly outperforms state-of-the-art methods on three downstream facial emotion analysis tasks.

[522] Training Acceleration of Low-Rank Decomposed Networks using Sequential Freezing and Rank Quantization

Habib Hajimolahoseini,Walid Ahmed,Yang Liu

Main category: cs.LG

TL;DR: 论文提出两种技术（秩优化和分解层顺序冻结）以加速低秩分解模型，无需使用小秩分解，实验显示训练和推理速度显著提升且精度接近原始模型。

Details

Motivation: 低秩分解（LRD）虽能减少深度学习模型的参数和计算复杂度，但分解后新增层数多，若秩不够小则无法显著加速训练/推理，而小秩又可能导致精度下降。 Method: 提出秩优化和分解层顺序冻结两种技术，避免使用小秩分解，实验涵盖卷积和基于Transformer的模型。 Result: 实验表明，两种技术结合可将训练吞吐量提升60%，推理吞吐量提升37%，同时保持精度接近原始模型。 Conclusion: 秩优化和顺序冻结技术能有效加速低秩分解模型，且不牺牲精度，适用于多种模型架构。 Abstract: Low Rank Decomposition (LRD) is a model compression technique applied to the weight tensors of deep learning models in order to reduce the number of trainable parameters and computational complexity. However, due to high number of new layers added to the architecture after applying LRD, it may not lead to a high training/inference acceleration if the decomposition ranks are not small enough. The issue is that using small ranks increases the risk of significant accuracy drop after decomposition. In this paper, we propose two techniques for accelerating low rank decomposed models without requiring to use small ranks for decomposition. These methods include rank optimization and sequential freezing of decomposed layers. We perform experiments on both convolutional and transformer-based models. Experiments show that these techniques can improve the model throughput up to 60% during training and 37% during inference when combined together while preserving the accuracy close to that of the original models

[523] Improving Resnet-9 Generalization Trained on Small Datasets

Omar Mohamed Awad,Habib Hajimolahoseini,Michael Lim,Gurpreet Gosal,Walid Ahmed,Yang Liu,Gordon Deng

Main category: cs.LG

TL;DR: 本文提出了一种在ICLR硬件感知高效训练竞赛中获一等奖的方法，旨在10分钟内完成图像分类任务并达到最高准确率。

Details

Motivation: 竞赛要求在10分钟内使用CIFAR-10的小子集（5000张图像）训练模型，并在秘密测试集上评估性能。目标是高效训练并提升模型泛化能力。 Method: 采用多种技术优化ResNet-9，包括锐度感知优化、标签平滑、梯度中心化、输入块白化及元学习训练。 Result: 实验表明，ResNet-9在10分钟内训练CIFAR-10的10%子集，准确率达88%。 Conclusion: 该方法在高效训练和模型性能上表现出色，适用于资源受限的场景。 Abstract: This paper presents our proposed approach that won the first prize at the ICLR competition on Hardware Aware Efficient Training. The challenge is to achieve the highest possible accuracy in an image classification task in less than 10 minutes. The training is done on a small dataset of 5000 images picked randomly from CIFAR-10 dataset. The evaluation is performed by the competition organizers on a secret dataset with 1000 images of the same size. Our approach includes applying a series of technique for improving the generalization of ResNet-9 including: sharpness aware optimization, label smoothing, gradient centralization, input patch whitening as well as metalearning based training. Our experiments show that the ResNet-9 can achieve the accuracy of 88% while trained only on a 10% subset of CIFAR-10 dataset in less than 10 minuets

[524] GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values

Farnoosh Javadi,Walid Ahmed,Habib Hajimolahoseini,Foozhan Ataiefard,Mohammad Hassanpour,Saina Asani,Austin Wen,Omar Mohamed Awad,Kangling Liu,Yang Liu

Main category: cs.LG

TL;DR: GQKVA方法通过通用查询、键和值分组技术，加速Transformer预训练并减小模型规模，实验显示性能与模型大小之间存在权衡。

Details

Motivation: 解决大规模Transformer模型预训练慢、计算密集和参数过多的问题。 Method: 提出GQKVA方法，通用查询、键和值分组技术，优化模型结构和训练效率。 Result: ViT实验显示模型大小减少4%时准确率提升0.3%，最激进实验模型大小减少15%时准确率仅下降1%。 Conclusion: GQKVA提供了一种轻量且高效的替代方案，传统多头注意力并非总是最佳选择。 Abstract: Massive transformer-based models face several challenges, including slow and computationally intensive pre-training and over-parametrization. This paper addresses these challenges by proposing a versatile method called GQKVA, which generalizes query, key, and value grouping techniques. GQKVA is designed to speed up transformer pre-training while reducing the model size. Our experiments with various GQKVA variants highlight a clear trade-off between performance and model size, allowing for customized choices based on resource and time limitations. Our findings also indicate that the conventional multi-head attention approach is not always the best choice, as there are lighter and faster alternatives available. We tested our method on ViT, which achieved an approximate 0.3% increase in accuracy while reducing the model size by about 4% in the task of image classification. Additionally, our most aggressive model reduction experiment resulted in a reduction of approximately 15% in model size, with only around a 1% drop in accuracy.

[525] Accelerating the Low-Rank Decomposed Models

Habib Hajimolahoseini,Walid Ahmed,Austin Wen,Yang Liu

Main category: cs.LG

TL;DR: 本文探讨了如何改进低秩分解技术以在AI模型中实现高精度、低内存消耗，并加速训练和推理。

Details

Motivation: 尽管张量分解是一种有效的数据压缩技术，但由于分解后增加的层数可能导致模型更深，从而增加延迟，因此其在AI模型压缩中并不流行。 Method: 研究提出了一种改进的低秩分解技术，旨在减少冗余数据的同时避免模型深度显著增加。 Result: 改进后的技术能够显著减少参数数量，同时保持高精度并降低内存消耗，加速训练和推理。 Conclusion: 通过优化低秩分解技术，可以在AI模型中实现高效的数据压缩和性能提升。 Abstract: Tensor decomposition is a mathematically supported technique for data compression. It consists of applying some kind of a Low Rank Decomposition technique on the tensors or matrices in order to reduce the redundancy of the data. However, it is not a popular technique for compressing the AI models duo to the high number of new layers added to the architecture after decomposition. Although the number of parameters could shrink significantly, it could result in the model be more than twice deeper which could add some latency to the training or inference. In this paper, we present a comprehensive study about how to modify low rank decomposition technique in AI models so that we could benefit from both high accuracy and low memory consumption as well as speeding up the training and inference

[526] LatentLLM: Attention-Aware Joint Tensor Compression

Toshiaki Koike-Akino,Xiangyu Chen,Jing Liu,Ye Wang,Pu,Wang,Matthew Brand

Main category: cs.LG

TL;DR: 提出一种新框架，将大型语言模型和多模态模型转换为低维潜在结构，显著提升模型压缩后的准确性。

Details

Motivation: 现代基础模型（如LLMs和LMMs）需要大量计算和内存资源，现有压缩方法在降低维度时准确性不足。 Method: 扩展局部激活感知张量分解为全局注意力感知联合张量分解。 Result: 在多个基准测试（包括多模态推理任务）中，框架显著优于现有压缩方法。 Conclusion: 新框架能有效实现计算和内存高效的大型模型。 Abstract: Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor de-composition. Our framework can significantly improve the model accuracy over the existing model compression methods when reducing the latent dimension to realize computationally/memory-efficient LLMs/LLMs. We show the benefit on several benchmark including multi-modal reasoning tasks.

[527] Evidence-Grounded Multimodal Misinformation Detection with Attention-Based GNNs

Sharad Duwal,Mir Nafis Sharear Shopnil,Abhishek Tyagi,Adiba Mahbub Proma

Main category: cs.LG

TL;DR: 论文提出了一种基于图的方法，通过构建证据图和声明图来检测多模态上下文外（OOC）错误信息，其检测准确率达到93.05%，优于现有方法。

Details

Motivation: 当前方法（如LLMs和LVLMs）在检测多模态OOC错误信息时缺乏上下文解析步骤，导致性能不足。 Method: 使用图神经网络（GNNs）编码和比较从在线文本证据和声明中构建的证据图和声明图。 Result: 方法在评估集上达到93.05%的检测准确率，优于第二名方法（LLM）2.82%。 Conclusion: 研究表明，小型且任务特定的方法在多模态错误信息检测中表现更优。 Abstract: Multimodal out-of-context (OOC) misinformation is misinformation that repurposes real images with unrelated or misleading captions. Detecting such misinformation is challenging because it requires resolving the context of the claim before checking for misinformation. Many current methods, including LLMs and LVLMs, do not perform this contextualization step. LLMs hallucinate in absence of context or parametric knowledge. In this work, we propose a graph-based method that evaluates the consistency between the image and the caption by constructing two graph representations: an evidence graph, derived from online textual evidence, and a claim graph, from the claim in the caption. Using graph neural networks (GNNs) to encode and compare these representations, our framework then evaluates the truthfulness of image-caption pairs. We create datasets for our graph-based method, evaluate and compare our baseline model against popular LLMs on the misinformation detection task. Our method scores $93.05\%$ detection accuracy on the evaluation set and outperforms the second-best performing method (an LLM) by $2.82\%$, making a case for smaller and task-specific methods.

[528] ELDeR: Getting Efficient LLMs through Data-Driven Regularized Layer-wise Pruning

Mingkuan Feng,Jinyang Wu,Siyuan Liu,Shuai Zhang,Hongjian Fang,Ruihan Jin,Feihu Che,Pengpeng Shao,Zhengqi Wen,Jianhua Tao

Main category: cs.LG

TL;DR: 论文提出了一种名为ELDeR的新方法，通过数据驱动的正则化逐层剪枝来高效优化大型语言模型（LLMs），减少了直接剪枝导致的信息损失，并显著降低了恢复微调的计算成本。

Details

Motivation: 大型语言模型的高计算和内存成本限制了其广泛应用。现有剪枝方法通常采用先剪枝后微调的范式，但静态移除部分参数会导致性能不可逆的下降，需要昂贵的恢复微调。 Method: 提出先正则化后剪枝的新范式，通过为每层Transformer输出乘初始权重，并利用少量数据迭代学习权重，再对权重较小的层进行正则化，迫使信息转移到剩余层。 Result: 实验表明，ELDeR在保持语言建模能力的同时，性能优于现有逐层结构化剪枝方法，并显著降低了恢复微调的计算成本。 Conclusion: ELDeR是一种逐层剪枝方法，端到端加速效果明显，是高效LLMs的有前景技术。 Abstract: The deployment of Large language models (LLMs) in many fields is largely hindered by their high computational and memory costs. Recent studies suggest that LLMs exhibit sparsity, which can be used for pruning. Previous pruning methods typically follow a prune-then-finetune paradigm. Since the pruned parts still contain valuable information, statically removing them without updating the remaining parameters often results in irreversible performance degradation, requiring costly recovery fine-tuning (RFT) to maintain performance. To address this, we propose a novel paradigm: first apply regularization, then prune. Based on this paradigm, we propose ELDeR: Getting Efficient LLMs through Data-Driven Regularized Layer-wise Pruning. We multiply the output of each transformer layer by an initial weight, then we iteratively learn the weights of each transformer layer by using a small amount of data in a simple way. After that, we apply regularization to the difference between the output and input of the layers with smaller weights, forcing the information to be transferred to the remaining layers. Compared with direct pruning, ELDeR reduces the information loss caused by direct parameter removal, thus better preserving the model's language modeling ability. Experimental results show that ELDeR achieves superior performance compared with powerful layer-wise structured pruning methods, while greatly reducing RFT computational costs. Since ELDeR is a layer-wise pruning method, its end-to-end acceleration effect is obvious, making it a promising technique for efficient LLMs.

[529] Task Specific Pruning with LLM-Sieve: How Many Parameters Does Your Task Really Need?

Waleed Reda,Abhinav Jangda,Krishna Chintalapudi

Main category: cs.LG

TL;DR: LLM-Sieve框架通过任务感知联合投影和遗传算法，实现LLMs任务特定剪枝，参数减少20-75%，精度仅下降1-5%。

Details

Motivation: 研究如何在资源受限环境中为特定任务优化LLMs的参数规模。 Method: 使用任务感知联合投影和遗传算法进行差异化剪枝，兼容LoRA微调和量化。 Result: 参数减少20-75%，精度下降1-5%，且在同一任务域内泛化能力强。 Conclusion: LLM-Sieve为生成小型高性能任务特定模型提供了实用且稳健的机制。 Abstract: As Large Language Models (LLMs) are increasingly being adopted for narrow tasks - such as medical question answering or sentiment analysis - and deployed in resource-constrained settings, a key question arises: how many parameters does a task actually need? In this work, we present LLM-Sieve, the first comprehensive framework for task-specific pruning of LLMs that achieves 20-75% parameter reduction with only 1-5% accuracy degradation across diverse domains. Unlike prior methods that apply uniform pruning or rely on low-rank approximations of weight matrices or inputs in isolation, LLM-Sieve (i) learns task-aware joint projections to better approximate output behavior, and (ii) employs a Genetic Algorithm to discover differentiated pruning levels for each matrix. LLM-Sieve is fully compatible with LoRA fine-tuning and quantization, and uniquely demonstrates strong generalization across datasets within the same task domain. Together, these results establish a practical and robust mechanism to generate smaller performant task-specific models.

[530] Learning without Isolation: Pathway Protection for Continual Learning

Zhikang Chen,Abudukelimu Wuerkaixi,Sen Cui,Haoxuan Li,Ding Li,Jingfeng Zhang,Bo Han,Gang Niu,Houfang Liu,Yi Yang,Sifan Yang,Changshui Zhang,Tianling Ren

Main category: cs.LG

TL;DR: 论文提出了一种新的持续学习框架LwI，通过保护旧任务的通路而非参数，解决了深度网络在顺序任务学习中的灾难性遗忘问题。

Details

Motivation: 深度网络在顺序任务学习中容易发生灾难性遗忘，现有方法主要通过保护参数来解决，但参数保护不切实际且效率低。 Method: 从神经科学和物理学角度出发，提出通路比参数更重要的观点，设计了LwI框架，通过图匹配实现模型融合并保护旧任务通路。 Result: 实验证明LwI在参数效率高的前提下有效解决了灾难性遗忘问题，并在多个基准数据集上表现优越。 Conclusion: LwI框架通过保护通路而非参数，为持续学习提供了一种高效且实用的解决方案。 Abstract: Deep networks are prone to catastrophic forgetting during sequential task learning, i.e., losing the knowledge about old tasks upon learning new tasks. To this end, continual learning(CL) has emerged, whose existing methods focus mostly on regulating or protecting the parameters associated with the previous tasks. However, parameter protection is often impractical, since the size of parameters for storing the old-task knowledge increases linearly with the number of tasks, otherwise it is hard to preserve the parameters related to the old-task knowledge. In this work, we bring a dual opinion from neuroscience and physics to CL: in the whole networks, the pathways matter more than the parameters when concerning the knowledge acquired from the old tasks. Following this opinion, we propose a novel CL framework, learning without isolation(LwI), where model fusion is formulated as graph matching and the pathways occupied by the old tasks are protected without being isolated. Thanks to the sparsity of activation channels in a deep network, LwI can adaptively allocate available pathways for a new task, realizing pathway protection and addressing catastrophic forgetting in a parameter-efficient manner. Experiments on popular benchmark datasets demonstrate the superiority of the proposed LwI.

[531] $μ$-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts

Toshiaki Koike-Akino,Jing Liu,Ye Wang

Main category: cs.LG

TL;DR: 论文提出了一种无需重新训练的激活感知剪枝方法（μ-MoE），通过高效校准动态适应任务/提示相关的结构化稀疏性，降低推理复杂度。

Details

Motivation: 解决大型基础模型的高计算需求问题，同时避免因依赖校准数据而导致的领域偏移。 Method: 采用激活感知剪枝技术，结合高效校准，动态适应不同任务/提示的结构化稀疏性。 Result: 实验表明，μ-MoE能够动态适应任务/提示相关的结构化稀疏性，显著降低推理复杂度。 Conclusion: μ-MoE是一种高效且适应性强的压缩方法，适用于大型基础模型的推理优化。 Abstract: To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these rely on calibration data, domain shift may arise for unknown downstream tasks. With a computationally efficient calibration, activation-aware pruning can be executed for every prompt adaptively, yet achieving reduced complexity at inference. We formulate it as a mixture of micro-experts, called $\mu$-MoE. Several experiments demonstrate that $\mu$-MoE can dynamically adapt to task/prompt-dependent structured sparsity on the fly.

[532] How to build a consistency model: Learning flow maps via self-distillation

Nicholas M. Boffi,Michael S. Albergo,Eric Vanden-Eijnden

Main category: cs.LG

TL;DR: 提出了一种系统性学习流映射的方法，通过自蒸馏将现有蒸馏方案转化为直接训练算法，无需预训练模型。

Details

Motivation: 提升基于微分方程的生成模型效率，探索流映射与速度场的关系。 Method: 利用流映射与速度场的关系，通过自蒸馏直接训练模型，避免预训练。 Result: 高维任务（如图像合成）需避免流映射的时空导数，低维任务则可通过高阶导数捕捉特征。 Conclusion: 方法在不同维度任务中表现灵活，为生成模型提供了高效训练方案。 Abstract: Building on the framework proposed in Boffi et al. (2024), we present a systematic approach for learning flow maps associated with flow and diffusion models. Flow map-based models, commonly known as consistency models, encompass recent efforts to improve the efficiency of generative models based on solutions to differential equations. By exploiting a relationship between the velocity field underlying a continuous-time flow and the instantaneous rate of change of the flow map, we show how to convert existing distillation schemes into direct training algorithms via self-distillation, eliminating the need for pre-trained models. We empirically evaluate several instantiations of our framework, finding that high-dimensional tasks like image synthesis benefit from objective functions that avoid temporal and spatial derivatives of the flow map, while lower-dimensional tasks can benefit from objectives incorporating higher-order derivatives to capture sharp features.

[533] Synthesizing and Adapting Error Correction Data for Mobile Large Language Model Applications

Yanxiang Zhang,Zheng Xu,Shanshan Wu,Yuanbo Zhang,Daniel Ramage

Main category: cs.LG

TL;DR: 利用大语言模型（LLM）合成高质量的错误校正数据集，并通过重新加权和混合数据源优化移动设备上的错误校正性能。

Details

Motivation: 提升LLM在移动设备用户输入错误校正中的应用效果。 Method: 1. 利用LLM合成错误校正数据集；2. 通过重新加权适配移动应用领域；3. 结合少量实时A/B测试指标优化数据分布。 Result: 提出了一种高效的数据合成和优化方法，显著提升了离线评估和生产环境中的错误校正性能。 Conclusion: 通过合成数据和混合数据源的最佳实践，优化了LLM在移动设备错误校正中的表现。 Abstract: Error correction is an important capability when applying large language models (LLMs) to facilitate user typing on mobile devices. In this paper, we use LLMs to synthesize a high-quality dataset of error correction pairs to evaluate and improve LLMs for mobile applications. We first prompt LLMs with error correction domain knowledge to build a scalable and reliable addition to the existing data synthesis pipeline. We then adapt the synthetic data distribution to match the mobile application domain by reweighting the samples. The reweighting model is learnt by predicting (a handful of) live A/B test metrics when deploying LLMs in production, given the LLM performance on offline evaluation data and scores from a small privacy-preserving on-device language model. Finally, we present best practices for mixing our synthetic data with other data sources to improve model performance on error correction in both offline evaluation and production live A/B testing.

[534] LORE: Lagrangian-Optimized Robust Embeddings for Visual Encoders

Borna Khodabandeh,Amirabbas Afzali,Amirhossein Afsharrad,Seyed Shahabeddin Mousavi,Sanjay Lall,Sajjad Amini,Seyed-Mohsen Moosavi-Dezfooli

Main category: cs.LG

TL;DR: 提出了一种名为LORE的无监督对抗微调框架，通过约束优化平衡鲁棒性和准确性，显著提升了零样本对抗鲁棒性，同时保持了干净数据的性能。

Details

Motivation: 现有对抗微调方法存在不稳定性和鲁棒性与准确性之间的权衡问题，需要一种更优的解决方案。 Method: LORE利用约束优化和嵌入空间邻近约束，平衡鲁棒性和干净数据性能。 Result: 实验表明，LORE显著提升零样本对抗鲁棒性，且对干净数据准确性影响最小。 Conclusion: LORE是一种有效的对抗微调框架，适用于提升模型鲁棒性和泛化能力。 Abstract: Visual encoders have become fundamental components in modern computer vision pipelines. However, ensuring robustness against adversarial perturbations remains a critical challenge. Recent efforts have explored both supervised and unsupervised adversarial fine-tuning strategies. We identify two key limitations in these approaches: (i) they often suffer from instability, especially during the early stages of fine-tuning, resulting in suboptimal convergence and degraded performance on clean data, and (ii) they exhibit a suboptimal trade-off between robustness and clean data accuracy, hindering the simultaneous optimization of both objectives. To overcome these challenges, we propose Lagrangian-Optimized Robust Embeddings (LORE), a novel unsupervised adversarial fine-tuning framework. LORE utilizes constrained optimization, which offers a principled approach to balancing competing goals, such as improving robustness while preserving nominal performance. By enforcing embedding-space proximity constraints, LORE effectively maintains clean data performance throughout adversarial fine-tuning. Extensive experiments show that LORE significantly improves zero-shot adversarial robustness with minimal degradation in clean data accuracy. Furthermore, we demonstrate the effectiveness of the adversarially fine-tuned CLIP image encoder in out-of-distribution generalization and enhancing the interpretability of image embeddings.

[535] AmorLIP: Efficient Language-Image Pretraining via Amortization

Haotian Sun,Yitong Li,Yuchen Zhuang,Niao He,Hanjun Dai,Bo Dai

Main category: cs.LG

TL;DR: AmorLIP是一种高效的CLIP预训练框架，通过轻量级神经网络分摊对比学习的计算成本，显著提升训练效率和性能。

Details

Motivation: 现有CLIP方法需要极大批量和高计算资源，而现有解决方案往往牺牲性能或面临扩展性问题。 Method: 提出AmorLIP框架，利用基于能量模型的光谱分解见解，引入新的分摊目标和实用技术以提高训练稳定性。 Result: 在38个下游任务中，AmorLIP的零样本分类和检索性能显著优于标准CLIP基线，相对提升高达12.24%。 Conclusion: AmorLIP通过高效分摊计算成本，解决了CLIP训练中的资源瓶颈，同时保持或提升性能。 Abstract: Contrastive Language-Image Pretraining (CLIP) has demonstrated strong zero-shot performance across diverse downstream text-image tasks. Existing CLIP methods typically optimize a contrastive objective using negative samples drawn from each minibatch. To achieve robust representation learning, these methods require extremely large batch sizes and escalate computational demands to hundreds or even thousands of GPUs. Prior approaches to mitigate this issue often compromise downstream performance, prolong training duration, or face scalability challenges with very large datasets. To overcome these limitations, we propose AmorLIP, an efficient CLIP pretraining framework that amortizes expensive computations involved in contrastive learning through lightweight neural networks, which substantially improves training efficiency and performance. Leveraging insights from a spectral factorization of energy-based models, we introduce novel amortization objectives along with practical techniques to improve training stability. Extensive experiments across 38 downstream tasks demonstrate the superior zero-shot classification and retrieval capabilities of AmorLIP, consistently outperforming standard CLIP baselines with substantial relative improvements of up to 12.24%.

[536] B-score: Detecting biases in large language models using response history

An Vo,Mohammad Reza Taesiri,Daeyoung Kim,Anh Totti Nguyen

Main category: cs.LG

TL;DR: 研究探讨了大型语言模型（LLMs）在多轮对话中通过观察自身先前回答是否能减少偏见输出，并提出了B-score作为检测偏见的新指标。

Details

Motivation: LLMs常表现出强烈偏见，研究旨在探索多轮对话是否能帮助模型自我去偏见。 Method: 测试LLMs在9个主题、3类问题（主观、随机、客观）上的表现，提出B-score作为偏见检测指标。 Result: LLMs在多轮对话中对随机问题能自我去偏见；B-score显著提高了对LLM答案的验证准确性。 Conclusion: 多轮对话有助于LLMs减少偏见，B-score是有效的偏见检测工具。 Abstract: Large language models (LLMs) often exhibit strong biases, e.g, against women or in favor of the number 7. We investigate whether LLMs would be able to output less biased answers when allowed to observe their prior answers to the same question in a multi-turn conversation. To understand which types of questions invite more biased answers, we test LLMs on our proposed set of questions that span 9 topics and belong to three types: (1) Subjective; (2) Random; and (3) Objective. Interestingly, LLMs are able to "de-bias" themselves in a multi-turn conversation in response to questions that seek an Random, unbiased answer. Furthermore, we propose B-score, a novel metric that is effective in detecting biases to Subjective, Random, Easy, and Hard questions. On MMLU, HLE, and CSQA, leveraging B-score substantially improves the verification accuracy of LLM answers (i.e, accepting LLM correct answers and rejecting incorrect ones) compared to using verbalized confidence scores or the frequency of single-turn answers alone. Code and data are available at: https://b-score.github.io.

[537] STRICT: Stress Test of Rendering Images Containing Text

Tianyu Zhang,Xinyu Wang,Zhenghan Tai,Lu Li,Jijun Chi,Jingrui Tian,Hailin He,Suyuchen Wang

Main category: cs.LG

TL;DR: 论文提出了STRICT基准，用于系统性测试扩散模型生成图像中一致且可读文本的能力，揭示了模型在长距离一致性和指令遵循上的局限性。

Details

Motivation: 扩散模型在生成图像时难以生成一致且可读的文本，这归因于其局部性偏差，无法建模长距离空间依赖关系。 Method: 引入STRICT基准，从文本长度、正确性与可读性、指令遵循率三个维度评估扩散模型。 Result: 评估显示现有模型在长距离一致性和指令遵循上存在显著不足。 Conclusion: 研究揭示了扩散模型的架构瓶颈，为未来多模态生成模型的研究提供了方向。 Abstract: While diffusion models have revolutionized text-to-image generation with their ability to synthesize realistic and diverse scenes, they continue to struggle to generate consistent and legible text within images. This shortcoming is commonly attributed to the locality bias inherent in diffusion-based generation, which limits their ability to model long-range spatial dependencies. In this paper, we introduce $\textbf{STRICT}$, a benchmark designed to systematically stress-test the ability of diffusion models to render coherent and instruction-aligned text in images. Our benchmark evaluates models across multiple dimensions: (1) the maximum length of readable text that can be generated; (2) the correctness and legibility of the generated text, and (3) the ratio of not following instructions for generating text. We evaluate several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities. Our findings provide insights into architectural bottlenecks and motivate future research directions in multimodal generative modeling. We release our entire evaluation pipeline at https://github.com/tianyu-z/STRICT-Bench.

[538] Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs

Mengqi Liao,Xiangyu Xi,Ruinian Chen,Jia Leng,Yangen Hu,Ke Zeng,Shuai Liu,Huaiyu Wan

Main category: cs.LG

TL;DR: 论文提出了一种动态分配RL训练资源的方法，通过问题难度调整rollout预算，并结合自适应温度策略保持探索能力，从而提升LLMs的效率和性能。

Details

Motivation: 现有RL方法对所有问题分配相同rollout次数，效率低下，且可能限制模型探索能力，导致性能上限低于基础模型。 Method: 动态分配rollout预算基于问题难度，并引入自适应动态温度调整策略以稳定熵值。 Result: 该方法提升了RL训练效率，同时保持了模型的探索能力，避免了性能上限问题。 Conclusion: 提出的机制有效解决了RL训练中的资源分配和探索能力问题，为LLMs的优化提供了新思路。 Abstract: Reasoning large language models (LLMs) excel in complex tasks, which has drawn significant attention to reinforcement learning (RL) for LLMs. However, existing approaches allocate an equal number of rollouts to all questions during the RL process, which is inefficient. This inefficiency stems from the fact that training on simple questions yields limited gains, whereas more rollouts are needed for challenging questions to sample correct answers. Furthermore, while RL improves response precision, it limits the model's exploration ability, potentially resulting in a performance cap below that of the base model prior to RL. To address these issues, we propose a mechanism for dynamically allocating rollout budgets based on the difficulty of the problems, enabling more efficient RL training. Additionally, we introduce an adaptive dynamic temperature adjustment strategy to maintain the entropy at a stable level, thereby encouraging sufficient exploration. This enables LLMs to improve response precision while preserving their exploratory ability to uncover potential correct pathways. The code and data is available on: https://github.com/LiaoMengqi/E3-RL4LLMs

[539] I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts

Jiayi Xin,Sukwon Yun,Jie Peng,Inyoung Choi,Jenna L. Ballard,Tianlong Chen,Qi Long

Main category: cs.LG

TL;DR: I2MoE提出了一种可解释的多模态交互感知混合专家框架，通过显式建模多模态交互并提供局部和全局解释，改进了模态融合方法。

Details

Motivation: 传统模态融合方法无法处理多模态间的异构交互且缺乏可解释性，I2MoE旨在解决这些问题。 Method: I2MoE使用弱监督交互损失学习多模态交互，并通过重加权模型为每个交互专家输出分配重要性分数，提供样本和数据集级别的解释。 Result: 在医疗和通用多模态数据集上的实验表明，I2MoE能灵活结合不同融合技术，提升任务性能并提供解释。 Conclusion: I2MoE是一种有效的模态融合方法，兼具性能提升和可解释性。 Abstract: Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, vanilla fusion methods are limited by (1) inability to account for heterogeneous interactions between modalities and (2) lack of interpretability in uncovering the multimodal interactions inherent in the data. To this end, we propose I2MoE (Interpretable Multimodal Interaction-aware Mixture of Experts), an end-to-end MoE framework designed to enhance modality fusion by explicitly modeling diverse multimodal interactions, as well as providing interpretation on a local and global level. First, I2MoE utilizes different interaction experts with weakly supervised interaction losses to learn multimodal interactions in a data-driven way. Second, I2MoE deploys a reweighting model that assigns importance scores for the output of each interaction expert, which offers sample-level and dataset-level interpretation. Extensive evaluation of medical and general multimodal datasets shows that I2MoE is flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios. Code is available at https://github.com/Raina-Xin/I2MoE.

[540] Neural Parameter Search for Slimmer Fine-Tuned Models and Better Transfer

Guodong Du,Zitao Fang,Jing Li,Junlin Li,Runhua Jiang,Shuyang Yu,Yifei Guo,Yangneng Chen,Sim Kuan Goh,Ho-Kin Tang,Daojing He,Honghai Liu,Min Zhang

Main category: cs.LG

TL;DR: 论文提出了一种基于任务向量机制的新型剪枝方法NPS-Pruning，通过低秩子空间搜索神经参数，优化微调模型的剪枝效率，提升知识迁移、融合与模型压缩效果。

Details

Motivation: 微调模型在跨领域表现不佳且存在冗余，结合剪枝与原预训练模型可缓解遗忘与干扰，但需高效剪枝策略。 Method: 利用任务向量机制预处理微调模型，计算其与原模型的差异，通过低秩子空间搜索神经参数（NPS-Pruning）进行剪枝。 Result: 在视觉、NLP和多模态基准测试中验证了方法的有效性，显著提升性能并降低存储成本。 Conclusion: NPS-Pruning为微调模型剪枝提供了高效解决方案，支持知识迁移、融合与压缩，代码已开源。 Abstract: Foundation models and their checkpoints have significantly advanced deep learning, boosting performance across various applications. However, fine-tuned models often struggle outside their specific domains and exhibit considerable redundancy. Recent studies suggest that combining a pruned fine-tuned model with the original pre-trained model can mitigate forgetting, reduce interference when merging model parameters across tasks, and improve compression efficiency. In this context, developing an effective pruning strategy for fine-tuned models is crucial. Leveraging the advantages of the task vector mechanism, we preprocess fine-tuned models by calculating the differences between them and the original model. Recognizing that different task vector subspaces contribute variably to model performance, we introduce a novel method called Neural Parameter Search (NPS-Pruning) for slimming down fine-tuned models. This method enhances pruning efficiency by searching through neural parameters of task vectors within low-rank subspaces. Our method has three key applications: enhancing knowledge transfer through pairwise model interpolation, facilitating effective knowledge fusion via model merging, and enabling the deployment of compressed models that retain near-original performance while significantly reducing storage costs. Extensive experiments across vision, NLP, and multi-modal benchmarks demonstrate the effectiveness and robustness of our approach, resulting in substantial performance gains. The code is publicly available at: https://github.com/duguodong7/NPS-Pruning.

[541] CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models

Qinsi Wang,Hancheng Ye,Ming-Yu Chung,Yudong Liu,Yueqian Lin,Martin Kuo,Mingyuan Ma,Jianyi Zhang,Yiran Chen

Main category: cs.LG

TL;DR: 论文探讨了视觉语言模型（VLMs）中token稀疏性和神经元稀疏性之间的潜在协同作用，提出了一种名为CoreMatching的协同稀疏推理框架，显著提升了推理效率。

Details

Motivation: 现有研究通常假设token稀疏性和神经元稀疏性独立运作，但缺乏对两者潜在协同作用的深入探索。论文旨在揭示这种协同关系，并利用其提升VLMs的推理效率。 Method: 通过分析核心神经元与核心token的匹配机制，提出CoreMatching框架，结合两种稀疏性实现协同优化。 Result: 在十项图像理解任务和三种硬件设备上，CoreMatching超越了现有基线方法，在NVIDIA Titan Xp上实现了5倍FLOPs减少和10倍整体加速。 Conclusion: 论文揭示了token与神经元稀疏性的协同作用，提出的CoreMatching框架为高效推理提供了新思路，具有广泛的应用潜力。 Abstract: Vision-Language Models (VLMs) excel across diverse tasks but suffer from high inference costs in time and memory. Token sparsity mitigates inefficiencies in token usage, while neuron sparsity reduces high-dimensional computations, both offering promising solutions to enhance efficiency. Recently, these two sparsity paradigms have evolved largely in parallel, fostering the prevailing assumption that they function independently. However, a fundamental yet underexplored question remains: Do they truly operate in isolation, or is there a deeper underlying interplay that has yet to be uncovered? In this paper, we conduct the first comprehensive investigation into this question. By introducing and analyzing the matching mechanism between Core Neurons and Core Tokens, we found that key neurons and tokens for inference mutually influence and reinforce each other. Building on this insight, we propose CoreMatching, a co-adaptive sparse inference framework, which leverages the synergy between token and neuron sparsity to enhance inference efficiency. Through theoretical analysis and efficiency evaluations, we demonstrate that the proposed method surpasses state-of-the-art baselines on ten image understanding tasks and three hardware devices. Notably, on the NVIDIA Titan Xp, it achieved 5x FLOPs reduction and a 10x overall speedup. Code is released at https://github.com/wangqinsi1/2025-ICML-CoreMatching/tree/main.

[542] On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization

Wenlong Deng,Yi Ren,Muchen Li,Danica J. Sutherland,Xiaoxiao Li,Christos Thrampoulidis

Main category: cs.LG

TL;DR: 论文提出了一种新方法NTHR，用于解决GRPO算法中的Lazy Likelihood Displacement（LLD）问题，通过调整对错误响应中不同token的惩罚权重，显著提升了模型性能。

Details

Motivation: 发现GRPO算法中存在LLD现象，即正确响应的似然在训练中仅小幅增长甚至下降，影响了模型性能。 Method: 提出NTHR方法，利用GRPO的组结构，以正确响应为锚点，调整对错误响应中token的惩罚权重。 Result: 在数学推理基准测试中，NTHR有效缓解了LLD，模型性能提升了0.5B到3B参数范围。 Conclusion: NTHR通过针对性调整惩罚权重，解决了GRPO中的LLD问题，为RL在LLMs中的应用提供了改进方向。 Abstract: Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO's widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO's learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens. Experiments on math reasoning benchmarks demonstrate that NTHR effectively mitigates LLD, yielding consistent performance gains across models ranging from 0.5B to 3B parameters.

[543] Exploring the Possibility of TypiClust for Low-Budget Federated Active Learning

Yuta Ono,Hiroshi Nakamura,Hideki Takase

Main category: cs.LG

TL;DR: Federated Active Learning (FAL) 结合主动学习 (AL) 以减少标注负担，研究 TypiClust 在低预算 FAL 中的有效性，发现其表现优于其他方法且对分布偏移不敏感。

Details

Motivation: FAL 环境下获取标注成本高，需研究低预算策略以应对数据异质性等挑战。 Method: 在低预算 FAL 设置中评估 TypiClust 策略，分析其对分布偏移和特征提取方法的敏感性。 Result: TypiClust 在低预算 FAL 中表现优异，对典型性分布偏移不敏感，且特征提取方法对其影响有限。 Conclusion: TypiClust 是低预算 FAL 的有效策略，为有限数据场景提供可行方案。 Abstract: Federated Active Learning (FAL) seeks to reduce the burden of annotation under the realistic constraints of federated learning by leveraging Active Learning (AL). As FAL settings make it more expensive to obtain ground truth labels, FAL strategies that work well in low-budget regimes, where the amount of annotation is very limited, are needed. In this work, we investigate the effectiveness of TypiClust, a successful low-budget AL strategy, in low-budget FAL settings. Our empirical results show that TypiClust works well even in low-budget FAL settings contrasted with relatively low performances of other methods, although these settings present additional challenges, such as data heterogeneity, compared to AL. In addition, we show that FAL settings cause distribution shifts in terms of typicality, but TypiClust is not very vulnerable to the shifts. We also analyze the sensitivity of TypiClust to feature extraction methods, and it suggests a way to perform FAL even in limited data situations.

[544] Diversity-Driven Generative Dataset Distillation Based on Diffusion Model with Self-Adaptive Memory

Mingzhuo Li,Guang Li,Jiafeng Mao,Takahiro Ogawa,Miki Haseyama

Main category: cs.LG

TL;DR: 提出了一种基于扩散模型的多样性驱动生成数据集蒸馏方法，通过自适应内存对齐分布，提升下游任务准确性。

Details

Motivation: 现有生成模型在数据集蒸馏中分布多样性不足，导致下游验证准确性下降。 Method: 采用扩散模型和自适应内存技术，对齐蒸馏数据集与原始数据集的分布。 Result: 实验表明，该方法在多数情况下优于现有先进方法。 Conclusion: 该方法能有效解决数据集蒸馏任务，提升多样性和准确性。 Abstract: Dataset distillation enables the training of deep neural networks with comparable performance in significantly reduced time by compressing large datasets into small and representative ones. Although the introduction of generative models has made great achievements in this field, the distributions of their distilled datasets are not diverse enough to represent the original ones, leading to a decrease in downstream validation accuracy. In this paper, we present a diversity-driven generative dataset distillation method based on a diffusion model to solve this problem. We introduce self-adaptive memory to align the distribution between distilled and real datasets, assessing the representativeness. The degree of alignment leads the diffusion model to generate more diverse datasets during the distillation process. Extensive experiments show that our method outperforms existing state-of-the-art methods in most situations, proving its ability to tackle dataset distillation tasks.

[545] WQLCP: Weighted Adaptive Conformal Prediction for Robust Uncertainty Quantification Under Distribution Shifts

Shadi Alijani,Homayoun Najjaran

Main category: cs.LG

TL;DR: 论文提出两种方法（RLSCP和WQLCP）改进共形预测在分布偏移下的表现，WQLCP通过加权交换性进一步优化，实验证明其优于现有基线。

Details

Motivation: 解决共形预测在分布偏移下覆盖率不可靠和预测集过大的问题。 Method: RLSCP利用VAE重建损失作为不确定性度量；WQLCP引入加权交换性调整分位数阈值。 Result: WQLCP在ImageNet等数据集上保持覆盖率的同时减小预测集大小。 Conclusion: WQLCP为分布偏移下的共形预测提供了鲁棒解决方案。 Abstract: Conformal prediction (CP) provides a framework for constructing prediction sets with guaranteed coverage, assuming exchangeable data. However, real-world scenarios often involve distribution shifts that violate exchangeability, leading to unreliable coverage and inflated prediction sets. To address this challenge, we first introduce Reconstruction Loss-Scaled Conformal Prediction (RLSCP), which utilizes reconstruction losses derived from a Variational Autoencoder (VAE) as an uncertainty metric to scale score functions. While RLSCP demonstrates performance improvements, mainly resulting in better coverage, it quantifies quantiles based on a fixed calibration dataset without considering the discrepancies between test and train datasets in an unexchangeable setting. In the next step, we propose Weighted Quantile Loss-scaled Conformal Prediction (WQLCP), which refines RLSCP by incorporating a weighted notion of exchangeability, adjusting the calibration quantile threshold based on weights with respect to the ratio of calibration and test loss values. This approach improves the CP-generated prediction set outputs in the presence of distribution shifts. Experiments on large-scale datasets, including ImageNet variants, demonstrate that WQLCP outperforms existing baselines by consistently maintaining coverage while reducing prediction set sizes, providing a robust solution for CP under distribution shifts.

[546] Multiplicity is an Inevitable and Inherent Challenge in Multimodal Learning

Sanghyuk Chun

Main category: cs.LG

TL;DR: 论文探讨了多模态学习中的多重性问题，指出当前方法假设的一对一模态对齐过于简化，而实际关系是多对多的。多重性是语义抽象、表示不对称和任务依赖模糊性的结果，影响了数据构建、训练和评估。

Details

Motivation: 当前多模态学习方法假设模态间是一对一确定性对齐，但现实中关系是多对多的（多重性）。这种简化忽略了语义抽象和任务模糊性，导致训练和评估不可靠。 Method: 通过分析多重性的成因和影响，提出其是多模态学习中的根本瓶颈，并呼吁研究新的多重性感知学习框架和数据集构建方法。 Result: 多重性导致训练不确定性、评估不可靠和数据集质量低下，影响多模态学习的各个阶段。 Conclusion: 需要开发新的多重性感知学习框架和数据集构建协议，以解决多模态学习中的多重性问题。 Abstract: Multimodal learning has seen remarkable progress, particularly with the emergence of large-scale pre-training across various modalities. However, most current approaches are built on the assumption of a deterministic, one-to-one alignment between modalities. This oversimplifies real-world multimodal relationships, where their nature is inherently many-to-many. This phenomenon, named multiplicity, is not a side-effect of noise or annotation error, but an inevitable outcome of semantic abstraction, representational asymmetry, and task-dependent ambiguity in multimodal tasks. This position paper argues that multiplicity is a fundamental bottleneck that manifests across all stages of the multimodal learning pipeline: from data construction to training and evaluation. This paper examines the causes and consequences of multiplicity, and highlights how multiplicity introduces training uncertainty, unreliable evaluation, and low dataset quality. This position calls for new research directions on multimodal learning: novel multiplicity-aware learning frameworks and dataset construction protocols considering multiplicity.

[547] Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models

Rui Cai,Bangzheng Li,Xiaofei Wen,Muhao Chen,Zhe Zhao

Main category: cs.LG

TL;DR: 论文提出多模态大语言模型（MLLMs）存在跨模态能力问题，即模型难以区分任务相关与无关信号，导致性能下降。作者通过扰动实验验证问题，并提出一种新框架增强模型鲁棒性。

Details

Motivation: MLLMs在任务中表现优异，但容易受无关模态干扰，导致性能下降，尤其在单模态任务中更为明显。 Method: 设计扰动实验验证问题，提出基于扰动的数据增强和一致性正则化策略的微调框架。 Result: 在多个基准数据集和模型上验证，显著提升了模型鲁棒性和跨模态能力。 Conclusion: 所提方法有效增强单模态推理能力，同时提升多模态任务性能。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across tasks, yet they often exhibit difficulty in distinguishing task-relevant from irrelevant signals, particularly in tasks like Visual Question Answering (VQA), which can lead to susceptibility to misleading or spurious inputs. We refer to this broader limitation as the Cross-Modality Competency Problem: the model's inability to fairly evaluate all modalities. This vulnerability becomes more evident in modality-specific tasks such as image classification or pure text question answering, where models are expected to rely solely on one modality. In such tasks, spurious information from irrelevant modalities often leads to significant performance degradation. We refer to this failure as Modality Interference, which serves as a concrete and measurable instance of the cross-modality competency problem. We further design a perturbation-based causal diagnostic experiment to verify and quantify this problem. To mitigate modality interference, we propose a novel framework to fine-tune MLLMs, including perturbation-based data augmentations with both heuristic perturbations and adversarial perturbations via Projected Gradient Descent (PGD), and a consistency regularization strategy applied to model outputs with original and perturbed inputs. Experiments on multiple benchmark datasets (image-heavy, text-heavy, and VQA tasks) and multiple model families with different scales demonstrate significant improvements in robustness and cross-modality competency, indicating our method's effectiveness in boosting unimodal reasoning ability while enhancing performance on multimodal tasks.

[548] GraphAU-Pain: Graph-based Action Unit Representation for Pain Intensity Estimation

Zhiyu Wang,Yang Liu,Hatice Gunes

Main category: cs.LG

TL;DR: GraphAU-Pain利用图神经网络建模面部动作单元及其关系，提升疼痛强度估计的准确性和可解释性。

Details

Motivation: 疼痛相关面部行为的理解对数字医疗至关重要，尤其是无法言语的患者。现有方法在可解释性和严重性量化方面存在局限。 Method: 提出GraphAU-Pain框架，将面部动作单元表示为图节点，共现关系为边，利用关系图神经网络建模。 Result: 在UNBC数据集上，F1-score为66.21%，准确率为87.61%。 Conclusion: GraphAU-Pain在疼痛强度估计中表现出色，兼具性能提升和可解释性。 Abstract: Understanding pain-related facial behaviors is essential for digital healthcare in terms of effective monitoring, assisted diagnostics, and treatment planning, particularly for patients unable to communicate verbally. Existing data-driven methods of detecting pain from facial expressions are limited due to interpretability and severity quantification. To this end, we propose GraphAU-Pain, leveraging a graph-based framework to model facial Action Units (AUs) and their interrelationships for pain intensity estimation. AUs are represented as graph nodes, with co-occurrence relationships as edges, enabling a more expressive depiction of pain-related facial behaviors. By utilizing a relational graph neural network, our framework offers improved interpretability and significant performance gains. Experiments conducted on the publicly available UNBC dataset demonstrate the effectiveness of the GraphAU-Pain, achieving an F1-score of 66.21% and accuracy of 87.61% in pain intensity estimation.

[549] Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning

Ziyi Zhang,Li Shen,Deheng Ye,Yong Luo,Huangxuan Zhao,Lefei Zhang

Main category: cs.LG

TL;DR: 提出了一种基于强化学习的微调框架MVC-ZigAL，用于优化少步文本到多视图（T2MV）扩散模型，同时提升单视图保真度和跨视图一致性。

Details

Motivation: 现有加速T2MV生成方法在减少计算量的同时牺牲了图像质量和视图一致性，需要一种更高效的优化方法。 Method: 1. 将多视图去噪统一为马尔可夫决策过程；2. 提出ZMV-Sampling采样技术；3. 开发MV-ZigAL策略优化方法；4. 将RL微调重构为约束优化问题。 Result: MVC-ZigAL框架在保持少步效率的同时，显著提升了T2MV生成的保真度和一致性。 Conclusion: 通过约束优化和策略优化结合，MVC-ZigAL为T2MV生成提供了一种高效且平衡的解决方案。 Abstract: Text-to-multiview (T2MV) generation, which produces coherent multiview images from a single text prompt, remains computationally intensive, while accelerated T2MV methods using few-step diffusion models often sacrifice image fidelity and view consistency. To address this, we propose a novel reinforcement learning (RL) finetuning framework tailored for few-step T2MV diffusion models to jointly optimize per-view fidelity and cross-view consistency. Specifically, we first reformulate T2MV denoising across all views as a single unified Markov decision process, enabling multiview-aware policy optimization driven by a joint-view reward objective. Next, we introduce ZMV-Sampling, a test-time T2MV sampling technique that adds an inversion-denoising pass to reinforce both viewpoint and text conditioning, resulting in improved T2MV generation at the cost of inference time. To internalize its performance gains into the base sampling policy, we develop MV-ZigAL, a novel policy optimization strategy that uses reward advantages of ZMV-Sampling over standard sampling as learning signals for policy updates. Finally, noting that the joint-view reward objective under-optimizes per-view fidelity but naively optimizing single-view metrics neglects cross-view alignment, we reframe RL finetuning for T2MV diffusion models as a constrained optimization problem that maximizes per-view fidelity subject to an explicit joint-view constraint, thereby enabling more efficient and balanced policy updates. By integrating this constrained optimization paradigm with MV-ZigAL, we establish our complete RL finetuning framework, referred to as MVC-ZigAL, which effectively refines the few-step T2MV diffusion baseline in both fidelity and consistency while preserving its few-step efficiency.

[550] Understanding Generalization in Diffusion Models via Probability Flow Distance

Huijie Zhang,Zijian Huang,Siyi Chen,Jinfan Zhou,Zekai Zhang,Peng Wang,Qing Qu

Main category: cs.LG

TL;DR: 论文提出了概率流距离（PFD），一种理论可靠且计算高效的度量方法，用于评估扩散模型的分布泛化能力。

Details

Motivation: 扩散模型在生成高质量样本方面表现出色，但评估其泛化能力仍具挑战性。现有理论指标不适用于高维数据，而实际指标又缺乏严谨性。 Method: 通过概率流ODE比较噪声到数据的映射，提出PFD度量方法，并结合师生评估协议实证分析扩散模型的泛化行为。 Result: 研究发现扩散模型具有从记忆到泛化的缩放行为、早期学习和双下降训练动态，以及偏差-方差分解等关键泛化行为。 Conclusion: PFD为未来扩散模型泛化能力的实证和理论研究奠定了基础。 Abstract: Diffusion models have emerged as a powerful class of generative models, capable of producing high-quality samples that generalize beyond the training data. However, evaluating this generalization remains challenging: theoretical metrics are often impractical for high-dimensional data, while no practical metrics rigorously measure generalization. In this work, we bridge this gap by introducing probability flow distance ($\texttt{PFD}$), a theoretically grounded and computationally efficient metric to measure distributional generalization. Specifically, $\texttt{PFD}$ quantifies the distance between distributions by comparing their noise-to-data mappings induced by the probability flow ODE. Moreover, by using $\texttt{PFD}$ under a teacher-student evaluation protocol, we empirically uncover several key generalization behaviors in diffusion models, including: (1) scaling behavior from memorization to generalization, (2) early learning and double descent training dynamics, and (3) bias-variance decomposition. Beyond these insights, our work lays a foundation for future empirical and theoretical studies on generalization in diffusion models.

[551] Multimodal Federated Learning With Missing Modalities through Feature Imputation Network

Pranav Poudel,Aavash Chhetri,Prashnna Gyawali,Georgios Leontidis,Binod Bhattarai

Main category: cs.LG

TL;DR: 提出了一种轻量级低维特征翻译器，用于重建缺失模态的瓶颈特征，解决了医疗多模态联邦学习中缺失模态的挑战。

Details

Motivation: 医疗多模态联邦学习中，缺失模态是常见问题，现有方法依赖公开数据集或生成模型，但成本高且易出错。 Method: 提出了一种轻量级低维特征翻译器，用于重建缺失模态的瓶颈特征。 Result: 在三个数据集（MIMIC-CXR、NIH Open-I和CheXpert）上，均优于基线方法。 Conclusion: 该方法有效解决了医疗多模态联邦学习中的缺失模态问题，性能优于现有方法。 Abstract: Multimodal federated learning holds immense potential for collaboratively training models from multiple sources without sharing raw data, addressing both data scarcity and privacy concerns, two key challenges in healthcare. A major challenge in training multimodal federated models in healthcare is the presence of missing modalities due to multiple reasons, including variations in clinical practice, cost and accessibility constraints, retrospective data collection, privacy concerns, and occasional technical or human errors. Previous methods typically rely on publicly available real datasets or synthetic data to compensate for missing modalities. However, obtaining real datasets for every disease is impractical, and training generative models to synthesize missing modalities is computationally expensive and prone to errors due to the high dimensionality of medical data. In this paper, we propose a novel, lightweight, low-dimensional feature translator to reconstruct bottleneck features of the missing modalities. Our experiments on three different datasets (MIMIC-CXR, NIH Open-I, and CheXpert), in both homogeneous and heterogeneous settings consistently improve the performance of competitive baselines. The code and implementation details are available at: https://github.com/bhattarailab/FedFeatGen

[552] DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation

Pingzhi Li,Zhen Tan,Huaizhi Qu,Huan Liu,Tianlong Chen

Main category: cs.LG

TL;DR: 论文提出了一种防御性输出生成（DOGe）策略，通过微调LLM的最后一层线性层，主动保护LLM免受知识蒸馏（KD）的模仿攻击。

Details

Motivation: 现有保护方法（如水印）只能在事后识别模仿，而其他防御方法假设学生模型模仿教师模型的内部逻辑，无法应对仅从输出文本进行的蒸馏。论文旨在在API访问的现实约束下主动保护LLM。 Method: 通过对抗性损失微调教师LLM的最后一层线性层，生成既准确又对蒸馏具有误导性的输出。 Result: 实验表明，学生模型从防御性生成的教师输出中蒸馏时性能显著下降，而教师模型的原始性能保持不变甚至提升。 Conclusion: DOGe是一种高效实用的防御策略，能有效对抗基于KD的模型模仿。 Abstract: Large Language Models (LLMs) represent substantial intellectual and economic investments, yet their effectiveness can inadvertently facilitate model imitation via knowledge distillation (KD).In practical scenarios, competitors can distill proprietary LLM capabilities by simply observing publicly accessible outputs, akin to reverse-engineering a complex performance by observation alone. Existing protective methods like watermarking only identify imitation post-hoc, while other defenses assume the student model mimics the teacher's internal logits, rendering them ineffective against distillation purely from observed output text. This paper confronts the challenge of actively protecting LLMs within the realistic constraints of API-based access. We introduce an effective and efficient Defensive Output Generation (DOGe) strategy that subtly modifies the output behavior of an LLM. Its outputs remain accurate and useful for legitimate users, yet are designed to be misleading for distillation, significantly undermining imitation attempts. We achieve this by fine-tuning only the final linear layer of the teacher LLM with an adversarial loss. This targeted training approach anticipates and disrupts distillation attempts during inference time. Our experiments show that, while preserving or even improving the original performance of the teacher model, student models distilled from the defensively generated teacher outputs demonstrate catastrophically reduced performance, demonstrating our method's effectiveness as a practical safeguard against KD-based model imitation.

[553] Probabilistic Kernel Function for Fast Angle Testing

Kejing Lu,Chuan Xiao,Yoshiharu Ishikawa

Main category: cs.LG

TL;DR: 本文研究了高维欧几里得空间中的角度测试问题，提出了两种基于投影的概率核函数，分别用于角度比较和角度阈值化。与依赖高斯分布随机投影向量的现有方法不同，我们的方法利用参考角度并采用确定性结构。实验表明，我们的核函数在理论和实验上均优于基于高斯分布的方法，并在近似最近邻搜索（ANNS）中实现了比HNSW算法高2.5~3倍的查询吞吐量。

Details

Motivation: 解决高维空间中角度测试问题，提升现有方法的效率和准确性。 Method: 提出两种基于投影的概率核函数，利用参考角度和确定性投影向量结构。 Result: 核函数在理论和实验上优于高斯分布方法，ANNS应用中查询吞吐量提升2.5~3倍。 Conclusion: 提出的核函数在高维角度测试和ANNS中表现出色，具有实际应用潜力。 Abstract: In this paper, we study the angle testing problem in high-dimensional Euclidean spaces and propose two projection-based probabilistic kernel functions, one designed for angle comparison and the other for angle thresholding. Unlike existing approaches that rely on random projection vectors drawn from Gaussian distributions, our approach leverages reference angles and employs a deterministic structure for the projection vectors. Notably, our kernel functions do not require asymptotic assumptions, such as the number of projection vectors tending to infinity, and can be both theoretically and experimentally shown to outperform Gaussian-distribution-based kernel functions. We further apply the proposed kernel function to Approximate Nearest Neighbor Search (ANNS) and demonstrate that our approach achieves a 2.5X ~ 3X higher query-per-second (QPS) throughput compared to the state-of-the-art graph-based search algorithm HNSW.

Dan Peng,Zhihui Fu,Zewen Ye,Zhuoran Song,Jun Wang

Main category: cs.LG

TL;DR: 提出了一种高精度的稀疏注意力机制，通过跨头共享精确的注意力模式，显著提升长上下文推理的效率与准确性。

Details

Motivation: 现有稀疏注意力方法依赖预定义模式或不准确估计，无法完全捕捉注意力的真实动态，导致效率与准确性降低。 Method: 基于两个关键观察（注意力模式在头间高度相似且输入间一致），提出跨头共享精确注意力模式的方法，仅需对少量头进行完整计算。 Result: 在保持或超越现有方法速度提升的同时，实现了最佳整体准确性。 Conclusion: 该方法通过共享精确模式，有效捕捉注意力动态，为长上下文推理提供了高效且准确的解决方案。 Abstract: Sparse attention methods exploit the inherent sparsity in attention to speed up the prefilling phase of long-context inference, mitigating the quadratic complexity of full attention computation. While existing sparse attention methods rely on predefined patterns or inaccurate estimations to approximate attention behavior, they often fail to fully capture the true dynamics of attention, resulting in reduced efficiency and compromised accuracy. Instead, we propose a highly accurate sparse attention mechanism that shares similar yet precise attention patterns across heads, enabling a more realistic capture of the dynamic behavior of attention. Our approach is grounded in two key observations: (1) attention patterns demonstrate strong inter-head similarity, and (2) this similarity remains remarkably consistent across diverse inputs. By strategically sharing computed accurate patterns across attention heads, our method effectively captures actual patterns while requiring full attention computation for only a small subset of heads. Comprehensive evaluations demonstrate that our approach achieves superior or comparable speedup relative to state-of-the-art methods while delivering the best overall accuracy.

[555] Learning to Reason without External Rewards

Xuandong Zhao,Zhewei Kang,Aosong Feng,Sergey Levine,Dawn Song

Main category: cs.LG

TL;DR: 论文提出了一种名为RLIF的无监督学习框架，通过内部反馈（如模型的自信心）替代外部奖励，实现了与GRPO相当的性能，并在跨领域任务中表现更优。

Details

Motivation: 传统RLVR方法依赖昂贵且领域特定的监督信号，限制了其扩展性。RLIF旨在通过内部信号实现无监督学习。 Method: 提出Intuitor方法，利用模型的自信心（self-certainty）作为奖励信号，替代GRPO中的外部奖励，实现完全无监督学习。 Result: Intuitor在数学基准测试中与GRPO性能相当，并在代码生成等跨领域任务中表现更优，无需黄金解决方案或测试用例。 Conclusion: 内部信号可以驱动有效学习，为无验证奖励的自主AI系统提供可扩展的替代方案。 Abstract: Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

[556] Preference Optimization by Estimating the Ratio of the Data Distribution

Yeongmin Kim,Heesun Bae,Byeonghu Na,Il-Chul Moon

Main category: cs.LG

TL;DR: 本文提出了一种广义的DPO损失方法（BPO），通过比率匹配框架优化目标策略，同时保持简单性和理论保证，实验表明BPO在生成质量和多样性上优于DPO。

Details

Motivation: 现有DPO方法在匹配目标策略时依赖奖励模型或分区函数，缺乏理论保证，而BPO通过比率匹配提供更通用的解决方案。 Method: 提出Bregman偏好优化（BPO）框架，利用比率匹配实现目标策略最优性，并开发了梯度缩放方法SBA。 Result: BPO在实验中表现优于DPO和其他扩展方法，生成质量和多样性均有提升，应用于Llama-3-8B时达到SOTA性能。 Conclusion: BPO是一种通用且高效的偏好优化框架，适用于多种目标策略，实验验证了其优越性。 Abstract: Direct preference optimization (DPO) is widely used as a simple and stable method for aligning large language models (LLMs) with human preferences. This paper investigates a generalized DPO loss that enables a policy model to match the target policy from a likelihood ratio estimation perspective. The ratio of the target policy provides a unique identification of the policy distribution without relying on reward models or partition functions. This allows the generalized loss to retain both simplicity and theoretical guarantees, which prior work such as $f$-PO fails to achieve simultaneously. We propose Bregman preference optimization (BPO), a generalized framework for ratio matching that provides a family of objective functions achieving target policy optimality. BPO subsumes DPO as a special case and offers tractable forms for all instances, allowing implementation with a few lines of code. We further develop scaled Basu's power divergence (SBA), a gradient scaling method that can be used for BPO instances. The BPO framework complements other DPO variants and is applicable to target policies defined by these variants. In experiments, unlike other probabilistic loss extensions such as $f$-DPO or $f$-PO, which exhibit a trade-off between generation fidelity and diversity, instances of BPO improve both win rate and entropy compared with DPO. When applied to Llama-3-Instruct-8B, BPO achieves state-of-the-art performance among Llama-3-8B backbones, with a 55.9\% length-controlled win rate on AlpacaEval2.

[557] Discrete Markov Bridge

Hengli Li,Yuxuan Wang,Song-Chun Zhu,Ying Nian Wu,Zilong Zheng

Main category: cs.LG

TL;DR: 提出了一种名为Discrete Markov Bridge的新框架，通过Matrix Learning和Score Learning提升离散数据建模的表现，优于现有基线方法。

Details

Motivation: 现有离散扩散方法依赖固定速率转移矩阵，限制了潜在表示的表达能力和设计空间。 Method: 结合Matrix Learning和Score Learning，进行理论分析和空间复杂度优化。 Result: 在Text8数据集上ELBO达到1.38，优于基线；在CIFAR-10数据集上表现与图像生成方法相当。 Conclusion: Discrete Markov Bridge有效解决了现有方法的局限性，提升了离散数据建模的表现。 Abstract: Discrete diffusion has recently emerged as a promising paradigm in discrete data modeling. However, existing methods typically rely on a fixed rate transition matrix during training, which not only limits the expressiveness of latent representations, a fundamental strength of variational methods, but also constrains the overall design space. To address these limitations, we propose Discrete Markov Bridge, a novel framework specifically designed for discrete representation learning. Our approach is built upon two key components: Matrix Learning and Score Learning. We conduct a rigorous theoretical analysis, establishing formal performance guarantees for Matrix Learning and proving the convergence of the overall framework. Furthermore, we analyze the space complexity of our method, addressing practical constraints identified in prior studies. Extensive empirical evaluations validate the effectiveness of the proposed Discrete Markov Bridge, which achieves an Evidence Lower Bound (ELBO) of 1.38 on the Text8 dataset, outperforming established baselines. Moreover, the proposed model demonstrates competitive performance on the CIFAR-10 dataset, achieving results comparable to those obtained by image-specific generation approaches.

[558] Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

Ruizhe Shi,Minhak Song,Runlong Zhou,Zihan Zhang,Maryam Fazel,Simon S. Du

Main category: cs.LG

TL;DR: 论文分析了RLHF和DPO在表示差距下的性能差异，分解为显式和隐式差距，并探讨了模型误设和样本效率的影响。

Details

Motivation: 研究RLHF和DPO在不同模型误设和优化条件下的性能差异，为方法选择提供理论依据。 Method: 理论分析分解性能差距，比较RLHF、DPO和在线DPO在精确和近似优化下的表现。 Result: 在线DPO在模型误设时表现最优；RLHF在稀疏奖励下样本效率更高。 Conclusion: 研究为RLHF和DPO的选择提供了理论支持，强调了不同场景下的适用性。 Abstract: We present a fine-grained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap. Our study decomposes this gap into two sources: an explicit representation gap under exact optimization and an implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on the type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is implicitly sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model -- highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.

[559] ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining

Melis Ilayda Bal,Volkan Cevher,Michael Muehlebach

Main category: cs.LG

TL;DR: ESLM是一种基于风险感知的算法，通过在线令牌级批量选择提高训练效率和分布鲁棒性，显著减少训练计算量。

Details

Motivation: 大型语言模型预训练计算密集，许多令牌对学习贡献有限，导致效率低下。 Method: ESLM利用每令牌统计信息（如熵或损失）和应用风险阈值，保留每批次中最具信息量的令牌。 Result: 实验表明，ESLM显著减少训练FLOPs，同时保持或优于基线的困惑度和下游性能。 Conclusion: ESLM提供了一种高效且分布鲁棒的语言模型预训练方法，适用于不同模型规模和预训练语料。 Abstract: Large language model pretraining is compute-intensive, yet many tokens contribute marginally to learning, resulting in inefficiency. We introduce Efficient Selective Language Modeling (ESLM), a risk-aware algorithm that improves training efficiency and distributional robustness by performing online token-level batch selection. ESLM leverages per-token statistics (e.g., entropy or loss) and applies value-at-risk thresholding to retain only the most informative tokens per batch. This data-centric mechanism reshapes the training loss, prioritizing high-risk tokens and eliminating redundant gradient computation. We frame ESLM as a bilevel game: the model competes with a masking adversary that selects worst-case token subsets under a constrained thresholding rule. In the loss-based setting, ESLM recovers conditional value-at-risk loss minimization, providing a principled connection to distributionally robust optimization. We extend our approach to Ada-ESLM, which adaptively tunes the selection confidence during training. Experiments on GPT-2 pretraining show that ESLM significantly reduces training FLOPs while maintaining or improving both perplexity and downstream performance compared to baselines. Our approach also scales across model sizes, pretraining corpora, and integrates naturally with knowledge distillation.

[560] An Explainable Diagnostic Framework for Neurodegenerative Dementias via Reinforcement-Optimized LLM Reasoning

Andrew Zamai,Nathanael Fijalkow,Boris Mansencal,Laurent Simon,Eloi Navet,Pierrick Coupe

Main category: cs.LG

TL;DR: 提出了一种结合深度学习和大语言模型的框架，用于提高神经退行性痴呆诊断的透明度和准确性。

Details

Motivation: 解决神经退行性痴呆诊断中症状重叠和影像模式相似的问题，同时提升深度学习模型的临床实用性。 Method: 通过模块化流程将3D T1加权脑MRI转换为放射学报告，并利用大语言模型辅助诊断，结合强化学习激励诊断推理。 Result: 框架在保持诊断性能的同时，生成基于影像发现的诊断理由，提供因果解释。 Conclusion: 该框架在诊断准确性和解释性之间取得平衡，为临床决策提供支持。 Abstract: The differential diagnosis of neurodegenerative dementias is a challenging clinical task, mainly because of the overlap in symptom presentation and the similarity of patterns observed in structural neuroimaging. To improve diagnostic efficiency and accuracy, deep learning-based methods such as Convolutional Neural Networks and Vision Transformers have been proposed for the automatic classification of brain MRIs. However, despite their strong predictive performance, these models find limited clinical utility due to their opaque decision making. In this work, we propose a framework that integrates two core components to enhance diagnostic transparency. First, we introduce a modular pipeline for converting 3D T1-weighted brain MRIs into textual radiology reports. Second, we explore the potential of modern Large Language Models (LLMs) to assist clinicians in the differential diagnosis between Frontotemporal dementia subtypes, Alzheimer's disease, and normal aging based on the generated reports. To bridge the gap between predictive accuracy and explainability, we employ reinforcement learning to incentivize diagnostic reasoning in LLMs. Without requiring supervised reasoning traces or distillation from larger models, our approach enables the emergence of structured diagnostic rationales grounded in neuroimaging findings. Unlike post-hoc explainability methods that retrospectively justify model decisions, our framework generates diagnostic rationales as part of the inference process-producing causally grounded explanations that inform and guide the model's decision-making process. In doing so, our framework matches the diagnostic performance of existing deep learning methods while offering rationales that support its diagnostic conclusions.

[561] MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

Hui Chen,Miao Xiong,Yujie Lu,Wei Han,Ailin Deng,Yufei He,Jiaying Wu,Yibo Li,Yue Liu,Bryan Hooi

Main category: cs.LG

TL;DR: MLR-Bench是一个用于评估AI代理在开放式机器学习研究中表现的基准，包含任务集、评估框架和模块化代理框架，发现当前LLM在生成研究思路和论文方面表现良好，但编码代理在实验结果上存在可靠性问题。

Details

Motivation: 推动和支持科学发现的AI代理潜力日益增长，但缺乏评估其在开放式研究中表现的综合基准。 Method: MLR-Bench包括201个研究任务、自动化评估框架MLR-Judge和模块化代理框架MLR-Agent，支持分阶段和端到端评估。 Result: LLM在生成思路和论文结构上表现良好，但编码代理的实验结果常不可靠（80%案例）。MLR-Judge与人类评审高度一致。 Conclusion: MLR-Bench为社区提供了评估和改进AI研究代理的工具，支持可信赖的科学发现。 Abstract: Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced coding agent, finding that while LLMs are effective at generating coherent ideas and well-structured papers, current coding agents frequently (e.g., in 80% of the cases) produce fabricated or invalidated experimental results--posing a major barrier to scientific reliability. We validate MLR-Judge through human evaluation, showing high agreement with expert reviewers, supporting its potential as a scalable tool for research evaluation. We open-source MLR-Bench to help the community benchmark, diagnose, and improve AI research agents toward trustworthy and transparent scientific discovery.

[562] The Limits of Preference Data for Post-Training

Eric Zhao,Jessica Dai,Pranjal Awasthi

Main category: cs.LG

TL;DR: 论文探讨了在需要人类反馈的任务中，强化学习（RL）优化面临的限制，尤其是偏好数据的局限性，并提出需要结合人类评分和算法创新。

Details

Motivation: 研究动机在于探索RL在需要人类定性反馈的任务（如深度研究和旅行规划）中的应用，以及偏好数据对优化的限制。 Method: 通过投票理论形式化偏好数据的局限性，类比模型选择答案与选民选择候选人的过程。 Result: 研究发现，即使理想化的偏好数据也无法保证近似最优解，偏好数据限制了RLHF在推理行为中的表现。 Conclusion: 结论指出，需要结合人类评分和算法创新，以扩展RL在需要人类反馈的任务中的成功应用。 Abstract: Recent progress in strengthening the capabilities of large language models has stemmed from applying reinforcement learning to domains with automatically verifiable outcomes. A key question is whether we can similarly use RL to optimize for outcomes in domains where evaluating outcomes inherently requires human feedback; for example, in tasks like deep research and trip planning, outcome evaluation is qualitative and there are many possible degrees of success. One attractive and scalable modality for collecting human feedback is preference data: ordinal rankings (pairwise or $k$-wise) that indicate, for $k$ given outcomes, which one is preferred. In this work, we study a critical roadblock: preference data fundamentally and significantly limits outcome-based optimization. Even with idealized preference data (infinite, noiseless, and online), the use of ordinal feedback can prevent obtaining even approximately optimal solutions. We formalize this impossibility using voting theory, drawing an analogy between how a model chooses to answer a query with how voters choose a candidate to elect. This indicates that grounded human scoring and algorithmic innovations are necessary for extending the success of RL post-training to domains demanding human feedback. We also explore why these limitations have disproportionately impacted RLHF when it comes to eliciting reasoning behaviors (e.g., backtracking) versus situations where RLHF has been historically successful (e.g., instruction-tuning and safety training), finding that the limitations of preference data primarily suppress RLHF's ability to elicit robust strategies -- a class that encompasses most reasoning behaviors.

[563] Embracing Imperfection: Simulating Students with Diverse Cognitive Levels Using LLM-based Agents

Tao Wu,Jingyuan Chen,Wang Lin,Mengze Li,Yumeng Zhu,Ang Li,Kun Kuang,Fei Wu

Main category: cs.LG

TL;DR: 提出了一种无需训练的框架，通过知识图谱构建学生认知原型，模拟学生行为并改进模拟准确性。

Details

Motivation: 当前大语言模型（LLM）作为助手生成完美答案，难以模拟学生多样化的认知能力和学习缺陷。 Method: 基于知识图谱构建学生认知原型，预测表现并模拟解决方案，通过波束搜索迭代优化。 Result: 在Student_100数据集上，模拟准确性提升100%。 Conclusion: 该方法有效解决了LLM模拟学生行为的局限性，显著提升了模拟的真实性。 Abstract: Large language models (LLMs) are revolutionizing education, with LLM-based agents playing a key role in simulating student behavior. A major challenge in student simulation is modeling the diverse learning patterns of students at various cognitive levels. However, current LLMs, typically trained as ``helpful assistants'', target at generating perfect responses. As a result, they struggle to simulate students with diverse cognitive abilities, as they often produce overly advanced answers, missing the natural imperfections that characterize student learning and resulting in unrealistic simulations. To address this issue, we propose a training-free framework for student simulation. We begin by constructing a cognitive prototype for each student using a knowledge graph, which captures their understanding of concepts from past learning records. This prototype is then mapped to new tasks to predict student performance. Next, we simulate student solutions based on these predictions and iteratively refine them using a beam search method to better replicate realistic mistakes. To validate our approach, we construct the \texttt{Student\_100} dataset, consisting of $100$ students working on Python programming and $5,000$ learning records. Experimental results show that our method consistently outperforms baseline models, achieving $100\%$ improvement in simulation accuracy.

[564] SAEs Are Good for Steering -- If You Select the Right Features

Dana Arad,Aaron Mueller,Yonatan Belinkov

Main category: cs.LG

TL;DR: 论文提出区分输入特征和输出特征，并设计评分方法筛选高输出特征，显著提升了稀疏自编码器的控制效果。

Details

Motivation: 现有方法仅通过激活分析特征，未能全面描述特征对模型输出的影响，需更有效的方法区分特征类型。 Method: 提出输入和输出评分方法，用于区分输入特征和输出特征，并通过筛选高输出特征优化控制效果。 Result: 实验表明，高输入和高输出评分特征极少共存，筛选后稀疏自编码器的控制效果提升2-3倍。 Conclusion: 区分和筛选特征类型显著提升了稀疏自编码器的实用性，使其在无监督方法中具有竞争力。 Abstract: Sparse Autoencoders (SAEs) have been proposed as an unsupervised approach to learn a decomposition of a model's latent space. This enables useful applications such as steering - influencing the output of a model towards a desired concept - without requiring labeled data. Current methods identify SAE features to steer by analyzing the input tokens that activate them. However, recent work has highlighted that activations alone do not fully describe the effect of a feature on the model's output. In this work, we draw a distinction between two types of features: input features, which mainly capture patterns in the model's input, and output features, which have a human-understandable effect on the model's output. We propose input and output scores to characterize and locate these types of features, and show that high values for both scores rarely co-occur in the same features. These findings have practical implications: after filtering out features with low output scores, we obtain 2-3x improvements when steering with SAEs, making them competitive with supervised methods.

[565] Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning

Jaehun Jung,Seungju Han,Ximing Lu,Skyler Hallinan,David Acuna,Shrimai Prabhumoye,Mostafa Patwary,Mohammad Shoeybi,Bryan Catanzaro,Yejin Choi

Main category: cs.LG

TL;DR: 论文提出了一种新的数据多样性度量G-Vendi，并通过实验证明其对语言模型泛化能力的预测效果优于现有方法。同时，提出了Prismatic Synthesis框架，用于生成多样化的合成数据，显著提升了模型性能。

Details

Motivation: 现有数据多样性度量方法过于依赖表面启发式方法，未能有效反映模型行为，因此需要一种更准确的多样性度量方法以提升语言模型的泛化能力。 Method: 通过大规模实验分析，引入G-Vendi度量（基于模型梯度熵的多样性量化方法），并提出Prismatic Synthesis框架生成多样化合成数据。 Result: G-Vendi与OOD性能强相关（Spearman's ρ≈0.9），Prismatic Synthesis显著提升模型性能，优于现有方法。 Conclusion: G-Vendi和Prismatic Synthesis为提升语言模型泛化能力提供了有效工具，尤其在数据多样性不足的场景下表现突出。 Abstract: Effective generalization in language models depends critically on the diversity of their training data. Yet existing diversity metrics often fall short of this goal, relying on surface-level heuristics that are decoupled from model behavior. This motivates us to ask: What kind of diversity in training data actually drives generalization in language models -- and how can we measure and amplify it? Through large-scale empirical analyses spanning over 300 training runs, carefully controlled for data scale and quality, we show that data diversity can be a strong predictor of generalization in LLM reasoning -- as measured by average model performance on unseen out-of-distribution benchmarks. We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients. Despite using a small off-the-shelf proxy model for gradients, G-Vendi consistently outperforms alternative measures, achieving strong correlation (Spearman's $\rho \approx 0.9$) with out-of-distribution (OOD) performance on both natural language inference (NLI) and math reasoning tasks. Building on this insight, we present Prismatic Synthesis, a framework for generating diverse synthetic data by targeting underrepresented regions in gradient space. Experimental results show that Prismatic Synthesis consistently improves model performance as we scale synthetic data -- not just on in-distribution test but across unseen, out-of-distribution benchmarks -- significantly outperforming state-of-the-art models that rely on 20 times larger data generator than ours. For example, PrismMath-7B, our model distilled from a 32B LLM, outperforms R1-Distill-Qwen-7B -- the same base model trained on proprietary data generated by 671B R1 -- on 6 out of 7 challenging benchmarks.

[566] Learning Extrapolative Sequence Transformations from Markov Chains

Sophia Hager,Aleem Khan,Andrew Wang,Nicholas Andrews

Main category: cs.LG

TL;DR: 论文提出了一种从MCMC搜索中学习自回归模型的方法，用于高效生成具有目标特性的新序列，优于传统随机搜索。

Details

Motivation: 深度学习在训练和测试条件相似的任务中表现良好，但在需要超越已知数据的任务（如生物序列设计）中，随机搜索方法效率低。 Method: 利用MCMC生成的马尔可夫链状态训练自回归模型，以高效生成满足目标特性的新序列。 Result: 在蛋白质序列设计、文本情感控制和文本匿名化任务中，自回归模型表现优于MCMC，且具有更高的可扩展性和样本效率。 Conclusion: 自回归模型能够高效地生成满足目标特性的序列，优于传统随机搜索方法。 Abstract: Most successful applications of deep learning involve similar training and test conditions. However, tasks such as biological sequence design involve searching for sequences that improve desirable properties beyond previously known values, which requires novel hypotheses that \emph{extrapolate} beyond training data. In these settings, extrapolation may be achieved by using random search methods such as Markov chain Monte Carlo (MCMC), which, given an initial state, sample local transformations to approximate a target density that rewards states with the desired properties. However, even with a well-designed proposal, MCMC may struggle to explore large structured state spaces efficiently. Rather than relying on stochastic search, it would be desirable to have a model that greedily optimizes the properties of interest, successfully extrapolating in as few steps as possible. We propose to learn such a model from the Markov chains resulting from MCMC search. Specifically, our approach uses selected states from Markov chains as a source of training data for an autoregressive model, which is then able to efficiently generate novel sequences that extrapolate along the sequence-level properties of interest. The proposed approach is validated on three problems: protein sequence design, text sentiment control, and text anonymization. We find that the autoregressive model can extrapolate as well or better than MCMC, but with the additional benefits of scalability and significantly higher sample efficiency.

[567] Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs

Xiangchen Song,Aashiq Muhamed,Yujia Zheng,Lingjing Kong,Zeyu Tang,Mona T. Diab,Virginia Smith,Kun Zhang

Main category: cs.LG

TL;DR: 论文主张在稀疏自编码器（SAEs）中优先考虑特征一致性，提出使用PW-MCC作为衡量指标，并证明通过适当架构可实现高一致性。

Details

Motivation: 稀疏自编码器在机制可解释性（MI）中用于分解神经网络激活，但不同训练运行中特征的不一致性影响了研究的可靠性和效率。 Method: 提出使用Pairwise Dictionary Mean Correlation Coefficient（PW-MCC）作为衡量特征一致性的指标，并通过理论验证和实验（包括合成数据和真实LLM数据）支持其有效性。 Result: 实验表明，通过适当架构可实现高特征一致性（TopK SAEs在LLM激活中PW-MCC达0.80），且高一致性与特征解释的语义相似性强相关。 Conclusion: 呼吁社区系统性地测量特征一致性，以推动机制可解释性研究的稳健累积进展。 Abstract: Sparse Autoencoders (SAEs) are a prominent tool in mechanistic interpretability (MI) for decomposing neural network activations into interpretable features. However, the aspiration to identify a canonical set of features is challenged by the observed inconsistency of learned SAE features across different training runs, undermining the reliability and efficiency of MI research. This position paper argues that mechanistic interpretability should prioritize feature consistency in SAEs -- the reliable convergence to equivalent feature sets across independent runs. We propose using the Pairwise Dictionary Mean Correlation Coefficient (PW-MCC) as a practical metric to operationalize consistency and demonstrate that high levels are achievable (0.80 for TopK SAEs on LLM activations) with appropriate architectural choices. Our contributions include detailing the benefits of prioritizing consistency; providing theoretical grounding and synthetic validation using a model organism, which verifies PW-MCC as a reliable proxy for ground-truth recovery; and extending these findings to real-world LLM data, where high feature consistency strongly correlates with the semantic similarity of learned feature explanations. We call for a community-wide shift towards systematically measuring feature consistency to foster robust cumulative progress in MI.

[568] The Coverage Principle: A Framework for Understanding Compositional Generalization

Hoyeon Chang,Jinho Park,Hanseul Cho,Sohee Yang,Miyoung Ko,Hyeonbin Hwang,Seungpil Won,Dohaeng Lee,Youbin Ahn,Minjoon Seo

Main category: cs.LG

TL;DR: 论文提出覆盖率原则，揭示依赖模式匹配的模型在组合任务中难以泛化，并通过实验验证其预测能力。

Details

Motivation: 解决大语言模型在系统性组合泛化上的不足，提出数据为中心的框架。 Method: 提出覆盖率原则，通过实验验证其对Transformer泛化能力的预测，包括数据需求、参数扩展和路径模糊性。 Result: 训练数据需求随标记集大小呈二次增长，参数扩展不提升效率；路径模糊性影响性能；Chain-of-Thought监督提升效率但仍有限。 Conclusion: 覆盖率原则为组合推理提供统一视角，强调需基础架构或训练创新以实现系统性组合性。 Abstract: Large language models excel at pattern matching, yet often fall short in systematic compositional generalization. We propose the coverage principle: a data-centric framework showing that models relying primarily on pattern matching for compositional tasks cannot reliably generalize beyond substituting fragments that yield identical results when used in the same contexts. We demonstrate that this framework has a strong predictive power for the generalization capabilities of Transformers. First, we derive and empirically confirm that the training data required for two-hop generalization grows at least quadratically with the token set size, and the training data efficiency does not improve with 20x parameter scaling. Second, for compositional tasks with path ambiguity where one variable affects the output through multiple computational paths, we show that Transformers learn context-dependent state representations that undermine both performance and interoperability. Third, Chain-of-Thought supervision improves training data efficiency for multi-hop tasks but still struggles with path ambiguity. Finally, we outline a \emph{mechanism-based} taxonomy that distinguishes three ways neural networks can generalize: structure-based (bounded by coverage), property-based (leveraging algebraic invariances), and shared-operator (through function reuse). This conceptual lens contextualizes our results and highlights where new architectural ideas are needed to achieve systematic compositionally. Overall, the coverage principle provides a unified lens for understanding compositional reasoning, and underscores the need for fundamental architectural or training innovations to achieve truly systematic compositionality.

eess.IV [Back]

[569] Brightness-Invariant Tracking Estimation in Tagged MRI

Zhangxing Bian,Shuwen Wei,Xiao Liang,Yuan-Chiao Lu,Samuel W. Remedios,Fangxu Xing,Jonghye Woo,Dzung L. Pham,Aaron Carass,Philip V. Bayly,Jiachen Zhuo,Ahmed Alshareef,Jerry L. Prince

Main category: eess.IV

TL;DR: BRITE技术通过分离解剖结构和标记模式，结合去噪扩散概率模型和物理信息神经网络，提高了标记MRI中运动跟踪的准确性。

Details

Motivation: 标记MRI中由于亮度变化和运动导致的光流方法误差问题，需要一种更鲁棒的跟踪方法。 Method: BRITE技术结合去噪扩散概率模型和物理信息神经网络，分离解剖结构和标记模式，同时估计拉格朗日运动。 Result: 实验表明，BRITE在运动应变估计上比其他方法更准确，且对标记褪色具有鲁棒性。 Conclusion: BRITE为标记MRI提供了一种亮度不变的运动跟踪方法，具有更高的准确性和鲁棒性。 Abstract: Magnetic resonance (MR) tagging is an imaging technique for noninvasively tracking tissue motion in vivo by creating a visible pattern of magnetization saturation (tags) that deforms with the tissue. Due to longitudinal relaxation and progression to steady-state, the tags and tissue brightnesses change over time, which makes tracking with optical flow methods error-prone. Although Fourier methods can alleviate these problems, they are also sensitive to brightness changes as well as spectral spreading due to motion. To address these problems, we introduce the brightness-invariant tracking estimation (BRITE) technique for tagged MRI. BRITE disentangles the anatomy from the tag pattern in the observed tagged image sequence and simultaneously estimates the Lagrangian motion. The inherent ill-posedness of this problem is addressed by leveraging the expressive power of denoising diffusion probabilistic models to represent the probabilistic distribution of the underlying anatomy and the flexibility of physics-informed neural networks to estimate biologically-plausible motion. A set of tagged MR images of a gel phantom was acquired with various tag periods and imaging flip angles to demonstrate the impact of brightness variations and to validate our method. The results show that BRITE achieves more accurate motion and strain estimates as compared to other state of the art methods, while also being resistant to tag fading.

[570] How We Won the ISLES'24 Challenge by Preprocessing

Tianyi Ren,Juampablo E. Heras Rivera,Hitender Oswal,Yutong Pan,William Henry,Jacob Ruzevick,Mehmet Kurt

Main category: eess.IV

TL;DR: 论文提出了一种基于深度学习的预处理和分割方法，用于准确预测中风病灶的进展，并在ISLES'24挑战中取得了最佳成绩。

Details

Motivation: 中风是全球三大死因之一，准确识别中风病灶边界对诊断和治疗至关重要。现有监督深度学习方法需要大量标注数据，ISLES'24挑战提供了纵向影像数据以解决这一问题。 Method: 采用深度学习预处理流程（包括颅骨剥离和自定义强度窗口处理），结合大型残差nnU-Net架构进行分割。 Result: 在测试集上平均Dice得分为28.5，标准差为21.27。 Conclusion: 精心设计的预处理流程和标准分割架构能有效提升中风病灶分割的准确性。 Abstract: Stroke is among the top three causes of death worldwide, and accurate identification of stroke lesion boundaries is critical for diagnosis and treatment. Supervised deep learning methods have emerged as the leading solution for stroke lesion segmentation but require large, diverse, and annotated datasets. The ISLES'24 challenge addresses this need by providing longitudinal stroke imaging data, including CT scans taken on arrival to the hospital and follow-up MRI taken 2-9 days from initial arrival, with annotations derived from follow-up MRI. Importantly, models submitted to the ISLES'24 challenge are evaluated using only CT inputs, requiring prediction of lesion progression that may not be visible in CT scans for segmentation. Our winning solution shows that a carefully designed preprocessing pipeline including deep-learning-based skull stripping and custom intensity windowing is beneficial for accurate segmentation. Combined with a standard large residual nnU-Net architecture for segmentation, this approach achieves a mean test Dice of 28.5 with a standard deviation of 21.27.

[571] ReflectGAN: Modeling Vegetation Effects for Soil Carbon Estimation from Satellite Imagery

Dristi Datta,Manoranjan Paul,Manzur Murshed,Shyh Wei Teng,Leigh M. Schmidtke

Main category: eess.IV

TL;DR: 提出ReflectGAN框架，通过生成对抗网络从植被覆盖的卫星图像中重建裸土反射率，显著提升土壤有机碳（SOC）的估计精度。

Details

Motivation: 植被覆盖导致土壤反射率被污染，影响SOC的准确估计，需解决这一问题以提高土壤健康监测的可靠性。 Method: 开发ReflectGAN，学习植被覆盖与裸土反射率之间的光谱转换，结合LUCAS 2018数据集和Landsat 8影像训练模型。 Result: ReflectGAN生成的反射率输入使模型性能显著提升（R²提高35%，RMSE降低43%），优于现有植被校正方法。 Conclusion: ReflectGAN能有效提高植被覆盖区域SOC估计的准确性，为土壤监测提供更可靠的工具。 Abstract: Soil organic carbon (SOC) is a critical indicator of soil health, but its accurate estimation from satellite imagery is hindered in vegetated regions due to spectral contamination from plant cover, which obscures soil reflectance and reduces model reliability. This study proposes the Reflectance Transformation Generative Adversarial Network (ReflectGAN), a novel paired GAN-based framework designed to reconstruct accurate bare soil reflectance from vegetated soil satellite observations. By learning the spectral transformation between vegetated and bare soil reflectance, ReflectGAN facilitates more precise SOC estimation under mixed land cover conditions. Using the LUCAS 2018 dataset and corresponding Landsat 8 imagery, we trained multiple learning-based models on both original and ReflectGAN-reconstructed reflectance inputs. Models trained on ReflectGAN outputs consistently outperformed those using existing vegetation correction methods. For example, the best-performing model (RF) achieved an $R^2$ of 0.54, RMSE of 3.95, and RPD of 2.07 when applied to the ReflectGAN-generated signals, representing a 35\% increase in $R^2$, a 43\% reduction in RMSE, and a 43\% improvement in RPD compared to the best existing method (PMM-SU). The performance of the models with ReflectGAN is also better compared to their counterparts when applied to another dataset, i.e., Sentinel-2 imagery. These findings demonstrate the potential of ReflectGAN to improve SOC estimation accuracy in vegetated landscapes, supporting more reliable soil monitoring.

[572] Memory-Efficient Super-Resolution of 3D Micro-CT Images Using Octree-Based GANs: Enhancing Resolution and Segmentation Accuracy

Evgeny Ugolkov,Xupeng He,Hyung Kwak,Hussein Hoteit

Main category: eess.IV

TL;DR: 提出了一种基于生成模型的内存高效算法，显著提升了岩石3D微CT图像的分割质量，实现了16倍分辨率提升，并修正了分割中的误差。

Details

Motivation: 解决微CT测量中因不同矿物X射线衰减重叠导致的分割不准确问题，以及3D卷积层内存消耗高的挑战。 Method: 采用3D Octree结构的卷积Wasserstein生成对抗网络（带梯度惩罚），结合渐进生长生成器模型，实现内存高效的3D Octree卷积层。 Result: 分辨率从7微米/体素提升至0.44微米/体素，矿物分割更准确，显著改善了孔隙表征和矿物区分。 Conclusion: 该方法突破了体积深度学习中的内存瓶颈，为现代地球科学成像提供了高效解决方案。 Abstract: We present a memory-efficient algorithm for significantly enhancing the quality of segmented 3D micro-Computed Tomography (micro-CT) images of rocks using a generative model. The proposed model achieves a 16x increase in resolution and corrects inaccuracies in segmentation caused by the overlapping X-ray attenuation in micro-CT measurements across different minerals. The generative model employed is a 3D Octree-based convolutional Wasserstein generative adversarial network with gradient penalty. To address the challenge of high memory consumption inherent in standard 3D convolutional layers, we implemented an Octree structure within the 3D progressive growing generator model. This enabled the use of memory-efficient 3D Octree-based convolutional layers. The approach is pivotal in overcoming the long-standing memory bottleneck in volumetric deep learning, making it possible to reach 16x super-resolution in 3D, a scale that is challenging to attain due to cubic memory scaling. For training, we utilized segmented 3D low-resolution micro-CT images along with unpaired segmented complementary 2D high-resolution laser scanning microscope images. Post-training, resolution improved from 7 to 0.44 micro-m/voxel with accurate segmentation of constituent minerals. Validated on Berea sandstone, this framework demonstrates substantial improvements in pore characterization and mineral differentiation, offering a robust solution to one of the primary computational limitations in modern geoscientific imaging.

[573] MedITok: A Unified Tokenizer for Medical Image Synthesis and Interpretation

Chenglong Ma,Yuanfeng Ji,Jin Ye,Zilong Li,Chenhui Wang,Junzhi Ning,Wei Li,Lihao Liu,Qiushan Guo,Tianbin Li,Junjun He,Hongming Shan

Main category: eess.IV

TL;DR: MedITok是一种专为医学图像设计的统一标记器，通过两阶段训练框架实现低层次结构细节和高层次临床语义的统一编码，在多个数据集和任务中表现优异。

Details

Motivation: 当前自回归模型在医学影像中的潜力未充分发挥，缺乏统一的视觉标记器来同时捕捉细节结构和临床语义。 Method: 提出MedITok，采用两阶段训练框架：视觉表示对齐阶段和文本语义表示对齐阶段，结合大规模数据集训练。 Result: 在30多个数据集、9种成像模态和4种任务中达到最先进性能。 Conclusion: MedITok为自回归建模提供统一标记空间，支持广泛的临床诊断和生成式医疗应用。 Abstract: Advanced autoregressive models have reshaped multimodal AI. However, their transformative potential in medical imaging remains largely untapped due to the absence of a unified visual tokenizer -- one capable of capturing fine-grained visual structures for faithful image reconstruction and realistic image synthesis, as well as rich semantics for accurate diagnosis and image interpretation. To this end, we present MedITok, the first unified tokenizer tailored for medical images, encoding both low-level structural details and high-level clinical semantics within a unified latent space. To balance these competing objectives, we introduce a novel two-stage training framework: a visual representation alignment stage that cold-starts the tokenizer reconstruction learning with a visual semantic constraint, followed by a textual semantic representation alignment stage that infuses detailed clinical semantics into the latent space. Trained on the meticulously collected large-scale dataset with over 30 million medical images and 2 million image-caption pairs, MedITok achieves state-of-the-art performance on more than 30 datasets across 9 imaging modalities and 4 different tasks. By providing a unified token space for autoregressive modeling, MedITok supports a wide range of tasks in clinical diagnostics and generative healthcare applications. Model and code will be made publicly available at: https://github.com/Masaaki-75/meditok.

[574] A Contrastive Learning Foundation Model Based on Perfectly Aligned Sample Pairs for Remote Sensing Images

Hengtong Shen,Haiyan Gu,Haitao Li,Yi Yang,Agen qiu

Main category: eess.IV

TL;DR: 提出了一种名为PerA的自监督学习方法，通过语义完美对齐的样本对生成通用的遥感图像特征，提高了对比学习在遥感领域的适应性。

Details

Motivation: 由于领域差异，对比学习方法在遥感图像中需要特定适配，因此提出PerA方法以解决这一问题。 Method: PerA通过空间不重叠的掩码对增强图像进行采样，确保语义对齐但外观不一致，同时通过教师-学生模型和可学习掩码令牌保证特征一致性。 Result: 实验表明，PerA在多个下游任务中性能接近现有最优方法，且具有更高的内存效率和更大的批量训练能力。 Conclusion: PerA为遥感图像解释提供了一种高效的自监督学习方法，具有实际应用价值。 Abstract: Self-Supervised Learning (SSL) enables us to pre-train foundation models without costly labeled data. Among SSL methods, Contrastive Learning (CL) methods are better at obtaining accurate semantic representations in noise interference. However, due to the significant domain gap, while CL methods have achieved great success in many computer vision tasks, they still require specific adaptation for Remote Sensing (RS) images. To this end, we present a novel self-supervised method called PerA, which produces all-purpose RS features through semantically Perfectly Aligned sample pairs. Specifically, PerA obtains features from sampled views by applying spatially disjoint masks to augmented images rather than random cropping. With disjoint masks, we divide patches from different views into different parts that are semantically aligned but inconsistent in appearance. Our framework provides high-quality features by ensuring consistency between teacher and student and predicting learnable mask tokens. Compared to previous contrastive methods, our method demonstrates higher memory efficiency and can be trained with larger batches due to its sparse inputs. We also collect an unlabeled pre-training dataset, which contains about 5 million RS images. We conducted experiments on multiple downstream task datasets and achieved performance comparable to previous state-of-the-art methods with a limited model scale, which verified the superiority of our method. We hope this work will contribute to practical remote sensing interpretation works.

[575] Advancements in Medical Image Classification through Fine-Tuning Natural Domain Foundation Models

Mobina Mansoori,Sajjad Shahabodini,Farnoush Bayatmakou,Jamshid Abouei,Konstantinos N. Plataniotis,Arash Mohammadi

Main category: eess.IV

TL;DR: 该研究探讨了先进基础模型（如DINOv2、MAE等）在医学图像分类中的应用，发现这些模型显著提升了分类效果，尤其在数据有限的情况下表现优异。

Details

Motivation: 分析基础模型在医学领域的应用潜力，评估其对医学图像分类的改进效果。 Method: 通过微调多种先进基础模型（如DINOv2、MAE等），在多个医学图像数据集（如CBIS-DDSM、ISIC2019等）上进行分类实验。 Result: AIMv2、DINOv2和SAM2模型表现最佳，表明自然领域训练的进步对医学领域有积极影响。 Conclusion: 基础模型在医学图像分类中具有显著潜力，尤其在数据有限的情况下表现优异。 Abstract: Using massive datasets, foundation models are large-scale, pre-trained models that perform a wide range of tasks. These models have shown consistently improved results with the introduction of new methods. It is crucial to analyze how these trends impact the medical field and determine whether these advancements can drive meaningful change. This study investigates the application of recent state-of-the-art foundation models, DINOv2, MAE, VMamba, CoCa, SAM2, and AIMv2, for medical image classification. We explore their effectiveness on datasets including CBIS-DDSM for mammography, ISIC2019 for skin lesions, APTOS2019 for diabetic retinopathy, and CHEXPERT for chest radiographs. By fine-tuning these models and evaluating their configurations, we aim to understand the potential of these advancements in medical image classification. The results indicate that these advanced models significantly enhance classification outcomes, demonstrating robust performance despite limited labeled data. Based on our results, AIMv2, DINOv2, and SAM2 models outperformed others, demonstrating that progress in natural domain training has positively impacted the medical domain and improved classification outcomes. Our code is publicly available at: https://github.com/sajjad-sh33/Medical-Transfer-Learning.

[576] Improvement Strategies for Few-Shot Learning in OCT Image Classification of Rare Retinal Diseases

Cheng-Yu Tai,Ching-Wen Chen,Chi-Chin Wu,Bo-Chen Chiu,Cheng-Hung,Lin,Cheng-Kai Lu,Jia-Kang Wang,Tzu-Lun Huang

Main category: eess.IV

TL;DR: 本文通过少样本学习提升OCT诊断图像的分类准确性，结合GAN增强和新型方法，最终模型达到97.85%的准确率。

Details

Motivation: 解决OCT诊断图像中主要和稀有类别分类的准确性问题，尤其是数据不平衡导致的偏差。 Method: 采用GAN增强作为基线，引入U-GAT-IT改进生成部分，结合数据平衡技术和CBAM注意力机制，微调InceptionV3。 Result: 最佳模型整体准确率达到97.85%，显著优于基线。 Conclusion: 提出的方法有效提升了OCT图像分类的准确性，尤其在处理数据不平衡问题时表现优异。 Abstract: This paper focuses on using few-shot learning to improve the accuracy of classifying OCT diagnosis images with major and rare classes. We used the GAN-based augmentation strategy as a baseline and introduced several novel methods to further enhance our model. The proposed strategy contains U-GAT-IT for improving the generative part and uses the data balance technique to narrow down the skew of accuracy between all categories. The best model obtained was built with CBAM attention mechanism and fine-tuned InceptionV3, and achieved an overall accuracy of 97.85%, representing a significant improvement over the original baseline.

cs.HC [Back]

Ugur Kursuncu,Trilok Padhi,Gaurav Sinha,Abdulkadir Erol,Jaya Krishna Mandivarapu,Christopher R. Larrison

Main category: cs.HC

TL;DR: 研究评估了大型语言模型（GPT和Llama）在焦虑支持中的潜在效用，发现微调能提升语言质量但增加毒性和偏见，GPT整体更支持。

Details

Motivation: 心理健康支持需求增长，但资源短缺，促使探索LLMs的实时支持潜力，但其在敏感领域的应用尚未充分研究。 Method: 利用r/Anxiety子论坛的真实用户帖子进行提示和微调，采用混合方法评估语言质量、安全性和支持性。 Result: 微调提升语言质量但增加毒性和偏见，降低情感响应；GPT整体表现更支持。 Conclusion: 未处理的社交媒体内容微调LLMs存在风险，需缓解策略。 Abstract: The growing demand for accessible mental health support, compounded by workforce shortages and logistical barriers, has led to increased interest in utilizing Large Language Models (LLMs) for scalable and real-time assistance. However, their use in sensitive domains such as anxiety support remains underexamined. This study presents a systematic evaluation of LLMs (GPT and Llama) for their potential utility in anxiety support by using real user-generated posts from the r/Anxiety subreddit for both prompting and fine-tuning. Our approach utilizes a mixed-method evaluation framework incorporating three main categories of criteria: (i) linguistic quality, (ii) safety and trustworthiness, and (iii) supportiveness. Results show that fine-tuning LLMs with naturalistic anxiety-related data enhanced linguistic quality but increased toxicity and bias, and diminished emotional responsiveness. While LLMs exhibited limited empathy, GPT was evaluated as more supportive overall. Our findings highlight the risks of fine-tuning LLMs on unprocessed social media content without mitigation strategies.

cs.DL [Back]

[578] SCIRGC: Multi-Granularity Citation Recommendation and Citation Sentence Preference Alignment

Xiangyu Li,Jingqiang Chen

Main category: cs.DL

TL;DR: SciRGC框架通过自动推荐引用文献和生成引用句子，解决了学术引用生成中的两大挑战：准确识别引用意图和生成高质量引用句子。

Details

Motivation: 解决研究人员在引用过程中耗时的问题，提升引用推荐的准确性和引用句子的质量。 Method: 结合引用网络和情感意图提升推荐准确性，利用原文摘要、局部上下文、引用意图和推荐文献生成推理式引用句子。 Result: SciRGC框架在引用推荐和句子生成方面均优于基线模型，并通过消融实验验证了其有效性。 Conclusion: SciRGC为跨学科研究者提供了高效且准确的引用生成工具。 Abstract: Citations are crucial in scientific research articles as they highlight the connection between the current study and prior work. However, this process is often time-consuming for researchers. In this study, we propose the SciRGC framework, which aims to automatically recommend citation articles and generate citation sentences for citation locations within articles. The framework addresses two key challenges in academic citation generation: 1) how to accurately identify the author's citation intent and find relevant citation papers, and 2) how to generate high-quality citation sentences that align with human preferences. We enhance citation recommendation accuracy in the citation article recommendation module by incorporating citation networks and sentiment intent, and generate reasoning-based citation sentences in the citation sentence generation module by using the original article abstract, local context, citation intent, and recommended articles as inputs. Additionally, we propose a new evaluation metric to fairly assess the quality of generated citation sentences. Through comparisons with baseline models and ablation experiments, the SciRGC framework not only improves the accuracy and relevance of citation recommendations but also ensures the appropriateness of the generated citation sentences in context, providing a valuable tool for interdisciplinary researchers.

cs.IT [Back]

[579] ICDM: Interference Cancellation Diffusion Models for Wireless Semantic Communications

Tong Wu,Zhiyong Chen,Dazhi He,Feng Yang,Meixia Tao,Xiaodong Xu,Wenjun Zhang,Ping Zhang

Main category: cs.IT

TL;DR: 论文提出了一种基于扩散模型的干扰消除方法（ICDM），用于无线语义通信系统，显著降低了均方误差并提升了感知质量。

Details

Motivation: 无线信号的广播特性使其易受高斯噪声和未知干扰的影响，扩散模型的去噪能力是否能有效消除干扰成为研究动机。 Method: 将干扰消除问题建模为信号和干扰联合后验概率的最大后验问题，提出ICDM模型，分解联合后验为信号和干扰的独立先验概率及信道转移概率，并通过扩散模型学习对数梯度。 Result: 实验表明，ICDM在瑞利衰落信道下显著降低MSE（4.54 dB）并提升LPIPS（2.47 dB）。 Conclusion: ICDM为无线通信中的干扰消除提供了高效解决方案，具有实际应用潜力。 Abstract: Diffusion models (DMs) have recently achieved significant success in wireless communications systems due to their denoising capabilities. The broadcast nature of wireless signals makes them susceptible not only to Gaussian noise, but also to unaware interference. This raises the question of whether DMs can effectively mitigate interference in wireless semantic communication systems. In this paper, we model the interference cancellation problem as a maximum a posteriori (MAP) problem over the joint posterior probability of the signal and interference, and theoretically prove that the solution provides excellent estimates for the signal and interference. To solve this problem, we develop an interference cancellation diffusion model (ICDM), which decomposes the joint posterior into independent prior probabilities of the signal and interference, along with the channel transition probablity. The log-gradients of these distributions at each time step are learned separately by DMs and accurately estimated through deriving. ICDM further integrates these gradients with advanced numerical iteration method, achieving accurate and rapid interference cancellation. Extensive experiments demonstrate that ICDM significantly reduces the mean square error (MSE) and enhances perceptual quality compared to schemes without ICDM. For example, on the CelebA dataset under the Rayleigh fading channel with a signal-to-noise ratio (SNR) of $20$ dB and signal to interference plus noise ratio (SINR) of 0 dB, ICDM reduces the MSE by 4.54 dB and improves the learned perceptual image patch similarity (LPIPS) by 2.47 dB.

cs.RO [Back]

[580] MaskedManipulator: Versatile Whole-Body Control for Loco-Manipulation

Chen Tessler,Yifeng Jiang,Erwin Coumans,Zhengyi Luo,Gal Chechik,Xue Bin Peng

Main category: cs.RO

TL;DR: MaskedManipulator是一种通过两阶段学习方法开发的统一生成策略，旨在实现高层次的全身操控目标，填补了当前物理动画中目标驱动控制的不足。

Details

Motivation: 当前物理动画中的全身操控方法在特定任务中表现良好，但缺乏对高层次目标的通用性。MaskedManipulator旨在通过直观的高层次目标（如目标物体姿态、关键角色姿态）实现复杂的操控任务。 Method: 采用两阶段学习方法：1）训练跟踪控制器从大规模动作捕捉数据中重建复杂的人-物交互；2）将跟踪控制器提炼为MaskedManipulator，支持用户对角色和物体的直观控制。 Result: MaskedManipulator能够根据用户指定的高层次目标（如目标物体姿态）合成全身动作，实现复杂的操控任务。 Conclusion: MaskedManipulator为交互式和逼真的虚拟角色提供了新的可能性，填补了当前物理动画中目标驱动控制的空白。 Abstract: Humans interact with their world while leveraging precise full-body control to achieve versatile goals. This versatility allows them to solve long-horizon, underspecified problems, such as placing a cup in a sink, by seamlessly sequencing actions like approaching the cup, grasping, transporting it, and finally placing it in the sink. Such goal-driven control can enable new procedural tools for animation systems, enabling users to define partial objectives while the system naturally ``fills in'' the intermediate motions. However, while current methods for whole-body dexterous manipulation in physics-based animation achieve success in specific interaction tasks, they typically employ control paradigms (e.g., detailed kinematic motion tracking, continuous object trajectory following, or direct VR teleoperation) that offer limited versatility for high-level goal specification across the entire coupled human-object system. To bridge this gap, we present MaskedManipulator, a unified and generative policy developed through a two-stage learning approach. First, our system trains a tracking controller to physically reconstruct complex human-object interactions from large-scale human mocap datasets. This tracking controller is then distilled into MaskedManipulator, which provides users with intuitive control over both the character's body and the manipulated object. As a result, MaskedManipulator enables users to specify complex loco-manipulation tasks through intuitive high-level objectives (e.g., target object poses, key character stances), and MaskedManipulator then synthesizes the necessary full-body actions for a physically simulated humanoid to achieve these goals, paving the way for more interactive and life-like virtual characters.

[581] From Single Images to Motion Policies via Video-Generation Environment Representations

Weiming Zhi,Ziyong Ma,Tianyi Zhang,Matthew Johnson-Roberson

Main category: cs.RO

TL;DR: 论文提出了一种名为VGER的框架，通过单张RGB图像生成环境表示，并训练运动生成模型以实现无碰撞运动。

Details

Motivation: 自主机器人需要构建环境表示并适应其几何结构，但现有单目深度估计方法存在误差，难以直接用于运动生成。 Method: 利用大规模视频生成模型生成多视角视频，再通过3D基础模型生成密集点云，最后训练隐式表示和运动生成模型。 Result: VGER在多种室内外环境中表现良好，能够从单张RGB图像生成平滑且符合几何结构的运动。 Conclusion: VGER框架有效解决了单目图像到运动生成的挑战，为自主机器人提供了实用的解决方案。 Abstract: Autonomous robots typically need to construct representations of their surroundings and adapt their motions to the geometry of their environment. Here, we tackle the problem of constructing a policy model for collision-free motion generation, consistent with the environment, from a single input RGB image. Extracting 3D structures from a single image often involves monocular depth estimation. Developments in depth estimation have given rise to large pre-trained models such as DepthAnything. However, using outputs of these models for downstream motion generation is challenging due to frustum-shaped errors that arise. Instead, we propose a framework known as Video-Generation Environment Representation (VGER), which leverages the advances of large-scale video generation models to generate a moving camera video conditioned on the input image. Frames of this video, which form a multiview dataset, are then input into a pre-trained 3D foundation model to produce a dense point cloud. We then introduce a multi-scale noise approach to train an implicit representation of the environment structure and build a motion generation model that complies with the geometry of the representation. We extensively evaluate VGER over a diverse set of indoor and outdoor environments. We demonstrate its ability to produce smooth motions that account for the captured geometry of a scene, all from a single RGB input image.

[582] Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning

Junlin Wang,Zhiyun Lin

Main category: cs.RO

TL;DR: 论文提出了一种名为ICon的对比学习方法，用于改进机器人操作任务的视觉表示学习，通过分离代理和环境特征提升策略学习效果。

Details

Motivation: 机器人操作任务中，复杂的身体动力学使得视觉表示学习具有挑战性，需要嵌入身体相关的归纳偏置以提高学习效率。 Method: ICon是一种基于Vision Transformers的对比学习方法，通过分离代理和环境特征空间，生成代理中心的视觉表示，并将其作为辅助目标集成到端到端策略学习中。 Result: 实验表明，ICon不仅提高了多种操作任务的策略性能，还促进了不同机器人之间的策略迁移。 Conclusion: ICon通过嵌入身体相关的归纳偏置，显著提升了机器人操作任务的视觉表示学习和策略性能。 Abstract: Learning effective visual representations for robotic manipulation remains a fundamental challenge due to the complex body dynamics involved in action execution. In this paper, we study how visual representations that carry body-relevant cues can enable efficient policy learning for downstream robotic manipulation tasks. We present $\textbf{I}$nter-token $\textbf{Con}$trast ($\textbf{ICon}$), a contrastive learning method applied to the token-level representations of Vision Transformers (ViTs). ICon enforces a separation in the feature space between agent-specific and environment-specific tokens, resulting in agent-centric visual representations that embed body-specific inductive biases. This framework can be seamlessly integrated into end-to-end policy learning by incorporating the contrastive loss as an auxiliary objective. Our experiments show that ICon not only improves policy performance across various manipulation tasks but also facilitates policy transfer across different robots. The project website: https://github.com/HenryWJL/icon

[583] WorldEval: World Model as Real-World Robot Policies Evaluator

Yaxuan Li,Yichen Zhu,Junjie Wen,Chaomin Shen,Yi Xu

Main category: cs.RO

TL;DR: 本文提出了一种利用世界模型作为机器人策略评估的代理方法，通过Policy2Vec和WorldEval实现了高效、可扩展的评估。

Details

Motivation: 评估通用机器人策略在真实场景中的表现耗时且困难，尤其是在任务数量增加和环境变化时。 Method: 提出Policy2Vec方法，将视频生成模型转化为世界模拟器，并设计WorldEval自动化评估流程。 Result: WorldEval与真实场景中的策略表现高度相关，且优于现有方法如real-to-sim。 Conclusion: 世界模型可作为机器人策略评估的可靠代理，显著提升评估效率和安全性。 Abstract: The field of robotics has made significant strides toward developing generalist robot manipulation policies. However, evaluating these policies in real-world scenarios remains time-consuming and challenging, particularly as the number of tasks scales and environmental conditions change. In this work, we demonstrate that world models can serve as a scalable, reproducible, and reliable proxy for real-world robot policy evaluation. A key challenge is generating accurate policy videos from world models that faithfully reflect the robot actions. We observe that directly inputting robot actions or using high-dimensional encoding methods often fails to generate action-following videos. To address this, we propose Policy2Vec, a simple yet effective approach to turn a video generation model into a world simulator that follows latent action to generate the robot video. We then introduce WorldEval, an automated pipeline designed to evaluate real-world robot policies entirely online. WorldEval effectively ranks various robot policies and individual checkpoints within a single policy, and functions as a safety detector to prevent dangerous actions by newly developed robot models. Through comprehensive paired evaluations of manipulation policies in real-world environments, we demonstrate a strong correlation between policy performance in WorldEval and real-world scenarios. Furthermore, our method significantly outperforms popular methods such as real-to-sim approach.

math.AG [Back]

[584] Tropical Geometry Based Edge Detection Using Min-Plus and Max-Plus Algebra

Shivam Kumar Jha S,Jaya NN Iyer

Main category: math.AG

TL;DR: 提出了一种基于热带几何的边缘检测框架，利用min-plus和max-plus代数重新定义卷积和梯度计算，强调主导强度变化，实现更锐利和连续的边缘表示。

Details

Motivation: 传统边缘检测方法在低对比度和纹理区域表现不佳，热带代数提供了一种噪声感知且可扩展的解决方案。 Method: 提出三种变体：自适应阈值法、多核min-plus法和强调结构连续性的max-plus法，结合多尺度处理、Hessian滤波和小波收缩。 Result: 在MATLAB内置图像上的实验表明，热带代数与经典算子（如Canny和LoG）结合可提升低对比度和纹理区域的边界检测效果。 Conclusion: 热带代数在边缘检测中具有潜力，能够提升边缘清晰度和结构一致性，适用于实际图像分析任务。 Abstract: This paper proposes a tropical geometry-based edge detection framework that reformulates convolution and gradient computations using min-plus and max-plus algebra. The tropical formulation emphasizes dominant intensity variations, contributing to sharper and more continuous edge representations. Three variants are explored: an adaptive threshold-based method, a multi-kernel min-plus method, and a max-plus method emphasizing structural continuity. The framework integrates multi-scale processing, Hessian filtering, and wavelet shrinkage to enhance edge transitions while maintaining computational efficiency. Experiments on MATLAB built-in grayscale and color images suggest that tropical formulations integrated with classical operators, such as Canny and LoG, can improve boundary detection in low-contrast and textured regions. Quantitative evaluation using standard edge metrics indicates favorable edge clarity and structural coherence. These results highlight the potential of tropical algebra as a scalable and noise-aware formulation for edge detection in practical image analysis tasks.

cs.IR [Back]

[585] News Without Borders: Domain Adaptation of Multilingual Sentence Embeddings for Cross-lingual News Recommendation

Andreea Iana,Fabian David Schmidt,Goran Glavaš,Heiko Paulheim

Main category: cs.IR

TL;DR: 论文提出了一种新闻适应的多语言句子编码器（NaSE），通过领域专用化预训练的多语言句子编码器，解决了新闻推荐系统中零样本跨语言迁移（ZS-XLT）性能下降的问题，并在冷启动和少样本推荐场景中表现出色。

Details

Motivation: 多语言新闻消费者数量快速增长，现有神经新闻推荐系统在零样本跨语言迁移中性能损失严重，且传统微调方法在数据稀缺或冷启动场景中计算成本高、不可行。 Method: 构建并利用多语言新闻专用语料库PolyNews和PolyNewsParallel，提出新闻适应的句子编码器（NaSE），并测试其在不进行监督微调的情况下的有效性。 Result: NaSE在零样本跨语言迁移、冷启动和少样本新闻推荐中实现了最先进的性能。 Conclusion: NaSE提供了一种简单而强大的基线方法，无需监督微调即可在新闻推荐任务中取得优异表现。 Abstract: Rapidly growing numbers of multilingual news consumers pose an increasing challenge to news recommender systems in terms of providing customized recommendations. First, existing neural news recommenders, even when powered by multilingual language models (LMs), suffer substantial performance losses in zero-shot cross-lingual transfer (ZS-XLT). Second, the current paradigm of fine-tuning the backbone LM of a neural recommender on task-specific data is computationally expensive and infeasible in few-shot recommendation and cold-start setups, where data is scarce or completely unavailable. In this work, we propose a news-adapted sentence encoder (NaSE), domain-specialized from a pretrained massively multilingual sentence encoder (SE). To this end, we construct and leverage PolyNews and PolyNewsParallel, two multilingual news-specific corpora. With the news-adapted multilingual SE in place, we test the effectiveness of (i.e., question the need for) supervised fine-tuning for news recommendation, and propose a simple and strong baseline based on (i) frozen NaSE embeddings and (ii) late click-behavior fusion. We show that NaSE achieves state-of-the-art performance in ZS-XLT in true cold-start and few-shot news recommendation.

[586] Walk&Retrieve: Simple Yet Effective Zero-shot Retrieval-Augmented Generation via Knowledge Graph Walks

Martin Böckling,Heiko Paulheim,Andreea Iana

Main category: cs.IR

TL;DR: Walk&Retrieve是一个基于知识图谱的轻量级检索增强生成框架，通过图遍历和知识表达提升零样本RAG性能，无需微调即可适应动态知识图谱更新。

Details

Motivation: 解决LLMs在幻觉和知识过时问题上的不足，以及现有KG-based RAG方法在表示对齐、检索效率与动态适应上的挑战。 Method: 利用基于行走的图遍历和知识表达生成语料，支持零样本RAG，无需领域数据微调。 Result: 在响应准确性和幻觉减少上优于现有RAG系统，查询延迟低且可扩展性强。 Conclusion: Walk&Retrieve展示了轻量级检索策略在RAG研究中的潜力，为未来工作提供了强基线。 Abstract: Large Language Models (LLMs) have showcased impressive reasoning abilities, but often suffer from hallucinations or outdated knowledge. Knowledge Graph (KG)-based Retrieval-Augmented Generation (RAG) remedies these shortcomings by grounding LLM responses in structured external information from a knowledge base. However, many KG-based RAG approaches struggle with (i) aligning KG and textual representations, (ii) balancing retrieval accuracy and efficiency, and (iii) adapting to dynamically updated KGs. In this work, we introduce Walk&Retrieve, a simple yet effective KG-based framework that leverages walk-based graph traversal and knowledge verbalization for corpus generation for zero-shot RAG. Built around efficient KG walks, our method does not require fine-tuning on domain-specific data, enabling seamless adaptation to KG updates, reducing computational overhead, and allowing integration with any off-the-shelf backbone LLM. Despite its simplicity, Walk&Retrieve performs competitively, often outperforming existing RAG systems in response accuracy and hallucination reduction. Moreover, it demonstrates lower query latency and robust scalability to large KGs, highlighting the potential of lightweight retrieval strategies as strong baselines for future RAG research.

[587] Hard Negative Mining for Domain-Specific Retrieval in Enterprise Systems

Hansa Meghwani,Amit Agarwal,Priyaranjan Pattnayak,Hitesh Laxmichand Patel,Srikant Panda

Main category: cs.IR

TL;DR: 提出了一种针对企业领域数据的可扩展硬负样本挖掘框架，显著提升了检索性能。

Details

Motivation: 企业搜索系统因语义不匹配和术语重叠导致检索性能下降，影响下游应用。 Method: 结合多种嵌入模型，降维并动态选择语义挑战性但上下文无关的文档作为硬负样本。 Result: 在企业专有数据集上MRR@3和MRR@10分别提升15%和19%，并在公开数据集上验证了泛化能力。 Conclusion: 该方法高效且语义精准，适用于实际应用。 Abstract: Enterprise search systems often struggle to retrieve accurate, domain-specific information due to semantic mismatches and overlapping terminologies. These issues can degrade the performance of downstream applications such as knowledge management, customer support, and retrieval-augmented generation agents. To address this challenge, we propose a scalable hard-negative mining framework tailored specifically for domain-specific enterprise data. Our approach dynamically selects semantically challenging but contextually irrelevant documents to enhance deployed re-ranking models. Our method integrates diverse embedding models, performs dimensionality reduction, and uniquely selects hard negatives, ensuring computational efficiency and semantic precision. Evaluation on our proprietary enterprise corpus (cloud services domain) demonstrates substantial improvements of 15\% in MRR@3 and 19\% in MRR@10 compared to state-of-the-art baselines and other negative sampling techniques. Further validation on public domain-specific datasets (FiQA, Climate Fever, TechQA) confirms our method's generalizability and readiness for real-world applications.

[588] AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking

Soyoung Yoon,Gyuwan Kim,Gyu-Hwung Cho,Seung-won Hwang

Main category: cs.IR

TL;DR: AcuRank是一种自适应重排框架，通过动态调整计算量和目标，基于文档相关性的不确定性估计，优化检索效率。

Details

Motivation: 固定计算量的重排方法忽视了查询难度和文档分布，导致效率低下。 Method: 使用贝叶斯TrueSkill模型迭代优化相关性估计，直到达到足够置信水平，并明确建模排名不确定性。 Result: 在TREC-DL和BEIR基准测试中，AcuRank在准确性和效率之间取得了更好的平衡，且计算扩展性优于固定计算基线。 Conclusion: AcuRank在多样化检索任务和基于LLM的重排模型中表现出高效性和通用性。 Abstract: Listwise reranking with large language models (LLMs) enhances top-ranked results in retrieval-based applications. Due to the limit in context size and high inference cost of long context, reranking is typically performed over a fixed size of small subsets, with the final ranking aggregated from these partial results. This fixed computation disregards query difficulty and document distribution, leading to inefficiencies. We propose AcuRank, an adaptive reranking framework that dynamically adjusts both the amount and target of computation based on uncertainty estimates over document relevance. Using a Bayesian TrueSkill model, we iteratively refine relevance estimates until reaching sufficient confidence levels, and our explicit modeling of ranking uncertainty enables principled control over reranking behavior and avoids unnecessary updates to confident predictions. Results on the TREC-DL and BEIR benchmarks show that our method consistently achieves a superior accuracy-efficiency trade-off and scales better with compute than fixed-computation baselines. These results highlight the effectiveness and generalizability of our method across diverse retrieval tasks and LLM-based reranking models.

[589] Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval

Kidist Amde Mekonnen,Yosef Worku Alemneh,Maarten de Rijke

Main category: cs.IR

TL;DR: 论文提出针对阿姆哈拉语的密集检索模型，基于预训练的Amharic BERT和RoBERTa，显著提升了检索效果，并分析了低资源语言的挑战。

Details

Motivation: 探索低资源、形态丰富的语言（如阿姆哈拉语）在神经检索中的效果，填补现有研究的空白。 Method: 引入基于预训练Amharic BERT和RoBERTa的密集检索模型，包括RoBERTa-Base-Amharic-Embed和更紧凑的变体，以及ColBERT模型。 Result: RoBERTa-Base-Amharic-Embed在MRR@10和Recall@10上分别相对提升17.6%和9.86%，ColBERT模型达到最高MRR@10（0.843）。 Conclusion: 语言特定适配对低资源检索至关重要，研究为未来低资源信息检索提供了数据集、代码和模型。 Abstract: Neural retrieval methods using transformer-based pre-trained language models have advanced multilingual and cross-lingual retrieval. However, their effectiveness for low-resource, morphologically rich languages such as Amharic remains underexplored due to data scarcity and suboptimal tokenization. We address this gap by introducing Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones. Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10 and a 9.86% gain in Recall@10 over the strongest multilingual baseline, Arctic Embed 2.0 (568M parameters). More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M), remain competitive while being over 13x smaller. Additionally, we train a ColBERT-based late interaction retrieval model that achieves the highest MRR@10 score (0.843) among all evaluated models. We benchmark our proposed models against both sparse and dense retrieval baselines to systematically assess retrieval effectiveness in Amharic. Our analysis highlights key challenges in low-resource settings and underscores the importance of language-specific adaptation. To foster future research in low-resource IR, we publicly release our dataset, codebase, and trained models at https://github.com/kidist-amde/amharic-ir-benchmarks.

[590] REARANK: Reasoning Re-ranking Agent via Reinforcement Learning

Le Zhang,Bo Wang,Xipeng Qiu,Siva Reddy,Aishwarya Agrawal

Main category: cs.IR

TL;DR: REARANK是一个基于大语言模型的列表推理重排代理，通过显式推理提升性能和可解释性，仅需少量标注样本即可在信息检索任务中超越基线模型，性能接近GPT-4。

Details

Motivation: 提升信息检索任务中的重排性能和可解释性，同时减少对大量标注数据的依赖。 Method: 结合强化学习和数据增强，基于Qwen2.5-7B模型构建REARANK-7B，通过显式推理进行重排。 Result: 在多个基准测试中表现优异，性能接近GPT-4，并在推理密集型任务中超越GPT-4。 Conclusion: 强化学习能有效增强大语言模型的推理能力，REARANK在重排任务中具有显著优势。 Abstract: We present REARANK, a large language model (LLM)-based listwise reasoning reranking agent. REARANK explicitly reasons before reranking, significantly improving both performance and interpretability. Leveraging reinforcement learning and data augmentation, REARANK achieves substantial improvements over baseline models across popular information retrieval benchmarks, notably requiring only 179 annotated samples. Built on top of Qwen2.5-7B, our REARANK-7B demonstrates performance comparable to GPT-4 on both in-domain and out-of-domain benchmarks and even surpasses GPT-4 on reasoning-intensive BRIGHT benchmarks. These results underscore the effectiveness of our approach and highlight how reinforcement learning can enhance LLM reasoning capabilities in reranking.

astro-ph.GA [Back]

[591] RGC-Bent: A Novel Dataset for Bent Radio Galaxy Classification

Mir Sazzat Hossain,Khan Muhammad Bin Asad,Payaswini Saikia,Adrita Khan,Md Akil Raihan Iftee,Rakibul Hasan Rajib,Arshad Momen,Md Ashraful Amin,Amin Ahsan Ali,AKM Mahbubur Rahman

Main category: astro-ph.GA

TL;DR: 论文介绍了一个专为弯曲射电活动星系核（AGN）分类设计的新型机器学习数据集，并评估了多种深度学习模型的性能。

Details

Motivation: 弯曲射电AGN的分类因缺乏专门的数据集和基准而具有挑战性，但其对研究星系团动力学和AGN物理具有重要意义。 Method: 数据集来自知名射电天文巡天，支持NAT和WAT分类，并详细描述了数据处理步骤。评估了CNN和基于Transformer的模型。 Result: ConvNeXT模型在NAT和WAT分类中取得了最高的F1分数。 Conclusion: 通过共享数据集和基准，旨在推动AGN分类、星系团环境和星系演化研究的发展。 Abstract: We introduce a novel machine learning dataset tailored for the classification of bent radio active galactic nuclei (AGN) in astronomical observations. Bent radio AGN, distinguished by their curved jet structures, provide critical insights into galaxy cluster dynamics, interactions within the intracluster medium, and the broader physics of AGN. Despite their astrophysical significance, the classification of bent radio AGN remains a challenge due to the scarcity of specialized datasets and benchmarks. To address this, we present a dataset, derived from a well-recognized radio astronomy survey, that is designed to support the classification of NAT (Narrow-Angle Tail) and WAT (Wide-Angle Tail) categories, along with detailed data processing steps. We further evaluate the performance of state-of-the-art deep learning models on the dataset, including Convolutional Neural Networks (CNNs), and transformer-based architectures. Our results demonstrate the effectiveness of advanced machine learning models in classifying bent radio AGN, with ConvNeXT achieving the highest F1-scores for both NAT and WAT sources. By sharing this dataset and benchmarks, we aim to facilitate the advancement of research in AGN classification, galaxy cluster environments and galaxy evolution.

stat.AP [Back]

[592] Unsupervised cell segmentation by fast Gaussian Processes

Laura Baracaldo,Blythe King,Haoran Yan,Yizi Lin,Nina Miolane,Mengyang Gu

Main category: stat.AP

TL;DR: 提出了一种基于快速高斯过程的无监督细胞分割算法，适用于噪声显微镜图像，无需参数调整或对形状的严格假设。

Details

Motivation: 现有监督分割工具依赖高质量标注图像和形状假设，难以适用于新类型对象。 Method: 采用自适应阈值标准和分水岭分割，处理亮度不均图像并区分接触细胞。 Result: 模拟和真实数据实验表明，该方法在可扩展性和准确性上优于现有方法。 Conclusion: 该无监督方法为细胞行为分析提供了高效且通用的解决方案。 Abstract: Cell boundary information is crucial for analyzing cell behaviors from time-lapse microscopy videos. Existing supervised cell segmentation tools, such as ImageJ, require tuning various parameters and rely on restrictive assumptions about the shape of the objects. While recent supervised segmentation tools based on convolutional neural networks enhance accuracy, they depend on high-quality labelled images, making them unsuitable for segmenting new types of objects not in the database. We developed a novel unsupervised cell segmentation algorithm based on fast Gaussian processes for noisy microscopy images without the need for parameter tuning or restrictive assumptions about the shape of the object. We derived robust thresholding criteria adaptive for heterogeneous images containing distinct brightness at different parts to separate objects from the background, and employed watershed segmentation to distinguish touching cell objects. Both simulated studies and real-data analysis of large microscopy images demonstrate the scalability and accuracy of our approach compared with the alternatives.

eess.SP [Back]

[593] Evaluation in EEG Emotion Recognition: State-of-the-Art Review and Unified Framework

Natia Kukhilava,Tatia Tsmindashvili,Rapael Kalandadze,Anchit Gupta,Sofio Katamadze,François Brémond,Laura M. Ferrari,Philipp Müller,Benedikt Emanuel Wirth

Main category: eess.SP

TL;DR: 该论文提出EEGain，一个统一的EEG-ER评估协议，解决了领域内缺乏标准化评估的问题。

Details

Motivation: EEG-ER领域缺乏统一的评估协议，导致研究结果难以比较和复现。 Method: 通过分析216篇论文，提出EEGain框架，标准化数据预处理、评估指标和数据集加载。 Result: EEGain在六个常用数据集上验证了四种公开方法，显著提升了研究的可复现性和可比性。 Conclusion: EEGain为EEG-ER研究提供了标准化工具，加速了领域进展。 Abstract: Electroencephalography-based Emotion Recognition (EEG-ER) has become a growing research area in recent years. Analyzing 216 papers published between 2018 and 2023, we uncover that the field lacks a unified evaluation protocol, which is essential to fairly define the state of the art, compare new approaches and to track the field's progress. We report the main inconsistencies between the used evaluation protocols, which are related to ground truth definition, evaluation metric selection, data splitting types (e.g., subject-dependent or subject-independent) and the use of different datasets. Capitalizing on this state-of-the-art research, we propose a unified evaluation protocol, EEGain (https://github.com/EmotionLab/EEGain), which enables an easy and efficient evaluation of new methods and datasets. EEGain is a novel open source software framework, offering the capability to compare - and thus define - state-of-the-art results. EEGain includes standardized methods for data pre-processing, data splitting, evaluation metrics, and the ability to load the six most relevant datasets (i.e., AMIGOS, DEAP, DREAMER, MAHNOB-HCI, SEED, SEED-IV) in EEG-ER with only a single line of code. In addition, we have assessed and validated EEGain using these six datasets on the four most common publicly available methods (EEGNet, DeepConvNet, ShallowConvNet, TSception). This is a significant step to make research on EEG-ER more reproducible and comparable, thereby accelerating the overall progress of the field.

[594] AI- Enhanced Stethoscope in Remote Diagnostics for Cardiopulmonary Diseases

Hania Ghouse,Juveria Tanveen,Abdul Muqtadir Ahmed,Uma N. Dulhare

Main category: eess.SP

TL;DR: 论文提出了一种结合AI的低成本听诊器模型，用于同时诊断心肺疾病，适用于资源匮乏地区。

Details

Motivation: 全球心肺疾病增加，现有检测方法存在诊断不及时和资源不足的问题，尤其在偏远地区。 Method: 采用MFCC特征提取和GRU-CNN混合模型处理听诊音频信号，部署于低成本设备。 Result: 模型能准确诊断六种肺病和五种心血管疾病，并生成数字音频记录。 Conclusion: 低成本听诊器与高效AI模型的结合，为标准化医疗提供了创新解决方案。 Abstract: The increase in cardiac and pulmonary diseases presents an alarming and pervasive health challenge on a global scale responsible for unexpected and premature mortalities. In spite of how serious these conditions are, existing methods of detection and treatment encounter challenges, particularly in achieving timely diagnosis for effective medical intervention. Manual screening processes commonly used for primary detection of cardiac and respiratory problems face inherent limitations, increased by a scarcity of skilled medical practitioners in remote or under-resourced areas. To address this, our study introduces an innovative yet efficient model which integrates AI for diagnosing lung and heart conditions concurrently using the auscultation sounds. Unlike the already high-priced digital stethoscope, our proposed model has been particularly designed to deploy on low-cost embedded devices and thus ensure applicability in under-developed regions that actually face an issue of accessing medical care. Our proposed model incorporates MFCC feature extraction and engineering techniques to ensure that the signal is well analyzed for accurate diagnostics through the hybrid model combining Gated Recurrent Unit with CNN in processing audio signals recorded from the low-cost stethoscope. Beyond its diagnostic capabilities, the model generates digital audio records that facilitate in classifying six pulmonary and five cardiovascular diseases. Hence, the integration of a cost effective stethoscope with an efficient AI empowered model deployed on a web app providing real-time analysis, represents a transformative step towards standardized healthcare

[595] Large Language Model-Driven Distributed Integrated Multimodal Sensing and Semantic Communications

Yubo Peng,Luping Xiang,Bingxin Zhang,Kun Yang

Main category: eess.SP

TL;DR: 提出了一种基于大语言模型的多模态感知与语义通信框架（LLM-DiSAC），通过融合射频与视觉数据提升复杂环境下的感知精度和通信效率。

Details

Motivation: 传统单模态感知系统在复杂动态环境中表现不足，单设备系统视角和覆盖范围有限。 Method: 开发了射频-视觉融合网络（RVFN）、基于LLM的语义传输网络（LSTN）及变压器聚合模型（TRAM），并采用两阶段分布式学习策略。 Result: 在合成多视角射频-视觉数据集上表现良好。 Conclusion: LLM-DiSAC有效解决了单模态系统的局限性，提升了感知和通信性能。 Abstract: Traditional single-modal sensing systems-based solely on either radio frequency (RF) or visual data-struggle to cope with the demands of complex and dynamic environments. Furthermore, single-device systems are constrained by limited perspectives and insufficient spatial coverage, which impairs their effectiveness in urban or non-line-of-sight scenarios. To overcome these challenges, we propose a novel large language model (LLM)-driven distributed integrated multimodal sensing and semantic communication (LLM-DiSAC) framework. Specifically, our system consists of multiple collaborative sensing devices equipped with RF and camera modules, working together with an aggregation center to enhance sensing accuracy. First, on sensing devices, LLM-DiSAC develops an RF-vision fusion network (RVFN), which employs specialized feature extractors for RF and visual data, followed by a cross-attention module for effective multimodal integration. Second, a LLM-based semantic transmission network (LSTN) is proposed to enhance communication efficiency, where the LLM-based decoder leverages known channel parameters, such as transceiver distance and signal-to-noise ratio (SNR), to mitigate semantic distortion. Third, at the aggregation center, a transformer-based aggregation model (TRAM) with an adaptive aggregation attention mechanism is developed to fuse distributed features and enhance sensing accuracy. To preserve data privacy, a two-stage distributed learning strategy is introduced, allowing local model training at the device level and centralized aggregation model training using intermediate features. Finally, evaluations on a synthetic multi-view RF-visual dataset generated by the Genesis simulation engine show that LLM-DiSAC achieves a good performance.

cs.SE [Back]

[596] SEW: Self-Evolving Agentic Workflows for Automated Code Generation

Siwei Liu,Jinyuan Fang,Han Zhou,Yingxu Wang,Zaiqiao Meng

Main category: cs.SE

TL;DR: 论文提出了一种名为SEW的自进化框架，用于自动生成和优化多代理工作流，以解决复杂编码任务，实验表明其性能提升显著。

Details

Motivation: 现有方法依赖手工设计的多代理工作流，无法自动适应不同类型的编码问题，限制了其灵活性和效率。 Method: 提出了SEW框架，通过自进化机制自动生成和优化多代理工作流，并探索了工作流信息的最佳文本编码方式。 Result: 在三个编码基准数据集上的实验显示，SEW能自动设计并优化工作流，性能提升高达33%。 Conclusion: SEW框架有效解决了手工设计工作流的局限性，为自动化工作流设计提供了新思路。 Abstract: Large Language Models (LLMs) have demonstrated effectiveness in code generation tasks. To enable LLMs to address more complex coding challenges, existing research has focused on crafting multi-agent systems with agentic workflows, where complex coding tasks are decomposed into sub-tasks, assigned to specialized agents. Despite their effectiveness, current approaches heavily rely on hand-crafted agentic workflows, with both agent topologies and prompts manually designed, which limits their ability to automatically adapt to different types of coding problems. To address these limitations and enable automated workflow design, we propose \textbf{S}elf-\textbf{E}volving \textbf{W}orkflow (\textbf{SEW}), a novel self-evolving framework that automatically generates and optimises multi-agent workflows. Extensive experiments on three coding benchmark datasets, including the challenging LiveCodeBench, demonstrate that our SEW can automatically design agentic workflows and optimise them through self-evolution, bringing up to 33\% improvement on LiveCodeBench compared to using the backbone LLM only. Furthermore, by investigating different representation schemes of workflow, we provide insights into the optimal way to encode workflow information with text.

[597] From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation?

Wasi Uddin Ahmad,Somshubra Majumdar,Boris Ginsburg

Main category: cs.SE

TL;DR: 研究发现，后处理对LLM在FIM代码生成中的自动评估至关重要，但监督微调能显著提升性能，减少后处理需求。

Details

Motivation: 探讨后处理对指令调优LLM输出的必要性，以解决生成代码中多余内容的问题。 Method: 通过监督微调优化LLM的FIM代码生成能力，并在HumanEval Infilling和SAFIM基准上评估性能。 Result: 微调后的模型在生成完整代码行时无需后处理，但在随机代码片段时仍需后处理。 Conclusion: 监督微调可减少后处理需求，但在某些情况下仍需后处理以确保生成质量。 Abstract: Post-processing is crucial for the automatic evaluation of LLMs in fill-in-the-middle (FIM) code generation due to the frequent presence of extraneous code in raw outputs. This extraneous generation suggests a lack of awareness regarding output boundaries, requiring truncation for effective evaluation. The determination of an optimal truncation strategy, however, often proves intricate, particularly when the scope includes several programming languages. This study investigates the necessity of post-processing instruction-tuned LLM outputs. Our findings reveal that supervised fine-tuning significantly enhances FIM code generation, enabling LLMs to generate code that seamlessly integrates with the surrounding context. Evaluating our fine-tuned \texttt{Qwen2.5-Coder} (base and instruct) models on HumanEval Infilling and SAFIM benchmarks demonstrates improved performances without post-processing, especially when the \emph{middle} consist of complete lines. However, post-processing of the LLM outputs remains necessary when the \emph{middle} is a random span of code.

[598] Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI

Ranjan Sapkota,Konstantinos I. Roumeliotis,Manoj Karkee

Main category: cs.SE

TL;DR: 本文综述了AI辅助软件开发中的两种新兴范式：vibe coding和agentic coding，分析了它们的差异、应用场景及未来发展方向。

Details

Motivation: 探讨如何通过LLMs在软件开发中实现更高效的人机协作，比较两种范式的优缺点及其适用场景。 Method: 提出详细的分类法，涵盖概念基础、执行模型、反馈机制等，并通过20个用例进行对比分析。 Result: vibe coding适合早期原型设计和教育，而agentic coding在自动化、重构和CI/CD中表现更优。 Conclusion: 未来AI软件工程的成功将依赖于两种范式的融合，而非单一选择。 Abstract: This review presents a comprehensive analysis of two emerging paradigms in AI-assisted software development: vibe coding and agentic coding. While both leverage large language models (LLMs), they differ fundamentally in autonomy, architectural design, and the role of the developer. Vibe coding emphasizes intuitive, human-in-the-loop interaction through prompt-based, conversational workflows that support ideation, experimentation, and creative exploration. In contrast, agentic coding enables autonomous software development through goal-driven agents capable of planning, executing, testing, and iterating tasks with minimal human intervention. We propose a detailed taxonomy spanning conceptual foundations, execution models, feedback loops, safety mechanisms, debugging strategies, and real-world tool ecosystems. Through comparative workflow analysis and 20 detailed use cases, we illustrate how vibe systems thrive in early-stage prototyping and education, while agentic systems excel in enterprise-grade automation, codebase refactoring, and CI/CD integration. We further examine emerging trends in hybrid architectures, where natural language interfaces are coupled with autonomous execution pipelines. Finally, we articulate a future roadmap for agentic AI, outlining the infrastructure needed for trustworthy, explainable, and collaborative systems. Our findings suggest that successful AI software engineering will rely not on choosing one paradigm, but on harmonizing their strengths within a unified, human-centered development lifecycle.

[599] CIDRe: A Reference-Free Multi-Aspect Criterion for Code Comment Quality Measurement

Maria Dziuba,Valentin Malykh

Main category: cs.SE

TL;DR: CIDRe是一种语言无关、无参考的代码注释质量评估标准，结合了相关性、信息量、完整性和描述长度四个维度，实验证明其优于现有方法。

Details

Motivation: 现有代码注释质量评估方法（如SIDE、MIDQ、STASIS）在代码-注释分析方面存在局限性，需要更全面的质量指标。 Method: 提出CIDRe标准，包含四个协同维度：相关性、信息量、完整性和描述长度，并在手动标注数据集上验证。 Result: CIDRe在交叉熵评估中优于现有指标，基于其过滤的注释数据微调的模型在GPT-4o-mini评估中表现出显著质量提升。 Conclusion: CIDRe是一种有效的代码注释质量评估标准，能够显著提升模型生成注释的质量。 Abstract: Effective generation of structured code comments requires robust quality metrics for dataset curation, yet existing approaches (SIDE, MIDQ, STASIS) suffer from limited code-comment analysis. We propose CIDRe, a language-agnostic reference-free quality criterion combining four synergistic aspects: (1) relevance (code-comment semantic alignment), (2) informativeness (functional coverage), (3) completeness (presence of all structure sections), and (4) description length (detail sufficiency). We validate our criterion on a manually annotated dataset. Experiments demonstrate CIDRe's superiority over existing metrics, achieving improvement in cross-entropy evaluation. When applied to filter comments, the models finetuned on CIDRe-filtered data show statistically significant quality gains in GPT-4o-mini assessments.

[600] StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

Jialin Yang,Dongfu Jiang,Lipeng He,Sherman Siu,Yuxuan Zhang,Disen Liao,Zhuofeng Li,Huaye Zeng,Yiming Jia,Haozhe Wang,Benjamin Schneider,Chi Ruan,Wentao Ma,Zhiheng Lyu,Yifei Wang,Yi Lu,Quy Duc Do,Ziyan Jiang,Ping Nie,Wenhu Chen

Main category: cs.SE

TL;DR: StructEval是一个评估大语言模型（LLMs）生成结构化输出能力的综合基准，涵盖18种格式和44种任务类型，结果显示现有模型表现仍有较大提升空间。

Details

Motivation: 随着LLMs在软件开发中的广泛应用，评估其生成结构化输出的能力变得至关重要。 Method: 通过生成任务（从自然语言生成结构化输出）和转换任务（在结构化格式间转换）系统评估LLMs的结构保真度，并引入新指标衡量格式一致性和结构正确性。 Result: 即使是先进模型如o1-mini平均得分仅为75.58，开源模型落后约10分，生成任务比转换任务更具挑战性，生成视觉内容比文本结构更困难。 Conclusion: StructEval为评估LLMs的结构化输出能力提供了全面基准，揭示了当前模型的不足，为未来改进指明了方向。 Abstract: As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce StructEval, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: 1) generation tasks, producing structured output from natural language prompts, and 2) conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps, even state-of-the-art models like o1-mini achieve only 75.58 average score, with open-source alternatives lagging approximately 10 points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.

cs.DB [Back]

[601] A Survey of LLM $\times$ DATA

Xuanhe Zhou,Junxuan He,Wei Zhou,Haodong Chen,Zirui Tang,Haoyu Zhao,Xin Tong,Guoliang Li,Youmin Chen,Jun Zhou,Zhaojun Sun,Binyuan Hui,Shuo Wang,Conghui He,Zhiyuan Liu,Jingren Zhou,Fan Wu

Main category: cs.DB

TL;DR: 本文综述了大型语言模型（LLM）与数据管理（DATA）的双向关系，分别探讨了DATA4LLM（数据管理支持LLM）和LLM4DATA（LLM支持数据管理）的最新进展。

Details

Motivation: 随着LLM和数据管理的快速发展，两者之间的双向关系日益重要，本文旨在全面梳理这一交叉领域的研究现状。 Method: 通过分类综述，分别从DATA4LLM（数据处理的三个阶段）和LLM4DATA（三个应用方向）展开分析。 Result: 总结了数据管理如何优化LLM的数据需求，以及LLM如何提升数据管理的效率与智能化水平。 Conclusion: LLM与数据管理的融合为两个领域带来了新的机遇与挑战，未来研究需进一步探索其协同潜力。 Abstract: The integration of large language model (LLM) and data management (DATA) is rapidly redefining both domains. In this survey, we comprehensively review the bidirectional relationships. On the one hand, DATA4LLM, spanning large-scale data processing, storage, and serving, feeds LLMs with high quality, diversity, and timeliness of data required for stages like pre-training, post-training, retrieval-augmented generation, and agentic workflows: (i) Data processing for LLMs includes scalable acquisition, deduplication, filtering, selection, domain mixing, and synthetic augmentation; (ii) Data Storage for LLMs focuses on efficient data and model formats, distributed and heterogeneous storage hierarchies, KV-cache management, and fault-tolerant checkpointing; (iii) Data serving for LLMs tackles challenges in RAG (e.g., knowledge post-processing), LLM inference (e.g., prompt compression, data provenance), and training strategies (e.g., data packing and shuffling). On the other hand, in LLM4DATA, LLMs are emerging as general-purpose engines for data management. We review recent advances in (i) data manipulation, including automatic data cleaning, integration, discovery; (ii) data analysis, covering reasoning over structured, semi-structured, and unstructured data, and (iii) system optimization (e.g., configuration tuning, query rewriting, anomaly diagnosis), powered by LLM techniques like retrieval-augmented prompting, task-specialized fine-tuning, and multi-agent collaboration.

[602] SQUiD: Synthesizing Relational Databases from Unstructured Text

Mushtari Sadia,Zhenning Yang,Yunming Xiao,Ang Chen,Amrita Roy Chowdhury

Main category: cs.DB

TL;DR: SQUiD利用神经符号框架从非结构化文本自动生成关系数据库，性能优于基线方法。

Details

Motivation: 解决非结构化文本与关系数据库之间的鸿沟。 Method: SQUiD框架将任务分解为四个阶段，每个阶段采用专门技术。 Result: 实验表明SQUiD在多样数据集上表现优于基线。 Conclusion: SQUiD为从文本生成关系数据库提供了有效解决方案。 Abstract: Relational databases are central to modern data management, yet most data exists in unstructured forms like text documents. To bridge this gap, we leverage large language models (LLMs) to automatically synthesize a relational database by generating its schema and populating its tables from raw text. We introduce SQUiD, a novel neurosymbolic framework that decomposes this task into four stages, each with specialized techniques. Our experiments show that SQUiD consistently outperforms baselines across diverse datasets.

[603] ODIN: A NL2SQL Recommender to Handle Schema Ambiguity

Kapil Vaidya,Abishek Sankararaman,Jialin Ding,Chuan Lei,Xiao Qin,Balakrishnan Narayanaswamy,Tim Kraska

Main category: cs.DB

TL;DR: ODIN是一个NL2SQL推荐引擎，通过生成多个可能的SQL查询来解决模式歧义问题，动态调整建议数量并学习用户反馈，显著提高了正确SQL查询的生成概率。

Details

Motivation: 在复杂的企业环境中，模式歧义是NL2SQL系统的主要挑战，尤其是在多表和语义相似列名共存的情况下。 Method: ODIN通过生成一组可能的SQL查询来应对模式歧义，动态调整建议数量，并利用用户反馈进行个性化推荐。 Result: 评估显示，ODIN生成正确SQL查询的概率比基线方法提高了1.5-2倍。 Conclusion: ODIN有效解决了模式歧义问题，显著提升了NL2SQL系统的准确性和实用性。 Abstract: NL2SQL (natural language to SQL) systems translate natural language into SQL queries, allowing users with no technical background to interact with databases and create tools like reports or visualizations. While recent advancements in large language models (LLMs) have significantly improved NL2SQL accuracy, schema ambiguity remains a major challenge in enterprise environments with complex schemas, where multiple tables and columns with semantically similar names often co-exist. To address schema ambiguity, we introduce ODIN, a NL2SQL recommendation engine. Instead of producing a single SQL query given a natural language question, ODIN generates a set of potential SQL queries by accounting for different interpretations of ambiguous schema components. ODIN dynamically adjusts the number of suggestions based on the level of ambiguity, and ODIN learns from user feedback to personalize future SQL query recommendations. Our evaluation shows that ODIN improves the likelihood of generating the correct SQL query by 1.5-2$\times$ compared to baselines.

q-bio.NC [Back]

Subba Reddy Oota,Khushbu Pahwa,Mounika Marreddy,Maneesh Singh,Manish Gupta,Bapi S. Raju

Main category: q-bio.NC

TL;DR: 多模态Transformer模型在预测视觉脑活动方面表现优异，但多模态刺激下的预测准确性尚不明确。研究发现多模态模型在语言和视觉区域的脑活动对齐更优，且多模态信息处理涉及额外特征。

Details

Motivation: 探讨多模态模型在预测多模态刺激下脑活动的准确性，以理解大脑如何处理多模态信息。 Method: 使用单模态和多模态模型（跨模态和联合预训练）分析fMRI数据，研究其对电影观看时脑活动的预测能力。 Result: 多模态模型在语言和视觉区域的对齐表现更优，且多模态信息处理涉及单模态特征之外的额外信息。 Conclusion: 多模态模型为研究大脑多模态信息处理提供了新视角，跨模态模型依赖视频模态，联合预训练模型则依赖视频和音频模态。 Abstract: Despite participants engaging in unimodal stimuli, such as watching images or silent videos, recent work has demonstrated that multi-modal Transformer models can predict visual brain activity impressively well, even with incongruent modality representations. This raises the question of how accurately these multi-modal models can predict brain activity when participants are engaged in multi-modal stimuli. As these models grow increasingly popular, their use in studying neural activity provides insights into how our brains respond to such multi-modal naturalistic stimuli, i.e., where it separates and integrates information across modalities through a hierarchy of early sensory regions to higher cognition. We investigate this question by using multiple unimodal and two types of multi-modal models-cross-modal and jointly pretrained-to determine which type of model is more relevant to fMRI brain activity when participants are engaged in watching movies. We observe that both types of multi-modal models show improved alignment in several language and visual regions. This study also helps in identifying which brain regions process unimodal versus multi-modal information. We further investigate the contribution of each modality to multi-modal alignment by carefully removing unimodal features one by one from multi-modal representations, and find that there is additional information beyond the unimodal embeddings that is processed in the visual and language regions. Based on this investigation, we find that while for cross-modal models, their brain alignment is partially attributed to the video modality; for jointly pretrained models, it is partially attributed to both the video and audio modalities. This serves as a strong motivation for the neuroscience community to investigate the interpretability of these models for deepening our understanding of multi-modal information processing in brain.

cs.CY [Back]

[605] Towards medical AI misalignment: a preliminary study

Barbara Puccio,Federico Castagna,Allan Tucker,Pierangelo Veltri

Main category: cs.CY

TL;DR: 研究发现，大型语言模型（LLMs）尽管能力强大，但仍易受角色扮演式攻击（如“Goofy Game”）的影响，可能导致生成不安全的临床建议。

Details

Motivation: 探讨LLMs在角色扮演攻击下的脆弱性，尤其是在医疗领域可能引发的潜在危害。 Method: 通过构造角色扮演提示词，测试LLMs生成错误临床建议的易感性。 Result: 即使无技术背景的用户也能通过特定提示词诱导LLMs生成有害内容。 Conclusion: 研究揭示了LLMs在特定攻击下的安全漏洞，为未来防护措施提供参考。 Abstract: Despite their staggering capabilities as assistant tools, often exceeding human performances, Large Language Models (LLMs) are still prone to jailbreak attempts from malevolent users. Although red teaming practices have already identified and helped to address several such jailbreak techniques, one particular sturdy approach involving role-playing (which we named `Goofy Game') seems effective against most of the current LLMs safeguards. This can result in the provision of unsafe content, which, although not harmful per se, might lead to dangerous consequences if delivered in a setting such as the medical domain. In this preliminary and exploratory study, we provide an initial analysis of how, even without technical knowledge of the internal architecture and parameters of generative AI models, a malicious user could construct a role-playing prompt capable of coercing an LLM into producing incorrect (and potentially harmful) clinical suggestions. We aim to illustrate a specific vulnerability scenario, providing insights that can support future advancements in the field.

[606] Will Large Language Models Transform Clinical Prediction?

Yusuf Yildiz,Goran Nenadic,Meghna Jani,David A. Jenkins

Main category: cs.CY

TL;DR: 论文讨论了大型语言模型（LLMs）在临床预测中的应用，强调其潜力与挑战，并呼吁进一步研究以完善其整合。

Details

Motivation: LLMs在医疗领域的潜力巨大，但其在临床预测中的具体应用仍需验证和改进。 Method: 探讨了LLMs在临床预测中的使用，重点关注方法扩展、验证、公平性评估和法规制定。 Result: LLMs在临床预测中展现出潜力，但仍需解决公平性、偏见和法规等问题。 Conclusion: 需要进一步研究和领域特定考量，以实现LLMs在临床预测工作流程中的全面整合。 Abstract: Background: Large language models (LLMs) are attracting increasing interest in healthcare. Their ability to summarise large datasets effectively, answer questions accurately, and generate synthesised text is widely recognised. These capabilities are already finding applications in healthcare. Body: This commentary discusses LLMs usage in the clinical prediction context and highlight potential benefits and existing challenges. In these early stages, the focus should be on extending the methodology, specifically on validation, fairness and bias evaluation, survival analysis and development of regulations. Conclusion: We conclude that further work and domain-specific considerations need to be made for full integration into the clinical prediction workflows.

[607] Language Models Surface the Unwritten Code of Science and Society

Honglin Bao,Siyang Wu,Jiwoong Choi,Yingrong Mao,James A. Evans

Main category: cs.CY

TL;DR: 论文呼吁研究社区不仅调查人类偏见如何被大型语言模型（LLMs）继承，还探索如何利用这些偏见揭示社会“不成文规则”（如隐性刻板印象和启发式方法）。通过科学案例研究，提出一个概念框架，利用LLMs生成自洽假设来揭示同行评审中的隐藏规则。研究发现LLMs的规范性先验（如理论严谨性）逐渐转向强调外部连接的叙事（如文献定位），揭示了科学神话的优先性。人类评审者虽隐式奖励这些叙事，但避免在评论中明确提及。

Details

Motivation: 揭示LLMs如何继承和放大人类偏见，并利用这些偏见揭示社会中的隐性规则，如科学同行评审中的未明言标准。 Method: 通过案例研究，提出一个概念框架，利用LLMs生成自洽假设来分析同行评审中的隐藏规则。具体方法包括迭代搜索更深层次的假设，并比较LLMs的规范性先验与后验。 Result: LLMs的规范性先验（如理论严谨性）逐渐转向强调外部连接的叙事（如文献定位）。人类评审者隐式奖励这些叙事（相关性=0.49），但避免在评论中明确提及（相关性=-0.14）。 Conclusion: 该框架可广泛用于揭示社会中的隐性规则，帮助更精准地实现负责任AI。 Abstract: This paper calls on the research community not only to investigate how human biases are inherited by large language models (LLMs) but also to explore how these biases in LLMs can be leveraged to make society's "unwritten code" - such as implicit stereotypes and heuristics - visible and accessible for critique. We introduce a conceptual framework through a case study in science: uncovering hidden rules in peer review - the factors that reviewers care about but rarely state explicitly due to normative scientific expectations. The idea of the framework is to push LLMs to speak out their heuristics through generating self-consistent hypotheses - why one paper appeared stronger in reviewer scoring - among paired papers submitted to 45 computer science conferences, while iteratively searching deeper hypotheses from remaining pairs where existing hypotheses cannot explain. We observed that LLMs' normative priors about the internal characteristics of good science extracted from their self-talk, e.g. theoretical rigor, were systematically updated toward posteriors that emphasize storytelling about external connections, such as how the work is positioned and connected within and across literatures. This shift reveals the primacy of scientific myths about intrinsic properties driving scientific excellence rather than extrinsic contextualization and storytelling that influence conceptions of relevance and significance. Human reviewers tend to explicitly reward aspects that moderately align with LLMs' normative priors (correlation = 0.49) but avoid articulating contextualization and storytelling posteriors in their review comments (correlation = -0.14), despite giving implicit reward to them with positive scores. We discuss the broad applicability of the framework, leveraging LLMs as diagnostic tools to surface the tacit codes underlying human society, enabling more precisely targeted responsible AI.

cs.DC [Back]

[608] Optimizing edge AI models on HPC systems with the edge in the loop

Marcel Aach,Cyril Blanc,Andreas Lintermann,Kurt De Grave

Main category: cs.DC

TL;DR: 论文提出了一种硬件感知的神经架构搜索（NAS）工作流，用于在边缘设备上优化AI模型，显著提升了推理速度和模型质量。

Details

Motivation: 边缘设备上的AI模型需要高精度和低延迟，传统方法如剪枝、蒸馏或量化可能不足，硬件感知NAS能更系统地探索优化架构。 Method: 通过连接比利时边缘设备和德国高性能计算系统，实时测量目标硬件延迟，快速训练架构候选。 Result: 在AM领域的实验中，推理速度提升约8.8倍，模型质量提高约1.35倍。 Conclusion: 硬件感知NAS是优化边缘设备AI模型的有效方法，显著优于人工设计的基线。 Abstract: Artificial intelligence and machine learning models deployed on edge devices, e.g., for quality control in Additive Manufacturing (AM), are frequently small in size. Such models usually have to deliver highly accurate results within a short time frame. Methods that are commonly employed in literature start out with larger trained models and try to reduce their memory and latency footprint by structural pruning, knowledge distillation, or quantization. It is, however, also possible to leverage hardware-aware Neural Architecture Search (NAS), an approach that seeks to systematically explore the architecture space to find optimized configurations. In this study, a hardware-aware NAS workflow is introduced that couples an edge device located in Belgium with a powerful High-Performance Computing system in Germany, to train possible architecture candidates as fast as possible while performing real-time latency measurements on the target hardware. The approach is verified on a use case in the AM domain, based on the open RAISE-LPBF dataset, achieving ~8.8 times faster inference speed while simultaneously enhancing model quality by a factor of ~1.35, compared to a human-designed baseline.

cs.CR [Back]

[609] $PD^3F$: A Pluggable and Dynamic DoS-Defense Framework Against Resource Consumption Attacks Targeting Large Language Models

Yuanhe Zhang,Xinyue Wang,Haoran Gao,Zhenhong Zhou,Fanyu Meng,Yuyao Zhang,Sen Su

Main category: cs.CR

TL;DR: 论文提出了一种名为$PD^3F$的可插拔动态防御框架，用于保护大语言模型（LLMs）免受资源消耗攻击，通过输入和输出两阶段的策略显著提升防御效果。

Details

Motivation: 由于大语言模型（LLMs）计算资源需求高，易受资源消耗攻击（如DoS攻击）影响，现有研究缺乏有效的防御策略，导致实际部署中存在安全风险。 Method: 提出$PD^3F$框架，采用两阶段防御：输入侧通过资源索引指导动态请求调度，输出侧引入自适应终止机制抑制恶意生成。 Result: 实验表明，$PD^3F$在六种模型上显著缓解资源消耗攻击，对抗负载下用户访问能力提升高达500%。 Conclusion: $PD^3F$为LLMs的弹性部署提供了资源感知的防御方案，是应对资源消耗攻击的重要进展。 Abstract: Large Language Models (LLMs), due to substantial computational requirements, are vulnerable to resource consumption attacks, which can severely degrade server performance or even cause crashes, as demonstrated by denial-of-service (DoS) attacks designed for LLMs. However, existing works lack mitigation strategies against such threats, resulting in unresolved security risks for real-world LLM deployments. To this end, we propose the Pluggable and Dynamic DoS-Defense Framework ($PD^3F$), which employs a two-stage approach to defend against resource consumption attacks from both the input and output sides. On the input side, we propose the Resource Index to guide Dynamic Request Polling Scheduling, thereby reducing resource usage induced by malicious attacks under high-concurrency scenarios. On the output side, we introduce the Adaptive End-Based Suppression mechanism, which terminates excessive malicious generation early. Experiments across six models demonstrate that $PD^3F$ significantly mitigates resource consumption attacks, improving users' access capacity by up to 500% during adversarial load. $PD^3F$ represents a step toward the resilient and resource-aware deployment of LLMs against resource consumption attacks.

[610] Lifelong Safety Alignment for Language Models

Haoyu Wang,Zeyu Qin,Yifei Zhao,Chao Du,Min Lin,Xueqian Wang,Tianyu Pang

Main category: cs.CR

TL;DR: 提出了一种终身安全对齐框架，通过元攻击者和防御者的竞争机制，使LLM能够持续适应新型越狱攻击。

Details

Motivation: 现有防御主要针对已知攻击类型，但更关键的是应对部署中可能出现的未知攻击。 Method: 框架包含元攻击者和防御者，元攻击者通过GPT-4o提取研究论文中的关键见解，并通过迭代训练发现新攻击策略；防御者则逐步提升鲁棒性。 Result: 元攻击者在首轮训练中单次攻击成功率达73%，防御者最终将其成功率降至7%。 Conclusion: 该框架显著提升了LLM在开放环境中的安全性和可靠性。 Abstract: LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for unseen attacks that may arise during deployment. To address this, we propose a lifelong safety alignment framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a Meta-Attacker, trained to actively discover novel jailbreaking strategies, and a Defender, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack success rate (ASR) on RR and a 57% transfer ASR on LAT using only single-turn attacks. Meanwhile, the Defender progressively improves its robustness and ultimately reduces the Meta-Attacker's success rate to just 7%, enabling safer and more reliable deployment of LLMs in open-ended environments. The code is available at https://github.com/sail-sg/LifelongSafetyAlignment.

cs.MA [Back]

Alireza Rezazadeh,Zichao Li,Ange Lou,Yuying Zhao,Wei Wei,Yujia Bao

Main category: cs.MA

TL;DR: 论文提出了Collaborative Memory框架，用于多用户、多代理环境中的知识共享，支持动态、不对称的访问控制。

Details

Motivation: 当前多代理系统在知识共享方面存在局限性，尤其是跨用户的知识转移和动态权限管理问题。 Method: 采用双层级内存（私有和共享）和二分图编码访问控制，结合不可变的来源属性和细粒度读写策略。 Result: 框架实现了安全、高效、可解释的跨用户知识共享，并确保对动态权限的合规性和操作可审计性。 Conclusion: Collaborative Memory为多代理环境中的知识管理提供了灵活且安全的解决方案。 Abstract: Complex tasks are increasingly delegated to ensembles of specialized LLM-based agents that reason, communicate, and coordinate actions-both among themselves and through interactions with external tools, APIs, and databases. While persistent memory has been shown to enhance single-agent performance, most approaches assume a monolithic, single-user context-overlooking the benefits and challenges of knowledge transfer across users under dynamic, asymmetric permissions. We introduce Collaborative Memory, a framework for multi-user, multi-agent environments with asymmetric, time-evolving access controls encoded as bipartite graphs linking users, agents, and resources. Our system maintains two memory tiers: (1) private memory-private fragments visible only to their originating user; and (2) shared memory-selectively shared fragments. Each fragment carries immutable provenance attributes (contributing agents, accessed resources, and timestamps) to support retrospective permission checks. Granular read policies enforce current user-agent-resource constraints and project existing memory fragments into filtered transformed views. Write policies determine fragment retention and sharing, applying context-aware transformations to update the memory. Both policies may be designed conditioned on system, agent, and user-level information. Our framework enables safe, efficient, and interpretable cross-user knowledge sharing, with provable adherence to asymmetric, time-varying policies and full auditability of memory operations.

eess.AS [Back]

[612] Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving

Jingran Xie,Xiang Li,Hui Wang,Yue Yu,Yang Xiang,Xixin Wu,Zhiyong Wu

Main category: eess.AS

TL;DR: 论文提出了一种名为MTBI的多任务行为模仿方法，通过语音-文本交错对齐，仅需配对语音和文本数据，提升了语音大语言模型（SLLM）的泛化能力。

Details

Motivation: 当前语音大语言模型（SLLM）依赖监督微调，但缺乏广泛任务的标注语音数据，导致对齐效率低和泛化能力差。 Method: 采用多任务行为模仿（MTBI）方法，通过语音-文本交错对齐，确保LLM解码器对语音和文本生成等效响应。 Result: 实验表明，MTBI在提示和任务泛化上优于现有SLLM，且所需监督语音数据更少。 Conclusion: MTBI方法显著提升了SLLM的泛化能力，为语音与LLM的高效对齐提供了新思路。 Abstract: Large language models (LLMs) have shown remarkable generalization across tasks, leading to increased interest in integrating speech with LLMs. These speech LLMs (SLLMs) typically use supervised fine-tuning to align speech with text-based LLMs. However, the lack of annotated speech data across a wide range of tasks hinders alignment efficiency, resulting in poor generalization. To address these issues, we propose a novel multi-task 'behavior imitation' method with speech-text interleaving, called MTBI, which relies solely on paired speech and transcripts. By ensuring the LLM decoder generates equivalent responses to paired speech and text, we achieve a more generalized SLLM. Interleaving is used to further enhance alignment efficiency. We introduce a simple benchmark to evaluate prompt and task generalization across different models. Experimental results demonstrate that our MTBI outperforms SOTA SLLMs on both prompt and task generalization, while requiring less supervised speech data.

[613] Evaluating the Usefulness of Non-Diagnostic Speech Data for Developing Parkinson's Disease Classifiers

Terry Yi Zhong,Esther Janse,Cristian Tejedor-Garcia,Louis ten Bosch,Martha Larson

Main category: eess.AS

TL;DR: 论文探讨了基于非诊断性语音任务（Turn-Taking数据集）检测帕金森病（PD）的可行性，发现其效果与诊断性数据集（如PC-GITA）相当，并研究了影响分类性能的数据集特性。

Details

Motivation: 探索非诊断性语音数据在PD检测中的潜力，以提供更自动化、经济且非侵入性的诊断方法。 Method: 使用Turn-Taking数据集，分析音频拼接、性别和状态分布平衡对分类性能的影响，并进行跨数据集评估。 Result: Turn-Taking数据集效果与PC-GITA相当；音频拼接和平衡分布可提升性能；模型在跨数据集评估中表现不同。 Conclusion: 非诊断性语音数据可用于PD检测，但需注意数据特性和个体差异对模型性能的影响。 Abstract: Speech-based Parkinson's disease (PD) detection has gained attention for its automated, cost-effective, and non-intrusive nature. As research studies usually rely on data from diagnostic-oriented speech tasks, this work explores the feasibility of diagnosing PD on the basis of speech data not originally intended for diagnostic purposes, using the Turn-Taking (TT) dataset. Our findings indicate that TT can be as useful as diagnostic-oriented PD datasets like PC-GITA. We also investigate which specific dataset characteristics impact PD classification performance. The results show that concatenating audio recordings and balancing participants' gender and status distributions can be beneficial. Cross-dataset evaluation reveals that models trained on PC-GITA generalize poorly to TT, whereas models trained on TT perform better on PC-GITA. Furthermore, we provide insights into the high variability across folds, which is mainly due to large differences in individual speaker performance.

[614] Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models

Ke-Han Lu,Chun-Yi Kuan,Hung-yi Lee

Main category: eess.AS

TL;DR: Speech-IFeval是一个评估框架，用于测试语音感知语言模型（SLMs）的指令跟随能力和量化灾难性遗忘问题。研究发现，大多数SLMs在基本指令上表现不佳，且对提示变化敏感。

Details

Motivation: 现有的评估方法混淆了语音感知和指令跟随能力，无法准确评估SLMs的指令跟随能力，因此需要专门的评估框架。 Method: 提出了Speech-IFeval框架，通过设计专门的基准测试来诊断SLMs的指令跟随能力。 Result: 研究发现，大多数SLMs在指令跟随上表现远不如基于文本的LLMs，且输出不一致、不可靠。 Conclusion: 研究强调了评估SLMs时需超越任务级指标，并提供了未来研究的指导方向。 Abstract: We introduce Speech-IFeval, an evaluation framework designed to assess instruction-following capabilities and quantify catastrophic forgetting in speech-aware language models (SLMs). Recent SLMs integrate speech perception with large language models (LLMs), often degrading textual capabilities due to speech-centric training. Existing benchmarks conflate speech perception with instruction-following, hindering evaluation of these distinct skills. To address this gap, we provide a benchmark for diagnosing the instruction-following abilities of SLMs. Our findings show that most SLMs struggle with even basic instructions, performing far worse than text-based LLMs. Additionally, these models are highly sensitive to prompt variations, often yielding inconsistent and unreliable outputs. We highlight core challenges and provide insights to guide future research, emphasizing the need for evaluation beyond task-level metrics.

[615] MVP: Multi-source Voice Pathology detection

Alkis Koudounas,Moreno La Quatra,Gabriele Ciravegna,Marco Fantini,Erika Crosetti,Giovanni Succo,Tania Cerquitelli,Sabato Marco Siniscalchi,Elena Baralis

Main category: eess.AS

TL;DR: MVP是一种基于Transformer的多源语音病理检测方法，通过融合句子朗读和持续元音录音，显著提升了诊断性能。

Details

Motivation: 语音障碍严重影响患者生活质量，但非侵入性自动诊断因数据稀缺和录音来源多样性而研究不足。 Method: 提出MVP方法，利用Transformer直接处理原始语音信号，探索三种融合策略：波形拼接、中间特征融合和决策级组合。 Result: 在德语、葡萄牙语和意大利语中验证，中间特征融合表现最佳，AUC提升高达13%。 Conclusion: MVP通过多源融合策略，显著提升了语音病理检测的准确性。 Abstract: Voice disorders significantly impact patient quality of life, yet non-invasive automated diagnosis remains under-explored due to both the scarcity of pathological voice data, and the variability in recording sources. This work introduces MVP (Multi-source Voice Pathology detection), a novel approach that leverages transformers operating directly on raw voice signals. We explore three fusion strategies to combine sentence reading and sustained vowel recordings: waveform concatenation, intermediate feature fusion, and decision-level combination. Empirical validation across the German, Portuguese, and Italian languages shows that intermediate feature fusion using transformers best captures the complementary characteristics of both recording types. Our approach achieves up to +13% AUC improvement over single-source methods.

[616] From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data

Chun-Yi Kuan,Hung-yi Lee

Main category: eess.AS

TL;DR: 论文提出BALSa方法，通过合成数据生成解决音频-语言对齐问题，并引入LISTEN训练方法减少音频幻觉，提升模型性能。

Details

Motivation: 现有音频感知大语言模型（ALLMs）在适应音频任务时存在灾难性遗忘和音频幻觉问题，且跨模态对齐依赖大量资源密集型数据。 Method: 利用主干LLMs合成通用对齐数据（BALSa），并引入LISTEN对比训练方法，扩展至多音频场景。 Result: 实验表明，该方法有效减少音频幻觉，保持音频理解和推理能力，多音频训练进一步提升性能。 Conclusion: BALSa为ALLMs开发提供了高效且可扩展的解决方案。 Abstract: Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs. These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks. However, this adaptation process presents two major limitations. First, ALLMs often suffer from catastrophic forgetting, where important textual capabilities such as instruction-following are lost after training on audio data. In some cases, models may even hallucinate sounds that are not present in the input audio, raising concerns about their reliability. Second, achieving cross-modal alignment between audio and language typically relies on large collections of task-specific question-answer pairs for instruction tuning, making the process resource-intensive. To address these issues, we leverage the backbone LLMs from ALLMs to synthesize general-purpose caption-style alignment data. We refer to this process as bootstrapping audio-language alignment via synthetic data generation from backbone LLMs (BALSa). Building on BALSa, we introduce LISTEN (Learning to Identify Sounds Through Extended Negative Samples), a contrastive-like training method designed to improve ALLMs' ability to distinguish between present and absent sounds. We further extend BALSa to multi-audio scenarios, where the model either explains the differences between audio inputs or produces a unified caption that describes them all, thereby enhancing audio-language alignment. Experimental results indicate that our method effectively mitigates audio hallucinations while reliably maintaining strong performance in audio understanding, reasoning, and instruction-following skills. Moreover, incorporating multi-audio training further enhances the model's comprehension and reasoning capabilities. Overall, BALSa offers an efficient and scalable approach to the development of ALLMs.

cs.SD [Back]

[617] Towards Reliable Large Audio Language Model

Ziyang Ma,Xiquan Li,Yakun Song,Wenxi Chen,Chenpeng Du,Jian Wu,Yuanzhe Chen,Zhuo Chen,Yuping Wang,Yuxuan Wang,Xie Chen

Main category: cs.SD

TL;DR: 本文探讨了如何提升大型音频语言模型（LALMs）的可靠性，提出了训练无关和训练相关的方法，并引入新的评估指标RGI。

Details

Motivation: 现有LALMs缺乏识别知识边界和主动拒绝回答未知问题的能力，可靠性研究不足。 Method: 研究了多模态思维链（MCoT）和监督微调（SFT）等方法，并提出了可靠性增益指数（RGI）作为新评估指标。 Result: 训练无关和训练相关方法均能不同程度提升LALMs的可靠性，且可靠性意识可跨音频模态迁移。 Conclusion: 可靠性是LALMs的‘元能力’，可跨模态迁移，但不同音频模态间仍存在显著差异。 Abstract: Recent advancements in large audio language models (LALMs) have demonstrated impressive results and promising prospects in universal understanding and reasoning across speech, music, and general sound. However, these models still lack the ability to recognize their knowledge boundaries and refuse to answer questions they don't know proactively. While there have been successful attempts to enhance the reliability of LLMs, reliable LALMs remain largely unexplored. In this paper, we systematically investigate various approaches towards reliable LALMs, including training-free methods such as multi-modal chain-of-thought (MCoT), and training-based methods such as supervised fine-tuning (SFT). Besides, we identify the limitations of previous evaluation metrics and propose a new metric, the Reliability Gain Index (RGI), to assess the effectiveness of different reliable methods. Our findings suggest that both training-free and training-based methods enhance the reliability of LALMs to different extents. Moreover, we find that awareness of reliability is a "meta ability", which can be transferred across different audio modalities, although significant structural and content differences exist among sound, music, and speech.

[618] Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks

Chang Liu,Haomin Zhang,Shiyu Xia,Zihao Chen,Chaofan Ding,Xin Yue,Huizhe Chen,Xinhan Di

Main category: cs.SD

TL;DR: 论文提出了一个名为CoP Benchmark Dataset的多模态基准数据集，专门用于视频引导的钢琴音乐生成，以解决现有评估数据集和指标在同步性和复杂性上的不足。

Details

Motivation: 现有评估数据集未能充分捕捉钢琴音乐生成所需的复杂同步性，且现有指标无法反映视频与钢琴音乐交互的复杂性，因此需要一个全面的基准数据集来加速高质量钢琴音乐生成的研究。 Method: 提出了CoP Benchmark Dataset，包含详细的多模态注释，通过逐步的Chain-of-Perform指导实现视频内容与钢琴音频的精确语义和时间对齐，并提供通用的评估框架。 Result: CoP Benchmark Dataset是一个完全开源的基准数据集，包含数据集、注释和评估协议，并设有持续更新的排行榜以促进研究。 Conclusion: 该数据集为视频引导的钢琴音乐生成提供了全面的评估工具，有望推动该领域的研究进展。 Abstract: Generating high-quality piano audio from video requires precise synchronization between visual cues and musical output, ensuring accurate semantic and temporal alignment.However, existing evaluation datasets do not fully capture the intricate synchronization required for piano music generation. A comprehensive benchmark is essential for two primary reasons: (1) existing metrics fail to reflect the complexity of video-to-piano music interactions, and (2) a dedicated benchmark dataset can provide valuable insights to accelerate progress in high-quality piano music generation. To address these challenges, we introduce the CoP Benchmark Dataset-a fully open-sourced, multimodal benchmark designed specifically for video-guided piano music generation. The proposed Chain-of-Perform (CoP) benchmark offers several compelling features: (1) detailed multimodal annotations, enabling precise semantic and temporal alignment between video content and piano audio via step-by-step Chain-of-Perform guidance; (2) a versatile evaluation framework for rigorous assessment of both general-purpose and specialized video-to-piano generation tasks; and (3) full open-sourcing of the dataset, annotations, and evaluation protocols. The dataset is publicly available at https://github.com/acappemin/Video-to-Audio-and-Piano, with a continuously updated leaderboard to promote ongoing research in this domain.

cs.AI [Back]

Jiayi Zhou,Jiaming Ji,Boyuan Chen,Jiapeng Sun,Wenqi Chen,Donghai Hong,Sirui Han,Yike Guo,Yaodong Yang

Main category: cs.AI

TL;DR: 论文提出了一种名为Generative RLHF-V的新框架，结合生成式奖励模型（GRM）与多模态RLHF，通过两阶段流程提升多模态大语言模型（MLLM）的对齐性能。

Details

Motivation: 传统基于分数的奖励模型在准确性、泛化性和可解释性上表现不佳，阻碍了对齐方法（如RLHF）的进展。 Method: 采用两阶段流程：1）多模态生成式奖励建模，通过强化学习（RL）引导GRM主动捕捉人类意图并预测成对分数；2）基于分组比较的RL优化，提升多模态RL评分精度。 Result: 实验显示，该框架在7个基准测试中提升了4个MLLM的性能18.1%，显著优于基线RLHF的5.3%。此外，随着候选响应数量增加，性能接近线性提升。 Conclusion: Generative RLHF-V框架有效解决了传统奖励模型的局限性，显著提升了MLLM的对齐性能。 Abstract: Training multi-modal large language models (MLLMs) that align with human intentions is a long-term challenge. Traditional score-only reward models for alignment suffer from low accuracy, weak generalization, and poor interpretability, blocking the progress of alignment methods, e.g., reinforcement learning from human feedback (RLHF). Generative reward models (GRMs) leverage MLLMs' intrinsic reasoning capabilities to discriminate pair-wise responses, but their pair-wise paradigm makes it hard to generalize to learnable rewards. We introduce Generative RLHF-V, a novel alignment framework that integrates GRMs with multi-modal RLHF. We propose a two-stage pipeline: $\textbf{multi-modal generative reward modeling from RL}$, where RL guides GRMs to actively capture human intention, then predict the correct pair-wise scores; and $\textbf{RL optimization from grouped comparison}$, which enhances multi-modal RL scoring precision by grouped responses comparison. Experimental results demonstrate that, besides out-of-distribution generalization of RM discrimination, our framework improves 4 MLLMs' performance across 7 benchmarks by $18.1\%$, while the baseline RLHF is only $5.3\%$. We further validate that Generative RLHF-V achieves a near-linear improvement with an increasing number of candidate responses. Our code and models can be found at https://generative-rlhf-v.github.io.

[620] Diffusion Blend: Inference-Time Multi-Preference Alignment for Diffusion Models

Min Cheng,Fatemeh Doudi,Dileep Kalathil,Mohammad Ghavamzadeh,Panganamala R. Kumar

Main category: cs.AI

TL;DR: 论文提出Diffusion Blend方法，通过混合微调模型的反向扩散过程，实现在推理时对齐多目标和用户偏好，无需额外微调。

Details

Motivation: 现有强化学习方法在微调扩散模型时仅针对单一奖励函数，难以平衡多目标冲突和用户偏好多样性。 Method: 提出Diffusion Blend，包含两种算法：DB-MPA用于多奖励对齐，DB-KLA用于KL正则控制。 Result: 实验表明，Diffusion Blend优于基线方法，性能接近或超过单独微调模型。 Conclusion: Diffusion Blend实现了高效、用户驱动的推理时对齐，解决了多目标冲突和偏好多样性问题。 Abstract: Reinforcement learning (RL) algorithms have been used recently to align diffusion models with downstream objectives such as aesthetic quality and text-image consistency by fine-tuning them to maximize a single reward function under a fixed KL regularization. However, this approach is inherently restrictive in practice, where alignment must balance multiple, often conflicting objectives. Moreover, user preferences vary across prompts, individuals, and deployment contexts, with varying tolerances for deviation from a pre-trained base model. We address the problem of inference-time multi-preference alignment: given a set of basis reward functions and a reference KL regularization strength, can we design a fine-tuning procedure so that, at inference time, it can generate images aligned with any user-specified linear combination of rewards and regularization, without requiring additional fine-tuning? We propose Diffusion Blend, a novel approach to solve inference-time multi-preference alignment by blending backward diffusion processes associated with fine-tuned models, and we instantiate this approach with two algorithms: DB-MPA for multi-reward alignment and DB-KLA for KL regularization control. Extensive experiments show that Diffusion Blend algorithms consistently outperform relevant baselines and closely match or exceed the performance of individually fine-tuned models, enabling efficient, user-driven alignment at inference-time. The code is available at https://github.com/bluewoods127/DB-2025}{github.com/bluewoods127/DB-2025.

Ye Mo,Zirui Shao,Kai Ye,Xianwei Mao,Bo Zhang,Hangdi Xing,Peng Ye,Gang Huang,Kehan Chen,Zhou Huan,Zixu Yan,Sheng Zhou

Main category: cs.AI

TL;DR: Doc-CoB通过模仿人类阅读模式，提出了一种基于链式视觉推理的机制，显著提升了多模态大语言模型在文档理解任务中的性能。

Details

Motivation: 现有MLLM在处理文档图像时未能有效关注查询相关区域，导致冗余信息和不可靠响应。 Method: 引入Doc-CoB机制，结合布局分析器生成训练数据，并通过两个辅助任务优化区域选择和推理。 Result: 在七个基准测试中，Doc-CoB显著提升了四种流行模型的性能。 Conclusion: Doc-CoB是一种简单有效的解决方案，适用于广泛的文档理解任务。 Abstract: Multimodal large language models (MLLMs) have made significant progress in document understanding. However, the information-dense nature of document images still poses challenges, as most queries depend on only a few relevant regions, with the rest being redundant. Existing one-pass MLLMs process entire document images without considering query relevance, often failing to focus on critical regions and producing unfaithful responses. Inspired by the human coarse-to-fine reading pattern, we introduce Doc-CoB (Chain-of-Box), a simple-yet-effective mechanism that integrates human-style visual reasoning into MLLM without modifying its architecture. Our method allows the model to autonomously select the set of regions (boxes) most relevant to the query, and then focus attention on them for further understanding. We first design a fully automatic pipeline, integrating a commercial MLLM with a layout analyzer, to generate 249k training samples with intermediate visual reasoning supervision. Then we incorporate two enabling tasks that improve box identification and box-query reasoning, which together enhance document understanding. Extensive experiments on seven benchmarks with four popular models show that Doc-CoB significantly improves performance, demonstrating its effectiveness and wide applicability. All code, data, and models will be released publicly.

[622] Pedagogy-R1: Pedagogically-Aligned Reasoning Model with Balanced Educational Benchmark

Unggi Lee,Jaeyong Lee,Jiyeong Bae,Yeil Jeong,Junbo Koh,Gyeonggeon Lee,Gunho Lee,Taekyung Ahn,Hyeoncheol Kim

Main category: cs.AI

TL;DR: 论文提出了Pedagogy-R1框架，通过蒸馏管道、教育基准和提示策略，改进大型推理模型的教学行为。

Details

Motivation: 大型推理模型在数学和编程等结构化领域表现优异，但缺乏教学连贯性和真实教学行为。 Method: 采用蒸馏管道过滤和优化模型输出，设计教育基准WBEB评估多维度表现，并使用CoP提示策略生成教学推理。 Result: 通过定量和定性评估，系统分析了大型推理模型的教学优势和局限性。 Conclusion: Pedagogy-R1框架成功提升了大型推理模型的教学能力，填补了其在教育领域的应用空白。 Abstract: Recent advances in large reasoning models (LRMs) show strong performance in structured domains such as mathematics and programming; however, they often lack pedagogical coherence and realistic teaching behaviors. To bridge this gap, we introduce Pedagogy-R1, a framework that adapts LRMs for classroom use through three innovations: (1) a distillation-based pipeline that filters and refines model outputs for instruction-tuning, (2) the Well-balanced Educational Benchmark (WBEB), which evaluates performance across subject knowledge, pedagogical knowledge, tracing, essay scoring, and teacher decision-making, and (3) a Chain-of-Pedagogy (CoP) prompting strategy for generating and eliciting teacher-style reasoning. Our mixed-method evaluation combines quantitative metrics with qualitative analysis, providing the first systematic assessment of LRMs' pedagogical strengths and limitations.

[623] Knowledge Grafting of Large Language Models

Guodong Du,Xuanning Zhou,Junlin Li,Zhuo Li,Zesheng Shi,Wanyu Lin,Ho-Kin Tang,Xiucheng Li,Fangming Liu,Wenya Wang,Min Zhang,Jing Li

Main category: cs.AI

TL;DR: GraftLLM提出了一种新的跨能力转移方法，通过SkillPack格式存储源模型能力，解决了现有方法在大规模异构模型中的局限性。

Details

Motivation: 现有方法主要针对小型同质模型，难以适用于大规模异构模型，且存在参数冲突和灾难性遗忘问题。 Method: 采用SkillPack格式存储能力，结合模块感知自适应压缩策略，实现高效存储和任务特定知识保留。 Result: 实验表明，GraftLLM在知识转移、知识融合和无遗忘学习方面优于现有技术。 Conclusion: GraftLLM为跨能力转移提供了可扩展且高效的解决方案。 Abstract: Cross-capability transfer is a key challenge in large language model (LLM) research, with applications in multi-task integration, model compression, and continual learning. Recent works like FuseLLM and FuseChat have demonstrated the potential of transferring multiple model capabilities to lightweight models, enhancing adaptability and efficiency, which motivates our investigation into more efficient cross-capability transfer methods. However, existing approaches primarily focus on small, homogeneous models, limiting their applicability. For large, heterogeneous models, knowledge distillation with full-parameter fine-tuning often overlooks the student model's intrinsic capacity and risks catastrophic forgetting, while PEFT methods struggle to effectively absorb knowledge from source LLMs. To address these issues, we introduce GraftLLM, a novel method that stores source model capabilities in a target model with SkillPack format. This approach preserves general capabilities, reduces parameter conflicts, and supports forget-free continual learning and model fusion. We employ a module-aware adaptive compression strategy to compress parameter updates, ensuring efficient storage while maintaining task-specific knowledge. The resulting SkillPack serves as a compact and transferable knowledge carrier, ideal for heterogeneous model fusion and continual learning. Experiments across various scenarios demonstrate that GraftLLM outperforms existing techniques in knowledge transfer, knowledge fusion, and forget-free learning, providing a scalable and efficient solution for cross-capability transfer. The code is publicly available at: https://github.com/duguodong7/GraftLLM.

[624] RvLLM: LLM Runtime Verification with Domain Knowledge

Yedi Zhang,Sun Yi Emma,Annabelle Lee Jia En,Annabelle Lee Jia En,Jin Song Dong

Main category: cs.AI

TL;DR: 提出了一种结合领域知识的LLM错误检测方法，设计了一种规范语言ESL和运行时验证框架RvLLM，有效检测LLM输出错误。

Details

Motivation: LLM在生成文本时可能产生不一致或错误输出，现有研究多关注通用场景，忽略了领域知识的整合。 Method: 设计了规范语言ESL和运行时验证框架RvLLM，支持领域专家定制约束并验证LLM输出。 Result: 在三个任务中验证了RvLLM的有效性，能轻量灵活地检测LLM错误。 Conclusion: LLM仍易犯低级错误，RvLLM通过领域知识提供了一种长期解决方案。 Abstract: Large language models (LLMs) have emerged as a dominant AI paradigm due to their exceptional text understanding and generation capabilities. However, their tendency to generate inconsistent or erroneous outputs challenges their reliability, especially in high-stakes domains requiring accuracy and trustworthiness. Existing research primarily focuses on detecting and mitigating model misbehavior in general-purpose scenarios, often overlooking the potential of integrating domain-specific knowledge. In this work, we advance misbehavior detection by incorporating domain knowledge. The core idea is to design a general specification language that enables domain experts to customize domain-specific predicates in a lightweight and intuitive manner, supporting later runtime verification of LLM outputs. To achieve this, we design a novel specification language, ESL, and introduce a runtime verification framework, RvLLM, to validate LLM output against domain-specific constraints defined in ESL. We evaluate RvLLM on three representative tasks: violation detection against Singapore Rapid Transit Systems Act, numerical comparison, and inequality solving. Experimental results demonstrate that RvLLM effectively detects erroneous outputs across various LLMs in a lightweight and flexible manner. The results reveal that despite their impressive capabilities, LLMs remain prone to low-level errors due to limited interpretability and a lack of formal guarantees during inference, and our framework offers a potential long-term solution by leveraging expert domain knowledge to rigorously and efficiently verify LLM outputs.

[625] CardioCoT: Hierarchical Reasoning for Multimodal Survival Analysis

Shaohao Rui,Haoyang Su,Jinyi Xiang,Lian-Ming Wu,Xiaosong Wang

Main category: cs.AI

TL;DR: CardioCoT是一种新型的两阶段分层推理增强生存分析框架，旨在提高模型可解释性和预测性能，用于急性心肌梗死患者的主要不良心血管事件复发风险预测。

Details

Motivation: 现有方法主要关注风险分层能力，而忽视了临床实践中对中间稳健推理和模型可解释性的需求。此外，端到端风险预测因数据限制和建模复杂性面临挑战。 Method: CardioCoT采用两阶段方法：第一阶段通过证据增强的自优化机制生成稳健的分层推理轨迹；第二阶段将推理轨迹与影像数据结合进行风险模型训练和预测。 Result: CardioCoT在MACE复发风险预测中表现出优越性能，并提供可解释的推理过程。 Conclusion: CardioCoT为临床决策提供了有价值的见解，同时提升了预测性能和模型可解释性。 Abstract: Accurate prediction of major adverse cardiovascular events recurrence risk in acute myocardial infarction patients based on postoperative cardiac MRI and associated clinical notes is crucial for precision treatment and personalized intervention. Existing methods primarily focus on risk stratification capability while overlooking the need for intermediate robust reasoning and model interpretability in clinical practice. Moreover, end-to-end risk prediction using LLM/VLM faces significant challenges due to data limitations and modeling complexity. To bridge this gap, we propose CardioCoT, a novel two-stage hierarchical reasoning-enhanced survival analysis framework designed to enhance both model interpretability and predictive performance. In the first stage, we employ an evidence-augmented self-refinement mechanism to guide LLM/VLMs in generating robust hierarchical reasoning trajectories based on associated radiological findings. In the second stage, we integrate the reasoning trajectories with imaging data for risk model training and prediction. CardioCoT demonstrates superior performance in MACE recurrence risk prediction while providing interpretable reasoning processes, offering valuable insights for clinical decision-making.

[626] AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware Budgeting

Shijue Huang,Hongru Wang,Wanjun Zhong,Zhaochen Su,Jiazhan Feng,Bowen Cao,Yi R. Fung

Main category: cs.AI

TL;DR: AdaCtrl是一个动态调整推理长度的框架，通过自适应预算分配和用户控制，平衡效率与效果。

Details

Motivation: 现代大型推理模型在处理简单问题时生成冗长推理链，效率与效果难以平衡。 Method: 采用两阶段训练：冷启动微调阶段和难度感知强化学习阶段，结合显式长度触发标签实现用户控制。 Result: 在多个数据集上，AdaCtrl减少了推理长度（10.06%-91.04%），同时提升性能。 Conclusion: AdaCtrl通过自适应推理和用户控制，显著提升了效率与效果。 Abstract: Modern large reasoning models demonstrate impressive problem-solving capabilities by employing sophisticated reasoning strategies. However, they often struggle to balance efficiency and effectiveness, frequently generating unnecessarily lengthy reasoning chains for simple problems. In this work, we propose AdaCtrl, a novel framework to support both difficulty-aware adaptive reasoning budget allocation and explicit user control over reasoning depth. AdaCtrl dynamically adjusts its reasoning length based on self-assessed problem difficulty, while also allowing users to manually control the budget to prioritize either efficiency or effectiveness. This is achieved through a two-stage training pipeline: an initial cold-start fine-tuning phase to instill the ability to self-aware difficulty and adjust reasoning budget, followed by a difficulty-aware reinforcement learning (RL) stage that refines the model's adaptive reasoning strategies and calibrates its difficulty assessments based on its evolving capabilities during online training. To enable intuitive user interaction, we design explicit length-triggered tags that function as a natural interface for budget control. Empirical results show that AdaCtrl adapts reasoning length based on estimated difficulty, compared to the standard training baseline that also incorporates fine-tuning and RL, it yields performance improvements and simultaneously reduces response length by 10.06% and 12.14% on the more challenging AIME2024 and AIME2025 datasets, which require elaborate reasoning, and by 62.05% and 91.04% on the MATH500 and GSM8K datasets, where more concise responses are sufficient. Furthermore, AdaCtrl enables precise user control over the reasoning budget, allowing for tailored responses to meet specific needs.

[627] Signal, Image, or Symbolic: Exploring the Best Input Representation for Electrocardiogram-Language Models Through a Unified Framework

William Han,Chaojing Duan,Zhepeng Cen,Yihang Yao,Xiaoyu Song,Atharva Mhaskar,Dylan Leong,Michael A. Rosenberg,Emerson Liu,Ding Zhao

Main category: cs.AI

TL;DR: 论文探讨了心电图语言模型（ELMs）中最有效的心电图输入表示形式，比较了三种候选表示（原始时间序列信号、渲染图像和离散符号序列），发现符号表示在统计上表现最佳。

Details

Motivation: 传统基于分类的心电图解释系统无法模拟专家心脏电生理学家的自由文本响应能力，因此需要探索最有效的心电图输入表示形式以优化ELMs性能。 Method: 研究通过6个公共数据集和5个评估指标，对三种心电图输入表示形式（原始信号、图像和符号序列）进行全面基准测试，并分析了LLM主干、心电图持续时间和标记预算等因素。 Result: 符号表示在统计上显著优于原始信号和图像输入，且在信号扰动下表现稳健。 Conclusion: 研究结果为下一代ELMs开发中选择输入表示形式提供了明确指导，符号序列是最优选择。 Abstract: Recent advances have increasingly applied large language models (LLMs) to electrocardiogram (ECG) interpretation, giving rise to Electrocardiogram-Language Models (ELMs). Conditioned on an ECG and a textual query, an ELM autoregressively generates a free-form textual response. Unlike traditional classification-based systems, ELMs emulate expert cardiac electrophysiologists by issuing diagnoses, analyzing waveform morphology, identifying contributing factors, and proposing patient-specific action plans. To realize this potential, researchers are curating instruction-tuning datasets that pair ECGs with textual dialogues and are training ELMs on these resources. Yet before scaling ELMs further, there is a fundamental question yet to be explored: What is the most effective ECG input representation? In recent works, three candidate representations have emerged-raw time-series signals, rendered images, and discretized symbolic sequences. We present the first comprehensive benchmark of these modalities across 6 public datasets and 5 evaluation metrics. We find symbolic representations achieve the greatest number of statistically significant wins over both signal and image inputs. We further ablate the LLM backbone, ECG duration, and token budget, and we evaluate robustness to signal perturbations. We hope that our findings offer clear guidance for selecting input representations when developing the next generation of ELMs.

[628] Consistency-based Abductive Reasoning over Perceptual Errors of Multiple Pre-trained Models in Novel Environments

Mario Leiva,Noel Ngu,Joshua Shay Kricheli,Aditya Taparia,Ransalu Senanayake,Paulo Shakarian,Nathaniel Bastian,John Corcoran,Gerardo Simari

Main category: cs.AI

TL;DR: 论文提出了一种基于一致性溯因的方法，通过整合多个预训练模型的预测来应对分布偏移问题，显著提升了性能。

Details

Motivation: 预训练感知模型在新环境中常因分布偏移导致性能下降，现有方法在提升精度时往往牺牲召回率。论文假设利用多个模型可以缓解这一问题。 Method: 将模型预测和错误检测规则编码为逻辑程序，通过溯因解释（子集预测）最大化覆盖率并控制逻辑不一致率。提出了基于整数规划和启发式搜索的两种算法。 Result: 在模拟航空影像数据集上，该方法相比单个模型和标准集成方法，F1分数和准确率分别平均提升13.6%和16.6%。 Conclusion: 一致性溯因是整合多模型知识的有效机制，适用于挑战性新场景。 Abstract: The deployment of pre-trained perception models in novel environments often leads to performance degradation due to distributional shifts. Although recent artificial intelligence approaches for metacognition use logical rules to characterize and filter model errors, improving precision often comes at the cost of reduced recall. This paper addresses the hypothesis that leveraging multiple pre-trained models can mitigate this recall reduction. We formulate the challenge of identifying and managing conflicting predictions from various models as a consistency-based abduction problem. The input predictions and the learned error detection rules derived from each model are encoded in a logic program. We then seek an abductive explanation--a subset of model predictions--that maximizes prediction coverage while ensuring the rate of logical inconsistencies (derived from domain constraints) remains below a specified threshold. We propose two algorithms for this knowledge representation task: an exact method based on Integer Programming (IP) and an efficient Heuristic Search (HS). Through extensive experiments on a simulated aerial imagery dataset featuring controlled, complex distributional shifts, we demonstrate that our abduction-based framework outperforms individual models and standard ensemble baselines, achieving, for instance, average relative improvements of approximately 13.6% in F1-score and 16.6% in accuracy across 15 diverse test datasets when compared to the best individual model. Our results validate the use of consistency-based abduction as an effective mechanism to robustly integrate knowledge from multiple imperfect reasoners in challenging, novel scenarios.

[629] Meta-aware Learning in text-to-SQL Large Language Model

Wenda Zhang

Main category: cs.AI

TL;DR: 本文提出了一种元感知学习框架，通过整合领域知识、数据库模式、思维链推理和元数据关系，提升LLM在文本到SQL任务中的性能。

Details

Motivation: 利用LLM的进步解决文本到SQL任务中理解复杂领域信息和数据库结构的挑战。 Method: 提出包含四种学习策略的框架：基于模式的学习、思维链学习、知识增强学习和关键信息标记化。 Result: 实验证明该方法在执行准确性、多任务SQL生成能力和减少灾难性遗忘方面表现优越。 Conclusion: 该框架显著提升了LLM在业务领域SQL生成中的性能。 Abstract: The advancements of Large language models (LLMs) have provided great opportunities to text-to-SQL tasks to overcome the main challenges to understand complex domain information and complex database structures in business applications. In this paper, we propose a meta-aware learning framework to integrate domain knowledge, database schema, chain-of-thought reasoning processes, and metadata relationships to improve the SQL generation quality. The proposed framework includes four learning strategies: schema-based learning, Chain-of-Thought (CoT) learning, knowledge-enhanced learning, and key information tokenization. This approach provides a comprehensive understanding of database structure and metadata information towards LLM through fine-tuning to improve its performance on SQL generation within business domains. Through two experimental studies, we have demonstrated the superiority of the proposed methods in execution accuracy, multi-task SQL generation capability, and reduction of catastrophic forgetting.

[630] DiffVLA: Vision-Language Guided Diffusion Planning for Autonomous Driving

Anqing Jiang,Yu Gao,Zhigang Sun,Yiru Wang,Jijun Wang,Jinghao Chai,Qian Cao,Yuweng Heng,Hao Jiang,Zongzheng Zhang,Xianda Guo,Hao Sun,Hao Zhao

Main category: cs.AI

TL;DR: 论文提出了一种名为Diff-VLA的新型混合稀疏-密集扩散策略，结合视觉语言模型（VLM），解决了端到端自动驾驶中的计算成本高、行为多样性和复杂场景决策问题。

Details

Motivation: 端到端自动驾驶因其全微分设计潜力巨大，但现有方法存在计算成本高、行为多样性不足和复杂场景决策不优的问题。 Method: 提出Diff-VLA，利用稀疏扩散表示实现高效多模态驾驶行为，并通过VLM与地图实例的深度交互优化轨迹生成。 Result: 在Autonomous Grand Challenge 2025中表现优异，达到45.0 PDMS。 Conclusion: Diff-VLA在复杂场景中展现出优越性能，为端到端自动驾驶提供了新思路。 Abstract: Research interest in end-to-end autonomous driving has surged owing to its fully differentiable design integrating modular tasks, i.e. perception, prediction and planing, which enables optimization in pursuit of the ultimate goal. Despite the great potential of the end-to-end paradigm, existing methods suffer from several aspects including expensive BEV (bird's eye view) computation, action diversity, and sub-optimal decision in complex real-world scenarios. To address these challenges, we propose a novel hybrid sparse-dense diffusion policy, empowered by a Vision-Language Model (VLM), called Diff-VLA. We explore the sparse diffusion representation for efficient multi-modal driving behavior. Moreover, we rethink the effectiveness of VLM driving decision and improve the trajectory generation guidance through deep interaction across agent, map instances and VLM output. Our method shows superior performance in Autonomous Grand Challenge 2025 which contains challenging real and reactive synthetic scenarios. Our methods achieves 45.0 PDMS.

[631] Can Large Language Models Infer Causal Relationships from Real-World Text?

Ryan Saklad,Aman Chadha,Oleg Pavlov,Raha Moraffah

Main category: cs.AI

TL;DR: 论文研究了大型语言模型（LLMs）从真实世界文本中推断因果关系的能力，提出了首个真实世界数据集作为基准，并揭示了LLMs在此任务中的主要挑战。

Details

Motivation: 现有研究主要关注合成文本中的简单因果关系，未能反映真实任务的复杂性，因此需要探索LLMs在真实文本中的因果推理能力。 Method: 开发了一个基于真实学术文献的多样化基准数据集，评估了前沿LLMs的表现。 Result: 最佳模型平均F1得分仅为0.477，常见问题包括处理隐含信息、区分相关因果因素及长文本中的信息关联。 Conclusion: 基准数据集为改进LLMs的因果推理能力提供了针对性研究方向。 Abstract: Understanding and inferring causal relationships from texts is a core aspect of human cognition and is essential for advancing large language models (LLMs) towards artificial general intelligence. Existing work primarily focuses on synthetically generated texts which involve simple causal relationships explicitly mentioned in the text. This fails to reflect the complexities of real-world tasks. In this paper, we investigate whether LLMs are capable of inferring causal relationships from real-world texts. We develop a benchmark drawn from real-world academic literature which includes diverse texts with respect to length, complexity of relationships (different levels of explicitness, number of events, and causal relationships), and domains and sub-domains. To the best of our knowledge, our benchmark is the first-ever real-world dataset for this task. Our experiments on state-of-the-art LLMs evaluated on our proposed benchmark demonstrate significant challenges, with the best-performing model achieving an average F1 score of only 0.477. Analysis reveals common pitfalls: difficulty with implicitly stated information, in distinguishing relevant causal factors from surrounding contextual details, and with connecting causally relevant information spread across lengthy textual passages. By systematically characterizing these deficiencies, our benchmark offers targeted insights for further research into advancing LLM causal reasoning.

[632] REACT: Representation Extraction And Controllable Tuning to Overcome Overfitting in LLM Knowledge Editing

Haitian Zhong,Yuhuan Liu,Ziyang Xu,Guofan Liu,Qiang Liu,Shu Wu,Zhe Zhao,Liang Wang,Tieniu Tan

Main category: cs.AI

TL;DR: REACT框架通过两阶段方法解决大语言模型编辑中的过拟合问题，实现精确可控的知识编辑。

Details

Motivation: 大语言模型编辑方法常因过拟合导致知识更新超出预期范围，需一种更精确可控的编辑方法。 Method: REACT分为两阶段：提取潜在事实表示并计算“信念偏移”向量；通过可控扰动和预训练分类器实现编辑。 Result: 实验表明REACT显著减少过拟合，并在多种编辑场景下保持平衡的编辑性能。 Conclusion: REACT为知识编辑提供了一种高效且可控的解决方案。 Abstract: Large language model editing methods frequently suffer from overfitting, wherein factual updates can propagate beyond their intended scope, overemphasizing the edited target even when it's contextually inappropriate. To address this challenge, we introduce REACT (Representation Extraction And Controllable Tuning), a unified two-phase framework designed for precise and controllable knowledge editing. In the initial phase, we utilize tailored stimuli to extract latent factual representations and apply Principal Component Analysis with a simple learnbale linear transformation to compute a directional "belief shift" vector for each instance. In the second phase, we apply controllable perturbations to hidden states using the obtained vector with a magnitude scalar, gated by a pre-trained classifier that permits edits only when contextually necessary. Relevant experiments on EVOKE benchmarks demonstrate that REACT significantly reduces overfitting across nearly all evaluation metrics, and experiments on COUNTERFACT and MQuAKE shows that our method preserves balanced basic editing performance (reliability, locality, and generality) under diverse editing scenarios.

[633] FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks

Atsunori Moteki,Shoichi Masui,Fan Yang,Yueqi Song,Yonatan Bisk,Graham Neubig,Ikuo Kusajima,Yasuto Watanabe,Hiroyuki Ishida,Jun Takahashi,Shan Jiang

Main category: cs.AI

TL;DR: FieldWorkArena是一个针对现实世界现场工作的代理AI基准测试，填补了现有基准测试在复杂现实环境中的不足。

Details

Motivation: 现有代理AI基准测试主要针对网络任务，无法满足现实工作环境中复杂任务的需求。 Method: 定义了新的动作空间并改进了评估函数，使用现场视频和实际文档构建数据集，任务基于工人和管理者访谈设计。 Result: 验证了多模态LLM（如GPT-4o）的性能评估可行性，并识别了新评估方法的有效性和局限性。 Conclusion: FieldWorkArena为代理AI在现实工作环境中的评估提供了有效工具，数据集和评估程序已公开。 Abstract: This paper proposes FieldWorkArena, a benchmark for agentic AI targeting real-world field work. With the recent increase in demand for agentic AI, they are required to monitor and report safety and health incidents, as well as manufacturing-related incidents, that may occur in real-world work environments. Existing agentic AI benchmarks have been limited to evaluating web tasks and are insufficient for evaluating agents in real-world work environments, where complexity increases significantly. In this paper, we define a new action space that agentic AI should possess for real world work environment benchmarks and improve the evaluation function from previous methods to assess the performance of agentic AI in diverse real-world tasks. The dataset consists of videos captured on-site and documents actually used in factories and warehouses, and tasks were created based on interviews with on-site workers and managers. Evaluation results confirmed that performance evaluation considering the characteristics of Multimodal LLM (MLLM) such as GPT-4o is feasible. Additionally, the effectiveness and limitations of the proposed new evaluation method were identified. The complete dataset (HuggingFace) and evaluation program (GitHub) can be downloaded from the following website: https://en-documents.research.global.fujitsu.com/fieldworkarena/.

[634] Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs

Jaemin Kim,Hangeol Chang,Hyunmin Hwang,Choonghan Kim,Jong Chul Ye

Main category: cs.AI

TL;DR: UniR是一个轻量级、可组合的推理模块，可与任何冻结的LLM结合，提升其推理能力，无需重新训练。

Details

Motivation: 解决LLMs推理能力提升时的高计算成本和泛化性下降问题，以及PEFT方法需针对不同LLM重新训练的局限性。 Method: UniR将奖励分解为独立训练的推理模块，通过预定义奖励将轨迹级信号转化为令牌级指导，并与LLM的logits相加。 Result: 在数学推理和机器翻译任务中，UniR显著优于基线方法，并展示出强的小模型到大模型的泛化能力。 Conclusion: UniR是一种高效、灵活且鲁棒的解决方案，可在不损害LLM核心能力的情况下增强其推理能力。 Abstract: Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise their generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically requires retraining for each LLM backbone due to architectural dependencies. To address these challenges, here we propose Universal Reasoner (UniR) - a single, lightweight, composable, and plug-and-play reasoning module that can be used with any frozen LLM to endow it with specialized reasoning capabilities. Specifically, UniR decomposes the reward into a standalone reasoning module that is trained independently using predefined rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR can be combined with any frozen LLM at inference time by simply adding its output logits to those of the LLM backbone. This additive structure naturally enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Experimental results on mathematical reasoning and machine translation tasks show that UniR significantly outperforms \add{existing baseline fine-tuning methods using the Llama3.2 model}. Furthermore, UniR demonstrates strong weak-to-strong generalization: reasoning modules trained on smaller models effectively guide much larger LLMs. This makes UniR a cost-efficient, adaptable, and robust solution for enhancing reasoning in LLMs without compromising their core capabilities. Code is open-sourced at https://github.com/hangeol/UniR

[635] GUARDIAN: Safeguarding LLM Multi-Agent Collaborations with Temporal Graph Modeling

Jialong Zhou,Lichao Wang,Xiao Yang

Main category: cs.AI

TL;DR: GUARDIAN是一个用于检测和缓解多智能体协作中安全问题的统一方法，通过建模为时间属性图并采用无监督编码器-解码器架构，实现了高效的安全防护。

Details

Motivation: 大型语言模型（LLMs）支持多智能体复杂对话，但面临幻觉放大和错误传播等安全挑战，亟需解决方案。 Method: 将多智能体协作建模为离散时间属性图，利用无监督编码器-解码器架构和增量训练，结合信息瓶颈理论压缩图结构。 Result: 实验证明GUARDIAN能高效防护多种安全漏洞，达到最先进精度且资源利用率高。 Conclusion: GUARDIAN为多智能体协作提供了可靠的安全保障，具有实际应用价值。 Abstract: The emergence of large language models (LLMs) enables the development of intelligent agents capable of engaging in complex and multi-turn dialogues. However, multi-agent collaboration face critical safety challenges, such as hallucination amplification and error injection and propagation. This paper presents GUARDIAN, a unified method for detecting and mitigating multiple safety concerns in GUARDing Intelligent Agent collaboratioNs. By modeling the multi-agent collaboration process as a discrete-time temporal attributed graph, GUARDIAN explicitly captures the propagation dynamics of hallucinations and errors. The unsupervised encoder-decoder architecture incorporating an incremental training paradigm, learns to reconstruct node attributes and graph structures from latent embeddings, enabling the identification of anomalous nodes and edges with unparalleled precision. Moreover, we introduce a graph abstraction mechanism based on the Information Bottleneck Theory, which compresses temporal interaction graphs while preserving essential patterns. Extensive experiments demonstrate GUARDIAN's effectiveness in safeguarding LLM multi-agent collaborations against diverse safety vulnerabilities, achieving state-of-the-art accuracy with efficient resource utilization.

[636] Next Token Prediction Is a Dead End for Creativity

Ibukun Olatunji,Mark Sheppard

Main category: cs.AI

TL;DR: 论文认为基于token预测的模型与真正的创造力存在根本性不匹配，提出以互动过程而非预测输出来重新定义创造力。

Details

Motivation: 揭示现有语言生成模型在表面连贯性上的局限性，尤其是在自发性和原创性方面的不足。 Method: 以battle rap为案例研究，分析预测系统在对抗性和情感共鸣交流中的缺陷。 Result: 证明现有模型无法真正参与富有情感或对抗性的创意交流。 Conclusion: 提出未来AI系统应更注重互动性和表达力，以更好地与人类创意实践对齐。 Abstract: This paper argues that token prediction is fundamentally misaligned with real creativity. While next-token models have enabled impressive advances in language generation, their architecture favours surface-level coherence over spontaneity, originality, and improvisational risk. We use battle rap as a case study to expose the limitations of predictive systems, demonstrating that they cannot truly engage in adversarial or emotionally resonant exchanges. By reframing creativity as an interactive process rather than a predictive output, we offer a vision for AI systems that are more expressive, responsive, and aligned with human creative practice.

[637] ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

Qiushi Sun,Zhoumianze Liu,Chang Ma,Zichen Ding,Fangzhi Xu,Zhangyue Yin,Haiteng Zhao,Zhenyu Wu,Kanzhi Cheng,Zhaoyang Liu,Jianing Wang,Qintong Li,Xiangru Tang,Tianbao Xie,Xiachong Feng,Xiang Li,Ben Kao,Wenhai Wang,Biqing Qi,Lingpeng Kong,Zhiyong Wu

Main category: cs.AI

TL;DR: ScienceBoard是一个多领域环境与基准测试，旨在评估LLM代理在科学发现工作流中的表现，结果显示现有代理成功率仅15%。

Details

Motivation: 利用LLM代理加速跨学科科学发现，解决复杂研究任务。 Method: 开发ScienceBoard，包含多领域环境和169个真实任务基准测试，评估代理性能。 Result: 现有代理（如GPT-4o、Claude 3.7）在复杂任务中成功率仅15%。 Conclusion: ScienceBoard为改进代理设计提供了方向，推动更强大的科学发现工具发展。 Abstract: Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.

[638] Architectures of Error: A Philosophical Inquiry into AI and Human Code Generation

Camilo Chacón Sartori

Main category: cs.AI

TL;DR: 论文探讨了生成式AI（GenAI）在代码生成中与人类程序员的协作，提出“错误架构”概念以区分人类与机器的错误来源，并分析其哲学与工程意义。

Details

Motivation: 随着生成式AI在代码生成中的应用增加，需要区分人类与机器的错误来源及其对协作开发的影响。 Method: 结合Dennett的机械功能主义和Rescher的方法实用主义，分析人类认知与机器随机性的错误起源，并利用Floridi的抽象层次理论探讨其交互。 Result: 揭示了人类与机器在代码生成中错误的根本差异，提出了语义一致性、安全性、认知限制和控制机制等关键问题。 Conclusion: 为哲学家提供了理解GenAI独特认识论挑战的框架，同时为工程师提供了更批判性的协作基础。 Abstract: With the rise of generative AI (GenAI), Large Language Models are increasingly employed for code generation, becoming active co-authors alongside human programmers. Focusing specifically on this application domain, this paper articulates distinct ``Architectures of Error'' to ground an epistemic distinction between human and machine code generation. Examined through their shared vulnerability to error, this distinction reveals fundamentally different causal origins: human-cognitive versus artificial-stochastic. To develop this framework and substantiate the distinction, the analysis draws critically upon Dennett's mechanistic functionalism and Rescher's methodological pragmatism. I argue that a systematic differentiation of these error profiles raises critical philosophical questions concerning semantic coherence, security robustness, epistemic limits, and control mechanisms in human-AI collaborative software development. The paper also utilizes Floridi's levels of abstraction to provide a nuanced understanding of how these error dimensions interact and may evolve with technological advancements. This analysis aims to offer philosophers a structured framework for understanding GenAI's unique epistemological challenges, shaped by these architectural foundations, while also providing software engineers a basis for more critically informed engagement.

[639] Task Memory Engine: Spatial Memory for Robust Multi-Step LLM Agents

Ye Ye

Main category: cs.AI

TL;DR: 论文提出Task Memory Engine (TME)，一种模块化内存控制器，通过图结构改进LLMs的多步交互能力，减少幻觉和误解。

Details

Motivation: 大型语言模型在多步交互中表现不佳，缺乏持久记忆导致目标跟踪和任务依赖管理困难，影响自主代理的可信度。 Method: TME采用空间记忆框架，用图结构（树或DAG）替代线性上下文，结合TRIM组件建模任务语义和用户意图。 Result: 在四种多轮场景中，TME完全消除三个任务的幻觉和误解，整体减少幻觉66.7%和误解83.3%，优于ReAct。 Conclusion: TME的模块化设计支持即插即用和领域定制，为复杂交互场景中的LLM代理性能提供了可扩展解决方案。 Abstract: Large Language Models (LLMs) falter in multi-step interactions -- often hallucinating, repeating actions, or misinterpreting user corrections -- due to reliance on linear, unstructured context. This fragility stems from the lack of persistent memory to track evolving goals and task dependencies, undermining trust in autonomous agents. We introduce the Task Memory Engine (TME), a modular memory controller that transforms existing LLMs into robust, revision-aware agents without fine-tuning. TME implements a spatial memory framework that replaces flat context with graph-based structures to support consistent, multi-turn reasoning. Departing from linear concatenation and ReAct-style prompting, TME builds a dynamic task graph -- either a tree or directed acyclic graph (DAG) -- to map user inputs to subtasks, align them with prior context, and enable dependency-tracked revisions. Its Task Representation and Intent Management (TRIM) component models task semantics and user intent to ensure accurate interpretation. Across four multi-turn scenarios-trip planning, cooking, meeting scheduling, and shopping cart editing -- TME eliminates 100% of hallucinations and misinterpretations in three tasks, and reduces hallucinations by 66.7% and misinterpretations by 83.3% across 27 user turns, outperforming ReAct. TME's modular design supports plug-and-play deployment and domain-specific customization, adaptable to both personal assistants and enterprise automation. We release TME's codebase, benchmarks, and components as open-source resources, enabling researchers to develop reliable LLM agents. TME's scalable architecture addresses a critical gap in agent performance across complex, interactive settings.

[640] BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

Guilong Lu,Xuntao Guo,Rongjunchen Zhang,Wenqiao Zhu,Ji Liu

Main category: cs.AI

TL;DR: BizFinBench是首个针对金融领域设计的LLM评估基准，包含6,781个中文查询，覆盖五个维度，并引入IteraJudge方法减少评估偏差。实验显示不同模型在不同任务中表现各异，但均难以应对复杂推理场景。

Details

Motivation: 评估LLM在逻辑密集、精度要求高的金融领域的可靠性，填补现有研究的空白。 Method: 构建BizFinBench基准，包含多维度任务，并设计IteraJudge方法减少LLM作为评估者的偏差。 Result: 实验表明，不同模型在不同任务中表现差异显著，但均无法在所有任务中占优，复杂推理能力不足。 Conclusion: BizFinBench为金融领域的LLM研究提供了严谨的基准，未来需提升模型在复杂推理任务中的表现。 Abstract: Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.

[641] Automated Text-to-Table for Reasoning-Intensive Table QA: Pipeline Design and Benchmarking Insights

Shi-Yu Tian,Zhi Zhou,Wei Dong,Ming Yang,Kun-Yang Yu,Zi-Jian Cheng,Lan-Zhe Guo,Yu-Feng Li

Main category: cs.AI

TL;DR: 论文提出自动化生成管道AutoT2T，将数学应用题转化为表格推理任务，构建新基准TabularGSM，揭示LLMs在复杂表格QA任务中失败的关键原因是推理与检索/识别过程的紧密耦合。

Details

Motivation: 现有研究依赖昂贵的人工标注数据且难以覆盖复杂推理场景，表格结构异质性阻碍系统分析LLMs在推理密集型任务中的表现。 Method: 提出AutoT2T自动化生成管道，将数学应用题转化为表格推理任务，生成多种表格变体（含噪声版本），构建TabularGSM基准。 Result: 实验分析表明，LLMs在复杂表格QA任务中失败的关键原因是推理与检索/识别过程的紧密耦合。 Conclusion: 模型需发展协同推理能力以有效应对复杂表格QA任务。 Abstract: Reasoning with tabular data holds increasing importance in modern applications, yet comprehensive evaluation methodologies for reasoning-intensive Table Question Answering (QA) tasks remain nascent. Existing research is constrained by two primary bottlenecks: 1) Reliance on costly manually annotated real-world data, which is difficult to cover complex reasoning scenarios; 2) The heterogeneity of table structures hinders systematic analysis of the intrinsic mechanisms behind the underperformance of LLMs, especially in reasoning-intensive tasks. To address these issues, we propose an automated generation pipeline AutoT2T that transforms mathematical word problems into table-based reasoning tasks, eliminating the need for manual annotation. The pipeline can generate multiple variants of a table for the same reasoning problem, including noisy versions to support robustness evaluation. Based on this, we construct a new benchmark TabularGSM, which systematically spans a range of table complexities and trap problems. Experimental analyses through AutoT2T and TabularGSM reveal that the tight coupling between reasoning and retrieval or identification processes is a key factor underlying the failure of LLMs in complex Table QA tasks. This highlights the necessity for models to develop synergistic reasoning capabilities in order to perform effectively in complex Table QA tasks.

[642] Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models

George Kour,Itay Nakash,Ateret Anaby-Tavor,Michal Shmueli-Scheuer

Main category: cs.AI

TL;DR: 该论文提出了一个名为POBs的基准测试，用于评估大型语言模型（LLMs）在社会、文化、伦理和个人领域的主观倾向，并发现新模型版本在一致性和偏见方面表现更差。

Details

Motivation: 随着LLMs在人类生活中的深入应用，评估其是否及如何表现出主观偏好、观点和信仰变得至关重要，以避免潜在的偏见影响决策。 Method: 开发了POBs基准测试，用于评估LLMs的主观倾向，并测试了推理和自我反思机制对模型性能的影响。 Result: 研究发现，新模型版本的一致性和中立性下降，偏向特定观点，且推理和自我反思机制对改善这些指标效果有限。 Conclusion: 论文揭示了LLMs在主观倾向方面的潜在问题，呼吁关注模型偏见和一致性的趋势。 Abstract: As Large Language Models (LLMs) become deeply integrated into human life and increasingly influence decision-making, it's crucial to evaluate whether and to what extent they exhibit subjective preferences, opinions, and beliefs. These tendencies may stem from biases within the models, which may shape their behavior, influence the advice and recommendations they offer to users, and potentially reinforce certain viewpoints. This paper presents the Preference, Opinion, and Belief survey (POBs), a benchmark developed to assess LLMs' subjective inclinations across societal, cultural, ethical, and personal domains. We applied our benchmark to evaluate leading open- and closed-source LLMs, measuring desired properties such as reliability, neutrality, and consistency. In addition, we investigated the effect of increasing the test-time compute, through reasoning and self-reflection mechanisms, on those metrics. While effective in other tasks, our results show that these mechanisms offer only limited gains in our domain. Furthermore, we reveal that newer model versions are becoming less consistent and more biased toward specific viewpoints, highlighting a blind spot and a concerning trend. POBS: https://ibm.github.io/POBS

[643] SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond

Junteng Liu,Yuanxiang Fan,Zhuo Jiang,Han Ding,Yongyi Hu,Chi Zhang,Yiqi Shi,Shitong Weng,Aili Chen,Shiqi Chen,Yunan Huang,Mozhi Zhang,Pengyu Zhao,Junjie Yan,Junxian He

Main category: cs.AI

TL;DR: SynLogic是一个数据合成框架和数据集，用于生成多样化的逻辑推理数据，提升大型语言模型的推理能力。

Details

Motivation: 现有开源复制工作主要集中在数学和编码领域，而通用推理能力的方法和资源尚未充分探索。逻辑推理被认为是通用推理能力的基础。 Method: 提出SynLogic框架，生成35种逻辑推理任务的数据，支持调整难度和数量，并通过简单规则验证数据。 Result: SynLogic在7B和32B模型上验证了有效性，逻辑推理性能领先开源数据集，并在混合训练中显著提升推理泛化能力。 Conclusion: SynLogic是提升LLM通用推理能力的宝贵资源，已开源数据和合成框架。 Abstract: Recent advances such as OpenAI-o1 and DeepSeek R1 have demonstrated the potential of Reinforcement Learning (RL) to enhance reasoning abilities in Large Language Models (LLMs). While open-source replication efforts have primarily focused on mathematical and coding domains, methods and resources for developing general reasoning capabilities remain underexplored. This gap is partly due to the challenge of collecting diverse and verifiable reasoning data suitable for RL. We hypothesize that logical reasoning is critical for developing general reasoning capabilities, as logic forms a fundamental building block of reasoning. In this work, we present SynLogic, a data synthesis framework and dataset that generates diverse logical reasoning data at scale, encompassing 35 diverse logical reasoning tasks. The SynLogic approach enables controlled synthesis of data with adjustable difficulty and quantity. Importantly, all examples can be verified by simple rules, making them ideally suited for RL with verifiable rewards. In our experiments, we validate the effectiveness of RL training on the SynLogic dataset based on 7B and 32B models. SynLogic leads to state-of-the-art logical reasoning performance among open-source datasets, surpassing DeepSeek-R1-Distill-Qwen-32B by 6 points on BBEH. Furthermore, mixing SynLogic data with mathematical and coding tasks improves the training efficiency of these domains and significantly enhances reasoning generalization. Notably, our mixed training model outperforms DeepSeek-R1-Zero-Qwen-32B across multiple benchmarks. These findings position SynLogic as a valuable resource for advancing the broader reasoning capabilities of LLMs. We open-source both the data synthesis pipeline and the SynLogic dataset at https://github.com/MiniMax-AI/SynLogic.

[644] Large Language Models for Planning: A Comprehensive and Systematic Survey

Pengfei Cao,Tianyi Men,Wencan Liu,Jingwen Zhang,Xuzhao Li,Xixun Lin,Dianbo Sui,Yanan Cao,Kang Liu,Jun Zhao

Main category: cs.AI

TL;DR: 本文综述了基于大型语言模型（LLM）的规划方法，分类为外部模块增强、微调基础和搜索基础三种方法，并总结了评估框架和研究方向。

Details

Motivation: 探索LLM在规划任务中的潜力，填补系统性研究的空白。 Method: 分类分析了三种LLM规划方法：外部模块增强、微调基础和搜索基础。 Result: 总结了现有评估框架和性能比较，揭示了LLM规划的机制。 Conclusion: 本文为LLM规划领域提供了系统综述，并指出了未来研究方向。 Abstract: Planning represents a fundamental capability of intelligent agents, requiring comprehensive environmental understanding, rigorous logical reasoning, and effective sequential decision-making. While Large Language Models (LLMs) have demonstrated remarkable performance on certain planning tasks, their broader application in this domain warrants systematic investigation. This paper presents a comprehensive review of LLM-based planning. Specifically, this survey is structured as follows: First, we establish the theoretical foundations by introducing essential definitions and categories about automated planning. Next, we provide a detailed taxonomy and analysis of contemporary LLM-based planning methodologies, categorizing them into three principal approaches: 1) External Module Augmented Methods that combine LLMs with additional components for planning, 2) Finetuning-based Methods that involve using trajectory data and feedback signals to adjust LLMs in order to improve their planning abilities, and 3) Searching-based Methods that break down complex tasks into simpler components, navigate the planning space, or enhance decoding strategies to find the best solutions. Subsequently, we systematically summarize existing evaluation frameworks, including benchmark datasets, evaluation metrics and performance comparisons between representative planning methods. Finally, we discuss the underlying mechanisms enabling LLM-based planning and outline promising research directions for this rapidly evolving field. We hope this survey will serve as a valuable resource to inspire innovation and drive progress in this field.

[645] HS-STAR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation

Feng Xiong,Hongling Xu,Yifei Wang,Runxi Cheng,Yong Wang,Xiangxiang Chu

Main category: cs.AI

TL;DR: HS-STaR框架通过分层采样策略，动态分配采样预算到边界难度问题，显著提升语言模型的数学推理能力。

Details

Motivation: 现有方法对所有问题分配均匀采样预算，忽视了不同难度问题的学习效用差异。 Method: HS-STaR采用轻量级预采样和动态重采样策略，优先处理边界难度问题。 Result: 在多个推理基准测试中，HS-STaR显著优于基线方法，无需额外预算。 Conclusion: 边界难度问题具有更高学习效用，HS-STaR框架能有效利用有限预算提升模型性能。 Abstract: Self-taught reasoners (STaRs) enhance the mathematical reasoning abilities of large language models (LLMs) by leveraging self-generated responses for self-training. Recent studies have incorporated reward models to guide response selection or decoding, aiming to obtain higher-quality data. However, they typically allocate a uniform sampling budget across all problems, overlooking the varying utility of problems at different difficulty levels. In this work, we conduct an empirical study and find that problems near the boundary of the LLM's reasoning capability offer significantly greater learning utility than both easy and overly difficult ones. To identify and exploit such problems, we propose HS-STaR, a Hierarchical Sampling framework for Self-Taught Reasoners. Given a fixed sampling budget, HS-STaR first performs lightweight pre-sampling with a reward-guided difficulty estimation strategy to efficiently identify boundary-level problems. Subsequently, it dynamically reallocates the remaining budget toward these high-utility problems during a re-sampling phase, maximizing the generation of valuable training data. Extensive experiments across multiple reasoning benchmarks and backbone LLMs demonstrate that HS-STaR significantly outperforms other baselines without requiring additional sampling budget.

[646] Large Language Models as Autonomous Spacecraft Operators in Kerbal Space Program

Alejandro Carrasco,Victor Rodriguez-Fernandez,Richard Linares

Main category: cs.AI

TL;DR: 论文探讨了将大型语言模型（LLMs）作为自主代理应用于空间控制领域，特别是在卫星自主操作决策中的潜力。

Details

Motivation: 利用LLMs的文本理解能力，将其应用于非合作空间操作中的卫星自主决策，以推动空间研究的发展。 Method: 通过提示工程、少量样本提示和微调技术，开发了一个纯LLM解决方案，用于Kerbal Space Program Differential Games挑战赛。 Result: 该LLM代理在比赛中排名第二，验证了LLMs在空间研究中的可行性。 Conclusion: 该研究首次将LLM代理引入空间研究，并提供了开源代码和数据集，为后续研究奠定了基础。 Abstract: Recent trends are emerging in the use of Large Language Models (LLMs) as autonomous agents that take actions based on the content of the user text prompts. We intend to apply these concepts to the field of Control in space, enabling LLMs to play a significant role in the decision-making process for autonomous satellite operations. As a first step towards this goal, we have developed a pure LLM-based solution for the Kerbal Space Program Differential Games (KSPDG) challenge, a public software design competition where participants create autonomous agents for maneuvering satellites involved in non-cooperative space operations, running on the KSP game engine. Our approach leverages prompt engineering, few-shot prompting, and fine-tuning techniques to create an effective LLM-based agent that ranked 2nd in the competition. To the best of our knowledge, this work pioneers the integration of LLM agents into space research. The project comprises several open repositories to facilitate replication and further research. The codebase is accessible on \href{https://github.com/ARCLab-MIT/kspdg}{GitHub}, while the trained models and datasets are available on \href{https://huggingface.co/OhhTuRnz}{Hugging Face}. Additionally, experiment tracking and detailed results can be reviewed on \href{https://wandb.ai/carrusk/huggingface}{Weights \& Biases

[647] DCG-SQL: Enhancing In-Context Learning for Text-to-SQL with Deep Contextual Schema Link Graph

Jihyung Lee,Jin-Seop Lee,Jaehoon Lee,YunSeok Choi,Jee-Hyong Lee

Main category: cs.AI

TL;DR: 本文提出了一种基于深度上下文模式链接图的新方法，用于改进Text-to-SQL任务中的演示检索和SQL生成，显著提升了性能。

Details

Motivation: 现有方法依赖超大规模LLMs的能力，而对小规模LLMs效果不佳，表明演示检索效率低下。 Method: 构建深度上下文模式链接图，捕捉问题和数据库模式项之间的关键信息和语义关系，以改进演示检索和SQL生成。 Result: 在Spider基准测试中，该方法在超大规模和小规模LLMs上均表现出性能提升和效率改进。 Conclusion: 该方法通过有效检索演示和生成SQL，显著提升了Text-to-SQL任务的性能，适用于不同规模的LLMs。 Abstract: Text-to-SQL, which translates a natural language question into an SQL query, has advanced with in-context learning of Large Language Models (LLMs). However, existing methods show little improvement in performance compared to randomly chosen demonstrations, and significant performance drops when smaller LLMs (e.g., Llama 3.1-8B) are used. This indicates that these methods heavily rely on the intrinsic capabilities of hyper-scaled LLMs, rather than effectively retrieving useful demonstrations. In this paper, we propose a novel approach for effectively retrieving demonstrations and generating SQL queries. We construct a Deep Contextual Schema Link Graph, which contains key information and semantic relationship between a question and its database schema items. This graph-based structure enables effective representation of Text-to-SQL samples and retrieval of useful demonstrations for in-context learning. Experimental results on the Spider benchmark demonstrate the effectiveness of our approach, showing consistent improvements in SQL generation performance and efficiency across both hyper-scaled LLMs and small LLMs. Our code will be released.

[648] Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models

Makesh Narsimhan Sreedhar,Traian Rebedea,Christopher Parisien

Main category: cs.AI

TL;DR: 论文分析了基于推理的语言模型在内容审核中的应用，重点关注数据效率和推理效率。

Details

Motivation: 研究基于推理的护栏模型在内容审核中的潜力，特别是对自定义安全策略的泛化能力。 Method: 通过实验评估数据效率和推理效率，包括样本效率、推理预算和双模式训练。 Result: 推理模型在少量数据下表现优异，且通过推理预算和双模式训练优化了实际部署。 Conclusion: 研究为实际系统中高效训练和部署推理护栏模型提供了实用指导。 Abstract: Reasoning-based language models have demonstrated strong performance across various domains, with the most notable gains seen in mathematical and coding tasks. Recent research has shown that reasoning also offers significant benefits for LLM safety and guardrail applications. In this work, we conduct a comprehensive analysis of training reasoning-based guardrail models for content moderation, with an emphasis on generalization to custom safety policies at inference time. Our study focuses on two key dimensions: data efficiency and inference efficiency. On the data front, we find that reasoning-based models exhibit strong sample efficiency, achieving competitive performance with significantly fewer training examples than their non-reasoning counterparts. This unlocks the potential to repurpose the remaining data for mining high-value, difficult samples that further enhance model performance. On the inference side, we evaluate practical trade-offs by introducing reasoning budgets, examining the impact of reasoning length on latency and accuracy, and exploring dual-mode training to allow runtime control over reasoning behavior. Our findings will provide practical insights for researchers and developers to effectively and efficiently train and deploy reasoning-based guardrails models in real-world systems.

[649] On Path to Multimodal Historical Reasoning: HistBench and HistAgent

Jiahao Qiu,Fulian Xiao,Yimin Wang,Yuchen Mao,Yijia Chen,Xinzhe Juan,Siran Wang,Xuan Qi,Tongcheng Zhang,Zixin Yao,Jiacheng Guo,Yifu Lu,Charles Argon,Jundi Cui,Daixin Chen,Junran Zhou,Shuyao Zhou,Zhanpeng Zhou,Ling Yang,Shilong Liu,Hongru Wang,Kaixuan Huang,Xun Jiang,Yuming Cao,Yue Chen,Yunfei Chen,Zhengyi Chen,Ruowei Dai,Mengqiu Deng,Jiye Fu,Yunting Gu,Zijie Guan,Zirui Huang,Xiaoyan Ji,Yumeng Jiang,Delong Kong,Haolong Li,Jiaqi Li,Ruipeng Li,Tianze Li,Zhuoran Li,Haixia Lian,Mengyue Lin,Xudong Liu,Jiayi Lu,Jinghan Lu,Wanyu Luo,Ziyue Luo,Zihao Pu,Zhi Qiao,Ruihuan Ren,Liang Wan,Ruixiang Wang,Tianhui Wang,Yang Wang,Zeyu Wang,Zihua Wang,Yujia Wu,Zhaoyi Wu,Hao Xin,Weiao Xing,Ruojun Xiong,Weijie Xu,Yao Shu,Xiao Yao,Xiaorui Yang,Yuchen Yang,Nan Yi,Jiadong Yu,Yangyuxuan Yu,Huiting Zeng,Danni Zhang,Yunjie Zhang,Zhaoyu Zhang,Zhiheng Zhang,Xiaofeng Zheng,Peirong Zhou,Linyan Zhong,Xiaoyin Zong,Ying Zhao,Zhenxin Chen,Lin Ding,Xiaoyu Gao,Bingbing Gong,Yichao Li,Yang Liao,Guang Ma,Tianyuan Ma,Xinrui Sun,Tianyi Wang,Han Xia,Ruobing Xian,Gen Ye,Tengfei Yu,Wentao Zhang,Yuxi Wang,Xi Gao,Mengdi Wang

Main category: cs.AI

TL;DR: 论文介绍了HistBench和HistAgent，前者是一个评估AI历史推理能力的新基准，后者是一个专为历史任务设计的AI代理，显著优于通用模型。

Details

Motivation: 探索大型语言模型在历史领域的潜力，填补其在历史推理能力上的不足。 Method: 开发HistBench基准测试，并设计历史专用代理HistAgent，配备OCR、翻译、档案搜索和图像理解工具。 Result: HistAgent在HistBench上表现显著优于通用模型，如GPT-4o和DeepSeek-R1。 Conclusion: 现有通用模型在历史推理上存在局限，HistAgent展示了领域专用设计的优势。 Abstract: Recent advances in large language models (LLMs) have led to remarkable progress across domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for AI, involving multimodal source interpretation, temporal inference, and cross-linguistic analysis. While general-purpose agents perform well on many existing benchmarks, they lack the domain-specific expertise required to engage with historical materials and questions. To address this gap, we introduce HistBench, a new benchmark of 414 high-quality questions designed to evaluate AI's capacity for historical reasoning and authored by more than 40 expert contributors. The tasks span a wide range of historical problems-from factual retrieval based on primary sources to interpretive analysis of manuscripts and images, to interdisciplinary challenges involving archaeology, linguistics, or cultural history. Furthermore, the benchmark dataset spans 29 ancient and modern languages and covers a wide range of historical periods and world regions. Finding the poor performance of LLMs and other agents on HistBench, we further present HistAgent, a history-specific agent equipped with carefully designed tools for OCR, translation, archival search, and image understanding in History. On HistBench, HistAgent based on GPT-4o achieves an accuracy of 27.54% pass@1 and 36.47% pass@2, significantly outperforming LLMs with online search and generalist agents, including GPT-4o (18.60%), DeepSeek-R1(14.49%) and Open Deep Research-smolagents(20.29% pass@1 and 25.12% pass@2). These results highlight the limitations of existing LLMs and generalist agents and demonstrate the advantages of HistAgent for historical reasoning.

Table of Contents

cs.CV [Back]

[1] InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning

[2] Sampling Strategies for Efficient Training of Deep Learning Object Detection Algorithms

[3] CTRL-GS: Cascaded Temporal Residue Learning for 4D Gaussian Splatting

[4] COLORA: Efficient Fine-Tuning for Convolutional Models with a Study Case on Optical Coherence Tomography Image Classification

[5] DART$^3$: Leveraging Distance for Test Time Adaptation in Person Re-Identification

[6] Pose Splatter: A 3D Gaussian Splatting Model for Quantifying Animal Pose and Appearance

[7] CONCORD: Concept-Informed Diffusion for Dataset Distillation

[8] Weakly-supervised Mamba-Based Mastoidectomy Shape Prediction for Cochlear Implant Surgery Using 3D T-Distribution Loss

[9] Monocular Marker-free Patient-to-Image Intraoperative Registration for Cochlear Implant Surgery

[10] Taming Diffusion for Dataset Distillation with High Representativeness

[11] Recent Deep Learning in Crowd Behaviour Analysis: A Brief Review

[12] Rehabilitation Exercise Quality Assessment and Feedback Generation Using Large Language Models with Prompt Engineering

[13] Dynamics of Affective States During Takeover Requests in Conditionally Automated Driving Among Older Adults with and without Cognitive Impairment

[14] CENet: Context Enhancement Network for Medical Image Segmentation

[15] TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP

[16] OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data

[17] Mitigating Context Bias in Domain Adaptation for Object Detection using Mask Pooling

[18] Agentic 3D Scene Generation with Spatially Contextualized VLMs

[19] BiomechGPT: Towards a Biomechanically Fluent Multimodal Foundation Model for Clinically Relevant Motion Tasks

[20] In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation

[21] HonestFace: Towards Honest Face Restoration with One-Step Diffusion Model

[22] ZooplanktonBench: A Geo-Aware Zooplankton Recognition and Classification Dataset from Marine Observations

[23] Syn3DTxt: Embedding 3D Cues for Scene Text Generation

[24] Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning

[25] Improved Immiscible Diffusion: Accelerate Diffusion Training by Reducing Its Miscibility

[26] TK-Mamba: Marrying KAN with Mamba for Text-Driven 3D Medical Image Segmentation

[27] ThinkVideo: High-Quality Reasoning Video Segmentation with Chain of Thoughts

[28] On Denoising Walking Videos for Gait Recognition

[29] Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

[30] Guiding the Experts: Semantic Priors for Efficient and Focused MoE Routing

[31] HyperFake: Hyperspectral Reconstruction and Attention-Guided Analysis for Advanced Deepfake Detection

[32] EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models

[33] Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

[34] Rethinking Causal Mask Attention for Vision-Language Inference

[35] Spiking Transformers Need High Frequency Information

[36] Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter

[37] SerendibCoins: Exploring The Sri Lankan Coins Dataset

[38] SuperGS: Consistent and Detailed 3D Super-Resolution Scene Reconstruction via Gaussian Splatting

[39] ProphetDWM: A Driving World Model for Rolling Out Future Actions and Videos

[40] Why Not Replace? Sustaining Long-Term Visual Localization via Handcrafted-Learned Feature Collaboration on CPU

[41] So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection

[42] DVD-Quant: Data-free Video Diffusion Transformers Quantization

[43] ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation

[44] Restoring Real-World Images with an Internal Detail Enhancement Diffusion Model

[45] Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps

[46] Manifold-aware Representation Learning for Degradation-agnostic Image Restoration

[47] WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation

[48] Affective Image Editing: Shaping Emotional Factors via Text Descriptions

[49] GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains

[50] Deep Learning for Breast Cancer Detection: Comparative Analysis of ConvNeXT and EfficientNet

[51] FusionTrack: End-to-End Multi-Object Tracking in Arbitrary Multi-View Environment

[52] Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation

[53] Rethinking Direct Preference Optimization in Diffusion Models

[54] MoMBS: Mixed-order minibatch sampling enhances model training from diverse-quality images

[55] C3R: Channel Conditioned Cell Representations for unified evaluation in microscopy imaging

[56] ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models

[57] StyleGuard: Preventing Text-to-Image-Model-based Style Mimicry Attacks by Style Perturbations

[58] Dual-Path Stable Soft Prompt Generation for Domain Generalization

[59] OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks

[60] Think Twice before Adaptation: Improving Adaptability of DeepFake Detection via Online Test-Time Adaptation

[61] VORTA: Efficient Video Diffusion via Routing Sparse Attention

[62] SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models

[63] Reasoning Segmentation for Images and Videos: A Survey

[64] Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding

[65] MSLAU-Net: A Hybird CNN-Transformer Network for Medical Image Segmentation

[66] Localizing Knowledge in Diffusion Transformers

[67] Inference Compute-Optimal Video Vision Language Models

[68] Eye-See-You: Reverse Pass-Through VR and Head Avatars

[69] Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

[70] REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing

[71] SD-OVON: A Semantics-aware Dataset and Benchmark Generation Pipeline for Open-Vocabulary Object Navigation in Dynamic Scenes

[72] Beyond Domain Randomization: Event-Inspired Perception for Visually Robust Adversarial Imitation from Videos

[73] Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering

[74] LLM-Guided Taxonomy and Hierarchical Uncertainty for 3D Point CLoud Active Learning

[75] Words as Geometric Features: Estimating Homography using Optical Character Recognition as Compressed Image Representation

[76] WeedNet: A Foundation Model-Based Global-to-Local AI Approach for Real-Time Weed Species Identification and Classification

[77] Geometry-guided Online 3D Video Synthesis with Multi-View Temporal Consistency

[78] Echo Planning for Autonomous Driving: From Current Observations to Future Trajectories and Back