cs.CV [Back]

[1] CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge

Zehua Liu,Xiaolou Li,Chen Chen,Lantian Li,Dong Wang

Main category: cs.CV

TL;DR: CNVSRC 2024挑战赛在CNVSRC 2023基础上推进中文大词汇量连续视觉语音识别研究，引入更强基线系统和新增数据集，提升了数据量和多样性。

Details

Motivation: 推动中文大词汇量连续视觉语音识别（LVC-VSR）的研究进展。 Method: 使用相同数据集（CN-CVS训练，CNVSRC-Single/Multi评估），新增数据集CN-CVS2-P1，改进数据预处理、特征提取、模型设计和训练策略。 Result: 挑战赛展示了在数据预处理、特征提取、模型设计和训练策略上的创新，进一步提升了LVC-VSR的技术水平。 Conclusion: CNVSRC 2024通过改进和新增数据集，推动了中文LVC-VSR领域的发展。 Abstract: This paper presents the second Chinese Continuous Visual Speech Recognition Challenge (CNVSRC 2024), which builds on CNVSRC 2023 to advance research in Chinese Large Vocabulary Continuous Visual Speech Recognition (LVC-VSR). The challenge evaluates two test scenarios: reading in recording studios and Internet speech. CNVSRC 2024 uses the same datasets as its predecessor CNVSRC 2023, which involves CN-CVS for training and CNVSRC-Single/Multi for development and evaluation. However, CNVSRC 2024 introduced two key improvements: (1) a stronger baseline system, and (2) an additional dataset, CN-CVS2-P1, for open tracks to improve data volume and diversity. The new challenge has demonstrated several important innovations in data preprocessing, feature extraction, model design, and training strategies, further pushing the state-of-the-art in Chinese LVC-VSR. More details and resources are available at the official website.

[2] OASIS: Online Sample Selection for Continual Visual Instruction Tuning

Minjae Lee,Minhyuk Seo,Tingyu Qu,Tinne Tuytelaars,Jonghyun Choi

Main category: cs.CV

TL;DR: OASIS是一种自适应在线样本选择方法，用于持续视觉指令调整（CVIT），动态调整每批样本数量以减少冗余并适应分布变化，仅需25%数据即可达到全数据训练性能。

Details

Motivation: 在CVIT场景中，多模态数据持续在线到达，现有数据选择方法依赖预训练参考模型或固定样本数量，无法适应分布变化。 Method: 提出OASIS方法，动态调整每批样本数量，并通过迭代更新选择分数最小化冗余。 Result: 实验表明，OASIS仅用25%数据即可达到全数据训练性能，并优于现有方法。 Conclusion: OASIS为CVIT提供了一种高效的自适应样本选择方案，显著减少训练开销。 Abstract: In continual visual instruction tuning (CVIT) scenarios, where multi-modal data continuously arrive in an online streaming manner, training delays from large-scale data significantly hinder real-time adaptation. While existing data selection strategies reduce training overheads, they rely on pre-trained reference models, which are impractical in CVIT setups due to unknown future data. Recent reference model-free online sample selection methods address this issue but typically select a fixed number of samples per batch (e.g., top-k), causing them to suffer from distribution shifts where informativeness varies across batches. To address these limitations, we propose OASIS, an adaptive online sample selection approach for CVIT that: (1) dynamically adjusts selected samples per batch based on relative inter-batch informativeness, and (2) minimizes redundancy of selected samples through iterative selection score updates. Empirical results across various MLLMs, such as LLaVA-1.5 and Qwen-VL-2.5, show that OASIS achieves comparable performance to full-data training using only 25% of the data and outperforms the state-of-the-art.

[3] Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing

Zehua Liu,Xiaolou Li,Li Guo,Lantian Li,Dong Wang

Main category: cs.CV

TL;DR: 本文探讨了如何更好地利用大型语言模型（LLMs）提升视觉语音识别（VSR）性能，提出了三项关键贡献：规模测试、上下文感知解码和迭代优化。

Details

Motivation: 尽管LLMs已被整合到VSR系统中并带来性能提升，但其潜力尚未充分研究，如何有效利用LLMs仍待探索。 Method: 通过规模测试研究LLM大小对VSR性能的影响，引入上下文感知解码以提升准确性，并提出迭代优化方法逐步减少识别错误。 Result: 实验表明，这些方法能显著释放LLMs的潜力，大幅提升VSR性能。 Conclusion: 本文系统性地探索了LLMs在VSR任务中的应用，为未来研究提供了重要参考。 Abstract: Visual Speech Recognition (VSR) transcribes speech by analyzing lip movements. Recently, Large Language Models (LLMs) have been integrated into VSR systems, leading to notable performance improvements. However, the potential of LLMs has not been extensively studied, and how to effectively utilize LLMs in VSR tasks remains unexplored. This paper systematically explores how to better leverage LLMs for VSR tasks and provides three key contributions: (1) Scaling Test: We study how the LLM size affects VSR performance, confirming a scaling law in the VSR task. (2) Context-Aware Decoding: We add contextual text to guide the LLM decoding, improving recognition accuracy. (3) Iterative Polishing: We propose iteratively refining LLM outputs, progressively reducing recognition errors. Extensive experiments demonstrate that by these designs, the great potential of LLMs can be largely harnessed, leading to significant VSR performance improvement.

[4] Research on Driving Scenario Technology Based on Multimodal Large Lauguage Model Optimization

Wang Mengjie,Zhu Huiping,Li Jian,Shi Wenxiu,Zhang Song

Main category: cs.CV

TL;DR: 本文提出了一种优化多模态模型在驾驶场景中应用的综合方法，包括动态提示优化、数据集构建、模型训练和部署优化，显著提升了模型性能和资源利用率。

Details

Motivation: 随着自动驾驶技术的发展，理解复杂驾驶场景的需求增加，但多模态大模型在垂直领域的应用面临数据收集、训练和部署优化的挑战。 Method: 方法包括动态提示优化（根据输入图像调整提示）、数据集构建（结合真实与合成数据）、模型训练（知识蒸馏、动态微调、量化）和部署优化。 Result: 实验结果表明，该方法显著提高了模型在关键任务中的准确性，并实现了高效的资源利用。 Conclusion: 该方法为驾驶场景感知技术的实际应用提供了有力支持。 Abstract: With the advancement of autonomous and assisted driving technologies, higher demands are placed on the ability to understand complex driving scenarios. Multimodal general large models have emerged as a solution for this challenge. However, applying these models in vertical domains involves difficulties such as data collection, model training, and deployment optimization. This paper proposes a comprehensive method for optimizing multimodal models in driving scenarios, including cone detection, traffic light recognition, speed limit recommendation, and intersection alerts. The method covers key aspects such as dynamic prompt optimization, dataset construction, model training, and deployment. Specifically, the dynamic prompt optimization adjusts the prompts based on the input image content to focus on objects affecting the ego vehicle, enhancing the model's task-specific focus and judgment capabilities. The dataset is constructed by combining real and synthetic data to create a high-quality and diverse multimodal training dataset, improving the model's generalization in complex driving environments. In model training, advanced techniques like knowledge distillation, dynamic fine-tuning, and quantization are integrated to reduce storage and computational costs while boosting performance. Experimental results show that this systematic optimization method not only significantly improves the model's accuracy in key tasks but also achieves efficient resource utilization, providing strong support for the practical application of driving scenario perception technologies.

[5] Object-centric Self-improving Preference Optimization for Text-to-Image Generation

Yoonjin Oh,Yongjin Kim,Hyomin Kim,Donghwan Chi,Sungwoong Kim

Main category: cs.CV

TL;DR: 本文提出了一种名为OSPO的框架，用于改进多模态大语言模型（MLLMs）在文本到图像生成任务中的细粒度视觉理解能力，通过自主构建高质量偏好对数据实现优化。

Details

Motivation: 尽管MLLMs在图像理解和生成方面取得了进展，但在细粒度视觉理解（尤其是文本到图像生成任务）中仍存在不足。偏好优化方法在图像理解任务中已有探索，但在图像生成中的应用尚未充分研究。 Method: 提出OSPO框架，利用MLLMs的内在推理能力，无需外部数据或模型。通过对象中心提示扰动、密集化和VQA评分，自主构建对象级对比偏好对，消除模糊或不均衡的变体。 Result: 在三个代表性文本到图像生成基准测试中验证了OSPO，结果显示其性能显著优于基线模型。 Conclusion: OSPO通过高质量偏好对数据的自主构建，有效提升了MLLMs在文本到图像生成任务中的性能，填补了偏好优化在图像生成领域的空白。 Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have significantly improved both image understanding and generation capabilities. Despite these improvements, MLLMs still struggle with fine-grained visual comprehension, particularly in text-to-image generation tasks. While preference optimization methods have been explored to address these limitations in image understanding tasks, their application to image generation remains largely underexplored. To address this gap, we propose an Object-centric Self-improving Preference Optimization (OSPO) framework designed for text-to-image generation by MLLMs. OSPO leverages the intrinsic reasoning abilities of MLLMs without requiring any external datasets or models. OSPO emphasizes the importance of high-quality preference pair data, which is critical for effective preference optimization. To achieve this, it introduces a self-improving mechanism that autonomously constructs object-level contrastive preference pairs through object-centric prompt perturbation, densification and VQA scoring. This process eliminates ambiguous or disproportionate variations commonly found in naively generated preference pairs, thereby enhancing the effectiveness of preference optimization. We validate OSPO on three representative compositional text-to-image benchmarks, demonstrating substantial performance gains over baseline models.

[6] Are classical deep neural networks weakly adversarially robust?

Nuolin Sun,Linyuan Wang,Dongyang Li,Bin Yan,Lei Li

Main category: cs.CV

TL;DR: 提出一种基于层特征路径的对抗样本检测与图像识别方法，避免高计算成本的对抗训练。

Details

Motivation: 传统DNN对抗鲁棒性弱，对抗训练计算开销大，需探索更高效方法。 Method: 利用层特征构建特征路径，计算其与类中心特征路径的相关性。 Result: 在ResNet-20上，干净准确率82.77%，对抗准确率44.17%；ResNet-18上分别为80.01%和46.1%。 Conclusion: 方法揭示DNN固有对抗鲁棒性，挑战传统认知，提供高效防御策略。 Abstract: Adversarial attacks have received increasing attention and it has been widely recognized that classical DNNs have weak adversarial robustness. The most commonly used adversarial defense method, adversarial training, improves the adversarial accuracy of DNNs by generating adversarial examples and retraining the model. However, adversarial training requires a significant computational overhead. In this paper, inspired by existing studies focusing on the clustering properties of DNN output features at each layer and the Progressive Feedforward Collapse phenomenon, we propose a method for adversarial example detection and image recognition that uses layer-wise features to construct feature paths and computes the correlation between the examples feature paths and the class-centered feature paths. Experimental results show that the recognition method achieves 82.77% clean accuracy and 44.17% adversarial accuracy on the ResNet-20 with PFC. Compared to the adversarial training method with 77.64% clean accuracy and 52.94% adversarial accuracy, our method exhibits a trade-off without relying on computationally expensive defense strategies. Furthermore, on the standard ResNet-18, our method maintains this advantage with respective metrics of 80.01% and 46.1%. This result reveals inherent adversarial robustness in DNNs, challenging the conventional understanding of the weak adversarial robustness in DNNs.

[7] Fairness through Feedback: Addressing Algorithmic Misgendering in Automatic Gender Recognition

Camilla Quaresmini,Giacomo Zanotti

Main category: cs.CV

TL;DR: 论文探讨了自动性别识别（AGR）系统的问题，提出重新思考其理论与实践的框架，建议通过用户反馈机制提高公平性。

Details

Motivation: AGR系统通常基于性别二元假设，且分类结果与性别表达存在差距，对非二元性别者不友好，需要改进。 Method: 区分性别、性别表达与生理性别，提出通过用户反馈机制修正系统输出。 Result: 反馈机制虽降低系统自主性，但能显著提升AGR的公平性。 Conclusion: AGR系统应尊重个体权利与自我表达，建议将其视为支持工具而非绝对分类器。 Abstract: Automatic Gender Recognition (AGR) systems are an increasingly widespread application in the Machine Learning (ML) landscape. While these systems are typically understood as detecting gender, they often classify datapoints based on observable features correlated at best with either male or female sex. In addition to questionable binary assumptions, from an epistemological point of view, this is problematic for two reasons. First, there exists a gap between the categories the system is meant to predict (woman versus man) and those onto which their output reasonably maps (female versus male). What is more, gender cannot be inferred on the basis of such observable features. This makes AGR tools often unreliable, especially in the case of non-binary and gender non-conforming people. We suggest a theoretical and practical rethinking of AGR systems. To begin, distinctions are made between sex, gender, and gender expression. Then, we build upon the observation that, unlike algorithmic misgendering, human-human misgendering is open to the possibility of re-evaluation and correction. We suggest that analogous dynamics should be recreated in AGR, giving users the possibility to correct the system's output. While implementing such a feedback mechanism could be regarded as diminishing the system's autonomy, it represents a way to significantly increase fairness levels in AGR. This is consistent with the conceptual change of paradigm that we advocate for AGR systems, which should be understood as tools respecting individuals' rights and capabilities of self-expression and determination.

Youze Xue,Dian Li,Gang Liu

Main category: cs.CV

TL;DR: 论文分析了多模态大语言模型（MLLMs）中硬负样本对对比学习的贡献，提出显式梯度放大器以增强硬负样本的梯度，从而提升嵌入性能。

Details

Motivation: 尽管CLIP框架已成功扩展到MLLMs，但硬负样本的具体贡献未被深入研究，作者希望通过梯度分析优化对比学习。 Method: 通过分析info-NCE损失的梯度，提出显式梯度放大器，增强硬负样本的梯度，基于LLaVA-OneVision-7B架构训练多模态嵌入模型。 Result: 在MMEB基准测试中达到最优性能，结合自研MLLM（QQMM）后登上MMEB排行榜首位。 Conclusion: 显式梯度放大器能有效提升多模态嵌入模型的性能，为对比学习提供了新思路。 Abstract: With the rapid advancement of multi-modal large language models (MLLMs) in recent years, the foundational Contrastive Language-Image Pretraining (CLIP) framework has been successfully extended to MLLMs, enabling more powerful and universal multi-modal embeddings for a wide range of retrieval tasks. Despite these developments, the core contrastive learning paradigm remains largely unchanged from CLIP-style models to MLLMs. Within this framework, the effective mining of hard negative samples continues to be a critical factor for enhancing performance. Prior works have introduced both offline and online strategies for hard negative mining to improve the efficiency of contrastive learning. While these approaches have led to improved multi-modal embeddings, the specific contribution of each hard negative sample to the learning process has not been thoroughly investigated. In this work, we conduct a detailed analysis of the gradients of the info-NCE loss with respect to the query, positive, and negative samples, elucidating the role of hard negatives in updating model parameters. Building upon this analysis, we propose to explicitly amplify the gradients associated with hard negative samples, thereby encouraging the model to learn more discriminative embeddings. Our multi-modal embedding model, trained with the proposed Explicit Gradient Amplifier and based on the LLaVA-OneVision-7B architecture, achieves state-of-the-art performance on the MMEB benchmark compared to previous methods utilizing the same MLLM backbone. Furthermore, when integrated with our self-developed MLLM, QQMM, our approach attains the top rank on the MMEB leaderboard. Code and models are released on https://github.com/QQ-MM/QQMM-embed.

[9] Dynamic-Aware Video Distillation: Optimizing Temporal Resolution Based on Video Semantics

Yinjie Zhao,Heng Zhao,Bihan Wen,Yew-Soon Ong,Joey Tianyi Zhou

Main category: cs.CV

TL;DR: 论文提出了一种基于强化学习的动态感知视频蒸馏方法（DAViD），用于优化视频数据集中的时间分辨率，显著提升了性能。

Details

Motivation: 视频数据集中的时间信息和冗余问题未被充分研究，现有方法假设所有视频语义具有统一的时间冗余，限制了效果。 Method: 采用强化学习（RL）预测合成视频的最佳时间分辨率，并提出教师循环奖励函数更新RL代理策略。 Result: DAViD显著优于现有数据集蒸馏方法，性能大幅提升。 Conclusion: 该研究为未来更高效和语义自适应的视频数据集蒸馏研究奠定了基础。 Abstract: With the rapid development of vision tasks and the scaling on datasets and models, redundancy reduction in vision datasets has become a key area of research. To address this issue, dataset distillation (DD) has emerged as a promising approach to generating highly compact synthetic datasets with significantly less redundancy while preserving essential information. However, while DD has been extensively studied for image datasets, DD on video datasets remains underexplored. Video datasets present unique challenges due to the presence of temporal information and varying levels of redundancy across different classes. Existing DD approaches assume a uniform level of temporal redundancy across all different video semantics, which limits their effectiveness on video datasets. In this work, we propose Dynamic-Aware Video Distillation (DAViD), a Reinforcement Learning (RL) approach to predict the optimal Temporal Resolution of the synthetic videos. A teacher-in-the-loop reward function is proposed to update the RL agent policy. To the best of our knowledge, this is the first study to introduce adaptive temporal resolution based on video semantics in video dataset distillation. Our approach significantly outperforms existing DD methods, demonstrating substantial improvements in performance. This work paves the way for future research on more efficient and semantic-adaptive video dataset distillation research.

[10] Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs

Aditya Kanade,Tanuja Ganu

Main category: cs.CV

TL;DR: MLLMs存在视觉感知缺陷，即使答案正确也可能误解关键视觉元素。研究提出新基准测试，发现MLLMs在复杂任务中表现远低于人类。

Details

Motivation: 揭示MLLMs在视觉感知上的不足，并推动开发更鲁棒的模型。 Method: 创建包含1,758张图像和2,612个问题的基准测试，涵盖7个子任务，评估MLLMs的视觉能力。 Result: 人类准确率96.49%，而顶级MLLMs平均低于50%，任务复杂度增加时差距更大。 Conclusion: MLLMs需改进视觉感知能力，尤其是在细粒度细节处理上。 Abstract: Multimodal Large Language Models (MLLMs) show reasoning promise, yet their visual perception is a critical bottleneck. Strikingly, MLLMs can produce correct answers even while misinterpreting crucial visual elements, masking these underlying failures. Our preliminary study on a joint perception-reasoning dataset revealed that for one leading MLLM, 29% of its correct answers to reasoning questions still exhibited visual perception errors. To systematically address this, we introduce "Do You See Me", a scalable benchmark with 1,758 images and 2,612 questions. It spans seven human-psychology inspired subtasks in 2D and 3D, featuring controllable complexity to rigorously evaluate MLLM visual skills. Our findings on 3 leading closed-source and 5 major open-source models reveal a stark deficit: humans achieve 96.49% accuracy, while top MLLMs average below 50%. This performance gap widens rapidly with increased task complexity (e.g., from 12% to 45% in the visual form constancy subtask). Further analysis into the root causes suggests that failures stem from challenges like misallocated visual attention and the instability of internal representations for fine-grained details, especially at or below encoder patch resolution. This underscores an urgent need for MLLMs with truly robust visual perception. The benchmark dataset, source code and evaluation scripts are available at https://github.com/microsoft/Do-You-See-Me.

[11] Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences

Hyojin Bahng,Caroline Chan,Fredo Durand,Phillip Isola

Main category: cs.CV

TL;DR: 论文提出了一种利用循环一致性作为监督信号的方法，通过文本到图像和图像到文本的双向映射计算相似性，构建了一个包含866K比较对的数据集，并在多种任务中表现优于现有方法。

Details

Motivation: 现有方法依赖人工或AI偏好，成本高且耗时，因此需要一种更高效的方法来学习语言与视觉的对齐。 Method: 通过文本到图像和图像到文本的双向映射计算循环一致性得分，用于排序候选并构建偏好数据集。 Result: 基于该数据集训练的奖励模型在详细描述任务中优于现有对齐指标，并在多种视觉语言任务和文本到图像生成中提升性能。 Conclusion: 循环一致性是一种有效的监督信号，能够高效构建高质量偏好数据集，显著提升多模态任务性能。 Abstract: Learning alignment between language and vision is a fundamental challenge, especially as multimodal data becomes increasingly detailed and complex. Existing methods often rely on collecting human or AI preferences, which can be costly and time-intensive. We propose an alternative approach that leverages cycle consistency as a supervisory signal. Given an image and generated text, we map the text back to image space using a text-to-image model and compute the similarity between the original image and its reconstruction. Analogously, for text-to-image generation, we measure the textual similarity between an input caption and its reconstruction through the cycle. We use the cycle consistency score to rank candidates and construct a preference dataset of 866K comparison pairs. The reward model trained on our dataset outperforms state-of-the-art alignment metrics on detailed captioning, with superior inference-time scalability when used as a verifier for Best-of-N sampling. Furthermore, performing DPO and Diffusion DPO using our dataset enhances performance across a wide range of vision-language tasks and text-to-image generation. Our dataset, model, and code are at https://cyclereward.github.io

[12] SAB3R: Semantic-Augmented Backbone in 3D Reconstruction

Xuweiyi Chen,Tian Xia,Sihan Xu,Jianing Yang,Joyce Chai,Zezhou Cheng

Main category: cs.CV

TL;DR: 论文提出新任务Map and Locate，结合开放词汇分割与3D重建，并介绍基线方法SAB3R，性能优于现有技术。

Details

Motivation: 统一开放词汇分割和3D重建任务，推动具身AI应用发展。 Method: 基于MASt3R，引入轻量级蒸馏策略，将2D视觉主干特征迁移至3D模型，实现单次前向生成语义特征和点云。 Result: SAB3R在Map and Locate任务中表现优于MASt3R和CLIP组合，并在2D和3D任务中验证有效性。 Conclusion: SAB3R为开放词汇分割与3D重建的统一任务提供了高效解决方案，推动实际应用。 Abstract: We introduce a new task, Map and Locate, which unifies the traditionally distinct objectives of open-vocabulary segmentation - detecting and segmenting object instances based on natural language queries - and 3D reconstruction, the process of estimating a scene's 3D structure from visual inputs. Specifically, Map and Locate involves generating a point cloud from an unposed video and segmenting object instances based on open-vocabulary queries. This task serves as a critical step toward real-world embodied AI applications and introduces a practical task that bridges reconstruction, recognition and reorganization. To tackle this task, we introduce a simple yet effective baseline, which we denote as SAB3R. Our approach builds upon MASt3R, a recent breakthrough in 3D computer vision, and incorporates a lightweight distillation strategy. This method transfers dense, per-pixel semantic features from 2D vision backbones (eg, CLIP and DINOv2) to enhance MASt3R's capabilities. Without introducing any auxiliary frozen networks, our model generates per-pixel semantic features and constructs cohesive point maps in a single forward pass. Compared to separately deploying MASt3R and CLIP, our unified model, SAB3R, achieves superior performance on the Map and Locate benchmark. Furthermore, we evaluate SAB3R on both 2D semantic segmentation and 3D tasks to comprehensively validate its effectiveness.

[13] Implicit Deformable Medical Image Registration with Learnable Kernels

Stefano Fogarollo,Gregor Laimer,Reto Bale,Matthias Harders

Main category: cs.CV

TL;DR: 提出了一种新型隐式医学图像配准框架，通过稀疏关键点对应重建密集位移场，提高了配准的准确性和可靠性。

Details

Motivation: 医学图像配准在肿瘤治疗等临床应用中至关重要，但现有AI方法常产生不可靠的变形，限制了其临床应用。 Method: 将图像配准重新定义为信号重建问题，学习核函数从稀疏关键点恢复密集位移场，并采用分层架构进行粗到细的估计。 Result: 在胸部和腹部零样本配准任务中表现优异，生成更符合解剖关系的变形，性能接近专业商业系统。 Conclusion: 该方法不仅提升了配准的准确性和可靠性，还缩小了隐式与显式配准技术的泛化差距，具有临床推广潜力。 Abstract: Deformable medical image registration is an essential task in computer-assisted interventions. This problem is particularly relevant to oncological treatments, where precise image alignment is necessary for tracking tumor growth, assessing treatment response, and ensuring accurate delivery of therapies. Recent AI methods can outperform traditional techniques in accuracy and speed, yet they often produce unreliable deformations that limit their clinical adoption. In this work, we address this challenge and introduce a novel implicit registration framework that can predict accurate and reliable deformations. Our insight is to reformulate image registration as a signal reconstruction problem: we learn a kernel function that can recover the dense displacement field from sparse keypoint correspondences. We integrate our method in a novel hierarchical architecture, and estimate the displacement field in a coarse-to-fine manner. Our formulation also allows for efficient refinement at test time, permitting clinicians to easily adjust registrations when needed. We validate our method on challenging intra-patient thoracic and abdominal zero-shot registration tasks, using public and internal datasets from the local University Hospital. Our method not only shows competitive accuracy to state-of-the-art approaches, but also bridges the generalization gap between implicit and explicit registration techniques. In particular, our method generates deformations that better preserve anatomical relationships and matches the performance of specialized commercial systems, underscoring its potential for clinical adoption.

[14] TIIF-Bench: How Does Your T2I Model Follow Your Instructions?

Xinyu Wei,Jinrui Zhang,Zeqing Wang,Hongyang Wei,Zhen Guo,Lei Zhang

Main category: cs.CV

TL;DR: TIIF-Bench是一个新的文本到图像（T2I）模型评估基准，旨在通过多样化和复杂的提示系统评估模型对文本指令的细粒度对齐能力。

Details

Motivation: 现有T2I模型评估基准在提示多样性和复杂性上不足，且评价指标粗糙，无法准确评估文本指令与生成图像的细粒度对齐性能。 Method: TIIF-Bench包含5000个按多维度组织的提示，分为三个难度级别，并提供长短版本以评估模型对提示长度的鲁棒性。引入文本渲染和风格控制两个关键属性，并利用大型视觉语言模型提出可计算框架。 Result: 通过主流T2I模型的细致评估，分析了当前模型的优缺点，并揭示了现有基准的局限性。 Conclusion: TIIF-Bench为T2I模型提供了一个系统且全面的评估工具，有助于推动模型性能的进一步提升。 Abstract: The rapid advancements of Text-to-Image (T2I) models have ushered in a new phase of AI-generated content, marked by their growing ability to interpret and follow user instructions. However, existing T2I model evaluation benchmarks fall short in limited prompt diversity and complexity, as well as coarse evaluation metrics, making it difficult to evaluate the fine-grained alignment performance between textual instructions and generated images. In this paper, we present TIIF-Bench (Text-to-Image Instruction Following Benchmark), aiming to systematically assess T2I models' ability in interpreting and following intricate textual instructions. TIIF-Bench comprises a set of 5000 prompts organized along multiple dimensions, which are categorized into three levels of difficulties and complexities. To rigorously evaluate model robustness to varying prompt lengths, we provide a short and a long version for each prompt with identical core semantics. Two critical attributes, i.e., text rendering and style control, are introduced to evaluate the precision of text synthesis and the aesthetic coherence of T2I models. In addition, we collect 100 high-quality designer level prompts that encompass various scenarios to comprehensively assess model performance. Leveraging the world knowledge encoded in large vision language models, we propose a novel computable framework to discern subtle variations in T2I model outputs. Through meticulous benchmarking of mainstream T2I models on TIIF-Bench, we analyze the pros and cons of current T2I models and reveal the limitations of current T2I benchmarks. Project Page: https://a113n-w3i.github.io/TIIF_Bench/.

[15] Quantifying task-relevant representational similarity using decision variable correlation

Yu,Qian,Wilson S. Geisler,Xue-Xin Wei

Main category: cs.CV

TL;DR: 该论文提出了一种新方法（DVC）来比较模型与猴子大脑在图像分类任务中的决策策略相似性，发现模型与猴子之间的相似性低于模型之间或猴子之间的相似性，且随着模型性能提升反而降低。

Details

Motivation: 研究动机在于解决先前关于大脑与深度神经网络在图像分类任务中表征相似性的争议，提出一种更专注于任务相关信息的比较方法。 Method: 采用决策变量相关（DVC）方法，量化分类任务中解码决策的相关性，评估模型与猴子V4/IT记录的相似性。 Result: 结果显示模型之间与猴子之间的相似性相当，但模型与猴子的相似性较低且随模型性能提升而降低；对抗训练和更大数据集预训练未能提升模型与猴子的相似性。 Conclusion: 结论表明猴子V4/IT与图像分类模型在任务相关表征上存在根本性差异。 Abstract: Previous studies have compared the brain and deep neural networks trained on image classification. Intriguingly, while some suggest that their representations are highly similar, others argued the opposite. Here, we propose a new approach to characterize the similarity of the decision strategies of two observers (models or brains) using decision variable correlation (DVC). DVC quantifies the correlation between decoded decisions on individual samples in a classification task and thus can capture task-relevant information rather than general representational alignment. We evaluate this method using monkey V4/IT recordings and models trained on image classification tasks. We find that model--model similarity is comparable to monkey--monkey similarity, whereas model--monkey similarity is consistently lower and, surprisingly, decreases with increasing ImageNet-1k performance. While adversarial training enhances robustness, it does not improve model--monkey similarity in task-relevant dimensions; however, it markedly increases model--model similarity. Similarly, pre-training on larger datasets does not improve model--monkey similarity. These results suggest a fundamental divergence between the task-relevant representations in monkey V4/IT and those learned by models trained on image classification tasks.

[16] Fire360: A Benchmark for Robust Perception and Episodic Memory in Degraded 360-Degree Firefighting Videos

Aditi Tiwari,Farzaneh Masoud,Dac Trong Nguyen,Jill Kraft,Heng Ji,Klara Nahrstedt

Main category: cs.CV

TL;DR: Fire360是一个用于评估消防场景中感知与推理能力的基准数据集，包含228个360度视频，支持五项任务，旨在提升AI在恶劣环境下的表现。

Details

Motivation: 现代AI系统在可靠性要求高的环境中表现不佳，消防员因情境感知问题受伤的情况频发，需要改进AI在恶劣条件下的能力。 Method: 数据集包含多样条件下的消防训练视频，标注了动作、物体位置和退化元数据，支持五项任务测试。 Result: 人类专家在TOR任务中表现优异（83.5%），而GPT-4o等模型表现较差，暴露了在退化条件下的推理缺陷。 Conclusion: 通过发布Fire360及其评估套件，旨在推动AI在不确定性环境中的感知、记忆、推理和行动能力。 Abstract: Modern AI systems struggle most in environments where reliability is critical - scenes with smoke, poor visibility, and structural deformation. Each year, tens of thousands of firefighters are injured on duty, often due to breakdowns in situational perception. We introduce Fire360, a benchmark for evaluating perception and reasoning in safety-critical firefighting scenarios. The dataset includes 228 360-degree videos from professional training sessions under diverse conditions (e.g., low light, thermal distortion), annotated with action segments, object locations, and degradation metadata. Fire360 supports five tasks: Visual Question Answering, Temporal Action Captioning, Object Localization, Safety-Critical Reasoning, and Transformed Object Retrieval (TOR). TOR tests whether models can match pristine exemplars to fire-damaged counterparts in unpaired scenes, evaluating transformation-invariant recognition. While human experts achieve 83.5% on TOR, models like GPT-4o lag significantly, exposing failures in reasoning under degradation. By releasing Fire360 and its evaluation suite, we aim to advance models that not only see, but also remember, reason, and act under uncertainty. The dataset is available at: https://uofi.box.com/v/fire360dataset.

[17] Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment

Johannes Schusterbauer,Ming Gui,Frank Fundel,Björn Ommer

Main category: cs.CV

TL;DR: Diff2Flow框架通过时间步重缩放、对齐插值和从扩散预测中导出FM兼容速度场，将预训练扩散模型知识高效迁移到流匹配（FM）中，实现直接且高效的FM微调。

Details

Motivation: 当前基础FM模型在微调时计算成本高，而扩散模型（如Stable Diffusion）具有高效架构和生态系统支持，因此需要一种方法将扩散模型的知识高效迁移到FM中。 Method: 提出Diff2Flow框架，通过时间步重缩放、对齐插值和从扩散预测中导出FM兼容速度场，实现扩散模型到FM的知识迁移。 Result: Diff2Flow在参数高效约束下优于朴素FM和扩散微调，并在多样下游任务中达到或超越最先进方法的性能。 Conclusion: Diff2Flow成功解决了扩散模型到FM的知识迁移问题，为高效FM微调提供了新思路。 Abstract: Diffusion models have revolutionized generative tasks through high-fidelity outputs, yet flow matching (FM) offers faster inference and empirical performance gains. However, current foundation FM models are computationally prohibitive for finetuning, while diffusion models like Stable Diffusion benefit from efficient architectures and ecosystem support. This work addresses the critical challenge of efficiently transferring knowledge from pre-trained diffusion models to flow matching. We propose Diff2Flow, a novel framework that systematically bridges diffusion and FM paradigms by rescaling timesteps, aligning interpolants, and deriving FM-compatible velocity fields from diffusion predictions. This alignment enables direct and efficient FM finetuning of diffusion priors with no extra computation overhead. Our experiments demonstrate that Diff2Flow outperforms na\"ive FM and diffusion finetuning particularly under parameter-efficient constraints, while achieving superior or competitive performance across diverse downstream tasks compared to state-of-the-art methods. We will release our code at https://github.com/CompVis/diff2flow.

[18] VLCD: Vision-Language Contrastive Distillation for Accurate and Efficient Automatic Placenta Analysis

Manas Mehta,Yimu Pan,Kelly Gallagher,Alison D. Gernand,Jeffery A. Goldstein,Delia Mwinyelle,Leena Mithal,James Z. Wang

Main category: cs.CV

TL;DR: 论文提出两种改进视觉-语言对比学习框架的方法，以提高胎盘病理检测的准确性和效率，并通过模型压缩和加速提升部署能力。

Details

Motivation: 胎盘病理检查对检测和减轻分娩相关健康风险有效，但现有自动化方法计算量大，限制了其实际应用。 Method: 提出两种改进：1）文本锚定的视觉-语言对比知识蒸馏（VLCD）；2）使用自然图像数据集进行无监督预蒸馏以优化初始化。 Result: 方法在性能和鲁棒性上优于或匹配教师模型，尤其适用于低质量图像，提升了效率和部署能力。 Conclusion: VLCD提高了医疗视觉-语言对比学习方法的效率和可部署性，使AI医疗解决方案在资源有限环境中更易普及。 Abstract: Pathological examination of the placenta is an effective method for detecting and mitigating health risks associated with childbirth. Recent advancements in AI have enabled the use of photographs of the placenta and pathology reports for detecting and classifying signs of childbirth-related pathologies. However, existing automated methods are computationally extensive, which limits their deployability. We propose two modifications to vision-language contrastive learning (VLC) frameworks to enhance their accuracy and efficiency: (1) text-anchored vision-language contrastive knowledge distillation (VLCD)-a new knowledge distillation strategy for medical VLC pretraining, and (2) unsupervised predistillation using a large natural images dataset for improved initialization. Our approach distills efficient neural networks that match or surpass the teacher model in performance while achieving model compression and acceleration. Our results showcase the value of unsupervised predistillation in improving the performance and robustness of our approach, specifically for lower-quality images. VLCD serves as an effective way to improve the efficiency and deployability of medical VLC approaches, making AI-based healthcare solutions more accessible, especially in resource-constrained environments.

[19] Motion aware video generative model

Bowen Xue,Giuseppe Claudio Guarnera,Shuang Zhao,Zahra Montazeri

Main category: cs.CV

TL;DR: 本文提出了一种基于物理约束的频率域方法，用于提升扩散视频生成的物理合理性，通过分析运动类型的频谱特征并设计损失函数和增强模块，显著提高了生成视频的运动质量。

Details

Motivation: 当前扩散视频生成方法主要依赖统计学习，未显式建模运动物理特性，导致生成视频存在非物理伪影，影响真实性。本文旨在通过频率域物理建模提升生成视频的物理合理性。 Method: 1. 分析不同物理运动（平移、旋转、缩放）的频率域特征；2. 提出物理运动损失函数优化视频频谱；3. 设计频率域增强模块，通过零初始化策略调整视频特征。 Result: 实验表明，该方法显著提升了生成视频的运动质量和物理合理性，且不影响视觉质量或语义一致性。 Conclusion: 本文提出的频率域物理运动框架为深度学习视频生成提供了物理约束的通用方法，连接了数据驱动模型与物理运动模型。 Abstract: Recent advances in diffusion-based video generation have yielded unprecedented quality in visual content and semantic coherence. However, current approaches predominantly rely on statistical learning from vast datasets without explicitly modeling the underlying physics of motion, resulting in subtle yet perceptible non-physical artifacts that diminish the realism of generated videos. This paper introduces a physics-informed frequency domain approach to enhance the physical plausibility of generated videos. We first conduct a systematic analysis of the frequency-domain characteristics of diverse physical motions (translation, rotation, scaling), revealing that each motion type exhibits distinctive and identifiable spectral signatures. Building on this theoretical foundation, we propose two complementary components: (1) a physical motion loss function that quantifies and optimizes the conformity of generated videos to ideal frequency-domain motion patterns, and (2) a frequency domain enhancement module that progressively learns to adjust video features to conform to physical motion constraints while preserving original network functionality through a zero-initialization strategy. Experiments across multiple video diffusion architectures demonstrate that our approach significantly enhances motion quality and physical plausibility without compromising visual quality or semantic alignment. Our frequency-domain physical motion framework generalizes effectively across different video generation architectures, offering a principled approach to incorporating physical constraints into deep learning-based video synthesis pipelines. This work seeks to establish connections between data-driven models and physics-based motion models.

[20] PAIR-Net: Enhancing Egocentric Speaker Detection via Pretrained Audio-Visual Fusion and Alignment Loss

Yu Wang,Juhyung Ha,David J. Crandall

Main category: cs.CV

TL;DR: PAIR-Net结合预训练的Whisper音频编码器和微调的AV-HuBERT视觉骨干，通过跨模态对齐损失提升主动说话者检测性能，在Ego4D基准上达到76.6% mAP。

Details

Motivation: 解决传统视觉方法在视角不稳定、运动模糊和屏幕外语音等现实场景下性能下降的问题。 Method: 整合部分冻结的Whisper音频编码器和微调的AV-HuBERT视觉骨干，引入跨模态对齐损失以同步音频和视觉表征。 Result: 在Ego4D ASD基准上达到76.6% mAP，优于LoCoNet和STHG。 Conclusion: 预训练音频先验和对齐融合方法在现实场景下对ASD具有显著价值。 Abstract: Active speaker detection (ASD) in egocentric videos presents unique challenges due to unstable viewpoints, motion blur, and off-screen speech sources - conditions under which traditional visual-centric methods degrade significantly. We introduce PAIR-Net (Pretrained Audio-Visual Integration with Regularization Network), an effective model that integrates a partially frozen Whisper audio encoder with a fine-tuned AV-HuBERT visual backbone to robustly fuse cross-modal cues. To counteract modality imbalance, we introduce an inter-modal alignment loss that synchronizes audio and visual representations, enabling more consistent convergence across modalities. Without relying on multi-speaker context or ideal frontal views, PAIR-Net achieves state-of-the-art performance on the Ego4D ASD benchmark with 76.6% mAP, surpassing LoCoNet and STHG by 8.2% and 12.9% mAP, respectively. Our results highlight the value of pretrained audio priors and alignment-based fusion for robust ASD under real-world egocentric conditions.

[21] Rig3R: Rig-Aware Conditioning for Learned 3D Reconstruction

Samuel Li,Pujith Kachana,Prajwal Chidananda,Saurabh Nair,Yasutaka Furukawa,Matthew Brown

Main category: cs.CV

TL;DR: Rig3R是一种多视角重建模型，通过结合或推断相机设备的结构信息，显著提升了3D重建、相机姿态估计和设备结构发现的性能。

Details

Motivation: 现有方法（如DUSt3R）将图像视为无结构集合，限制了在已知或可推断设备结构场景中的效果。Rig3R旨在解决这一问题。 Method: Rig3R利用设备元数据（如相机ID、时间和姿态）构建设备感知的潜在空间，并联合预测点图和两种射线图（全局和设备中心）。 Result: Rig3R在3D重建、相机姿态估计和设备发现任务中表现最优，性能提升17-45% mAA，且无需后处理。 Conclusion: Rig3R通过设备感知设计，显著提升了多视角重建任务的性能，尤其在设备结构信息缺失时仍能稳健工作。 Abstract: Estimating agent pose and 3D scene structure from multi-camera rigs is a central task in embodied AI applications such as autonomous driving. Recent learned approaches such as DUSt3R have shown impressive performance in multiview settings. However, these models treat images as unstructured collections, limiting effectiveness in scenarios where frames are captured from synchronized rigs with known or inferable structure. To this end, we introduce Rig3R, a generalization of prior multiview reconstruction models that incorporates rig structure when available, and learns to infer it when not. Rig3R conditions on optional rig metadata including camera ID, time, and rig poses to develop a rig-aware latent space that remains robust to missing information. It jointly predicts pointmaps and two types of raymaps: a pose raymap relative to a global frame, and a rig raymap relative to a rig-centric frame consistent across time. Rig raymaps allow the model to infer rig structure directly from input images when metadata is missing. Rig3R achieves state-of-the-art performance in 3D reconstruction, camera pose estimation, and rig discovery, outperforming both traditional and learned methods by 17-45% mAA across diverse real-world rig datasets, all in a single forward pass without post-processing or iterative refinement.

Cristian-Ioan Blaga,Paul Suganthan,Sahil Dua,Krishna Srinivasan,Enrique Alfonseca,Peter Dornbach,Tom Duerig,Imed Zitouni,Zhe Dong

Main category: cs.CV

TL;DR: 论文提出了一种新的多模态图像检索基准，包含两个数据集（EI和MMIR），用于评估视觉与文本结合的深度跨模态理解能力。

Details

Motivation: 当前缺乏结合视觉和文本信息的混合模态图像检索的挑战性基准，因此需要一个新的评估工具。 Method: 引入两个新数据集：EI（提供Wikipedia实体的规范图像）和MMIR（源自WIT数据集），支持单实体图像查询和多实体图像查询。 Result: 通过实验验证了基准的实用性，并得到众包人工标注的数据集质量确认。 Conclusion: 提出的基准和数据集为混合模态检索提供了有效的训练和评估资源。 Abstract: Despite advances in multimodal learning, challenging benchmarks for mixed-modal image retrieval that combines visual and textual information are lacking. This paper introduces a novel benchmark to rigorously evaluate image retrieval that demands deep cross-modal contextual understanding. We present two new datasets: the Entity Image Dataset (EI), providing canonical images for Wikipedia entities, and the Mixed-Modal Image Retrieval Dataset (MMIR), derived from the WIT dataset. The MMIR benchmark features two challenging query types requiring models to ground textual descriptions in the context of provided visual entities: single entity-image queries (one entity image with descriptive text) and multi-entity-image queries (multiple entity images with relational text). We empirically validate the benchmark's utility as both a training corpus and an evaluation set for mixed-modal retrieval. The quality of both datasets is further affirmed through crowd-sourced human annotations. The datasets are accessible through the GitHub page: https://github.com/google-research-datasets/wit-retrieval.

[23] Improving Knowledge Distillation Under Unknown Covariate Shift Through Confidence-Guided Data Augmentation

Niclas Popp,Kevin Alexander Laube,Matthias Hein,Lukas Schott

Main category: cs.CV

TL;DR: 论文提出了一种基于扩散的数据增强策略，用于解决知识蒸馏中的协变量偏移问题，显著提升了模型的鲁棒性。

Details

Motivation: 在数据受限的情况下，知识蒸馏的效果受限于训练数据中的虚假特征。本文旨在利用鲁棒的教师模型，使学生模型也能对这些未知的虚假特征具有鲁棒性。 Method: 提出了一种新颖的扩散数据增强方法，通过最大化教师与学生模型的分歧生成具有挑战性的样本。 Result: 在CelebA、SpuCo Birds和虚假ImageNet数据集上，该方法显著提升了最差组和平均组准确率，以及虚假mAUC。 Conclusion: 该方法在协变量偏移下优于现有扩散数据增强基线，有效提升了学生模型的鲁棒性。 Abstract: Large foundation models trained on extensive datasets demonstrate strong zero-shot capabilities in various domains. To replicate their success when data and model size are constrained, knowledge distillation has become an established tool for transferring knowledge from foundation models to small student networks. However, the effectiveness of distillation is critically limited by the available training data. This work addresses the common practical issue of covariate shift in knowledge distillation, where spurious features appear during training but not at test time. We ask the question: when these spurious features are unknown, yet a robust teacher is available, is it possible for a student to also become robust to them? We address this problem by introducing a novel diffusion-based data augmentation strategy that generates images by maximizing the disagreement between the teacher and the student, effectively creating challenging samples that the student struggles with. Experiments demonstrate that our approach significantly improves worst group and mean group accuracy on CelebA and SpuCo Birds as well as the spurious mAUC on spurious ImageNet under covariate shift, outperforming state-of-the-art diffusion-based data augmentation baselines

[24] QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation

Ahmed Wasfy,Omer Nacar,Abdelakreem Elkhateb,Mahmoud Reda,Omar Elshehy,Adel Ammar,Wadii Boulila

Main category: cs.CV

TL;DR: Qari-OCR是一系列基于Qwen2-VL-2B-Instruct优化的视觉语言模型，专注于阿拉伯语OCR，显著提升了识别准确率。

Details

Motivation: 阿拉伯语脚本的复杂性（如连写、变音符号和多样字体）对OCR技术提出了持续挑战，需要更高效的解决方案。 Method: 通过迭代微调专业合成数据集，优化Qwen2-VL-2B-Instruct模型，开发了Qari-OCR系列模型。 Result: QARI v0.2在变音符号丰富的文本上实现了WER 0.160、CER 0.061和BLEU 0.737，表现出色。 Conclusion: Qari-OCR显著提升了阿拉伯语OCR的准确性和效率，并开源模型和数据集以促进研究。 Abstract: The inherent complexities of Arabic script; its cursive nature, diacritical marks (tashkeel), and varied typography, pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2, establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates superior handling of tashkeel, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.

[25] Medical World Model: Generative Simulation of Tumor Evolution for Treatment Planning

Yijun Yang,Zhao-Yang Wang,Qiuping Liu,Shuwen Sun,Kang Wang,Rama Chellappa,Zongwei Zhou,Alan Yuille,Lei Zhu,Yu-Dong Zhang,Jieneng Chen

Main category: cs.CV

TL;DR: MeWM是一种医学世界模型，通过视觉预测疾病未来状态，结合生成模型和生存分析优化临床决策。

Details

Motivation: 现代医学需要有效的治疗和临床决策，MeWM旨在通过生成模型模拟疾病动态，辅助医生选择最佳治疗方案。 Method: MeWM结合视觉语言模型（策略模型）和肿瘤生成模型（动态模型），通过逆动态模型评估治疗效果并优化治疗方案。 Result: MeWM在生成肿瘤图像的特异性上表现优异，并在优化个体化治疗方案上超越医学专用GPT，提升临床决策的F1分数13%。 Conclusion: MeWM为医学世界模型的未来应用奠定了基础，可作为第二读者辅助临床决策。 Abstract: Providing effective treatment and making informed clinical decisions are essential goals of modern medicine and clinical care. We are interested in simulating disease dynamics for clinical decision-making, leveraging recent advances in large generative models. To this end, we introduce the Medical World Model (MeWM), the first world model in medicine that visually predicts future disease states based on clinical decisions. MeWM comprises (i) vision-language models to serve as policy models, and (ii) tumor generative models as dynamics models. The policy model generates action plans, such as clinical treatments, while the dynamics model simulates tumor progression or regression under given treatment conditions. Building on this, we propose the inverse dynamics model that applies survival analysis to the simulated post-treatment tumor, enabling the evaluation of treatment efficacy and the selection of the optimal clinical action plan. As a result, the proposed MeWM simulates disease dynamics by synthesizing post-treatment tumors, with state-of-the-art specificity in Turing tests evaluated by radiologists. Simultaneously, its inverse dynamics model outperforms medical-specialized GPTs in optimizing individualized treatment protocols across all metrics. Notably, MeWM improves clinical decision-making for interventional physicians, boosting F1-score in selecting the optimal TACE protocol by 13%, paving the way for future integration of medical world models as the second readers.

[26] Generalized Category Discovery via Reciprocal Learning and Class-Wise Distribution Regularization

Duo Liu,Zhiquan Tan,Linglan Zhao,Zhongqiang Zhang,Xiangzhong Fang,Weiran Huang

Main category: cs.CV

TL;DR: 论文提出了一种名为RLF的互学习框架，通过引入辅助分支和类分布正则化（CDR），解决了现有参数化方法在基础类别识别上的不足，显著提升了所有类别的性能。

Details

Motivation: 现有参数化方法在广义类别发现（GCD）中因不可靠的自监督导致基础类别识别能力不足。 Method: 提出互学习框架（RLF），包含主分支和辅助分支，主分支筛选伪基础样本给辅助分支，辅助分支提供更可靠的软标签。同时引入类分布正则化（CDR）以减少对基础类别的学习偏差。 Result: 在七个GCD数据集上的实验表明，RLCD方法在所有类别上均表现出色，且计算开销极小。 Conclusion: RLF和CDR的结合有效提升了GCD任务的性能，为参数化方法提供了新的解决方案。 Abstract: Generalized Category Discovery (GCD) aims to identify unlabeled samples by leveraging the base knowledge from labeled ones, where the unlabeled set consists of both base and novel classes. Since clustering methods are time-consuming at inference, parametric-based approaches have become more popular. However, recent parametric-based methods suffer from inferior base discrimination due to unreliable self-supervision. To address this issue, we propose a Reciprocal Learning Framework (RLF) that introduces an auxiliary branch devoted to base classification. During training, the main branch filters the pseudo-base samples to the auxiliary branch. In response, the auxiliary branch provides more reliable soft labels for the main branch, leading to a virtuous cycle. Furthermore, we introduce Class-wise Distribution Regularization (CDR) to mitigate the learning bias towards base classes. CDR essentially increases the prediction confidence of the unlabeled data and boosts the novel class performance. Combined with both components, our proposed method, RLCD, achieves superior performance in all classes with negligible extra computation. Comprehensive experiments across seven GCD datasets validate its superiority. Our codes are available at https://github.com/APORduo/RLCD.

Junjie Li,Nan Zhang,Xiaoyang Qu,Kai Lu,Guokuan Li,Jiguang Wan,Jianzong Wang

Main category: cs.CV

TL;DR: RATE-Nav是一种基于区域感知的终止增强方法，通过几何预测区域分割和区域探索估计算法优化目标导航任务，显著提升了成功率。

Details

Motivation: 当前目标导航研究中，冗余探索和探索失败问题突出，探索终止的时机是关键但未被充分研究的方向。 Method: 提出RATE-Nav方法，结合几何预测区域分割算法和区域探索估计算法，利用视觉语言模型（VLMs）实现高效终止。 Result: 在HM3D数据集上成功率为67.8%，SPL为31.3%；在MP3D数据集上比零样本方法提升约10%。 Conclusion: RATE-Nav通过优化探索终止策略，显著提升了目标导航任务的性能。 Abstract: Object Navigation (ObjectNav) is a fundamental task in embodied artificial intelligence. Although significant progress has been made in semantic map construction and target direction prediction in current research, redundant exploration and exploration failures remain inevitable. A critical but underexplored direction is the timely termination of exploration to overcome these challenges. We observe a diminishing marginal effect between exploration steps and exploration rates and analyze the cost-benefit relationship of exploration. Inspired by this, we propose RATE-Nav, a Region-Aware Termination-Enhanced method. It includes a geometric predictive region segmentation algorithm and region-Based exploration estimation algorithm for exploration rate calculation. By leveraging the visual question answering capabilities of visual language models (VLMs) and exploration rates enables efficient termination.RATE-Nav achieves a success rate of 67.8% and an SPL of 31.3% on the HM3D dataset. And on the more challenging MP3D dataset, RATE-Nav shows approximately 10% improvement over previous zero-shot methods.

[28] InterRVOS: Interaction-aware Referring Video Object Segmentation

Woojeong Jin,Seongchan Kim,Seungryong Kim

Main category: cs.CV

TL;DR: 论文提出了一种新的任务InterRVOS，专注于视频中对象间的交互分割，并构建了大规模数据集InterRVOS-8K和基线模型ReVIOSa。

Details

Motivation: 现有方法多关注单一目标对象的分割，忽视了对象间的交互关系，而交互在视频理解中至关重要。 Method: 提出InterRVOS任务，构建数据集InterRVOS-8K，设计基线模型ReVIOSa处理交互分割。 Result: 实验表明，该方法在复杂对象交互建模上优于现有方法。 Conclusion: 为交互中心视频理解研究奠定了基础。 Abstract: Referring video object segmentation aims to segment the object in a video corresponding to a given natural language expression. While prior works have explored various referring scenarios, including motion-centric or multi-instance expressions, most approaches still focus on localizing a single target object in isolation. However, in comprehensive video understanding, an object's role is often defined by its interactions with other entities, which are largely overlooked in existing datasets and models. In this work, we introduce Interaction-aware referring video object sgementation (InterRVOS), a new task that requires segmenting both actor and target entities involved in an interaction. Each interactoin is described through a pair of complementary expressions from different semantic perspectives, enabling fine-grained modeling of inter-object relationships. To tackle this task, we propose InterRVOS-8K, the large-scale and automatically constructed dataset containing diverse interaction-aware expressions with corresponding masks, including challenging cases such as motion-only multi-instance expressions. We also present a baseline architecture, ReVIOSa, designed to handle actor-target segmentation from a single expression, achieving strong performance in both standard and interaction-focused settings. Furthermore, we introduce an actor-target-aware evalaution setting that enables a more targeted assessment of interaction understanding. Experimental results demonstrate that our approach outperforms prior methods in modeling complex object interactions for referring video object segmentation task, establishing a strong foundation for future research in interaction-centric video understanding. Our project page is available at \href{https://cvlab-kaist.github.io/InterRVOS}{https://cvlab-kaist.github.io/InterRVOS}.

[29] RoadFormer : Local-Global Feature Fusion for Road Surface Classification in Autonomous Driving

Tianze Wang,Zhang Zhang,Chao Sun

Main category: cs.CV

TL;DR: 该论文提出了一种基于视觉的细粒度道路表面分类方法（RoadFormer），通过结合卷积和Transformer模块提取局部与全局特征，并引入前景-背景模块（FBM）提升分类能力，实验结果表明其性能优于现有方法。

Details

Motivation: 道路表面分类（RSC）对提升自动驾驶的安全性和交通管理至关重要，但现有视觉方法忽略了细粒度分类（如相似纹理的路面）。 Method: 提出纯视觉的细粒度RSC方法，融合卷积和Transformer模块提取特征，并设计FBM模块增强复杂路面的分类能力。 Result: 在大规模数据集上Top-1准确率达92.52%和96.50%，比现有方法提升5.69%至12.84%。 Conclusion: RoadFormer在RSC任务中表现优异，显著提升了自动驾驶系统的路面感知可靠性。 Abstract: The classification of the type of road surface (RSC) aims to utilize pavement features to identify the roughness, wet and dry conditions, and material information of the road surface. Due to its ability to effectively enhance road safety and traffic management, it has received widespread attention in recent years. In autonomous driving, accurate RSC allows vehicles to better understand the road environment, adjust driving strategies, and ensure a safer and more efficient driving experience. For a long time, vision-based RSC has been favored. However, existing visual classification methods have overlooked the exploration of fine-grained classification of pavement types (such as similar pavement textures). In this work, we propose a pure vision-based fine-grained RSC method for autonomous driving scenarios, which fuses local and global feature information through the stacking of convolutional and transformer modules. We further explore the stacking strategies of local and global feature extraction modules to find the optimal feature extraction strategy. In addition, since fine-grained tasks also face the challenge of relatively large intra-class differences and relatively small inter-class differences, we propose a Foreground-Background Module (FBM) that effectively extracts fine-grained context features of the pavement, enhancing the classification ability for complex pavements. Experiments conducted on a large-scale pavement dataset containing one million samples and a simplified dataset reorganized from this dataset achieved Top-1 classification accuracies of 92.52% and 96.50%, respectively, improving by 5.69% to 12.84% compared to SOTA methods. These results demonstrate that RoadFormer outperforms existing methods in RSC tasks, providing significant progress in improving the reliability of pavement perception in autonomous driving systems.

[30] Auto-Labeling Data for Object Detection

Brent A. Griffin,Manushree Gangwar,Jacob Sela,Jason J. Corso

Main category: cs.CV

TL;DR: 本文提出了一种无需真实标注标签的物体检测模型训练方法，利用预训练的视觉-语言基础模型生成伪标签，从而降低成本并保持模型效率。

Details

Motivation: 传统物体检测标注成本高，而替代方法要么功能受限，要么计算成本过高。本文旨在解决这一问题。 Method: 配置预训练的视觉-语言基础模型生成伪标签，并用于训练轻量级检测模型。 Result: 实验表明，该方法在多个数据集上表现接近传统标注方法，同时大幅降低标注时间和成本。 Conclusion: 该方法是一种可行的替代方案，兼具性能和效率优势。 Abstract: Great labels make great models. However, traditional labeling approaches for tasks like object detection have substantial costs at scale. Furthermore, alternatives to fully-supervised object detection either lose functionality or require larger models with prohibitive computational costs for inference at scale. To that end, this paper addresses the problem of training standard object detection models without any ground truth labels. Instead, we configure previously-trained vision-language foundation models to generate application-specific pseudo "ground truth" labels. These auto-generated labels directly integrate with existing model training frameworks, and we subsequently train lightweight detection models that are computationally efficient. In this way, we avoid the costs of traditional labeling, leverage the knowledge of vision-language models, and keep the efficiency of lightweight models for practical application. We perform exhaustive experiments across multiple labeling configurations, downstream inference models, and datasets to establish best practices and set an extensive auto-labeling benchmark. From our results, we find that our approach is a viable alternative to standard labeling in that it maintains competitive performance on multiple datasets and substantially reduces labeling time and costs.

[31] A TRPCA-Inspired Deep Unfolding Network for Hyperspectral Image Denoising via Thresholded t-SVD and Top-K Sparse Transformer

Liang Li,Jianli Zhao,Sheng Fang,Siyu Chen,Hui Sun

Main category: cs.CV

TL;DR: 提出了一种基于张量鲁棒主成分分析（TRPCA）的深度展开网络（DU-TRPCA），通过紧密集成的低秩和稀疏模块，有效去除高光谱图像中的混合噪声。

Details

Motivation: 高光谱图像（HSIs）在采集和传输过程中常受复杂混合噪声影响，现有混合方法未能充分利用不同先验或模块的互补优势。 Method: 结合低秩模块（阈值张量奇异值分解）和稀疏模块（Top-K稀疏变换器），通过深度展开网络实现紧密耦合的交替优化。 Result: 在合成和真实HSIs数据上的实验表明，DU-TRPCA在严重混合噪声下优于现有方法，并提供可解释性和稳定去噪动态。 Conclusion: DU-TRPCA通过紧密耦合的低秩和稀疏模块，显著提升了高光谱图像去噪性能，同时保留了TRPCA的迭代优化优势。 Abstract: Hyperspectral images (HSIs) are often degraded by complex mixed noise during acquisition and transmission, making effective denoising essential for subsequent analysis. Recent hybrid approaches that bridge model-driven and data-driven paradigms have shown great promise. However, most of these approaches lack effective alternation between different priors or modules, resulting in loosely coupled regularization and insufficient exploitation of their complementary strengths. Inspired by tensor robust principal component analysis (TRPCA), we propose a novel deep unfolding network (DU-TRPCA) that enforces stage-wise alternation between two tightly integrated modules: low-rank and sparse. The low-rank module employs thresholded tensor singular value decomposition (t-SVD), providing a widely adopted convex surrogate for tensor low-rankness and has been demonstrated to effectively capture the global spatial-spectral structure of HSIs. The Top-K sparse transformer module adaptively imposes sparse constraints, directly matching the sparse regularization in TRPCA and enabling effective removal of localized outliers and complex noise. This tightly coupled architecture preserves the stage-wise alternation between low-rank approximation and sparse refinement inherent in TRPCA, while enhancing representational capacity through attention mechanisms. Extensive experiments on synthetic and real-world HSIs demonstrate that DU-TRPCA surpasses state-of-the-art methods under severe mixed noise, while offering interpretability benefits and stable denoising dynamics inspired by iterative optimization. Code is available at https://github.com/liangli97/TRPCA-Deep-Unfolding-HSI-Denoising.

[32] Approximate Borderline Sampling using Granular-Ball for Classification Tasks

Qin Xie,Qinghua Zhang,Shuyin Xia

Main category: cs.CV

TL;DR: 提出了一种基于粒度球（GB）的近似边界采样方法（GBABS），通过限制扩散生成GB（RD-GBG）避免重叠，并结合异构最近邻概念进行边界采样，显著提升了分类任务中噪声数据的处理效果。

Details

Motivation: 现有基于GB的采样方法存在边界采样策略缺失及GB重叠导致的类边界模糊问题，限制了其在分类任务中的性能。 Method: 1. 提出RD-GBG方法，通过受限扩散生成GB以避免重叠；2. 基于异构最近邻概念设计GBABS方法，实现边界采样并提升噪声数据集质量。 Result: 实验表明，GBABS在噪声数据集上表现优异，无需最优纯度阈值，且优于现有GB采样方法及其他代表性采样方法。 Conclusion: GBABS是一种通用采样方法，首次实现边界采样并提升噪声数据质量，为分类任务提供了高效且鲁棒的解决方案。 Abstract: Data sampling enhances classifier efficiency and robustness through data compression and quality improvement. Recently, the sampling method based on granular-ball (GB) has shown promising performance in generality and noisy classification tasks. However, some limitations remain, including the absence of borderline sampling strategies and issues with class boundary blurring or shrinking due to overlap between GBs. In this paper, an approximate borderline sampling method using GBs is proposed for classification tasks. First, a restricted diffusion-based GB generation (RD-GBG) method is proposed, which prevents GB overlaps by constrained expansion, preserving precise geometric representation of GBs via redefined ones. Second, based on the concept of heterogeneous nearest neighbor, a GB-based approximate borderline sampling (GBABS) method is proposed, which is the first general sampling method capable of both borderline sampling and improving the quality of class noise datasets. Additionally, since RD-GBG incorporates noise detection and GBABS focuses on borderline samples, GBABS performs outstandingly on class noise datasets without the need for an optimal purity threshold. Experimental results demonstrate that the proposed methods outperform the GB-based sampling method and several representative sampling methods. Our source code is publicly available at https://github.com/CherylTse/GBABS.

[33] ViTNF: Leveraging Neural Fields to Boost Vision Transformers in Generalized Category Discovery

Jiayi Su,Dequan Jin

Main category: cs.CV

TL;DR: 论文提出了一种基于神经场的分类器（NF），替换ViT中的MLP头，简化训练流程并提升性能。

Details

Motivation: 传统ViT的MLP头训练成本高且未充分利用特征提取器能力，需改进以提升广义类别发现（GCD）任务的效果。 Method: 提出静态神经场函数构建NF分类器，替代MLP头，形成ViTNF架构，并优化训练流程。 Result: 在多个数据集上显著超越现有方法，新类和全部类准确率分别提升19%和16%。 Conclusion: ViTNF在GCD任务中表现优越，训练更高效，性能显著提升。 Abstract: Generalized category discovery (GCD) is a highly popular task in open-world recognition, aiming to identify unknown class samples using known class data. By leveraging pre-training, meta-training, and fine-tuning, ViT achieves excellent few-shot learning capabilities. Its MLP head is a feedforward network, trained synchronously with the entire network in the same process, increasing the training cost and difficulty without fully leveraging the power of the feature extractor. This paper proposes a new architecture by replacing the MLP head with a neural field-based one. We first present a new static neural field function to describe the activity distribution of the neural field and then use two static neural field functions to build an efficient few-shot classifier. This neural field-based (NF) classifier consists of two coupled static neural fields. It stores the feature information of support samples by its elementary field, the known categories by its high-level field, and the category information of support samples by its cross-field connections. We replace the MLP head with the proposed NF classifier, resulting in a novel architecture ViTNF, and simplify the three-stage training mode by pre-training the feature extractor on source tasks and training the NF classifier with support samples in meta-testing separately, significantly reducing ViT's demand for training samples and the difficulty of model training. To enhance the model's capability in identifying new categories, we provide an effective algorithm to determine the lateral interaction scale of the elementary field. Experimental results demonstrate that our model surpasses existing state-of-the-art methods on CIFAR-100, ImageNet-100, CUB-200, and Standard Cars, achieving dramatic accuracy improvements of 19\% and 16\% in new and all classes, respectively, indicating a notable advantage in GCD.

Seulgi Kim,Ghazal Kaviani,Mohit Prabhushankar,Ghassan AlRegib

Main category: cs.CV

TL;DR: 论文提出了一种多模态和多层次的动作预测方法（m&m-Ant），结合视觉和文本线索，并通过细粒度标签生成器和时间一致性损失函数优化性能，在多个数据集上取得了3.08%的准确率提升。

Details

Motivation: 动作预测任务需要处理不完全信息，传统方法仅关注视觉模态，忽略了多源信息整合的潜力。 Method: 提出m&m-Ant方法，结合视觉和文本线索，并引入细粒度标签生成器和时间一致性损失函数。 Result: 在Breakfast、50 Salads和DARai等数据集上实现了3.08%的平均准确率提升。 Conclusion: 多模态和层次建模在动作预测中具有潜力，为未来研究设立了新基准。 Abstract: Action anticipation, the task of predicting future actions from partially observed videos, is crucial for advancing intelligent systems. Unlike action recognition, which operates on fully observed videos, action anticipation must handle incomplete information. Hence, it requires temporal reasoning, and inherent uncertainty handling. While recent advances have been made, traditional methods often focus solely on visual modalities, neglecting the potential of integrating multiple sources of information. Drawing inspiration from human behavior, we introduce \textit{Multi-level and Multi-modal Action Anticipation (m\&m-Ant)}, a novel multi-modal action anticipation approach that combines both visual and textual cues, while explicitly modeling hierarchical semantic information for more accurate predictions. To address the challenge of inaccurate coarse action labels, we propose a fine-grained label generator paired with a specialized temporal consistency loss function to optimize performance. Extensive experiments on widely used datasets, including Breakfast, 50 Salads, and DARai, demonstrate the effectiveness of our approach, achieving state-of-the-art results with an average anticipation accuracy improvement of 3.08\% over existing methods. This work underscores the potential of multi-modal and hierarchical modeling in advancing action anticipation and establishes a new benchmark for future research in the field. Our code is available at: https://github.com/olivesgatech/mM-ant.

[35] RRCANet: Recurrent Reusable-Convolution Attention Network for Infrared Small Target Detection

Yongxian Liu,Boyang Li,Ting Liu,Zaiping Lin,Wei An

Main category: cs.CV

TL;DR: 提出了一种名为RRCA-Net的循环可重用卷积注意力网络，用于红外小目标检测，通过可重用卷积块和双交互注意力聚合模块实现高效检测，实验表明其性能优越且参数少。

Details

Motivation: 红外小目标检测因目标小、暗、形状多变而具有挑战性，现有CNN方法需要大量特征提取和融合模块，效率不高。 Method: 设计了RRCA-Net，包含可重用卷积块（RuCB）和双交互注意力聚合模块（DIAAM），通过循环迭代和注意力机制优化特征提取与融合。 Result: 在多个基准数据集上表现优异，性能接近最先进方法，同时参数较少，并能作为插件提升其他方法的性能。 Conclusion: RRCA-Net通过高效的特征提取和注意力机制，实现了红外小目标检测的高性能与低参数需求，具有广泛的应用潜力。 Abstract: Infrared small target detection is a challenging task due to its unique characteristics (e.g., small, dim, shapeless and changeable). Recently published CNN-based methods have achieved promising performance with heavy feature extraction and fusion modules. To achieve efficient and effective detection, we propose a recurrent reusable-convolution attention network (RRCA-Net) for infrared small target detection. Specifically, RRCA-Net incorporates reusable-convolution block (RuCB) in a recurrent manner without introducing extra parameters. With the help of the repetitive iteration in RuCB, the high-level information of small targets in the deep layers can be well maintained and further refined. Then, a dual interactive attention aggregation module (DIAAM) is proposed to promote the mutual enhancement and fusion of refined information. In this way, RRCA-Net can both achieve high-level feature refinement and enhance the correlation of contextual information between adjacent layers. Moreover, to achieve steady convergence, we design a target characteristic inspired loss function (DpT-k loss) by integrating physical and mathematical constraints. Experimental results on three benchmark datasets (e.g. NUAA-SIRST, IRSTD-1k, DenseSIRST) demonstrate that our RRCA-Net can achieve comparable performance to the state-of-the-art methods while maintaining a small number of parameters, and act as a plug and play module to introduce consistent performance improvement for several popular IRSTD methods. Our code will be available at https://github.com/yongxianLiu/ soon.

[36] The Devil is in the Darkness: Diffusion-Based Nighttime Dehazing Anchored in Brightness Perception

Xiaofeng Cong,Yu-Xin Zhang,Haoran Wei,Yeying Jin,Junming Hou,Jie Gui,Jing Zhang,Dacheng Tao

Main category: cs.CV

TL;DR: DiffND框架通过数据合成和亮度感知优化，解决了夜间图像去雾中亮度映射不一致的问题，并提升了光照重建的真实性。

Details

Motivation: 现有方法在夜间图像去雾中忽略了昼夜亮度关系，导致合成图像亮度与真实世界不一致，且模型缺乏对白天亮度的显式知识。 Method: 提出DiffND框架，包括亮度一致的数据合成管道和基于预训练扩散模型的亮度感知优化恢复模型。 Result: 实验验证了数据集的实用性和模型在联合去雾及亮度映射中的优越性能。 Conclusion: DiffND在夜间图像去雾和亮度重建方面表现出色，解决了现有方法的局限性。 Abstract: While nighttime image dehazing has been extensively studied, converting nighttime hazy images to daytime-equivalent brightness remains largely unaddressed. Existing methods face two critical limitations: (1) datasets overlook the brightness relationship between day and night, resulting in the brightness mapping being inconsistent with the real world during image synthesis; and (2) models do not explicitly incorporate daytime brightness knowledge, limiting their ability to reconstruct realistic lighting. To address these challenges, we introduce the Diffusion-Based Nighttime Dehazing (DiffND) framework, which excels in both data synthesis and lighting reconstruction. Our approach starts with a data synthesis pipeline that simulates severe distortions while enforcing brightness consistency between synthetic and real-world scenes, providing a strong foundation for learning night-to-day brightness mapping. Next, we propose a restoration model that integrates a pre-trained diffusion model guided by a brightness perception network. This design harnesses the diffusion model's generative ability while adapting it to nighttime dehazing through brightness-aware optimization. Experiments validate our dataset's utility and the model's superior performance in joint haze removal and brightness mapping.

[37] Towards Explicit Geometry-Reflectance Collaboration for Generalized LiDAR Segmentation in Adverse Weather

Longyu Yang,Ping Hu,Shangbo Yuan,Lu Zhang,Jun Liu,Hengtao Shen,Xiaofeng Zhu

Main category: cs.CV

TL;DR: 论文提出了一种名为GRC的双分支框架，通过分离几何和反射特征提取，提升LiDAR语义分割在恶劣天气下的性能。

Details

Motivation: 现有LiDAR语义分割模型在恶劣天气下性能下降，且少有研究关注点云几何结构和反射强度的异质域偏移问题。 Method: GRC框架采用双分支架构分别处理几何和反射特征，并通过多级特征协作模块抑制冗余信息。 Result: 实验表明，GRC在不依赖复杂模拟或增强的情况下，显著提升了模型在恶劣天气下的鲁棒性和泛化能力。 Conclusion: GRC框架通过几何-反射协作，有效解决了异质域偏移问题，并在基准测试中取得了新的最优结果。 Abstract: Existing LiDAR semantic segmentation models often suffer from decreased accuracy when exposed to adverse weather conditions. Recent methods addressing this issue focus on enhancing training data through weather simulation or universal augmentation techniques. However, few works have studied the negative impacts caused by the heterogeneous domain shifts in the geometric structure and reflectance intensity of point clouds. In this paper, we delve into this challenge and address it with a novel Geometry-Reflectance Collaboration (GRC) framework that explicitly separates feature extraction for geometry and reflectance. Specifically, GRC employs a dual-branch architecture designed to independently process geometric and reflectance features initially, thereby capitalizing on their distinct characteristic. Then, GRC adopts a robust multi-level feature collaboration module to suppress redundant and unreliable information from both branches. Consequently, without complex simulation or augmentation, our method effectively extracts intrinsic information about the scene while suppressing interference, thus achieving better robustness and generalization in adverse weather conditions. We demonstrate the effectiveness of GRC through comprehensive experiments on challenging benchmarks, showing that our method outperforms previous approaches and establishes new state-of-the-art results.

[38] Modelship Attribution: Tracing Multi-Stage Manipulations Across Generative Models

Zhiya Tan,Xin Zhang,Joey Tianyi Zhou

Main category: cs.CV

TL;DR: 论文提出了一种新方法来解决复杂迭代图像篡改的溯源问题，首次系统建模并定义了“模型归属”任务，通过识别生成模型和重构编辑序列来追踪图像篡改过程。

Details

Motivation: 随着生成技术的普及，图像篡改变得复杂且难以溯源，现有方法在单阶段篡改中表现良好，但在复杂迭代篡改中效果不佳。 Method: 利用三种生成模型（StyleMapGAN、DiffSwap、FacePartsSwap）模拟多阶段篡改，构建首个模型归属数据集，并提出模型归属变换器（MAT）框架。 Result: 实验表明，MAT在多阶段篡改溯源中表现优异，显著优于其他方法。 Conclusion: MAT为解决复杂迭代篡改的溯源问题提供了有效方案，填补了现有研究的空白。 Abstract: As generative techniques become increasingly accessible, authentic visuals are frequently subjected to iterative alterations by various individuals employing a variety of tools. Currently, to avoid misinformation and ensure accountability, a lot of research on detection and attribution is emerging. Although these methods demonstrate promise in single-stage manipulation scenarios, they fall short when addressing complex real-world iterative manipulation. In this paper, we are the first, to the best of our knowledge, to systematically model this real-world challenge and introduce a novel method to solve it. We define a task called "Modelship Attribution", which aims to trace the evolution of manipulated images by identifying the generative models involved and reconstructing the sequence of edits they performed. To realistically simulate this scenario, we utilize three generative models, StyleMapGAN, DiffSwap, and FacePartsSwap, that sequentially modify distinct regions of the same image. This process leads to the creation of the first modelship dataset, comprising 83,700 images (16,740 images*5). Given that later edits often overwrite the fingerprints of earlier models, the focus shifts from extracting blended fingerprints to characterizing each model's distinctive editing patterns. To tackle this challenge, we introduce the modelship attribution transformer (MAT), a purpose-built framework designed to effectively recognize and attribute the contributions of various models within complex, multi-stage manipulation workflows. Through extensive experiments and comparative analysis with other related methods, our results, including comprehensive ablation studies, demonstrate that the proposed approach is a highly effective solution for modelship attribution.

[39] Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology

Wenhao Tang,Rong Qin,Heng Fang,Fengtao Zhou,Hao Chen,Xiang Li,Ming-Ming Cheng

Main category: cs.CV

TL;DR: 论文提出了一种名为ABMILX的新型多实例学习方法，通过全局相关性注意力优化和多头机制解决了稀疏注意力MIL的优化问题，结合高效的多尺度随机补丁采样策略，实现了端到端训练的优越性能。

Details

Motivation: 传统计算方法在计算病理学中存在性能限制，如编码器未针对下游任务微调以及与MIL的分离优化。端到端学习虽直观但面临计算量大和结果次优的挑战，因此作者重新审视并优化了端到端学习方法。 Method: 提出ABMILX方法，通过全局相关性注意力优化和多头机制解决稀疏注意力MIL的优化问题，并采用多尺度随机补丁采样策略实现高效训练。 Result: 实验表明，ABMILX在多个基准测试中超越了现有最优的两阶段方法，同时保持了计算效率（<10 RTX3090小时）。 Conclusion: 论文展示了端到端学习在计算病理学中的潜力，并呼吁更多研究关注这一领域。 Abstract: Pre-trained encoders for offline feature extraction followed by multiple instance learning (MIL) aggregators have become the dominant paradigm in computational pathology (CPath), benefiting cancer diagnosis and prognosis. However, performance limitations arise from the absence of encoder fine-tuning for downstream tasks and disjoint optimization with MIL. While slide-level supervised end-to-end (E2E) learning is an intuitive solution to this issue, it faces challenges such as high computational demands and suboptimal results. These limitations motivate us to revisit E2E learning. We argue that prior work neglects inherent E2E optimization challenges, leading to performance disparities compared to traditional two-stage methods. In this paper, we pioneer the elucidation of optimization challenge caused by sparse-attention MIL and propose a novel MIL called ABMILX. It mitigates this problem through global correlation-based attention refinement and multi-head mechanisms. With the efficient multi-scale random patch sampling strategy, an E2E trained ResNet with ABMILX surpasses SOTA foundation models under the two-stage paradigm across multiple challenging benchmarks, while remaining computationally efficient (<10 RTX3090 hours). We show the potential of E2E learning in CPath and calls for greater research focus in this area. The code is https://github.com/DearCaat/E2E-WSI-ABMILX.

[40] Guiding Registration with Emergent Similarity from Pre-Trained Diffusion Models

Nurislam Tursynbek,Hastings Greer,Basar Demir,Marc Niethammer

Main category: cs.CV

TL;DR: 扩散模型被用于医学图像配准，通过提取语义特征实现更准确的对应关系，优于传统基于强度的配准方法。

Details

Motivation: 传统基于强度的配准方法在解剖结构不一致时效果不佳，扩散模型能提取语义特征，为解决这一问题提供了新思路。 Method: 利用预训练的扩散模型提取特征作为相似性度量，指导可变形图像配准网络。 Result: 在2D多模态（DXA到X射线）和3D单模态（脑提取与非脑提取MRI）配准任务中表现优异。 Conclusion: 扩散模型特征作为相似性度量能显著提升配准精度，尤其在解剖结构不一致的场景下。 Abstract: Diffusion models, while trained for image generation, have emerged as powerful foundational feature extractors for downstream tasks. We find that off-the-shelf diffusion models, trained exclusively to generate natural RGB images, can identify semantically meaningful correspondences in medical images. Building on this observation, we propose to leverage diffusion model features as a similarity measure to guide deformable image registration networks. We show that common intensity-based similarity losses often fail in challenging scenarios, such as when certain anatomies are visible in one image but absent in another, leading to anatomically inaccurate alignments. In contrast, our method identifies true semantic correspondences, aligning meaningful structures while disregarding those not present across images. We demonstrate superior performance of our approach on two tasks: multimodal 2D registration (DXA to X-Ray) and monomodal 3D registration (brain-extracted to non-brain-extracted MRI). Code: https://github.com/uncbiag/dgir

[41] Empowering Functional Neuroimaging: A Pre-trained Generative Framework for Unified Representation of Neural Signals

Weiheng Yao,Xuhang Chen,Shuqiang Wang

Main category: cs.CV

TL;DR: 提出一种基于生成AI的多模态功能神经影像统一表示框架，解决数据获取成本高和公平性问题。

Details

Motivation: 多模态功能神经影像获取成本高且存在公平性问题，需解决数据不足和代表性不足的挑战。 Method: 通过生成AI将多模态功能神经影像映射到统一表示空间，生成受限模态和代表性不足群体的数据。 Result: 实验表明框架能生成真实脑活动模式数据，提升下游任务性能并增强模型公平性。 Conclusion: 该框架为降低多模态功能神经影像获取成本和提升BCI解码模型公平性提供了新范式。 Abstract: Multimodal functional neuroimaging enables systematic analysis of brain mechanisms and provides discriminative representations for brain-computer interface (BCI) decoding. However, its acquisition is constrained by high costs and feasibility limitations. Moreover, underrepresentation of specific groups undermines fairness of BCI decoding model. To address these challenges, we propose a unified representation framework for multimodal functional neuroimaging via generative artificial intelligence (AI). By mapping multimodal functional neuroimaging into a unified representation space, the proposed framework is capable of generating data for acquisition-constrained modalities and underrepresented groups. Experiments show that the framework can generate data consistent with real brain activity patterns, provide insights into brain mechanisms, and improve performance on downstream tasks. More importantly, it can enhance model fairness by augmenting data for underrepresented groups. Overall, the framework offers a new paradigm for decreasing the cost of acquiring multimodal functional neuroimages and enhancing the fairness of BCI decoding models.

[42] Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification

Shuang Li,Jiaxu Leng,Changjiang Kuang,Mingpi Tan,Xinbo Gao

Main category: cs.CV

TL;DR: 论文提出了一种基于视频级语言驱动的VVI-ReID框架（VLD），通过生成模态共享的文本提示并结合时空信息，显著提升了跨模态行人重识别的性能。

Details

Motivation: 解决视频级语言提示生成与利用中的模态差异问题，以提取模态不变的序列级特征。 Method: 提出VLD框架，包含不变模态语言提示（IMLP）和时空提示（STP）两个核心模块，分别用于生成模态共享文本提示和建模时空信息。 Result: 在两个VVI-ReID基准测试中取得了最先进的结果。 Conclusion: VLD框架通过语言提示和时空建模有效解决了跨模态行人重识别中的模态差异问题。 Abstract: Video-based Visible-Infrared Person Re-Identification (VVI-ReID) aims to match pedestrian sequences across modalities by extracting modality-invariant sequence-level features. As a high-level semantic representation, language provides a consistent description of pedestrian characteristics in both infrared and visible modalities. Leveraging the Contrastive Language-Image Pre-training (CLIP) model to generate video-level language prompts and guide the learning of modality-invariant sequence-level features is theoretically feasible. However, the challenge of generating and utilizing modality-shared video-level language prompts to address modality gaps remains a critical problem. To address this problem, we propose a simple yet powerful framework, video-level language-driven VVI-ReID (VLD), which consists of two core modules: invariant-modality language prompting (IMLP) and spatial-temporal prompting (STP). IMLP employs a joint fine-tuning strategy for the visual encoder and the prompt learner to effectively generate modality-shared text prompts and align them with visual features from different modalities in CLIP's multimodal space, thereby mitigating modality differences. Additionally, STP models spatiotemporal information through two submodules, the spatial-temporal hub (STH) and spatial-temporal aggregation (STA), which further enhance IMLP by incorporating spatiotemporal information into text prompts. The STH aggregates and diffuses spatiotemporal information into the [CLS] token of each frame across the vision transformer (ViT) layers, whereas STA introduces dedicated identity-level loss and specialized multihead attention to ensure that the STH focuses on identity-relevant spatiotemporal feature aggregation. The VLD framework achieves state-of-the-art results on two VVI-ReID benchmarks. The code will be released at https://github.com/Visuang/VLD.

[43] SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios

Lingwei Dang,Ruizhi Shao,Hongwen Zhang,Wei Min,Yebin Liu,Qingyao Wu

Main category: cs.CV

TL;DR: 提出了一种结合视觉先验和动态约束的同步扩散框架，用于同时生成手-物体交互（HOI）视频和运动，消除了对预定义3D模型和显式姿态指导的依赖。

Details

Motivation: 当前HOI生成方法依赖预定义3D模型和实验室数据，泛化能力有限；视频生成方法牺牲物理合理性。本文旨在结合视觉和动态约束，提升生成质量。 Method: 采用三模态自适应调制对齐特征，结合3D全注意力建模模态间依赖；提出视觉感知的3D交互扩散模型，形成闭环反馈。 Result: 实验表明，该方法在生成高保真、动态合理的HOI序列上优于现有方法，泛化能力显著。 Conclusion: 提出的框架显著提升了视频-运动一致性，适用于未见过的真实场景。 Abstract: Hand-Object Interaction (HOI) generation has significant application potential. However, current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data, limiting generalization capabilities. Meanwhile, HOI video generation methods prioritize pixel-level visual fidelity, often sacrificing physical plausibility. Recognizing that visual appearance and motion patterns share fundamental physical laws in the real world, we propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously. To integrate the heterogeneous semantics, appearance, and motion features, our method implements tri-modal adaptive modulation for feature aligning, coupled with 3D full-attention for modeling inter- and intra-modal dependencies. Furthermore, we introduce a vision-aware 3D interaction diffusion model that generates explicit 3D interaction sequences directly from the synchronized diffusion outputs, then feeds them back to establish a closed-loop feedback cycle. This architecture eliminates dependencies on predefined object models or explicit pose guidance while significantly enhancing video-motion consistency. Experimental results demonstrate our method's superiority over state-of-the-art approaches in generating high-fidelity, dynamically plausible HOI sequences, with notable generalization capabilities in unseen real-world scenarios. Project page at \href{https://github.com/Droliven}{https://github.com/Droliven}.

[44] VidEvent: A Large Dataset for Understanding Dynamic Evolution of Events in Videos

Baoyu Liang,Qile Su,Shoutai Zhu,Yuchen Liang,Chao Tong

Main category: cs.CV

TL;DR: 该论文提出了视频事件理解任务，并发布了VidEvent数据集，包含23,000多个标注事件，支持事件脚本提取与预测。

Details

Motivation: 视频事件的复杂结构和动态演变对AI理解构成挑战，需要高质量数据集和基准模型支持研究。 Method: 通过精心标注的电影回顾视频构建VidEvent数据集，并提供基线模型架构与性能指标。 Result: VidEvent数据集和基线模型为视频事件理解提供了高质量资源，推动算法创新。 Conclusion: VidEvent有望推动视频事件理解研究，数据集和资源已公开。 Abstract: Despite the significant impact of visual events on human cognition, understanding events in videos remains a challenging task for AI due to their complex structures, semantic hierarchies, and dynamic evolution. To address this, we propose the task of video event understanding that extracts event scripts and makes predictions with these scripts from videos. To support this task, we introduce VidEvent, a large-scale dataset containing over 23,000 well-labeled events, featuring detailed event structures, broad hierarchies, and logical relations extracted from movie recap videos. The dataset was created through a meticulous annotation process, ensuring high-quality and reliable event data. We also provide comprehensive baseline models offering detailed descriptions of their architecture and performance metrics. These models serve as benchmarks for future research, facilitating comparisons and improvements. Our analysis of VidEvent and the baseline models highlights the dataset's potential to advance video event understanding and encourages the exploration of innovative algorithms and models. The dataset and related resources are publicly available at www.videvent.top.

[45] ANT: Adaptive Neural Temporal-Aware Text-to-Motion Model

Wenshuo Chen,Kuimou Yu,Haozhe Jia,Kaishen Yuan,Bowen Tian,Songning Lai,Hongru Xiao,Erhang Zhang,Lei Wang,Yutao Yue

Main category: cs.CV

TL;DR: ANT是一种自适应神经时间感知架构，通过动态调整语义粒度和条件引导，显著提升了文本到运动生成的性能。

Details

Motivation: 现有扩散模型在文本到运动生成中忽略了时间频率需求，导致早期去噪需要结构语义而后期需要局部细节，与生物形态发生中的阶段需求不匹配。 Method: 提出ANT架构，包括语义时间自适应模块（STA）、动态无分类器引导调度（DCFG）和时间语义重加权，以动态调整语义粒度和条件引导。 Result: 实验表明，ANT能显著提升模型性能，并在StableMoFusion上实现最先进的语义对齐。 Conclusion: ANT通过模拟生物形态发生的调控机制，有效解决了文本到运动生成中的时间频率需求问题。 Abstract: While diffusion models advance text-to-motion generation, their static semantic conditioning ignores temporal-frequency demands: early denoising requires structural semantics for motion foundations while later stages need localized details for text alignment. This mismatch mirrors biological morphogenesis where developmental phases demand distinct genetic programs. Inspired by epigenetic regulation governing morphological specialization, we propose **(ANT)**, an **A**daptive **N**eural **T**emporal-Aware architecture. ANT orchestrates semantic granularity through: **(i) Semantic Temporally Adaptive (STA) Module:** Automatically partitions denoising into low-frequency structural planning and high-frequency refinement via spectral analysis. **(ii) Dynamic Classifier-Free Guidance scheduling (DCFG):** Adaptively adjusts conditional to unconditional ratio enhancing efficiency while maintaining fidelity. **(iii) Temporal-semantic reweighting:** Quantitatively aligns text influence with phase requirements. Extensive experiments show that ANT can be applied to various baselines, significantly improving model performance, and achieving state-of-the-art semantic alignment on StableMoFusion.

[46] PAID: Pairwise Angular-Invariant Decomposition for Continual Test-Time Adaptation

Kunyu Wang,Xueyang Fu,Yunfei Bao,Chengjie Ge,Chengzhi Cao,Wei Zhai,Zheng-Jun Zha

Main category: cs.CV

TL;DR: PAID方法通过保留预训练权重的成对角度结构，提出了一种简单有效的持续测试时间适应方法。

Details

Motivation: 现有方法忽视预训练权重中的领域不变先验信息，而PAID利用几何属性（如成对角度结构）来改进适应效果。 Method: PAID将权重分解为幅度和方向，引入可学习的正交矩阵以全局旋转方向，同时保留成对角度结构。 Result: 在四个CTTA基准测试中，PAID表现优于现有方法。 Conclusion: 保留成对角度结构是CTTA中简单而有效的原则。 Abstract: Continual Test-Time Adaptation (CTTA) aims to online adapt a pre-trained model to changing environments during inference. Most existing methods focus on exploiting target data, while overlooking another crucial source of information, the pre-trained weights, which encode underutilized domain-invariant priors. This paper takes the geometric attributes of pre-trained weights as a starting point, systematically analyzing three key components: magnitude, absolute angle, and pairwise angular structure. We find that the pairwise angular structure remains stable across diverse corrupted domains and encodes domain-invariant semantic information, suggesting it should be preserved during adaptation. Based on this insight, we propose PAID (Pairwise Angular-Invariant Decomposition), a prior-driven CTTA method that decomposes weight into magnitude and direction, and introduces a learnable orthogonal matrix via Householder reflections to globally rotate direction while preserving the pairwise angular structure. During adaptation, only the magnitudes and the orthogonal matrices are updated. PAID achieves consistent improvements over recent SOTA methods on four widely used CTTA benchmarks, demonstrating that preserving pairwise angular structure offers a simple yet effective principle for CTTA.

[47] ReSpace: Text-Driven 3D Scene Synthesis and Editing with Preference Alignment

Martin JJ. Bucher,Iro Armeni

Main category: cs.CV

TL;DR: ReSpace是一个基于自回归语言模型的生成框架，用于文本驱动的3D室内场景合成与编辑，通过结构化场景表示和双阶段训练方法提升语义和空间推理能力。

Details

Motivation: 现有方法在3D室内场景合成与编辑中存在语义简化、编辑受限或空间推理不足的问题，ReSpace旨在通过自然语言和结构化表示解决这些局限性。 Method: 采用紧凑的结构化场景表示和双阶段训练（监督微调与偏好对齐），利用自回归语言模型进行对象添加，零-shot LLM处理对象移除。 Result: 实验结果显示，ReSpace在对象添加任务上超越现有技术，同时在完整场景合成中保持竞争力。 Conclusion: ReSpace通过结合语言模型和结构化表示，为3D场景合成与编辑提供了更灵活、语义丰富的解决方案。 Abstract: Scene synthesis and editing has emerged as a promising direction in computer graphics. Current trained approaches for 3D indoor scenes either oversimplify object semantics through one-hot class encodings (e.g., 'chair' or 'table'), require masked diffusion for editing, ignore room boundaries, or rely on floor plan renderings that fail to capture complex layouts. In contrast, LLM-based methods enable richer semantics via natural language (e.g., 'modern studio with light wood furniture') but do not support editing, remain limited to rectangular layouts or rely on weak spatial reasoning from implicit world models. We introduce ReSpace, a generative framework for text-driven 3D indoor scene synthesis and editing using autoregressive language models. Our approach features a compact structured scene representation with explicit room boundaries that frames scene editing as a next-token prediction task. We leverage a dual-stage training approach combining supervised fine-tuning and preference alignment, enabling a specially trained language model for object addition that accounts for user instructions, spatial geometry, object semantics, and scene-level composition. For scene editing, we employ a zero-shot LLM to handle object removal and prompts for addition. We further introduce a novel voxelization-based evaluation that captures fine-grained geometry beyond 3D bounding boxes. Experimental results surpass state-of-the-art on object addition while maintaining competitive results on full scene synthesis.

[48] Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning

Kunyu Wang,Xueyang Fu,Xin Lu,Chengjie Ge,Chengzhi Cao,Wei Zhai,Zheng-Jun Zha

Main category: cs.CV

TL;DR: 提出了一种基于剪枝的高效持续测试时自适应目标检测方法，通过敏感度引导的通道剪枝策略和随机通道重新激活机制，在减少计算开销的同时提升性能。

Details

Motivation: 观察到并非所有源特征都有益，某些域敏感特征通道可能对目标域性能产生负面影响，因此提出剪枝策略。 Method: 采用敏感度引导的通道剪枝策略，量化通道对域差异的敏感度，并通过加权稀疏正则化选择性剪枝；引入随机通道重新激活机制。 Result: 在三个基准测试中表现优异，计算开销减少12%（FLOPs）。 Conclusion: 该方法在提升适应性能的同时显著降低了计算成本。 Abstract: Continual test-time adaptive object detection (CTTA-OD) aims to online adapt a source pre-trained detector to ever-changing environments during inference under continuous domain shifts. Most existing CTTA-OD methods prioritize effectiveness while overlooking computational efficiency, which is crucial for resource-constrained scenarios. In this paper, we propose an efficient CTTA-OD method via pruning. Our motivation stems from the observation that not all learned source features are beneficial; certain domain-sensitive feature channels can adversely affect target domain performance. Inspired by this, we introduce a sensitivity-guided channel pruning strategy that quantifies each channel based on its sensitivity to domain discrepancies at both image and instance levels. We apply weighted sparsity regularization to selectively suppress and prune these sensitive channels, focusing adaptation efforts on invariant ones. Additionally, we introduce a stochastic channel reactivation mechanism to restore pruned channels, enabling recovery of potentially useful features and mitigating the risks of early pruning. Extensive experiments on three benchmarks show that our method achieves superior adaptation performance while reducing computational overhead by 12% in FLOPs compared to the recent SOTA method.

[49] HRTR: A Single-stage Transformer for Fine-grained Sub-second Action Segmentation in Stroke Rehabilitation

Halil Ismail Helvaci,Justin Philip Huber,Jihye Bae,Sen-ching Samson Cheung

Main category: cs.CV

TL;DR: 论文提出了一种名为HRTR的单阶段Transformer模型，用于高分辨率、亚秒级动作的检测与分类，显著优于现有方法。

Details

Motivation: 中风康复需要精确跟踪患者动作，但现有方法难以实现高分辨率、亚秒级的动作检测。 Method: 提出HRTR模型，通过单阶段Transformer实现高分辨率、亚秒级动作的检测与分类，无需多阶段方法或后处理。 Result: HRTR在多个数据集上表现优异，Edit Score分别为70.1（StrokeRehab Video）、69.4（StrokeRehab IMU）和88.4（50Salads）。 Conclusion: HRTR在动作检测任务中表现出色，为中风康复提供了一种高效的解决方案。 Abstract: Stroke rehabilitation often demands precise tracking of patient movements to monitor progress, with complexities of rehabilitation exercises presenting two critical challenges: fine-grained and sub-second (under one-second) action detection. In this work, we propose the High Resolution Temporal Transformer (HRTR), to time-localize and classify high-resolution (fine-grained), sub-second actions in a single-stage transformer, eliminating the need for multi-stage methods and post-processing. Without any refinements, HRTR outperforms state-of-the-art systems on both stroke related and general datasets, achieving Edit Score (ES) of 70.1 on StrokeRehab Video, 69.4 on StrokeRehab IMU, and 88.4 on 50Salads.

[50] Generative Perception of Shape and Material from Differential Motion

Xinran Nicole Han,Ko Nishino,Todd Zickler

Main category: cs.CV

TL;DR: 论文提出了一种基于条件去噪扩散模型的方法，通过短时视频中的物体运动信息生成形状和材质的多模态预测，解决了单视角下的感知模糊问题。

Details

Motivation: 人类通过微小运动或旋转物体来消除形状和材质感知的模糊性，受此启发，研究旨在利用运动信息提升视觉推理能力。 Method: 采用参数高效的架构，直接在像素空间训练条件去噪扩散模型，生成物体的多属性解耦表示。 Result: 模型在静态观测下生成多样化的多模态预测，运动时快速收敛到更准确的解释，并在真实物体上表现良好。 Conclusion: 通过连续运动观测，该生成式感知方法为物理系统中的视觉推理提供了新思路。 Abstract: Perceiving the shape and material of an object from a single image is inherently ambiguous, especially when lighting is unknown and unconstrained. Despite this, humans can often disentangle shape and material, and when they are uncertain, they often move their head slightly or rotate the object to help resolve the ambiguities. Inspired by this behavior, we introduce a novel conditional denoising-diffusion model that generates samples of shape-and-material maps from a short video of an object undergoing differential motions. Our parameter-efficient architecture allows training directly in pixel-space, and it generates many disentangled attributes of an object simultaneously. Trained on a modest number of synthetic object-motion videos with supervision on shape and material, the model exhibits compelling emergent behavior: For static observations, it produces diverse, multimodal predictions of plausible shape-and-material maps that capture the inherent ambiguities; and when objects move, the distributions quickly converge to more accurate explanations. The model also produces high-quality shape-and-material estimates for less ambiguous, real-world objects. By moving beyond single-view to continuous motion observations, our work suggests a generative perception approach for improving visual reasoning in physically-embodied systems.

[51] Towards Better De-raining Generalization via Rainy Characteristics Memorization and Replay

Kunyu Wang,Xueyang Fu,Chengzhi Cao,Chengjie Ge,Wei Zhai,Zheng-Jun Zha

Main category: cs.CV

TL;DR: 提出了一种新框架，通过逐步扩展数据集提升图像去雨网络的适应性，结合GAN和知识蒸馏技术，模拟人脑学习机制，显著提升性能。

Details

Motivation: 现有方法因数据集有限，在多样化真实雨天场景中表现不足，需提升网络的适应性和泛化能力。 Method: 采用GAN捕获新数据特征，模拟海马体学习；结合知识蒸馏技术，模拟新皮质与海马体的协同作用，实现持续学习。 Result: 在三个基准网络上测试，框架能持续积累知识，并在新场景中超越现有方法。 Conclusion: 该框架有效提升去雨网络的泛化能力，为持续学习提供新思路。 Abstract: Current image de-raining methods primarily learn from a limited dataset, leading to inadequate performance in varied real-world rainy conditions. To tackle this, we introduce a new framework that enables networks to progressively expand their de-raining knowledge base by tapping into a growing pool of datasets, significantly boosting their adaptability. Drawing inspiration from the human brain's ability to continuously absorb and generalize from ongoing experiences, our approach borrow the mechanism of the complementary learning system. Specifically, we first deploy Generative Adversarial Networks (GANs) to capture and retain the unique features of new data, mirroring the hippocampus's role in learning and memory. Then, the de-raining network is trained with both existing and GAN-synthesized data, mimicking the process of hippocampal replay and interleaved learning. Furthermore, we employ knowledge distillation with the replayed data to replicate the synergy between the neocortex's activity patterns triggered by hippocampal replays and the pre-existing neocortical knowledge. This comprehensive framework empowers the de-raining network to amass knowledge from various datasets, continually enhancing its performance on previously unseen rainy scenes. Our testing on three benchmark de-raining networks confirms the framework's effectiveness. It not only facilitates continuous knowledge accumulation across six datasets but also surpasses state-of-the-art methods in generalizing to new real-world scenarios.

[52] Flexiffusion: Training-Free Segment-Wise Neural Architecture Search for Efficient Diffusion Models

Hongtao Huang,Xiaojun Chang,Lina Yao

Main category: cs.CV

TL;DR: Flexiffusion是一种无需训练的NAS框架，通过动态组合生成步骤类型和引入轻量级评估指标rFID，显著加速扩散模型推理，同时保持生成质量。

Details

Motivation: 扩散模型因多步推理导致计算成本高，现有NAS方法受限于重训练需求、搜索复杂性和缓慢评估。 Method: 将生成过程分解为灵活段，动态组合完整计算、缓存重用和跳过计算三种步骤类型，并引入rFID评估指标。 Result: 在多个模型和数据集上实现2倍以上加速，FID退化低于5%，Stable Diffusion上达到5.1倍加速。 Conclusion: Flexiffusion为高效搜索高速扩散模型提供了新范式，无需牺牲生成质量。 Abstract: Diffusion models (DMs) are powerful generative models capable of producing high-fidelity images but are constrained by high computational costs due to iterative multi-step inference. While Neural Architecture Search (NAS) can optimize DMs, existing methods are hindered by retraining requirements, exponential search complexity from step-wise optimization, and slow evaluation relying on massive image generation. To address these challenges, we propose Flexiffusion, a training-free NAS framework that jointly optimizes generation schedules and model architectures without modifying pre-trained parameters. Our key insight is to decompose the generation process into flexible segments of equal length, where each segment dynamically combines three step types: full (complete computation), partial (cache-reused computation), and null (skipped computation). This segment-wise search space reduces the candidate pool exponentially compared to step-wise NAS while preserving architectural diversity. Further, we introduce relative FID (rFID), a lightweight evaluation metric for NAS that measures divergence from a teacher model's outputs instead of ground truth, slashing evaluation time by over $90\%$. In practice, Flexiffusion achieves at least $2\times$ acceleration across LDMs, Stable Diffusion, and DDPMs on ImageNet and MS-COCO, with FID degradation under $5\%$, outperforming prior NAS and caching methods. Notably, it attains $5.1\times$ speedup on Stable Diffusion with near-identical CLIP scores. Our work pioneers a resource-efficient paradigm for searching high-speed DMs without sacrificing quality.

[53] Co-Evidential Fusion with Information Volume for Medical Image Segmentation

Yuanpeng He,Lijian Li,Tianxiang Zhan,Chi-Man Pun,Wenpin Jiao,Zhi Jin

Main category: cs.CV

TL;DR: 本文提出了一种改进的半监督医学图像分割方法，通过引入新的不确定性度量策略和信息量评估，提升了模型性能。

Details

Motivation: 现有半监督图像分割方法未能充分利用多源体素级不确定性进行针对性学习。 Method: 1. 提出基于广义证据深度学习的pignistic共证据融合策略；2. 引入信息量评估（IVUM）设计两种证据学习方案。 Result: 在四个数据集上验证了方法的竞争力。 Conclusion: 新方法通过更精确的不确定性度量和优化目标，显著提升了半监督分割效果。 Abstract: Although existing semi-supervised image segmentation methods have achieved good performance, they cannot effectively utilize multiple sources of voxel-level uncertainty for targeted learning. Therefore, we propose two main improvements. First, we introduce a novel pignistic co-evidential fusion strategy using generalized evidential deep learning, extended by traditional D-S evidence theory, to obtain a more precise uncertainty measure for each voxel in medical samples. This assists the model in learning mixed labeled information and establishing semantic associations between labeled and unlabeled data. Second, we introduce the concept of information volume of mass function (IVUM) to evaluate the constructed evidence, implementing two evidential learning schemes. One optimizes evidential deep learning by combining the information volume of the mass function with original uncertainty measures. The other integrates the learning pattern based on the co-evidential fusion strategy, using IVUM to design a new optimization objective. Experiments on four datasets demonstrate the competitive performance of our method.

[54] Towards In-the-wild 3D Plane Reconstruction from a Single Image

Jiachen Liu,Rui Yu,Sili Chen,Sharon X. Huang,Hengkai Guo

Main category: cs.CV

TL;DR: ZeroPlane是一种基于Transformer的模型，用于从单张图像中实现零样本3D平面检测和重建，适用于多样化的室内外场景。

Details

Motivation: 现有方法通常在单一数据集上训练，限制了其泛化能力。ZeroPlane旨在解决这一问题，提升跨域性能。 Method: 提出解耦平面法线和偏移表示，采用分类-回归范式学习平面参数，并结合高级骨干网络和像素几何增强模块。 Result: 在多个零样本评估数据集上，ZeroPlane显著优于现有方法，尤其在野外数据上表现突出。 Conclusion: ZeroPlane通过多域训练和几何增强，实现了高精度和强泛化能力的3D平面重建。 Abstract: 3D plane reconstruction from a single image is a crucial yet challenging topic in 3D computer vision. Previous state-of-the-art (SOTA) methods have focused on training their system on a single dataset from either indoor or outdoor domain, limiting their generalizability across diverse testing data. In this work, we introduce a novel framework dubbed ZeroPlane, a Transformer-based model targeting zero-shot 3D plane detection and reconstruction from a single image, over diverse domains and environments. To enable data-driven models across multiple domains, we have curated a large-scale planar benchmark, comprising over 14 datasets and 560,000 high-resolution, dense planar annotations for diverse indoor and outdoor scenes. To address the challenge of achieving desirable planar geometry on multi-dataset training, we propose to disentangle the representation of plane normal and offset, and employ an exemplar-guided, classification-then-regression paradigm to learn plane and offset respectively. Additionally, we employ advanced backbones as image encoder, and present an effective pixel-geometry-enhanced plane embedding module to further facilitate planar reconstruction. Extensive experiments across multiple zero-shot evaluation datasets have demonstrated that our approach significantly outperforms previous methods on both reconstruction accuracy and generalizability, especially over in-the-wild data. Our code and data are available at: https://github.com/jcliu0428/ZeroPlane.

[55] LumosFlow: Motion-Guided Long Video Generation

Jiahao Chen,Hangjie Yuan,Yichen Qian,Jingyun Liang,Jiazheng Xing,Pengwei Liu,Weihua Chen,Fan Wang,Bing Su

Main category: cs.CV

TL;DR: 论文提出LumosFlow框架，通过显式运动指导改进长视频生成，解决了传统方法中的时间重复和不自然过渡问题。

Details

Motivation: 长视频生成在娱乐和模拟领域应用广泛，但现有方法在时间一致性和视觉吸引力方面仍面临挑战。 Method: 采用LMTV-DM生成大运动间隔的关键帧，并通过LOF-DM和MotionControlNet分解中间帧插值为运动生成和后处理细化。 Result: 实验表明，该方法能生成具有一致运动和外观的长视频，实现了15倍的插值效果。 Conclusion: LumosFlow框架有效提升了长视频生成的质量和连续性，代码和模型将公开。 Abstract: Long video generation has gained increasing attention due to its widespread applications in fields such as entertainment and simulation. Despite advances, synthesizing temporally coherent and visually compelling long sequences remains a formidable challenge. Conventional approaches often synthesize long videos by sequentially generating and concatenating short clips, or generating key frames and then interpolate the intermediate frames in a hierarchical manner. However, both of them still remain significant challenges, leading to issues such as temporal repetition or unnatural transitions. In this paper, we revisit the hierarchical long video generation pipeline and introduce LumosFlow, a framework introduce motion guidance explicitly. Specifically, we first employ the Large Motion Text-to-Video Diffusion Model (LMTV-DM) to generate key frames with larger motion intervals, thereby ensuring content diversity in the generated long videos. Given the complexity of interpolating contextual transitions between key frames, we further decompose the intermediate frame interpolation into motion generation and post-hoc refinement. For each pair of key frames, the Latent Optical Flow Diffusion Model (LOF-DM) synthesizes complex and large-motion optical flows, while MotionControlNet subsequently refines the warped results to enhance quality and guide intermediate frame generation. Compared with traditional video frame interpolation, we achieve 15x interpolation, ensuring reasonable and continuous motion between adjacent frames. Experiments show that our method can generate long videos with consistent motion and appearance. Code and models will be made publicly available upon acceptance. Our project page: https://jiahaochen1.github.io/LumosFlow/

[56] RelationAdapter: Learning and Transferring Visual Relation with Diffusion Transformers

Yan Gong,Yiren Song,Yicheng Li,Chenglin Li,Yin Zhang

Main category: cs.CV

TL;DR: 论文提出了一种基于视觉提示的图像编辑新方法RelationAdapter，利用源-目标图像对提取编辑意图，并引入Relation252K数据集验证其效果。

Details

Motivation: 现有单参考方法在非刚性变换上表现不佳，需改进以支持多样化编辑任务。 Method: 提出RelationAdapter模块，结合DiT模型，通过少量示例捕捉和应用视觉变换。 Result: RelationAdapter显著提升编辑意图的理解和转移能力，生成质量和编辑性能均有明显提升。 Conclusion: RelationAdapter为视觉提示驱动的图像编辑提供了高效且通用的解决方案。 Abstract: Inspired by the in-context learning mechanism of large language models (LLMs), a new paradigm of generalizable visual prompt-based image editing is emerging. Existing single-reference methods typically focus on style or appearance adjustments and struggle with non-rigid transformations. To address these limitations, we propose leveraging source-target image pairs to extract and transfer content-aware editing intent to novel query images. To this end, we introduce RelationAdapter, a lightweight module that enables Diffusion Transformer (DiT) based models to effectively capture and apply visual transformations from minimal examples. We also introduce Relation252K, a comprehensive dataset comprising 218 diverse editing tasks, to evaluate model generalization and adaptability in visual prompt-driven scenarios. Experiments on Relation252K show that RelationAdapter significantly improves the model's ability to understand and transfer editing intent, leading to notable gains in generation quality and overall editing performance.

[57] Enhancing Monocular Height Estimation via Weak Supervision from Imperfect Labels

Sining Chen,Yilei Shi,Xiao Xiang Zhu

Main category: cs.CV

TL;DR: 论文提出了一种利用不完美标签数据训练单目高度估计网络的方法，通过弱监督和精心设计的损失函数，显著提升了模型在跨域任务中的性能。

Details

Motivation: 现有单目高度估计方法依赖高质量标签数据，但这类数据稀缺且集中在发达地区，限制了模型的泛化能力。论文首次尝试利用不完美标签数据解决这一问题。 Method: 提出了一种基于集成学习的框架，兼容任何单目高度估计网络，通过平衡软损失和序数约束，处理噪声标签、域偏移和高度值长尾分布问题。 Result: 在DFC23和GBH数据集上的实验表明，该方法平均均方根误差分别降低了22.94%和18.62%，显著优于基线方法。 Conclusion: 通过利用不完美标签数据，论文方法显著提升了单目高度估计的泛化性能，为大规模应用提供了可能。 Abstract: Monocular height estimation is considered the most efficient and cost-effective means of 3D perception in remote sensing, and it has attracted much attention since the emergence of deep learning. While training neural networks requires a large amount of data, data with perfect labels are scarce and only available within developed regions. The trained models therefore lack generalizability, which limits the potential for large-scale application of existing methods. We tackle this problem for the first time, by introducing data with imperfect labels into training pixel-wise height estimation networks, including labels that are incomplete, inexact, and inaccurate compared to high-quality labels. We propose an ensemble-based pipeline compatible with any monocular height estimation network. Taking the challenges of noisy labels, domain shift, and long-tailed distribution of height values into consideration, we carefully design the architecture and loss functions to leverage the information concealed in imperfect labels using weak supervision through balanced soft losses and ordinal constraints. We conduct extensive experiments on two datasets with different resolutions, DFC23 (0.5 to 1 m) and GBH (3 m). The results indicate that the proposed pipeline outperforms baselines by achieving more balanced performance across various domains, leading to improvements of average root mean square errors up to 22.94 %, and 18.62 % on DFC23 and GBH, respectively. The efficacy of each design component is validated through ablation studies. Code is available at https://github.com/zhu-xlab/weakim2h.

[58] MemoryOut: Learning Principal Features via Multimodal Sparse Filtering Network for Semi-supervised Video Anomaly Detection

Juntong Li,Lingwei Dang,Yukun Su,Yun Hao,Qingxin Xiao,Yongwei Nie,Qingyao Wu

Main category: cs.CV

TL;DR: 论文提出了一种新的视频异常检测框架，通过稀疏特征过滤模块（SFFM）和混合专家（MoE）架构抑制过度泛化，并结合视觉语言模型（VLM）提升语义理解能力。

Details

Motivation: 现有基于重建或预测的视频异常检测方法存在过度泛化和忽视高层语义的问题。 Method: 提出SFFM动态过滤异常信息，MoE架构提取多样化特征；结合VLM生成文本描述，实现多模态联合建模。 Result: 在多个公开数据集上验证了框架的有效性。 Conclusion: 该框架通过稀疏特征过滤和多模态建模，显著提升了视频异常检测的性能。 Abstract: Video Anomaly Detection (VAD) methods based on reconstruction or prediction face two critical challenges: (1) strong generalization capability often results in accurate reconstruction or prediction of abnormal events, making it difficult to distinguish normal from abnormal patterns; (2) reliance only on low-level appearance and motion cues limits their ability to identify high-level semantic in abnormal events from complex scenes. To address these limitations, we propose a novel VAD framework with two key innovations. First, to suppress excessive generalization, we introduce the Sparse Feature Filtering Module (SFFM) that employs bottleneck filters to dynamically and adaptively remove abnormal information from features. Unlike traditional memory modules, it does not need to memorize the normal prototypes across the training dataset. Further, we design the Mixture of Experts (MoE) architecture for SFFM. Each expert is responsible for extracting specialized principal features during running time, and different experts are selectively activated to ensure the diversity of the learned principal features. Second, to overcome the neglect of semantics in existing methods, we integrate a Vision-Language Model (VLM) to generate textual descriptions for video clips, enabling comprehensive joint modeling of semantic, appearance, and motion cues. Additionally, we enforce modality consistency through semantic similarity constraints and motion frame-difference contrastive loss. Extensive experiments on multiple public datasets validate the effectiveness of our multimodal joint modeling framework and sparse feature filtering paradigm. Project page at https://qzfm.github.io/sfn_vad_project_page/.

[59] VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning

Hao Yan,Handong Zheng,Hao Wang,Liang Yin,Xingchen Liu,Zhenbiao Cao,Xinxing Su,Zihao Chen,Jihao Wu,Minghui Liao,Chao Weng,Wei Chen,Yuliang Liu,Xiang Bai

Main category: cs.CV

TL;DR: 论文提出了VisuRiddles基准和PRS框架，以解决多模态大语言模型在抽象视觉推理中的瓶颈问题，并通过实验验证了其有效性。

Details

Motivation: 当前多模态大语言模型在抽象视觉推理（AVR）中存在感知抽象图形的局限性，亟需解决。 Method: 1. 提出VisuRiddles基准，评估模型的五个核心维度和两类高级推理能力；2. 开发PRS框架，自动生成带有细粒度感知描述的谜题，用于训练和监督中间推理阶段。 Result: 实验证明细粒度视觉感知是主要瓶颈，PRS框架显著提升了模型性能。 Conclusion: PRS框架有效解决了抽象视觉推理的瓶颈问题，并提升了模型的可解释性和训练效率。 Abstract: Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks. However, Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics. To tackle this issue, we investigate the bottlenecks in current MLLMs and synthesize training data to improve their abstract visual perception. First, we propose VisuRiddles, a benchmark for AVR, featuring tasks meticulously constructed to assess models' reasoning capacities across five core dimensions and two high-level reasoning categories. Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions. PRS not only generates valuable training data for abstract graphics but also provides fine-grained perceptual description, crucially allowing for supervision over intermediate reasoning stages and thereby improving both training efficacy and model interpretability. Our extensive experimental results on VisuRiddles empirically validate that fine-grained visual perception is the principal bottleneck and our synthesis framework markedly enhances the performance of contemporary MLLMs on these challenging tasks. Our code and dataset will be released at https://github.com/yh-hust/VisuRiddles

[60] Probabilistic Online Event Downsampling

Andreu Girbau-Xalabarder,Jun Nagata,Shinichi Sumiyoshi

Main category: cs.CV

TL;DR: 论文提出了一种名为POLED的概率框架，通过事件重要性概率密度函数（ePDF）动态估计事件重要性，实现场景自适应的在线事件降采样，并在多个任务中验证了其有效性。

Details

Motivation: 事件相机的高时间分辨率带来了高带宽和计算需求，现有降采样方法依赖固定启发式策略，适应性不足。 Method: 提出POLED框架，利用可自定义的ePDF在线估计事件重要性，并设计了一种保留轮廓的ePDF。 Result: 在四个数据集和任务（分类、图像插值、表面法线估计、目标检测）中验证了智能采样对性能的重要性。 Conclusion: POLED框架通过动态事件重要性估计，显著提升了事件降采样的适应性和性能。 Abstract: Event cameras capture scene changes asynchronously on a per-pixel basis, enabling extremely high temporal resolution. However, this advantage comes at the cost of high bandwidth, memory, and computational demands. To address this, prior work has explored event downsampling, but most approaches rely on fixed heuristics or threshold-based strategies, limiting their adaptability. Instead, we propose a probabilistic framework, POLED, that models event importance through an event-importance probability density function (ePDF), which can be arbitrarily defined and adapted to different applications. Our approach operates in a purely online setting, estimating event importance on-the-fly from raw event streams, enabling scene-specific adaptation. Additionally, we introduce zero-shot event downsampling, where downsampled events must remain usable for models trained on the original event stream, without task-specific adaptation. We design a contour-preserving ePDF that prioritizes structurally important events and evaluate our method across four datasets and tasks--object classification, image interpolation, surface normal estimation, and object detection--demonstrating that intelligent sampling is crucial for maintaining performance under event-budget constraints.

[61] Technical Report for Ego4D Long-Term Action Anticipation Challenge 2025

Qiaohui Chu,Haoyu Zhang,Yisen Feng,Meng Liu,Weili Guan,Yaowei Wang,Liqiang Nie

Main category: cs.CV

TL;DR: 提出了一种用于Ego4D长期动作预测任务的三阶段框架，结合视觉特征提取、动作识别和长期动作预测，取得了CVPR 2025挑战赛的第一名。

Details

Motivation: 受基础模型最新进展的启发，旨在提升长期动作预测的准确性。 Method: 三阶段框架：1) 视觉特征提取；2) 使用Transformer预测动词和名词；3) 通过微调的大型语言模型生成未来动作序列。 Result: 在CVPR 2025挑战赛中取得第一名，建立了长期动作预测的新标杆。 Conclusion: 该框架通过结合视觉和语言模型，显著提升了长期动作预测的性能。 Abstract: In this report, we present a novel three-stage framework developed for the Ego4D Long-Term Action Anticipation (LTA) task. Inspired by recent advances in foundation models, our method consists of three stages: feature extraction, action recognition, and long-term action anticipation. First, visual features are extracted using a high-performance visual encoder. The features are then fed into a Transformer to predict verbs and nouns, with a verb-noun co-occurrence matrix incorporated to enhance recognition accuracy. Finally, the predicted verb-noun pairs are formatted as textual prompts and input into a fine-tuned large language model (LLM) to anticipate future action sequences. Our framework achieves first place in this challenge at CVPR 2025, establishing a new state-of-the-art in long-term action prediction. Our code will be released at https://github.com/CorrineQiu/Ego4D-LTA-Challenge-2025.

[62] SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence

Zhitao Zeng,Zhu Zhuo,Xiaojun Jia,Erli Zhang,Junde Wu,Jiaan Zhang,Yuxuan Wang,Chang Han Low,Jian Jiang,Zilong Zheng,Xiaochun Cao,Yutong Ban,Qi Dou,Yang Liu,Yueming Jin

Main category: cs.CV

TL;DR: SurgVLM是一种针对手术智能的大型视觉语言基础模型，通过构建大规模多模态手术数据库SurgVLM-DB，解决了现有通用视觉语言模型在手术领域中的不足。

Details

Motivation: 手术智能面临独特挑战，如手术视觉感知、时间分析和推理，现有通用视觉语言模型因缺乏领域特定监督和大规模高质量数据库而无法满足需求。 Method: 构建SurgVLM-DB数据库（180万帧图像和779万对话），统一23个公共数据集，并通过指令调优开发SurgVLM模型（基于Qwen2.5-VL）。 Result: SurgVLM在SurgVLM-Bench上评估，并与14种主流商业VLM（如GPT-4o、Gemini 2.0 Flash）进行比较，表现出色。 Conclusion: SurgVLM填补了手术智能领域的空白，为多任务手术智能提供了通用解决方案。 Abstract: Foundation models have achieved transformative success across biomedical domains by enabling holistic understanding of multimodal data. However, their application in surgery remains underexplored. Surgical intelligence presents unique challenges - requiring surgical visual perception, temporal analysis, and reasoning. Existing general-purpose vision-language models fail to address these needs due to insufficient domain-specific supervision and the lack of a large-scale high-quality surgical database. To bridge this gap, we propose SurgVLM, one of the first large vision-language foundation models for surgical intelligence, where this single universal model can tackle versatile surgical tasks. To enable this, we construct a large-scale multimodal surgical database, SurgVLM-DB, comprising over 1.81 million frames with 7.79 million conversations, spanning more than 16 surgical types and 18 anatomical structures. We unify and reorganize 23 public datasets across 10 surgical tasks, followed by standardizing labels and doing hierarchical vision-language alignment to facilitate comprehensive coverage of gradually finer-grained surgical tasks, from visual perception, temporal analysis, to high-level reasoning. Building upon this comprehensive dataset, we propose SurgVLM, which is built upon Qwen2.5-VL, and undergoes instruction tuning to 10+ surgical tasks. We further construct a surgical multimodal benchmark, SurgVLM-Bench, for method evaluation. SurgVLM-Bench consists of 6 popular and widely-used datasets in surgical domain, covering several crucial downstream tasks. Based on SurgVLM-Bench, we evaluate the performance of our SurgVLM (3 SurgVLM variants: SurgVLM-7B, SurgVLM-32B, and SurgVLM-72B), and conduct comprehensive comparisons with 14 mainstream commercial VLMs (e.g., GPT-4o, Gemini 2.0 Flash, Qwen2.5-Max).

[63] Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models

Shizhan Gong,Yankai Jiang,Qi Dou,Farzan Farnia

Main category: cs.CV

TL;DR: 论文提出了一种基于核的方法，将CLIP的视觉表示与DINOv2对齐，以增强感知能力并保持与文本嵌入的兼容性。

Details

Motivation: CLIP在细粒度感知方面存在局限性，影响下游多模态大语言模型（MLLMs）的性能，而DINOv2在捕捉图像细节方面表现优异。 Method: 采用核方法对齐CLIP和DINOv2的视觉表示，设计高效的随机优化目标进行微调。 Result: 对齐后的视觉编码器在零样本物体识别、细粒度空间推理和定位任务中表现显著提升，下游MLLMs性能也得到增强。 Conclusion: 该方法有效结合了CLIP和DINOv2的优势，提升了多模态模型的感知能力。 Abstract: Vision-language models, such as CLIP, have achieved significant success in aligning visual and textual representations, becoming essential components of many multi-modal large language models (MLLMs) like LLaVA and OpenFlamingo. However, numerous studies have identified CLIP's limited fine-grained perception as a critical drawback, leading to substantial failures in downstream MLLMs. In contrast, vision-centric foundation models like DINOv2 demonstrate remarkable capabilities in capturing fine details from images. In this work, we propose a novel kernel-based method to align CLIP's visual representation with that of DINOv2, ensuring that the resulting embeddings maintain compatibility with text embeddings while enhancing perceptual capabilities. Our alignment objective is designed for efficient stochastic optimization. Following this image-only alignment fine-tuning, the visual encoder retains compatibility with the frozen text encoder and exhibits significant improvements in zero-shot object recognition, fine-grained spatial reasoning, and localization. By integrating the aligned visual encoder, downstream MLLMs also demonstrate enhanced performance.

[64] DCI: Dual-Conditional Inversion for Boosting Diffusion-Based Image Editing

Zixiang Li,Haoyu Wang,Wei Wang,Chuangchuang Tan,Yunchao Wei,Yao Zhao

Main category: cs.CV

TL;DR: 论文提出了一种名为Dual-Conditional Inversion（DCI）的新框架，通过联合条件优化解决了扩散模型中重建精度与编辑灵活性之间的权衡问题。

Details

Motivation: 现有扩散模型的反演方法在重建精度和编辑灵活性之间存在固有矛盾，难以同时保持语义对齐和结构一致性。 Method: DCI通过双条件固定点优化问题，联合源提示和参考图像指导反演过程，最小化潜在噪声差距和重建误差。 Result: 实验表明，DCI在多种编辑任务中达到最先进性能，显著提升了重建质量和编辑精度。 Conclusion: DCI为反演过程提供了新理解，并在重建任务中表现出鲁棒性和泛化能力。 Abstract: Diffusion models have achieved remarkable success in image generation and editing tasks. Inversion within these models aims to recover the latent noise representation for a real or generated image, enabling reconstruction, editing, and other downstream tasks. However, to date, most inversion approaches suffer from an intrinsic trade-off between reconstruction accuracy and editing flexibility. This limitation arises from the difficulty of maintaining both semantic alignment and structural consistency during the inversion process. In this work, we introduce Dual-Conditional Inversion (DCI), a novel framework that jointly conditions on the source prompt and reference image to guide the inversion process. Specifically, DCI formulates the inversion process as a dual-condition fixed-point optimization problem, minimizing both the latent noise gap and the reconstruction error under the joint guidance. This design anchors the inversion trajectory in both semantic and visual space, leading to more accurate and editable latent representations. Our novel setup brings new understanding to the inversion process. Extensive experiments demonstrate that DCI achieves state-of-the-art performance across multiple editing tasks, significantly improving both reconstruction quality and editing precision. Furthermore, we also demonstrate that our method achieves strong results in reconstruction tasks, implying a degree of robustness and generalizability approaching the ultimate goal of the inversion process.

[65] Contrast & Compress: Learning Lightweight Embeddings for Short Trajectories

Abhishek Vivekanandan,Christian Hubschneider,J. Marius Zöllner

Main category: cs.CV

TL;DR: 论文提出了一种基于Transformer编码器和对比三元组损失的框架，用于学习短轨迹的固定维度嵌入，优于传统方法。

Details

Motivation: 现有方法依赖计算密集型启发式或缺乏可解释性的潜在锚表示，无法满足下游应用的需求。 Method: 使用Transformer编码器和对比三元组损失，结合Cosine和FFT相似性度量，学习短轨迹的嵌入表示。 Result: 在Argoverse 2数据集上，Cosine相似性目标在轨迹聚类和检索任务中表现优于FFT基线，且低维嵌入（如16维）在性能和计算开销间取得平衡。 Conclusion: 该方法提供了紧凑、语义明确且高效的轨迹表示，为实时系统中的运动预测提供了透明且可控的解决方案。 Abstract: The ability to retrieve semantically and directionally similar short-range trajectories with both accuracy and efficiency is foundational for downstream applications such as motion forecasting and autonomous navigation. However, prevailing approaches often depend on computationally intensive heuristics or latent anchor representations that lack interpretability and controllability. In this work, we propose a novel framework for learning fixed-dimensional embeddings for short trajectories by leveraging a Transformer encoder trained with a contrastive triplet loss that emphasize the importance of discriminative feature spaces for trajectory data. We analyze the influence of Cosine and FFT-based similarity metrics within the contrastive learning paradigm, with a focus on capturing the nuanced directional intent that characterizes short-term maneuvers. Our empirical evaluation on the Argoverse 2 dataset demonstrates that embeddings shaped by Cosine similarity objectives yield superior clustering of trajectories by both semantic and directional attributes, outperforming FFT-based baselines in retrieval tasks. Notably, we show that compact Transformer architectures, even with low-dimensional embeddings (e.g., 16 dimensions, but qualitatively down to 4), achieve a compelling balance between retrieval performance (minADE, minFDE) and computational overhead, aligning with the growing demand for scalable and interpretable motion priors in real-time systems. The resulting embeddings provide a compact, semantically meaningful, and efficient representation of trajectory data, offering a robust alternative to heuristic similarity measures and paving the way for more transparent and controllable motion forecasting pipelines.

[66] BEVCALIB: LiDAR-Camera Calibration via Geometry-Guided Bird's-Eye View Representations

Weiduo Yuan,Jerry Li,Justin Yue,Divyank Shah,Konstantinos Karydis,Hang Qiu

Main category: cs.CV

TL;DR: BEVCALIB是一种利用鸟瞰图（BEV）特征从原始数据中进行LiDAR-相机校准的新模型，显著优于传统方法。

Details

Motivation: 传统LiDAR-相机校准方法需要大量受控环境数据且无法补偿运动中的变换变化，限制了实际应用。 Method: 通过提取和融合相机与LiDAR的BEV特征到共享空间，并引入特征选择器优化训练效率。 Result: 在KITTI和NuScenes数据集上，BEVCALIB在平移和旋转精度上分别平均提升47.08%-82.32%和78.17%-68.29%。 Conclusion: BEVCALIB在LiDAR-相机校准领域实现了新的最优性能，代码已开源。 Abstract: Accurate LiDAR-camera calibration is fundamental to fusing multi-modal perception in autonomous driving and robotic systems. Traditional calibration methods require extensive data collection in controlled environments and cannot compensate for the transformation changes during the vehicle/robot movement. In this paper, we propose the first model that uses bird's-eye view (BEV) features to perform LiDAR camera calibration from raw data, termed BEVCALIB. To achieve this, we extract camera BEV features and LiDAR BEV features separately and fuse them into a shared BEV feature space. To fully utilize the geometric information from the BEV feature, we introduce a novel feature selector to filter the most important features in the transformation decoder, which reduces memory consumption and enables efficient training. Extensive evaluations on KITTI, NuScenes, and our own dataset demonstrate that BEVCALIB establishes a new state of the art. Under various noise conditions, BEVCALIB outperforms the best baseline in the literature by an average of (47.08%, 82.32%) on KITTI dataset, and (78.17%, 68.29%) on NuScenes dataset, in terms of (translation, rotation), respectively. In the open-source domain, it improves the best reproducible baseline by one order of magnitude. Our code and demo results are available at https://cisl.ucr.edu/BEVCalib.

[67] Hyperspectral Image Generation with Unmixing Guided Diffusion Model

Shiyu Shen,Bin Pan,Ziye Zhang,Zhenwei Shi

Main category: cs.CV

TL;DR: 提出了一种基于高光谱解混的扩散模型，通过降低计算复杂度并保持高保真度，生成高质量且多样化的高光谱图像。

Details

Motivation: 现有生成模型依赖条件生成方案，限制了生成图像的多样性，且高光谱数据的高维度和物理约束对扩散模型提出了挑战。 Method: 模型包含解混自编码器模块和丰度扩散模块，前者将生成任务转移到低维丰度空间，后者确保生成样本满足非负性和一致性约束。 Result: 实验结果表明，模型能够生成高质量且多样化的高光谱图像，并通过新提出的评估指标验证了其有效性。 Conclusion: 该模型在高光谱数据生成方面取得了进展，解决了高维度和物理约束的挑战。 Abstract: Recently, hyperspectral image generation has received increasing attention, but existing generative models rely on conditional generation schemes, which limits the diversity of generated images. Diffusion models are popular for their ability to generate high-quality samples, but adapting these models from RGB to hyperspectral data presents the challenge of high dimensionality and physical constraints. To address these challenges, we propose a novel diffusion model guided by hyperspectral unmixing. Our model comprises two key modules: an unmixing autoencoder module and an abundance diffusion module. The unmixing autoencoder module leverages unmixing guidance to shift the generative task from the image space to the low-dimensional abundance space, significantly reducing computational complexity while preserving high fidelity. The abundance diffusion module generates samples that satisfy the constraints of non-negativity and unity, ensuring the physical consistency of the reconstructed HSIs. Additionally, we introduce two evaluation metrics tailored to hyperspectral data. Empirical results, evaluated using both traditional metrics and our proposed metrics, indicate that our model is capable of generating high-quality and diverse hyperspectral images, offering an advancement in hyperspectral data generation.

[68] Application of convolutional neural networks in image super-resolution

Tian Chunwei,Song Mingjian,Zuo Wangmeng,Du Bo,Zhang Yanning,Zhang Shichao

Main category: cs.CV

TL;DR: 本文总结了卷积神经网络（CNN）在图像超分辨率中的不同方法和模块，分析了它们的差异与关系，并通过实验比较了性能，最后指出了潜在的研究方向和不足。

Details

Motivation: 由于CNN在图像超分辨率中的强大学习能力，但不同方法差异较大且缺乏总结，因此有必要对这些方法进行系统梳理。 Method: 介绍了CNN在图像超分辨率中的原理，并分析了基于双三次插值、最近邻插值、双线性插值、转置卷积、子像素层和元上采样等方法。 Result: 通过实验比较了不同CNN插值和模块的性能。 Conclusion: 总结了现有方法的优缺点，并指出了潜在研究方向，以促进CNN在图像超分辨率中的发展。 Abstract: Due to strong learning abilities of convolutional neural networks (CNNs), they have become mainstream methods for image super-resolution. However, there are big differences of different deep learning methods with different types. There is little literature to summarize relations and differences of different methods in image super-resolution. Thus, summarizing these literatures are important, according to loading capacity and execution speed of devices. This paper first introduces principles of CNNs in image super-resolution, then introduces CNNs based bicubic interpolation, nearest neighbor interpolation, bilinear interpolation, transposed convolution, sub-pixel layer, meta up-sampling for image super-resolution to analyze differences and relations of different CNNs based interpolations and modules, and compare performance of these methods by experiments. Finally, this paper gives potential research points and drawbacks and summarizes the whole paper, which can facilitate developments of CNNs in image super-resolution.

[69] One-Step Diffusion-based Real-World Image Super-Resolution with Visual Perception Distillation

Xue Wu,Jingwei Xin,Zhijun Tu,Jie Hu,Jie Li,Nannan Wang,Xinbo Gao

Main category: cs.CV

TL;DR: VPD-SR是一种新颖的视觉感知扩散蒸馏框架，旨在通过显式语义监督和高频感知损失，实现高效的一步超分辨率重建。

Details

Motivation: 现有扩散模型在超分辨率任务中因多步采样限制速度，且生成图像语义对齐和感知质量不足。 Method: 提出VPD-SR框架，包含显式语义监督（ESS）和高频感知损失（HFP），并结合对抗训练提升生成真实性。 Result: 实验表明，VPD-SR在一步采样下优于现有方法和教师模型。 Conclusion: VPD-SR通过语义和高频优化，显著提升超分辨率任务的效率和感知质量。 Abstract: Diffusion-based models have been widely used in various visual generation tasks, showing promising results in image super-resolution (SR), while typically being limited by dozens or even hundreds of sampling steps. Although existing methods aim to accelerate the inference speed of multi-step diffusion-based SR methods through knowledge distillation, their generated images exhibit insufficient semantic alignment with real images, resulting in suboptimal perceptual quality reconstruction, specifically reflected in the CLIPIQA score. These methods still have many challenges in perceptual quality and semantic fidelity. Based on the challenges, we propose VPD-SR, a novel visual perception diffusion distillation framework specifically designed for SR, aiming to construct an effective and efficient one-step SR model. Specifically, VPD-SR consists of two components: Explicit Semantic-aware Supervision (ESS) and High-Frequency Perception (HFP) loss. Firstly, the ESS leverages the powerful visual perceptual understanding capabilities of the CLIP model to extract explicit semantic supervision, thereby enhancing semantic consistency. Then, Considering that high-frequency information contributes to the visual perception quality of images, in addition to the vanilla distillation loss, the HFP loss guides the student model to restore the missing high-frequency details in degraded images that are critical for enhancing perceptual quality. Lastly, we expand VPD-SR in adversarial training manner to further enhance the authenticity of the generated content. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed VPD-SR achieves superior performance compared to both previous state-of-the-art methods and the teacher model with just one-step sampling.

[70] High Performance Space Debris Tracking in Complex Skylight Backgrounds with a Large-Scale Dataset

Guohang Zhuang,Weixi Song,Jinyang Huang,Chenwei Yang,Yan Lu

Main category: cs.CV

TL;DR: 论文提出了一种基于深度学习的空间碎片跟踪网络（SDT-Net），通过高效的特征表示提升跟踪精度，并构建了一个大规模数据集SDTD验证模型效果。

Details

Motivation: 空间碎片的快速增加对实时准确跟踪提出了需求，传统信号处理方法难以应对复杂背景和高密度碎片。 Method: 提出SDT-Net，利用深度学习实现端到端学习，并通过新型观测模拟方案构建SDTD数据集。 Result: 实验验证了模型的高效性，并在南极站真实数据上取得70.6%的MOTA分数。 Conclusion: SDT-Net在真实场景中表现出强迁移能力，数据集和代码将公开。 Abstract: With the rapid development of space exploration, space debris has attracted more attention due to its potential extreme threat, leading to the need for real-time and accurate debris tracking. However, existing methods are mainly based on traditional signal processing, which cannot effectively process the complex background and dense space debris. In this paper, we propose a deep learning-based Space Debris Tracking Network~(SDT-Net) to achieve highly accurate debris tracking. SDT-Net effectively represents the feature of debris, enhancing the efficiency and stability of end-to-end model learning. To train and evaluate this model effectively, we also produce a large-scale dataset Space Debris Tracking Dataset (SDTD) by a novel observation-based data simulation scheme. SDTD contains 18,040 video sequences with a total of 62,562 frames and covers 250,000 synthetic space debris. Extensive experiments validate the effectiveness of our model and the challenging of our dataset. Furthermore, we test our model on real data from the Antarctic Station, achieving a MOTA score of 70.6%, which demonstrates its strong transferability to real-world scenarios. Our dataset and code will be released soon.

[71] Hierarchical Question-Answering for Driving Scene Understanding Using Vision-Language Models

Safaa Abdullahi Moallim Mohamud,Minjin Baek,Dong Seog Han

Main category: cs.CV

TL;DR: 提出了一种分层问答方法，用于自动驾驶车辆的场景理解，平衡成本效率与详细视觉解释。

Details

Motivation: 解决自动驾驶中高效且详细理解场景的需求，避免生成冗长描述。 Method: 使用定制数据集微调紧凑的视觉语言模型（VLM），通过分层问答策略分解任务，动态跳过问题以减少计算开销。 Result: 在定制数据集上表现优异，与GPT-4o等先进方法竞争，同时显著降低推理时间。 Conclusion: 该方法能高效捕捉关键驾驶元素，适合实时部署。 Abstract: In this paper, we present a hierarchical question-answering (QA) approach for scene understanding in autonomous vehicles, balancing cost-efficiency with detailed visual interpretation. The method fine-tunes a compact vision-language model (VLM) on a custom dataset specific to the geographical area in which the vehicle operates to capture key driving-related visual elements. At the inference stage, the hierarchical QA strategy decomposes the scene understanding task into high-level and detailed sub-questions. Instead of generating lengthy descriptions, the VLM navigates a structured question tree, where answering high-level questions (e.g., "Is it possible for the ego vehicle to turn left at the intersection?") triggers more detailed sub-questions (e.g., "Is there a vehicle approaching the intersection from the opposite direction?"). To optimize inference time, questions are dynamically skipped based on previous answers, minimizing computational overhead. The extracted answers are then synthesized using handcrafted templates to ensure coherent, contextually accurate scene descriptions. We evaluate the proposed approach on the custom dataset using GPT reference-free scoring, demonstrating its competitiveness with state-of-the-art methods like GPT-4o in capturing key scene details while achieving significantly lower inference time. Moreover, qualitative results from real-time deployment highlight the proposed approach's capacity to capture key driving elements with minimal latency.

[72] Synthetic Iris Image Databases and Identity Leakage: Risks and Mitigation Strategies

Ada Sawilska,Mateusz Trokielewicz

Main category: cs.CV

TL;DR: 本文综述了虹膜图像合成方法，旨在解决从活体个体收集大规模多样化生物特征数据的问题，并讨论了不同生成方法的潜力与风险。

Details

Motivation: 解决从活体个体收集大规模多样化生物特征数据的挑战，为生物特征方法开发提供替代方案。 Method: 综述了传统图像处理技术、GAN、VAE和扩散模型等虹膜图像合成方法。 Result: 分析了每种方法的生成潜力和保真度，并提供了预测示例。 Conclusion: 讨论了生物特征泄漏风险及预防策略，强调需进一步研究以验证生成方法替代真实数据集的可行性。 Abstract: This paper presents a comprehensive overview of iris image synthesis methods, which can alleviate the issues associated with gathering large, diverse datasets of biometric data from living individuals, which are considered pivotal for biometric methods development. These methods for synthesizing iris data range from traditional, hand crafted image processing-based techniques, through various iterations of GAN-based image generators, variational autoencoders (VAEs), as well as diffusion models. The potential and fidelity in iris image generation of each method is discussed and examples of inferred predictions are provided. Furthermore, the risks of individual biometric features leakage from the training sets are considered, together with possible strategies for preventing them, which have to be implemented should these generative methods be considered a valid replacement of real-world biometric datasets.

[73] ControlMambaIR: Conditional Controls with State-Space Model for Image Restoration

Cheng Yang,Lijing Liang,Zhixun Su

Main category: cs.CV

TL;DR: ControlMambaIR是一种新型图像恢复方法，结合Mamba网络架构和扩散模型，提升图像生成的控制与优化，在去雨、去模糊和去噪任务中表现优异。

Details

Motivation: 解决图像恢复任务中的感知挑战，如去雨、去模糊和去噪，通过改进条件控制优化图像生成过程。 Method: 集成Mamba网络架构与扩散模型，通过条件网络实现精细控制，并在多个基准数据集上进行实验验证。 Result: 在感知质量指标（LPIPS、FID）上优于现有方法，同时保持图像失真指标（PSNR、SSIM）的竞争力。 Conclusion: ControlMambaIR在图像恢复任务中表现出灵活性和高效性，Mamba架构作为条件控制网络优于CNN和注意力机制。 Abstract: This paper proposes ControlMambaIR, a novel image restoration method designed to address perceptual challenges in image deraining, deblurring, and denoising tasks. By integrating the Mamba network architecture with the diffusion model, the condition network achieves refined conditional control, thereby enhancing the control and optimization of the image generation process. To evaluate the robustness and generalization capability of our method across various image degradation conditions, extensive experiments were conducted on several benchmark datasets, including Rain100H, Rain100L, GoPro, and SSID. The results demonstrate that our proposed approach consistently surpasses existing methods in perceptual quality metrics, such as LPIPS and FID, while maintaining comparable performance in image distortion metrics, including PSNR and SSIM, highlighting its effectiveness and adaptability. Notably, ablation experiments reveal that directly noise prediction in the diffusion process achieves better performance, effectively balancing noise suppression and detail preservation. Furthermore, the findings indicate that the Mamba architecture is particularly well-suited as a conditional control network for diffusion models, outperforming both CNN- and Attention-based approaches in this context. Overall, these results highlight the flexibility and effectiveness of ControlMambaIR in addressing a range of image restoration perceptual challenges.

[74] Small Aid, Big Leap: Efficient Test-Time Adaptation for Vision-Language Models with AdaptNet

Xiao Chen,Jiazhen Huang,Qinting Jiang,Fanding Huang,Xianghua Fu,Jingyan Jiang,Zhi Wang

Main category: cs.CV

TL;DR: SAIL是一种基于适配器的高效测试时自适应框架，通过轻量级AdaptNet和梯度感知重置策略，显著降低计算成本并提升性能。

Details

Motivation: 现有测试时自适应方法计算成本高且扩展性差，SAIL旨在解决这些问题。 Method: SAIL利用冻结的预训练VLM与AdaptNet协作，通过置信度插值权重生成预测，并采用梯度感知重置策略防止灾难性遗忘。 Result: SAIL在多个基准测试中表现优异，计算成本低，适用于实际部署。 Conclusion: SAIL是一种高效、可扩展的测试时自适应解决方案。 Abstract: Test-time adaptation (TTA) has emerged as a critical technique for enhancing the generalization capability of vision-language models (VLMs) during inference. However, existing approaches often incur substantial computational costs and exhibit poor scalability, primarily due to sample-wise adaptation granularity and reliance on costly auxiliary designs such as data augmentation. To address these limitations, we introduce SAIL (Small Aid, Big Leap), a novel adapter-based TTA framework that leverages a lightweight, learnable AdaptNet to enable efficient and scalable model adaptation. As SAIL's core, a frozen pre-trained VLM collaborates with AdaptNet through a confidence-based interpolation weight, generating robust predictions during inference. These predictions serve as self-supervised targets to align AdaptNet's outputs through efficient batch-wise processing, dramatically reducing computational costs without modifying the VLM or requiring memory caches. To mitigate catastrophic forgetting during continual adaptation, we propose a gradient-aware reset strategy driven by a gradient drift indicator (GDI), which dynamically detects domain transitions and strategically resets AdaptNet for stable adaptation. Extensive experiments across diverse benchmarks on two scenarios demonstrate that SAIL achieves state-of-the-art performance while maintaining low computational costs. These results highlight SAIL's effectiveness, efficiency and scalability for real-world deployment. The code will be released upon acceptance.

[75] Self-Disentanglement and Re-Composition for Cross-Domain Few-Shot Segmentation

Jintao Tong,Yixiong Zou,Guangyao Chen,Yuhua Li,Ruixuan Li

Main category: cs.CV

TL;DR: 论文提出了一种解决跨域少样本分割（CD-FSS）中特征纠缠问题的方法，通过分解ViT结构并学习权重分配，显著提升了性能。

Details

Motivation: 当前CD-FSS方法在距离计算中存在特征纠缠问题，限制了知识迁移效果。 Method: 分解ViT结构，分析纠缠问题，并提出通过学习权重分配来解耦特征。 Result: 实验表明，新方法在1-shot和5-shot设置下平均准确率分别提升1.92%和1.88%。 Conclusion: 通过解耦ViT特征并重新组合，有效解决了CD-FSS中的纠缠问题，提升了模型性能。 Abstract: Cross-Domain Few-Shot Segmentation (CD-FSS) aims to transfer knowledge from a source-domain dataset to unseen target-domain datasets with limited annotations. Current methods typically compare the distance between training and testing samples for mask prediction. However, we find an entanglement problem exists in this widely adopted method, which tends to bind sourcedomain patterns together and make each of them hard to transfer. In this paper, we aim to address this problem for the CD-FSS task. We first find a natural decomposition of the ViT structure, based on which we delve into the entanglement problem for an interpretation. We find the decomposed ViT components are crossly compared between images in distance calculation, where the rational comparisons are entangled with those meaningless ones by their equal importance, leading to the entanglement problem. Based on this interpretation, we further propose to address the entanglement problem by learning to weigh for all comparisons of ViT components, which learn disentangled features and re-compose them for the CD-FSS task, benefiting both the generalization and finetuning. Experiments show that our model outperforms the state-of-the-art CD-FSS method by 1.92% and 1.88% in average accuracy under 1-shot and 5-shot settings, respectively.

[76] Solving Inverse Problems with FLAIR

Julius Erbach,Dominik Narnhofer,Andreas Dombos,Bernt Schiele,Jan Eric Lenssen,Konrad Schindler

Main category: cs.CV

TL;DR: FLAIR是一种新型的无训练变分框架，利用基于流的生成模型作为逆问题的先验，解决了现有方法在保真度和多样性上的不足。

Details

Motivation: 基于流的生成模型（如Stable Diffusion 3）在图像生成上表现优异，但在逆成像问题中尚未达到类似效果，主要障碍包括非线性映射、数据似然项难解及罕见模式恢复困难。 Method: FLAIR通过变分目标实现流匹配，结合确定性轨迹调整恢复罕见模式，并分离数据保真度和正则化项的优化，引入时间依赖的校准方案。 Result: 在标准成像基准测试中，FLAIR在重建质量和样本多样性上均优于现有基于扩散和流的方法。 Conclusion: FLAIR为逆问题提供了一种高效且无需训练的解决方案，显著提升了生成模型作为先验的性能。 Abstract: Flow-based latent generative models such as Stable Diffusion 3 are able to generate images with remarkable quality, even enabling photorealistic text-to-image generation. Their impressive performance suggests that these models should also constitute powerful priors for inverse imaging problems, but that approach has not yet led to comparable fidelity. There are several key obstacles: (i) the encoding into a lower-dimensional latent space makes the underlying (forward) mapping non-linear; (ii) the data likelihood term is usually intractable; and (iii) learned generative models struggle to recover rare, atypical data modes during inference. We present FLAIR, a novel training free variational framework that leverages flow-based generative models as a prior for inverse problems. To that end, we introduce a variational objective for flow matching that is agnostic to the type of degradation, and combine it with deterministic trajectory adjustments to recover atypical modes. To enforce exact consistency with the observed data, we decouple the optimization of the data fidelity and regularization terms. Moreover, we introduce a time-dependent calibration scheme in which the strength of the regularization is modulated according to off-line accuracy estimates. Results on standard imaging benchmarks demonstrate that FLAIR consistently outperforms existing diffusion- and flow-based methods in terms of reconstruction quality and sample diversity.

[77] Towards Geometry Problem Solving in the Large Model Era: A Survey

Yurui Zhao,Xiang Wang,Jiahong Liu,Irwin King,Zhitao Huang

Main category: cs.CV

TL;DR: 该论文综述了几何问题解决（GPS）领域的进展，提出了统一的范式，并指出了未来研究方向。

Details

Motivation: 几何问题解决在AI中具有重要意义，但目前自动化仍面临挑战，且领域内方法、基准和评估框架分散。 Method: 通过三个核心维度（基准构建、文本与图解解析、推理范式）系统综述GPS进展，并提出统一分析范式。 Result: 总结了当前GPS的局限性，并提出了未来研究方向，如自动基准生成和可解释的神经符号集成。 Conclusion: 论文为未来研究提供了方向，目标是实现人类水平的几何推理。 Abstract: Geometry problem solving (GPS) represents a critical frontier in artificial intelligence, with profound applications in education, computer-aided design, and computational graphics. Despite its significance, automating GPS remains challenging due to the dual demands of spatial understanding and rigorous logical reasoning. Recent advances in large models have enabled notable breakthroughs, particularly for SAT-level problems, yet the field remains fragmented across methodologies, benchmarks, and evaluation frameworks. This survey systematically synthesizes GPS advancements through three core dimensions: (1) benchmark construction, (2) textual and diagrammatic parsing, and (3) reasoning paradigms. We further propose a unified analytical paradigm, assess current limitations, and identify emerging opportunities to guide future research toward human-level geometric reasoning, including automated benchmark generation and interpretable neuro-symbolic integration.

[78] Large-scale Self-supervised Video Foundation Model for Intelligent Surgery

Shu Yang,Fengtao Zhou,Leon Mayer,Fuxiang Huang,Yiliang Chen,Yihui Wang,Sunan He,Yuxiang Nie,Xi Wang,Ömer Sümer,Yueming Jin,Huihui Sun,Shuchang Xu,Alex Qinyang Liu,Zheng Li,Jing Qin,Jeremy YuenChun Teoh,Lena Maier-Hein,Hao Chen

Main category: cs.CV

TL;DR: SurgVISTA是一种基于视频级别的手术预训练框架，通过联合时空建模从大规模手术视频数据中学习，显著提升了手术场景的时空理解能力。

Details

Motivation: 现有AI方法缺乏显式的时间建模，导致对动态手术场景的理解不完整。 Method: 构建大规模手术视频数据集，提出SurgVISTA框架，结合时空建模和知识蒸馏。 Result: 在13个视频级别数据集上，SurgVISTA表现优于自然和手术领域的预训练模型。 Conclusion: SurgVISTA有望推动智能手术系统在临床中的应用。 Abstract: Computer-Assisted Intervention (CAI) has the potential to revolutionize modern surgery, with surgical scene understanding serving as a critical component in supporting decision-making, improving procedural efficacy, and ensuring intraoperative safety. While existing AI-driven approaches alleviate annotation burdens via self-supervised spatial representation learning, their lack of explicit temporal modeling during pre-training fundamentally restricts the capture of dynamic surgical contexts, resulting in incomplete spatiotemporal understanding. In this work, we introduce the first video-level surgical pre-training framework that enables joint spatiotemporal representation learning from large-scale surgical video data. To achieve this, we constructed a large-scale surgical video dataset comprising 3,650 videos and approximately 3.55 million frames, spanning more than 20 surgical procedures and over 10 anatomical structures. Building upon this dataset, we propose SurgVISTA (Surgical Video-level Spatial-Temporal Architecture), a reconstruction-based pre-training method that captures intricate spatial structures and temporal dynamics through joint spatiotemporal modeling. Additionally, SurgVISTA incorporates image-level knowledge distillation guided by a surgery-specific expert to enhance the learning of fine-grained anatomical and semantic features. To validate its effectiveness, we established a comprehensive benchmark comprising 13 video-level datasets spanning six surgical procedures across four tasks. Extensive experiments demonstrate that SurgVISTA consistently outperforms both natural- and surgical-domain pre-trained models, demonstrating strong potential to advance intelligent surgical systems in clinically meaningful scenarios.

[79] FaceSleuth: Learning-Driven Single-Orientation Attention Verifies Vertical Dominance in Micro-Expression Recognition

Linquan Wu,Tianxiang Jiang,Wenhao Duan,Yini Fang,Jacky Keung

Main category: cs.CV

TL;DR: FaceSleuth提出了一种双流架构，通过垂直注意力增强微表情识别，并结合可学习的单方向注意力模块，显著提升了识别性能。

Details

Motivation: 微表情识别需要放大毫秒级、低幅度的面部运动，同时抑制身份特征。 Method: FaceSleuth采用双流架构，包括垂直注意力块、面部位置聚焦器和动作单元嵌入，并引入可学习的单方向注意力模块。 Result: 在三个标准数据集上，FaceSleuth性能显著优于现有方法，最高准确率达95.1%。 Conclusion: 垂直注意力偏置是微表情识别中最具区分性的方向，FaceSleuth为此提供了实证支持。 Abstract: Micro-expression recognition (MER) demands models that can amplify millisecond-level, low-amplitude facial motions while suppressing identity-specific appearance. We introduce FaceSleuth, a dual-stream architecture that (1) enhances motion along the empirically dominant vertical axix through a Continuously Vertical Attention (CVA) block, (2) localises the resulting signals with a Facial Position Focalizer built on hierarchical cross-window attention, and (3) steers feature learning toward physiologically meaningful regions via lightweight Action-Unit embeddings. To examine whether the hand-chosen vertical axis is indeed optimal, we further propose a Single-Orientation Attention (SOA) module that learns its own pooling direction end-to-end. SOA is differentiable, adds only 0.16 % parameters, and collapses to CVA when the learned angle converges to {\Pi}/2. In practice, SOA reliably drifts to 88{\deg}, confirming the effectiveness of the vertical prior while delivering consistent gains. On three standard MER benchmarks, FaceSleuth with CVA already surpasses previous state-of-the-art methods; plugging in SOA lifts accuracy and F1 score performance to 95.1 % / 0.918 on CASME II, 87.1 % / 0.840 on SAMM, and 92.9 % / 0.917 on MMEW without sacrificing model compactness. These results establish a new state of the art and, for the first time, provide empirical evidence that the vertical attention bias is the most discriminative orientation for MER.

[80] LayoutRAG: Retrieval-Augmented Model for Content-agnostic Conditional Layout Generation

Yuxuan Wu,Le Wang,Sanping Zhou,Mengnan Liu,Gang Hua,Haoxiang Li

Main category: cs.CV

TL;DR: 该论文提出了一种基于检索和参考引导的布局生成方法，通过条件检索模板并利用参考指导生成过程，优于现有方法。

Details

Motivation: 现有扩散或流匹配模型在布局生成中仍有优化空间，特别是在满足特定条件下生成最优布局。 Method: 通过条件检索合适的布局模板作为参考，并设计条件调制注意力机制选择性吸收检索知识。 Result: 实验表明，该方法能生成高质量且符合给定条件的布局，优于现有先进模型。 Conclusion: 提出的方法通过检索和参考引导生成，有效提升了布局生成的质量和条件满足能力。 Abstract: Controllable layout generation aims to create plausible visual arrangements of element bounding boxes within a graphic design according to certain optional constraints, such as the type or position of a specific component. While recent diffusion or flow-matching models have achieved considerable advances in multifarious conditional generation tasks, there remains considerable room for generating optimal arrangements under given conditions. In this work, we propose to carry out layout generation through retrieving by conditions and reference-guided generation. Specifically, we retrieve appropriate layout templates according to given conditions as references. The references are then utilized to guide the denoising or flow-based transport process. By retrieving layouts compatible with the given conditions, we can uncover the potential information not explicitly provided in the given condition. Such an approach offers more effective guidance to the model during the generation process, in contrast to previous models that feed the condition to the model and let the model infer the unprovided layout attributes directly. Meanwhile, we design a condition-modulated attention that selectively absorbs retrieval knowledge, adapting to the difference between retrieved templates and given conditions. Extensive experiment results show that our method successfully produces high-quality layouts that meet the given conditions and outperforms existing state-of-the-art models. Code will be released upon acceptance.

[81] Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences

Yunhong Lu,Qichao Wang,Hengyuan Cao,Xiaoyin Xu,Min Zhang

Main category: cs.CV

TL;DR: SmPO-Diffusion改进DPO方法，通过建模偏好分布和优化目标上界估计，提升文本到图像生成模型与人类偏好的对齐效果。

Details

Motivation: 现有方法忽略了个体偏好的差异性，需更细粒度表示。 Method: 引入平滑偏好分布替代二元分布，结合奖励模型和偏好似然平均优化DPO损失；利用反转技术模拟扩散模型轨迹偏好分布。 Result: SmPO-Diffusion在偏好评估中表现最优，训练成本更低。 Conclusion: 通过简单修改有效解决了过度优化和目标不对齐问题，实现了更优性能。 Abstract: Direct Preference Optimization (DPO) aligns text-to-image (T2I) generation models with human preferences using pairwise preference data. Although substantial resources are expended in collecting and labeling datasets, a critical aspect is often neglected: \textit{preferences vary across individuals and should be represented with more granularity.} To address this, we propose SmPO-Diffusion, a novel method for modeling preference distributions to improve the DPO objective, along with a numerical upper bound estimation for the diffusion optimization objective. First, we introduce a smoothed preference distribution to replace the original binary distribution. We employ a reward model to simulate human preferences and apply preference likelihood averaging to improve the DPO loss, such that the loss function approaches zero when preferences are similar. Furthermore, we utilize an inversion technique to simulate the trajectory preference distribution of the diffusion model, enabling more accurate alignment with the optimization objective. Our approach effectively mitigates issues of excessive optimization and objective misalignment present in existing methods through straightforward modifications. Our SmPO-Diffusion achieves state-of-the-art performance in preference evaluation, outperforming baselines across metrics with lower training costs. The project page is https://jaydenlyh.github.io/SmPO-project-page/.

[82] ToothForge: Automatic Dental Shape Generation using Synchronized Spectral Embeddings

Tibor Kubík,François Guibault,Michal Španěl,Hervé Lombaert

Main category: cs.CV

TL;DR: ToothForge是一种基于频谱的方法，用于自动生成新颖的3D牙齿模型，解决了牙科形状数据集的稀疏性问题。通过在频谱域操作，该方法实现了高效的机器学习建模，并能快速生成高分辨率牙齿网格。

Details

Motivation: 牙科形状数据集的稀疏性限制了3D牙齿模型的生成。现有方法因分解谐波的不稳定性而受限，且要求所有形状共享固定的连接性。 Method: 提出在同步频率嵌入上建模潜在流形，将所有数据样本的频谱对齐到共同基上，消除分解不稳定性引入的偏差。 Result: 在真实牙冠数据集上，生成形状的重建质量优于未对齐嵌入训练的模型。频谱分析还可用于形状压缩和插值。 Conclusion: ToothForge结合频谱分析和机器学习，适用于牙科及其他医学领域的形状分析，突破了网格结构的限制。 Abstract: We introduce ToothForge, a spectral approach for automatically generating novel 3D teeth, effectively addressing the sparsity of dental shape datasets. By operating in the spectral domain, our method enables compact machine learning modeling, allowing the generation of high-resolution tooth meshes in milliseconds. However, generating shape spectra comes with the instability of the decomposed harmonics. To address this, we propose modeling the latent manifold on synchronized frequential embeddings. Spectra of all data samples are aligned to a common basis prior to the training procedure, effectively eliminating biases introduced by the decomposition instability. Furthermore, synchronized modeling removes the limiting factor imposed by previous methods, which require all shapes to share a common fixed connectivity. Using a private dataset of real dental crowns, we observe a greater reconstruction quality of the synthetized shapes, exceeding those of models trained on unaligned embeddings. We also explore additional applications of spectral analysis in digital dentistry, such as shape compression and interpolation. ToothForge facilitates a range of approaches at the intersection of spectral analysis and machine learning, with fewer restrictions on mesh structure. This makes it applicable for shape analysis not only in dentistry, but also in broader medical applications, where guaranteeing consistent connectivity across shapes from various clinics is unrealistic. The code is available at https://github.com/tiborkubik/toothForge.

[83] Iterative Self-Improvement of Vision Language Models for Image Scoring and Self-Explanation

Naoto Tanji,Toshihiko Yamasaki

Main category: cs.CV

TL;DR: 提出一种新方法，通过自训练让视觉语言模型生成图像评分及自然语言解释，无需外部数据或模型。

Details

Motivation: 理解模型评分依据对信任其判断至关重要。 Method: 利用图像评分数据集和指令调优的VLM进行自训练，结合直接偏好优化迭代训练。 Result: 提高了评分准确性和解释的连贯性。 Conclusion: 该方法有效提升了模型评分与解释的对齐性。 Abstract: Image scoring is a crucial task in numerous real-world applications. To trust a model's judgment, understanding its rationale is essential. This paper proposes a novel training method for Vision Language Models (VLMs) to generate not only image scores but also corresponding justifications in natural language. Leveraging only an image scoring dataset and an instruction-tuned VLM, our method enables self-training, utilizing the VLM's generated text without relying on external data or models. In addition, we introduce a simple method for creating a dataset designed to improve alignment between predicted scores and their textual justifications. By iteratively training the model with Direct Preference Optimization on two distinct datasets and merging them, we can improve both scoring accuracy and the coherence of generated explanations.

[84] LinkTo-Anime: A 2D Animation Optical Flow Dataset from 3D Model Rendering

Xiaoyi Feng,Kaifeng Zou,Caichun Cen,Tao Huang,Hui Guo,Zizhou Huang,Yingli Zhao,Mingqing Zhang,Diwei Wang,Yuntao Zou,Dagang Li

Main category: cs.CV

TL;DR: LinkTo-Anime是首个专为cel动画角色运动设计的高质量数据集，填补了现有光流数据集的空白，支持光流估计及相关下游任务研究。

Details

Motivation: 现有光流数据集主要关注真实世界模拟或合成人类运动，缺乏针对cel动画角色运动的数据集，而该领域具有独特的视觉和运动特征。 Method: 通过3D模型渲染生成LinkTo-Anime数据集，提供丰富标注（光流、遮挡掩码、骨架等），并构建多数据集基准测试。 Result: 数据集包含395个视频序列，共24,230训练帧、720验证帧和4,320测试帧，并分析了不同光流估计方法的局限性。 Conclusion: LinkTo-Anime填补了cel动画光流数据集的空白，为相关研究提供了重要资源。 Abstract: Existing optical flow datasets focus primarily on real-world simulation or synthetic human motion, but few are tailored to Celluloid(cel) anime character motion: a domain with unique visual and motion characteristics. To bridge this gap and facilitate research in optical flow estimation and downstream tasks such as anime video generation and line drawing colorization, we introduce LinkTo-Anime, the first high-quality dataset specifically designed for cel anime character motion generated with 3D model rendering. LinkTo-Anime provides rich annotations including forward and backward optical flow, occlusion masks, and Mixamo Skeleton. The dataset comprises 395 video sequences, totally 24,230 training frames, 720 validation frames, and 4,320 test frames. Furthermore, a comprehensive benchmark is constructed with various optical flow estimation methods to analyze the shortcomings and limitations across multiple datasets.

[85] GeneA-SLAM2: Dynamic SLAM with AutoEncoder-Preprocessed Genetic Keypoints Resampling and Depth Variance-Guided Dynamic Region Removal

Shufan Qing,Anzhen Li,Qiandi Wang,Yuefeng Niu,Mingchen Feng,Guoliang Hu,Jinqiao Wu,Fengtao Nan,Yingchun Fan

Main category: cs.CV

TL;DR: GeneA-SLAM2通过深度方差约束处理动态场景，结合自编码器优化关键点分布，提升动态环境下的SLAM精度。

Details

Motivation: 现有语义SLAM在高度动态场景中无法完全覆盖动态区域，需更鲁棒的方法。 Method: 利用深度方差提取动态像素并生成深度掩码，结合自编码器优化关键点分布。 Result: 在高度动态序列中表现优于现有方法，保持高精度。 Conclusion: GeneA-SLAM2为动态环境提供了一种高效且鲁棒的解决方案。 Abstract: Existing semantic SLAM in dynamic environments mainly identify dynamic regions through object detection or semantic segmentation methods. However, in certain highly dynamic scenarios, the detection boxes or segmentation masks cannot fully cover dynamic regions. Therefore, this paper proposes a robust and efficient GeneA-SLAM2 system that leverages depth variance constraints to handle dynamic scenes. Our method extracts dynamic pixels via depth variance and creates precise depth masks to guide the removal of dynamic objects. Simultaneously, an autoencoder is used to reconstruct keypoints, improving the genetic resampling keypoint algorithm to obtain more uniformly distributed keypoints and enhance the accuracy of pose estimation. Our system was evaluated on multiple highly dynamic sequences. The results demonstrate that GeneA-SLAM2 maintains high accuracy in dynamic scenes compared to current methods. Code is available at: https://github.com/qingshufan/GeneA-SLAM2.

[86] Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning

Negin Baghbanzadeh,Sajad Ashkezari,Elham Dolatabadi,Arash Afkanpour

Main category: cs.CV

TL;DR: 论文提出了一种基于Transformer的目标检测方法，用于大规模提取生物医学文献中的复合图子图，并发布了包含1800万高质量子图-标题对的OPEN-PMC-18M数据集，提升了视觉-语言模型的性能。

Details

Motivation: 生物医学文献中的复合图子图提取问题尚未在大规模上得到解决，且现有方法在数据集规模和泛化性上存在局限。 Method: 采用基于Transformer的目标检测方法，训练于50万合成复合图数据集，并在ImageCLEF 2016和合成基准上取得最优性能。 Result: 发布了OPEN-PMC-18M数据集，并在检索、零样本分类和鲁棒性基准上展示了模型性能的提升。 Conclusion: 研究为生物医学视觉-语言建模和表示学习提供了可复现的基准和进一步研究的基础。 Abstract: Compound figures, which are multi-panel composites containing diverse subfigures, are ubiquitous in biomedical literature, yet large-scale subfigure extraction remains largely unaddressed. Prior work on subfigure extraction has been limited in both dataset size and generalizability, leaving a critical open question: How does high-fidelity image-text alignment via large-scale subfigure extraction impact representation learning in vision-language models? We address this gap by introducing a scalable subfigure extraction pipeline based on transformer-based object detection, trained on a synthetic corpus of 500,000 compound figures, and achieving state-of-the-art performance on both ImageCLEF 2016 and synthetic benchmarks. Using this pipeline, we release OPEN-PMC-18M, a large-scale high quality biomedical vision-language dataset comprising 18 million clinically relevant subfigure-caption pairs spanning radiology, microscopy, and visible light photography. We train and evaluate vision-language models on our curated datasets and show improved performance across retrieval, zero-shot classification, and robustness benchmarks, outperforming existing baselines. We release our dataset, models, and code to support reproducible benchmarks and further study into biomedical vision-language modeling and representation learning.

[87] VTGaussian-SLAM: RGBD SLAM for Large Scale Scenes with Splatting View-Tied 3D Gaussians

Pengchong Hu,Zhizhong Han

Main category: cs.CV

TL;DR: 论文提出了一种基于视图绑定的3D高斯表示方法（view-tied 3D Gaussians），用于RGBD SLAM系统，解决了现有方法在超大场景中效率低下的问题。

Details

Motivation: 现有方法使用3D高斯表示场景，但在超大场景中由于GPU内存限制，无法高效优化所有高斯，导致跟踪和建图效率低下。 Method: 提出了视图绑定的3D高斯表示方法，简化高斯参数（如位置、旋转和多维方差），并将其绑定到深度像素上，节省存储并支持更多高斯表示细节。同时，优化了跟踪和建图策略，无需全程保持所有高斯可学习。 Result: 在广泛使用的基准测试中，新方法在渲染和跟踪精度以及可扩展性上优于最新方法。 Conclusion: 视图绑定的3D高斯表示和优化策略显著提升了RGBD SLAM系统在大场景中的效率和性能。 Abstract: Jointly estimating camera poses and mapping scenes from RGBD images is a fundamental task in simultaneous localization and mapping (SLAM). State-of-the-art methods employ 3D Gaussians to represent a scene, and render these Gaussians through splatting for higher efficiency and better rendering. However, these methods cannot scale up to extremely large scenes, due to the inefficient tracking and mapping strategies that need to optimize all 3D Gaussians in the limited GPU memories throughout the training to maintain the geometry and color consistency to previous RGBD observations. To resolve this issue, we propose novel tracking and mapping strategies to work with a novel 3D representation, dubbed view-tied 3D Gaussians, for RGBD SLAM systems. View-tied 3D Gaussians is a kind of simplified Gaussians, which is tied to depth pixels, without needing to learn locations, rotations, and multi-dimensional variances. Tying Gaussians to views not only significantly saves storage but also allows us to employ many more Gaussians to represent local details in the limited GPU memory. Moreover, our strategies remove the need of maintaining all Gaussians learnable throughout the training, while improving rendering quality, and tracking accuracy. We justify the effectiveness of these designs, and report better performance over the latest methods on the widely used benchmarks in terms of rendering and tracking accuracy and scalability. Please see our project page for code and videos at https://machineperceptionlab.github.io/VTGaussian-SLAM-Project .

[88] RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS

Chuanyu Fu,Yuqi Zhang,Kunbin Yao,Guanying Chen,Yuan Xiong,Chuan Huang,Shuguang Cui,Xiaochun Cao

Main category: cs.CV

TL;DR: RobustSplat通过延迟高斯增长和尺度级联掩码引导，解决了3DGS中瞬态物体导致的渲染伪影问题。

Details

Motivation: 现有3DGS方法在建模受瞬态物体影响的场景时，因高斯密度化过程导致伪影，需要更鲁棒的解决方案。 Method: 1. 延迟高斯增长策略，优先优化静态场景结构；2. 尺度级联掩码引导，从低分辨率到高分辨率逐步优化掩码预测。 Result: 在多个数据集上表现优于现有方法，证明了方法的鲁棒性和有效性。 Conclusion: RobustSplat通过创新设计有效解决了瞬态物体引起的伪影问题，提升了3DGS的渲染质量。 Abstract: 3D Gaussian Splatting (3DGS) has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling. However, existing methods struggle with accurately modeling scenes affected by transient objects, leading to artifacts in the rendered images. We identify that the Gaussian densification process, while enhancing scene detail capture, unintentionally contributes to these artifacts by growing additional Gaussians that model transient disturbances. To address this, we propose RobustSplat, a robust solution based on two critical designs. First, we introduce a delayed Gaussian growth strategy that prioritizes optimizing static scene structure before allowing Gaussian splitting/cloning, mitigating overfitting to transient objects in early optimization. Second, we design a scale-cascaded mask bootstrapping approach that first leverages lower-resolution feature similarity supervision for reliable initial transient mask estimation, taking advantage of its stronger semantic consistency and robustness to noise, and then progresses to high-resolution supervision to achieve more precise mask prediction. Extensive experiments on multiple challenging datasets show that our method outperforms existing methods, clearly demonstrating the robustness and effectiveness of our method. Our project page is https://fcyycf.github.io/RobustSplat/.

[89] Unified Attention Modeling for Efficient Free-Viewing and Visual Search via Shared Representations

Fatma Youssef Mohammed,Kostas Alexis

Main category: cs.CV

TL;DR: 研究探讨了自由观看和任务驱动的视觉搜索是否可以共享共同的注意力表示，并提出了一种基于HAT的神经网络架构。结果表明，两者可以高效共享表示，知识迁移性能下降仅3.86%，同时显著降低计算成本。

Details

Motivation: 探索自由观看和任务驱动的视觉搜索是否存在共同的注意力表示，以验证是否可以通过共享表示提高效率和性能。 Method: 提出了一种基于Human Attention Transformer (HAT)的神经网络架构，用于测试自由观看和视觉搜索的共享表示假设。 Result: 模型在自由观看训练后迁移到视觉搜索任务时，性能仅下降3.86%，计算成本显著降低（GFLOPs减少92.29%，参数减少31.23%）。 Conclusion: 自由观看和视觉搜索可以高效共享共同的注意力表示，为计算建模提供了更高效的解决方案。 Abstract: Computational human attention modeling in free-viewing and task-specific settings is often studied separately, with limited exploration of whether a common representation exists between them. This work investigates this question and proposes a neural network architecture that builds upon the Human Attention transformer (HAT) to test the hypothesis. Our results demonstrate that free-viewing and visual search can efficiently share a common representation, allowing a model trained in free-viewing attention to transfer its knowledge to task-driven visual search with a performance drop of only 3.86% in the predicted fixation scanpaths, measured by the semantic sequence score (SemSS) metric which reflects the similarity between predicted and human scanpaths. This transfer reduces computational costs by 92.29% in terms of GFLOPs and 31.23% in terms of trainable parameters.

[90] A Dynamic Transformer Network for Vehicle Detection

Chunwei Tian,Kai Liu,Bob Zhang,Zhixiang Huang,Chia-Wen Lin,David Zhang

Main category: cs.CV

TL;DR: 提出了一种动态Transformer网络（DTNet），通过动态卷积和混合注意力机制提升车辆检测的适应性，解决了光照和遮挡差异问题。

Details

Motivation: 现有基于深度网络的车辆检测算法因忽略光照和遮挡差异而性能受限，需提升检测器的适应性。 Method: DTNet结合动态卷积生成权重，利用混合注意力机制（通道注意力与Transformer）增强信息关系，并通过空间位置信息优化结构信息。 Result: 实验表明DTNet在车辆检测中表现优异。 Conclusion: DTNet通过动态权重和注意力机制有效提升了车辆检测的适应性，代码已开源。 Abstract: Stable consumer electronic systems can assist traffic better. Good traffic consumer electronic systems require collaborative work between traffic algorithms and hardware. However, performance of popular traffic algorithms containing vehicle detection methods based on deep networks via learning data relation rather than learning differences in different lighting and occlusions is limited. In this paper, we present a dynamic Transformer network for vehicle detection (DTNet). DTNet utilizes a dynamic convolution to guide a deep network to dynamically generate weights to enhance adaptability of an obtained detector. Taking into relations of different information account, a mixed attention mechanism based channel attention and Transformer is exploited to strengthen relations of channels and pixels to extract more salient information for vehicle detection. To overcome the drawback of difference in an image account, a translation-variant convolution relies on spatial location information to refine obtained structural information for vehicle detection. Experimental results illustrate that our DTNet is competitive for vehicle detection. Code of the proposed DTNet can be obtained at https://github.com/hellloxiaotian/DTNet.

[91] FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts

Tongyuan Bai,Wangyuanfan Bai,Dong Chen,Tieru Wu,Manyi Li,Rui Ma

Main category: cs.CV

TL;DR: FreeScene是一个用户友好的3D室内场景合成框架，支持自由形式的用户输入（如文本描述或参考图像），并通过VLM-based Graph Designer和MG-DiT模型实现高质量和可控的场景生成。

Details

Motivation: 现有方法在3D室内场景合成中要么控制粗糙（语言控制），要么需要复杂的图设计（图控制），缺乏便捷性和精细控制的平衡。 Method: FreeScene结合文本和图像输入，通过VLM-based Graph Designer生成图表示，再使用MG-DiT模型进行图感知去噪，实现多任务场景生成。 Result: 实验表明，FreeScene在生成质量和可控性上优于现有方法，适用于文本到场景、图到场景和重排等任务。 Conclusion: FreeScene提供了一个高效且用户友好的解决方案，统一了文本和图控制的场景合成，具有广泛的应用潜力。 Abstract: Controllability plays a crucial role in the practical applications of 3D indoor scene synthesis. Existing works either allow rough language-based control, that is convenient but lacks fine-grained scene customization, or employ graph based control, which offers better controllability but demands considerable knowledge for the cumbersome graph design process. To address these challenges, we present FreeScene, a user-friendly framework that enables both convenient and effective control for indoor scene synthesis.Specifically, FreeScene supports free-form user inputs including text description and/or reference images, allowing users to express versatile design intentions. The user inputs are adequately analyzed and integrated into a graph representation by a VLM-based Graph Designer. We then propose MG-DiT, a Mixed Graph Diffusion Transformer, which performs graph-aware denoising to enhance scene generation. Our MG-DiT not only excels at preserving graph structure but also offers broad applicability to various tasks, including, but not limited to, text-to-scene, graph-to-scene, and rearrangement, all within a single model. Extensive experiments demonstrate that FreeScene provides an efficient and user-friendly solution that unifies text-based and graph based scene synthesis, outperforming state-of-the-art methods in terms of both generation quality and controllability in a range of applications.

[92] SAMJ: Fast Image Annotation on ImageJ/Fiji via Segment Anything Model

Carlos Garcia-Lopez-de-Haro,Caterina Fuster-Barcelo,Curtis T. Rueden,Jonathan Heras,Vladimir Ulman,Daniel Franco-Barranco,Adrian Ines,Kevin W. Eliceiri,Jean-Christophe Olivo-Marin,Jean-Yves Tinevez,Daniel Sage,Arrate Munoz-Barrutia

Main category: cs.CV

TL;DR: SAMJ是一个基于Segment Anything Model (SAM)的ImageJ/Fiji插件，旨在简化生物医学图像分析中的掩码标注工作。

Details

Motivation: 掩码标注在AI驱动的生物医学图像分析中耗时且劳动密集，需要更高效的解决方案。 Method: 开发了SAMJ插件，利用SAM模型实现交互式标注，支持一键安装和实时对象描绘。 Result: SAMJ简化并加速了标注数据集的创建，适用于大型科学图像。 Conclusion: SAMJ为用户提供了一个易用且高效的标注工具，解决了生物医学图像分析中的标注瓶颈。 Abstract: Mask annotation remains a significant bottleneck in AI-driven biomedical image analysis due to its labor-intensive nature. To address this challenge, we introduce SAMJ, a user-friendly ImageJ/Fiji plugin leveraging the Segment Anything Model (SAM). SAMJ enables seamless, interactive annotations with one-click installation on standard computers. Designed for real-time object delineation in large scientific images, SAMJ is an easy-to-use solution that simplifies and accelerates the creation of labeled image datasets.

[93] Automated Measurement of Optic Nerve Sheath Diameter Using Ocular Ultrasound Video

Renxing Li,Weiyi Tang,Peiqi Li,Qiming Huang,Jiayuan She,Shengkai Li,Haoran Xu,Yeyun Wan,Jing Liu,Hailong Fu,Xiang Li,Jiangang Chen

Main category: cs.CV

TL;DR: 提出了一种自动识别超声视频序列中最佳帧以测量视神经鞘直径（ONSD）的方法，结合KCF跟踪和SLIC分割算法，并通过GMM和KL散度方法测量，结果与专家测量高度一致。

Details

Motivation: ONSD与颅内压（ICP）线性相关，但手动测量依赖操作者经验，因此需要自动化方法提高准确性和效率。 Method: 使用KCF跟踪算法和SLIC分割算法自动识别最佳帧，结合GMM和KL散度方法测量ONSD。 Result: 与专家测量相比，平均误差0.04，均方差0.054，组内相关系数0.782，显示高准确性。 Conclusion: 该方法自动化程度高，准确性接近专家水平，具有临床应用潜力。 Abstract: Objective. Elevated intracranial pressure (ICP) is recognized as a biomarker of secondary brain injury, with a significant linear correlation observed between optic nerve sheath diameter (ONSD) and ICP. Frequent monitoring of ONSD could effectively support dynamic evaluation of ICP. However, ONSD measurement is heavily reliant on the operator's experience and skill, particularly in manually selecting the optimal frame from ultrasound sequences and measuring ONSD. Approach. This paper presents a novel method to automatically identify the optimal frame from video sequences for ONSD measurement by employing the Kernel Correlation Filter (KCF) tracking algorithm and Simple Linear Iterative Clustering (SLIC) segmentation algorithm. The optic nerve sheath is mapped and measured using a Gaussian Mixture Model (GMM) combined with a KL-divergence-based method. Results. When compared with the average measurements of two expert clinicians, the proposed method achieved a mean error, mean squared deviation, and intraclass correlation coefficient (ICC) of 0.04, 0.054, and 0.782, respectively. Significance. The findings suggest that this method provides highly accurate automated ONSD measurements, showing potential for clinical application.

[94] Random Registers for Cross-Domain Few-Shot Learning

Shuai Yi,Yixiong Zou,Yuhua Li,Ruixuan Li

Main category: cs.CV

TL;DR: 研究发现，在跨域少样本学习中，ViT的提示调优可能损害目标域泛化性能，而随机寄存器却能提升性能。通过分析，提出了一种基于随机寄存器的新方法，显著提升了性能。

Details

Motivation: 探索Vision Transformer在跨域少样本学习中的迁移能力，特别是提示调优与随机寄存器对性能的影响。 Method: 通过实验发现随机寄存器优于提示调优，提出在图像标记的语义区域添加随机寄存器以增强注意力扰动。 Result: 在四个基准测试中验证了方法的有效性，实现了最先进的性能。 Conclusion: 随机寄存器通过扰动注意力帮助模型找到平坦的最小值，提升了跨域少样本学习的迁移能力。 Abstract: Cross-domain few-shot learning (CDFSL) aims to transfer knowledge from a data-sufficient source domain to data-scarce target domains. Although Vision Transformer (ViT) has shown superior capability in many vision tasks, its transferability against huge domain gaps in CDFSL is still under-explored. In this paper, we find an intriguing phenomenon: during the source-domain training, prompt tuning, as a common way to train ViT, could be harmful for the generalization of ViT in target domains, but setting them to random noises (i.e., random registers) could consistently improve target-domain performance. We then delve into this phenomenon for an interpretation. We find that learnable prompts capture domain information during the training on the source dataset, which views irrelevant visual patterns as vital cues for recognition. This can be viewed as a kind of overfitting and increases the sharpness of the loss landscapes. In contrast, random registers are essentially a novel way of perturbing attention for the sharpness-aware minimization, which helps the model find a flattened minimum in loss landscapes, increasing the transferability. Based on this phenomenon and interpretation, we further propose a simple but effective approach for CDFSL to enhance the perturbation on attention maps by adding random registers on the semantic regions of image tokens, improving the effectiveness and efficiency of random registers. Extensive experiments on four benchmarks validate our rationale and state-of-the-art performance. Codes and models are available at https://github.com/shuaiyi308/REAP.

[95] Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments

Di Wen,Lei Qi,Kunyu Peng,Kailun Yang,Fei Teng,Ao Luo,Jia Fu,Yufan Chen,Ruiping Liu,Yitian Shi,M. Saquib Sarfraz,Rainer Stiefelhagen

Main category: cs.CV

TL;DR: 论文提出了MicroG-4M数据集，用于微重力环境下人类活动的时空和语义理解，填补了现有数据集的空白。

Details

Motivation: 现有视频理解数据集主要基于地球重力条件，而微重力环境会改变人类运动和视觉语义，这对安全关键的空间应用提出了挑战。 Method: 通过真实太空任务和电影模拟构建数据集，包含4,759个视频片段，涵盖50种动作、1,238个上下文丰富的描述和7,000多个问答对。 Result: 数据集支持多标签动作识别、时序视频描述和视觉问答三个核心任务，并建立了基于先进模型的基线。 Conclusion: MicroG-4M为微重力环境下的视频理解提供了首个基准，推动了领域鲁棒性的研究。 Abstract: Despite substantial progress in video understanding, most existing datasets are limited to Earth's gravitational conditions. However, microgravity alters human motion, interactions, and visual semantics, revealing a critical gap for real-world vision systems. This presents a challenge for domain-robust video understanding in safety-critical space applications. To address this, we introduce MicroG-4M, the first benchmark for spatio-temporal and semantic understanding of human activities in microgravity. Constructed from real-world space missions and cinematic simulations, the dataset includes 4,759 clips covering 50 actions, 1,238 context-rich captions, and over 7,000 question-answer pairs on astronaut activities and scene understanding. MicroG-4M supports three core tasks: fine-grained multi-label action recognition, temporal video captioning, and visual question answering, enabling a comprehensive evaluation of both spatial localization and semantic reasoning in microgravity contexts. We establish baselines using state-of-the-art models. All data, annotations, and code are available at https://github.com/LEI-QI-233/HAR-in-Space.

[96] PBR-SR: Mesh PBR Texture Super Resolution from 2D Image Priors

Yujin Chen,Yinyu Nie,Benjamin Ummenhofer,Reiner Birkl,Michael Paulitsch,Matthias Nießner

Main category: cs.CV

TL;DR: PBR-SR是一种零样本方法，通过迭代优化和先验约束，从低分辨率PBR纹理生成高质量高分辨率纹理，无需额外训练或数据。

Details

Motivation: 解决基于视图的超分辨率方法中常见的视角不一致和光照敏感性问题，同时保持纹理对低分辨率输入的忠实性。 Method: 利用预训练的自然图像超分辨率模型，通过迭代最小化超分辨率先验与可微分渲染之间的偏差，并结合多视角渲染的2D先验约束和纹理域的身份约束。 Result: PBR-SR在艺术家设计和AI生成的网格上均能生成高质量PBR纹理，优于直接应用超分辨率模型和先前的纹理优化方法。 Conclusion: PBR-SR是一种高效且无需额外训练的方法，适用于高级应用如重新光照，生成高保真PBR纹理。 Abstract: We present PBR-SR, a novel method for physically based rendering (PBR) texture super resolution (SR). It outputs high-resolution, high-quality PBR textures from low-resolution (LR) PBR input in a zero-shot manner. PBR-SR leverages an off-the-shelf super-resolution model trained on natural images, and iteratively minimizes the deviations between super-resolution priors and differentiable renderings. These enhancements are then back-projected into the PBR map space in a differentiable manner to produce refined, high-resolution textures. To mitigate view inconsistencies and lighting sensitivity, which is common in view-based super-resolution, our method applies 2D prior constraints across multi-view renderings, iteratively refining the shared, upscaled textures. In parallel, we incorporate identity constraints directly in the PBR texture domain to ensure the upscaled textures remain faithful to the LR input. PBR-SR operates without any additional training or data requirements, relying entirely on pretrained image priors. We demonstrate that our approach produces high-fidelity PBR textures for both artist-designed and AI-generated meshes, outperforming both direct SR models application and prior texture optimization methods. Our results show high-quality outputs in both PBR and rendering evaluations, supporting advanced applications such as relighting.

[97] METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding

Mengyue Wang,Shuo Chen,Kristian Kersting,Volker Tresp,Yunpu Ma

Main category: cs.CV

TL;DR: METok是一种无需训练的多阶段事件驱动的令牌压缩框架，旨在加速视频大语言模型（VLLMs）的推理，同时保持准确性。

Details

Motivation: 处理长视频时，计算需求高且视觉数据冗余，因此需要一种高效的压缩方法。 Method: METok通过三个阶段逐步消除冗余视觉令牌：事件感知压缩、基于语义对齐和事件重要性的分层令牌修剪，以及解码阶段的KV缓存优化。 Result: 实验表明，METok在效率和准确性之间实现了最佳平衡，例如在LongVA-7B上实现了80.6%的FLOPs减少和93.5%的KV缓存内存节省。 Conclusion: METok为长视频处理提供了一种高效且准确的解决方案。 Abstract: Recent advances in Video Large Language Models (VLLMs) have significantly enhanced their ability to understand video content. Nonetheless, processing long videos remains challenging due to high computational demands and the redundancy present in the visual data. In this work, we propose METok, a training-free, Multi-stage Event-based Token compression framework designed to accelerate VLLMs' inference while preserving accuracy. METok progressively eliminates redundant visual tokens across three critical stages: (1) event-aware compression during vision encoding, (2) hierarchical token pruning in the prefilling stage based on semantic alignment and event importance, and (3) a decoding-stage KV Cache optimization that further reduces memory consumption. Our experiments on diverse video benchmarks demonstrate that METok achieves an optimal trade-off between efficiency and accuracy by dynamically selecting informative visual tokens. For instance, equipping LongVA-7B with METok realizes an 80.6% FLOPs reduction and 93.5% KV Cache memory savings, all while maintaining comparable or even superior accuracy.

[98] Learning Pyramid-structured Long-range Dependencies for 3D Human Pose Estimation

Mingjie Wei,Xuemei Xie,Yutong Zhong,Guangming Shi

Main category: cs.CV

TL;DR: 论文提出了一种金字塔图注意力模块（PGA）和金字塔图变换器（PGFormer），用于3D人体姿态估计，通过多尺度捕捉长距离依赖关系，实现了更低的误差和更小的模型体积。

Details

Motivation: 现有方法通过增加网络深度学习非连接部分之间的依赖关系，但会引入无关噪声并增加模型体积。因此，需要一种更有效的方法来建模长距离依赖关系。 Method: 利用金字塔结构捕捉关节和组之间的相关性，提出PGA模块并行计算多尺度信息的相关性，并结合图卷积模块开发了PGFormer。 Result: 在Human3.6M和MPI-INF-3DHP数据集上，PGFormer实现了比现有方法更低的误差和更小的模型体积。 Conclusion: PGFormer通过金字塔结构和多尺度注意力机制，有效解决了长距离依赖建模问题，为3D人体姿态估计提供了轻量高效的解决方案。 Abstract: Action coordination in human structure is indispensable for the spatial constraints of 2D joints to recover 3D pose. Usually, action coordination is represented as a long-range dependence among body parts. However, there are two main challenges in modeling long-range dependencies. First, joints should not only be constrained by other individual joints but also be modulated by the body parts. Second, existing methods make networks deeper to learn dependencies between non-linked parts. They introduce uncorrelated noise and increase the model size. In this paper, we utilize a pyramid structure to better learn potential long-range dependencies. It can capture the correlation across joints and groups, which complements the context of the human sub-structure. In an effective cross-scale way, it captures the pyramid-structured long-range dependence. Specifically, we propose a novel Pyramid Graph Attention (PGA) module to capture long-range cross-scale dependencies. It concatenates information from various scales into a compact sequence, and then computes the correlation between scales in parallel. Combining PGA with graph convolution modules, we develop a Pyramid Graph Transformer (PGFormer) for 3D human pose estimation, which is a lightweight multi-scale transformer architecture. It encapsulates human sub-structures into self-attention by pooling. Extensive experiments show that our approach achieves lower error and smaller model size than state-of-the-art methods on Human3.6M and MPI-INF-3DHP datasets. The code is available at https://github.com/MingjieWe/PGFormer.

[99] Hierarchical Self-Prompting SAM: A Prompt-Free Medical Image Segmentation Framework

Mengmeng Zhang,Xingyuan Dai,Yicheng Sun,Jing Wang,Yueyang Yao,Xiaoyan Gong,Fuze Cong,Feiyue Wang,Yisheng Lv

Main category: cs.CV

TL;DR: HSP-SAM是一种自提示框架，用于无提示医学图像分割，性能优于现有方法。

Details

Motivation: SAM依赖提示限制了其在医学图像中的应用，现有方法难以去除这种依赖。 Method: 提出分层自提示框架HSP-SAM，首次在学习过程中引入抽象提示。 Result: 在多种医学图像模态中表现优异，泛化能力强，性能提升达14.04%。 Conclusion: 抽象提示比位置提示包含更丰富的语义信息，增强了模型的鲁棒性和泛化能力。 Abstract: Although the Segment Anything Model (SAM) is highly effective in natural image segmentation, it requires dependencies on prompts, which limits its applicability to medical imaging where manual prompts are often unavailable. Existing efforts to fine-tune SAM for medical segmentation typically struggle to remove this dependency. We propose Hierarchical Self-Prompting SAM (HSP-SAM), a novel self-prompting framework that enables SAM to achieve strong performance in prompt-free medical image segmentation. Unlike previous self-prompting methods that remain limited to positional prompts similar to vanilla SAM, we are the first to introduce learning abstract prompts during the self-prompting process. This simple and intuitive self-prompting framework achieves superior performance on classic segmentation tasks such as polyp and skin lesion segmentation, while maintaining robustness across diverse medical imaging modalities. Furthermore, it exhibits strong generalization to unseen datasets, achieving improvements of up to 14.04% over previous state-of-the-art methods on some challenging benchmarks. These results suggest that abstract prompts encapsulate richer and higher-dimensional semantic information compared to positional prompts, thereby enhancing the model's robustness and generalization performance. All models and codes will be released upon acceptance.

[100] Enhancing Abnormality Identification: Robust Out-of-Distribution Strategies for Deepfake Detection

Luca Maiano,Fabrizio Casadei,Irene Amerini

Main category: cs.CV

TL;DR: 论文提出两种新的OOD检测方法，用于解决深度伪造检测中的开放集挑战，实验验证其优于现有技术。

Details

Motivation: 深度伪造检测在开放集场景中泛化能力不足，现有方法难以应对不断更新的生成模型。 Method: 提出两种OOD检测方法：一种基于输入图像重构，另一种结合注意力机制。 Result: 实验证明方法优于现有技术，在基准测试中表现优异。 Conclusion: 方法在动态现实应用中具有潜力，为深度伪造检测提供鲁棒解决方案。 Abstract: Detecting deepfakes has become a critical challenge in Computer Vision and Artificial Intelligence. Despite significant progress in detection techniques, generalizing them to open-set scenarios continues to be a persistent difficulty. Neural networks are often trained on the closed-world assumption, but with new generative models constantly evolving, it is inevitable to encounter data generated by models that are not part of the training distribution. To address these challenges, in this paper, we propose two novel Out-Of-Distribution (OOD) detection approaches. The first approach is trained to reconstruct the input image, while the second incorporates an attention mechanism for detecting OODs. Our experiments validate the effectiveness of the proposed approaches compared to existing state-of-the-art techniques. Our method achieves promising results in deepfake detection and ranks among the top-performing configurations on the benchmark, demonstrating their potential for robust, adaptable solutions in dynamic, real-world applications.

[101] MVTD: A Benchmark Dataset for Maritime Visual Object Tracking

Ahsan Baidar Bakht,Muhayy Ud Din,Sajid Javed,Irfan Hussain

Main category: cs.CV

TL;DR: 论文介绍了专门为海事视觉目标跟踪设计的MVTD数据集，包含182个视频序列和150,000帧，覆盖四种目标类别。实验显示现有算法在该数据集上性能下降，但通过微调可显著提升性能。

Details

Motivation: 海事环境中的视觉目标跟踪面临独特挑战（如水面反射、低对比度目标等），现有通用数据集无法满足需求，亟需领域专用数据集。 Method: 构建了MVTD数据集，包含高分辨率视频序列和多样化的海事场景，评估了14种先进跟踪算法，并进行了微调实验。 Result: 现有算法在MVTD上性能显著下降，但通过微调后性能大幅提升，证明了领域适应和迁移学习的有效性。 Conclusion: MVTD填补了海事视觉跟踪领域的空白，为研究和算法优化提供了重要基准。 Abstract: Visual Object Tracking (VOT) is a fundamental task with widespread applications in autonomous navigation, surveillance, and maritime robotics. Despite significant advances in generic object tracking, maritime environments continue to present unique challenges, including specular water reflections, low-contrast targets, dynamically changing backgrounds, and frequent occlusions. These complexities significantly degrade the performance of state-of-the-art tracking algorithms, highlighting the need for domain-specific datasets. To address this gap, we introduce the Maritime Visual Tracking Dataset (MVTD), a comprehensive and publicly available benchmark specifically designed for maritime VOT. MVTD comprises 182 high-resolution video sequences, totaling approximately 150,000 frames, and includes four representative object classes: boat, ship, sailboat, and unmanned surface vehicle (USV). The dataset captures a diverse range of operational conditions and maritime scenarios, reflecting the real-world complexities of maritime environments. We evaluated 14 recent SOTA tracking algorithms on the MVTD benchmark and observed substantial performance degradation compared to their performance on general-purpose datasets. However, when fine-tuned on MVTD, these models demonstrate significant performance gains, underscoring the effectiveness of domain adaptation and the importance of transfer learning in specialized tracking contexts. The MVTD dataset fills a critical gap in the visual tracking community by providing a realistic and challenging benchmark for maritime scenarios. Dataset and Source Code can be accessed here "https://github.com/AhsanBaidar/MVTD".

[102] Pan-Arctic Permafrost Landform and Human-built Infrastructure Feature Detection with Vision Transformers and Location Embeddings

Amal S. Perera,David Fernandez,Chandi Witharana,Elias Manos,Michael Pimenta,Anna K. Liljedahl,Ingmar Nitze,Yili Yang,Todd Nicholson,Chia-Yu Hsu,Wenwen Li,Guido Grosse

Main category: cs.CV

TL;DR: 该论文探讨了在北极遥感任务中使用Vision Transformers（ViTs）结合地理空间位置嵌入的方法，以提升对永久冻土地貌、融化解冻扰动和人类基础设施的检测性能。

Details

Motivation: 北极地区的高分辨率遥感数据量庞大，传统CNN模型在捕捉长距离依赖和全局上下文方面存在局限，而ViTs通过注意力机制和自监督学习解决了这些问题。 Method: 研究整合了地理空间位置嵌入到ViTs中，评估了其在检测冰楔多边形（IWP）、退化融冻滑塌（RTS）和人类基础设施任务中的表现。 Result: 实验表明，结合位置嵌入的ViTs在三个任务中的两个上优于CNN模型，例如RTS检测的F1分数从0.84提升至0.92。 Conclusion: ViTs结合空间感知能力在北极遥感任务中具有潜力，为处理复杂光谱特征和区域适应性提供了新思路。 Abstract: Accurate mapping of permafrost landforms, thaw disturbances, and human-built infrastructure at pan-Arctic scale using sub-meter satellite imagery is increasingly critical. Handling petabyte-scale image data requires high-performance computing and robust feature detection models. While convolutional neural network (CNN)-based deep learning approaches are widely used for remote sensing (RS),similar to the success in transformer based large language models, Vision Transformers (ViTs) offer advantages in capturing long-range dependencies and global context via attention mechanisms. ViTs support pretraining via self-supervised learning-addressing the common limitation of labeled data in Arctic feature detection and outperform CNNs on benchmark datasets. Arctic also poses challenges for model generalization, especially when features with the same semantic class exhibit diverse spectral characteristics. To address these issues for Arctic feature detection, we integrate geospatial location embeddings into ViTs to improve adaptation across regions. This work investigates: (1) the suitability of pre-trained ViTs as feature extractors for high-resolution Arctic remote sensing tasks, and (2) the benefit of combining image and location embeddings. Using previously published datasets for Arctic feature detection, we evaluate our models on three tasks-detecting ice-wedge polygons (IWP), retrogressive thaw slumps (RTS), and human-built infrastructure. We empirically explore multiple configurations to fuse image embeddings and location embeddings. Results show that ViTs with location embeddings outperform prior CNN-based models on two of the three tasks including F1 score increase from 0.84 to 0.92 for RTS detection, demonstrating the potential of transformer-based models with spatial awareness for Arctic RS applications.

[103] NTIRE 2025 XGC Quality Assessment Challenge: Methods and Results

Xiaohong Liu,Xiongkuo Min,Qiang Hu,Xiaoyun Zhang,Jie Guo,Guangtao Zhai,Shushi Wang,Yingjie Zhou,Lu Liu,Jingxin Li,Liu Yang,Farong Wen,Li Xu,Yanwei Jiang,Xilei Zhu,Chunyi Li,Zicheng Zhang,Huiyu Duan,Xiele Wu,Yixuan Gao,Yuqin Cao,Jun Jia,Wei Sun,Jiezhang Cao,Radu Timofte,Baojun Li,Jiamian Huang,Dan Luo,Tao Liu,Weixia Zhang,Bingkun Zheng,Junlin Chen,Ruikai Zhou,Meiya Chen,Yu Wang,Hao Jiang,Xiantao Li,Yuxiang Jiang,Jun Tang,Yimeng Zhao,Bo Hu,Zelu Qi,Chaoyang Zhang,Fei Zhao,Ping Shi,Lingzhi Fu,Heng Cong,Shuai He,Rongyu Zhang,Jiarong He,Zongyao Hu,Wei Luo,Zihao Yu,Fengbin Guan,Yiting Lu,Xin Li,Zhibo Chen,Mengjing Su,Yi Wang,Tuo Chen,Chunxiao Li,Shuaiyu Zhao,Jiaxin Wen,Chuyi Lin,Sitong Liu,Ningxin Chu,Jing Wan,Yu Zhou,Baoying Chen,Jishen Zeng,Jiarui Liu,Xianjin Liu,Xin Chen,Lanzhi Zhou,Hangyu Li,You Han,Bibo Xiang,Zhenjie Liu,Jianzhang Lu,Jialin Gui,Renjie Lu,Shangfei Wang,Donghao Zhou,Jingyu Lin,Quanjian Song,Jiancheng Huang,Yufeng Yang,Changwei Wang,Shupeng Zhong,Yang Yang,Lihuo He,Jia Liu,Yuting Xing,Tida Fang,Yuchun Jin

Main category: cs.CV

TL;DR: NTIRE 2025 XGC挑战赛分为三个赛道：用户生成视频、AI生成视频和说话头，每个赛道均吸引了大量参与者并取得了显著成果。

Details

Motivation: 解决视频和说话头处理领域的主要挑战，推动相关技术的发展。 Method: 挑战赛分为三个赛道，分别使用不同的数据集（FineVD-GC、Q-Eval-Video、THQA-NTIRE），参与者提交模型和事实表。 Result: 每个赛道的参与者均提出了优于基线的方法，推动了相关领域的发展。 Conclusion: NTIRE 2025 XGC挑战赛成功促进了视频和说话头处理技术的进步。 Abstract: This paper reports on the NTIRE 2025 XGC Quality Assessment Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. This challenge is to address a major challenge in the field of video and talking head processing. The challenge is divided into three tracks, including user generated video, AI generated video and talking head. The user-generated video track uses the FineVD-GC, which contains 6,284 user generated videos. The user-generated video track has a total of 125 registered participants. A total of 242 submissions are received in the development phase, and 136 submissions are received in the test phase. Finally, 5 participating teams submitted their models and fact sheets. The AI generated video track uses the Q-Eval-Video, which contains 34,029 AI-Generated Videos (AIGVs) generated by 11 popular Text-to-Video (T2V) models. A total of 133 participants have registered in this track. A total of 396 submissions are received in the development phase, and 226 submissions are received in the test phase. Finally, 6 participating teams submitted their models and fact sheets. The talking head track uses the THQA-NTIRE, which contains 12,247 2D and 3D talking heads. A total of 89 participants have registered in this track. A total of 225 submissions are received in the development phase, and 118 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Each participating team in every track has proposed a method that outperforms the baseline, which has contributed to the development of fields in three tracks.

[104] GaRA-SAM: Robustifying Segment Anything Model with Gated-Rank Adaptation

Sohyun Lee,Yeho Kwon,Lukas Hoyer,Suha Kwak

Main category: cs.CV

TL;DR: 提出了一种名为GaRA的方法，通过动态调整权重矩阵的秩来增强Segment Anything Model（SAM）对输入退化的鲁棒性，显著优于现有方法。

Details

Motivation: 在高风险应用（如自动驾驶和机器人）中，提升SAM对输入退化的鲁棒性至关重要。 Method: 引入轻量级适配器（GaRA），动态调整权重矩阵的秩，实现细粒度和输入感知的鲁棒化。 Result: GaRA-SAM在所有鲁棒分割基准测试中表现优异，在ACDC数据集上IoU得分提升高达21.3%。 Conclusion: GaRA方法在不牺牲SAM泛化能力的前提下，显著提升了其对输入退化的鲁棒性。 Abstract: Improving robustness of the Segment Anything Model (SAM) to input degradations is critical for its deployment in high-stakes applications such as autonomous driving and robotics. Our approach to this challenge prioritizes three key aspects: first, parameter efficiency to maintain the inherent generalization capability of SAM; second, fine-grained and input-aware robustification to precisely address the input corruption; and third, adherence to standard training protocols for ease of training. To this end, we propose gated-rank adaptation (GaRA). GaRA introduces lightweight adapters into intermediate layers of the frozen SAM, where each adapter dynamically adjusts the effective rank of its weight matrix based on the input by selectively activating (rank-1) components of the matrix using a learned gating module. This adjustment enables fine-grained and input-aware robustification without compromising the generalization capability of SAM. Our model, GaRA-SAM, significantly outperforms prior work on all robust segmentation benchmarks. In particular, it surpasses the previous best IoU score by up to 21.3\%p on ACDC, a challenging real corrupted image dataset.

[105] OpenFace 3.0: A Lightweight Multitask System for Comprehensive Facial Behavior Analysis

Jiewen Hu,Leena Mathur,Paul Pu Liang,Louis-Philippe Morency

Main category: cs.CV

TL;DR: OpenFace 3.0是一个开源工具包，用于面部行为分析，包括面部标志点检测、动作单元检测、视线估计和情感识别。它通过多任务架构训练，提升了性能、速度和效率。

Details

Motivation: 近年来，计算领域对面部行为分析系统的需求增加，OpenFace 3.0旨在提供一个轻量级、高效且易于使用的工具包。 Method: 采用多任务架构训练的统一模型，适用于不同人群、头部姿态、光照条件和视频分辨率。 Result: OpenFace 3.0在预测性能、推理速度和内存效率上优于同类工具包，并支持实时运行。 Conclusion: OpenFace 3.0是一个高效、易用的开源工具，适用于研究，并支持社区贡献。 Abstract: In recent years, there has been increasing interest in automatic facial behavior analysis systems from computing communities such as vision, multimodal interaction, robotics, and affective computing. Building upon the widespread utility of prior open-source facial analysis systems, we introduce OpenFace 3.0, an open-source toolkit capable of facial landmark detection, facial action unit detection, eye-gaze estimation, and facial emotion recognition. OpenFace 3.0 contributes a lightweight unified model for facial analysis, trained with a multi-task architecture across diverse populations, head poses, lighting conditions, video resolutions, and facial analysis tasks. By leveraging the benefits of parameter sharing through a unified model and training paradigm, OpenFace 3.0 exhibits improvements in prediction performance, inference speed, and memory efficiency over similar toolkits and rivals state-of-the-art models. OpenFace 3.0 can be installed and run with a single line of code and operate in real-time without specialized hardware. OpenFace 3.0 code for training models and running the system is freely available for research purposes and supports contributions from the community.

[106] Dense Match Summarization for Faster Two-view Estimation

Jonathan Astermark,Anders Heyden,Viktor Larsson

Main category: cs.CV

TL;DR: 提出了一种高效的匹配摘要方案，显著加速了密集匹配下的两视图相对位姿估计，同时保持了与完整密集匹配相当的精度。

Details

Motivation: 密集匹配虽能提高位姿估计的准确性和鲁棒性，但其大量匹配点导致RANSAC中的鲁棒估计运行时间显著增加。 Method: 提出了一种高效的匹配摘要方案，以减少匹配点数量，同时保持精度。 Result: 在标准基准数据集上验证，该方法运行速度比完整密集匹配快10-100倍，且精度相当。 Conclusion: 该方法有效解决了密集匹配带来的计算效率问题，为实时应用提供了可能。 Abstract: In this paper, we speed up robust two-view relative pose from dense correspondences. Previous work has shown that dense matchers can significantly improve both accuracy and robustness in the resulting pose. However, the large number of matches comes with a significantly increased runtime during robust estimation in RANSAC. To avoid this, we propose an efficient match summarization scheme which provides comparable accuracy to using the full set of dense matches, while having 10-100x faster runtime. We validate our approach on standard benchmark datasets together with multiple state-of-the-art dense matchers.

[107] FlySearch: Exploring how vision-language models explore

Adam Pardyl,Dominik Matuszek,Mateusz Przebieracz,Marek Cygan,Bartosz Zieliński,Maciej Wołczyk

Main category: cs.CV

TL;DR: 论文探讨了视觉语言模型（VLMs）在复杂、非结构化环境中的表现，发现当前最先进的VLMs在简单探索任务中表现不佳，与人类差距显著。

Details

Motivation: 研究VLMs在真实、复杂环境中的有效性，填补其在主动探索任务中的性能空白。 Method: 提出FlySearch，一个3D户外逼真环境，用于测试VLMs在复杂场景中的搜索和导航能力，并定义三种难度任务。 Result: VLMs在简单任务中表现不可靠，与人类差距随任务难度增加而扩大；发现幻觉、上下文误解和任务规划失败是主要原因。 Conclusion: 部分问题可通过微调解决，并公开了基准、场景和代码库。 Abstract: The real world is messy and unstructured. Uncovering critical information often requires active, goal-driven exploration. It remains to be seen whether Vision-Language Models (VLMs), which recently emerged as a popular zero-shot tool in many difficult tasks, can operate effectively in such conditions. In this paper, we answer this question by introducing FlySearch, a 3D, outdoor, photorealistic environment for searching and navigating to objects in complex scenes. We define three sets of scenarios with varying difficulty and observe that state-of-the-art VLMs cannot reliably solve even the simplest exploration tasks, with the gap to human performance increasing as the tasks get harder. We identify a set of central causes, ranging from vision hallucination, through context misunderstanding, to task planning failures, and we show that some of them can be addressed by finetuning. We publicly release the benchmark, scenarios, and the underlying codebase.

[108] Towards Auto-Annotation from Annotation Guidelines: A Benchmark through 3D LiDAR Detection

Yechi Ma,Wei Hua,Shu Kong

Main category: cs.CV

TL;DR: 论文提出新基准AnnoGuide，旨在通过专家定义的标注指南自动完成数据标注，减少人工标注需求。以nuScenes数据集为例，利用基础模型实现多模态少样本3D检测，性能显著提升。

Details

Motivation: 数据标注是机器学习应用中的关键但繁琐的步骤，成本高且耗时。研究旨在通过自动化方法直接从标注指南生成标注，减少人工干预。 Method: 采用基础模型（FMs）进行RGB图像目标检测与分割，将2D检测投影到3D空间，并通过聚类LiDAR点生成3D标注。逐步优化关键组件提升性能。 Result: 3D检测mAP从12.1提升至21.9，但仍表明AnnoGuide是一个开放且具挑战性的问题，需进一步发展LiDAR基础模型。 Conclusion: AnnoGuide为自动化数据标注提供了新方向，展示了基础模型的潜力，但需进一步研究以解决LiDAR相关挑战。 Abstract: A crucial yet under-appreciated prerequisite in machine learning solutions for real-applications is data annotation: human annotators are hired to manually label data according to detailed, expert-crafted guidelines. This is often a laborious, tedious, and costly process. To study methods for facilitating data annotation, we introduce a new benchmark AnnoGuide: Auto-Annotation from Annotation Guidelines. It aims to evaluate automated methods for data annotation directly from expert-defined annotation guidelines, eliminating the need for manual labeling. As a case study, we repurpose the well-established nuScenes dataset, commonly used in autonomous driving research, which provides comprehensive annotation guidelines for labeling LiDAR point clouds with 3D cuboids across 18 object classes. These guidelines include a few visual examples and textual descriptions, but no labeled 3D cuboids in LiDAR data, making this a novel task of multi-modal few-shot 3D detection without 3D annotations. The advances of powerful foundation models (FMs) make AnnoGuide especially timely, as FMs offer promising tools to tackle its challenges. We employ a conceptually straightforward pipeline that (1) utilizes open-source FMs for object detection and segmentation in RGB images, (2) projects 2D detections into 3D using known camera poses, and (3) clusters LiDAR points within the frustum of each 2D detection to generate a 3D cuboid. Starting with a non-learned solution that leverages off-the-shelf FMs, we progressively refine key components and achieve significant performance improvements, boosting 3D detection mAP from 12.1 to 21.9! Nevertheless, our results highlight that AnnoGuide remains an open and challenging problem, underscoring the urgent need for developing LiDAR-based FMs. We release our code and models at GitHub: https://annoguide.github.io/annoguide3Dbenchmark

[109] MIND: Material Interface Generation from UDFs for Non-Manifold Surface Reconstruction

Xuhui Chen,Fei Hou,Wencheng Wang,Hong Qin,Ying He

Main category: cs.CV

TL;DR: 提出了一种名为MIND的新算法，直接从无符号距离场（UDFs）生成材料界面，支持非流形网格提取。

Details

Motivation: 现有方法从UDFs提取网格时存在拓扑缺陷，且无法处理非流形几何。 Method: 通过计算多标签全局场，结合UDFs构建材料界面，支持非流形网格提取。 Result: 实验表明，MIND能稳健处理复杂非流形表面，显著优于现有方法。 Conclusion: MIND为UDFs的非流形网格提取提供了有效解决方案。 Abstract: Unsigned distance fields (UDFs) are widely used in 3D deep learning due to their ability to represent shapes with arbitrary topology. While prior work has largely focused on learning UDFs from point clouds or multi-view images, extracting meshes from UDFs remains challenging, as the learned fields rarely attain exact zero distances. A common workaround is to reconstruct signed distance fields (SDFs) locally from UDFs to enable surface extraction via Marching Cubes. However, this often introduces topological artifacts such as holes or spurious components. Moreover, local SDFs are inherently incapable of representing non-manifold geometry, leading to complete failure in such cases. To address this gap, we propose MIND (Material Interface from Non-manifold Distance fields), a novel algorithm for generating material interfaces directly from UDFs, enabling non-manifold mesh extraction from a global perspective. The core of our method lies in deriving a meaningful spatial partitioning from the UDF, where the target surface emerges as the interface between distinct regions. We begin by computing a two-signed local field to distinguish the two sides of manifold patches, and then extend this to a multi-labeled global field capable of separating all sides of a non-manifold structure. By combining this multi-labeled field with the input UDF, we construct material interfaces that support non-manifold mesh extraction via a multi-labeled Marching Cubes algorithm. Extensive experiments on UDFs generated from diverse data sources, including point cloud reconstruction, multi-view reconstruction, and medial axis transforms, demonstrate that our approach robustly handles complex non-manifold surfaces and significantly outperforms existing methods.

[110] FORLA:Federated Object-centric Representation Learning with Slot Attention

Guiqiu Liao,Matjaz Jogan,Eric Eaton,Daniel A. Hashimoto

Main category: cs.CV

TL;DR: FORLA是一个联邦学习框架，通过无监督的slot attention学习跨客户端的对象中心表示和特征适应，优于集中式基线。

Details

Motivation: 解决在异构无标签数据集中学习高效视觉表示的挑战，需联合跨客户端信息特征并解耦领域特定因素。 Method: 采用共享特征适配器和slot attention模块，设计双分支师生架构优化适配器。 Result: 在多个真实数据集上表现优于集中式基线，学习到紧凑且通用的跨域表示。 Conclusion: 联邦slot attention是分布式概念下可扩展无监督视觉表示学习的有效工具。 Abstract: Learning efficient visual representations across heterogeneous unlabeled datasets remains a central challenge in federated learning. Effective federated representations require features that are jointly informative across clients while disentangling domain-specific factors without supervision. We introduce FORLA, a novel framework for federated object-centric representation learning and feature adaptation across clients using unsupervised slot attention. At the core of our method is a shared feature adapter, trained collaboratively across clients to adapt features from foundation models, and a shared slot attention module that learns to reconstruct the adapted features. To optimize this adapter, we design a two-branch student-teacher architecture. In each client, a student decoder learns to reconstruct full features from foundation models, while a teacher decoder reconstructs their adapted, low-dimensional counterpart. The shared slot attention module bridges cross-domain learning by aligning object-level representations across clients. Experiments in multiple real-world datasets show that our framework not only outperforms centralized baselines on object discovery but also learns a compact, universal representation that generalizes well across domains. This work highlights federated slot attention as an effective tool for scalable, unsupervised visual representation learning from cross-domain data with distributed concepts.

[111] HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation

Yicheng Xiao,Lin Song,Rui Yang,Cheng Cheng,Zunnan Xu,Zhaoyang Zhang,Yixiao Ge,Xiu Li,Ying Shan

Main category: cs.CV

TL;DR: 本文提出了一种高效训练范式，构建单一Transformer模型用于统一多模态理解和生成，通过多模态预热策略和特征预缩放技术，实现了低成本高性能的HaploOmni模型。

Details

Motivation: 随着语言模型的进步，多模态理解和生成逐渐从分离组件发展为统一单模型框架，但跨模态兼容性和训练效率仍是挑战。 Method: 提出多模态预热策略和特征预缩放技术，结合多模态AdaLN，构建HaploOmni模型。 Result: HaploOmni在有限训练成本下，在多个图像和视频理解与生成基准上表现优异。 Conclusion: HaploOmni证明了统一单模型在多模态任务中的高效性和竞争力，代码将开源。 Abstract: With the advancement of language models, unified multimodal understanding and generation have made significant strides, with model architectures evolving from separated components to unified single-model frameworks. This paper explores an efficient training paradigm to build a single transformer for unified multimodal understanding and generation. Specifically, we propose a multimodal warmup strategy utilizing prior knowledge to extend capabilities. To address cross-modal compatibility challenges, we introduce feature pre-scaling and multimodal AdaLN techniques. Integrating the proposed technologies, we present the HaploOmni, a new single multimodal transformer. With limited training costs, HaploOmni achieves competitive performance across multiple image and video understanding and generation benchmarks over advanced unified models. All codes will be made public at https://github.com/Tencent/HaploVLM.

[112] Deep Learning for Retinal Degeneration Assessment: A Comprehensive Analysis of the MARIO AMD Progression Challenge

Rachid Zeghlache,Ikram Brahim,Pierre-Henri Conze,Mathieu Lamard,Mohammed El Amine Lazouni,Zineb Aziza Elaouaber,Leila Ryma Lazouni,Christopher Nielsen,Ahmad O. Ahsan,Matthias Wilms,Nils D. Forkert,Lovre Antonio Budimir,Ivana Matovinović,Donik Vršnak,Sven Lončarić,Philippe Zhang,Weili Jiang,Yihao Li,Yiding Hao,Markus Frohmann,Patrick Binder,Marcel Huber,Taha Emre,Teresa Finisterra Araújo,Marzieh Oghbaie,Hrvoje Bogunović,Amerens A. Bekkers,Nina M. van Liebergen,Hugo J. Kuijf,Abdul Qayyum,Moona Mazher,Steven A. Niederer,Alberto J. Beltrán-Carrero,Juan J. Gómez-Valverde,Javier Torresano-Rodríquez,Álvaro Caballero-Sastre,María J. Ledesma Carbayo,Yosuke Yamagishi,Yi Ding,Robin Peretzke,Alexandra Ertl,Maximilian Fischer,Jessica Kächele,Sofiane Zehar,Karim Boukli Hacene,Thomas Monfort,Béatrice Cochener,Mostafa El Habib Daho,Anas-Alexis Benyoussef,Gwenolé Quellec

Main category: cs.CV

TL;DR: MARIO挑战赛旨在通过OCT图像自动检测和监测AMD，评估算法性能，结果显示AI在测量AMD进展方面与医生相当，但无法预测未来演变。

Details

Motivation: 推动AMD的自动化检测和监测技术，利用多模态数据集评估算法性能。 Method: 使用来自法国和阿尔及利亚的数据集，设计两项任务：分类OCT B扫描的演变和预测AMD未来演变。 Result: AI在测量AMD进展方面表现与医生相当，但无法预测未来演变。 Conclusion: MARIO挑战赛为AMD监测设定了基准，AI在部分任务中表现优异，但预测能力仍需改进。 Abstract: The MARIO challenge, held at MICCAI 2024, focused on advancing the automated detection and monitoring of age-related macular degeneration (AMD) through the analysis of optical coherence tomography (OCT) images. Designed to evaluate algorithmic performance in detecting neovascular activity changes within AMD, the challenge incorporated unique multi-modal datasets. The primary dataset, sourced from Brest, France, was used by participating teams to train and test their models. The final ranking was determined based on performance on this dataset. An auxiliary dataset from Algeria was used post-challenge to evaluate population and device shifts from submitted solutions. Two tasks were involved in the MARIO challenge. The first one was the classification of evolution between two consecutive 2D OCT B-scans. The second one was the prediction of future AMD evolution over three months for patients undergoing anti-vascular endothelial growth factor (VEGF) therapy. Thirty-five teams participated, with the top 12 finalists presenting their methods. This paper outlines the challenge's structure, tasks, data characteristics, and winning methodologies, setting a benchmark for AMD monitoring using OCT, infrared imaging, and clinical data (such as the number of visits, age, gender, etc.). The results of this challenge indicate that artificial intelligence (AI) performs as well as a physician in measuring AMD progression (Task 1) but is not yet able of predicting future evolution (Task 2).

[113] Astrophotography turbulence mitigation via generative models

Joonyeoup Kim,Yu Yuan,Xingguang Zhang,Xijun Wang,Stanley Chan

Main category: cs.CV

TL;DR: AstroDiff利用扩散模型的高质量生成先验和修复能力，显著提升天文图像在大气湍流下的质量。

Details

Motivation: 地面望远镜拍摄的天文图像常受大气湍流影响，导致成像质量下降，传统方法如幸运成像需要大量数据和复杂处理。 Method: 提出AstroDiff，结合扩散模型的生成先验和修复能力，修复湍流影响的天文图像。 Result: 实验表明，AstroDiff在湍流修复中优于现有学习方法，提供更高感知质量和结构保真度。 Conclusion: AstroDiff为天文图像湍流修复提供了高效解决方案，代码和结果已公开。 Abstract: Photography is the cornerstone of modern astronomical and space research. However, most astronomical images captured by ground-based telescopes suffer from atmospheric turbulence, resulting in degraded imaging quality. While multi-frame strategies like lucky imaging can mitigate some effects, they involve intensive data acquisition and complex manual processing. In this paper, we propose AstroDiff, a generative restoration method that leverages both the high-quality generative priors and restoration capabilities of diffusion models to mitigate atmospheric turbulence. Extensive experiments demonstrate that AstroDiff outperforms existing state-of-the-art learning-based methods in astronomical image turbulence mitigation, providing higher perceptual quality and better structural fidelity under severe turbulence conditions. Our code and additional results are available at https://web-six-kappa-66.vercel.app/

[114] DFBench: Benchmarking Deepfake Image Detection Capability of Large Multimodal Models

Jiarui Wang,Huiyu Duan,Juntong Wang,Ziheng Jia,Woo Yi Yang,Xiaorong Zhu,Yu Zhao,Jiaying Qian,Yuke Xing,Guangtao Zhai,Xiongkuo Min

Main category: cs.CV

TL;DR: DFBench是一个大规模深度伪造基准测试，包含多样化的图像和最新的生成模型，用于评估检测器和生成模型的性能。MoA-DF利用多模态模型的混合策略，实现了最先进的检测效果。

Details

Motivation: 随着生成模型的快速发展，AI生成图像的逼真度显著提升，验证数字内容真实性面临挑战。现有检测方法受限于数据集和模型多样性，难以应对日益复杂的AI生成内容。 Method: 提出DFBench基准测试，包含54万张图像，涵盖真实、AI编辑和AI生成内容，并基于12种最新生成模型。进一步提出MoA-DF方法，利用多模态模型的混合策略进行检测。 Result: MoA-DF在深度伪造检测中实现了最先进的性能，证明了多模态模型在此任务中的有效性。 Conclusion: DFBench和MoA-DF为深度伪造检测提供了新的基准和方法，展示了多模态模型在此领域的潜力。 Abstract: With the rapid advancement of generative models, the realism of AI-generated images has significantly improved, posing critical challenges for verifying digital content authenticity. Current deepfake detection methods often depend on datasets with limited generation models and content diversity that fail to keep pace with the evolving complexity and increasing realism of the AI-generated content. Large multimodal models (LMMs), widely adopted in various vision tasks, have demonstrated strong zero-shot capabilities, yet their potential in deepfake detection remains largely unexplored. To bridge this gap, we present \textbf{DFBench}, a large-scale DeepFake Benchmark featuring (i) broad diversity, including 540,000 images across real, AI-edited, and AI-generated content, (ii) latest model, the fake images are generated by 12 state-of-the-art generation models, and (iii) bidirectional benchmarking and evaluating for both the detection accuracy of deepfake detectors and the evasion capability of generative models. Based on DFBench, we propose \textbf{MoA-DF}, Mixture of Agents for DeepFake detection, leveraging a combined probability strategy from multiple LMMs. MoA-DF achieves state-of-the-art performance, further proving the effectiveness of leveraging LMMs for deepfake detection. Database and codes are publicly available at https://github.com/IntMeGroup/DFBench.

[115] Smartflow: Enabling Scalable Spatiotemporal Geospatial Research

David McVicar,Brian Avant,Adrian Gould,Diego Torrejon,Charles Della Porta,Ryan Mukherjee

Main category: cs.CV

TL;DR: BlackSky的Smartflow是一个基于云的框架，支持可扩展的时空地理空间研究，结合开源工具和技术，通过标准化数据立方体处理异构地理空间数据，并利用Kubernetes实现工作流编排。

Details

Motivation: 解决地理空间数据处理的异构性和可扩展性问题，支持大规模地理区域和时间尺度的模型开发与分析。 Method: 使用STAC兼容目录作为输入，将异构数据转换为标准化数据立方体，结合ClearML、Tensorboard和Apache Superset进行模型实验管理，依托Kubernetes实现工作流编排。 Result: Smartflow适用于大规模地理空间模型开发，并展示了一个用于监测重型建筑的新型神经网络架构，能够检测建筑开发的各个主要阶段。 Conclusion: Smartflow为地理空间研究提供了一个高效、可扩展的框架，结合新型神经网络架构，展示了其在大规模监测任务中的潜力。 Abstract: BlackSky introduces Smartflow, a cloud-based framework enabling scalable spatiotemporal geospatial research built on open-source tools and technologies. Using STAC-compliant catalogs as a common input, heterogeneous geospatial data can be processed into standardized datacubes for analysis and model training. Model experimentation is managed using a combination of tools, including ClearML, Tensorboard, and Apache Superset. Underpinning Smartflow is Kubernetes, which orchestrates the provisioning and execution of workflows to support both horizontal and vertical scalability. This combination of features makes Smartflow well-suited for geospatial model development and analysis over large geographic areas, time scales, and expansive image archives. We also present a novel neural architecture, built using Smartflow, to monitor large geographic areas for heavy construction. Qualitative results based on data from the IARPA Space-based Machine Automated Recognition Technique (SMART) program are presented that show the model is capable of detecting heavy construction throughout all major phases of development.

[116] Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers

Pengtao Chen,Xianfang Zeng,Maosen Zhao,Peng Ye,Mingzhu Shen,Wei Cheng,Gang Yu,Tao Chen

Main category: cs.CV

TL;DR: 论文提出Sparse-vDiT框架，通过利用vDiT中注意力机制的稀疏性模式，显著降低了计算复杂度和推理延迟，同时保持高视觉保真度。

Details

Motivation: 视频生成任务中，扩散变换器（DiTs）的注意力机制具有二次复杂度，导致显著的推理延迟。研究发现vDiT中存在三种稀疏性模式，可用于优化计算效率。 Method: 提出Sparse-vDiT框架，包括：1）针对稀疏性模式优化的稀疏核；2）离线稀疏扩散搜索算法，选择每层和每头的最优稀疏计算策略。 Result: 在CogVideoX1.5、HunyuanVideo和Wan2.1模型中，Sparse-vDiT实现了2.09×、2.38×和1.67×的理论FLOP减少，实际推理速度提升1.76×、1.85×和1.58×，PSNR值分别达到24.13、27.09和22.59。 Conclusion: 研究表明，vDiT中的潜在结构稀疏性可被系统性地用于长视频合成，显著提升效率而不牺牲视觉质量。 Abstract: While Diffusion Transformers (DiTs) have achieved breakthroughs in video generation, this long sequence generation task remains constrained by the quadratic complexity of attention mechanisms, resulting in significant inference latency. Through detailed analysis of attention maps in Video Diffusion Transformer (vDiT), we identify three recurring sparsity patterns: diagonal, multi-diagonal, and vertical-stripe structures. And even 3-6\% attention heads can be skipped. Crucially, these patterns exhibit strong layer-depth and head-position correlations but show limited dependence on the input content. Leveraging these findings, we propose Sparse-vDiT, a sparsity acceleration framework for vDiT comprising: 1) Pattern-optimized sparse kernels that replace dense attention with computationally efficient implementations for each identified sparsity pattern. 2) An offline sparse diffusion search algorithm that selects the optimal sparse computation strategy per layer and head via hardware-aware cost modeling. After determining the optimal configuration, we fuse heads within the same layer that share the same attention strategy, enhancing inference efficiency. Integrated into state-of-the-art vDiT models (CogVideoX1.5, HunyuanVideo, and Wan2.1), Sparse-vDiT achieves 2.09$\times$, 2.38$\times$, and 1.67$\times$ theoretical FLOP reduction, and actual inference speedups of 1.76$\times$, 1.85$\times$, and 1.58$\times$, respectively, while maintaining high visual fidelity, with PSNR values reaching 24.13, 27.09, and 22.59. Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis.

[117] EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models

Mingzhe Li,Gehao Zhang,Zhenting Wang,Shiqing Ma,Siqi Pan,Richard Cartwright,Juan Zhai

Main category: cs.CV

TL;DR: 提出了一种名为\sys的提示反转技术，用于文本到图像扩散模型，通过预训练图像描述模型初始化嵌入，并在潜在空间中进行逆向工程优化，最终转换为文本。该方法在图像相似性、文本对齐、提示可解释性和泛化性方面优于现有方法。

Details

Motivation: 提示反转任务（从生成图像反推文本提示）在数据归属、模型溯源和数字水印验证等方面具有重要应用潜力。现有方法在语义流畅性和效率上存在不足，需要更优的解决方案。 Method: 结合预训练图像描述模型初始化嵌入，通过潜在空间逆向工程优化嵌入，并使用嵌入到文本模型将其转换为文本提示。 Result: 在MS COCO、LAION和Flickr等数据集上，\sys方法在图像相似性、文本对齐、提示可解释性和泛化性方面优于现有方法。 Conclusion: \sys方法在提示反转任务中表现出色，并展示了在跨概念图像合成、概念操作、多概念生成和无监督分割等任务中的应用潜力。 Abstract: Text-to-image generation models~(e.g., Stable Diffusion) have achieved significant advancements, enabling the creation of high-quality and realistic images based on textual descriptions. Prompt inversion, the task of identifying the textual prompt used to generate a specific artifact, holds significant potential for applications including data attribution, model provenance, and watermarking validation. Recent studies introduced a delayed projection scheme to optimize for prompts representative of the vocabulary space, though challenges in semantic fluency and efficiency remain. Advanced image captioning models or visual large language models can generate highly interpretable prompts, but they often lack in image similarity. In this paper, we propose a prompt inversion technique called \sys for text-to-image diffusion models, which includes initializing embeddings using a pre-trained image captioning model, refining them through reverse-engineering in the latent space, and converting them to texts using an embedding-to-text model. Our experiments on the widely-used datasets, such as MS COCO, LAION, and Flickr, show that our method outperforms existing methods in terms of image similarity, textual alignment, prompt interpretability and generalizability. We further illustrate the application of our generated prompts in tasks such as cross-concept image synthesis, concept manipulation, evolutionary multi-concept generation and unsupervised segmentation.

[118] LEG-SLAM: Real-Time Language-Enhanced Gaussian Splatting for SLAM

Roman Titkov,Egor Zubkov,Dmitry Yudin,Jaafar Mahmoud,Malik Mohrat,Gennady Sidorov

Main category: cs.CV

TL;DR: LEG-SLAM结合优化的高斯泼溅与视觉语言特征提取，实现实时语义3D SLAM，速度显著优于现有方法。

Details

Motivation: 将语义信息融入高斯泼溅表示，同时保持实时性能，解决SLAM应用中的挑战。 Method: 融合高斯泼溅与DINOv2视觉语言特征提取，采用PCA特征压缩器，支持在线密集SLAM。 Result: 在Replica和ScanNet数据集上分别达到10 fps和18 fps，重建速度显著优于现有方法。 Conclusion: LEG-SLAM无需先验数据准备，为实时语义3D SLAM提供了高效解决方案。 Abstract: Modern Gaussian Splatting methods have proven highly effective for real-time photorealistic rendering of 3D scenes. However, integrating semantic information into this representation remains a significant challenge, especially in maintaining real-time performance for SLAM (Simultaneous Localization and Mapping) applications. In this work, we introduce LEG-SLAM -- a novel approach that fuses an optimized Gaussian Splatting implementation with visual-language feature extraction using DINOv2 followed by a learnable feature compressor based on Principal Component Analysis, while enabling an online dense SLAM. Our method simultaneously generates high-quality photorealistic images and semantically labeled scene maps, achieving real-time scene reconstruction with more than 10 fps on the Replica dataset and 18 fps on ScanNet. Experimental results show that our approach significantly outperforms state-of-the-art methods in reconstruction speed while achieving competitive rendering quality. The proposed system eliminates the need for prior data preparation such as camera's ego motion or pre-computed static semantic maps. With its potential applications in autonomous robotics, augmented reality, and other interactive domains, LEG-SLAM represents a significant step forward in real-time semantic 3D Gaussian-based SLAM. Project page: https://titrom025.github.io/LEG-SLAM/

[119] ORV: 4D Occupancy-centric Robot Video Generation

Xiuyu Yang,Bohan Li,Shaocong Xu,Nan Wang,Chongjie Ye,Zhaoxi Chen,Minghan Qin,Yikang Ding,Xin Jin,Hang Zhao,Hao Zhao

Main category: cs.CV

TL;DR: ORV框架通过4D语义占用序列生成精细化的机器人视频，解决了现有方法控制精度低和泛化能力差的问题。

Details

Motivation: 传统机器人仿真数据获取耗时且劳动密集，现有生成模型因全局粗对齐导致控制精度和泛化能力不足。 Method: 提出ORV框架，利用4D语义占用序列提供精细化的语义和几何指导，实现高时间一致性和精确可控的视频生成。 Result: ORV在多个数据集和子任务中表现优于基线方法，支持多视角视频生成。 Conclusion: ORV为机器人学习和仿真提供了高效、精确的视频生成解决方案。 Abstract: Acquiring real-world robotic simulation data through teleoperation is notoriously time-consuming and labor-intensive. Recently, action-driven generative models have gained widespread adoption in robot learning and simulation, as they eliminate safety concerns and reduce maintenance efforts. However, the action sequences used in these methods often result in limited control precision and poor generalization due to their globally coarse alignment. To address these limitations, we propose ORV, an Occupancy-centric Robot Video generation framework, which utilizes 4D semantic occupancy sequences as a fine-grained representation to provide more accurate semantic and geometric guidance for video generation. By leveraging occupancy-based representations, ORV enables seamless translation of simulation data into photorealistic robot videos, while ensuring high temporal consistency and precise controllability. Furthermore, our framework supports the simultaneous generation of multi-view videos of robot gripping operations - an important capability for downstream robotic learning tasks. Extensive experimental results demonstrate that ORV consistently outperforms existing baseline methods across various datasets and sub-tasks. Demo, Code and Model: https://orangesodahub.github.io/ORV

[120] SG2VID: Scene Graphs Enable Fine-Grained Control for Video Synthesis

Ssharvien Kumar Sivakumar,Yannik Frisch,Ghazal Ghazaei,Anirban Mukhopadhyay

Main category: cs.CV

TL;DR: SG2VID是一种基于扩散模型的视频生成方法，利用场景图实现精确视频合成和细粒度控制，用于手术模拟训练。

Details

Motivation: 传统手术模拟工具缺乏真实感和解剖学变异性，现有生成模型忽视细粒度控制。 Method: 提出SG2VID，结合扩散模型和场景图，实现精确视频合成和细粒度控制。 Result: 在多个手术数据集上表现优于现有方法，支持工具和解剖结构的精确控制。 Conclusion: SG2VID为手术模拟提供了更真实和可控的生成方法，并能提升下游任务性能。 Abstract: Surgical simulation plays a pivotal role in training novice surgeons, accelerating their learning curve and reducing intra-operative errors. However, conventional simulation tools fall short in providing the necessary photorealism and the variability of human anatomy. In response, current methods are shifting towards generative model-based simulators. Yet, these approaches primarily focus on using increasingly complex conditioning for precise synthesis while neglecting the fine-grained human control aspect. To address this gap, we introduce SG2VID, the first diffusion-based video model that leverages Scene Graphs for both precise video synthesis and fine-grained human control. We demonstrate SG2VID's capabilities across three public datasets featuring cataract and cholecystectomy surgery. While SG2VID outperforms previous methods both qualitatively and quantitatively, it also enables precise synthesis, providing accurate control over tool and anatomy's size and movement, entrance of new tools, as well as the overall scene layout. We qualitatively motivate how SG2VID can be used for generative augmentation and present an experiment demonstrating its ability to improve a downstream phase detection task when the training set is extended with our synthetic videos. Finally, to showcase SG2VID's ability to retain human control, we interact with the Scene Graphs to generate new video samples depicting major yet rare intra-operative irregularities.

[121] InterMamba: Efficient Human-Human Interaction Generation with Adaptive Spatio-Temporal Mamba

Zizhao Wu,Yingying Sun,Yiming Chen,Xiaoling Gu,Ruyu Liu,Jiazhou Chen

Main category: cs.CV

TL;DR: 提出了一种基于Mamba框架的高效人-人交互生成方法，解决了现有Transformer架构的可扩展性和效率问题。

Details

Motivation: 人-人交互生成在运动合成中至关重要，但现有方法依赖Transformer架构，存在可扩展性和效率挑战。 Method: 采用自适应时空Mamba框架，通过两个并行SSM分支和自适应机制整合运动序列的时空特征，并开发了自适应和交叉自适应模块以增强特征学习。 Result: 在两个交互数据集上取得最优结果，参数规模仅为基线方法的36%，推理速度提升至基线方法的46%。 Conclusion: 该方法在质量和效率上均优于现有方法，适用于实时反馈需求。 Abstract: Human-human interaction generation has garnered significant attention in motion synthesis due to its vital role in understanding humans as social beings. However, existing methods typically rely on transformer-based architectures, which often face challenges related to scalability and efficiency. To address these issues, we propose a novel, efficient human-human interaction generation method based on the Mamba framework, designed to meet the demands of effectively capturing long-sequence dependencies while providing real-time feedback. Specifically, we introduce an adaptive spatio-temporal Mamba framework that utilizes two parallel SSM branches with an adaptive mechanism to integrate the spatial and temporal features of motion sequences. To further enhance the model's ability to capture dependencies within individual motion sequences and the interactions between different individual sequences, we develop two key modules: the self-adaptive spatio-temporal Mamba module and the cross-adaptive spatio-temporal Mamba module, enabling efficient feature learning. Extensive experiments demonstrate that our method achieves state-of-the-art results on two interaction datasets with remarkable quality and efficiency. Compared to the baseline method InterGen, our approach not only improves accuracy but also requires a minimal parameter size of just 66M ,only 36% of InterGen's, while achieving an average inference speed of 0.57 seconds, which is 46% of InterGen's execution time.

[122] Explicitly Modeling Subcortical Vision with a Neuro-Inspired Front-End Improves CNN Robustness

Lucas Piper,Arlindo L. Oliveira,Tiago Marques

Main category: cs.CV

TL;DR: EVNets结合了VOneBlock和SubcorticalBlock，提升了CNN的鲁棒性和生物学对齐性，并在多种扰动测试中表现优异。

Details

Motivation: 解决CNN在视觉扰动和域外图像中的脆弱性问题，通过模仿生物视觉系统提升模型鲁棒性。 Method: 提出EVNets，结合VOneBlock和SubcorticalBlock，后者基于神经科学模型设计。 Result: EVNets在V1对齐、形状偏好和鲁棒性测试中表现优于基准CNN，提升8.5%。结合数据增强后进一步提升7.3%。 Conclusion: EVNets展示了架构改进与数据增强的互补性，为提升模型鲁棒性提供了新方向。 Abstract: Convolutional neural networks (CNNs) trained on object recognition achieve high task performance but continue to exhibit vulnerability under a range of visual perturbations and out-of-domain images, when compared with biological vision. Prior work has demonstrated that coupling a standard CNN with a front-end block (VOneBlock) that mimics the primate primary visual cortex (V1) can improve overall model robustness. Expanding on this, we introduce Early Vision Networks (EVNets), a new class of hybrid CNNs that combine the VOneBlock with a novel SubcorticalBlock, whose architecture draws from computational models in neuroscience and is parameterized to maximize alignment with subcortical responses reported across multiple experimental studies. Without being optimized to do so, the assembly of the SubcorticalBlock with the VOneBlock improved V1 alignment across most standard V1 benchmarks, and better modeled extra-classical receptive field phenomena. In addition, EVNets exhibit stronger emergent shape bias and overperform the base CNN architecture by 8.5% on an aggregate benchmark of robustness evaluations, including adversarial perturbations, common corruptions, and domain shifts. Finally, we show that EVNets can be further improved when paired with a state-of-the-art data augmentation technique, surpassing the performance of the isolated data augmentation approach by 7.3% on our robustness benchmark. This result reveals complementary benefits between changes in architecture to better mimic biology and training-based machine learning approaches.

[123] FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

Christian Schlarmann,Francesco Croce,Nicolas Flammarion,Matthias Hein

Main category: cs.CV

TL;DR: FuseLIP是一种新型多模态嵌入架构，通过单一Transformer模型处理文本和图像令牌，实现早期融合，优于传统后期融合方法。

Details

Motivation: 现有对比语言-图像预训练方法无法原生处理多模态输入，需额外模块融合特征。 Method: 利用离散图像令牌化技术，提出单一Transformer模型处理扩展的文本和图像令牌词汇表。 Result: FuseLIP在多模态任务（如VQA和文本引导图像转换检索）中表现优于基线，单模态任务表现相当。 Conclusion: FuseLIP通过早期融合实现更丰富的多模态表示，为多模态嵌入任务提供了高效解决方案。 Abstract: Contrastive language-image pre-training aligns the features of text-image pairs in a common latent space via distinct encoders for each modality. While this approach achieves impressive performance in several zero-shot tasks, it cannot natively handle multimodal inputs, i.e., encoding image and text into a single feature vector. As a remedy, it is common practice to use additional modules to merge the features extracted by the unimodal encoders. In this work, we present FuseLIP, an alternative architecture for multimodal embedding. Leveraging recent progress in discrete image tokenizers, we propose to use a single transformer model which operates on an extended vocabulary of text and image tokens. This early fusion approach allows the different modalities to interact at each depth of encoding and obtain richer representations compared to common late fusion. We collect new datasets for multimodal pre-training and evaluation, designing challenging tasks for multimodal encoder models. We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval, while being comparable to baselines on unimodal tasks.

[124] EgoVLM: Policy Optimization for Egocentric Video Understanding

Ashwin Vinod,Shrey Pandit,Aditya Vavre,Linshen Liu

Main category: cs.CV

TL;DR: EgoVLM是一个专为第一人称视频设计的视觉语言模型，通过强化学习优化，显著提升了在自我中心视频问答任务中的性能。

Details

Motivation: 随着可穿戴设备和自主代理的发展，需要从第一人称视频流中进行鲁棒的推理。 Method: EgoVLM通过Group Relative Policy Optimization（GRPO）进行微调，直接使用强化学习对齐人类推理步骤，无需监督微调。 Result: EgoVLM-3B在EgoSchema基准测试中表现优于Qwen2.5-VL 3B和7B模型，分别提升了14.33和13.87个准确点。 Conclusion: EgoVLM通过生成推理轨迹增强了可解释性，适用于下游应用，并提出了一种基于关键帧的奖励机制，为未来研究提供了新方向。 Abstract: Emerging embodied AI applications, such as wearable cameras and autonomous agents, have underscored the need for robust reasoning from first person video streams. We introduce EgoVLM, a vision-language model specifically designed to integrate visual comprehension and spatial-temporal reasoning within egocentric video contexts. EgoVLM is fine-tuned via Group Relative Policy Optimization (GRPO), a reinforcement learning method adapted to align model outputs with human-like reasoning steps. Following DeepSeek R1-Zero's approach, we directly tune using RL without any supervised fine-tuning phase on chain-of-thought (CoT) data. We evaluate EgoVLM on egocentric video question answering benchmarks and show that domain-specific training substantially improves performance over general-purpose VLMs. Our EgoVLM-3B, trained exclusively on non-CoT egocentric data, outperforms the base Qwen2.5-VL 3B and 7B models by 14.33 and 13.87 accuracy points on the EgoSchema benchmark, respectively. By explicitly generating reasoning traces, EgoVLM enhances interpretability, making it well-suited for downstream applications. Furthermore, we introduce a novel keyframe-based reward that incorporates salient frame selection to guide reinforcement learning optimization. This reward formulation opens a promising avenue for future exploration in temporally grounded egocentric reasoning.

[125] DyTact: Capturing Dynamic Contacts in Hand-Object Manipulation

Xiaoyan Cong,Angela Xing,Chandradeep Pokhariya,Rao Fu,Srinath Sridhar

Main category: cs.CV

TL;DR: DyTact是一种无标记捕捉方法，用于精确捕捉手-物体动态接触，通过2D高斯面元和MANO网格结合，优化动态接触重建。

Details

Motivation: 动态手-物体接触重建在AI动画、XR和机器人技术中至关重要，但现有技术因遮挡、复杂表面细节和捕捉限制而难以实现。 Method: DyTact基于2D高斯面元动态建模，结合MANO网格模板优化，采用接触引导的自适应采样策略处理遮挡。 Result: DyTact在动态接触估计和新视角合成质量上达到最优，同时优化速度快且内存高效。 Conclusion: DyTact为动态手-物体接触重建提供了一种高效、非侵入性的解决方案，显著提升了现有技术的性能。 Abstract: Reconstructing dynamic hand-object contacts is essential for realistic manipulation in AI character animation, XR, and robotics, yet it remains challenging due to heavy occlusions, complex surface details, and limitations in existing capture techniques. In this paper, we introduce DyTact, a markerless capture method for accurately capturing dynamic contact in hand-object manipulations in a non-intrusive manner. Our approach leverages a dynamic, articulated representation based on 2D Gaussian surfels to model complex manipulations. By binding these surfels to MANO meshes, DyTact harnesses the inductive bias of template models to stabilize and accelerate optimization. A refinement module addresses time-dependent high-frequency deformations, while a contact-guided adaptive sampling strategy selectively increases surfel density in contact regions to handle heavy occlusion. Extensive experiments demonstrate that DyTact not only achieves state-of-the-art dynamic contact estimation accuracy but also significantly improves novel view synthesis quality, all while operating with fast optimization and efficient memory usage. Project Page: https://oliver-cong02.github.io/DyTact.github.io/ .

[126] ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions

Di Chang,Mingdeng Cao,Yichun Shi,Bo Liu,Shengqu Cai,Shijie Zhou,Weilin Huang,Gordon Wetzstein,Mohammad Soleymani,Peng Wang

Main category: cs.CV

TL;DR: ByteMorph是一个专注于非刚性运动的指令图像编辑框架，包含数据集ByteMorph-6M和基于Diffusion Transformer的模型ByteMorpher。

Details

Motivation: 现有方法主要关注静态场景或刚性变换，无法处理动态运动的复杂编辑需求。 Method: 通过运动引导数据生成、分层合成技术和自动标注构建数据集，并基于DiT构建模型。 Result: ByteMorph-6M包含600万对高分辨率图像编辑样本，并提供了评估基准ByteMorph-Bench。 Conclusion: ByteMorph填补了非刚性运动编辑的空白，并提供了全面的评估。 Abstract: Editing images with instructions to reflect non-rigid motions, camera viewpoint shifts, object deformations, human articulations, and complex interactions, poses a challenging yet underexplored problem in computer vision. Existing approaches and datasets predominantly focus on static scenes or rigid transformations, limiting their capacity to handle expressive edits involving dynamic motion. To address this gap, we introduce ByteMorph, a comprehensive framework for instruction-based image editing with an emphasis on non-rigid motions. ByteMorph comprises a large-scale dataset, ByteMorph-6M, and a strong baseline model built upon the Diffusion Transformer (DiT), named ByteMorpher. ByteMorph-6M includes over 6 million high-resolution image editing pairs for training, along with a carefully curated evaluation benchmark ByteMorph-Bench. Both capture a wide variety of non-rigid motion types across diverse environments, human figures, and object categories. The dataset is constructed using motion-guided data generation, layered compositing techniques, and automated captioning to ensure diversity, realism, and semantic coherence. We further conduct a comprehensive evaluation of recent instruction-based image editing methods from both academic and commercial domains.

[127] Revisiting Continuity of Image Tokens for Cross-domain Few-shot Learning

Shuai Yi,Yixiong Zou,Yuhua Li,Ruixuan Li

Main category: cs.CV

TL;DR: 研究发现ViT中图像标记的连续性对其泛化能力有重要影响，提出了一种新方法以改善跨域少样本学习性能。

Details

Motivation: 探索ViT在跨域少样本学习中的性能下降原因，尤其是图像标记连续性的作用。 Method: 通过破坏图像标记的连续性，促使模型更依赖小尺度模式而非大尺度模式。 Result: 实验表明该方法能有效减少域差距并超越现有技术。 Conclusion: 图像标记连续性的破坏有助于提升ViT在极端域差距下的泛化能力。 Abstract: Vision Transformer (ViT) has achieved remarkable success due to its large-scale pretraining on general domains, but it still faces challenges when applying it to downstream distant domains that have only scarce training data, which gives rise to the Cross-Domain Few-Shot Learning (CDFSL) task. Inspired by Self-Attention's insensitivity to token orders, we find an interesting phenomenon neglected in current works: disrupting the continuity of image tokens (i.e., making pixels not smoothly transited across patches) in ViT leads to a noticeable performance decline in the general (source) domain but only a marginal decrease in downstream target domains. This questions the role of image tokens' continuity in ViT's generalization under large domain gaps. In this paper, we delve into this phenomenon for an interpretation. We find continuity aids ViT in learning larger spatial patterns, which are harder to transfer than smaller ones, enlarging domain distances. Meanwhile, it implies that only smaller patterns within each patch could be transferred under extreme domain gaps. Based on this interpretation, we further propose a simple yet effective method for CDFSL that better disrupts the continuity of image tokens, encouraging the model to rely less on large patterns and more on smaller ones. Extensive experiments show the effectiveness of our method in reducing domain gaps and outperforming state-of-the-art works. Codes and models are available at https://github.com/shuaiyi308/ReCIT.

[128] Zero-Shot Tree Detection and Segmentation from Aerial Forest Imagery

Michelle Chen,David Russell,Amritha Pallavoor,Derek Young,Jane Wu

Main category: cs.CV

TL;DR: 论文研究了零样本方式下使用SAM2模型进行单树检测与分割的效果，发现其具有出色的泛化能力，并能与专业方法协同工作。

Details

Motivation: 大规模单树分割对生态研究至关重要，但现有方法依赖难以扩展的训练数据。 Method: 使用预训练的SAM2模型进行零样本分割和零样本迁移（以现有树检测模型的预测作为提示）。 Result: SAM2表现出色，能与专业方法协同，为遥感问题提供新思路。 Conclusion: 预训练大模型在遥感领域应用前景广阔，代码已开源。 Abstract: Large-scale delineation of individual trees from remote sensing imagery is crucial to the advancement of ecological research, particularly as climate change and other environmental factors rapidly transform forest landscapes across the world. Current RGB tree segmentation methods rely on training specialized machine learning models with labeled tree datasets. While these learning-based approaches can outperform manual data collection when accurate, the existing models still depend on training data that's hard to scale. In this paper, we investigate the efficacy of using a state-of-the-art image segmentation model, Segment Anything Model 2 (SAM2), in a zero-shot manner for individual tree detection and segmentation. We evaluate a pretrained SAM2 model on two tasks in this domain: (1) zero-shot segmentation and (2) zero-shot transfer by using predictions from an existing tree detection model as prompts. Our results suggest that SAM2 not only has impressive generalization capabilities, but also can form a natural synergy with specialized methods trained on in-domain labeled data. We find that applying large pretrained models to problems in remote sensing is a promising avenue for future progress. We make our code available at: https://github.com/open-forest-observatory/tree-detection-framework.

[129] Targeted Forgetting of Image Subgroups in CLIP Models

Zeliang Zhang,Gaowen Liu,Charles Fleming,Ramana Rao Kompella,Chenliang Xu

Main category: cs.CV

TL;DR: 本文提出了一种三阶段方法，用于在CLIP模型中精细地选择性遗忘特定知识，同时保持其零样本性能。

Details

Motivation: 基础模型（如CLIP）在零样本任务中表现优异，但可能从互联网数据中继承有害知识，现有遗忘方法无法精细处理特定知识遗忘。 Method: 提出三阶段方法：遗忘阶段微调目标知识，提醒阶段恢复保留样本性能，恢复阶段通过模型融合恢复零样本能力。 Result: 在CIFAR-10、ImageNet-1K等数据集上验证，方法能有效遗忘特定子组，同时保持零样本性能。 Conclusion: 该方法显著优于基线遗忘方法，适用于CLIP模型的精细知识遗忘。 Abstract: Foundation models (FMs) such as CLIP have demonstrated impressive zero-shot performance across various tasks by leveraging large-scale, unsupervised pre-training. However, they often inherit harmful or unwanted knowledge from noisy internet-sourced datasets, compromising their reliability in real-world applications. Existing model unlearning methods either rely on access to pre-trained datasets or focus on coarse-grained unlearning (e.g., entire classes), leaving a critical gap for fine-grained unlearning. In this paper, we address the challenging scenario of selectively forgetting specific portions of knowledge within a class, without access to pre-trained data, while preserving the model's overall performance. We propose a novel three-stage approach that progressively unlearns targeted knowledge while mitigating over-forgetting. It consists of (1) a forgetting stage to fine-tune the CLIP on samples to be forgotten, (2) a reminding stage to restore performance on retained samples, and (3) a restoring stage to recover zero-shot capabilities using model souping. Additionally, we introduce knowledge distillation to handle the distribution disparity between forgetting, retaining samples, and unseen pre-trained data. Extensive experiments on CIFAR-10, ImageNet-1K, and style datasets demonstrate that our approach effectively unlearns specific subgroups while maintaining strong zero-shot performance on semantically similar subgroups and other categories, significantly outperforming baseline unlearning methods, which lose effectiveness under the CLIP unlearning setting.

[130] Controllable Human-centric Keyframe Interpolation with Generative Prior

Zujin Guo,Size Wu,Zhongang Cai,Wei Li,Chen Change Loy

Main category: cs.CV

TL;DR: PoseFuse3D-KI是一种新框架，通过整合3D人体引导信号改进视频关键帧插值，显著提升复杂人体动作的生成质量与控制能力。

Details

Motivation: 现有插值方法缺乏3D几何引导，难以生成复杂人体动作且控制有限。 Method: 提出PoseFuse3D-KI框架，结合3D人体信号（SMPL-X编码器）与2D姿态嵌入，通过融合网络实现可控插值。 Result: 在CHKI-Video数据集上，PSNR提升9%，LPIPS降低38%，优于现有方法。 Conclusion: PoseFuse3D-KI通过3D引导显著提升插值质量，验证了其有效性。 Abstract: Existing interpolation methods use pre-trained video diffusion priors to generate intermediate frames between sparsely sampled keyframes. In the absence of 3D geometric guidance, these methods struggle to produce plausible results for complex, articulated human motions and offer limited control over the synthesized dynamics. In this paper, we introduce PoseFuse3D Keyframe Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human guidance signals into the diffusion process for Controllable Human-centric Keyframe Interpolation (CHKI). To provide rich spatial and structural cues for interpolation, our PoseFuse3D, a 3D-informed control model, features a novel SMPL-X encoder that transforms 3D geometry and shape into the 2D latent conditioning space, alongside a fusion network that integrates these 3D cues with 2D pose embeddings. For evaluation, we build CHKI-Video, a new dataset annotated with both 2D poses and 3D SMPL-X parameters. We show that PoseFuse3D-KI consistently outperforms state-of-the-art baselines on CHKI-Video, achieving a 9% improvement in PSNR and a 38% reduction in LPIPS. Comprehensive ablations demonstrate that our PoseFuse3D model improves interpolation fidelity.

[131] DCM: Dual-Expert Consistency Model for Efficient and High-Quality Video Generation

Zhengyao Lv,Chenyang Si,Tianlin Pan,Zhaoxi Chen,Kwan-Yee K. Wong,Yu Qiao,Ziwei Liu

Main category: cs.CV

TL;DR: 论文提出了一种双专家一致性模型（DCM），通过语义专家和细节专家的分工，解决了视频扩散模型蒸馏中的优化梯度冲突问题，显著提升了时间一致性和细节质量。

Details

Motivation: 扩散模型在视频合成中表现优异，但计算开销大；一致性模型虽能加速，但直接应用于视频模型会导致时间一致性和细节质量下降。 Method: 提出DCM模型，包含语义专家和细节专家，分别优化语义布局、运动和细节；引入时间相干损失、GAN和特征匹配损失。 Result: DCM在减少采样步数的同时，实现了最优的视觉质量。 Conclusion: 专家分工策略有效解决了视频扩散模型蒸馏中的优化问题，提升了合成效果。 Abstract: Diffusion Models have achieved remarkable results in video synthesis but require iterative denoising steps, leading to substantial computational overhead. Consistency Models have made significant progress in accelerating diffusion models. However, directly applying them to video diffusion models often results in severe degradation of temporal consistency and appearance details. In this paper, by analyzing the training dynamics of Consistency Models, we identify a key conflicting learning dynamics during the distillation process: there is a significant discrepancy in the optimization gradients and loss contributions across different timesteps. This discrepancy prevents the distilled student model from achieving an optimal state, leading to compromised temporal consistency and degraded appearance details. To address this issue, we propose a parameter-efficient \textbf{Dual-Expert Consistency Model~(DCM)}, where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement. Furthermore, we introduce Temporal Coherence Loss to improve motion consistency for the semantic expert and apply GAN and Feature Matching Loss to enhance the synthesis quality of the detail expert.Our approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation. Our code and models are available at \href{https://github.com/Vchitect/DCM}{https://github.com/Vchitect/DCM}.

[132] AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation

Lu Qiu,Yizhuo Li,Yuying Ge,Yixiao Ge,Ying Shan,Xihui Liu

Main category: cs.CV

TL;DR: AnimeShooter是一个参考引导的多镜头动画数据集，填补了现有数据集中缺乏角色参考图像的空白，并通过自动化流程实现了视觉一致性。

Details

Motivation: 现有公开数据集主要关注现实场景，缺乏角色参考图像和连贯的多镜头视频生成能力，AnimeShooter旨在解决这一问题。 Method: 通过自动化流程构建AnimeShooter数据集，包含故事级和镜头级注释，并利用MLLM和视频扩散模型生成连贯的多镜头视频。 Result: 实验结果表明，基于AnimeShooter训练的模型在跨镜头视觉一致性和参考视觉引导方面表现优异。 Conclusion: AnimeShooter数据集为连贯动画视频生成提供了有效支持，展示了其在AIGC领域的价值。 Abstract: Recent advances in AI-generated content (AIGC) have significantly accelerated animation production. To produce engaging animations, it is essential to generate coherent multi-shot video clips with narrative scripts and character references. However, existing public datasets primarily focus on real-world scenarios with global descriptions, and lack reference images for consistent character guidance. To bridge this gap, we present AnimeShooter, a reference-guided multi-shot animation dataset. AnimeShooter features comprehensive hierarchical annotations and strong visual consistency across shots through an automated pipeline. Story-level annotations provide an overview of the narrative, including the storyline, key scenes, and main character profiles with reference images, while shot-level annotations decompose the story into consecutive shots, each annotated with scene, characters, and both narrative and descriptive visual captions. Additionally, a dedicated subset, AnimeShooter-audio, offers synchronized audio tracks for each shot, along with audio descriptions and sound sources. To demonstrate the effectiveness of AnimeShooter and establish a baseline for the reference-guided multi-shot video generation task, we introduce AnimeShooterGen, which leverages Multimodal Large Language Models (MLLMs) and video diffusion models. The reference image and previously generated shots are first processed by MLLM to produce representations aware of both reference and context, which are then used as the condition for the diffusion model to decode the subsequent shot. Experimental results show that the model trained on AnimeShooter achieves superior cross-shot visual consistency and adherence to reference visual guidance, which highlight the value of our dataset for coherent animated video generation.

[133] Native-Resolution Image Synthesis

Zidong Wang,Lei Bai,Xiangyu Yue,Wanli Ouyang,Yiyuan Zhang

Main category: cs.CV

TL;DR: 提出了一种原生分辨率图像合成方法，通过Native-resolution diffusion Transformer（NiT）实现任意分辨率和宽高比的图像生成，突破了传统固定分辨率方法的限制。

Details

Motivation: 传统图像生成方法受限于固定分辨率和方形格式，无法灵活处理多样化的视觉需求。 Method: 设计了NiT架构，通过建模可变分辨率和宽高比的视觉标记，实现原生分辨率图像合成。 Result: NiT在ImageNet-256x256和512x512基准测试中达到最优性能，并展示了强大的零样本泛化能力，支持生成高分辨率（如1536x1536）和多样宽高比（如16:9、3:1）的图像。 Conclusion: 原生分辨率建模为视觉生成模型与先进大语言模型方法之间的桥梁提供了重要潜力。 Abstract: We introduce native-resolution image synthesis, a novel generative modeling paradigm that enables the synthesis of images at arbitrary resolutions and aspect ratios. This approach overcomes the limitations of conventional fixed-resolution, square-image methods by natively handling variable-length visual tokens, a core challenge for traditional techniques. To this end, we introduce the Native-resolution diffusion Transformer (NiT), an architecture designed to explicitly model varying resolutions and aspect ratios within its denoising process. Free from the constraints of fixed formats, NiT learns intrinsic visual distributions from images spanning a broad range of resolutions and aspect ratios. Notably, a single NiT model simultaneously achieves the state-of-the-art performance on both ImageNet-256x256 and 512x512 benchmarks. Surprisingly, akin to the robust zero-shot capabilities seen in advanced large language models, NiT, trained solely on ImageNet, demonstrates excellent zero-shot generalization performance. It successfully generates high-fidelity images at previously unseen high resolutions (e.g., 1536 x 1536) and diverse aspect ratios (e.g., 16:9, 3:1, 4:3), as shown in Figure 1. These findings indicate the significant potential of native-resolution modeling as a bridge between visual generative modeling and advanced LLM methodologies.

[134] OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

Mengdi Jia,Zekun Qi,Shaochen Zhang,Wenyao Zhang,Xinqiang Yu,Jiawei He,He Wang,Li Yi

Main category: cs.CV

TL;DR: OmniSpatial是一个基于认知心理学的全面空间推理基准，涵盖四大类50个子类，包含1.5K问答对，揭示了现有视觉语言模型在空间理解上的局限性。

Details

Motivation: 当前视觉语言模型在基础空间关系理解上表现有限，缺乏对复杂空间推理的评估，因此需要更全面的基准。 Method: 通过互联网数据爬取和人工标注，构建了包含动态推理、复杂空间逻辑、空间交互和视角转换的OmniSpatial基准。 Result: 实验表明，现有模型在综合空间理解上存在显著不足。 Conclusion: OmniSpatial为未来研究提供了方向，揭示了模型改进的空间。 Abstract: Spatial reasoning is a key aspect of cognitive psychology and remains a major bottleneck for current vision-language models (VLMs). While extensive research has aimed to evaluate or improve VLMs' understanding of basic spatial relations, such as distinguishing left from right, near from far, and object counting, these tasks represent only the most fundamental level of spatial reasoning. In this work, we introduce OmniSpatial, a comprehensive and challenging benchmark for spatial reasoning, grounded in cognitive psychology. OmniSpatial covers four major categories: dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking, with 50 fine-grained subcategories. Through Internet data crawling and careful manual annotation, we construct over 1.5K question-answer pairs. Extensive experiments show that both open- and closed-source VLMs, as well as existing reasoning and spatial understanding models, exhibit significant limitations in comprehensive spatial understanding. We further analyze failure cases and propose potential directions for future research.

[135] SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation

Siqi Chen,Xinyu Dong,Haolei Xu,Xingyu Wu,Fei Tang,Hang Zhang,Yuchen Yan,Linjuan Wu,Wenqi Zhang,Guiyang Hou,Yongliang Shen,Weiming Lu,Yueting Zhuang

Main category: cs.CV

TL;DR: SVGenius是一个全面的SVG处理基准测试，涵盖2,377个查询，评估22个主流模型，发现专有模型表现优于开源模型，但所有模型在复杂度增加时性能下降。推理增强训练比单纯扩展更有效。

Details

Motivation: 现有SVG处理基准测试覆盖范围有限、缺乏复杂度分层和碎片化评估范式，需要更全面的评估框架。 Method: SVGenius基于24个应用领域的真实数据，通过8个任务类别和18个指标评估模型，涵盖理解、编辑和生成三个维度。 Result: 专有模型显著优于开源模型，但所有模型在复杂度增加时性能下降；推理增强训练比单纯扩展更有效，风格迁移是最具挑战的任务。 Conclusion: SVGenius为SVG处理提供了首个系统性评估框架，为开发更强大的向量图形模型和自动化图形设计应用提供了关键见解。 Abstract: Large Language Models (LLMs) and Multimodal LLMs have shown promising capabilities for SVG processing, yet existing benchmarks suffer from limited real-world coverage, lack of complexity stratification, and fragmented evaluation paradigms. We introduce SVGenius, a comprehensive benchmark comprising 2,377 queries across three progressive dimensions: understanding, editing, and generation. Built on real-world data from 24 application domains with systematic complexity stratification, SVGenius evaluates models through 8 task categories and 18 metrics. We assess 22 mainstream models spanning different scales, architectures, training paradigms, and accessibility levels. Our analysis reveals that while proprietary models significantly outperform open-source counterparts, all models exhibit systematic performance degradation with increasing complexity, indicating fundamental limitations in current approaches; however, reasoning-enhanced training proves more effective than pure scaling for overcoming these limitations, though style transfer remains the most challenging capability across all model types. SVGenius establishes the first systematic evaluation framework for SVG processing, providing crucial insights for developing more capable vector graphics models and advancing automated graphic design applications. Appendix and supplementary materials (including all data and code) are available at https://zju-real.github.io/SVGenius.

[136] CamCloneMaster: Enabling Reference-based Camera Control for Video Generation

Yawen Luo,Jianhong Bai,Xiaoyu Shi,Menghan Xia,Xintao Wang,Pengfei Wan,Di Zhang,Kun Gai,Tianfan Xue

Main category: cs.CV

TL;DR: CamCloneMaster提出了一种无需相机参数或微调的直观相机控制框架，通过参考视频复制相机运动，支持图像到视频和视频到视频任务。

Details

Motivation: 现有方法依赖显式相机参数序列，操作繁琐，难以实现复杂相机运动。 Method: 提出CamCloneMaster框架，利用参考视频复制相机运动，无需相机参数或微调，并构建大规模合成数据集Camera Clone Dataset。 Result: 实验和用户研究表明，CamCloneMaster在相机可控性和视觉质量上优于现有方法。 Conclusion: CamCloneMaster提供了一种更直观的相机控制方法，适用于多种任务，性能优越。 Abstract: Camera control is crucial for generating expressive and cinematic videos. Existing methods rely on explicit sequences of camera parameters as control conditions, which can be cumbersome for users to construct, particularly for intricate camera movements. To provide a more intuitive camera control method, we propose CamCloneMaster, a framework that enables users to replicate camera movements from reference videos without requiring camera parameters or test-time fine-tuning. CamCloneMaster seamlessly supports reference-based camera control for both Image-to-Video and Video-to-Video tasks within a unified framework. Furthermore, we present the Camera Clone Dataset, a large-scale synthetic dataset designed for camera clone learning, encompassing diverse scenes, subjects, and camera movements. Extensive experiments and user studies demonstrate that CamCloneMaster outperforms existing methods in terms of both camera controllability and visual quality.

[137] Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval

Jiwen Yu,Jianhong Bai,Yiran Qin,Quande Liu,Xintao Wang,Pengfei Wan,Di Zhang,Xihui Liu

Main category: cs.CV

TL;DR: 提出了一种名为Context-as-Memory的方法，通过利用历史上下文作为记忆来提升长视频生成的场景一致性，并设计了高效的存储和检索机制。

Details

Motivation: 现有方法在长视频生成中因历史上下文利用不足而难以保持场景一致性。 Method: 1. 以帧格式存储上下文；2. 通过帧维度拼接上下文和待预测帧；3. 引入Memory Retrieval模块选择相关上下文帧。 Result: Context-as-Memory在长视频生成中表现出卓越的记忆能力，并能泛化到未见过的开放域场景。 Conclusion: 该方法显著提升了长视频生成的场景一致性，且计算高效。 Abstract: Recent advances in interactive video generation have shown promising results, yet existing approaches struggle with scene-consistent memory capabilities in long video generation due to limited use of historical context. In this work, we propose Context-as-Memory, which utilizes historical context as memory for video generation. It includes two simple yet effective designs: (1) storing context in frame format without additional post-processing; (2) conditioning by concatenating context and frames to be predicted along the frame dimension at the input, requiring no external control modules. Furthermore, considering the enormous computational overhead of incorporating all historical context, we propose the Memory Retrieval module to select truly relevant context frames by determining FOV (Field of View) overlap between camera poses, which significantly reduces the number of candidate frames without substantial information loss. Experiments demonstrate that Context-as-Memory achieves superior memory capabilities in interactive long video generation compared to SOTAs, even generalizing effectively to open-domain scenarios not seen during training. The link of our project page is https://context-as-memory.github.io/.

[138] MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query

Wei Chow,Yuan Gao,Linfeng Li,Xian Wang,Qi Xu,Hang Song,Lingdong Kong,Ran Zhou,Yi Zeng,Yidong Cai,Botian Jiang,Shilin Xu,Jiajun Zhang,Minghui Qiu,Xiangtai Li,Tianshu Yang,Siliang Tang,Juncheng Li

Main category: cs.CV

TL;DR: 论文介绍了MERIT数据集和Coral框架，用于多条件语义检索，解决了现有方法的局限性，并显著提升了性能。

Details

Motivation: 现有语义检索研究局限于单一语言或条件，无法满足实际多条件查询需求。 Method: 提出MERIT数据集和Coral框架，结合嵌入重构和对比学习优化预训练模型。 Result: Coral在MERIT上性能提升45.9%，并在多个基准测试中验证了泛化能力。 Conclusion: 论文为多条件语义检索研究奠定了基础，贡献包括数据集、问题发现和创新框架。 Abstract: Semantic retrieval is crucial for modern applications yet remains underexplored in current research. Existing datasets are limited to single languages, single images, or singular retrieval conditions, often failing to fully exploit the expressive capacity of visual information as evidenced by maintained performance when images are replaced with captions. However, practical retrieval scenarios frequently involve interleaved multi-condition queries with multiple images. Hence, this paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval, comprising 320,000 queries with 135,000 products in 5 languages, covering 7 distinct product categories. Extensive experiments on MERIT identify existing models's limitation: focusing solely on global semantic information while neglecting specific conditional elements in queries. Consequently, we propose Coral, a novel fine-tuning framework that adapts pre-trained MLLMs by integrating embedding reconstruction to preserve fine-grained conditional elements and contrastive learning to extract comprehensive global semantics. Experiments demonstrate that Coral achieves a 45.9% performance improvement over conventional approaches on MERIT, with strong generalization capabilities validated across 8 established retrieval benchmarks. Collectively, our contributions - a novel dataset, identification of critical limitations in existing approaches, and an innovative fine-tuning framework - establish a foundation for future research in interleaved multi-condition semantic retrieval.

[139] UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin,Zongjian Li,Xinhua Cheng,Yuwei Niu,Yang Ye,Xianyi He,Shenghai Yuan,Wangbo Yu,Shaodong Wang,Yunyang Ge,Yatian Pang,Li Yuan

Main category: cs.CV

TL;DR: 论文提出了一种名为UniWorld的统一生成框架，基于强大的视觉语言模型和对比语义编码器提供的语义特征，显著提升了图像感知和操作任务的性能。

Details

Motivation: 现有统一模型在视觉语言理解和文本到图像生成方面表现优异，但在图像感知和操作任务上能力有限，而GPT-4o-Image的成功启发了基于语义特征的新方法。 Method: UniWorld框架利用视觉语言模型和对比语义编码器提取的语义特征，而非传统的VAE，构建了一个高效的统一模型。 Result: UniWorld仅使用1%的BAGEL数据，便在图像编辑基准上超越BAGEL，同时在图像理解和生成任务中保持竞争力。 Conclusion: UniWorld展示了基于语义特征的统一框架在图像任务中的潜力，并开源了模型权重、训练脚本和数据集。 Abstract: Although existing unified models deliver strong performance on vision-language understanding and text-to-image generation, their models are limited in exploring image perception and manipulation tasks, which are urgently desired by users for wide applications. Recently, OpenAI released their powerful GPT-4o-Image model for comprehensive image perception and manipulation, achieving expressive capability and attracting community interests. By observing the performance of GPT-4o-Image in our carefully constructed experiments, we infer that GPT-4o-Image leverages features extracted by semantic encoders instead of VAE, while VAEs are considered essential components in many image manipulation models. Motivated by such inspiring observations, we present a unified generative framework named UniWorld based on semantic features provided by powerful visual-language models and contrastive semantic encoders. As a result, we build a strong unified model using only 1% amount of BAGEL's data, which consistently outperforms BAGEL on image editing benchmarks. UniWorld also maintains competitive image understanding and generation capabilities, achieving strong performance across multiple image perception tasks. We fully open-source our models, including model weights, training and evaluation scripts, and datasets.

[140] Self-Supervised Spatial Correspondence Across Modalities

Ayush Shrivastava,Andrew Owens

Main category: cs.CV

TL;DR: 提出了一种跨模态时空对应方法，无需对齐数据即可学习特征表示。

Details

Motivation: 解决不同视觉模态（如RGB与深度图）之间的像素级对应问题，无需依赖对齐数据或光一致性假设。 Method: 扩展对比随机游走框架，同时学习跨模态和模态内匹配的循环一致特征表示。 Result: 在几何和语义匹配任务中表现优异，包括RGB-深度、RGB-热成像、照片-素描和跨风格图像对齐。 Conclusion: 该方法简单有效，适用于多种跨模态匹配任务。 Abstract: We present a method for finding cross-modal space-time correspondences. Given two images from different visual modalities, such as an RGB image and a depth map, our model identifies which pairs of pixels correspond to the same physical points in the scene. To solve this problem, we extend the contrastive random walk framework to simultaneously learn cycle-consistent feature representations for both cross-modal and intra-modal matching. The resulting model is simple and has no explicit photo-consistency assumptions. It can be trained entirely using unlabeled data, without the need for any spatially aligned multimodal image pairs. We evaluate our method on both geometric and semantic correspondence tasks. For geometric matching, we consider challenging tasks such as RGB-to-depth and RGB-to-thermal matching (and vice versa); for semantic matching, we evaluate on photo-sketch and cross-style image alignment. Our method achieves strong performance across all benchmarks.

[141] IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation

Yuanze Lin,Yi-Wen Chen,Yi-Hsuan Tsai,Ronald Clark,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: IllumiCraft是一个端到端的扩散框架，通过整合HDR视频、合成重光照帧和3D点轨迹，生成高质量、时间一致的视频。

Details

Motivation: 现有扩散模型缺乏对几何线索的显式集成，无法精确控制场景光照和视觉外观。 Method: 结合HDR视频、合成重光照帧和3D点轨迹，统一扩散架构生成视频。 Result: 生成的视频在时间一致性和保真度上优于现有可控视频生成方法。 Conclusion: IllumiCraft通过整合多种线索，显著提升了视频生成的精确性和可控性。 Abstract: Although diffusion-based models can generate high-quality and high-resolution video sequences from textual or image inputs, they lack explicit integration of geometric cues when controlling scene lighting and visual appearance across frames. To address this limitation, we propose IllumiCraft, an end-to-end diffusion framework accepting three complementary inputs: (1) high-dynamic-range (HDR) video maps for detailed lighting control; (2) synthetically relit frames with randomized illumination changes (optionally paired with a static background reference image) to provide appearance cues; and (3) 3D point tracks that capture precise 3D geometry information. By integrating the lighting, appearance, and geometry cues within a unified diffusion architecture, IllumiCraft generates temporally coherent videos aligned with user-defined prompts. It supports background-conditioned and text-conditioned video relighting and provides better fidelity than existing controllable video generation methods. Project Page: https://yuanze-lin.me/IllumiCraft_page

cs.GR [Back]

[142] High-throughput viscometry via machine-learning from videos of inverted vials

Ignacio Arretche,Mohammad Tanver Hossain,Ramdas Tiwari,Abbie Kim,Mya G. Mills,Connor D. Armstrong,Jacob J. Lessard,Sameh H. Tawfick,Randy H. Ewoldt

Main category: cs.GR

TL;DR: 提出了一种基于计算机视觉的粘度计，通过自动化倒置瓶测试实现定量粘度测量，覆盖范围广且成本低。

Details

Motivation: 传统倒置瓶测试因复杂流动难以定量测量粘度，需开发简单、高吞吐的方法。 Method: 利用计算机视觉和神经网络从倒置瓶中流体流动的视觉特征推断粘度。 Result: 相对误差低于25%，对非牛顿流体也能可靠估计零剪切粘度。 Conclusion: 该方法简单、低成本且可扩展，适用于高通量实验和资源有限环境。 Abstract: Although the inverted vial test has been widely used as a qualitative method for estimating fluid viscosity, quantitative rheological characterization has remained limited due to its complex, uncontrolled flow - driven by gravity, surface tension, inertia, and initial conditions. Here, we present a computer vision (CV) viscometer that automates the inverted vial test and enables quantitative viscosity inference across nearly five orders of magnitude (0.01-1000 Pas), without requiring direct velocity field measurements. The system simultaneously inverts multiple vials and records videos of the evolving fluid, which are fed into a neural network that approximates the inverse function from visual features and known fluid density. Despite the complex, multi-regime flow within the vial, our approach achieves relative errors below 25%, improving to 15% for viscosities above 0.1 Pas. When tested on non-Newtonian polymer solutions, the method reliably estimates zero-shear viscosity as long as viscoelastic or shear-thinning behaviors remain negligible within the flow regime. Moreover, high standard deviations in the inferred values may serve as a proxy for identifying fluids with strong non-Newtonian behavior. The CV viscometer requires only one camera and one motor, is contactless and low-cost, and can be easily integrated into high-throughput experimental automated and manual workflows. Transcending traditional characterization paradigms, our method leverages uncontrolled flows and visual features to achieve simplicity and scalability, enabling high-throughput viscosity inference that can meet the growing demand of data-driven material models while remaining accessible to lower resource environments.

[143] Stochastic Barnes-Hut Approximation for Fast Summation on the GPU

Abhishek Madan,Nicholas Sharp,Francis Williams,Ken Museth,David I. W. Levin

Main category: cs.GR

TL;DR: 提出了一种新的随机Barnes-Hut近似方法，通过将LOD近似作为控制变量，构建了无偏估计器，在图形应用中表现优异，GPU计算效率高。

Details

Motivation: 改进Barnes-Hut近似的效率和准确性，特别是在GPU计算环境下。 Method: 将LOD近似作为控制变量，构建无偏估计器，并优化GPU实现。 Result: 在图形应用中，如环绕数计算和平滑距离评估，性能优于确定性Barnes-Hut近似，速度快9.4倍。 Conclusion: 该方法在GPU计算中高效且准确，适用于图形应用。 Abstract: We present a novel stochastic version of the Barnes-Hut approximation. Regarding the level-of-detail (LOD) family of approximations as control variates, we construct an unbiased estimator of the kernel sum being approximated. Through several examples in graphics applications such as winding number computation and smooth distance evaluation, we demonstrate that our method is well-suited for GPU computation, capable of outperforming a GPU-optimized implementation of the deterministic Barnes-Hut approximation by achieving equal median error in up to 9.4x less time.

[144] FlexPainter: Flexible and Multi-View Consistent Texture Generation

Dongyu Yan,Leyi Wu,Jiantao Lin,Luozhou Wang,Tianshuo Xu,Zhifei Chen,Zhen Yang,Lie Xu,Shunsi Zhang,Yingcong Chen

Main category: cs.GR

TL;DR: FlexPainter是一个新颖的纹理生成流程，通过多模态条件引导和一致性增强技术，显著提升了纹理生成的灵活性和质量。

Details

Motivation: 现有基于扩散的方法在纹理生成中存在控制灵活性不足、多视角图像不一致等问题，限制了生成质量。 Method: 构建共享条件嵌入空间，实现多模态输入灵活聚合；提出基于图像的CFG方法分解结构和风格信息；利用3D知识生成多视角图像，并通过视图同步和自适应加权模块确保一致性；结合3D感知纹理补全和增强模型生成高质量纹理。 Result: 实验表明，FlexPainter在灵活性和生成质量上显著优于现有方法。 Conclusion: FlexPainter通过多模态引导和一致性优化，为高质量纹理生成提供了有效解决方案。 Abstract: Texture map production is an important part of 3D modeling and determines the rendering quality. Recently, diffusion-based methods have opened a new way for texture generation. However, restricted control flexibility and limited prompt modalities may prevent creators from producing desired results. Furthermore, inconsistencies between generated multi-view images often lead to poor texture generation quality. To address these issues, we introduce \textbf{FlexPainter}, a novel texture generation pipeline that enables flexible multi-modal conditional guidance and achieves highly consistent texture generation. A shared conditional embedding space is constructed to perform flexible aggregation between different input modalities. Utilizing such embedding space, we present an image-based CFG method to decompose structural and style information, achieving reference image-based stylization. Leveraging the 3D knowledge within the image diffusion prior, we first generate multi-view images simultaneously using a grid representation to enhance global understanding. Meanwhile, we propose a view synchronization and adaptive weighting module during diffusion sampling to further ensure local consistency. Finally, a 3D-aware texture completion model combined with a texture enhancement model is used to generate seamless, high-resolution texture maps. Comprehensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods in both flexibility and generation quality.

[145] Voyager: Real-Time Splatting City-Scale 3D Gaussians on Your Phone

Zheng Liu,He Zhu,Xinyang Li,Yirun Wang,Yujiao Shi,Wei Li,Jingwen Leng,Minyi Guo,Yu Feng

Main category: cs.GR

TL;DR: 提出了一种在移动设备上实现城市规模3D高斯泼溅渲染的高效解决方案，通过流式传输必要的高斯元素和优化渲染过程，显著降低数据传输和延迟。

Details

Motivation: 移动设备资源有限，难以直接渲染城市规模的3D高斯泼溅场景，而传统的云端渲染方案因高延迟和带宽需求过大而不适用。 Method: 利用用户运动时新可见高斯元素数量恒定的特点，云端异步搜索必要的高斯元素，客户端通过查找表加速渲染，并结合运行时优化。 Result: 相比现有方案，数据传输减少100倍以上，速度提升8.9倍，同时保持渲染质量。 Conclusion: 该方案成功实现了低延迟、城市规模的3D高斯泼溅渲染，适用于移动设备。 Abstract: 3D Gaussian Splatting (3DGS) is an emerging technique for photorealistic 3D scene rendering. However, rendering city-scale 3DGS scenes on mobile devices, e.g., your smartphones, remains a significant challenge due to the limited resources on mobile devices. A natural solution is to offload computation to the cloud; however, naively streaming rendered frames from the cloud to the client introduces high latency and requires bandwidth far beyond the capacity of current wireless networks. In this paper, we propose an effective solution to enable city-scale 3DGS rendering on mobile devices. Our key insight is that, under normal user motion, the number of newly visible Gaussians per second remains roughly constant. Leveraging this, we stream only the necessary Gaussians to the client. Specifically, on the cloud side, we propose asynchronous level-of-detail search to identify the necessary Gaussians for the client. On the client side, we accelerate rendering via a lookup table-based rasterization. Combined with holistic runtime optimizations, our system can deliver low-latency, city-scale 3DGS rendering on mobile devices. Compared to existing solutions, Voyager achieves over 100$\times$ reduction on data transfer and up to 8.9$\times$ speedup while retaining comparable rendering quality.

[146] PhysGaia: A Physics-Aware Dataset of Multi-Body Interactions for Dynamic Novel View Synthesis

Mijeong Kim,Gunhee Kim,Jungyoon Choi,Wonjae Roh,Bohyung Han

Main category: cs.GR

TL;DR: PhysGaia是一个专为动态新视角合成（DyNVS）设计的物理感知数据集，包含结构化对象和非结构化物理现象，支持物理感知的动态场景建模。

Details

Motivation: 现有数据集主要关注逼真重建，缺乏对物理交互的全面支持，PhysGaia填补了这一空白。 Method: 数据集通过精心选择的物理求解器生成严格遵循物理规律的场景，并提供3D粒子轨迹和物理参数等真实数据。 Result: PhysGaia支持定量评估物理建模，并为前沿DyNVS模型提供集成管道。 Conclusion: PhysGaia将推动动态视角合成、物理场景理解和深度学习与物理模拟的结合研究。 Abstract: We introduce PhysGaia, a novel physics-aware dataset specifically designed for Dynamic Novel View Synthesis (DyNVS), encompassing both structured objects and unstructured physical phenomena. Unlike existing datasets that primarily focus on photorealistic reconstruction, PhysGaia is created to actively support physics-aware dynamic scene modeling. Our dataset provides complex dynamic scenarios with rich interactions among multiple objects, where they realistically collide with each other and exchange forces. Furthermore, it contains a diverse range of physical materials, such as liquid, gas, viscoelastic substance, and textile, which moves beyond the rigid bodies prevalent in existing datasets. All scenes in PhysGaia are faithfully generated to strictly adhere to physical laws, leveraging carefully selected material-specific physics solvers. To enable quantitative evaluation of physical modeling, our dataset provides essential ground-truth information, including 3D particle trajectories and physics parameters, e.g., viscosity. To facilitate research adoption, we also provide essential integration pipelines for using state-of-the-art DyNVS models with our dataset and report their results. By addressing the critical lack of datasets for physics-aware modeling, PhysGaia will significantly advance research in dynamic view synthesis, physics-based scene understanding, and deep learning models integrated with physical simulation -- ultimately enabling more faithful reconstruction and interpretation of complex dynamic scenes. Our datasets and codes are available in the project website, http://cvlab.snu.ac.kr/research/PhysGaia.

[147] VolTex: Food Volume Estimation using Text-Guided Segmentation and Neural Surface Reconstruction

Ahmad AlMughrabi,Umair Haroon,Ricardo Marques,Petia Radeva

Main category: cs.GR

TL;DR: VolTex框架通过文本输入实现精准食物分割，结合神经表面重建方法生成高保真3D网格，提升食物体积估计的准确性。

Details

Motivation: 现有3D食物体积估计方法缺乏对食物部分的精准选择，VolTex旨在解决这一问题。 Method: 用户通过文本输入指定目标食物，利用神经表面重建方法生成3D网格进行体积计算。 Result: 在MetaFood3D数据集上验证了VolTex在食物分割和体积估计中的有效性。 Conclusion: VolTex通过改进食物对象选择，显著提升了食物体积估计的精确性。 Abstract: Accurate food volume estimation is crucial for dietary monitoring, medical nutrition management, and food intake analysis. Existing 3D Food Volume estimation methods accurately compute the food volume but lack for food portions selection. We present VolTex, a framework that improves \change{the food object selection} in food volume estimation. Allowing users to specify a target food item via text input to be segmented, our method enables the precise selection of specific food objects in real-world scenes. The segmented object is then reconstructed using the Neural Surface Reconstruction method to generate high-fidelity 3D meshes for volume computation. Extensive evaluations on the MetaFood3D dataset demonstrate the effectiveness of our approach in isolating and reconstructing food items for accurate volume estimation. The source code is accessible at https://github.com/GCVCG/VolTex.

[148] PartComposer: Learning and Composing Part-Level Concepts from Single-Image Examples

Junyu Liu,R. Kenny Jones,Daniel Ritchie

Main category: cs.GR

TL;DR: PartComposer是一个从单图像示例学习部分级概念的框架，通过动态数据合成和最大化互信息实现强解耦和可控组合。

Details

Motivation: 现有方法难以有效学习细粒度概念或需要大量数据输入，PartComposer旨在解决这些问题。 Method: 提出动态数据合成管道生成多样部分组合，并通过概念预测器最大化去噪潜在与结构化概念代码的互信息。 Result: 方法在解耦和可控组合方面表现优异，优于同类和跨类别基线。 Conclusion: PartComposer通过创新方法实现了高效的部分级概念学习和组合。 Abstract: We present PartComposer: a framework for part-level concept learning from single-image examples that enables text-to-image diffusion models to compose novel objects from meaningful components. Existing methods either struggle with effectively learning fine-grained concepts or require a large dataset as input. We propose a dynamic data synthesis pipeline generating diverse part compositions to address one-shot data scarcity. Most importantly, we propose to maximize the mutual information between denoised latents and structured concept codes via a concept predictor, enabling direct regulation on concept disentanglement and re-composition supervision. Our method achieves strong disentanglement and controllable composition, outperforming subject and part-level baselines when mixing concepts from the same, or different, object categories.

[149] HumanRAM: Feed-forward Human Reconstruction and Animation Model using Transformers

Zhiyuan Yu,Zhe Li,Hujun Bao,Can Yang,Xiaowei Zhou

Main category: cs.GR

TL;DR: HumanRAM提出了一种基于单目或稀疏图像的可推广人体重建与动画方法，通过显式姿态条件和共享SMPL-X神经纹理，结合Transformer和DPT解码器，实现了高质量的重建与动画效果。

Details

Motivation: 现有方法依赖密集视图捕获或耗时优化，限制了实际应用。HumanRAM旨在解决这些限制，提供高效且通用的解决方案。 Method: 结合显式姿态条件（SMPL-X神经纹理）和Transformer模型，利用DPT解码器合成新视角和新姿态下的人体渲染。 Result: 实验表明，HumanRAM在重建精度、动画逼真度和泛化性能上显著优于现有方法。 Conclusion: HumanRAM为单目或稀疏图像的人体重建与动画提供了高效且高质量的解决方案。 Abstract: 3D human reconstruction and animation are long-standing topics in computer graphics and vision. However, existing methods typically rely on sophisticated dense-view capture and/or time-consuming per-subject optimization procedures. To address these limitations, we propose HumanRAM, a novel feed-forward approach for generalizable human reconstruction and animation from monocular or sparse human images. Our approach integrates human reconstruction and animation into a unified framework by introducing explicit pose conditions, parameterized by a shared SMPL-X neural texture, into transformer-based large reconstruction models (LRM). Given monocular or sparse input images with associated camera parameters and SMPL-X poses, our model employs scalable transformers and a DPT-based decoder to synthesize realistic human renderings under novel viewpoints and novel poses. By leveraging the explicit pose conditions, our model simultaneously enables high-quality human reconstruction and high-fidelity pose-controlled animation. Experiments show that HumanRAM significantly surpasses previous methods in terms of reconstruction accuracy, animation fidelity, and generalization performance on real-world datasets. Video results are available at https://zju3dv.github.io/humanram/.

cs.CL [Back]

[150] Research on Medical Named Entity Identification Based On Prompt-Biomrc Model and Its Application in Intelligent Consultation System

Jinzhu Yang

Main category: cs.CL

TL;DR: 本研究提出Prompt-bioMRC模型，结合硬模板和软提示设计，提升医疗领域命名实体识别（NER）的精度和效率。实验证明其优于传统模型，为智能诊断系统等应用提供技术支持。

Details

Motivation: 探索提示学习方法在医疗领域NER中的应用，以提升医学文本实体识别的能力。 Method: 提出Prompt-bioMRC模型，结合硬模板和软提示设计，优化医学实体识别。 Result: 在多种医学数据集上实验，模型表现优于传统方法。 Conclusion: 该研究为医疗数据自动化处理和高效医疗决策提供了可靠技术支持。 Abstract: This study is dedicated to exploring the application of prompt learning methods to advance Named Entity Recognition (NER) within the medical domain. In recent years, the emergence of large-scale models has driven significant progress in NER tasks, particularly with the introduction of the BioBERT language model, which has greatly enhanced NER capabilities in medical texts. Our research introduces the Prompt-bioMRC model, which integrates both hard template and soft prompt designs aimed at refining the precision and efficiency of medical entity recognition. Through extensive experimentation across diverse medical datasets, our findings consistently demonstrate that our approach surpasses traditional models. This enhancement not only validates the efficacy of our methodology but also highlights its potential to provide reliable technological support for applications like intelligent diagnosis systems. By leveraging advanced NER techniques, this study contributes to advancing automated medical data processing, facilitating more accurate medical information extraction, and supporting efficient healthcare decision-making processes.

[151] No Free Lunch in Active Learning: LLM Embedding Quality Dictates Query Strategy Success

Lukas Rauch,Moritz Wirth,Denis Huseljic,Marek Herde,Bernhard Sick,Matthias Aßenmacher

Main category: cs.CL

TL;DR: 研究探讨了利用大型语言模型（LLM）嵌入改进深度主动学习（AL）的实用性，发现嵌入质量和查询策略选择对性能有显著影响。

Details

Motivation: 通过利用冻结的LLM嵌入，减少深度主动学习中迭代微调大型模型的计算成本。 Method: 使用MTEB排行榜中的五个高性能模型和两个基线模型，在十个文本分类任务中系统评估嵌入质量对AL策略的影响。 Result: 多样性采样与高质量嵌入协同提升早期AL性能；查询策略选择对嵌入质量敏感，Badge策略表现更稳健。 Conclusion: AL策略需结合嵌入质量和任务特性进行上下文评估。 Abstract: The advent of large language models (LLMs) capable of producing general-purpose representations lets us revisit the practicality of deep active learning (AL): By leveraging frozen LLM embeddings, we can mitigate the computational costs of iteratively fine-tuning large backbones. This study establishes a benchmark and systematically investigates the influence of LLM embedding quality on query strategies in deep AL. We employ five top-performing models from the massive text embedding benchmark (MTEB) leaderboard and two baselines for ten diverse text classification tasks. Our findings reveal key insights: First, initializing the labeled pool using diversity-based sampling synergizes with high-quality embeddings, boosting performance in early AL iterations. Second, the choice of the optimal query strategy is sensitive to embedding quality. While the computationally inexpensive Margin sampling can achieve performance spikes on specific datasets, we find that strategies like Badge exhibit greater robustness across tasks. Importantly, their effectiveness is often enhanced when paired with higher-quality embeddings. Our results emphasize the need for context-specific evaluation of AL strategies, as performance heavily depends on embedding quality and the target task.

[152] NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts

Abhay Gupta,Michael Lu,Kevin Zhu,Sean O'Brien,Vasu Sharma

Main category: cs.CL

TL;DR: NovelHopQA是一个新基准，用于评估在64k-128k标记的长文本中多跳推理能力，揭示了当前前沿模型在长上下文和多跳推理中的局限性。

Details

Motivation: 当前大语言模型（LLMs）在长上下文和多跳推理任务中表现不佳，缺乏联合评估这两方面的基准。 Method: 通过构建基于83部小说的长文本数据集，设计关键词引导的管道生成多跳问题链，并评估6种前沿模型。 Result: 随着跳数和上下文长度的增加，模型准确率显著下降，表明规模不能保证推理能力。 Conclusion: NovelHopQA为长文本多跳推理提供了诊断工具，揭示了模型的常见失败模式。 Abstract: Current large language models (LLMs) struggle to answer questions that span tens of thousands of tokens, especially when multi-hop reasoning is involved. While prior benchmarks explore long-context comprehension or multi-hop reasoning in isolation, none jointly vary context length and reasoning depth in natural narrative settings. We introduce NovelHopQA, the first benchmark to evaluate k1-4 hop QA over 64k-128k-token excerpts from 83 full-length public-domain novels. A keyword-guided pipeline builds hop-separated chains grounded in coherent storylines. We evaluate six state-of-the-art (SOTA) models and apply oracle-context filtering to ensure all questions are genuinely answerable. Human annotators validate both alignment and hop depth. We noticed consistent accuracy drops with increased hops and context length, even in frontier models-revealing that sheer scale does not guarantee robust reasoning. Our failure mode analysis highlights common breakdowns, such as missed final-hop integration and long-range drift. NovelHopQA offers a controlled diagnostic setting to stress-test multi-hop reasoning at scale.

[153] Pruning for Performance: Efficient Idiom and Metaphor Classification in Low-Resource Konkani Using mBERT

Timothy Do,Pranav Saran,Harshita Poojary,Pranav Prabhu,Sean O'Brien,Vasu Sharma,Kevin Zhu

Main category: cs.CL

TL;DR: 本文提出了一种结合mBERT、双向LSTM和线性分类器的混合模型，用于解决低资源语言（如Konkani）中比喻语言的分类问题，并通过注意力头剪枝策略提升效率。

Details

Motivation: 解决低资源语言中比喻语言对NLP系统的挑战。 Method: 使用混合模型（mBERT+双向LSTM+线性分类器），并采用梯度注意力头剪枝策略。 Result: 比喻分类准确率78%，习语分类任务准确率83%。 Conclusion: 注意力头剪枝策略对低资源语言的高效NLP工具构建有效。 Abstract: In this paper, we address the persistent challenges that figurative language expressions pose for natural language processing (NLP) systems, particularly in low-resource languages such as Konkani. We present a hybrid model that integrates a pre-trained Multilingual BERT (mBERT) with a bidirectional LSTM and a linear classifier. This architecture is fine-tuned on a newly introduced annotated dataset for metaphor classification, developed as part of this work. To improve the model's efficiency, we implement a gradient-based attention head pruning strategy. For metaphor classification, the pruned model achieves an accuracy of 78%. We also applied our pruning approach to expand on an existing idiom classification task, achieving 83% accuracy. These results demonstrate the effectiveness of attention head pruning for building efficient NLP tools in underrepresented languages.

[154] Enhancing Paraphrase Type Generation: The Impact of DPO and RLHF Evaluated with Human-Ranked Data

Christopher Lee Lübbers

Main category: cs.CL

TL;DR: 该研究通过结合人类排名数据和直接偏好优化（DPO）技术，提升了改写类型的生成准确性和人类偏好评分，同时创建了新的人类标注数据集以支持未来评估。

Details

Motivation: 现有改写类型生成方法因依赖自动化指标和有限的人类标注数据，常与人类偏好不一致，导致语义保真度和语言转换的关键方面被忽视。 Method: 利用人类排名的改写类型数据集，并整合DPO技术，直接对齐模型输出与人类判断。 Result: DPO训练将改写类型生成准确性提高了3个百分点，人类偏好评分提高了7个百分点；改写类型检测模型的F1分数表现优异。 Conclusion: 偏好数据和DPO训练能生成更可靠、语义准确的改写，为下游应用（如摘要和问答）提供支持，并推动改写类型研究向更用户对齐的方向发展。 Abstract: Paraphrasing re-expresses meaning to enhance applications like text simplification, machine translation, and question-answering. Specific paraphrase types facilitate accurate semantic analysis and robust language models. However, existing paraphrase-type generation methods often misalign with human preferences due to reliance on automated metrics and limited human-annotated training data, obscuring crucial aspects of semantic fidelity and linguistic transformations. This study addresses this gap by leveraging a human-ranked paraphrase-type dataset and integrating Direct Preference Optimization (DPO) to align model outputs directly with human judgments. DPO-based training increases paraphrase-type generation accuracy by 3 percentage points over a supervised baseline and raises human preference ratings by 7 percentage points. A newly created human-annotated dataset supports more rigorous future evaluations. Additionally, a paraphrase-type detection model achieves F1 scores of 0.91 for addition/deletion, 0.78 for same polarity substitution, and 0.70 for punctuation changes. These findings demonstrate that preference data and DPO training produce more reliable, semantically accurate paraphrases, enabling downstream applications such as improved summarization and more robust question-answering. The PTD model surpasses automated metrics and provides a more reliable framework for evaluating paraphrase quality, advancing paraphrase-type research toward richer, user-aligned language generation and establishing a stronger foundation for future evaluations grounded in human-centric criteria.

[155] ChatCFD: an End-to-End CFD Agent with Domain-specific Structured Thinking

E Fan,Weizong Wang,Tianhan Zhang

Main category: cs.CL

TL;DR: ChatCFD是一个基于大型语言模型的自动化CFD工作流工具，通过自然语言或文献配置模拟，降低了对专业知识的依赖。

Details

Motivation: CFD在科学和工程中至关重要，但操作复杂且需要专业知识，ChatCFD旨在解决这一问题。 Method: 利用OpenFOAM框架，结合数据库构建、配置验证和错误反思，将CFD知识与语言模型集成。 Result: 验证表明ChatCFD能自主复现复杂CFD结果，超越通用语言模型的能力。 Conclusion: ChatCFD为CFD提供了高效、易用的自动化解决方案，扩展了非专家的应用范围。 Abstract: Computational Fluid Dynamics (CFD) is essential for scientific and engineering advancements but is limited by operational complexity and the need for extensive expertise. This paper presents ChatCFD, a large language model-driven pipeline that automates CFD workflows within the OpenFOAM framework. It enables users to configure and execute complex simulations from natural language prompts or published literature with minimal expertise. The innovation is its structured approach to database construction, configuration validation, and error reflection, integrating CFD and OpenFOAM knowledge with general language models to improve accuracy and adaptability. Validation shows ChatCFD can autonomously reproduce published CFD results, handling complex, unseen configurations beyond basic examples, a task challenging for general language models.

[156] FinS-Pilot: A Benchmark for Online Financial System

Feng Wang,Yiding Sun,Jiaxin Mao,Wei Xue,Danqing Xu

Main category: cs.CL

TL;DR: FinS-Pilot是一个用于在线金融应用RAG系统评估的新基准，解决了数据保密和动态数据整合问题，通过真实金融助手交互构建。

Details

Motivation: 现有金融RAG基准因数据保密和缺乏动态数据整合受限，需专门评估工具。 Method: 基于真实金融助手交互，整合实时API数据和结构化文本，通过意图分类框架组织。 Result: FinS-Pilot能全面评估金融助手处理静态知识和实时市场信息的能力，并识别适合金融应用的LLM。 Conclusion: FinS-Pilot填补了金融领域专用评估工具的空白，提供了实用框架和数据集。 Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various professional domains, with their performance typically evaluated through standardized benchmarks. However, the development of financial RAG benchmarks has been constrained by data confidentiality issues and the lack of dynamic data integration. To address this issue, we introduces FinS-Pilot, a novel benchmark for evaluating RAG systems in online financial applications. Constructed from real-world financial assistant interactions, our benchmark incorporates both real-time API data and structured text sources, organized through an intent classification framework covering critical financial domains such as equity analysis and macroeconomic forecasting. The benchmark enables comprehensive evaluation of financial assistants' capabilities in handling both static knowledge and time-sensitive market information. Through systematic experiments with multiple Chinese leading LLMs, we demonstrate FinS-Pilot's effectiveness in identifying models suitable for financial applications while addressing the current gap in specialized evaluation tools for the financial domain. Our work contributes both a practical evaluation framework and a curated dataset to advance research in financial NLP systems. The code and dataset are accessible on GitHub\footnote{https://github.com/PhealenWang/financial\_rag\_benchmark}.

[157] Enhancing Multimodal Continual Instruction Tuning with BranchLoRA

Duzhen Zhang,Yong Ren,Zhong-Zhi Li,Yahan Yu,Jiahua Dong,Chenxing Li,Zhilong Ji,Jinfeng Bai

Main category: cs.CL

TL;DR: BranchLoRA提出了一种不对称框架，通过灵活的调优-冻结机制和任务特定路由器，解决了MCIT中的灾难性遗忘问题，显著提升了性能。

Details

Motivation: 现有方法（如MoELoRA）在多模态持续指令调优中因简单聚合LoRA块导致灾难性遗忘和参数效率低下，BranchLoRA旨在解决这些问题。 Method: 提出BranchLoRA框架，采用调优-冻结机制和任务特定路由器，并引入任务选择器以优化推理过程。 Result: 在MCIT基准测试中，BranchLoRA显著优于MoELoRA，并在不同规模的MLLM中保持优势。 Conclusion: BranchLoRA通过创新设计有效解决了灾难性遗忘和参数效率问题，为MCIT提供了高效且性能优越的解决方案。 Abstract: Multimodal Continual Instruction Tuning (MCIT) aims to finetune Multimodal Large Language Models (MLLMs) to continually align with human intent across sequential tasks. Existing approaches often rely on the Mixture-of-Experts (MoE) LoRA framework to preserve previous instruction alignments. However, these methods are prone to Catastrophic Forgetting (CF), as they aggregate all LoRA blocks via simple summation, which compromises performance over time. In this paper, we identify a critical parameter inefficiency in the MoELoRA framework within the MCIT context. Based on this insight, we propose BranchLoRA, an asymmetric framework to enhance both efficiency and performance. To mitigate CF, we introduce a flexible tuning-freezing mechanism within BranchLoRA, enabling branches to specialize in intra-task knowledge while fostering inter-task collaboration. Moreover, we incrementally incorporate task-specific routers to ensure an optimal branch distribution over time, rather than favoring the most recent task. To streamline inference, we introduce a task selector that automatically routes test inputs to the appropriate router without requiring task identity. Extensive experiments on the latest MCIT benchmark demonstrate that BranchLoRA significantly outperforms MoELoRA and maintains its superiority across various MLLM sizes.

[158] Evaluating the Unseen Capabilities: How Many Theorems Do LLMs Know?

Xiang Li,Jiayi Xin,Qi Long,Weijie J. Su

Main category: cs.CL

TL;DR: 论文提出KnowSum框架，通过量化未观察到的知识来更全面地评估大语言模型（LLMs），揭示当前评估方法忽略了大量潜在知识。

Details

Motivation: 现有评估方法未能充分反映LLMs的实际能力，尤其是未观察到的知识部分，导致评估结果不一致。 Method: 引入KnowSum统计框架，通过外推观察到的知识实例频率来估计未观察到的知识量。 Result: 实验表明，KnowSum在估计总知识、评估信息检索效果和测量输出多样性方面有效，且显著改变了常见LLMs的排名。 Conclusion: KnowSum提供了一种更全面的评估方法，揭示了当前评估方法的局限性，并展示了未观察知识对模型能力评估的重要性。 Abstract: Accurate evaluation of large language models (LLMs) is crucial for understanding their capabilities and guiding their development. However, current evaluations often inconsistently reflect the actual capacities of these models. In this paper, we demonstrate that one of many contributing factors to this \textit{evaluation crisis} is the oversight of unseen knowledge -- information encoded by LLMs but not directly observed or not yet observed during evaluations. We introduce KnowSum, a statistical framework designed to provide a more comprehensive assessment by quantifying the unseen knowledge for a class of evaluation tasks. KnowSum estimates the unobserved portion by extrapolating from the appearance frequencies of observed knowledge instances. We demonstrate the effectiveness and utility of KnowSum across three critical applications: estimating total knowledge, evaluating information retrieval effectiveness, and measuring output diversity. Our experiments reveal that a substantial volume of knowledge is omitted when relying solely on observed LLM performance. Importantly, KnowSum yields significantly different comparative rankings for several common LLMs based on their internal knowledge.

[159] Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains

Juncheng Wu,Sheng Liu,Haoqin Tu,Hang Yu,Xiaoke Huang,James Zou,Cihang Xie,Yuyin Zhou

Main category: cs.CL

TL;DR: 论文研究了推理增强型大语言模型在医学和数学领域的逐步推理过程，提出细粒度评估框架，发现监督微调（SFT）和强化学习（RL）对模型推理能力的不同影响。

Details

Motivation: 尽管推理增强型大语言模型在复杂任务上表现优异，但其内部推理过程的质量和透明度尚未充分研究。 Method: 通过将推理轨迹分解为知识和推理两部分，提出评估框架（知识指数和信息增益），并分析SFT和RL训练的模型在医学和数学领域的表现。 Result: 发现SFT提高最终答案准确率但降低推理质量，而RL在医学领域通过修剪不准确知识提升推理能力。 Conclusion: SFT在医学领域不可或缺，而RL能优化推理路径，为模型设计提供新方向。 Abstract: Recent advances in reasoning-enhanced Large Language Models such as OpenAI-o1/3 and DeepSeek-R1 have significantly improved performance on complex tasks. However, the quality and transparency of their internal reasoning processes remain underexplored. This work moves beyond the final-answer accuracy and investigates step-by-step reasoning in the medical and mathematical domains by explicitly decomposing the thinking trajectories into two parts: knowledge and reasoning. Specifically, we introduce a fine-grained evaluation framework that judges: (1) the correctness of knowledge used (measured by Knowledge Index (KI)) and (2) the quality of reasoning (measured by Information Gain (InfoGain)). Using this framework, we study R1-distilled and base Qwen models trained with supervised fine-tuning (SFT) and/or reinforcement learning (RL) in the medical and math domains. Three intriguing findings emerge: (1) The general reasoning abilities in R1-distilled models do not transfer effectively to the medical domain through either SFT or RL. (2) SFT raises final-answer accuracy in both domains, but often at the cost of reasoning quality: InfoGain drops by 38.9% on average compared with untrained models; In the medical domain, however, SFT remains crucial because domain knowledge is indispensable. (3) RL enhances medical reasoning by pruning inaccurate or irrelevant knowledge from reasoning paths, thereby improving both reasoning accuracy and knowledge correctness.

[160] Model Internal Sleuthing: Finding Lexical Identity and Inflectional Morphology in Modern Language Models

Michael Li,Nishant Subramani

Main category: cs.CL

TL;DR: 研究探讨了现代大型语言模型（如BERT、GPT-2、Pythia等）如何编码词汇和形态信息，发现词汇信息在早期层线性集中，后期层非线性集中，而形态信息在各层均线性可分离。

Details

Motivation: 理解现代大型语言模型如何编码语言信息，尤其是词汇和形态学特征。 Method: 通过训练线性和非线性分类器，分析模型各层激活以预测词汇词根和形态特征。 Result: 词汇信息在早期层线性集中，后期层非线性集中；形态信息在各层均线性可分离且通过抽象化编码。 Conclusion: 尽管模型架构和规模不同，但语言信息的组织方式相似，表明这些特性可能是语言模型的基础。 Abstract: Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information is rooted in studies of early models like BERT and GPT-2. To better understand today's language models, we investigate how both classical architectures (BERT, DeBERTa, GPT-2)and contemporary large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama-3.1) represent lexical identity and inflectional morphology. We train linear and nonlinear classifiers on layer-wise activations to predict word lemmas and inflectional features. We discover that models concentrate lexical information linearly in early layers and increasingly nonlinearly in later layers, while keeping inflectional information uniformly accessible and linearly separable throughout the layers. Further analysis reveals that these models encode inflectional morphology through generalizable abstractions, but rely predominantly on memorization to encode lexical identity. Remarkably, these patterns emerge across all 16 models we test, despite differences in architecture, size, and training regime (including pretrained and instruction-tuned variants). This consistency suggests that, despite substantial advances in LLM technologies, transformer models organize linguistic information in similar ways, indicating that these properties could be fundamental for next token prediction and are learned early during pretraining. Our code is available at https://github.com/ml5885/model_internal_sleuthing.

[161] BabyLM's First Constructions: Causal interventions provide a signal of learning

Joshua Rozner,Leonie Weissweiler,Cory Shain

Main category: cs.CL

TL;DR: 研究发现，即使训练数据量符合儿童语言学习的发展水平，语言模型仍能学习并表现多样化的构式，且构式表现与模型性能相关。

Details

Motivation: 探讨语言模型在符合儿童语言学习数据量的情况下是否仍能学习构式，并验证构式学习与模型性能的关系。 Method: 使用Rozner等人的方法评估BabyLM挑战赛中模型的构式学习能力。 Result: 模型能够学习多样化的构式，包括表面难以区分的案例，且构式表现与BabyLM基准测试性能相关。 Conclusion: 构式学习在开发合理的语言模型中具有功能相关性，支持构式语法理论。 Abstract: Construction grammar posits that children acquire constructions (form-meaning pairings) from the statistics of their environment. Recent work supports this hypothesis by showing sensitivity to constructions in pretrained language models (PLMs), including one recent study (Rozner et al., 2025) demonstrating that constructions shape the PLM's output distribution. However, models under study have generally been trained on developmentally implausible amounts of data, casting doubt on their relevance to human language learning. Here we use Rozner et al.'s methods to evaluate constructional learning in models from the 2024 BabyLM challenge. Our results show that even when trained on developmentally plausible quantities of data, models represent diverse constructions, even hard cases that are superficially indistinguishable. We further find correlational evidence that constructional performance may be functionally relevant: models that better represent constructions perform better on the BabyLM benchmarks.

[162] HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation

Amir Hussein,Cihan Xiao,Matthew Wiesner,Dan Povey,Leibny Paola Garcia,Sanjeev Khudanpur

Main category: cs.CL

TL;DR: HENT-SRT是一种新型神经转导器框架，通过任务分解和自蒸馏技术提升语音翻译性能，同时降低计算成本。

Details

Motivation: 现有神经转导器在语音翻译中存在词序问题和性能下降，且计算成本高。 Method: 提出HENT-SRT框架，分解ASR和翻译任务，采用自蒸馏和CTC一致性正则化，优化编码器和预测器结构。 Result: 在阿拉伯语、西班牙语和普通话数据集上取得最佳性能，缩小与AED模型的差距。 Conclusion: HENT-SRT有效解决了神经转导器在语音翻译中的问题，提升了性能和效率。 Abstract: Neural transducers (NT) provide an effective framework for speech streaming, demonstrating strong performance in automatic speech recognition (ASR). However, the application of NT to speech translation (ST) remains challenging, as existing approaches struggle with word reordering and performance degradation when jointly modeling ASR and ST, resulting in a gap with attention-based encoder-decoder (AED) models. Existing NT-based ST approaches also suffer from high computational training costs. To address these issues, we propose HENT-SRT (Hierarchical Efficient Neural Transducer for Speech Recognition and Translation), a novel framework that factorizes ASR and translation tasks to better handle reordering. To ensure robust ST while preserving ASR performance, we use self-distillation with CTC consistency regularization. Moreover, we improve computational efficiency by incorporating best practices from ASR transducers, including a down-sampled hierarchical encoder, a stateless predictor, and a pruned transducer loss to reduce training complexity. Finally, we introduce a blank penalty during decoding, reducing deletions and improving translation quality. Our approach is evaluated on three conversational datasets Arabic, Spanish, and Mandarin achieving new state-of-the-art performance among NT models and substantially narrowing the gap with AED-based systems.

[163] Different Speech Translation Models Encode and Translate Speaker Gender Differently

Dennis Fucci,Marco Gaido,Matteo Negri,Luisa Bentivogli,Andre Martins,Giuseppe Attanasio

Main category: cs.CL

TL;DR: 研究探讨语音翻译模型是否捕捉说话者性别特征及其对翻译中性别分配的影响。

Details

Motivation: 探究语音翻译模型是否像语音模型一样能捕捉性别特征，以及这对翻译中性别分配的影响。 Method: 使用探测方法评估不同语音翻译模型中的性别编码能力。 Result: 传统编码-解码模型能捕捉性别信息，但新架构（通过适配器整合语音编码器和翻译系统）不能。性别编码能力低的模型更倾向于男性默认翻译。 Conclusion: 新架构的语音翻译模型在性别编码上表现较差，导致翻译中更明显的男性默认偏见。 Abstract: Recent studies on interpreting the hidden states of speech models have shown their ability to capture speaker-specific features, including gender. Does this finding also hold for speech translation (ST) models? If so, what are the implications for the speaker's gender assignment in translation? We address these questions from an interpretability perspective, using probing methods to assess gender encoding across diverse ST models. Results on three language directions (English-French/Italian/Spanish) indicate that while traditional encoder-decoder models capture gender information, newer architectures -- integrating a speech encoder with a machine translation system via adapters -- do not. We also demonstrate that low gender encoding capabilities result in systems' tendency toward a masculine default, a translation bias that is more pronounced in newer architectures.

[164] AI Debate Aids Assessment of Controversial Claims

Salman Rahman,Sheriff Issaka,Ashima Suvarna,Genglin Liu,James Shiffer,Jaeyoung Lee,Md Rizwan Parvez,Hamid Palangi,Shi Feng,Nanyun Peng,Yejin Choi,Julian Michael,Liwei Jiang,Saadia Gabriel

Main category: cs.CL

TL;DR: AI辩论能帮助减少偏见并提高判断准确性，尤其在争议性话题如COVID-19事实性上。

Details

Motivation: 随着AI影响力增强，其可能放大错误信息并加深社会分歧，尤其是在公共健康等关键领域。 Method: 通过AI辩论（两个AI系统辩论对立观点）和咨询（单一AI系统提供建议）两种方式，研究人类和AI法官对COVID-19事实性判断的影响。 Result: 辩论显著提高判断准确性（主流信念者提升15.2%，怀疑者提升4.7%），AI法官表现优于人类法官（78.5% vs. 70.1%）。 Conclusion: AI辩论是一种有前景的监督方法，能结合多样的人类和AI判断，提升争议领域的真相识别能力。 Abstract: As AI grows more powerful, it will increasingly shape how we understand the world. But with this influence comes the risk of amplifying misinformation and deepening social divides-especially on consequential topics like public health where factual accuracy directly impacts well-being. Scalable Oversight aims to ensure AI truthfulness by enabling humans to supervise systems that may exceed human capabilities--yet humans themselves hold different beliefs and biases that impair their judgment. We study whether AI debate can guide biased judges toward the truth by having two AI systems debate opposing sides of controversial COVID-19 factuality claims where people hold strong prior beliefs. We conduct two studies: one with human judges holding either mainstream or skeptical beliefs evaluating factuality claims through AI-assisted debate or consultancy protocols, and a second examining the same problem with personalized AI judges designed to mimic these different human belief systems. In our human study, we find that debate-where two AI advisor systems present opposing evidence-based arguments-consistently improves judgment accuracy and confidence calibration, outperforming consultancy with a single-advisor system by 10% overall. The improvement is most significant for judges with mainstream beliefs (+15.2% accuracy), though debate also helps skeptical judges who initially misjudge claims move toward accurate views (+4.7% accuracy). In our AI judge study, we find that AI judges with human-like personas achieve even higher accuracy (78.5%) than human judges (70.1%) and default AI judges without personas (69.8%), suggesting their potential for supervising frontier AI models. These findings highlight AI debate as a promising path toward scalable, bias-resilient oversight--leveraging both diverse human and AI judgments to move closer to truth in contested domains.

[165] Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature Attribution

Dennis Fucci,Marco Gaido,Matteo Negri,Mauro Cettolo,Luisa Bentivogli

Main category: cs.CL

TL;DR: 本文通过特征归因技术分析了现代Conformer-based ASR系统依赖的声学线索，发现其对元音、摩擦音和爆破音的声学特性有特定偏好。

Details

Motivation: 尽管ASR技术取得显著进展，但模型依赖的具体声学线索仍不明确。本研究旨在填补这一空白。 Method: 应用特征归因技术，分析元音、摩擦音和爆破音的声学特性在时间和频率域中的表现。 Result: ASR模型更依赖元音的全时程（尤其是前两个共振峰）、摩擦音的频谱特性（尤其是咝音），以及爆破音的释放阶段（尤其是爆破特性）。 Conclusion: 研究结果提升了ASR模型的可解释性，并指出了未来研究的方向，以发现模型鲁棒性的潜在不足。 Abstract: Despite significant advances in ASR, the specific acoustic cues models rely on remain unclear. Prior studies have examined such cues on a limited set of phonemes and outdated models. In this work, we apply a feature attribution technique to identify the relevant acoustic cues for a modern Conformer-based ASR system. By analyzing plosives, fricatives, and vowels, we assess how feature attributions align with their acoustic properties in the time and frequency domains, also essential for human speech perception. Our findings show that the ASR model relies on vowels' full time spans, particularly their first two formants, with greater saliency in male speech. It also better captures the spectral characteristics of sibilant fricatives than non-sibilants and prioritizes the release phase in plosives, especially burst characteristics. These insights enhance the interpretability of ASR models and highlight areas for future research to uncover potential gaps in model robustness.

[166] BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models

Lindia Tjuatja,Graham Neubig

Main category: cs.CL

TL;DR: 论文提出了一种名为BehaviorBox的自动比较语言模型的方法，通过性能感知的上下文嵌入找到模型间的细粒度差异。

Details

Motivation: 语言模型评估任务复杂且模糊，传统方法如困惑度难以捕捉模型间的具体差异，需要一种自动化的方法来发现模型在特定上下文中的表现差异。 Method: 使用性能感知的上下文嵌入技术，提取细粒度特征（如特定短语或标点符号的使用），比较两个模型在这些特征上的表现差异。 Result: BehaviorBox成功识别出模型在细粒度上下文中的性能差异，例如条件句或情感表达中的表现，这些差异无法通过传统困惑度指标发现。 Conclusion: BehaviorBox提供了一种自动化且细粒度的方法来比较语言模型，揭示了传统评估方法无法捕捉的具体性能差异。 Abstract: Language model evaluation is a daunting task: prompts are brittle, corpus-level perplexities are vague, and the choice of benchmarks are endless. Finding examples that show meaningful, generalizable differences between two LMs is crucial to understanding where one model succeeds and another fails. Can this process be done automatically? In this work, we propose methodology for automated comparison of language models that uses performance-aware contextual embeddings to find fine-grained features of text where one LM outperforms another. Our method, which we name BehaviorBox, extracts coherent features that demonstrate differences with respect to the ease of generation between two LMs. Specifically, BehaviorBox finds features that describe groups of words in fine-grained contexts, such as "conditional 'were' in the phrase 'if you were'" and "exclamation marks after emotional statements", where one model outperforms another within a particular datatset. We apply BehaviorBox to compare models that vary in size, model family, and post-training, and enumerate insights into specific contexts that illustrate meaningful differences in performance which cannot be found by measures such as corpus-level perplexity alone.

[167] Leveraging Natural Language Processing to Unravel the Mystery of Life: A Review of NLP Approaches in Genomics, Transcriptomics, and Proteomics

Ella Rannon,David Burstein

Main category: cs.CL

TL;DR: 本文综述了自然语言处理（NLP）方法在生物序列数据（如基因组学、转录组学和蛋白质组学）中的应用，探讨了从经典方法到先进模型的适应性及其潜力。

Details

Motivation: 探索NLP技术在生物序列分析中的潜力，以推动对生物过程的理解。 Method: 综述了NLP方法（如word2vec、transformer和hyena算子）在DNA、RNA和蛋白质序列分析中的应用，并评估了其优缺点。 Result: NLP方法在结构预测、基因表达和进化分析等生物任务中展现出潜力。 Conclusion: NLP与生物信息学的结合有望为生命科学领域带来重大突破。 Abstract: Natural Language Processing (NLP) has transformed various fields beyond linguistics by applying techniques originally developed for human language to the analysis of biological sequences. This review explores the application of NLP methods to biological sequence data, focusing on genomics, transcriptomics, and proteomics. We examine how various NLP methods, from classic approaches like word2vec to advanced models employing transformers and hyena operators, are being adapted to analyze DNA, RNA, protein sequences, and entire genomes. The review also examines tokenization strategies and model architectures, evaluating their strengths, limitations, and suitability for different biological tasks. We further cover recent advances in NLP applications for biological data, such as structure prediction, gene expression, and evolutionary analysis, highlighting the potential of these methods for extracting meaningful insights from large-scale genomic data. As language models continue to advance, their integration into bioinformatics holds immense promise for advancing our understanding of biological processes in all domains of life.

[168] Investigating the Impact of Word Informativeness on Speech Emotion Recognition

Sofoklis Kakouros

Main category: cs.CL

TL;DR: 论文提出了一种基于语义重要性的语音情感识别方法，通过预训练语言模型识别关键语音段，提升识别准确率。

Details

Motivation: 传统方法在整句或长语音段上计算声学特征，可能忽略细粒度变化，本研究旨在通过语义重要性筛选关键段以提高识别效果。 Method: 利用预训练语言模型计算词语信息量，筛选关键语音段，并仅在这些段上计算声学特征（如能量、F0等）及其功能。 Result: 实验表明，基于词语信息量筛选的语音段特征计算显著提升了情感识别性能。 Conclusion: 通过语义重要性筛选语音段能有效提升情感识别准确率，为语音情感分析提供了新思路。 Abstract: In emotion recognition from speech, a key challenge lies in identifying speech signal segments that carry the most relevant acoustic variations for discerning specific emotions. Traditional approaches compute functionals for features such as energy and F0 over entire sentences or longer speech portions, potentially missing essential fine-grained variation in the long-form statistics. This research investigates the use of word informativeness, derived from a pre-trained language model, to identify semantically important segments. Acoustic features are then computed exclusively for these identified segments, enhancing emotion recognition accuracy. The methodology utilizes standard acoustic prosodic features, their functionals, and self-supervised representations. Results indicate a notable improvement in recognition performance when features are computed on segments selected based on word informativeness, underscoring the effectiveness of this approach.

[169] CoDial: Interpretable Task-Oriented Dialogue Systems Through Dialogue Flow Alignment

Radin Shayanfar,Chu Fei Luo,Rohan Bhambhoria,Samuel Dahan,Xiaodan Zhu

Main category: cs.CL

TL;DR: CoDial框架通过结构化异构图将专家知识转化为可执行的对话逻辑，支持非技术专家轻松定义和优化任务导向对话系统，并在STAR和MultiWOZ数据集上表现优异。

Details

Motivation: 解决专业领域对话系统开发中专家知识成本高、技术难度大的问题，支持非技术专家参与系统定义与优化。 Method: 提出CoDial框架，利用结构化异构图表示专家知识，并集成到现有对话语言（如Colang）中，实现零样本任务导向对话。 Result: 在STAR数据集上达到最先进性能，在MultiWOZ数据集上与基线模型竞争，支持通过人工和LLM反馈迭代优化。 Conclusion: CoDial是一个实用工具，支持专家引导的高风险领域LLM对齐，具有可解释性和可修改性。 Abstract: It is often challenging to teach specialized, unseen tasks to dialogue systems due to the high cost of expert knowledge, training data, and high technical difficulty. To support domain-specific applications - such as law, medicine, or finance - it is essential to build frameworks that enable non-technical experts to define, test, and refine system behaviour with minimal effort. Achieving this requires cross-disciplinary collaboration between developers and domain specialists. In this work, we introduce a novel framework, CoDial (Code for Dialogue), that converts expert knowledge, represented as a novel structured heterogeneous graph, into executable conversation logic. CoDial can be easily implemented in existing guardrailing languages, such as Colang, to enable interpretable, modifiable, and true zero-shot specification of task-oriented dialogue systems. Empirically, CoDial achieves state-of-the-art performance on the STAR dataset for inference-based models and is competitive with similar baselines on the well-known MultiWOZ dataset. We also demonstrate CoDial's iterative improvement via manual and LLM-aided feedback, making it a practical tool for expert-guided alignment of LLMs in high-stakes domains.

[170] ImpRAG: Retrieval-Augmented Generation with Implicit Queries

Wenzheng Zhang,Xi Victoria Lin,Karl Stratos,Wen-tau Yih,Mingda Chen

Main category: cs.CL

TL;DR: ImpRAG提出了一种无需显式查询的检索增强生成系统，通过统一检索和生成任务，显著提升了模型在多样化任务上的泛化能力。

Details

Motivation: 传统RAG系统将检索和生成分离，限制了模型在多样化任务上的泛化能力。ImpRAG旨在通过统一模型消除显式查询的需求。 Method: ImpRAG将预训练的解码器语言模型分为专用层组，通过两阶段推理过程同时优化检索和生成任务，使用相同参数和前向传播。 Result: 在8个知识密集型任务上，ImpRAG在未见任务上的精确匹配分数提升了3.6-11.5，表现出色。 Conclusion: ImpRAG通过平衡检索和生成参数，利用生成困惑度作为检索训练目标，显著提升了性能，证明了其有效性。 Abstract: Retrieval-Augmented Generation (RAG) systems traditionally treat retrieval and generation as separate processes, requiring explicit textual queries to connect them. This separation can limit the ability of models to generalize across diverse tasks. In this work, we propose a query-free RAG system, named ImpRAG, which integrates retrieval and generation into a unified model. ImpRAG allows models to implicitly express their information needs, eliminating the need for human-specified queries. By dividing pretrained decoder-only language models into specialized layer groups, ImpRAG optimizes retrieval and generation tasks simultaneously. Our approach employs a two-stage inference process, using the same model parameters and forward pass for both retrieval and generation, thereby minimizing the disparity between retrievers and language models. Experiments on 8 knowledge-intensive tasks demonstrate that ImpRAG achieves 3.6-11.5 improvements in exact match scores on unseen tasks with diverse formats, highlighting its effectiveness in enabling models to articulate their own information needs and generalize across tasks. Our analysis underscores the importance of balancing retrieval and generation parameters and leveraging generation perplexities as retrieval training objectives for enhanced performance.

[171] Sounding Like a Winner? Prosodic Differences in Post-Match Interviews

Sofoklis Kakouros,Haoyu Chen

Main category: cs.CL

TL;DR: 研究通过分析网球赛后采访的韵律特征（如音高和强度）及自监督学习模型（如Wav2Vec 2.0和HuBERT），探索仅凭采访录音分类比赛胜负的可行性。结果表明，自监督学习能有效区分胜负，而韵律特征（如音高变化）仍是胜利的强指标。

Details

Motivation: 探索赛后采访中韵律特征与比赛结果的关系，并尝试利用自监督学习模型分类胜负，以揭示情绪状态与语音模式的关联。 Method: 结合传统声学特征和深度语音表征（如Wav2Vec 2.0和HuBERT），使用机器学习分类器区分胜负。 Result: 自监督学习能有效区分胜负，韵律特征（如音高变化）是胜利的强指标。 Conclusion: 赛后采访的韵律特征和自监督学习模型可用于分类比赛结果，揭示了情绪状态与语音模式的关联。 Abstract: This study examines the prosodic characteristics associated with winning and losing in post-match tennis interviews. Additionally, this research explores the potential to classify match outcomes solely based on post-match interview recordings using prosodic features and self-supervised learning (SSL) representations. By analyzing prosodic elements such as pitch and intensity, alongside SSL models like Wav2Vec 2.0 and HuBERT, the aim is to determine whether an athlete has won or lost their match. Traditional acoustic features and deep speech representations are extracted from the data, and machine learning classifiers are employed to distinguish between winning and losing players. Results indicate that SSL representations effectively differentiate between winning and losing outcomes, capturing subtle speech patterns linked to emotional states. At the same time, prosodic cues -- such as pitch variability -- remain strong indicators of victory.

[172] LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback

Thai Hoang,Kung-Hsiang Huang,Shirley Kokane,Jianguo Zhang,Zuxin Liu,Ming Zhu,Jake Grigsby,Tian Lan,Michael S Ryoo,Chien-Sheng Wu,Shelby Heinecke,Huan Wang,Silvio Savarese,Caiming Xiong,Juan Carlos Niebles

Main category: cs.CL

TL;DR: LAM SIMULATOR是一个为AI代理设计的框架，通过动态任务生成、工具调用和实时反馈，自主探索任务并生成高质量训练数据，显著提升模型性能。

Details

Motivation: 大型动作模型（LAMs）在多步骤任务中面临高质量训练数据不足的挑战，需要一种高效的方法来生成数据并提升代理能力。 Method: 提出LAM SIMULATOR框架，包含动态任务查询生成器、工具库和交互环境，使LLM代理能自主探索任务并生成训练数据。 Result: 在ToolBench和CRMArena基准测试中，使用自生成数据的模型性能提升高达49.3%。 Conclusion: LAM SIMULATOR高效且有效，减少了人工干预，加速了AI代理的开发。 Abstract: Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data, especially for multi-steps tasks that involve planning, executing tool calls, and responding to feedback. To address these issues, we present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback. Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real-time feedback. This setup enables LLM Agents to explore and solve tasks autonomously, facilitating the discovery of multiple approaches to tackle any given task. The resulting action trajectory data are then used to create high-quality training datasets for LAMs. Our experiments on popular agentic benchmarks, ToolBench and CRMArena, highlight the effectiveness of LAM SIMULATOR: models trained with self-generated datasets using our framework achieve significant performance gains, up to a 49.3\% improvement over their original baselines. LAM SIMULATOR requires minimal human input during dataset creation, highlighting LAM SIMULATOR's efficiency and effectiveness in speeding up development of AI agents.

[173] Explain-then-Process: Using Grammar Prompting to Enhance Grammatical Acceptability Judgments

Russell Scheinberg,Ameeta Agrawal,Amber Shore,So Young Lee

Main category: cs.CL

TL;DR: 论文提出了一种名为“语法提示”的方法，通过让大语言模型（LLM）先解释语法规则，再将解释作为上下文输入目标模型，显著提升了模型在判断句子可接受性时的表现。

Details

Motivation: 尽管大语言模型能够解释语法规则，但在实际应用中却难以正确运用这些规则。 Method: 采用“解释-处理”范式，即LLM先生成语法现象的简洁解释，再将解释作为额外上下文输入目标模型（LLM或小型语言模型SLM）。 Result: 在多个语言基准测试中，该方法显著提升了性能，缩小了LLM与SLM之间的准确率差距。 Conclusion: 语法提示是一种轻量级、语言无关的方法，能够帮助低成本SLM在多语言场景中接近前沿LLM的性能。 Abstract: Large language models (LLMs) can explain grammatical rules, yet they often fail to apply those rules when judging sentence acceptability. We present "grammar prompting", an explain-then-process paradigm: a large LLM first produces a concise explanation of the relevant syntactic phenomenon, then that explanation is fed back as additional context to the target model -- either an LLM or a smaller language model (SLM) -- before deciding which sentence of a minimal pair is grammatical. On the English BLiMP, Chinese SLING, and Russian RuBLiMP benchmarks, this simple prompt design yields substantial improvements over strong baselines across many syntactic phenomena. Feeding an LLM's metalinguistic explanation back to the target model bridges the gap between knowing a rule and using it. On SLMs, grammar prompting alone trims the average LLM-SLM accuracy gap by about 20%, and when paired with chain-of-thought, by 56% (13.0 pp -> 5.8 pp), all at negligible cost. The lightweight, language-agnostic cue lets low-cost SLMs approach frontier-LLM performance in multilingual settings.

[174] Quantifying Misattribution Unfairness in Authorship Attribution

Pegah Alipoormolabashi,Ajay Patel,Niranjan Balasubramanian

Main category: cs.CL

TL;DR: 论文提出了一种衡量作者归属不公平性的指标MAUIk，用于评估作者被错误归因的风险，并通过实验发现现有模型存在显著不公平性，某些作者风险更高。

Details

Motivation: 在法医等实际场景中，作者归属错误可能导致严重后果，现有评估方法未考虑公平性问题。 Method: 引入Misattribution Unfairness Index (MAUIk)，基于作者在未撰写文本中被排在前k位的频率，评估五种模型在两个数据集上的不公平性。 Result: 所有模型均表现出高度不公平性，部分作者风险更高；不公平性与模型在潜在搜索空间中对作者的向量嵌入方式相关。 Conclusion: 研究揭示了作者归属模型可能带来的潜在危害，建议在模型构建和使用中与终端用户沟通并校准误归风险。 Abstract: Authorship misattribution can have profound consequences in real life. In forensic settings simply being considered as one of the potential authors of an evidential piece of text or communication can result in undesirable scrutiny. This raises a fairness question: Is every author in the candidate pool at equal risk of misattribution? Standard evaluation measures for authorship attribution systems do not explicitly account for this notion of fairness. We introduce a simple measure, Misattribution Unfairness Index (MAUIk), which is based on how often authors are ranked in the top k for texts they did not write. Using this measure we quantify the unfairness of five models on two different datasets. All models exhibit high levels of unfairness with increased risks for some authors. Furthermore, we find that this unfairness relates to how the models embed the authors as vectors in the latent search space. In particular, we observe that the risk of misattribution is higher for authors closer to the centroid (or center) of the embedded authors in the haystack. These results indicate the potential for harm and the need for communicating with and calibrating end users on misattribution risk when building and providing such models for downstream use.

[175] Something Just Like TRuST : Toxicity Recognition of Span and Target

Berk Atil,Namrata Sureddy,Rebecca J. Passonneau

Main category: cs.CL

TL;DR: TRuST是一个综合数据集，用于改进毒性检测，包含毒性、目标社会群体和毒性片段标签。研究发现微调模型在毒性检测上优于零样本和少样本提示，但某些社会群体表现仍不佳。

Details

Motivation: 在线内容中的毒性问题对心理和社会有负面影响，需要更全面的数据集和改进的检测方法。 Method: 结合现有数据集创建TRuST，包含多样化的目标群体标签，并评估大型语言模型在毒性检测、目标群体识别和毒性片段提取上的表现。 Result: 微调模型表现优于零样本和少样本提示，但对某些社会群体效果不佳；推理能力未显著提升性能。 Conclusion: LLMs在社会推理能力上较弱，需进一步改进毒性检测模型，尤其是针对特定社会群体。 Abstract: Toxicity in online content, including content generated by language models, has become a critical concern due to its potential for negative psychological and social impact. This paper introduces TRuST, a comprehensive dataset designed to improve toxicity detection that merges existing datasets, and has labels for toxicity, target social group, and toxic spans. It includes a diverse range of target groups such as ethnicity, gender, religion, disability, and politics, with both human/machine-annotated and human machine-generated data. We benchmark state-of-the-art large language models (LLMs) on toxicity detection, target group identification, and toxic span extraction. We find that fine-tuned models consistently outperform zero-shot and few-shot prompting, though performance remains low for certain social groups. Further, reasoning capabilities do not significantly improve performance, indicating that LLMs have weak social reasoning skills.

[176] One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL

Hyungjoo Chae,Dongjin Kang,Jihyuk Kim,Beong-woo Kwak,Sunghyun Park,Haeju Park,Jinyoung Yeo,Moontae Lee,Kyungjae Lee

Main category: cs.CL

TL;DR: 论文提出了一种独立构建长链推理数据集的方法，通过短链推理LLM生成Long CoT Collection，验证其质量接近R1，并展示了其在强化学习中的潜力。

Details

Motivation: 减少对现有大型推理模型（如R1）的依赖，推动独立LRM发展。 Method: 利用短链推理LLM构建Long CoT Collection数据集，并通过管道技术引入新推理策略。 Result: 数据集质量接近R1，训练后模型在强化学习中表现显著提升（2-3倍增益）。 Conclusion: 独立构建的长链推理数据集可行且有效，为LRM发展提供了新方向。 Abstract: With the release of R1, a publicly available large reasoning model (LRM), researchers commonly train new LRMs by training language models on R1's long chain-of-thought (CoT) inferences. While prior works show that LRMs' capabilities can be reproduced through direct distillation, the continued reliance on the existing models (e.g., R1) remains a critical limitation in advancing the field. As a first step toward independent LRM development, this paper explores the possibility of constructing a long CoT dataset with LLMs that are not trained for inference-time scaling. To this end, we present the Long CoT Collection, a dataset of 100K CoT rationales annotated using existing short CoT LLMs. We develop a pipeline that induces o1's novel reasoning strategies into short CoT LLMs, enabling them to think longer and introducing controllability over the thought budget to better manage the overthinking problem. Our extensive analyses validate that our dataset achieves quality comparable to--or slightly below--R1. Furthermore, our experiments demonstrate that training on our dataset not only strengthens general reasoning skills, but also provides a strong foundation for reinforcement learning--models initialized on our data achieve 2-3x larger gains with RLVR.

[177] STORYTELLER: An Enhanced Plot-Planning Framework for Coherent and Cohesive Story Generation

Jiaming Li,Yukun Chen,Ziqiang Liu,Minghuan Tan,Lei Zhang,Yunshui Li,Run Luo,Longze Chen,Jing Luo,Ahmadreza Argha,Hamid Alinejad-Rokny,Wei Zhou,Min Yang

Main category: cs.CL

TL;DR: 论文提出了一种名为Storyteller的新方法，通过基于SVO三元组的节点结构和动态模块（STORYLINE与NEKG）提升自动生成故事的连贯性和逻辑一致性，显著优于现有方法。

Details

Motivation: 现有自动故事生成方法在叙事连贯性和逻辑一致性上表现不佳，影响了故事体验，因此需要改进。 Method: 引入基于SVO三元组的节点结构，并整合动态模块STORYLINE和NEKG，持续交互优化故事生成过程。 Result: 实验表明，Storyteller在人类偏好评估中平均胜率达84.33%，且在创意、连贯性、吸引力和相关性等方面表现优异。 Conclusion: Storyteller通过系统性方法显著提升了自动故事生成的质量，为AI创作提供了新方向。 Abstract: Stories are central to human culture, serving to share ideas, preserve traditions, and foster connections. Automatic story generation, a key advancement in artificial intelligence (AI), offers new possibilities for creating personalized content, exploring creative ideas, and enhancing interactive experiences. However, existing methods struggle to maintain narrative coherence and logical consistency. This disconnect compromises the overall storytelling experience, underscoring the need for substantial improvements. Inspired by human cognitive processes, we introduce Storyteller, a novel approach that systemically improves the coherence and consistency of automatically generated stories. Storyteller introduces a plot node structure based on linguistically grounded subject verb object (SVO) triplets, which capture essential story events and ensure a consistent logical flow. Unlike previous methods, Storyteller integrates two dynamic modules, the STORYLINE and narrative entity knowledge graph (NEKG),that continuously interact with the story generation process. This integration produces structurally sound, cohesive and immersive narratives. Extensive experiments demonstrate that Storyteller significantly outperforms existing approaches, achieving an 84.33% average win rate through human preference evaluation. At the same time, it is also far ahead in other aspects including creativity, coherence, engagement, and relevance.

[178] Truth over Tricks: Measuring and Mitigating Shortcut Learning in Misinformation Detection

Herun Wan,Jiaying Wu,Minnan Luo,Zhi Zeng,Zhixiong Su

Main category: cs.CL

TL;DR: 论文提出TruthOverTricks评估范式，揭示现有虚假信息检测模型依赖表面线索的问题，并提出SMF数据增强框架以提升鲁棒性。

Details

Motivation: 虚假信息检测模型常依赖与训练数据相关的表面线索，难以泛化到真实场景，且大语言模型（LLMs）加剧了这一问题。 Method: 提出TruthOverTricks评估范式，分类表面线索行为，评估七种检测器，并开发SMF框架（基于LLM的数据增强）。 Result: 现有检测器在自然和对抗性表面线索下性能显著下降，SMF在16个基准测试中提升鲁棒性。 Conclusion: SMF通过数据增强减少对表面线索的依赖，促进虚假信息检测器的开发，相关资源已公开。 Abstract: Misinformation detection models often rely on superficial cues (i.e., \emph{shortcuts}) that correlate with misinformation in training data but fail to generalize to the diverse and evolving nature of real-world misinformation. This issue is exacerbated by large language models (LLMs), which can easily generate convincing misinformation through simple prompts. We introduce TruthOverTricks, a unified evaluation paradigm for measuring shortcut learning in misinformation detection. TruthOverTricks categorizes shortcut behaviors into intrinsic shortcut induction and extrinsic shortcut injection, and evaluates seven representative detectors across 14 popular benchmarks, along with two new factual misinformation datasets, NQ-Misinfo and Streaming-Misinfo. Empirical results reveal that existing detectors suffer severe performance degradation when exposed to both naturally occurring and adversarially crafted shortcuts. To address this, we propose SMF, an LLM-augmented data augmentation framework that mitigates shortcut reliance through paraphrasing, factual summarization, and sentiment normalization. SMF consistently enhances robustness across 16 benchmarks, encouraging models to rely on deeper semantic understanding rather than shortcut cues. To promote the development of misinformation detectors, we have published the resources publicly at https://github.com/whr000001/TruthOverTricks.

[179] DIAMOND: An LLM-Driven Agent for Context-Aware Baseball Highlight Summarization

Jeonghun Kang,Soonmok Kwon,Joonseok Lee,Byung-Hak Kim

Main category: cs.CL

TL;DR: DIAMOND是一种基于LLM的棒球高光时刻摘要系统，结合了结构化体育分析和自然语言推理，显著提升了摘要质量。

Details

Motivation: 传统方法（如WPA排名或计算机视觉事件检测）缺乏战略深度和故事性，而人工标注成本高且不可扩展。 Method: DIAMOND结合了棒球统计特征（如Win Expectancy、WPA和Leverage Index）与LLM模块，量化比赛重要性并增强上下文叙事价值。 Result: 在韩国棒球联盟的测试中，DIAMOND的F1分数从42.9%（仅WPA）提升至84.8%，优于商业和统计基线。 Conclusion: DIAMOND展示了模块化、可解释的代理框架在体育事件摘要中的潜力，尽管规模有限。 Abstract: Traditional approaches -- such as Win Probability Added (WPA)-based ranking or computer vision-driven event detection -- can identify scoring plays but often miss strategic depth, momentum shifts, and storyline progression. Manual curation remains the gold standard but is resource-intensive and not scalable. We introduce DIAMOND, an LLM-driven agent for context-aware baseball highlight summarization that integrates structured sports analytics with natural language reasoning. DIAMOND leverages sabermetric features -- Win Expectancy, WPA, and Leverage Index -- to quantify play importance, while an LLM module enhances selection based on contextual narrative value. This hybrid approach ensures both quantitative rigor and qualitative richness, surpassing the limitations of purely statistical or vision-based systems. Evaluated on five diverse Korean Baseball Organization League games, DIAMOND improves F1-score from 42.9% (WPA-only) to 84.8%, outperforming both commercial and statistical baselines. Though limited in scale, our results highlight the potential of modular, interpretable agent-based frameworks for event-level summarization in sports and beyond.

[180] AnswerCarefully: A Dataset for Improving the Safety of Japanese LLM Output

Hisami Suzuki,Satoru Katsumata,Takashi Kodama,Tetsuro Takahashi,Kouta Nakayama,Satoshi Sekine

Main category: cs.CL

TL;DR: AnswerCarefully是一个用于提升日本LLM输出安全性和适当性的数据集，包含1800个需特别关注的问题-答案对，覆盖多种风险类别。通过微调日本LLM，提升了输出安全性且不影响通用回答的实用性。

Details

Motivation: 解决日本LLM输出中的安全性和文化适当性问题，填补现有英语数据集的不足。 Method: 手动创建反映日本社会文化背景的问题-答案对数据集，并用于微调日本LLM。 Result: 微调后LLM输出安全性提升，且不影响通用回答的实用性；12个日本LLM的安全性评估结果。 Conclusion: AnswerCarefully数据集有效提升日本LLM的安全性，并支持跨语言和地区的类似数据集开发。 Abstract: In this paper we present AnswerCarefully, a dataset for promoting the safety and appropriateness of Japanese LLM outputs. The dataset consists of 1,800 pairs of questions and reference answers, where the questions require special attention in answering. It covers a wide range of risk categories established in prior English-language datasets, but the data samples are original in that they are manually created to reflect the socio-cultural context of LLM usage in Japan. We show that using this dataset for instruction to fine-tune a Japanese LLM led to improved output safety without compromising the utility of general responses. We also report the results of a safety evaluation of 12 Japanese LLMs using this dataset as a benchmark. Finally, we describe the latest update on the dataset which provides English translations and annotations of the questions, aimed at facilitating the derivation of similar datasets in different languages and regions.

[181] Exploring Explanations Improves the Robustness of In-Context Learning

Ukyo Honda,Tatsushi Oka

Main category: cs.CL

TL;DR: 论文提出了一种扩展X-ICL的框架X²-ICL，通过系统探索所有可能标签的解释，提升了模型的鲁棒性和泛化能力。

Details

Motivation: 现有ICL方法在分布外数据上表现不佳，X-ICL通过引入解释提升了可靠性，但仍需进一步优化。 Method: 扩展X-ICL，系统探索所有可能标签的解释（X²-ICL），以支持更全面的决策。 Result: 在多个自然语言理解数据集上验证，X²-ICL显著提升了对分布外数据的鲁棒性。 Conclusion: X²-ICL是一种有效的改进方法，能够显著提升模型的泛化能力和鲁棒性。 Abstract: In-context learning (ICL) has emerged as a successful paradigm for leveraging large language models (LLMs). However, it often struggles to generalize beyond the distribution of the provided demonstrations. A recent advancement in enhancing robustness is ICL with explanations (X-ICL), which improves prediction reliability by guiding LLMs to understand and articulate the reasoning behind correct labels. Building on this approach, we introduce an advanced framework that extends X-ICL by systematically exploring explanations for all possible labels (X$^2$-ICL), thereby enabling more comprehensive and robust decision-making. Experimental results on multiple natural language understanding datasets validate the effectiveness of X$^2$-ICL, demonstrating significantly improved robustness to out-of-distribution data compared to the existing ICL approaches.

[182] Consultant Decoding: Yet Another Synergistic Mechanism

Chuanghao Ding,Jiaping Wang,Ziqing Yang,Xiaoliang Wang,Dahua Lin,Cam-Tu Nguyen,Fei Tan

Main category: cs.CL

TL;DR: 论文提出了一种新的协同机制Consultant Decoding (CD)，通过改进验证机制显著提高了大型语言模型(LLM)的推理速度，同时保持了生成质量。

Details

Motivation: 现有Speculative Decoding (SD)的高拒绝率导致需要重复调用LLM验证草案标记，影响了整体效率。 Method: CD采用基于LLM计算的标记级似然性验证候选草案，而非SD的重要性采样指标。 Result: CD实现了推理速度最高2.5倍的提升，生成质量接近目标模型的100%，且大幅降低了对大型目标模型的调用频率。 Conclusion: CD不仅超越了SD的性能上限，还在高要求任务中表现出色，展示了其高效性和实用性。 Abstract: The synergistic mechanism based on Speculative Decoding (SD) has garnered considerable attention as a simple yet effective approach for accelerating the inference of large language models (LLMs). Nonetheless, the high rejection rates require repeated LLMs calls to validate draft tokens, undermining the overall efficiency gain of SD. In this work, we revisit existing verification mechanisms and propose a novel synergetic mechanism Consultant Decoding (CD). Unlike SD, which relies on a metric derived from importance sampling for verification, CD verifies candidate drafts using token-level likelihoods computed solely by the LLM. CD achieves up to a 2.5-fold increase in inference speed compared to the target model, while maintaining comparable generation quality (around 100% of the target model's performance). Interestingly, this is achieved by combining models whose parameter sizes differ by two orders of magnitude. In addition, CD reduces the call frequency of the large target model to below 10%, particularly in more demanding tasks. CD's performance was even found to surpass that of the large target model, which theoretically represents the upper bound for speculative decoding.

[183] GraphRAG-Bench: Challenging Domain-Specific Reasoning for Evaluating Graph Retrieval-Augmented Generation

Yilin Xiao,Junnan Dong,Chuang Zhou,Su Dong,Qianwen Zhang,Di Yin,Xing Sun,Xiao Huang

Main category: cs.CL

TL;DR: GraphRAG-Bench是一个大规模、领域特定的基准测试，旨在全面评估GraphRAG模型的推理能力，弥补现有评估的不足。

Details

Motivation: 当前GraphRAG模型的评估主要依赖传统问答数据集，无法全面评估其推理能力提升。 Method: 设计了具有挑战性的问题、多样化的任务覆盖和全面的评估框架。 Result: 通过应用九种GraphRAG方法，验证了GraphRAG-Bench在量化模型推理能力提升方面的实用性。 Conclusion: GraphRAG-Bench为研究社区提供了关于图架构、检索效果和推理能力的关键见解和行动指南。 Abstract: Graph Retrieval Augmented Generation (GraphRAG) has garnered increasing recognition for its potential to enhance large language models (LLMs) by structurally organizing domain-specific corpora and facilitating complex reasoning. However, current evaluations of GraphRAG models predominantly rely on traditional question-answering datasets. Their limited scope in questions and evaluation metrics fails to comprehensively assess the reasoning capacity improvements enabled by GraphRAG models. To address this gap, we introduce GraphRAG-Bench, a large-scale, domain-specific benchmark designed to rigorously evaluate GraphRAG models. Our benchmark offers three key superiorities: $(i)$ Challenging question design. Featuring college-level, domain-specific questions that demand multi-hop reasoning, the benchmark ensures that simple content retrieval is insufficient for problem-solving. For example, some questions require mathematical reasoning or programming. $(ii)$ Diverse task coverage. The dataset includes a broad spectrum of reasoning tasks, multiple-choice, true/false, multi-select, open-ended, and fill-in-the-blank. It spans 16 disciplines in twenty core textbooks. $(iii)$ Holistic evaluation framework. GraphRAG-Bench provides comprehensive assessment across the entire GraphRAG pipeline, including graph construction, knowledge retrieval, and answer generation. Beyond final-answer correctness, it evaluates the logical coherence of the reasoning process. By applying nine contemporary GraphRAG methods to GraphRAG-Bench, we demonstrate its utility in quantifying how graph-based structuring improves model reasoning capabilities. Our analysis reveals critical insights about graph architectures, retrieval efficacy, and reasoning capabilities, offering actionable guidance for the research community.

[184] SingaKids: A Multilingual Multimodal Dialogic Tutor for Language Learning

Zhengyuan Liu,Geyu Lin,Hui Li Tan,Huayun Zhang,Yanfeng Lu,Xiaoxue Gao,Stella Xin Yin,He Sun,Hock Huan Goh,Lung Hsiang Wong,Nancy F. Chen

Main category: cs.CL

TL;DR: SingaKids是一个多语言对话式语言学习系统，通过图片描述任务和沉浸式设计提升儿童语言学习效果。

Details

Motivation: 生成式AI在教育应用中潜力巨大，但需解决多语言和文化适应性以及儿童友好设计问题。 Method: 整合密集图像描述、多语言对话交互、语音理解和生成，并通过预训练和优化提升系统性能。 Result: 实证研究表明，SingaKids对不同水平的学习者均有效。 Conclusion: SingaKids为多语言儿童语言学习提供了有效的对话式教学工具。 Abstract: The integration of generative artificial intelligence into educational applications has enhanced personalized and interactive learning experiences, and it shows strong potential to promote young learners language acquisition. However, it is still challenging to ensure consistent and robust performance across different languages and cultural contexts, and kids-friendly design requires simplified instructions, engaging interactions, and age-appropriate scaffolding to maintain motivation and optimize learning outcomes. In this work, we introduce SingaKids, a dialogic tutor designed to facilitate language learning through picture description tasks. Our system integrates dense image captioning, multilingual dialogic interaction, speech understanding, and engaging speech generation to create an immersive learning environment in four languages: English, Mandarin, Malay, and Tamil. We further improve the system through multilingual pre-training, task-specific tuning, and scaffolding optimization. Empirical studies with elementary school students demonstrate that SingaKids provides effective dialogic teaching, benefiting learners at different performance levels.

[185] Gender Inequality in English Textbooks Around the World: an NLP Approach

Tairan Liu

Main category: cs.CL

TL;DR: 本研究通过自然语言处理方法量化了22个国家英语教科书中的性别不平等问题，发现男性角色在数量、提及顺序和命名实体上普遍被过度代表。

Details

Motivation: 尽管已有研究关注单个国家教科书中的性别不平等，但缺乏跨文化比较，本研究填补了这一空白。 Method: 采用自然语言处理技术，包括字符计数、首次提及顺序、TF-IDF词关联、命名实体分析、大语言模型测试和GloVe嵌入分析。 Result: 结果显示所有地区均存在性别不平等，男性角色在多个维度上被过度代表，拉丁文化圈的不平等程度最低。 Conclusion: 教科书中的性别不平等是一个普遍现象，跨文化研究有助于更全面地理解这一问题。 Abstract: Textbooks play a critical role in shaping children's understanding of the world. While previous studies have identified gender inequality in individual countries' textbooks, few have examined the issue cross-culturally. This study applies natural language processing methods to quantify gender inequality in English textbooks from 22 countries across 7 cultural spheres. Metrics include character count, firstness (which gender is mentioned first), and TF-IDF word associations by gender. The analysis also identifies gender patterns in proper names appearing in TF-IDF word lists, tests whether large language models can distinguish between gendered word lists, and uses GloVe embeddings to examine how closely keywords associate with each gender. Results show consistent overrepresentation of male characters in terms of count, firstness, and named entities. All regions exhibit gender inequality, with the Latin cultural sphere showing the least disparity.

[186] Comparative Analysis of AI Agent Architectures for Entity Relationship Classification

Maryam Berijanian,Kuldeep Singh,Amin Sehati

Main category: cs.CL

TL;DR: 比较了三种AI代理架构在关系分类任务中的表现，发现多智能体协调优于标准少样本提示，接近微调模型性能。

Details

Motivation: 解决信息提取中实体关系分类在有限标注数据和复杂关系结构下的挑战。 Method: 分析了三种架构：自反自评估、分层任务分解和动态示例生成，并比较了它们在多领域和模型后端上的表现。 Result: 多智能体协调表现最佳，接近微调模型性能。 Conclusion: 为基于LLM的结构化关系提取系统设计提供了实用指导。 Abstract: Entity relationship classification remains a challenging task in information extraction, especially in scenarios with limited labeled data and complex relational structures. In this study, we conduct a comparative analysis of three distinct AI agent architectures designed to perform relation classification using large language models (LLMs). The agentic architectures explored include (1) reflective self-evaluation, (2) hierarchical task decomposition, and (3) a novel multi-agent dynamic example generation mechanism, each leveraging different modes of reasoning and prompt adaptation. In particular, our dynamic example generation approach introduces real-time cooperative and adversarial prompting. We systematically compare their performance across multiple domains and model backends. Our experiments demonstrate that multi-agent coordination consistently outperforms standard few-shot prompting and approaches the performance of fine-tuned models. These findings offer practical guidance for the design of modular, generalizable LLM-based systems for structured relation extraction. The source codes and dataset are available at \href{https://github.com/maryambrj/ALIEN.git}{https://github.com/maryambrj/ALIEN.git}.

[187] From Anger to Joy: How Nationality Personas Shape Emotion Attribution in Large Language Models

Mahammed Kamruzzaman,Abdullah Al Monsur,Gene Louis Kim,Anshuman Chhabra

Main category: cs.CL

TL;DR: 研究发现，大型语言模型（LLMs）在扮演不同国籍角色时表现出情感刻板印象，与人类情感反应存在显著差异。

Details

Motivation: 探讨LLMs在模拟不同国籍角色时是否表现出情感刻板印象，并验证其情感分配是否符合文化规范。 Method: 通过分析预训练LLMs对不同国家角色的情感分配，比较其与人类情感反应的差异。 Result: LLMs在情感分配上存在显著的国籍差异，尤其是负面情感（如羞耻、恐惧）的分配与人类反应不一致。 Conclusion: LLMs输出中存在简化和潜在偏见的情感刻板印象，需进一步优化以减少文化偏见。 Abstract: Emotions are a fundamental facet of human experience, varying across individuals, cultural contexts, and nationalities. Given the recent success of Large Language Models (LLMs) as role-playing agents, we examine whether LLMs exhibit emotional stereotypes when assigned nationality-specific personas. Specifically, we investigate how different countries are represented in pre-trained LLMs through emotion attributions and whether these attributions align with cultural norms. Our analysis reveals significant nationality-based differences, with emotions such as shame, fear, and joy being disproportionately assigned across regions. Furthermore, we observe notable misalignment between LLM-generated and human emotional responses, particularly for negative emotions, highlighting the presence of reductive and potentially biased stereotypes in LLM outputs.

[188] Should LLM Safety Be More Than Refusing Harmful Instructions?

Utsav Maskey,Mark Dras,Usman Naseem

Main category: cs.CL

TL;DR: 本文系统评估了大型语言模型（LLM）在长尾分布（加密）文本上的行为及其安全影响，提出了一个二维安全评估框架，并通过实验揭示了模型在解密能力下可能存在的安全漏洞。

Details

Motivation: 研究LLM在长尾加密文本场景下的安全性，揭示其潜在的安全漏洞，并为开发更健壮的安全机制提供方向。 Method: 提出二维安全评估框架（指令拒绝和生成安全性），通过实验验证模型在解密能力下的安全表现。 Result: 实验表明，具备解密能力的模型可能受到不匹配泛化攻击，导致安全机制失效或过度拒绝。 Conclusion: 本文为理解LLM在长尾文本场景下的安全性提供了重要见解，并指出了未来安全机制的发展方向。 Abstract: This paper presents a systematic evaluation of Large Language Models' (LLMs) behavior on long-tail distributed (encrypted) texts and their safety implications. We introduce a two-dimensional framework for assessing LLM safety: (1) instruction refusal-the ability to reject harmful obfuscated instructions, and (2) generation safety-the suppression of generating harmful responses. Through comprehensive experiments, we demonstrate that models that possess capabilities to decrypt ciphers may be susceptible to mismatched-generalization attacks: their safety mechanisms fail on at least one safety dimension, leading to unsafe responses or over-refusal. Based on these findings, we evaluate a number of pre-LLM and post-LLM safeguards and discuss their strengths and limitations. This work contributes to understanding the safety of LLM in long-tail text scenarios and provides directions for developing robust safety mechanisms.

[189] IP-Dialog: Evaluating Implicit Personalization in Dialogue Systems with Synthetic Data

Bo Peng,Zhiheng Wang,Heyang Gong,Chaochao Lu

Main category: cs.CL

TL;DR: 提出了一种自动生成合成数据的新方法，并引入了IP-Dialog基准和训练数据集，解决了对话系统中用户背景推断的挑战。

Details

Motivation: 现代对话系统需要从对话中推断用户背景以实现个性化服务，但高质量数据稀缺且传统数据构建方法存在隐私和资源问题。 Method: 提出自动合成数据生成方法，构建IP-Dialog基准和训练数据集，涵盖10个任务和12种用户属性类型，并开发了系统评估框架。 Result: 实验证明了数据集的可靠性，并提供了模型在隐式个性化中推理路径的洞察。 Conclusion: 该方法为对话系统的个性化能力评估和改进提供了有效工具。 Abstract: In modern dialogue systems, the ability to implicitly infer user backgrounds from conversations and leverage this information for personalized assistance is crucial. However, the scarcity of high-quality data remains a fundamental challenge to evaluating and improving this capability. Traditional dataset construction methods are labor-intensive, resource-demanding, and raise privacy concerns. To address these issues, we propose a novel approach for automatic synthetic data generation and introduce the Implicit Personalized Dialogue (IP-Dialog) benchmark along with a training dataset, covering 10 tasks and 12 user attribute types. Additionally, we develop a systematic evaluation framework with four metrics to assess both attribute awareness and reasoning capabilities. We further propose five causal graphs to elucidate models' reasoning pathways during implicit personalization. Extensive experiments yield insightful observations and prove the reliability of our dataset.

[190] Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework

Zhaorui Yang,Bo Pan,Han Wang,Yiyao Wang,Xingyu Liu,Minfeng Zhu,Bo Zhang,Wei Chen

Main category: cs.CL

TL;DR: 论文提出了一种名为FDV的结构化文本表示方法，使大语言模型能够学习和生成多样化的高质量可视化内容，并在此基础上开发了多模态深度研究框架Multimodal DeepResearcher。

Details

Motivation: 现有深度研究框架主要生成纯文本内容，而文本与可视化结合的自动化生成尚未充分探索，这带来了设计和整合的挑战。 Method: 提出了FDV作为图表的结构化文本表示，并开发了Multimodal DeepResearcher框架，分为研究、文本化示例报告、规划和多模态报告生成四个阶段。 Result: 实验表明，Multimodal DeepResearcher在Claude 3.7 Sonnet模型上比基线方法高出82%的胜率。 Conclusion: FDV和Multimodal DeepResearcher有效解决了多模态报告生成的挑战，提升了可视化与文本的整合质量。 Abstract: Visualizations play a crucial part in effective communication of concepts and information. Recent advances in reasoning and retrieval augmented generation have enabled Large Language Models (LLMs) to perform deep research and generate comprehensive reports. Despite its progress, existing deep research frameworks primarily focus on generating text-only content, leaving the automated generation of interleaved texts and visualizations underexplored. This novel task poses key challenges in designing informative visualizations and effectively integrating them with text reports. To address these challenges, we propose Formal Description of Visualization (FDV), a structured textual representation of charts that enables LLMs to learn from and generate diverse, high-quality visualizations. Building on this representation, we introduce Multimodal DeepResearcher, an agentic framework that decomposes the task into four stages: (1) researching, (2) exemplar report textualization, (3) planning, and (4) multimodal report generation. For the evaluation of generated multimodal reports, we develop MultimodalReportBench, which contains 100 diverse topics served as inputs along with 5 dedicated metrics. Extensive experiments across models and evaluation methods demonstrate the effectiveness of Multimodal DeepResearcher. Notably, utilizing the same Claude 3.7 Sonnet model, Multimodal DeepResearcher achieves an 82\% overall win rate over the baseline method.

[191] MidPO: Dual Preference Optimization for Safety and Helpfulness in Large Language Models via a Mixture of Experts Framework

Yupeng Qi,Ziyu Lyu,Min Yang,Yanlin Wang,Lu Bai,Lixin Cui

Main category: cs.CL

TL;DR: 论文提出MidPO框架，通过混合专家（MoE）方法优化安全性与帮助性的平衡，显著优于现有方法。

Details

Motivation: 随着大语言模型（LLM）广泛应用，如何在保持帮助性的同时增强安全性成为关键挑战。现有方法在安全性与帮助性平衡上存在不足。 Method: MidPO首先将基础模型转化为独立的安全性和帮助性专家，再通过MoE框架和动态路由机制自适应平衡两者。 Result: 在三个流行数据集上的实验表明，MidPO在安全性和帮助性上均显著优于现有方法。 Conclusion: MidPO通过混合专家框架有效解决了安全性与帮助性的平衡问题，具有实际应用潜力。 Abstract: As large language models (LLMs) are increasingly applied across various domains, enhancing safety while maintaining the helpfulness of LLMs has become a critical challenge. Recent studies solve this problem through safety-constrained online preference optimization or safety-constrained offline preference optimization. However, the safety-constrained online methods often suffer from excessive safety, which might reduce helpfulness, while the safety-constrained offline methods perform poorly in adaptively balancing safety and helpfulness. To address these limitations, we propose MidPO, a \textbf{\underline{Mi}}xture of Experts (MoE) framework for safety-helpfulness \textbf{\underline{d}}ual \textbf{\underline{P}}reference \textbf{\underline{O}}ptimization. Firstly, MidPO devises single-preference enhanced direct preference optimization approach to transform the base model into two independent experts, termed safety and helpfulness experts, and fine-tunes the two independent experts for optimal safety or helpfulness performance. Secondly, to achieve an effective balance between safety and helpfulness, MidPO incorporates the two experts into the MoE framework and designs a dynamic routing mechanism to allocate contributions from each expert adaptively. We conduct quantitative and qualitative experiments on three popular datasets to demonstrate the proposed MidPO significantly outperforms state-of-the-art approaches in both safety and helpfulness. The code and models will be released.

[192] XToM: Exploring the Multilingual Theory of Mind for Large Language Models

Chunkit Chan,Yauwai Yim,Hongchuan Zeng,Zhiying Zou,Xinyuan Cheng,Zhifan Sun,Zheye Deng,Kawai Chung,Yuzhuo Ao,Yixiang Fan,Cheng Jiayang,Ercong Nie,Ginny Y. Wong,Helmut Schmid,Hinrich Schütze,Simon See,Yangqiu Song

Main category: cs.CL

TL;DR: XToM是一个多语言基准测试，用于评估LLM在多语言环境中的心理理论能力，发现LLM在语言理解上表现优异，但在心理理论能力上存在语言差异。

Details

Motivation: 现有对LLM心理理论的评估主要局限于英语，忽略了语言多样性对人类认知的影响，因此需要研究LLM是否具备跨语言的心理理论能力。 Method: 开发了XToM基准测试，涵盖五种语言和多样化的任务场景，并系统评估了LLM（如DeepSeek R1）的表现。 Result: LLM在多语言理解上表现优异，但在心理理论能力上存在语言间的显著差异。 Conclusion: LLM在跨语言环境中模拟人类心理理论的能力存在局限性。 Abstract: Theory of Mind (ToM), the ability to infer mental states in others, is pivotal for human social cognition. Existing evaluations of ToM in LLMs are largely limited to English, neglecting the linguistic diversity that shapes human cognition. This limitation raises a critical question: can LLMs exhibit Multilingual Theory of Mind, which is the capacity to reason about mental states across diverse linguistic contexts? To address this gap, we present XToM, a rigorously validated multilingual benchmark that evaluates ToM across five languages and incorporates diverse, contextually rich task scenarios. Using XToM, we systematically evaluate LLMs (e.g., DeepSeek R1), revealing a pronounced dissonance: while models excel in multilingual language understanding, their ToM performance varies across languages. Our findings expose limitations in LLMs' ability to replicate human-like mentalizing across linguistic contexts.

[193] FroM: Frobenius Norm-Based Data-Free Adaptive Model Merging

Zijian Li,Xiaocheng Feng,Huixin Liu,Yichong Huang,Ting Liu,Bing Qin

Main category: cs.CL

TL;DR: 本文提出了一种改进的模型合并方法FroM，通过Frobenius范数直接测量模型参数，无需训练数据，有效缓解任务干扰问题。

Details

Motivation: 传统模型合并方法在参数高效微调场景中易受任务干扰，需改进以提升性能。 Method: 提出FroM方法，利用Frobenius范数直接测量参数，引入超参数控制合并过程。 Result: FroM在多种微调场景中优于基线方法，显著减轻任务干扰。 Conclusion: FroM为模型合并提供了一种高效且无需训练数据的解决方案。 Abstract: With the development of large language models, fine-tuning has emerged as an effective method to enhance performance in specific scenarios by injecting domain-specific knowledge. In this context, model merging techniques provide a solution for fusing knowledge from multiple fine-tuning models by combining their parameters. However, traditional methods often encounter task interference when merging full fine-tuning models, and this problem becomes even more evident in parameter-efficient fine-tuning scenarios. In this paper, we introduce an improvement to the RegMean method, which indirectly leverages the training data to approximate the outputs of the linear layers before and after merging. We propose an adaptive merging method called FroM, which directly measures the model parameters using the Frobenius norm, without any training data. By introducing an additional hyperparameter for control, FroM outperforms baseline methods across various fine-tuning scenarios, alleviating the task interference problem.

[194] ORPP: Self-Optimizing Role-playing Prompts to Enhance Language Model Capabilities

Yifan Duan,Yihong Tang,Kehai Chen,Liqiang Nie,Min Zhang

Main category: cs.CL

TL;DR: ORPP（优化角色扮演提示）是一种通过优化和生成角色扮演提示来提升大语言模型性能的框架，其核心思想是将提示搜索空间限制在角色扮演场景中。

Details

Motivation: 现有提示优化方法存在计算开销高或依赖模型优化能力的局限性，限制了其广泛应用。 Method: ORPP通过在小样本上迭代优化生成高质量角色扮演提示，并利用模型的少样本学习能力将优化经验迁移到其他样本。 Result: 实验表明ORPP在性能上不仅匹配且多数情况下超越现有主流方法，并具备出色的即插即用能力。 Conclusion: ORPP通过角色扮演提示优化，显著提升模型性能，且具备广泛兼容性。 Abstract: High-quality prompts are crucial for eliciting outstanding performance from large language models (LLMs) on complex tasks. Existing research has explored model-driven strategies for prompt optimization. However, these methods often suffer from high computational overhead or require strong optimization capabilities from the model itself, which limits their broad applicability.To address these challenges, we propose ORPP (Optimized Role-Playing Prompt),a framework that enhances model performance by optimizing and generating role-playing prompts. The core idea of ORPP is to confine the prompt search space to role-playing scenarios, thereby fully activating the model's intrinsic capabilities through carefully crafted, high-quality role-playing prompts. Specifically, ORPP first performs iterative optimization on a small subset of training samples to generate high-quality role-playing prompts. Then, leveraging the model's few-shot learning capability, it transfers the optimization experience to efficiently generate suitable prompts for the remaining samples.Our experimental results show that ORPP not only matches but in most cases surpasses existing mainstream prompt optimization methods in terms of performance. Notably, ORPP demonstrates superior "plug-and-play" capability. In most cases, it can be integrated with various other prompt methods and further enhance their effectiveness.

[195] Do Language Models Think Consistently? A Study of Value Preferences Across Varying Response Lengths

Inderjeet Nair,Lu Wang

Main category: cs.CL

TL;DR: 研究发现，短形式和长形式测试中推断的LLM价值偏好相关性较弱，且长形式生成设置间的偏好一致性也较低。对齐仅略微提升价值表达的一致性。

Details

Motivation: 探索短形式测试与长形式实际应用中LLM价值偏好的一致性，填补现有研究的空白。 Method: 比较五种LLM（llama3-8b等）在短形式和长形式响应中的价值偏好，分析参数数量和生成属性对偏好的影响。 Result: 短形式与长形式偏好相关性弱；长形式生成设置间一致性低；对齐效果有限；论证特异性与偏好强度负相关，场景覆盖与偏好正相关。 Conclusion: 需开发更稳健的方法以确保LLM在不同应用中价值表达的一致性。 Abstract: Evaluations of LLMs' ethical risks and value inclinations often rely on short-form surveys and psychometric tests, yet real-world use involves long-form, open-ended responses -- leaving value-related risks and preferences in practical settings largely underexplored. In this work, we ask: Do value preferences inferred from short-form tests align with those expressed in long-form outputs? To address this question, we compare value preferences elicited from short-form reactions and long-form responses, varying the number of arguments in the latter to capture users' differing verbosity preferences. Analyzing five LLMs (llama3-8b, gemma2-9b, mistral-7b, qwen2-7b, and olmo-7b), we find (1) a weak correlation between value preferences inferred from short-form and long-form responses across varying argument counts, and (2) similarly weak correlation between preferences derived from any two distinct long-form generation settings. (3) Alignment yields only modest gains in the consistency of value expression. Further, we examine how long-form generation attributes relate to value preferences, finding that argument specificity negatively correlates with preference strength, while representation across scenarios shows a positive correlation. Our findings underscore the need for more robust methods to ensure consistent value expression across diverse applications.

[196] Enhancing Large Language Models with Neurosymbolic Reasoning for Multilingual Tasks

Sina Bagheri Nezhad,Ameeta Agrawal

Main category: cs.CL

TL;DR: NSAR结合神经与符号推理，显著提升多目标推理能力，优于现有方法。

Details

Motivation: 解决大语言模型在长文本中多目标推理的困难。 Method: 通过提取符号事实并生成Python代码，结合神经与符号推理。 Result: 在七种语言和不同上下文长度下，NSAR表现优于基线方法。 Conclusion: 神经与符号推理结合在多语言环境中具有高效性和可扩展性。 Abstract: Large language models (LLMs) often struggle to perform multi-target reasoning in long-context scenarios where relevant information is scattered across extensive documents. To address this challenge, we introduce NeuroSymbolic Augmented Reasoning (NSAR), which combines the benefits of neural and symbolic reasoning during inference. NSAR explicitly extracts symbolic facts from text and generates executable Python code to handle complex reasoning steps. Through extensive experiments across seven languages and diverse context lengths, we demonstrate that NSAR significantly outperforms both a vanilla RAG baseline and advanced prompting strategies in accurately identifying and synthesizing multiple pieces of information. Our results highlight the effectiveness of combining explicit symbolic operations with neural inference for robust, interpretable, and scalable reasoning in multilingual settings.

[197] Minos: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text

Junzhe Zhang,Huixuan Zhang,Xinyu Hu,Li Lin,Mingqi Gao,Shi Qiu,Xiaojun Wan

Main category: cs.CL

TL;DR: 论文提出了Minos-Corpus数据集和Minos模型，用于多模态生成任务的评估，结合人类和GPT数据，在T2I任务上表现优异。

Details

Motivation: 现有研究忽视了对T2I生成任务的评估能力和大规模人类评估数据的结合。 Method: 提出Minos-Corpus数据集，结合人类和GPT数据；采用Data Selection and Balance、Mix-SFT训练方法，并应用DPO开发Minos模型。 Result: Minos在T2I任务评估中优于所有开源和闭源模型，并在多任务评估中达到同类模型的最佳性能。 Conclusion: 高质量人类评估数据和联合训练I2T与T2I任务数据对提升评估性能至关重要。 Abstract: Evaluation is important for multimodal generation tasks. With the rapid progress of MLLMs, there is growing interest in applying MLLMs to build general evaluation systems. However, existing work overlooks two aspects: (1) the development of evaluation capabilities for text-to-image (T2I) generation task, and (2) the incorporation of large-scale human evaluation data. In this paper, we introduce Minos-Corpus, a large-scale multimodal evaluation dataset that combines evaluation data from both human and GPT. The corpus contains evaluation data across both image-to-text(I2T) and T2I generation tasks. Based on this corpus, we propose Data Selection and Balance, Mix-SFT training methods, and apply DPO to develop Minos, a multimodal evaluation model built upon a 7B backbone. Minos achieves state-of-the-art (SoTA) performance among all open-source evaluation models of similar scale on the average of evaluation performance on all tasks, and outperforms all open-source and closed-source models on evaluation of T2I generation task. Extensive experiments demonstrate the importance of leveraging high-quality human evaluation data and jointly training on evaluation data from both I2T and T2I generation tasks.

Yongjian Li,HaoCheng Chu,Yukun Yan,Zhenghao Liu,Shi Yu,Zheni Zeng,Ruobing Wang,Sen Song,Zhiyuan Liu,Maosong Sun

Main category: cs.CL

TL;DR: KARE-RAG通过结构化知识表示、优化训练目标和对比数据生成，显著提升了RAG管道的性能，尤其是在处理噪声内容和纠正关键错误方面。

Details

Motivation: 尽管RAG扩展了LLMs的知识来源，但检索文档中的噪声导致事实不一致，因此需要增强模型处理噪声内容的能力。 Method: KARE-RAG引入结构化知识表示、DDPO训练目标和对比数据生成，优化知识利用和错误纠正。 Result: 实验表明，KARE-RAG显著提升了RAG管道的性能，包括领域内和领域外任务，且数据效率高。 Conclusion: 通过改进模型处理检索内容的方式，KARE-RAG为RAG的优化提供了新方向，且代码和数据将公开。 Abstract: Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to access broader knowledge sources, yet factual inconsistencies persist due to noise in retrieved documents-even with advanced retrieval methods. We demonstrate that enhancing generative models' capacity to process noisy content is equally critical for robust performance. In this paper, we present KARE-RAG (Knowledge-Aware Refinement and Enhancement for RAG), which improves knowledge utilization through three key innovations: (1) structured knowledge representations that facilitate error detection during training, (2) Dense Direct Preference Optimization (DDPO)-a refined training objective that prioritizes correction of critical errors, and (3) a contrastive data generation pipeline that maintains semantic consistency while rectifying factual inaccuracies. Experiments show our method significantly enhances standard RAG pipelines across model scales, improving both in-domain and out-of-domain task performance without compromising general capabilities. Notably, these gains are achieved with modest training data, suggesting data-efficient optimization is possible through targeted learning strategies. Our findings establish a new direction for RAG improvement: by improving how models learn to process retrieved content, we can enhance performance across diverse inference paradigms. All data and code will be publicly available on Github.

[199] M$^3$FinMeeting: A Multilingual, Multi-Sector, and Multi-Task Financial Meeting Understanding Evaluation Dataset

Jie Zhu,Junhui Li,Yalong Wen,Xiandong Li,Lifan Guo,Feng Chen

Main category: cs.CL

TL;DR: 提出了一个多语言、多行业、多任务的金融会议理解基准M³FinMeeting，填补了现有金融评测的不足。

Details

Motivation: 现有金融评测主要依赖新闻或报告，难以捕捉真实金融会议动态，需更全面的评测工具。 Method: 构建支持英中日的多语言数据集，覆盖GICS行业分类，包含摘要、QA对提取和问答三项任务。 Result: 实验显示，即使先进的长上下文模型在金融会议理解上仍有显著改进空间。 Conclusion: M³FinMeeting能有效评估LLMs在金融会议理解上的能力，推动模型进步。 Abstract: Recent breakthroughs in large language models (LLMs) have led to the development of new benchmarks for evaluating their performance in the financial domain. However, current financial benchmarks often rely on news articles, earnings reports, or announcements, making it challenging to capture the real-world dynamics of financial meetings. To address this gap, we propose a novel benchmark called $\texttt{M$^3$FinMeeting}$, which is a multilingual, multi-sector, and multi-task dataset designed for financial meeting understanding. First, $\texttt{M$^3$FinMeeting}$ supports English, Chinese, and Japanese, enhancing comprehension of financial discussions in diverse linguistic contexts. Second, it encompasses various industry sectors defined by the Global Industry Classification Standard (GICS), ensuring that the benchmark spans a broad range of financial activities. Finally, $\texttt{M$^3$FinMeeting}$ includes three tasks: summarization, question-answer (QA) pair extraction, and question answering, facilitating a more realistic and comprehensive evaluation of understanding. Experimental results with seven popular LLMs reveal that even the most advanced long-context models have significant room for improvement, demonstrating the effectiveness of $\texttt{M$^3$FinMeeting}$ as a benchmark for assessing LLMs' financial meeting comprehension skills.

[200] FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

Zhuohan Xie,Dhruv Sahnan,Debopriyo Banerjee,Georgi Georgiev,Rushil Thareja,Hachem Madmoun,Jinyan Su,Aaryamonvikram Singh,Yuxia Wang,Rui Xing,Fajri Koto,Haonan Li,Ivan Koychev,Tanmoy Chakraborty,Salem Lahlou,Veselin Stoyanov,Preslav Nakov

Main category: cs.CL

TL;DR: FinChain是首个用于验证链式思维（CoT）金融推理的符号基准，填补了现有数据集在评估中间推理步骤上的空白。

Details

Motivation: 现有金融任务数据集（如FinQA和ConvFinQA）仅监督最终数值答案，缺乏对中间推理步骤的系统评估。 Method: FinChain涵盖12个金融领域的54个主题，每个主题提供5种参数化模板，包含可执行的Python跟踪，支持自动生成训练数据。同时提出ChainEval指标，自动评估最终答案和中间推理。 Result: 测试30个LLM后发现，即使最先进的模型在多步金融推理上仍有显著改进空间。 Conclusion: FinChain为金融推理提供了系统评估工具，并展示了当前模型的局限性。 Abstract: Multi-step symbolic reasoning is critical for advancing downstream performance on financial tasks. Yet, benchmarks for systematically evaluating this capability are lacking. Existing datasets like FinQA and ConvFinQA supervise only final numerical answers, without assessing intermediate reasoning steps. To address this, we introduce FinChain, the first symbolic benchmark designed for verifiable Chain-of- Thought (CoT) financial reasoning. Spanning 54 topics across 12 financial domains, Fin- Chain offers five parameterized templates per topic, each varying in reasoning complexity and domain expertise required. Each dataset instance includes an executable Python trace, enabling automatic generation of extensive training data and easy adaptation to other domains. We also introduce ChainEval, a new metric for automatic evaluation of both final answers and intermediate reasoning. Benchmarking 30 LLMs on our dataset, we find that even state-of-the-art models have considerable room for improvement in multi-step financial reasoning. All templates and evaluation metrics for FinChain are available at https: //github.com/mbzuai-nlp/finchain.

[201] Learning Together to Perform Better: Teaching Small-Scale LLMs to Collaborate via Preferential Rationale Tuning

Sohan Patnaik,Milan Aggarwal,Sumit Bhatia,Balaji Krishnamurthy

Main category: cs.CL

TL;DR: COLLATE框架通过多样化生成和偏好优化提升小型LLM的推理能力，无需依赖大型模型。

Details

Motivation: 解决因版权和法律问题无法使用大型LLM的问题，同时提升小型模型的独立推理能力。 Method: COLLATE通过多样化生成候选答案，并利用偏好优化选择最佳答案。 Result: 在多个领域的数据集上表现优于基线方法。 Conclusion: COLLATE为小型LLM提供了一种有效的独立推理能力提升方案。 Abstract: LLMssuch as GPT-4 have shown a remarkable ability to solve complex questions by generating step-by-step rationales. Prior works have utilized this capability to improve smaller and cheaper LMs (say, with 7B parameters). However, various practical constraints, such as copyright and legal issues, owing to lack of transparency in the pre-training data of large (often closed) models, prevent their use in commercial settings. Little focus has been given to improving the innate reasoning ability of smaller models without distilling information from larger LLMs. To address this, we propose COLLATE, a trainable framework that tunes a (small) LLM to generate those outputs from a pool of diverse rationales that selectively improves the downstream task. COLLATE enforces multiple instances of the same LLM to exhibit distinct behavior and employs them to generate rationales to obtain diverse outputs. The LLM is then tuned via preference optimization to choose the candidate rationale which maximizes the likelihood of ground-truth answer. COLLATE outperforms several trainable and prompting baselines on 5 datasets across 3 domains: maths problem solving, natural language inference, and commonsense reasoning. We show the eff icacy of COLLATE on LLMs from different model families across varying parameter scales (1B to 8B) and demonstrate the benefit of multiple rationale providers guided by the end task through ablations. Code is released here (https://github.com/Sohanpatnaik106/collate).

[202] Multilingual Information Retrieval with a Monolingual Knowledge Base

Yingying Zhuang,Aman Gupta,Anurag Beniwal

Main category: cs.CL

TL;DR: 论文提出了一种通过加权采样微调多语言嵌入模型的方法，用于跨语言信息检索，性能显著提升。

Details

Motivation: 高质量知识库资源稀缺且语言有限，需有效嵌入模型实现跨语言知识共享。 Method: 采用加权采样的对比学习策略微调多语言嵌入模型。 Result: 加权采样策略在MRR和Recall@3上分别提升31.03%和33.98%。 Conclusion: 该方法语言无关，适用于多语言和代码混合场景。 Abstract: Multilingual information retrieval has emerged as powerful tools for expanding knowledge sharing across languages. On the other hand, resources on high quality knowledge base are often scarce and in limited languages, therefore an effective embedding model to transform sentences from different languages into a feature vector space same as the knowledge base language becomes the key ingredient for cross language knowledge sharing, especially to transfer knowledge available in high-resource languages to low-resource ones. In this paper we propose a novel strategy to fine-tune multilingual embedding models with weighted sampling for contrastive learning, enabling multilingual information retrieval with a monolingual knowledge base. We demonstrate that the weighted sampling strategy produces performance gains compared to standard ones by up to 31.03\% in MRR and up to 33.98\% in Recall@3. Additionally, our proposed methodology is language agnostic and applicable for both multilingual and code switching use cases.

[203] ReasoningFlow: Semantic Structure of Complex Reasoning Traces

Jinu Lee,Sagnik Mukherjee,Dilek Hakkani-Tur,Julia Hockenmaier

Main category: cs.CL

TL;DR: ReasoningFlow是一种统一框架，用于分析大型推理模型生成的复杂推理轨迹，将其解析为有向无环图，从而识别不同的推理模式。

Details

Motivation: 大型推理模型生成的推理轨迹复杂且多样，需要一种统一的方法来分析和理解这些轨迹的语义结构。 Method: 提出ReasoningFlow框架，将推理轨迹解析为有向无环图，并通过子图结构表征不同的推理模式。 Result: ReasoningFlow提供了一种人类可理解的表示方法，有助于理解和评估大型推理模型的推理过程。 Conclusion: ReasoningFlow为分析和改进大型推理模型的推理过程提供了有前景的工具。 Abstract: Large reasoning models (LRMs) generate complex reasoning traces with planning, reflection, verification, and backtracking. In this work, we introduce ReasoningFlow, a unified schema for analyzing the semantic structures of these complex traces. ReasoningFlow parses traces into directed acyclic graphs, enabling the characterization of distinct reasoning patterns as subgraph structures. This human-interpretable representation offers promising applications in understanding, evaluating, and enhancing the reasoning processes of LRMs.

[204] Natural Language Processing to Enhance Deliberation in Political Online Discussions: A Survey

Maike Behrendt,Stefan Sylvius Wagner,Carina Weinmann,Marike Bormann,Mira Warne,Stefan Harmeling

Main category: cs.CL

TL;DR: 论文探讨了政治在线讨论中的问题及如何利用机器学习提升讨论质量。

Details

Motivation: 随着政治讨论逐渐数字化，高质量的在线讨论和参与过程对决策至关重要，但平台设计影响讨论质量。 Method: 利用机器学习方法分析并解决政治在线讨论中的问题。 Result: 展示了机器学习在提升讨论质量和促进深思熟虑交流中的潜力。 Conclusion: 机器学习可以有效改善政治在线讨论的设计和质量。 Abstract: Political online participation in the form of discussing political issues and exchanging opinions among citizens is gaining importance with more and more formats being held digitally. To come to a decision, a careful discussion and consideration of opinions and a civil exchange of arguments, which is defined as the act of deliberation, is desirable. The quality of discussions and participation processes in terms of their deliberativeness highly depends on the design of platforms and processes. To facilitate online communication for both participants and initiators, machine learning methods offer a lot of potential. In this work we want to showcase which issues occur in political online discussions and how machine learning can be used to counteract these issues and enhance deliberation.

[205] Answer Convergence as a Signal for Early Stopping in Reasoning

Xin Liu,Lu Wang

Main category: cs.CL

TL;DR: 研究发现，大型语言模型（LLMs）在推理任务中存在大量冗余步骤，60%的推理步骤后即可稳定输出答案。提出了三种高效推理策略，显著减少计算成本且几乎不影响准确性。

Details

Motivation: Chain-of-thought（CoT）提示虽增强推理能力，但导致冗长输出和计算成本增加，推测许多推理步骤可能不必要。 Method: 通过系统研究确定最小必要推理步骤，并提出三种策略：基于答案一致性的早期停止、增强结束信号生成概率，以及基于内部激活的监督学习方法。 Result: 在五个基准测试和五个开源LLMs上，方法显著减少令牌使用且准确性几乎无下降，其中Answer Consistency在NaturalQuestions上减少40%令牌并提升准确性。 Conclusion: 研究强调了高效推理方法的重要性，为实际应用提供了实用价值。 Abstract: Chain-of-thought (CoT) prompting enhances reasoning in large language models (LLMs) but often leads to verbose and redundant outputs, thus increasing inference cost. We hypothesize that many reasoning steps are unnecessary for producing correct answers. To investigate this, we start with a systematic study to examine what is the minimum reasoning required for a model to reach a stable decision. We find that on math reasoning tasks like math, models typically converge to their final answers after 60\% of the reasoning steps, suggesting substantial redundancy in the remaining content. Based on these insights, we propose three inference-time strategies to improve efficiency: (1) early stopping via answer consistency, (2) boosting the probability of generating end-of-reasoning signals, and (3) a supervised method that learns when to stop based on internal activations. Experiments across five benchmarks and five open-weights LLMs show that our methods significantly reduce token usage with little or no accuracy drop. In particular, on NaturalQuestions, Answer Consistency reduces tokens by over 40\% while further improving accuracy. Our work underscores the importance of cost-effective reasoning methods that operate at inference time, offering practical benefits for real-world applications.

[206] CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG

Yang Tian,Fan Liu,Jingyuan Zhang,Victoria W.,Yupeng Hu,Liqiang Nie

Main category: cs.CL

TL;DR: CoRe-MMRAG 是一个端到端框架，通过四阶段流程解决多模态检索增强生成中的知识不一致问题，显著提升了性能。

Details

Motivation: 解决多模态检索增强生成中的参数化与检索知识不一致（PRKI）及视觉与文本知识不一致（VTKI）问题。 Method: 提出四阶段流程：生成内部响应、选择多模态证据、生成外部响应、整合生成可靠答案，并采用专门训练范式。 Result: 在 KB-VQA 基准测试中，性能分别提升 5.6% 和 9.3%。 Conclusion: CoRe-MMRAG 有效解决了多模态知识不一致问题，显著提升了模型性能。 Abstract: Multimodal Retrieval-Augmented Generation (MMRAG) has been introduced to enhance Multimodal Large Language Models by incorporating externally retrieved multimodal knowledge, but it introduces two challenges: Parametric-Retrieved Knowledge Inconsistency (PRKI), where discrepancies between parametric and retrieved knowledge create uncertainty in determining reliability, and Visual-Textual Knowledge Inconsistency (VTKI), where misalignment between visual and textual sources disrupts entity representation. To address these challenges, we propose \textbf{C}r\textbf{o}ss-source knowledge \textbf{Re}conciliation for \textbf{M}ulti\textbf{M}odal \textbf{RAG} (CoRe-MMRAG), a novel end-to-end framework that effectively reconciles inconsistencies across knowledge sources. CoRe-MMRAG follows a four-stage pipeline: it first generates an internal response from parametric knowledge, then selects the most relevant multimodal evidence via joint similarity assessment, generates an external response, and finally integrates both to produce a reliable answer. Additionally, a specialized training paradigm enhances knowledge source discrimination, multimodal integration, and unified answer generation. Experiments on KB-VQA benchmarks show that CoRe-MMRAG achieves substantial improvements over baseline methods, achieving 5.6\% and 9.3\% performance gains on InfoSeek and Encyclopedic-VQA, respectively. We release code and data at \href{https://github.com/TyangJN/CoRe-MMRAG}{https://github.com/TyangJN/CoRe-MMRAG}.

[207] Pruning General Large Language Models into Customized Expert Models

Yirao Zhao,Guizhen Chen,Kenji Kawaguchi,Lidong Bing,Wenxuan Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为Cus-Prun的自定义剪枝方法，旨在将大型通用语言模型剪枝为小型轻量级专家模型，无需后训练，且在专家和通用能力上表现优异。

Details

Motivation: 大型语言模型（LLMs）需要大量计算资源，而现有剪枝方法通常保留模型的通用能力，导致性能下降或需要后训练。 Method: Cus-Prun方法通过识别并剪枝语言、领域和任务三个维度上的无关神经元，生成轻量级专家模型。 Result: 实验表明，Cus-Prun在不同模型家族和规模上均优于其他方法，专家和通用能力损失最小。 Conclusion: Cus-Prun是一种高效且无需后训练的剪枝方法，适用于生成特定下游场景的专家模型。 Abstract: Large language models (LLMs) have revolutionized natural language processing, yet their substantial model sizes often require substantial computational resources. To preserve computing resources and accelerate inference speed, it is crucial to prune redundant parameters, especially for experienced users who often need compact expert models tailored to specific downstream scenarios. However, most existing pruning methods focus on preserving the model's general capabilities, often requiring extensive post-training or suffering from degraded performance due to coarse-grained pruning. In this work, we design a $\underline{Cus}$tom $\underline{Prun}$ing method ($\texttt{Cus-Prun}$) to prune a large general model into a smaller lightweight expert model, which is positioned along the "language", "domain" and "task" dimensions. By identifying and pruning irrelevant neurons of each dimension, $\texttt{Cus-Prun}$ creates expert models without any post-training. Our experiments demonstrate that $\texttt{Cus-Prun}$ consistently outperforms other methods, achieving minimal loss in both expert and general capabilities across various models from different model families and sizes.

[208] IndoSafety: Culturally Grounded Safety for LLMs in Indonesian Languages

Muhammad Falensi Azmi,Muhammad Dehan Al Kautsar,Alfan Farizki Wicaksono,Fajri Koto

Main category: cs.CL

TL;DR: IndoSafety是首个针对印尼语境的高质量、人工验证的安全评估数据集，涵盖五种语言变体，发现现有印尼中心LLM常生成不安全输出，而基于IndoSafety的微调显著提升安全性。

Details

Motivation: 研究印尼语境下区域特定LLM的安全性问题，因文化多样性对本地规范的敏感性至关重要。 Method: 扩展现有安全框架，构建印尼社会文化背景的分类法，创建IndoSafety数据集，并对LLM进行微调评估。 Result: 现有印尼中心LLM在口语和本地语言环境中常生成不安全输出，微调后安全性显著提升且任务性能保持。 Conclusion: 强调文化基础安全评估的重要性，为多语言环境中负责任LLM部署提供实践步骤。 Abstract: Although region-specific large language models (LLMs) are increasingly developed, their safety remains underexplored, particularly in culturally diverse settings like Indonesia, where sensitivity to local norms is essential and highly valued by the community. In this work, we present IndoSafety, the first high-quality, human-verified safety evaluation dataset tailored for the Indonesian context, covering five language varieties: formal and colloquial Indonesian, along with three major local languages: Javanese, Sundanese, and Minangkabau. IndoSafety is constructed by extending prior safety frameworks to develop a taxonomy that captures Indonesia's sociocultural context. We find that existing Indonesian-centric LLMs often generate unsafe outputs, particularly in colloquial and local language settings, while fine-tuning on IndoSafety significantly improves safety while preserving task performance. Our work highlights the critical need for culturally grounded safety evaluation and provides a concrete step toward responsible LLM deployment in multilingual settings. Warning: This paper contains example data that may be offensive, harmful, or biased.

[209] Prosodic Structure Beyond Lexical Content: A Study of Self-Supervised Learning

Sarenne Wallbridge,Christoph Minixhofer,Catherine Lai,Peter Bell

Main category: cs.CL

TL;DR: 研究利用自监督学习（SSL）分析韵律的声学相关性，发现其能有效预测局部和长期结构，如情感识别，优于传统特征。

Details

Motivation: 探讨韵律（如语调、节奏和响度）在独立于词汇内容时对语言结构的贡献程度。 Method: 提出Masked Prosody Model，通过自监督学习分析韵律的声学相关性，并进行感知标签预测实验。 Result: 模型在预测局部和长期结构（如情感识别）上表现优异，显著优于传统特征。 Conclusion: 自监督学习的训练目标时间尺度对捕捉复杂结构至关重要，其编码的结构优于传统方法。 Abstract: People exploit the predictability of lexical structures during text comprehension. Though predictable structure is also present in speech, the degree to which prosody, e.g. intonation, tempo, and loudness, contributes to such structure independently of the lexical content is unclear. This study leverages self-supervised learning (SSL) to examine the temporal granularity of structures in the acoustic correlates of prosody. Representations from our proposed Masked Prosody Model can predict perceptual labels dependent on local information, such as word boundaries, but provide the most value for labels involving longer-term structures, like emotion recognition. Probing experiments across various perceptual labels show strong relative gains over untransformed pitch, energy, and voice activity features. Our results reveal the importance of SSL training objective timescale and highlight the value of complex SSL-encoded structures compared to more constrained classical structures.

[210] Evaluating Named Entity Recognition Models for Russian Cultural News Texts: From BERT to LLM

Maria Levchenko

Main category: cs.CL

TL;DR: 论文研究了俄语文化新闻中人名命名实体识别（NER）的挑战，比较了多种模型，发现GPT-4o在特定提示下表现最佳（F1=0.93），GPT-4精度最高（0.99），并展示了模型在俄语等形态丰富语言中的能力与局限。

Details

Motivation: 解决俄语文化新闻中人名NER的挑战，评估不同模型的性能，为研究者和从业者提供参考。 Method: 使用SPbLitGuide数据集（1999-2019年圣彼得堡文化活动公告），比较了DeepPavlov、RoBERTa、SpaCy等传统模型与GPT-3.5、GPT-4、GPT-4o等LLM的表现。 Result: GPT-4o在JSON提示下F1=0.93，GPT-4精度0.99；后续GPT-4.1（2025年）F1=0.94。 Conclusion: 研究展示了LLM在俄语NER任务中的潜力，尤其是GPT系列模型的快速进步和简化部署需求。 Abstract: This paper addresses the challenge of Named Entity Recognition (NER) for person names within the specialized domain of Russian news texts concerning cultural events. The study utilizes the unique SPbLitGuide dataset, a collection of event announcements from Saint Petersburg spanning 1999 to 2019. A comparative evaluation of diverse NER models is presented, encompassing established transformer-based architectures such as DeepPavlov, RoBERTa, and SpaCy, alongside recent Large Language Models (LLMs) including GPT-3.5, GPT-4, and GPT-4o. Key findings highlight the superior performance of GPT-4o when provided with specific prompting for JSON output, achieving an F1 score of 0.93. Furthermore, GPT-4 demonstrated the highest precision at 0.99. The research contributes to a deeper understanding of current NER model capabilities and limitations when applied to morphologically rich languages like Russian within the cultural heritage domain, offering insights for researchers and practitioners. Follow-up evaluation with GPT-4.1 (April 2025) achieves F1=0.94 for both simple and structured prompts, demonstrating rapid progress across model families and simplified deployment requirements.

[211] On Generalization across Measurement Systems: LLMs Entail More Test-Time Compute for Underrepresented Cultures

Minh Duc Bui,Kyung Eun Park,Goran Glavaš,Fabian David Schmidt,Katharina von der Wense

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）在不同测量系统（如货币）中的表现，发现其默认使用数据中占主导的测量系统，且性能不稳定，推理方法虽可缓解问题但增加成本。

Details

Motivation: 探讨LLMs是否能跨文化背景提供准确的测量系统信息，以支持多样化用户需求。 Method: 通过新编译的数据集测试七种开源LLMs，研究其默认测量系统、性能差异及推理方法的效果。 Result: LLMs默认使用数据主导的测量系统，性能不稳定，推理方法可部分解决问题但增加成本。 Conclusion: LLMs在跨测量系统应用中存在性能差异，需优化以支持多样化用户。 Abstract: Measurement systems (e.g., currencies) differ across cultures, but the conversions between them are well defined so that humans can state facts using any measurement system of their choice. Being available to users from diverse cultural backgrounds, large language models (LLMs) should also be able to provide accurate information irrespective of the measurement system at hand. Using newly compiled datasets we test if this is the case for seven open-source LLMs, addressing three key research questions: (RQ1) What is the default system used by LLMs for each type of measurement? (RQ2) Do LLMs' answers and their accuracy vary across different measurement systems? (RQ3) Can LLMs mitigate potential challenges w.r.t. underrepresented systems via reasoning? Our findings show that LLMs default to the measurement system predominantly used in the data. Additionally, we observe considerable instability and variance in performance across different measurement systems. While this instability can in part be mitigated by employing reasoning methods such as chain-of-thought (CoT), this implies longer responses and thereby significantly increases test-time compute (and inference costs), marginalizing users from cultural backgrounds that use underrepresented measurement systems.

[212] Beyond the Surface: Measuring Self-Preference in LLM Judgments

Zhi-Yuan Chen,Hao Wang,Xinyu Zhang,Enrui Hu,Yankai Lin

Main category: cs.CL

TL;DR: 论文提出了一种新的评分方法DBG，用于更准确地衡量大型语言模型（LLMs）作为评判者时的自我偏好偏差，避免了传统方法中因响应质量混淆的问题。

Details

Motivation: 现有方法在测量LLMs的自我偏好偏差时，无法区分偏差与响应质量的影响，导致结果不准确。 Method: 引入黄金评判作为响应质量的真实标准，提出DBG评分，通过比较模型自评分与黄金评判的差异来衡量偏差。 Result: 实验验证了DBG评分的有效性，并分析了模型版本、大小、推理能力等因素对自我偏好偏差的影响。 Conclusion: DBG评分能更准确地测量自我偏好偏差，同时揭示了影响偏差的因素及其潜在机制。 Abstract: Recent studies show that large language models (LLMs) exhibit self-preference bias when serving as judges, meaning they tend to favor their own responses over those generated by other models. Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models. However, this approach conflates self-preference bias with response quality, as higher-quality responses from the judge model may also lead to positive score differences, even in the absence of bias. To address this issue, we introduce gold judgments as proxies for the actual quality of responses and propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments. Since gold judgments reflect true response quality, the DBG score mitigates the confounding effect of response quality on bias measurement. Using the DBG score, we conduct comprehensive experiments to assess self-preference bias across LLMs of varying versions, sizes, and reasoning abilities. Additionally, we investigate two factors that influence and help alleviate self-preference bias: response text style and the post-training data of judge models. Finally, we explore potential underlying mechanisms of self-preference bias from an attention-based perspective. Our code and data are available at https://github.com/zhiyuanc2001/self-preference.

[213] EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing

Fan Gao,Dongyuan Li,Ding Xia,Fei Mi,Yasheng Wang,Lifeng Shang,Baojun Wang

Main category: cs.CL

TL;DR: 论文提出了一个多体裁的中文写作基准测试（\benchName），用于评估大型语言模型在四种主要文体中的表现，并开发了细粒度的评分框架。

Details

Motivation: 现有基准测试在中文写作评估中忽视了结构和修辞复杂性，尤其是在不同文体间的差异。 Method: 构建了包含728个真实世界提示的多体裁数据集，并设计了细粒度、分层次的评分框架。 Result: 对15个大型语言模型进行了基准测试，分析了它们在不同文体和指令类型中的表现。 Conclusion: \benchName 旨在推动基于LLM的中文写作评估，并启发未来在教育场景中改进文章生成的研究。 Abstract: Chinese essay writing and its evaluation are critical in educational contexts, yet the capabilities of Large Language Models (LLMs) in this domain remain largely underexplored. Existing benchmarks often rely on coarse-grained text quality metrics, largely overlooking the structural and rhetorical complexities of Chinese essays, particularly across diverse genres. To address this gap, we propose \benchName, a multi-genre benchmark specifically designed for Chinese essay writing across four major genres: Argumentative, Narrative, Descriptive, and Expository. We curate and refine a total of 728 real-world prompts to ensure authenticity and meticulously categorize them into the \textit{Open-Ended} and \textit{Constrained} sets to capture diverse writing scenarios. To reliably evaluate generated essays, we develop a fine-grained, genre-specific scoring framework that hierarchically aggregates scores. We further validate our evaluation protocol through a comprehensive human agreement study. Finally, we benchmark 15 large-sized LLMs, analyzing their strengths and limitations across genres and instruction types. With \benchName, we aim to advance LLM-based Chinese essay evaluation and inspire future research on improving essay generation in educational settings.

[214] Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-Tuning

Ömer Tarik Özyilmaz,Matt Coler,Matias Valdenegro-Toro

Main category: cs.CL

TL;DR: 研究探讨了微调OpenAI的Whisper模型对五种主要阿拉伯语方言的自动语音识别效果，发现少量现代标准阿拉伯语（MSA）微调数据能显著提升小模型性能，而方言数据池化模型表现与方言专用模型相当。

Details

Motivation: 商业阿拉伯语ASR系统在处理方言语音时表现不佳，研究旨在探索如何通过微调和数据池化提升方言识别效果。 Method: 使用Mozilla Common Voice的MSA数据和MASC数据集的方言数据，评估MSA微调数据量、MSA预训练效果以及方言专用与池化模型的性能。 Result: 少量MSA微调数据显著提升小模型性能；MSA预训练效果有限；方言池化模型与专用模型表现相当。 Conclusion: 方言数据池化在平衡条件下可缓解低资源ASR数据稀缺问题，且性能损失较小。 Abstract: Although commercial Arabic automatic speech recognition (ASR) systems support Modern Standard Arabic (MSA), they struggle with dialectal speech. We investigate the effect of fine-tuning OpenAI's Whisper on five major Arabic dialects (Gulf, Levantine, Iraqi, Egyptian, Maghrebi) using Mozilla Common Voice for MSA and the MASC dataset for dialectal speech. We evaluate MSA training size effects, benefits of pre-training on MSA data, and dialect-specific versus dialect-pooled models. We find that small amounts of MSA fine-tuning data yield substantial improvements for smaller models, matching larger non-fine-tuned models. While MSA pre-training shows minimal benefit, suggesting limited shared features between MSA and dialects, our dialect-pooled models perform comparably to dialect-specific ones. This indicates that pooling dialectal data, when properly balanced, can help address data scarcity in low-resource ASR without significant performance loss.

[215] Are Economists Always More Introverted? Analyzing Consistency in Persona-Assigned LLMs

Manon Reusens,Bart Baesens,David Jurgens

Main category: cs.CL

TL;DR: 本文提出了一种标准化框架，用于分析个性化大语言模型（LLM）在不同任务和运行中保持角色一致性的能力。研究发现，一致性受角色、刻板印象和模型设计等因素影响，且在不同任务中表现不一。

Details

Motivation: 现有研究缺乏对LLM在不同角色和任务类型中一致性的全面分析，因此需要一种标准化框架来评估其表现。 Method: 引入了一个新框架，通过四个角色类别（幸福感、职业、个性和政治立场）和多种任务维度（调查写作、文章生成、社交媒体帖子生成、单轮和多轮对话）评估一致性。 Result: 研究发现，一致性受角色、刻板印象和模型设计影响，且在结构化任务和额外上下文中表现更好。 Conclusion: 该框架为评估LLM角色一致性提供了标准化方法，揭示了影响一致性的关键因素，并为未来研究提供了方向。 Abstract: Personalized Large Language Models (LLMs) are increasingly used in diverse applications, where they are assigned a specific persona - such as a happy high school teacher - to guide their responses. While prior research has examined how well LLMs adhere to predefined personas in writing style, a comprehensive analysis of consistency across different personas and task types is lacking. In this paper, we introduce a new standardized framework to analyze consistency in persona-assigned LLMs. We define consistency as the extent to which a model maintains coherent responses when assigned the same persona across different tasks and runs. Our framework evaluates personas across four different categories (happiness, occupation, personality, and political stance) spanning multiple task dimensions (survey writing, essay generation, social media post generation, single turn, and multi-turn conversations). Our findings reveal that consistency is influenced by multiple factors, including the assigned persona, stereotypes, and model design choices. Consistency also varies across tasks, increasing with more structured tasks and additional context. All code is available on GitHub.

[216] EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving

Shihan Dou,Ming Zhang,Chenhao Huang,Jiayi Chen,Feng Chen,Shichun Liu,Yan Liu,Chenxiao Liu,Cheng Zhong,Zongzhang Zhang,Tao Gui,Chao Xin,Wei Chengzhi,Lin Yan,Qi Zhang,Xuanjing Huang

Main category: cs.CL

TL;DR: EvaLearn是一个评估大语言模型学习能力和效率的新基准，包含648个挑战性问题，要求模型按顺序解决问题以利用经验。

Details

Motivation: 评估大语言模型在挑战性任务中的学习能力和效率，填补现有基准的不足。 Method: 设计EvaLearn基准，包含182个任务序列，使用五种自动化指标评估模型表现。 Result: 不同模型表现各异，部分模型学习能力强，部分则表现不佳；静态能力强的模型在学习能力上未必占优。 Conclusion: EvaLearn为评估大语言模型潜力提供了新视角，揭示了模型与人类能力的差距，推动更深入的评估方法发展。 Abstract: We introduce EvaLearn, a pioneering benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks, a critical, yet underexplored aspect of model potential. EvaLearn contains 648 challenging problems across six task types, grouped into 182 sequences, each sequence dedicated to one task type. Diverging from most existing benchmarks that evaluate models in parallel, EvaLearn requires models to solve problems sequentially, allowing them to leverage the experience gained from previous solutions. EvaLearn provides five comprehensive automated metrics to evaluate models and quantify their learning capability and efficiency. We extensively benchmark nine frontier models and observe varied performance profiles: some models, such as Claude-3.7-sonnet, start with moderate initial performance but exhibit strong learning ability, while some models struggle to benefit from experience and may even show negative transfer. Moreover, we investigate model performance under two learning settings and find that instance-level rubrics and teacher-model feedback further facilitate model learning. Importantly, we observe that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks, highlighting that EvaLearn evaluates a new dimension of model performance. We hope EvaLearn provides a novel evaluation perspective for assessing LLM potential and understanding the gap between models and human capabilities, promoting the development of deeper and more dynamic evaluation approaches. All datasets, the automatic evaluation framework, and the results studied in this paper are available at the GitHub repository.

[217] TL;DR: Too Long, Do Re-weighting for Effcient LLM Reasoning Compression

Zhong-Zhi Li,Xiao Liang,Zihao Tang,Lei Ji,Peijie Wang,Haotian Xu,Xing W,Haizhen Huang,Weiwei Deng,Ying Nian Wu,Yeyun Gong,Zhijiang Guo,Xiao Liu,Fei Yin,Cheng-Lin Liu

Main category: cs.CL

TL;DR: 提出了一种动态比例训练方法，显著减少输出标记数量40%，同时保持推理准确性。

Details

Motivation: 解决大语言模型在长输出推理中的效率问题，避免依赖复杂数据标注或多模型插值。 Method: 动态平衡System-1和System-2数据的权重，消除冗余推理过程。 Result: 在DeepSeek-R1-Distill-7B和14B上验证，输出标记减少40%，推理准确性不变。 Conclusion: 该方法高效且实用，代码和数据将公开。 Abstract: Large Language Models (LLMs) have recently achieved remarkable progress by leveraging Reinforcement Learning and extended Chain-of-Thought (CoT) techniques. However, the challenge of performing efficient language reasoning--especially during inference with extremely long outputs--has drawn increasing attention from the research community. In this work, we propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations or interpolation between multiple models. We continuously balance the weights between the model's System-1 and System-2 data to eliminate redundant reasoning processes while preserving the model's reasoning capability. We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels. Our method significantly reduces the number of output tokens by nearly 40% while maintaining the accuracy of the reasoning. Our code and data will be available soon.

[218] Decompose, Plan in Parallel, and Merge: A Novel Paradigm for Large Language Models based Planning with Multiple Constraints

Zhengdong Lu,Weikai Lu,Yiling Tao,Yun Dai,ZiXuan Chen,Huiping Zhuang,Cen Chen,Hao Peng,Ziqian Zeng

Main category: cs.CL

TL;DR: 提出了一种并行规划范式DPPM，通过分解、并行规划和合并子计划来解决LLM在规划任务中的局限性，显著优于现有方法。

Details

Motivation: 现有规划方法存在严格约束和级联错误的问题，限制了LLM在规划任务中的表现。 Method: DPPM范式：分解任务为子任务，并行生成子计划，合并为全局计划，并加入验证和优化模块。 Result: 在旅行规划任务中，DPPM显著优于现有方法。 Conclusion: DPPM有效解决了LLM在规划任务中的局限性，提升了性能。 Abstract: Despite significant advances in Large Language Models (LLMs), planning tasks still present challenges for LLM-based agents. Existing planning methods face two key limitations: heavy constraints and cascading errors. To address these limitations, we propose a novel parallel planning paradigm, which Decomposes, Plans for subtasks in Parallel, and Merges subplans into a final plan (DPPM). Specifically, DPPM decomposes the complex task based on constraints into subtasks, generates the subplan for each subtask in parallel, and merges them into a global plan. In addition, our approach incorporates a verification and refinement module, enabling error correction and conflict resolution. Experimental results demonstrate that DPPM significantly outperforms existing methods in travel planning tasks.

[219] MASTER: Enhancing Large Language Model via Multi-Agent Simulated Teaching

Liang Yue,Yihong Tang,Kehai Chen,Jie Liu,Min Zhang

Main category: cs.CL

TL;DR: 论文提出MASTER方法，通过多智能体交互增强指令微调数据，显著提升模型性能。

Details

Motivation: 高质量微调数据获取困难且成本高，需解决这一问题以提升模型性能。 Method: 提出MASTER方法，模拟教学场景，通过多智能体对话生成高质量数据。 Result: 构建BOOST-QA数据集，实验显示模型在多任务基准测试中表现优异。 Conclusion: MASTER显著提升模型推理能力，为未来研究提供重要启示。 Abstract: Instruction fine-tuning is crucial in NLP tasks, enhancing pretrained models' instruction-following capabilities and task-specific performance. However, obtaining high-quality fine-tuning data for large models is challenging due to data collection difficulties and high production costs. To address this, we propose MASTER, a novel data augmentation method that enriches original data through interactions among multiple agents with varying cognitive levels. We simulate three pedagogically grounded teaching scenarios, leveraging multi-agent conversations to generate high-quality teacher-student interaction data. Utilizing MASTER, we construct BOOST-QA, a fine-tuning dataset augmented from existing datasets like Orca-Math-200k, ProcQA, and OpenHermes2.5. Experiments show that models fine-tuned with BOOST-QA perform excellently across multiple benchmarks, demonstrating strong multitask generalization. Notably, MASTER significantly improves models' reasoning abilities in complex tasks, providing valuable insights for future research.

[220] On Entity Identification in Language Models

Masaki Sakata,Sho Yokoi,Benjamin Heinzerling,Takumi Ito,Kentaro Inui

Main category: cs.CL

TL;DR: 该论文研究了语言模型（LMs）内部表示如何识别和区分命名实体的提及，提出了一种基于聚类质量度量的框架，并分析了实体表示对模型性能的影响。

Details

Motivation: 探讨语言模型内部表示如何捕捉命名实体的提及，以及这些表示如何区分不同实体，以理解模型内部对实体信息的组织方式。 Method: 通过聚类分析量化模型内部表示中实体提及的聚集和分离程度，并分析实体信息的低维线性子空间表示。 Result: 实验表明，模型能有效识别和区分实体（精度和召回率在0.66到0.9之间），且实体信息在早期层中以低维线性子空间紧凑表示。 Conclusion: 语言模型的内部表示与实体中心知识结构存在同构关系，揭示了模型如何组织和利用实体信息。 Abstract: We analyze the extent to which internal representations of language models (LMs) identify and distinguish mentions of named entities, focusing on the many-to-many correspondence between entities and their mentions. We first formulate two problems of entity mentions -- ambiguity and variability -- and propose a framework analogous to clustering quality metrics. Specifically, we quantify through cluster analysis of LM internal representations the extent to which mentions of the same entity cluster together and mentions of different entities remain separated. Our experiments examine five Transformer-based autoregressive models, showing that they effectively identify and distinguish entities with metrics analogous to precision and recall ranging from 0.66 to 0.9. Further analysis reveals that entity-related information is compactly represented in a low-dimensional linear subspace at early LM layers. Additionally, we clarify how the characteristics of entity representations influence word prediction performance. These findings are interpreted through the lens of isomorphism between LM representations and entity-centric knowledge structures in the real world, providing insights into how LMs internally organize and use entity information.

[221] RACE-Align: Retrieval-Augmented and Chain-of-Thought Enhanced Preference Alignment for Large Language Models

Qihang Yan,Xinyu Zhang,Luming Guo,Qi Zhang,Feifan Liu

Main category: cs.CL

TL;DR: RACE-Align框架通过结合外部知识检索和链式思维推理，显著提升大语言模型在垂直领域的准确性、知识性和可解释性。

Details

Motivation: 解决大语言模型在垂直领域中准确性、领域推理和可解释性不足的问题，传统对齐方法忽略知识来源和推理逻辑。 Method: 提出RACE-Align框架，构建包含外部知识和链式思维推理的偏好数据集，使用DPO算法对齐模型。 Result: 在传统中医领域实验中，RACE-Align显著优于基础模型和仅使用SFT的模型，提升准确性、推理深度和可解释性。 Conclusion: RACE-Align为增强大语言模型在复杂垂直领域的知识应用、推理可靠性和过程透明性提供了有效途径。 Abstract: Large Language Models (LLMs) struggle with accuracy, domain-specific reasoning, and interpretability in vertical domains. Traditional preference alignment methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) often overlook the underlying knowledge sources and reasoning logic. This paper introduces RACE-Align (Retrieval-Augmented and Chain-of-Thought Enhanced Alignment), a novel framework designed to address these limitations. RACE-Align systematically constructs a binary preference dataset incorporating external knowledge support and explicit Chain-of-Thought (CoT) reasoning, then aligns LLMs using the DPO algorithm. The core innovation lies in its preference data construction strategy: it integrates AI-driven retrieval for factual grounding, enhancing knowledgeability and accuracy, and emphasizes the optimization of domain-specific CoT, treating the reasoning process itself as a key preference dimension. A multi-stage, AI-driven refinement pipeline cost-effectively generates these preference pairs. Experimental validation in Traditional Chinese Medicine (TCM) using Qwen3-1.7B as the base model demonstrates that RACE-Align significantly outperforms the original base model and a model fine-tuned only with Supervised Fine-Tuning (SFT). Improvements were observed across multiple dimensions, including answer accuracy, information richness, application of TCM thinking patterns, logicality and depth of reasoning, and interpretability. These findings suggest RACE-Align offers an effective pathway to enhance LLMs' knowledge application, reasoning reliability, and process transparency in complex vertical domains.

[222] Stereotypical gender actions can be extracted from Web text

Amaç Herdağdelen,Marco Baroni

Main category: cs.CL

TL;DR: 研究从文本语料库和Twitter中提取性别特定行为，并与人类刻板印象对比，验证了利用自然文本增强常识知识库的可行性。

Details

Motivation: 探索如何利用自然文本数据（尤其是Twitter）来补充常识知识库中关于性别刻板印象的信息。 Method: 使用Open Mind Common Sense（OMCS）和Twitter用户性别信息，结合语料库中的代词/姓名性别启发式方法，计算行为的性别偏见。 Result: Spearman相关系数为0.47，ROC曲线下面积为0.76，验证了方法的有效性。 Conclusion: 研究表明，利用自然文本（尤其是Twitter语料库）补充常识知识库中的性别刻板印象是可行的，并提供了两个数据集。 Abstract: We extracted gender-specific actions from text corpora and Twitter, and compared them to stereotypical expectations of people. We used Open Mind Common Sense (OMCS), a commonsense knowledge repository, to focus on actions that are pertinent to common sense and daily life of humans. We use the gender information of Twitter users and Web-corpus-based pronoun/name gender heuristics to compute the gender bias of the actions. With high recall, we obtained a Spearman correlation of 0.47 between corpus-based predictions and a human gold standard, and an area under the ROC curve of 0.76 when predicting the polarity of the gold standard. We conclude that it is feasible to use natural text (and a Twitter-derived corpus in particular) in order to augment commonsense repositories with the stereotypical gender expectations of actions. We also present a dataset of 441 commonsense actions with human judges' ratings on whether the action is typically/slightly masculine/feminine (or neutral), and another larger dataset of 21,442 actions automatically rated by the methods we investigate in this study.

[223] Multi-task Learning with Active Learning for Arabic Offensive Speech Detection

Aisha Alansari,Hamzah Luqman

Main category: cs.CL

TL;DR: 本文提出了一种结合多任务学习和主动学习的新框架，用于提升阿拉伯社交媒体文本中攻击性语言的检测效果，通过动态调整任务权重和主动选择样本，实现了85.42%的F1分数。

Details

Motivation: 社交媒体中攻击性、暴力和粗俗语言的传播加剧，而阿拉伯语检测面临数据稀缺和语言复杂性等挑战。 Method: 结合多任务学习（MTL）和主动学习，动态调整任务权重，并使用不确定性采样选择样本，同时引入加权表情处理。 Result: 在OSACT2022数据集上达到85.42%的宏F1分数，优于现有方法且样本需求更少。 Conclusion: 多任务学习与主动学习的结合在资源受限环境下高效且准确地检测攻击性语言具有潜力。 Abstract: The rapid growth of social media has amplified the spread of offensive, violent, and vulgar speech, which poses serious societal and cybersecurity concerns. Detecting such content in Arabic text is particularly complex due to limited labeled data, dialectal variations, and the language's inherent complexity. This paper proposes a novel framework that integrates multi-task learning (MTL) with active learning to enhance offensive speech detection in Arabic social media text. By jointly training on two auxiliary tasks, violent and vulgar speech, the model leverages shared representations to improve the detection accuracy of the offensive speech. Our approach dynamically adjusts task weights during training to balance the contribution of each task and optimize performance. To address the scarcity of labeled data, we employ an active learning strategy through several uncertainty sampling techniques to iteratively select the most informative samples for model training. We also introduce weighted emoji handling to better capture semantic cues. Experimental results on the OSACT2022 dataset show that the proposed framework achieves a state-of-the-art macro F1-score of 85.42%, outperforming existing methods while using significantly fewer fine-tuning samples. The findings of this study highlight the potential of integrating MTL with active learning for efficient and accurate offensive language detection in resource-constrained settings.

[224] Exploiting the English Vocabulary Profile for L2 word-level vocabulary assessment with LLMs

Stefano Bannò,Kate Knill,Mark Gales

Main category: cs.CL

TL;DR: 论文提出了一种利用大型语言模型（LLMs）和英语词汇轮廓（EVP）进行细粒度词汇评估的新方法，优于传统的词性标注方法。

Details

Motivation: 词汇使用是第二语言（L2）熟练度的核心，但现有自动化评估多依赖上下文无关的词性标注，无法精确评估词汇在句子中的实际使用。 Method: 结合LLMs和EVP，通过上下文分析词汇使用，解决多义性、上下文变化和多词表达等挑战。 Result: LLMs在词汇熟练度评估中表现优于传统方法，并能探索词汇与文章整体熟练度的相关性。 Conclusion: LLMs适合词汇评估任务，为L2学习提供了更精确的工具。 Abstract: Vocabulary use is a fundamental aspect of second language (L2) proficiency. To date, its assessment by automated systems has typically examined the context-independent, or part-of-speech (PoS) related use of words. This paper introduces a novel approach to enable fine-grained vocabulary evaluation exploiting the precise use of words within a sentence. The scheme combines large language models (LLMs) with the English Vocabulary Profile (EVP). The EVP is a standard lexical resource that enables in-context vocabulary use to be linked with proficiency level. We evaluate the ability of LLMs to assign proficiency levels to individual words as they appear in L2 learner writing, addressing key challenges such as polysemy, contextual variation, and multi-word expressions. We compare LLMs to a PoS-based baseline. LLMs appear to exploit additional semantic information that yields improved performance. We also explore correlations between word-level proficiency and essay-level proficiency. Finally, the approach is applied to examine the consistency of the EVP proficiency levels. Results show that LLMs are well-suited for the task of vocabulary assessment.

[225] SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking

Sifan Li,Yujun Cai,Yiwei Wang

Main category: cs.CL

TL;DR: HC-Bench基准测试显示，视觉语言模型（VLMs）在检测隐藏内容（如光学错觉或AI生成图像中的隐藏文本和物体）时表现极差（0-5.36%）。通过SemVink方法（将图像缩放到低分辨率），模型准确率提升至>99%，揭示了VLMs过度依赖高层语义而忽视低层视觉操作的缺陷。

Details

Motivation: 人类能本能地感知隐藏内容，而VLMs在此类任务中表现不佳，表明其架构存在缺陷，需改进以增强实际应用的鲁棒性。 Method: 提出HC-Bench基准测试，包含112张隐藏内容的图像，并设计SemVink方法（低分辨率缩放）以消除视觉噪声。 Result: VLMs在HC-Bench上准确率极低（0-5.36%），而SemVink方法显著提升至>99%。 Conclusion: VLMs需整合多尺度处理以弥补低层视觉操作的不足，从而缩小计算视觉与人类认知的差距，适用于医疗影像、安全等领域。 Abstract: Vision-language models (VLMs) excel in semantic tasks but falter at a core human capability: detecting hidden content in optical illusions or AI-generated images through perceptual adjustments like zooming. We introduce HC-Bench, a benchmark of 112 images with hidden text, objects, and illusions, revealing that leading VLMs achieve near-zero accuracy (0-5.36%)-even with explicit prompting. Humans resolve such ambiguities instinctively, yet VLMs fail due to an overreliance on high-level semantics. Strikingly, we propose SemVink (Semantic Visual Thinking) by simply scaling images to low resolutions (32-128 pixels), which unlocks >99% accuracy by eliminating redundant visual noise. This exposes a critical architectural flaw: VLMs prioritize abstract reasoning over low-level visual operations crucial for real-world robustness. Our work urges a shift toward hybrid models integrating multi-scale processing, bridging the gap between computational vision and human cognition for applications in medical imaging, security, and beyond.

[226] ProcrustesGPT: Compressing LLMs with Structured Matrices and Orthogonal Transformations

Ekaterina Grishina,Mikhail Gorbunov,Maxim Rakhuba

Main category: cs.CL

TL;DR: 利用正交变换优化大语言模型（LLM）权重矩阵的结构化表示，以减少参数数量。

Details

Motivation: 大语言模型需要大量计算和内存资源，结构化矩阵是减少参数的有效方法，但直接表示预训练模型的权重矩阵可能不准确。 Method: 利用LLM输出对某些正交变换的不变性，找到能显著提高权重矩阵在结构化类别中可压缩性的变换。 Result: 该方法适用于多种支持高效投影操作的结构化矩阵，并提供了开源代码。 Conclusion: 通过正交变换优化权重矩阵的结构化表示，为减少LLM参数提供了可行方案。 Abstract: Large language models (LLMs) demonstrate impressive results in natural language processing tasks but require a significant amount of computational and memory resources. Structured matrix representations are a promising way for reducing the number of parameters of these models. However, it seems unrealistic to expect that weight matrices of pretrained models can be accurately represented by structured matrices without any fine-tuning. To overcome this issue, we utilize the fact that LLM output is invariant under certain orthogonal transformations of weight matrices. This insight can be leveraged to identify transformations that significantly improve the compressibility of weights within structured classes. The proposed approach is applicable to various types of structured matrices that support efficient projection operations. Code is available at https://github.com/GrishKate/ProcrustesGPT

[227] TO-GATE: Clarifying Questions and Summarizing Responses with Trajectory Optimization for Eliciting Human Preference

Yulin Dou,Jiangming Liu

Main category: cs.CL

TL;DR: TO-GATE通过轨迹优化提升LLM在多轮对话中生成问题的能力，显著优于基线方法。

Details

Motivation: 现有基于自学习推理的方法难以识别最优对话轨迹并避免无关问题。 Method: TO-GATE包含澄清解析器（生成最优提问轨迹）和总结器（确保任务对齐的最终回答）。 Result: 实验显示TO-GATE在偏好诱导任务上比基线方法提升9.32%。 Conclusion: TO-GATE通过优化对话轨迹，显著提升了LLM在任务对齐问题生成和回答中的表现。 Abstract: Large language models (LLMs) can effectively elicit human preferences through multi-turn dialogue. Complex tasks can be accomplished through iterative clarifying questions and final responses generated by an LLM acting as a questioner (STaR-GATE; Andukuri et al., 2024}). However, existing approaches based on self-taught reasoning struggle to identify optimal dialogue trajectories and avoid irrelevant questions to the tasks. To address this limitation, we propose TO-GATE, a novel framework that enhances question generation through trajectory optimization, which consists of two key components: a clarification resolver that generates optimal questioning trajectories, and a summarizer that ensures task-aligned final responses. The trajectory optimization enables the model to produce effective elicitation questions and summary responses tailored to specific tasks. Experimental results demonstrate that TO-GATE significantly outperforms baseline methods, achieving a 9.32% improvement on standard preference elicitation tasks.

[228] Token and Span Classification for Entity Recognition in French Historical Encyclopedias

Ludovic Moncla,Hédi Zeghidi

Main category: cs.CL

TL;DR: 该研究比较了多种命名实体识别（NER）方法在历史文本中的应用，包括传统方法和基于Transformer的模型，并提出了一种处理嵌套实体的分类框架。

Details

Motivation: 历史文本中的NER面临语言非标准化、拼写古老及实体嵌套等独特挑战，需要探索有效方法。 Method: 研究在GeoEDdA数据集上测试了CRF、spaCy、CamemBERT和Flair等方法，并提出基于token和span的分类框架，同时评估了生成模型在低资源场景下的潜力。 Result: 基于Transformer的模型在嵌套实体上表现最佳，而生成模型在数据稀缺时显示出潜力。 Conclusion: 历史NER仍面临挑战，未来可结合符号与神经方法以更好地处理早期现代法语文本的复杂性。 Abstract: Named Entity Recognition (NER) in historical texts presents unique challenges due to non-standardized language, archaic orthography, and nested or overlapping entities. This study benchmarks a diverse set of NER approaches, ranging from classical Conditional Random Fields (CRFs) and spaCy-based models to transformer-based architectures such as CamemBERT and sequence-labeling models like Flair. Experiments are conducted on the GeoEDdA dataset, a richly annotated corpus derived from 18th-century French encyclopedias. We propose framing NER as both token-level and span-level classification to accommodate complex nested entity structures typical of historical documents. Additionally, we evaluate the emerging potential of few-shot prompting with generative language models for low-resource scenarios. Our results demonstrate that while transformer-based models achieve state-of-the-art performance, especially on nested entities, generative models offer promising alternatives when labeled data are scarce. The study highlights ongoing challenges in historical NER and suggests avenues for hybrid approaches combining symbolic and neural methods to better capture the intricacies of early modern French text.

[229] CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective

Jintian Shao,Yiming Cheng

Main category: cs.CL

TL;DR: 论文反驳了Chain-of-Thought (CoT)提示引发大型语言模型真正推理能力的观点，认为CoT仅是一种结构约束，引导模型模仿推理形式。

Details

Motivation: 探讨CoT提示是否真正引发大型语言模型的抽象推理能力，还是仅通过结构约束模仿推理形式。 Method: 通过理论分析，提出CoT提示通过强制生成中间步骤，利用模型的序列预测和模式匹配能力，而非真正推理。 Result: CoT提示通过结构约束使模型输出类似连贯思维过程的序列，而非实现真正的推理。 Conclusion: CoT提示并未引发真正的抽象推理，而是通过结构约束模仿推理形式。 Abstract: Chain-of-Thought (CoT) prompting has demonstrably enhanced the performance of Large Language Models on tasks requiring multi-step inference. This success has led to widespread claims of emergent reasoning capabilities in these models. In this paper, we present a theoretical counter-perspective: Chain-of-Thought (CoT) does not elicit genuine, abstract reasoning. Instead, we argue that Chain-of-Thought functions as a powerful structural constraint that guides Large Language Models to imitate the form of reasoning. By forcing the generation of intermediate steps, Chain-of-Thought leverages the model immense capacity for sequence prediction and pattern matching, effectively constraining its output to sequences that resemble coherent thought processes. Chain-of-Thought (CoT) prompting has demonstrably enhanced the performance of Large Language Models on tasks requiring multi-step inference. This success has led to widespread claims of emergent reasoning capabilities in these models. In this paper, we present a theoretical counter-perspective: Chain-of-Thought (CoT) does not elicit genuine, abstract reasoning. Instead, we argue that Chain-of-Thought functions as a powerful structural constraint that guides Large Language Models to imitate the form of reasoning. By forcing the generation of intermediate steps, Chain-of-Thought leverages the model immense capacity for sequence prediction and pattern matching, effectively constraining its output to sequences that resemble coherent thought processes.

[230] A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation

Verena Blaschke,Miriam Winkler,Constantin Förster,Gabriele Wenger-Glemser,Barbara Plank

Main category: cs.CL

TL;DR: 论文介绍了Betthupferl数据集，用于评估ASR模型对德国东南部方言的鲁棒性，并分析了方言与标准德语的差异。

Details

Motivation: 德国方言在当前ASR研究中代表性不足，需要评估模型对方言变异的鲁棒性。 Method: 构建包含四种方言和标准德语的Betthupferl数据集，并测试多语言ASR模型的翻译表现。 Result: ASR模型输出介于方言和标准德语之间，有时会标准化语法差异，但更接近方言结构。 Conclusion: Betthupferl数据集为研究方言ASR提供了基础，模型表现显示其对语法差异的处理能力有限。 Abstract: Although Germany has a diverse landscape of dialects, they are underrepresented in current automatic speech recognition (ASR) research. To enable studies of how robust models are towards dialectal variation, we present Betthupferl, an evaluation dataset containing four hours of read speech in three dialect groups spoken in Southeast Germany (Franconian, Bavarian, Alemannic), and half an hour of Standard German speech. We provide both dialectal and Standard German transcriptions, and analyze the linguistic differences between them. We benchmark several multilingual state-of-the-art ASR models on speech translation into Standard German, and find differences between how much the output resembles the dialectal vs. standardized transcriptions. Qualitative error analyses of the best ASR model reveal that it sometimes normalizes grammatical differences, but often stays closer to the dialectal constructions.

[231] IMPARA-GED: Grammatical Error Detection is Boosting Reference-free Grammatical Error Quality Estimator

Yusuke Sakai,Takumi Goto,Taro Watanabe

Main category: cs.CL

TL;DR: IMPARA-GED是一种新型的无参考自动语法错误校正（GEC）评估方法，具备语法错误检测（GED）能力，通过预训练语言模型提升性能，实验显示其在SEEDA数据集上与人评估相关性最高。

Details

Motivation: 改进现有自动GEC评估方法IMPARA的质量估计器，增强其语法错误检测能力，以更准确地评估语法错误校正效果。 Method: 利用预训练语言模型构建IMPARA-GED的质量估计器，增强其语法错误检测能力。 Result: 在SEEDA数据集上，IMPARA-GED的句子级评估与人类评估的相关性最高。 Conclusion: IMPARA-GED是一种有效的自动GEC评估方法，尤其在语法错误检测方面表现优异。 Abstract: We propose IMPARA-GED, a novel reference-free automatic grammatical error correction (GEC) evaluation method with grammatical error detection (GED) capabilities. We focus on the quality estimator of IMPARA, an existing automatic GEC evaluation method, and construct that of IMPARA-GED using a pre-trained language model with enhanced GED capabilities. Experimental results on SEEDA, a meta-evaluation dataset for automatic GEC evaluation methods, demonstrate that IMPARA-GED achieves the highest correlation with human sentence-level evaluations.

[232] Cell-o1: Training LLMs to Solve Single-Cell Reasoning Puzzles with Reinforcement Learning

Yin Fang,Qiao Jin,Guangzhi Xiong,Bowen Jin,Xianrui Zhong,Siru Ouyang,Aidong Zhang,Jiawei Han,Zhiyong Lu

Main category: cs.CL

TL;DR: 论文提出了CellPuzzles任务，旨在通过批量细胞上下文推理为细胞分配唯一类型，并开发了Cell-o1模型，显著提升了性能。

Details

Motivation: 当前基础模型在单细胞RNA测序数据注释中缺乏批量上下文考虑和解释性推理，而人类专家则能基于领域知识为不同细胞簇分配独特类型。 Method: 提出CellPuzzles任务，开发Cell-o1模型，通过监督微调和强化学习优化批量级奖励。 Result: Cell-o1在批量级准确率上比最佳基线（OpenAI的o1）提升73%，并展现出专家级推理能力。 Conclusion: Cell-o1在批量细胞注释任务中表现优异，为未来研究提供了新思路和工具。 Abstract: Cell type annotation is a key task in analyzing the heterogeneity of single-cell RNA sequencing data. Although recent foundation models automate this process, they typically annotate cells independently, without considering batch-level cellular context or providing explanatory reasoning. In contrast, human experts often annotate distinct cell types for different cell clusters based on their domain knowledge. To mimic this workflow, we introduce the CellPuzzles task, where the objective is to assign unique cell types to a batch of cells. This benchmark spans diverse tissues, diseases, and donor conditions, and requires reasoning across the batch-level cellular context to ensure label uniqueness. We find that off-the-shelf large language models (LLMs) struggle on CellPuzzles, with the best baseline (OpenAI's o1) achieving only 19.0% batch-level accuracy. To fill this gap, we propose Cell-o1, a 7B LLM trained via supervised fine-tuning on distilled reasoning traces, followed by reinforcement learning with batch-level rewards. Cell-o1 achieves state-of-the-art performance, outperforming o1 by over 73% and generalizing well across contexts. Further analysis of training dynamics and reasoning behaviors provides insights into batch-level annotation performance and emergent expert-like reasoning. Code and data are available at https://github.com/ncbi-nlp/cell-o1.

[233] A Controllable Examination for Long-Context Language Models

Yijun Yang,Zeyu Huang,Wenhao Zhu,Zihan Qiu,Fei Yuan,Jeff Z. Pan,Ivan Titov

Main category: cs.CL

TL;DR: 论文提出了一种新的长上下文语言模型评估框架LongBioBench，解决了现有评估方法的局限性，并通过实验验证了其有效性。

Details

Motivation: 现有长上下文语言模型评估框架存在局限性：真实任务过于复杂且易受数据污染，合成任务缺乏上下文连贯性。 Method: 提出LongBioBench，利用人工生成的传记作为受控环境，评估模型的理解、推理和可信赖性。 Result: 实验表明，大多数模型在语义理解和基础推理上仍有不足，且随着上下文长度增加，可信赖性下降。 Conclusion: LongBioBench在模拟真实任务和保持可控性之间取得了更好的平衡，具有高可解释性和可配置性。 Abstract: Existing frameworks for evaluating long-context language models (LCLM) can be broadly categorized into real-world and synthetic tasks. Despite their utility, both approaches are accompanied by certain intrinsic limitations. Real-world tasks are too complex to interpret or characterize and are susceptible to data contamination. In contrast, synthetic tasks often adopt the needle-in-the-haystack (NIAH) format, wherein a lack of coherence between the "needle" and the "haystack" compromises their validity as proxies for realistic applications. In response to these challenges, we posit that an ideal long-context evaluation framework should be characterized by three essential features: $\textit{seamless context}$, $\textit{controllable setting}$, and $\textit{sound evaluation}$. This study introduces $\textbf{LongBioBench}$, a novel benchmark that utilizes artificially generated biographies as a controlled environment for assessing LCLMs across dimensions of $\textit{understanding}$, $\textit{reasoning}$, and $\textit{trustworthiness}$. Our experimental evaluation, which includes $\textbf{18}$ LCLMs in total, demonstrates that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results and are less trustworthy as context length increases. Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence, numerical needles, and the absence of distractors, rendering them vulnerable to test the model long-context capabilities. Moreover, we also reveal that long-context continual pretraining primarily adjusts RoPE embedding to accommodate extended context lengths. To sum up, compared to previous synthetic benchmarks, LongBioBench achieves a better trade-off between mirroring authentic language tasks and maintaining controllability, and is highly interpretable and configurable.

[234] INESC-ID @ eRisk 2025: Exploring Fine-Tuned, Similarity-Based, and Prompt-Based Approaches to Depression Symptom Identification

Diogo A. P. Nunes,Eugénio Ribeiro

Main category: cs.CL

TL;DR: 团队通过微调基础模型并结合合成数据，在eRisk 2025任务1中取得了最佳表现，击败了其他16个团队。

Details

Motivation: 解决BDI问卷中抑郁症状相关句子的检索问题，由于训练数据标签有限，将其转化为二元分类任务。 Method: 将标记数据分为训练集和验证集，尝试了基础模型微调、句子相似度、LLM提示和集成技术。 Result: 微调基础模型表现最佳，尤其是结合合成数据后。不同症状的最佳方法各异，最终提交的五个测试运行在官方评估中得分最高。 Conclusion: 基础模型微调结合合成数据是解决抑郁症状句子检索问题的有效方法，且需针对不同症状调整策略。 Abstract: In this work, we describe our team's approach to eRisk's 2025 Task 1: Search for Symptoms of Depression. Given a set of sentences and the Beck's Depression Inventory - II (BDI) questionnaire, participants were tasked with submitting up to 1,000 sentences per depression symptom in the BDI, sorted by relevance. Participant submissions were evaluated according to standard Information Retrieval (IR) metrics, including Average Precision (AP) and R-Precision (R-PREC). The provided training data, however, consisted of sentences labeled as to whether a given sentence was relevant or not w.r.t. one of BDI's symptoms. Due to this labeling limitation, we framed our development as a binary classification task for each BDI symptom, and evaluated accordingly. To that end, we split the available labeled data into training and validation sets, and explored foundation model fine-tuning, sentence similarity, Large Language Model (LLM) prompting, and ensemble techniques. The validation results revealed that fine-tuning foundation models yielded the best performance, particularly when enhanced with synthetic data to mitigate class imbalance. We also observed that the optimal approach varied by symptom. Based on these insights, we devised five independent test runs, two of which used ensemble methods. These runs achieved the highest scores in the official IR evaluation, outperforming submissions from 16 other teams.

[235] Quantitative LLM Judges

Aishwarya Sahoo,Jeevana Kruthi Karnuthala,Tushar Parmanand Budhwani,Pranchal Agarwal,Sankaran Vaidyanathan,Alexa Siu,Franck Dernoncourt,Jennifer Healey,Nedim Lipka,Ryan Rossi,Uttaran Bhattacharya,Branislav Kveton

Main category: cs.CL

TL;DR: LLM-as-a-judge框架通过回归模型将LLM评分与人类评分对齐，提出四种定量评分方法，计算效率高且适用于有限人类反馈场景。

Details

Motivation: 解决现有LLM评分与人类评分不一致的问题，提升评分的准确性和效率。 Method: 使用回归模型训练定量评分方法，基于LLM的文本评价和原始评分进行优化。 Result: 实验证明定量评分方法能有效提升现有评分的预测能力。 Conclusion: 该框架在计算效率和统计效率上优于监督微调，适用于实际应用。 Abstract: LLM-as-a-judge is a framework in which a large language model (LLM) automatically evaluates the output of another LLM. We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to human scores in a given domain using regression models. The models are trained to improve the score of the original judge by using the judge's textual evaluation and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework is more computationally efficient than supervised fine-tuning and can be more statistically efficient when human feedback is limited, which is expected in most applications of our work. We validate these claims empirically on four datasets using two base judges. Our experiments show that quantitative judges can effectively improve the predictive power of existing judges through post-hoc modeling.

[236] Adaptive Graph Pruning for Multi-Agent Communication

Boyi Li,Zhonghan Zhao,Der-Horng Lee,Gaoang Wang

Main category: cs.CL

TL;DR: 论文提出了一种名为自适应图剪枝（AGP）的多智能体协作框架，通过动态优化智能体数量和通信拓扑结构，显著提升了任务适应性和性能。

Details

Motivation: 当前基于大语言模型（LLM）的多智能体系统在任务中表现优异，但固定智能体数量和静态通信结构限制了其适应不同任务复杂性的能力。 Method: AGP采用两阶段训练策略：首先独立训练软剪枝网络以确定任务特定的最优智能体数量和通信拓扑；然后在最大完全图中联合优化硬剪枝和软剪枝。 Result: 实验表明，AGP在六个基准测试中达到最先进性能，性能提升2.58%∼9.84%，且显著减少训练步骤和令牌消耗。 Conclusion: AGP是一种高效、任务自适应且经济的方法，能够动态优化多智能体协作，显著提升性能并减少资源消耗。 Abstract: Large Language Model (LLM) based multi-agent systems have shown remarkable performance in various tasks, especially when enhanced through collaborative communication. However, current methods often rely on a fixed number of agents and static communication structures, limiting their ability to adapt to varying task complexities. In this paper, we propose Adaptive Graph Pruning (AGP), a novel task-adaptive multi-agent collaboration framework that jointly optimizes agent quantity (hard-pruning) and communication topology (soft-pruning). Specifically, our method employs a two-stage training strategy: firstly, independently training soft-pruning networks for different agent quantities to determine optimal agent-quantity-specific complete graphs and positional masks across specific tasks; and then jointly optimizing hard-pruning and soft-pruning within a maximum complete graph to dynamically configure the number of agents and their communication topologies per task. Extensive experiments demonstrate that our approach is: (1) High-performing, achieving state-of-the-art results across six benchmarks and consistently generalizes across multiple mainstream LLM architectures, with a increase in performance of $2.58\%\sim 9.84\%$; (2) Task-adaptive, dynamically constructing optimized communication topologies tailored to specific tasks, with an extremely high performance in all three task categories (general reasoning, mathematical reasoning, and code generation); (3) Token-economical, having fewer training steps and token consumption at the same time, with a decrease in token consumption of $90\%+$; and (4) Training-efficient, achieving high performance with very few training steps compared with other methods. The performance will surpass the existing baselines after about ten steps of training under six benchmarks.

[237] HACo-Det: A Study Towards Fine-Grained Machine-Generated Text Detection under Human-AI Coauthoring

Zhixiong Su,Yichen Wang,Herun Wan,Zhaohan Zhang,Minnan Luo

Main category: cs.CL

TL;DR: 本文探讨了在人类与AI共同创作文本的背景下，细粒度机器生成文本（MGT）检测的可能性，提出了数据集HACo-Det，并改进了现有检测器。实验表明，微调模型表现优于基于指标的方法，但问题仍未完全解决。

Details

Motivation: 现有研究主要关注二进制文档级检测，忽视了人类与AI共同创作的文本，因此需要开发细粒度检测方法。 Method: 提出数据集HACo-Det，改进七种文档级检测器以支持词级检测，并在词级和句子级任务上评估。 Result: 基于指标的方法表现较差（平均F1分数0.462），微调模型表现更优且泛化能力更强，但问题仍未完全解决。 Conclusion: 细粒度共同创作文本检测仍具挑战性，需进一步改进，如优化上下文窗口等。 Abstract: The misuse of large language models (LLMs) poses potential risks, motivating the development of machine-generated text (MGT) detection. Existing literature primarily concentrates on binary, document-level detection, thereby neglecting texts that are composed jointly by human and LLM contributions. Hence, this paper explores the possibility of fine-grained MGT detection under human-AI coauthoring. We suggest fine-grained detectors can pave pathways toward coauthored text detection with a numeric AI ratio. Specifically, we propose a dataset, HACo-Det, which produces human-AI coauthored texts via an automatic pipeline with word-level attribution labels. We retrofit seven prevailing document-level detectors to generalize them to word-level detection. Then we evaluate these detectors on HACo-Det on both word- and sentence-level detection tasks. Empirical results show that metric-based methods struggle to conduct fine-grained detection with a 0.462 average F1 score, while finetuned models show superior performance and better generalization across domains. However, we argue that fine-grained co-authored text detection is far from solved. We further analyze factors influencing performance, e.g., context window, and highlight the limitations of current methods, pointing to potential avenues for improvement.

[238] FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models

Yan Gao,Massimo Roberto Scamarcia,Javier Fernandez-Marques,Mohammad Naseri,Chong Shen Ng,Dimitris Stripelis,Zexi Li,Tao Shen,Jiamu Bai,Daoyuan Chen,Zikai Zhang,Rui Hu,InSeo Song,Lee KangYoon,Hong Jia,Ting Dang,Junyan Wang,Zheyuan Liu,Daniel Janes Beutel,Lingjuan Lyu,Nicholas D. Lane

Main category: cs.CL

TL;DR: 论文提出FlowerTune LLM Leaderboard，用于评估联邦学习环境下预训练大语言模型的微调性能，涵盖四个领域，并比较了26种模型。

Details

Motivation: 解决大语言模型依赖公开数据的问题，探索联邦学习在隐私保护下的模型微调潜力。 Method: 开发FlowerTune LLM Leaderboard，包含四个领域的联邦指令微调数据集和评估指标，比较26种预训练模型的性能。 Result: 提供了模型性能、资源限制和领域适应的全面比较，为隐私保护、领域专用大语言模型奠定基础。 Conclusion: 该工作为现实应用中隐私保护的领域专用大语言模型开发提供了基础。 Abstract: Large Language Models (LLMs) have achieved state-of-the-art results across diverse domains, yet their development remains reliant on vast amounts of publicly available data, raising concerns about data scarcity and the lack of access to domain-specific, sensitive information. Federated Learning (FL) presents a compelling framework to address these challenges by enabling decentralized fine-tuning on pre-trained LLMs without sharing raw data. However, the compatibility and performance of pre-trained LLMs in FL settings remain largely under explored. We introduce the FlowerTune LLM Leaderboard, a first-of-its-kind benchmarking suite designed to evaluate federated fine-tuning of LLMs across four diverse domains: general NLP, finance, medical, and coding. Each domain includes federated instruction-tuning datasets and domain-specific evaluation metrics. Our results, obtained through a collaborative, open-source and community-driven approach, provide the first comprehensive comparison across 26 pre-trained LLMs with different aggregation and fine-tuning strategies under federated settings, offering actionable insights into model performance, resource constraints, and domain adaptation. This work lays the foundation for developing privacy-preserving, domain-specialized LLMs for real-world applications.

[239] Expanding before Inferring: Enhancing Factuality in Large Language Models through Premature Layers Interpolation

Dingwei Chen,Ziqiang Liu,Feiteng Fang,Chak Tou Leong,Shiwen Ni,Ahmadreza Argha,Hamid Alinejad-Rokny,Min Yang,Chengming Li

Main category: cs.CL

TL;DR: PLI（Premature Layers Interpolation）是一种无需训练、即插即用的干预方法，通过数学插值插入中间层来减少大语言模型（LLMs）的幻觉问题，提高事实一致性。

Details

Motivation: LLMs在文本理解和生成方面表现出色，但存在事实不一致（幻觉）的问题，现有方法要么资源密集，要么未能充分利用模型内部机制。 Method: 提出PLI方法，通过数学插值在相邻层之间插入中间层，扩展信息处理深度，改善事实一致性。 Result: 在四个公开数据集上，PLI有效减少幻觉，多数情况下优于现有基线方法。 Conclusion: PLI的成功与LLMs内部机制密切相关，代码和数据将公开以促进可复现性。 Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities in text understanding and generation. However, their tendency to produce factually inconsistent outputs, commonly referred to as ''hallucinations'', remains a critical challenge. Existing approaches, such as retrieval-based and inference-time correction methods, primarily address this issue at the input or output level, often overlooking the intrinsic information refinement process and the role of premature layers. Meanwhile, alignment- and fine-tuning-based methods are resource-intensive. In this paper, we propose PLI (Premature Layers Interpolation), a novel, training-free, and plug-and-play intervention designed to enhance factuality. PLI mitigates hallucinations by inserting premature layers formed through mathematical interpolation with adjacent layers. Inspired by stable diffusion and sampling steps, PLI extends the depth of information processing and transmission in LLMs, improving factual coherence. Experiments on four publicly available datasets demonstrate that PLI effectively reduces hallucinations while outperforming existing baselines in most cases. Further analysis suggests that the success of layer interpolation is closely linked to LLMs' internal mechanisms. To promote reproducibility, we will release our code and data upon acceptance.

[240] Towards a Japanese Full-duplex Spoken Dialogue System

Atsumoto Ohashi,Shinya Iizuka,Jingjing Jiang,Ryuichiro Higashinaka

Main category: cs.CL

TL;DR: 本文介绍了首个公开可用的日语全双工口语对话模型，基于英文模型Moshi，通过两阶段训练（预训练和微调）及合成数据增强，性能优于基线模型。

Details

Motivation: 日语全双工口语对话系统研究稀缺，本文旨在填补这一空白。 Method: 基于Moshi模型，采用两阶段训练（预训练和微调），并利用合成数据增强。 Result: 模型在自然性和意义性上优于日语基线模型。 Conclusion: 本文成功开发了首个日语全双工口语对话模型，性能显著提升。 Abstract: Full-duplex spoken dialogue systems, which can model simultaneous bidirectional features of human conversations such as speech overlaps and backchannels, have attracted significant attention recently. However, the study of full-duplex spoken dialogue systems for the Japanese language has been limited, and the research on their development in Japanese remains scarce. In this paper, we present the first publicly available full-duplex spoken dialogue model in Japanese, which is built upon Moshi, a full-duplex dialogue model in English. Our model is trained through a two-stage process: pre-training on a large-scale spoken dialogue data in Japanese, followed by fine-tuning on high-quality stereo spoken dialogue data. We further enhance the model's performance by incorporating synthetic dialogue data generated by a multi-stream text-to-speech system. Evaluation experiments demonstrate that the trained model outperforms Japanese baseline models in both naturalness and meaningfulness.

[241] Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis

Richard Armitage

Main category: cs.CL

TL;DR: 测试领先的LLMs在初级保健教育中的表现，结果显示所有模型均显著优于普通医生。

Details

Motivation: 评估LLMs在初级保健领域的潜力，尤其是针对MRCGP考试问题的能力。 Method: 使用100道MRCGP风格的多选题测试o3、Claude Opus 4、Grok3和Gemini 2.5 Pro。 Result: o3得分99.0%，其他模型得分95.0%，均显著高于普通医生的平均分73.0%。 Conclusion: LLMs，尤其是推理模型，在初级保健领域具有显著支持潜力。 Abstract: Background: Large language models (LLMs) have demonstrated substantial potential to support clinical practice. Other than Chat GPT4 and its predecessors, few LLMs, especially those of the leading and more powerful reasoning model class, have been subjected to medical specialty examination questions, including in the domain of primary care. This paper aimed to test the capabilities of leading LLMs as of May 2025 (o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro) in primary care education, specifically in answering Member of the Royal College of General Practitioners (MRCGP) style examination questions. Methods: o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro were tasked to answer 100 randomly chosen multiple choice questions from the Royal College of General Practitioners GP SelfTest on 25 May 2025. Questions included textual information, laboratory results, and clinical images. Each model was prompted to answer as a GP in the UK and was provided with full question information. Each question was attempted once by each model. Responses were scored against correct answers provided by GP SelfTest. Results: The total score of o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro was 99.0%, 95.0%, 95.0%, and 95.0%, respectively. The average peer score for the same questions was 73.0%. Discussion: All models performed remarkably well, and all substantially exceeded the average performance of GPs and GP registrars who had answered the same questions. o3 demonstrated the best performance, while the performances of the other leading models were comparable with each other and were not substantially lower than that of o3. These findings strengthen the case for LLMs, particularly reasoning models, to support the delivery of primary care, especially those that have been specifically trained on primary care clinical data.

[242] It's Not a Walk in the Park! Challenges of Idiom Translation in Speech-to-text Systems

Iuliia Zaitova,Badr M. Abdullah,Wei Xue,Dietrich Klakow,Bernd Möbius,Tania Avgustinova

Main category: cs.CL

TL;DR: 论文系统评估了习语翻译在文本到文本机器翻译（MT）和语音到文本翻译（SLT）中的表现，发现SLT系统在习语翻译上表现较差，而MT系统和大型语言模型表现更好。

Details

Motivation: 习语翻译是机器翻译中的主要挑战，尤其在语音到文本系统中研究较少，因此需要系统评估其表现。 Method: 比较了多种MT、SLT系统和大型语言模型在德语到英语和俄语到英语的习语翻译任务中的表现。 Result: SLT系统在习语翻译上表现显著下降，常采用字面翻译，而MT系统和大型语言模型表现更优。 Conclusion: 需要在SLT架构中开发习语专用策略和改进内部表示。 Abstract: Idioms are defined as a group of words with a figurative meaning not deducible from their individual components. Although modern machine translation systems have made remarkable progress, translating idioms remains a major challenge, especially for speech-to-text systems, where research on this topic is notably sparse. In this paper, we systematically evaluate idiom translation as compared to conventional news translation in both text-to-text machine translation (MT) and speech-to-text translation (SLT) systems across two language pairs (German to English, Russian to English). We compare state-of-the-art end-to-end SLT systems (SeamlessM4T SLT-to-text, Whisper Large v3) with MT systems (SeamlessM4T SLT-to-text, No Language Left Behind), Large Language Models (DeepSeek, LLaMA) and cascaded alternatives. Our results reveal that SLT systems experience a pronounced performance drop on idiomatic data, often reverting to literal translations even in higher layers, whereas MT systems and Large Language Models demonstrate better handling of idioms. These findings underscore the need for idiom-specific strategies and improved internal representations in SLT architectures.

[243] A Multi-Agent Framework for Mitigating Dialect Biases in Privacy Policy Question-Answering Systems

Đorđe Klisura,Astrid R Bernaga Torres,Anna Karen Gárate-Escamilla,Rajesh Roshan Biswal,Ke Yang,Hilal Pataci,Anthony Rios

Main category: cs.CL

TL;DR: 提出一种多智能体框架，通过方言代理和隐私政策代理协作，减少隐私政策问答系统中的方言偏见，无需重新训练即可显著提升性能。

Details

Motivation: 隐私政策的复杂性限制了不同人群的可访问性，现有系统在英语方言间存在性能差异，导致非标准方言使用者处于劣势。 Method: 结合方言代理（将查询翻译为标准英语）和隐私政策代理（利用领域知识优化预测），无需额外训练或方言特定调整。 Result: 在PrivacyQA和PolicyQA上，零样本准确率分别从0.394提升至0.601和从0.352提升至0.464，超越或匹配少样本基线。 Conclusion: 结构化智能体协作有效减少方言偏见，强调设计考虑语言多样性的NLP系统对隐私信息公平访问的重要性。 Abstract: Privacy policies inform users about data collection and usage, yet their complexity limits accessibility for diverse populations. Existing Privacy Policy Question Answering (QA) systems exhibit performance disparities across English dialects, disadvantaging speakers of non-standard varieties. We propose a novel multi-agent framework inspired by human-centered design principles to mitigate dialectal biases. Our approach integrates a Dialect Agent, which translates queries into Standard American English (SAE) while preserving dialectal intent, and a Privacy Policy Agent, which refines predictions using domain expertise. Unlike prior approaches, our method does not require retraining or dialect-specific fine-tuning, making it broadly applicable across models and domains. Evaluated on PrivacyQA and PolicyQA, our framework improves GPT-4o-mini's zero-shot accuracy from 0.394 to 0.601 on PrivacyQA and from 0.352 to 0.464 on PolicyQA, surpassing or matching few-shot baselines without additional training data. These results highlight the effectiveness of structured agent collaboration in mitigating dialect biases and underscore the importance of designing NLP systems that account for linguistic diversity to ensure equitable access to privacy information.

[244] Conditioning Large Language Models on Legal Systems? Detecting Punishable Hate Speech

Florian Ludwig,Torsten Zesch,Frederike Zufall

Main category: cs.CL

TL;DR: 本文研究了如何通过不同抽象层次的法律知识调节大型语言模型（LLMs），以检测德国刑法中规定的仇恨言论。结果显示，模型与法律专家在仇恨言论评估上仍存在显著差距。

Details

Motivation: 评估法律问题需要考虑特定法律体系及其抽象层次，但LLMs是否内化了这些法律体系尚不明确。本文旨在探索如何通过不同抽象层次的法律知识调节LLMs。 Method: 提出并研究了在多个法律抽象层次上调节LLMs的方法，重点关注德国刑法中仇恨言论的分类任务。 Result: 模型在法律评估中表现不佳，尤其是基于抽象法律知识的模型缺乏深度理解，而基于具体法律知识的模型在识别目标群体上表现较好，但在行为分类上仍有困难。 Conclusion: 尽管LLMs在特定任务上表现尚可，但在法律评估上与专家仍有差距，需进一步改进。 Abstract: The assessment of legal problems requires the consideration of a specific legal system and its levels of abstraction, from constitutional law to statutory law to case law. The extent to which Large Language Models (LLMs) internalize such legal systems is unknown. In this paper, we propose and investigate different approaches to condition LLMs at different levels of abstraction in legal systems. This paper examines different approaches to conditioning LLMs at multiple levels of abstraction in legal systems to detect potentially punishable hate speech. We focus on the task of classifying whether a specific social media posts falls under the criminal offense of incitement to hatred as prescribed by the German Criminal Code. The results show that there is still a significant performance gap between models and legal experts in the legal assessment of hate speech, regardless of the level of abstraction with which the models were conditioned. Our analysis revealed, that models conditioned on abstract legal knowledge lacked deep task understanding, often contradicting themselves and hallucinating answers, while models using concrete legal knowledge performed reasonably well in identifying relevant target groups, but struggled with classifying target conducts.

[245] Coding Agents with Multimodal Browsing are Generalist Problem Solvers

Aditya Bharat Soni,Boxuan Li,Xingyao Wang,Valerie Chen,Graham Neubig

Main category: cs.CL

TL;DR: OpenHands-Versa是一种通用AI代理，使用少量通用工具（如代码编辑、网页搜索等）在多样化任务中表现优异，超越专用代理。

Details

Motivation: 现代AI代理通常针对特定任务优化，缺乏泛化能力。本文探讨如何用最少通用工具实现多样化任务的高性能。 Method: 开发OpenHands-Versa代理，集成代码编辑、网页搜索、多模态浏览和文件访问等通用工具。 Result: 在SWE-Bench Multimodal、GAIA和The Agent Company三个基准测试中，OpenHands-Versa分别以9.1、1.3和9.1点的绝对优势超越专用代理。 Conclusion: OpenHands-Versa证明了通用代理的可行性，为未来研究提供了强基线。 Abstract: Modern human labor is characterized by specialization; we train for years and develop particular tools that allow us to perform well across a variety of tasks. In addition, AI agents have been specialized for domains such as software engineering, web navigation, and workflow automation. However, this results in agents that are good for one thing but fail to generalize beyond their intended scope. One reason for this is that agent developers provide a highly specialized set of tools or make architectural decisions optimized for a specific use case or benchmark. In this work, we ask the question: what is the minimal set of general tools that can be used to achieve high performance across a diverse set of tasks? Our answer is OpenHands-Versa, a generalist agent built with a modest number of general tools: code editing and execution, web search, as well as multimodal web browsing and file access. Importantly, OpenHands-Versa demonstrates superior or competitive performance over leading specialized agents across three diverse and challenging benchmarks: SWE-Bench Multimodal, GAIA, and The Agent Company, outperforming the best-performing previously published results with absolute improvements in success rate of 9.1, 1.3, and 9.1 points respectively. Further, we show how existing state-of-the-art multi-agent systems fail to generalize beyond their target domains. These results demonstrate the feasibility of developing a generalist agent to solve diverse tasks and establish OpenHands-Versa as a strong baseline for future research.

[246] Leveraging Information Retrieval to Enhance Spoken Language Understanding Prompts in Few-Shot Learning

Pierre Lepagnol,Sahar Ghannay,Thomas Gerald,Christophe Servan,Sophie Rosset

Main category: cs.CL

TL;DR: 论文提出了一种利用信息检索（IR）方法选择示例以增强提示的方法，用于提升少样本学习中的口语理解（SLU）任务性能。

Details

Motivation: 当前最先进的SLU技术依赖大量训练数据，但特定任务或语言的标注数据有限，而指令调优的大语言模型（LLMs）在少样本学习中表现优异。 Method: 通过信息检索（IR）方法选择示例构建增强提示，应用于SLU任务。 Result: 实验结果表明，基于词汇的IR方法显著提升了性能，且未增加提示长度。 Conclusion: 该方法有效解决了SLU任务中数据不足的问题，为少样本学习提供了新思路。 Abstract: Understanding user queries is fundamental in many applications, such as home assistants, booking systems, or recommendations. Accordingly, it is crucial to develop accurate Spoken Language Understanding (SLU) approaches to ensure the reliability of the considered system. Current State-of-the-Art SLU techniques rely on large amounts of training data; however, only limited annotated examples are available for specific tasks or languages. In the meantime, instruction-tuned large language models (LLMs) have shown exceptional performance on unseen tasks in a few-shot setting when provided with adequate prompts. In this work, we propose to explore example selection by leveraging Information retrieval (IR) approaches to build an enhanced prompt that is applied to an SLU task. We evaluate the effectiveness of the proposed method on several SLU benchmarks. Experimental results show that lexical IR methods significantly enhance performance without increasing prompt length.

[247] Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective

Jintian Shao,Yiming Cheng

Main category: cs.CL

TL;DR: 论文分析了VAPO框架在强化学习（RL）中用于增强大语言模型（LLMs）长链推理时的局限性，探讨了信用分配、价值函数表示能力等问题。

Details

Motivation: 研究旨在揭示VAPO框架在长链推理中的局限性，以推动更强大的LLM代理的研究。 Method: 通过理论分析，探讨了VAPO框架在信用分配、价值函数表示和全局价值信号转化为局部策略改进方面的不足。 Result: 研究发现VAPO框架在建模长期价值和细粒度策略指导方面存在根本性限制。 Conclusion: 论文为理解当前RL在高级推理中的局限性提供了见解，并提出了未来研究方向。 Abstract: Reinforcement learning (RL) enhances large language models (LLMs) in complex, long-chain-of-thought (long-CoT) reasoning. The advanced VAPO framework, despite sophisticated mechanisms like Decoupled GAE, theoretically faces fundamental limitations in comprehensively modeling and leveraging deep, long-term value for fine-grained, step-by-step policy guidance in extended reasoning chains. We argue these limitations stem from inherent difficulties in credit assignment, value function representational capacity with temporally abstracted goals, and translating global value signals into local policy improvements, especially with sparse rewards. Our theoretical analysis examines these aspects to illuminate VAPO's boundaries in long-term value modeling, aiming to deepen understanding of current RL for advanced reasoning and suggest future research for more robust LLM agents.

[248] Facts Do Care About Your Language: Assessing Answer Quality of Multilingual LLMs

Yuval Kansal,Shmuel Berman,Lydia Liu

Main category: cs.CL

TL;DR: 评估Llama3.1模型在多语言教育场景中的事实准确性，发现其提供多余信息且加剧对小语种的偏见。

Details

Motivation: 随着大语言模型在教育中的应用增加，确保其多语言环境下的准确性至关重要。 Method: 评估Llama3.1模型在回答中学阶段事实性问题时的表现。 Result: 模型不仅提供多余且不准确的信息，还加剧了对小语种的偏见。 Conclusion: 需改进大语言模型在多语言教育中的事实性和公平性。 Abstract: Factuality is a necessary precursor to useful educational tools. As adoption of Large Language Models (LLMs) in education continues of grow, ensuring correctness in all settings is paramount. Despite their strong English capabilities, LLM performance in other languages is largely untested. In this work, we evaluate the correctness of the Llama3.1 family of models in answering factual questions appropriate for middle and high school students. We demonstrate that LLMs not only provide extraneous and less truthful information, but also exacerbate existing biases against rare languages.

[249] Literary Evidence Retrieval via Long-Context Language Models

Katherine Thai,Mohit Iyyer

Main category: cs.CL

TL;DR: 现代长上下文语言模型在文学证据检索任务中表现如何？研究发现，Gemini Pro 2.5等模型在准确性上超过人类专家（62.5% vs. 50%），但开源模型表现较差（29.1%）。模型在文学信号的细微理解和过度生成方面仍有挑战。

Details

Motivation: 探讨现代长上下文语言模型对文学小说的理解能力，特别是在文学证据检索任务中的表现。 Method: 利用RELiC数据集构建基准测试，要求模型在提供全文和文学批评的情况下生成缺失的引文，测试其全局叙事推理和细粒度文本分析能力。 Result: Gemini Pro 2.5模型表现优于人类专家（62.5% vs. 50%），而开源模型仅达到29.1%的准确性。 Conclusion: 尽管某些模型表现优异，但在文学信号的细微理解和过度生成方面仍有不足，未来研究需进一步改进。 Abstract: How well do modern long-context language models understand literary fiction? We explore this question via the task of literary evidence retrieval, repurposing the RELiC dataset of That et al. (2022) to construct a benchmark where the entire text of a primary source (e.g., The Great Gatsby) is provided to an LLM alongside literary criticism with a missing quotation from that work. This setting, in which the model must generate the missing quotation, mirrors the human process of literary analysis by requiring models to perform both global narrative reasoning and close textual examination. We curate a high-quality subset of 292 examples through extensive filtering and human verification. Our experiments show that recent reasoning models, such as Gemini Pro 2.5 can exceed human expert performance (62.5% vs. 50% accuracy). In contrast, the best open-weight model achieves only 29.1% accuracy, highlighting a wide gap in interpretive reasoning between open and closed-weight models. Despite their speed and apparent accuracy, even the strongest models struggle with nuanced literary signals and overgeneration, signaling open challenges for applying LLMs to literary analysis. We release our dataset and evaluation code to encourage future work in this direction.

[250] Beyond Text Compression: Evaluating Tokenizers Across Scales

Jonas F. Lotz,António V. Lopes,Stephan Peitz,Hendra Setiawan,Leonardo Emili

Main category: cs.CL

TL;DR: 小模型能高效预测分词器对大模型性能的影响，并提出新的内在评估指标，为多语言任务提供更可靠的分词器选择框架。

Details

Motivation: 当前分词器评估缺乏高效可靠的方法，尤其是在多语言任务中，分词器选择对性能的影响显著。 Method: 通过小模型预测分词器对大模型的影响，提出基于Zipf定律的内在评估指标，并结合多指标构建评估框架。 Result: 英语任务中分词器影响较小，而多语言任务中性能差异显著；新指标比文本压缩更能预测下游表现。 Conclusion: 提供了一种高效可靠的分词器评估方法，为未来语言模型开发中的分词器选择提供了指导。 Abstract: The choice of tokenizer can profoundly impact language model performance, yet accessible and reliable evaluations of tokenizer quality remain an open challenge. Inspired by scaling consistency, we show that smaller models can accurately predict significant differences in tokenizer impact on larger models at a fraction of the compute cost. By systematically evaluating both English-centric and multilingual tokenizers, we find that tokenizer choice has negligible effects on tasks in English but results in consistent performance differences in multilingual settings. We propose new intrinsic tokenizer metrics inspired by Zipf's law that correlate more strongly with downstream performance than text compression when modeling unseen languages. By combining several metrics to capture multiple aspects of tokenizer behavior, we develop a reliable framework for intrinsic tokenizer evaluations. Our work offers a more efficient path to informed tokenizer selection in future language model development.

[251] Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Xiaoying Zhang,Hao Sun,Yipeng Zhang,Kaituo Feng,Chao Yang,Helen Meng

Main category: cs.CL

TL;DR: 论文提出Critique-GRPO框架，结合自然语言和数值反馈优化强化学习，提升大语言模型的推理能力。

Details

Motivation: 现有强化学习仅依赖数值反馈存在性能瓶颈、自反思效果有限和持续失败问题。 Method: 提出Critique-GRPO框架，整合自然语言和数值反馈，支持模型从初始响应和反馈中同时学习。 Result: 在多个推理任务中，Critique-GRPO表现优于监督学习和强化学习方法，平均得分提升4.5%-5%。 Conclusion: 自然语言反馈能有效解决强化学习的局限性，且探索策略的效率与熵和响应长度无关。 Abstract: Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided refinements simultaneously while maintaining exploration. Extensive experiments using Qwen2.5-7B-Base and Qwen3-8B-Base show that Critique-GRPO consistently outperforms supervised learning-based and RL-based fine-tuning approaches across eight challenging mathematical, STEM, and general reasoning tasks, improving average pass@1 scores by approximately 4.5% and 5%, respectively. Notably, Critique-GRPO surpasses a strong baseline that incorporates expert demonstrations within online RL. Further analysis reveals two critical insights about policy exploration: (1) higher entropy does not always guarantee efficient learning from exploration, and (2) longer responses do not necessarily lead to more effective exploration.

[252] AUTOCIRCUIT-RL: Reinforcement Learning-Driven LLM for Automated Circuit Topology Generation

Prashanth Vijayaraghavan,Luyao Shi,Ehsan Degan,Vandana Mukherjee,Xin Zhang

Main category: cs.CL

TL;DR: AUTOCIRCUIT-RL是一种基于强化学习的框架，通过LLM生成模拟电路拓扑结构，显著提高了电路设计的有效性和效率。

Details

Motivation: 模拟电路拓扑合成的设计空间庞大且约束严格，传统方法效率低下，需要更智能的自动化解决方案。 Method: 框架分为两阶段：指令调优（LLM学习从结构化提示生成拓扑）和RL细化（通过奖励模型优化生成结果）。 Result: AUTOCIRCUIT-RL生成的有效电路比基线多12%，效率提升14%，重复生成率降低38%，且训练数据有限时成功率超60%。 Conclusion: 该框架在复杂电路设计中表现出色，标志着AI驱动电路设计的重大进展。 Abstract: Analog circuit topology synthesis is integral to Electronic Design Automation (EDA), enabling the automated creation of circuit structures tailored to specific design requirements. However, the vast design search space and strict constraint adherence make efficient synthesis challenging. Leveraging the versatility of Large Language Models (LLMs), we propose AUTOCIRCUIT-RL,a novel reinforcement learning (RL)-based framework for automated analog circuit synthesis. The framework operates in two phases: instruction tuning, where an LLM learns to generate circuit topologies from structured prompts encoding design constraints, and RL refinement, which further improves the instruction-tuned model using reward models that evaluate validity, efficiency, and output voltage. The refined model is then used directly to generate topologies that satisfy the design constraints. Empirical results show that AUTOCIRCUIT-RL generates ~12% more valid circuits and improves efficiency by ~14% compared to the best baselines, while reducing duplicate generation rates by ~38%. It achieves over 60% success in synthesizing valid circuits with limited training data, demonstrating strong generalization. These findings highlight the framework's effectiveness in scaling to complex circuits while maintaining efficiency and constraint adherence, marking a significant advancement in AI-driven circuit design.

[253] Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

Yinjie Wang,Ling Yang,Ye Tian,Ke Shen,Mengdi Wang

Main category: cs.CL

TL;DR: CURE是一个新型强化学习框架，通过协同进化代码和单元测试生成能力，无需真实代码监督，显著提升代码生成和测试性能。

Details

Motivation: 传统方法依赖真实代码监督，限制了灵活性和可扩展性。CURE通过交互学习直接从错误中提升能力。 Method: 提出CURE框架，设计专用奖励机制，协同进化代码和测试生成能力，优化模型性能。 Result: ReasonFlux-Coder-7B和14B模型在代码生成和测试任务中表现优异，超越同类模型。 Conclusion: CURE不仅提升代码生成和测试能力，还可作为强化学习的奖励模型，具有广泛应用潜力。 Abstract: We propose CURE, a novel reinforcement learning framework with a dedicated reward design that co-evolves coding and unit test generation capabilities based on their interaction outcomes, without any ground-truth code as supervision. This approach enables flexible and scalable training and allows the unit tester to learn directly from the coder's mistakes. Our derived ReasonFlux-Coder-7B and 14B models improve code generation accuracy by 5.3% and Best-of-N accuracy by 9.0% after optimization on Qwen2.5-Instruct models, outperforming similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder. They naturally extend to downstream tasks such as test-time scaling and agentic coding-achieving a 8.1% improvement over the base model. For the long-CoT model, our ReasonFlux-Coder-4B consistently outperforms Qwen3-4B while achieving 64.8% inference efficiency in unit test generation. Notably, we also find that our model can serve as an effective reward model for reinforcement learning on base models. Project: https://github.com/Gen-Verse/CURE

[254] GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Qianhui Wu,Kanzhi Cheng,Rui Yang,Chaoyun Zhang,Jianwei Yang,Huiqiang Jiang,Jian Mu,Baolin Peng,Bo Qiao,Reuben Tan,Si Qin,Lars Liden,Qingwei Lin,Huan Zhang,Tong Zhang,Jianbing Zhang,Dongmei Zhang,Jianfeng Gao

Main category: cs.CL

TL;DR: GUI-Actor提出了一种基于视觉语言模型（VLM）的无坐标GUI定位方法，通过注意力机制和验证器提升视觉定位性能。

Details

Motivation: 现有方法在视觉定位中存在空间语义对齐弱、无法处理模糊监督目标以及坐标与视觉特征粒度不匹配的问题。 Method: GUI-Actor引入基于注意力的动作头，学习将专用令牌与相关视觉补丁令牌对齐，并设计验证器筛选最佳动作区域。 Result: 实验表明，GUI-Actor在多个基准测试中优于现有方法，且通过微调仅动作头即可达到高性能。 Conclusion: GUI-Actor为VLM提供了有效的视觉定位能力，同时保留了其通用性。 Abstract: One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to handle ambiguous supervision targets, and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B even surpasses UI-TARS-72B (38.1) on ScreenSpot-Pro, achieving scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.

[255] Entity-Augmented Neuroscience Knowledge Retrieval Using Ontology and Semantic Understanding Capability of LLM

Pralaypati Ta,Sriram Venkatesaperumal,Keerthi Ram,Mohanasankar Sivaprakasam

Main category: cs.CL

TL;DR: 该论文提出了一种利用大型语言模型、神经科学本体论和文本嵌入从未标记的神经科学研究语料库中构建知识图谱的新方法，显著提升了知识发现能力。

Details

Motivation: 神经科学文献信息分散，现有检索方法难以有效提取信息，且构建知识图谱通常需要标注数据和领域专业知识，而获取大规模标注数据具有挑战性。 Method: 利用大型语言模型（LLM）、神经科学本体论和文本嵌入从未标记的神经科学研究语料库中构建知识图谱，并引入实体增强的信息检索算法。 Result: 实验表明，该方法在实体提取上F1得分为0.84，且从知识图谱中获得的知识显著提升了54%以上问题的回答质量。 Conclusion: 该方法有效解决了神经科学领域知识图谱构建的标注数据依赖问题，显著提升了知识发现和信息检索能力。 Abstract: Neuroscience research publications encompass a vast wealth of knowledge. Accurately retrieving existing information and discovering new insights from this extensive literature is essential for advancing the field. However, when knowledge is dispersed across multiple sources, current state-of-the-art retrieval methods often struggle to extract the necessary information. A knowledge graph (KG) can integrate and link knowledge from multiple sources, but existing methods for constructing KGs in neuroscience often rely on labeled data and require domain expertise. Acquiring large-scale, labeled data for a specialized area like neuroscience presents significant challenges. This work proposes novel methods for constructing KG from unlabeled large-scale neuroscience research corpus utilizing large language models (LLM), neuroscience ontology, and text embeddings. We analyze the semantic relevance of neuroscience text segments identified by LLM for building the knowledge graph. We also introduce an entity-augmented information retrieval algorithm to extract knowledge from the KG. Several experiments were conducted to evaluate the proposed approaches, and the results demonstrate that our methods significantly enhance knowledge discovery from the unlabeled neuroscience research corpus. It achieves an F1 score of 0.84 for entity extraction, and the knowledge obtained from the KG improves answers to over 54% of the questions.

[256] Causal Estimation of Tokenisation Bias

Pietro Lesci,Clara Meister,Thomas Hofmann,Andreas Vlachos,Tiago Pimentel

Main category: cs.CL

TL;DR: 论文探讨了语言模型中分词器对字符概率的影响，提出了一种量化分词偏差的方法，并通过实验验证了分词选择对模型输出的显著影响。

Details

Motivation: 现代语言模型通常基于子词序列训练，但最终定义的是字符概率。理想情况下，分词器的选择不应影响字符概率，但实际上存在偏差。本文旨在量化这种分词偏差。 Method: 将分词偏差定义为因果效应，并利用回归不连续性设计进行估计。通过比较分词器词汇表中排名相近的子词，量化其影响。 Result: 实验发现分词选择显著影响模型输出，子词的存在可能使字符概率增加高达17倍。 Conclusion: 分词器是语言模型设计中的关键选择，其偏差对模型输出有显著影响。 Abstract: Modern language models are typically trained over subword sequences, but ultimately define probabilities over character-strings. Ideally, the choice of the tokeniser -- which maps character-strings to subwords -- should not affect the probability assigned to the underlying character-string; in practice, it does. We define this mismatch as tokenisation bias. In this work, we quantify one particular type of tokenisation bias: the effect of including or not a subword (e.g., $\langle hello \rangle$) in a tokeniser's vocabulary on the probability a trained model assigns to the corresponding characters (i.e., \textit{``hello''}). Estimating this effect is challenging because each model is trained with only one tokeniser. We address this by framing tokenisation bias as a causal effect and estimating it using the regression discontinuity design. Specifically, we exploit the fact that tokenisation algorithms rank subwords and add the first $K$ to a tokeniser's vocabulary, where $K$ is an arbitrary cutoff point. As such, we can estimate a causal effect by comparing similar subwords around this cutoff. Experimentally, we find that tokenisation consistently affects models' outputs across scales, vocabularies, and tokenisers. Notably, a subword's presence in a small model's vocabulary may increase its characters' probability by up to 17 times, highlighting tokenisation as a key design choice in language modelling.

cs.IR [Back]

[257] A Dynamic Framework for Semantic Grouping of Common Data Elements (CDE) Using Embeddings and Clustering

Madan Krishnamurthy,Daniel Korn,Melissa A Haendel,Christopher J Mungall,Anne E Thessen

Main category: cs.IR

TL;DR: 开发了一个动态可扩展框架，利用LLM和HDBSCAN解决生物医学数据中CDE的语义异构性和结构变异性问题，实现高效集成和互操作性。

Details

Motivation: 解决生物医学数据中CDE的语义异构性、结构变异性和上下文依赖性，以促进数据集成和科学发现。 Method: 1. LLM生成上下文感知的文本嵌入；2. HDBSCAN聚类语义相似的CDE；3. LLM自动标记；4. 监督学习训练分类器。 Result: 在24,000+ CDE上识别118个有意义的聚类，分类器准确率90.46%，外部验证显示强一致性。 Conclusion: 该框架为CDE协调提供了实用解决方案，提升了选择效率并支持数据互操作性。 Abstract: This research aims to develop a dynamic and scalable framework to facilitate harmonization of Common Data Elements (CDEs) across heterogeneous biomedical datasets by addressing challenges such as semantic heterogeneity, structural variability, and context dependence to streamline integration, enhance interoperability, and accelerate scientific discovery. Our methodology leverages Large Language Models (LLMs) for context-aware text embeddings that convert CDEs into dense vectors capturing semantic relationships and patterns. These embeddings are clustered using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) to group semantically similar CDEs. The framework incorporates four key steps: (1) LLM-based text embedding to mathematically represent semantic context, (2) unsupervised clustering of embeddings via HDBSCAN, (3) automated labeling using LLM summarization, and (4) supervised learning to train a classifier assigning new or unclustered CDEs to labeled clusters. Evaluated on the NIH NLM CDE Repository with over 24,000 CDEs, the system identified 118 meaningful clusters at an optimized minimum cluster size of 20. The classifier achieved 90.46 percent overall accuracy, performing best in larger categories. External validation against Gravity Projects Social Determinants of Health domains showed strong agreement (Adjusted Rand Index 0.52, Normalized Mutual Information 0.78), indicating that embeddings effectively capture cluster characteristics. This adaptable and scalable approach offers a practical solution to CDE harmonization, improving selection efficiency and supporting ongoing data interoperability.

cs.RO [Back]

[258] Enhancing Speech Instruction Understanding and Disambiguation in Robotics via Speech Prosody

David Sasu,Kweku Andoh Yamoah,Benedict Quartey,Natalie Schluter

Main category: cs.RO

TL;DR: 论文提出了一种利用语音韵律直接推断和解析指令意图的新方法，结合大语言模型进行任务计划选择，显著提升了人机交互的准确性。

Details

Motivation: 传统方法依赖语音转文本，忽略了关键的韵律线索，导致意图解析不准确。 Method: 直接利用语音韵律推断意图，并通过上下文学习将预测意图集成到大语言模型中，以消歧和选择任务计划。 Result: 方法在检测指称意图时达到95.79%的准确率，对模糊指令的任务计划选择准确率为71.96%。 Conclusion: 该方法显著提升了人机交互中指令解析的准确性，具有重要应用潜力。 Abstract: Enabling robots to accurately interpret and execute spoken language instructions is essential for effective human-robot collaboration. Traditional methods rely on speech recognition to transcribe speech into text, often discarding crucial prosodic cues needed for disambiguating intent. We propose a novel approach that directly leverages speech prosody to infer and resolve instruction intent. Predicted intents are integrated into large language models via in-context learning to disambiguate and select appropriate task plans. Additionally, we present the first ambiguous speech dataset for robotics, designed to advance research in speech disambiguation. Our method achieves 95.79% accuracy in detecting referent intents within an utterance and determines the intended task plan of ambiguous instructions with 71.96% accuracy, demonstrating its potential to significantly improve human-robot communication.

[259] Grasp2Grasp: Vision-Based Dexterous Grasp Translation via Schrödinger Bridges

Tao Zhong,Jonah Buchanan,Christine Allen-Blanchette

Main category: cs.RO

TL;DR: 提出一种基于视觉的灵巧抓取转移方法，通过Schrödinger Bridge框架实现不同形态机械手之间的抓取意图转移。

Details

Motivation: 解决不同形态机械手之间抓取意图转移的问题，避免依赖配对演示或特定手的模拟。 Method: 采用Schrödinger Bridge框架，通过分数和流匹配学习源和目标潜在抓取空间的映射，结合物理约束成本函数。 Result: 实验表明，该方法能生成稳定且物理合理的抓取，具有强泛化能力。 Conclusion: 该方法实现了异构机械手的语义抓取转移，将视觉抓取与概率生成建模结合。 Abstract: We propose a new approach to vision-based dexterous grasp translation, which aims to transfer grasp intent across robotic hands with differing morphologies. Given a visual observation of a source hand grasping an object, our goal is to synthesize a functionally equivalent grasp for a target hand without requiring paired demonstrations or hand-specific simulations. We frame this problem as a stochastic transport between grasp distributions using the Schr\"odinger Bridge formalism. Our method learns to map between source and target latent grasp spaces via score and flow matching, conditioned on visual observations. To guide this translation, we introduce physics-informed cost functions that encode alignment in base pose, contact maps, wrench space, and manipulability. Experiments across diverse hand-object pairs demonstrate our approach generates stable, physically grounded grasps with strong generalization. This work enables semantic grasp transfer for heterogeneous manipulators and bridges vision-based grasping with probabilistic generative modeling.

[260] HiLO: High-Level Object Fusion for Autonomous Driving using Transformers

Timo Osterburg,Franz Albers,Christopher Diehl,Rajesh Pushparaj,Torsten Bertram

Main category: cs.RO

TL;DR: 论文提出了一种基于Transformer的高层对象融合方法HiLO，改进了传统Adapted Kalman Filter (AKF)，在F1分数和平均IoU上分别提升了25.9和6.1个百分点。

Details

Motivation: 在自动驾驶中，传感器数据融合对环境感知至关重要。基于学习的融合方法虽然性能高，但复杂度和硬件需求限制了其在量产车中的应用。高层融合方法计算需求低且鲁棒性强。 Method: 论文改进AKF并提出HiLO，一种基于Transformer的高层对象融合方法。 Result: 实验结果显示F1分数提升25.9个百分点，平均IoU提升6.1个百分点。新的大规模真实数据集验证了方法的有效性，跨域评估进一步验证了泛化能力。 Conclusion: HiLO方法在性能和计算效率上优于传统方法，适用于自动驾驶的高层数据融合。 Abstract: The fusion of sensor data is essential for a robust perception of the environment in autonomous driving. Learning-based fusion approaches mainly use feature-level fusion to achieve high performance, but their complexity and hardware requirements limit their applicability in near-production vehicles. High-level fusion methods offer robustness with lower computational requirements. Traditional methods, such as the Kalman filter, dominate this area. This paper modifies the Adapted Kalman Filter (AKF) and proposes a novel transformer-based high-level object fusion method called HiLO. Experimental results demonstrate improvements of $25.9$ percentage points in $\textrm{F}_1$ score and $6.1$ percentage points in mean IoU. Evaluation on a new large-scale real-world dataset demonstrates the effectiveness of the proposed approaches. Their generalizability is further validated by cross-domain evaluation between urban and highway scenarios. Code, data, and models are available at https://github.com/rst-tu-dortmund/HiLO .

[261] Rodrigues Network for Learning Robot Actions

Jialiang Zhang,Haoran Geng,Yang You,Congyue Deng,Pieter Abbeel,Jitendra Malik,Leonidas Guibas

Main category: cs.RO

TL;DR: 论文提出了一种名为Neural Rodrigues Operator的学习模块，结合Rodrigues Network（RodriNet），用于改进机器人动作学习中的运动学预测。实验表明其在合成任务和实际应用中均优于传统方法。

Details

Motivation: 现有架构（如MLPs和Transformers）缺乏对关节系统运动学结构的归纳偏置，限制了动作学习和预测的效果。 Method: 提出Neural Rodrigues Operator作为经典前向运动学操作的可学习扩展，并设计RodriNet网络，专门处理动作数据。 Result: 在运动学预测和模仿学习等任务中，RodriNet表现显著优于传统方法。 Conclusion: 将结构化运动学先验融入网络架构可有效提升动作学习的效果。 Abstract: Understanding and predicting articulated actions is important in robot learning. However, common architectures such as MLPs and Transformers lack inductive biases that reflect the underlying kinematic structure of articulated systems. To this end, we propose the Neural Rodrigues Operator, a learnable generalization of the classical forward kinematics operation, designed to inject kinematics-aware inductive bias into neural computation. Building on this operator, we design the Rodrigues Network (RodriNet), a novel neural architecture specialized for processing actions. We evaluate the expressivity of our network on two synthetic tasks on kinematic and motion prediction, showing significant improvements compared to standard backbones. We further demonstrate its effectiveness in two realistic applications: (i) imitation learning on robotic benchmarks with the Diffusion Policy, and (ii) single-image 3D hand reconstruction. Our results suggest that integrating structured kinematic priors into the network architecture improves action learning in various domains.

cs.SE [Back]

[262] Automated Web Application Testing: End-to-End Test Case Generation with Large Language Models and Screen Transition Graphs

Nguyen-Khang Le,Quan Minh Bui,Minh Ngoc Nguyen,Hiep Nguyen,Trung Vo,Son T. Luu,Shoshin Nomura,Minh Le Nguyen

Main category: cs.SE

TL;DR: 本文提出了一种自动化系统，结合图结构和大型语言模型（LLMs），用于生成网站导航和表单填写的测试用例，提升了测试覆盖率和鲁棒性。

Details

Motivation: 现代网络应用界面复杂且动态性强，确保其可靠性具有挑战性。现有LLMs在动态导航流程和复杂表单交互方面仍有局限。 Method: 系统采用屏幕转换图和LLMs建模导航流程生成测试场景；利用状态图处理条件表单并自动化生成Selenium脚本。 Result: 实验结果表明，该系统显著提高了测试覆盖率和鲁棒性。 Conclusion: 该研究通过结合图结构和LLMs，推动了网络应用测试的发展。 Abstract: Web applications are critical to modern software ecosystems, yet ensuring their reliability remains challenging due to the complexity and dynamic nature of web interfaces. Recent advances in large language models (LLMs) have shown promise in automating complex tasks, but limitations persist in handling dynamic navigation flows and complex form interactions. This paper presents an automated system for generating test cases for two key aspects of web application testing: site navigation and form filling. For site navigation, the system employs screen transition graphs and LLMs to model navigation flows and generate test scenarios. For form filling, it uses state graphs to handle conditional forms and automates Selenium script generation. Key contributions include: (1) a novel integration of graph structures and LLMs for site navigation testing, (2) a state graph-based approach for automating form-filling test cases, and (3) a comprehensive dataset for evaluating form-interaction testing. Experimental results demonstrate the system's effectiveness in improving test coverage and robustness, advancing the state of web application testing.

[263] Is PMBOK Guide the Right Fit for AI? Re-evaluating Project Management in the Face of Artificial Intelligence Projects

Alexey Burdakov,Max Jaihyun Ahn

Main category: cs.SE

TL;DR: 本文评估了PMBOK指南在AI软件项目中的适用性，指出了其局限性并提出了改进建议。

Details

Motivation: 传统项目管理框架（如PMBOK）在AI项目中存在不足，尤其是在数据管理、迭代开发和伦理问题方面。 Method: 通过分析PMBOK指南的局限性，提出整合数据生命周期管理、迭代框架和伦理考量的改进方案。 Result: 研究发现PMBOK在AI项目中存在显著不足，需针对性调整。 Conclusion: 改进后的项目管理方法能更好地适应AI项目的动态性和复杂性。 Abstract: This paper critically evaluates the applicability of the Project Management Body of Knowledge (PMBOK) Guide framework to Artificial Intelligence (AI) software projects, highlighting key limitations and proposing tailored adaptations. Unlike traditional projects, AI initiatives rely heavily on complex data, iterative experimentation, and specialized expertise while navigating significant ethical considerations. Our analysis identifies gaps in the PMBOK Guide, including its limited focus on data management, insufficient support for iterative development, and lack of guidance on ethical and multidisciplinary challenges. To address these deficiencies, we recommend integrating data lifecycle management, adopting iterative and AI project management frameworks, and embedding ethical considerations within project planning and execution. Additionally, we explore alternative approaches that better align with AI's dynamic and exploratory nature. We aim to enhance project management practices for AI software projects by bridging these gaps.

cs.MM [Back]

Zihao Ding,Cheng-Tse Lee,Mufeng Zhu,Tao Guan,Yuan-Chun Sun,Cheng-Hsin Hsu,Yao Liu

Main category: cs.MM

TL;DR: 本文介绍了EyeNavGS，首个公开的6-DoF导航数据集，包含46名参与者在12个真实世界3DGS场景中的导航数据，用于支持6-DoF视口预测、自适应流媒体等研究。

Details

Motivation: 目前缺乏用于高保真3DGS场景的真实用户导航数据，限制了相关应用开发和渲染性能优化。 Method: 使用Meta Quest Pro头显记录46名参与者在12个3DGS场景中的头部姿态和眼动数据，并进行场景初始化和数据后处理。 Result: 发布了EyeNavGS数据集及配套开源工具，支持6-DoF导航研究。 Conclusion: EyeNavGS为3DGS场景的视口预测、自适应流媒体等研究提供了宝贵资源。 Abstract: 3D Gaussian Splatting (3DGS) is an emerging media representation that reconstructs real-world 3D scenes in high fidelity, enabling 6-degrees-of-freedom (6-DoF) navigation in virtual reality (VR). However, developing and evaluating 3DGS-enabled applications and optimizing their rendering performance, require realistic user navigation data. Such data is currently unavailable for photorealistic 3DGS reconstructions of real-world scenes. This paper introduces EyeNavGS (EyeNavGS), the first publicly available 6-DoF navigation dataset featuring traces from 46 participants exploring twelve diverse, real-world 3DGS scenes. The dataset was collected at two sites, using the Meta Quest Pro headsets, recording the head pose and eye gaze data for each rendered frame during free world standing 6-DoF navigation. For each of the twelve scenes, we performed careful scene initialization to correct for scene tilt and scale, ensuring a perceptually-comfortable VR experience. We also release our open-source SIBR viewer software fork with record-and-replay functionalities and a suite of utility tools for data processing, conversion, and visualization. The EyeNavGS dataset and its accompanying software tools provide valuable resources for advancing research in 6-DoF viewport prediction, adaptive streaming, 3D saliency, and foveated rendering for 3DGS scenes. The EyeNavGS dataset is available at: https://symmru.github.io/EyeNavGS/.

[265] StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion

Fengjin Li,Jie Wang,Yadong Niu,Yongqing Wang,Meng Meng,Jian Luan,Zhiyong Wu

Main category: cs.MM

TL;DR: StarVC是一种基于自回归的语音转换框架，通过先预测文本标记再合成声学特征，显著提升了语音转换性能。

Details

Motivation: 传统语音转换方法未充分利用语义内容，而StarVC通过显式文本建模解决了这一问题。 Method: StarVC采用自回归框架，先预测文本标记，再合成声学特征。 Result: 实验表明，StarVC在保留语言内容（WER、CER）和说话人特征（SECS、MOS）方面优于传统方法。 Conclusion: StarVC通过显式文本建模提升了语音转换性能，为未来研究提供了新方向。 Abstract: Voice Conversion (VC) modifies speech to match a target speaker while preserving linguistic content. Traditional methods usually extract speaker information directly from speech while neglecting the explicit utilization of linguistic content. Since VC fundamentally involves disentangling speaker identity from linguistic content, leveraging structured semantic features could enhance conversion performance. However, previous attempts to incorporate semantic features into VC have shown limited effectiveness, motivating the integration of explicit text modeling. We propose StarVC, a unified autoregressive VC framework that first predicts text tokens before synthesizing acoustic features. The experiments demonstrate that StarVC outperforms conventional VC methods in preserving both linguistic content (i.e., WER and CER) and speaker characteristics (i.e., SECS and MOS). Audio demo can be found at: https://thuhcsi.github.io/StarVC/.

eess.IV [Back]

[266] Surgical Foundation Model Leveraging Compression and Entropy Maximization for Image-Guided Surgical Assistance

Lianhao Yin,Ozanan Meireles,Guy Rosman,Daniela Rus

Main category: eess.IV

TL;DR: 论文提出了一种名为Compress-to-Explore (C2E)的自监督学习框架，利用Kolmogorov复杂度从手术视频中学习紧凑且信息丰富的表示，无需标注数据即可提升编码器性能。

Details

Motivation: 实时视频理解对微创手术（MIS）至关重要，但监督学习方法需要大量标注数据，而医学领域的数据标注成本高昂且稀缺。现有自监督方法难以捕捉泛化性强的结构和物理信息。 Method: C2E框架通过熵最大化解码器压缩图像，保留临床相关细节，从而学习紧凑表示。 Result: 在大规模未标注手术数据集上训练后，C2E在多种手术ML任务（如工作流分类、工具-组织交互分类、分割和诊断）中表现出强泛化能力。 Conclusion: C2E展示了自监督学习在提升手术AI性能和改善MIS结果方面的潜力。 Abstract: Real-time video understanding is critical to guide procedures in minimally invasive surgery (MIS). However, supervised learning approaches require large, annotated datasets that are scarce due to annotation efforts that are prohibitive, e.g., in medical fields. Although self-supervision methods can address such limitations, current self-supervised methods often fail to capture structural and physical information in a form that generalizes across tasks. We propose Compress-to-Explore (C2E), a novel self-supervised framework that leverages Kolmogorov complexity to learn compact, informative representations from surgical videos. C2E uses entropy-maximizing decoders to compress images while preserving clinically relevant details, improving encoder performance without labeled data. Trained on large-scale unlabeled surgical datasets, C2E demonstrates strong generalization across a variety of surgical ML tasks, such as workflow classification, tool-tissue interaction classification, segmentation, and diagnosis tasks, providing improved performance as a surgical visual foundation model. As we further show in the paper, the model's internal compact representation better disentangles features from different structural parts of images. The resulting performance improvements highlight the yet untapped potential of self-supervised learning to enhance surgical AI and improve outcomes in MIS.

[267] Alzheimers Disease Classification in Functional MRI With 4D Joint Temporal-Spatial Kernels in Novel 4D CNN Model

Javier Salazar Cavazos,Scott Peltier

Main category: eess.IV

TL;DR: 提出了一种新型4D卷积网络，用于提取功能MRI数据的时空特征，相比传统3D模型表现更优，尤其在阿尔茨海默病诊断中效果显著。

Details

Motivation: 现有方法仅使用3D空间模型处理4D功能MRI数据，可能导致特征提取不足，影响下游任务表现。 Method: 开发了一种4D卷积网络，能够同时学习空间信息和时间动态特征。 Result: 实验表明，4D CNN在功能MRI数据中表现优于3D模型，提升了阿尔茨海默病的诊断效果。 Conclusion: 未来可探索基于任务的fMRI应用和回归任务，以进一步理解认知表现和疾病进展。 Abstract: Previous works in the literature apply 3D spatial-only models on 4D functional MRI data leading to possible sub-par feature extraction to be used for downstream tasks like classification. In this work, we aim to develop a novel 4D convolution network to extract 4D joint temporal-spatial kernels that not only learn spatial information but in addition also capture temporal dynamics. Experimental results show promising performance in capturing spatial-temporal data in functional MRI compared to 3D models. The 4D CNN model improves Alzheimers disease diagnosis for rs-fMRI data, enabling earlier detection and better interventions. Future research could explore task-based fMRI applications and regression tasks, enhancing understanding of cognitive performance and disease progression.

[268] Are Pixel-Wise Metrics Reliable for Sparse-View Computed Tomography Reconstruction?

Tianyu Lin,Xinran Li,Chuntung Zhuang,Qi Chen,Yuanhao Cai,Kai Ding,Alan L. Yuille,Zongwei Zhou

Main category: eess.IV

TL;DR: 论文提出了一套新的解剖感知评估指标和CARE框架，以解决稀疏视图CT重建中传统指标无法捕捉关键解剖结构完整性的问题。

Details

Motivation: 传统评估指标（如SSIM和PSNR）在稀疏视图CT重建中难以评估关键解剖结构的完整性，尤其是小或薄区域。 Method: 提出解剖感知评估指标，并开发CARE框架，通过结构惩罚在训练中增强解剖结构的保留。 Result: CARE显著提高了CT重建的结构完整性，大器官提升32%，小器官22%，肠道40%，血管36%。 Conclusion: CARE框架有效提升了稀疏视图CT重建中解剖结构的完整性，适用于多种重建方法。 Abstract: Widely adopted evaluation metrics for sparse-view CT reconstruction--such as Structural Similarity Index Measure and Peak Signal-to-Noise Ratio--prioritize pixel-wise fidelity but often fail to capture the completeness of critical anatomical structures, particularly small or thin regions that are easily missed. To address this limitation, we propose a suite of novel anatomy-aware evaluation metrics designed to assess structural completeness across anatomical structures, including large organs, small organs, intestines, and vessels. Building on these metrics, we introduce CARE, a Completeness-Aware Reconstruction Enhancement framework that incorporates structural penalties during training to encourage anatomical preservation of significant structures. CARE is model-agnostic and can be seamlessly integrated into analytical, implicit, and generative methods. When applied to these methods, CARE substantially improves structural completeness in CT reconstructions, achieving up to +32% improvement for large organs, +22% for small organs, +40% for intestines, and +36% for vessels.

[269] NTIRE 2025 Challenge on RAW Image Restoration and Super-Resolution

Marcos V. Conde,Radu Timofte,Zihao Lu,Xiangyu Kongand Xiaoxia Xingand Fan Wangand Suejin Hanand MinKyu Parkand Tianyu Zhangand Xin Luoand Yeda Chenand Dong Liuand Li Pangand Yuhang Yangand Hongzhong Wangand Xiangyong Caoand Ruixuan Jiangand Senyan Xuand Siyuan Jiangand Xueyang Fuand Zheng-Jun Zhaand Tianyu Haoand Yuhong Heand Ruoqi Liand Yueqi Yangand Xiang Yuand Guanlan Hongand Minmin Yiand Yuanjia Chenand Liwen Zhangand Zijie Jinand Cheng Liand Lian Liuand Wei Songand Heng Sunand Yubo Wangand Jinghua Wangand Jiajie Luand Watchara Ruangsangand

Main category: eess.IV

TL;DR: 本文回顾了NTIRE 2025 RAW图像修复与超分辨率挑战赛，总结了提出的解决方案和结果。

Details

Motivation: RAW图像修复与超分辨率在现代图像信号处理（ISP）流程中可能至关重要，但该领域的研究不如RGB领域深入。 Method: 挑战赛的目标包括修复带有模糊和噪声的RAW图像，以及将RAW Bayer图像放大2倍（考虑未知噪声和模糊）。 Result: 共有230名参与者注册，45名提交了结果。 Conclusion: 报告展示了当前RAW图像修复的最先进技术。 Abstract: This paper reviews the NTIRE 2025 RAW Image Restoration and Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Restoration and Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. The goal of this challenge is two fold, (i) restore RAW images with blur and noise degradations, (ii) upscale RAW Bayer images by 2x, considering unknown noise and blur. In the challenge, a total of 230 participants registered, and 45 submitted results during thee challenge period. This report presents the current state-of-the-art in RAW Restoration.

[270] Dual encoding feature filtering generalized attention UNET for retinal vessel segmentation

Md Tauhidul Islam,Wu Da-Wen,Tang Qing-Qing,Zhao Kai-Yang,Yin Teng,Li Yan-Fei,Shang Wen-Yi,Liu Jing-Yu,Zhang Hai-Xian

Main category: eess.IV

TL;DR: DEFFA-Unet提出了一种改进的视网膜血管分割方法，通过增加编码器、特征过滤融合模块和注意力引导的特征重建融合模块，解决了数据不足、分布不平衡和特征提取不足的问题，显著提升了模型性能。

Details

Motivation: 视网膜血管分割对诊断眼部和心血管疾病至关重要，但现有方法在数据不足、分布不平衡和特征提取方面存在问题，限制了分割性能和模型泛化能力。 Method: DEFFA-Unet引入额外的编码器处理域不变预处理输入，开发特征过滤融合模块和注意力引导的特征重建融合模块，并提出创新的数据增强和平衡方法。 Result: 在多个基准数据集上的验证表明，DEFFA-Unet在分割性能和模型泛化能力上优于基线方法和现有最优模型。 Conclusion: DEFFA-Unet通过改进特征提取和数据处理方法，显著提升了视网膜血管分割的精度和泛化能力。 Abstract: Retinal blood vessel segmentation is crucial for diagnosing ocular and cardiovascular diseases. Although the introduction of U-Net in 2015 by Olaf Ronneberger significantly advanced this field, yet issues like limited training data, imbalance data distribution, and inadequate feature extraction persist, hindering both the segmentation performance and optimal model generalization. Addressing these critical issues, the DEFFA-Unet is proposed featuring an additional encoder to process domain-invariant pre-processed inputs, thereby improving both richer feature encoding and enhanced model generalization. A feature filtering fusion module is developed to ensure the precise feature filtering and robust hybrid feature fusion. In response to the task-specific need for higher precision where false positives are very costly, traditional skip connections are replaced with the attention-guided feature reconstructing fusion module. Additionally, innovative data augmentation and balancing methods are proposed to counter data scarcity and distribution imbalance, further boosting the robustness and generalization of the model. With a comprehensive suite of evaluation metrics, extensive validations on four benchmark datasets (DRIVE, CHASEDB1, STARE, and HRF) and an SLO dataset (IOSTAR), demonstrate the proposed method's superiority over both baseline and state-of-the-art models. Particularly the proposed method significantly outperforms the compared methods in cross-validation model generalization.

[271] Unrolling Nonconvex Graph Total Variation for Image Denoising

Songlin Wei,Gene Cheung,Fei Chen,Ivan Selesnick

Main category: eess.IV

TL;DR: 论文提出了一种新的非凸图总变分（NC-GTV）方法，结合Huber函数和Gershgorin圆定理确保目标函数凸性，并通过ADMM算法和轻量级网络实现高效去噪。

Details

Motivation: 传统基于模型的图像去噪方法使用凸正则化项（如TV），但其可能引入局部极小值。本文旨在通过非凸正则化项提升稀疏信号表示，同时确保目标函数整体凸性。 Method: 提出NC-GTV，基于图的Huber函数定义非凸正则化项，利用Gershgorin圆定理计算参数以确保凸性，并设计基于ADMM的线性时间算法，进一步展开为轻量级网络。 Result: 实验表明，该方法在图像去噪中优于未展开的GTV和其他代表性方法，且网络参数更少。 Conclusion: NC-GTV结合非凸正则化和凸优化，实现了高效且高性能的图像去噪，为稀疏信号处理提供了新思路。 Abstract: Conventional model-based image denoising optimizations employ convex regularization terms, such as total variation (TV) that convexifies the $\ell_0$-norm to promote sparse signal representation. Instead, we propose a new non-convex total variation term in a graph setting (NC-GTV), such that when combined with an $\ell_2$-norm fidelity term for denoising, leads to a convex objective with no extraneous local minima. We define NC-GTV using a new graph variant of the Huber function, interpretable as a Moreau envelope. The crux is the selection of a parameter $a$ characterizing the graph Huber function that ensures overall objective convexity; we efficiently compute $a$ via an adaptation of Gershgorin Circle Theorem (GCT). To minimize the convex objective, we design a linear-time algorithm based on Alternating Direction Method of Multipliers (ADMM) and unroll it into a lightweight feed-forward network for data-driven parameter learning. Experiments show that our method outperforms unrolled GTV and other representative image denoising schemes, while employing far fewer network parameters.

Haowen Pang,Weiyan Guo,Chuyang Ye

Main category: eess.IV

TL;DR: SwinUNETR用于合成脑MRI中缺失的模态，结合Swin Transformer和CNN的优势，生成高质量合成图像。

Details

Motivation: 临床实践中脑MRI模态缺失是常见问题，需要一种有效方法合成缺失模态以辅助诊断。 Method: 采用SwinUNETR，结合Swin Transformer的全局上下文捕捉能力和CNN的局部细节处理能力。 Result: 在脑MRI数据集上表现优异，显著提升图像质量、解剖一致性和诊断价值。 Conclusion: SwinUNETR能有效合成缺失模态，为临床诊断提供高质量图像支持。 Abstract: Multi-modal brain magnetic resonance imaging (MRI) plays a crucial role in clinical diagnostics by providing complementary information across different imaging modalities. However, a common challenge in clinical practice is missing MRI modalities. In this paper, we apply SwinUNETR to the synthesize of missing modalities in brain MRI. SwinUNETR is a novel neural network architecture designed for medical image analysis, integrating the strengths of Swin Transformer and convolutional neural networks (CNNs). The Swin Transformer, a variant of the Vision Transformer (ViT), incorporates hierarchical feature extraction and window-based self-attention mechanisms, enabling it to capture both local and global contextual information effectively. By combining the Swin Transformer with CNNs, SwinUNETR merges global context awareness with detailed spatial resolution. This hybrid approach addresses the challenges posed by the varying modality characteristics and complex brain structures, facilitating the generation of accurate and realistic synthetic images. We evaluate the performance of SwinUNETR on brain MRI datasets and demonstrate its superior capability in generating clinically valuable images. Our results show significant improvements in image quality, anatomical consistency, and diagnostic value.

[273] Dynamic mapping from static labels: remote sensing dynamic sample generation with temporal-spectral embedding

Shuai Yuan,Shuang Chen,Tianwu Lin,Jie Wang,Peng Gong

Main category: eess.IV

TL;DR: TasGen是一个两阶段自动框架，通过时间-光谱嵌入从静态样本生成动态样本，减少人工标注需求。

Details

Motivation: 遥感地理制图依赖代表性样本数据，但地表动态变化快，样本易过时，需频繁人工更新。 Method: 提出TasGen框架，利用静态样本生成动态样本，通过时间-光谱嵌入建模光谱和时间依赖。 Result: 无需额外人工标注即可捕捉地表变化。 Conclusion: TasGen能有效减少人工标注负担，提升遥感制图效率。 Abstract: Accurate remote sensing geographic mapping depends heavily on representative and timely sample data. However, rapid changes in land surface dynamics necessitate frequent updates, quickly rendering previously collected samples obsolete and imposing significant labor demands for continuous manual updates. In this study, we aim to address this problem by dynamic sample generation using existing single-date static labeled samples. We introduce TasGen, a two-stage automated framework to automatically generate dynamic samples, designed to simultaneously model spectral and temporal dependencies in time-series remote sensing imagery via temporal-spectral embedding, capturing land surface changes without additional manual annotations.

[274] A Tree-guided CNN for image super-resolution

Chunwei Tian,Mingjian Song,Xiaopeng Fan,Xiangtao Zheng,Bob Zhang,David Zhang

Main category: eess.IV

TL;DR: 提出了一种树引导的CNN（TSRNet）用于图像超分辨率，通过树架构增强关键节点效果，并结合余弦变换技术和Adan优化器提升性能。

Details

Motivation: 现有深度卷积神经网络在图像超分辨率中难以有效利用关键层信息，导致性能下降。 Method: 设计树引导的CNN（TSRNet），结合余弦变换提取跨域信息，并使用Adan优化器优化参数。 Result: 实验验证了TSRNet在恢复高质量图像方面的优越性。 Conclusion: TSRNet通过树架构和跨域信息提取显著提升了图像超分辨率性能。 Abstract: Deep convolutional neural networks can extract more accurate structural information via deep architectures to obtain good performance in image super-resolution. However, it is not easy to find effect of important layers in a single network architecture to decrease performance of super-resolution. In this paper, we design a tree-guided CNN for image super-resolution (TSRNet). It uses a tree architecture to guide a deep network to enhance effect of key nodes to amplify the relation of hierarchical information for improving the ability of recovering images. To prevent insufficiency of the obtained structural information, cosine transform techniques in the TSRNet are used to extract cross-domain information to improve the performance of image super-resolution. Adaptive Nesterov momentum optimizer (Adan) is applied to optimize parameters to boost effectiveness of training a super-resolution model. Extended experiments can verify superiority of the proposed TSRNet for restoring high-quality images. Its code can be obtained at https://github.com/hellloxiaotian/TSRNet.

eess.SP [Back]

[275] Simulate Any Radar: Attribute-Controllable Radar Simulation via Waveform Parameter Embedding

Weiqing Xiao,Hao Huang,Chonghao Zhong,Yujie Lin,Nan Wang,Xiaoxue Chen,Zhaoxi Chen,Saining Zhang,Shuocheng Yang,Pierre Merriaux,Lei Lei,Hao Zhao

Main category: eess.SP

TL;DR: SA-Radar是一种雷达模拟方法，通过波形参数化属性嵌入，结合生成式和物理模拟，高效生成可定制的雷达数据。

Details

Motivation: 现有雷达模拟方法要么基于生成式模型，要么基于物理模拟，缺乏两者的结合，且需要详细的硬件规格。SA-Radar旨在解决这一问题。 Method: 设计了ICFAR-Net（3D U-Net），通过波形参数编码雷达属性，生成范围-方位-多普勒（RAD）张量，无需详细硬件规格。 Result: 实验表明，SA-Radar生成的数据在2D/3D目标检测和雷达语义分割等任务中表现真实且有效，显著提升模型性能。 Conclusion: SA-Radar是一种通用雷达数据引擎，支持新传感器视角和场景编辑，适用于自动驾驶应用。 Abstract: We present SA-Radar (Simulate Any Radar), a radar simulation approach that enables controllable and efficient generation of radar cubes conditioned on customizable radar attributes. Unlike prior generative or physics-based simulators, SA-Radar integrates both paradigms through a waveform-parameterized attribute embedding. We design ICFAR-Net, a 3D U-Net conditioned on radar attributes encoded via waveform parameters, which captures signal variations induced by different radar configurations. This formulation bypasses the need for detailed radar hardware specifications and allows efficient simulation of range-azimuth-Doppler (RAD) tensors across diverse sensor settings. We further construct a mixed real-simulated dataset with attribute annotations to robustly train the network. Extensive evaluations on multiple downstream tasks-including 2D/3D object detection and radar semantic segmentation-demonstrate that SA-Radar's simulated data is both realistic and effective, consistently improving model performance when used standalone or in combination with real data. Our framework also supports simulation in novel sensor viewpoints and edited scenes, showcasing its potential as a general-purpose radar data engine for autonomous driving applications. Code and additional materials are available at https://zhuxing0.github.io/projects/SA-Radar.

cs.HC [Back]

[276] Inter(sectional) Alia(s): Ambiguity in Voice Agent Identity via Intersectional Japanese Self-Referents

Takao Fujii,Katie Seaborn,Madeleine Steeds,Jun Kato

Main category: cs.HC

TL;DR: 研究探讨了对话代理中非代词自我指称（NPSR）和声音对社会身份感知的影响，发现声音性别化明显，但某些自我指称可避免性别化。

Details

Motivation: 探讨机器拟人化中的伦理问题，尤其是非代词自我指称和声音对社会身份感知的作用。 Method: 通过众包研究，204名日本参与者评估三种ChatGPT声音和七种自我指称。 Result: 声音性别化明显，某些自我指称（如boku和watakushi）能避免性别化，且年龄和正式度感知与性别化相关。 Conclusion: 研究强调了代理身份感知的复杂性，提倡在语音代理设计中考虑交叉性和文化敏感性。 Abstract: Conversational agents that mimic people have raised questions about the ethics of anthropomorphizing machines with human social identity cues. Critics have also questioned assumptions of identity neutrality in humanlike agents. Recent work has revealed that intersectional Japanese pronouns can elicit complex and sometimes evasive impressions of agent identity. Yet, the role of other "neutral" non-pronominal self-referents (NPSR) and voice as a socially expressive medium remains unexplored. In a crowdsourcing study, Japanese participants (N = 204) evaluated three ChatGPT voices (Juniper, Breeze, and Ember) using seven self-referents. We found strong evidence of voice gendering alongside the potential of intersectional self-referents to evade gendering, i.e., ambiguity through neutrality and elusiveness. Notably, perceptions of age and formality intersected with gendering as per sociolinguistic theories, especially boku and watakushi. This work provides a nuanced take on agent identity perceptions and champions intersectional and culturally-sensitive work on voice agents.

cs.MA [Back]

[277] MAEBE: Multi-Agent Emergent Behavior Framework

Sinem Erisken,Timothy Gothard,Martin Leitgab,Ram Potham

Main category: cs.MA

TL;DR: 论文提出MAEBE框架，评估多智能体AI系统的涌现风险，发现LLM道德偏好易受问题框架影响，且群体行为无法从单智能体行为预测。

Details

Motivation: 传统AI安全评估方法对孤立LLM的测试不足，多智能体系统带来新的涌现风险，需系统性评估。 Method: 采用MAEBE框架和Greatest Good Benchmark，结合双反转问题技术，分析单智能体和多智能体的道德偏好与行为。 Result: LLM道德偏好脆弱且易受问题框架影响；群体行为涌现且无法从单智能体预测；群体中存在同伴压力等现象。 Conclusion: 需在多智能体交互情境下评估AI系统，以应对其独特的安全和对齐挑战。 Abstract: Traditional AI safety evaluations on isolated LLMs are insufficient as multi-agent AI ensembles become prevalent, introducing novel emergent risks. This paper introduces the Multi-Agent Emergent Behavior Evaluation (MAEBE) framework to systematically assess such risks. Using MAEBE with the Greatest Good Benchmark (and a novel double-inversion question technique), we demonstrate that: (1) LLM moral preferences, particularly for Instrumental Harm, are surprisingly brittle and shift significantly with question framing, both in single agents and ensembles. (2) The moral reasoning of LLM ensembles is not directly predictable from isolated agent behavior due to emergent group dynamics. (3) Specifically, ensembles exhibit phenomena like peer pressure influencing convergence, even when guided by a supervisor, highlighting distinct safety and alignment challenges. Our findings underscore the necessity of evaluating AI systems in their interactive, multi-agent contexts.

astro-ph.IM [Back]

[278] An Exploratory Framework for Future SETI Applications: Detecting Generative Reactivity via Language Models

Po-Chieh Yu

Main category: astro-ph.IM

TL;DR: 论文探讨了噪声输入是否能引发语言模型的结构化响应，提出了一种评估框架，并通过实验发现鲸鱼和鸟类的叫声比白噪声更能触发模型的语义反应。

Details

Motivation: 研究动机是探索语言模型是否能从无传统语义的数据中检测潜在结构，为传统SETI方法提供补充。 Method: 使用GPT-2小模型，测试四种声学输入（人类语音、鲸鱼叫声、鸟类鸣叫和白噪声），定义语义诱导潜力（SIP）评分来衡量反应。 Result: 鲸鱼和鸟类的叫声SIP评分高于白噪声，人类语音反应中等，表明模型能检测无传统语义数据中的结构。 Conclusion: 生成模型的反应性可能为识别值得关注的数据提供新方法，尤其是在通信意图未知的情况下。 Abstract: We present an exploratory framework to test whether noise-like input can induce structured responses in language models. Instead of assuming that extraterrestrial signals must be decoded, we evaluate whether inputs can trigger linguistic behavior in generative systems. This shifts the focus from decoding to viewing structured output as a sign of underlying regularity in the input. We tested GPT-2 small, a 117M-parameter model trained on English text, using four types of acoustic input: human speech, humpback whale vocalizations, Phylloscopus trochilus birdsong, and algorithmically generated white noise. All inputs were treated as noise-like, without any assumed symbolic encoding. To assess reactivity, we defined a composite score called Semantic Induction Potential (SIP), combining entropy, syntax coherence, compression gain, and repetition penalty. Results showed that whale and bird vocalizations had higher SIP scores than white noise, while human speech triggered only moderate responses. This suggests that language models may detect latent structure even in data without conventional semantics. We propose that this approach could complement traditional SETI methods, especially in cases where communicative intent is unknown. Generative reactivity may offer a different way to identify data worth closer attention.

cs.SD [Back]

[279] MotionRAG-Diff: A Retrieval-Augmented Diffusion Framework for Long-Term Music-to-Dance Generation

Mingyang Huang,Peng Zhang,Bang Zhang

Main category: cs.SD

TL;DR: MotionRAG-Diff结合检索增强生成（RAG）与扩散模型，提出一种高质量、音乐连贯的舞蹈生成方法，解决了现有方法在长期连贯性和音乐对齐上的不足。

Details

Motivation: 现有舞蹈生成方法（如运动图和扩散模型）在长期连贯性和音乐对齐上存在局限，需要一种新方法结合两者的优势。 Method: 1. 跨模态对比学习对齐音乐与舞蹈表示；2. 优化运动图系统实现高效检索与拼接；3. 多条件扩散模型联合优化音乐信号与对比特征。 Result: MotionRAG-Diff在运动质量、多样性和音乐-运动同步准确性上达到最优性能。 Conclusion: 该工作通过结合检索模板的保真度与扩散模型的创造力，为音乐驱动舞蹈生成建立了新范式。 Abstract: Generating long-term, coherent, and realistic music-conditioned dance sequences remains a challenging task in human motion synthesis. Existing approaches exhibit critical limitations: motion graph methods rely on fixed template libraries, restricting creative generation; diffusion models, while capable of producing novel motions, often lack temporal coherence and musical alignment. To address these challenges, we propose $\textbf{MotionRAG-Diff}$, a hybrid framework that integrates Retrieval-Augmented Generation (RAG) with diffusion-based refinement to enable high-quality, musically coherent dance generation for arbitrary long-term music inputs. Our method introduces three core innovations: (1) A cross-modal contrastive learning architecture that aligns heterogeneous music and dance representations in a shared latent space, establishing unsupervised semantic correspondence without paired data; (2) An optimized motion graph system for efficient retrieval and seamless concatenation of motion segments, ensuring realism and temporal coherence across long sequences; (3) A multi-condition diffusion model that jointly conditions on raw music signals and contrastive features to enhance motion quality and global synchronization. Extensive experiments demonstrate that MotionRAG-Diff achieves state-of-the-art performance in motion quality, diversity, and music-motion synchronization accuracy. This work establishes a new paradigm for music-driven dance generation by synergizing retrieval-based template fidelity with diffusion-based creative enhancement.

[280] TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models

Chetwin Low,Weimin Wang

Main category: cs.SD

TL;DR: TalkingMachines是一个高效框架，将预训练视频生成模型转化为实时音频驱动角色动画生成器，结合音频大语言模型和视频生成基础模型，实现自然对话体验。

Details

Motivation: 目标是开发一个能够实时生成音频驱动角色动画的系统，提升对话体验的自然性和效率。 Method: 1. 将预训练的SOTA图像到视频DiT模型适配为音频驱动的18B参数角色生成模型；2. 通过非对称知识蒸馏实现无限视频流生成；3. 设计高吞吐、低延迟推理管道。 Result: 成功开发了TalkingMachines框架，支持高效、实时的音频驱动角色动画生成。 Conclusion: TalkingMachines通过技术创新和工程优化，实现了高质量的实时角色动画生成，为对话体验提供了新工具。 Abstract: In this paper, we present TalkingMachines -- an efficient framework that transforms pretrained video generation models into real-time, audio-driven character animators. TalkingMachines enables natural conversational experiences by integrating an audio large language model (LLM) with our video generation foundation model. Our primary contributions include: (1) We adapt a pretrained SOTA image-to-video DiT into an audio-driven avatar generation model of 18 billion parameters; (2) We enable infinite video streaming without error accumulation through asymmetric knowledge distillation from a bidirectional teacher model into a sparse causal, autoregressive student model; (3) We design a high-throughput, low-latency inference pipeline incorporating several key engineering optimizations such as: (a) disaggregation of the DiT and VAE decoder across separate devices, (b) efficient overlap of inter-device communication and computation using CUDA streams, (c) elimination of redundant recomputations to maximize frame-generation throughput. Please see demo videos here - https://aaxwaz.github.io/TalkingMachines/

[281] Learning More with Less: Self-Supervised Approaches for Low-Resource Speech Emotion Recognition

Ziwei Gong,Pengyuan Shi,Kaan Donbekci,Lin Ai,Run Chen,David Sasu,Zehui Wu,Julia Hirschberg

Main category: cs.SD

TL;DR: 论文探讨了无监督学习在低资源语言（LRLs）语音情感识别（SER）中的应用，对比学习和BYOL方法显著提升了性能。

Details

Motivation: 低资源语言中标注数据稀缺，限制了语音情感识别的发展，因此探索无监督学习方法以提升性能。 Method: 采用对比学习（CL）和Bootstrap Your Own Latent（BYOL）作为自监督学习方法，增强跨语言泛化能力。 Result: 在乌尔都语、德语和孟加拉语中，F1分数分别提升了10.6%、15.2%和13.9%。 Conclusion: 研究为开发更具包容性、可解释性和鲁棒性的低资源语言情感识别系统奠定了基础。 Abstract: Speech Emotion Recognition (SER) has seen significant progress with deep learning, yet remains challenging for Low-Resource Languages (LRLs) due to the scarcity of annotated data. In this work, we explore unsupervised learning to improve SER in low-resource settings. Specifically, we investigate contrastive learning (CL) and Bootstrap Your Own Latent (BYOL) as self-supervised approaches to enhance cross-lingual generalization. Our methods achieve notable F1 score improvements of 10.6% in Urdu, 15.2% in German, and 13.9% in Bangla, demonstrating their effectiveness in LRLs. Additionally, we analyze model behavior to provide insights on key factors influencing performance across languages, and also highlighting challenges in low-resource SER. This work provides a foundation for developing more inclusive, explainable, and robust emotion recognition systems for underrepresented languages.

[282] Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion

Ajinkya Kulkarni,Sandipana Dowerah,Tanel Alumae,Mathew Magimai. -Doss

Main category: cs.SD

TL;DR: 该论文提出了一种结合深度度量多类N-pair损失、Real Emphasis和Fake Dispersion框架、Conformer分类网络以及集成分数嵌入融合的新型音频源追踪系统，显著提升了音频深度伪造的源追踪性能。

Details

Motivation: 随着AI技术的发展，音频深度伪造的逼真度越来越高，当前研究主要集中在区分真实语音与伪造语音，而追踪伪造音频的源系统同样至关重要。 Method: 结合深度度量多类N-pair损失、Real Emphasis和Fake Dispersion框架、Conformer分类网络以及集成分数嵌入融合，提升源追踪的判别能力和鲁棒性。 Result: 使用Frechet距离和标准指标评估，系统在源追踪任务中表现出优于基线方法的性能。 Conclusion: 提出的方法在区分真实与伪造语音模式及源追踪任务中表现出色，为音频深度伪造的源追踪提供了有效解决方案。 Abstract: Audio deepfakes are acquiring an unprecedented level of realism with advanced AI. While current research focuses on discerning real speech from spoofed speech, tracing the source system is equally crucial. This work proposes a novel audio source tracing system combining deep metric multi-class N-pair loss with Real Emphasis and Fake Dispersion framework, a Conformer classification network, and ensemble score-embedding fusion. The N-pair loss improves discriminative ability, while Real Emphasis and Fake Dispersion enhance robustness by focusing on differentiating real and fake speech patterns. The Conformer network captures both global and local dependencies in the audio signal, crucial for source tracing. The proposed ensemble score-embedding fusion shows an optimal trade-off between in-domain and out-of-domain source tracing scenarios. We evaluate our method using Frechet Distance and standard metrics, demonstrating superior performance in source tracing over the baseline system.

[283] Enhancing Speech Emotion Recognition with Graph-Based Multimodal Fusion and Prosodic Features for the Speech Emotion Recognition in Naturalistic Conditions Challenge at Interspeech 2025

Alef Iury Siqueira Ferreira,Lucas Rafael Gris,Alexandre Ferro Filho,Lucas Ólives,Daniel Ribeiro,Luiz Fernando,Fernanda Lustosa,Rodrigo Tanaka,Frederico Santos de Oliveira,Arlindo Galvão Filho

Main category: cs.SD

TL;DR: 本文提出了一种结合音频模型和文本特征的鲁棒系统，用于自然语音中的情感识别，重点研究了F0量化和预训练音频标记模型的效果，并通过集成模型提升了性能。

Details

Motivation: 自然语音中的情感识别因情感表达微妙且音频不可预测而具有挑战性，本文旨在解决这一问题。 Method: 结合音频模型与文本特征，研究F0量化和预训练音频标记模型，采用集成模型和Graph Attention Networks。 Result: 在测试集上Macro F1-score为39.79%（验证集42.20%）。 Conclusion: 证明了所提方法的潜力，尤其是Graph Attention Networks的有效性，代码已公开。 Abstract: Training SER models in natural, spontaneous speech is especially challenging due to the subtle expression of emotions and the unpredictable nature of real-world audio. In this paper, we present a robust system for the INTERSPEECH 2025 Speech Emotion Recognition in Naturalistic Conditions Challenge, focusing on categorical emotion recognition. Our method combines state-of-the-art audio models with text features enriched by prosodic and spectral cues. In particular, we investigate the effectiveness of Fundamental Frequency (F0) quantization and the use of a pretrained audio tagging model. We also employ an ensemble model to improve robustness. On the official test set, our system achieved a Macro F1-score of 39.79% (42.20% on validation). Our results underscore the potential of these methods, and analysis of fusion techniques confirmed the effectiveness of Graph Attention Networks. Our source code is publicly available.

[284] Cocktail-Party Audio-Visual Speech Recognition

Thai-Binh Nguyen,Ngoc-Quan Pham,Alexander Waibel

Main category: cs.SD

TL;DR: 该研究提出了一种新的音频-视觉鸡尾酒会数据集，用于评估AVSR系统在真实噪声环境中的表现，并展示了显著性能提升。

Details

Motivation: 现有AVSR模型在理想化场景中表现良好，但忽视了现实世界中包含说话和沉默面部片段的复杂性。 Method: 引入了一个包含说话和沉默面部片段的1526小时AVSR数据集，并在极端噪声条件下测试。 Result: 相对于现有技术，WER降低了67%，从119%降至39.2%。 Conclusion: 该方法在无需显式分割提示的情况下显著提升了AVSR系统在噪声环境中的性能。 Abstract: Audio-Visual Speech Recognition (AVSR) offers a robust solution for speech recognition in challenging environments, such as cocktail-party scenarios, where relying solely on audio proves insufficient. However, current AVSR models are often optimized for idealized scenarios with consistently active speakers, overlooking the complexities of real-world settings that include both speaking and silent facial segments. This study addresses this gap by introducing a novel audio-visual cocktail-party dataset designed to benchmark current AVSR systems and highlight the limitations of prior approaches in realistic noisy conditions. Additionally, we contribute a 1526-hour AVSR dataset comprising both talking-face and silent-face segments, enabling significant performance gains in cocktail-party environments. Our approach reduces WER by 67% relative to the state-of-the-art, reducing WER from 119% to 39.2% in extreme noise, without relying on explicit segmentation cues.

[285] Synthetic Speech Source Tracing using Metric Learning

Dimitrios Koutsianos,Stavros Zacharopoulos,Yannis Panagakis,Themos Stafylakis

Main category: cs.SD

TL;DR: 论文提出通过说话人识别方法追踪合成语音的生成系统，比较了分类和度量学习两种方法，发现ResNet表现优于自监督学习。

Details

Motivation: 现有研究多关注伪造检测，而缺乏对合成语音来源追踪的鲁棒解决方案。 Method: 采用分类和度量学习两种方法，基于ResNet和自监督学习（SSL）架构，在MLAADv5基准上测试。 Result: ResNet在度量学习方法中表现优异，甚至超越SSL系统。 Conclusion: ResNet适用于来源追踪任务，但需优化SSL表示，为合成媒体检测提供了新方向。 Abstract: This paper addresses source tracing in synthetic speech-identifying generative systems behind manipulated audio via speaker recognition-inspired pipelines. While prior work focuses on spoofing detection, source tracing lacks robust solutions. We evaluate two approaches: classification-based and metric-learning. We tested our methods on the MLAADv5 benchmark using ResNet and self-supervised learning (SSL) backbones. The results show that ResNet achieves competitive performance with the metric learning approach, matching and even exceeding SSL-based systems. Our work demonstrates ResNet's viability for source tracing while underscoring the need to optimize SSL representations for this task. Our work bridges speaker recognition methodologies with audio forensic challenges, offering new directions for combating synthetic media manipulation.

cs.LG [Back]

[286] Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons

Andrew Kiruluta,Preethi Raju,Priscilla Burity

Main category: cs.LG

TL;DR: 提出一种非注意力机制的大语言模型架构，高效处理超长上下文窗口，避免传统Transformer的二次复杂度问题。

Details

Motivation: 传统Transformer因自注意力机制导致内存和计算复杂度二次增长，难以处理超长上下文。 Method: 结合状态空间块（S4启发）、多分辨率卷积层、轻量级循环监督器和检索增强外部记忆，避免token间注意力。 Result: 模型能高效处理数十万至数百万token的上下文，计算复杂度接近线性。 Conclusion: 新架构为处理超长上下文提供了一种高效且可扩展的解决方案。 Abstract: We present a novel non attention based architecture for large language models (LLMs) that efficiently handles very long context windows, on the order of hundreds of thousands to potentially millions of tokens. Unlike traditional Transformer designs, which suffer from quadratic memory and computation overload due to the nature of the self attention mechanism, our model avoids token to token attention entirely. Instead, it combines the following complementary components: State Space blocks (inspired by S4) that learn continuous time convolution kernels and scale near linearly with sequence length, Multi Resolution Convolution layers that capture local context at different dilation levels, a lightweight Recurrent Supervisor to maintain a global hidden state across sequential chunks, and Retrieval Augmented External Memory that stores and retrieves high-level chunk embeddings without reintroducing quadratic operations.

[287] Turning LLM Activations Quantization-Friendly

Patrik Czakó,Gábor Kertész,Sándor Szénási

Main category: cs.LG

TL;DR: 论文研究了量化大语言模型（LLMs）中的异常值问题，提出了一种新的度量方法和混合量化方法以减少量化误差。

Details

Motivation: 量化可以降低LLMs的服务成本，但激活整数运算需要量化权重和激活值，而LLMs中的异常值会增加量化误差。 Method: 研究异常值对分层量化误差的影响，提出基于通道幅度的新度量方法，并设计了一种结合通道缩放和旋转的混合量化方法。 Result: 通过数学公式证明了混合方法的优势，并展示了其在减少量化误差方面的有效性。 Conclusion: 提出的混合量化方法能有效减少LLMs中的量化误差，为高效量化提供了新思路。 Abstract: Quantization effectively reduces the serving costs of Large Language Models (LLMs) by speeding up data movement through compressed parameters and enabling faster operations via integer arithmetic. However, activating integer arithmetic requires quantizing both weights and activations, which poses challenges due to the significant outliers in LLMs that increase quantization error. In this work, we investigate these outliers with an emphasis on their effect on layer-wise quantization error, then examine how smoothing and rotation transform the observed values. Our primary contributions include introducing a new metric to measure and visualize quantization difficulty based on channel magnitudes, as well as proposing a hybrid approach that applies channel-wise scaling before rotation, supported by a mathematical formulation of its benefits.

[288] Assigning Distinct Roles to Quantized and Low-Rank Matrices Toward Optimal Weight Decomposition

Yoonjun Cho,Soeun Kim,Dongjae Jeon,Kyelim Lee,Beomsoo Lee,Albert No

Main category: cs.LG

TL;DR: 论文提出了一种名为ODLRI的方法，通过将权重矩阵分解为量化和低秩两部分，优化了大型语言模型的压缩效果。

Details

Motivation: 现有方法在联合优化量化和低秩近似时，往往偏重一方而忽略另一方，导致分解效果不佳。 Method: 引入Outlier-Driven Low-Rank Initialization (ODLRI)，让低秩部分专门捕捉激活敏感的权重，从而平衡量化和低秩近似。 Result: 实验表明，ODLRI能有效减少激活感知误差、降低量化规模，并在低比特设置下提升困惑度和零样本准确率。 Conclusion: ODLRI通过结构化分解，显著提升了权重矩阵分解的效果，适用于多种大型语言模型。 Abstract: Decomposing weight matrices into quantization and low-rank components ($\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$) is a widely used technique for compressing large language models (LLMs). Existing joint optimization methods iteratively alternate between quantization and low-rank approximation. However, these methods tend to prioritize one component at the expense of the other, resulting in suboptimal decompositions that fail to leverage each component's unique strengths. In this work, we introduce Outlier-Driven Low-Rank Initialization (ODLRI), which assigns low-rank components the specific role of capturing activation-sensitive weights. This structured decomposition mitigates outliers' negative impact on quantization, enabling more effective balance between quantization and low-rank approximation. Experiments on Llama2 (7B, 13B, 70B), Llama3-8B, and Mistral-7B demonstrate that incorporating ODLRI into the joint optimization framework consistently reduces activation-aware error, minimizes quantization scale, and improves perplexity and zero-shot accuracy in low-bit settings.

[289] SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis

Zijian Wu,Jinjie Ni,Xiangyan Liu,Zichen Liu,Hang Yan,Michael Qizhe Shieh

Main category: cs.LG

TL;DR: SynthRL是一种通过合成强化学习数据来提升视觉语言模型性能的管道，显著提高了模型在复杂推理任务中的表现。

Details

Motivation: 研究如何通过合成强化学习数据进一步提升RLVR（带可验证奖励的强化学习）训练的效果。 Method: 提出SynthRL管道，包括选择种子问题、增强问题难度并保留答案、以及验证阶段确保正确性和难度提升。 Result: 在MMK12数据集上，SynthRL合成了3.3K个额外问题，模型在多个视觉数学推理基准测试中表现显著提升。 Conclusion: SynthRL能有效提升模型在复杂推理任务中的表现，尤其在最具挑战性的样本上效果显著。 Abstract: Vision-language models (VLMs) trained via reinforcement learning with verifiable reward (RLVR) have shown notable progress in scaling test-time compute effectively. In this work, we investigate how synthesized RL data can further improve RLVR. To this end, we propose \textbf{SynthRL}-a scalable and guaranteed pipeline for automatic data scaling in reasoning-oriented RL training. SynthRL comprises three key stages: (1) selecting seed questions with appropriate distribution, (2) augmenting them into more challenging variants while preserving the original answers, and (3) a guaranteed verification stage that ensures near-perfect correctness and difficulty enhancement. Our empirical experiments demonstrate SynthRL's scalability and effectiveness. When applied to the MMK12 dataset, SynthRL synthesizes over 3.3K additional verifiable, challenging questions from approximately 8K seed samples. Models trained with our synthesized data achieve consistent gains across five out-of-domain visual math reasoning benchmarks, with a significant improvement over baseline models trained on seed data alone. Notably, detailed analysis reveals that the gains are more pronounced on the most challenging evaluation samples, highlighting SynthRL's effectiveness in eliciting deeper and more complex reasoning patterns.

[290] KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning

Hongling Xu,Qi Zhu,Heyuan Deng,Jinpeng Li,Lu Hou,Yasheng Wang,Lifeng Shang,Ruifeng Xu,Fei Mi

Main category: cs.LG

TL;DR: KDRL是一个结合知识蒸馏（KD）和强化学习（RL）的统一后训练框架，旨在提升大型语言模型的推理能力。

Details

Motivation: RL能产生复杂推理行为但样本效率低，KD学习效率高但泛化能力差，因此需要结合两者优势。 Method: KDRL通过策略梯度优化同时最小化学生与教师分布的反向KL散度（RKL）并最大化基于规则的奖励。 Result: 在多个推理基准测试中，KDRL优于GRPO和KD基线，平衡了性能和推理效率。 Conclusion: 结合KD和RL是训练推理型LLM的有效且高效策略。 Abstract: Recent advances in large language model (LLM) post-training have leveraged two distinct paradigms to enhance reasoning capabilities: reinforcement learning (RL) and knowledge distillation (KD). While RL enables the emergence of complex reasoning behaviors, it often suffers from low sample efficiency when the initial policy struggles to explore high-reward trajectories. Conversely, KD improves learning efficiency via mimicking the teacher model but tends to generalize poorly to out-of-domain scenarios. In this work, we present \textbf{KDRL}, a \textit{unified post-training framework} that jointly optimizes a reasoning model through teacher supervision (KD) and self-exploration (RL). Specifically, KDRL leverages policy gradient optimization to simultaneously minimize the reverse Kullback-Leibler divergence (RKL) between the student and teacher distributions while maximizing the expected rule-based rewards. We first formulate a unified objective that integrates GRPO and KD, and systematically explore how different KL approximations, KL coefficients, and reward-guided KD strategies affect the overall post-training dynamics and performance. Empirical results on multiple reasoning benchmarks demonstrate that KDRL outperforms GRPO and various KD baselines while achieving a favorable balance between performance and reasoning token efficiency. These findings indicate that integrating KD and RL serves as an effective and efficient strategy to train reasoning LLMs.

[291] Comba: Improving Nonlinear RNNs with Closed-loop Control

Jiaxi Hu,Yongqi Pan,Jusen Du,Disen Lan,Xiaqiang Tang,Qingsong Wen,Yuxuan Liang,Weigao Sun

Main category: cs.LG

TL;DR: 本文提出了一种新型非线性RNN变体Comba，基于闭环控制理论，通过状态反馈和输出反馈校正，实现了高效的语言和视觉建模。

Details

Motivation: 现有序列建模方法（如Gated DeltaNet、TTT和RWKV-7）通过Delta学习规则改进性能，但存在非线性递归结构的局限性。本文旨在分析非线性RNN的优势与不足，并提出更高效的解决方案。 Method: 基于闭环控制理论，提出Comba模型，采用标量加低秩状态转移，结合状态反馈和输出反馈校正，并实现了硬件高效的并行核。 Result: 在340M/1.3B参数规模下，Comba在语言和视觉建模中表现出卓越的性能和计算效率。 Conclusion: Comba通过创新的非线性RNN设计，显著提升了序列建模的效率与性能。 Abstract: Recent efficient sequence modeling methods such as Gated DeltaNet, TTT, and RWKV-7 have achieved performance improvements by supervising the recurrent memory management through Delta learning rule. Unlike previous state-space models (e.g., Mamba) and gated linear attentions (e.g., GLA), these models introduce interactions between the recurrent state and the key vector, resulting in a nonlinear recursive structure. In this paper, we first introduce the concept of Nonlinear RNNs with a comprehensive analysis on the advantages and limitations of these models. Then, based on closed-loop control theory, we propose a novel Nonlinear RNN variant named Comba, which adopts a scalar-plus-low-rank state transition, with both state feedback and output feedback corrections. We also implement a hardware-efficient chunk-wise parallel kernel in Triton and train models with 340M/1.3B parameters on large-scale corpus. Comba demonstrates its superior performance and computation efficiency in both language and vision modeling.

[292] Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective

Shenghua He,Tian Xia,Xuan Zhou,Hui Wei

Main category: cs.LG

TL;DR: 论文研究了大型语言模型（LLM）强化学习中的零奖励假设问题，提出了一种理论框架，证明仅需响应级奖励模型即可无偏估计策略梯度，并提出了新算法TRePO。

Details

Motivation: 解决LLM强化学习中非终端动作（中间令牌生成）缺乏即时奖励的问题，提供理论支持以减少对精确令牌级奖励的依赖。 Method: 引入轨迹策略梯度定理，证明响应级奖励模型可无偏估计策略梯度，并提出新算法TRePO。 Result: 理论证明PPO、GRPO等方法具备建模令牌级奖励信号的能力，TRePO在简化性和内存效率上表现优异。 Conclusion: 研究为LLM微调提供了更实用的方法，开发者可专注于改进响应级奖励模型，同时TRePO展现了广泛应用潜力。 Abstract: We study a common challenge in reinforcement learning for large language models (LLMs): the Zero-Reward Assumption, where non-terminal actions (i.e., intermediate token generations) receive zero task-specific immediate reward, while only the final token receives a reward for the entire response. This assumption arises frequently in practice, as precise token-level rewards are often difficult or infeasible to obtain in LLM applications. In this work, we provide a unifying theoretical perspective. We introduce the Trajectory Policy Gradient Theorem, which shows that the policy gradient based on true, unknown token-level rewards can be unbiasedly estimated using only a response-level reward model, regardless of whether the Zero-Reward Assumption holds or not, for algorithms in the REINFORCE and Actor-Critic families. This result reveals that widely used methods such as PPO, GRPO, ReMax, and RLOO inherently possess the capacity to model token-level reward signals, offering a theoretical justification for response-level reward approaches. Our findings pave the way for more practical, efficient LLM fine-tuning, allowing developers to treat training algorithms as black boxes and focus on improving the response-level reward model with auxiliary sub-models. We also offer a detailed analysis of popular RL and non-RL methods, comparing their theoretical foundations and practical advantages across common LLM tasks. Finally, we propose a new algorithm: Token-Reinforced Policy Optimization (TRePO), a theoretically grounded method that is simpler than PPO, matches GRPO in memory efficiency, and holds promise for broad applicability.

[293] Scaling Fine-Grained MoE Beyond 50B Parameters: Empirical Evaluation and Practical Insights

Jakub Krajewski,Marcin Chochowski,Daniel Korzekwa

Main category: cs.LG

TL;DR: 细粒度MoE（混合专家）架构在提升大语言模型收敛性和质量方面表现出潜力。本文提出训练方法并对比其与标准MoE的扩展性，证明细粒度MoE在56B参数规模下表现更优。

Details

Motivation: 探索细粒度MoE架构在提升模型收敛速度和性能方面的潜力，为大规模模型开发提供实证支持。 Method: 提出训练方法，对比细粒度MoE与标准MoE在不同规模（最高56B参数）下的收敛速度、下游任务表现及训练实践。 Result: 细粒度MoE在最大规模下验证损失更低，下游任务准确率更高。 Conclusion: 细粒度MoE为未来大规模模型开发提供了实证基础和实用指导。 Abstract: Mixture of Experts (MoE) architectures have emerged as pivotal for scaling Large Language Models (LLMs) efficiently. Fine-grained MoE approaches - utilizing more numerous, smaller experts - have demonstrated potential in improving model convergence and quality. This work proposes a set of training recipes and provides a comprehensive empirical evaluation of fine-grained MoE, directly comparing its scaling properties against standard MoE configurations for models with up to 56B total (17B active) parameters. We investigate convergence speed, model performance on downstream benchmarks, and practical training considerations across various setups. Overall, at the largest scale we show that fine-grained MoE achieves better validation loss and higher accuracy across a set of downstream benchmarks. This study offers empirical grounding and practical insights for leveraging fine-grained MoE in the development of future large-scale models.

[294] Retrieval-Augmented Generation as Noisy In-Context Learning: A Unified Theory and Risk Bounds

Yang Guo,Yutian Tao,Yifei Ming,Robert D. Nowak,Yingyu Liang

Main category: cs.LG

TL;DR: 本文提出了首个有限样本泛化界，用于检索增强生成（RAG）在上下文线性回归中的理论分析，并推导出偏差-方差权衡。

Details

Motivation: 尽管RAG在实践中有许多成功案例，但其理论方面尚未充分探索。本文旨在填补这一空白。 Method: 将检索文本视为查询相关的噪声上下文示例，并推导出RAG和经典上下文学习（ICL）的极限情况。 Result: 分析表明RAG存在固有泛化误差上限，且框架能建模从训练数据和外部语料库的检索。 Conclusion: 理论与实验一致表明ICL和RAG在样本效率上的优势，支持了理论分析的正确性。 Abstract: Retrieval-augmented generation (RAG) has seen many empirical successes in recent years by aiding the LLM with external knowledge. However, its theoretical aspect has remained mostly unexplored. In this paper, we propose the first finite-sample generalization bound for RAG in in-context linear regression and derive an exact bias-variance tradeoff. Our framework views the retrieved texts as query-dependent noisy in-context examples and recovers the classical in-context learning (ICL) and standard RAG as the limit cases. Our analysis suggests that an intrinsic ceiling on generalization error exists on RAG as opposed to the ICL. Furthermore, our framework is able to model retrieval both from the training data and from external corpora by introducing uniform and non-uniform RAG noise. In line with our theory, we show the sample efficiency of ICL and RAG empirically with experiments on common QA benchmarks, such as Natural Questions and TriviaQA.

[295] Memorization to Generalization: Emergence of Diffusion Models from Associative Memory

Bao Pham,Gabriel Raya,Matteo Negri,Mohammed J. Zaki,Luca Ambrogioni,Dmitry Krotov

Main category: cs.LG

TL;DR: 论文探讨了扩散模型作为联想记忆系统的行为，揭示了在小数据和大数据条件下不同的记忆与生成现象，并预测了虚假状态的存在。

Details

Motivation: 研究扩散模型在联想记忆框架下的行为，以理解其记忆与生成机制，特别是虚假状态的出现。 Method: 将扩散模型的训练和生成阶段分别类比为记忆编码和检索，分析其在不同数据规模下的行为。 Result: 在小数据时，模型表现出强记忆性；大数据时，生成新的吸引子状态，虚假状态在过渡边界出现。 Conclusion: 研究为扩散模型的记忆-泛化现象提供了新视角，并验证了虚假状态的理论预测。 Abstract: Hopfield networks are associative memory (AM) systems, designed for storing and retrieving patterns as local minima of an energy landscape. In the classical Hopfield model, an interesting phenomenon occurs when the amount of training data reaches its critical memory load $- spurious\,\,states$, or unintended stable points, emerge at the end of the retrieval dynamics, leading to incorrect recall. In this work, we examine diffusion models, commonly used in generative modeling, from the perspective of AMs. The training phase of diffusion model is conceptualized as memory encoding (training data is stored in the memory). The generation phase is viewed as an attempt of memory retrieval. In the small data regime the diffusion model exhibits a strong memorization phase, where the network creates distinct basins of attraction around each sample in the training set, akin to the Hopfield model below the critical memory load. In the large data regime, a different phase appears where an increase in the size of the training set fosters the creation of new attractor states that correspond to manifolds of the generated samples. Spurious states appear at the boundary of this transition and correspond to emergent attractor states, which are absent in the training set, but, at the same time, have distinct basins of attraction around them. Our findings provide: a novel perspective on the memorization-generalization phenomenon in diffusion models via the lens of AMs, theoretical prediction of existence of spurious states, empirical validation of this prediction in commonly-used diffusion models.

[296] Johnny: Structuring Representation Space to Enhance Machine Abstract Reasoning Ability

Ruizhuo Song,Beiming Yuan

Main category: cs.LG

TL;DR: 论文提出Johnny架构和Spin-Transformer网络，通过改进表示空间和位置关系捕捉，显著提升了AI在Raven渐进矩阵任务中的抽象推理能力。

Details

Motivation: 传统RPM解决模型过度依赖选项池配置，限制了推理能力，需改进。 Method: 提出Johnny架构（表示提取与推理模块）和Spin-Transformer网络（优化位置关系捕捉）。 Result: 实验显示Johnny和Spin-Transformer在RPM任务中表现优异。 Conclusion: 新方法为提升AI抽象推理能力提供了创新途径。 Abstract: This paper thoroughly investigates the challenges of enhancing AI's abstract reasoning capabilities, with a particular focus on Raven's Progressive Matrices (RPM) tasks involving complex human-like concepts. Firstly, it dissects the empirical reality that traditional end-to-end RPM-solving models heavily rely on option pool configurations, highlighting that this dependency constrains the model's reasoning capabilities. To address this limitation, the paper proposes the Johnny architecture - a novel representation space-based framework for RPM-solving. Through the synergistic operation of its Representation Extraction Module and Reasoning Module, Johnny significantly enhances reasoning performance by supplementing primitive negative option configurations with a learned representation space. Furthermore, to strengthen the model's capacity for capturing positional relationships among local features, the paper introduces the Spin-Transformer network architecture, accompanied by a lightweight Straw Spin-Transformer variant that reduces computational overhead through parameter sharing and attention mechanism optimization. Experimental evaluations demonstrate that both Johnny and Spin-Transformer achieve superior performance on RPM tasks, offering innovative methodologies for advancing AI's abstract reasoning capabilities.

[297] EWGN: Elastic Weight Generation and Context Switching in Deep Learning

Shriraj P. Sawant,Krishna P. Miyapuram

Main category: cs.LG

TL;DR: 论文提出了一种名为弹性权重生成网络（EWGN）的新方法，通过动态生成权重和上下文切换来缓解神经网络中的灾难性遗忘问题，并在MNIST和fashion-MNIST数据集上验证了其有效性。

Details

Motivation: 人类能够学习和保留多种任务的能力激发了人工通用智能的研究，而持续学习是实现这一目标的重要步骤。灾难性遗忘是神经网络在多任务学习中的主要挑战。 Method: EWGN通过一个额外的网络动态生成主网络的权重，实现输入依赖的上下文切换，并结合随机梯度下降和弹性权重巩固算法进行训练。 Result: 在MNIST和fashion-MNIST数据集上的实验表明，EWGN能够有效保留先前学习任务的表现。 Conclusion: 动态权重生成和上下文切换能力有助于实现持续学习，提升性能。 Abstract: The ability to learn and retain a wide variety of tasks is a hallmark of human intelligence that has inspired research in artificial general intelligence. Continual learning approaches provide a significant step towards achieving this goal. It has been known that task variability and context switching are challenging for learning in neural networks. Catastrophic forgetting refers to the poor performance on retention of a previously learned task when a new task is being learned. Switching between different task contexts can be a useful approach to mitigate the same by preventing the interference between the varying task weights of the network. This paper introduces Elastic Weight Generative Networks (EWGN) as an idea for context switching between two different tasks. The proposed EWGN architecture uses an additional network that generates the weights of the primary network dynamically while consolidating the weights learned. The weight generation is input-dependent and thus enables context switching. Using standard computer vision datasets, namely MNIST and fashion-MNIST, we analyse the retention of previously learned task representations in Fully Connected Networks, Convolutional Neural Networks, and EWGN architectures with Stochastic Gradient Descent and Elastic Weight Consolidation learning algorithms. Understanding dynamic weight generation and context-switching ability can be useful in enabling continual learning for improved performance.

[298] Robust Federated Learning against Noisy Clients via Masked Optimization

Xuefeng Jiang,Tian Wen,Zhiqin Yang,Lvhua Wu,Yufeng Chen,Sheng Sun,Yuwei Wang,Min Liu

Main category: cs.LG

TL;DR: 本文提出了一种名为MaskedOptim的两阶段优化框架，用于解决联邦学习中的复杂标签噪声问题，通过检测高噪声客户端和标签校正机制提升模型性能。

Details

Motivation: 联邦学习中客户端提供的标注数据常含复杂标签噪声，影响模型性能，需开发有效策略减轻其负面影响。 Method: 提出两阶段框架：第一阶段检测高噪声客户端，第二阶段通过端到端标签校正机制修正噪声标签，并使用几何中值模型聚合增强鲁棒性。 Result: 在多个数据集上的实验表明，该框架在不同场景下表现鲁棒，且能有效提升噪声客户端本地数据集的质量。 Conclusion: MaskedOptim框架能有效解决联邦学习中的标签噪声问题，提升模型性能和数据质量。 Abstract: In recent years, federated learning (FL) has made significant advance in privacy-sensitive applications. However, it can be hard to ensure that FL participants provide well-annotated data for training. The corresponding annotations from different clients often contain complex label noise at varying levels. This label noise issue has a substantial impact on the performance of the trained models, and clients with greater noise levels can be largely attributed for this degradation. To this end, it is necessary to develop an effective optimization strategy to alleviate the adverse effects of these noisy clients.In this study, we present a two-stage optimization framework, MaskedOptim, to address this intricate label noise problem. The first stage is designed to facilitate the detection of noisy clients with higher label noise rates. The second stage focuses on rectifying the labels of the noisy clients' data through an end-to-end label correction mechanism, aiming to mitigate the negative impacts caused by misinformation within datasets. This is achieved by learning the potential ground-truth labels of the noisy clients' datasets via backpropagation. To further enhance the training robustness, we apply the geometric median based model aggregation instead of the commonly-used vanilla averaged model aggregation. We implement sixteen related methods and conduct evaluations on three image datasets and one text dataset with diverse label noise patterns for a comprehensive comparison. Extensive experimental results indicate that our proposed framework shows its robustness in different scenarios. Additionally, our label correction framework effectively enhances the data quality of the detected noisy clients' local datasets. % Our codes will be open-sourced to facilitate related research communities. Our codes are available via https://github.com/Sprinter1999/MaskedOptim .

[299] Rethinking Post-Unlearning Behavior of Large Vision-Language Models

Minsung Kim,Nakyeong Yang,Kyomin Jung

Main category: cs.LG

TL;DR: 论文提出了一种针对大型视觉语言模型（LVLMs）的机器遗忘方法PUBG，解决了现有方法在隐私保护后导致输出质量下降的问题。

Details

Motivation: 现有遗忘方法在保护隐私时，往往忽视输出质量，导致退化、幻觉或过度拒绝等不良行为。 Method: 提出新遗忘任务，要求模型生成既保护隐私又信息丰富且视觉基础的回答，并设计PUBG方法，引导遗忘后的行为朝向理想输出分布。 Result: 实验表明，PUBG有效避免了隐私泄露，同时生成视觉基础和信息丰富的回答，解决了现有方法的遗留问题。 Conclusion: PUBG为LVLMs提供了一种平衡隐私保护与输出质量的遗忘方法，具有实际应用价值。 Abstract: Machine unlearning is used to mitigate the privacy risks of Large Vision-Language Models (LVLMs) arising from training on large-scale web data. However, existing unlearning methods often fail to carefully select substitute outputs for forget targets, resulting in Unlearning Aftermaths-undesirable behaviors such as degenerate, hallucinated, or excessively refused responses. We highlight that, especially for generative LVLMs, it is crucial to consider the quality and informativeness of post-unlearning responses rather than relying solely on naive suppression. To address this, we introduce a new unlearning task for LVLMs that requires models to provide privacy-preserving yet informative and visually grounded responses. We also propose PUBG, a novel unlearning method that explicitly guides post-unlearning behavior toward a desirable output distribution. Experiments show that, while existing methods suffer from Unlearning Aftermaths despite successfully preventing privacy violations, PUBG effectively mitigates these issues, generating visually grounded and informative responses without privacy leakage for forgotten targets.

[300] HIEGNet: A Heterogenous Graph Neural Network Including the Immune Environment in Glomeruli Classification

Niklas Kormann,Masoud Ramuz,Zeeshan Nisar,Nadine S. Schaadt,Hendrik Annuth,Benjamin Doerr,Friedrich Feuerhake,Thomas Lampert,Johannes F. Lutzeyer

Main category: cs.LG

TL;DR: 提出了一种用于肾小球分类的异构图神经网络HIEGNet，结合传统和机器学习方法构建图结构，并在全幻灯片图像数据集上验证其性能优于基线模型。

Details

Motivation: 解决肾小球健康分类任务中图构建的独特困难，尤其是节点、边和信息特征的识别问题。 Method: 提出一个结合传统和机器学习技术的流程来构建异构图，并设计HIEGNet架构以整合肾小球及其周围免疫细胞的信息。 Result: HIEGNet在肾移植患者的全幻灯片图像数据集上表现优于多个基线模型，且在不同患者间泛化能力最佳。 Conclusion: HIEGNet通过整合免疫环境信息，为肾小球分类提供了一种有效的解决方案，并公开了实现代码。 Abstract: Graph Neural Networks (GNNs) have recently been found to excel in histopathology. However, an important histopathological task, where GNNs have not been extensively explored, is the classification of glomeruli health as an important indicator in nephropathology. This task presents unique difficulties, particularly for the graph construction, i.e., the identification of nodes, edges, and informative features. In this work, we propose a pipeline composed of different traditional and machine learning-based computer vision techniques to identify nodes, edges, and their corresponding features to form a heterogeneous graph. We then proceed to propose a novel heterogeneous GNN architecture for glomeruli classification, called HIEGNet, that integrates both glomeruli and their surrounding immune cells. Hence, HIEGNet is able to consider the immune environment of each glomerulus in its classification. Our HIEGNet was trained and tested on a dataset of Whole Slide Images from kidney transplant patients. Experimental results demonstrate that HIEGNet outperforms several baseline models and generalises best between patients among all baseline models. Our implementation is publicly available at https://github.com/nklsKrmnn/HIEGNet.git.

[301] SiamNAS: Siamese Surrogate Model for Dominance Relation Prediction in Multi-objective Neural Architecture Search

Yuyang Zhou,Ferrante Neri,Yew-Soon Ong,Ruibin Bai

Main category: cs.LG

TL;DR: 论文提出了一种基于孪生网络块的代理模型方法（SiamNAS），用于高效预测神经网络架构搜索（NAS）中的支配关系，显著降低了计算成本。

Details

Motivation: 现代NAS是多目标优化问题，计算成本高且难以直接求解，需要高效近似方法。 Method: 利用孪生网络块构建代理模型，预测架构间的支配关系，并基于模型大小设计启发式规则替代拥挤距离计算。 Result: SiamNAS在NAS-Bench-201上实现了92%的准确率，仅用0.01 GPU天就找到Pareto最优解，包括CIFAR-10的最佳架构和ImageNet的第二佳架构。 Conclusion: 该方法展示了孪生网络代理模型在多任务优化中的潜力，并可扩展为生成多样化的Pareto最优解集。 Abstract: Modern neural architecture search (NAS) is inherently multi-objective, balancing trade-offs such as accuracy, parameter count, and computational cost. This complexity makes NAS computationally expensive and nearly impossible to solve without efficient approximations. To address this, we propose a novel surrogate modelling approach that leverages an ensemble of Siamese network blocks to predict dominance relationships between candidate architectures. Lightweight and easy to train, the surrogate achieves 92% accuracy and replaces the crowding distance calculation in the survivor selection strategy with a heuristic rule based on model size. Integrated into a framework termed SiamNAS, this design eliminates costly evaluations during the search process. Experiments on NAS-Bench-201 demonstrate the framework's ability to identify Pareto-optimal solutions with significantly reduced computational costs. The proposed SiamNAS identified a final non-dominated set containing the best architecture in NAS-Bench-201 for CIFAR-10 and the second-best for ImageNet, in terms of test error rate, within 0.01 GPU days. This proof-of-concept study highlights the potential of the proposed Siamese network surrogate model to generalise to multi-tasking optimisation, enabling simultaneous optimisation across tasks. Additionally, it offers opportunities to extend the approach for generating Sets of Pareto Sets (SOS), providing diverse Pareto-optimal solutions for heterogeneous task settings.

[302] Interaction Field Matching: Overcoming Limitations of Electrostatic Models

Stepan I. Manukhov,Alexander Kolesov,Vladimir V. Palyulin,Alexander Korotin

Main category: cs.LG

TL;DR: 论文提出了一种名为交互场匹配（IFM）的新方法，扩展了静电匹配（EFM）的范式，通过引入更一般的交互场解决了EFM中建模静电场的复杂性问题。

Details

Motivation: EFM虽然是一种新颖的数据生成和传输方法，但由于需要建模复杂的静电场，存在实现上的困难。 Method: 提出IFM方法，利用更一般的交互场替代静电场，并设计了一种基于强相互作用的实现方式。 Result: 在玩具和图像数据传输问题上展示了IFM的性能。 Conclusion: IFM不仅解决了EFM的问题，还扩展了其适用范围，为数据生成和传输提供了新思路。 Abstract: Electrostatic field matching (EFM) has recently appeared as a novel physics-inspired paradigm for data generation and transfer using the idea of an electric capacitor. However, it requires modeling electrostatic fields using neural networks, which is non-trivial because of the necessity to take into account the complex field outside the capacitor plates. In this paper, we propose Interaction Field Matching (IFM), a generalization of EFM which allows using general interaction fields beyond the electrostatic one. Furthermore, inspired by strong interactions between quarks and antiquarks in physics, we design a particular interaction field realization which solves the problems which arise when modeling electrostatic fields in EFM. We show the performance on a series of toy and image data transfer problems.

cs.CR [Back]

[303] BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage

Kalyan Nakka,Nitesh Saxena

Main category: cs.CR

TL;DR: 论文提出了一种名为BitBypass的新型黑盒越狱攻击方法，通过利用连字符分隔的比特流伪装来绕过大型语言模型（LLMs）的安全对齐。

Details

Motivation: 现有LLMs的安全对齐方法（如监督微调、人类反馈强化学习等）在面对对抗攻击时仍存在漏洞，亟需探索新的攻击方式以揭示其脆弱性。 Method: 开发了BitBypass攻击，利用比特流伪装而非传统的提示工程或对抗操作，绕过LLMs的安全对齐。 Result: 在五种先进LLMs（GPT-4o、Gemini 1.5等）上测试，BitBypass成功绕过安全对齐并生成有害内容，且在隐蔽性和成功率上优于现有攻击方法。 Conclusion: BitBypass展示了LLMs安全对齐的潜在漏洞，为未来防御研究提供了新的方向。 Abstract: The inherent risk of generating harmful and unsafe content by Large Language Models (LLMs), has highlighted the need for their safety alignment. Various techniques like supervised fine-tuning, reinforcement learning from human feedback, and red-teaming were developed for ensuring the safety alignment of LLMs. However, the robustness of these aligned LLMs is always challenged by adversarial attacks that exploit unexplored and underlying vulnerabilities of the safety alignment. In this paper, we develop a novel black-box jailbreak attack, called BitBypass, that leverages hyphen-separated bitstream camouflage for jailbreaking aligned LLMs. This represents a new direction in jailbreaking by exploiting fundamental information representation of data as continuous bits, rather than leveraging prompt engineering or adversarial manipulations. Our evaluation of five state-of-the-art LLMs, namely GPT-4o, Gemini 1.5, Claude 3.5, Llama 3.1, and Mixtral, in adversarial perspective, revealed the capabilities of BitBypass in bypassing their safety alignment and tricking them into generating harmful and unsafe content. Further, we observed that BitBypass outperforms several state-of-the-art jailbreak attacks in terms of stealthiness and attack success. Overall, these results highlights the effectiveness and efficiency of BitBypass in jailbreaking these state-of-the-art LLMs.

cs.AI [Back]

Jiongnan Liu,Zhicheng Dou,Ning Hu,Chenyan Xiong

Main category: cs.AI

TL;DR: 论文提出了一种超越传统推荐系统的新范式，通过多模态生成模型直接为用户生成个性化内容（如图像），而非仅过滤现有内容。

Details

Motivation: 传统推荐系统局限于过滤现有内容，无法生成新颖概念，难以完全满足用户需求和偏好。 Method: 利用多模态大模型（LMMs），结合监督微调和在线强化学习策略，训练模型生成个性化内容。 Result: 在两个基准数据集和用户研究中验证了方法的有效性，生成的图像不仅符合用户历史偏好，还与其潜在兴趣相关。 Conclusion: 该方法为推荐系统提供了新的可能性，能够更全面地满足用户需求。 Abstract: To address the challenge of information overload from massive web contents, recommender systems are widely applied to retrieve and present personalized results for users. However, recommendation tasks are inherently constrained to filtering existing items and lack the ability to generate novel concepts, limiting their capacity to fully satisfy user demands and preferences. In this paper, we propose a new paradigm that goes beyond content filtering and selecting: directly generating personalized items in a multimodal form, such as images, tailored to individual users. To accomplish this, we leverage any-to-any Large Multimodal Models (LMMs) and train them in both supervised fine-tuning and online reinforcement learning strategy to equip them with the ability to yield tailored next items for users. Experiments on two benchmark datasets and user study confirm the efficacy of the proposed method. Notably, the generated images not only align well with users' historical preferences but also exhibit relevance to their potential future interests.

[305] ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code

Tianyu Hua,Harper Hua,Violet Xiang,Benjamin Klieger,Sang T. Truong,Weixin Liang,Fan-Yun Sun,Nick Haber

Main category: cs.AI

TL;DR: ResearchCodeBench评估大语言模型（LLMs）将前沿研究论文中的新思想转化为可执行代码的能力，发现最佳模型成功率不足40%。

Details

Motivation: 探讨LLMs在实现未见过的研究论文中的新思想时的能力，填补研究空白。 Method: 引入包含212个编码挑战的ResearchCodeBench，评估30+专有和开源LLMs的代码生成能力。 Result: 最佳模型Gemini-2.5-Pro-Preview成功率为37.3%，其他模型表现更差。 Conclusion: ResearchCodeBench为LLM在代码生成领域的持续改进提供了评估平台。 Abstract: Large language models (LLMs) have shown promise in transforming machine learning research, yet their capability to faithfully implement novel ideas from recent research papers-ideas unseen during pretraining-remains unclear. We introduce ResearchCodeBench, a benchmark of 212 coding challenges that evaluates LLMs' ability to translate cutting-edge ML contributions from top 2024-2025 research papers into executable code. We assessed 30+ proprietary and open-source LLMs, finding that even the best models correctly implement less than 40% of the code. We find Gemini-2.5-Pro-Preview to perform best at 37.3% success rate, with O3 (High) and O4-mini (High) following behind at 32.3% and 30.8% respectively. We present empirical findings on performance comparison, contamination, and error patterns. By providing a rigorous and community-driven evaluation platform, ResearchCodeBench enables continuous understanding and advancement of LLM-driven innovation in research code generation.

[306] Benchmarking and Advancing Large Language Models for Local Life Services

Xiaochong Lan,Jie Feng,Jiahuan Lei,Xinlei Shi,Yong Li

Main category: cs.AI

TL;DR: 论文研究了大型语言模型（LLMs）在本地生活服务中的潜力，通过建立基准和评估性能，发现7B模型可媲美72B模型，优化了部署可行性。

Details

Motivation: 探索LLMs在本地生活服务中的应用潜力，以平衡模型能力与推理成本。 Method: 建立综合基准，评估LLMs性能，并研究模型微调和基于代理的工作流程。 Result: 7B模型性能接近72B模型，显著提升了部署的可行性和效率。 Conclusion: 优化后的LLMs在本地生活服务中更具实用性和可及性。 Abstract: Large language models (LLMs) have exhibited remarkable capabilities and achieved significant breakthroughs across various domains, leading to their widespread adoption in recent years. Building on this progress, we investigate their potential in the realm of local life services. In this study, we establish a comprehensive benchmark and systematically evaluate the performance of diverse LLMs across a wide range of tasks relevant to local life services. To further enhance their effectiveness, we explore two key approaches: model fine-tuning and agent-based workflows. Our findings reveal that even a relatively compact 7B model can attain performance levels comparable to a much larger 72B model, effectively balancing inference cost and model capability. This optimization greatly enhances the feasibility and efficiency of deploying LLMs in real-world online services, making them more practical and accessible for local life applications.

[307] Rethinking Machine Unlearning in Image Generation Models

Renyang Liu,Wenjie Feng,Tianwei Zhang,Wei Zhou,Xueqi Cheng,See-Kiong Ng

Main category: cs.AI

TL;DR: 论文探讨了图像生成模型遗忘（IGMU）的挑战，提出了任务分类框架CatIGMU、评估框架EvalIGMU和高质量数据集DataIGM，以解决现有方法在实践中的不足。

Details

Motivation: 随着图像生成模型的广泛应用，数据隐私和内容安全成为重要问题，机器遗忘（MU）被视为解决这些问题的有效手段。然而，IGMU在实践中仍存在任务区分不清、评估框架缺乏等问题。 Method: 论文提出了CatIGMU（任务分类框架）、EvalIGMU（评估框架）和DataIGM（数据集），用于标准化IGMU任务并评估现有算法。 Result: 研究发现现有IGMU算法在多维度评估中表现不佳，尤其是在保留性和鲁棒性方面。 Conclusion: 论文通过提出新框架和数据集，为IGMU的标准化和可靠评估提供了基础，并揭示了现有算法的局限性。 Abstract: With the surge and widespread application of image generation models, data privacy and content safety have become major concerns and attracted great attention from users, service providers, and policymakers. Machine unlearning (MU) is recognized as a cost-effective and promising means to address these challenges. Despite some advancements, image generation model unlearning (IGMU) still faces remarkable gaps in practice, e.g., unclear task discrimination and unlearning guidelines, lack of an effective evaluation framework, and unreliable evaluation metrics. These can hinder the understanding of unlearning mechanisms and the design of practical unlearning algorithms. We perform exhaustive assessments over existing state-of-the-art unlearning algorithms and evaluation standards, and discover several critical flaws and challenges in IGMU tasks. Driven by these limitations, we make several core contributions, to facilitate the comprehensive understanding, standardized categorization, and reliable evaluation of IGMU. Specifically, (1) We design CatIGMU, a novel hierarchical task categorization framework. It provides detailed implementation guidance for IGMU, assisting in the design of unlearning algorithms and the construction of testbeds. (2) We introduce EvalIGMU, a comprehensive evaluation framework. It includes reliable quantitative metrics across five critical aspects. (3) We construct DataIGM, a high-quality unlearning dataset, which can be used for extensive evaluations of IGMU, training content detectors for judgment, and benchmarking the state-of-the-art unlearning algorithms. With EvalIGMU and DataIGM, we discover that most existing IGMU algorithms cannot handle the unlearning well across different evaluation dimensions, especially for preservation and robustness. Code and models are available at https://github.com/ryliu68/IGMU.

[308] Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning

Chen Qian,Dongrui Liu,Haochen Wen,Zhen Bai,Yong Liu,Jing Shao

Main category: cs.AI

TL;DR: 论文研究了大型推理模型（LRMs）的信息论视角下的推理轨迹，发现中间表示与正确答案之间的互信息（MI）峰值现象，并证明这些峰值与‘思考标记’相关，提出了两种提升LRM推理性能的方法。

Details

Motivation: 大型推理模型（LRMs）在复杂问题解决中表现出色，但其内部推理机制尚不明确，研究旨在从信息论角度揭示其推理过程。 Method: 通过跟踪中间表示与正确答案的互信息（MI）变化，分析MI峰值现象，并识别出与峰值相关的‘思考标记’。 Result: 发现MI峰值与模型预测错误概率下降相关，且‘思考标记’对推理性能至关重要。 Conclusion: 研究揭示了LRMs的推理机制，并提出利用‘思考标记’提升推理性能的实用方法。 Abstract: Large reasoning models (LRMs) have demonstrated impressive capabilities in complex problem-solving, yet their internal reasoning mechanisms remain poorly understood. In this paper, we investigate the reasoning trajectories of LRMs from an information-theoretic perspective. By tracking how mutual information (MI) between intermediate representations and the correct answer evolves during LRM reasoning, we observe an interesting MI peaks phenomenon: the MI at specific generative steps exhibits a sudden and significant increase during LRM's reasoning process. We theoretically analyze such phenomenon and show that as MI increases, the probability of model's prediction error decreases. Furthermore, these MI peaks often correspond to tokens expressing reflection or transition, such as ``Hmm'', ``Wait'' and ``Therefore,'' which we term as the thinking tokens. We then demonstrate that these thinking tokens are crucial for LRM's reasoning performance, while other tokens has minimal impacts. Building on these analyses, we propose two simple yet effective methods to improve LRM's reasoning performance, by delicately leveraging these thinking tokens. Overall, our work provides novel insights into the reasoning mechanisms of LRMs and offers practical ways to improve their reasoning capabilities. The code is available at https://github.com/ChnQ/MI-Peaks.

[309] Mitigating Manipulation and Enhancing Persuasion: A Reflective Multi-Agent Approach for Legal Argument Generation

Li Zhang,Kevin D. Ashley

Main category: cs.AI

TL;DR: 论文提出了一种多代理反思方法，用于生成合法合规的法律论证，显著减少幻觉和未基于事实的论证，并在不可行时有效中止。

Details

Motivation: 大型语言模型（LLMs）在法律论证生成中存在幻觉、未基于事实的论证以及无法有效中止的问题，亟需一种更可靠的方法。 Method: 采用多代理框架，包括Factor Analyst和Argument Polisher，通过迭代优化生成三层次法律论证（原告、被告、反驳）。 Result: 在多场景测试中，该方法在减少幻觉、提高事实利用率和有效中止方面显著优于基线模型。 Conclusion: 多代理反思框架为法律论证生成提供了一种可信赖的AI方法，有助于减少操纵风险。 Abstract: Large Language Models (LLMs) are increasingly explored for legal argument generation, yet they pose significant risks of manipulation through hallucination and ungrounded persuasion, and often fail to utilize provided factual bases effectively or abstain when arguments are untenable. This paper introduces a novel reflective multi-agent method designed to address these challenges in the context of legally compliant persuasion. Our approach employs specialized agents--a Factor Analyst and an Argument Polisher--in an iterative refinement process to generate 3-ply legal arguments (plaintiff, defendant, rebuttal). We evaluate Reflective Multi-Agent against single-agent, enhanced-prompt single-agent, and non-reflective multi-agent baselines using four diverse LLMs (GPT-4o, GPT-4o-mini, Llama-4-Maverick-17b-128e, Llama-4-Scout-17b-16e) across three legal scenarios: "arguable", "mismatched", and "non-arguable". Results demonstrate Reflective Multi-Agent's significant superiority in successful abstention (preventing generation when arguments cannot be grounded), marked improvements in hallucination accuracy (reducing fabricated and misattributed factors), particularly in "non-arguable" scenarios, and enhanced factor utilization recall (improving the use of provided case facts). These findings suggest that structured reflection within a multi-agent framework offers a robust computable method for fostering ethical persuasion and mitigating manipulation in LLM-based legal argumentation systems, a critical step towards trustworthy AI in law. Project page: https://lizhang-aiandlaw.github.io/A-Reflective-Multi-Agent-Approach-for-Legal-Argument-Generation/

[310] DPO Learning with LLMs-Judge Signal for Computer Use Agents

Man Luo,David Cobbley,Xin Su,Shachar Rosenman,Vasudev Lal,Shao-Yen Tseng,Phillip Howard

Main category: cs.AI

TL;DR: 本文提出了一种轻量级视觉语言模型，用于开发隐私保护且资源高效的计算机使用代理（CUA），通过本地运行和自动数据筛选框架提升性能。

Details

Motivation: 现有CUA依赖云端计算，存在隐私和可扩展性问题，需开发本地运行的轻量级模型。 Method: 引入LLM-as-Judge框架自动评估和筛选合成交互轨迹，生成高质量数据用于强化学习，训练轻量级模型。 Result: 在OS-World基准测试中，本地微调模型优于现有基线。 Conclusion: 该方法为隐私、高效且通用的GUI代理提供了可行路径。 Abstract: Computer use agents (CUA) are systems that automatically interact with graphical user interfaces (GUIs) to complete tasks. CUA have made significant progress with the advent of large vision-language models (VLMs). However, these agents typically rely on cloud-based inference with substantial compute demands, raising critical privacy and scalability concerns, especially when operating on personal devices. In this work, we take a step toward privacy-preserving and resource-efficient agents by developing a lightweight vision-language model that runs entirely on local machines. To train this compact agent, we introduce an LLM-as-Judge framework that automatically evaluates and filters synthetic interaction trajectories, producing high-quality data for reinforcement learning without human annotation. Experiments on the OS-World benchmark demonstrate that our fine-tuned local model outperforms existing baselines, highlighting a promising path toward private, efficient, and generalizable GUI agents.

Table of Contents

cs.CV [Back]

[1] CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge

[2] OASIS: Online Sample Selection for Continual Visual Instruction Tuning

[3] Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing

[4] Research on Driving Scenario Technology Based on Multimodal Large Lauguage Model Optimization

[5] Object-centric Self-improving Preference Optimization for Text-to-Image Generation

[6] Are classical deep neural networks weakly adversarially robust?

[7] Fairness through Feedback: Addressing Algorithmic Misgendering in Automatic Gender Recognition

[8] Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying

[9] Dynamic-Aware Video Distillation: Optimizing Temporal Resolution Based on Video Semantics

[10] Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs

[11] Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences

[12] SAB3R: Semantic-Augmented Backbone in 3D Reconstruction

[13] Implicit Deformable Medical Image Registration with Learnable Kernels

[14] TIIF-Bench: How Does Your T2I Model Follow Your Instructions?

[15] Quantifying task-relevant representational similarity using decision variable correlation

[16] Fire360: A Benchmark for Robust Perception and Episodic Memory in Degraded 360-Degree Firefighting Videos

[17] Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment

[18] VLCD: Vision-Language Contrastive Distillation for Accurate and Efficient Automatic Placenta Analysis

[19] Motion aware video generative model

[20] PAIR-Net: Enhancing Egocentric Speaker Detection via Pretrained Audio-Visual Fusion and Alignment Loss

[21] Rig3R: Rig-Aware Conditioning for Learned 3D Reconstruction

[22] Entity Image and Mixed-Modal Image Retrieval Datasets

[23] Improving Knowledge Distillation Under Unknown Covariate Shift Through Confidence-Guided Data Augmentation

[24] QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation

[25] Medical World Model: Generative Simulation of Tumor Evolution for Treatment Planning

[26] Generalized Category Discovery via Reciprocal Learning and Class-Wise Distribution Regularization

[27] RATE-Nav: Region-Aware Termination Enhancement for Zero-shot Object Navigation with Vision-Language Models

[28] InterRVOS: Interaction-aware Referring Video Object Segmentation

[29] RoadFormer : Local-Global Feature Fusion for Road Surface Classification in Autonomous Driving

[30] Auto-Labeling Data for Object Detection

[31] A TRPCA-Inspired Deep Unfolding Network for Hyperspectral Image Denoising via Thresholded t-SVD and Top-K Sparse Transformer

[32] Approximate Borderline Sampling using Granular-Ball for Classification Tasks

[33] ViTNF: Leveraging Neural Fields to Boost Vision Transformers in Generalized Category Discovery

[34] Multi-level and Multi-modal Action Anticipation

[35] RRCANet: Recurrent Reusable-Convolution Attention Network for Infrared Small Target Detection

[36] The Devil is in the Darkness: Diffusion-Based Nighttime Dehazing Anchored in Brightness Perception

[37] Towards Explicit Geometry-Reflectance Collaboration for Generalized LiDAR Segmentation in Adverse Weather

[38] Modelship Attribution: Tracing Multi-Stage Manipulations Across Generative Models

[39] Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology

[40] Guiding Registration with Emergent Similarity from Pre-Trained Diffusion Models

[41] Empowering Functional Neuroimaging: A Pre-trained Generative Framework for Unified Representation of Neural Signals

[42] Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification

[43] SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios

[44] VidEvent: A Large Dataset for Understanding Dynamic Evolution of Events in Videos

[45] ANT: Adaptive Neural Temporal-Aware Text-to-Motion Model

[46] PAID: Pairwise Angular-Invariant Decomposition for Continual Test-Time Adaptation

[47] ReSpace: Text-Driven 3D Scene Synthesis and Editing with Preference Alignment

[48] Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning

[49] HRTR: A Single-stage Transformer for Fine-grained Sub-second Action Segmentation in Stroke Rehabilitation

[50] Generative Perception of Shape and Material from Differential Motion

[51] Towards Better De-raining Generalization via Rainy Characteristics Memorization and Replay

[52] Flexiffusion: Training-Free Segment-Wise Neural Architecture Search for Efficient Diffusion Models

[53] Co-Evidential Fusion with Information Volume for Medical Image Segmentation

[54] Towards In-the-wild 3D Plane Reconstruction from a Single Image

[55] LumosFlow: Motion-Guided Long Video Generation

[56] RelationAdapter: Learning and Transferring Visual Relation with Diffusion Transformers

[57] Enhancing Monocular Height Estimation via Weak Supervision from Imperfect Labels

[58] MemoryOut: Learning Principal Features via Multimodal Sparse Filtering Network for Semi-supervised Video Anomaly Detection

[59] VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning

[60] Probabilistic Online Event Downsampling

[61] Technical Report for Ego4D Long-Term Action Anticipation Challenge 2025

[62] SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence

[63] Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models

[64] DCI: Dual-Conditional Inversion for Boosting Diffusion-Based Image Editing

[65] Contrast & Compress: Learning Lightweight Embeddings for Short Trajectories

[66] BEVCALIB: LiDAR-Camera Calibration via Geometry-Guided Bird's-Eye View Representations

[67] Hyperspectral Image Generation with Unmixing Guided Diffusion Model

[68] Application of convolutional neural networks in image super-resolution

[69] One-Step Diffusion-based Real-World Image Super-Resolution with Visual Perception Distillation

[70] High Performance Space Debris Tracking in Complex Skylight Backgrounds with a Large-Scale Dataset

[71] Hierarchical Question-Answering for Driving Scene Understanding Using Vision-Language Models

[72] Synthetic Iris Image Databases and Identity Leakage: Risks and Mitigation Strategies

[73] ControlMambaIR: Conditional Controls with State-Space Model for Image Restoration

[74] Small Aid, Big Leap: Efficient Test-Time Adaptation for Vision-Language Models with AdaptNet

[75] Self-Disentanglement and Re-Composition for Cross-Domain Few-Shot Segmentation

[76] Solving Inverse Problems with FLAIR

[77] Towards Geometry Problem Solving in the Large Model Era: A Survey

[78] Large-scale Self-supervised Video Foundation Model for Intelligent Surgery