cs.CV [Total: 139]
cs.GR [Total: 12]
cs.CL [Total: 140]
cs.DL [Total: 1]
cs.RO [Total: 4]
cs.LG [Total: 17]
eess.IV [Total: 5]
cs.SE [Total: 3]
cs.CY [Total: 2]
cs.AI [Total: 5]
eess.SP [Total: 1]
q-bio.NC [Total: 1]
cs.SD [Total: 2]
stat.ML [Total: 1]
cs.IR [Total: 4]
cs.CR [Total: 1]
cs.HC [Total: 4]
cs.DB [Total: 1]
eess.AS [Total: 2]

cs.CV [Back]

[1] ReaMOT: A Benchmark and Framework for Reasoning-based Multi-Object Tracking

Sijia Chen,Yanqiu Yu,En Yu,Wenbing Tao

Main category: cs.CV

TL;DR: 论文提出了一种新任务ReaMOT，基于推理的多目标跟踪，并构建了ReaMOT Challenge基准数据集和ReaTrack框架。

Details

Motivation: 现有RMOT任务在复杂推理性语言指令下表现不佳，需更挑战的任务评估模型推理能力。 Method: 构建ReaMOT Challenge基准，提出ReaTrack框架（基于LVLM和SAM2的无训练方法）。 Result: ReaTrack在ReaMOT Challenge上表现有效。 Conclusion: ReaMOT任务和ReaTrack框架为多目标跟踪领域提供了新挑战和解决方案。 Abstract: Referring Multi-object tracking (RMOT) is an important research field in computer vision. Its task form is to guide the models to track the objects that conform to the language instruction. However, the RMOT task commonly requires clear language instructions, such methods often fail to work when complex language instructions with reasoning characteristics appear. In this work, we propose a new task, called Reasoning-based Multi-Object Tracking (ReaMOT). ReaMOT is a more challenging task that requires accurate reasoning about objects that match the language instruction with reasoning characteristic and tracking the objects' trajectories. To advance the ReaMOT task and evaluate the reasoning capabilities of tracking models, we construct ReaMOT Challenge, a reasoning-based multi-object tracking benchmark built upon 12 datasets. Specifically, it comprises 1,156 language instructions with reasoning characteristic, 423,359 image-language pairs, and 869 diverse scenes, which is divided into three levels of reasoning difficulty. In addition, we propose a set of evaluation metrics tailored for the ReaMOT task. Furthermore, we propose ReaTrack, a training-free framework for reasoning-based multi-object tracking based on large vision-language models (LVLM) and SAM2, as a baseline for the ReaMOT task. Extensive experiments on the ReaMOT Challenge benchmark demonstrate the effectiveness of our ReaTrack framework.

[2] What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models

Lorenzo Baraldi,Davide Bucciarelli,Federico Betti,Marcella Cornia,Lorenzo Baraldi,Nicu Sebe,Rita Cucchiara

Main category: cs.CV

TL;DR: DICE是一个用于评估指令驱动图像编辑模型的新方法，通过检测局部差异并评估其与修改请求的相关性，显著提升了与人类判断的一致性。

Details

Motivation: 现有评估指标在人类判断一致性和可解释性方面表现不足，需要一种更有效的评估方法。 Method: DICE由差异检测器和一致性评估器组成，基于自回归多模态大语言模型（MLLM），结合自监督、蒸馏和全监督训练策略。 Result: 实验表明DICE能有效识别连贯的编辑结果，并与人类判断高度相关。 Conclusion: DICE提供了一种更可靠的图像编辑评估方法，代码、模型和数据已公开。 Abstract: Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. DICE consists of two key components: a difference detector and a coherence estimator, both built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision. Through extensive experiments, we evaluate each stage of our pipeline, comparing different MLLMs within the proposed framework. We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment. We publicly release our source code, models, and data.

[3] RetroMotion: Retrocausal Motion Forecasting Models are Instructable

Royden Wagner,Omer Sahin Tas,Felix Hauser,Marlon Steiner,Dominik Strutz,Abhishek Vivekanandan,Carlos Fernandez,Christoph Stiller

Main category: cs.CV

TL;DR: 论文提出了一种多任务学习方法，用于运动预测，结合了逆向信息流，实现了在Waymo和Argoverse 2数据集上的最优表现。

Details

Motivation: 解决运动预测中场景约束和交互行为带来的复杂性差异问题。 Method: 使用Transformer模型生成联合轨迹分布，通过重新编码边际分布并进行成对建模，引入逆向信息流。 Result: 在Waymo Interaction Prediction和Argoverse 2数据集上取得最优结果，并能通过轨迹修改实现指令交互。 Conclusion: 该方法不仅提升了预测性能，还支持基于目标的指令交互和场景适应性调整。 Abstract: Motion forecasts of road users (i.e., agents) vary in complexity as a function of scene constraints and interactive behavior. We address this with a multi-task learning method for motion forecasting that includes a retrocausal flow of information. The corresponding tasks are to forecast (1) marginal trajectory distributions for all modeled agents and (2) joint trajectory distributions for interacting agents. Using a transformer model, we generate the joint distributions by re-encoding marginal distributions followed by pairwise modeling. This incorporates a retrocausal flow of information from later points in marginal trajectories to earlier points in joint trajectories. Per trajectory point, we model positional uncertainty using compressed exponential power distributions. Notably, our method achieves state-of-the-art results in the Waymo Interaction Prediction dataset and generalizes well to the Argoverse 2 dataset. Additionally, our method provides an interface for issuing instructions through trajectory modifications. Our experiments show that regular training of motion forecasting leads to the ability to follow goal-based instructions and to adapt basic directional instructions to the scene context. Code: https://github.com/kit-mrt/future-motion

[4] MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Yunlong Tang,Pinxin Liu,Mingqian Feng,Zhangyun Tan,Rui Mao,Chao Huang,Jing Bi,Yunzhong Xiao,Susan Liang,Hang Hua,Ali Vosoughi,Luchuan Song,Zeliang Zhang,Chenliang Xu

Main category: cs.CV

TL;DR: MMPerspective是首个专门评估多模态大语言模型（MLLMs）对透视几何理解的基准测试，包含10个任务和2,711张图像。研究发现MLLMs在表面感知任务上表现良好，但在组合推理和空间一致性上存在局限。

Details

Motivation: 探究MLLMs是否真正理解透视几何，填补现有研究的空白。 Method: 设计MMPerspective基准，包含10个任务和2,711张图像，评估43个先进MLLMs。 Result: MLLMs在感知任务上表现良好，但在推理和鲁棒性上表现不佳，模型架构和规模对性能有影响。 Conclusion: MMPerspective为诊断和提升视觉语言系统的空间理解能力提供了重要工具。 Abstract: Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/

[5] DIPO: Dual-State Images Controlled Articulated Object Generation Powered by Diverse Data

Ruqi Wu,Xinjie Wang,Liu Liu,Chunle Guo,Jiaxiong Qiu,Chongyi Li,Lichao Huang,Zhizhong Su,Ming-Ming Cheng

Main category: cs.CV

TL;DR: DIPO是一个新颖的框架，通过一对图像（静止状态和关节状态）控制生成3D关节物体，优于现有基线。

Details

Motivation: 单图像方法缺乏运动信息，而双图像输入能更可靠地预测部件间的运动关系。 Method: 提出双图像扩散模型和基于Chain-of-Thought的图推理器，并开发自动化数据集扩展管道LEGO-Art。 Result: DIPO在静止和关节状态下显著优于基线，PM-X数据集提升了泛化能力。 Conclusion: DIPO框架和PM-X数据集为复杂关节物体的生成提供了有效解决方案。 Abstract: We present DIPO, a novel framework for the controllable generation of articulated 3D objects from a pair of images: one depicting the object in a resting state and the other in an articulated state. Compared to the single-image approach, our dual-image input imposes only a modest overhead for data collection, but at the same time provides important motion information, which is a reliable guide for predicting kinematic relationships between parts. Specifically, we propose a dual-image diffusion model that captures relationships between the image pair to generate part layouts and joint parameters. In addition, we introduce a Chain-of-Thought (CoT) based graph reasoner that explicitly infers part connectivity relationships. To further improve robustness and generalization on complex articulated objects, we develop a fully automated dataset expansion pipeline, name LEGO-Art, that enriches the diversity and complexity of PartNet-Mobility dataset. We propose PM-X, a large-scale dataset of complex articulated 3D objects, accompanied by rendered images, URDF annotations, and textual descriptions. Extensive experiments demonstrate that DIPO significantly outperforms existing baselines in both the resting state and the articulated state, while the proposed PM-X dataset further enhances generalization to diverse and structurally complex articulated objects. Our code and dataset will be released to the community upon publication.

[6] CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting

Lei Tian,Xiaomin Li,Liqian Ma,Hefei Huang,Zirui Zheng,Hao Yin,Taiqing Li,Huchuan Lu,Xu Jia

Main category: cs.CV

TL;DR: 论文提出CCL-LGS框架，通过多视角语义线索解决3D语义理解中的跨视角不一致性问题，提升3D高斯语义场质量。

Details

Motivation: 现有依赖2D先验的方法因遮挡、图像模糊和视角变化导致语义不一致，影响3D重建质量。 Method: 结合零样本跟踪器对齐SAM生成的2D掩码，利用CLIP提取多视角语义编码，并通过对比码本学习模块优化特征。 Result: 实验表明CCL-LGS优于现有方法，解决了语义冲突并保持类别区分性。 Conclusion: CCL-LGS通过多视角一致性监督有效提升了3D语义理解的鲁棒性和准确性。 Abstract: Recent advances in 3D reconstruction techniques and vision-language models have fueled significant progress in 3D semantic understanding, a capability critical to robotics, autonomous driving, and virtual/augmented reality. However, methods that rely on 2D priors are prone to a critical challenge: cross-view semantic inconsistencies induced by occlusion, image blur, and view-dependent variations. These inconsistencies, when propagated via projection supervision, deteriorate the quality of 3D Gaussian semantic fields and introduce artifacts in the rendered outputs. To mitigate this limitation, we propose CCL-LGS, a novel framework that enforces view-consistent semantic supervision by integrating multi-view semantic cues. Specifically, our approach first employs a zero-shot tracker to align a set of SAM-generated 2D masks and reliably identify their corresponding categories. Next, we utilize CLIP to extract robust semantic encodings across views. Finally, our Contrastive Codebook Learning (CCL) module distills discriminative semantic features by enforcing intra-class compactness and inter-class distinctiveness. In contrast to previous methods that directly apply CLIP to imperfect masks, our framework explicitly resolves semantic conflicts while preserving category discriminability. Extensive experiments demonstrate that CCL-LGS outperforms previous state-of-the-art methods. Our project page is available at https://epsilontl.github.io/CCL-LGS/.

[7] WeatherEdit: Controllable Weather Editing with 4D Gaussian Field

Chenghao Qian,Wenjing Li,Yuhu Guo,Gustav Markkula

Main category: cs.CV

TL;DR: WeatherEdit是一个用于3D场景中生成可控天气效果的新方法，包括背景编辑和粒子构建两部分，通过物理模拟实现高真实感。

Details

Motivation: 为自动驾驶模拟提供可控且真实的恶劣天气效果。 Method: 结合扩散模型和TV注意力机制编辑2D背景，通过4D高斯场生成动态天气粒子。 Result: 在多个驾驶数据集上验证了方法的多样性和可控性。 Conclusion: WeatherEdit在自动驾驶模拟中具有潜力。 Abstract: In this work, we present WeatherEdit, a novel weather editing pipeline for generating realistic weather effects with controllable types and severity in 3D scenes. Our approach is structured into two key components: weather background editing and weather particle construction. For weather background editing, we introduce an all-in-one adapter that integrates multiple weather styles into a single pretrained diffusion model, enabling the generation of diverse weather effects in 2D image backgrounds. During inference, we design a Temporal-View (TV-) attention mechanism that follows a specific order to aggregate temporal and spatial information, ensuring consistent editing across multi-frame and multi-view images. To construct the weather particles, we first reconstruct a 3D scene using the edited images and then introduce a dynamic 4D Gaussian field to generate snowflakes, raindrops and fog in the scene. The attributes and dynamics of these particles are precisely controlled through physical-based modelling and simulation, ensuring realistic weather representation and flexible severity adjustments. Finally, we integrate the 4D Gaussian field with the 3D scene to render consistent and highly realistic weather effects. Experiments on multiple driving datasets demonstrate that WeatherEdit can generate diverse weather effects with controllable condition severity, highlighting its potential for autonomous driving simulation in adverse weather. See project page: https://jumponthemoon.github.io/w-edit

[8] ControlTac: Force- and Position-Controlled Tactile Data Augmentation with a Single Reference Image

Dongyu Luo,Kelin Yu,Amir-Hossein Shahidzadeh,Cornelia Fermüller,Yiannis Aloimonos

Main category: cs.CV

TL;DR: ControlTac是一个两阶段可控框架，用于生成基于参考触觉图像、接触力和接触位置的逼真触觉图像，以解决触觉数据收集成本高的问题。

Details

Motivation: 由于传感器-物体交互的局部性和传感器实例间的不一致性，大规模触觉数据收集成本高，现有方法（如模拟和自由触觉生成）常输出不真实且难以迁移到下游任务。 Method: 提出ControlTac，通过参考触觉图像、接触力和接触位置作为控制输入，生成物理合理且多样化的触觉图像，用于数据增强。 Result: 在三个下游任务中验证了ControlTac能有效增强触觉数据集并带来性能提升，实际实验进一步验证其实用性。 Conclusion: ControlTac通过可控生成逼真触觉图像，解决了触觉数据稀缺问题，并提升了任务性能。 Abstract: Vision-based tactile sensing has been widely used in perception, reconstruction, and robotic manipulation. However, collecting large-scale tactile data remains costly due to the localized nature of sensor-object interactions and inconsistencies across sensor instances. Existing approaches to scaling tactile data, such as simulation and free-form tactile generation, often suffer from unrealistic output and poor transferability to downstream tasks.To address this, we propose ControlTac, a two-stage controllable framework that generates realistic tactile images conditioned on a single reference tactile image, contact force, and contact position. With those physical priors as control input, ControlTac generates physically plausible and varied tactile images that can be used for effective data augmentation. Through experiments on three downstream tasks, we demonstrate that ControlTac can effectively augment tactile datasets and lead to consistent gains. Our three real-world experiments further validate the practical utility of our approach. Project page: https://dongyuluo.github.io/controltac.

[9] Electrolyzers-HSI: Close-Range Multi-Scene Hyperspectral Imaging Benchmark Dataset

Elias Arbash,Ahmed Jamal Afifi,Ymane Belahsen,Margret Fuchs,Pedram Ghamisi,Paul Scheunders,Richard Gloaguen

Main category: cs.CV

TL;DR: 论文提出了一个名为Electrolyzers-HSI的多模态基准数据集，用于加速电解器材料的分类，支持可持续回收。

Details

Motivation: 解决可持续回收的全球挑战，需要自动化、快速和准确的SOTA材料检测系统，以支持循环经济和绿色协议。 Method: 数据集包含55个高分辨率RGB图像和HSI数据立方体，覆盖400-2500 nm光谱范围，并评估了多种ML和DL方法，包括Vision Transformer和Multimodal Fusion Transformer。 Result: 数据集提供了超过4.2百万像素向量和424,169个标记向量，支持非侵入式光谱分析和材料分类。 Conclusion: Electrolyzers-HSI数据集和代码库公开可用，支持可重复研究，促进智能和可持续电子废物回收的广泛应用。 Abstract: The global challenge of sustainable recycling demands automated, fast, and accurate, state-of-the-art (SOTA) material detection systems that act as a bedrock for a circular economy. Democratizing access to these cutting-edge solutions that enable real-time waste analysis is essential for scaling up recycling efforts and fostering the Green Deal. In response, we introduce \textbf{Electrolyzers-HSI}, a novel multimodal benchmark dataset designed to accelerate the recovery of critical raw materials through accurate electrolyzer materials classification. The dataset comprises 55 co-registered high-resolution RGB images and hyperspectral imaging (HSI) data cubes spanning the 400--2500 nm spectral range, yielding over 4.2 million pixel vectors and 424,169 labeled ones. This enables non-invasive spectral analysis of shredded electrolyzer samples, supporting quantitative and qualitative material classification and spectral properties investigation. We evaluate a suite of baseline machine learning (ML) methods alongside SOTA transformer-based deep learning (DL) architectures, including Vision Transformer, SpectralFormer, and the Multimodal Fusion Transformer, to investigate architectural bottlenecks for further efficiency optimisation when deploying transformers in material identification. We implement zero-shot detection techniques and majority voting across pixel-level predictions to establish object-level classification robustness. In adherence to the FAIR data principles, the electrolyzers-HSI dataset and accompanying codebase are openly available at https://github.com/hifexplo/Electrolyzers-HSI and https://rodare.hzdr.de/record/3668, supporting reproducible research and facilitating the broader adoption of smart and sustainable e-waste recycling solutions.

[10] CPathAgent: An Agent-based Foundation Model for Interpretable High-Resolution Pathology Image Analysis Mimicking Pathologists' Diagnostic Logic

Yuxuan Sun,Yixuan Si,Chenglu Zhu,Kai Zhang,Zhongyi Shui,Bowen Ding,Tao Lin,Lin Yang

Main category: cs.CV

TL;DR: CPathAgent是一种基于代理的模型，通过模拟病理学家的诊断逻辑（如放大/缩小和导航操作）生成更详细和可解释的诊断报告。

Details

Motivation: 现有方法未能模拟病理学家的诊断逻辑，无法系统性地从低倍镜到高倍镜逐步检查病理图像。 Method: 开发了多阶段训练策略，统一了补丁级、区域级和全玻片级能力，并构建了首个大区域分析基准PathMMU-HR²。 Result: CPathAgent在三个尺度的基准测试中均优于现有方法，生成更详细的诊断报告。 Conclusion: CPathAgent为计算病理学的未来发展提供了有前景的方向。 Abstract: Recent advances in computational pathology have led to the emergence of numerous foundation models. However, these approaches fail to replicate the diagnostic process of pathologists, as they either simply rely on general-purpose encoders with multi-instance learning for classification or directly apply multimodal models to generate reports from images. A significant limitation is their inability to emulate the diagnostic logic employed by pathologists, who systematically examine slides at low magnification for overview before progressively zooming in on suspicious regions to formulate comprehensive diagnoses. To address this gap, we introduce CPathAgent, an innovative agent-based model that mimics pathologists' reasoning processes by autonomously executing zoom-in/out and navigation operations across pathology images based on observed visual features. To achieve this, we develop a multi-stage training strategy unifying patch-level, region-level, and whole-slide capabilities within a single model, which is essential for mimicking pathologists, who require understanding and reasoning capabilities across all three scales. This approach generates substantially more detailed and interpretable diagnostic reports compared to existing methods, particularly for huge region understanding. Additionally, we construct an expert-validated PathMMU-HR$^{2}$, the first benchmark for huge region analysis, a critical intermediate scale between patches and whole slides, as diagnosticians typically examine several key regions rather than entire slides at once. Extensive experiments demonstrate that CPathAgent consistently outperforms existing approaches across three scales of benchmarks, validating the effectiveness of our agent-based diagnostic approach and highlighting a promising direction for the future development of computational pathology.

[11] A Feature-level Bias Evaluation Framework for Facial Expression Recognition Models

Tangzheng Lian,Oya Celiktutan

Main category: cs.CV

TL;DR: 本文提出了一种无需人口统计标签的特征级偏差评估框架，用于FER模型的偏差分析，并引入统计模块确保结果的显著性。

Details

Motivation: 现有FER模型存在对某些人口群体的偏见，但公开数据集中缺乏人口统计标签限制了偏差分析的范围，且伪标签可能扭曲结果。 Method: 提出特征级偏差评估框架，无需测试集中的人口统计标签，并引入统计模块验证结果的显著性。 Result: 实验表明，该方法比依赖伪标签的现有方法更有效，且揭示了FER模型中显著的年龄、性别和种族偏见。 Conclusion: 该框架为FER模型的公平性评估提供了更可靠的方法，并指导选择更公平的网络架构。 Abstract: Recent studies on fairness have shown that Facial Expression Recognition (FER) models exhibit biases toward certain visually perceived demographic groups. However, the limited availability of human-annotated demographic labels in public FER datasets has constrained the scope of such bias analysis. To overcome this limitation, some prior works have resorted to pseudo-demographic labels, which may distort bias evaluation results. Alternatively, in this paper, we propose a feature-level bias evaluation framework for evaluating demographic biases in FER models under the setting where demographic labels are unavailable in the test set. Extensive experiments demonstrate that our method more effectively evaluates demographic biases compared to existing approaches that rely on pseudo-demographic labels. Furthermore, we observe that many existing studies do not include statistical testing in their bias evaluations, raising concerns that some reported biases may not be statistically significant but rather due to randomness. To address this issue, we introduce a plug-and-play statistical module to ensure the statistical significance of biased evaluation results. A comprehensive bias analysis based on the proposed module is then conducted across three sensitive attributes (age, gender, and race), seven facial expressions, and multiple network architectures on a large-scale dataset, revealing the prominent demographic biases in FER and providing insights on selecting a fairer network architecture.

[12] MetaWriter: Personalized Handwritten Text Recognition Using Meta-Learned Prompt Tuning

Wenhao Gu,Li Gu,Ching Yee Suen,Yang Wang

Main category: cs.CV

TL;DR: 提出了一种基于提示调优的高效框架，用于个性化手写文本识别，通过自监督损失和元学习优化，显著减少参数更新和标注需求。

Details

Motivation: 传统HTR方法缺乏对多样书写风格的鲁棒性，且现有个性化方法需要标注数据且参数效率低下。 Method: 将个性化任务建模为提示调优，结合自监督损失和元学习优化提示初始化，仅需更新少量参数。 Result: 在RIMES和IAM基准测试中表现优于现有方法，参数使用量减少20倍。 Conclusion: 该方法为个性化HTR提供了高效解决方案，适用于资源受限场景。 Abstract: Recent advancements in handwritten text recognition (HTR) have enabled the effective conversion of handwritten text to digital formats. However, achieving robust recognition across diverse writing styles remains challenging. Traditional HTR methods lack writer-specific personalization at test time due to limitations in model architecture and training strategies. Existing attempts to bridge this gap, through gradient-based meta-learning, still require labeled examples and suffer from parameter-inefficient fine-tuning, leading to substantial computational and memory overhead. To overcome these challenges, we propose an efficient framework that formulates personalization as prompt tuning, incorporating an auxiliary image reconstruction task with a self-supervised loss to guide prompt adaptation with unlabeled test-time examples. To ensure self-supervised loss effectively minimizes text recognition error, we leverage meta-learning to learn the optimal initialization of the prompts. As a result, our method allows the model to efficiently capture unique writing styles by updating less than 1% of its parameters and eliminating the need for time-intensive annotation processes. We validate our approach on the RIMES and IAM Handwriting Database benchmarks, where it consistently outperforms previous state-of-the-art methods while using 20x fewer parameters. We believe this represents a significant advancement in personalized handwritten text recognition, paving the way for more reliable and practical deployment in resource-constrained scenarios.

[13] MultLFG: Training-free Multi-LoRA composition using Frequency-domain Guidance

Aniket Roy,Maitreya Suin,Ketul Shah,Rama Chellappa

Main category: cs.CV

TL;DR: MultLFG是一种无需训练的多LoRA适配器融合框架，通过频域指导实现自适应融合，显著提升多概念生成任务的准确性和一致性。

Details

Motivation: 当前方法难以在不训练的情况下有效融合多个LoRA适配器，尤其是在涉及复杂视觉元素的组合中。 Method: MultLFG采用基于时间步和频域子带的自适应融合策略，选择性激活相关LoRA适配器。 Result: 在ComposLoRA基准测试中，MultLFG显著提升了组合保真度和图像质量，优于现有方法。 Conclusion: MultLFG通过频域指导实现了高效的多LoRA融合，为多概念生成任务提供了更优解决方案。 Abstract: Low-Rank Adaptation (LoRA) has gained prominence as a computationally efficient method for fine-tuning generative models, enabling distinct visual concept synthesis with minimal overhead. However, current methods struggle to effectively merge multiple LoRA adapters without training, particularly in complex compositions involving diverse visual elements. We introduce MultLFG, a novel framework for training-free multi-LoRA composition that utilizes frequency-domain guidance to achieve adaptive fusion of multiple LoRAs. Unlike existing methods that uniformly aggregate concept-specific LoRAs, MultLFG employs a timestep and frequency subband adaptive fusion strategy, selectively activating relevant LoRAs based on content relevance at specific timesteps and frequency bands. This frequency-sensitive guidance not only improves spatial coherence but also provides finer control over multi-LoRA composition, leading to more accurate and consistent results. Experimental evaluations on the ComposLoRA benchmark reveal that MultLFG substantially enhances compositional fidelity and image quality across various styles and concept sets, outperforming state-of-the-art baselines in multi-concept generation tasks. Code will be released.

[14] Causality and "In-the-Wild" Video-Based Person Re-ID: A Survey

Md Rashidunnabi,Kailash Hambarde,Hugo Proença

Main category: cs.CV

TL;DR: 本文综述了因果推理在视频行人重识别（Re-ID）中的应用，探讨了其如何替代传统相关性方法，并提出了一种新的分类法。

Details

Motivation: 现有模型依赖表面相关性（如服装、背景或光照），难以泛化到不同领域、视角和时间变化，因此需要因果推理来解决这一问题。 Method: 综述了基于结构因果模型、干预和反事实推理的方法，分类为生成解耦、域不变建模和因果Transformer。 Result: 提出了新的因果特定鲁棒性评估指标，并分析了实际应用中的可扩展性、公平性、可解释性和隐私问题。 Conclusion: 为因果视频行人Re-ID建立了理论基础，并指出了未来研究方向，如结合高效架构和自监督学习。 Abstract: Video-based person re-identification (Re-ID) remains brittle in real-world deployments despite impressive benchmark performance. Most existing models rely on superficial correlations such as clothing, background, or lighting that fail to generalize across domains, viewpoints, and temporal variations. This survey examines the emerging role of causal reasoning as a principled alternative to traditional correlation-based approaches in video-based Re-ID. We provide a structured and critical analysis of methods that leverage structural causal models, interventions, and counterfactual reasoning to isolate identity-specific features from confounding factors. The survey is organized around a novel taxonomy of causal Re-ID methods that spans generative disentanglement, domain-invariant modeling, and causal transformers. We review current evaluation metrics and introduce causal-specific robustness measures. In addition, we assess practical challenges of scalability, fairness, interpretability, and privacy that must be addressed for real-world adoption. Finally, we identify open problems and outline future research directions that integrate causal modeling with efficient architectures and self-supervised learning. This survey aims to establish a coherent foundation for causal video-based person Re-ID and to catalyze the next phase of research in this rapidly evolving domain.

[15] Retrieval Visual Contrastive Decoding to Mitigate Object Hallucinations in Large Vision-Language Models

Jihoon Lee,Min Song

Main category: cs.CV

TL;DR: RVCD（检索视觉对比解码）是一种新方法，通过利用正负图像在logit级别抑制目标幻觉（OH），显著优于现有解码方法。

Details

Motivation: 尽管大型视觉语言模型取得显著进展，目标幻觉（OH）仍是持续挑战。现有方法无需额外训练即可解决此问题，但仍有改进空间。 Method: RVCD引入正负图像对比解码，参考AI生成的单概念图像，在logit级别优化解码过程。 Result: RVCD在抑制OH方面表现优于现有解码方法。 Conclusion: RVCD为抑制目标幻觉提供了更有效的解决方案，无需额外训练。 Abstract: Despite significant advancements in Large Vision-Language Models, Object Hallucination (OH) remains a persistent challenge. Building upon prior studies on contrastive decoding that address this issue without requiring additional model training, we introduce RVCD (Retrieval Visual Contrastive Decoding), an advanced method to suppress OH. RVCD leverages both negative and positive images at the logit level, explicitly referencing AI-generated images designed to represent a single concept. Our approach demonstrates substantial improvements over existing decoding-based methods.

[16] Total-Editing: Head Avatar with Editable Appearance, Motion, and Lighting

Yizhou Zhao,Chunjiang Liu,Haoyu Chen,Bhiksha Raj,Min Xu,Tadas Baltrusaitis,Mitch Rundle,HsiangTao Wu,Kamran Ghasedi

Main category: cs.CV

TL;DR: Total-Editing是一个统一的肖像编辑框架，结合了面部重演和肖像重光照，通过神经辐射场解码器和变形场提升编辑质量和真实感。

Details

Motivation: 现有的面部重演和肖像重光照方法独立处理，缺乏协同效应，导致编辑结果不够自然和一致。 Method: 设计了具有本征分解能力的神经辐射场解码器，并结合基于移动最小二乘的变形场，实现外观、运动和光照的精确控制。 Result: 框架显著提升了肖像编辑的质量和真实感，支持多源应用如光照转移和肖像动画。 Conclusion: Total-Editing通过统一框架实现了更自然、灵活的肖像编辑，为相关领域提供了新思路。 Abstract: Face reenactment and portrait relighting are essential tasks in portrait editing, yet they are typically addressed independently, without much synergy. Most face reenactment methods prioritize motion control and multiview consistency, while portrait relighting focuses on adjusting shading effects. To take advantage of both geometric consistency and illumination awareness, we introduce Total-Editing, a unified portrait editing framework that enables precise control over appearance, motion, and lighting. Specifically, we design a neural radiance field decoder with intrinsic decomposition capabilities. This allows seamless integration of lighting information from portrait images or HDR environment maps into synthesized portraits. We also incorporate a moving least squares based deformation field to enhance the spatiotemporal coherence of avatar motion and shading effects. With these innovations, our unified framework significantly improves the quality and realism of portrait editing results. Further, the multi-source nature of Total-Editing supports more flexible applications, such as illumination transfer from one portrait to another, or portrait animation with customized backgrounds.

[17] Be Decisive: Noise-Induced Layouts for Multi-Subject Generation

Omer Dahary,Yehonathan Cohen,Or Patashnik,Kfir Aberman,Daniel Cohen-Or

Main category: cs.CV

TL;DR: 本文提出了一种新方法，通过预测和优化初始噪声生成的空间布局，避免与外部布局冲突，从而提升多主体生成的准确性和稳定性。

Details

Motivation: 现有文本到图像扩散模型在多主体生成时存在主体泄漏问题，外部布局控制常与模型先验冲突。 Method: 使用小型神经网络预测和优化噪声诱导的布局，确保主体边界清晰且一致。 Result: 实验表明，该方法在文本-图像对齐和多主体生成稳定性上优于现有布局引导技术。 Conclusion: 噪声对齐策略有效提升了生成质量，同时保留了模型的多样性。 Abstract: Generating multiple distinct subjects remains a challenge for existing text-to-image diffusion models. Complex prompts often lead to subject leakage, causing inaccuracies in quantities, attributes, and visual features. Preventing leakage among subjects necessitates knowledge of each subject's spatial location. Recent methods provide these spatial locations via an external layout control. However, enforcing such a prescribed layout often conflicts with the innate layout dictated by the sampled initial noise, leading to misalignment with the model's prior. In this work, we introduce a new approach that predicts a spatial layout aligned with the prompt, derived from the initial noise, and refines it throughout the denoising process. By relying on this noise-induced layout, we avoid conflicts with externally imposed layouts and better preserve the model's prior. Our method employs a small neural network to predict and refine the evolving noise-induced layout at each denoising step, ensuring clear boundaries between subjects while maintaining consistency. Experimental results show that this noise-aligned strategy achieves improved text-image alignment and more stable multi-subject generation compared to existing layout-guided techniques, while preserving the rich diversity of the model's original distribution.

[18] OmniIndoor3D: Comprehensive Indoor 3D Reconstruction

Xiaobao Wei,Xiaoan Zhang,Hao Wang,Qingpo Wuwu,Ming Lu,Wenzhao Zheng,Shanghang Zhang

Main category: cs.CV

TL;DR: OmniIndoor3D是一个基于高斯表示的室内3D重建框架，通过结合RGB-D图像和轻量级MLP优化几何与外观，实现高质量的全景重建。

Details

Motivation: 解决现有3D高斯表示（3DGS）在几何精度上的不足，满足室内场景的高质量全景重建需求。 Method: 结合RGB-D图像生成粗略3D重建，初始化3D高斯并引入轻量级MLP优化几何；提出基于全景先验的高斯基元分布策略。 Result: 在多个数据集上实现外观、几何和全景重建的领先性能，支持机器人导航。 Conclusion: OmniIndoor3D填补了室内3D重建的关键空白，代码将开源。 Abstract: We propose a novel framework for comprehensive indoor 3D reconstruction using Gaussian representations, called OmniIndoor3D. This framework enables accurate appearance, geometry, and panoptic reconstruction of diverse indoor scenes captured by a consumer-level RGB-D camera. Since 3DGS is primarily optimized for photorealistic rendering, it lacks the precise geometry critical for high-quality panoptic reconstruction. Therefore, OmniIndoor3D first combines multiple RGB-D images to create a coarse 3D reconstruction, which is then used to initialize the 3D Gaussians and guide the 3DGS training. To decouple the optimization conflict between appearance and geometry, we introduce a lightweight MLP that adjusts the geometric properties of 3D Gaussians. The introduced lightweight MLP serves as a low-pass filter for geometry reconstruction and significantly reduces noise in indoor scenes. To improve the distribution of Gaussian primitives, we propose a densification strategy guided by panoptic priors to encourage smoothness on planar surfaces. Through the joint optimization of appearance, geometry, and panoptic reconstruction, OmniIndoor3D provides comprehensive 3D indoor scene understanding, which facilitates accurate and robust robotic navigation. We perform thorough evaluations across multiple datasets, and OmniIndoor3D achieves state-of-the-art results in appearance, geometry, and panoptic reconstruction. We believe our work bridges a critical gap in indoor 3D reconstruction. The code will be released at: https://ucwxb.github.io/OmniIndoor3D/

[19] Mamba-Driven Topology Fusion for Monocular 3-D Human Pose Estimation

Zenghao Zheng,Lianping Yang,Jinshan Pan,Hegui Zhu

Main category: cs.CV

TL;DR: 提出了一种基于Mamba的拓扑融合框架，用于3D人体姿态估计，通过骨骼感知模块和图卷积网络改进Mamba模型，显著降低计算成本并提高精度。

Details

Motivation: Transformer在3D姿态估计中因自注意力机制复杂度高而面临计算挑战，Mamba模型虽高效但缺乏对拓扑结构和局部关节关系的处理能力。 Method: 提出骨骼感知模块推断骨骼向量方向与长度，改进Mamba的卷积结构为双向图卷积网络，并设计时空细化模块建模时空关系。 Result: 在Human3.6M和MPI-INF-3DHP数据集上验证，显著降低计算成本并提高精度，消融实验证明各模块有效性。 Conclusion: 结合骨骼拓扑的Mamba改进框架有效解决了其在捕捉人体结构关系上的局限性，同时保持了高效性。 Abstract: Transformer-based methods for 3-D human pose estimation face significant computational challenges due to the quadratic growth of self-attention mechanism complexity with sequence length. Recently, the Mamba model has substantially reduced computational overhead and demonstrated outstanding performance in modeling long sequences by leveraging state space model (SSM). However, the ability of SSM to process sequential data is not suitable for 3-D joint sequences with topological structures, and the causal convolution structure in Mamba also lacks insight into local joint relationships. To address these issues, we propose the Mamba-Driven Topology Fusion framework in this paper. Specifically, the proposed Bone Aware Module infers the direction and length of bone vectors in the spherical coordinate system, providing effective topological guidance for the Mamba model in processing joint sequences. Furthermore, we enhance the convolutional structure within the Mamba model by integrating forward and backward graph convolutional network, enabling it to better capture local joint dependencies. Finally, we design a Spatiotemporal Refinement Module to model both temporal and spatial relationships within the sequence. Through the incorporation of skeletal topology, our approach effectively alleviates Mamba's limitations in capturing human structural relationships. We conduct extensive experiments on the Human3.6M and MPI-INF-3DHP datasets for testing and comparison, and the results show that the proposed method greatly reduces computational cost while achieving higher accuracy. Ablation studies further demonstrate the effectiveness of each proposed module. The code and models will be released.

[20] Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

Peter Robicheaux,Matvei Popov,Anish Madan,Isaac Robinson,Joseph Nelson,Deva Ramanan,Neehar Peri

Main category: cs.CV

TL;DR: 论文提出了一种通过少量视觉示例和丰富文本描述对齐视觉语言模型（VLMs）的方法，以解决其在分布外任务和模态上的泛化问题，并引入了Roboflow100-VL数据集进行评估。

Details

Motivation: 现有视觉语言模型在常见对象上表现优异，但在分布外类别、任务和成像模态上泛化能力不足，需要一种更有效的方法来对齐新概念。 Method: 通过引入Roboflow100-VL数据集，包含100个多模态目标检测数据集，评估模型在零样本、少样本、半监督和全监督设置下的表现。 Result: 实验显示，GroundingDINO和Qwen2.5-VL等模型在挑战性医学影像数据集上的零样本准确率低于2%，表明少样本概念对齐的必要性。 Conclusion: 研究表明，通过少量视觉示例和文本描述对齐VLMs是提升其在分布外任务上性能的有效方法。 Abstract: Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Our code and dataset are available at https://github.com/roboflow/rf100-vl/ and https://universe.roboflow.com/rf100-vl/

[21] Intelligent Incident Hypertension Prediction in Obstructive Sleep Apnea

Omid Halimi Milani,Ahmet Enis Cetin,Bharati Prasad

Main category: cs.CV

TL;DR: 该研究提出了一种结合离散余弦变换（DCT）和迁移学习的深度学习方法，用于预测阻塞性睡眠呼吸暂停（OSA）患者在五年内发展为高血压的风险。

Details

Motivation: OSA是高血压的重要风险因素，但预测其发展为高血压的准确性仍具挑战性。研究旨在通过整合多导睡眠图信号和频率域特征提取，提高预测性能。 Method: 研究提取多导睡眠图信号特征，并通过DCT层将其转换为频率域表示，结合预训练的2D神经网络（如MobileNet、EfficientNet和ResNet）进行迁移学习。 Result: 模型在EfficientNet中最佳AUC为72.88%，验证了频率域特征提取和迁移学习的有效性。 Conclusion: 该方法为OSA患者高血压风险预测提供了新思路，尤其适用于有限医疗数据集。 Abstract: Obstructive sleep apnea (OSA) is a significant risk factor for hypertension, primarily due to intermittent hypoxia and sleep fragmentation. Predicting whether individuals with OSA will develop hypertension within five years remains a complex challenge. This study introduces a novel deep learning approach that integrates Discrete Cosine Transform (DCT)-based transfer learning to enhance prediction accuracy. We are the first to incorporate all polysomnography signals together for hypertension prediction, leveraging their collective information to improve model performance. Features were extracted from these signals and transformed into a 2D representation to utilize pre-trained 2D neural networks such as MobileNet, EfficientNet, and ResNet variants. To further improve feature learning, we introduced a DCT layer, which transforms input features into a frequency-based representation, preserving essential spectral information, decorrelating features, and enhancing robustness to noise. This frequency-domain approach, coupled with transfer learning, is especially beneficial for limited medical datasets, as it leverages rich representations from pre-trained networks to improve generalization. By strategically placing the DCT layer at deeper truncation depths within EfficientNet, our model achieved a best area under the curve (AUC) of 72.88%, demonstrating the effectiveness of frequency-domain feature extraction and transfer learning in predicting hypertension risk in OSA patients over a five-year period.

[22] OccLE: Label-Efficient 3D Semantic Occupancy Prediction

Naiyu Fang,Zheyuan Zhou,Fayao Liu,Xulei Yang,Jiacheng Wei,Lemiao Qiu,Guosheng Lin

Main category: cs.CV

TL;DR: OccLE是一种标签高效的3D语义占用预测方法，通过解耦语义和几何学习任务，并融合其特征网格，在仅需10%体素标注的情况下实现高性能。

Details

Motivation: 现有方法依赖昂贵的全监督或性能有限的自监督，OccLE旨在通过标签高效的方式解决这一问题。 Method: 解耦语义和几何学习任务，利用2D基础模型生成伪标签，并通过半监督增强几何学习，最后融合特征网格进行预测。 Result: 在仅10%体素标注下，OccLE在SemanticKITTI验证集上达到16.59%的mIoU。 Conclusion: OccLE展示了在有限标注下实现高性能3D语义占用预测的潜力。 Abstract: 3D semantic occupancy prediction offers an intuitive and efficient scene understanding and has attracted significant interest in autonomous driving perception. Existing approaches either rely on full supervision, which demands costly voxel-level annotations, or on self-supervision, which provides limited guidance and yields suboptimal performance. To address these challenges, we propose OccLE, a Label-Efficient 3D Semantic Occupancy Prediction that takes images and LiDAR as inputs and maintains high performance with limited voxel annotations. Our intuition is to decouple the semantic and geometric learning tasks and then fuse the learned feature grids from both tasks for the final semantic occupancy prediction. Therefore, the semantic branch distills 2D foundation model to provide aligned pseudo labels for 2D and 3D semantic learning. The geometric branch integrates image and LiDAR inputs in cross-plane synergy based on their inherency, employing semi-supervision to enhance geometry learning. We fuse semantic-geometric feature grids through Dual Mamba and incorporate a scatter-accumulated projection to supervise unannotated prediction with aligned pseudo labels. Experiments show that OccLE achieves competitive performance with only 10% of voxel annotations, reaching a mIoU of 16.59% on the SemanticKITTI validation set.

[23] ConsiStyle: Style Diversity in Training-Free Consistent T2I Generation

Yohai Mazuz,Janna Bruner,Lior Wolf

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的方法，通过操纵注意力矩阵实现文本对齐和主题一致性，解决了现有方法在风格与主题一致性之间的权衡问题。

Details

Motivation: 现有文本到图像生成方法在保持主题一致性和风格多样性之间存在矛盾，通常需要大规模微调或优化，而无需训练的方法往往无法保持主题一致性。 Method: 通过操纵注意力矩阵，从锚定图像获取Queries和Keys，从非锚定图像获取Values，并在自注意力机制中添加跨图像组件，同时对齐Value矩阵的统计信息。 Result: 实验表明，该方法有效解耦风格与主题外观，能够在多样风格下生成文本对齐且主题一致的图像。 Conclusion: 该方法无需训练即可实现风格对齐和主题一致性，为文本到图像生成提供了一种高效解决方案。 Abstract: In text-to-image models, consistent character generation is the task of achieving text alignment while maintaining the subject's appearance across different prompts. However, since style and appearance are often entangled, the existing methods struggle to preserve consistent subject characteristics while adhering to varying style prompts. Current approaches for consistent text-to-image generation typically rely on large-scale fine-tuning on curated image sets or per-subject optimization, which either fail to generalize across prompts or do not align well with textual descriptions. Meanwhile, training-free methods often fail to maintain subject consistency across different styles. In this work, we introduce a training-free method that achieves both style alignment and subject consistency. The attention matrices are manipulated such that Queries and Keys are obtained from the anchor image(s) that are used to define the subject, while the Values are imported from a parallel copy that is not subject-anchored. Additionally, cross-image components are added to the self-attention mechanism by expanding the Key and Value matrices. To do without shifting from the target style, we align the statistics of the Value matrices. As is demonstrated in a comprehensive battery of qualitative and quantitative experiments, our method effectively decouples style from subject appearance and enables faithful generation of text-aligned images with consistent characters across diverse styles.

[24] Incorporating Flexible Image Conditioning into Text-to-Video Diffusion Models without Training

Bolin Lai,Sangmin Lee,Xu Cao,Xiang Li,James M. Rehg

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的FlexTI2V方法，通过灵活的图像条件生成视频，优于现有方法。

Details

Motivation: 解决现有方法在资源消耗和条件限制上的不足，实现更灵活的视频生成。 Method: 使用潜在空间噪声表示和随机补丁交换策略，结合动态控制机制调整条件强度。 Result: 实验表明，FlexTI2V显著优于其他无需训练的图像条件方法。 Conclusion: FlexTI2V为视频生成提供了一种高效且灵活的新方法。 Abstract: Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by finetuning, which is costly in resources and only limited to a few predefined conditioning settings. To tackle this issue, we introduce a unified formulation for TI2V generation with flexible visual conditioning. Furthermore, we propose an innovative training-free approach, dubbed FlexTI2V, that can condition T2V foundation models on an arbitrary amount of images at arbitrary positions. Specifically, we firstly invert the condition images to noisy representation in a latent space. Then, in the denoising process of T2V models, our method uses a novel random patch swapping strategy to incorporate visual features into video representations through local image patches. To balance creativity and fidelity, we use a dynamic control mechanism to adjust the strength of visual conditioning to each video frame. Extensive experiments validate that our method surpasses previous training-free image conditioning methods by a notable margin. We also show more insights of our method by detailed ablation study and analysis.

[25] TrustSkin: A Fairness Pipeline for Trustworthy Facial Affect Analysis Across Skin Tone

Ana M. Cabanas,Alma Pedro,Domingo Mery

Main category: cs.CV

TL;DR: 研究比较了两种肤色分类方法（ITA和基于$H^$-$L^$的方法）在面部情感分析系统中的公平性表现，发现深肤色样本严重不足，且ITA方法因对光照敏感而存在局限性。

Details

Motivation: 探讨肤色分类方法对公平性评估的影响，尤其是在面部情感分析系统中。 Method: 使用AffectNet数据集和MobileNet模型，比较ITA和$H^*$-$L^*$方法，并通过Grad-CAM分析模型注意力模式。 Result: 深肤色样本仅占2%，公平性指标（F1-score和TPR）在不同肤色组间存在显著差异，$H^*$-$L^*$方法更稳定。 Conclusion: 肤色测量方法的选择对公平性评估至关重要，ITA可能掩盖深肤色群体的差异。研究还提出了一个模块化的公平性评估流程。 Abstract: Understanding how facial affect analysis (FAA) systems perform across different demographic groups requires reliable measurement of sensitive attributes such as ancestry, often approximated by skin tone, which itself is highly influenced by lighting conditions. This study compares two objective skin tone classification methods: the widely used Individual Typology Angle (ITA) and a perceptually grounded alternative based on Lightness ($L^*$) and Hue ($H^*$). Using AffectNet and a MobileNet-based model, we assess fairness across skin tone groups defined by each method. Results reveal a severe underrepresentation of dark skin tones ($\sim 2 \%$), alongside fairness disparities in F1-score (up to 0.08) and TPR (up to 0.11) across groups. While ITA shows limitations due to its sensitivity to lighting, the $H^*$-$L^*$ method yields more consistent subgrouping and enables clearer diagnostics through metrics such as Equal Opportunity. Grad-CAM analysis further highlights differences in model attention patterns by skin tone, suggesting variation in feature encoding. To support future mitigation efforts, we also propose a modular fairness-aware pipeline that integrates perceptual skin tone estimation, model interpretability, and fairness evaluation. These findings emphasize the relevance of skin tone measurement choices in fairness assessment and suggest that ITA-based evaluations may overlook disparities affecting darker-skinned individuals.

[26] Open-Det: An Efficient Learning Framework for Open-Ended Detection

Guiping Cao,Tao Wang,Wenjian Huang,Xiangyuan Lan,Jianguo Zhang,Dongmei Jiang

Main category: cs.CV

TL;DR: Open-Det框架通过重构目标检测器和名称生成器，结合视觉-语言对齐机制，显著提升了开放目标检测任务的效率和性能。

Details

Motivation: 解决现有开放目标检测模型（如GenerateU）需要大规模数据集、收敛慢且性能有限的问题。 Method: 提出Open-Det框架，包括重构检测器和名称生成器、视觉-语言对齐机制、提示蒸馏器和联合损失函数。 Result: 仅用1.5%训练数据和20.8%训练周期，性能提升1.0%（APr）。 Conclusion: Open-Det在效率和性能上均优于现有方法，为开放目标检测提供了高效解决方案。 Abstract: Open-Ended object Detection (OED) is a novel and challenging task that detects objects and generates their category names in a free-form manner, without requiring additional vocabularies during inference. However, the existing OED models, such as GenerateU, require large-scale datasets for training, suffer from slow convergence, and exhibit limited performance. To address these issues, we present a novel and efficient Open-Det framework, consisting of four collaborative parts. Specifically, Open-Det accelerates model training in both the bounding box and object name generation process by reconstructing the Object Detector and the Object Name Generator. To bridge the semantic gap between Vision and Language modalities, we propose a Vision-Language Aligner with V-to-L and L-to-V alignment mechanisms, incorporating with the Prompts Distiller to transfer knowledge from the VLM into VL-prompts, enabling accurate object name generation for the LLM. In addition, we design a Masked Alignment Loss to eliminate contradictory supervision and introduce a Joint Loss to enhance classification, resulting in more efficient training. Compared to GenerateU, Open-Det, using only 1.5% of the training data (0.077M vs. 5.077M), 20.8% of the training epochs (31 vs. 149), and fewer GPU resources (4 V100 vs. 16 A100), achieves even higher performance (+1.0% in APr). The source codes are available at: https://github.com/Med-Process/Open-Det.

[27] IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios

Yifan Li,Yuhang Chen,Anh Dao,Lichi Li,Zhongyi Cai,Zhen Tan,Tianlong Chen,Yu Kong

Main category: cs.CV

TL;DR: IndustryEQA是首个专注于工业安全场景的EQA基准测试，弥补了现有基准测试在工业环境中的不足，提供了高保真视频和丰富标注。

Details

Motivation: 现有EQA基准测试主要关注家庭环境，忽略了工业场景中的安全性和推理过程，限制了代理在真实工业应用中的评估。 Method: 基于NVIDIA Isaac Sim平台构建IndustryEQA，包含高保真视频、多样工业资产、动态人机交互及危险场景，提供六类标注和额外推理评估。 Result: 包含1344个问答对（971来自小型仓库，373来自大型仓库），并提出综合评估框架测试感知和推理能力。 Conclusion: IndustryEQA旨在推动EQA研究发展更鲁棒、安全且实用的工业场景代理。 Abstract: Existing Embodied Question Answering (EQA) benchmarks primarily focus on household environments, often overlooking safety-critical aspects and reasoning processes pertinent to industrial settings. This drawback limits the evaluation of agent readiness for real-world industrial applications. To bridge this, we introduce IndustryEQA, the first benchmark dedicated to evaluating embodied agent capabilities within safety-critical warehouse scenarios. Built upon the NVIDIA Isaac Sim platform, IndustryEQA provides high-fidelity episodic memory videos featuring diverse industrial assets, dynamic human agents, and carefully designed hazardous situations inspired by real-world safety guidelines. The benchmark includes rich annotations covering six categories: equipment safety, human safety, object recognition, attribute recognition, temporal understanding, and spatial understanding. Besides, it also provides extra reasoning evaluation based on these categories. Specifically, it comprises 971 question-answer pairs generated from small warehouse and 373 pairs from large ones, incorporating scenarios with and without human. We further propose a comprehensive evaluation framework, including various baseline models, to assess their general perception and reasoning abilities in industrial environments. IndustryEQA aims to steer EQA research towards developing more robust, safety-aware, and practically applicable embodied agents for complex industrial environments. Benchmark and codes are available.

[28] See through the Dark: Learning Illumination-affined Representations for Nighttime Occupancy Prediction

Yuan Wu,Zhiqiang Yan,Yigong Zhang,Xiang Li,ian Yang

Main category: cs.CV

TL;DR: LIAR是一个新框架，通过学习光照相关表示，解决了夜间场景中3D空间占用预测的挑战。

Details

Motivation: 现有视觉方法在白天表现良好，但在夜间因光照条件差而效果不佳。 Method: LIAR引入选择性低光图像增强（SLLIE）和两个光照感知组件（2D-IGS和3D-IDP），分别处理局部曝光不足和过度曝光。 Result: 实验表明LIAR在夜间场景中表现优异。 Conclusion: LIAR为夜间3D占用预测提供了有效解决方案。 Abstract: Occupancy prediction aims to estimate the 3D spatial distribution of occupied regions along with their corresponding semantic labels. Existing vision-based methods perform well on daytime benchmarks but struggle in nighttime scenarios due to limited visibility and challenging lighting conditions. To address these challenges, we propose \textbf{LIAR}, a novel framework that learns illumination-affined representations. LIAR first introduces Selective Low-light Image Enhancement (SLLIE), which leverages the illumination priors from daytime scenes to adaptively determine whether a nighttime image is genuinely dark or sufficiently well-lit, enabling more targeted global enhancement. Building on the illumination maps generated by SLLIE, LIAR further incorporates two illumination-aware components: 2D Illumination-guided Sampling (2D-IGS) and 3D Illumination-driven Projection (3D-IDP), to respectively tackle local underexposure and overexposure. Specifically, 2D-IGS modulates feature sampling positions according to illumination maps, assigning larger offsets to darker regions and smaller ones to brighter regions, thereby alleviating feature degradation in underexposed areas. Subsequently, 3D-IDP enhances semantic understanding in overexposed regions by constructing illumination intensity fields and supplying refined residual queries to the BEV context refinement process. Extensive experiments on both real and synthetic datasets demonstrate the superior performance of LIAR under challenging nighttime scenarios. The source code and pretrained models are available \href{https://github.com/yanzq95/LIAR}{here}.

[29] HCQA-1.5 @ Ego4D EgoSchema Challenge 2025

Haoyu Zhang,Yisen Feng,Qiaohui Chu,Meng Liu,Weili Guan,Yaowei Wang,Liqiang Nie

Main category: cs.CV

TL;DR: 本文提出了一种改进的HCQA框架，通过多源聚合策略和置信度过滤机制提升视频问答的可靠性，最终在EgoSchema挑战赛中取得第三名。

Details

Motivation: 为了提高第一人称视角视频问答中答案预测的可靠性。 Method: 扩展HCQA框架，引入多源聚合策略生成多样化预测，结合置信度过滤机制选择高置信度答案，并对低置信度情况采用细粒度推理模块进行优化。 Result: 在EgoSchema盲测集上达到77%的准确率，优于去年的获胜方案和多数参赛团队。 Conclusion: 该方法显著提升了视频问答的可靠性，代码已开源。 Abstract: In this report, we present the method that achieves third place for Ego4D EgoSchema Challenge in CVPR 2025. To improve the reliability of answer prediction in egocentric video question answering, we propose an effective extension to the previously proposed HCQA framework. Our approach introduces a multi-source aggregation strategy to generate diverse predictions, followed by a confidence-based filtering mechanism that selects high-confidence answers directly. For low-confidence cases, we incorporate a fine-grained reasoning module that performs additional visual and contextual analysis to refine the predictions. Evaluated on the EgoSchema blind test set, our method achieves 77% accuracy on over 5,000 human-curated multiple-choice questions, outperforming last year's winning solution and the majority of participating teams. Our code will be added at https://github.com/Hyu-Zhang/HCQA.

[30] Scan-and-Print: Patch-level Data Summarization and Augmentation for Content-aware Layout Generation in Poster Design

HsiaoYuan Hsu,Yuxin Peng

Main category: cs.CV

TL;DR: 论文提出了一种名为Scan-and-Print的补丁级数据总结与增强方法，用于AI海报设计中的内容感知布局生成，显著降低了计算瓶颈并提升了生成质量。

Details

Motivation: 现有方法因参数过多导致实时性和泛化能力受限，需解决这一问题。 Method: 通过Scan（选择适合放置元素的补丁）和Print（跨图像-布局对合成新样本）两步骤，结合顶点级布局表示。 Result: 实验显示，该方法在保持高质量布局生成的同时，计算瓶颈减少了95.2%。 Conclusion: Scan-and-Print是一种高效且高质量的内容感知布局生成方法。 Abstract: In AI-empowered poster design, content-aware layout generation is crucial for the on-image arrangement of visual-textual elements, e.g., logo, text, and underlay. To perceive the background images, existing work demanded a high parameter count that far exceeds the size of available training data, which has impeded the model's real-time performance and generalization ability. To address these challenges, we proposed a patch-level data summarization and augmentation approach, vividly named Scan-and-Print. Specifically, the scan procedure selects only the patches suitable for placing element vertices to perform fine-grained perception efficiently. Then, the print procedure mixes up the patches and vertices across two image-layout pairs to synthesize over 100% new samples in each epoch while preserving their plausibility. Besides, to facilitate the vertex-level operations, a vertex-based layout representation is introduced. Extensive experimental results on widely used benchmarks demonstrated that Scan-and-Print can generate visually appealing layouts with state-of-the-art quality while dramatically reducing computational bottleneck by 95.2%.

[31] RoGA: Towards Generalizable Deepfake Detection through Robust Gradient Alignment

Lingyu Qiu,Ke Jiang,Xiaoyang Tan

Main category: cs.CV

TL;DR: 提出一种新的学习目标，通过梯度对齐增强深度伪造检测模型的泛化能力，避免额外正则化。

Details

Motivation: 现有方法通过额外模块防止过拟合，但可能阻碍经验风险最小化（ERM）目标的优化，影响性能。 Method: 通过扰动模型参数对齐跨域梯度更新，保留域不变特征并管理域特异性。 Result: 在多个深度伪造检测数据集上表现优于现有技术。 Conclusion: 梯度对齐策略有效提升模型对域偏移的鲁棒性。 Abstract: Recent advancements in domain generalization for deepfake detection have attracted significant attention, with previous methods often incorporating additional modules to prevent overfitting to domain-specific patterns. However, such regularization can hinder the optimization of the empirical risk minimization (ERM) objective, ultimately degrading model performance. In this paper, we propose a novel learning objective that aligns generalization gradient updates with ERM gradient updates. The key innovation is the application of perturbations to model parameters, aligning the ascending points across domains, which specifically enhances the robustness of deepfake detection models to domain shifts. This approach effectively preserves domain-invariant features while managing domain-specific characteristics, without introducing additional regularization. Experimental results on multiple challenging deepfake detection datasets demonstrate that our gradient alignment strategy outperforms state-of-the-art domain generalization techniques, confirming the efficacy of our method. The code is available at https://github.com/Lynn0925/RoGA.

[32] Photography Perspective Composition: Towards Aesthetic Perspective Recommendation

Lujian Yao,Siming Zheng,Xinbin Yuan,Zhuoxuan Cai,Pu Wu,Jinwei Chen,Bo Li,Peng-Tao Jiang

Main category: cs.CV

TL;DR: 论文提出了一种基于3D透视调整的摄影构图方法（PPC），解决了传统2D裁剪方法的不足，并通过自动化数据集构建、视频生成和透视质量评估模型实现了优化。

Details

Motivation: 传统2D裁剪方法在场景主题排列不佳时效果有限，专业摄影师常通过3D透视调整优化构图。论文受此启发，提出PPC方法，但面临数据集稀缺和评估标准缺失的挑战。 Method: 提出PPC方法，包括自动化构建数据集、生成透视调整视频和基于人类表现的透视质量评估模型。 Result: 实现了无需额外提示或相机轨迹的简洁方法，帮助普通用户提升构图技能。 Conclusion: PPC方法扩展了传统构图技术，通过数据驱动和评估模型解决了透视调整的挑战。 Abstract: Traditional photography composition approaches are dominated by 2D cropping-based methods. However, these methods fall short when scenes contain poorly arranged subjects. Professional photographers often employ perspective adjustment as a form of 3D recomposition, modifying the projected 2D relationships between subjects while maintaining their actual spatial positions to achieve better compositional balance. Inspired by this artistic practice, we propose photography perspective composition (PPC), extending beyond traditional cropping-based methods. However, implementing the PPC faces significant challenges: the scarcity of perspective transformation datasets and undefined assessment criteria for perspective quality. To address these challenges, we present three key contributions: (1) An automated framework for building PPC datasets through expert photographs. (2) A video generation approach that demonstrates the transformation process from suboptimal to optimal perspectives. (3) A perspective quality assessment (PQA) model constructed based on human performance. Our approach is concise and requires no additional prompt instructions or camera trajectories, helping and guiding ordinary users to enhance their composition skills.

[33] DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving

Muxi Diao,Lele Yang,Hongbo Yin,Zhexu Wang,Yejie Wang,Daxin Tian,Kongming Liang,Zhanyu Ma

Main category: cs.CV

TL;DR: AutoDriveRL是一个统一的训练框架，将自动驾驶建模为四个核心任务的结构化推理过程，通过任务特定的奖励模型优化每个任务，训练出实时决策的DriveRX模型，性能优于GPT-4o。

Details

Motivation: 传统端到端模型在复杂场景中泛化能力不足，现有视觉语言模型（VLMs）因模块孤立和静态监督无法支持多阶段决策。 Method: 将自动驾驶任务分解为四个核心任务，每个任务建模为视觉语言问答问题，使用任务特定奖励模型优化，训练跨任务推理VLM DriveRX。 Result: DriveRX在公共基准测试中表现优异，行为推理优于GPT-4o，且在复杂或损坏的驾驶条件下表现出鲁棒性。 Conclusion: AutoDriveRL框架和DriveRX模型为自动驾驶研究提供了新思路，未来将开源以支持进一步研究。 Abstract: Autonomous driving requires real-time, robust reasoning across perception, prediction, planning, and behavior. However, conventional end-to-end models fail to generalize in complex scenarios due to the lack of structured reasoning. Recent vision-language models (VLMs) have been applied to driving tasks, but they typically rely on isolated modules and static supervision, limiting their ability to support multi-stage decision-making. We present AutoDriveRL, a unified training framework that formulates autonomous driving as a structured reasoning process over four core tasks. Each task is independently modeled as a vision-language question-answering problem and optimized using task-specific reward models, enabling fine-grained reinforcement signals at different reasoning stages. Within this framework, we train DriveRX, a cross-task reasoning VLM designed for real-time decision-making. DriveRX achieves strong performance on a public benchmark, outperforming GPT-4o in behavior reasoning and demonstrating robustness under complex or corrupted driving conditions. Our analysis further highlights the impact of vision encoder design and reward-guided reasoning compression. We will release the AutoDriveRL framework and the DriveRX model to support future research.

[34] Contrastive Desensitization Learning for Cross Domain Face Forgery Detection

Lingyu Qiu,Ke Jiang,Xiaoyang Tan

Main category: cs.CV

TL;DR: 提出了一种新的跨域人脸伪造检测方法，通过对比去敏网络（CDN）降低误报率并提高检测准确性。

Details

Motivation: 现有方法在多域适用性上存在高误报率问题，影响了系统可用性。 Method: 基于对比去敏网络（CDN）和鲁棒去敏算法，通过学习真实人脸图像的域变换特征来捕获域特性。 Result: 在大规模基准数据集上，CDN显著降低了误报率并提升了检测准确性。 Conclusion: CDN方法在跨域人脸伪造检测中表现出色，具有理论鲁棒性和实际优势。 Abstract: In this paper, we propose a new cross-domain face forgery detection method that is insensitive to different and possibly unseen forgery methods while ensuring an acceptable low false positive rate. Although existing face forgery detection methods are applicable to multiple domains to some degree, they often come with a high false positive rate, which can greatly disrupt the usability of the system. To address this issue, we propose an Contrastive Desensitization Network (CDN) based on a robust desensitization algorithm, which captures the essential domain characteristics through learning them from domain transformation over pairs of genuine face images. One advantage of CDN lies in that the learnt face representation is theoretical justified with regard to the its robustness against the domain changes. Extensive experiments over large-scale benchmark datasets demonstrate that our method achieves a much lower false alarm rate with improved detection accuracy compared to several state-of-the-art methods.

[35] Supervised Contrastive Learning for Ordinal Engagement Measurement

Sadaf Safa,Ali Abedi,Shehroz S. Khan

Main category: cs.CV

TL;DR: 提出了一种基于视频的学生参与度测量方法，利用监督对比学习进行有序分类，解决了类别不平衡和顺序性问题，并在公开数据集上验证了其有效性。

Details

Motivation: 学生参与度对教育项目至关重要，自动测量能帮助教师监控参与情况并调整教学策略。 Method: 从视频中提取情感和行为特征，结合监督对比学习框架和时序数据增强技术，训练有序分类器。 Result: 在DAiSEE数据集上验证了方法的有效性，能稳健分类参与度水平。 Conclusion: 该方法为虚拟学习环境中学生参与度的理解和提升提供了重要贡献。 Abstract: Student engagement plays a crucial role in the successful delivery of educational programs. Automated engagement measurement helps instructors monitor student participation, identify disengagement, and adapt their teaching strategies to enhance learning outcomes effectively. This paper identifies two key challenges in this problem: class imbalance and incorporating order into engagement levels rather than treating it as mere categories. Then, a novel approach to video-based student engagement measurement in virtual learning environments is proposed that utilizes supervised contrastive learning for ordinal classification of engagement. Various affective and behavioral features are extracted from video samples and utilized to train ordinal classifiers within a supervised contrastive learning framework (with a sequential classifier as the encoder). A key step involves the application of diverse time-series data augmentation techniques to these feature vectors, enhancing model training. The effectiveness of the proposed method was evaluated using a publicly available dataset for engagement measurement, DAiSEE, containing videos of students who participated in virtual learning programs. The results demonstrate the robust ability of the proposed method for the classification of the engagement level. This approach promises a significant contribution to understanding and enhancing student engagement in virtual learning environments.

[36] Continual Learning on CLIP via Incremental Prompt Tuning with Intrinsic Textual Anchors

Haodong Lu,Xinyu Zhang,Kristen Moore,Jason Xue,Lina Yao,Anton van den Hengel,Dong Gong

Main category: cs.CV

TL;DR: 提出了一种基于增量提示调优的简洁持续学习方法TPPT，充分利用CLIP的多模态结构，通过文本原型引导视觉提示学习，减少遗忘。

Details

Motivation: 现有方法依赖复杂设计，可能引入不必要复杂性，未充分利用CLIP的内在能力。 Method: TPPT通过文本原型引导视觉提示学习（TPPT-V），并联合优化视觉和文本提示（TPPT-VT），引入关系多样性正则化防止嵌入空间崩溃。 Result: 实验证明该方法能有效学习新知识并减少遗忘。 Conclusion: 利用CLIP的内在指导进行持续适应具有显著优势。 Abstract: Continual learning (CL) enables deep networks to acquire new knowledge while avoiding catastrophic forgetting. The powerful generalization ability of pre-trained models (PTMs), such as the Contrastive Language-Image Pre-training (CLIP) model, has inspired a range of CL methods targeting new and specialized tasks, providing rich multi-modal embeddings that support lightweight, incremental prompt tuning. Existing methods often rely on complex designs built upon specific assumptions, such as intricate regularization schemes for prompt pools, specialized routing mechanisms, or multi-stage incrementations, that introduce additional-and possibly unnecessary-complexity, underutilizing CLIP's intrinsic capabilities. In this paper, we propose a concise CL approach for CLIP based on incremental prompt tuning that fully exploits its multi-modal structure and the stability of textual representations. Our method, Textual Prototype-guided Prompt Tuning (TPPT), introduces textual prototypes not merely as static classifiers, as in existing methods, but as stable anchors to guide the learning of visual prompts, thereby shaping the embedding space (i.e., TPPT-V). We show that our bidirectional supervision strategy enables more effective learning of new knowledge while reducing forgetting. To further close the vision-language gap during CL, we jointly optimizes visual and textual prompts (i.e., TPPT-VT). We also introduce a relational diversity regularization on the textual anchors to prevent embedding space collapse and mitigate correlated forgetting. Extensive experiments and analyses demonstrate the effectiveness of our proposed approach, highlighting the benefits of leveraging CLIP's intrinsic guidance for continual adaptation.

[37] VisAlgae 2023: A Dataset and Challenge for Algae Detection in Microscopy Images

Mingxuan Sun,Juntao Jiang,Zhiqiang Yang,Shenao Kong,Jiamin Qi,Jianru Shang,Shuangling Luo,Wanfa Sun,Tianyi Wang,Yanqi Wang,Qixuan Wang,Tingjian Dai,Tianxiang Chen,Jinming Zhang,Xuerui Zhang,Yuepeng He,Pengcheng Fu,Qiu Guan,Shizheng Zhou,Yanbo Yu,Qigui Jiang,Teng Zhou,Liuyong Shi,Hong Yan

Main category: cs.CV

TL;DR: 本文总结了第二届“Vision Meets Algae”挑战赛（VisAlgae 2023），旨在提升高通量微藻细胞检测技术。挑战赛吸引了369支团队参与，提供了一个包含1000张图像的数据集，涵盖六类不同大小和特征的微藻。

Details

Motivation: 微藻在生态平衡和经济领域具有重要作用，但其多样化的尺寸和条件给检测带来挑战。 Method: 挑战赛任务包括检测小目标、处理运动模糊和复杂背景。参与者提交了解决方案，其中前十名的方法被总结。 Result: 挑战赛展示了计算机视觉与藻类研究的结合潜力，为生态理解和技术进步提供了新思路。 Conclusion: 该研究为微藻检测提供了实用方法和数据集，推动了相关领域的发展。 Abstract: Microalgae, vital for ecological balance and economic sectors, present challenges in detection due to their diverse sizes and conditions. This paper summarizes the second "Vision Meets Algae" (VisAlgae 2023) Challenge, aiming to enhance high-throughput microalgae cell detection. The challenge, which attracted 369 participating teams, includes a dataset of 1000 images across six classes, featuring microalgae of varying sizes and distinct features. Participants faced tasks such as detecting small targets, handling motion blur, and complex backgrounds. The top 10 methods, outlined here, offer insights into overcoming these challenges and maximizing detection accuracy. This intersection of algae research and computer vision offers promise for ecological understanding and technological advancement. The dataset can be accessed at: https://github.com/juntaoJianggavin/Visalgae2023/.

[38] Temporal Saliency-Guided Distillation: A Scalable Framework for Distilling Video Datasets

Xulin Gu,Xinhao Zhong,Zhixing Wei,Yimin Zhou,Shuoyang Sun,Bin Chen,Hongpeng Wang,Yuan Luo

Main category: cs.CV

TL;DR: 本文提出了一种新的单层视频数据集蒸馏框架，通过优化合成视频以减少计算成本并保留时间动态。

Details

Motivation: 视频数据的高维度和时间复杂性使得视频数据集蒸馏具有挑战性，现有方法计算成本高且难以保留时间动态。 Method: 引入时间显著性引导的过滤机制，利用帧间差异指导蒸馏过程，保留信息性时间线索并减少冗余。 Result: 在标准视频基准测试中，该方法实现了最先进的性能，缩小了真实与蒸馏视频数据之间的差距。 Conclusion: 该方法为视频数据集压缩提供了可扩展的解决方案。 Abstract: Dataset distillation (DD) has emerged as a powerful paradigm for dataset compression, enabling the synthesis of compact surrogate datasets that approximate the training utility of large-scale ones. While significant progress has been achieved in distilling image datasets, extending DD to the video domain remains challenging due to the high dimensionality and temporal complexity inherent in video data. Existing video distillation (VD) methods often suffer from excessive computational costs and struggle to preserve temporal dynamics, as na\"ive extensions of image-based approaches typically lead to degraded performance. In this paper, we propose a novel uni-level video dataset distillation framework that directly optimizes synthetic videos with respect to a pre-trained model. To address temporal redundancy and enhance motion preservation, we introduce a temporal saliency-guided filtering mechanism that leverages inter-frame differences to guide the distillation process, encouraging the retention of informative temporal cues while suppressing frame-level redundancy. Extensive experiments on standard video benchmarks demonstrate that our method achieves state-of-the-art performance, bridging the gap between real and distilled video data and offering a scalable solution for video dataset compression.

[39] Beyond Entropy: Region Confidence Proxy for Wild Test-Time Adaptation

Zixuan Hu,Yichun Hu,Xiaotong Li,Shixiang Tang,Ling-Yu Duan

Main category: cs.CV

TL;DR: 论文提出了一种名为ReCAP的新方法，通过区域集成优化解决了WTTA中的噪声优化问题，显著提升了适应效率。

Details

Motivation: 现有WTTA方法主要关注样本选择策略，而忽略了底层优化问题。熵最小化框架在噪声优化中存在显著限制，阻碍了适应效率。 Method: 提出ReCAP方法，包括概率区域建模方案和有限到无限渐进逼近技术，将难以处理的区域置信度转化为可处理的代理。 Result: 实验表明，ReCAP在多种数据集和场景中均优于现有方法。 Conclusion: ReCAP通过区域集成优化显著提升了WTTA的适应效率，为实际应用提供了简洁有效的解决方案。 Abstract: Wild Test-Time Adaptation (WTTA) is proposed to adapt a source model to unseen domains under extreme data scarcity and multiple shifts. Previous approaches mainly focused on sample selection strategies, while overlooking the fundamental problem on underlying optimization. Initially, we critically analyze the widely-adopted entropy minimization framework in WTTA and uncover its significant limitations in noisy optimization dynamics that substantially hinder adaptation efficiency. Through our analysis, we identify region confidence as a superior alternative to traditional entropy, however, its direct optimization remains computationally prohibitive for real-time applications. In this paper, we introduce a novel region-integrated method ReCAP that bypasses the lengthy process. Specifically, we propose a probabilistic region modeling scheme that flexibly captures semantic changes in embedding space. Subsequently, we develop a finite-to-infinite asymptotic approximation that transforms the intractable region confidence into a tractable and upper-bounded proxy. These innovations significantly unlock the overlooked potential dynamics in local region in a concise solution. Our extensive experiments demonstrate the consistent superiority of ReCAP over existing methods across various datasets and wild scenarios.

[40] Hierarchical Instruction-aware Embodied Visual Tracking

Kui Wu,Hao Chen,Churan Wang,Fakhri Karray,Zhoujun Li,Yizhou Wang,Fangwei Zhong

Main category: cs.CV

TL;DR: HIEVT提出了一种分层指令感知的视觉跟踪方法，通过空间目标桥接用户指令与代理动作，解决了传统语言模型在UC-EVT任务中的速度与泛化性问题。

Details

Motivation: UC-EVT任务中，高级用户指令与低级代理动作之间存在显著差距，现有语言模型在推理速度或泛化性上存在局限。 Method: HIEVT通过LLM语义空间目标对齐器将指令转化为空间目标，再通过RL自适应目标对齐策略生成动作。 Result: 在十百万轨迹训练和多样环境测试中，HIEVT表现出强大的鲁棒性和泛化性。 Conclusion: HIEVT为UC-EVT任务提供了一种高效且通用的解决方案，适用于复杂指令和动态目标。 Abstract: User-Centric Embodied Visual Tracking (UC-EVT) presents a novel challenge for reinforcement learning-based models due to the substantial gap between high-level user instructions and low-level agent actions. While recent advancements in language models (e.g., LLMs, VLMs, VLAs) have improved instruction comprehension, these models face critical limitations in either inference speed (LLMs, VLMs) or generalizability (VLAs) for UC-EVT tasks. To address these challenges, we propose \textbf{Hierarchical Instruction-aware Embodied Visual Tracking (HIEVT)} agent, which bridges instruction comprehension and action generation using \textit{spatial goals} as intermediaries. HIEVT first introduces \textit{LLM-based Semantic-Spatial Goal Aligner} to translate diverse human instructions into spatial goals that directly annotate the desired spatial position. Then the \textit{RL-based Adaptive Goal-Aligned Policy}, a general offline policy, enables the tracker to position the target as specified by the spatial goal. To benchmark UC-EVT tasks, we collect over ten million trajectories for training and evaluate across one seen environment and nine unseen challenging environments. Extensive experiments and real-world deployments demonstrate the robustness and generalizability of HIEVT across diverse environments, varying target dynamics, and complex instruction combinations. The complete project is available at https://sites.google.com/view/hievt.

[41] MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding

Fuwen Luo,Shengfeng Lou,Chi Chen,Ziyue Wang,Chenliang Li,Weizhou Shen,Jiyue Guo,Peng Li,Ming Yan,Ji Zhang,Fei Huang,Yang Liu

Main category: cs.CV

TL;DR: MUSEG是一种基于强化学习的方法，通过多段时序标注提升多模态大语言模型的视频时序理解能力。

Details

Motivation: 当前多模态大语言模型在细粒度时序推理上表现不佳，现有强化学习方法效果有限。 Method: 提出MUSEG方法，引入时间戳感知的多段标注，结合分阶段奖励的强化学习训练策略。 Result: 在时序标注和时间敏感视频问答任务上显著优于现有方法，泛化能力强。 Conclusion: MUSEG有效提升了视频时序理解能力，为多模态大语言模型的时序推理提供了新思路。 Abstract: Video temporal understanding is crucial for multimodal large language models (MLLMs) to reason over events in videos. Despite recent advances in general video understanding, current MLLMs still struggle with fine-grained temporal reasoning. While reinforcement learning (RL) has been explored to address this issue recently, existing RL approaches remain limited in effectiveness. In this work, we propose MUSEG, a novel RL-based method that enhances temporal understanding by introducing timestamp-aware multi-segment grounding. MUSEG enables MLLMs to align queries with multiple relevant video segments, promoting more comprehensive temporal reasoning. To facilitate effective learning, we design a customized RL training recipe with phased rewards that progressively guides the model toward temporally grounded reasoning. Extensive experiments on temporal grounding and time-sensitive video QA tasks demonstrate that MUSEG significantly outperforms existing methods and generalizes well across diverse temporal understanding scenarios. View our project at https://github.com/THUNLP-MT/MUSEG.

[42] VLM Can Be a Good Assistant: Enhancing Embodied Visual Tracking with Self-Improving Visual-Language Models

Kui Wu,Shuhang Xu,Hao Chen,Churan Wang,Zhoujun Li,Yizhou Wang,Fangwei Zhong

Main category: cs.CV

TL;DR: 提出了一种结合视觉语言模型（VLM）的自改进框架，用于提升主动视觉跟踪（EVT）系统在跟踪失败时的恢复能力。

Details

Motivation: 当前主动视觉跟踪系统在跟踪失败时恢复能力有限，需要一种更智能的解决方案。 Method: 结合现成的主动跟踪方法和VLM的推理能力，快速视觉策略用于正常跟踪，失败时激活VLM推理，并引入记忆增强的自反思机制。 Result: 实验显示性能显著提升，成功率分别比RL和PID方法提高了72%和220%。 Conclusion: 首次将VLM推理集成到EVT中，为动态非结构化环境中的机器人应用提供了重要进展。 Abstract: We introduce a novel self-improving framework that enhances Embodied Visual Tracking (EVT) with Visual-Language Models (VLMs) to address the limitations of current active visual tracking systems in recovering from tracking failure. Our approach combines the off-the-shelf active tracking methods with VLMs' reasoning capabilities, deploying a fast visual policy for normal tracking and activating VLM reasoning only upon failure detection. The framework features a memory-augmented self-reflection mechanism that enables the VLM to progressively improve by learning from past experiences, effectively addressing VLMs' limitations in 3D spatial reasoning. Experimental results demonstrate significant performance improvements, with our framework boosting success rates by $72\%$ with state-of-the-art RL-based approaches and $220\%$ with PID-based methods in challenging environments. This work represents the first integration of VLM-based reasoning to assist EVT agents in proactive failure recovery, offering substantial advances for real-world robotic applications that require continuous target monitoring in dynamic, unstructured environments. Project website: https://sites.google.com/view/evt-recovery-assistant.

[43] LeDiFlow: Learned Distribution-guided Flow Matching to Accelerate Image Generation

Pascal Zwick,Nils Friederich,Maximilian Beichter,Lennart Hilbert,Ralf Mikut,Oliver Bringmann

Main category: cs.CV

TL;DR: LeDiFlow提出了一种基于学习分布引导的流匹配方法，通过优化先验分布减少ODE求解步骤，显著提升了图像生成效率和质量。

Details

Motivation: 扩散模型（DMs）的迭代特性导致高质量图像生成效率低下，流匹配（FM）作为一种替代方法，但其基于高斯先验的弯曲概率路径增加了计算复杂度。 Method: LeDiFlow通过学习更适合的先验分布，初始化ODE求解器，从而生成更易计算的概率路径，结合SOTA Transformer架构和潜在空间采样。 Result: LeDiFlow在像素空间推理速度提升3.75倍，潜在空间模型图像质量提升1.32倍（CMMD指标）。 Conclusion: LeDiFlow通过优化先验分布，显著提升了FM模型的效率和图像质量，适用于消费级工作站训练。 Abstract: Enhancing the efficiency of high-quality image generation using Diffusion Models (DMs) is a significant challenge due to the iterative nature of the process. Flow Matching (FM) is emerging as a powerful generative modeling paradigm based on a simulation-free training objective instead of a score-based one used in DMs. Typical FM approaches rely on a Gaussian distribution prior, which induces curved, conditional probability paths between the prior and target data distribution. These curved paths pose a challenge for the Ordinary Differential Equation (ODE) solver, requiring a large number of inference calls to the flow prediction network. To address this issue, we present Learned Distribution-guided Flow Matching (LeDiFlow), a novel scalable method for training FM-based image generation models using a better-suited prior distribution learned via a regression-based auxiliary model. By initializing the ODE solver with a prior closer to the target data distribution, LeDiFlow enables the learning of more computationally tractable probability paths. These paths directly translate to fewer solver steps needed for high-quality image generation at inference time. Our method utilizes a State-Of-The-Art (SOTA) transformer architecture combined with latent space sampling and can be trained on a consumer workstation. We empirically demonstrate that LeDiFlow remarkably outperforms the respective FM baselines. For instance, when operating directly on pixels, our model accelerates inference by up to 3.75x compared to the corresponding pixel-space baseline. Simultaneously, our latent FM model enhances image quality on average by 1.32x in CLIP Maximum Mean Discrepancy (CMMD) metric against its respective baseline.

[44] Intern-GS: Vision Model Guided Sparse-View 3D Gaussian Splatting

Xiangyu Sun,Runnan Chen,Mingming Gong,Dong Xu,Tongliang Liu

Main category: cs.CV

TL;DR: Intern-GS利用视觉基础模型增强稀疏视图高斯泼溅，实现高质量场景重建。

Details

Motivation: 稀疏视图场景重建因数据有限导致信息不完整，现有方法效果不佳。 Method: Intern-GS利用DUSt3R生成密集高斯点云，并通过视觉基础模型预测深度和外观优化3D高斯。 Result: 在LLFF、DTU等数据集上实现最先进的渲染质量。 Conclusion: Intern-GS有效解决了稀疏视图重建的局限性，提升了重建质量。 Abstract: Sparse-view scene reconstruction often faces significant challenges due to the constraints imposed by limited observational data. These limitations result in incomplete information, leading to suboptimal reconstructions using existing methodologies. To address this, we present Intern-GS, a novel approach that effectively leverages rich prior knowledge from vision foundation models to enhance the process of sparse-view Gaussian Splatting, thereby enabling high-quality scene reconstruction. Specifically, Intern-GS utilizes vision foundation models to guide both the initialization and the optimization process of 3D Gaussian splatting, effectively addressing the limitations of sparse inputs. In the initialization process, our method employs DUSt3R to generate a dense and non-redundant gaussian point cloud. This approach significantly alleviates the limitations encountered by traditional structure-from-motion (SfM) methods, which often struggle under sparse-view constraints. During the optimization process, vision foundation models predict depth and appearance for unobserved views, refining the 3D Gaussians to compensate for missing information in unseen regions. Extensive experiments demonstrate that Intern-GS achieves state-of-the-art rendering quality across diverse datasets, including both forward-facing and large-scale scenes, such as LLFF, DTU, and Tanks and Temples.

[45] MoPFormer: Motion-Primitive Transformer for Wearable-Sensor Activity Recognition

Hao Zhang,Zhan Zhuang,Xuehao Wang,Xiaodong Yang,Yu Zhang

Main category: cs.CV

TL;DR: MoPFormer是一个基于Transformer的自监督框架，通过将传感器信号转化为语义化的运动基元，提升HAR的可解释性和跨数据集泛化能力。

Details

Motivation: 解决HAR中可解释性不足和跨数据集泛化能力差的问题。 Method: 两阶段框架：1）将传感器信号分割并量化为运动基元；2）通过上下文感知嵌入模块和Transformer编码器学习时间表示。 Result: 在六个HAR基准测试中表现优于现有方法，并显著提升跨数据集性能。 Conclusion: MoPFormer通过捕捉基础运动模式，提升了HAR的可解释性和泛化能力。 Abstract: Human Activity Recognition (HAR) with wearable sensors is challenged by limited interpretability, which significantly impacts cross-dataset generalization. To address this challenge, we propose Motion-Primitive Transformer (MoPFormer), a novel self-supervised framework that enhances interpretability by tokenizing inertial measurement unit signals into semantically meaningful motion primitives and leverages a Transformer architecture to learn rich temporal representations. MoPFormer comprises two-stages. first stage is to partition multi-channel sensor streams into short segments and quantizing them into discrete "motion primitive" codewords, while the second stage enriches those tokenized sequences through a context-aware embedding module and then processes them with a Transformer encoder. The proposed MoPFormer can be pre-trained using a masked motion-modeling objective that reconstructs missing primitives, enabling it to develop robust representations across diverse sensor configurations. Experiments on six HAR benchmarks demonstrate that MoPFormer not only outperforms state-of-the-art methods but also successfully generalizes across multiple datasets. Most importantly, the learned motion primitives significantly enhance both interpretability and cross-dataset performance by capturing fundamental movement patterns that remain consistent across similar activities regardless of dataset origin.

[46] Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models

Yufei Zhan,Hongyin Zhao,Yousong Zhu,Shurong Zheng,Fan Yang,Ming Tang,Jinqiao Wang

Main category: cs.CV

TL;DR: 提出了一种统一的视觉推理机制，使大型多模态模型（LMMs）能够通过其内在能力解决复杂的组合问题，无需多次推理或外部工具。

Details

Motivation: 当前LMMs在组合推理任务中表现不足，阻碍了其成为真正通用的视觉模型。 Method: 引入类似人类的‘理解-思考-回答’过程，设计单次前向推理机制，并构建了334K视觉指令样本。 Result: 模型Griffon-R在VSR、CLEVR等复杂视觉推理基准上表现优异，同时在MMBench、ScienceQA等多模态任务中提升能力。 Conclusion: 该方法填补了基础视觉能力与通用问答之间的鸿沟，为复杂视觉推理提供了可靠且可追溯的解决方案。 Abstract: Large Multimodal Models (LMMs) have recently demonstrated remarkable visual understanding performance on both vision-language and vision-centric tasks. However, they often fall short in integrating advanced, task-specific capabilities for compositional reasoning, which hinders their progress toward truly competent general vision models. To address this, we present a unified visual reasoning mechanism that enables LMMs to solve complicated compositional problems by leveraging their intrinsic capabilities (e.g. grounding and visual understanding capabilities). Different from the previous shortcut learning mechanism, our approach introduces a human-like understanding-thinking-answering process, allowing the model to complete all steps in a single pass forwarding without the need for multiple inferences or external tools. This design bridges the gap between foundational visual capabilities and general question answering, encouraging LMMs to generate faithful and traceable responses for complex visual reasoning. Meanwhile, we curate 334K visual instruction samples covering both general scenes and text-rich scenes and involving multiple foundational visual capabilities. Our trained model, Griffon-R, has the ability of end-to-end automatic understanding, self-thinking, and reasoning answers. Comprehensive experiments show that Griffon-R not only achieves advancing performance on complex visual reasoning benchmarks including VSR and CLEVR, but also enhances multimodal capabilities across various benchmarks like MMBench and ScienceQA. Data, models, and codes will be release at https://github.com/jefferyZhan/Griffon/tree/master/Griffon-R soon.

[47] PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

Ansel Blume,Jeonghwan Kim,Hyeonjeong Ha,Elen Chatikyan,Xiaomeng Jin,Khanh Duy Nguyen,Nanyun Peng,Kai-Wei Chang,Derek Hoiem,Heng Ji

Main category: cs.CV

TL;DR: PARTONOMY是一个用于像素级部分定位的LMM基准测试，揭示了现有LMM在部分定位能力上的不足，并提出了改进模型PLUM。

Details

Motivation: 现实世界中的物体由独特的、对象特定的部分组成，但现有LMM难以完成这一任务，因此需要开发新的基准和模型。 Method: 通过构建PARTONOMY数据集（包含862个部分标签和534个对象标签），并设计PLUM模型（采用span标记和反馈循环机制）。 Result: 实验显示现有LMM表现不佳（如LISA-13B仅5.9% gIoU），而PLUM在多个任务上优于现有模型。 Conclusion: PLUM为LMM提供了细粒度的视觉理解能力，开辟了新的研究方向。 Abstract: Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning-yet, large multimodal models (LMMs) struggle to perform this seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM benchmark designed for pixel-level part grounding. We construct PARTONOMY from existing part datasets and our own rigorously annotated set of images, encompassing 862 part labels and 534 object labels for evaluation. Unlike existing datasets that simply ask models to identify generic parts, PARTONOMY uses specialized concepts (e.g., agricultural airplane), and challenges models to compare objects' parts, consider part-whole relationships, and justify textual predictions with visual segmentations. Our experiments demonstrate significant limitations in state-of-the-art LMMs (e.g., LISA-13B achieves only 5.9% gIoU), highlighting a critical gap in their part grounding abilities. We note that existing segmentation-enabled LMMs (segmenting LMMs) have two key architectural shortcomings: they use special [SEG] tokens not seen during pretraining which induce distribution shift, and they discard predicted segmentations instead of using past predictions to guide future ones. To address these deficiencies, we train several part-centric LMMs and propose PLUM, a novel segmenting LMM that uses span tagging instead of segmentation tokens and that conditions on prior predictions in a feedback loop. We find that pretrained PLUM outperforms existing segmenting LMMs on reasoning segmentation, VQA, and visual hallucination benchmarks. In addition, PLUM finetuned on our proposed Explanatory Part Segmentation task is competitive with segmenting LMMs trained on significantly more segmentation data. Our work opens up new avenues towards enabling fine-grained, grounded visual understanding in LMMs.

[48] ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval

Eric Xing,Pranavi Kolouju,Robert Pless,Abby Stylianou,Nathan Jacobs

Main category: cs.CV

TL;DR: 论文提出了一种名为ConText-CIR的框架，通过Text Concept-Consistency损失函数提升图像和文本修改的表示准确性，并结合合成数据生成方法，在CIR任务中实现了新的最佳性能。

Details

Motivation: 现有方法在组合图像检索（CIR）任务中难以准确表示图像和文本修改，导致性能不佳。 Method: 提出ConText-CIR框架，采用Text Concept-Consistency损失函数，增强文本修改中名词短语对查询图像相关部分的关注；同时设计合成数据生成流程，支持训练。 Result: 在监督和零样本设置下，ConText-CIR在多个基准数据集（如CIRR和CIRCO）上实现了新的最佳性能。 Conclusion: ConText-CIR通过改进表示学习和数据生成方法，显著提升了组合图像检索任务的性能。 Abstract: Composed image retrieval (CIR) is the task of retrieving a target image specified by a query image and a relative text that describes a semantic modification to the query image. Existing methods in CIR struggle to accurately represent the image and the text modification, resulting in subpar performance. To address this limitation, we introduce a CIR framework, ConText-CIR, trained with a Text Concept-Consistency loss that encourages the representations of noun phrases in the text modification to better attend to the relevant parts of the query image. To support training with this loss function, we also propose a synthetic data generation pipeline that creates training data from existing CIR datasets or unlabeled images. We show that these components together enable stronger performance on CIR tasks, setting a new state-of-the-art in composed image retrieval in both the supervised and zero-shot settings on multiple benchmark datasets, including CIRR and CIRCO. Source code, model checkpoints, and our new datasets are available at https://github.com/mvrl/ConText-CIR.

[49] MetaSlot: Break Through the Fixed Number of Slots in Object-Centric Learning

Hongjia Liu,Rongzhen Zhao,Haohan Chen,Joni Pajarinen

Main category: cs.CV

TL;DR: MetaSlot是一种改进的Slot Attention变体，通过动态适应对象数量，提升对象中心学习的性能。

Details

Motivation: 现有对象中心学习方法依赖固定数量的slot，导致对象数量变化时表示不准确。 Method: MetaSlot通过维护对象原型代码本、去除重复slot和注入渐进噪声，动态调整slot数量。 Result: 在多个数据集和任务中，MetaSlot显著提升了性能和slot表示的可解释性。 Conclusion: MetaSlot是一种通用且高效的Slot Attention改进方法，适用于现有OCL架构。 Abstract: Learning object-level, structured representations is widely regarded as a key to better generalization in vision and underpins the design of next-generation Pre-trained Vision Models (PVMs). Mainstream Object-Centric Learning (OCL) methods adopt Slot Attention or its variants to iteratively aggregate objects' super-pixels into a fixed set of query feature vectors, termed slots. However, their reliance on a static slot count leads to an object being represented as multiple parts when the number of objects varies. We introduce MetaSlot, a plug-and-play Slot Attention variant that adapts to variable object counts. MetaSlot (i) maintains a codebook that holds prototypes of objects in a dataset by vector-quantizing the resulting slot representations; (ii) removes duplicate slots from the traditionally aggregated slots by quantizing them with the codebook; and (iii) injects progressively weaker noise into the Slot Attention iterations to accelerate and stabilize the aggregation. MetaSlot is a general Slot Attention variant that can be seamlessly integrated into existing OCL architectures. Across multiple public datasets and tasks--including object discovery and recognition--models equipped with MetaSlot achieve significant performance gains and markedly interpretable slot representations, compared with existing Slot Attention variants.

[50] TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs

Zhehan Kan,Yanlin Liu,Kun Yin,Xinghua Jiang,Xin Li,Haoyu Cao,Yinsong Liu,Deqiang Jiang,Xing Sun,Qingmin Liao,Wenming Yang

Main category: cs.CV

TL;DR: TACO是一种新型强化学习算法，用于视觉推理，解决了现有方法在推理与答案一致性、模型稳定性及数据效率方面的不足。

Details

Motivation: 现有方法在多模态环境中复制DeepSeek R1的推理能力时存在局限性，如推理与答案不一致、模型不稳定及数据效率低。 Method: TACO基于GRPO，引入Think-Answer Consistency确保答案基于推理，Rollback Resample Strategy稳定长链探索，自适应学习计划优化数据效率，Test-Time-Resolution-Scaling平衡计算开销。 Result: 在REC和VQA任务上的实验表明，TACO显著提升了性能。 Conclusion: TACO通过强化学习解决了多模态推理中的关键问题，提升了模型性能。 Abstract: DeepSeek R1 has significantly advanced complex reasoning for large language models (LLMs). While recent methods have attempted to replicate R1's reasoning capabilities in multimodal settings, they face limitations, including inconsistencies between reasoning and final answers, model instability and crashes during long-chain exploration, and low data learning efficiency. To address these challenges, we propose TACO, a novel reinforcement learning algorithm for visual reasoning. Building on Generalized Reinforcement Policy Optimization (GRPO), TACO introduces Think-Answer Consistency, which tightly couples reasoning with answer consistency to ensure answers are grounded in thoughtful reasoning. We also introduce the Rollback Resample Strategy, which adaptively removes problematic samples and reintroduces them to the sampler, enabling stable long-chain exploration and future learning opportunities. Additionally, TACO employs an adaptive learning schedule that focuses on moderate difficulty samples to optimize data efficiency. Furthermore, we propose the Test-Time-Resolution-Scaling scheme to address performance degradation due to varying resolutions during reasoning while balancing computational overhead. Extensive experiments on in-distribution and out-of-distribution benchmarks for REC and VQA tasks show that fine-tuning LVLMs leads to significant performance improvements.

[51] Breaking Dataset Boundaries: Class-Agnostic Targeted Adversarial Attacks

Taïga Gonçalves,Tomo Miyazaki,Shinichiro Omachi

Main category: cs.CV

TL;DR: CD-MTA是一种生成对抗样本的方法，能够误导图像分类器将输入误分类到任意目标类别，包括训练中未见的类别。它通过图像条件输入和类无关损失函数，解决了传统方法依赖类嵌入和数据泄漏的问题。

Details

Motivation: 传统目标攻击方法需要为每个目标类别重新训练模型，且现有多目标攻击方法依赖训练数据和类嵌入，限制了其在黑盒场景中的实用性。 Method: CD-MTA采用图像条件输入和类无关损失函数，消除了对类语义的依赖，实现了对未见类别的泛化。 Result: 在ImageNet和其他七个数据集上的实验表明，CD-MTA在标准和跨域设置中优于现有方法，且无需访问黑盒模型的训练数据。 Conclusion: CD-MTA通过创新的设计解决了多目标攻击中的数据泄漏和泛化问题，为黑盒场景下的对抗攻击提供了更实用的解决方案。 Abstract: We present Cross-Domain Multi-Targeted Attack (CD-MTA), a method for generating adversarial examples that mislead image classifiers toward any target class, including those not seen during training. Traditional targeted attacks are limited to one class per model, requiring expensive retraining for each target. Multi-targeted attacks address this by introducing a perturbation generator with a conditional input to specify the target class. However, existing methods are constrained to classes observed during training and require access to the black-box model's training data--introducing a form of data leakage that undermines realistic evaluation in practical black-box scenarios. We identify overreliance on class embeddings as a key limitation, leading to overfitting and poor generalization to unseen classes. To address this, CD-MTA replaces class-level supervision with an image-based conditional input and introduces class-agnostic losses that align the perturbed and target images in the feature space. This design removes dependence on class semantics, thereby enabling generalization to unseen classes across datasets. Experiments on ImageNet and seven other datasets show that CD-MTA outperforms prior multi-targeted attacks in both standard and cross-domain settings--without accessing the black-box model's training data.

[52] Integrating Intermediate Layer Optimization and Projected Gradient Descent for Solving Inverse Problems with Diffusion Models

Yang Zheng,Wen Li,Zhaoqiang Liu

Main category: cs.CV

TL;DR: 论文提出了两种新方法DMILO和DMILO-PGD，通过中间层优化和稀疏偏差改进扩散模型在逆问题中的性能，解决了计算负担和收敛问题。

Details

Motivation: 传统逆问题方法依赖手工先验，无法捕捉复杂数据。扩散模型虽表现优异，但存在计算量大和收敛不优的问题。 Method: DMILO通过中间层优化减轻内存负担，引入稀疏偏差扩展模型范围；DMILO-PGD结合投影梯度下降优化收敛。 Result: 实验验证了DMILO和DMILO-PGD在多种图像数据集上的优越性，性能显著优于现有方法。 Conclusion: DMILO和DMILO-PGD有效解决了扩散模型在逆问题中的常见挑战，提升了重建性能。 Abstract: Inverse problems (IPs) involve reconstructing signals from noisy observations. Traditional approaches often rely on handcrafted priors, which can fail to capture the complexity of real-world data. The advent of pre-trained generative models has introduced new paradigms, offering improved reconstructions by learning rich priors from data. Among these, diffusion models (DMs) have emerged as a powerful framework, achieving remarkable reconstruction performance across numerous IPs. However, existing DM-based methods frequently encounter issues such as heavy computational demands and suboptimal convergence. In this work, building upon the idea of the recent work DMPlug~\cite{wang2024dmplug}, we propose two novel methods, DMILO and DMILO-PGD, to address these challenges. Our first method, DMILO, employs intermediate layer optimization (ILO) to alleviate the memory burden inherent in DMPlug. Additionally, by introducing sparse deviations, we expand the range of DMs, enabling the exploration of underlying signals that may lie outside the range of the diffusion model. We further propose DMILO-PGD, which integrates ILO with projected gradient descent (PGD), thereby reducing the risk of suboptimal convergence. We provide an intuitive theoretical analysis of our approach under appropriate conditions and validate its superiority through extensive experiments on diverse image datasets, encompassing both linear and nonlinear IPs. Our results demonstrate significant performance gains over state-of-the-art methods, highlighting the effectiveness of DMILO and DMILO-PGD in addressing common challenges in DM-based IP solvers.

[53] Rendering-Aware Reinforcement Learning for Vector Graphics Generation

Juan A. Rodriguez,Haotian Zhang,Abhay Puri,Aarash Feizi,Rishav Pramanik,Pascal Wichmann,Arnab Mondal,Mohammad Reza Samsami,Rabiul Awal,Perouz Taslakian,Spandana Gella,Sai Rajeswar,David Vazquez,Christopher Pal,Marco Pedersoli

Main category: cs.CV

TL;DR: 论文提出了一种基于强化学习的方法RLRF，通过渲染反馈提升SVG生成的准确性和效率。

Details

Motivation: 现有视觉语言模型（VLM）在SVG生成中因缺乏渲染反馈而难以生成高效且准确的SVG。 Method: 使用强化学习（RL）方法RLRF，通过比较渲染后的SVG与原始图像生成奖励信号，指导模型优化。 Result: RLRF显著优于监督微调，能够生成更准确、高效且语义连贯的SVG。 Conclusion: RLRF通过渲染反馈解决了SVG生成中的常见问题，提升了生成质量和泛化能力。 Abstract: Scalable Vector Graphics (SVG) offer a powerful format for representing visual designs as interpretable code. Recent advances in vision-language models (VLMs) have enabled high-quality SVG generation by framing the problem as a code generation task and leveraging large-scale pretraining. VLMs are particularly suitable for this task as they capture both global semantics and fine-grained visual patterns, while transferring knowledge across vision, natural language, and code domains. However, existing VLM approaches often struggle to produce faithful and efficient SVGs because they never observe the rendered images during training. Although differentiable rendering for autoregressive SVG code generation remains unavailable, rendered outputs can still be compared to original inputs, enabling evaluative feedback suitable for reinforcement learning (RL). We introduce RLRF(Reinforcement Learning from Rendering Feedback), an RL method that enhances SVG generation in autoregressive VLMs by leveraging feedback from rendered SVG outputs. Given an input image, the model generates SVG roll-outs that are rendered and compared to the original image to compute a reward. This visual fidelity feedback guides the model toward producing more accurate, efficient, and semantically coherent SVGs. RLRF significantly outperforms supervised fine-tuning, addressing common failure modes and enabling precise, high-quality SVG generation with strong structural understanding and generalization.

[54] Not All Thats Rare Is Lost: Causal Paths to Rare Concept Synthesis

Bo-Kai Ruan,Zi-Xiang Ni,Bo-Lun Huang,Teng-Fang Hsiao,Hong-Han Shuai

Main category: cs.CV

TL;DR: RAP框架通过将稀有概念生成视为潜在因果路径的导航，利用语义相关的频繁提示近似稀有提示，并通过动态提示切换和第二阶去噪机制提升生成效果。

Details

Motivation: 解决扩散模型在生成训练分布中罕见概念时的性能不足问题。 Method: 提出RAP框架，将稀有概念生成建模为潜在因果路径，动态切换提示并引入第二阶去噪机制。 Result: RAP在多种扩散模型上显著提升了稀有概念的生成质量，优于基线方法。 Conclusion: RAP通过因果路径导航和动态提示切换，有效提升了扩散模型对稀有概念的生成能力。 Abstract: Diffusion models have shown strong capabilities in high-fidelity image generation but often falter when synthesizing rare concepts, i.e., prompts that are infrequently observed in the training distribution. In this paper, we introduce RAP, a principled framework that treats rare concept generation as navigating a latent causal path: a progressive, model-aligned trajectory through the generative space from frequent concepts to rare targets. Rather than relying on heuristic prompt alternation, we theoretically justify that rare prompt guidance can be approximated by semantically related frequent prompts. We then formulate prompt switching as a dynamic process based on score similarity, enabling adaptive stage transitions. Furthermore, we reinterpret prompt alternation as a second-order denoising mechanism, promoting smooth semantic progression and coherent visual synthesis. Through this causal lens, we align input scheduling with the model's internal generative dynamics. Experiments across diverse diffusion backbones demonstrate that RAP consistently enhances rare concept generation, outperforming strong baselines in both automated evaluations and human studies.

[55] Frame-Level Captions for Long Video Generation with Complex Multi Scenes

Guangcong Zheng,Jianlong Yuan,Bo Wang,Haoyang Huang,Guoqing Ma,Nan Duan

Main category: cs.CV

TL;DR: 本文提出了一种新方法，通过帧级标注和注意力机制解决长视频生成中的误差累积和多场景问题，实验证明其优于现有方法。

Details

Motivation: 当前自回归扩散模型在生成长视频时存在误差累积问题，且现有方法难以处理多场景复杂故事。本文旨在解决这些问题。 Method: 提出帧级标注数据集方法，结合帧级注意力机制和Diffusion Forcing训练，确保文本与视频精确匹配。 Result: 在VBench 2.0基准测试中表现优异，尤其在复杂场景和动态变化中生成高质量长视频。 Conclusion: 新方法有效解决了长视频生成的挑战，计划公开数据集和模型以促进研究。 Abstract: Generating long videos that can show complex stories, like movie scenes from scripts, has great promise and offers much more than short clips. However, current methods that use autoregression with diffusion models often struggle because their step-by-step process naturally leads to a serious error accumulation (drift). Also, many existing ways to make long videos focus on single, continuous scenes, making them less useful for stories with many events and changes. This paper introduces a new approach to solve these problems. First, we propose a novel way to annotate datasets at the frame-level, providing detailed text guidance needed for making complex, multi-scene long videos. This detailed guidance works with a Frame-Level Attention Mechanism to make sure text and video match precisely. A key feature is that each part (frame) within these windows can be guided by its own distinct text prompt. Our training uses Diffusion Forcing to provide the model with the ability to handle time flexibly. We tested our approach on difficult VBench 2.0 benchmarks ("Complex Plots" and "Complex Landscapes") based on the WanX2.1-T2V-1.3B model. The results show our method is better at following instructions in complex, changing scenes and creates high-quality long videos. We plan to share our dataset annotation methods and trained models with the research community. Project page: https://zgctroy.github.io/frame-level-captions .

[56] Causality-Driven Infrared and Visible Image Fusion

Linli Ma,Suzhen Lin,Jianchao Zeng,Zanxia Jin,Yanbo Wang,Fengyuan Li,Yubing Luo

Main category: cs.CV

TL;DR: 本文提出了一种基于因果关系的图像融合方法，通过构建因果图消除数据集场景偏差的影响，并设计了BAFFM模块提升融合性能。

Details

Motivation: 现有方法忽视了数据集场景偏差对模型训练的影响，导致模型学习到虚假相关性，限制了融合性能。 Method: 从因果视角重新审视图像融合任务，构建因果图消除偏差影响，并提出BAFFM模块消除混杂因素干扰。 Result: 在三个标准数据集上的实验表明，该方法显著优于现有红外与可见光图像融合方法。 Conclusion: 通过因果分析和BAFFM模块，有效提升了图像融合性能。 Abstract: Image fusion aims to combine complementary information from multiple source images to generate more comprehensive scene representations. Existing methods primarily rely on the stacking and design of network architectures to enhance the fusion performance, often ignoring the impact of dataset scene bias on model training. This oversight leads the model to learn spurious correlations between specific scenes and fusion weights under conventional likelihood estimation framework, thereby limiting fusion performance. To solve the above problems, this paper first re-examines the image fusion task from the causality perspective, and disentangles the model from the impact of bias by constructing a tailored causal graph to clarify the causalities among the variables in image fusion task. Then, the Back-door Adjustment based Feature Fusion Module (BAFFM) is proposed to eliminate confounder interference and enable the model to learn the true causal effect. Finally, Extensive experiments on three standard datasets prove that the proposed method significantly surpasses state-of-the-art methods in infrared and visible image fusion.

[57] Fully Spiking Neural Networks for Unified Frame-Event Object Tracking

Jingjun Yang,Liangwei Fan,Jinpu Zhang,Xiangkai Lian,Hui Shen,Dewen Hu

Main category: cs.CV

TL;DR: 提出了一种名为SpikeFET的全脉冲框架事件跟踪方法，通过结合卷积局部特征提取和Transformer全局建模，高效融合图像和事件数据，显著降低功耗并提升跟踪精度。

Details

Motivation: 当前融合方法在复杂环境中实现高性能时计算开销大，且难以高效提取事件流中的稀疏异步信息，未能充分利用事件驱动脉冲范式的能效优势。 Method: 提出SpikeFET框架，结合卷积和Transformer；引入随机拼图模块（RPM）消除位置偏差；提出时空正则化（STR）策略增强特征一致性。 Result: 在多个基准测试中表现优于现有方法，显著降低功耗，实现性能和效率的最佳平衡。 Conclusion: SpikeEFT框架在视觉目标跟踪中实现了高效能和高性能的结合，代码将开源。 Abstract: The integration of image and event streams offers a promising approach for achieving robust visual object tracking in complex environments. However, current fusion methods achieve high performance at the cost of significant computational overhead and struggle to efficiently extract the sparse, asynchronous information from event streams, failing to leverage the energy-efficient advantages of event-driven spiking paradigms. To address this challenge, we propose the first fully Spiking Frame-Event Tracking framework called SpikeFET. This network achieves synergistic integration of convolutional local feature extraction and Transformer-based global modeling within the spiking paradigm, effectively fusing frame and event data. To overcome the degradation of translation invariance caused by convolutional padding, we introduce a Random Patchwork Module (RPM) that eliminates positional bias through randomized spatial reorganization and learnable type encoding while preserving residual structures. Furthermore, we propose a Spatial-Temporal Regularization (STR) strategy that overcomes similarity metric degradation from asymmetric features by enforcing spatio-temporal consistency among temporal template features in latent space. Extensive experiments across multiple benchmarks demonstrate that the proposed framework achieves superior tracking accuracy over existing methods while significantly reducing power consumption, attaining an optimal balance between performance and efficiency. The code will be released.

[58] ProBA: Probabilistic Bundle Adjustment with the Bhattacharyya Coefficient

Jason Chui,Daniel Cremers

Main category: cs.CV

TL;DR: 提出了一种新的概率化Bundle Adjustment方法（ProBA），无需相机位姿或焦距的先验知识，通过建模和传播2D观测与3D场景结构的不确定性，实现更鲁棒的优化。

Details

Motivation: 传统BA方法依赖准确的初始估计和已知相机内参，限制了其在不确定或未知内参情况下的适用性。 Method: 使用3D高斯模型替代点状地标，引入不确定性感知的重投影损失，并通过Bhattacharyya系数强制多3D高斯之间的几何一致性。 Result: 实验表明，ProBA在挑战性现实条件下优于传统方法，减少了对初始化和已知内参的需求。 Conclusion: ProBA提高了SLAM系统在非结构化环境中的实用性。 Abstract: Classical Bundle Adjustment (BA) methods require accurate initial estimates for convergence and typically assume known camera intrinsics, which limits their applicability when such information is uncertain or unavailable. We propose a novel probabilistic formulation of BA (ProBA) that explicitly models and propagates uncertainty in both the 2D observations and the 3D scene structure, enabling optimization without any prior knowledge of camera poses or focal length. Our method uses 3D Gaussians instead of point-like landmarks and we introduce uncertainty-aware reprojection losses by projecting the 3D Gaussians onto the 2D image space, and enforce geometric consistency across multiple 3D Gaussians using the Bhattacharyya coefficient to encourage overlap between their corresponding Gaussian distributions. This probabilistic framework leads to more robust and reliable optimization, even in the presence of outliers in the correspondence set, reducing the likelihood of converging to poor local minima. Experimental results show that \textit{ProBA} outperforms traditional methods in challenging real-world conditions. By removing the need for strong initialization and known intrinsics, ProBA enhances the practicality of SLAM systems deployed in unstructured environments.

[59] Exploring Timeline Control for Facial Motion Generation

Yifeng Ma,Jinwei Qi,Chaonan Ji,Peng Zhang,Bang Zhang,Zhidong Deng,Liefeng Bo

Main category: cs.CV

TL;DR: 本文提出了一种新的面部运动生成控制信号：时间线控制，相比音频和文本信号，它能提供更精细的控制。用户可以通过多轨道时间线精确指定面部动作的时序。

Details

Motivation: 传统基于音频或文本的面部运动生成方法缺乏对动作时序的精确控制，时间线控制能够弥补这一不足。 Method: 首先通过Toeplitz逆协方差聚类标注面部动作的时间区间，然后提出基于扩散的生成模型，支持文本引导的运动生成。 Result: 实验表明，该方法能准确标注面部动作区间，并生成与时间线精确对齐的自然面部运动。 Conclusion: 时间线控制为面部运动生成提供了更精细和灵活的控制方式。 Abstract: This paper introduces a new control signal for facial motion generation: timeline control. Compared to audio and text signals, timelines provide more fine-grained control, such as generating specific facial motions with precise timing. Users can specify a multi-track timeline of facial actions arranged in temporal intervals, allowing precise control over the timing of each action. To model the timeline control capability, We first annotate the time intervals of facial actions in natural facial motion sequences at a frame-level granularity. This process is facilitated by Toeplitz Inverse Covariance-based Clustering to minimize human labor. Based on the annotations, we propose a diffusion-based generation model capable of generating facial motions that are natural and accurately aligned with input timelines. Our method supports text-guided motion generation by using ChatGPT to convert text into timelines. Experimental results show that our method can annotate facial action intervals with satisfactory accuracy, and produces natural facial motions accurately aligned with timelines.

[60] AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding

Chaeyoung Jung,Youngjoon Jang,Joon Son Chung

Main category: cs.CV

TL;DR: 论文提出了一种名为AVCD的新型解码框架，用于抑制多模态大语言模型中的幻觉问题，通过动态识别和遮蔽非主导模态，显著提升了模型性能。

Details

Motivation: 多模态大语言模型（AV-LLMs）中的幻觉问题复杂，涉及音频、视频和语言的单模态及跨模态组合，需要一种更自适应且模态感知的解码策略。 Method: 提出AVCD框架，利用注意力分布动态识别非主导模态，并通过注意力遮蔽生成扰动输出对数；同时引入熵引导的自适应解码以提高效率。 Result: 在AVHBench数据集上，AVCD显著提升了VideoLLaMA2和video-SALMONN的准确性，分别提高了6%和11%。 Conclusion: AVCD是一种无需训练的解码框架，能有效抑制多模态幻觉，具有强鲁棒性和泛化能力。 Abstract: Hallucination remains a major challenge in multimodal large language models (MLLMs). To address this, various contrastive decoding (CD) methods have been proposed that contrasts original logits with hallucinated logits generated from perturbed inputs. While CD has shown promise in vision-language models (VLMs), it is not well-suited for AV-LLMs, where hallucinations often emerge from both unimodal and cross-modal combinations involving audio, video, and language. These intricate interactions call for a more adaptive and modality-aware decoding strategy. In this paper, we propose Audio-Visual Contrastive Decoding (AVCD)-a novel, training-free decoding framework designed to model trimodal interactions and suppress modality-induced hallucinations in AV-LLMs. Unlike previous CD methods in VLMs that corrupt a fixed modality, AVCD leverages attention distributions to dynamically identify less dominant modalities and applies attentive masking to generate perturbed output logits. To support CD in a trimodal setting, we also reformulate the original CD framework to jointly handle audio, visual, and textual inputs. Finally, to improve efficiency, we introduce entropy-guided adaptive decoding, which selectively skips unnecessary decoding steps based on the model's confidence in its predictions. Extensive experiments demonstrate that AVCD consistently outperforms existing decoding methods. Especially, on the AVHBench dataset, it improves accuracy by 6% for VideoLLaMA2 and 11% for video-SALMONN, demonstrating strong robustness and generalizability.

[61] In Context Learning with Vision Transformers: Case Study

Antony Zhao,Alex Proshkin,Fergal Hennessy,Francesco Crivelli

Main category: cs.CV

TL;DR: 大型Transformer模型能够通过上下文学习（in-context learning）完成少样本、单样本甚至零样本任务。本文旨在将其扩展到图像空间，研究其对复杂函数（如卷积神经网络）的学习能力。

Details

Motivation: 探索Transformer模型在图像空间中学习复杂函数（如卷积神经网络）的能力，扩展其在上下文学习中的应用范围。 Method: 通过实验分析Transformer模型在图像空间中对复杂函数的上下文学习能力，如卷积神经网络等。 Result: （需进一步实验验证） Conclusion: 研究有望揭示Transformer模型在图像空间中的上下文学习潜力，为更复杂的任务提供理论基础。 Abstract: Large transformer models have been shown to be capable of performing in-context learning. By using examples in a prompt as well as a query, they are capable of performing tasks such as few-shot, one-shot, or zero-shot learning to output the corresponding answer to this query. One area of interest to us is that these transformer models have been shown to be capable of learning the general class of certain functions, such as linear functions and small 2-layer neural networks, on random data (Garg et al, 2023). We aim to extend this to the image space to analyze their capability to in-context learn more complex functions on the image space, such as convolutional neural networks and other methods.

[62] Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models

Chaeyoung Jung,Youngjoon Jang,Jongmin Choi,Joon Son Chung

Main category: cs.CV

TL;DR: 提出Fork-Merge Decoding（FMD）方法，通过推理阶段的分叉合并策略解决音频-视觉大语言模型中的模态偏差问题，无需额外训练。

Details

Motivation: 当前音频-视觉大语言模型（AV-LLMs）在联合处理多模态特征时可能引入模态偏差，导致模型过度依赖某一模态。 Method: FMD在推理阶段分叉处理音频和视频输入（分叉阶段），随后合并隐藏状态进行联合推理（合并阶段）。 Result: 在VideoLLaMA2和video-SALMONN模型上，FMD在音频、视频及音频-视觉联合推理任务中均表现提升。 Conclusion: FMD通过推理时干预有效提升了多模态理解的鲁棒性，无需额外训练或架构修改。 Abstract: The goal of this work is to enhance balanced multimodal understanding in audio-visual large language models (AV-LLMs) by addressing modality bias without requiring additional training. In current AV-LLMs, audio and video features are typically processed jointly in the decoder. While this strategy facilitates unified multimodal understanding, it may introduce modality bias, where the model tends to over-rely on one modality due to imbalanced training signals. To mitigate this, we propose Fork-Merge Decoding (FMD), a simple yet effective inference-time strategy that requires no additional training or architectural modifications. FMD first performs modality-specific reasoning by processing audio-only and video-only inputs through the early decoder layers (a fork phase), and then merges the resulting hidden states for joint reasoning in the remaining layers (a merge phase). This approach promotes balanced modality contributions and leverages complementary information across modalities. We evaluate our method on two representative AV-LLMs, VideoLLaMA2 and video-SALMONN, using three benchmark datasets. Experimental results demonstrate consistent performance improvements on tasks focused on audio, video, and combined audio-visual reasoning, demonstrating the effectiveness of inference-time interventions for robust multimodal understanding.

[63] Stereo Radargrammetry Using Deep Learning from Airborne SAR Images

Tatsuya Sasayama,Shintaro Ito,Koichi Ito,Takafumi Aoki

Main category: cs.CV

TL;DR: 提出一种基于深度学习的立体雷达测量方法，通过SAR图像实现更广范围和更精确的高程测量。

Details

Motivation: 现有深度学习方法缺乏公开的SAR图像数据集，且传统方法易受几何图像调制影响。 Method: 创建SAR数据集并微调深度学习图像对应方法，避免地面投影，分块处理SAR图像。 Result: 实验表明，该方法比传统方法具有更广范围和更高精度的高程测量能力。 Conclusion: 该方法有效解决了SAR图像处理中的几何调制问题，提升了测量精度。 Abstract: In this paper, we propose a stereo radargrammetry method using deep learning from airborne Synthetic Aperture Radar (SAR) images.Deep learning-based methods are considered to suffer less from geometric image modulation, while there is no public SAR image dataset used to train such methods.We create a SAR image dataset and perform fine-tuning of a deep learning-based image correspondence method.The proposed method suppresses the degradation of image quality by pixel interpolation without ground projection of the SAR image and divides the SAR image into patches for processing, which makes it possible to apply deep learning.Through a set of experiments, we demonstrate that the proposed method exhibits a wider range and more accurate elevation measurements compared to conventional methods.

[64] YOLO-FireAD: Efficient Fire Detection via Attention-Guided Inverted Residual Learning and Dual-Pooling Feature Preservation

Weichao Pan,Bohan Xu,Xu Wang,Chengze Lv,Shuoyang Wang,Zhenke Duan

Main category: cs.CV

TL;DR: 提出了一种基于YOLO的火灾检测模型YOLO-FireAD，通过注意力机制和双池化融合提升特征提取能力，显著降低了参数量和计算量，同时提高了检测精度。

Details

Motivation: 解决动态环境中火灾检测的挑战，如光照干扰、误检和漏检，以及现有YOLO模型的特征提取和信息损失问题。 Method: 引入注意力引导的倒置残差块（AIR）和双池化下采样融合块（DPDF），分别增强火灾特征并抑制噪声，以及保留多尺度火灾模式。 Result: 在公开数据集上表现优异，参数量和计算量分别降低51.8%和43.2%，mAP75比主流实时检测模型高1.3-5.5%。 Conclusion: YOLO-FireAD在效率和精度上均优于现有模型，适用于动态环境中的火灾检测。 Abstract: Fire detection in dynamic environments faces continuous challenges, including the interference of illumination changes, many false detections or missed detections, and it is difficult to achieve both efficiency and accuracy. To address the problem of feature extraction limitation and information loss in the existing YOLO-based models, this study propose You Only Look Once for Fire Detection with Attention-guided Inverted Residual and Dual-pooling Downscale Fusion (YOLO-FireAD) with two core innovations: (1) Attention-guided Inverted Residual Block (AIR) integrates hybrid channel-spatial attention with inverted residuals to adaptively enhance fire features and suppress environmental noise; (2) Dual Pool Downscale Fusion Block (DPDF) preserves multi-scale fire patterns through learnable fusion of max-average pooling outputs, mitigating small-fire detection failures. Extensive evaluation on two public datasets shows the efficient performance of our model. Our proposed model keeps the sum amount of parameters (1.45M, 51.8% lower than YOLOv8n) (4.6G, 43.2% lower than YOLOv8n), and mAP75 is higher than the mainstream real-time object detection models YOLOv8n, YOL-Ov9t, YOLOv10n, YOLO11n, YOLOv12n and other YOLOv8 variants 1.3-5.5%.

[65] Frequency Composition for Compressed and Domain-Adaptive Neural Networks

Yoojin Kwon,Hongjun Suh,Wooseok Lee,Taesik Gong,Songyi Han,Hyung-Sin Kim

Main category: cs.CV

TL;DR: CoDA框架通过频率组合统一了模型压缩和域适应，在训练和测试阶段分别利用低频和高频信息，显著提升了压缩模型在域适应任务中的性能。

Details

Motivation: 现代设备上的神经网络应用需在资源受限和域适应不可预测的双重挑战下运行，而现有工作往往单独处理压缩或域适应问题。 Method: CoDA结合量化感知训练（QAT）和测试时适应（TTA），训练时利用低频特征学习鲁棒性，测试时利用高频信息适应目标域。 Result: 在CIFAR10-C和ImageNet-C基准测试中，CoDA显著提升了压缩模型的性能，分别比全精度TTA基线提高了7.96%和5.37%。 Conclusion: CoDA为资源受限设备上的域适应问题提供了一种高效解决方案，且能与现有方法协同工作。 Abstract: Modern on-device neural network applications must operate under resource constraints while adapting to unpredictable domain shifts. However, this combined challenge-model compression and domain adaptation-remains largely unaddressed, as prior work has tackled each issue in isolation: compressed networks prioritize efficiency within a fixed domain, whereas large, capable models focus on handling domain shifts. In this work, we propose CoDA, a frequency composition-based framework that unifies compression and domain adaptation. During training, CoDA employs quantization-aware training (QAT) with low-frequency components, enabling a compressed model to selectively learn robust, generalizable features. At test time, it refines the compact model in a source-free manner (i.e., test-time adaptation, TTA), leveraging the full-frequency information from incoming data to adapt to target domains while treating high-frequency components as domain-specific cues. LFC are aligned with the trained distribution, while HFC unique to the target distribution are solely utilized for batch normalization. CoDA can be integrated synergistically into existing QAT and TTA methods. CoDA is evaluated on widely used domain-shift benchmarks, including CIFAR10-C and ImageNet-C, across various model architectures. With significant compression, it achieves accuracy improvements of 7.96%p on CIFAR10-C and 5.37%p on ImageNet-C over the full-precision TTA baseline.

Pingrui Zhang,Yifei Su,Pengyuan Wu,Dong An,Li Zhang,Zhigang Wang,Dong Wang,Yan Ding,Bin Zhao,Xuelong Li

Main category: cs.CV

TL;DR: 论文提出了一种基于语言形式的自适应想象方法（ATD），通过双分支自引导想象策略，结合大型语言模型（LLM），显著提升了视觉与语言导航（VLN）任务的效率和可靠性。

Details

Motivation: 视觉与语言导航任务中，部分可观测性导致感知与语言对齐困难。现有方法依赖视觉合成，计算成本高且冗余。 Method: 提出自适应文本想象器（ATD），采用左右脑架构设计，左脑负责逻辑整合，右脑负责未来场景的想象预测，仅微调Q-former以激活LLM的领域知识。 Result: 在R2R基准测试中，ATD以更少的参数实现了最先进的性能。 Conclusion: ATD通过语言形式想象关键环境语义，提供了一种更高效可靠的导航策略。 Abstract: Vision-and-Language Navigation (VLN) requires the agent to navigate by following natural instructions under partial observability, making it difficult to align perception with language. Recent methods mitigate this by imagining future scenes, yet they rely on vision-based synthesis, leading to high computational cost and redundant details. To this end, we propose to adaptively imagine key environmental semantics via \textit{language} form, enabling a more reliable and efficient strategy. Specifically, we introduce a novel Adaptive Text Dreamer (ATD), a dual-branch self-guided imagination policy built upon a large language model (LLM). ATD is designed with a human-like left-right brain architecture, where the left brain focuses on logical integration, and the right brain is responsible for imaginative prediction of future scenes. To achieve this, we fine-tune only the Q-former within both brains to efficiently activate domain-specific knowledge in the LLM, enabling dynamic updates of logical reasoning and imagination during navigation. Furthermore, we introduce a cross-interaction mechanism to regularize the imagined outputs and inject them into a navigation expert module, allowing ATD to jointly exploit both the reasoning capacity of the LLM and the expertise of the navigation model. We conduct extensive experiments on the R2R benchmark, where ATD achieves state-of-the-art performance with fewer parameters. The code is \href{https://github.com/zhangpingrui/Adaptive-Text-Dreamer}{here}.

[67] HTMNet: A Hybrid Network with Transformer-Mamba Bottleneck Multimodal Fusion for Transparent and Reflective Objects Depth Completion

Guanghu Xie,Yonglong Zhang,Zhiduo Jiang,Yang Liu,Zongwu Xie,Baoshi Cao,Hong Liu

Main category: cs.CV

TL;DR: HTMNet是一种结合Transformer、CNN和Mamba架构的混合模型，用于解决透明和反射物体深度信息不完整的问题，在多数据集上表现优异。

Details

Motivation: 透明和反射物体导致深度传感器信息不完整，影响机器人感知和操作任务。 Method: 采用双分支Transformer-CNN编码器，结合Transformer-Mamba多尺度融合模块，并引入基于自注意力机制和状态空间模型的多模态融合模块。 Result: 在多个公开数据集上达到SOTA性能。 Conclusion: HTMNet在透明物体深度补全任务中表现出色，验证了其方法的有效性。 Abstract: Transparent and reflective objects pose significant challenges for depth sensors, resulting in incomplete depth information that adversely affects downstream robotic perception and manipulation tasks. To address this issue, we propose HTMNet, a novel hybrid model integrating Transformer, CNN, and Mamba architectures. The encoder is constructed based on a dual-branch Transformer-CNN framework, while the multi-scale fusion module leverages a Transformer-Mamba architecture, which also serves as the foundation for the decoder design. We introduce a novel multimodal fusion module grounded in self-attention mechanisms and state space models, marking the first application of the Mamba architecture in the field of transparent object depth completion and revealing its promising potential. Additionally, we design an innovative multi-scale fusion module for the decoder that combines channel attention, spatial attention, and multi-scale feature extraction techniques to effectively integrate multi-scale features through a down-fusion strategy. Extensive evaluations on multiple public datasets demonstrate that our model achieves state-of-the-art(SOTA) performance, validating the effectiveness of our approach.

[68] Create Anything Anywhere: Layout-Controllable Personalized Diffusion Model for Multiple Subjects

Wei Li,Hebei Li,Yansong Peng,Siying Wu,Yueyi Zhang,Xiaoyan Sun

Main category: cs.CV

TL;DR: LCP-Diffusion模型通过动态-静态互补视觉细化模块和双重布局控制机制，实现了无需调整的高保真文本到图像生成，同时具备精确的布局控制能力。

Details

Motivation: 现有方法在文本到图像生成中缺乏精确的布局控制能力，且未充分利用参考主题的动态特征来提升生成图像的保真度。 Method: 提出LCP-Diffusion模型，结合动态-静态互补视觉细化模块和双重布局控制机制，实现高保真和灵活布局的生成。 Result: 实验验证LCP-Diffusion在身份保持和布局控制方面表现优异。 Conclusion: LCP-Diffusion是首个实现“在任何地方创建任何内容”的个性化生成框架。 Abstract: Diffusion models have significantly advanced text-to-image generation, laying the foundation for the development of personalized generative frameworks. However, existing methods lack precise layout controllability and overlook the potential of dynamic features of reference subjects in improving fidelity. In this work, we propose Layout-Controllable Personalized Diffusion (LCP-Diffusion) model, a novel framework that integrates subject identity preservation with flexible layout guidance in a tuning-free approach. Our model employs a Dynamic-Static Complementary Visual Refining module to comprehensively capture the intricate details of reference subjects, and introduces a Dual Layout Control mechanism to enforce robust spatial control across both training and inference stages. Extensive experiments validate that LCP-Diffusion excels in both identity preservation and layout controllability. To the best of our knowledge, this is a pioneering work enabling users to "create anything anywhere".

[69] Geometry-Editable and Appearance-Preserving Object Compositon

Jianman Lin,Haojie Li,Chunmei Qing,Zhijing Yang,Liang Lin,Tianshui Chen

Main category: cs.CV

TL;DR: DGAD模型通过解耦几何编辑和外观保留，结合语义嵌入和交叉注意力机制，实现精确的几何编辑和外观一致性。

Details

Motivation: 现有方法仅编码高层语义线索，丢失细粒度外观细节，无法同时满足几何编辑和外观保留的需求。 Method: DGAD利用CLIP/DINO和参考网络提取语义嵌入和外观特征，通过预训练扩散模型捕获几何信息，并设计密集交叉注意力机制对齐外观特征。 Result: 在公开基准测试中，DGAD框架表现出色，实现了灵活的几何编辑和忠实的外观保留。 Conclusion: DGAD模型通过解耦设计，有效解决了几何编辑与外观保留的平衡问题，为通用对象合成提供了新思路。 Abstract: General object composition (GOC) aims to seamlessly integrate a target object into a background scene with desired geometric properties, while simultaneously preserving its fine-grained appearance details. Recent approaches derive semantic embeddings and integrate them into advanced diffusion models to enable geometry-editable generation. However, these highly compact embeddings encode only high-level semantic cues and inevitably discard fine-grained appearance details. We introduce a Disentangled Geometry-editable and Appearance-preserving Diffusion (DGAD) model that first leverages semantic embeddings to implicitly capture the desired geometric transformations and then employs a cross-attention retrieval mechanism to align fine-grained appearance features with the geometry-edited representation, facilitating both precise geometry editing and faithful appearance preservation in object composition. Specifically, DGAD builds on CLIP/DINO-derived and reference networks to extract semantic embeddings and appearance-preserving representations, which are then seamlessly integrated into the encoding and decoding pipelines in a disentangled manner. We first integrate the semantic embeddings into pre-trained diffusion models that exhibit strong spatial reasoning capabilities to implicitly capture object geometry, thereby facilitating flexible object manipulation and ensuring effective editability. Then, we design a dense cross-attention mechanism that leverages the implicitly learned object geometry to retrieve and spatially align appearance features with their corresponding regions, ensuring faithful appearance consistency. Extensive experiments on public benchmarks demonstrate the effectiveness of the proposed DGAD framework.

[70] HuMoCon: Concept Discovery for Human Motion Understanding

Qihang Fang,Chengcheng Tang,Bugra Tekin,Shugao Ma,Yanchao Yang

Main category: cs.CV

TL;DR: HuMoCon是一种新型的运动视频理解框架，专注于人类行为分析，通过多模态编码器和特征对齐策略解决运动概念发现的挑战。

Details

Motivation: 解决运动概念发现中的多模态特征对齐缺失和掩码自编码框架中高频信息丢失的问题。 Method: 结合视频上下文理解与运动细粒度交互建模的特征对齐策略，并引入速度重建机制以增强高频特征表达。 Result: 在标准基准测试中显著优于现有方法，实现了有效的运动概念发现。 Conclusion: HuMoCon为人类运动理解提供了高效解决方案，并将开源相关代码。 Abstract: We present HuMoCon, a novel motion-video understanding framework designed for advanced human behavior analysis. The core of our method is a human motion concept discovery framework that efficiently trains multi-modal encoders to extract semantically meaningful and generalizable features. HuMoCon addresses key challenges in motion concept discovery for understanding and reasoning, including the lack of explicit multi-modality feature alignment and the loss of high-frequency information in masked autoencoding frameworks. Our approach integrates a feature alignment strategy that leverages video for contextual understanding and motion for fine-grained interaction modeling, further with a velocity reconstruction mechanism to enhance high-frequency feature expression and mitigate temporal over-smoothing. Comprehensive experiments on standard benchmarks demonstrate that HuMoCon enables effective motion concept discovery and significantly outperforms state-of-the-art methods in training large models for human motion understanding. We will open-source the associated code with our paper.

[71] Good Enough: Is it Worth Improving your Label Quality?

Alexander Jaus,Zdravko Marinov,Constantin Seibold,Simon Reiß,Jens Kleesiek,Rainer Stiefelhagen

Main category: cs.CV

TL;DR: 本文研究了医疗图像分割中标签质量的影响，发现高质量标签对域内性能有提升，但对预训练影响较小。

Details

Motivation: 探讨提高医疗图像分割标签质量的成本与收益是否匹配。 Method: 使用nnU-Net、TotalSegmentator和MedSAM生成多版本伪标签CT数据集，系统评估标签质量的影响。 Result: 高质量标签提升域内性能，但收益不明显低于阈值；预训练中标签质量影响较小。 Conclusion: 为标签质量改进的投入提供了实用指导。 Abstract: Improving label quality in medical image segmentation is costly, but its benefits remain unclear. We systematically evaluate its impact using multiple pseudo-labeled versions of CT datasets, generated by models like nnU-Net, TotalSegmentator, and MedSAM. Our results show that while higher-quality labels improve in-domain performance, gains remain unclear if below a small threshold. For pre-training, label quality has minimal impact, suggesting that models rather transfer general concepts than detailed annotations. These findings provide guidance on when improving label quality is worth the effort.

[72] QwT-v2: Practical, Effective and Efficient Post-Training Quantization

Ningyuan Tang,Minghao Fu,Hao Yu,Jianxin Wu

Main category: cs.CV

TL;DR: QwT-v2是一种改进的网络量化方法，通过轻量级通道仿射补偿模块（CWAC）解决了QwT的兼容性和效率问题，同时保持或提升精度。

Details

Motivation: 网络量化是减少深度神经网络资源消耗的实用方法，但现有方法如QwT存在额外参数、延迟和硬件兼容性问题。 Method: 提出QwT-v2，采用轻量级CWAC模块，减少额外参数和计算，并提升硬件兼容性。 Result: QwT-v2在减少资源消耗的同时，精度与QwT相当或更高，且兼容大多数硬件平台。 Conclusion: QwT-v2是一种高效且兼容性强的网络量化改进方法。 Abstract: Network quantization is arguably one of the most practical network compression approaches for reducing the enormous resource consumption of modern deep neural networks. They usually require diverse and subtle design choices for specific architecture and tasks. Instead, the QwT method is a simple and general approach which introduces lightweight additional structures to improve quantization. But QwT incurs extra parameters and latency. More importantly, QwT is not compatible with many hardware platforms. In this paper, we propose QwT-v2, which not only enjoys all advantages of but also resolves major defects of QwT. By adopting a very lightweight channel-wise affine compensation (CWAC) module, QwT-v2 introduces significantly less extra parameters and computations compared to QwT, and at the same time matches or even outperforms QwT in accuracy. The compensation module of QwT-v2 can be integrated into quantization inference engines with little effort, which not only effectively removes the extra costs but also makes it compatible with most existing hardware platforms.

[73] ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation

Sanghyun Jo,Wooyeol Lee,Ziseok Lee,Kyungsu Kim

Main category: cs.CV

TL;DR: 提出了一种无需训练的方法ISAC，通过实例优先建模解决多实例场景中的对象合并或遗漏问题，显著提升了多类和多实例准确性。

Details

Motivation: 现有文本到图像扩散模型在多实例场景中表现不佳，常合并或遗漏对象，需一种无需训练的方法解决此问题。 Method: 采用Instance-to-Semantic Attention Control (ISAC)方法，通过实例优先建模和分层树状提示机制，分离多对象实例并与其语义标签对齐。 Result: ISAC在多类和多实例准确性上分别达到52%和83%，无需外部模型支持。 Conclusion: ISAC是一种高效且无需训练的方法，显著改善了多实例场景中的对象生成效果。 Abstract: Text-to-image diffusion models excel at generating single-instance scenes but struggle with multi-instance scenarios, often merging or omitting objects. Unlike previous training-free approaches that rely solely on semantic-level guidance without addressing instance individuation, our training-free method, Instance-to-Semantic Attention Control (ISAC), explicitly resolves incomplete instance formation and semantic entanglement through an instance-first modeling approach. This enables ISAC to effectively leverage a hierarchical, tree-structured prompt mechanism, disentangling multiple object instances and individually aligning them with their corresponding semantic labels. Without employing any external models, ISAC achieves up to 52% average multi-class accuracy and 83% average multi-instance accuracy by effectively forming disentangled instances. The code will be made available upon publication.

[74] PMA: Towards Parameter-Efficient Point Cloud Understanding via Point Mamba Adapter

Yaohua Zha,Yanzi Wang,Hang Guo,Jinpeng Wang,Tao Dai,Bin Chen,Zhihao Ouyang,Xue Yuerong,Ke Chen,Shu-Tao Xia

Main category: cs.CV

TL;DR: 论文提出了一种名为Point Mamba Adapter (PMA)的方法，通过整合预训练模型中间层的互补信息，提升点云理解的全面性。

Details

Motivation: 现有方法仅利用预训练模型的最终输出，忽略了中间层的丰富互补信息，限制了模型的潜力。 Method: PMA构建了一个有序特征序列，利用Mamba融合多层语义，并通过几何约束的门提示生成器(G2PG)优化空间顺序。 Result: 在多个点云数据集上的实验表明，PMA通过融合中间层特征，显著提升了点云理解能力。 Conclusion: PMA通过有效整合预训练模型的中间层信息，为点云理解提供了新的解决方案。 Abstract: Applying pre-trained models to assist point cloud understanding has recently become a mainstream paradigm in 3D perception. However, existing application strategies are straightforward, utilizing only the final output of the pre-trained model for various task heads. It neglects the rich complementary information in the intermediate layer, thereby failing to fully unlock the potential of pre-trained models. To overcome this limitation, we propose an orthogonal solution: Point Mamba Adapter (PMA), which constructs an ordered feature sequence from all layers of the pre-trained model and leverages Mamba to fuse all complementary semantics, thereby promoting comprehensive point cloud understanding. Constructing this ordered sequence is non-trivial due to the inherent isotropy of 3D space. Therefore, we further propose a geometry-constrained gate prompt generator (G2PG) shared across different layers, which applies shared geometric constraints to the output gates of the Mamba and dynamically optimizes the spatial order, thus enabling more effective integration of multi-layer information. Extensive experiments conducted on challenging point cloud datasets across various tasks demonstrate that our PMA elevates the capability for point cloud understanding to a new level by fusing diverse complementary intermediate features. Code is available at https://github.com/zyh16143998882/PMA.

[75] DSOcc: Leveraging Depth Awareness and Semantic Aid to Boost Camera-Based 3D Semantic Occupancy Prediction

Naiyu Fang,Zheyuan Zhou,Kang Wang,Ruibo Li,Lemiao Qiu,Shuyou Zhang,Zhe Wang,Guosheng Lin

Main category: cs.CV

TL;DR: 论文提出DSOcc方法，通过深度感知和语义辅助提升基于相机的3D语义占据预测性能，解决了现有方法特征分配错误和样本不足的问题。

Details

Motivation: 现有基于相机的3D语义占据预测方法依赖显式占据状态推断，导致特征分配错误，且样本不足限制了占据类别推断的学习。 Method: 联合进行占据状态和占据类别的推断，通过非学习方法计算软占据置信度，并与图像特征结合实现深度感知的自适应隐式占据状态推断；利用训练好的图像语义分割和多帧融合辅助占据类别推断。 Result: 在SemanticKITTI数据集上，DSOcc在基于相机的方法中达到了最先进的性能。 Conclusion: DSOcc通过深度感知和语义辅助有效提升了3D语义占据预测的准确性和鲁棒性。 Abstract: Camera-based 3D semantic occupancy prediction offers an efficient and cost-effective solution for perceiving surrounding scenes in autonomous driving. However, existing works rely on explicit occupancy state inference, leading to numerous incorrect feature assignments, and insufficient samples restrict the learning of occupancy class inference. To address these challenges, we propose leveraging Depth awareness and Semantic aid to boost camera-based 3D semantic Occupancy prediction (DSOcc). We jointly perform occupancy state and occupancy class inference, where soft occupancy confidence is calculated through non-learning method and multiplied with image features to make the voxel representation aware of depth, enabling adaptive implicit occupancy state inference. Rather than focusing on improving feature learning, we directly utilize well-trained image semantic segmentation and fuse multiple frames with their occupancy probabilities to aid occupancy class inference, thereby enhancing robustness. Experimental results demonstrate that DSOcc achieves state-of-the-art performance on the SemanticKITTI dataset among camera-based methods.

[76] OrienText: Surface Oriented Textual Image Generation

Shubham Singh Paliwal,Arushi Jain,Monika Sharma,Vikram Jamwal,Lovekesh Vig

Main category: cs.CV

TL;DR: OrienText方法通过利用区域特定的表面法线作为条件输入，改进了文本到图像生成模型在复杂表面上准确渲染文本的能力。

Details

Motivation: 在电子商务等领域，图像中的文本至关重要，但现有文本到图像生成模型难以在复杂表面上准确渲染文本。 Method: 提出OrienText方法，利用区域特定的表面法线作为条件输入，改进文本在复杂表面上的渲染和方向。 Result: 在自建数据集上验证了OrienText方法的有效性，并优于现有方法。 Conclusion: OrienText方法显著提升了文本在复杂表面上的生成准确性。 Abstract: Textual content in images is crucial in e-commerce sectors, particularly in marketing campaigns, product imaging, advertising, and the entertainment industry. Current text-to-image (T2I) generation diffusion models, though proficient at producing high-quality images, often struggle to incorporate text accurately onto complex surfaces with varied perspectives, such as angled views of architectural elements like buildings, banners, or walls. In this paper, we introduce the Surface Oriented Textual Image Generation (OrienText) method, which leverages region-specific surface normals as conditional input to T2I generation diffusion model. Our approach ensures accurate rendering and correct orientation of the text within the image context. We demonstrate the effectiveness of the OrienText method on a self-curated dataset of images and compare it against the existing textual image generation methods.

[77] RF4D:Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes

Jiarui Zhang,Zhihao Li,Chong Wang,Bihan Wen

Main category: cs.CV

TL;DR: RF4D是一种基于雷达的神经场框架，专为动态户外场景的新视角合成设计，通过显式整合时空信息和特征级流模块，显著提升了动态物体建模能力。

Details

Motivation: 现有神经场方法在恶劣天气下表现脆弱，而毫米波雷达对环境变化具有鲁棒性，但其与神经场的结合尚未充分探索。动态户外场景需要时空建模以实现时间一致的新视角合成。 Method: RF4D框架显式整合时间信息，引入特征级流模块预测相邻帧间的潜在时间偏移，并提出与雷达物理特性一致的能量渲染公式。 Result: 在公开雷达数据集上的实验表明，RF4D在雷达测量合成质量和占用估计精度上表现优越，尤其在动态户外场景中改进显著。 Conclusion: RF4D通过结合雷达的鲁棒性和时空建模能力，为动态户外场景的新视角合成提供了高效解决方案。 Abstract: Neural fields (NFs) have demonstrated remarkable performance in scene reconstruction, powering various tasks such as novel view synthesis. However, existing NF methods relying on RGB or LiDAR inputs often exhibit severe fragility to adverse weather, particularly when applied in outdoor scenarios like autonomous driving. In contrast, millimeter-wave radar is inherently robust to environmental changes, while unfortunately, its integration with NFs remains largely underexplored. Besides, as outdoor driving scenarios frequently involve moving objects, making spatiotemporal modeling essential for temporally consistent novel view synthesis. To this end, we introduce RF4D, a radar-based neural field framework specifically designed for novel view synthesis in outdoor dynamic scenes. RF4D explicitly incorporates temporal information into its representation, significantly enhancing its capability to model moving objects. We further introduce a feature-level flow module that predicts latent temporal offsets between adjacent frames, enforcing temporal coherence in dynamic scene modeling. Moreover, we propose a radar-specific power rendering formulation closely aligned with radar sensing physics, improving synthesis accuracy and interoperability. Extensive experiments on public radar datasets demonstrate the superior performance of RF4D in terms of radar measurement synthesis quality and occupancy estimation accuracy, achieving especially pronounced improvements in dynamic outdoor scenarios.

[78] DreamBoothDPO: Improving Personalized Generation using Direct Preference Optimization

Shamil Ayupov,Maksim Nakhodnov,Anastasia Yaschenko,Andrey Kuznetsov,Aibek Alanov

Main category: cs.CV

TL;DR: 提出了一种基于强化学习的方法，通过生成合成配对数据集来平衡文本到图像生成中的概念保真度和上下文对齐问题。

Details

Motivation: 解决文本到图像生成中概念保真度与上下文对齐之间的平衡问题。 Method: 采用RL方法，利用外部质量指标生成合成配对数据集，支持灵活调整图像保真度与文本对齐的权衡。 Result: 方法在收敛速度和输出质量上优于基线，并通过多步训练验证了有效性。 Conclusion: 该方法在多种架构和微调技术中表现出色，代码已开源。 Abstract: Personalized diffusion models have shown remarkable success in Text-to-Image (T2I) generation by enabling the injection of user-defined concepts into diverse contexts. However, balancing concept fidelity with contextual alignment remains a challenging open problem. In this work, we propose an RL-based approach that leverages the diverse outputs of T2I models to address this issue. Our method eliminates the need for human-annotated scores by generating a synthetic paired dataset for DPO-like training using external quality metrics. These better-worse pairs are specifically constructed to improve both concept fidelity and prompt adherence. Moreover, our approach supports flexible adjustment of the trade-off between image fidelity and textual alignment. Through multi-step training, our approach outperforms a naive baseline in convergence speed and output quality. We conduct extensive qualitative and quantitative analysis, demonstrating the effectiveness of our method across various architectures and fine-tuning techniques. The source code can be found at https://github.com/ControlGenAI/DreamBoothDPO.

[79] RefAV: Towards Planning-Centric Scenario Mining

Cainan Davidson,Deva Ramanan,Neehar Peri

Main category: cs.CV

TL;DR: 论文提出了一种基于视觉语言模型（VLMs）的时空场景挖掘方法RefAV，用于从自动驾驶车辆日志中识别和定位安全关键场景，并发布了大规模数据集。

Details

Motivation: 传统场景挖掘方法效率低且易出错，需要更高效的方法从海量驾驶日志中识别关键场景。 Method: 利用视觉语言模型（VLMs）检测和定位自然语言查询描述的复杂多智能体交互场景，并引入RefAV数据集进行验证。 Result: 实验表明，直接使用现成VLMs效果不佳，场景挖掘具有独特挑战。 Conclusion: RefAV为场景挖掘提供了新方法和数据集，但需进一步优化模型以适应任务需求。 Abstract: Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of 10,000 diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from 1000 driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Our code and dataset are available at https://github.com/CainanD/RefAV/ and https://argoverse.github.io/user-guide/tasks/scenario_mining.html

[80] Assessing the Use of Face Swapping Methods as Face Anonymizers in Videos

Mustafa İzzet Muştu,Hazım Kemal Ekenel

Main category: cs.CV

TL;DR: 本文探讨了人脸交换技术在视频数据隐私保护中的潜力，通过评估其时间一致性、匿名强度和视觉保真度，证明其能有效隐藏身份且保持数据质量。

Details

Motivation: 大规模视觉数据需求增长与严格隐私法规促使研究匿名化方法，以在不严重降低数据质量的情况下隐藏个人身份。 Method: 通过评估人脸交换技术的时间一致性、匿名强度和视觉保真度，分析其在视频隐私保护中的效果。 Result: 人脸交换技术能产生一致的面部过渡并有效隐藏身份，适合隐私保护视频应用。 Conclusion: 人脸交换技术为隐私保护视频应用提供了可行方案，并为未来匿名化研究奠定了基础。 Abstract: The increasing demand for large-scale visual data, coupled with strict privacy regulations, has driven research into anonymization methods that hide personal identities without seriously degrading data quality. In this paper, we explore the potential of face swapping methods to preserve privacy in video data. Through extensive evaluations focusing on temporal consistency, anonymity strength, and visual fidelity, we find that face swapping techniques can produce consistent facial transitions and effectively hide identities. These results underscore the suitability of face swapping for privacy-preserving video applications and lay the groundwork for future advancements in anonymization focused face-swapping models.

[81] Facial Attribute Based Text Guided Face Anonymization

Mustafa İzzet Muştu,Hazım Kemal Ekenel

Main category: cs.CV

TL;DR: 论文提出了一种基于深度学习的面部匿名化流程，利用扩散模型生成自然但不可识别的面部图像，以解决隐私合规问题。

Details

Motivation: 计算机视觉应用中处理大量包含个人信息的视觉数据时，隐私保护成为重要挑战。现有方法需个人同意，限制了高质量数据集的收集。 Method: 采用三阶段流程：RetinaNet检测面部，VGG-Face提取特征，BrushNet扩散模型生成匿名面部图像。 Result: 生成自然且不可识别的面部图像，支持隐私合规的数据集创建。 Conclusion: 该方法为计算机视觉研究提供了隐私保护的解决方案，无需训练GAN模型。 Abstract: The increasing prevalence of computer vision applications necessitates handling vast amounts of visual data, often containing personal information. While this technology offers significant benefits, it should not compromise privacy. Data privacy regulations emphasize the need for individual consent for processing personal data, hindering researchers' ability to collect high-quality datasets containing the faces of the individuals. This paper presents a deep learning-based face anonymization pipeline to overcome this challenge. Unlike most of the existing methods, our method leverages recent advancements in diffusion-based inpainting models, eliminating the need for training Generative Adversarial Networks. The pipeline employs a three-stage approach: face detection with RetinaNet, feature extraction with VGG-Face, and realistic face generation using the state-of-the-art BrushNet diffusion model. BrushNet utilizes the entire image, face masks, and text prompts specifying desired facial attributes like age, ethnicity, gender, and expression. This enables the generation of natural-looking images with unrecognizable individuals, facilitating the creation of privacy-compliant datasets for computer vision research.

[82] Unified Alignment Protocol: Making Sense of the Unlabeled Data in New Domains

Sabbir Ahmed,Mamshad Nayeem Rizve,Abdullah Al Arafat,Jacqueline Liu,Rahim Hossain,Mohaiminul Al Nahian,Adnan Siraj Rakin

Main category: cs.CV

TL;DR: 论文提出了一种名为UAP的新框架，用于解决半监督联邦学习中的领域泛化问题，通过两阶段训练提升模型在新领域的表现。

Details

Motivation: 传统半监督联邦学习假设训练和测试数据分布相同，但实际中领域偏移常见，需增强模型的泛化能力。 Method: 提出UAP框架，包括两阶段训练：服务器模型学习特征对齐，客户端利用服务器特征分布进行对齐训练。 Result: 在多个标准数据集上，UAP实现了半监督联邦学习中的最优泛化性能。 Conclusion: UAP有效解决了领域偏移问题，提升了半监督联邦学习的实用性。 Abstract: Semi-Supervised Federated Learning (SSFL) is gaining popularity over conventional Federated Learning in many real-world applications. Due to the practical limitation of limited labeled data on the client side, SSFL considers that participating clients train with unlabeled data, and only the central server has the necessary resources to access limited labeled data, making it an ideal fit for real-world applications (e.g., healthcare). However, traditional SSFL assumes that the data distributions in the training phase and testing phase are the same. In practice, however, domain shifts frequently occur, making it essential for SSFL to incorporate generalization capabilities and enhance their practicality. The core challenge is improving model generalization to new, unseen domains while the client participate in SSFL. However, the decentralized setup of SSFL and unsupervised client training necessitates innovation to achieve improved generalization across domains. To achieve this, we propose a novel framework called the Unified Alignment Protocol (UAP), which consists of an alternating two-stage training process. The first stage involves training the server model to learn and align the features with a parametric distribution, which is subsequently communicated to clients without additional communication overhead. The second stage proposes a novel training algorithm that utilizes the server feature distribution to align client features accordingly. Our extensive experiments on standard domain generalization benchmark datasets across multiple model architectures reveal that proposed UAP successfully achieves SOTA generalization performance in SSFL setting.

[83] FeatInv: Spatially resolved mapping from feature space to input space using conditional diffusion models

Nils Neukirch,Johanna Vielhaben,Nils Strodthoff

Main category: cs.CV

TL;DR: 提出了一种基于条件扩散模型的方法，用于高保真地从特征空间映射到输入空间，以提升对深度神经网络内部表示的理解。

Details

Motivation: 深度神经网络的内部表示难以解释，现有方法通常依赖粗略近似，因此需要更精确的映射方法。 Method: 使用预训练的高保真条件扩散模型，以概率方式学习从特征空间到输入空间的映射。 Result: 在多种预训练图像分类器（如CNN和ViT）上展示了出色的重建能力，并通过定性比较和鲁棒性分析验证了方法的有效性。 Conclusion: 该方法为计算机视觉模型的特征空间理解提供了广泛的应用潜力。 Abstract: Internal representations are crucial for understanding deep neural networks, such as their properties and reasoning patterns, but remain difficult to interpret. While mapping from feature space to input space aids in interpreting the former, existing approaches often rely on crude approximations. We propose using a conditional diffusion model - a pretrained high-fidelity diffusion model conditioned on spatially resolved feature maps - to learn such a mapping in a probabilistic manner. We demonstrate the feasibility of this approach across various pretrained image classifiers from CNNs to ViTs, showing excellent reconstruction capabilities. Through qualitative comparisons and robustness analysis, we validate our method and showcase possible applications, such as the visualization of concept steering in input space or investigations of the composite nature of the feature space. This approach has broad potential for improving feature space understanding in computer vision models.

[84] RainFusion: Adaptive Video Generation Acceleration via Multi-Dimensional Visual Redundancy

Aiyue Chen,Bin Dong,Jingru Li,Jing Lin,Yiwu Yao,Gongyi Wang

Main category: cs.CV

TL;DR: RainFusion是一种无需训练的新型稀疏注意力方法，通过利用视频数据中的稀疏性加速注意力计算，同时保持视频质量，实现了2倍以上的计算速度提升。

Details

Motivation: 扩散模型在视频生成中计算成本高，3D注意力机制占用了大量资源，RainFusion旨在解决这一问题。 Method: 提出RainFusion方法，通过识别三种稀疏模式（空间、时间和纹理）并利用自适应识别模块（ARM）动态确定稀疏模式，实现高效计算。 Result: 在多个开源模型上验证，RainFusion实现了2倍以上的注意力计算加速，且视频质量几乎不受影响（VBench分数仅下降0.2%）。 Conclusion: RainFusion是一种即插即用的高效方法，适用于现有3D注意力视频生成模型，显著提升了计算效率。 Abstract: Video generation using diffusion models is highly computationally intensive, with 3D attention in Diffusion Transformer (DiT) models accounting for over 80\% of the total computational resources. In this work, we introduce {\bf RainFusion}, a novel training-free sparse attention method that exploits inherent sparsity nature in visual data to accelerate attention computation while preserving video quality. Specifically, we identify three unique sparse patterns in video generation attention calculations--Spatial Pattern, Temporal Pattern and Textural Pattern. The sparse pattern for each attention head is determined online with negligible overhead (\textasciitilde\,0.2\%) with our proposed {\bf ARM} (Adaptive Recognition Module) during inference. Our proposed {\bf RainFusion} is a plug-and-play method, that can be seamlessly integrated into state-of-the-art 3D-attention video generation models without additional training or calibration. We evaluate our method on leading open-sourced models including HunyuanVideo, OpenSoraPlan-1.2 and CogVideoX-5B, demonstrating its broad applicability and effectiveness. Experimental results show that RainFusion achieves over {\bf 2$\times$} speedup in attention computation while maintaining video quality, with only a minimal impact on VBench scores (-0.2\%).

[85] Robust Video-Based Pothole Detection and Area Estimation for Intelligent Vehicles with Depth Map and Kalman Smoothing

Dehao Wang,Haohang Zhu,Yiwen Xu,Kaiqi Liu

Main category: cs.CV

TL;DR: 本文提出了一种结合目标检测和单目深度估计的鲁棒性道路坑洼面积估计框架，通过改进的ACSH-YOLOv8模型和MBTP方法提高了检测精度和实用性。

Details

Motivation: 道路坑洼对驾驶安全和舒适性构成严重威胁，现有基于视觉的方法易受相机角度和路面平坦假设的影响，导致复杂环境中误差较大。 Method: 提出ACSH-YOLOv8模型增强特征提取和小坑洼检测，结合BoT-SORT跟踪和DepthAnything V2生成深度图，采用MBTP方法估计坑洼面积，并通过CDKF优化结果一致性。 Result: ACSH-YOLOv8的AP(50)达到76.6%，比YOLOv8提升7.6%；CDKF优化使预测更鲁棒，增强了方法的实用性。 Conclusion: 该框架显著提高了坑洼检测和面积估计的精度与鲁棒性，适用于复杂现实环境。 Abstract: Road potholes pose a serious threat to driving safety and comfort, making their detection and assessment a critical task in fields such as autonomous driving. When driving vehicles, the operators usually avoid large potholes and approach smaller ones at reduced speeds to ensure safety. Therefore, accurately estimating pothole area is of vital importance. Most existing vision-based methods rely on distance priors to construct geometric models. However, their performance is susceptible to variations in camera angles and typically relies on the assumption of a flat road surface, potentially leading to significant errors in complex real-world environments. To address these problems, a robust pothole area estimation framework that integrates object detection and monocular depth estimation in a video stream is proposed in this paper. First, to enhance pothole feature extraction and improve the detection of small potholes, ACSH-YOLOv8 is proposed with ACmix module and the small object detection head. Then, the BoT-SORT algorithm is utilized for pothole tracking, while DepthAnything V2 generates depth maps for each frame. With the obtained depth maps and potholes labels, a novel Minimum Bounding Triangulated Pixel (MBTP) method is proposed for pothole area estimation. Finally, Kalman Filter based on Confidence and Distance (CDKF) is developed to maintain consistency of estimation results across consecutive frames. The results show that ACSH-YOLOv8 model achieves an AP(50) of 76.6%, representing a 7.6% improvement over YOLOv8. Through CDKF optimization across consecutive frames, pothole predictions become more robust, thereby enhancing the method's practical applicability.

[86] Advancing high-fidelity 3D and Texture Generation with 2.5D latents

Xin Yang,Jiantao Lin,Yingjie Xu,Haodong Li,Yingcong Chen

Main category: cs.CV

TL;DR: 提出了一种联合生成3D几何和纹理的新框架，通过2.5D潜在表示和轻量级解码器实现高质量3D生成。

Details

Motivation: 现有方法中3D几何和纹理生成分离导致不协调，且数据质量不均影响性能。 Method: 整合多视角RGB、法线和坐标图像为2.5D潜在表示，利用预训练2D基础模型生成2.5D，再通过轻量级解码器转换为3D。 Result: 模型在生成高质量3D对象及几何条件纹理生成方面显著优于现有方法。 Conclusion: 提出的框架有效解决了3D生成中的几何与纹理协调问题，性能优越。 Abstract: Despite the availability of large-scale 3D datasets and advancements in 3D generative models, the complexity and uneven quality of 3D geometry and texture data continue to hinder the performance of 3D generation techniques. In most existing approaches, 3D geometry and texture are generated in separate stages using different models and non-unified representations, frequently leading to unsatisfactory coherence between geometry and texture. To address these challenges, we propose a novel framework for joint generation of 3D geometry and texture. Specifically, we focus in generate a versatile 2.5D representations that can be seamlessly transformed between 2D and 3D. Our approach begins by integrating multiview RGB, normal, and coordinate images into a unified representation, termed as 2.5D latents. Next, we adapt pre-trained 2D foundation models for high-fidelity 2.5D generation, utilizing both text and image conditions. Finally, we introduce a lightweight 2.5D-to-3D refiner-decoder framework that efficiently generates detailed 3D representations from 2.5D images. Extensive experiments demonstrate that our model not only excels in generating high-quality 3D objects with coherent structure and color from text and image inputs but also significantly outperforms existing methods in geometry-conditioned texture generation.

[87] Styl3R: Instant 3D Stylized Reconstruction for Arbitrary Scenes and Styles

Peng Wang,Xiang Liu,Peidong Liu

Main category: cs.CV

TL;DR: 提出了一种快速3D场景风格化方法，通过分支架构分离结构建模和外观着色，实现了多视角一致性和高效性。

Details

Motivation: 现有3D风格化方法计算量大且依赖密集输入图像，需要一种更高效且无需密集输入的方法。 Method: 采用分支架构分离结构和外观，结合身份损失进行预训练，实现快速风格化。 Result: 在多种数据集上验证了方法的高质量风格化效果，优于现有方法。 Conclusion: 该方法在保持3D结构的同时实现了高效风格化，具有广泛适用性。 Abstract: Stylizing 3D scenes instantly while maintaining multi-view consistency and faithfully resembling a style image remains a significant challenge. Current state-of-the-art 3D stylization methods typically involve computationally intensive test-time optimization to transfer artistic features into a pretrained 3D representation, often requiring dense posed input images. In contrast, leveraging recent advances in feed-forward reconstruction models, we demonstrate a novel approach to achieve direct 3D stylization in less than a second using unposed sparse-view scene images and an arbitrary style image. To address the inherent decoupling between reconstruction and stylization, we introduce a branched architecture that separates structure modeling and appearance shading, effectively preventing stylistic transfer from distorting the underlying 3D scene structure. Furthermore, we adapt an identity loss to facilitate pre-training our stylization model through the novel view synthesis task. This strategy also allows our model to retain its original reconstruction capabilities while being fine-tuned for stylization. Comprehensive evaluations, using both in-domain and out-of-domain datasets, demonstrate that our approach produces high-quality stylized 3D content that achieve a superior blend of style and scene appearance, while also outperforming existing methods in terms of multi-view consistency and efficiency.

[88] LPOI: Listwise Preference Optimization for Vision Language Models

Fatemeh Pesaran Zadeh,Yoojin Oh,Gunhee Kim

Main category: cs.CV

TL;DR: LPOI是一种针对大型视觉语言模型（VLM）的对象感知列表偏好优化方法，通过掩码和插值技术减少幻觉，无需额外标注数据。

Details

Motivation: 现有方法如RLHF和DPO容易过拟合或加剧幻觉，且缺乏列表偏好优化的研究。 Method: LPOI通过掩码关键对象并插值生成渐进完整的图像序列，训练模型按对象可见性排序。 Result: 在多个基准测试中，LPOI在减少幻觉和提升VLM性能上优于现有方法。 Conclusion: LPOI为VLM的偏好优化提供了一种高效且无需额外标注的解决方案。 Abstract: Aligning large VLMs with human preferences is a challenging task, as methods like RLHF and DPO often overfit to textual information or exacerbate hallucinations. Although augmenting negative image samples partially addresses these pitfalls, no prior work has employed listwise preference optimization for VLMs, due to the complexity and cost of constructing listwise image samples. In this work, we propose LPOI, the first object-aware listwise preference optimization developed for reducing hallucinations in VLMs. LPOI identifies and masks a critical object in the image, and then interpolates the masked region between the positive and negative images to form a sequence of incrementally more complete images. The model is trained to rank these images in ascending order of object visibility, effectively reducing hallucinations while retaining visual fidelity. LPOI requires no extra annotations beyond standard pairwise preference data, as it automatically constructs the ranked lists through object masking and interpolation. Comprehensive experiments on MMHalBench, AMBER, and Object HalBench confirm that LPOI outperforms existing preference optimization methods in reducing hallucinations and enhancing VLM performance. We make the code available at https://github.com/fatemehpesaran310/lpoi.

[89] Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

Davide Lobba,Fulvio Sanguigni,Bin Ren,Marcella Cornia,Rita Cucchiara,Nicu Sebe

Main category: cs.CV

TL;DR: 本文提出了一种名为TEMU-VTOFF的新架构，用于解决虚拟试脱（VTOFF）任务中的两大挑战：服装特征解耦和多类别适用性。

Details

Motivation: 虚拟试脱（VTOFF）任务旨在从穿着服装的个体照片中生成标准化的服装产品图像，但现有方法在解耦服装特征和适用多类别服装方面存在局限性。 Method: TEMU-VTOFF采用双DiT主干架构和修改后的多模态注意力机制，结合图像、文本和掩码等多模态输入，并引入对齐模块优化生成细节。 Result: 在VITON-HD和Dress Code数据集上的实验表明，TEMU-VTOFF在视觉质量和服装保真度方面均达到最新水平。 Conclusion: TEMU-VTOFF为虚拟试脱任务提供了一种高效且通用的解决方案，显著提升了生成图像的质量和适用范围。 Abstract: While virtual try-on (VTON) systems aim to render a garment onto a target person image, this paper tackles the novel task of virtual try-off (VTOFF), which addresses the inverse problem: generating standardized product images of garments from real-world photos of clothed individuals. Unlike VTON, which must resolve diverse pose and style variations, VTOFF benefits from a consistent and well-defined output format -- typically a flat, lay-down-style representation of the garment -- making it a promising tool for data generation and dataset enhancement. However, existing VTOFF approaches face two major limitations: (i) difficulty in disentangling garment features from occlusions and complex poses, often leading to visual artifacts, and (ii) restricted applicability to single-category garments (e.g., upper-body clothes only), limiting generalization. To address these challenges, we present Text-Enhanced MUlti-category Virtual Try-Off (TEMU-VTOFF), a novel architecture featuring a dual DiT-based backbone with a modified multimodal attention mechanism for robust garment feature extraction. Our architecture is designed to receive garment information from multiple modalities like images, text, and masks to work in a multi-category setting. Finally, we propose an additional alignment module to further refine the generated visual details. Experiments on VITON-HD and Dress Code datasets show that TEMU-VTOFF sets a new state-of-the-art on the VTOFF task, significantly improving both visual quality and fidelity to the target garments.

[90] Minute-Long Videos with Dual Parallelisms

Zeqing Wang,Bowen Zheng,Xingyi Yang,Yuecong Xu,Xinchao Wang

Main category: cs.CV

TL;DR: 提出了一种名为DualParal的分布式推理策略，通过并行化时间帧和模型层来降低DiT视频扩散模型的长视频处理延迟和内存成本。

Details

Motivation: 解决DiT视频扩散模型在生成长视频时的高延迟和高内存成本问题。 Method: 采用块级去噪方案，将帧块序列分配给不同GPU处理，结合特征缓存和协调噪声初始化策略优化性能。 Result: 在8×RTX 4090 GPU上，实现了6.54倍延迟降低和1.48倍内存成本减少，生成了1025帧视频。 Conclusion: DualParal策略显著提升了长视频生成的效率和性能。 Abstract: Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, a feature cache is implemented on each GPU to store and reuse features from the prior block as context, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs without extra resource costs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54$\times$ lower latency and 1.48$\times$ lower memory cost on 8$\times$RTX 4090 GPUs.

[91] DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding

Weihao Xuan,Junjue Wang,Heli Qi,Zihang Chen,Zhuo Zheng,Yanfei Zhong,Junshi Xia,Naoto Yokoya

Main category: cs.CV

TL;DR: DVL-Suite是一个用于分析长期城市动态的遥感影像框架，包含DVL-Bench和DVL-Instruct两部分，旨在解决多模态大语言模型在长期地球观测分析中的局限性。

Details

Motivation: 现有模型在长期地球观测分析中表现有限，尤其是对多时相影像的理解不足。 Method: 开发了DVL-Suite框架，包含15,063张高分辨率多时相影像，涵盖7个城市理解任务，并评估了17种先进模型。 Result: 发现现有模型在长期时间理解和定量分析方面存在不足，因此创建了DVL-Instruct数据集和DVLChat基线模型。 Conclusion: DVL-Suite为长期城市动态分析提供了新工具，并通过DVLChat展示了语言交互在遥感影像分析中的潜力。 Abstract: Multimodal large language models have demonstrated remarkable capabilities in visual understanding, but their application to long-term Earth observation analysis remains limited, primarily focusing on single-temporal or bi-temporal imagery. To address this gap, we introduce DVL-Suite, a comprehensive framework for analyzing long-term urban dynamics through remote sensing imagery. Our suite comprises 15,063 high-resolution (1.0m) multi-temporal images spanning 42 megacities in the U.S. from 2005 to 2023, organized into two components: DVL-Bench and DVL-Instruct. The DVL-Bench includes seven urban understanding tasks, from fundamental change detection (pixel-level) to quantitative analyses (regional-level) and comprehensive urban narratives (scene-level), capturing diverse urban dynamics including expansion/transformation patterns, disaster assessment, and environmental challenges. We evaluate 17 state-of-the-art multimodal large language models and reveal their limitations in long-term temporal understanding and quantitative analysis. These challenges motivate the creation of DVL-Instruct, a specialized instruction-tuning dataset designed to enhance models' capabilities in multi-temporal Earth observation. Building upon this dataset, we develop DVLChat, a baseline model capable of both image-level question-answering and pixel-level segmentation, facilitating a comprehensive understanding of city dynamics through language interactions.

[92] Uni3D-MoE: Scalable Multimodal 3D Scene Understanding via Mixture of Experts

Yue Zhang,Yingzhao Jian,Hehe Fan,Yi Yang,Roger Zimmermann

Main category: cs.CV

TL;DR: Uni3D-MoE是一种基于稀疏混合专家（MoE）的多模态大语言模型，旨在通过动态选择专家实现自适应3D多模态融合，提升3D场景理解的完整性和准确性。

Details

Motivation: 现有方法通常仅使用一种或有限的3D模态，导致场景表示不完整且解释准确性降低，同时不同查询依赖不同模态，统一处理可能无法有效捕捉查询特定上下文。 Method: Uni3D-MoE整合多种3D模态（如RGB、深度图像、BEV地图、点云和体素表示），通过稀疏MoE模型中的可学习路由机制动态选择专家，每个专家专长于处理特定模态。 Result: 在标准3D场景理解基准和专用数据集上的广泛评估证明了Uni3D-MoE的有效性。 Conclusion: Uni3D-MoE通过自适应多模态融合显著提升了3D场景理解的性能，为复杂任务提供了灵活解决方案。 Abstract: Recent advancements in multimodal large language models (MLLMs) have demonstrated considerable potential for comprehensive 3D scene understanding. However, existing approaches typically utilize only one or a limited subset of 3D modalities, resulting in incomplete representations of 3D scenes and reduced interpretive accuracy. Furthermore, different types of queries inherently depend on distinct modalities, indicating that uniform processing of all modality tokens may fail to effectively capture query-specific context. To address these challenges, we propose Uni3D-MoE, a sparse Mixture-of-Experts (MoE)-based 3D MLLM designed to enable adaptive 3D multimodal fusion. Specifically, Uni3D-MoE integrates a comprehensive set of 3D modalities, including multi-view RGB and depth images, bird's-eye-view (BEV) maps, point clouds, and voxel representations. At its core, our framework employs a learnable routing mechanism within the sparse MoE-based large language model, dynamically selecting appropriate experts at the token level. Each expert specializes in processing multimodal tokens based on learned modality preferences, thus facilitating flexible collaboration tailored to diverse task-specific requirements. Extensive evaluations on standard 3D scene understanding benchmarks and specialized datasets demonstrate the efficacy of Uni3D-MoE.

[93] DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response

Junjue Wang,Weihao Xuan,Heli Qi,Zhihao Liu,Kunyi Liu,Yuhan Wu,Hongruixuan Chen,Jian Song,Junshi Xia,Zhuo Zheng,Naoto Yokoya

Main category: cs.CV

TL;DR: 论文提出了一个全球规模的遥感视觉语言数据集DisasterM3，用于灾害评估与响应，包含多灾害、多传感器和多任务特性，并通过微调现有模型展示了性能提升。

Details

Motivation: 现有大型视觉语言模型在复杂灾害场景中表现不佳，缺乏针对灾害的专用数据集和跨传感器能力。 Method: 构建DisasterM3数据集，包含26,988张双时相卫星图像和123k指令对，覆盖5大洲36种灾害，结合SAR和光学传感器数据，设计9种灾害相关任务。 Result: 评估14种通用和遥感视觉语言模型，发现现有模型在灾害任务中表现不佳；通过微调4种模型，实现了跨传感器和跨灾害的稳定性能提升。 Conclusion: DisasterM3填补了灾害专用数据集的空白，通过多任务和跨传感器设计，显著提升了视觉语言模型在灾害场景中的表现。 Abstract: Large vision-language models (VLMs) have made great achievements in Earth vision. However, complex disaster scenes with diverse disaster types, geographic regions, and satellite sensors have posed new challenges for VLM applications. To fill this gap, we curate a remote sensing vision-language dataset (DisasterM3) for global-scale disaster assessment and response. DisasterM3 includes 26,988 bi-temporal satellite images and 123k instruction pairs across 5 continents, with three characteristics: 1) Multi-hazard: DisasterM3 involves 36 historical disaster events with significant impacts, which are categorized into 10 common natural and man-made disasters. 2)Multi-sensor: Extreme weather during disasters often hinders optical sensor imaging, making it necessary to combine Synthetic Aperture Radar (SAR) imagery for post-disaster scenes. 3) Multi-task: Based on real-world scenarios, DisasterM3 includes 9 disaster-related visual perception and reasoning tasks, harnessing the full potential of VLM's reasoning ability with progressing from disaster-bearing body recognition to structural damage assessment and object relational reasoning, culminating in the generation of long-form disaster reports. We extensively evaluated 14 generic and remote sensing VLMs on our benchmark, revealing that state-of-the-art models struggle with the disaster tasks, largely due to the lack of a disaster-specific corpus, cross-sensor gap, and damage object counting insensitivity. Focusing on these issues, we fine-tune four VLMs using our dataset and achieve stable improvements across all tasks, with robust cross-sensor and cross-disaster generalization capabilities.

[94] Instance Data Condensation for Image Super-Resolution

Tianhao Peng,Ho Man Kwan,Yuxuan Jiang,Ge Gao,Fan Zhang,Xiaozhong Xu,Shan Liu,David Bull

Main category: cs.CV

TL;DR: 提出了一种针对图像超分辨率（ISR）的实例数据压缩（IDC）框架，通过随机局部傅里叶特征提取和多级特征分布匹配，实现了10%数据压缩率下性能接近或优于原始数据集的效果。

Details

Motivation: 深度学习图像超分辨率依赖大数据集训练，计算和存储成本高；数据压缩在高层计算机视觉任务中表现优异，但在ISR中尚未充分探索。 Method: 提出IDC框架，结合随机局部傅里叶特征提取和多级特征分布匹配，优化全局和局部特征分布，生成高质量合成训练数据。 Result: 在DIV2K数据集上实现10%压缩率，合成数据集性能接近或优于原始数据集，训练稳定性优异。 Conclusion: 首次证明10%数据量的压缩数据集在ISR任务中表现优异，代码和数据集已开源。 Abstract: Deep learning based image Super-Resolution (ISR) relies on large training datasets to optimize model generalization; this requires substantial computational and storage resources during training. While dataset condensation has shown potential in improving data efficiency and privacy for high-level computer vision tasks, it has not yet been fully exploited for ISR. In this paper, we propose a novel Instance Data Condensation (IDC) framework specifically for ISR, which achieves instance-level data condensation through Random Local Fourier Feature Extraction and Multi-level Feature Distribution Matching. This aims to optimize feature distributions at both global and local levels and obtain high-quality synthesized training content with fine detail. This framework has been utilized to condense the most commonly used training dataset for ISR, DIV2K, with a 10% condensation rate. The resulting synthetic dataset offers comparable or (in certain cases) even better performance compared to the original full dataset and excellent training stability when used to train various popular ISR models. To the best of our knowledge, this is the first time that a condensed/synthetic dataset (with a 10% data volume) has demonstrated such performance. The source code and the synthetic dataset have been made available at https://github.com/.

[95] Differentiable Solver Search for Fast Diffusion Sampling

Shuai Wang,Zexian Li,Qipeng zhang,Tianhui Song,Xubin Li,Tiezheng Ge,Bo Zheng,Limin Wang

Main category: cs.CV

TL;DR: 论文提出了一种新的可微分求解器搜索算法，以优化扩散模型的求解器，显著提升了生成质量和效率。

Details

Motivation: 扩散模型虽然生成质量高，但计算成本大，现有ODE求解器依赖次优的t相关拉格朗日插值，限制了性能。 Method: 通过分析时间步长和求解器系数的紧凑搜索空间，提出了一种可微分求解器搜索算法，找到更优的求解器。 Result: 实验表明，新求解器在ImageNet256上仅用10步就达到了FID分数2.40和2.35，优于传统求解器。 Conclusion: 新求解器在多种模型架构、分辨率和大小上均表现出通用性和优越性。 Abstract: Diffusion models have demonstrated remarkable generation quality but at the cost of numerous function evaluations. Recently, advanced ODE-based solvers have been developed to mitigate the substantial computational demands of reverse-diffusion solving under limited sampling steps. However, these solvers, heavily inspired by Adams-like multistep methods, rely solely on t-related Lagrange interpolation. We show that t-related Lagrange interpolation is suboptimal for diffusion model and reveal a compact search space comprised of time steps and solver coefficients. Building on our analysis, we propose a novel differentiable solver search algorithm to identify more optimal solver. Equipped with the searched solver, rectified-flow models, e.g., SiT-XL/2 and FlowDCN-XL/2, achieve FID scores of 2.40 and 2.35, respectively, on ImageNet256 with only 10 steps. Meanwhile, DDPM model, DiT-XL/2, reaches a FID score of 2.33 with only 10 steps. Notably, our searched solver outperforms traditional solvers by a significant margin. Moreover, our searched solver demonstrates generality across various model architectures, resolutions, and model sizes.

[96] ReassembleNet: Learnable Keypoints and Diffusion for 2D Fresco Reconstruction

Adeela Islam,Stefano Fiorini,Stuart James,Pietro Morerio,Alessio Del Bue

Main category: cs.CV

TL;DR: ReassembleNet通过轮廓关键点和图神经网络降低复杂性，提升多模态特征整合，在旋转和平移误差上显著优于现有方法。

Details

Motivation: 解决现有深度学习方法在可扩展性、多模态和实际应用中的局限性，特别是在复杂形状和真实世界问题上的表现。 Method: 使用轮廓关键点表示输入片段，通过图神经网络选择关键点，结合多模态特征，并利用扩散式位姿估计恢复原始结构。 Result: 在旋转和平移误差上分别比现有方法提高了55%和86%。 Conclusion: ReassembleNet在复杂场景中表现出色，为多领域重组任务提供了高效解决方案。 Abstract: The task of reassembly is a significant challenge across multiple domains, including archaeology, genomics, and molecular docking, requiring the precise placement and orientation of elements to reconstruct an original structure. In this work, we address key limitations in state-of-the-art Deep Learning methods for reassembly, namely i) scalability; ii) multimodality; and iii) real-world applicability: beyond square or simple geometric shapes, realistic and complex erosion, or other real-world problems. We propose ReassembleNet, a method that reduces complexity by representing each input piece as a set of contour keypoints and learning to select the most informative ones by Graph Neural Networks pooling inspired techniques. ReassembleNet effectively lowers computational complexity while enabling the integration of features from multiple modalities, including both geometric and texture data. Further enhanced through pretraining on a semi-synthetic dataset. We then apply diffusion-based pose estimation to recover the original structure. We improve on prior methods by 55% and 86% for RMSE Rotation and Translation, respectively.

[97] FastFace: Tuning Identity Preservation in Distilled Diffusion via Guidance and Attention

Sergey Karpukhin,Vadim Titov,Andrey Kuznetsov,Aibek Alanov

Main category: cs.CV

TL;DR: 提出FastFace框架，通过重新设计无分类器引导和注意力机制，实现无需训练的ID适配器快速适配蒸馏加速的扩散模型。

Details

Motivation: 现有ID适配器需与基础扩散模型联合训练，推理速度慢，FastFace旨在解决这一问题。 Method: 重新设计无分类器引导和注意力机制，提出FastFace框架。 Result: 提升身份相似性和生成保真度，并开发了公开评估协议。 Conclusion: FastFace为ID适配器提供了一种高效、无需训练的适配方案。 Abstract: In latest years plethora of identity-preserving adapters for a personalized generation with diffusion models have been released. Their main disadvantage is that they are dominantly trained jointly with base diffusion models, which suffer from slow multi-step inference. This work aims to tackle the challenge of training-free adaptation of pretrained ID-adapters to diffusion models accelerated via distillation - through careful re-design of classifier-free guidance for few-step stylistic generation and attention manipulation mechanisms in decoupled blocks to improve identity similarity and fidelity, we propose universal FastFace framework. Additionally, we develop a disentangled public evaluation protocol for id-preserving adapters.

[98] RoBiS: Robust Binary Segmentation for High-Resolution Industrial Images

Xurui Li,Zhonesheng Jiang,Tingxuan Ai,Yu Zhou

Main category: cs.CV

TL;DR: RoBiS框架通过Swin-Cropping、数据增强和自适应二值化策略，显著提升了MVTec AD 2基准上的异常检测性能。

Details

Motivation: 解决现有方法在复杂真实场景中性能下降的问题。 Method: 采用Swin-Cropping预处理、数据增强和自适应二值化策略。 Result: 在MVTec AD 2上，SegF1提升29.2%（Test_private）和29.82%（Test_private_mixed）。 Conclusion: RoBiS框架在真实场景异常检测中表现出色。 Abstract: Robust unsupervised anomaly detection (AD) in real-world scenarios is an important task. Current methods exhibit severe performance degradation on the MVTec AD 2 benchmark due to its complex real-world challenges. To solve this problem, we propose a robust framework RoBiS, which consists of three core modules: (1) Swin-Cropping, a high-resolution image pre-processing strategy to preserve the information of small anomalies through overlapping window cropping. (2) The data augmentation of noise addition and lighting simulation is carried out on the training data to improve the robustness of AD model. We use INP-Former as our baseline, which could generate better results on the various sub-images. (3) The traditional statistical-based binarization strategy (mean+3std) is combined with our previous work, MEBin (published in CVPR2025), for joint adaptive binarization. Then, SAM is further employed to refine the segmentation results. Compared with some methods reported by the MVTec AD 2, our RoBiS achieves a 29.2% SegF1 improvement (from 21.8% to 51.00%) on Test_private and 29.82% SegF1 gains (from 16.7% to 46.52%) on Test_private_mixed. Code is available at https://github.com/xrli-U/RoBiS.

[99] Normalized Attention Guidance: Universal Negative Guidance for Diffusion Model

Dar-Yen Chen,Hmrishav Bandyopadhyay,Kai Zou,Yi-Zhe Song

Main category: cs.CV

TL;DR: 论文提出了一种名为NAG的高效、无需训练的负向引导方法，解决了扩散模型中负向引导在少步采样下的失效问题。

Details

Motivation: 负向引导在扩散模型中是一个基础挑战，尤其在少步采样下，现有的Classifier-Free Guidance (CFG)方法因正负分支预测差异而失效。 Method: NAG通过在注意力空间应用L1归一化和优化的外推机制，实现了高效的负向引导。 Result: NAG在文本对齐（CLIP Score）、保真度（FID, PFID）和人类感知质量（ImageReward）上均表现优异，且适用于多种架构和模态。 Conclusion: NAG作为一种模型无关的推理时方法，无需重新训练，为现代扩散框架提供了高效的负向引导解决方案。 Abstract: Negative guidance -- explicitly suppressing unwanted attributes -- remains a fundamental challenge in diffusion models, particularly in few-step sampling regimes. While Classifier-Free Guidance (CFG) works well in standard settings, it fails under aggressive sampling step compression due to divergent predictions between positive and negative branches. We present Normalized Attention Guidance (NAG), an efficient, training-free mechanism that applies extrapolation in attention space with L1-based normalization and refinement. NAG restores effective negative guidance where CFG collapses while maintaining fidelity. Unlike existing approaches, NAG generalizes across architectures (UNet, DiT), sampling regimes (few-step, multi-step), and modalities (image, video), functioning as a \textit{universal} plug-in with minimal computational overhead. Through extensive experimentation, we demonstrate consistent improvements in text alignment (CLIP Score), fidelity (FID, PFID), and human-perceived quality (ImageReward). Our ablation studies validate each design component, while user studies confirm significant preference for NAG-guided outputs. As a model-agnostic inference-time approach requiring no retraining, NAG provides effortless negative guidance for all modern diffusion frameworks -- pseudocode in the Appendix!

[100] Boosting Adversarial Transferability via High-Frequency Augmentation and Hierarchical-Gradient Fusion

Yayin Zheng,Chen Wan,Zihong Guo,Hailing Kuang,Xiaohai Lu

Main category: cs.CV

TL;DR: 本文提出了一种新的对抗攻击框架FSA，通过结合频域和空域变换，显著提高了对抗样本在黑盒防御模型中的攻击成功率。

Details

Motivation: 对抗攻击对机器学习模型的安全性构成挑战，现有方法主要关注空域，忽略了频域的重要性。 Method: FSA框架包含高频增强和分层梯度融合两种技术，分别通过傅里叶变换和多尺度梯度分解来优化对抗样本。 Result: 实验表明，FSA在多种黑盒防御模型上表现优异，平均攻击成功率比现有方法提高了23.6%。 Conclusion: FSA通过频域和空域的结合，为对抗攻击提供了更有效的解决方案。 Abstract: Adversarial attacks have become a significant challenge in the security of machine learning models, particularly in the context of black-box defense strategies. Existing methods for enhancing adversarial transferability primarily focus on the spatial domain. This paper presents Frequency-Space Attack (FSA), a new adversarial attack framework that effectively integrates frequency-domain and spatial-domain transformations. FSA combines two key techniques: (1) High-Frequency Augmentation, which applies Fourier transform with frequency-selective amplification to diversify inputs and emphasize the critical role of high-frequency components in adversarial attacks, and (2) Hierarchical-Gradient Fusion, which merges multi-scale gradient decomposition and fusion to capture both global structures and fine-grained details, resulting in smoother perturbations. Our experiment demonstrates that FSA consistently outperforms state-of-the-art methods across various black-box models. Notably, our proposed FSA achieves an average attack success rate increase of 23.6% compared with BSR (CVPR 2024) on eight black-box defense models.

[101] Making Every Event Count: Balancing Data Efficiency and Accuracy in Event Camera Subsampling

Hesam Araghi,Jan van Gemert,Nergis Tomen

Main category: cs.CV

TL;DR: 本文系统评估了六种硬件友好的子采样方法对事件视频分类任务的影响，提出了一种基于密度的子采样方法，提高了稀疏场景下的分类准确性。

Details

Motivation: 事件相机的高事件率对数据传输和处理提出了挑战，而子采样方法的效果尚未充分研究。 Method: 使用卷积神经网络评估六种子采样方法，并提出一种基于密度的子采样方法。 Result: 基于密度的子采样方法在稀疏场景下提高了分类准确性，同时分析了影响性能的关键因素。 Conclusion: 研究为平衡数据效率和任务准确性的硬件高效子采样策略提供了见解。 Abstract: Event cameras offer high temporal resolution and power efficiency, making them well-suited for edge AI applications. However, their high event rates present challenges for data transmission and processing. Subsampling methods provide a practical solution, but their effect on downstream visual tasks remains underexplored. In this work, we systematically evaluate six hardware-friendly subsampling methods using convolutional neural networks for event video classification on various benchmark datasets. We hypothesize that events from high-density regions carry more task-relevant information and are therefore better suited for subsampling. To test this, we introduce a simple causal density-based subsampling method, demonstrating improved classification accuracy in sparse regimes. Our analysis further highlights key factors affecting subsampling performance, including sensitivity to hyperparameters and failure cases in scenarios with large event count variance. These findings provide insights for utilization of hardware-efficient subsampling strategies that balance data efficiency and task accuracy. The code for this paper will be released at: https://github.com/hesamaraghi/event-camera-subsampling-methods.

[102] Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models

Xudong Tan,Yaoxin Yang,Peng Ye,Jialin Zheng,Bizhe Bai,Xinyi Wang,Jia Hao,Tao Chen

Main category: cs.CV

TL;DR: FlashVLA是一个无需训练、即插即用的加速框架，通过动作重用和视觉令牌剪枝，显著降低Vision-Language-Action模型的推理成本。

Details

Motivation: VLA模型的高推理成本限制了实时部署和边缘应用，现有工作主要关注架构优化，而FlashVLA从冗余角度出发，提出动作重用和视觉令牌剪枝。 Method: FlashVLA采用令牌感知的动作重用机制和信息引导的视觉令牌选择策略，减少冗余解码和低贡献令牌。 Result: 在LIBERO基准测试中，FlashVLA将FLOPs降低55.7%，延迟减少36.0%，任务成功率仅下降0.7%。 Conclusion: FlashVLA有效实现了轻量级、低延迟的VLA推理，无需重新训练。 Abstract: Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general-purpose robot control through natural language instructions. However, their high inference cost-stemming from large-scale token computation and autoregressive decoding-poses significant challenges for real-time deployment and edge applications. While prior work has primarily focused on architectural optimization, we take a different perspective by identifying a dual form of redundancy in VLA models: (i) high similarity across consecutive action steps, and (ii) substantial redundancy in visual tokens. Motivated by these observations, we propose FlashVLA, the first training-free and plug-and-play acceleration framework that enables action reuse in VLA models. FlashVLA improves inference efficiency through a token-aware action reuse mechanism that avoids redundant decoding across stable action steps, and an information-guided visual token selection strategy that prunes low-contribution tokens. Extensive experiments on the LIBERO benchmark show that FlashVLA reduces FLOPs by 55.7% and latency by 36.0%, with only a 0.7% drop in task success rate. These results demonstrate the effectiveness of FlashVLA in enabling lightweight, low-latency VLA inference without retraining.

[103] Sci-Fi: Symmetric Constraint for Frame Inbetweening

Liuhan Chen,Xiaodong Cun,Xiaoyu Li,Xianyi He,Shenghai Yuan,Jie Chen,Ying Shan,Li Yuan

Main category: cs.CV

TL;DR: 论文提出了一种名为Sci-Fi的新框架，通过改进的机制对称约束起始帧和结束帧，解决了现有方法在帧间合成中的不对称控制问题。

Details

Motivation: 现有方法在帧间合成中因对起始帧和结束帧的控制不对称，导致生成的中间帧出现不一致运动或外观崩溃。 Method: 提出Sci-Fi框架，引入轻量级模块EF-Net，专门编码结束帧并扩展为时间自适应特征，以增强结束帧的约束强度。 Result: 实验证明Sci-Fi在多种场景下能生成更和谐的过渡帧，优于其他基线方法。 Conclusion: Sci-Fi通过对称约束机制有效解决了帧间合成中的控制不对称问题，提升了生成质量。 Abstract: Frame inbetweening aims to synthesize intermediate video sequences conditioned on the given start and end frames. Current state-of-the-art methods mainly extend large-scale pre-trained Image-to-Video Diffusion models (I2V-DMs) by incorporating end-frame constraints via directly fine-tuning or omitting training. We identify a critical limitation in their design: Their injections of the end-frame constraint usually utilize the same mechanism that originally imposed the start-frame (single image) constraint. However, since the original I2V-DMs are adequately trained for the start-frame condition in advance, naively introducing the end-frame constraint by the same mechanism with much less (even zero) specialized training probably can't make the end frame have a strong enough impact on the intermediate content like the start frame. This asymmetric control strength of the two frames over the intermediate content likely leads to inconsistent motion or appearance collapse in generated frames. To efficiently achieve symmetric constraints of start and end frames, we propose a novel framework, termed Sci-Fi, which applies a stronger injection for the constraint of a smaller training scale. Specifically, it deals with the start-frame constraint as before, while introducing the end-frame constraint by an improved mechanism. The new mechanism is based on a well-designed lightweight module, named EF-Net, which encodes only the end frame and expands it into temporally adaptive frame-wise features injected into the I2V-DM. This makes the end-frame constraint as strong as the start-frame constraint, enabling our Sci-Fi to produce more harmonious transitions in various scenarios. Extensive experiments prove the superiority of our Sci-Fi compared with other baselines.

[104] Is Hyperbolic Space All You Need for Medical Anomaly Detection?

Alvaro Gonzalez-Jimenez,Simone Lionetti,Ludovic Amruthalingam,Philippe Gottfrois,Fabian Gröger,Marc Pouly,Alexander A. Navarini

Main category: cs.CV

TL;DR: 该论文提出了一种将特征表示投影到双曲空间的新方法，用于医学异常检测，相比传统欧几里得空间方法表现更优。

Details

Motivation: 传统方法在欧几里得空间中提取特征，但无法有效捕捉特征的层次关系，导致异常检测性能不佳。 Method: 将特征表示投影到双曲空间，基于置信度聚合特征，并分类样本为健康或异常。 Result: 双曲空间方法在多个医学数据集上表现优于欧几里得方法，AUROC得分更高，且在少样本场景中表现稳健。 Conclusion: 双曲空间是医学异常检测的有力替代方案，具有潜力和实用性。 Abstract: Medical anomaly detection has emerged as a promising solution to challenges in data availability and labeling constraints. Traditional methods extract features from different layers of pre-trained networks in Euclidean space; however, Euclidean representations fail to effectively capture the hierarchical relationships within these features, leading to suboptimal anomaly detection performance. We propose a novel yet simple approach that projects feature representations into hyperbolic space, aggregates them based on confidence levels, and classifies samples as healthy or anomalous. Our experiments demonstrate that hyperbolic space consistently outperforms Euclidean-based frameworks, achieving higher AUROC scores at both image and pixel levels across multiple medical benchmark datasets. Additionally, we show that hyperbolic space exhibits resilience to parameter variations and excels in few-shot scenarios, where healthy images are scarce. These findings underscore the potential of hyperbolic space as a powerful alternative for medical anomaly detection. The project website can be found at https://hyperbolic-anomalies.github.io

[105] Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning

Lintao Xu,Yinghao Wang,Chaohui Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为MoDOT的网络，联合估计深度和遮挡边界（OB），通过CASM模块和OBDCL损失函数提升性能，实验显示其在合成和真实数据集上达到SOTA。

Details

Motivation: 遮挡边界（OB）和单目深度估计（MDE）相互提供几何线索，联合估计可以提升场景理解和3D重建能力。 Method: 提出MoDOT网络，结合CASM模块（跨注意力多尺度条带卷积）和OBDCL损失函数（遮挡感知损失），联合优化深度和OB估计。 Result: 在合成数据集和NYUD-v2真实数据集上达到SOTA，深度迁移结果也表现优异。 Conclusion: 联合估计深度和OB具有显著优势，MoDOT的设计有效且性能优越，代码和数据集将开源。 Abstract: Occlusion Boundary Estimation (OBE) identifies boundaries arising from both inter-object occlusions and self-occlusion within individual objects, distinguishing intrinsic object edges from occlusion-induced contours to improve scene understanding and 3D reconstruction capacity. This is closely related to Monocular Depth Estimation (MDE), which infers depth from a single image, as occlusion boundaries provide critical geometric cues for resolving depth ambiguities, while depth priors can conversely refine occlusion reasoning in complex scenes. In this paper, we propose a novel network, MoDOT, that first jointly estimates depth and OBs. We propose CASM, a cross-attention multi-scale strip convolution module, leverages mid-level OB features to significantly enhance depth prediction. Additionally, we introduce an occlusion-aware loss function, OBDCL, which encourages sharper and more accurate depth boundaries. Extensive experiments on both real and synthetic datasets demonstrate the mutual benefits of jointly estimating depth and OB, and highlight the effectiveness of our model design. Our method achieves the state-of-the-art (SOTA) on both our proposed synthetic datasets and one popular real dataset, NYUD-v2, significantly outperforming multi-task baselines. Besides, without domain adaptation, results on real-world depth transfer are comparable to the competitors, while preserving sharp occlusion boundaries for geometric fidelity. We will release our code, pre-trained models, and datasets to support future research in this direction.

[106] CROP: Contextual Region-Oriented Visual Token Pruning

Jiawei Guo,Feifei Zhai,Pu Jian,Qianrun Wei,Yu Zhou

Main category: cs.CV

TL;DR: CROP框架通过两步（定位与剪枝）压缩视觉标记，减少冗余信息，提升VQA任务性能。

Details

Motivation: 现有VLM方法处理整图导致冗余视觉标记，增加计算负担。 Method: CROP先定位与问题相关的区域，再通过PLC和ILP策略剪枝。 Result: CROP在多种VQA任务中表现优异，优于现有方法。 Conclusion: CROP高效压缩视觉标记，为VQA任务提供新解决方案。 Abstract: Current VLM-based VQA methods often process entire images, leading to excessive visual tokens that include redundant information irrelevant to the posed question. This abundance of unnecessary image details creates numerous visual tokens, drastically increasing memory and computational requirements in VLMs. To address this, we propose Contextual Region-Oriented Visual Token Pruning (CROP), a novel framework to compress visual tokens through a two-step process: Localization and Pruning. Specifically, CROP first employs an efficient model to identify the contextual region relevant to the input query. Subsequently, two distinct strategies are introduced for pruning: (1) Pre-LLM Compression (PLC), which adaptively compresses different image regions with varying ratios, and (2) Inner-LLM Pruning (ILP), a training-free method that prunes tokens within early LLM layers guided by the identified contextual region. Extensive experiments on a wide range of VQA tasks demonstrate that CROP significantly outperforms existing visual token pruning methods and achieves state-of-the-art performance. Our code and datasets will be made available.

[107] 3D-UIR: 3D Gaussian for Underwater 3D Scene Reconstruction via Physics-Based Appearance-Medium Decouplin

Jieyu Yuan,Yujun Li,Yuanlin Zhang,Chunle Guo,Xiongxin Tang,Ruixing Wang,Chongyi Li

Main category: cs.CV

TL;DR: 该论文提出了一种基于物理的框架，用于水下场景重建中的新视角合成，通过高斯建模分离物体外观与水介质效应，并引入距离引导优化策略，显著提升了渲染质量和恢复精度。

Details

Motivation: 水下场景重建因复杂的光介质相互作用而面临独特挑战，传统体积渲染假设在非均匀介质中失效，3D高斯泼溅（3DGS）在水下环境中表现不佳。 Method: 提出了一种物理框架，通过定制高斯建模分离物体外观与水介质效应，引入外观嵌入和距离引导优化策略，结合伪深度图监督和深度正则化。 Result: 实验表明，该方法在渲染质量和恢复精度上显著优于现有方法。 Conclusion: 该研究成功实现了高质量的新视角合成和物理准确的场景恢复，为水下场景重建提供了有效解决方案。 Abstract: Novel view synthesis for underwater scene reconstruction presents unique challenges due to complex light-media interactions. Optical scattering and absorption in water body bring inhomogeneous medium attenuation interference that disrupts conventional volume rendering assumptions of uniform propagation medium. While 3D Gaussian Splatting (3DGS) offers real-time rendering capabilities, it struggles with underwater inhomogeneous environments where scattering media introduce artifacts and inconsistent appearance. In this study, we propose a physics-based framework that disentangles object appearance from water medium effects through tailored Gaussian modeling. Our approach introduces appearance embeddings, which are explicit medium representations for backscatter and attenuation, enhancing scene consistency. In addition, we propose a distance-guided optimization strategy that leverages pseudo-depth maps as supervision with depth regularization and scale penalty terms to improve geometric fidelity. By integrating the proposed appearance and medium modeling components via an underwater imaging model, our approach achieves both high-quality novel view synthesis and physically accurate scene restoration. Experiments demonstrate our significant improvements in rendering quality and restoration accuracy over existing methods. The project page is available at \href{https://bilityniu.github.io/3D-UIR}{https://bilityniu.github.io/3D-UIR

[108] Plenodium: UnderWater 3D Scene Reconstruction with Plenoptic Medium Representation

Changguanng Wu,Jiangxin Dong,Chengjian Li,Jinhui Tang

Main category: cs.CV

TL;DR: Plenodium是一种高效的三维表示框架，能够同时建模物体和参与介质，通过球谐编码结合方向和位置信息，显著提升水下场景重建精度。

Details

Motivation: 现有介质表示仅依赖视角相关建模，无法充分捕捉水下环境的复杂特性，因此需要一种更全面的表示方法。 Method: 提出伪深度高斯补充技术增强COLMAP点云，并开发深度排序正则化损失优化场景几何和深度图一致性。 Result: 在真实水下数据集上实现显著的三维重建改进，并通过模拟数据集验证了水下场景的恢复能力。 Conclusion: Plenodium框架在水下场景重建中表现出色，代码和数据集已开源。 Abstract: We present Plenodium (plenoptic medium), an effective and efficient 3D representation framework capable of jointly modeling both objects and participating media. In contrast to existing medium representations that rely solely on view-dependent modeling, our novel plenoptic medium representation incorporates both directional and positional information through spherical harmonics encoding, enabling highly accurate underwater scene reconstruction. To address the initialization challenge in degraded underwater environments, we propose the pseudo-depth Gaussian complementation to augment COLMAP-derived point clouds with robust depth priors. In addition, a depth ranking regularized loss is developed to optimize the geometry of the scene and improve the ordinal consistency of the depth maps. Extensive experiments on real-world underwater datasets demonstrate that our method achieves significant improvements in 3D reconstruction. Furthermore, we conduct a simulated dataset with ground truth and the controllable scattering medium to demonstrate the restoration capability of our method in underwater scenarios. Our code and dataset are available at https://plenodium.github.io/.

[109] DiMoSR: Feature Modulation via Multi-Branch Dilated Convolutions for Efficient Image Super-Resolution

M. Akin Yilmaz,Ahmet Bilican,A. Murat Tekalp

Main category: cs.CV

TL;DR: DiMoSR是一种新型轻量级单图像超分辨率（SISR）架构，通过调制增强特征表示，结合多分支扩张卷积捕获上下文信息，在保持计算效率的同时提升性能。

Details

Motivation: 解决轻量级SISR中重建质量与模型效率的平衡问题，探索注意力机制之外的架构范式。 Method: 提出DiMoSR，利用多分支扩张卷积扩大感受野，结合特征调制补充注意力机制。 Result: DiMoSR在多个基准数据集上优于现有轻量级方法，PSNR和SSIM指标更优，计算复杂度相当或更低。 Conclusion: DiMoSR验证了特征调制与注意力机制的互补性，为高效网络设计提供新思路。 Abstract: Balancing reconstruction quality versus model efficiency remains a critical challenge in lightweight single image super-resolution (SISR). Despite the prevalence of attention mechanisms in recent state-of-the-art SISR approaches that primarily emphasize or suppress feature maps, alternative architectural paradigms warrant further exploration. This paper introduces DiMoSR (Dilated Modulation Super-Resolution), a novel architecture that enhances feature representation through modulation to complement attention in lightweight SISR networks. The proposed approach leverages multi-branch dilated convolutions to capture rich contextual information over a wider receptive field while maintaining computational efficiency. Experimental results demonstrate that DiMoSR outperforms state-of-the-art lightweight methods across diverse benchmark datasets, achieving superior PSNR and SSIM metrics with comparable or reduced computational complexity. Through comprehensive ablation studies, this work not only validates the effectiveness of DiMoSR but also provides critical insights into the interplay between attention mechanisms and feature modulation to guide future research in efficient network design. The code and model weights to reproduce our results are available at: https://github.com/makinyilmaz/DiMoSR

[110] Supervised and self-supervised land-cover segmentation & classification of the Biesbosch wetlands

Eva Gmelich Meijling,Roberto Del Prete,Arnoud Visser

Main category: cs.CV

TL;DR: 该研究提出了一种结合监督学习和自监督学习的方法，用于湿地土地覆盖分类，解决了高分辨率卫星图像标注数据稀缺的问题，并提高了分类准确性。

Details

Motivation: 湿地土地覆盖分类对环境监测和生态系统管理至关重要，但高分辨率卫星图像的标注数据稀缺，限制了监督学习的效果。 Method: 采用U-Net模型，结合监督学习和自监督学习（SSL），在荷兰六个湿地区域的Sentinel-2图像上进行训练，并通过自编码器进行预训练以提高精度。 Result: 基线模型准确率为85.26%，通过SSL预训练后提高到88.23%。高分辨率图像提供了更清晰的边界和细节。 Conclusion: 研究成功解决了标注数据稀缺问题，并通过公开数据集为湿地分类任务提供了资源。 Abstract: Accurate wetland land-cover classification is essential for environmental monitoring, biodiversity assessment, and sustainable ecosystem management. However, the scarcity of annotated data, especially for high-resolution satellite imagery, poses a significant challenge for supervised learning approaches. To tackle this issue, this study presents a methodology for wetland land-cover segmentation and classification that adopts both supervised and self-supervised learning (SSL). We train a U-Net model from scratch on Sentinel-2 imagery across six wetland regions in the Netherlands, achieving a baseline model accuracy of 85.26%. Addressing the limited availability of labeled data, the results show that SSL pretraining with an autoencoder can improve accuracy, especially for the high-resolution imagery where it is more difficult to obtain labeled data, reaching an accuracy of 88.23%. Furthermore, we introduce a framework to scale manually annotated high-resolution labels to medium-resolution inputs. While the quantitative performance between resolutions is comparable, high-resolution imagery provides significantly sharper segmentation boundaries and finer spatial detail. As part of this work, we also contribute a curated Sentinel-2 dataset with Dynamic World labels, tailored for wetland classification tasks and made publicly available.

[111] Spectral Compression Transformer with Line Pose Graph for Monocular 3D Human Pose Estimation

Zenghao Zheng,Lianping Yang,Hegui Zhu,Mingrui Ye

Main category: cs.CV

TL;DR: 提出了一种基于谱压缩变换器（SCT）和线姿态图（LPG）的双流网络架构，用于高效3D人体姿态估计，通过减少序列冗余和计算成本，实现了SOTA性能。

Details

Motivation: Transformer在3D人体姿态估计中因自注意力的二次复杂度导致计算成本高，且现有方法未能有效消除序列冗余。 Method: 1. 使用SCT通过离散余弦变换压缩序列长度；2. 提出LPG补充结构信息；3. 设计双流网络建模空间关节关系和运动轨迹。 Result: 在Human3.6M和MPI-INF-3DHP数据集上达到SOTA性能（如Human3.6M上MPJPE为37.7mm），计算效率提升。 Conclusion: SCT和LPG有效减少了冗余和计算成本，双流网络进一步提升了性能，实验验证了各模块的有效性。 Abstract: Transformer-based 3D human pose estimation methods suffer from high computational costs due to the quadratic complexity of self-attention with respect to sequence length. Additionally, pose sequences often contain significant redundancy between frames. However, recent methods typically fail to improve model capacity while effectively eliminating sequence redundancy. In this work, we introduce the Spectral Compression Transformer (SCT) to reduce sequence length and accelerate computation. The SCT encoder treats hidden features between blocks as Temporal Feature Signals (TFS) and applies the Discrete Cosine Transform, a Fourier transform-based technique, to determine the spectral components to be retained. By filtering out certain high-frequency noise components, SCT compresses the sequence length and reduces redundancy. To further enrich the input sequence with prior structural information, we propose the Line Pose Graph (LPG) based on line graph theory. The LPG generates skeletal position information that complements the input 2D joint positions, thereby improving the model's performance. Finally, we design a dual-stream network architecture to effectively model spatial joint relationships and the compressed motion trajectory within the pose sequence. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that our model achieves state-of-the-art performance with improved computational efficiency. For example, on the Human3.6M dataset, our method achieves an MPJPE of 37.7mm while maintaining a low computational cost. Furthermore, we perform ablation studies on each module to assess its effectiveness. The code and models will be released.

[112] Efficient Leaf Disease Classification and Segmentation using Midpoint Normalization Technique and Attention Mechanism

Enam Ahmed Taufik,Antara Firoz Parsa,Seraj Al Mahmud Mostafa

Main category: cs.CV

TL;DR: 论文提出了一种两阶段方法（MPN和注意力机制），用于植物病害检测，显著提升了分类和分割任务的性能。

Details

Motivation: 植物病害检测面临标记数据稀缺和复杂背景的挑战，需要高效且平衡的解决方案。 Method: 结合Mid Point Normalization（MPN）预处理和注意力机制（SE块），用于分类和分割任务。 Result: 分类任务达到93%准确率，分割任务Dice得分为72.44%，IoU为58.54%，均优于基线方法。 Conclusion: 该方法高效轻量，适用于实际计算机视觉应用。 Abstract: Enhancing plant disease detection from leaf imagery remains a persistent challenge due to scarce labeled data and complex contextual factors. We introduce a transformative two-stage methodology, Mid Point Normalization (MPN) for intelligent image preprocessing, coupled with sophisticated attention mechanisms that dynamically recalibrate feature representations. Our classification pipeline, merging MPN with Squeeze-and-Excitation (SE) blocks, achieves remarkable 93% accuracy while maintaining exceptional class-wise balance. The perfect F1 score attained for our target class exemplifies attention's power in adaptive feature refinement. For segmentation tasks, we seamlessly integrate identical attention blocks within U-Net architecture using MPN-enhanced inputs, delivering compelling performance gains with 72.44% Dice score and 58.54% IoU, substantially outperforming baseline implementations. Beyond superior accuracy metrics, our approach yields computationally efficient, lightweight architectures perfectly suited for real-world computer vision applications.

[113] MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on

Guangyuan Li,Siming Zheng,Hao Zhang,Jinwei Chen,Junsheng Luan,Binkai Ou,Lei Zhao,Bo Li,Peng-Tao Jiang

Main category: cs.CV

TL;DR: MagicTryOn提出了一种基于视频扩散Transformer的视频虚拟试穿框架，解决了现有方法在时空一致性和服装细节保留上的不足。

Details

Motivation: 当前视频虚拟试穿方法在时空一致性、服装内容保留和细节表达上存在局限，无法有效捕捉动态变化和复杂细节。 Method: 采用扩散Transformer替代U-Net，结合全自注意力联合建模时空一致性；设计粗到细的服装保留策略，并引入掩码感知损失优化服装区域保真度。 Result: 在图像和视频试穿数据集上的实验表明，该方法在综合评估中优于现有SOTA方法，并能泛化到实际场景。 Conclusion: MagicTryOn通过改进架构和策略，显著提升了视频虚拟试穿的时空一致性和服装细节保留能力。 Abstract: Video Virtual Try-On (VVT) aims to simulate the natural appearance of garments across consecutive video frames, capturing their dynamic variations and interactions with human body motion. However, current VVT methods still face challenges in terms of spatiotemporal consistency and garment content preservation. First, they use diffusion models based on the U-Net, which are limited in their expressive capability and struggle to reconstruct complex details. Second, they adopt a separative modeling approach for spatial and temporal attention, which hinders the effective capture of structural relationships and dynamic consistency across frames. Third, their expression of garment details remains insufficient, affecting the realism and stability of the overall synthesized results, especially during human motion. To address the above challenges, we propose MagicTryOn, a video virtual try-on framework built upon the large-scale video diffusion Transformer.We replace the U-Net architecture with a diffusion Transformer and combine full self-attention to jointly model the spatiotemporal consistency of videos. We design a coarse-to-fine garment preservation strategy. The coarse strategy integrates garment tokens during the embedding stage, while the fine strategy incorporates multiple garment-based conditions, such as semantics, textures, and contour lines during the denoising stage. Moreover, we introduce a mask-aware loss to further optimize garment region fidelity. Extensive experiments on both image and video try-on datasets demonstrate that our method outperforms existing SOTA methods in comprehensive evaluations and generalizes to in-the-wild scenarios.

[114] MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Yang Shi,Huanqian Wang,Wulin Xie,Huanyao Zhang,Lijie Zhao,Yi-Fan Zhang,Xinfeng Li,Chaoyou Fu,Zhuoer Wen,Wenting Liu,Zhuoran Zhang,Xinlong Chen,Bohan Zeng,Sihan Yang,Yuanxing Zhang,Pengfei Wan,Haotian Wang,Wenjing Yang

Main category: cs.CV

TL;DR: 论文介绍了MME-VideoOCR基准测试，用于评估多模态大语言模型在视频OCR任务中的表现，发现现有模型在需要时空推理的任务上表现不佳。

Details

Motivation: 现有MLLMs在静态图像OCR中表现良好，但在视频OCR中因运动模糊等因素效果显著下降，需更全面的评估基准。 Method: 提出MME-VideoOCR基准，包含10类任务、25个子任务和44种场景，评估18种先进MLLMs。 Result: 最佳模型Gemini-2.5 Pro准确率仅73.7%，模型在单帧任务表现好，但在时空推理任务中表现差。 Conclusion: 视频OCR需高分辨率输入和足够时间覆盖，现有模型需改进以应对复杂视频场景。 Abstract: Multimodal Large Language Models (MLLMs) have achieved considerable accuracy in Optical Character Recognition (OCR) from static images. However, their efficacy in video OCR is significantly diminished due to factors such as motion blur, temporal variations, and visual effects inherent in video content. To provide clearer guidance for training practical MLLMs, we introduce the MME-VideoOCR benchmark, which encompasses a comprehensive range of video OCR application scenarios. MME-VideoOCR features 10 task categories comprising 25 individual tasks and spans 44 diverse scenarios. These tasks extend beyond text recognition to incorporate deeper comprehension and reasoning of textual content within videos. The benchmark consists of 1,464 videos with varying resolutions, aspect ratios, and durations, along with 2,000 meticulously curated, manually annotated question-answer pairs. We evaluate 18 state-of-the-art MLLMs on MME-VideoOCR, revealing that even the best-performing model (Gemini-2.5 Pro) achieves an accuracy of only 73.7%. Fine-grained analysis indicates that while existing MLLMs demonstrate strong performance on tasks where relevant texts are contained within a single or few frames, they exhibit limited capability in effectively handling tasks that demand holistic video comprehension. These limitations are especially evident in scenarios that require spatio-temporal reasoning, cross-frame information integration, or resistance to language prior bias. Our findings also highlight the importance of high-resolution visual input and sufficient temporal coverage for reliable OCR in dynamic video scenarios.

[115] HoliTom: Holistic Token Merging for Fast Video Large Language Models

Kele Shao,Keda Tao,Can Qin,Haoxuan You,Yang Sui,Huan Wang

Main category: cs.CV

TL;DR: HoliTom是一种无需训练的全新视频令牌合并框架，通过全局冗余感知时间分割和时空合并减少90%以上的视觉令牌，显著降低计算负担，同时保持99.1%的性能。

Details

Motivation: 现有视频LLM令牌剪枝方法存在计算效率低或忽略全局时间动态的问题，HoliTom旨在通过结合内外剪枝策略优化冗余减少。 Method: HoliTom采用外LLM剪枝（全局冗余感知时间分割和时空合并）和内LLM令牌相似性合并，实现高效令牌减少。 Result: 在LLaVA-OneVision-7B上，计算成本降至6.9% FLOPs，性能保持99.1%，TTFT减少2.28倍，解码吞吐量提升1.32倍。 Conclusion: HoliTom展示了内外剪枝策略结合的潜力，为高效视频LLM推理提供了实用解决方案。 Abstract: Video large language models (video LLMs) excel at video comprehension but face significant computational inefficiency due to redundant video tokens. Existing token pruning methods offer solutions. However, approaches operating within the LLM (inner-LLM pruning), such as FastV, incur intrinsic computational overhead in shallow layers. In contrast, methods performing token pruning before the LLM (outer-LLM pruning) primarily address spatial redundancy within individual frames or limited temporal windows, neglecting the crucial global temporal dynamics and correlations across longer video sequences. This leads to sub-optimal spatio-temporal reduction and does not leverage video compressibility fully. Crucially, the synergistic potential and mutual influence of combining these strategies remain unexplored. To further reduce redundancy, we introduce HoliTom, a novel training-free holistic token merging framework. HoliTom employs outer-LLM pruning through global redundancy-aware temporal segmentation, followed by spatial-temporal merging to reduce visual tokens by over 90%, significantly alleviating the LLM's computational burden. Complementing this, we introduce a robust inner-LLM token similarity-based merging approach, designed for superior performance and compatibility with outer-LLM pruning. Evaluations demonstrate our method's promising efficiency-performance trade-off on LLaVA-OneVision-7B, reducing computational costs to 6.9% of FLOPs while maintaining 99.1% of the original performance. Furthermore, we achieve a 2.28x reduction in Time-To-First-Token (TTFT) and a 1.32x acceleration in decoding throughput, highlighting the practical benefits of our integrated pruning approach for efficient video LLMs inference.

[116] Beyond Accuracy: Uncovering the Role of Similarity Perception and its Alignment with Semantics in Supervised Learning

Katarzyna Filus,Mateusz Żarski

Main category: cs.CV

TL;DR: 论文提出了Deep Similarity Inspector (DSI)框架，用于研究深度视觉网络如何发展其相似性感知，并与语义相似性对齐。实验发现CNN和ViT在训练过程中经历三个阶段，且两者存在明显差异。

Details

Motivation: 相似性感知在深度视觉领域尚未得到足够关注，尤其是其与语义相似性的关系。 Method: 引入DSI框架，通过实验分析CNN和ViT在训练过程中的相似性感知发展。 Result: CNN和ViT在训练中经历三个阶段（初始相似性激增、细化、稳定），且两者表现不同。此外，还观察到错误细化现象。 Conclusion: DSI框架揭示了深度视觉网络相似性感知的发展规律，为未来研究提供了新视角。 Abstract: Similarity manifests in various forms, including semantic similarity that is particularly important, serving as an approximation of human object categorization based on e.g. shared functionalities and evolutionary traits. It also offers practical advantages in computational modeling via lexical structures such as WordNet with constant and interpretable similarity. As in the domain of deep vision, there is still not enough focus on the phenomena regarding the similarity perception emergence. We introduce Deep Similarity Inspector (DSI) -- a systematic framework to inspect how deep vision networks develop their similarity perception and its alignment with semantic similarity. Our experiments show that both Convolutional Neural Networks' (CNNs) and Vision Transformers' (ViTs) develop a rich similarity perception during training with 3 phases (initial similarity surge, refinement, stabilization), with clear differences between CNNs and ViTs. Besides the gradual mistakes elimination, the mistakes refinement phenomenon can be observed.

[117] AgriFM: A Multi-source Temporal Remote Sensing Foundation Model for Crop Mapping

Wenyuan Li,Shunlin Liang,Keyan Chen,Yongzhe Chen,Han Ma,Jianglei Xu,Yichuan Ma,Shikang Guan,Husheng Fang,Zhenwei Shi

Main category: cs.CV

TL;DR: AgriFM是一个专为农业作物测绘设计的遥感基础模型，通过改进的Video Swin Transformer架构实现多尺度时空特征提取，显著优于现有方法。

Details

Motivation: 现有遥感基础模型在作物测绘中表现不佳，主要因为固定时空窗口或忽略时间信息。AgriFM旨在解决这些问题。 Method: 采用改进的Video Swin Transformer架构，同步时空下采样，结合多源卫星数据（MODIS、Landsat-8/9、Sentinel-2）进行预训练。 Result: AgriFM在多种下游任务中表现优于传统深度学习和现有遥感基础模型。 Conclusion: AgriFM通过统一处理多尺度时空特征，显著提升了作物测绘的准确性。 Abstract: Accurate crop mapping fundamentally relies on modeling multi-scale spatiotemporal patterns, where spatial scales range from individual field textures to landscape-level context, and temporal scales capture both short-term phenological transitions and full growing-season dynamics. Transformer-based remote sensing foundation models (RSFMs) offer promising potential for crop mapping due to their innate ability for unified spatiotemporal processing. However, current RSFMs remain suboptimal for crop mapping: they either employ fixed spatiotemporal windows that ignore the multi-scale nature of crop systems or completely disregard temporal information by focusing solely on spatial patterns. To bridge these gaps, we present AgriFM, a multi-source remote sensing foundation model specifically designed for agricultural crop mapping. Our approach begins by establishing the necessity of simultaneous hierarchical spatiotemporal feature extraction, leading to the development of a modified Video Swin Transformer architecture where temporal down-sampling is synchronized with spatial scaling operations. This modified backbone enables efficient unified processing of long time-series satellite inputs. AgriFM leverages temporally rich data streams from three satellite sources including MODIS, Landsat-8/9 and Sentinel-2, and is pre-trained on a global representative dataset comprising over 25 million image samples supervised by land cover products. The resulting framework incorporates a versatile decoder architecture that dynamically fuses these learned spatiotemporal representations, supporting diverse downstream tasks. Comprehensive evaluations demonstrate AgriFM's superior performance over conventional deep learning approaches and state-of-the-art general-purpose RSFMs across all downstream tasks. Codes will be available at urlhttps://github.com/flyakon/AgriFM.

[118] YOLO-SPCI: Enhancing Remote Sensing Object Detection via Selective-Perspective-Class Integration

Xinyuan Wang,Lian Peng,Xiangcheng Li,Yilin He,KinTak U

Main category: cs.CV

TL;DR: YOLO-SPCI是一个基于YOLOv8改进的目标检测框架，通过引入轻量级SPCI模块增强多尺度特征表示，在遥感图像中表现优异。

Details

Motivation: 遥感图像中的目标检测面临尺度变化大、目标密集和背景杂乱等挑战，现有检测器缺乏多尺度特征细化机制。 Method: 提出SPCI模块，包含SSG、PFM和CDM三个组件，嵌入YOLOv8的P3和P5阶段，优化特征表示。 Result: 在NWPU VHR-10数据集上，YOLO-SPCI性能优于现有先进检测器。 Conclusion: YOLO-SPCI通过SPCI模块显著提升了遥感图像目标检测的性能。 Abstract: Object detection in remote sensing imagery remains a challenging task due to extreme scale variation, dense object distributions, and cluttered backgrounds. While recent detectors such as YOLOv8 have shown promising results, their backbone architectures lack explicit mechanisms to guide multi-scale feature refinement, limiting performance on high-resolution aerial data. In this work, we propose YOLO-SPCI, an attention-enhanced detection framework that introduces a lightweight Selective-Perspective-Class Integration (SPCI) module to improve feature representation. The SPCI module integrates three components: a Selective Stream Gate (SSG) for adaptive regulation of global feature flow, a Perspective Fusion Module (PFM) for context-aware multi-scale integration, and a Class Discrimination Module (CDM) to enhance inter-class separability. We embed two SPCI blocks into the P3 and P5 stages of the YOLOv8 backbone, enabling effective refinement while preserving compatibility with the original neck and head. Experiments on the NWPU VHR-10 dataset demonstrate that YOLO-SPCI achieves superior performance compared to state-of-the-art detectors.

[119] Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Junhao Cheng,Yuying Ge,Teng Wang,Yixiao Ge,Jing Liao,Ying Shan

Main category: cs.CV

TL;DR: Video-Holmes是一个新的视频推理基准，旨在评估多模态大语言模型（MLLMs）的复杂推理能力，发现现有模型在整合信息方面存在困难。

Details

Motivation: 现有视频基准主要评估视觉感知能力，无法全面反映真实世界的复杂推理过程，因此需要设计更接近人类推理方式的评测工具。 Method: 基于270部悬疑短片的1,837个问题，设计了7个任务，要求模型主动定位并整合分散的视觉线索。 Result: 最佳模型Gemini-2.5-Pro准确率仅45%，大多数模型低于40%，表明模型在信息整合方面表现不佳。 Conclusion: Video-Holmes可作为“Holmes测试”推动模型更接近人类推理，并突显该领域的挑战。 Abstract: Recent advances in CoT reasoning and RL post-training have been reported to enhance video reasoning capabilities of MLLMs. This progress naturally raises a question: can these models perform complex video reasoning in a manner comparable to human experts? However, existing video benchmarks primarily evaluate visual perception and grounding abilities, with questions that can be answered based on explicit prompts or isolated visual cues. Such benchmarks do not fully capture the intricacies of real-world reasoning, where humans must actively search for, integrate, and analyze multiple clues before reaching a conclusion. To address this issue, we present Video-Holmes, a benchmark inspired by the reasoning process of Sherlock Holmes, designed to evaluate the complex video reasoning capabilities of MLLMs. Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films, which spans seven carefully designed tasks. Each task is constructed by first identifying key events and causal relationships within films, and then designing questions that require models to actively locate and connect multiple relevant visual clues scattered across different video segments. Our comprehensive evaluation of state-of-the-art MLLMs reveals that, while these models generally excel at visual perception, they encounter substantial difficulties with integrating information and often miss critical clues. For example, the best-performing model, Gemini-2.5-Pro, achieves an accuracy of only 45%, with most models scoring below 40%. We aim that Video-Holmes can serve as a "Holmes-test" for multimodal reasoning, motivating models to reason more like humans and emphasizing the ongoing challenges in this field. The benchmark is released in https://github.com/TencentARC/Video-Holmes.

[120] GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution

Fengxiang Wang,Mingshuo Chen,Yueying Li,Di Wang,Haotian Wang,Zonghao Guo,Zefan Wang,Boqi Shan,Long Lan,Yulin Wang,Hongzhen Wang,Wenjing Yang,Bo Du,Jing Zhang

Main category: cs.CV

TL;DR: 论文提出了解决超高分辨率遥感图像处理中数据稀缺和标记爆炸问题的两种策略，并基于此开发了首个能处理8K分辨率的多模态大语言模型GeoLLaVA-8K。

Details

Motivation: 超高分辨率遥感图像在地球观测中有重要价值，但现有模型面临数据稀缺和标记爆炸的瓶颈。 Method: 引入两个高分辨率数据集SuperRS-VQA和HighRS-VQA，并提出背景标记修剪和锚定标记选择策略以减少内存占用。 Result: 开发的GeoLLaVA-8K模型在XLRS-Bench上达到新最优性能。 Conclusion: 通过数据增强和标记优化策略，成功解决了超高分辨率遥感图像处理的挑战。 Abstract: Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,376$\times$8,376) and HighRS-VQA (avg. 2,000$\times$1,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: Background Token Pruning and Anchored Token Selection, to reduce the memory footprint while preserving key semantics.Integrating these techniques, we introduce GeoLLaVA-8K, the first RS-focused multimodal large language model capable of handling inputs up to 8K$\times$8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench.

[121] Empowering Vector Graphics with Consistently Arbitrary Viewing and View-dependent Visibility

Yidi Li,Jun Xiao,Zhengda Lu,Yiqun Wang,Haiyong Jiang

Main category: cs.CV

TL;DR: Dream3DVG是一种新颖的文本到矢量图形生成方法，支持任意视角查看、渐进细节优化和视角相关遮挡感知。

Details

Motivation: 解决文本提示与矢量图形之间的领域差距，提供更一致的引导，并优化细节控制和遮挡感知。 Method: 采用双分支优化框架，包括辅助的3D高斯泼溅优化分支和3D矢量图形优化分支，结合渐进细节控制和可见性感知渲染模块。 Result: 在3D草图和3D图标上展示了方法在不同细节抽象级别、跨视角一致性和遮挡感知笔画剔除方面的优越性。 Conclusion: Dream3DVG通过双分支框架和渐进优化，显著提升了文本到矢量图形的生成质量和灵活性。 Abstract: This work presents a novel text-to-vector graphics generation approach, Dream3DVG, allowing for arbitrary viewpoint viewing, progressive detail optimization, and view-dependent occlusion awareness. Our approach is a dual-branch optimization framework, consisting of an auxiliary 3D Gaussian Splatting optimization branch and a 3D vector graphics optimization branch. The introduced 3DGS branch can bridge the domain gaps between text prompts and vector graphics with more consistent guidance. Moreover, 3DGS allows for progressive detail control by scheduling classifier-free guidance, facilitating guiding vector graphics with coarse shapes at the initial stages and finer details at later stages. We also improve the view-dependent occlusions by devising a visibility-awareness rendering module. Extensive results on 3D sketches and 3D iconographies, demonstrate the superiority of the method on different abstraction levels of details, cross-view consistency, and occlusion-aware stroke culling.

[122] ZigzagPointMamba: Spatial-Semantic Mamba for Point Cloud Understanding

Linshuang Diao,Dayong Ren,Sensen Song,Yurong Qian

Main category: cs.CV

TL;DR: ZigzagPointMamba通过改进点云特征提取方法，解决了现有方法的空间连续性和局部语义相关性破坏问题，显著提升了下游任务性能。

Details

Motivation: 现有PointMamba方法依赖复杂的标记排序和随机掩码，破坏了空间连续性和局部语义相关性，需要改进。 Method: 提出ZigzagPointMamba，采用简单的锯齿扫描路径和Semantic-Siamese掩码策略，增强空间连续性和局部语义建模。 Result: 在多个下游任务中表现优异，如ShapeNetPart分割任务提升1.59% mIoU，ModelNet40分类任务提升0.4%准确率。 Conclusion: ZigzagPointMamba通过优化特征提取和掩码策略，显著提升了点云自监督学习的性能。 Abstract: State Space models (SSMs) such as PointMamba enable efficient feature extraction for point cloud self-supervised learning with linear complexity, outperforming Transformers in computational efficiency. However, existing PointMamba-based methods depend on complex token ordering and random masking, which disrupt spatial continuity and local semantic correlations. We propose ZigzagPointMamba to tackle these challenges. The core of our approach is a simple zigzag scan path that globally sequences point cloud tokens, enhancing spatial continuity by preserving the proximity of spatially adjacent point tokens. Nevertheless, random masking undermines local semantic modeling in self-supervised learning. To address this, we introduce a Semantic-Siamese Masking Strategy (SMS), which masks semantically similar tokens to facilitate reconstruction by integrating local features of original and similar tokens. This overcomes the dependence on isolated local features and enables robust global semantic modeling. Our pre-trained ZigzagPointMamba weights significantly improve downstream tasks, achieving a 1.59% mIoU gain on ShapeNetPart for part segmentation, a 0.4% higher accuracy on ModelNet40 for classification, and 0.19%, 1.22%, and 0.72% higher accuracies respectively for the classification tasks on the OBJ-BG, OBJ-ONLY, and PB-T50-RS subsets of ScanObjectNN. The code is available at: https://anonymous.4open.science/r/ZigzagPointMamba-1800/

[123] Automatically Identify and Rectify: Robust Deep Contrastive Multi-view Clustering in Noisy Scenarios

Xihong Yang,Siwei Wang,Fangdi Wang,Jiaqi Jin,Suyuan Liu,Yue Liu,En Zhu,Xinwang Liu,Yueming Jin

Main category: cs.CV

TL;DR: AIRMVC是一种新型多视图聚类框架，通过自动识别和校正噪声数据，提升在噪声环境下的聚类性能。

Details

Motivation: 现实场景中多视图数据常含噪声，现有方法假设视图干净，导致性能下降。 Method: 将噪声识别建模为异常检测问题（GMM），设计混合校正策略，并引入噪声鲁棒对比机制生成可靠表示。 Result: 在六个基准数据集上，AIRMVC在噪声场景下优于现有算法。 Conclusion: AIRMVC通过理论证明和实验验证，有效提升噪声环境下的多视图聚类性能。 Abstract: Leveraging the powerful representation learning capabilities, deep multi-view clustering methods have demonstrated reliable performance by effectively integrating multi-source information from diverse views in recent years. Most existing methods rely on the assumption of clean views. However, noise is pervasive in real-world scenarios, leading to a significant degradation in performance. To tackle this problem, we propose a novel multi-view clustering framework for the automatic identification and rectification of noisy data, termed AIRMVC. Specifically, we reformulate noisy identification as an anomaly identification problem using GMM. We then design a hybrid rectification strategy to mitigate the adverse effects of noisy data based on the identification results. Furthermore, we introduce a noise-robust contrastive mechanism to generate reliable representations. Additionally, we provide a theoretical proof demonstrating that these representations can discard noisy information, thereby improving the performance of downstream tasks. Extensive experiments on six benchmark datasets demonstrate that AIRMVC outperforms state-of-the-art algorithms in terms of robustness in noisy scenarios. The code of AIRMVC are available at https://github.com/xihongyang1999/AIRMVC on Github.

[124] Mentor3AD: Feature Reconstruction-based 3D Anomaly Detection via Multi-modality Mentor Learning

Jinbao Wang,Hanzhe Liang,Can Gao,Chenxi Hu,Jie Zhou,Yunkang Cao,Linlin Shen,Weiming Shen

Main category: cs.CV

TL;DR: Mentor3AD是一种利用多模态导师学习的新方法，通过融合RGB和3D模态的特征来提升3D异常检测性能。

Details

Motivation: 利用多模态的互补信息，通过导师学习进一步区分正常与异常特征差异。 Method: 提出Mentor3AD方法，包括融合模块(MFM)、指导模块(MGM)和投票模块(VM)，用于特征融合、跨模态重建和最终异常评分。 Result: 在MVTec 3D-AD和Eyecandies数据集上的实验验证了方法的有效性。 Conclusion: Mentor3AD通过多模态导师学习显著提升了3D异常检测的性能。 Abstract: Multimodal feature reconstruction is a promising approach for 3D anomaly detection, leveraging the complementary information from dual modalities. We further advance this paradigm by utilizing multi-modal mentor learning, which fuses intermediate features to further distinguish normal from feature differences. To address these challenges, we propose a novel method called Mentor3AD, which utilizes multi-modal mentor learning. By leveraging the shared features of different modalities, Mentor3AD can extract more effective features and guide feature reconstruction, ultimately improving detection performance. Specifically, Mentor3AD includes a Mentor of Fusion Module (MFM) that merges features extracted from RGB and 3D modalities to create a mentor feature. Additionally, we have designed a Mentor of Guidance Module (MGM) to facilitate cross-modal reconstruction, supported by the mentor feature. Lastly, we introduce a Voting Module (VM) to more accurately generate the final anomaly score. Extensive comparative and ablation studies on MVTec 3D-AD and Eyecandies have verified the effectiveness of the proposed method.

[125] OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers

Ziqiao Peng,Jiwen Liu,Haoxian Zhang,Xiaoqiang Liu,Songlin Tang,Pengfei Wan,Di Zhang,Hongyan Liu,Jun He

Main category: cs.CV

TL;DR: OmniSync是一个通用的唇同步框架，通过无掩码训练和扩散变换器模型，解决了现有方法在身份一致性、姿态变化和音频信号弱条件等问题。

Details

Motivation: 现有唇同步方法依赖参考帧和掩码修复，限制了其对身份一致性、姿态变化和遮挡的鲁棒性，且音频信号条件较弱。 Method: 采用无掩码训练范式，结合扩散变换器模型和流匹配渐进噪声初始化，提出动态时空无分类器引导机制（DS-CFG）。 Result: 在多样化的AI生成视频中，OmniSync在视觉质量和唇同步准确性上显著优于现有方法。 Conclusion: OmniSync为唇同步任务提供了一种通用且高效的解决方案，适用于真实和AI生成视频。 Abstract: Lip synchronization is the task of aligning a speaker's lip movements in video with corresponding speech audio, and it is essential for creating realistic, expressive video content. However, existing methods often rely on reference frames and masked-frame inpainting, which limit their robustness to identity consistency, pose variations, facial occlusions, and stylized content. In addition, since audio signals provide weaker conditioning than visual cues, lip shape leakage from the original video will affect lip sync quality. In this paper, we present OmniSync, a universal lip synchronization framework for diverse visual scenarios. Our approach introduces a mask-free training paradigm using Diffusion Transformer models for direct frame editing without explicit masks, enabling unlimited-duration inference while maintaining natural facial dynamics and preserving character identity. During inference, we propose a flow-matching-based progressive noise initialization to ensure pose and identity consistency, while allowing precise mouth-region editing. To address the weak conditioning signal of audio, we develop a Dynamic Spatiotemporal Classifier-Free Guidance (DS-CFG) mechanism that adaptively adjusts guidance strength over time and space. We also establish the AIGC-LipSync Benchmark, the first evaluation suite for lip synchronization in diverse AI-generated videos. Extensive experiments demonstrate that OmniSync significantly outperforms prior methods in both visual quality and lip sync accuracy, achieving superior results in both real-world and AI-generated videos.

[126] Visual Product Graph: Bridging Visual Products And Composite Images For End-to-End Style Recommendations

Yue Li Du,Ben Alexander,Mikhail Antonenka,Rohan Mahadev,Hao-yu Wu,Dmitry Kislyuk

Main category: cs.CV

TL;DR: 论文提出了一种基于视觉产品图（VPG）的实时检索系统，用于检索语义相似但视觉差异大的内容，并展示了其在产品搭配推荐中的应用。

Details

Motivation: 解决视觉搜索系统中检索语义相似但视觉差异大的内容的挑战。 Method: 利用高性能基础设施和先进计算机视觉模型构建VPG，支持从单个产品导航到包含这些产品的复合场景，并提供互补推荐。 Result: 系统在端到端人类相关性评估中达到78.8%的极高相似度@1，模块参与率为6%。 Conclusion: VPG技术已成功部署于Pinterest的“Ways to Style It”模块，展示了其在实际应用中的有效性。 Abstract: Retrieving semantically similar but visually distinct contents has been a critical capability in visual search systems. In this work, we aim to tackle this problem with Visual Product Graph (VPG), leveraging high-performance infrastructure for storage and state-of-the-art computer vision models for image understanding. VPG is built to be an online real-time retrieval system that enables navigation from individual products to composite scenes containing those products, along with complementary recommendations. Our system not only offers contextual insights by showcasing how products can be styled in a context, but also provides recommendations for complementary products drawn from these inspirations. We discuss the essential components for building the Visual Product Graph, along with the core computer vision model improvements across object detection, foundational visual embeddings, and other visual signals. Our system achieves a 78.8% extremely similar@1 in end-to-end human relevance evaluations, and a 6% module engagement rate. The "Ways to Style It" module, powered by the Visual Product Graph technology, is deployed in production at Pinterest.

[127] Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

Muzhi Zhu,Hao Zhong,Canyu Zhao,Zongze Du,Zheng Huang,Mingyu Liu,Hao Chen,Cheng Zou,Jingdong Chen,Ming Yang,Chunhua Shen

Main category: cs.CV

TL;DR: 本文提出了ACTIVE-O3框架，通过强化学习赋予多模态大语言模型（MLLMs）主动感知能力，解决了现有方法搜索效率低和区域选择不准确的问题，并在多个任务和场景中验证了其有效性。

Details

Motivation: 主动感知是高效感知和决策的关键，但MLLMs在此领域的研究尚未深入。本文旨在填补这一空白，探索如何让MLLMs具备主动感知能力。 Method: 提出ACTIVE-O3框架，基于GRPO的强化学习方法，训练MLLMs实现主动感知。建立了全面的评测基准，涵盖开放世界任务和特定领域场景。 Result: ACTIVE-O3在多种任务中表现优异，包括小物体检测和交互式分割，并在V* Benchmark上展示了零样本推理能力。 Conclusion: ACTIVE-O3为MLLMs的主动感知研究提供了简单易用的代码库和评测标准，推动了该领域的未来发展。 Abstract: Active vision, also known as active perception, refers to the process of actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. Recently, the use of Multimodal Large Language Models (MLLMs) as central planning and decision-making modules in robotic systems has gained extensive attention. However, despite the importance of active perception in embodied intelligence, there is little to no exploration of how MLLMs can be equipped with or learn active perception capabilities. In this paper, we first provide a systematic definition of MLLM-based active perception tasks. We point out that the recently proposed GPT-o3 model's zoom-in search strategy can be regarded as a special case of active perception; however, it still suffers from low search efficiency and inaccurate region selection. To address these issues, we propose ACTIVE-O3, a purely reinforcement learning based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasks, such as small-object and dense object grounding, and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. In addition, ACTIVE-O3 also demonstrates strong zero-shot reasoning abilities on the V* Benchmark, without relying on any explicit reasoning data. We hope that our work can provide a simple codebase and evaluation protocol to facilitate future research on active perception in MLLMs.

[128] ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models

Bozhou Li,Wentao Zhang

Main category: cs.CV

TL;DR: ID-Align通过重新排序位置ID，解决了高分辨率图像和缩略图之间以及文本与图像之间交互受限的问题，显著提升了性能。

Details

Motivation: 当前方法在同时编码高分辨率图像和缩略图时，生成了大量图像标记，结合RoPE的长程衰减特性，限制了标记间的交互。 Method: 提出ID-Align方法，通过重新分配位置ID，使高分辨率标记继承对应缩略图标记的ID，并限制位置索引的过度扩展。 Result: 在LLaVA-Next框架中实验，ID-Align在MMBench关系推理任务上提升了6.09%，并在多个基准测试中表现优异。 Conclusion: ID-Align有效解决了标记交互问题，显著提升了模型性能。 Abstract: Currently, a prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously. While effective, this method generates a large number of image tokens. When combined with the widely used Rotary Position Embedding (RoPE), its long-term decay property hinders the interaction between high-resolution tokens and thumbnail tokens, as well as between text and image. To address these issues, we propose ID-Align, which alleviates these problems by reordering position IDs. In this method, high-resolution tokens inherit IDs from their corresponding thumbnail token while constraining the overexpansion of positional indices. Our experiments conducted within the LLaVA-Next framework demonstrate that ID-Align achieves significant improvements, including a 6.09% enhancement on MMBench's relation reasoning tasks and notable gains across multiple benchmarks. Our code is available at the following link: https://github.com/zooblastlbz/ID-Align.

[129] Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

Mehrdad Fazli,Bowen Wei,Ziwei Zhu

Main category: cs.CV

TL;DR: CAAC框架通过视觉标记校准和自适应注意力重缩放，有效减少大型视觉语言模型的幻觉问题。

Details

Motivation: 大型视觉语言模型在多模态任务中表现优异，但存在幻觉问题，即错误描述图像中不存在的对象或属性。现有方法在开放性和长文本生成场景中难以保持准确性。 Method: CAAC框架包括视觉标记校准（VTC）和自适应注意力重缩放（AAR），分别解决空间感知偏差和模态偏差问题。 Result: 在CHAIR、AMBER和POPE基准测试中，CAAC表现优于基线方法，尤其在长文本生成中显著减少幻觉。 Conclusion: CAAC通过置信度驱动的调整，有效提升了视觉语言模型的生成准确性和视觉对齐能力。 Abstract: Large vision-language models (LVLMs) achieve impressive performance on multimodal tasks but often suffer from hallucination, and confidently describe objects or attributes not present in the image. Current inference-time interventions, while training-free, struggle to maintain accuracy in open-ended and long-form generation scenarios. We introduce the Confidence-Aware Attention Calibration (CAAC) framework to address this challenge by targeting two key biases: spatial perception bias, which distributes attention disproportionately across image tokens, and modality bias, which shifts focus from visual to textual inputs over time. CAAC employs a two-step approach: Visual-Token Calibration (VTC) to balance attention across visual tokens, and Adaptive Attention Re-Scaling (AAR) to reinforce visual grounding based on the model's confidence. This confidence-driven adjustment ensures consistent visual alignment during generation. Experiments on CHAIR, AMBER, and POPE benchmarks demonstrate that CAAC outperforms baselines, particularly in long-form generations, effectively reducing hallucination.

[130] DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction

Yiheng Liu,Liao Qu,Huichao Zhang,Xu Wang,Yi Jiang,Yiming Gao,Hu Ye,Xian Li,Shuai Wang,Daniel K. Du,Shu Cheng,Zehuan Yuan,Xinglong Wu

Main category: cs.CV

TL;DR: DetailFlow是一种从粗到细的1D自回归图像生成方法，通过渐进式细节预测策略生成高质量图像，显著减少令牌数量并提升生成速度。

Details

Motivation: 现有自回归模型在生成复杂视觉内容时效率较低且令牌数量多，DetailFlow旨在通过更自然的1D令牌序列和并行推理机制解决这些问题。 Method: 采用分辨率感知的令牌序列和渐进式降质图像监督，结合并行推理与自校正机制，实现高效生成。 Result: 在ImageNet 256x256基准测试中，DetailFlow以128令牌实现2.96 gFID，优于VAR和FlexVAR，且生成速度快2倍。 Conclusion: DetailFlow在生成质量和效率上均优于现有方法，为自回归图像生成提供了更优解决方案。 Abstract: This paper presents DetailFlow, a coarse-to-fine 1D autoregressive (AR) image generation method that models images through a novel next-detail prediction strategy. By learning a resolution-aware token sequence supervised with progressively degraded images, DetailFlow enables the generation process to start from the global structure and incrementally refine details. This coarse-to-fine 1D token sequence aligns well with the autoregressive inference mechanism, providing a more natural and efficient way for the AR model to generate complex visual content. Our compact 1D AR model achieves high-quality image synthesis with significantly fewer tokens than previous approaches, i.e. VAR/VQGAN. We further propose a parallel inference mechanism with self-correction that accelerates generation speed by approximately 8x while reducing accumulation sampling error inherent in teacher-forcing supervision. On the ImageNet 256x256 benchmark, our method achieves 2.96 gFID with 128 tokens, outperforming VAR (3.3 FID) and FlexVAR (3.05 FID), which both require 680 tokens in their AR models. Moreover, due to the significantly reduced token count and parallel inference mechanism, our method runs nearly 2x faster inference speed compared to VAR and FlexVAR. Extensive experimental results demonstrate DetailFlow's superior generation quality and efficiency compared to existing state-of-the-art methods.

[131] Policy Optimized Text-to-Image Pipeline Design

Uri Gadot,Rinon Gal,Yftah Ziser,Gal Chechik,Shie Mannor

Main category: cs.CV

TL;DR: 论文提出了一种基于强化学习的框架，用于自动化设计文本到图像生成的多组件流程，解决了现有方法计算成本高和泛化能力差的问题。

Details

Motivation: 当前文本到图像生成的多组件流程设计需要大量专业知识，且现有自动化方法存在计算成本高和泛化能力差的局限性。 Method: 采用强化学习框架，首先训练奖励模型预测图像质量分数，然后分两阶段优化流程设计：初始词汇训练和GRPO优化，并结合无分类器引导增强技术。 Result: 实验表明，该方法能生成更多样化的流程，并显著提升图像质量，优于现有基线方法。 Conclusion: 提出的强化学习框架有效解决了多组件流程设计的自动化问题，具有更高的效率和泛化能力。 Abstract: Text-to-image generation has evolved beyond single monolithic models to complex multi-component pipelines. These combine fine-tuned generators, adapters, upscaling blocks and even editing steps, leading to significant improvements in image quality. However, their effective design requires substantial expertise. Recent approaches have shown promise in automating this process through large language models (LLMs), but they suffer from two critical limitations: extensive computational requirements from generating images with hundreds of predefined pipelines, and poor generalization beyond memorized training examples. We introduce a novel reinforcement learning-based framework that addresses these inefficiencies. Our approach first trains an ensemble of reward models capable of predicting image quality scores directly from prompt-workflow combinations, eliminating the need for costly image generation during training. We then implement a two-phase training strategy: initial workflow vocabulary training followed by GRPO-based optimization that guides the model toward higher-performing regions of the workflow space. Additionally, we incorporate a classifier-free guidance based enhancement technique that extrapolates along the path between the initial and GRPO-tuned models, further improving output quality. We validate our approach through a set of comparisons, showing that it can successfully create new flows with greater diversity and lead to superior image quality compared to existing baselines.

[132] MV-CoLight: Efficient Object Compositing with Consistent Lighting and Shadow Generation

Kerui Ren,Jiayang Bai,Linning Xu,Lihan Jiang,Jiangmiao Pang,Mulin Yu,Bo Dai

Main category: cs.CV

TL;DR: MV-CoLight是一个两阶段框架，用于在2D图像和3D场景中实现光照一致的对象合成，解决了多视角一致性、复杂场景和多样化光照条件的挑战。

Details

Motivation: 现有方法主要关注单图像场景或内在分解技术，难以处理多视角一致性、复杂场景和多样化光照条件。 Method: 提出MV-CoLight框架，采用前馈架构直接建模光照和阴影，避免基于扩散方法的迭代偏差，并使用Hilbert曲线映射将2D输入与3D高斯场景表示对齐。 Result: 实验表明，MV-CoLight在标准基准和真实场景中实现了最先进的和谐效果，展现了框架的鲁棒性和广泛泛化能力。 Conclusion: MV-CoLight为AR和具身智能应用提供了高效、一致的对象合成解决方案。 Abstract: Object compositing offers significant promise for augmented reality (AR) and embodied intelligence applications. Existing approaches predominantly focus on single-image scenarios or intrinsic decomposition techniques, facing challenges with multi-view consistency, complex scenes, and diverse lighting conditions. Recent inverse rendering advancements, such as 3D Gaussian and diffusion-based methods, have enhanced consistency but are limited by scalability, heavy data requirements, or prolonged reconstruction time per scene. To broaden its applicability, we introduce MV-CoLight, a two-stage framework for illumination-consistent object compositing in both 2D images and 3D scenes. Our novel feed-forward architecture models lighting and shadows directly, avoiding the iterative biases of diffusion-based methods. We employ a Hilbert curve-based mapping to align 2D image inputs with 3D Gaussian scene representations seamlessly. To facilitate training and evaluation, we further introduce a large-scale 3D compositing dataset. Experiments demonstrate state-of-the-art harmonized results across standard benchmarks and our dataset, as well as casually captured real-world scenes demonstrate the framework's robustness and wide generalization.

[133] Frame In-N-Out: Unbounded Controllable Image-to-Video Generation

Boyang Wang,Xuweiyi Chen,Matheus Gadelha,Zezhou Cheng

Main category: cs.CV

TL;DR: 论文提出了一种基于Frame In和Frame Out技术的视频生成方法，通过用户指定运动轨迹控制物体进出场景，并引入新数据集和评估协议。

Details

Motivation: 解决视频生成中的可控性、时间一致性和细节合成问题。 Method: 采用身份保持的运动可控视频扩散变换器架构，结合半自动生成的数据集。 Result: 提出的方法显著优于现有基线。 Conclusion: 该方法为视频生成提供了更高效和可控的解决方案。 Abstract: Controllability, temporal coherence, and detail synthesis remain the most critical challenges in video generation. In this paper, we focus on a commonly used yet underexplored cinematic technique known as Frame In and Frame Out. Specifically, starting from image-to-video generation, users can control the objects in the image to naturally leave the scene or provide breaking new identity references to enter the scene, guided by user-specified motion trajectory. To support this task, we introduce a new dataset curated semi-automatically, a comprehensive evaluation protocol targeting this setting, and an efficient identity-preserving motion-controllable video Diffusion Transformer architecture. Our evaluation shows that our proposed approach significantly outperforms existing baselines.

[134] Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment

Xiaojun Jia,Sensen Gao,Simeng Qin,Tianyu Pang,Chao Du,Yihao Huang,Xinfeng Li,Yiming Li,Bo Li,Yang Liu

Main category: cs.CV

TL;DR: 提出了一种基于特征最优对齐的针对性可迁移对抗攻击方法（FOA-Attack），通过全局和局部特征对齐提升对抗样本的迁移能力。

Details

Motivation: 现有方法通常通过全局特征对齐实现针对性攻击，但忽略了局部信息，导致迁移性不足，尤其是在闭源模型中。 Method: 在全局层面引入余弦相似度损失对齐粗粒度特征；在局部层面利用聚类提取紧凑模式，并通过最优传输问题对齐细粒度特征。此外，采用动态集成模型加权策略。 Result: 实验表明，该方法优于现有技术，尤其在闭源MLLMs中迁移效果显著。 Conclusion: FOA-Attack通过全局和局部特征最优对齐，显著提升了对抗样本的迁移能力。 Abstract: Multimodal large language models (MLLMs) remain vulnerable to transferable adversarial examples. While existing methods typically achieve targeted attacks by aligning global features-such as CLIP's [CLS] token-between adversarial and target samples, they often overlook the rich local information encoded in patch tokens. This leads to suboptimal alignment and limited transferability, particularly for closed-source models. To address this limitation, we propose a targeted transferable adversarial attack method based on feature optimal alignment, called FOA-Attack, to improve adversarial transfer capability. Specifically, at the global level, we introduce a global feature loss based on cosine similarity to align the coarse-grained features of adversarial samples with those of target samples. At the local level, given the rich local representations within Transformers, we leverage clustering techniques to extract compact local patterns to alleviate redundant local features. We then formulate local feature alignment between adversarial and target samples as an optimal transport (OT) problem and propose a local clustering optimal transport loss to refine fine-grained feature alignment. Additionally, we propose a dynamic ensemble model weighting strategy to adaptively balance the influence of multiple models during adversarial example generation, thereby further improving transferability. Extensive experiments across various models demonstrate the superiority of the proposed method, outperforming state-of-the-art methods, especially in transferring to closed-source MLLMs. The code is released at https://github.com/jiaxiaojunQAQ/FOA-Attack.

[135] Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

Wei Pang,Kevin Qinghong Lin,Xiangru Jian,Xi He,Philip Torr

Main category: cs.CV

TL;DR: 该论文提出了首个学术海报生成基准和评估指标，并介绍了一种名为PosterAgent的多智能体流程，用于高效生成高质量学术海报。

Details

Motivation: 学术海报生成是一个重要但具有挑战性的任务，需要将长篇论文压缩为视觉连贯的单页内容。目前缺乏相关基准和自动化工具。 Method: 论文提出了PosterAgent，一种自上而下、视觉反馈驱动的多智能体流程，包括解析器、规划器和绘制-评论循环。 Result: 实验表明，PosterAgent在多个指标上优于现有方法，且成本低廉（每张海报仅0.005美元）。 Conclusion: 该研究为下一代全自动海报生成模型指明了方向，并开源了代码和数据集。 Abstract: Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c)Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.

[136] ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models

Dingming Li,Hongxing Li,Zixuan Wang,Yuchen Yan,Hang Zhang,Siqi Chen,Guiyang Hou,Shengpei Jiang,Wenqi Zhang,Yongliang Shen,Weiming Lu,Yueting Zhuang

Main category: cs.CV

TL;DR: 该论文提出ViewSpatial-Bench，首个针对多视角空间定位任务的综合基准测试，揭示了当前视觉语言模型在跨视角空间推理上的局限性，并通过微调模型显著提升了性能。

Details

Motivation: 当前视觉语言模型在跨视角空间推理任务中表现不佳，尤其是从非相机视角（如人类视角）进行推理时。 Method: 引入ViewSpatial-Bench基准测试，包含五种任务类型，并通过自动化3D标注生成精确方向标签。对多种视觉语言模型进行评测，并通过多视角空间数据集微调模型。 Result: 模型在相机视角任务中表现尚可，但在人类视角任务中准确率显著下降。微调后整体性能提升46.24%。 Conclusion: 该研究为空间智能提供了重要基准，并证明建模3D空间关系能提升视觉语言模型的空间理解能力。 Abstract: Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We identify a critical limitation: current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints when required to adopt another entity's spatial frame of reference. We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation across five distinct task types, supported by an automated 3D annotation pipeline that generates precise directional labels. Comprehensive evaluation of diverse VLMs on ViewSpatial-Bench reveals a significant performance disparity: models demonstrate reasonable performance on camera-perspective tasks but exhibit reduced accuracy when reasoning from a human viewpoint. By fine-tuning VLMs on our multi-perspective spatial dataset, we achieve an overall performance improvement of 46.24% across tasks, highlighting the efficacy of our approach. Our work establishes a crucial benchmark for spatial intelligence in embodied AI systems and provides empirical evidence that modeling 3D spatial relationships enhances VLMs' corresponding spatial comprehension capabilities.

[137] Vision Transformers with Self-Distilled Registers

Yinjie Chen,Zipeng Yan,Chong Zhou,Bo Dai,Andrew F. Luo

Main category: cs.CV

TL;DR: 论文提出了一种名为PH-Reg的高效自蒸馏方法，用于为预训练的ViT模型添加寄存器令牌，以减少异常令牌的影响，而无需重新训练。

Details

Motivation: ViT模型在处理视觉任务时表现出色，但存在异常令牌问题，影响细粒度定位和结构一致性。为解决这一问题，需要在不重新训练的情况下为预训练ViT添加寄存器令牌。 Method: 提出PH-Reg方法，通过自蒸馏将寄存器令牌集成到现有ViT中。教师网络保持冻结，学生网络添加随机初始化的寄存器令牌，并通过测试时增强生成无异常的密集嵌入，优化少量学生权重。 Result: PH-Reg有效减少了异常令牌数量，提升了学生ViT在零样本和线性探测下的分割和深度预测性能。 Conclusion: PH-Reg是一种高效的方法，能够在不重新训练的情况下改善ViT模型的性能。 Abstract: Vision Transformers (ViTs) have emerged as the dominant architecture for visual processing tasks, demonstrating excellent scalability with increased training data and model size. However, recent work has identified the emergence of artifact tokens in ViTs that are incongruous with the local semantics. These anomalous tokens degrade ViT performance in tasks that require fine-grained localization or structural coherence. An effective mitigation of this issue is to the addition of register tokens to ViTs, which implicitly "absorb" the artifact term during training. Given the availability of various large-scale pre-trained ViTs, in this paper we aim at equipping them with such register tokens without the need of re-training them from scratch, which is infeasible considering their size. Specifically, we propose Post Hoc Registers (PH-Reg), an efficient self-distillation method that integrates registers into an existing ViT without requiring additional labeled data and full retraining. PH-Reg initializes both teacher and student networks from the same pre-trained ViT. The teacher remains frozen and unmodified, while the student is augmented with randomly initialized register tokens. By applying test-time augmentation to the teacher's inputs, we generate denoised dense embeddings free of artifacts, which are then used to optimize only a small subset of unlocked student weights. We show that our approach can effectively reduce the number of artifact tokens, improving the segmentation and depth prediction of the student ViT under zero-shot and linear probing.

[138] Generalizable and Relightable Gaussian Splatting for Human Novel View Synthesis

Yipengjing Sun,Chenyang Wang,Shunyuan Zheng,Zonglin Li,Shengping Zhang,Xiangyang Ji

Main category: cs.CV

TL;DR: GRGS是一个通用且可重光照的3D高斯框架，用于在多样光照条件下实现高保真的人体新视角合成。

Details

Motivation: 现有方法依赖逐角色优化或忽略物理约束，GRGS旨在通过前馈全监督策略解决这些问题。 Method: GRGS通过Lighting-aware Geometry Refinement (LGR)模块重建光照无关几何，再通过Physically Grounded Neural Rendering (PGNR)模块结合神经预测与物理着色。 Result: GRGS在视觉质量、几何一致性和跨角色及光照条件的泛化性上表现优异。 Conclusion: GRGS通过创新的模块和训练方案，实现了高质量、可编辑的重光照效果。 Abstract: We propose GRGS, a generalizable and relightable 3D Gaussian framework for high-fidelity human novel view synthesis under diverse lighting conditions. Unlike existing methods that rely on per-character optimization or ignore physical constraints, GRGS adopts a feed-forward, fully supervised strategy that projects geometry, material, and illumination cues from multi-view 2D observations into 3D Gaussian representations. Specifically, to reconstruct lighting-invariant geometry, we introduce a Lighting-aware Geometry Refinement (LGR) module trained on synthetically relit data to predict accurate depth and surface normals. Based on the high-quality geometry, a Physically Grounded Neural Rendering (PGNR) module is further proposed to integrate neural prediction with physics-based shading, supporting editable relighting with shadows and indirect illumination. Besides, we design a 2D-to-3D projection training scheme that leverages differentiable supervision from ambient occlusion, direct, and indirect lighting maps, which alleviates the computational cost of explicit ray tracing. Extensive experiments demonstrate that GRGS achieves superior visual quality, geometric consistency, and generalization across characters and lighting conditions.

[139] InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning

Zifu Wan,Yaqi Xie,Ce Zhang,Zhiqiu Lin,Zihan Wang,Simon Stepputtis,Deva Ramanan,Katia Sycara

Main category: cs.CV

TL;DR: 论文提出了一个名为InstructPart的新基准，用于评估模型在理解和执行部分级别任务中的表现，并展示了任务导向的部分分割对现有视觉语言模型仍具挑战性。

Details

Motivation: 当前多模态基础模型通常将对象视为不可分割的整体，忽略了其构成部分及其功能属性，而理解这些部分对执行任务至关重要。 Method: 引入了一个包含手工标注部分分割注释和任务导向指令的基准InstructPart，并通过实验评估了现有模型的性能。 Result: 实验表明任务导向部分分割对现有视觉语言模型仍具挑战性，同时提出的简单基线方法通过微调实现了两倍的性能提升。 Conclusion: 通过数据集和基准，旨在推动任务导向部分分割的研究，并提升视觉语言模型在多个领域的适用性。 Abstract: Large multimodal foundation models, particularly in the domains of language and vision, have significantly advanced various tasks, including robotics, autonomous driving, information retrieval, and grounding. However, many of these models perceive objects as indivisible, overlooking the components that constitute them. Understanding these components and their associated affordances provides valuable insights into an object's functionality, which is fundamental for performing a wide range of tasks. In this work, we introduce a novel real-world benchmark, InstructPart, comprising hand-labeled part segmentation annotations and task-oriented instructions to evaluate the performance of current models in understanding and executing part-level tasks within everyday contexts. Through our experiments, we demonstrate that task-oriented part segmentation remains a challenging problem, even for state-of-the-art Vision-Language Models (VLMs). In addition to our benchmark, we introduce a simple baseline that achieves a twofold performance improvement through fine-tuning with our dataset. With our dataset and benchmark, we aim to facilitate research on task-oriented part segmentation and enhance the applicability of VLMs across various domains, including robotics, virtual reality, information retrieval, and other related fields. Project website: https://zifuwan.github.io/InstructPart/.

cs.GR [Back]

[140] Precise Gradient Discontinuities in Neural Fields for Subspace Physics

Mengfei Liu,Yue Chang,Zhecheng Wang,Peter Yichen Chen,Eitan Grinspun

Main category: cs.GR

TL;DR: 提出一种神经场构造方法，用于捕捉梯度不连续性，支持参数化形状家族的无网格仿真。

Details

Motivation: 传统网格方法难以处理空间导数不连续性，神经场虽能模拟形状变化，但无法有效表示梯度不连续性。 Method: 通过引入平滑夹紧距离函数增强输入坐标，实现梯度跳跃的编码，支持动态界面演化。 Result: 方法支持异质材料和动态折痕的仿真，实现形状变形、交互式折痕编辑等新功能。 Conclusion: 该方法可结合现有技术，统一处理梯度和值不连续性，扩展了神经场在复杂物理系统中的应用。 Abstract: Discontinuities in spatial derivatives appear in a wide range of physical systems, from creased thin sheets to materials with sharp stiffness transitions. Accurately modeling these features is essential for simulation but remains challenging for traditional mesh-based methods, which require discontinuity-aligned remeshing -- entangling geometry with simulation and hindering generalization across shape families. Neural fields offer an appealing alternative by encoding basis functions as smooth, continuous functions over space, enabling simulation across varying shapes. However, their smoothness makes them poorly suited for representing gradient discontinuities. Prior work addresses discontinuities in function values, but capturing sharp changes in spatial derivatives while maintaining function continuity has received little attention. We introduce a neural field construction that captures gradient discontinuities without baking their location into the network weights. By augmenting input coordinates with a smoothly clamped distance function in a lifting framework, we enable encoding of gradient jumps at evolving interfaces. This design supports discretization-agnostic simulation of parametrized shape families with heterogeneous materials and evolving creases, enabling new reduced-order capabilities such as shape morphing, interactive crease editing, and simulation of soft-rigid hybrid structures. We further demonstrate that our method can be combined with previous lifting techniques to jointly capture both gradient and value discontinuities, supporting simultaneous cuts and creases within a unified model.

[141] ART-DECO: Arbitrary Text Guidance for 3D Detailizer Construction

Qimin Chen,Yuezhi Yang,Yifang Wang,Vladimir G. Kim,Siddhartha Chaudhuri,Hao Zhang,Zhiqin Chen

Main category: cs.GR

TL;DR: 论文提出了一种3D细节化模型，能够快速将粗糙的3D形状代理转换为高质量资产，并通过文本提示控制细节生成。

Details

Motivation: 现有文本到3D生成模型在细节控制和交互性上存在不足，需要一种能够快速生成高质量3D资产并支持用户结构控制的方法。 Method: 模型通过两阶段训练，利用预训练的多视角图像扩散模型和Score Distillation Sampling (SDS)技术，将文本提示和粗糙3D代理结合生成细节丰富的3D形状。 Result: 实验表明，该方法生成的3D形状在质量和细节上优于现有文本到3D模型，且能在1秒内完成生成，支持交互式编辑。 Conclusion: 该模型在3D建模中展现出强大的通用性，支持多样化的风格、结构和对象类别，并能生成超出当前模型能力的创意3D资产。 Abstract: We introduce a 3D detailizer, a neural model which can instantaneously (in <1s) transform a coarse 3D shape proxy into a high-quality asset with detailed geometry and texture as guided by an input text prompt. Our model is trained using the text prompt, which defines the shape class and characterizes the appearance and fine-grained style of the generated details. The coarse 3D proxy, which can be easily varied and adjusted (e.g., via user editing), provides structure control over the final shape. Importantly, our detailizer is not optimized for a single shape; it is the result of distilling a generative model, so that it can be reused, without retraining, to generate any number of shapes, with varied structures, whose local details all share a consistent style and appearance. Our detailizer training utilizes a pretrained multi-view image diffusion model, with text conditioning, to distill the foundational knowledge therein into our detailizer via Score Distillation Sampling (SDS). To improve SDS and enable our detailizer architecture to learn generalizable features over complex structures, we train our model in two training stages to generate shapes with increasing structural complexity. Through extensive experiments, we show that our method generates shapes of superior quality and details compared to existing text-to-3D models under varied structure control. Our detailizer can refine a coarse shape in less than a second, making it possible to interactively author and adjust 3D shapes. Furthermore, the user-imposed structure control can lead to creative, and hence out-of-distribution, 3D asset generations that are beyond the current capabilities of leading text-to-3D generative models. We demonstrate an interactive 3D modeling workflow our method enables, and its strong generalizability over styles, structures, and object categories.

[142] SZ Sequences: Binary-Based $(0, 2^q)$-Sequences

Abdalla G. M. Ahmed,Matt Pharr,Victor Ostromoukhov,Hui Huang

Main category: cs.GR

TL;DR: 论文提出了一种新的二进制基（0,4）-序列构造方法，并扩展到更高维，用于提升渲染积分的收敛速度。

Details

Motivation: 低差异序列在计算机图形学中广泛应用，但现有序列在低维投影中的分布效果不佳，因此需要开发新的序列构造方法。 Method: 基于2×2块矩阵作为符号构造更高维矩阵，生成具有目标（0,s）-序列特性的序列。通过搜索合适的字母表和候选生成矩阵，最终推导出64位矩阵的构造公式。 Result: SZ序列在常见渲染应用中，平均相对平方误差改进高达1.93倍。 Conclusion: 提出的SZ序列构造方法高效且易于实现，可作为现有应用的替代方案，显著提升渲染性能。 Abstract: Low-discrepancy sequences have seen widespread adoption in computer graphics thanks to their superior convergence rates. Since rendering integrals often comprise products of lower-dimensional integrals, recent work has focused on developing sequences that are also well-distributed in lower-dimensional projections. To this end, we introduce a novel construction of binary-based (0, 4)-sequences; that is, progressive fully multi-stratified sequences of 4D points, and extend the idea to higher power-of-two dimensions. We further show that not only it is possible to nest lower-dimensional sequences in higher-dimensional ones -- for example, embedding a (0, 2)-sequence within our (0, 4)-sequence -- but that we can ensemble two (0, 2)-sequences into a (0, 4)-sequence, four (0, 4)-sequences into a (0, 16)-sequence, and so on. Such sequences can provide excellent convergence rates when integrals include lower-dimensional integration problems in 2, 4, 16, ... dimensions. Our construction is based on using 2$\times$2 block matrices as symbols to construct larger matrices that potentially generate a sequence with the target (0, s)-sequence in base $s$ property. We describe how to search for suitable alphabets and identify two distinct, cross-related alphabets of block symbols, which we call S and Z, hence \emph{SZ} for the resulting family of sequences. Given the alphabets, we construct candidate generator matrices and search for valid sets of matrices. We then infer a formula to construct full-resolution (64-bit) matrices. Our binayr generator matrices allow highly efficient implementation using bitwise operations, and can be used as a drop-in replacement for Sobol matrices in existing applications. We compare SZ sequences to state-of-the-art low discrepancy sequences, and demonstrate mean relative squared error improvements up to $1.93\times$ in common rendering applications.

[143] Learned Adaptive Mesh Generation

Zhiyuan Zhang,Amir Vaxman,Stefanos-Aldo Papanicolopulos,Kartic Subr

Main category: cs.GR

TL;DR: 论文提出了一种基于稀疏蒙特卡洛估计和神经网络的适应性网格生成方法（LAMG），用于高效求解3D域上的偏微分方程（PDEs），比传统方法快2-4倍。

Details

Motivation: 传统有限元方法（FEM）求解3D域上的PDEs计算成本高，而蒙特卡洛方法（MC）虽有优势但未被充分利用。论文旨在结合两者的优点，提出一种更高效的解决方案。 Method: 通过稀疏和近似的MC估计训练轻量级神经网络，生成适应性网格（LAMG），再通过一次FEM计算求解PDEs。 Result: LAMG方法比传统适应性FEM或MC方法快2-4倍，且误差相近。 Conclusion: LAMG是一种高效、轻量且通用的学习框架，适用于多种网格和边界条件。 Abstract: The distribution and evolution of several real-world quantities, such as temperature, pressure, light, and heat, are modelled mathematically using Partial Differential Equations (PDEs). Solving PDEs defined on arbitrary 3D domains, say a 3D scan of a turbine's blade, is computationally expensive and scales quadratically with discretization. Traditional workflows in research and industry exploit variants of the finite element method (FEM), but some key benefits of using Monte Carlo (MC) methods have been identified. We use sparse and approximate MC estimates to infer adaptive discretization. We achieve this by training a neural network that is lightweight and that generalizes across shapes and boundary conditions. Our algorithm, Learned Adaptive Mesh Generation (LAMG), maps a set of sparse MC estimates of the solution to a sizing field that defines a local (adaptive) spatial resolution. We then use standard methods to generate tetrahedral meshes that respect the sizing field, and obtain the solution via one FEM computation on the adaptive mesh. We train the network to mimic a computationally expensive method that requires multiple (iterative) FEM solves. Thus, our one-shot method is $2\times$ to $4\times$ faster than adaptive methods for FEM or MC while achieving similar error. Our learning framework is lightweight and versatile. We demonstrate its effectiveness across a large dataset of meshes.

[144] Stochastic Preconditioning for Neural Field Optimization

Selena Ling,Merlin Nimier-David,Alec Jacobson,Nicholas Sharp

Main category: cs.GR

TL;DR: 通过在训练中引入空间随机性，神经场的拟合效果显著提升，甚至优于定制设计的层次结构和频率空间构造。该方法通过高斯分布偏移采样实现，优化时查询模糊场可提高收敛性和鲁棒性。

Details

Motivation: 神经场在视觉计算中表现优异，但现有方法依赖复杂的定制设计，缺乏简单统一的优化手段。 Method: 提出一种基于高斯分布偏移采样的随机预处理技术，隐式操作模糊场，无需额外成本且易于实现。 Result: 实验表明，该方法在多种表示和任务中显著提升性能，接近或优于现有定制设计方法。 Conclusion: 随机预处理为神经场提供了一种简单高效的优化方法，适用于多种场景，具有广泛的应用潜力。 Abstract: Neural fields are a highly effective representation across visual computing. This work observes that fitting these fields is greatly improved by incorporating spatial stochasticity during training, and that this simple technique can replace or even outperform custom-designed hierarchies and frequency space constructions. The approach is formalized as implicitly operating on a blurred version of the field, evaluated in-expectation by sampling with Gaussian-distributed offsets. Querying the blurred field during optimization greatly improves convergence and robustness, akin to the role of preconditioners in numerical linear algebra. This implicit, sampling-based perspective fits naturally into the neural field paradigm, comes at no additional cost, and is extremely simple to implement. We describe the basic theory of this technique, including details such as handling boundary conditions, and extending to a spatially-varying blur. Experiments demonstrate this approach on representations including coordinate MLPs, neural hashgrids, triplanes, and more, across tasks including surface reconstruction and radiance fields. In settings where custom-designed hierarchies have already been developed, stochastic preconditioning nearly matches or improves their performance with a simple and unified approach; in settings without existing hierarchies it provides an immediate boost to quality and robustness.

[145] Progressively Projected Newton's Method

José Antonio Fernández-Fernández,Fabian Löschner,Jan Bender

Main category: cs.GR

TL;DR: 本文提出了一种改进的牛顿法（PPN），通过仅投影部分元素Hessian矩阵来减少计算量，同时保持收敛性。

Details

Motivation: 传统的投影牛顿法（PN）需要投影所有负特征值的Hessian矩阵，这会扰动全局Hessian矩阵并影响收敛速度。 Method: PPN方法利用当前迭代残差动态确定需要投影的Hessian矩阵子集，减少投影次数和特征分解计算。 Result: 实验表明，PPN在大多数情况下比PN和PDN需要更少的投影和迭代次数，成为最快的求解器。 Conclusion: PPN在大多数情况下优于PN和PDN，但在大时间步长和准静态模拟中，PN仍更优。 Abstract: Newton's Method is widely used to find the solution of complex non-linear simulation problems in Computer Graphics. To guarantee a descent direction, it is common practice to clamp the negative eigenvalues of each element Hessian prior to assembly - a strategy known as Projected Newton (PN) - but this perturbation often hinders convergence. In this work, we observe that projecting only a small subset of element Hessians is sufficient to secure a descent direction. Building on this insight, we introduce Progressively Projected Newton (PPN), a novel variant of Newton's Method that uses the current iterate residual to cheaply determine the subset of element Hessians to project. The global Hessian thus remains closer to its original form, reducing both the number of Newton iterations and the amount of required eigen-decompositions. We compare PPN with PN and Project-on-Demand Newton (PDN) in a comprehensive set of experiments covering contact-free and contact-rich deformables (including large stiffness and mass ratios), co-dimensional, and rigid-body simulations, and a range of time step sizes, tolerances and resolutions. PPN consistently performs fewer than 10% of the projections required by PN or PDN and, in the vast majority of cases, converges in fewer Newton iterations, which makes PPN the fastest solver in our benchmark. The most notable exceptions are simulations with very large time steps and quasistatics, where PN remains a better choice.

[146] CityGo: Lightweight Urban Modeling and Rendering with Proxy Buildings and Residual Gaussians

Weihang Liu,Yuhui Zhong,Yuke Li,Xi Chen,Jiadi Cui,Honglong Zhang,Lan Xu,Xin Lou,Yujiao Shi,Jingyi Yu,Yingliang Zhang

Main category: cs.GR

TL;DR: CityGo提出了一种结合纹理代理几何与残差和周围3D高斯分布的混合框架，用于轻量级、逼真的城市场景渲染。

Details

Motivation: 大规模城市场景的精确高效建模对AR导航、无人机巡检和智慧城市数字孪生等应用至关重要。现有方法如3D高斯分布（3DGS）虽提升了可扩展性和视觉质量，但仍受限于密集基元使用、长训练时间及边缘设备适应性差。 Method: CityGo首先从MVS点云提取紧凑的建筑代理网格，使用零阶SH高斯通过图像渲染和反投影生成无遮挡纹理。为捕捉高频细节，引入基于代理-照片差异和深度先验的残差高斯分布。非关键区域采用重要性感知下采样以减少冗余。 Result: 实验表明，CityGo显著减少训练时间（平均1.4倍加速），在移动GPU上实现实时渲染，内存和能耗大幅降低，视觉保真度与纯3DGS相当。 Conclusion: CityGo的混合表示在训练效率、实时渲染能力和资源消耗方面优于纯3D高斯分布方法，适用于大规模城市场景的轻量级建模。 Abstract: Accurate and efficient modeling of large-scale urban scenes is critical for applications such as AR navigation, UAV based inspection, and smart city digital twins. While aerial imagery offers broad coverage and complements limitations of ground-based data, reconstructing city-scale environments from such views remains challenging due to occlusions, incomplete geometry, and high memory demands. Recent advances like 3D Gaussian Splatting (3DGS) improve scalability and visual quality but remain limited by dense primitive usage, long training times, and poor suit ability for edge devices. We propose CityGo, a hybrid framework that combines textured proxy geometry with residual and surrounding 3D Gaussians for lightweight, photorealistic rendering of urban scenes from aerial perspectives. Our approach first extracts compact building proxy meshes from MVS point clouds, then uses zero order SH Gaussians to generate occlusion-free textures via image-based rendering and back-projection. To capture high-frequency details, we introduce residual Gaussians placed based on proxy-photo discrepancies and guided by depth priors. Broader urban context is represented by surrounding Gaussians, with importance-aware downsampling applied to non-critical regions to reduce redundancy. A tailored optimization strategy jointly refines proxy textures and Gaussian parameters, enabling real-time rendering of complex urban scenes on mobile GPUs with significantly reduced training and memory requirements. Extensive experiments on real-world aerial datasets demonstrate that our hybrid representation significantly reduces training time, achieving on average 1.4x speedup, while delivering comparable visual fidelity to pure 3D Gaussian Splatting approaches. Furthermore, CityGo enables real-time rendering of large-scale urban scenes on mobile consumer GPUs, with substantially reduced memory usage and energy consumption.

[147] IKMo: Image-Keyframed Motion Generation with Trajectory-Pose Conditioned Motion Diffusion Model

Yang Zhao,Yan Zhang,Xubo Yang

Main category: cs.GR

TL;DR: IKMo是一种基于扩散模型的两阶段运动生成方法，通过解耦轨迹和姿态输入，结合优化模块和并行编码器，生成高保真运动。实验证明其在轨迹关键帧约束下优于现有方法，且MLLM代理预处理提升了用户满意度。

Details

Motivation: 现有方法对轨迹和姿态进行全局处理导致输出不理想，需改进以提升运动生成的保真度和可控性。 Method: 提出IKMo方法，采用两阶段条件框架：第一阶段优化输入，第二阶段通过并行编码器处理轨迹和姿态，再通过ControlNet生成运动。 Result: 在HumanML3D和KIT-ML数据集上表现优于现有方法，用户研究表明MLLM代理预处理使生成运动更符合用户预期。 Conclusion: IKMo通过解耦和优化输入，结合MLLM代理，显著提升了运动生成的保真度和可控性。 Abstract: Existing human motion generation methods with trajectory and pose inputs operate global processing on both modalities, leading to suboptimal outputs. In this paper, we propose IKMo, an image-keyframed motion generation method based on the diffusion model with trajectory and pose being decoupled. The trajectory and pose inputs go through a two-stage conditioning framework. In the first stage, the dedicated optimization module is applied to refine inputs. In the second stage, trajectory and pose are encoded via a Trajectory Encoder and a Pose Encoder in parallel. Then, motion with high spatial and semantic fidelity is guided by a motion ControlNet, which processes the fused trajectory and pose data. Experiment results based on HumanML3D and KIT-ML datasets demonstrate that the proposed method outperforms state-of-the-art on all metrics under trajectory-keyframe constraints. In addition, MLLM-based agents are implemented to pre-process model inputs. Given texts and keyframe images from users, the agents extract motion descriptions, keyframe poses, and trajectories as the optimized inputs into the motion generation model. We conducts a user study with 10 participants. The experiment results prove that the MLLM-based agents pre-processing makes generated motion more in line with users' expectation. We believe that the proposed method improves both the fidelity and controllability of motion generation by the diffusion model.

[148] Hand Shadow Art: A Differentiable Rendering Perspective

Aalok Gangopadhyay,Prajwal Singh,Ashish Tiwari,Shanmuganathan Raman

Main category: cs.GR

TL;DR: 提出了一种基于可微分渲染的方法，通过变形手部模型来生成与目标图像和光照配置一致的阴影效果。

Details

Motivation: 探索如何通过手部变形生成艺术性的阴影效果，为图形学社区提供实用工具。 Method: 使用可微分渲染技术，调整手部模型形状以匹配目标阴影图像和光照条件。 Result: 展示了双手投射的阴影效果以及手部姿势在目标阴影图像之间的插值。 Conclusion: 该方法为图形学领域提供了一种生成艺术阴影的新工具。 Abstract: Shadow art is an exciting form of sculptural art that produces captivating artistic effects through the 2D shadows cast by 3D shapes. Hand shadows, also known as shadow puppetry or shadowgraphy, involve creating various shapes and figures using your hands and fingers to cast meaningful shadows on a wall. In this work, we propose a differentiable rendering-based approach to deform hand models such that they cast a shadow consistent with a desired target image and the associated lighting configuration. We showcase the results of shadows cast by a pair of two hands and the interpolation of hand poses between two desired shadow images. We believe that this work will be a useful tool for the graphics community.

[149] efunc: An Efficient Function Representation without Neural Networks

Biao Zhang,Peter Wonka

Main category: cs.GR

TL;DR: 论文提出了一种参数高效、不依赖神经网络的高质量函数逼近方法，基于多项式插值和径向基函数，显著减少了计算时间和内存消耗。

Details

Motivation: 现有神经网络方法参数多，实用性受限，需探索更高效的函数逼近方法。 Method: 提出连续函数建模框架，基于多项式插值和径向基函数的紧凑表示，开发高效CUDA优化算法。 Result: 在3D SDF实验中，性能优于或媲美现有技术（如八叉树/哈希网格），参数更少。 Conclusion: 该方法高效、参数少，适用于高质量函数逼近任务。 Abstract: Function fitting/approximation plays a fundamental role in computer graphics and other engineering applications. While recent advances have explored neural networks to address this task, these methods often rely on architectures with many parameters, limiting their practical applicability. In contrast, we pursue high-quality function approximation using parameter-efficient representations that eliminate the dependency on neural networks entirely. We first propose a novel framework for continuous function modeling. Most existing works can be formulated using this framework. We then introduce a compact function representation, which is based on polynomials interpolated using radial basis functions, bypassing both neural networks and complex/hierarchical data structures. We also develop memory-efficient CUDA-optimized algorithms that reduce computational time and memory consumption to less than 10% compared to conventional automatic differentiation frameworks. Finally, we validate our representation and optimization pipeline through extensive experiments on 3D signed distance functions (SDFs). The proposed representation achieves comparable or superior performance to state-of-the-art techniques (e.g., octree/hash-grid techniques) with significantly fewer parameters.

[150] Structure from Collision

Takuhiro Kaneko

Main category: cs.GR

TL;DR: 论文提出了一种新任务Structure from Collision (SfC)，通过碰撞过程中的外观变化估计物体的内部结构，并提出了SfC-NeRF模型，通过体积退火优化不可见内部结构。

Details

Motivation: 现有神经3D表示方法（如NeRF和3DGS）只能估计可见外部结构，无法识别隐藏的内部结构。SfC任务旨在解决这一局限性。 Method: 提出SfC-NeRF模型，通过物理、外观保持和关键帧约束优化内部结构，并使用体积退火避免局部最优。 Result: 在115个不同结构和材质的物体上实验验证了SfC的特性及SfC-NeRF的有效性。 Conclusion: SfC-NeRF成功解决了从碰撞中估计内部结构的任务，展示了体积退火的优势。 Abstract: Recent advancements in neural 3D representations, such as neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS), have enabled the accurate estimation of 3D structures from multiview images. However, this capability is limited to estimating the visible external structure, and identifying the invisible internal structure hidden behind the surface is difficult. To overcome this limitation, we address a new task called Structure from Collision (SfC), which aims to estimate the structure (including the invisible internal structure) of an object from appearance changes during collision. To solve this problem, we propose a novel model called SfC-NeRF that optimizes the invisible internal structure of an object through a video sequence under physical, appearance (i.e., visible external structure)-preserving, and keyframe constraints. In particular, to avoid falling into undesirable local optima owing to its ill-posed nature, we propose volume annealing; that is, searching for global optima by repeatedly reducing and expanding the volume. Extensive experiments on 115 objects involving diverse structures (i.e., various cavity shapes, locations, and sizes) and material properties revealed the properties of SfC and demonstrated the effectiveness of the proposed SfC-NeRF.

[151] CoDA: Coordinated Diffusion Noise Optimization for Whole-Body Manipulation of Articulated Objects

Huaijin Pi,Zhi Cen,Zhiyang Dou,Taku Komura

Main category: cs.GR

TL;DR: 提出了一种协调扩散噪声优化框架，用于合成全身操纵关节物体的动作，包括身体、手部和物体运动，解决了协调性和精确性的挑战。

Details

Motivation: 全身操纵关节物体在虚拟人类和机器人领域有广泛应用，但需解决身体与手部协调及高自由度精确操作的问题。 Method: 采用三个专用扩散模型分别优化身体、左手和右手的噪声空间，通过梯度流实现协调，并使用BPS统一表示增强手-物体交互精度。 Result: 实验表明，该方法在运动质量和物理合理性上优于现有方法，支持多种功能如物体姿态控制和行走中操纵。 Conclusion: 提出的框架有效解决了全身操纵的协调性和精确性问题，具有广泛的应用潜力。 Abstract: Synthesizing whole-body manipulation of articulated objects, including body motion, hand motion, and object motion, is a critical yet challenging task with broad applications in virtual humans and robotics. The core challenges are twofold. First, achieving realistic whole-body motion requires tight coordination between the hands and the rest of the body, as their movements are interdependent during manipulation. Second, articulated object manipulation typically involves high degrees of freedom and demands higher precision, often requiring the fingers to be placed at specific regions to actuate movable parts. To address these challenges, we propose a novel coordinated diffusion noise optimization framework. Specifically, we perform noise-space optimization over three specialized diffusion models for the body, left hand, and right hand, each trained on its own motion dataset to improve generalization. Coordination naturally emerges through gradient flow along the human kinematic chain, allowing the global body posture to adapt in response to hand motion objectives with high fidelity. To further enhance precision in hand-object interaction, we adopt a unified representation based on basis point sets (BPS), where end-effector positions are encoded as distances to the same BPS used for object geometry. This unified representation captures fine-grained spatial relationships between the hand and articulated object parts, and the resulting trajectories serve as targets to guide the optimization of diffusion noise, producing highly accurate interaction motion. We conduct extensive experiments demonstrating that our method outperforms existing approaches in motion quality and physical plausibility, and enables various capabilities such as object pose control, simultaneous walking and manipulation, and whole-body generation from hand-only data.

cs.CL [Back]

[152] Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs

Amr Hegazy,Mostafa Elhoushi,Amr Alanwar

Main category: cs.CL

TL;DR: 提出一种轻量级可训练控制器网络，动态调节LLM激活以实现细粒度、自适应的推理时行为控制。

Details

Motivation: 现有方法缺乏细粒度和自适应机制，无法有效控制LLM生成不安全内容或违反安全指南的行为。 Method: 集成轻量级控制器网络，通过观察中间激活预测全局缩放因子和层特定权重，动态调节预计算的“拒绝方向”向量。 Result: 在ToxicChat和In-The-Wild Jailbreak Prompts等安全基准测试中，显著提高拒绝率，优于现有方法。 Conclusion: 该方法实现了无需修改原始模型参数的细粒度行为控制，高效且自适应。 Abstract: Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning. Activation steering provides an alternative for inference-time control, but existing methods typically lack fine-grained, adaptive mechanisms. We introduce a novel approach using a lightweight, trainable controller network integrated during inference. This controller network observes specific intermediate LLM activations and predicts both a global scaling factor and layer-specific weights. The predicted global scaling factor and layer-specific weights then dynamically modulate the intensity of a steering patch, derived from a pre-computed "refusal direction" vector, applied across the LLM's layers during generation. Trained on activations from both harmful and benign prompts, our controller learns to discriminatively apply nuanced, layer-aware interventions, activating steering primarily for harmful inputs. Experiments using safety benchmarks like ToxicChat & In-The-Wild Jailbreak Prompts demonstrate that our weighted steering controller significantly increases refusal rates compared to the base LLM, achieving targeted behavioral modification without altering the original model parameters. Our experiments with Llama-3.1-8B, Llama-3.2-1B & Mistral-7B show our approach outperforms existing methods, presenting an efficient and adaptive method for fine-grained control over LLM behavior at inference time.

[153] Arctic-Text2SQL-R1: Simple Rewards, Strong Reasoning in Text-to-SQL

Zhewei Yao,Guoheng Sun,Lukasz Borchmann,Zheyu Shen,Minghang Deng,Bohan Zhai,Hao Zhang,Ang Li,Yuxiong He

Main category: cs.CL

TL;DR: Arctic-Text2SQL-R1是一个基于强化学习的框架，旨在生成准确且可执行的SQL查询，通过轻量级奖励信号实现高效训练，并在多个基准测试中达到最先进的性能。

Details

Motivation: 尽管大型语言模型在SQL生成方面取得了进展，但生成复杂查询的正确且可执行的SQL仍然是一个挑战。 Method: 采用强化学习框架，仅基于执行正确性设计轻量级奖励信号，避免复杂的中间监督和奖励塑造。结合精选数据、强监督初始化和有效训练实践。 Result: 在六个不同的Test2SQL基准测试中达到最先进的执行准确率，7B模型性能超过之前的70B级系统。 Conclusion: 该框架具有可扩展性和高效性，通过简单扩展（如值检索和多数投票）进一步提升了推理时的鲁棒性，为未来研究提供了实用指导。 Abstract: Translating natural language into SQL (Test2SQL) is a longstanding challenge at the intersection of natural language understanding and structured data access. While large language models (LLMs) have significantly improved fluency in SQL generation, producing correct and executable SQL--particularly for complex queries--remains a bottleneck. We present Arctic-Text2SQL-R1, a reinforcement learning (RL) framework and model family designed to generate accurate, executable SQL using a lightweight reward signal based solely on execution correctness. Our approach avoids brittle intermediate supervision and complex reward shaping, promoting stable training and alignment with the end task. Combined with carefully curated data, strong supervised initialization, and effective training practices, Arctic-Text2SQL-R1 achieves state-of-the-art execution accuracy across six diverse Test2SQL benchmarks, including the top position on the BIRD leaderboard. Notably, our 7B model outperforms prior 70B-class systems, highlighting the framework's scalability and efficiency. We further demonstrate inference-time robustness through simple extensions like value retrieval and majority voting. Extensive experiments and ablation studies offer both positive and negative insights, providing practical guidance for future Test2SQL research.

[154] Beyond Demonstrations: Dynamic Vector Construction from Latent Representations

Wang Cai,Hsiu-Yuan Huang,Zhixiang Wang,Yunfang Wu

Main category: cs.CL

TL;DR: DyVec是一种动态向量方法，通过改进ICV方法的局限性，如对ICL因素的敏感性和启发式注入位置，实现了比少样本ICL和现有ICV基线更好的性能。

Details

Motivation: 现有ICV方法对ICL因素敏感，使用粗粒度或语义碎片化的表示，并依赖启发式注入位置，限制了其适用性。 Method: DyVec采用Exhaustive Query Rotation（EQR）策略提取鲁棒的语义聚合表示，并通过动态潜在分割与注入及REINFORCE优化学习最佳注入位置。 Result: 实验表明DyVec优于少样本ICL、LoRA和现有ICV基线，动态分割与注入语义聚合表示效果显著。 Conclusion: DyVec为推理时任务适应提供了一种轻量级且数据高效的解决方案。 Abstract: In-Context derived Vector (ICV) methods extract task-relevant representations from large language models (LLMs) and reinject them during inference, achieving comparable performance to few-shot In-Context Learning (ICL) without repeated demonstration processing. However, existing ICV methods remain sensitive to ICL-specific factors, often use coarse or semantically fragmented representations as the source of the vector, and rely on heuristic-based injection positions, limiting their applicability. To address these issues, we propose Dynamic Vector (DyVec), which incorporates an Exhaustive Query Rotation (EQR) strategy to extract robust semantically aggregated latent representations by mitigating variance introduced by ICL. It then applies Dynamic Latent Segmentation and Injection to adaptively partition representations based on task complexity and leverages REINFORCE-based optimization to learn optimal injection positions for each segment. Experiments results show that DyVec outperforms few-shot ICL, LoRA, and prior ICV baselines. Further analysis highlights the effectiveness of dynamically segmenting and injecting semantically aggregated latent representations. DyVec provides a lightweight and data-efficient solution for inference-time task adaptation.

[155] Less Context, Same Performance: A RAG Framework for Resource-Efficient LLM-Based Clinical NLP

Satya Narayana Cheetirala,Ganesh Raut,Dhavalkumar Patel,Fabio Sanatana,Robert Freeman,Matthew A Levin,Girish N. Nadkarni,Omar Dawkins,Reba Miller,Randolph M. Steinhagen,Eyal Klang,Prem Timsina

Main category: cs.CL

TL;DR: RAG方法通过仅使用最相关文本片段，在长文本分类任务中表现与全文本处理相当，显著减少计算成本。

Details

Motivation: 解决LLMs在长文本分类中因token限制和高计算成本带来的挑战。 Method: 将临床文档分块并转换为向量嵌入，存储于FAISS索引中，检索最相关的4000词片段输入LLMs进行分类。 Result: RAG方法与全文本处理在AUC ROC、精确率、召回率和F1上无显著差异（p>0.05）。 Conclusion: RAG可显著减少token使用且不损失分类准确性，为长临床文档分析提供可扩展且经济的解决方案。 Abstract: Long text classification is challenging for Large Language Models (LLMs) due to token limits and high computational costs. This study explores whether a Retrieval Augmented Generation (RAG) approach using only the most relevant text segments can match the performance of processing entire clinical notes with large context LLMs. We begin by splitting clinical documents into smaller chunks, converting them into vector embeddings, and storing these in a FAISS index. We then retrieve the top 4,000 words most pertinent to the classification query and feed these consolidated segments into an LLM. We evaluated three LLMs (GPT4o, LLaMA, and Mistral) on a surgical complication identification task. Metrics such as AUC ROC, precision, recall, and F1 showed no statistically significant differences between the RAG based approach and whole-text processing (p > 0.05p > 0.05). These findings indicate that RAG can significantly reduce token usage without sacrificing classification accuracy, providing a scalable and cost effective solution for analyzing lengthy clinical documents.

[156] BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

Mathew J. Koretsky,Maya Willey,Adi Asija,Owen Bianchi,Chelsea X. Alvarado,Tanay Nayak,Nicole Kuznetsov,Sungwon Kim,Mike A. Nalls,Daniel Khashabi,Faraz Faghri

Main category: cs.CL

TL;DR: BiomedSQL是一个专门评估生物医学领域文本到SQL转换中科学推理能力的基准，包含68,000个问题/SQL查询/答案三元组，基于真实生物医学知识库。

Details

Motivation: 当前文本到SQL系统在生物医学领域难以将定性科学问题转化为可执行SQL，尤其是在需要隐式领域推理时。 Method: 通过整合基因-疾病关联、组学数据因果推断和药物批准记录，构建了BiomedSQL基准，并评估了多种开源和闭源LLM的性能。 Result: GPT-o3-mini执行准确率为59.0%，定制多步代理BMSQL达到62.6%，均远低于专家基线的90.0%。 Conclusion: BiomedSQL为提升文本到SQL系统在生物医学知识库中的推理能力提供了新基础，支持科学发现。 Abstract: Biomedical researchers increasingly rely on large-scale structured databases for complex analytical tasks. However, current text-to-SQL systems often struggle to map qualitative scientific questions into executable SQL, particularly when implicit domain reasoning is required. We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base. BiomedSQL comprises 68,000 question/SQL query/answer triples grounded in a harmonized BigQuery knowledge base that integrates gene-disease associations, causal inference from omics data, and drug approval records. Each question requires models to infer domain-specific criteria, such as genome-wide significance thresholds, effect directionality, or trial phase filtering, rather than rely on syntactic translation alone. We evaluate a range of open- and closed-source LLMs across prompting strategies and interaction paradigms. Our results reveal a substantial performance gap: GPT-o3-mini achieves 59.0% execution accuracy, while our custom multi-step agent, BMSQL, reaches 62.6%, both well below the expert baseline of 90.0%. BiomedSQL provides a new foundation for advancing text-to-SQL systems capable of supporting scientific discovery through robust reasoning over structured biomedical knowledge bases. Our dataset is publicly available at https://huggingface.co/datasets/NIH-CARD/BiomedSQL, and our code is open-source at https://github.com/NIH-CARD/biomedsql.

[157] Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms

Mengru Wang,Ziwen Xu,Shengyu Mao,Shumin Deng,Zhaopeng Tu,Huajun Chen,Ningyu Zhang

Main category: cs.CL

TL;DR: 提出了一种名为Steering Target Atoms (STA)的新方法，通过分离和操作知识组件来增强语言模型的安全性和可靠性。

Details

Motivation: 语言模型生成的控制对安全和可靠性至关重要，但现有方法（如提示工程）因参数众多和内部表示高度交织而限制了控制精度。 Method: 使用稀疏自编码器（SAE）分离高维空间中的知识，并通过STA方法定位和操作原子知识组件。 Result: 实验证明STA方法在增强安全性和控制精度方面有效，且在对抗场景中表现出优越的鲁棒性和灵活性。 Conclusion: STA方法为语言模型的精确控制提供了新思路，并验证了其在大型推理模型中的有效性。 Abstract: Precise control over language model generation is vital for ensuring both safety and reliability. Although prompt engineering and steering are commonly used to intervene in model behaviors, the vast number of parameters in models often results in highly intertwined internal representations. This interdependency can limit control precision and sometimes lead to unintended side effects. Recent research has explored the use of sparse autoencoders (SAE) to disentangle knowledge in high-dimensional spaces for steering. However, these applications have been limited to toy tasks owing to the nontrivial issue of locating atomic knowledge components. In this paper, we propose Steering Target Atoms (STA), a novel method that isolates and manipulates disentangled knowledge components to enhance safety. Comprehensive experiments demonstrate the effectiveness of our approach. Further analysis reveals that steering exhibits superior robustness and flexibility, particularly in adversarial scenarios. We also apply the steering strategy to the large reasoning model, confirming its effectiveness in precise reasoning control.

[158] PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus

Shahriar Noroozizadeh,Sayantan Kumar,George H. Chen,Jeremy C. Weiss

Main category: cs.CL

TL;DR: PMOA-TTS是首个公开的大规模临床时间线数据集，包含124,699篇PubMed病例报告，通过LLM管道提取5.6百万时间标记事件，验证了时间线质量并展示了预测价值。

Details

Motivation: 临床叙述中的时间动态建模需要大规模时间标注资源，但现有资源有限。 Method: 结合启发式过滤和Llama 3.3筛选病例报告，使用Llama 3.3和DeepSeek R1提取时间线，并通过三项指标评估质量。 Result: 事件匹配率80%，时间一致性c-index>0.90，下游任务预测一致性达0.82±0.01。 Conclusion: PMOA-TTS为生物医学NLP中的时间线提取和建模提供了可扩展的基础。 Abstract: Understanding temporal dynamics in clinical narratives is essential for modeling patient trajectories, yet large-scale temporally annotated resources remain limited. We present PMOA-TTS, the first openly available dataset of 124,699 PubMed Open Access (PMOA) case reports, each converted into structured (event, time) timelines via a scalable LLM-based pipeline. Our approach combines heuristic filtering with Llama 3.3 to identify single-patient case reports, followed by prompt-driven extraction using Llama 3.3 and DeepSeek R1, resulting in over 5.6 million timestamped clinical events. To assess timeline quality, we evaluate against a clinician-curated reference set using three metrics: (i) event-level matching (80% match at a cosine similarity threshold of 0.1), (ii) temporal concordance (c-index > 0.90), and (iii) Area Under the Log-Time CDF (AULTC) for timestamp alignment. Corpus-level analysis shows wide diagnostic and demographic coverage. In a downstream survival prediction task, embeddings from extracted timelines achieve time-dependent concordance indices up to 0.82 $\pm$ 0.01, demonstrating the predictive value of temporally structured narratives. PMOA-TTS provides a scalable foundation for timeline extraction, temporal reasoning, and longitudinal modeling in biomedical NLP. The dataset is available at: https://huggingface.co/datasets/snoroozi/pmoa-tts .

[159] Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence

Amirhosein Ghasemabadi,Keith G. Mills,Baochun Li,Di Niu

Main category: cs.CL

TL;DR: 论文提出了一种名为Guided by Gut (GG)的高效自引导测试时间缩放框架，无需依赖昂贵的外部验证模型即可达到PRM级别性能。

Details

Motivation: 现有TTS方法依赖外部PRM或BoN采样，计算成本高，GG旨在通过自引导方式降低成本。 Method: GG利用轻量级树搜索，仅依赖LLM内部信号（如token级置信度和步骤新颖性），并通过强化学习微调提高置信度估计的可靠性。 Result: 实验表明，GG使小模型（1.5B参数）达到或超越大模型（32B-70B参数）的精度，同时降低GPU内存使用10倍，推理速度提升8倍，内存占用减少4-5倍。 Conclusion: GG显著降低了TTS技术的计算成本和内存占用，使其更高效实用。 Abstract: Test-Time Scaling (TTS) methods for enhancing Large Language Model (LLM) reasoning often incur substantial computational costs, primarily due to extensive reliance on external Process Reward Models (PRMs) or sampling methods like Best-of-N (BoN). This paper introduces Guided by Gut (GG), an efficient self-guided TTS framework that achieves PRM-level performance without costly external verifier models. Our method employs a lightweight tree search guided solely by intrinsic LLM signals, token-level confidence and step novelty. One critical innovation is improving the reliability of internal confidence estimates via a targeted reinforcement learning fine-tuning phase. Empirical evaluations on challenging mathematical reasoning benchmarks demonstrate that GG enables smaller models (e.g., 1.5B parameters) to achieve accuracy matching or surpassing significantly larger models (e.g., 32B-70B parameters), while reducing GPU memory usage by up to 10x. Compared to PRM-based methods, GG achieves comparable accuracy with 8x faster inference speeds and 4-5x lower memory usage. Additionally, GG reduces KV cache memory usage by approximately 50% compared to the BoN strategy, facilitating more efficient and practical deployment of TTS techniques.

[160] Multi-Scale Manifold Alignment: A Unified Framework for Enhanced Explainability of Large Language Models

Yukun Zhang,Qi Dong

Main category: cs.CL

TL;DR: 提出了一种多尺度流形对齐框架，用于分解LLM的潜在空间，提升可解释性和信任度。

Details

Motivation: 尽管大型语言模型（LLM）性能强大，但其内部推理过程不透明，限制了关键应用中的可解释性和信任度。 Method: 通过多尺度流形对齐框架，将潜在空间分解为全局、中间和局部语义流形，并引入跨尺度映射函数，结合几何对齐和信息保留约束。 Result: 理论分析表明，在温和假设下，对齐误差（以KL散度衡量）可被限制。 Conclusion: 该框架为LLM的多尺度语义结构提供了统一解释，推动了可解释性研究，并支持如偏见检测和鲁棒性增强等应用。 Abstract: Recent advances in Large Language Models (LLMs) have achieved strong performance, yet their internal reasoning remains opaque, limiting interpretability and trust in critical applications. We propose a novel Multi_Scale Manifold Alignment framework that decomposes the latent space into global, intermediate, and local semantic manifolds capturing themes, context, and word-level details. Our method introduces cross_scale mapping functions that jointly enforce geometric alignment (e.g., Procrustes analysis) and information preservation (via mutual information constraints like MINE or VIB). We further incorporate curvature regularization and hyperparameter tuning for stable optimization. Theoretical analysis shows that alignment error, measured by KL divergence, can be bounded under mild assumptions. This framework offers a unified explanation of how LLMs structure multi-scale semantics, advancing interpretability and enabling applications such as bias detection and robustness enhancement.

[161] Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query

Yixuan Wang,Shiyu Ji,Yijun Liu,Yuzhuang Xu,Yang Xu,Qingfu Zhu,Wanxiang Che

Main category: cs.CL

TL;DR: 论文提出了一种名为Lookahead Q-Cache (LAQ)的新框架，通过生成低成本伪前瞻查询来优化KV缓存淘汰，解决了现有方法在内存受限时的不一致性问题。

Details

Motivation: 大型语言模型（LLM）依赖KV缓存加速解码，但长文本序列会导致内存使用激增。现有方法基于预填充阶段注意力分数淘汰缓存，与实际推理查询不一致，尤其在内存受限时。 Method: LAQ框架生成低成本伪前瞻查询，作为重要性估计的观察窗口，使缓存淘汰更符合实际推理场景。 Result: 在LongBench和Needle-in-a-Haystack基准测试中，LAQ优于现有方法，在有限缓存预算下性能提升1~4点。 Conclusion: LAQ不仅独立有效，还能与现有方法结合，进一步提升性能。 Abstract: Large language models (LLMs) rely on key-value cache (KV cache) to accelerate decoding by reducing redundant computations. However, the KV cache memory usage grows substantially with longer text sequences, posing challenges for efficient deployment. Existing KV cache eviction methods prune tokens using prefilling-stage attention scores, causing inconsistency with actual inference queries, especially under tight memory budgets. In this paper, we propose Lookahead Q-Cache (LAQ), a novel eviction framework that generates low-cost pseudo lookahead queries to better approximate the true decoding-stage queries. By using these lookahead queries as the observation window for importance estimation, LAQ achieves more consistent and accurate KV cache eviction aligned with real inference scenarios. Experimental results on LongBench and Needle-in-a-Haystack benchmarks show that LAQ outperforms existing methods across various budget levels, achieving a 1 $\sim$ 4 point improvement on LongBench under limited cache budget. Moreover, LAQ is complementary to existing approaches and can be flexibly combined to yield further improvements.

[162] Language Model Distillation: A Temporal Difference Imitation Learning Perspective

Zishun Yu,Shangzhe Li,Xinhua Zhang

Main category: cs.CL

TL;DR: 论文提出了一种基于时间差分学习的通用框架，用于语言模型蒸馏，通过利用教师模型的分布稀疏性，在减少的动作空间上操作，从而提升性能。

Details

Motivation: 大型语言模型虽然性能强大，但计算成本高昂。蒸馏技术被广泛用于压缩模型，但现有方法多基于行为克隆。本文从时间差分学习角度提出新框架，以解决这一问题。 Method: 利用教师模型的分布稀疏性（即大多数概率质量集中在少数token上），设计了一个在减少的动作空间上操作的时间差分学习框架。 Result: 实验表明，该方法能有效提升蒸馏后的模型性能。 Conclusion: 通过时间差分学习和分布稀疏性结合，提出了一种高效的模型蒸馏框架，为后续研究提供了新思路。 Abstract: Large language models have led to significant progress across many NLP tasks, although their massive sizes often incur substantial computational costs. Distillation has become a common practice to compress these large and highly capable models into smaller, more efficient ones. Many existing language model distillation methods can be viewed as behavior cloning from the perspective of imitation learning or inverse reinforcement learning. This viewpoint has inspired subsequent studies that leverage (inverse) reinforcement learning techniques, including variations of behavior cloning and temporal difference learning methods. Rather than proposing yet another specific temporal difference method, we introduce a general framework for temporal difference-based distillation by exploiting the distributional sparsity of the teacher model. Specifically, it is often observed that language models assign most probability mass to a small subset of tokens. Motivated by this observation, we design a temporal difference learning framework that operates on a reduced action space (a subset of vocabulary), and demonstrate how practical algorithms can be derived and the resulting performance improvements.

[163] MOSLIM:Align with diverse preferences in prompts through reward classification

Yu Zhang,Wanli Jiang,Zhengyu Yang

Main category: cs.CL

TL;DR: MOSLIM是一种新颖的多目标对齐方法，通过单一奖励模型和策略模型处理多样化目标，无需偏好训练，显著减少GPU资源需求。

Details

Motivation: 现有方法需要多个策略或奖励模型，或偏好特定的微调模型，MOSLIM旨在简化这一过程并提高效率。 Method: MOSLIM使用多头部奖励模型分类问答对，并通过映射函数将分类结果转化为奖励分数，优化策略模型。 Result: MOSLIM在多个多目标基准测试中表现优异，且资源消耗显著低于现有方法。 Conclusion: MOSLIM为多目标对齐提供了一种高效且灵活的方法，适用于大规模模型部署。 Abstract: The multi-objective alignment of Large Language Models (LLMs) is essential for ensuring foundational models conform to diverse human preferences. Current research in this field typically involves either multiple policies or multiple reward models customized for various preferences, or the need to train a preference-specific supervised fine-tuning (SFT) model. In this work, we introduce a novel multi-objective alignment method, MOSLIM, which utilizes a single reward model and policy model to address diverse objectives. MOSLIM provides a flexible way to control these objectives through prompting and does not require preference training during SFT phase, allowing thousands of off-the-shelf models to be directly utilized within this training framework. MOSLIM leverages a multi-head reward model that classifies question-answer pairs instead of scoring them and then optimize policy model with a scalar reward derived from a mapping function that converts classification results from reward model into reward scores. We demonstrate the efficacy of our proposed method across several multi-objective benchmarks and conduct ablation studies on various reward model sizes and policy optimization methods. The MOSLIM method outperforms current multi-objective approaches in most results while requiring significantly fewer GPU computing resources compared with existing policy optimization methods.

[164] Assessing the Capability of LLMs in Solving POSCOMP Questions

Cayo Viegas,Rohit Gheyi,Márcio Ribeiro

Main category: cs.CL

TL;DR: LLMs在计算机科学领域的表现研究，以巴西POSCOMP考试为基准，发现ChatGPT-4等模型在文本任务上表现优异，但图像解释仍是挑战。

Details

Motivation: 评估LLMs在计算机科学等专业领域的实际能力，为其未来发展提供指导。 Method: 测试四种LLM（ChatGPT-4、Gemini 1.0 Advanced等）在2022-2023年POSCOMP考试中的表现，并扩展到2024年考试及更新模型。 Result: ChatGPT-4在2023年考试中超越所有人类考生，新模型在2022-2024年考试中持续优于人类平均水平。 Conclusion: LLMs在文本任务上潜力显著，但需改进图像处理能力，未来模型有望进一步提升。 Abstract: Recent advancements in Large Language Models (LLMs) have significantly expanded the capabilities of artificial intelligence in natural language processing tasks. Despite this progress, their performance in specialized domains such as computer science remains relatively unexplored. Understanding the proficiency of LLMs in these domains is critical for evaluating their practical utility and guiding future developments. The POSCOMP, a prestigious Brazilian examination used for graduate admissions in computer science promoted by the Brazlian Computer Society (SBC), provides a challenging benchmark. This study investigates whether LLMs can match or surpass human performance on the POSCOMP exam. Four LLMs - ChatGPT-4, Gemini 1.0 Advanced, Claude 3 Sonnet, and Le Chat Mistral Large - were initially evaluated on the 2022 and 2023 POSCOMP exams. The assessments measured the models' proficiency in handling complex questions typical of the exam. LLM performance was notably better on text-based questions than on image interpretation tasks. In the 2022 exam, ChatGPT-4 led with 57 correct answers out of 69 questions, followed by Gemini 1.0 Advanced (49), Le Chat Mistral (48), and Claude 3 Sonnet (44). Similar trends were observed in the 2023 exam. ChatGPT-4 achieved the highest performance, surpassing all students who took the POSCOMP 2023 exam. LLMs, particularly ChatGPT-4, show promise in text-based tasks on the POSCOMP exam, although image interpretation remains a challenge. Given the rapid evolution of LLMs, we expanded our analysis to include more recent models - o1, Gemini 2.5 Pro, Claude 3.7 Sonnet, and o3-mini-high - evaluated on the 2022-2024 POSCOMP exams. These newer models demonstrate further improvements and consistently surpass both the average and top-performing human participants across all three years.

[165] Dynamic Manifold Evolution Theory: Modeling and Stability Analysis of Latent Representations in Large Language Models

Yukun Zhang,Qi Dong

Main category: cs.CL

TL;DR: DMET提出了一种动态流形演化理论，将大语言模型生成建模为低维语义流形上的受控动态系统，通过量化指标关联生成质量。

Details

Motivation: 为了解决文本生成中创造力与一致性的平衡问题，并提供一个统一的动态系统框架。 Method: 将潜在状态更新建模为连续动力学的离散时间欧拉近似，利用Lyapunov稳定性理论定义三个量化指标。 Result: 实验验证了DMET的预测，并提供了平衡生成文本创造力和一致性的指导原则。 Conclusion: DMET为文本生成提供了一种理论框架和量化评估方法，有助于优化生成质量。 Abstract: We introduce Dynamic Manifold Evolution Theory (DMET),a unified framework that models large language model generation as a controlled dynamical system evolving on a low_dimensional semantic manifold. By casting latent_state updates as discrete time Euler approximations of continuous dynamics, we map intrinsic energy_driven flows and context_dependent forces onto Transformer components (residual connections, attention, feed-forward networks). Leveraging Lyapunov stability theory We define three empirical metrics (state continuity, clustering quality, topological persistence) that quantitatively link latent_trajectory properties to text fluency, grammaticality, and semantic coherence. Extensive experiments across decoding parameters validate DMET's predictions and yield principled guidelines for balancing creativity and consistency in text generation.

[166] Do LLMs have a Gender (Entropy) Bias?

Sonal Prabhune,Balaji Padmanabhan,Kaushik Dutta

Main category: cs.CL

TL;DR: 研究发现，尽管在类别层面上LLM对男女的回答没有显著偏见，但在具体问题层面上存在明显差异。提出了一种简单的去偏方法，能有效提高回答的信息量。

Details

Motivation: 探究流行的LLM中是否存在性别偏见，并开发新的基准数据集RealWorldQuestioning。 Method: 使用四个LLM测试熵偏见，通过ChatGPT-4o进行定性和定量评估。 Result: 类别层面无显著偏见，但具体问题层面存在差异。提出的去偏方法在78%的情况下提高了信息量。 Conclusion: 简单的提示去偏策略能有效平衡LLM输出，提升回答质量。 Abstract: We investigate the existence and persistence of a specific type of gender bias in some of the popular LLMs and contribute a new benchmark dataset, RealWorldQuestioning (released on HuggingFace ), developed from real-world questions across four key domains in business and health contexts: education, jobs, personal financial management, and general health. We define and study entropy bias, which we define as a discrepancy in the amount of information generated by an LLM in response to real questions users have asked. We tested this using four different LLMs and evaluated the generated responses both qualitatively and quantitatively by using ChatGPT-4o (as "LLM-as-judge"). Our analyses (metric-based comparisons and "LLM-as-judge" evaluation) suggest that there is no significant bias in LLM responses for men and women at a category level. However, at a finer granularity (the individual question level), there are substantial differences in LLM responses for men and women in the majority of cases, which "cancel" each other out often due to some responses being better for males and vice versa. This is still a concern since typical users of these tools often ask a specific question (only) as opposed to several varied ones in each of these common yet important areas of life. We suggest a simple debiasing approach that iteratively merges the responses for the two genders to produce a final result. Our approach demonstrates that a simple, prompt-based debiasing strategy can effectively debias LLM outputs, thus producing responses with higher information content than both gendered variants in 78% of the cases, and consistently achieving a balanced integration in the remaining cases.

[167] SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data

Wenkai Fang,Shunyu Liu,Yang Zhou,Kongcheng Zhang,Tongya Zheng,Kaixuan Chen,Mingli Song,Dacheng Tao

Main category: cs.CL

TL;DR: SeRL通过自指令和自奖励模块，在初始数据有限的情况下提升LLM的推理能力，无需依赖高质量指令和外部奖励标注。

Details

Motivation: 现有RL方法依赖高质量指令和可验证奖励，但在专业领域难以获取，SeRL旨在解决这一问题。 Method: SeRL包含自指令模块（生成多样化高质量指令）和自奖励模块（多数投票机制估计奖励），结合常规RL进行迭代学习。 Result: 在多个推理基准和不同LLM骨干上，SeRL表现优于同类方法，接近高质量数据训练的效果。 Conclusion: SeRL为数据有限场景下的LLM训练提供了有效解决方案，性能优异且无需外部标注。 Abstract: Recent advances have demonstrated the effectiveness of Reinforcement Learning (RL) in improving the reasoning capabilities of Large Language Models (LLMs). However, existing works inevitably rely on high-quality instructions and verifiable rewards for effective training, both of which are often difficult to obtain in specialized domains. In this paper, we propose Self-play Reinforcement Learning(SeRL) to bootstrap LLM training with limited initial data. Specifically, SeRL comprises two complementary modules: self-instruction and self-rewarding. The former module generates additional instructions based on the available data at each training step, employing robust online filtering strategies to ensure instruction quality, diversity, and difficulty. The latter module introduces a simple yet effective majority-voting mechanism to estimate response rewards for additional instructions, eliminating the need for external annotations. Finally, SeRL performs conventional RL based on the generated data, facilitating iterative self-play learning. Extensive experiments on various reasoning benchmarks and across different LLM backbones demonstrate that the proposed SeRL yields results superior to its counterparts and achieves performance on par with those obtained by high-quality data with verifiable rewards. Our code is available at https://github.com/wantbook-book/SeRL.

[168] Rethinking Text-based Protein Understanding: Retrieval or LLM?

Juntong Wu,Zijing Liu,He Cao,Hao Li,Bin Feng,Zishan Shu,Ke Yu,Li Yuan,Yu Li

Main category: cs.CL

TL;DR: 论文分析了当前蛋白质-文本模型的数据泄露问题，提出了一种基于生物实体的新评估框架和检索增强方法，显著提升了蛋白质-文本生成性能。

Details

Motivation: 当前蛋白质-文本模型的评估存在数据泄露问题，且传统自然语言处理指标不适用于该领域。 Method: 重组现有数据集并引入基于生物实体的评估框架，提出检索增强方法。 Result: 检索增强方法在蛋白质-文本生成中显著优于微调的大语言模型，且在无需训练的场景下表现高效准确。 Conclusion: 新方法和评估框架解决了现有问题，为蛋白质-文本模型的发展提供了更可靠的基准。 Abstract: In recent years, protein-text models have gained significant attention for their potential in protein generation and understanding. Current approaches focus on integrating protein-related knowledge into large language models through continued pretraining and multi-modal alignment, enabling simultaneous comprehension of textual descriptions and protein sequences. Through a thorough analysis of existing model architectures and text-based protein understanding benchmarks, we identify significant data leakage issues present in current benchmarks. Moreover, conventional metrics derived from natural language processing fail to accurately assess the model's performance in this domain. To address these limitations, we reorganize existing datasets and introduce a novel evaluation framework based on biological entities. Motivated by our observation, we propose a retrieval-enhanced method, which significantly outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy and efficiency in training-free scenarios. Our code and data can be seen at https://github.com/IDEA-XL/RAPM.

[169] Enhancing Logical Reasoning in Language Models via Symbolically-Guided Monte Carlo Process Supervision

Xingwei Tan,Marco Valentino,Mahmud Akhter,Maria Liakata,Nikolaos Aletras

Main category: cs.CL

TL;DR: 论文提出一种通过生成符号推理轨迹并结合蒙特卡洛估计优化的方法，提升大语言模型在逻辑推理中的泛化能力，减少记忆依赖。

Details

Motivation: 现有大语言模型在数学和逻辑推理中表现优异，但依赖记忆而非泛化能力，且对内容变化敏感。结合符号方法可提升可靠性，但现有方法难以有效利用符号表示。 Method: 生成符号推理轨迹，通过蒙特卡洛估计优化的过程奖励模型筛选高质量轨迹，并用于微调模型。 Result: 在FOLIO和LogicAsker等逻辑推理基准测试中表现优异，同时增强了跨领域泛化能力。 Conclusion: 符号引导的过程监督能有效减轻记忆对推理的影响，提升模型的泛化能力。 Abstract: Large language models (LLMs) have shown promising performance in mathematical and logical reasoning benchmarks. However, recent studies have pointed to memorization, rather than generalization, as one of the leading causes for such performance. LLMs, in fact, are susceptible to content variations, demonstrating a lack of robust symbolic abstractions supporting their reasoning process. To improve reliability, many attempts have been made to combine LLMs with symbolic methods. Nevertheless, existing approaches fail to effectively leverage symbolic representations due to the challenges involved in developing reliable and scalable verification mechanisms. In this paper, we propose to overcome such limitations by generating symbolic reasoning trajectories and select the high-quality ones using a process reward model automatically tuned based on Monte Carlo estimation. The trajectories are then employed via fine-tuning methods to improve logical reasoning and generalization. Our results on logical reasoning benchmarks such as FOLIO and LogicAsker show the effectiveness of the proposed method with large gains on frontier and open-weight models. Moreover, additional experiments on claim verification reveal that fine-tuning on the generated symbolic reasoning trajectories enhances out-of-domain generalizability, suggesting the potential impact of symbolically-guided process supervision in alleviating the effect of memorization on LLM reasoning.

[170] GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

Zihong Chen,Wanli Jiang,Jinzhe Li,Zhonghang Yuan,Huanjun Kong,Wanli Ouyang,Nanqing Dong

Main category: cs.CL

TL;DR: GraphGen是一个基于知识图谱的框架，用于生成高质量的QA数据，解决LLM微调中的数据稀缺问题。

Details

Motivation: 传统合成数据方法存在事实不准确、长尾覆盖不足等问题，GraphGen旨在提供更可靠的解决方案。 Method: 通过构建细粒度知识图谱，识别知识缺口，并采用多跳邻域采样和风格控制生成多样化QA数据。 Result: 实验表明，GraphGen在知识密集型任务中优于传统合成数据方法。 Conclusion: GraphGen为监督微调中的数据稀缺问题提供了更全面可靠的解决方案。 Abstract: Fine-tuning for large language models (LLMs) typically requires substantial amounts of high-quality supervised data, which is both costly and labor-intensive to acquire. While synthetic data generation has emerged as a promising solution, existing approaches frequently suffer from factual inaccuracies, insufficient long-tail coverage, simplistic knowledge structures, and homogenized outputs. To address these challenges, we introduce GraphGen, a knowledge graph-guided framework designed for three key question-answering (QA) scenarios: atomic QA, aggregated QA, and multi-hop QA. It begins by constructing a fine-grained knowledge graph from the source text. It then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data. Experimental results on knowledge-intensive tasks under closed-book settings demonstrate that GraphGen outperforms conventional synthetic data methods, offering a more reliable and comprehensive solution to the data scarcity challenge in supervised fine-tuning. The code and data are publicly available at https://github.com/open-sciencelab/GraphGen.

[171] SEMMA: A Semantic Aware Knowledge Graph Foundation Model

Arvindh Arun,Sumit Kumar,Mojtaba Nayyeri,Bo Xiong,Ponnurangam Kumaraguru,Antonio Vergari,Steffen Staab

Main category: cs.CL

TL;DR: SEMMA是一种双模块知识图谱基础模型，通过结合文本语义和结构信息，显著提升了零样本推理能力，尤其是在未见关系词汇的挑战性场景中表现优异。

Details

Motivation: 现有知识图谱基础模型主要依赖图结构，忽略了文本属性中的丰富语义信息，限制了其泛化能力。 Method: SEMMA利用大型语言模型（LLMs）生成语义嵌入，构建文本关系图，并与结构图融合。 Result: 在54个多样化知识图谱上，SEMMA在完全归纳链接预测中优于纯结构基线（如ULTRA），在未见关系词汇场景中效果是结构方法的2倍。 Conclusion: 文本语义对于知识推理中的泛化至关重要，未来的基础模型需统一结构和语言信号。 Abstract: Knowledge Graph Foundation Models (KGFMs) have shown promise in enabling zero-shot reasoning over unseen graphs by learning transferable patterns. However, most existing KGFMs rely solely on graph structure, overlooking the rich semantic signals encoded in textual attributes. We introduce SEMMA, a dual-module KGFM that systematically integrates transferable textual semantics alongside structure. SEMMA leverages Large Language Models (LLMs) to enrich relation identifiers, generating semantic embeddings that subsequently form a textual relation graph, which is fused with the structural component. Across 54 diverse KGs, SEMMA outperforms purely structural baselines like ULTRA in fully inductive link prediction. Crucially, we show that in more challenging generalization settings, where the test-time relation vocabulary is entirely unseen, structural methods collapse while SEMMA is 2x more effective. Our findings demonstrate that textual semantics are critical for generalization in settings where structure alone fails, highlighting the need for foundation models that unify structural and linguistic signals in knowledge reasoning.

[172] The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project

Angelina A. Aquino,Lester James V. Miranda,Elsie Marie T. Or

Main category: cs.CL

TL;DR: 本文介绍了UD-NewsCrawl，这是目前最大的他加禄语树库，包含15.6k棵树，并基于通用依存框架手动标注。

Details

Motivation: 旨在为计算语言学研究提供资源，尤其是针对他加禄语等代表性不足的语言。 Method: 详细描述了树库开发过程，包括数据收集、预处理、手动标注和质量保证，并评估了基于Transformer的模型性能。 Result: 提供了基线模型评估结果，并讨论了他加禄语独特语法特性对句法分析的挑战。 Conclusion: UD-NewsCrawl和基线模型将成为推动他加禄语等语言研究的宝贵资源。 Abstract: This paper presents UD-NewsCrawl, the largest Tagalog treebank to date, containing 15.6k trees manually annotated according to the Universal Dependencies framework. We detail our treebank development process, including data collection, pre-processing, manual annotation, and quality assurance procedures. We provide baseline evaluations using multiple transformer-based models to assess the performance of state-of-the-art dependency parsers on Tagalog. We also highlight challenges in the syntactic analysis of Tagalog given its distinctive grammatical properties, and discuss its implications for the annotation of this treebank. We anticipate that UD-NewsCrawl and our baseline model implementations will serve as valuable resources for advancing computational linguistics research in underrepresented languages like Tagalog.

[173] PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy

Shuhao Guan,Moule Lin,Cheng Xu,Xinyi Liu,Jinman Zhao,Jiexin Fan,Qi Xu,Derek Greene

Main category: cs.CL

TL;DR: PreP-OCR是一个两阶段流程，结合图像恢复和语义感知的OCR后校正，显著提升历史文档的文本提取效果。

Details

Motivation: 解决历史文档因退化导致的OCR错误率高的问题，通过联合优化图像清晰度和语言一致性。 Method: 1. 生成合成图像对训练图像恢复模型；2. 使用ByT5后校正器处理剩余OCR错误。 Result: 在13,831页历史文档上测试，字符错误率降低63.9-70.3%。 Conclusion: PreP-OCR展示了图像恢复与语言校正结合在历史档案数字化中的潜力。 Abstract: This paper introduces PreP-OCR, a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction to improve text extraction from degraded historical documents. Our key innovation lies in jointly optimizing image clarity and linguistic consistency. First, we generate synthetic image pairs with randomized text fonts, layouts, and degradations. An image restoration model is trained on this synthetic data, using multi-directional patch extraction and fusion to process large images. Second, a ByT5 post-corrector, fine-tuned on synthetic historical text training pairs, addresses any remaining OCR errors. Detailed experiments on 13,831 pages of real historical documents in English, French, and Spanish show that PreP-OCR pipeline reduces character error rates by 63.9-70.3\% compared to OCR on raw images. Our pipeline demonstrates the potential of integrating image restoration with linguistic error correction for digitizing historical archives.

[174] HAMburger: Accelerating LLM Inference via Token Smashing

Jingyu Liu,Ce Zhang

Main category: cs.CL

TL;DR: HAMburger是一种分层自回归模型，通过优化KV缓存和计算资源分配，显著提升LLM推理效率。

Details

Motivation: 现有LLM推理模式中，每个token需要一次前向传递和KV缓存，效率低下。研究发现LLM能够自我识别信息需求，无需全局上下文即可生成多个token。 Method: 引入HAMburger模型，结合分层嵌入器和微步解码器，将多个token压缩到单个KV缓存中，实现每步生成多个token。 Result: 实验表明，HAMburger将KV缓存计算减少2倍，吞吐量提升2倍，同时在短长上下文任务中保持质量。 Conclusion: HAMburger探索了一种计算和内存高效的硬件无关设计，显著优化了LLM推理效率。 Abstract: The growing demand for efficient Large Language Model (LLM) inference requires a holistic optimization on algorithms, systems, and hardware. However, very few works have fundamentally changed the generation pattern: each token needs one forward pass and one KV cache. This can be sub-optimal because we found that LLMs are extremely capable of self-identifying the exact dose of information that a single KV cache can store, and many tokens can be generated confidently without global context. Based on this insight, we introduce HAMburger, a Hierarchically Auto-regressive Model that redefines resource allocation in LLMs by moving beyond uniform computation and storage per token during inference. Stacking a compositional embedder and a micro-step decoder in between a base LLM, HAMburger smashes multiple tokens into a single KV and generates several tokens per step. Additionally, HAMburger functions as a speculative decoding framework where it can blindly trust self-drafted tokens. As a result, HAMburger shifts the growth of KV cache and forward FLOPs from linear to sub-linear with respect to output length, and adjusts its inference speed based on query perplexity and output structure. Extensive evaluations show that HAMburger reduces the KV cache computation by up to 2$\times$ and achieves up to 2$\times$ TPS, while maintaining quality in both short- and long-context tasks. Our method explores an extremely challenging inference regime that requires both computation- and memory-efficiency with a hardware-agnostic design.

[175] In-context Language Learning for Endangered Languages in Speech Recognition

Zhaolin Li,Jan Niehues

Main category: cs.CL

TL;DR: 研究表明，通过上下文学习（ICL），大型语言模型（LLMs）可以在无监督数据的情况下学习低资源语言，并在语音识别任务中表现优异。

Details

Motivation: 全球有约7000种语言，但现有LLMs仅支持少数。研究探索LLMs是否可以通过ICL学习未见过的低资源语言。 Method: 在四种未训练过的濒危语言上进行实验，比较基于概率和基于指令的学习方法。 Result: 提供更多相关文本样本能提升语言建模和语音识别性能；基于概率的方法优于传统指令方法；ICL使LLMs在ASR任务中表现媲美专用模型。 Conclusion: ICL使LLMs能高效学习低资源语言，同时保留原有能力，为语言多样性保护提供新途径。 Abstract: With approximately 7,000 languages spoken worldwide, current large language models (LLMs) support only a small subset. Prior research indicates LLMs can learn new languages for certain tasks without supervised data. We extend this investigation to speech recognition, investigating whether LLMs can learn unseen, low-resource languages through in-context learning (ICL). With experiments on four diverse endangered languages that LLMs have not been trained on, we find that providing more relevant text samples enhances performance in both language modelling and Automatic Speech Recognition (ASR) tasks. Furthermore, we show that the probability-based approach outperforms the traditional instruction-based approach in language learning. Lastly, we show ICL enables LLMs to achieve ASR performance that is comparable to or even surpasses dedicated language models trained specifically for these languages, while preserving the original capabilities of the LLMs.

[176] Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries

Sahana Ramnath,Anurag Mudgil,Brihi Joshi,Skyler Hallinan,Xiang Ren

Main category: cs.CL

TL;DR: Amulet框架利用对话行为和准则提升LLM评委在复杂多轮对话中的评估准确性，实验显示其在四个数据集上显著优于基线。

Details

Motivation: 由于大型语言模型广泛用于评估其他模型的响应，需提升其在多轮、多样化对话中的判断准确性。 Method: Amulet结合对话行为和准则分析对话结构和意图，并评估响应是否满足对话原则。 Result: 实验表明，60-70%的对话中人类意图会变化，75%的偏好响应可通过对话行为或准则区分。Amulet在四个数据集上表现优于基线。 Conclusion: Amulet有效提升LLM评委在多轮对话中的判断能力，可单独或集成使用。 Abstract: Today, large language models are widely used as judges to evaluate responses from other language models. Hence, it is imperative to benchmark and improve these LLM-judges on real-world language model usage: a typical human-assistant conversation is lengthy, and shows significant diversity in topics, intents, and requirements across turns, e.g. social interactions, task requests, feedback. We present Amulet, a framework that leverages pertinent linguistic concepts of dialog-acts and maxims to improve the accuracy of LLM-judges on preference data with complex, multi-turn conversational context. Amulet presents valuable insights about (a) the communicative structures and intents present in the conversation (dialog acts), and (b) the satisfaction of conversational principles (maxims) by the preference responses, and uses them to make judgments. On four challenging datasets, Amulet shows that (a) humans frequently (60 to 70 percent of the time) change their intents from one turn of the conversation to the next, and (b) in 75 percent of instances, the preference responses can be differentiated via dialog acts and/or maxims, reiterating the latter's significance in judging such data. Amulet can be used either as a judge by applying the framework to a single LLM, or integrated into a jury with different LLM judges; our judges and juries show strong improvements on relevant baselines for all four datasets.

[177] Conversation Kernels: A Flexible Mechanism to Learn Relevant Context for Online Conversation Understanding

Vibhor Agarwal,Arjoo Gupta,Suparna De,Nishanth Sastry

Main category: cs.CL

TL;DR: 论文提出了一种通用机制（Conversation Kernels）来分析在线对话中的上下文依赖关系，以理解帖子的多种属性（如信息量、趣味性等）。

Details

Motivation: 随着社交网络和在线讨论论坛的发展，理解在线对话的需求增加，但由于帖子简短且隐含上下文依赖，传统方法难以有效分析。 Method: 设计了两种Conversation Kernels，探索对话树中帖子的不同邻域，构建适合特定任务的上下文。 Result: 在slashdot.org的对话数据上验证了方法的通用性和灵活性。 Conclusion: Conversation Kernels是一种通用且灵活的框架，适用于多种对话理解任务。 Abstract: Understanding online conversations has attracted research attention with the growth of social networks and online discussion forums. Content analysis of posts and replies in online conversations is difficult because each individual utterance is usually short and may implicitly refer to other posts within the same conversation. Thus, understanding individual posts requires capturing the conversational context and dependencies between different parts of a conversation tree and then encoding the context dependencies between posts and comments/replies into the language model. To this end, we propose a general-purpose mechanism to discover appropriate conversational context for various aspects about an online post in a conversation, such as whether it is informative, insightful, interesting or funny. Specifically, we design two families of Conversation Kernels, which explore different parts of the neighborhood of a post in the tree representing the conversation and through this, build relevant conversational context that is appropriate for each task being considered. We apply our developed method to conversations crawled from slashdot.org, which allows users to apply highly different labels to posts, such as 'insightful', 'funny', etc., and therefore provides an ideal experimental platform to study whether a framework such as Conversation Kernels is general-purpose and flexible enough to be adapted to disparately different conversation understanding tasks.

[178] InFact: Informativeness Alignment for Improved LLM Factuality

Roi Cohen,Russa Biswas,Gerard de Melo

Main category: cs.CL

TL;DR: 本文提出了一种信息对齐机制，旨在提升语言模型生成文本的事实完整性和信息量。

Details

Motivation: 尽管LLMs生成的文本可能事实正确，但往往信息量不足。本文旨在解决这一问题。 Method: 利用现有的事实基准，提出信息对齐目标，优先选择既正确又信息丰富的答案。 Result: 通过优化这一目标，模型不仅提升了信息量，还改善了事实准确性。 Conclusion: 信息对齐机制是提升LLMs生成文本质量和事实性的有效方法。 Abstract: Factual completeness is a general term that captures how detailed and informative a factually correct text is. For instance, the factual sentence ``Barack Obama was born in the United States'' is factually correct, though less informative than the factual sentence ``Barack Obama was born in Honolulu, Hawaii, United States''. Despite the known fact that LLMs tend to hallucinate and generate factually incorrect text, they might also tend to choose to generate factual text that is indeed factually correct and yet less informative than other, more informative choices. In this work, we tackle this problem by proposing an informativeness alignment mechanism. This mechanism takes advantage of recent factual benchmarks to propose an informativeness alignment objective. This objective prioritizes answers that are both correct and informative. A key finding of our work is that when training a model to maximize this objective or optimize its preference, we can improve not just informativeness but also factuality.

[179] Inceptive Transformers: Enhancing Contextual Representations through Multi-Scale Feature Learning Across Domains and Languages

Asif Shahriar,Rifat Shahriyar,M Saifur Rahman

Main category: cs.CL

TL;DR: Inceptive Transformer通过多尺度特征提取模块动态加权token，平衡局部与全局依赖，在多个任务中表现优于基线模型1%至14%。

Details

Motivation: 传统Transformer将所有token信息压缩到单一[CLS] token中，可能导致局部或层次化信息丢失。 Method: 引入Inceptive Transformer，结合inception网络的多尺度特征提取模块，动态加权token。 Result: 在情感识别、反讽检测、疾病识别等任务中，模型性能提升1%至14%。 Conclusion: 该方法具有跨语言和多领域的适用性，能有效丰富Transformer表示。 Abstract: Conventional transformer models typically compress the information from all tokens in a sequence into a single \texttt{[CLS]} token to represent global context-- an approach that can lead to information loss in tasks requiring localized or hierarchical cues. In this work, we introduce \textit{Inceptive Transformer}, a modular and lightweight architecture that enriches transformer-based token representations by integrating a multi-scale feature extraction module inspired by inception networks. Our model is designed to balance local and global dependencies by dynamically weighting tokens based on their relevance to a particular task. Evaluation across a diverse range of tasks including emotion recognition (both English and Bangla), irony detection, disease identification, and anti-COVID vaccine tweets classification shows that our models consistently outperform the baselines by 1\% to 14\% while maintaining efficiency. These findings highlight the versatility and cross-lingual applicability of our method for enriching transformer-based representations across diverse domains.

[180] Beyond Keywords: Evaluating Large Language Model Classification of Nuanced Ableism

Naba Rizvi,Harper Strickland,Saleha Ahmedi,Aekta Kallepalli,Isha Khirwadkar,William Wu,Imani N. S. Munyaka,Nedjma Ousidhoum

Main category: cs.CL

TL;DR: 论文评估了四种大型语言模型（LLMs）在识别针对自闭症患者的隐性歧视（ableism）方面的能力，发现LLMs能识别相关语言但常忽略有害含义，且依赖表面关键词匹配而非上下文理解。

Details

Motivation: 研究动机是了解LLMs如何理解并识别针对自闭症患者的隐性歧视，填补现有研究中对LLMs在ableism识别方面表现的空白。 Method: 方法包括评估四种LLMs对自闭症相关术语的理解能力，以及它们在上下文中识别隐性歧视的效果，并与人类注释者的解释进行定性比较。 Result: 结果显示LLMs能识别自闭症相关语言但常忽略有害含义，且依赖关键词匹配而非上下文理解，而人类注释者则更全面。 Conclusion: 结论是LLMs在识别隐性歧视方面仍有局限，需改进上下文理解能力，但二元分类方案对评估LLMs表现是有效的。 Abstract: Large language models (LLMs) are increasingly used in decision-making tasks like r\'esum\'e screening and content moderation, giving them the power to amplify or suppress certain perspectives. While previous research has identified disability-related biases in LLMs, little is known about how they conceptualize ableism or detect it in text. We evaluate the ability of four LLMs to identify nuanced ableism directed at autistic individuals. We examine the gap between their understanding of relevant terminology and their effectiveness in recognizing ableist content in context. Our results reveal that LLMs can identify autism-related language but often miss harmful or offensive connotations. Further, we conduct a qualitative comparison of human and LLM explanations. We find that LLMs tend to rely on surface-level keyword matching, leading to context misinterpretations, in contrast to human annotators who consider context, speaker identity, and potential impact. On the other hand, both LLMs and humans agree on the annotation scheme, suggesting that a binary classification is adequate for evaluating LLM performance, which is consistent with findings from prior studies involving human annotators.

[181] Gatsby Without the 'E': Crafting Lipograms with LLMs

Rohan Balasubramanian,Nitish Gokulakrishnan,Syeda Jannatus Saba,Steven Skiena

Main category: cs.CL

TL;DR: 研究探讨了现代大语言模型（LLMs）如何将《了不起的盖茨比》转化为完全不含字母'e'的文本，展示了英语在严格约束下的灵活性。

Details

Motivation: 探索大语言模型在严格约束条件下（如避用特定字母）的文本生成能力，以及英语语言的适应性。 Method: 采用多种技术，从简单的同义词替换到结合束搜索和命名实体分析的生成模型。 Result: 排除最多3.6%的常见字母对文本意义影响较小，但更强的约束会显著降低翻译保真度。 Conclusion: 英语在严格约束下表现出惊人的灵活性和创造性，语言模型能够有效支持此类挑战性任务。 Abstract: Lipograms are a unique form of constrained writing where all occurrences of a particular letter are excluded from the text, typified by the novel Gadsby, which daringly avoids all usage of the letter 'e'. In this study, we explore the power of modern large language models (LLMs) by transforming the novel F. Scott Fitzgerald's The Great Gatsby into a fully 'e'-less text. We experimented with a range of techniques, from baseline methods like synonym replacement to sophisticated generative models enhanced with beam search and named entity analysis. We show that excluding up to 3.6% of the most common letters (up to the letter 'u') had minimal impact on the text's meaning, although translation fidelity rapidly and predictably decays with stronger lipogram constraints. Our work highlights the surprising flexibility of English under strict constraints, revealing just how adaptable and creative language can be.

[182] Large Language Models for IT Automation Tasks: Are We There Yet?

Md Mahadi Hassan,John Salvador,Akond Rahman,Santu Karmaker

Main category: cs.CL

TL;DR: ITAB基准测试评估开源LLMs在生成功能性Ansible脚本时的表现，发现其在状态推理和模块知识方面存在显著不足。

Details

Motivation: 研究LLMs在IT自动化任务（如Ansible）中的实际表现，填补现有基准测试的不足。 Method: 提出ITAB基准测试，包含126个多样化任务，通过动态执行评估14个开源LLMs的表现。 Result: 所有LLMs的pass@10率均低于12%，主要错误为状态推理（44.87%）和模块知识缺陷（24.37%）。 Conclusion: LLMs在IT自动化中需改进状态推理和领域知识，才能实现可靠应用。 Abstract: LLMs show promise in code generation, yet their effectiveness for IT automation tasks, particularly for tools like Ansible, remains understudied. Existing benchmarks rely primarily on synthetic tasks that fail to capture the needs of practitioners who use IT automation tools, such as Ansible. We present ITAB (IT Automation Task Benchmark), a benchmark of 126 diverse tasks (e.g., configuring servers, managing files) where each task accounts for state reconciliation: a property unique to IT automation tools. ITAB evaluates LLMs' ability to generate functional Ansible automation scripts via dynamic execution in controlled environments. We evaluate 14 open-source LLMs, none of which accomplish pass@10 at a rate beyond 12%. To explain these low scores, we analyze 1,411 execution failures across the evaluated LLMs and identify two main categories of prevalent semantic errors: failures in state reconciliation related reasoning (44.87% combined from variable (11.43%), host (11.84%), path(11.63%), and template (9.97%) issues) and deficiencies in module-specific execution knowledge (24.37% combined from Attribute and parameter (14.44%) and module (9.93%) errors). Our findings reveal key limitations in open-source LLMs' ability to track state changes and apply specialized module knowledge, indicating that reliable IT automation will require major advances in state reasoning and domain-specific execution understanding.

[183] ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis

Hawau Olamide Toyin,Rufael Marew,Humaid Alblooshi,Samar M. Magdy,Hanan Aldarmaki

Main category: cs.CL

TL;DR: ArVoice是一个多说话者的现代标准阿拉伯语语音语料库，包含带音标的转录文本，适用于语音合成等任务。

Details

Motivation: 为多说话者语音合成及其他任务（如音标恢复、语音转换和深度伪造检测）提供高质量的阿拉伯语语音数据。 Method: 结合专业录音、修改的阿拉伯语音语料库子集和高质量合成语音，构建包含83.52小时语音的语料库。 Result: 训练了三个开源TTS和两个语音转换系统，展示了数据集的实用性。 Conclusion: ArVoice语料库为研究提供了有价值的资源，并已公开供研究使用。 Abstract: We introduce ArVoice, a multi-speaker Modern Standard Arabic (MSA) speech corpus with diacritized transcriptions, intended for multi-speaker speech synthesis, and can be useful for other tasks such as speech-based diacritic restoration, voice conversion, and deepfake detection. ArVoice comprises: (1) a new professionally recorded set from six voice talents with diverse demographics, (2) a modified subset of the Arabic Speech Corpus; and (3) high-quality synthetic speech from two commercial systems. The complete corpus consists of a total of 83.52 hours of speech across 11 voices; around 10 hours consist of human voices from 7 speakers. We train three open-source TTS and two voice conversion systems to illustrate the use cases of the dataset. The corpus is available for research use.

[184] Multimodal Emotion Recognition in Conversations: A Survey of Methods, Trends, Challenges and Prospects

Chengyan Wu,Yiqiang Cai,Yang Liu,Pengxu Zhu,Yun Xue,Ziwei Gong,Julia Hirschberg,Bolei Ma

Main category: cs.CL

TL;DR: 本文综述了多模态对话情感识别（MERC）的研究现状，包括其动机、核心任务、方法、评估策略及未来方向。

Details

Motivation: 提升人机交互的自然性和情感理解，单模态方法难以满足需求，需整合多模态信息。 Method: 系统综述MERC的核心任务、代表性方法及评估策略。 Result: 总结了MERC的最新趋势、关键挑战及未来发展方向。 Conclusion: MERC研究对情感智能系统发展至关重要，本文为其提供了及时指导。 Abstract: While text-based emotion recognition methods have achieved notable success, real-world dialogue systems often demand a more nuanced emotional understanding than any single modality can offer. Multimodal Emotion Recognition in Conversations (MERC) has thus emerged as a crucial direction for enhancing the naturalness and emotional understanding of human-computer interaction. Its goal is to accurately recognize emotions by integrating information from various modalities such as text, speech, and visual signals. This survey offers a systematic overview of MERC, including its motivations, core tasks, representative methods, and evaluation strategies. We further examine recent trends, highlight key challenges, and outline future directions. As interest in emotionally intelligent systems grows, this survey provides timely guidance for advancing MERC research.

[185] AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy

Sebastian Antony Joseph,Syed Murtaza Husain,Stella S. R. Offner,Stéphanie Juneau,Paul Torrey,Adam S. Bolton,Juan P. Farias,Niall Gaffney,Greg Durrett,Junyi Jessy Li

Main category: cs.CL

TL;DR: 该论文介绍了AstroVisBench，首个用于评估大型语言模型在天文学领域数据处理和可视化能力的基准测试，揭示了当前模型在该领域的局限性。

Details

Motivation: 评估LLM在科学工作流中生成正确科学见解的能力，填补了现有研究的空白。 Method: 开发AstroVisBench基准测试，结合LLM-as-a-judge工作流，并与专业天文学家的标注进行验证。 Result: 当前最先进的语言模型在天文学研究中作为助手的实用性存在显著差距。 Conclusion: AstroVisBench为AI科学家提供了端到端评估工具，推动了可视化工作流的发展。 Abstract: Large Language Models (LLMs) are being explored for applications in scientific research, including their capabilities to synthesize literature, answer research questions, generate research ideas, and even conduct computational experiments. Ultimately, our goal is for these to help scientists derive novel scientific insights. In many areas of science, such insights often arise from processing and visualizing data to understand its patterns. However, evaluating whether an LLM-mediated scientific workflow produces outputs conveying the correct scientific insights is challenging to evaluate and has not been addressed in past work. We introduce AstroVisBench, the first benchmark for both scientific computing and visualization in the astronomy domain. AstroVisBench judges a language model's ability to both (1) create astronomy-specific workflows to process and analyze data and (2) visualize the results of these workflows through complex plots. Our evaluation of visualizations uses a novel LLM-as-a-judge workflow, which is validated against annotation by five professional astronomers. Using AstroVisBench we present an evaluation of state-of-the-art language models, showing a significant gap in their ability to engage in astronomy research as useful assistants. This evaluation provides a strong end-to-end evaluation for AI scientists that offers a path forward for the development of visualization-based workflows, which are central to a broad range of domains from physics to biology.

[186] Paths Not Taken: Understanding and Mending the Multilingual Factual Recall Pipeline

Meng Lu,Ruochen Zhang,Ellie Pavlick,Carsten Eickhoff

Main category: cs.CL

TL;DR: 多语言大语言模型（LLM）在事实召回任务中表现不一致，英语优于其他语言。研究发现其机制是通过英语中心的事实召回处理多语言查询，再将答案翻译回目标语言。错误源于英语机制参与不足和翻译错误。通过两种向量干预，模型在多语言事实一致性上提升了35%。

Details

Motivation: 探究多语言LLM在事实召回任务中表现不一致的原因，并提出改进方法。 Method: 使用机制分析技术揭示LLM的英语中心处理流程，并提出两种语言无关的向量干预方法。 Result: 干预后，最低表现语言的召回准确率提升了35%。 Conclusion: 机制分析可解锁LLM的潜在多语言能力。 Abstract: Multilingual large language models (LLMs) often exhibit factual inconsistencies across languages, with significantly better performance in factual recall tasks in English than in other languages. The causes of these failures, however, remain poorly understood. Using mechanistic analysis techniques, we uncover the underlying pipeline that LLMs employ, which involves using the English-centric factual recall mechanism to process multilingual queries and then translating English answers back into the target language. We identify two primary sources of error: insufficient engagement of the reliable English-centric mechanism for factual recall, and incorrect translation from English back into the target language for the final answer. To address these vulnerabilities, we introduce two vector interventions, both independent of languages and datasets, to redirect the model toward better internal paths for higher factual consistency. Our interventions combined increase the recall accuracy by over 35 percent for the lowest-performing language. Our findings demonstrate how mechanistic insights can be used to unlock latent multilingual capabilities in LLMs.

[187] The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages

Chris Emezue,The NaijaVoices Community,Busayo Awobade,Abraham Owodunni,Handel Emezue,Gloria Monica Tobechukwu Emezue,Nefertiti Nneoma Emezue,Sewade Ogun,Bunmi Akinremi,David Ifeoluwa Adelani,Chris Pal

Main category: cs.CL

TL;DR: NaijaVoices数据集填补了非洲语言（如伊博语、豪萨语和约鲁巴语）在语音技术中数据不足的空白，提供了1800小时的语音文本数据，显著提升了语音识别模型的性能。

Details

Motivation: 非洲语言在语音技术中代表性不足，限制了约10亿人的技术可及性，现有数据集规模不足，无法支持鲁棒的语音模型。 Method: 通过独特的数据收集方法构建了NaijaVoices数据集，包含5000多名说话者的1800小时语音文本数据，并分析了其声学多样性。 Result: 微调实验显示，自动语音识别的WER显著改善：Whisper（75.86%）、MMS（52.06%）、XLSR（42.33%）。 Conclusion: NaijaVoices数据集为非洲语言的语音处理提供了重要资源，推动了多语言技术的发展。 Abstract: The development of high-performing, robust, and reliable speech technologies depends on large, high-quality datasets. However, African languages -- including our focus, Igbo, Hausa, and Yoruba -- remain under-represented due to insufficient data. Popular voice-enabled technologies do not support any of the 2000+ African languages, limiting accessibility for circa one billion people. While previous dataset efforts exist for the target languages, they lack the scale and diversity needed for robust speech models. To bridge this gap, we introduce the NaijaVoices dataset, a 1,800-hour speech-text dataset with 5,000+ speakers. We outline our unique data collection approach, analyze its acoustic diversity, and demonstrate its impact through finetuning experiments on automatic speech recognition, averagely achieving 75.86% (Whisper), 52.06% (MMS), and 42.33% (XLSR) WER improvements. These results highlight NaijaVoices' potential to advance multilingual speech processing for African languages.

[188] Emotion Classification In-Context in Spanish

Bipul Thapa,Gabriel Cofre

Main category: cs.CL

TL;DR: 该论文提出了一种结合TF-IDF和BERT嵌入的混合方法，用于将西班牙语客户反馈分类为积极、中性和消极情绪，并通过自定义堆叠集成（CSE）模型显著提升了分类准确率。

Details

Motivation: 传统方法将广泛使用的语言翻译为较少使用的语言会导致语义完整性和上下文细微差别的丢失，因此需要一种能够保留原始语言语义深度的分类方法。 Method: 采用TF-IDF与BERT嵌入结合的混合方法，并使用CSE模型（包含逻辑回归、KNN、LGBM和AdaBoost等基模型）进行情绪分类。 Result: CSE模型在西班牙语数据集上的测试准确率达到93.3%，显著优于单独模型和BERT模型。 Conclusion: 结合TF-IDF和BERT的方法在西班牙语情绪分类中表现优异，为提升客户反馈分析和服务改进提供了有价值的见解。 Abstract: Classifying customer feedback into distinct emotion categories is essential for understanding sentiment and improving customer experience. In this paper, we classify customer feedback in Spanish into three emotion categories--positive, neutral, and negative--using advanced NLP and ML techniques. Traditional methods translate feedback from widely spoken languages to less common ones, resulting in a loss of semantic integrity and contextual nuances inherent to the original language. To address this limitation, we propose a hybrid approach that combines TF-IDF with BERT embeddings, effectively transforming Spanish text into rich numerical representations that preserve the semantic depth of the original language by using a Custom Stacking Ensemble (CSE) approach. To evaluate emotion classification, we utilize a range of models, including Logistic Regression, KNN, Bagging classifier with LGBM, and AdaBoost. The CSE model combines these classifiers as base models and uses a one-vs-all Logistic Regression as the meta-model. Our experimental results demonstrate that CSE significantly outperforms the individual and BERT model, achieving a test accuracy of 93.3% on the native Spanish dataset--higher than the accuracy obtained from the translated version. These findings underscore the challenges of emotion classification in Spanish and highlight the advantages of combining vectorization techniques like TF-IDF with BERT for improved accuracy. Our results provide valuable insights for businesses seeking to leverage emotion classification to enhance customer feedback analysis and service improvements.

[189] Effectiveness of Prompt Optimization in NL2SQL Systems

Sairam Gurajada,Eser Kandogan,Sajjadur Rahman

Main category: cs.CL

TL;DR: 本文提出了一种针对生产场景的高精度、高性能NL2SQL系统，通过静态示例集和多目标优化框架解决上下文选择问题。

Details

Motivation: 当前NL2SQL方法虽能生成高质量SQL，但生产场景更需高精度和高性能，需优化上下文选择以提升系统表现。 Method: 提出一个提示优化框架，通过多目标优化选择静态示例集，涵盖查询日志、数据库、SQL结构和执行延迟等关键因素。 Result: 初步实验证明该框架在提升SQL生成精度和性能方面有效。 Conclusion: 静态示例集和多目标优化框架能显著提升生产场景下NL2SQL系统的表现。 Abstract: NL2SQL approaches have greatly benefited from the impressive capabilities of large language models (LLMs). In particular, bootstrapping an NL2SQL system for a specific domain can be as simple as instructing an LLM with sufficient contextual information, such as schema details and translation demonstrations. However, building an accurate system still requires the rigorous task of selecting the right context for each query-including identifying relevant schema elements, cell values, and suitable exemplars that help the LLM understand domain-specific nuances. Retrieval-based methods have become the go-to approach for identifying such context. While effective, these methods introduce additional inference-time costs due to the retrieval process. In this paper, we argue that production scenarios demand high-precision, high-performance NL2SQL systems, rather than simply high-quality SQL generation, which is the focus of most current NL2SQL approaches. In such scenarios, the careful selection of a static set of exemplars-capturing the intricacies of the query log, target database, SQL constructs, and execution latencies-plays a more crucial role than exemplar selection based solely on similarity. The key challenge, however, lies in identifying a representative set of exemplars for a given production setting. To this end, we propose a prompt optimization framework that not only addresses the high-precision requirement but also optimizes the performance of the generated SQL through multi-objective optimization. Preliminary empirical analysis demonstrates the effectiveness of the proposed framework.

[190] Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation

Dancheng Liu,Amir Nassereldine,Chenhui Xu,Jinjun Xiong

Main category: cs.CL

TL;DR: 研究发现，ASR模型的鲁棒性主要受声学多样性而非语言丰富性的驱动，通过声学增强方法可在小规模数据集上显著提升性能。

Details

Motivation: 探讨训练数据中的语言和声学多样性对ASR模型鲁棒性的影响，以解决大规模数据集不切实际的问题。 Method: 分析声学和语言多样性对ASR模型的影响，并测试声学增强方法在Librispeech数据集上的效果。 Result: 声学增强方法在960小时的Librispeech数据集上可将未见数据集的词错误率降低19.24%。 Conclusion: 声学增强是构建鲁棒ASR模型的有效替代方案，尤其适用于缺乏大规模语音数据的情况。 Abstract: Whisper's robust performance in automatic speech recognition (ASR) is often attributed to its massive 680k-hour training set, an impractical scale for most researchers. In this work, we examine how linguistic and acoustic diversity in training data affect the robustness of the ASR model and reveal that transcription generalization is primarily driven by acoustic variation rather than linguistic richness. We find that targeted acoustic augmentation methods could significantly improve the generalization ability of ASR models, reducing word-error rates by up to 19.24 percent on unseen datasets when training on the 960-hour Librispeech dataset. These findings highlight strategic acoustically focused data augmentation as a promising alternative to massive datasets for building robust ASR models, offering a potential solution to future foundation ASR models when massive human speech data is lacking.

[191] REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning

Ziju Shen,Naohao Huang,Fanyi Yang,Yutong Wang,Guoxiong Gao,Tianyi Xu,Jiedong Jiang,Wanyi He,Pu Yang,Mengzhou Sun,Haocheng Ju,Peihao Wu,Bryan Dai,Bin Dong

Main category: cs.CL

TL;DR: REAL-Prover是一个基于Lean 4的开源逐步定理证明器，结合了大型语言模型和检索系统，显著提升了解决大学数学问题的性能。

Details

Motivation: 当前形式化定理证明器在高级数学领域泛化能力不足，REAL-Prover旨在突破这一限制。 Method: 使用微调的大型语言模型（REAL-Prover-v1）和检索系统（Leansearch-PS），并通过数据提取管道（HERALD-AF）和交互环境（Jixia-interactive）收集数据。 Result: 在ProofNet数据集上达到23.7%的成功率（Pass@64），在FATE-M基准测试中达到56.7%的SOTA成功率。 Conclusion: REAL-Prover在高级数学问题解决上表现优异，为形式化定理证明领域提供了新的工具和基准。 Abstract: Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).

[192] SeqPO-SiMT: Sequential Policy Optimization for Simultaneous Machine Translation

Ting Xu,Zhichao Huang,Jiankai Sun,Shanbo Cheng,Wai Lam

Main category: cs.CL

TL;DR: SeqPO-SiMT是一种新的策略优化框架，将同步机器翻译任务视为顺序决策问题，通过定制奖励提升翻译质量并降低延迟。

Details

Motivation: 解决同步机器翻译（SiMT）任务中的多步决策问题，区别于传统单步任务的强化学习方法。 Method: 提出SeqPO-SiMT框架，通过定制奖励模拟和优化SiMT过程。 Result: 在多个数据集上显著提升翻译质量并降低延迟，性能超越监督微调模型。 Conclusion: SeqPO-SiMT在同步翻译任务中表现优异，甚至接近离线翻译的高性能模型。 Abstract: We present Sequential Policy Optimization for Simultaneous Machine Translation (SeqPO-SiMT), a new policy optimization framework that defines the simultaneous machine translation (SiMT) task as a sequential decision making problem, incorporating a tailored reward to enhance translation quality while reducing latency. In contrast to popular Reinforcement Learning from Human Feedback (RLHF) methods, such as PPO and DPO, which are typically applied in single-step tasks, SeqPO-SiMT effectively tackles the multi-step SiMT task. This intuitive framework allows the SiMT LLMs to simulate and refine the SiMT process using a tailored reward. We conduct experiments on six datasets from diverse domains for En to Zh and Zh to En SiMT tasks, demonstrating that SeqPO-SiMT consistently achieves significantly higher translation quality with lower latency. In particular, SeqPO-SiMT outperforms the supervised fine-tuning (SFT) model by 1.13 points in COMET, while reducing the Average Lagging by 6.17 in the NEWSTEST2021 En to Zh dataset. While SiMT operates with far less context than offline translation, the SiMT results of SeqPO-SiMT on 7B LLM surprisingly rival the offline translation of high-performing LLMs, including Qwen-2.5-7B-Instruct and LLaMA-3-8B-Instruct.

[193] POLAR: A Benchmark for Multilingual, Multicultural, and Multi-Event Online Polarization

Usman Naseem,Juan Ren,Saba Anwar,Sarah Kohail,Rudy Alexandro Garrido Veliz,Robert Geislinger,Aisha Jabr,Idris Abdulmumin,Laiba Qureshi,Aarushi Ajay Borkar,Maryam Ibrahim Mukhtar,Abinew Ali Ayele,Ibrahim Said Ahmad,Adem Ali,Martin Semmann,Shamsuddeen Hassan Muhammad,Seid Muhie Yimam

Main category: cs.CL

TL;DR: POLAR是一个多语言、多文化、多事件的数据集，用于研究在线极化现象，并通过实验评估了多种语言模型在极化检测中的表现。

Details

Motivation: 在线极化对民主话语构成挑战，但现有研究多为单语言或特定事件，缺乏多语言和多文化的视角。 Method: 构建POLAR数据集，标注极化的三个维度（存在、类型和表现），并评估多语言预训练模型和大型语言模型的表现。 Result: 模型在二元极化检测中表现良好，但在预测极化类型和表现时效果较差。 Conclusion: 极化具有高度情境依赖性，需要更稳健、适应性强的方法。所有资源将公开以支持全球数字极化研究。 Abstract: Online polarization poses a growing challenge for democratic discourse, yet most computational social science research remains monolingual, culturally narrow, or event-specific. We introduce POLAR, a multilingual, multicultural, and multievent dataset with over 23k instances in seven languages from diverse online platforms and real-world events. Polarization is annotated along three axes: presence, type, and manifestation, using a variety of annotation platforms adapted to each cultural context. We conduct two main experiments: (1) we fine-tune six multilingual pretrained language models in both monolingual and cross-lingual setups; and (2) we evaluate a range of open and closed large language models (LLMs) in few-shot and zero-shot scenarios. Results show that while most models perform well on binary polarization detection, they achieve substantially lower scores when predicting polarization types and manifestations. These findings highlight the complex, highly contextual nature of polarization and the need for robust, adaptable approaches in NLP and computational social science. All resources will be released to support further research and effective mitigation of digital polarization globally.

[194] Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration

Sibo Xiao,Zixin Lin,Wenyang Gao,Yue Zhang

Main category: cs.CL

TL;DR: 提出了一种名为XpandA的多智能体框架，通过动态分区和问题驱动的工作流程，解决了长上下文处理中的延迟、信息丢失和依赖破坏问题。

Details

Motivation: 现有基于智能体的分治法在处理长上下文时存在累积延迟、信息丢失和依赖破坏的局限性。 Method: XpandA采用动态分区、问题引导协议和选择性重放机制，优化长文本处理。 Result: 在1k至1M长度的基准测试中，XpandA显著提升了LLMs的长上下文处理能力，性能提升20%，推理速度加快1.5倍。 Conclusion: XpandA为超长序列处理提供了可行方案，显著提升了LLMs的长上下文能力。 Abstract: Processing long contexts has become a critical capability for modern large language models (LLMs). Existing works leverage agent-based divide-and-conquer methods for processing long contexts. But these methods face crucial limitations, including prohibitive accumulated latency and amplified information loss from excessive agent invocations, and the disruption of inherent textual dependencies by immoderate partitioning. In this paper, we propose a novel multi-agent framework XpandA (Expand-Agent) coupled with question-driven workflow and dynamic partitioning for robust long-context processing. XpandA overcomes these limitations through: 1) dynamic partitioning of long texts, which adaptively modulates the filling rate of context windows for input sequences of vastly varying lengths; 2) question-guided protocol to update flat information ensembles within centralized shared memory, constructing consistent inter-agent knowledge across partitions; and 3) selectively replaying specific partitions based on the state-tracking of question-information couples to promote the resolution of inverted-order structures across partitions (e.g., flashbacks). We perform a comprehensive evaluation of XpandA on multiple long-context benchmarks with length varying from 1k to 1M, demonstrating XpandA's feasibility for processing ultra-long sequences and its significant effectiveness in enhancing the long-context capabilities of various LLMs by achieving 20\% improvements and 1.5x inference speedup over baselines of full-context, RAG and previous agent-based methods.

[195] Test-Time Learning for Large Language Models

Jinwu Hu,Zhitian Zhang,Guohao Chen,Xutao Wen,Chao Shuai,Wei Luo,Bin Xiao,Yuanqing Li,Mingkui Tan

Main category: cs.CL

TL;DR: 提出了一种名为TLM的测试时学习范式，通过最小化未标记测试数据的输入困惑度，动态适应LLMs到目标领域，提升性能至少20%。

Details

Motivation: LLMs在泛化到专业领域和处理语言变体时存在局限性，需要动态适应方法。 Method: 通过输入困惑度最小化实现自监督学习，采用高效样本选择策略和低秩适应（LoRA）避免灾难性遗忘。 Result: 在领域知识适应任务上，TLM比原始LLMs性能提升至少20%。 Conclusion: TLM为LLMs在测试时的动态适应提供了一种有效且稳定的解决方案。 Abstract: While Large Language Models (LLMs) have exhibited remarkable emergent capabilities through extensive pre-training, they still face critical limitations in generalizing to specialized domains and handling diverse linguistic variations, known as distribution shifts. In this paper, we propose a Test-Time Learning (TTL) paradigm for LLMs, namely TLM, which dynamically adapts LLMs to target domains using only unlabeled test data during testing. Specifically, we first provide empirical evidence and theoretical insights to reveal that more accurate predictions from LLMs can be achieved by minimizing the input perplexity of the unlabeled test data. Based on this insight, we formulate the Test-Time Learning process of LLMs as input perplexity minimization, enabling self-supervised enhancement of LLM performance. Furthermore, we observe that high-perplexity samples tend to be more informative for model optimization. Accordingly, we introduce a Sample Efficient Learning Strategy that actively selects and emphasizes these high-perplexity samples for test-time updates. Lastly, to mitigate catastrophic forgetting and ensure adaptation stability, we adopt Low-Rank Adaptation (LoRA) instead of full-parameter optimization, which allows lightweight model updates while preserving more original knowledge from the model. We introduce the AdaptEval benchmark for TTL and demonstrate through experiments that TLM improves performance by at least 20% compared to original LLMs on domain knowledge adaptation.

[196] STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models

Kai Chen,Zihao He,Taiwei Shi,Kristina Lerman

Main category: cs.CL

TL;DR: Steer-Bench是一个评估大型语言模型（LLMs）在不同社区规范下适应性的基准测试，覆盖30对Reddit子社区，结果显示LLMs在社区敏感性方面仍有显著差距。

Details

Motivation: 评估LLMs在多样社区规范下的适应能力（即“可操控性”）在现实应用中至关重要，但目前缺乏系统性的评估方法。 Method: 引入Steer-Bench基准，包含30对对比Reddit子社区的10,000+指令-响应对和5,500多选问题，用于测试LLMs的社区对齐能力。 Result: 人类专家准确率达81%，而最佳LLMs仅达65%，部分模型落后人类15个百分点以上。 Conclusion: Steer-Bench揭示了LLMs在社区敏感性方面的不足，为系统性评估提供了工具。 Abstract: Steerability, or the ability of large language models (LLMs) to adapt outputs to align with diverse community-specific norms, perspectives, and communication styles, is critical for real-world applications but remains under-evaluated. We introduce Steer-Bench, a benchmark for assessing population-specific steering using contrasting Reddit communities. Covering 30 contrasting subreddit pairs across 19 domains, Steer-Bench includes over 10,000 instruction-response pairs and validated 5,500 multiple-choice question with corresponding silver labels to test alignment with diverse community norms. Our evaluation of 13 popular LLMs using Steer-Bench reveals that while human experts achieve an accuracy of 81% with silver labels, the best-performing models reach only around 65% accuracy depending on the domain and configuration. Some models lag behind human-level alignment by over 15 percentage points, highlighting significant gaps in community-sensitive steerability. Steer-Bench is a benchmark to systematically assess how effectively LLMs understand community-specific instructions, their resilience to adversarial steering attempts, and their ability to accurately represent diverse cultural and ideological perspectives.

[197] FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information

Yan Wang,Yang Ren,Lingfei Qian,Xueqing Peng,Keyi Wang,Yi Han,Dongji Feng,Xiao-Yang Liu,Jimin Huang,Qianqian Xie

Main category: cs.CL

TL;DR: FinTagging是一个首个全范围、表格感知的XBRL基准，用于评估大型语言模型（LLMs）在XBRL财务报告中的结构化信息提取和语义对齐能力。

Details

Motivation: 现有基准将XBRL标记简化为扁平多类分类，仅关注叙述文本，无法全面评估LLMs在财务报告中的能力。 Method: FinTagging将XBRL标记分解为两个子任务：FinNI（财务实体提取）和FinCL（分类驱动的概念对齐），要求模型在非结构化文本和结构化表格中提取事实并与10k+ US-GAAP分类对齐。 Result: LLMs在信息提取方面表现良好，但在细粒度概念对齐（尤其是区分相近分类条目）上表现不佳。 Conclusion: 现有LLMs无法完全自动化XBRL标记，需改进语义推理和模式感知建模以满足财务披露的准确性需求。 Abstract: We introduce FinTagging, the first full-scope, table-aware XBRL benchmark designed to evaluate the structured information extraction and semantic alignment capabilities of large language models (LLMs) in the context of XBRL-based financial reporting. Unlike prior benchmarks that oversimplify XBRL tagging as flat multi-class classification and focus solely on narrative text, FinTagging decomposes the XBRL tagging problem into two subtasks: FinNI for financial entity extraction and FinCL for taxonomy-driven concept alignment. It requires models to jointly extract facts and align them with the full 10k+ US-GAAP taxonomy across both unstructured text and structured tables, enabling realistic, fine-grained evaluation. We assess a diverse set of LLMs under zero-shot settings, systematically analyzing their performance on both subtasks and overall tagging accuracy. Our results reveal that, while LLMs demonstrate strong generalization in information extraction, they struggle with fine-grained concept alignment, particularly in disambiguating closely related taxonomy entries. These findings highlight the limitations of existing LLMs in fully automating XBRL tagging and underscore the need for improved semantic reasoning and schema-aware modeling to meet the demands of accurate financial disclosure. Code is available at our GitHub repository and data is at our Hugging Face repository.

[198] Chinese Cyberbullying Detection: Dataset, Method, and Validation

Yi Zhu,Xin Zou,Xindong Wu

Main category: cs.CL

TL;DR: 本文提出了一种基于事件标注的中文网络欺凌检测数据集CHNCI，包含91个事件的220,676条评论，通过集成方法和人工标注构建，并验证了其作为网络欺凌检测和事件预测基准的有效性。

Details

Motivation: 现有网络欺凌检测基准多基于言论极性（如“攻击性”和“非攻击性”），而现实中网络欺凌常通过事件引发社会关注，因此需要一种基于事件的标注方法。 Method: 结合三种基于解释生成的网络欺凌检测方法作为集成方法生成伪标签，再由人工标注；提出验证网络欺凌事件的评价标准。 Result: 构建的CHNCI数据集可作为网络欺凌检测和事件预测的基准。 Conclusion: 这是首个针对中文网络欺凌事件检测的研究，CHNCI数据集填补了该领域的空白。 Abstract: Existing cyberbullying detection benchmarks were organized by the polarity of speech, such as "offensive" and "non-offensive", which were essentially hate speech detection. However, in the real world, cyberbullying often attracted widespread social attention through incidents. To address this problem, we propose a novel annotation method to construct a cyberbullying dataset that organized by incidents. The constructed CHNCI is the first Chinese cyberbullying incident detection dataset, which consists of 220,676 comments in 91 incidents. Specifically, we first combine three cyberbullying detection methods based on explanations generation as an ensemble method to generate the pseudo labels, and then let human annotators judge these labels. Then we propose the evaluation criteria for validating whether it constitutes a cyberbullying incident. Experimental results demonstrate that the constructed dataset can be a benchmark for the tasks of cyberbullying detection and incident prediction. To the best of our knowledge, this is the first study for the Chinese cyberbullying incident detection task.

[199] Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge

Yue Fang,Zhi Jin,Jie An,Hongshen Chen,Xiaohong Chen,Naijun Zhan

Main category: cs.CL

TL;DR: 论文提出了一种名为STL-DivEn的数据集和KGST框架，用于自动将自然语言转换为信号时序逻辑（STL），解决了现有数据不足的问题，并在多样性和准确性上优于现有方法。

Details

Motivation: 手动将自然语言（NL）转换为信号时序逻辑（STL）耗时且易错，而现有数据集不足限制了自动转换的研究。 Method: 通过小规模种子集和LLM生成多样化的NL-STL对，结合规则过滤和人工验证构建STL-DivEn数据集；提出KGST框架，基于外部知识进行生成-优化转换。 Result: STL-DivEn数据集比现有数据集更具多样性；KGST框架在STL-DivEn和DeepSTL数据集上的转换准确性优于基线模型。 Conclusion: STL-DivEn和KGST为NL到STL的自动转换提供了有效解决方案，推动了相关领域的研究。 Abstract: Temporal Logic (TL), especially Signal Temporal Logic (STL), enables precise formal specification, making it widely used in cyber-physical systems such as autonomous driving and robotics. Automatically transforming NL into STL is an attractive approach to overcome the limitations of manual transformation, which is time-consuming and error-prone. However, due to the lack of datasets, automatic transformation currently faces significant challenges and has not been fully explored. In this paper, we propose an NL-STL dataset named STL-Diversity-Enhanced (STL-DivEn), which comprises 16,000 samples enriched with diverse patterns. To develop the dataset, we first manually create a small-scale seed set of NL-STL pairs. Next, representative examples are identified through clustering and used to guide large language models (LLMs) in generating additional NL-STL pairs. Finally, diversity and accuracy are ensured through rigorous rule-based filters and human validation. Furthermore, we introduce the Knowledge-Guided STL Transformation (KGST) framework, a novel approach for transforming natural language into STL, involving a generate-then-refine process based on external knowledge. Statistical analysis shows that the STL-DivEn dataset exhibits more diversity than the existing NL-STL dataset. Moreover, both metric-based and human evaluations indicate that our KGST approach outperforms baseline models in transformation accuracy on STL-DivEn and DeepSTL datasets.

[200] BacktrackAgent: Enhancing GUI Agent with Error Detection and Backtracking Mechanism

Qinzhuo Wu,Pengzhi Gao,Wei Liu,Jian Luan

Main category: cs.CL

TL;DR: BacktrackAgent是一种GUI代理框架，通过回溯机制提升任务完成效率，包括验证器、判断器和反射器模块，实验结果显示其在任务成功率和步骤准确性上均有提升。

Details

Motivation: 现有GUI代理主要关注单个动作的准确性，缺乏有效的错误检测和恢复机制，BacktrackAgent旨在解决这一问题。 Method: 提出BacktrackAgent框架，包含验证器、判断器和反射器模块，并引入判断奖励机制，同时开发了针对回溯机制的训练数据集。 Result: 在Mobile3M和Auto-UI基准测试中，BacktrackAgent在任务成功率和步骤准确性上表现更优。 Conclusion: BacktrackAgent通过回溯机制显著提升了GUI代理的性能，未来将公开数据和代码。 Abstract: Graphical User Interface (GUI) agents have gained substantial attention due to their impressive capabilities to complete tasks through multiple interactions within GUI environments. However, existing agents primarily focus on enhancing the accuracy of individual actions and often lack effective mechanisms for detecting and recovering from errors. To address these shortcomings, we propose the BacktrackAgent, a robust framework that incorporates a backtracking mechanism to improve task completion efficiency. BacktrackAgent includes verifier, judger, and reflector components as modules for error detection and recovery, while also applying judgment rewards to further enhance the agent's performance. Additionally, we develop a training dataset specifically designed for the backtracking mechanism, which considers the outcome pages after action executions. Experimental results show that BacktrackAgent has achieved performance improvements in both task success rate and step accuracy on Mobile3M and Auto-UI benchmarks. Our data and code will be released upon acceptance.

[201] Self-Route: Automatic Mode Switching via Capability Estimation for Efficient Reasoning

Yang He,Xiao Ding,Bibo Cai,Yufei Zhang,Kai Xiong,Zhouhao Sun,Bing Qin,Ting Liu

Main category: cs.CL

TL;DR: 论文提出Self-Route框架，动态选择推理模式以减少不必要的token消耗，提升效率。

Details

Motivation: RLLMs在处理简单问题时因过度推理导致资源浪费，需优化推理效率。 Method: 引入轻量级预推理阶段提取能力感知嵌入，构建Gradient-10K数据集训练路由器。 Result: Self-Route在保持准确性的同时减少30-55%的token消耗。 Conclusion: Self-Route具有广泛适用性和实用价值，适用于不同规模和推理范式的模型。 Abstract: While reasoning-augmented large language models (RLLMs) significantly enhance complex task performance through extended reasoning chains, they inevitably introduce substantial unnecessary token consumption, particularly for simpler problems where Short Chain-of-Thought (Short CoT) suffices. This overthinking phenomenon leads to inefficient resource usage without proportional accuracy gains. To address this issue, we propose Self-Route, a dynamic reasoning framework that automatically selects between general and reasoning modes based on model capability estimation. Our approach introduces a lightweight pre-inference stage to extract capability-aware embeddings from hidden layer representations, enabling real-time evaluation of the model's ability to solve problems. We further construct Gradient-10K, a model difficulty estimation-based dataset with dense complexity sampling, to train the router for precise capability boundary detection. Extensive experiments demonstrate that Self-Route achieves comparable accuracy to reasoning models while reducing token consumption by 30-55\% across diverse benchmarks. The proposed framework demonstrates consistent effectiveness across models with different parameter scales and reasoning paradigms, highlighting its general applicability and practical value.

[202] Pretraining Language Models to Ponder in Continuous Space

Boyi Zeng,Shixiang Song,Siyuan Huang,Yixuan Wang,He Li,Ziwei He,Xinbing Wang,Zhiyu Li,Zhouhan Lin

Main category: cs.CL

TL;DR: 论文提出了一种通过‘沉思’过程增强语言模型的方法，即在单个令牌生成步骤中多次调用前向过程，以提升模型性能。

Details

Motivation: 受人类在表达复杂句子前会沉思的启发，将这一过程引入语言模型，以增强其认知处理能力。 Method: 在生成令牌时，模型通过加权求和所有令牌嵌入进行‘沉思’，并将生成的嵌入反馈为输入进行多次前向传递。 Result: 实验表明，沉思模型在语言建模任务中性能与参数翻倍的普通模型相当，并在下游任务中显著优于基线模型。 Conclusion: 该方法简单有效，可无缝集成到现有语言模型中，显著提升性能。 Abstract: Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort. In this work, we introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step. During pondering, instead of generating an actual token sampled from the prediction distribution, the model ponders by yielding a weighted sum of all token embeddings according to the predicted token distribution. The generated embedding is then fed back as input for another forward pass. We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations. Our method is straightforward and can be seamlessly integrated with various existing language models. Experiments across three widely used open-source architectures-GPT-2, Pythia, and LLaMA-and extensive downstream task evaluations demonstrate the effectiveness and generality of our method. For language modeling tasks, pondering language models achieve performance comparable to vanilla models with twice the number of parameters. On 9 downstream benchmarks, our pondering-enhanced Pythia models significantly outperform the official Pythia models. Notably, pondering-enhanced Pythia-1B is comparable to TinyLlama-1.1B, which is trained on 10 times more data. The code is available at https://github.com/LUMIA-Group/PonderingLM.

[203] SELF-PERCEPT: Introspection Improves Large Language Models' Detection of Multi-Person Mental Manipulation in Conversations

Danush Khanna,Pratinav Seth,Sidhaarth Sredharan Murali,Aditya Kumar Guru,Siddharth Shukla,Tanuj Tyagi,Sandeep Chaurasia,Kripabandhu Ghosh

Main category: cs.CL

TL;DR: 论文提出了MultiManip数据集和SELF-PERCEPT框架，用于检测多人多轮对话中的心理操纵，解决了LLMs在此任务上的局限性。

Details

Motivation: 心理操纵在人际交流中普遍且隐蔽，现有LLMs难以有效检测复杂对话中的操纵行为。 Method: 构建MultiManip数据集，提出SELF-PERCEPT两阶段提示框架，评估多种LLMs。 Result: 现有LLMs表现不佳，SELF-PERCEPT框架显著提升了检测效果。 Conclusion: SELF-PERCEPT为检测复杂对话中的心理操纵提供了有效解决方案。 Abstract: Mental manipulation is a subtle yet pervasive form of abuse in interpersonal communication, making its detection critical for safeguarding potential victims. However, due to manipulation's nuanced and context-specific nature, identifying manipulative language in complex, multi-turn, and multi-person conversations remains a significant challenge for large language models (LLMs). To address this gap, we introduce the MultiManip dataset, comprising 220 multi-turn, multi-person dialogues balanced between manipulative and non-manipulative interactions, all drawn from reality shows that mimic real-world scenarios. For manipulative interactions, it includes 11 distinct manipulations depicting real-life scenarios. We conduct extensive evaluations of state-of-the-art LLMs, such as GPT-4o and Llama-3.1-8B, employing various prompting strategies. Despite their capabilities, these models often struggle to detect manipulation effectively. To overcome this limitation, we propose SELF-PERCEPT, a novel, two-stage prompting framework inspired by Self-Perception Theory, demonstrating strong performance in detecting multi-person, multi-turn mental manipulation. Our code and data are publicly available at https://github.com/danushkhanna/self-percept .

[204] Phir Hera Fairy: An English Fairytaler is a Strong Faker of Fluent Speech in Low-Resource Indian Languages

Praveen Srinivasa Varadhan,Srija Anand,Soma Siddhartha,Mitesh M. Khapra

Main category: cs.CL

TL;DR: 研究了将英语F5-TTS模型微调到11种印度语言的效果，发现仅使用印度数据微调效果最佳，生成的IN-F5模型接近人类多语言能力。

Details

Motivation: 探索英语预训练模型如何适应低资源印度语言，并评估其在多语言合成、语音克隆和风格克隆中的表现。 Method: 比较了三种方法：(i)从头训练，(ii)仅用印度数据微调英语F5，(iii)同时用印度和英语数据微调以防止遗忘。 Result: 仅用印度数据微调效果最好，IN-F5模型能实现跨语言流畅合成，并支持零资源语言的合成。 Conclusion: 英语预训练有助于低资源TTS达到人类水平，同时提出了计算最优策略和零资源合成方法。 Abstract: What happens when an English Fairytaler is fine-tuned on Indian languages? We evaluate how the English F5-TTS model adapts to 11 Indian languages, measuring polyglot fluency, voice-cloning, style-cloning, and code-mixing. We compare: (i) training from scratch, (ii) fine-tuning English F5 on Indian data, and (iii) fine-tuning on both Indian and English data to prevent forgetting. Fine-tuning with only Indian data proves most effective and the resultant IN-F5 is a near-human polyglot; that enables speakers of one language (e.g., Odia) to fluently speak in another (e.g., Hindi). Our results show English pretraining aids low-resource TTS in reaching human parity. To aid progress in other low-resource languages, we study data-constrained setups and arrive at a compute optimal strategy. Finally, we show IN-F5 can synthesize unseen languages like Bhojpuri and Tulu using a human-in-the-loop approach for zero-resource TTS via synthetic data generation.

[205] Beyond Templates: Dynamic Adaptation of Reasoning Demonstrations via Feasibility-Aware Exploration

Yong Wu,Weihang Pan,Ke Li,Chen Binhui,Ping Li,Binbin Lin

Main category: cs.CL

TL;DR: DART框架通过动态适应推理轨迹，解决大语言模型与小语言模型之间的能力差距，提升小模型的推理性能。

Details

Motivation: 现有推理数据集通常针对强大语言模型设计，直接应用于弱模型时性能下降，需解决能力差距问题。 Method: DART采用选择性模仿策略，通过解决方案模拟估计步骤适应性，当专家步骤超出学生能力时，学生自主探索替代路径。 Result: DART在多个推理基准和模型规模上验证，显著提升泛化能力和数据效率。 Conclusion: DART通过动态适应推理轨迹，为资源受限模型提供可扩展的推理对齐解决方案。 Abstract: Large language models (LLMs) have shown remarkable reasoning capabilities, yet aligning such abilities to small language models (SLMs) remains a challenge due to distributional mismatches and limited model capacity. Existing reasoning datasets, typically designed for powerful LLMs, often lead to degraded performance when directly applied to weaker models. In this work, we introduce Dynamic Adaptation of Reasoning Trajectories (DART), a novel data adaptation framework that bridges the capability gap between expert reasoning trajectories and diverse SLMs. Instead of uniformly imitating expert steps, DART employs a selective imitation strategy guided by step-wise adaptability estimation via solution simulation. When expert steps surpass the student's capacity -- signaled by an Imitation Gap -- the student autonomously explores alternative reasoning paths, constrained by outcome consistency. We validate DART across multiple reasoning benchmarks and model scales, demonstrating that it significantly improves generalization and data efficiency over static fine-tuning. Our method enhances supervision quality by aligning training signals with the student's reasoning capabilities, offering a scalable solution for reasoning alignment in resource-constrained models.

[206] Dissecting Physics Reasoning in Small Language Models: A Multi-Dimensional Analysis from an Educational Perspective

Nicy Scaria,Silvester John Joseph Kennedy,Diksha Seth,Deepak Subramani

Main category: cs.CL

TL;DR: 研究探讨了小型语言模型（SLMs）在高中物理推理中的表现，发现其答案准确率较高但推理链正确率低，表明其过度依赖模式识别而非真正理解。

Details

Motivation: SLMs在计算效率和可访问性方面具有优势，但其在复杂推理（如物理领域）的能力尚未充分探索。 Method: 使用来自OpenStax高中物理教材的数据集，结合Bloom分类法和文化情境化方法，评估了多款SLMs的答案和推理链正确性。 Result: Qwen 3 1.7B答案准确率达85%，但完全正确推理仅38%；数学格式对性能影响小，文化情境下推理一致性较高。 Conclusion: SLMs需增强真实理解和可验证推理链的生成，而不仅是答案准确性，才能成为可靠的物理教育工具。 Abstract: Small Language Models (SLMs) offer computational efficiency and accessibility, making them promising for educational applications. However, their capacity for complex reasoning, particularly in domains such as physics, remains underexplored. This study investigates the high school physics reasoning capabilities of state-of-the-art SLMs (under 4 billion parameters), including instruct versions of Llama 3.2, Phi 4 Mini, Gemma 3, and Qwen series. We developed a comprehensive physics dataset from the OpenStax High School Physics textbook, annotated according to Bloom's Taxonomy, with LaTeX and plaintext mathematical notations. A novel cultural contextualization approach was applied to a subset, creating culturally adapted problems for Asian, African, and South American/Australian contexts while preserving core physics principles. Using an LLM-as-a-judge framework with Google's Gemini 2.5 Flash, we evaluated answer and reasoning chain correctness, along with calculation accuracy. The results reveal significant differences between the SLMs. Qwen 3 1.7B achieved high `answer accuracy' (85%), but `fully correct reasoning' was substantially low (38%). The format of the mathematical notation had a negligible impact on performance. SLMs exhibited varied performance across the physics topics and showed a decline in reasoning quality with increasing cognitive and knowledge complexity. In particular, the consistency of reasoning was largely maintained in diverse cultural contexts, especially by better performing models. These findings indicate that, while SLMs can often find correct answers, their underlying reasoning is frequently flawed, suggesting an overreliance on pattern recognition. For SLMs to become reliable educational tools in physics, future development must prioritize enhancing genuine understanding and the generation of sound, verifiable reasoning chains over mere answer accuracy.

[207] SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution

Hanlin Wang,Chak Tou Leong,Jiashuo Wang,Jian Wang,Wenjie Li

Main category: cs.CL

TL;DR: 论文提出了一种名为SPA的奖励再分配框架，通过分解最终奖励为逐步贡献，以解决强化学习中延迟奖励的问题，显著提升了任务成功率和准确性。

Details

Motivation: 强化学习在训练LLM代理处理复杂任务时面临延迟奖励的挑战，导致早期动作缺乏有效指导。 Method: 提出Stepwise Progress Attribution (SPA)框架，训练进度估计器分解最终奖励为逐步贡献，并结合环境信号作为中间奖励。 Result: 在多个基准测试中，SPA平均成功率提升2.5%，准确性提升1.9%。 Conclusion: SPA通过提供更有效的中间奖励，显著提升了强化学习的训练效果。 Abstract: Reinforcement learning (RL) holds significant promise for training LLM agents to handle complex, goal-oriented tasks that require multi-step interactions with external environments. However, a critical challenge when applying RL to these agentic tasks arises from delayed rewards: feedback signals are typically available only after the entire task is completed. This makes it non-trivial to assign delayed rewards to earlier actions, providing insufficient guidance regarding environmental constraints and hindering agent training. In this work, we draw on the insight that the ultimate completion of a task emerges from the cumulative progress an agent makes across individual steps. We propose Stepwise Progress Attribution (SPA), a general reward redistribution framework that decomposes the final reward into stepwise contributions, each reflecting its incremental progress toward overall task completion. To achieve this, we train a progress estimator that accumulates stepwise contributions over a trajectory to match the task completion. During policy optimization, we combine the estimated per-step contribution with a grounding signal for actions executed in the environment as the fine-grained, intermediate reward for effective agent training. Extensive experiments on common agent benchmarks (including Webshop, ALFWorld, and VirtualHome) demonstrate that SPA consistently outperforms the state-of-the-art method in both success rate (+2.5\% on average) and grounding accuracy (+1.9\% on average). Further analyses demonstrate that our method remarkably provides more effective intermediate rewards for RL training. Our code is available at https://github.com/WangHanLinHenry/SPA-RL-Agent.

[208] Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator

Peiwen Yuan,Yiwei Li,Shaoxiong Feng,Xinglin Wang,Yueqi Zhang,Jiayi Shi,Chuyi Tan,Boyuan Pan,Yao Hu,Kan Li

Main category: cs.CL

TL;DR: 论文研究了LLM作为基准生成器时可能存在的自我偏见问题，并提出Silencer框架来消除这种偏见，提高生成基准的质量。

Details

Motivation: 现有方法中，LLM生成的基准可能因自我偏见导致性能评估失真，这一问题尚未被充分探索。 Method: 通过定义和验证自我偏见现象，提出Silencer框架，利用多个生成器之间的异质性在样本和基准层面中和偏见。 Result: 实验表明，Silencer能将自我偏见降至接近零，显著提升生成基准的评估效果（Pearson相关系数从0.655提升至0.833）。 Conclusion: Silencer是一种通用且高效的框架，能有效消除自我偏见，生成高质量的评估基准。 Abstract: LLM-as-Benchmark-Generator methods have been widely studied as a supplement to human annotators for scalable evaluation, while the potential biases within this paradigm remain underexplored. In this work, we systematically define and validate the phenomenon of inflated performance in models evaluated on their self-generated benchmarks, referred to as self-bias, and attribute it to sub-biases arising from question domain, language style, and wrong labels. On this basis, we propose Silencer, a general framework that leverages the heterogeneity between multiple generators at both the sample and benchmark levels to neutralize bias and generate high-quality, self-bias-silenced benchmark. Experimental results across various settings demonstrate that Silencer can suppress self-bias to near zero, significantly improve evaluation effectiveness of the generated benchmark (with an average improvement from 0.655 to 0.833 in Pearson correlation with high-quality human-annotated benchmark), while also exhibiting strong generalizability.

[209] CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models

Xiaqiang Tang,Jian Li,Keyu Hu,Du Nan,Xiaolong Li,Xi Zhang,Weigao Sun,Sihong Xie

Main category: cs.CL

TL;DR: 论文提出了一种评估大型语言模型（LLM）生成认知陈述忠实性的框架，并创建了一个包含认知陈述的基准数据集CogniBench-L。

Details

Motivation: 现有基准仅包含重述事实的陈述，缺乏对认知陈述（基于上下文推理的陈述）的评估标准，导致其忠实性难以衡量和优化。 Method: 受立法领域证据评估的启发，设计了一个严格框架评估认知陈述的忠实性，并开发了自动标注流程以生成更大规模的基准数据集。 Result: 创建了CogniBench-L数据集，揭示了认知陈述的统计特征，并训练了高精度的认知幻觉检测模型。 Conclusion: 提出的框架和数据集为评估和优化LLM生成的认知陈述提供了有效工具，相关资源已开源。 Abstract: Faithfulness hallucination are claims generated by a Large Language Model (LLM) not supported by contexts provided to the LLM. Lacking assessment standard, existing benchmarks only contain "factual statements" that rephrase source materials without marking "cognitive statements" that make inference from the given context, making the consistency evaluation and optimization of cognitive statements difficult. Inspired by how an evidence is assessed in the legislative domain, we design a rigorous framework to assess different levels of faithfulness of cognitive statements and create a benchmark dataset where we reveal insightful statistics. We design an annotation pipeline to create larger benchmarks for different LLMs automatically, and the resulting larger-scale CogniBench-L dataset can be used to train accurate cognitive hallucination detection model. We release our model and dataset at: https://github.com/FUTUREEEEEE/CogniBench

[210] SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences

Jungyoub Cha,Hyunjong Kim,Sungzoon Cho

Main category: cs.CL

TL;DR: SpecExtend通过集成高效注意力机制和跨模型检索策略，显著提升了长序列推测解码的性能，加速效果达2.22倍。

Details

Motivation: 推测解码在长输入上性能下降，主要因注意力成本增加和草稿准确性降低。 Method: 结合FlashAttention和Hybrid Tree Attention，并引入跨模型检索策略动态选择上下文。 Result: 在16K tokens的长输入上加速2.22倍。 Conclusion: SpecExtend为长序列推测解码提供了高效解决方案。 Abstract: Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), but its performance degrades on long inputs due to increased attention cost and reduced draft accuracy. We introduce SpecExtend, a drop-in enhancement that improves the performance of speculative decoding on long sequences without any additional training. SpecExtend integrates efficient attention mechanisms such as FlashAttention and Hybrid Tree Attention into both the draft and target models, reducing latency across all stages. To improve draft accuracy and speed, we propose Cross-model Retrieval, a novel KV cache update strategy that uses the target model's attention scores to dynamically select relevant context for the draft model. Extensive evaluations on three long-context understanding datasets show that SpecExtend accelerates standard tree-based speculative decoding by up to 2.22x for inputs up to 16K tokens, providing an effective solution for speculative decoding of long sequences. The code is available at https://github.com/jycha98/SpecExtend .

[211] CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature

Noy Sternlicht,Tom Hope

Main category: cs.CL

TL;DR: CHIMERA是一个通过挖掘科学文献构建的大规模知识库，用于研究概念重组和创新方向预测。

Details

Motivation: 探索科学家如何通过重组现有概念实现创新，并支持跨领域研究方向的预测。 Method: 提出信息提取任务，构建高质量标注语料库，训练基于LLM的提取模型，应用于AI领域论文。 Result: 构建了包含28K重组例子的知识库，并训练了科学假设生成模型。 Conclusion: CHIMERA为研究概念重组和预测创新方向提供了有效工具。 Abstract: A hallmark of human innovation is the process of recombination -- creating original ideas by integrating elements of existing mechanisms and concepts. In this work, we automatically mine the scientific literature and build CHIMERA: a large-scale knowledge base (KB) of recombination examples. CHIMERA can be used to empirically explore at scale how scientists recombine concepts and take inspiration from different areas, or to train supervised machine learning models that learn to predict new creative cross-domain directions. To build this KB, we present a novel information extraction task of extracting recombination from scientific paper abstracts, collect a high-quality corpus of hundreds of manually annotated abstracts, and use it to train an LLM-based extraction model. The model is applied to a large corpus of papers in the AI domain, yielding a KB of over 28K recombination examples. We analyze CHIMERA to explore the properties of recombination in different subareas of AI. Finally, we train a scientific hypothesis generation model using the KB, which predicts new recombination directions that real-world researchers find inspiring. Our data and code are available at https://github.cs.huji.ac.il/tomhope-lab/CHIMERA

[212] Improved Representation Steering for Language Models

Zhengxuan Wu,Qinan Yu,Aryaman Arora,Christopher D. Manning,Christopher Potts

Main category: cs.CL

TL;DR: RePS是一种新的双向偏好优化方法，通过调整语言模型的表示来提升概念引导和抑制的效果，在Gemma模型上表现优于现有方法，接近提示法的性能。

Details

Motivation: 现有权重或表示调整方法在语言模型引导中效果不如提示法，RePS旨在通过双向偏好优化提升表示引导的效果。 Method: 提出Reference-free Preference Steering (RePS)，一种双向偏好优化目标，联合实现概念引导和抑制，并在Gemma模型上进行训练和评估。 Result: RePS在Gemma模型（2B至27B）上优于现有基于语言模型目标的方法，接近提示法的性能，且在抑制任务中表现更优。 Conclusion: RePS为语言模型引导和抑制提供了一种可解释且鲁棒的替代方法，效果接近提示法。 Abstract: Steering methods for language models (LMs) seek to provide fine-grained and interpretable control over model generations by variously changing model inputs, weights, or representations to adjust behavior. Recent work has shown that adjusting weights or representations is often less effective than steering by prompting, for instance when wanting to introduce or suppress a particular concept. We demonstrate how to improve representation steering via our new Reference-free Preference Steering (RePS), a bidirectional preference-optimization objective that jointly does concept steering and suppression. We train three parameterizations of RePS and evaluate them on AxBench, a large-scale model steering benchmark. On Gemma models with sizes ranging from 2B to 27B, RePS outperforms all existing steering methods trained with a language modeling objective and substantially narrows the gap with prompting -- while promoting interpretability and minimizing parameter count. In suppression, RePS matches the language-modeling objective on Gemma-2 and outperforms it on the larger Gemma-3 variants while remaining resilient to prompt-based jailbreaking attacks that defeat prompting. Overall, our results suggest that RePS provides an interpretable and robust alternative to prompting for both steering and suppression.

[213] RSCF: Relation-Semantics Consistent Filter for Entity Embedding of Knowledge Graph

Junsik Kim,Jinwook Park,Kangil Kim

Main category: cs.CL

TL;DR: 论文提出了一种新的知识图谱嵌入方法RSCF，解决了现有方法中嵌入差异不一致的问题，显著提升了性能。

Details

Motivation: 现有知识图谱嵌入方法中，关系特定的实体变换导致嵌入差异不一致，可能丢失有价值的归纳偏置。 Method: RSCF方法通过共享仿射变换、根植实体变换和归一化变化，保持嵌入语义一致性，并增加了关系变换和预测模块。 Result: 在知识图谱补全任务中，RSCF显著优于现有方法，表现出对所有关系及其频率的鲁棒性。 Conclusion: RSCF通过保持嵌入语义一致性，显著提升了知识图谱嵌入的性能和鲁棒性。 Abstract: In knowledge graph embedding, leveraging relation-specific entity-transformation has markedly enhanced performance. However, the consistency of embedding differences before and after transformation remains unaddressed, risking the loss of valuable inductive bias inherent in the embeddings. This inconsistency stems from two problems. First, transformation representations are specified for relations in a disconnected manner, allowing dissimilar transformations and corresponding entity-embeddings for similar relations. Second, a generalized plug-in approach as a SFBR (Semantic Filter Based on Relations) disrupts this consistency through excessive concentration of entity embeddings under entity-based regularization, generating indistinguishable score distributions among relations. In this paper, we introduce a plug-in KGE method, Relation-Semantics Consistent Filter (RSCF), containing more consistent entity-transformation characterized by three features: 1) shared affine transformation of relation embeddings across all relations, 2) rooted entity-transformation that adds an entity embedding to its change represented by the transformed vector, and 3) normalization of the change to prevent scale reduction. To amplify the advantages of consistency that preserve semantics on embeddings, RSCF adds relation transformation and prediction modules for enhancing the semantics. In knowledge graph completion tasks with distance-based and tensor decomposition models, RSCF significantly outperforms state-of-the-art KGE methods, showing robustness across all relations and their frequencies.

[214] Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

Krishna Singh Rajput,Tejas Anvekar,Chitta Baral,Vivek Gupta

Main category: cs.CL

TL;DR: MAMMQA是一个多智能体框架，用于处理文本、表格和图像的多模态问答，通过分解问题、跨模态推理和整合答案，显著提升了准确性和可解释性。

Details

Motivation: 现有方法依赖单一推理策略，忽视了多模态的独特性，限制了准确性和可解释性。 Method: MAMMQA包含两个视觉语言模型（VLM）智能体和一个基于文本的大语言模型（LLM）智能体，分别负责分解问题、跨模态推理和答案整合。 Result: 在多模态问答基准测试中，MAMMQA在准确性和鲁棒性上均优于现有基线方法。 Conclusion: MAMMQA通过模块化设计和多智能体协作，显著提升了多模态问答的性能和可解释性。 Abstract: Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a single, generalized reasoning strategy, overlooking the unique characteristics of each modality ultimately limiting both accuracy and interpretability. To address these limitations, we propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images. Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent. The first VLM decomposes the user query into sub-questions and sequentially retrieves partial answers from each modality. The second VLM synthesizes and refines these results through cross-modal reasoning. Finally, the LLM integrates the insights into a cohesive answer. This modular design enhances interpretability by making the reasoning process transparent and allows each agent to operate within its domain of expertise. Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.

[215] Tracing and Reversing Rank-One Model Edits

Paul Youssef,Zhixue Zhao,Christin Seifert,Jörg Schlötterer

Main category: cs.CL

TL;DR: 该论文研究了知识编辑方法（KEs）的可追踪性和可逆性，特别是针对ROME方法，提出了检测、追踪和逆转恶意编辑的框架。

Details

Motivation: 知识编辑方法（KEs）虽然能有效更新大语言模型（LLMs）的事实内容，但也可能被恶意利用植入错误信息或偏见，因此需要防御技术。 Method: 通过分析ROME编辑方法在权重矩阵中引入的分布模式，定位编辑权重并预测编辑内容，进一步提出推断编辑对象的方法。 Result: 研究发现ROME编辑的权重具有可追踪性，能高精度预测编辑内容（95%准确率），并能逆转编辑（80%准确率）。 Conclusion: 论文证明了基于权重检测和逆转编辑的可行性，为防御LLMs的对抗性操纵提供了框架。 Abstract: Knowledge editing methods (KEs) are a cost-effective way to update the factual content of large language models (LLMs), but they pose a dual-use risk. While KEs are beneficial for updating outdated or incorrect information, they can be exploited maliciously to implant misinformation or bias. In order to defend against these types of malicious manipulation, we need robust techniques that can reliably detect, interpret, and mitigate adversarial edits. This work investigates the traceability and reversibility of knowledge edits, focusing on the widely used Rank-One Model Editing (ROME) method. We first show that ROME introduces distinctive distributional patterns in the edited weight matrices, which can serve as effective signals for locating the edited weights. Second, we show that these altered weights can reliably be used to predict the edited factual relation, enabling partial reconstruction of the modified fact. Building on this, we propose a method to infer the edited object entity directly from the modified weights, without access to the editing prompt, achieving over 95% accuracy. Finally, we demonstrate that ROME edits can be reversed, recovering the model's original outputs with $\geq$ 80% accuracy. Our findings highlight the feasibility of detecting, tracing, and reversing edits based on the edited weights, offering a robust framework for safeguarding LLMs against adversarial manipulations.

[216] Reinforced Informativeness Optimization for Long-Form Retrieval-Augmented Generation

Yuhao Wang,Ruiyang Ren,Yucheng Wang,Wayne Xin Zhao,Jing Liu,Hua Wu,Haifeng Wang

Main category: cs.CL

TL;DR: RioRAG提出了一种基于强化学习的框架，通过优化信息量和层次化奖励模型，解决了长形式问答中的数据稀缺、幻觉问题和评估难题。

Details

Motivation: 长形式问答面临高质量训练数据稀缺、幻觉风险增加和评估指标不足的挑战。 Method: 采用强化学习优化信息量，并提出层次化奖励模型，通过三阶段过程评估答案。 Result: 在LongFact和RAGChecker基准测试中验证了方法的有效性。 Conclusion: RioRAG为长形式问答提供了一种高效且可靠的解决方案。 Abstract: Long-form question answering (LFQA) presents unique challenges for large language models, requiring the synthesis of coherent, paragraph-length answers. While retrieval-augmented generation (RAG) systems have emerged as a promising solution, existing research struggles with key limitations: the scarcity of high-quality training data for long-form generation, the compounding risk of hallucination in extended outputs, and the absence of reliable evaluation metrics for factual completeness. In this paper, we propose RioRAG, a novel reinforcement learning (RL) framework that advances long-form RAG through reinforced informativeness optimization. Our approach introduces two fundamental innovations to address the core challenges. First, we develop an RL training paradigm of reinforced informativeness optimization that directly optimizes informativeness and effectively addresses the slow-thinking deficit in conventional RAG systems, bypassing the need for expensive supervised data. Second, we propose a nugget-centric hierarchical reward modeling approach that enables precise assessment of long-form answers through a three-stage process: extracting the nugget from every source webpage, constructing a nugget claim checklist, and computing rewards based on factual alignment. Extensive experiments on two LFQA benchmarks LongFact and RAGChecker demonstrate the effectiveness of the proposed method. Our codes are available at https://github.com/RUCAIBox/RioRAG.

[217] AdParaphrase v2.0: Generating Attractive Ad Texts Using a Preference-Annotated Paraphrase Dataset

Soichiro Murakami,Peinan Zhang,Hidetaka Kamigaito,Hiroya Takamura,Manabu Okumura

Main category: cs.CL

TL;DR: AdParaphrase v2.0是一个用于广告文本改写的数据集，包含人类偏好数据，支持分析语言因素并开发吸引人的广告文本生成方法。

Details

Motivation: 识别广告文本吸引力的因素对广告成功至关重要，需要更全面的数据集支持研究和开发。 Method: 构建了包含16,460个广告文本改写对的数据集，每个对由十名评估者标注偏好数据，并分析了语言特征和生成方法。 Result: 实验发现了新的语言特征，探索了生成吸引人广告文本的方法，并揭示了人类偏好与广告表现的关系。 Conclusion: AdParaphrase v2.0为广告文本研究提供了更可靠的数据支持，并展示了基于大语言模型的参考无关指标潜力。 Abstract: Identifying factors that make ad text attractive is essential for advertising success. This study proposes AdParaphrase v2.0, a dataset for ad text paraphrasing, containing human preference data, to enable the analysis of the linguistic factors and to support the development of methods for generating attractive ad texts. Compared with v1.0, this dataset is 20 times larger, comprising 16,460 ad text paraphrase pairs, each annotated with preference data from ten evaluators, thereby enabling a more comprehensive and reliable analysis. Through the experiments, we identified multiple linguistic features of engaging ad texts that were not observed in v1.0 and explored various methods for generating attractive ad texts. Furthermore, our analysis demonstrated the relationships between human preference and ad performance, and highlighted the potential of reference-free metrics based on large language models for evaluating ad text attractiveness. The dataset is publicly available at: https://github.com/CyberAgentAILab/AdParaphrase-v2.0.

[218] Concealment of Intent: A Game-Theoretic Analysis

Xinbo Wu,Abhishek Umrawal,Lav R. Varshney

Main category: cs.CL

TL;DR: 本文提出了一种可扩展的攻击策略——意图隐藏对抗提示，通过技能组合掩盖恶意意图，并开发了博弈论框架分析攻击与防御系统的交互。研究发现攻击者具有结构性优势，并提出了一种针对性的防御机制。

Details

Motivation: 随着大型语言模型（LLMs）能力的增强，其安全部署问题日益突出。尽管已有对齐机制防止滥用，但仍易受精心设计的对抗提示攻击。 Method: 提出意图隐藏对抗提示策略，构建博弈论框架分析攻击与防御系统的交互，并设计针对性防御机制。 Result: 实证验证了攻击在多种现实LLMs上的有效性，显示其优于现有对抗提示技术。 Conclusion: 攻击者具有结构性优势，需开发针对性防御机制以应对意图隐藏攻击。 Abstract: As large language models (LLMs) grow more capable, concerns about their safe deployment have also grown. Although alignment mechanisms have been introduced to deter misuse, they remain vulnerable to carefully designed adversarial prompts. In this work, we present a scalable attack strategy: intent-hiding adversarial prompting, which conceals malicious intent through the composition of skills. We develop a game-theoretic framework to model the interaction between such attacks and defense systems that apply both prompt and response filtering. Our analysis identifies equilibrium points and reveals structural advantages for the attacker. To counter these threats, we propose and analyze a defense mechanism tailored to intent-hiding attacks. Empirically, we validate the attack's effectiveness on multiple real-world LLMs across a range of malicious behaviors, demonstrating clear advantages over existing adversarial prompting techniques.

[219] Divide-Then-Align: Honest Alignment based on the Knowledge Boundary of RAG

Xin Sun,Jianan Xie,Zhongqi Chen,Qiang Liu,Shu Wu,Yuehe Chen,Bowen Song,Weiqiang Wang,Zilei Wang,Liang Wang

Main category: cs.CL

TL;DR: 论文提出Divide-Then-Align (DTA)方法，通过划分知识象限和优化偏好数据，使检索增强系统在知识边界外时能够回答“我不知道”，从而提高系统可靠性和可信度。

Details

Motivation: 现有Retrieval-Augmented Fine-Tuning (RAFT)方法在缺乏可靠知识时仍生成答案，降低了高风险领域的可靠性。DTA旨在解决这一问题。 Method: DTA将数据样本划分为四个知识象限，为每个象限构建定制偏好数据，并通过Direct Preference Optimization (DPO)优化模型。 Result: 在三个基准数据集上的实验表明，DTA能有效平衡准确性和适当弃权，提升系统可靠性。 Conclusion: DTA方法显著增强了检索增强系统的可信度，适用于需要高可靠性的场景。 Abstract: Large language models (LLMs) augmented with retrieval systems have significantly advanced natural language processing tasks by integrating external knowledge sources, enabling more accurate and contextually rich responses. To improve the robustness of such systems against noisy retrievals, Retrieval-Augmented Fine-Tuning (RAFT) has emerged as a widely adopted method. However, RAFT conditions models to generate answers even in the absence of reliable knowledge. This behavior undermines their reliability in high-stakes domains, where acknowledging uncertainty is critical. To address this issue, we propose Divide-Then-Align (DTA), a post-training approach designed to endow RAG systems with the ability to respond with "I don't know" when the query is out of the knowledge boundary of both the retrieved passages and the model's internal knowledge. DTA divides data samples into four knowledge quadrants and constructs tailored preference data for each quadrant, resulting in a curated dataset for Direct Preference Optimization (DPO). Experimental results on three benchmark datasets demonstrate that DTA effectively balances accuracy with appropriate abstention, enhancing the reliability and trustworthiness of retrieval-augmented systems.

[220] Can LLMs Learn to Map the World from Local Descriptions?

Sirui Xia,Aili Chen,Xintao Wang,Tinghui Zhu,Yikai Zhang,Jiangjie Chen,Yanghua Xiao

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLMs）能否通过整合局部相对关系构建全局空间认知，实验表明LLMs在空间感知和导航任务中表现优异。

Details

Motivation: 尽管LLMs在代码和数学任务中表现出色，但其在结构化空间知识内部化的潜力尚未充分探索。 Method: 研究通过模拟城市环境实验，测试LLMs在空间感知（推断全局布局）和空间导航（学习道路连通性并规划路径）中的表现。 Result: LLMs能泛化到未见过的空间关系，其潜在表征与现实空间分布一致，并能从轨迹数据中学习道路连通性，实现准确路径规划。 Conclusion: LLMs具备构建全局空间认知的能力，为空间智能应用提供了新思路。 Abstract: Recent advances in Large Language Models (LLMs) have demonstrated strong capabilities in tasks such as code and mathematics. However, their potential to internalize structured spatial knowledge remains underexplored. This study investigates whether LLMs, grounded in locally relative human observations, can construct coherent global spatial cognition by integrating fragmented relational descriptions. We focus on two core aspects of spatial cognition: spatial perception, where models infer consistent global layouts from local positional relationships, and spatial navigation, where models learn road connectivity from trajectory data and plan optimal paths between unconnected locations. Experiments conducted in a simulated urban environment demonstrate that LLMs not only generalize to unseen spatial relationships between points of interest (POIs) but also exhibit latent representations aligned with real-world spatial distributions. Furthermore, LLMs can learn road connectivity from trajectory descriptions, enabling accurate path planning and dynamic spatial awareness during navigation.

[221] Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties

Jiyoung Lee,Seungho Kim,Jieun Han,Jun-Min Lee,Kitaek Kim,Alice Oh,Edward Choi

Main category: cs.CL

TL;DR: 论文提出了Trans-EnV框架，用于评估大型语言模型（LLMs）在多种非标准英语变体上的语言鲁棒性，揭示了性能显著下降的问题。

Details

Motivation: 当前LLMs主要基于标准美国英语（SAE）评估，忽视了全球英语多样性，可能导致公平性问题。 Method: 结合语言学专家知识和基于LLM的转换，将SAE数据集自动转换为38种英语变体，评估7种先进LLMs。 Result: 非标准英语变体上LLMs的准确率下降高达46.3%，显示性能显著差异。 Conclusion: 强调跨多样英语变体的全面语言鲁棒性评估的重要性，Trans-EnV框架公开可用。 Abstract: Large Language Models (LLMs) are predominantly evaluated on Standard American English (SAE), often overlooking the diversity of global English varieties. This narrow focus may raise fairness concerns as degraded performance on non-standard varieties can lead to unequal benefits for users worldwide. Therefore, it is critical to extensively evaluate the linguistic robustness of LLMs on multiple non-standard English varieties. We introduce Trans-EnV, a framework that automatically transforms SAE datasets into multiple English varieties to evaluate the linguistic robustness. Our framework combines (1) linguistics expert knowledge to curate variety-specific features and transformation guidelines from linguistic literature and corpora, and (2) LLM-based transformations to ensure both linguistic validity and scalability. Using Trans-EnV, we transform six benchmark datasets into 38 English varieties and evaluate seven state-of-the-art LLMs. Our results reveal significant performance disparities, with accuracy decreasing by up to 46.3% on non-standard varieties. These findings highlight the importance of comprehensive linguistic robustness evaluation across diverse English varieties. Each construction of Trans-EnV was validated through rigorous statistical testing and consultation with a researcher in the field of second language acquisition, ensuring its linguistic validity. Our \href{https://github.com/jiyounglee-0523/TransEnV}{code} and \href{https://huggingface.co/collections/jiyounglee0523/transenv-681eadb3c0c8cf363b363fb1}{datasets} are publicly available.

[222] MSA at SemEval-2025 Task 3: High Quality Weak Labeling and LLM Ensemble Verification for Multilingual Hallucination Detection

Baraa Hikal,Ahmed Nasreldin,Ali Hamdi

Main category: cs.CL

TL;DR: 本文介绍了针对SemEval-2025任务3的提交方案，通过结合任务特定的提示工程和LLM集成验证机制，成功在多语言环境中检测LLM生成的幻觉文本，并在多个语言中取得优异成绩。

Details

Motivation: 解决多语言环境下指令调优大语言模型（LLMs）生成的文本中的幻觉问题，提升文本生成的可靠性。 Method: 采用任务特定的提示工程和LLM集成验证机制，通过概率投票验证幻觉文本的有效性，并利用模糊匹配优化文本对齐。 Result: 系统在阿拉伯语和巴斯克语中排名第一，在德语、瑞典语和芬兰语中排名第二，在捷克语、波斯语和法语中排名第三。 Conclusion: 提出的方法在多语言幻觉检测任务中表现优异，验证了其有效性和泛化能力。 Abstract: This paper describes our submission for SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. The task involves detecting hallucinated spans in text generated by instruction-tuned Large Language Models (LLMs) across multiple languages. Our approach combines task-specific prompt engineering with an LLM ensemble verification mechanism, where a primary model extracts hallucination spans and three independent LLMs adjudicate their validity through probability-based voting. This framework simulates the human annotation workflow used in the shared task validation and test data. Additionally, fuzzy matching refines span alignment. Our system ranked 1st in Arabic and Basque, 2nd in German, Swedish, and Finnish, and 3rd in Czech, Farsi, and French.

[223] EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models

Chengyu Wang,Junbing Yan,Wenrui Cai,Yuanhao Yue,Jun Huang

Main category: cs.CL

TL;DR: EasyDistill是一个全面的知识蒸馏工具包，支持黑盒和白盒蒸馏，适用于大语言模型，提供模块化设计和用户友好界面。

Details

Motivation: 简化大语言模型的知识蒸馏过程，使其更易于研究和工业应用。 Method: 结合数据合成、监督微调、排序优化和强化学习等技术，支持System 1和System 2模型的蒸馏。 Result: 提供了高效的蒸馏模型和工业解决方案，并开源了相关数据集。 Conclusion: EasyDistill使先进的蒸馏技术更易用，推动了NLP社区的发展。 Abstract: In this paper, we present EasyDistill, a comprehensive toolkit designed for effective black-box and white-box knowledge distillation (KD) of large language models (LLMs). Our framework offers versatile functionalities, including data synthesis, supervised fine-tuning, ranking optimization, and reinforcement learning techniques specifically tailored for KD scenarios. The toolkit accommodates KD functionalities for both System 1 (fast, intuitive) and System 2 (slow, analytical) models. With its modular design and user-friendly interface, EasyDistill empowers researchers and industry practitioners to seamlessly experiment with and implement state-of-the-art KD strategies for LLMs. In addition, EasyDistill provides a series of robust distilled models and KD-based industrial solutions developed by us, along with the corresponding open-sourced datasets, catering to a variety of use cases. Furthermore, we describe the seamless integration of EasyDistill into Alibaba Cloud's Platform for AI (PAI). Overall, the EasyDistill toolkit makes advanced KD techniques for LLMs more accessible and impactful within the NLP community.

[224] Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing

Jeongsoo Choi,Jaehun Kim,Joon Son Chung

Main category: cs.CL

TL;DR: 提出了一种跨语言配音系统，通过离散扩散模型和显式时长控制实现时间对齐的语音翻译，同时保留说话人身份和语速。

Details

Motivation: 现有语音翻译方法虽翻译质量高，但常忽略语音模式的传递，导致与源语音不匹配，限制了配音应用。 Method: 采用离散扩散模型进行语音到单元的翻译，结合显式时长控制；通过条件流匹配模型合成语音；引入基于单元的语速适应机制。 Result: 实验表明，该系统生成的翻译自然流畅，与源语音的时长和语速一致，且翻译性能优异。 Conclusion: 该框架为跨语言配音提供了高效解决方案，同时保持了语音特征和翻译质量。 Abstract: This paper introduces a cross-lingual dubbing system that translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed. Despite the strong translation quality of existing speech translation approaches, they often overlook the transfer of speech patterns, leading to mismatches with source speech and limiting their suitability for dubbing applications. To address this, we propose a discrete diffusion-based speech-to-unit translation model with explicit duration control, enabling time-aligned translation. We then synthesize speech based on the predicted units and source identity with a conditional flow matching model. Additionally, we introduce a unit-based speed adaptation mechanism that guides the translation model to produce speech at a rate consistent with the source, without relying on any text. Extensive experiments demonstrate that our framework generates natural and fluent translations that align with the original speech's duration and speaking pace, while achieving competitive translation performance.

Junhyuk Choi,Minju Kim,Yeseon Hong,Bugeun Kim

Main category: cs.CL

TL;DR: 本文针对大型视觉语言模型（LVLMs）可能学习并生成社会偏见和刻板印象的问题，提出了基于刻板印象内容模型（SCM）的新评估指标和基准测试BASIC，用于评估性别、种族和颜色刻板印象。研究发现SCM评估有效，LVLMs存在颜色、性别和种族刻板印象，且模型架构与参数规模可能影响刻板印象。

Details

Motivation: 随着大型视觉语言模型的快速发展，其可能学习并生成社会偏见和刻板印象的问题引起关注。现有研究在指标和数据集上存在不足，尤其是忽视了内容词和颜色的影响。 Method: 提出基于SCM的新评估指标，并设计基准测试BASIC，用于评估性别、种族和颜色刻板印象。对八种LVLMs进行了研究。 Result: 研究发现：(1) SCM评估能有效捕捉刻板印象；(2) LVLMs输出中存在颜色、性别和种族刻板印象；(3) 模型架构与参数规模可能影响刻板印象。 Conclusion: 研究通过新指标和基准测试揭示了LVLMs的刻板印象问题，为未来研究提供了工具和方向。 Abstract: As large vision language models(LVLMs) rapidly advance, concerns about their potential to learn and generate social biases and stereotypes are increasing. Previous studies on LVLM's stereotypes face two primary limitations: metrics that overlooked the importance of content words, and datasets that overlooked the effect of color. To address these limitations, this study introduces new evaluation metrics based on the Stereotype Content Model (SCM). We also propose BASIC, a benchmark for assessing gender, race, and color stereotypes. Using SCM metrics and BASIC, we conduct a study with eight LVLMs to discover stereotypes. As a result, we found three findings. (1) The SCM-based evaluation is effective in capturing stereotypes. (2) LVLMs exhibit color stereotypes in the output along with gender and race ones. (3) Interaction between model architecture and parameter sizes seems to affect stereotypes. We release BASIC publicly on [anonymized for review].

[226] Towards Objective Fine-tuning: How LLMs' Prior Knowledge Causes Potential Poor Calibration?

Ziming Wang,Zeyu Shi,Haoyi Zhou,Shiqi Gao,Qingyun Sun,Jianxin Li

Main category: cs.CL

TL;DR: LLMs在微调后校准性差，CogCalib框架通过针对性学习策略显著提升校准性。

Details

Motivation: 研究LLMs先验知识对微调校准的影响，发现已知数据导致校准不良。 Method: 提出CogCalib框架，根据模型先验知识应用针对性学习策略。 Result: 在7个任务中，CogCalib显著提升校准性，ECE平均降低57%。 Conclusion: CogCalib提升LLMs校准性，增强其在关键应用中的可靠性。 Abstract: Fine-tuned Large Language Models (LLMs) often demonstrate poor calibration, with their confidence scores misaligned with actual performance. While calibration has been extensively studied in models trained from scratch, the impact of LLMs' prior knowledge on calibration during fine-tuning remains understudied. Our research reveals that LLMs' prior knowledge causes potential poor calibration due to the ubiquitous presence of known data in real-world fine-tuning, which appears harmful for calibration. Specifically, data aligned with LLMs' prior knowledge would induce overconfidence, while new knowledge improves calibration. Our findings expose a tension: LLMs' encyclopedic knowledge, while enabling task versatility, undermines calibration through unavoidable knowledge overlaps. To address this, we propose CogCalib, a cognition-aware framework that applies targeted learning strategies according to the model's prior knowledge. Experiments across 7 tasks using 3 LLM families prove that CogCalib significantly improves calibration while maintaining performance, achieving an average 57\% reduction in ECE compared to standard fine-tuning in Llama3-8B. These improvements generalize well to out-of-domain tasks, enhancing the objectivity and reliability of domain-specific LLMs, and making them more trustworthy for critical human-AI interaction applications.

[227] Automated Privacy Information Annotation in Large Language Model Interactions

Hang Zeng,Xiangyu Liu,Yong Hu,Chaoyue Niu,Fan Wu,Shaojie Tang,Guihai Chen

Main category: cs.CL

TL;DR: 论文提出了一种用于检测大型语言模型（LLM）交互中隐私泄露的方法，构建了一个大规模多语言数据集，并设计了评估指标和基线方法。

Details

Motivation: 用户在与LLM交互时可能无意泄露隐私信息，现有隐私检测方法不适用于LLM场景，因此需要开发本地可部署的隐私检测模型。 Method: 构建了一个包含249K用户查询和154K标注隐私短语的数据集，设计了自动化隐私标注流程和评估指标，并测试了轻量级LLM的基线方法。 Result: 评估结果显示当前性能与实际应用需求存在差距，需进一步研究更有效的本地隐私检测方法。 Conclusion: 论文为LLM交互中的隐私检测提供了数据集和基线方法，为未来研究奠定了基础。 Abstract: Users interacting with large language models (LLMs) under their real identifiers often unknowingly risk disclosing private information. Automatically notifying users whether their queries leak privacy and which phrases leak what private information has therefore become a practical need. Existing privacy detection methods, however, were designed for different objectives and application scenarios, typically tagging personally identifiable information (PII) in anonymous content. In this work, to support the development and evaluation of privacy detection models for LLM interactions that are deployable on local user devices, we construct a large-scale multilingual dataset with 249K user queries and 154K annotated privacy phrases. In particular, we build an automated privacy annotation pipeline with cloud-based strong LLMs to automatically extract privacy phrases from dialogue datasets and annotate leaked information. We also design evaluation metrics at the levels of privacy leakage, extracted privacy phrase, and privacy information. We further establish baseline methods using light-weight LLMs with both tuning-free and tuning-based methods, and report a comprehensive evaluation of their performance. Evaluation results reveal a gap between current performance and the requirements of real-world LLM applications, motivating future research into more effective local privacy detection methods grounded in our dataset.

[228] Automatic Transmission for LLM Tiers: Optimizing Cost and Accuracy in Large Language Models

Injae Na,Keonwoong Noh,Woohwan Jung

Main category: cs.CL

TL;DR: LLM-AT框架自动选择LLM层级以平衡成本与性能，无需训练。

Details

Motivation: 解决复杂NLP任务中如何选择合适LLM层级以优化成本与性能的问题。 Method: LLM-AT由Starter、Generator和Judge组成，通过迭代升级模型层级直至获得有效响应。 Result: 实验表明LLM-AT在降低成本的同时实现高性能。 Conclusion: LLM-AT是实际应用中平衡成本与性能的实用解决方案。 Abstract: LLM providers typically offer multiple LLM tiers, varying in performance and price. As NLP tasks become more complex and modularized, selecting the suitable LLM tier for each subtask is a key challenge to balance between cost and performance. To address the problem, we introduce LLM Automatic Transmission (LLM-AT) framework that automatically selects LLM tiers without training. LLM-AT consists of Starter, Generator, and Judge. The starter selects the initial LLM tier expected to solve the given question, the generator produces a response using the LLM of the selected tier, and the judge evaluates the validity of the response. If the response is invalid, LLM-AT iteratively upgrades to a higher-tier model, generates a new response, and re-evaluates until a valid response is obtained. Additionally, we propose accuracy estimator, which enables the suitable initial LLM tier selection without training. Given an input question, accuracy estimator estimates the expected accuracy of each LLM tier by computing the valid response rate across top-k similar queries from past inference records. Experiments demonstrate that LLM-AT achieves superior performance while reducing costs, making it a practical solution for real-world applications.

[229] Multi-objective Large Language Model Alignment with Hierarchical Experts

Zhuo Li,Guodong Du,Weiyang Guo,Yigeng Zhou,Xiucheng Li,Wenya Wang,Fangming Liu,Yequan Wang,Deheng Ye,Min Zhang,Jing Li

Main category: cs.CL

TL;DR: HoE是一种轻量级、参数高效且即插即用的方法，无需模型训练即可使大语言模型适应多样化的用户偏好。

Details

Motivation: 现有对齐方法难以有效平衡多目标之间的权衡，HoE旨在解决这一问题。 Method: HoE由LoRA专家、路由专家和偏好路由三个层次组件组成，实现参数大小、训练成本和性能之间的权衡。 Result: 在14个目标和6个基准测试中，HoE表现优于15种基线方法。 Conclusion: HoE是一种高效且灵活的多目标对齐方法，适用于多样化用户偏好。 Abstract: Aligning large language models (LLMs) to simultaneously satisfy multiple objectives remains a significant challenge, especially given the diverse and often conflicting nature of human preferences. Existing alignment methods struggle to balance trade-offs effectively, often requiring costly retraining or yielding suboptimal results across the Pareto frontier of preferences. In this paper, we introduce \textit{HoE}(Hierarchical Mixture-of-Experts), a \textit{lightweight}, \textit{parameter-efficient}, and \textit{plug-and-play} approach that eliminates the need for model training, while enabling LLMs to adapt across the entire Pareto frontier and accommodate diverse user preferences. In particular, \textit{HoE} consists of three hierarchical components: LoRA Experts, Router Experts and Preference Routing, reaching optimal Pareto frontiers and achieving a trade-off between parameter size, training cost, and performance. We evaluate \textit{HoE} across various tasks on 14 objectives and 200 different preferences among 6 benchmarks, demonstrating superior performance over 15 recent baselines. Code is available in the supplementary materials.

[230] Information-Theoretic Complementary Prompts for Improved Continual Text Classification

Duzhen Zhang,Yong Ren,Chenxing Li,Dong Yu,Tielin Zhang

Main category: cs.CL

TL;DR: InfoComp提出了一种新的持续文本分类方法，通过分离任务特定和任务无关的提示空间，利用信息论框架优化学习，显著减少灾难性遗忘并提升知识迁移。

Details

Motivation: 现有方法过于关注任务特定知识，忽视了共享知识的价值，受互补学习系统理论启发，提出同时学习两种知识。 Method: 设计P-Prompt和S-Prompt分别编码任务特定和任务无关知识，通过信息论框架优化学习，引入两种损失函数。 Result: 在多个基准测试中表现优于现有方法，显著减少灾难性遗忘并提升知识迁移。 Conclusion: InfoComp通过分离和优化两种知识的学习，为持续文本分类提供了高效解决方案。 Abstract: Continual Text Classification (CTC) aims to continuously classify new text data over time while minimizing catastrophic forgetting of previously acquired knowledge. However, existing methods often focus on task-specific knowledge, overlooking the importance of shared, task-agnostic knowledge. Inspired by the complementary learning systems theory, which posits that humans learn continually through the interaction of two systems -- the hippocampus, responsible for forming distinct representations of specific experiences, and the neocortex, which extracts more general and transferable representations from past experiences -- we introduce Information-Theoretic Complementary Prompts (InfoComp), a novel approach for CTC. InfoComp explicitly learns two distinct prompt spaces: P(rivate)-Prompt and S(hared)-Prompt. These respectively encode task-specific and task-invariant knowledge, enabling models to sequentially learn classification tasks without relying on data replay. To promote more informative prompt learning, InfoComp uses an information-theoretic framework that maximizes mutual information between different parameters (or encoded representations). Within this framework, we design two novel loss functions: (1) to strengthen the accumulation of task-specific knowledge in P-Prompt, effectively mitigating catastrophic forgetting, and (2) to enhance the retention of task-invariant knowledge in S-Prompt, improving forward knowledge transfer. Extensive experiments on diverse CTC benchmarks show that our approach outperforms previous state-of-the-art methods.

[231] On VLMs for Diverse Tasks in Multimodal Meme Classification

Deepesh Gavit,Debajyoti Mazumder,Samiran Das,Jasabanta Patro

Main category: cs.CL

TL;DR: 论文提出了一种结合视觉语言模型（VLM）和语言模型（LLM）的新方法，用于提升表情包分类任务的性能，显著提高了基线表现。

Details

Motivation: 研究旨在通过系统分析视觉语言模型在表情包分类任务中的表现，探索如何结合VLM和LLM以提升分类性能。 Method: 提出了一种新方法，利用VLM生成表情包的视觉理解，并通过微调LLM处理文本内容，结合两者提升分类效果。 Result: 实验结果显示，结合VLM和LLM的方法在讽刺、攻击性和情感分类任务中分别提升了8.34%、3.52%和26.24%的性能。 Conclusion: 研究揭示了VLM的优势和局限性，并提出了一种新的表情包理解策略，为相关任务提供了有效解决方案。 Abstract: In this paper, we present a comprehensive and systematic analysis of vision-language models (VLMs) for disparate meme classification tasks. We introduced a novel approach that generates a VLM-based understanding of meme images and fine-tunes the LLMs on textual understanding of the embedded meme text for improving the performance. Our contributions are threefold: (1) Benchmarking VLMs with diverse prompting strategies purposely to each sub-task; (2) Evaluating LoRA fine-tuning across all VLM components to assess performance gains; and (3) Proposing a novel approach where detailed meme interpretations generated by VLMs are used to train smaller language models (LLMs), significantly improving classification. The strategy of combining VLMs with LLMs improved the baseline performance by 8.34%, 3.52% and 26.24% for sarcasm, offensive and sentiment classification, respectively. Our results reveal the strengths and limitations of VLMs and present a novel strategy for meme understanding.

[232] Research Community Perspectives on "Intelligence" and Large Language Models

Bertram Højer,Terne Sasha Thorn Jakobsen,Anna Rogers,Stefan Heinrich

Main category: cs.CL

TL;DR: 论文通过调查303名研究者，探讨了NLP领域中对“智能”的定义及其在研究议程中的作用，发现多数人认为当前NLP系统并不“智能”。

Details

Motivation: 研究旨在澄清NLP领域中“人工智能”一词中“智能”的具体含义及其对研究目标的影响。 Method: 通过问卷调查了来自NLP、机器学习、认知科学等领域的303名研究者。 Result: 研究者普遍认同智能的三大标准：泛化能力、适应性和推理能力；仅29%认为当前NLP系统“智能”，16.2%将开发智能系统视为研究目标。 Conclusion: 当前NLP系统被多数研究者认为不够“智能”，且开发智能系统并非主流研究目标。 Abstract: Despite the widespread use of ''artificial intelligence'' (AI) framing in Natural Language Processing (NLP) research, it is not clear what researchers mean by ''intelligence''. To that end, we present the results of a survey on the notion of ''intelligence'' among researchers and its role in the research agenda. The survey elicited complete responses from 303 researchers from a variety of fields including NLP, Machine Learning (ML), Cognitive Science, Linguistics, and Neuroscience. We identify 3 criteria of intelligence that the community agrees on the most: generalization, adaptability, & reasoning. Our results suggests that the perception of the current NLP systems as ''intelligent'' is a minority position (29%). Furthermore, only 16.2% of the respondents see developing intelligent systems as a research goal, and these respondents are more likely to consider the current systems intelligent.

[233] Context-Aware Content Moderation for German Newspaper Comments

Felix Krejca,Tobias Kietreiber,Alexander Buchelt,Sebastian Neumaier

Main category: cs.CL

TL;DR: 本文研究德语报纸论坛的内容审核，提出结合上下文信息的分类模型，评估了LSTM、CNN和ChatGPT-3.5 Turbo的性能。

Details

Motivation: 填补德语报纸论坛内容审核研究的空白，并探索上下文信息对模型性能的影响。 Method: 使用LSTM、CNN和ChatGPT-3.5 Turbo模型，结合One Million Posts Corpus数据，评估上下文信息的作用。 Result: CNN和LSTM模型在上下文信息下表现优异，而ChatGPT的零样本分类未受益且表现较差。 Conclusion: 上下文信息对传统模型（如CNN和LSTM）有显著提升，但对ChatGPT无帮助。 Abstract: The increasing volume of online discussions requires advanced automatic content moderation to maintain responsible discourse. While hate speech detection on social media is well-studied, research on German-language newspaper forums remains limited. Existing studies often neglect platform-specific context, such as user history and article themes. This paper addresses this gap by developing and evaluating binary classification models for automatic content moderation in German newspaper forums, incorporating contextual information. Using LSTM, CNN, and ChatGPT-3.5 Turbo, and leveraging the One Million Posts Corpus from the Austrian newspaper Der Standard, we assess the impact of context-aware models. Results show that CNN and LSTM models benefit from contextual information and perform competitively with state-of-the-art approaches. In contrast, ChatGPT's zero-shot classification does not improve with added context and underperforms.

[234] Personalized Query Auto-Completion for Long and Short-Term Interests with Adaptive Detoxification Generation

Zhibo Wang,Xiaoze Jiang,Zhiheng Qin,Enyun Yu,Han Li

Main category: cs.CL

TL;DR: 论文提出了一种名为LaD的新模型，用于解决查询自动补全（QAC）中的个性化表示和内容去毒问题。该模型通过分层捕获用户的长短期兴趣，并结合自适应去毒技术，显著提升了生成内容的质量和安全性。

Details

Motivation: 现有QAC系统在个性化表示和内容去毒方面存在不足，无法满足复杂生成场景的需求，且可能生成有毒内容，影响用户体验。 Method: 提出LaD模型，分层捕获用户的长短期兴趣，并采用基于Reject Preference Optimization（RPO）的在线训练方法，通过特殊标记[Reject]实现自适应去毒。 Result: 在工业级数据集和在线A/B测试中表现优异，显著提升了指标，并已部署在快手搜索中，服务数亿活跃用户。 Conclusion: LaD模型有效解决了QAC中的个性化表示和去毒问题，显著提升了系统性能和用户体验。 Abstract: Query auto-completion (QAC) plays a crucial role in modern search systems. However, in real-world applications, there are two pressing challenges that still need to be addressed. First, there is a need for hierarchical personalized representations for users. Previous approaches have typically used users' search behavior as a single, overall representation, which proves inadequate in more nuanced generative scenarios. Additionally, query prefixes are typically short and may contain typos or sensitive information, increasing the likelihood of generating toxic content compared to traditional text generation tasks. Such toxic content can degrade user experience and lead to public relations issues. Therefore, the second critical challenge is detoxifying QAC systems. To address these two limitations, we propose a novel model (LaD) that captures personalized information from both long-term and short-term interests, incorporating adaptive detoxification. In LaD, personalized information is captured hierarchically at both coarse-grained and fine-grained levels. This approach preserves as much personalized information as possible while enabling online generation within time constraints. To move a futher step, we propose an online training method based on Reject Preference Optimization (RPO). By incorporating a special token [Reject] during both the training and inference processes, the model achieves adaptive detoxification. Consequently, the generated text presented to users is both non-toxic and relevant to the given prefix. We conduct comprehensive experiments on industrial-scale datasets and perform online A/B tests, delivering the largest single-experiment metric improvement in nearly two years of our product. Our model has been deployed on Kuaishou search, driving the primary traffic for hundreds of millions of active users. The code is available at https://github.com/JXZe/LaD.

[235] Reason-Align-Respond: Aligning LLM Reasoning with Knowledge Graphs for KGQA

Xiangqing Shen,Fanfan Wang,Rui Xia

Main category: cs.CL

TL;DR: RAR框架通过结合LLM推理与知识图谱，解决了LLM的幻觉问题和知识图谱推理能力不足的问题，实现了高效的KGQA任务。

Details

Motivation: LLM在复杂推理任务中表现出色但存在幻觉问题，知识图谱提供结构化知识但缺乏灵活推理能力，需要结合两者优势。 Method: 提出RAR框架，包含Reasoner生成推理链、Aligner对齐KG路径、Responser合成答案，采用EM算法优化。 Result: 在WebQSP和CWQ上分别达到93.3%和91.0%的Hit@1分数，零样本泛化能力强且推理高效。 Conclusion: RAR成功整合LLM与知识图谱，生成高质量、可解释的推理链，性能优越且通用性强。 Abstract: LLMs have demonstrated remarkable capabilities in complex reasoning tasks, yet they often suffer from hallucinations and lack reliable factual grounding. Meanwhile, knowledge graphs (KGs) provide structured factual knowledge but lack the flexible reasoning abilities of LLMs. In this paper, we present Reason-Align-Respond (RAR), a novel framework that systematically integrates LLM reasoning with knowledge graphs for KGQA. Our approach consists of three key components: a Reasoner that generates human-like reasoning chains, an Aligner that maps these chains to valid KG paths, and a Responser that synthesizes the final answer. We formulate this process as a probabilistic model and optimize it using the Expectation-Maximization algorithm, which iteratively refines the reasoning chains and knowledge paths. Extensive experiments on multiple benchmarks demonstrate the effectiveness of RAR, achieving state-of-the-art performance with Hit@1 scores of 93.3% and 91.0% on WebQSP and CWQ respectively. Human evaluation confirms that RAR generates high-quality, interpretable reasoning chains well-aligned with KG paths. Furthermore, RAR exhibits strong zero-shot generalization capabilities and maintains computational efficiency during inference.

[236] Contrastive Learning on LLM Back Generation Treebank for Cross-domain Constituency Parsing

Peiming Guo,Meishan Zhang,Jianling Li,Min Zhang,Yue Zhang

Main category: cs.CL

TL;DR: 论文提出了一种基于大语言模型（LLM）的反向生成方法（LLM back generation），用于自动生成跨领域选区树库，并结合对比学习预训练策略提升跨领域选区解析性能。

Details

Motivation: 跨领域选区解析因多领域树库有限而面临挑战，需探索自动生成树库的方法。 Method: 提出LLM反向生成方法，以仅含领域关键词的不完整选区树为输入，填充缺失词生成完整树库；引入跨度级对比学习预训练策略。 Result: 在MCTB的五个目标领域上验证，平均性能达到最优。 Conclusion: LLM反向生成树库结合对比学习预训练，显著提升跨领域选区解析性能。 Abstract: Cross-domain constituency parsing is still an unsolved challenge in computational linguistics since the available multi-domain constituency treebank is limited. We investigate automatic treebank generation by large language models (LLMs) in this paper. The performance of LLMs on constituency parsing is poor, therefore we propose a novel treebank generation method, LLM back generation, which is similar to the reverse process of constituency parsing. LLM back generation takes the incomplete cross-domain constituency tree with only domain keyword leaf nodes as input and fills the missing words to generate the cross-domain constituency treebank. Besides, we also introduce a span-level contrastive learning pre-training strategy to make full use of the LLM back generation treebank for cross-domain constituency parsing. We verify the effectiveness of our LLM back generation treebank coupled with contrastive learning pre-training on five target domains of MCTB. Experimental results show that our approach achieves state-of-the-art performance on average results compared with various baselines.

[237] Evaluating and Steering Modality Preferences in Multimodal Large Language Model

Yu Zhang,Jinlong Ma,Yongshuai Hou,Xuefeng Bai,Kehai Chen,Yang Xiang,Jun Yu,Min Zhang

Main category: cs.CL

TL;DR: 论文研究了多模态大语言模型（MLLMs）在处理多模态上下文时是否存在模态偏好，并通过构建MC²基准和实验揭示了18种MLLMs普遍存在模态偏好，提出了基于表示工程的调控方法。

Details

Motivation: 探讨MLLMs在多模态任务中是否存在模态偏好，并研究如何通过外部干预调控这种偏好。 Method: 构建MC²基准评估模态偏好，提出基于表示工程的探测和调控方法，无需额外微调或精心设计的提示。 Result: 实验显示所有18种MLLMs均存在模态偏好，且可通过表示工程方法有效调控偏好方向。 Conclusion: 模态偏好是MLLMs的普遍现象，提出的方法能有效调控偏好并提升下游任务性能。 Abstract: Multimodal large language models (MLLMs) have achieved remarkable performance on complex tasks with multimodal context. However, it is still understudied whether they exhibit modality preference when processing multimodal contexts. To study this question, we first build a \textbf{MC\textsuperscript{2}} benchmark under controlled evidence conflict scenarios to systematically evaluate modality preference, which is the tendency to favor one modality over another when making decisions based on multimodal conflicting evidence. Our extensive evaluation reveals that all 18 tested MLLMs generally demonstrate clear modality bias, and modality preference can be influenced by external interventions. An in-depth analysis reveals that the preference direction can be captured within the latent representations of MLLMs. Built on this, we propose a probing and steering method based on representation engineering to explicitly control modality preference without additional fine-tuning or carefully crafted prompts. Our method effectively amplifies modality preference toward a desired direction and applies to downstream tasks such as hallucination mitigation and multimodal machine translation, yielding promising improvements.

[238] Who Reasons in the Large Language Models?

Jie Shao,Jianxin Wu

Main category: cs.CL

TL;DR: 研究发现，大型语言模型的推理能力主要归因于Transformer多头自注意力机制中的输出投影模块（oproj），而非整个模型或其他模块。

Details

Motivation: 探索大型语言模型（LLMs）推理能力的来源，以明确是整体模型、特定模块还是过拟合的结果。 Method: 引入Stethoscope for Networks（SfN）工具套件，用于分析和探测LLMs内部行为，验证oproj模块的作用。 Result: 实验证据表明，oproj模块在推理能力中起核心作用，而其他模块更多支持流畅对话。 Conclusion: 这一发现为LLM可解释性提供了新视角，并为更高效的针对性训练策略开辟了途径。 Abstract: Despite the impressive performance of large language models (LLMs), the process of endowing them with new capabilities--such as mathematical reasoning--remains largely empirical and opaque. A critical open question is whether reasoning abilities stem from the entire model, specific modules, or are merely artifacts of overfitting. In this work, we hypothesize that the reasoning capabilities in well-trained LLMs are primarily attributed to the output projection module (oproj) in the Transformer's multi-head self-attention (MHSA) mechanism. To support this hypothesis, we introduce Stethoscope for Networks (SfN), a suite of diagnostic tools designed to probe and analyze the internal behaviors of LLMs. Using SfN, we provide both circumstantial and empirical evidence suggesting that oproj plays a central role in enabling reasoning, whereas other modules contribute more to fluent dialogue. These findings offer a new perspective on LLM interpretability and open avenues for more targeted training strategies, potentially enabling more efficient and specialized LLMs.

[239] Articulatory strategy in vowel production as a basis for speaker discrimination

Justin J. H. Lo,Patrycja Strycharczuk,Sam Kirkham

Main category: cs.CL

TL;DR: 研究探讨了发音策略在元音产生中的个体特异性是否足以用于说话人区分，发现舌部大小是最具区分力的特征。

Details

Motivation: 探索发音策略的个体特异性是否可作为说话人区分的基础。 Method: 对40名英语说话者的舌形数据进行广义Procrustes分析，并在似然比框架下评估正交舌形特征的区分潜力。 Result: 舌部大小是最具区分力的特征，前部舌形变化优于后部。形状信息在无共变时可与大小形状信息媲美。 Conclusion: 舌形特征在说话人区分中具有潜力，但需考虑特征的共变情况。 Abstract: The way speakers articulate is well known to be variable across individuals while at the same time subject to anatomical and biomechanical constraints. In this study, we ask whether articulatory strategy in vowel production can be sufficiently speaker-specific to form the basis for speaker discrimination. We conducted Generalised Procrustes Analyses of tongue shape data from 40 English speakers from the North West of England, and assessed the speaker-discriminatory potential of orthogonal tongue shape features within the framework of likelihood ratios. Tongue size emerged as the individual dimension with the strongest discriminatory power, while tongue shape variation in the more anterior part of the tongue generally outperformed tongue shape variation in the posterior part. When considered in combination, shape-only information may offer comparable levels of speaker specificity to size-and-shape information, but only when features do not exhibit speaker-level co-variation.

[240] Uncertainty Unveiled: Can Exposure to More In-context Examples Mitigate Uncertainty for Large Language Models?

Yifei Wang,Yu Sheng,Linjing Li,Daniel Zeng

Main category: cs.CL

TL;DR: 本文研究了长上下文情境学习（ICL）中增加示例对预测不确定性的影响，揭示了示例数量如何通过减少认知不确定性（EU）提升性能。

Details

Motivation: 现有研究多关注性能提升，而增加示例对生成响应可信度的影响尚未充分探索。 Method: 通过系统量化不同示例数量下的不确定性，分析示例数量的影响，并引入认知不确定性分解的新视角。 Result: 增加示例通过注入任务特定知识减少总不确定性，提升性能；复杂任务需先解决输入噪声问题。 Conclusion: 研究揭示了示例数量减少不确定性的机制，为提升ICL的可信度提供了新视角。 Abstract: Recent advances in handling long sequences have facilitated the exploration of long-context in-context learning (ICL). While much of the existing research emphasizes performance improvements driven by additional in-context examples, the influence on the trustworthiness of generated responses remains underexplored. This paper addresses this gap by investigating how increased examples influence predictive uncertainty, an essential aspect in trustworthiness. We begin by systematically quantifying the uncertainty of ICL with varying shot counts, analyzing the impact of example quantity. Through uncertainty decomposition, we introduce a novel perspective on performance enhancement, with a focus on epistemic uncertainty (EU). Our results reveal that additional examples reduce total uncertainty in both simple and complex tasks by injecting task-specific knowledge, thereby diminishing EU and enhancing performance. For complex tasks, these advantages emerge only after addressing the increased noise and uncertainty associated with longer inputs. Finally, we explore the evolution of internal confidence across layers, unveiling the mechanisms driving the reduction in uncertainty.

[241] LLMs are Frequency Pattern Learners in Natural Language Inference

Liang Cheng,Zhaowei Wang,Mark Steedman

Main category: cs.CL

TL;DR: 论文研究了LLMs在NLI任务微调中学习的机制，发现模型利用了前提和假设中的谓词频率偏差，并在对抗性案例中表现不佳。

Details

Motivation: 探索LLMs在NLI任务微调中实际学习的内容，揭示其性能提升的潜在机制。 Method: 分析NLI数据集中前提和假设的谓词频率，评估模型在偏差一致和对抗性案例中的表现，并研究频率偏差与文本蕴含的关系。 Result: LLMs依赖频率偏差进行推理，微调后对偏差的依赖性显著增加，且频率偏差与文本蕴含相关。 Conclusion: 频率偏差的学习是LLMs在推理任务中性能提升的关键因素。 Abstract: While fine-tuning LLMs on NLI corpora improves their inferential performance, the underlying mechanisms driving this improvement remain largely opaque. In this work, we conduct a series of experiments to investigate what LLMs actually learn during fine-tuning. We begin by analyzing predicate frequencies in premises and hypotheses across NLI datasets and identify a consistent frequency bias, where predicates in hypotheses occur more frequently than those in premises for positive instances. To assess the impact of this bias, we evaluate both standard and NLI fine-tuned LLMs on bias-consistent and bias-adversarial cases. We find that LLMs exploit frequency bias for inference and perform poorly on adversarial instances. Furthermore, fine-tuned LLMs exhibit significantly increased reliance on this bias, suggesting that they are learning these frequency patterns from datasets. Finally, we compute the frequencies of hyponyms and their corresponding hypernyms from WordNet, revealing a correlation between frequency bias and textual entailment. These findings help explain why learning frequency patterns can enhance model performance on inference tasks.

[242] Def-DTS: Deductive Reasoning for Open-domain Dialogue Topic Segmentation

Seungmin Lee,Yongsang Yoo,Minhwa Jung,Min Song

Main category: cs.CL

TL;DR: Def-DTS利用LLM的多步演绎推理提升对话主题分割性能，通过结构化提示实现上下文摘要、意图分类和主题转移检测，实验表明其优于传统方法。

Details

Motivation: 对话主题分割（DTS）在NLP任务中至关重要，但面临数据短缺、标注模糊和解决方案复杂度高的问题，而LLM和推理技术在此领域应用较少。 Method: 采用LLM的多步演绎推理，结合结构化提示进行双向上下文摘要、意图分类和主题转移检测，并提出通用意图列表。 Result: 实验显示Def-DTS在多种对话场景中优于传统和前沿方法，显著减少第二类错误，并探索了自动标注潜力。 Conclusion: Def-DTS通过LLM推理技术显著提升DTS性能，展示了其在自动标注和领域无关意图分类中的潜力。 Abstract: Dialogue Topic Segmentation (DTS) aims to divide dialogues into coherent segments. DTS plays a crucial role in various NLP downstream tasks, but suffers from chronic problems: data shortage, labeling ambiguity, and incremental complexity of recently proposed solutions. On the other hand, Despite advances in Large Language Models (LLMs) and reasoning strategies, these have rarely been applied to DTS. This paper introduces Def-DTS: Deductive Reasoning for Open-domain Dialogue Topic Segmentation, which utilizes LLM-based multi-step deductive reasoning to enhance DTS performance and enable case study using intermediate result. Our method employs a structured prompting approach for bidirectional context summarization, utterance intent classification, and deductive topic shift detection. In the intent classification process, we propose the generalizable intent list for domain-agnostic dialogue intent classification. Experiments in various dialogue settings demonstrate that Def-DTS consistently outperforms traditional and state-of-the-art approaches, with each subtask contributing to improved performance, particularly in reducing type 2 error. We also explore the potential for autolabeling, emphasizing the importance of LLM reasoning techniques in DTS.

[243] FCKT: Fine-Grained Cross-Task Knowledge Transfer with Semantic Contrastive Learning for Targeted Sentiment Analysis

Wei Chen,Zhao Zhang,Meng Yuan,Kepeng Xu,Fuzhen Zhuang

Main category: cs.CL

TL;DR: 本文提出了一种细粒度跨任务知识迁移框架FCKT，用于目标情感分析（TSA），通过显式结合方面级信息优化情感预测，减少负迁移并提升性能。

Details

Motivation: 现有研究多采用粗粒度知识迁移，忽略了方面-情感关系的细粒度控制，导致负迁移问题。 Method: 提出FCKT框架，显式结合方面级信息进行细粒度知识迁移。 Result: 在三个数据集上的实验表明，FCKT优于多种基线和大语言模型。 Conclusion: FCKT通过细粒度知识迁移有效提升了TSA任务的性能。 Abstract: In this paper, we address the task of targeted sentiment analysis (TSA), which involves two sub-tasks, i.e., identifying specific aspects from reviews and determining their corresponding sentiments. Aspect extraction forms the foundation for sentiment prediction, highlighting the critical dependency between these two tasks for effective cross-task knowledge transfer. While most existing studies adopt a multi-task learning paradigm to align task-specific features in the latent space, they predominantly rely on coarse-grained knowledge transfer. Such approaches lack fine-grained control over aspect-sentiment relationships, often assuming uniform sentiment polarity within related aspects. This oversimplification neglects contextual cues that differentiate sentiments, leading to negative transfer. To overcome these limitations, we propose FCKT, a fine-grained cross-task knowledge transfer framework tailored for TSA. By explicitly incorporating aspect-level information into sentiment prediction, FCKT achieves fine-grained knowledge transfer, effectively mitigating negative transfer and enhancing task performance. Experiments on three datasets, including comparisons with various baselines and large language models (LLMs), demonstrate the effectiveness of FCKT. The source code is available on https://github.com/cwei01/FCKT.

[244] Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction

Sam O'Connor Russell,Naomi Harte

Main category: cs.CL

TL;DR: MM-VAP是一种结合语音和视觉线索（如面部表情、头部姿态和视线）的多模态预测性轮流模型，在视频会议互动中表现优于仅依赖语音的模型。

Details

Motivation: 现有预测性轮流模型多仅依赖语音，忽略了视觉线索在轮流中的重要性，限制了人机交互的自然性。 Method: MM-VAP整合语音和视觉特征（面部表情、头部姿态和视线），并通过分组沉默时长分析模型表现。 Result: MM-VAP在视频会议互动中的预测准确率（84%）高于仅音频模型（79%），且在所有沉默时长下均表现更优。面部表情特征对模型贡献最大。 Conclusion: 视觉线索对轮流预测至关重要，未来研究需整合多模态特征。代码已公开，为多模态预测性轮流模型的首个综合分析。 Abstract: Turn-taking is richly multimodal. Predictive turn-taking models (PTTMs) facilitate naturalistic human-robot interaction, yet most rely solely on speech. We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze. We find that it outperforms the state-of-the-art audio-only in videoconferencing interactions (84% vs. 79% hold/shift prediction accuracy). Unlike prior work which aggregates all holds and shifts, we group by duration of silence between turns. This reveals that through the inclusion of visual features, MM-VAP outperforms a state-of-the-art audio-only turn-taking model across all durations of speaker transitions. We conduct a detailed ablation study, which reveals that facial expression features contribute the most to model performance. Thus, our working hypothesis is that when interlocutors can see one another, visual cues are vital for turn-taking and must therefore be included for accurate turn-taking prediction. We additionally validate the suitability of automatic speech alignment for PTTM training using telephone speech. This work represents the first comprehensive analysis of multimodal PTTMs. We discuss implications for future work and make all code publicly available.

[245] Predicting Implicit Arguments in Procedural Video Instructions

Anil Batra,Laura Sevilla-Lara,Marcus Rohrbach,Frank Keller

Main category: cs.CL

TL;DR: 论文提出Implicit-VidSRL数据集，解决语义角色标注（SRL）中隐含参数缺失问题，提升多模态模型在烹饪过程中的上下文推理能力，并展示了新模型iSRL-Qwen2-VL的性能优势。

Details

Motivation: 现有SRL基准常忽略隐含参数，导致对上下文理解不完整，尤其在高度省略的流程性文本中。 Method: 引入Implicit-VidSRL数据集，要求从多模态烹饪过程中推断显式和隐含参数，并评估多模态模型的上下文推理能力。 Result: 多模态LLMs在预测隐含参数时表现不佳，而iSRL-Qwen2-VL在F1分数上显著优于GPT-4o。 Conclusion: Implicit-VidSRL填补了隐含参数推理的空白，iSRL-Qwen2-VL为多模态SRL任务提供了更优解决方案。 Abstract: Procedural texts help AI enhance reasoning about context and action sequences. Transforming these into Semantic Role Labeling (SRL) improves understanding of individual steps by identifying predicate-argument structure like {verb,what,where/with}. Procedural instructions are highly elliptic, for instance, (i) add cucumber to the bowl and (ii) add sliced tomatoes, the second step's where argument is inferred from the context, referring to where the cucumber was placed. Prior SRL benchmarks often miss implicit arguments, leading to incomplete understanding. To address this, we introduce Implicit-VidSRL, a dataset that necessitates inferring implicit and explicit arguments from contextual information in multimodal cooking procedures. Our proposed dataset benchmarks multimodal models' contextual reasoning, requiring entity tracking through visual changes in recipes. We study recent multimodal LLMs and reveal that they struggle to predict implicit arguments of what and where/with from multi-modal procedural data given the verb. Lastly, we propose iSRL-Qwen2-VL, which achieves a 17% relative improvement in F1-score for what-implicit and a 14.7% for where/with-implicit semantic roles over GPT-4o.

[246] Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation

Ekaterina Fadeeva,Aleksandr Rubashevskii,Roman Vashurin,Shehzaad Dhuliawala,Artem Shelmanov,Timothy Baldwin,Preslav Nakov,Mrinmaya Sachan,Maxim Panov

Main category: cs.CL

TL;DR: FRANQ是一种新方法，用于检测RAG系统中的幻觉问题，通过不确定性量化技术区分事实性和忠实性，并在实验中表现优于现有方法。

Details

Motivation: RAG系统在开放域问答中表现良好，但仍存在幻觉问题，现有方法常将事实性与忠实性混淆。 Method: 提出FRANQ方法，结合不确定性量化技术，根据陈述是否忠实于检索上下文来评估事实性。 Result: 在多个数据集和LLM上的实验表明，FRANQ能更准确地检测RAG生成响应中的事实错误。 Conclusion: FRANQ通过区分事实性和忠实性，有效提升了RAG系统中幻觉检测的准确性。 Abstract: Large Language Models (LLMs) enhanced with external knowledge retrieval, an approach known as Retrieval-Augmented Generation (RAG), have shown strong performance in open-domain question answering. However, RAG systems remain susceptible to hallucinations: factually incorrect outputs that may arise either from inconsistencies in the model's internal knowledge or incorrect use of the retrieved context. Existing approaches often conflate factuality with faithfulness to the retrieved context, misclassifying factually correct statements as hallucinations if they are not directly supported by the retrieval. In this paper, we introduce FRANQ (Faithfulness-based Retrieval Augmented UNcertainty Quantification), a novel method for hallucination detection in RAG outputs. FRANQ applies different Uncertainty Quantification (UQ) techniques to estimate factuality based on whether a statement is faithful to the retrieved context or not. To evaluate FRANQ and other UQ techniques for RAG, we present a new long-form Question Answering (QA) dataset annotated for both factuality and faithfulness, combining automated labeling with manual validation of challenging examples. Extensive experiments on long- and short-form QA across multiple datasets and LLMs show that FRANQ achieves more accurate detection of factual errors in RAG-generated responses compared to existing methods.

[247] LLMs Think, But Not In Your Flow: Reasoning-Level Personalization for Black-Box Large Language Models

Jieyong Kim,Tongyoung Kim,Soonjin Yoon,Jaehyung Kim,Dongha Lee

Main category: cs.CL

TL;DR: RPM框架通过推理级个性化，提升黑盒大语言模型的输出准确性和可解释性，优于现有响应级方法。

Details

Motivation: 黑盒大语言模型（LLMs）虽强大，但输出通常忽略用户偏好和推理风格，现有方法仅关注响应级个性化，未建模用户思维过程。 Method: RPM通过提取用户历史中的特征构建个性化推理路径，并在推理阶段基于特征相似性检索对齐示例，实现推理级个性化。 Result: 实验表明，RPM在多样任务中一致优于响应级个性化方法。 Conclusion: 推理级个性化能有效提升黑盒LLMs的性能和可解释性。 Abstract: Large language models (LLMs) have recently achieved impressive performance across a wide range of natural language tasks and are now widely used in real-world applications. Among them, black-box LLMs--served via APIs without access to model internals--are especially dominant due to their scalability and ease of deployment. Despite their strong capabilities, these models typically produce generalized responses that overlook personal preferences and reasoning styles. This has led to growing interest in black-box LLM personalization, which aims to tailor model outputs to user-specific context without modifying model parameters. However, existing approaches primarily focus on response-level personalization, attempting to match final outputs without modeling personal thought process. To address this limitation, we propose RPM, a framework for reasoning-level personalization that aligns the model's reasoning process with a user's personalized logic. RPM first constructs statistical user-specific factors by extracting and grouping response-influential features from user history. It then builds personalized reasoning paths that reflect how these factors are used in context. In the inference stage, RPM retrieves reasoning-aligned examples for new queries via feature-level similarity and performs inference conditioned on the structured factors and retrieved reasoning paths, enabling the model to follow user-specific reasoning trajectories. This reasoning-level personalization enhances both predictive accuracy and interpretability by grounding model outputs in user-specific logic through structured information. Extensive experiments across diverse tasks show that RPM consistently outperforms response-level personalization methods, demonstrating the effectiveness of reasoning-level personalization in black-box LLMs.

[248] BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge

Daeen Kabir,Minhajur Rahman Chowdhury Mahim,Sheikh Shafayat,Adnan Sadik,Arian Ahmed,Eunsu Kim,Alice Oh

Main category: cs.CL

TL;DR: BLUCK是一个新的数据集，用于评估大语言模型（LLMs）在孟加拉语语言理解和文化知识方面的表现，包含2366道多选题，涵盖23个类别。

Details

Motivation: 评估LLMs在孟加拉语和文化知识中的表现，填补现有评测基准的空白。 Method: 从大学和职业考试中精选多选题，测试6个专有和3个开源LLMs。 Result: LLMs整体表现尚可，但在孟加拉语音学部分表现不佳，孟加拉语被视为中等资源语言。 Conclusion: BLUCK是首个专注于孟加拉本土文化的多选题评测基准，为未来研究提供了重要工具。 Abstract: In this work, we introduce BLUCK, a new dataset designed to measure the performance of Large Language Models (LLMs) in Bengali linguistic understanding and cultural knowledge. Our dataset comprises 2366 multiple-choice questions (MCQs) carefully curated from compiled collections of several college and job level examinations and spans 23 categories covering knowledge on Bangladesh's culture and history and Bengali linguistics. We benchmarked BLUCK using 6 proprietary and 3 open-source LLMs - including GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Llama-3.3-70B-Instruct, and DeepSeekV3. Our results show that while these models perform reasonably well overall, they, however, struggles in some areas of Bengali phonetics. Although current LLMs' performance on Bengali cultural and linguistic contexts is still not comparable to that of mainstream languages like English, our results indicate Bengali's status as a mid-resource language. Importantly, BLUCK is also the first MCQ-based evaluation benchmark that is centered around native Bengali culture, history, and linguistics.

[249] Thinker: Learning to Think Fast and Slow

Stephen Chung,Wenyu Du,Jie Fu

Main category: cs.CL

TL;DR: 通过引入四阶段任务（快速思考、验证、慢速思考和总结），改进大型语言模型在问答任务中的推理能力，显著提升了准确性和效率。

Details

Motivation: 观察到大型语言模型在问答任务中的搜索行为不精确且缺乏信心，受心理学双过程理论启发，提出分阶段任务以优化推理能力。 Method: 设计四阶段任务：快速思考（严格令牌限制）、验证、慢速思考（更深入思考）和总结（提炼精确步骤）。 Result: Qwen2.5-1.5B的平均准确率从24.9%提升至27.9%，DeepSeek-R1-Qwen-1.5B从45.9%提升至49.8%。快速思考模式仅用1000令牌即可达到26.8%准确率。 Conclusion: 直觉和深思熟虑的推理是互补的系统，针对性训练可显著提升模型性能。 Abstract: Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1. However, this search behavior is often imprecise and lacks confidence, resulting in long, redundant responses and highlighting deficiencies in intuition and verification. Inspired by the Dual Process Theory in psychology, we introduce a simple modification to the QA task that includes four stages: Fast Thinking, where the LLM must answer within a strict token budget; Verification, where the model evaluates its initial response; Slow Thinking, where it refines the initial response with more deliberation; and Summarization, where it distills the refinement from the previous stage into precise steps. Our proposed task improves average accuracy from 24.9% to 27.9% for Qwen2.5-1.5B, and from 45.9% to 49.8% for DeepSeek-R1-Qwen-1.5B. Notably, for Qwen2.5-1.5B, the Fast Thinking mode alone achieves 26.8% accuracy using fewer than 1000 tokens, demonstrating substantial inference efficiency gains. These findings suggest that intuition and deliberative reasoning are distinct, complementary systems benefiting from targeted training.

[250] A Lightweight Multi-Expert Generative Language Model System for Engineering Information and Knowledge Extraction

Bogdan Bogachov,Yaoyao Fiona Zhao

Main category: cs.CL

TL;DR: 本文提出了一种轻量级领域适应方法Small Language Graph（SLG），通过图结构中的小型专家节点解决计算资源需求和幻觉问题，性能优于传统方法。

Details

Motivation: 现有领域适应方法计算资源需求高且存在幻觉问题，尤其在工程场景中需要高精度文本生成。 Method: SLG采用图结构，每个节点为小型专家模型，针对特定文本微调。 Result: SLG在Exact Match指标上比传统方法高3倍，微调速度快1.7倍。 Conclusion: SLG为中小型工程公司提供了低成本使用生成式AI的可能，并支持分布式AI系统。 Abstract: Despite recent advancements in domain adaptation techniques for large language models, these methods remain computationally intensive, and the resulting models can still exhibit hallucination issues. Most existing adaptation methods do not prioritize reducing the computational resources required for fine-tuning and inference of language models. Hallucination issues have gradually decreased with each new model release. However, they remain prevalent in engineering contexts, where generating well-structured text with minimal errors and inconsistencies is critical. This work introduces a novel approach called the Small Language Graph (SLG), which is a lightweight adaptation solution designed to address the two key challenges outlined above. The system is structured in the form of a graph, where each node represents a lightweight expert - a small language model fine-tuned on specific and concise texts. The results of this study have shown that SLG was able to surpass conventional fine-tuning methods on the Exact Match metric by 3 times. Additionally, the fine-tuning process was 1.7 times faster compared to that of a larger stand-alone language model. These findings introduce a potential for small to medium-sized engineering companies to confidently use generative AI technologies, such as LLMs, without the necessity to invest in expensive computational resources. Also, the graph architecture and the small size of expert nodes offer a possible opportunity for distributed AI systems, thus potentially diverting the global need for expensive centralized compute clusters.

[251] Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA

Sergey Pletenev,Maria Marina,Nikolay Ivanov,Daria Galimzianova,Nikita Krayko,Mikhail Salnikov,Vasily Konovalov,Alexander Panchenko,Viktor Moskvoretskii

Main category: cs.CL

TL;DR: 论文提出EverGreenQA数据集，用于评估和训练LLMs在问答任务中对问题时间性的处理能力，并展示了其实际应用。

Details

Motivation: LLMs在问答任务中常出现幻觉，问题的时效性（是否随时间变化）是一个关键但未被充分研究的因素。 Method: 引入EverGreenQA数据集，评估12种现代LLMs对问题时间性的显式和隐式编码能力，并训练轻量级分类器EG-E5。 Result: EG-E5在任务中达到SoTA性能，并展示了时间性分类在三个实际应用中的有效性。 Conclusion: EverGreenQA为LLMs处理问题时间性提供了新工具，并展示了其实际价值。 Abstract: Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions -- whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o retrieval behavior.

[252] Scaling and Prompting for Improved End-to-End Spoken Grammatical Error Correction

Mengjie Qian,Rao Ma,Stefano Bannò,Kate M. Knill,Mark J. F. Gales

Main category: cs.CL

TL;DR: 论文研究了端到端语音基础模型在口语语法纠错（SGEC）和反馈生成（SGECF）中的有效性，提出伪标注方法扩展训练数据，并探讨了提示和模型规模的影响。

Details

Motivation: 传统SGEC系统依赖级联模块，而端到端语音基础模型的兴起为SGEC和反馈生成提供了新思路。 Method: 采用伪标注方法扩展训练数据，并基于Whisper模型进行提示训练。 Result: 伪标注将训练数据从77小时扩展到2500小时，提升了性能；提示训练在反馈生成中效果显著。 Conclusion: 伪标注和提示训练对SGEC和反馈生成有积极影响，但模型规模扩大时伪标注效果有限。 Abstract: Spoken Grammatical Error Correction (SGEC) and Feedback (SGECF) are crucial for second language learners, teachers and test takers. Traditional SGEC systems rely on a cascaded pipeline consisting of an ASR, a module for disfluency detection (DD) and removal and one for GEC. With the rise of end-to-end (E2E) speech foundation models, we investigate their effectiveness in SGEC and feedback generation. This work introduces a pseudo-labelling process to address the challenge of limited labelled data, expanding the training data size from 77 hours to approximately 2500 hours, leading to improved performance. Additionally, we prompt an E2E Whisper-based SGEC model with fluent transcriptions, showing a slight improvement in SGEC performance, with more significant gains in feedback generation. Finally, we assess the impact of increasing model size, revealing that while pseudo-labelled data does not yield performance gain for a larger Whisper model, training with prompts proves beneficial.

[253] Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis

Tianyi Xu,Hongjie Chen,Wang Qing,Lv Hang,Jian Kang,Li Jie,Zhennan Lin,Yongxiang Li,Xie Lei

Main category: cs.CL

TL;DR: 论文研究了自监督预训练结合大语言模型（LLM）对中文方言和口音语音识别性能的提升效果，并在多个方言数据集上取得了SOTA结果。

Details

Motivation: 由于数据稀缺，中文方言和口音对大多数ASR模型仍具挑战性。自监督学习的最新进展表明，自监督预训练结合LLM可有效提升低资源场景下的ASR性能。 Method: 使用30万小时无标签方言和口音语音数据预训练Data2vec2模型，并在4万小时有监督数据集上进行对齐训练，系统研究了不同投影器和LLM对普通话、方言及口音语音识别性能的影响。 Result: 在多个方言数据集（如Kespeech）上取得了SOTA结果。 Conclusion: 自监督预训练结合LLM能显著提升中文方言和口音语音识别性能，研究结果将开源以促进可重复研究。 Abstract: Large-scale training corpora have significantly improved the performance of ASR models. Unfortunately, due to the relative scarcity of data, Chinese accents and dialects remain a challenge for most ASR models. Recent advancements in self-supervised learning have shown that self-supervised pre- training, combined with large language models (LLM), can effectively enhance ASR performance in low-resource scenarios. We aim to investigate the effectiveness of this paradigm for Chinese dialects. Specifically, we pre-train a Data2vec2 model on 300,000 hours of unlabeled dialect and accented speech data and do alignment training on a supervised dataset of 40,000 hours. Then, we systematically examine the impact of various projectors and LLMs on Mandarin, dialect, and accented speech recognition performance under this paradigm. Our method achieved SOTA results on multiple dialect datasets, including Kespeech. We will open-source our work to promote reproducible research

[254] Assessment of L2 Oral Proficiency using Speech Large Language Models

Rao Ma,Mengjie Qian,Siyuan Tang,Stefano Bannò,Kate M. Knill,Mark J. F. Gales

Main category: cs.CL

TL;DR: 本文探讨了多模态大语言模型（LLMs）在L2英语口语评分中的潜力，通过回归和分类目标比较不同训练策略，结果显示语音LLMs优于现有基线，并具备良好的泛化能力。

Details

Motivation: 随着L2英语学习者数量增加，自动口语评分需求上升，但现有方法存在信息丢失或局限性，多模态LLMs为解决这些问题提供了新思路。 Method: 比较了回归和分类目标的训练策略，利用多模态LLMs进行L2口语评分。 Result: 语音LLMs在两个数据集上表现优于现有基线，并在跨部分或跨任务评估中展现出强泛化能力。 Conclusion: 多模态LLMs在L2口语评分中具有显著潜力，其预训练获得的音频理解知识有助于提升性能。 Abstract: The growing population of L2 English speakers has increased the demand for developing automatic graders for spoken language assessment (SLA). Historically, statistical models, text encoders, and self-supervised speech models have been utilised for this task. However, cascaded systems suffer from the loss of information, while E2E graders also have limitations. With the recent advancements of multi-modal large language models (LLMs), we aim to explore their potential as L2 oral proficiency graders and overcome these issues. In this work, we compare various training strategies using regression and classification targets. Our results show that speech LLMs outperform all previous competitive baselines, achieving superior performance on two datasets. Furthermore, the trained grader demonstrates strong generalisation capabilities in the cross-part or cross-task evaluation, facilitated by the audio understanding knowledge acquired during LLM pre-training.

[255] M-Wanda: Improving One-Shot Pruning for Multilingual LLMs

Rochelle Choenni,Ivan Titov

Main category: cs.CL

TL;DR: 本文研究了多语言大模型在稀疏化（剪枝）下的性能表现，并提出了一种名为M-Wanda的剪枝方法，通过语言感知的激活统计和动态调整层间稀疏度来优化多语言性能。

Details

Motivation: 多语言大模型的性能通常依赖于模型规模，而剪枝方法虽然能缩小模型规模，但往往伴随性能损失。因此，需要研究剪枝与多语言性能之间的权衡。 Method: 提出M-Wanda剪枝方法，利用语言感知的激活统计和动态调整层间稀疏度来优化剪枝过程。 Result: M-Wanda在保持较低额外成本的同时，显著提升了多语言模型的性能。 Conclusion: 本文首次明确优化剪枝以保留多语言性能，为未来多语言剪枝研究提供了启发。 Abstract: Multilingual LLM performance is often critically dependent on model size. With an eye on efficiency, this has led to a surge in interest in one-shot pruning methods that retain the benefits of large-scale pretraining while shrinking the model size. However, as pruning tends to come with performance loss, it is important to understand the trade-offs between multilinguality and sparsification. In this work, we study multilingual performance under different sparsity constraints and show that moderate ratios already substantially harm performance. To help bridge this gap, we propose M-Wanda, a pruning method that models cross-lingual variation by incorporating language-aware activation statistics into its pruning criterion and dynamically adjusts layerwise sparsity based on cross-lingual importance. We show that M-Wanda consistently improves performance at minimal additional costs. We are the first to explicitly optimize pruning to retain multilingual performance, and hope to inspire future advances in multilingual pruning.

[256] TAT-R1: Terminology-Aware Translation with Reinforcement Learning and Word Alignment

Zheng Li,Mao Zheng,Mingyang Song,Wenjie Yang

Main category: cs.CL

TL;DR: 论文提出TAT-R1模型，通过强化学习和词对齐提升术语翻译准确性，实验证明其优于基线模型。

Details

Motivation: 现有深度推理大语言模型在数学和编码任务中表现优异，但术语翻译领域尚未探索。 Method: 结合强化学习和词对齐设计规则奖励，训练模型专注术语翻译。 Result: TAT-R1显著提升术语翻译准确性，同时保持通用翻译任务性能。 Conclusion: TAT-R1有效解决术语翻译问题，并揭示了DeepSeek-R1训练范式在机器翻译中的关键发现。 Abstract: Recently, deep reasoning large language models(LLMs) like DeepSeek-R1 have made significant progress in tasks such as mathematics and coding. Inspired by this, several studies have employed reinforcement learning(RL) to enhance models' deep reasoning capabilities and improve machine translation(MT) quality. However, the terminology translation, an essential task in MT, remains unexplored in deep reasoning LLMs. In this paper, we propose \textbf{TAT-R1}, a terminology-aware translation model trained with reinforcement learning and word alignment. Specifically, we first extract the keyword translation pairs using a word alignment model. Then we carefully design three types of rule-based alignment rewards with the extracted alignment relationships. With those alignment rewards, the RL-trained translation model can learn to focus on the accurate translation of key information, including terminology in the source text. Experimental results show the effectiveness of TAT-R1. Our model significantly improves terminology translation accuracy compared to the baseline models while maintaining comparable performance on general translation tasks. In addition, we conduct detailed ablation studies of the DeepSeek-R1-like training paradigm for machine translation and reveal several key findings.

[257] Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning

Mingyang Song,Mao Zheng

Main category: cs.CL

TL;DR: 论文提出了一种名为ConciseR的两阶段强化学习框架，旨在解决大型语言模型（LLMs）在长链式思维（CoT）推理中出现的过度思考问题，通过优化生成简洁的推理响应。

Details

Motivation: 现有推理模型在生成长CoT响应时存在过度思考现象（如冗余或重复思维模式），影响了推理效率和性能。 Method: 采用两阶段强化学习框架：第一阶段（GRPO++）通过更多训练步骤提升推理能力；第二阶段（L-GRPO）通过较少训练步骤强制简洁性。 Result: ConciseR在多个基准测试（如AIME 2024、MATH-500等）中优于现有推理模型，且无需额外RL范式。 Conclusion: ConciseR通过简洁推理优化，显著提升了LLMs的推理效率和性能。 Abstract: As test-time scaling becomes a pivotal research frontier in Large Language Models (LLMs) development, contemporary and advanced post-training methodologies increasingly focus on extending the generation length of long Chain-of-Thought (CoT) responses to enhance reasoning capabilities toward DeepSeek R1-like performance. However, recent studies reveal a persistent overthinking phenomenon in state-of-the-art reasoning models, manifesting as excessive redundancy or repetitive thinking patterns in long CoT responses. To address this issue, in this paper, we propose a simple yet effective two-stage reinforcement learning framework for achieving concise reasoning in LLMs, named ConciseR. Specifically, the first stage, using more training steps, aims to incentivize the model's reasoning capabilities via Group Relative Policy Optimization with clip-higher and dynamic sampling components (GRPO++), and the second stage, using fewer training steps, explicitly enforces conciseness and improves efficiency via Length-aware Group Relative Policy Optimization (L-GRPO). Significantly, ConciseR only optimizes response length once all rollouts of a sample are correct, following the "walk before you run" principle. Extensive experimental results demonstrate that our ConciseR model, which generates more concise CoT reasoning responses, outperforms recent state-of-the-art reasoning models with zero RL paradigm across AIME 2024, MATH-500, AMC 2023, Minerva, and Olympiad benchmarks.

[258] Exploring the Latent Capacity of LLMs for One-Step Text Generation

Gleb Mezentsev,Ivan Oseledets

Main category: cs.CL

TL;DR: 研究发现，冻结的大型语言模型（LLMs）仅需两个学习嵌入即可在单次前向传递中生成数百个准确标记，揭示了LLMs无需迭代解码的多标记生成能力。

Details

Motivation: 探索是否可以在无需自回归的情况下实现文本重构，揭示LLMs的潜在能力。 Method: 使用冻结的LLMs，仅提供两个学习嵌入，通过单次前向传递生成多标记文本。 Result: 模型能生成数百个准确标记，且嵌入在嵌入空间中形成连通局部区域。 Conclusion: 研究表明LLMs具备无需迭代解码的多标记生成能力，为学习专用编码器提供了潜在方向。 Abstract: A recent study showed that large language models (LLMs) can reconstruct surprisingly long texts - up to thousands of tokens - via autoregressive generation from just one specially trained input embedding. In this work, we explore whether such reconstruction is possible without autoregression. We show that frozen LLMs can generate hundreds of accurate tokens in just one forward pass, when provided with only two learned embeddings. This reveals a surprising and underexplored capability of LLMs - multi-token generation without iterative decoding. We investigate the behaviour of these embeddings and provide insight into the type of information they encode. We also empirically show that although these representations are not unique for a given text, they form connected and local regions in embedding space - a property that suggests the potential of learning a dedicated encoder into that space.

[259] Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation

Jong Hak Moon,Geon Choi,Paloma Rabaey,Min Gwan Kim,Hyuk Gi Hong,Jung-Oh Lee,Hangyul Yoon,Eun Woo Doe,Jiyoun Kim,Harshita Sharma,Daniel C. Castro,Javier Alvarez-Valle,Edward Choi

Main category: cs.CL

TL;DR: 论文提出了LUNGUAGE基准数据集和LUNGUAGESCORE评估指标，用于结构化放射学报告生成，支持单报告和纵向患者级评估。

Details

Motivation: 现有放射学报告评估方法局限于单报告场景，且依赖粗粒度指标，无法捕捉细粒度临床语义和时间依赖性。 Method: 开发了一个两阶段框架，将生成的报告转化为结构化表示，并提出LUNGUAGESCORE评估指标，比较实体、关系和属性层面的输出。 Result: LUNGUAGESCORE能有效支持结构化报告评估，实验结果表明其优越性。 Conclusion: LUNGUAGE为序列放射学报告提供了首个基准数据集、结构化框架和评估指标。 Abstract: Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE,a benchmark dataset for structured radiology report generation that supports both single-report evaluation and longitudinal patient-level assessment across multiple studies. It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 80 of them contain longitudinal annotations to capture disease progression and inter-study intervals, also reviewed by experts. Using this benchmark, we develop a two-stage framework that transforms generated reports into fine-grained, schema-aligned structured representations, enabling longitudinal interpretation. We also propose LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency across patient timelines. These contributions establish the first benchmark dataset, structuring framework, and evaluation metric for sequential radiology reporting, with empirical results demonstrating that LUNGUAGESCORE effectively supports structured report evaluation. The code is available at: https://github.com/SuperSupermoon/Lunguage

[260] Unveiling Instruction-Specific Neurons & Experts: An Analytical Framework for LLM's Instruction-Following Capabilities

Junyan Zhang,Yubo Gao,Yibo Yan,Jungang Li,Zhaorui Hou,Sicheng Tao,Shuliang Liu,Song Dai,Yonghua Hei,Junzhuo Li,Xuming Hu

Main category: cs.CL

TL;DR: 该研究系统分析了微调如何重构大型语言模型（LLMs）的计算机制，通过识别和分析稀疏组件（如神经元和专家），提出了HexaInst数据集和SPARCOM分析框架，揭示了这些组件在指令执行中的关键作用。

Details

Motivation: 尽管微调显著提升了LLMs的指令跟随能力，但其背后的计算机制仍不明确，研究旨在填补这一空白。 Method: 使用HexaInst数据集和SPARCOM框架，识别和分析稀疏组件（神经元和专家），评估其功能通用性和独特性。 Result: 实验证明了稀疏组件的功能通用性、独特性及其在指令执行中的关键作用。 Conclusion: 研究揭示了微调如何通过稀疏计算基板实现指令内化，为可信赖LLM社区提供了深入见解。 Abstract: The finetuning of Large Language Models (LLMs) has significantly advanced their instruction-following capabilities, yet the underlying computational mechanisms driving these improvements remain poorly understood. This study systematically examines how fine-tuning reconfigures LLM computations by isolating and analyzing instruction-specific sparse components, i.e., neurons in dense models and both neurons and experts in Mixture-of-Experts (MoE) architectures. In particular, we introduce HexaInst, a carefully curated and balanced instructional dataset spanning six distinct categories, and propose SPARCOM, a novel analytical framework comprising three key contributions: (1) a method for identifying these sparse components, (2) an evaluation of their functional generality and uniqueness, and (3) a systematic comparison of their alterations. Through experiments, we demonstrate functional generality, uniqueness, and the critical role of these components in instruction execution. By elucidating the relationship between fine-tuning-induced adaptations and sparse computational substrates, this work provides deeper insights into how LLMs internalize instruction-following behavior for the trustworthy LLM community.

[261] Pretrained LLMs Learn Multiple Types of Uncertainty

Roi Cohen,Omri Fahn,Gerard de Melo

Main category: cs.CL

TL;DR: 研究探讨了大语言模型（LLMs）如何隐式捕捉不确定性，并展示了其在预测任务正确性中的潜力。

Details

Motivation: 尽管LLMs在许多任务中表现出色，但仍存在幻觉问题，导致生成不准确文本。研究旨在探索LLMs是否能在未明确训练的情况下捕捉不确定性。 Method: 通过将不确定性视为模型潜在空间中的线性概念，研究分析了LLMs在预训练后捕捉不确定性的能力，并探讨了不同类型不确定性的作用。 Result: 研究发现LLMs能捕捉多种不确定性类型，且这些类型对任务正确性预测有帮助。此外，模型规模对捕捉不确定性影响不大。 Conclusion: 通过指令微调或[IDK]-token微调将不确定性类型统一，有助于提升模型在正确性预测方面的表现。 Abstract: Large Language Models are known to capture real-world knowledge, allowing them to excel in many downstream tasks. Despite recent advances, these models are still prone to what are commonly known as hallucinations, causing them to emit unwanted and factually incorrect text. In this work, we study how well LLMs capture uncertainty, without explicitly being trained for that. We show that, if considering uncertainty as a linear concept in the model's latent space, it might indeed be captured, even after only pretraining. We further show that, though unintuitive, LLMs appear to capture several different types of uncertainty, each of which can be useful to predict the correctness for a specific task or benchmark. Furthermore, we provide in-depth results such as demonstrating a correlation between our correction prediction and the model's ability to abstain from misinformation using words, and the lack of impact of model scaling for capturing uncertainty. Finally, we claim that unifying the uncertainty types as a single one using instruction-tuning or [IDK]-token tuning is helpful for the model in terms of correctness prediction.

[262] A Representation Level Analysis of NMT Model Robustness to Grammatical Errors

Abderrahmane Issam,Yusuf Can Semerci,Jan Scholtes,Gerasimos Spanakis

Main category: cs.CL

TL;DR: 本文通过分析模型内部表示，研究了机器翻译中模型对语法错误的处理机制，发现编码器先检测错误再修正，并识别了关键注意力头（Robustness Heads）。

Details

Motivation: 现有研究多关注机器翻译的鲁棒性失败或改进，而本文从模型表示角度探讨鲁棒性，分析语法错误输入在模型内部的表示及其演变。 Method: 采用语法错误检测（GED）探针和表示相似性分析，研究模型层间表示变化，并分析注意力机制中的Robustness Heads。 Result: 编码器先检测语法错误，再通过修正表示接近正确形式；Robustness Heads在响应错误时关注可解释语言单元，且鲁棒性微调后模型更依赖这些头。 Conclusion: 模型通过特定注意力头处理语法错误，鲁棒性微调增强了这一机制，为理解模型鲁棒性提供了新视角。 Abstract: Understanding robustness is essential for building reliable NLP systems. Unfortunately, in the context of machine translation, previous work mainly focused on documenting robustness failures or improving robustness. In contrast, we study robustness from a model representation perspective by looking at internal model representations of ungrammatical inputs and how they evolve through model layers. For this purpose, we perform Grammatical Error Detection (GED) probing and representational similarity analysis. Our findings indicate that the encoder first detects the grammatical error, then corrects it by moving its representation toward the correct form. To understand what contributes to this process, we turn to the attention mechanism where we identify what we term Robustness Heads. We find that Robustness Heads attend to interpretable linguistic units when responding to grammatical errors, and that when we fine-tune models for robustness, they tend to rely more on Robustness Heads for updating the ungrammatical word representation.

[263] LMCD: Language Models are Zeroshot Cognitive Diagnosis Learners

Yu He,Zihan Yao,Chentao Song,Tianyu Qi,Jun Liu,Ming Li,Qing Huang

Main category: cs.CL

TL;DR: LMCD框架利用大语言模型解决认知诊断中的冷启动问题，通过知识扩散和语义-认知融合两阶段提升性能。

Details

Motivation: 传统认知诊断模型在冷启动场景下表现不佳，而现有NLP方法未能完全弥合语义理解与认知分析之间的差距。 Method: LMCD通过知识扩散生成丰富的练习和知识概念内容，并通过语义-认知融合结合文本信息和学生认知状态。 Result: 在真实数据集上，LMCD在冷启动场景下显著优于现有方法。 Conclusion: LMCD为认知诊断提供了一种有效的冷启动解决方案，性能优越且代码开源。 Abstract: Cognitive Diagnosis (CD) has become a critical task in AI-empowered education, supporting personalized learning by accurately assessing students' cognitive states. However, traditional CD models often struggle in cold-start scenarios due to the lack of student-exercise interaction data. Recent NLP-based approaches leveraging pre-trained language models (PLMs) have shown promise by utilizing textual features but fail to fully bridge the gap between semantic understanding and cognitive profiling. In this work, we propose Language Models as Zeroshot Cognitive Diagnosis Learners (LMCD), a novel framework designed to handle cold-start challenges by harnessing large language models (LLMs). LMCD operates via two primary phases: (1) Knowledge Diffusion, where LLMs generate enriched contents of exercises and knowledge concepts (KCs), establishing stronger semantic links; and (2) Semantic-Cognitive Fusion, where LLMs employ causal attention mechanisms to integrate textual information and student cognitive states, creating comprehensive profiles for both students and exercises. These representations are efficiently trained with off-the-shelf CD models. Experiments on two real-world datasets demonstrate that LMCD significantly outperforms state-of-the-art methods in both exercise-cold and domain-cold settings. The code is publicly available at https://github.com/TAL-auroraX/LMCD

[264] Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings

Gunjan Balde,Soumyadeep Roy,Mainack Mondal,Niloy Ganguly

Main category: cs.CL

TL;DR: LLMs在医学文本摘要中表现良好，但在高OOV词或高新颖性数据上性能下降。词汇适应策略能显著提升性能，尤其是在医学领域。

Details

Motivation: 研究LLMs在医学文本摘要中的局限性，特别是在高OOV词或高新颖性数据上的表现，并提出词汇适应策略以解决词汇不匹配问题。 Method: 通过词汇适应策略（如更新LLM词汇表）和持续预训练策略，在三个医学摘要数据集上进行实验，并进行人工评估。 Result: 词汇适应显著提升了LLM在高难度设置下的摘要性能，医学专家评估也证实了其生成摘要的相关性和忠实性。 Conclusion: 词汇适应是定制LLMs以适应医学领域的有效方法，尤其在处理专业词汇时表现突出。 Abstract: Large Language Models (LLMs) recently achieved great success in medical text summarization by simply using in-context learning. However, these recent efforts do not perform fine-grained evaluations under difficult settings where LLMs might fail. They typically report performance scores over the entire dataset. Through our benchmarking study, we show that LLMs show a significant performance drop for data points with high concentration of out-of-vocabulary (OOV) words or with high novelty. Vocabulary adaptation is an intuitive solution to this vocabulary mismatch issue where the LLM vocabulary gets updated with certain expert domain (here, medical) words or subwords. An interesting finding from our study is that Llama-3.1, even with a vocabulary size of around 128K tokens, still faces over-fragmentation issue with medical words. To that end, we show vocabulary adaptation helps improve the LLM summarization performance even in difficult settings. Through extensive experimentation of multiple vocabulary adaptation strategies, two continual pretraining strategies, and three benchmark medical summarization datasets, we gain valuable insights into the role of vocabulary adaptation strategies for customizing LLMs to the medical domain. We also performed a human evaluation study with medical experts where they found that vocabulary adaptation results in more relevant and faithful summaries. Our codebase is made publicly available at https://github.com/gb-kgp/LLM-MedicalSummarization-Benchmark.

[265] ReSCORE: Label-free Iterative Retriever Training for Multi-hop Question Answering with Relevance-Consistency Supervision

Dosung Lee,Wonjun Oh,Boyoung Kim,Minyoung Kim,Joonsuk Park,Paul Hongsuck Seo

Main category: cs.CL

TL;DR: ReSCORE是一种无需标注文档的新方法，通过利用大语言模型训练稠密检索器，显著提升了多跳问答的性能。

Details

Motivation: 解决多跳问答中稠密检索器需要标注查询-文档对的挑战，因为查询在推理步骤中变化较大。 Method: 利用大语言模型评估文档与问题的相关性和答案一致性，通过迭代问答框架训练检索器。 Result: 在三个多跳问答基准测试中，ReSCORE显著提升了检索性能，并达到最先进的问答效果。 Conclusion: ReSCORE为无需标注数据的稠密检索器训练提供了有效解决方案，推动了多跳问答的发展。 Abstract: Multi-hop question answering (MHQA) involves reasoning across multiple documents to answer complex questions. Dense retrievers typically outperform sparse methods like BM25 by leveraging semantic embeddings; however, they require labeled query-document pairs for fine-tuning. This poses a significant challenge in MHQA due to the high variability of queries (reformulated) questions throughout the reasoning steps. To overcome this limitation, we introduce Retriever Supervision with Consistency and Relevance (ReSCORE), a novel method for training dense retrievers for MHQA without labeled documents. ReSCORE leverages large language models to capture each documents relevance to the question and consistency with the correct answer and use them to train a retriever within an iterative question-answering framework. Experiments on three MHQA benchmarks demonstrate the effectiveness of ReSCORE, with significant improvements in retrieval, and in turn, the state-of-the-art MHQA performance. Our implementation is available at: https://leeds1219.github.io/ReSCORE.

[266] Multilingual Pretraining for Pixel Language Models

Ilker Kesen,Jonas F. Lotz,Ingo Ziegler,Phillip Rust,Desmond Elliott

Main category: cs.CL

TL;DR: PIXEL-M4是一种基于多语言预训练的像素语言模型，支持英语、印地语、乌克兰语和简体中文，在非拉丁语系任务中表现优于单语言模型。

Details

Motivation: 探索多语言预训练对像素语言模型在跨语言任务中的提升潜力。 Method: 在四种视觉和语言多样性语言（英语、印地语、乌克兰语、简体中文）上预训练PIXEL-M4模型。 Result: PIXEL-M4在非拉丁语系任务中优于单语言模型，并能捕捉未见过语言的丰富语言特征。 Conclusion: 多语言预训练显著提升了像素语言模型对多样化语言的支持能力。 Abstract: Pixel language models operate directly on images of rendered text, eliminating the need for a fixed vocabulary. While these models have demonstrated strong capabilities for downstream cross-lingual transfer, multilingual pretraining remains underexplored. We introduce PIXEL-M4, a model pretrained on four visually and linguistically diverse languages: English, Hindi, Ukrainian, and Simplified Chinese. Multilingual evaluations on semantic and syntactic tasks show that PIXEL-M4 outperforms an English-only counterpart on non-Latin scripts. Word-level probing analyses confirm that PIXEL-M4 captures rich linguistic features, even in languages not seen during pretraining. Furthermore, an analysis of its hidden representations shows that multilingual pretraining yields a semantic embedding space closely aligned across the languages used for pretraining. This work demonstrates that multilingual pretraining substantially enhances the capability of pixel language models to effectively support a diverse set of languages.

[267] rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset

Yifei Liu,Li Lyna Zhang,Yi Zhu,Bingcheng Dong,Xudong Zhou,Ning Shang,Fan Yang,Mao Yang

Main category: cs.CL

TL;DR: rStar-Coder通过构建大规模、高难度的验证数据集（418K问题和580K解决方案），显著提升了LLM的代码推理能力，并在多个基准测试中表现优异。

Details

Motivation: 当前高难度代码数据集的稀缺限制了LLM在代码推理方面的进步，尤其是缺乏可验证的测试用例。 Method: 1. 合成竞争编程问题和解决方案；2. 引入三步输入生成和互验证机制；3. 为问题添加高质量的长推理解决方案。 Result: 在Qwen模型上表现卓越，7B和14B模型在LiveCodeBench上的性能分别提升至57.3%和62.5%，并在USA Computing Olympiad上超越更大模型。 Conclusion: rStar-Coder数据集显著提升了LLM的代码推理能力，证明了小模型也能实现前沿性能。 Abstract: Advancing code reasoning in large language models (LLMs) is fundamentally limited by the scarcity of high-difficulty datasets, especially those with verifiable input-output test cases necessary for rigorous solution validation at scale. We introduce rStar-Coder, which significantly improves LLM code reasoning capabilities by constructing a large-scale, verified dataset of 418K competition-level code problems, 580K long-reasoning solutions along with rich test cases of varying difficulty. This is achieved through three core contributions: (1) we curate competitive programming code problems and oracle solutions to synthesize new, solvable problems; (2) we introduce a reliable input-output test case synthesis pipeline that decouples the generation into a three-step input generation method and a mutual verification mechanism for effective output labeling; (3) we augment problems with high-quality, test-case-verified long-reasoning solutions. Extensive experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate the superiority of rStar-Coder dataset, achieving leading performance comparable to frontier reasoning LLMs with much smaller model sizes. On LiveCodeBench, rStar-Coder improves Qwen2.5-7B from 17.4% to an impressive 57.3%, and Qwen2.5-14B from 23.3% to 62.5%, surpassing o3-mini (low) by3.1%. On the more challenging USA Computing Olympiad, our 7B model achieves an average pass@1 accuracy of 16.15%, outperforming the frontier-level QWQ-32B. Code and the dataset will be released at https://github.com/microsoft/rStar.

[268] How Humans and LLMs Organize Conceptual Knowledge: Exploring Subordinate Categories in Italian

Andrea Pedrotti,Giulia Rambelli,Caterina Villani,Marianna Bolognesi

Main category: cs.CL

TL;DR: 研究首次通过分析从属级别生成的示例，探讨了类别组织，发现人类与LLMs在示例生成、类别归纳和典型性判断任务中对齐度较低，但性能因语义领域而异。

Details

Motivation: 探索从属级别类别组织的结构，填补先前研究集中于基本级别类别的空白。 Method: 使用意大利心理语言学数据集，评估文本和视觉LLMs在示例生成、类别归纳和典型性判断任务中的表现。 Result: 人类与LLMs的对齐度较低，但性能在不同语义领域中有显著差异。 Conclusion: AI生成的示例在心理和语言学研究中有潜力，但也存在局限性。 Abstract: People can categorize the same entity at multiple taxonomic levels, such as basic (bear), superordinate (animal), and subordinate (grizzly bear). While prior research has focused on basic-level categories, this study is the first attempt to examine the organization of categories by analyzing exemplars produced at the subordinate level. We present a new Italian psycholinguistic dataset of human-generated exemplars for 187 concrete words. We then use these data to evaluate whether textual and vision LLMs produce meaningful exemplars that align with human category organization across three key tasks: exemplar generation, category induction, and typicality judgment. Our findings show a low alignment between humans and LLMs, consistent with previous studies. However, their performance varies notably across different semantic domains. Ultimately, this study highlights both the promises and the constraints of using AI-generated exemplars to support psychological and linguistic research.

[269] Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead

Jesujoba O. Alabi,Michael A. Hedderich,David Ifeoluwa Adelani,Dietrich Klakow

Main category: cs.CL

TL;DR: 非洲拥有丰富的语言多样性，但当前NLP技术对其支持不足。本文调查了734篇相关论文，分析了进展并提出了促进包容性研究的建议。

Details

Motivation: 非洲语言的多样性在NLP系统中未被充分体现，可能导致数字鸿沟扩大。研究旨在推动对非洲语言的包容性支持。 Method: 通过分析过去五年发表的734篇关于非洲语言NLP的论文，总结了研究进展和趋势。 Result: 研究发现非洲语言NLP研究正在增长，但仍需更多资源和支持。 Conclusion: 提出了促进非洲语言NLP研究包容性和可持续发展的方向。 Abstract: With over 2,000 languages and potentially millions of speakers, Africa represents one of the richest linguistic regions in the world. Yet, this diversity is scarcely reflected in state-of-the-art natural language processing (NLP) systems and large language models (LLMs), which predominantly support a narrow set of high-resource languages. This exclusion not only limits the reach and utility of modern NLP technologies but also risks widening the digital divide across linguistic communities. Nevertheless, NLP research on African languages is active and growing. In recent years, there has been a surge of interest in this area, driven by several factors-including the creation of multilingual language resources, the rise of community-led initiatives, and increased support through funding programs. In this survey, we analyze 734 research papers on NLP for African languages published over the past five years, offering a comprehensive overview of recent progress across core tasks. We identify key trends shaping the field and conclude by outlining promising directions to foster more inclusive and sustainable NLP research for African languages.

[270] Leveraging large language models and traditional machine learning ensembles for ADHD detection from narrative transcripts

Yuxin Zhu,Yuting Guo,Noah Marchuck,Abeed Sarker,Yun Wang

Main category: cs.CL

TL;DR: 该研究提出了一种集成框架，结合LLaMA3、RoBERTa和SVM模型，通过多数投票机制提升ADHD诊断分类的性能，F1分数达0.71。

Details

Motivation: 尽管大语言模型（LLMs）发展迅速，但其与传统监督机器学习在医学数据（尤其是精神病学领域）的结合仍未被充分探索。 Method: 集成LLaMA3（捕捉长程语义）、RoBERTa（微调于临床叙述）和SVM（基于TF-IDF特征），通过多数投票机制聚合结果。 Result: 集成模型优于单一模型，F1分数为0.71，召回率提升且保持高精度。 Conclusion: 混合架构结合LLMs的语义丰富性和传统ML的可解释性，为精神病学文本分类提供了新方向。 Abstract: Despite rapid advances in large language models (LLMs), their integration with traditional supervised machine learning (ML) techniques that have proven applicability to medical data remains underexplored. This is particularly true for psychiatric applications, where narrative data often exhibit nuanced linguistic and contextual complexity, and can benefit from the combination of multiple models with differing characteristics. In this study, we introduce an ensemble framework for automatically classifying Attention-Deficit/Hyperactivity Disorder (ADHD) diagnosis (binary) using narrative transcripts. Our approach integrates three complementary models: LLaMA3, an open-source LLM that captures long-range semantic structure; RoBERTa, a pre-trained transformer model fine-tuned on labeled clinical narratives; and a Support Vector Machine (SVM) classifier trained using TF-IDF-based lexical features. These models are aggregated through a majority voting mechanism to enhance predictive robustness. The dataset includes 441 instances, including 352 for training and 89 for validation. Empirical results show that the ensemble outperforms individual models, achieving an F$_1$ score of 0.71 (95\% CI: [0.60-0.80]). Compared to the best-performing individual model (SVM), the ensemble improved recall while maintaining competitive precision. This indicates the strong sensitivity of the ensemble in identifying ADHD-related linguistic cues. These findings demonstrate the promise of hybrid architectures that leverage the semantic richness of LLMs alongside the interpretability and pattern recognition capabilities of traditional supervised ML, offering a new direction for robust and generalizable psychiatric text classification.

[271] PEDANTIC: A Dataset for the Automatic Examination of Definiteness in Patent Claims

Valentin Knappich,Annemarie Friedrich,Anna Hätty,Simon Razniewski

Main category: cs.CL

TL;DR: 论文介绍了PEDANTIC数据集，用于自动检测专利权利要求中的模糊性，并展示了LLM在此任务中的表现。

Details

Motivation: 专利权利要求中的模糊性是常见的拒绝原因，但目前缺乏标注数据集来支持自动检测方法的研究。 Method: 通过自动流程从USPTO获取文件，使用LLM提取模糊性原因，并进行人工验证。还采用LLM-as-Judge方法评估模型表现。 Result: LLM在识别模糊性原因方面表现良好，但在预测任务上未能超越逻辑回归基线。 Conclusion: PEDANTIC为专利AI研究提供了宝贵资源，未来将公开数据集和代码。 Abstract: Patent claims define the scope of protection for an invention. If there are ambiguities in a claim, it is rejected by the patent office. In the US, this is referred to as indefiniteness (35 U.S.C {\S} 112(b)) and is among the most frequent reasons for patent application rejection. The development of automatic methods for patent definiteness examination has the potential to make patent drafting and examination more efficient, but no annotated dataset has been published to date. We introduce PEDANTIC (\underline{P}at\underline{e}nt \underline{D}efiniteness Ex\underline{a}mi\underline{n}a\underline{ti}on \underline{C}orpus), a novel dataset of 14k US patent claims from patent applications relating to Natural Language Processing (NLP), annotated with reasons for indefiniteness. We construct PEDANTIC using a fully automatic pipeline that retrieves office action documents from the USPTO and uses Large Language Models (LLMs) to extract the reasons for indefiniteness. A human validation study confirms the pipeline's accuracy in generating high-quality annotations. To gain insight beyond binary classification metrics, we implement an LLM-as-Judge evaluation that compares the free-form reasoning of every model-cited reason with every examiner-cited reason. We show that LLM agents based on Qwen 2.5 32B and 72B struggle to outperform logistic regression baselines on definiteness prediction, even though they often correctly identify the underlying reasons. PEDANTIC provides a valuable resource for patent AI researchers, enabling the development of advanced examination models. We will publicly release the dataset and code.

[272] Leveraging Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning

Bidyarthi Paul,Jalisha Jashim Era,Mirazur Rahman Zim,Tahmid Sattar Aothoi,Faisal Muhammad Shah

Main category: cs.CL

TL;DR: 论文提出了SOMADHAN数据集，包含8792个复杂的孟加拉语数学应用题及其逐步解答，用于提升低资源语言中的数学推理能力。通过评估多种大语言模型（如GPT-4o、LLaMA等），发现Chain of Thought提示显著提高了性能，尤其是多步逻辑任务。LLaMA-3.3 70B在少样本CoT提示下达到88%的准确率。

Details

Motivation: 解决孟加拉语数学应用题（MWPs）的挑战，填补该语言在数学推理任务中缺乏高质量数据集的空白。 Method: 创建SOMADHAN数据集，评估多种大语言模型在零样本和少样本设置下的表现，结合Chain of Thought提示和LoRA微调技术。 Result: Chain of Thought提示显著提升模型性能，LLaMA-3.3 70B在少样本CoT提示下达到88%准确率。 Conclusion: SOMADHAN数据集填补了孟加拉语NLP的空白，为低资源语言中的数学推理提供了高质量数据和可扩展框架。 Abstract: Solving Bengali Math Word Problems (MWPs) remains a major challenge in natural language processing (NLP) due to the language's low-resource status and the multi-step reasoning required. Existing models struggle with complex Bengali MWPs, largely because no human-annotated Bengali dataset has previously addressed this task. This gap has limited progress in Bengali mathematical reasoning. To address this, we created SOMADHAN, a dataset of 8792 complex Bengali MWPs with manually written, step-by-step solutions. We designed this dataset to support reasoning-focused evaluation and model development in a linguistically underrepresented context. Using SOMADHAN, we evaluated a range of large language models (LLMs) - including GPT-4o, GPT-3.5 Turbo, LLaMA series models, Deepseek, and Qwen - through both zero-shot and few-shot prompting with and without Chain of Thought (CoT) reasoning. CoT prompting consistently improved performance over standard prompting, especially in tasks requiring multi-step logic. LLaMA-3.3 70B achieved the highest accuracy of 88% with few-shot CoT prompting. We also applied Low-Rank Adaptation (LoRA) to fine-tune models efficiently, enabling them to adapt to Bengali MWPs with minimal computational cost. Our work fills a critical gap in Bengali NLP by providing a high-quality reasoning dataset and a scalable framework for solving complex MWPs. We aim to advance equitable research in low-resource languages and enhance reasoning capabilities in educational and language technologies.

[273] Evaluating LLM Adaptation to Sociodemographic Factors: User Profile vs. Dialogue History

Qishuai Zhong,Zongmin Li,Siqi Fan,Aixin Sun

Main category: cs.CL

TL;DR: 论文提出了一种评估大型语言模型（LLM）在用户社会人口特征（如年龄、职业、教育水平）下的行为适应性的框架，并探讨了显式和隐式引入这些特征时的模型一致性。

Details

Motivation: 现有评估多关注单轮提示，而实际应用需基于多轮对话历史适应行为，因此需要更全面的评估方法。 Method: 通过多智能体流程构建合成数据集，结合用户档案和多轮对话历史，使用VSM 2013问题探测模型的价值表达。 Result: 大多数模型会根据人口特征调整表达的价值，尤其在年龄和教育水平上，但一致性各异；推理能力强的模型表现更一致。 Conclusion: 推理能力对稳健的社会人口适应性至关重要，未来研究可进一步优化模型的一致性。 Abstract: Effective engagement by large language models (LLMs) requires adapting responses to users' sociodemographic characteristics, such as age, occupation, and education level. While many real-world applications leverage dialogue history for contextualization, existing evaluations of LLMs' behavioral adaptation often focus on single-turn prompts. In this paper, we propose a framework to evaluate LLM adaptation when attributes are introduced either (1) explicitly via user profiles in the prompt or (2) implicitly through multi-turn dialogue history. We assess the consistency of model behavior across these modalities. Using a multi-agent pipeline, we construct a synthetic dataset pairing dialogue histories with distinct user profiles and employ questions from the Value Survey Module (VSM 2013) (Hofstede and Hofstede, 2016) to probe value expression. Our findings indicate that most models adjust their expressed values in response to demographic changes, particularly in age and education level, but consistency varies. Models with stronger reasoning capabilities demonstrate greater alignment, indicating the importance of reasoning in robust sociodemographic adaptation.

[274] Analyzing values about gendered language reform in LLMs' revisions

Jules Watson,Xi Wang,Raymond Liu,Suzanne Stevenson,Barend Beekhuizen

Main category: cs.CL

TL;DR: 研究探讨了LLMs在文本修订中对性别角色名词的修改及其合理性，评估其是否符合女性和跨性别包容性语言改革，并讨论了其对价值对齐的影响。

Details

Motivation: 探讨LLMs在性别角色名词修订中的行为，评估其是否符合女性和跨性别包容性语言改革，以促进价值对齐。 Method: 通过分析LLMs对性别角色名词的修订及其合理性，结合社会语言学视角，评估其与人类在语言改革中的一致性。 Result: 研究发现LLMs在性别角色名词修订中表现出与人类相似的语境敏感性。 Conclusion: 研究为LLMs的价值对齐提供了重要启示，尤其是在性别包容性语言方面。 Abstract: Within the common LLM use case of text revision, we study LLMs' revision of gendered role nouns (e.g., outdoorsperson/woman/man) and their justifications of such revisions. We evaluate their alignment with feminist and trans-inclusive language reforms for English. Drawing on insight from sociolinguistics, we further assess if LLMs are sensitive to the same contextual effects in the application of such reforms as people are, finding broad evidence of such effects. We discuss implications for value alignment.

[275] PHISH in MESH: Korean Adversarial Phonetic Substitution and Phonetic-Semantic Feature Integration Defense

Byungjun Kim,Minju Kim,Hyeonchu Park,Bugeun Kim

Main category: cs.CL

TL;DR: 论文提出PHISH和MESH方法，针对韩语中的语音替换攻击，提升仇恨言论检测的鲁棒性。

Details

Motivation: 恶意用户利用语音替换逃避仇恨言论检测，现有研究忽视韩语且缺乏架构层面的防御。 Method: 提出PHISH（利用韩语语音特性）和MESH（在架构层面结合语义与语音特征）。 Result: 实验证明方法在扰动和非扰动数据集上均有效，提升检测性能并反映真实对抗行为。 Conclusion: PHISH和MESH能有效应对韩语中的语音替换攻击，增强检测系统的实用性。 Abstract: As malicious users increasingly employ phonetic substitution to evade hate speech detection, researchers have investigated such strategies. However, two key challenges remain. First, existing studies have overlooked the Korean language, despite its vulnerability to phonetic perturbations due to its phonographic nature. Second, prior work has primarily focused on constructing datasets rather than developing architectural defenses. To address these challenges, we propose (1) PHonetic-Informed Substitution for Hangul (PHISH) that exploits the phonological characteristics of the Korean writing system, and (2) Mixed Encoding of Semantic-pHonetic features (MESH) that enhances the detector's robustness by incorporating phonetic information at the architectural level. Our experimental results demonstrate the effectiveness of our proposed methods on both perturbed and unperturbed datasets, suggesting that they not only improve detection performance but also reflect realistic adversarial behaviors employed by malicious users.

[276] AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs

Xuanwen Ding,Chengjun Pan,Zejun Li,Jiwen Zhang,Siyuan Wang,Zhongyu Wei

Main category: cs.CL

TL;DR: AutoJudger是一个基于代理的框架，通过动态选择最具信息量的测试问题，显著降低多模态大语言模型（MLLMs）评估的成本。

Details

Motivation: 随着多模态基准测试规模和复杂性的增加，评估成本急剧上升，亟需一种高效的解决方案。 Method: AutoJudger结合项目反应理论（IRT）和自主评估代理，动态选择测试问题，并采用语义感知检索和动态记忆机制。 Result: 在四个多模态基准测试中，AutoJudger仅使用4%的数据即可达到90%以上的排名准确性。 Conclusion: AutoJudger显著降低了评估成本，同时保持了高准确性。 Abstract: Evaluating multimodal large language models (MLLMs) is increasingly expensive, as the growing size and cross-modality complexity of benchmarks demand significant scoring efforts. To tackle with this difficulty, we introduce AutoJudger, an agent-driven framework for efficient and adaptive benchmarking of MLLMs that tackles this escalating cost. AutoJudger employs the Item Response Theory (IRT) to estimate the question difficulty and an autonomous evaluation agent to dynamically select the most informative test questions based on the model's real-time performance. Specifically, AutoJudger incorporates two pivotal components: a semantic-aware retrieval mechanism to ensure that selected questions cover diverse and challenging scenarios across both vision and language modalities, and a dynamic memory that maintains contextual statistics of previously evaluated questions to guide coherent and globally informed question selection throughout the evaluation process. Extensive experiments on four representative multimodal benchmarks demonstrate that our adaptive framework dramatically reduces evaluation expenses, i.e. AutoJudger uses only 4% of the data to achieve over 90% ranking accuracy with the full benchmark evaluation on MMT-Bench.

Xiao Liu,Xinyi Dong,Xinyang Gao,Yansong Feng,Xun Pang

Main category: cs.CL

TL;DR: 通过结合数据和自动验证增强大语言模型（LLM）生成研究想法的质量，实验显示可行性和质量均有提升。

Details

Motivation: 解决LLM生成研究想法时可行性和有效性不足的问题。 Method: 在生成阶段引入元数据指导，在筛选阶段加入自动验证。 Result: 元数据提升可行性20%，自动验证提升质量7%，人类研究显示LLM生成的想法能激发更高质量的研究提案。 Conclusion: 数据驱动的LLM辅助研究想法生成具有实际应用潜力。 Abstract: Recent advancements in large language models (LLMs) have shown promise in generating novel research ideas. However, these ideas often face challenges related to feasibility and expected effectiveness. This paper explores how augmenting LLMs with relevant data during the idea generation process can enhance the quality of generated ideas. We introduce two ways of incorporating data: (1) providing metadata during the idea generation stage to guide LLMs toward feasible directions, and (2) adding automatic validation during the idea selection stage to assess the empirical plausibility of hypotheses within ideas. We conduct experiments in the social science domain, specifically with climate negotiation topics, and find that metadata improves the feasibility of generated ideas by 20%, while automatic validation improves the overall quality of selected ideas by 7%. A human study shows that LLM-generated ideas, along with their related data and validation processes, inspire researchers to propose research ideas with higher quality. Our work highlights the potential of data-driven research idea generation, and underscores the practical utility of LLM-assisted ideation in real-world academic settings.

[278] DecisionFlow: Advancing Large Language Model as Principled Decision Maker

Xiusi Chen,Shanyong Wang,Cheng Qian,Hongru Wang,Peixuan Han,Heng Ji

Main category: cs.CL

TL;DR: DecisionFlow是一个新的决策建模框架，通过结构化推理提升语言模型的透明度和解释性，在高风险领域表现优于基线方法。

Details

Motivation: 当前语言模型在高风险决策中缺乏结构化推理，导致决策与解释脱节，需要更透明和可解释的方法。 Method: 提出DecisionFlow框架，通过构建语义基础决策空间和潜在效用函数，透明地评估权衡。 Result: 在两个高风险基准测试中，DecisionFlow比基线方法准确率提升30%，并增强结果一致性。 Conclusion: DecisionFlow结合符号推理与LLM，为更可靠、可解释的决策支持系统迈出关键一步。 Abstract: In high-stakes domains such as healthcare and finance, effective decision-making demands not just accurate outcomes but transparent and explainable reasoning. However, current language models often lack the structured deliberation needed for such tasks, instead generating decisions and justifications in a disconnected, post-hoc manner. To address this, we propose DecisionFlow, a novel decision modeling framework that guides models to reason over structured representations of actions, attributes, and constraints. Rather than predicting answers directly from prompts, DecisionFlow builds a semantically grounded decision space and infers a latent utility function to evaluate trade-offs in a transparent, utility-driven manner. This process produces decisions tightly coupled with interpretable rationales reflecting the model's reasoning. Empirical results on two high-stakes benchmarks show that DecisionFlow not only achieves up to 30% accuracy gains over strong prompting baselines but also enhances alignment in outcomes. Our work is a critical step toward integrating symbolic reasoning with LLMs, enabling more accountable, explainable, and reliable LLM decision support systems. We release the data and code at https://github.com/xiusic/DecisionFlow.

[279] Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling

Hovhannes Tamoyan,Subhabrata Dutta,Iryna Gurevych

Main category: cs.CL

TL;DR: 研究发现大型语言模型（LLMs）在生成内容时具有内在的自我监控能力，能够通过Transformer残差流中的线性特征判断事实正确性。

Details

Motivation: 解决LLMs生成内容中事实错误的问题，探索其内在自我监控机制。 Method: 通过分析Transformer残差流中的线性特征，研究LLMs在生成时对事实正确性的自我判断能力。 Result: LLMs在训练过程中快速形成自我监控能力，并在中间层达到峰值，且对格式变化具有鲁棒性。 Conclusion: LLMs具备内在的自我监控能力，有助于提升其可解释性和可靠性。 Abstract: Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs). Prior findings suggest LLMs can (sometimes) detect factual incorrectness in their generated content (i.e., fact-checking post-generation). In this work, we provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation. We demonstrate that for a given subject entity and a relation, LLMs internally encode linear features in the Transformer's residual stream that dictate whether it will be able to recall the correct attribute (that forms a valid entity-relation-attribute triplet). This self-awareness signal is robust to minor formatting variations. We investigate the effects of context perturbation via different example selection strategies. Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers. These findings uncover intrinsic self-monitoring capabilities within LLMs, contributing to their interpretability and reliability.

[280] RelationalFactQA: A Benchmark for Evaluating Tabular Fact Retrieval from Large Language Models

Dario Satriani,Enzo Veltri,Donatello Santoro,Paolo Papotti

Main category: cs.CL

TL;DR: 论文探讨了大语言模型（LLMs）在生成结构化表格输出时的准确性挑战，提出了新基准RelationalFactQA，并发现当前LLMs在此任务上表现不佳。

Details

Motivation: 现有基准主要评估简短事实回答，忽略了生成结构化表格输出的能力，这是LLMs事实性的重要方面。 Method: 引入RelationalFactQA基准，包含多样化自然语言问题与SQL配对，以及标准表格答案，用于评估结构化知识检索能力。 Result: 实验显示，即使最先进的LLMs在生成关系型输出时准确率不超过25%，且随着输出维度增加性能显著下降。 Conclusion: 研究揭示了LLMs在结构化知识合成上的局限性，RelationalFactQA为未来LLM事实性评估提供了关键工具。 Abstract: Factuality in Large Language Models (LLMs) is a persistent challenge. Current benchmarks often assess short factual answers, overlooking the critical ability to generate structured, multi-record tabular outputs from parametric knowledge. We demonstrate that this relational fact retrieval is substantially more difficult than isolated point-wise queries, even when individual facts are known to the model, exposing distinct failure modes sensitive to output dimensionality (e.g., number of attributes or records). To systematically evaluate this under-explored capability, we introduce RelationalFactQA, a new benchmark featuring diverse natural language questions (paired with SQL) and gold-standard tabular answers, specifically designed to assess knowledge retrieval in a structured format. RelationalFactQA enables analysis across varying query complexities, output sizes, and data characteristics. Our experiments reveal that even state-of-the-art LLMs struggle significantly, not exceeding 25% factual accuracy in generating relational outputs, with performance notably degrading as output dimensionality increases. These findings underscore critical limitations in current LLMs' ability to synthesize structured factual knowledge and establish RelationalFactQA as a crucial resource for measuring future progress in LLM factuality.

[281] Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity

Yehui Tang,Xiaosong Li,Fangcheng Liu,Wei Guo,Hang Zhou,Yaoyuan Wang,Kai Han,Xianzhi Yu,Jinpeng Li,Hui Zang,Fei Mi,Xiaojun Meng,Zhicheng Liu,Hanting Chen,Binfan Zheng,Can Chen,Youliang Yan,Ruiming Tang,Peifeng Qin,Xinghao Chen,Dacheng Tao,Yunhe Wang

Main category: cs.CL

TL;DR: MoGE（分组专家混合）通过分组和平衡专家负载，解决了MoE中专家激活不均衡的问题，显著提升了Ascend NPU上的训练和推理效率。

Details

Motivation: MoE中专家激活不均衡导致系统效率低下，尤其是在多设备并行运行时。 Method: 提出MoGE，通过专家分组和负载均衡设计，优化模型执行。 Result: MoGE在Ascend NPU上实现了更好的负载均衡和效率，推理性能显著提升。 Conclusion: MoGE是一种高效的专家混合架构，适用于大规模模型训练和推理。 Abstract: The surgence of Mixture of Experts (MoE) in Large Language Models promises a small price of execution cost for a much larger model parameter count and learning capacity, because only a small fraction of parameters are activated for each input token. However, it is commonly observed that some experts are activated far more often than others, leading to system inefficiency when running the experts on different devices in parallel. Therefore, we introduce Mixture of Grouped Experts (MoGE), which groups the experts during selection and balances the expert workload better than MoE in nature. It constrains tokens to activate an equal number of experts within each predefined expert group. When a model execution is distributed on multiple devices, this architectural design ensures a balanced computational load across devices, significantly enhancing throughput, particularly for the inference phase. Further, we build Pangu Pro MoE on Ascend NPUs, a sparse model based on MoGE with 72 billion total parameters, 16 billion of which are activated for each token. The configuration of Pangu Pro MoE is optimized for Ascend 300I Duo and 800I A2 through extensive system simulation studies. Our experiments indicate that MoGE indeed leads to better expert load balancing and more efficient execution for both model training and inference on Ascend NPUs. The inference performance of Pangu Pro MoE achieves 1148 tokens/s per card and can be further improved to 1528 tokens/s per card by speculative acceleration, outperforming comparable 32B and 72B Dense models. Furthermore, we achieve an excellent cost-to-performance ratio for model inference on Ascend 300I Duo.Our studies show that Ascend NPUs are capable of training Pangu Pro MoE with massive parallelization to make it a leading model within the sub-100B total parameter class, outperforming prominent open-source models like GLM-Z1-32B and Qwen3-32B.

[282] RefTool: Enhancing Model Reasoning with Reference-Guided Tool Creation

Xiao Liu,Da Yin,Zirui Wu,Yansong Feng

Main category: cs.CL

TL;DR: RefTool是一个基于外部参考的框架，用于自动生成工具以增强LLMs在复杂任务中的推理能力，解决了传统方法依赖模型内部知识的局限性。

Details

Motivation: 传统工具生成方法依赖LLMs的内部知识，无法处理超出其知识范围的任务。RefTool通过利用外部参考材料（如教科书）来解决这一问题。 Method: RefTool包含两个模块：工具生成（从参考内容生成可执行工具并验证）和工具利用（通过工具箱层次结构选择和应用工具）。 Result: 在因果、物理和化学基准测试中，RefTool平均准确率比现有方法高11.3%，且成本高效、通用性强。 Conclusion: RefTool通过外部参考材料生成准确工具，层次结构优化工具选择，显著提升了LLMs的推理能力。 Abstract: Tools enhance the reasoning capabilities of large language models (LLMs) in complex problem-solving tasks, but not all tasks have available tools. In the absence of predefined tools, prior works have explored instructing LLMs to generate tools on their own. However, such approaches rely heavily on the models' internal knowledge and would fail in domains beyond the LLMs' knowledge scope. To address this limitation, we propose RefTool, a reference-guided framework for automatic tool creation that leverages structured external materials such as textbooks. RefTool consists of two modules: (1) tool creation, where LLMs generate executable tools from reference content, validate them using illustrative examples, and organize them hierarchically into a toolbox; and (2) tool utilization, where LLMs navigate the toolbox structure to select and apply the appropriate tools to solve problems. Experiments on causality, physics, and chemistry benchmarks demonstrate that RefTool outperforms existing tool-creation and domain-specific reasoning methods by 11.3% on average accuracy, while being cost-efficient and broadly generalizable. Analyses reveal that grounding tool creation in references produces accurate and faithful tools, and that the hierarchical structure facilitates effective tool selection. RefTool enables LLMs to overcome knowledge limitations, demonstrating the value of grounding tool creation in external references for enhanced and generalizable reasoning.

[283] Towards Better Instruction Following Retrieval Models

Yuchen Zhuang,Aaron Trinh,Rushi Qiang,Haotian Sun,Chao Zhang,Hanjun Dai,Bo Dai

Main category: cs.CL

TL;DR: InF-IR是一个大规模训练语料库，旨在提升检索模型在指令跟随信息检索中的表现。通过生成高质量的正负样本对，训练出的InF-Embed模型显著优于基线。

Details

Motivation: 传统检索模型难以有效遵循用户指令，需要专门的数据集来提升指令跟随能力。 Method: 构建包含38,000个<指令,查询,段落>三元组的InF-IR语料库，生成硬负样本并通过高级推理模型验证。训练InF-Embed模型，结合对比学习和指令-查询注意力机制。 Result: InF-Embed在五个指令检索基准测试中，p-MRR指标比基线提升8.1%。 Conclusion: InF-IR语料库和InF-Embed模型有效提升了指令跟随检索的性能。 Abstract: Modern information retrieval (IR) models, trained exclusively on standard pairs, struggle to effectively interpret and follow explicit user instructions. We introduce InF-IR, a large-scale, high-quality training corpus tailored for enhancing retrieval models in Instruction-Following IR. InF-IR expands traditional training pairs into over 38,000 expressive triplets as positive samples. In particular, for each positive triplet, we generate two additional hard negative examples by poisoning both instructions and queries, then rigorously validated by an advanced reasoning model (o3-mini) to ensure semantic plausibility while maintaining instructional incorrectness. Unlike existing corpora that primarily support computationally intensive reranking tasks for decoder-only language models, the highly contrastive positive-negative triplets in InF-IR further enable efficient representation learning for smaller encoder-only models, facilitating direct embedding-based retrieval. Using this corpus, we train InF-Embed, an instruction-aware Embedding model optimized through contrastive learning and instruction-query attention mechanisms to align retrieval outcomes precisely with user intents. Extensive experiments across five instruction-based retrieval benchmarks demonstrate that InF-Embed significantly surpasses competitive baselines by 8.1% in p-MRR, measuring the instruction-following capabilities.

[284] Words Like Knives: Backstory-Personalized Modeling and Detection of Violent Communication

Jocelyn Shen,Akhila Yerukola,Xuhui Zhou,Cynthia Breazeal,Maarten Sap,Hae Won Park

Main category: cs.CL

TL;DR: 论文探讨了亲密关系中对话破裂的检测，强调现有NLP研究忽视关系动态的影响，并提出基于非暴力沟通理论的LLM评估方法。

Details

Motivation: 现有冲突检测任务通常忽略关系背景对对话感知的影响，而亲密关系中的对话破裂与个人历史和情感背景密切相关。 Method: 利用非暴力沟通理论评估LLM在检测对话破裂中的表现，并构建PersonaConflicts Corpus数据集（N=5,772模拟对话），通过人工标注研究关系背景对冲突感知的影响。 Result: 关系背景的极性显著影响人类对对话破裂的感知，但模型难以有效利用这些背景信息。模型还倾向于高估消息对听者的积极影响。 Conclusion: 研究强调了关系背景个性化对LLM在人类沟通中作为有效调解者的重要性。 Abstract: Conversational breakdowns in close relationships are deeply shaped by personal histories and emotional context, yet most NLP research treats conflict detection as a general task, overlooking the relational dynamics that influence how messages are perceived. In this work, we leverage nonviolent communication (NVC) theory to evaluate LLMs in detecting conversational breakdowns and assessing how relationship backstory influences both human and model perception of conflicts. Given the sensitivity and scarcity of real-world datasets featuring conflict between familiar social partners with rich personal backstories, we contribute the PersonaConflicts Corpus, a dataset of N=5,772 naturalistic simulated dialogues spanning diverse conflict scenarios between friends, family members, and romantic partners. Through a controlled human study, we annotate a subset of dialogues and obtain fine-grained labels of communication breakdown types on individual turns, and assess the impact of backstory on human and model perception of conflict in conversation. We find that the polarity of relationship backstories significantly shifted human perception of communication breakdowns and impressions of the social partners, yet models struggle to meaningfully leverage those backstories in the detection task. Additionally, we find that models consistently overestimate how positively a message will make a listener feel. Our findings underscore the critical role of personalization to relationship contexts in enabling LLMs to serve as effective mediators in human communication for authentic connection.

[285] Do LLMs Need to Think in One Language? Correlation between Latent Language and Task Performance

Shintaro Ozaki,Tatsuya Hiraoka,Hiroto Otake,Hiroki Ouchi,Masaru Isonuma,Benjamin Heinzerling,Kentaro Inui,Taro Watanabe,Yusuke Miyao,Yohei Oseki,Yu Takagi

Main category: cs.CL

TL;DR: 研究发现，大语言模型（LLMs）在处理任务时，其内部潜在语言与输入/输出语言的一致性并不总是影响下游任务性能，因为模型会在最终层调整内部表示以适应目标语言。

Details

Motivation: 探索潜在语言与输入/输出语言之间的差异如何影响下游任务性能，填补了现有研究中关于潜在语言对任务性能影响的空白。 Method: 通过在不同下游任务中变化输入提示语言，分析潜在语言一致性与任务性能之间的相关性，并构建了翻译和地理文化等多样化的数据集。 Result: 实验表明，潜在语言的一致性并非总是对任务性能至关重要，因为模型会在最终层调整内部表示以适应目标语言。 Conclusion: 潜在语言的一致性并非下游任务性能的决定性因素，模型能够通过内部调整适应目标语言需求。 Abstract: Large Language Models (LLMs) are known to process information using a proficient internal language consistently, referred to as latent language, which may differ from the input or output languages. However, how the discrepancy between the latent language and the input and output language affects downstream task performance remains largely unexplored. While many studies research the latent language of LLMs, few address its importance in influencing task performance. In our study, we hypothesize that thinking in latent language consistently enhances downstream task performance. To validate this, our work varies the input prompt languages across multiple downstream tasks and analyzes the correlation between consistency in latent language and task performance. We create datasets consisting of questions from diverse domains such as translation and geo-culture, which are influenced by the choice of latent language. Experimental results across multiple LLMs on translation and geo-culture tasks, which are sensitive to the choice of language, indicate that maintaining consistency in latent language is not always necessary for optimal downstream task performance. This is because these models adapt their internal representations near the final layers to match the target language, reducing the impact of consistency on overall performance.

[286] Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion

Zhanqiu Hu,Jian Meng,Yash Akhauri,Mohamed S. Abdelfattah,Jae-sun Seo,Zhiru Zhang,Udit Gupta

Main category: cs.CL

TL;DR: 论文提出两种无需训练的技术（FreeCache和Guided Diffusion），显著提升扩散语言模型的推理效率，实现34倍加速且不损失质量。

Details

Motivation: 扩散语言模型虽具并行生成和双向性优势，但推理速度慢且存在token不连贯问题，限制了其应用。 Method: 1. FreeCache：通过重用稳定的KV投影减少计算成本；2. Guided Diffusion：利用轻量级自回归模型监督去噪步骤，减少迭代次数。 Result: 在开源推理基准测试中，组合方法实现34倍端到端加速，且质量无损失。 Conclusion: 首次使扩散语言模型在延迟上与自回归模型相当甚至更快，为其广泛应用铺平道路。 Abstract: Diffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference. While they match the quality of similarly sized Autoregressive (AR) Models (e.g., Qwen2.5 7B, Llama3 8B), their iterative denoising requires multiple full-sequence forward passes, resulting in high computational costs and latency, particularly for long input prompts and long-context scenarios. Furthermore, parallel token generation introduces token incoherence problems, and current sampling heuristics suffer from significant quality drops with decreasing denoising steps. We address these limitations with two training-free techniques. First, we propose FreeCache, a Key-Value (KV) approximation caching technique that reuses stable KV projections across denoising steps, effectively reducing the computational cost of DLM inference. Second, we introduce Guided Diffusion, a training-free method that uses a lightweight pretrained autoregressive model to supervise token unmasking, dramatically reducing the total number of denoising iterations without sacrificing quality. We conduct extensive evaluations on open-source reasoning benchmarks, and our combined methods deliver up to a 34x end-to-end speedup without compromising accuracy. For the first time, diffusion language models achieve a comparable and even faster latency as the widely adopted autoregressive models. Our work successfully paved the way for scaling up the diffusion language model to a broader scope of applications across different domains.

[287] Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration

Zijun Liu,Zhennan Wan,Peng Li,Ming Yan,Ji Zhang,Fei Huang,Yang Liu

Main category: cs.CL

TL;DR: ExtAgents是一种多智能体框架，旨在解决LLM在推理时知识整合的瓶颈，提升性能而不需要更长的上下文训练。

Details

Motivation: 现有LLM的上下文窗口限制了外部知识输入的规模，导致信息丢失，阻碍了复杂任务的性能提升。 Method: 提出多智能体框架ExtAgents，通过分布式方式处理大规模输入，优化知识同步和推理过程。 Result: 在∞Bench+等测试集上，ExtAgents显著优于现有非训练方法，且保持高效并行性。 Conclusion: ExtAgents为LLM在多智能体协调和大规模知识输入场景提供了有效解决方案，具有实际应用潜力。 Abstract: With the rapid advancement of post-training techniques for reasoning and information seeking, large language models (LLMs) can incorporate a large quantity of retrieved knowledge to solve complex tasks. However, the limited context window of LLMs obstructs scaling the amount of external knowledge input, prohibiting further improvement, especially for tasks requiring significant amount of external knowledge. Existing context window extension methods inevitably cause information loss. LLM-based multi-agent methods emerge as a new paradigm to handle massive input in a distributional manner, where we identify two core bottlenecks in existing knowledge synchronization and reasoning processes. In this work, we develop a multi-agent framework, $\textbf{ExtAgents}$, to overcome the bottlenecks and enable better scalability in inference-time knowledge integration without longer-context training. Benchmarked with our enhanced multi-hop question answering test, $\textbf{$\boldsymbol{\infty}$Bench+}$, and other public test sets including long survey generation, ExtAgents significantly enhances the performance over existing non-training methods with the same amount of external knowledge input, regardless of whether it falls $\textit{within or exceeds the context window}$. Moreover, the method maintains high efficiency due to high parallelism. Further study in the coordination of LLM agents on increasing external knowledge input could benefit real-world applications.

[288] Are Language Models Consequentialist or Deontological Moral Reasoners?

Keenan Samway,Max Kleiman-Weiner,David Guzman Piedrahita,Rada Mihalcea,Bernhard Schölkopf,Zhijing Jin

Main category: cs.CL

TL;DR: 该论文通过大规模分析大型语言模型（LLMs）的道德推理痕迹，揭示了其在伦理复杂场景中的推理模式，发现LLMs倾向于基于义务论原则，而事后解释则转向功利主义。

Details

Motivation: 随着AI系统在医疗、法律和治理等领域的应用增加，理解其如何处理伦理复杂场景变得至关重要。此前研究主要关注LLMs的道德判断而非推理过程。 Method: 研究利用600多个不同的电车问题作为探针，分析LLMs的道德推理痕迹，并引入分类法系统化其推理模式。 Result: 分析表明，LLMs的思维链倾向于义务论原则，而事后解释则更偏向功利主义。 Conclusion: 该框架为理解LLMs如何处理和表达伦理考量提供了基础，有助于其在高风险决策环境中的安全和可解释部署。 Abstract: As AI systems increasingly navigate applications in healthcare, law, and governance, understanding how they handle ethically complex scenarios becomes critical. Previous work has mainly examined the moral judgments in large language models (LLMs), rather than their underlying moral reasoning process. In contrast, we focus on a large-scale analysis of the moral reasoning traces provided by LLMs. Furthermore, unlike prior work that attempted to draw inferences from only a handful of moral dilemmas, our study leverages over 600 distinct trolley problems as probes for revealing the reasoning patterns that emerge within different LLMs. We introduce and test a taxonomy of moral rationales to systematically classify reasoning traces according to two main normative ethical theories: consequentialism and deontology. Our analysis reveals that LLM chains-of-thought tend to favor deontological principles based on moral obligations, while post-hoc explanations shift notably toward consequentialist rationales that emphasize utility. Our framework provides a foundation for understanding how LLMs process and articulate ethical considerations, an important step toward safe and interpretable deployment of LLMs in high-stakes decision-making environments. Our code is available at https://github.com/keenansamway/moral-lens .

[289] UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents

Han Xiao,Guozhi Wang,Yuxiang Chai,Zimu Lu,Weifeng Lin,Hao He,Lue Fan,Liuyang Bian,Rui Hu,Liang Liu,Shuai Ren,Yafei Wen,Xiaoxin Chen,Aojun Zhou,Hongsheng Li

Main category: cs.CL

TL;DR: UI-Genie是一个自改进框架，通过奖励模型和数据生成策略解决GUI代理中的轨迹验证和高质量训练数据问题，并在多个基准测试中实现最优性能。

Details

Motivation: 解决GUI代理中轨迹验证困难和高质训练数据难以扩展的两大挑战。 Method: 采用图像-文本交错架构的奖励模型（UI-Genie-RM）和自改进管道，结合规则验证、轨迹破坏和硬负例挖掘生成数据。 Result: UI-Genie在多个GUI代理基准测试中达到最优性能，并通过开源框架和数据集推动进一步研究。 Conclusion: UI-Genie通过自改进框架和高质量数据生成，显著提升了GUI代理的性能和可扩展性。 Abstract: In this paper, we introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents: verification of trajectory outcome is challenging and high-quality training data are not scalable. These challenges are addressed by a reward model and a self-improving pipeline, respectively. The reward model, UI-Genie-RM, features an image-text interleaved architecture that efficiently pro- cesses historical context and unifies action-level and task-level rewards. To sup- port the training of UI-Genie-RM, we develop deliberately-designed data genera- tion strategies including rule-based verification, controlled trajectory corruption, and hard negative mining. To address the second challenge, a self-improvement pipeline progressively expands solvable complex GUI tasks by enhancing both the agent and reward models through reward-guided exploration and outcome verification in dynamic environments. For training the model, we generate UI- Genie-RM-517k and UI-Genie-Agent-16k, establishing the first reward-specific dataset for GUI agents while demonstrating high-quality synthetic trajectory gen- eration without manual annotation. Experimental results show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks with three generations of data-model self-improvement. We open-source our complete framework implementation and generated datasets to facilitate further research in https://github.com/Euphoria16/UI-Genie.

[290] Silence is Not Consensus: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making

Yihan Wang,Qiao Yan,Zhenghao Xing,Lihao Liu,Junjun He,Chi-Wing Fu,Xiaowei Hu,Pheng-Ann Heng

Main category: cs.CL

TL;DR: 论文提出了一种名为“Catfish Agent”的角色专用LLM，旨在通过结构化异议解决多智能体框架中的“Silent Agreement”问题，从而提升临床问答的准确性。

Details

Motivation: 在多智能体框架中，智能体容易在复杂或模糊病例中过早达成共识（Silent Agreement），缺乏批判性分析，影响诊断准确性。 Method: 引入Catfish Agent，通过两种机制（复杂度感知干预和语气校准干预）激发更深层次的推理。 Result: 在九个医学Q&A和三个医学VQA基准测试中，该方法优于单智能体和多智能体LLM框架，包括GPT-4o和DeepSeek-R1等领先商业模型。 Conclusion: Catfish Agent能有效解决Silent Agreement问题，显著提升临床问答的准确性和推理深度。 Abstract: Large language models (LLMs) have demonstrated strong potential in clinical question answering, with recent multi-agent frameworks further improving diagnostic accuracy via collaborative reasoning. However, we identify a recurring issue of Silent Agreement, where agents prematurely converge on diagnoses without sufficient critical analysis, particularly in complex or ambiguous cases. We present a new concept called Catfish Agent, a role-specialized LLM designed to inject structured dissent and counter silent agreement. Inspired by the ``catfish effect'' in organizational psychology, the Catfish Agent is designed to challenge emerging consensus to stimulate deeper reasoning. We formulate two mechanisms to encourage effective and context-aware interventions: (i) a complexity-aware intervention that modulates agent engagement based on case difficulty, and (ii) a tone-calibrated intervention articulated to balance critique and collaboration. Evaluations on nine medical Q&A and three medical VQA benchmarks show that our approach consistently outperforms both single- and multi-agent LLMs frameworks, including leading commercial models such as GPT-4o and DeepSeek-R1.

[291] How does Alignment Enhance LLMs' Multilingual Capabilities? A Language Neurons Perspective

Shimao Zhang,Zhejian Lai,Xiang Liu,Shuaijie She,Xiao Liu,Yeyun Gong,Shujian Huang,Jiajun Chen

Main category: cs.CL

TL;DR: 本文提出了一种细粒度的神经元识别算法，用于检测语言神经元（包括语言特定和语言相关神经元）及语言无关神经元，并基于神经元分布特性将LLMs的多语言推理过程分为四部分。

Details

Motivation: 研究多语言对齐如何通过高资源语言向低资源语言的能力转移来增强LLMs的多语言能力，并从语言特定神经元的角度分析LLMs的机制。 Method: 提出新的神经元识别算法，检测语言神经元和语言无关神经元，并基于其分布特性划分LLMs的多语言推理过程。 Result: 系统分析了对齐前后模型的神经元变化，发现了“自发多语言对齐”现象，为理解LLMs的多语言能力提供了实证结果和见解。 Conclusion: 通过神经元类型的综合分析，为多语言对齐和LLMs的多语言能力提供了新的理解和实证支持。 Abstract: Multilingual Alignment is an effective and representative paradigm to enhance LLMs' multilingual capabilities, which transfers the capabilities from the high-resource languages to the low-resource languages. Meanwhile, some researches on language-specific neurons reveal that there are language-specific neurons that are selectively activated in LLMs when processing different languages. This provides a new perspective to analyze and understand LLMs' mechanisms more specifically in multilingual scenarios. In this work, we propose a new finer-grained neuron identification algorithm, which detects language neurons~(including language-specific neurons and language-related neurons) and language-agnostic neurons. Furthermore, based on the distributional characteristics of different types of neurons, we divide the LLMs' internal process for multilingual inference into four parts: (1) multilingual understanding, (2) shared semantic space reasoning, (3) multilingual output space transformation, and (4) vocabulary space outputting. Additionally, we systematically analyze the models before and after alignment with a focus on different types of neurons. We also analyze the phenomenon of ''Spontaneous Multilingual Alignment''. Overall, our work conducts a comprehensive investigation based on different types of neurons, providing empirical results and valuable insights for better understanding multilingual alignment and multilingual capabilities of LLMs.

cs.DL [Back]

[292] Leveraging GANs for citation intent classification and its impact on citation network analysis

Davi A. Bezerra,Filipi N. Silva,Diego R. Amancio

Main category: cs.DL

TL;DR: 本文提出了一种基于GAN的方法用于分类引用意图，结果表明该方法在性能上与现有技术相当，但参数更少。同时，研究发现引用意图的过滤会影响论文在引用网络中的中心性。

Details

Motivation: 引用在科学计量学中具有重要作用，但不同引用意图的功能不同。理解引用意图可以更细致地解释科学影响。 Method: 采用基于GAN的方法分类引用意图，并结合上下文嵌入。 Result: 提出的方法在分类任务中表现优异，参数更少；引用意图的过滤显著影响论文在引用网络中的中心性。 Conclusion: GAN架构结合上下文嵌入在引用意图分类中高效有效，引用意图对论文中心性有显著影响。 Abstract: Citations play a fundamental role in the scientific ecosystem, serving as a foundation for tracking the flow of knowledge, acknowledging prior work, and assessing scholarly influence. In scientometrics, they are also central to the construction of quantitative indicators. Not all citations, however, serve the same function: some provide background, others introduce methods, or compare results. Therefore, understanding citation intent allows for a more nuanced interpretation of scientific impact. In this paper, we adopted a GAN-based method to classify citation intents. Our results revealed that the proposed method achieves competitive classification performance, closely matching state-of-the-art results with substantially fewer parameters. This demonstrates the effectiveness and efficiency of leveraging GAN architectures combined with contextual embeddings in intent classification task. We also investigated whether filtering citation intents affects the centrality of papers in citation networks. Analyzing the network constructed from the unArXiv dataset, we found that paper rankings can be significantly influenced by citation intent. All four centrality metrics examined- degree, PageRank, closeness, and betweenness - were sensitive to the filtering of citation types. The betweenness centrality displayed the greatest sensitivity, showing substantial changes in ranking when specific citation intents were removed.

cs.RO [Back]

[293] Vision-Based Risk Aware Emergency Landing for UAVs in Complex Urban Environments

Julio de la Torre-Vanegas,Miguel Soriano-Garcia,Israel Becerra,Diego Mercado-Ravell

Main category: cs.RO

TL;DR: 提出了一种基于语义分割的风险感知方法，用于无人机在复杂城市环境中安全着陆，成功率达90%以上。

Details

Motivation: 解决无人机在紧急情况下于拥挤城市环境中安全着陆的挑战。 Method: 利用深度神经网络进行像素级风险评估，结合风险地图算法动态识别安全着陆区，并通过控制系统引导无人机。 Result: 在多样化城市环境中验证，着陆成功率超过90%，风险指标显著改善。 Conclusion: 风险导向的视觉方法能有效降低紧急着陆事故风险，提升无人机在复杂城市环境中的操作能力。 Abstract: Landing safely in crowded urban environments remains an essential yet challenging endeavor for Unmanned Aerial Vehicles (UAVs), especially in emergency situations. In this work, we propose a risk-aware approach that harnesses semantic segmentation to continuously evaluate potential hazards in the drone's field of view. By using a specialized deep neural network to assign pixel-level risk values and applying an algorithm based on risk maps, our method adaptively identifies a stable Safe Landing Zone (SLZ) despite moving critical obstacles such as vehicles, people, etc., and other visual challenges like shifting illumination. A control system then guides the UAV toward this low-risk region, employing altitude-dependent safety thresholds and temporal landing point stabilization to ensure robust descent trajectories. Experimental validation in diverse urban environments demonstrates the effectiveness of our approach, achieving over 90% landing success rates in very challenging real scenarios, showing significant improvements in various risk metrics. Our findings suggest that risk-oriented vision methods can effectively help reduce the risk of accidents in emergency landing situations, particularly in complex, unstructured, urban scenarios, densely populated with moving risky obstacles, while potentiating the true capabilities of UAVs in complex urban operations.

[294] Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review

Matthew Lisondra,Beno Benhabib,Goldie Nejat

Main category: cs.RO

TL;DR: 本文系统综述了基础模型在移动服务机器人中的应用，探讨了其在多模态感知、实时决策和任务泛化等挑战中的作用，并展望了未来研究方向。

Details

Motivation: 基础模型的快速发展为移动服务机器人中的具身AI提供了新机遇，但多模态感知融合、实时决策等挑战仍需解决。 Method: 通过系统综述，分析了基础模型在多模态传感器融合、语言条件控制和自适应任务执行中的应用。 Result: 基础模型在家庭辅助、医疗和服务自动化等领域展现出变革性潜力。 Conclusion: 未来研究需关注预测性扩展法则、自主长期适应和跨具身泛化，以实现基础模型在机器人系统中的高效部署。 Abstract: Rapid advancements in foundation models, including Large Language Models, Vision-Language Models, Multimodal Large Language Models, and Vision-Language-Action Models have opened new avenues for embodied AI in mobile service robotics. By combining foundation models with the principles of embodied AI, where intelligent systems perceive, reason, and act through physical interactions, robots can improve understanding, adapt to, and execute complex tasks in dynamic real-world environments. However, embodied AI in mobile service robots continues to face key challenges, including multimodal sensor fusion, real-time decision-making under uncertainty, task generalization, and effective human-robot interactions (HRI). In this paper, we present the first systematic review of the integration of foundation models in mobile service robotics, identifying key open challenges in embodied AI and examining how foundation models can address them. Namely, we explore the role of such models in enabling real-time sensor fusion, language-conditioned control, and adaptive task execution. Furthermore, we discuss real-world applications in the domestic assistance, healthcare, and service automation sectors, demonstrating the transformative impact of foundation models on service robotics. We also include potential future research directions, emphasizing the need for predictive scaling laws, autonomous long-term adaptation, and cross-embodiment generalization to enable scalable, efficient, and robust deployment of foundation models in human-centric robotic systems.

[295] Spatial RoboGrasp: Generalized Robotic Grasping Control Policy

Yiqi Huang,Travis Davies,Jiahuan Yan,Jiankai Sun,Xiang Chen,Luhui Hu

Main category: cs.RO

TL;DR: 提出了一种结合多模态感知与抓取预测的统一框架，显著提升了机器人在多样化环境中的抓取和任务成功率。

Details

Motivation: 解决机器人操作在多样化环境中泛化性和精确性不足的问题，尤其是空间感知的局限性。 Method: 融合领域随机增强、单目深度估计和深度感知的6-DoF抓取提示，构建统一的空间表示，并基于扩散策略生成动作序列。 Result: 抓取成功率提升40%，任务成功率提升45%，在环境变化下表现优异。 Conclusion: 空间感知与扩散模仿学习的结合为通用机器人抓取提供了可扩展且鲁棒的解决方案。 Abstract: Achieving generalizable and precise robotic manipulation across diverse environments remains a critical challenge, largely due to limitations in spatial perception. While prior imitation-learning approaches have made progress, their reliance on raw RGB inputs and handcrafted features often leads to overfitting and poor 3D reasoning under varied lighting, occlusion, and object conditions. In this paper, we propose a unified framework that couples robust multimodal perception with reliable grasp prediction. Our architecture fuses domain-randomized augmentation, monocular depth estimation, and a depth-aware 6-DoF Grasp Prompt into a single spatial representation for downstream action planning. Conditioned on this encoding and a high-level task prompt, our diffusion-based policy yields precise action sequences, achieving up to 40% improvement in grasp success and 45% higher task success rates under environmental variation. These results demonstrate that spatially grounded perception, paired with diffusion-based imitation learning, offers a scalable and robust solution for general-purpose robotic grasping.

[296] Object-Centric Action-Enhanced Representations for Robot Visuo-Motor Policy Learning

Nikos Giannakakis,Argyris Manetas,Panagiotis P. Filntisis,Petros Maragos,George Retsinas

Main category: cs.RO

TL;DR: 论文提出了一种对象为中心的编码器，将语义分割和视觉表示生成耦合处理，利用Slot Attention机制和预训练的SOLV模型，通过微调人类动作视频数据提升机器人任务性能。

Details

Motivation: 受人类基于对象的场景处理方式启发，研究旨在通过耦合语义分割和视觉表示生成，提升机器人视觉运动策略的学习效率。 Method: 采用Slot Attention机制和预训练的SOLV模型，在人类动作视频数据上进行微调，结合强化学习和模仿学习进行验证。 Result: 实验表明，耦合方法能显著提升机器人任务性能，且利用域外预训练模型和人类动作数据微调可减少对标注或机器人专用数据的依赖。 Conclusion: 该方法展示了利用现有视觉编码器加速训练和提升泛化能力的潜力，为机器人学习提供了高效解决方案。 Abstract: Learning visual representations from observing actions to benefit robot visuo-motor policy generation is a promising direction that closely resembles human cognitive function and perception. Motivated by this, and further inspired by psychological theories suggesting that humans process scenes in an object-based fashion, we propose an object-centric encoder that performs semantic segmentation and visual representation generation in a coupled manner, unlike other works, which treat these as separate processes. To achieve this, we leverage the Slot Attention mechanism and use the SOLV model, pretrained in large out-of-domain datasets, to bootstrap fine-tuning on human action video data. Through simulated robotic tasks, we demonstrate that visual representations can enhance reinforcement and imitation learning training, highlighting the effectiveness of our integrated approach for semantic segmentation and encoding. Furthermore, we show that exploiting models pretrained on out-of-domain datasets can benefit this process, and that fine-tuning on datasets depicting human actions -- although still out-of-domain -- , can significantly improve performance due to close alignment with robotic tasks. These findings show the capability to reduce reliance on annotated or robot-specific action datasets and the potential to build on existing visual encoders to accelerate training and improve generalizability.

cs.LG [Back]

[297] FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation

Dong Liu,Jiayi Zhang,Yifan Li,Yanxuan Yu,Ben Lengerich,Ying Nian Wu

Main category: cs.LG

TL;DR: FastCache通过隐藏状态级缓存和压缩框架加速DiT推理，减少计算冗余，同时保持生成质量。

Details

Motivation: DiT模型计算密集，FastCache旨在通过利用内部表示的冗余性提高效率。 Method: 采用空间感知令牌选择机制和变换器级缓存，减少不必要计算。 Result: 实验证明FastCache显著降低延迟和内存使用，生成质量优于其他缓存方法。 Conclusion: FastCache是一种高效的DiT加速方法，理论分析和实证结果均支持其有效性。 Abstract: Diffusion Transformers (DiT) are powerful generative models but remain computationally intensive due to their iterative structure and deep transformer stacks. To alleviate this inefficiency, we propose FastCache, a hidden-state-level caching and compression framework that accelerates DiT inference by exploiting redundancy within the model's internal representations. FastCache introduces a dual strategy: (1) a spatial-aware token selection mechanism that adaptively filters redundant tokens based on hidden state saliency, and (2) a transformer-level cache that reuses latent activations across timesteps when changes are statistically insignificant. These modules work jointly to reduce unnecessary computation while preserving generation fidelity through learnable linear approximation. Theoretical analysis shows that FastCache maintains bounded approximation error under a hypothesis-testing-based decision rule. Empirical evaluations across multiple DiT variants demonstrate substantial reductions in latency and memory usage, with best generation output quality compared to other cache methods, as measured by FID and t-FID. Code implementation of FastCache is available on GitHub at https://github.com/NoakLiu/FastCache-xDiT.

[298] HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

Haoran Li,Yingjie Qin,Baoyuan Ou,Lai Xu,Ruiwen Xu

Main category: cs.LG

TL;DR: 论文提出HoPE，一种混合位置嵌入方法，用于提升视觉语言模型在长上下文场景中的性能，解决了现有方法在长视频任务中的不足。

Details

Motivation: 现有视觉语言模型在长上下文（如长视频）中性能下降，且现有RoPE方法缺乏理论支持，无法有效捕捉时空依赖关系。 Method: 提出HoPE，结合混合频率分配策略和动态时间缩放机制，以提升长上下文语义建模能力。 Result: 在四个视频基准测试中，HoPE在长视频理解和检索任务上表现优于现有方法。 Conclusion: HoPE有效提升了视觉语言模型在长上下文场景中的性能，具有理论和实践价值。 Abstract: Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long context, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Code is available at https://github.com/hrlics/HoPE.

[299] Avoid Forgetting by Preserving Global Knowledge Gradients in Federated Learning with Non-IID Data

Abhijit Chunduru,Majid Morafah,Mahdi Morafah,Vishnu Pandi Chellapandi,Ang Li

Main category: cs.LG

TL;DR: 论文通过实验分析数据异质性对联邦学习全局决策边界的影响，提出FedProj框架，通过服务器端知识转移损失和公共数据集记忆机制，显著提升性能。

Details

Motivation: 现有方法缺乏对数据异质性如何影响全局决策边界的深入理解，导致客户端遗忘全局边界。 Method: 提出FedProj框架，包括服务器端知识转移损失和利用公共数据集记忆机制调节梯度更新。 Result: FedProj在实验中大幅优于现有方法。 Conclusion: FedProj有效解决了全局决策边界遗忘问题，提升了联邦学习的鲁棒性。 Abstract: The inevitable presence of data heterogeneity has made federated learning very challenging. There are numerous methods to deal with this issue, such as local regularization, better model fusion techniques, and data sharing. Though effective, they lack a deep understanding of how data heterogeneity can affect the global decision boundary. In this paper, we bridge this gap by performing an experimental analysis of the learned decision boundary using a toy example. Our observations are surprising: (1) we find that the existing methods suffer from forgetting and clients forget the global decision boundary and only learn the perfect local one, and (2) this happens regardless of the initial weights, and clients forget the global decision boundary even starting from pre-trained optimal weights. In this paper, we present FedProj, a federated learning framework that robustly learns the global decision boundary and avoids its forgetting during local training. To achieve better ensemble knowledge fusion, we design a novel server-side ensemble knowledge transfer loss to further calibrate the learned global decision boundary. To alleviate the issue of learned global decision boundary forgetting, we further propose leveraging an episodic memory of average ensemble logits on a public unlabeled dataset to regulate the gradient updates at each step of local training. Experimental results demonstrate that FedProj outperforms state-of-the-art methods by a large margin.

[300] Bi-Level Unsupervised Feature Selection

Jingjing Liu,Xiansen Ju,Xianchao Xiu,Wanquan Liu

Main category: cs.LG

TL;DR: 提出了一种新的双层无监督特征选择方法（BLUFS），结合聚类和特征级别，使用谱聚类和ℓ₂,₀范数约束，显著提升了性能。

Details

Motivation: 现有无监督特征选择方法通常仅从单一视角建模，无法同时评估特征重要性和保留数据结构，限制了性能。 Method: BLUFS方法包括聚类级别（谱聚类生成伪标签和线性回归学习投影矩阵）和特征级别（ℓ₂,₀范数约束），并设计了高效的PAM算法求解。 Result: 在合成和真实数据集上的实验表明，BLUFS在聚类和分类任务中表现优越。 Conclusion: BLUFS通过双层框架和ℓ₂,₀范数约束，显著提升了无监督特征选择的性能。 Abstract: Unsupervised feature selection (UFS) is an important task in data engineering. However, most UFS methods construct models from a single perspective and often fail to simultaneously evaluate feature importance and preserve their inherent data structure, thus limiting their performance. To address this challenge, we propose a novel bi-level unsupervised feature selection (BLUFS) method, including a clustering level and a feature level. Specifically, at the clustering level, spectral clustering is used to generate pseudo-labels for representing the data structure, while a continuous linear regression model is developed to learn the projection matrix. At the feature level, the $\ell_{2,0}$-norm constraint is imposed on the projection matrix for more effectively selecting features. To the best of our knowledge, this is the first work to combine a bi-level framework with the $\ell_{2,0}$-norm. To solve the proposed bi-level model, we design an efficient proximal alternating minimization (PAM) algorithm, whose subproblems either have explicit solutions or can be computed by fast solvers. Furthermore, we establish the convergence result and computational complexity. Finally, extensive experiments on two synthetic datasets and eight real datasets demonstrate the superiority of BLUFS in clustering and classification tasks.

[301] Detecting Informative Channels: ActionFormer

Kunpeng Zhao,Asahi Miyazaki,Tsuyoshi Okita

Main category: cs.LG

TL;DR: 论文提出了一种改进的ActionFormer模型，用于传感器信号的人类活动识别（HAR），通过Sequence-and-Excitation策略和swish激活函数优化性能。

Details

Motivation: 传统Transformer模型在HAR中难以捕捉高时间动态性和时空特征间的依赖关系，限制了其对细微变化的识别能力。 Method: 改进的ActionFormer采用Sequence-and-Excitation策略减少额外参数，并使用swish激活函数保留负方向信息。 Result: 在WEAR数据集上，改进模型对惯性数据的平均mAP提升了16.01%。 Conclusion: 改进的ActionFormer显著提升了传感器信号HAR的性能，验证了其有效性。 Abstract: Human Activity Recognition (HAR) has recently witnessed advancements with Transformer-based models. Especially, ActionFormer shows us a new perspectives for HAR in the sense that this approach gives us additional outputs which detect the border of the activities as well as the activity labels. ActionFormer was originally proposed with its input as image/video. However, this was converted to with its input as sensor signals as well. We analyze this extensively in terms of deep learning architectures. Based on the report of high temporal dynamics which limits the model's ability to capture subtle changes effectively and of the interdependencies between the spatial and temporal features. We propose the modified ActionFormer which will decrease these defects for sensor signals. The key to our approach lies in accordance with the Sequence-and-Excitation strategy to minimize the increase in additional parameters and opt for the swish activation function to retain the information about direction in the negative range. Experiments on the WEAR dataset show that our method achieves substantial improvement of a 16.01\% in terms of average mAP for inertial data.

[302] Uni-Instruct: One-step Diffusion Model through Unified Diffusion Divergence Instruction

Yifei Wang,Weimin Bai,Colin Zhang,Debing Zhang,Weijian Luo,He Sun

Main category: cs.LG

TL;DR: Uni-Instruct统一了10多种一步扩散蒸馏方法，提出基于f-散度家族的扩散扩展理论，克服了原始扩展f-散度的不可计算问题，实现了高效的一步扩散模型训练，并在多个基准测试中取得最先进性能。

Details

Motivation: 通过理论驱动的框架统一现有的一步扩散蒸馏方法，并基于扩散扩展理论解决f-散度家族的计算难题。 Method: 提出Uni-Instruct框架，通过等效且可计算的损失函数训练一步扩散模型，最小化扩展f-散度家族。 Result: 在CIFAR10和ImageNet-64×64基准测试中取得创纪录的FID值（1.46和1.02），并在文本到3D生成任务中略优于现有方法。 Conclusion: Uni-Instruct的理论和实证贡献为一步扩散蒸馏和扩散模型知识迁移的未来研究提供了重要基础。 Abstract: In this paper, we unify more than 10 existing one-step diffusion distillation approaches, such as Diff-Instruct, DMD, SIM, SiD, $f$-distill, etc, inside a theory-driven framework which we name the \textbf{\emph{Uni-Instruct}}. Uni-Instruct is motivated by our proposed diffusion expansion theory of the $f$-divergence family. Then we introduce key theories that overcome the intractability issue of the original expanded $f$-divergence, resulting in an equivalent yet tractable loss that effectively trains one-step diffusion models by minimizing the expanded $f$-divergence family. The novel unification introduced by Uni-Instruct not only offers new theoretical contributions that help understand existing approaches from a high-level perspective but also leads to state-of-the-art one-step diffusion generation performances. On the CIFAR10 generation benchmark, Uni-Instruct achieves record-breaking Frechet Inception Distance (FID) values of \textbf{\emph{1.46}} for unconditional generation and \textbf{\emph{1.38}} for conditional generation. On the ImageNet-$64\times 64$ generation benchmark, Uni-Instruct achieves a new SoTA one-step generation FID of \textbf{\emph{1.02}}, which outperforms its 79-step teacher diffusion with a significant improvement margin of 1.33 (1.02 vs 2.35). We also apply Uni-Instruct on broader tasks like text-to-3D generation. For text-to-3D generation, Uni-Instruct gives decent results, which slightly outperforms previous methods, such as SDS and VSD, in terms of both generation quality and diversity. Both the solid theoretical and empirical contributions of Uni-Instruct will potentially help future studies on one-step diffusion distillation and knowledge transferring of diffusion models.

[303] Leaner Transformers: More Heads, Less Depth

Hemanth Saratchandran,Damien Teney,Simon Lucey

Main category: cs.LG

TL;DR: 论文挑战了Transformer模型‘越大越好’的观念，通过理论分析发现多头注意力的主要作用是改善注意力块的稳定性，从而可以增加头数、减少深度，在保持精度的同时减少30-50%的参数。

Details

Motivation: 现有Transformer模型可能过度庞大，作者希望通过理论分析重新定义多头注意力的作用，以优化模型设计。 Method: 提出理论原则，重新设计流行架构，增加头数并减少深度，从而减少参数。 Result: 在多种任务（计算机视觉、语言建模等）中，模型参数减少30-50%的同时保持精度。 Conclusion: 通过优化多头注意力的设计，可以显著减少模型参数而不牺牲性能，挑战了‘越大越好’的传统观念。 Abstract: Transformers have reshaped machine learning by utilizing attention mechanisms to capture complex patterns in large datasets, leading to significant improvements in performance. This success has contributed to the belief that "bigger means better", leading to ever-increasing model sizes. This paper challenge this ideology by showing that many existing transformers might be unnecessarily oversized. We discover a theoretical principle that redefines the role of multi-head attention. An important benefit of the multiple heads is in improving the conditioning of the attention block. We exploit this theoretical insight and redesign popular architectures with an increased number of heads. The improvement in the conditioning proves so significant in practice that model depth can be decreased, reducing the parameter count by up to 30-50% while maintaining accuracy. We obtain consistent benefits across a variety of transformer-based architectures of various scales, on tasks in computer vision (ImageNet-1k) as well as language and sequence modeling (GLUE benchmark, TinyStories, and the Long-Range Arena benchmark).

[304] Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning

Shenao Zhang,Yaqing Wang,Yinxiao Liu,Tianqi Liu,Peter Grabowski,Eugene Ie,Zhaoran Wang,Yunxuan Li

Main category: cs.LG

TL;DR: 论文提出了一种基于贝叶斯自适应强化学习（BARL）的方法，以解决传统马尔可夫强化学习中反思性探索的不足，并在测试时表现出更优的性能。

Details

Motivation: 传统马尔可夫强化学习（RL）在训练阶段限制了探索行为，且仅依赖当前状态的历史上下文，导致反思性推理是否会在训练中自然产生及其测试时的优势尚不明确。 Method: 通过贝叶斯自适应强化学习框架，将反思性探索重新建模为对马尔可夫决策过程后验分布的期望回报优化，激励模型进行奖励最大化的利用和信息收集的探索。 Result: BARL算法在合成和数学推理任务中优于传统马尔可夫RL方法，实现了更高的标记效率和探索效果。 Conclusion: BARL为LLM提供了基于观察结果的策略切换指导，证明了贝叶斯框架在反思性探索中的有效性。 Abstract: Large Language Models (LLMs) trained via Reinforcement Learning (RL) have exhibited strong reasoning capabilities and emergent reflective behaviors, such as backtracking and error correction. However, conventional Markovian RL confines exploration to the training phase to learn an optimal deterministic policy and depends on the history contexts only through the current state. Therefore, it remains unclear whether reflective reasoning will emerge during Markovian RL training, or why they are beneficial at test time. To remedy this, we recast reflective exploration within the Bayes-Adaptive RL framework, which explicitly optimizes the expected return under a posterior distribution over Markov decision processes. This Bayesian formulation inherently incentivizes both reward-maximizing exploitation and information-gathering exploration via belief updates. Our resulting algorithm, BARL, instructs the LLM to stitch and switch strategies based on the observed outcomes, offering principled guidance on when and how the model should reflectively explore. Empirical results on both synthetic and mathematical reasoning tasks demonstrate that BARL outperforms standard Markovian RL approaches at test time, achieving superior token efficiency with improved exploration effectiveness. Our code is available at https://github.com/shenao-zhang/BARL.

[305] Red-Teaming Text-to-Image Systems by Rule-based Preference Modeling

Yichuan Cao,Yibo Miao,Xiao-Shan Gao,Yinpeng Dong

Main category: cs.LG

TL;DR: 提出了一种基于规则偏好建模的引导红队测试方法（RPG-RT），通过迭代修改提示并利用反馈动态适应未知防御机制，解决了现有黑盒方法的局限性。

Details

Motivation: 评估文本到图像（T2I）模型的安全性至关重要，但现有方法因依赖内部访问或特定防御机制知识而受限。 Method: 采用基于规则偏好建模的引导红队测试（RPG-RT），利用LLM迭代修改提示并利用反馈动态适应防御机制。 Result: 在19个T2I系统、3个商业API服务和T2V模型上的实验验证了方法的优越性和实用性。 Conclusion: RPG-RT有效解决了未知防御机制的挑战，为T2I模型的安全性评估提供了实用工具。 Abstract: Text-to-image (T2I) models raise ethical and safety concerns due to their potential to generate inappropriate or harmful images. Evaluating these models' security through red-teaming is vital, yet white-box approaches are limited by their need for internal access, complicating their use with closed-source models. Moreover, existing black-box methods often assume knowledge about the model's specific defense mechanisms, limiting their utility in real-world commercial API scenarios. A significant challenge is how to evade unknown and diverse defense mechanisms. To overcome this difficulty, we propose a novel Rule-based Preference modeling Guided Red-Teaming (RPG-RT), which iteratively employs LLM to modify prompts to query and leverages feedback from T2I systems for fine-tuning the LLM. RPG-RT treats the feedback from each iteration as a prior, enabling the LLM to dynamically adapt to unknown defense mechanisms. Given that the feedback is often labeled and coarse-grained, making it difficult to utilize directly, we further propose rule-based preference modeling, which employs a set of rules to evaluate desired or undesired feedback, facilitating finer-grained control over the LLM's dynamic adaptation process. Extensive experiments on nineteen T2I systems with varied safety mechanisms, three online commercial API services, and T2V models verify the superiority and practicality of our approach.

[306] Learning Single Index Models with Diffusion Priors

Anqi Tang,Youming Chen,Shuchen Xue,Zhaoqiang Liu

Main category: cs.LG

TL;DR: 本文提出了一种基于扩散模型（DMs）的信号恢复方法，专注于半参数单指标模型，能够处理非线性测量模型中的不连续或未知链接函数。该方法仅需一轮无条件采样和部分DM反演，实验表明其重建精度更高且计算效率更高。

Details

Motivation: 现有基于扩散模型的信号恢复方法要么局限于特定重建问题，要么无法处理非线性测量模型中的不连续或未知链接函数。本文旨在填补这一空白。 Method: 提出了一种高效的信号恢复方法，仅需一轮无条件采样和部分扩散模型反演，适用于半参数单指标模型。 Result: 理论分析和数值实验表明，该方法在图像数据集上实现了更精确的重建，且显著减少了神经函数评估次数。 Conclusion: 该方法为非线性测量模型中的信号恢复提供了一种高效且准确的解决方案。 Abstract: Diffusion models (DMs) have demonstrated remarkable ability to generate diverse and high-quality images by efficiently modeling complex data distributions. They have also been explored as powerful generative priors for signal recovery, resulting in a substantial improvement in the quality of reconstructed signals. However, existing research on signal recovery with diffusion models either focuses on specific reconstruction problems or is unable to handle nonlinear measurement models with discontinuous or unknown link functions. In this work, we focus on using DMs to achieve accurate recovery from semi-parametric single index models, which encompass a variety of popular nonlinear models that may have {\em discontinuous} and {\em unknown} link functions. We propose an efficient reconstruction method that only requires one round of unconditional sampling and (partial) inversion of DMs. Theoretical analysis on the effectiveness of the proposed methods has been established under appropriate conditions. We perform numerical experiments on image datasets for different nonlinear measurement models. We observe that compared to competing methods, our approach can yield more accurate reconstructions while utilizing significantly fewer neural function evaluations.

[307] How Do Transformers Learn Variable Binding in Symbolic Programs?

Yiwei Wu,Atticus Geiger,Raphaël Millière

Main category: cs.LG

TL;DR: 论文研究了Transformer模型如何在没有内置绑定操作的情况下学习变量绑定能力，揭示了训练过程中的三个阶段，并发现模型利用残差流作为可寻址内存空间。

Details

Motivation: 探索现代神经网络如何在没有显式绑定操作的情况下实现变量绑定，以弥合连接主义和符号方法之间的差距。 Method: 训练Transformer模型解引用符号程序中的变量，分析其训练过程中的三个阶段，并通过因果干预揭示模型机制。 Result: 模型通过残差流和专用注意力头动态跟踪变量绑定，实现了准确的解引用。 Conclusion: Transformer模型能够学习系统性的变量绑定，无需显式架构支持。 Abstract: Variable binding -- the ability to associate variables with values -- is fundamental to symbolic computation and cognition. Although classical architectures typically implement variable binding via addressable memory, it is not well understood how modern neural networks lacking built-in binding operations may acquire this capacity. We investigate this by training a Transformer to dereference queried variables in symbolic programs where variables are assigned either numerical constants or other variables. Each program requires following chains of variable assignments up to four steps deep to find the queried value, and also contains irrelevant chains of assignments acting as distractors. Our analysis reveals a developmental trajectory with three distinct phases during training: (1) random prediction of numerical constants, (2) a shallow heuristic prioritizing early variable assignments, and (3) the emergence of a systematic mechanism for dereferencing assignment chains. Using causal interventions, we find that the model learns to exploit the residual stream as an addressable memory space, with specialized attention heads routing information across token positions. This mechanism allows the model to dynamically track variable bindings across layers, resulting in accurate dereferencing. Our results show how Transformer models can learn to implement systematic variable binding without explicit architectural support, bridging connectionist and symbolic approaches.

[308] SageAttention2++: A More Efficient Implementation of SageAttention2

Jintao Zhang,Xiaoming Xu,Jia Wei,Haofeng Huang,Pengle Zhang,Chendong Xiang,Jun Zhu,Jianfei Chen

Main category: cs.LG

TL;DR: SageAttention2++通过量化加速注意力计算，利用FP8指令进一步提速，比FlashAttention快3.9倍，同时保持准确率。

Details

Motivation: 注意力计算的时间复杂度随序列长度呈二次增长，效率至关重要。 Method: 采用量化加速矩阵乘法（Matmul），并利用FP8指令（FP16累加）进一步提速。 Result: SageAttention2++比FlashAttention快3.9倍，且准确率与SageAttention2相当。 Conclusion: SageAttention2++高效加速多种模型（语言、图像、视频生成），端到端指标损失可忽略。 Abstract: The efficiency of attention is critical because its time complexity grows quadratically with sequence length. SageAttention2 addresses this by utilizing quantization to accelerate matrix multiplications (Matmul) in attention. To further accelerate SageAttention2, we propose to utilize the faster instruction of FP8 Matmul accumulated in FP16. The instruction is 2x faster than the FP8 Matmul used in SageAttention2. Our experiments show that SageAttention2++ achieves a 3.9x speedup over FlashAttention while maintaining the same attention accuracy as SageAttention2. This means SageAttention2++ effectively accelerates various models, including those for language, image, and video generation, with negligible end-to-end metrics loss. The code will be available at https://github.com/thu-ml/SageAttention.

[309] Topological Deep Learning for Speech Data

Zhiwang Yu

Main category: cs.LG

TL;DR: 该论文提出了一种基于拓扑数据分析（TDA）的拓扑感知卷积核，显著提升了语音识别网络的性能。

Details

Motivation: 受Carlsson等人启发，研究旨在探索TDA在深度学习中的应用潜力，特别是通过数学工具优化神经网络。 Method: 通过研究正交群作用在核上的性质，建立了矩阵空间的纤维丛分解，提出了新的滤波器生成方法，并设计了正交特征（OF）层。 Result: OF层在音素识别任务中表现优异，尤其在低噪声环境下，并展示了跨域适应性。 Conclusion: 该研究揭示了TDA在神经网络优化中的潜力，为数学与深度学习的跨学科研究开辟了新途径。 Abstract: Topological data analysis (TDA) offers novel mathematical tools for deep learning. Inspired by Carlsson et al., this study designs topology-aware convolutional kernels that significantly improve speech recognition networks. Theoretically, by investigating orthogonal group actions on kernels, we establish a fiber-bundle decomposition of matrix spaces, enabling new filter generation methods. Practically, our proposed Orthogonal Feature (OF) layer achieves superior performance in phoneme recognition, particularly in low-noise scenarios, while demonstrating cross-domain adaptability. This work reveals TDA's potential in neural network optimization, opening new avenues for mathematics-deep learning interdisciplinary studies.

[310] Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers

Charles London,Varun Kanade

Main category: cs.LG

TL;DR: 暂停符号（如“...”）能提升Transformer在语言和数学任务中的性能，但其理论作用尚不明确。本文首次证明，在有限深度和宽度的Transformer中加入暂停符号能严格增强其计算表达能力。

Details

Motivation: 解释暂停符号如何提升Transformer的计算能力，填补理论与实证之间的空白。 Method: 通过理论分析，证明暂停符号对Transformer表达能力的严格提升作用，并结合实验验证。 Result: 暂停符号使Transformer能表达更复杂的函数类（如$\mathsf{AC}^0$和$\mathsf{TC}^0$），并成功学习如奇偶校验等任务。 Conclusion: 暂停符号是一种独立于思维链提示的机制，能显著增强Transformer的推理能力。 Abstract: Pause tokens, simple filler symbols such as "...", consistently improve Transformer performance on both language and mathematical tasks, yet their theoretical effect remains unexplained. We provide the first formal separation result, proving that adding pause tokens to constant-depth, logarithmic-width Transformers strictly increases their computational expressivity. With bounded-precision activations, Transformers without pause tokens compute only a strict subset of $\mathsf{AC}^0$ functions, while adding a polynomial number of pause tokens allows them to express the entire class. For logarithmic-precision Transformers, we show that adding pause tokens achieves expressivity equivalent to $\mathsf{TC}^0$, matching known upper bounds. Empirically, we demonstrate that two-layer causally masked Transformers can learn parity when supplied with pause tokens, a function that they appear unable to learn without them. Our results provide a rigorous theoretical explanation for prior empirical findings, clarify how pause tokens interact with width, depth, and numeric precision, and position them as a distinct mechanism, complementary to chain-of-thought prompting, for enhancing Transformer reasoning.

[311] PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing

Yu Yan,Sheng Sun,Zhifei Zheng,Ziji Hao,Teli Liu,Min Liu

Main category: cs.LG

TL;DR: PoisonSwarm框架通过模型众包策略生成多样化的有害数据，解决了现有方法在生成可靠性和内容多样性上的限制。

Details

Motivation: 构建安全和负责任的AI应用需要高质量的有害信息数据，但现有方法受限于LLMs的安全对齐机制，难以生成可靠且多样化的有害数据。 Method: 提出PoisonSwarm框架，通过生成良性数据作为基础模板，分解为语义单元并动态切换模型进行毒化和优化。 Result: 实验表明PoisonSwarm在生成多样化和高扩展性的有害数据方面表现优异。 Conclusion: PoisonSwarm为有害数据合成提供了高效且多样化的解决方案。 Abstract: To construct responsible and secure AI applications, harmful information data is widely utilized for adversarial testing and the development of safeguards. Existing studies mainly leverage Large Language Models (LLMs) to synthesize data to obtain high-quality task datasets at scale, thereby avoiding costly human annotation. However, limited by the safety alignment mechanisms of LLMs, the synthesis of harmful data still faces challenges in generation reliability and content diversity. In this study, we propose a novel harmful information synthesis framework, PoisonSwarm, which applies the model crowdsourcing strategy to generate diverse harmful data while maintaining a high success rate. Specifically, we generate abundant benign data as the based templates in a counterfactual manner. Subsequently, we decompose each based template into multiple semantic units and perform unit-by-unit toxification and final refinement through dynamic model switching, thus ensuring the success of synthesis. Experimental results demonstrate that PoisonSwarm achieves state-of-the-art performance in synthesizing different categories of harmful data with high scalability and diversity.

[312] Hardware-Efficient Attention for Fast Decoding

Ted Zadouri,Hubert Strauss,Tri Dao

Main category: cs.LG

TL;DR: 论文提出两种注意力机制改进方法（GTA和GLA），通过减少KV缓存加载和优化并行性，显著提升LLM解码效率。

Details

Motivation: 当前LLM解码在大型批次和长上下文场景下受限于KV缓存加载和并行性不足，导致延迟增加，硬件利用率低。 Method: 提出Grouped-Tied Attention (GTA)减少内存传输，Grouped Latent Attention (GLA)优化并行解码，并实现高效内核。 Result: GTA节省一半KV缓存，GLA在解码速度上比FlashMLA快2倍，吞吐量提升2倍。 Conclusion: GTA和GLA通过硬件效率优化，显著提升解码性能，适用于在线服务场景。 Abstract: LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decoding limits parallelism. We analyze the interplay among arithmetic intensity, parallelization, and model quality and question whether current architectures fully exploit modern hardware. This work redesigns attention to perform more computation per byte loaded from memory to maximize hardware efficiency without trading off parallel scalability. We first propose Grouped-Tied Attention (GTA), a simple variant that combines and reuses key and value states, reducing memory transfers without compromising model quality. We then introduce Grouped Latent Attention (GLA), a parallel-friendly latent attention paired with low-level optimizations for fast decoding while maintaining high model quality. Experiments show that GTA matches Grouped-Query Attention (GQA) quality while using roughly half the KV cache and that GLA matches Multi-head Latent Attention (MLA) and is easier to shard. Our optimized GLA kernel is up to 2$\times$ faster than FlashMLA, for example, in a speculative decoding setting when the query length exceeds one. Furthermore, by fetching a smaller KV cache per device, GLA reduces end-to-end latency and increases throughput in online serving benchmarks by up to 2$\times$.

[313] Reinforcing General Reasoning without Verifiers

Xiangxin Zhou,Zichen Liu,Anya Sims,Haonan Wang,Tianyu Pang,Chongxuan Li,Liang Wang,Min Lin,Chao Du

Main category: cs.LG

TL;DR: 论文提出了一种无需验证器的方法VeriFree，通过直接最大化生成参考答案的概率，解决了传统基于验证器的强化学习在通用推理领域的局限性。

Details

Motivation: 当前基于验证器的强化学习方法仅适用于规则可验证的任务，无法扩展到现实领域，且存在依赖强验证器、易受奖励攻击等问题。 Method: 提出VeriFree方法，绕过答案验证，直接使用强化学习最大化生成参考答案的概率。 Result: VeriFree在多个基准测试中表现优异，甚至超越基于验证器的方法，同时显著降低计算负担。 Conclusion: VeriFree为通用推理领域提供了一种高效且实用的训练方法，兼具理论和实践优势。 Abstract: The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at https://github.com/sail-sg/VeriFree.

eess.IV [Back]

[314] Unpaired Image-to-Image Translation for Segmentation and Signal Unmixing

Nikola Andrejic,Milica Spasic,Igor Mihajlovic,Petra Milosavljevic,Djordje Pavlovic,Filip Milisavljevic,Uros Milivojevic,Danilo Delibasic,Ivana Mikic,Sinisa Todorovic

Main category: eess.IV

TL;DR: Ui2i是一种新型的无配对图像到图像翻译模型，通过改进CycleGAN，更好地分离内容和风格特征，并保持内容完整性。

Details

Motivation: 解决无配对数据集中风格迁移时内容保持的问题，特别是在生物医学图像处理中需要更精确结构保留的任务。 Method: 使用U-Net生成器、近似双向谱归一化和注意力机制，结合图像尺度增强训练。 Result: 在生物医学任务中验证了Ui2i的内容保持能力，特别是在单通道免疫荧光图像中分离叠加信号。 Conclusion: Ui2i是首个能够使用真实无配对数据分离叠加信号的方法，适用于需要高内容保真度的任务。 Abstract: This work introduces Ui2i, a novel model for unpaired image-to-image translation, trained on content-wise unpaired datasets to enable style transfer across domains while preserving content. Building on CycleGAN, Ui2i incorporates key modifications to better disentangle content and style features, and preserve content integrity. Specifically, Ui2i employs U-Net-based generators with skip connections to propagate localized shallow features deep into the generator. Ui2i removes feature-based normalization layers from all modules and replaces them with approximate bidirectional spectral normalization -- a parameter-based alternative that enhances training stability. To further support content preservation, channel and spatial attention mechanisms are integrated into the generators. Training is facilitated through image scale augmentation. Evaluation on two biomedical tasks -- domain adaptation for nuclear segmentation in immunohistochemistry (IHC) images and unmixing of biological structures superimposed in single-channel immunofluorescence (IF) images -- demonstrates Ui2i's ability to preserve content fidelity in settings that demand more accurate structural preservation than typical translation tasks. To the best of our knowledge, Ui2i is the first approach capable of separating superimposed signals in IF images using real, unpaired training data.

[315] The Role of AI in Early Detection of Life-Threatening Diseases: A Retinal Imaging Perspective

Tariq M Khan,Toufique Ahmed Soomro,Imran Razzak

Main category: eess.IV

TL;DR: 视网膜成像技术结合AI和移动健康技术，为系统性疾病的早期检测提供了新方法，但仍需解决标准化和临床整合问题。

Details

Motivation: 分散的视网膜成像技术和AI模型限制了其在临床实践中的应用，亟需系统整合和标准化。 Method: 综述了OCT/OCTA、AO技术、AI/ML算法及移动健康技术的最新进展，并评估其诊断性能。 Result: AI模型在糖尿病视网膜病变和心血管风险预测中表现出高敏感性和特异性，移动健康技术提高了筛查可及性。 Conclusion: 提出多中心标准化协议和临床整合路线图，以推动视网膜筛查在精准预防和早期干预中的应用。 Abstract: Retinal imaging has emerged as a powerful, non-invasive modality for detecting and quantifying biomarkers of systemic diseases-ranging from diabetes and hypertension to Alzheimer's disease and cardiovascular disorders but current insights remain dispersed across platforms and specialties. Recent technological advances in optical coherence tomography (OCT/OCTA) and adaptive optics (AO) now deliver ultra-high-resolution scans (down to 5 {\mu}m ) with superior contrast and spatial integration, allowing early identification of microvascular abnormalities and neurodegenerative changes. At the same time, AI-driven and machine learning (ML) algorithms have revolutionized the analysis of large-scale retinal datasets, increasing sensitivity and specificity; for example, deep learning models achieve > 90 \% sensitivity for diabetic retinopathy and AUC = 0.89 for the prediction of cardiovascular risk from fundus photographs. The proliferation of mobile health technologies and telemedicine platforms further extends access, reduces costs, and facilitates community-based screening and longitudinal monitoring. Despite these breakthroughs, translation into routine practice is hindered by heterogeneous imaging protocols, limited external validation of AI models, and integration challenges within clinical workflows. In this review, we systematically synthesize the latest OCT/OCT and AO developments, AI/ML approaches, and mHealth/Tele-ophthalmology initiatives and quantify their diagnostic performance across disease domains. Finally, we propose a roadmap for multicenter protocol standardization, prospective validation trials, and seamless incorporation of retinal screening into primary and specialty care pathways-paving the way for precision prevention, early intervention, and ongoing treatment of life-threatening systemic diseases.

[316] Multitemporal Latent Dynamical Framework for Hyperspectral Images Unmixing

Ruiying Li,Bin Pan,Lan Ma,Xia Xu,Zhenwei Shi

Main category: eess.IV

TL;DR: 本文提出了一种多时相高光谱解混框架MiLD，通过神经微分方程建模丰度动态变化，并提供了理论支持。

Details

Motivation: 现有方法忽略了丰度的动态变化，作者希望通过神经微分方程解决这一问题。 Method: MiLD框架包括问题定义、数学建模、求解算法和理论验证，利用神经微分方程和潜在变量建模丰度动态。 Result: 实验验证了MiLD在合成和真实数据集上的有效性。 Conclusion: MiLD成功解决了多时相高光谱解混中的动态丰度建模问题，并提供了理论支持。 Abstract: Multitemporal hyperspectral unmixing can capture dynamical evolution of materials. Despite its capability, current methods emphasize variability of endmembers while neglecting dynamics of abundances, which motivates our adoption of neural ordinary differential equations to model abundances temporally. However, this motivation is hindered by two challenges: the inherent complexity in defining, modeling and solving problem, and the absence of theoretical support. To address above challenges, in this paper, we propose a multitemporal latent dynamical (MiLD) unmixing framework by capturing dynamical evolution of materials with theoretical validation. For addressing multitemporal hyperspectral unmixing, MiLD consists of problem definition, mathematical modeling, solution algorithm and theoretical support. We formulate multitemporal unmixing problem definition by conducting ordinary differential equations and developing latent variables. We transfer multitemporal unmixing to mathematical model by dynamical discretization approaches, which describe the discreteness of observed sequence images with mathematical expansions. We propose algorithm to solve problem and capture dynamics of materials, which approximates abundance evolution by neural networks. Furthermore, we provide theoretical support by validating the crucial properties, which verifies consistency, convergence and stability theorems. The major contributions of MiLD include defining problem by ordinary differential equations, modeling problem by dynamical discretization approach, solving problem by multitemporal unmixing algorithm, and presenting theoretical support. Our experiments on both synthetic and real datasets have validated the utility of our work

[317] Generative Image Compression by Estimating Gradients of the Rate-variable Feature Distribution

Minghao Han,Weiyi You,Jinhua Zhang,Leheng Zhang,Ce Zhu,Shuhang Gu

Main category: eess.IV

TL;DR: 本文提出了一种基于扩散模型的生成式图像压缩方法，通过将压缩过程重新解释为随机微分方程（SDEs）控制的前向扩散路径，并训练反向神经网络直接重建图像，实现了高质量的重建效果。

Details

Motivation: 生成式图像压缩（GIC）结合生成模型以生成逼真的重建图像，但现有方法未能直接利用扩散模型。本文旨在通过直接建模压缩过程为扩散路径，提升重建质量。 Method: 提出了一种新的扩散模型框架，将压缩过程视为前向扩散路径，并训练反向神经网络直接重建图像，无需高斯噪声初始化。 Result: 在多个基准数据集上，该方法在感知失真、统计保真度和无参考质量评估等指标上优于现有生成式图像压缩方法。 Conclusion: 该方法通过直接建模压缩过程为扩散路径，实现了高质量的重建效果，且仅需少量采样步骤。 Abstract: While learned image compression (LIC) focuses on efficient data transmission, generative image compression (GIC) extends this framework by integrating generative modeling to produce photo-realistic reconstructed images. In this paper, we propose a novel diffusion-based generative modeling framework tailored for generative image compression. Unlike prior diffusion-based approaches that indirectly exploit diffusion modeling, we reinterpret the compression process itself as a forward diffusion path governed by stochastic differential equations (SDEs). A reverse neural network is trained to reconstruct images by reversing the compression process directly, without requiring Gaussian noise initialization. This approach achieves smooth rate adjustment and photo-realistic reconstructions with only a minimal number of sampling steps. Extensive experiments on benchmark datasets demonstrate that our method outperforms existing generative image compression approaches across a range of metrics, including perceptual distortion, statistical fidelity, and no-reference quality assessments.

[318] Prostate Cancer Screening with Artificial Intelligence-Enhanced Micro-Ultrasound: A Comparative Study with Traditional Methods

Muhammad Imran,Wayne G. Brisbane,Li-Ming Su,Jason P. Joseph,Wei Shao

Main category: eess.IV

TL;DR: AI分析微超声图像在检测前列腺癌中比传统PSA和DRE方法更准确，提高了特异性并保持高敏感性。

Details

Motivation: 研究微超声结合AI是否能在检测临床显著前列腺癌（csPCa）中优于传统的PSA和DRE筛查方法。 Method: 使用自监督卷积自编码器提取微超声图像特征，随机森林分类器通过五折交叉验证预测csPCa，并与基于PSA、DRE、前列腺体积和年龄的临床模型对比。 Result: AI模型的AUROC为0.871，敏感性92.5%，特异性68.1%；临床模型AUROC为0.753，敏感性96.2%，特异性27.3%。 Conclusion: AI解析的微超声在保持高敏感性的同时提高了特异性，可能减少不必要的活检，成为低成本筛查替代方案。 Abstract: Background and objective: Micro-ultrasound (micro-US) is a novel imaging modality with diagnostic accuracy comparable to MRI for detecting clinically significant prostate cancer (csPCa). We investigated whether artificial intelligence (AI) interpretation of micro-US can outperform clinical screening methods using PSA and digital rectal examination (DRE). Methods: We retrospectively studied 145 men who underwent micro-US guided biopsy (79 with csPCa, 66 without). A self-supervised convolutional autoencoder was used to extract deep image features from 2D micro-US slices. Random forest classifiers were trained using five-fold cross-validation to predict csPCa at the slice level. Patients were classified as csPCa-positive if 88 or more consecutive slices were predicted positive. Model performance was compared with a classifier using PSA, DRE, prostate volume, and age. Key findings and limitations: The AI-based micro-US model and clinical screening model achieved AUROCs of 0.871 and 0.753, respectively. At a fixed threshold, the micro-US model achieved 92.5% sensitivity and 68.1% specificity, while the clinical model showed 96.2% sensitivity but only 27.3% specificity. Limitations include a retrospective single-center design and lack of external validation. Conclusions and clinical implications: AI-interpreted micro-US improves specificity while maintaining high sensitivity for csPCa detection. This method may reduce unnecessary biopsies and serve as a low-cost alternative to PSA-based screening. Patient summary: We developed an AI system to analyze prostate micro-ultrasound images. It outperformed PSA and DRE in detecting aggressive cancer and may help avoid unnecessary biopsies.

cs.SE [Back]

[319] SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Ibragim Badertdinov,Alexander Golubev,Maksim Nekrashevich,Anton Shevtsov,Simon Karasik,Andrei Andriushchenko,Maria Trofimova,Daria Litvintseva,Boris Yangel

Main category: cs.SE

TL;DR: 论文提出了一种自动化、可扩展的管道，用于从GitHub提取真实世界的交互式软件工程任务，构建了包含21,000多个Python任务的SWE-rebench数据集，并解决了现有数据稀缺和评估污染问题。

Details

Motivation: 当前基于LLM的软件工程代理面临高质量训练数据稀缺和评估污染问题，现有数据集规模小且缺乏多样性，无法满足需求。 Method: 通过自动化管道从GitHub提取真实交互式任务，构建SWE-rebench数据集，并用于无污染的基准测试。 Result: 构建了包含21,000多个任务的SWE-rebench数据集，发现某些语言模型性能可能因污染问题而被高估。 Conclusion: 提出的方法解决了数据稀缺和评估污染问题，为软件工程代理的研究提供了更可靠的基准。 Abstract: LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.

[320] SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis

Yansong Li,Paula Branco,Alexander M. Hoole,Manish Marwah,Hari Manassery Koduvely,Guy-Vincent Jourdan,Stephan Jou

Main category: cs.SE

TL;DR: SV-TrustEval-C是一个评估大型语言模型（LLMs）在C语言代码漏洞分析中可靠性的基准，重点关注结构推理和语义推理能力。

Details

Motivation: 现有研究忽视了结构推理和语义推理在漏洞分析中的重要性，需要更全面的评估方法。 Method: 引入SV-TrustEval-C基准，通过结构推理和语义推理两个维度评估LLMs的能力。 Result: 当前LLMs在理解复杂代码关系和逻辑一致性上表现不佳，依赖模式匹配而非逻辑推理。 Conclusion: SV-TrustEval-C有效揭示了LLMs的不足，为提升其推理能力和可信度提供了方向。 Abstract: As Large Language Models (LLMs) evolve in understanding and generating code, accurately evaluating their reliability in analyzing source code vulnerabilities becomes increasingly vital. While studies have examined LLM capabilities in tasks like vulnerability detection and repair, they often overlook the importance of both structure and semantic reasoning crucial for trustworthy vulnerability analysis. To address this gap, we introduce SV-TrustEval-C, a benchmark designed to evaluate LLMs' abilities for vulnerability analysis of code written in the C programming language through two key dimensions: structure reasoning - assessing how models identify relationships between code elements under varying data and control flow complexities; and semantic reasoning - examining their logical consistency in scenarios where code is structurally and semantically perturbed. Our results show that current LLMs are far from satisfactory in understanding complex code relationships and that their vulnerability analyses rely more on pattern matching than on robust logical reasoning. These findings underscore the effectiveness of the SV-TrustEval-C benchmark and highlight critical areas for enhancing the reasoning capabilities and trustworthiness of LLMs in real-world vulnerability analysis tasks. Our initial benchmark dataset is publicly available.

[321] An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks

Xin Zhou,Kisub Kim,Ting Zhang,Martin Weyssow,Luis F. Gomes,Guang Yang,David Lo

Main category: cs.SE

TL;DR: SWE-Judge是一种新的自动评估指标，专门用于准确评估生成的软件工件的正确性，通过多策略动态集成显著优于现有指标。

Details

Motivation: 现有自动评估方法在评估生成软件工件的正确性时准确性不足，而人工评估虽准确但缺乏可扩展性。 Method: SWE-Judge定义五种独立评估策略，动态选择最优子集进行集成评分。 Result: 实验表明，SWE-Judge与人工评估的相关性显著提高（5.9%-183.8%），并在多个任务中达到与人工标注相近的一致性。 Conclusion: SWE-Judge是一种可扩展且可靠的替代人工评估的方法。 Abstract: Large Language Models (LLMs) and other automated techniques have been increasingly used to support software developers by generating software artifacts such as code snippets, patches, and comments. However, accurately assessing the correctness of these generated artifacts remains a significant challenge. On one hand, human evaluation provides high accuracy but is labor-intensive and lacks scalability. On the other hand, other existing automatic evaluation metrics are scalable and require minimal human effort, but they often fail to accurately reflect the actual correctness of generated software artifacts. In this paper, we present SWE-Judge, the first evaluation metric for LLM-as-Ensemble-Judge specifically designed to accurately assess the correctness of generated software artifacts. SWE-Judge first defines five distinct evaluation strategies, each implemented as an independent judge. A dynamic team selection mechanism then identifies the most appropriate subset of judges to produce a final correctness score through ensembling. We evaluate SWE-Judge across a diverse set of software engineering (SE) benchmarks, including CoNaLa, Card2Code, HumanEval-X, APPS, APR-Assess, and Summary-Assess. These benchmarks span three SE tasks: code generation, automated program repair, and code summarization. Experimental results demonstrate that SWE-Judge consistently achieves a higher correlation with human judgments, with improvements ranging from 5.9% to 183.8% over existing automatic metrics. Furthermore, SWE-Judge reaches agreement levels with human annotators that are comparable to inter-annotator agreement in code generation and program repair tasks. These findings underscore SWE-Judge's potential as a scalable and reliable alternative to human evaluation.

cs.CY [Back]

[322] Cultural Awareness in Vision-Language Models: A Cross-Country Exploration

Avinash Madasu,Vasudev Lal,Phillip Howard

Main category: cs.CY

TL;DR: 提出了一种新框架，用于系统评估视觉语言模型（VLMs）在种族、性别和身体特征方面的文化偏见。

Details

Motivation: VLMs被广泛用于不同文化背景，但其内部偏见尚未被充分理解。 Method: 设计了三个基于检索的任务：种族与国家关联、个人特质与国家关联、身体特征与国家关联。 Result: 研究发现VLMs存在持续偏见，可能无意中强化社会刻板印象。 Conclusion: 研究揭示了VLMs的文化偏见问题，需进一步改进以减少偏见影响。 Abstract: Vision-Language Models (VLMs) are increasingly deployed in diverse cultural contexts, yet their internal biases remain poorly understood. In this work, we propose a novel framework to systematically evaluate how VLMs encode cultural differences and biases related to race, gender, and physical traits across countries. We introduce three retrieval-based tasks: (1) Race to Country retrieval, which examines the association between individuals from specific racial groups (East Asian, White, Middle Eastern, Latino, South Asian, and Black) and different countries; (2) Personal Traits to Country retrieval, where images are paired with trait-based prompts (e.g., Smart, Honest, Criminal, Violent) to investigate potential stereotypical associations; and (3) Physical Characteristics to Country retrieval, focusing on visual attributes like skinny, young, obese, and old to explore how physical appearances are culturally linked to nations. Our findings reveal persistent biases in VLMs, highlighting how visual representations may inadvertently reinforce societal stereotypes.

[323] Position is Power: System Prompts as a Mechanism of Bias in Large Language Models (LLMs)

Anna Neumann,Elisabeth Kirsten,Muhammad Bilal Zafar,Jatinder Singh

Main category: cs.CY

TL;DR: 论文研究了大型语言模型（LLMs）中系统提示对模型行为的影响，揭示了因信息位置不同导致的偏见问题，并呼吁将系统提示分析纳入AI审计流程。

Details

Motivation: 系统提示在LLMs中优先级高于用户输入，但其复杂性和不透明性可能导致未察觉的偏见和潜在危害。 Method: 比较了六种商业LLMs中50个人口统计组在系统提示和用户提示下的处理差异。 Result: 发现显著的偏见，表现为用户表征和决策场景的差异，这些偏见源于不透明的系统配置。 Conclusion: 系统提示的分析应纳入AI审计流程，以避免潜在的偏见和危害。 Abstract: System prompts in Large Language Models (LLMs) are predefined directives that guide model behaviour, taking precedence over user inputs in text processing and generation. LLM deployers increasingly use them to ensure consistent responses across contexts. While model providers set a foundation of system prompts, deployers and third-party developers can append additional prompts without visibility into others' additions, while this layered implementation remains entirely hidden from end-users. As system prompts become more complex, they can directly or indirectly introduce unaccounted for side effects. This lack of transparency raises fundamental questions about how the position of information in different directives shapes model outputs. As such, this work examines how the placement of information affects model behaviour. To this end, we compare how models process demographic information in system versus user prompts across six commercially available LLMs and 50 demographic groups. Our analysis reveals significant biases, manifesting in differences in user representation and decision-making scenarios. Since these variations stem from inaccessible and opaque system-level configurations, they risk representational, allocative and potential other biases and downstream harms beyond the user's ability to detect or correct. Our findings draw attention to these critical issues, which have the potential to perpetuate harms if left unexamined. Further, we argue that system prompt analysis must be incorporated into AI auditing processes, particularly as customisable system prompts become increasingly prevalent in commercial AI deployments.

cs.AI [Back]

[324] Project Riley: Multimodal Multi-Agent LLM Collaboration with Emotional Reasoning and Voting

Ana Rita Ortigoso,Gabriel Vieira,Daniel Fuentes,Luis Frazão,Nuno Costa,António Pereira

Main category: cs.AI

TL;DR: Project Riley是一种多模态、多模型的对话AI架构，模拟情绪状态影响的推理，通过五个情感代理（Joy、Sadness、Fear、Anger、Disgust）生成和优化响应，最终整合为连贯输出。另一个衍生原型Armando用于紧急场景，结合RAG技术。用户测试显示其在情感对齐和沟通清晰度上表现优异。

Details

Motivation: 受《头脑特工队》启发，旨在模拟情绪对推理的影响，提升对话AI的情感表达和实用性。 Method: 采用多模态LLM和情感代理的多轮对话机制，结合RAG和上下文跟踪技术。 Result: 用户测试表明，系统在情感对齐、清晰度和实用性方面表现良好。 Conclusion: Project Riley展示了情绪模拟在对话AI中的潜力，尤其在结构化场景中效果显著。 Abstract: This paper presents Project Riley, a novel multimodal and multi-model conversational AI architecture oriented towards the simulation of reasoning influenced by emotional states. Drawing inspiration from Pixar's Inside Out, the system comprises five distinct emotional agents - Joy, Sadness, Fear, Anger, and Disgust - that engage in structured multi-round dialogues to generate, criticise, and iteratively refine responses. A final reasoning mechanism synthesises the contributions of these agents into a coherent output that either reflects the dominant emotion or integrates multiple perspectives. The architecture incorporates both textual and visual large language models (LLMs), alongside advanced reasoning and self-refinement processes. A functional prototype was deployed locally in an offline environment, optimised for emotional expressiveness and computational efficiency. From this initial prototype, another one emerged, called Armando, which was developed for use in emergency contexts, delivering emotionally calibrated and factually accurate information through the integration of Retrieval-Augmented Generation (RAG) and cumulative context tracking. The Project Riley prototype was evaluated through user testing, in which participants interacted with the chatbot and completed a structured questionnaire assessing three dimensions: Emotional Appropriateness, Clarity and Utility, and Naturalness and Human-likeness. The results indicate strong performance in structured scenarios, particularly with respect to emotional alignment and communicative clarity.

[325] Scaling over Scaling: Exploring Test-Time Scaling Pareto in Large Reasoning Models

Jian Wang,Boyan Zhu,Chak Tou Leong,Yongqi Li,Wenjie Li

Main category: cs.AI

TL;DR: 本文研究了测试时计算扩展的帕累托边界，提出了TTSPM模型，分析了并行和顺序扩展的饱和点，并验证了其在实际推理任务中的实用性。

Details

Motivation: 探索大规模推理模型（LRMs）在测试时计算扩展中的性能极限和资源分配优化问题。 Method: 理论分析了并行和顺序扩展的两种范式，推导了它们的饱和点，并通过实验在多个推理基准上验证。 Result: 发现两种扩展范式在扩展预算的饱和点上具有统一的数学结构，并验证了其在实际任务中的有效性。 Conclusion: 研究为测试时扩展的成本效益权衡提供了见解，有助于开发更高效的推理策略。 Abstract: Large reasoning models (LRMs) have exhibited the capacity of enhancing reasoning performance via internal test-time scaling. Building upon this, a promising direction is to further scale test-time compute to unlock even greater reasoning capabilities. However, as we push these scaling boundaries, systematically understanding the practical limits and achieving optimal resource allocation becomes a critical challenge. In this paper, we investigate the scaling Pareto of test-time scaling and introduce the Test-Time Scaling Performance Model (TTSPM). We theoretically analyze two fundamental paradigms for such extended scaling, parallel scaling and sequential scaling, from a probabilistic modeling perspective. Our primary contribution is the derivation of the saturation point on the scaling budget for both strategies, identifying thresholds beyond which additional computation yields diminishing returns. Remarkably, despite their distinct mechanisms, both paradigms converge to a unified mathematical structure in their upper bounds. We empirically validate our theoretical findings on challenging reasoning benchmarks, including AIME, MATH-500, and GPQA, demonstrating the practical utility of these bounds for test-time resource allocation. We hope that this work provides insights into the cost-benefit trade-offs of test-time scaling, guiding the development of more resource-efficient inference strategies for large reasoning models.

[326] Comparisons between a Large Language Model-based Real-Time Compound Diagnostic Medical AI Interface and Physicians for Common Internal Medicine Cases using Simulated Patients

Hyungjun Park,Chang-Yun Woo,Seungjo Lim,Seunghwan Lim,Keunho Kwak,Ju Young Jeong,Chong Hyun Suh

Main category: cs.AI

TL;DR: LLM-based AI interface outperformed physicians in diagnostic accuracy and efficiency for internal medicine cases.

Details

Motivation: To compare the diagnostic performance of an AI interface with physicians in common internal medicine cases. Method: Nonrandomized clinical trial using USMLE Step 2 CS-style exams with 10 cases, comparing AI and physicians. Result: AI achieved 80% accuracy (vs. 50-70% for physicians), faster time (44.6% shorter), and lower cost (98.1% reduction). Conclusion: AI shows potential to assist in primary care with comparable accuracy and efficiency to physicians. Abstract: Objective To develop an LLM based realtime compound diagnostic medical AI interface and performed a clinical trial comparing this interface and physicians for common internal medicine cases based on the United States Medical License Exam (USMLE) Step 2 Clinical Skill (CS) style exams. Methods A nonrandomized clinical trial was conducted on August 20, 2024. We recruited one general physician, two internal medicine residents (2nd and 3rd year), and five simulated patients. The clinical vignettes were adapted from the USMLE Step 2 CS style exams. We developed 10 representative internal medicine cases based on actual patients and included information available on initial diagnostic evaluation. Primary outcome was the accuracy of the first differential diagnosis. Repeatability was evaluated based on the proportion of agreement. Results The accuracy of the physicians' first differential diagnosis ranged from 50% to 70%, whereas the realtime compound diagnostic medical AI interface achieved an accuracy of 80%. The proportion of agreement for the first differential diagnosis was 0.7. The accuracy of the first and second differential diagnoses ranged from 70% to 90% for physicians, whereas the AI interface achieved an accuracy rate of 100%. The average time for the AI interface (557 sec) was 44.6% shorter than that of the physicians (1006 sec). The AI interface ($0.08) also reduced costs by 98.1% compared to the physicians' average ($4.2). Patient satisfaction scores ranged from 4.2 to 4.3 for care by physicians and were 3.9 for the AI interface Conclusion An LLM based realtime compound diagnostic medical AI interface demonstrated diagnostic accuracy and patient satisfaction comparable to those of a physician, while requiring less time and lower costs. These findings suggest that AI interfaces may have the potential to assist primary care consultations for common internal medicine cases.

[327] MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

Jiakang Yuan,Tianshuo Peng,Yilei Jiang,Yiting Lu,Renrui Zhang,Kaituo Feng,Chaoyou Fu,Tao Chen,Lei Bai,Bo Zhang,Xiangyu Yue

Main category: cs.AI

TL;DR: 论文提出了MME-Reasoning基准，全面评估多模态大语言模型的逻辑推理能力，发现现有模型在综合推理中存在显著局限性。

Details

Motivation: 现有基准未能全面评估多模态大语言模型的逻辑推理能力，缺乏对推理类型的明确分类和清晰理解。 Method: 设计了MME-Reasoning基准，涵盖归纳、演绎和溯因三种推理类型，并优化数据以聚焦推理能力而非感知或知识广度。 Result: 评估显示，即使是先进的多模态大语言模型在综合逻辑推理中表现有限，且在不同推理类型间存在明显性能不平衡。 Conclusion: 研究揭示了当前多模态大语言模型在逻辑推理中的关键局限性，为理解和评估推理能力提供了系统性见解。 Abstract: Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to comprehensively evaluate their reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning. To address these issues, we introduce MME-Reasoning, a comprehensive benchmark designed to evaluate the reasoning ability of MLLMs, which covers all three types of reasoning (i.e., inductive, deductive, and abductive) in its questions. We carefully curate the data to ensure that each question effectively evaluates reasoning ability rather than perceptual skills or knowledge breadth, and extend the evaluation protocols to cover the evaluation of diverse questions. Our evaluation reveals substantial limitations of state-of-the-art MLLMs when subjected to holistic assessments of logical reasoning capabilities. Even the most advanced MLLMs show limited performance in comprehensive logical reasoning, with notable performance imbalances across reasoning types. In addition, we conducted an in-depth analysis of approaches such as ``thinking mode'' and Rule-based RL, which are commonly believed to enhance reasoning abilities. These findings highlight the critical limitations and performance imbalances of current MLLMs in diverse logical reasoning scenarios, providing comprehensive and systematic insights into the understanding and evaluation of reasoning capabilities.

[328] The Multilingual Divide and Its Impact on Global AI Safety

Aidan Peppin,Julia Kreutzer,Alice Schoenauer Sebag,Kelly Marchisio,Beyza Ermis,John Dang,Samuel Cahyawijaya,Shivalika Singh,Seraphina Goldfarb-Tarrant,Viraat Aryabumi,Aakanksha,Wei-Yin Ko,Ahmet Üstün,Matthias Gallé,Marzieh Fadaee,Sara Hooker

Main category: cs.AI

TL;DR: 论文探讨了AI中存在的“语言鸿沟”问题及其对全球AI安全的负面影响，提出了解决挑战的障碍和建议。

Details

Motivation: 研究AI在多语言能力与安全性能上的差距，尤其是非主流语言，以减少全球AI安全的不平等。 Method: 分析了语言鸿沟存在和扩大的原因，及其对AI安全的影响，并提出解决障碍的建议。 Result: 指出了支持多语言数据集创建、透明度和研究的政策与治理措施。 Conclusion: 通过政策与治理支持多语言发展，可以缩小语言鸿沟并提升全球AI安全。 Abstract: Despite advances in large language model capabilities in recent years, a large gap remains in their capabilities and safety performance for many languages beyond a relatively small handful of globally dominant languages. This paper provides researchers, policymakers and governance experts with an overview of key challenges to bridging the "language gap" in AI and minimizing safety risks across languages. We provide an analysis of why the language gap in AI exists and grows, and how it creates disparities in global AI safety. We identify barriers to address these challenges, and recommend how those working in policy and governance can help address safety concerns associated with the language gap by supporting multilingual dataset creation, transparency, and research.

eess.SP [Back]

[329] BrainStratify: Coarse-to-Fine Disentanglement of Intracranial Neural Dynamics

Hui Zheng,Hai-Teng Wang,Yi-Tao Jing,Pei-Yang Lin,Han-Qing Zhao,Wei Chen,Peng-Hu Wei,Yong-Zhi Shan,Guo-Guang Zhao,Yun-Zhe Liu

Main category: eess.SP

TL;DR: BrainStratify是一种统一的粗到细神经解缠框架，用于从颅内神经信号解码语音，显著优于现有方法。

Details

Motivation: 解决颅内神经信号中任务相关信号稀疏分布且与无关信号纠缠的问题。 Method: 结合空间上下文引导的时空建模识别功能组，并使用解耦产品量化（DPQ）解缠目标功能组内的神经动态。 Result: 在多个数据集上表现优异，显著优于之前的方法。 Conclusion: BrainStratify为颅内记录语音解码提供了鲁棒且可解释的解决方案。 Abstract: Decoding speech directly from neural activity is a central goal in brain-computer interface (BCI) research. In recent years, exciting advances have been made through the growing use of intracranial field potential recordings, such as stereo-ElectroEncephaloGraphy (sEEG) and ElectroCorticoGraphy (ECoG). These neural signals capture rich population-level activity but present key challenges: (i) task-relevant neural signals are sparsely distributed across sEEG electrodes, and (ii) they are often entangled with task-irrelevant neural signals in both sEEG and ECoG. To address these challenges, we introduce a unified Coarse-to-Fine neural disentanglement framework, BrainStratify, which includes (i) identifying functional groups through spatial-context-guided temporal-spatial modeling, and (ii) disentangling distinct neural dynamics within the target functional group using Decoupled Product Quantization (DPQ). We evaluate BrainStratify on two open-source sEEG datasets and one (epidural) ECoG dataset, spanning tasks like vocal production and speech perception. Extensive experiments show that BrainStratify, as a unified framework for decoding speech from intracranial neural signals, significantly outperforms previous decoding methods. Overall, by combining data-driven stratification with neuroscience-inspired modularity, BrainStratify offers a robust and interpretable solution for speech decoding from intracranial recordings.

q-bio.NC [Back]

[330] Optimizing fMRI Data Acquisition for Decoding Natural Speech with Limited Participants

Louis Jalouzot,Alexis Thual,Yair Lakretz,Christophe Pallier,Bertrand Thirion

Main category: q-bio.NC

TL;DR: 研究探讨了从少量参与者的fMRI数据中解码自然语音的最佳策略，发现多主体训练未提升解码准确性，且解码器更擅长建模句法而非语义特征。

Details

Motivation: 探索在有限参与者数据下，如何优化从fMRI解码自然语音的策略，并评估多主体训练和不同刺激对解码效果的影响。 Method: 利用8名参与者的fMRI数据，训练深度神经网络预测LLM生成的文本表示，比较单主体与多主体训练的效果。 Result: 多主体训练未提升解码准确性；解码器更擅长处理句法特征；复杂句法或丰富语义的句子更难解码。 Conclusion: 在自然语音解码中，多主体数据的利用可能需要更深度的表型分析或更大规模的参与者群体。 Abstract: We investigate optimal strategies for decoding perceived natural speech from fMRI data acquired from a limited number of participants. Leveraging Lebel et al. (2023)'s dataset of 8 participants, we first demonstrate the effectiveness of training deep neural networks to predict LLM-derived text representations from fMRI activity. Then, in this data regime, we observe that multi-subject training does not improve decoding accuracy compared to single-subject approach. Furthermore, training on similar or different stimuli across subjects has a negligible effect on decoding accuracy. Finally, we find that our decoders better model syntactic than semantic features, and that stories containing sentences with complex syntax or rich semantic content are more challenging to decode. While our results demonstrate the benefits of having extensive data per participant (deep phenotyping), they suggest that leveraging multi-subject for natural speech decoding likely requires deeper phenotyping or a substantially larger cohort.

cs.SD [Back]

[331] Music's Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMs

Wenhao You,Xingjian Diao,Chunhui Zhang,Keyi Kong,Weiyi Wu,Zhongyu Ouyang,Chiyu Ma,Tingxuan Wu,Noah Wei,Zong Ke,Ming Cheng,Soroush Vosoughi,Jiang Gui

Main category: cs.SD

TL;DR: 该论文探讨了音乐音频-视觉问答（Music AVQA）领域的挑战，提出了针对性的输入处理、时空架构设计和音乐特定建模策略的重要性，并提供了未来研究方向。

Details

Motivation: 音乐领域的多模态任务需要专门的方法，Music AVQA因其复杂的音频-视觉内容和时间动态特性而具有独特挑战。 Method: 通过系统分析Music AVQA数据集和方法，提出专门的输入处理、时空架构设计和音乐特定建模策略。 Result: 研究总结了与高性能相关的设计模式，并提出了结合音乐先验的未来方向。 Conclusion: 该研究旨在为音乐多模态理解奠定基础，并鼓励更多关注和研究。 Abstract: While recent Multimodal Large Language Models exhibit impressive capabilities for general multimodal tasks, specialized domains like music necessitate tailored approaches. Music Audio-Visual Question Answering (Music AVQA) particularly underscores this, presenting unique challenges with its continuous, densely layered audio-visual content, intricate temporal dynamics, and the critical need for domain-specific knowledge. Through a systematic analysis of Music AVQA datasets and methods, this position paper identifies that specialized input processing, architectures incorporating dedicated spatial-temporal designs, and music-specific modeling strategies are critical for success in this domain. Our study provides valuable insights for researchers by highlighting effective design patterns empirically linked to strong performance, proposing concrete future directions for incorporating musical priors, and aiming to establish a robust foundation for advancing multimodal musical understanding. This work is intended to inspire broader attention and further research, supported by a continuously updated anonymous GitHub repository of relevant papers: https://github.com/xid32/Survey4MusicAVQA.

[332] VoxAging: Continuously Tracking Speaker Aging with a Large-Scale Longitudinal Dataset in English and Mandarin

Zhiqi Ai,Meixuan Bao,Zhiyong Chen,Zhi Yang,Xinnuo Li,Shugong Xu

Main category: cs.SD

TL;DR: 论文介绍了VoxAging数据集，研究了说话人老化对验证系统的影响，并分析了年龄和性别等因素的作用。

Details

Motivation: 由于缺乏长期纵向数据，说话人老化研究受限，因此构建了VoxAging数据集以填补这一空白。 Method: 使用VoxAging数据集（293名说话人，最长17年），分析老化现象及其对验证系统的影响。 Result: 揭示了说话人老化对验证系统的负面影响，并探讨了年龄组和性别的影响。 Conclusion: VoxAging数据集为说话人老化研究提供了重要资源，并推动了相关技术的发展。 Abstract: The performance of speaker verification systems is adversely affected by speaker aging. However, due to challenges in data collection, particularly the lack of sustained and large-scale longitudinal data for individuals, research on speaker aging remains difficult. In this paper, we present VoxAging, a large-scale longitudinal dataset collected from 293 speakers (226 English speakers and 67 Mandarin speakers) over several years, with the longest time span reaching 17 years (approximately 900 weeks). For each speaker, the data were recorded at weekly intervals. We studied the phenomenon of speaker aging and its effects on advanced speaker verification systems, analyzed individual speaker aging processes, and explored the impact of factors such as age group and gender on speaker aging research.

stat.ML [Back]

[333] A False Discovery Rate Control Method Using a Fully Connected Hidden Markov Random Field for Neuroimaging Data

Taehyo Kim,Qiran Jia,Mony J. de Leon,Hai Shu

Main category: stat.ML

TL;DR: 提出了一种名为fcHMRF-LIS的空间FDR控制方法，用于神经影像数据中的多重检验问题，解决了现有方法在空间依赖性、稳定性和计算效率上的不足。

Details

Motivation: 神经影像数据分析中，传统的FDR控制方法假设检验独立且存在高假阴性率，而现有空间FDR方法未能同时解决空间依赖性、稳定性和计算效率问题。 Method: 结合局部显著性指数（LIS）和新型全连接隐马尔可夫随机场（fcHMRF），采用高效的期望最大化算法，降低计算复杂度。 Result: fcHMRF-LIS在模拟中表现出准确的FDR控制、更低的假阴性率、更稳定的FDP和FNP，以及更高的真阳性率。 Conclusion: fcHMRF-LIS在神经影像数据分析中具有显著优势，能够识别与疾病相关的脑区，并提高计算效率。 Abstract: False discovery rate (FDR) control methods are essential for voxel-wise multiple testing in neuroimaging data analysis, where hundreds of thousands or even millions of tests are conducted to detect brain regions associated with disease-related changes. Classical FDR control methods (e.g., BH, q-value, and LocalFDR) assume independence among tests and often lead to high false non-discovery rates (FNR). Although various spatial FDR control methods have been developed to improve power, they still fall short in jointly addressing three major challenges in neuroimaging applications: capturing complex spatial dependencies, maintaining low variability in both false discovery proportion (FDP) and false non-discovery proportion (FNP) across replications, and achieving computational scalability for high-resolution data. To address these challenges, we propose fcHMRF-LIS, a powerful, stable, and scalable spatial FDR control method for voxel-wise multiple testing. It integrates the local index of significance (LIS)-based testing procedure with a novel fully connected hidden Markov random field (fcHMRF) designed to model complex spatial structures using a parsimonious parameterization. We develop an efficient expectation-maximization algorithm incorporating mean-field approximation, the Conditional Random Fields as Recurrent Neural Networks (CRF-RNN) technique, and permutohedral lattice filtering, reducing the computational complexity from quadratic to linear in the number of tests. Extensive simulations demonstrate that fcHMRF-LIS achieves accurate FDR control, lower FNR, reduced variability in FDP and FNP, and a higher number of true positives compared to existing methods. Applied to an FDG-PET dataset from the Alzheimer's Disease Neuroimaging Initiative, fcHMRF-LIS identifies neurobiologically relevant brain regions and offers notable advantages in computational efficiency.

cs.IR [Back]

[334] Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents

Jaeyoung Choe,Jihoon Kim,Woohwan Jung

Main category: cs.IR

TL;DR: 论文提出了一种名为HiREC的框架，通过分层检索和证据整理解决传统RAG方法在金融标准化文档中因重复文本导致的检索问题。

Details

Motivation: 传统RAG方法在处理金融标准化文档（如SEC文件）时，因格式相似和重复文本导致检索不准确和不完整。 Method: HiREC框架采用分层检索（先检索相关文档，再选择最相关段落）和证据整理（去除无关段落，必要时生成补充查询）。 Result: 构建并发布了LOFin问答基准，包含145,897份SEC文档和1,595个问答对。 Conclusion: HiREC框架有效解决了金融文档中的检索问题，提升了准确性和完整性。 Abstract: Retrieval-augmented generation (RAG) based large language models (LLMs) are widely used in finance for their excellent performance on knowledge-intensive tasks. However, standardized documents (e.g., SEC filing) share similar formats such as repetitive boilerplate texts, and similar table structures. This similarity forces traditional RAG methods to misidentify near-duplicate text, leading to duplicate retrieval that undermines accuracy and completeness. To address these issues, we propose the Hierarchical Retrieval with Evidence Curation (HiREC) framework. Our approach first performs hierarchical retrieval to reduce confusion among similar texts. It first retrieve related documents and then selects the most relevant passages from the documents. The evidence curation process removes irrelevant passages. When necessary, it automatically generates complementary queries to collect missing information. To evaluate our approach, we construct and release a Large-scale Open-domain Financial (LOFin) question answering benchmark that includes 145,897 SEC documents and 1,595 question-answer pairs. Our code and data are available at https://github.com/deep-over/LOFin-bench-HiREC.

[335] TeroSeek: An AI-Powered Knowledge Base and Retrieval Generation Platform for Terpenoid Research

Xu Kang,Siqi Jiang,Kangwei Xu,Jiahao Li,Ruibo Wu

Main category: cs.IR

TL;DR: TeroSeek是一个针对萜类化合物的知识库和AI问答工具，通过RAG框架提供高质量信息，优于通用大语言模型。

Details

Motivation: 萜类化合物研究跨学科且知识整合困难，需专业工具支持。 Method: 基于20年文献构建知识库，结合RAG框架开发问答机器人和网络服务。 Result: TeroSeek在萜类相关查询中表现优于通用大语言模型。 Conclusion: TeroSeek是多学科研究的专业工具，已公开可用。 Abstract: Terpenoids are a crucial class of natural products that have been studied for over 150 years, but their interdisciplinary nature (spanning chemistry, pharmacology, and biology) complicates knowledge integration. To address this, the authors developed TeroSeek, a curated knowledge base (KB) built from two decades of terpenoid literature, coupled with an AI-powered question-answering chatbot and web service. Leveraging a retrieval-augmented generation (RAG) framework, TeroSeek provides structured, high-quality information and outperforms general-purpose large language models (LLMs) in terpenoid-related queries. It serves as a domain-specific expert tool for multidisciplinary research and is publicly available at http://teroseek.qmclab.com.

[336] What LLMs Miss in Recommendations: Bridging the Gap with Retrieval-Augmented Collaborative Signals

Shahrooz Pouryousef

Main category: cs.IR

TL;DR: 本文比较了大型语言模型（LLMs）与经典矩阵分解（MF）模型在推荐系统中的表现，发现LLMs在捕捉协作信号方面不足，但通过检索增强生成（RAG）方法可以显著提升推荐质量。

Details

Motivation: 探索LLMs是否能够有效利用用户-物品交互数据中的协作信号，并与传统MF模型进行比较。 Method: 系统比较LLMs与MF模型，并引入RAG方法，通过结构化交互数据增强LLMs的预测能力。 Result: 当前LLMs在捕捉协作模式方面不如MF模型，但RAG方法显著提升了推荐质量。 Conclusion: RAG方法为基于LLM的推荐系统提供了有前景的方向。 Abstract: User-item interactions contain rich collaborative signals that form the backbone of many successful recommender systems. While recent work has explored the use of large language models (LLMs) for recommendation, it remains unclear whether LLMs can effectively reason over this type of collaborative information. In this paper, we conduct a systematic comparison between LLMs and classical matrix factorization (MF) models to assess LLMs' ability to leverage user-item interaction data. We further introduce a simple retrieval-augmented generation (RAG) method that enhances LLMs by grounding their predictions in structured interaction data. Our experiments reveal that current LLMs often fall short in capturing collaborative patterns inherent to MF models, but that our RAG-based approach substantially improves recommendation quality-highlighting a promising direction for future LLM-based recommenders.

[337] Something's Fishy In The Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks

Allaa Boutaleb,Bernd Amann,Hubert Naacke,Rafael Angarita

Main category: cs.IR

TL;DR: 当前表格联合搜索（TUS）的基准测试存在局限性，导致简单方法表现优于复杂方法。作者提出未来基准测试需满足更真实可靠的评估标准。

Details

Motivation: 分析现有TUS基准测试的局限性，发现其未能有效评估语义理解能力，需改进以更准确反映方法性能。 Method: 分析现有TUS基准测试的问题，并提出未来基准测试应满足的关键标准。 Result: 现有基准测试受数据集特性影响大，简单方法表现优于复杂方法，未能有效评估语义理解。 Conclusion: 未来TUS基准测试需改进，以更真实可靠地评估语义表格联合搜索的进展。 Abstract: Recent table representation learning and data discovery methods tackle table union search (TUS) within data lakes, which involves identifying tables that can be unioned with a given query table to enrich its content. These methods are commonly evaluated using benchmarks that aim to assess semantic understanding in real-world TUS tasks. However, our analysis of prominent TUS benchmarks reveals several limitations that allow simple baselines to perform surprisingly well, often outperforming more sophisticated approaches. This suggests that current benchmark scores are heavily influenced by dataset-specific characteristics and fail to effectively isolate the gains from semantic understanding. To address this, we propose essential criteria for future benchmarks to enable a more realistic and reliable evaluation of progress in semantic table union search.

cs.CR [Back]

[338] Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space

Yao Huang,Yitong Sun,Shouwei Ruan,Yichi Zhang,Yinpeng Dong,Xingxing Wei

Main category: cs.CR

TL;DR: 论文提出了一种基于ELM理论和遗传优化的新框架，通过扩展策略空间显著提升了LLMs的越狱攻击成功率，实验显示在Claude-3.5上成功率超过90%。

Details

Motivation: 尽管现有方法通过提示工程改进了LLMs的安全性，但其效果受限于预定义的策略空间，无法应对安全对齐模型。 Method: 将越狱策略分解为基于ELM理论的核心组件，并结合遗传优化和意图评估机制。 Result: 在Claude-3.5上实现了超过90%的越狱成功率，且具有跨模型迁移性和高评估准确性。 Conclusion: 扩展策略空间是提升越狱攻击能力的关键，新框架为LLMs安全性研究提供了重要方向。 Abstract: Large Language Models (LLMs), despite advanced general capabilities, still suffer from numerous safety risks, especially jailbreak attacks that bypass safety protocols. Understanding these vulnerabilities through black-box jailbreak attacks, which better reflect real-world scenarios, offers critical insights into model robustness. While existing methods have shown improvements through various prompt engineering techniques, their success remains limited against safety-aligned models, overlooking a more fundamental problem: the effectiveness is inherently bounded by the predefined strategy spaces. However, expanding this space presents significant challenges in both systematically capturing essential attack patterns and efficiently navigating the increased complexity. To better explore the potential of expanding the strategy space, we address these challenges through a novel framework that decomposes jailbreak strategies into essential components based on the Elaboration Likelihood Model (ELM) theory and develops genetic-based optimization with intention evaluation mechanisms. To be striking, our experiments reveal unprecedented jailbreak capabilities by expanding the strategy space: we achieve over 90% success rate on Claude-3.5 where prior methods completely fail, while demonstrating strong cross-model transferability and surpassing specialized safeguard models in evaluation accuracy. The code is open-sourced at: https://github.com/Aries-iai/CL-GSO.

cs.HC [Back]

[339] The Impact of a Chatbot's Ephemerality-Framing on Self-Disclosure Perceptions

Samuel Rhys Cox,Rune Møberg Jacobsen,Niels van Berkel

Main category: cs.HC

TL;DR: 研究探讨了聊天机器人框架（熟悉型与陌生型）对用户自我披露的影响，发现情感披露时陌生型更舒适，而事实披露时熟悉型更受欢迎。

Details

Motivation: 聊天机器人日益用于自我披露，但其框架对用户行为的影响尚不明确。 Method: 采用混合因子设计，比较熟悉型与陌生型聊天机器人在情感和事实披露中的表现。 Result: 情感披露时陌生型更舒适，事实披露时熟悉型更受欢迎；定性分析显示陌生型提供匿名性，熟悉型需建立信任。 Conclusion: 聊天机器人框架需根据披露类型调整，情感披露适合陌生型，事实披露适合熟悉型。 Abstract: Self-disclosure, the sharing of one's thoughts and feelings, is affected by the perceived relationship between individuals. While chatbots are increasingly used for self-disclosure, the impact of a chatbot's framing on users' self-disclosure remains under-explored. We investigated how a chatbot's description of its relationship with users, particularly in terms of ephemerality, affects self-disclosure. Specifically, we compared a Familiar chatbot, presenting itself as a companion remembering past interactions, with a Stranger chatbot, presenting itself as a new, unacquainted entity in each conversation. In a mixed factorial design, participants engaged with either the Familiar or Stranger chatbot in two sessions across two days, with one conversation focusing on Emotional- and another Factual-disclosure. When Emotional-disclosure was sought in the first chatting session, Stranger-condition participants felt more comfortable self-disclosing. However, when Factual-disclosure was sought first, these differences were replaced by more enjoyment among Familiar-condition participants. Qualitative findings showed Stranger afforded anonymity and reduced judgement, whereas Familiar sometimes felt intrusive unless rapport was built via low-risk Factual-disclosure.

Saharsh Barve,Andy Mao,Jiayue Melissa Shi,Prerna Juneja,Koustuv Saha

Main category: cs.HC

TL;DR: 论文提出了一种理论驱动的偏见检测框架和社交刻板印象指数（SSI），用于评估文本到图像（T2I）生成模型中的社会偏见。通过审核三种主流T2I模型，发现初始输出常包含刻板印象。通过提示优化，偏见显著减少，但用户研究显示刻板印象图像更符合预期。

Details

Motivation: T2I生成模型在视觉内容创作中具有潜力，但常复制和放大社会刻板印象，引发伦理问题。 Method: 提出SSI和偏见检测框架，审核三种T2I模型输出，并通过LLM进行提示优化。 Result: 提示优化使SSI显著下降（地理文化61%，职业69%，形容词51%），但用户认为刻板印象图像更符合预期。 Conclusion: 需平衡伦理去偏见与上下文相关性，呼吁T2I系统支持全球多样性和包容性，同时反映现实社会复杂性。 Abstract: Recent advances in generative AI have enabled visual content creation through text-to-image (T2I) generation. However, despite their creative potential, T2I models often replicate and amplify societal stereotypes -- particularly those related to gender, race, and culture -- raising important ethical concerns. This paper proposes a theory-driven bias detection rubric and a Social Stereotype Index (SSI) to systematically evaluate social biases in T2I outputs. We audited three major T2I model outputs -- DALL-E-3, Midjourney-6.1, and Stability AI Core -- using 100 queries across three categories -- geocultural, occupational, and adjectival. Our analysis reveals that initial outputs are prone to include stereotypical visual cues, including gendered professions, cultural markers, and western beauty norms. To address this, we adopted our rubric to conduct targeted prompt refinement using LLMs, which significantly reduced bias -- SSI dropped by 61% for geocultural, 69% for occupational, and 51% for adjectival queries. We complemented our quantitative analysis through a user study examining perceptions, awareness, and preferences around AI-generated biased imagery. Our findings reveal a key tension -- although prompt refinement can mitigate stereotypes, it can limit contextual alignment. Interestingly, users often perceived stereotypical images to be more aligned with their expectations. We discuss the need to balance ethical debiasing with contextual relevance and call for T2I systems that support global diversity and inclusivity while not compromising the reflection of real-world social complexity.

[341] Learning Annotation Consensus for Continuous Emotion Recognition

Ibrahim Shoer,Engin Erzin

Main category: cs.HC

TL;DR: 提出一种多标注者训练方法，用于连续情绪识别，通过共识网络整合标注，优于传统单一标签方法。

Details

Motivation: 情感计算数据集中多标注者标注常不一致，传统方法合并为单一标签可能丢失有价值的信息。 Method: 使用共识网络整合多标注者标注，指导主预测器更好地反映集体输入。 Result: 在RECOLA和COGNIMUSE数据集上表现优于传统单一标签方法。 Conclusion: 多标注者数据在情绪识别中具有优势，适用于标注丰富但不一致的领域。 Abstract: In affective computing, datasets often contain multiple annotations from different annotators, which may lack full agreement. Typically, these annotations are merged into a single gold standard label, potentially losing valuable inter-rater variability. We propose a multi-annotator training approach for continuous emotion recognition (CER) that seeks a consensus across all annotators rather than relying on a single reference label. Our method employs a consensus network to aggregate annotations into a unified representation, guiding the main arousal-valence predictor to better reflect collective inputs. Tested on the RECOLA and COGNIMUSE datasets, our approach outperforms traditional methods that unify annotations into a single label. This underscores the benefits of fully leveraging multi-annotator data in emotion recognition and highlights its applicability across various fields where annotations are abundant yet inconsistent.

[342] Creativity in LLM-based Multi-Agent Systems: A Survey

Yi-Cheng Lin,Kang-Chieh Chen,Zhe-Yan Li,Tzu-Heng Wu,Tzu-Hsuan Wu,Kuan-Yu Chen,Hung-yi Lee,Yun-Nung Chen

Main category: cs.HC

TL;DR: 本文是第一篇专注于多智能体系统（MAS）中创造力的综述，提出了关于智能体主动性、生成技术及挑战的结构化框架。

Details

Motivation: 现有综述多关注MAS基础设施，而忽视了创造力维度，本文旨在填补这一空白。 Method: 聚焦文本和图像生成任务，提出分类法、生成技术概述及数据集和评估指标。 Result: 总结了智能体主动性分类、生成技术（如发散探索和协作合成）及关键挑战（如评估标准不一致）。 Conclusion: 本文为创造性MAS的发展、评估和标准化提供了结构化框架和路线图。 Abstract: Large language model (LLM)-driven multi-agent systems (MAS) are transforming how humans and AIs collaboratively generate ideas and artifacts. While existing surveys provide comprehensive overviews of MAS infrastructures, they largely overlook the dimension of \emph{creativity}, including how novel outputs are generated and evaluated, how creativity informs agent personas, and how creative workflows are coordinated. This is the first survey dedicated to creativity in MAS. We focus on text and image generation tasks, and present: (1) a taxonomy of agent proactivity and persona design; (2) an overview of generation techniques, including divergent exploration, iterative refinement, and collaborative synthesis, as well as relevant datasets and evaluation metrics; and (3) a discussion of key challenges, such as inconsistent evaluation standards, insufficient bias mitigation, coordination conflicts, and the lack of unified benchmarks. This survey offers a structured framework and roadmap for advancing the development, evaluation, and standardization of creative MAS.

cs.DB [Back]

[343] LazyVLM: Neuro-Symbolic Approach to Video Analytics

Xiangru Jian,Wei Pang,Zhengyuan Dong,Chao Zhang,M. Tamer Özsu

Main category: cs.DB

TL;DR: LazyVLM是一种神经符号视频分析系统，结合了用户友好的查询界面和高效处理能力，解决了现有方法的灵活性与效率问题。

Details

Motivation: 现有视频分析方法在灵活性与效率之间存在矛盾，端到端视觉语言模型（VLMs）处理长上下文时效率低，而神经符号方法依赖手动标注和固定规则。 Method: LazyVLM通过半结构化文本接口支持复杂多帧视频查询，将查询分解为细粒度操作，利用关系查询和向量相似性搜索提高效率。 Result: LazyVLM在开放域视频数据查询中表现出高效、稳健和用户友好的特性。 Conclusion: LazyVLM为大规模视频数据分析提供了一种高效且易用的解决方案。 Abstract: Current video analytics approaches face a fundamental trade-off between flexibility and efficiency. End-to-end Vision Language Models (VLMs) often struggle with long-context processing and incur high computational costs, while neural-symbolic methods depend heavily on manual labeling and rigid rule design. In this paper, we introduce LazyVLM, a neuro-symbolic video analytics system that provides a user-friendly query interface similar to VLMs, while addressing their scalability limitation. LazyVLM enables users to effortlessly drop in video data and specify complex multi-frame video queries using a semi-structured text interface for video analytics. To address the scalability limitations of VLMs, LazyVLM decomposes multi-frame video queries into fine-grained operations and offloads the bulk of the processing to efficient relational query execution and vector similarity search. We demonstrate that LazyVLM provides a robust, efficient, and user-friendly solution for querying open-domain video data at scale.

eess.AS [Back]

[344] Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset

Rui Liu,Pu Gao,Jiatian Xi,Berrak Sisman,Carlos Busso,Haizhou Li

Main category: eess.AS

TL;DR: EmoCorrector是一种基于文本的语音编辑（TSE）后校正方案，通过检索增强生成（RAG）技术解决现有TSE方法中情感不一致的问题。

Details

Motivation: 现有TSE方法主要关注内容准确性和声学一致性，但忽视了文本修改导致的情感变化或不一致问题。 Method: EmoCorrector提取编辑文本的情感特征，检索匹配情感的语音样本，并合成符合目标情感的语音，同时保留说话者身份和音质。 Result: 在ECD-TSE数据集上的实验表明，EmoCorrector显著提升了目标情感的表达，解决了现有TSE方法的情感不一致问题。 Conclusion: EmoCorrector为TSE中的情感一致性提供了有效解决方案，并通过ECD-TSE数据集推动了相关研究的发展。 Abstract: Text-based speech editing (TSE) modifies speech using only text, eliminating re-recording. However, existing TSE methods, mainly focus on the content accuracy and acoustic consistency of synthetic speech segments, and often overlook the emotional shifts or inconsistency issues introduced by text changes. To address this issue, we propose EmoCorrector, a novel post-correction scheme for TSE. EmoCorrector leverages Retrieval-Augmented Generation (RAG) by extracting the edited text's emotional features, retrieving speech samples with matching emotions, and synthesizing speech that aligns with the desired emotion while preserving the speaker's identity and quality. To support the training and evaluation of emotional consistency modeling in TSE, we pioneer the benchmarking Emotion Correction Dataset for TSE (ECD-TSE). The prominent aspect of ECD-TSE is its inclusion of $<$text, speech$>$ paired data featuring diverse text variations and a range of emotional expressions. Subjective and objective experiments and comprehensive analysis on ECD-TSE confirm that EmoCorrector significantly enhances the expression of intended emotion while addressing emotion inconsistency limitations in current TSE methods. Code and audio examples are available at https://github.com/AI-S2-Lab/EmoCorrector.

[345] PSRB: A Comprehensive Benchmark for Evaluating Persian ASR Systems

Nima Sedghiyeh,Sara Sadeghi,Reza Khodadadi,Farzin Kashani,Omid Aghdaei,Somayeh Rahimi,Mohammad Sadegh Safari

Main category: eess.AS

TL;DR: 论文介绍了波斯语语音识别基准（PSRB），用于评估波斯语等低资源语言的ASR系统性能，并提出了新的错误度量方法。

Details

Motivation: 解决波斯语等低资源语言ASR系统评估的挑战，填补现有研究的空白。 Method: 构建PSRB基准，评估10种ASR系统，分析错误类型并提出加权替换错误的新度量方法。 Result: ASR模型在标准波斯语表现良好，但在区域口音、儿童语音和特定语言挑战上表现不佳。 Conclusion: PSRB为波斯语ASR研究提供了资源，并可作为其他低资源语言基准开发的框架。 Abstract: Although Automatic Speech Recognition (ASR) systems have become an integral part of modern technology, their evaluation remains challenging, particularly for low-resource languages such as Persian. This paper introduces Persian Speech Recognition Benchmark(PSRB), a comprehensive benchmark designed to address this gap by incorporating diverse linguistic and acoustic conditions. We evaluate ten ASR systems, including state-of-the-art commercial and open-source models, to examine performance variations and inherent biases. Additionally, we conduct an in-depth analysis of Persian ASR transcriptions, identifying key error types and proposing a novel metric that weights substitution errors. This metric enhances evaluation robustness by reducing the impact of minor and partial errors, thereby improving the precision of performance assessment. Our findings indicate that while ASR models generally perform well on standard Persian, they struggle with regional accents, children's speech, and specific linguistic challenges. These results highlight the necessity of fine-tuning and incorporating diverse, representative training datasets to mitigate biases and enhance overall ASR performance. PSRB provides a valuable resource for advancing ASR research in Persian and serves as a framework for developing benchmarks in other low-resource languages. A subset of the PSRB dataset is publicly available at https://huggingface.co/datasets/PartAI/PSRB.

Table of Contents

cs.CV [Back]

[1] ReaMOT: A Benchmark and Framework for Reasoning-based Multi-Object Tracking

[2] What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models

[3] RetroMotion: Retrocausal Motion Forecasting Models are Instructable

[4] MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

[5] DIPO: Dual-State Images Controlled Articulated Object Generation Powered by Diverse Data

[6] CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting

[7] WeatherEdit: Controllable Weather Editing with 4D Gaussian Field

[8] ControlTac: Force- and Position-Controlled Tactile Data Augmentation with a Single Reference Image

[9] Electrolyzers-HSI: Close-Range Multi-Scene Hyperspectral Imaging Benchmark Dataset

[10] CPathAgent: An Agent-based Foundation Model for Interpretable High-Resolution Pathology Image Analysis Mimicking Pathologists' Diagnostic Logic

[11] A Feature-level Bias Evaluation Framework for Facial Expression Recognition Models

[12] MetaWriter: Personalized Handwritten Text Recognition Using Meta-Learned Prompt Tuning

[13] MultLFG: Training-free Multi-LoRA composition using Frequency-domain Guidance

[14] Causality and "In-the-Wild" Video-Based Person Re-ID: A Survey

[15] Retrieval Visual Contrastive Decoding to Mitigate Object Hallucinations in Large Vision-Language Models

[16] Total-Editing: Head Avatar with Editable Appearance, Motion, and Lighting

[17] Be Decisive: Noise-Induced Layouts for Multi-Subject Generation

[18] OmniIndoor3D: Comprehensive Indoor 3D Reconstruction

[19] Mamba-Driven Topology Fusion for Monocular 3-D Human Pose Estimation

[20] Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

[21] Intelligent Incident Hypertension Prediction in Obstructive Sleep Apnea

[22] OccLE: Label-Efficient 3D Semantic Occupancy Prediction

[23] ConsiStyle: Style Diversity in Training-Free Consistent T2I Generation

[24] Incorporating Flexible Image Conditioning into Text-to-Video Diffusion Models without Training

[25] TrustSkin: A Fairness Pipeline for Trustworthy Facial Affect Analysis Across Skin Tone

[26] Open-Det: An Efficient Learning Framework for Open-Ended Detection

[27] IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios

[28] See through the Dark: Learning Illumination-affined Representations for Nighttime Occupancy Prediction

[29] HCQA-1.5 @ Ego4D EgoSchema Challenge 2025

[30] Scan-and-Print: Patch-level Data Summarization and Augmentation for Content-aware Layout Generation in Poster Design

[31] RoGA: Towards Generalizable Deepfake Detection through Robust Gradient Alignment

[32] Photography Perspective Composition: Towards Aesthetic Perspective Recommendation

[33] DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving

[34] Contrastive Desensitization Learning for Cross Domain Face Forgery Detection

[35] Supervised Contrastive Learning for Ordinal Engagement Measurement

[36] Continual Learning on CLIP via Incremental Prompt Tuning with Intrinsic Textual Anchors

[37] VisAlgae 2023: A Dataset and Challenge for Algae Detection in Microscopy Images

[38] Temporal Saliency-Guided Distillation: A Scalable Framework for Distilling Video Datasets

[39] Beyond Entropy: Region Confidence Proxy for Wild Test-Time Adaptation

[40] Hierarchical Instruction-aware Embodied Visual Tracking

[41] MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding

[42] VLM Can Be a Good Assistant: Enhancing Embodied Visual Tracking with Self-Improving Visual-Language Models

[43] LeDiFlow: Learned Distribution-guided Flow Matching to Accelerate Image Generation

[44] Intern-GS: Vision Model Guided Sparse-View 3D Gaussian Splatting

[45] MoPFormer: Motion-Primitive Transformer for Wearable-Sensor Activity Recognition

[46] Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models

[47] PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

[48] ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval

[49] MetaSlot: Break Through the Fixed Number of Slots in Object-Centric Learning

[50] TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs

[51] Breaking Dataset Boundaries: Class-Agnostic Targeted Adversarial Attacks

[52] Integrating Intermediate Layer Optimization and Projected Gradient Descent for Solving Inverse Problems with Diffusion Models

[53] Rendering-Aware Reinforcement Learning for Vector Graphics Generation

[54] Not All Thats Rare Is Lost: Causal Paths to Rare Concept Synthesis

[55] Frame-Level Captions for Long Video Generation with Complex Multi Scenes

[56] Causality-Driven Infrared and Visible Image Fusion

[57] Fully Spiking Neural Networks for Unified Frame-Event Object Tracking

[58] ProBA: Probabilistic Bundle Adjustment with the Bhattacharyya Coefficient

[59] Exploring Timeline Control for Facial Motion Generation

[60] AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding

[61] In Context Learning with Vision Transformers: Case Study

[62] Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models

[63] Stereo Radargrammetry Using Deep Learning from Airborne SAR Images

[64] YOLO-FireAD: Efficient Fire Detection via Attention-Guided Inverted Residual Learning and Dual-Pooling Feature Preservation

[65] Frequency Composition for Compressed and Domain-Adaptive Neural Networks

[66] Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation

[67] HTMNet: A Hybrid Network with Transformer-Mamba Bottleneck Multimodal Fusion for Transparent and Reflective Objects Depth Completion

[68] Create Anything Anywhere: Layout-Controllable Personalized Diffusion Model for Multiple Subjects

[69] Geometry-Editable and Appearance-Preserving Object Compositon

[70] HuMoCon: Concept Discovery for Human Motion Understanding

[71] Good Enough: Is it Worth Improving your Label Quality?

[72] QwT-v2: Practical, Effective and Efficient Post-Training Quantization

[73] ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation

[74] PMA: Towards Parameter-Efficient Point Cloud Understanding via Point Mamba Adapter

[75] DSOcc: Leveraging Depth Awareness and Semantic Aid to Boost Camera-Based 3D Semantic Occupancy Prediction

[76] OrienText: Surface Oriented Textual Image Generation

[77] RF4D:Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes

[78] DreamBoothDPO: Improving Personalized Generation using Direct Preference Optimization