cs.CV [Total: 161]
cs.GR [Total: 4]
cs.CL [Total: 101]
cs.IT [Total: 1]
cs.RO [Total: 6]
cs.MA [Total: 1]
physics.optics [Total: 1]
cs.LG [Total: 19]
quant-ph [Total: 1]
cs.CY [Total: 2]
eess.IV [Total: 15]
cs.IR [Total: 1]
cs.HC [Total: 2]
stat.ML [Total: 1]
eess.AS [Total: 2]
stat.CO [Total: 1]
physics.soc-ph [Total: 2]
cs.MM [Total: 1]
cs.AI [Total: 6]
cs.SD [Total: 4]
cs.CR [Total: 2]

cs.CV [Back]

[1] Enhancing Vision Transformer Explainability Using Artificial Astrocytes

Nicolas Echevarrieta-Catalan,Ana Ribas-Rodriguez,Francisco Cedron,Odelia Schwartz,Vanessa Aguiar-Pulido

Main category: cs.CV

TL;DR: 提出了一种名为ViTA的训练无关方法，通过人工星形胶质细胞增强预训练深度神经网络的推理能力，生成更符合人类感知的解释。

Details

Motivation: 机器学习模型精度高但解释性差，现有方法适用性有限，需改进。 Method: 基于神经科学启发，提出ViTA方法，无需额外训练，结合Grad-CAM和Grad-CAM++评估。 Result: ViTA显著提升了模型解释与人类感知的对齐，统计显著优于标准ViT。 Conclusion: ViTA为提升模型解释性提供了有效且无需训练的解决方案。 Abstract: Machine learning models achieve high precision, but their decision-making processes often lack explainability. Furthermore, as model complexity increases, explainability typically decreases. Existing efforts to improve explainability primarily involve developing new eXplainable artificial intelligence (XAI) techniques or incorporating explainability constraints during training. While these approaches yield specific improvements, their applicability remains limited. In this work, we propose the Vision Transformer with artificial Astrocytes (ViTA). This training-free approach is inspired by neuroscience and enhances the reasoning of a pretrained deep neural network to generate more human-aligned explanations. We evaluated our approach employing two well-known XAI techniques, Grad-CAM and Grad-CAM++, and compared it to a standard Vision Transformer (ViT). Using the ClickMe dataset, we quantified the similarity between the heatmaps produced by the XAI techniques and a (human-aligned) ground truth. Our results consistently demonstrate that incorporating artificial astrocytes enhances the alignment of model explanations with human perception, leading to statistically significant improvements across all XAI techniques and metrics utilized.

[2] Do DeepFake Attribution Models Generalize?

Spiros Baxavanakis,Manos Schinas,Symeon Papadopoulos

Main category: cs.CV

TL;DR: 论文探讨了DeepFake检测的局限性，提出多分类和归因模型的重要性，并通过实验验证了对比方法在跨数据集泛化中的有效性。

Details

Motivation: DeepFake生成技术的普及威胁了在线信息的真实性，现有二进制检测模型忽略了不同篡改方法的差异，归因模型的缺乏影响了实用性和可解释性。 Method: 利用五种骨干模型在六个DeepFake数据集上进行实验，比较二进制与多分类模型的泛化能力，评估归因模型在未知数据集中的准确性，并测试对比方法的效果。 Result: 二进制模型泛化能力更强，但更大模型、对比方法和高质量数据能提升归因模型的性能。 Conclusion: 归因模型在提升检测可信度和可解释性方面具有潜力，未来研究应关注数据质量和模型优化。 Abstract: Recent advancements in DeepFake generation, along with the proliferation of open-source tools, have significantly lowered the barrier for creating synthetic media. This trend poses a serious threat to the integrity and authenticity of online information, undermining public trust in institutions and media. State-of-the-art research on DeepFake detection has primarily focused on binary detection models. A key limitation of these models is that they treat all manipulation techniques as equivalent, despite the fact that different methods introduce distinct artifacts and visual cues. Only a limited number of studies explore DeepFake attribution models, although such models are crucial in practical settings. By providing the specific manipulation method employed, these models could enhance both the perceived trustworthiness and explainability for end users. In this work, we leverage five state-of-the-art backbone models and conduct extensive experiments across six DeepFake datasets. First, we compare binary and multi-class models in terms of cross-dataset generalization. Second, we examine the accuracy of attribution models in detecting seen manipulation methods in unknown datasets, hence uncovering data distribution shifts on the same DeepFake manipulations. Last, we assess the effectiveness of contrastive methods in improving cross-dataset generalization performance. Our findings indicate that while binary models demonstrate better generalization abilities, larger models, contrastive methods, and higher data quality can lead to performance improvements in attribution models. The code of this work is available on GitHub.

[3] CIM-NET: A Video Denoising Deep Neural Network Model Optimized for Computing-in-Memory Architectures

Shan Gao,Zhiqiang Wu,Yawen Niu,Xiaotao Li,Qingqing Xu

Main category: cs.CV

TL;DR: 提出了一种硬件-算法协同设计框架CIM-NET，通过优化大感受野操作和伪卷积算子CIM-CONV，显著减少矩阵-向量乘法操作，提升CIM芯片上的推理速度，同时保持去噪性能。

Details

Motivation: 现有DNN模型未考虑CIM架构约束，限制了其在边缘设备上的加速潜力。 Method: 提出CIM-NET架构和伪卷积算子CIM-CONV，结合滑动处理和全连接变换，优化特征提取与重建。 Result: CIM-NET将MVM操作减少至1/77，PSNR仅略微下降（35.11 dB vs. 35.56 dB）。 Conclusion: 硬件-算法协同设计显著提升CIM芯片上的推理效率，同时保持性能竞争力。 Abstract: While deep neural network (DNN)-based video denoising has demonstrated significant performance, deploying state-of-the-art models on edge devices remains challenging due to stringent real-time and energy efficiency requirements. Computing-in-Memory (CIM) chips offer a promising solution by integrating computation within memory cells, enabling rapid matrix-vector multiplication (MVM). However, existing DNN models are often designed without considering CIM architectural constraints, thus limiting their acceleration potential during inference. To address this, we propose a hardware-algorithm co-design framework incorporating two innovations: (1) a CIM-Aware Architecture, CIM-NET, optimized for large receptive field operation and CIM's crossbar-based MVM acceleration; and (2) a pseudo-convolutional operator, CIM-CONV, used within CIM-NET to integrate slide-based processing with fully connected transformations for high-quality feature extraction and reconstruction. This framework significantly reduces the number of MVM operations, improving inference speed on CIM chips while maintaining competitive performance. Experimental results indicate that, compared to the conventional lightweight model FastDVDnet, CIM-NET substantially reduces MVM operations with a slight decrease in denoising performance. With a stride value of 8, CIM-NET reduces MVM operations to 1/77th of the original, while maintaining competitive PSNR (35.11 dB vs. 35.56 dB

[4] Learning Shared Representations from Unpaired Data

Amitai Yacobi,Nir Ben-Ari,Ronen Talmon,Uri Shaham

Main category: cs.CV

TL;DR: 该论文提出了一种仅需非配对数据即可学习共享表示的方法，通过谱嵌入技术在多模态任务中表现出色。

Details

Motivation: 当前多模态表示学习方法依赖配对数据，但配对数据难以获取，因此探索如何利用非配对数据学习共享表示。 Method: 利用随机游走矩阵的谱嵌入技术，独立构建单模态表示，从而学习共享表示。 Result: 在计算机视觉和自然语言处理任务中表现出色，支持检索、生成、算术、零样本和跨域分类。 Conclusion: 首次证明仅需非配对数据即可学习通用跨模态嵌入，具有广泛应用潜力。 Abstract: Learning shared representations is a primary area of multimodal representation learning. The current approaches to achieve a shared embedding space rely heavily on paired samples from each modality, which are significantly harder to obtain than unpaired ones. In this work, we demonstrate that shared representations can be learned almost exclusively from unpaired data. Our arguments are grounded in the spectral embeddings of the random walk matrices constructed independently from each unimodal representation. Empirical results in computer vision and natural language processing domains support its potential, revealing the effectiveness of unpaired data in capturing meaningful cross-modal relations, demonstrating high capabilities in retrieval tasks, generation, arithmetics, zero-shot, and cross-domain classification. This work, to the best of our knowledge, is the first to demonstrate these capabilities almost exclusively from unpaired samples, giving rise to a cross-modal embedding that could be viewed as universal, i.e., independent of the specific modalities of the data. Our code IS publicly available at https://github.com/shaham-lab/SUE.

[5] UniDB++: Fast Sampling of Unified Diffusion Bridge

Mokai Pan,Kaizhen Zhu,Yuexin Ma,Yanwei Fu,Jingyi Yu,Jingya Wang,Ye Shi

Main category: cs.CV

TL;DR: UniDB++提出了一种无需训练的采样算法，显著提升了UniDB框架的计算效率和生成质量，通过闭式解和SDE-Corrector机制实现了20倍加速和高质量生成。

Details

Motivation: UniDB框架虽然实现了高保真图像生成，但其依赖迭代采样方法导致计算效率低下，现有加速技术无法解决其独特挑战。 Method: UniDB++通过推导UniDB反向时间SDE的闭式解，减少误差积累，并引入数据预测模型和SDE-Corrector机制。 Result: 实验表明，UniDB++在图像修复任务中表现优异，生成质量和速度均优于传统方法，推理时间显著减少。 Conclusion: UniDB++在SOC驱动的扩散桥模型中实现了理论通用性与实际效率的统一。 Abstract: Diffusion Bridges enable transitions between arbitrary distributions, with the Unified Diffusion Bridge (UniDB) framework achieving high-fidelity image generation via a Stochastic Optimal Control (SOC) formulation. However, UniDB's reliance on iterative Euler sampling methods results in slow, computationally expensive inference, while existing acceleration techniques for diffusion or diffusion bridge models fail to address its unique challenges: missing terminal mean constraints and SOC-specific penalty coefficients in its SDEs. We present UniDB++, a training-free sampling algorithm that significantly improves upon these limitations. The method's key advancement comes from deriving exact closed-form solutions for UniDB's reverse-time SDEs, effectively reducing the error accumulation inherent in Euler approximations and enabling high-quality generation with up to 20$\times$ fewer sampling steps. This method is further complemented by replacing conventional noise prediction with a more stable data prediction model, along with an SDE-Corrector mechanism that maintains perceptual quality for low-step regimes (5-10 steps). Additionally, we demonstrate that UniDB++ aligns with existing diffusion bridge acceleration methods by evaluating their update rules, and UniDB++ can recover DBIMs as special cases under some theoretical conditions. Experiments demonstrate UniDB++'s state-of-the-art performance in image restoration tasks, outperforming Euler-based methods in fidelity and speed while reducing inference time significantly. This work bridges the gap between theoretical generality and practical efficiency in SOC-driven diffusion bridge models. Our code is available at https://github.com/2769433owo/UniDB-plusplus.

[6] How Much Do Large Language Models Know about Human Motion? A Case Study in 3D Avatar Control

Kunhang Li,Jason Naradowsky,Yansong Feng,Yusuke Miyao

Main category: cs.CV

TL;DR: 论文探讨了大型语言模型（LLMs）在3D虚拟角色控制中的人类运动知识，发现LLMs擅长高层次运动规划，但在精确身体部位定位和多步运动规划中存在困难。

Details

Motivation: 研究动机是验证LLMs在理解和生成人类运动指令方面的能力，尤其是通过3D虚拟角色动画作为验证手段。 Method: 方法包括：1）通过LLMs生成高层次运动计划；2）细化每个步骤的身体部位位置；3）通过线性插值生成动画；4）设计20个代表性运动指令进行综合评估。 Result: 结果显示LLMs在高层次运动解释和创意运动概念化方面表现良好，但在精确空间定位和多步运动规划中表现不佳。 Conclusion: 结论指出LLMs在人类运动知识方面具有潜力，但需改进精确空间和时序参数的处理能力。 Abstract: We explore Large Language Models (LLMs)' human motion knowledge through 3D avatar control. Given a motion instruction, we prompt LLMs to first generate a high-level movement plan with consecutive steps (High-level Planning), then specify body part positions in each step (Low-level Planning), which we linearly interpolate into avatar animations as a clear verification lens for human evaluators. Through carefully designed 20 representative motion instructions with full coverage of basic movement primitives and balanced body part usage, we conduct comprehensive evaluations including human assessment of both generated animations and high-level movement plans, as well as automatic comparison with oracle positions in low-level planning. We find that LLMs are strong at interpreting the high-level body movements but struggle with precise body part positioning. While breaking down motion queries into atomic components improves planning performance, LLMs have difficulty with multi-step movements involving high-degree-of-freedom body parts. Furthermore, LLMs provide reasonable approximation for general spatial descriptions, but fail to handle precise spatial specifications in text, and the precise spatial-temporal parameters needed for avatar control. Notably, LLMs show promise in conceptualizing creative motions and distinguishing culturally-specific motion patterns.

[7] EvidenceMoE: A Physics-Guided Mixture-of-Experts with Evidential Critics for Advancing Fluorescence Light Detection and Ranging in Scattering Media

Ismail Erbas,Ferhat Demirkiran,Karthik Swaminathan,Naigang Wang,Navid Ibtehaj Nizam,Stefan T. Radev,Kaoutar El Maghraoui,Xavier Intes,Vikas Pandey

Main category: cs.CV

TL;DR: 提出了一种基于物理引导的混合专家（MoE）框架，用于解决荧光LiDAR在散射介质中的计算挑战，显著提升了深度和荧光寿命的估计精度。

Details

Motivation: 荧光LiDAR在散射介质中面临信号复杂、难以分离光子飞行时间和荧光寿命的问题，现有方法效果有限。 Method: 采用物理引导的MoE框架，结合基于证据的Dirichlet批评器（EDCs）和决策网络，自适应融合专家预测。 Result: 在模拟的荧光LiDAR数据中，深度估计的NRMSE为0.030，荧光寿命的NRMSE为0.074。 Conclusion: 该方法在散射介质中表现出色，为荧光LiDAR的应用提供了更可靠的解决方案。 Abstract: Fluorescence LiDAR (FLiDAR), a Light Detection and Ranging (LiDAR) technology employed for distance and depth estimation across medical, automotive, and other fields, encounters significant computational challenges in scattering media. The complex nature of the acquired FLiDAR signal, particularly in such environments, makes isolating photon time-of-flight (related to target depth) and intrinsic fluorescence lifetime exceptionally difficult, thus limiting the effectiveness of current analytical and computational methodologies. To overcome this limitation, we present a Physics-Guided Mixture-of-Experts (MoE) framework tailored for specialized modeling of diverse temporal components. In contrast to the conventional MoE approaches our expert models are informed by underlying physics, such as the radiative transport equation governing photon propagation in scattering media. Central to our approach is EvidenceMoE, which integrates Evidence-Based Dirichlet Critics (EDCs). These critic models assess the reliability of each expert's output by providing per-expert quality scores and corrective feedback. A Decider Network then leverages this information to fuse expert predictions into a robust final estimate adaptively. We validate our method using realistically simulated Fluorescence LiDAR (FLiDAR) data for non-invasive cancer cell depth detection generated from photon transport models in tissue. Our framework demonstrates strong performance, achieving a normalized root mean squared error (NRMSE) of 0.030 for depth estimation and 0.074 for fluorescence lifetime.

[8] Self-Organizing Visual Prototypes for Non-Parametric Representation Learning

Thalles Silva,Helio Pedrini,Adín Ramírez Rivera

Main category: cs.CV

TL;DR: SOP是一种新的无监督视觉特征学习技术，通过多个支持嵌入（SEs）表征原型，提升训练性能。

Details

Motivation: 现有自监督学习方法依赖单一原型编码数据特征，限制了性能。SOP旨在通过多支持嵌入更全面表征数据特征。 Method: 提出SOP策略，使用多个语义相似的支持嵌入表征原型，并引入非参数损失函数和SOP-MIM任务。 Result: 在检索、线性评估、微调和目标检测等任务中，SOP预训练编码器达到最先进性能。 Conclusion: SOP策略显著提升了无监督视觉特征学习的性能，尤其适用于复杂编码器。 Abstract: We present Self-Organizing Visual Prototypes (SOP), a new training technique for unsupervised visual feature learning. Unlike existing prototypical self-supervised learning (SSL) methods that rely on a single prototype to encode all relevant features of a hidden cluster in the data, we propose the SOP strategy. In this strategy, a prototype is represented by many semantically similar representations, or support embeddings (SEs), each containing a complementary set of features that together better characterize their region in space and maximize training performance. We reaffirm the feasibility of non-parametric SSL by introducing novel non-parametric adaptations of two loss functions that implement the SOP strategy. Notably, we introduce the SOP Masked Image Modeling (SOP-MIM) task, where masked representations are reconstructed from the perspective of multiple non-parametric local SEs. We comprehensively evaluate the representations learned using the SOP strategy on a range of benchmarks, including retrieval, linear evaluation, fine-tuning, and object detection. Our pre-trained encoders achieve state-of-the-art performance on many retrieval benchmarks and demonstrate increasing performance gains with more complex encoders.

[9] Is Attention Required for Transformer Inference? Explore Function-preserving Attention Replacement

Yuxin Ren,Maxwell D Collins,Miao Hu,Huanrui Yang

Main category: cs.CV

TL;DR: FAR框架通过用LSTM等序列模块替换预训练Transformer中的注意力机制，提升推理效率，同时保持模型性能。

Details

Motivation: Transformer的注意力机制在推理时效率较低，尤其是在资源有限的设备上。研究发现推理时的注意力冗余，认为可以用更简单的函数替代。 Method: 提出FAR框架，用多头LSTM替换注意力模块，通过块级蒸馏和全局结构剪枝优化模型。 Result: 在DeiT视觉Transformer上验证，FAR在ImageNet和下游任务中保持原模型精度，同时减少参数和延迟。 Conclusion: FAR成功保留了注意力模块的语义关系，证明了用简单函数替代复杂注意力的可行性。 Abstract: While transformers excel across vision and language pretraining tasks, their reliance on attention mechanisms poses challenges for inference efficiency, especially on edge and embedded accelerators with limited parallelism and memory bandwidth. Hinted by the observed redundancy of attention at inference time, we hypothesize that though the model learns complicated token dependency through pretraining, the inference-time sequence-to-sequence mapping in each attention layer is actually ''simple'' enough to be represented with a much cheaper function. In this work, we explore FAR, a Function-preserving Attention Replacement framework that replaces all attention blocks in pretrained transformers with learnable sequence-to-sequence modules, exemplified by an LSTM. FAR optimize a multi-head LSTM architecture with a block-wise distillation objective and a global structural pruning framework to achieve a family of efficient LSTM-based models from pretrained transformers. We validate FAR on the DeiT vision transformer family and demonstrate that it matches the accuracy of the original models on ImageNet and multiple downstream tasks with reduced parameters and latency. Further analysis shows that FAR preserves the semantic token relationships and the token-to-token correlation learned in the transformer's attention module.

[10] Caption This, Reason That: VLMs Caught in the Middle

Zihan Weng,Lucas Gomez,Taylor Whittington Webb,Pouya Bashivan

Main category: cs.CV

TL;DR: 论文分析了视觉语言模型（VLMs）在认知能力上的局限性，提出通过认知科学方法评估其表现，并发现改进方向。

Details

Motivation: VLMs在视觉理解方面取得显著进展，但在计数或关系推理等任务上仍落后于人类。研究旨在揭示其认知局限并提出改进方法。 Method: 采用认知科学方法，评估VLMs在感知、注意力和记忆等核心认知轴上的表现，并通过视觉-文本解耦分析探究失败原因。 Result: 发现VLMs在空间理解或选择性注意力任务上存在显著差距，但通过生成文本推理可改善表现。微调小型VLMs能提升核心认知能力。 Conclusion: 研究详细分析了VLMs的认知优缺点，提出了改进方向，并展示了微调方法的有效性。 Abstract: Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years. Yet, they still lag behind human capabilities in specific visual tasks such as counting or relational reasoning. To understand the underlying limitations, we adopt methodologies from cognitive science, analyzing VLM performance along core cognitive axes: Perception, Attention, and Memory. Using a suite of tasks targeting these abilities, we evaluate state-of-the-art VLMs, including GPT-4o. Our analysis reveals distinct cognitive profiles: while advanced models approach ceiling performance on some tasks (e.g. category identification), a significant gap persists, particularly in tasks requiring spatial understanding or selective attention. Investigating the source of these failures and potential methods for improvement, we employ a vision-text decoupling analysis, finding that models struggling with direct visual reasoning show marked improvement when reasoning over their own generated text captions. These experiments reveal a strong need for improved VLM Chain-of-Thought (CoT) abilities, even in models that consistently exceed human performance. Furthermore, we demonstrate the potential of targeted fine-tuning on composite visual reasoning tasks and show that fine-tuning smaller VLMs substantially improves core cognitive abilities. While this improvement does not translate to large enhancements on challenging, out-of-distribution benchmarks, we show broadly that VLM performance on our datasets strongly correlates with performance on these other benchmarks. Our work provides a detailed analysis of VLM cognitive strengths and weaknesses and identifies key bottlenecks in simultaneous perception and reasoning while also providing an effective and simple solution.

[11] Equivariant Flow Matching for Point Cloud Assembly

Ziming Wang,Nan Xue,Rebecka Jörnsten

Main category: cs.CV

TL;DR: 提出了一种基于流匹配模型的等变求解器Eda，用于点云组装任务，能够高效学习等变分布并处理非重叠输入。

Details

Motivation: 点云组装任务的目标是通过对齐多个点云片段重建完整3D形状，现有方法在非重叠输入情况下表现不佳。 Method: 提出Eda模型，通过学习输入片段相关的向量场实现等变分布，并构建等变路径以提高训练效率。 Result: Eda在实际数据集上表现优异，能够处理非重叠输入片段。 Conclusion: Eda是一种高效且鲁棒的点云组装方法，适用于复杂场景。 Abstract: The goal of point cloud assembly is to reconstruct a complete 3D shape by aligning multiple point cloud pieces. This work presents a novel equivariant solver for assembly tasks based on flow matching models. We first theoretically show that the key to learning equivariant distributions via flow matching is to learn related vector fields. Based on this result, we propose an assembly model, called equivariant diffusion assembly (Eda), which learns related vector fields conditioned on the input pieces. We further construct an equivariant path for Eda, which guarantees high data efficiency of the training process. Our numerical results show that Eda is highly competitive on practical datasets, and it can even handle the challenging situation where the input pieces are non-overlapped.

[12] DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers

Zitong Wang,Hang Zhao,Qianyu Zhou,Xuequan Lu,Xiangtai Li,Yiren Song

Main category: cs.CV

TL;DR: 论文提出了一种新任务：Alpha合成图像的逐层分解，并提出了DiffDecompose框架和AlphaBlend数据集，解决了透明/半透明层分解的挑战。

Details

Motivation: 现有图像分解方法在透明/半透明层分解中存在局限性，如依赖掩码先验、静态对象假设和数据集缺乏。 Method: 提出了DiffDecompose框架，基于扩散Transformer，通过上下文分解和层位置编码克隆实现逐层预测。 Result: 在AlphaBlend和LOGO数据集上的实验验证了DiffDecompose的有效性。 Conclusion: DiffDecompose和AlphaBlend数据集为透明/半透明层分解提供了有效解决方案。 Abstract: Diffusion models have recently motivated great success in many generation tasks like object removal. Nevertheless, existing image decomposition methods struggle to disentangle semi-transparent or transparent layer occlusions due to mask prior dependencies, static object assumptions, and the lack of datasets. In this paper, we delve into a novel task: Layer-Wise Decomposition of Alpha-Composited Images, aiming to recover constituent layers from single overlapped images under the condition of semi-transparent/transparent alpha layer non-linear occlusion. To address challenges in layer ambiguity, generalization, and data scarcity, we first introduce AlphaBlend, the first large-scale and high-quality dataset for transparent and semi-transparent layer decomposition, supporting six real-world subtasks (e.g., translucent flare removal, semi-transparent cell decomposition, glassware decomposition). Building on this dataset, we present DiffDecompose, a diffusion Transformer-based framework that learns the posterior over possible layer decompositions conditioned on the input image, semantic prompts, and blending type. Rather than regressing alpha mattes directly, DiffDecompose performs In-Context Decomposition, enabling the model to predict one or multiple layers without per-layer supervision, and introduces Layer Position Encoding Cloning to maintain pixel-level correspondence across layers. Extensive experiments on the proposed AlphaBlend dataset and public LOGO dataset verify the effectiveness of DiffDecompose. The code and dataset will be available upon paper acceptance. Our code will be available at: https://github.com/Wangzt1121/DiffDecompose.

[13] Vision Meets Language: A RAG-Augmented YOLOv8 Framework for Coffee Disease Diagnosis and Farmer Assistance

Semanto Mondal

Main category: cs.CV

TL;DR: 提出了一种结合对象检测、大语言模型和检索增强生成的混合方法，用于咖啡叶病害检测与治理，旨在优化农业资源使用并减少环境影响。

Details

Motivation: 传统农业效率低下且对环境有害，需通过精准农业技术优化资源利用。 Method: 结合YOLOv8、NLP和RAG技术，构建一个AI系统，实现病害检测与治理建议。 Result: 系统能实时检测病害并提供环境友好的治理方案，减少农药使用。 Conclusion: 该框架具有可扩展性和用户友好性，有望广泛应用于农业领域。 Abstract: As a social being, we have an intimate bond with the environment. A plethora of things in human life, such as lifestyle, health, and food are dependent on the environment and agriculture. It comes under our responsibility to support the environment as well as agriculture. However, traditional farming practices often result in inefficient resource use and environmental challenges. To address these issues, precision agriculture has emerged as a promising approach that leverages advanced technologies to optimise agricultural processes. In this work, a hybrid approach is proposed that combines the three different potential fields of model AI: object detection, large language model (LLM), and Retrieval-Augmented Generation (RAG). In this novel framework, we have tried to combine the vision and language models to work together to identify potential diseases in the tree leaf. This study introduces a novel AI-based precision agriculture system that uses Retrieval Augmented Generation (RAG) to provide context-aware diagnoses and natural language processing (NLP) and YOLOv8 for crop disease detection. The system aims to tackle major issues with large language models (LLMs), especially hallucinations and allows for adaptive treatment plans and real-time disease detection. The system provides an easy-to-use interface to the farmers, which they can use to detect the different diseases related to coffee leaves by just submitting the image of the affected leaf the model will detect the diseases as well as suggest potential remediation methodologies which aim to lower the use of pesticides, preserving livelihoods, and encouraging environmentally friendly methods. With an emphasis on scalability, dependability, and user-friendliness, the project intends to improve RAG-integrated object detection systems for wider agricultural applications in the future.

[14] Corruption-Aware Training of Latent Video Diffusion Models for Robust Text-to-Video Generation

Chika Maduabuchi,Hao Chen,Yujin Han,Jindong Wang

Main category: cs.CV

TL;DR: CAT-LVDM通过数据对齐的噪声注入提升视频扩散模型的鲁棒性，显著减少语义漂移和时间不一致性。

Details

Motivation: 现有LVDMs对不完美条件敏感，导致语义漂移和时间不一致性，尤其在噪声多的网络视频-文本数据上。 Method: 提出CAT-LVDM框架，包括Batch-Centered Noise Injection（BCNI）和Spectrum-Aware Contextual Noise（SACN），分别针对语义一致性和低频平滑性。 Result: BCNI在WebVid-2M等数据集上平均降低FVD 31.9%，SACN在UCF-101上提升12.3%。 Conclusion: CAT-LVDM为多模态噪声下的鲁棒视频扩散训练提供了理论支持和实用方法。 Abstract: Latent Video Diffusion Models (LVDMs) achieve high-quality generation but are sensitive to imperfect conditioning, which causes semantic drift and temporal incoherence on noisy, web-scale video-text datasets. We introduce CAT-LVDM, the first corruption-aware training framework for LVDMs that improves robustness through structured, data-aligned noise injection. Our method includes Batch-Centered Noise Injection (BCNI), which perturbs embeddings along intra-batch semantic directions to preserve temporal consistency. BCNI is especially effective on caption-rich datasets like WebVid-2M, MSR-VTT, and MSVD. We also propose Spectrum-Aware Contextual Noise (SACN), which injects noise along dominant spectral directions to improve low-frequency smoothness, showing strong results on UCF-101. On average, BCNI reduces FVD by 31.9% across WebVid-2M, MSR-VTT, and MSVD, while SACN yields a 12.3% improvement on UCF-101. Ablation studies confirm the benefit of low-rank, data-aligned noise. Our theoretical analysis further explains how such perturbations tighten entropy, Wasserstein, score-drift, mixing-time, and generalization bounds. CAT-LVDM establishes a principled, scalable training approach for robust video diffusion under multimodal noise. Code and models: https://github.com/chikap421/catlvdm

[15] Image Tokens Matter: Mitigating Hallucination in Discrete Tokenizer-based Large Vision-Language Models via Latent Editing

Weixing Wang,Zifeng Ding,Jindong Gu,Rui Cao,Christoph Meinel,Gerard de Melo,Haojin Yang

Main category: cs.CV

TL;DR: 论文研究了大型视觉语言模型（LVLMs）中幻觉问题的根源，并提出了一种基于图神经网络和对比学习的缓解方法。

Details

Motivation: LVLMs通过离散图像标记器统一多模态表示，但仍存在幻觉问题。作者假设这是由于训练中视觉先验导致的标记共现关联。 Method: 构建标记共现图，使用GNN和对比学习聚类标记，提出通过修改潜在图像嵌入来抑制幻觉标记的影响。 Result: 实验表明，该方法有效减少幻觉，同时保持模型表达能力。 Conclusion: 通过分析标记共现关系，提出的方法能显著缓解LVLMs的幻觉问题。 Abstract: Large Vision-Language Models (LVLMs) with discrete image tokenizers unify multimodal representations by encoding visual inputs into a finite set of tokens. Despite their effectiveness, we find that these models still hallucinate non-existent objects. We hypothesize that this may be due to visual priors induced during training: When certain image tokens frequently co-occur in the same spatial regions and represent shared objects, they become strongly associated with the verbalizations of those objects. As a result, the model may hallucinate by evoking visually absent tokens that often co-occur with present ones. To test this assumption, we construct a co-occurrence graph of image tokens using a segmentation dataset and employ a Graph Neural Network (GNN) with contrastive learning followed by a clustering method to group tokens that frequently co-occur in similar visual contexts. We find that hallucinations predominantly correspond to clusters whose tokens dominate the input, and more specifically, that the visually absent tokens in those clusters show much higher correlation with hallucinated objects compared to tokens present in the image. Based on this observation, we propose a hallucination mitigation method that suppresses the influence of visually absent tokens by modifying latent image embeddings during generation. Experiments show our method reduces hallucinations while preserving expressivity. Code is available at https://github.com/weixingW/CGC-VTD/tree/main

Daniel Csizmadia,Andrei Codreanu,Victor Sim,Vighnesh Prabeau,Michael Lu,Kevin Zhu,Sean O'Brien,Vasu Sharma

Main category: cs.CV

TL;DR: DCLIP是一种基于CLIP模型的改进版本，通过师生蒸馏框架提升多模态图像-文本检索能力，同时保留零样本分类性能。

Details

Motivation: CLIP模型在图像分辨率固定和上下文有限的约束下，难以满足需要细粒度跨模态理解的检索任务需求。 Method: 采用跨模态Transformer教师模型，通过双向交叉注意力生成丰富嵌入，指导轻量级学生模型的训练，结合对比学习和余弦相似度目标。 Result: 在仅使用少量数据的情况下，DCLIP显著提升了检索指标（Recall@K, MAP），同时保留了94%的零样本分类性能。 Conclusion: DCLIP有效平衡了任务专业化和泛化能力，为高级视觉语言任务提供了资源高效、领域自适应且细节敏感的解决方案。 Abstract: We present Distill CLIP (DCLIP), a fine-tuned variant of the CLIP model that enhances multimodal image-text retrieval while preserving the original model's strong zero-shot classification capabilities. CLIP models are typically constrained by fixed image resolutions and limited context, which can hinder their effectiveness in retrieval tasks that require fine-grained cross-modal understanding. DCLIP addresses these challenges through a meta teacher-student distillation framework, where a cross-modal transformer teacher is fine-tuned to produce enriched embeddings via bidirectional cross-attention between YOLO-extracted image regions and corresponding textual spans. These semantically and spatially aligned global representations guide the training of a lightweight student model using a hybrid loss that combines contrastive learning and cosine similarity objectives. Despite being trained on only ~67,500 samples curated from MSCOCO, Flickr30k, and Conceptual Captions-just a fraction of CLIP's original dataset-DCLIP significantly improves image-text retrieval metrics (Recall@K, MAP), while retaining approximately 94% of CLIP's zero-shot classification performance. These results demonstrate that DCLIP effectively mitigates the trade-off between task specialization and generalization, offering a resource-efficient, domain-adaptive, and detail-sensitive solution for advanced vision-language tasks. Code available at https://anonymous.4open.science/r/DCLIP-B772/README.md.

[17] Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts

Hee-Seon Kim,Minbeom Kim,Wonjun Lee,Kihyun Kim,Changick Kim

Main category: cs.CV

TL;DR: 论文提出了一种新的Benign-to-Toxic（B2T）攻击范式，通过优化对抗图像从良性输入诱导毒性输出，突破了传统Toxic-Continuation方法的局限性。

Details

Motivation: 传统Toxic-Continuation方法在缺乏明确毒性信号时效果不佳，因此需要一种新方法来揭示多模态模型的安全漏洞。 Method: 提出B2T攻击范式，优化对抗图像以从良性输入诱导毒性输出，无需依赖毒性提示。 Result: B2T方法优于现有方法，适用于黑盒场景，并能与基于文本的攻击互补。 Conclusion: B2T揭示了多模态对齐中未充分探索的漏洞，为攻击方法提供了新方向。 Abstract: Optimization-based jailbreaks typically adopt the Toxic-Continuation setting in large vision-language models (LVLMs), following the standard next-token prediction objective. In this setting, an adversarial image is optimized to make the model predict the next token of a toxic prompt. However, we find that the Toxic-Continuation paradigm is effective at continuing already-toxic inputs, but struggles to induce safety misalignment when explicit toxic signals are absent. We propose a new paradigm: Benign-to-Toxic (B2T) jailbreak. Unlike prior work, we optimize adversarial images to induce toxic outputs from benign conditioning. Since benign conditioning contains no safety violations, the image alone must break the model's safety mechanisms. Our method outperforms prior approaches, transfers in black-box settings, and complements text-based jailbreaks. These results reveal an underexplored vulnerability in multimodal alignment and introduce a fundamentally new direction for jailbreak approaches.

[18] Analytical Calculation of Weights Convolutional Neural Network

Polad Geidarov

Main category: cs.CV

TL;DR: 提出一种无需传统训练的CNN权重和阈值解析计算方法，仅需10张MNIST图像即可确定参数，实验表明其能快速识别半数测试图像。

Details

Motivation: 探索无需训练的CNN参数解析计算方法，以简化模型构建过程。 Method: 基于10张MNIST图像解析计算CNN权重、阈值和通道数，并用C++实现模块进行实验。 Result: 解析计算的CNN能快速识别半数测试图像，推理时间极短。 Conclusion: 表明CNN可通过纯解析计算直接用于分类任务，无需训练。 Abstract: This paper presents an algorithm for analytically calculating the weights and thresholds of convolutional neural networks (CNNs) without using standard training procedures. The algorithm enables the determination of CNN parameters based on just 10 selected images from the MNIST dataset, each representing a digit from 0 to 9. As part of the method, the number of channels in CNN layers is also derived analytically. A software module was implemented in C++ Builder, and a series of experiments were conducted using the MNIST dataset. Results demonstrate that the analytically computed CNN can recognize over half of 1000 handwritten digit images without any training, achieving inference in fractions of a second. These findings suggest that CNNs can be constructed and applied directly for classification tasks without training, using purely analytical computation of weights.

[19] A Novel Convolutional Neural Network-Based Framework for Complex Multiclass Brassica Seed Classification

Elhoucine Elfatimia,Recep Eryigitb,Lahcen Elfatimi

Main category: cs.CV

TL;DR: 本文提出了一种基于卷积神经网络（CNN）的新框架，用于高效分类十种常见的芸苔属种子，解决了种子图像纹理相似性的挑战，并实现了93%的高准确率。

Details

Motivation: 农民因作物生产和农场运营的需求，缺乏时间和资源进行农场研究。种子分类对质量控制、生产效率和杂质检测至关重要，早期识别种子类型可降低田间出苗的成本和风险。 Method: 采用自定义设计的CNN架构，针对种子图像纹理相似性的挑战，优化了层配置，并与几种预训练的最先进架构进行了性能对比。 Result: 在收集的芸苔属种子数据集上，提出的模型达到了93%的高准确率。 Conclusion: 该CNN框架为种子分类提供了一种高效解决方案，有助于农民提高种子质量管理和产量估算的精确性。 Abstract: Agricultural research has accelerated in recent years, yet farmers often lack the time and resources for on-farm research due to the demands of crop production and farm operations. Seed classification offers valuable insights into quality control, production efficiency, and impurity detection. Early identification of seed types is critical to reducing the cost and risk associated with field emergence, which can lead to yield losses or disruptions in downstream processes like harvesting. Seed sampling supports growers in monitoring and managing seed quality, improving precision in determining seed purity levels, guiding management adjustments, and enhancing yield estimations. This study proposes a novel convolutional neural network (CNN)-based framework for the efficient classification of ten common Brassica seed types. The approach addresses the inherent challenge of texture similarity in seed images using a custom-designed CNN architecture. The model's performance was evaluated against several pre-trained state-of-the-art architectures, with adjustments to layer configurations for optimized classification. Experimental results using our collected Brassica seed dataset demonstrate that the proposed model achieved a high accuracy rate of 93 percent.

[20] Knowledge Distillation Approach for SOS Fusion Staging: Towards Fully Automated Skeletal Maturity Assessment

Omid Halimi Milani,Amanda Nikho,Marouane Tliba,Lauren Mills,Ahmet Enis Cetin,Mohammed H Elnagar

Main category: cs.CV

TL;DR: 提出了一种用于自动评估蝶枕软骨融合的新型深度学习框架，通过双模型架构和知识蒸馏实现高精度诊断。

Details

Motivation: 蝶枕软骨融合是正畸和法医人类学的重要诊断标志，但现有方法依赖外部裁剪或分割工具，效率低且不一致。 Method: 采用双模型架构（教师模型和学生模型），通过新设计的损失函数和梯度注意力空间映射实现知识蒸馏。 Result: 框架在无需外部预处理的情况下实现了高诊断准确性，适用于临床环境。 Conclusion: 该方法简化了流程，提高了骨骼成熟度评估的效率和一致性。 Abstract: We introduce a novel deep learning framework for the automated staging of spheno-occipital synchondrosis (SOS) fusion, a critical diagnostic marker in both orthodontics and forensic anthropology. Our approach leverages a dual-model architecture wherein a teacher model, trained on manually cropped images, transfers its precise spatial understanding to a student model that operates on full, uncropped images. This knowledge distillation is facilitated by a newly formulated loss function that aligns spatial logits as well as incorporates gradient-based attention spatial mapping, ensuring that the student model internalizes the anatomically relevant features without relying on external cropping or YOLO-based segmentation. By leveraging expert-curated data and feedback at each step, our framework attains robust diagnostic accuracy, culminating in a clinically viable end-to-end pipeline. This streamlined approach obviates the need for additional pre-processing tools and accelerates deployment, thereby enhancing both the efficiency and consistency of skeletal maturation assessment in diverse clinical settings.

[21] Multi-instance Learning as Downstream Task of Self-Supervised Learning-based Pre-trained Model

Koki Matsuishi,Tsuyoshi Okita

Main category: cs.CV

TL;DR: 论文提出了一种基于自监督学习的预训练模型方法，用于解决深度多示例学习中实例数量增加导致的性能下降问题，显著提升了脑血肿CT的分类性能。

Details

Motivation: 在脑血肿CT的多示例学习中，当每个包中的实例数量增加到256时，传统深度学习方法表现极差，亟需解决方案。 Method: 采用自监督学习的预训练模型作为多示例学习器的下游任务，以克服原始任务中的虚假相关性问题。 Result: 实验结果显示，该方法在脑血肿CT的低密度标记分类任务中，准确率提升了5%至13%，F1分数提升了40%至55%。 Conclusion: 通过自监督预训练模型，有效解决了多示例学习中实例数量增加带来的挑战，显著提升了分类性能。 Abstract: In deep multi-instance learning, the number of applicable instances depends on the data set. In histopathology images, deep learning multi-instance learners usually assume there are hundreds to thousands instances in a bag. However, when the number of instances in a bag increases to 256 in brain hematoma CT, learning becomes extremely difficult. In this paper, we address this drawback. To overcome this problem, we propose using a pre-trained model with self-supervised learning for the multi-instance learner as a downstream task. With this method, even when the original target task suffers from the spurious correlation problem, we show improvements of 5% to 13% in accuracy and 40% to 55% in the F1 measure for the hypodensity marker classification of brain hematoma CT.

[22] Diffusion Model-based Activity Completion for AI Motion Capture from Videos

Gao Huayu,Huang Tengjiu,Ye Xiaolong,Tsuyoshi Okita

Main category: cs.CV

TL;DR: AI运动捕捉技术通过扩散模型生成补充动作序列，解决传统方法中动作预定义的限制，并在Human3.6M数据集上表现优异。

Details

Motivation: 当前AI运动捕捉技术依赖预定义动作序列，无法处理未观察到的动作，限制了虚拟人动作的灵活性。 Method: 提出基于扩散模型的动作补全技术，结合门控模块和位置-时间嵌入模块，生成平滑连续的动作序列。 Result: MDC-Net在ADE、FDE和MMADE指标上优于现有方法，模型更小（16.84M），生成的动作更自然。 Conclusion: 该方法有效解决了动作序列的连续性问题，为虚拟人动作生成提供了更灵活的解决方案。 Abstract: AI-based motion capture is an emerging technology that offers a cost-effective alternative to traditional motion capture systems. However, current AI motion capture methods rely entirely on observed video sequences, similar to conventional motion capture. This means that all human actions must be predefined, and movements outside the observed sequences are not possible. To address this limitation, we aim to apply AI motion capture to virtual humans, where flexible actions beyond the observed sequences are required. We assume that while many action fragments exist in the training data, the transitions between them may be missing. To bridge these gaps, we propose a diffusion-model-based action completion technique that generates complementary human motion sequences, ensuring smooth and continuous movements. By introducing a gate module and a position-time embedding module, our approach achieves competitive results on the Human3.6M dataset. Our experimental results show that (1) MDC-Net outperforms existing methods in ADE, FDE, and MMADE but is slightly less accurate in MMFDE, (2) MDC-Net has a smaller model size (16.84M) compared to HumanMAC (28.40M), and (3) MDC-Net generates more natural and coherent motion sequences. Additionally, we propose a method for extracting sensor data, including acceleration and angular velocity, from human motion sequences.

[23] EaqVLA: Encoding-aligned Quantization for Vision-Language-Action Models

Feng Jiang,Zihao Zheng,Xiuping Cui,Maoliang Li,JIayu Chen,Xiang Chen

Main category: cs.CV

TL;DR: 提出了一种名为EaqVLA的优化框架，通过编码对齐量化解决VLA模型的量化问题，显著降低了计算和存储成本。

Details

Motivation: 现有VLA模型的计算和存储成本高昂，且现有量化方法因token对齐问题难以直接应用。 Method: 提出了一种完整的分析方法识别多粒度对齐问题，并基于此设计了编码对齐感知的混合精度量化方法。 Result: EaqVLA在量化性能上优于现有方法，实现了最小的量化损失和显著的加速效果。 Conclusion: EaqVLA框架有效解决了VLA模型的量化问题，为端到端控制策略的优化提供了新思路。 Abstract: With the development of Embodied Artificial intelligence, the end-to-end control policy such as Vision-Language-Action (VLA) model has become the mainstream. Existing VLA models faces expensive computing/storage cost, which need to be optimized. Quantization is considered as the most effective method which can not only reduce the memory cost but also achieve computation acceleration. However, we find the token alignment of VLA models hinders the application of existing quantization methods. To address this, we proposed an optimized framework called EaqVLA, which apply encoding-aligned quantization to VLA models. Specifically, we propose an complete analysis method to find the misalignment in various granularity. Based on the analysis results, we propose a mixed precision quantization with the awareness of encoding alignment. Experiments shows that the porposed EaqVLA achieves better quantization performance (with the minimal quantization loss for end-to-end action control and xxx times acceleration) than existing quantization methods.

[24] Thickness-aware E(3)-Equivariant 3D Mesh Neural Networks

Sungwon Kim,Namkyeong Lee,Yunyoung Doh,Seungmin Shin,Guimok Cho,Seung-Won Jeon,Sangkook Kim,Chanyoung Park

Main category: cs.CV

TL;DR: 提出了一种新的厚度感知3D网格神经网络（T-EMNN），解决了现有方法忽略物体厚度的问题，同时保持了计算效率。

Details

Motivation: 现有3D静态分析方法主要关注表面拓扑和几何，忽略了物体厚度及其对行为的影响，导致分析不准确。 Method: 提出了T-EMNN框架，结合厚度信息并保持E(3)-等变性，引入数据驱动坐标编码空间信息。 Result: 在工业数据集上验证，T-EMNN能准确预测节点级3D变形，有效捕捉厚度效应且计算高效。 Conclusion: T-EMNN为3D静态分析提供了更准确且高效的解决方案，弥补了现有方法的不足。 Abstract: Mesh-based 3D static analysis methods have recently emerged as efficient alternatives to traditional computational numerical solvers, significantly reducing computational costs and runtime for various physics-based analyses. However, these methods primarily focus on surface topology and geometry, often overlooking the inherent thickness of real-world 3D objects, which exhibits high correlations and similar behavior between opposing surfaces. This limitation arises from the disconnected nature of these surfaces and the absence of internal edge connections within the mesh. In this work, we propose a novel framework, the Thickness-aware E(3)-Equivariant 3D Mesh Neural Network (T-EMNN), that effectively integrates the thickness of 3D objects while maintaining the computational efficiency of surface meshes. Additionally, we introduce data-driven coordinates that encode spatial information while preserving E(3)-equivariance or invariance properties, ensuring consistent and robust analysis. Evaluations on a real-world industrial dataset demonstrate the superior performance of T-EMNN in accurately predicting node-level 3D deformations, effectively capturing thickness effects while maintaining computational efficiency.

[25] Do We Need All the Synthetic Data? Towards Targeted Synthetic Image Augmentation via Diffusion Models

Dang Nguyen,Jiping Li,Jinghao Zheng,Baharan Mirzasoleiman

Main category: cs.CV

TL;DR: 通过仅增强部分未在早期训练中学习的数据，该方法在多种场景下提升分类器性能，优于全数据集增强。

Details

Motivation: 现有方法在增强数据多样性时需大幅增加数据量（10-30倍），而本文提出部分数据增强策略以提升性能。 Method: 分析双层CNN，证明增强部分未学习数据可促进特征学习速度的均匀性，避免噪声放大。实验中使用30%-40%数据增强。 Result: 在CIFAR-10、CIFAR-100和TinyImageNet上，性能提升达2.8%，且优于SOTA优化器SAM。 Conclusion: 部分数据增强策略高效且可与其他增强方法结合，进一步提升性能。 Abstract: Synthetically augmenting training datasets with diffusion models has been an effective strategy for improving generalization of image classifiers. However, existing techniques struggle to ensure the diversity of generation and increase the size of the data by up to 10-30x to improve the in-distribution performance. In this work, we show that synthetically augmenting part of the data that is not learned early in training outperforms augmenting the entire dataset. By analyzing a two-layer CNN, we prove that this strategy improves generalization by promoting homogeneity in feature learning speed without amplifying noise. Our extensive experiments show that by augmenting only 30%-40% of the data, our method boosts the performance by up to 2.8% in a variety of scenarios, including training ResNet, ViT and DenseNet on CIFAR-10, CIFAR-100, and TinyImageNet, with a range of optimizers including SGD and SAM. Notably, our method applied with SGD outperforms the SOTA optimizer, SAM, on CIFAR-100 and TinyImageNet. It can also easily stack with existing weak and strong augmentation strategies to further boost the performance.

[26] Do you see what I see? An Ambiguous Optical Illusion Dataset exposing limitations of Explainable AI

Carina Newen,Luca Hinkamp,Maria Ntonti,Emmanuel Müller

Main category: cs.CV

TL;DR: 本文介绍了一个新颖的光学错觉数据集，旨在研究机器学习和人类感知中的模糊性问题，并探讨了视觉概念（如注视方向）对模型准确性的影响。

Details

Motivation: 在安全关键领域（如自动驾驶和医疗诊断）中，模糊数据的重要性日益凸显，光学错觉为研究人类和机器感知的局限性提供了独特视角。然而，相关数据集稀缺。 Method: 通过系统生成包含交织动物对的光学错觉数据集，重点关注注视方向和眼部线索等视觉概念，并分析其对模型的影响。 Result: 研究发现，视觉概念（如注视方向）对模型准确性有显著影响，为研究机器视觉中的偏见和人类与机器感知的对齐提供了基础。 Conclusion: 该数据集为研究视觉学习中的概念重要性及机器与人类感知的对齐提供了新工具，代码和数据集已公开。 Abstract: From uncertainty quantification to real-world object detection, we recognize the importance of machine learning algorithms, particularly in safety-critical domains such as autonomous driving or medical diagnostics. In machine learning, ambiguous data plays an important role in various machine learning domains. Optical illusions present a compelling area of study in this context, as they offer insight into the limitations of both human and machine perception. Despite this relevance, optical illusion datasets remain scarce. In this work, we introduce a novel dataset of optical illusions featuring intermingled animal pairs designed to evoke perceptual ambiguity. We identify generalizable visual concepts, particularly gaze direction and eye cues, as subtle yet impactful features that significantly influence model accuracy. By confronting models with perceptual ambiguity, our findings underscore the importance of concepts in visual learning and provide a foundation for studying bias and alignment between human and machine vision. To make this dataset useful for general purposes, we generate optical illusions systematically with different concepts discussed in our bias mitigation section. The dataset is accessible in Kaggle via https://kaggle.com/datasets/693bf7c6dd2cb45c8a863f9177350c8f9849a9508e9d50526e2ffcc5559a8333. Our source code can be found at https://github.com/KDD-OpenSource/Ambivision.git.

[27] Any-to-Bokeh: One-Step Video Bokeh via Multi-Plane Image Guided Diffusion

Yang Yang,Siming Zheng,Jinwei Chen,Boxi Wu,Xiaofei He,Deng Cai,Bo Li,Peng-Tao Jiang

Main category: cs.CV

TL;DR: 提出了一种新颖的一步视频散景框架，解决了现有方法在时间一致性和深度控制上的不足。

Details

Motivation: 现有视频编辑模型无法明确控制焦点平面或调整散景强度，且图像散景方法扩展到视频时会导致时间闪烁和不理想的边缘模糊过渡。 Method: 利用多平面图像（MPI）表示，通过逐步扩展的深度采样函数构建，结合单步视频扩散模型和预训练模型的3D先验知识。 Result: 实验表明，该方法能生成高质量、可控的散景效果，并在多个评估基准上达到最先进性能。 Conclusion: 提出的框架在时间一致性、深度鲁棒性和细节保留方面表现优异，适用于多样场景。 Abstract: Recent advances in diffusion based editing models have enabled realistic camera simulation and image-based bokeh, but video bokeh remains largely unexplored. Existing video editing models cannot explicitly control focus planes or adjust bokeh intensity, limiting their applicability for controllable optical effects. Moreover, naively extending image-based bokeh methods to video often results in temporal flickering and unsatisfactory edge blur transitions due to the lack of temporal modeling and generalization capability. To address these challenges, we propose a novel one-step video bokeh framework that converts arbitrary input videos into temporally coherent, depth-aware bokeh effects. Our method leverages a multi-plane image (MPI) representation constructed through a progressively widening depth sampling function, providing explicit geometric guidance for depth-dependent blur synthesis. By conditioning a single-step video diffusion model on MPI layers and utilizing the strong 3D priors from pre-trained models such as Stable Video Diffusion, our approach achieves realistic and consistent bokeh effects across diverse scenes. Additionally, we introduce a progressive training strategy to enhance temporal consistency, depth robustness, and detail preservation. Extensive experiments demonstrate that our method produces high-quality, controllable bokeh effects and achieves state-of-the-art performance on multiple evaluation benchmarks.

[28] Object Concepts Emerge from Motion

Haoqian Liang,Xiaohui Wang,Zhichao Li,Ya Yang,Naiyan Wang

Main category: cs.CV

TL;DR: 论文提出了一种基于生物启发的无监督学习框架，利用运动边界作为对象级分组的信号，从原始视频中学习对象中心视觉表示。

Details

Motivation: 受发展神经科学启发，婴儿通过观察运动获得对象理解，论文旨在开发一种无需标签和相机校准的可扩展方法。 Method: 通过现成的光流和聚类算法生成基于运动的实例掩码，并利用对比学习训练视觉编码器。 Result: 在三个下游任务（单目深度估计、3D对象检测和占用预测）中表现优于现有监督和自监督基线，并展示了对未见场景的强泛化能力。 Conclusion: 运动诱导的对象表示为视觉基础模型提供了有吸引力的替代方案，捕捉了视觉实例这一关键但被忽视的抽象层次。 Abstract: Object concepts play a foundational role in human visual cognition, enabling perception, memory, and interaction in the physical world. Inspired by findings in developmental neuroscience - where infants are shown to acquire object understanding through observation of motion - we propose a biologically inspired framework for learning object-centric visual representations in an unsupervised manner. Our key insight is that motion boundary serves as a strong signal for object-level grouping, which can be used to derive pseudo instance supervision from raw videos. Concretely, we generate motion-based instance masks using off-the-shelf optical flow and clustering algorithms, and use them to train visual encoders via contrastive learning. Our framework is fully label-free and does not rely on camera calibration, making it scalable to large-scale unstructured video data. We evaluate our approach on three downstream tasks spanning both low-level (monocular depth estimation) and high-level (3D object detection and occupancy prediction) vision. Our models outperform previous supervised and self-supervised baselines and demonstrate strong generalization to unseen scenes. These results suggest that motion-induced object representations offer a compelling alternative to existing vision foundation models, capturing a crucial but overlooked level of abstraction: the visual instance. The corresponding code will be released upon paper acceptance.

[29] BaryIR: Learning Multi-Source Unified Representation in Continuous Barycenter Space for Generalizable All-in-One Image Restoration

Xiaole Tang,Xiaoyi He,Xiang Gu,Jian Sun

Main category: cs.CV

TL;DR: 提出BaryIR框架，通过多源表示学习解决全场景图像修复中的分布外问题，提升泛化能力。

Details

Motivation: 现有全场景图像修复方法对分布外数据和退化类型泛化能力不足，限制了实际应用。 Method: BaryIR将多源退化图像的潜在空间分解为连续重心空间（统一特征编码）和源特定子空间（特定语义编码），通过多源潜在最优传输重心问题学习统一表示。 Result: 实验表明BaryIR在性能上优于现有方法，尤其在真实数据和未见退化类型上表现优异。 Conclusion: BaryIR通过多源表示学习显著提升了全场景图像修复的泛化能力，适用于复杂现实场景。 Abstract: Despite remarkable advances made in all-in-one image restoration (AIR) for handling different types of degradations simultaneously, existing methods remain vulnerable to out-of-distribution degradations and images, limiting their real-world applicability. In this paper, we propose a multi-source representation learning framework BaryIR, which decomposes the latent space of multi-source degraded images into a continuous barycenter space for unified feature encoding and source-specific subspaces for specific semantic encoding. Specifically, we seek the multi-source unified representation by introducing a multi-source latent optimal transport barycenter problem, in which a continuous barycenter map is learned to transport the latent representations to the barycenter space. The transport cost is designed such that the representations from source-specific subspaces are contrasted with each other while maintaining orthogonality to those from the barycenter space. This enables BaryIR to learn compact representations with unified degradation-agnostic information from the barycenter space, as well as degradation-specific semantics from source-specific subspaces, capturing the inherent geometry of multi-source data manifold for generalizable AIR. Extensive experiments demonstrate that BaryIR achieves competitive performance compared to state-of-the-art all-in-one methods. Particularly, BaryIR exhibits superior generalization ability to real-world data and unseen degradations. The code will be publicly available at https://github.com/xl-tang3/BaryIR.

[30] Geometric Feature Prompting of Image Segmentation Models

Kenneth Ball,Erin Taylor,Nirav Patel,Andrew Bartels,Gary Koplik,James Polly,Jay Hineman

Main category: cs.CV

TL;DR: 提出了一种几何驱动的提示生成器（GeomPrompt），用于在科学图像分析任务中自动生成敏感且特异的分割提示点，显著提升了SAM模型在植物根系图像分割中的表现。

Details

Motivation: 植物根系图像（如rhizotron或minirhizotron图像）的分割传统上难以自动化，手工标注耗时且主观。利用SAM模型结合几何提示生成器，可以高效解决这一问题。 Method: 开发了GeomPrompt工具，通过几何特征生成提示点，直接与SAM模型集成，实现自动化的根系图像分割。 Result: GeomPrompt生成的提示点显著提升了SAM模型在根系图像分割中的敏感性和特异性，减少了所需的手动提示数量。 Conclusion: 几何驱动的提示生成方法为科学图像分析任务提供了一种高效、自动化的解决方案，并已开源相关工具geomprompt。 Abstract: Advances in machine learning, especially the introduction of transformer architectures and vision transformers, have led to the development of highly capable computer vision foundation models. The segment anything model (known colloquially as SAM and more recently SAM 2), is a highly capable foundation model for segmentation of natural images and has been further applied to medical and scientific image segmentation tasks. SAM relies on prompts -- points or regions of interest in an image -- to generate associated segmentations. In this manuscript we propose the use of a geometrically motivated prompt generator to produce prompt points that are colocated with particular features of interest. Focused prompting enables the automatic generation of sensitive and specific segmentations in a scientific image analysis task using SAM with relatively few point prompts. The image analysis task examined is the segmentation of plant roots in rhizotron or minirhizotron images, which has historically been a difficult task to automate. Hand annotation of rhizotron images is laborious and often subjective; SAM, initialized with GeomPrompt local ridge prompts has the potential to dramatically improve rhizotron image processing. The authors have concurrently released an open source software suite called geomprompt https://pypi.org/project/geomprompt/ that can produce point prompts in a format that enables direct integration with the segment-anything package.

[31] QuARI: Query Adaptive Retrieval Improvement

Eric Xing,Abby Stylianou,Robert Pless,Nathan Jacobs

Main category: cs.CV

TL;DR: 通过学习查询特定的特征空间变换，提升大规模图像检索性能，计算成本低且效果优于现有方法。

Details

Motivation: 现有视觉语言模型在大规模实例检索任务中表现不佳，需要更高效的优化方法。 Method: 提出一种线性变换方法，将查询映射到特定特征空间，适用于大规模检索和重排序。 Result: 该方法在计算成本较低的情况下，性能优于现有最先进方法。 Conclusion: 查询特定的特征空间变换是提升大规模图像检索的有效方法。 Abstract: Massive-scale pretraining has made vision-language models increasingly popular for image-to-image and text-to-image retrieval across a broad collection of domains. However, these models do not perform well when used for challenging retrieval tasks, such as instance retrieval in very large-scale image collections. Recent work has shown that linear transformations of VLM features trained for instance retrieval can improve performance by emphasizing subspaces that relate to the domain of interest. In this paper, we explore a more extreme version of this specialization by learning to map a given query to a query-specific feature space transformation. Because this transformation is linear, it can be applied with minimal computational cost to millions of image embeddings, making it effective for large-scale retrieval or re-ranking. Results show that this method consistently outperforms state-of-the-art alternatives, including those that require many orders of magnitude more computation at query time.

[32] Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks

Keanu Nichols,Nazia Tasnim,Yan Yuting,Nicholas Ikechukwu,Elva Zou,Deepti Ghadiyaram,Bryan Plummer

Main category: cs.CV

TL;DR: DORI是一个专注于评估多模态系统对物体方向感知能力的基准测试，揭示了当前模型在精确方向判断上的局限性。

Details

Motivation: 现有视觉语言基准测试未能单独评估物体方向理解能力，常将其与位置关系和场景理解混淆。 Method: 通过11个数据集的67个对象类别，设计了四个方向理解维度的任务，评估15种先进模型。 Result: 最佳模型在粗略任务中准确率为54.2%，在细粒度任务中为33.0%，表现较差。 Conclusion: 需要专门的方向表示机制，DORI为改进机器人控制和3D场景重建提供了方向。 Abstract: Object orientation understanding represents a fundamental challenge in visual perception critical for applications like robotic manipulation and augmented reality. Current vision-language benchmarks fail to isolate this capability, often conflating it with positional relationships and general scene understanding. We introduce DORI (Discriminative Orientation Reasoning Intelligence), a comprehensive benchmark establishing object orientation perception as a primary evaluation target. DORI assesses four dimensions of orientation comprehension: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding. Through carefully curated tasks from 11 datasets spanning 67 object categories across synthetic and real-world scenarios, DORI provides insights on how multi-modal systems understand object orientations. Our evaluation of 15 state-of-the-art vision-language models reveals critical limitations: even the best models achieve only 54.2% accuracy on coarse tasks and 33.0% on granular orientation judgments, with performance deteriorating for tasks requiring reference frame shifts or compound rotations. These findings demonstrate the need for dedicated orientation representation mechanisms, as models show systematic inability to perform precise angular estimations, track orientation changes across viewpoints, and understand compound rotations - suggesting limitations in their internal 3D spatial representations. As the first diagnostic framework specifically designed for orientation awareness in multimodal systems, DORI offers implications for improving robotic control, 3D scene reconstruction, and human-AI interaction in physical environments. DORI data: https://huggingface.co/datasets/appledora/DORI-Benchmark

[33] Think Before You Diffuse: LLMs-Guided Physics-Aware Video Generation

Ke Zhang,Cihan Xiao,Yiqun Mei,Jiacong Xu,Vishal M. Patel

Main category: cs.CV

TL;DR: DiffPhy是一个通过微调预训练视频扩散模型实现物理正确和逼真视频生成的通用框架，利用LLM和MLLM指导生成。

Details

Motivation: 现有视频扩散模型在生成视觉上吸引人的结果时，难以合成正确的物理效果，复杂运动和交互增加了学习物理的难度。 Method: 利用LLM从文本提示中推理物理上下文，并通过MLLM监督信号和新训练目标确保物理正确性和语义一致性。 Result: 在公共基准测试中，DiffPhy在多样化物理相关场景中实现了最先进的生成效果。 Conclusion: DiffPhy通过结合LLM和MLLM，显著提升了视频生成中的物理正确性和逼真度。 Abstract: Recent video diffusion models have demonstrated their great capability in generating visually-pleasing results, while synthesizing the correct physical effects in generated videos remains challenging. The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data. In this work, we propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model. Our method leverages large language models (LLMs) to explicitly reason a comprehensive physical context from the text prompt and use it to guide the generation. To incorporate physical context into the diffusion model, we leverage a Multimodal large language model (MLLM) as a supervisory signal and introduce a set of novel training objectives that jointly enforce physical correctness and semantic consistency with the input text. We also establish a high-quality physical video dataset containing diverse phyiscal actions and events to facilitate effective finetuning. Extensive experiments on public benchmarks demonstrate that DiffPhy is able to produce state-of-the-art results across diverse physics-related scenarios. Our project page is available at https://bwgzk-keke.github.io/DiffPhy/

[34] Scalable Segmentation for Ultra-High-Resolution Brain MR Images

Xiaoling Hu,Peirong Liu,Dina Zemlyanker,Jonathan Williams Ramirez,Oula Puonti,Juan Eugenio Iglesias

Main category: cs.CV

TL;DR: 提出了一种利用低分辨率粗标签作为空间参考的框架，通过回归带符号距离变换图实现边界感知分割，并引入可扩展的类条件分割策略，提高了效率和泛化能力。

Details

Motivation: 超高分辨率脑MRI分割面临标注数据不足和计算需求高的挑战，需要一种高效且无需额外标注的方法。 Method: 使用低分辨率粗标签作为参考，回归带符号距离变换图实现边界感知；引入类条件分割策略，每次分割一个类别以减少内存消耗。 Result: 在合成和真实数据集上验证了方法的优越性能和可扩展性。 Conclusion: 该方法在高效性和泛化能力上优于传统分割方法。 Abstract: Although deep learning has shown great success in 3D brain MRI segmentation, achieving accurate and efficient segmentation of ultra-high-resolution brain images remains challenging due to the lack of labeled training data for fine-scale anatomical structures and high computational demands. In this work, we propose a novel framework that leverages easily accessible, low-resolution coarse labels as spatial references and guidance, without incurring additional annotation cost. Instead of directly predicting discrete segmentation maps, our approach regresses per-class signed distance transform maps, enabling smooth, boundary-aware supervision. Furthermore, to enhance scalability, generalizability, and efficiency, we introduce a scalable class-conditional segmentation strategy, where the model learns to segment one class at a time conditioned on a class-specific input. This novel design not only reduces memory consumption during both training and testing, but also allows the model to generalize to unseen anatomical classes. We validate our method through comprehensive experiments on both synthetic and real-world datasets, demonstrating its superior performance and scalability compared to conventional segmentation approaches.

[35] MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis

Yitong Li,Morteza Ghahremani,Christian Wachinger

Main category: cs.CV

TL;DR: MedBridge是一个轻量级多模态适应框架，通过重新利用预训练的视觉语言模型（VLM）提升医学图像诊断准确性，无需大量资源。

Details

Motivation: 现有视觉语言基础模型在自然图像分类上表现优异，但在医学图像上因领域差异表现不佳，而训练医学基础模型又需要大量资源。 Method: MedBridge包含三个关键组件：1）Focal Sampling模块提取高分辨率局部区域；2）Query Encoder（QEncoder）注入可学习查询以对齐医学语义；3）Mixture of Experts机制利用多样化VLM的互补优势。 Result: 在五个医学影像基准测试中，MedBridge在跨领域和领域内适应任务中表现优异，尤其在多标签胸部疾病诊断中AUC提升6-15%。 Conclusion: MedBridge有效利用基础模型实现准确且数据高效的医学诊断，代码已开源。 Abstract: Recent vision-language foundation models deliver state-of-the-art results on natural image classification but falter on medical images due to pronounced domain shifts. At the same time, training a medical foundation model requires substantial resources, including extensive annotated data and high computational capacity. To bridge this gap with minimal overhead, we introduce MedBridge, a lightweight multimodal adaptation framework that re-purposes pretrained VLMs for accurate medical image diagnosis. MedBridge comprises three key components. First, a Focal Sampling module that extracts high-resolution local regions to capture subtle pathological features and compensate for the limited input resolution of general-purpose VLMs. Second, a Query Encoder (QEncoder) injects a small set of learnable queries that attend to the frozen feature maps of VLM, aligning them with medical semantics without retraining the entire backbone. Third, a Mixture of Experts mechanism, driven by learnable queries, harnesses the complementary strength of diverse VLMs to maximize diagnostic performance. We evaluate MedBridge on five medical imaging benchmarks across three key adaptation tasks, demonstrating its superior performance in both cross-domain and in-domain adaptation settings, even under varying levels of training data availability. Notably, MedBridge achieved over 6-15% improvement in AUC compared to state-of-the-art VLM adaptation methods in multi-label thoracic disease diagnosis, underscoring its effectiveness in leveraging foundation models for accurate and data-efficient medical diagnosis. Our code is available at https://github.com/ai-med/MedBridge.

[36] OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

Cheng Luo,Jianghui Wang,Bing Li,Siyang Song,Bernard Ghanem

Main category: cs.CV

TL;DR: 论文提出了一种新的任务OMCRG，旨在在线生成与说话者多模态输入同步的听者反馈，并提出了OmniResponse模型来解决同步问题。

Details

Motivation: 自然对话中听者的多模态反馈（如语音和面部表情）需要同步生成，现有方法难以实现这一点。 Method: 通过引入文本作为中间模态，提出OmniResponse模型，结合Chrono-Text和TempoVoice模块实现同步生成。 Result: 在ResponseNet数据集上，OmniResponse在语义内容、视听同步和生成质量上显著优于基线模型。 Conclusion: OmniResponse为多模态听者反馈生成提供了有效解决方案，并推动了OMCRG领域的研究。 Abstract: In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task that aims to online generate synchronized verbal and non-verbal listener feedback, conditioned on the speaker's multimodal input. OMCRG reflects natural dyadic interactions and poses new challenges in achieving synchronization between the generated audio and facial responses of the listener. To address these challenges, we innovatively introduce text as an intermediate modality to bridge the audio and facial responses. We hence propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates high-quality multi-modal listener responses. OmniResponse leverages a pretrained LLM enhanced with two novel components: Chrono-Text, which temporally anchors generated text tokens, and TempoVoice, a controllable online TTS module that produces speech synchronized with facial reactions. To support further OMCRG research, we present ResponseNet, a new dataset comprising 696 high-quality dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and facial behavior annotations. Comprehensive evaluations conducted on ResponseNet demonstrate that OmniResponse significantly outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality.

[37] Moment kernels: a simple and scalable approach for equivariance to rotations and reflections in deep convolutional networks

Zachary Schlamowitz,Andrew Bennecke,Daniel J. Tward

Main category: cs.CV

TL;DR: 论文提出了一种简单的卷积核形式（“矩核”），用于实现旋转和反射对称性的等变性，解决了传统方法数学复杂的问题，并在生物医学图像分析任务中验证了其有效性。

Details

Motivation: 旋转和反射对称性在生物医学图像分析中至关重要，但传统方法依赖复杂的数学理论（如表示论），限制了其广泛应用。 Method: 提出“矩核”作为简单卷积核形式，证明其可实现等变性，并通过标准卷积模块实现等变神经网络。 Result: 在分类、3D图像配准和细胞分割等任务中验证了方法的有效性。 Conclusion: 矩核提供了一种简单且通用的方式实现对称性等变性，有望推动生物医学图像分析的发展。 Abstract: The principle of translation equivariance (if an input image is translated an output image should be translated by the same amount), led to the development of convolutional neural networks that revolutionized machine vision. Other symmetries, like rotations and reflections, play a similarly critical role, especially in biomedical image analysis, but exploiting these symmetries has not seen wide adoption. We hypothesize that this is partially due to the mathematical complexity of methods used to exploit these symmetries, which often rely on representation theory, a bespoke concept in differential geometry and group theory. In this work, we show that the same equivariance can be achieved using a simple form of convolution kernels that we call ``moment kernels,'' and prove that all equivariant kernels must take this form. These are a set of radially symmetric functions of a spatial position $x$, multiplied by powers of the components of $x$ or the identity matrix. We implement equivariant neural networks using standard convolution modules, and provide architectures to execute several biomedical image analysis tasks that depend on equivariance principles: classification (outputs are invariant under orthogonal transforms), 3D image registration (outputs transform like a vector), and cell segmentation (quadratic forms defining ellipses transform like a matrix).

[38] What is Adversarial Training for Diffusion Models?

Briglia Maria Rosaria,Mujtaba Hussain Mirza,Giuseppe Lisanti,Iacopo Masi

Main category: cs.CV

TL;DR: 对抗训练（AT）在扩散模型（DMs）中的作用与分类器不同，强调对扩散过程的等变性而非输出不变性，提升模型对噪声和异常数据的鲁棒性。

Details

Motivation: 探究对抗训练在扩散模型中的独特作用，区别于分类器中的传统应用，以提升模型对噪声和异常数据的处理能力。 Method: 通过在扩散训练中引入随机噪声或对抗噪声，无需假设噪声模型，无缝集成到现有训练流程中。 Result: 在低维和高维数据集上验证了方法的有效性，并在CIFAR-10等标准基准测试中表现出色，尤其是在噪声和对抗攻击下。 Conclusion: 对抗训练在扩散模型中通过等变性提升鲁棒性，适用于处理噪声、异常数据和对抗攻击。 Abstract: We answer the question in the title, showing that adversarial training (AT) for diffusion models (DMs) fundamentally differs from classifiers: while AT in classifiers enforces output invariance, AT in DMs requires equivariance to keep the diffusion process aligned with the data distribution. AT is a way to enforce smoothness in the diffusion flow, improving robustness to outliers and corrupted data. Unlike prior art, our method makes no assumptions about the noise model and integrates seamlessly into diffusion training by adding random noise, similar to randomized smoothing, or adversarial noise, akin to AT. This enables intrinsic capabilities such as handling noisy data, dealing with extreme variability such as outliers, preventing memorization, and improving robustness. We rigorously evaluate our approach with proof-of-concept datasets with known distributions in low- and high-dimensional space, thereby taking a perfect measure of errors; we further evaluate on standard benchmarks such as CIFAR-10, CelebA and LSUN Bedroom, showing strong performance under severe noise, data corruption, and iterative adversarial attacks.

[39] Learning to See More: UAS-Guided Super-Resolution of Satellite Imagery for Precision Agriculture

Arif Masrur,Peder A. Olsen,Paul R. Adler,Carlan Jackson,Matthew W. Myers,Nathan Sedghi,Ray R. Weil

Main category: cs.CV

TL;DR: 该研究提出了一种融合卫星和无人机（UAS）图像的超分辨率框架，通过整合空间、光谱和时间数据，提升了农作物生物量和氮含量的估算精度。

Details

Motivation: 卫星和无人机在精准农业中各具优缺点，卫星覆盖广但分辨率低，无人机分辨率高但成本高且覆盖有限。研究旨在结合两者优势，提供一种经济高效的解决方案。 Method: 采用超分辨率方法融合卫星和无人机图像，通过光谱扩展将无人机RGB数据扩展到植被红边和近红外区域，生成高分辨率Sentinel-2图像。 Result: 生物量和氮含量估算精度分别提高了18%和31%，且无人机数据只需在部分田地和时间点采集。 Conclusion: 该框架轻量且可扩展，适用于农场实际应用，即使在无卫星数据时仍有效，减少了无人机重复飞行的需求。 Abstract: Unmanned Aircraft Systems (UAS) and satellites are key data sources for precision agriculture, yet each presents trade-offs. Satellite data offer broad spatial, temporal, and spectral coverage but lack the resolution needed for many precision farming applications, while UAS provide high spatial detail but are limited by coverage and cost, especially for hyperspectral data. This study presents a novel framework that fuses satellite and UAS imagery using super-resolution methods. By integrating data across spatial, spectral, and temporal domains, we leverage the strengths of both platforms cost-effectively. We use estimation of cover crop biomass and nitrogen (N) as a case study to evaluate our approach. By spectrally extending UAS RGB data to the vegetation red edge and near-infrared regions, we generate high-resolution Sentinel-2 imagery and improve biomass and N estimation accuracy by 18% and 31%, respectively. Our results show that UAS data need only be collected from a subset of fields and time points. Farmers can then 1) enhance the spectral detail of UAS RGB imagery; 2) increase the spatial resolution by using satellite data; and 3) extend these enhancements spatially and across the growing season at the frequency of the satellite flights. Our SRCNN-based spectral extension model shows considerable promise for model transferability over other cropping systems in the Upper and Lower Chesapeake Bay regions. Additionally, it remains effective even when cloud-free satellite data are unavailable, relying solely on the UAS RGB input. The spatial extension model produces better biomass and N predictions than models built on raw UAS RGB images. Once trained with targeted UAS RGB data, the spatial extension model allows farmers to stop repeated UAS flights. While we introduce super-resolution advances, the core contribution is a lightweight and scalable system for affordable on-farm use.

[40] Visual Loop Closure Detection Through Deep Graph Consensus

Martin Büchner,Liza Dahiya,Simon Dorer,Vipul Ramtekkar,Kenji Nishimiya,Daniele Cattaneo,Abhinav Valada

Main category: cs.CV

TL;DR: LoopGNN是一种图神经网络架构，通过利用视觉相似关键帧的团来估计闭环一致性，提高了闭环检测的精度和效率。

Details

Motivation: 传统闭环检测依赖计算昂贵的几何验证，而深度方法通常仅处理关键帧对，限制了性能和效率。 Method: 提出LoopGNN，通过图神经网络传播团内关键帧的深度特征编码，实现高精度闭环检测。 Result: 在TartanDrive 2.0和NCLT数据集上表现优于传统基线，且计算效率更高。 Conclusion: LoopGNN是一种高效且鲁棒的闭环检测方法，适用于在线SLAM场景。 Abstract: Visual loop closure detection traditionally relies on place recognition methods to retrieve candidate loops that are validated using computationally expensive RANSAC-based geometric verification. As false positive loop closures significantly degrade downstream pose graph estimates, verifying a large number of candidates in online simultaneous localization and mapping scenarios is constrained by limited time and compute resources. While most deep loop closure detection approaches only operate on pairs of keyframes, we relax this constraint by considering neighborhoods of multiple keyframes when detecting loops. In this work, we introduce LoopGNN, a graph neural network architecture that estimates loop closure consensus by leveraging cliques of visually similar keyframes retrieved through place recognition. By propagating deep feature encodings among nodes of the clique, our method yields high-precision estimates while maintaining high recall. Extensive experimental evaluations on the TartanDrive 2.0 and NCLT datasets demonstrate that LoopGNN outperforms traditional baselines. Additionally, an ablation study across various keypoint extractors demonstrates that our method is robust, regardless of the type of deep feature encodings used, and exhibits higher computational efficiency compared to classical geometric verification baselines. We release our code, supplementary material, and keyframe data at https://loopgnn.cs.uni-freiburg.de.

Chengyue Huang,Brisa Maneechotesuwan,Shivang Chopra,Zsolt Kira

Main category: cs.CV

TL;DR: 提出了一个新的基准FRAMES-VQA，用于评估VQA任务中的鲁棒微调，涵盖多模态分布偏移。

Details

Motivation: 现有评估设置多为单模态或特定OOD类型，难以应对多模态上下文中的复杂挑战。 Method: 利用十个现有VQA基准，分为ID、近OOD和远OOD数据集，计算Mahalanobis距离量化分布偏移，并分析单模态与多模态偏移的交互。 Result: 提供了对单模态和多模态偏移的深入分析，为开发更鲁棒的微调方法提供了指导。 Conclusion: FRAMES-VQA为多模态分布偏移下的VQA任务提供了新的评估基准和分析工具。 Abstract: Visual question answering (VQA) systems face significant challenges when adapting to real-world data shifts, especially in multi-modal contexts. While robust fine-tuning strategies are essential for maintaining performance across in-distribution (ID) and out-of-distribution (OOD) scenarios, current evaluation settings are primarily unimodal or particular to some types of OOD, offering limited insight into the complexities of multi-modal contexts. In this work, we propose a new benchmark FRAMES-VQA (Fine-Tuning Robustness across Multi-Modal Shifts in VQA) for evaluating robust fine-tuning for VQA tasks. We utilize ten existing VQA benchmarks, including VQAv2, IV-VQA, VQA-CP, OK-VQA and others, and categorize them into ID, near and far OOD datasets covering uni-modal, multi-modal and adversarial distribution shifts. We first conduct a comprehensive comparison of existing robust fine-tuning methods. We then quantify the distribution shifts by calculating the Mahalanobis distance using uni-modal and multi-modal embeddings extracted from various models. Further, we perform an extensive analysis to explore the interactions between uni- and multi-modal shifts as well as modality importance for ID and OOD samples. These analyses offer valuable guidance on developing more robust fine-tuning methods to handle multi-modal distribution shifts. The code is available at https://github.com/chengyuehuang511/FRAMES-VQA .

[42] MMTBENCH: A Unified Benchmark for Complex Multimodal Table Reasoning

Prasham Yatinkumar Titiya,Jainil Trivedi,Chitta Baral,Vivek Gupta

Main category: cs.CV

TL;DR: MMTBENCH是一个包含500个真实世界多模态表格的基准测试，用于评估视觉语言模型在复杂表格推理任务中的表现，揭示了现有模型的性能差距。

Details

Motivation: 当前视觉语言模型在多模态表格（结合半结构化数据和视觉元素）上的推理能力尚未被充分探索，因此需要建立一个基准测试来评估和改进模型性能。 Method: 研究者构建了MMTBENCH，包含500个多模态表格和4021个问答对，覆盖多种问题类型、推理类型和表格类型。 Result: 评估显示现有模型在视觉推理和多步推理任务上表现不佳，表明需要改进视觉与语言处理的集成架构。 Conclusion: MMTBENCH为未来多模态表格研究提供了高质量资源，强调了改进模型架构的紧迫性。 Abstract: Multimodal tables those that integrate semi structured data with visual elements such as charts and maps are ubiquitous across real world domains, yet they pose a formidable challenge to current vision language models (VLMs). While Large Language models (LLMs) and VLMs have demonstrated strong capabilities in text and image understanding, their performance on complex, real world multimodal table reasoning remains unexplored. To bridge this gap, we introduce MMTBENCH (Multimodal Table Benchmark), a benchmark consisting of 500 real world multimodal tables drawn from diverse real world sources, with a total of 4021 question answer pairs. MMTBENCH questions cover four question types (Explicit, Implicit, Answer Mention, and Visual Based), five reasoning types (Mathematical, Extrema Identification, Fact Verification, Vision Based, and Others), and eight table types (Single/Multiple Entity, Maps and Charts with Entities, Single/Multiple Charts, Maps, and Visualizations). Extensive evaluation of state of the art models on all types reveals substantial performance gaps, particularly on questions requiring visual-based reasoning and multi-step inference. These findings show the urgent need for improved architectures that more tightly integrate vision and language processing. By providing a challenging, high-quality resource that mirrors the complexity of real-world tasks, MMTBENCH underscores its value as a resource for future research on multimodal tables.

[43] Compositional Scene Understanding through Inverse Generative Modeling

Yanbo Wang,Justin Dauwels,Yilun Du

Main category: cs.CV

TL;DR: 该论文提出了一种通过逆向生成建模理解场景的方法，利用组合式生成模型从图像中推断场景结构和对象，实现对新场景的鲁棒泛化。

Details

Motivation: 探索生成模型不仅用于合成视觉内容，还能用于理解自然图像中的场景属性。 Method: 将场景理解建模为逆向生成问题，通过组合式生成模型拟合给定图像的条件参数。 Result: 能够推断场景中的对象和全局因素，并在新场景中实现鲁棒泛化，支持零样本多对象感知。 Conclusion: 组合式生成模型为场景理解提供了有效方法，可直接应用于预训练的文本到图像生成模型。 Abstract: Generative models have demonstrated remarkable abilities in generating high-fidelity visual content. In this work, we explore how generative models can further be used not only to synthesize visual content but also to understand the properties of a scene given a natural image. We formulate scene understanding as an inverse generative modeling problem, where we seek to find conditional parameters of a visual generative model to best fit a given natural image. To enable this procedure to infer scene structure from images substantially different than those seen during training, we further propose to build this visual generative model compositionally from smaller models over pieces of a scene. We illustrate how this procedure enables us to infer the set of objects in a scene, enabling robust generalization to new test scenes with an increased number of objects of new shapes. We further illustrate how this enables us to infer global scene factors, likewise enabling robust generalization to new scenes. Finally, we illustrate how this approach can be directly applied to existing pretrained text-to-image generative models for zero-shot multi-object perception. Code and visualizations are at \href{https://energy-based-model.github.io/compositional-inference}{https://energy-based-model.github.io/compositional-inference}.

[44] SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation

Claudia Cuttano,Gabriele Trivigno,Giuseppe Averta,Carlo Masone

Main category: cs.CV

TL;DR: SANSA通过改进SAM2的语义对齐，实现了少样本分割的先进性能，支持多种交互方式且高效。

Details

Motivation: SAM2虽然具备强大的分割能力，但其特征表示与对象跟踪任务相关，影响了语义理解任务的表现。 Method: 提出SANSA框架，显式提取SAM2的潜在语义结构，并进行少量任务特定修改。 Result: 在少样本分割基准测试中表现最优，支持多种交互方式且速度更快。 Conclusion: SANSA成功将SAM2重新用于少样本分割任务，展示了其潜在语义结构的价值。 Abstract: Few-shot segmentation aims to segment unseen object categories from just a handful of annotated examples. This requires mechanisms that can both identify semantically related objects across images and accurately produce segmentation masks. We note that Segment Anything 2 (SAM2), with its prompt-and-propagate mechanism, offers both strong segmentation capabilities and a built-in feature matching process. However, we show that its representations are entangled with task-specific cues optimized for object tracking, which impairs its use for tasks requiring higher level semantic understanding. Our key insight is that, despite its class-agnostic pretraining, SAM2 already encodes rich semantic structure in its features. We propose SANSA (Semantically AligNed Segment Anything 2), a framework that makes this latent structure explicit, and repurposes SAM2 for few-shot segmentation through minimal task-specific modifications. SANSA achieves state-of-the-art performance on few-shot segmentation benchmarks specifically designed to assess generalization, outperforms generalist methods in the popular in-context setting, supports various prompts flexible interaction via points, boxes, or scribbles, and remains significantly faster and more compact than prior approaches. Code is available at https://github.com/ClaudiaCuttano/SANSA.

[45] ALTER: All-in-One Layer Pruning and Temporal Expert Routing for Efficient Diffusion Generation

Xiaomeng Yang,Lei Lu,Qihui Fan,Changdi Yang,Juyi Lin,Yanzhi Wang,Xuan Zhang,Shangqian Gao

Main category: cs.CV

TL;DR: ALTER框架通过统一层剪枝、专家路由和微调，显著提升了扩散模型的推理效率，同时保持生成质量。

Details

Motivation: 扩散模型的高计算开销限制了其在资源受限环境中的实际应用，现有加速方法未能充分捕捉时间动态或存在剪枝与微调不匹配的问题。 Method: ALTER采用超网络动态生成层剪枝决策和时间步路由，统一优化剪枝、路由和微调，形成高效的时间专家混合模型。 Result: ALTER在20步推理中仅需25.9%的计算量，实现3.64倍加速，同时保持与原模型相当的视觉保真度。 Conclusion: ALTER为扩散模型提供了一种高效且高质量的优化框架，适用于资源受限场景。 Abstract: Diffusion models have demonstrated exceptional capabilities in generating high-fidelity images. However, their iterative denoising process results in significant computational overhead during inference, limiting their practical deployment in resource-constrained environments. Existing acceleration methods often adopt uniform strategies that fail to capture the temporal variations during diffusion generation, while the commonly adopted sequential pruning-then-fine-tuning strategy suffers from sub-optimality due to the misalignment between pruning decisions made on pretrained weights and the model's final parameters. To address these limitations, we introduce ALTER: All-in-One Layer Pruning and Temporal Expert Routing, a unified framework that transforms diffusion models into a mixture of efficient temporal experts. ALTER achieves a single-stage optimization that unifies layer pruning, expert routing, and model fine-tuning by employing a trainable hypernetwork, which dynamically generates layer pruning decisions and manages timestep routing to specialized, pruned expert sub-networks throughout the ongoing fine-tuning of the UNet. This unified co-optimization strategy enables significant efficiency gains while preserving high generative quality. Specifically, ALTER achieves same-level visual fidelity to the original 50-step Stable Diffusion v2.1 model while utilizing only 25.9% of its total MACs with just 20 inference steps and delivering a 3.64x speedup through 35% sparsity.

[46] HDRSDR-VQA: A Subjective Video Quality Dataset for HDR and SDR Comparative Evaluation

Bowen Chen,Cheng-han Lee,Yixu Chen,Zaixi Shang,Hai Wei,Alan C. Bovik

Main category: cs.CV

TL;DR: HDRSDR-VQA是一个大规模视频质量评估数据集，支持HDR和SDR内容的直接比较，包含960个视频和22,000对主观评分。

Details

Motivation: 现有数据集通常仅关注单一动态范围格式或评估协议有限，无法直接比较HDR和SDR内容。HDRSDR-VQA旨在填补这一空白。 Method: 数据集包含54个源序列生成的960个视频，涵盖9种失真级别。通过145名参与者和6台HDR电视进行了主观研究，收集了22,000对比较数据并转换为JOD评分。 Result: 数据集支持HDR和SDR内容的直接比较，揭示了用户偏好的具体场景和原因。 Conclusion: HDRSDR-VQA为视频质量评估、自适应流媒体和感知模型研究提供了重要资源。 Abstract: We introduce HDRSDR-VQA, a large-scale video quality assessment dataset designed to facilitate comparative analysis between High Dynamic Range (HDR) and Standard Dynamic Range (SDR) content under realistic viewing conditions. The dataset comprises 960 videos generated from 54 diverse source sequences, each presented in both HDR and SDR formats across nine distortion levels. To obtain reliable perceptual quality scores, we conducted a comprehensive subjective study involving 145 participants and six consumer-grade HDR-capable televisions. A total of over 22,000 pairwise comparisons were collected and scaled into Just-Objectionable-Difference (JOD) scores. Unlike prior datasets that focus on a single dynamic range format or use limited evaluation protocols, HDRSDR-VQA enables direct content-level comparison between HDR and SDR versions, supporting detailed investigations into when and why one format is preferred over the other. The open-sourced part of the dataset is publicly available to support further research in video quality assessment, content-adaptive streaming, and perceptual model development.

[47] UniMoGen: Universal Motion Generation

Aliasghar Khani,Arianna Rampini,Evan Atherton,Bruno Roy

Main category: cs.CV

TL;DR: UniMoGen是一种基于UNet的扩散模型，用于骨架无关的运动生成，支持多样角色（如人类和动物）的运动生成，具有高效性和可控性。

Details

Motivation: 现有方法依赖特定骨架结构，限制了其通用性。UniMoGen旨在解决这一问题，实现骨架无关的运动生成。 Method: 采用UNet-based扩散模型，动态处理必要关节，支持风格和轨迹输入控制，并能延续历史帧运动。 Result: 在100style和LAFAN1数据集上表现优异，超越现有方法，实现高效跨骨架运动生成。 Conclusion: UniMoGen为角色动画提供了灵活、高效且可控的解决方案，具有广泛应用潜力。 Abstract: Motion generation is a cornerstone of computer graphics, animation, gaming, and robotics, enabling the creation of realistic and varied character movements. A significant limitation of existing methods is their reliance on specific skeletal structures, which restricts their versatility across different characters. To overcome this, we introduce UniMoGen, a novel UNet-based diffusion model designed for skeleton-agnostic motion generation. UniMoGen can be trained on motion data from diverse characters, such as humans and animals, without the need for a predefined maximum number of joints. By dynamically processing only the necessary joints for each character, our model achieves both skeleton agnosticism and computational efficiency. Key features of UniMoGen include controllability via style and trajectory inputs, and the ability to continue motions from past frames. We demonstrate UniMoGen's effectiveness on the 100style dataset, where it outperforms state-of-the-art methods in diverse character motion generation. Furthermore, when trained on both the 100style and LAFAN1 datasets, which use different skeletons, UniMoGen achieves high performance and improved efficiency across both skeletons. These results highlight UniMoGen's potential to advance motion generation by providing a flexible, efficient, and controllable solution for a wide range of character animations.

[48] Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation

Mehrdad Noori,David Osowiechi,Gustavo Adolfo Vargas Hakim,Ali Bahri,Moslem Yazdanpanah,Sahar Dastani,Farzad Beizaee,Ismail Ben Ayed,Christian Desrosiers

Main category: cs.CV

TL;DR: 本文提出了一种针对开放词汇语义分割（OVSS）的测试时适应（TTA）方法，名为MLMP熵最小化，通过整合中间视觉编码器层特征和多提示模板，显著优于直接采用分类TTA基线。

Details

Motivation: 现有TTA方法主要关注图像分类，而密集预测任务（如OVSS）被忽视，因此需要一种专门的分割TTA方法。 Method: 提出MLMP熵最小化方法，整合中间视觉编码器层特征，并在全局CLS标记和局部像素级使用多提示模板。 Result: 实验表明，该方法在82种测试场景中显著优于分类TTA基线。 Conclusion: MLMP方法为开放词汇分割TTA提供了标准化基准，并展示了其有效性。 Abstract: Recently, test-time adaptation has attracted wide interest in the context of vision-language models for image classification. However, to the best of our knowledge, the problem is completely overlooked in dense prediction tasks such as Open-Vocabulary Semantic Segmentation (OVSS). In response, we propose a novel TTA method tailored to adapting VLMs for segmentation during test time. Unlike TTA methods for image classification, our Multi-Level and Multi-Prompt (MLMP) entropy minimization integrates features from intermediate vision-encoder layers and is performed with different text-prompt templates at both the global CLS token and local pixel-wise levels. Our approach could be used as plug-and-play for any segmentation network, does not require additional training data or labels, and remains effective even with a single test sample. Furthermore, we introduce a comprehensive OVSS TTA benchmark suite, which integrates a rigorous evaluation protocol, seven segmentation datasets, and 15 common corruptions, with a total of 82 distinct test scenarios, establishing a standardized and comprehensive testbed for future TTA research in open-vocabulary segmentation. Our experiments on this suite demonstrate that our segmentation-tailored method consistently delivers significant gains over direct adoption of TTA classification baselines.

[49] RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers

Xuwei Xu,Yang Li,Yudong Chen,Jiajun Liu,Sen Wang

Main category: cs.CV

TL;DR: 研究发现FFN层是ViT推理延迟的主要来源，提出了一种通道空闲机制，通过结构重参数化优化FFN层效率，显著降低延迟且保持或提升精度。

Details

Motivation: 揭示FFN层对ViT推理延迟的关键影响，探索优化大规模ViT效率的方法。 Method: 提出通道空闲机制，允许部分特征通道绕过非线性激活函数，形成线性路径以实现结构重参数化。 Result: RePaViT系列模型在保持或提升精度的同时显著降低延迟，尤其是大模型表现更优（如RePa-ViT-Large提速66.8%，精度提升1.7%）。 Conclusion: RePaViT首次将结构重参数化应用于FFN层，为高效ViT提供了新方向。 Abstract: We reveal that feedforward network (FFN) layers, rather than attention layers, are the primary contributors to Vision Transformer (ViT) inference latency, with their impact signifying as model size increases. This finding highlights a critical opportunity for optimizing the efficiency of large-scale ViTs by focusing on FFN layers. In this work, we propose a novel channel idle mechanism that facilitates post-training structural reparameterization for efficient FFN layers during testing. Specifically, a set of feature channels remains idle and bypasses the nonlinear activation function in each FFN layer, thereby forming a linear pathway that enables structural reparameterization during inference. This mechanism results in a family of ReParameterizable Vision Transformers (RePaViTs), which achieve remarkable latency reductions with acceptable sacrifices (sometimes gains) in accuracy across various ViTs. The benefits of our method scale consistently with model sizes, demonstrating greater speed improvements and progressively narrowing accuracy gaps or even higher accuracies on larger models. In particular, RePa-ViT-Large and RePa-ViT-Huge enjoy 66.8% and 68.7% speed-ups with +1.7% and +1.1% higher top-1 accuracies under the same training strategy, respectively. RePaViT is the first to employ structural reparameterization on FFN layers to expedite ViTs to our best knowledge, and we believe that it represents an auspicious direction for efficient ViTs. Source code is available at https://github.com/Ackesnal/RePaViT.

[50] FPAN: Mitigating Replication in Diffusion Models through the Fine-Grained Probabilistic Addition of Noise to Token Embeddings

Jingqi Xu,Chenghao Li,Yuke Zhang,Peter A. Beerel

Main category: cs.CV

TL;DR: 论文提出了一种细粒度噪声注入技术（FPAN），通过概率性地为标记嵌入添加更多噪声，显著减少扩散模型对训练数据的复制，同时保持图像质量。

Details

Motivation: 扩散模型在生成高质量图像方面表现出色，但容易复制训练数据，引发隐私问题。现有方法效果有限，需要更有效的解决方案。 Method: 提出FPAN技术，分析不同噪声量的影响，并概率性地为标记嵌入添加更大噪声。 Result: FPAN平均减少28.78%的数据复制，优于基线方法26.51%，与其他方法结合可进一步减少16.82%的复制。 Conclusion: FPAN是一种有效的隐私保护技术，显著减少数据复制且不影响图像质量，适用于扩散模型。 Abstract: Diffusion models have demonstrated remarkable potential in generating high-quality images. However, their tendency to replicate training data raises serious privacy concerns, particularly when the training datasets contain sensitive or private information. Existing mitigation strategies primarily focus on reducing image duplication, modifying the cross-attention mechanism, and altering the denoising backbone architecture of diffusion models. Moreover, recent work has shown that adding a consistent small amount of noise to text embeddings can reduce replication to some degree. In this work, we begin by analyzing the impact of adding varying amounts of noise. Based on our analysis, we propose a fine-grained noise injection technique that probabilistically adds a larger amount of noise to token embeddings. We refer to our method as Fine-grained Probabilistic Addition of Noise (FPAN). Through our extensive experiments, we show that our proposed FPAN can reduce replication by an average of 28.78% compared to the baseline diffusion model without significantly impacting image quality, and outperforms the prior consistent-magnitude-noise-addition approach by 26.51%. Moreover, when combined with other existing mitigation methods, our FPAN approach can further reduce replication by up to 16.82% with similar, if not improved, image quality.

[51] Beyond Perception: Evaluating Abstract Visual Reasoning through Multi-Stage Task

Yanbei Jiang,Yihao Ding,Chao Lei,Jiayang Ao,Jey Han Lau,Krista A. Ehinger

Main category: cs.CV

TL;DR: 论文提出了MultiStAR基准和MSEval指标，用于评估多模态大语言模型在抽象视觉推理中的多阶段表现，发现现有模型在复杂规则检测阶段仍有挑战。

Details

Motivation: 现有抽象视觉推理（AVR）基准仅关注单步推理和最终结果，忽略了推理过程的多阶段性，且现有模型表现不佳的原因未得到解释。 Method: 基于RAVEN设计MultiStAR基准，引入MSEval指标评估中间步骤和最终结果的正确性，并在17种代表性MLLMs上进行实验。 Result: 实验显示，现有MLLMs在基础感知任务中表现良好，但在复杂规则检测阶段仍存在困难。 Conclusion: MultiStAR和MSEval填补了AVR评估的空白，揭示了MLLMs在复杂推理中的局限性。 Abstract: Current Multimodal Large Language Models (MLLMs) excel in general visual reasoning but remain underexplored in Abstract Visual Reasoning (AVR), which demands higher-order reasoning to identify abstract rules beyond simple perception. Existing AVR benchmarks focus on single-step reasoning, emphasizing the end result but neglecting the multi-stage nature of reasoning process. Past studies found MLLMs struggle with these benchmarks, but it doesn't explain how they fail. To address this gap, we introduce MultiStAR, a Multi-Stage AVR benchmark, based on RAVEN, designed to assess reasoning across varying levels of complexity. Additionally, existing metrics like accuracy only focus on the final outcomes while do not account for the correctness of intermediate steps. Therefore, we propose a novel metric, MSEval, which considers the correctness of intermediate steps in addition to the final outcomes. We conduct comprehensive experiments on MultiStAR using 17 representative close-source and open-source MLLMs. The results reveal that while existing MLLMs perform adequately on basic perception tasks, they continue to face challenges in more complex rule detection stages.

[52] Rethinking Gradient-based Adversarial Attacks on Point Cloud Classification

Jun Chen,Xinke Li,Mingyue Xu,Tianrui Li,Chongshou Li

Main category: cs.CV

TL;DR: 本文提出两种新策略改进基于梯度的对抗攻击，通过加权梯度和自适应步长（WAAttack）以及子集攻击（SubAttack），显著提升了攻击效果和不可感知性。

Details

Motivation: 现有基于梯度的对抗攻击方法未考虑点云的异构性，导致扰动过大且易察觉。 Method: 提出WAAttack（加权梯度和自适应步长）和SubAttack（聚焦关键子集），优化攻击策略。 Result: 实验表明，该方法在生成不可感知对抗样本方面优于现有基线。 Conclusion: 本文为点云分类的对抗攻击提供了更高效且隐蔽的新思路。 Abstract: Gradient-based adversarial attacks have become a dominant approach for evaluating the robustness of point cloud classification models. However, existing methods often rely on uniform update rules that fail to consider the heterogeneous nature of point clouds, resulting in excessive and perceptible perturbations. In this paper, we rethink the design of gradient-based attacks by analyzing the limitations of conventional gradient update mechanisms and propose two new strategies to improve both attack effectiveness and imperceptibility. First, we introduce WAAttack, a novel framework that incorporates weighted gradients and an adaptive step-size strategy to account for the non-uniform contribution of points during optimization. This approach enables more targeted and subtle perturbations by dynamically adjusting updates according to the local structure and sensitivity of each point. Second, we propose SubAttack, a complementary strategy that decomposes the point cloud into subsets and focuses perturbation efforts on structurally critical regions. Together, these methods represent a principled rethinking of gradient-based adversarial attacks for 3D point cloud classification. Extensive experiments demonstrate that our approach outperforms state-of-the-art baselines in generating highly imperceptible adversarial examples. Code will be released upon paper acceptance.

[53] Towards Scalable Language-Image Pre-training for 3D Medical Imaging

Chenhui Zhao,Yiwei Lyu,Asadur Chowdury,Edward Harake,Akhil Kondepudi,Akshay Rao,Xinhai Hou,Honglak Lee,Todd Hollon

Main category: cs.CV

TL;DR: HLIP是一种用于3D医学影像的可扩展预训练框架，通过分层注意力机制解决了计算效率问题，并在多个基准测试中取得了显著性能提升。

Details

Motivation: 3D医学影像（如CT和MRI）的高计算需求限制了语言-图像预训练的成功，HLIP旨在解决这一问题。 Method: HLIP采用轻量级分层注意力机制，基于放射学数据的自然层次结构（切片、扫描、研究），提高了计算效率和泛化能力。 Result: 在多个基准测试中表现优异，例如在Rad-ChestCT上AUC提升4.3%，在Pub-Brain-5上平衡准确率提升32.4%。 Conclusion: HLIP证明了直接在未整理的临床数据集上进行预训练是3D医学影像语言-图像预训练的可扩展且有效的方向。 Abstract: Language-image pre-training has demonstrated strong performance in 2D medical imaging, but its success in 3D modalities such as CT and MRI remains limited due to the high computational demands of volumetric data, which pose a significant barrier to training on large-scale, uncurated clinical studies. In this study, we introduce Hierarchical attention for Language-Image Pre-training (HLIP), a scalable pre-training framework for 3D medical imaging. HLIP adopts a lightweight hierarchical attention mechanism inspired by the natural hierarchy of radiology data: slice, scan, and study. This mechanism exhibits strong generalizability, e.g., +4.3% macro AUC on the Rad-ChestCT benchmark when pre-trained on CT-RATE. Moreover, the computational efficiency of HLIP enables direct training on uncurated datasets. Trained on 220K patients with 3.13 million scans for brain MRI and 240K patients with 1.44 million scans for head CT, HLIP achieves state-of-the-art performance, e.g., +32.4% balanced ACC on the proposed publicly available brain MRI benchmark Pub-Brain-5; +1.4% and +6.9% macro AUC on head CT benchmarks RSNA and CQ500, respectively. These results demonstrate that, with HLIP, directly pre-training on uncurated clinical datasets is a scalable and effective direction for language-image pre-training in 3D medical imaging. The code is available at https://github.com/Zch0414/hlip

[54] GETReason: Enhancing Image Context Extraction through Hierarchical Multi-Agent Reasoning

Shikhhar Siingh,Abhinav Rawat,Vivek Gupta,Chitta Baral

Main category: cs.CV

TL;DR: GETReason框架通过提取全球事件、时间和地理空间信息，提升对图像深层意义的理解，并引入GREAT评估指标。

Details

Motivation: 现有方法难以准确提取公开重要图像的上下文信息，而GETReason旨在解决这一问题。 Method: 采用分层多代理方法，结合全球事件、时间和地理空间信息进行推理。 Result: 通过GREAT评估，证明该方法能有效推断图像与事件背景的联系。 Conclusion: GETReason框架能显著提升对图像上下文的理解，为新闻和教育领域提供支持。 Abstract: Publicly significant images from events hold valuable contextual information, crucial for journalism and education. However, existing methods often struggle to extract this relevance accurately. To address this, we introduce GETReason (Geospatial Event Temporal Reasoning), a framework that moves beyond surface-level image descriptions to infer deeper contextual meaning. We propose that extracting global event, temporal, and geospatial information enhances understanding of an image's significance. Additionally, we introduce GREAT (Geospatial Reasoning and Event Accuracy with Temporal Alignment), a new metric for evaluating reasoning-based image understanding. Our layered multi-agent approach, assessed using a reasoning-weighted metric, demonstrates that meaningful insights can be inferred, effectively linking images to their broader event context.

[55] Cross-DINO: Cross the Deep MLP and Transformer for Small Object Detection

Guiping Cao,Wenjian Huang,Xiangyuan Lan,Jianguo Zhang,Dongmei Jiang,Yaowei Wang

Main category: cs.CV

TL;DR: 论文提出Cross-DINO方法，通过结合深度MLP网络和新的Cross Coding Twice Module（CCTM）提升小目标检测性能，并引入Category-Size（CS）软标签和Boost Loss损失函数。实验表明其优于现有方法。

Details

Motivation: 小目标检测（SOD）因信息有限和模型预测分数低而具有挑战性，现有Transformer检测器在此任务上潜力未充分挖掘。 Method: 提出Cross-DINO，结合深度MLP网络和CCTM模块增强小目标细节，并引入CS软标签和Boost Loss损失函数。 Result: 在多个数据集上表现优异，如COCO上APs达36.4%，优于DINO（+4.4%）。 Conclusion: Cross-DINO有效提升DETR类模型在小目标检测上的性能，且参数和计算量更少。 Abstract: Small Object Detection (SOD) poses significant challenges due to limited information and the model's low class prediction score. While Transformer-based detectors have shown promising performance, their potential for SOD remains largely unexplored. In typical DETR-like frameworks, the CNN backbone network, specialized in aggregating local information, struggles to capture the necessary contextual information for SOD. The multiple attention layers in the Transformer Encoder face difficulties in effectively attending to small objects and can also lead to blurring of features. Furthermore, the model's lower class prediction score of small objects compared to large objects further increases the difficulty of SOD. To address these challenges, we introduce a novel approach called Cross-DINO. This approach incorporates the deep MLP network to aggregate initial feature representations with both short and long range information for SOD. Then, a new Cross Coding Twice Module (CCTM) is applied to integrate these initial representations to the Transformer Encoder feature, enhancing the details of small objects. Additionally, we introduce a new kind of soft label named Category-Size (CS), integrating the Category and Size of objects. By treating CS as new ground truth, we propose a new loss function called Boost Loss to improve the class prediction score of the model. Extensive experimental results on COCO, WiderPerson, VisDrone, AI-TOD, and SODA-D datasets demonstrate that Cross-DINO efficiently improves the performance of DETR-like models on SOD. Specifically, our model achieves 36.4% APs on COCO for SOD with only 45M parameters, outperforming the DINO by +4.4% APS (36.4% vs. 32.0%) with fewer parameters and FLOPs, under 12 epochs training setting. The source codes will be available at https://github.com/Med-Process/Cross-DINO.

[56] EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

Zun Wang,Jaemin Cho,Jialu Li,Han Lin,Jaehong Yoon,Yue Zhang,Mohit Bansal

Main category: cs.CV

TL;DR: EPiC提出了一种高效精确的相机控制学习框架，通过掩码源视频生成高质量锚点视频，无需昂贵的相机轨迹标注，并结合轻量级Anchor-ControlNet模块，实现了高效的训练和精确的3D相机控制。

Details

Motivation: 现有方法依赖点云估计生成锚点视频，存在误差且需要大量相机轨迹标注，资源消耗高。 Method: 通过掩码源视频基于首帧可见性生成高质量锚点视频，并引入Anchor-ControlNet模块集成锚点视频引导。 Result: EPiC在RealEstate10K和MiraData上达到SOTA性能，支持零样本泛化到视频到视频场景。 Conclusion: EPiC无需修改扩散模型主干，实现了高效训练和精确相机控制，具有强泛化能力。 Abstract: Recent approaches on 3D camera control in video diffusion models (VDMs) often create anchor videos to guide diffusion models as a structured prior by rendering from estimated point clouds following annotated camera trajectories. However, errors inherent in point cloud estimation often lead to inaccurate anchor videos. Moreover, the requirement for extensive camera trajectory annotations further increases resource demands. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that automatically constructs high-quality anchor videos without expensive camera trajectory annotations. Concretely, we create highly precise anchor videos for training by masking source videos based on first-frame visibility. This approach ensures high alignment, eliminates the need for camera trajectory annotations, and thus can be readily applied to any in-the-wild video to generate image-to-video (I2V) training pairs. Furthermore, we introduce Anchor-ControlNet, a lightweight conditioning module that integrates anchor video guidance in visible regions to pretrained VDMs, with less than 1% of backbone model parameters. By combining the proposed anchor video data and ControlNet module, EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, without requiring modifications to the diffusion model backbone typically needed to mitigate rendering misalignments. Although being trained on masking-based anchor videos, our method generalizes robustly to anchor videos made with point clouds during inference, enabling precise 3D-informed camera control. EPiC achieves SOTA performance on RealEstate10K and MiraData for I2V camera control task, demonstrating precise and robust camera control ability both quantitatively and qualitatively. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video scenarios.

[57] Hyperspectral Gaussian Splatting

Sunil Kumar Narayanan,Lingjun Zhao,Lu Gan,Yongsheng Chen

Main category: cs.CV

TL;DR: 提出了一种结合3D高斯泼溅和扩散模型的新方法HS-GS，用于高光谱场景的3D显式重建和新视角合成，显著提升了性能。

Details

Motivation: 解决NeRF在高光谱成像中训练时间长和渲染速度慢的问题，同时捕捉光谱细节并提升图像质量。 Method: 结合3D高斯泼溅与扩散模型，引入波长编码器和KL散度损失，优化光谱重建和去噪。 Result: 在Hyper-NeRF数据集上验证，HS-GS性能优于现有方法。 Conclusion: HS-GS为高光谱成像提供了一种高效且高质量的解决方案，具有广泛应用潜力。 Abstract: Hyperspectral imaging (HSI) has been widely used in agricultural applications for non-destructive estimation of plant nutrient composition and precise determination of nutritional elements in samples. Recently, 3D reconstruction methods have been used to create implicit neural representations of HSI scenes, which can help localize the target object's nutrient composition spatially and spectrally. Neural Radiance Field (NeRF) is a cutting-edge implicit representation that can render hyperspectral channel compositions of each spatial location from any viewing direction. However, it faces limitations in training time and rendering speed. In this paper, we propose Hyperspectral Gaussian Splatting (HS-GS), which combines the state-of-the-art 3D Gaussian Splatting (3DGS) with a diffusion model to enable 3D explicit reconstruction of the hyperspectral scenes and novel view synthesis for the entire spectral range. To enhance the model's ability to capture fine-grained reflectance variations across the light spectrum and leverage correlations between adjacent wavelengths for denoising, we introduce a wavelength encoder to generate wavelength-specific spherical harmonics offsets. We also introduce a novel Kullback--Leibler divergence-based loss to mitigate the spectral distribution gap between the rendered image and the ground truth. A diffusion model is further applied for denoising the rendered images and generating photorealistic hyperspectral images. We present extensive evaluations on five diverse hyperspectral scenes from the Hyper-NeRF dataset to show the effectiveness of our proposed HS-GS framework. The results demonstrate that HS-GS achieves new state-of-the-art performance among all previously published methods. Code will be released upon publication.

[58] Concentrate on Weakness: Mining Hard Prototypes for Few-Shot Medical Image Segmentation

Jianchao Jiang,Haofeng Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种改进的少样本医学图像分割方法，通过关注弱特征和边界优化，显著提升了分割性能。

Details

Motivation: 现有基于原型的少样本医学图像分割方法因随机采样或局部平均导致边界模糊，需要关注弱特征以改善边界清晰度。 Method: 设计了支持自预测（SSP）模块识别弱特征，硬原型生成（HPG）模块生成硬原型，多相似性图融合（MSMF）模块优化分割，并引入边界损失。 Result: 在三个公开医学图像数据集上实现了最先进的性能。 Conclusion: 该方法通过关注弱特征和边界优化，显著提升了少样本医学图像分割的准确性和边界清晰度。 Abstract: Few-Shot Medical Image Segmentation (FSMIS) has been widely used to train a model that can perform segmentation from only a few annotated images. However, most existing prototype-based FSMIS methods generate multiple prototypes from the support image solely by random sampling or local averaging, which can cause particularly severe boundary blurring due to the tendency for normal features accounting for the majority of features of a specific category. Consequently, we propose to focus more attention to those weaker features that are crucial for clear segmentation boundary. Specifically, we design a Support Self-Prediction (SSP) module to identify such weak features by comparing true support mask with one predicted by global support prototype. Then, a Hard Prototypes Generation (HPG) module is employed to generate multiple hard prototypes based on these weak features. Subsequently, a Multiple Similarity Maps Fusion (MSMF) module is devised to generate final segmenting mask in a dual-path fashion to mitigate the imbalance between foreground and background in medical images. Furthermore, we introduce a boundary loss to further constraint the edge of segmentation. Extensive experiments on three publicly available medical image datasets demonstrate that our method achieves state-of-the-art performance. Code is available at https://github.com/jcjiang99/CoW.

[59] CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation

Pardis Taghavi,Tian Liu,Renjie Li,Reza Langari,Zhengzhong Tu

Main category: cs.CV

TL;DR: CAST是一个半监督知识蒸馏框架，通过压缩预训练的视觉基础模型（VFM）为紧凑专家模型，利用有限标注和大量未标注数据，显著提升实例分割性能。

Details

Motivation: 实例分割需要昂贵的像素级标注和大模型，CAST旨在通过半监督学习减少标注需求并提升模型效率。 Method: CAST分为三个阶段：1）通过自训练和对比像素校准进行VFM域适应；2）通过多目标损失（结合监督学习和伪标签）蒸馏到紧凑学生模型；3）在标注数据上微调以消除伪标签偏差。核心是实例感知的像素级对比损失。 Result: 在Cityscapes和ADE20K上，CAST的学生模型（比VFM小11倍）分别比其教师模型提升了+3.4 AP和+1.5 AP，并优于现有半监督方法。 Conclusion: CAST通过结合对比学习和知识蒸馏，显著提升了实例分割的性能和效率，同时减少了标注需求。 Abstract: Instance segmentation demands costly per-pixel annotations and large models. We introduce CAST, a semi-supervised knowledge distillation (SSKD) framework that compresses pretrained vision foundation models (VFM) into compact experts using limited labeled and abundant unlabeled data. CAST unfolds in three stages: (1) domain adaptation of the VFM teacher(s) via self-training with contrastive pixel calibration, (2) distillation into a compact student via a unified multi-objective loss that couples standard supervision and pseudo-labels with our instance-aware pixel-wise contrastive term, and (3) fine-tuning on labeled data to remove residual pseudo-label bias. Central to CAST is an \emph{instance-aware pixel-wise contrastive loss} that fuses mask and class scores to mine informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and fully leverage unlabeled images. On Cityscapes and ADE20K, our ~11X smaller student surpasses its adapted VFM teacher(s) by +3.4 AP (33.9 vs. 30.5) and +1.5 AP (16.7 vs. 15.2) and outperforms state-of-the-art semi-supervised approaches.

[60] Reference-Guided Identity Preserving Face Restoration

Mo Zhou,Keren Ye,Viraj Shah,Kangfu Mei,Mauricio Delbracio,Peyman Milanfar,Vishal M. Patel,Hossein Talebi

Main category: cs.CV

TL;DR: 本文提出了一种新方法，通过最大化参考面部的利用，改进了基于扩散模型的图像修复中的身份保留问题。

Details

Motivation: 在基于扩散模型的图像修复中，保留面部身份是一个关键但持续的挑战，现有方法未能充分利用参考面部的潜力。 Method: 方法包括：1）复合上下文，融合参考面部的多级信息；2）硬样本身份损失，改进身份学习效率；3）无需训练的推理时多参考输入适配。 Result: 在FFHQ-Ref和CelebA-Ref-Test等基准测试中，该方法显著提升了面部修复质量，身份保留效果达到最优。 Conclusion: 该方法通过创新性技术和损失函数，显著提升了基于参考的面部修复效果和身份保留能力。 Abstract: Preserving face identity is a critical yet persistent challenge in diffusion-based image restoration. While reference faces offer a path forward, existing reference-based methods often fail to fully exploit their potential. This paper introduces a novel approach that maximizes reference face utility for improved face restoration and identity preservation. Our method makes three key contributions: 1) Composite Context, a comprehensive representation that fuses multi-level (high- and low-level) information from the reference face, offering richer guidance than prior singular representations. 2) Hard Example Identity Loss, a novel loss function that leverages the reference face to address the identity learning inefficiencies found in the existing identity loss. 3) A training-free method to adapt the model to multi-reference inputs during inference. The proposed method demonstrably restores high-quality faces and achieves state-of-the-art identity preserving restoration on benchmarks such as FFHQ-Ref and CelebA-Ref-Test, consistently outperforming previous work.

[61] AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment

Yiheng Lin,Shifang Zhao,Ting Liu,Xiaochao Qu,Luoqi Liu,Yao Zhao,Yunchao Wei

Main category: cs.CV

TL;DR: AlignGen提出了一种跨模态先验对齐机制，通过可学习令牌、鲁棒训练策略和选择性跨模态注意力掩码，解决了文本与参考图像不对齐时生成结果偏向文本先验的问题，显著提升了个性化图像生成的效果。

Details

Motivation: 现有零样本方法在文本提示与参考图像不对齐时，生成结果会偏向文本先验，导致参考内容丢失。 Method: 引入可学习令牌、鲁棒训练策略和选择性跨模态注意力掩码，对齐文本与视觉先验。 Result: AlignGen在实验中优于现有零样本方法，甚至超过流行的测试时优化方法。 Conclusion: AlignGen通过跨模态先验对齐机制，有效提升了文本与参考图像不对齐时的个性化图像生成效果。 Abstract: Personalized image generation aims to integrate user-provided concepts into text-to-image models, enabling the generation of customized content based on a given prompt. Recent zero-shot approaches, particularly those leveraging diffusion transformers, incorporate reference image information through multi-modal attention mechanism. This integration allows the generated output to be influenced by both the textual prior from the prompt and the visual prior from the reference image. However, we observe that when the prompt and reference image are misaligned, the generated results exhibit a stronger bias toward the textual prior, leading to a significant loss of reference content. To address this issue, we propose AlignGen, a Cross-Modality Prior Alignment mechanism that enhances personalized image generation by: 1) introducing a learnable token to bridge the gap between the textual and visual priors, 2) incorporating a robust training strategy to ensure proper prior alignment, and 3) employing a selective cross-modal attention mask within the multi-modal attention mechanism to further align the priors. Experimental results demonstrate that AlignGen outperforms existing zero-shot methods and even surpasses popular test-time optimization approaches.

[62] LiDARDustX: A LiDAR Dataset for Dusty Unstructured Road Environments

Chenfeng Wei,Qi Wu,Si Zuo,Jiahua Xu,Boyang Zhao,Zeyu Yang,Guotao Xie,Shenhong Wang

Main category: cs.CV

TL;DR: LiDARDustX数据集填补了高粉尘环境下自动驾驶感知任务的空白，包含3万帧LiDAR数据，并分析了粉尘对感知精度的影响。

Details

Motivation: 现有数据集主要针对结构化城市环境，缺乏对高粉尘等特殊场景的支持，限制了算法的泛化能力。 Method: 通过六种LiDAR传感器采集数据，提供3D标注和点云语义分割，80%以上为粉尘场景。 Result: 建立了高粉尘环境下3D检测与分割算法的基准，并分析了粉尘对感知的影响。 Conclusion: LiDARDustX为高粉尘环境下的自动驾驶研究提供了重要资源，揭示了粉尘对感知任务的挑战。 Abstract: Autonomous driving datasets are essential for validating the progress of intelligent vehicle algorithms, which include localization, perception, and prediction. However, existing datasets are predominantly focused on structured urban environments, which limits the exploration of unstructured and specialized scenarios, particularly those characterized by significant dust levels. This paper introduces the LiDARDustX dataset, which is specifically designed for perception tasks under high-dust conditions, such as those encountered in mining areas. The LiDARDustX dataset consists of 30,000 LiDAR frames captured by six different LiDAR sensors, each accompanied by 3D bounding box annotations and point cloud semantic segmentation. Notably, over 80% of the dataset comprises dust-affected scenes. By utilizing this dataset, we have established a benchmark for evaluating the performance of state-of-the-art 3D detection and segmentation algorithms. Additionally, we have analyzed the impact of dust on perception accuracy and delved into the causes of these effects. The data and further information can be accessed at: https://github.com/vincentweikey/LiDARDustX.

[63] BD Open LULC Map: High-resolution land use land cover mapping & benchmarking for urban development in Dhaka, Bangladesh

Mir Sazzat Hossain,Ovi Paul,Md Akil Raihan Iftee,Rakibul Hasan Rajib,Abu Bakar Siddik Nayem,Anis Sarker,Arshad Momen,Md. Ashraful Amin,Amin Ahsan Ali,AKM Mahbubur Rahman

Main category: cs.CV

TL;DR: BD Open LULC Map (BOLM) 提供高分辨率卫星图像的像素级 LULC 标注，用于支持深度学习模型和领域适应任务，填补南亚/东亚地区 LULC 数据集的空白。

Details

Motivation: 解决南亚/东亚发展中国家因资金有限、基础设施多样和人口密集导致的卫星标注数据稀缺问题。 Method: 使用高分辨率 Bing 卫星图像（2.22 米/像素）为达卡大都市区及周边提供 11 类 LULC 标注，并通过 GIS 专家验证。 Result: BOLM 覆盖 4,392 平方公里（8.91 亿像素），并在 DeepLab V3+ 模型上对五类主要 LULC 进行分割性能比较。 Conclusion: BOLM 为南亚/东亚地区提供了可靠的 LULC 数据集，支持深度学习模型和领域适应研究。 Abstract: Land Use Land Cover (LULC) mapping using deep learning significantly enhances the reliability of LULC classification, aiding in understanding geography, socioeconomic conditions, poverty levels, and urban sprawl. However, the scarcity of annotated satellite data, especially in South/East Asian developing countries, poses a major challenge due to limited funding, diverse infrastructures, and dense populations. In this work, we introduce the BD Open LULC Map (BOLM), providing pixel-wise LULC annotations across eleven classes (e.g., Farmland, Water, Forest, Urban Structure, Rural Built-Up) for Dhaka metropolitan city and its surroundings using high-resolution Bing satellite imagery (2.22 m/pixel). BOLM spans 4,392 sq km (891 million pixels), with ground truth validated through a three-stage process involving GIS experts. We benchmark LULC segmentation using DeepLab V3+ across five major classes and compare performance on Bing and Sentinel-2A imagery. BOLM aims to support reliable deep models and domain adaptation tasks, addressing critical LULC dataset gaps in South/East Asia.

[64] InfoSAM: Fine-Tuning the Segment Anything Model from An Information-Theoretic Perspective

Yuanhong Zhang,Muyao Yuan,Weizhan Zhang,Tieliang Gong,Wen Wen,Jiangyong Ying,Weijie Shi

Main category: cs.CV

TL;DR: InfoSAM提出了一种基于信息论的方法，通过保留预训练模型中的领域不变关系，优化了SAM的微调过程。

Details

Motivation: SAM在通用任务中表现优异，但在专业领域表现不佳。现有PEFT方法忽略了预训练模型中的领域不变关系。 Method: 提出InfoSAM，通过两个互信息目标（压缩领域不变关系和最大化师生模型间知识共享）优化微调。 Result: 实验证明InfoSAM能显著提升SAM在专业任务中的性能。 Conclusion: InfoSAM为SAM的微调提供了鲁棒的蒸馏框架，适用于专业场景。 Abstract: The Segment Anything Model (SAM), a vision foundation model, exhibits impressive zero-shot capabilities in general tasks but struggles in specialized domains. Parameter-efficient fine-tuning (PEFT) is a promising approach to unleash the potential of SAM in novel scenarios. However, existing PEFT methods for SAM neglect the domain-invariant relations encoded in the pre-trained model. To bridge this gap, we propose InfoSAM, an information-theoretic approach that enhances SAM fine-tuning by distilling and preserving its pre-trained segmentation knowledge. Specifically, we formulate the knowledge transfer process as two novel mutual information-based objectives: (i) to compress the domain-invariant relation extracted from pre-trained SAM, excluding pseudo-invariant information as possible, and (ii) to maximize mutual information between the relational knowledge learned by the teacher (pre-trained SAM) and the student (fine-tuned model). The proposed InfoSAM establishes a robust distillation framework for PEFT of SAM. Extensive experiments across diverse benchmarks validate InfoSAM's effectiveness in improving SAM family's performance on real-world tasks, demonstrating its adaptability and superiority in handling specialized scenarios.

[65] Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting

Wei Lin,Chenyang Zhao,Antoni B. Chan

Main category: cs.CV

TL;DR: 论文提出了一种基于伪标签的半监督计数框架，通过点对区域（P2R）方案替代点对点（P2P）监督，解决了伪标签置信度传播问题。

Details

Motivation: 点检测方法在密集人群定位和计数中表现优异，但标注成本高。论文旨在通过半监督学习减少标注需求。 Method: 提出点特定激活图（PSAM）分析训练问题，并设计P2R方案，通过分割局部区域共享伪点置信度。 Result: 实验表明P2R在半监督计数和无监督域适应中优于P2P，解决了PSAM揭示的问题。 Conclusion: P2R方案有效解决了伪标签训练中的置信度传播问题，提升了计数性能。 Abstract: Point detection has been developed to locate pedestrians in crowded scenes by training a counter through a point-to-point (P2P) supervision scheme. Despite its excellent localization and counting performance, training a point-based counter still faces challenges concerning annotation labor: hundreds to thousands of points are required to annotate a single sample capturing a dense crowd. In this paper, we integrate point-based methods into a semi-supervised counting framework based on pseudo-labeling, enabling the training of a counter with only a few annotated samples supplemented by a large volume of pseudo-labeled data. However, during implementation, the training encounters issues as the confidence for pseudo-labels fails to be propagated to background pixels via the P2P. To tackle this challenge, we devise a point-specific activation map (PSAM) to visually interpret the phenomena occurring during the ill-posed training. Observations from the PSAM suggest that the feature map is excessively activated by the loss for unlabeled data, causing the decoder to misinterpret these over-activations as pedestrians. To mitigate this issue, we propose a point-to-region (P2R) scheme to substitute P2P, which segments out local regions rather than detects a point corresponding to a pedestrian for supervision. Consequently, pixels in the local region can share the same confidence with the corresponding pseudo points. Experimental results in both semi-supervised counting and unsupervised domain adaptation highlight the advantages of our method, illustrating P2R can resolve issues identified in PSAM. The code is available at https://github.com/Elin24/P2RLoss.

[66] UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios

Le Thien Phuc Nguyen,Zhuoran Yu,Khoa Quang Nhat Cao,Yuwei Guo,Tu Ho Manh Pham,Tuan Tai Nguyen,Toan Ngo Duc Vo,Lucas Poon,Soochahn Lee,Yong Jae Lee

Main category: cs.CV

TL;DR: UniTalk是一个专为主动说话人检测任务设计的新数据集，强调挑战性场景以提升模型泛化能力，相比传统数据集如AVA更具多样性。

Details

Motivation: 传统数据集如AVA主要基于老电影，存在领域差距，而UniTalk专注于多样且困难的真实场景，如多语言、嘈杂背景和拥挤场景。 Method: UniTalk包含44.5小时视频，涵盖48,693个说话人身份，并标注帧级主动说话人信息，覆盖多种视频类型。 Result: 实验表明，现有模型在AVA上表现优异，但在UniTalk上未能饱和，显示真实条件下任务尚未解决。UniTalk训练的模型在Talkies、ASW和AVA上泛化能力更强。 Conclusion: UniTalk为主动说话人检测提供了新基准，是开发鲁棒模型的重要资源。 Abstract: We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniTalk demonstrate stronger generalization to modern "in-the-wild" datasets like Talkies and ASW, as well as to AVA. UniTalk thus establishes a new benchmark for active speaker detection, providing researchers with a valuable resource for developing and evaluating versatile and resilient models. Dataset: https://huggingface.co/datasets/plnguyen2908/UniTalk-ASD Code: https://github.com/plnguyen2908/UniTalk-ASD-code

[67] Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs

Insu Lee,Wooje Park,Jaeyun Jang,Minyoung Noh,Kyuhong Shim,Byonghyo Shim

Main category: cs.CV

TL;DR: 论文提出了一种结合第一人称（egocentric）和第三人称（exocentric）视角的框架，以增强大视觉语言模型（LVLMs）在多视角问答任务中的表现，并提出了新的基准E3VQA和训练无关的提示技术M3CoT。

Details

Motivation: 解决第一人称视角在空间或上下文复杂查询中的局限性，通过引入第三人称视角提供全局场景信息。 Method: 提出E3VQA基准和M3CoT提示技术，整合多视角场景图以构建统一场景表示。 Result: M3CoT显著提升了LVLMs的性能（GPT-4o提升4.84%，Gemini 2.0 Flash提升5.94%）。 Conclusion: 多视角输入对LVLMs在多视角推理中具有重要价值，但也揭示了其局限性。 Abstract: Large vision-language models (LVLMs) are increasingly deployed in interactive applications such as virtual and augmented reality, where first-person (egocentric) view captured by head-mounted cameras serves as key input. While this view offers fine-grained cues about user attention and hand-object interactions, their narrow field of view and lack of global context often lead to failures on spatially or contextually demanding queries. To address this, we introduce a framework that augments egocentric inputs with third-person (exocentric) views, providing complementary information such as global scene layout and object visibility to LVLMs. We present E3VQA, the first benchmark for multi-view question answering with 4K high-quality question-answer pairs grounded in synchronized ego-exo image pairs. Additionally, we propose M3CoT, a training-free prompting technique that constructs a unified scene representation by integrating scene graphs from three complementary perspectives. M3CoT enables LVLMs to reason more effectively across views, yielding consistent performance gains (4.84% for GPT-4o and 5.94% for Gemini 2.0 Flash) over a recent CoT baseline. Our extensive evaluation reveals key strengths and limitations of LVLMs in multi-view reasoning and highlights the value of leveraging both egocentric and exocentric inputs.

Mengdan Zhu,Senhao Cheng,Guangji Bai,Yifei Zhang,Liang Zhao

Main category: cs.CV

TL;DR: 论文提出了一种名为Cross-modal RAG的新框架，通过将查询和图像分解为子维度组件，实现了子查询感知的检索和生成，显著提升了生成质量。

Details

Motivation: 现有检索增强生成方法在复杂查询下表现不佳，因为无法从单一图像中获取所有所需元素。 Method: 采用混合检索策略（稀疏和稠密检索器）和子查询感知的多模态大语言模型进行图像生成。 Result: 在多个数据集上显著优于现有基线，同时保持高效性。 Conclusion: Cross-modal RAG为复杂查询下的文本到图像生成提供了有效解决方案。 Abstract: Text-to-image generation increasingly demands access to domain-specific, fine-grained, and rapidly evolving knowledge that pretrained models cannot fully capture. Existing Retrieval-Augmented Generation (RAG) methods attempt to address this by retrieving globally relevant images, but they fail when no single image contains all desired elements from a complex user query. We propose Cross-modal RAG, a novel framework that decomposes both queries and images into sub-dimensional components, enabling subquery-aware retrieval and generation. Our method introduces a hybrid retrieval strategy - combining a sub-dimensional sparse retriever with a dense retriever - to identify a Pareto-optimal set of images, each contributing complementary aspects of the query. During generation, a multimodal large language model is guided to selectively condition on relevant visual features aligned to specific subqueries, ensuring subquery-aware image synthesis. Extensive experiments on MS-COCO, Flickr30K, WikiArt, CUB, and ImageNet-LT demonstrate that Cross-modal RAG significantly outperforms existing baselines in both retrieval and generation quality, while maintaining high efficiency.

[69] One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models

Senmao Li,Lei Wang,Kai Wang,Tao Liu,Jiehang Xie,Joost van de Weijer,Fahad Shahbaz Khan,Shiqi Yang,Yaxing Wang,Jian Yang

Main category: cs.CV

TL;DR: TiUE是一种新型的文本到图像扩散模型蒸馏方法，通过共享编码器特征和并行采样，显著提升了推理速度和图像质量。

Details

Motivation: 现有蒸馏模型在减少采样步骤时牺牲了多样性和质量，且UNet编码器存在冗余计算。 Method: 提出时间无关的统一编码器TiUE，共享编码器特征，并引入KL散度项正则化噪声预测。 Result: TiUE在生成多样性和真实性上优于LCM、SD-Turbo和SwiftBrushv2，同时保持高效计算。 Conclusion: TiUE为文本到图像扩散模型提供了一种高效且高质量的蒸馏方案。 Abstract: Text-to-Image (T2I) diffusion models have made remarkable advancements in generative modeling; however, they face a trade-off between inference speed and image quality, posing challenges for efficient deployment. Existing distilled T2I models can generate high-fidelity images with fewer sampling steps, but often struggle with diversity and quality, especially in one-step models. From our analysis, we observe redundant computations in the UNet encoders. Our findings suggest that, for T2I diffusion models, decoders are more adept at capturing richer and more explicit semantic information, while encoders can be effectively shared across decoders from diverse time steps. Based on these observations, we introduce the first Time-independent Unified Encoder TiUE for the student model UNet architecture, which is a loop-free image generation approach for distilling T2I diffusion models. Using a one-pass scheme, TiUE shares encoder features across multiple decoder time steps, enabling parallel sampling and significantly reducing inference time complexity. In addition, we incorporate a KL divergence term to regularize noise prediction, which enhances the perceptual realism and diversity of the generated images. Experimental results demonstrate that TiUE outperforms state-of-the-art methods, including LCM, SD-Turbo, and SwiftBrushv2, producing more diverse and realistic results while maintaining the computational efficiency.

[70] A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding

Mengjingcheng Mo,Xinyang Tong,Jiaxu Leng,Mingpi Tan,Jiankang Zheng,Yiran Liu,Haosheng Chen,Ji Gan,Weisheng Li,Xinbo Gao

Main category: cs.CV

TL;DR: 论文提出了A2Seek数据集和A2Seek-R1框架，用于解决无人机视角下异常检测的动态视角、尺度变化和复杂场景问题，显著提升了预测和定位性能。

Details

Motivation: 无人机视角的异常检测面临动态视角、尺度变化和复杂场景的挑战，现有数据集和方法难以适应，导致性能下降。 Method: 提出A2Seek数据集，包含高分辨率视频和详细标注；开发A2Seek-R1框架，结合图思维引导的微调和A-GRPO优化策略，模拟无人机飞行行为。 Result: A2Seek-R1在预测准确率和异常定位上分别提升22.04%和13.9%，表现出强泛化能力。 Conclusion: A2Seek数据集和A2Seek-R1框架有效解决了无人机视角异常检测的挑战，具有广泛的应用潜力。 Abstract: While unmanned aerial vehicles (UAVs) offer wide-area, high-altitude coverage for anomaly detection, they face challenges such as dynamic viewpoints, scale variations, and complex scenes. Existing datasets and methods, mainly designed for fixed ground-level views, struggle to adapt to these conditions, leading to significant performance drops in drone-view scenarios. To bridge this gap, we introduce A2Seek (Aerial Anomaly Seek), a large-scale, reasoning-centric benchmark dataset for aerial anomaly understanding. This dataset covers various scenarios and environmental conditions, providing high-resolution real-world aerial videos with detailed annotations, including anomaly categories, frame-level timestamps, region-level bounding boxes, and natural language explanations for causal reasoning. Building on this dataset, we propose A2Seek-R1, a novel reasoning framework that generalizes R1-style strategies to aerial anomaly understanding, enabling a deeper understanding of "Where" anomalies occur and "Why" they happen in aerial frames. To this end, A2Seek-R1 first employs a graph-of-thought (GoT)-guided supervised fine-tuning approach to activate the model's latent reasoning capabilities on A2Seek. Then, we introduce Aerial Group Relative Policy Optimization (A-GRPO) to design rule-based reward functions tailored to aerial scenarios. Furthermore, we propose a novel "seeking" mechanism that simulates UAV flight behavior by directing the model's attention to informative regions. Extensive experiments demonstrate that A2Seek-R1 achieves up to a 22.04% improvement in AP for prediction accuracy and a 13.9% gain in mIoU for anomaly localization, exhibiting strong generalization across complex environments and out-of-distribution scenarios. Our dataset and code will be released at https://hayneyday.github.io/A2Seek/.

[71] DvD: Unleashing a Generative Paradigm for Document Dewarping via Coordinates-based Diffusion Model

Weiguang Zhang,Huangcheng Lu,Maizhen Ning,Xiaowei Huang,Wei Wang,Kaizhu Huang,Qiufeng Wang

Main category: cs.CV

TL;DR: 论文提出了一种基于扩散模型的文档去扭曲方法DvD，通过坐标级去噪和时间变体条件细化机制，显著提升了文档结构的保留能力，并在多个基准测试中达到最优性能。

Details

Motivation: 文档去扭曲技术虽已取得进展，但在保留文档结构方面仍具挑战性。扩散模型的最新发展为解决这一问题提供了潜在可能，但其在高分辨率复杂文档图像上的控制能力不足。 Method: DvD采用坐标级去噪而非像素级去噪，生成变形校正映射，并引入时间变体条件细化机制以增强文档结构的保留。 Result: DvD在DocUNet、DIR300和AnyPhotoDoc6300等多个基准测试中实现了最优性能，计算效率可接受。 Conclusion: DvD是首个基于扩散模型的文档去扭曲方法，通过创新设计显著提升了性能，并提出了新的基准测试AnyPhotoDoc6300以支持更全面的评估。 Abstract: Document dewarping aims to rectify deformations in photographic document images, thus improving text readability, which has attracted much attention and made great progress, but it is still challenging to preserve document structures. Given recent advances in diffusion models, it is natural for us to consider their potential applicability to document dewarping. However, it is far from straightforward to adopt diffusion models in document dewarping due to their unfaithful control on highly complex document images (e.g., 2000$\times$3000 resolution). In this paper, we propose DvD, the first generative model to tackle document \textbf{D}ewarping \textbf{v}ia a \textbf{D}iffusion framework. To be specific, DvD introduces a coordinate-level denoising instead of typical pixel-level denoising, generating a mapping for deformation rectification. In addition, we further propose a time-variant condition refinement mechanism to enhance the preservation of document structures. In experiments, we find that current document dewarping benchmarks can not evaluate dewarping models comprehensively. To this end, we present AnyPhotoDoc6300, a rigorously designed large-scale document dewarping benchmark comprising 6,300 real image pairs across three distinct domains, enabling fine-grained evaluation of dewarping models. Comprehensive experiments demonstrate that our proposed DvD can achieve state-of-the-art performance with acceptable computational efficiency on multiple metrics across various benchmarks including DocUNet, DIR300, and AnyPhotoDoc6300. The new benchmark and code will be publicly available.

[72] Learning World Models for Interactive Video Generation

Taiye Chen,Xun Hu,Zihan Ding,Chi Jin

Main category: cs.CV

TL;DR: 论文提出了一种视频检索增强生成（VRAG）方法，通过显式全局状态条件化减少长期累积误差，提升世界模型的时空一致性。

Details

Motivation: 现有长视频生成模型因累积误差和内存机制不足，缺乏有效的世界建模能力，限制了未来规划和动作选择的效果。 Method: 通过动作条件和自回归框架增强图像到视频模型的交互能力，并提出VRAG方法。 Result: VRAG显著减少了长期累积误差，提高了世界模型的时空一致性，而单纯的自回归生成或检索增强生成效果较差。 Conclusion: 研究揭示了视频世界模型的基本挑战，并建立了改进视频生成模型的基准。 Abstract: Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework, and reveal that compounding error is inherently irreducible in autoregressive video generation, while insufficient memory mechanism leads to incoherence of world models. We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors and increases spatiotemporal consistency of world models. In contrast, naive autoregressive generation with extended context windows and retrieval-augmented generation prove less effective for video generation, primarily due to the limited in-context learning capabilities of current video models. Our work illuminates the fundamental challenges in video world models and establishes a comprehensive benchmark for improving video generation models with internal world modeling capabilities.

[73] D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples

Zijing Hu,Fengda Zhang,Kun Kuang

Main category: cs.CV

TL;DR: 该论文提出D-Fusion方法，通过视觉一致性样本优化扩散模型的提示-图像对齐问题，解决了直接偏好优化（DPO）中的视觉不一致性限制。

Details

Motivation: 扩散模型在生成图像与文本提示对齐方面存在不足，直接偏好优化（DPO）虽能提升对齐效果，但受限于视觉不一致性问题。 Method: 提出D-Fusion方法，通过掩码引导的自注意力融合生成视觉一致性样本，并保留去噪轨迹以支持DPO训练。 Result: 实验表明，D-Fusion能有效提升不同强化学习算法中的提示-图像对齐效果。 Conclusion: D-Fusion解决了DPO的视觉不一致性问题，显著提升了扩散模型的对齐能力。 Abstract: The practical applications of diffusion models have been limited by the misalignment between generated images and corresponding text prompts. Recent studies have introduced direct preference optimization (DPO) to enhance the alignment of these models. However, the effectiveness of DPO is constrained by the issue of visual inconsistency, where the significant visual disparity between well-aligned and poorly-aligned images prevents diffusion models from identifying which factors contribute positively to alignment during fine-tuning. To address this issue, this paper introduces D-Fusion, a method to construct DPO-trainable visually consistent samples. On one hand, by performing mask-guided self-attention fusion, the resulting images are not only well-aligned, but also visually consistent with given poorly-aligned images. On the other hand, D-Fusion can retain the denoising trajectories of the resulting images, which are essential for DPO training. Extensive experiments demonstrate the effectiveness of D-Fusion in improving prompt-image alignment when applied to different reinforcement learning algorithms.

[74] Event-based Egocentric Human Pose Estimation in Dynamic Environment

Wataru Ikeda,Masashi Hatano,Ryosei Hara,Mariko Isogawa

Main category: cs.CV

TL;DR: 提出了一种基于事件相机的前向视角人体姿态估计方法D-EventEgo，通过头部姿态估计和动态对象分割提升性能。

Details

Motivation: 现有方法依赖RGB相机，无法应对低光或运动模糊环境，事件相机可解决这些问题。 Method: 首先估计头部姿态，以此为条件生成身体姿态；引入运动分割模块去除动态对象干扰。 Result: 在合成数据集上，动态环境中四项指标优于基线。 Conclusion: D-EventEgo在事件相机人体姿态估计任务中表现优异，尤其在动态环境下。 Abstract: Estimating human pose using a front-facing egocentric camera is essential for applications such as sports motion analysis, VR/AR, and AI for wearable devices. However, many existing methods rely on RGB cameras and do not account for low-light environments or motion blur. Event-based cameras have the potential to address these challenges. In this work, we introduce a novel task of human pose estimation using a front-facing event-based camera mounted on the head and propose D-EventEgo, the first framework for this task. The proposed method first estimates the head poses, and then these are used as conditions to generate body poses. However, when estimating head poses, the presence of dynamic objects mixed with background events may reduce head pose estimation accuracy. Therefore, we introduce the Motion Segmentation Module to remove dynamic objects and extract background information. Extensive experiments on our synthetic event-based dataset derived from EgoBody, demonstrate that our approach outperforms our baseline in four out of five evaluation metrics in dynamic environments.

[75] Prototype Embedding Optimization for Human-Object Interaction Detection in Livestreaming

Menghui Zhang,Jing Zhang,Lin Chen,Li Zhuo

Main category: cs.CV

TL;DR: 论文提出了一种原型嵌入优化方法（PeO-HOI），用于解决直播中人-物交互（HOI）检测中的对象偏差问题，显著提升了检测性能。

Details

Motivation: 直播中人-物交互检测存在对象偏差问题，即过度关注物体而忽视与主播的交互。 Method: 通过对象检测与跟踪提取人-物对特征，采用原型嵌入优化减少对象偏差，并结合时空上下文建模进行HOI检测。 Result: 在VidHOI和BJUT-HOI数据集上，PeO-HOI的检测准确率显著提升（如VidHOI上37.19%@full）。 Conclusion: PeO-HOI有效解决了直播HOI检测中的对象偏差问题，提升了检测性能。 Abstract: Livestreaming often involves interactions between streamers and objects, which is critical for understanding and regulating web content. While human-object interaction (HOI) detection has made some progress in general-purpose video downstream tasks, when applied to recognize the interaction behaviors between a streamer and different objects in livestreaming, it tends to focuses too much on the objects and neglects their interactions with the streamer, which leads to object bias. To solve this issue, we propose a prototype embedding optimization for human-object interaction detection (PeO-HOI). First, the livestreaming is preprocessed using object detection and tracking techniques to extract features of the human-object (HO) pairs. Then, prototype embedding optimization is adopted to mitigate the effect of object bias on HOI. Finally, after modelling the spatio-temporal context between HO pairs, the HOI detection results are obtained by the prediction head. The experimental results show that the detection accuracy of the proposed PeO-HOI method has detection accuracies of 37.19%@full, 51.42%@non-rare, 26.20%@rare on the publicly available dataset VidHOI, 45.13%@full, 62.78%@non-rare and 30.37%@rare on the self-built dataset BJUT-HOI, which effectively improves the HOI detection performance in livestreaming.

[76] PanoWan: Lifting Diffusion Video Generation Models to 360° with Latitude/Longitude-aware Mechanisms

Yifei Xia,Shuchen Weng,Siqi Yang,Jingqi Liu,Chengxuan Zhu,Minggui Teng,Zijian Jia,Han Jiang,Boxin Shi

Main category: cs.CV

TL;DR: PanoWan利用预训练的文本到视频模型，通过纬度感知采样和旋转语义去噪等技术，实现了高质量的全景视频生成，并贡献了一个全景视频数据集PanoVid。

Details

Motivation: 现有全景视频生成模型难以利用预训练的生成先验，主要受限于数据集规模和空间特征表示的差异。 Method: PanoWan采用纬度感知采样避免纬度失真，通过旋转语义去噪和填充像素解码确保经度边界无缝过渡。 Result: PanoWan在全景视频生成中达到最先进性能，并在零样本下游任务中表现稳健。 Conclusion: PanoWan通过最小模块实现了预训练模型到全景领域的有效迁移，并展示了高质量生成能力。 Abstract: Panoramic video generation enables immersive 360{\deg} content creation, valuable in applications that demand scene-consistent world exploration. However, existing panoramic video generation models struggle to leverage pre-trained generative priors from conventional text-to-video models for high-quality and diverse panoramic videos generation, due to limited dataset scale and the gap in spatial feature representations. In this paper, we introduce PanoWan to effectively lift pre-trained text-to-video models to the panoramic domain, equipped with minimal modules. PanoWan employs latitude-aware sampling to avoid latitudinal distortion, while its rotated semantic denoising and padded pixel-wise decoding ensure seamless transitions at longitude boundaries. To provide sufficient panoramic videos for learning these lifted representations, we contribute PanoVid, a high-quality panoramic video dataset with captions and diverse scenarios. Consequently, PanoWan achieves state-of-the-art performance in panoramic video generation and demonstrates robustness for zero-shot downstream tasks.

[77] GL-PGENet: A Parameterized Generation Framework for Robust Document Image Enhancement

Zhihong Tang,Yang Li

Main category: cs.CV

TL;DR: GL-PGENet是一种新型文档图像增强网络，针对多退化彩色文档图像设计，结合全局与局部优化，通过参数化生成和两阶段训练策略，实现了高效且鲁棒的增强效果。

Details

Motivation: 现有方法局限于单退化恢复或灰度图像处理，无法满足多退化彩色文档图像的需求。 Method: 提出GL-PGENet，包含分层增强框架、双分支局部细化网络和修改的NestUNet架构，采用两阶段训练策略。 Result: 在DocUNet和RealDAE数据集上分别达到0.7721和0.9480的SSIM分数，具有跨域适应性和计算效率。 Conclusion: GL-PGENet在多退化彩色文档图像增强中表现出色，具有实际应用价值。 Abstract: Document Image Enhancement (DIE) serves as a critical component in Document AI systems, where its performance substantially determines the effectiveness of downstream tasks. To address the limitations of existing methods confined to single-degradation restoration or grayscale image processing, we present Global with Local Parametric Generation Enhancement Network (GL-PGENet), a novel architecture designed for multi-degraded color document images, ensuring both efficiency and robustness in real-world scenarios. Our solution incorporates three key innovations: First, a hierarchical enhancement framework that integrates global appearance correction with local refinement, enabling coarse-to-fine quality improvement. Second, a Dual-Branch Local-Refine Network with parametric generation mechanisms that replaces conventional direct prediction, producing enhanced outputs through learned intermediate parametric representations rather than pixel-wise mapping. This approach enhances local consistency while improving model generalization. Finally, a modified NestUNet architecture incorporating dense block to effectively fuse low-level pixel features and high-level semantic features, specifically adapted for document image characteristics. In addition, to enhance generalization performance, we adopt a two-stage training strategy: large-scale pretraining on a synthetic dataset of 500,000+ samples followed by task-specific fine-tuning. Extensive experiments demonstrate the superiority of GL-PGENet, achieving state-of-the-art SSIM scores of 0.7721 on DocUNet and 0.9480 on RealDAE. The model also exhibits remarkable cross-domain adaptability and maintains computational efficiency for high-resolution images without performance degradation, confirming its practical utility in real-world scenarios.

[78] Learnable Burst-Encodable Time-of-Flight Imaging for High-Fidelity Long-Distance Depth Sensing

Manchao Bao,Shengjiang Fang,Tao Yue,Xuemei Hu

Main category: cs.CV

TL;DR: 论文提出了一种新型的ToF成像范式BE-ToF，通过爆发模式发射光脉冲并优化编码函数与深度重建网络，解决了传统iToF的相位包裹和低信噪比问题。

Details

Motivation: 长距离深度成像在自动驾驶和机器人等领域有重要应用，但现有直接和间接ToF技术分别存在硬件要求高和相位包裹、低信噪比的问题。 Method: BE-ToF系统采用爆发模式发射光脉冲，估计整个爆发周期的相位延迟，并提出了端到端可学习框架，联合优化编码函数和深度重建网络。 Result: 通过仿真和原型实验验证，BE-ToF在长距离深度成像中表现出高保真度和实用性。 Conclusion: BE-ToF是一种有效且实用的长距离深度成像解决方案，克服了传统ToF技术的局限性。 Abstract: Long-distance depth imaging holds great promise for applications such as autonomous driving and robotics. Direct time-of-flight (dToF) imaging offers high-precision, long-distance depth sensing, yet demands ultra-short pulse light sources and high-resolution time-to-digital converters. In contrast, indirect time-of-flight (iToF) imaging often suffers from phase wrapping and low signal-to-noise ratio (SNR) as the sensing distance increases. In this paper, we introduce a novel ToF imaging paradigm, termed Burst-Encodable Time-of-Flight (BE-ToF), which facilitates high-fidelity, long-distance depth imaging. Specifically, the BE-ToF system emits light pulses in burst mode and estimates the phase delay of the reflected signal over the entire burst period, thereby effectively avoiding the phase wrapping inherent to conventional iToF systems. Moreover, to address the low SNR caused by light attenuation over increasing distances, we propose an end-to-end learnable framework that jointly optimizes the coding functions and the depth reconstruction network. A specialized double well function and first-order difference term are incorporated into the framework to ensure the hardware implementability of the coding functions. The proposed approach is rigorously validated through comprehensive simulations and real-world prototype experiments, demonstrating its effectiveness and practical applicability.

[79] Guess the Age of Photos: An Interactive Web Platform for Historical Image Age Estimation

Hasan Yucedag,Adam Jatowt

Main category: cs.CV

TL;DR: 论文介绍了一个名为‘Guess the Age of Photos’的网页平台，通过两种游戏化模式让用户估计历史照片的年代。平台基于Python等技术构建，评估显示用户更擅长相对比较而非绝对年份猜测。

Details

Motivation: 通过游戏化方式提升用户对历史照片年代的认知，同时为研究人类对图像时间线索的感知提供资源。 Method: 平台采用Python、Flask等技术构建，包含两种游戏模式，使用10,150张历史照片数据集（1930-1999年）。通过动态评分和排行榜提升用户参与度。 Result: 113名用户完成15,473次游戏，满意度4.25/5。用户在相对比较中表现更好（65.9%准确率），绝对年份猜测准确率较低（25.6%）。 Conclusion: 该平台不仅作为教育工具提升历史意识，还为研究人类感知图像时间线索和训练计算机视觉模型提供了资源。 Abstract: This paper introduces Guess the Age of Photos, a web platform engaging users in estimating the years of historical photographs through two gamified modes: Guess the Year (predicting a single image's year) and Timeline Challenge (comparing two images to identify the older). Built with Python, Flask, Bootstrap, and PostgreSQL, it uses a 10,150-image subset of the Date Estimation in the Wild dataset (1930-1999). Features like dynamic scoring and leaderboards boost engagement. Evaluated with 113 users and 15,473 gameplays, the platform earned a 4.25/5 satisfaction rating. Users excelled in relative comparisons (65.9% accuracy) over absolute year guesses (25.6% accuracy), with older decades easier to identify. The platform serves as an educational tool, fostering historical awareness and analytical skills via interactive exploration of visual heritage. Furthermore, the platform provides a valuable resource for studying human perception of temporal cues in images and could be used to generate annotated data for training and evaluating computer vision models.

[80] Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization

Kaiyuan Li,Xiaoyue Chen,Chen Gao,Yong Li,Xinlei Chen

Main category: cs.CV

TL;DR: 论文提出了一种名为平衡令牌剪枝（BTP）的方法，用于减少大型视觉语言模型（LVLMs）中的图像令牌数量，从而降低计算开销。该方法通过分阶段剪枝，平衡局部和全局影响，显著提升了剪枝效果。

Details

Motivation: 现有令牌剪枝方法通常忽视剪枝对当前层输出和后续层输出的联合影响，导致剪枝效果不佳。 Method: 提出BTP方法，利用校准集分阶段剪枝：早期阶段关注对后续层的影响，深层阶段注重保持局部输出一致性。 Result: 实验表明，BTP方法在多个基准测试中表现优异，平均压缩率达78%，同时保留96.7%的原始模型性能。 Conclusion: BTP是一种高效的令牌剪枝方法，显著降低了计算开销，同时保持了模型性能。 Abstract: Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens. However, the large number of image tokens results in significant computational overhead, and the use of dynamic high-resolution inputs further increases this burden. Previous approaches have attempted to reduce the number of image tokens through token pruning, typically by selecting tokens based on attention scores or image token diversity. Through empirical studies, we observe that existing methods often overlook the joint impact of pruning on both the current layer's output (local) and the outputs of subsequent layers (global), leading to suboptimal pruning decisions. To address this challenge, we propose Balanced Token Pruning (BTP), a plug-and-play method for pruning vision tokens. Specifically, our method utilizes a small calibration set to divide the pruning process into multiple stages. In the early stages, our method emphasizes the impact of pruning on subsequent layers, whereas in the deeper stages, the focus shifts toward preserving the consistency of local outputs. Extensive experiments across various LVLMs demonstrate the broad effectiveness of our approach on multiple benchmarks. Our method achieves a 78% compression rate while preserving 96.7% of the original models' performance on average.

[81] OmniAD: Detect and Understand Industrial Anomaly via Multimodal Reasoning

Shifang Zhao,Yiheng Lin,Lu Han,Yao Zhao,Yunchao Wei

Main category: cs.CV

TL;DR: OmniAD是一个结合视觉和文本推理的多模态框架，用于细粒度异常检测与分析，通过集成训练策略在多个基准测试中表现优异。

Details

Motivation: 工业知识在异常检测中的详细分析仍具挑战性，OmniAD旨在填补这一空白。 Method: OmniAD采用多模态推理，结合视觉和文本推理，利用Text-as-Mask Encoding进行无阈值异常检测，并通过集成训练策略（SFT和GRPO）增强泛化能力。 Result: 在MMAD基准测试中达到79.1分，优于Qwen2.5-VL-7B和GPT-4o，并在多个异常检测基准中表现优异。 Conclusion: 视觉感知对异常理解至关重要，OmniAD展示了其在多模态推理中的有效性，代码和模型将公开。 Abstract: While anomaly detection has made significant progress, generating detailed analyses that incorporate industrial knowledge remains a challenge. To address this gap, we introduce OmniAD, a novel framework that unifies anomaly detection and understanding for fine-grained analysis. OmniAD is a multimodal reasoner that combines visual and textual reasoning processes. The visual reasoning provides detailed inspection by leveraging Text-as-Mask Encoding to perform anomaly detection through text generation without manually selected thresholds. Following this, Visual Guided Textual Reasoning conducts comprehensive analysis by integrating visual perception. To enhance few-shot generalization, we employ an integrated training strategy that combines supervised fine-tuning (SFT) with reinforcement learning (GRPO), incorporating three sophisticated reward functions. Experimental results demonstrate that OmniAD achieves a performance of 79.1 on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. It also shows strong results across multiple anomaly detection benchmarks. These results highlight the importance of enhancing visual perception for effective reasoning in anomaly understanding. All codes and models will be publicly available.

[82] LatentMove: Towards Complex Human Movement Video Generation

Ashkan Taghipour,Morteza Ghahremani,Mohammed Bennamoun,Farid Boussaid,Aref Miri Rekavandi,Zinuo Li,Qiuhong Ke,Hamid Laga

Main category: cs.CV

TL;DR: LatentMove是一个基于DiT的框架，专注于生成高度动态的人类动画，通过条件控制分支和可学习的面部/身体标记提升视频质量。

Details

Motivation: 解决现有I2V方法在复杂、非重复性人类动作中表现不佳的问题。 Method: 采用DiT框架，结合条件控制分支和可学习标记，并使用CHV数据集进行训练和评估。 Result: LatentMove显著提升了人类动画质量，尤其在处理快速复杂动作时表现优异。 Conclusion: LatentMove推动了I2V生成技术的发展，代码、数据集和评估指标将开源。 Abstract: Image-to-video (I2V) generation seeks to produce realistic motion sequences from a single reference image. Although recent methods exhibit strong temporal consistency, they often struggle when dealing with complex, non-repetitive human movements, leading to unnatural deformations. To tackle this issue, we present LatentMove, a DiT-based framework specifically tailored for highly dynamic human animation. Our architecture incorporates a conditional control branch and learnable face/body tokens to preserve consistency as well as fine-grained details across frames. We introduce Complex-Human-Videos (CHV), a dataset featuring diverse, challenging human motions designed to benchmark the robustness of I2V systems. We also introduce two metrics to assess the flow and silhouette consistency of generated videos with their ground truth. Experimental results indicate that LatentMove substantially improves human animation quality--particularly when handling rapid, intricate movements--thereby pushing the boundaries of I2V generation. The code, the CHV dataset, and the evaluation metrics will be available at https://github.com/ --.

[83] AquaMonitor: A multimodal multi-view image sequence dataset for real-life aquatic invertebrate biodiversity monitoring

Mikko Impiö,Philipp M. Rehsen,Tiina Laamanen,Arne J. Beermann,Florian Leese,Jenni Raitoharju

Main category: cs.CV

TL;DR: AquaMonitor是首个大型水生无脊椎动物计算机视觉数据集，用于环境监测，包含2.7M图像、DNA序列和生物测量数据，定义了三个基准任务并提供了基线模型。

Details

Motivation: 现有物种识别数据集缺乏标准化采集协议，且未聚焦水生无脊椎动物，AquaMonitor填补了这一空白，支持自动化识别方法在真实监测场景中的评估。 Method: 通过两年监测采集图像、DNA序列及生物测量数据，构建多模态数据集，并设计三个基准任务（监测、分类和少样本分类）。 Result: 数据集包含2.7M图像、43,189个样本，DNA序列和生物测量数据，为生物多样性监测提供了实用工具。 Conclusion: AquaMonitor为水生生物监测提供了重要资源，其基准任务可直接推动监测技术的进步，支持水质评估立法需求。 Abstract: This paper presents the AquaMonitor dataset, the first large computer vision dataset of aquatic invertebrates collected during routine environmental monitoring. While several large species identification datasets exist, they are rarely collected using standardized collection protocols, and none focus on aquatic invertebrates, which are particularly laborious to collect. For AquaMonitor, we imaged all specimens from two years of monitoring whenever imaging was possible given practical limitations. The dataset enables the evaluation of automated identification methods for real-life monitoring purposes using a realistically challenging and unbiased setup. The dataset has 2.7M images from 43,189 specimens, DNA sequences for 1358 specimens, and dry mass and size measurements for 1494 specimens, making it also one of the largest biological multi-view and multimodal datasets to date. We define three benchmark tasks and provide strong baselines for these: 1) Monitoring benchmark, reflecting real-life deployment challenges such as open-set recognition, distribution shift, and extreme class imbalance, 2) Classification benchmark, which follows a standard fine-grained visual categorization setup, and 3) Few-shot benchmark, which targets classes with only few training examples from very fine-grained categories. Advancements on the Monitoring benchmark can directly translate to improvement of aquatic biodiversity monitoring, which is an important component of regular legislative water quality assessment in many countries.

[84] From Failures to Fixes: LLM-Driven Scenario Repair for Self-Evolving Autonomous Driving

Xinyu Xia,Xingjun Ma,Yunfeng Hu,Ting Qu,Hong Chen,Xun Gong

Main category: cs.CV

TL;DR: SERA是一个基于LLM的框架，通过针对性场景推荐修复自动驾驶系统的失败案例，提升其鲁棒性和泛化能力。

Details

Motivation: 现有场景生成和选择方法缺乏适应性和语义相关性，限制了性能提升。SERA旨在通过自适应修复失败案例解决这一问题。 Method: SERA分析性能日志，识别失败模式，动态检索语义对齐场景，并利用LLM优化推荐，最后通过少量数据微调实现针对性适应。 Result: 实验表明，SERA在多个自动驾驶基准测试中显著提升关键指标，证明其在安全关键条件下的有效性和泛化能力。 Conclusion: SERA通过自适应场景推荐和LLM优化，为自动驾驶系统的自我进化提供了高效解决方案。 Abstract: Ensuring robust and generalizable autonomous driving requires not only broad scenario coverage but also efficient repair of failure cases, particularly those related to challenging and safety-critical scenarios. However, existing scenario generation and selection methods often lack adaptivity and semantic relevance, limiting their impact on performance improvement. In this paper, we propose \textbf{SERA}, an LLM-powered framework that enables autonomous driving systems to self-evolve by repairing failure cases through targeted scenario recommendation. By analyzing performance logs, SERA identifies failure patterns and dynamically retrieves semantically aligned scenarios from a structured bank. An LLM-based reflection mechanism further refines these recommendations to maximize relevance and diversity. The selected scenarios are used for few-shot fine-tuning, enabling targeted adaptation with minimal data. Experiments on the benchmark show that SERA consistently improves key metrics across multiple autonomous driving baselines, demonstrating its effectiveness and generalizability under safety-critical conditions.

[85] Bringing CLIP to the Clinic: Dynamic Soft Labels and Negation-Aware Learning for Medical Analysis

Hanbin Ko,Chang-Min Park

Main category: cs.CV

TL;DR: 提出一种结合临床增强动态软标签和医学图形对齐的新方法，改进医学视觉语言处理中的对比损失应用，并通过否定硬负样本提升临床语言理解。

Details

Motivation: 通用架构（如CLIP）直接应用于医学数据时面临否定处理和数据不平衡的挑战，需改进以提升临床适用性。 Method: 集成临床增强动态软标签、医学图形对齐和否定硬负样本，优化医学CLIP训练流程。 Result: 在零样本、微调分类和报告检索等任务中达到最先进性能，并通过CXR-Align基准验证临床语言理解能力。 Conclusion: 方法易于实现且泛化性强，显著提升医学视觉语言处理能力和临床语言理解。 Abstract: The development of large-scale image-text pair datasets has significantly advanced self-supervised learning in Vision-Language Processing (VLP). However, directly applying general-domain architectures such as CLIP to medical data presents challenges, particularly in handling negations and addressing the inherent data imbalance of medical datasets. To address these issues, we propose a novel approach that integrates clinically-enhanced dynamic soft labels and medical graphical alignment, thereby improving clinical comprehension and the applicability of contrastive loss in medical contexts. Furthermore, we introduce negation-based hard negatives to deepen the model's understanding of the complexities of clinical language. Our approach is easily integrated into the medical CLIP training pipeline and achieves state-of-the-art performance across multiple tasks, including zero-shot, fine-tuned classification, and report retrieval. To comprehensively evaluate our model's capacity for understanding clinical language, we introduce CXR-Align, a benchmark uniquely designed to evaluate the understanding of negation and clinical information within chest X-ray (CXR) datasets. Experimental results demonstrate that our proposed methods are straightforward to implement and generalize effectively across contrastive learning frameworks, enhancing medical VLP capabilities and advancing clinical language understanding in medical imaging.

[86] MObyGaze: a film dataset of multimodal objectification densely annotated by experts

Julie Tores,Elisa Ancarani,Lucile Sassatelli,Hui-Yin Wu,Clement Bergman,Lea Andolfi,Victor Ecrement,Remy Sun,Frederic Precioso,Thierry Devars,Magali Guaresi,Virginie Julliard,Sarah Lecossais

Main category: cs.CV

TL;DR: 本文提出了一种新的AI任务，通过多模态（视觉、语音、音频）时间模式来量化电影中的物化现象，并发布了MObyGaze数据集。

Details

Motivation: 研究电影中性别表征的物化现象，以理解刻板印象如何在屏幕上延续。 Method: 基于电影研究和心理学，定义了物化的结构化分类，并构建了MObyGaze数据集，包含20部电影的6072个密集标注片段。 Result: 展示了任务的可行性，并比较了不同视觉、文本和音频模型的性能。 Conclusion: 该研究为多模态物化分析提供了新工具和数据集，推动了相关领域的发展。 Abstract: Characterizing and quantifying gender representation disparities in audiovisual storytelling contents is necessary to grasp how stereotypes may perpetuate on screen. In this article, we consider the high-level construct of objectification and introduce a new AI task to the ML community: characterize and quantify complex multimodal (visual, speech, audio) temporal patterns producing objectification in films. Building on film studies and psychology, we define the construct of objectification in a structured thesaurus involving 5 sub-constructs manifesting through 11 concepts spanning 3 modalities. We introduce the Multimodal Objectifying Gaze (MObyGaze) dataset, made of 20 movies annotated densely by experts for objectification levels and concepts over freely delimited segments: it amounts to 6072 segments over 43 hours of video with fine-grained localization and categorization. We formulate different learning tasks, propose and investigate best ways to learn from the diversity of labels among a low number of annotators, and benchmark recent vision, text and audio models, showing the feasibility of the task. We make our code and our dataset available to the community and described in the Croissant format: https://anonymous.4open.science/r/MObyGaze-F600/.

[87] Fast Feature Matching of UAV Images via Matrix Band Reduction-based GPU Data Schedule

San Jiang,Kan You,Wanshou Jiang,Qingquan Li

Main category: cs.CV

TL;DR: 提出了一种基于GPU数据调度算法的高效无人机图像特征匹配方法，通过矩阵带缩减（MBR）和GPU加速级联哈希实现，显著提升了匹配效率。

Details

Motivation: 特征匹配在运动恢复结构（SfM）中占据大量时间成本，研究旨在通过GPU加速优化无人机图像的特征匹配效率。 Method: 1. 使用图像检索技术选择匹配对；2. 基于MBR的数据调度策略生成紧凑图像块；3. 在GPU加速级联哈希框架下执行特征匹配，并结合几何约束和RANSAC验证。 Result: 在大规模无人机数据集上测试，速度比KD-Tree方法提升77.0至100.0倍，且在相对和绝对BA中达到可比精度。 Conclusion: 该算法为无人机图像特征匹配提供了一种高效解决方案。 Abstract: Feature matching dominats the time costs in structure from motion (SfM). The primary contribution of this study is a GPU data schedule algorithm for efficient feature matching of Unmanned aerial vehicle (UAV) images. The core idea is to divide the whole dataset into blocks based on the matrix band reduction (MBR) and achieve efficient feature matching via GPU-accelerated cascade hashing. First, match pairs are selected by using an image retrieval technique, which converts images into global descriptors and searches high-dimension nearest neighbors with graph indexing. Second, compact image blocks are iteratively generated from a MBR-based data schedule strategy, which exploits image connections to avoid redundant data IO (input/output) burden and increases the usage of GPU computing power. Third, guided by the generated image blocks, feature matching is executed sequentially within the framework of GPU-accelerated cascade hashing, and initial candidate matches are refined by combining a local geometric constraint and RANSAC-based global verification. For further performance improvement, these two seps are designed to execute parallelly in GPU and CPU. Finally, the performance of the proposed solution is evaluated by using large-scale UAV datasets. The results demonstrate that it increases the efficiency of feature matching with speedup ratios ranging from 77.0 to 100.0 compared with KD-Tree based matching methods, and achieves comparable accuracy in relative and absolute bundle adjustment (BA). The proposed algorithm is an efficient solution for feature matching of UAV images.

[88] UAVPairs: A Challenging Benchmark for Match Pair Retrieval of Large-scale UAV Images

Junhuan Liu,San Jiang,Wei Ge,Wei Huang,Bingxuan Guo,Qingquan Li

Main category: cs.CV

TL;DR: 本文提出了UAVPairs数据集和训练流程，用于大规模无人机图像匹配对检索，通过几何相似性和多场景结构优化训练样本，并设计了排名列表损失以提高检索模型的区分度。实验证明该方法显著提升了检索准确性和3D重建质量。

Details

Motivation: 解决大规模无人机图像匹配对检索中训练样本生成成本高和现有损失函数区分度不足的问题。 Method: 构建UAVPairs数据集，提出批量非平凡样本挖掘策略和排名列表损失，优化训练流程。 Result: 模型在检索准确性和3D重建质量上显著优于现有方法，尤其在重复纹理和弱纹理场景中表现更稳健。 Conclusion: UAVPairs数据集和训练流程为大规模无人机图像匹配对检索提供了高效解决方案。 Abstract: The primary contribution of this paper is a challenging benchmark dataset, UAVPairs, and a training pipeline designed for match pair retrieval of large-scale UAV images. First, the UAVPairs dataset, comprising 21,622 high-resolution images across 30 diverse scenes, is constructed; the 3D points and tracks generated by SfM-based 3D reconstruction are employed to define the geometric similarity of image pairs, ensuring genuinely matchable image pairs are used for training. Second, to solve the problem of expensive mining cost for global hard negative mining, a batched nontrivial sample mining strategy is proposed, leveraging the geometric similarity and multi-scene structure of the UAVPairs to generate training samples as to accelerate training. Third, recognizing the limitation of pair-based losses, the ranked list loss is designed to improve the discrimination of image retrieval models, which optimizes the global similarity structure constructed from the positive set and negative set. Finally, the effectiveness of the UAVPairs dataset and training pipeline is validated through comprehensive experiments on three distinct large-scale UAV datasets. The experiment results demonstrate that models trained with the UAVPairs dataset and the ranked list loss achieve significantly improved retrieval accuracy compared to models trained on existing datasets or with conventional losses. Furthermore, these improvements translate to enhanced view graph connectivity and higher quality of reconstructed 3D models. The models trained by the proposed approach perform more robustly compared with hand-crafted global features, particularly in challenging repetitively textured scenes and weakly textured scenes. For match pair retrieval of large-scale UAV images, the trained image retrieval models offer an effective solution. The dataset would be made publicly available at https://github.com/json87/UAVPairs.

[89] On the Transferability and Discriminability of Repersentation Learning in Unsupervised Domain Adaptation

Wenwen Qiang,Ziyin Gu,Lingyu Si,Jiangmeng Li,Changwen Zheng,Fuchun Sun,Hui Xiong

Main category: cs.CV

TL;DR: 论文提出了一种新的对抗性无监督域适应框架RLGLC，通过结合域对齐和目标域判别性增强，解决了传统方法忽略目标域特征判别性的问题。

Details

Motivation: 传统对抗性域适应方法仅依赖分布对齐和源域经验风险最小化，忽略了目标域特征的判别性，导致性能不佳。 Method: 提出了RLGLC框架，结合域对齐目标和判别性增强约束，利用AR-WWD解决类别不平衡和语义维度加权，并通过局部一致性机制保留目标域细粒度判别信息。 Result: 在多个基准数据集上的实验表明，RLGLC显著优于现有方法。 Conclusion: 论文证明了在对抗性域适应中同时保证可迁移性和判别性的必要性，RLGLC框架为此提供了有效解决方案。 Abstract: In this paper, we addressed the limitation of relying solely on distribution alignment and source-domain empirical risk minimization in Unsupervised Domain Adaptation (UDA). Our information-theoretic analysis showed that this standard adversarial-based framework neglects the discriminability of target-domain features, leading to suboptimal performance. To bridge this theoretical-practical gap, we defined "good representation learning" as guaranteeing both transferability and discriminability, and proved that an additional loss term targeting target-domain discriminability is necessary. Building on these insights, we proposed a novel adversarial-based UDA framework that explicitly integrates a domain alignment objective with a discriminability-enhancing constraint. Instantiated as Domain-Invariant Representation Learning with Global and Local Consistency (RLGLC), our method leverages Asymmetrically-Relaxed Wasserstein of Wasserstein Distance (AR-WWD) to address class imbalance and semantic dimension weighting, and employs a local consistency mechanism to preserve fine-grained target-domain discriminative information. Extensive experiments across multiple benchmark datasets demonstrate that RLGLC consistently surpasses state-of-the-art methods, confirming the value of our theoretical perspective and underscoring the necessity of enforcing both transferability and discriminability in adversarial-based UDA.

[90] Adapting Segment Anything Model for Power Transmission Corridor Hazard Segmentation

Hang Chen,Maoyuan Ye,Peng Yang,Haibin He,Juhua Liu,Bo Du

Main category: cs.CV

TL;DR: 论文提出ELE-SAM模型，针对电力传输走廊危险分割任务优化了SAM模型，通过上下文感知提示适配器和高保真掩码解码器提升性能，并构建了首个大规模真实数据集ELE-40K。

Details

Motivation: 电力传输走廊危险分割（PTCHS）对保障电力传输安全至关重要，但现有SAM模型在复杂场景下表现不佳，尤其是对精细结构目标的处理。 Method: 提出ELE-SAM模型，包括上下文感知提示适配器（整合全局-局部特征）和高保真掩码解码器（利用多粒度掩码特征）。 Result: 在ELE-40K数据集上，ELE-SAM比基线模型平均提升16.8% mIoU和20.6% mBIoU，并在HQSeg-44K上优于现有方法。 Conclusion: ELE-SAM在PTCHS任务中表现优异，验证了其在高质量通用目标分割中的有效性，数据集和代码已开源。 Abstract: Power transmission corridor hazard segmentation (PTCHS) aims to separate transmission equipment and surrounding hazards from complex background, conveying great significance to maintaining electric power transmission safety. Recently, the Segment Anything Model (SAM) has emerged as a foundational vision model and pushed the boundaries of segmentation tasks. However, SAM struggles to deal with the target objects in complex transmission corridor scenario, especially those with fine structure. In this paper, we propose ELE-SAM, adapting SAM for the PTCHS task. Technically, we develop a Context-Aware Prompt Adapter to achieve better prompt tokens via incorporating global-local features and focusing more on key regions. Subsequently, to tackle the hazard objects with fine structure in complex background, we design a High-Fidelity Mask Decoder by leveraging multi-granularity mask features and then scaling them to a higher resolution. Moreover, to train ELE-SAM and advance this field, we construct the ELE-40K benchmark, the first large-scale and real-world dataset for PTCHS including 44,094 image-mask pairs. Experimental results for ELE-40K demonstrate the superior performance that ELE-SAM outperforms the baseline model with the average 16.8% mIoU and 20.6% mBIoU performance improvement. Moreover, compared with the state-of-the-art method on HQSeg-44K, the average 2.9% mIoU and 3.8% mBIoU absolute improvements further validate the effectiveness of our method on high-quality generic object segmentation. The source code and dataset are available at https://github.com/Hhaizee/ELE-SAM.

[91] Autoregression-free video prediction using diffusion model for mitigating error propagation

Woonho Ko,Jin Bok Park,Il Yong Chun

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的无自回归（ARFree）视频预测框架，解决了传统自回归方法在远距离未来帧中的误差传播问题。

Details

Motivation: 现有长期视频预测方法通常依赖自回归机制，但这种方法在远距离未来帧中容易产生误差传播。 Method: ARFree框架包含两个关键组件：1）运动预测模块，从上下文帧元组中提取运动特征预测未来运动；2）训练方法，提升相邻未来帧元组间的运动连续性和上下文一致性。 Result: 在两个基准数据集上的实验表明，ARFree框架优于多种最先进的视频预测方法。 Conclusion: ARFree框架通过直接预测未来帧元组，避免了自回归机制的误差传播问题，显著提升了视频预测性能。 Abstract: Existing long-term video prediction methods often rely on an autoregressive video prediction mechanism. However, this approach suffers from error propagation, particularly in distant future frames. To address this limitation, this paper proposes the first AutoRegression-Free (ARFree) video prediction framework using diffusion models. Different from an autoregressive video prediction mechanism, ARFree directly predicts any future frame tuples from the context frame tuple. The proposed ARFree consists of two key components: 1) a motion prediction module that predicts a future motion using motion feature extracted from the context frame tuple; 2) a training method that improves motion continuity and contextual consistency between adjacent future frame tuples. Our experiments with two benchmark datasets show that the proposed ARFree video prediction framework outperforms several state-of-the-art video prediction methods.

[92] SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model

Yifan Chang,Yukang Feng,Jianwen Sun,Jiaxin Ai,Chuanhao Li,S. Kevin Zhou,Kaipeng Zhang

Main category: cs.CV

TL;DR: SridBench是首个科学图表生成基准，评估AI在科学插图生成中的表现，发现现有模型如GPT-4o-image仍落后于人类。

Details

Motivation: 科学插图生成需要高精度和专业知识，但目前缺乏相关评估基准。 Method: 引入SridBench基准，包含1,120个科学图表实例，由专家和多模态大模型（MLLMs）标注，评估六个维度。 Result: 实验显示，顶级模型在语义保真度和结构准确性上仍不及人类，存在文本/视觉清晰度和科学正确性问题。 Conclusion: 需进一步提升推理驱动的视觉生成能力。 Abstract: Recent years have seen rapid advances in AI-driven image generation. Early diffusion models emphasized perceptual quality, while newer multimodal models like GPT-4o-image integrate high-level reasoning, improving semantic understanding and structural composition. Scientific illustration generation exemplifies this evolution: unlike general image synthesis, it demands accurate interpretation of technical content and transformation of abstract ideas into clear, standardized visuals. This task is significantly more knowledge-intensive and laborious, often requiring hours of manual work and specialized tools. Automating it in a controllable, intelligent manner would provide substantial practical value. Yet, no benchmark currently exists to evaluate AI on this front. To fill this gap, we introduce SridBench, the first benchmark for scientific figure generation. It comprises 1,120 instances curated from leading scientific papers across 13 natural and computer science disciplines, collected via human experts and MLLMs. Each sample is evaluated along six dimensions, including semantic fidelity and structural accuracy. Experimental results reveal that even top-tier models like GPT-4o-image lag behind human performance, with common issues in text/visual clarity and scientific correctness. These findings highlight the need for more advanced reasoning-driven visual generation capabilities.

Alejandro D. Mousist

Main category: cs.CV

TL;DR: 论文提出了一种针对ISS上IMAGIN-e任务中地球观测图像机械散焦的盲去模糊方法，利用GAN框架在无参考图像条件下恢复图像质量。

Details

Motivation: 解决空间边缘计算约束下地球观测图像的机械散焦问题，提升图像质量以支持实际应用。 Method: 基于Sentinel-2数据估计散焦核，并在GAN框架中训练恢复模型，无需参考图像。 Result: 在合成退化的Sentinel-2图像上，SSIM提升72.47%，PSNR提升25.00%；在IMAGIN-e任务中，NIQE和BRISQUE分别提升60.66%和48.38%。 Conclusion: 该方法成功部署于IMAGIN-e任务，验证了其在资源受限环境下的实用性和高效性，支持水体分割等应用。 Abstract: This work addresses mechanical defocus in Earth observation images from the IMAGIN-e mission aboard the ISS, proposing a blind deblurring approach adapted to space-based edge computing constraints. Leveraging Sentinel-2 data, our method estimates the defocus kernel and trains a restoration model within a GAN framework, effectively operating without reference images. On Sentinel-2 images with synthetic degradation, SSIM improved by 72.47% and PSNR by 25.00%, confirming the model's ability to recover lost details when the original clean image is known. On IMAGIN-e, where no reference images exist, perceptual quality metrics indicate a substantial enhancement, with NIQE improving by 60.66% and BRISQUE by 48.38%, validating real-world onboard restoration. The approach is currently deployed aboard the IMAGIN-e mission, demonstrating its practical application in an operational space environment. By efficiently handling high-resolution images under edge computing constraints, the method enables applications such as water body segmentation and contour detection while maintaining processing viability despite resource limitations.

[94] What Makes for Text to 360-degree Panorama Generation with Stable Diffusion?

Jinhong Ni,Chang-Bin Zhang,Qiang Zhang,Jing Zhang

Main category: cs.CV

TL;DR: 论文探讨了如何通过微调预训练的扩散模型（如Stable Diffusion）生成360度全景图像，揭示了注意力模块中不同矩阵的作用，并提出了一种高效框架UniPano。

Details

Motivation: 研究旨在理解预训练扩散模型在生成全景图像时的内在机制，并解决领域差距问题。 Method: 通过分析注意力模块中不同矩阵的行为，提出UniPano框架，专注于优化关键矩阵以提升性能和效率。 Result: UniPano在性能和效率上优于现有方法，显著减少了内存和训练时间。 Conclusion: 论文揭示了预训练模型在全景生成中的关键机制，为未来研究提供了简洁高效的基线。 Abstract: Recent prosperity of text-to-image diffusion models, e.g. Stable Diffusion, has stimulated research to adapt them to 360-degree panorama generation. Prior work has demonstrated the feasibility of using conventional low-rank adaptation techniques on pre-trained diffusion models to generate panoramic images. However, the substantial domain gap between perspective and panoramic images raises questions about the underlying mechanisms enabling this empirical success. We hypothesize and examine that the trainable counterparts exhibit distinct behaviors when fine-tuned on panoramic data, and such an adaptation conceals some intrinsic mechanism to leverage the prior knowledge within the pre-trained diffusion models. Our analysis reveals the following: 1) the query and key matrices in the attention modules are responsible for common information that can be shared between the panoramic and perspective domains, thus are less relevant to panorama generation; and 2) the value and output weight matrices specialize in adapting pre-trained knowledge to the panoramic domain, playing a more critical role during fine-tuning for panorama generation. We empirically verify these insights by introducing a simple framework called UniPano, with the objective of establishing an elegant baseline for future research. UniPano not only outperforms existing methods but also significantly reduces memory usage and training time compared to prior dual-branch approaches, making it scalable for end-to-end panorama generation with higher resolution. The code will be released.

[95] FaceEditTalker: Interactive Talking Head Generation with Facial Attribute Editing

Guanwen Feng,Zhiyuan Ma,Yunan Li,Junwei Jing,Jiahao Yang,Qiguang Miao

Main category: cs.CV

TL;DR: FaceEditTalker是一个统一框架，支持在生成高质量音频同步说话视频的同时进行可控的面部属性编辑。

Details

Motivation: 现有音频驱动说话头生成方法忽视了面部属性编辑的重要性，而这一能力对个性化、品牌适配等应用至关重要。 Method: 方法包括图像特征空间编辑模块（提取语义和细节特征）和音频驱动视频生成模块（融合编辑特征与音频引导的面部标志）。 Result: 实验表明，该方法在唇同步准确性、视频质量和属性可控性上优于现有技术。 Conclusion: FaceEditTalker为面部属性编辑和高质量视频生成提供了统一解决方案。 Abstract: Recent advances in audio-driven talking head generation have achieved impressive results in lip synchronization and emotional expression. However, they largely overlook the crucial task of facial attribute editing. This capability is crucial for achieving deep personalization and expanding the range of practical applications, including user-tailored digital avatars, engaging online education content, and brand-specific digital customer service. In these key domains, the flexible adjustment of visual attributes-such as hairstyle, accessories, and subtle facial features is essential for aligning with user preferences, reflecting diverse brand identities, and adapting to varying contextual demands. In this paper, we present FaceEditTalker, a unified framework that enables controllable facial attribute manipulation while generating high-quality, audio-synchronized talking head videos. Our method consists of two key components: an image feature space editing module, which extracts semantic and detail features and allows flexible control over attributes like expression, hairstyle, and accessories; and an audio-driven video generation module, which fuses these edited features with audio-guided facial landmarks to drive a diffusion-based generator. This design ensures temporal coherence, visual fidelity, and identity preservation across frames. Extensive experiments on public datasets demonstrate that our method outperforms state-of-the-art approaches in lip-sync accuracy, video quality, and attribute controllability. Project page: https://peterfanfan.github.io/FaceEditTalker/

[96] 3D Question Answering via only 2D Vision-Language Models

Fengyun Wang,Sicheng Yu,Jiawei Wu,Jinhui Tang,Hanwang Zhang,Qianru Sun

Main category: cs.CV

TL;DR: 提出cdViews方法，通过自动选择关键和多样化的2D视图，利用2D大视觉语言模型（LVLMs）零样本推理解决3D场景理解任务，在ScanQA和SQA基准上达到最优性能。

Details

Motivation: 探索如何利用2D LVLMs解决3D任务，避免资源密集型的3D LVLMs训练。 Method: 提出cdViews方法，包含viewSelector（选择关键视图）和viewNMS（增强视图多样性），通过2D模型零样本推理回答3D问题。 Result: 在ScanQA和SQA基准上实现最优性能，验证了2D LVLMs在3D任务中的有效性。 Conclusion: 2D LVLMs是当前解决3D任务最有效的替代方案，无需3D LVLMs的复杂训练。 Abstract: Large vision-language models (LVLMs) have significantly advanced numerous fields. In this work, we explore how to harness their potential to address 3D scene understanding tasks, using 3D question answering (3D-QA) as a representative example. Due to the limited training data in 3D, we do not train LVLMs but infer in a zero-shot manner. Specifically, we sample 2D views from a 3D point cloud and feed them into 2D models to answer a given question. When the 2D model is chosen, e.g., LLAVA-OV, the quality of sampled views matters the most. We propose cdViews, a novel approach to automatically selecting critical and diverse Views for 3D-QA. cdViews consists of two key components: viewSelector prioritizing critical views based on their potential to provide answer-specific information, and viewNMS enhancing diversity by removing redundant views based on spatial overlap. We evaluate cdViews on the widely-used ScanQA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks.

[97] Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language

Guangfu Hao,Haojie Wen,Liangxuna Guo,Yang Chen,Yanchao Bi,Shan Yu

Main category: cs.CV

TL;DR: 论文提出了一种基于低维属性表示的框架，结合视觉和语言模型，显著提升了工具选择任务的准确性。

Details

Motivation: 人类灵活的工具选择能力是独特的认知能力，但现有计算模型对此能力的研究不足。 Method: 使用视觉编码器（ResNet或ViT）从工具图像中提取属性，语言模型（如GPT-2、LLaMA、DeepSeek）从任务描述中推导所需属性。 Result: 模型在工具选择任务中达到74%的准确率，显著优于直接匹配方法（20%）和小型多模态模型（21%-58%），接近GPT-4o（73%）的性能。 Conclusion: 该框架提供了一种参数高效且可解释的解决方案，模拟了人类工具认知，推动了认知科学和实际应用的发展。 Abstract: Flexible tool selection reflects a complex cognitive ability that distinguishes humans from other species, yet computational models that capture this ability remain underdeveloped. We developed a framework using low-dimensional attribute representations to bridge visual tool perception and linguistic task understanding. We constructed a comprehensive dataset (ToolNet) containing 115 common tools labeled with 13 carefully designed attributes spanning physical, functional, and psychological properties, paired with natural language scenarios describing tool usage. Visual encoders (ResNet or ViT) extract attributes from tool images while fine-tuned language models (GPT-2, LLaMA, DeepSeek) derive required attributes from task descriptions. Our approach achieves 74% accuracy in tool selection tasks-significantly outperforming direct tool matching (20%) and smaller multimodal models (21%-58%), while approaching performance of much larger models like GPT-4o (73%) with substantially fewer parameters. Ablation studies revealed that manipulation-related attributes (graspability, hand-relatedness, elongation) consistently prove most critical across modalities. This work provides a parameter-efficient, interpretable solution that mimics human-like tool cognition, advancing both cognitive science understanding and practical applications in tool selection tasks.

[98] Improving Brain-to-Image Reconstruction via Fine-Grained Text Bridging

Runze Xia,Shuo Feng,Renzhi Wang,Congchi Yin,Xuyun Wen,Piji Li

Main category: cs.CV

TL;DR: 提出了一种名为FgB2I的方法，通过细粒度文本作为桥梁改进脑到图像重建，解决了现有方法中细节和语义不一致的问题。

Details

Motivation: 现有脑到图像重建方法常因缺乏足够的语义信息而导致重建结果细节不足和语义不一致。 Method: FgB2I包括三个关键阶段：细节增强、解码细粒度文本描述和基于文本的脑到图像重建，利用大型视觉语言模型生成细粒度描述，并通过三个奖励指标指导解码。 Result: 细粒度文本描述可以整合到现有重建方法中，实现更精细的脑到图像重建。 Conclusion: FgB2I通过引入细粒度文本作为桥梁，显著提升了脑到图像重建的细节和语义一致性。 Abstract: Brain-to-Image reconstruction aims to recover visual stimuli perceived by humans from brain activity. However, the reconstructed visual stimuli often missing details and semantic inconsistencies, which may be attributed to insufficient semantic information. To address this issue, we propose an approach named Fine-grained Brain-to-Image reconstruction (FgB2I), which employs fine-grained text as bridge to improve image reconstruction. FgB2I comprises three key stages: detail enhancement, decoding fine-grained text descriptions, and text-bridged brain-to-image reconstruction. In the detail-enhancement stage, we leverage large vision-language models to generate fine-grained captions for visual stimuli and experimentally validate its importance. We propose three reward metrics (object accuracy, text-image semantic similarity, and image-image semantic similarity) to guide the language model in decoding fine-grained text descriptions from fMRI signals. The fine-grained text descriptions can be integrated into existing reconstruction methods to achieve fine-grained Brain-to-Image reconstruction.

[99] Learning A Robust RGB-Thermal Detector for Extreme Modality Imbalance

Chao Tian,Chao Yang,Guoqing Zhu,Qiang Wang,Zhenyu He

Main category: cs.CV

TL;DR: 提出了一种基于基础-辅助检测器架构的RGB-T目标检测方法，通过模态交互模块和伪退化技术解决模态不平衡问题，显著提升了模型鲁棒性。

Details

Motivation: 现实场景中RGB-T数据可能因环境或技术问题出现模态退化，导致训练和测试时的分布不一致问题。传统方法假设模态平衡，无法有效应对极端不平衡情况。 Method: 设计了基础检测器和辅助检测器架构，引入模态交互模块自适应加权模态，并利用伪退化技术模拟真实不平衡数据。基础检测器提供一致性约束，辅助检测器处理退化样本。 Result: 实验表明，该方法能显著降低缺失率（55%），并在多种基线检测器上提升性能。 Conclusion: 提出的方法有效解决了RGB-T目标检测中的模态不平衡问题，增强了模型在退化条件下的鲁棒性。 Abstract: RGB-Thermal (RGB-T) object detection utilizes thermal infrared (TIR) images to complement RGB data, improving robustness in challenging conditions. Traditional RGB-T detectors assume balanced training data, where both modalities contribute equally. However, in real-world scenarios, modality degradation-due to environmental factors or technical issues-can lead to extreme modality imbalance, causing out-of-distribution (OOD) issues during testing and disrupting model convergence during training. This paper addresses these challenges by proposing a novel base-and-auxiliary detector architecture. We introduce a modality interaction module to adaptively weigh modalities based on their quality and handle imbalanced samples effectively. Additionally, we leverage modality pseudo-degradation to simulate real-world imbalances in training data. The base detector, trained on high-quality pairs, provides a consistency constraint for the auxiliary detector, which receives degraded samples. This framework enhances model robustness, ensuring reliable performance even under severe modality degradation. Experimental results demonstrate the effectiveness of our method in handling extreme modality imbalances~(decreasing the Missing Rate by 55%) and improving performance across various baseline detectors.

[100] Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers

Weilun Feng,Chuanguang Yang,Haotong Qin,Xiangqi Li,Yu Wang,Zhulin An,Libo Huang,Boyu Diao,Zixiang Zhao,Yongjun Xu,Michele Magno

Main category: cs.CV

TL;DR: Q-VDiT是一种专为视频扩散变换器（DiT）设计的量化框架，通过Token-aware Quantization Estimator（TQE）和Temporal Maintenance Distillation（TMD）解决了量化信息丢失和优化目标与视频生成需求不匹配的问题。

Details

Motivation: 现有量化方法在视频生成任务中表现不佳，主要由于量化信息丢失和优化目标与视频生成需求不匹配。 Method: 提出TQE补偿量化误差，TMD保持帧间时空相关性并优化整体视频上下文。 Result: W3A6 Q-VDiT在场景一致性上达到23.40，优于现有量化方法1.9倍。 Conclusion: Q-VDiT为视频DiT模型的量化提供了高效解决方案，显著提升了性能。 Abstract: Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present Q-VDiT, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the Token-aware Quantization Estimator (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce Temporal Maintenance Distillation (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency of 23.40, setting a new benchmark and outperforming current state-of-the-art quantization methods by 1.9$\times$. Code will be available at https://github.com/cantbebetter2/Q-VDiT.

[101] S2AFormer: Strip Self-Attention for Efficient Vision Transformer

Guoan Xu,Wenfeng Huang,Wenjing Jia,Jiamao Li,Guangwei Gao,Guo-Jun Qi

Main category: cs.CV

TL;DR: S2AFormer是一种高效的Vision Transformer架构，通过Strip Self-Attention（SSA）和Hybrid Perception Blocks（HPBs）结合CNN的局部感知与Transformer的全局建模能力，显著降低计算开销并保持精度。

Details

Motivation: Vision Transformer（ViT）的计算需求随token数量呈二次增长，限制了其实际效率。现有方法虽结合卷积与自注意力，但自注意力的复杂矩阵操作仍是瓶颈。 Method: 提出S2AFormer，采用SSA减少空间和通道维度，设计HPBs整合CNN的局部感知与Transformer的全局建模。 Result: 在ImageNet-1k、ADE20k和COCO等基准测试中，S2AFormer在GPU和非GPU环境下均表现出高效性和准确性。 Conclusion: S2AFormer在效率和效果间取得平衡，是高效Vision Transformer的有力候选。 Abstract: Vision Transformer (ViT) has made significant advancements in computer vision, thanks to its token mixer's sophisticated ability to capture global dependencies between all tokens. However, the quadratic growth in computational demands as the number of tokens increases limits its practical efficiency. Although recent methods have combined the strengths of convolutions and self-attention to achieve better trade-offs, the expensive pairwise token affinity and complex matrix operations inherent in self-attention remain a bottleneck. To address this challenge, we propose S2AFormer, an efficient Vision Transformer architecture featuring novel Strip Self-Attention (SSA). We design simple yet effective Hybrid Perception Blocks (HPBs) to effectively integrate the local perception capabilities of CNNs with the global context modeling of Transformer's attention mechanisms. A key innovation of SSA lies in its reducing the spatial dimensions of $K$ and $V$ while compressing the channel dimensions of $Q$ and $K$. This design significantly reduces computational overhead while preserving accuracy, striking an optimal balance between efficiency and effectiveness. We evaluate the robustness and efficiency of S2AFormer through extensive experiments on multiple vision benchmarks, including ImageNet-1k for image classification, ADE20k for semantic segmentation, and COCO for object detection and instance segmentation. Results demonstrate that S2AFormer achieves significant accuracy gains with superior efficiency in both GPU and non-GPU environments, making it a strong candidate for efficient vision Transformers.

[102] Investigating Mechanisms for In-Context Vision Language Binding

Darshana Saravanan,Makarand Tapaswi,Vineet Gandhi

Main category: cs.CV

TL;DR: 本文研究了视觉语言模型（VLMs）中图像与文本绑定的机制，提出了一种Binding ID机制，用于关联图像中的对象与其文本描述。

Details

Motivation: 理解VLMs如何通过Binding ID机制在图像和文本之间建立关联，以提升模型的多模态理解能力。 Method: 使用合成数据集和任务，分析VLMs如何为图像中的对象及其文本描述分配Binding ID。 Result: 实验表明，VLMs为对象的图像标记和文本引用分配了独特的Binding ID，实现了上下文关联。 Conclusion: Binding ID机制在VLMs中有效支持图像与文本的跨模态绑定。 Abstract: To understand a prompt, Vision-Language models (VLMs) must perceive the image, comprehend the text, and build associations within and across both modalities. For instance, given an 'image of a red toy car', the model should associate this image to phrases like 'car', 'red toy', 'red object', etc. Feng and Steinhardt propose the Binding ID mechanism in LLMs, suggesting that the entity and its corresponding attribute tokens share a Binding ID in the model activations. We investigate this for image-text binding in VLMs using a synthetic dataset and task that requires models to associate 3D objects in an image with their descriptions in the text. Our experiments demonstrate that VLMs assign a distinct Binding ID to an object's image tokens and its textual references, enabling in-context association.

[103] A Survey on Training-free Open-Vocabulary Semantic Segmentation

Naomi Kombol,Ivan Martinović,Siniša Šegvić

Main category: cs.CV

TL;DR: 这篇论文综述了无需训练的开源词汇语义分割方法，重点介绍了基于CLIP、辅助视觉基础模型和生成方法的30多种方法，并讨论了当前研究的局限性和未来方向。

Details

Motivation: 传统语义分割方法需要大量计算资源和标注数据，而开源词汇语义分割要求模型能分类未学习过的类别，因此研究者转向利用现有多模态分类模型的无需训练方法。 Method: 论文首先定义任务，概述流行模型类型，并将30多种方法分为三大类：纯CLIP方法、利用辅助视觉基础模型的方法和生成方法。 Result: 综述总结了当前研究的局限性和潜在问题，并提出了未来研究的未探索方向。 Conclusion: 该综述旨在为新研究者提供入门指导，并激发对该领域的进一步兴趣。 Abstract: Semantic segmentation is one of the most fundamental tasks in image understanding with a long history of research, and subsequently a myriad of different approaches. Traditional methods strive to train models up from scratch, requiring vast amounts of computational resources and training data. In the advent of moving to open-vocabulary semantic segmentation, which asks models to classify beyond learned categories, large quantities of finely annotated data would be prohibitively expensive. Researchers have instead turned to training-free methods where they leverage existing models made for tasks where data is more easily acquired. Specifically, this survey will cover the history, nuance, idea development and the state-of-the-art in training-free open-vocabulary semantic segmentation that leverages existing multi-modal classification models. We will first give a preliminary on the task definition followed by an overview of popular model archetypes and then spotlight over 30 approaches split into broader research branches: purely CLIP-based, those leveraging auxiliary visual foundation models and ones relying on generative methods. Subsequently, we will discuss the limitations and potential problems of current research, as well as provide some underexplored ideas for future study. We believe this survey will serve as a good onboarding read to new researchers and spark increased interest in the area.

[104] Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation

Yunsoo Kim,Jinge Wu,Su-Hwan Kim,Pardeep Vasudev,Jiashu Shen,Honghan Wu

Main category: cs.CV

TL;DR: 提出了一种名为Look & Mark（L&M）的新方法，通过结合放射科医生的注视点和标注框，显著提升了多模态大语言模型在医学影像分析中的性能，减少了临床错误。

Details

Motivation: 现有多模态大语言模型在医学影像分析中存在幻觉和临床错误，限制了其实际应用的可靠性。 Method: 提出L&M方法，结合放射科医生的注视点（Look）和标注框（Mark），利用上下文学习提升模型性能，无需重新训练。 Result: L&M显著提升了模型性能，如CXR-LLaVA的A.AVG指标提升1.2%，LLaVA-Med提升9.2%，并减少了临床错误（平均每报告减少0.43个错误）。 Conclusion: L&M是一种可扩展且高效的解决方案，有望提升AI辅助放射学的诊断流程，尤其在资源匮乏的临床环境中。 Abstract: Recent advancements in multimodal Large Language Models (LLMs) have significantly enhanced the automation of medical image analysis, particularly in generating radiology reports from chest X-rays (CXR). However, these models still suffer from hallucinations and clinically significant errors, limiting their reliability in real-world applications. In this study, we propose Look & Mark (L&M), a novel grounding fixation strategy that integrates radiologist eye fixations (Look) and bounding box annotations (Mark) into the LLM prompting framework. Unlike conventional fine-tuning, L&M leverages in-context learning to achieve substantial performance gains without retraining. When evaluated across multiple domain-specific and general-purpose models, L&M demonstrates significant gains, including a 1.2% improvement in overall metrics (A.AVG) for CXR-LLaVA compared to baseline prompting and a remarkable 9.2% boost for LLaVA-Med. General-purpose models also benefit from L&M combined with in-context learning, with LLaVA-OV achieving an 87.3% clinical average performance (C.AVG)-the highest among all models, even surpassing those explicitly trained for CXR report generation. Expert evaluations further confirm that L&M reduces clinically significant errors (by 0.43 average errors per report), such as false predictions and omissions, enhancing both accuracy and reliability. These findings highlight L&M's potential as a scalable and efficient solution for AI-assisted radiology, paving the way for improved diagnostic workflows in low-resource clinical settings.

[105] Hadaptive-Net: Efficient Vision Models via Adaptive Cross-Hadamard Synergy

Xuyang Zhang,Xi Zhang,Liang Chen,Hao Shi,Qingshan Guo

Main category: cs.CV

TL;DR: 论文提出了一种基于Hadamard乘积的轻量级网络模块ACH和Hadaptive-Net，显著提升了视觉任务中推理速度与精度的平衡。

Details

Motivation: 尽管Hadamard乘积在增强网络表示能力和维度压缩方面潜力巨大，但其实际应用尚未系统探索。本文旨在挖掘其优势并实现高效应用。 Method: 分析了Hadamard乘积在跨通道交互和通道扩展中的优势，提出自适应跨通道Hadamard乘积模块（ACH），并构建轻量级网络Hadaptive-Net。 Result: 实验证明Hadaptive-Net在视觉任务中实现了推理速度与精度的前所未有的平衡。 Conclusion: 通过ACH模块和Hadaptive-Net，成功将Hadamard乘积的理论潜力转化为实际应用，为轻量级网络设计提供了新思路。 Abstract: Recent studies have revealed the immense potential of Hadamard product in enhancing network representational capacity and dimensional compression. However, despite its theoretical promise, this technique has not been systematically explored or effectively applied in practice, leaving its full capabilities underdeveloped. In this work, we first analyze and identify the advantages of Hadamard product over standard convolutional operations in cross-channel interaction and channel expansion. Building upon these insights, we propose a computationally efficient module: Adaptive Cross-Hadamard (ACH), which leverages adaptive cross-channel Hadamard products for high-dimensional channel expansion. Furthermore, we introduce Hadaptive-Net (Hadamard Adaptive Network), a lightweight network backbone for visual tasks, which is demonstrated through experiments that it achieves an unprecedented balance between inference speed and accuracy through our proposed module.

[106] GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking

Haibin He,Jing Zhang,Maoyuan Ye,Juhua Liu,Bo Du,Dacheng Tao

Main category: cs.CV

TL;DR: GoMatching++ 是一种高效的方法，将图像文本检测器转化为视频文本检测器，通过轻量级跟踪器和优化机制提升性能，并在多个基准测试中创下新记录。

Details

Motivation: 现有视频文本检测方法性能不足，尤其是识别能力有限，需要更高效且数据需求低的解决方案。 Method: 冻结图像文本检测器，引入轻量级可训练跟踪器，结合重评分机制和 LST-Matcher 提升视频文本处理能力。 Result: 在 ICDAR15-video、DSText 和 BOVText 等基准测试中表现优异，并显著降低训练成本。 Conclusion: GoMatching++ 和 ArTVideo 基准测试将推动视频文本检测领域的未来发展。 Abstract: Video text spotting (VTS) extends image text spotting (ITS) by adding text tracking, significantly increasing task complexity. Despite progress in VTS, existing methods still fall short of the performance seen in ITS. This paper identifies a key limitation in current video text spotters: limited recognition capability, even after extensive end-to-end training. To address this, we propose GoMatching++, a parameter- and data-efficient method that transforms an off-the-shelf image text spotter into a video specialist. The core idea lies in freezing the image text spotter and introducing a lightweight, trainable tracker, which can be optimized efficiently with minimal training data. Our approach includes two key components: (1) a rescoring mechanism to bridge the domain gap between image and video data, and (2) the LST-Matcher, which enhances the frozen image text spotter's ability to handle video text. We explore various architectures for LST-Matcher to ensure efficiency in both parameters and training data. As a result, GoMatching++ sets new performance records on challenging benchmarks such as ICDAR15-video, DSText, and BOVText, while significantly reducing training costs. To address the lack of curved text datasets in VTS, we introduce ArTVideo, a new benchmark featuring over 30% curved text with detailed annotations. We also provide a comprehensive statistical analysis and experimental results for ArTVideo. We believe that GoMatching++ and the ArTVideo benchmark will drive future advancements in video text spotting. The source code, models and dataset are publicly available at https://github.com/Hxyz-123/GoMatching.

[107] Enjoying Information Dividend: Gaze Track-based Medical Weakly Supervised Segmentation

Zhisong Wang,Yiwen Ye,Ziyang Chen,Yong Xia

Main category: cs.CV

TL;DR: GradTrack利用医生的注视轨迹（包括注视点、持续时间和时序）提升弱监督语义分割（WSSS）性能，通过多级注视监督在解码过程中逐步优化特征。

Details

Motivation: 医学影像中弱监督语义分割（WSSS）难以有效利用稀疏标注，现有基于注视的方法未充分利用注视数据的丰富信息。 Method: GradTrack包含注视轨迹图生成和轨迹注意力两个组件，通过多级注视监督逐步优化特征。 Result: 在Kvasir-SEG和NCI-ISBI数据集上，GradTrack显著优于现有方法，Dice分数分别提升3.21%和2.61%，并缩小了与全监督模型的性能差距。 Conclusion: GradTrack通过有效利用注视数据提升了WSSS性能，为医学影像分割提供了新思路。 Abstract: Weakly supervised semantic segmentation (WSSS) in medical imaging struggles with effectively using sparse annotations. One promising direction for WSSS leverages gaze annotations, captured via eye trackers that record regions of interest during diagnostic procedures. However, existing gaze-based methods, such as GazeMedSeg, do not fully exploit the rich information embedded in gaze data. In this paper, we propose GradTrack, a framework that utilizes physicians' gaze track, including fixation points, durations, and temporal order, to enhance WSSS performance. GradTrack comprises two key components: Gaze Track Map Generation and Track Attention, which collaboratively enable progressive feature refinement through multi-level gaze supervision during the decoding process. Experiments on the Kvasir-SEG and NCI-ISBI datasets demonstrate that GradTrack consistently outperforms existing gaze-based methods, achieving Dice score improvements of 3.21\% and 2.61\%, respectively. Moreover, GradTrack significantly narrows the performance gap with fully supervised models such as nnUNet.

[108] StateSpaceDiffuser: Bringing Long Context to Diffusion World Models

Nedko Savov,Naser Kazemi,Deheng Zhang,Danda Pani Paudel,Xi Wang,Luc Van Gool

Main category: cs.CV

TL;DR: StateSpaceDiffuser通过结合状态空间模型（Mamba）解决了扩散模型在长上下文任务中视觉一致性丢失的问题，显著提升了长期记忆能力。

Details

Motivation: 现有扩散模型因缺乏持久环境状态，导致在长序列任务中视觉一致性快速丢失。 Method: 提出StateSpaceDiffuser，将状态空间模型的序列表示集成到扩散模型中，以恢复长期记忆。 Result: 在2D迷宫导航和复杂3D环境中，StateSpaceDiffuser显著优于纯扩散模型，保持了更长时间的视觉一致性。 Conclusion: 结合状态空间表示能有效提升扩散模型的长期记忆能力，同时保持高保真合成效果。 Abstract: World models have recently become promising tools for predicting realistic visuals based on actions in complex environments. However, their reliance on a short sequence of observations causes them to quickly lose track of context. As a result, visual consistency breaks down after just a few steps, and generated scenes no longer reflect information seen earlier. This limitation of the state-of-the-art diffusion-based world models comes from their lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform on long-context tasks by integrating a sequence representation from a state-space model (Mamba), representing the entire interaction history. This design restores long-term memory without sacrificing the high-fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model's ability to reinstantiate seen content in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline, maintaining a coherent visual context for an order of magnitude more steps. It delivers consistent views in both a 2D maze navigation and a complex 3D environment. These results establish that bringing state-space representations into diffusion models is highly effective in demonstrating both visual details and long-term memory.

[109] YH-MINER: Multimodal Intelligent System for Natural Ecological Reef Metric Extraction

Mingzhuang Wang,Yvyang Li,Xiyang Zhang,Fei Tan,Qi Shi,Guotao Zhang,Siqi Chen,Yufei Liu,Lei Lei,Ming Zhou,Qiang Lin,Hongqiang Yang

Main category: cs.CV

TL;DR: 该研究开发了YH-OSI系统，基于多模态大模型（MLLM），通过目标检测和语义分割实现珊瑚礁的高效监测，分类准确率达88%。

Details

Motivation: 珊瑚礁生态监测面临效率低和分割精度不足的挑战，亟需智能化解决方案。 Method: 采用多模态大模型框架，结合目标检测（mAP@0.5=0.78）和语义分割模块，生成空间先验框并完成像素级分割，再通过分类指令实现分类。 Result: 系统在低光和遮挡场景下实现88%的属级分类准确率，并能提取核心生态指标。 Conclusion: YH-OSI系统为珊瑚礁监测提供了高效自动化方案，并具备扩展性，支持未来与水下机器人集成。 Abstract: Coral reefs, crucial for sustaining marine biodiversity and ecological processes (e.g., nutrient cycling, habitat provision), face escalating threats, underscoring the need for efficient monitoring. Coral reef ecological monitoring faces dual challenges of low efficiency in manual analysis and insufficient segmentation accuracy in complex underwater scenarios. This study develops the YH-OSI system, establishing an intelligent framework centered on the Multimodal Large Model (MLLM) for "object detection-semantic segmentation-prior input". The system uses the object detection module (mAP@0.5=0.78) to generate spatial prior boxes for coral instances, driving the segment module to complete pixel-level segmentation in low-light and densely occluded scenarios. The segmentation masks and finetuned classification instructions are fed into the Qwen2-VL-based multimodal model as prior inputs, achieving a genus-level classification accuracy of 88% and simultaneously extracting core ecological metrics. Meanwhile, the system retains the scalability of the multimodal model through standardized interfaces, laying a foundation for future integration into multimodal agent-based underwater robots and supporting the full-process automation of "image acquisition-prior generation-real-time analysis."

[110] Domain Adaptation of Attention Heads for Zero-shot Anomaly Detection

Kiyoon Jeong,Jaehyuk Heo,Junyeong Son,Pilsung Kang

Main category: cs.CV

TL;DR: HeadCLIP是一种零样本异常检测方法，通过自适应文本和图像编码器，结合可学习提示和动态调整的注意力头权重，显著提升了检测性能。

Details

Motivation: 现有零样本异常检测方法在领域自适应方面存在不足，未能充分利用通用模型的能力。 Method: HeadCLIP通过可学习提示和动态调整的注意力头权重，实现文本和图像编码器的领域自适应，并引入联合异常评分。 Result: 在工业和医疗领域的数据集上，HeadCLIP在像素和图像级别的异常检测性能均优于现有方法。 Conclusion: HeadCLIP通过全面的领域自适应设计，显著提升了零样本异常检测的效果。 Abstract: Zero-shot anomaly detection (ZSAD) in images is an approach that can detect anomalies without access to normal samples, which can be beneficial in various realistic scenarios where model training is not possible. However, existing ZSAD research has shown limitations by either not considering domain adaptation of general-purpose backbone models to anomaly detection domains or by implementing only partial adaptation to some model components. In this paper, we propose HeadCLIP to overcome these limitations by effectively adapting both text and image encoders to the domain. HeadCLIP generalizes the concepts of normality and abnormality through learnable prompts in the text encoder, and introduces learnable head weights to the image encoder to dynamically adjust the features held by each attention head according to domain characteristics. Additionally, we maximize the effect of domain adaptation by introducing a joint anomaly score that utilizes domain-adapted pixel-level information for image-level anomaly detection. Experimental results using multiple real datasets in both industrial and medical domains show that HeadCLIP outperforms existing ZSAD techniques at both pixel and image levels. In the industrial domain, improvements of up to 4.9%p in pixel-level mean anomaly detection score (mAD) and up to 3.0%p in image-level mAD were achieved, with similar improvements (3.2%p, 3.1%p) in the medical domain.

[111] Learning Fine-Grained Geometry for Sparse-View Splatting via Cascade Depth Loss

Wenjun Lu,Haodong Chen,Anqi Yi,Yuk Ying Chung,Zhiyong Wang,Kun Hu

Main category: cs.CV

TL;DR: HDGS提出了一种基于深度监督的框架，通过多尺度深度一致性提升稀疏视图下的新视角合成质量。

Details

Motivation: 稀疏视图条件下，现有方法（如NeRF和3DGS）因几何线索不足导致重建质量下降，深度渲染质量是关键因素。 Method: 引入分层深度引导的Splatting（HDGS）和Cascade Pearson Correlation Loss（CPCL），通过多尺度深度一致性逐步优化几何。 Result: 在LLFF和DTU基准测试中，HDGS在稀疏视图下实现了最先进的性能，同时保持高效和高质量的渲染。 Conclusion: HDGS通过多尺度深度监督显著提升了稀疏视图下的结构保真度，为新视角合成提供了有效解决方案。 Abstract: Novel view synthesis is a fundamental task in 3D computer vision that aims to reconstruct realistic images from a set of posed input views. However, reconstruction quality degrades significantly under sparse-view conditions due to limited geometric cues. Existing methods, such as Neural Radiance Fields (NeRF) and the more recent 3D Gaussian Splatting (3DGS), often suffer from blurred details and structural artifacts when trained with insufficient views. Recent works have identified the quality of rendered depth as a key factor in mitigating these artifacts, as it directly affects geometric accuracy and view consistency. In this paper, we address these challenges by introducing Hierarchical Depth-Guided Splatting (HDGS), a depth supervision framework that progressively refines geometry from coarse to fine levels. Central to HDGS is a novel Cascade Pearson Correlation Loss (CPCL), which aligns rendered and estimated monocular depths across multiple spatial scales. By enforcing multi-scale depth consistency, our method substantially improves structural fidelity in sparse-view scenarios. Extensive experiments on the LLFF and DTU benchmarks demonstrate that HDGS achieves state-of-the-art performance under sparse-view settings while maintaining efficient and high-quality rendering

[112] From Controlled Scenarios to Real-World: Cross-Domain Degradation Pattern Matching for All-in-One Image Restoration

Junyu Fan,Chuanlin Liao,Yi Lin

Main category: cs.CV

TL;DR: 论文提出了一种统一域自适应图像修复（UDAIR）框架，通过域适应策略和对比学习机制提升多退化模式下的图像修复性能。

Details

Motivation: 现有全合一图像修复（AiOIR）方法在真实场景中性能下降，因训练数据与真实数据分布差异导致退化识别能力不足。 Method: 设计代码本学习离散嵌入表示退化模式，采用跨样本对比学习捕获共享特征；提出域适应策略动态对齐源域和目标域的代码本嵌入，并通过测试时适应机制优化对齐。 Result: 在10个开源数据集上，UDAIR实现了最先进的性能，特征聚类验证了未知条件下的退化识别能力。 Conclusion: UDAIR通过域适应和对比学习显著提升了AiOIR在真实场景中的泛化能力。 Abstract: As a fundamental imaging task, All-in-One Image Restoration (AiOIR) aims to achieve image restoration caused by multiple degradation patterns via a single model with unified parameters. Although existing AiOIR approaches obtain promising performance in closed and controlled scenarios, they still suffered from considerable performance reduction in real-world scenarios since the gap of data distributions between the training samples (source domain) and real-world test samples (target domain) can lead inferior degradation awareness ability. To address this issue, a Unified Domain-Adaptive Image Restoration (UDAIR) framework is proposed to effectively achieve AiOIR by leveraging the learned knowledge from source domain to target domain. To improve the degradation identification, a codebook is designed to learn a group of discrete embeddings to denote the degradation patterns, and the cross-sample contrastive learning mechanism is further proposed to capture shared features from different samples of certain degradation. To bridge the data gap, a domain adaptation strategy is proposed to build the feature projection between the source and target domains by dynamically aligning their codebook embeddings, and a correlation alignment-based test-time adaptation mechanism is designed to fine-tune the alignment discrepancies by tightening the degradation embeddings to the corresponding cluster center in the source domain. Experimental results on 10 open-source datasets demonstrate that UDAIR achieves new state-of-the-art performance for the AiOIR task. Most importantly, the feature cluster validate the degradation identification under unknown conditions, and qualitative comparisons showcase robust generalization to real-world scenarios.

[113] Neural Restoration of Greening Defects in Historical Autochrome Photographs Based on Purely Synthetic Data

Saptarshi Neil Sinha,P. Julius Kuehn,Johannes Koppe,Arjan Kuijper,Michael Weinmann

Main category: cs.CV

TL;DR: 本文提出了一种基于合成数据集生成和生成式AI的方法，用于自动修复数字化奥托克罗姆照片中的绿色缺陷。

Details

Motivation: 早期视觉艺术（尤其是彩色照片）的保存因老化和不当存储导致的模糊、划痕、颜色渗漏和褪色等问题而面临挑战。 Method: 通过合成数据模拟绿色缺陷，并采用改进的加权损失函数（ChaIR方法）进行修复。 Result: 该方法能够高效修复照片，减少时间需求，且优于现有方法。 Conclusion: 提出的方法为视觉艺术修复提供了一种高效且自动化的解决方案。 Abstract: The preservation of early visual arts, particularly color photographs, is challenged by deterioration caused by aging and improper storage, leading to issues like blurring, scratches, color bleeding, and fading defects. In this paper, we present the first approach for the automatic removal of greening color defects in digitized autochrome photographs. Our main contributions include a method based on synthetic dataset generation and the use of generative AI with a carefully designed loss function for the restoration of visual arts. To address the lack of suitable training datasets for analyzing greening defects in damaged autochromes, we introduce a novel approach for accurately simulating such defects in synthetic data. We also propose a modified weighted loss function for the ChaIR method to account for color imbalances between defected and non-defected areas. While existing methods struggle with accurately reproducing original colors and may require significant manual effort, our method allows for efficient restoration with reduced time requirements.

[114] CADReview: Automatically Reviewing CAD Programs with Error Detection and Correction

Jiali Chen,Xusen Hei,HongFei Liu,Yuancheng Wei,Zikun Deng,Jiayuan Xie,Yi Cai,Li Qing

Main category: cs.CV

TL;DR: 论文提出了ReCAD框架，用于自动检测和修复CAD程序中的错误，确保3D对象与参考图像一致，并创建了包含20K程序-图像对的CADReview数据集。

Details

Motivation: 设计师在3D对象原型设计过程中需要耗费大量时间检查和修正CAD程序与参考图像的一致性，现有MLLMs在多几何组件识别和空间几何操作上表现不佳。 Method: 提出了CAD程序修复框架ReCAD，能够有效检测程序错误并提供修正反馈，同时构建了CADReview数据集。 Result: 实验表明，ReCAD在CAD审查任务中显著优于现有MLLMs，展现了在设计应用中的潜力。 Conclusion: ReCAD框架通过自动化错误检测和修正，提高了设计效率，为CAD程序审查提供了有效解决方案。 Abstract: Computer-aided design (CAD) is crucial in prototyping 3D objects through geometric instructions (i.e., CAD programs). In practical design workflows, designers often engage in time-consuming reviews and refinements of these prototypes by comparing them with reference images. To bridge this gap, we introduce the CAD review task to automatically detect and correct potential errors, ensuring consistency between the constructed 3D objects and reference images. However, recent advanced multimodal large language models (MLLMs) struggle to recognize multiple geometric components and perform spatial geometric operations within the CAD program, leading to inaccurate reviews. In this paper, we propose the CAD program repairer (ReCAD) framework to effectively detect program errors and provide helpful feedback on error correction. Additionally, we create a dataset, CADReview, consisting of over 20K program-image pairs, with diverse errors for the CAD review task. Extensive experiments demonstrate that our ReCAD significantly outperforms existing MLLMs, which shows great potential in design applications.

[115] IKIWISI: An Interactive Visual Pattern Generator for Evaluating the Reliability of Vision-Language Models Without Ground Truth

Md Touhidul Islam,Imran Kabir,Md Alimoor Reza,Syed Masum Billah

Main category: cs.CV

TL;DR: IKIWISI是一个交互式视觉模式生成器，用于评估无真实标签时的视觉语言模型在视频对象识别中的表现。它通过热图可视化模型输出，并引入“间谍对象”来检测模型的幻觉行为。用户研究表明，IKIWISI易于使用且能有效评估模型可靠性。

Details

Motivation: 传统评估方法在缺乏真实标签时难以评估视觉语言模型的可靠性，IKIWISI通过可视化模型输出和引入“间谍对象”填补了这一空白。 Method: IKIWISI将模型输出转化为二元热图（绿色表示对象存在，红色表示不存在），并利用“间谍对象”检测模型是否对不存在对象产生幻觉。 Result: 15名参与者的研究表明，IKIWISI易于使用，评估结果与客观指标一致，且用户仅需检查少量热图单元即可得出结论。 Conclusion: IKIWISI不仅补充了传统评估方法，还揭示了改善视觉语言系统中人类感知与机器理解对齐的机会。 Abstract: We present IKIWISI ("I Know It When I See It"), an interactive visual pattern generator for assessing vision-language models in video object recognition when ground truth is unavailable. IKIWISI transforms model outputs into a binary heatmap where green cells indicate object presence and red cells indicate object absence. This visualization leverages humans' innate pattern recognition abilities to evaluate model reliability. IKIWISI introduces "spy objects": adversarial instances users know are absent, to discern models hallucinating on nonexistent items. The tool functions as a cognitive audit mechanism, surfacing mismatches between human and machine perception by visualizing where models diverge from human understanding. Our study with 15 participants found that users considered IKIWISI easy to use, made assessments that correlated with objective metrics when available, and reached informed conclusions by examining only a small fraction of heatmap cells. This approach not only complements traditional evaluation methods through visual assessment of model behavior with custom object sets, but also reveals opportunities for improving alignment between human perception and machine understanding in vision-language systems.

[116] Learning to Infer Parameterized Representations of Plants from 3D Scans

Samara Ghrer,Christophe Godin,Stefanie Wuhrer

Main category: cs.CV

TL;DR: 提出了一种统一的方法，通过3D扫描植物推断其参数化表示，适用于重建、分割和骨架化等多种任务。

Details

Motivation: 植物3D重建因复杂的空间结构和自遮挡问题具有挑战性，现有方法多为逆向建模或专注于特定任务。 Method: 使用基于L系统的程序模型生成虚拟植物，训练递归神经网络，从3D点云推断参数化树状表示。 Result: 在合成植物上验证，该方法在重建、分割和骨架化任务中表现与现有最优方法相当。 Conclusion: 该方法为植物3D建模提供了统一的解决方案，适用于多种任务。 Abstract: Reconstructing faithfully the 3D architecture of plants from unstructured observations is a challenging task. Plants frequently contain numerous organs, organized in branching systems in more or less complex spatial networks, leading to specific computational issues due to self-occlusion or spatial proximity between organs. Existing works either consider inverse modeling where the aim is to recover the procedural rules that allow to simulate virtual plants, or focus on specific tasks such as segmentation or skeletonization. We propose a unified approach that, given a 3D scan of a plant, allows to infer a parameterized representation of the plant. This representation describes the plant's branching structure, contains parametric information for each plant organ, and can therefore be used directly in a variety of tasks. In this data-driven approach, we train a recursive neural network with virtual plants generated using an L-systems-based procedural model. After training, the network allows to infer a parametric tree-like representation based on an input 3D point cloud. Our method is applicable to any plant that can be represented as binary axial tree. We evaluate our approach on Chenopodium Album plants, using experiments on synthetic plants to show that our unified framework allows for different tasks including reconstruction, segmentation and skeletonization, while achieving results on-par with state-of-the-art for each task.

[117] Progressive Data Dropout: An Embarrassingly Simple Approach to Faster Training

Shriram M S,Xinyue Hao,Shihao Hou,Yang Lu,Laura Sevilla-Lara,Anurag Arnab,Shreyank N Gowda

Main category: cs.CV

TL;DR: 论文提出了一种名为Progressive Data Dropout的新训练范式，通过减少有效训练轮次至基线的12.4%，不仅节省了计算成本，还提高了准确性达4.82%。

Details

Motivation: 当前机器学习依赖大规模数据集训练，成本高昂。尽管模型压缩研究较多，但数据集优化仍缺乏有效方法。 Method: 结合硬数据挖掘和dropout的简单方法，提出Progressive Data Dropout，无需改变模型架构或优化器。 Result: 有效训练轮次减少至12.4%，准确性提升4.82%。 Conclusion: 该方法简单易用，适用于标准训练流程，具有广泛应用的潜力。 Abstract: The success of the machine learning field has reliably depended on training on large datasets. While effective, this trend comes at an extraordinary cost. This is due to two deeply intertwined factors: the size of models and the size of datasets. While promising research efforts focus on reducing the size of models, the other half of the equation remains fairly mysterious. Indeed, it is surprising that the standard approach to training remains to iterate over and over, uniformly sampling the training dataset. In this paper we explore a series of alternative training paradigms that leverage insights from hard-data-mining and dropout, simple enough to implement and use that can become the new training standard. The proposed Progressive Data Dropout reduces the number of effective epochs to as little as 12.4% of the baseline. This savings actually do not come at any cost for accuracy. Surprisingly, the proposed method improves accuracy by up to 4.82%. Our approach requires no changes to model architecture or optimizer, and can be applied across standard training pipelines, thus posing an excellent opportunity for wide adoption. Code can be found here: https://github.com/bazyagami/LearningWithRevision

[118] Task-Driven Implicit Representations for Automated Design of LiDAR Systems

Nikhil Behari,Aaron Young,Akshat Dave,Ramesh Raskar

Main category: cs.CV

TL;DR: 提出了一种基于任务驱动的自动化LiDAR系统设计框架，通过六维设计空间和流生成模型学习任务特定密度，实现高效约束感知设计。

Details

Motivation: LiDAR设计复杂且耗时，传统方法多为手动，难以满足多样化的空间和时间采样需求。 Method: 在六维设计空间中表示LiDAR配置，通过流生成模型学习任务特定密度，利用期望最大化方法合成新系统。 Result: 在3D视觉任务中验证了方法的有效性，支持人脸扫描、机器人跟踪和物体检测等应用。 Conclusion: 框架实现了自动化、高效的LiDAR系统设计，适用于多样化任务和约束条件。 Abstract: Imaging system design is a complex, time-consuming, and largely manual process; LiDAR design, ubiquitous in mobile devices, autonomous vehicles, and aerial imaging platforms, adds further complexity through unique spatial and temporal sampling requirements. In this work, we propose a framework for automated, task-driven LiDAR system design under arbitrary constraints. To achieve this, we represent LiDAR configurations in a continuous six-dimensional design space and learn task-specific implicit densities in this space via flow-based generative modeling. We then synthesize new LiDAR systems by modeling sensors as parametric distributions in 6D space and fitting these distributions to our learned implicit density using expectation-maximization, enabling efficient, constraint-aware LiDAR system design. We validate our method on diverse tasks in 3D vision, enabling automated LiDAR system design across real-world-inspired applications in face scanning, robotic tracking, and object detection.

[119] VME: A Satellite Imagery Dataset and Benchmark for Detecting Vehicles in the Middle East and Beyond

Noora Al-Emadi,Ingmar Weber,Yin Yang,Ferda Ofli

Main category: cs.CV

TL;DR: 论文提出了VME数据集和CDSI基准数据集，用于提升卫星图像中车辆检测的准确性，特别是在中东地区和全球范围内。

Details

Motivation: 现有数据集存在地理偏见，忽视了中东地区，导致车辆检测模型在该区域表现不佳。 Method: 构建了VME数据集（中东地区）和CDSI基准数据集（全球范围），结合手动和半自动标注方法。 Result: VME显著提升了中东地区的检测准确率，CDSI则显著提升了全球范围内的检测性能。 Conclusion: VME和CDSI填补了现有数据集的空白，为车辆检测提供了更全面的解决方案。 Abstract: Detecting vehicles in satellite images is crucial for traffic management, urban planning, and disaster response. However, current models struggle with real-world diversity, particularly across different regions. This challenge is amplified by geographic bias in existing datasets, which often focus on specific areas and overlook regions like the Middle East. To address this gap, we present the Vehicles in the Middle East (VME) dataset, designed explicitly for vehicle detection in high-resolution satellite images from Middle Eastern countries. Sourced from Maxar, the VME dataset spans 54 cities across 12 countries, comprising over 4,000 image tiles and more than 100,000 vehicles, annotated using both manual and semi-automated methods. Additionally, we introduce the largest benchmark dataset for Car Detection in Satellite Imagery (CDSI), combining images from multiple sources to enhance global car detection. Our experiments demonstrate that models trained on existing datasets perform poorly on Middle Eastern images, while the VME dataset significantly improves detection accuracy in this region. Moreover, state-of-the-art models trained on CDSI achieve substantial improvements in global car detection.

[120] Identity-Preserving Text-to-Image Generation via Dual-Level Feature Decoupling and Expert-Guided Fusion

Kewen Chen,Xiaobin Hu,Wenqi Ren

Main category: cs.CV

TL;DR: 提出了一种新框架，通过解耦身份相关与无关特征并引入特征融合机制，提升文本到图像生成的质量和文本对齐。

Details

Motivation: 当前方法难以分离输入图像中身份相关与无关信息，导致过拟合或无法保持主题身份。 Method: 框架包含隐式-显式前景-背景解耦模块（IEDM）和基于专家混合（MoE）的特征融合模块（FFM），结合可学习适配器和修复技术。 Result: 实验表明，该方法显著提升了生成图像质量、场景适应灵活性和输出多样性。 Conclusion: 新框架有效解决了身份信息解耦问题，提升了文本到图像生成的性能。 Abstract: Recent advances in large-scale text-to-image generation models have led to a surge in subject-driven text-to-image generation, which aims to produce customized images that align with textual descriptions while preserving the identity of specific subjects. Despite significant progress, current methods struggle to disentangle identity-relevant information from identity-irrelevant details in the input images, resulting in overfitting or failure to maintain subject identity. In this work, we propose a novel framework that improves the separation of identity-related and identity-unrelated features and introduces an innovative feature fusion mechanism to improve the quality and text alignment of generated images. Our framework consists of two key components: an Implicit-Explicit foreground-background Decoupling Module (IEDM) and a Feature Fusion Module (FFM) based on a Mixture of Experts (MoE). IEDM combines learnable adapters for implicit decoupling at the feature level with inpainting techniques for explicit foreground-background separation at the image level. FFM dynamically integrates identity-irrelevant features with identity-related features, enabling refined feature representations even in cases of incomplete decoupling. In addition, we introduce three complementary loss functions to guide the decoupling process. Extensive experiments demonstrate the effectiveness of our proposed method in enhancing image generation quality, improving flexibility in scene adaptation, and increasing the diversity of generated outputs across various textual descriptions.

[121] DAM: Domain-Aware Module for Multi-Domain Dataset Condensation

Jaehyun Choi,Gyojin Han,Dong-Jae Lee,Sunghyun Baek,Junmo Kim

Main category: cs.CV

TL;DR: 论文提出了一种多领域数据集压缩方法（MDDC），通过引入领域感知模块（DAM）和频率伪标签技术，提升了单领域和多领域数据集压缩的性能。

Details

Motivation: 现代数据集通常包含多领域的异构图像，而现有数据集压缩方法未充分考虑这一点，导致压缩效果受限。 Method: 提出MDDC方法，结合领域感知模块（DAM）和频率伪标签技术，动态嵌入领域特征到合成图像中。 Result: 实验表明，DAM在领域内、领域外和跨架构性能上均优于基线方法。 Conclusion: MDDC通过领域感知模块有效提升了数据集压缩的泛化能力，适用于多领域场景。 Abstract: Dataset Condensation (DC) has emerged as a promising solution to mitigate the computational and storage burdens associated with training deep learning models. However, existing DC methods largely overlook the multi-domain nature of modern datasets, which are increasingly composed of heterogeneous images spanning multiple domains. In this paper, we extend DC and introduce Multi-Domain Dataset Condensation (MDDC), which aims to condense data that generalizes across both single-domain and multi-domain settings. To this end, we propose the Domain-Aware Module (DAM), a training-time module that embeds domain-related features into each synthetic image via learnable spatial masks. As explicit domain labels are mostly unavailable in real-world datasets, we employ frequency-based pseudo-domain labeling, which leverages low-frequency amplitude statistics. DAM is only active during the condensation process, thus preserving the same images per class (IPC) with prior methods. Experiments show that DAM consistently improves in-domain, out-of-domain, and cross-architecture performance over baseline dataset condensation methods.

[122] PacTure: Efficient PBR Texture Generation on Packed Views with Visual Autoregressive Models

Fan Fei,Jiajun Tang,Fei-Peng Tian,Boxin Shi,Ping Tan

Main category: cs.CV

TL;DR: PacTure是一个新颖的框架，用于从无纹理3D网格、文本描述和可选图像提示生成基于物理的渲染（PBR）材质纹理。通过引入视图打包技术和多视图多域生成主干，显著提高了生成质量和效率。

Details

Motivation: 解决现有2D生成纹理方法在全局一致性和分辨率上的局限性，以及推理时间过长的问题。 Method: 提出视图打包技术，将多视图映射问题转化为2D矩形装箱问题，同时结合多域生成框架，提高分辨率和效率。 Result: 实验表明，PacTure在生成PBR纹理的质量和训练/推理效率上优于现有方法。 Conclusion: PacTure通过创新的视图打包和多域生成技术，显著提升了纹理生成的性能，为3D渲染提供了高效解决方案。 Abstract: We present PacTure, a novel framework for generating physically-based rendering (PBR) material textures from an untextured 3D mesh, a text description, and an optional image prompt. Early 2D generation-based texturing approaches generate textures sequentially from different views, resulting in long inference times and globally inconsistent textures. More recent approaches adopt multi-view generation with cross-view attention to enhance global consistency, which, however, limits the resolution for each view. In response to these weaknesses, we first introduce view packing, a novel technique that significantly increases the effective resolution for each view during multi-view generation without imposing additional inference cost, by formulating the arrangement of multi-view maps as a 2D rectangle bin packing problem. In contrast to UV mapping, it preserves the spatial proximity essential for image generation and maintains full compatibility with current 2D generative models. To further reduce the inference cost, we enable fine-grained control and multi-domain generation within the next-scale prediction autoregressive framework to create an efficient multi-view multi-domain generative backbone. Extensive experiments show that PacTure outperforms state-of-the-art methods in both quality of generated PBR textures and efficiency in training and inference.

[123] Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

Xudong Li,Mengdan Zhang,Peixian Chen,Xiawu Zheng,Yan Zhang,Jingyuan Zheng,Yunhang Shen,Ke Li,Chaoyou Fu,Xing Sun,Rongrong Ji

Main category: cs.CV

TL;DR: CcDPO是一种多级偏好优化框架，通过从全局序列到局部细节的视觉线索增强多图像理解，减少幻觉现象。

Details

Motivation: 多模态大语言模型在单图像任务中表现优异，但在多图像理解中因跨模态不对齐导致幻觉问题，现有方法未能全面建模上下文。 Method: CcDPO采用两级优化：上下文级优化缓解认知偏差，针级优化关注细粒度视觉细节，并构建MultiScope-42k数据集支持优化。 Result: 实验表明CcDPO显著减少幻觉，并在单图像和多图像任务中表现一致提升。 Conclusion: CcDPO通过多级优化和数据集支持，有效提升多模态模型在多图像任务中的性能。 Abstract: Multi-modal Large Language Models (MLLMs) excel at single-image tasks but struggle with multi-image understanding due to cross-modal misalignment, leading to hallucinations (context omission, conflation, and misinterpretation). Existing methods using Direct Preference Optimization (DPO) constrain optimization to a solitary image reference within the input sequence, neglecting holistic context modeling. We propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework that enhances per-image perception in multi-image settings by zooming into visual clues -- from sequential context to local details. It features: (i) Context-Level Optimization : Re-evaluates cognitive biases underlying MLLMs' multi-image context comprehension and integrates a spectrum of low-cost global sequence preferences for bias mitigation. (ii) Needle-Level Optimization : Directs attention to fine-grained visual details through region-targeted visual prompts and multimodal preference supervision. To support scalable optimization, we also construct MultiScope-42k, an automatically generated dataset with high-quality multi-level preference pairs. Experiments show that CcDPO significantly reduces hallucinations and yields consistent performance gains across general single- and multi-image tasks.

[124] Self-Reflective Reinforcement Learning for Diffusion-based Image Reasoning Generation

Jiadong Pan,Zhiyuan Ma,Kaiyan Zhang,Ning Ding,Bowen Zhou

Main category: cs.CV

TL;DR: SRRL是一种自反思强化学习算法，通过将去噪轨迹视为多轮反思过程，首次将图像推理引入生成任务。

Details

Motivation: 现有图像生成方法在逻辑中心任务中表现不佳，受CoT和RL启发，提出SRRL以实现逻辑图像的推理生成。 Method: SRRL将去噪轨迹作为CoT步骤，引入条件引导的前向过程，支持多轮反思迭代。 Result: 实验表明SRRL在逻辑图像生成任务中表现优异，甚至优于GPT-4o。 Conclusion: SRRL首次将图像推理引入生成任务，展现了在逻辑图像生成中的潜力。 Abstract: Diffusion models have recently demonstrated exceptional performance in image generation task. However, existing image generation methods still significantly suffer from the dilemma of image reasoning, especially in logic-centered image generation tasks. Inspired by the success of Chain of Thought (CoT) and Reinforcement Learning (RL) in LLMs, we propose SRRL, a self-reflective RL algorithm for diffusion models to achieve reasoning generation of logical images by performing reflection and iteration across generation trajectories. The intermediate samples in the denoising process carry noise, making accurate reward evaluation difficult. To address this challenge, SRRL treats the entire denoising trajectory as a CoT step with multi-round reflective denoising process and introduces condition guided forward process, which allows for reflective iteration between CoT steps. Through SRRL-based iterative diffusion training, we introduce image reasoning through CoT into generation tasks adhering to physical laws and unconventional physical phenomena for the first time. Notably, experimental results of case study exhibit that the superior performance of our SRRL algorithm even compared with GPT-4o. The project page is https://jadenpan0.github.io/srrl.github.io/.

[125] Frugal Incremental Generative Modeling using Variational Autoencoders

Victor Enescu,Hichem Sahbi

Main category: cs.CV

TL;DR: 提出了一种基于变分自编码器（VAEs）的无回放增量学习模型，解决了增量学习中数据增长和灾难性遗忘的问题。

Details

Motivation: 增量学习在深度学习中潜力巨大，但面临灾难性遗忘和数据增长带来的可扩展性挑战。 Method: 设计了多模态潜在空间的新增量生成模型，并引入正交性准则以减少VAEs的灾难性遗忘。 Result: 实验表明，该方法在内存占用上比相关方法节省至少一个数量级，同时达到SOTA准确率。 Conclusion: 该方法为增量学习提供了一种高效且可扩展的解决方案。 Abstract: Continual or incremental learning holds tremendous potential in deep learning with different challenges including catastrophic forgetting. The advent of powerful foundation and generative models has propelled this paradigm even further, making it one of the most viable solution to train these models. However, one of the persisting issues lies in the increasing volume of data particularly with replay-based methods. This growth introduces challenges with scalability since continuously expanding data becomes increasingly demanding as the number of tasks grows. In this paper, we attenuate this issue by devising a novel replay-free incremental learning model based on Variational Autoencoders (VAEs). The main contribution of this work includes (i) a novel incremental generative modelling, built upon a well designed multi-modal latent space, and also (ii) an orthogonality criterion that mitigates catastrophic forgetting of the learned VAEs. The proposed method considers two variants of these VAEs: static and dynamic with no (or at most a controlled) growth in the number of parameters. Extensive experiments show that our method is (at least) an order of magnitude more ``memory-frugal'' compared to the closely related works while achieving SOTA accuracy scores.

[126] GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control

Anthony Chen,Wenzhao Zheng,Yida Wang,Xueyang Zhang,Kun Zhan,Peng Jia,Kurt Keutzer,Shangbang Zhang

Main category: cs.CV

TL;DR: GeoDrive通过将3D几何条件显式集成到驾驶世界模型中，提升了空间理解和动作可控性，显著优于现有模型。

Details

Motivation: 当前方法在3D几何一致性或遮挡处理中存在不足，影响自动驾驶任务的安全评估可靠性。 Method: 从输入帧提取3D表示，基于用户指定的自车轨迹生成2D渲染，并通过动态编辑模块增强渲染效果。 Result: 实验表明，GeoDrive在动作准确性和3D空间感知上显著优于现有模型，并能泛化到新轨迹。 Conclusion: GeoDrive提供了更真实、适应性强且可靠的场景建模，为自动驾驶安全性提供了新方案。 Abstract: Recent advancements in world models have revolutionized dynamic environment simulation, allowing systems to foresee future states and assess potential actions. In autonomous driving, these capabilities help vehicles anticipate the behavior of other road users, perform risk-aware planning, accelerate training in simulation, and adapt to novel scenarios, thereby enhancing safety and reliability. Current approaches exhibit deficiencies in maintaining robust 3D geometric consistency or accumulating artifacts during occlusion handling, both critical for reliable safety assessment in autonomous navigation tasks. To address this, we introduce GeoDrive, which explicitly integrates robust 3D geometry conditions into driving world models to enhance spatial understanding and action controllability. Specifically, we first extract a 3D representation from the input frame and then obtain its 2D rendering based on the user-specified ego-car trajectory. To enable dynamic modeling, we propose a dynamic editing module during training to enhance the renderings by editing the positions of the vehicles. Extensive experiments demonstrate that our method significantly outperforms existing models in both action accuracy and 3D spatial awareness, leading to more realistic, adaptable, and reliable scene modeling for safer autonomous driving. Additionally, our model can generalize to novel trajectories and offers interactive scene editing capabilities, such as object editing and object trajectory control.

[127] RC-AutoCalib: An End-to-End Radar-Camera Automatic Calibration Network

Van-Tin Luu,Yon-Lin Cai,Vu-Hoang Tran,Wei-Chen Chiu,Yi-Ting Chen,Ching-Chun Huang

Main category: cs.CV

TL;DR: 本文提出了一种创新的在线自动雷达-相机几何标定方法，解决了雷达高度数据稀疏性和不确定性的挑战。通过双视角表示和选择性融合机制，结合多模态交叉注意力，显著提升了标定性能。

Details

Motivation: 雷达高度数据的稀疏性和不确定性使得系统运行时的自动标定成为长期挑战，亟需一种鲁棒性强的解决方案。 Method: 采用双视角表示（前视图和鸟瞰图），提出选择性融合机制和多模态交叉注意力机制，并设计了抗噪声匹配器以增强鲁棒性。 Result: 在nuScenes数据集上的实验表明，该方法显著优于现有雷达-相机和LiDAR-相机标定技术。 Conclusion: 该方法为雷达-相机自动标定设立了新基准，代码已开源。 Abstract: This paper presents a groundbreaking approach - the first online automatic geometric calibration method for radar and camera systems. Given the significant data sparsity and measurement uncertainty in radar height data, achieving automatic calibration during system operation has long been a challenge. To address the sparsity issue, we propose a Dual-Perspective representation that gathers features from both frontal and bird's-eye views. The frontal view contains rich but sensitive height information, whereas the bird's-eye view provides robust features against height uncertainty. We thereby propose a novel Selective Fusion Mechanism to identify and fuse reliable features from both perspectives, reducing the effect of height uncertainty. Moreover, for each view, we incorporate a Multi-Modal Cross-Attention Mechanism to explicitly find location correspondences through cross-modal matching. During the training phase, we also design a Noise-Resistant Matcher to provide better supervision and enhance the robustness of the matching mechanism against sparsity and height uncertainty. Our experimental results, tested on the nuScenes dataset, demonstrate that our method significantly outperforms previous radar-camera auto-calibration methods, as well as existing state-of-the-art LiDAR-camera calibration techniques, establishing a new benchmark for future research. The code is available at https://github.com/nycu-acm/RC-AutoCalib.

[128] Zero-Shot 3D Visual Grounding from Vision-Language Models

Rong Li,Shijie Li,Lingdong Kong,Xulei Yang,Junwei Liang

Main category: cs.CV

TL;DR: SeeGround是一个零样本3D视觉定位框架，利用2D视觉语言模型避免3D特定训练需求，通过混合输入格式和核心模块提升定位精度，显著优于现有零样本基线。

Details

Motivation: 解决现有3D视觉定位方法依赖标注3D数据和预定义类别的问题，提升在开放世界场景中的可扩展性。 Method: 提出SeeGround框架，结合混合输入格式（查询对齐的渲染视图和空间增强文本描述）及两个核心模块（视角适应模块和融合对齐模块）。 Result: 在ScanRefer和Nr3D数据集上分别提升7.7%和7.1%，性能接近全监督方法。 Conclusion: SeeGround展示了在零样本设置下的强大泛化能力，为3D视觉定位提供了高效解决方案。 Abstract: 3D Visual Grounding (3DVG) seeks to locate target objects in 3D scenes using natural language descriptions, enabling downstream applications such as augmented reality and robotics. Existing approaches typically rely on labeled 3D data and predefined categories, limiting scalability to open-world settings. We present SeeGround, a zero-shot 3DVG framework that leverages 2D Vision-Language Models (VLMs) to bypass the need for 3D-specific training. To bridge the modality gap, we introduce a hybrid input format that pairs query-aligned rendered views with spatially enriched textual descriptions. Our framework incorporates two core components: a Perspective Adaptation Module that dynamically selects optimal viewpoints based on the query, and a Fusion Alignment Module that integrates visual and spatial signals to enhance localization precision. Extensive evaluations on ScanRefer and Nr3D confirm that SeeGround achieves substantial improvements over existing zero-shot baselines -- outperforming them by 7.7% and 7.1%, respectively -- and even rivals fully supervised alternatives, demonstrating strong generalization under challenging conditions.

[129] Distance Transform Guided Mixup for Alzheimer's Detection

Zobia Batool,Huseyin Ozkan,Erchan Aptoula

Main category: cs.CV

TL;DR: 该研究提出了一种基于单领域泛化的方法，通过改进mixup方法生成多样化的MRI图像，以解决阿尔茨海默病检测中的类别不平衡和数据集多样性问题。

Details

Motivation: 医学数据集存在类别不平衡、成像协议差异和数据集多样性不足的问题，限制了模型的泛化能力。 Method: 通过计算MRI扫描的距离变换，将其空间分层并与不同样本的层结合，生成增强图像，同时保留大脑结构。 Result: 实验结果表明，该方法在ADNI和AIBL数据集上均提高了泛化性能。 Conclusion: 提出的方法有效解决了医学图像分析中的泛化问题，为阿尔茨海默病的早期诊断提供了更可靠的模型。 Abstract: Alzheimer's detection efforts aim to develop accurate models for early disease diagnosis. Significant advances have been achieved with convolutional neural networks and vision transformer based approaches. However, medical datasets suffer heavily from class imbalance, variations in imaging protocols, and limited dataset diversity, which hinder model generalization. To overcome these challenges, this study focuses on single-domain generalization by extending the well-known mixup method. The key idea is to compute the distance transform of MRI scans, separate them spatially into multiple layers and then combine layers stemming from distinct samples to produce augmented images. The proposed approach generates diverse data while preserving the brain's structure. Experimental results show generalization performance improvement across both ADNI and AIBL datasets.

[130] Can NeRFs See without Cameras?

Chaitanya Amballa,Sattwik Basu,Yu-Lin Wei,Zhijian Yang,Mehmet Ergezer,Romit Roy Choudhury

Main category: cs.CV

TL;DR: NeRFs被重新设计以学习多路径信号（如WiFi），从而推断环境信息，例如从稀疏WiFi测量中学习室内平面图。

Details

Motivation: 探索是否可以通过多路径信号（如RF/音频）推断环境信息，类似于NeRFs在视觉领域的成功。 Method: 重新设计NeRFs，使其能够从多路径信号中学习环境信息，应用于稀疏WiFi测量数据。 Result: 成功学习到隐含的室内平面图，支持信号预测和基础光线追踪应用。 Conclusion: 重新设计的NeRFs能够利用多路径信号推断环境信息，为室内场景分析提供了新方法。 Abstract: Neural Radiance Fields (NeRFs) have been remarkably successful at synthesizing novel views of 3D scenes by optimizing a volumetric scene function. This scene function models how optical rays bring color information from a 3D object to the camera pixels. Radio frequency (RF) or audio signals can also be viewed as a vehicle for delivering information about the environment to a sensor. However, unlike camera pixels, an RF/audio sensor receives a mixture of signals that contain many environmental reflections (also called "multipath"). Is it still possible to infer the environment using such multipath signals? We show that with redesign, NeRFs can be taught to learn from multipath signals, and thereby "see" the environment. As a grounding application, we aim to infer the indoor floorplan of a home from sparse WiFi measurements made at multiple locations inside the home. Although a difficult inverse problem, our implicitly learnt floorplans look promising, and enables forward applications, such as indoor signal prediction and basic ray tracing.

[131] On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation

Liyao Tang,Zhe Chen,Dacheng Tao

Main category: cs.CV

TL;DR: 论文提出了一种几何感知的参数高效微调模块GEM，用于3D点云模型，显著减少计算和存储成本。

Details

Motivation: 现有参数高效微调方法在3D点云任务中表现不佳，忽略了局部空间结构和全局几何上下文。 Method: GEM结合细粒度局部位置编码和轻量级潜在注意力机制，解决空间和几何分布不匹配问题。 Result: GEM性能与全微调相当甚至更优，仅更新1.6%参数，训练时间和内存需求显著降低。 Conclusion: GEM为大规模3D点云模型的高效、可扩展和几何感知微调设定了新基准。 Abstract: The emergence of large-scale pre-trained point cloud models has significantly advanced 3D scene understanding, but adapting these models to specific downstream tasks typically demands full fine-tuning, incurring high computational and storage costs. Parameter-efficient fine-tuning (PEFT) techniques, successful in natural language processing and 2D vision tasks, would underperform when naively applied to 3D point cloud models due to significant geometric and spatial distribution shifts. Existing PEFT methods commonly treat points as orderless tokens, neglecting important local spatial structures and global geometric contexts in 3D modeling. To bridge this gap, we introduce the Geometric Encoding Mixer (GEM), a novel geometry-aware PEFT module specifically designed for 3D point cloud transformers. GEM explicitly integrates fine-grained local positional encodings with a lightweight latent attention mechanism to capture comprehensive global context, thereby effectively addressing the spatial and geometric distribution mismatch. Extensive experiments demonstrate that GEM achieves performance comparable to or sometimes even exceeding full fine-tuning, while only updating 1.6% of the model's parameters, fewer than other PEFT methods. With significantly reduced training time and memory requirements, our approach thus sets a new benchmark for efficient, scalable, and geometry-aware fine-tuning of large-scale 3D point cloud models. Code will be released.

[132] NFR: Neural Feature-Guided Non-Rigid Shape Registration

Puhua Jiang,Zhangquan Chen,Mingze Sun,Ruqi Huang

Main category: cs.CV

TL;DR: 提出一种基于学习的3D形状配准框架，无需标注对应关系即可处理非刚性变形和部分形状匹配。

Details

Motivation: 解决传统方法在显著非刚性变形和部分形状匹配中的不足，同时减少对标注数据的依赖。 Method: 结合深度学习提取的神经特征与几何配准流程，动态更新和过滤对应关系。 Result: 在多个基准测试中达到最优性能，且对未见过的复杂形状对也能生成高质量对应关系。 Conclusion: 该框架在非刚性配准任务中表现出色，具有鲁棒性和泛化能力。 Abstract: In this paper, we propose a novel learning-based framework for 3D shape registration, which overcomes the challenges of significant non-rigid deformation and partiality undergoing among input shapes, and, remarkably, requires no correspondence annotation during training. Our key insight is to incorporate neural features learned by deep learning-based shape matching networks into an iterative, geometric shape registration pipeline. The advantage of our approach is two-fold -- On one hand, neural features provide more accurate and semantically meaningful correspondence estimation than spatial features (e.g., coordinates), which is critical in the presence of large non-rigid deformations; On the other hand, the correspondences are dynamically updated according to the intermediate registrations and filtered by consistency prior, which prominently robustify the overall pipeline. Empirical results show that, with as few as dozens of training shapes of limited variability, our pipeline achieves state-of-the-art results on several benchmarks of non-rigid point cloud matching and partial shape matching across varying settings, but also delivers high-quality correspondences between unseen challenging shape pairs that undergo both significant extrinsic and intrinsic deformations, in which case neither traditional registration methods nor intrinsic methods work.

[133] Fostering Video Reasoning via Next-Event Prediction

Haonan Wang,Hongfu Liu,Xiangyan Liu,Chao Du,Kenji Kawaguchi,Ye Wang,Tianyu Pang

Main category: cs.CV

TL;DR: 论文提出了一种名为“下一事件预测”（NEP）的自监督学习任务，旨在提升多模态大语言模型（MLLMs）在视频输入上的时序推理能力。

Details

Motivation: 现有任务（如视频问答）依赖人工标注或更强模型，而视频描述则混淆了时空信息。NEP通过利用未来视频片段作为自监督信号，填补了这一空白。 Method: 将视频分为过去和未来帧，MLLM以过去帧为输入，预测未来帧的事件摘要，从而促进时序推理。为此构建了V1-33K数据集，并探索了多种视频指令调优策略。 Result: 实验验证了NEP作为一种可扩展且有效的训练范式，能够显著提升MLLMs的时序推理能力。 Conclusion: NEP为MLLMs的时序推理提供了一种创新的自监督学习框架，并通过FutureBench评估验证了其有效性。 Abstract: Next-token prediction serves as the foundational learning task enabling reasoning in LLMs. But what should the learning task be when aiming to equip MLLMs with temporal reasoning capabilities over video inputs? Existing tasks such as video question answering often rely on annotations from humans or much stronger MLLMs, while video captioning tends to entangle temporal reasoning with spatial information. To address this gap, we propose next-event prediction (NEP), a learning task that harnesses future video segments as a rich, self-supervised signal to foster temporal reasoning. We segment each video into past and future frames: the MLLM takes the past frames as input and predicts a summary of events derived from the future frames, thereby encouraging the model to reason temporally in order to complete the task. To support this task, we curate V1-33K, a dataset comprising 33,000 automatically extracted video segments spanning diverse real-world scenarios. We further explore a range of video instruction-tuning strategies to study their effects on temporal reasoning. To evaluate progress, we introduce FutureBench to assess coherence in predicting unseen future events. Experiments validate that NEP offers a scalable and effective training paradigm for fostering temporal reasoning in MLLMs.

[134] Universal Domain Adaptation for Semantic Segmentation

Seun-An Choe,Keon-Hee Park,Jinwoo Choi,Gyeong-Moon Park

Main category: cs.CV

TL;DR: 论文提出了一种名为UniMAP的新框架，用于解决语义分割中无监督域适应的通用性问题，通过域特定原型区分和目标图像匹配提升性能。

Details

Motivation: 传统无监督域适应方法假设源域和目标域的类别设置已知，但实际场景中可能存在私有类别，导致性能下降。 Method: 提出UniMAP框架，包含域特定原型区分（DSPD）和目标图像匹配（TIM）两个关键组件。 Result: 实验表明UniMAP显著优于基线方法。 Conclusion: UniMAP为通用域适应语义分割提供了有效解决方案，无需预先知道类别设置。 Abstract: Unsupervised domain adaptation for semantic segmentation (UDA-SS) aims to transfer knowledge from labeled source data to unlabeled target data. However, traditional UDA-SS methods assume that category settings between source and target domains are known, which is unrealistic in real-world scenarios. This leads to performance degradation if private classes exist. To address this limitation, we propose Universal Domain Adaptation for Semantic Segmentation (UniDA-SS), achieving robust adaptation even without prior knowledge of category settings. We define the problem in the UniDA-SS scenario as low confidence scores of common classes in the target domain, which leads to confusion with private classes. To solve this problem, we propose UniMAP: UniDA-SS with Image Matching and Prototype-based Distinction, a novel framework composed of two key components. First, Domain-Specific Prototype-based Distinction (DSPD) divides each class into two domain-specific prototypes, enabling finer separation of domain-specific features and enhancing the identification of common classes across domains. Second, Target-based Image Matching (TIM) selects a source image containing the most common-class pixels based on the target pseudo-label and pairs it in a batch to promote effective learning of common classes. We also introduce a new UniDA-SS benchmark and demonstrate through various experiments that UniMAP significantly outperforms baselines. The code is available at \href{https://github.com/KU-VGI/UniMAP}{this https URL}.

[135] SHTOcc: Effective 3D Occupancy Prediction with Sparse Head and Tail Voxels

Qiucheng Yu,Yuan Xie,Xin Tan

Main category: cs.CV

TL;DR: SHTOcc提出了一种稀疏头尾体素构建方法，解决了3D占用预测中的长尾问题和几何分布问题，显著提升了性能。

Details

Motivation: 现有方法未探索体素的关键分布模式，导致结果不理想。本文研究了体素的类间分布和几何分布，以解决长尾问题和性能不足。 Method: 提出SHTOcc，通过稀疏头尾体素构建平衡关键体素，并使用解耦学习减少对主导类别的偏置。 Result: 实验显示，SHTOcc在多个基线上显著改进：GPU内存减少42.2%，推理速度提升58.6%，准确率提高约7%。 Conclusion: SHTOcc验证了其在3D占用预测中的有效性和高效性。 Abstract: 3D occupancy prediction has attracted much attention in the field of autonomous driving due to its powerful geometric perception and object recognition capabilities. However, existing methods have not explored the most essential distribution patterns of voxels, resulting in unsatisfactory results. This paper first explores the inter-class distribution and geometric distribution of voxels, thereby solving the long-tail problem caused by the inter-class distribution and the poor performance caused by the geometric distribution. Specifically, this paper proposes SHTOcc (Sparse Head-Tail Occupancy), which uses sparse head-tail voxel construction to accurately identify and balance key voxels in the head and tail classes, while using decoupled learning to reduce the model's bias towards the dominant (head) category and enhance the focus on the tail class. Experiments show that significant improvements have been made on multiple baselines: SHTOcc reduces GPU memory usage by 42.2%, increases inference speed by 58.6%, and improves accuracy by about 7%, verifying its effectiveness and efficiency. The code is available at https://github.com/ge95net/SHTOcc

[136] Single Domain Generalization for Alzheimer's Detection from 3D MRIs with Pseudo-Morphological Augmentations and Contrastive Learning

Zobia Batool,Huseyin Ozkan,Erchan Aptoula

Main category: cs.CV

TL;DR: 本文提出了一种结合可学习伪形态模块和监督对比学习的方法，以提升阿尔茨海默病MRI检测的泛化能力，解决了类别不平衡和协议差异问题。

Details

Motivation: 尽管深度学习在阿尔茨海默病MRI检测中取得进展，但类别不平衡、协议差异和数据集多样性不足限制了模型的泛化能力。 Method: 使用可学习伪形态模块生成形状感知的解剖学有意义增强，并结合监督对比学习模块提取鲁棒的类别特定表示。 Result: 在三个数据集上的实验表明，该方法在类别不平衡和成像协议差异下表现出更好的性能和泛化能力。 Conclusion: 该方法有效提升了阿尔茨海默病检测的泛化能力，尤其在复杂场景下表现优异。 Abstract: Although Alzheimer's disease detection via MRIs has advanced significantly thanks to contemporary deep learning models, challenges such as class imbalance, protocol variations, and limited dataset diversity often hinder their generalization capacity. To address this issue, this article focuses on the single domain generalization setting, where given the data of one domain, a model is designed and developed with maximal performance w.r.t. an unseen domain of distinct distribution. Since brain morphology is known to play a crucial role in Alzheimer's diagnosis, we propose the use of learnable pseudo-morphological modules aimed at producing shape-aware, anatomically meaningful class-specific augmentations in combination with a supervised contrastive learning module to extract robust class-specific representations. Experiments conducted across three datasets show improved performance and generalization capacity, especially under class imbalance and imaging protocol variations. The source code will be made available upon acceptance at https://github.com/zobia111/SDG-Alzheimer.

[137] ProCrop: Learning Aesthetic Image Cropping from Professional Compositions

Ke Zhang,Tianyu Ding,Jiachen Jiang,Tianyi Chen,Ilya Zharkov,Vishal M. Patel,Luming Liang

Main category: cs.CV

TL;DR: ProCrop是一种基于检索的图像裁剪方法，利用专业摄影作品指导裁剪决策，显著提升性能，并提供了一个大规模弱标注数据集。

Details

Motivation: 现有基于规则或数据驱动的图像裁剪方法缺乏多样性或需要标注数据，ProCrop旨在通过学习专业摄影构图解决这一问题。 Method: ProCrop通过融合专业照片和查询图像的特征，学习专业构图。同时，通过外绘专业图像生成242K弱标注数据集，提供多样化高质量裁剪建议。 Result: ProCrop在监督和弱监督设置下均显著优于现有方法，使用新数据集时甚至媲美全监督方法。 Conclusion: ProCrop和数据集将公开，推动图像美学和构图分析研究。 Abstract: Image cropping is crucial for enhancing the visual appeal and narrative impact of photographs, yet existing rule-based and data-driven approaches often lack diversity or require annotated training data. We introduce ProCrop, a retrieval-based method that leverages professional photography to guide cropping decisions. By fusing features from professional photographs with those of the query image, ProCrop learns from professional compositions, significantly boosting performance. Additionally, we present a large-scale dataset of 242K weakly-annotated images, generated by out-painting professional images and iteratively refining diverse crop proposals. This composition-aware dataset generation offers diverse high-quality crop proposals guided by aesthetic principles and becomes the largest publicly available dataset for image cropping. Extensive experiments show that ProCrop significantly outperforms existing methods in both supervised and weakly-supervised settings. Notably, when trained on the new dataset, our ProCrop surpasses previous weakly-supervised methods and even matches fully supervised approaches. Both the code and dataset will be made publicly available to advance research in image aesthetics and composition analysis.

[138] The Meeseeks Mesh: Spatially Consistent 3D Adversarial Objects for BEV Detector

Aixuan Li,Mochu Xiang,Jing Zhang,Yuchao Dai

Main category: cs.CV

TL;DR: 该论文研究了3D物体检测模型对3D对抗攻击的脆弱性，提出了一种生成非侵入式3D对抗对象的方法，并通过实验验证了其有效性。

Details

Motivation: 3D物体检测在自动驾驶系统中至关重要，但其对对抗攻击的脆弱性尚未充分研究。本文旨在评估模型的鲁棒性，并生成真实场景中的对抗对象。 Method: 采用可微分渲染技术建模对抗对象与目标车辆的空间关系，引入遮挡感知模块增强视觉一致性，并设计BEV空间特征引导的优化策略。 Result: 实验表明，该方法能有效抑制先进3D检测器的预测，且对抗对象在不同位置和距离下保持攻击效果。 Conclusion: 该方法为评估3D检测模型的鲁棒性提供了重要工具，生成的对抗对象具有强泛化能力。 Abstract: 3D object detection is a critical component in autonomous driving systems. It allows real-time recognition and detection of vehicles, pedestrians and obstacles under varying environmental conditions. Among existing methods, 3D object detection in the Bird's Eye View (BEV) has emerged as the mainstream framework. To guarantee a safe, robust and trustworthy 3D object detection, 3D adversarial attacks are investigated, where attacks are placed in 3D environments to evaluate the model performance, e.g., putting a film on a car, clothing a pedestrian. The vulnerability of 3D object detection models to 3D adversarial attacks serves as an important indicator to evaluate the robustness of the model against perturbations. To investigate this vulnerability, we generate non-invasive 3D adversarial objects tailored for real-world attack scenarios. Our method verifies the existence of universal adversarial objects that are spatially consistent across time and camera views. Specifically, we employ differentiable rendering techniques to accurately model the spatial relationship between adversarial objects and the target vehicle. Furthermore, we introduce an occlusion-aware module to enhance visual consistency and realism under different viewpoints. To maintain attack effectiveness across multiple frames, we design a BEV spatial feature-guided optimization strategy. Experimental results demonstrate that our approach can reliably suppress vehicle predictions from state-of-the-art 3D object detectors, serving as an important tool to test robustness of 3D object detection models before deployment. Moreover, the generated adversarial objects exhibit strong generalization capabilities, retaining its effectiveness at various positions and distances in the scene.

[139] PathFL: Multi-Alignment Federated Learning for Pathology Image Segmentation

Yuan Zhang,Feng Chen,Yaolei Qi,Guanyu Yang,Huazhu Fu

Main category: cs.CV

TL;DR: PathFL是一个多对齐联邦学习框架，通过图像、特征和模型聚合三层次对齐策略，解决病理图像分割中的异构性问题。

Details

Motivation: 病理图像分割在多中心环境中面临成像模态、器官和扫描设备等异构性带来的表示偏差和泛化性挑战。 Method: PathFL采用三层次对齐策略：图像级协作风格增强模块、特征级自适应对齐模块和模型级分层相似性聚合策略。 Result: 在四种异构病理图像数据集上的评估显示，PathFL在性能和鲁棒性上优于其他方法。 Conclusion: PathFL通过多级对齐策略有效解决了病理图像分割中的异构性问题，提升了模型的泛化能力。 Abstract: Pathology image segmentation across multiple centers encounters significant challenges due to diverse sources of heterogeneity including imaging modalities, organs, and scanning equipment, whose variability brings representation bias and impedes the development of generalizable segmentation models. In this paper, we propose PathFL, a novel multi-alignment Federated Learning framework for pathology image segmentation that addresses these challenges through three-level alignment strategies of image, feature, and model aggregation. Firstly, at the image level, a collaborative style enhancement module aligns and diversifies local data by facilitating style information exchange across clients. Secondly, at the feature level, an adaptive feature alignment module ensures implicit alignment in the representation space by infusing local features with global insights, promoting consistency across heterogeneous client features learning. Finally, at the model aggregation level, a stratified similarity aggregation strategy hierarchically aligns and aggregates models on the server, using layer-specific similarity to account for client discrepancies and enhance global generalization. Comprehensive evaluations on four sets of heterogeneous pathology image datasets, encompassing cross-source, cross-modality, cross-organ, and cross-scanner variations, validate the effectiveness of our PathFL in achieving better performance and robustness against data heterogeneity.

[140] PrismLayers: Open Data for High-Quality Multi-Layer Transparent Image Generative Models

Junwen Chen,Heyang Jiang,Yanbin Wang,Keming Wu,Ji Li,Chao Zhang,Keiji Yanai,Dong Chen,Yuhui Yuan

Main category: cs.CV

TL;DR: 论文提出了PrismLayersPro数据集和ART+模型，用于生成高质量的多层透明图像，解决了数据缺乏问题，并通过合成管道和模型优化提升了生成效果。

Details

Motivation: 解决多层透明图像生成领域缺乏高质量数据集的问题，以支持更灵活的创意编辑。 Method: 发布PrismLayersPro数据集，开发训练无关的合成管道，提出LayerFLUX和MultiLayerFLUX技术，并优化ART模型为ART+。 Result: ART+在用户研究中表现优于原ART模型，视觉质量接近FLUX.1-[dev]模型。 Conclusion: 该工作为多层透明图像生成任务奠定了数据集基础，推动了可编辑、高质量分层图像的研究与应用。 Abstract: Generating high-quality, multi-layer transparent images from text prompts can unlock a new level of creative control, allowing users to edit each layer as effortlessly as editing text outputs from LLMs. However, the development of multi-layer generative models lags behind that of conventional text-to-image models due to the absence of a large, high-quality corpus of multi-layer transparent data. In this paper, we address this fundamental challenge by: (i) releasing the first open, ultra-high-fidelity PrismLayers (PrismLayersPro) dataset of 200K (20K) multilayer transparent images with accurate alpha mattes, (ii) introducing a trainingfree synthesis pipeline that generates such data on demand using off-the-shelf diffusion models, and (iii) delivering a strong, open-source multi-layer generation model, ART+, which matches the aesthetics of modern text-to-image generation models. The key technical contributions include: LayerFLUX, which excels at generating high-quality single transparent layers with accurate alpha mattes, and MultiLayerFLUX, which composes multiple LayerFLUX outputs into complete images, guided by human-annotated semantic layout. To ensure higher quality, we apply a rigorous filtering stage to remove artifacts and semantic mismatches, followed by human selection. Fine-tuning the state-of-the-art ART model on our synthetic PrismLayersPro yields ART+, which outperforms the original ART in 60% of head-to-head user study comparisons and even matches the visual quality of images generated by the FLUX.1-[dev] model. We anticipate that our work will establish a solid dataset foundation for the multi-layer transparent image generation task, enabling research and applications that require precise, editable, and visually compelling layered imagery.

[141] Thinking with Generated Images

Ethan Chern,Zhulin Hu,Steffi Chern,Siqi Kou,Jiadi Su,Yan Ma,Zhijie Deng,Pengfei Liu

Main category: cs.CV

TL;DR: 提出了一种新范式，通过生成中间视觉思考步骤，使多模态模型能够在文本和视觉模态间自然思考，显著提升视觉推理能力。

Details

Motivation: 当前多模态模型的视觉推理局限于处理固定图像或仅通过文本链式思考，缺乏动态生成和迭代视觉假设的能力。 Method: 采用两种机制：(1) 分解复杂视觉任务为可管理的子目标并逐步生成；(2) 生成初始视觉假设后，通过文本推理自我批判并优化输出。 Result: 在视觉生成基准测试中，相对基线方法提升高达50%（从38%到57%），尤其在复杂多对象场景中表现突出。 Conclusion: 该方法为AI模型提供了类似人类的视觉想象和迭代优化能力，适用于生物化学、建筑设计、犯罪分析和运动策略等领域。 Abstract: We present Thinking with Generated Images, a novel paradigm that fundamentally transforms how large multimodal models (LMMs) engage with visual reasoning by enabling them to natively think across text and vision modalities through spontaneous generation of intermediate visual thinking steps. Current visual reasoning with LMMs is constrained to either processing fixed user-provided images or reasoning solely through text-based chain-of-thought (CoT). Thinking with Generated Images unlocks a new dimension of cognitive capability where models can actively construct intermediate visual thoughts, critique their own visual hypotheses, and refine them as integral components of their reasoning process. We demonstrate the effectiveness of our approach through two complementary mechanisms: (1) vision generation with intermediate visual subgoals, where models decompose complex visual tasks into manageable components that are generated and integrated progressively, and (2) vision generation with self-critique, where models generate an initial visual hypothesis, analyze its shortcomings through textual reasoning, and produce refined outputs based on their own critiques. Our experiments on vision generation benchmarks show substantial improvements over baseline approaches, with our models achieving up to 50% (from 38% to 57%) relative improvement in handling complex multi-object scenarios. From biochemists exploring novel protein structures, and architects iterating on spatial designs, to forensic analysts reconstructing crime scenes, and basketball players envisioning strategic plays, our approach enables AI models to engage in the kind of visual imagination and iterative refinement that characterizes human creative, analytical, and strategic thinking. We release our open-source suite at https://github.com/GAIR-NLP/thinking-with-generated-images.

[142] RiverMamba: A State Space Model for Global River Discharge and Flood Forecasting

Mohamad Hakam Shams Eddin,Yikui Zhang,Stefan Kollet,Juergen Gall

Main category: cs.CV

TL;DR: RiverMamba是一种新型深度学习模型，用于全球河流流量和洪水预测，通过时空建模和高效Mamba块提升预测能力。

Details

Motivation: 现有深度学习方法在洪水预测中局限于局部应用，未能利用水体的空间联系，需要新方法以改进预测。 Method: RiverMamba利用长期再分析数据预训练，结合ECMWF HRES气象预测，通过Mamba块建模时空关系。 Result: RiverMamba在河流流量和极端洪水预测上优于现有AI和物理模型。 Conclusion: RiverMamba为科学和操作应用提供了可靠的全球洪水预测工具。 Abstract: Recent deep learning approaches for river discharge forecasting have improved the accuracy and efficiency in flood forecasting, enabling more reliable early warning systems for risk management. Nevertheless, existing deep learning approaches in hydrology remain largely confined to local-scale applications and do not leverage the inherent spatial connections of bodies of water. Thus, there is a strong need for new deep learning methodologies that are capable of modeling spatio-temporal relations to improve river discharge and flood forecasting for scientific and operational applications. To address this, we present RiverMamba, a novel deep learning model that is pretrained with long-term reanalysis data and that can forecast global river discharge and floods on a $0.05^\circ$ grid up to 7 days lead time, which is of high relevance in early warning. To achieve this, RiverMamba leverages efficient Mamba blocks that enable the model to capture global-scale channel network routing and enhance its forecast capability for longer lead times. The forecast blocks integrate ECMWF HRES meteorological forecasts, while accounting for their inaccuracies through spatio-temporal modeling. Our analysis demonstrates that RiverMamba delivers reliable predictions of river discharge, including extreme floods across return periods and lead times, surpassing both operational AI- and physics-based models.

[143] Scaling-up Perceptual Video Quality Assessment

Ziheng Jia,Zicheng Zhang,Zeyu Zhang,Yingji Liang,Xiaorong Zhu,Chunyi Li,Jinliang Han,Haoning Wu,Bin Wang,Haoran Zhang,Guanyu Zhu,Qiyong Zhao,Xiaohong Liu,Guangtao Zhai,Xiongkuo Min

Main category: cs.CV

TL;DR: 论文提出OmniVQA框架，构建了大规模VQA多模态指令数据库（MIDB），并引入互补训练策略，在视频质量评估任务中取得最优性能。

Details

Motivation: 解决视频质量评估（VQA）领域因数据稀缺和规模不足导致的数据缩放潜力未被充分挖掘的问题。 Method: 提出OmniVQA框架，构建OmniVQA-Chat-400K和OmniVQA-MOS-20K数据集，设计互补训练策略和OmniVQA-FG-Benchmark评估方法。 Result: 模型在质量理解和评分任务中达到最优性能。 Conclusion: OmniVQA框架和数据集有效提升了VQA任务的性能，填补了数据缩放潜力未被挖掘的空白。 Abstract: The data scaling law has been shown to significantly enhance the performance of large multi-modal models (LMMs) across various downstream tasks. However, in the domain of perceptual video quality assessment (VQA), the potential of scaling law remains unprecedented due to the scarcity of labeled resources and the insufficient scale of datasets. To address this, we propose \textbf{OmniVQA}, an efficient framework designed to efficiently build high-quality, human-in-the-loop VQA multi-modal instruction databases (MIDBs). We then scale up to create \textbf{OmniVQA-Chat-400K}, the largest MIDB in the VQA field concurrently. Our focus is on the technical and aesthetic quality dimensions, with abundant in-context instruction data to provide fine-grained VQA knowledge. Additionally, we have built the \textbf{OmniVQA-MOS-20K} dataset to enhance the model's quantitative quality rating capabilities. We then introduce a \textbf{complementary} training strategy that effectively leverages the knowledge from datasets for quality understanding and quality rating tasks. Furthermore, we propose the \textbf{OmniVQA-FG (fine-grain)-Benchmark} to evaluate the fine-grained performance of the models. Our results demonstrate that our models achieve state-of-the-art performance in both quality understanding and rating tasks.

[144] Deep Learning-Based BMD Estimation from Radiographs with Conformal Uncertainty Quantification

Long Hui,Wai Lok Yeung

Main category: cs.CV

TL;DR: 利用深度学习从膝关节X光片中估算骨密度（BMD），通过不确定性量化提高临床可信度。

Details

Motivation: DXA设备有限，阻碍骨质疏松筛查，希望通过广泛可用的膝关节X光片实现机会性BMD估算。 Method: 使用EfficientNet模型在OAI数据集上训练，预测BMD；比较两种测试时间增强（TTA）方法，并采用Split Conformal Prediction提供统计严格的预测区间。 Result: 皮尔逊相关性为0.68（传统TTA）；多样本TTA产生更紧密的置信区间，同时保持覆盖率。 Conclusion: 尽管膝关节X光片与标准DXA存在解剖学差异，但该方法为基于常规X光片的可信AI辅助BMD筛查奠定了基础。 Abstract: Limited DXA access hinders osteoporosis screening. This proof-of-concept study proposes using widely available knee X-rays for opportunistic Bone Mineral Density (BMD) estimation via deep learning, emphasizing robust uncertainty quantification essential for clinical use. An EfficientNet model was trained on the OAI dataset to predict BMD from bilateral knee radiographs. Two Test-Time Augmentation (TTA) methods were compared: traditional averaging and a multi-sample approach. Crucially, Split Conformal Prediction was implemented to provide statistically rigorous, patient-specific prediction intervals with guaranteed coverage. Results showed a Pearson correlation of 0.68 (traditional TTA). While traditional TTA yielded better point predictions, the multi-sample approach produced slightly tighter confidence intervals (90%, 95%, 99%) while maintaining coverage. The framework appropriately expressed higher uncertainty for challenging cases. Although anatomical mismatch between knee X-rays and standard DXA limits immediate clinical use, this method establishes a foundation for trustworthy AI-assisted BMD screening using routine radiographs, potentially improving early osteoporosis detection.

[145] RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

Yuchi Wang,Yishuo Cai,Shuhuai Ren,Sihan Yang,Linli Yao,Yuanxin Liu,Yuanxing Zhang,Pengfei Wan,Xu Sun

Main category: cs.CV

TL;DR: RICO提出了一种通过视觉重建优化图像描述的新框架，显著提升了描述的准确性和完整性。

Details

Motivation: 现有图像描述增强方法因幻觉和细节缺失导致不准确，RICO旨在解决这些问题。 Method: 利用文本到图像模型重建参考图像，并通过MLLM识别差异迭代优化描述。 Result: 在CapsBench和CompreCap上性能提升约10%。 Conclusion: RICO框架有效提升了图像描述的质量，并通过RICO-Flash降低了计算成本。 Abstract: Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap. Code released at https://github.com/wangyuchi369/RICO.

[146] MultiFormer: A Multi-Person Pose Estimation System Based on CSI and Attention Mechanism

Yanyi Qu,Haoyang Ma,Wenhui Xiong

Main category: cs.CV

TL;DR: MultiFormer是一种基于CSI的无线感知系统，通过Transformer和特征融合网络实现高精度多人姿态估计。

Details

Motivation: 基于CSI的人体姿态估计在非侵入式监测中具有潜力，但面临多人姿态识别和CSI特征学习的挑战。 Method: 采用基于Transformer的时频双令牌特征提取器和多阶段特征融合网络（MSFN）来建模CSI特征和姿态概率热图。 Result: 在公开和自收集数据集上的实验表明，MultiFormer在准确性上优于现有方法，尤其是对高动态关键点的估计。 Conclusion: MultiFormer通过创新的特征提取和融合方法，显著提升了基于CSI的姿态估计性能。 Abstract: Human pose estimation based on Channel State Information (CSI) has emerged as a promising approach for non-intrusive and precise human activity monitoring, yet faces challenges including accurate multi-person pose recognition and effective CSI feature learning. This paper presents MultiFormer, a wireless sensing system that accurately estimates human pose through CSI. The proposed system adopts a Transformer based time-frequency dual-token feature extractor with multi-head self-attention. This feature extractor is able to model inter-subcarrier correlations and temporal dependencies of the CSI. The extracted CSI features and the pose probability heatmaps are then fused by Multi-Stage Feature Fusion Network (MSFN) to enforce the anatomical constraints. Extensive experiments conducted on on the public MM-Fi dataset and our self-collected dataset show that the MultiFormer achieves higher accuracy over state-of-the-art approaches, especially for high-mobility keypoints (wrists, elbows) that are particularly difficult for previous methods to accurately estimate.

Jaehyun Choi,Jiwan Hur,Gyojin Han,Jaemyung Yu,Junmo Kim

Main category: cs.CV

TL;DR: PRISM是一种新的视频数据集压缩方法，通过渐进式细化和插入稀疏运动帧，保留空间内容与时间动态的关联，优于现有方法。

Details

Motivation: 解决大规模视频数据处理中的计算挑战，同时保留视频中空间内容与时间动态的相互依赖关系。 Method: 提出PRISM方法，通过渐进式细化和插入帧，结合梯度关系，压缩视频数据。 Result: 在标准视频动作识别基准测试中表现优于现有方法，且存储需求更低。 Conclusion: PRISM为资源受限环境提供了一种高效的视频数据集压缩解决方案。 Abstract: Video dataset condensation has emerged as a critical technique for addressing the computational challenges associated with large-scale video data processing in deep learning applications. While significant progress has been made in image dataset condensation, the video domain presents unique challenges due to the complex interplay between spatial content and temporal dynamics. This paper introduces PRISM, Progressive Refinement and Insertion for Sparse Motion, for video dataset condensation, a novel approach that fundamentally reconsiders how video data should be condensed. Unlike the previous method that separates static content from dynamic motion, our method preserves the essential interdependence between these elements. Our approach progressively refines and inserts frames to fully accommodate the motion in an action while achieving better performance but less storage, considering the relation of gradients for each frame. Extensive experiments across standard video action recognition benchmarks demonstrate that PRISM outperforms existing disentangled approaches while maintaining compact representations suitable for resource-constrained environments.

[148] Sherlock: Self-Correcting Reasoning in Vision-Language Models

Yi Ding,Ruqi Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种名为Sherlock的自校正框架，通过轨迹级自校正目标和视觉扰动构建偏好数据，显著提升了视觉语言模型的推理能力，仅需少量标注数据即可实现自改进。

Details

Motivation: 现有的视觉语言模型对推理错误敏感，依赖大量标注数据或验证器，且泛化能力有限。为解决这些问题，作者探索了自校正策略。 Method: 引入Sherlock框架，包括轨迹级自校正目标、基于视觉扰动的偏好数据构建方法，以及动态β偏好调优。模型仅需20k随机标注数据即可获得自校正能力。 Result: 在八个基准测试中，Sherlock平均准确率达到64.1（直接生成）和65.4（自校正后），优于其他模型，且仅使用不到20%的标注数据。 Conclusion: 自校正策略显著提升了视觉语言模型的推理能力和数据效率，Sherlock框架为未来研究提供了有效工具。 Abstract: Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic $\beta$ for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated data, it continues to self-improve without external supervision. Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and LlamaV-o1 (63.4) while using less than 20% of the annotated data.

[149] Universal Visuo-Tactile Video Understanding for Embodied Interaction

Yifan Xie,Mingyang Li,Shoujie Li,Xingting Li,Guangyu Chen,Fei Ma,Fei Richard Yu,Wenbo Ding

Main category: cs.CV

TL;DR: VTV-LLM是一种多模态大语言模型，首次实现视觉-触觉视频（VTV）的通用理解，填补触觉感知与自然语言之间的空白。

Details

Motivation: 现有方法在视觉和语言模态上取得进展，但未能有效整合触觉信息，而触觉对真实世界交互至关重要。 Method: 提出VTV150K数据集，包含15万帧视频，来自100种物体和3种触觉传感器；开发三阶段训练范式，包括VTV增强、VTV-文本对齐和文本提示微调。 Result: VTV-LLM在触觉视频理解任务中表现优异，支持特征评估、比较分析和场景决策等高级触觉推理能力。 Conclusion: VTV-LLM为触觉领域的人机交互奠定基础，提供更直观的交互方式。 Abstract: Tactile perception is essential for embodied agents to understand physical attributes of objects that cannot be determined through visual inspection alone. While existing approaches have made progress in visual and language modalities for physical understanding, they fail to effectively incorporate tactile information that provides crucial haptic feedback for real-world interaction. In this paper, we present VTV-LLM, the first multi-modal large language model for universal Visuo-Tactile Video (VTV) understanding that bridges the gap between tactile perception and natural language. To address the challenges of cross-sensor and cross-modal integration, we contribute VTV150K, a comprehensive dataset comprising 150,000 video frames from 100 diverse objects captured across three different tactile sensors (GelSight Mini, DIGIT, and Tac3D), annotated with four fundamental tactile attributes (hardness, protrusion, elasticity, and friction). We develop a novel three-stage training paradigm that includes VTV enhancement for robust visuo-tactile representation, VTV-text alignment for cross-modal correspondence, and text prompt finetuning for natural language generation. Our framework enables sophisticated tactile reasoning capabilities including feature assessment, comparative analysis, scenario-based decision making and so on. Experimental evaluations demonstrate that VTV-LLM achieves superior performance in tactile video understanding tasks, establishing a foundation for more intuitive human-machine interaction in tactile domains.

[150] ImageReFL: Balancing Quality and Diversity in Human-Aligned Diffusion Models

Dmitrii Sorokin,Maksim Nakhodnov,Andrey Kuznetsov,Aibek Alanov

Main category: cs.CV

TL;DR: 论文提出两种方法解决扩散模型在人类偏好对齐与多样性之间的权衡：1）结合生成策略，仅在生成后期使用奖励调优模型；2）ImageReFL方法，通过多正则化器提升多样性。

Details

Motivation: 扩散模型在图像生成方面表现优异，但与人类偏好对齐时往往牺牲多样性。 Method: 1）结合生成策略；2）ImageReFL方法，结合真实图像和多正则化器。 Result: 方法在质量和多样性指标上优于传统奖励调优，用户研究证实其平衡性更好。 Conclusion: 提出的方法有效解决了对齐与多样性的矛盾，代码已开源。 Abstract: Recent advances in diffusion models have led to impressive image generation capabilities, but aligning these models with human preferences remains challenging. Reward-based fine-tuning using models trained on human feedback improves alignment but often harms diversity, producing less varied outputs. In this work, we address this trade-off with two contributions. First, we introduce \textit{combined generation}, a novel sampling strategy that applies a reward-tuned diffusion model only in the later stages of the generation process, while preserving the base model for earlier steps. This approach mitigates early-stage overfitting and helps retain global structure and diversity. Second, we propose \textit{ImageReFL}, a fine-tuning method that improves image diversity with minimal loss in quality by training on real images and incorporating multiple regularizers, including diffusion and ReFL losses. Our approach outperforms conventional reward tuning methods on standard quality and diversity metrics. A user study further confirms that our method better balances human preference alignment and visual diversity. The source code can be found at https://github.com/ControlGenAI/ImageReFL .

[151] 3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model

Wenbo Hu,Yining Hong,Yanjun Wang,Leison Gao,Zibu Wei,Xingcheng Yao,Nanyun Peng,Yonatan Bitton,Idan Szpektor,Kai-Wei Chang

Main category: cs.CV

TL;DR: 论文提出了一种针对大语言模型（LLMs）在3D环境中空间-时间记忆建模不足的问题，设计了3DMem-Bench基准和3DLLM-Mem动态记忆管理模型，显著提升了任务性能。

Details

Motivation: 当前LLMs在动态多房间3D环境中规划和行动能力不足，主要原因是缺乏有效的3D空间-时间记忆建模。 Method: 引入3DMem-Bench基准和3DLLM-Mem模型，后者通过工作记忆令牌选择性融合历史记忆特征。 Result: 3DLLM-Mem在3DMem-Bench上表现最佳，比基线模型成功率提升16.5%。 Conclusion: 3DLLM-Mem有效解决了LLMs在复杂3D环境中的记忆管理问题，显著提升了任务性能。 Abstract: Humans excel at performing complex tasks by leveraging long-term memory across temporal and spatial experiences. In contrast, current Large Language Models (LLMs) struggle to effectively plan and act in dynamic, multi-room 3D environments. We posit that part of this limitation is due to the lack of proper 3D spatial-temporal memory modeling in LLMs. To address this, we first introduce 3DMem-Bench, a comprehensive benchmark comprising over 26,000 trajectories and 2,892 embodied tasks, question-answering and captioning, designed to evaluate an agent's ability to reason over long-term memory in 3D environments. Second, we propose 3DLLM-Mem, a novel dynamic memory management and fusion model for embodied spatial-temporal reasoning and actions in LLMs. Our model uses working memory tokens, which represents current observations, as queries to selectively attend to and fuse the most useful spatial and temporal features from episodic memory, which stores past observations and interactions. Our approach allows the agent to focus on task-relevant information while maintaining memory efficiency in complex, long-horizon environments. Experimental results demonstrate that 3DLLM-Mem achieves state-of-the-art performance across various tasks, outperforming the strongest baselines by 16.5% in success rate on 3DMem-Bench's most challenging in-the-wild embodied tasks.

[152] Tell me Habibi, is it Real or Fake?

Kartik Kuckreja,Parul Gupta,Injy Hamed,Thamar Solorio,Muhammad Haris Khan,Abhinav Dhall

Main category: cs.CV

TL;DR: 论文介绍了首个大规模阿拉伯语-英语音频-视觉深度伪造数据集ArEnAV，包含387k视频和765小时内容，支持多语言和多模态深度伪造检测研究。

Details

Motivation: 现有深度伪造检测研究多关注单语内容，忽视了多语言和语码转换（如阿拉伯语-英语混合）的挑战，需填补这一空白。 Method: 通过整合四种文本到语音和两种唇同步模型生成数据集，并对比现有单语和多语数据集及检测模型。 Result: ArEnAV数据集为多语言深度伪造检测提供了新基准，支持更全面的研究。 Conclusion: 该数据集有望推动多语言和多模态深度伪造检测技术的发展。 Abstract: Deepfake generation methods are evolving fast, making fake media harder to detect and raising serious societal concerns. Most deepfake detection and dataset creation research focuses on monolingual content, often overlooking the challenges of multilingual and code-switched speech, where multiple languages are mixed within the same discourse. Code-switching, especially between Arabic and English, is common in the Arab world and is widely used in digital communication. This linguistic mixing poses extra challenges for deepfake detection, as it can confuse models trained mostly on monolingual data. To address this, we introduce \textbf{ArEnAV}, the first large-scale Arabic-English audio-visual deepfake dataset featuring intra-utterance code-switching, dialectal variation, and monolingual Arabic content. It \textbf{contains 387k videos and over 765 hours of real and fake videos}. Our dataset is generated using a novel pipeline integrating four Text-To-Speech and two lip-sync models, enabling comprehensive analysis of multilingual multimodal deepfake detection. We benchmark our dataset against existing monolingual and multilingual datasets, state-of-the-art deepfake detection models, and a human evaluation, highlighting its potential to advance deepfake research. The dataset can be accessed \href{https://huggingface.co/datasets/kartik060702/ArEnAV-Full}{here}.

[153] SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning

Jiaqi Huang,Zunnan Xu,Jun Zhou,Ting Liu,Yicheng Xiao,Mingwen Ou,Bowen Ji,Xiu Li,Kehong Yuan

Main category: cs.CV

TL;DR: SAM-R1是一种新颖的多模态大模型框架，通过强化学习实现图像分割任务的细粒度推理，无需依赖昂贵的人工标注推理数据。

Details

Motivation: 现有方法依赖耗时的人工标注推理数据，而强化学习可以赋予模型推理能力，减少对标注数据的依赖。 Method: 结合细粒度分割设置和任务特定奖励，利用Segment Anything Model（SAM）作为奖励提供者，优化模型推理与分割的对齐。 Result: 仅用3k训练样本，SAM-R1在多个基准测试中表现优异。 Conclusion: 强化学习能有效为多模态模型提供面向分割的推理能力。 Abstract: Leveraging multimodal large models for image segmentation has become a prominent research direction. However, existing approaches typically rely heavily on manually annotated datasets that include explicit reasoning processes, which are costly and time-consuming to produce. Recent advances suggest that reinforcement learning (RL) can endow large models with reasoning capabilities without requiring such reasoning-annotated data. In this paper, we propose SAM-R1, a novel framework that enables multimodal large models to perform fine-grained reasoning in image understanding tasks. Our approach is the first to incorporate fine-grained segmentation settings during the training of multimodal reasoning models. By integrating task-specific, fine-grained rewards with a tailored optimization objective, we further enhance the model's reasoning and segmentation alignment. We also leverage the Segment Anything Model (SAM) as a strong and flexible reward provider to guide the learning process. With only 3k training samples, SAM-R1 achieves strong performance across multiple benchmarks, demonstrating the effectiveness of reinforcement learning in equipping multimodal models with segmentation-oriented reasoning capabilities.

[154] Adversarially Robust AI-Generated Image Detection for Free: An Information Theoretic Perspective

Ruixuan Zhang,He Wang,Zhengyu Zhao,Zhiqing Guo,Xun Yang,Yunfeng Diao,Meng Wang

Main category: cs.CV

TL;DR: 论文提出了一种无需训练的对抗防御方法TRIM，用于检测AI生成图像（AIGI），通过信息论方法解决特征纠缠问题，显著优于现有防御方法。

Details

Motivation: AI生成图像的恶意使用（如伪造和虚假信息）日益严重，现有检测器易受对抗攻击，而对抗训练（AT）在AIGI检测中存在性能崩溃问题。 Method: 提出TRIM方法，基于标准检测器，利用预测熵和KL散度量特征偏移，无需额外训练。 Result: 在多个数据集和攻击下验证，TRIM显著优于现有防御方法（如ProGAN上提升33.88%），同时保持原始准确率。 Conclusion: TRIM是一种高效且无需训练的对抗防御方法，解决了AIGI检测中的特征纠缠问题，具有广泛适用性。 Abstract: Rapid advances in Artificial Intelligence Generated Images (AIGI) have facilitated malicious use, such as forgery and misinformation. Therefore, numerous methods have been proposed to detect fake images. Although such detectors have been proven to be universally vulnerable to adversarial attacks, defenses in this field are scarce. In this paper, we first identify that adversarial training (AT), widely regarded as the most effective defense, suffers from performance collapse in AIGI detection. Through an information-theoretic lens, we further attribute the cause of collapse to feature entanglement, which disrupts the preservation of feature-label mutual information. Instead, standard detectors show clear feature separation. Motivated by this difference, we propose Training-free Robust Detection via Information-theoretic Measures (TRIM), the first training-free adversarial defense for AIGI detection. TRIM builds on standard detectors and quantifies feature shifts using prediction entropy and KL divergence. Extensive experiments across multiple datasets and attacks validate the superiority of our TRIM, e.g., outperforming the state-of-the-art defense by 33.88% (28.91%) on ProGAN (GenImage), while well maintaining original accuracy.

[155] PS4PRO: Pixel-to-pixel Supervision for Photorealistic Rendering and Optimization

Yezhi Shen,Qiuchen Zhai,Fengqing Zhu

Main category: cs.CV

TL;DR: 提出了一种基于视频帧插值的数据增强方法PS4PRO，用于提升神经渲染在静态和动态场景中的重建性能。

Details

Motivation: 神经渲染方法在从2D图像重建3D场景时受限于输入视角数量，尤其在复杂动态场景中表现不佳。 Method: 设计轻量级高质量视频帧插值模型PS4PRO，通过多样化视频数据集训练，隐式建模相机运动和真实3D几何。 Result: 实验表明，该方法有效提升了静态和动态场景的重建性能。 Conclusion: PS4PRO作为隐式世界先验，丰富了3D重建的监督信息，为神经渲染提供了有效的数据增强手段。 Abstract: Neural rendering methods have gained significant attention for their ability to reconstruct 3D scenes from 2D images. The core idea is to take multiple views as input and optimize the reconstructed scene by minimizing the uncertainty in geometry and appearance across the views. However, the reconstruction quality is limited by the number of input views. This limitation is further pronounced in complex and dynamic scenes, where certain angles of objects are never seen. In this paper, we propose to use video frame interpolation as the data augmentation method for neural rendering. Furthermore, we design a lightweight yet high-quality video frame interpolation model, PS4PRO (Pixel-to-pixel Supervision for Photorealistic Rendering and Optimization). PS4PRO is trained on diverse video datasets, implicitly modeling camera movement as well as real-world 3D geometry. Our model performs as an implicit world prior, enriching the photo supervision for 3D reconstruction. By leveraging the proposed method, we effectively augment existing datasets for neural rendering methods. Our experimental results indicate that our method improves the reconstruction performance on both static and dynamic scenes.

[156] ObjectClear: Complete Object Removal via Object-Effect Attention

Jixin Zhao,Shangchen Zhou,Zhouxia Wang,Peiqing Yang,Chen Change Loy

Main category: cs.CV

TL;DR: 论文提出了一种新数据集OBER和框架ObjectClear，用于解决扩散修复方法在物体移除及其效果（如阴影、反射）上的不足。

Details

Motivation: 扩散修复方法在移除物体及其效果时存在伪影、内容幻觉和背景改变的问题，需要更精准的解决方案。 Method: 引入OBER数据集，包含物体效果前后的配对图像和精确掩码；提出ObjectClear框架，利用物体效果注意力机制分离前景移除和背景重建。 Result: ObjectClear在复杂场景中表现优异，显著提升了物体效果移除质量和背景保真度。 Conclusion: OBER数据集和ObjectClear框架为物体效果移除提供了有效工具，尤其在复杂场景中表现突出。 Abstract: Object removal requires eliminating not only the target object but also its effects, such as shadows and reflections. However, diffusion-based inpainting methods often produce artifacts, hallucinate content, alter background, and struggle to remove object effects accurately. To address this challenge, we introduce a new dataset for OBject-Effect Removal, named OBER, which provides paired images with and without object effects, along with precise masks for both objects and their associated visual artifacts. The dataset comprises high-quality captured and simulated data, covering diverse object categories and complex multi-object scenes. Building on OBER, we propose a novel framework, ObjectClear, which incorporates an object-effect attention mechanism to guide the model toward the foreground removal regions by learning attention masks, effectively decoupling foreground removal from background reconstruction. Furthermore, the predicted attention map enables an attention-guided fusion strategy during inference, greatly preserving background details. Extensive experiments demonstrate that ObjectClear outperforms existing methods, achieving improved object-effect removal quality and background fidelity, especially in complex scenarios.

[157] SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation

Dekai Zhu,Yixuan Hu,Youquan Liu,Dongyue Lu,Lingdong Kong,Slobodan Ilic

Main category: cs.CV

TL;DR: Spiral是一种新型的LiDAR扩散模型，能够同时生成深度、反射率和语义图，解决了现有范围视图方法无法生成标注数据的问题，并在实验中表现出色。

Details

Motivation: 现有范围视图方法无法生成标注的LiDAR场景，依赖预训练分割模型会导致跨模态一致性不佳。Spiral旨在解决这一问题，同时保留范围视图的高效性和简化网络设计的优势。 Method: 提出Spiral模型，基于扩散模型同时生成深度、反射率和语义图，并引入新的语义感知指标评估生成数据质量。 Result: 在SemanticKITTI和nuScenes数据集上，Spiral以最小的参数量实现了最先进的性能，优于结合生成和分割模型的两步方法。 Conclusion: Spiral不仅能高效生成标注数据，还可用于下游分割任务的合成数据增强，显著减少LiDAR数据的标注工作量。 Abstract: Leveraging recent diffusion models, LiDAR-based large-scale 3D scene generation has achieved great success. While recent voxel-based approaches can generate both geometric structures and semantic labels, existing range-view methods are limited to producing unlabeled LiDAR scenes. Relying on pretrained segmentation models to predict the semantic maps often results in suboptimal cross-modal consistency. To address this limitation while preserving the advantages of range-view representations, such as computational efficiency and simplified network design, we propose Spiral, a novel range-view LiDAR diffusion model that simultaneously generates depth, reflectance images, and semantic maps. Furthermore, we introduce novel semantic-aware metrics to evaluate the quality of the generated labeled range-view data. Experiments on the SemanticKITTI and nuScenes datasets demonstrate that Spiral achieves state-of-the-art performance with the smallest parameter size, outperforming two-step methods that combine the generative and segmentation models. Additionally, we validate that range images generated by Spiral can be effectively used for synthetic data augmentation in the downstream segmentation training, significantly reducing the labeling effort on LiDAR data.

[158] Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation

Zhe Kong,Feng Gao,Yong Zhang,Zhuoliang Kang,Xiaoming Wei,Xunliang Cai,Guanying Chen,Wenhan Luo

Main category: cs.CV

TL;DR: 论文提出了一种新任务——多人对话视频生成，并提出了MultiTalk框架，解决了多人生成中的音频绑定和指令跟随问题。

Details

Motivation: 现有方法主要关注单人动画，难以处理多流音频输入，且存在音频与人物绑定错误及指令跟随能力不足的问题。 Method: 提出Label Rotary Position Embedding (L-RoPE)方法解决音频绑定问题，并通过部分参数训练和多任务训练保留基础模型的指令跟随能力。 Result: MultiTalk在多个数据集上表现优于其他方法，展示了强大的生成能力。 Conclusion: MultiTalk有效解决了多人对话视频生成中的关键挑战，具有广泛的应用潜力。 Abstract: Audio-driven human animation methods, such as talking head and talking body generation, have made remarkable progress in generating synchronized facial movements and appealing visual quality videos. However, existing methods primarily focus on single human animation and struggle with multi-stream audio inputs, facing incorrect binding problems between audio and persons. Additionally, they exhibit limitations in instruction-following capabilities. To solve this problem, in this paper, we propose a novel task: Multi-Person Conversational Video Generation, and introduce a new framework, MultiTalk, to address the challenges during multi-person generation. Specifically, for audio injection, we investigate several schemes and propose the Label Rotary Position Embedding (L-RoPE) method to resolve the audio and person binding problem. Furthermore, during training, we observe that partial parameter training and multi-task training are crucial for preserving the instruction-following ability of the base model. MultiTalk achieves superior performance compared to other methods on several datasets, including talking head, talking body, and multi-person datasets, demonstrating the powerful generation capabilities of our approach.

[159] VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

Ce Zhang,Kaixin Ma,Tianqing Fang,Wenhao Yu,Hongming Zhang,Zhisong Zhang,Yaqi Xie,Katia Sycara,Haitao Mi,Dong Yu

Main category: cs.CV

TL;DR: VScan是一个两阶段视觉令牌减少框架，通过全局和局部扫描合并令牌，并在语言模型中间层修剪令牌，显著加速推理并保持高性能。

Details

Motivation: 大型视觉语言模型（LVLMs）因视觉令牌序列较长导致计算成本高，难以实时部署，需要优化令牌处理效率。 Method: 提出VScan框架，结合视觉编码阶段的全局和局部扫描令牌合并，以及在语言模型中间层进行令牌修剪。 Result: 在四个LVLMs上验证，VScan显著加速推理（2.91倍预填充速度提升，10倍FLOPs减少），同时保留95.4%的原始性能。 Conclusion: VScan通过优化令牌处理流程，实现了高效的多模态理解，性能优于现有方法。 Abstract: Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91$\times$ speedup in prefilling and a 10$\times$ reduction in FLOPs, while retaining 95.4% of the original performance.

[160] Training Free Stylized Abstraction

Aimon Rahman,Kartik Narayan,Vishal M. Patel

Main category: cs.CV

TL;DR: 提出了一种无需训练的框架，通过视觉语言模型和跨域校正流反转策略生成风格化抽象图像，支持多轮生成且无需微调。

Details

Motivation: 解决风格化抽象任务中如何在保持身份特征的同时实现风格化，尤其是对分布外个体的挑战。 Method: 结合推理时视觉语言模型的特征提取和跨域校正流反转策略，动态调整结构恢复。 Result: 在多种风格（如乐高、针织玩偶、南方公园）上表现出色，支持未见身份和风格的泛化。 Conclusion: 该方法在风格化抽象任务中实现了高保真重建，且无需训练或微调。 Abstract: Stylized abstraction synthesizes visually exaggerated yet semantically faithful representations of subjects, balancing recognizability with perceptual distortion. Unlike image-to-image translation, which prioritizes structural fidelity, stylized abstraction demands selective retention of identity cues while embracing stylistic divergence, especially challenging for out-of-distribution individuals. We propose a training-free framework that generates stylized abstractions from a single image using inference-time scaling in vision-language models (VLLMs) to extract identity-relevant features, and a novel cross-domain rectified flow inversion strategy that reconstructs structure based on style-dependent priors. Our method adapts structural restoration dynamically through style-aware temporal scheduling, enabling high-fidelity reconstructions that honor both subject and style. It supports multi-round abstraction-aware generation without fine-tuning. To evaluate this task, we introduce StyleBench, a GPT-based human-aligned metric suited for abstract styles where pixel-level similarity fails. Experiments across diverse abstraction (e.g., LEGO, knitted dolls, South Park) show strong generalization to unseen identities and styles in a fully open-source setup.

[161] Zero-Shot Vision Encoder Grafting via LLM Surrogates

Kaiyu Yue,Vasu Singla,Menglin Jia,John Kirchenbauer,Rifaa Qadri,Zikui Cai,Abhinav Bhatele,Furong Huang,Tom Goldstein

Main category: cs.CV

TL;DR: 论文提出了一种通过小型“代理模型”训练视觉编码器的方法，以降低视觉语言模型（VLM）的训练成本，并实现零样本迁移到大型语言模型（LLM）。

Details

Motivation: 为了减少VLM训练中因使用大型语言模型（如Llama-70B）带来的高计算成本，研究探索了一种先通过小型语言模型训练视觉编码器再迁移到大型模型的方法。 Method: 构建小型代理模型，其嵌入空间和表示语言与目标大型LLM一致，通过继承目标LLM的浅层实现。训练后的视觉编码器可直接迁移到大型模型，称为零样本嫁接。 Result: 嫁接后的模型性能优于编码器-代理模型对，某些基准测试中甚至与完整目标LLM训练相当，同时训练成本降低约45%。 Conclusion: 零样本嫁接方法有效降低了VLM训练成本，同时保持了高性能，为大规模视觉语言模型训练提供了经济高效的解决方案。 Abstract: Vision language models (VLMs) typically pair a modestly sized vision encoder with a large language model (LLM), e.g., Llama-70B, making the decoder the primary computational burden during training. To reduce costs, a potential promising strategy is to first train the vision encoder using a small language model before transferring it to the large one. We construct small "surrogate models" that share the same embedding space and representation language as the large target LLM by directly inheriting its shallow layers. Vision encoders trained on the surrogate can then be directly transferred to the larger model, a process we call zero-shot grafting -- when plugged directly into the full-size target LLM, the grafted pair surpasses the encoder-surrogate pair and, on some benchmarks, even performs on par with full decoder training with the target LLM. Furthermore, our surrogate training approach reduces overall VLM training costs by ~45% when using Llama-70B as the decoder.

cs.GR [Back]

[162] RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination

Chong Zeng,Yue Dong,Pieter Peers,Hongzhi Wu,Xin Tong

Main category: cs.GR

TL;DR: RenderFormer是一种神经渲染管线，直接从三角形场景表示渲染图像，支持全局光照且无需逐场景训练。

Details

Motivation: 传统渲染方法依赖物理模拟，计算复杂且耗时。RenderFormer旨在通过序列到序列转换简化渲染流程。 Method: 采用两阶段Transformer架构：视图无关阶段建模三角形间光传输，视图相关阶段将光线束转换为像素值。 Result: 在多种复杂形状和光传输场景中验证了RenderFormer的有效性。 Conclusion: RenderFormer提供了一种无需逐场景优化的高效渲染方法，展示了神经渲染的潜力。 Abstract: We present RenderFormer, a neural rendering pipeline that directly renders an image from a triangle-based representation of a scene with full global illumination effects and that does not require per-scene training or fine-tuning. Instead of taking a physics-centric approach to rendering, we formulate rendering as a sequence-to-sequence transformation where a sequence of tokens representing triangles with reflectance properties is converted to a sequence of output tokens representing small patches of pixels. RenderFormer follows a two stage pipeline: a view-independent stage that models triangle-to-triangle light transport, and a view-dependent stage that transforms a token representing a bundle of rays to the corresponding pixel values guided by the triangle-sequence from the view-independent stage. Both stages are based on the transformer architecture and are learned with minimal prior constraints. We demonstrate and evaluate RenderFormer on scenes with varying complexity in shape and light transport.

[163] Fluid Simulation on Vortex Particle Flow Maps

Sinan Wang,Junwei Zhou,Fan Feng,Zhiqi Li,Yuchen Sun,Duowen Chen,Greg Turk,Bo Zhu

Main category: cs.GR

TL;DR: VPFM方法通过涡量粒子流图模拟不可压缩流动，显著延长流图距离，优于现有技术。

Details

Motivation: 解决动态固体边界下复杂涡流演化的模拟问题，利用涡量在流图上的优势特性。 Method: 结合欧拉-拉格朗日表示，涡量粒子演化和背景网格速度重构，包括涡量流图框架、Hessian演化方案和固体边界处理。 Result: 流图长度延长3-12倍，有效捕捉复杂涡流和湍流现象。 Conclusion: VPFM在涡量保持和复杂流动模拟中表现出色，验证了其有效性。 Abstract: We propose the Vortex Particle Flow Map (VPFM) method to simulate incompressible flow with complex vortical evolution in the presence of dynamic solid boundaries. The core insight of our approach is that vorticity is an ideal quantity for evolution on particle flow maps, enabling significantly longer flow map distances compared to other fluid quantities like velocity or impulse. To achieve this goal, we developed a hybrid Eulerian-Lagrangian representation that evolves vorticity and flow map quantities on vortex particles, while reconstructing velocity on a background grid. The method integrates three key components: (1) a vorticity-based particle flow map framework, (2) an accurate Hessian evolution scheme on particles, and (3) a solid boundary treatment for no-through and no-slip conditions in VPFM. These components collectively allow a substantially longer flow map length (3-12 times longer) than the state-of-the-art, enhancing vorticity preservation over extended spatiotemporal domains. We validated the performance of VPFM through diverse simulations, demonstrating its effectiveness in capturing complex vortex dynamics and turbulence phenomena.

[164] STDR: Spatio-Temporal Decoupling for Real-Time Dynamic Scene Rendering

Zehao Li,Hao Jiang,Yujun Cai,Jianing Chen,Baolong Bi,Shuqin Gao,Honglong Zhao,Yiwei Wang,Tianlu Mao,Zhaoqi Wang

Main category: cs.GR

TL;DR: 论文提出STDR模块，通过解耦时空概率分布提升动态场景重建的时空一致性。

Details

Motivation: 现有3DGS方法在动态重建中因时空纠缠导致初始化不连贯，难以准确建模动态运动。 Method: STDR模块引入时空掩码、分离变形场和一致性正则化，联合解耦时空模式。 Result: 实验表明，STDR显著提升了重建质量和时空一致性。 Conclusion: STDR为动态场景重建提供了一种高效且通用的解决方案。 Abstract: Although dynamic scene reconstruction has long been a fundamental challenge in 3D vision, the recent emergence of 3D Gaussian Splatting (3DGS) offers a promising direction by enabling high-quality, real-time rendering through explicit Gaussian primitives. However, existing 3DGS-based methods for dynamic reconstruction often suffer from \textit{spatio-temporal incoherence} during initialization, where canonical Gaussians are constructed by aggregating observations from multiple frames without temporal distinction. This results in spatio-temporally entangled representations, making it difficult to model dynamic motion accurately. To overcome this limitation, we propose \textbf{STDR} (Spatio-Temporal Decoupling for Real-time rendering), a plug-and-play module that learns spatio-temporal probability distributions for each Gaussian. STDR introduces a spatio-temporal mask, a separated deformation field, and a consistency regularization to jointly disentangle spatial and temporal patterns. Extensive experiments demonstrate that incorporating our module into existing 3DGS-based dynamic scene reconstruction frameworks leads to notable improvements in both reconstruction quality and spatio-temporal consistency across synthetic and real-world benchmarks.

[165] Neural Face Skinning for Mesh-agnostic Facial Expression Cloning

Sihun Cha,Serin Yoon,Kwanggyoon Seo,Junyong Noh

Main category: cs.GR

TL;DR: 提出了一种结合全局和局部变形模型的方法，用于面部动画重定向，实现了对细节的精确捕捉和直观控制。

Details

Motivation: 现有方法在面部表情重定向中难以同时兼顾全局控制和局部细节，导致效果不佳。 Method: 通过预测目标面部网格的顶点权重，将全局潜在代码的影响局部化，结合FACS-based blendshapes监督潜在代码。 Result: 实验表明，该方法在表情保真度、变形传输准确性和适应性上优于现有方法。 Conclusion: 该方法成功结合了全局和局部模型的优势，实现了高效且精确的面部动画重定向。 Abstract: Accurately retargeting facial expressions to a face mesh while enabling manipulation is a key challenge in facial animation retargeting. Recent deep-learning methods address this by encoding facial expressions into a global latent code, but they often fail to capture fine-grained details in local regions. While some methods improve local accuracy by transferring deformations locally, this often complicates overall control of the facial expression. To address this, we propose a method that combines the strengths of both global and local deformation models. Our approach enables intuitive control and detailed expression cloning across diverse face meshes, regardless of their underlying structures. The core idea is to localize the influence of the global latent code on the target mesh. Our model learns to predict skinning weights for each vertex of the target face mesh through indirect supervision from predefined segmentation labels. These predicted weights localize the global latent code, enabling precise and region-specific deformations even for meshes with unseen shapes. We supervise the latent code using Facial Action Coding System (FACS)-based blendshapes to ensure interpretability and allow straightforward editing of the generated animation. Through extensive experiments, we demonstrate improved performance over state-of-the-art methods in terms of expression fidelity, deformation transfer accuracy, and adaptability across diverse mesh structures.

cs.CL [Back]

[166] More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Chengzhi Liu,Zhongxing Xu,Qingyue Wei,Juncheng Wu,James Zou,Xin Eric Wang,Yuyin Zhou,Sheng Liu

Main category: cs.CL

TL;DR: 论文研究了多模态大语言模型在长推理链中视觉基础能力下降的问题，提出了RH-AUC指标和RH-Bench基准，发现模型大小和训练数据类型对推理与感知平衡的影响。

Details

Motivation: 长推理链导致多模态大语言模型视觉基础能力下降，引发幻觉问题，需要系统研究这一现象。 Method: 引入RH-AUC指标量化模型感知准确性随推理长度的变化，并发布RH-Bench基准评估推理能力与幻觉的权衡。 Result: 发现更大模型在推理与感知间平衡更好，且这种平衡受训练数据类型影响大于数据量。 Conclusion: 强调需联合评估推理质量与感知保真度的重要性。 Abstract: Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors. Attention analysis shows that longer reasoning chains lead to reduced focus on visual inputs, which contributes to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model's perception accuracy changes with reasoning length, allowing us to evaluate whether the model preserves visual grounding during reasoning. We also release RH-Bench, a diagnostic benchmark that spans a variety of multimodal tasks, designed to assess the trade-off between reasoning ability and hallucination. Our analysis reveals that (i) larger models typically achieve a better balance between reasoning and perception, and (ii) this balance is influenced more by the types and domains of training data than by its overall volume. These findings underscore the importance of evaluation frameworks that jointly consider both reasoning quality and perceptual fidelity.

[167] Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use

Titouan Parcollet,Yuan Tseng,Shucong Zhang,Rogier van Dalen

Main category: cs.CL

TL;DR: Loquacious Set是一个25,000小时的商业可用英语语音数据集，旨在解决现有ASR数据集的局限性，如规模小、许可证限制或转录不可靠等问题。

Details

Motivation: 现有ASR数据集（如LibriSpeech）存在规模小、仅包含干净朗读语音等问题，而其他新数据集又因许可证或数据质量问题无法满足工业界和学术界需求。 Method: 提出Loquacious Set，包含多样化的语音类型（朗读、自发、演讲、干净、嘈杂）和口音，覆盖25,000小时数据。 Result: Loquacious Set为学术界和工业界提供了适用于真实场景的ASR系统开发资源。 Conclusion: Loquacious Set填补了现有ASR数据集的不足，为研究和应用提供了更全面的语音数据支持。 Abstract: Automatic speech recognition (ASR) research is driven by the availability of common datasets between industrial researchers and academics, encouraging comparisons and evaluations. LibriSpeech, despite its long success as an ASR benchmark, is now limited by its size and focus on clean, read speech, leading to near-zero word error rates. More recent datasets, including MOSEL, YODAS, Gigaspeech, OWSM, Libriheavy or People's Speech suffer from major limitations including licenses that researchers in the industry cannot use, unreliable transcriptions, incorrect audio data, or the lack of evaluation sets. This work presents the Loquacious Set, a 25,000-hour curated collection of commercially usable English speech. Featuring hundreds of thousands of speakers with diverse accents and a wide range of speech types (read, spontaneous, talks, clean, noisy), the Loquacious Set is designed to work for academics and researchers in the industry to build ASR systems in real-world scenarios.

[168] Rethinking Data Mixture for Large Language Models: A Comprehensive Survey and New Perspectives

Yajiao Liu,Congliang Chen,Junchi Yang,Ruoyu Sun

Main category: cs.CL

TL;DR: 论文综述了数据混合方法，提出了细粒度分类（离线与在线方法），总结了问题表述、算法及优缺点，并指出领域挑战。

Details

Motivation: 在有限计算资源下，如何确定不同数据域的权重以训练最佳模型是核心问题。 Method: 将现有方法细分为离线（启发式、算法式、函数拟合）和在线（最小最大优化、混合法则等）方法，并总结其问题表述与算法。 Result: 明确了各类方法的联系与区别，总结了优缺点。 Conclusion: 数据混合领域仍面临关键挑战，需进一步研究。 Abstract: Training large language models with data collected from various domains can improve their performance on downstream tasks. However, given a fixed training budget, the sampling proportions of these different domains significantly impact the model's performance. How can we determine the domain weights across different data domains to train the best-performing model within constrained computational resources? In this paper, we provide a comprehensive overview of existing data mixture methods. First, we propose a fine-grained categorization of existing methods, extending beyond the previous offline and online classification. Offline methods are further grouped into heuristic-based, algorithm-based, and function fitting-based methods. For online methods, we categorize them into three groups: online min-max optimization, online mixing law, and other approaches by drawing connections with the optimization frameworks underlying offline methods. Second, we summarize the problem formulations, representative algorithms for each subtype of offline and online methods, and clarify the relationships and distinctions among them. Finally, we discuss the advantages and disadvantages of each method and highlight key challenges in the field of data mixture.

[169] R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

Tianyu Fu,Yi Ge,Yichen You,Enshu Liu,Zhihang Yuan,Guohao Dai,Shengen Yan,Huazhong Yang,Yu Wang

Main category: cs.CL

TL;DR: 论文提出了一种名为R2R的神经令牌路由方法，通过选择性使用大语言模型（LLMs）处理关键的分歧令牌，而将大部分令牌生成交给小语言模型（SLMs），从而提升推理效率。

Details

Motivation: 大型语言模型（LLMs）虽然推理能力强，但推理开销大，部署困难；而小型语言模型（SLMs）效率高，但性能较差。研究发现，仅少数令牌会导致LLMs和SLMs的推理路径分歧。 Method: 提出R2R方法，通过轻量级路由器选择性使用LLMs处理关键令牌，其余由SLM生成。开发了自动数据生成管道，用于训练路由器。 Result: 在数学、编程和QA任务上，R2R以平均5.6B的激活参数量，性能超越R1-7B，接近R1-32B，同时实现2.8倍的速度提升。 Conclusion: R2R在效率和性能之间取得了平衡，推动了测试时扩展效率的Pareto前沿。 Abstract: Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs' reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce **Roads to Rome (R2R)**, a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at https://github.com/thu-nics/R2R.

[170] How does Misinformation Affect Large Language Model Behaviors and Preferences?

Miao Peng,Nuo Chen,Jianheng Tang,Jia Li

Main category: cs.CL

TL;DR: MisBench是一个全面评估大语言模型（LLMs）对错误信息行为的基准，包含1000多万条错误信息，揭示了LLMs在知识冲突和风格变化中的脆弱性，并提出了一种新方法RtD以增强其检测能力。

Details

Motivation: 现有研究缺乏对LLMs受错误信息影响的细粒度分析，MisBench旨在填补这一空白。 Method: 构建MisBench基准，包含10,346,712条错误信息，分析LLMs在知识冲突和风格变化中的表现，并提出RtD方法。 Result: LLMs在检测错误信息方面表现相当，但仍易受知识冲突和风格变化影响。RtD方法能有效增强其检测能力。 Conclusion: MisBench为LLMs与错误信息的交互提供了重要见解，可作为评估LLM检测器的有效基准，提升其在实际应用中的可靠性。 Abstract: Large Language Models (LLMs) have shown remarkable capabilities in knowledge-intensive tasks, while they remain vulnerable when encountering misinformation. Existing studies have explored the role of LLMs in combating misinformation, but there is still a lack of fine-grained analysis on the specific aspects and extent to which LLMs are influenced by misinformation. To bridge this gap, we present MisBench, the current largest and most comprehensive benchmark for evaluating LLMs' behavior and knowledge preference toward misinformation. MisBench consists of 10,346,712 pieces of misinformation, which uniquely considers both knowledge-based conflicts and stylistic variations in misinformation. Empirical results reveal that while LLMs demonstrate comparable abilities in discerning misinformation, they still remain susceptible to knowledge conflicts and stylistic variations. Based on these findings, we further propose a novel approach called Reconstruct to Discriminate (RtD) to strengthen LLMs' ability to detect misinformation. Our study provides valuable insights into LLMs' interactions with misinformation, and we believe MisBench can serve as an effective benchmark for evaluating LLM-based detectors and enhancing their reliability in real-world applications. Codes and data are available at https://github.com/GKNL/MisBench.

[171] Iterative Corpus Refinement for Materials Property Prediction Based on Scientific Texts

Lei Zhang,Markus Stricker

Main category: cs.CL

TL;DR: 论文提出了一种迭代框架，通过选择多样化的文献、训练Word2Vec模型，并监测嵌入空间中成分-性质相关性的收敛性，加速材料发现与优化。该方法成功预测了高性能材料，并通过实验验证。

Details

Motivation: 材料发现中的组合爆炸问题导致数据稀缺，现有科学文本中的潜在知识未充分利用。 Method: 采用迭代框架，结合多样化文献选择、Word2Vec模型训练和嵌入空间相关性监测，预测高性能材料。 Result: 成功预测了氧还原、氢析出和氧析出反应的高性能材料，实验验证了预测结果。 Conclusion: 迭代文献优化框架为材料发现提供了高效、可扩展的工具，尤其在数据稀缺的情况下。 Abstract: The discovery and optimization of materials for specific applications is hampered by the practically infinite number of possible elemental combinations and associated properties, also known as the `combinatorial explosion'. By nature of the problem, data are scarce and all possible data sources should be used. In addition to simulations and experimental results, the latent knowledge in scientific texts is not yet used to its full potential. We present an iterative framework that refines a given scientific corpus by strategic selection of the most diverse documents, training Word2Vec models, and monitoring the convergence of composition-property correlations in embedding space. Our approach is applied to predict high-performing materials for oxygen reduction (ORR), hydrogen evolution (HER), and oxygen evolution (OER) reactions for a large number of possible candidate compositions. Our method successfully predicts the highest performing compositions among a large pool of candidates, validated by experimental measurements of the electrocatalytic performance in the lab. This work demonstrates and validates the potential of iterative corpus refinement to accelerate materials discovery and optimization, offering a scalable and efficient tool for screening large compositional spaces where reliable data are scarce or non-existent.

[172] Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations

Zeinab Dehghani,Koorosh Aslansefat,Adil Khan,Mohammed Naveed Akram

Main category: cs.CL

TL;DR: SMILE是一种新方法，用于解释大型语言模型（如GPT、LLAMA和Claude）如何响应提示的不同部分，通过可视化热图展示提示中最重要的部分。

Details

Motivation: 大型语言模型是黑箱，缺乏透明性，在需要信任和责任的领域可能存在问题。 Method: SMILE通过微调输入并测量输出变化，生成热图以突出提示中影响最大的部分。 Result: 在多个领先的LLM上测试，SMILE在准确性、一致性、稳定性和保真度方面表现良好。 Conclusion: SMILE提高了模型的可解释性，使AI更透明和可信。 Abstract: Large language models like GPT, LLAMA, and Claude have become incredibly powerful at generating text, but they are still black boxes, so it is hard to understand how they decide what to say. That lack of transparency can be problematic, especially in fields where trust and accountability matter. To help with this, we introduce SMILE, a new method that explains how these models respond to different parts of a prompt. SMILE is model-agnostic and works by slightly changing the input, measuring how the output changes, and then highlighting which words had the most impact. Create simple visual heat maps showing which parts of a prompt matter the most. We tested SMILE on several leading LLMs and used metrics such as accuracy, consistency, stability, and fidelity to show that it gives clear and reliable explanations. By making these models easier to understand, SMILE brings us one step closer to making AI more transparent and trustworthy.

[173] Rethinking the Outlier Distribution in Large Language Models: An In-depth Study

Rahul Raman,Khushi Sharma,Sai Qian Zhang

Main category: cs.CL

TL;DR: 本文研究了大型语言模型（LLMs）中的异常值问题，探讨了其形成机制并提出了缓解策略，以提升量化过程的准确性和效率。

Details

Motivation: 异常值对LLMs的性能（如量化和压缩）有显著影响，可能导致量化误差和性能下降。研究其根源并提出解决方案有助于优化模型部署。 Method: 通过全面调查异常值的形成机制，提出了消除大规模激活和通道级异常值的有效方法。 Result: 提出的方法能够显著减少异常值，同时对模型准确性的影响最小。 Conclusion: 本文为LLMs中的异常值问题提供了深入分析和实用解决方案，有助于提升量化效率和模型性能。 Abstract: Investigating outliers in large language models (LLMs) is crucial due to their significant impact on various aspects of LLM performance, including quantization and compression. Outliers often cause considerable quantization errors, leading to degraded model performance. Identifying and addressing these outliers can enhance the accuracy and efficiency of the quantization process, enabling smoother deployment on edge devices or specialized hardware. Recent studies have identified two common types of outliers in LLMs: massive activations and channel-wise outliers. While numerous quantization algorithms have been proposed to mitigate their effects and maintain satisfactory accuracy, few have thoroughly explored the root causes of these outliers in depth. In this paper, we conduct a comprehensive investigation into the formation mechanisms of these outliers and propose potential strategies to mitigate their occurrence. Ultimately, we introduce some efficient approaches to eliminate most massive activations and channel-wise outliers with minimal impact on accuracy.

[174] LLMPR: A Novel LLM-Driven Transfer Learning based Petition Ranking Model

Avijit Gayen,Somyajit Chakraborty,Mainak Sen,Soham Paul,Angshuman Jana

Main category: cs.CL

TL;DR: 提出LLMPR框架，利用大语言模型和机器学习自动为法律案件分配优先级，显著提高效率和公平性。

Details

Motivation: 印度司法系统中未解决案件积压严重，手动优先级分配效率低且易受主观偏见影响。 Method: 结合DistilBERT、LegalBERT和MiniLM等嵌入技术提取文本特征，并与定量指标结合，训练多种机器学习模型。 Result: 随机森林和决策树模型表现最佳，准确率超99%，Spearman等级相关性达0.99。 Conclusion: 自动化案件优先级分配可优化司法流程，减少积压并提高公平性。 Abstract: The persistent accumulation of unresolved legal cases, especially within the Indian judiciary, significantly hampers the timely delivery of justice. Manual methods of prioritizing petitions are often prone to inefficiencies and subjective biases further exacerbating delays. To address this issue, we propose LLMPR (Large Language Model-based Petition Ranking), an automated framework that utilizes transfer learning and machine learning to assign priority rankings to legal petitions based on their contextual urgency. Leveraging the ILDC dataset comprising 7,593 annotated petitions, we process unstructured legal text and extract features through various embedding techniques, including DistilBERT, LegalBERT, and MiniLM. These textual embeddings are combined with quantitative indicators such as gap days, rank scores, and word counts to train multiple machine learning models, including Random Forest, Decision Tree, XGBoost, LightGBM, and CatBoost. Our experiments demonstrate that Random Forest and Decision Tree models yield superior performance, with accuracy exceeding 99% and a Spearman rank correlation of 0.99. Notably, models using only numerical features achieve nearly optimal ranking results (R2 = 0.988, \r{ho} = 0.998), while LLM-based embeddings offer only marginal gains. These findings suggest that automated petition ranking can effectively streamline judicial workflows, reduce case backlog, and improve fairness in legal prioritization.

[175] MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs

Raoyuan Zhao,Beiduo Chen,Barbara Plank,Michael A. Hedderich

Main category: cs.CL

TL;DR: MAKIEval是一个自动多语言框架，用于评估大型语言模型（LLMs）在不同语言、地区和文化主题中的文化意识表现。

Details

Motivation: 由于LLMs的英语中心预训练可能导致跨语言文化偏见，而现有评估方法受限于翻译质量和基准不足，因此需要一种更全面的评估工具。 Method: 利用Wikidata的多语言结构作为跨语言锚点，自动识别模型输出中的文化实体并链接到结构化知识，无需人工标注或翻译。 Result: 评估了7个LLMs在13种语言、19个国家和地区中的表现，发现模型在英语中表现出更强的文化意识。 Conclusion: MAKIEval提供了一种可扩展、语言无关的评估方法，揭示了LLMs在文化意识方面的跨语言差异。 Abstract: Large language models (LLMs) are used globally across many languages, but their English-centric pretraining raises concerns about cross-lingual disparities for cultural awareness, often resulting in biased outputs. However, comprehensive multilingual evaluation remains challenging due to limited benchmarks and questionable translation quality. To better assess these disparities, we introduce MAKIEval, an automatic multilingual framework for evaluating cultural awareness in LLMs across languages, regions, and topics. MAKIEval evaluates open-ended text generation, capturing how models express culturally grounded knowledge in natural language. Leveraging Wikidata's multilingual structure as a cross-lingual anchor, it automatically identifies cultural entities in model outputs and links them to structured knowledge, enabling scalable, language-agnostic evaluation without manual annotation or translation. We then introduce four metrics that capture complementary dimensions of cultural awareness: granularity, diversity, cultural specificity, and consensus across languages. We assess 7 LLMs developed from different parts of the world, encompassing both open-source and proprietary systems, across 13 languages, 19 countries and regions, and 6 culturally salient topics (e.g., food, clothing). Notably, we find that models tend to exhibit stronger cultural awareness in English, suggesting that English prompts more effectively activate culturally grounded knowledge. We publicly release our code and data.

[176] Do We Know What LLMs Don't Know? A Study of Consistency in Knowledge Probing

Raoyuan Zhao,Abdullatif Köksal,Ali Modarressi,Michael A. Hedderich,Hinrich Schütze

Main category: cs.CL

TL;DR: 论文提出了一种基于输入变化和定量指标的新方法，评估了大型语言模型（LLMs）知识缺口探测的不一致性，揭示了探测方法在内部和跨方法上的高度不一致性。

Details

Motivation: 大型语言模型（LLMs）的可靠性因其幻觉问题而受到严重影响，因此需要精确识别其知识缺口。现有探测方法存在不一致性，亟需评估和改进。 Method: 提出了一种基于输入变化和定量指标的新评估过程，分析了知识缺口探测的两种不一致性：内部方法不一致性和跨方法不一致性。 Result: 研究发现，探测方法在内部（如提示微小变化导致结果差异）和跨方法（不同方法间决策一致性低至7%）上均存在高度不一致性。 Conclusion: 现有探测方法存在显著问题，亟需开发对扰动鲁棒的探测框架。 Abstract: The reliability of large language models (LLMs) is greatly compromised by their tendency to hallucinate, underscoring the need for precise identification of knowledge gaps within LLMs. Various methods for probing such gaps exist, ranging from calibration-based to prompting-based methods. To evaluate these probing methods, in this paper, we propose a new process based on using input variations and quantitative metrics. Through this, we expose two dimensions of inconsistency in knowledge gap probing. (1) Intra-method inconsistency: Minimal non-semantic perturbations in prompts lead to considerable variance in detected knowledge gaps within the same probing method; e.g., the simple variation of shuffling answer options can decrease agreement to around 40%. (2) Cross-method inconsistency: Probing methods contradict each other on whether a model knows the answer. Methods are highly inconsistent -- with decision consistency across methods being as low as 7% -- even though the model, dataset, and prompt are all the same. These findings challenge existing probing methods and highlight the urgent need for perturbation-robust probing frameworks.

[177] Assessing and Refining ChatGPT's Performance in Identifying Targeting and Inappropriate Language: A Comparative Study

Barbarestani Baran,Maks Isa,Vossen Piek

Main category: cs.CL

TL;DR: ChatGPT在识别在线评论中的不当内容方面表现良好，但在针对性语言检测上存在波动，需进一步优化。

Details

Motivation: 随着社交媒体上用户生成内容的大量增加，AI在内容审核中的作用日益重要。本研究旨在评估ChatGPT在此领域的有效性。 Method: 通过将ChatGPT的性能与众包标注和专家评估进行比较，评估其准确性、检测范围和一致性。 Result: ChatGPT在检测不当内容方面表现良好，尤其在迭代优化后（如版本6），但在针对性语言检测上存在较高的假阳性率。 Conclusion: 研究表明ChatGPT在自动化内容审核中具有潜力，但需进一步优化模型和上下文理解能力以减少误判。 Abstract: This study evaluates the effectiveness of ChatGPT, an advanced AI model for natural language processing, in identifying targeting and inappropriate language in online comments. With the increasing challenge of moderating vast volumes of user-generated content on social network sites, the role of AI in content moderation has gained prominence. We compared ChatGPT's performance against crowd-sourced annotations and expert evaluations to assess its accuracy, scope of detection, and consistency. Our findings highlight that ChatGPT performs well in detecting inappropriate content, showing notable improvements in accuracy through iterative refinements, particularly in Version 6. However, its performance in targeting language detection showed variability, with higher false positive rates compared to expert judgments. This study contributes to the field by demonstrating the potential of AI models like ChatGPT to enhance automated content moderation systems while also identifying areas for further improvement. The results underscore the importance of continuous model refinement and contextual understanding to better support automated moderation and mitigate harmful online behavior.

[178] Counterfactual Simulatability of LLM Explanations for Generation Tasks

Marvin Limpijankit,Yanda Chen,Melanie Subbiah,Nicholas Deas,Kathleen McKeown

Main category: cs.CL

TL;DR: 论文探讨了LLMs解释行为的不可预测性，提出了一种评估解释方法（反事实可模拟性）的通用框架，并发现其在生成任务中的应用效果因任务类型而异。

Details

Motivation: LLMs的行为解释对高风险场景至关重要，但现有方法仅适用于二元任务，需扩展至生成任务。 Method: 提出通用框架，将反事实可模拟性应用于生成任务，以新闻摘要和医疗建议为例。 Result: LLM解释在新闻摘要中有效，但在医疗建议中仍有改进空间；评估方法更适合技能型任务。 Conclusion: 反事实可模拟性评估框架适用于生成任务，但需根据任务类型调整，尤其在知识型任务中效果有限。 Abstract: LLMs can be unpredictable, as even slight alterations to the prompt can cause the output to change in unexpected ways. Thus, the ability of models to accurately explain their behavior is critical, especially in high-stakes settings. One approach for evaluating explanations is counterfactual simulatability, how well an explanation allows users to infer the model's output on related counterfactuals. Counterfactual simulatability has been previously studied for yes/no question answering tasks. We provide a general framework for extending this method to generation tasks, using news summarization and medical suggestion as example use cases. We find that while LLM explanations do enable users to better predict LLM outputs on counterfactuals in the summarization setting, there is significant room for improvement for medical suggestion. Furthermore, our results suggest that the evaluation for counterfactual simulatability may be more appropriate for skill-based tasks as opposed to knowledge-based tasks.

[179] BehaviorSFT: Behavioral Token Conditioning for Clinical Agents Across the Proactivity Spectrum

Yubin Kim,Zhiyuan Hu,Hyewon Jeong,Eugene Park,Shuyue Stella Li,Chanwoo Park,Shiyun Xiong,MingYu Lu,Hyeonhoon Lee,Xin Liu,Daniel McDuff,Cynthia Breazeal,Samir Tulebaev,Hae Won Park

Main category: cs.CL

TL;DR: 论文介绍了BehaviorBench数据集和BehaviorSFT训练策略，用于评估和提升大型语言模型在临床任务中的主动行为表现。实验显示BehaviorSFT显著提升了模型的主动性和临床实用性。

Details

Motivation: 大型语言模型在临床任务中表现出的主动行为不足，需要改进其行为适应性以提升实用性。 Method: 提出BehaviorBench数据集评估模型行为，并开发BehaviorSFT训练策略，通过行为标记动态调整模型行为。 Result: BehaviorSFT显著提升了模型在BehaviorBench上的表现（Macro F1达97.3%），临床评估也证实其行为更贴近实际需求。 Conclusion: BehaviorSFT能有效提升语言模型在临床任务中的主动性和实用性，优于传统微调方法。 Abstract: Large Language Models (LLMs) as clinical agents require careful behavioral adaptation. While adept at reactive tasks (e.g., diagnosis reasoning), LLMs often struggle with proactive engagement, like unprompted identification of critical missing information or risks. We introduce BehaviorBench, a comprehensive dataset to evaluate agent behaviors across a clinical assistance spectrum, ranging from reactive query responses to proactive interventions (e.g., clarifying ambiguities, flagging overlooked critical data). Our BehaviorBench experiments reveal LLMs' inconsistent proactivity. To address this, we propose BehaviorSFT, a novel training strategy using behavioral tokens to explicitly condition LLMs for dynamic behavioral selection along this spectrum. BehaviorSFT boosts performance, achieving up to 97.3% overall Macro F1 on BehaviorBench and improving proactive task scores (e.g., from 95.0% to 96.5% for Qwen2.5-7B-Ins). Crucially, blind clinician evaluations confirmed BehaviorSFT-trained agents exhibit more realistic clinical behavior, striking a superior balance between helpful proactivity (e.g., timely, relevant suggestions) and necessary restraint (e.g., avoiding over-intervention) versus standard fine-tuning or explicit instructed agents.

[180] Calibrating LLM Confidence by Probing Perturbed Representation Stability

Reza Khanmohammadi,Erfan Miahi,Mehrsa Mardikoraem,Simerjot Kaur,Ivan Brugere,Charese H. Smiley,Kundan Thind,Mohammad M. Ghassemi

Main category: cs.CL

TL;DR: CCPS是一种通过扰动LLM内部表示稳定性来校准其置信度的方法，显著优于现有方法，提高了模型的可靠性和准确性。

Details

Motivation: 大型语言模型（LLM）的置信度校准不足影响了其可靠性，需要更准确的置信度估计方法。 Method: CCPS通过对最终隐藏状态施加有针对性的对抗扰动，提取特征并利用轻量级分类器预测答案正确性。 Result: 在多个LLM和基准测试中，CCPS显著降低了校准误差和Brier分数，同时提高了准确性和AUC等指标。 Conclusion: CCPS提供了一种高效、广泛适用且更准确的LLM置信度估计方法，增强了模型的可信度。 Abstract: Miscalibration in Large Language Models (LLMs) undermines their reliability, highlighting the need for accurate confidence estimation. We introduce CCPS (Calibrating LLM Confidence by Probing Perturbed Representation Stability), a novel method analyzing internal representational stability in LLMs. CCPS applies targeted adversarial perturbations to final hidden states, extracts features reflecting the model's response to these perturbations, and uses a lightweight classifier to predict answer correctness. CCPS was evaluated on LLMs from 8B to 32B parameters (covering Llama, Qwen, and Mistral architectures) using MMLU and MMLU-Pro benchmarks in both multiple-choice and open-ended formats. Our results show that CCPS significantly outperforms current approaches. Across four LLMs and three MMLU variants, CCPS reduces Expected Calibration Error by approximately 55% and Brier score by 21%, while increasing accuracy by 5 percentage points, Area Under the Precision-Recall Curve by 4 percentage points, and Area Under the Receiver Operating Characteristic Curve by 6 percentage points, all relative to the strongest prior method. CCPS delivers an efficient, broadly applicable, and more accurate solution for estimating LLM confidence, thereby improving their trustworthiness.

[181] GMU Systems for the IWSLT 2025 Low-Resource Speech Translation Shared Task

Chutong Meng,Antonios Anastasopoulos

Main category: cs.CL

TL;DR: 本文介绍了GMU系统在IWSLT 2025低资源语音翻译任务中的应用，探讨了多种训练范式并比较了其效果。

Details

Motivation: 研究低资源语言对的语音翻译任务，探索如何通过微调SeamlessM4T-v2提升性能。 Method: 微调SeamlessM4T-v2用于ASR、MT和E2E ST任务，并尝试了直接E2E微调、多任务训练以及利用ASR/MT模型参数初始化等方法。 Result: 直接E2E微调效果显著；使用微调后的ASR编码器初始化对未训练语言有帮助；多任务训练略有提升。 Conclusion: 直接E2E微调是有效的，结合ASR编码器初始化和多任务训练可进一步提升性能。 Abstract: This paper describes the GMU systems for the IWSLT 2025 low-resource speech translation shared task. We trained systems for all language pairs, except for Levantine Arabic. We fine-tuned SeamlessM4T-v2 for automatic speech recognition (ASR), machine translation (MT), and end-to-end speech translation (E2E ST). The ASR and MT models are also used to form cascaded ST systems. Additionally, we explored various training paradigms for E2E ST fine-tuning, including direct E2E fine-tuning, multi-task training, and parameter initialization using components from fine-tuned ASR and/or MT models. Our results show that (1) direct E2E fine-tuning yields strong results; (2) initializing with a fine-tuned ASR encoder improves ST performance on languages SeamlessM4T-v2 has not been trained on; (3) multi-task training can be slightly helpful.

[182] VeriTrail: Closed-Domain Hallucination Detection with Traceability

Dasha Metropolitansky,Jonathan Larson

Main category: cs.CL

TL;DR: VeriTrail是一种新的方法，用于检测多步生成过程中的幻觉内容，并提供可追溯性。

Details

Motivation: 语言模型在遵循源材料时仍可能生成未经证实的内容（封闭域幻觉），多步生成过程（MGS）中风险更高。需要检测幻觉并追踪其来源。 Method: 提出VeriTrail方法，设计数据集包含中间输出和人类标注的最终输出忠实度。 Result: VeriTrail在两个数据集上优于基线方法。 Conclusion: VeriTrail是首个为多步和单步生成过程提供幻觉检测和可追溯性的方法。 Abstract: Even when instructed to adhere to source material, Language Models often generate unsubstantiated content - a phenomenon known as "closed-domain hallucination." This risk is amplified in processes with multiple generative steps (MGS), compared to processes with a single generative step (SGS). However, due to the greater complexity of MGS processes, we argue that detecting hallucinations in their final outputs is necessary but not sufficient: it is equally important to trace where hallucinated content was likely introduced and how faithful content may have been derived from the source through intermediate outputs. To address this need, we present VeriTrail, the first closed-domain hallucination detection method designed to provide traceability for both MGS and SGS processes. We also introduce the first datasets to include all intermediate outputs as well as human annotations of final outputs' faithfulness for their respective MGS processes. We demonstrate that VeriTrail outperforms baseline methods on both datasets.

[183] Revisiting Common Assumptions about Arabic Dialects in NLP

Amr Keleg,Sharon Goldwater,Walid Magdy

Main category: cs.CL

TL;DR: 论文分析了阿拉伯语方言的多样性，验证了四种常见假设的准确性，发现这些假设过于简化现实，可能阻碍阿拉伯语NLP任务的进展。

Details

Motivation: 阿拉伯语方言多样性显著，但现有NLP任务中广泛采用的假设（如方言可区分）未经定量验证，可能影响任务效果。 Method: 通过扩展和分析多标签数据集，手动评估11种国家级别方言中每个句子的有效性。 Result: 分析表明四种假设过于简化现实，部分假设并不总是准确。 Conclusion: 这些假设的不准确性可能阻碍阿拉伯语NLP任务的进一步发展。 Abstract: Arabic has diverse dialects, where one dialect can be substantially different from the others. In the NLP literature, some assumptions about these dialects are widely adopted (e.g., ``Arabic dialects can be grouped into distinguishable regional dialects") and are manifested in different computational tasks such as Arabic Dialect Identification (ADI). However, these assumptions are not quantitatively verified. We identify four of these assumptions and examine them by extending and analyzing a multi-label dataset, where the validity of each sentence in 11 different country-level dialects is manually assessed by speakers of these dialects. Our analysis indicates that the four assumptions oversimplify reality, and some of them are not always accurate. This in turn might be hindering further progress in different Arabic NLP tasks.

[184] Representative Language Generation

Charlotte Peale,Vinod Raman,Omer Reingold

Main category: cs.CL

TL;DR: 论文提出“代表性生成”概念，扩展了生成模型的理论框架，旨在解决多样性和偏见问题。通过引入“群体闭包维度”并分析无限假设类别的可行性，论文为开发更具代表性的生成模型提供了理论基础。

Details

Motivation: 解决生成模型中多样性和偏见的不足，扩展现有生成理论框架。 Method: 引入“代表性生成”概念和“群体闭包维度”，分析信息理论和计算可行性。 Result: 证明了在无限假设类别下代表性生成的可行性，但指出仅通过成员查询无法实现计算。 Conclusion: 为开发更具多样性和代表性的生成模型提供了理论支持。 Abstract: We introduce "representative generation," extending the theoretical framework for generation proposed by Kleinberg et al. (2024) and formalized by Li et al. (2024), to additionally address diversity and bias concerns in generative models. Our notion requires outputs of a generative model to proportionally represent groups of interest from the training data. We characterize representative uniform and non-uniform generation, introducing the "group closure dimension" as a key combinatorial quantity. For representative generation in the limit, we analyze both information-theoretic and computational aspects, demonstrating feasibility for countably infinite hypothesis classes and collections of groups under certain conditions, but proving a negative result for computability using only membership queries. This contrasts with Kleinberg et al.'s (2024) positive results for standard generation in the limit. Our findings provide a rigorous foundation for developing more diverse and representative generative models.

[185] Principled Content Selection to Generate Diverse and Personalized Multi-Document Summaries

Vishakh Padmakumar,Zichao Wang,David Arbour,Jennifer Healey

Main category: cs.CL

TL;DR: 论文提出了一种通过分步内容选择（提取、选择和重写）来提升大语言模型在多文档摘要任务中源覆盖率的简单方法。

Details

Motivation: 大语言模型在处理长上下文时存在注意力不均的问题（“迷失在中间”现象），导致在多文档摘要任务中难以覆盖多样化的源材料。 Method: 将摘要任务分为三步：1) 提取文档的关键点；2) 使用DPP选择多样化关键点；3) 重写为最终摘要。结合提示步骤和原则性技术提升覆盖率。 Result: 在DiverseSumm基准测试中，该方法显著提高了不同大语言模型的源覆盖率，并能生成个性化的摘要。 Conclusion: 通过分步内容选择和结合DPP技术，可以有效提升大语言模型在多文档摘要任务中的表现，同时支持个性化需求。 Abstract: While large language models (LLMs) are increasingly capable of handling longer contexts, recent work has demonstrated that they exhibit the "lost in the middle" phenomenon (Liu et al., 2024) of unevenly attending to different parts of the provided context. This hinders their ability to cover diverse source material in multi-document summarization, as noted in the DiverseSumm benchmark (Huang et al., 2024). In this work, we contend that principled content selection is a simple way to increase source coverage on this task. As opposed to prompting an LLM to perform the summarization in a single step, we explicitly divide the task into three steps -- (1) reducing document collections to atomic key points, (2) using determinantal point processes (DPP) to perform select key points that prioritize diverse content, and (3) rewriting to the final summary. By combining prompting steps, for extraction and rewriting, with principled techniques, for content selection, we consistently improve source coverage on the DiverseSumm benchmark across various LLMs. Finally, we also show that by incorporating relevance to a provided user intent into the DPP kernel, we can generate personalized summaries that cover relevant source information while retaining coverage.

[186] Evaluating the Retrieval Robustness of Large Language Models

Shuyang Cao,Karthik Radhakrishnan,David Rosenberg,Steven Lu,Pengxiang Cheng,Lu Wang,Shiyue Zhang

Main category: cs.CL

TL;DR: 论文研究了检索增强生成（RAG）在实际应用中的鲁棒性，探讨了RAG是否总是优于非RAG、更多检索文档是否总是提升性能以及文档顺序是否影响结果。

Details

Motivation: 评估RAG在实际应用中的鲁棒性，以解决因检索不完美和模型利用能力有限导致的性能下降问题。 Method: 建立包含1500个开放域问题的基准数据集，引入三个鲁棒性指标，评估11种LLM和3种提示策略。 Result: 所有LLM均表现出较高的检索鲁棒性，但不同程度的鲁棒性缺陷阻碍了它们充分利用RAG的优势。 Conclusion: RAG在实际应用中具有较高的鲁棒性，但仍需改进以充分发挥其潜力。 Abstract: Retrieval-augmented generation (RAG) generally enhances large language models' (LLMs) ability to solve knowledge-intensive tasks. But RAG may also lead to performance degradation due to imperfect retrieval and the model's limited ability to leverage retrieved content. In this work, we evaluate the robustness of LLMs in practical RAG setups (henceforth retrieval robustness). We focus on three research questions: (1) whether RAG is always better than non-RAG; (2) whether more retrieved documents always lead to better performance; (3) and whether document orders impact results. To facilitate this study, we establish a benchmark of 1500 open-domain questions, each with retrieved documents from Wikipedia. We introduce three robustness metrics, each corresponds to one research question. Our comprehensive experiments, involving 11 LLMs and 3 prompting strategies, reveal that all of these LLMs exhibit surprisingly high retrieval robustness; nonetheless, different degrees of imperfect robustness hinders them from fully utilizing the benefits of RAG.

[187] EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse

Tianyu Guo,Hande Dong,Yichong Leng,Feng Liu,Cheater Lin,Nong Xiao,Xianwei Zhang

Main category: cs.CL

TL;DR: EFIM是一种改进的提示格式，通过优化KV缓存重用和引入片段标记化训练方法，显著提升了大型语言模型在填充任务中的效率和性能。

Details

Motivation: 填充任务中，KV缓存重用因提示格式的结构问题而受限，导致计算效率低下。 Method: 提出EFIM提示格式以优化KV缓存重用，并引入片段标记化训练方法解决子词生成问题。 Result: 实验表明，EFIM将延迟降低52%，吞吐量提升98%，同时保持原始填充能力。 Conclusion: EFIM通过技术改进显著提升了填充任务的效率，具有实际应用价值。 Abstract: Large language models (LLMs) are often used for infilling tasks, which involve predicting or generating missing information in a given text. These tasks typically require multiple interactions with similar context. To reduce the computation of repeated historical tokens, cross-request key-value (KV) cache reuse, a technique that stores and reuses intermediate computations, has become a crucial method in multi-round interactive services. However, in infilling tasks, the KV cache reuse is often hindered by the structure of the prompt format, which typically consists of a prefix and suffix relative to the insertion point. Specifically, the KV cache of the prefix or suffix part is frequently invalidated as the other part (suffix or prefix) is incrementally generated. To address the issue, we propose EFIM, a transformed prompt format of FIM to unleash the performance potential of KV cache reuse. Although the transformed prompt can solve the inefficiency, it exposes subtoken generation problems in current LLMs, where they have difficulty generating partial words accurately. Therefore, we introduce a fragment tokenization training method which splits text into multiple fragments before tokenization during data processing. Experiments on two representative LLMs show that LLM serving with EFIM can lower the latency by 52% and improve the throughput by 98% while maintaining the original infilling capability.EFIM's source code is publicly available at https://github.com/gty111/EFIM.

[188] Co-Saving: Resource Aware Multi-Agent Collaboration for Software Development

Rennai Qiu,Chen Qian,Ran Li,Yufan Dang,Weize Chen,Cheng Yang,Yingli Zhang,Ye Tian,Xuantang Xiong,Lei Han,Zhiyuan Liu,Maosong Sun

Main category: cs.CL

TL;DR: 提出了一个资源感知的多代理系统Co-Saving，通过引入“捷径”减少冗余推理，显著降低资源消耗并提高代码质量。

Details

Motivation: 现有多代理系统在复杂任务中资源消耗高且效率低，需要改进。 Method: 利用历史成功轨迹中的“捷径”优化代理协作，减少冗余推理。 Result: 相比ChatDev，平均减少50.85%的token使用，代码质量提升10.06%。 Conclusion: Co-Saving系统显著提升了资源利用效率和任务完成质量。 Abstract: Recent advancements in Large Language Models (LLMs) and autonomous agents have demonstrated remarkable capabilities across various domains. However, standalone agents frequently encounter limitations when handling complex tasks that demand extensive interactions and substantial computational resources. Although Multi-Agent Systems (MAS) alleviate some of these limitations through collaborative mechanisms like task decomposition, iterative communication, and role specialization, they typically remain resource-unaware, incurring significant inefficiencies due to high token consumption and excessive execution time. To address these limitations, we propose a resource-aware multi-agent system -- Co-Saving (meaning that multiple agents collaboratively engage in resource-saving activities), which leverages experiential knowledge to enhance operational efficiency and solution quality. Our key innovation is the introduction of "shortcuts" -- instructional transitions learned from historically successful trajectories -- which allows to bypass redundant reasoning agents and expedite the collective problem-solving process. Experiments for software development tasks demonstrate significant advantages over existing methods. Specifically, compared to the state-of-the-art MAS ChatDev, our method achieves an average reduction of 50.85% in token usage, and improves the overall code quality by 10.06%.

[189] Beyond Completion: A Foundation Model for General Knowledge Graph Reasoning

Yin Hua,Zhiqiang Liu,Mingyang Chen,Zheng Fang,Chi Man Wong,Lingxiao Li,Chi Man Vong,Huajun Chen,Wen Zhang

Main category: cs.CL

TL;DR: MERRY是一种通用知识图谱推理的基础模型，通过结合结构化和文本信息，显著提升了知识图谱内外的任务性能。

Details

Motivation: 现有基础模型主要关注知识图谱的结构信息，忽略了文本信息，限制了其在知识图谱外任务中的应用。 Method: 提出多视角条件消息传递（CMP）编码架构、动态残差融合模块和灵活边评分机制，以整合文本和结构信息。 Result: 在28个数据集上的评估表明，MERRY在大多数场景中优于现有基线，尤其在知识图谱问答（KGQA）等任务中表现优异。 Conclusion: MERRY展示了强大的知识图谱内推理能力和对知识图谱外任务的优秀泛化能力。 Abstract: In natural language processing (NLP) and computer vision (CV), the successful application of foundation models across diverse tasks has demonstrated their remarkable potential. However, despite the rich structural and textual information embedded in knowledge graphs (KGs), existing research of foundation model for KG has primarily focused on their structural aspects, with most efforts restricted to in-KG tasks (e.g., knowledge graph completion, KGC). This limitation has hindered progress in addressing more challenging out-of-KG tasks. In this paper, we introduce MERRY, a foundation model for general knowledge graph reasoning, and investigate its performance across two task categories: in-KG reasoning tasks (e.g., KGC) and out-of-KG tasks (e.g., KG question answering, KGQA). We not only utilize the structural information, but also the textual information in KGs. Specifically, we propose a multi-perspective Conditional Message Passing (CMP) encoding architecture to bridge the gap between textual and structural modalities, enabling their seamless integration. Additionally, we introduce a dynamic residual fusion module to selectively retain relevant textual information and a flexible edge scoring mechanism to adapt to diverse downstream tasks. Comprehensive evaluations on 28 datasets demonstrate that MERRY outperforms existing baselines in most scenarios, showcasing strong reasoning capabilities within KGs and excellent generalization to out-of-KG tasks such as KGQA.

[190] RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments

Zeyi Liao,Jaylen Jones,Linxi Jiang,Eric Fosler-Lussier,Yu Su,Zhiqiang Lin,Huan Sun

Main category: cs.CL

TL;DR: RedTeamCUA是一个对抗性测试框架，用于评估计算机使用代理（CUAs）在混合Web-OS环境中的间接提示注入漏洞，揭示了当前CUAs的高攻击成功率。

Details

Motivation: 当前对CUAs间接提示注入威胁的评估缺乏真实且可控的环境，且忽略了混合Web-OS攻击场景。 Method: 提出RedTeamCUA框架，结合VM和Docker的混合沙盒，支持灵活的对抗场景配置和直接注入点测试。 Result: 测试显示CUAs的ASR高达42.9%，Claude 4 Opus的ASR达48%，表明即使高级CUAs也存在显著漏洞。 Conclusion: RedTeamCUA为系统分析CUA漏洞提供了关键工具，强调了在部署前加强防御的紧迫性。 Abstract: Computer-use agents (CUAs) promise to automate complex tasks across operating systems (OS) and the web, but remain vulnerable to indirect prompt injection. Current evaluations of this threat either lack support realistic but controlled environments or ignore hybrid web-OS attack scenarios involving both interfaces. To address this, we propose RedTeamCUA, an adversarial testing framework featuring a novel hybrid sandbox that integrates a VM-based OS environment with Docker-based web platforms. Our sandbox supports key features tailored for red teaming, such as flexible adversarial scenario configuration, and a setting that decouples adversarial evaluation from navigational limitations of CUAs by initializing tests directly at the point of an adversarial injection. Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS attack scenarios and fundamental security vulnerabilities. Benchmarking current frontier CUAs identifies significant vulnerabilities: Claude 3.7 Sonnet | CUA demonstrates an ASR of 42.9%, while Operator, the most secure CUA evaluated, still exhibits an ASR of 7.6%. Notably, CUAs often attempt to execute adversarial tasks with an Attempt Rate as high as 92.5%, although failing to complete them due to capability limitations. Nevertheless, we observe concerning ASRs of up to 50% in realistic end-to-end settings, with the recently released frontier Claude 4 Opus | CUA showing an alarming ASR of 48%, demonstrating that indirect prompt injection presents tangible risks for even advanced CUAs despite their capabilities and safeguards. Overall, RedTeamCUA provides an essential framework for advancing realistic, controlled, and systematic analysis of CUA vulnerabilities, highlighting the urgent need for robust defenses to indirect prompt injection prior to real-world deployment.

[191] Graph-Assisted Culturally Adaptable Idiomatic Translation for Indic Languages

Pratik Rakesh Singh,Kritarth Prasad,Mohammadi Zaki,Pankaj Wasnik

Main category: cs.CL

TL;DR: 提出了一种基于自适应图神经网络（GNN）的方法IdiomCE，用于解决多词表达和习语翻译中的文化差异问题，显著提升了翻译质量。

Details

Motivation: 传统静态知识图谱和基于提示的方法难以捕捉习语翻译中的复杂关系，导致翻译效果不佳。 Method: 采用自适应图神经网络（GNN）方法IdiomCE，学习习语表达之间的复杂映射，泛化能力强。 Result: 在多个习语翻译数据集上评估，显著提升了英语到印度语言的习语翻译质量。 Conclusion: IdiomCE方法在资源受限环境下也能提升翻译质量，适用于小型模型。 Abstract: Translating multi-word expressions (MWEs) and idioms requires a deep understanding of the cultural nuances of both the source and target languages. This challenge is further amplified by the one-to-many nature of idiomatic translations, where a single source idiom can have multiple target-language equivalents depending on cultural references and contextual variations. Traditional static knowledge graphs (KGs) and prompt-based approaches struggle to capture these complex relationships, often leading to suboptimal translations. To address this, we propose IdiomCE, an adaptive graph neural network (GNN) based methodology that learns intricate mappings between idiomatic expressions, effectively generalizing to both seen and unseen nodes during training. Our proposed method enhances translation quality even in resource-constrained settings, facilitating improved idiomatic translation in smaller models. We evaluate our approach on multiple idiomatic translation datasets using reference-less metrics, demonstrating significant improvements in translating idioms from English to various Indian languages.

[192] RISE: Reasoning Enhancement via Iterative Self-Exploration in Multi-hop Question Answering

Bolei He,Xinran He,Mengke Chen,Xianwei Xue,Ying Zhu,Zhenhua Ling

Main category: cs.CL

TL;DR: RISE框架通过迭代自我探索提升大语言模型在复杂推理任务（如多跳问答）中的表现。

Details

Motivation: 大语言模型在复杂推理任务（如多跳问答）中表现不佳，检索增强生成方法存在噪声过滤和证据检索不足的问题。 Method: RISE框架包含问题分解、检索-阅读和自我批判三个步骤，通过迭代自我探索优化推理路径。 Result: 实验表明，RISE在多跳问答基准测试中显著提高了推理准确性和任务表现。 Conclusion: RISE通过迭代自我探索有效提升模型在复杂推理任务中的能力。 Abstract: Large Language Models (LLMs) excel in many areas but continue to face challenges with complex reasoning tasks, such as Multi-Hop Question Answering (MHQA). MHQA requires integrating evidence from diverse sources while managing intricate logical dependencies, often leads to errors in reasoning. Retrieval-Augmented Generation (RAG), widely employed in MHQA tasks, faces challenges in effectively filtering noisy data and retrieving all necessary evidence, thereby limiting its effectiveness in addressing MHQA challenges. To address these challenges, we propose RISE:Reasoning Enhancement via Iterative Self-Exploration, a novel framework designed to enhance models' reasoning capability through iterative self-exploration. Specifically, RISE involves three key steps in addressing MHQA tasks: question decomposition, retrieve-then-read, and self-critique. By leveraging continuous self-exploration, RISE identifies accurate reasoning paths, iteratively self-improving the model's capability to integrate evidence, maintain logical consistency, and enhance performance in MHQA tasks. Extensive experiments on multiple MHQA benchmarks demonstrate that RISE significantly improves reasoning accuracy and task performance.

[193] Test-Time Scaling with Repeated Sampling Improves Multilingual Text Generation

Ashim Gupta,Vivek Srikumar

Main category: cs.CL

TL;DR: 通过重复采样在推理任务中表现良好，但在多语言生成中的效果尚未充分研究。研究表明，该方法在多语言基准测试中能显著提升质量，部分情况下增益超过35%。

Details

Motivation: 探索重复采样方法在多语言生成任务中的有效性，尤其是在推理任务中的表现。 Method: 使用基于困惑度和奖励的验证器，在两个多语言基准测试（Aya Evaluation Suite和m-ArenaHard）上评估重复采样的效果。 Result: 结果显示质量显著提升，部分情况下增益超过35%。困惑度验证器适用于开放式任务，而奖励验证器在需要推理的任务（如数学、代码）中表现更优。 Conclusion: 重复采样在多语言文本生成中具有广泛实用性，选择适合任务的验证器至关重要。 Abstract: Inference-time scaling via repeated sampling has shown promise in reasoning tasks, but its effectiveness in multilingual generation remains underexplored. We evaluate this approach using perplexity- and reward-based verifiers on two multilingual benchmarks: the Aya Evaluation Suite and m-ArenaHard. Our results show consistent quality improvements, with gains exceeding 35% in some cases. While perplexity-based scoring is effective for open-ended prompts, only reward-based verifiers improve performance on tasks requiring reasoning (e.g., math, code). Our results demonstrate the broader utility of repeated sampling for multilingual text generation and underscore the importance of selecting right verifiers for the task.

[194] Resolving Knowledge Conflicts in Domain-specific Data Selection: A Case Study on Medical Instruction-tuning

Qihuang Zhong,Liang Ding,Fei Liao,Juhua Liu,Bo Du,Dacheng Tao

Main category: cs.CL

TL;DR: 论文提出了一种名为KDS的知识感知数据选择框架，用于优化领域特定指令调优的数据选择，解决知识冲突问题，提升模型性能。

Details

Motivation: 当前数据选择方法在领域特定指令调优中表现不佳，主要原因是忽视了知识冲突的影响，导致模型性能下降和幻觉问题。 Method: KDS框架通过两种知识感知指标（上下文-记忆知识对齐和内部记忆知识一致性）量化知识冲突，并筛选高质量、多样化的数据。 Result: 在医学领域的实验中，KDS显著优于其他基线方法，提升了模型性能和泛化能力，同时缓解了幻觉问题。 Conclusion: KDS是一种简单有效的方法，能够显著提升领域特定指令调优的效果，具有广泛的应用潜力。 Abstract: Domain-specific instruction-tuning has become the defacto standard for improving the performance of large language models (LLMs) in specialized applications, e.g., medical question answering. Since the instruction-tuning dataset might contain redundant or low-quality data, data selection (DS) is usually required to maximize the data efficiency. Despite the successes in the general domain, current DS methods often struggle to select the desired data for domain-specific instruction-tuning. One of the main reasons is that they neglect the impact of knowledge conflicts, i.e., the discrepancy between LLMs' pretrained knowledge and context knowledge of instruction data, which could damage LLMs' prior abilities and lead to hallucination. To this end, we propose a simple-yet-effective Knowledge-aware Data Selection (namely KDS) framework to select the domain-specific instruction-tuning data that meets LLMs' actual needs. The core of KDS is to leverage two knowledge-aware metrics for quantitatively measuring knowledge conflicts from two aspects: context-memory knowledge alignment and intra-memory knowledge consistency. By filtering the data with large knowledge conflicts and sampling the high-quality and diverse data, KDS can effectively stimulate the LLMs' abilities and achieve better domain-specific performance. Taking the medical domain as the testbed, we conduct extensive experiments and empirically prove that KDS surpasses the other baselines and brings significant and consistent performance gains among all LLMs. More encouragingly, KDS effectively improves the model generalization and alleviates the hallucination problem.

[195] LaMDAgent: An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents

Taro Yano,Yoichi Ishibashi,Masafumi Oyamada

Main category: cs.CL

TL;DR: LaMDAgent是一个自动化构建和优化后训练管道的框架，通过LLM代理实现，显著提升了任务性能并减少了人工干预。

Details

Motivation: 现有后训练方法多为手动设计或专注于单一组件优化，缺乏自动化完整管道的探索。 Method: 利用LLM代理系统探索模型生成技术、数据集和超参数配置，基于任务反馈优化管道。 Result: 实验显示LaMDAgent提升工具使用准确率9.0分，同时发现传统方法忽略的有效策略。 Conclusion: LaMDAgent为后训练管道自动化提供了高效解决方案，数据规模扩展更具成本效益。 Abstract: Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks. To further tailor LLMs to specific domains or applications, post-training techniques such as Supervised Fine-Tuning (SFT), Preference Learning, and model merging are commonly employed. While each of these methods has been extensively studied in isolation, the automated construction of complete post-training pipelines remains an underexplored area. Existing approaches typically rely on manual design or focus narrowly on optimizing individual components, such as data ordering or merging strategies. In this work, we introduce LaMDAgent (short for Language Model Developing Agent), a novel framework that autonomously constructs and optimizes full post-training pipelines through the use of LLM-based agents. LaMDAgent systematically explores diverse model generation techniques, datasets, and hyperparameter configurations, leveraging task-based feedback to discover high-performing pipelines with minimal human intervention. Our experiments show that LaMDAgent improves tool-use accuracy by 9.0 points while preserving instruction-following capabilities. Moreover, it uncovers effective post-training strategies that are often overlooked by conventional human-driven exploration. We further analyze the impact of data and model size scaling to reduce computational costs on the exploration, finding that model size scalings introduces new challenges, whereas scaling data size enables cost-effective pipeline discovery.

[196] Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack

Juan Ren,Mark Dras,Usman Naseem

Main category: cs.CL

TL;DR: 该论文分析了大型视觉语言模型（LVLMs）的安全漏洞，提出了一种两阶段评估框架来量化对抗攻击的效果，并定义了理想化的模型行为规范。

Details

Motivation: LVLMs在多模态任务中表现出色，但其视觉输入的引入扩大了攻击面，暴露了新的安全漏洞。研究旨在揭示传统对抗攻击如何绕过LVLMs的安全机制。 Method: 通过系统表征分析，提出两阶段评估框架：第一阶段区分指令不遵从、直接拒绝和成功攻击；第二阶段量化模型输出满足有害意图的程度，并对拒绝行为分类。 Result: 研究揭示了LVLMs在对抗攻击下的行为模式，并提出了理想化的模型行为规范，为多模态系统的安全对齐提供了目标。 Conclusion: 论文为LVLMs的安全漏洞提供了系统性分析，并提出了一种评估框架和理想化行为规范，有助于提升多模态系统的安全性。 Abstract: Large Vision-Language Models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks. However, their integration of visual inputs introduces expanded attack surfaces, thereby exposing them to novel security vulnerabilities. In this work, we conduct a systematic representational analysis to uncover why conventional adversarial attacks can circumvent the safety mechanisms embedded in LVLMs. We further propose a novel two stage evaluation framework for adversarial attacks on LVLMs. The first stage differentiates among instruction non compliance, outright refusal, and successful adversarial exploitation. The second stage quantifies the degree to which the model's output fulfills the harmful intent of the adversarial prompt, while categorizing refusal behavior into direct refusals, soft refusals, and partial refusals that remain inadvertently helpful. Finally, we introduce a normative schema that defines idealized model behavior when confronted with harmful prompts, offering a principled target for safety alignment in multimodal systems.

[197] Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset

Fakhraddin Alwajih,Samar Mohamed Magdy,Abdellah El Mekki,Omer Nacar,Youssef Nafea,Safaa Taher Abdelfadil,Abdulfattah Mohammed Yahya,Hamzah Luqman,Nada Almarwani,Samah Aloufi,Baraah Qawasmeh,Houdaifa Atou,Serry Sibaee,Hamzah A. Alsayadi,Walid Al-Dhabyani,Maged S. Al-shaibani,Aya El aatar,Nour Qandos,Rahaf Alhamouri,Samar Ahmad,Razan Khassib,Lina Hamad,Mohammed Anwar AL-Ghrawi,Fatimah Alshamari,Cheikh Malainine,Doaa Qawasmeh,Aminetou Yacoub,Tfeil moilid,Ruwa AbuHweidi,Ahmed Aboeitta,Vatimetou Mohamed Lemin,Reem Abdel-Salam,Ahlam Bashiti,Adel Ammar,Aisha Alansari,Ahmed Ashraf,Nora Alturayeif,Sara Shatnawi,Alcides Alcoba Inciarte,AbdelRahim A. Elmadany,Mohamedou cheikh tourad,Ismail Berrada,Mustafa Jarrar,Shady Shehata,Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: 论文介绍了Pearl，一个针对阿拉伯文化理解的大规模多模态数据集和基准，旨在解决主流视觉语言模型中的文化偏见问题。

Details

Motivation: 主流视觉语言模型（LVLMs）存在文化偏见，需要多样化的多模态数据集来改善这一问题。 Method: 通过高级代理工作流程和45名阿拉伯世界注释者的人工标注，构建了包含K个多模态示例的Pearl数据集，涵盖十个文化领域。 Result: 评估表明，以推理为中心的指令对齐显著提升了模型的文化理解能力，优于传统扩展方法。 Conclusion: Pearl为文化感知的多模态建模研究提供了基础资源，数据集和基准已公开。 Abstract: Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce Pearl, a large-scale Arabic multimodal dataset and benchmark explicitly designed for cultural understanding. Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 45 annotators from across the Arab world, Pearl comprises over K multimodal examples spanning ten culturally significant domains covering all Arab countries. We further provide two robust evaluation benchmarks Pearl and Pearl-Lite along with a specialized subset Pearl-X explicitly developed to assess nuanced cultural variations. Comprehensive evaluations on state-of-the-art open and proprietary LVLMs demonstrate that reasoning-centric instruction alignment substantially improves models' cultural grounding compared to conventional scaling methods. Pearl establishes a foundational resource for advancing culturally-informed multimodal modeling research. All datasets and benchmarks are publicly available.

[198] Leveraging Interview-Informed LLMs to Model Survey Responses: Comparative Insights from AI-Generated and Human Data

Jihong Zhang,Xinya Liang,Anqi Deng,Nicole Bonge,Lin Tan,Ling Zhang,Nicole Zarrett

Main category: cs.CL

TL;DR: 研究探讨了利用大型语言模型（LLM）结合访谈数据预测人类调查问卷响应的可行性，发现LLM能捕捉总体模式但变异性较低，提示设计和访谈内容对结果影响显著。

Details

Motivation: 混合方法研究中，定量与定性数据的整合存在挑战，LLM为生成基于定性数据的合成调查响应提供了潜在解决方案。 Method: 使用BREQ问卷和访谈数据，测试LLM在提示设计和访谈内容影响下预测人类响应的能力。 Result: LLM能捕捉总体响应模式但变异性较低，提示设计和访谈内容对结果影响显著，人口统计信息影响较小。 Conclusion: LLM有潜力桥接定性与定量方法，但需优化提示设计、减少偏见并提升模型设置以增强数据有效性。 Abstract: Mixed methods research integrates quantitative and qualitative data but faces challenges in aligning their distinct structures, particularly in examining measurement characteristics and individual response patterns. Advances in large language models (LLMs) offer promising solutions by generating synthetic survey responses informed by qualitative data. This study investigates whether LLMs, guided by personal interviews, can reliably predict human survey responses, using the Behavioral Regulations in Exercise Questionnaire (BREQ) and interviews from after-school program staff as a case study. Results indicate that LLMs capture overall response patterns but exhibit lower variability than humans. Incorporating interview data improves response diversity for some models (e.g., Claude, GPT), while well-crafted prompts and low-temperature settings enhance alignment between LLM and human responses. Demographic information had less impact than interview content on alignment accuracy. These findings underscore the potential of interview-informed LLMs to bridge qualitative and quantitative methodologies while revealing limitations in response variability, emotional interpretation, and psychometric fidelity. Future research should refine prompt design, explore bias mitigation, and optimize model settings to enhance the validity of LLM-generated survey data in social science research.

[199] Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate

Ashim Gupta,Maitrey Mehta,Zhichao Xu,Vivek Srikumar

Main category: cs.CL

TL;DR: 该论文提出了一种评估大型语言模型（LLM）跨语言一致性的框架，揭示了其在多语言能力上的显著不一致性和缺陷。

Details

Motivation: 评估LLM在多语言任务中的表现通常需要昂贵的数据集，且开放生成任务的评估复杂。因此，作者提出了一种基于翻译和评估的策略来简化多语言一致性评估。 Method: 提出了一种“翻译后评估”框架，从信息和同理心两个维度评估LLM的跨语言一致性。 Result: 研究发现，流行的LLM在30种语言中存在显著的不一致性，某些语系和文字的性能严重不足。 Conclusion: 多语言LLM的评估需考虑多维度一致性，作者邀请实践者使用其框架进行未来基准测试。 Abstract: Large language models (LLMs) provide detailed and impressive responses to queries in English. However, are they really consistent at responding to the same query in other languages? The popular way of evaluating for multilingual performance of LLMs requires expensive-to-collect annotated datasets. Further, evaluating for tasks like open-ended generation, where multiple correct answers may exist, is nontrivial. Instead, we propose to evaluate the predictability of model response across different languages. In this work, we propose a framework to evaluate LLM's cross-lingual consistency based on a simple Translate then Evaluate strategy. We instantiate this evaluation framework along two dimensions of consistency: information and empathy. Our results reveal pronounced inconsistencies in popular LLM responses across thirty languages, with severe performance deficits in certain language families and scripts, underscoring critical weaknesses in their multilingual capabilities. These findings necessitate cross-lingual evaluations that are consistent along multiple dimensions. We invite practitioners to use our framework for future multilingual LLM benchmarking.

[200] Legal Assist AI: Leveraging Transformer-Based Model for Effective Legal Assistance

Jatin Gupta,Akhil Sharma,Saransh Singhania,Ali Imam Abidi

Main category: cs.CL

TL;DR: 本文介绍了Legal Assist AI，一种基于Transformer的模型，旨在通过大型语言模型（LLMs）为印度公民提供有效的法律帮助，填补法律信息获取的空白。

Details

Motivation: 印度许多公民因法律意识不足和法律信息获取困难而无法行使合法权利，因此需要一种高效的法律辅助工具。 Method: 模型通过检索定制数据库中的法律信息并生成准确回答，基于印度法律领域的大规模数据集（如印度宪法、BNS、BNSS等）进行微调。 Result: 模型在AIBE评估中得分60.08%，优于GPT-3.5 Turbo和Mistral 7B，且在法律推理和准确性上表现突出，避免了幻觉问题。 Conclusion: Legal Assist AI展示了在实际法律场景中的应用潜力，未来将扩展数据集并提升性能，以覆盖更多多语言和案例特定查询。 Abstract: Pursuit of accessible legal assistance in India faces a critical gap, as many citizens struggle to leverage their legal rights due to limited awareness and access to relevant legal information. This paper introduces Legal Assist AI, a transformer-based model designed to bridge this gap by offering effective legal assistance through large language models (LLMs). The system retrieves relevant legal information from a curated database and generates accurate responses, enabling effective assistance for diverse users, including legal professionals, scholars, and the general public. The model was fine-tuned on extensive datasets from the Indian legal domain, including Indian Constitution, Bharatiya Nyaya Sanhita (BNS), Bharatiya Nagarik Suraksha Sanhita (BNSS) and so forth, providing a robust understanding of the complexities of Indian law. By incorporating domain-specific legal datasets, the proposed model demonstrated remarkable efficiency and specialization in legal Question-Answering. The model was evaluated against state-of-the-art models such as GPT-3.5 Turbo and Mistral 7B, achieving a 60.08% score on the AIBE, outperforming its competitors in legal reasoning and accuracy. Unlike other models, Legal Assist AI avoided common issues such as hallucinations, making it highly reliable for practical legal applications. It showcases the model's applicability in real-world legal scenarios, with future iterations aiming to enhance performance and expand its dataset to cover a broader range of multilingual and case-specific queries as well.

[201] CoThink: Token-Efficient Reasoning via Instruct Models Guiding Reasoning Models

Siqi Fan,Peng Han,Shuo Shang,Yequan Wang,Aixin Sun

Main category: cs.CL

TL;DR: 论文提出CoThink方法，通过动态调整推理深度，减少大型语言模型（LLMs）在简单问题上的过度思考，显著降低token生成量，同时保持准确性。

Details

Motivation: 推理优化的LLMs在简单问题上过度思考，导致输出冗长且token效率低。研究旨在解决这一问题。 Method: 提出CoThink方法：指令模型先生成解决方案大纲，推理模型再细化。动态调整推理深度。 Result: 在三个数据集上测试，CoThink减少22.3%的token生成，准确性损失仅0.42%。 Conclusion: CoThink有效提升推理效率，并可能揭示LLMs的推理效率缩放规律。 Abstract: Large language models (LLMs) benefit from increased test-time compute, a phenomenon known as test-time scaling. However, reasoning-optimized models often overthink even simple problems, producing excessively verbose outputs and leading to low token efficiency. By comparing these models with equally sized instruct models, we identify two key causes of this verbosity: (1) reinforcement learning reduces the information density of forward reasoning, and (2) backward chain-of thought training encourages redundant and often unnecessary verification steps. Since LLMs cannot assess the difficulty of a given problem, they tend to apply the same cautious reasoning strategy across all tasks, resulting in inefficient overthinking. To address this, we propose CoThink, an embarrassingly simple pipeline: an instruct model first drafts a high-level solution outline; a reasoning model then works out the solution. We observe that CoThink enables dynamic adjustment of reasoning depth based on input difficulty. Evaluated with three reasoning models DAPO, DeepSeek-R1, and QwQ on three datasets GSM8K, MATH500, and AIME24, CoThink reduces total token generation by 22.3% while maintaining pass@1 accuracy within a 0.42% margin on average. With reference to the instruct model, we formally define reasoning efficiency and observe a potential reasoning efficiency scaling law in LLMs.

[202] Improving Continual Pre-training Through Seamless Data Packing

Ruicheng Yin,Xuan Gao,Changze Lv,Xiaohua Wang,Xiaoqing Zheng,Xuanjing Huang

Main category: cs.CL

TL;DR: 论文提出了一种名为Seamless Packing（SP）的新数据打包策略，通过滑动窗口和First-Fit-Decreasing算法减少截断和上下文不连续问题，显著提升了模型性能。

Details

Motivation: 传统的数据打包方法在持续预训练中容易导致截断和上下文不连续，影响模型表现。 Method: 提出SP策略：1）使用滑动窗口同步重叠标记以保持上下文连贯；2）采用First-Fit-Decreasing算法将短文本打包到略大于目标长度的容器中，减少填充和截断。 Result: 在多种模型架构和语料领域上，SP方法在99%的情况下优于基线方法。 Conclusion: SP策略有效解决了传统数据打包的问题，显著提升了持续预训练的性能和效率。 Abstract: Continual pre-training has demonstrated significant potential in enhancing model performance, particularly in domain-specific scenarios. The most common approach for packing data before continual pre-training involves concatenating input texts and splitting them into fixed-length sequences. While straightforward and efficient, this method often leads to excessive truncation and context discontinuity, which can hinder model performance. To address these issues, we explore the potential of data engineering to enhance continual pre-training, particularly its impact on model performance and efficiency. We propose Seamless Packing (SP), a novel data packing strategy aimed at preserving contextual information more effectively and enhancing model performance. Our approach employs a sliding window technique in the first stage that synchronizes overlapping tokens across consecutive sequences, ensuring better continuity and contextual coherence. In the second stage, we adopt a First-Fit-Decreasing algorithm to pack shorter texts into bins slightly larger than the target sequence length, thereby minimizing padding and truncation. Empirical evaluations across various model architectures and corpus domains demonstrate the effectiveness of our method, outperforming baseline method in 99% of all settings. Code is available at https://github.com/Infernus-WIND/Seamless-Packing.

[203] VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

Qiuchen Wang,Ruixue Ding,Yu Zeng,Zehui Chen,Lin Chen,Shihang Wang,Pengjun Xie,Fei Huang,Feng Zhao

Main category: cs.CL

TL;DR: VRAG-RL是一个基于强化学习的框架，用于处理视觉丰富的RAG任务，通过优化视觉语言模型（VLMs）与搜索引擎的交互，提升多模态信息的检索和推理能力。

Details

Motivation: 传统文本RAG方法无法处理视觉信息，而现有视觉RAG方法因固定流程和推理能力不足表现不佳。强化学习（RL）被证明能提升模型推理能力，因此提出VRAG-RL框架。 Method: VRAG-RL通过视觉感知令牌和多轮推理轨迹优化VLMs，定义针对视觉输入的动作空间（如裁剪和缩放），并结合查询重写和检索性能设计奖励机制。 Result: VRAG-RL解决了多模态RAG中推理令牌分配不足和视觉感知缺失的问题，提升了模型与搜索引擎交互时的检索效果。 Conclusion: VRAG-RL通过强化学习策略优化VLMs，显著提升了视觉丰富信息检索和推理的性能，适用于实际应用。 Abstract: Effectively retrieving, reasoning and understanding visually rich information remains a challenge for RAG methods. Traditional text-based methods cannot handle visual-related information. On the other hand, current vision-based RAG approaches are often limited by fixed pipelines and frequently struggle to reason effectively due to the insufficient activation of the fundamental capabilities of models. As RL has been proven to be beneficial for model reasoning, we introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information. With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories with the help of visual perception tokens and undergoing continual optimization based on these samples. Our approach highlights key limitations of RL in RAG domains: (i) Prior Multi-modal RAG approaches tend to merely incorporate images into the context, leading to insufficient reasoning token allocation and neglecting visual-specific perception; and (ii) When models interact with search engines, their queries often fail to retrieve relevant information due to the inability to articulate requirements, thereby leading to suboptimal performance. To address these challenges, we define an action space tailored for visually rich inputs, with actions including cropping and scaling, allowing the model to gather information from a coarse-to-fine perspective. Furthermore, to bridge the gap between users' original inquiries and the retriever, we employ a simple yet effective reward that integrates query rewriting and retrieval performance with a model-based reward. Our VRAG-RL optimizes VLMs for RAG tasks using specially designed RL strategies, aligning the model with real-world applications. The code is available at \hyperlink{https://github.com/Alibaba-NLP/VRAG}{https://github.com/Alibaba-NLP/VRAG}.

[204] Jailbreak Distillation: Renewable Safety Benchmarking

Jingyu Zhang,Ahmed Elgohary,Xiawei Wang,A S M Iftekhar,Ahmed Magooda,Benjamin Van Durme,Daniel Khashabi,Kyle Jackson

Main category: cs.CL

TL;DR: JBDistill是一种新颖的安全基准构建框架，通过将越狱攻击“蒸馏”为高质量且易更新的安全基准，解决了现有安全评估中的挑战。

Details

Motivation: 随着大型语言模型（LLM）在关键应用中的快速部署，迫切需要稳健的安全基准测试方法。 Method: JBDistill利用少量开发模型和现有越狱攻击算法生成候选提示池，并通过提示选择算法筛选出有效的子集作为安全基准。 Result: 实验表明，该基准在13种不同的评估模型中表现优异，显著优于现有安全基准，同时保持高分离性和多样性。 Conclusion: JBDistill为安全评估提供了高效、可持续且适应性强的解决方案。 Abstract: Large language models (LLMs) are rapidly deployed in critical applications, raising urgent needs for robust safety benchmarking. We propose Jailbreak Distillation (JBDistill), a novel benchmark construction framework that "distills" jailbreak attacks into high-quality and easily-updatable safety benchmarks. JBDistill utilizes a small set of development models and existing jailbreak attack algorithms to create a candidate prompt pool, then employs prompt selection algorithms to identify an effective subset of prompts as safety benchmarks. JBDistill addresses challenges in existing safety evaluation: the use of consistent evaluation prompts across models ensures fair comparisons and reproducibility. It requires minimal human effort to rerun the JBDistill pipeline and produce updated benchmarks, alleviating concerns on saturation and contamination. Extensive experiments demonstrate our benchmarks generalize robustly to 13 diverse evaluation models held out from benchmark construction, including proprietary, specialized, and newer-generation LLMs, significantly outperforming existing safety benchmarks in effectiveness while maintaining high separability and diversity. Our framework thus provides an effective, sustainable, and adaptable solution for streamlining safety evaluation.

[205] Voice Adaptation for Swiss German

Samuel Stucki,Jan Deriu,Mark Cieliebak

Main category: cs.CL

TL;DR: 研究探讨了瑞士德语方言的语音适应模型性能，通过预处理大量瑞士播客数据集并微调XTTSv2模型，实现了良好的评估结果。

Details

Motivation: 适应语音克隆技术至 underrepresented 语言，特别是瑞士德语方言。 Method: 预处理瑞士播客数据集，自动转录并标注方言类别，微调XTTSv2模型。 Result: 模型在人类和自动化评估中表现良好，CMOS得分达-0.28，SMOS得分3.8。 Conclusion: 研究为 underrepresented 语言的语音克隆技术提供了重要进展。 Abstract: This work investigates the performance of Voice Adaptation models for Swiss German dialects, i.e., translating Standard German text to Swiss German dialect speech. For this, we preprocess a large dataset of Swiss podcasts, which we automatically transcribe and annotate with dialect classes, yielding approximately 5000 hours of weakly labeled training material. We fine-tune the XTTSv2 model on this dataset and show that it achieves good scores in human and automated evaluations and can correctly render the desired dialect. Our work shows a step towards adapting Voice Cloning technology to underrepresented languages. The resulting model achieves CMOS scores of up to -0.28 and SMOS scores of 3.8.

[206] Safeguarding Privacy of Retrieval Data against Membership Inference Attacks: Is This Query Too Close to Home?

Yujin Choi,Youngjoo Park,Junyoung Byun,Jaewook Lee,Jinseong Park

Main category: cs.CL

TL;DR: Mirabel是一个基于相似性的MIA检测框架，用于保护RAG系统中的隐私数据，通过简单的检测和隐藏策略有效防御攻击。

Details

Motivation: 解决RAG系统中直接传递私有文档导致的成员推理攻击（MIA）问题。 Method: 提出Mirabel框架，利用相似性检测MIA查询，并采用检测和隐藏策略。 Result: 实验证明Mirabel能有效防御多种先进MIA方法，并保持数据实用性和系统兼容性。 Conclusion: Mirabel为私有RAG系统提供了一种高效、适应性强的MIA防御方案。 Abstract: Retrieval-augmented generation (RAG) mitigates the hallucination problem in large language models (LLMs) and has proven effective for specific, personalized applications. However, passing private retrieved documents directly to LLMs introduces vulnerability to membership inference attacks (MIAs), which try to determine whether the target datum exists in the private external database or not. Based on the insight that MIA queries typically exhibit high similarity to only one target document, we introduce Mirabel, a similarity-based MIA detection framework designed for the RAG system. With the proposed Mirabel, we show that simple detect-and-hide strategies can successfully obfuscate attackers, maintain data utility, and remain system-agnostic. We experimentally prove its detection and defense against various state-of-the-art MIA methods and its adaptability to existing private RAG systems.

[207] Beyond path selection: Better LLMs for Scientific Information Extraction with MimicSFT and Relevance and Rule-induced(R$^2$)GRPO

Ran Li,Shimin Di,Yuchen Liu,Chen Jing,Yu Qiu,Lei Chen

Main category: cs.CL

TL;DR: 论文研究了LLMs在科学信息提取任务中的表现，提出了一种结合监督微调（SFT）和强化学习（RLVR）的两阶段训练方法，显著提升了推理能力。

Details

Motivation: 探讨LLMs在科学信息提取（SciIE）任务中表现不佳的原因，并提出改进方法。 Method: 提出两阶段训练：1. MimicSFT（利用结构化推理模板），2. R²GRPO（结合相关性和规则诱导奖励）。 Result: 实验表明，该方法在科学IE基准测试中提升了推理能力，R²GRPO结合MimicSFT在关系提取任务中超越了基线LLMs和专用监督模型。 Conclusion: SFT和RLVR可以共同提升LLMs的推理能力，尤其在需要记忆和推理的任务中表现优异。 Abstract: Previous study suggest that powerful Large Language Models (LLMs) trained with Reinforcement Learning with Verifiable Rewards (RLVR) only refines reasoning path without improving the reasoning capacity in math tasks while supervised-finetuning(SFT) with distillation can. We study this from the view of Scientific information extraction (SciIE) where LLMs and reasoning LLMs underperforms small Bert-based models. SciIE require both the reasoning and memorization. We argue that both SFT and RLVR can refine the reasoning path and improve reasoning capacity in a simple way based on SciIE. We propose two-stage training with 1. MimicSFT, using structured reasoning templates without needing high-quality chain-of-thought data, 2. R$^2$GRPO with relevance and rule-induced rewards. Experiments on scientific IE benchmarks show that both methods can improve the reasoning capacity. R$^2$GRPO with mimicSFT surpasses baseline LLMs and specialized supervised models in relation extraction. Our code is available at https://github.com/ranlislz/R2GRPO.

[208] ArgInstruct: Specialized Instruction Fine-Tuning for Computational Argumentation

Maja Stahl,Timon Ziegenbein,Joonsuk Park,Henning Wachsmuth

Main category: cs.CL

TL;DR: 本文提出了一种针对计算论证（CA）领域的专门指令微调方法，旨在提升大型语言模型（LLM）在CA任务中的表现，同时保持其泛化能力。

Details

Motivation: 尽管指令跟随的LLM在未见任务上表现良好，但在需要领域知识的任务中仍存在困难。本文旨在解决这一问题。 Method: 通过分析现有CA研究，设计了105个CA任务的自然语言指令，并开发了一个CA专用基准。利用自指导过程合成了52k条CA相关指令，训练了一个CA专用的指令跟随LLM。 Result: 实验表明，CA专用指令微调显著提升了LLM在CA任务中的表现，同时不影响其在通用NLP任务上的性能。 Conclusion: CA专用指令微调是一种有效的方法，既能提升LLM在特定领域的表现，又能保持其泛化能力。 Abstract: Training large language models (LLMs) to follow instructions has significantly enhanced their ability to tackle unseen tasks. However, despite their strong generalization capabilities, instruction-following LLMs encounter difficulties when dealing with tasks that require domain knowledge. This work introduces a specialized instruction fine-tuning for the domain of computational argumentation (CA). The goal is to enable an LLM to effectively tackle any unseen CA tasks while preserving its generalization capabilities. Reviewing existing CA research, we crafted natural language instructions for 105 CA tasks to this end. On this basis, we developed a CA-specific benchmark for LLMs that allows for a comprehensive evaluation of LLMs' capabilities in solving various CA tasks. We synthesized 52k CA-related instructions, adapting the self-instruct process to train a CA-specialized instruction-following LLM. Our experiments suggest that CA-specialized instruction fine-tuning significantly enhances the LLM on both seen and unseen CA tasks. At the same time, performance on the general NLP tasks of the SuperNI benchmark remains stable.

[209] Learning to Route Queries Across Knowledge Bases for Step-wise Retrieval-Augmented Reasoning

Chunyi Peng,Zhipeng Xu,Zhenghao Liu,Yishan Li,Yukun Yan,Shuo Wang,Zhiyuan Liu,Yu Gu,Minghe Yu,Ge Yu,Maosong Sun

Main category: cs.CL

TL;DR: R1-Router是一种新型MRAG框架，通过动态决定何时何地检索知识，结合Step-GRPO算法优化推理行为，显著提升了多模态问答任务的性能。

Details

Motivation: 现有MRAG方法采用静态检索流程，忽略了MLLMs的动态推理能力，导致效率低下。 Method: 提出R1-Router框架，动态生成查询并路由到合适的知识库，结合Step-GRPO算法优化推理。 Result: 在多个开放域QA基准测试中，R1-Router性能提升超过7%，减少了不必要的检索。 Conclusion: R1-Router通过动态知识检索和优化算法，显著提升了多模态任务的效率和准确性。 Abstract: Multimodal Retrieval-Augmented Generation (MRAG) has shown promise in mitigating hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge during generation. Existing MRAG methods typically adopt a static retrieval pipeline that fetches relevant information from multiple Knowledge Bases (KBs), followed by a refinement step. However, these approaches overlook the reasoning and planning capabilities of MLLMs to dynamically determine how to interact with different KBs during the reasoning process. To address this limitation, we propose R1-Router, a novel MRAG framework that learns to decide when and where to retrieve knowledge based on the evolving reasoning state. Specifically, R1-Router can generate follow-up queries according to the current reasoning step, routing these intermediate queries to the most suitable KB, and integrating external knowledge into a coherent reasoning trajectory to answer the original query. Furthermore, we introduce Step-wise Group Relative Policy Optimization (Step-GRPO), a tailored reinforcement learning algorithm that assigns step-specific rewards to optimize the reasoning behavior of MLLMs. Experimental results on various open-domain QA benchmarks across multiple modalities demonstrate that R1-Router outperforms baseline models by over 7%. Further analysis shows that R1-Router can adaptively and effectively leverage diverse KBs, reducing unnecessary retrievals and improving both efficiency and accuracy.

[210] Knowledge Base Construction for Knowledge-Augmented Text-to-SQL

Jinheon Baek,Horst Samulowitz,Oktie Hassanzadeh,Dharmashankar Subramanian,Sola Shirai,Alfio Gliozzo,Debarun Bhattacharjya

Main category: cs.CL

TL;DR: 论文提出了一种基于知识库的Text-to-SQL方法，通过构建全面的知识库解决LLMs在多样化和领域特定查询中的局限性，显著提升了生成SQL的准确性。

Details

Motivation: 现有基于LLMs的Text-to-SQL方法在多样化和领域特定查询中表现不佳，因为LLMs的参数知识有限，无法覆盖所有数据库模式。 Method: 构建一个全面的知识库，结合所有可用问题、数据库模式及相关知识，从中检索并生成查询所需的知识。 Result: 在多个Text-to-SQL数据集上验证，方法在重叠和非重叠数据库场景中均显著优于基线。 Conclusion: 提出的知识库方法有效解决了LLMs在Text-to-SQL中的局限性，提升了生成SQL的准确性和泛化能力。 Abstract: Text-to-SQL aims to translate natural language queries into SQL statements, which is practical as it enables anyone to easily retrieve the desired information from databases. Recently, many existing approaches tackle this problem with Large Language Models (LLMs), leveraging their strong capability in understanding user queries and generating corresponding SQL code. Yet, the parametric knowledge in LLMs might be limited to covering all the diverse and domain-specific queries that require grounding in various database schemas, which makes generated SQLs less accurate oftentimes. To tackle this, we propose constructing the knowledge base for text-to-SQL, a foundational source of knowledge, from which we retrieve and generate the necessary knowledge for given queries. In particular, unlike existing approaches that either manually annotate knowledge or generate only a few pieces of knowledge for each query, our knowledge base is comprehensive, which is constructed based on a combination of all the available questions and their associated database schemas along with their relevant knowledge, and can be reused for unseen databases from different datasets and domains. We validate our approach on multiple text-to-SQL datasets, considering both the overlapping and non-overlapping database scenarios, where it outperforms relevant baselines substantially.

[211] MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models

Zhiyu Li,Shichao Song,Hanyu Wang,Simin Niu,Ding Chen,Jiawei Yang,Chenyang Xi,Huayi Lai,Jihao Zhao,Yezhaohui Wang,Junpeng Ren,Zehao Lin,Jiahao Huo,Tianyi Chen,Kai Chen,Kehang Li,Zhiqiang Yin,Qingchen Yu,Bo Tang,Hongkang Yang,Zhi-Qin John Xu,Feiyu Xiong

Main category: cs.CL

TL;DR: MemOS是一个为大型语言模型设计的内存操作系统，首次将内存提升为一类操作资源，解决了当前LLM在内存管理上的不足。

Details

Motivation: 当前LLM主要依赖参数化内存和临时激活内存，缺乏统一的结构化内存管理架构，限制了长期知识演化的能力。 Method: 引入MemOS，通过MemCube标准化内存抽象，实现异构内存的跟踪、融合和迁移，并提供结构化的访问机制。 Result: MemOS建立了具有强可控性、适应性和可进化性的内存中心执行框架。 Conclusion: MemOS填补了当前LLM基础设施的关键空白，为下一代智能系统的持续适应、个性化智能和跨平台协调奠定了基础。 Abstract: Large Language Models (LLMs) have emerged as foundational infrastructure in the pursuit of Artificial General Intelligence (AGI). Despite their remarkable capabilities in language perception and generation, current LLMs fundamentally lack a unified and structured architecture for handling memory. They primarily rely on parametric memory (knowledge encoded in model weights) and ephemeral activation memory (context-limited runtime states). While emerging methods like Retrieval-Augmented Generation (RAG) incorporate plaintext memory, they lack lifecycle management and multi-modal integration, limiting their capacity for long-term knowledge evolution. To address this, we introduce MemOS, a memory operating system designed for LLMs that, for the first time, elevates memory to a first-class operational resource. It builds unified mechanisms for representation, organization, and governance across three core memory types: parametric, activation, and plaintext. At its core is the MemCube, a standardized memory abstraction that enables tracking, fusion, and migration of heterogeneous memory, while offering structured, traceable access across tasks and contexts. MemOS establishes a memory-centric execution framework with strong controllability, adaptability, and evolvability. It fills a critical gap in current LLM infrastructure and lays the groundwork for continual adaptation, personalized intelligence, and cross-platform coordination in next-generation intelligent systems.

[212] Curse of High Dimensionality Issue in Transformer for Long-context Modeling

Shuhai Zhang,Zeng You,Yaofo Chen,Zhiquan Wen,Qianyue Wang,Zhijie Qiu,Yuanqing Li,Mingkui Tan

Main category: cs.CL

TL;DR: 论文提出动态组注意力（DGA），通过分组编码策略减少注意力计算中的冗余，显著降低计算成本，同时保持性能。

Details

Motivation: 传统注意力机制中所有token消耗相同计算资源，但实际注意力权重稀疏，存在冗余计算。 Method: 将序列建模重新定义为监督学习任务，分析注意力稀疏性，提出分组编码策略，并设计动态组注意力（DGA）。 Result: DGA显著减少计算成本，同时保持竞争力性能。 Conclusion: DGA通过减少冗余计算，有效提升注意力机制的效率。 Abstract: Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to \textit{redundant} attention computations: while attention weights are often \textit{sparse}, all tokens consume \textit{equal} computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a \textit{supervised learning task}, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a \textit{group coding strategy}, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose \textit{Dynamic Group Attention} (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive performance.Code is available at https://github.com/bolixinyu/DynamicGroupAttention.

[213] THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models

Zhiyuan Li,Yi Chang,Yuan Wu

Main category: cs.CL

TL;DR: 大型推理模型（LRMs）在复杂任务中表现优异，但存在过度思考问题，导致计算效率低下。Think-Bench基准被提出以评估LRMs的推理效率，并揭示多数LRMs在简单问题上生成冗余推理链。

Details

Motivation: LRMs在复杂任务中表现优于传统大型语言模型（LLMs），但过度思考问题严重影响了其计算效率，尤其在简单任务中生成冗余内容，浪费资源。 Method: 引入Think-Bench基准，提出新的效率指标，并从推理过程、结果质量和思维链（CoT）特性等多维度评估LRMs。 Result: 多数LRMs在简单问题上表现出过度思考，生成冗长推理链；部分LRMs的CoT质量高但效率低。 Conclusion: Think-Bench为LRMs研究提供了坚实基础，未来可进一步优化推理效率。 Abstract: Large reasoning models (LRMs) have achieved impressive performance in complex tasks, often outperforming conventional large language models (LLMs). However, the prevalent issue of overthinking severely limits their computational efficiency. Overthinking occurs when models generate excessive and redundant tokens that contribute little to accurate outcomes, especially in simple tasks, resulting in a significant waste of computational resources. To systematically investigate this issue, we introduce Think-Bench, a benchmark designed to evaluate the reasoning efficiency of LRMs. We also propose novel efficiency metrics and conduct a comprehensive evaluation of various LRMs across multiple dimensions, including the reasoning process, outcome quality, and chain-of-thought (CoT) characteristics. Our analysis reveals that most LRMs exhibit overthinking in handling easy questions, generating unnecessarily lengthy reasoning chains. While many LRMs demonstrate high CoT quality, several suffer from low efficiency. We hope that Think-Bench can serve as a robust foundation for advancing research into LRMs.

[214] Multimodal Forecasting of Sparse Intraoperative Hypotension Events Powered by Language Model

Jintao Zhang,Zirui Liu,Mingyue Cheng,Shilong Zhang,Tingyue Pan,Qi Liu,Yanhu Xie

Main category: cs.CL

TL;DR: 提出了一种名为IOHFuseLM的多模态语言模型框架，用于预测术中低血压（IOH），通过两阶段训练策略和动态与静态数据融合，显著提升了预测准确性。

Details

Motivation: 术中低血压（IOH）与不良预后密切相关，但由于事件稀疏性和患者数据多样性，预测IOH具有挑战性。 Method: 采用两阶段训练策略：1）基于扩散方法增强的生理时间序列进行领域自适应预训练；2）在原始临床数据集上进行任务微调。通过多模态融合将结构化临床描述与生理时间序列对齐。 Result: 在两个术中数据集上的实验表明，IOHFuseLM在识别IOH事件方面优于现有基线方法。 Conclusion: IOHFuseLM展示了在临床决策支持中的潜力，代码已开源以促进可重复性。 Abstract: Intraoperative hypotension (IOH) frequently occurs under general anesthesia and is strongly linked to adverse outcomes such as myocardial injury and increased mortality. Despite its significance, IOH prediction is hindered by event sparsity and the challenge of integrating static and dynamic data across diverse patients. In this paper, we propose \textbf{IOHFuseLM}, a multimodal language model framework. To accurately identify and differentiate sparse hypotensive events, we leverage a two-stage training strategy. The first stage involves domain adaptive pretraining on IOH physiological time series augmented through diffusion methods, thereby enhancing the model sensitivity to patterns associated with hypotension. Subsequently, task fine-tuning is performed on the original clinical dataset to further enhance the ability to distinguish normotensive from hypotensive states. To enable multimodal fusion for each patient, we align structured clinical descriptions with the corresponding physiological time series at the token level. Such alignment enables the model to capture individualized temporal patterns alongside their corresponding clinical semantics. In addition, we convert static patient attributes into structured text to enrich personalized information. Experimental evaluations on two intraoperative datasets demonstrate that IOHFuseLM outperforms established baselines in accurately identifying IOH events, highlighting its applicability in clinical decision support scenarios. Our code is publicly available to promote reproducibility at https://github.com/zjt-gpu/IOHFuseLM.

[215] Multilingual vs Crosslingual Retrieval of Fact-Checked Claims: A Tale of Two Approaches

Alan Ramponi,Marco Rovera,Robert Moro,Sara Tonelli

Main category: cs.CL

TL;DR: 本文研究了多语言和跨语言环境下事实核查声明的检索策略，重点探讨了负样本选择和重排序方法，并在47种语言的数据集上验证了其效果。

Details

Motivation: 由于某些语言的事实核查资源有限，且全球性事件（如疫情、战争）需要跨语言检索，因此提升多语言和跨语言检索性能至关重要。 Method: 采用负样本选择（监督学习）和重排序（无监督学习）策略，并在包含47种语言的数据集上评估。 Result: 最佳结果来自基于LLM的重排序方法，其次是基于句子相似度的负样本微调。跨语言检索具有独特特性。 Conclusion: 跨语言检索在多语言环境中具有独特挑战和优势，LLM重排序和负样本选择是有效策略。 Abstract: Retrieval of previously fact-checked claims is a well-established task, whose automation can assist professional fact-checkers in the initial steps of information verification. Previous works have mostly tackled the task monolingually, i.e., having both the input and the retrieved claims in the same language. However, especially for languages with a limited availability of fact-checks and in case of global narratives, such as pandemics, wars, or international politics, it is crucial to be able to retrieve claims across languages. In this work, we examine strategies to improve the multilingual and crosslingual performance, namely selection of negative examples (in the supervised) and re-ranking (in the unsupervised setting). We evaluate all approaches on a dataset containing posts and claims in 47 languages (283 language combinations). We observe that the best results are obtained by using LLM-based re-ranking, followed by fine-tuning with negative examples sampled using a sentence similarity-based strategy. Most importantly, we show that crosslinguality is a setup with its own unique characteristics compared to the multilingual setup.

[216] LoKI: Low-damage Knowledge Implanting of Large Language Models

Runyu Wang,Peng Ping,Zhengyu Guo,Xiaoye Zhang,Quan Shi,Liting Zhou,Tianbo Ji

Main category: cs.CL

TL;DR: LoKI是一种参数高效的微调方法，通过低损伤知识植入解决灾难性遗忘问题，同时保持模型的通用能力。

Details

Motivation: 解决现有PEFT方法在微调大型语言模型时可能导致的灾难性遗忘和通用能力下降问题。 Method: 基于对Transformer架构中知识存储机制的理解，提出LoKI技术，实现低损伤知识植入。 Result: 在多种模型类型中，LoKI的任务性能与全微调或LoRA方法相当甚至更好，同时显著保留通用能力。 Conclusion: LoKI通过连接LLM知识存储机制与微调目标，实现了任务专业化和通用能力保护的最佳平衡。 Abstract: Fine-tuning adapts pretrained models for specific tasks but poses the risk of catastrophic forgetting (CF), where critical knowledge from pre-training is overwritten. Current Parameter-Efficient Fine-Tuning (PEFT) methods for Large Language Models (LLMs), while efficient, often sacrifice general capabilities. To address the issue of CF in a general-purpose PEFT framework, we propose \textbf{Lo}w-damage \textbf{K}nowledge \textbf{I}mplanting (\textbf{LoKI}), a PEFT technique that is based on a mechanistic understanding of how knowledge is stored in transformer architectures. In two real-world scenarios, LoKI demonstrates task-specific performance that is comparable to or even surpasses that of full fine-tuning and LoRA-based methods across various model types, while significantly better preserving general capabilities. Our work connects mechanistic insights into LLM knowledge storage with practical fine-tuning objectives, achieving state-of-the-art trade-offs between task specialization and the preservation of general capabilities. Our implementation is publicly available as ready-to-use code\footnote{https://github.com/Nexround/LoKI}.

[217] EULER: Enhancing the Reasoning Ability of Large Language Models through Error-Induced Learning

Zhuoyang Wu,Xinze Li,Zhenghao Liu,Yukun Yan,Zhiyuan Liu,Minghe Yu,Cheng Yang,Yu Gu,Ge Yu,Maosong Sun

Main category: cs.CL

TL;DR: 本文提出EULER模型，通过生成高质量的错误解决方案来增强LLMs的数学推理能力，实验显示其性能优于基线模型4%以上。

Details

Motivation: LLMs在数学问题解决中表现优异，但通过错误学习进一步提升性能的潜力尚未充分挖掘，尤其是如何为每个数学问题生成错误解决方案。 Method: EULER模型优化错误暴露模型，增加自生成错误解决方案的概率，同时利用优质LLM的解决方案规范生成质量。 Result: 实验表明EULER在多个数学问题数据集上表现优异，性能提升超过4%，并能生成更具挑战性和教育意义的错误解决方案。 Conclusion: EULER通过高质量错误生成显著提升了LLMs的数学推理能力，为训练和推理过程提供了有效支持。 Abstract: Large Language Models (LLMs) have demonstrated strong reasoning capabilities and achieved promising results in mathematical problem-solving tasks. Learning from errors offers the potential to further enhance the performance of LLMs during Supervised Fine-Tuning (SFT). However, the errors in synthesized solutions are typically gathered from sampling trails, making it challenging to generate solution errors for each mathematical problem. This paper introduces the Error-IndUced LEaRning (EULER) model, which aims to develop an error exposure model that generates high-quality solution errors to enhance the mathematical reasoning capabilities of LLMs. Specifically, EULER optimizes the error exposure model to increase the generation probability of self-made solution errors while utilizing solutions produced by a superior LLM to regularize the generation quality. Our experiments across various mathematical problem datasets demonstrate the effectiveness of the EULER model, achieving an improvement of over 4% compared to all baseline models. Further analysis reveals that EULER is capable of synthesizing more challenging and educational solution errors, which facilitate both the training and inference processes of LLMs. All codes are available at https://github.com/NEUIR/EULER.

[218] RAD: Redundancy-Aware Distillation for Hybrid Models via Self-Speculative Decoding

Yuichiro Hoshino,Hideyuki Tachibana,Muneyoshi Inahara,Hiroto Takegawa

Main category: cs.CL

TL;DR: RAD（冗余感知蒸馏）是一种新框架，通过自推测解码识别Transformer中的冗余层，并用SSM组件替换，结合定向蒸馏提升性能。

Details

Motivation: 优化混合模型（Transformer与SSM结合）的性能与效率，解决Transformer组件的冗余问题。 Method: 使用自推测解码诊断冗余层，选择性替换为SSM组件，并进行定向蒸馏。 Result: 在数学和编程任务中显著超越基线模型，GSM8K得分71.27（基线46.17），CRUX得分28.25（基线22.75）。 Conclusion: RAD为混合模型的高效优化和性能提升提供了新途径。 Abstract: Hybrid models combining Transformers and State Space Models (SSMs) are promising for balancing performance and efficiency. However, optimizing these hybrid models, particularly by addressing the potential redundancy inherent within the Transformer components, remains a significant challenge. In this paper, we propose RAD (Redundancy-Aware Distillation), a novel framework that uses self-speculative decoding as a diagnostic tool to identify redundant attention layers within the model. These identified layers are then selectively replaced with SSM components, followed by targeted (self-)distillation. Specifically, RAD focuses knowledge transfer on the components identified as redundant, considering architectural changes and specific weight initialization strategies. We experimentally demonstrate that self-distillation using RAD significantly surpasses the performance of the original base model on mathematical and coding tasks. Furthermore, RAD is also effective in standard knowledge distillation settings, achieving up to approximately 2x faster convergence compared to baseline methods. Notably, while a baseline model distilled from a Llama-3.1 70B teacher achieves scores of 46.17 on GSM8K and 22.75 on CRUX, RAD achieves significantly higher scores of 71.27 on GSM8K and 28.25 on CRUX, even when using a much smaller Llama-3.1 8B teacher. RAD offers a new pathway for efficient optimization and performance enhancement in the distillation of hybrid models.

[219] Limited Generalizability in Argument Mining: State-Of-The-Art Models Learn Datasets, Not Arguments

Marc Feger,Katarina Boland,Stefan Dietze

Main category: cs.CL

TL;DR: 该研究重新评估了BERT类模型在识别论点任务中的泛化能力，发现模型依赖词汇捷径而非真正任务对齐，但特定预训练和联合训练能提升性能。

Details

Motivation: 论证识别是自动化话语分析的关键任务，现有研究认为BERT类模型在多种辩论场景中表现优异，但缺乏对其泛化能力的系统评估。 Method: 评估了四种Transformer模型（三种标准模型和一种增强模型）在17个英语句子级数据集上的表现，重点关注泛化能力。 Result: 模型依赖词汇捷径，在熟悉数据集上表现良好，但在未见数据集上性能显著下降；特定预训练和联合训练能提升泛化能力。 Conclusion: 研究揭示了现有模型的局限性，并提出了改进方向，强调了任务特定预训练和联合训练的重要性。 Abstract: Identifying arguments is a necessary prerequisite for various tasks in automated discourse analysis, particularly within contexts such as political debates, online discussions, and scientific reasoning. In addition to theoretical advances in understanding the constitution of arguments, a significant body of research has emerged around practical argument mining, supported by a growing number of publicly available datasets. On these benchmarks, BERT-like transformers have consistently performed best, reinforcing the belief that such models are broadly applicable across diverse contexts of debate. This study offers the first large-scale re-evaluation of such state-of-the-art models, with a specific focus on their ability to generalize in identifying arguments. We evaluate four transformers, three standard and one enhanced with contrastive pre-training for better generalization, on 17 English sentence-level datasets as most relevant to the task. Our findings show that, to varying degrees, these models tend to rely on lexical shortcuts tied to content words, suggesting that apparent progress may often be driven by dataset-specific cues rather than true task alignment. While the models achieve strong results on familiar benchmarks, their performance drops markedly when applied to unseen datasets. Nonetheless, incorporating both task-specific pre-training and joint benchmark training proves effective in enhancing both robustness and generalization.

[220] InComeS: Integrating Compression and Selection Mechanisms into LLMs for Efficient Model Editing

Shuaiyi Li,Zhisong Zhang,Yang Deng,Chenlong Deng,Tianqing Fang,Hongming Zhang,Haitao Mi,Dong Yu,Wai Lam

Main category: cs.CL

TL;DR: InComeS框架通过压缩和选择机制提升大语言模型处理编辑上下文的能力，解决了现有方法在复杂场景中语义理解不足的问题。

Details

Motivation: 现有模型编辑方法在复杂场景中表现不佳，缺乏深层语义理解能力，且受限于大语言模型的上下文窗口。 Method: 提出InComeS框架，通过压缩编辑上下文为键值缓存，并添加交叉注意力模块动态选择信息。 Result: 实验表明InComeS在多样化的模型编辑基准测试中表现高效且有效。 Conclusion: InComeS通过灵活的设计克服了现有方法的局限性，提升了模型编辑的性能和效率。 Abstract: Although existing model editing methods perform well in recalling exact edit facts, they often struggle in complex scenarios that require deeper semantic understanding rather than mere knowledge regurgitation. Leveraging the strong contextual reasoning abilities of large language models (LLMs), in-context learning (ICL) becomes a promising editing method by comprehending edit information through context encoding. However, this method is constrained by the limited context window of LLMs, leading to degraded performance and efficiency as the number of edits increases. To overcome this limitation, we propose InComeS, a flexible framework that enhances LLMs' ability to process editing contexts through explicit compression and selection mechanisms. Specifically, InComeS compresses each editing context into the key-value (KV) cache of a special gist token, enabling efficient handling of multiple edits without being restricted by the model's context window. Furthermore, specialized cross-attention modules are added to dynamically select the most relevant information from the gist pools, enabling adaptive and effective utilization of edit information. We conduct experiments on diverse model editing benchmarks with various editing formats, and the results demonstrate the effectiveness and efficiency of our method.

[221] Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy

Paramita Mirza,Lucas Weber,Fabian Küch

Main category: cs.CL

TL;DR: 本文提出了一种高效且通用的数据选择方法，通过多步骤流程实现数据点分组、质量评估和难度评分，同时结合任务分类和聚类算法保证多样性，从而以最小开销实现高性能微调。

Details

Motivation: 现有数据选择方法计算成本高或适用范围有限，本文旨在解决这些问题。 Method: 采用多步骤流程，包括数据点分组、质量评估模型和轻量级难度评分方法，结合任务分类和聚类算法保证多样性。 Result: 实现了高效且通用的数据选择，支持高性能微调。 Conclusion: 该方法在保证性能的同时显著降低了计算开销，适用于多用途模型的微调。 Abstract: Recent work shows that post-training datasets for LLMs can be substantially downsampled without noticeably deteriorating performance. However, data selection often incurs high computational costs or is limited to narrow domains. In this paper, we demonstrate that data selection can be both -- efficient and universal -- by using a multi-step pipeline in which we efficiently bin data points into groups, estimate quality using specialized models, and score difficulty with a robust, lightweight method. Task-based categorization allows us to control the composition of our final data -- crucial for finetuning multi-purpose models. To guarantee diversity, we improve upon previous work using embedding models and a clustering algorithm. This integrated strategy enables high-performance fine-tuning with minimal overhead.

[222] Unifying Continuous and Discrete Text Diffusion with Non-simultaneous Diffusion Processes

Bocheng Li,Zhujin Gao,Linli Xu

Main category: cs.CL

TL;DR: NeoDiff是一种新型扩散模型，结合离散和连续扩散模型的优势，通过泊松扩散过程和自适应时间预测器实现更精细的噪声控制和语义感知的文本生成。

Details

Motivation: 离散和连续扩散模型各有局限性：离散模型缺乏精细控制，连续模型无法捕捉语义细微差异。NeoDiff旨在整合两者的优势。 Method: NeoDiff采用泊松扩散过程实现灵活的噪声控制，并使用时间预测器自适应调节去噪进度。优化推理计划以提升性能。 Result: 实验表明，NeoDiff在多项文本生成任务中优于基线模型，包括非自回归和自回归扩散模型。 Conclusion: NeoDiff为文本生成提供了更有效和理论统一的框架，展示了扩散模型在高质量文本生成中的潜力。 Abstract: Diffusion models have emerged as a promising approach for text generation, with recent works falling into two main categories: discrete and continuous diffusion models. Discrete diffusion models apply token corruption independently using categorical distributions, allowing for different diffusion progress across tokens but lacking fine-grained control. Continuous diffusion models map tokens to continuous spaces and apply fine-grained noise, but the diffusion progress is uniform across tokens, limiting their ability to capture semantic nuances. To address these limitations, we propose \textbf{\underline{N}}on-simultan\textbf{\underline{e}}ous C\textbf{\underline{o}}ntinuous \textbf{\underline{Diff}}usion Models (NeoDiff), a novel diffusion model that integrates the strengths of both discrete and continuous approaches. NeoDiff introduces a Poisson diffusion process for the forward process, enabling a flexible and fine-grained noising paradigm, and employs a time predictor for the reverse process to adaptively modulate the denoising progress based on token semantics. Furthermore, NeoDiff utilizes an optimized schedule for inference to ensure more precise noise control and improved performance. Our approach unifies the theories of discrete and continuous diffusion models, offering a more principled and effective framework for text generation. Experimental results on several text generation tasks demonstrate NeoDiff's superior performance compared to baselines of non-autoregressive continuous and discrete diffusion models, iterative-based methods and autoregressive diffusion-based methods. These results highlight NeoDiff's potential as a powerful tool for generating high-quality text and advancing the field of diffusion-based text generation.

[223] ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

Gili Lior,Eliya Habba,Shahar Levy,Avi Caciularu,Gabriel Stanovsky

Main category: cs.CL

TL;DR: 论文提出了一种基于随机方法评估LLM对提示敏感性的框架，并开发了ReliableEval方法以估算所需提示重采样次数。

Details

Motivation: 标准基准测试通常使用单一提示评估LLM性能，但LLM对提示措辞高度敏感，可能导致评估不可靠。 Method: 采用随机矩方法在意义保留的提示扰动空间中进行评估，并定义可靠评估的标准。 Result: 研究发现即使是GPT-4o和Claude-3.7-Sonnet等顶级模型也存在显著的提示敏感性。 Conclusion: 提出的方法模型、任务和指标无关，为LLM评估提供了鲁棒且有效的方案。 Abstract: LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of reliable evaluation that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.

[224] Reverse Preference Optimization for Complex Instruction Following

Xiang Huang,Ting-En Lin,Feiteng Fang,Yuchuan Wu,Hangyu Li,Yuzhong Qu,Fei Huang,Yongbin Li

Main category: cs.CL

TL;DR: 论文提出了一种名为反向偏好优化（RPO）的方法，用于解决大型语言模型在处理复杂指令时因偏好对噪声导致的问题。RPO通过动态反转指令中的约束，确保所选响应完美，并扩大偏好对的差距，从而优化方向更清晰。实验表明，RPO在多个基准测试中优于基线方法，并能有效扩展到不同规模的模型。

Details

Motivation: 处理复杂指令时，现有方法因偏好对噪声导致性能下降，需要更有效的方法来优化多偏好对齐。 Method: 提出反向偏好优化（RPO），动态反转指令约束以减少噪声，并扩大偏好对的差距。 Result: 在Sysbench和Multi-IF基准测试中，RPO平均分别比DPO基线提高了4.6和2.5分，且能有效扩展到不同规模模型（8B至70B参数）。 Conclusion: RPO是一种简单有效的方法，能够显著提升大型语言模型处理复杂指令的能力，并在不同规模模型中表现优异。 Abstract: Instruction following (IF) is a critical capability for large language models (LLMs). However, handling complex instructions with multiple constraints remains challenging. Previous methods typically select preference pairs based on the number of constraints they satisfy, introducing noise where chosen examples may fail to follow some constraints and rejected examples may excel in certain respects over the chosen ones. To address the challenge of aligning with multiple preferences, we propose a simple yet effective method called Reverse Preference Optimization (RPO). It mitigates noise in preference pairs by dynamically reversing the constraints within the instruction to ensure the chosen response is perfect, alleviating the burden of extensive sampling and filtering to collect perfect responses. Besides, reversal also enlarges the gap between chosen and rejected responses, thereby clarifying the optimization direction and making it more robust to noise. We evaluate RPO on two multi-turn IF benchmarks, Sysbench and Multi-IF, demonstrating average improvements over the DPO baseline of 4.6 and 2.5 points (on Llama-3.1 8B), respectively. Moreover, RPO scales effectively across model sizes (8B to 70B parameters), with the 70B RPO model surpassing GPT-4o.

[225] TabXEval: Why this is a Bad Table? An eXhaustive Rubric for Table Evaluation

Vihang Pancholi,Jainit Bafna,Tejas Anvekar,Manish Shrivastava,Vivek Gupta

Main category: cs.CL

TL;DR: 论文提出了一种名为TabXEval的表格评估框架，结合结构对齐和语义比较，显著提升了表格评估的准确性和可解释性。

Details

Motivation: 传统表格评估方法难以捕捉结构和内容的细微差异，亟需一种更全面的评估方法。 Method: TabXEval框架分为两阶段：TabAlign进行结构对齐，TabCompare进行语义和句法比较。使用TabXBench作为评估基准。 Result: TabXEval在多样化的表格任务和领域中表现出色，优于传统方法。 Conclusion: TabXEval为可解释的表格评估提供了新方向，具有广泛的应用潜力。 Abstract: Evaluating tables qualitatively & quantitatively presents a significant challenge, as traditional metrics often fail to capture nuanced structural and content discrepancies. To address this, we introduce a novel, methodical rubric integrating multi-level structural descriptors with fine-grained contextual quantification, thereby establishing a robust foundation for comprehensive table comparison. Building on this foundation, we propose TabXEval, an eXhaustive and eXplainable two-phase evaluation framework. TabXEval initially aligns reference tables structurally via TabAlign & subsequently conducts a systematic semantic and syntactic comparison using TabCompare; this approach clarifies the evaluation process and pinpoints subtle discrepancies overlooked by conventional methods. The efficacy of this framework is assessed using TabXBench, a novel, diverse, multi-domain benchmark we developed, featuring realistic table perturbations and human-annotated assessments. Finally, a systematic analysis of existing evaluation methods through sensitivity-specificity trade-offs demonstrates the qualitative and quantitative effectiveness of TabXEval across diverse table-related tasks and domains, paving the way for future innovations in explainable table evaluation.

[226] Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design

Yudi Zhang,Weilin Zhao,Xu Han,Tiejun Zhao,Wang Xu,Hailong Cao,Conghui Zhu

Main category: cs.CL

TL;DR: 论文探讨了结合推测解码和量化技术加速大型语言模型推理的方法，提出了一种分层框架以优化性能。

Details

Motivation: 推测解码和量化技术分别通过验证多令牌和压缩权重来加速推理，但结合使用时发现4位量化模型的性能优势被推测解码的计算负载抵消。 Method: 提出分层框架，利用小模型将树状草稿转为序列草稿，结合目标量化模型的内存优势。 Result: 实验显示，分层方法在4位量化Llama-3-70B模型上实现2.78倍加速，优于EAGLE-2的1.31倍。 Conclusion: 分层框架有效结合推测解码和量化，显著提升推理速度。 Abstract: Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which increases computational effort. Quantization achieves this optimization by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications. To further leverage their strengths, we investigate the integration of these two techniques. Surprisingly, experiments applying the advanced speculative decoding method EAGLE-2 to various quantized models reveal that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding. Specifically, verifying a tree-style draft incurs significantly more time overhead than a single-token forward pass on 4-bit weight quantized models. This finding led to our new speculative decoding design: a hierarchical framework that employs a small model as an intermediate stage to turn tree-style drafts into sequence drafts, leveraging the memory access benefits of the target quantized model. Experimental results show that our hierarchical approach achieves a 2.78$\times$ speedup across various tasks for the 4-bit weight Llama-3-70B model on an A100 GPU, outperforming EAGLE-2 by 1.31$\times$. Code available at https://github.com/AI9Stars/SpecMQuant.

[227] Breaking the Cloak! Unveiling Chinese Cloaked Toxicity with Homophone Graph and Toxic Lexicon

Xuchen Ma,Jianxiang Yu,Wenming Shao,Bo Pang,Xiang Li

Main category: cs.CL

TL;DR: C$^2$TU是一种无需训练和提示的方法，用于揭露中文社交媒体中的伪装毒性内容，通过子串匹配和语义过滤实现高效检测。

Details

Motivation: 社交媒体中伪装毒性内容（如同音字替换）的增多，现有方法主要针对英文，中文领域尚未解决。 Method: 采用子串匹配识别候选毒性词，结合BERT和LLMs过滤非毒性内容并修正伪装。 Result: 在两个中文毒性数据集上表现优异，F1分数和准确率分别比最佳竞争对手高出71%和35%。 Conclusion: C$^2$TU为中文伪装毒性内容检测提供了高效解决方案，填补了该领域空白。 Abstract: Social media platforms have experienced a significant rise in toxic content, including abusive language and discriminatory remarks, presenting growing challenges for content moderation. Some users evade censorship by deliberately disguising toxic words through homophonic cloak, which necessitates the task of unveiling cloaked toxicity. Existing methods are mostly designed for English texts, while Chinese cloaked toxicity unveiling has not been solved yet. To tackle the issue, we propose C$^2$TU, a novel training-free and prompt-free method for Chinese cloaked toxic content unveiling. It first employs substring matching to identify candidate toxic words based on Chinese homo-graph and toxic lexicon. Then it filters those candidates that are non-toxic and corrects cloaks to be their corresponding toxicities. Specifically, we develop two model variants for filtering, which are based on BERT and LLMs, respectively. For LLMs, we address the auto-regressive limitation in computing word occurrence probability and utilize the full semantic contexts of a text sequence to reveal cloaked toxic words. Extensive experiments demonstrate that C$^2$TU can achieve superior performance on two Chinese toxic datasets. In particular, our method outperforms the best competitor by up to 71% on the F1 score and 35% on accuracy, respectively.

[228] Let's Predict Sentence by Sentence

Hyeonbin Hwang,Byeongguk Jeon,Seungone Kim,Jiyeon Kim,Hoyeon Chang,Sohee Yang,Seungpil Won,Dohaeng Lee,Youbin Ahn,Minjoon Seo

Main category: cs.CL

TL;DR: 论文探讨了预训练语言模型（LMs）能否从基于token的推理过渡到基于句子和概念的抽象推理，提出了一个框架，通过预测连续句子嵌入实现高效推理。

Details

Motivation: 人类推理基于高级抽象（如句子和概念），而LMs目前仅基于token生成，研究旨在探索LMs是否能在语义单元上进行抽象推理。 Method: 提出一个框架，将预训练LM适配到句子空间，通过自回归预测连续句子嵌入，探索了两种嵌入范式（语义嵌入和上下文嵌入）和两种推理模式（离散化和连续化）。 Result: 在数学、逻辑、常识和规划四个领域，上下文嵌入在连续推理下性能与Chain-of-Thought（CoT）相当，同时推理效率提升50%。 Conclusion: 研究表明预训练LMs可以在潜在嵌入空间中有效进行抽象和结构化推理，并展示了可扩展性和模块化适应性。 Abstract: Autoregressive language models (LMs) generate one token at a time, yet human reasoning operates over higher-level abstractions - sentences, propositions, and concepts. This contrast raises a central question- Can LMs likewise learn to reason over structured semantic units rather than raw token sequences? In this work, we investigate whether pretrained LMs can be lifted into such abstract reasoning spaces by building on their learned representations. We present a framework that adapts a pretrained token-level LM to operate in sentence space by autoregressively predicting continuous embeddings of next sentences. We explore two embedding paradigms inspired by classical representation learning: 1) semantic embeddings, learned via autoencoding to preserve surface meaning; and 2) contextual embeddings, trained via next-sentence prediction to encode anticipatory structure. We evaluate both under two inference regimes: Discretized, which decodes each predicted embedding into text before re-encoding; and Continuous, which reasons entirely in embedding space for improved efficiency. Across four domains - mathematics, logic, commonsense, and planning - contextual embeddings under continuous inference show competitive performance with Chain-of-Thought (CoT) while reducing inference-time FLOPs on average by half. We also present early signs of scalability and modular adaptation. Finally, to visualize latent trajectories, we introduce SentenceLens, a diagnostic tool that decodes intermediate model states into interpretable sentences. Together, our results indicate that pretrained LMs can effectively transition to abstract, structured reasoning within latent embedding spaces.

[229] Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Mehdi Ali,Manuel Brack,Max Lübbering,Elias Wendt,Abbas Goher Khan,Richard Rutmann,Alex Jude,Maurice Kraus,Alexander Arno Weber,Felix Stollenwerk,David Kaczér,Florian Mai,Lucie Flek,Rafet Sifa,Nicolas Flores-Herr,Joachim Köhler,Patrick Schramowski,Michael Fromm,Kristian Kersting

Main category: cs.CL

TL;DR: JQL是一种高效的多语言数据筛选方法，通过轻量级注释器提升数据质量，显著优于现有启发式方法。

Details

Motivation: 高质量多语言训练数据稀缺，现有方法依赖启发式筛选，限制了跨语言迁移和扩展性。 Method: JQL利用预训练多语言嵌入的轻量级注释器，高效筛选多样且高质量的多语言数据。 Result: 在35种语言上验证，JQL显著优于Fineweb2等现有方法，提升下游模型训练质量和数据保留率。 Conclusion: JQL为多语言数据筛选提供实用方案，提升了多语言数据集开发标准。 Abstract: High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering methods like Fineweb2. JQL notably enhances downstream model training quality and increases data retention rates. Our research provides practical insights and valuable resources for multilingual data curation, raising the standards of multilingual dataset development.

[230] A Linguistically Motivated Analysis of Intonational Phrasing in Text-to-Speech Systems: Revealing Gaps in Syntactic Sensitivity

Charlotte Pouw,Afra Alishahi,Willem Zuidema

Main category: cs.CL

TL;DR: 研究分析了TTS系统对句法边界的敏感性，发现系统在句法模糊的句子中表现不佳，需依赖逗号等表面标记，而在简单句法中能利用更深层次的句法线索。通过微调模型，系统能生成更准确的语调模式。

Details

Motivation: 探索TTS系统如何利用句法信息生成语调短语边界，尤其是在句法模糊的句子中。 Method: 采用心理语言学启发的方法，分析TTS系统在生成语调短语边界时的表现，并对模型进行微调以提升性能。 Result: TTS系统在句法模糊的句子中依赖表面标记，而在简单句法中能利用句法线索。微调后模型生成更准确的语调模式。 Conclusion: TTS系统可通过微调提升对句法信息的利用能力，生成更符合底层结构的语调模式。 Abstract: We analyze the syntactic sensitivity of Text-to-Speech (TTS) systems using methods inspired by psycholinguistic research. Specifically, we focus on the generation of intonational phrase boundaries, which can often be predicted by identifying syntactic boundaries within a sentence. We find that TTS systems struggle to accurately generate intonational phrase boundaries in sentences where syntactic boundaries are ambiguous (e.g., garden path sentences or sentences with attachment ambiguity). In these cases, systems need superficial cues such as commas to place boundaries at the correct positions. In contrast, for sentences with simpler syntactic structures, we find that systems do incorporate syntactic cues beyond surface markers. Finally, we finetune models on sentences without commas at the syntactic boundary positions, encouraging them to focus on more subtle linguistic cues. Our findings indicate that this leads to more distinct intonation patterns that better reflect the underlying structure.

[231] BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain

Yunsoo Kim,Yusuf Abdulle,Honghan Wu

Main category: cs.CL

TL;DR: BioHopR是一个新的生物医学领域多跳推理基准，填补了现有评测的空白，揭示了当前模型在多跳推理中的挑战。

Details

Motivation: 现有评测缺乏对生物医学领域多跳推理能力的评估，尤其是涉及复杂关系的查询。 Method: 基于PrimeKG构建BioHopR，包含1跳和2跳推理任务，评估了多种先进模型。 Result: O3-mini在1跳和2跳任务中表现最佳，但所有模型在多跳任务中性能显著下降。 Conclusion: BioHopR为生物医学多跳推理评测设定了新标准，揭示了专有与开源模型之间的差距，并推动未来研究。 Abstract: Biomedical reasoning often requires traversing interconnected relationships across entities such as drugs, diseases, and proteins. Despite the increasing prominence of large language models (LLMs), existing benchmarks lack the ability to evaluate multi-hop reasoning in the biomedical domain, particularly for queries involving one-to-many and many-to-many relationships. This gap leaves the critical challenges of biomedical multi-hop reasoning underexplored. To address this, we introduce BioHopR, a novel benchmark designed to evaluate multi-hop, multi-answer reasoning in structured biomedical knowledge graphs. Built from the comprehensive PrimeKG, BioHopR includes 1-hop and 2-hop reasoning tasks that reflect real-world biomedical complexities. Evaluations of state-of-the-art models reveal that O3-mini, a proprietary reasoning-focused model, achieves 37.93% precision on 1-hop tasks and 14.57% on 2-hop tasks, outperforming proprietary models such as GPT4O and open-source biomedical models including HuatuoGPT-o1-70B and Llama-3.3-70B. However, all models exhibit significant declines in multi-hop performance, underscoring the challenges of resolving implicit reasoning steps in the biomedical domain. By addressing the lack of benchmarks for multi-hop reasoning in biomedical domain, BioHopR sets a new standard for evaluating reasoning capabilities and highlights critical gaps between proprietary and open-source models while paving the way for future advancements in biomedical LLMs.

[232] MRT at SemEval-2025 Task 8: Maximizing Recovery from Tables with Multiple Steps

Maximiliano Hormazábal Lagos,Álvaro Bueno Saez,Héctor Cerezo-Costas,Pedro Alonso Doval,Jorge Alcalde Vesteiro

Main category: cs.CL

TL;DR: 论文提出了一种利用LLM生成Python代码的方法，用于解决表格数据问答任务，通过多步骤流程实现，最终在子任务1中取得了70.50%的分数。

Details

Motivation: 解决SemEval 2025 Task 8中的表格数据问答挑战，目标是高效地从表格中提取答案。 Method: 采用多步骤流程：理解表格内容、生成自然语言指令、将指令翻译为代码、执行代码并处理错误，使用开源LLM和优化提示。 Result: 在子任务1中取得了70.50%的分数。 Conclusion: 该方法通过代码生成和LLM优化，有效解决了表格数据问答问题。 Abstract: In this paper we expose our approach to solve the \textit{SemEval 2025 Task 8: Question-Answering over Tabular Data} challenge. Our strategy leverages Python code generation with LLMs to interact with the table and get the answer to the questions. The process is composed of multiple steps: understanding the content of the table, generating natural language instructions in the form of steps to follow in order to get the answer, translating these instructions to code, running it and handling potential errors or exceptions. These steps use open source LLMs and fine grained optimized prompts for each task (step). With this approach, we achieved a score of $70.50\%$ for subtask 1.

[233] Comprehensive Evaluation on Lexical Normalization: Boundary-Aware Approaches for Unsegmented Languages

Shohei Higashiyama,Masao Utiyama

Main category: cs.CL

TL;DR: 论文提出了一种针对非正式表达的处理方法，通过构建大规模日语数据集和基于预训练模型的归一化方法，展示了编码器和解码器方法的有效性。

Details

Motivation: 解决用户生成文本中非正式表达处理的挑战，尤其是在未分词语言中缺乏全面评估的问题。 Method: 构建大规模多领域日语归一化数据集，开发基于预训练模型的归一化方法，并进行多视角实验评估。 Result: 实验表明，编码器和解码器方法在准确性和效率上均表现良好。 Conclusion: 该方法为未分词语言的归一化处理提供了有效解决方案，并展示了预训练模型的潜力。 Abstract: Lexical normalization research has sought to tackle the challenge of processing informal expressions in user-generated text, yet the absence of comprehensive evaluations leaves it unclear which methods excel across multiple perspectives. Focusing on unsegmented languages, we make three key contributions: (1) creating a large-scale, multi-domain Japanese normalization dataset, (2) developing normalization methods based on state-of-the-art pretrained models, and (3) conducting experiments across multiple evaluation perspectives. Our experiments show that both encoder-only and decoder-only approaches achieve promising results in both accuracy and efficiency.

[234] Natural Language Processing in Support of Evidence-based Medicine: A Scoping Review

Zihan Xu,Haotian Ma,Gongbo Zhang,Yihao Ding,Chunhua Weng,Yifan Peng

Main category: cs.CL

TL;DR: 本文综述了129项关于利用自然语言处理（NLP）支持循证医学（EBM）的研究，探讨了NLP在EBM五个关键步骤中的应用及其对临床决策的改进作用。

Details

Motivation: 由于医学文献数量庞大且增长迅速，人工整理成本高昂，亟需NLP技术来自动化证据的识别、评估、综合和传播。 Method: 通过系统回顾129项相关研究，分析了NLP在EBM五个步骤（Ask, Acquire, Appraise, Apply, Assess）中的应用。 Result: 研究发现NLP在证据提取、综合、评估和总结方面具有显著潜力，但当前仍存在局限性。 Conclusion: NLP有望通过优化证据处理流程和提升数据可理解性，彻底改变EBM实践，未来研究应进一步探索其潜力。 Abstract: Evidence-based medicine (EBM) is at the forefront of modern healthcare, emphasizing the use of the best available scientific evidence to guide clinical decisions. Due to the sheer volume and rapid growth of medical literature and the high cost of curation, there is a critical need to investigate Natural Language Processing (NLP) methods to identify, appraise, synthesize, summarize, and disseminate evidence in EBM. This survey presents an in-depth review of 129 research studies on leveraging NLP for EBM, illustrating its pivotal role in enhancing clinical decision-making processes. The paper systematically explores how NLP supports the five fundamental steps of EBM -- Ask, Acquire, Appraise, Apply, and Assess. The review not only identifies current limitations within the field but also proposes directions for future research, emphasizing the potential for NLP to revolutionize EBM by refining evidence extraction, evidence synthesis, appraisal, summarization, enhancing data comprehensibility, and facilitating a more efficient clinical workflow.

[235] Compensating for Data with Reasoning: Low-Resource Machine Translation with LLMs

Samuel Frontull,Thomas Ströhle

Main category: cs.CL

TL;DR: 论文提出了一种名为Fragment-Shot Prompting的新方法，通过分段输入和基于语法覆盖率的翻译示例检索，改进了低资源语言的翻译效果。

Details

Motivation: 大语言模型在多语言机器翻译中表现优异，但在低资源语言翻译中仍面临挑战，尤其是提示工程的应用。 Method: 提出Fragment-Shot Prompting和Pivoted Fragment-Shot方法，结合语法覆盖率检索翻译示例，并评估了多种模型（如GPT-3.5、GPT-4o等）。 Result: 方法在低资源语言翻译中有效，语法覆盖率与翻译质量正相关；推理能力强的模型表现更好；提示工程对低资源到高资源语言翻译改进有限。 Conclusion: Fragment-Shot Prompting显著提升了低资源语言翻译质量，尤其是在模型推理能力强的场景下。 Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in multilingual machine translation, sometimes even outperforming traditional neural systems. However, previous research has highlighted the challenges of using LLMs, particularly with prompt engineering, for low-resource languages. In this work, we introduce Fragment-Shot Prompting, a novel in-context learning method that segments input and retrieves translation examples based on syntactic coverage, along with Pivoted Fragment-Shot, an extension that enables translation without direct parallel data. We evaluate these methods using GPT-3.5, GPT-4o, o1-mini, LLaMA-3.3, and DeepSeek-R1 for translation between Italian and two Ladin variants, revealing three key findings: (1) Fragment-Shot Prompting is effective for translating into and between the studied low-resource languages, with syntactic coverage positively correlating with translation quality; (2) Models with stronger reasoning abilities make more effective use of retrieved knowledge, generally produce better translations, and enable Pivoted Fragment-Shot to significantly improve translation quality between the Ladin variants; and (3) prompt engineering offers limited, if any, improvements when translating from a low-resource to a high-resource language, where zero-shot prompting already yields satisfactory results. We publicly release our code and the retrieval corpora.

[236] 360-LLaMA-Factory: Plug & Play Sequence Parallelism for Long Post-Training

Haosheng Zou,Xiaowei Lv,Shousheng Jia,Xiangzheng Zhang

Main category: cs.CL

TL;DR: 360-LLaMA-Factory开源项目通过引入序列并行技术，被广泛应用于多个模型和公司训练框架中。

Details

Motivation: 提升LLaMA-Factory的性能和扩展性，支持更高效的模型训练。 Method: 在LLaMA-Factory中实现序列并行技术，并开源360-LLaMA-Factory。 Result: 项目被广泛认可，应用于多个模型和公司训练框架。 Conclusion: 序列并行技术为模型训练提供了高效解决方案，360-LLaMA-Factory的成功展示了其潜力。 Abstract: Adding sequence parallelism into LLaMA-Factory, we open-sourced 360-LLaMA-Factory at https://github.com/Qihoo360/360-LLaMA-Factory. 360-LLaMA-Factory has received wide recognition and used in models such as Light-R1 arXiv:2503.10460, TinyR1 arXiv:2503.04872, Kaggle AIMO math models and also in large companies' training frameworks. This technical report delves deeper into the different sequence parallel modes behind 360-LLaMA-Factory and discusses our implementation insights.

[237] Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing

Yifan Lu,Jing Li,Yigeng Zhou,Yihui Zhang,Wenya Wang,Xiucheng Li,Meishan Zhang,Fangming Liu,Jun Yu,Min Zhang

Main category: cs.CL

TL;DR: ToxEdit是一种动态检测和缓解大型语言模型（LLM）毒性的方法，通过自适应层间路径有效降低毒性，同时保持模型的一般能力。

Details

Motivation: 现有LLM去毒方法依赖实体定位且易过度编辑，导致对对抗性输入无效或影响模型性能。 Method: 提出ToxEdit，动态检测前向传播中的毒性激活模式，并通过自适应层间路径计算缓解毒性。 Result: 实验表明，ToxEdit在去毒性能和保持模型能力上优于现有方法。 Conclusion: ToxEdit通过动态毒性检测和自适应路径设计，实现了高效去毒且不影响模型一般能力。 Abstract: Large language models (LLMs) exhibit impressive language capabilities but remain vulnerable to malicious prompts and jailbreaking attacks. Existing knowledge editing methods for LLM detoxification face two major challenges. First, they often rely on entity-specific localization, making them ineffective against adversarial inputs without explicit entities. Second, these methods suffer from over-editing, where detoxified models reject legitimate queries, compromising overall performance. In this paper, we propose ToxEdit, a toxicity-aware knowledge editing approach that dynamically detects toxic activation patterns during forward propagation. It then routes computations through adaptive inter-layer pathways to mitigate toxicity effectively. This design ensures precise toxicity mitigation while preserving LLMs' general capabilities. To more accurately assess over-editing, we also enhance the SafeEdit benchmark by incorporating instruction-following evaluation tasks. Experimental results on multiple LLMs demonstrate that our ToxEdit outperforms previous state-of-the-art methods in both detoxification performance and safeguarding general capabilities of LLMs.

[238] If Pigs Could Fly... Can LLMs Logically Reason Through Counterfactuals?

Ishwar B Balappanawar,Vamshi Krishna Bonagiri,Anish R Joishy,Manas Gaur,Krishnaprasad Thirunarayan,Ponnurangam Kumaraguru

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）在知识冲突情境下的逻辑推理能力，提出了CounterLogic数据集和Self-Segregate提示方法，显著提升了模型性能。

Details

Motivation: 探究LLMs在知识冲突情境下的逻辑推理能力退化现象，并提出改进方法。 Method: 引入CounterLogic数据集，设计Self-Segregate提示方法，通过元认知意识识别知识冲突。 Result: LLMs在反事实情境下平均准确率下降27%，Self-Segregate方法将差距缩小至11%，整体准确率提升7.5%。 Conclusion: 研究为理解和增强LLMs在现实应用中的逻辑推理能力提供了实用见解。 Abstract: Large Language Models (LLMs) demonstrate impressive reasoning capabilities in familiar contexts, but struggle when the context conflicts with their parametric knowledge. To investigate this phenomenon, we introduce CounterLogic, a dataset containing 1,800 examples across 9 logical schemas, explicitly designed to evaluate logical reasoning through counterfactual (hypothetical knowledge-conflicting) scenarios. Our systematic evaluation of 11 LLMs across 6 different datasets reveals a consistent performance degradation, with accuracies dropping by 27% on average when reasoning through counterfactual information. We propose Self-Segregate, a prompting method enabling metacognitive awareness (explicitly identifying knowledge conflicts) before reasoning. Our method dramatically narrows the average performance gaps from 27% to just 11%, while significantly increasing the overall accuracy (+7.5%). We discuss the implications of these findings and draw parallels to human cognitive processes, particularly on how humans disambiguate conflicting information during reasoning tasks. Our findings offer practical insights for understanding and enhancing LLMs reasoning capabilities in real-world applications, especially where models must logically reason independently of their factual knowledge.

[239] Advancing Expert Specialization for Better MoE

Hongcan Guo,Haolang Lu,Guoshun Nan,Bolun Chu,Jialin Zhuang,Yuan Yang,Wenhao Che,Sicong Leng,Qimei Cui,Xudong Jiang

Main category: cs.CL

TL;DR: 提出一种改进MoE模型的方法，通过正交性和方差损失增强专家专业化，提升性能。

Details

Motivation: 现有MoE模型的辅助负载均衡损失导致专家重叠和路由过于均匀，影响专家专业化和性能。 Method: 引入正交性损失和方差损失，优化训练过程。 Result: 实验显示，方法显著提升专家专业化，性能提升达23.79%，且保持负载均衡。 Conclusion: 该方法简单有效，无需架构修改即可提升MoE模型性能。 Abstract: Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input. However, we observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing, which hinders expert specialization and degrades overall performance during post-training. To address this, we propose a simple yet effective solution that introduces two complementary objectives: (1) an orthogonality loss to encourage experts to process distinct types of tokens, and (2) a variance loss to encourage more discriminative routing decisions. Gradient-level analysis demonstrates that these objectives are compatible with the existing auxiliary loss and contribute to optimizing the training process. Experimental results over various model architectures and across multiple benchmarks show that our method significantly enhances expert specialization. Notably, our method improves classic MoE baselines with auxiliary loss by up to 23.79%, while also maintaining load balancing in downstream tasks, without any architectural modifications or additional components. We will release our code to contribute to the community.

Antonia Karamolegkou,Angana Borah,Eunjung Cho,Sagnik Ray Choudhury,Martina Galletti,Rajarshi Ghosh,Pranav Gupta,Oana Ignat,Priyanka Kargupta,Neema Kotonya,Hemank Lamba,Sun-Joo Lee,Arushi Mangla,Ishani Mondal,Deniz Nazarova,Poli Nemkova,Dina Pisarevskaya,Naquee Rizwan,Nazanin Sabri,Dominik Stammbach,Anna Steinberg,David Tomás,Steven R Wilson,Bowen Yi,Jessica H Zhu,Arkaitz Zubiaga,Anders Søgaard,Alexander Fraser,Zhijing Jin,Rada Mihalcea,Joel R. Tetreault,Daryna Dementieva

Main category: cs.CL

TL;DR: 论文探讨了NLP在社会挑战中的角色，强调负责任和公平的研究方向。

Details

Motivation: 随着LLM的快速发展，NLP领域需要更负责任和有意向的部署，以解决社会问题。 Method: 通过跨学科分析社会目标和新兴风险，提出研究方向。 Result: 指出了NLP4SG研究中的挑战和机遇。 Conclusion: 需确保NLP研究的负责任和公平性，以推动社会进步。 Abstract: Recent advancements in large language models (LLMs) have unlocked unprecedented possibilities across a range of applications. However, as a community, we believe that the field of Natural Language Processing (NLP) has a growing need to approach deployment with greater intentionality and responsibility. In alignment with the broader vision of AI for Social Good (Toma\v{s}ev et al., 2020), this paper examines the role of NLP in addressing pressing societal challenges. Through a cross-disciplinary analysis of social goals and emerging risks, we highlight promising research directions and outline challenges that must be addressed to ensure responsible and equitable progress in NLP4SG research.

[241] Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

Lai Wei,Yuting Li,Kaipeng Zheng,Chen Wang,Yue Wang,Linghe Kong,Lichao Sun,Weiran Huang

Main category: cs.CL

TL;DR: 论文提出了一种两阶段方法（监督微调+强化学习）来提升多模态语言模型的推理能力，并在多个基准测试中取得了最优性能。

Details

Motivation: 研究多模态语言模型（MLLMs）中自校正模式的存在及其与推理性能的关系，探索如何通过结合监督微调和强化学习进一步提升推理能力。 Method: 采用两阶段方法：1) 监督微调（SFT）作为冷启动，引入结构化思维链推理模式；2) 使用GRPO进行强化学习（RL）以优化推理能力。 Result: 实验表明，该方法在3B和7B规模的MLLMs上均优于仅使用SFT或RL的方法，并在多个基准测试中达到最优性能。 Conclusion: 该研究为构建先进的多模态推理模型提供了实用指导，证明了结合监督微调和强化学习的有效性。 Abstract: Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While "aha moment" patterns--where models exhibit self-correction through reflection--are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %$\rightarrow$73.4 % on MathVista, 62.9 %$\rightarrow$70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at https://github.com/waltonfuture/RL-with-Cold-Start.

[242] Text2Grad: Reinforcement Learning from Natural Language Feedback

Hanyang Wang,Lu Wang,Chaoyun Zhang,Tianjun Mao,Si Qin,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang

Main category: cs.CL

TL;DR: Text2Grad将自由形式的文本反馈转化为细粒度的梯度信号，直接优化语言模型的策略，显著提升了任务性能和可解释性。

Details

Motivation: 传统RLHF使用粗粒度的标量奖励，掩盖了成功或失败的细节原因，导致学习缓慢且不透明。Text2Grad旨在通过文本反馈实现更精确的优化。 Method: Text2Grad通过三个组件实现：反馈标注管道、细粒度奖励模型和策略优化器，将文本反馈转化为梯度信号并直接优化模型策略。 Result: 在摘要、代码生成和问答任务中，Text2Grad优于标量奖励RL和仅提示的基线方法，任务指标和可解释性均显著提升。 Conclusion: 将自然语言反馈转化为梯度信号是一种强大的细粒度策略优化方法，Text2Grad展示了其有效性。 Abstract: Traditional RLHF optimizes language models with coarse, scalar rewards that mask the fine-grained reasons behind success or failure, leading to slow and opaque learning. Recent work augments RL with textual critiques through prompting or reflection, improving interpretability but leaving model parameters untouched. We introduce Text2Grad, a reinforcement-learning paradigm that turns free-form textual feedback into span-level gradients. Given human (or programmatic) critiques, Text2Grad aligns each feedback phrase with the relevant token spans, converts these alignments into differentiable reward signals, and performs gradient updates that directly refine the offending portions of the model's policy. This yields precise, feedback-conditioned adjustments instead of global nudges. Text2Grad is realized through three components: (1) a high-quality feedback-annotation pipeline that pairs critiques with token spans; (2) a fine-grained reward model that predicts span-level reward on answer while generating explanatory critiques; and (3) a span-level policy optimizer that back-propagates natural-language gradients. Across summarization, code generation, and question answering, Text2Grad consistently surpasses scalar-reward RL and prompt-only baselines, providing both higher task metrics and richer interpretability. Our results demonstrate that natural-language feedback, when converted to gradients, is a powerful signal for fine-grained policy optimization. The code for our method is available at https://github.com/microsoft/Text2Grad

[243] LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High

Judith Sieker,Clara Lachenmaier,Sina Zarrieß

Main category: cs.CL

TL;DR: 该论文研究了LLMs如何处理虚假预设，并探讨了某些语言因素是否影响其对虚假预设内容的反应。研究发现LLMs难以识别虚假预设，且表现因条件而异。

Details

Motivation: 虚假预设可能嵌入误导性信息，引发对LLMs是否能够检测并纠正这些信息的担忧，尤其是在政治背景下。 Method: 采用基于语言预设分析的系统方法，研究LLMs在不同条件下对虚假预设的敏感性，实验使用了新创建的数据集和三种LLMs（GPT-4-o、LLama-3-8B、Mistral-7B-v03）。 Result: LLMs在识别虚假预设方面表现不佳，且表现因语言结构、政治党派和情境概率等因素而异。 Conclusion: 语言预设分析是揭示LLMs回应中政治错误信息强化的有效工具。 Abstract: This paper examines how LLMs handle false presuppositions and whether certain linguistic factors influence their responses to falsely presupposed content. Presuppositions subtly introduce information as given, making them highly effective at embedding disputable or false information. This raises concerns about whether LLMs, like humans, may fail to detect and correct misleading assumptions introduced as false presuppositions, even when the stakes of misinformation are high. Using a systematic approach based on linguistic presupposition analysis, we investigate the conditions under which LLMs are more or less sensitive to adopt or reject false presuppositions. Focusing on political contexts, we examine how factors like linguistic construction, political party, and scenario probability impact the recognition of false presuppositions. We conduct experiments with a newly created dataset and examine three LLMs: OpenAI's GPT-4-o, Meta's LLama-3-8B, and MistralAI's Mistral-7B-v03. Our results show that the models struggle to recognize false presuppositions, with performance varying by condition. This study highlights that linguistic presupposition analysis is a valuable tool for uncovering the reinforcement of political misinformation in LLM responses.

[244] Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition

Hanting Chen,Yasheng Wang,Kai Han,Dong Li,Lin Li,Zhenni Bi,Jinpeng Li,Haoyu Wang,Fei Mi,Mingjian Zhu,Bin Wang,Kaikai Song,Yifei Fu,Xu He,Yu Luo,Chong Zhu,Quan He,Xueyu Wu,Wei He,Hailin Hu,Yehui Tang,Dacheng Tao,Xinghao Chen,Yunhe Wang,Other Contributors

Main category: cs.CL

TL;DR: Pangu Embedded是一个高效的大型语言模型推理器，通过两阶段训练框架实现快速和慢速推理能力，显著降低计算成本和推理延迟。

Details

Motivation: 解决现有推理优化LLM的高计算成本和推理延迟问题。 Method: 采用两阶段训练框架：第一阶段通过迭代蒸馏和强化学习优化模型；第二阶段引入双系统框架，支持快速和慢速推理模式。 Result: 在多个基准测试中表现优于同类模型，如Qwen3-8B和GLM4-9B。 Conclusion: Pangu Embedded展示了强大且实用的LLM推理器发展方向。 Abstract: This work presents Pangu Embedded, an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs), featuring flexible fast and slow thinking capabilities. Pangu Embedded addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs. We propose a two-stage training framework for its construction. In Stage 1, the model is finetuned via an iterative distillation process, incorporating inter-iteration model merging to effectively aggregate complementary knowledge. This is followed by reinforcement learning on Ascend clusters, optimized by a latency-tolerant scheduler that combines stale synchronous parallelism with prioritized data queues. The RL process is guided by a Multi-source Adaptive Reward System (MARS), which generates dynamic, task-specific reward signals using deterministic metrics and lightweight LLM evaluators for mathematics, coding, and general problem-solving tasks. Stage 2 introduces a dual-system framework, endowing Pangu Embedded with a "fast" mode for routine queries and a deeper "slow" mode for complex inference. This framework offers both manual mode switching for user control and an automatic, complexity-aware mode selection mechanism that dynamically allocates computational resources to balance latency and reasoning depth. Experimental results on benchmarks including AIME 2024, GPQA, and LiveCodeBench demonstrate that Pangu Embedded with 7B parameters, outperforms similar-size models like Qwen3-8B and GLM4-9B. It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture, highlighting a promising direction for developing powerful yet practically deployable LLM reasoners.

[245] RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning

Kun Li,Yunxiang Li,Tianhua Zhang,Hongyin Luo,Xixin Wu,James Glass,Helen Meng

Main category: cs.CL

TL;DR: RAG-Zeval是一个端到端框架，通过规则引导的推理任务评估RAG系统的忠实性和正确性，利用强化学习训练评估器，减少计算成本并提升性能。

Details

Motivation: 当前基于LLM的评估框架依赖资源密集型模型和多阶段提示，未能充分利用模型的推理能力且计算成本高。 Method: RAG-Zeval将评估任务转化为规则引导的推理任务，通过强化学习训练评估器，并采用基于排名的奖励机制。 Result: RAG-Zeval在性能上优于基线模型，与人类判断的相关性最强，且具有更好的解释性。 Conclusion: RAG-Zeval提供了一种高效、可解释的RAG系统评估方法，显著降低了计算成本。 Abstract: Robust evaluation is critical for deploying trustworthy retrieval-augmented generation (RAG) systems. However, current LLM-based evaluation frameworks predominantly rely on directly prompting resource-intensive models with complex multi-stage prompts, underutilizing models' reasoning capabilities and introducing significant computational cost. In this paper, we present RAG-Zeval (RAG-Zero Evaluator), a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task. Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments with detailed explanation in one-pass. We introduce a ranking-based outcome reward mechanism, using preference judgments rather than absolute scores, to address the challenge of obtaining precise pointwise reward signals. To this end, we synthesize the ranking references by generating quality-controlled responses with zero human annotation. Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments and outperforming baselines that rely on LLMs with 10-100 times more parameters. Our approach also exhibits superior interpretability in response evaluation.

Lai Wei,Yuting Li,Chen Wang,Yue Wang,Linghe Kong,Weiran Huang,Lichao Sun

Main category: cs.CL

TL;DR: 论文提出了一种名为MM-UPT的无监督后训练框架，利用GRPO算法和自奖励机制提升多模态大语言模型的推理能力，无需外部监督。

Details

Motivation: 传统的有监督微调方法依赖昂贵的人工标注数据，而现有的无监督方法复杂且难以迭代。本文旨在探索一种稳定、可扩展的无监督后训练方法。 Method: 基于GRPO算法，提出MM-UPT框架，通过多数投票自奖励机制替代传统奖励信号，并结合合成问题生成。 Result: 实验显示MM-UPT显著提升了模型性能（如MathVista和We-Math数据集），优于现有无监督基线，接近有监督GRPO的结果。 Conclusion: MM-UPT为无外部监督下多模态大语言模型的持续自主增强提供了新范式。 Abstract: Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). However, these supervised methods require expensive and manually annotated multi-modal data--an ultimately unsustainable resource. While recent efforts have explored unsupervised post-training, their methods are complex and difficult to iterate. In this work, we are the first to investigate the use of GRPO, a stable and scalable online RL algorithm, for enabling continual self-improvement without any external supervision. We propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs. MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that MM-UPT significantly improves the reasoning ability of Qwen2.5-VL-7B (e.g., 66.3 %$\rightarrow$72.9 % on MathVista, 62.9 %$\rightarrow$68.7 % on We-Math), using standard dataset without ground truth labels. MM-UPT also outperforms prior unsupervised baselines and even approaches the results of supervised GRPO. Furthermore, we show that incorporating synthetic questions, generated solely by MLLM itself, can boost performance as well, highlighting a promising approach for scalable self-improvement. Overall, MM-UPT offers a new paradigm for continual, autonomous enhancement of MLLMs in the absence of external supervision. Our code is available at https://github.com/waltonfuture/MM-UPT.

[247] EvolveSearch: An Iterative Self-Evolving Search Agent

Dingchu Zhang,Yida Zhao,Jialong Wu,Baixuan Li,Wenbiao Yin,Liwen Zhang,Yong Jiang,Yufeng Li,Kewei Tu,Pengjun Xie,Fei Huang

Main category: cs.CL

TL;DR: EvolveSearch提出了一种结合监督微调（SFT）和强化学习（RL）的自进化框架，显著提升了LLM在开放网络搜索领域的性能。

Details

Motivation: 当前主流方法在开放搜索领域面临数据生产效率和利用效率的挑战，需要一种无需外部人工标注数据的解决方案。 Method: 结合SFT和RL的自进化框架EvolveSearch，通过迭代优化提升搜索能力。 Result: 在七个多跳问答基准测试中，平均性能提升4.7%，优于现有最佳方法。 Conclusion: EvolveSearch为开放网络搜索领域的自进化能力提供了新思路。 Abstract: The rapid advancement of large language models (LLMs) has transformed the landscape of agentic information seeking capabilities through the integration of tools such as search engines and web browsers. However, current mainstream approaches for enabling LLM web search proficiency face significant challenges: supervised fine-tuning struggles with data production in open-search domains, while RL converges quickly, limiting their data utilization efficiency. To address these issues, we propose EvolveSearch, a novel iterative self-evolution framework that combines SFT and RL to enhance agentic web search capabilities without any external human-annotated reasoning data. Extensive experiments on seven multi-hop question-answering (MHQA) benchmarks demonstrate that EvolveSearch consistently improves performance across iterations, ultimately achieving an average improvement of 4.7\% over the current state-of-the-art across seven benchmarks, opening the door to self-evolution agentic capabilities in open web search domains.

[248] Multi-MLLM Knowledge Distillation for Out-of-Context News Detection

Yimeng Gu,Zhao Tong,Ignacio Castro,Shu Wu,Gareth Tyson

Main category: cs.CL

TL;DR: 论文提出了一种两阶段知识蒸馏框架，通过利用教师模型生成标签和推理，提升小型多模态大语言模型在低资源场景下的性能，显著减少标注数据需求。

Details

Motivation: 现有方法依赖大量标注数据或昂贵API调用，不适用于低资源场景，因此研究如何以更高效和低成本的方式提升小型模型的性能。 Method: 1. 利用多个教师模型生成标签和推理作为知识；2. 两阶段知识蒸馏：第一阶段用LoRA微调学生模型，第二阶段在教师预测冲突的数据点上结合LoRA和DPO进一步微调。 Result: 实验表明，该方法仅需不到10%的标注数据即可达到最先进性能。 Conclusion: 提出的方法在低资源场景下有效提升了小型模型的性能，减少了标注成本和计算开销。 Abstract: Multimodal out-of-context news is a type of misinformation in which the image is used outside of its original context. Many existing works have leveraged multimodal large language models (MLLMs) for detecting out-of-context news. However, observing the limited zero-shot performance of smaller MLLMs, they generally require label-rich fine-tuning and/or expensive API calls to GPT models to improve the performance, which is impractical in low-resource scenarios. In contrast, we aim to improve the performance of small MLLMs in a more label-efficient and cost-effective manner. To this end, we first prompt multiple teacher MLLMs to generate both label predictions and corresponding rationales, which collectively serve as the teachers' knowledge. We then introduce a two-stage knowledge distillation framework to transfer this knowledge to a student MLLM. In Stage 1, we apply LoRA fine-tuning to the student model using all training data. In Stage 2, we further fine-tune the student model using both LoRA fine-tuning and DPO on the data points where teachers' predictions conflict. This two-stage strategy reduces annotation costs and helps the student model uncover subtle patterns in more challenging cases. Experimental results demonstrate that our approach achieves state-of-the-art performance using less than 10% labeled data.

[249] Emotion-o1: Adaptive Long Reasoning for Emotion Understanding in LLMs

Changhao Song,Yazhou Zhang,Peng Zhang

Main category: cs.CL

TL;DR: 提出了一种任务自适应的推理框架，通过可变长度推理链和复合奖励函数，显著提升了情感理解任务的性能。

Details

Motivation: 当前方法使用固定长度的推理链，无法适应情感任务的复杂性变化，需要一种动态调整推理深度的解决方案。 Method: 结合微调和强化学习，设计复合奖励函数，平衡准确性、推理深度控制、路径多样性和逻辑重复抑制。 Result: 在情感、情绪、幽默和讽刺任务中，F1和准确率均有提升，高级任务提升尤为显著。 Conclusion: 该框架通过自适应深度分析，弥补了固定推理与情感复杂性之间的差距。 Abstract: Emotion understanding includes basic tasks (e.g., sentiment/emotion classification) and advanced tasks (e.g., sarcasm/humor detection). Current methods rely on fixed-length CoT reasoning, failing to adapt to the varying complexity of emotions. We propose a task-adaptive reasoning framework that employs DeepSeek-R1 to generate variable-length reasoning chains for different emotion tasks. By combining fine-tuning with reinforcement learning, we design a composite reward function that balances four objectives: prediction accuracy, adaptive reasoning depth control, structural diversity in reasoning paths, and suppression of repetitive logic. This approach achieves dynamic context-sensitive inference while enabling LLMs to autonomously develop deep reasoning capabilities. Experimental results demonstrate consistent improvements in both Acc and F1 scores across four tasks: emotion, sentiment, humor, and sarcasm. Notably, peak enhancements reached 3.56% F1 (2.76% Acc) for basic tasks and 37.95% F1 (23.14% Acc) for advanced tasks. Our work bridges rigid CoT reasoning and emotional complexity through adaptive-depth analysis.

[250] ClaimPKG: Enhancing Claim Verification via Pseudo-Subgraph Generation with Lightweight Specialized LLM

Hoang Pham,Thanh-Do Nguyen,Khac-Hoai Nam Bui

Main category: cs.CL

TL;DR: ClaimPKG是一个端到端框架，通过将知识图谱（KGs）与大型语言模型（LLMs）结合，提升声明验证的推理能力。

Details

Motivation: 现有方法主要依赖非结构化文本，难以有效利用KGs的结构化知识，而LLMs在多步推理和KG适应方面存在不足。 Method: ClaimPKG使用轻量级LLM生成伪子图指导子图检索，再利用通用LLM处理子图生成最终结果。 Result: 在FactKG数据集上，ClaimPKG性能优于基线模型9%-12%，并在非结构化数据集上展示零样本泛化能力。 Conclusion: ClaimPKG成功结合KGs与LLMs，为声明验证提供了高效且通用的解决方案。 Abstract: Integrating knowledge graphs (KGs) to enhance the reasoning capabilities of large language models (LLMs) is an emerging research challenge in claim verification. While KGs provide structured, semantically rich representations well-suited for reasoning, most existing verification methods rely on unstructured text corpora, limiting their ability to effectively leverage KGs. Additionally, despite possessing strong reasoning abilities, modern LLMs struggle with multi-step modular pipelines and reasoning over KGs without adaptation. To address these challenges, we propose ClaimPKG, an end-to-end framework that seamlessly integrates LLM reasoning with structured knowledge from KGs. Specifically, the main idea of ClaimPKG is to employ a lightweight, specialized LLM to represent the input claim as pseudo-subgraphs, guiding a dedicated subgraph retrieval module to identify relevant KG subgraphs. These retrieved subgraphs are then processed by a general-purpose LLM to produce the final verdict and justification. Extensive experiments on the FactKG dataset demonstrate that ClaimPKG achieves state-of-the-art performance, outperforming strong baselines in this research field by 9%-12% accuracy points across multiple categories. Furthermore, ClaimPKG exhibits zero-shot generalizability to unstructured datasets such as HoVer and FEVEROUS, effectively combining structured knowledge from KGs with LLM reasoning across various LLM backbones.

[251] Do Large Language Models Think Like the Brain? Sentence-Level Evidence from fMRI and Hierarchical Embeddings

Yu Lei,Xingyang Ge,Yi Zhang,Yiming Yang,Bolei Ma

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLMs）与人脑在句子理解层面的计算原理是否一致，发现模型性能提升会使其表征架构更接近人脑的层级结构。

Details

Motivation: 理解LLMs与人脑是否遵循相似的计算原理，是认知神经科学和AI领域的重要问题。 Method: 通过比较14种公开LLMs的分层表征与人类句子理解时的fMRI数据，构建句子级神经预测模型。 Result: 模型性能的提升使其表征架构更接近人脑的层级结构，尤其在更高语义抽象层次上表现出更强的功能和解剖对应。 Conclusion: LLMs的表征架构在性能提升过程中会逐渐与人脑语言处理的层级结构对齐。 Abstract: Understanding whether large language models (LLMs) and the human brain converge on similar computational principles remains a fundamental and important question in cognitive neuroscience and AI. Do the brain-like patterns observed in LLMs emerge simply from scaling, or do they reflect deeper alignment with the architecture of human language processing? This study focuses on the sentence-level neural mechanisms of language models, systematically investigating how hierarchical representations in LLMs align with the dynamic neural responses during human sentence comprehension. By comparing hierarchical embeddings from 14 publicly available LLMs with fMRI data collected from participants, who were exposed to a naturalistic narrative story, we constructed sentence-level neural prediction models to precisely identify the model layers most significantly correlated with brain region activations. Results show that improvements in model performance drive the evolution of representational architectures toward brain-like hierarchies, particularly achieving stronger functional and anatomical correspondence at higher semantic abstraction levels.

[252] Agent-UniRAG: A Trainable Open-Source LLM Agent Framework for Unified Retrieval-Augmented Generation Systems

Hoang Pham,Khac-Hoai Nam Bui

Main category: cs.CL

TL;DR: 提出了一种基于LLM代理的统一检索增强生成（RAG）系统框架Agent-UniRAG，支持单跳和多跳查询，并引入合成数据集SynAgent-RAG提升小规模开源LLM性能。

Details

Motivation: 现有RAG系统多针对单跳或多跳查询分别设计，限制了实际应用。本文旨在开发一个统一的框架，提升RAG系统的效果和可解释性。 Method: 设计了基于LLM代理的框架Agent-UniRAG，根据输入复杂度分步处理查询，并引入合成数据集SynAgent-RAG支持小规模LLM。 Result: 在多个RAG基准测试中表现与闭源及大规模开源LLM相当。 Conclusion: Agent-UniRAG为统一RAG系统提供了有效且可解释的解决方案，适用于不同规模LLM。 Abstract: This paper presents a novel approach for unified retrieval-augmented generation (RAG) systems using the recent emerging large language model (LLM) agent concept. Specifically, Agent LLM, which utilizes LLM as fundamental controllers, has become a promising approach to enable the interpretability of RAG tasks, especially for complex reasoning question-answering systems (e.g., multi-hop queries). Nonetheless, previous works mainly focus on solving RAG systems with either single-hop or multi-hop approaches separately, which limits the application of those approaches to real-world applications. In this study, we propose a trainable agent framework called Agent-UniRAG for unified retrieval-augmented LLM systems, which enhances the effectiveness and interpretability of RAG systems. The main idea is to design an LLM agent framework to solve RAG tasks step-by-step based on the complexity of the inputs, simultaneously including single-hop and multi-hop queries in an end-to-end manner. Furthermore, we introduce SynAgent-RAG, a synthetic dataset to enable the proposed agent framework for small open-source LLMs (e.g., Llama-3-8B). The results show comparable performances with closed-source and larger open-source LLMs across various RAG benchmarks. Our source code and dataset are publicly available for further exploitation.

[253] Fusion Steering: Prompt-Specific Activation Control

Waldemar Chang,Alhassan Yasin

Main category: cs.CL

TL;DR: Fusion Steering是一种激活引导方法，通过动态注入提示特定的激活增量，提升大型语言模型在问答任务中的事实准确性。

Details

Motivation: 传统方法局限于单层或固定层操作，无法灵活适应不同提示的需求，因此需要一种更动态、灵活的激活引导方法。 Method: 采用全层和分段引导配置，动态注入基于参考补全的激活增量，并通过Optuna优化注入权重，平衡事实对齐和流畅性。 Result: 在SimpleQA任务中，分段引导的准确率达到25.4%，显著优于基线（3.5%）和全层引导（16.2%）。 Conclusion: 分段动态干预策略和全网络激活控制在提升模型性能方面具有潜力，同时支持稀疏表示，为可解释和可扩展的激活控制提供了方向。 Abstract: We present Fusion Steering, an activation steering methodology that improves factual accuracy in large language models (LLMs) for question-answering (QA) tasks. This approach introduces flexible steering configurations, including full-layer steering and segmented steering. Unlike traditional methods constrained to single-layer or fixed-layer operations, Fusion Steering employs dynamic injection of prompt-specific activation deltas across all transformer layers. These activation deltas are derived from reference completions that combine the ground-truth answer with a model-generated explanation to facilitate semantically enriched, example-specific steering. The injection weights are optimized per prompt using Optuna, targeting a joint objective that balances token overlap (factual alignment) and perplexity (fluency proxy). Evaluation employs a composite score integrating token overlap and LLM-graded quality, encompassing factual accuracy, coherence, and relevance. Empirical results on 260 SimpleQA prompts (selected from 500 where the baseline failed) showcase the efficacy of segmented steering. Using Gemma-2-2B-IT with 8-bit quantization, segmented steering achieves an accuracy of 25.4% (outputs scoring $\geq 0.6$), outperforming the baseline at 3.5% and full-layer steering at 16.2%. Under the stricter SimpleQA rubric, segmented steering boosts fully correct responses from 0.0% to 13.1%. These findings highlight the strengths of segmented, dynamic intervention strategies and the promise of per-prompt, full-network activation control. Fusion Steering is also amenable to sparse representations, such as Neuronpedia or sparse crosscoders, suggesting a promising direction for interpretable and scalable activation-level control in LLMs.

[254] Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts

Xue Zhang,Yunlong Liang,Fandong Meng,Songming Zhang,Yufeng Chen,Jinan Xu,Jie Zhou

Main category: cs.CL

TL;DR: 提出了一种基于层间语言相似性的专家分配算法（LayerMoE），用于高效扩展多语言大模型，减少参数成本并缓解旧语言遗忘问题。

Details

Motivation: 现有方法在扩展新语言时参数成本高且难以避免对旧语言性能的影响，需更高效的解决方案。 Method: 分析LLM各层的语言表征相似性，基于相似性分配新专家数量，并在高相似性层添加分类器引导路由。 Result: 在单次扩展和持续扩展场景中，分别减少60%和33.3%的专家数量，性能优于现有方法。 Conclusion: LayerMoE通过层间相似性优化专家分配，显著提升了多语言扩展的效率和性能。 Abstract: Continually expanding new languages for existing large language models (LLMs) is a promising yet challenging approach to building powerful multilingual LLMs. The biggest challenge is to make the model continuously learn new languages while preserving the proficient ability of old languages. To achieve this, recent work utilizes the Mixture-of-Experts (MoE) architecture to expand new languages by adding new experts and avoid catastrophic forgetting of old languages by routing corresponding tokens to the original model backbone (old experts). Although intuitive, this kind of method is parameter-costly when expanding new languages and still inevitably impacts the performance of old languages. To address these limitations, we analyze the language characteristics of different layers in LLMs and propose a layer-wise expert allocation algorithm (LayerMoE) to determine the appropriate number of new experts for each layer. Specifically, we find different layers in LLMs exhibit different representation similarities between languages and then utilize the similarity as the indicator to allocate experts for each layer, i.e., the higher similarity, the fewer experts. Additionally, to further mitigate the forgetting of old languages, we add a classifier in front of the router network on the layers with higher similarity to guide the routing of old language tokens. Experimental results show that our method outperforms the previous state-of-the-art baseline with 60% fewer experts in the single-expansion setting and with 33.3% fewer experts in the lifelong-expansion setting, demonstrating the effectiveness of our method.

[255] Precise In-Parameter Concept Erasure in Large Language Models

Yoav Gur-Arieh,Clara Suslik,Yihuai Hong,Fazl Barez,Mor Geva

Main category: cs.CL

TL;DR: PISCES是一种新框架，通过直接编辑参数空间中的方向，精确地从语言模型中删除概念。相比现有方法，PISCES在效果、特异性和鲁棒性上表现更好。

Details

Motivation: 大型语言模型（LLMs）在预训练中可能学习到下游任务中不需要的知识（如敏感信息或版权内容），现有方法（如微调或事实级编辑）效果有限。 Method: PISCES使用解缠模型将MLP向量分解为可解释特征，通过自动解释技术识别目标概念相关特征，并从模型参数中删除。 Result: 在Gemma 2和Llama 3.1上的实验表明，PISCES将目标概念准确率降至7.7%，特异性提升31%，鲁棒性提升38%。 Conclusion: 基于特征的参数内编辑为语言模型中的概念删除提供了更精确可靠的方法。 Abstract: Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES (Precise In-parameter Suppression for Concept EraSure), a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest gains in efficacy over leading erasure methods, reducing accuracy on the target concept to as low as 7.7%, while dramatically improving erasure specificity (by up to 31%) and robustness (by up to 38%). Overall, these results demonstrate that feature-based in-parameter editing enables a more precise and reliable approach for removing conceptual knowledge in language models.

[256] Self-Error-Instruct: Generalizing from Errors for LLMs Mathematical Reasoning

Erxin Yu,Jing Li,Ming Liao,Qi Zhu,Boyang Xue,Minghui Xu,Baojun Wang,Lanqing Hong,Fei Mi,Lifeng Shang

Main category: cs.CL

TL;DR: SEI框架通过识别错误类型、生成针对性训练数据，并迭代优化，提升大语言模型在数学推理中的表现。

Details

Motivation: 大语言模型在数学推理中存在许多错误案例，现有方法仅从孤立案例中学习，无法泛化错误模式。 Method: 通过分析错误案例生成关键短语，聚类识别错误类型，利用GPT-4o生成针对性训练数据，并通过单样本学习优化数据。 Result: 在GSM8K和MATH数据集上，模型推理能力显著提升，且泛化到其他数学数据集。 Conclusion: SEI框架通过错误泛化有效提升大语言模型的数学推理能力。 Abstract: Although large language models demonstrate strong performance across various domains, they still struggle with numerous bad cases in mathematical reasoning. Previous approaches to learning from errors synthesize training data by solely extrapolating from isolated bad cases, thereby failing to generalize the extensive patterns inherent within these cases. This paper presents Self-Error-Instruct (SEI), a framework that addresses these model weaknesses and synthesizes more generalized targeted training data. Specifically, we explore a target model on two mathematical datasets, GSM8K and MATH, to pinpoint bad cases. Then, we generate error keyphrases for these cases based on the instructor model's (GPT-4o) analysis and identify error types by clustering these keyphrases. Next, we sample a few bad cases during each generation for each identified error type and input them into the instructor model, which synthesizes additional training data using a self-instruct approach. This new data is refined through a one-shot learning process to ensure that only the most effective examples are kept. Finally, we use these curated data to fine-tune the target model, iteratively repeating the process to enhance performance. We apply our framework to various models and observe improvements in their reasoning abilities across both in-domain and out-of-domain mathematics datasets. These results demonstrate the effectiveness of self-error instruction in improving LLMs' mathematical reasoning through error generalization.

[257] Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu,Hao Zhang,Shuchen Xue,Zhijian Liu,Shizhe Diao,Ligeng Zhu,Ping Luo,Song Han,Enze Xie

Main category: cs.CL

TL;DR: 论文提出了一种针对双向扩散模型的块状近似KV缓存机制和置信度感知并行解码策略，显著提升了Diffusion LLMs的推理速度，同时保持了生成质量。

Details

Motivation: 现有的开源Diffusion LLMs在推理速度上落后于自回归模型，且并行解码时生成质量下降，亟需解决这些问题以实现实际部署。 Method: 1. 提出块状近似KV缓存机制，支持缓存重用；2. 提出置信度感知并行解码策略，选择性解码高置信度token以减少依赖破坏。 Result: 在LLaDA和Dream模型上实验显示，吞吐量提升高达27.6倍，且精度损失极小。 Conclusion: 通过优化缓存和并行解码策略，显著缩小了Diffusion LLMs与自回归模型的性能差距，为其实际应用铺平了道路。 Abstract: Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, we propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality. Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to \textbf{27.6$\times$ throughput} improvement with minimal accuracy loss, closing the performance gap with autoregressive models and paving the way for practical deployment of Diffusion LLMs.

[258] Chain-of-Talkers (CoTalk): Fast Human Annotation of Dense Image Captions

Yijun Shen,Delong Chen,Fan Liu,Xingyu Wang,Chuanyi Zhang,Liang Yao,Yuhui Zheng

Main category: cs.CL

TL;DR: CoTalk是一种AI辅助的标注方法，通过顺序标注和多模态界面优化标注效率，提升语义覆盖率和检索性能。

Details

Motivation: 密集标注的图像描述有助于视觉-语言对齐学习，但如何系统优化人工标注效率尚未充分探索。 Method: CoTalk采用顺序标注减少冗余工作，结合多模态界面（阅读输入、语音输出）提高效率。 Result: 实验显示CoTalk在标注速度（0.42 vs. 0.30单位/秒）和检索性能（41.13% vs. 40.52%）上优于并行方法。 Conclusion: CoTalk在固定预算下显著提升标注效率和语义覆盖率，为视觉-语言对齐任务提供更优标注方案。 Abstract: While densely annotated image captions significantly facilitate the learning of robust vision-language alignment, methodologies for systematically optimizing human annotation efforts remain underexplored. We introduce Chain-of-Talkers (CoTalk), an AI-in-the-loop methodology designed to maximize the number of annotated samples and improve their comprehensiveness under fixed budget constraints (e.g., total human annotation time). The framework is built upon two key insights. First, sequential annotation reduces redundant workload compared to conventional parallel annotation, as subsequent annotators only need to annotate the ``residual'' -- the missing visual information that previous annotations have not covered. Second, humans process textual input faster by reading while outputting annotations with much higher throughput via talking; thus a multimodal interface enables optimized efficiency. We evaluate our framework from two aspects: intrinsic evaluations that assess the comprehensiveness of semantic units, obtained by parsing detailed captions into object-attribute trees and analyzing their effective connections; extrinsic evaluation measures the practical usage of the annotated captions in facilitating vision-language alignment. Experiments with eight participants show our Chain-of-Talkers (CoTalk) improves annotation speed (0.42 vs. 0.30 units/sec) and retrieval performance (41.13\% vs. 40.52\%) over the parallel method.

[259] Stochastic Chameleons: Irrelevant Context Hallucinations Reveal Class-Based (Mis)Generalization in LLMs

Ziling Cheng,Meng Cao,Marc-Antoine Rondeau,Jackie Chi Kit Cheung

Main category: cs.CL

TL;DR: LLMs的‘随机鹦鹉’行为存在规律性错误，表现为基于类别的错误泛化，其内部机制通过抽象类别和上下文线索生成答案。

Details

Motivation: 探讨LLMs在生成答案时的错误模式及其内部机制，以更全面理解其‘随机鹦鹉’行为的本质。 Method: 通过行为分析和机制解释性实验，研究Llama-3、Mistral和Pythia模型在39种事实回忆关系中的表现。 Result: 发现错误源于类别的错误泛化，模型内部存在两类竞争电路，分别基于查询和上下文线索，影响最终输出。 Conclusion: LLMs通过形式化训练表现出基于抽象泛化的能力，但其可靠性受上下文线索影响，可称为‘随机变色龙’。 Abstract: The widespread success of large language models (LLMs) on NLP benchmarks has been accompanied by concerns that LLMs function primarily as stochastic parrots that reproduce texts similar to what they saw during pre-training, often erroneously. But what is the nature of their errors, and do these errors exhibit any regularities? In this work, we examine irrelevant context hallucinations, in which models integrate misleading contextual cues into their predictions. Through behavioral analysis, we show that these errors result from a structured yet flawed mechanism that we term class-based (mis)generalization, in which models combine abstract class cues with features extracted from the query or context to derive answers. Furthermore, mechanistic interpretability experiments on Llama-3, Mistral, and Pythia across 39 factual recall relation types reveal that this behavior is reflected in the model's internal computations: (i) abstract class representations are constructed in lower layers before being refined into specific answers in higher layers, (ii) feature selection is governed by two competing circuits -- one prioritizing direct query-based reasoning, the other incorporating contextual cues -- whose relative influences determine the final output. Our findings provide a more nuanced perspective on the stochastic parrot argument: through form-based training, LLMs can exhibit generalization leveraging abstractions, albeit in unreliable ways based on contextual cues -- what we term stochastic chameleons.

[260] Spatial Knowledge Graph-Guided Multimodal Synthesis

Yida Xue,Zhen Bi,Jinnan Yang,Jungang Lou,Huajun Chen,Ningyu Zhang

Main category: cs.CL

TL;DR: SKG2Data利用空间知识图谱指导多模态数据合成，提升多模态大语言模型的空间感知能力。

Details

Motivation: 多模态大语言模型的空间感知能力有限，需通过合成数据解决，但确保数据符合空间常识具有挑战性。 Method: 提出SKG2Data方法，通过构建空间知识图谱（SKG）模拟人类空间感知，指导多模态数据合成。 Result: 实验表明，合成的数据显著提升了模型的空间感知和推理能力，并具备强泛化性。 Conclusion: 基于知识的数据合成有望推动空间智能的发展。 Abstract: Recent advances in multimodal large language models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. In this work, we introduce SKG2Data, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2Data automatically constructs a Spatial Knowledge Graph (SKG) to emulate human-like perception of spatial directions and distances, which is subsequently utilized to guide multimodal data synthesis. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, not only enhance the spatial perception and reasoning abilities of MLLMs but also exhibit strong generalization capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence.

[261] Learning Composable Chains-of-Thought

Fangcong Yin,Zeyu Leo Liu,Liu Leqi,Xi Ye,Greg Durrett

Main category: cs.CL

TL;DR: 通过改进链式思维（CoT）数据格式，使其具有可组合性，从而提升大型语言模型在未见推理任务上的零样本性能。

Details

Motivation: 解决现有方法在训练分布外任务上泛化能力不足的问题，尤其是组合性推理任务。 Method: 改进原子任务的CoT数据格式为可组合形式，结合多任务学习或模型合并，并通过拒绝采样微调（RFT）进一步优化。 Result: 在字符串操作和自然语言技能组合任务上，Composable CoT训练优于多任务学习和持续微调基线。 Conclusion: Composable CoT方法能有效提升模型在组合性任务上的零样本性能。 Abstract: A common approach for teaching large language models (LLMs) to reason is to train on chain-of-thought (CoT) traces of in-distribution reasoning problems, but such annotated data is costly to obtain for every problem of interest. We want reasoning models to generalize beyond their training distribution, and ideally to generalize compositionally: combine atomic reasoning skills to solve harder, unseen reasoning tasks. We take a step towards compositional generalization of reasoning skills when addressing a target compositional task that has no labeled CoT data. We find that simply training models on CoT data of atomic tasks leads to limited generalization, but minimally modifying CoT formats of constituent atomic tasks to be composable can lead to improvements. We can train "atomic CoT" models on the atomic tasks with Composable CoT data and combine them with multitask learning or model merging for better zero-shot performance on the target compositional task. Such a combined model can be further bootstrapped on a small amount of compositional data using rejection sampling fine-tuning (RFT). Results on string operations and natural language skill compositions show that training LLMs on Composable CoT outperforms multitask learning and continued fine-tuning baselines within a given training data budget.

[262] Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese

Hanjia Lyu,Jiebo Luo,Jian Kang,Allison Koenecke

Main category: cs.CL

TL;DR: 研究发现大型语言模型（LLM）在简体中文和繁体中文提示下表现存在差异，可能因训练数据、字符偏好和分词方式导致偏见。

Details

Motivation: 了解LLM在简体中文和繁体中文中的表现差异，避免因模型偏见导致的文化代表性问题和下游决策（如教育或招聘）中的潜在危害。 Method: 设计两项基准任务（区域术语选择和区域名称选择），评估11种主流商业和开源LLM的表现。 Result: 大多数LLM在区域术语选择中偏向简体中文，而在区域名称选择中偏向繁体中文，偏见与任务和提示语言相关。 Conclusion: 需进一步分析LLM偏见，提供开源基准数据集以促进未来研究。 Abstract: While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring. To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese). For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models -- spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese. These findings highlight the need for further analysis of LLM biases; as such, we provide an open-sourced benchmark dataset to foster reproducible evaluations of future LLM behavior across Chinese language variants (https://github.com/brucelyu17/SC-TC-Bench).

[263] WebDancer: Towards Autonomous Information Seeking Agency

Jialong Wu,Baixuan Li,Runnan Fang,Wenbiao Yin,Liwen Zhang,Zhengwei Tao,Dingchu Zhang,Zekun Xi,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou

Main category: cs.CL

TL;DR: 论文提出了一种端到端的信息搜索代理训练范式，包括数据构建、轨迹采样、监督微调和强化学习四个阶段，并在WebDancer代理中实现，取得了显著效果。

Details

Motivation: 解决复杂现实问题需要多步推理和深入信息搜索，现有代理系统（如Deep Research）展示了自主多步研究的潜力，但缺乏从数据为中心的训练视角的完整范式。 Method: 提出四阶段训练范式：1) 数据构建；2) 轨迹采样；3) 监督微调；4) 强化学习。基于ReAct框架实现WebDancer代理。 Result: 在GAIA和WebWalkerQA基准测试中表现优异，验证了训练范式的有效性。 Conclusion: 该训练范式为开发更强大的代理模型提供了系统化路径和实用洞察，代码和演示已开源。 Abstract: Addressing intricate real-world problems necessitates in-depth information seeking and multi-step reasoning. Recent progress in agentic systems, exemplified by Deep Research, underscores the potential for autonomous multi-step research. In this work, we present a cohesive paradigm for building end-to-end agentic information seeking agents from a data-centric and training-stage perspective. Our approach consists of four key stages: (1) browsing data construction, (2) trajectories sampling, (3) supervised fine-tuning for effective cold start, and (4) reinforcement learning for enhanced generalisation. We instantiate this framework in a web agent based on the ReAct, WebDancer. Empirical evaluations on the challenging information seeking benchmarks, GAIA and WebWalkerQA, demonstrate the strong performance of WebDancer, achieving considerable results and highlighting the efficacy of our training paradigm. Further analysis of agent training provides valuable insights and actionable, systematic pathways for developing more capable agentic models. The codes and demo will be released in https://github.com/Alibaba-NLP/WebAgent.

[264] The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

Ang Lv,Ruobing Xie,Xingwu Sun,Zhanhui Kang,Rui Yan

Main category: cs.CL

TL;DR: 研究探讨了奖励噪声对大型语言模型（LLM）后训练的影响，发现模型对噪声具有强鲁棒性，并提出了基于关键推理短语的奖励方法（RPR），显著提升了性能。

Details

Motivation: 研究动机在于探索实际场景中奖励噪声对LLM后训练的影响，以改进推理任务的表现。 Method: 通过手动翻转奖励函数输出和引入推理模式奖励（RPR），结合噪声奖励模型进行实验。 Result: LLM对高达40%的奖励噪声表现出鲁棒性，RPR方法使模型性能接近无噪声训练结果（70%以上准确率）。 Conclusion: 研究表明预训练阶段提升模型基础能力的重要性，并为后训练技术提供了新思路。 Abstract: Recent studies on post-training large language models (LLMs) for reasoning through reinforcement learning (RL) typically focus on tasks that can be accurately verified and rewarded, such as solving math problems. In contrast, our research investigates the impact of reward noise, a more practical consideration for real-world scenarios involving the post-training of LLMs using reward models. We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function's outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid convergence, improving its performance on math tasks from 5% to 72%, compared to the 75% accuracy achieved by a model trained with noiseless rewards. Surprisingly, by only rewarding the appearance of key reasoning phrases (namely reasoning pattern reward, RPR), such as ``first, I need to''-without verifying the correctness of answers, the model achieved peak downstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models trained with strict correctness verification and accurate rewards. Recognizing the importance of the reasoning process over the final results, we combined RPR with noisy reward models. RPR helped calibrate the noisy reward models, mitigating potential false negatives and enhancing the LLM's performance on open-ended tasks. These findings suggest the importance of improving models' foundational abilities during the pre-training phase while providing insights for advancing post-training techniques. Our code and scripts are available at https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason.

[265] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

Qingchen Yu,Zifan Zheng,Ding Chen,Simin Niu,Bo Tang,Feiyu Xiong,Zhiyu Li

Main category: cs.CL

TL;DR: GuessArena提出了一种基于对抗性游戏的动态评估框架，解决了传统静态评测在适应性和细粒度评估上的不足。

Details

Motivation: 传统静态评测缺乏对不同应用领域的适应性，且难以捕捉领域知识和上下文推理能力的细粒度评估。 Method: 采用类似‘Guess Who I Am?’游戏的交互式结构，结合动态领域知识建模和渐进式推理评估。 Result: 在五个垂直领域（金融、医疗、制造、信息技术和教育）的实证研究表明，GuessArena能有效区分LLM的领域知识覆盖和推理链完整性。 Conclusion: GuessArena在可解释性、可扩展性和场景适应性上优于传统评测方法。 Abstract: The evaluation of large language models (LLMs) has traditionally relied on static benchmarks, a paradigm that poses two major limitations: (1) predefined test sets lack adaptability to diverse application domains, and (2) standardized evaluation protocols often fail to capture fine-grained assessments of domain-specific knowledge and contextual reasoning abilities. To overcome these challenges, we propose GuessArena, an adaptive evaluation framework grounded in adversarial game-based interactions. Inspired by the interactive structure of the Guess Who I Am? game, our framework seamlessly integrates dynamic domain knowledge modeling with progressive reasoning assessment to improve evaluation fidelity. Empirical studies across five vertical domains-finance, healthcare, manufacturing, information technology, and education-demonstrate that GuessArena effectively distinguishes LLMs in terms of domain knowledge coverage and reasoning chain completeness. Compared to conventional benchmarks, our method provides substantial advantages in interpretability, scalability, and scenario adaptability.

[266] AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models

Feng Luo,Yu-Neng Chuang,Guanchu Wang,Hoang Anh Duy Le,Shaochen Zhong,Hongyi Liu,Jiayi Yuan,Yang Sui,Vladimir Braverman,Vipin Chaudhary,Xia Hu

Main category: cs.CL

TL;DR: AutoL2S是一个动态框架，帮助大型语言模型根据问题复杂度动态调整推理路径长度，减少不必要的长推理，提高效率。

Details

Motivation: 解决大型语言模型在简单推理问题上生成过长推理路径的问题，降低推理成本和延迟。 Method: 提出AutoL2S框架，通过训练模型识别问题复杂度并动态压缩推理路径，使用标记指示何时跳过冗长推理。 Result: AutoL2S将推理路径长度减少高达57%，同时保持性能不变。 Conclusion: AutoL2S显著提升了大型语言模型的推理效率和可扩展性。 Abstract: The reasoning-capable large language models (LLMs) demonstrate strong performance on complex reasoning tasks but often suffer from overthinking, generating unnecessarily long chain-of-thought (CoT) reasoning paths for easy reasoning questions, thereby increasing inference cost and latency. Recent approaches attempt to address this challenge by manually deciding when to apply long or short reasoning. However, they lack the flexibility to adapt CoT length dynamically based on question complexity. In this paper, we propose Auto Long-Short Reasoning (AutoL2S), a dynamic and model-agnostic framework that enables LLMs to dynamically compress their generated reasoning path based on the complexity of the reasoning question. AutoL2S enables a learned paradigm, in which LLMs themselves can decide when longer reasoning is necessary and when shorter reasoning suffices, by training on data annotated with our proposed method, which includes both long and short CoT paths and a special token. We then use token to indicate when the model can skip generating lengthy CoT reasoning. This proposed annotation strategy can enhance the LLMs' ability to generate shorter CoT reasoning paths with improved quality after training. Extensive evaluation results show that AutoL2S reduces the length of reasoning generation by up to 57% without compromising performance, demonstrating the effectiveness of AutoL2S for scalable and efficient LLM reasoning.

cs.IT [Back]

[267] Synonymous Variational Inference for Perceptual Image Compression

Zijian Liang,Kai Niu,Changshuo Wang,Jin Xu,Ping Zhang

Main category: cs.IT

TL;DR: 论文提出了一种基于同义关系的变分推理方法（SVI），用于分析感知图像压缩问题，并引入了一种新的图像压缩方案（SIC）。

Details

Motivation: 语义信息理论揭示了语义与句法信息之间的同义关系，作者希望通过这种关系重新分析感知图像压缩问题。 Method: 提出同义变分推理（SVI）方法，以感知相似性为同义标准构建同义集（Synset），并通过最小化部分语义KL散度来近似潜在同义表示的后验。 Result: 理论证明感知图像压缩的优化方向遵循三重权衡，实验结果表明单一致进式SIC编解码器可实现可比的率失真感知性能。 Conclusion: 提出的分析方法有效，SIC方案验证了其理论框架的可行性。 Abstract: Recent contributions of semantic information theory reveal the set-element relationship between semantic and syntactic information, represented as synonymous relationships. In this paper, we propose a synonymous variational inference (SVI) method based on this synonymity viewpoint to re-analyze the perceptual image compression problem. It takes perceptual similarity as a typical synonymous criterion to build an ideal synonymous set (Synset), and approximate the posterior of its latent synonymous representation with a parametric density by minimizing a partial semantic KL divergence. This analysis theoretically proves that the optimization direction of perception image compression follows a triple tradeoff that can cover the existing rate-distortion-perception schemes. Additionally, we introduce synonymous image compression (SIC), a new image compression scheme that corresponds to the analytical process of SVI, and implement a progressive SIC codec to fully leverage the model's capabilities. Experimental results demonstrate comparable rate-distortion-perception performance using a single progressive SIC codec, thus verifying the effectiveness of our proposed analysis method.

cs.RO [Back]

[268] Learning Compositional Behaviors from Demonstration and Language

Weiyu Liu,Neil Nie,Ruohan Zhang,Jiayuan Mao,Jiajun Wu

Main category: cs.RO

TL;DR: BLADE是一个结合模仿学习和模型规划的长时程机器人操作框架，利用语言标注演示和大型语言模型提取抽象动作知识，构建结构化高层动作表示库。

Details

Motivation: 解决长时程机器人操作中动作规划和泛化能力的挑战，通过语言和演示结合提升适应性。 Method: 整合语言标注演示和LLMs提取动作知识，构建结构化动作表示库，包括视觉感知的预条件和效果，并实现为神经网络策略。 Result: BLADE能自动恢复结构化表示，无需手动标注，显著提升了在初始状态、外部扰动和新目标等新情境下的泛化能力。 Conclusion: BLADE在仿真和真实机器人实验中验证了其有效性，适用于复杂物体和几何约束场景。 Abstract: We introduce Behavior from Language and Demonstration (BLADE), a framework for long-horizon robotic manipulation by integrating imitation learning and model-based planning. BLADE leverages language-annotated demonstrations, extracts abstract action knowledge from large language models (LLMs), and constructs a library of structured, high-level action representations. These representations include preconditions and effects grounded in visual perception for each high-level action, along with corresponding controllers implemented as neural network-based policies. BLADE can recover such structured representations automatically, without manually labeled states or symbolic definitions. BLADE shows significant capabilities in generalizing to novel situations, including novel initial states, external state perturbations, and novel goals. We validate the effectiveness of our approach both in simulation and on real robots with a diverse set of objects with articulated parts, partial observability, and geometric constraints.

[269] CogAD: Cognitive-Hierarchy Guided End-to-End Autonomous Driving

Zhennan Wang,Jianing Teng,Canqun Xiang,Kangliang Chen,Xing Pan,Lu Deng,Weihao Gu

Main category: cs.RO

TL;DR: CogAD是一种新型端到端自动驾驶模型，模仿人类驾驶员的层次认知机制，通过全局到局部上下文处理和意图条件多模式轨迹生成，实现更符合人类认知的感知与规划。

Details

Motivation: 现有端到端自动驾驶方法与人类认知原则在感知和规划上存在根本性不一致，CogAD旨在通过模仿人类认知机制解决这一问题。 Method: CogAD采用双层次机制：全局到局部上下文处理（感知）和意图条件多模式轨迹生成（规划），并通过双层次不确定性建模实现多样化轨迹生成。 Result: 在nuScenes和Bench2Drive数据集上，CogAD在端到端规划中表现最优，尤其在长尾场景和复杂现实驾驶条件下具有鲁棒性。 Conclusion: CogAD通过模仿人类认知机制，显著提升了自动驾驶的感知与规划能力，尤其在复杂场景中表现突出。 Abstract: While end-to-end autonomous driving has advanced significantly, prevailing methods remain fundamentally misaligned with human cognitive principles in both perception and planning. In this paper, we propose CogAD, a novel end-to-end autonomous driving model that emulates the hierarchical cognition mechanisms of human drivers. CogAD implements dual hierarchical mechanisms: global-to-local context processing for human-like perception and intent-conditioned multi-mode trajectory generation for cognitively-inspired planning. The proposed method demonstrates three principal advantages: comprehensive environmental understanding through hierarchical perception, robust planning exploration enabled by multi-level planning, and diverse yet reasonable multi-modal trajectory generation facilitated by dual-level uncertainty modeling. Extensive experiments on nuScenes and Bench2Drive demonstrate that CogAD achieves state-of-the-art performance in end-to-end planning, exhibiting particular superiority in long-tail scenarios and robust generalization to complex real-world driving conditions.

[270] Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge

Zhongyi Zhou,Yichen Zhu,Junjie Wen,Chaomin Shen,Yi Xu

Main category: cs.RO

TL;DR: ChatVLA-2是一种新型的混合专家VLA模型，通过三阶段训练流程保留VLM的核心能力，同时实现可操作的推理。在数学匹配任务中表现出色，超越现有模仿学习方法。

Details

Motivation: 现有端到端VLA系统在微调时会丢失预训练VLM的关键能力，因此需要一种能保留并扩展VLM核心能力的通用VLA模型。 Method: 提出ChatVLA-2模型，采用混合专家架构和三阶段训练流程，保留VLM的开放世界推理能力并实现可操作的机器人任务。 Result: 在数学匹配任务中表现出卓越的数学推理和OCR能力，同时具备强大的空间推理能力，超越OpenVLA等现有方法。 Conclusion: ChatVLA-2显著提升了机器人基础模型的通用性和推理能力，为开发真正通用的机器人模型迈出重要一步。 Abstract: Vision-language-action (VLA) models have emerged as the next generation of models in robotics. However, despite leveraging powerful pre-trained Vision-Language Models (VLMs), existing end-to-end VLA systems often lose key capabilities during fine-tuning as the model adapts to specific robotic tasks. We argue that a generalizable VLA model should retain and expand upon the VLM's core competencies: 1) Open-world embodied reasoning - the VLA should inherit the knowledge from VLM, i.e., recognize anything that the VLM can recognize, capable of solving math problems, possessing visual-spatial intelligence, 2) Reasoning following - effectively translating the open-world reasoning into actionable steps for the robot. In this work, we introduce ChatVLA-2, a novel mixture-of-expert VLA model coupled with a specialized three-stage training pipeline designed to preserve the VLM's original strengths while enabling actionable reasoning. To validate our approach, we design a math-matching task wherein a robot interprets math problems written on a whiteboard and picks corresponding number cards from a table to solve equations. Remarkably, our method exhibits exceptional mathematical reasoning and OCR capabilities, despite these abilities not being explicitly trained within the VLA. Furthermore, we demonstrate that the VLA possesses strong spatial reasoning skills, enabling it to interpret novel directional instructions involving previously unseen objects. Overall, our method showcases reasoning and comprehension abilities that significantly surpass state-of-the-art imitation learning methods such as OpenVLA, DexVLA, and pi-zero. This work represents a substantial advancement toward developing truly generalizable robotic foundation models endowed with robust reasoning capacities.

[271] ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation

Jiawen Yu,Hairuo Liu,Qiaojun Yu,Jieji Ren,Ce Hao,Haitong Ding,Guangyu Huang,Guofan Huang,Yan Song,Panpan Cai,Cewu Lu,Wenqiang Zhang

Main category: cs.RO

TL;DR: ForceVLA是一种新型的端到端机器人操作框架，通过将外力感知作为VLA系统的核心模态，结合视觉-语言嵌入和实时力反馈，显著提升了接触密集型任务的成功率。

Details

Motivation: 现有的VLA模型在需要精细力控制的接触密集型任务中表现不佳，尤其是在视觉遮挡或动态不确定的情况下。ForceVLA旨在解决这一问题。 Method: ForceVLA引入了FVLMoE模块，动态整合预训练的视觉-语言嵌入和实时6轴力反馈，并通过ForceVLA-Data数据集支持多模态训练。 Result: ForceVLA在接触密集型任务中平均成功率提升了23.2%，在插头插入等任务中达到80%的成功率。 Conclusion: ForceVLA展示了多模态集成在灵巧操作中的重要性，为物理智能机器人控制设定了新基准。 Abstract: Vision-Language-Action (VLA) models have advanced general-purpose robotic manipulation by leveraging pretrained visual and linguistic representations. However, they struggle with contact-rich tasks that require fine-grained control involving force, especially under visual occlusion or dynamic uncertainty. To address these limitations, we propose \textbf{ForceVLA}, a novel end-to-end manipulation framework that treats external force sensing as a first-class modality within VLA systems. ForceVLA introduces \textbf{FVLMoE}, a force-aware Mixture-of-Experts fusion module that dynamically integrates pretrained visual-language embeddings with real-time 6-axis force feedback during action decoding. This enables context-aware routing across modality-specific experts, enhancing the robot's ability to adapt to subtle contact dynamics. We also introduce \textbf{ForceVLA-Data}, a new dataset comprising synchronized vision, proprioception, and force-torque signals across five contact-rich manipulation tasks. ForceVLA improves average task success by 23.2\% over strong $\pi_0$-based baselines, achieving up to 80\% success in tasks such as plug insertion. Our approach highlights the importance of multimodal integration for dexterous manipulation and sets a new benchmark for physically intelligent robotic control. Code and data will be released at https://sites.google.com/view/forcevla2025.

[272] LiDAR Based Semantic Perception for Forklifts in Outdoor Environments

Benjamin Serfling,Hannes Reichert,Lorenzo Bayerlein,Konrad Doll,Kati Radkhah-Lens

Main category: cs.RO

TL;DR: 提出了一种基于双LiDAR的语义分割框架，专为复杂室外环境中的自动叉车设计，结合前向和向下倾斜LiDAR传感器，实现高精度场景理解。

Details

Motivation: 针对工业物料搬运任务中动态和静态障碍物的检测与分割需求，提升自动叉车在动态仓库和场地环境中的安全导航能力。 Method: 采用双LiDAR系统捕获高分辨率3D点云，通过轻量级方法将点云分割为安全关键实例类（如行人、车辆、叉车）和环境类（如可行驶地面、车道、建筑物）。 Result: 实验验证表明，该方法在满足严格实时性要求的同时，实现了高分割精度。 Conclusion: 该框架适用于动态仓库和场地环境中的安全感知全自动叉车导航。 Abstract: In this study, we present a novel LiDAR-based semantic segmentation framework tailored for autonomous forklifts operating in complex outdoor environments. Central to our approach is the integration of a dual LiDAR system, which combines forward-facing and downward-angled LiDAR sensors to enable comprehensive scene understanding, specifically tailored for industrial material handling tasks. The dual configuration improves the detection and segmentation of dynamic and static obstacles with high spatial precision. Using high-resolution 3D point clouds captured from two sensors, our method employs a lightweight yet robust approach that segments the point clouds into safety-critical instance classes such as pedestrians, vehicles, and forklifts, as well as environmental classes such as driveable ground, lanes, and buildings. Experimental validation demonstrates that our approach achieves high segmentation accuracy while satisfying strict runtime requirements, establishing its viability for safety-aware, fully autonomous forklift navigation in dynamic warehouse and yard environments.

[273] UP-SLAM: Adaptively Structured Gaussian SLAM with Uncertainty Prediction in Dynamic Environments

Wancai Zheng,Linlin Ou,Jiajie He,Libo Zhou,Xinyi Yu,Yan Wei

Main category: cs.RO

TL;DR: UP-SLAM是一种实时RGB-D SLAM系统，通过并行框架解耦跟踪与建图，使用概率八叉树管理高斯基元，并提出无训练的动态区域过滤方法，显著提升动态环境下的定位精度和渲染质量。

Details

Motivation: 现有3D高斯溅射技术在动态环境中存在实时性和鲁棒性不足的问题，UP-SLAM旨在解决这些限制。 Method: 采用并行框架解耦跟踪与建图，使用概率八叉树自适应管理高斯基元，提出多模态残差融合的无训练不确定性估计器，并设计时间编码器提升渲染质量。 Result: 在多个数据集上，UP-SLAM在定位精度（提升59.8%）和渲染质量（提升4.57 dB PSNR）上优于现有方法，同时保持实时性能。 Conclusion: UP-SLAM在动态环境中实现了高效、鲁棒的SLAM，并生成可重用的静态地图。 Abstract: Recent 3D Gaussian Splatting (3DGS) techniques for Visual Simultaneous Localization and Mapping (SLAM) have significantly progressed in tracking and high-fidelity mapping. However, their sequential optimization framework and sensitivity to dynamic objects limit real-time performance and robustness in real-world scenarios. We present UP-SLAM, a real-time RGB-D SLAM system for dynamic environments that decouples tracking and mapping through a parallelized framework. A probabilistic octree is employed to manage Gaussian primitives adaptively, enabling efficient initialization and pruning without hand-crafted thresholds. To robustly filter dynamic regions during tracking, we propose a training-free uncertainty estimator that fuses multi-modal residuals to estimate per-pixel motion uncertainty, achieving open-set dynamic object handling without reliance on semantic labels. Furthermore, a temporal encoder is designed to enhance rendering quality. Concurrently, low-dimensional features are efficiently transformed via a shallow multilayer perceptron to construct DINO features, which are then employed to enrich the Gaussian field and improve the robustness of uncertainty prediction. Extensive experiments on multiple challenging datasets suggest that UP-SLAM outperforms state-of-the-art methods in both localization accuracy (by 59.8%) and rendering quality (by 4.57 dB PSNR), while maintaining real-time performance and producing reusable, artifact-free static maps in dynamic environments.The project: https://aczheng-cai.github.io/up_slam.github.io/

cs.MA [Back]

[274] Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation

Yu-Lun Song,Chung-En Tsern,Che-Cheng Wu,Yu-Ming Chang,Syuan-Bo Huang,Wei-Chu Chen,Michael Chia-Liang Lin,Yu-Ta Lin

Main category: cs.MA

TL;DR: 论文提出了一种结合大型语言模型（LLM）与基于代理建模（ABM）的创新城市移动模拟方法，提升了代理多样性和真实性，并应用于台北市的实证分析。

Details

Motivation: 传统基于规则的ABM在模拟城市移动时缺乏多样性和真实性，因此研究探索利用LLM增强模拟效果。 Method: 通过LLM生成合成人口档案、分配常规与偶然地点，并模拟个性化路线，结合台北市实际数据建模。 Result: 模拟结果提供了路线热图和模式特定指标，为城市规划者提供决策支持。 Conclusion: 未来工作将集中于建立验证框架，以确保模拟在城市规划应用中的准确性和可靠性。 Abstract: This study presents an innovative approach to urban mobility simulation by integrating a Large Language Model (LLM) with Agent-Based Modeling (ABM). Unlike traditional rule-based ABM, the proposed framework leverages LLM to enhance agent diversity and realism by generating synthetic population profiles, allocating routine and occasional locations, and simulating personalized routes. Using real-world data, the simulation models individual behaviors and large-scale mobility patterns in Taipei City. Key insights, such as route heat maps and mode-specific indicators, provide urban planners with actionable information for policy-making. Future work focuses on establishing robust validation frameworks to ensure accuracy and reliability in urban planning applications.

physics.optics [Back]

[275] Large-Area Fabrication-aware Computational Diffractive Optics

Kaixuan Wei,Hector A. Jimenez-Romero,Hadi Amata,Jipeng Sun,Qiang Fu,Felix Heide,Wolfgang Heidrich

Main category: physics.optics

TL;DR: 提出了一种制造感知的设计流程，用于大规模生产衍射光学元件，解决了仿真与制造之间的质量差距。

Details

Motivation: 解决现有衍射光学系统在仿真与制造之间存在质量差距的问题，推动其在实际应用中的使用。 Method: 提出制造感知的设计流程，包括超分辨率神经光刻模型和分布式计算框架，用于优化衍射光学系统。 Result: 实现了仿真与制造原型之间的良好一致性，并在单DOE成像系统中获得高质量图像。 Conclusion: 研究突破了衍射光学系统在实际应用中的制造限制。 Abstract: Differentiable optics, as an emerging paradigm that jointly optimizes optics and (optional) image processing algorithms, has made innovative optical designs possible across a broad range of applications. Many of these systems utilize diffractive optical components (DOEs) for holography, PSF engineering, or wavefront shaping. Existing approaches have, however, mostly remained limited to laboratory prototypes, owing to a large quality gap between simulation and manufactured devices. We aim at lifting the fundamental technical barriers to the practical use of learned diffractive optical systems. To this end, we propose a fabrication-aware design pipeline for diffractive optics fabricated by direct-write grayscale lithography followed by nano-imprinting replication, which is directly suited for inexpensive mass production of large area designs. We propose a super-resolved neural lithography model that can accurately predict the 3D geometry generated by the fabrication process. This model can be seamlessly integrated into existing differentiable optics frameworks, enabling fabrication-aware, end-to-end optimization of computational optical systems. To tackle the computational challenges, we also devise tensor-parallel compute framework centered on distributing large-scale FFT computation across many GPUs. As such, we demonstrate large scale diffractive optics designs up to 32.16 mm $\times$ 21.44 mm, simulated on grids of up to 128,640 by 85,760 feature points. We find adequate agreement between simulation and fabricated prototypes for applications such as holography and PSF engineering. We also achieve high image quality from an imaging system comprised only of a single DOE, with images processed only by a Wiener filter utilizing the simulation PSF. We believe our findings lift the fabrication limitations for real-world applications of diffractive optics and differentiable optical design.

cs.LG [Back]

[276] ChemHAS: Hierarchical Agent Stacking for Enhancing Chemistry Tools

Zhucong Li,Bowei Zhang,Jin Xiao,Zhijian Zhou,Fenglei Cao,Jiaqing Liang,Yuan Qi

Main category: cs.LG

TL;DR: 论文提出ChemHAS方法，通过优化代理堆叠结构减少化学工具预测误差，提升性能。

Details

Motivation: LLM代理虽能提升化学任务性能，但受限于化学工具的预测误差，需进一步减少误差。 Method: 提出ChemHAS方法，通过有限数据优化代理堆叠结构，增强化学工具。 Result: 在四项基础化学任务中达到最优性能，有效补偿工具预测误差，并识别四种代理堆叠行为。 Conclusion: ChemHAS不仅提升性能，还增强可解释性，为AI代理在科研中的应用开辟新可能。 Abstract: Large Language Model (LLM)-based agents have demonstrated the ability to improve performance in chemistry-related tasks by selecting appropriate tools. However, their effectiveness remains limited by the inherent prediction errors of chemistry tools. In this paper, we take a step further by exploring how LLMbased agents can, in turn, be leveraged to reduce prediction errors of the tools. To this end, we propose ChemHAS (Chemical Hierarchical Agent Stacking), a simple yet effective method that enhances chemistry tools through optimizing agent-stacking structures from limited data. ChemHAS achieves state-of-the-art performance across four fundamental chemistry tasks, demonstrating that our method can effectively compensate for prediction errors of the tools. Furthermore, we identify and characterize four distinct agent-stacking behaviors, potentially improving interpretability and revealing new possibilities for AI agent applications in scientific research. Our code and dataset are publicly available at https: //anonymous.4open.science/r/ChemHAS-01E4/README.md.

[277] Revisiting Bi-Linear State Transitions in Recurrent Neural Networks

M. Reza Ebrahimi,Roland Memisevic

Main category: cs.LG

TL;DR: 论文探讨了循环神经网络中隐藏单元的作用，提出其不仅是记忆存储，还能通过双线性操作主动参与计算，为状态跟踪任务提供自然归纳偏置。

Details

Motivation: 研究动机在于重新审视隐藏单元在计算中的主动角色，而非仅作为被动记忆存储。 Method: 通过理论和实证分析双线性操作（隐藏单元与输入嵌入的乘法交互），证明其对状态跟踪任务中隐藏状态演化的自然表示能力。 Result: 双线性状态更新形成与任务复杂度对应的自然层次结构，线性循环网络（如Mamba）位于最低复杂度层级。 Conclusion: 双线性操作为状态跟踪任务提供了有效的归纳偏置，揭示了隐藏单元在计算中的主动贡献。 Abstract: The role of hidden units in recurrent neural networks is typically seen as modeling memory, with research focusing on enhancing information retention through gating mechanisms. A less explored perspective views hidden units as active participants in the computation performed by the network, rather than passive memory stores. In this work, we revisit bi-linear operations, which involve multiplicative interactions between hidden units and input embeddings. We demonstrate theoretically and empirically that they constitute a natural inductive bias for representing the evolution of hidden states in state tracking tasks. These are the simplest type of task that require hidden units to actively contribute to the behavior of the network. We also show that bi-linear state updates form a natural hierarchy corresponding to state tracking tasks of increasing complexity, with popular linear recurrent networks such as Mamba residing at the lowest-complexity center of that hierarchy.

[278] Born a Transformer -- Always a Transformer?

Yana Veitsman,Mayank Jobanputra,Yash Sarrof,Aleksandra Bakalova,Vera Demberg,Ellie Pavlick,Michael Hahn

Main category: cs.LG

TL;DR: 研究探讨了预训练大语言模型（LLMs）是否克服了Transformer架构在序列任务中的理论限制，发现预训练增强了某些能力但未突破长度泛化的根本限制。

Details

Motivation: 探究Transformer架构的理论限制是否在预训练LLMs中仍然存在，以及LLMs是否能通过规模和预训练数据克服这些限制。 Method: 通过检索和复制任务家族，使用C-RASP框架研究长度泛化，并进行理论和实证分析。 Result: 发现预训练模型在检索任务中存在“归纳与反归纳”不对称性，且这种不对称性可通过针对性微调消除。 Conclusion: 预训练选择性增强了Transformer的某些能力，但未突破其长度泛化的根本限制。 Abstract: Transformers have theoretical limitations in modeling certain sequence-to-sequence tasks, yet it remains largely unclear if these limitations play a role in large-scale pretrained LLMs, or whether LLMs might effectively overcome these constraints in practice due to the scale of both the models themselves and their pretraining data. We explore how these architectural constraints manifest after pretraining, by studying a family of $\textit{retrieval}$ and $\textit{copying}$ tasks inspired by Liu et al. [2024]. We use the recently proposed C-RASP framework for studying length generalization [Huang et al., 2025b] to provide guarantees for each of our settings. Empirically, we observe an $\textit{induction-versus-anti-induction}$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) rather than the left (anti-induction) of a query token. This asymmetry disappears upon targeted fine-tuning if length-generalization is guaranteed by theory. Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained Transformers. We validate our findings through practical experiments on real-world tasks demonstrating reliability risks. Our results highlight that pretraining selectively enhances certain Transformer capabilities, but does not overcome fundamental length-generalization limits.

[279] From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs

Stanley Yu,Vaidehi Bulusu,Oscar Yasunaga,Clayton Lau,Cole Blondin,Sean O'Brien,Kevin Zhu,Vasu Sharma

Main category: cs.LG

TL;DR: 论文扩展了概念锥框架，用于研究LLM中真实性的多维几何结构，揭示了其因果中介作用。

Details

Motivation: 尽管LLM具有强大的对话能力，但常生成虚假信息。现有研究认为真实性可表示为单一线性方向，但可能未完全捕捉其几何复杂性。 Method: 扩展概念锥框架，识别多维锥，并通过因果干预、跨架构泛化和行为保留验证其作用。 Result: 发现多维锥能可靠翻转模型对事实陈述的响应，且具有跨架构泛化能力，同时保留无关行为。 Conclusion: 揭示了LLM中真假命题的多维结构，概念锥是探测抽象行为的有力工具。 Abstract: Large Language Models (LLMs) exhibit strong conversational abilities but often generate falsehoods. Prior work suggests that the truthfulness of simple propositions can be represented as a single linear direction in a model's internal activations, but this may not fully capture its underlying geometry. In this work, we extend the concept cone framework, recently introduced for modeling refusal, to the domain of truth. We identify multi-dimensional cones that causally mediate truth-related behavior across multiple LLM families. Our results are supported by three lines of evidence: (i) causal interventions reliably flip model responses to factual statements, (ii) learned cones generalize across model architectures, and (iii) cone-based interventions preserve unrelated model behavior. These findings reveal the richer, multidirectional structure governing simple true/false propositions in LLMs and highlight concept cones as a promising tool for probing abstract behaviors.

[280] Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones

Parsa Mirtaheri,Ezra Edelman,Samy Jelassi,Eran Malach,Enric Boix-Adsera

Main category: cs.LG

TL;DR: 本文探讨了在推理时间计算中，顺序扩展（如更长的思维链）与并行扩展（如多数投票）的优劣，发现某些图连通性问题中顺序扩展具有指数级优势。

Details

Motivation: 研究推理时间计算的最优分配方式，以提升大语言模型的推理能力。 Method: 通过理论分析和实验验证，比较顺序扩展与并行扩展在图连通性问题中的表现。 Result: 在特定图连通性问题中，顺序扩展显著优于并行扩展。 Conclusion: 顺序扩展在某些推理场景中具有显著优势，为推理时间计算的优化提供了新视角。 Abstract: Inference-time computation has emerged as a promising scaling axis for improving large language model reasoning. However, despite yielding impressive performance, the optimal allocation of inference-time computation remains poorly understood. A central question is whether to prioritize sequential scaling (e.g., longer chains of thought) or parallel scaling (e.g., majority voting across multiple short chains of thought). In this work, we seek to illuminate the landscape of test-time scaling by demonstrating the existence of reasoning settings where sequential scaling offers an exponential advantage over parallel scaling. These settings are based on graph connectivity problems in challenging distributions of graphs. We validate our theoretical findings with comprehensive experiments across a range of language models, including models trained from scratch for graph connectivity with different chain of thought strategies as well as large reasoning models.

[281] Efficient Ensemble for Fine-tuning Language Models on Multiple Datasets

Dongyue Li,Ziniu Zhang,Lu Wang,Hongyang R. Zhang

Main category: cs.LG

TL;DR: 提出一种集成方法，通过多个小型适配器而非单一适配器来微调语言模型，显著提升效率与性能。

Details

Motivation: 现有方法（如QLoRA）在单数据集上高效，但在多任务数据集上缺乏高效微调方案。 Method: 将n个数据集划分为m组（m< ### [282] [EnsemW2S: Enhancing Weak-to-Strong Generalization with Large Language Model Ensembles](https://arxiv.org/abs/2505.21959) *Aakriti Agrawal,Mucong Ding,Zora Che,Chenghao Deng,Anirudh Satheesh,Bang An,Bayan Bruss,John Langford,Furong Huang* Main category: cs.LG TL;DR: 论文提出了一种名为EnsemW2S的新方法，通过迭代组合多个弱专家模型，提升其对超人类任务的泛化能力，从而监督更强的学生模型。实验表明，该方法在分布内和分布外数据集上均显著提升了性能。

Details

Motivation: 随着大语言模型（LLMs）接近或超越人类水平，需要开发能够有效监督和增强这些强大模型的方法，尤其是基于有限人类数据的小模型。 Method: 提出EnsemW2S方法，采用令牌级集成策略，迭代组合多个弱专家模型，系统改进其泛化能力。 Result: 在分布内数据集上，专家模型和学生模型分别提升了4%和3.2%；在分布外数据集上，提升幅度分别达到6%和2.28%。 Conclusion: EnsemW2S方法在弱到强泛化问题上表现优异，为监督强大模型提供了有效途径。 Abstract: With Large Language Models (LLMs) rapidly approaching and potentially surpassing human-level performance, it has become imperative to develop approaches capable of effectively supervising and enhancing these powerful models using smaller, human-level models exposed to only human-level data. We address this critical weak-to-strong (W2S) generalization challenge by proposing a novel method aimed at improving weak experts, by training on the same limited human-level data, enabling them to generalize to complex, super-human-level tasks. Our approach, called \textbf{EnsemW2S}, employs a token-level ensemble strategy that iteratively combines multiple weak experts, systematically addressing the shortcomings identified in preceding iterations. By continuously refining these weak models, we significantly enhance their collective ability to supervise stronger student models. We extensively evaluate the generalization performance of both the ensemble of weak experts and the subsequent strong student model across in-distribution (ID) and out-of-distribution (OOD) datasets. For OOD, we specifically introduce question difficulty as an additional dimension for defining distributional shifts. Our empirical results demonstrate notable improvements, achieving 4\%, and 3.2\% improvements on ID datasets and, upto 6\% and 2.28\% on OOD datasets for experts and student models respectively, underscoring the effectiveness of our proposed method in advancing W2S generalization.

### [283] [Pitfalls of Rule- and Model-based Verifiers -- A Case Study on Mathematical Reasoning](https://arxiv.org/abs/2505.22203) *Yuzhen Huang,Weihao Zeng,Xingshan Zeng,Qi Zhu,Junxian He* Main category: cs.LG TL;DR: 论文分析了强化学习中验证器的可靠性问题，重点研究了数学推理领域中的规则验证器和模型验证器，发现两者分别存在假阴性和易被攻击的问题。

Details

Motivation: 验证器的可靠性对强化学习的成功至关重要，但目前对其影响的理解不足，尤其是在数学推理等复杂领域。 Method: 以数学推理为例，通过静态评估和强化学习训练场景，全面分析规则验证器和模型验证器的表现。 Result: 规则验证器常因无法识别不同格式的等效答案而产生假阴性，影响训练效果；模型验证器虽静态评估准确率高，但易被攻击导致假阳性。 Conclusion: 研究揭示了验证器的独特风险，为开发更鲁棒的强化学习奖励系统提供了重要见解。 Abstract: Trustworthy verifiers are essential for the success of reinforcement learning with verifiable reward (RLVR), which is the core methodology behind various large reasoning models such as DeepSeek-R1. In complex domains like mathematical reasoning, rule-based verifiers have been widely adopted in previous works to train strong reasoning models. However, the reliability of these verifiers and their impact on the RL training process remain poorly understood. In this work, we take mathematical reasoning as a case study and conduct a comprehensive analysis of various verifiers in both static evaluation and RL training scenarios. First, we find that current open-source rule-based verifiers often fail to recognize equivalent answers presented in different formats across multiple commonly used mathematical datasets, resulting in non-negligible false negative rates. This limitation adversely affects RL training performance and becomes more pronounced as the policy model gets stronger. Subsequently, we investigate model-based verifiers as a potential solution to address these limitations. While the static evaluation shows that model-based verifiers achieve significantly higher verification accuracy, further analysis and RL training results imply that they are highly susceptible to hacking, where they misclassify certain patterns in responses as correct (i.e., false positives). This vulnerability is exploited during policy model optimization, leading to artificially inflated rewards. Our findings underscore the unique risks inherent to both rule-based and model-based verifiers, aiming to offer valuable insights to develop more robust reward systems in reinforcement learning.

### [284] [Train Sparse Autoencoders Efficiently by Utilizing Features Correlation](https://arxiv.org/abs/2505.22255) *Vadim Kurochkin,Yaroslav Aksenov,Daniil Laptev,Daniil Gavrilov,Nikita Balagansky* Main category: cs.LG TL;DR: KronSAE通过Kronecker乘积分解和mAND激活函数，解决了大规模稀疏自编码器（SAE）训练中的计算和内存问题。

Details

Motivation: 稀疏自编码器（SAEs）在解释语言模型隐藏状态方面表现出潜力，但大规模训练时计算和内存开销大。 Method: 提出KronSAE架构，利用Kronecker乘积分解减少计算和内存开销，并引入mAND激活函数提升性能。 Result: KronSAE显著降低了计算和内存需求，同时提高了模型的解释性和性能。 Conclusion: KronSAE为大规模SAE训练提供了一种高效且可解释的解决方案。 Abstract: Sparse Autoencoders (SAEs) have demonstrated significant promise in interpreting the hidden states of language models by decomposing them into interpretable latent directions. However, training SAEs at scale remains challenging, especially when large dictionary sizes are used. While decoders can leverage sparse-aware kernels for efficiency, encoders still require computationally intensive linear operations with large output dimensions. To address this, we propose KronSAE, a novel architecture that factorizes the latent representation via Kronecker product decomposition, drastically reducing memory and computational overhead. Furthermore, we introduce mAND, a differentiable activation function approximating the binary AND operation, which improves interpretability and performance in our factorized framework.

### [285] [Skywork Open Reasoner 1 Technical Report](https://arxiv.org/abs/2505.22312) *Jujie He,Jiacai Liu,Chris Yuhao Liu,Rui Yan,Chaojie Wang,Peng Cheng,Xiaoyu Zhang,Fuxiang Zhang,Jiacheng Xu,Wei Shen,Siyuan Li,Liang Zeng,Tianwen Wei,Cheng Cheng,Bo An,Yang Liu,Yahui Zhou* Main category: cs.LG TL;DR: Skywork-OR1通过强化学习显著提升了大语言模型的推理能力，在多个基准测试中表现优异，并开源了模型权重和训练代码。

Details

Motivation: 探索强化学习在提升大语言模型推理能力中的作用，并验证其可扩展性。 Method: 基于DeepSeek-R1-Distill模型系列，采用强化学习优化长链推理模型（CoT），并进行消融研究和熵崩溃现象分析。 Result: Skywork-OR1在多个基准测试中表现显著提升，32B模型平均准确率提升15.0%，7B模型提升13.9%。 Conclusion: 强化学习是提升大语言模型推理能力的有效方法，熵崩溃现象的缓解对性能提升至关重要。 Abstract: The success of DeepSeek-R1 underscores the significant role of reinforcement learning (RL) in enhancing the reasoning capabilities of large language models (LLMs). In this work, we present Skywork-OR1, an effective and scalable RL implementation for long Chain-of-Thought (CoT) models. Building on the DeepSeek-R1-Distill model series, our RL approach achieves notable performance gains, increasing average accuracy across AIME24, AIME25, and LiveCodeBench from 57.8% to 72.8% (+15.0%) for the 32B model and from 43.6% to 57.5% (+13.9%) for the 7B model. Our Skywork-OR1-32B model surpasses both DeepSeek-R1 and Qwen3-32B on the AIME24 and AIME25 benchmarks, while achieving comparable results on LiveCodeBench. The Skywork-OR1-7B and Skywork-OR1-Math-7B models demonstrate competitive reasoning capabilities among models of similar size. We perform comprehensive ablation studies on the core components of our training pipeline to validate their effectiveness. Additionally, we thoroughly investigate the phenomenon of entropy collapse, identify key factors affecting entropy dynamics, and demonstrate that mitigating premature entropy collapse is critical for improved test performance. To support community research, we fully open-source our model weights, training code, and training datasets.

### [286] [Mitigating Overthinking in Large Reasoning Models via Manifold Steering](https://arxiv.org/abs/2505.22411) *Yao Huang,Huanran Chen,Shouwei Ruan,Yichi Zhang,Xingxing Wei,Yinpeng Dong* Main category: cs.LG TL;DR: 论文提出了一种名为Manifold Steering的新方法，通过低维流形投影减少大型推理模型中的过度思考现象，显著降低计算开销并保持准确性。

Details

Motivation: 大型推理模型（LRMs）在推理过程中常出现过度思考现象，导致计算开销增加。本文旨在从机制可解释性角度研究并缓解这一问题。 Method: 研究发现过度思考现象与激活空间中的低维流形相关，提出Manifold Steering方法，将干预方向投影到低维流形以减少噪声干扰。 Result: 实验表明，该方法在DeepSeek-R1模型上减少了71%的输出标记，同时保持甚至提高了数学基准测试的准确性，并具有跨领域迁移能力。 Conclusion: Manifold Steering方法有效缓解了过度思考问题，显著提升了模型效率，且适用于多种任务。 Abstract: Recent advances in Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex tasks such as mathematics and coding. However, these models frequently exhibit a phenomenon known as overthinking during inference, characterized by excessive validation loops and redundant deliberation, leading to substantial computational overheads. In this paper, we aim to mitigate overthinking by investigating the underlying mechanisms from the perspective of mechanistic interpretability. We first showcase that the tendency of overthinking can be effectively captured by a single direction in the model's activation space and the issue can be eased by intervening the activations along this direction. However, this efficacy soon reaches a plateau and even deteriorates as the intervention strength increases. We therefore systematically explore the activation space and find that the overthinking phenomenon is actually tied to a low-dimensional manifold, which indicates that the limited effect stems from the noises introduced by the high-dimensional steering direction. Based on this insight, we propose Manifold Steering, a novel approach that elegantly projects the steering direction onto the low-dimensional activation manifold given the theoretical approximation of the interference noise. Extensive experiments on DeepSeek-R1 distilled models validate that our method reduces output tokens by up to 71% while maintaining and even improving the accuracy on several mathematical benchmarks. Our method also exhibits robust cross-domain transferability, delivering consistent token reduction performance in code generation and knowledge-based QA tasks. Code is available at: https://github.com/Aries-iai/Manifold_Steering.

### [287] [Scaling Reasoning without Attention](https://arxiv.org/abs/2505.22425) *Xueliang Zhao,Wei Wu,Lingpeng Kong* Main category: cs.LG TL;DR: 论文提出了一种无注意力的语言模型，通过架构和数据创新解决了Transformer的低效性和缺乏结构化微调的问题，并在多个基准测试中表现优异。

Details

Motivation: 解决大型语言模型在复杂推理任务中因Transformer架构效率低下和缺乏结构化微调而面临的瓶颈问题。 Method: 采用基于Mamba-2的状态空间对偶（SSD）层，消除自注意力和键值缓存，实现固定内存和恒定时间推理；提出两阶段课程微调策略，基于PromptCoT合成范式生成结构化问题。 Result: 在多个基准测试中，模型表现优于同规模的Transformer和混合模型，甚至超越更大规模的Gemma3-27B。 Conclusion: 状态空间模型有望成为高效且可扩展的替代方案，适用于高容量推理任务。 Abstract: Large language models (LLMs) have made significant advances in complex reasoning tasks, yet they remain bottlenecked by two core challenges: architectural inefficiency due to reliance on Transformers, and a lack of structured fine-tuning for high-difficulty domains. We introduce \ourmodel, an attention-free language model that addresses both issues through architectural and data-centric innovations. Built on the state space dual (SSD) layers of Mamba-2, our model eliminates the need for self-attention and key-value caching, enabling fixed-memory, constant-time inference. To train it for complex reasoning, we propose a two-phase curriculum fine-tuning strategy based on the \textsc{PromptCoT} synthesis paradigm, which generates pedagogically structured problems via abstract concept selection and rationale-guided generation. On benchmark evaluations, \ourmodel-7B outperforms strong Transformer and hybrid models of comparable scale, and even surpasses the much larger Gemma3-27B by 2.6\% on AIME 24, 0.6\% on AIME 25, and 3.0\% on Livecodebench. These results highlight the potential of state space models as efficient and scalable alternatives to attention-based architectures for high-capacity reasoning.

### [288] [The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models](https://arxiv.org/abs/2505.22617) *Ganqu Cui,Yuchen Zhang,Jiacheng Chen,Lifan Yuan,Zhi Wang,Yuxin Zuo,Haozhan Li,Yuchen Fan,Huayu Chen,Weize Chen,Zhiyuan Liu,Hao Peng,Lei Bai,Wanli Ouyang,Yu Cheng,Bowen Zhou,Ning Ding* Main category: cs.LG TL;DR: 论文研究了强化学习中策略熵崩溃的问题，提出了熵与性能的关系方程，并通过理论和实证分析熵动态机制，提出了两种控制熵的方法（Clip-Cov和KL-Cov），实验证明其有效性。

Details

Motivation: 解决强化学习（RL）中策略熵崩溃的问题，以提升探索能力和性能。 Method: 通过理论和实证分析熵动态机制，提出Clip-Cov和KL-Cov两种方法限制高协方差标记的更新。 Result: 实验表明，提出的方法能有效避免熵崩溃，提升下游任务性能。 Conclusion: 熵管理对RL扩展至关重要，提出的方法为持续探索提供了有效途径。 Abstract: This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, this diminished exploratory ability is always accompanied with the saturation of policy performance. In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion, and the ceiling is fully predictable H=0, R=-a+b. Our finding necessitates entropy management for continuous exploration toward scaling compute for RL. To this end, we investigate entropy dynamics both theoretically and empirically. Our derivation highlights that, the change in policy entropy is driven by the covariance between action probability and the change in logits, which is proportional to its advantage when using Policy Gradient-like algorithms. Empirical study shows that, the values of covariance term and entropy differences matched exactly, supporting the theoretical conclusion. Moreover, the covariance term stays mostly positive throughout training, further explaining why policy entropy would decrease monotonically. Through understanding the mechanism behind entropy dynamics, we motivate to control entropy by restricting the update of high-covariance tokens. Specifically, we propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively. Experiments show that these methods encourage exploration, thus helping policy escape entropy collapse and achieve better downstream performance.

### [289] [Position: Uncertainty Quantification Needs Reassessment for Large-language Model Agents](https://arxiv.org/abs/2505.22655) *Michael Kirchhof,Gjergji Kasneci,Enkelejda Kasneci* Main category: cs.LG TL;DR: 传统不确定性二分法在LLM交互场景中受限，需研究新方法以丰富不确定性量化。

Details

Motivation: LLM和聊天机器人常输出错误，传统不确定性量化方法在交互场景中失效，需探索新方向。 Method: 提出三种新研究方向：未明确信息的不确定性、交互式学习以减少上下文不确定性、利用语言表达不确定性。 Result: 传统不确定性定义在交互场景中矛盾且失效，新方法有望提升透明度和信任度。 Conclusion: 新研究方向将使LLM交互更透明、可信且直观。 Abstract: Large-language models (LLMs) and chatbot agents are known to provide wrong outputs at times, and it was recently found that this can never be fully prevented. Hence, uncertainty quantification plays a crucial role, aiming to quantify the level of ambiguity in either one overall number or two numbers for aleatoric and epistemic uncertainty. This position paper argues that this traditional dichotomy of uncertainties is too limited for the open and interactive setup that LLM agents operate in when communicating with a user, and that we need to research avenues that enrich uncertainties in this novel scenario. We review the literature and find that popular definitions of aleatoric and epistemic uncertainties directly contradict each other and lose their meaning in interactive LLM agent settings. Hence, we propose three novel research directions that focus on uncertainties in such human-computer interactions: Underspecification uncertainties, for when users do not provide all information or define the exact task at the first go, interactive learning, to ask follow-up questions and reduce the uncertainty about the current context, and output uncertainties, to utilize the rich language and speech space to express uncertainties as more than mere numbers. We expect that these new ways of dealing with and communicating uncertainties will lead to LLM agent interactions that are more transparent, trustworthy, and intuitive.

### [290] [Temporal Restoration and Spatial Rewiring for Source-Free Multivariate Time Series Domain Adaptation](https://arxiv.org/abs/2505.21525) *Peiliang Gong,Yucheng Wang,Min Wu,Zhenghua Chen,Xiaoli Li,Daoqiang Zhang* Main category: cs.LG TL;DR: TERSE是一种针对多元时间序列（MTS）的无源域适应（SFDA）方法，通过时空特征编码和任务设计，解决了现有方法忽略空间相关性的问题。

Details

Motivation: 现有SFDA方法在多元时间序列上表现不佳，主要因为它们未考虑数据中的空间相关性，而TERSE旨在解决这一问题。 Method: TERSE结合了时空特征编码器、时间恢复和空间重连任务，以捕获和转移跨域的时空依赖性。 Result: 在三个真实时间序列数据集上的实验表明，TERSE在性能和通用性上均表现优异。 Conclusion: TERSE是首个同时考虑时空一致性的MTS-SFDA方法，可作为插件模块集成到现有SFDA方法中。 Abstract: Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained model from an annotated source domain to an unlabelled target domain without accessing the source data, thereby preserving data privacy. While existing SFDA methods have proven effective in reducing reliance on source data, they struggle to perform well on multivariate time series (MTS) due to their failure to consider the intrinsic spatial correlations inherent in MTS data. These spatial correlations are crucial for accurately representing MTS data and preserving invariant information across domains. To address this challenge, we propose Temporal Restoration and Spatial Rewiring (TERSE), a novel and concise SFDA method tailored for MTS data. Specifically, TERSE comprises a customized spatial-temporal feature encoder designed to capture the underlying spatial-temporal characteristics, coupled with both temporal restoration and spatial rewiring tasks to reinstate latent representations of the temporally masked time series and the spatially masked correlated structures. During the target adaptation phase, the target encoder is guided to produce spatially and temporally consistent features with the source domain by leveraging the source pre-trained temporal restoration and spatial rewiring networks. Therefore, TERSE can effectively model and transfer spatial-temporal dependencies across domains, facilitating implicit feature alignment. In addition, as the first approach to simultaneously consider spatial-temporal consistency in MTS-SFDA, TERSE can also be integrated as a versatile plug-and-play module into established SFDA methods. Extensive experiments on three real-world time series datasets demonstrate the effectiveness and versatility of our approach.

### [291] [Taming Transformer Without Using Learning Rate Warmup](https://arxiv.org/abs/2505.21910) *Xianbiao Qi,Yelin He,Jiaquan Ye,Chun-Guang Li,Bojia Zi,Xili Dai,Qin Zou,Rong Xiao* Main category: cs.LG TL;DR: 论文提出了一种新的优化策略，通过平滑权重更新来避免Transformer训练中的谱能量集中问题，从而防止模型崩溃，无需学习率预热。

Details

Motivation: 研究Transformer在大规模训练中模型崩溃的原因，发现谱能量集中是导致恶性熵崩溃的关键因素。 Method: 基于Weyl不等式，提出一种优化策略，通过动态限制学习率来平滑权重更新，防止谱能量集中。 Result: 实验表明，该策略能稳定训练ViT、Swin-Transformer和GPT，无需学习率预热。 Conclusion: 提出的优化策略有效解决了Transformer训练中的谱能量集中问题，提升了训练的稳定性。 Abstract: Scaling Transformer to a large scale without using some technical tricks such as learning rate warump and using an obviously lower learning rate is an extremely challenging task, and is increasingly gaining more attention. In this paper, we provide a theoretical analysis for the process of training Transformer and reveal the rationale behind the model crash phenomenon in the training process, termed \textit{spectral energy concentration} of ${\bW_q}^{\top} \bW_k$, which is the reason for a malignant entropy collapse, where ${\bW_q}$ and $\bW_k$ are the projection matrices for the query and the key in Transformer, respectively. To remedy this problem, motivated by \textit{Weyl's Inequality}, we present a novel optimization strategy, \ie, making the weight updating in successive steps smooth -- if the ratio $\frac{\sigma_{1}(\nabla \bW_t)}{\sigma_{1}(\bW_{t-1})}$ is larger than a threshold, we will automatically bound the learning rate to a weighted multiple of $\frac{\sigma_{1}(\bW_{t-1})}{\sigma_{1}(\nabla \bW_t)}$, where $\nabla \bW_t$ is the updating quantity in step $t$. Such an optimization strategy can prevent spectral energy concentration to only a few directions, and thus can avoid malignant entropy collapse which will trigger the model crash. We conduct extensive experiments using ViT, Swin-Transformer and GPT, showing that our optimization strategy can effectively and stably train these Transformers without using learning rate warmup.

### [292] [From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization](https://arxiv.org/abs/2505.22310) *Shoaib Ahmed Siddiqui,Adrian Weller,David Krueger,Gintare Karolina Dziugaite,Michael Curtis Mozer,Eleni Triantafillou* Main category: cs.LG TL;DR: 论文发现现有LLM遗忘方法易受重新学习攻击，遗忘的知识会通过微调重新出现，甚至无需遗忘集数据。通过实验发现，权重空间特性可预测抗攻击能力，并提出新方法提升抗攻击性能。

Details

Motivation: 研究现有遗忘方法在视觉分类器中的脆弱性，探索遗忘知识重新出现的现象及其原因。 Method: 在受控环境中测试多种遗忘方法，分析权重空间特性（如L2距离和线性模式连接性），提出新方法提升抗攻击能力。 Result: 遗忘集准确率可从50%恢复至近100%，而权重空间特性可预测抗攻击能力。新方法在抗攻击性能上达到最优。 Conclusion: 权重空间特性是抗重新学习攻击的关键指标，新方法显著提升了遗忘模型的鲁棒性。 Abstract: Recent unlearning methods for LLMs are vulnerable to relearning attacks: knowledge believed-to-be-unlearned re-emerges by fine-tuning on a small set of (even seemingly-unrelated) examples. We study this phenomenon in a controlled setting for example-level unlearning in vision classifiers. We make the surprising discovery that forget-set accuracy can recover from around 50% post-unlearning to nearly 100% with fine-tuning on just the retain set -- i.e., zero examples of the forget set. We observe this effect across a wide variety of unlearning methods, whereas for a model retrained from scratch excluding the forget set (gold standard), the accuracy remains at 50%. We observe that resistance to relearning attacks can be predicted by weight-space properties, specifically, $L_2$-distance and linear mode connectivity between the original and the unlearned model. Leveraging this insight, we propose a new class of methods that achieve state-of-the-art resistance to relearning attacks.

### [293] [A Closer Look at Multimodal Representation Collapse](https://arxiv.org/abs/2505.22483) *Abhra Chaudhuri,Anjan Dutta,Tu Bui,Serban Georgescu* Main category: cs.LG TL;DR: 论文研究了模态崩溃现象，提出通过跨模态知识蒸馏和显式基重分配算法解决该问题。

Details

Motivation: 理解并解决多模态融合中出现的模态崩溃现象，即模型仅依赖部分模态而忽略其他模态的问题。 Method: 通过理论分析发现模态崩溃的原因，并提出跨模态知识蒸馏和显式基重分配算法。 Result: 实验验证了理论分析的正确性，并展示了算法在多模态基准上的有效性。 Conclusion: 跨模态知识蒸馏和显式基重分配能有效防止模态崩溃，并适用于处理缺失模态的情况。 Abstract: We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another, effectively masking out positive contributions from the predictive features of the former modality and leading to its collapse. We further prove that cross-modal knowledge distillation implicitly disentangles such representations by freeing up rank bottlenecks in the student encoder, denoising the fusion-head outputs without negatively impacting the predictive features from either modality. Based on the above findings, we propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities. Extensive experiments on multiple multimodal benchmarks validate our theoretical claims. Project page: https://abhrac.github.io/mmcollapse/.

### [294] [Understanding Adversarial Training with Energy-based Models](https://arxiv.org/abs/2505.22486) *Mujtaba Hussain Mirza,Maria Rosaria Briglia,Filippo Bartolucci,Senad Beadini,Giuseppe Lisanti,Iacopo Masi* Main category: cs.LG TL;DR: 该论文利用能量模型（EBM）框架分析对抗训练（AT）中的分类器，提出Delta能量正则化器（DER）以缓解灾难性过拟合（CO）和鲁棒过拟合（RO），并探讨鲁棒分类器的生成能力。

Details

Motivation: 通过能量视角理解对抗训练中的过拟合现象，并探索鲁棒分类器的生成潜力。 Method: 分析对抗样本与自然样本的能量差异，提出DER正则化器，并改进生成技术（如局部类PCA和能量引导）。 Result: DER有效缓解CO和RO，生成模型在图像质量和多样性上表现优异，达到与混合模型竞争的IS和FID分数。 Conclusion: 能量视角为对抗训练和生成模型提供了新见解，DER和生成改进技术展示了实际应用潜力。 Abstract: We aim at using Energy-based Model (EBM) framework to better understand adversarial training (AT) in classifiers, and additionally to analyze the intrinsic generative capabilities of robust classifiers. By viewing standard classifiers through an energy lens, we begin by analyzing how the energies of adversarial examples, generated by various attacks, differ from those of the natural samples. The central focus of our work is to understand the critical phenomena of Catastrophic Overfitting (CO) and Robust Overfitting (RO) in AT from an energy perspective. We analyze the impact of existing AT approaches on the energy of samples during training and observe that the behavior of the ``delta energy' -- change in energy between original sample and its adversarial counterpart -- diverges significantly when CO or RO occurs. After a thorough analysis of these energy dynamics and their relationship with overfitting, we propose a novel regularizer, the Delta Energy Regularizer (DER), designed to smoothen the energy landscape during training. We demonstrate that DER is effective in mitigating both CO and RO across multiple benchmarks. We further show that robust classifiers, when being used as generative models, have limits in handling trade-off between image quality and variability. We propose an improved technique based on a local class-wise principal component analysis (PCA) and energy-based guidance for better class-specific initialization and adaptive stopping, enhancing sample diversity and generation quality. Considering that we do not explicitly train for generative modeling, we achieve a competitive Inception Score (IS) and Fr\'echet inception distance (FID) compared to hybrid discriminative-generative models.

# quant-ph [[Back]](#toc) ### [295] [Physics-inspired Generative AI models via real hardware-based noisy quantum diffusion](https://arxiv.org/abs/2505.22193) *Marco Parigi,Stefano Martina,Francesco Aldo Venturelli,Filippo Caruso* Main category: quant-ph TL;DR: 量子扩散模型（QDMs）利用量子特性提升生成AI性能，提出两种物理启发协议：量子随机行走和利用IBM量子硬件噪声生成图像，展示了量子噪声作为资源的潜力。

Details

Motivation: 现有量子扩散模型算法受限于近量子设备的可扩展性，探索量子动态和噪声如何提升生成模型的性能。 Method: 1. 使用量子随机行走形式，结合量子与经典动态生成更稳健的模型；2. 利用IBM量子硬件的固有噪声实现图像生成。 Result: 生成的MNIST图像FID分数更低，量子噪声被证明是有用资源。 Conclusion: 为大规模量子生成AI算法开辟新方向，量子噪声可被利用而非消除。 Abstract: Quantum Diffusion Models (QDMs) are an emerging paradigm in Generative AI that aims to use quantum properties to improve the performances of their classical counterparts. However, existing algorithms are not easily scalable due to the limitations of near-term quantum devices. Following our previous work on QDMs, here we propose and implement two physics-inspired protocols. In the first, we use the formalism of quantum stochastic walks, showing that a specific interplay of quantum and classical dynamics in the forward process produces statistically more robust models generating sets of MNIST images with lower Fr\'echet Inception Distance (FID) than using totally classical dynamics. In the second approach, we realize an algorithm to generate images by exploiting the intrinsic noise of real IBM quantum hardware with only four qubits. Our work could be a starting point to pave the way for new scenarios for large-scale algorithms in quantum Generative AI, where quantum noise is neither mitigated nor corrected, but instead exploited as a useful resource.

# cs.CY [[Back]](#toc) ### [296] [From prosthetic memory to prosthetic denial: Auditing whether large language models are prone to mass atrocity denialism](https://arxiv.org/abs/2505.21753) *Roberto Ulloa,Eve M. Zucker,Daniel Bultmann,David J. Simon,Mykola Makhortykh* Main category: cs.CY TL;DR: 研究探讨大型语言模型（LLMs）如何影响历史叙事传播，分析其对集体暴行记忆的呈现，揭示LLMs可能引发记忆媒介化或记忆否认的风险。

Details

Motivation: 探究LLMs在历史记忆传播中的作用，尤其是其对暴行记忆的潜在扭曲或否认影响。 Method: 对五种LLMs（Claude、GPT、Llama、Mixtral、Gemini）进行对比审计，针对四个历史案例（Holodomor、大屠杀、柬埔寨种族灭绝、卢旺达种族灭绝）测试其回应。 Result: LLMs对广泛记录的事件（如大屠杀）回应准确，但对较少记录的案例（如柬埔寨种族灭绝）易受否认主义框架影响。 Conclusion: LLMs虽扩展了媒介化记忆概念，但未受监管的使用可能强化历史否认主义，引发伦理问题。 Abstract: The proliferation of large language models (LLMs) can influence how historical narratives are disseminated and perceived. This study explores the implications of LLMs' responses on the representation of mass atrocity memory, examining whether generative AI systems contribute to prosthetic memory, i.e., mediated experiences of historical events, or to what we term "prosthetic denial," the AI-mediated erasure or distortion of atrocity memories. We argue that LLMs function as interfaces that can elicit prosthetic memories and, therefore, act as experiential sites for memory transmission, but also introduce risks of denialism, particularly when their outputs align with contested or revisionist narratives. To empirically assess these risks, we conducted a comparative audit of five LLMs (Claude, GPT, Llama, Mixtral, and Gemini) across four historical case studies: the Holodomor, the Holocaust, the Cambodian Genocide, and the genocide against the Tutsis in Rwanda. Each model was prompted with questions addressing common denialist claims in English and an alternative language relevant to each case (Ukrainian, German, Khmer, and French). Our findings reveal that while LLMs generally produce accurate responses for widely documented events like the Holocaust, significant inconsistencies and susceptibility to denialist framings are observed for more underrepresented cases like the Cambodian Genocide. The disparities highlight the influence of training data availability and the probabilistic nature of LLM responses on memory integrity. We conclude that while LLMs extend the concept of prosthetic memory, their unmoderated use risks reinforcing historical denialism, raising ethical concerns for (digital) memory preservation, and potentially challenging the advantageous role of technology associated with the original values of prosthetic memory.

### [297] [Detecting Cultural Differences in News Video Thumbnails via Computational Aesthetics](https://arxiv.org/abs/2505.21912) *Marvin Limpijankit,John Kender* Main category: cs.CY TL;DR: 提出了一种两步法检测不同文化背景下图像风格的差异，先基于内容聚类，再比较美学特征。测试了2400个YouTube缩略图，发现中美风格差异显著。

Details

Motivation: 研究不同文化背景下图像风格的差异，为视觉宣传分析提供基线。 Method: 两步法：先聚类视觉主题，再比较美学特征。测试数据为2400个中美YouTube缩略图。 Result: 中国缩略图更随意、生动，美国缩略图更正式、精细。具体表现为色彩、饱和度、对称性等方面的差异。 Conclusion: 差异反映文化偏好，方法可作为视觉宣传分析的基线。 Abstract: We propose a two-step approach for detecting differences in the style of images across sources of differing cultural affinity, where images are first clustered into finer visual themes based on content before their aesthetic features are compared. We test this approach on 2,400 YouTube video thumbnails taken equally from two U.S. and two Chinese YouTube channels, and relating equally to COVID-19 and the Ukraine conflict. Our results suggest that while Chinese thumbnails are less formal and more candid, U.S. channels tend to use more deliberate, proper photographs as thumbnails. In particular, U.S. thumbnails are less colorful, more saturated, darker, more finely detailed, less symmetric, sparser, less varied, and more up close and personal than Chinese thumbnails. We suggest that most of these differences reflect cultural preferences, and that our methods and observations can serve as a baseline against which suspected visual propaganda can be computed and compared.

# eess.IV [[Back]](#toc) ### [298] [Cascaded 3D Diffusion Models for Whole-body 3D 18-F FDG PET/CT synthesis from Demographics](https://arxiv.org/abs/2505.22489) *Siyeop Yoon,Sifan Song,Pengfei Jin,Matthew Tivnan,Yujin Oh,Sekeun Kim,Dufan Wu,Xiang Li,Quanzheng Li* Main category: eess.IV TL;DR: 提出了一种级联3D扩散模型框架，直接从人口统计学变量合成高保真3D PET/CT体积，满足肿瘤成像、虚拟试验和AI驱动数据增强中对真实数字孪生的需求。

Details

Motivation: 解决传统确定性模型依赖预定义模板的局限性，提供更灵活、真实的合成图像生成方法。 Method: 采用两阶段生成过程：首先通过基于分数的扩散模型生成低分辨率PET/CT体积，再通过超分辨率残差扩散模型提升空间分辨率。 Result: 合成图像与真实数据在器官体积和SUV分布上高度一致，代谢摄取值偏差在3-5%以内。 Conclusion: 级联3D扩散模型能生成解剖和代谢准确的PET/CT图像，为临床和研究提供可扩展的合成数据解决方案。 Abstract: We propose a cascaded 3D diffusion model framework to synthesize high-fidelity 3D PET/CT volumes directly from demographic variables, addressing the growing need for realistic digital twins in oncologic imaging, virtual trials, and AI-driven data augmentation. Unlike deterministic phantoms, which rely on predefined anatomical and metabolic templates, our method employs a two-stage generative process. An initial score-based diffusion model synthesizes low-resolution PET/CT volumes from demographic variables alone, providing global anatomical structures and approximate metabolic activity. This is followed by a super-resolution residual diffusion model that refines spatial resolution. Our framework was trained on 18-F FDG PET/CT scans from the AutoPET dataset and evaluated using organ-wise volume and standardized uptake value (SUV) distributions, comparing synthetic and real data between demographic subgroups. The organ-wise comparison demonstrated strong concordance between synthetic and real images. In particular, most deviations in metabolic uptake values remained within 3-5% of the ground truth in subgroup analysis. These findings highlight the potential of cascaded 3D diffusion models to generate anatomically and metabolically accurate PET/CT images, offering a robust alternative to traditional phantoms and enabling scalable, population-informed synthetic imaging for clinical and research applications.

### [299] [High-Fidelity Functional Ultrasound Reconstruction via A Visual Auto-Regressive Framework](https://arxiv.org/abs/2505.21530) *Xuhang Chen,Zhuo Li,Yanyan Shen,Mufti Mahmud,Hieu Pham,Chi-Man Pun,Shuqiang Wang* Main category: eess.IV TL;DR: fUS成像在神经血管映射中具有高时空分辨率，但数据稀缺和信号衰减限制了其应用。

Details

Motivation: 解决fUS成像因数据稀缺和信号衰减导致的数据集多样性不足及机器学习模型公平性问题。 Method: 未提及具体方法。 Result: 未提及具体结果。 Conclusion: fUS成像的潜力受限于数据稀缺和信号衰减问题。 Abstract: Functional ultrasound (fUS) imaging provides exceptional spatiotemporal resolution for neurovascular mapping, yet its practical application is significantly hampered by critical challenges. Foremost among these are data scarcity, arising from ethical considerations and signal degradation through the cranium, which collectively limit dataset diversity and compromise the fairness of downstream machine learning models.

### [300] [Image denoising as a conditional expectation](https://arxiv.org/abs/2505.21546) *Sajal Chakroborty,Suddhasattwa Das* Main category: eess.IV TL;DR: 论文提出了一种基于概率空间的数据驱动去噪方法，通过条件期望恢复真实图像，并在RKHS中求解最小二乘问题。

Details

Motivation: 传统去噪方法基于投影到子空间的假设，可能无法保证无偏和收敛。 Method: 将噪声图像视为概率空间的样本，通过核积分算子估计真实图像的条件期望，并在RKHS中求解线性方程。 Result: 方法在像素数趋于无穷时收敛，并可优化有限像素图像的去噪参数。 Conclusion: 提出的方法在理论和实践中均具有优势，适用于图像去噪。 Abstract: All techniques for denoising involve a notion of a true (noise-free) image, and a hypothesis space. The hypothesis space may reconstruct the image directly as a grayscale valued function, or indirectly by its Fourier or wavelet spectrum. Most common techniques estimate the true image as a projection to some subspace. We propose an interpretation of a noisy image as a collection of samples drawn from a certain probability space. Within this interpretation, projection based approaches are not guaranteed to be unbiased and convergent. We present a data-driven denoising method in which the true image is recovered as a conditional expectation. Although the probability space is unknown apriori, integrals on this space can be estimated by kernel integral operators. The true image is reformulated as the least squares solution to a linear equation in a reproducing kernel Hilbert space (RKHS), and involving various kernel integral operators as linear transforms. Assuming the true image to be a continuous function on a compact planar domain, the technique is shown to be convergent as the number of pixels goes to infinity. We also show that for a picture with finite number of pixels, the convergence result can be used to choose the various parameters for an optimum denoising result.

### [301] [Taylor expansion-based Kolmogorov-Arnold network for blind image quality assessment](https://arxiv.org/abs/2505.21592) *Ze Chen,Shaode Yu* Main category: eess.IV TL;DR: TaylorKAN通过泰勒展开作为可学习激活函数，提升了局部逼近能力，并在高维特征处理中表现优于其他KAN相关模型。

Details

Motivation: 传统KAN模型在高维特征处理中性能提升有限且计算成本高，需改进。 Method: 提出TaylorKAN，利用泰勒展开增强局部逼近能力，并集成网络深度减少和特征维度压缩以提高计算效率。 Result: 在五个数据库上，TaylorKAN表现优于其他KAN相关模型，验证了其泛化能力。 Conclusion: TaylorKAN是一种高效且鲁棒的高维分数回归模型。 Abstract: Kolmogorov-Arnold Network (KAN) has attracted growing interest for its strong function approximation capability. In our previous work, KAN and its variants were explored in score regression for blind image quality assessment (BIQA). However, these models encounter challenges when processing high-dimensional features, leading to limited performance gains and increased computational cost. To address these issues, we propose TaylorKAN that leverages the Taylor expansions as learnable activation functions to enhance local approximation capability. To improve the computational efficiency, network depth reduction and feature dimensionality compression are integrated into the TaylorKAN-based score regression pipeline. On five databases (BID, CLIVE, KonIQ, SPAQ, and FLIVE) with authentic distortions, extensive experiments demonstrate that TaylorKAN consistently outperforms the other KAN-related models, indicating that the local approximation via Taylor expansions is more effective than global approximation using orthogonal functions. Its generalization capacity is validated through inter-database experiments. The findings highlight the potential of TaylorKAN as an efficient and robust model for high-dimensional score regression.

### [302] [Optimizing Deep Learning for Skin Cancer Classification: A Computationally Efficient CNN with Minimal Accuracy Trade-Off](https://arxiv.org/abs/2505.21597) *Abdullah Al Mamun,Pollob Chandra Ray,Md Rahat Ul Nasib,Akash Das,Jia Uddin,Md Nurul Absur* Main category: eess.IV TL;DR: 提出了一种轻量级CNN模型，显著减少参数和计算量，同时保持高分类精度，适用于资源受限环境。

Details

Motivation: 现有基于迁移学习的模型（如ResNet50）计算开销大，难以在资源受限环境中部署。 Method: 设计了一种自定义CNN模型，大幅减少参数和FLOPs。 Result: 模型参数减少96.7%，FLOPs降至30.04百万，精度偏差小于0.022%。 Conclusion: 轻量级CNN在资源受限环境下更具实用性，适合移动和边缘设备。 Abstract: The rapid advancement of deep learning in medical image analysis has greatly enhanced the accuracy of skin cancer classification. However, current state-of-the-art models, especially those based on transfer learning like ResNet50, come with significant computational overhead, rendering them impractical for deployment in resource-constrained environments. This study proposes a custom CNN model that achieves a 96.7\% reduction in parameters (from 23.9 million in ResNet50 to 692,000) while maintaining a classification accuracy deviation of less than 0.022\%. Our empirical analysis of the HAM10000 dataset reveals that although transfer learning models provide a marginal accuracy improvement of approximately 0.022\%, they result in a staggering 13,216.76\% increase in FLOPs, considerably raising computational costs and inference latency. In contrast, our lightweight CNN architecture, which encompasses only 30.04 million FLOPs compared to ResNet50's 4.00 billion, significantly reduces energy consumption, memory footprint, and inference time. These findings underscore the trade-off between the complexity of deep models and their real-world feasibility, positioning our optimized CNN as a practical solution for mobile and edge-based skin cancer diagnostics.

### [303] [Laparoscopic Image Desmoking Using the U-Net with New Loss Function and Integrated Differentiable Wiener Filter](https://arxiv.org/abs/2505.21634) *Chengyu Yang,Chengjun Liu* Main category: eess.IV TL;DR: 提出了一种结合U-Net深度学习、新损失函数和可学习Wiener滤波器的ULW方法，用于去除腹腔镜手术中的烟雾，提升图像质量。

Details

Motivation: 腹腔镜手术中烟雾导致视觉清晰度下降，影响手术和计算机辅助技术。 Method: 使用U-Net深度学习，结合结构相似性、感知损失和均方误差的新损失函数，以及可学习Wiener滤波器。 Result: 在公开数据集上验证，ULW方法在视觉清晰度和量化评估中表现优异。 Conclusion: ULW方法为实时增强腹腔镜图像提供了有效解决方案。 Abstract: Laparoscopic surgeries often suffer from reduced visual clarity due to the presence of surgical smoke originated by surgical instruments, which poses significant challenges for both surgeons and vision based computer-assisted technologies. In order to remove the surgical smoke, a novel U-Net deep learning with new loss function and integrated differentiable Wiener filter (ULW) method is presented. Specifically, the new loss function integrates the pixel, structural, and perceptual properties. Thus, the new loss function, which combines the structural similarity index measure loss, the perceptual loss, as well as the mean squared error loss, is able to enhance the quality and realism of the reconstructed images. Furthermore, the learnable Wiener filter is capable of effectively modelling the degradation process caused by the surgical smoke. The effectiveness of the proposed ULW method is evaluated using the publicly available paired laparoscopic smoke and smoke-free image dataset, which provides reliable benchmarking and quantitative comparisons. Experimental results show that the proposed ULW method excels in both visual clarity and metric-based evaluation. As a result, the proposed ULW method offers a promising solution for real-time enhancement of laparoscopic imagery. The code is available at https://github.com/chengyuyang-njit/ImageDesmoke.

### [304] [STA-Risk: A Deep Dive of Spatio-Temporal Asymmetries for Breast Cancer Risk Prediction](https://arxiv.org/abs/2505.21699) *Zhengbo Zhou,Dooman Arefan,Margarita Zuley,Jules Sumkin,Shandong Wu* Main category: eess.IV TL;DR: STA-Risk是一种基于Transformer的模型，通过捕捉乳腺影像的空间和时间不对称性，显著提升了乳腺癌风险预测的性能。

Details

Motivation: 现有乳腺癌风险预测模型性能有限，且忽视了纵向影像中的空间和时间细节。STA-Risk旨在解决这一问题。 Method: 提出STA-Risk模型，结合侧向编码和时间编码学习空间-时间不对称性，并采用定制的不对称损失函数。 Result: 在两个独立数据集上实验，STA-Risk在1至5年风险预测中优于四种代表性SOTA模型。 Conclusion: STA-Risk通过捕捉空间和时间不对称性，显著提升了乳腺癌风险预测的准确性。 Abstract: Predicting the risk of developing breast cancer is an important clinical tool to guide early intervention and tailoring personalized screening strategies. Early risk models have limited performance and recently machine learning-based analysis of mammogram images showed encouraging risk prediction effects. These models however are limited to the use of a single exam or tend to overlook nuanced breast tissue evolvement in spatial and temporal details of longitudinal imaging exams that are indicative of breast cancer risk. In this paper, we propose STA-Risk (Spatial and Temporal Asymmetry-based Risk Prediction), a novel Transformer-based model that captures fine-grained mammographic imaging evolution simultaneously from bilateral and longitudinal asymmetries for breast cancer risk prediction. STA-Risk is innovative by the side encoding and temporal encoding to learn spatial-temporal asymmetries, regulated by a customized asymmetry loss. We performed extensive experiments with two independent mammogram datasets and achieved superior performance than four representative SOTA models for 1- to 5-year future risk prediction. Source codes will be released upon publishing of the paper.

### [305] [Privacy-Preserving Chest X-ray Report Generation via Multimodal Federated Learning with ViT and GPT-2](https://arxiv.org/abs/2505.21715) *Md. Zahid Hossain,Mustofa Ahmed,Most. Sharmin Sultana Samu,Md. Rakibul Islam* Main category: eess.IV TL;DR: 该研究提出了一种基于多模态联邦学习的胸部X光报告生成框架，采用ViT编码器和GPT-2生成器，评估了三种联邦学习聚合策略，其中Krum Aggregation表现最佳。

Details

Motivation: 传统集中式方法需要传输敏感数据，存在隐私问题，因此研究提出了一种隐私保护的联邦学习框架。 Method: 使用Vision Transformer（ViT）作为编码器，GPT-2作为报告生成器，并评估了三种联邦学习聚合策略（FedAvg、Krum Aggregation和L-FedAvg）。 Result: Krum Aggregation在ROUGE、BLEU、BERTScore和RaTEScore等指标上表现最佳，联邦学习模型在生成临床相关且语义丰富的报告方面优于集中式模型。 Conclusion: 该轻量级且隐私保护的框架为医疗AI的协作开发提供了可能，同时保护数据机密性。 Abstract: The automated generation of radiology reports from chest X-ray images holds significant promise in enhancing diagnostic workflows while preserving patient privacy. Traditional centralized approaches often require sensitive data transfer, posing privacy concerns. To address this, the study proposes a Multimodal Federated Learning framework for chest X-ray report generation using the IU-Xray dataset. The system utilizes a Vision Transformer (ViT) as the encoder and GPT-2 as the report generator, enabling decentralized training without sharing raw data. Three Federated Learning (FL) aggregation strategies: FedAvg, Krum Aggregation and a novel Loss-aware Federated Averaging (L-FedAvg) were evaluated. Among these, Krum Aggregation demonstrated superior performance across lexical and semantic evaluation metrics such as ROUGE, BLEU, BERTScore and RaTEScore. The results show that FL can match or surpass centralized models in generating clinically relevant and semantically rich radiology reports. This lightweight and privacy-preserving framework paves the way for collaborative medical AI development without compromising data confidentiality.

### [306] [MAMBO-NET: Multi-Causal Aware Modeling Backdoor-Intervention Optimization for Medical Image Segmentation Network](https://arxiv.org/abs/2505.21874) *Ruiguo Yu,Yiyang Zhang,Yuan Tian,Yujie Diao,Di Jin,Witold Pedrycz* Main category: eess.IV TL;DR: 论文提出了一种多因果感知建模后门干预优化网络（MAMBO-NET），用于解决医学图像分割中混淆因素的影响。

Details

Motivation: 现有医学图像分割方法假设图像到分割的过程无偏，忽略了复杂解剖变异和成像模态限制等混淆因素，导致分割结果不理想。 Method: MAMBO-NET利用多高斯分布自建模拟合混淆因素，并引入因果干预技术，结合后验概率约束优化训练过程。 Result: 在五个医学图像数据集上的实验表明，该方法显著降低了混淆因素的影响，提高了分割准确性。 Conclusion: MAMBO-NET通过因果干预和分布建模有效解决了医学图像分割中的混淆问题，提升了分割性能。 Abstract: Medical image segmentation methods generally assume that the process from medical image to segmentation is unbiased, and use neural networks to establish conditional probability models to complete the segmentation task. This assumption does not consider confusion factors, which can affect medical images, such as complex anatomical variations and imaging modality limitations. Confusion factors obfuscate the relevance and causality of medical image segmentation, leading to unsatisfactory segmentation results. To address this issue, we propose a multi-causal aware modeling backdoor-intervention optimization (MAMBO-NET) network for medical image segmentation. Drawing insights from causal inference, MAMBO-NET utilizes self-modeling with multi-Gaussian distributions to fit the confusion factors and introduce causal intervention into the segmentation process. Moreover, we design appropriate posterior probability constraints to effectively train the distributions of confusion factors. For the distributions to effectively guide the segmentation and mitigate and eliminate the Impact of confusion factors on the segmentation, we introduce classical backdoor intervention techniques and analyze their feasibility in the segmentation task. To evaluate the effectiveness of our approach, we conducted extensive experiments on five medical image datasets. The results demonstrate that our method significantly reduces the influence of confusion factors, leading to enhanced segmentation accuracy.

### [307] [Subspecialty-Specific Foundation Model for Intelligent Gastrointestinal Pathology](https://arxiv.org/abs/2505.21928) *Lianghui Zhu,Xitong Ling,Minxi Ouyang,Xiaoping Liu,Mingxi Fu,Tian Guan,Fanglei Fu,Xuanyu Wang,Maomao Zeng,Mingxi Zhu,Yibo Jin,Liming Liu,Song Duan,Qiming He,Yizhi Wang,Luxi Xie,Houqiang Li,Yonghong He,Sufang Tian* Main category: eess.IV TL;DR: Digepath是一种针对胃肠道病理学的专用基础模型，通过双阶段迭代优化策略显著提升了诊断性能，尤其在模糊病例中表现优异。

Details

Motivation: 传统组织病理学诊断依赖主观判断，存在可重复性和诊断变异性问题，亟需AI驱动的精准病理学方法。 Method: 采用双阶段迭代优化策略（预训练与精细筛查结合），基于超过3.53亿个图像块和20万张H&E染色切片进行训练。 Result: 在34项任务中33项达到最优性能，包括诊断、分子预测等，早期癌症筛查灵敏度达99.6%。 Conclusion: Digepath填补了组织病理学实践的关键空白，为其他病理学亚专科提供了可转移的范例。 Abstract: Gastrointestinal (GI) diseases represent a clinically significant burden, necessitating precise diagnostic approaches to optimize patient outcomes. Conventional histopathological diagnosis, heavily reliant on the subjective interpretation of pathologists, suffers from limited reproducibility and diagnostic variability. To overcome these limitations and address the lack of pathology-specific foundation models for GI diseases, we develop Digepath, a specialized foundation model for GI pathology. Our framework introduces a dual-phase iterative optimization strategy combining pretraining with fine-screening, specifically designed to address the detection of sparsely distributed lesion areas in whole-slide images. Digepath is pretrained on more than 353 million image patches from over 200,000 hematoxylin and eosin-stained slides of GI diseases. It attains state-of-the-art performance on 33 out of 34 tasks related to GI pathology, including pathological diagnosis, molecular prediction, gene mutation prediction, and prognosis evaluation, particularly in diagnostically ambiguous cases and resolution-agnostic tissue classification.We further translate the intelligent screening module for early GI cancer and achieve near-perfect 99.6% sensitivity across 9 independent medical institutions nationwide. The outstanding performance of Digepath highlights its potential to bridge critical gaps in histopathological practice. This work not only advances AI-driven precision pathology for GI diseases but also establishes a transferable paradigm for other pathology subspecialties.

### [308] [Risk-Sensitive Conformal Prediction for Catheter Placement Detection in Chest X-rays](https://arxiv.org/abs/2505.22496) *Long Hui* Main category: eess.IV TL;DR: 提出了一种结合多任务学习和风险敏感共形预测的新方法，用于胸部X光片中导管和管线位置的检测，显著提升了临床可靠性。

Details

Motivation: 解决临床中对导管和管线位置检测的高可靠性需求，尤其是在关键临床发现中避免高风险误判。 Method: 采用多任务学习同时进行分类、分割和关键点检测，并结合风险敏感共形预测提供统计保证的预测集。 Result: 实验结果显示90.68%的总体覆盖率和99.29%的关键条件覆盖率，且未出现高风险误判。 Conclusion: 该方法不仅提供高精度预测，还能可靠量化不确定性，适用于生命关键医疗应用。 Abstract: This paper presents a novel approach to catheter and line position detection in chest X-rays, combining multi-task learning with risk-sensitive conformal prediction to address critical clinical requirements. Our model simultaneously performs classification, segmentation, and landmark detection, leveraging the synergistic relationship between these tasks to improve overall performance. We further enhance clinical reliability through risk-sensitive conformal prediction, which provides statistically guaranteed prediction sets with higher reliability for clinically critical findings. Experimental results demonstrate excellent performance with 90.68\% overall empirical coverage and 99.29\% coverage for critical conditions, while maintaining remarkable precision in prediction sets. Most importantly, our risk-sensitive approach achieves zero high-risk mispredictions (cases where the system dangerously declares problematic tubes as confidently normal), making the system particularly suitable for clinical deployment. This work offers both accurate predictions and reliably quantified uncertainty -- essential features for life-critical medical applications.

### [309] [Surf2CT: Cascaded 3D Flow Matching Models for Torso 3D CT Synthesis from Skin Surface](https://arxiv.org/abs/2505.22511) *Siyeop Yoon,Yujin Oh,Pengfei Jin,Sifan Song,Matthew Tivnan,Dufan Wu,Xiang Li,Quanzheng Li* Main category: eess.IV TL;DR: Surf2CT是一种新颖的级联流匹配框架，能够仅通过外部表面扫描和简单人口统计数据生成人体躯干的完整3D CT图像，无需内部成像。

Details

Motivation: 传统CT扫描需要侵入性操作且成本高，而Surf2CT旨在通过非侵入性方法生成高保真内部解剖图像，推动家庭医疗和预防医学的发展。 Method: Surf2CT分为三个阶段：表面补全（SDF重建）、粗CT合成（低分辨率生成）和CT超分辨率（高分辨率细化），均基于3D流匹配模型。 Result: 在700例测试中，器官体积误差小（-11.1%至4.4%），肌肉/脂肪组成相关性高（0.67至0.96），表面补全显著提升指标（Chamfer距离从521.8 mm降至2.7 mm）。 Conclusion: Surf2CT为非侵入性内部解剖成像开辟了新范式，有望应用于家庭医疗和个性化临床评估，避免传统成像技术的风险。 Abstract: We present Surf2CT, a novel cascaded flow matching framework that synthesizes full 3D computed tomography (CT) volumes of the human torso from external surface scans and simple demographic data (age, sex, height, weight). This is the first approach capable of generating realistic volumetric internal anatomy images solely based on external body shape and demographics, without any internal imaging. Surf2CT proceeds through three sequential stages: (1) Surface Completion, reconstructing a complete signed distance function (SDF) from partial torso scans using conditional 3D flow matching; (2) Coarse CT Synthesis, generating a low-resolution CT volume from the completed SDF and demographic information; and (3) CT Super-Resolution, refining the coarse volume into a high-resolution CT via a patch-wise conditional flow model. Each stage utilizes a 3D-adapted EDM2 backbone trained via flow matching. We trained our model on a combined dataset of 3,198 torso CT scans (approximately 1.13 million axial slices) sourced from Massachusetts General Hospital (MGH) and the AutoPET challenge. Evaluation on 700 paired torso surface-CT cases demonstrated strong anatomical fidelity: organ volumes exhibited small mean percentage differences (range from -11.1% to 4.4%), and muscle/fat body composition metrics matched ground truth with strong correlation (range from 0.67 to 0.96). Lung localization had minimal bias (mean difference -2.5 mm), and surface completion significantly improved metrics (Chamfer distance: from 521.8 mm to 2.7 mm; Intersection-over-Union: from 0.87 to 0.98). Surf2CT establishes a new paradigm for non-invasive internal anatomical imaging using only external data, opening opportunities for home-based healthcare, preventive medicine, and personalized clinical assessments without the risks associated with conventional imaging techniques.

### [310] [Multipath cycleGAN for harmonization of paired and unpaired low-dose lung computed tomography reconstruction kernels](https://arxiv.org/abs/2505.22568) *Aravind R. Krishnan,Thomas Z. Li,Lucas W. Remedios,Michael E. Kim,Chenyu Gao,Gaurav Rudravaram,Elyssa M. McMaster,Adam M. Saunders,Shunxing Bao,Kaiwen Xu,Lianrui Zuo,Kim L. Sandler,Fabien Maldonado,Yuankai Huo,Bennett A. Landman* Main category: eess.IV TL;DR: 提出了一种基于多路径cycleGAN的CT核谐波化方法，用于减少定量成像中的系统性变异性，并验证了其在肺气肿量化和解剖一致性上的效果。

Details

Motivation: CT重建核的选择会影响定量成像的一致性，尤其是肺气肿量化，因此需要一种方法来谐波化不同核的图像以减少变异性。 Method: 使用多路径cycleGAN模型，结合配对和非配对数据训练，共享潜在空间，并针对每个域设计判别器。模型在NLST数据集上训练，评估了42种核组合。 Result: 模型显著减少了配对核的肺气肿量化偏差（p<0.05），并消除了非配对核的混淆差异（p>0.05）。解剖结构（如肌肉和脂肪）的Dice分数较高，肺血管重叠合理。 Conclusion: 共享潜在空间的多路径cycleGAN能够有效谐波化CT核，提升肺气肿量化的准确性并保持解剖结构的保真度。 Abstract: Reconstruction kernels in computed tomography (CT) affect spatial resolution and noise characteristics, introducing systematic variability in quantitative imaging measurements such as emphysema quantification. Choosing an appropriate kernel is therefore essential for consistent quantitative analysis. We propose a multipath cycleGAN model for CT kernel harmonization, trained on a mixture of paired and unpaired data from a low-dose lung cancer screening cohort. The model features domain-specific encoders and decoders with a shared latent space and uses discriminators tailored for each domain.We train the model on 42 kernel combinations using 100 scans each from seven representative kernels in the National Lung Screening Trial (NLST) dataset. To evaluate performance, 240 scans from each kernel are harmonized to a reference soft kernel, and emphysema is quantified before and after harmonization. A general linear model assesses the impact of age, sex, smoking status, and kernel on emphysema. We also evaluate harmonization from soft kernels to a reference hard kernel. To assess anatomical consistency, we compare segmentations of lung vessels, muscle, and subcutaneous adipose tissue generated by TotalSegmentator between harmonized and original images. Our model is benchmarked against traditional and switchable cycleGANs. For paired kernels, our approach reduces bias in emphysema scores, as seen in Bland-Altman plots (p<0.05). For unpaired kernels, harmonization eliminates confounding differences in emphysema (p>0.05). High Dice scores confirm preservation of muscle and fat anatomy, while lung vessel overlap remains reasonable. Overall, our shared latent space multipath cycleGAN enables robust harmonization across paired and unpaired CT kernels, improving emphysema quantification and preserving anatomical fidelity.

### [311] [Comparative Analysis of Machine Learning Models for Lung Cancer Mutation Detection and Staging Using 3D CT Scans](https://arxiv.org/abs/2505.22592) *Yiheng Li,Francisco Carrillo-Perez,Mohammed Alawad,Olivier Gevaert* Main category: eess.IV TL;DR: 比较两种机器学习模型在肺癌突变检测和分期中的表现，监督模型在突变检测中表现更好，自监督模型在分期中表现更优。

Details

Motivation: 肺癌是全球癌症死亡的主要原因，非侵入性方法检测关键突变和分期对改善患者预后至关重要。 Method: 比较FMCIB+XGBoost（监督模型）和Dinov2+ABMIL（自监督模型）在3D肺结节数据上的性能。 Result: FMCIB+XGBoost在KRAS和EGFR突变检测中表现更优，Dinov2+ABMIL在癌症分期中表现更佳。 Conclusion: 监督模型在突变检测中更具临床价值，自监督模型在分期中具有潜力，但突变敏感性有待提高。 Abstract: Lung cancer is the leading cause of cancer mortality worldwide, and non-invasive methods for detecting key mutations and staging are essential for improving patient outcomes. Here, we compare the performance of two machine learning models - FMCIB+XGBoost, a supervised model with domain-specific pretraining, and Dinov2+ABMIL, a self-supervised model with attention-based multiple-instance learning - on 3D lung nodule data from the Stanford Radiogenomics and Lung-CT-PT-Dx cohorts. In the task of KRAS and EGFR mutation detection, FMCIB+XGBoost consistently outperformed Dinov2+ABMIL, achieving accuracies of 0.846 and 0.883 for KRAS and EGFR mutations, respectively. In cancer staging, Dinov2+ABMIL demonstrated competitive generalization, achieving an accuracy of 0.797 for T-stage prediction in the Lung-CT-PT-Dx cohort, suggesting SSL's adaptability across diverse datasets. Our results emphasize the clinical utility of supervised models in mutation detection and highlight the potential of SSL to improve staging generalization, while identifying areas for enhancement in mutation sensitivity.

### [312] [Chest Disease Detection In X-Ray Images Using Deep Learning Classification Method](https://arxiv.org/abs/2505.22609) *Alanna Hazlett,Naomi Ohashi,Timothy Rodriguez,Sodiq Adewole* Main category: eess.IV TL;DR: 研究通过迁移学习技术，利用预训练的CNN模型对胸部X光片进行分类，表现优异，并采用Grad-CAM提升模型可解释性。

Details

Motivation: 探索多种分类模型在胸部X光片中对COVID-19、肺炎、结核病和正常病例的分类性能。 Method: 使用预训练的CNN模型进行迁移学习，并在标记的医学X光图像上进行微调，同时应用Grad-CAM进行模型解释。 Result: 初步结果显示高准确率和强分类性能（如精确率、召回率和F1分数）。 Conclusion: 该方法在临床应用中具有潜力，通过可视化解释提升了模型的信任度和透明度。 Abstract: In this work, we investigate the performance across multiple classification models to classify chest X-ray images into four categories of COVID-19, pneumonia, tuberculosis (TB), and normal cases. We leveraged transfer learning techniques with state-of-the-art pre-trained Convolutional Neural Networks (CNNs) models. We fine-tuned these pre-trained architectures on a labeled medical x-ray images. The initial results are promising with high accuracy and strong performance in key classification metrics such as precision, recall, and F1 score. We applied Gradient-weighted Class Activation Mapping (Grad-CAM) for model interpretability to provide visual explanations for classification decisions, improving trust and transparency in clinical applications.

# cs.IR [[Back]](#toc) ### [313] [Scientific Paper Retrieval with LLM-Guided Semantic-Based Ranking](https://arxiv.org/abs/2505.21815) *Yunyi Zhang,Ruozhen Yang,Siqi Jiao,SeongKu Kang,Jiawei Han* Main category: cs.IR TL;DR: SemRank结合LLM引导的查询理解和基于概念的语义索引，显著提升科学论文检索的准确性。

Details

Motivation: 现有密集检索方法难以捕捉细粒度科学概念，而基于LLM的方法缺乏语料库特定知识的支持，可能导致不可靠结果。 Method: 提出SemRank框架，利用LLM识别查询核心概念，并通过多粒度科学概念索引论文，实现精确语义匹配。 Result: 实验表明，SemRank显著提升多种基础检索器的性能，优于现有LLM基线，且保持高效。 Conclusion: SemRank通过结合LLM和概念索引，有效解决了科学论文检索中的细粒度概念匹配问题。 Abstract: Scientific paper retrieval is essential for supporting literature discovery and research. While dense retrieval methods demonstrate effectiveness in general-purpose tasks, they often fail to capture fine-grained scientific concepts that are essential for accurate understanding of scientific queries. Recent studies also use large language models (LLMs) for query understanding; however, these methods often lack grounding in corpus-specific knowledge and may generate unreliable or unfaithful content. To overcome these limitations, we propose SemRank, an effective and efficient paper retrieval framework that combines LLM-guided query understanding with a concept-based semantic index. Each paper is indexed using multi-granular scientific concepts, including general research topics and detailed key phrases. At query time, an LLM identifies core concepts derived from the corpus to explicitly capture the query's information need. These identified concepts enable precise semantic matching, significantly enhancing retrieval accuracy. Experiments show that SemRank consistently improves the performance of various base retrievers, surpasses strong existing LLM-based baselines, and remains highly efficient.

# cs.HC [[Back]](#toc) ### [314] [UI-Evol: Automatic Knowledge Evolving for Computer Use Agents](https://arxiv.org/abs/2505.21964) *Ziyun Zhang,Xinyi Liu,Xiaoyi Zhang,Jun Wang,Gang Chen,Yan Lu* Main category: cs.HC TL;DR: 论文提出UI-Evol模块，通过知识演化解决计算机代理中知识与执行间的差距，显著提升任务性能和代理可靠性。

Details

Motivation: 研究发现即使90%的知识正确，执行成功率仅41%，揭示了知识与执行间的显著差距。 Method: UI-Evol包含两个阶段：Retrace阶段提取真实动作序列，Critique阶段通过对比外部参考优化知识。 Result: 在OSWorld基准测试中，UI-Evol显著提升任务性能并降低行为标准差。 Conclusion: UI-Evol有效解决了知识与执行的差距，提升了计算机代理的可靠性和性能。 Abstract: External knowledge has played a crucial role in the recent development of computer use agents. We identify a critical knowledge-execution gap: retrieved knowledge often fails to translate into effective real-world task execution. Our analysis shows even 90\% correct knowledge yields only 41\% execution success rate. To bridge this gap, we propose UI-Evol, a plug-and-play module for autonomous GUI knowledge evolution. UI-Evol consists of two stages: a Retrace Stage that extracts faithful objective action sequences from actual agent-environment interactions, and a Critique Stage that refines existing knowledge by comparing these sequences against external references. We conduct comprehensive experiments on the OSWorld benchmark with the state-of-the-art Agent S2. Our results demonstrate that UI-Evol not only significantly boosts task performance but also addresses a previously overlooked issue of high behavioral standard deviation in computer use agents, leading to superior performance on computer use tasks and substantially improved agent reliability.

### [315] [MapStory: LLM-Powered Text-Driven Map Animation Prototyping with Human-in-the-Loop Editing](https://arxiv.org/abs/2505.21966) *Aditya Gunturu,Ben Pearman,Keiichi Ihara,Morteza Faraji,Bryan Wang,Rubaiat Habib Kazi,Ryo Suzuki* Main category: cs.HC TL;DR: MapStory是一个基于LLM的地图动画创作工具，通过自然语言脚本自动生成可编辑的动画序列。

Details

Motivation: 旨在简化地图动画的创作流程，降低技术门槛，提升创作效率。 Method: 采用代理架构分解脚本为动画构建块，结合LLM和网络搜索获取地理信息，并提供交互式时间线编辑器。 Result: 用户能轻松创建地图动画，加快迭代速度，激发创意探索。 Conclusion: MapStory有效降低了地图故事创作的门槛，提升了创作体验。 Abstract: We introduce MapStory, an LLM-powered animation authoring tool that generates editable map animation sequences directly from natural language text. Given a user-written script, MapStory leverages an agentic architecture to automatically produce a scene breakdown, which decomposes the script into key animation building blocks such as camera movements, visual highlights, and animated elements. Our system includes a researcher component that accurately queries geospatial information by leveraging an LLM with web search, enabling the automatic extraction of relevant regions, paths, and coordinates while allowing users to edit and query for changes or additional information to refine the results. Additionally, users can fine-tune parameters of these blocks through an interactive timeline editor. We detail the system's design and architecture, informed by formative interviews with professional animators and an analysis of 200 existing map animation videos. Our evaluation, which includes expert interviews (N=5) and a usability study (N=12), demonstrates that MapStory enables users to create map animations with ease, facilitates faster iteration, encourages creative exploration, and lowers barriers to creating map-centric stories.

# stat.ML [[Back]](#toc) ### [316] [Higher-Order Group Synchronization](https://arxiv.org/abs/2505.21932) *Adriana L. Duncan,Joe Kileel* Main category: stat.ML TL;DR: 本文提出了一种新颖的高阶群同步问题，基于超图处理高阶局部测量以获取全局估计，并提供了首个计算框架及其理论保证。

Details

Motivation: 高阶群同步的动机来源于计算机视觉和图像处理等应用需求，旨在解决传统群同步方法无法处理高阶测量的问题。 Method: 定义了高阶群同步问题及其数学基础，提出了基于消息传递算法的全局计算框架，并分析了其在噪声和异常值下的收敛性。 Result: 实验表明，该方法在旋转和角度同步中优于传统成对同步方法，对异常值更具鲁棒性，且在模拟冷冻电镜数据中表现与标准包相当。 Conclusion: 高阶群同步方法为处理高阶测量提供了有效工具，具有广泛的应用潜力。 Abstract: Group synchronization is the problem of determining reliable global estimates from noisy local measurements on networks. The typical task for group synchronization is to assign elements of a group to the nodes of a graph in a way that respects group elements given on the edges which encode information about local pairwise relationships between the nodes. In this paper, we introduce a novel higher-order group synchronization problem which operates on a hypergraph and seeks to synchronize higher-order local measurements on the hyperedges to obtain global estimates on the nodes. Higher-order group synchronization is motivated by applications to computer vision and image processing, among other computational problems. First, we define the problem of higher-order group synchronization and discuss its mathematical foundations. Specifically, we give necessary and sufficient synchronizability conditions which establish the importance of cycle consistency in higher-order group synchronization. Then, we propose the first computational framework for general higher-order group synchronization; it acts globally and directly on higher-order measurements using a message passing algorithm. We discuss theoretical guarantees for our framework, including convergence analyses under outliers and noise. Finally, we show potential advantages of our method through numerical experiments. In particular, we show that in certain cases our higher-order method applied to rotational and angular synchronization outperforms standard pairwise synchronization methods and is more robust to outliers. We also show that our method has comparable performance on simulated cryo-electron microscopy (cryo-EM) data compared to a standard cryo-EM reconstruction package.

# eess.AS [[Back]](#toc) ### [317] [VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining](https://arxiv.org/abs/2505.21527) *Jianheng Zhuo,Yifan Yang,Yiwen Shao,Yong Xu,Dong Yu,Kai Yu,Xie Chen* Main category: eess.AS TL;DR: VietASR提出了一种针对低资源语言（如越南语）的自动语音识别（ASR）训练方法，利用大量未标记数据和少量标记数据，通过自监督学习提升性能。

Details

Motivation: 解决低资源语言ASR中标记数据稀缺、训练成本高、延迟和可访问性不足的问题。 Method: 采用多轮ASR偏置的自监督学习，结合大规模未标记数据（70,000小时）和少量标记数据（50小时）进行训练。 Result: VietASR在轻量级模型下表现优于Whisper Large-v3和商业ASR系统。 Conclusion: VietASR为低资源ASR提供了一种高效且实用的解决方案，代码和模型将开源。 Abstract: Automatic speech recognition (ASR) has made remarkable progress but heavily relies on large-scale labeled data, which is scarce for low-resource languages like Vietnamese. While existing systems such as Whisper, USM, and MMS achieve promising performance, their efficacy remains inadequate in terms of training costs, latency, and accessibility. To address these issues, we propose VietASR, a novel ASR training pipeline that leverages vast amounts of unlabeled data and a small set of labeled data. Through multi-iteration ASR-biased self-supervised learning on a large-scale unlabeled dataset, VietASR offers a cost-effective and practical solution for enhancing ASR performance. Experiments demonstrate that pre-training on 70,000-hour unlabeled data and fine-tuning on merely 50-hour labeled data yield a lightweight but powerful ASR model. It outperforms Whisper Large-v3 and commercial ASR systems on real-world data. Our code and models will be open-sourced to facilitate research in low-resource ASR.

### [318] [Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition](https://arxiv.org/abs/2505.22251) *Yuan Tseng,Titouan Parcollet,Rogier van Dalen,Shucong Zhang,Sourav Bhattacharya* Main category: eess.AS TL;DR: 研究发现LibriSpeech和Common Voice评估集中的大量数据出现在LLM预训练语料中，质疑了基于这些数据集的可靠性。实验表明，受污染的LLM更倾向于生成训练中见过的句子，且语音识别系统对训练中见过的转录赋予更高概率。

Details

Motivation: 探讨LLM在语音任务中性能提升的可靠性，尤其是数据污染对评估结果的影响。 Method: 比较受污染和未受污染的LLM在生成测试句子时的表现，并分析语音识别系统的错误率和概率分配。 Result: 受污染的LLM更易生成训练中见过的句子，语音识别系统对训练数据的转录赋予更高概率，但错误率差异较小。 Conclusion: 数据污染会显著影响LLM的输出，强调使用独立数据集评估LLM语音系统的重要性。 Abstract: Recent work suggests that large language models (LLMs) can improve performance of speech tasks compared to existing systems. To support their claims, results on LibriSpeech and Common Voice are often quoted. However, this work finds that a substantial amount of the LibriSpeech and Common Voice evaluation sets appear in public LLM pretraining corpora. This calls into question the reliability of findings drawn from these two datasets. To measure the impact of contamination, LLMs trained with or without contamination are compared, showing that a contaminated LLM is more likely to generate test sentences it has seen during training. Speech recognisers using contaminated LLMs shows only subtle differences in error rates, but assigns significantly higher probabilities to transcriptions seen during training. Results show that LLM outputs can be biased by tiny amounts of data contamination, highlighting the importance of evaluating LLM-based speech systems with held-out data.

# stat.CO [[Back]](#toc) ### [319] [tenSVD algorithm for compression](https://arxiv.org/abs/2505.21686) *Michele Gallo* Main category: stat.CO TL;DR: 提出了一种基于张量的高效图像存储方法，通过Tucker模型压缩数据，以减少存储、传输和处理能耗。

Details

Motivation: 高维数据管理需求增加，张量分析在多领域应用广泛，但现有方法在存储和能耗方面效率不足。 Method: 将原始数据组织为高阶张量，应用Tucker模型压缩，并在R中实现，与基线算法对比。 Result: 通过计算时间和信息保留质量评估，结果显示该方法在能耗和效率上表现优异。 Conclusion: 该方法显著提升了图像存储的效率和可持续性，尤其在能耗方面具有优势。 Abstract: Tensors provide a robust framework for managing high-dimensional data. Consequently, tensor analysis has emerged as an active research area in various domains, including machine learning, signal processing, computer vision, graph analysis, and data mining. This study introduces an efficient image storage approach utilizing tensors, aiming to minimize memory to store, bandwidth to transmit and energy to processing. The proposed method organizes original data into a higher-order tensor and applies the Tucker model for compression. Implemented in R, this method is compared to a baseline algorithm. The evaluation focuses on efficient of algorithm measured in term of computational time and the quality of information preserved, using both simulated and real datasets. A detailed analysis of the results is conducted, employing established quantitative metrics, with significant attention paid to sustainability in terms of energy consumption across algorithms.

# physics.soc-ph [[Back]](#toc) ### [320] [Complexity counts: global and local perspectives on Indo-Aryan numeral systems](https://arxiv.org/abs/2505.21510) *Chundra Cathcart* Main category: physics.soc-ph TL;DR: 本文研究了印欧语系语言（如印地语、古吉拉特语和孟加拉语）中数字系统的高度非透明性，探讨了其复杂性及其背后的语言和非语言因素。

Details

Motivation: 印欧语系语言的数字系统在1-99范围内表现出高度非透明性，无法通过简单规则构建，这与大多数语言（如英语、汉语）不同。本文旨在理解这种复杂性的成因及其在跨语言数字系统中的独特性。 Method: 通过跨语言数据库数据，开发并应用多种量化指标来衡量数字系统的复杂性，并分析印欧语系语言与其他语言的差异。同时，探讨了宗教、地理隔离等因素对数字系统复杂性的影响。 Result: 研究发现印欧语系语言的数字系统整体上比其他语言更复杂，但内部也存在差异。尽管复杂性高，这些系统仍遵循跨语言中高效通信的一般压力。 Conclusion: 本文呼吁在讨论跨语言数字系统的一般变异时，应重视这种被忽视的复杂性维度，并提出了未来研究的方向。 Abstract: The numeral systems of Indo-Aryan languages such as Hindi, Gujarati, and Bengali are highly unusual in that unlike most numeral systems (e.g., those of English, Chinese, etc.), forms referring to 1--99 are highly non-transparent and are cannot be constructed using straightforward rules. As an example, Hindi/Urdu *iky\=anve* `91' is not decomposable into the composite elements *ek* `one' and *nave* `ninety' in the way that its English counterpart is. This paper situates Indo-Aryan languages within the typology of cross-linguistic numeral systems, and explores the linguistic and non-linguistic factors that may be responsible for the persistence of complex systems in these languages. Using cross-linguistic data from multiple databases, we develop and employ a number of cross-linguistically applicable metrics to quantifies the complexity of languages' numeral systems, and demonstrate that Indo-Aryan languages have decisively more complex numeral systems than the world's languages as a whole, though individual Indo-Aryan languages differ from each other in terms of the complexity of the patterns they display. We investigate the factors (e.g., religion, geographic isolation, etc.) that underlie complexity in numeral systems, with a focus on South Asia, in an attempt to develop an account of why complex numeral systems developed and persisted in certain Indo-Aryan languages but not elsewhere. Finally, we demonstrate that Indo-Aryan numeral systems adhere to certain general pressures toward efficient communication found cross-linguistically, despite their high complexity. We call for this somewhat overlooked dimension of complexity to be taken seriously when discussing general variation in cross-linguistic numeral systems.

### [321] [Fluent but Culturally Distant: Can Regional Training Teach Cultural Understanding?](https://arxiv.org/abs/2505.21548) *Dhruv Agarwal,Anya Shukla,Sunayana Sitaram,Aditya Vashistha* Main category: physics.soc-ph TL;DR: 研究发现，区域化的大语言模型（LLMs）在文化价值观和实践上与本地文化并未更匹配，甚至不如全球模型。

Details

Motivation: 探讨区域化LLMs是否真正反映本地文化，而不仅仅是语言。 Method: 评估五个印度本地和五个全球LLMs，通过价值观和实践两个维度进行分析。 Result: 印度本地模型在文化匹配上并未优于全球模型，且区域微调可能损害文化能力。 Conclusion: 需要更多高质量、文化相关的数据来构建真正反映本地文化的LLMs。 Abstract: Large language models (LLMs) are used around the world but exhibit Western cultural tendencies. To address this cultural misalignment, many countries have begun developing "regional" LLMs tailored to local communities. Yet it remains unclear whether these models merely speak the language of their users or also reflect their cultural values and practices. Using India as a case study, we evaluate five Indic and five global LLMs along two key dimensions: values (via the Inglehart-Welzel map and GlobalOpinionQA) and practices (via CulturalBench and NormAd). Across all four tasks, we find that Indic models do not align more closely with Indian cultural norms than global models. In fact, an average American person is a better proxy for Indian cultural values than any Indic model. Even prompting strategies fail to meaningfully improve alignment. Ablations show that regional fine-tuning does not enhance cultural competence and may in fact hurt it by impeding recall of existing knowledge. We trace this failure to the scarcity of high-quality, untranslated, and culturally grounded pretraining and fine-tuning data. Our study positions cultural evaluation as a first-class requirement alongside multilingual benchmarks and offers a reusable methodology for developers. We call for deeper investments in culturally representative data to build and evaluate truly sovereign LLMs.

# cs.MM [[Back]](#toc) ### [322] [Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning](https://arxiv.org/abs/2505.22045) *Le Xu,Chenxing Li,Yong Ren,Yujie Chen,Yu Gu,Ruibo Fu,Shan Yang,Dong Yu* Main category: cs.MM TL;DR: 提出了一种熵感知门控融合框架，通过跨模态不确定性量化动态调节视觉信息流，解决视听错位问题。

Details

Motivation: 现有视觉引导音频字幕系统在真实场景中（如配音或画外音）难以处理视听错位问题。 Method: 采用注意力熵分析识别并抑制误导性视觉线索，结合批量视听混洗技术生成合成错位训练对。 Result: 在AudioCaps基准测试中表现优于现有基线，尤其在错位模态场景下，推理速度提升约6倍。 Conclusion: 该方法有效提升了视听错位场景下的模型鲁棒性和性能。 Abstract: Current vision-guided audio captioning systems frequently fail to address audiovisual misalignment in real-world scenarios, such as dubbed content or off-screen sounds. To bridge this critical gap, we present an entropy-aware gated fusion framework that dynamically modulates visual information flow through cross-modal uncertainty quantification. Our novel approach employs attention entropy analysis in cross-attention layers to automatically identify and suppress misleading visual cues during modal fusion. Complementing this architecture, we develop a batch-wise audiovisual shuffling technique that generates synthetic mismatched training pairs, greatly enhancing model resilience against alignment noise. Evaluations on the AudioCaps benchmark demonstrate our system's superior performance over existing baselines, especially in mismatched modality scenarios. Furthermore, our solution demonstrates an approximately 6x improvement in inference speed compared to the baseline.

# cs.AI [[Back]](#toc) ### [323] [Capability-Based Scaling Laws for LLM Red-Teaming](https://arxiv.org/abs/2505.20162) *Alexander Panfilov,Paul Kassianik,Maksym Andriushchenko,Jonas Geiping* Main category: cs.AI TL;DR: 研究探讨了大型语言模型（LLM）在能力差距下的红队测试（red-teaming）效果，发现攻击成功率与攻击者-目标能力差距相关，并提出了一种预测攻击成功的缩放定律。

Details

Motivation: 随着语言模型能力和自主性的提升，红队测试的漏洞识别变得至关重要，但传统方法在能力差距下可能失效。 Method: 通过500多个攻击者-目标对的LLM越狱攻击实验，模拟人类红队行为，分析能力差距对攻击成功率的影响。 Result: 发现三个趋势：能力更强的模型攻击效果更好；目标能力超过攻击者时成功率骤降；攻击成功率与MMLU-Pro基准的社会科学表现相关。 Conclusion: 固定能力的攻击者（如人类）可能对未来的模型无效，开源模型的风险增加，需准确衡量和控制模型的操控能力。 Abstract: As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a weak-to-strong problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the capability gap between attacker and target. We evaluate more than 500 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the target's capability exceeds the attacker's, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these trends, we derive a jailbreaking scaling law that predicts attack success for a fixed target based on attacker-target capability gap. These findings suggest that fixed-capability attackers (e.g., humans) may become ineffective against future models, increasingly capable open-source models amplify risks for existing systems, and model providers must accurately measure and control models' persuasive and manipulative abilities to limit their effectiveness as attackers.

### [324] [R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning](https://arxiv.org/abs/2505.21668) *Yongchao Chen,Yueying Liu,Junwei Zhou,Yilun Hao,Jingquan Wang,Yang Zhang,Chuchu Fan* Main category: cs.AI TL;DR: R1-Code-Interpreter扩展了纯文本LLM，通过多轮监督微调和强化学习，使其能在逐步推理中自主生成代码查询，显著提升了在复杂任务中的表现。

Details

Motivation: 解决LLMs在精确计算、符号操作、优化和算法推理任务中的不足，以及如何有效结合文本推理与代码生成的问题。 Method: 采用多轮监督微调（SFT）和强化学习（RL）训练模型，研究不同策略（如GRPO vs. PPO）和代码输出形式（掩码与非掩码）。 Result: 最终模型R1-CI-14B在37个测试任务上的平均准确率从44.0%提升至64.1%，接近GPT-4o的70.9%。 Conclusion: 代码解释器训练因任务多样性和代码执行成本而更具挑战性，SFT阶段至关重要，模型表现出通过代码生成的自我检查行为。 Abstract: Despite advances in reasoning and planning of R1-like models, Large Language Models (LLMs) still struggle with tasks requiring precise computation, symbolic manipulation, optimization, and algorithmic reasoning, in which textual reasoning lacks the rigor of code execution. A key challenge is enabling LLMs to decide when to use textual reasoning versus code generation. While OpenAI trains models to invoke a Code Interpreter as needed, public research lacks guidance on aligning pre-trained LLMs to effectively leverage code and generalize across diverse tasks. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. We curate 144 reasoning and planning tasks (107 for training, 37 for testing), each with over 200 diverse questions. We fine-tune Qwen-2.5 models (3B/7B/14B) using various SFT and RL strategies, investigating different answer formats, reasoning vs. non-reasoning models, cold vs. warm starts, GRPO vs. PPO, and masked vs. unmasked code outputs. Unlike prior RL work on narrow domains, we find that Code Interpreter training is significantly harder due to high task diversity and expensive code execution, highlighting the critical role of the SFT stage. Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.0\% to 64.1\%, outperforming GPT-4o (text-only: 58.6\%) and approaching GPT-4o with Code Interpreter (70.9\%), with the emergent self-checking behavior via code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.

### [325] [Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation](https://arxiv.org/abs/2505.21784) *Tharindu Kumarage,Ninareh Mehrabi,Anil Ramakrishna,Xinyan Zhao,Richard Zemel,Kai-Wei Chang,Aram Galstyan,Rahul Gupta,Charith Peris* Main category: cs.AI TL;DR: AIDSAFE提出了一种通过多智能体迭代审议生成高质量安全推理数据的方法，显著提升了LLM的安全性和抗越狱能力。

Details

Motivation: 现有安全措施存在过度拒绝和越狱漏洞问题，而安全推理需要高质量的政策嵌入思维链数据集。 Method: AIDSAFE利用多智能体审议迭代扩展安全政策推理，并通过数据精炼阶段去除低质量内容。同时，通过信念增强生成偏好数据。 Result: AIDSAFE生成的思维链在政策遵循和推理质量上表现优异，显著提升了LLM的安全泛化和抗越狱能力。 Conclusion: AIDSAFE为安全推理提供了高效的数据生成方法，同时支持监督微调和偏好对齐训练。 Abstract: Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi-agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong foundation for supervised fine-tuning (SFT)-based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine-tuning open-source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over-refusal accuracy. AIDSAFE-generated CoT datasets can be found here: https://huggingface.co/datasets/AmazonScience/AIDSAFE

### [326] [Modeling and Optimizing User Preferences in AI Copilots: A Comprehensive Survey and Taxonomy](https://arxiv.org/abs/2505.21907) *Saleh Afzoon,Zahra Jahanandish,Phuong Thao Huynh,Amin Beheshti,Usman Naseem* Main category: cs.AI TL;DR: 本文综述了AI副驾驶中偏好优化的研究，提出了分阶段的分类法，并分析了信号获取、意图建模和反馈集成技术。

Details

Motivation: 随着AI副驾驶能力的提升和广泛使用，个性化成为确保可用性、信任和生产力的关键，但相关研究仍分散且不足。 Method: 通过综合研究，提出分阶段的偏好优化分类法（交互前、交互中和交互后），并分析相关技术。 Result: 总结了偏好信号获取、用户意图建模和反馈集成的方法，为设计自适应AI副驾驶提供了结构化基础。 Conclusion: 本文为偏好感知的AI副驾驶设计提供了全面视角，明确了各阶段适用的技术方法。 Abstract: AI copilots, context-aware, AI-powered systems designed to assist users in tasks such as software development and content creation, are becoming integral to modern workflows. As these systems grow in capability and adoption, personalization has emerged as a cornerstone for ensuring usability, trust, and productivity. Central to this personalization is preference optimization: the ability of AI copilots to detect, interpret, and align with individual user preferences. While personalization techniques are well-established in domains like recommender systems and dialogue agents, their adaptation to interactive, real-time systems like AI copilots remains fragmented and underexplored. This survey addresses this gap by synthesizing research on how user preferences are captured, modeled, and refined within the design of AI copilots. We introduce a unified definition of AI copilots and propose a phase-based taxonomy of preference optimization strategies, structured around pre-interaction, mid-interaction, and post-interaction stages. We analyze techniques for acquiring preference signals, modeling user intent, and integrating feedback loops, highlighting both established approaches and recent innovations. By bridging insights from AI personalization, human-AI collaboration, and large language model adaptation, this survey provides a structured foundation for designing adaptive, preference-aware AI copilots. It offers a holistic view of the available preference resources, how they can be leveraged, and which technical approaches are most suited to each stage of system design.

### [327] [Rethinking the Unsolvable: When In-Context Search Meets Test-Time Scaling](https://arxiv.org/abs/2505.22290) *Fanzeng Xia,Yidong Luo,Tinko Sebastian Bartels,Yaqi Xu,Tongxin Li* Main category: cs.AI TL;DR: 通过结合上下文搜索提示和内部扩展技术，大型语言模型（LLM）在复杂推理任务上实现了突破性表现，挑战了现有评估范式对其能力的低估。

Details

Motivation: 现有研究通过简单提示评估LLM的推理能力，忽略了高级技术可能带来的性能提升，导致对LLM潜力的低估。 Method: 采用上下文搜索提示和内部扩展技术，系统探索LLM在超难推理任务上的表现。 Result: 在NP-hard任务和实际规划任务中，成功率提升高达30倍，理论上扩展了可解决问题的复杂度类别。 Conclusion: 研究呼吁重新评估LLM的推理能力，并提出更全面的评估策略，以充分挖掘其潜力。 Abstract: Recent research has highlighted that Large Language Models (LLMs), even when trained to generate extended long reasoning steps, still face significant challenges on hard reasoning problems. However, much of the existing literature relies on direct prompting with simple in-context learning examples for evaluation, which largely overlooks advanced techniques to elicit LLMs' deliberate reasoning before drawing conclusions that LLMs hit a performance ceiling. In this paper, we systematically explore the combined potential of in-context search and test-time scaling on super hard reasoning tasks. We find that by employing advanced in-context search prompting to LLMs augmented with internal scaling, one can achieve transformative performance breakthroughs on tasks previously deemed "unsolvable" (e.g., reported success rates below 5%). We provide both empirical results and theoretical analysis of how this combination can unleash LLM reasoning capabilities: i) Empirically, on controlled NP-hard tasks and complex real-world planning benchmarks, our approach achieves up to a 30x improvement in success rates compared to previously reported results without any external mechanisms; ii) Theoretically, we show that in-context search prompting, when combined with internal scaling, significantly extends the complexity class of solvable reasoning problems. These findings challenge prevailing assumptions about the limitations of LLMs on complex tasks, indicating that current evaluation paradigms systematically underestimate their true potential. Our work calls for a critical reassessment of how LLM reasoning is benchmarked and a more robust evaluation strategy that fully captures the true capabilities of contemporary LLMs, which can lead to a better understanding of their operational reasoning boundaries in real-world deployments.

### [328] [Efficiently Enhancing General Agents With Hierarchical-categorical Memory](https://arxiv.org/abs/2505.22006) *Changze Qiao,Mingming Lu* Main category: cs.AI TL;DR: 论文提出了一种无需参数更新的通用代理EHC，通过分层记忆检索和任务导向经验学习模块，实现了高效学习和适应新环境的能力。

Details

Motivation: 现有方法依赖计算成本高的端到端训练或缺乏持续学习能力的工具使用法，EHC旨在解决这些问题。 Method: EHC包含分层记忆检索（HMR）模块和任务导向经验学习（TOEL）模块，分别用于快速检索记忆和分类任务经验。 Result: 在多个标准数据集上的实验表明，EHC优于现有方法，达到最先进性能。 Conclusion: EHC是一种高效处理复杂多模态任务的通用代理。 Abstract: With large language models (LLMs) demonstrating remarkable capabilities, there has been a surge in research on leveraging LLMs to build general-purpose multi-modal agents. However, existing approaches either rely on computationally expensive end-to-end training using large-scale multi-modal data or adopt tool-use methods that lack the ability to continuously learn and adapt to new environments. In this paper, we introduce EHC, a general agent capable of learning without parameter updates. EHC consists of a Hierarchical Memory Retrieval (HMR) module and a Task-Category Oriented Experience Learning (TOEL) module. The HMR module facilitates rapid retrieval of relevant memories and continuously stores new information without being constrained by memory capacity. The TOEL module enhances the agent's comprehension of various task characteristics by classifying experiences and extracting patterns across different categories. Extensive experiments conducted on multiple standard datasets demonstrate that EHC outperforms existing methods, achieving state-of-the-art performance and underscoring its effectiveness as a general agent for handling complex multi-modal tasks.

# cs.SD [[Back]](#toc) ### [329] [Visual Cues Support Robust Turn-taking Prediction in Noise](https://arxiv.org/abs/2505.22088) *Sam O'Connor Russell,Naomi Harte* Main category: cs.SD TL;DR: 研究了预测性轮流模型（PTTM）在噪声环境中的表现，发现噪声显著降低模型性能，但多模态模型能利用视觉线索提升准确性。

Details

Motivation: 探索预测性轮流模型在噪声环境中的性能，以改进人机交互的自然性。 Method: 分析PTTM在不同噪声类型下的表现，并开发多模态模型（结合音频和视觉特征）以提升性能。 Result: 噪声使PTTM准确率从84%降至52%，但多模态模型在10 dB音乐噪声中达到72%准确率。 Conclusion: 多模态PTTM在噪声中表现更优，但需依赖准确转录，且对新噪声类型的泛化能力有限。 Abstract: Accurate predictive turn-taking models (PTTMs) are essential for naturalistic human-robot interaction. However, little is known about their performance in noise. This study therefore explores PTTM performance in types of noise likely to be encountered once deployed. Our analyses reveal PTTMs are highly sensitive to noise. Hold/shift accuracy drops from 84% in clean speech to just 52% in 10 dB music noise. Training with noisy data enables a multimodal PTTM, which includes visual features to better exploit visual cues, with 72% accuracy in 10 dB music noise. The multimodal PTTM outperforms the audio-only PTTM across all noise types and SNRs, highlighting its ability to exploit visual cues; however, this does not always generalise to new types of noise. Analysis also reveals that successful training relies on accurate transcription, limiting the use of ASR-derived transcriptions to clean conditions. We make code publicly available for future research.

### [330] [Advancing Hearing Assessment: An ASR-Based Frequency-Specific Speech Test for Diagnosing Presbycusis](https://arxiv.org/abs/2505.22231) *Stefan Bleeck* Main category: cs.SD TL;DR: 本文提出了一种基于自动语音识别（ASR）的频率特异性语音测试方法，用于更精确地诊断听力损失对语音理解的功能影响。

Details

Motivation: 传统听力测试难以全面评估听力损失对语音理解的功能影响，尤其是在超阈值缺陷和频率特异性感知问题上。 Method: 利用ASR模拟中度斜坡听力损失的感知效果，通过控制语音刺激的声学退化并分析音素级混淆模式。 Result: 模拟听力损失导致特定音素混淆（如高频辅音替换）和音素缺失，测试电池能有效区分模拟正常听力和听力受损者。 Conclusion: ASR驱动的方法为开发客观、精细和频率特异性的听力评估工具提供了新方向，未来将验证人类参与者并探索AI模型的集成。 Abstract: Traditional audiometry often fails to fully characterize the functional impact of hearing loss on speech understanding, particularly supra-threshold deficits and frequency-specific perception challenges in conditions like presbycusis. This paper presents the development and simulated evaluation of a novel Automatic Speech Recognition (ASR)-based frequency-specific speech test designed to provide granular diagnostic insights. Our approach leverages ASR to simulate the perceptual effects of moderate sloping hearing loss by processing speech stimuli under controlled acoustic degradation and subsequently analyzing phoneme-level confusion patterns. Key findings indicate that simulated hearing loss introduces specific phoneme confusions, predominantly affecting high-frequency consonants (e.g., alveolar/palatal to labiodental substitutions) and leading to significant phoneme deletions, consistent with the acoustic cues degraded in presbycusis. A test battery curated from these ASR-derived confusions demonstrated diagnostic value, effectively differentiating between simulated normal-hearing and hearing-impaired listeners in a comprehensive simulation. This ASR-driven methodology offers a promising avenue for developing objective, granular, and frequency-specific hearing assessment tools that complement traditional audiometry. Future work will focus on validating these findings with human participants and exploring the integration of advanced AI models for enhanced diagnostic precision.

### [331] [Effective Context in Neural Speech Models](https://arxiv.org/abs/2505.22487) *Yen Meng,Sharon Goldwater,Hao Tang* Main category: cs.SD TL;DR: 本文提出了两种测量语音Transformer模型实际使用上下文（有效上下文）的方法，并分析了不同任务和模型对上下文的需求。

Details

Motivation: 研究现代神经语音模型实际使用的上下文长度，以了解不同任务和模型对上下文的需求差异。 Method: 提出了两种测量有效上下文的方法，并应用于分析监督学习和自监督学习的语音Transformer模型。 Result: 监督模型的有效上下文与任务性质相关，自监督模型的有效上下文主要在早期层增加且较短。HuBERT模型可在流式模式下运行。 Conclusion: 语音模型的实际上下文使用较短，无需修改架构即可实现流式处理。 Abstract: Modern neural speech models benefit from having longer context, and many approaches have been proposed to increase the maximum context a model can use. However, few have attempted to measure how much context these models actually use, i.e., the effective context. Here, we propose two approaches to measuring the effective context, and use them to analyze different speech Transformers. For supervised models, we find that the effective context correlates well with the nature of the task, with fundamental frequency tracking, phone classification, and word classification requiring increasing amounts of effective context. For self-supervised models, we find that effective context increases mainly in the early layers, and remains relatively short -- similar to the supervised phone model. Given that these models do not use a long context during prediction, we show that HuBERT can be run in streaming mode without modification to the architecture and without further fine-tuning.

### [332] [RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling](https://arxiv.org/abs/2505.22024) *Long-Khanh Pham,Thanh V. T. Tran,Minh-Tan Pham,Van Nguyen* Main category: cs.SD TL;DR: RESOUND是一种新型的唇语转语音（L2S）系统，通过分离声学路径和语义路径，结合语音单元和梅尔频谱图，生成清晰且富有表现力的语音。

Details

Motivation: 传统L2S合成在准确性和自然度上存在挑战，主要由于对语言内容、口音和韵律的监督有限。 Method: RESOUND采用源滤波理论，分为声学路径（预测韵律）和语义路径（提取语言特征），并结合语音单元和梅尔频谱图进行波形生成。 Result: 在两个标准L2S基准测试中，RESOUND在多项指标上表现出色。 Conclusion: RESOUND通过独立优化声学和语义表示，显著提升了L2S合成的性能。 Abstract: Lip-to-speech (L2S) synthesis, which reconstructs speech from visual cues, faces challenges in accuracy and naturalness due to limited supervision in capturing linguistic content, accents, and prosody. In this paper, we propose RESOUND, a novel L2S system that generates intelligible and expressive speech from silent talking face videos. Leveraging source-filter theory, our method involves two components: an acoustic path to predict prosody and a semantic path to extract linguistic features. This separation simplifies learning, allowing independent optimization of each representation. Additionally, we enhance performance by integrating speech units, a proven unsupervised speech representation technique, into waveform generation alongside mel-spectrograms. This allows RESOUND to synthesize prosodic speech while preserving content and speaker identity. Experiments conducted on two standard L2S benchmarks confirm the effectiveness of the proposed method across various metrics.

# cs.CR [[Back]](#toc) ### [333] [Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models](https://arxiv.org/abs/2505.22271) *Yongcan Yu,Yanbo Wang,Ran He,Jian Liang* Main category: cs.CR TL;DR: 论文提出了一种通用的防御框架TIM，通过自适应和自我进化的方式抵御多种越狱攻击。

Details

Motivation: 现有防御方法通常针对特定类型的越狱攻击，无法应对多样化的对抗策略，因此需要一种更通用的解决方案。 Method: TIM框架通过训练一个关键令牌进行高效检测，并在推理时识别越狱行为；随后通过安全微调拒绝越狱指令，同时解耦检测模块以避免性能下降。 Result: 实验表明，TIM在大型语言模型和多模态语言模型上均表现出高效防御能力。 Conclusion: TIM框架为抵御多样化的越狱攻击提供了一种有效且通用的解决方案。 Abstract: While (multimodal) large language models (LLMs) have attracted widespread attention due to their exceptional capabilities, they remain vulnerable to jailbreak attacks. Various defense methods are proposed to defend against jailbreak attacks, however, they are often tailored to specific types of jailbreak attacks, limiting their effectiveness against diverse adversarial strategies. For instance, rephrasing-based defenses are effective against text adversarial jailbreaks but fail to counteract image-based attacks. To overcome these limitations, we propose a universal defense framework, termed Test-time IMmunization (TIM), which can adaptively defend against various jailbreak attacks in a self-evolving way. Specifically, TIM initially trains a gist token for efficient detection, which it subsequently applies to detect jailbreak activities during inference. When jailbreak attempts are identified, TIM implements safety fine-tuning using the detected jailbreak instructions paired with refusal answers. Furthermore, to mitigate potential performance degradation in the detector caused by parameter updates during safety fine-tuning, we decouple the fine-tuning process from the detection module. Extensive experiments on both LLMs and multimodal LLMs demonstrate the efficacy of TIM.

### [334] [VideoMarkBench: Benchmarking Robustness of Video Watermarking](https://arxiv.org/abs/2505.21620) *Zhengyuan Jiang,Moyang Guo,Kecen Li,Yuepeng Hu,Yupu Wang,Zhicong Huang,Cheng Hong,Neil Zhenqiang Gong* Main category: cs.CR TL;DR: 论文提出了VideoMarkBench，首个系统性评估视频水印在去除和伪造攻击下鲁棒性的基准测试，揭示了当前方法的显著脆弱性。

Details

Motivation: 随着视频生成模型的快速发展，高度逼真的合成视频引发了与虚假信息和版权侵权相关的伦理问题，视频水印作为一种缓解策略被提出，但其鲁棒性尚未充分研究。 Method: 研究引入了VideoMarkBench基准测试，涵盖三种视频生成模型生成的统一数据集，评估四种水印方法和七种聚合策略在12种扰动下的表现。 Result: 研究发现当前水印方法在去除和伪造攻击下存在显著脆弱性，亟需更鲁棒的解决方案。 Conclusion: 论文强调了开发更鲁棒视频水印技术的紧迫性，并提供了公开可用的代码资源。 Abstract: The rapid development of video generative models has led to a surge in highly realistic synthetic videos, raising ethical concerns related to disinformation and copyright infringement. Recently, video watermarking has been proposed as a mitigation strategy by embedding invisible marks into AI-generated videos to enable subsequent detection. However, the robustness of existing video watermarking methods against both common and adversarial perturbations remains underexplored. In this work, we introduce VideoMarkBench, the first systematic benchmark designed to evaluate the robustness of video watermarks under watermark removal and watermark forgery attacks. Our study encompasses a unified dataset generated by three state-of-the-art video generative models, across three video styles, incorporating four watermarking methods and seven aggregation strategies used during detection. We comprehensively evaluate 12 types of perturbations under white-box, black-box, and no-box threat models. Our findings reveal significant vulnerabilities in current watermarking approaches and highlight the urgent need for more robust solutions. Our code is available at https://github.com/zhengyuan-jiang/VideoMarkBench.

Table of Contents

cs.CV [Back]

[1] Enhancing Vision Transformer Explainability Using Artificial Astrocytes

[2] Do DeepFake Attribution Models Generalize?

[3] CIM-NET: A Video Denoising Deep Neural Network Model Optimized for Computing-in-Memory Architectures

[4] Learning Shared Representations from Unpaired Data

[5] UniDB++: Fast Sampling of Unified Diffusion Bridge

[6] How Much Do Large Language Models Know about Human Motion? A Case Study in 3D Avatar Control

[7] EvidenceMoE: A Physics-Guided Mixture-of-Experts with Evidential Critics for Advancing Fluorescence Light Detection and Ranging in Scattering Media

[8] Self-Organizing Visual Prototypes for Non-Parametric Representation Learning

[9] Is Attention Required for Transformer Inference? Explore Function-preserving Attention Replacement

[10] Caption This, Reason That: VLMs Caught in the Middle

[11] Equivariant Flow Matching for Point Cloud Assembly

[12] DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers

[13] Vision Meets Language: A RAG-Augmented YOLOv8 Framework for Coffee Disease Diagnosis and Farmer Assistance

[14] Corruption-Aware Training of Latent Video Diffusion Models for Robust Text-to-Video Generation

[15] Image Tokens Matter: Mitigating Hallucination in Discrete Tokenizer-based Large Vision-Language Models via Latent Editing

[16] Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation

[17] Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts

[18] Analytical Calculation of Weights Convolutional Neural Network

[19] A Novel Convolutional Neural Network-Based Framework for Complex Multiclass Brassica Seed Classification

[20] Knowledge Distillation Approach for SOS Fusion Staging: Towards Fully Automated Skeletal Maturity Assessment

[21] Multi-instance Learning as Downstream Task of Self-Supervised Learning-based Pre-trained Model

[22] Diffusion Model-based Activity Completion for AI Motion Capture from Videos

[23] EaqVLA: Encoding-aligned Quantization for Vision-Language-Action Models

[24] Thickness-aware E(3)-Equivariant 3D Mesh Neural Networks

[25] Do We Need All the Synthetic Data? Towards Targeted Synthetic Image Augmentation via Diffusion Models

[26] Do you see what I see? An Ambiguous Optical Illusion Dataset exposing limitations of Explainable AI

[27] Any-to-Bokeh: One-Step Video Bokeh via Multi-Plane Image Guided Diffusion

[28] Object Concepts Emerge from Motion

[29] BaryIR: Learning Multi-Source Unified Representation in Continuous Barycenter Space for Generalizable All-in-One Image Restoration

[30] Geometric Feature Prompting of Image Segmentation Models

[31] QuARI: Query Adaptive Retrieval Improvement

[32] Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks

[33] Think Before You Diffuse: LLMs-Guided Physics-Aware Video Generation

[34] Scalable Segmentation for Ultra-High-Resolution Brain MR Images

[35] MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis

[36] OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

[37] Moment kernels: a simple and scalable approach for equivariance to rotations and reflections in deep convolutional networks

[38] What is Adversarial Training for Diffusion Models?

[39] Learning to See More: UAS-Guided Super-Resolution of Satellite Imagery for Precision Agriculture

[40] Visual Loop Closure Detection Through Deep Graph Consensus

[41] FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering

[42] MMTBENCH: A Unified Benchmark for Complex Multimodal Table Reasoning

[43] Compositional Scene Understanding through Inverse Generative Modeling

[44] SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation

[45] ALTER: All-in-One Layer Pruning and Temporal Expert Routing for Efficient Diffusion Generation

[46] HDRSDR-VQA: A Subjective Video Quality Dataset for HDR and SDR Comparative Evaluation

[47] UniMoGen: Universal Motion Generation

[48] Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation

[49] RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers

[50] FPAN: Mitigating Replication in Diffusion Models through the Fine-Grained Probabilistic Addition of Noise to Token Embeddings

[51] Beyond Perception: Evaluating Abstract Visual Reasoning through Multi-Stage Task

[52] Rethinking Gradient-based Adversarial Attacks on Point Cloud Classification

[53] Towards Scalable Language-Image Pre-training for 3D Medical Imaging

[54] GETReason: Enhancing Image Context Extraction through Hierarchical Multi-Agent Reasoning

[55] Cross-DINO: Cross the Deep MLP and Transformer for Small Object Detection

[56] EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

[57] Hyperspectral Gaussian Splatting

[58] Concentrate on Weakness: Mining Hard Prototypes for Few-Shot Medical Image Segmentation

[59] CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation

[60] Reference-Guided Identity Preserving Face Restoration

[61] AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment

[62] LiDARDustX: A LiDAR Dataset for Dusty Unstructured Road Environments

[63] BD Open LULC Map: High-resolution land use land cover mapping & benchmarking for urban development in Dhaka, Bangladesh

[64] InfoSAM: Fine-Tuning the Segment Anything Model from An Information-Theoretic Perspective

[65] Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting

[66] UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios

[67] Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs

[68] Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image Generation

[69] One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models

[70] A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding

[71] DvD: Unleashing a Generative Paradigm for Document Dewarping via Coordinates-based Diffusion Model

[72] Learning World Models for Interactive Video Generation

[73] D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples

[74] Event-based Egocentric Human Pose Estimation in Dynamic Environment

[75] Prototype Embedding Optimization for Human-Object Interaction Detection in Livestreaming

[76] PanoWan: Lifting Diffusion Video Generation Models to 360° with Latitude/Longitude-aware Mechanisms

[77] GL-PGENet: A Parameterized Generation Framework for Robust Document Image Enhancement

[78] Learnable Burst-Encodable Time-of-Flight Imaging for High-Fidelity Long-Distance Depth Sensing