cs.CV [Back]

[1] One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

Chenhao Zheng,Jieyu Zhang,Mohammadreza Salehi,Ziqi Gao,Vishnu Iyengar,Norimasa Kobori,Quan Kong,Ranjay Krishna

Main category: cs.CV

TL;DR: 论文提出了一种基于对象轨迹的视频标记化方法TrajViT，显著减少了冗余标记并提升了性能。

Details

Motivation: 现有视频标记化方法因使用时空块导致标记过多且计算效率低下，而最佳标记减少策略会降低性能且对相机运动无效。 Method: 提出基于全景子对象轨迹的标记化方法TrajViT，通过对比学习训练，提取对象轨迹并转换为语义标记。 Result: TrajViT在多个视频理解任务中显著优于ViT3D，如视频-文本检索任务中Top-5召回率提升6%，标记减少10倍。 Conclusion: TrajViT是首个在多样化视频分析任务中一致优于ViT3D的高效编码器，具有鲁棒性和可扩展性。 Abstract: Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT (ViT3D) across multiple video understanding benchmarks, e.g., TrajViT outperforms ViT3D by a large margin of 6% top-5 recall in average at video-text retrieval task with 10x token deduction. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 18x less inference FLOPs. TrajViT is the first efficient encoder to consistently outperform ViT3D across diverse video analysis tasks, making it a robust and scalable solution.

[2] Using Cross-Domain Detection Loss to Infer Multi-Scale Information for Improved Tiny Head Tracking

Jisu Kim,Alex Mattingly,Eung-Joo Lee,Benjamin S. Riggan

Main category: cs.CV

TL;DR: 提出了一种优化性能与效率平衡的框架，用于增强微小头部检测与跟踪，通过跨域检测损失、多尺度模块和小感受野检测机制实现改进。

Details

Motivation: 当前方法计算成本高，增加了延迟并占用资源，需优化性能与效率的平衡。 Method: 整合跨域检测损失、多尺度模块和小感受野检测机制。 Result: 在CroHD和CrowdHuman数据集上，MOTA和mAP指标均有提升。 Conclusion: 该框架在拥挤场景中有效提升了微小头部检测与跟踪的性能。 Abstract: Head detection and tracking are essential for downstream tasks, but current methods often require large computational budgets, which increase latencies and ties up resources (e.g., processors, memory, and bandwidth). To address this, we propose a framework to enhance tiny head detection and tracking by optimizing the balance between performance and efficiency. Our framework integrates (1) a cross-domain detection loss, (2) a multi-scale module, and (3) a small receptive field detection mechanism. These innovations enhance detection by bridging the gap between large and small detectors, capturing high-frequency details at multiple scales during training, and using filters with small receptive fields to detect tiny heads. Evaluations on the CroHD and CrowdHuman datasets show improved Multiple Object Tracking Accuracy (MOTA) and mean Average Precision (mAP), demonstrating the effectiveness of our approach in crowded scenes.

[3] Frequency-Adaptive Discrete Cosine-ViT-ResNet Architecture for Sparse-Data Vision

Ziyue Kang,Weichuan Zhang

Main category: cs.CV

TL;DR: 提出了一种混合深度学习框架，结合自适应DCT预处理模块、ViT-B16和ResNet50主干网络，以及贝叶斯线性分类头，解决了稀有动物图像分类中数据稀缺的问题。

Details

Motivation: 稀有动物图像分类面临数据稀缺的挑战，许多物种仅有少量标记样本。 Method: 设计自适应DCT模块学习最优频域边界，结合ViT-B16和ResNet50提取全局和局部特征，并通过交叉融合策略整合特征，最后使用贝叶斯线性分类器进行分类。 Result: 在自建的50类野生动物数据集上，该方法优于传统CNN和固定频带DCT方法，在样本稀缺情况下达到最优准确率。 Conclusion: 提出的自适应频域选择机制和混合框架有效提升了稀有动物图像分类的性能。 Abstract: A major challenge in rare animal image classification is the scarcity of data, as many species usually have only a small number of labeled samples. To address this challenge, we designed a hybrid deep-learning framework comprising a novel adaptive DCT preprocessing module, ViT-B16 and ResNet50 backbones, and a Bayesian linear classification head. To our knowledge, we are the first to introduce an adaptive frequency-domain selection mechanism that learns optimal low-, mid-, and high-frequency boundaries suited to the subsequent backbones. Our network first captures image frequency-domain cues via this adaptive DCT partitioning. The adaptively filtered frequency features are then fed into ViT-B16 to model global contextual relationships, while ResNet50 concurrently extracts local, multi-scale spatial representations from the original image. A cross-level fusion strategy seamlessly integrates these frequency- and spatial-domain embeddings, and the fused features are passed through a Bayesian linear classifier to output the final category predictions. On our self-built 50-class wildlife dataset, this approach outperforms conventional CNN and fixed-band DCT pipelines, achieving state-of-the-art accuracy under extreme sample scarcity.

[4] How Animals Dance (When You're Not Looking)

Xiaojuan Wang,Aleksander Holynski,Brian Curless,Ira Kemelmacher,Steve Seitz

Main category: cs.CV

TL;DR: 提出了一种基于关键帧的框架，用于生成音乐同步、舞蹈感知的动物舞蹈视频，通过优化关键帧结构和视频扩散模型实现。

Details

Motivation: 解决从少量关键帧生成高质量、音乐同步的动物舞蹈视频的问题。 Method: 将舞蹈合成建模为图优化问题，结合文本到图像生成和视频扩散模型。 Result: 仅需6个输入关键帧即可生成长达30秒的多样化动物舞蹈视频。 Conclusion: 该方法高效且灵活，适用于多种动物和音乐类型。 Abstract: We present a keyframe-based framework for generating music-synchronized, choreography aware animal dance videos. Starting from a few keyframes representing distinct animal poses -- generated via text-to-image prompting or GPT-4o -- we formulate dance synthesis as a graph optimization problem: find the optimal keyframe structure that satisfies a specified choreography pattern of beats, which can be automatically estimated from a reference dance video. We also introduce an approach for mirrored pose image generation, essential for capturing symmetry in dance. In-between frames are synthesized using an video diffusion model. With as few as six input keyframes, our method can produce up to 30 second dance videos across a wide range of animals and music tracks.

[5] HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer

Qi Cai,Jingwen Chen,Yang Chen,Yehao Li,Fuchen Long,Yingwei Pan,Zhaofan Qiu,Yiheng Zhang,Fengbin Gao,Peihan Xu,Yimeng Wang,Kai Yu,Wenxuan Chen,Ziwei Feng,Zijian Gong,Jianzhuang Pan,Yi Peng,Rui Tian,Siyu Wang,Bo Zhao,Ting Yao,Tao Mei

Main category: cs.CV

TL;DR: HiDream-I1是一个17B参数的开源图像生成基础模型，通过稀疏扩散变换器（DiT）和动态MoE架构，实现了高质量图像生成和低延迟。

Details

Motivation: 解决现有图像生成模型在质量和计算效率之间的权衡问题。 Method: 采用双流解耦设计和单流稀疏DiT结构，结合动态MoE架构，支持多模态交互和高效图像生成。 Result: 实现了高质量的图像生成，并提供了三种变体以适应不同需求。 Conclusion: HiDream-I1及其衍生产品（HiDream-E1和HiDream-A1）为多模态AIGC研究提供了全面的工具和资源。 Abstract: Recent advancements in image generative foundation models have prioritized quality improvements but often at the cost of increased computational complexity and inference latency. To address this critical trade-off, we introduce HiDream-I1, a new open-source image generative foundation model with 17B parameters that achieves state-of-the-art image generation quality within seconds. HiDream-I1 is constructed with a new sparse Diffusion Transformer (DiT) structure. Specifically, it starts with a dual-stream decoupled design of sparse DiT with dynamic Mixture-of-Experts (MoE) architecture, in which two separate encoders are first involved to independently process image and text tokens. Then, a single-stream sparse DiT structure with dynamic MoE architecture is adopted to trigger multi-model interaction for image generation in a cost-efficient manner. To support flexiable accessibility with varied model capabilities, we provide HiDream-I1 in three variants: HiDream-I1-Full, HiDream-I1-Dev, and HiDream-I1-Fast. Furthermore, we go beyond the typical text-to-image generation and remould HiDream-I1 with additional image conditions to perform precise, instruction-based editing on given images, yielding a new instruction-based image editing model namely HiDream-E1. Ultimately, by integrating text-to-image generation and instruction-based image editing, HiDream-I1 evolves to form a comprehensive image agent (HiDream-A1) capable of fully interactive image creation and refinement. To accelerate multi-modal AIGC research, we have open-sourced all the codes and model weights of HiDream-I1-Full, HiDream-I1-Dev, HiDream-I1-Fast, HiDream-E1 through our project websites: https://github.com/HiDream-ai/HiDream-I1 and https://github.com/HiDream-ai/HiDream-E1. All features can be directly experienced via https://vivago.ai/studio.

[6] LayerPeeler: Autoregressive Peeling for Layer-wise Image Vectorization

Ronghuan Wu,Wanchao Su,Jing Liao

Main category: cs.CV

TL;DR: LayerPeeler是一种新颖的逐层图像矢量化方法，通过渐进式简化策略解决现有工具在遮挡区域处理上的不足。

Details

Motivation: 现有图像矢量化工具在处理遮挡区域时效果不佳，导致生成的矢量图形不完整或碎片化，影响可编辑性。 Method: LayerPeeler采用自回归剥离策略，结合视觉语言模型构建层图，利用图像扩散模型和局部注意力控制精确移除遮挡层。 Result: 实验表明，LayerPeeler在路径语义、几何规则性和视觉保真度上显著优于现有技术。 Conclusion: LayerPeeler提供了一种高质量的图像矢量化解决方案，为遮挡区域处理提供了新思路。 Abstract: Image vectorization is a powerful technique that converts raster images into vector graphics, enabling enhanced flexibility and interactivity. However, popular image vectorization tools struggle with occluded regions, producing incomplete or fragmented shapes that hinder editability. While recent advancements have explored rule-based and data-driven layer-wise image vectorization, these methods face limitations in vectorization quality and flexibility. In this paper, we introduce LayerPeeler, a novel layer-wise image vectorization approach that addresses these challenges through a progressive simplification paradigm. The key to LayerPeeler's success lies in its autoregressive peeling strategy: by identifying and removing the topmost non-occluded layers while recovering underlying content, we generate vector graphics with complete paths and coherent layer structures. Our method leverages vision-language models to construct a layer graph that captures occlusion relationships among elements, enabling precise detection and description for non-occluded layers. These descriptive captions are used as editing instructions for a finetuned image diffusion model to remove the identified layers. To ensure accurate removal, we employ localized attention control that precisely guides the model to target regions while faithfully preserving the surrounding content. To support this, we contribute a large-scale dataset specifically designed for layer peeling tasks. Extensive quantitative and qualitative experiments demonstrate that LayerPeeler significantly outperforms existing techniques, producing vectorization results with superior path semantics, geometric regularity, and visual fidelity.

[7] MIAS-SAM: Medical Image Anomaly Segmentation without thresholding

Marco Colussi,Dragan Ahmetovic,Sergio Mascetti

Main category: cs.CV

TL;DR: MIAS-SAM是一种用于医学图像异常区域分割的新方法，通过基于补丁的记忆库和SAM编码器提取特征，无需阈值即可实现精确分割。

Details

Motivation: 解决医学图像异常分割中需要手动设置阈值的问题，提高分割精度和自动化程度。 Method: 使用SAM编码器提取正常数据的特征并存储于记忆库，推理时比较嵌入补丁与记忆库特征生成异常图，通过异常图的重心提示SAM解码器完成分割。 Result: 在三种不同模态的公开数据集（脑MRI、肝脏CT和视网膜OCT）上表现出高精度的异常分割能力，DICE评分验证了其有效性。 Conclusion: MIAS-SAM无需阈值即可实现精确的异常分割，适用于多种医学图像模态，具有较高的实用价值。 Abstract: This paper presents MIAS-SAM, a novel approach for the segmentation of anomalous regions in medical images. MIAS-SAM uses a patch-based memory bank to store relevant image features, which are extracted from normal data using the SAM encoder. At inference time, the embedding patches extracted from the SAM encoder are compared with those in the memory bank to obtain the anomaly map. Finally, MIAS-SAM computes the center of gravity of the anomaly map to prompt the SAM decoder, obtaining an accurate segmentation from the previously extracted features. Differently from prior works, MIAS-SAM does not require to define a threshold value to obtain the segmentation from the anomaly map. Experimental results conducted on three publicly available datasets, each with a different imaging modality (Brain MRI, Liver CT, and Retina OCT) show accurate anomaly segmentation capabilities measured using DICE score. The code is available at: https://github.com/warpcut/MIAS-SAM

[8] Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization

Yuxi Zhang,Yueting Li,Xinyu Du,Sibo Wang

Main category: cs.CV

TL;DR: Rhet2Pix框架通过多步策略优化解决修辞语言生成图像的挑战，优于现有SOTA模型。

Details

Motivation: 现有文本到图像模型难以处理修辞语言的隐含意义，导致生成的图像偏向字面表达。 Method: 提出Rhet2Pix框架，采用双层MDP扩散模块，逐步细化子句并优化图像生成动作。 Result: 实验表明Rhet2Pix在修辞文本到图像生成中优于GPT-4o等SOTA模型。 Conclusion: Rhet2Pix有效解决了修辞语言生成图像的难题，为多模态模型提供了新思路。 Abstract: Generating images from rhetorical languages remains a critical challenge for text-to-image models. Even state-of-the-art (SOTA) multimodal large language models (MLLM) fail to generate images based on the hidden meaning inherent in rhetorical language--despite such content being readily mappable to visual representations by humans. A key limitation is that current models emphasize object-level word embedding alignment, causing metaphorical expressions to steer image generation towards their literal visuals and overlook the intended semantic meaning. To address this, we propose Rhet2Pix, a framework that formulates rhetorical text-to-image generation as a multi-step policy optimization problem, incorporating a two-layer MDP diffusion module. In the outer layer, Rhet2Pix converts the input prompt into incrementally elaborated sub-sentences and executes corresponding image-generation actions, constructing semantically richer visuals. In the inner layer, Rhet2Pix mitigates reward sparsity during image generation by discounting the final reward and optimizing every adjacent action pair along the diffusion denoising trajectory. Extensive experiments demonstrate the effectiveness of Rhet2Pix in rhetorical text-to-image generation. Our model outperforms SOTA MLLMs such as GPT-4o, Grok-3 and leading academic baselines across both qualitative and quantitative evaluations. The code and dataset used in this work are publicly available.

[9] Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory

Srishti Yadav,Lauren Tilton,Maria Antoniak,Taylor Arnold,Jiaang Li,Siddhesh Milind Pawar,Antonia Karamolegkou,Stella Frank,Zhaochong An,Negar Rostamzadeh,Daniel Hershcovich,Serge Belongie,Ekaterina Shutova

Main category: cs.CV

TL;DR: 本文探讨了现代视觉语言模型（VLM）在文化能力评估中的不足，并提出了一种基于视觉文化研究的框架来系统分析图像中的文化维度。

Details

Motivation: 由于VLM在多样化的应用中表现不佳，尤其是文化能力方面，因此需要一种系统的方法来理解和标注图像中的文化细微差别。 Method: 通过回顾视觉文化研究（文化研究、符号学和视觉研究）的基础方法，提出了五个文化维度的框架。 Result: 提出了一个更全面的分析框架，用于评估VLM的文化能力。 Conclusion: 结合视觉文化研究的方法论，可以更系统地分析和提升VLM的文化能力。 Abstract: Modern vision-language models (VLMs) often fail at cultural competency evaluations and benchmarks. Given the diversity of applications built upon VLMs, there is renewed interest in understanding how they encode cultural nuances. While individual aspects of this problem have been studied, we still lack a comprehensive framework for systematically identifying and annotating the nuanced cultural dimensions present in images for VLMs. This position paper argues that foundational methodologies from visual culture studies (cultural studies, semiotics, and visual studies) are necessary for cultural analysis of images. Building upon this review, we propose a set of five frameworks, corresponding to cultural dimensions, that must be considered for a more complete analysis of the cultural competencies of VLMs.

[10] Fast Trajectory-Independent Model-Based Reconstruction Algorithm for Multi-Dimensional Magnetic Particle Imaging

Vladyslav Gapyak,Thomas März,Andreas Weinmann

Main category: cs.CV

TL;DR: 本文提出了一种独立于轨迹的模型重建算法，用于2D磁粒子成像（MPI），并开发了零样本即插即用（PnP）算法，无需针对MPI数据进行重新训练。

Details

Motivation: 传统MPI重建方法依赖于耗时的校准或模型模拟，且受限于特定扫描轨迹。本文旨在开发一种更灵活、通用的模型重建方法。 Method: 采用轨迹无关的模型重建算法，结合零样本PnP算法，利用自然图像训练的降噪器解决反卷积问题。 Result: 在公开数据集和自定义数据上验证了算法的重建能力，展示了跨不同扫描场景的强适应性。 Conclusion: 该方法为通用、灵活的模型MPI重建奠定了基础。 Abstract: Magnetic Particle Imaging (MPI) is a promising tomographic technique for visualizing the spatio-temporal distribution of superparamagnetic nanoparticles, with applications ranging from cancer detection to real-time cardiovascular monitoring. Traditional MPI reconstruction relies on either time-consuming calibration (measured system matrix) or model-based simulation of the forward operator. Recent developments have shown the applicability of Chebyshev polynomials to multi-dimensional Lissajous Field-Free Point (FFP) scans. This method is bound to the particular choice of sinusoidal scanning trajectories. In this paper, we present the first reconstruction on real 2D MPI data with a trajectory-independent model-based MPI reconstruction algorithm. We further develop the zero-shot Plug-and-Play (PnP) algorithm of the authors -- with automatic noise level estimation -- to address the present deconvolution problem, leveraging a state-of-the-art denoiser trained on natural images without retraining on MPI-specific data. We evaluate our method on the publicly available 2D FFP MPI dataset ``MPIdata: Equilibrium Model with Anisotropy", featuring scans of six phantoms acquired using a Bruker preclinical scanner. Moreover, we show reconstruction performed on custom data on a 2D scanner with additional high-frequency excitation field and partial data. Our results demonstrate strong reconstruction capabilities across different scanning scenarios -- setting a precedent for general-purpose, flexible model-based MPI reconstruction.

[11] VidText: Towards Comprehensive Evaluation for Video Text Understanding

Zhoufaran Yang,Yan Shu,Zhifei Yang,Yan Zhang,Yu Li,Keyang Lu,Gangyan Zeng,Shaohui Liu,Yu Zhou,Nicu Sebe

Main category: cs.CV

TL;DR: VidText是一个新的视频文本理解基准，填补了现有视频理解和OCR基准的不足，支持多语言和多层次任务评估。

Details

Motivation: 现有视频理解基准忽视文本信息，OCR基准局限于静态图像，无法捕捉文本与动态视觉的交互。 Method: 提出VidText基准，覆盖多样场景，支持多语言，提供视频级、片段级和实例级任务，并引入感知推理任务。 Result: 实验表明当前大型多模态模型在多数任务上表现不佳，存在改进空间。 Conclusion: VidText填补了视频理解基准的空白，为未来动态环境中的多模态推理研究奠定基础。 Abstract: Visual texts embedded in videos carry rich semantic information, which is crucial for both holistic video understanding and fine-grained reasoning about local human actions. However, existing video understanding benchmarks largely overlook textual information, while OCR-specific benchmarks are constrained to static images, limiting their ability to capture the interaction between text and dynamic visual contexts. To address this gap, we propose VidText, a new benchmark designed for comprehensive and in-depth evaluation of video text understanding. VidText offers the following key features: 1) It covers a wide range of real-world scenarios and supports multilingual content, encompassing diverse settings where video text naturally appears. 2) It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks, enabling assessment of both global summarization and local retrieval capabilities. 3) The benchmark also introduces a set of paired perception reasoning tasks, ranging from visual text perception to cross-modal reasoning between textual and visual information. Extensive experiments on 18 state-of-the-art Large Multimodal Models (LMMs) reveal that current models struggle across most tasks, with significant room for improvement. Further analysis highlights the impact of both model-intrinsic factors, such as input resolution and OCR capability, and external factors, including the use of auxiliary information and Chain-of-Thought reasoning strategies. We hope VidText will fill the current gap in video understanding benchmarks and serve as a foundation for future research on multimodal reasoning with video text in dynamic environments.

[12] IMTS is Worth Time $\times$ Channel Patches: Visual Masked Autoencoders for Irregular Multivariate Time Series Prediction

Zhangyi Hu,Jiemin Wu,Hua Xu,Mingqian Liao,Ninghui Feng,Bo Gao,Songning Lai,Yutao Yue

Main category: cs.CV

TL;DR: VIMTS是一个基于视觉MAE的框架，用于处理不规则多变量时间序列（IMTS）预测，通过补全缺失数据和利用跨通道依赖关系，结合自监督学习，显著提升了预测性能。

Details

Motivation: 由于多通道信号未对齐和大量缺失数据，IMTS预测具有挑战性。现有方法难以从缺失数据中提取可靠的时间模式，而预训练基础模型通常仅适用于规则时间序列。 Method: VIMTS将IMTS沿时间线划分为等间隔特征块，利用跨通道依赖关系补全缺失数据，并采用视觉MAE处理稀疏多通道数据，结合粗到细技术生成预测。 Result: 实验表明VIMTS在IMTS预测中表现优异，并具备少样本学习能力，扩展了视觉基础模型在时间序列任务中的应用。 Conclusion: VIMTS通过结合视觉MAE和自监督学习，有效解决了IMTS预测中的缺失数据问题，为更广泛的时间序列任务提供了新思路。 Abstract: Irregular Multivariate Time Series (IMTS) forecasting is challenging due to the unaligned nature of multi-channel signals and the prevalence of extensive missing data. Existing methods struggle to capture reliable temporal patterns from such data due to significant missing values. While pre-trained foundation models show potential for addressing these challenges, they are typically designed for Regularly Sampled Time Series (RTS). Motivated by the visual Mask AutoEncoder's (MAE) powerful capability for modeling sparse multi-channel information and its success in RTS forecasting, we propose VIMTS, a framework adapting Visual MAE for IMTS forecasting. To mitigate the effect of missing values, VIMTS first processes IMTS along the timeline into feature patches at equal intervals. These patches are then complemented using learned cross-channel dependencies. Then it leverages visual MAE's capability in handling sparse multichannel data for patch reconstruction, followed by a coarse-to-fine technique to generate precise predictions from focused contexts. In addition, we integrate self-supervised learning for improved IMTS modeling by adapting the visual MAE to IMTS data. Extensive experiments demonstrate VIMTS's superior performance and few-shot capability, advancing the application of visual foundation models in more general time series tasks. Our code is available at https://github.com/WHU-HZY/VIMTS.

[13] Improving Contrastive Learning for Referring Expression Counting

Kostas Triaridis,Panagiotis Kaliosis,E-Ro Nguyen,Jingyi Xu,Hieu Le,Dimitris Samaras

Main category: cs.CV

TL;DR: 论文提出了一种基于对比学习的框架C-REX，用于解决Referring Expression Counting（REC）任务，通过增强判别性表示学习，显著提升了性能。

Details

Motivation: 现有方法在区分视觉相似但属于不同指代表达式的对象时表现不佳，因此需要一种更稳定的对比学习框架。 Method: C-REX采用监督对比学习，完全在图像空间内操作，避免了图像-文本对比学习的对齐问题，并提供更大的负样本池。 Result: C-REX在REC任务中表现优异，MAE和RMSE分别提升22%和10%，同时在类无关计数任务中也表现良好。 Conclusion: C-REX通过改进对比学习框架，显著提升了REC任务的性能，并展示了其通用性。 Abstract: Object counting has progressed from class-specific models, which count only known categories, to class-agnostic models that generalize to unseen categories. The next challenge is Referring Expression Counting (REC), where the goal is to count objects based on fine-grained attributes and contextual differences. Existing methods struggle with distinguishing visually similar objects that belong to the same category but correspond to different referring expressions. To address this, we propose C-REX, a novel contrastive learning framework, based on supervised contrastive learning, designed to enhance discriminative representation learning. Unlike prior works, C-REX operates entirely within the image space, avoiding the misalignment issues of image-text contrastive learning, thus providing a more stable contrastive signal. It also guarantees a significantly larger pool of negative samples, leading to improved robustness in the learned representations. Moreover, we showcase that our framework is versatile and generic enough to be applied to other similar tasks like class-agnostic counting. To support our approach, we analyze the key components of sota detection-based models and identify that detecting object centroids instead of bounding boxes is the key common factor behind their success in counting tasks. We use this insight to design a simple yet effective detection-based baseline to build upon. Our experiments show that C-REX achieves state-of-the-art results in REC, outperforming previous methods by more than 22\% in MAE and more than 10\% in RMSE, while also demonstrating strong performance in class-agnostic counting. Code is available at https://github.com/cvlab-stonybrook/c-rex.

[14] CLIPGaussian: Universal and Multimodal Style Transfer Based on Gaussian Splatting

Kornel Howil,Joanna Waczyńska,Piotr Borycki,Tadeusz Dziarmaga,Marcin Mazur,Przemysław Spurek

Main category: cs.CV

TL;DR: CLIPGaussians是首个支持文本和图像引导的多模态风格迁移框架，适用于2D图像、视频、3D对象和4D场景，无需大型生成模型或从头训练。

Details

Motivation: 当前Gaussian Splatting (GS) 表示难以实现复杂的风格迁移，尤其是在3D和4D内容中。 Method: 直接在GS基元上操作，作为插件模块集成到现有GS流程中，联合优化颜色和几何形状。 Result: 在3D和4D设置中实现了时间一致性，保持模型大小，并在所有任务中表现出卓越的风格保真度和一致性。 Conclusion: CLIPGaussians是一种通用且高效的多模态风格迁移解决方案。 Abstract: Gaussian Splatting (GS) has recently emerged as an efficient representation for rendering 3D scenes from 2D images and has been extended to images, videos, and dynamic 4D content. However, applying style transfer to GS-based representations, especially beyond simple color changes, remains challenging. In this work, we introduce CLIPGaussians, the first unified style transfer framework that supports text- and image-guided stylization across multiple modalities: 2D images, videos, 3D objects, and 4D scenes. Our method operates directly on Gaussian primitives and integrates into existing GS pipelines as a plug-in module, without requiring large generative models or retraining from scratch. CLIPGaussians approach enables joint optimization of color and geometry in 3D and 4D settings, and achieves temporal coherence in videos, while preserving a model size. We demonstrate superior style fidelity and consistency across all tasks, validating CLIPGaussians as a universal and efficient solution for multimodal style transfer.

[15] A Probabilistic Jump-Diffusion Framework for Open-World Egocentric Activity Recognition

Sanjoy Kundu,Shanmukha Vellamcheti,Sathyanarayanan N. Aakur

Main category: cs.CV

TL;DR: ProbRes是一个基于跳扩散的概率残差搜索框架，用于开放世界的自我中心活动识别，通过平衡先验引导的探索和似然驱动的利用，高效导航搜索空间。

Details

Motivation: 开放世界的自我中心活动识别具有挑战性，因为其无约束性要求模型从部分观察的广阔搜索空间中推断未见活动。 Method: ProbRes整合了结构化常识先验构建语义一致的搜索空间，利用视觉语言模型（VLMs）自适应优化预测，并采用随机搜索机制高效定位高似然活动标签。 Result: 在多个开放级别（L0--L3）上系统评估ProbRes，显示其适应搜索空间复杂性的能力，并在多个基准数据集上达到最先进性能。 Conclusion: ProbRes不仅提升了开放世界活动识别的性能，还建立了清晰的分类法，为自我中心活动理解提供了方法论基础。 Abstract: Open-world egocentric activity recognition poses a fundamental challenge due to its unconstrained nature, requiring models to infer unseen activities from an expansive, partially observed search space. We introduce ProbRes, a Probabilistic Residual search framework based on jump-diffusion that efficiently navigates this space by balancing prior-guided exploration with likelihood-driven exploitation. Our approach integrates structured commonsense priors to construct a semantically coherent search space, adaptively refines predictions using Vision-Language Models (VLMs) and employs a stochastic search mechanism to locate high-likelihood activity labels while minimizing exhaustive enumeration efficiently. We systematically evaluate ProbRes across multiple openness levels (L0--L3), demonstrating its adaptability to increasing search space complexity. In addition to achieving state-of-the-art performance on benchmark datasets (GTEA Gaze, GTEA Gaze+, EPIC-Kitchens, and Charades-Ego), we establish a clear taxonomy for open-world recognition, delineating the challenges and methodological advancements necessary for egocentric activity understanding.

[16] 4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians

Hidenobu Matsuki,Gwangbin Bae,Andrew J. Davison

Main category: cs.CV

TL;DR: 提出首个基于可微分渲染的4D跟踪与映射方法，联合优化相机定位与非刚性表面重建，并引入新数据集支持研究。

Details

Motivation: 解决4D-SLAM中因高维优化空间和非刚性运动带来的挑战，填补现有方法在动态场景重建中的不足。 Method: 采用高斯表面基元优化深度信号，结合MLP变形场和新型相机位姿估计技术，实现时空重建。 Result: 实现了准确的表面重建和非刚性运动建模，并提供了开源合成数据集支持评估。 Conclusion: 为4D-SLAM研究提供了新方法和评估标准，推动了动态场景重建领域的发展。 Abstract: We propose the first 4D tracking and mapping method that jointly performs camera localization and non-rigid surface reconstruction via differentiable rendering. Our approach captures 4D scenes from an online stream of color images with depth measurements or predictions by jointly optimizing scene geometry, appearance, dynamics, and camera ego-motion. Although natural environments exhibit complex non-rigid motions, 4D-SLAM remains relatively underexplored due to its inherent challenges; even with 2.5D signals, the problem is ill-posed because of the high dimensionality of the optimization space. To overcome these challenges, we first introduce a SLAM method based on Gaussian surface primitives that leverages depth signals more effectively than 3D Gaussians, thereby achieving accurate surface reconstruction. To further model non-rigid deformations, we employ a warp-field represented by a multi-layer perceptron (MLP) and introduce a novel camera pose estimation technique along with surface regularization terms that facilitate spatio-temporal reconstruction. In addition to these algorithmic challenges, a significant hurdle in 4D SLAM research is the lack of reliable ground truth and evaluation protocols, primarily due to the difficulty of 4D capture using commodity sensors. To address this, we present a novel open synthetic dataset of everyday objects with diverse motions, leveraging large-scale object models and animation modeling. In summary, we open up the modern 4D-SLAM research by introducing a novel method and evaluation protocols grounded in modern vision and rendering techniques.

[17] CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models

Junbo Yin,Chao Zha,Wenjia He,Chencheng Xu,Xin Gao

Main category: cs.CV

TL;DR: CFP-Gen是一种新型扩散语言模型，用于组合功能蛋白生成，能够整合多模态条件约束，解决现有PLM在同时满足多模态约束时的困难。

Details

Motivation: 现有PLM仅能基于单一模态条件生成蛋白序列，难以同时满足多模态约束，限制了蛋白设计的灵活性和功能性。 Method: CFP-Gen通过Annotation-Guided Feature Modulation（AGFM）模块动态调整蛋白特征分布，结合Residue-Controlled Functional Encoding（RCFE）模块捕获残基级交互，并集成3D结构编码器施加几何约束。 Result: CFP-Gen能够高通量生成功能与天然蛋白相当的新蛋白，并在设计多功能蛋白时具有高成功率。 Conclusion: CFP-Gen为蛋白设计提供了一种高效的多模态约束整合方法，具有广泛的应用潜力。 Abstract: Existing PLMs generate protein sequences based on a single-condition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP-Gen, a novel diffusion language model for Combinatorial Functional Protein GENeration. CFP-Gen facilitates the de novo protein design by integrating multimodal conditions with functional, sequence, and structural constraints. Specifically, an Annotation-Guided Feature Modulation (AGFM) module is introduced to dynamically adjust the protein feature distribution based on composable functional annotations, e.g., GO terms, IPR domains and EC numbers. Meanwhile, the Residue-Controlled Functional Encoding (RCFE) module captures residue-wise interaction to ensure more precise control. Additionally, off-the-shelf 3D structure encoders can be seamlessly integrated to impose geometric constraints. We demonstrate that CFP-Gen enables high-throughput generation of novel proteins with functionality comparable to natural proteins, while achieving a high success rate in designing multifunctional proteins. Code and data available at https://github.com/yinjunbo/cfpgen.

[18] 3DGS Compression with Sparsity-guided Hierarchical Transform Coding

Hao Xu,Xiaolin Wu,Xi Zhang

Main category: cs.CV

TL;DR: SHTC是一种端到端优化的变换编码框架，用于3D高斯泼溅（3DGS）压缩，通过联合优化3DGS、变换和轻量级上下文模型，显著提升了率失真性能。

Details

Motivation: 3DGS因其快速高质量渲染而流行，但内存占用大，传输和存储开销高。现有神经压缩方法未采用端到端优化的分析-合成变换，导致性能不佳。 Method: SHTC框架包括基频层（KLT用于数据去相关）和增强层（稀疏编码压缩KLT残差），通过学习线性变换和展开ISTA算法重构残差。 Result: SHTC显著提升了率失真性能，同时参数和计算开销最小。 Conclusion: SHTC是首个端到端优化的3DGS压缩框架，通过可解释设计结合信号先验，实现了高效压缩。 Abstract: 3D Gaussian Splatting (3DGS) has gained popularity for its fast and high-quality rendering, but it has a very large memory footprint incurring high transmission and storage overhead. Recently, some neural compression methods, such as Scaffold-GS, were proposed for 3DGS but they did not adopt the approach of end-to-end optimized analysis-synthesis transforms which has been proven highly effective in neural signal compression. Without an appropriate analysis transform, signal correlations cannot be removed by sparse representation. Without such transforms the only way to remove signal redundancies is through entropy coding driven by a complex and expensive context modeling, which results in slower speed and suboptimal rate-distortion (R-D) performance. To overcome this weakness, we propose Sparsity-guided Hierarchical Transform Coding (SHTC), the first end-to-end optimized transform coding framework for 3DGS compression. SHTC jointly optimizes the 3DGS, transforms and a lightweight context model. This joint optimization enables the transform to produce representations that approach the best R-D performance possible. The SHTC framework consists of a base layer using KLT for data decorrelation, and a sparsity-coded enhancement layer that compresses the KLT residuals to refine the representation. The enhancement encoder learns a linear transform to project high-dimensional inputs into a low-dimensional space, while the decoder unfolds the Iterative Shrinkage-Thresholding Algorithm (ISTA) to reconstruct the residuals. All components are designed to be interpretable, allowing the incorporation of signal priors and fewer parameters than black-box transforms. This novel design significantly improves R-D performance with minimal additional parameters and computational overhead.

[19] Hierarchical Material Recognition from Local Appearance

Matthew Beveridge,Shree K. Nayar

Main category: cs.CV

TL;DR: 提出了一种基于物理特性的材料分类法，并构建了一个多样化数据集，利用图注意力网络实现分层材料识别，性能优异且能适应复杂场景。

Details

Motivation: 为视觉应用提供一种基于材料物理特性的分类方法，并解决真实世界中材料识别的挑战。 Method: 使用图注意力网络，结合材料分类法和深度图数据，实现分层材料识别。 Result: 模型在性能上达到最优，并能适应复杂成像条件，同时支持小样本学习新材料。 Conclusion: 提出的分类法和模型在材料识别中表现出色，具有实际应用潜力。 Abstract: We introduce a taxonomy of materials for hierarchical recognition from local appearance. Our taxonomy is motivated by vision applications and is arranged according to the physical traits of materials. We contribute a diverse, in-the-wild dataset with images and depth maps of the taxonomy classes. Utilizing the taxonomy and dataset, we present a method for hierarchical material recognition based on graph attention networks. Our model leverages the taxonomic proximity between classes and achieves state-of-the-art performance. We demonstrate the model's potential to generalize to adverse, real-world imaging conditions, and that novel views rendered using the depth maps can enhance this capability. Finally, we show the model's capacity to rapidly learn new materials in a few-shot learning setting.

Maksim Kolodiazhnyi,Denis Tarasov,Dmitrii Zhemchuzhnikov,Alexander Nikulin,Ilya Zisman,Anna Vorontsova,Anton Konushin,Vladislav Kurenkov,Danila Rukhovich

Main category: cs.CV

TL;DR: 提出了一种多模态CAD重建模型，结合点云、图像和文本输入，通过监督微调和强化学习优化，在DeepCAD基准测试中表现优异。

Details

Motivation: 现有CAD重建方法通常仅支持单一输入模态，限制了通用性和鲁棒性。 Method: 采用两阶段流程：先在大规模生成数据上进行监督微调，再通过在线反馈进行强化学习微调。 Result: 在DeepCAD基准测试中，模型在所有三种输入模态上均优于现有单模态方法，强化学习微调后进一步提升了性能。 Conclusion: 多模态输入结合强化学习微调显著提升了CAD重建的性能和通用性。 Abstract: Computer-Aided Design (CAD) plays a central role in engineering and manufacturing, making it possible to create precise and editable 3D models. Using a variety of sensor or user-provided data as inputs for CAD reconstruction can democratize access to design applications. However, existing methods typically focus on a single input modality, such as point clouds, images, or text, which limits their generalizability and robustness. Leveraging recent advances in vision-language models (VLM), we propose a multi-modal CAD reconstruction model that simultaneously processes all three input modalities. Inspired by large language model (LLM) training paradigms, we adopt a two-stage pipeline: supervised fine-tuning (SFT) on large-scale procedurally generated data, followed by reinforcement learning (RL) fine-tuning using online feedback, obtained programatically. Furthermore, we are the first to explore RL fine-tuning of LLMs for CAD tasks demonstrating that online RL algorithms such as Group Relative Preference Optimization (GRPO) outperform offline alternatives. In the DeepCAD benchmark, our SFT model outperforms existing single-modal approaches in all three input modalities simultaneously. More importantly, after RL fine-tuning, cadrille sets new state-of-the-art on three challenging datasets, including a real-world one.

[21] Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

Ruichen Chen,Keith G. Mills,Liyao Jiang,Chao Gao,Di Niu

Main category: cs.CV

TL;DR: Re-ttention提出了一种高稀疏注意力机制，通过利用扩散模型的时间冗余性，在极低计算量下保持视觉生成质量。

Details

Motivation: 解决扩散变换器中注意力机制计算复杂度高的问题，同时避免现有稀疏注意力方法在极高稀疏度下视觉质量下降和计算开销增加的问题。 Method: Re-ttention通过基于历史softmax分布重新调整注意力分数，克服注意力机制中的概率归一化偏移，实现高稀疏度下的高质量视觉生成。 Result: 实验表明，Re-ttention仅需3.1%的token即可保持视觉质量，在T2V/T2I模型中优于FastDiTAttn等方法，并在H100 GPU上实现45%端到端延迟降低。 Conclusion: Re-ttention是一种高效且高质量的稀疏注意力方法，适用于视觉生成任务。 Abstract: Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. % To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. % Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1\% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference. Further, we measure latency to show that our method can attain over 45\% end-to-end % and over 92\% self-attention latency reduction on an H100 GPU at negligible overhead cost. Code available online here: \href{https://github.com/cccrrrccc/Re-ttention}{https://github.com/cccrrrccc/Re-ttention}

[22] Leveraging Diffusion Models for Synthetic Data Augmentation in Protein Subcellular Localization Classification

Sylvey Lin,Zhi-Yi Cao

Main category: cs.CV

TL;DR: 研究探讨扩散模型生成的合成图像是否能提升蛋白质亚细胞定位的多标签分类性能，发现合成数据在测试集上泛化能力不足，而传统方法表现更稳定。

Details

Motivation: 探索合成数据在生物医学图像分类中的潜力，尤其是扩散模型生成的图像是否能增强多标签分类任务。 Method: 使用简化的类条件去噪扩散概率模型（DDPM）生成标签一致的样本，并通过混合损失和混合表示两种策略整合合成与真实数据。 Result: 混合方法在验证集上表现良好，但在测试集上泛化能力差；基于ResNet的传统方法表现更稳定。 Conclusion: 合成数据的有效利用需要更真实的数据生成和更强的监督机制。 Abstract: We investigate whether synthetic images generated by diffusion models can enhance multi-label classification of protein subcellular localization. Specifically, we implement a simplified class-conditional denoising diffusion probabilistic model (DDPM) to produce label-consistent samples and explore their integration with real data via two hybrid training strategies: Mix Loss and Mix Representation. While these approaches yield promising validation performance, our proposed MixModel exhibits poor generalization to unseen test data, underscoring the challenges of leveraging synthetic data effectively. In contrast, baseline classifiers built on ResNet backbones with conventional loss functions demonstrate greater stability and test-time performance. Our findings highlight the importance of realistic data generation and robust supervision when incorporating generative augmentation into biomedical image classification.

[23] Fast Isotropic Median Filtering

Ben Weiss

Main category: cs.CV

TL;DR: 提出一种高效的中值滤波方法，克服了传统算法在比特深度、核大小和形状上的限制。

Details

Motivation: 传统中值滤波算法存在比特深度、核大小和形状的限制，导致实际应用受限。 Method: 提出一种新方法，支持任意比特深度、核大小和凸核形状（包括圆形）。 Result: 该方法高效且无传统算法的限制。 Conclusion: 新方法解决了中值滤波的长期限制，具有广泛适用性。 Abstract: Median filtering is a cornerstone of computational image processing. It provides an effective means of image smoothing, with minimal blurring or softening of edges, invariance to monotonic transformations such as gamma adjustment, and robustness to noise and outliers. However, known algorithms have all suffered from practical limitations: the bit depth of the image data, the size of the filter kernel, or the kernel shape itself. Square-kernel implementations tend to produce streaky cross-hatching artifacts, and nearly all known efficient algorithms are in practice limited to square kernels. We present for the first time a method that overcomes all of these limitations. Our method operates efficiently on arbitrary bit-depth data, arbitrary kernel sizes, and arbitrary convex kernel shapes, including circular shapes.

[24] ATI: Any Trajectory Instruction for Controllable Video Generation

Angtian Wang,Haibin Huang,Jacob Zhiyuan Fang,Yiding Yang,Chongyang Ma

Main category: cs.CV

TL;DR: 提出了一种统一的视频生成运动控制框架，通过轨迹输入集成相机运动、物体平移和局部精细运动。

Details

Motivation: 解决现有方法中运动控制模块分散或任务特定设计的问题，提供一种统一的解决方案。 Method: 通过轻量级运动注入器将用户定义的轨迹投影到预训练图像到视频生成模型的潜在空间中。 Result: 在多种视频运动控制任务中表现优异，包括风格化运动效果、动态视角变化和精确局部运动操控。 Conclusion: 该方法在可控性和视觉质量上显著优于现有方法，且兼容多种先进视频生成模型。 Abstract: We propose a unified framework for motion control in video generation that seamlessly integrates camera movement, object-level translation, and fine-grained local motion using trajectory-based inputs. In contrast to prior methods that address these motion types through separate modules or task-specific designs, our approach offers a cohesive solution by projecting user-defined trajectories into the latent space of pre-trained image-to-video generation models via a lightweight motion injector. Users can specify keypoints and their motion paths to control localized deformations, entire object motion, virtual camera dynamics, or combinations of these. The injected trajectory signals guide the generative process to produce temporally consistent and semantically aligned motion sequences. Our framework demonstrates superior performance across multiple video motion control tasks, including stylized motion effects (e.g., motion brushes), dynamic viewpoint changes, and precise local motion manipulation. Experiments show that our method provides significantly better controllability and visual quality compared to prior approaches and commercial solutions, while remaining broadly compatible with various state-of-the-art video generation backbones. Project page: https://anytraj.github.io/.

[25] Toward Memory-Aided World Models: Benchmarking via Spatial Consistency

Kewei Lian,Shaofei Cai,Yilun Du,Yitao Liang

Main category: cs.CV

TL;DR: 论文提出了一种新的数据集和基准测试，用于促进空间一致性世界模型的发展，特别关注记忆模块的设计。

Details

Motivation: 现有数据集缺乏对空间一致性的明确要求，且大多数基准测试忽视长距离空间一致性的需求。 Method: 通过在Minecraft开放世界中采样150个不同位置，收集250小时（2000万帧）的循环导航视频数据，并采用课程设计逐步增加序列长度。 Result: 构建了一个可扩展的数据集和基准测试，评估了四种代表性世界模型基线。 Conclusion: 该数据集和基准测试为未来研究提供了支持，填补了空间一致性世界模型开发的空白。 Abstract: The ability to simulate the world in a spatially consistent manner is a crucial requirements for effective world models. Such a model enables high-quality visual generation, and also ensures the reliability of world models for downstream tasks such as simulation and planning. Designing a memory module is a crucial component for addressing spatial consistency: such a model must not only retain long-horizon observational information, but also enables the construction of explicit or implicit internal spatial representations. However, there are no dataset designed to promote the development of memory modules by explicitly enforcing spatial consistency constraints. Furthermore, most existing benchmarks primarily emphasize visual coherence or generation quality, neglecting the requirement of long-range spatial consistency. To bridge this gap, we construct a dataset and corresponding benchmark by sampling 150 distinct locations within the open-world environment of Minecraft, collecting about 250 hours (20 million frames) of loop-based navigation videos with actions. Our dataset follows a curriculum design of sequence lengths, allowing models to learn spatial consistency on increasingly complex navigation trajectories. Furthermore, our data collection pipeline is easily extensible to new Minecraft environments and modules. Four representative world model baselines are evaluated on our benchmark. Dataset, benchmark, and code are open-sourced to support future research.

[26] HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions

Shuolin Xu,Siming Zheng,Ziyi Wang,HC Yu,Jinwei Chen,Huaqi Zhang,Bo Li,Peng-Tao Jiang

Main category: cs.CV

TL;DR: 论文介绍了Open-HyperMotionX数据集和HyperMotionX Bench，用于评估和改进复杂人体运动条件下的姿态引导动画生成模型，并提出了一种基于DiT的视频生成基线方法和空间低频增强RoPE模块。

Details

Motivation: 现有方法在复杂人体运动（Hypermotion）中表现不佳，且缺乏高质量评估基准。 Method: 提出Open-HyperMotionX数据集和HyperMotionX Bench，设计了一种基于DiT的视频生成基线方法，并引入空间低频增强RoPE模块。 Result: 方法显著提高了高度动态人体运动序列的结构稳定性和外观一致性。 Conclusion: 提出的数据集和方法有效提升了复杂人体运动图像动画的生成质量。 Abstract: Recent advances in diffusion models have significantly improved conditional video generation, particularly in the pose-guided human image animation task. Although existing methods are capable of generating high-fidelity and time-consistent animation sequences in regular motions and static scenes, there are still obvious limitations when facing complex human body motions (Hypermotion) that contain highly dynamic, non-standard motions, and the lack of a high-quality benchmark for evaluation of complex human motion animations. To address this challenge, we introduce the \textbf{Open-HyperMotionX Dataset} and \textbf{HyperMotionX Bench}, which provide high-quality human pose annotations and curated video clips for evaluating and improving pose-guided human image animation models under complex human motion conditions. Furthermore, we propose a simple yet powerful DiT-based video generation baseline and design spatial low-frequency enhanced RoPE, a novel module that selectively enhances low-frequency spatial feature modeling by introducing learnable frequency scaling. Our method significantly improves structural stability and appearance consistency in highly dynamic human motion sequences. Extensive experiments demonstrate the effectiveness of our dataset and proposed approach in advancing the generation quality of complex human motion image animations. Code and dataset will be made publicly available.

[27] Pose-free 3D Gaussian splatting via shape-ray estimation

Youngju Na,Taeyeon Kim,Jumin Lee,Kyu Beom Han,Woo Jae Kim,Sung-eui Yoon

Main category: cs.CV

TL;DR: SHARE是一种无需相机姿态的3D高斯泼溅框架，通过联合形状和相机光线估计解决姿态不准确导致的几何对齐问题。

Details

Motivation: 在真实场景中，获取精确相机姿态具有挑战性，导致几何对齐误差。SHARE旨在解决这一问题。 Method: SHARE通过构建姿态感知的规范体积表示，整合多视角信息，并利用锚对齐的高斯预测优化局部几何。 Result: 在多样化真实数据集上的实验表明，SHARE在无姿态通用高斯泼溅中表现稳健。 Conclusion: SHARE通过姿态无关的设计，显著提升了在姿态不准确情况下的几何重建质量。 Abstract: While generalizable 3D Gaussian splatting enables efficient, high-quality rendering of unseen scenes, it heavily depends on precise camera poses for accurate geometry. In real-world scenarios, obtaining accurate poses is challenging, leading to noisy pose estimates and geometric misalignments. To address this, we introduce SHARE, a pose-free, feed-forward Gaussian splatting framework that overcomes these ambiguities by joint shape and camera rays estimation. Instead of relying on explicit 3D transformations, SHARE builds a pose-aware canonical volume representation that seamlessly integrates multi-view information, reducing misalignment caused by inaccurate pose estimates. Additionally, anchor-aligned Gaussian prediction enhances scene reconstruction by refining local geometry around coarse anchors, allowing for more precise Gaussian placement. Extensive experiments on diverse real-world datasets show that our method achieves robust performance in pose-free generalizable Gaussian splatting.

[28] MOVi: Training-free Text-conditioned Multi-Object Video Generation

Aimon Rahman,Jiang Liu,Ze Wang,Ximeng Sun,Jialian Wu,Xiaodong Yu,Yusheng Su,Vishal M. Patel,Zicheng Liu,Emad Barsoum

Main category: cs.CV

TL;DR: 提出一种无需训练的多对象视频生成方法，利用扩散模型和大型语言模型（LLM）的开放世界知识，显著提升多对象生成能力。

Details

Motivation: 现有扩散基文本到视频（T2V）模型在多对象生成中表现不佳，难以准确捕捉复杂对象交互或生成指定对象。 Method: 使用LLM作为对象轨迹的“导演”，通过噪声重新初始化和注意力机制优化，实现精确控制和特征分离。 Result: 实验表明，该方法在多对象生成能力上提升42%，同时保持高保真度和运动平滑性。 Conclusion: 该方法为多对象视频生成提供了一种高效且无需训练的解决方案。 Abstract: Recent advances in diffusion-based text-to-video (T2V) models have demonstrated remarkable progress, but these models still face challenges in generating videos with multiple objects. Most models struggle with accurately capturing complex object interactions, often treating some objects as static background elements and limiting their movement. In addition, they often fail to generate multiple distinct objects as specified in the prompt, resulting in incorrect generations or mixed features across objects. In this paper, we present a novel training-free approach for multi-object video generation that leverages the open world knowledge of diffusion models and large language models (LLMs). We use an LLM as the ``director'' of object trajectories, and apply the trajectories through noise re-initialization to achieve precise control of realistic movements. We further refine the generation process by manipulating the attention mechanism to better capture object-specific features and motion patterns, and prevent cross-object feature interference. Extensive experiments validate the effectiveness of our training free approach in significantly enhancing the multi-object generation capabilities of existing video diffusion models, resulting in 42% absolute improvement in motion dynamics and object generation accuracy, while also maintaining high fidelity and motion smoothness.

[29] Synthetic Document Question Answering in Hungarian

Jonathan Li,Zoltan Csaki,Nidhi Hiremath,Etash Guha,Fenglu Hong,Edward Ma,Urmish Thakker

Main category: cs.CV

TL;DR: 论文提出针对匈牙利语的文档视觉问答（VQA）数据集HuDocVQA和HuDocVQA-manual，填补低资源语言数据空白，并通过质量过滤和去重提升数据质量。

Details

Motivation: 解决低资源语言（如匈牙利语）在文档VQA任务中缺乏训练和评估数据的问题。 Method: 通过匈牙利语Common Crawl数据，手动和自动生成两个VQA数据集，并进行多轮质量过滤和去重。 Result: 微调Llama 3.2 11B Instruct模型后，在HuDocVQA上的准确率提高了7.2%。 Conclusion: 数据集和代码将公开，促进多语言文档VQA研究。 Abstract: Modern VLMs have achieved near-saturation accuracy in English document visual question-answering (VQA). However, this task remains challenging in lower resource languages due to a dearth of suitable training and evaluation data. In this paper we present scalable methods for curating such datasets by focusing on Hungarian, approximately the 17th highest resource language on the internet. Specifically, we present HuDocVQA and HuDocVQA-manual, document VQA datasets that modern VLMs significantly underperform on compared to English DocVQA. HuDocVQA-manual is a small manually curated dataset based on Hungarian documents from Common Crawl, while HuDocVQA is a larger synthetically generated VQA data set from the same source. We apply multiple rounds of quality filtering and deduplication to HuDocVQA in order to match human-level quality in this dataset. We also present HuCCPDF, a dataset of 117k pages from Hungarian Common Crawl PDFs along with their transcriptions, which can be used for training a model for Hungarian OCR. To validate the quality of our datasets, we show how finetuning on a mixture of these datasets can improve accuracy on HuDocVQA for Llama 3.2 11B Instruct by +7.2%. Our datasets and code will be released to the public to foster further research in multilingual DocVQA.

[30] SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model

Bowen Chen,Keyan Chen,Mohan Yang,Zhengxia Zou,Zhenwei Shi

Main category: cs.CV

TL;DR: 论文提出了一种语义引导的超分辨率框架SeG-SR，利用视觉语言模型提取语义知识以提升遥感图像超分辨率重建的性能。

Details

Motivation: 现有遥感图像超分辨率方法主要关注像素空间低层特征，忽略了高层语义理解，导致重建结果可能语义不一致。 Method: 提出SeG-SR框架，包含语义特征提取模块（SFEM）、语义定位模块（SLM）和可学习调制模块（LMM），利用语义知识指导超分辨率过程。 Result: SeG-SR在两个数据集上达到最先进性能，并在多种超分辨率架构中表现一致提升。 Conclusion: SeG-SR通过引入高层语义知识，显著提升了遥感图像超分辨率重建的质量和一致性。 Abstract: High-resolution (HR) remote sensing imagery plays a vital role in a wide range of applications, including urban planning and environmental monitoring. However, due to limitations in sensors and data transmission links, the images acquired in practice often suffer from resolution degradation. Remote Sensing Image Super-Resolution (RSISR) aims to reconstruct HR images from low-resolution (LR) inputs, providing a cost-effective and efficient alternative to direct HR image acquisition. Existing RSISR methods primarily focus on low-level characteristics in pixel space, while neglecting the high-level understanding of remote sensing scenes. This may lead to semantically inconsistent artifacts in the reconstructed results. Motivated by this observation, our work aims to explore the role of high-level semantic knowledge in improving RSISR performance. We propose a Semantic-Guided Super-Resolution framework, SeG-SR, which leverages Vision-Language Models (VLMs) to extract semantic knowledge from input images and uses it to guide the super resolution (SR) process. Specifically, we first design a Semantic Feature Extraction Module (SFEM) that utilizes a pretrained VLM to extract semantic knowledge from remote sensing images. Next, we propose a Semantic Localization Module (SLM), which derives a series of semantic guidance from the extracted semantic knowledge. Finally, we develop a Learnable Modulation Module (LMM) that uses semantic guidance to modulate the features extracted by the SR network, effectively incorporating high-level scene understanding into the SR pipeline. We validate the effectiveness and generalizability of SeG-SR through extensive experiments: SeG-SR achieves state-of-the-art performance on two datasets and consistently delivers performance improvements across various SR architectures. Codes can be found at https://github.com/Mr-Bamboo/SeG-SR.

[31] Spatio-Temporal Joint Density Driven Learning for Skeleton-Based Action Recognition

Shanaka Ramesh Gunasekara,Wanqing Li,Philip Ogunbona,Jack Yang

Main category: cs.CV

TL;DR: 论文提出了一种新的空间-时间关节密度（STJD）测量方法，用于量化骨架序列中动态与静态关节的交互作用，并提出了STJD-CL和STJD-MP两种自监督学习方法，显著提升了动作分类性能。

Details

Motivation: 传统方法主要关注骨架序列的动态特征，而忽略了动态与静态关节之间的交互作用。本文旨在挖掘这种交互作用的判别潜力。 Method: 提出了STJD测量方法，并通过STJD-CL对比学习策略和STJD-MP重建框架，利用“主要关节”指导自监督学习。 Result: 在NTU RGB+D 60、NTU RGB+D 120和PKUMMD数据集上，STJD-CL和STJD-MP分别比现有对比方法提升了3.5和3.6个百分点。 Conclusion: STJD方法有效利用了骨架序列中动态与静态关节的交互作用，显著提升了动作分类性能。 Abstract: Traditional approaches in unsupervised or self supervised learning for skeleton-based action classification have concentrated predominantly on the dynamic aspects of skeletal sequences. Yet, the intricate interaction between the moving and static elements of the skeleton presents a rarely tapped discriminative potential for action classification. This paper introduces a novel measurement, referred to as spatial-temporal joint density (STJD), to quantify such interaction. Tracking the evolution of this density throughout an action can effectively identify a subset of discriminative moving and/or static joints termed "prime joints" to steer self-supervised learning. A new contrastive learning strategy named STJD-CL is proposed to align the representation of a skeleton sequence with that of its prime joints while simultaneously contrasting the representations of prime and nonprime joints. In addition, a method called STJD-MP is developed by integrating it with a reconstruction-based framework for more effective learning. Experimental evaluations on the NTU RGB+D 60, NTU RGB+D 120, and PKUMMD datasets in various downstream tasks demonstrate that the proposed STJD-CL and STJD-MP improved performance, particularly by 3.5 and 3.6 percentage points over the state-of-the-art contrastive methods on the NTU RGB+D 120 dataset using X-sub and X-set evaluations, respectively.

[32] Towards Privacy-Preserving Fine-Grained Visual Classification via Hierarchical Learning from Label Proportions

Jinyi Chang,Dongliang Chang,Lei Chen,Bingyao Yu,Zhanyu Ma

Main category: cs.CV

TL;DR: 该论文提出了一种无需实例级标签的细粒度视觉分类方法（LHFGLP），利用标签比例学习（LLP）范式，通过层次化特征细化提升分类精度。

Details

Motivation: 现有细粒度分类方法依赖实例级标签，不适用于隐私敏感场景（如医学图像分析），因此需要一种无需直接访问实例标签的准确分类方法。 Method: 提出LHFGLP框架，结合层次化细粒度稀疏字典学习和层次化比例损失，通过袋级标签实现高效训练。 Result: 在三个细粒度数据集上的实验表明，LHFGLP优于现有基于LLP的方法。 Conclusion: 该方法为隐私保护的细粒度分类提供了有效解决方案，并公开代码和数据集以促进进一步研究。 Abstract: In recent years, Fine-Grained Visual Classification (FGVC) has achieved impressive recognition accuracy, despite minimal inter-class variations. However, existing methods heavily rely on instance-level labels, making them impractical in privacy-sensitive scenarios such as medical image analysis. This paper aims to enable accurate fine-grained recognition without direct access to instance labels. To achieve this, we leverage the Learning from Label Proportions (LLP) paradigm, which requires only bag-level labels for efficient training. Unlike existing LLP-based methods, our framework explicitly exploits the hierarchical nature of fine-grained datasets, enabling progressive feature granularity refinement and improving classification accuracy. We propose Learning from Hierarchical Fine-Grained Label Proportions (LHFGLP), a framework that incorporates Unrolled Hierarchical Fine-Grained Sparse Dictionary Learning, transforming handcrafted iterative approximation into learnable network optimization. Additionally, our proposed Hierarchical Proportion Loss provides hierarchical supervision, further enhancing classification performance. Experiments on three widely-used fine-grained datasets, structured in a bag-based manner, demonstrate that our framework consistently outperforms existing LLP-based methods. We will release our code and datasets to foster further research in privacy-preserving fine-grained classification.

[33] Deep Modeling and Optimization of Medical Image Classification

Yihang Wu,Muhammad Owais,Reem Kateb,Ahmad Chaddad

Main category: cs.CV

TL;DR: 论文提出了一种结合CLIP变体、联邦学习和传统机器学习的方法，用于医学图像分类，解决了数据隐私和泛化能力问题。

Details

Motivation: 解决医学领域因数据隐私问题导致的大数据需求不足，以及CLIP在医学领域潜力未充分挖掘的挑战。 Method: 1) 提出CLIP变体，使用CNN和ViT作为图像编码器；2) 结合联邦学习保护数据隐私；3) 引入传统ML方法提升泛化能力。 Result: 在HAM10000数据集中，maxvit表现最佳（AVG=87.03%）；在ISIC2018中，SVM提升了swin transformer系列的性能（AVG提升约2%）。 Conclusion: 该方法在医学图像分类中有效，兼顾了性能和数据隐私，同时提升了泛化能力。 Abstract: Deep models, such as convolutional neural networks (CNNs) and vision transformer (ViT), demonstrate remarkable performance in image classification. However, those deep models require large data to fine-tune, which is impractical in the medical domain due to the data privacy issue. Furthermore, despite the feasible performance of contrastive language image pre-training (CLIP) in the natural domain, the potential of CLIP has not been fully investigated in the medical field. To face these challenges, we considered three scenarios: 1) we introduce a novel CLIP variant using four CNNs and eight ViTs as image encoders for the classification of brain cancer and skin cancer, 2) we combine 12 deep models with two federated learning techniques to protect data privacy, and 3) we involve traditional machine learning (ML) methods to improve the generalization ability of those deep models in unseen domain data. The experimental results indicate that maxvit shows the highest averaged (AVG) test metrics (AVG = 87.03\%) in HAM10000 dataset with multimodal learning, while convnext\_l demonstrates remarkable test with an F1-score of 83.98\% compared to swin\_b with 81.33\% in FL model. Furthermore, the use of support vector machine (SVM) can improve the overall test metrics with AVG of $\sim 2\%$ for swin transformer series in ISIC2018. Our codes are available at https://github.com/AIPMLab/SkinCancerSimulation.

[34] Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation

Jihai Zhang,Tianle Li,Linjie Li,Zhengyuan Yang,Yu Cheng

Main category: cs.CV

TL;DR: 本文系统研究了统一视觉语言模型（VLMs）中理解与生成任务的泛化能力，发现混合训练能带来相互增强的效果，且数据量增加时效果更显著。

Details

Motivation: 探索统一架构中理解与生成任务的相互增强假设，填补现有研究的空白。 Method: 设计贴近现实场景的数据集，评估多种统一VLM架构，进行定量实验。 Result: 混合训练带来理解与生成任务的相互提升；多模态输入输出空间对齐更好时泛化能力更强；生成任务知识可迁移至理解任务。 Conclusion: 统一理解与生成对VLMs至关重要，为模型设计与优化提供了重要启示。 Abstract: Recent advancements in unified vision-language models (VLMs), which integrate both visual understanding and generation capabilities, have attracted significant attention. The underlying hypothesis is that a unified architecture with mixed training on both understanding and generation tasks can enable mutual enhancement between understanding and generation. However, this hypothesis remains underexplored in prior works on unified VLMs. To address this gap, this paper systematically investigates the generalization across understanding and generation tasks in unified VLMs. Specifically, we design a dataset closely aligned with real-world scenarios to facilitate extensive experiments and quantitative evaluations. We evaluate multiple unified VLM architectures to validate our findings. Our key findings are as follows. First, unified VLMs trained with mixed data exhibit mutual benefits in understanding and generation tasks across various architectures, and this mutual benefits can scale up with increased data. Second, better alignment between multimodal input and output spaces will lead to better generalization. Third, the knowledge acquired during generation tasks can transfer to understanding tasks, and this cross-task generalization occurs within the base language model, beyond modality adapters. Our findings underscore the critical necessity of unifying understanding and generation in VLMs, offering valuable insights for the design and optimization of unified VLMs.

[35] SpatialSplat: Efficient Semantic 3D from Sparse Unposed Images

Yu Sheng,Jiajun Deng,Xinran Zhang,Yu Zhang,Bei Hua,Yanyong Zhang,Jianmin Ji

Main category: cs.CV

TL;DR: SpatialSplat提出了一种新的前馈框架，通过双场语义表示和选择性高斯机制，减少冗余并提升语义3D重建的效率和准确性。

Details

Motivation: 现有方法在压缩语义特征时牺牲了表达能力，且像素级预测导致冗余。SpatialSplat旨在解决这些问题。 Method: 采用双场语义表示（粗粒度特征场和细粒度特征场）和选择性高斯机制，消除冗余高斯。 Result: 实验显示参数减少60%，性能优于现有方法。 Conclusion: SpatialSplat通过更紧凑的高斯表示和高效语义编码，提升了语义3D重建的实用性。 Abstract: A major breakthrough in 3D reconstruction is the feedforward paradigm to generate pixel-wise 3D points or Gaussian primitives from sparse, unposed images. To further incorporate semantics while avoiding the significant memory and storage costs of high-dimensional semantic features, existing methods extend this paradigm by associating each primitive with a compressed semantic feature vector. However, these methods have two major limitations: (a) the naively compressed feature compromises expressiveness, affecting the model's ability to capture fine-grained semantics, and (b) the pixel-wise primitive prediction introduces redundancy in overlapping areas, causing unnecessary memory overhead. To this end, we introduce \textbf{SpatialSplat}, a feedforward framework that produces redundancy-aware Gaussians and capitalizes on a dual-field semantic representation. Particularly, with the insight that primitives within the same instance exhibit high semantic consistency, we decompose the semantic representation into a coarse feature field that encodes uncompressed semantics with minimal primitives, and a fine-grained yet low-dimensional feature field that captures detailed inter-instance relationships. Moreover, we propose a selective Gaussian mechanism, which retains only essential Gaussians in the scene, effectively eliminating redundant primitives. Our proposed Spatialsplat learns accurate semantic information and detailed instances prior with more compact 3D Gaussians, making semantic 3D reconstruction more applicable. We conduct extensive experiments to evaluate our method, demonstrating a remarkable 60\% reduction in scene representation parameters while achieving superior performance over state-of-the-art methods. The code will be made available for future investigation.

[36] Multi-Sourced Compositional Generalization in Visual Question Answering

Chuanhao Li,Wenbo Ye,Zhen Li,Yuwei Wu,Yunde Jia

Main category: cs.CV

TL;DR: 该论文研究了视觉与语言任务中的多源组合泛化（MSCG）问题，提出了一种检索增强的训练框架，通过统一不同模态的原始表示来提升模型的MSCG能力。

Details

Motivation: 由于视觉与语言任务的多模态特性，组合的原始来源不同，导致多源新颖组合的泛化能力（MSCG）未被充分探索。 Method: 提出检索增强训练框架，通过检索语义等效的原始特征并聚合，学习跨模态的统一表示。 Result: 实验结果表明该框架有效，并基于GQA数据集构建了GQA-MSCG用于评估。 Conclusion: 该研究填补了MSCG领域的空白，提出的框架和数据集为未来研究提供了基础。 Abstract: Compositional generalization is the ability of generalizing novel compositions from seen primitives, and has received much attention in vision-and-language (V\&L) recently. Due to the multi-modal nature of V\&L tasks, the primitives composing compositions source from different modalities, resulting in multi-sourced novel compositions. However, the generalization ability over multi-sourced novel compositions, \textit{i.e.}, multi-sourced compositional generalization (MSCG) remains unexplored. In this paper, we explore MSCG in the context of visual question answering (VQA), and propose a retrieval-augmented training framework to enhance the MSCG ability of VQA models by learning unified representations for primitives from different modalities. Specifically, semantically equivalent primitives are retrieved for each primitive in the training samples, and the retrieved features are aggregated with the original primitive to refine the model. This process helps the model learn consistent representations for the same semantic primitives across different modalities. To evaluate the MSCG ability of VQA models, we construct a new GQA-MSCG dataset based on the GQA dataset, in which samples include three types of novel compositions composed of primitives from different modalities. Experimental results demonstrate the effectiveness of the proposed framework. We release GQA-MSCG at https://github.com/NeverMoreLCH/MSCG.

[37] Zero-P-to-3: Zero-Shot Partial-View Images to 3D Object

Yuxuan Lin,Ruihang Chu,Zhenyu Chen,Xiao Tang,Lei Ke,Haoling Li,Yingji Zhong,Zhihao Li,Shiyong Liu,Xiaofei Wu,Jianzhuang Liu,Yujiu Yang

Main category: cs.CV

TL;DR: 论文提出了一种无需训练的方法\method，通过融合局部密集观测和多源先验来解决部分视角3D重建中的视角范围有限和生成不一致问题。

Details

Motivation: 部分视角3D重建中，传统插值技术因视角分布不均而失效，且生成视角与可见区域不一致，导致重建质量下降。 Method: 提出融合策略在DDIM采样中对齐多源先验，生成多视角一致图像，并设计迭代细化策略利用几何结构提升重建质量。 Result: 在多个数据集上实验表明，该方法在不可见区域的表现优于现有技术。 Conclusion: \method通过融合先验和迭代优化，显著提升了部分视角3D重建的生成一致性和质量。 Abstract: Generative 3D reconstruction shows strong potential in incomplete observations. While sparse-view and single-image reconstruction are well-researched, partial observation remains underexplored. In this context, dense views are accessible only from a specific angular range, with other perspectives remaining inaccessible. This task presents two main challenges: (i) limited View Range: observations confined to a narrow angular scope prevent effective traditional interpolation techniques that require evenly distributed perspectives. (ii) inconsistent Generation: views created for invisible regions often lack coherence with both visible regions and each other, compromising reconstruction consistency. To address these challenges, we propose \method, a novel training-free approach that integrates the local dense observations and multi-source priors for reconstruction. Our method introduces a fusion-based strategy to effectively align these priors in DDIM sampling, thereby generating multi-view consistent images to supervise invisible views. We further design an iterative refinement strategy, which uses the geometric structures of the object to enhance reconstruction quality. Extensive experiments on multiple datasets show the superiority of our method over SOTAs, especially in invisible regions.

[38] URWKV: Unified RWKV Model with Multi-state Perspective for Low-light Image Restoration

Rui Xu,Yuzhen Niu,Yuezhou Li,Huangbiao Xu,Wenxi Liu,Yuzhong Chen

Main category: cs.CV

TL;DR: 论文提出了一种名为URWKV的统一模型，通过多状态视角灵活处理低光图像增强和去模糊任务，解决了动态耦合退化问题。

Details

Motivation: 现有低光图像增强（LLIE）和联合LLIE-去模糊模型在处理预定义退化时表现良好，但难以应对动态耦合退化问题。 Method: 提出URWKV模型，包括Luminance-adaptive Normalization（LAN）和State-aware Selective Fusion（SSF）模块，利用多状态机制感知复杂退化。 Result: URWKV模型在多个基准测试中表现优于现有方法，且参数和计算资源需求显著减少。 Conclusion: URWKV模型通过多状态机制和动态融合策略，有效解决了低光图像增强和去模糊中的动态耦合退化问题。 Abstract: Existing low-light image enhancement (LLIE) and joint LLIE and deblurring (LLIE-deblur) models have made strides in addressing predefined degradations, yet they are often constrained by dynamically coupled degradations. To address these challenges, we introduce a Unified Receptance Weighted Key Value (URWKV) model with multi-state perspective, enabling flexible and effective degradation restoration for low-light images. Specifically, we customize the core URWKV block to perceive and analyze complex degradations by leveraging multiple intra- and inter-stage states. First, inspired by the pupil mechanism in the human visual system, we propose Luminance-adaptive Normalization (LAN) that adjusts normalization parameters based on rich inter-stage states, allowing for adaptive, scene-aware luminance modulation. Second, we aggregate multiple intra-stage states through exponential moving average approach, effectively capturing subtle variations while mitigating information loss inherent in the single-state mechanism. To reduce the degradation effects commonly associated with conventional skip connections, we propose the State-aware Selective Fusion (SSF) module, which dynamically aligns and integrates multi-state features across encoder stages, selectively fusing contextual information. In comparison to state-of-the-art models, our URWKV model achieves superior performance on various benchmarks, while requiring significantly fewer parameters and computational resources.

[39] GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion

Gwanghyun Kim,Xueting Li,Ye Yuan,Koki Nagano,Tianye Li,Jan Kautz,Se Young Chun,Umar Iqbal

Main category: cs.CV

TL;DR: GeoMan是一种新型架构，通过结合图像模型和视频扩散模型，从单目视频中生成准确且时间一致的3D人体几何估计，解决了数据稀缺和深度估计的挑战。

Details

Motivation: 现有方法主要针对单图像优化，存在时间不一致性和动态细节捕捉不足的问题，GeoMan旨在解决这些局限性。 Method: GeoMan利用图像模型估计首帧深度和法线，再通过视频扩散模型生成后续帧，将任务转化为图像到视频生成问题，并引入根相对深度表示以保留人体尺度细节。 Result: GeoMan在定性和定量评估中均达到最先进性能，显著提升了时间一致性和泛化能力。 Conclusion: GeoMan通过创新设计有效解决了3D人体几何估计中的长期挑战，展示了其在视频处理中的潜力。 Abstract: Estimating accurate and temporally consistent 3D human geometry from videos is a challenging problem in computer vision. Existing methods, primarily optimized for single images, often suffer from temporal inconsistencies and fail to capture fine-grained dynamic details. To address these limitations, we present GeoMan, a novel architecture designed to produce accurate and temporally consistent depth and normal estimations from monocular human videos. GeoMan addresses two key challenges: the scarcity of high-quality 4D training data and the need for metric depth estimation to accurately model human size. To overcome the first challenge, GeoMan employs an image-based model to estimate depth and normals for the first frame of a video, which then conditions a video diffusion model, reframing video geometry estimation task as an image-to-video generation problem. This design offloads the heavy lifting of geometric estimation to the image model and simplifies the video model's role to focus on intricate details while using priors learned from large-scale video datasets. Consequently, GeoMan improves temporal consistency and generalizability while requiring minimal 4D training data. To address the challenge of accurate human size estimation, we introduce a root-relative depth representation that retains critical human-scale details and is easier to be estimated from monocular inputs, overcoming the limitations of traditional affine-invariant and metric depth representations. GeoMan achieves state-of-the-art performance in both qualitative and quantitative evaluations, demonstrating its effectiveness in overcoming longstanding challenges in 3D human geometry estimation from videos.

[40] LeMoRe: Learn More Details for Lightweight Semantic Segmentation

Mian Muhammad Naeem Abid,Nancy Mehta,Zongwei Wu,Radu Timofte

Main category: cs.CV

TL;DR: 论文提出了一种结合显式和隐式建模的轻量级语义分割方法，通过嵌套注意力机制平衡计算效率和表征能力。

Details

Motivation: 现有方法在特征建模的复杂性上难以平衡效率和性能，且依赖参数密集型设计或计算密集的Vision Transformer框架。 Method: 结合显式建模的笛卡尔方向和隐式推断的中间表示，通过嵌套注意力机制高效捕捉全局依赖。 Result: 在ADE20K、CityScapes等数据集上验证了方法在性能和效率上的平衡。 Conclusion: LeMoRe方法在轻量级语义分割中实现了高效与高性能的平衡。 Abstract: Lightweight semantic segmentation is essential for many downstream vision tasks. Unfortunately, existing methods often struggle to balance efficiency and performance due to the complexity of feature modeling. Many of these existing approaches are constrained by rigid architectures and implicit representation learning, often characterized by parameter-heavy designs and a reliance on computationally intensive Vision Transformer-based frameworks. In this work, we introduce an efficient paradigm by synergizing explicit and implicit modeling to balance computational efficiency with representational fidelity. Our method combines well-defined Cartesian directions with explicitly modeled views and implicitly inferred intermediate representations, efficiently capturing global dependencies through a nested attention mechanism. Extensive experiments on challenging datasets, including ADE20K, CityScapes, Pascal Context, and COCO-Stuff, demonstrate that LeMoRe strikes an effective balance between performance and efficiency.

[41] CURVE: CLIP-Utilized Reinforcement Learning for Visual Image Enhancement via Simple Image Processing

Yuka Ogino,Takahiro Toizumi,Atsushi Ito

Main category: cs.CV

TL;DR: 本文提出了一种基于CLIP和强化学习的低光图像增强方法CURVE，通过Bézier曲线调整全局色调，并在高分辨率图像中保持计算效率。

Details

Motivation: 解决零参考低光图像增强中如何利用CLIP模型获得感知良好的图像，同时保持计算效率的挑战。 Method: 提出CURVE方法，使用Bézier曲线调整全局色调，并通过强化学习迭代估计处理参数，奖励设计基于CLIP文本嵌入。 Result: 在低光和多曝光数据集上的实验表明，CURVE在增强质量和处理速度上优于传统方法。 Conclusion: CURVE是一种高效且感知良好的低光图像增强方法，适用于高分辨率图像。 Abstract: Low-Light Image Enhancement (LLIE) is crucial for improving both human perception and computer vision tasks. This paper addresses two challenges in zero-reference LLIE: obtaining perceptually 'good' images using the Contrastive Language-Image Pre-Training (CLIP) model and maintaining computational efficiency for high-resolution images. We propose CLIP-Utilized Reinforcement learning-based Visual image Enhancement (CURVE). CURVE employs a simple image processing module which adjusts global image tone based on B\'ezier curve and estimates its processing parameters iteratively. The estimator is trained by reinforcement learning with rewards designed using CLIP text embeddings. Experiments on low-light and multi-exposure datasets demonstrate the performance of CURVE in terms of enhancement quality and processing speed compared to conventional methods.

[42] EAD: An EEG Adapter for Automated Classification

Pushapdeep Singh,Jyoti Nigam,Medicherla Vamsi Krishna,Arnav Bhavsar,Aditya Nigam

Main category: cs.CV

TL;DR: 提出EEG Adapter (EAD)框架，解决不同设备采集的EEG信号分类问题，实现高精度和泛化能力。

Details

Motivation: 传统EEG分类方法依赖特定设备和通道数，难以统一处理不同设备采集的数据，需开发灵活框架。 Method: 基于EEG基础模型进行适配，学习鲁棒表示，兼容任何信号采集设备。 Result: 在EEG-ImageNet和BrainLat数据集上分别达到99.33%和92.31%的准确率，并展示零样本分类能力。 Conclusion: EAD框架能有效处理不同设备采集的EEG数据，具有高精度和泛化能力。 Abstract: While electroencephalography (EEG) has been a popular modality for neural decoding, it often involves task specific acquisition of the EEG data. This poses challenges for the development of a unified pipeline to learn embeddings for various EEG signal classification, which is often involved in various decoding tasks. Traditionally, EEG classification involves the step of signal preprocessing and the use of deep learning techniques, which are highly dependent on the number of EEG channels in each sample. However, the same pipeline cannot be applied even if the EEG data is collected for the same experiment but with different acquisition devices. This necessitates the development of a framework for learning EEG embeddings, which could be highly beneficial for tasks involving multiple EEG samples for the same task but with varying numbers of EEG channels. In this work, we propose EEG Adapter (EAD), a flexible framework compatible with any signal acquisition device. More specifically, we leverage a recent EEG foundational model with significant adaptations to learn robust representations from the EEG data for the classification task. We evaluate EAD on two publicly available datasets achieving state-of-the-art accuracies 99.33% and 92.31% on EEG-ImageNet and BrainLat respectively. This illustrates the effectiveness of the proposed framework across diverse EEG datasets containing two different perception tasks: stimulus and resting-state EEG signals. We also perform zero-shot EEG classification on EEG-ImageNet task to demonstrate the generalization capability of the proposed approach.

[43] Identification of Patterns of Cognitive Impairment for Early Detection of Dementia

Anusha A. S.,Uma Ranjan,Medha Sharma,Siddharth Dutt

Main category: cs.CV

TL;DR: 该论文提出了一种个性化认知测试方案，通过识别个体特定的认知障碍模式，为早期痴呆检测提供更高效的工具。

Details

Motivation: 早期痴呆检测对干预至关重要，但现有认知测试耗时且难以大规模应用，且个体认知障碍模式差异大。 Method: 采用两步法：先通过群体聚类学习认知障碍模式，再通过特征选择和聚类分析识别个性化模式。 Result: 识别出的模式与临床认可的轻度认知障碍（MCI）亚型一致，可用于预测无症状人群的认知障碍路径。 Conclusion: 该方法为大规模、个性化的痴呆早期检测提供了可行方案。 Abstract: Early detection of dementia is crucial to devise effective interventions. Comprehensive cognitive tests, while being the most accurate means of diagnosis, are long and tedious, thus limiting their applicability to a large population, especially when periodic assessments are needed. The problem is compounded by the fact that people have differing patterns of cognitive impairment as they progress to different forms of dementia. This paper presents a novel scheme by which individual-specific patterns of impairment can be identified and used to devise personalized tests for periodic follow-up. Patterns of cognitive impairment are initially learned from a population cluster of combined normals and MCIs, using a set of standardized cognitive tests. Impairment patterns in the population are identified using a 2-step procedure involving an ensemble wrapper feature selection followed by cluster identification and analysis. These patterns have been shown to correspond to clinically accepted variants of MCI, a prodrome of dementia. The learned clusters of patterns can subsequently be used to identify the most likely route of cognitive impairment, even for pre-symptomatic and apparently normal people. Baseline data of 24,000 subjects from the NACC database was used for the study.

[44] Diffusion-Based Generative Models for 3D Occupancy Prediction in Autonomous Driving

Yunshen Wang,Yicheng Liu,Tianyuan Yuan,Yucheng Mao,Yingshi Liang,Xiuyu Yang,Honggang Zhang,Hang Zhao

Main category: cs.CV

TL;DR: 该论文提出了一种基于扩散模型的生成方法，用于预测3D占用网格，解决了现有判别方法在噪声数据、不完整观测和复杂3D场景结构中的问题。

Details

Motivation: 当前判别方法在3D占用预测中存在对噪声数据、不完整观测和复杂场景结构的处理不足，影响了预测的一致性和准确性。 Method: 通过将3D占用预测任务重新定义为生成建模任务，利用扩散模型学习数据分布并融入3D场景先验。 Result: 实验表明，基于扩散模型的生成方法在预测一致性、噪声鲁棒性和复杂3D结构处理上优于现有判别方法，尤其在遮挡或低可见度区域表现更优。 Conclusion: 该方法不仅提升了3D占用预测的准确性和真实性，还显著改善了自动驾驶下游规划任务的实际效果。 Abstract: Accurately predicting 3D occupancy grids from visual inputs is critical for autonomous driving, but current discriminative methods struggle with noisy data, incomplete observations, and the complex structures inherent in 3D scenes. In this work, we reframe 3D occupancy prediction as a generative modeling task using diffusion models, which learn the underlying data distribution and incorporate 3D scene priors. This approach enhances prediction consistency, noise robustness, and better handles the intricacies of 3D spatial structures. Our extensive experiments show that diffusion-based generative models outperform state-of-the-art discriminative approaches, delivering more realistic and accurate occupancy predictions, especially in occluded or low-visibility regions. Moreover, the improved predictions significantly benefit downstream planning tasks, highlighting the practical advantages of our method for real-world autonomous driving applications.

[45] TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance

Keren Ye,Ignacio Garcia Dorado,Michalis Raptis,Mauricio Delbracio,Irene Zhu,Peyman Milanfar,Hossein Talebi

Main category: cs.CV

TL;DR: TextSR是一种专为多语言场景文本图像超分辨率设计的扩散模型，通过结合文本检测、OCR和UTF-8编码，显著提升了文本区域的超分辨率效果。

Details

Motivation: 现有扩散模型在场景文本图像超分辨率中存在文本定位不准确和字符形状建模不足的问题，导致生成质量下降。 Method: TextSR利用文本检测器和OCR提取多语言文本，通过UTF-8编码和交叉注意力将字符转换为视觉形状，并采用两种创新方法增强模型鲁棒性。 Result: 在TextZoom和TextVQA数据集上表现优异，为STISR设立了新基准。 Conclusion: TextSR通过整合文本字符先验，显著提升了文本超分辨率的细节和可读性。 Abstract: While recent advancements in Image Super-Resolution (SR) using diffusion models have shown promise in improving overall image quality, their application to scene text images has revealed limitations. These models often struggle with accurate text region localization and fail to effectively model image and multilingual character-to-shape priors. This leads to inconsistencies, the generation of hallucinated textures, and a decrease in the perceived quality of the super-resolved text. To address these issues, we introduce TextSR, a multimodal diffusion model specifically designed for Multilingual Scene Text Image Super-Resolution. TextSR leverages a text detector to pinpoint text regions within an image and then employs Optical Character Recognition (OCR) to extract multilingual text from these areas. The extracted text characters are then transformed into visual shapes using a UTF-8 based text encoder and cross-attention. Recognizing that OCR may sometimes produce inaccurate results in real-world scenarios, we have developed two innovative methods to enhance the robustness of our model. By integrating text character priors with the low-resolution text images, our model effectively guides the super-resolution process, enhancing fine details within the text and improving overall legibility. The superior performance of our model on both the TextZoom and TextVQA datasets sets a new benchmark for STISR, underscoring the efficacy of our approach.

[46] MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation

Siyuan Wang,Jiawei Liu,Wei Wang,Yeying Jin,Jinsong Du,Zhi Han

Main category: cs.CV

TL;DR: 论文提出了一种基于运动掩码的两阶段网络（MMGT），通过音频和运动掩码联合驱动生成同步的语音手势视频，解决了传统方法中因仅依赖音频导致的大幅度手势运动捕捉不足的问题。

Details

Motivation: 由于仅依赖音频作为控制信号难以捕捉视频中的大幅度手势运动，导致明显的伪影和失真，现有方法通常通过引入额外先验信息来解决，但这限制了任务的实际应用。 Method: 提出MMGT网络，分为两阶段：1）SMGA网络从音频生成高质量姿势视频和运动掩码；2）MM-HAA结合稳定扩散视频生成模型，克服传统方法在细粒度运动生成和区域细节控制上的限制。 Result: 实验表明，该方法在视频质量、唇同步和手势生成方面均有提升。 Conclusion: MMGT通过运动掩码和两阶段设计，实现了高质量、细节丰富的上半身视频生成，解决了传统方法的局限性。 Abstract: Co-Speech Gesture Video Generation aims to generate vivid speech videos from audio-driven still images, which is challenging due to the diversity of different parts of the body in terms of amplitude of motion, audio relevance, and detailed features. Relying solely on audio as the control signal often fails to capture large gesture movements in video, leading to more pronounced artifacts and distortions. Existing approaches typically address this issue by introducing additional a priori information, but this can limit the practical application of the task. Specifically, we propose a Motion Mask-Guided Two-Stage Network (MMGT) that uses audio, as well as motion masks and motion features generated from the audio signal to jointly drive the generation of synchronized speech gesture videos. In the first stage, the Spatial Mask-Guided Audio Pose Generation (SMGA) Network generates high-quality pose videos and motion masks from audio, effectively capturing large movements in key regions such as the face and gestures. In the second stage, we integrate the Motion Masked Hierarchical Audio Attention (MM-HAA) into the Stabilized Diffusion Video Generation model, overcoming limitations in fine-grained motion generation and region-specific detail control found in traditional methods. This guarantees high-quality, detailed upper-body video generation with accurate texture and motion details. Evaluations show improved video quality, lip-sync, and gesture. The model and code are available at https://github.com/SIA-IDE/MMGT.

[47] HMAD: Advancing E2E Driving with Anchored Offset Proposals and Simulation-Supervised Multi-target Scoring

Bin Wang,Pingjun Li,Jinkun Liu,Jun Cheng,Hailong Lei,Yinze Rong,Huan-ang Gao,Kangliang Chen,Xing Pan,Weihao Gu

Main category: cs.CV

TL;DR: HMAD框架通过BEV轨迹生成和评分模块解决自动驾驶中的轨迹多样性和最优路径选择问题，显著提升了驾驶评分。

Details

Motivation: 自动驾驶在生成多样且合规的轨迹以及通过多维度评分选择最优路径方面存在挑战。 Method: HMAD结合BEV轨迹生成和迭代解码技术，并通过模拟监督评分模块评估轨迹。 Result: HMAD在CVPR 2025测试集上实现了44.5%的驾驶评分。 Conclusion: HMAD展示了轨迹生成与安全评分解耦对高级自动驾驶的益处。 Abstract: End-to-end autonomous driving faces persistent challenges in both generating diverse, rule-compliant trajectories and robustly selecting the optimal path from these options via learned, multi-faceted evaluation. To address these challenges, we introduce HMAD, a framework integrating a distinctive Bird's-Eye-View (BEV) based trajectory proposal mechanism with learned multi-criteria scoring. HMAD leverages BEVFormer and employs learnable anchored queries, initialized from a trajectory dictionary and refined via iterative offset decoding (inspired by DiffusionDrive), to produce numerous diverse and stable candidate trajectories. A key innovation, our simulation-supervised scorer module, then evaluates these proposals against critical metrics including no at-fault collisions, drivable area compliance, comfortableness, and overall driving quality (i.e., extended PDM score). Demonstrating its efficacy, HMAD achieves a 44.5% driving score on the CVPR 2025 private test set. This work highlights the benefits of effectively decoupling robust trajectory generation from comprehensive, safety-aware learned scoring for advanced autonomous driving.

[48] PhotoArtAgent: Intelligent Photo Retouching with Language Model-Based Artist Agents

Haoyu Chen,Keda Tao,Yizao Wang,Xinlei Wang,Lei Zhu,Jinjin Gu

Main category: cs.CV

TL;DR: PhotoArtAgent是一个结合视觉语言模型和自然语言推理的智能系统，模拟专业艺术家的创作过程，提供透明且交互式的照片修饰方案。

Details

Motivation: 专业艺术家通过照片修饰增强情感表达和叙事深度，而现有自动化工具缺乏解释性和交互透明度。 Method: 结合视觉语言模型和自然语言推理，分析艺术需求，制定修饰策略，并通过API输出参数至Lightroom，迭代优化结果。 Result: 在用户研究中优于现有自动化工具，结果接近专业艺术家水平。 Conclusion: PhotoArtAgent通过透明解释和交互控制，实现了高质量的自动化照片修饰。 Abstract: Photo retouching is integral to photographic art, extending far beyond simple technical fixes to heighten emotional expression and narrative depth. While artists leverage expertise to create unique visual effects through deliberate adjustments, non-professional users often rely on automated tools that produce visually pleasing results but lack interpretative depth and interactive transparency. In this paper, we introduce PhotoArtAgent, an intelligent system that combines Vision-Language Models (VLMs) with advanced natural language reasoning to emulate the creative process of a professional artist. The agent performs explicit artistic analysis, plans retouching strategies, and outputs precise parameters to Lightroom through an API. It then evaluates the resulting images and iteratively refines them until the desired artistic vision is achieved. Throughout this process, PhotoArtAgent provides transparent, text-based explanations of its creative rationale, fostering meaningful interaction and user control. Experimental results show that PhotoArtAgent not only surpasses existing automated tools in user studies but also achieves results comparable to those of professional human artists.

[49] Zero-to-Hero: Zero-Shot Initialization Empowering Reference-Based Video Appearance Editing

Tongtong Su,Chengyu Wang,Jun Huang,Dongming Lu

Main category: cs.CV

TL;DR: 论文提出了一种名为Zero-to-Hero的参考视频编辑方法，通过分解编辑过程为两个阶段，解决了现有文本引导方法的模糊性和控制不足问题。

Details

Motivation: 现有文本引导的视频编辑方法存在用户意图模糊和细粒度控制不足的问题，需要一种更精确和一致的方法。 Method: 方法分为两个阶段：Zero-Stage通过编辑锚帧作为参考图像，并利用原始帧的对应关系引导注意力机制；Hero-Stage通过条件生成模型修复视频。 Result: 在PSNR上比最佳基线方法提高了2.6 dB，证明了方法的有效性。 Conclusion: Zero-to-Hero方法在视频编辑中实现了更高的准确性和时间一致性，解决了现有方法的局限性。 Abstract: Appearance editing according to user needs is a pivotal task in video editing. Existing text-guided methods often lead to ambiguities regarding user intentions and restrict fine-grained control over editing specific aspects of objects. To overcome these limitations, this paper introduces a novel approach named {Zero-to-Hero}, which focuses on reference-based video editing that disentangles the editing process into two distinct problems. It achieves this by first editing an anchor frame to satisfy user requirements as a reference image and then consistently propagating its appearance across other frames. We leverage correspondence within the original frames to guide the attention mechanism, which is more robust than previously proposed optical flow or temporal modules in memory-friendly video generative models, especially when dealing with objects exhibiting large motions. It offers a solid ZERO-shot initialization that ensures both accuracy and temporal consistency. However, intervention in the attention mechanism results in compounded imaging degradation with over-saturated colors and unknown blurring issues. Starting from Zero-Stage, our Hero-Stage Holistically learns a conditional generative model for vidEo RestOration. To accurately evaluate the consistency of the appearance, we construct a set of videos with multiple appearances using Blender, enabling a fine-grained and deterministic evaluation. Our method outperforms the best-performing baseline with a PSNR improvement of 2.6 dB. The project page is at https://github.com/Tonniia/Zero2Hero.

[50] Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning

Jinquan Guan,Qi Chen,Lizhou Liang,Yuhang Liu,Vu Minh Hieu Phan,Minh-Son To,Jian Chen,Yutong Xie

Main category: cs.CV

TL;DR: 论文提出了CXRTrek数据集和CXRTrekNet模型，旨在模拟放射科医生的多阶段诊断推理过程，解决了现有医学AI模型在临床推理中的不足。

Details

Motivation: 现有医学AI模型采用简单的输入-输出模式，忽略了诊断推理的序列性和上下文依赖性，导致与临床场景不匹配、推理缺乏上下文和错误难以追踪。 Method: 构建了CXRTrek数据集，包含8个诊断阶段的428,966个样本和1100万Q&A对；提出了CXRTrekNet模型，将临床推理流程融入视觉-语言大模型框架。 Result: CXRTrekNet在CXRTrek基准测试中优于现有医学VLLM，并在五个外部数据集上表现出更强的泛化能力。 Conclusion: CXRTrek数据集和模型填补了医学AI在临床推理建模上的空白，为未来研究提供了重要资源。 Abstract: Artificial intelligence (AI)-based chest X-ray (CXR) interpretation assistants have demonstrated significant progress and are increasingly being applied in clinical settings. However, contemporary medical AI models often adhere to a simplistic input-to-output paradigm, directly processing an image and an instruction to generate a result, where the instructions may be integral to the model's architecture. This approach overlooks the modeling of the inherent diagnostic reasoning in chest X-ray interpretation. Such reasoning is typically sequential, where each interpretive stage considers the images, the current task, and the contextual information from previous stages. This oversight leads to several shortcomings, including misalignment with clinical scenarios, contextless reasoning, and untraceable errors. To fill this gap, we construct CXRTrek, a new multi-stage visual question answering (VQA) dataset for CXR interpretation. The dataset is designed to explicitly simulate the diagnostic reasoning process employed by radiologists in real-world clinical settings for the first time. CXRTrek covers 8 sequential diagnostic stages, comprising 428,966 samples and over 11 million question-answer (Q&A) pairs, with an average of 26.29 Q&A pairs per sample. Building on the CXRTrek dataset, we propose a new vision-language large model (VLLM), CXRTrekNet, specifically designed to incorporate the clinical reasoning flow into the VLLM framework. CXRTrekNet effectively models the dependencies between diagnostic stages and captures reasoning patterns within the radiological context. Trained on our dataset, the model consistently outperforms existing medical VLLMs on the CXRTrek benchmarks and demonstrates superior generalization across multiple tasks on five diverse external datasets. The dataset and model can be found in our repository (https://github.com/guanjinquan/CXRTrek).

[51] FlowAlign: Trajectory-Regularized, Inversion-Free Flow-based Image Editing

Jeongsol Kim,Yeobin Hong,Jong Chul Ye

Main category: cs.CV

TL;DR: FlowAlign是一种无需反转的流式图像编辑框架，通过引入流匹配损失提升编辑轨迹的稳定性和一致性。

Details

Motivation: 现有方法如FlowEdit虽无需精确潜在反转，但常导致编辑轨迹不稳定和源图像一致性差。 Method: 提出FlowAlign框架，利用流匹配损失作为正则化机制，平衡语义对齐与结构一致性。 Result: 实验表明，FlowAlign在源图像保留和编辑可控性上优于现有方法。 Conclusion: FlowAlign通过流匹配损失实现了更稳定、一致的图像编辑，支持反向编辑。 Abstract: Recent inversion-free, flow-based image editing methods such as FlowEdit leverages a pre-trained noise-to-image flow model such as Stable Diffusion 3, enabling text-driven manipulation by solving an ordinary differential equation (ODE). While the lack of exact latent inversion is a core advantage of these methods, it often results in unstable editing trajectories and poor source consistency. To address this limitation, we propose FlowAlign, a novel inversion-free flow-based framework for consistent image editing with principled trajectory control. FlowAlign introduces a flow-matching loss as a regularization mechanism to promote smoother and more stable trajectories during the editing process. Notably, the flow-matching loss is shown to explicitly balance semantic alignment with the edit prompt and structural consistency with the source image along the trajectory. Furthermore, FlowAlign naturally supports reverse editing by simply reversing the ODE trajectory, highlighting the reversible and consistent nature of the transformation. Extensive experiments demonstrate that FlowAlign outperforms existing methods in both source preservation and editing controllability.

[52] PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling

Xiao Yu,Yan Fang,Xiaojie Jin,Yao Zhao,Yunchao Wei

Main category: cs.CV

TL;DR: On-AVEP提出了一种在线音视频事件解析方法，通过预测未来建模（PreFM）框架提升实时性和准确性。

Details

Motivation: 现有方法依赖离线处理且模型庞大，难以满足实时需求，需开发轻量高效的在线解析方法。 Method: 提出PreFM框架，包括预测未来多模态建模和模态无关的鲁棒表示，以增强上下文理解。 Result: 在UnAV-100和LLP数据集上，PreFM显著优于现有方法，参数更少。 Conclusion: PreFM为实时多模态视频理解提供了高效解决方案。 Abstract: Audio-visual event parsing plays a crucial role in understanding multimodal video content, but existing methods typically rely on offline processing of entire videos with huge model sizes, limiting their real-time applicability. We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams. The On-AVEP task necessitates models with two key capabilities: (1) Accurate online inference, to effectively distinguish events with unclear and limited context in online settings, and (2) Real-time efficiency, to balance high performance with computational constraints. To cultivate these, we propose the Predictive Future Modeling (PreFM) framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues, thereby enhancing contextual understanding and (b) modality-agnostic robust representation along with focal temporal prioritization to improve precision and generalization. Extensive experiments on the UnAV-100 and LLP datasets show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters, offering an insightful approach for real-time multimodal video understanding. Code is available at https://github.com/XiaoYu-1123/PreFM.

[53] LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering

Jonas Kulhanek,Marie-Julie Rakotosaona,Fabian Manhardt,Christina Tsalicoglou,Michael Niemeyer,Torsten Sattler,Songyou Peng,Federico Tombari

Main category: cs.CV

TL;DR: 提出了一种基于3D高斯泼溅的层次化LOD方法，用于在内存受限设备上实时渲染大规模场景。

Details

Motivation: 解决大规模场景在内存受限设备上实时渲染的挑战，减少渲染时间和GPU内存使用。 Method: 采用层次化LOD表示，基于相机距离选择高斯子集，结合深度感知3D平滑滤波、重要性修剪和微调，动态加载空间分块以减少内存开销。 Result: 在户外和室内数据集上实现最佳性能，降低延迟和内存需求的同时保持高质量渲染。 Conclusion: 该方法有效平衡了渲染质量与资源消耗，适用于实时大规模场景渲染。 Abstract: In this work, we present a novel level-of-detail (LOD) method for 3D Gaussian Splatting that enables real-time rendering of large-scale scenes on memory-constrained devices. Our approach introduces a hierarchical LOD representation that iteratively selects optimal subsets of Gaussians based on camera distance, thus largely reducing both rendering time and GPU memory usage. We construct each LOD level by applying a depth-aware 3D smoothing filter, followed by importance-based pruning and fine-tuning to maintain visual fidelity. To further reduce memory overhead, we partition the scene into spatial chunks and dynamically load only relevant Gaussians during rendering, employing an opacity-blending mechanism to avoid visual artifacts at chunk boundaries. Our method achieves state-of-the-art performance on both outdoor (Hierarchical 3DGS) and indoor (Zip-NeRF) datasets, delivering high-quality renderings with reduced latency and memory requirements.

[54] Implicit Inversion turns CLIP into a Decoder

Antonio D'Orazio,Maria Rosaria Briglia,Donato Crisostomi,Dario Loi,Emanuele Rodolà,Iacopo Masi

Main category: cs.CV

TL;DR: CLIP模型无需解码器或训练即可实现图像合成，通过频率感知隐式神经表示和稳定化技术解锁生成能力。

Details

Motivation: 探索CLIP作为判别模型的潜在生成能力，无需额外训练或修改权重。 Method: 采用频率感知隐式神经表示、对抗鲁棒初始化、正交Procrustes投影和混合损失。 Result: 实现文本到图像生成、风格迁移和图像重建，无需修改CLIP权重。 Conclusion: 判别模型可能隐藏未开发的生成潜力。 Abstract: CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. In this work, we show that image synthesis is nevertheless possible using CLIP alone -- without any decoder, training, or fine-tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. Without altering CLIP's weights, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. These findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight.

[55] RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer

Liu Liu,Xiaofeng Wang,Guosheng Zhao,Keyu Li,Wenkang Qin,Jiaxiong Qiu,Zheng Zhu,Guan Huang,Zhizhong Su

Main category: cs.CV

TL;DR: RoboTransfer是一种基于扩散的视频生成框架，用于机器人数据合成，解决了模仿学习中真实数据收集昂贵和仿真数据难以扩展的问题。

Details

Motivation: 模仿学习在机器人操作中很重要，但真实数据收集成本高，仿真数据存在仿真到现实的差距。 Method: RoboTransfer结合多视角几何和场景组件控制，通过跨视角特征交互和全局深度/法线条件确保几何一致性。 Result: 实验显示，RoboTransfer生成的多视角视频具有更好的几何一致性和视觉保真度，训练的策略在DIFF-OBJ和DIFF-ALL场景中分别提升了33.3%和251%的成功率。 Conclusion: RoboTransfer为机器人数据合成提供了一种高效且可控的方法，显著提升了模仿学习的性能。 Abstract: Imitation Learning has become a fundamental approach in robotic manipulation. However, collecting large-scale real-world robot demonstrations is prohibitively expensive. Simulators offer a cost-effective alternative, but the sim-to-real gap make it extremely challenging to scale. Therefore, we introduce RoboTransfer, a diffusion-based video generation framework for robotic data synthesis. Unlike previous methods, RoboTransfer integrates multi-view geometry with explicit control over scene components, such as background and object attributes. By incorporating cross-view feature interactions and global depth/normal conditions, RoboTransfer ensures geometry consistency across views. This framework allows fine-grained control, including background edits and object swaps. Experiments demonstrate that RoboTransfer is capable of generating multi-view videos with enhanced geometric consistency and visual fidelity. In addition, policies trained on the data generated by RoboTransfer achieve a 33.3% relative improvement in the success rate in the DIFF-OBJ setting and a substantial 251% relative improvement in the more challenging DIFF-ALL scenario. Explore more demos on our project page: https://horizonrobotics.github.io/robot_lab/robotransfer

[56] DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes

Sungjune Park,Hyunjun Kim,Junho Kim,Seongho Kim,Yong Man Ro

Main category: cs.CV

TL;DR: 本文提出了一种基于强化学习的框架DIP-R1，旨在提升多模态大语言模型（MLLMs）在复杂场景中的细粒度视觉感知能力。通过三种奖励模型，DIP-R1显著提升了模型在密集拥挤场景等复杂环境中的表现。

Details

Motivation: 尽管MLLMs在视觉理解方面表现优异，但在复杂现实场景（如密集拥挤区域）中的细粒度感知能力仍有局限。受强化学习在LLMs和MLLMs中的成功启发，本文探索如何利用RL增强MLLMs的视觉感知能力。 Method: 开发了DIP-R1框架，包含三种奖励模型：1）标准推理奖励，分三步（理解、观察、决策）；2）方差引导观察奖励，专注于不确定区域；3）加权精确召回奖励，提升决策准确性。 Result: DIP-R1在多种细粒度目标检测数据上表现优异，显著优于现有基线模型和监督微调方法，且在域内外场景中均取得一致改进。 Conclusion: 研究表明，将RL整合到MLLMs中具有巨大潜力，可显著提升复杂现实感知任务的能力。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant visual understanding capabilities, yet their fine-grained visual perception in complex real-world scenarios, such as densely crowded public areas, remains limited. Inspired by the recent success of reinforcement learning (RL) in both LLMs and MLLMs, in this paper, we explore how RL can enhance visual perception ability of MLLMs. Then we develop a novel RL-based framework, Deep Inspection and Perception with RL (DIP-R1) designed to enhance the visual perception capabilities of MLLMs, by comprehending complex scenes and looking through visual instances closely. DIP-R1 guides MLLMs through detailed inspection of visual scene via three simply designed rule-based reward modelings. First, we adopt a standard reasoning reward encouraging the model to include three step-by-step processes: 1) reasoning for understanding visual scenes, 2) observing for looking through interested but ambiguous regions, and 3) decision-making for predicting answer. Second, a variance-guided looking reward is designed to examine uncertain regions for the second observing process. It explicitly enables the model to inspect ambiguous areas, improving its ability to mitigate perceptual uncertainties. Third, we model a weighted precision-recall accuracy reward enhancing accurate decision-making. We explore its effectiveness across diverse fine-grained object detection data consisting of challenging real-world environments, such as densely crowded scenes. Built upon existing MLLMs, DIP-R1 achieves consistent and significant improvement across various in-domain and out-of-domain scenarios. It also outperforms various existing baseline models and supervised fine-tuning methods. Our findings highlight the substantial potential of integrating RL into MLLMs for enhancing capabilities in complex real-world perception tasks.

Junyi Guo,Jingxuan Zhang,Fangyu Wu,Huanda Lu,Qiufeng Wang,Wenmian Yang,Eng Gee Lim,Dongming Lu

Main category: cs.CV

TL;DR: 论文提出了一种新任务FS2RG，通过结合平面草图和文本指导生成逼真的服装图像，并提出了HiGarment框架解决其挑战。

Details

Motivation: 服装合成任务在设计阶段已有研究，但生产过程未被充分探索，FS2RG任务旨在填补这一空白。 Method: HiGarment框架包含多模态语义增强机制和协调交叉注意力机制，以平衡草图和文本信息。 Result: 实验和用户研究验证了HiGarment的有效性，并发布了开源数据集。 Conclusion: HiGarment成功解决了FS2RG任务的挑战，为服装合成提供了新方法。 Abstract: Diffusion-based garment synthesis tasks primarily focus on the design phase in the fashion domain, while the garment production process remains largely underexplored. To bridge this gap, we introduce a new task: Flat Sketch to Realistic Garment Image (FS2RG), which generates realistic garment images by integrating flat sketches and textual guidance. FS2RG presents two key challenges: 1) fabric characteristics are solely guided by textual prompts, providing insufficient visual supervision for diffusion-based models, which limits their ability to capture fine-grained fabric details; 2) flat sketches and textual guidance may provide conflicting information, requiring the model to selectively preserve or modify garment attributes while maintaining structural coherence. To tackle this task, we propose HiGarment, a novel framework that comprises two core components: i) a multi-modal semantic enhancement mechanism that enhances fabric representation across textual and visual modalities, and ii) a harmonized cross-attention mechanism that dynamically balances information from flat sketches and text prompts, allowing controllable synthesis by generating either sketch-aligned (image-biased) or text-guided (text-biased) outputs. Furthermore, we collect Multi-modal Detailed Garment, the largest open-source dataset for garment generation. Experimental results and user studies demonstrate the effectiveness of HiGarment in garment synthesis. The code and dataset will be released.

[58] Fooling the Watchers: Breaking AIGC Detectors via Semantic Prompt Attacks

Run Hao,Peng Ying

Main category: cs.CV

TL;DR: 提出了一种基于语法树和蒙特卡洛树搜索的对抗性提示生成框架，用于规避AIGC检测器，并在竞赛中表现优异。

Details

Motivation: 解决文本到图像模型生成的人像可能被滥用的问题，以及现有AIGC检测器的鲁棒性不足。 Method: 利用语法树结构和蒙特卡洛树搜索算法，系统探索语义提示空间，生成多样且可控的对抗性提示。 Result: 方法在多个T2I模型上验证有效，并在真实竞赛中排名第一。 Conclusion: 该方法不仅可用于攻击场景，还能构建高质量对抗数据集，助力更鲁棒的AIGC检测与防御系统开发。 Abstract: The rise of text-to-image (T2I) models has enabled the synthesis of photorealistic human portraits, raising serious concerns about identity misuse and the robustness of AIGC detectors. In this work, we propose an automated adversarial prompt generation framework that leverages a grammar tree structure and a variant of the Monte Carlo tree search algorithm to systematically explore the semantic prompt space. Our method generates diverse, controllable prompts that consistently evade both open-source and commercial AIGC detectors. Extensive experiments across multiple T2I models validate its effectiveness, and the approach ranked first in a real-world adversarial AIGC detection competition. Beyond attack scenarios, our method can also be used to construct high-quality adversarial datasets, providing valuable resources for training and evaluating more robust AIGC detection and defense systems.

[59] Language-guided Learning for Object Detection Tackling Multiple Variations in Aerial Images

Sungjune Park,Hyunjun Kim,Beomchan Park,Yong Man Ro

Main category: cs.CV

TL;DR: 论文提出了一种名为LANGO的语言引导物体检测框架，用于解决航拍图像中因光照和视角变化导致的物体检测挑战。

Details

Motivation: 航拍图像中物体检测面临多种变化（如光照和视角）的挑战，导致物体定位和分类困难。 Method: 设计了视觉语义推理器和关系学习损失，分别处理场景级和实例级变化。 Result: 实验表明，该方法显著提升了检测性能。 Conclusion: LANGO框架有效缓解了航拍图像中的变化问题，提升了物体检测效果。 Abstract: Despite recent advancements in computer vision research, object detection in aerial images still suffers from several challenges. One primary challenge to be mitigated is the presence of multiple types of variation in aerial images, for example, illumination and viewpoint changes. These variations result in highly diverse image scenes and drastic alterations in object appearance, so that it becomes more complicated to localize objects from the whole image scene and recognize their categories. To address this problem, in this paper, we introduce a novel object detection framework in aerial images, named LANGuage-guided Object detection (LANGO). Upon the proposed language-guided learning, the proposed framework is designed to alleviate the impacts from both scene and instance-level variations. First, we are motivated by the way humans understand the semantics of scenes while perceiving environmental factors in the scenes (e.g., weather). Therefore, we design a visual semantic reasoner that comprehends visual semantics of image scenes by interpreting conditions where the given images were captured. Second, we devise a training objective, named relation learning loss, to deal with instance-level variations, such as viewpoint angle and scale changes. This training objective aims to learn relations in language representations of object categories, with the help of the robust characteristics against such variations. Through extensive experiments, we demonstrate the effectiveness of the proposed method, and our method obtains noticeable detection performance improvements.

[60] WTEFNet: Real-Time Low-Light Object Detection for Advanced Driver-Assistance Systems

Hao Wu,Junzhou Chen,Ronghui Zhang,Nengchao Lyu,Hongyu Hu,Yanyong Guo,Tony Z. Qiu

Main category: cs.CV

TL;DR: WTEFNet是一个专为低光场景设计的实时目标检测框架，结合低光增强、小波特征提取和自适应融合检测模块，在多个数据集上表现优异。

Details

Motivation: 解决RGB相机在低光条件下性能下降的问题，提升ADAS系统的环境感知能力。 Method: WTEFNet包含三个核心模块：低光增强（LLE）、小波特征提取（WFE）和自适应融合检测（AFFD）。LLE改善图像质量，WFE分离高低频特征，AFFD融合语义和光照特征。 Result: 在BDD100K、SHIFT、nuScenes和GSN数据集上达到最先进精度，并在嵌入式平台上验证了实时性。 Conclusion: WTEFNet在低光条件下表现优异，适用于实时ADAS应用。 Abstract: Object detection is a cornerstone of environmental perception in advanced driver assistance systems(ADAS). However, most existing methods rely on RGB cameras, which suffer from significant performance degradation under low-light conditions due to poor image quality. To address this challenge, we proposes WTEFNet, a real-time object detection framework specifically designed for low-light scenarios, with strong adaptability to mainstream detectors. WTEFNet comprises three core modules: a Low-Light Enhancement (LLE) module, a Wavelet-based Feature Extraction (WFE) module, and an Adaptive Fusion Detection (AFFD) module. The LLE enhances dark regions while suppressing overexposed areas; the WFE applies multi-level discrete wavelet transforms to isolate high- and low-frequency components, enabling effective denoising and structural feature retention; the AFFD fuses semantic and illumination features for robust detection. To support training and evaluation, we introduce GSN, a manually annotated dataset covering both clear and rainy night-time scenes. Extensive experiments on BDD100K, SHIFT, nuScenes, and GSN demonstrate that WTEFNet achieves state-of-the-art accuracy under low-light conditions. Furthermore, deployment on a embedded platform (NVIDIA Jetson AGX Orin) confirms the framework's suitability for real-time ADAS applications.

[61] HyperPointFormer: Multimodal Fusion in 3D Space with Dual-Branch Cross-Attention Transformers

Aldino Rizaldy,Richard Gloaguen,Fabian Ewald Fassnacht,Pedram Ghamisi

Main category: cs.CV

TL;DR: 提出了一种基于3D点云的多模态融合方法，使用双分支Transformer模型直接学习几何和光谱特征，并通过跨注意力机制增强特征融合，在多个数据集上验证了其竞争力。

Details

Motivation: 现有方法多将3D数据降维为2D处理，未能充分利用3D数据的潜力，限制了模型的3D特征学习能力和预测生成。 Method: 提出全3D方法，融合多模态数据于3D点云中，采用双分支Transformer模型和跨注意力机制，实现多尺度特征融合。 Result: 在DFC2018等数据集上验证，3D融合方法表现竞争力，且能生成3D预测，灵活性优于2D方法。 Conclusion: 3D融合方法不仅性能优越，还能提供2D方法无法实现的3D预测能力，具有更广泛的应用潜力。 Abstract: Multimodal remote sensing data, including spectral and lidar or photogrammetry, is crucial for achieving satisfactory land-use / land-cover classification results in urban scenes. So far, most studies have been conducted in a 2D context. When 3D information is available in the dataset, it is typically integrated with the 2D data by rasterizing the 3D data into 2D formats. Although this method yields satisfactory classification results, it falls short in fully exploiting the potential of 3D data by restricting the model's ability to learn 3D spatial features directly from raw point clouds. Additionally, it limits the generation of 3D predictions, as the dimensionality of the input data has been reduced. In this study, we propose a fully 3D-based method that fuses all modalities within the 3D point cloud and employs a dedicated dual-branch Transformer model to simultaneously learn geometric and spectral features. To enhance the fusion process, we introduce a cross-attention-based mechanism that fully operates on 3D points, effectively integrating features from various modalities across multiple scales. The purpose of cross-attention is to allow one modality to assess the importance of another by weighing the relevant features. We evaluated our method by comparing it against both 3D and 2D methods using the 2018 IEEE GRSS Data Fusion Contest (DFC2018) dataset. Our findings indicate that 3D fusion delivers competitive results compared to 2D methods and offers more flexibility by providing 3D predictions. These predictions can be projected onto 2D maps, a capability that is not feasible in reverse. Additionally, we evaluated our method on different datasets, specifically the ISPRS Vaihingen 3D and the IEEE 2019 Data Fusion Contest. Our code will be published here: https://github.com/aldinorizaldy/hyperpointformer.

[62] Navigating the Accuracy-Size Trade-Off with Flexible Model Merging

Akash Dhasade,Divyansh Jhunjhunwala,Milos Vujasinovic,Gauri Joshi,Anne-Marie Kermarrec

Main category: cs.CV

TL;DR: FlexMerge是一种数据无关的模型合并框架，通过灵活生成不同大小的合并模型，平衡精度与部署成本。

Details

Motivation: 解决单模型合并精度不足与多模型部署成本高的问题。 Method: 将微调模型视为序列块，逐步合并，支持多种合并算法。 Result: 实验显示，稍大的合并模型能显著提升精度。 Conclusion: FlexMerge提供灵活、高效且无需数据的解决方案，适用于多样化部署场景。 Abstract: Model merging has emerged as an efficient method to combine multiple single-task fine-tuned models. The merged model can enjoy multi-task capabilities without expensive training. While promising, merging into a single model often suffers from an accuracy gap with respect to individual fine-tuned models. On the other hand, deploying all individual fine-tuned models incurs high costs. We propose FlexMerge, a novel data-free model merging framework to flexibly generate merged models of varying sizes, spanning the spectrum from a single merged model to retaining all individual fine-tuned models. FlexMerge treats fine-tuned models as collections of sequential blocks and progressively merges them using any existing data-free merging method, halting at a desired size. We systematically explore the accuracy-size trade-off exhibited by different merging algorithms in combination with FlexMerge. Extensive experiments on vision and NLP benchmarks, with up to 30 tasks, reveal that even modestly larger merged models can provide substantial accuracy improvements over a single model. By offering fine-grained control over fused model size, FlexMerge provides a flexible, data-free, and high-performance solution for diverse deployment scenarios.

[63] SAMamba: Adaptive State Space Modeling with Hierarchical Vision for Infrared Small Target Detection

Wenhao Xu,Shuchen Zheng,Changwei Wang,Zherui Zhang,Chuan Ren,Rongtao Xu,Shibiao Xu

Main category: cs.CV

TL;DR: SAMamba结合SAM2的分层特征学习和Mamba的选择性序列建模，提出FS-Adapter、CSI模块和DPCF模块，显著提升红外小目标检测性能。

Details

Motivation: 红外小目标检测在军事等领域至关重要，但现有深度学习方法存在信息丢失和全局上下文建模效率低的问题。 Method: 提出SAMamba框架，包含FS-Adapter实现域适应、CSI模块高效建模全局上下文、DPCF模块多尺度特征融合。 Result: 在多个数据集上显著优于现有方法，尤其在复杂背景和多尺度目标场景中表现突出。 Conclusion: SAMamba通过域适应、细节保留和高效长程依赖建模，有效解决了红外小目标检测的核心挑战。 Abstract: Infrared small target detection (ISTD) is vital for long-range surveillance in military, maritime, and early warning applications. ISTD is challenged by targets occupying less than 0.15% of the image and low distinguishability from complex backgrounds. Existing deep learning methods often suffer from information loss during downsampling and inefficient global context modeling. This paper presents SAMamba, a novel framework integrating SAM2's hierarchical feature learning with Mamba's selective sequence modeling. Key innovations include: (1) A Feature Selection Adapter (FS-Adapter) for efficient natural-to-infrared domain adaptation via dual-stage selection (token-level with a learnable task embedding and channel-wise adaptive transformations); (2) A Cross-Channel State-Space Interaction (CSI) module for efficient global context modeling with linear complexity using selective state space modeling; and (3) A Detail-Preserving Contextual Fusion (DPCF) module that adaptively combines multi-scale features with a gating mechanism to balance high-resolution and low-resolution feature contributions. SAMamba addresses core ISTD challenges by bridging the domain gap, maintaining fine-grained details, and efficiently modeling long-range dependencies. Experiments on NUAA-SIRST, IRSTD-1k, and NUDT-SIRST datasets show SAMamba significantly outperforms state-of-the-art methods, especially in challenging scenarios with heterogeneous backgrounds and varying target scales. Code: https://github.com/zhengshuchen/SAMamba.

[64] UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes

Yixun Liang,Kunming Luo,Xiao Chen,Rui Chen,Hongyu Yan,Weiyu Li,Jiarui Liu,Ping Tan

Main category: cs.CV

TL;DR: UniTEX提出了一种新颖的两阶段3D纹理生成框架，通过直接操作于统一的3D功能空间，避免了UV映射的限制，生成高质量且一致的3D纹理。

Details

Motivation: 现有方法依赖UV映射和图像重投影，导致拓扑模糊问题。UniTEX旨在绕过这些限制，直接在3D空间中生成纹理。 Method: 1. 通过纹理函数（TFs）将纹理生成提升到3D空间；2. 使用基于Transformer的大型纹理模型（LTM）从图像和几何输入预测TFs；3. 采用LoRA策略优化2D先验，实现高质量多视图纹理合成。 Result: 实验表明，UniTEX在视觉质量和纹理完整性上优于现有方法，提供了可扩展的自动化3D纹理生成方案。 Conclusion: UniTEX为3D纹理生成提供了一种通用且高效的解决方案，代码已开源。 Abstract: We present UniTEX, a novel two-stage 3D texture generation framework to create high-quality, consistent textures for 3D assets. Existing approaches predominantly rely on UV-based inpainting to refine textures after reprojecting the generated multi-view images onto the 3D shapes, which introduces challenges related to topological ambiguity. To address this, we propose to bypass the limitations of UV mapping by operating directly in a unified 3D functional space. Specifically, we first propose that lifts texture generation into 3D space via Texture Functions (TFs)--a continuous, volumetric representation that maps any 3D point to a texture value based solely on surface proximity, independent of mesh topology. Then, we propose to predict these TFs directly from images and geometry inputs using a transformer-based Large Texturing Model (LTM). To further enhance texture quality and leverage powerful 2D priors, we develop an advanced LoRA-based strategy for efficiently adapting large-scale Diffusion Transformers (DiTs) for high-quality multi-view texture synthesis as our first stage. Extensive experiments demonstrate that UniTEX achieves superior visual quality and texture integrity compared to existing approaches, offering a generalizable and scalable solution for automated 3D texture generation. Code will available in: https://github.com/YixunLiang/UniTEX.

[65] Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs

Zheng Sun,Yi Wei,Long Yu

Main category: cs.CV

TL;DR: 本文提出了一种结合数据和方法的完整解决方案，以提升多模态大语言模型（MLLMs）在医学图像筛选中的美学推理能力。通过构建包含1500+样本的数据集和引入强化学习方法（DPA-GRPO），显著超越了现有大型模型的性能。

Details

Motivation: 当前MLLMs在图像筛选任务中表现不佳，主要由于缺乏数据和美学推理能力不足。本文旨在解决这些问题。 Method: 收集了包含1500+样本的医学图像数据集，并采用长链思维（CoT）和强化学习方法（DPA-GRPO）提升模型能力。 Result: 实验表明，即使最先进的闭源MLLMs（如GPT-4o和Qwen-VL-Max）在美学推理任务中表现接近随机猜测，而本文方法显著超越了这些模型。 Conclusion: 本文提出的解决方案为图像美学推理任务提供了新的配置，有望在未来广泛应用。 Abstract: Multimodal Large Language Models (MLLMs) are of great application across many domains, such as multimodal understanding and generation. With the development of diffusion models (DM) and unified MLLMs, the performance of image generation has been significantly improved, however, the study of image screening is rare and its performance with MLLMs is unsatisfactory due to the lack of data and the week image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive medical image screening dataset with 1500+ samples, each sample consists of a medical image, four generated images, and a multiple-choice answer. The dataset evaluates the aesthetic reasoning ability under four aspects: \textit{(1) Appearance Deformation, (2) Principles of Physical Lighting and Shadow, (3) Placement Layout, (4) Extension Rationality}. For methodology, we utilize long chains of thought (CoT) and Group Relative Policy Optimization with Dynamic Proportional Accuracy reward, called DPA-GRPO, to enhance the image aesthetic reasoning ability of MLLMs. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT-4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the reinforcement learning approach, we are able to surpass the score of both large-scale models and leading closed-source models using a much smaller model. We hope our attempt on medical image screening will serve as a regular configuration in image aesthetic reasoning in the future.

[66] Unsupervised Transcript-assisted Video Summarization and Highlight Detection

Spyros Barbakos,Charalampos Antoniadis,Gerasimos Potamianos,Gianluca Setti

Main category: cs.CV

TL;DR: 本文提出了一种结合视频帧和文本转录的多模态强化学习框架，用于视频摘要和高光检测，优于仅依赖视觉内容的方法。

Details

Motivation: 视频消费是日常生活的重要组成部分，但观看完整视频可能繁琐。现有方法未将视频帧和文本转录结合在强化学习框架中。 Method: 提出多模态管道，结合视频帧和转录文本，通过强化学习训练模型生成多样且具代表性的摘要和高光片段。 Result: 实验表明，结合转录文本的视频摘要和高光检测优于仅依赖视觉内容的方法。 Conclusion: 多模态强化学习框架有效提升了视频摘要和高光检测的性能，且能利用大规模未标注数据训练。 Abstract: Video consumption is a key part of daily life, but watching entire videos can be tedious. To address this, researchers have explored video summarization and highlight detection to identify key video segments. While some works combine video frames and transcripts, and others tackle video summarization and highlight detection using Reinforcement Learning (RL), no existing work, to the best of our knowledge, integrates both modalities within an RL framework. In this paper, we propose a multimodal pipeline that leverages video frames and their corresponding transcripts to generate a more condensed version of the video and detect highlights using a modality fusion mechanism. The pipeline is trained within an RL framework, which rewards the model for generating diverse and representative summaries while ensuring the inclusion of video segments with meaningful transcript content. The unsupervised nature of the training allows for learning from large-scale unannotated datasets, overcoming the challenge posed by the limited size of existing annotated datasets. Our experiments show that using the transcript in video summarization and highlight detection achieves superior results compared to relying solely on the visual content of the video.

[67] LADA: Scalable Label-Specific CLIP Adapter for Continual Learning

Mao-Lin Luo,Zi-Hao Zhou,Tong Wei,Min-Ling Zhang

Main category: cs.CV

TL;DR: LADA（Label-specific ADApter）通过为冻结的CLIP图像编码器添加轻量级标签特定记忆单元，解决了现有CLIP方法在持续学习中的参数选择错误问题，并通过特征蒸馏防止灾难性遗忘。

Details

Motivation: 现有基于CLIP的方法在持续学习中需要为每个任务分配部分参数，导致推理时参数选择错误，性能下降。 Method: LADA在冻结的CLIP图像编码器后添加标签特定记忆单元，通过特征蒸馏防止新类干扰旧类特征。 Result: LADA在持续学习任务中实现了最先进的性能。 Conclusion: LADA通过轻量级设计和特征蒸馏，有效解决了持续学习中的参数选择和遗忘问题。 Abstract: Continual learning with vision-language models like CLIP offers a pathway toward scalable machine learning systems by leveraging its transferable representations. Existing CLIP-based methods adapt the pre-trained image encoder by adding multiple sets of learnable parameters, with each task using a partial set of parameters. This requires selecting the expected parameters for input images during inference, which is prone to error that degrades performance. To address this problem, we introduce LADA (Label-specific ADApter). Instead of partitioning parameters across tasks, LADA appends lightweight, label-specific memory units to the frozen CLIP image encoder, enabling discriminative feature generation by aggregating task-agnostic knowledge. To prevent catastrophic forgetting, LADA employs feature distillation for seen classes, preventing their features from being interfered with by new classes. Positioned after the image encoder, LADA prevents gradient flow to the frozen CLIP parameters, ensuring efficient training. Extensive results show that LADA achieves state-of-the-art performance in continual learning settings. The implementation code is available at https://github.com/MaolinLuo/LADA.

[68] Are MLMs Trapped in the Visual Room?

Yazhou Zhang,Chunwang Zou,Qimeng Liu,Lu Rong,Ben Yao,Zheng Lian,Qiuchi Li,Peng Zhang,Jing Qin

Main category: cs.CV

TL;DR: 论文提出“视觉房间”论点，质疑多模态大模型（MLMs）是否能真正理解图像，并通过感知和认知两层次评估框架验证其局限性。

Details

Motivation: 受Searle的“中文房间”启发，探讨MLMs是否仅通过算法规则处理视觉输入而缺乏真正理解。 Method: 提出两层次评估框架（感知和认知），并构建高质量多模态讽刺数据集（924静态图像和100动态视频）。 Result: MLMs在感知任务表现良好，但在讽刺理解上平均错误率达16.1%，揭示“看见”与“理解”间的显著差距。 Conclusion: 研究为“视觉房间”论点提供实证支持，并提出MLMs的新评估范式。 Abstract: Can multi-modal large models (MLMs) that can ``see'' an image be said to ``understand'' it? Drawing inspiration from Searle's Chinese Room, we propose the \textbf{Visual Room} argument: a system may process and describe every detail of visual inputs by following algorithmic rules, without genuinely comprehending the underlying intention. This dilemma challenges the prevailing assumption that perceptual mastery implies genuine understanding. In implementation, we introduce a two-tier evaluation framework spanning perception and cognition. The perception component evaluates whether MLMs can accurately capture the surface-level details of visual contents, where the cognitive component examines their ability to infer sarcasm polarity. To support this framework, We further introduce a high-quality multi-modal sarcasm dataset comprising both 924 static images and 100 dynamic videos. All sarcasm labels are annotated by the original authors and verified by independent reviewers to ensure clarity and consistency. We evaluate eight state-of-the-art (SoTA) MLMs. Our results highlight three key findings: (1) MLMs perform well on perception tasks; (2) even with correct perception, models exhibit an average error rate of ~16.1\% in sarcasm understanding, revealing a significant gap between seeing and understanding; (3) error analysis attributes this gap to deficiencies in emotional reasoning, commonsense inference, and context alignment. This work provides empirical grounding for the proposed Visual Room argument and offers a new evaluation paradigm for MLMs.

[69] Holistic Large-Scale Scene Reconstruction via Mixed Gaussian Splatting

Chuandong Liu,Huijiao Wang,Lei Yu,Gui-Song Xia

Main category: cs.CV

TL;DR: MixGS提出了一种全局优化框架，用于大规模3D场景重建，解决了现有分治方法导致的全局信息丢失和复杂参数调整问题。

Details

Motivation: 现有的大规模场景重建方法依赖分治策略，导致全局信息丢失且参数调整复杂。 Method: MixGS通过将相机位姿和高斯属性整合为视图感知表示，并引入混合操作，联合保持全局一致性和局部保真度。 Result: 实验表明，MixGS在渲染质量和速度上达到最优，同时显著降低计算需求，可在单块24GB显存GPU上训练。 Conclusion: MixGS为大规模3D场景重建提供了一种高效且高质量的解决方案。 Abstract: Recent advances in 3D Gaussian Splatting have shown remarkable potential for novel view synthesis. However, most existing large-scale scene reconstruction methods rely on the divide-and-conquer paradigm, which often leads to the loss of global scene information and requires complex parameter tuning due to scene partitioning and local optimization. To address these limitations, we propose MixGS, a novel holistic optimization framework for large-scale 3D scene reconstruction. MixGS models the entire scene holistically by integrating camera pose and Gaussian attributes into a view-aware representation, which is decoded into fine-detailed Gaussians. Furthermore, a novel mixing operation combines decoded and original Gaussians to jointly preserve global coherence and local fidelity. Extensive experiments on large-scale scenes demonstrate that MixGS achieves state-of-the-art rendering quality and competitive speed, while significantly reducing computational requirements, enabling large-scale scene reconstruction training on a single 24GB VRAM GPU. The code will be released at https://github.com/azhuantou/MixGS.

[70] RSFAKE-1M: A Large-Scale Dataset for Detecting Diffusion-Generated Remote Sensing Forgeries

Zhihong Tan,Jiayi Wang,Huiying Shi,Binyuan Huang,Hongchen Wei,Zhenzhong Chen

Main category: cs.CV

TL;DR: 论文介绍了RSFAKE-1M数据集，用于检测扩散模型生成的伪造遥感图像，填补了现有研究的空白。

Details

Motivation: 遥感图像在环境监测等领域至关重要，但现有伪造检测方法主要针对GAN生成图像或自然图像，缺乏对扩散模型生成伪造图像的研究。 Method: 构建了包含50万伪造和50万真实遥感图像的RSFAKE-1M数据集，伪造图像由10种扩散模型生成，涵盖多种生成条件。 Result: 实验表明，当前方法对扩散模型生成的伪造遥感图像检测效果有限，而基于RSFAKE-1M训练的模型表现出更好的泛化性和鲁棒性。 Conclusion: RSFAKE-1M为遥感图像伪造检测领域的研究提供了重要基础，推动了下一代检测方法的发展。 Abstract: Detecting forged remote sensing images is becoming increasingly critical, as such imagery plays a vital role in environmental monitoring, urban planning, and national security. While diffusion models have emerged as the dominant paradigm for image generation, their impact on remote sensing forgery detection remains underexplored. Existing benchmarks primarily target GAN-based forgeries or focus on natural images, limiting progress in this critical domain. To address this gap, we introduce RSFAKE-1M, a large-scale dataset of 500K forged and 500K real remote sensing images. The fake images are generated by ten diffusion models fine-tuned on remote sensing data, covering six generation conditions such as text prompts, structural guidance, and inpainting. This paper presents the construction of RSFAKE-1M along with a comprehensive experimental evaluation using both existing detectors and unified baselines. The results reveal that diffusion-based remote sensing forgeries remain challenging for current methods, and that models trained on RSFAKE-1M exhibit notably improved generalization and robustness. Our findings underscore the importance of RSFAKE-1M as a foundation for developing and evaluating next-generation forgery detection approaches in the remote sensing domain. The dataset and other supplementary materials are available at https://huggingface.co/datasets/TZHSW/RSFAKE/.

[71] GenCAD-Self-Repairing: Feasibility Enhancement for 3D CAD Generation

Chikaha Tsuji,Enrique Flores Medina,Harshit Gupta,Md Ferdous Alam

Main category: cs.CV

TL;DR: GenCAD-Self-Repairing通过扩散引导和自我修复流程提升生成CAD模型的可行性，将不可行设计的三分之二转化为可行设计。

Details

Motivation: GenCAD生成的设计中约10%不可行，限制了其实际应用。 Method: 结合扩散引导去噪和回归校正机制，优化CAD命令序列。 Result: 成功将三分之二的不可行设计转化为可行设计，同时保持几何精度。 Conclusion: 该方法显著提升了生成CAD模型的可行性，扩展了高质量训练数据的可用性，增强了AI驱动CAD生成的适用性。 Abstract: With the advancement of generative AI, research on its application to 3D model generation has gained traction, particularly in automating the creation of Computer-Aided Design (CAD) files from images. GenCAD is a notable model in this domain, leveraging an autoregressive transformer-based architecture with a contrastive learning framework to generate CAD programs. However, a major limitation of GenCAD is its inability to consistently produce feasible boundary representations (B-reps), with approximately 10% of generated designs being infeasible. To address this, we propose GenCAD-Self-Repairing, a framework that enhances the feasibility of generative CAD models through diffusion guidance and a self-repairing pipeline. This framework integrates a guided diffusion denoising process in the latent space and a regression-based correction mechanism to refine infeasible CAD command sequences while preserving geometric accuracy. Our approach successfully converted two-thirds of infeasible designs in the baseline method into feasible ones, significantly improving the feasibility rate while simultaneously maintaining a reasonable level of geometric accuracy between the point clouds of ground truth models and generated models. By significantly improving the feasibility rate of generating CAD models, our approach helps expand the availability of high-quality training data and enhances the applicability of AI-driven CAD generation in manufacturing, architecture, and product design.

[72] Federated Unsupervised Semantic Segmentation

Evangelos Charalampakis,Vasileios Mygdalis,Ioannis Pitas

Main category: cs.CV

TL;DR: FUSS是首个完全去中心化、无监督的联邦学习框架，用于语义图像分割，通过特征和原型空间的一致性优化，显著优于传统方法。

Details

Motivation: 解决联邦学习在无监督语义图像分割中特征和聚类中心对齐的挑战，尤其是在数据分布异构且无监督的情况下。 Method: 提出FUSS框架，结合局部分割头和共享语义中心，优化特征和原型空间的一致性。 Result: 在基准和真实数据集上，FUSS表现优于局部训练和传统联邦学习方法。 Conclusion: FUSS为无监督联邦语义分割提供了有效解决方案，代码将公开以支持复现。 Abstract: This work explores the application of Federated Learning (FL) in Unsupervised Semantic image Segmentation (USS). Recent USS methods extract pixel-level features using frozen visual foundation models and refine them through self-supervised objectives that encourage semantic grouping. These features are then grouped to semantic clusters to produce segmentation masks. Extending these ideas to federated settings requires feature representation and cluster centroid alignment across distributed clients -- an inherently difficult task under heterogeneous data distributions in the absence of supervision. To address this, we propose FUSS Federated Unsupervised image Semantic Segmentation) which is, to our knowledge, the first framework to enable fully decentralized, label-free semantic segmentation training. FUSS introduces novel federation strategies that promote global consistency in feature and prototype space, jointly optimizing local segmentation heads and shared semantic centroids. Experiments on both benchmark and real-world datasets, including binary and multi-class segmentation tasks, show that FUSS consistently outperforms local-only client trainings as well as extensions of classical FL algorithms under varying client data distributions. To support reproducibility, full code will be released upon manuscript acceptance.

[73] TRACE: Trajectory-Constrained Concept Erasure in Diffusion Models

Finn Carter

Main category: cs.CV

TL;DR: TRACE是一种新方法，用于从扩散模型中擦除特定概念，同时保持生成质量。它结合理论框架和微调程序，在多个基准测试中表现优异。

Details

Motivation: 扩散模型可能生成不良内容（如色情、敏感身份、版权风格），引发隐私、公平和安全问题。概念擦除旨在解决这一问题。 Method: TRACE通过理论框架和微调程序，修改交叉注意力层并引入轨迹感知目标，在后期采样阶段避开目标概念。 Result: TRACE在多个基准测试中表现最佳，优于ANT、EraseAnything和MACE等方法。 Conclusion: TRACE有效擦除目标概念，同时保持生成质量，为扩散模型的安全应用提供了解决方案。 Abstract: Text-to-image diffusion models have shown unprecedented generative capability, but their ability to produce undesirable concepts (e.g.~pornographic content, sensitive identities, copyrighted styles) poses serious concerns for privacy, fairness, and safety. {Concept erasure} aims to remove or suppress specific concept information in a generative model. In this paper, we introduce \textbf{TRACE (Trajectory-Constrained Attentional Concept Erasure)}, a novel method to erase targeted concepts from diffusion models while preserving overall generative quality. Our approach combines a rigorous theoretical framework, establishing formal conditions under which a concept can be provably suppressed in the diffusion process, with an effective fine-tuning procedure compatible with both conventional latent diffusion (Stable Diffusion) and emerging rectified flow models (e.g.~FLUX). We first derive a closed-form update to the model's cross-attention layers that removes hidden representations of the target concept. We then introduce a trajectory-aware finetuning objective that steers the denoising process away from the concept only in the late sampling stages, thus maintaining the model's fidelity on unrelated content. Empirically, we evaluate TRACE on multiple benchmarks used in prior concept erasure studies (object classes, celebrity faces, artistic styles, and explicit content from the I2P dataset). TRACE achieves state-of-the-art performance, outperforming recent methods such as ANT, EraseAnything, and MACE in terms of removal efficacy and output quality.

[74] Adversarial Semantic and Label Perturbation Attack for Pedestrian Attribute Recognition

Weizhe Kong,Xiao Wang,Ruichong Gao,Chenglong Li,Yu Zhang,Xing Yang,Yaowei Wang,Jin Tang

Main category: cs.CV

TL;DR: 本文提出了首个针对行人属性识别（PAR）的对抗攻击与防御框架ASL-PAR，结合全局和局部攻击，并设计了语义偏移防御策略，实验验证了其有效性。

Details

Motivation: 尽管PAR在深度神经网络推动下取得进展，但其抗干扰能力和潜在脆弱性尚未充分研究，本文旨在填补这一空白。 Method: 基于CLIP的PAR框架，采用多模态Transformer融合视觉和文本特征，通过对抗语义和标签扰动生成噪声，并设计防御策略。 Result: 在数字和物理领域的多个数据集上验证了攻击与防御策略的有效性。 Conclusion: ASL-PAR框架为PAR的对抗攻击与防御提供了有效解决方案，代码已开源。 Abstract: Pedestrian Attribute Recognition (PAR) is an indispensable task in human-centered research and has made great progress in recent years with the development of deep neural networks. However, the potential vulnerability and anti-interference ability have still not been fully explored. To bridge this gap, this paper proposes the first adversarial attack and defense framework for pedestrian attribute recognition. Specifically, we exploit both global- and patch-level attacks on the pedestrian images, based on the pre-trained CLIP-based PAR framework. It first divides the input pedestrian image into non-overlapping patches and embeds them into feature embeddings using a projection layer. Meanwhile, the attribute set is expanded into sentences using prompts and embedded into attribute features using a pre-trained CLIP text encoder. A multi-modal Transformer is adopted to fuse the obtained vision and text tokens, and a feed-forward network is utilized for attribute recognition. Based on the aforementioned PAR framework, we adopt the adversarial semantic and label-perturbation to generate the adversarial noise, termed ASL-PAR. We also design a semantic offset defense strategy to suppress the influence of adversarial attacks. Extensive experiments conducted on both digital domains (i.e., PETA, PA100K, MSP60K, RAPv2) and physical domains fully validated the effectiveness of our proposed adversarial attack and defense strategies for the pedestrian attribute recognition. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR.

[75] Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis

Hengyuan Cao,Yutong Feng,Biao Gong,Yijing Tian,Yunhong Lu,Chuang Liu,Bin Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为DRA-Ctrl的视频到图像知识压缩与任务适应范式，利用视频模型的优势（如长程上下文建模和全注意力机制）支持可控图像生成任务。

Details

Motivation: 探索训练好的视频生成模型是否能在高维空间中有效支持低维任务（如可控图像生成），以利用视频模型的动态和连续变化捕捉能力。 Method: 提出DRA-Ctrl范式，包括基于mixup的过渡策略以解决视频帧与图像生成的差异，并重新设计注意力结构和掩码机制以对齐文本提示与图像控制。 Result: 实验表明，经过调整的视频模型在多种图像生成任务中优于直接训练的图像模型，展示了视频模型在视觉应用中的潜力。 Conclusion: DRA-Ctrl为资源密集型视频模型的重用提供了新思路，并为跨视觉模态的统一生成模型奠定了基础。 Abstract: Video generative models can be regarded as world simulators due to their ability to capture dynamic, continuous changes inherent in real-world environments. These models integrate high-dimensional information across visual, temporal, spatial, and causal dimensions, enabling predictions of subjects in various status. A natural and valuable research direction is to explore whether a fully trained video generative model in high-dimensional space can effectively support lower-dimensional tasks such as controllable image generation. In this work, we propose a paradigm for video-to-image knowledge compression and task adaptation, termed \textit{Dimension-Reduction Attack} (\texttt{DRA-Ctrl}), which utilizes the strengths of video models, including long-range context modeling and flatten full-attention, to perform various generation tasks. Specially, to address the challenging gap between continuous video frames and discrete image generation, we introduce a mixup-based transition strategy that ensures smooth adaptation. Moreover, we redesign the attention structure with a tailored masking mechanism to better align text prompts with image-level control. Experiments across diverse image generation tasks, such as subject-driven and spatially conditioned generation, show that repurposed video models outperform those trained directly on images. These results highlight the untapped potential of large-scale video generators for broader visual applications. \texttt{DRA-Ctrl} provides new insights into reusing resource-intensive video models and lays foundation for future unified generative models across visual modalities. The project page is https://dra-ctrl-2025.github.io/DRA-Ctrl/.

[76] Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization

Matteo Gallici,Haitz Sáez de Ocáriz Borde

Main category: cs.CV

TL;DR: 使用强化学习（RL）微调预训练生成模型，结合GRPO方法优化视觉自回归模型（VAR），显著提升图像质量并实现风格控制。

Details

Motivation: 探索如何通过RL微调使生成模型的输出更符合人类偏好，尤其是利用GRPO优化VAR模型。 Method: 采用Group Relative Policy Optimization（GRPO）微调VAR模型，结合美学预测器和CLIP嵌入的奖励信号。 Result: 方法显著提升图像质量，并支持生成超出预训练分布的风格图像。 Conclusion: RL微调对VAR模型高效且有效，尤其适合在线采样，优于扩散模型。 Abstract: Fine-tuning pre-trained generative models with Reinforcement Learning (RL) has emerged as an effective approach for aligning outputs more closely with nuanced human preferences. In this paper, we investigate the application of Group Relative Policy Optimization (GRPO) to fine-tune next-scale visual autoregressive (VAR) models. Our empirical results demonstrate that this approach enables alignment to intricate reward signals derived from aesthetic predictors and CLIP embeddings, significantly enhancing image quality and enabling precise control over the generation style. Interestingly, by leveraging CLIP, our method can help VAR models generalize beyond their initial ImageNet distribution: through RL-driven exploration, these models can generate images aligned with prompts referencing image styles that were absent during pre-training. In summary, we show that RL-based fine-tuning is both efficient and effective for VAR models, benefiting particularly from their fast inference speeds, which are advantageous for online sampling, an aspect that poses significant challenges for diffusion-based alternatives.

[77] DSAGL: Dual-Stream Attention-Guided Learning for Weakly Supervised Whole Slide Image Classification

Daoxi Cao,Hangbei Cheng,Yijin Li,Ruolin Zhou,Xinyi Li,Xuehan Zhang,Binwei Li,Xuancheng Gu,Xueyu Liu,Yongfei Wu

Main category: cs.CV

TL;DR: DSAGL是一种弱监督分类框架，结合教师-学生架构和双流设计，通过多尺度注意力伪标签解决实例级模糊性和包级语义一致性，优于现有基线。

Details

Motivation: 全切片图像（WSIs）因超高分辨率和丰富语义内容对癌症诊断至关重要，但其巨大尺寸和细粒度标注稀缺限制了传统监督学习的应用。 Method: 提出DSAGL框架，采用教师-学生架构和双流设计，利用VSSMamba编码器建模长程依赖，FASA模块聚焦诊断相关区域，并通过混合损失增强双流一致性。 Result: 在CIFAR-10、NCT-CRC和TCGA-Lung数据集上，DSAGL表现优于现有弱监督多实例学习基线，具有更强的判别性能和鲁棒性。 Conclusion: DSAGL通过双流注意力引导学习有效解决了WSIs分类中的弱监督问题，为癌症诊断提供了高效解决方案。 Abstract: Whole-slide images (WSIs) are critical for cancer diagnosis due to their ultra-high resolution and rich semantic content. However, their massive size and the limited availability of fine-grained annotations pose substantial challenges for conventional supervised learning. We propose DSAGL (Dual-Stream Attention-Guided Learning), a novel weakly supervised classification framework that combines a teacher-student architecture with a dual-stream design. DSAGL explicitly addresses instance-level ambiguity and bag-level semantic consistency by generating multi-scale attention-based pseudo labels and guiding instance-level learning. A shared lightweight encoder (VSSMamba) enables efficient long-range dependency modeling, while a fusion-attentive module (FASA) enhances focus on sparse but diagnostically relevant regions. We further introduce a hybrid loss to enforce mutual consistency between the two streams. Experiments on CIFAR-10, NCT-CRC, and TCGA-Lung datasets demonstrate that DSAGL consistently outperforms state-of-the-art MIL baselines, achieving superior discriminative performance and robustness under weak supervision.

[78] Diffusion Sampling Path Tells More: An Efficient Plug-and-Play Strategy for Sample Filtering

Sixian Wang,Zhiwei Tang,Tsung-Hui Chang

Main category: cs.CV

TL;DR: 论文提出了一种名为CFG-Rejection的高效方法，通过利用去噪轨迹中的累积分数差异（ASD）来早期过滤低质量样本，无需外部奖励信号或模型重训练。

Details

Motivation: 扩散模型在采样过程中存在随机性导致的样本质量不一致问题，现有方法（如DDPO和推理时对齐技术）计算成本高且依赖外部奖励信号，限制了其广泛应用。 Method: 研究发现样本质量与去噪轨迹中的累积分数差异（ASD）有强相关性，并基于此提出了CFG-Rejection方法，通过早期过滤低质量样本提升生成质量。 Result: 实验验证表明，CFG-Rejection在图像生成任务中显著提升了人类偏好评分（HPSv2, PickScore）和挑战性基准（GenEval, DPG-Bench）的表现。 Conclusion: CFG-Rejection是一种高效、即插即用的方法，适用于多种生成任务，为高质量样本生成提供了更可靠的解决方案。 Abstract: Diffusion models often exhibit inconsistent sample quality due to stochastic variations inherent in their sampling trajectories. Although training-based fine-tuning (e.g. DDPO [1]) and inference-time alignment techniques[2] aim to improve sample fidelity, they typically necessitate full denoising processes and external reward signals. This incurs substantial computational costs, hindering their broader applicability. In this work, we unveil an intriguing phenomenon: a previously unobserved yet exploitable link between sample quality and characteristics of the denoising trajectory during classifier-free guidance (CFG). Specifically, we identify a strong correlation between high-density regions of the sample distribution and the Accumulated Score Differences (ASD)--the cumulative divergence between conditional and unconditional scores. Leveraging this insight, we introduce CFG-Rejection, an efficient, plug-and-play strategy that filters low-quality samples at an early stage of the denoising process, crucially without requiring external reward signals or model retraining. Importantly, our approach necessitates no modifications to model architectures or sampling schedules and maintains full compatibility with existing diffusion frameworks. We validate the effectiveness of CFG-Rejection in image generation through extensive experiments, demonstrating marked improvements on human preference scores (HPSv2, PickScore) and challenging benchmarks (GenEval, DPG-Bench). We anticipate that CFG-Rejection will offer significant advantages for diverse generative modalities beyond images, paving the way for more efficient and reliable high-quality sample generation.

[79] Beyond Optimal Transport: Model-Aligned Coupling for Flow Matching

Yexiong Lin,Yu Yao,Tongliang Liu

Main category: cs.CV

TL;DR: Flow Matching (FM) 框架通过向量场将样本从源分布传输到目标分布。传统方法使用随机耦合导致路径交叉，而基于最优传输（OT）的方法虽减少交叉但未与模型偏好对齐。本文提出模型对齐耦合（MAC），结合几何距离和模型预测误差优化耦合，显著提升生成质量和效率。

Details

Motivation: 传统FM方法因随机耦合导致路径交叉和非直线轨迹，而OT方法虽减少交叉但未与模型偏好对齐，导致学习困难。 Method: 提出模型对齐耦合（MAC），通过几何距离和模型预测误差优化耦合，并选择误差最低的耦合进行训练。 Result: 实验表明，MAC在少步生成中显著优于现有方法，提升生成质量和效率。 Conclusion: MAC通过结合几何和模型对齐优化耦合，有效解决了传统方法的局限性，为FM提供了更高效的解决方案。 Abstract: Flow Matching (FM) is an effective framework for training a model to learn a vector field that transports samples from a source distribution to a target distribution. To train the model, early FM methods use random couplings, which often result in crossing paths and lead the model to learn non-straight trajectories that require many integration steps to generate high-quality samples. To address this, recent methods adopt Optimal Transport (OT) to construct couplings by minimizing geometric distances, which helps reduce path crossings. However, we observe that such geometry-based couplings do not necessarily align with the model's preferred trajectories, making it difficult to learn the vector field induced by these couplings, which prevents the model from learning straight trajectories. Motivated by this, we propose Model-Aligned Coupling (MAC), an effective method that matches training couplings based not only on geometric distance but also on alignment with the model's preferred transport directions based on its prediction error. To avoid the time-costly match process, MAC proposes to select the top-$k$ fraction of couplings with the lowest error for training. Extensive experiments show that MAC significantly improves generation quality and efficiency in few-step settings compared to existing methods. Project page: https://yexionglin.github.io/mac

[80] Beam-Guided Knowledge Replay for Knowledge-Rich Image Captioning using Vision-Language Model

Reem AlJunaid,Muzammil Behzad

Main category: cs.CV

TL;DR: KRCapVLM是一个基于知识重放的图像描述生成框架，通过结合视觉语言模型和改进的解码与训练策略，显著提升了描述的质量和知识识别能力。

Details

Motivation: 现有图像描述模型生成的描述通常缺乏具体性和上下文深度，KRCapVLM旨在解决这一问题。 Method: 提出KRCapVLM框架，结合知识重放、束搜索解码、注意力模块和训练调度器。 Result: 模型在知识识别准确性和描述质量上均有显著提升，能够生成更具信息性和上下文相关性的描述。 Conclusion: KRCapVLM有效增强了模型生成有意义、基于知识的图像描述的能力。 Abstract: Generating informative and knowledge-rich image captions remains a challenge for many existing captioning models, which often produce generic descriptions that lack specificity and contextual depth. To address this limitation, we propose KRCapVLM, a knowledge replay-based novel image captioning framework using vision-language model. We incorporate beam search decoding to generate more diverse and coherent captions. We also integrate attention-based modules into the image encoder to enhance feature representation. Finally, we employ training schedulers to improve stability and ensure smoother convergence during training. These proposals accelerate substantial gains in both caption quality and knowledge recognition. Our proposed model demonstrates clear improvements in both the accuracy of knowledge recognition and the overall quality of generated captions. It shows a stronger ability to generalize to previously unseen knowledge concepts, producing more informative and contextually relevant descriptions. These results indicate the effectiveness of our approach in enhancing the model's capacity to generate meaningful, knowledge-grounded captions across a range of scenarios.

[81] VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?

Yuanxin Liu,Kun Ouyang,Haoning Wu,Yi Liu,Lin Sui,Xinhao Li,Yan Zhong,Y. Charles,Xinyu Zhou,Xu Sun

Main category: cs.CV

TL;DR: 论文介绍了VideoReasonBench，一个用于评估视觉中心复杂视频推理的基准测试，发现现有多模态大语言模型（MLLMs）在复杂视频推理上表现不佳，而扩展的思维预算对提升性能至关重要。

Details

Motivation: 现有视频理解领域缺乏能够展示长链思维推理优势的基准测试，且现有任务多为知识驱动而非视觉内容驱动。 Method: 设计了VideoReasonBench基准测试，包含视觉丰富且高复杂度的视频推理任务，评估模型的三种逐步升级的推理能力。 Result: 评估了18种最先进的多模态大语言模型，发现大多数在复杂视频推理上表现不佳，Gemini-2.5-Pro表现最佳（56.0%准确率）。 Conclusion: 扩展的思维预算对提升复杂视频推理性能至关重要，而现有基准测试未能体现这一点。 Abstract: Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoning, e.g., GPT-4o achieves only 6.9% accuracy, while the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on "test-time scaling" further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.

[82] MCFNet: A Multimodal Collaborative Fusion Network for Fine-Grained Semantic Classification

Yang Qiao,Xiaoyu Zhong,Xiaofeng Gu,Zhiguo Yu

Main category: cs.CV

TL;DR: 提出了一种新型多模态协作融合网络（MCFNet），通过正则化融合模块和混合注意力机制提升细粒度分类性能。

Details

Motivation: 多模态信息处理对图像分类性能提升至关重要，但传统方法难以捕捉模态间复杂依赖关系，限制了高精度分类任务的应用。 Method: MCFNet结合正则化融合模块和混合注意力机制，优化模态内特征表示和语义对齐，并引入多模态决策分类模块，通过加权投票整合多损失函数。 Result: 在基准数据集上的实验表明，MCFNet在分类准确率上取得显著提升。 Conclusion: MCFNet能有效建模跨模态的细微语义，适用于高精度分类任务。 Abstract: Multimodal information processing has become increasingly important for enhancing image classification performance. However, the intricate and implicit dependencies across different modalities often hinder conventional methods from effectively capturing fine-grained semantic interactions, thereby limiting their applicability in high-precision classification tasks. To address this issue, we propose a novel Multimodal Collaborative Fusion Network (MCFNet) designed for fine-grained classification. The proposed MCFNet architecture incorporates a regularized integrated fusion module that improves intra-modal feature representation through modality-specific regularization strategies, while facilitating precise semantic alignment via a hybrid attention mechanism. Additionally, we introduce a multimodal decision classification module, which jointly exploits inter-modal correlations and unimodal discriminative features by integrating multiple loss functions within a weighted voting paradigm. Extensive experiments and ablation studies on benchmark datasets demonstrate that the proposed MCFNet framework achieves consistent improvements in classification accuracy, confirming its effectiveness in modeling subtle cross-modal semantics.

[83] PAN-Crafter: Learning Modality-Consistent Alignment for PAN-Sharpening

Jeonghyeok Do,Sungpyo Kim,Geunhyuk Youk,Jaehyup Lee,Munchurl Kim

Main category: cs.CV

TL;DR: PAN-Crafter通过模态一致性对齐框架解决PAN和MS图像融合中的跨模态不对齐问题，显著提升HRMS图像质量。

Details

Motivation: 传统深度学习方法假设像素完美对齐，导致光谱失真和模糊。PAN-Crafter旨在解决跨模态不对齐问题。 Method: 提出Modality-Adaptive Reconstruction（MARs）和Cross-Modality Alignment-Aware Attention（CM3A）机制，联合重建HRMS和PAN图像。 Result: 在多个基准数据集上表现优于现有方法，推理速度快50.11倍，内存占用减少0.63倍，且泛化能力强。 Conclusion: PAN-Crafter通过模态对齐和自适应特征细化，显著提升了PAN-sharpening的性能和效率。 Abstract: PAN-sharpening aims to fuse high-resolution panchromatic (PAN) images with low-resolution multi-spectral (MS) images to generate high-resolution multi-spectral (HRMS) outputs. However, cross-modality misalignment -- caused by sensor placement, acquisition timing, and resolution disparity -- induces a fundamental challenge. Conventional deep learning methods assume perfect pixel-wise alignment and rely on per-pixel reconstruction losses, leading to spectral distortion, double edges, and blurring when misalignment is present. To address this, we propose PAN-Crafter, a modality-consistent alignment framework that explicitly mitigates the misalignment gap between PAN and MS modalities. At its core, Modality-Adaptive Reconstruction (MARs) enables a single network to jointly reconstruct HRMS and PAN images, leveraging PAN's high-frequency details as auxiliary self-supervision. Additionally, we introduce Cross-Modality Alignment-Aware Attention (CM3A), a novel mechanism that bidirectionally aligns MS texture to PAN structure and vice versa, enabling adaptive feature refinement across modalities. Extensive experiments on multiple benchmark datasets demonstrate that our PAN-Crafter outperforms the most recent state-of-the-art method in all metrics, even with 50.11$\times$ faster inference time and 0.63$\times$ the memory size. Furthermore, it demonstrates strong generalization performance on unseen satellite datasets, showing its robustness across different conditions.

[84] UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning

Weijia Mao,Zhenheng Yang,Mike Zheng Shou

Main category: cs.CV

TL;DR: UniRL是一种自改进的后训练方法，通过模型自身生成图像作为训练数据，无需外部图像数据，同时提升生成和理解任务的性能。

Details

Motivation: 现有统一多模态大语言模型依赖大规模数据和计算资源，后训练方法常需外部数据或局限于特定任务。UniRL旨在解决这些问题。 Method: 采用自生成图像作为训练数据，结合监督微调（SFT）和Group Relative Policy Optimization（GRPO）优化模型。 Result: 在Show-o和Janus上评估，GenEval得分分别为0.77和0.65，性能提升且任务间不平衡减少。 Conclusion: UniRL无需外部数据，高效提升多模态任务性能，具有实际应用潜力。 Abstract: Unified multimodal large language models such as Show-o and Janus have achieved strong performance across both generation and understanding tasks. However, these models typically rely on large-scale datasets and require substantial computation during the pretraining stage. In addition, several post-training methods have been proposed, but they often depend on external data or are limited to task-specific customization. In this work, we introduce UniRL, a self-improving post-training approach. Our approach enables the model to generate images from prompts and use them as training data in each iteration, without relying on any external image data. Moreover, it enables the two tasks to enhance each other: the generated images are used for understanding, and the understanding results are used to supervise generation. We explore supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to optimize the models. UniRL offers three key advantages: (1) it requires no external image data, as all training samples are generated by the model itself during training; (2) it not only improves individual task performance, but also reduces the imbalance between generation and understanding; and (3) it requires only several additional training steps during the post-training stage. We evaluate UniRL on top of Show-o and Janus, achieving a GenEval score of 0.77 for Show-o and 0.65 for Janus. Code and models will be released in https://github.com/showlab/UniRL.

[85] VModA: An Effective Framework for Adaptive NSFW Image Moderation

Han Bao,Qinying Wang,Zhi Chen,Qingming Li,Xuhong Zhang,Changjiang Li,Zonghui Wang,Shouling Ji,Wenzhi Chen

Main category: cs.CV

TL;DR: VModA是一个通用且有效的框架，用于检测复杂的NSFW内容，适应不同平台和地区的审核规则，显著提升了检测准确性。

Details

Motivation: NSFW内容在社交网络上泛滥，对用户尤其是未成年人造成严重危害，现有检测方法难以应对复杂语义的NSFW内容。 Method: 提出VModA框架，适应多样化审核规则，处理复杂语义的NSFW内容。 Result: 实验显示VModA在检测准确性上提升54.3%，并表现出跨类别、场景和基础VLM的强适应性。 Conclusion: VModA在理论和实践中均表现出色，解决了现有NSFW检测方法的不足。 Abstract: Not Safe/Suitable for Work (NSFW) content is rampant on social networks and poses serious harm to citizens, especially minors. Current detection methods mainly rely on deep learning-based image recognition and classification. However, NSFW images are now presented in increasingly sophisticated ways, often using image details and complex semantics to obscure their true nature or attract more views. Although still understandable to humans, these images often evade existing detection methods, posing a significant threat. Further complicating the issue, varying regulations across platforms and regions create additional challenges for effective moderation, leading to detection bias and reduced accuracy. To address this, we propose VModA, a general and effective framework that adapts to diverse moderation rules and handles complex, semantically rich NSFW content across categories. Experimental results show that VModA significantly outperforms existing methods, achieving up to a 54.3% accuracy improvement across NSFW types, including those with complex semantics. Further experiments demonstrate that our method exhibits strong adaptability across categories, scenarios, and base VLMs. We also identified inconsistent and controversial label samples in public NSFW benchmark datasets, re-annotated them, and submitted corrections to the original maintainers. Two datasets have confirmed the updates so far. Additionally, we evaluate VModA in real-world scenarios to demonstrate its practical effectiveness.

[86] Robust and Annotation-Free Wound Segmentation on Noisy Real-World Pressure Ulcer Images: Towards Automated DESIGN-R\textsuperscript{\textregistered} Assessment

Yun-Cheng Tsai

Main category: cs.CV

TL;DR: 提出了一种结合YOLOv11n检测器和FUSegNet分割模型的轻量级流程，仅需500个标注框即可实现跨身体部位伤口分割，无需微调。

Details

Motivation: 现有模型如FUSegNet主要针对足部溃疡，难以泛化到其他身体部位，需解决领域差距问题。 Method: 结合YOLOv11n检测器和预训练FUSegNet，仅需500个标注框，无需像素级标注或微调。 Result: 在多个测试集上，平均IoU提升23个百分点，DESIGN-R尺寸估计准确率从71%提高到94%。 Conclusion: 该方法无需任务特定微调即可泛化，仅需少量标注，支持临床自动化伤口评分，并公开模型权重以促进应用。 Abstract: Purpose: Accurate wound segmentation is essential for automated DESIGN-R scoring. However, existing models such as FUSegNet, which are trained primarily on foot ulcer datasets, often fail to generalize to wounds on other body sites. Methods: We propose an annotation-efficient pipeline that combines a lightweight YOLOv11n-based detector with the pre-trained FUSegNet segmentation model. Instead of relying on pixel-level annotations or retraining for new anatomical regions, our method achieves robust performance using only 500 manually labeled bounding boxes. This zero fine-tuning approach effectively bridges the domain gap and enables direct deployment across diverse wound types. This is an advance not previously demonstrated in the wound segmentation literature. Results: Evaluated on three real-world test sets spanning foot, sacral, and trochanter wounds, our YOLO plus FUSegNet pipeline improved mean IoU by 23 percentage points over vanilla FUSegNet and increased end-to-end DESIGN-R size estimation accuracy from 71 percent to 94 percent (see Table 3 for details). Conclusion: Our pipeline generalizes effectively across body sites without task-specific fine-tuning, demonstrating that minimal supervision, with 500 annotated ROIs, is sufficient for scalable, annotation-light wound segmentation. This capability paves the way for real-world DESIGN-R automation, reducing reliance on pixel-wise labeling, streamlining documentation workflows, and supporting objective and consistent wound scoring in clinical practice. We will publicly release the trained detector weights and configuration to promote reproducibility and facilitate downstream deployment.

[87] Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings

Xingguang Wei,Haomin Wang,Shenglong Ye,Ruifeng Luo,Yanting Zhang,Lixin Gu,Jifeng Dai,Yu Qiao,Wenhai Wang,Hongjie Zhang

Main category: cs.CV

TL;DR: VecFormer提出了一种基于线表示的CAD图纸全景符号识别方法，解决了现有方法的高计算成本、泛化性差和几何信息丢失问题，并通过分支融合细化模块提升预测一致性，实验表现优异。

Details

Motivation: 现有CAD图纸符号识别方法依赖图像栅格化、图构建或点表示，存在高计算成本、泛化性差和几何信息丢失问题，亟需更高效且保留几何结构的方法。 Method: VecFormer采用线表示原始图元，保留几何连续性，结合分支融合细化模块整合实例与语义预测，提升一致性。 Result: 实验表明，VecFormer在PQ指标上达91.1，Stuff-PQ分别提升9.6和21.2分，优于现有方法。 Conclusion: 线表示是矢量图形理解的有效基础，VecFormer为CAD图纸的全景符号识别提供了新思路。 Abstract: We study the task of panoptic symbol spotting, which involves identifying both individual instances of countable things and the semantic regions of uncountable stuff in computer-aided design (CAD) drawings composed of vector graphical primitives. Existing methods typically rely on image rasterization, graph construction, or point-based representation, but these approaches often suffer from high computational costs, limited generality, and loss of geometric structural information. In this paper, we propose VecFormer, a novel method that addresses these challenges through line-based representation of primitives. This design preserves the geometric continuity of the original primitive, enabling more accurate shape representation while maintaining a computation-friendly structure, making it well-suited for vector graphic understanding tasks. To further enhance prediction reliability, we introduce a Branch Fusion Refinement module that effectively integrates instance and semantic predictions, resolving their inconsistencies for more coherent panoptic outputs. Extensive experiments demonstrate that our method establishes a new state-of-the-art, achieving 91.1 PQ, with Stuff-PQ improved by 9.6 and 21.2 points over the second-best results under settings with and without prior information, respectively, highlighting the strong potential of line-based representation as a foundation for vector graphic understanding.

[88] Bridging Geometric and Semantic Foundation Models for Generalized Monocular Depth Estimation

Sanggyun Ma,Wonjoon Choi,Jihun Park,Jaeyeul Kim,Seunghun Lee,Jiwan Seo,Sunghoon Im

Main category: cs.CV

TL;DR: BriGeS通过融合几何和语义信息提升单目深度估计，利用Bridging Gate和Attention Temperature Scaling技术优化性能，资源消耗低且泛化能力强。

Details

Motivation: 结合几何和语义信息的互补优势，提升复杂场景下的单目深度估计性能。 Method: 通过Bridging Gate整合深度和分割基础模型，采用Attention Temperature Scaling调整注意力机制，仅训练Bridging Gate以减少资源消耗。 Result: 在多个数据集上表现优于现有方法，尤其在复杂结构和重叠物体场景中。 Conclusion: BriGeS是一种高效且泛化能力强的单目深度估计方法。 Abstract: We present Bridging Geometric and Semantic (BriGeS), an effective method that fuses geometric and semantic information within foundation models to enhance Monocular Depth Estimation (MDE). Central to BriGeS is the Bridging Gate, which integrates the complementary strengths of depth and segmentation foundation models. This integration is further refined by our Attention Temperature Scaling technique. It finely adjusts the focus of the attention mechanisms to prevent over-concentration on specific features, thus ensuring balanced performance across diverse inputs. BriGeS capitalizes on pre-trained foundation models and adopts a strategy that focuses on training only the Bridging Gate. This method significantly reduces resource demands and training time while maintaining the model's ability to generalize effectively. Extensive experiments across multiple challenging datasets demonstrate that BriGeS outperforms state-of-the-art methods in MDE for complex scenes, effectively handling intricate structures and overlapping objects.

[89] Video Editing for Audio-Visual Dubbing

Binyamin Manela,Sharon Gannot,Ethan Fetyaya

Main category: cs.CV

TL;DR: EdiDub是一种新颖的视觉配音框架，通过内容感知编辑任务解决了现有方法在保留视频上下文和复杂视觉元素方面的不足。

Details

Motivation: 当前视觉配音方法在无缝集成到原始场景或保留遮挡和光照变化等视觉信息方面存在显著局限性。 Method: EdiDub将视觉配音重新定义为内容感知编辑任务，采用专门的调节方案以确保修改的准确性和忠实性。 Result: 在多个基准测试中，EdiDub显著提升了身份保留和同步性，并在人类评估中获得了更高的同步性和视觉自然性评分。 Conclusion: EdiDub通过内容感知编辑方法，在保持复杂视觉元素的同时实现准确的唇同步，优于传统生成或修复方法。 Abstract: Visual dubbing, the synchronization of facial movements with new speech, is crucial for making content accessible across different languages, enabling broader global reach. However, current methods face significant limitations. Existing approaches often generate talking faces, hindering seamless integration into original scenes, or employ inpainting techniques that discard vital visual information like partial occlusions and lighting variations. This work introduces EdiDub, a novel framework that reformulates visual dubbing as a content-aware editing task. EdiDub preserves the original video context by utilizing a specialized conditioning scheme to ensure faithful and accurate modifications rather than mere copying. On multiple benchmarks, including a challenging occluded-lip dataset, EdiDub significantly improves identity preservation and synchronization. Human evaluations further confirm its superiority, achieving higher synchronization and visual naturalness scores compared to the leading methods. These results demonstrate that our content-aware editing approach outperforms traditional generation or inpainting, particularly in maintaining complex visual elements while ensuring accurate lip synchronization.

[90] UrbanCraft: Urban View Extrapolation via Hierarchical Sem-Geometric Priors

Tianhang Wang,Fan Lu,Sanqing Qu,Guo Yu,Shihang Du,Ya Wu,Yuan Huang,Guang Chen

Main category: cs.CV

TL;DR: UrbanCraft通过分层语义-几何表示解决外推视角合成问题，利用粗粒度场景级先验和细粒度实例级先验提升性能。

Details

Motivation: 现有方法在训练相机分布外的视角合成性能不足，限制了城市重建的泛化能力。 Method: 设计分层语义-几何表示（粗粒度场景级和细粒度实例级先验），并提出HSG-VSD方法整合语义和几何约束。 Result: 定性和定量实验验证了方法在外推视角合成问题上的有效性。 Conclusion: UrbanCraft通过分层表示和约束优化，显著提升了外推视角合成的性能。 Abstract: Existing neural rendering-based urban scene reconstruction methods mainly focus on the Interpolated View Synthesis (IVS) setting that synthesizes from views close to training camera trajectory. However, IVS can not guarantee the on-par performance of the novel view outside the training camera distribution (\textit{e.g.}, looking left, right, or downwards), which limits the generalizability of the urban reconstruction application. Previous methods have optimized it via image diffusion, but they fail to handle text-ambiguous or large unseen view angles due to coarse-grained control of text-only diffusion. In this paper, we design UrbanCraft, which surmounts the Extrapolated View Synthesis (EVS) problem using hierarchical sem-geometric representations serving as additional priors. Specifically, we leverage the partially observable scene to reconstruct coarse semantic and geometric primitives, establishing a coarse scene-level prior through an occupancy grid as the base representation. Additionally, we incorporate fine instance-level priors from 3D bounding boxes to enhance object-level details and spatial relationships. Building on this, we propose the \textbf{H}ierarchical \textbf{S}emantic-Geometric-\textbf{G}uided Variational Score Distillation (HSG-VSD), which integrates semantic and geometric constraints from pretrained UrbanCraft2D into the score distillation sampling process, forcing the distribution to be consistent with the observable scene. Qualitative and quantitative comparisons demonstrate the effectiveness of our methods on EVS problem.

[91] Adaptive Spatial Augmentation for Semi-supervised Semantic Segmentation

Lingyan Ran,Yali Li,Tao Zhuo,Shizhou Zhang,Yanning Zhang

Main category: cs.CV

TL;DR: 论文提出了一种自适应空间增强方法（ASAug），用于半监督语义分割（SSSS），通过动态调整增强策略提升模型性能。

Details

Motivation: 现有强增强方法主要关注强度扰动，对语义掩码影响小，而空间增强在SSSS中被忽视。本文证明空间增强对SSSS有效，并提出自适应策略。 Method: 提出自适应空间增强（ASAug），基于熵动态调整每张图像的增强方式，解决弱强增强间掩码不一致问题。 Result: ASAug作为可插拔模块，显著提升现有方法性能，在PASCAL VOC 2012、Cityscapes和COCO数据集上达到SOTA。 Conclusion: 空间增强对SSSS有效，自适应策略进一步优化性能，ASAug具有通用性和高效性。 Abstract: In semi-supervised semantic segmentation (SSSS), data augmentation plays a crucial role in the weak-to-strong consistency regularization framework, as it enhances diversity and improves model generalization. Recent strong augmentation methods have primarily focused on intensity-based perturbations, which have minimal impact on the semantic masks. In contrast, spatial augmentations like translation and rotation have long been acknowledged for their effectiveness in supervised semantic segmentation tasks, but they are often ignored in SSSS. In this work, we demonstrate that spatial augmentation can also contribute to model training in SSSS, despite generating inconsistent masks between the weak and strong augmentations. Furthermore, recognizing the variability among images, we propose an adaptive augmentation strategy that dynamically adjusts the augmentation for each instance based on entropy. Extensive experiments show that our proposed Adaptive Spatial Augmentation (\textbf{ASAug}) can be integrated as a pluggable module, consistently improving the performance of existing methods and achieving state-of-the-art results on benchmark datasets such as PASCAL VOC 2012, Cityscapes, and COCO.

[92] VITON-DRR: Details Retention Virtual Try-on via Non-rigid Registration

Ben Li,Minqi Li,Jie Ren,Kaibing Zhang

Main category: cs.CV

TL;DR: 提出了一种基于非刚性配准的虚拟试穿方法（VITON-DRR），通过双金字塔结构特征提取器和变形模块，提高了服装细节保留和变形准确性。

Details

Motivation: 现有虚拟试穿方法在服装变形时难以保留细节，尤其在自遮挡或姿势差异大的情况下效果不佳。 Method: 使用双金字塔结构特征提取器重建人体语义分割，设计变形模块提取服装关键点并通过非刚性配准算法变形，最后通过图像合成模块生成试穿图像。 Result: 实验表明，VITON-DRR在变形准确性和细节保留上优于现有方法。 Conclusion: VITON-DRR通过非刚性配准和双金字塔结构，显著提升了虚拟试穿的效果。 Abstract: Image-based virtual try-on aims to fit a target garment to a specific person image and has attracted extensive research attention because of its huge application potential in the e-commerce and fashion industries. To generate high-quality try-on results, accurately warping the clothing item to fit the human body plays a significant role, as slight misalignment may lead to unrealistic artifacts in the fitting image. Most existing methods warp the clothing by feature matching and thin-plate spline (TPS). However, it often fails to preserve clothing details due to self-occlusion, severe misalignment between poses, etc. To address these challenges, this paper proposes a detail retention virtual try-on method via accurate non-rigid registration (VITON-DRR) for diverse human poses. Specifically, we reconstruct a human semantic segmentation using a dual-pyramid-structured feature extractor. Then, a novel Deformation Module is designed for extracting the cloth key points and warping them through an accurate non-rigid registration algorithm. Finally, the Image Synthesis Module is designed to synthesize the deformed garment image and generate the human pose information adaptively. {Compared with} traditional methods, the proposed VITON-DRR can make the deformation of fitting images more accurate and retain more garment details. The experimental results demonstrate that the proposed method performs better than state-of-the-art methods.

[93] CryoCCD: Conditional Cycle-consistent Diffusion with Biophysical Modeling for Cryo-EM Synthesis

Runmin Jiang,Genpei Zhang,Yuntian Yang,Siqi Wu,Yuheng Zhang,Wanyue Feng,Yizhou Zhao,Xi Xiao,Xiao Wang,Tianyang Wang,Xingjian Li,Min Xu

Main category: cs.CV

TL;DR: CryoCCD是一个结合生物物理建模与生成技术的合成框架，用于生成高质量、结构多样的冷冻电镜显微图像，并通过条件扩散模型模拟真实噪声，显著提升下游任务性能。

Details

Motivation: 冷冻电镜成像的高质量标注数据稀缺，现有合成方法难以同时捕捉生物样本的结构多样性和复杂噪声，限制了模型的鲁棒性。 Method: CryoCCD整合生物物理建模与生成技术，通过多尺度显微图像生成、条件扩散模型和对比学习，模拟生物物理变异和自适应噪声。 Result: 实验表明，CryoCCD生成的图像结构准确，在颗粒挑选和重建任务中优于现有方法。 Conclusion: CryoCCD为冷冻电镜数据合成提供了高效解决方案，显著提升下游分析性能。 Abstract: Cryo-electron microscopy (cryo-EM) offers near-atomic resolution imaging of macromolecules, but developing robust models for downstream analysis is hindered by the scarcity of high-quality annotated data. While synthetic data generation has emerged as a potential solution, existing methods often fail to capture both the structural diversity of biological specimens and the complex, spatially varying noise inherent in cryo-EM imaging. To overcome these limitations, we propose CryoCCD, a synthesis framework that integrates biophysical modeling with generative techniques. Specifically, CryoCCD produces multi-scale cryo-EM micrographs that reflect realistic biophysical variability through compositional heterogeneity, cellular context, and physics-informed imaging. To generate realistic noise, we employ a conditional diffusion model, enhanced by cycle consistency to preserve structural fidelity and mask-aware contrastive learning to capture spatially adaptive noise patterns. Extensive experiments show that CryoCCD generates structurally accurate micrographs and enhances performance in downstream tasks, outperforming state-of-the-art baselines in both particle picking and reconstruction.

[94] A Reverse Causal Framework to Mitigate Spurious Correlations for Debiasing Scene Graph Generation

Shuzhou Sun,Li Liu,Tianpeng Liu,Shuaifeng Zhi,Ming-Ming Cheng,Janne Heikkilä,Yongxiang Liu

Main category: cs.CV

TL;DR: 论文提出了一种反向因果框架（RcSGG）来解决场景图生成（SGG）中因因果链结构导致的虚假相关性问题，通过干预混淆变量和增强反向因果估计，显著提高了性能。

Details

Motivation: 现有两阶段SGG框架因因果链结构导致虚假相关性，引发尾部关系被预测为头部关系、前景关系被预测为背景关系等偏差。 Method: 提出RcSGG框架，包括主动反向估计（ARE）和最大信息采样（MIS），通过反向因果结构干预混淆变量并增强估计。 Result: 在多个基准测试和不同SGG框架中实现了最先进的平均召回率。 Conclusion: RcSGG有效消除了SGG中的虚假相关性及其导致的偏差，显著提升了性能。 Abstract: Existing two-stage Scene Graph Generation (SGG) frameworks typically incorporate a detector to extract relationship features and a classifier to categorize these relationships; therefore, the training paradigm follows a causal chain structure, where the detector's inputs determine the classifier's inputs, which in turn influence the final predictions. However, such a causal chain structure can yield spurious correlations between the detector's inputs and the final predictions, i.e., the prediction of a certain relationship may be influenced by other relationships. This influence can induce at least two observable biases: tail relationships are predicted as head ones, and foreground relationships are predicted as background ones; notably, the latter bias is seldom discussed in the literature. To address this issue, we propose reconstructing the causal chain structure into a reverse causal structure, wherein the classifier's inputs are treated as the confounder, and both the detector's inputs and the final predictions are viewed as causal variables. Specifically, we term the reconstructed causal paradigm as the Reverse causal Framework for SGG (RcSGG). RcSGG initially employs the proposed Active Reverse Estimation (ARE) to intervene on the confounder to estimate the reverse causality, \ie the causality from final predictions to the classifier's inputs. Then, the Maximum Information Sampling (MIS) is suggested to enhance the reverse causality estimation further by considering the relationship information. Theoretically, RcSGG can mitigate the spurious correlations inherent in the SGG framework, subsequently eliminating the induced biases. Comprehensive experiments on popular benchmarks and diverse SGG frameworks show the state-of-the-art mean recall rate.

Runyi Li,Bin Chen,Jian Zhang,Radu Timofte

Main category: cs.CV

TL;DR: 论文提出LAFR方法，通过潜在空间适配器对齐低质量图像的潜在分布，结合多级恢复损失和轻量微调，实现高效且保真的盲人脸恢复。

Details

Motivation: 解决现有扩散模型在低质量图像编码时的语义不匹配问题，避免重新训练VAE的高计算成本。 Method: 提出LAFR潜在空间适配器对齐潜在分布，引入多级恢复损失，并轻量微调扩散先验。 Result: 在合成和真实人脸恢复基准测试中表现优异，训练时间减少70%，保真度高。 Conclusion: LAFR高效解决了低质量图像恢复中的语义对齐问题，同时显著降低了计算成本。 Abstract: Blind face restoration from low-quality (LQ) images is a challenging task that requires not only high-fidelity image reconstruction but also the preservation of facial identity. While diffusion models like Stable Diffusion have shown promise in generating high-quality (HQ) images, their VAE modules are typically trained only on HQ data, resulting in semantic misalignment when encoding LQ inputs. This mismatch significantly weakens the effectiveness of LQ conditions during the denoising process. Existing approaches often tackle this issue by retraining the VAE encoder, which is computationally expensive and memory-intensive. To address this limitation efficiently, we propose LAFR (Latent Alignment for Face Restoration), a novel codebook-based latent space adapter that aligns the latent distribution of LQ images with that of HQ counterparts, enabling semantically consistent diffusion sampling without altering the original VAE. To further enhance identity preservation, we introduce a multi-level restoration loss that combines constraints from identity embeddings and facial structural priors. Additionally, by leveraging the inherent structural regularity of facial images, we show that lightweight finetuning of diffusion prior on just 0.9% of FFHQ dataset is sufficient to achieve results comparable to state-of-the-art methods, reduce training time by 70%. Extensive experiments on both synthetic and real-world face restoration benchmarks demonstrate the effectiveness and efficiency of LAFR, achieving high-quality, identity-preserving face reconstruction from severely degraded inputs.

[96] Revisiting Reweighted Risk for Calibration: AURC, Focal Loss, and Inverse Focal Loss

Han Zhou,Sebastian G. Gruber,Teodora Popordanoska,Matthew B. Blaschko

Main category: cs.CV

TL;DR: 论文分析了加权风险函数（如焦点损失、逆焦点损失和AURC）的校准特性，提出优化正则化AURC可改善校准效果，并通过SoftRank实现梯度优化。

Details

Motivation: 探讨不同加权风险函数（如焦点损失和逆焦点损失）的校准特性差异，并建立它们与校准误差的理论联系。 Method: 提出正则化AURC优化方法，使用SoftRank技术实现梯度优化，并比较不同加权策略的效果。 Result: 实验表明，基于AURC的损失函数在多种数据集和模型架构中表现出竞争性的校准性能。 Conclusion: 优化正则化AURC可有效改善模型校准，且逆焦点损失的加权策略更具理论依据。 Abstract: Several variants of reweighted risk functionals, such as focal losss, inverse focal loss, and the Area Under the Risk-Coverage Curve (AURC), have been proposed in the literature and claims have been made in relation to their calibration properties. However, focal loss and inverse focal loss propose vastly different weighting schemes. In this paper, we revisit a broad class of weighted risk functions commonly used in deep learning and establish a principled connection between these reweighting schemes and calibration errors. We show that minimizing calibration error is closely linked to the selective classification paradigm and demonstrate that optimizing a regularized variant of the AURC naturally leads to improved calibration. This regularized AURC shares a similar reweighting strategy with inverse focal loss, lending support to the idea that focal loss is less principled when calibration is a desired outcome. Direct AURC optimization offers greater flexibility through the choice of confidence score functions (CSFs). To enable gradient-based optimization, we introduce a differentiable formulation of the regularized AURC using the SoftRank technique. Empirical evaluations demonstrate that our AURC-based loss achieves competitive class-wise calibration performance across a range of datasets and model architectures.

[97] A Divide-and-Conquer Approach for Global Orientation of Non-Watertight Scene-Level Point Clouds Using 0-1 Integer Optimization

Zhuodong Li,Fei Hou,Wencheng Wang,Xuequan Lu,Ying He

Main category: cs.CV

TL;DR: DACPO提出了一种分而治之的方法，用于大规模非封闭3D场景的点云定向，通过分块处理、全局优化和可见性评估实现高效定向。

Details

Motivation: 现有方法主要针对封闭的物体级3D模型，而大规模非封闭3D场景的定向问题尚未充分研究。 Method: DACPO将点云分块处理，每块通过随机贪婪法和泊松表面重建优化定向，再通过图模型和全局优化整合结果。 Result: 在基准数据集上，DACPO在非封闭场景中表现优异，优于现有方法。 Conclusion: DACPO为大规模非封闭点云定向提供了高效且鲁棒的解决方案。 Abstract: Orienting point clouds is a fundamental problem in computer graphics and 3D vision, with applications in reconstruction, segmentation, and analysis. While significant progress has been made, existing approaches mainly focus on watertight, object-level 3D models. The orientation of large-scale, non-watertight 3D scenes remains an underexplored challenge. To address this gap, we propose DACPO (Divide-And-Conquer Point Orientation), a novel framework that leverages a divide-and-conquer strategy for scalable and robust point cloud orientation. Rather than attempting to orient an unbounded scene at once, DACPO segments the input point cloud into smaller, manageable blocks, processes each block independently, and integrates the results through a global optimization stage. For each block, we introduce a two-step process: estimating initial normal orientations by a randomized greedy method and refining them by an adapted iterative Poisson surface reconstruction. To achieve consistency across blocks, we model inter-block relationships using an an undirected graph, where nodes represent blocks and edges connect spatially adjacent blocks. To reliably evaluate orientation consistency between adjacent blocks, we introduce the concept of the visible connected region, which defines the region over which visibility-based assessments are performed. The global integration is then formulated as a 0-1 integer-constrained optimization problem, with block flip states as binary variables. Despite the combinatorial nature of the problem, DACPO remains scalable by limiting the number of blocks (typically a few hundred for 3D scenes) involved in the optimization. Experiments on benchmark datasets demonstrate DACPO's strong performance, particularly in challenging large-scale, non-watertight scenarios where existing methods often fail. The source code is available at https://github.com/zd-lee/DACPO.

[98] TimePoint: Accelerated Time Series Alignment via Self-Supervised Keypoint and Descriptor Learning

Ron Shapira Weber,Shahar Ben Ishay,Andrey Lavrinenko,Shahaf E. Finder,Oren Freifeld

Main category: cs.CV

TL;DR: TimePoint是一种自监督方法，通过从合成数据中学习关键点和描述符，显著加速DTW对齐并提高准确性。

Details

Motivation: 动态时间规整（DTW）在时间序列对齐中存在可扩展性差和对噪声敏感的问题，需要一种更高效的解决方案。 Method: TimePoint利用1D微分同胚生成合成数据，结合全卷积和小波卷积架构提取关键点和描述符，再用DTW处理稀疏表示。 Result: TimePoint在速度和准确性上均优于标准DTW，且仅需合成数据训练即可泛化到真实数据。 Conclusion: TimePoint为时间序列分析提供了一种可扩展且高效的解决方案。 Abstract: Fast and scalable alignment of time series is a fundamental challenge in many domains. The standard solution, Dynamic Time Warping (DTW), struggles with poor scalability and sensitivity to noise. We introduce TimePoint, a self-supervised method that dramatically accelerates DTW-based alignment while typically improving alignment accuracy by learning keypoints and descriptors from synthetic data. Inspired by 2D keypoint detection but carefully adapted to the unique challenges of 1D signals, TimePoint leverages efficient 1D diffeomorphisms, which effectively model nonlinear time warping, to generate realistic training data. This approach, along with fully convolutional and wavelet convolutional architectures, enables the extraction of informative keypoints and descriptors. Applying DTW to these sparse representations yield major speedups and typically higher alignment accuracy than standard DTW applied to the full signals. TimePoint demonstrates strong generalization to real-world time series when trained solely on synthetic data, and further improves with fine-tuning on real data. Extensive experiments demonstrate that TimePoint consistently achieves faster and more accurate alignments than standard DTW, making it a scalable solution for time-series analysis. Our code is available at https://github.com/BGU-CS-VIL/TimePoint

[99] PhysicsNeRF: Physics-Guided 3D Reconstruction from Sparse Views

Mohamed Rayan Barhdadi,Hasan Kurban,Hussein Alnuweiri

Main category: cs.CV

TL;DR: PhysicsNeRF通过引入物理约束改进NeRF，在稀疏视图下实现更优的3D重建，性能显著优于现有方法。

Details

Motivation: 解决稀疏视图下标准NeRF性能不足的问题，提升3D重建的物理一致性和泛化能力。 Method: 结合深度排序、RegNeRF风格一致性、稀疏先验和跨视图对齐四种约束，采用0.67M参数的小型架构。 Result: 仅用8视图即达到21.4 dB平均PSNR，性能优于现有方法，同时揭示了稀疏重建的5.7-6.2 dB泛化差距。 Conclusion: PhysicsNeRF为智能体交互和仿真提供了物理一致的3D表示，并阐明了约束NeRF模型的表现力-泛化权衡。 Abstract: PhysicsNeRF is a physically grounded framework for 3D reconstruction from sparse views, extending Neural Radiance Fields with four complementary constraints: depth ranking, RegNeRF-style consistency, sparsity priors, and cross-view alignment. While standard NeRFs fail under sparse supervision, PhysicsNeRF employs a compact 0.67M-parameter architecture and achieves 21.4 dB average PSNR using only 8 views, outperforming prior methods. A generalization gap of 5.7-6.2 dB is consistently observed and analyzed, revealing fundamental limitations of sparse-view reconstruction. PhysicsNeRF enables physically consistent, generalizable 3D representations for agent interaction and simulation, and clarifies the expressiveness-generalization trade-off in constrained NeRF models.

[100] VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation

Shi-Xue Zhang,Hongfa Wang,Duojun Huang,Xin Li,Xiaobin Zhu,Xu-Cheng Yin

Main category: cs.CV

TL;DR: 论文提出了VCapsBench，一个细粒度视频字幕评估基准，包含5K+视频和100K+ QA对，用于提升文本到视频生成的质量。

Details

Motivation: 现有基准在细粒度评估方面不足，尤其是对视频生成关键的空间-时间细节捕捉不足。 Method: 引入VCapsBench，包含5,677视频和109,796 QA对，覆盖21个细粒度维度，并提出AR、IR、CR三个指标及基于LLM的自动化评估流程。 Result: VCapsBench通过对比QA对分析验证字幕质量，为字幕优化提供可操作建议。 Conclusion: 该基准可推动鲁棒文本到视频模型的发展，数据集和代码已开源。 Abstract: Video captions play a crucial role in text-to-video generation tasks, as their quality directly influences the semantic coherence and visual fidelity of the generated videos. Although large vision-language models (VLMs) have demonstrated significant potential in caption generation, existing benchmarks inadequately address fine-grained evaluation, particularly in capturing spatial-temporal details critical for video generation. To address this gap, we introduce the Fine-grained Video Caption Evaluation Benchmark (VCapsBench), the first large-scale fine-grained benchmark comprising 5,677 (5K+) videos and 109,796 (100K+) question-answer pairs. These QA-pairs are systematically annotated across 21 fine-grained dimensions (e.g., camera movement, and shot type) that are empirically proven critical for text-to-video generation. We further introduce three metrics (Accuracy (AR), Inconsistency Rate (IR), Coverage Rate (CR)), and an automated evaluation pipeline leveraging large language model (LLM) to verify caption quality via contrastive QA-pairs analysis. By providing actionable insights for caption optimization, our benchmark can advance the development of robust text-to-video models. The dataset and codes are available at website: https://github.com/GXYM/VCapsBench.

[101] R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation

Kaijie Chen,Zihao Lin,Zhiyang Xu,Ying Shen,Yuguang Yao,Joy Rimchala,Jiaxin Zhang,Lifu Huang

Main category: cs.CV

TL;DR: R2I-Bench是一个专门评估文本到图像生成模型推理能力的基准，包含多类推理任务，并设计了细粒度评估指标R2IScore。实验表明当前模型推理能力有限。

Details

Motivation: 现有文本到图像生成模型在推理能力上表现不足，缺乏系统性评估，因此需要开发专门的基准和评估方法。 Method: 提出了R2I-Bench基准，涵盖多种推理类别，并设计了基于问答的评估指标R2IScore，用于评估文本图像对齐、推理准确性和图像质量。 Result: 实验测试了16种代表性模型，发现其推理能力普遍有限，表明需要更强大的推理感知架构。 Conclusion: R2I-Bench为评估和提升文本到图像生成模型的推理能力提供了重要工具，未来需开发更先进的推理感知模型。 Abstract: Reasoning is a fundamental capability often required in real-world text-to-image (T2I) generation, e.g., generating ``a bitten apple that has been left in the air for more than a week`` necessitates understanding temporal decay and commonsense concepts. While recent T2I models have made impressive progress in producing photorealistic images, their reasoning capability remains underdeveloped and insufficiently evaluated. To bridge this gap, we introduce R2I-Bench, a comprehensive benchmark specifically designed to rigorously assess reasoning-driven T2I generation. R2I-Bench comprises meticulously curated data instances, spanning core reasoning categories, including commonsense, mathematical, logical, compositional, numerical, causal, and concept mixing. To facilitate fine-grained evaluation, we design R2IScore, a QA-style metric based on instance-specific, reasoning-oriented evaluation questions that assess three critical dimensions: text-image alignment, reasoning accuracy, and image quality. Extensive experiments with 16 representative T2I models, including a strong pipeline-based framework that decouples reasoning and generation using the state-of-the-art language and image generation models, demonstrate consistently limited reasoning performance, highlighting the need for more robust, reasoning-aware architectures in the next generation of T2I systems. Project Page: https://r2i-bench.github.io

[102] VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning

Liyun Zhu,Qixiang Chen,Xi Shen,Xiaodong Cun

Main category: cs.CV

TL;DR: 论文提出了一种基于多模态大语言模型（MLLMs）的框架VAU-R1，通过强化微调（RFT）提升视频异常理解（VAU）的推理能力，并推出了首个针对视频异常推理的基准测试VAU-Bench。

Details

Motivation: 视频异常理解在智能城市、安全监控等领域至关重要，但现有方法缺乏可解释性且难以捕捉异常事件的因果和上下文关系，同时缺乏评估推理能力的综合基准。 Method: 提出VAU-R1框架，利用MLLMs和RFT增强异常推理；同时设计VAU-Bench基准，包含多选题、详细解释、时间标注和描述性标题。 Result: 实验表明，VAU-R1显著提高了问答准确性、时间定位和推理连贯性。 Conclusion: VAU-R1和VAU-Bench为可解释且具备推理能力的视频异常理解奠定了基础。 Abstract: Video Anomaly Understanding (VAU) is essential for applications such as smart cities, security surveillance, and disaster alert systems, yet remains challenging due to its demand for fine-grained spatio-temporal perception and robust reasoning under ambiguity. Despite advances in anomaly detection, existing methods often lack interpretability and struggle to capture the causal and contextual aspects of abnormal events. This limitation is further compounded by the absence of comprehensive benchmarks for evaluating reasoning ability in anomaly scenarios. To address both challenges, we introduce VAU-R1, a data-efficient framework built upon Multimodal Large Language Models (MLLMs), which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT). Besides, we propose VAU-Bench, the first Chain-of-Thought benchmark tailored for video anomaly reasoning, featuring multiple-choice QA, detailed rationales, temporal annotations, and descriptive captions. Empirical results show that VAU-R1 significantly improves question answering accuracy, temporal grounding, and reasoning coherence across diverse contexts. Together, our method and benchmark establish a strong foundation for interpretable and reasoning-aware video anomaly understanding. Our code is available at https://github.com/GVCLab/VAU-R1.

[103] OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data

Fengxiang Wang,Mingshuo Chen,Xuming He,YiFan Zhang,Feng Liu,Zijie Guo,Zhenghao Hu,Jiong Wang,Jingyi Xu,Zhangrui Li,Fenghua Ling,Ben Fei,Weijia Li,Long Lan,Wenjing Yang,Wenlong Zhang,Lei Bai

Main category: cs.CV

TL;DR: OmniEarth-Bench是一个全面的地球科学多模态基准，覆盖六大地球圈层及其交互作用，包含100个专家策划的评估维度，现有最先进模型表现不佳。

Details

Motivation: 现有地球科学多模态学习基准在覆盖范围和评估维度上存在局限，无法全面评估跨圈层交互作用。 Method: 利用卫星和实地观测数据，整合29,779个标注，涵盖感知、推理、科学知识推理和链式推理四个层级，并通过专家-众包混合工作流程验证。 Result: 9个最先进的多模态大模型在基准测试中表现不佳，最高准确率不足35%，某些跨圈层任务中GPT-4o准确率降至0%。 Conclusion: OmniEarth-Bench为地球科学AI设定了新标准，推动了科学发现和环境监测的实际应用。 Abstract: Existing benchmarks for Earth science multimodal learning exhibit critical limitations in systematic coverage of geosystem components and cross-sphere interactions, often constrained to isolated subsystems (only in Human-activities sphere or atmosphere) with limited evaluation dimensions (less than 16 tasks). To address these gaps, we introduce OmniEarth-Bench, the first comprehensive multimodal benchmark spanning all six Earth science spheres (atmosphere, lithosphere, Oceansphere, cryosphere, biosphere and Human-activities sphere) and cross-spheres with one hundred expert-curated evaluation dimensions. Leveraging observational data from satellite sensors and in-situ measurements, OmniEarth-Bench integrates 29,779 annotations across four tiers: perception, general reasoning, scientific knowledge reasoning and chain-of-thought (CoT) reasoning. This involves the efforts of 2-5 experts per sphere to establish authoritative evaluation dimensions and curate relevant observational datasets, 40 crowd-sourcing annotators to assist experts for annotations, and finally, OmniEarth-Bench is validated via hybrid expert-crowd workflows to reduce label ambiguity. Experiments on 9 state-of-the-art MLLMs reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35\% accuracy. Especially, in some cross-spheres tasks, the performance of leading models like GPT-4o drops to 0.0\%. OmniEarth-Bench sets a new standard for geosystem-aware AI, advancing both scientific discovery and practical applications in environmental monitoring and disaster prediction. The dataset, source code, and trained models were released.

[104] CLIP-AE: CLIP-assisted Cross-view Audio-Visual Enhancement for Unsupervised Temporal Action Localization

Rui Xia,Dan Jiang,Quan Zhang,Ke Zhang,Chun Yuan

Main category: cs.CV

TL;DR: 提出了一种基于CLIP辅助的跨视图视听增强无监督时序动作定位方法，解决了现有方法过度关注高区分性区域和缺乏多模态信息的问题。

Details

Motivation: 现有监督或弱监督方法依赖标注数据，成本高；无监督方法面临特征过度关注高区分性区域和缺乏多模态信息的挑战。 Method: 结合视觉语言预训练和分类预训练协作增强，引入音频感知提供边界信息，采用自监督跨视图学习范式。 Result: 在两个公开数据集上表现优于多个先进方法。 Conclusion: 该方法通过多模态和跨视图学习有效提升了无监督时序动作定位的性能。 Abstract: Temporal Action Localization (TAL) has garnered significant attention in information retrieval. Existing supervised or weakly supervised methods heavily rely on labeled temporal boundaries and action categories, which are labor-intensive and time-consuming. Consequently, unsupervised temporal action localization (UTAL) has gained popularity. However, current methods face two main challenges: 1) Classification pre-trained features overly focus on highly discriminative regions; 2) Solely relying on visual modality information makes it difficult to determine contextual boundaries. To address these issues, we propose a CLIP-assisted cross-view audiovisual enhanced UTAL method. Specifically, we introduce visual language pre-training (VLP) and classification pre-training-based collaborative enhancement to avoid excessive focus on highly discriminative regions; we also incorporate audio perception to provide richer contextual boundary information. Finally, we introduce a self-supervised cross-view learning paradigm to achieve multi-view perceptual enhancement without additional annotations. Extensive experiments on two public datasets demonstrate our model's superiority over several state-of-the-art competitors.

[105] Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization and Temporal Motion Modulation

Jiahao Cui,Yan Chen,Mingwang Xu,Hanlin Shang,Yuxuan Chen,Yun Zhan,Zilong Dong,Yao Yao,Jingdong Wang,Siyu Zhu

Main category: cs.CV

TL;DR: 提出了一种基于人类偏好对齐的扩散框架，通过直接偏好优化和时间运动调制，显著提升了肖像动画的唇音同步、表情生动性和身体运动连贯性。

Details

Motivation: 生成高度动态和逼真的肖像动画（由音频和骨骼运动驱动）面临唇音同步、自然表情和高保真身体运动的挑战。 Method: 1. 直接偏好优化，利用人类偏好数据集对齐生成输出与感知指标；2. 时间运动调制，通过时间通道重新分配和比例特征扩展解决时空分辨率不匹配问题。 Result: 在唇音同步、表情生动性和身体运动连贯性上明显优于基线方法，人类偏好指标显著提升。 Conclusion: 提出的框架有效解决了肖像动画生成中的关键问题，并在实验中表现出优越性能。 Abstract: Generating highly dynamic and photorealistic portrait animations driven by audio and skeletal motion remains challenging due to the need for precise lip synchronization, natural facial expressions, and high-fidelity body motion dynamics. We propose a human-preference-aligned diffusion framework that addresses these challenges through two key innovations. First, we introduce direct preference optimization tailored for human-centric animation, leveraging a curated dataset of human preferences to align generated outputs with perceptual metrics for portrait motion-video alignment and naturalness of expression. Second, the proposed temporal motion modulation resolves spatiotemporal resolution mismatches by reshaping motion conditions into dimensionally aligned latent features through temporal channel redistribution and proportional feature expansion, preserving the fidelity of high-frequency motion details in diffusion-based synthesis. The proposed mechanism is complementary to existing UNet and DiT-based portrait diffusion approaches, and experiments demonstrate obvious improvements in lip-audio synchronization, expression vividness, body motion coherence over baseline methods, alongside notable gains in human preference metrics. Our model and source code can be found at: https://github.com/xyz123xyz456/hallo4.

[106] Position Paper: Metadata Enrichment Model: Integrating Neural Networks and Semantic Knowledge Graphs for Cultural Heritage Applications

Jan Ignatowicz,Krzysztof Kutt,Grzegorz J. Nalepa

Main category: cs.CV

TL;DR: 论文提出了一种结合神经网络与语义技术的元数据丰富模型（MEM），用于提升文化遗产数字化收藏的可访问性和互操作性，并通过多层视觉机制（MVM）动态提取嵌套特征。

Details

Motivation: 文化遗产数字化的元数据不足限制了可访问性和跨机构协作，现有视觉分析模型对特定领域（如手稿和古版书）的应用有限。 Method: 提出MEM框架，结合计算机视觉模型、大语言模型（LLMs）和知识图谱，通过MVM动态检测嵌套特征。 Result: 在Jagiellonian数字图书馆的古版书数据集上验证MEM，并发布105页手稿的标注数据集。 Conclusion: MEM为文化遗产研究提供了灵活可扩展的方法，展示了人工智能与语义技术在实践中的潜力。 Abstract: The digitization of cultural heritage collections has opened new directions for research, yet the lack of enriched metadata poses a substantial challenge to accessibility, interoperability, and cross-institutional collaboration. In several past years neural networks models such as YOLOv11 and Detectron2 have revolutionized visual data analysis, but their application to domain-specific cultural artifacts - such as manuscripts and incunabula - remains limited by the absence of methodologies that address structural feature extraction and semantic interoperability. In this position paper, we argue, that the integration of neural networks with semantic technologies represents a paradigm shift in cultural heritage digitization processes. We present the Metadata Enrichment Model (MEM), a conceptual framework designed to enrich metadata for digitized collections by combining fine-tuned computer vision models, large language models (LLMs) and structured knowledge graphs. The Multilayer Vision Mechanism (MVM) appears as the key innovation of MEM. This iterative process improves visual analysis by dynamically detecting nested features, such as text within seals or images within stamps. To expose MEM's potential, we apply it to a dataset of digitized incunabula from the Jagiellonian Digital Library and release a manually annotated dataset of 105 manuscript pages. We examine the practical challenges of MEM's usage in real-world GLAM institutions, including the need for domain-specific fine-tuning, the adjustment of enriched metadata with Linked Data standards and computational costs. We present MEM as a flexible and extensible methodology. This paper contributes to the discussion on how artificial intelligence and semantic web technologies can advance cultural heritage research, and also use these technologies in practice.

[107] Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information

Xu Chu,Xinrong Chen,Guanyu Wang,Zhijie Tan,Kui Huang,Wenyu Lv,Tong Mo,Weiping Li

Main category: cs.CV

TL;DR: Qwen-LA通过引入视觉-文本反思过程，减少视觉语言推理模型中的幻觉问题，提升视觉注意力。

Details

Motivation: 长推理过程会稀释视觉信息，导致视觉注意力下降并引发幻觉，仅依赖文本反思不足以解决此问题。 Method: 提出Qwen-LA模型，结合BRPO强化学习方法，动态决定何时进行视觉-文本反思，并通过Visual Token COPY和ROUTE强制模型重新关注视觉信息。 Result: 实验表明，Qwen-LA在多视觉QA数据集和幻觉指标上表现领先，同时减少幻觉。 Conclusion: Qwen-LA通过视觉-文本反思和视觉信息补充，有效提升视觉语言推理模型的性能和可靠性。 Abstract: Inference time scaling drives extended reasoning to enhance the performance of Vision-Language Models (VLMs), thus forming powerful Vision-Language Reasoning Models (VLRMs). However, long reasoning dilutes visual tokens, causing visual information to receive less attention and may trigger hallucinations. Although introducing text-only reflection processes shows promise in language models, we demonstrate that it is insufficient to suppress hallucinations in VLMs. To address this issue, we introduce Qwen-LookAgain (Qwen-LA), a novel VLRM designed to mitigate hallucinations by incorporating a vision-text reflection process that guides the model to re-attention visual information during reasoning. We first propose a reinforcement learning method Balanced Reflective Policy Optimization (BRPO), which guides the model to decide when to generate vision-text reflection on its own and balance the number and length of reflections. Then, we formally prove that VLRMs lose attention to visual tokens as reasoning progresses, and demonstrate that supplementing visual information during reflection enhances visual attention. Therefore, during training and inference, Visual Token COPY and Visual Token ROUTE are introduced to force the model to re-attention visual information at the visual level, addressing the limitations of text-only reflection. Experiments on multiple visual QA datasets and hallucination metrics indicate that Qwen-LA achieves leading accuracy performance while reducing hallucinations. Our code is available at: https://github.com/Liar406/Look_Again.

[108] Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition

Yu Li,Jin Jiang,Jianhua Zhu,Shuai Peng,Baole Wei,Yuxuan Zhou,Liangcai Gao

Main category: cs.CV

TL;DR: Uni-MuMER通过微调预训练的视觉语言模型（VLM）解决手写数学表达式识别（HMER）问题，结合三种数据驱动任务，在CROHME和HME100K数据集上取得最优性能。

Details

Motivation: HMER因符号布局自由和手写风格多变而具有挑战性，现有方法难以整合为统一框架。 Method: 完全微调VLM，结合Tree-CoT、EDL和SC三种任务，注入领域知识。 Result: 在CROHME和HME100K数据集上超越现有最佳模型16.31%和24.42%。 Conclusion: Uni-MuMER展示了VLM在HMER任务中的潜力，提供了一种统一的解决方案。 Abstract: Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layout and variability in handwriting styles. Prior methods have faced performance bottlenecks, proposing isolated architectural modifications that are difficult to integrate coherently into a unified framework. Meanwhile, recent advances in pretrained vision-language models (VLMs) have demonstrated strong cross-task generalization, offering a promising foundation for developing unified solutions. In this paper, we introduce Uni-MuMER, which fully fine-tunes a VLM for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions. Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves new state-of-the-art performance, surpassing the best lightweight specialized model SSAN by 16.31% and the top-performing VLM Gemini2.5-flash by 24.42% in the zero-shot setting. Our datasets, models, and code are open-sourced at: https://github.com/BFlameSwift/Uni-MuMER

[109] Weakly-supervised Localization of Manipulated Image Regions Using Multi-resolution Learned Features

Ziyong Wang,Charith Abhayaratne

Main category: cs.CV

TL;DR: 提出了一种弱监督方法，结合图像级操纵检测网络的激活图和预训练模型的分割图，用于定位图像操纵区域，无需像素级标注。

Details

Motivation: 现有深度学习方法在图像级分类准确率高，但缺乏可解释性和操纵区域定位能力，且真实场景中缺乏像素级标注。 Method: 基于WCBnet生成多视角特征图，与预训练分割模型（如DeepLab、SegmentAnything、PSPnet）的分割图融合，利用贝叶斯推理优化定位。 Result: 实验证明该方法有效，可在无像素级标签的情况下定位图像操纵。 Conclusion: 该方法为图像操纵定位提供了一种可行的弱监督解决方案。 Abstract: The explosive growth of digital images and the widespread availability of image editing tools have made image manipulation detection an increasingly critical challenge. Current deep learning-based manipulation detection methods excel in achieving high image-level classification accuracy, they often fall short in terms of interpretability and localization of manipulated regions. Additionally, the absence of pixel-wise annotations in real-world scenarios limits the existing fully-supervised manipulation localization techniques. To address these challenges, we propose a novel weakly-supervised approach that integrates activation maps generated by image-level manipulation detection networks with segmentation maps from pre-trained models. Specifically, we build on our previous image-level work named WCBnet to produce multi-view feature maps which are subsequently fused for coarse localization. These coarse maps are then refined using detailed segmented regional information provided by pre-trained segmentation models (such as DeepLab, SegmentAnything and PSPnet), with Bayesian inference employed to enhance the manipulation localization. Experimental results demonstrate the effectiveness of our approach, highlighting the feasibility to localize image manipulations without relying on pixel-level labels.

[110] Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles

Zifu Wang,Junyi Zhu,Bo Tang,Zhiyu Li,Feiyu Xiong,Jiaqian Yu,Matthew B. Blaschko

Main category: cs.CV

TL;DR: 本文研究了基于规则的视觉强化学习（RL）在多模态大语言模型（MLLMs）中的应用，以拼图任务为实验框架，发现MLLMs通过微调能从随机猜测提升到近乎完美的准确性，并能泛化到复杂任务。RL比监督微调（SFT）表现更好，且初始SFT可能阻碍RL优化。

Details

Motivation: 探索基于规则的RL在多模态任务中的表现，特别是拼图任务，以填补文本领域与视觉领域之间的研究空白。 Method: 使用拼图任务作为结构化实验框架，通过微调MLLMs并比较RL与SFT的效果。 Result: MLLMs通过微调显著提升性能，并能泛化到复杂任务；RL比SFT更有效，但初始SFT可能阻碍RL优化。 Conclusion: 拼图任务为理解基于规则的视觉RL提供了重要见解，RL在多模态学习中具有潜力，但需注意初始训练策略的影响。 Abstract: The application of rule-based reinforcement learning (RL) to multimodal large language models (MLLMs) introduces unique challenges and potential deviations from findings in text-only domains, particularly for perception-heavy tasks. This paper provides a comprehensive study of rule-based visual RL using jigsaw puzzles as a structured experimental framework, revealing several key findings. \textit{Firstly,} we find that MLLMs, initially performing near to random guessing on simple puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning. \textit{Secondly,} training on jigsaw puzzles can induce generalization to other visual tasks, with effectiveness tied to specific task configurations. \textit{Thirdly,} MLLMs can learn and generalize with or without explicit reasoning, though open-source models often favor direct answering. Consequently, even when trained for step-by-step reasoning, they can ignore the thinking process in deriving the final answer. \textit{Fourthly,} we observe that complex reasoning patterns appear to be pre-existing rather than emergent, with their frequency increasing alongside training and task difficulty. \textit{Finally,} our results demonstrate that RL exhibits more effective generalization than Supervised Fine-Tuning (SFT), and an initial SFT cold start phase can hinder subsequent RL optimization. Although these observations are based on jigsaw puzzles and may vary across other visual tasks, this research contributes a valuable piece of jigsaw to the larger puzzle of collective understanding rule-based visual RL and its potential in multimodal learning. The code is available at: \href{https://github.com/zifuwanggg/Jigsaw-R1}{https://github.com/zifuwanggg/Jigsaw-R1}.

[111] VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

Ce Zhang,Kaixin Ma,Tianqing Fang,Wenhao Yu,Hongming Zhang,Zhisong Zhang,Yaqi Xie,Katia Sycara,Haitao Mi,Dong Yu

Main category: cs.CV

TL;DR: VScan是一种两阶段视觉令牌减少框架，通过全局和局部扫描结合令牌合并及语言模型中间层剪枝，显著加速推理并保持高性能。

Details

Motivation: 解决大型视觉语言模型（LVLMs）因视觉令牌序列过长导致的高计算成本问题，以实现实时部署。 Method: 提出VScan框架，包括视觉编码阶段的全局和局部扫描令牌合并，以及语言模型中间层的剪枝。 Result: 在四个LVLMs上验证，VScan显著加速推理（2.91倍预填充速度提升，10倍FLOPs减少），并保持95.4%的原性能。 Conclusion: VScan通过优化令牌处理策略，有效平衡了计算效率和模型性能，优于现有方法。 Abstract: Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91$\times$ speedup in prefilling and a 10$\times$ reduction in FLOPs, while retaining 95.4% of the original performance.

[112] DeepChest: Dynamic Gradient-Free Task Weighting for Effective Multi-Task Learning in Chest X-ray Classification

Youssef Mohamed,Noran Mohamed,Khaled Abouhashad,Feilong Tang,Sara Atito,Shoaib Jameel,Imran Razzak,Ahmed B. Zaky

Main category: cs.CV

TL;DR: DeepChest是一种动态任务加权框架，用于多标签胸部X光分类，通过性能驱动的权重机制提高效率和准确性。

Details

Motivation: 多任务学习（MTL）在医学影像等领域具有优势，但任务贡献平衡是一个挑战。 Method: DeepChest利用任务特定损失趋势分析动态调整权重，无需梯度访问，减少内存使用并加速训练。 Result: 在大型CXR数据集上，DeepChest比现有方法准确率提高7%，并显著降低任务损失。 Conclusion: DeepChest为医学诊断中的深度学习提供了更高效和实用的解决方案。 Abstract: While Multi-Task Learning (MTL) offers inherent advantages in complex domains such as medical imaging by enabling shared representation learning, effectively balancing task contributions remains a significant challenge. This paper addresses this critical issue by introducing DeepChest, a novel, computationally efficient and effective dynamic task-weighting framework specifically designed for multi-label chest X-ray (CXR) classification. Unlike existing heuristic or gradient-based methods that often incur substantial overhead, DeepChest leverages a performance-driven weighting mechanism based on effective analysis of task-specific loss trends. Given a network architecture (e.g., ResNet18), our model-agnostic approach adaptively adjusts task importance without requiring gradient access, thereby significantly reducing memory usage and achieving a threefold increase in training speed. It can be easily applied to improve various state-of-the-art methods. Extensive experiments on a large-scale CXR dataset demonstrate that DeepChest not only outperforms state-of-the-art MTL methods by 7% in overall accuracy but also yields substantial reductions in individual task losses, indicating improved generalization and effective mitigation of negative transfer. The efficiency and performance gains of DeepChest pave the way for more practical and robust deployment of deep learning in critical medical diagnostic applications. The code is publicly available at https://github.com/youssefkhalil320/DeepChest-MTL

[113] Bridging Classical and Modern Computer Vision: PerceptiveNet for Tree Crown Semantic Segmentation

Georgios Voulgaris

Main category: cs.CV

TL;DR: 论文提出PerceptiveNet模型，结合Log-Gabor卷积层和宽感受野主干网络，显著提升树冠语义分割精度，并在多个数据集上验证其优越性。

Details

Motivation: 树冠精确分割对森林管理和生态研究至关重要，但传统方法和深度学习模型难以处理复杂树冠结构。 Method: 提出PerceptiveNet，包含可训练的Log-Gabor卷积层和宽感受野主干网络，并通过实验比较不同卷积层效果。 Result: PerceptiveNet在树冠数据集和基准航拍数据集上表现优于现有方法，具有跨领域泛化能力。 Conclusion: PerceptiveNet通过创新设计显著提升语义分割性能，适用于复杂场景。 Abstract: The accurate semantic segmentation of tree crowns within remotely sensed data is crucial for scientific endeavours such as forest management, biodiversity studies, and carbon sequestration quantification. However, precise segmentation remains challenging due to complexities in the forest canopy, including shadows, intricate backgrounds, scale variations, and subtle spectral differences among tree species. Compared to the traditional methods, Deep Learning models improve accuracy by extracting informative and discriminative features, but often fall short in capturing the aforementioned complexities. To address these challenges, we propose PerceptiveNet, a novel model incorporating a Logarithmic Gabor-parameterised convolutional layer with trainable filter parameters, alongside a backbone that extracts salient features while capturing extensive context and spatial information through a wider receptive field. We investigate the impact of Log-Gabor, Gabor, and standard convolutional layers on semantic segmentation performance through extensive experimentation. Additionally, we conduct an ablation study to assess the contributions of individual layers and their combinations to overall model performance, and we evaluate PerceptiveNet as a backbone within a novel hybrid CNN-Transformer model. Our results outperform state-of-the-art models, demonstrating significant performance improvements on a tree crown dataset while generalising across domains, including two benchmark aerial scene semantic segmentation datasets with varying complexities.

Shengyuan Liu,Boyun Zheng,Wenting Chen,Zhihao Peng,Zhenfei Yin,Jing Shao,Jiancong Hu,Yixuan Yuan

Main category: cs.CV

TL;DR: EndoBench是一个全面的基准测试，旨在评估多模态大语言模型（MLLMs）在内窥镜实践中的表现，涵盖多种场景和任务。

Details

Motivation: 当前的内窥镜分析基准测试有限，无法覆盖真实世界的多样性和临床工作流程的全部需求。 Method: EndoBench包括4种内窥镜场景、12项临床任务和5种视觉提示粒度，共6,832个验证问题。 Result: 实验显示专有MLLMs优于开源模型但仍不及人类专家，医学领域微调显著提升准确性，模型性能受提示格式和任务复杂性影响。 Conclusion: EndoBench为内窥镜领域的MLLMs评估设立了新标准，揭示了模型与专家临床推理之间的差距。 Abstract: Endoscopic procedures are essential for diagnosing and treating internal diseases, and multi-modal large language models (MLLMs) are increasingly applied to assist in endoscopy analysis. However, current benchmarks are limited, as they typically cover specific endoscopic scenarios and a small set of clinical tasks, failing to capture the real-world diversity of endoscopic scenarios and the full range of skills needed in clinical workflows. To address these issues, we introduce EndoBench, the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice with multi-dimensional capacities. EndoBench encompasses 4 distinct endoscopic scenarios, 12 specialized clinical tasks with 12 secondary subtasks, and 5 levels of visual prompting granularities, resulting in 6,832 rigorously validated VQA pairs from 21 diverse datasets. Our multi-dimensional evaluation framework mirrors the clinical workflow--spanning anatomical recognition, lesion analysis, spatial localization, and surgical operations--to holistically gauge the perceptual and diagnostic abilities of MLLMs in realistic scenarios. We benchmark 23 state-of-the-art models, including general-purpose, medical-specialized, and proprietary MLLMs, and establish human clinician performance as a reference standard. Our extensive experiments reveal: (1) proprietary MLLMs outperform open-source and medical-specialized models overall, but still trail human experts; (2) medical-domain supervised fine-tuning substantially boosts task-specific accuracy; and (3) model performance remains sensitive to prompt format and clinical task complexity. EndoBench establishes a new standard for evaluating and advancing MLLMs in endoscopy, highlighting both progress and persistent gaps between current models and expert clinical reasoning. We publicly release our benchmark and code.

[115] Color Image Set Recognition Based on Quaternionic Grassmannians

Xiang Xiang Wang,Tin-Yau Tam

Main category: cs.CV

TL;DR: 提出了一种基于四元数Grassmannian的彩色图像集识别方法，利用四元数捕捉颜色信息，并通过计算Grassmannian上的最短距离构建分类框架。

Details

Motivation: 传统方法难以有效处理彩色图像集的识别问题，四元数能更好地捕捉颜色信息，Grassmannian提供了一种有效的表示方式。 Method: 将彩色图像集表示为四元数Grassmannian上的点，提出直接公式计算最短距离，并基于此构建分类框架。 Result: 在ETH-80数据集上取得良好识别效果。 Conclusion: 方法有效但稳定性有待改进，未来可进一步优化。 Abstract: We propose a new method for recognizing color image sets using quaternionic Grassmannians, which use the power of quaternions to capture color information and represent each color image set as a point on the quaternionic Grassmannian. We provide a direct formula to calculate the shortest distance between two points on the quaternionic Grassmannian, and use this distance to build a new classification framework. Experiments on the ETH-80 benchmark dataset show that our method achieves good recognition results. We also discuss some limitations in stability and suggest ways the method can be improved in the future.

[116] Comparing the Effects of Persistence Barcodes Aggregation and Feature Concatenation on Medical Imaging

Dashti A. Ali,Richard K. G. Do,William R. Jarnagin,Aras T. Asaad,Amber L. Simpson

Main category: cs.CV

TL;DR: 比较了两种基于持久同调的特征构建方法在医学图像分类中的效果，发现特征拼接方法性能更优。

Details

Motivation: 传统特征提取方法对输入变化敏感，持久同调（PH）能稳定提取拓扑特征，但如何构建最终特征向量尚需研究。 Method: 通过聚合持久条形码或拼接特征向量两种方法构建特征，并在多数据集上比较分类性能。 Result: 特征拼接方法保留了更多细节拓扑信息，分类性能更好。 Conclusion: 在类似实验中，推荐使用特征拼接方法构建拓扑特征向量。 Abstract: In medical image analysis, feature engineering plays an important role in the design and performance of machine learning models. Persistent homology (PH), from the field of topological data analysis (TDA), demonstrates robustness and stability to data perturbations and addresses the limitation from traditional feature extraction approaches where a small change in input results in a large change in feature representation. Using PH, we store persistent topological and geometrical features in the form of the persistence barcode whereby large bars represent global topological features and small bars encapsulate geometrical information of the data. When multiple barcodes are computed from 2D or 3D medical images, two approaches can be used to construct the final topological feature vector in each dimension: aggregating persistence barcodes followed by featurization or concatenating topological feature vectors derived from each barcode. In this study, we conduct a comprehensive analysis across diverse medical imaging datasets to compare the effects of the two aforementioned approaches on the performance of classification models. The results of this analysis indicate that feature concatenation preserves detailed topological information from individual barcodes, yields better classification performance and is therefore a preferred approach when conducting similar experiments.

[117] Radiant Triangle Soup with Soft Connectivity Forces for 3D Reconstruction and Novel View Synthesis

Nathaniel Burgdorfer,Philippos Mordohai

Main category: cs.CV

TL;DR: 提出了一种基于三角形表示场景几何和外观的推理时优化框架，优于当前广泛使用的高斯泼溅方法。

Details

Motivation: 三角形比高斯泼溅更具表达力，且能更好地支持下游任务。 Method: 开发了一种针对三角形汤（不连接的半透明三角形集合）的场景优化算法，并在优化过程中引入连接力以鼓励表面连续性。 Result: 在代表性3D重建数据集上展示了具有竞争力的光度和几何结果。 Conclusion: 三角形表示在场景优化中具有优势，能够实现更好的表面连续性和表达力。 Abstract: In this work, we introduce an inference-time optimization framework utilizing triangles to represent the geometry and appearance of the scene. More specifically, we develop a scene optimization algorithm for triangle soup, a collection of disconnected semi-transparent triangle primitives. Compared to the current most-widely used primitives for 3D scene representation, namely Gaussian splats, triangles allow for more expressive color interpolation, and benefit from a large algorithmic infrastructure for downstream tasks. Triangles, unlike full-rank Gaussian kernels, naturally combine to form surfaces. We formulate connectivity forces between triangles during optimization, encouraging explicit, but soft, surface continuity in 3D. We perform experiments on a representative 3D reconstruction dataset and show competitive photometric and geometric results.

[118] VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

Xiangdong Zhang,Jiaqi Liao,Shaofeng Zhang,Fanqing Meng,Xiangpeng Wan,Junchi Yan,Yu Cheng

Main category: cs.CV

TL;DR: VideoREPA框架通过Token Relation Distillation损失，将视频理解基础模型的物理知识蒸馏到文本到视频（T2V）扩散模型中，显著提升了生成视频的物理合理性。

Details

Motivation: 当前T2V模型在生成物理合理内容方面表现不佳，因其对物理的理解能力有限。研究发现，T2V模型的物理理解能力落后于视频自监督学习方法。 Method: 提出VideoREPA框架，通过Token Relation Distillation（TRD）损失，将视频理解基础模型的物理知识对齐到T2V模型中，实现物理知识的注入。 Result: 实验表明，VideoREPA显著提升了基线方法CogVideoX的物理常识，在相关基准测试中取得显著改进，生成视频更符合直觉物理。 Conclusion: VideoREPA是首个专为T2V模型设计的REPA方法，成功提升了生成视频的物理合理性，为T2V模型的物理理解能力提供了新方向。 Abstract: Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models, a critical departure from prior representation alignment (REPA) methods. To our knowledge, VideoREPA is the first REPA method designed for finetuning T2V models and specifically for injecting physical knowledge. Empirical evaluations show that VideoREPA substantially enhances the physics commonsense of baseline method, CogVideoX, achieving significant improvement on relevant benchmarks and demonstrating a strong capacity for generating videos consistent with intuitive physics. More video results are available at https://videorepa.github.io/.

[119] D-AR: Diffusion via Autoregressive Models

Ziteng Gao,Mike Zheng Shou

Main category: cs.CV

TL;DR: D-AR将图像扩散过程重新定义为标准的自回归过程，利用离散令牌序列实现图像生成，支持预览和布局控制，性能优异。

Details

Motivation: 探索一种统一的自回归架构，利用扩散模型的特性和大型语言模型的能力，简化视觉合成任务。 Method: 设计令牌化器将图像转换为离散令牌序列，利用自回归模型进行令牌预测，生成过程与扩散去噪步骤直接对应。 Result: 在ImageNet基准测试中，使用775M Llama骨干和256个离散令牌，FID达到2.09。 Conclusion: D-AR展示了自回归模型在视觉合成中的潜力，为未来研究提供了新方向。 Abstract: This paper presents Diffusion via Autoregressive models (D-AR), a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure in the standard next-token-prediction fashion. We start by designing the tokenizer that converts images into sequences of discrete tokens, where tokens in different positions can be decoded into different diffusion denoising steps in the pixel space. Thanks to the diffusion properties, these tokens naturally follow a coarse-to-fine order, which directly lends itself to autoregressive modeling. Therefore, we apply standard next-token prediction on these tokens, without modifying any underlying designs (either causal masks or training/inference strategies), and such sequential autoregressive token generation directly mirrors the diffusion procedure in image space. That is, once the autoregressive model generates an increment of tokens, we can directly decode these tokens into the corresponding diffusion denoising step in the streaming manner. Our pipeline naturally reveals several intriguing properties, for example, it supports consistent previews when generating only a subset of tokens and enables zero-shot layout-controlled synthesis. On the standard ImageNet benchmark, our method achieves 2.09 FID using a 775M Llama backbone with 256 discrete tokens. We hope our work can inspire future research on unified autoregressive architectures of visual synthesis, especially with large language models. Code and models will be available at https://github.com/showlab/D-AR

[120] OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Size Wu,Zhonghua Wu,Zerui Gong,Qingyi Tao,Sheng Jin,Qinyue Li,Wei Li,Chen Change Loy

Main category: cs.CV

TL;DR: OpenUni是一个轻量级、开源的基线模型，用于统一多模态理解和生成，通过高效训练策略和简单架构实现高质量图像生成和卓越性能。

Details

Motivation: 受统一模型学习实践的启发，旨在最小化训练复杂性和开销，同时支持开放研究和社区发展。 Method: 采用现成的多模态大语言模型和扩散模型，通过可学习查询和轻量级Transformer连接器进行训练。 Result: 生成高质量、指令对齐的图像，并在标准基准测试中表现优异，仅需1.1B和3.1B激活参数。 Conclusion: OpenUni展示了简单架构的高效性，并开源了所有模型权重、训练代码和数据集，推动社区研究。 Abstract: In this report, we present OpenUni, a simple, lightweight, and fully open-source baseline for unifying multimodal understanding and generation. Inspired by prevailing practices in unified model learning, we adopt an efficient training strategy that minimizes the training complexity and overhead by bridging the off-the-shelf multimodal large language models (LLMs) and diffusion models through a set of learnable queries and a light-weight transformer-based connector. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG- Bench, and WISE, with only 1.1B and 3.1B activated parameters. To support open research and community advancement, we release all model weights, training code, and our curated training datasets (including 23M image-text pairs) at https://github.com/wusize/OpenUni.

[121] Grounded Reinforcement Learning for Visual Reasoning

Gabriel Sarch,Snigdha Saha,Naitik Khandelwal,Ayush Jain,Michael J. Tarr,Aviral Kumar,Katerina Fragkiadaki

Main category: cs.CV

TL;DR: ViGoRL是一种视觉语言模型，通过强化学习将推理步骤显式锚定到视觉坐标，显著提升了视觉推理任务的性能。

Details

Motivation: 视觉推理任务需要模型具备视觉注意力、感知输入解释和空间证据抽象推理能力，而现有方法缺乏显式的空间锚定机制。 Method: ViGoRL采用多轮强化学习框架，动态聚焦任务相关区域，并结合视觉反馈优化推理过程。 Result: 在多个视觉推理基准测试中，ViGoRL表现优于传统方法，尤其在定位小GUI元素和视觉搜索任务中达到86.4%的准确率。 Conclusion: 视觉锚定的强化学习是提升模型通用视觉推理能力的有效范式。 Abstract: While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, visual reasoning introduces added complexity by requiring models to direct visual attention, interpret perceptual inputs, and ground abstract reasoning in spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces, guiding visual attention to task-relevant regions at each step. When fine-grained exploration is required, our novel multi-turn RL framework enables the model to dynamically zoom into predicted coordinates as reasoning unfolds. Across a diverse set of visual reasoning benchmarks--including SAT-2 and BLINK for spatial reasoning, V*bench for visual search, and ScreenSpot and VisualWebArena for web-based grounding--ViGoRL consistently outperforms both supervised fine-tuning and conventional RL baselines that lack explicit grounding mechanisms. Incorporating multi-turn RL with zoomed-in visual feedback significantly improves ViGoRL's performance on localizing small GUI elements and visual search, achieving 86.4% on V*Bench. Additionally, we find that grounding amplifies other visual behaviors such as region exploration, grounded subgoal setting, and visual verification. Finally, human evaluations show that the model's visual references are not only spatially accurate but also helpful for understanding model reasoning steps. Our results show that visually grounded RL is a strong paradigm for imbuing models with general-purpose visual reasoning.

[122] VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

Tingyu Song,Tongyan Hu,Guo Gan,Yilun Zhao

Main category: cs.CV

TL;DR: 论文提出了一个新的基准VF-Eval，用于评估多模态大语言模型（MLLMs）在AI生成内容（AIGC）视频上的表现，发现现有模型表现不佳，并展示了如何通过人类反馈改进视频生成。

Details

Motivation: 现有研究主要关注自然视频，忽略了AIGC视频的评估，同时MLLMs在AIGC视频上的能力尚未充分探索。 Method: 提出VF-Eval基准，包含四个任务：连贯性验证、错误感知、错误类型检测和推理评估，评估了13个前沿MLLMs。 Result: 即使表现最好的GPT-4.1模型在所有任务中表现也不一致，表明基准的挑战性。实验RePrompt显示，通过人类反馈改进MLLMs有助于视频生成。 Conclusion: VF-Eval揭示了MLLMs在AIGC视频上的局限性，并展示了人类反馈在改进视频生成中的潜力。 Abstract: MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VF-Eval in improving video generation, we conduct an experiment, RePrompt, demonstrating that aligning MLLMs more closely with human feedback can benefit video generation.

[123] DA-VPT: Semantic-Guided Visual Prompt Tuning for Vision Transformers

Li Ren,Chen Chen,Liqiang Wang,Kien Hua

Main category: cs.CV

TL;DR: DA-VPT利用度量学习技术研究提示分布对微调性能的影响，提出一种新框架，通过语义数据引导提示分布，提升ViT模型的微调效果。

Details

Motivation: 探索提示与图像标记之间的基本关联和分布，以改进视觉提示调优的性能。 Method: 提出DA-VPT框架，利用度量学习技术从语义数据中学习距离度量，引导提示分布。 Result: 在识别和分割任务中，DA-VPT显著提升了ViT模型的微调效果和性能。 Conclusion: DA-VPT通过语义信息引导提示学习，为下游视觉任务提供了更高效和有效的微调方法。 Abstract: Visual Prompt Tuning (VPT) has become a promising solution for Parameter-Efficient Fine-Tuning (PEFT) approach for Vision Transformer (ViT) models by partially fine-tuning learnable tokens while keeping most model parameters frozen. Recent research has explored modifying the connection structures of the prompts. However, the fundamental correlation and distribution between the prompts and image tokens remain unexplored. In this paper, we leverage metric learning techniques to investigate how the distribution of prompts affects fine-tuning performance. Specifically, we propose a novel framework, Distribution Aware Visual Prompt Tuning (DA-VPT), to guide the distributions of the prompts by learning the distance metric from their class-related semantic data. Our method demonstrates that the prompts can serve as an effective bridge to share semantic information between image patches and the class token. We extensively evaluated our approach on popular benchmarks in both recognition and segmentation tasks. The results demonstrate that our approach enables more effective and efficient fine-tuning of ViT models by leveraging semantic information to guide the learning of the prompts, leading to improved performance on various downstream vision tasks.

[124] CLDTracker: A Comprehensive Language Description for Visual Tracking

Mohamad Alansari,Sajid Javed,Iyyakutti Iyappan Ganapathi,Sara Alansari,Muzammal Naseer

Main category: cs.CV

TL;DR: 论文提出CLDTracker，一种基于全面语言描述的视觉跟踪框架，通过双分支架构结合视觉和文本特征，解决了传统跟踪器在复杂场景中的局限性。

Details

Motivation: 视觉目标跟踪（VOT）因动态外观变化、遮挡和背景干扰而具有挑战性，传统跟踪器依赖视觉线索表现不佳。视觉语言模型（VLM）在语义理解上的潜力未被充分利用。 Method: 提出CLDTracker，采用双分支架构（文本分支和视觉分支），利用CLIP和GPT-4V等VLM生成丰富的文本描述，增强语义和上下文信息。 Result: 在六个标准VOT基准测试中达到SOTA性能，验证了结合视觉和语言特征的有效性。 Conclusion: CLDTracker通过结合视觉和语言特征，显著提升了跟踪性能，为VOT任务提供了新思路。 Abstract: VOT remains a fundamental yet challenging task in computer vision due to dynamic appearance changes, occlusions, and background clutter. Traditional trackers, relying primarily on visual cues, often struggle in such complex scenarios. Recent advancements in VLMs have shown promise in semantic understanding for tasks like open-vocabulary detection and image captioning, suggesting their potential for VOT. However, the direct application of VLMs to VOT is hindered by critical limitations: the absence of a rich and comprehensive textual representation that semantically captures the target object's nuances, limiting the effective use of language information; inefficient fusion mechanisms that fail to optimally integrate visual and textual features, preventing a holistic understanding of the target; and a lack of temporal modeling of the target's evolving appearance in the language domain, leading to a disconnect between the initial description and the object's subsequent visual changes. To bridge these gaps and unlock the full potential of VLMs for VOT, we propose CLDTracker, a novel Comprehensive Language Description framework for robust visual Tracking. Our tracker introduces a dual-branch architecture consisting of a textual and a visual branch. In the textual branch, we construct a rich bag of textual descriptions derived by harnessing the powerful VLMs such as CLIP and GPT-4V, enriched with semantic and contextual cues to address the lack of rich textual representation. Experiments on six standard VOT benchmarks demonstrate that CLDTracker achieves SOTA performance, validating the effectiveness of leveraging robust and temporally-adaptive vision-language representations for tracking. Code and models are publicly available at: https://github.com/HamadYA/CLDTracker

Dionysis Christopoulos,Sotiris Spanos,Eirini Baltzi,Valsamis Ntouskos,Konstantinos Karantzalos

Main category: cs.CV

TL;DR: SLIMP通过结合皮肤病变图像和患者元数据的嵌套对比学习方法，提升了皮肤病变分类任务的性能。

Details

Motivation: 由于图像条件差异大且缺乏临床背景，仅基于图像的黑色素瘤检测和皮肤病变分类具有挑战性。SLIMP旨在通过结合图像和患者元数据，模拟临床医生的整体评估方法。 Method: SLIMP采用嵌套对比学习方法，整合单个皮肤病变的外观、元数据以及患者级别的医疗记录和其他临床信息。 Result: SLIMP在皮肤病变分类任务中表现优于其他预训练策略，证明了其学习到的表征质量更高。 Conclusion: SLIMP通过充分利用多模态数据，显著提升了皮肤病变分类的性能，为临床决策提供了更可靠的依据。 Abstract: We introduce SLIMP (Skin Lesion Image-Metadata Pre-training) for learning rich representations of skin lesions through a novel nested contrastive learning approach that captures complex relationships between images and metadata. Melanoma detection and skin lesion classification based solely on images, pose significant challenges due to large variations in imaging conditions (lighting, color, resolution, distance, etc.) and lack of clinical and phenotypical context. Clinicians typically follow a holistic approach for assessing the risk level of the patient and for deciding which lesions may be malignant and need to be excised, by considering the patient's medical history as well as the appearance of other lesions of the patient. Inspired by this, SLIMP combines the appearance and the metadata of individual skin lesions with patient-level metadata relating to their medical record and other clinically relevant information. By fully exploiting all available data modalities throughout the learning process, the proposed pre-training strategy improves performance compared to other pre-training strategies on downstream skin lesions classification tasks highlighting the learned representations quality.

[126] AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views

Lihan Jiang,Yucheng Mao,Linning Xu,Tao Lu,Kerui Ren,Yichen Jin,Xudong Xu,Mulin Yu,Jiangmiao Pang,Feng Zhao,Dahua Lin,Bo Dai

Main category: cs.CV

TL;DR: AnySplat是一种前馈网络，用于从未校准的图像集合中合成新视角，无需已知相机姿态或逐场景优化，且计算效率高。

Details

Motivation: 传统神经渲染方法需要已知相机姿态和逐场景优化，而现有前馈方法在密集视角下计算负担重。AnySplat旨在解决这些问题，实现高效且无需姿态标注的新视角合成。 Method: 通过单次前向传播预测3D高斯基元（编码场景几何和外观）及输入图像的相机内外参数，适用于多视角数据集。 Result: 在零样本评估中，AnySplat在稀疏和密集视角下均达到与姿态感知基线相当的质量，并超越现有无姿态方法，同时显著降低渲染延迟。 Conclusion: AnySplat为无约束拍摄环境下的实时新视角合成提供了高效解决方案。 Abstract: We introduce AnySplat, a feed forward network for novel view synthesis from uncalibrated image collections. In contrast to traditional neural rendering pipelines that demand known camera poses and per scene optimization, or recent feed forward methods that buckle under the computational weight of dense views, our model predicts everything in one shot. A single forward pass yields a set of 3D Gaussian primitives encoding both scene geometry and appearance, and the corresponding camera intrinsics and extrinsics for each input image. This unified design scales effortlessly to casually captured, multi view datasets without any pose annotations. In extensive zero shot evaluations, AnySplat matches the quality of pose aware baselines in both sparse and dense view scenarios while surpassing existing pose free approaches. Moreover, it greatly reduce rendering latency compared to optimization based neural fields, bringing real time novel view synthesis within reach for unconstrained capture settings.Project page: https://city-super.github.io/anysplat/

[127] FMG-Det: Foundation Model Guided Robust Object Detection

Darryl Hannan,Timothy Doster,Henry Kvinge,Adam Attarian,Yijing Watkins

Main category: cs.CV

TL;DR: 论文提出FMG-Det方法，通过结合多实例学习框架和预处理流程，利用基础模型校正噪声标注，提升目标检测性能。

Details

Motivation: 目标检测任务中标注边界的主观性导致数据质量不一致，噪声标注显著降低模型性能，尤其在少样本场景下。 Method: 结合多实例学习（MIL）框架和预处理流程，利用基础模型校正标注，并对检测头进行微调。 Result: 在多个数据集上实现了最先进的性能，适用于标准及少样本场景，且方法更简单高效。 Conclusion: FMG-Det通过校正噪声标注，显著提升了目标检测模型的性能，尤其在少样本场景下表现突出。 Abstract: Collecting high quality data for object detection tasks is challenging due to the inherent subjectivity in labeling the boundaries of an object. This makes it difficult to not only collect consistent annotations across a dataset but also to validate them, as no two annotators are likely to label the same object using the exact same coordinates. These challenges are further compounded when object boundaries are partially visible or blurred, which can be the case in many domains. Training on noisy annotations significantly degrades detector performance, rendering them unusable, particularly in few-shot settings, where just a few corrupted annotations can impact model performance. In this work, we propose FMG-Det, a simple, efficient methodology for training models with noisy annotations. More specifically, we propose combining a multiple instance learning (MIL) framework with a pre-processing pipeline that leverages powerful foundation models to correct labels prior to training. This pre-processing pipeline, along with slight modifications to the detector head, results in state-of-the-art performance across a number of datasets, for both standard and few-shot scenarios, while being much simpler and more efficient than other approaches.

[128] PixelThink: Towards Efficient Chain-of-Pixel Reasoning

Song Wang,Gongfan Fang,Lingdong Kong,Xiangtai Li,Jianyun Xu,Sheng Yang,Qiang Li,Jianke Zhu,Xinchao Wang

Main category: cs.CV

TL;DR: PixelThink通过结合任务难度和模型不确定性调节推理长度，提升推理效率和分割性能。

Details

Motivation: 现有方法在泛化性和推理效率上存在不足，如过长的推理链和计算成本高。 Method: 提出PixelThink，利用外部任务难度和内部模型不确定性调节推理生成，优化推理长度。 Result: 实验表明，该方法提高了推理效率和分割性能。 Conclusion: 为高效可解释的多模态理解提供了新视角。 Abstract: Existing reasoning segmentation approaches typically fine-tune multimodal large language models (MLLMs) using image-text pairs and corresponding mask labels. However, they exhibit limited generalization to out-of-distribution scenarios without an explicit reasoning process. Although recent efforts leverage reinforcement learning through group-relative policy optimization (GRPO) to enhance reasoning ability, they often suffer from overthinking - producing uniformly verbose reasoning chains irrespective of task complexity. This results in elevated computational costs and limited control over reasoning quality. To address this problem, we propose PixelThink, a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty to regulate reasoning generation within a reinforcement learning paradigm. The model learns to compress reasoning length in accordance with scene complexity and predictive confidence. To support comprehensive evaluation, we introduce ReasonSeg-Diff, an extended benchmark with annotated reasoning references and difficulty scores, along with a suite of metrics designed to assess segmentation accuracy, reasoning quality, and efficiency jointly. Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance. Our work contributes novel perspectives towards efficient and interpretable multimodal understanding. The code and model will be publicly available.

[129] ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS

Weijie Wang,Donny Y. Chen,Zeyu Zhang,Duochao Shi,Akide Liu,Bohan Zhuang

Main category: cs.CV

TL;DR: ZPressor模块通过压缩多视角输入为紧凑潜在状态Z，提升3D高斯溅射模型的扩展性和性能。

Details

Motivation: 现有前馈3D高斯溅射模型因编码器容量有限，难以处理多视角输入，导致性能下降或内存消耗过大。 Method: 提出ZPressor模块，将多视角输入分区为锚点和支持集，利用交叉注意力压缩信息为潜在状态Z。 Result: 在DL3DV-10K和RealEstate10K基准测试中，ZPressor显著提升模型性能和鲁棒性，支持100+视角输入。 Conclusion: ZPressor为前馈3D高斯溅射模型提供高效扩展方案，适用于密集视角场景。 Abstract: Feed-forward 3D Gaussian Splatting (3DGS) models have recently emerged as a promising solution for novel view synthesis, enabling one-pass inference without the need for per-scene 3DGS optimization. However, their scalability is fundamentally constrained by the limited capacity of their encoders, leading to degraded performance or excessive memory consumption as the number of input views increases. In this work, we analyze feed-forward 3DGS frameworks through the lens of the Information Bottleneck principle and introduce ZPressor, a lightweight architecture-agnostic module that enables efficient compression of multi-view inputs into a compact latent state $Z$ that retains essential scene information while discarding redundancy. Concretely, ZPressor enables existing feed-forward 3DGS models to scale to over 100 input views at 480P resolution on an 80GB GPU, by partitioning the views into anchor and support sets and using cross attention to compress the information from the support views into anchor views, forming the compressed latent state $Z$. We show that integrating ZPressor into several state-of-the-art feed-forward 3DGS models consistently improves performance under moderate input views and enhances robustness under dense view settings on two large-scale benchmarks DL3DV-10K and RealEstate10K. The video results, code and trained models are available on our project page: https://lhmd.top/zpressor.

[130] MAGREF: Masked Guidance for Any-Reference Video Generation

Yufan Deng,Xun Guo,Yuanyang Yin,Jacob Zhiyuan Fang,Yiding Yang,Yizhi Wang,Shenghai Yuan,Angtian Wang,Bo Liu,Haibin Huang,Chongyang Ma

Main category: cs.CV

TL;DR: MAGREF是一个基于多参考主题的视频生成统一框架，通过掩码引导实现高质量的多主题一致性视频合成。

Details

Motivation: 当前基于多参考主题的视频生成在多主题一致性和生成质量方面仍面临挑战。 Method: 提出区域感知动态掩码机制和像素级通道拼接机制，以灵活处理多主题并保留外观特征。 Result: 模型在视频生成质量上达到最先进水平，支持从单主题训练扩展到复杂多主题场景。 Conclusion: MAGREF为可扩展、可控且高保真的多主题视频合成提供了有效解决方案。 Abstract: Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF

[131] DarkDiff: Advancing Low-Light Raw Enhancement by Retasking Diffusion Models for Camera ISP

Amber Yijia Zheng,Yu Zhang,Jun Hu,Raymond A. Yeh,Chen Chen

Main category: cs.CV

TL;DR: 论文提出了一种新框架，利用预训练的生成扩散模型增强低光原始图像，解决了现有方法在细节恢复和颜色准确性上的不足。

Details

Motivation: 在极端低光条件下拍摄高质量照片具有挑战性，传统ISP算法逐渐被深度学习模型取代，但现有回归模型常导致图像过度平滑或阴影过深。 Method: 通过重新利用预训练的生成扩散模型与相机ISP结合，提出了一种新框架来增强低光原始图像。 Result: 实验表明，该方法在三个低光原始图像基准测试中，感知质量优于现有最佳方法。 Conclusion: 该方法有效提升了低光图像的细节和颜色准确性，具有显著的实际应用价值。 Abstract: High-quality photography in extreme low-light conditions is challenging but impactful for digital cameras. With advanced computing hardware, traditional camera image signal processor (ISP) algorithms are gradually being replaced by efficient deep networks that enhance noisy raw images more intelligently. However, existing regression-based models often minimize pixel errors and result in oversmoothing of low-light photos or deep shadows. Recent work has attempted to address this limitation by training a diffusion model from scratch, yet those models still struggle to recover sharp image details and accurate colors. We introduce a novel framework to enhance low-light raw images by retasking pre-trained generative diffusion models with the camera ISP. Extensive experiments demonstrate that our method outperforms the state-of-the-art in perceptual quality across three challenging low-light raw image benchmarks.

[132] Boosting Domain Incremental Learning: Selecting the Optimal Parameters is All You Need

Qiang Wang,Xiang Song,Yuhang He,Jizhou Han,Chenhao Ding,Xinyuan Gao,Yihong Gong

Main category: cs.CV

TL;DR: SOYO是一个轻量级框架，通过GMC和DFR改进PIDIL中的域选择，结合MDFN增强特征提取，在多个任务中表现优于现有基线。

Details

Motivation: 解决DNN在动态数据分布下性能下降的问题，特别是PIDIL方法在参数选择准确性上的不足。 Method: 提出SOYO框架，包含GMC、DFR和MDFN，支持多种PEFT方法。 Result: 在六个基准测试中，SOYO表现优于现有基线，展示了其鲁棒性和适应性。 Conclusion: SOYO在复杂动态环境中具有显著优势，代码将开源。 Abstract: Deep neural networks (DNNs) often underperform in real-world, dynamic settings where data distributions change over time. Domain Incremental Learning (DIL) offers a solution by enabling continual model adaptation, with Parameter-Isolation DIL (PIDIL) emerging as a promising paradigm to reduce knowledge conflicts. However, existing PIDIL methods struggle with parameter selection accuracy, especially as the number of domains and corresponding classes grows. To address this, we propose SOYO, a lightweight framework that improves domain selection in PIDIL. SOYO introduces a Gaussian Mixture Compressor (GMC) and Domain Feature Resampler (DFR) to store and balance prior domain data efficiently, while a Multi-level Domain Feature Fusion Network (MDFN) enhances domain feature extraction. Our framework supports multiple Parameter-Efficient Fine-Tuning (PEFT) methods and is validated across tasks such as image classification, object detection, and speech enhancement. Experimental results on six benchmarks demonstrate SOYO's consistent superiority over existing baselines, showcasing its robustness and adaptability in complex, evolving environments. The codes will be released in https://github.com/qwangcv/SOYO.

[133] To Trust Or Not To Trust Your Vision-Language Model's Prediction

Hao Dong,Moru Liu,Jian Liang,Eleni Chatzi,Olga Fink

Main category: cs.CV

TL;DR: TrustVLM是一个无需训练的框架，旨在解决VLM预测可信度评估问题，通过利用图像嵌入空间改进误分类检测，显著提升性能。

Details

Motivation: VLM在零样本和迁移学习中表现优异，但在安全关键领域易产生自信但错误的预测，带来严重后果。 Method: 提出一种基于图像嵌入空间的置信度评分函数，利用模态间隙和概念区分性改进误分类检测。 Result: 在17个数据集上评估，性能提升显著（AURC提升51.87%，AUROC提升9.14%，FPR95提升32.42%）。 Conclusion: TrustVLM无需重新训练即可提升VLM可靠性，为其在现实应用中的安全部署铺平道路。 Abstract: Vision-Language Models (VLMs) have demonstrated strong capabilities in aligning visual and textual modalities, enabling a wide range of applications in multimodal understanding and generation. While they excel in zero-shot and transfer learning scenarios, VLMs remain susceptible to misclassification, often yielding confident yet incorrect predictions. This limitation poses a significant risk in safety-critical domains, where erroneous predictions can lead to severe consequences. In this work, we introduce TrustVLM, a training-free framework designed to address the critical challenge of estimating when VLM's predictions can be trusted. Motivated by the observed modality gap in VLMs and the insight that certain concepts are more distinctly represented in the image embedding space, we propose a novel confidence-scoring function that leverages this space to improve misclassification detection. We rigorously evaluate our approach across 17 diverse datasets, employing 4 architectures and 2 VLMs, and demonstrate state-of-the-art performance, with improvements of up to 51.87% in AURC, 9.14% in AUROC, and 32.42% in FPR95 compared to existing baselines. By improving the reliability of the model without requiring retraining, TrustVLM paves the way for safer deployment of VLMs in real-world applications. The code will be available at https://github.com/EPFL-IMOS/TrustVLM.

[134] Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu,Fangfu Liu,Yi-Hsin Hung,Yueqi Duan

Main category: cs.CV

TL;DR: Spatial-MLLM是一种新型框架，通过纯2D输入实现空间推理，无需依赖额外的3D或2.5D数据。

Details

Motivation: 现有3D MLLMs依赖额外数据，限制了在仅有2D输入场景下的应用。 Method: 采用双编码器架构（语义编码器和空间编码器），结合空间感知帧采样策略。 Result: 在多种空间理解和推理任务中达到最先进性能。 Conclusion: Spatial-MLLM为纯2D输入场景提供了高效的空间推理解决方案。 Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k dataset and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.

[135] ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks

Akashah Shabbir,Muhammad Akhtar Munir,Akshay Dudhane,Muhammad Umer Sheikh,Muhammad Haris Khan,Paolo Fraccaro,Juan Bernabe Moreno,Fahad Shahbaz Khan,Salman Khan

Main category: cs.CV

TL;DR: ThinkGeo是一个专为评估LLM驱动代理在遥感任务中工具使用能力的基准测试，涵盖多种实际应用场景，并揭示了不同模型在工具准确性和规划一致性上的差异。

Details

Motivation: 现有评估多关注通用或多模态场景，缺乏针对复杂遥感用例的领域特定基准。 Method: 采用ReAct式交互循环，评估开源和闭源LLM在436个结构化任务上的表现，包括逐步执行指标和最终答案正确性。 Result: 分析显示不同模型在工具准确性和规划一致性上存在显著差异。 Conclusion: ThinkGeo为评估工具增强LLM在遥感中的空间推理能力提供了首个广泛测试平台。 Abstract: Recent progress in large language models (LLMs) has enabled tool-augmented agents capable of solving complex real-world tasks through step-by-step reasoning. However, existing evaluations often focus on general-purpose or multimodal scenarios, leaving a gap in domain-specific benchmarks that assess tool-use capabilities in complex remote sensing use cases. We present ThinkGeo, an agentic benchmark designed to evaluate LLM-driven agents on remote sensing tasks via structured tool use and multi-step planning. Inspired by tool-interaction paradigms, ThinkGeo includes human-curated queries spanning a wide range of real-world applications such as urban planning, disaster assessment and change analysis, environmental monitoring, transportation analysis, aviation monitoring, recreational infrastructure, and industrial site analysis. Each query is grounded in satellite or aerial imagery and requires agents to reason through a diverse toolset. We implement a ReAct-style interaction loop and evaluate both open and closed-source LLMs (e.g., GPT-4o, Qwen2.5) on 436 structured agentic tasks. The benchmark reports both step-wise execution metrics and final answer correctness. Our analysis reveals notable disparities in tool accuracy and planning consistency across models. ThinkGeo provides the first extensive testbed for evaluating how tool-enabled LLMs handle spatial reasoning in remote sensing. Our code and dataset are publicly available

[136] Rooms from Motion: Un-posed Indoor 3D Object Detection as Localization and Mapping

Justin Lazarow,Kai Kang,Afshin Dehghan

Main category: cs.CV

TL;DR: Rooms from Motion (RfM) 是一种基于物体中心的3D目标检测方法，能够在无相机姿态的情况下进行定位和建图，优于现有依赖点云或密集体积的方法。

Details

Motivation: 现有3D目标检测方法依赖全局信息和已知相机姿态，而RfM旨在从无姿态图像中实现定位和建图。 Method: 通过基于3D框的物体中心匹配器替代传统的2D关键点匹配，估计相机姿态和物体轨迹，生成全局语义3D物体地图。 Result: 在CA-1M和ScanNet++数据集上，RfM表现出优于点云和多视图方法的定位和建图性能。 Conclusion: RfM提供了一种稀疏且参数化的物体中心表示，扩展了场景级3D目标检测的能力。 Abstract: We revisit scene-level 3D object detection as the output of an object-centric framework capable of both localization and mapping using 3D oriented boxes as the underlying geometric primitive. While existing 3D object detection approaches operate globally and implicitly rely on the a priori existence of metric camera poses, our method, Rooms from Motion (RfM) operates on a collection of un-posed images. By replacing the standard 2D keypoint-based matcher of structure-from-motion with an object-centric matcher based on image-derived 3D boxes, we estimate metric camera poses, object tracks, and finally produce a global, semantic 3D object map. When a priori pose is available, we can significantly improve map quality through optimization of global 3D boxes against individual observations. RfM shows strong localization performance and subsequently produces maps of higher quality than leading point-based and multi-view 3D object detection methods on CA-1M and ScanNet++, despite these global methods relying on overparameterization through point clouds or dense volumes. Rooms from Motion achieves a general, object-centric representation which not only extends the work of Cubify Anything to full scenes but also allows for inherently sparse localization and parametric mapping proportional to the number of objects in a scene.

[137] Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models

Haohan Chi,Huan-ang Gao,Ziming Liu,Jianing Liu,Chenyu Liu,Jinwei Li,Kaisen Yang,Yangcheng Yu,Zeda Wang,Wenyi Li,Leichen Wang,Xingtao Hu,Hao Sun,Hang Zhao,Hao Zhao

Main category: cs.CV

TL;DR: Impromptu VLA 提出了一种新的数据集，解决了自动驾驶中 Vision-Language-Action 模型在非结构化场景中的性能问题。

Details

Motivation: 现有 VLA 模型在非结构化极端场景中表现不佳，缺乏针对性基准测试。 Method: 构建了包含 80,000 个视频片段的 Impromptu VLA 数据集，基于四类非结构化场景分类，并包含规划导向的问答注释和动作轨迹。 Result: 实验表明，使用该数据集训练的 VLA 模型在多个基准测试中表现显著提升，包括闭环 NeuroNCAP 分数、碰撞率和开环 nuScenes 轨迹预测。 Conclusion: Impromptu VLA 数据集有效提升了 VLA 模型的性能，并为感知、预测和规划提供了诊断工具。 Abstract: Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios, largely due to a scarcity of targeted benchmarks. To address this, we introduce Impromptu VLA. Our core contribution is the Impromptu VLA Dataset: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets. This dataset is built upon our novel taxonomy of four challenging unstructured categories and features rich, planning-oriented question-answering annotations and action trajectories. Crucially, experiments demonstrate that VLAs trained with our dataset achieve substantial performance gains on established benchmarks--improving closed-loop NeuroNCAP scores and collision rates, and reaching near state-of-the-art L2 accuracy in open-loop nuScenes trajectory prediction. Furthermore, our Q&A suite serves as an effective diagnostic, revealing clear VLM improvements in perception, prediction, and planning. Our code, data and models are available at https://github.com/ahydchh/Impromptu-VLA.

[138] LoRAShop: Training-Free Multi-Concept Image Generation and Editing with Rectified Flow Transformers

Yusuf Dalva,Hidir Yesiltepe,Pinar Yanardag

Main category: cs.CV

TL;DR: LoRAShop是一个基于LoRA模型的多概念图像编辑框架，通过解耦潜在掩码和局部权重混合实现高效编辑。

Details

Motivation: 现有方法在多概念图像编辑中难以保持全局一致性和细节，LoRAShop旨在解决这一问题。 Method: 利用扩散变换器中的特征交互模式，提取解耦潜在掩码，并在局部区域混合LoRA权重。 Result: 实验表明，LoRAShop在身份保持和编辑效果上优于基线方法。 Conclusion: LoRAShop为个性化扩散模型提供了实用的编辑工具，推动了视觉创作的发展。 Abstract: We introduce LoRAShop, the first framework for multi-concept image editing with LoRA models. LoRAShop builds on a key observation about the feature interaction patterns inside Flux-style diffusion transformers: concept-specific transformer features activate spatially coherent regions early in the denoising process. We harness this observation to derive a disentangled latent mask for each concept in a prior forward pass and blend the corresponding LoRA weights only within regions bounding the concepts to be personalized. The resulting edits seamlessly integrate multiple subjects or styles into the original scene while preserving global context, lighting, and fine details. Our experiments demonstrate that LoRAShop delivers better identity preservation compared to baselines. By eliminating retraining and external constraints, LoRAShop turns personalized diffusion models into a practical `photoshop-with-LoRAs' tool and opens new avenues for compositional visual storytelling and rapid creative iteration.

[139] MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Sihan Yang,Runsen Xu,Yiman Xie,Sizhe Yang,Mo Li,Jingli Lin,Chenming Zhu,Xiaochen Chen,Haodong Duan,Xiangyu Yue,Dahua Lin,Tai Wang,Jiangmiao Pang

Main category: cs.CV

TL;DR: MMSI-Bench是一个专注于多图像空间智能的VQA基准测试，通过1,000个挑战性问题评估MLLMs，发现现有模型与人类表现差距显著。

Details

Motivation: 现有基准测试仅关注单图像关系，无法满足现实世界对多图像空间推理的需求，因此需要新的评估工具。 Method: 六名3D视觉专家耗时300多小时，从12万张图像中精心设计1,000个多选问题，并评估34个开源和专有MLLMs。 Result: 最强开源模型准确率约30%，OpenAI的o3模型达40%，而人类为97%，凸显MMSI-Bench的挑战性。 Conclusion: MMSI-Bench揭示了多图像空间推理的难点，并提供了错误分析工具，为未来研究指明方向。 Abstract: Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a step-by-step reasoning process. We conduct extensive experiments and thoroughly evaluate 34 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI's o3 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering valuable insights for advancing multi-image spatial intelligence. Project page: https://runsenxu.com/projects/MMSI_Bench .

[140] Sketch Down the FLOPs: Towards Efficient Networks for Human Sketch

Aneeshan Sain,Subhajit Maity,Pinaki Nath Chowdhury,Subhadeep Koley,Ayan Kumar Bhunia,Yi-Zhe Song

Main category: cs.CV

TL;DR: 论文提出两种针对草图数据的组件，通过知识蒸馏和动态画布选择器，显著减少计算量（FLOPs降低99.37%），同时保持检索精度。

Details

Motivation: 现有高效轻量模型针对照片设计，无法直接用于草图数据，需开发专门针对草图的高效推理方法。 Method: 1. 跨模态知识蒸馏网络，将照片高效网络适配草图；2. RL动态画布选择器，根据草图抽象程度调整计算。 Result: FLOPs减少99.37%（40.18G→0.254G），精度几乎不变（33.03% vs 32.77%）。 Conclusion: 提出的方法成功实现草图数据的高效推理，计算量甚至低于照片高效模型。 Abstract: As sketch research has collectively matured over time, its adaptation for at-mass commercialisation emerges on the immediate horizon. Despite an already mature research endeavour for photos, there is no research on the efficient inference specifically designed for sketch data. In this paper, we first demonstrate existing state-of-the-art efficient light-weight models designed for photos do not work on sketches. We then propose two sketch-specific components which work in a plug-n-play manner on any photo efficient network to adapt them to work on sketch data. We specifically chose fine-grained sketch-based image retrieval (FG-SBIR) as a demonstrator as the most recognised sketch problem with immediate commercial value. Technically speaking, we first propose a cross-modal knowledge distillation network to transfer existing photo efficient networks to be compatible with sketch, which brings down number of FLOPs and model parameters by 97.96% percent and 84.89% respectively. We then exploit the abstract trait of sketch to introduce a RL-based canvas selector that dynamically adjusts to the abstraction level which further cuts down number of FLOPs by two thirds. The end result is an overall reduction of 99.37% of FLOPs (from 40.18G to 0.254G) when compared with a full network, while retaining the accuracy (33.03% vs 32.77%) -- finally making an efficient network for the sparse sketch data that exhibit even fewer FLOPs than the best photo counterpart.

[141] Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Yunze Man,De-An Huang,Guilin Liu,Shiwei Sheng,Shilong Liu,Liang-Yan Gui,Jan Kautz,Yu-Xiong Wang,Zhiding Yu

Main category: cs.CV

TL;DR: Argus通过视觉注意力机制改进多模态大语言模型在视觉任务中的表现。

Details

Motivation: 现有MLLMs在需要精确视觉聚焦的任务中表现不佳，Argus旨在解决这一问题。 Method: 采用对象为中心的视觉链式思维信号，实现目标导向的视觉注意力。 Result: 在多种基准测试中，Argus在多模态推理和对象定位任务中表现优异。 Conclusion: 显式语言引导的视觉兴趣区域参与对MLLMs至关重要，需从视觉中心视角推进多模态智能。 Abstract: Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, yet they often struggle with vision-centric scenarios where precise visual focus is needed for accurate reasoning. In this paper, we introduce Argus to address these limitations with a new visual attention grounding mechanism. Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention during multimodal reasoning tasks. Evaluations on diverse benchmarks demonstrate that Argus excels in both multimodal reasoning tasks and referring object grounding tasks. Extensive analysis further validates various design choices of Argus, and reveals the effectiveness of explicit language-guided visual region-of-interest engagement in MLLMs, highlighting the importance of advancing multimodal intelligence from a visual-centric perspective. Project page: https://yunzeman.github.io/argus/

[142] TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

Yao Xiao,Qiqian Fu,Heyi Tao,Yuqun Wu,Zhen Zhu,Derek Hoiem

Main category: cs.CV

TL;DR: TextRegion结合图像文本模型和SAM2的优势，生成文本对齐的区域标记，支持详细视觉理解并保留开放词汇能力。

Details

Motivation: 图像文本模型在详细视觉理解方面表现不足，而SAM2能提供精确的空间边界。结合两者优势以提升性能。 Method: 提出TextRegion框架，无需训练，结合图像文本模型和SAM2生成文本对齐的区域标记。 Result: 在多项下游任务中表现优异或与最先进的无训练方法竞争，兼容多种图像文本模型。 Conclusion: TextRegion是一种简单有效的框架，适用于开放世界语义分割等任务，具有高度实用性和可扩展性。 Abstract: Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: https://github.com/avaxiao/TextRegion.

cs.GR [Back]

[143] Quality assessment of 3D human animation: Subjective and objective evaluation

Rim Rekik,Stefanie Wuhrer,Ludovic Hoyet,Katja Zibrek,Anne-Hélène Olivier

Main category: cs.GR

TL;DR: 提出了一种基于数据驱动的质量评估方法，用于评估非参数化模型生成的虚拟人动画的真实性。

Details

Motivation: 虚拟人动画在虚拟和增强现实中有广泛应用，但对其质量的评估仍具挑战性，尤其是非参数化模型生成的动画。 Method: 通过用户研究收集主观真实性评分，并利用数据集训练线性回归模型预测感知评分。 Result: 线性回归模型在数据集上达到90%的相关性，优于现有深度学习基线。 Conclusion: 该方法为非参数化虚拟人动画的质量评估提供了有效工具。 Abstract: Virtual human animations have a wide range of applications in virtual and augmented reality. While automatic generation methods of animated virtual humans have been developed, assessing their quality remains challenging. Recently, approaches introducing task-oriented evaluation metrics have been proposed, leveraging neural network training. However, quality assessment measures for animated virtual humans that are not generated with parametric body models have yet to be developed. In this context, we introduce a first such quality assessment measure leveraging a novel data-driven framework. First, we generate a dataset of virtual human animations together with their corresponding subjective realism evaluation scores collected with a user study. Second, we use the resulting dataset to learn predicting perceptual evaluation scores. Results indicate that training a linear regressor on our dataset results in a correlation of 90%, which outperforms a state of the art deep learning baseline.

[144] To Measure What Isn't There -- Visual Exploration of Missingness Structures Using Quality Metrics

Sara Johansson Fernstad,Sarah Alsufyani,Silvia Del Din,Alison Yarnall,Lynn Rochester

Main category: cs.GR

TL;DR: 本文提出了一组质量指标，用于识别和可视化高维数据中的结构化缺失。这些指标有助于理解缺失模式，并支持数据质量问题的决策。

Details

Motivation: 高维数据中的缺失值是常见问题，可能导致分析问题。结构化缺失可能反映数据收集或预处理问题，也可能揭示重要数据特征。现有研究多关注统计方法填补缺失值，而可视化在理解缺失结构方面潜力巨大，但相关研究较少且缺乏扩展性。 Method: 本文提出了一组质量指标，用于识别和理解数据中的结构化缺失模式，并通过实际步行监测研究案例展示了这些指标在可视化分析中的应用。 Result: 提出的质量指标能够有效识别结构化缺失模式，支持对高维数据中缺失值的可视化探索和决策。 Conclusion: 本文的质量指标为高维数据中的结构化缺失分析提供了实用工具，填补了现有研究的空白，并展示了可视化在缺失数据分析中的潜力。 Abstract: This paper contributes a set of quality metrics for identification and visual analysis of structured missingness in high-dimensional data. Missing values in data are a frequent challenge in most data generating domains and may cause a range of analysis issues. Structural missingness in data may indicate issues in data collection and pre-processing, but may also highlight important data characteristics. While research into statistical methods for dealing with missing data are mainly focusing on replacing missing values with plausible estimated values, visualization has great potential to support a more in-depth understanding of missingness structures in data. Nonetheless, while the interest in missing data visualization has increased in the last decade, it is still a relatively overlooked research topic with a comparably small number of publications, few of which address scalability issues. Efficient visual analysis approaches are needed to enable exploration of missingness structures in large and high-dimensional data, and to support informed decision-making in context of potential data quality issues. This paper suggests a set of quality metrics for identification of patterns of interest for understanding of structural missingness in data. These quality metrics can be used as guidance in visual analysis, as demonstrated through a use case exploring structural missingness in data from a real-life walking monitoring study. All supplemental materials for this paper are available at https://doi.org/10.25405/data.ncl.c.7741829.

cs.CL [Back]

[145] Training Language Models to Generate Quality Code with Program Analysis Feedback

Feng Yao,Zilong Wang,Liyuan Liu,Junxia Cui,Li Zhong,Xiaohan Fu,Haohui Mai,Vish Krishnan,Jianfeng Gao,Jingbo Shang

Main category: cs.CL

TL;DR: 论文提出了一种名为REAL的强化学习框架，通过程序分析和单元测试反馈，激励大语言模型生成高质量代码，解决了现有方法依赖人工标注或启发式规则的局限性。

Details

Motivation: 现有的大语言模型代码生成方法（如监督微调和规则后处理）无法有效确保代码质量（如安全性和可维护性），且依赖人工标注或启发式规则，难以扩展。 Method: REAL框架结合程序分析（检测安全或可维护性缺陷）和单元测试（确保功能正确性），通过强化学习激励模型生成高质量代码，无需人工干预或参考代码。 Result: 实验表明，REAL在功能性和代码质量的综合评估中优于现有方法，适用于多种数据集和模型规模。 Conclusion: REAL填补了快速原型设计和生产级代码之间的鸿沟，使大语言模型既能快速生成代码，又能保证质量。 Abstract: Code generation with large language models (LLMs), often termed vibe coding, is increasingly adopted in production but fails to ensure code quality, particularly in security (e.g., SQL injection vulnerabilities) and maintainability (e.g., missing type annotations). Existing methods, such as supervised fine-tuning and rule-based post-processing, rely on labor-intensive annotations or brittle heuristics, limiting their scalability and effectiveness. We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code using program analysis-guided feedback. Specifically, REAL integrates two automated signals: (1) program analysis detecting security or maintainability defects and (2) unit tests ensuring functional correctness. Unlike prior work, our framework is prompt-agnostic and reference-free, enabling scalable supervision without manual intervention. Experiments across multiple datasets and model scales demonstrate that REAL outperforms state-of-the-art methods in simultaneous assessments of functionality and code quality. Our work bridges the gap between rapid prototyping and production-ready code, enabling LLMs to deliver both speed and quality.

[146] Climate Finance Bench

Rafik Mankour,Yassine Chafai,Hamada Saleh,Ghassen Ben Hassine,Thibaud Barreau,Peter Tankov

Main category: cs.CL

TL;DR: Climate Finance Bench提出了一种针对企业气候披露的问答开放基准，使用大型语言模型，并比较了RAG方法的性能。

Details

Motivation: 解决企业气候披露信息问答的标准化问题，并评估RAG方法在此任务中的表现。 Method: 收集33份英文可持续发展报告，标注330个专家验证的问答对，涵盖提取、数值推理和逻辑推理任务，并比较RAG方法。 Result: 研究发现检索器定位答案段落的能力是性能瓶颈，并提倡在气候AI应用中透明报告碳排放。 Conclusion: 强调了检索器性能的重要性，并建议采用量化技术以减少碳排放。 Abstract: Climate Finance Bench introduces an open benchmark that targets question-answering over corporate climate disclosures using Large Language Models. We curate 33 recent sustainability reports in English drawn from companies across all 11 GICS sectors and annotate 330 expert-validated question-answer pairs that span pure extraction, numerical reasoning, and logical reasoning. Building on this dataset, we propose a comparison of RAG (retrieval-augmented generation) approaches. We show that the retriever's ability to locate passages that actually contain the answer is the chief performance bottleneck. We further argue for transparent carbon reporting in AI-for-climate applications, highlighting advantages of techniques such as Weight Quantization.

[147] Pre-Training Curriculum for Multi-Token Prediction in Language Models

Ansar Aynetdinov,Alan Akbik

Main category: cs.CL

TL;DR: 多令牌预测（MTP）是一种新的语言模型预训练目标，通过逐步引入MTP的课程学习策略，解决了小模型在MTP训练中的困难。

Details

Motivation: 小语言模型（SLMs）在多令牌预测（MTP）目标上表现不佳，因此需要一种策略帮助其适应MTP训练。 Method: 提出两种课程学习策略：正向课程（从NTP逐步过渡到MTP）和反向课程（从MTP逐步过渡到NTP）。 Result: 正向课程帮助SLMs更好地利用MTP，提升下游任务性能和生成质量；反向课程虽提升性能，但无法提供自推测解码优势。 Conclusion: 正向课程是帮助SLMs适应MTP训练的有效策略，同时保留了自推测解码的优势。 Abstract: Multi-token prediction (MTP) is a recently proposed pre-training objective for language models. Rather than predicting only the next token (NTP), MTP predicts the next $k$ tokens at each prediction step, using multiple prediction heads. MTP has shown promise in improving downstream performance, inference speed, and training efficiency, particularly for large models. However, prior work has shown that smaller language models (SLMs) struggle with the MTP objective. To address this, we propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum, which gradually increases the complexity of the pre-training objective from NTP to MTP, and a reverse curriculum, which does the opposite. Our experiments show that the forward curriculum enables SLMs to better leverage the MTP objective during pre-training, improving downstream NTP performance and generative output quality, while retaining the benefits of self-speculative decoding. The reverse curriculum achieves stronger NTP performance and output quality, but fails to provide any self-speculative decoding benefits.

[148] FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian

Sara Papi,Marco Gaido,Luisa Bentivogli,Alessio Brutti,Mauro Cettolo,Roberto Gretter,Marco Matassoni,Mohamed Nabih,Matteo Negri

Main category: cs.CL

TL;DR: FAMA是一系列开源的语音基础模型（SFMs），填补了语音领域开放科学的空白，性能接近现有SFMs且速度更快。

Details

Motivation: 现有语音基础模型（如Whisper和SeamlessM4T）的封闭性导致可复现性和公平评估困难，而其他领域已通过开源模型和数据推动开放科学。 Method: FAMA基于15万+小时的开源语音数据训练，并提供了一个包含1.6万小时清理和伪标注语音的新数据集。 Result: FAMA性能与现有SFMs相当，且速度提升高达8倍。 Conclusion: FAMA及其相关资源（代码、数据集、模型）均以开源许可发布，推动了语音技术研究的开放性。 Abstract: The development of speech foundation models (SFMs) like Whisper and SeamlessM4T has significantly advanced the field of speech processing. However, their closed nature--with inaccessible training data and code--poses major reproducibility and fair evaluation challenges. While other domains have made substantial progress toward open science by developing fully transparent models trained on open-source (OS) code and data, similar efforts in speech remain limited. To fill this gap, we introduce FAMA, the first family of open science SFMs for English and Italian, trained on 150k+ hours of OS speech data. Moreover, we present a new dataset containing 16k hours of cleaned and pseudo-labeled speech for both languages. Results show that FAMA achieves competitive performance compared to existing SFMs while being up to 8 times faster. All artifacts, including code, datasets, and models, are released under OS-compliant licenses, promoting openness in speech technology research.

[149] StressTest: Can YOUR Speech LM Handle the Stress?

Iddo Yosha,Gallil Maimon,Yossi Adi

Main category: cs.CL

TL;DR: 论文提出了StressTest基准和Stress17k数据集，用于评估和提升语音感知语言模型在句子重音理解上的能力，并展示了优化后的StresSLM模型的显著性能提升。

Details

Motivation: 句子重音在语音中传递重要信息，但现有语音感知语言模型（SLMs）在评估和开发中忽视了其作用。 Method: 引入StressTest基准评估模型对重音模式的区分能力；提出合成数据生成管道创建Stress17k训练集；优化模型StresSLM。 Result: 现有SLMs在重音任务上表现不佳；StresSLM在句子重音推理和检测任务上显著优于现有模型。 Conclusion: 通过合成数据优化模型能有效提升SLMs在重音任务上的性能，填补了现有研究的空白。 Abstract: Sentence stress refers to emphasis, placed on specific words within a spoken utterance to highlight or contrast an idea, or to introduce new information. It is often used to imply an underlying intention that is not explicitly stated. Recent advances in speech-aware language models (SLMs) have enabled direct processing of audio, allowing models to bypass transcription and access the full richness of the speech signal and perform audio reasoning tasks such as spoken question answering. Despite the crucial role of sentence stress in shaping meaning and speaker intent, it remains largely overlooked in evaluation and development of such models. In this work, we address this gap by introducing StressTest, a benchmark specifically designed to evaluate a model's ability to distinguish between interpretations of spoken sentences based on the stress pattern. We assess the performance of several leading SLMs and find that, despite their overall capabilities, they perform poorly on such tasks. To overcome this limitation, we propose a novel synthetic data generation pipeline, and create Stress17k, a training set that simulates change of meaning implied by stress variation. Then, we empirically show that optimizing models with this synthetic dataset aligns well with real-world recordings and enables effective finetuning of SLMs. Results suggest, that our finetuned model, StresSLM, significantly outperforms existing models on both sentence stress reasoning and detection tasks. Code, models, data, and audio samples - pages.cs.huji.ac.il/adiyoss-lab/stresstest.

[150] Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems

Christopher Ormerod

Main category: cs.CL

TL;DR: 通过将反馈导向的注释整合到自动作文评分（AES）中，可以提高评分的准确性。

Details

Motivation: 研究旨在通过反馈驱动的注释（如拼写、语法错误和论证成分标记）提升自动作文评分的性能。 Method: 使用PERSUADE语料库，结合两种反馈注释：拼写语法错误标记和论证成分标记。采用生成式语言模型进行拼写纠正，编码器标记分类器识别论证元素，并将注释整合到评分过程中。 Result: 通过整合注释，基于编码器的大型语言模型在评分性能上有所提升。 Conclusion: 反馈驱动的注释可以有效提升自动作文评分的准确性，展示了在实际应用中的潜力。 Abstract: This study illustrates how incorporating feedback-oriented annotations into the scoring pipeline can enhance the accuracy of automated essay scoring (AES). This approach is demonstrated with the Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements (PERSUADE) corpus. We integrate two types of feedback-driven annotations: those that identify spelling and grammatical errors, and those that highlight argumentative components. To illustrate how this method could be applied in real-world scenarios, we employ two LLMs to generate annotations -- a generative language model used for spell-correction and an encoder-based token classifier trained to identify and mark argumentative elements. By incorporating annotations into the scoring process, we demonstrate improvements in performance using encoder-based large language models fine-tuned as classifiers.

[151] Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages

Kaja Dobrovoljc

Main category: cs.CL

TL;DR: 论文提出了一种基于树库的方法，通过依赖解析语料库比较口语和书面语的句法结构，发现两者在句法结构和多样性上存在显著差异。

Details

Motivation: 研究动机是探索口语和书面语在句法结构上的差异，以理解不同语言模态对句法组织的影响。 Method: 采用自下而上的归纳方法，从英语和斯洛文尼亚语的通用依赖树库中提取去词汇化的依赖子树，分析其大小、多样性和分布。 Result: 结果显示口语语料库的句法结构更少且多样性较低，且口语和书面语的句法结构重叠有限，表明模态特异性偏好。 Conclusion: 结论是该方法为跨语料库的句法变异研究提供了可扩展的语言独立框架，为语法使用理论奠定了基础。 Abstract: This paper presents a novel treebank-driven approach to comparing syntactic structures in speech and writing using dependency-parsed corpora. Adopting a fully inductive, bottom-up method, we define syntactic structures as delexicalized dependency (sub)trees and extract them from spoken and written Universal Dependencies (UD) treebanks in two syntactically distinct languages, English and Slovenian. For each corpus, we analyze the size, diversity, and distribution of syntactic inventories, their overlap across modalities, and the structures most characteristic of speech. Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities. Strikingly, the overlap between spoken and written syntactic inventories is very limited: most structures attested in speech do not occur in writing, pointing to modality-specific preferences in syntactic organization that reflect the distinct demands of real-time interaction and elaborated writing. This contrast is further supported by a keyness analysis of the most frequent speech-specific structures, which highlights patterns associated with interactivity, context-grounding, and economy of expression. We argue that this scalable, language-independent framework offers a useful general method for systematically studying syntactic variation across corpora, laying the groundwork for more comprehensive data-driven theories of grammar in use.

[152] MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators

John Mendonça,Alon Lavie,Isabel Trancoso

Main category: cs.CL

TL;DR: MEDAL是一个自动化多智能体框架，用于生成、评估和筛选更具代表性和多样性的开放域对话评估基准，解决了现有基准数据集静态、过时和缺乏多语言覆盖的问题。

Details

Motivation: 现有聊天机器人和LLM的性能评估数据集多为静态、过时且缺乏多语言覆盖，无法捕捉细微的语言和文化差异，阻碍了进一步的发展。 Method: 利用多个先进LLM生成多语言用户-聊天机器人对话，基于多样化种子上下文，并通过GPT-4.1进行多维性能分析，筛选并人工标注新的多语言基准。 Result: 发现当前LLM在检测细微问题（如同理心和推理）方面表现不佳，并揭示了显著的跨语言性能差异。 Conclusion: MEDAL框架成功生成了一个更全面的多语言评估基准，但当前LLM在评估开放域对话时仍有局限性。 Abstract: As the capabilities of chatbots and their underlying LLMs continue to dramatically improve, evaluating their performance has increasingly become a major blocker to their further development. A major challenge is the available benchmarking datasets, which are largely static, outdated, and lacking in multilingual coverage, limiting their ability to capture subtle linguistic and cultural variations. This paper introduces MEDAL, an automated multi-agent framework for generating, evaluating, and curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. A strong LLM (GPT-4.1) is then used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. We find that current LLMs struggle to detect nuanced issues, particularly those involving empathy and reasoning.

[153] Can Large Language Models Match the Conclusions of Systematic Reviews?

Christopher Polzak,Alejandro Lozano,Min Woo Sun,James Burgess,Yuhui Zhang,Kevin Wu,Serena Yeung-Levy

Main category: cs.CL

TL;DR: 该研究探讨了大型语言模型（LLMs）能否在系统性综述（SR）生成中匹配临床专家的结论，并通过MedEvidence基准测试了24种LLMs的性能。

Details

Motivation: 随着科学文献的爆炸式增长，利用LLMs自动化生成SR的需求增加，但其在证据评估和多文档推理方面的能力尚不明确。 Method: 研究通过MedEvidence基准，将100篇SR与其基于的研究配对，测试了24种LLMs（包括推理型、非推理型、医学专业型及不同规模的模型）。 Result: 研究发现推理能力不一定会提升性能，模型规模增大也不总是带来增益，而基于知识的微调反而降低了准确性。模型普遍表现出对低质量证据缺乏科学怀疑态度。 Conclusion: LLMs目前尚无法可靠匹配专家生成的SR结论，仍需进一步研究。研究团队公开了代码和基准以促进相关研究。 Abstract: Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly characterized. We therefore ask: Can LLMs match the conclusions of systematic reviews written by clinical experts when given access to the same studies? To explore this question, we present MedEvidence, a benchmark pairing findings from 100 SRs with the studies they are based on. We benchmark 24 LLMs on MedEvidence, including reasoning, non-reasoning, medical specialist, and models across varying sizes (from 7B-700B). Through our systematic evaluation, we find that reasoning does not necessarily improve performance, larger models do not consistently yield greater gains, and knowledge-based fine-tuning degrades accuracy on MedEvidence. Instead, most models exhibit similar behavior: performance tends to degrade as token length increases, their responses show overconfidence, and, contrary to human experts, all models show a lack of scientific skepticism toward low-quality findings. These results suggest that more work is still required before LLMs can reliably match the observations from expert-conducted SRs, even though these systems are already deployed and being used by clinicians. We release our codebase and benchmark to the broader research community to further investigate LLM-based SR systems.

[154] Towards a More Generalized Approach in Open Relation Extraction

Qing Wang,Yuepei Li,Qiao Qiao,Kang Zhou,Qi Li

Main category: cs.CL

TL;DR: MixORE是一个两阶段框架，用于在混合已知和未知关系的未标记数据中联合学习关系分类和聚类，显著优于现有基线。

Details

Motivation: 传统OpenRE方法假设未标记数据仅包含新关系或已预分割为已知和未知实例，而现实中新关系是随机分布的。 Method: 提出MixORE框架，结合关系分类和聚类，在混合数据中联合学习已知和未知关系。 Result: 在三个基准数据集上，MixORE在已知关系分类和未知关系聚类中均优于基线方法。 Conclusion: MixORE推动了广义OpenRE研究和实际应用的发展。 Abstract: Open Relation Extraction (OpenRE) seeks to identify and extract novel relational facts between named entities from unlabeled data without pre-defined relation schemas. Traditional OpenRE methods typically assume that the unlabeled data consists solely of novel relations or is pre-divided into known and novel instances. However, in real-world scenarios, novel relations are arbitrarily distributed. In this paper, we propose a generalized OpenRE setting that considers unlabeled data as a mixture of both known and novel instances. To address this, we propose MixORE, a two-phase framework that integrates relation classification and clustering to jointly learn known and novel relations. Experiments on three benchmark datasets demonstrate that MixORE consistently outperforms competitive baselines in known relation classification and novel relation clustering. Our findings contribute to the advancement of generalized OpenRE research and real-world applications.

[155] First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay

Andrew Zhu,Evan Osgood,Chris Callison-Burch

Main category: cs.CL

TL;DR: 论文提出了一种新型的LLM代理范式——"偷听代理"，通过监听人类对话而非直接参与来提供背景任务支持或建议，并以《龙与地下城》游戏为例进行了深入研究。

Details

Motivation: 探索LLM代理在非直接交互场景下的应用潜力，尤其是通过监听人类对话提供辅助。 Method: 使用大型多模态音频-语言模型作为偷听代理，辅助游戏主持人（Dungeon Master），并通过人类评估检验其有效性。 Result: 研究发现某些大型音频-语言模型具备通过隐式音频线索完成偷听代理任务的涌现能力。 Conclusion: 偷听代理范式具有潜力，作者发布了相关代码库以支持进一步研究。 Abstract: Much work has been done on conversational LLM agents which directly assist human users with tasks. We present an alternative paradigm for interacting with LLM agents, which we call "overhearing agents". These overhearing agents do not actively participate in conversation -- instead, they "listen in" on human-to-human conversations and perform background tasks or provide suggestions to assist the user. In this work, we explore the overhearing agents paradigm through the lens of Dungeons & Dragons gameplay. We present an in-depth study using large multimodal audio-language models as overhearing agents to assist a Dungeon Master. We perform a human evaluation to examine the helpfulness of such agents and find that some large audio-language models have the emergent ability to perform overhearing agent tasks using implicit audio cues. Finally, we release Python libraries and our project code to support further research into the overhearing agents paradigm at https://github.com/zhudotexe/overhearing_agents.

Yingming Wang,Pepa Atanasova

Main category: cs.CL

TL;DR: SR-NLE框架通过自我批判和迭代优化提升大语言模型生成的自然语言解释的忠实度，无需外部监督。

Details

Motivation: 现有大语言模型的自然语言解释往往不能忠实反映模型的推理过程，缺乏改进解释忠实度的方法。 Method: 提出SR-NLE框架，利用自然语言自我反馈和基于特征归因的反馈机制，迭代优化解释。 Result: 在三个数据集和四个LLM上的实验表明，SR-NLE显著降低不忠实率，最佳方法平均不忠实率降至36.02%。 Conclusion: LLM可通过反馈引导自我优化解释，无需额外训练或微调。 Abstract: With the rapid development of large language models (LLMs), natural language explanations (NLEs) have become increasingly important for understanding model predictions. However, these explanations often fail to faithfully represent the model's actual reasoning process. While existing work has demonstrated that LLMs can self-critique and refine their initial outputs for various tasks, this capability remains unexplored for improving explanation faithfulness. To address this gap, we introduce Self-critique and Refinement for Natural Language Explanations (SR-NLE), a framework that enables models to improve the faithfulness of their own explanations -- specifically, post-hoc NLEs -- through an iterative critique and refinement process without external supervision. Our framework leverages different feedback mechanisms to guide the refinement process, including natural language self-feedback and, notably, a novel feedback approach based on feature attribution that highlights important input words. Our experiments across three datasets and four state-of-the-art LLMs demonstrate that SR-NLE significantly reduces unfaithfulness rates, with our best method achieving an average unfaithfulness rate of 36.02%, compared to 54.81% for baseline -- an absolute reduction of 18.79%. These findings reveal that the investigated LLMs can indeed refine their explanations to better reflect their actual reasoning process, requiring only appropriate guidance through feedback without additional training or fine-tuning.

[157] What Has Been Lost with Synthetic Evaluation?

Alexander Gill,Abhilasha Ravichander,Ana Marasović

Main category: cs.CL

TL;DR: 研究发现，虽然LLMs能以低成本生成有效的评测基准，但其生成的数据对LLMs本身的挑战性低于人工标注数据，需重新评估这种方法的适用性。

Details

Motivation: 探讨LLMs是否能满足评测基准的高要求，如针对特定现象、避免利用捷径且具有挑战性。 Method: 通过两个案例研究，比较LLMs生成的推理文本基准与人工众包数据，评估其有效性和难度。 Result: LLMs能以低成本生成符合标注指南的数据，但对LLMs的挑战性低于人工数据。 Conclusion: 需重新评估LLMs生成评测数据的适用性，因其可能丢失某些关键挑战性。 Abstract: Large language models (LLMs) are increasingly used for data generation. However, creating evaluation benchmarks raises the bar for this emerging paradigm. Benchmarks must target specific phenomena, penalize exploiting shortcuts, and be challenging. Through two case studies, we investigate whether LLMs can meet these demands by generating reasoning over-text benchmarks and comparing them to those created through careful crowdsourcing. Specifically, we evaluate both the validity and difficulty of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA, which evaluates reasoning about negation, and DROP, which targets reasoning about quantities. We find that prompting LLMs can produce variants of these datasets that are often valid according to the annotation guidelines, at a fraction of the cost of the original crowdsourcing effort. However, we show that they are less challenging for LLMs than their human-authored counterparts. This finding sheds light on what may have been lost by generating evaluation data with LLMs, and calls for critically reassessing the immediate use of this increasingly prevalent approach to benchmark creation.

[158] Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

Arthur S. Bianchessi,Rodrigo C. Barros,Lucas S. Kupssinskü

Main category: cs.CL

TL;DR: 论文提出了一种基于贝叶斯注意力机制（BAM）的理论框架，用于改进位置编码（PE）方法，显著提升了长上下文泛化能力。

Details

Motivation: 现有位置编码方法缺乏理论清晰性，且评估指标有限，无法充分支持其外推能力。 Method: 提出了贝叶斯注意力机制（BAM），将位置编码建模为概率模型中的先验，并统一了现有方法（如NoPE和ALiBi）。 Result: BAM在500倍训练上下文长度下实现了准确的信息检索，优于现有方法，同时保持了相似的困惑度和少量额外参数。 Conclusion: BAM为位置编码提供了理论支持，显著提升了长上下文泛化能力。 Abstract: Transformer-based language models rely on positional encoding (PE) to handle token order and support context length extrapolation. However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims. We propose the Bayesian Attention Mechanism (BAM), a theoretical framework that formulates positional encoding as a prior within a probabilistic model. BAM unifies existing methods (e.g., NoPE and ALiBi) and motivates a new Generalized Gaussian positional prior that substantially improves long-context generalization. Empirically, BAM enables accurate information retrieval at $500\times$ the training context length, outperforming previous state-of-the-art context length generalization in long context retrieval accuracy while maintaining comparable perplexity and introducing minimal additional parameters.

[159] LiTEx: A Linguistic Taxonomy of Explanations for Understanding Within-Label Variation in Natural Language Inference

Pingjun Hong,Beiduo Chen,Siyao Peng,Marie-Catherine de Marneffe,Barbara Plank

Main category: cs.CL

TL;DR: 论文研究了自然语言推理（NLI）中人类标注者即使给出相同标签也可能存在不同推理的问题，提出了LITEX分类法来系统分析自由文本解释，并验证其在解释生成中的有效性。

Details

Motivation: 解决NLI中标注者即使标签一致但推理不一致的问题，揭示标注背后的真实原因。 Method: 引入LITEX分类法，对e-SNLI数据集子集进行标注，验证分类法的可靠性，并评估其在解释生成中的应用。 Result: LITEX分类法能有效捕捉标签内差异，且基于LITEX生成的解释更接近人类解释。 Conclusion: LITEX分类法不仅能揭示标签内差异，还能通过指导解释生成缩小人类与模型解释之间的差距。 Abstract: There is increasing evidence of Human Label Variation (HLV) in Natural Language Inference (NLI), where annotators assign different labels to the same premise-hypothesis pair. However, within-label variation--cases where annotators agree on the same label but provide divergent reasoning--poses an additional and mostly overlooked challenge. Several NLI datasets contain highlighted words in the NLI item as explanations, but the same spans on the NLI item can be highlighted for different reasons, as evidenced by free-text explanations, which offer a window into annotators' reasoning. To systematically understand this problem and gain insight into the rationales behind NLI labels, we introduce LITEX, a linguistically-informed taxonomy for categorizing free-text explanations. Using this taxonomy, we annotate a subset of the e-SNLI dataset, validate the taxonomy's reliability, and analyze how it aligns with NLI labels, highlights, and explanations. We further assess the taxonomy's usefulness in explanation generation, demonstrating that conditioning generation on LITEX yields explanations that are linguistically closer to human explanations than those generated using only labels or highlights. Our approach thus not only captures within-label variation but also shows how taxonomy-guided generation for reasoning can bridge the gap between human and model explanations more effectively than existing strategies.

[160] GateNLP at SemEval-2025 Task 10: Hierarchical Three-Step Prompting for Multilingual Narrative Classification

Iknoor Singh,Carolina Scarton,Kalina Bontcheva

Main category: cs.CL

TL;DR: 论文提出了一种名为H3Prompt的分层三步提示方法，用于多语言新闻叙事分类，并在SemEval 2025任务中取得最佳成绩。

Details

Motivation: 在线新闻和错误信息的泛滥需要自动数据分析方法，叙事分类成为重要任务。 Method: 采用三步大型语言模型提示策略，先分类文章领域，再识别主叙事和子叙事。 Result: 在28个团队中，该方法在英语测试集上排名第一。 Conclusion: H3Prompt是一种有效的多语言叙事分类方法，代码已开源。 Abstract: The proliferation of online news and the increasing spread of misinformation necessitate robust methods for automatic data analysis. Narrative classification is emerging as a important task, since identifying what is being said online is critical for fact-checkers, policy markers and other professionals working on information studies. This paper presents our approach to SemEval 2025 Task 10 Subtask 2, which aims to classify news articles into a pre-defined two-level taxonomy of main narratives and sub-narratives across multiple languages. We propose Hierarchical Three-Step Prompting (H3Prompt) for multilingual narrative classification. Our methodology follows a three-step Large Language Model (LLM) prompting strategy, where the model first categorises an article into one of two domains (Ukraine-Russia War or Climate Change), then identifies the most relevant main narratives, and finally assigns sub-narratives. Our approach secured the top position on the English test set among 28 competing teams worldwide. The code is available at https://github.com/GateNLP/H3Prompt.

[161] When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy

Jirui Qi,Shan Chen,Zidi Xiong,Raquel Fernández,Danielle S. Bitterman,Arianna Bisazza

Main category: cs.CL

TL;DR: 当前大型推理模型（LRMs）在英语推理任务中表现优异，但在其他语言中的推理能力研究较少。研究发现，即使是最先进的模型也常回归英语或产生碎片化推理，揭示了多语言推理的显著差距。

Details

Motivation: 研究LRMs在多语言环境中的推理能力，因为用户需要以母语理解推理过程以实现有效监督。 Method: 通过XReasoning基准全面评估两种主流LRMs，并尝试基于提示的干预和少量目标后训练。 Result: 干预措施提高了可读性和监督效果，但降低了答案准确性；后训练部分缓解了这一问题，但仍存在准确性损失。 Conclusion: 当前LRMs的多语言推理能力有限，未来需进一步改进。代码和数据已开源。 Abstract: Recent Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks. However, their ability to think in other languages is less studied. This capability is as important as answer accuracy for real world applications because users may find the reasoning trace useful for oversight only when it is expressed in their own language. We comprehensively evaluate two leading families of LRMs on our XReasoning benchmark and find that even the most advanced models often revert to English or produce fragmented reasoning in other languages, revealing a substantial gap in multilingual reasoning. Prompt based interventions that force models to reason in the users language improve readability and oversight but reduce answer accuracy, exposing an important trade off. We further show that targeted post training on just 100 examples mitigates this mismatch, though some accuracy loss remains. Our results highlight the limited multilingual reasoning capabilities of current LRMs and outline directions for future work. Code and data are available at https://github.com/Betswish/mCoT-XReasoning.

[162] VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models

Chahat Raj,Bowen Wei,Aylin Caliskan,Antonios Anastasopoulos,Ziwei Zhu

Main category: cs.CL

TL;DR: VIGNETTE是一个大规模VQA基准，用于评估视觉语言模型（VLMs）中的偏见，涵盖事实性、感知、刻板印象和决策四个方向，揭示模型如何通过视觉线索构建社会意义。

Details

Motivation: 现有VLM偏见研究多局限于肖像图像和性别-职业关联，忽视了更广泛的社会刻板印象及其潜在危害。 Method: 通过30M+图像的VQA框架，评估VLMs在四种方向上的偏见表现，结合社会心理学分析模型如何从视觉线索推断特质和角色。 Result: 研究发现VLMs存在微妙、多面且令人惊讶的刻板模式，揭示了模型如何通过输入构建社会意义。 Conclusion: VIGNETTE为VLM偏见研究提供了更全面的视角，揭示了模型在社会意义构建中的复杂性和潜在危害。 Abstract: While bias in large language models (LLMs) is well-studied, similar concerns in vision-language models (VLMs) have received comparatively less attention. Existing VLM bias studies often focus on portrait-style images and gender-occupation associations, overlooking broader and more complex social stereotypes and their implied harm. This work introduces VIGNETTE, a large-scale VQA benchmark with 30M+ images for evaluating bias in VLMs through a question-answering framework spanning four directions: factuality, perception, stereotyping, and decision making. Beyond narrowly-centered studies, we assess how VLMs interpret identities in contextualized settings, revealing how models make trait and capability assumptions and exhibit patterns of discrimination. Drawing from social psychology, we examine how VLMs connect visual identity cues to trait and role-based inferences, encoding social hierarchies, through biased selections. Our findings uncover subtle, multifaceted, and surprising stereotypical patterns, offering insights into how VLMs construct social meaning from inputs.

[163] Talent or Luck? Evaluating Attribution Bias in Large Language Models

Chahat Raj,Mahika Banerjee,Aylin Caliskan,Antonios Anastasopoulos,Ziwei Zhu

Main category: cs.CL

TL;DR: 论文探讨了人类和LLMs在归因事件结果时的差异，提出了一个基于认知的偏见评估框架。

Details

Motivation: 研究动机是理解归因如何影响公平性，尤其是LLMs在归因事件结果时可能存在的偏见。 Method: 提出了一种基于认知的偏见评估框架，用于识别模型推理中的偏见。 Result: 研究发现LLMs的归因行为可能强化对某些人口群体的偏见。 Conclusion: 结论强调了评估和解决LLMs归因偏见的重要性，以确保公平性。 Abstract: When a student fails an exam, do we tend to blame their effort or the test's difficulty? Attribution, defined as how reasons are assigned to event outcomes, shapes perceptions, reinforces stereotypes, and influences decisions. Attribution Theory in social psychology explains how humans assign responsibility for events using implicit cognition, attributing causes to internal (e.g., effort, ability) or external (e.g., task difficulty, luck) factors. LLMs' attribution of event outcomes based on demographics carries important fairness implications. Most works exploring social biases in LLMs focus on surface-level associations or isolated stereotypes. This work proposes a cognitively grounded bias evaluation framework to identify how models' reasoning disparities channelize biases toward demographic groups.

[164] ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room

Nikita Mehandru,Niloufar Golchini,David Bamman,Travis Zack,Melanie F. Molina,Ahmed Alaa

Main category: cs.CL

TL;DR: ER-Reason是一个用于评估大型语言模型（LLM）在急诊室（ER）临床推理和决策能力的基准，包含3,984名患者的数据和25,174份临床记录，揭示了LLM与临床医生推理之间的差距。

Details

Motivation: 现有评估多依赖昂贵的人工标注，且集中于孤立任务，未能全面反映临床推理或医疗决策流程。 Method: ER-Reason基于急诊室工作流程设计任务，涵盖分诊、评估、治疗选择等阶段，并收集了72份医生撰写的推理过程。 Result: 评估显示，LLM生成的临床推理与医生推理存在显著差距。 Conclusion: 未来研究需弥合LLM与临床医生在急诊决策推理上的差距。 Abstract: Large language models (LLMs) have been extensively evaluated on medical question answering tasks based on licensing exams. However, real-world evaluations often depend on costly human annotators, and existing benchmarks tend to focus on isolated tasks that rarely capture the clinical reasoning or full workflow underlying medical decisions. In this paper, we introduce ER-Reason, a benchmark designed to evaluate LLM-based clinical reasoning and decision-making in the emergency room (ER)--a high-stakes setting where clinicians make rapid, consequential decisions across diverse patient presentations and medical specialties under time pressure. ER-Reason includes data from 3,984 patients, encompassing 25,174 de-identified longitudinal clinical notes spanning discharge summaries, progress notes, history and physical exams, consults, echocardiography reports, imaging notes, and ER provider documentation. The benchmark includes evaluation tasks that span key stages of the ER workflow: triage intake, initial assessment, treatment selection, disposition planning, and final diagnosis--each structured to reflect core clinical reasoning processes such as differential diagnosis via rule-out reasoning. We also collected 72 full physician-authored rationales explaining reasoning processes that mimic the teaching process used in residency training, and are typically absent from ER documentation. Evaluations of state-of-the-art LLMs on ER-Reason reveal a gap between LLM-generated and clinician-authored clinical reasoning for ER decisions, highlighting the need for future research to bridge this divide.

[165] Structured Memory Mechanisms for Stable Context Representation in Large Language Models

Yue Xing,Tao Yang,Yijiashun Qi,Minggu Wei,Yu Cheng,Honghui Xin

Main category: cs.CL

TL;DR: 该论文提出了一种带有长期记忆机制的模型架构，以解决大语言模型在理解长期上下文时的局限性，并通过实验验证了其有效性。

Details

Motivation: 解决大语言模型在长期上下文理解中的局限性，如语义丢失和语义漂移问题。 Method: 提出了一种集成显式记忆单元、门控写入机制和基于注意力的读取模块的模型架构，并引入遗忘函数动态更新记忆内容。设计了联合训练目标，结合主任务损失和记忆操作约束。 Result: 模型在文本生成一致性、多轮问答稳定性和跨上下文推理准确性方面表现优越，尤其在长文本任务和复杂问答场景中表现出强语义保留和上下文连贯性。 Conclusion: 实验证明了记忆机制在语言理解中的关键作用，以及所提方法在架构设计和性能结果上的可行性和有效性。 Abstract: This paper addresses the limitations of large language models in understanding long-term context. It proposes a model architecture equipped with a long-term memory mechanism to improve the retention and retrieval of semantic information across paragraphs and dialogue turns. The model integrates explicit memory units, gated writing mechanisms, and attention-based reading modules. A forgetting function is introduced to enable dynamic updates of memory content, enhancing the model's ability to manage historical information. To further improve the effectiveness of memory operations, the study designs a joint training objective. This combines the main task loss with constraints on memory writing and forgetting. It guides the model to learn better memory strategies during task execution. Systematic evaluation across multiple subtasks shows that the model achieves clear advantages in text generation consistency, stability in multi-turn question answering, and accuracy in cross-context reasoning. In particular, the model demonstrates strong semantic retention and contextual coherence in long-text tasks and complex question answering scenarios. It effectively mitigates the context loss and semantic drift problems commonly faced by traditional language models when handling long-term dependencies. The experiments also include analysis of different memory structures, capacity sizes, and control strategies. These results further confirm the critical role of memory mechanisms in language understanding. They demonstrate the feasibility and effectiveness of the proposed approach in both architectural design and performance outcomes.

[166] Unraveling LoRA Interference: Orthogonal Subspaces for Robust Model Merging

Haobo Zhang,Jiayu Zhou

Main category: cs.CL

TL;DR: 论文提出OSRM方法，通过约束LoRA子空间来提升模型合并性能，避免任务间干扰，同时保持单任务准确性。

Details

Motivation: 微调大型语言模型（LMs）对单个任务表现良好，但部署和存储成本高。现有合并方法在LoRA微调模型上效果不佳，需解决参数与数据分布的相互作用问题。 Method: 提出OSRM方法，在微调前约束LoRA子空间，确保任务更新不影响其他任务输出，兼容现有合并算法。 Result: 在八个数据集、五种LMs上测试，OSRM提升合并性能，保持单任务准确性，且对超参数更鲁棒。 Conclusion: OSRM解决了LoRA模型合并中的干扰问题，提供了一种即插即用的解决方案，强调了数据-参数交互的重要性。 Abstract: Fine-tuning large language models (LMs) for individual tasks yields strong performance but is expensive for deployment and storage. Recent works explore model merging to combine multiple task-specific models into a single multi-task model without additional training. However, existing merging methods often fail for models fine-tuned with low-rank adaptation (LoRA), due to significant performance degradation. In this paper, we show that this issue arises from a previously overlooked interplay between model parameters and data distributions. We propose Orthogonal Subspaces for Robust model Merging (OSRM) to constrain the LoRA subspace *prior* to fine-tuning, ensuring that updates relevant to one task do not adversely shift outputs for others. Our approach can seamlessly integrate with most existing merging algorithms, reducing the unintended interference among tasks. Extensive experiments on eight datasets, tested with three widely used LMs and two large LMs, demonstrate that our method not only boosts merging performance but also preserves single-task accuracy. Furthermore, our approach exhibits greater robustness to the hyperparameters of merging. These results highlight the importance of data-parameter interaction in model merging and offer a plug-and-play solution for merging LoRA models.

[167] Improving QA Efficiency with DistilBERT: Fine-Tuning and Inference on mobile Intel CPUs

Ngeyen Yinkfu

Main category: cs.CL

TL;DR: 该研究提出了一种基于Transformer的高效问答模型，针对第13代Intel i7-1355U CPU优化，在SQuAD v1.1数据集上验证F1得分为0.6536，推理时间为0.1208秒/问题。

Details

Motivation: 为资源受限系统提供一种在准确性和计算效率之间取得平衡的实时问答模型。 Method: 使用DistilBERT架构，结合数据增强和超参数调优，进行模型微调。 Result: 模型在验证集上F1得分为0.6536，推理时间0.1208秒/问题，优于规则基线和完整BERT模型。 Conclusion: 该模型适合资源受限系统的实时应用，并提供了优化Transformer模型在CPU上推理的实用方法。 Abstract: This study presents an efficient transformer-based question-answering (QA) model optimized for deployment on a 13th Gen Intel i7-1355U CPU, using the Stanford Question Answering Dataset (SQuAD) v1.1. Leveraging exploratory data analysis, data augmentation, and fine-tuning of a DistilBERT architecture, the model achieves a validation F1 score of 0.6536 with an average inference time of 0.1208 seconds per question. Compared to a rule-based baseline (F1: 0.3124) and full BERT-based models, our approach offers a favorable trade-off between accuracy and computational efficiency. This makes it well-suited for real-time applications on resource-constrained systems. The study includes systematic evaluation of data augmentation strategies and hyperparameter configurations, providing practical insights into optimizing transformer models for CPU-based inference.

[168] WorkForceAgent-R1: Incentivizing Reasoning Capability in LLM-based Web Agents via Reinforcement Learning

Yuchen Zhuang,Di Jin,Jiaao Chen,Wenqi Shi,Hanrui Wang,Chao Zhang

Main category: cs.CL

TL;DR: WorkForceAgent-R1是一种基于LLM的网页代理，通过R1式强化学习框架提升单步推理和规划能力，显著优于SFT基线模型。

Details

Motivation: 现有基于监督微调的网页代理在动态网页交互中泛化性和鲁棒性不足，需改进推理能力。 Method: 采用规则化R1式强化学习框架，通过结构化奖励函数评估输出格式和动作正确性，隐式学习中间推理。 Result: 在WorkArena基准测试中，WorkForceAgent-R1比SFT基线提升10.26-16.59%，接近GPT-4o性能。 Conclusion: WorkForceAgent-R1在业务导向网页导航任务中表现出色，验证了强化学习框架的有效性。 Abstract: Large language models (LLMs)-empowered web agents enables automating complex, real-time web navigation tasks in enterprise environments. However, existing web agents relying on supervised fine-tuning (SFT) often struggle with generalization and robustness due to insufficient reasoning capabilities when handling the inherently dynamic nature of web interactions. In this study, we introduce WorkForceAgent-R1, an LLM-based web agent trained using a rule-based R1-style reinforcement learning framework designed explicitly to enhance single-step reasoning and planning for business-oriented web navigation tasks. We employ a structured reward function that evaluates both adherence to output formats and correctness of actions, enabling WorkForceAgent-R1 to implicitly learn robust intermediate reasoning without explicit annotations or extensive expert demonstrations. Extensive experiments on the WorkArena benchmark demonstrate that WorkForceAgent-R1 substantially outperforms SFT baselines by 10.26-16.59%, achieving competitive performance relative to proprietary LLM-based agents (gpt-4o) in workplace-oriented web navigation tasks.

[169] Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

Jaewoo Ahn,Heeseung Yun,Dayoon Ko,Gunhee Kim

Main category: cs.CL

TL;DR: 论文提出了一种名为MAC的基准测试，利用大语言模型生成欺骗性文本样本，揭示多模态表示的组合漏洞，并通过自训练方法提升零样本性能。

Details

Motivation: 预训练多模态表示（如CLIP）虽强大，但存在组合漏洞，导致反直觉判断。研究旨在揭示并改进这些漏洞。 Method: 提出MAC基准测试，利用LLMs生成欺骗性文本样本，并通过自训练方法（拒绝采样微调和多样性过滤）提升性能。 Result: 使用较小语言模型（如Llama-3.1-8B），方法在多模态表示（图像、视频、音频）中展现出更高的攻击成功率和样本多样性。 Conclusion: MAC基准测试和自训练方法有效揭示了多模态表示的组合漏洞，并为改进零样本方法提供了新思路。 Abstract: While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.

[170] OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature

Alisha Srivastava,Emir Korukluoglu,Minh Nhat Le,Duyen Tran,Chau Minh Pham,Marzena Karpinska,Mohit Iyyer

Main category: cs.CL

TL;DR: 论文研究了多语言和跨语言记忆在大型语言模型（LLM）中的表现，发现LLM能够跨语言回忆内容，即使文本未在预训练数据中直接翻译。

Details

Motivation: 探讨LLM在多语言和跨语言环境中记忆和回忆文本的能力，尤其是非英语语言或翻译文本的情况。 Method: 使用OWL数据集（包含10种语言的31.5K对齐文本），通过直接探测、名称填空和前缀生成三种任务评估模型记忆能力。 Result: LLM能够跨语言回忆内容，例如GPT-4o在新翻译文本中识别作者和标题的准确率为69%，扰动（如打乱单词）对准确性影响较小。 Conclusion: 研究揭示了LLM的跨语言记忆能力，并提供了模型间差异的见解。 Abstract: Large language models (LLMs) are known to memorize and recall English text from their pretraining data. However, the extent to which this ability generalizes to non-English languages or transfers across languages remains unclear. This paper investigates multilingual and cross-lingual memorization in LLMs, probing if memorized content in one language (e.g., English) can be recalled when presented in translation. To do so, we introduce OWL, a dataset of 31.5K aligned excerpts from 20 books in ten languages, including English originals, official translations (Vietnamese, Spanish, Turkish), and new translations in six low-resource languages (Sesotho, Yoruba, Maithili, Malagasy, Setswana, Tahitian). We evaluate memorization across model families and sizes through three tasks: (1) direct probing, which asks the model to identify a book's title and author; (2) name cloze, which requires predicting masked character names; and (3) prefix probing, which involves generating continuations. We find that LLMs consistently recall content across languages, even for texts without direct translation in pretraining data. GPT-4o, for example, identifies authors and titles 69% of the time and masked entities 6% of the time in newly translated excerpts. Perturbations (e.g., masking characters, shuffling words) modestly reduce direct probing accuracy (7% drop for shuffled official translations). Our results highlight the extent of cross-lingual memorization and provide insights on the differences between the models.

[171] NegVQA: Can Vision Language Models Understand Negation?

Yuhui Zhang,Yuchang Su,Yiming Liu,Serena Yeung-Levy

Main category: cs.CL

TL;DR: NegVQA是一个新的视觉问答基准，用于评估视觉语言模型对否定的理解能力，发现现有模型在否定问题上表现显著下降。

Details

Motivation: 否定是语言中的基本现象，可能完全逆转句子含义。随着视觉语言模型在高风险应用中的部署，评估其对否定的理解能力变得至关重要。 Method: 通过利用大型语言模型从现有VQA数据集中生成否定问题，构建了包含7,379个两选一问题的NegVQA基准。 Result: 评估20个最先进的视觉语言模型，发现它们在否定问题上表现显著下降，并呈现U型扩展趋势。 Conclusion: NegVQA揭示了视觉语言模型在否定理解上的关键差距，为未来模型开发提供了见解。 Abstract: Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs' negation understanding and offers insights into future VLM development. Project page available at https://yuhui-zh15.github.io/NegVQA/.

[172] StrucSum: Graph-Structured Reasoning for Long Document Extractive Summarization with LLMs

Haohan Yuan,Sukhwa Hong,Haopeng Zhang

Main category: cs.CL

TL;DR: StrucSum是一种无需训练的提示框架，通过句子级图结构增强LLM在零样本摘要中的表现，显著提升摘要质量和事实一致性。

Details

Motivation: 大型语言模型（LLM）在零样本摘要中表现优异，但对长文本的结构建模和关键信息识别能力不足。 Method: StrucSum通过三种策略注入结构信号：Neighbor-Aware Prompting（NAP）关注局部上下文，Centrality-Aware Prompting（CAP）评估重要性，Centrality-Guided Masking（CGM）实现高效输入缩减。 Result: 在ArXiv、PubMed和Multi-News数据集上，StrucSum显著优于无监督基线和普通提示方法，尤其在ArXiv上FactCC和SummaC分数分别提升19.2和9.7分。 Conclusion: 结构感知提示是一种简单有效的零样本提取摘要方法，无需额外训练或任务调整。 Abstract: Large language models (LLMs) have shown strong performance in zero-shot summarization, but often struggle to model document structure and identify salient information in long texts. In this work, we introduce StrucSum, a training-free prompting framework that enhances LLM reasoning through sentence-level graph structures. StrucSum injects structural signals into prompts via three targeted strategies: Neighbor-Aware Prompting (NAP) for local context, Centrality-Aware Prompting (CAP) for importance estimation, and Centrality-Guided Masking (CGM) for efficient input reduction. Experiments on ArXiv, PubMed, and Multi-News demonstrate that StrucSum consistently improves both summary quality and factual consistency over unsupervised baselines and vanilla prompting. Notably, on ArXiv, it boosts FactCC and SummaC by 19.2 and 9.7 points, indicating stronger alignment between summaries and source content. These findings suggest that structure-aware prompting is a simple yet effective approach for zero-shot extractive summarization with LLMs, without any training or task-specific tuning.

[173] LLMs for Argument Mining: Detection, Extraction, and Relationship Classification of pre-defined Arguments in Online Comments

Matteo Guida,Yulia Otmakhova,Eduard Hovy,Lea Frermann

Main category: cs.CL

TL;DR: 论文评估了四种大型语言模型（LLMs）在三个论点挖掘任务中的表现，发现其在处理大规模、精细调整的数据集时表现良好，但在处理长文本和情感化语言时存在系统性不足。

Details

Motivation: 研究动机是探索LLMs在检测和理解争议性话题（如堕胎）中预定义论点的能力，填补其在在线评论中论点挖掘性能的研究空白。 Method: 方法包括对四种先进的LLMs在三个论点挖掘任务上进行定量评估，使用了超过2,000条涉及六个极化话题的评论数据集。 Result: 结果显示LLMs在三个任务中总体表现良好，尤其是大型和精细调整的模型，但对长文本和情感化语言的处理存在不足。 Conclusion: 结论指出LLMs在自动化论点分析中具有潜力，但当前存在局限性，特别是在复杂和情感化的语言环境中。 Abstract: Automated large-scale analysis of public discussions around contested issues like abortion requires detecting and understanding the use of arguments. While Large Language Models (LLMs) have shown promise in language processing tasks, their performance in mining topic-specific, pre-defined arguments in online comments remains underexplored. We evaluate four state-of-the-art LLMs on three argument mining tasks using datasets comprising over 2,000 opinion comments across six polarizing topics. Quantitative evaluation suggests an overall strong performance across the three tasks, especially for large and fine-tuned LLMs, albeit at a significant environmental cost. However, a detailed error analysis revealed systematic shortcomings on long and nuanced comments and emotionally charged language, raising concerns for downstream applications like content moderation or opinion analysis. Our results highlight both the promise and current limitations of LLMs for automated argument analysis in online comments.

[174] LLM-based HSE Compliance Assessment: Benchmark, Performance, and Advancements

Jianwei Wang,Mengqi Wang,Yinsi Zhou,Zhenchang Xing,Qing Liu,Xiwei Xu,Wenjie Zhang,Liming Zhu

Main category: cs.CL

TL;DR: HSE-Bench是一个评估大型语言模型（LLM）在健康、安全与环境（HSE）合规性评估中能力的首个基准数据集，揭示了当前LLM依赖语义匹配而非系统性法律推理的局限性，并提出了一种新的提示技术RoE以改进推理能力。

Details

Motivation: HSE合规性评估需要动态实时决策，但LLM在领域知识和结构化法律推理方面的能力尚未充分探索。 Method: 构建了包含1,000多个问题的HSE-Bench数据集，采用IRAC推理流程评估LLM，并提出了RoE提示技术。 Result: 当前LLM表现良好但依赖语义匹配，缺乏系统性法律推理能力；RoE技术显著提升了推理准确性。 Conclusion: 研究揭示了LLM在HSE合规性评估中的推理缺陷，RoE技术为改进提供了方向，并呼吁进一步研究。 Abstract: Health, Safety, and Environment (HSE) compliance assessment demands dynamic real-time decision-making under complicated regulations and complex human-machine-environment interactions. While large language models (LLMs) hold significant potential for decision intelligence and contextual dialogue, their capacity for domain-specific knowledge in HSE and structured legal reasoning remains underexplored. We introduce HSE-Bench, the first benchmark dataset designed to evaluate the HSE compliance assessment capabilities of LLM. HSE-Bench comprises over 1,000 manually curated questions drawn from regulations, court cases, safety exams, and fieldwork videos, and integrates a reasoning flow based on Issue spotting, rule Recall, rule Application, and rule Conclusion (IRAC) to assess the holistic reasoning pipeline. We conduct extensive evaluations on different prompting strategies and more than 10 LLMs, including foundation models, reasoning models and multimodal vision models. The results show that, although current LLMs achieve good performance, their capabilities largely rely on semantic matching rather than principled reasoning grounded in the underlying HSE compliance context. Moreover, their native reasoning trace lacks the systematic legal reasoning required for rigorous HSE compliance assessment. To alleviate these, we propose a new prompting technique, Reasoning of Expert (RoE), which guides LLMs to simulate the reasoning process of different experts for compliance assessment and reach a more accurate unified decision. We hope our study highlights reasoning gaps in LLMs for HSE compliance and inspires further research on related tasks.

[175] ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind

Peixuan Han,Zijia Liu,Jiaxuan You

Main category: cs.CL

TL;DR: 论文提出了一种名为ToMAP的新方法，通过结合两个心智理论模块，增强说服者代理的对手意识和分析能力，从而生成更有效的论点。实验表明，ToMAP在性能上优于更大的基线模型。

Details

Motivation: 现有的大型语言模型在说服任务中缺乏对对手心智状态的动态建模能力，导致论点多样性和对手意识不足。 Method: ToMAP通过提示说服者考虑可能的反对意见，并使用文本编码器和MLP分类器预测对手立场，结合强化学习生成更有效的论点。 Result: ToMAP在3B参数规模下，性能优于GPT-4o等更大模型，相对增益达39.4%，并展现出更复杂的推理链和减少的重复。 Conclusion: ToMAP通过增强对手意识，生成了更多样且有效的论点，展示了在开发更具说服力语言代理方面的潜力。 Abstract: Large language models (LLMs) have shown promising potential in persuasion, but existing works on training LLM persuaders are still preliminary. Notably, while humans are skilled in modeling their opponent's thoughts and opinions proactively and dynamically, current LLMs struggle with such Theory of Mind (ToM) reasoning, resulting in limited diversity and opponent awareness. To address this limitation, we introduce Theory of Mind Augmented Persuader (ToMAP), a novel approach for building more flexible persuader agents by incorporating two theory of mind modules that enhance the persuader's awareness and analysis of the opponent's mental state. Specifically, we begin by prompting the persuader to consider possible objections to the target central claim, and then use a text encoder paired with a trained MLP classifier to predict the opponent's current stance on these counterclaims. Our carefully designed reinforcement learning schema enables the persuader learns how to analyze opponent-related information and utilize it to generate more effective arguments. Experiments show that the ToMAP persuader, while containing only 3B parameters, outperforms much larger baselines, like GPT-4o, with a relative gain of 39.4% across multiple persuadee models and diverse corpora. Notably, ToMAP exhibits complex reasoning chains and reduced repetition during training, which leads to more diverse and effective arguments. The opponent-aware feature of ToMAP also makes it suitable for long conversations and enables it to employ more logical and opponent-aware strategies. These results underscore our method's effectiveness and highlight its potential for developing more persuasive language agents. Code is available at: https://github.com/ulab-uiuc/ToMAP.

[176] Exploring Scaling Laws for EHR Foundation Models

Sheng Zhang,Qin Liu,Naoto Usuyama,Cliff Wong,Tristan Naumann,Hoifung Poon

Main category: cs.CL

TL;DR: 本文首次实证研究了电子健康记录（EHR）基础模型的缩放规律，发现其与大型语言模型（LLMs）类似，具有可预测的性能提升模式。

Details

Motivation: 电子健康记录（EHRs）是一种丰富、连续且全球广泛的数据源，但其缩放规律尚未被探索。本文旨在填补这一空白。 Method: 通过在MIMIC-IV数据库上训练不同规模和计算预算的Transformer架构，分析EHR模型的缩放行为。 Result: 发现EHR模型表现出与LLMs类似的缩放规律，包括抛物线IsoFLOPs曲线和计算、参数、数据量与临床效用之间的幂律关系。 Conclusion: 研究结果为开发高效的EHR基础模型奠定了基础，有望推动临床预测任务和个性化医疗的发展。 Abstract: The emergence of scaling laws has profoundly shaped the development of large language models (LLMs), enabling predictable performance gains through systematic increases in model size, dataset volume, and compute. Yet, these principles remain largely unexplored in the context of electronic health records (EHRs) -- a rich, sequential, and globally abundant data source that differs structurally from natural language. In this work, we present the first empirical investigation of scaling laws for EHR foundation models. By training transformer architectures on patient timeline data from the MIMIC-IV database across varying model sizes and compute budgets, we identify consistent scaling patterns, including parabolic IsoFLOPs curves and power-law relationships between compute, model parameters, data size, and clinical utility. These findings demonstrate that EHR models exhibit scaling behavior analogous to LLMs, offering predictive insights into resource-efficient training strategies. Our results lay the groundwork for developing powerful EHR foundation models capable of transforming clinical prediction tasks and advancing personalized healthcare.

[177] Verify-in-the-Graph: Entity Disambiguation Enhancement for Complex Claim Verification with Interactive Graph Representation

Hoang Pham,Thanh-Do Nguyen,Khac-Hoai Nam Bui

Main category: cs.CL

TL;DR: VeGraph是一个利用LLM代理的推理和理解能力解决复杂声明验证问题的新框架，通过图表示、实体消歧和验证三阶段实现高效验证。

Details

Motivation: 传统方法在复杂声明验证中因缺乏有效的实体消歧策略而受限，VeGraph旨在解决这一问题。 Method: VeGraph分三阶段：图表示（将声明分解为结构化三元组）、实体消歧（与知识库交互解决歧义）和验证（完成事实核查）。 Result: 实验显示VeGraph在HoVer和FEVEROUS基准上表现优异。 Conclusion: VeGraph有效解决了复杂声明验证的挑战，代码和数据已开源。 Abstract: Claim verification is a long-standing and challenging task that demands not only high accuracy but also explainability of the verification process. This task becomes an emerging research issue in the era of large language models (LLMs) since real-world claims are often complex, featuring intricate semantic structures or obfuscated entities. Traditional approaches typically address this by decomposing claims into sub-claims and querying a knowledge base to resolve hidden or ambiguous entities. However, the absence of effective disambiguation strategies for these entities can compromise the entire verification process. To address these challenges, we propose Verify-in-the-Graph (VeGraph), a novel framework leveraging the reasoning and comprehension abilities of LLM agents. VeGraph operates in three phases: (1) Graph Representation - an input claim is decomposed into structured triplets, forming a graph-based representation that integrates both structured and unstructured information; (2) Entity Disambiguation -VeGraph iteratively interacts with the knowledge base to resolve ambiguous entities within the graph for deeper sub-claim verification; and (3) Verification - remaining triplets are verified to complete the fact-checking process. Experiments using Meta-Llama-3-70B (instruct version) show that VeGraph achieves competitive performance compared to baselines on two benchmarks HoVer and FEVEROUS, effectively addressing claim verification challenges. Our source code and data are available for further exploitation.

[178] DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors

Yize Cheng,Wenxiao Wang,Mazda Moayeri,Soheil Feizi

Main category: cs.CL

TL;DR: DyePack框架通过后门攻击检测模型是否在训练中使用了基准测试集，无需访问模型内部细节，确保低误报率。

Details

Motivation: 开放基准测试易受测试集污染，需一种无需模型内部信息的方法来检测污染。 Method: DyePack混合后门样本到测试数据中，通过多后门随机目标设计计算精确误报率。 Result: 在多项选择和开放式生成任务中，DyePack成功检测所有污染模型，误报率极低。 Conclusion: DyePack提供了一种高效且可靠的方法来检测测试集污染，防止误报。 Abstract: Open benchmarks are essential for evaluating and advancing large language models, offering reproducibility and transparency. However, their accessibility makes them likely targets of test set contamination. In this work, we introduce DyePack, a framework that leverages backdoor attacks to identify models that used benchmark test sets during training, without requiring access to the loss, logits, or any internal details of the model. Like how banks mix dye packs with their money to mark robbers, DyePack mixes backdoor samples with the test data to flag models that trained on it. We propose a principled design incorporating multiple backdoors with stochastic targets, enabling exact false positive rate (FPR) computation when flagging every model. This provably prevents false accusations while providing strong evidence for every detected case of contamination. We evaluate DyePack on five models across three datasets, covering both multiple-choice and open-ended generation tasks. For multiple-choice questions, it successfully detects all contaminated models with guaranteed FPRs as low as 0.000073% on MMLU-Pro and 0.000017% on Big-Bench-Hard using eight backdoors. For open-ended generation tasks, it generalizes well and identifies all contaminated models on Alpaca with a guaranteed false positive rate of just 0.127% using six backdoors.

[179] A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs

Chiwan Park,Wonjun Jang,Daeryong Kim,Aelim Ahn,Kichang Yang,Woosung Hwang,Jihyeon Roh,Hyerin Park,Hyosun Wang,Min Seok Kim,Jihoon Kang

Main category: cs.CL

TL;DR: 论文探讨了如何将大型语言模型（LLMs）应用于工业场景，解决灵活对话能力与服务约束之间的冲突，并通过电商对话机器人的案例研究提出解决方案。

Details

Motivation: 工业应用中，LLMs需要在灵活对话与严格服务约束之间找到平衡，这对实际应用提出了挑战。 Method: 提出了一种方法，结合优化策略和实现流程，开发了适用于电商领域的对话机器人。 Result: 案例研究表明，该方法能有效弥合学术研究与实际应用的差距，提供可扩展、可控且可靠的AI驱动代理框架。 Conclusion: 论文为工业场景中LLMs的应用提供了实用框架，解决了灵活性与约束之间的冲突。 Abstract: The advancement of Large Language Models (LLMs) has led to significant improvements in various service domains, including search, recommendation, and chatbot applications. However, applying state-of-the-art (SOTA) research to industrial settings presents challenges, as it requires maintaining flexible conversational abilities while also strictly complying with service-specific constraints. This can be seen as two conflicting requirements due to the probabilistic nature of LLMs. In this paper, we propose our approach to addressing this challenge and detail the strategies we employed to overcome their inherent limitations in real-world applications. We conduct a practical case study of a conversational agent designed for the e-commerce domain, detailing our implementation workflow and optimizations. Our findings provide insights into bridging the gap between academic research and real-world application, introducing a framework for developing scalable, controllable, and reliable AI-driven agents.

[180] Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models

Jinwen Chen,Hainan Zhang,Fei Sun,Qinnan Zhang,Sijia Wen,Ziwei Wang,Zhiming Zheng

Main category: cs.CL

TL;DR: 论文提出了一种基于参考过滤和TF-IDF聚类（RFTC）的方法，用于检测LLMs中的隐蔽后门样本，解决了现有方法在生成任务中的局限性。

Details

Motivation: 微调LLMs时，数据集中的隐蔽后门样本会带来安全风险，现有检测方法无法有效适用于生成任务或可能影响生成性能。 Method: 通过参考模型输出对比筛选可疑样本，再对可疑样本进行TF-IDF聚类，利用类内距离差异识别后门样本。 Result: 在两个机器翻译数据集和一个QA数据集上的实验表明，RFTC在检测后门和模型性能上优于基线方法。 Conclusion: RFTC方法有效解决了隐蔽后门样本检测问题，参考过滤机制也被证明是有效的。 Abstract: Fine-tuning LLMs with datasets containing stealthy backdoors from publishers poses security risks to downstream applications. Mainstream detection methods either identify poisoned samples by analyzing the prediction probability of poisoned classification models or rely on the rewriting model to eliminate the stealthy triggers. However, the former cannot be applied to generation tasks, while the latter may degrade generation performance and introduce new triggers. Therefore, efficiently eliminating stealthy poisoned samples for LLMs remains an urgent problem. We observe that after applying TF-IDF clustering to the sample response, there are notable differences in the intra-class distances between clean and poisoned samples. Poisoned samples tend to cluster closely because of their specific malicious outputs, whereas clean samples are more scattered due to their more varied responses. Thus, in this paper, we propose a stealthy backdoor sample detection method based on Reference-Filtration and Tfidf-Clustering mechanisms (RFTC). Specifically, we first compare the sample response with the reference model's outputs and consider the sample suspicious if there's a significant discrepancy. And then we perform TF-IDF clustering on these suspicious samples to identify the true poisoned samples based on the intra-class distance. Experiments on two machine translation datasets and one QA dataset demonstrate that RFTC outperforms baselines in backdoor detection and model performance. Further analysis of different reference models also confirms the effectiveness of our Reference-Filtration.

[181] Context Robust Knowledge Editing for Language Models

Haewon Park,Gyubin Choi,Minjun Kim,Yohan Jo

Main category: cs.CL

TL;DR: 论文提出CHED基准和CoRE方法，评估和改进知识编辑（KE）方法的上下文鲁棒性，发现现有方法在上下文存在时易失败，CoRE通过减少隐藏状态方差提升了编辑成功率。

Details

Motivation: 现有知识编辑评估忽视上下文影响，导致实际应用中编辑效果不佳，需开发更鲁棒的评估和方法。 Method: 开发CHED基准评估上下文鲁棒性，提出CoRE方法通过最小化隐藏状态方差增强编辑效果。 Result: 现有KE方法在CHED上表现不佳，CoRE显著提升上下文存在时的编辑成功率，同时保持模型能力。 Conclusion: 上下文对知识编辑影响显著，CoRE方法有效提升鲁棒性，为未来研究提供新方向。 Abstract: Knowledge editing (KE) methods offer an efficient way to modify knowledge in large language models. Current KE evaluations typically assess editing success by considering only the edited knowledge without any preceding contexts. In real-world applications, however, preceding contexts often trigger the retrieval of the original knowledge and undermine the intended edit. To address this issue, we develop CHED -- a benchmark designed to evaluate the context robustness of KE methods. Evaluations on CHED show that they often fail when preceding contexts are present. To mitigate this shortcoming, we introduce CoRE, a KE method designed to strengthen context robustness by minimizing context-sensitive variance in hidden states of the model for edited knowledge. This method not only improves the editing success rate in situations where a preceding context is present but also preserves the overall capabilities of the model. We provide an in-depth analysis of the differing impacts of preceding contexts when introduced as user utterances versus assistant responses, and we dissect attention-score patterns to assess how specific tokens influence editing success.

[182] Uncovering Visual-Semantic Psycholinguistic Properties from the Distributional Structure of Text Embedding Spac

Si Wu,Sebastian Bruch

Main category: cs.CL

TL;DR: 本文提出了一种无监督方法（NSM），通过语义嵌入空间中单词邻域的峰值尖锐度来估计图像性和具体性，实验表明其优于现有方法。

Details

Motivation: 研究文本本身在图像-标题数据集中的信号是否足以准确估计图像性和具体性。 Method: 提出NSM（邻域稳定性度量），量化语义嵌入空间中邻域的峰值尖锐度。 Result: NSM与真实评分的相关性优于现有无监督方法，且能有效分类图像性和具体性。 Conclusion: NSM是一种有效的无监督方法，可用于估计图像性和具体性。 Abstract: Imageability (potential of text to evoke a mental image) and concreteness (perceptibility of text) are two psycholinguistic properties that link visual and semantic spaces. It is little surprise that computational methods that estimate them do so using parallel visual and semantic spaces, such as collections of image-caption pairs or multi-modal models. In this paper, we work on the supposition that text itself in an image-caption dataset offers sufficient signals to accurately estimate these properties. We hypothesize, in particular, that the peakedness of the neighborhood of a word in the semantic embedding space reflects its degree of imageability and concreteness. We then propose an unsupervised, distribution-free measure, which we call Neighborhood Stability Measure (NSM), that quantifies the sharpness of peaks. Extensive experiments show that NSM correlates more strongly with ground-truth ratings than existing unsupervised methods, and is a strong predictor of these properties for classification. Our code and data are available on GitHub (https://github.com/Artificial-Memory-Lab/imageability).

[183] Can Modern NLP Systems Reliably Annotate Chest Radiography Exams? A Pre-Purchase Evaluation and Comparative Study of Solutions from AWS, Google, Azure, John Snow Labs, and Open-Source Models on an Independent Pediatric Dataset

Shruti Hegde,Mabon Manoj Ninan,Jonathan R. Dillman,Shireen Hayatghaibi,Lynn Babcock,Elanchezhian Somasundaram

Main category: cs.CL

TL;DR: 该研究比较了四种商业临床NLP系统和两种专用胸部X光报告标注工具在儿科胸部X光报告中的实体提取和断言检测性能，发现性能差异显著，强调了临床部署前验证的重要性。

Details

Motivation: 通用临床NLP工具在儿科胸部X光报告标注中的独立评估有限，需要比较不同工具的性能。 Method: 使用四种商业NLP系统和两种专用工具（CheXpert和CheXbert）处理95,008份儿科胸部X光报告，提取实体和断言状态，并通过Fleiss Kappa和准确率评估性能。 Result: 不同系统提取的实体数量和断言准确率差异显著，SparkNLP表现最佳（76%），AWS最低（50%），专用工具准确率为56%。 Conclusion: 临床NLP工具性能差异显著，部署前需严格验证和审查。 Abstract: General-purpose clinical natural language processing (NLP) tools are increasingly used for the automatic labeling of clinical reports. However, independent evaluations for specific tasks, such as pediatric chest radiograph (CXR) report labeling, are limited. This study compares four commercial clinical NLP systems - Amazon Comprehend Medical (AWS), Google Healthcare NLP (GC), Azure Clinical NLP (AZ), and SparkNLP (SP) - for entity extraction and assertion detection in pediatric CXR reports. Additionally, CheXpert and CheXbert, two dedicated chest radiograph report labelers, were evaluated on the same task using CheXpert-defined labels. We analyzed 95,008 pediatric CXR reports from a large academic pediatric hospital. Entities and assertion statuses (positive, negative, uncertain) from the findings and impression sections were extracted by the NLP systems, with impression section entities mapped to 12 disease categories and a No Findings category. CheXpert and CheXbert extracted the same 13 categories. Outputs were compared using Fleiss Kappa and accuracy against a consensus pseudo-ground truth. Significant differences were found in the number of extracted entities and assertion distributions across NLP systems. SP extracted 49,688 unique entities, GC 16,477, AZ 31,543, and AWS 27,216. Assertion accuracy across models averaged around 62%, with SP highest (76%) and AWS lowest (50%). CheXpert and CheXbert achieved 56% accuracy. Considerable variability in performance highlights the need for careful validation and review before deploying NLP tools for clinical report labeling.

[184] Machine-Facing English: Defining a Hybrid Register Shaped by Human-AI Discourse

Hyunwoo Kim,Hanau Yi

Main category: cs.CL

TL;DR: 论文研究了机器面向英语（MFE）这一新兴语言现象，分析了其语法僵化、语用简化和超明确表达的特征，以及这些特征如何提高机器解析能力但牺牲了自然流畅性。

Details

Motivation: 研究动机在于探讨人类与AI持续互动如何塑造出一种新的语言变体（MFE），并分析其对语言丰富性和交流效率的影响。 Method: 研究方法基于双语（韩语/英语）语音和文本产品测试的定性观察，并结合自然语言声明提示（NLD-P）进行反思性起草。 Result: 研究发现MFE具有五种常见特征（冗余清晰性、指令性语法、受控词汇、扁平化韵律和单一意图结构），这些特征提高了执行准确性但压缩了表达范围。 Conclusion: 结论指出MFE的发展凸显了交流效率与语言丰富性之间的张力，并提出了对话界面设计和多语言用户教学方面的挑战，同时强调未来需要更全面的方法论阐述和实证验证。 Abstract: Machine-Facing English (MFE) is an emergent register shaped by the adaptation of everyday language to the expanding presence of AI interlocutors. Drawing on register theory (Halliday 1985, 2006), enregisterment (Agha 2003), audience design (Bell 1984), and interactional pragmatics (Giles & Ogay 2007), this study traces how sustained human-AI interaction normalizes syntactic rigidity, pragmatic simplification, and hyper-explicit phrasing - features that enhance machine parseability at the expense of natural fluency. Our analysis is grounded in qualitative observations from bilingual (Korean/English) voice- and text-based product testing sessions, with reflexive drafting conducted using Natural Language Declarative Prompting (NLD-P) under human curation. Thematic analysis identifies five recurrent traits - redundant clarity, directive syntax, controlled vocabulary, flattened prosody, and single-intent structuring - that improve execution accuracy but compress expressive range. MFE's evolution highlights a persistent tension between communicative efficiency and linguistic richness, raising design challenges for conversational interfaces and pedagogical considerations for multilingual users. We conclude by underscoring the need for comprehensive methodological exposition and future empirical validation.

Longyin Zhang,Bowei Zou,Ai Ti Aw

Main category: cs.CL

TL;DR: 提出了一种基于多语言大语言模型的方法（CAT-G），用于从社交媒体评论中生成方面术语，以提升下游NLP任务的效果，并贡献了首个多语言测试集。

Details

Motivation: 社交媒体评论的语言自由性和多样性给NLP任务（如评论聚类、总结和意见分析）带来挑战，需要更细粒度的方面术语识别和生成。 Method: 利用多语言大语言模型进行监督微调，生成评论方面术语（CAT-G），并通过DPO对齐模型预测与人类期望。 Result: 方法在两项NLP任务中提升了社交媒体话语的理解效果，并贡献了涵盖英语、中文、马来语和印尼语的多语言测试集。 Conclusion: CAT-G方法有效解决了社交媒体评论的多样性问题，测试集为多语言性能比较提供了基础。 Abstract: The inherent nature of social media posts, characterized by the freedom of language use with a disjointed array of diverse opinions and topics, poses significant challenges to downstream NLP tasks such as comment clustering, comment summarization, and social media opinion analysis. To address this, we propose a granular level of identifying and generating aspect terms from individual comments to guide model attention. Specifically, we leverage multilingual large language models with supervised fine-tuning for comment aspect term generation (CAT-G), further aligning the model's predictions with human expectations through DPO. We demonstrate the effectiveness of our method in enhancing the comprehension of social media discourse on two NLP tasks. Moreover, this paper contributes the first multilingual CAT-G test set on English, Chinese, Malay, and Bahasa Indonesian. As LLM capabilities vary among languages, this test set allows for a comparative analysis of performance across languages with varying levels of LLM proficiency.

[186] EL4NER: Ensemble Learning for Named Entity Recognition via Multiple Small-Parameter Large Language Models

Yuzhen Xiao,Jiahe Song,Yongxin Xu,Ruizhe Zhang,Yiqi Xiao,Xin Lu,Runchuan Zhu,Bowen Jiang,Junfeng Zhao

Main category: cs.CL

TL;DR: EL4NER是一种基于集成学习的方法，通过聚合多个开源小参数LLM的ICL输出来提升NER任务性能，同时降低部署和推理成本。

Details

Motivation: 解决现有ICL-based NER方法依赖大参数LLM带来的高计算资源、API成本和数据隐私问题。 Method: 设计任务分解流水线、引入跨度级句子相似度算法和自验证机制。 Result: EL4NER在多个数据集上超越大参数LLM方法，部分达到SOTA性能。 Conclusion: EL4NER展示了小参数LLM在ICL范式中的高效性和可行性。 Abstract: In-Context Learning (ICL) technique based on Large Language Models (LLMs) has gained prominence in Named Entity Recognition (NER) tasks for its lower computing resource consumption, less manual labeling overhead, and stronger generalizability. Nevertheless, most ICL-based NER methods depend on large-parameter LLMs: the open-source models demand substantial computational resources for deployment and inference, while the closed-source ones incur high API costs, raise data-privacy concerns, and hinder community collaboration. To address this question, we propose an Ensemble Learning Method for Named Entity Recognition (EL4NER), which aims at aggregating the ICL outputs of multiple open-source, small-parameter LLMs to enhance overall performance in NER tasks at less deployment and inference cost. Specifically, our method comprises three key components. First, we design a task decomposition-based pipeline that facilitates deep, multi-stage ensemble learning. Second, we introduce a novel span-level sentence similarity algorithm to establish an ICL demonstration retrieval mechanism better suited for NER tasks. Third, we incorporate a self-validation mechanism to mitigate the noise introduced during the ensemble process. We evaluated EL4NER on multiple widely adopted NER datasets from diverse domains. Our experimental results indicate that EL4NER surpasses most closed-source, large-parameter LLM-based methods at a lower parameter cost and even attains state-of-the-art (SOTA) performance among ICL-based methods on certain datasets. These results show the parameter efficiency of EL4NER and underscore the feasibility of employing open-source, small-parameter LLMs within the ICL paradigm for NER tasks.

[187] Query Routing for Retrieval-Augmented Language Models

Jiarui Zhang,Xiangyu Liu,Yong Hu,Chaoyue Niu,Fan Wu,Guihai Chen

Main category: cs.CL

TL;DR: RAGRouter是一种新型的路由机制，通过动态结合检索文档和LLM能力，显著提升RAG场景下的模型选择性能。

Details

Motivation: 现有路由方法在RAG场景中表现不佳，因为它们依赖静态知识表示，而检索文档会动态影响LLM的响应能力。 Method: 提出RAGRouter，利用文档嵌入和RAG能力嵌入，通过对比学习捕捉知识表示变化，实现智能路由。 Result: 实验表明，RAGRouter平均优于最佳单一LLM 3.61%，优于现有路由方法3.29%-9.33%。 Conclusion: RAGRouter通过动态路由设计，显著提升了RAG任务的性能与效率。 Abstract: Retrieval-Augmented Generation (RAG) significantly improves the performance of Large Language Models (LLMs) on knowledge-intensive tasks. However, varying response quality across LLMs under RAG necessitates intelligent routing mechanisms, which select the most suitable model for each query from multiple retrieval-augmented LLMs via a dedicated router model. We observe that external documents dynamically affect LLMs' ability to answer queries, while existing routing methods, which rely on static parametric knowledge representations, exhibit suboptimal performance in RAG scenarios. To address this, we formally define the new retrieval-augmented LLM routing problem, incorporating the influence of retrieved documents into the routing framework. We propose RAGRouter, a RAG-aware routing design, which leverages document embeddings and RAG capability embeddings with contrastive learning to capture knowledge representation shifts and enable informed routing decisions. Extensive experiments on diverse knowledge-intensive tasks and retrieval settings show that RAGRouter outperforms the best individual LLM by 3.61% on average and existing routing methods by 3.29%-9.33%. With an extended score-threshold-based mechanism, it also achieves strong performance-efficiency trade-offs under low-latency constraints.

[188] Self-Correcting Code Generation Using Small Language Models

Jeonghun Cho,Deokhyung Kang,Hyounghun Kim,Gary Geunbae Lee

Main category: cs.CL

TL;DR: 研究发现小模型在自我修正代码生成中表现不佳，提出CoCoS方法通过在线强化学习提升其能力，实验显示显著改进。

Details

Motivation: 探索小模型是否具备通过自我反思有效修正代码的能力。 Method: 提出CoCoS方法，采用在线强化学习目标，设计累积奖励函数和细粒度奖励机制。 Result: 在1B规模模型上，CoCoS在MBPP和HumanEval上分别提升35.8%和27.7%。 Conclusion: CoCoS有效提升小模型的多轮代码修正能力。 Abstract: Self-correction has demonstrated potential in code generation by allowing language models to revise and improve their outputs through successive refinement. Recent studies have explored prompting-based strategies that incorporate verification or feedback loops using proprietary models, as well as training-based methods that leverage their strong reasoning capabilities. However, whether smaller models possess the capacity to effectively guide their outputs through self-reflection remains unexplored. Our findings reveal that smaller models struggle to exhibit reflective revision behavior across both self-correction paradigms. In response, we introduce CoCoS, an approach designed to enhance the ability of small language models for multi-turn code correction. Specifically, we propose an online reinforcement learning objective that trains the model to confidently maintain correct outputs while progressively correcting incorrect outputs as turns proceed. Our approach features an accumulated reward function that aggregates rewards across the entire trajectory and a fine-grained reward better suited to multi-turn correction scenarios. This facilitates the model in enhancing initial response quality while achieving substantial improvements through self-correction. With 1B-scale models, CoCoS achieves improvements of 35.8% on the MBPP and 27.7% on HumanEval compared to the baselines.

Hongcheng Guo,Zheyong Xie,Shaosheng Cao,Boyang Wang,Weiting Liu,Anjie Le,Lei Li,Zhoujun Li

Main category: cs.CL

TL;DR: SNS-Bench-VL是一个多模态基准测试，用于评估视觉-语言大模型在社交媒体场景中的表现，涵盖8种任务和4001个问题-答案对。

Details

Motivation: 随着社交媒体中视觉和文本内容的融合，评估多模态大模型的能力对提升用户体验和平台智能至关重要。现有基准测试主要关注文本任务，缺乏对多模态场景的覆盖。 Method: 提出SNS-Bench-VL基准测试，包含8种多模态任务和4001个问题-答案对，评估了25种先进的多模态大模型。 Result: 研究发现多模态社交语境理解仍存在挑战。 Conclusion: SNS-Bench-VL旨在推动未来研究，开发更鲁棒、情境感知且符合人类需求的多模态智能。 Abstract: With the increasing integration of visual and textual content in Social Networking Services (SNS), evaluating the multimodal capabilities of Large Language Models (LLMs) is crucial for enhancing user experience, content understanding, and platform intelligence. Existing benchmarks primarily focus on text-centric tasks, lacking coverage of the multimodal contexts prevalent in modern SNS ecosystems. In this paper, we introduce SNS-Bench-VL, a comprehensive multimodal benchmark designed to assess the performance of Vision-Language LLMs in real-world social media scenarios. SNS-Bench-VL incorporates images and text across 8 multimodal tasks, including note comprehension, user engagement analysis, information retrieval, and personalized recommendation. It comprises 4,001 carefully curated multimodal question-answer pairs, covering single-choice, multiple-choice, and open-ended tasks. We evaluate over 25 state-of-the-art multimodal LLMs, analyzing their performance across tasks. Our findings highlight persistent challenges in multimodal social context comprehension. We hope SNS-Bench-VL will inspire future research towards robust, context-aware, and human-aligned multimodal intelligence for next-generation social networking services.

[190] Document-Level Text Generation with Minimum Bayes Risk Decoding using Optimal Transport

Yuu Jinnai

Main category: cs.CL

TL;DR: 本文研究了将最小贝叶斯风险（MBR）解码应用于文档级文本生成任务，提出了MBR-OT方法，通过Wasserstein距离计算文档效用，实验表明其性能优于标准MBR。

Details

Motivation: 文档级文本生成任务比句子级任务更具挑战性，需要理解更长上下文。现有MBR解码的效用函数多针对句子设计，限制了其在文档级任务中的表现。 Method: 提出MBR-OT方法，利用Wasserstein距离和句子级效用函数计算文档效用。 Result: 实验证明MBR-OT在文档级机器翻译、文本简化和密集图像描述任务中优于标准MBR。 Conclusion: MBR-OT通过改进效用计算方式，显著提升了文档级文本生成任务的性能。 Abstract: Document-level text generation tasks are known to be more difficult than sentence-level text generation tasks as they require the understanding of longer context to generate high-quality texts. In this paper, we investigate the adaption of Minimum Bayes Risk (MBR) decoding for document-level text generation tasks. MBR decoding makes use of a utility function to estimate the output with the highest expected utility from a set of candidate outputs. Although MBR decoding is shown to be effective in a wide range of sentence-level text generation tasks, its performance on document-level text generation tasks is limited as many of the utility functions are designed for evaluating the utility of sentences. To this end, we propose MBR-OT, a variant of MBR decoding using Wasserstein distance to compute the utility of a document using a sentence-level utility function. The experimental result shows that the performance of MBR-OT outperforms that of the standard MBR in document-level machine translation, text simplification, and dense image captioning tasks. Our code is available at https://github.com/jinnaiyuu/mbr-optimal-transport

[191] Generating Diverse Training Samples for Relation Extraction with Large Language Models

Zexuan Li,Hongliang Dai,Piji Li

Main category: cs.CL

TL;DR: 论文探讨如何利用大语言模型（LLMs）生成多样且正确的训练数据以改进关系抽取（RE）任务，提出了两种方法：通过上下文学习（ICL）提示直接生成多样化样本，以及通过直接偏好优化（DPO）微调LLMs。实验表明两种方法均能提升生成数据的质量，且用生成数据训练的非LLM模型表现优于直接使用LLM进行RE任务。

Details

Motivation: 直接使用LLMs生成的关系抽取训练样本结构相似度高，表达方式单一，因此需要研究如何提升生成样本的多样性和正确性。 Method: 1. 通过ICL提示直接生成多样化样本；2. 使用DPO微调LLMs以生成多样化训练样本。 Result: 实验证明两种方法均能提升生成数据的质量，且用生成数据训练的非LLM模型表现优于直接使用LLM进行RE任务。 Conclusion: 通过改进LLMs生成训练数据的多样性和正确性，可以有效提升关系抽取任务的性能，且非LLM模型在生成数据上的表现更优。 Abstract: Using Large Language Models (LLMs) to generate training data can potentially be a preferable way to improve zero or few-shot NLP tasks. However, many problems remain to be investigated for this direction. For the task of Relation Extraction (RE), we find that samples generated by directly prompting LLMs may easily have high structural similarities with each other. They tend to use a limited variety of phrasing while expressing the relation between a pair of entities. Therefore, in this paper, we study how to effectively improve the diversity of the training samples generated with LLMs for RE, while also maintaining their correctness. We first try to make the LLMs produce dissimilar samples by directly giving instructions in In-Context Learning (ICL) prompts. Then, we propose an approach to fine-tune LLMs for diversity training sample generation through Direct Preference Optimization (DPO). Our experiments on commonly used RE datasets show that both attempts can improve the quality of the generated training data. We also find that comparing with directly performing RE with an LLM, training a non-LLM RE model with its generated samples may lead to better performance.

[192] Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data

Seohyeong Lee,Eunwon Kim,Hwaran Lee,Buru Chang

Main category: cs.CL

TL;DR: Alignment Data Map利用GPT-4o分析偏好数据，仅需33%高质量数据即可达到或超越全数据集性能。

Details

Motivation: 收集人类偏好数据昂贵且低效，亟需提升数据收集效率。 Method: 使用GPT-4o作为代理计算对齐分数，构建基于均值和方差的Alignment Data Map。 Result: 实验表明，仅使用33%高质量数据即可实现与全数据集相当或更好的性能。 Conclusion: Alignment Data Map显著提升数据收集效率，并能诊断数据集中的低质量样本。 Abstract: Human preference data plays a critical role in aligning large language models (LLMs) with human values. However, collecting such data is often expensive and inefficient, posing a significant scalability challenge. To address this, we introduce Alignment Data Map, a GPT-4o-assisted tool for analyzing and diagnosing preference data. Using GPT-4o as a proxy for LLM alignment, we compute alignment scores for LLM-generated responses to instructions from existing preference datasets. These scores are then used to construct an Alignment Data Map based on their mean and variance. Our experiments show that using only 33 percent of the data, specifically samples in the high-mean, low-variance region, achieves performance comparable to or better than using the entire dataset. This finding suggests that the Alignment Data Map can significantly improve data collection efficiency by identifying high-quality samples for LLM alignment without requiring explicit annotations. Moreover, the Alignment Data Map can diagnose existing preference datasets. Our analysis shows that it effectively detects low-impact or potentially misannotated samples. Source code is available online.

[193] Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios

Linjie Mu,Zhongzhen Huang,Yakun Zhu,Xiangyu Zhao,Shaoting Zhang,Xiaofan Zhang

Main category: cs.CL

TL;DR: 提出了一种名为MedE²的两阶段后训练方法，通过文本和多模态数据增强医疗领域的多模态推理能力。

Details

Motivation: 尽管多模态推理模型在数学和科学领域取得了成功，但在医疗领域的应用仍不足。 Method: MedE²分为两阶段：第一阶段用2000个文本样本微调模型以激发推理行为；第二阶段用1500个多模态医疗案例进一步优化推理能力。 Result: 实验表明MedE²显著提升了医疗多模态模型的推理性能，并在多个基准测试中优于基线模型。 Conclusion: MedE²在提升医疗多模态推理方面具有高效性和实用性，且在大模型和推理扩展中表现稳健。 Abstract: Effective clinical decision-making depends on iterative, multimodal reasoning across diverse sources of evidence. The recent emergence of multimodal reasoning models has significantly transformed the landscape of solving complex tasks. Although such models have achieved notable success in mathematics and science, their application to medical domains remains underexplored. In this work, we propose \textit{MedE$^2$}, a two-stage post-training pipeline that elicits and then enhances multimodal reasoning for medical domains. In Stage-I, we fine-tune models using 2,000 text-only data samples containing precisely orchestrated reasoning demonstrations to elicit reasoning behaviors. In Stage-II, we further enhance the model's reasoning capabilities using 1,500 rigorously curated multimodal medical cases, aligning model reasoning outputs with our proposed multimodal medical reasoning preference. Extensive experiments demonstrate the efficacy and reliability of \textit{MedE$^2$} in improving the reasoning performance of medical multimodal models. Notably, models trained with \textit{MedE$^2$} consistently outperform baselines across multiple medical multimodal benchmarks. Additional validation on larger models and under inference-time scaling further confirms the robustness and practical utility of our approach.

Yiming Lei,Zhizheng Yang,Zeming Liu,Haitao Leng,Shaoguo Liu,Tingting Gao,Qingjie Liu,Yunhong Wang

Main category: cs.CL

TL;DR: 论文提出了一种名为ContextQFormer的上下文建模模块，用于增强多模态大语言模型在多轮对话中的表现，并构建了一个新的多轮多模态对话数据集TMDialog。实验表明，ContextQFormer在可用率上比基线模型提高了2%-4%。

Details

Motivation: 现有开源多模态模型在多轮交互（尤其是长上下文）中表现较弱，需要改进。 Method: 引入ContextQFormer模块，利用记忆块增强上下文信息表示；构建TMDialog数据集用于预训练、指令调优和评估。 Result: ContextQFormer在TMDialog数据集上比基线模型可用率提高了2%-4%。 Conclusion: ContextQFormer和TMDialog为多轮多模态对话研究提供了有效工具，显著提升了模型性能。 Abstract: Multi-modal large language models have demonstrated remarkable zero-shot abilities and powerful image-understanding capabilities. However, the existing open-source multi-modal models suffer from the weak capability of multi-turn interaction, especially for long contexts. To address the issue, we first introduce a context modeling module, termed ContextQFormer, which utilizes a memory block to enhance the presentation of contextual information. Furthermore, to facilitate further research, we carefully build a new multi-turn multi-modal dialogue dataset (TMDialog) for pre-training, instruction-tuning, and evaluation, which will be open-sourced lately. Compared with other multi-modal dialogue datasets, TMDialog contains longer conversations, which supports the research of multi-turn multi-modal dialogue. In addition, ContextQFormer is compared with three baselines on TMDialog and experimental results illustrate that ContextQFormer achieves an improvement of 2%-4% in available rate over baselines.

[195] PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics

Atharva Naik,Darsh Agrawal,Manav Kapadnis,Yuwei An,Yash Mathur,Carolyn Rose,David Mortensen

Main category: cs.CL

TL;DR: 论文探讨了长链思维（LCoT）大语言模型（LLMs）在历史语言学启发的归纳推理问题中的表现，发现其能力有限。

Details

Motivation: 研究LCoT LLMs的抽象推理能力是否适用于实际问题，尤其是历史语言学中的归纳推理问题。 Method: 开发了一个自动化流水线，动态生成可控难度的基准测试，以解决现有推理基准的可扩展性和污染问题。 Result: 生成的测试集中，最佳模型（Claude-3.7-Sonnet）仅达到54%的通过率，表明LCoT LLMs在历史语言学等领域的推理能力仍有不足。 Conclusion: LCoT LLMs在历史语言学等领域的推理能力尚未达到实用水平，需进一步改进。 Abstract: Recently, long chain of thought (LCoT), Large Language Models (LLMs), have taken the machine learning world by storm with their breathtaking reasoning capabilities. However, are the abstract reasoning abilities of these models general enough for problems of practical importance? Unlike past work, which has focused mainly on math, coding, and data wrangling, we focus on a historical linguistics-inspired inductive reasoning problem, formulated as Programming by Examples. We develop a fully automated pipeline for dynamically generating a benchmark for this task with controllable difficulty in order to tackle scalability and contamination issues to which many reasoning benchmarks are subject. Using our pipeline, we generate a test set with nearly 1k instances that is challenging for all state-of-the-art reasoning LLMs, with the best model (Claude-3.7-Sonnet) achieving a mere 54% pass rate, demonstrating that LCoT LLMs still struggle with a class or reasoning that is ubiquitous in historical linguistics as well as many other domains.

[196] Enhancing Large Language Models'Machine Translation via Dynamic Focus Anchoring

Qiuyu Ding,Zhiqiang Cao,Hailong Cao,Tiejun Zhao

Main category: cs.CL

TL;DR: 提出了一种简单有效的方法，通过动态识别和处理上下文敏感单元（CSUs）来增强大语言模型（LLMs）的机器翻译能力，无需额外训练。

Details

Motivation: 解决LLMs在翻译上下文敏感单元（如多义词）时的挑战，这些挑战可能导致翻译失败或理解能力下降。 Method: 动态分析和识别翻译难点，以结构化方式将CSUs融入LLMs，避免信息扁平化导致的误译。 Result: 在机器翻译基准数据集上表现优异，支持多种语言对，且无需额外训练。 Conclusion: 该方法有效提升了LLMs的翻译准确性和鲁棒性，同时资源消耗极低。 Abstract: Large language models have demonstrated exceptional performance across multiple crosslingual NLP tasks, including machine translation (MT). However, persistent challenges remain in addressing context-sensitive units (CSUs), such as polysemous words. These CSUs not only affect the local translation accuracy of LLMs, but also affect LLMs' understanding capability for sentences and tasks, and even lead to translation failure. To address this problem, we propose a simple but effective method to enhance LLMs' MT capabilities by acquiring CSUs and applying semantic focus. Specifically, we dynamically analyze and identify translation challenges, then incorporate them into LLMs in a structured manner to mitigate mistranslations or misunderstandings of CSUs caused by information flattening. Efficiently activate LLMs to identify and apply relevant knowledge from its vast data pool in this way, ensuring more accurate translations for translating difficult terms. On a benchmark dataset of MT, our proposed method achieved competitive performance compared to multiple existing open-sourced MT baseline models. It demonstrates effectiveness and robustness across multiple language pairs, including both similar language pairs and distant language pairs. Notably, the proposed method requires no additional model training and enhances LLMs' performance across multiple NLP tasks with minimal resource consumption.

[197] Cross-Domain Bilingual Lexicon Induction via Pretrained Language Models

Qiuyu Ding,Zhiqiang Cao,Hailong Cao,Tiejun Zhao

Main category: cs.CL

TL;DR: 本文提出了一种新的双语词典归纳任务，利用通用领域和目标领域的单语语料库提取领域特定的双语词典，并通过预训练模型改进词嵌入质量。

Details

Motivation: 传统双语词典归纳方法在专业领域表现不佳，因专业领域数据量小且词频低，静态词嵌入无法充分捕捉上下文影响。 Method: 结合预训练模型改进词嵌入，并首次在跨领域双语词典归纳任务中引入Code Switch策略。 Result: 实验表明，该方法在三个特定领域上平均提升0.78分，优于传统方法。 Conclusion: 该方法能有效提升专业领域的双语词典归纳性能，尤其在数据稀缺和上下文敏感的领域。 Abstract: Bilingual Lexicon Induction (BLI) is generally based on common domain data to obtain monolingual word embedding, and by aligning the monolingual word embeddings to obtain the cross-lingual embeddings which are used to get the word translation pairs. In this paper, we propose a new task of BLI, which is to use the monolingual corpus of the general domain and target domain to extract domain-specific bilingual dictionaries. Motivated by the ability of Pre-trained models, we propose a method to get better word embeddings that build on the recent work on BLI. This way, we introduce the Code Switch(Qin et al., 2020) firstly in the cross-domain BLI task, which can match differit is yet to be seen whether these methods are suitable for bilingual lexicon extraction in professional fields. As we can see in table 1, the classic and efficient BLI approach, Muse and Vecmap, perform much worse on the Medical dataset than on the Wiki dataset. On one hand, the specialized domain data set is relatively smaller compared to the generic domain data set generally, and specialized words have a lower frequency, which will directly affect the translation quality of bilingual dictionaries. On the other hand, static word embeddings are widely used for BLI, however, in some specific fields, the meaning of words is greatly influenced by context, in this case, using only static word embeddings may lead to greater bias. ent strategies in different contexts, making the model more suitable for this task. Experimental results show that our method can improve performances over robust BLI baselines on three specific domains by averagely improving 0.78 points.

[198] Tell, Don't Show: Leveraging Language Models' Abstractive Retellings to Model Literary Themes

Li Lucy,Camilla Griffiths,Sarah Levine,Jennifer L. Eberhardt,Dorottya Demszky,David Bamman

Main category: cs.CL

TL;DR: Retell是一种针对文学文本的主题建模方法，通过生成式语言模型将叙事内容转化为高级概念，再结合LDA提取更精确的主题。

Details

Motivation: 传统主题建模方法（如LDA）难以处理文学文本，因其注重感官细节而非抽象描述。Retell旨在解决这一问题。 Method: 利用生成式语言模型将文学段落转化为高级概念，再对转化结果运行LDA，提取主题。 Result: 相比单独使用LDA或直接让语言模型列出主题，Retell生成的主题更精确且信息丰富。 Conclusion: Retell为文学和文化分析提供了一种有效工具，尤其在处理复杂叙事内容时表现优异。 Abstract: Conventional bag-of-words approaches for topic modeling, like latent Dirichlet allocation (LDA), struggle with literary text. Literature challenges lexical methods because narrative language focuses on immersive sensory details instead of abstractive description or exposition: writers are advised to "show, don't tell." We propose Retell, a simple, accessible topic modeling approach for literature. Here, we prompt resource-efficient, generative language models (LMs) to tell what passages show, thereby translating narratives' surface forms into higher-level concepts and themes. By running LDA on LMs' retellings of passages, we can obtain more precise and informative topics than by running LDA alone or by directly asking LMs to list topics. To investigate the potential of our method for cultural analytics, we compare our method's outputs to expert-guided annotations in a case study on racial/cultural identity in high school English language arts books.

[199] ZIPA: A family of efficient models for multilingual phone recognition

Jian Zhu,Farhan Samir,Eleanor Chodroff,David R. Mortensen

Main category: cs.CL

TL;DR: ZIPA是一系列高效的语音模型，通过大规模多语言数据和高效架构（Zipformer）提升了跨语言音素识别性能，但在社会语音多样性建模上仍有局限。

Details

Motivation: 提升跨语言音素识别的性能，同时解决现有系统在参数效率和数据规模上的不足。 Method: 利用IPAPack++大规模多语言语音数据集，结合Zipformer架构（包括ZIPA-T和ZIPA-CR变体），并通过噪声学生训练进一步扩展数据规模。 Result: ZIPA在音素识别任务上优于现有系统，且参数更少；但社会语音多样性建模仍存在挑战。 Conclusion: ZIPA在跨语言音素识别上取得了显著进展，但未来研究需进一步解决社会语音多样性的建模问题。 Abstract: We present ZIPA, a family of efficient speech models that advances the state-of-the-art performance of crosslinguistic phone recognition. We first curated IPAPack++, a large-scale multilingual speech corpus with 17,132 hours of normalized phone transcriptions and a novel evaluation set capturing unseen languages and sociophonetic variation. With the large-scale training data, ZIPA, including transducer (ZIPA-T) and CTC-based (ZIPA-CR) variants, leverage the efficient Zipformer backbones and outperform existing phone recognition systems with much fewer parameters. Further scaling via noisy student training on 11,000 hours of pseudo-labeled multilingual data yields further improvement. While ZIPA achieves strong performance on benchmarks, error analysis reveals persistent limitations in modeling sociophonetic diversity, underscoring challenges for future research.

[200] Map&Make: Schema Guided Text to Table Generation

Naman Ahuja,Fenil Bardoliya,Chitta Baral,Vivek Gupta

Main category: cs.CL

TL;DR: 本文提出了一种名为Map&Make的新方法，用于将复杂文本转化为可解释的表格，显著提升了文本到表格生成的性能。

Details

Motivation: 当前方法在提取复杂信息和推断数据方面存在不足，需要一种更有效的方法来分解文本并提取潜在模式。 Method: Map&Make方法将文本分解为命题原子语句，提取潜在模式，并用于填充表格，同时处理定性和定量信息。 Result: 在Rotowire和Livesum数据集上测试，结果显示性能显著提升，并减少了幻觉错误。 Conclusion: Map&Make方法在文本到表格生成任务中表现出色，具有更好的解释性和实用性。 Abstract: Transforming dense, detailed, unstructured text into an interpretable and summarised table, also colloquially known as Text-to-Table generation, is an essential task for information retrieval. Current methods, however, miss out on how and what complex information to extract; they also lack the ability to infer data from the text. In this paper, we introduce a versatile approach, Map&Make, which "dissects" text into propositional atomic statements. This facilitates granular decomposition to extract the latent schema. The schema is then used to populate the tables that capture the qualitative nuances and the quantitative facts in the original text. Our approach is tested against two challenging datasets, Rotowire, renowned for its complex and multi-table schema, and Livesum, which demands numerical aggregation. By carefully identifying and correcting hallucination errors in Rotowire, we aim to achieve a cleaner and more reliable benchmark. We evaluate our method rigorously on a comprehensive suite of comparative and referenceless metrics. Our findings demonstrate significant improvement results across both datasets with better interpretability in Text-to-Table generation. Moreover, through detailed ablation studies and analyses, we investigate the factors contributing to superior performance and validate the practicality of our framework in structured summarization tasks.

[201] Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification

Wenjing Xing,Wenke Lu,Yeheng Duan,Bing Zhao,Zhenghui kang,Yaolong Wang,Kai Gao,Lei Qiao

Main category: cs.CL

TL;DR: Infinite-Instruct是一个自动化框架，用于合成高质量的问答对，提升大语言模型（LLM）的代码生成能力，通过反向构建和反馈构建增强问题逻辑，并通过静态代码分析确保数据质量。实验显示性能显著提升。

Details

Motivation: 传统代码指令数据合成方法存在多样性和逻辑性不足的问题，需要一种更高效的方法来提升LLM的代码生成能力。 Method: 框架采用反向构建将代码片段转化为编程问题，通过反馈构建构建知识图谱增强问题逻辑，并使用跨语言静态代码分析过滤无效样本。 Result: 在主流代码生成基准测试中，7B和32B参数模型的性能分别提升21.70%和36.95%，且仅用十分之一的指令微调数据即达到可比性能。 Conclusion: Infinite-Instruct为编程领域的LLM训练提供了可扩展的解决方案，并开源了实验数据集。 Abstract: Traditional code instruction data synthesis methods suffer from limited diversity and poor logic. We introduce Infinite-Instruct, an automated framework for synthesizing high-quality question-answer pairs, designed to enhance the code generation capabilities of large language models (LLMs). The framework focuses on improving the internal logic of synthesized problems and the quality of synthesized code. First, "Reverse Construction" transforms code snippets into diverse programming problems. Then, through "Backfeeding Construction," keywords in programming problems are structured into a knowledge graph to reconstruct them into programming problems with stronger internal logic. Finally, a cross-lingual static code analysis pipeline filters invalid samples to ensure data quality. Experiments show that on mainstream code generation benchmarks, our fine-tuned models achieve an average performance improvement of 21.70% on 7B-parameter models and 36.95% on 32B-parameter models. Using less than one-tenth of the instruction fine-tuning data, we achieved performance comparable to the Qwen-2.5-Coder-Instruct. Infinite-Instruct provides a scalable solution for LLM training in programming. We open-source the datasets used in the experiments, including both unfiltered versions and filtered versions via static analysis. The data are available at https://github.com/xingwenjing417/Infinite-Instruct-dataset

[202] Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement

Gabriele Sarti,Vilém Zouhar,Malvina Nissim,Arianna Bisazza

Main category: cs.CL

TL;DR: 研究了利用语言模型可解释性和不确定性量化来高效识别翻译错误的方法，替代传统高成本的WQE技术。

Details

Motivation: 现代词级质量评估（WQE）技术成本高，需大量人工标注数据或大型语言模型，因此探索更高效的方法。 Method: 利用语言模型可解释性和不确定性量化技术，从翻译模型内部识别错误。 Result: 在12种翻译方向的14个指标评估中，发现无监督指标的潜力，监督方法在标签不确定性下的不足，以及单标注者评估的脆弱性。 Conclusion: 无监督指标潜力巨大，监督方法需改进以应对标签不确定性，单标注者评估不可靠。 Abstract: Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.

[203] Cross-Task Experiential Learning on LLM-based Multi-Agent Collaboration

Yilong Li,Chen Qian,Yu Xia,Ruijie Shi,Yufan Dang,Zihao Xie,Ziming You,Weize Chen,Cheng Yang,Weichuan Liu,Ye Tian,Xuantang Xiong,Lei Han,Zhiyuan Liu,Maosong Sun

Main category: cs.CL

TL;DR: MAEL框架通过跨任务经验学习提升多智能体系统的协作效率，减少冗余计算并增强泛化能力。

Details

Motivation: 现有方法将任务孤立处理，导致计算冗余和泛化能力受限。 Method: 基于图结构的多智能体协作网络，量化任务解决步骤质量并存储经验，推理时检索高奖励经验作为示例。 Result: 实验表明MAEL能有效利用先验经验，实现更快收敛和更高质量的解。 Conclusion: MAEL通过经验积累和跨任务学习显著提升了多智能体系统的性能。 Abstract: Large Language Model-based multi-agent systems (MAS) have shown remarkable progress in solving complex tasks through collaborative reasoning and inter-agent critique. However, existing approaches typically treat each task in isolation, resulting in redundant computations and limited generalization across structurally similar tasks. To address this, we introduce multi-agent cross-task experiential learning (MAEL), a novel framework that endows LLM-driven agents with explicit cross-task learning and experience accumulation. We model the task-solving workflow on a graph-structured multi-agent collaboration network, where agents propagate information and coordinate via explicit connectivity. During the experiential learning phase, we quantify the quality for each step in the task-solving workflow and store the resulting rewards along with the corresponding inputs and outputs into each agent's individual experience pool. During inference, agents retrieve high-reward, task-relevant experiences as few-shot examples to enhance the effectiveness of each reasoning step, thereby enabling more accurate and efficient multi-agent collaboration. Experimental results on diverse datasets demonstrate that MAEL empowers agents to learn from prior task experiences effectively-achieving faster convergence and producing higher-quality solutions on current tasks.

[204] ExpeTrans: LLMs Are Experiential Transfer Learners

Jinglong Gao,Xiao Ding,Lingxiao Zou,Bibo Cai,Bing Qin,Ting Liu

Main category: cs.CL

TL;DR: 论文提出了一种自主经验转移框架，用于让大语言模型（LLMs）从现有任务中自主转移经验到新任务，以减少人工和时间成本，并提升性能。

Details

Motivation: 现有方法依赖大量人工或时间收集任务解决经验，难以应对LLMs任务类型的多样性。 Method: 设计了一个自主经验转移框架，模拟人类认知智能，实现经验的自主转移。 Result: 在13个数据集上的实验表明，该框架有效提升了LLMs的性能。 Conclusion: 该框架不仅降低了成本，还为LLMs的泛化提供了新路径。 Abstract: Recent studies provide large language models (LLMs) with textual task-solving experiences via prompts to improve their performance. However, previous methods rely on substantial human labor or time to gather such experiences for each task, which is impractical given the growing variety of task types in user queries to LLMs. To address this issue, we design an autonomous experience transfer framework to explore whether LLMs can mimic human cognitive intelligence to autonomously transfer experience from existing source tasks to newly encountered target tasks. This not only allows the acquisition of experience without extensive costs of previous methods, but also offers a novel path for the generalization of LLMs. Experimental results on 13 datasets demonstrate that our framework effectively improves the performance of LLMs. Furthermore, we provide a detailed analysis of each module in the framework.

[205] MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration

Zhitao He,Sandeep Polisetty,Zhiyuan Fan,Yuchen Huang,Shujin Wu,Yi R.,Fung

Main category: cs.CL

TL;DR: MMBoundary通过多模态推理步骤的置信度校准，提升MLLMs的知识边界意识，显著减少幻觉问题。

Details

Motivation: 现有方法主要关注整体响应的置信度，而忽略了推理步骤的置信度评估，导致幻觉问题累积。 Method: 结合文本和跨模态自奖励信号，分步骤估计置信度，并通过监督微调和强化学习校准。 Result: 在多个数据集和指标上显著优于现有方法，平均减少7.5%的校准误差，任务性能提升8.3%。 Conclusion: MMBoundary通过细粒度置信度校准，有效提升MLLMs的推理能力和自我修正能力。 Abstract: In recent years, multimodal large language models (MLLMs) have made significant progress but continue to face inherent challenges in multimodal reasoning, which requires multi-level (e.g., perception, reasoning) and multi-granular (e.g., multi-step reasoning chain) advanced inferencing. Prior work on estimating model confidence tends to focus on the overall response for training and calibration, but fails to assess confidence in each reasoning step, leading to undesirable hallucination snowballing. In this work, we present MMBoundary, a novel framework that advances the knowledge boundary awareness of MLLMs through reasoning step confidence calibration. To achieve this, we propose to incorporate complementary textual and cross-modal self-rewarding signals to estimate confidence at each step of the MLLM reasoning process. In addition to supervised fine-tuning MLLM on this set of self-rewarded confidence estimation signal for initial confidence expression warm-up, we introduce a reinforcement learning stage with multiple reward functions for further aligning model knowledge and calibrating confidence at each reasoning step, enhancing reasoning chain self-correction. Empirical results show that MMBoundary significantly outperforms existing methods across diverse domain datasets and metrics, achieving an average of 7.5% reduction in multimodal confidence calibration errors and up to 8.3% improvement in task performance.

[206] MCTSr-Zero: Self-Reflective Psychological Counseling Dialogues Generation via Principles and Adaptive Exploration

Hao Lu,Yanchi Gu,Haoyuan Huang,Yulin Zhou,Ningxin Zhu,Chen Li

Main category: cs.CL

TL;DR: MCTSr-Zero是一个针对开放式对话设计的MCTS框架，通过“领域对齐”和探索机制改进心理辅导对话生成。

Details

Motivation: 解决现有MCTS方法在开放式对话中因缺乏主观标准而产生不匹配响应的问题。 Method: 引入“领域对齐”调整搜索目标，结合“再生”和“元提示适应”机制扩展探索范围。 Result: 生成的对话数据用于微调PsyLLM，在PsyEval基准上达到最优性能。 Conclusion: MCTSr-Zero有效生成高质量对话数据，解决了LLM在复杂心理标准中的一致性问题。 Abstract: The integration of Monte Carlo Tree Search (MCTS) with Large Language Models (LLMs) has demonstrated significant success in structured, problem-oriented tasks. However, applying these methods to open-ended dialogues, such as those in psychological counseling, presents unique challenges. Unlike tasks with objective correctness, success in therapeutic conversations depends on subjective factors like empathetic engagement, ethical adherence, and alignment with human preferences, for which strict "correctness" criteria are ill-defined. Existing result-oriented MCTS approaches can therefore produce misaligned responses. To address this, we introduce MCTSr-Zero, an MCTS framework designed for open-ended, human-centric dialogues. Its core innovation is "domain alignment", which shifts the MCTS search objective from predefined end-states towards conversational trajectories that conform to target domain principles (e.g., empathy in counseling). Furthermore, MCTSr-Zero incorporates "Regeneration" and "Meta-Prompt Adaptation" mechanisms to substantially broaden exploration by allowing the MCTS to consider fundamentally different initial dialogue strategies. We evaluate MCTSr-Zero in psychological counseling by generating multi-turn dialogue data, which is used to fine-tune an LLM, PsyLLM. We also introduce PsyEval, a benchmark for assessing multi-turn psychological counseling dialogues. Experiments demonstrate that PsyLLM achieves state-of-the-art performance on PsyEval and other relevant metrics, validating MCTSr-Zero's effectiveness in generating high-quality, principle-aligned conversational data for human-centric domains and addressing the LLM challenge of consistently adhering to complex psychological standards.

[207] ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering

Jingxuan Wei,Nan Xu,Junnan Zhu,Yanni Hao,Gaowei Wu,Bihui Yu,Lei Wang

Main category: cs.CL

TL;DR: ChartMind是一个新的图表问答（CQA）基准，专注于复杂任务和多语言支持，填补了实际应用与传统学术评估之间的差距。提出的ChartLLM框架通过提取关键上下文元素和降噪，显著提升了多模态大语言模型的推理准确性。

Details

Motivation: 现有CQA评估依赖固定输出格式和客观指标，忽视了实际图表分析的复杂需求，因此需要更灵活的评估方法和模型。 Method: 提出ChartMind基准，涵盖七类任务，支持多语言和开放域文本输出。设计ChartLLM框架，提取上下文元素并降噪，提升模型推理能力。 Result: 在ChartMind和三个公共基准上的评估显示，ChartLLM显著优于指令跟随、OCR增强和思维链三种常见CQA范式。 Conclusion: 研究强调了灵活图表理解对实际CQA的重要性，为未来开发更稳健的图表推理模型提供了新方向。 Abstract: Chart question answering (CQA) has become a critical multimodal task for evaluating the reasoning capabilities of vision-language models. While early approaches have shown promising performance by focusing on visual features or leveraging large-scale pre-training, most existing evaluations rely on rigid output formats and objective metrics, thus ignoring the complex, real-world demands of practical chart analysis. In this paper, we introduce ChartMind, a new benchmark designed for complex CQA tasks in real-world settings. ChartMind covers seven task categories, incorporates multilingual contexts, supports open-domain textual outputs, and accommodates diverse chart formats, bridging the gap between real-world applications and traditional academic benchmarks. Furthermore, we propose a context-aware yet model-agnostic framework, ChartLLM, that focuses on extracting key contextual elements, reducing noise, and enhancing the reasoning accuracy of multimodal large language models. Extensive evaluations on ChartMind and three representative public benchmarks with 14 mainstream multimodal models show our framework significantly outperforms the previous three common CQA paradigms: instruction-following, OCR-enhanced, and chain-of-thought, highlighting the importance of flexible chart understanding for real-world CQA. These findings suggest new directions for developing more robust chart reasoning in future research.

[208] Automatic Construction of Multiple Classification Dimensions for Managing Approaches in Scientific Papers

Bing Ma,Hai Zhuge

Main category: cs.CL

TL;DR: 本文提出了一种多维度管理科学论文中方法的高效框架，通过语言模式提取方法，并基于树结构相似性聚类构建多维方法空间，显著提升了查询效率和相关性。

Details

Motivation: 科学论文中方法的查询和管理缺乏高效框架，导致研究者难以快速获取和利用相关方法。 Method: 通过语义、话语、句法和词汇四个语言层次识别方法模式，提取方法；提出树结构表示步骤并测量相似性；使用自底向上聚类算法构建多维方法空间。 Result: 构建的多维方法空间确保了查询结果的高相关性，并通过基于类别的查询机制快速缩小搜索范围。 Conclusion: 多维方法空间为科学方法的高效查询和管理提供了可行解决方案。 Abstract: Approaches form the foundation for conducting scientific research. Querying approaches from a vast body of scientific papers is extremely time-consuming, and without a well-organized management framework, researchers may face significant challenges in querying and utilizing relevant approaches. Constructing multiple dimensions on approaches and managing them from these dimensions can provide an efficient solution. Firstly, this paper identifies approach patterns using a top-down way, refining the patterns through four distinct linguistic levels: semantic level, discourse level, syntactic level, and lexical level. Approaches in scientific papers are extracted based on approach patterns. Additionally, five dimensions for categorizing approaches are identified using these patterns. This paper proposes using tree structure to represent step and measuring the similarity between different steps with a tree-structure-based similarity measure that focuses on syntactic-level similarities. A collection similarity measure is proposed to compute the similarity between approaches. A bottom-up clustering algorithm is proposed to construct class trees for approach components within each dimension by merging each approach component or class with its most similar approach component or class in each iteration. The class labels generated during the clustering process indicate the common semantics of the step components within the approach components in each class and are used to manage the approaches within the class. The class trees of the five dimensions collectively form a multi-dimensional approach space. The application of approach queries on the multi-dimensional approach space demonstrates that querying within this space ensures strong relevance between user queries and results and rapidly reduces search space through a class-based query mechanism.

[209] The Arabic AI Fingerprint: Stylometric Analysis and Detection of Large Language Models Text

Maged S. Al-Shaibani,Moataz Ahmed

Main category: cs.CL

TL;DR: 该论文全面研究了阿拉伯语机器生成文本，通过多种生成策略和模型架构，揭示了机器生成文本的独特语言模式，并开发了高效的BERT检测模型。

Details

Motivation: 大型语言模型（LLMs）在生成类人文本方面表现出色，但对信息完整性构成挑战，尤其是在阿拉伯语等低资源语言中。 Method: 研究采用多种生成策略（标题生成、内容感知生成和文本优化）和模型架构（ALLaM、Jais、Llama、GPT-4），结合风格分析，开发BERT检测模型。 Result: 研究发现机器生成文本具有可检测的语言特征，BERT模型在正式语境中表现优异（F1-score达99.9%）。 Conclusion: 该研究为阿拉伯语机器生成文本提供了全面分析，为开发语言感知的检测系统奠定了基础。 Abstract: Large Language Models (LLMs) have achieved unprecedented capabilities in generating human-like text, posing subtle yet significant challenges for information integrity across critical domains, including education, social media, and academia, enabling sophisticated misinformation campaigns, compromising healthcare guidance, and facilitating targeted propaganda. This challenge becomes severe, particularly in under-explored and low-resource languages like Arabic. This paper presents a comprehensive investigation of Arabic machine-generated text, examining multiple generation strategies (generation from the title only, content-aware generation, and text refinement) across diverse model architectures (ALLaM, Jais, Llama, and GPT-4) in academic, and social media domains. Our stylometric analysis reveals distinctive linguistic patterns differentiating human-written from machine-generated Arabic text across these varied contexts. Despite their human-like qualities, we demonstrate that LLMs produce detectable signatures in their Arabic outputs, with domain-specific characteristics that vary significantly between different contexts. Based on these insights, we developed BERT-based detection models that achieved exceptional performance in formal contexts (up to 99.9\% F1-score) with strong precision across model architectures. Our cross-domain analysis confirms generalization challenges previously reported in the literature. To the best of our knowledge, this work represents the most comprehensive investigation of Arabic machine-generated text to date, uniquely combining multiple prompt generation methods, diverse model architectures, and in-depth stylometric analysis across varied textual domains, establishing a foundation for developing robust, linguistically-informed detection systems essential for preserving information integrity in Arabic-language contexts.

[210] Sentinel: Attention Probing of Proxy Models for LLM Context Compression with an Understanding Perspective

Yong Zhang,Yanwen Huang,Ning Cheng,Yang Guo,Yun Zhu,Yanmeng Wang,Shaojun Wang,Jing Xiao

Main category: cs.CL

TL;DR: Sentinel提出了一种轻量级的句子级压缩框架，通过注意力机制实现上下文过滤，无需训练专用压缩模型，性能接近7B规模的压缩系统。

Details

Motivation: 解决检索增强生成（RAG）中检索到的段落冗长、噪声多或超出输入限制的问题，同时避免传统压缩方法的高成本和低可移植性。 Method: 利用现成的0.5B代理LLM的解码器注意力信号，通过轻量级分类器识别句子相关性，实现上下文压缩。 Result: 在LongBench基准测试中，Sentinel实现了5倍的压缩，同时QA性能与7B规模的压缩系统相当。 Conclusion: 通过探测原生注意力信号，Sentinel实现了快速、高效且问题感知的上下文压缩。 Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external context, but retrieved passages are often lengthy, noisy, or exceed input limits. Existing compression methods typically require supervised training of dedicated compression models, increasing cost and reducing portability. We propose Sentinel, a lightweight sentence-level compression framework that reframes context filtering as an attention-based understanding task. Rather than training a compression model, Sentinel probes decoder attention from an off-the-shelf 0.5B proxy LLM using a lightweight classifier to identify sentence relevance. Empirically, we find that query-context relevance estimation is consistent across model scales, with 0.5B proxies closely matching the behaviors of larger models. On the LongBench benchmark, Sentinel achieves up to 5$\times$ compression while matching the QA performance of 7B-scale compression systems. Our results suggest that probing native attention signals enables fast, effective, and question-aware context compression. Code available at: https://github.com/yzhangchuck/Sentinel.

[211] ScEdit: Script-based Assessment of Knowledge Editing

Xinye Li,Zunwen Zheng,Qian Zhang,Dekai Zhuang,Jiabao Kang,Liyan Xu,Qingbin Liu,Xi Chen,Zhiying Tu,Dianhui Chu,Dianbo Sui

Main category: cs.CL

TL;DR: 论文提出了一种新的知识编辑（KE）基准ScEdit，结合了反事实和时间编辑，并扩展了传统的事实型评估到动作型评估，发现现有KE方法在性能上有所下降。

Details

Motivation: 当前KE任务过于简单，缺乏实际应用场景的验证，需要更全面的评估框架。 Method: 引入ScEdit基准，结合反事实和时间编辑，采用token级和文本级评估方法。 Result: 所有KE方法在性能上均有所下降，尤其在文本级评估中表现不佳。 Conclusion: ScEdit为KE任务提供了更全面的评估，揭示了现有方法的局限性，推动了该领域的进一步发展。 Abstract: Knowledge Editing (KE) has gained increasing attention, yet current KE tasks remain relatively simple. Under current evaluation frameworks, many editing methods achieve exceptionally high scores, sometimes nearing perfection. However, few studies integrate KE into real-world application scenarios (e.g., recent interest in LLM-as-agent). To support our analysis, we introduce a novel script-based benchmark -- ScEdit (Script-based Knowledge Editing Benchmark) -- which encompasses both counterfactual and temporal edits. We integrate token-level and text-level evaluation methods, comprehensively analyzing existing KE techniques. The benchmark extends traditional fact-based ("What"-type question) evaluation to action-based ("How"-type question) evaluation. We observe that all KE methods exhibit a drop in performance on established metrics and face challenges on text-level metrics, indicating a challenging task. Our benchmark is available at https://github.com/asdfo123/ScEdit.

[212] How Does Response Length Affect Long-Form Factuality

James Xu Zhao,Jimmy Z. J. Liu,Bryan Hooi,See-Kiong Ng

Main category: cs.CL

TL;DR: 研究发现，大语言模型（LLMs）生成长文本时，响应长度增加会导致事实准确性下降，主要原因是知识耗尽而非其他假设。

Details

Motivation: 探讨LLMs生成文本长度与事实准确性之间的关系，填补现有研究的空白。 Method: 提出自动双层长文本事实性评估框架，并进行控制实验验证长度偏差。 Result: 实验表明，响应越长，事实准确性越低，主要原因是知识耗尽。 Conclusion: 知识耗尽是导致长文本事实性下降的主要原因，而非错误传播或长上下文影响。 Abstract: Large language models (LLMs) are widely used for long-form text generation. However, factual errors in the responses would undermine their reliability. Despite growing attention to LLM factuality, the effect of response length on factuality remains underexplored. In this work, we systematically investigate this relationship by first introducing an automatic and bi-level long-form factuality evaluation framework, which achieves high agreement with human annotations while being cost-effective. Using this framework, we conduct controlled experiments and find that longer responses exhibit lower factual precision, confirming the presence of length bias. To explain this phenomenon, we empirically examine three hypotheses: error propagation, long context, and facts exhaustion. Our results reveal that facts exhaustion, where the model gradually exhausts more reliable knowledge, is the primary cause of factual degradation, rather than the other two hypotheses.

[213] EmoBench-UA: A Benchmark Dataset for Emotion Detection in Ukrainian

Daryna Dementieva,Nikolay Babakov,Alexander Fraser

Main category: cs.CL

TL;DR: 本文介绍了EmoBench-UA，首个乌克兰语情感分类标注数据集，并评估了多种方法，揭示了乌克兰语情感分类的挑战。

Details

Motivation: 乌克兰语NLP在情感分类领域缺乏公开基准，研究填补了这一空白。 Method: 通过众包平台Toloka.ai创建高质量标注数据集，并评估了语言学基线、合成数据及大语言模型。 Result: 研究揭示了乌克兰语等非主流语言情感分类的挑战，需开发更多本土化模型和资源。 Conclusion: EmoBench-UA为乌克兰语情感分类提供了首个基准，未来需进一步开发本土化解决方案。 Abstract: While Ukrainian NLP has seen progress in many texts processing tasks, emotion classification remains an underexplored area with no publicly available benchmark to date. In this work, we introduce EmoBench-UA, the first annotated dataset for emotion detection in Ukrainian texts. Our annotation schema is adapted from the previous English-centric works on emotion detection (Mohammad et al., 2018; Mohammad, 2022) guidelines. The dataset was created through crowdsourcing using the Toloka.ai platform ensuring high-quality of the annotation process. Then, we evaluate a range of approaches on the collected dataset, starting from linguistic-based baselines, synthetic data translated from English, to large language models (LLMs). Our findings highlight the challenges of emotion classification in non-mainstream languages like Ukrainian and emphasize the need for further development of Ukrainian-specific models and training resources.

[214] Data-efficient Meta-models for Evaluation of Context-based Questions and Answers in LLMs

Julia Belikova,Konstantin Polev,Rauf Parchiev,Dmitry Simakov

Main category: cs.CL

TL;DR: 论文探讨了如何减少大语言模型（LLMs）和检索增强生成（RAG）系统中幻觉检测的训练数据需求，提出了一种结合高效分类算法和降维技术的方法，仅需250个训练样本即可达到与现有方法相当的性能。

Details

Motivation: 现有基于LLM隐藏状态的幻觉检测方法依赖大量标注数据，限制了实际应用的可扩展性。本文旨在解决数据标注这一关键瓶颈。 Method: 结合高效分类算法和降维技术，减少对训练样本的需求，应用于两种SOTA幻觉检测框架：Lookback Lens和基于探测的方法。 Result: 在标准问答RAG基准测试中，仅用250个训练样本即达到与强基线相当的性能。 Conclusion: 轻量级、数据高效的方法在工业部署中具有潜力，特别是在标注数据受限的场景下。 Abstract: Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly deployed in industry applications, yet their reliability remains hampered by challenges in detecting hallucinations. While supervised state-of-the-art (SOTA) methods that leverage LLM hidden states -- such as activation tracing and representation analysis -- show promise, their dependence on extensively annotated datasets limits scalability in real-world applications. This paper addresses the critical bottleneck of data annotation by investigating the feasibility of reducing training data requirements for two SOTA hallucination detection frameworks: Lookback Lens, which analyzes attention head dynamics, and probing-based approaches, which decode internal model representations. We propose a methodology combining efficient classification algorithms with dimensionality reduction techniques to minimize sample size demands while maintaining competitive performance. Evaluations on standardized question-answering RAG benchmarks show that our approach achieves performance comparable to strong proprietary LLM-based baselines with only 250 training samples. These results highlight the potential of lightweight, data-efficient paradigms for industrial deployment, particularly in annotation-constrained scenarios.

[215] Generalized Category Discovery in Event-Centric Contexts: Latent Pattern Mining with LLMs

Yi Luo,Qiwen Wang,Junqi Yang,Luyao Tang,Zhenghao Lin,Zhenzhe Ying,Weiqiang Wang,Chen Lin

Main category: cs.CL

TL;DR: 论文提出EC-GCD任务，解决文本GCD方法在现实场景中的不足，并提出PaMA框架，利用LLM优化聚类与分类对齐，提升少数类别的公平性。

Details

Motivation: 现有文本GCD方法在现实场景中验证不足，尤其是面对长文本、复杂叙事和类别不平衡时表现不佳。 Method: 提出PaMA框架，利用LLM提取事件模式优化聚类与分类对齐，并通过排序-过滤-挖掘流程平衡原型表示。 Result: 在EC-GCD基准测试中，PaMA表现优于现有方法，H-score提升高达12.58%，同时在基础GCD数据集上保持强泛化能力。 Conclusion: PaMA有效解决了EC-GCD中的聚类与分类对齐及类别不平衡问题，为现实场景中的GCD任务提供了实用解决方案。 Abstract: Generalized Category Discovery (GCD) aims to classify both known and novel categories using partially labeled data that contains only known classes. Despite achieving strong performance on existing benchmarks, current textual GCD methods lack sufficient validation in realistic settings. We introduce Event-Centric GCD (EC-GCD), characterized by long, complex narratives and highly imbalanced class distributions, posing two main challenges: (1) divergent clustering versus classification groupings caused by subjective criteria, and (2) Unfair alignment for minority classes. To tackle these, we propose PaMA, a framework leveraging LLMs to extract and refine event patterns for improved cluster-class alignment. Additionally, a ranking-filtering-mining pipeline ensures balanced representation of prototypes across imbalanced categories. Evaluations on two EC-GCD benchmarks, including a newly constructed Scam Report dataset, demonstrate that PaMA outperforms prior methods with up to 12.58% H-score gains, while maintaining strong generalization on base GCD datasets.

[216] Enhancing Marker Scoring Accuracy through Ordinal Confidence Modelling in Educational Assessments

Abhirup Chakravarty,Mark Brenchley,Trevor Breakspear,Ian Lewin,Yan Huang

Main category: cs.CL

TL;DR: 论文探讨了自动作文评分（AES）中的置信度建模，通过分类任务预测分数可靠性，并提出新的损失函数KWOCCE，显著提高了评分的准确性和可靠性。

Details

Motivation: 解决AES中分数可靠性不足的问题，确保仅在分数达到高可靠性标准时发布。 Method: 将置信度估计作为分类任务，利用分数分箱和新的KWOCCE损失函数，结合CEFR级别的有序结构。 Result: 最佳模型F1分数达0.97，47%的分数达到100% CEFR一致性，99%的分数至少95%一致性，显著优于独立AES模型。 Conclusion: 置信度建模显著提升了AES的可靠性，为分数发布提供了更严格的筛选标准。 Abstract: A key ethical challenge in Automated Essay Scoring (AES) is ensuring that scores are only released when they meet high reliability standards. Confidence modelling addresses this by assigning a reliability estimate measure, in the form of a confidence score, to each automated score. In this study, we frame confidence estimation as a classification task: predicting whether an AES-generated score correctly places a candidate in the appropriate CEFR level. While this is a binary decision, we leverage the inherent granularity of the scoring domain in two ways. First, we reformulate the task as an n-ary classification problem using score binning. Second, we introduce a set of novel Kernel Weighted Ordinal Categorical Cross Entropy (KWOCCE) loss functions that incorporate the ordinal structure of CEFR labels. Our best-performing model achieves an F1 score of 0.97, and enables the system to release 47% of scores with 100% CEFR agreement and 99% with at least 95% CEFR agreement -compared to approximately 92% (approx.) CEFR agreement from the standalone AES model where we release all AM predicted scores.

[217] Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO

Kaiyang Guo,Yinchuan Li,Zhitang Chen

Main category: cs.CL

TL;DR: 论文提出了一种改进的直接偏好优化方法（PRO），解决了现有对比对齐方法中存在的似然不确定性问题，并通过实验验证了其优越性。

Details

Motivation: 现有直接对齐方法（如DPO）虽然能有效引导大语言模型匹配相对偏好，但会降低示例响应的绝对似然，导致输出偏离预期模式，甚至出现奖励黑客效应。 Method: 论文重新审视了DPO的损失函数，提出了一种分解重构的损失函数，并引入PRO方法，通过近似完整的正则化项解决似然不确定性问题。 Result: 实验表明，PRO在成对、二元和标量反馈场景中均优于现有方法。 Conclusion: PRO方法通过解决似然不确定性问题，为多样反馈类型的对齐提供了统一且高效的解决方案。 Abstract: Direct alignment methods typically optimize large language models (LLMs) by contrasting the likelihoods of preferred versus dispreferred responses. While effective in steering LLMs to match relative preference, these methods are frequently noted for decreasing the absolute likelihoods of example responses. As a result, aligned models tend to generate outputs that deviate from the expected patterns, exhibiting reward-hacking effect even without a reward model. This undesired consequence exposes a fundamental limitation in contrastive alignment, which we characterize as likelihood underdetermination. In this work, we revisit direct preference optimization (DPO) -- the seminal direct alignment method -- and demonstrate that its loss theoretically admits a decomposed reformulation. The reformulated loss not only broadens applicability to a wider range of feedback types, but also provides novel insights into the underlying cause of likelihood underdetermination. Specifically, the standard DPO implementation implicitly oversimplifies a regularizer in the reformulated loss, and reinstating its complete version effectively resolves the underdetermination issue. Leveraging these findings, we introduce PRoximalized PReference Optimization (PRO), a unified method to align with diverse feeback types, eliminating likelihood underdetermination through an efficient approximation of the complete regularizer. Comprehensive experiments show the superiority of PRO over existing methods in scenarios involving pairwise, binary and scalar feedback.

[218] Neither Stochastic Parroting nor AGI: LLMs Solve Tasks through Context-Directed Extrapolation from Training Data Priors

Harish Tayyar Madabushi,Melissa Torgbi,Claire Bonial

Main category: cs.CL

TL;DR: 本文提出了一种关于大型语言模型（LLM）能力的中间立场，避免了极端观点，认为LLM通过上下文指导的外推（context-directed extrapolation）从训练数据中提取信息，其推理能力可预测且可控。

Details

Motivation: 批判性地探讨LLM能力的现实视角，避免将其视为完全随机模仿或具有不可预测的‘涌现’高级推理能力的极端观点。 Method: 提出‘上下文指导的外推’机制，结合现有文献支持，分析LLM如何从训练数据中提取信息并外推。 Result: LLM的推理能力远超随机模仿，但可预测、可控，不具人类高级认知能力，且无法通过无限训练扩展。 Conclusion: 未来研究应关注上下文指导的外推机制及其与训练数据的互动，探索不依赖LLM固有高级推理的增强技术。 Abstract: In this position paper we raise critical awareness of a realistic view of LLM capabilities that eschews extreme alternative views that LLMs are either "stochastic parrots" or in possession of "emergent" advanced reasoning capabilities, which, due to their unpredictable emergence, constitute an existential threat. Our middle-ground view is that LLMs extrapolate from priors from their training data, and that a mechanism akin to in-context learning enables the targeting of the appropriate information from which to extrapolate. We call this "context-directed extrapolation." Under this view, substantiated though existing literature, while reasoning capabilities go well beyond stochastic parroting, such capabilities are predictable, controllable, not indicative of advanced reasoning akin to high-level cognitive capabilities in humans, and not infinitely scalable with additional training. As a result, fears of uncontrollable emergence of agency are allayed, while research advances are appropriately refocused on the processes of context-directed extrapolation and how this interacts with training data to produce valuable capabilities in LLMs. Future work can therefore explore alternative augmenting techniques that do not rely on inherent advanced reasoning in LLMs.

[219] Discriminative Policy Optimization for Token-Level Reward Models

Hongzhan Chen,Tao Yang,Shiping Gao,Ruijun Chen,Xiaojun Quan,Hongtao Tian,Ting Yao

Main category: cs.CL

TL;DR: Q-RM通过解耦奖励建模与语言生成，提出了一种基于判别策略的令牌级奖励模型，显著提升了复杂推理任务的性能与训练效率。

Details

Motivation: 解决生成语言建模与奖励建模之间的冲突，避免不稳定性与信用分配不准确的问题。 Method: 提出Q-RM模型，通过优化判别策略（Q函数）从偏好数据中学习令牌级奖励，无需细粒度标注。 Result: Q-RM在多项基准测试中表现优于基线方法，数学推理任务中Pass@1分数显著提升，训练效率提高12倍。 Conclusion: Q-RM是一种高效且稳定的令牌级奖励建模方法，适用于增强LLM在复杂任务中的能力。 Abstract: Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs) for optimizing policy models, positioning them as a promising approach to enhancing the capabilities of LLMs in complex reasoning tasks. Recent efforts have advanced PRMs from step-level to token-level granularity by integrating reward modeling into the training of generative models, with reward scores derived from token generation probabilities. However, the conflict between generative language modeling and reward modeling may introduce instability and lead to inaccurate credit assignments. To address this challenge, we revisit token-level reward assignment by decoupling reward modeling from language generation and derive a token-level reward model through the optimization of a discriminative policy, termed the Q-function Reward Model (Q-RM). We theoretically demonstrate that Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations. In our experiments, Q-RM consistently outperforms all baseline methods across various benchmarks. For example, when integrated into PPO/REINFORCE algorithms, Q-RM enhances the average Pass@1 score by 5.85/4.70 points on mathematical reasoning tasks compared to the ORM baseline, and by 4.56/5.73 points compared to the token-level PRM counterpart. Moreover, reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH. Code and data are available at https://github.com/homzer/Q-RM.

[220] Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation

Beiduo Chen,Yang Janet Liu,Anna Korhonen,Barbara Plank

Main category: cs.CL

TL;DR: 本文提出了一种基于LLM的管道方法，利用语言基础的分段器从CoTs中提取支持与反对陈述，并结合基于排名的HLV评估框架，显著提升了与人类标注的一致性。

Details

Motivation: 研究旨在通过LLM生成的CoTs更好地理解人类标注的多样性，并提出一种更有效的方法来提取和分析这些标注的潜在逻辑。 Method: 提出了一种LLM管道方法，结合语言分段器从CoTs中提取支持与反对陈述，并设计了一个基于排名的HLV评估框架。 Result: 该方法在三个数据集上优于直接生成方法和基线方法，且在排名方法上与人类标注更一致。 Conclusion: 该方法有效提升了LLM在理解人类标注多样性方面的能力，为未来研究提供了新思路。 Abstract: The recent rise of reasoning-tuned Large Language Models (LLMs)--which generate chains of thought (CoTs) before giving the final answer--has attracted significant attention and offers new opportunities for gaining insights into human label variation, which refers to plausible differences in how multiple annotators label the same data instance. Prior work has shown that LLM-generated explanations can help align model predictions with human label distributions, but typically adopt a reverse paradigm: producing explanations based on given answers. In contrast, CoTs provide a forward reasoning path that may implicitly embed rationales for each answer option, before generating the answers. We thus propose a novel LLM-based pipeline enriched with linguistically-grounded discourse segmenters to extract supporting and opposing statements for each answer option from CoTs with improved accuracy. We also propose a rank-based HLV evaluation framework that prioritizes the ranking of answers over exact scores, which instead favor direct comparison of label distributions. Our method outperforms a direct generation method as well as baselines on three datasets, and shows better alignment of ranking methods with humans, highlighting the effectiveness of our approach.

[221] Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models

Mingyu Yu,Wei Wang,Yanjie Wei,Sujuan Qin

Main category: cs.CL

TL;DR: 本文研究了针对大型语言模型（LLMs）的越狱攻击，提出了一种基于语义理解能力的自适应越狱策略框架，显著提高了攻击成功率。

Details

Motivation: 随着LLMs的广泛应用，其内置的安全和伦理约束成为攻击目标，亟需研究如何通过越狱技术突破这些限制。 Method: 将LLMs按语义理解能力分为两类，针对每类设计定制化的越狱策略，利用其弱点进行攻击。 Result: 实验表明，自适应策略显著提升了越狱成功率，对GPT-4o（2025年5月29日发布）的攻击成功率高达98.9%。 Conclusion: 自适应越狱策略能有效针对不同LLMs的弱点，为AI安全领域提供了新的研究方向。 Abstract: Adversarial attacks on Large Language Models (LLMs) via jailbreaking techniques-methods that circumvent their built-in safety and ethical constraints-have emerged as a critical challenge in AI security. These attacks compromise the reliability of LLMs by exploiting inherent weaknesses in their comprehension capabilities. This paper investigates the efficacy of jailbreaking strategies that are specifically adapted to the diverse levels of understanding exhibited by different LLMs. We propose the Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models, a novel framework that classifies LLMs into Type I and Type II categories according to their semantic comprehension abilities. For each category, we design tailored jailbreaking strategies aimed at leveraging their vulnerabilities to facilitate successful attacks. Extensive experiments conducted on multiple LLMs demonstrate that our adaptive strategy markedly improves the success rate of jailbreaking. Notably, our approach achieves an exceptional 98.9% success rate in jailbreaking GPT-4o(29 May 2025 release)

[222] From Parameters to Prompts: Understanding and Mitigating the Factuality Gap between Fine-Tuned LLMs

Xuan Gong,Hanbo Huang,Shiyu Liang

Main category: cs.CL

TL;DR: 本文研究了监督微调数据对大型语言模型事实性的影响，发现推理阶段的提示可以弥补微调数据的不足。

Details

Motivation: 探究微调数据对模型事实性的影响机制，尤其是已知与未知知识间的差距。 Method: 通过系统实验和知识图谱理论分析，研究微调数据与推理提示的交互作用。 Result: 推理阶段的提示（如少样本学习和思维链）可以显著减少事实性差距。 Conclusion: 上下文学习能有效弥补微调数据的不足，需重新评估其作为微调数据选择方法的效果。 Abstract: Factual knowledge extraction aims to explicitly extract knowledge parameterized in pre-trained language models for application in downstream tasks. While prior work has been investigating the impact of supervised fine-tuning data on the factuality of large language models (LLMs), its mechanism remains poorly understood. We revisit this impact through systematic experiments, with a particular focus on the factuality gap that arises when fine-tuning on known versus unknown knowledge. Our findings show that this gap can be mitigated at the inference stage, either under out-of-distribution (OOD) settings or by using appropriate in-context learning (ICL) prompts (i.e., few-shot learning and Chain of Thought (CoT)). We prove this phenomenon theoretically from the perspective of knowledge graphs, showing that the test-time prompt may diminish or even overshadow the impact of fine-tuning data and play a dominant role in knowledge extraction. Ultimately, our results shed light on the interaction between finetuning data and test-time prompt, demonstrating that ICL can effectively compensate for shortcomings in fine-tuning data, and highlighting the need to reconsider the use of ICL prompting as a means to evaluate the effectiveness of fine-tuning data selection methods.

[223] The Warmup Dilemma: How Learning Rate Strategies Impact Speech-to-Text Model Convergence

Marco Gaido,Sara Papi,Luisa Bentivogli,Alessio Brutti,Mauro Cettolo,Roberto Gretter,Marco Matassoni,Mohamed Nabih,Matteo Negri

Main category: cs.CL

TL;DR: 论文探讨了大规模语音到文本（S2T）训练中学习率（LR）预热策略的影响，提出了一种次指数预热方法，并分析了其对初始收敛和最终性能的影响。

Details

Motivation: 大规模模型训练在资源和收敛性方面存在挑战，尤其是在使用复杂Transformer架构（如Conformer或Branchformer）时。现有的双线性LR预热方法虽有效，但缺乏与其他方法的比较和对最终性能影响的深入研究。 Method: 研究比较了不同LR预热策略，包括双线性预热和次指数预热，分析了其对初始收敛速度和最终模型性能的影响。 Result: 研究发现，大规模S2T训练需要次指数LR预热；较高的初始LR能加速收敛，但对最终性能无显著提升。 Conclusion: 次指数LR预热是大规模S2T训练的更优策略，而初始LR的选择需权衡收敛速度和最终性能。 Abstract: Training large-scale models presents challenges not only in terms of resource requirements but also in terms of their convergence. For this reason, the learning rate (LR) is often decreased when the size of a model is increased. Such a simple solution is not enough in the case of speech-to-text (S2T) trainings, where evolved and more complex variants of the Transformer architecture -- e.g., Conformer or Branchformer -- are used in light of their better performance. As a workaround, OWSM designed a double linear warmup of the LR, increasing it to a very small value in the first phase before updating it to a higher value in the second phase. While this solution worked well in practice, it was not compared with alternative solutions, nor was the impact on the final performance of different LR warmup schedules studied. This paper fills this gap, revealing that i) large-scale S2T trainings demand a sub-exponential LR warmup, and ii) a higher LR in the warmup phase accelerates initial convergence, but it does not boost final performance.

[224] UAQFact: Evaluating Factual Knowledge Utilization of LLMs on Unanswerable Questions

Chuanyuan Tan,Wenbiao Shao,Hao Xiong,Tong Zhu,Zhenhua Liu,Kai Shi,Wenliang Chen

Main category: cs.CL

TL;DR: 论文提出了一个双语数据集UAQFact，用于评估LLMs在处理不可回答问题时利用事实知识的能力，并定义了两个新任务。实验表明，即使LLMs具备相关知识，其表现仍不稳定，且外部知识的使用效果有限。

Details

Motivation: 现有数据集缺乏事实知识支持，无法全面评估LLMs在处理不可回答问题时利用事实知识的能力。 Method: 构建了基于知识图谱的双语数据集UAQFact，并定义了两个新任务以分别评估LLMs利用内部和外部事实知识的能力。 Result: 实验表明，LLMs在UAQFact上表现不稳定，即使具备相关知识，也无法充分利用，外部知识虽能提升性能但效果有限。 Conclusion: UAQFact为评估LLMs利用事实知识的能力提供了新基准，揭示了LLMs在此方面的不足，尤其是知识利用的不充分性。 Abstract: Handling unanswerable questions (UAQ) is crucial for LLMs, as it helps prevent misleading responses in complex situations. While previous studies have built several datasets to assess LLMs' performance on UAQ, these datasets lack factual knowledge support, which limits the evaluation of LLMs' ability to utilize their factual knowledge when handling UAQ. To address the limitation, we introduce a new unanswerable question dataset UAQFact, a bilingual dataset with auxiliary factual knowledge created from a Knowledge Graph. Based on UAQFact, we further define two new tasks to measure LLMs' ability to utilize internal and external factual knowledge, respectively. Our experimental results across multiple LLM series show that UAQFact presents significant challenges, as LLMs do not consistently perform well even when they have factual knowledge stored. Additionally, we find that incorporating external knowledge may enhance performance, but LLMs still cannot make full use of the knowledge which may result in incorrect responses.

[225] Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons

Krithik Vishwanath,Anton Alyakin,Mrigayu Ghosh,Jin Vivian Lee,Daniel Alexander Alber,Karl L. Sangwon,Douglas Kondziolka,Eric Karl Oermann

Main category: cs.CL

TL;DR: 研究评估了28个大语言模型在神经外科考试问题上的表现及其对干扰信息的脆弱性，发现部分模型能通过考试，但干扰显著降低性能。

Details

Motivation: 评估大语言模型在神经外科知识测试中的表现及其对干扰信息的鲁棒性，为临床安全部署提供依据。 Method: 使用2,904个神经外科考试问题测试28个模型，并引入干扰框架评估模型性能下降情况。 Result: 6个模型通过考试，但干扰使准确性显著下降，开源模型性能下降更明显。 Conclusion: 大语言模型在神经外科考试中表现良好，但对干扰信息敏感，需开发增强鲁棒性的策略。 Abstract: The Congress of Neurological Surgeons Self-Assessment for Neurological Surgeons (CNS-SANS) questions are widely used by neurosurgical residents to prepare for written board examinations. Recently, these questions have also served as benchmarks for evaluating large language models' (LLMs) neurosurgical knowledge. This study aims to assess the performance of state-of-the-art LLMs on neurosurgery board-like questions and to evaluate their robustness to the inclusion of distractor statements. A comprehensive evaluation was conducted using 28 large language models. These models were tested on 2,904 neurosurgery board examination questions derived from the CNS-SANS. Additionally, the study introduced a distraction framework to assess the fragility of these models. The framework incorporated simple, irrelevant distractor statements containing polysemous words with clinical meanings used in non-clinical contexts to determine the extent to which such distractions degrade model performance on standard medical benchmarks. 6 of the 28 tested LLMs achieved board-passing outcomes, with the top-performing models scoring over 15.7% above the passing threshold. When exposed to distractions, accuracy across various model architectures was significantly reduced-by as much as 20.4%-with one model failing that had previously passed. Both general-purpose and medical open-source models experienced greater performance declines compared to proprietary variants when subjected to the added distractors. While current LLMs demonstrate an impressive ability to answer neurosurgery board-like exam questions, their performance is markedly vulnerable to extraneous, distracting information. These findings underscore the critical need for developing novel mitigation strategies aimed at bolstering LLM resilience against in-text distractions, particularly for safe and effective clinical deployment.

[226] Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt

Keqin Peng,Liang Ding,Yuanxin Ouyang,Meng Fang,Dacheng Tao

Main category: cs.CL

TL;DR: 该论文定量分析了大型语言模型（RLLMs）在长链推理（Long CoT）中的过度思考现象，提出了一种简单有效的提示方法以减少自我怀疑，从而优化推理步骤。

Details

Motivation: 研究RLLMs在复杂任务中表现出的过度思考问题，尤其是自我怀疑导致的冗余推理步骤，旨在通过定量分析提出改进方法。 Method: 提出一种提示方法，先让模型质疑输入问题的有效性，再根据评估结果简洁回答，以减少对输入问题的过度依赖。 Result: 在三个数学推理任务和四个缺失前提的数据集上测试，该方法显著减少了答案长度和推理步骤，并在多个数据集上提升了性能。 Conclusion: 该方法有效减少了RLLMs的自我怀疑和过度思考，优化了推理效率，适用于多种任务和数据集。 Abstract: Reasoning Large Language Models (RLLMs) have demonstrated impressive performance on complex tasks, largely due to the adoption of Long Chain-of-Thought (Long CoT) reasoning. However, they often exhibit overthinking -- performing unnecessary reasoning steps even after arriving at the correct answer. Prior work has largely focused on qualitative analyses of overthinking through sample-based observations of long CoTs. In contrast, we present a quantitative analysis of overthinking from the perspective of self-doubt, characterized by excessive token usage devoted to re-verifying already-correct answer. We find that self-doubt significantly contributes to overthinking. In response, we introduce a simple and effective prompting method to reduce the model's over-reliance on input questions, thereby avoiding self-doubt. Specifically, we first prompt the model to question the validity of the input question, and then respond concisely based on the outcome of that evaluation. Experiments on three mathematical reasoning tasks and four datasets with missing premises demonstrate that our method substantially reduces answer length and yields significant improvements across nearly all datasets upon 4 widely-used RLLMs. Further analysis demonstrates that our method effectively minimizes the number of reasoning steps and reduces self-doubt.

[227] Spoken Language Modeling with Duration-Penalized Self-Supervised Units

Nicol Visser,Herman Kamper

Main category: cs.CL

TL;DR: 研究了语音语言模型（SLM）中码本大小和单元粗糙度（持续时间）对性能的影响，发现粗糙度在句子重合成任务中更有效，但在音素和单词级别中作用有限。

Details

Motivation: 探索码本大小和单元粗糙度对SLM性能的影响，填补了相关研究的空白。 Method: 使用简单的持续时间惩罚动态规划（DPDP）方法，调整码本大小和单元粗糙度，并在不同语言级别进行分析。 Result: 音素和单词级别中，粗糙度作用有限；句子重合成任务中，粗糙单元表现更好；在词汇和句法任务中，粗糙单元在低比特率下准确率更高。 Conclusion: 粗糙单元并非总是更好，但DPDP是一种简单高效的方法，适用于需要粗糙单元的任务。 Abstract: Spoken language models (SLMs) operate on acoustic units obtained by discretizing self-supervised speech representations. Although the characteristics of these units directly affect performance, the interaction between codebook size and unit coarseness (i.e., duration) remains unexplored. We investigate SLM performance as we vary codebook size and unit coarseness using the simple duration-penalized dynamic programming (DPDP) method. New analyses are performed across different linguistic levels. At the phone and word levels, coarseness provides little benefit, as long as the codebook size is chosen appropriately. However, when producing whole sentences in a resynthesis task, SLMs perform better with coarser units. In lexical and syntactic language modeling tasks, coarser units also give higher accuracies at lower bitrates. We therefore show that coarser units aren't always better, but that DPDP is a simple and efficient way to obtain coarser units for the tasks where they are beneficial.

[228] Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

Liangliang Zhang,Zhuorui Jiang,Hongliang Chi,Haoyang Chen,Mohammed Elkoumy,Fali Wang,Qiong Wu,Zhengyi Zhou,Shirui Pan,Suhang Wang,Yao Ma

Main category: cs.CL

TL;DR: KGQAGen是一个基于LLM的框架，用于生成高质量的多跳推理QA基准，解决了现有数据集的质量问题。

Details

Motivation: 现有KGQA数据集（如WebQSP和CWQ）存在标注不准确、问题模糊或无法回答、知识过时等问题，平均事实正确率仅为57%。 Method: KGQAGen结合结构化知识基础、LLM引导生成和符号验证，生成可验证的QA实例。 Result: 构建了KGQAGen-10k基准，现有最先进模型在该基准上表现不佳，凸显其挑战性。 Conclusion: KGQAGen为KGQA评估提供了可扩展的框架，并呼吁更严格的基准构建。 Abstract: Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets, including WebQSP and CWQ, we find that the average factual correctness rate is only 57 %. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a ten-thousand scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.

[229] CLaC at SemEval-2025 Task 6: A Multi-Architecture Approach for Corporate Environmental Promise Verification

Nawar Turk,Eeham Khan,Leila Kosseim

Main category: cs.CL

TL;DR: 本文介绍了针对SemEval-2025 Task 6（PromiseEval）的方法，通过三种模型架构解决承诺验证的四个子任务，最终结合多任务学习的模型表现最佳。

Details

Motivation: 验证企业ESG报告中的承诺，解决承诺识别、证据评估、清晰度评价和验证时间四个子任务。 Method: 1. 使用ESG-BERT和任务特定分类头；2. 加入针对子任务的定制语言特征；3. 结合注意力池化、文档元数据增强和多目标学习的综合模型。 Result: 在ML-Promise数据集上，综合模型得分0.5268，优于基线0.5227。 Conclusion: 语言特征提取、注意力池化和多目标学习在承诺验证任务中有效，但面临类别不平衡和数据不足的挑战。 Abstract: This paper presents our approach to the SemEval-2025 Task~6 (PromiseEval), which focuses on verifying promises in corporate ESG (Environmental, Social, and Governance) reports. We explore three model architectures to address the four subtasks of promise identification, supporting evidence assessment, clarity evaluation, and verification timing. Our first model utilizes ESG-BERT with task-specific classifier heads, while our second model enhances this architecture with linguistic features tailored for each subtask. Our third approach implements a combined subtask model with attention-based sequence pooling, transformer representations augmented with document metadata, and multi-objective learning. Experiments on the English portion of the ML-Promise dataset demonstrate progressive improvement across our models, with our combined subtask approach achieving a leaderboard score of 0.5268, outperforming the provided baseline of 0.5227. Our work highlights the effectiveness of linguistic feature extraction, attention pooling, and multi-objective learning in promise verification tasks, despite challenges posed by class imbalance and limited training data.

[230] Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

Yunqiao Yang,Houxing Ren,Zimu Lu,Ke Wang,Weikang Shi,Aojun Zhou,Junting Pan,Mingjie Zhan,Hongsheng Li

Main category: cs.CL

TL;DR: 论文提出了一种名为PCPO的新框架，通过结合表面答案正确性和内在概率一致性优化LLM的数学推理能力。

Details

Motivation: 当前基于结果的方法忽视了回答的内部逻辑一致性，PCPO旨在解决这一问题。 Method: PCPO框架引入双重定量指标：表面答案正确性和内在概率一致性。 Result: 实验表明PCPO在多种LLM和基准测试中优于现有方法。 Conclusion: PCPO通过结合双重指标显著提升了数学推理能力，代码已开源。 Abstract: Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at https://github.com/YunqiaoYang/PCPO.

[231] Translation in the Wild

Yuri Balashov

Main category: cs.CL

TL;DR: LLMs表现出强大的翻译能力，但其能力来源尚不明确，可能与训练数据中的“偶然双语”和指令微调有关。

Details

Motivation: 探讨LLMs在未专门训练翻译任务的情况下，为何仍能表现出卓越的翻译能力。 Method: 通过分析训练数据和指令微调的作用，提出“双重性”假设，并探讨其验证方法。 Result: LLMs的翻译能力可能源于两种不同类型的预训练数据，并以不同方式内化。 Conclusion: 研究为重新定义深度学习时代的翻译（人工与机器）提供了新视角。 Abstract: Large Language Models (LLMs) excel in translation among other things, demonstrating competitive performance for many language pairs in zero- and few-shot settings. But unlike dedicated neural machine translation models, LLMs are not trained on any translation-related objective. What explains their remarkable translation abilities? Are these abilities grounded in "incidental bilingualism" (Briakou et al. 2023) in training data? Does instruction tuning contribute to it? Are LLMs capable of aligning and leveraging semantically identical or similar monolingual contents from different corners of the internet that are unlikely to fit in a single context window? I offer some reflections on this topic, informed by recent studies and growing user experience. My working hypothesis is that LLMs' translation abilities originate in two different types of pre-training data that may be internalized by the models in different ways. I discuss the prospects for testing the "duality" hypothesis empirically and its implications for reconceptualizing translation, human and machine, in the age of deep learning.

[232] Understanding Refusal in Language Models with Sparse Autoencoders

Wei Jie Yeo,Nirmalendu Prakash,Clement Neo,Roy Ka-Wei Lee,Erik Cambria,Ranjan Satapathy

Main category: cs.CL

TL;DR: 论文通过稀疏自编码器研究了指令调优LLM中的拒绝行为机制，识别了与拒绝相关的潜在特征，并验证了其对生成行为的影响。

Details

Motivation: 研究语言模型中拒绝行为的内部机制，以增强对其安全行为的理解。 Method: 使用稀疏自编码器识别拒绝相关的潜在特征，并在两个开源聊天模型上进行干预实验。 Result: 验证了拒绝特征对生成行为的影响，并展示了其在分类任务中对对抗样本的泛化能力。 Conclusion: 研究揭示了拒绝行为的激活层面机制，并提供了对抗越狱技术的理解，同时开源了代码。 Abstract: Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks. We open source our code in https://github.com/wj210/refusal_sae.

[233] Evaluating AI capabilities in detecting conspiracy theories on YouTube

Leonardo La Rocca,Francesco Corso,Francesco Pierri

Main category: cs.CL

TL;DR: 该研究探讨了使用开源大型语言模型（LLMs）识别YouTube上的阴谋论视频，发现文本模型召回率高但精度低，多模态模型表现较差，RoBERTa在现实应用中接近LLMs性能。

Details

Motivation: YouTube作为全球领先平台，易传播有害内容，需有效检测方法。 Method: 使用标记数据集评估多种LLMs的零样本性能，并与微调RoBERTa基准对比。 Result: 文本LLMs召回率高但精度低，多模态模型表现不佳；RoBERTa在未标记数据上接近LLMs性能。 Conclusion: 当前LLM方法在有害内容检测中有潜力但需更精确和鲁棒的系统。 Abstract: As a leading online platform with a vast global audience, YouTube's extensive reach also makes it susceptible to hosting harmful content, including disinformation and conspiracy theories. This study explores the use of open-weight Large Language Models (LLMs), both text-only and multimodal, for identifying conspiracy theory videos shared on YouTube. Leveraging a labeled dataset of thousands of videos, we evaluate a variety of LLMs in a zero-shot setting and compare their performance to a fine-tuned RoBERTa baseline. Results show that text-based LLMs achieve high recall but lower precision, leading to increased false positives. Multimodal models lag behind their text-only counterparts, indicating limited benefits from visual data integration. To assess real-world applicability, we evaluate the most accurate models on an unlabeled dataset, finding that RoBERTa achieves performance close to LLMs with a larger number of parameters. Our work highlights the strengths and limitations of current LLM-based approaches for online harmful content detection, emphasizing the need for more precise and robust systems.

[234] Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

Guangtao Zeng,Maohao Shen,Delin Chen,Zhenting Qi,Subhro Das,Dan Gutfreund,David Cox,Gregory Wornell,Wei Lu,Zhang-Wei Hong,Chuang Gan

Main category: cs.CL

TL;DR: EvoScale是一种高效的测试时扩展方法，通过进化过程优化生成结果，减少样本需求，使32B模型性能媲美100B以上模型。

Details

Motivation: 解决小参数语言模型在真实软件工程任务中表现不佳的问题，同时避免昂贵的高质量数据监督微调。 Method: 提出进化测试时扩展（EvoScale），结合强化学习训练模型自我进化，减少对外部验证器的依赖。 Result: 在SWE-Bench-Verified上，32B模型Satori-SWE-32B性能达到或超过100B以上模型。 Conclusion: EvoScale为小参数模型提供了一种高效、低成本的性能提升方案，具有实际应用潜力。 Abstract: Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller models are preferable in practice due to their lower computational cost, improving their performance remains challenging. Existing approaches primarily rely on supervised fine-tuning (SFT) with high-quality data, which is expensive to curate at scale. An alternative is test-time scaling: generating multiple outputs, scoring them using a verifier, and selecting the best one. Although effective, this strategy often requires excessive sampling and costly scoring, limiting its practical application. We propose Evolutionary Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process. By iteratively refining outputs via selection and mutation, EvoScale shifts the output distribution toward higher-scoring regions, reducing the number of samples needed to find correct solutions. To reduce the overhead from repeatedly sampling and selection, we train the model to self-evolve using reinforcement learning (RL). Rather than relying on external verifiers at inference time, the model learns to self-improve the scores of its own generations across iterations. Evaluated on SWE-Bench-Verified, EvoScale enables our 32B model, Satori-SWE-32B, to match or exceed the performance of models with over 100B parameters while using a few samples. Code, data, and models will be fully open-sourced.

[235] Table-R1: Inference-Time Scaling for Table Reasoning

Zheyuan Yang,Lyuhao Chen,Arman Cohan,Yilun Zhao

Main category: cs.CL

TL;DR: 该研究首次探索了表格推理任务中的推理时间扩展，提出了两种后训练策略：基于前沿模型推理轨迹的蒸馏和基于可验证奖励的强化学习（RLVR）。通过实验，Table-R1-Zero模型在7B参数下性能媲美GPT-4.1和DeepSeek-R1，并展现出强大的泛化能力。

Details

Motivation: 研究动机在于探索如何通过推理时间扩展提升表格推理任务的性能，尤其是利用小规模模型实现与大模型相当的效果。 Method: 方法包括：1）从DeepSeek-R1生成的推理轨迹中蒸馏出Table-R1-SFT模型；2）使用任务特定的可验证奖励函数和GRPO算法训练Table-R1-Zero模型。 Result: Table-R1-Zero模型在多种表格推理任务中表现优异，性能与GPT-4.1和DeepSeek-R1相当，且泛化能力强。 Conclusion: 研究表明，指令调优、模型架构选择和跨任务泛化对提升表格推理能力至关重要，强化学习训练中还能涌现出关键推理技能。 Abstract: In this work, we present the first study to explore inference-time scaling on table reasoning tasks. We develop and evaluate two post-training strategies to enable inference-time scaling: distillation from frontier model reasoning traces and reinforcement learning with verifiable rewards (RLVR). For distillation, we introduce a large-scale dataset of reasoning traces generated by DeepSeek-R1, which we use to fine-tune LLMs into the Table-R1-SFT model. For RLVR, we propose task-specific verifiable reward functions and apply the GRPO algorithm to obtain the Table-R1-Zero model. We evaluate our Table-R1-series models across diverse table reasoning tasks, including short-form QA, fact verification, and free-form QA. Notably, the Table-R1-Zero model matches or exceeds the performance of GPT-4.1 and DeepSeek-R1, while using only a 7B-parameter LLM. It also demonstrates strong generalization to out-of-domain datasets. Extensive ablation and qualitative analyses reveal the benefits of instruction tuning, model architecture choices, and cross-task generalization, as well as emergence of essential table reasoning skills during RL training.

[236] Characterizing the Expressivity of Transformer Language Models

Jiaoda Li,Ryan Cotterell

Main category: cs.CL

TL;DR: 本文研究了固定精度Transformer的表达能力，发现其等同于线性时序逻辑的特定片段，并验证了理论与实验的一致性。

Details

Motivation: 尽管Transformer模型在实证中表现优异，但其理论表达能力尚不完全清楚。本文旨在填补这一空白，通过更接近实际实现的假设（如固定精度和软注意力）来精确刻画其表达能力。 Method: 研究采用固定精度、严格未来掩码和软注意力的Transformer模型，将其表达能力与线性时序逻辑的特定片段（仅包含过去操作符）进行对比。 Result: 研究表明，这些Transformer模型与仅包含过去操作符的线性时序逻辑片段具有相同的表达能力，并与形式语言理论、自动机理论和代数中的已知类别相关联。实验验证了理论预测：模型在理论能力范围内的语言上表现完美，而在超出范围的语言上则无法泛化。 Conclusion: 本文为Transformer的表达能力提供了一个统一的理论框架，揭示了其局限性，并通过实验验证了理论的准确性。 Abstract: Transformer-based language models (LMs) have achieved widespread empirical success, but their theoretical expressive power remains only partially understood. Prior work often relies on idealized models with assumptions -- such as arbitrary numerical precision and hard attention -- that diverge from real-world transformers. In this work, we provide an exact characterization of fixed-precision transformers with strict future masking and soft attention, an idealization that more closely mirrors practical implementations. We show that these models are precisely as expressive as a specific fragment of linear temporal logic that includes only a single temporal operator: the past operator. We further relate this logic to established classes in formal language theory, automata theory, and algebra, yielding a rich and unified theoretical framework for understanding transformer expressivity. Finally, we present empirical results that align closely with our theory: transformers trained on languages within their theoretical capacity generalize perfectly over lengths, while they consistently fail to generalize on languages beyond it.

[237] AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora

Jiaxin Bai,Wei Fan,Qi Hu,Qing Zong,Chunyang Li,Hong Ting Tsang,Hongyu Luo,Yauwai Yim,Haoyu Huang,Xiao Zhou,Feng Qin,Tianshi Zheng,Xi Peng,Xin Yao,Huiwen Yang,Leijie Wu,Yi Ji,Gong Zhang,Renhai Chen,Yangqiu Song

Main category: cs.CL

TL;DR: AutoSchemaKG是一个无需预定义模式的全自动知识图谱构建框架，通过大语言模型从文本中提取知识三元组并生成模式，构建了大规模知识图谱ATLAS，性能优于现有方法。

Details

Motivation: 传统知识图谱构建依赖预定义模式，限制了灵活性和可扩展性。AutoSchemaKG旨在消除这一限制，实现完全自动化的知识图谱构建。 Method: 利用大语言模型从文本中提取知识三元组并生成模式，结合概念化技术组织实例。处理了超过5000万份文档，构建了ATLAS知识图谱。 Result: 构建了包含9亿+节点和59亿边的知识图谱ATLAS，在多跳QA任务中优于现有方法，模式生成与人工模式语义对齐达95%。 Conclusion: AutoSchemaKG证明了动态生成模式的大规模知识图谱可以有效补充大语言模型的参数知识，提升事实性。 Abstract: We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 95\% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.

[238] GeNRe: A French Gender-Neutral Rewriting System Using Collective Nouns

Enzo Doyen,Amalia Todirascu

Main category: cs.CL

TL;DR: GeNRe是首个法语性别中性改写系统，通过规则和微调模型解决法语中的性别偏见问题。

Details

Motivation: NLP文本数据中存在性别偏见，尤其是法语中的集体名词性别固定问题，需开发中性改写技术。 Method: 结合规则系统（RBS）和微调语言模型，并探索基于指令的模型提升性能。 Result: Claude 3 Opus结合字典接近RBS效果，系统在法语性别中性改写中表现良好。 Conclusion: GeNRe推动了法语NLP中性别偏见缓解技术的发展。 Abstract: A significant portion of the textual data used in the field of Natural Language Processing (NLP) exhibits gender biases, particularly due to the use of masculine generics (masculine words that are supposed to refer to mixed groups of men and women), which can perpetuate and amplify stereotypes. Gender rewriting, an NLP task that involves automatically detecting and replacing gendered forms with neutral or opposite forms (e.g., from masculine to feminine), can be employed to mitigate these biases. While such systems have been developed in a number of languages (English, Arabic, Portuguese, German, French), automatic use of gender neutralization techniques (as opposed to inclusive or gender-switching techniques) has only been studied for English. This paper presents GeNRe, the very first French gender-neutral rewriting system using collective nouns, which are gender-fixed in French. We introduce a rule-based system (RBS) tailored for the French language alongside two fine-tuned language models trained on data generated by our RBS. We also explore the use of instruct-based models to enhance the performance of our other systems and find that Claude 3 Opus combined with our dictionary achieves results close to our RBS. Through this contribution, we hope to promote the advancement of gender bias mitigation techniques in NLP for French.

[239] Are Reasoning Models More Prone to Hallucination?

Zijun Yao,Yantao Liu,Yanxu Chen,Jianhui Chen,Junfeng Fang,Lei Hou,Juanzi Li,Tat-Seng Chua

Main category: cs.CL

TL;DR: 本文探讨大型推理模型（LRMs）在事实寻求任务中的幻觉问题，发现不同后训练流程对幻觉的影响，并分析了认知行为和模型不确定性机制。

Details

Motivation: 研究LRMs在事实寻求任务中是否因推理能力而加剧幻觉，解决现有研究中的不一致结果。 Method: 通过全面评估、行为分析和模型不确定性机制研究，分析LRMs的幻觉问题。 Result: 发现冷启动监督微调（SFT）和可验证奖励强化学习（RL）减轻幻觉，而蒸馏和未冷启动的RL训练加剧幻觉。 Conclusion: 研究初步揭示了LRMs幻觉的成因，为未来模型优化提供了方向。 Abstract: Recently evolved large reasoning models (LRMs) show powerful performance in solving complex tasks with long chain-of-thought (CoT) reasoning capability. As these LRMs are mostly developed by post-training on formal reasoning tasks, whether they generalize the reasoning capability to help reduce hallucination in fact-seeking tasks remains unclear and debated. For instance, DeepSeek-R1 reports increased performance on SimpleQA, a fact-seeking benchmark, while OpenAI-o3 observes even severer hallucination. This discrepancy naturally raises the following research question: Are reasoning models more prone to hallucination? This paper addresses the question from three perspectives. (1) We first conduct a holistic evaluation for the hallucination in LRMs. Our analysis reveals that LRMs undergo a full post-training pipeline with cold start supervised fine-tuning (SFT) and verifiable reward RL generally alleviate their hallucination. In contrast, both distillation alone and RL training without cold start fine-tuning introduce more nuanced hallucinations. (2) To explore why different post-training pipelines alters the impact on hallucination in LRMs, we conduct behavior analysis. We characterize two critical cognitive behaviors that directly affect the factuality of a LRM: Flaw Repetition, where the surface-level reasoning attempts repeatedly follow the same underlying flawed logic, and Think-Answer Mismatch, where the final answer fails to faithfully match the previous CoT process. (3) Further, we investigate the mechanism behind the hallucination of LRMs from the perspective of model uncertainty. We find that increased hallucination of LRMs is usually associated with the misalignment between model uncertainty and factual accuracy. Our work provides an initial understanding of the hallucination in LRMs.

[240] ARC: Argument Representation and Coverage Analysis for Zero-Shot Long Document Summarization with Instruction Following LLMs

Mohamed Elaraby,Diane Litman

Main category: cs.CL

TL;DR: 论文研究了指令调优的大型语言模型（LLMs）在摘要生成中是否充分保留论证角色信息，提出了Argument Representation Coverage（ARC）框架来评估LLM生成摘要的质量，发现LLMs在稀疏分布的论证信息中表现不佳。

Details

Motivation: 论证角色在高风险领域（如法律）的摘要生成中至关重要，但LLMs是否能够有效保留这些信息尚不明确。 Method: 引入ARC框架，评估三种开源LLMs在长法律意见和科学文章中的摘要生成表现。 Result: LLMs在捕捉关键论证角色方面表现有限，尤其是当论证信息稀疏分布时。此外，LLMs的位置偏差和角色偏好影响了摘要质量。 Conclusion: 需要开发更具论证意识的摘要生成策略，以提升LLMs在高风险领域的表现。 Abstract: Integrating structured information has long improved the quality of abstractive summarization, particularly in retaining salient content. In this work, we focus on a specific form of structure: argument roles, which are crucial for summarizing documents in high-stakes domains such as law. We investigate whether instruction-tuned large language models (LLMs) adequately preserve this information. To this end, we introduce Argument Representation Coverage (ARC), a framework for measuring how well LLM-generated summaries capture salient arguments. Using ARC, we analyze summaries produced by three open-weight LLMs in two domains where argument roles are central: long legal opinions and scientific articles. Our results show that while LLMs cover salient argument roles to some extent, critical information is often omitted in generated summaries, particularly when arguments are sparsely distributed throughout the input. Further, we use ARC to uncover behavioral patterns -- specifically, how the positional bias of LLM context windows and role-specific preferences impact the coverage of key arguments in generated summaries, emphasizing the need for more argument-aware summarization strategies.

[241] Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation

Hongxiang Zhang,Hao Chen,Tianyi Zhang,Muhao Chen

Main category: cs.CL

TL;DR: ActLCD是一种新的解码策略，通过强化学习优化生成的事实性，减少幻觉。

Details

Motivation: 现有方法在长上下文生成中仍易产生幻觉，需改进解码策略。 Method: 提出ActLCD，利用强化学习策略和奖励感知分类器动态对比层。 Result: 在五个基准测试中优于现有方法，有效减少幻觉。 Conclusion: ActLCD在多样化生成场景中显著提升事实性。 Abstract: Recent decoding methods improve the factuality of large language models~(LLMs) by refining how the next token is selected during generation. These methods typically operate at the token level, leveraging internal representations to suppress superficial patterns. Nevertheless, LLMs remain prone to hallucinations, especially over longer contexts. In this paper, we propose Active Layer-Contrastive Decoding (ActLCD), a novel decoding strategy that actively decides when to apply contrasting layers during generation. By casting decoding as a sequential decision-making problem, ActLCD employs a reinforcement learning policy guided by a reward-aware classifier to optimize factuality beyond the token level. Our experiments demonstrate that ActLCD surpasses state-of-the-art methods across five benchmarks, showcasing its effectiveness in mitigating hallucinations in diverse generation scenarios.

[242] ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

Beong-woo Kwak,Minju Kim,Dongha Lim,Hyungjoo Chae,Dongjin Kang,Sunghwan Kim,Dongil Yang,Jinyoung Yeo

Main category: cs.CL

TL;DR: ToolHaystack是一个用于测试大语言模型在长期交互中工具使用能力的基准测试，揭示了现有模型在长期鲁棒性上的不足。

Details

Motivation: 现有评估多假设工具使用在短上下文中，缺乏对长期交互中模型行为的深入理解。 Method: 引入ToolHaystack基准，包含多任务执行上下文和真实噪声的连续对话，评估模型在长期交互中的表现。 Result: 14个先进大语言模型在标准多轮设置中表现良好，但在ToolHaystack中表现显著下降。 Conclusion: ToolHaystack揭示了现有模型在长期鲁棒性上的关键差距，为未来研究提供了方向。 Abstract: Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.

[243] LoLA: Low-Rank Linear Attention With Sparse Caching

Luke McDermott,Robert W. Heath Jr.,Rahul Parhi

Main category: cs.CL

TL;DR: LoLA是一种低秩线性注意力方法，通过稀疏缓存和分层次存储键值对，解决了长序列推理中的记忆冲突问题，显著提升了性能。

Details

Motivation: Transformer模型在长序列推理时存在二次复杂度问题，而现有线性注意力方法无法准确逼近softmax注意力，且难以处理长上下文信息。 Method: LoLA结合了滑动窗口注意力、稀疏全局缓存和递归隐藏状态，分层次存储键值对，避免记忆冲突。 Result: LoLA在8K上下文长度的任务中表现优异，将基础模型的准确率从0.6%提升至97.4%，且缓存需求更小。 Conclusion: LoLA是一种轻量级的高效方法，适用于长序列推理任务，性能优于其他子二次复杂度模型。 Abstract: Transformer-based large language models suffer from quadratic complexity at inference on long sequences. Linear attention methods are efficient alternatives, however, they fail to provide an accurate approximation of softmax attention. By additionally incorporating sliding window attention into each linear attention head, this gap can be closed for short context-length tasks. Unfortunately, these approaches cannot recall important information from long contexts due to "memory collisions". In this paper , we propose LoLA: Low-rank Linear Attention with sparse caching. LoLA separately stores additional key-value pairs that would otherwise interfere with past associative memories. Moreover, LoLA further closes the gap between linear attention models and transformers by distributing past key-value pairs into three forms of memory: (i) recent pairs in a local sliding window; (ii) difficult-to-memorize pairs in a sparse, global cache; and (iii) generic pairs in the recurrent hidden state of linear attention. As an inference-only strategy, LoLA enables pass-key retrieval on up to 8K context lengths on needle-in-a-haystack tasks from RULER. It boosts the accuracy of the base subquadratic model from 0.6% to 97.4% at 4K context lengths, with a 4.6x smaller cache than that of Llama-3.1 8B. LoLA demonstrates strong performance on zero-shot commonsense reasoning tasks among 1B and 8B parameter subquadratic models. Finally, LoLA is an extremely lightweight approach: Nearly all of our results can be reproduced on a single consumer GPU.

[244] Automatic classification of stop realisation with wav2vec2.0

James Tanner,Morgan Sonderegger,Jane Stuart-Smith,Jeff Mielke,Tyler Kendall

Main category: cs.CL

TL;DR: 利用预训练的wav2vec2.0模型自动分类语音数据中的爆破音，展示了其在英语和日语中的高准确性和鲁棒性。

Details

Motivation: 现有工具对多变语音现象的标注能力有限，而预训练的自监督模型（如wav2vec2.0）在语音分类任务中表现优异，潜在地编码了细粒度语音信息。 Method: 训练wav2vec2.0模型自动分类爆破音的存在，测试其在英语和日语中的表现，并验证其在精心整理和未准备语音语料库中的鲁棒性。 Result: 模型在爆破音分类任务中表现出高准确性，自动标注结果与手动标注结果高度一致，且能复现爆破音实现的变异性模式。 Conclusion: 预训练语音模型具有作为自动标注工具的潜力，可扩展语音研究的范围。 Abstract: Modern phonetic research regularly makes use of automatic tools for the annotation of speech data, however few tools exist for the annotation of many variable phonetic phenomena. At the same time, pre-trained self-supervised models, such as wav2vec2.0, have been shown to perform well at speech classification tasks and latently encode fine-grained phonetic information. We demonstrate that wav2vec2.0 models can be trained to automatically classify stop burst presence with high accuracy in both English and Japanese, robust across both finely-curated and unprepared speech corpora. Patterns of variability in stop realisation are replicated with the automatic annotations, and closely follow those of manual annotations. These results demonstrate the potential of pre-trained speech models as tools for the automatic annotation and processing of speech corpus data, enabling researchers to `scale-up' the scope of phonetic research with relative ease.

[245] Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models

Francesca Padovani,Jaap Jumelet,Yevgen Matusevych,Arianna Bisazza

Main category: cs.CL

TL;DR: 研究发现，儿童导向语言（CDL）训练的语言模型在多数情况下表现不如维基百科训练的模型，且需控制频率效应以准确评估句法能力。

Details

Motivation: 验证CDL在不同语言、模型类型和评估设置下的通用性，并改进现有基准测试的不足。 Method: 比较CDL与维基百科训练的模型，涵盖两种目标（掩码和因果）、三种语言（英语、法语、德语）和三个句法最小对基准，并提出新测试方法FIT-CLAMS。 Result: CDL在多数情况下表现不如维基百科模型，且需频率控制以平衡比较。 Conclusion: CDL并未显著提升句法学习能力，频率控制对评估句法能力至关重要。 Abstract: Seminal work by Huebner et al. (2021) showed that language models (LMs) trained on English Child-Directed Language (CDL) can reach similar syntactic abilities as LMs trained on much larger amounts of adult-directed written text, suggesting that CDL could provide more effective LM training material than the commonly used internet-crawled data. However, the generalizability of these results across languages, model types, and evaluation settings remains unclear. We test this by comparing models trained on CDL vs. Wikipedia across two LM objectives (masked and causal), three languages (English, French, German), and three syntactic minimal-pair benchmarks. Our results on these benchmarks show inconsistent benefits of CDL, which in most cases is outperformed by Wikipedia models. We then identify various shortcomings in previous benchmarks, and introduce a novel testing methodology, FIT-CLAMS, which uses a frequency-controlled design to enable balanced comparisons across training corpora. Through minimal pair evaluations and regression analysis we show that training on CDL does not yield stronger generalizations for acquiring syntax and highlight the importance of controlling for frequency effects when evaluating syntactic ability.

[246] Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation

Ziling Cheng,Meng Cao,Leila Pishdad,Yanshuai Cao,Jackie Chi Kit Cheung

Main category: cs.CL

TL;DR: 论文指出，基于最终答案的评估指标在衡量大语言模型（LLM）数学解题能力时存在局限性，混淆了抽象表达和算术计算两个子技能。研究发现，Llama-3和Qwen2.5的算术计算是瓶颈，而CoT（思维链）主要帮助计算而非抽象表达。机制上，模型通过“先抽象后计算”的方式处理问题。

Details

Motivation: 探讨现有评估指标是否准确反映LLM的数学推理能力，揭示抽象表达和算术计算对最终答案的影响。 Method: 在GSM8K和SVAMP数据集上对Llama-3和Qwen2.5进行解耦评估，分析CoT的作用，并通过因果修补验证机制。 Result: 发现算术计算是主要瓶颈，CoT对计算帮助显著但对抽象表达影响有限。模型通过“先抽象后计算”机制处理问题。 Conclusion: 需要解耦评估以更准确衡量LLM推理能力，并为未来改进提供方向。 Abstract: Final-answer-based metrics are commonly used for evaluating large language models (LLMs) on math word problems, often taken as proxies for reasoning ability. However, such metrics conflate two distinct sub-skills: abstract formulation (capturing mathematical relationships using expressions) and arithmetic computation (executing the calculations). Through a disentangled evaluation on GSM8K and SVAMP, we find that the final-answer accuracy of Llama-3 and Qwen2.5 (1B-32B) without CoT is overwhelmingly bottlenecked by the arithmetic computation step and not by the abstract formulation step. Contrary to the common belief, we show that CoT primarily aids in computation, with limited impact on abstract formulation. Mechanistically, we show that these two skills are composed conjunctively even in a single forward pass without any reasoning steps via an abstract-then-compute mechanism: models first capture problem abstractions, then handle computation. Causal patching confirms these abstractions are present, transferable, composable, and precede computation. These behavioural and mechanistic findings highlight the need for disentangled evaluation to accurately assess LLM reasoning and to guide future improvements.

Zixiang Xu,Yanbo Wang,Yue Huang,Jiayi Ye,Haomin Zhuang,Zirui Song,Lang Gao,Chenxi Wang,Zhaorun Chen,Yujun Zhou,Sixian Li,Wang Pan,Yue Zhao,Jieyu Zhao,Xiangliang Zhang,Xiuying Chen

Main category: cs.CL

TL;DR: 论文介绍了SocialMaze，一个评估大语言模型（LLMs）社会推理能力的新基准，填补了现有评估框架的不足。

Details

Motivation: 现有评估框架过于简化现实场景，无法全面评估LLMs的社会推理能力。 Method: 提出SocialMaze基准，包含深度推理、动态交互和信息不确定性三大挑战，覆盖六种任务。 Result: 模型在动态交互和不确定性下表现差异显著，针对性微调可提升性能。 Conclusion: SocialMaze为评估和提升LLMs的社会推理能力提供了有效工具。 Abstract: Large language models (LLMs) are increasingly applied to socially grounded tasks, such as online community moderation, media content analysis, and social reasoning games. Success in these contexts depends on a model's social reasoning ability - the capacity to interpret social contexts, infer others' mental states, and assess the truthfulness of presented information. However, there is currently no systematic evaluation framework that comprehensively assesses the social reasoning capabilities of LLMs. Existing efforts often oversimplify real-world scenarios and consist of tasks that are too basic to challenge advanced models. To address this gap, we introduce SocialMaze, a new benchmark specifically designed to evaluate social reasoning. SocialMaze systematically incorporates three core challenges: deep reasoning, dynamic interaction, and information uncertainty. It provides six diverse tasks across three key settings: social reasoning games, daily-life interactions, and digital community platforms. Both automated and human validation are used to ensure data quality. Our evaluation reveals several key insights: models vary substantially in their ability to handle dynamic interactions and integrate temporally evolving information; models with strong chain-of-thought reasoning perform better on tasks requiring deeper inference beyond surface-level cues; and model reasoning degrades significantly under uncertainty. Furthermore, we show that targeted fine-tuning on curated reasoning examples can greatly improve model performance in complex social scenarios. The dataset is publicly available at: https://huggingface.co/datasets/MBZUAI/SocialMaze

[248] SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods

Roksana Goworek,Harpal Karlcut,Muhammad Shezad,Nijaguna Darshana,Abhishek Mane,Syam Bondada,Raghav Sikka,Ulvi Mammadov,Rauf Allahverdiyev,Sriram Purighella,Paridhi Gupta,Muhinyia Ndegwa,Haim Dubossarsky

Main category: cs.CL

TL;DR: 论文提出了一种半自动标注方法，创建了涵盖九种低资源语言的多义词语义标注数据集，用于支持跨语言迁移研究。

Details

Motivation: 解决低资源语言中高质量评估数据集的缺乏问题，以推动跨语言迁移技术的发展。 Method: 采用半自动标注方法创建多义词语义标注数据集，并通过WiC格式实验评估其效用。 Result: 实验结果表明，针对性的数据集创建和评估对低资源语言的多义消歧和迁移研究至关重要。 Conclusion: 发布的数据集和代码旨在支持更公平、稳健和真正多语言的NLP研究。 Abstract: This paper addresses the critical need for high-quality evaluation datasets in low-resource languages to advance cross-lingual transfer. While cross-lingual transfer offers a key strategy for leveraging multilingual pretraining to expand language technologies to understudied and typologically diverse languages, its effectiveness is dependent on quality and suitable benchmarks. We release new sense-annotated datasets of sentences containing polysemous words, spanning nine low-resource languages across diverse language families and scripts. To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method. The utility of the datasets is demonstrated through Word-in-Context (WiC) formatted experiments that evaluate transfer on these low-resource languages. Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation in low-resource settings and transfer studies. The released datasets and code aim to support further research into fair, robust, and truly multilingual NLP.

[249] Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models

Jinzhe Li,Gengxu Li,Yi Chang,Yuan Wu

Main category: cs.CL

TL;DR: 论文提出PCBench评估大语言模型（LLMs）的前提批判能力，发现现有模型在自主识别输入前提错误方面表现有限，且推理能力与前提批判能力无直接关联。

Details

Motivation: 现有研究多关注LLMs在理想环境下的推理能力，而忽略了其在面对错误前提时的脆弱性，因此需要评估和提升LLMs的前提批判能力。 Method: 设计了PCBench，包含四种错误类型和三个难度级别，并采用多维度评估指标，对15种代表性LLMs进行系统评估。 Result: 发现大多数模型依赖显式提示检测错误，自主批判能力有限；前提批判能力受问题难度和错误类型影响；推理能力与前提批判能力无直接关联；错误前提会导致模型过度思考。 Conclusion: 提升LLMs对输入有效性的主动评估能力至关重要，前提批判能力是开发可靠、以人为本系统的基础。 Abstract: Large language models (LLMs) have witnessed rapid advancements, demonstrating remarkable capabilities. However, a notable vulnerability persists: LLMs often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs. This emphasizes the significance of possessing the \textbf{Premise Critique Ability} for LLMs, defined as the capacity to proactively identify and articulate errors in input premises. Most existing studies assess LLMs' reasoning ability in ideal settings, largely ignoring their vulnerabilities when faced with flawed premises. Thus, we introduce the \textbf{Premise Critique Bench (PCBench)}, designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics. We conducted systematic evaluations of 15 representative LLMs. Our findings reveal: (1) Most models rely heavily on explicit prompts to detect errors, with limited autonomous critique; (2) Premise critique ability depends on question difficulty and error type, with direct contradictions being easier to detect than complex or procedural errors; (3) Reasoning ability does not consistently correlate with the premise critique ability; (4) Flawed premises trigger overthinking in reasoning models, markedly lengthening responses due to repeated attempts at resolving conflicts. These insights underscore the urgent need to enhance LLMs' proactive evaluation of input validity, positioning premise critique as a foundational capability for developing reliable, human-centric systems. The code is available at https://github.com/MLGroupJLU/Premise_Critique.

[250] Label-Guided In-Context Learning for Named Entity Recognition

Fan Bai,Hamid Hassanzadeh,Ardavan Saeedi,Mark Dredze

Main category: cs.CL

TL;DR: DEER是一种改进上下文学习（ICL）的新方法，通过利用训练标签和标记级统计信息优化示例选择和错误修正，显著提升了命名实体识别（NER）的性能。

Details

Motivation: 现有的ICL方法在NER任务中通常仅基于语义相似性选择示例，忽略了训练标签，导致性能不佳。 Method: DEER结合标签引导的标记检索器优化示例选择，并通过标记统计信息识别和修正易错标记。 Result: 在五个NER数据集和四种LLM上的实验表明，DEER优于现有ICL方法，接近监督微调的性能。 Conclusion: DEER在可见和未见实体上均表现优异，且在低资源环境下具有鲁棒性。 Abstract: In-context learning (ICL) enables large language models (LLMs) to perform new tasks using only a few demonstrations. In Named Entity Recognition (NER), demonstrations are typically selected based on semantic similarity to the test instance, ignoring training labels and resulting in suboptimal performance. We introduce DEER, a new method that leverages training labels through token-level statistics to improve ICL performance. DEER first enhances example selection with a label-guided, token-based retriever that prioritizes tokens most informative for entity recognition. It then prompts the LLM to revisit error-prone tokens, which are also identified using label statistics, and make targeted corrections. Evaluated on five NER datasets using four different LLMs, DEER consistently outperforms existing ICL methods and approaches the performance of supervised fine-tuning. Further analysis shows its effectiveness on both seen and unseen entities and its robustness in low-resource settings.

[251] ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering

Zexi Liu,Jingyi Chai,Xinyu Zhu,Shuo Tang,Rui Ye,Bo Zhang,Lei Bai,Siheng Chen

Main category: cs.CL

TL;DR: 论文提出了一种基于强化学习的LLM代理训练框架，通过交互式实验优化ML任务，显著提升了小规模模型的性能。

Details

Motivation: 现有LLM代理依赖手动提示工程，缺乏自适应优化能力，因此探索学习型代理ML范式。 Method: 提出包含探索增强微调、分步RL和统一奖励模块的框架，训练7B规模的ML-Agent。 Result: ML-Agent在9个任务上训练后，性能超过671B规模的DeepSeek-R1，并展现跨任务泛化能力。 Conclusion: 学习型代理ML框架有效提升了小规模模型的自主ML能力，具有广泛的应用潜力。 Abstract: The emergence of large language model (LLM)-based agents has significantly advanced the development of autonomous machine learning (ML) engineering. However, most existing approaches rely heavily on manual prompt engineering, failing to adapt and optimize based on diverse experimental experiences. Focusing on this, for the first time, we explore the paradigm of learning-based agentic ML, where an LLM agent learns through interactive experimentation on ML tasks using online reinforcement learning (RL). To realize this, we propose a novel agentic ML training framework with three key components: (1) exploration-enriched fine-tuning, which enables LLM agents to generate diverse actions for enhanced RL exploration; (2) step-wise RL, which enables training on a single action step, accelerating experience collection and improving training efficiency; (3) an agentic ML-specific reward module, which unifies varied ML feedback signals into consistent rewards for RL optimization. Leveraging this framework, we train ML-Agent, driven by a 7B-sized Qwen-2.5 LLM for autonomous ML. Remarkably, despite being trained on merely 9 ML tasks, our 7B-sized ML-Agent outperforms the 671B-sized DeepSeek-R1 agent. Furthermore, it achieves continuous performance improvements and demonstrates exceptional cross-task generalization capabilities.

[252] Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time

Mohamad Chehade,Soumya Suvra Ghosal,Souradip Chakraborty,Avinash Reddy,Dinesh Manocha,Hao Zhu,Amrit Singh Bedi

Main category: cs.CL

TL;DR: SITAlign是一个推理时框架，通过最大化主要目标并满足次要标准的阈值约束，解决多目标对齐问题。

Details

Motivation: 现有方法通常将人类偏好反馈视为多目标优化问题，但忽略了人类决策的实际方式（如满意策略）。 Method: 提出SITAlign框架，结合满意策略，最大化主要目标并满足次要标准的阈值约束。 Result: 在PKU-SafeRLHF数据集上，SITAlign在保持无害性阈值的同时，帮助性奖励的GPT-4胜率比现有方法高22.3%。 Conclusion: SITAlign通过满意策略有效解决了多目标对齐问题，优于现有方法。 Abstract: Aligning large language models with humans is challenging due to the inherently multifaceted nature of preference feedback. While existing approaches typically frame this as a multi-objective optimization problem, they often overlook how humans actually make decisions. Research on bounded rationality suggests that human decision making follows satisficing strategies-optimizing primary objectives while ensuring others meet acceptable thresholds. To bridge this gap and operationalize the notion of satisficing alignment, we propose SITAlign: an inference time framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria. We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach. We empirically validate SITAlign's performance through extensive experimentation on multiple benchmarks. For instance, on the PKU-SafeRLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, SITAlign outperforms the state-of-the-art multi objective decoding strategy by a margin of 22.3% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the threshold on harmlessness.

[253] ATLAS: Learning to Optimally Memorize the Context at Test Time

Ali Behrouz,Zeman Li,Praneeth Kacham,Majid Daliri,Yuan Deng,Peilin Zhong,Meisam Razaviyayn,Vahab Mirrokni

Main category: cs.CL

TL;DR: 论文提出了一种名为ATLAS的长时记忆模块，通过优化当前和过去令牌的记忆，解决了现代循环神经网络在长上下文理解和序列外推任务中的不足，并在此基础上提出了DeepTransformers架构。实验表明，ATLAS在多项任务中优于Transformer和线性循环模型。

Details

Motivation: Transformer因其在上下文检索任务中的高效性和可扩展性成为主流序列建模架构，但其二次复杂度限制了其在长序列中的应用。现代循环神经网络虽在某些任务中表现优异，但在长上下文理解和外推任务中存在不足。论文旨在解决这些不足。 Method: 提出ATLAS长时记忆模块，优化记忆容量、在线更新机制和固定大小内存的表达管理。基于此，提出DeepTransformers架构，作为Transformer的严格泛化。 Result: 实验表明，ATLAS在语言建模、常识推理、密集召回和长上下文理解任务中优于Transformer和线性循环模型，并在BABILong基准测试中实现了80%的准确率提升。 Conclusion: ATLAS通过改进记忆模块的设计，显著提升了长上下文任务的性能，为序列建模提供了新的解决方案。 Abstract: Transformers have been established as the most popular backbones in sequence modeling, mainly due to their effectiveness in in-context retrieval tasks and the ability to learn at scale. Their quadratic memory and time complexity, however, bound their applicability in longer sequences and so has motivated researchers to explore effective alternative architectures such as modern recurrent neural networks (a.k.a long-term recurrent memory module). Despite their recent success in diverse downstream tasks, they struggle in tasks that requires long context understanding and extrapolation to longer sequences. We observe that these shortcomings come from three disjoint aspects in their design: (1) limited memory capacity that is bounded by the architecture of memory and feature mapping of the input; (2) online nature of update, i.e., optimizing the memory only with respect to the last input; and (3) less expressive management of their fixed-size memory. To enhance all these three aspects, we present ATLAS, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture. Our experimental results on language modeling, common-sense reasoning, recall-intensive, and long-context understanding tasks show that ATLAS surpasses the performance of Transformers and recent linear recurrent models. ATLAS further improves the long context performance of Titans, achieving +80\% accuracy in 10M context length of BABILong benchmark.

[254] DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning

Ziyin Zhang,Jiahao Xu,Zhiwei He,Tian Liang,Qiuzhi Liu,Yansi Li,Linfeng Song,Zhengwen Liang,Zhuosheng Zhang,Rui Wang,Zhaopeng Tu,Haitao Mi,Dong Yu

Main category: cs.CL

TL;DR: DeepTheorem提出了一种基于自然语言的非正式定理证明框架，通过大规模数据集和强化学习策略提升LLM的数学推理能力，显著优于现有方法。

Details

Motivation: 传统自动定理证明（ATP）依赖形式化系统，与LLM的自然语言知识不匹配。DeepTheorem旨在利用自然语言增强LLM的数学推理能力。 Method: 提出DeepTheorem框架，包括121K高质量非正式定理数据集、强化学习策略（RL-Zero）及综合评估指标。 Result: 实验表明DeepTheorem显著提升LLM的定理证明性能，达到最先进的准确性和推理质量。 Conclusion: DeepTheorem有潜力推动非正式定理证明和数学探索的进步。 Abstract: Theorem proving serves as a major testbed for evaluating complex reasoning abilities in large language models (LLMs). However, traditional automated theorem proving (ATP) approaches rely heavily on formal proof systems that poorly align with LLMs' strength derived from informal, natural language knowledge acquired during pre-training. In this work, we propose DeepTheorem, a comprehensive informal theorem-proving framework exploiting natural language to enhance LLM mathematical reasoning. DeepTheorem includes a large-scale benchmark dataset consisting of 121K high-quality IMO-level informal theorems and proofs spanning diverse mathematical domains, rigorously annotated for correctness, difficulty, and topic categories, accompanied by systematically constructed verifiable theorem variants. We devise a novel reinforcement learning strategy (RL-Zero) explicitly tailored to informal theorem proving, leveraging the verified theorem variants to incentivize robust mathematical inference. Additionally, we propose comprehensive outcome and process evaluation metrics examining proof correctness and the quality of reasoning steps. Extensive experimental analyses demonstrate DeepTheorem significantly improves LLM theorem-proving performance compared to existing datasets and supervised fine-tuning protocols, achieving state-of-the-art accuracy and reasoning quality. Our findings highlight DeepTheorem's potential to fundamentally advance automated informal theorem proving and mathematical exploration.

[255] Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

Heekyung Lee,Jiaxin Ge,Tsung-Han Wu,Minwoo Kang,Trevor Darrell,David M. Chan

Main category: cs.CL

TL;DR: 本文研究了当前视觉语言模型（VLMs）在解决Rebus谜题（一种通过图像、空间排列和符号替换编码语言的视觉谜题）时的能力，发现其在抽象推理和视觉隐喻理解方面表现不佳。

Details

Motivation: Rebus谜题对VLMs提出了独特挑战，需要多模态抽象、符号推理以及对文化和语言双关的理解，而传统任务如图像描述或问答无法涵盖这些能力。 Method: 通过构建一个手工生成和标注的多样化Rebus谜题基准，评估不同VLMs的表现。 Result: VLMs在解码简单视觉线索时表现出一定能力，但在需要抽象推理、横向思维和视觉隐喻理解的任务中表现显著不足。 Conclusion: 当前VLMs在解决复杂多模态任务（如Rebus谜题）时仍存在局限性，尤其是在抽象和隐喻理解方面。 Abstract: Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head" over "heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.

[256] From Chat Logs to Collective Insights: Aggregative Question Answering

Wentao Zhang,Woojeong Kim,Yuntian Deng

Main category: cs.CL

TL;DR: 论文提出了一种名为Aggregative Question Answering的新任务，旨在通过分析大规模用户与聊天机器人的对话数据，回答聚合性问题。

Details

Motivation: 现有方法通常将对话视为独立事件，忽略了从大规模对话日志中聚合和推理的关键洞察。 Method: 提出了WildChat-AQA基准，包含6,027个聚合性问题，源自182,330个真实聊天对话。 Result: 实验表明现有方法在有效推理或计算成本方面表现不佳。 Conclusion: 需要开发新方法以从大规模对话数据中提取集体洞察。 Abstract: Conversational agents powered by large language models (LLMs) are rapidly becoming integral to our daily interactions, generating unprecedented amounts of conversational data. Such datasets offer a powerful lens into societal interests, trending topics, and collective concerns. Yet, existing approaches typically treat these interactions as independent and miss critical insights that could emerge from aggregating and reasoning across large-scale conversation logs. In this paper, we introduce Aggregative Question Answering, a novel task requiring models to reason explicitly over thousands of user-chatbot interactions to answer aggregative queries, such as identifying emerging concerns among specific demographics. To enable research in this direction, we construct a benchmark, WildChat-AQA, comprising 6,027 aggregative questions derived from 182,330 real-world chatbot conversations. Experiments show that existing methods either struggle to reason effectively or incur prohibitive computational costs, underscoring the need for new approaches capable of extracting collective insights from large-scale conversational data.

eess.IV [Back]

[257] IRS: Incremental Relationship-guided Segmentation for Digital Pathology

Ruining Deng,Junchao Zhu,Juming Xiong,Can Cui,Tianyuan Yao,Junlin Guo,Siqi Lu,Marilyn Lionts,Mengmeng Yin,Yu Wang,Shilin Zhao,Yucheng Tang,Yihe Yang,Paul Dennis Simonson,Mert R. Sabuncu,Haichun Yang,Yuankai Huo

Main category: eess.IV

TL;DR: 论文提出了一种名为IRS的新型增量关系引导分割学习方案，用于处理数字病理学中时间获取的部分标注数据，同时保持分布外持续学习能力。

Details

Motivation: 数字病理学中的全景分割面临标注不完整和持续学习新表型、未见疾病及多样人群的挑战，亟需一种统一的分割框架。 Method: IRS通过数学建模解剖关系，利用增量通用命题矩阵实现空间-时间分布外持续学习。 Result: 实验表明IRS能有效处理多尺度病理分割，精确分割肾脏结构及分布外病变，显著提升领域泛化能力。 Conclusion: IRS为数字病理学中的实际应用提供了一种鲁棒的持续学习解决方案。 Abstract: Continual learning is rapidly emerging as a key focus in computer vision, aiming to develop AI systems capable of continuous improvement, thereby enhancing their value and practicality in diverse real-world applications. In healthcare, continual learning holds great promise for continuously acquired digital pathology data, which is collected in hospitals on a daily basis. However, panoramic segmentation on digital whole slide images (WSIs) presents significant challenges, as it is often infeasible to obtain comprehensive annotations for all potential objects, spanning from coarse structures (e.g., regions and unit objects) to fine structures (e.g., cells). This results in temporally and partially annotated data, posing a major challenge in developing a holistic segmentation framework. Moreover, an ideal segmentation model should incorporate new phenotypes, unseen diseases, and diverse populations, making this task even more complex. In this paper, we introduce a novel and unified Incremental Relationship-guided Segmentation (IRS) learning scheme to address temporally acquired, partially annotated data while maintaining out-of-distribution (OOD) continual learning capacity in digital pathology. The key innovation of IRS lies in its ability to realize a new spatial-temporal OOD continual learning paradigm by mathematically modeling anatomical relationships between existing and newly introduced classes through a simple incremental universal proposition matrix. Experimental results demonstrate that the IRS method effectively handles the multi-scale nature of pathological segmentation, enabling precise kidney segmentation across various structures (regions, units, and cells) as well as OOD disease lesions at multiple magnifications. This capability significantly enhances domain generalization, making IRS a robust approach for real-world digital pathology applications.

[258] iHDR: Iterative HDR Imaging with Arbitrary Number of Exposures

Yu Yuan,Yiheng Chi,Xingguang Zhang,Stanley Chan

Main category: eess.IV

TL;DR: 提出了一种名为iHDR的迭代融合框架，用于处理不同数量输入帧的高动态范围（HDR）成像问题。

Details

Motivation: 现有HDR成像方法通常针对固定数量输入设计，无法灵活适应不同输入帧数的场景。 Method: iHDR框架包含一个无重影的双输入HDR融合网络（DiHDR）和一个基于物理的域映射网络（ToneNet），通过迭代融合逐步生成HDR图像。 Result: 实验表明，该方法在灵活输入帧数下优于现有HDR去重影方法。 Conclusion: iHDR框架为解决不同输入帧数的HDR成像问题提供了有效解决方案。 Abstract: High dynamic range (HDR) imaging aims to obtain a high-quality HDR image by fusing information from multiple low dynamic range (LDR) images. Numerous learning-based HDR imaging methods have been proposed to achieve this for static and dynamic scenes. However, their architectures are mostly tailored for a fixed number (e.g., three) of inputs and, therefore, cannot apply directly to situations beyond the pre-defined limited scope. To address this issue, we propose a novel framework, iHDR, for iterative fusion, which comprises a ghost-free Dual-input HDR fusion network (DiHDR) and a physics-based domain mapping network (ToneNet). DiHDR leverages a pair of inputs to estimate an intermediate HDR image, while ToneNet maps it back to the nonlinear domain and serves as the reference input for the next pairwise fusion. This process is iteratively executed until all input frames are utilized. Qualitative and quantitative experiments demonstrate the effectiveness of the proposed method as compared to existing state-of-the-art HDR deghosting approaches given flexible numbers of input frames.

[259] Proximal Algorithm Unrolling: Flexible and Efficient Reconstruction Networks for Single-Pixel Imaging

Ping Wang,Lishun Wang,Gang Qu,Xiaodong Wang,Yulun Zhang,Xin Yuan

Main category: eess.IV

TL;DR: 本文提出了一种结合深度展开（unrolling）和即插即用（PnP）方法优势的解决方案，通过设计高效的深度图像恢复器（DIR）和提出通用的近端轨迹（PT）损失函数，实现了在单像素成像（SPI）逆问题中灵活处理不同压缩比（CR）的同时，提升了重建精度和速度。

Details

Motivation: 解决PnP方法在重建精度和速度上的限制，以及展开方法在压缩比变化时需要微调或重新训练的问题，整合两者的优势。 Method: 设计了高效的深度图像恢复器（DIR）用于展开HQS和ADMM算法，并提出了通用的近端轨迹（PT）损失函数来训练网络。 Result: 实验表明，所提出的近端展开网络不仅能灵活处理不同压缩比，还在重建精度和速度上优于之前的压缩比特定展开网络。 Conclusion: 该方法成功整合了PnP和展开方法的优势，为单像素成像逆问题提供了高效且灵活的解决方案。 Abstract: Deep-unrolling and plug-and-play (PnP) approaches have become the de-facto standard solvers for single-pixel imaging (SPI) inverse problem. PnP approaches, a class of iterative algorithms where regularization is implicitly performed by an off-the-shelf deep denoiser, are flexible for varying compression ratios (CRs) but are limited in reconstruction accuracy and speed. Conversely, unrolling approaches, a class of multi-stage neural networks where a truncated iterative optimization process is transformed into an end-to-end trainable network, typically achieve better accuracy with faster inference but require fine-tuning or even retraining when CR changes. In this paper, we address the challenge of integrating the strengths of both classes of solvers. To this end, we design an efficient deep image restorer (DIR) for the unrolling of HQS (half quadratic splitting) and ADMM (alternating direction method of multipliers). More importantly, a general proximal trajectory (PT) loss function is proposed to train HQS/ADMM-unrolling networks such that learned DIR approximates the proximal operator of an ideal explicit restoration regularizer. Extensive experiments demonstrate that, the resulting proximal unrolling networks can not only flexibly handle varying CRs with a single model like PnP algorithms, but also outperform previous CR-specific unrolling networks in both reconstruction accuracy and speed. Source codes and models are available at https://github.com/pwangcs/ProxUnroll.

[260] Advancing Image Super-resolution Techniques in Remote Sensing: A Comprehensive Survey

Yunliang Qi,Meng Lou,Yimin Liu,Lu Li,Zhen Yang,Wen Nie

Main category: eess.IV

TL;DR: 本文系统综述了遥感图像超分辨率（RSISR）方法，包括监督、无监督和质量评估三类，分析了现有方法的局限，并提出了未来研究方向。

Details

Motivation: 尽管近年来RSISR方法不断涌现，但缺乏系统综述，本文旨在填补这一空白，帮助研究者了解当前趋势和挑战。 Method: 对RSISR方法进行分类综述，涵盖方法、数据集和评估指标，并分析其优缺点。 Result: 发现现有方法在大尺度退化下保留细粒度纹理和几何结构方面存在显著局限。 Conclusion: 未来需开发领域专用架构和鲁棒评估协议，以缩小合成与真实场景的差距。 Abstract: Remote sensing image super-resolution (RSISR) is a crucial task in remote sensing image processing, aiming to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts. Despite the growing number of RSISR methods proposed in recent years, a systematic and comprehensive review of these methods is still lacking. This paper presents a thorough review of RSISR algorithms, covering methodologies, datasets, and evaluation metrics. We provide an in-depth analysis of RSISR methods, categorizing them into supervised, unsupervised, and quality evaluation approaches, to help researchers understand current trends and challenges. Our review also discusses the strengths, limitations, and inherent challenges of these techniques. Notably, our analysis reveals significant limitations in existing methods, particularly in preserving fine-grained textures and geometric structures under large-scale degradation. Based on these findings, we outline future research directions, highlighting the need for domain-specific architectures and robust evaluation protocols to bridge the gap between synthetic and real-world RSISR scenarios.

[261] Can Large Language Models Challenge CNNS in Medical Image Analysis?

Shibbir Ahmed,Shahnewaz Karim Sakib,Anindya Bijoy Das

Main category: eess.IV

TL;DR: 该研究提出了一种多模态AI框架，用于精确分类医学诊断图像，比较了CNN和LLM的性能，发现结合LLM的过滤技术可显著提升效果。

Details

Motivation: 提升医学诊断图像的分类精度和效率，同时关注环境影响。 Method: 使用公开数据集，比较CNN和LLM在准确性、执行效率、能耗及碳排放上的表现。 Result: CNN在多模态技术中表现更优，但结合LLM的过滤技术可显著提升性能。 Conclusion: 多模态AI系统有望提高医学诊断的可靠性、效率和可扩展性。 Abstract: This study presents a multimodal AI framework designed for precisely classifying medical diagnostic images. Utilizing publicly available datasets, the proposed system compares the strengths of convolutional neural networks (CNNs) and different large language models (LLMs). This in-depth comparative analysis highlights key differences in diagnostic performance, execution efficiency, and environmental impacts. Model evaluation was based on accuracy, F1-score, average execution time, average energy consumption, and estimated $CO_2$ emission. The findings indicate that although CNN-based models can outperform various multimodal techniques that incorporate both images and contextual information, applying additional filtering on top of LLMs can lead to substantial performance gains. These findings highlight the transformative potential of multimodal AI systems to enhance the reliability, efficiency, and scalability of medical diagnostics in clinical settings.

[262] PCA for Enhanced Cross-Dataset Generalizability in Breast Ultrasound Tumor Segmentation

Christian Schmidt,Heinrich Martin Overhoff

Main category: eess.IV

TL;DR: 本文提出了一种基于主成分分析（PCA）的新方法，用于提升医学超声图像分割模型在未见数据集上的外部有效性。实验表明，PCA预处理显著提高了召回率和Dice分数。

Details

Motivation: 医学图像分割模型在跨数据集部署时外部有效性不足，尤其在超声图像领域，现有方法如域适应和GAN风格迁移效果有限。 Method: 通过PCA预处理减少噪声并保留90%的方差，生成降维数据集，并训练U-Net模型进行分割。 Result: 使用PCA重建数据集显著提高了召回率（0.57 vs. 0.70）和Dice分数（0.50 vs. 0.58），外部验证召回率下降减少了33%。 Conclusion: PCA重建是一种有效的方法，可提升医学图像分割模型的外部有效性，尤其在挑战性案例中。 Abstract: In medical image segmentation, limited external validity remains a critical obstacle when models are deployed across unseen datasets, an issue particularly pronounced in the ultrasound image domain. Existing solutions-such as domain adaptation and GAN-based style transfer-while promising, often fall short in the medical domain where datasets are typically small and diverse. This paper presents a novel application of principal component analysis (PCA) to address this limitation. PCA preprocessing reduces noise and emphasizes essential features by retaining approximately 90\% of the dataset variance. We evaluate our approach across six diverse breast tumor ultrasound datasets comprising 3,983 B-mode images and corresponding expert tumor segmentation masks. For each dataset, a corresponding dimensionality reduced PCA-dataset is created and U-Net-based segmentation models are trained on each of the twelve datasets. Each model trained on an original dataset was inferenced on the remaining five out-of-domain original datasets (baseline results), while each model trained on a PCA dataset was inferenced on five out-of-domain PCA datasets. Our experimental results indicate that using PCA reconstructed datasets, instead of original images, improves the model's recall and Dice scores, particularly for model-dataset pairs where baseline performance was lowest, achieving statistically significant gains in recall (0.57 $\pm$ 0.07 vs. 0.70 $\pm$ 0.05, $p = 0.0004$) and Dice scores (0.50 $\pm$ 0.06 vs. 0.58 $\pm$ 0.06, $p = 0.03$). Our method reduced the decline in recall values due to external validation by $33\%$. These findings underscore the potential of PCA reconstruction as a safeguard to mitigate declines in segmentation performance, especially in challenging cases, with implications for enhancing external validity in real-world medical applications.

[263] ImmunoDiff: A Diffusion Model for Immunotherapy Response Prediction in Lung Cancer

Moinak Bhattacharya,Judy Huang,Amna F. Sher,Gagandeep Singh,Chao Chen,Prateek Prasanna

Main category: eess.IV

TL;DR: ImmunoDiff是一种基于扩散模型的框架，通过合成治疗后CT扫描并结合临床数据，显著提高了NSCLC免疫治疗反应预测的准确性。

Details

Motivation: 准确预测非小细胞肺癌（NSCLC）的免疫治疗反应是未满足的临床需求，现有模型仅依赖治疗前影像，难以捕捉治疗引起的复杂变化。 Method: 提出ImmunoDiff框架，结合解剖学先验和临床数据，通过cbi-Adapter模块实现多模态数据一致整合。 Result: 在NSCLC队列中，反应预测的平衡准确率提高了21.24%，生存预测的c指数增加了0.03。 Conclusion: ImmunoDiff通过整合解剖和临床数据，显著提升了免疫治疗反应预测的准确性。 Abstract: Accurately predicting immunotherapy response in Non-Small Cell Lung Cancer (NSCLC) remains a critical unmet need. Existing radiomics and deep learning-based predictive models rely primarily on pre-treatment imaging to predict categorical response outcomes, limiting their ability to capture the complex morphological and textural transformations induced by immunotherapy. This study introduces ImmunoDiff, an anatomy-aware diffusion model designed to synthesize post-treatment CT scans from baseline imaging while incorporating clinically relevant constraints. The proposed framework integrates anatomical priors, specifically lobar and vascular structures, to enhance fidelity in CT synthesis. Additionally, we introduce a novel cbi-Adapter, a conditioning module that ensures pairwise-consistent multimodal integration of imaging and clinical data embeddings, to refine the generative process. Additionally, a clinical variable conditioning mechanism is introduced, leveraging demographic data, blood-based biomarkers, and PD-L1 expression to refine the generative process. Evaluations on an in-house NSCLC cohort treated with immune checkpoint inhibitors demonstrate a 21.24% improvement in balanced accuracy for response prediction and a 0.03 increase in c-index for survival prediction. Code will be released soon.

[264] MRI Image Generation Based on Text Prompts

Xinxian Fan,Mengye Lyu

Main category: eess.IV

TL;DR: 研究探索了使用Stable Diffusion模型通过文本提示生成MRI图像，以解决真实MRI数据获取的挑战，如高成本、稀有案例样本不足和隐私问题。

Details

Motivation: 解决真实MRI数据集获取的高成本、稀有案例样本不足和隐私问题。 Method: 使用预训练的Stable Diffusion模型，在3T fastMRI和0.3T M4Raw数据集上进行微调，生成不同磁场强度的脑部T1、T2和FLAIR图像。 Result: 通过FID和MS-SSIM等定量指标评估，显示图像质量和语义一致性有所提升；合成图像能有效增强训练数据集并改善MRI对比分类任务性能。 Conclusion: 文本提示的MRI图像生成可行，可作为医学AI应用的有用工具。 Abstract: This study explores the use of text-prompted MRI image generation with the Stable Diffusion (SD) model to address challenges in acquiring real MRI datasets, such as high costs, limited rare case samples, and privacy concerns. The SD model, pre-trained on natural images, was fine-tuned using the 3T fastMRI dataset and the 0.3T M4Raw dataset, with the goal of generating brain T1, T2, and FLAIR images across different magnetic field strengths. The performance of the fine-tuned model was evaluated using quantitative metrics,including Fr\'echet Inception Distance (FID) and Multi-Scale Structural Similarity (MS-SSIM), showing improvements in image quality and semantic consistency with the text prompts. To further evaluate the model's potential, a simple classification task was carried out using a small 0.35T MRI dataset, demonstrating that the synthetic images generated by the fine-tuned SD model can effectively augment training datasets and improve the performance of MRI constrast classification tasks. Overall, our findings suggest that text-prompted MRI image generation is feasible and can serve as a useful tool for medical AI applications.

[265] DeepMultiConnectome: Deep Multi-Task Prediction of Structural Connectomes Directly from Diffusion MRI Tractography

Marcus J. Vroemen,Yuqian Chen,Yui Lo,Tengfei Xu,Weidong Cai,Fan Zhang,Josien P. W. Pluim,Lauren J. O'Donnell

Main category: eess.IV

TL;DR: DeepMultiConnectome是一种深度学习模型，直接从纤维追踪数据预测结构连接组，避免了传统方法中灰质分区的需求，支持多种分区方案，速度快且可扩展。

Details

Motivation: 传统连接组生成方法耗时且依赖灰质分区，限制了大规模研究的可行性。 Method: 采用基于点云的神经网络和多任务学习，模型根据流线连接的脑区分类，支持两种分区方案，共享学习表示。 Result: 预测的连接组与传统方法生成的结果高度相关（r=0.992和0.986），保留了网络特性，且测试-重测分析显示可重复性相当。 Conclusion: DeepMultiConnectome为跨多种分区方案的个体化连接组生成提供了快速、可扩展的解决方案。 Abstract: Diffusion MRI (dMRI) tractography enables in vivo mapping of brain structural connections, but traditional connectome generation is time-consuming and requires gray matter parcellation, posing challenges for large-scale studies. We introduce DeepMultiConnectome, a deep-learning model that predicts structural connectomes directly from tractography, bypassing the need for gray matter parcellation while supporting multiple parcellation schemes. Using a point-cloud-based neural network with multi-task learning, the model classifies streamlines according to their connected regions across two parcellation schemes, sharing a learned representation. We train and validate DeepMultiConnectome on tractography from the Human Connectome Project Young Adult dataset ($n = 1000$), labeled with an 84 and 164 region gray matter parcellation scheme. DeepMultiConnectome predicts multiple structural connectomes from a whole-brain tractogram containing 3 million streamlines in approximately 40 seconds. DeepMultiConnectome is evaluated by comparing predicted connectomes with traditional connectomes generated using the conventional method of labeling streamlines using a gray matter parcellation. The predicted connectomes are highly correlated with traditionally generated connectomes ($r = 0.992$ for an 84-region scheme; $r = 0.986$ for a 164-region scheme) and largely preserve network properties. A test-retest analysis of DeepMultiConnectome demonstrates reproducibility comparable to traditionally generated connectomes. The predicted connectomes perform similarly to traditionally generated connectomes in predicting age and cognitive function. Overall, DeepMultiConnectome provides a scalable, fast model for generating subject-specific connectomes across multiple parcellation schemes.

[266] Plug-and-Play Posterior Sampling for Blind Inverse Problems

Anqi Li,Weijie Gan,Ulugbek S. Kamilov

Main category: eess.IV

TL;DR: Blind-PnPDM是一种解决盲逆问题的新框架，通过交替高斯去噪方案和扩散模型先验，无需显式先验或参数估计。

Details

Motivation: 传统方法依赖显式先验或单独参数估计，Blind-PnPDM旨在提供更灵活、适应性更强的解决方案。 Method: 使用两个扩散模型作为先验，分别捕捉目标图像和测量算子的分布，通过交替高斯去噪实现后验采样。 Result: 在盲图像去模糊任务中，Blind-PnPDM在定量指标和视觉保真度上优于现有方法。 Conclusion: 将盲逆问题转化为一系列去噪子问题，并利用扩散模型的表达能力，是一种有效的方法。 Abstract: We introduce Blind Plug-and-Play Diffusion Models (Blind-PnPDM) as a novel framework for solving blind inverse problems where both the target image and the measurement operator are unknown. Unlike conventional methods that rely on explicit priors or separate parameter estimation, our approach performs posterior sampling by recasting the problem into an alternating Gaussian denoising scheme. We leverage two diffusion models as learned priors: one to capture the distribution of the target image and another to characterize the parameters of the measurement operator. This PnP integration of diffusion models ensures flexibility and ease of adaptation. Our experiments on blind image deblurring show that Blind-PnPDM outperforms state-of-the-art methods in terms of both quantitative metrics and visual fidelity. Our results highlight the effectiveness of treating blind inverse problems as a sequence of denoising subproblems while harnessing the expressive power of diffusion-based priors.

[267] Synthetic Generation and Latent Projection Denoising of Rim Lesions in Multiple Sclerosis

Alexandra G. Roberts,Ha M. Luu,Mert Şişman,Alexey V. Dimov,Ceren Tozlu,Ilhami Kovanlikaya,Susan A. Gauthier,Thanh D. Nguyen,Yi Wang

Main category: eess.IV

TL;DR: 该论文提出了一种生成合成定量磁化率图的方法，用于改善多发性硬化症中边缘病变（PRLs）的分类器性能，并引入了一种新的去噪方法以增加少数类样本。

Details

Motivation: 多发性硬化症中的边缘病变（PRLs）是一种新兴的生物标志物，但由于其罕见性，分类器面临类别不平衡问题。 Method: 通过生成合成定量磁化率图和多通道扩展，结合生成网络的投影能力提出了一种去噪方法，以增加少数类样本。 Result: 合成数据和去噪方法显著改善了边缘病变的检测性能，并更接近真实分布。 Conclusion: 该方法在多发性硬化症的临床诊断中具有潜在应用价值，代码和生成数据已公开。 Abstract: Quantitative susceptibility maps from magnetic resonance images can provide both prognostic and diagnostic information in multiple sclerosis, a neurodegenerative disease characterized by the formation of lesions in white matter brain tissue. In particular, susceptibility maps provide adequate contrast to distinguish between "rim" lesions, surrounded by deposited paramagnetic iron, and "non-rim" lesion types. These paramagnetic rim lesions (PRLs) are an emerging biomarker in multiple sclerosis. Much effort has been devoted to both detection and segmentation of such lesions to monitor longitudinal change. As paramagnetic rim lesions are rare, addressing this problem requires confronting the class imbalance between rim and non-rim lesions. We produce synthetic quantitative susceptibility maps of paramagnetic rim lesions and show that inclusion of such synthetic data improves classifier performance and provide a multi-channel extension to generate accompanying contrasts and probabilistic segmentation maps. We exploit the projection capability of our trained generative network to demonstrate a novel denoising approach that allows us to train on ambiguous rim cases and substantially increase the minority class. We show that both synthetic lesion synthesis and our proposed rim lesion label denoising method best approximate the unseen rim lesion distribution and improve detection in a clinically interpretable manner. We release our code and generated data at https://github.com/agr78/PRLx-GAN upon publication.

eess.AS [Back]

[268] NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding

Vladimir Bataev,Andrei Andrusenko,Lilit Grigoryan,Aleksandr Laptev,Vitaly Lavrukhin,Boris Ginsburg

Main category: eess.AS

TL;DR: NGPU-LM提出了一种高效并行的统计n-gram语言模型，优化了GPU推理，显著提升了计算效率，同时减少了贪婪解码与束搜索的准确率差距。

Details

Motivation: 现有统计n-gram语言模型在自动语音识别（ASR）中因并行化不足导致计算效率低下，限制了工业应用。 Method: 重新设计数据结构和引入NGPU-LM方法，支持GPU优化推理和自定义贪婪解码，适用于多种ASR模型。 Result: 在计算开销低于7%的情况下，NGPU-LM减少了贪婪解码与束搜索在域外场景下50%以上的准确率差距，同时避免了束搜索的显著减速。 Conclusion: NGPU-LM为ASR提供了一种高效、并行的解决方案，其开源实现有望推动工业应用。 Abstract: Statistical n-gram language models are widely used for context-biasing tasks in Automatic Speech Recognition (ASR). However, existing implementations lack computational efficiency due to poor parallelization, making context-biasing less appealing for industrial use. This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference. Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types - including transducers, attention encoder-decoder models, and CTC - with less than 7% computational overhead. The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search. The implementation of the proposed NGPU-LM is open-sourced.

cs.CY [Back]

[269] Conversational Alignment with Artificial Intelligence in Context

Rachel Katharine Sterken,James Ravi Kirkpatrick

Main category: cs.CY

TL;DR: 论文探讨了AI对话代理如何与人类沟通规范对齐，提出了CONTEXT-ALIGN框架，并指出当前LLM架构可能限制完全对齐。

Details

Motivation: 研究AI对话代理与人类沟通规范的关系，解决AI设计中的对齐问题。 Method: 基于哲学和语言学文献提出CONTEXT-ALIGN框架，分析LLM架构的限制。 Result: 提出新框架，但指出LLM可能无法完全实现对话对齐。 Conclusion: AI对话代理需进一步改进以实现与人类沟通规范的对齐。 Abstract: The development of sophisticated artificial intelligence (AI) conversational agents based on large language models raises important questions about the relationship between human norms, values, and practices and AI design and performance. This article explores what it means for AI agents to be conversationally aligned to human communicative norms and practices for handling context and common ground and proposes a new framework for evaluating developers' design choices. We begin by drawing on the philosophical and linguistic literature on conversational pragmatics to motivate a set of desiderata, which we call the CONTEXT-ALIGN framework, for conversational alignment with human communicative practices. We then suggest that current large language model (LLM) architectures, constraints, and affordances may impose fundamental limitations on achieving full conversational alignment.

eess.SY [Back]

[270] CF-DETR: Coarse-to-Fine Transformer for Real-Time Object Detection

Woojin Shin,Donghwa Kang,Byeongyun Park,Brent Byunghoon Kang,Jinkyu Lee,Hyeongboo Baek

Main category: eess.SY

TL;DR: CF-DETR通过粗到细的Transformer架构和实时调度框架NPFP**，解决了自动驾驶系统中多DETR任务实时性和高精度的挑战。

Details

Motivation: 自动驾驶感知系统中，多DETR任务难以同时满足实时性和高精度要求，现有调度方法未能充分利用Transformer特性。 Method: 提出CF-DETR系统，采用粗到细推理、选择性细推理和多级批量推理策略，结合NPFP**调度框架动态调整资源分配。 Result: 实验表明，CF-DETR在多种平台上实现了严格的实时性保证，并显著提高了整体和关键物体检测精度。 Conclusion: CF-DETR通过动态资源分配和调度，成功平衡了实时性和精度，适用于自动驾驶系统。 Abstract: Detection Transformers (DETR) are increasingly adopted in autonomous vehicle (AV) perception systems due to their superior accuracy over convolutional networks. However, concurrently executing multiple DETR tasks presents significant challenges in meeting firm real-time deadlines (R1) and high accuracy requirements (R2), particularly for safety-critical objects, while navigating the inherent latency-accuracy trade-off under resource constraints. Existing real-time DNN scheduling approaches often treat models generically, failing to leverage Transformer-specific properties for efficient resource allocation. To address these challenges, we propose CF-DETR, an integrated system featuring a novel coarse-to-fine Transformer architecture and a dedicated real-time scheduling framework NPFP**. CF-DETR employs three key strategies (A1: coarse-to-fine inference, A2: selective fine inference, A3: multi-level batch inference) that exploit Transformer properties to dynamically adjust patch granularity and attention scope based on object criticality, aiming to satisfy R2. The NPFP** scheduling framework (A4) orchestrates these adaptive mechanisms A1-A3. It partitions each DETR task into a safety-critical coarse subtask for guaranteed critical object detection within its deadline (ensuring R1), and an optional fine subtask for enhanced overall accuracy (R2), while managing individual and batched execution. Our extensive evaluations on server, GPU-enabled embedded platforms, and actual AV platforms demonstrate that CF-DETR, under an NPFP** policy, successfully meets strict timing guarantees for critical operations and achieves significantly higher overall and critical object detection accuracy compared to existing baselines across diverse AV workloads.

cs.RO [Back]

[271] AMOR: Adaptive Character Control through Multi-Objective Reinforcement Learning

Lucas N. Alegre,Agon Serifi,Ruben Grandia,David Müller,Espen Knoop,Moritz Bächer

Main category: cs.RO

TL;DR: 提出了一种多目标强化学习框架，通过训练一个基于权重调节的策略来解决传统RL中奖励函数权重调整耗时的问题，并展示了其在机器人动态运动中的应用。

Details

Motivation: 传统强化学习依赖加权奖励函数，需要大量调参，且难以适应现实世界的模拟到真实差距。 Method: 提出多目标强化学习框架，训练一个基于权重调节的策略，覆盖奖励权衡的Pareto前沿。 Result: 框架显著缩短了迭代时间，支持动态运动控制，并能在分层设置中动态调整权重。 Conclusion: 多目标策略编码了多样行为，能高效适应新任务，提升了RL在机器人控制中的实用性。 Abstract: Reinforcement learning (RL) has significantly advanced the control of physics-based and robotic characters that track kinematic reference motion. However, methods typically rely on a weighted sum of conflicting reward functions, requiring extensive tuning to achieve a desired behavior. Due to the computational cost of RL, this iterative process is a tedious, time-intensive task. Furthermore, for robotics applications, the weights need to be chosen such that the policy performs well in the real world, despite inevitable sim-to-real gaps. To address these challenges, we propose a multi-objective reinforcement learning framework that trains a single policy conditioned on a set of weights, spanning the Pareto front of reward trade-offs. Within this framework, weights can be selected and tuned after training, significantly speeding up iteration time. We demonstrate how this improved workflow can be used to perform highly dynamic motions with a robot character. Moreover, we explore how weight-conditioned policies can be leveraged in hierarchical settings, using a high-level policy to dynamically select weights according to the current task. We show that the multi-objective policy encodes a diverse spectrum of behaviors, facilitating efficient adaptation to novel tasks.

Siddharth Ancha,Sunshine Jiang,Travis Manderson,Laura Brandt,Yilun Du,Philip R. Osteen,Nicholas Roy

Main category: cs.RO

TL;DR: 论文提出了一种基于生成扩散模型的像素级异常检测方法，通过分析合成图像中的修改部分来检测异常，适用于非结构化环境中的机器人导航。

Details

Motivation: 在非结构化环境中，机器人需要检测与训练数据分布不同的异常，以确保安全可靠的导航。 Method: 使用生成扩散模型合成去除异常的图像，分析修改部分以检测异常；提出了一种新的推理方法，通过分析理想梯度并引导扩散模型预测梯度。 Result: 该方法无需重新训练或微调，可直接集成到现有工作流中，结合视觉-语言基础模型实现准确的异常检测。 Conclusion: 该方法为机器人导航提供了一种高效的异常检测解决方案，适用于复杂环境。 Abstract: In order to navigate safely and reliably in off-road and unstructured environments, robots must detect anomalies that are out-of-distribution (OOD) with respect to the training data. We present an analysis-by-synthesis approach for pixel-wise anomaly detection without making any assumptions about the nature of OOD data. Given an input image, we use a generative diffusion model to synthesize an edited image that removes anomalies while keeping the remaining image unchanged. Then, we formulate anomaly detection as analyzing which image segments were modified by the diffusion model. We propose a novel inference approach for guided diffusion by analyzing the ideal guidance gradient and deriving a principled approximation that bootstraps the diffusion model to predict guidance gradients. Our editing technique is purely test-time that can be integrated into existing workflows without the need for retraining or fine-tuning. Finally, we use a combination of vision-language foundation models to compare pixels in a learned feature space and detect semantically meaningful edits, enabling accurate anomaly detection for off-road navigation. Project website: https://siddancha.github.io/anomalies-by-diffusion-synthesis/

[273] TrackVLA: Embodied Visual Tracking in the Wild

Shaoan Wang,Jiazhao Zhang,Minghan Li,Jiahang Liu,Anqi Li,Kui Wu,Fangwei Zhong,Junzhi Yu,Zhizheng Zhang,He Wang

Main category: cs.RO

TL;DR: TrackVLA是一种视觉-语言-动作（VLA）模型，通过结合目标识别与轨迹规划，在动态环境中实现高效的视觉跟踪。

Details

Motivation: 解决现有方法在目标识别与轨迹规划分离时的性能瓶颈，特别是在严重遮挡和高动态场景下。 Method: 利用共享的LLM骨干网络，结合语言建模头和基于锚点的扩散模型，构建TrackVLA模型。 Result: 在合成和真实环境中均表现出SOTA性能，零样本下显著优于现有方法，并在10 FPS下保持鲁棒性。 Conclusion: TrackVLA通过协同学习目标识别与轨迹规划，显著提升了视觉跟踪的性能和泛化能力。 Abstract: Embodied visual tracking is a fundamental skill in Embodied AI, enabling an agent to follow a specific target in dynamic environments using only egocentric vision. This task is inherently challenging as it requires both accurate target recognition and effective trajectory planning under conditions of severe occlusion and high scene dynamics. Existing approaches typically address this challenge through a modular separation of recognition and planning. In this work, we propose TrackVLA, a Vision-Language-Action (VLA) model that learns the synergy between object recognition and trajectory planning. Leveraging a shared LLM backbone, we employ a language modeling head for recognition and an anchor-based diffusion model for trajectory planning. To train TrackVLA, we construct an Embodied Visual Tracking Benchmark (EVT-Bench) and collect diverse difficulty levels of recognition samples, resulting in a dataset of 1.7 million samples. Through extensive experiments in both synthetic and real-world environments, TrackVLA demonstrates SOTA performance and strong generalizability. It significantly outperforms existing methods on public benchmarks in a zero-shot manner while remaining robust to high dynamics and occlusion in real-world scenarios at 10 FPS inference speed. Our project page is: https://pku-epic.github.io/TrackVLA-web.

[274] Autoregressive Meta-Actions for Unified Controllable Trajectory Generation

Jianbo Zhao,Taiyu Ban,Xiyang Wang,Qibin Zhou,Hangning Zhou,Zhihao Liu,Mu Yang,Lei Liu,Bin Li

Main category: cs.RO

TL;DR: 论文提出了一种称为自回归元动作的方法，用于解决自动驾驶系统中元动作与轨迹生成的时间对齐问题，通过分解长间隔元动作为帧级元动作，实现了更精确的轨迹预测。

Details

Motivation: 现有自动驾驶框架依赖固定时间间隔的元动作，导致元动作与实际轨迹的时间不一致，影响任务连贯性和模型性能。 Method: 提出自回归元动作方法，将长间隔元动分解为帧级元动作，实现元动作与轨迹的严格对齐，并采用分阶段预训练分离基础运动动力学与高级决策控制的学习。 Result: 实验证明该方法提高了轨迹的自适应性和动态决策响应能力。 Conclusion: 自回归元动作方法有效解决了元动作与轨迹的时间对齐问题，提升了自动驾驶系统的性能。 Abstract: Controllable trajectory generation guided by high-level semantic decisions, termed meta-actions, is crucial for autonomous driving systems. A significant limitation of existing frameworks is their reliance on invariant meta-actions assigned over fixed future time intervals, causing temporal misalignment with the actual behavior trajectories. This misalignment leads to irrelevant associations between the prescribed meta-actions and the resulting trajectories, disrupting task coherence and limiting model performance. To address this challenge, we introduce Autoregressive Meta-Actions, an approach integrated into autoregressive trajectory generation frameworks that provides a unified and precise definition for meta-action-conditioned trajectory prediction. Specifically, We decompose traditional long-interval meta-actions into frame-level meta-actions, enabling a sequential interplay between autoregressive meta-action prediction and meta-action-conditioned trajectory generation. This decomposition ensures strict alignment between each trajectory segment and its corresponding meta-action, achieving a consistent and unified task formulation across the entire trajectory span and significantly reducing complexity. Moreover, we propose a staged pre-training process to decouple the learning of basic motion dynamics from the integration of high-level decision control, which offers flexibility, stability, and modularity. Experimental results validate our framework's effectiveness, demonstrating improved trajectory adaptivity and responsiveness to dynamic decision-making scenarios. We provide the video document and dataset, which are available at https://arma-traj.github.io/.

[275] Mobi-$π$: Mobilizing Your Robot Learning Policy

Jingyun Yang,Isabella Huang,Brandon Vu,Max Bajracharya,Rika Antonova,Jeannette Bohg

Main category: cs.RO

TL;DR: 论文提出了一种解决机器人策略在新环境中泛化能力不足的方法，通过优化机器人基座姿态来适配已学习的策略，无需额外训练。

Details

Motivation: 现有视觉运动策略在训练时视角有限，导致在新机器人位置泛化能力差，限制了其在移动平台上的应用。 Method: 提出Mobi-π框架，包括评估指标、模拟任务、可视化工具和基线方法，并利用3D高斯泼溅和采样优化来优化基座姿态。 Result: 提出的方法在仿真和真实环境中均优于基线，证明了其有效性。 Conclusion: 策略动员方法通过优化基座姿态提升了策略的泛化能力，且与现有增强鲁棒性的方法兼容。 Abstract: Learned visuomotor policies are capable of performing increasingly complex manipulation tasks. However, most of these policies are trained on data collected from limited robot positions and camera viewpoints. This leads to poor generalization to novel robot positions, which limits the use of these policies on mobile platforms, especially for precise tasks like pressing buttons or turning faucets. In this work, we formulate the policy mobilization problem: find a mobile robot base pose in a novel environment that is in distribution with respect to a manipulation policy trained on a limited set of camera viewpoints. Compared to retraining the policy itself to be more robust to unseen robot base pose initializations, policy mobilization decouples navigation from manipulation and thus does not require additional demonstrations. Crucially, this problem formulation complements existing efforts to improve manipulation policy robustness to novel viewpoints and remains compatible with them. To study policy mobilization, we introduce the Mobi-$\pi$ framework, which includes: (1) metrics that quantify the difficulty of mobilizing a given policy, (2) a suite of simulated mobile manipulation tasks based on RoboCasa to evaluate policy mobilization, (3) visualization tools for analysis, and (4) several baseline methods. We also propose a novel approach that bridges navigation and manipulation by optimizing the robot's base pose to align with an in-distribution base pose for a learned policy. Our approach utilizes 3D Gaussian Splatting for novel view synthesis, a score function to evaluate pose suitability, and sampling-based optimization to identify optimal robot poses. We show that our approach outperforms baselines in both simulation and real-world environments, demonstrating its effectiveness for policy mobilization.

cs.AI [Back]

[276] Decomposing Elements of Problem Solving: What "Math" Does RL Teach?

Tian Qin,Core Francisco Park,Mujin Kwun,Aaron Walsman,Eran Malach,Nikhil Anand,Hidenori Tanaka,David Alvarez-Melis

Main category: cs.AI

TL;DR: 论文提出将数学问题解决分解为规划、执行和验证三个基本能力，发现GRPO主要提升执行能力，但RL训练模型在新问题上遇到‘覆盖墙’问题。通过合成任务验证RL主要增强执行鲁棒性，并探索了克服覆盖墙的条件。

Details

Motivation: 现有准确性指标无法细粒度评估LLM的推理能力，需要更深入理解RL方法（如GRPO）如何影响问题解决能力。 Method: 将问题解决分解为规划、执行和验证能力，并通过合成任务模拟数学问题解决，验证RL的影响。 Result: GRPO主要提升执行能力（温度蒸馏现象），但RL模型在新问题上因规划能力不足遇到‘覆盖墙’。合成任务证实RL增强执行鲁棒性，并探索了克服覆盖墙的条件。 Conclusion: 研究揭示了RL在提升LLM推理中的作用和局限，为克服覆盖墙提供了方向。 Abstract: Mathematical reasoning tasks have become prominent benchmarks for assessing the reasoning capabilities of LLMs, especially with reinforcement learning (RL) methods such as GRPO showing significant performance gains. However, accuracy metrics alone do not support fine-grained assessment of capabilities and fail to reveal which problem-solving skills have been internalized. To better understand these capabilities, we propose to decompose problem solving into fundamental capabilities: Plan (mapping questions to sequences of steps), Execute (correctly performing solution steps), and Verify (identifying the correctness of a solution). Empirically, we find that GRPO mainly enhances the execution skill-improving execution robustness on problems the model already knows how to solve-a phenomenon we call temperature distillation. More importantly, we show that RL-trained models struggle with fundamentally new problems, hitting a 'coverage wall' due to insufficient planning skills. To explore RL's impact more deeply, we construct a minimal, synthetic solution-tree navigation task as an analogy for mathematical problem-solving. This controlled setup replicates our empirical findings, confirming RL primarily boosts execution robustness. Importantly, in this setting, we identify conditions under which RL can potentially overcome the coverage wall through improved exploration and generalization to new solution paths. Our findings provide insights into the role of RL in enhancing LLM reasoning, expose key limitations, and suggest a path toward overcoming these barriers. Code is available at https://github.com/cfpark00/RL-Wall.

[277] Enhancing Study-Level Inference from Clinical Trial Papers via RL-based Numeric Reasoning

Massimiliano Pronesti,Michela Lorandi,Paul Flanagan,Oisin Redmon,Anya Belz,Yufang Hou

Main category: cs.AI

TL;DR: 论文提出了一种基于定量推理的方法，通过提取结构化数值证据并应用领域知识逻辑，改进医学系统综述中的结论推断准确性。

Details

Motivation: 传统方法依赖浅层文本线索，无法捕捉专家评估背后的数值推理，限制了自动化系统综述的准确性和可解释性。 Method: 开发了一个数值推理系统，包括数值数据提取模型和效应估计组件，采用监督微调（SFT）和强化学习（RL）训练模型。 Result: 在CochraneForest基准测试中，基于RL的小规模数值提取模型比检索系统和大型通用LLM分别提高了21%和9%的F1分数。 Conclusion: 研究表明，基于推理的方法在自动化系统证据合成中具有潜力。 Abstract: Systematic reviews in medicine play a critical role in evidence-based decision-making by aggregating findings from multiple studies. A central bottleneck in automating this process is extracting numeric evidence and determining study-level conclusions for specific outcomes and comparisons. Prior work has framed this problem as a textual inference task by retrieving relevant content fragments and inferring conclusions from them. However, such approaches often rely on shallow textual cues and fail to capture the underlying numeric reasoning behind expert assessments. In this work, we conceptualise the problem as one of quantitative reasoning. Rather than inferring conclusions from surface text, we extract structured numerical evidence (e.g., event counts or standard deviations) and apply domain knowledge informed logic to derive outcome-specific conclusions. We develop a numeric reasoning system composed of a numeric data extraction model and an effect estimate component, enabling more accurate and interpretable inference aligned with the domain expert principles. We train the numeric data extraction model using different strategies, including supervised fine-tuning (SFT) and reinforcement learning (RL) with a new value reward model. When evaluated on the CochraneForest benchmark, our best-performing approach -- using RL to train a small-scale number extraction model -- yields up to a 21% absolute improvement in F1 score over retrieval-based systems and outperforms general-purpose LLMs of over 400B parameters by up to 9%. Our results demonstrate the promise of reasoning-driven approaches for automating systematic evidence synthesis.

[278] Be.FM: Open Foundation Models for Human Behavior

Yutong Xie,Zhuoheng Li,Xiyuan Wang,Yijun Pan,Qijia Liu,Xingzhi Cui,Kuang-Yu Lo,Ruoyi Gao,Xingjian Zhang,Jin Huang,Walter Yuan,Matthew O. Jackson,Qiaozhu Mei

Main category: cs.AI

TL;DR: Be.FM是一个基于开源大语言模型的行为基础模型，用于理解和预测人类行为，并在多个任务中表现出色。

Details

Motivation: 探索基础模型在人类行为建模中的潜力，填补现有研究的空白。 Method: 基于开源大语言模型，通过多样化行为数据微调，构建Be.FM模型。 Result: Be.FM能够预测行为、推断个体和群体特征、生成情境洞察并应用行为科学知识。 Conclusion: Be.FM展示了基础模型在人类行为建模中的潜力，为未来研究提供了方向。 Abstract: Despite their success in numerous fields, the potential of foundation models for modeling and understanding human behavior remains largely unexplored. We introduce Be.FM, one of the first open foundation models designed for human behavior modeling. Built upon open-source large language models and fine-tuned on a diverse range of behavioral data, Be.FM can be used to understand and predict human decision-making. We construct a comprehensive set of benchmark tasks for testing the capabilities of behavioral foundation models. Our results demonstrate that Be.FM can predict behaviors, infer characteristics of individuals and populations, generate insights about contexts, and apply behavioral science knowledge.

[279] Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models

Zeyu Liu,Yuhang Liu,Guanghao Zhu,Congkai Xie,Zhen Li,Jianbo Yuan,Xinyao Wang,Qing Li,Shing-Chi Cheung,Shengyu Zhang,Fei Wu,Hongxia Yang

Main category: cs.AI

TL;DR: 论文提出Infi-MMR框架，通过三阶段课程提升多模态小语言模型（MSLMs）的推理能力，并在多个测试中取得最优成绩。

Details

Motivation: 尽管大语言模型（LLMs）在推理能力上取得进展，但多模态小语言模型（MSLMs）面临数据集稀缺、视觉处理导致推理能力下降等问题。 Method: 设计三阶段框架：基础推理激活、跨模态推理适应和多模态推理增强，逐步提升MSLMs的推理能力。 Result: Infi-MMR-3B在多模态数学推理（如MathVerse测试43.68%）和通用推理（如MathVista测试67.2%）中表现优异。 Conclusion: Infi-MMR框架有效解决了MSLMs的推理挑战，为多模态推理提供了新思路。 Abstract: Recent advancements in large language models (LLMs) have demonstrated substantial progress in reasoning capabilities, such as DeepSeek-R1, which leverages rule-based reinforcement learning to enhance logical reasoning significantly. However, extending these achievements to multimodal large language models (MLLMs) presents critical challenges, which are frequently more pronounced for Multimodal Small Language Models (MSLMs) given their typically weaker foundational reasoning abilities: (1) the scarcity of high-quality multimodal reasoning datasets, (2) the degradation of reasoning capabilities due to the integration of visual processing, and (3) the risk that direct application of reinforcement learning may produce complex yet incorrect reasoning processes. To address these challenges, we design a novel framework Infi-MMR to systematically unlock the reasoning potential of MSLMs through a curriculum of three carefully structured phases and propose our multimodal reasoning model Infi-MMR-3B. The first phase, Foundational Reasoning Activation, leverages high-quality textual reasoning datasets to activate and strengthen the model's logical reasoning capabilities. The second phase, Cross-Modal Reasoning Adaptation, utilizes caption-augmented multimodal data to facilitate the progressive transfer of reasoning skills to multimodal contexts. The third phase, Multimodal Reasoning Enhancement, employs curated, caption-free multimodal data to mitigate linguistic biases and promote robust cross-modal reasoning. Infi-MMR-3B achieves both state-of-the-art multimodal math reasoning ability (43.68% on MathVerse testmini, 27.04% on MathVision test, and 21.33% on OlympiadBench) and general reasoning ability (67.2% on MathVista testmini).

[280] Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns

Xiang Li,Haiyang Yu,Xinghua Zhang,Ziyang Huang,Shizhu He,Kang Liu,Jun Zhao,Fei Huang,Yongbin Li

Main category: cs.AI

TL;DR: 论文介绍了Socratic-PRMBench，一个用于系统评估过程奖励模型（PRMs）在六种推理模式下的新基准，揭示了现有PRMs的不足。

Details

Motivation: 现有基准主要关注逐步正确性评估，缺乏对PRMs在不同推理模式下系统评估的需求。 Method: 提出Socratic-PRMBench，包含2995条带有缺陷的推理路径，覆盖六种推理模式。 Result: 实验发现现有PRMs在多样推理模式下的评估能力存在显著缺陷。 Conclusion: Socratic-PRMBench为PRMs的系统评估提供了全面测试平台，推动未来PRMs的发展。 Abstract: Process Reward Models (PRMs) are crucial in complex reasoning and problem-solving tasks (e.g., LLM agents with long-horizon decision-making) by verifying the correctness of each intermediate reasoning step. In real-world scenarios, LLMs may apply various reasoning patterns (e.g., decomposition) to solve a problem, potentially suffering from errors under various reasoning patterns. Therefore, PRMs are required to identify errors under various reasoning patterns during the reasoning process. However, existing benchmarks mainly focus on evaluating PRMs with stepwise correctness, ignoring a systematic evaluation of PRMs under various reasoning patterns. To mitigate this gap, we introduce Socratic-PRMBench, a new benchmark to evaluate PRMs systematically under six reasoning patterns, including Transformation, Decomposition, Regather, Deduction, Verification, and Integration. Socratic-PRMBench}comprises 2995 reasoning paths with flaws within the aforementioned six reasoning patterns. Through our experiments on both PRMs and LLMs prompted as critic models, we identify notable deficiencies in existing PRMs. These observations underscore the significant weakness of current PRMs in conducting evaluations on reasoning steps under various reasoning patterns. We hope Socratic-PRMBench can serve as a comprehensive testbed for systematic evaluation of PRMs under diverse reasoning patterns and pave the way for future development of PRMs.

[281] ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Chenyu Yang,Shiqian Su,Shi Liu,Xuan Dong,Yue Yu,Weijie Su,Xuehui Wang,Zhaoyang Liu,Jinguo Zhu,Hao Li,Wenhai Wang,Yu Qiao,Xizhou Zhu,Jifeng Dai

Main category: cs.AI

TL;DR: ZeroGUI是一个基于在线学习的框架，用于自动化GUI代理训练，无需人工标注，通过任务生成和奖励估计提升性能。

Details

Motivation: 现有离线学习方法依赖人工标注且适应性差，ZeroGUI旨在解决这些问题。 Method: 结合VLM自动任务生成、奖励估计和两阶段在线强化学习。 Result: 在UI-TARS和Aguvis上显著提升性能。 Conclusion: ZeroGUI为GUI代理训练提供了高效、可扩展的解决方案。 Abstract: The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.

cs.SE [Back]

[282] SWE-bench Goes Live!

Linghao Zhang,Shilin He,Chaoyun Zhang,Yu Kang,Bowen Li,Chengxing Xie,Junhao Wang,Maoquan Wang,Yufan Huang,Shengyu Fu,Elsie Nallipogu,Qingwei Lin,Yingnong Dang,Saravan Rajmohan,Dongmei Zhang

Main category: cs.SE

TL;DR: SWE-bench-Live是一个动态更新的基准测试，旨在解决现有静态基准测试（如SWE-bench）的局限性，通过自动化流程和实时更新的任务集，为LLMs在软件修复任务中的评估提供更严谨和可扩展的框架。

Details

Motivation: 现有基准测试（如SWE-bench）存在更新滞后、覆盖范围窄和依赖人工的问题，限制了其可扩展性并可能导致过拟合和数据污染。 Method: 提出SWE-bench-Live，包含1,319个实时GitHub问题任务，覆盖93个仓库，并配备Docker镜像确保可重复性；通过自动化流程（\method）实现任务创建和环境设置的自动化。 Result: 实验显示，即使在受控条件下，现有LLMs和代理框架在SWE-bench-Live上的表现显著低于静态基准测试。 Conclusion: SWE-bench-Live通过动态、多样化的任务集和自动化流程，为LLMs在真实软件开发环境中的评估提供了更可靠和抗污染的基准。 Abstract: The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily on manual effort for instance construction and environment setup. These factors hinder scalability and introduce risks of overfitting and data contamination. In this work, we present \textbf{SWE-bench-Live}, a \textit{live-updatable} benchmark designed to overcome these challenges. Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Central to our benchmark is \method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art agent frameworks and LLMs on SWE-bench-Live, revealing a substantial performance gap compared to static benchmarks like SWE-bench, even under controlled evaluation conditions. To better understand this discrepancy, we perform detailed analyses across repository origin, issue recency, and task difficulty. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live facilitates rigorous, contamination-resistant evaluation of LLMs and agents in dynamic, real-world software development settings.

[283] Identity resolution of software metadata using Large Language Models

Eva Martín del Pico,Josep Lluís Gelpí,Salvador Capella-Gutiérrez

Main category: cs.SE

TL;DR: 本文探讨了研究软件的重要性，并评估了指令调优大语言模型在软件元数据身份解析任务中的表现，为构建研究软件的统一集合提供了参考。

Details

Motivation: 研究软件在科研中至关重要，但其关注度低于研究数据。本文旨在通过整合和分析软件元数据，提升对软件开发和可持续性的理解。 Method: 利用bio.tools、Bioconductor和Galaxy ToolShed等平台的元数据，评估指令调优大语言模型在软件元数据身份解析中的性能，并引入基于一致性的高置信度自动决策代理。 Result: 代理方法在精确度和统计稳健性上表现优异，但也揭示了当前模型的局限性和自动化语义判断的挑战。 Conclusion: 整合软件元数据需要解决异构性和规模问题，指令调优模型在身份解析中具有潜力，但仍需改进以应对跨平台的语义挑战。 Abstract: Software is an essential component of research. However, little attention has been paid to it compared with that paid to research data. Recently, there has been an increase in efforts to acknowledge and highlight the importance of software in research activities. Structured metadata from platforms like bio.tools, Bioconductor, and Galaxy ToolShed offers valuable insights into research software in the Life Sciences. Although originally intended to support discovery and integration, this metadata can be repurposed for large-scale analysis of software practices. However, its quality and completeness vary across platforms, reflecting diverse documentation practices. To gain a comprehensive view of software development and sustainability, consolidating this metadata is necessary, but requires robust mechanisms to address its heterogeneity and scale. This article presents an evaluation of instruction-tuned large language models for the task of software metadata identity resolution, a critical step in assembling a cohesive collection of research software. Such a collection is the reference component for the Software Observatory at OpenEBench, a platform that aggregates metadata to monitor the FAIRness of research software in the Life Sciences. We benchmarked multiple models against a human-annotated gold standard, examined their behavior on ambiguous cases, and introduced an agreement-based proxy for high-confidence automated decisions. The proxy achieved high precision and statistical robustness, while also highlighting the limitations of current models and the broader challenges of automating semantic judgment in FAIR-aligned software metadata across registries and repositories.

[284] GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

Manish Shetty,Naman Jain,Jinjian Liu,Vijay Kethanaboyina,Koushik Sen,Ion Stoica

Main category: cs.SE

TL;DR: GSO是一个评估语言模型在高性能软件开发能力的基准，通过自动化测试分析代码库历史，识别优化任务。现有代理表现不佳，成功率低于5%。

Details

Motivation: 高性能软件开发复杂且需专业知识，需评估语言模型在此领域的能力。 Method: 开发自动化管道生成性能测试，分析代码库历史识别优化任务，测试代理改进运行时效率。 Result: 领先的SWE代理成功率低于5%，改进有限。定性分析发现关键失败模式。 Conclusion: GSO为未来研究提供基准和工具，揭示当前代理在高性能软件开发中的不足。 Abstract: Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models' capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties with low-level languages, practicing lazy optimization strategies, and challenges in accurately localizing bottlenecks. We release the code and artifacts of our benchmark along with agent trajectories to enable future research.

cs.SD [Back]

[285] Nosey: Open-source hardware for acoustic nasalance

Maya Dewhurst,Jack Collins,Justin J. H. Lo,Roy Alderton,Sam Kirkham

Main category: cs.SD

TL;DR: Nosey是一个低成本、可定制的开源硬件系统，用于记录声学鼻音数据，与商业设备相比表现相似但更经济。

Details

Motivation: 开发低成本、开源的鼻音数据记录系统，以替代昂贵的商业设备。 Method: 设计并3D打印硬件系统，与商业设备进行性能对比，并探讨定制化选项。 Result: Nosey的鼻音评分高于商业设备，但对比效果相似；硬件可灵活定制。 Conclusion: Nosey是商业设备的灵活、经济替代品，适合数据收集。 Abstract: We introduce Nosey (Nasalance Open Source Estimation sYstem), a low-cost, customizable, 3D-printed system for recording acoustic nasalance data that we have made available as open-source hardware (http://github.com/phoneticslab/nosey). We first outline the motivations and design principles behind our hardware nasalance system, and then present a comparison between Nosey and a commercial nasalance device. Nosey shows consistently higher nasalance scores than the commercial device, but the magnitude of contrast between phonological environments is comparable between systems. We also review ways of customizing the hardware to facilitate testing, such as comparison of microphones and different construction materials. We conclude that Nosey is a flexible and cost-effective alternative to commercial nasometry devices and propose some methodological considerations for its use in data collection.

[286] Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation

Hao Li,Ju Dai,Xin Zhao,Feng Zhou,Junjun Pan,Lei Li

Main category: cs.SD

TL;DR: 论文提出Wav2Sem模块，通过语义解耦改善3D语音驱动面部动画中音素相似音节的平均效应。

Details

Motivation: 现有方法中，自监督音频模型编码器导致音素相似音节在特征空间中耦合，影响唇形生成的精确性。 Method: 提出Wav2Sem模块，提取音频序列的语义特征，解耦特征空间中的音频编码。 Result: 实验表明，Wav2Sem显著减轻了音素相似音节的平均效应，提升了动画的精确性和自然度。 Conclusion: Wav2Sem模块有效解决了音频特征耦合问题，为语音驱动面部动画提供了更优的解决方案。 Abstract: In 3D speech-driven facial animation generation, existing methods commonly employ pre-trained self-supervised audio models as encoders. However, due to the prevalence of phonetically similar syllables with distinct lip shapes in language, these near-homophone syllables tend to exhibit significant coupling in self-supervised audio feature spaces, leading to the averaging effect in subsequent lip motion generation. To address this issue, this paper proposes a plug-and-play semantic decorrelation module-Wav2Sem. This module extracts semantic features corresponding to the entire audio sequence, leveraging the added semantic information to decorrelate audio encodings within the feature space, thereby achieving more expressive audio features. Extensive experiments across multiple Speech-driven models indicate that the Wav2Sem module effectively decouples audio features, significantly alleviating the averaging effect of phonetically similar syllables in lip shape generation, thereby enhancing the precision and naturalness of facial animations. Our source code is available at https://github.com/wslh852/Wav2Sem.git.

[287] Semantics-Aware Human Motion Generation from Audio Instructions

Zi-An Wang,Shihao Zou,Shiyao Yu,Mingyuan Zhang,Chao Dong

Main category: cs.SD

TL;DR: 论文提出了一种利用音频信号生成语义对齐动作的端到端框架，通过掩码生成变压器和记忆检索注意力模块优化稀疏长音频输入处理。

Details

Motivation: 音频信号作为自然直观的交互方式，现有方法多关注节奏匹配，导致音频语义与生成动作关联较弱。 Method: 采用掩码生成变压器和记忆检索注意力模块处理音频输入，并通过数据集增强（描述转对话风格和多样化音频生成）优化模型。 Result: 实验证明框架高效有效，音频指令能传达类似文本的语义，同时提供更实用和用户友好的交互。 Conclusion: 音频信号可作为语义编码的强有力工具，为交互技术提供新方向。 Abstract: Recent advances in interactive technologies have highlighted the prominence of audio signals for semantic encoding. This paper explores a new task, where audio signals are used as conditioning inputs to generate motions that align with the semantics of the audio. Unlike text-based interactions, audio provides a more natural and intuitive communication method. However, existing methods typically focus on matching motions with music or speech rhythms, which often results in a weak connection between the semantics of the audio and generated motions. We propose an end-to-end framework using a masked generative transformer, enhanced by a memory-retrieval attention module to handle sparse and lengthy audio inputs. Additionally, we enrich existing datasets by converting descriptions into conversational style and generating corresponding audio with varied speaker identities. Experiments demonstrate the effectiveness and efficiency of the proposed framework, demonstrating that audio instructions can convey semantics similar to text while providing more practical and user-friendly interactions.

[288] ZeroSep: Separate Anything in Audio with Zero Training

Chao Huang,Yuesheng Ma,Junxuan Huang,Susan Liang,Yunlong Tang,Jing Bi,Wenqiang Liu,Nima Mesgarani,Chenliang Xu

Main category: cs.SD

TL;DR: ZeroSep利用预训练的文本引导音频扩散模型实现零样本音频源分离，无需任务特定训练，支持开放集场景，性能超越监督方法。

Details

Motivation: 当前监督深度学习方法需要大量标注数据且难以泛化到真实世界复杂声学环境，受生成基础模型启发，探索预训练扩散模型是否可克服这些限制。 Method: 通过将混合音频反转到扩散模型的潜在空间，利用文本条件引导去噪过程恢复单个源，无需任务特定训练或微调。 Result: ZeroSep在多个分离基准上表现优异，性能超越监督方法。 Conclusion: 预训练文本引导音频扩散模型可用于零样本音频源分离，支持开放集场景，性能强大。 Abstract: Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of generative foundation models, we investigate whether pre-trained text-guided audio diffusion models can overcome these limitations. We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model under the right configuration. Our method, named ZeroSep, works by inverting the mixed audio into the diffusion model's latent space and then using text conditioning to guide the denoising process to recover individual sources. Without any task-specific training or fine-tuning, ZeroSep repurposes the generative diffusion model for a discriminative separation task and inherently supports open-set scenarios through its rich textual priors. ZeroSep is compatible with a variety of pre-trained text-guided audio diffusion backbones and delivers strong separation performance on multiple separation benchmarks, surpassing even supervised methods.

q-bio.TO [Back]

[289] Physiology-Informed Generative Multi-Task Network for Contrast-Free CT Perfusion

Wasif Khan,Kyle B. See,Simon Kato,Ziqian Huang,Amy Lazarte,Kyle Douglas,Xiangyang Lou,Teng J. Peng,Dhanashree Rajderkar,John Rees,Pina Sanelli,Amita Singh,Ibrahim Tuna,Christina A. Wilson,Ruogu Fang

Main category: q-bio.TO

TL;DR: 提出了一种名为MAGIC的深度学习框架，通过生成式人工智能和生理信息，将非对比CT图像映射为无对比剂的CTP图像，解决了传统CTP成像中对比剂带来的问题和成本。

Details

Motivation: 传统CTP成像依赖对比剂，可能导致过敏反应和副作用，且成本高昂。 Method: 结合生成式AI和生理信息，设计MAGIC框架，通过非对比CT生成无对比剂的CTP图像。 Result: MAGIC在图像保真度和诊断准确性上表现优异，优于传统对比剂CTP成像。 Conclusion: MAGIC有望通过无对比剂、低成本、快速的成像技术革新医疗领域。 Abstract: Perfusion imaging is extensively utilized to assess hemodynamic status and tissue perfusion in various organs. Computed tomography perfusion (CTP) imaging plays a key role in the early assessment and planning of stroke treatment. While CTP provides essential perfusion parameters to identify abnormal blood flow in the brain, the use of contrast agents in CTP can lead to allergic reactions and adverse side effects, along with costing USD 4.9 billion worldwide in 2022. To address these challenges, we propose a novel deep learning framework called Multitask Automated Generation of Intermodal CT perfusion maps (MAGIC). This framework combines generative artificial intelligence and physiological information to map non-contrast computed tomography (CT) imaging to multiple contrast-free CTP imaging maps. We demonstrate enhanced image fidelity by incorporating physiological characteristics into the loss terms. Our network was trained and validated using CT image data from patients referred for stroke at UF Health and demonstrated robustness to abnormalities in brain perfusion activity. A double-blinded study was conducted involving seven experienced neuroradiologists and vascular neurologists. This study validated MAGIC's visual quality and diagnostic accuracy showing favorable performance compared to clinical perfusion imaging with intravenous contrast injection. Overall, MAGIC holds great promise in revolutionizing healthcare by offering contrast-free, cost-effective, and rapid perfusion imaging.

cs.HC [Back]

[290] Errors in Stereo Geometry Induce Distance Misperception

Raffles Xingqi Zhu,Charlie S. Burlingham,Olivier Mercier,Phillip Guan

Main category: cs.HC

TL;DR: 论文提出了一个几何框架，用于预测因HMD透视几何不准确导致的距离感知误差，并通过实验验证了其有效性。

Details

Motivation: 研究HMD中渲染和视角误差如何影响用户对深度和距离的感知，以改进虚拟现实的真实感。 Method: 构建了一个几何框架预测误差，并在Quest 3 HMD平台上进行实验，包括五个实验验证框架的有效性。 Result: 透视几何误差会导致距离感知的过高或过低估计，实时视觉反馈可动态校准视觉运动映射。 Conclusion: 几何框架能有效预测误差，实时反馈可改善感知误差，提升HMD的准确性。 Abstract: Stereoscopic head-mounted displays (HMDs) render and present binocular images to create an egocentric, 3D percept to the HMD user. Within this render and presentation pipeline there are potential rendering camera and viewing position errors that can induce deviations in the depth and distance that a user perceives compared to the underlying intended geometry. For example, rendering errors can arise when HMD render cameras are incorrectly positioned relative to the assumed centers of projections of the HMD displays and viewing errors can arise when users view stereo geometry from the incorrect location in the HMD eyebox. In this work we present a geometric framework that predicts errors in distance perception arising from inaccurate HMD perspective geometry and build an HMD platform to reliably simulate render and viewing error in a Quest 3 HMD with eye tracking to experimentally test these predictions. We present a series of five experiments to explore the efficacy of this geometric framework and show that errors in perspective geometry can induce both under- and over-estimations in perceived distance. We further demonstrate how real-time visual feedback can be used to dynamically recalibrate visuomotor mapping so that an accurate reach distance is achieved even if the perceived visual distance is negatively impacted by geometric error.

[291] Large Language Models for Depression Recognition in Spoken Language Integrating Psychological Knowledge

Yupei Li,Shuaijie Shao,Manuel Milling,Björn W. Schuller

Main category: cs.HC

TL;DR: 该论文提出了一种结合多模态数据和心理学知识的LLM方法，用于抑郁症检测，显著提升了诊断准确性。

Details

Motivation: 抑郁症检测中，现有DNN和LLM方法在非文本线索和心理学知识整合方面存在不足，影响了实际效果。 Method: 使用DAIC-WOZ数据集，结合Wav2Vec提取音频特征，并通过问答形式将心理学知识融入LLM。 Result: 相比基线方法，MAE和RMSE指标均有显著提升。 Conclusion: 多模态数据和心理学知识的结合能有效提升抑郁症检测的准确性。 Abstract: Depression is a growing concern gaining attention in both public discourse and AI research. While deep neural networks (DNNs) have been used for recognition, they still lack real-world effectiveness. Large language models (LLMs) show strong potential but require domain-specific fine-tuning and struggle with non-textual cues. Since depression is often expressed through vocal tone and behaviour rather than explicit text, relying on language alone is insufficient. Diagnostic accuracy also suffers without incorporating psychological expertise. To address these limitations, we present, to the best of our knowledge, the first application of LLMs to multimodal depression detection using the DAIC-WOZ dataset. We extract the audio features using the pre-trained model Wav2Vec, and mapped it to text-based LLMs for further processing. We also propose a novel strategy for incorporating psychological knowledge into LLMs to enhance diagnostic performance, specifically using a question and answer set to grant authorised knowledge to LLMs. Our approach yields a notable improvement in both Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) compared to a base score proposed by the related original paper. The codes are available at https://github.com/myxp-lyp/Depression-detection.git

[292] Human Empathy as Encoder: AI-Assisted Depression Assessment in Special Education

Boning Zhao

Main category: cs.HC

TL;DR: 论文提出了一种名为HEAE的人本AI框架，通过结合学生叙述文本和教师生成的“共情向量”，提升抑郁严重程度评估的透明度和社会责任。

Details

Motivation: 在特殊教育等敏感环境中，标准化问卷和自动化方法难以准确评估学生抑郁情况，且缺乏教师共情带来的个性化洞察。 Method: HEAE框架整合学生叙述文本与教师生成的9维“共情向量”，通过多模态融合和分类架构优化，实现抑郁严重程度的7级分类。 Result: 实验显示，该方法在7级分类任务中达到82.74%的准确率。 Conclusion: HEAE为情感计算提供了一种更负责任和伦理的路径，通过结构化嵌入人类共情，增强而非替代人类判断。 Abstract: Assessing student depression in sensitive environments like special education is challenging. Standardized questionnaires may not fully reflect students' true situations. Furthermore, automated methods often falter with rich student narratives, lacking the crucial, individualized insights stemming from teachers' empathetic connections with students. Existing methods often fail to address this ambiguity or effectively integrate educator understanding. To address these limitations by fostering a synergistic human-AI collaboration, this paper introduces Human Empathy as Encoder (HEAE), a novel, human-centered AI framework for transparent and socially responsible depression severity assessment. Our approach uniquely integrates student narrative text with a teacher-derived, 9-dimensional "Empathy Vector" (EV), its dimensions guided by the PHQ-9 framework,to explicitly translate tacit empathetic insight into a structured AI input enhancing rather than replacing human judgment. Rigorous experiments optimized the multimodal fusion, text representation, and classification architecture, achieving 82.74% accuracy for 7-level severity classification. This work demonstrates a path toward more responsible and ethical affective computing by structurally embedding human empathy

[293] MAC-Gaze: Motion-Aware Continual Calibration for Mobile Gaze Tracking

Yaxiong Lei,Mingyue Zhao,Yuheng Wang,Shijing He,Yusuke Sugano,Yafei Wang,Kaixing Zhao,Mohamed Khamis,Juan Ye

Main category: cs.HC

TL;DR: MAC-Gaze提出了一种基于运动感知的持续校准方法，利用智能手机IMU传感器和持续学习技术，动态更新视线跟踪模型，显著提升了移动场景下的视线估计精度。

Details

Motivation: 传统的一次性校准方法无法适应动态变化的用户姿势和设备方向，导致性能下降。 Method: 结合预训练的视觉视线估计器和基于IMU的活动识别模型，采用聚类混合决策机制触发重新校准，并通过回放式持续学习避免灾难性遗忘。 Result: 在RGBDGaze和MotionGaze数据集上，视线估计误差分别降低了19.9%和31.7%。 Conclusion: MAC-Gaze为移动场景下的视线跟踪提供了一种鲁棒的持续校准解决方案。 Abstract: Mobile gaze tracking faces a fundamental challenge: maintaining accuracy as users naturally change their postures and device orientations. Traditional calibration approaches, like one-off, fail to adapt to these dynamic conditions, leading to degraded performance over time. We present MAC-Gaze, a Motion-Aware continual Calibration approach that leverages smartphone Inertial measurement unit (IMU) sensors and continual learning techniques to automatically detect changes in user motion states and update the gaze tracking model accordingly. Our system integrates a pre-trained visual gaze estimator and an IMU-based activity recognition model with a clustering-based hybrid decision-making mechanism that triggers recalibration when motion patterns deviate significantly from previously encountered states. To enable accumulative learning of new motion conditions while mitigating catastrophic forgetting, we employ replay-based continual learning, allowing the model to maintain performance across previously encountered motion conditions. We evaluate our system through extensive experiments on the publicly available RGBDGaze dataset and our own 10-hour multimodal MotionGaze dataset (481K+ images, 800K+ IMU readings), encompassing a wide range of postures under various motion conditions including sitting, standing, lying, and walking. Results demonstrate that our method reduces gaze estimation error by 19.9% on RGBDGaze (from 1.73 cm to 1.41 cm) and by 31.7% on MotionGaze (from 2.81 cm to 1.92 cm) compared to traditional calibration approaches. Our framework provides a robust solution for maintaining gaze estimation accuracy in mobile scenarios.

cs.LG [Back]

[294] FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference

Aniruddha Nrusimha,William Brandon,Mayank Mishra,Yikang Shen,Rameswar Panda,Jonathan Ragan-Kelley,Yoon Kim

Main category: cs.LG

TL;DR: FlashFormer是一种针对单批推理优化的内核，显著提升了基于Transformer的大语言模型的推理速度。

Details

Motivation: 现代大语言模型的计算特性使得开发专用内核变得重要，但现有内核主要针对大批量训练和推理优化，而低批量推理（如边缘部署和延迟敏感应用）的需求未得到充分满足。 Method: 提出了FlashFormer，一种针对单批推理优化的概念验证内核。 Result: 在不同模型规模和量化设置下，FlashFormer相比现有最先进的推理内核实现了显著加速。 Conclusion: FlashFormer为低批量推理场景提供了一种高效的解决方案，具有实际应用潜力。 Abstract: The size and compute characteristics of modern large language models have led to an increased interest in developing specialized kernels tailored for training and inference. Existing kernels primarily optimize for compute utilization, targeting the large-batch training and inference settings. However, low-batch inference, where memory bandwidth and kernel launch overheads contribute are significant factors, remains important for many applications of interest such as in edge deployment and latency-sensitive applications. This paper describes FlashFormer, a proof-of-concept kernel for accelerating single-batch inference for transformer-based large language models. Across various model sizes and quantizations settings, we observe nontrivial speedups compared to existing state-of-the-art inference kernels.

[295] DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration

Tianteng Gu,Bei Liu,Bo Xiao,Ke Zeng,Jiacheng Liu,Yanmin Qian

Main category: cs.LG

TL;DR: 提出了一种名为DenoiseRotator的新方法，通过重新分配参数重要性来增强模型剪枝的鲁棒性，显著减少了性能下降。

Details

Motivation: 现有剪枝方法主要关注单个权重的重要性估计，导致模型关键能力难以保留，尤其是在半结构化稀疏约束下性能下降显著。 Method: 通过最小化归一化重要性分数的信息熵，将重要性集中到更小的权重子集，并结合可学习的正交变换（DenoiseRotator）实现。 Result: 在LLaMA3、Qwen2.5和Mistral模型上，DenoiseRotator显著降低了困惑度差距，例如在LLaMA3-70B上困惑度差距减少了58%。 Conclusion: DenoiseRotator是一种模型无关的方法，可与现有剪枝技术无缝集成，显著提升剪枝后的模型性能。 Abstract: Pruning is a widely used technique to compress large language models (LLMs) by removing unimportant weights, but it often suffers from significant performance degradation - especially under semi-structured sparsity constraints. Existing pruning methods primarily focus on estimating the importance of individual weights, which limits their ability to preserve critical capabilities of the model. In this work, we propose a new perspective: rather than merely selecting which weights to prune, we first redistribute parameter importance to make the model inherently more amenable to pruning. By minimizing the information entropy of normalized importance scores, our approach concentrates importance onto a smaller subset of weights, thereby enhancing pruning robustness. We instantiate this idea through DenoiseRotator, which applies learnable orthogonal transformations to the model's weight matrices. Our method is model-agnostic and can be seamlessly integrated with existing pruning techniques such as Magnitude, SparseGPT, and Wanda. Evaluated on LLaMA3, Qwen2.5, and Mistral models under 50% unstructured and 2:4 semi-structured sparsity, DenoiseRotator consistently improves perplexity and zero-shot accuracy. For instance, on LLaMA3-70B pruned with SparseGPT at 2:4 semi-structured sparsity, DenoiseRotator reduces the perplexity gap to the dense model by 58%, narrowing the degradation from 8.1 to 3.4 points. Codes are available at https://github.com/Axel-gu/DenoiseRotator.

[296] MAP: Revisiting Weight Decomposition for Low-Rank Adaptation

Chongjie Si,Zhiyi Shi,Yadao Wang,Xiaokang Yang,Susanto Rahardja,Wei Shen

Main category: cs.LG

TL;DR: MAP是一种新的参数高效微调框架，通过将权重矩阵分解为方向和大小，提供更灵活和可解释的适配方法。

Details

Motivation: 现有参数高效微调方法（如LoRA）在方向定义上缺乏几何基础，限制了其性能。 Method: MAP将预训练权重归一化，学习方向更新，并引入两个标量系数独立调整基向量和更新向量的大小。 Result: 实验表明，MAP显著提升了现有PEFT方法的性能。 Conclusion: MAP因其通用性和简单性，有望成为未来PEFT方法设计的默认设置。 Abstract: The rapid development of large language models has revolutionized natural language processing, but their fine-tuning remains computationally expensive, hindering broad deployment. Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, have emerged as solutions. Recent work like DoRA attempts to further decompose weight adaptation into direction and magnitude components. However, existing formulations often define direction heuristically at the column level, lacking a principled geometric foundation. In this paper, we propose MAP, a novel framework that reformulates weight matrices as high-dimensional vectors and decouples their adaptation into direction and magnitude in a rigorous manner. MAP normalizes the pre-trained weights, learns a directional update, and introduces two scalar coefficients to independently scale the magnitude of the base and update vectors. This design enables more interpretable and flexible adaptation, and can be seamlessly integrated into existing PEFT methods. Extensive experiments show that MAP significantly improves performance when coupling with existing methods, offering a simple yet powerful enhancement to existing PEFT methods. Given the universality and simplicity of MAP, we hope it can serve as a default setting for designing future PEFT methods.

[297] Does Machine Unlearning Truly Remove Model Knowledge? A Framework for Auditing Unlearning in LLMs

Haokun Chen,Yueqi Zhang,Yuan Bi,Yao Zhang,Tong Liu,Jinhe Bi,Jian Lan,Jindong Gu,Claudia Grosser,Denis Krompass,Nassir Navab,Volker Tresp

Main category: cs.LG

TL;DR: 本文提出了一种全面的机器遗忘算法评估框架，包括三个基准数据集、六种遗忘算法和五种基于提示的审计方法，并探索了一种新的中间激活扰动技术。

Details

Motivation: 大型语言模型（LLMs）的训练数据可能包含敏感或受版权保护的内容，而现有法规（如GDPR）要求删除此类信息。机器遗忘算法旨在避免昂贵的重新训练，但评估其有效性仍具挑战性。 Method: 提出一个评估框架，包含基准数据集、遗忘算法和基于提示的审计方法，并引入中间激活扰动技术以补充现有审计方法的不足。 Result: 通过多种审计算法评估了不同遗忘策略的有效性和鲁棒性。 Conclusion: 该框架为机器遗忘算法的评估提供了全面且创新的方法，解决了现有方法的局限性。 Abstract: In recent years, Large Language Models (LLMs) have achieved remarkable advancements, drawing significant attention from the research community. Their capabilities are largely attributed to large-scale architectures, which require extensive training on massive datasets. However, such datasets often contain sensitive or copyrighted content sourced from the public internet, raising concerns about data privacy and ownership. Regulatory frameworks, such as the General Data Protection Regulation (GDPR), grant individuals the right to request the removal of such sensitive information. This has motivated the development of machine unlearning algorithms that aim to remove specific knowledge from models without the need for costly retraining. Despite these advancements, evaluating the efficacy of unlearning algorithms remains a challenge due to the inherent complexity and generative nature of LLMs. In this work, we introduce a comprehensive auditing framework for unlearning evaluation, comprising three benchmark datasets, six unlearning algorithms, and five prompt-based auditing methods. By using various auditing algorithms, we evaluate the effectiveness and robustness of different unlearning strategies. To explore alternatives beyond prompt-based auditing, we propose a novel technique that leverages intermediate activation perturbations, addressing the limitations of auditing methods that rely solely on model inputs and outputs.

[298] Rethinking Regularization Methods for Knowledge Graph Completion

Linyu Li,Zhi Jin,Yuanpeng He,Dongming Jin,Haoran Duan,Zhengwei Tao,Xuan Zhang,Jiandong Li

Main category: cs.LG

TL;DR: 本文重新思考了知识图谱补全（KGC）中正则化方法的应用，提出了一种新型稀疏正则化方法（SPR），通过选择性惩罚显著特征提升模型性能。

Details

Motivation: 现有KGC模型未充分利用正则化的潜力，希望通过深入研究正则化方法提升模型表现。 Method: 提出SPR正则化方法，选择性惩罚嵌入向量中的显著特征，忽略噪声成分。 Result: 实验表明SPR优于其他正则化方法，能突破模型性能上限。 Conclusion: 精心设计的正则化方法能显著提升KGC模型性能，SPR是一种有效解决方案。 Abstract: Knowledge graph completion (KGC) has attracted considerable attention in recent years because it is critical to improving the quality of knowledge graphs. Researchers have continuously explored various models. However, most previous efforts have neglected to take advantage of regularization from a deeper perspective and therefore have not been used to their full potential. This paper rethinks the application of regularization methods in KGC. Through extensive empirical studies on various KGC models, we find that carefully designed regularization not only alleviates overfitting and reduces variance but also enables these models to break through the upper bounds of their original performance. Furthermore, we introduce a novel sparse-regularization method that embeds the concept of rank-based selective sparsity into the KGC regularizer. The core idea is to selectively penalize those components with significant features in the embedding vector, thus effectively ignoring many components that contribute little and may only represent noise. Various comparative experiments on multiple datasets and multiple models show that the SPR regularization method is better than other regularization methods and can enable the KGC model to further break through the performance margin.

[299] Domain-Aware Tensor Network Structure Search

Giorgos Iacovides,Wuyang Zhou,Chao Li,Qibin Zhao,Danilo Mandic

Main category: cs.LG

TL;DR: 提出了一种结合领域信息和大型语言模型（LLM）的新框架tnLLM，用于高效解决张量网络结构搜索（TN-SS）问题，减少计算成本并提升结构透明度。

Details

Motivation: 现有TN-SS算法计算成本高且忽略领域信息，缺乏结构解释性。 Method: 通过领域感知提示管道，利用LLM直接预测合适的TN结构，并结合领域信息优化目标函数。 Result: tnLLM在较少函数评估下达到与SOTA算法相当的性能，并能加速其他方法的收敛。 Conclusion: tnLLM为TN-SS问题提供了一种高效、透明且领域感知的解决方案。 Abstract: Tensor networks (TNs) provide efficient representations of high-dimensional data, yet identification of the optimal TN structures, the so called tensor network structure search (TN-SS) problem, remains a challenge. Current state-of-the-art (SOTA) algorithms are computationally expensive as they require extensive function evaluations, which is prohibitive for real-world applications. In addition, existing methods ignore valuable domain information inherent in real-world tensor data and lack transparency in their identified TN structures. To this end, we propose a novel TN-SS framework, termed the tnLLM, which incorporates domain information about the data and harnesses the reasoning capabilities of large language models (LLMs) to directly predict suitable TN structures. The proposed framework involves a domain-aware prompting pipeline which instructs the LLM to infer suitable TN structures based on the real-world relationships between tensor modes. In this way, our approach is capable of not only iteratively optimizing the objective function, but also generating domain-aware explanations for the identified structures. Experimental results demonstrate that tnLLM achieves comparable TN-SS objective function values with much fewer function evaluations compared to SOTA algorithms. Furthermore, we demonstrate that the LLM-enabled domain information can be used to find good initializations in the search space for sampling-based SOTA methods to accelerate their convergence while preserving theoretical performance guarantees.

[300] Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models

Yiran Guo,Lijie Xu,Jie Liu,Dan Ye,Shuang Qiu

Main category: cs.LG

TL;DR: 论文提出了一种名为SPO的新型强化学习框架，通过中粒度分段优势估计，解决了现有方法在优势估计粒度上的不足，并在两个具体场景中验证了其有效性。

Details

Motivation: 提升大型语言模型的推理能力是一个关键挑战，现有方法在优势估计粒度上存在不足，要么过于细粒度（如PPO）导致估计不准确，要么过于粗粒度（如GRPO）导致信用分配不精确。 Method: 提出了Segment Policy Optimization (SPO)框架，包括灵活分段划分、准确分段优势估计和基于分段优势的策略优化三个组件，并在短链式思维（SPO-chain）和长链式思维（SPO-tree）两种场景中具体实现。 Result: 在GSM8K和MATH500数据集上，SPO分别比PPO和GRPO提升了6-12和7-11个百分点的准确率。 Conclusion: SPO通过中粒度分段优势估计，有效平衡了优势估计的精确性和计算成本，显著提升了语言模型的推理能力。 Abstract: Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: Token-level methods (e.g., PPO) aim to provide the fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose Segment Policy Optimization (SPO), a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than token-level methods, enabling accurate advantage estimation based on Monte Carlo (MC) without a critic model. SPO features three components with novel strategies: (1) flexible segment partition; (2) accurate segment advantage estimation; and (3) policy optimization using segment advantages, including a novel probability-mask strategy. We further instantiate SPO for two specific scenarios: (1) SPO-chain for short chain-of-thought (CoT), featuring novel cutpoint-based partition and chain-based advantage estimation, achieving $6$-$12$ percentage point improvements in accuracy over PPO and GRPO on GSM8K. (2) SPO-tree for long CoT, featuring novel tree-based advantage estimation, which significantly reduces the cost of MC estimation, achieving $7$-$11$ percentage point improvements over GRPO on MATH500 under 2K and 4K context evaluation. We make our code publicly available at https://github.com/AIFrameResearch/SPO.

[301] On-Policy RL with Optimal Reward Baseline

Yaru Hao,Li Dong,Xun Wu,Shaohan Huang,Zewen Chi,Furu Wei

Main category: cs.LG

TL;DR: 论文提出了一种名为OPO的新型强化学习算法，通过精确的在线策略训练和最优奖励基线，解决了现有算法训练不稳定和计算效率低的问题。

Details

Motivation: 现有强化学习算法在训练大型语言模型时存在不稳定性和计算效率低的问题，需要改进。 Method: 提出了OPO算法，强调精确的在线策略训练和引入最优奖励基线以减少梯度方差。 Result: 在数学推理基准测试中，OPO表现出更高的性能和训练稳定性，同时实现了更低的策略偏移和更高的输出熵。 Conclusion: OPO是一种稳定且有效的强化学习算法，适用于大型语言模型的对齐和推理任务。 Abstract: Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration. Moreover, OPO introduces the optimal reward baseline that theoretically minimizes gradient variance. We evaluate OPO on mathematical reasoning benchmarks. The results demonstrate its superior performance and training stability without additional models or regularization terms. Furthermore, OPO achieves lower policy shifts and higher output entropy, encouraging more diverse and less repetitive responses. These results highlight OPO as a promising direction for stable and effective reinforcement learning in large language model alignment and reasoning tasks. The implementation is provided at https://github.com/microsoft/LMOps/tree/main/opo.

[302] Differential Information: An Information-Theoretic Perspective on Preference Optimization

Yunjae Won,Hyunji Lee,Hyeonbin Hwang,Minjoon Seo

Main category: cs.LG

TL;DR: 本文通过引入差分信息分布（DID），填补了直接偏好优化（DPO）中奖励参数化理论基础的空白，揭示了其对策略学习和偏好数据结构的统一视角。

Details

Motivation: 尽管DPO在经验上取得了成功，但其对数比率奖励参数化的理论依据尚不完整。本文旨在填补这一理论空白。 Method: 利用差分信息分布（DID）分析偏好标签如何编码从参考策略到目标策略的差分信息，推导出对数比率奖励的唯一最优形式，并探讨其对策略行为的影响。 Result: 研究发现，偏好标签编码的差分信息与对数边际有序策略的隐含假设相关，并通过DID的熵分析揭示了其对策略分布的影响。实验验证了理论结果。 Conclusion: 本文为DPO目标、偏好数据结构和策略行为提供了统一的理论视角，揭示了差分信息学习在指令跟随和知识密集型任务中的不同作用。 Abstract: Direct Preference Optimization (DPO) has become a standard technique for aligning language models with human preferences in a supervised manner. Despite its empirical success, the theoretical justification behind its log-ratio reward parameterization remains incomplete. In this work, we address this gap by utilizing the Differential Information Distribution (DID): a distribution over token sequences that captures the information gained during policy updates. First, we show that when preference labels encode the differential information required to transform a reference policy into a target policy, the log-ratio reward in DPO emerges as the uniquely optimal form for learning the target policy via preference optimization. This result naturally yields a closed-form expression for the optimal sampling distribution over rejected responses. Second, we find that the condition for preferences to encode differential information is fundamentally linked to an implicit assumption regarding log-margin ordered policies-an inductive bias widely used in preference optimization yet previously unrecognized. Finally, by analyzing the entropy of the DID, we characterize how learning low-entropy differential information reinforces the policy distribution, while high-entropy differential information induces a smoothing effect, which explains the log-likelihood displacement phenomenon. We validate our theoretical findings in synthetic experiments and extend them to real-world instruction-following datasets. Our results suggest that learning high-entropy differential information is crucial for general instruction-following, while learning low-entropy differential information benefits knowledge-intensive question answering. Overall, our work presents a unifying perspective on the DPO objective, the structure of preference data, and resulting policy behaviors through the lens of differential information.

[303] Test-time augmentation improves efficiency in conformal prediction

Divya Shanmugam,Helen Lu,Swami Sankaranarayanan,John Guttag

Main category: cs.LG

TL;DR: 本文提出了一种通过测试时增强（TTA）减少共形分类器预测集大小的方法，无需重新训练模型，平均减少10%-14%的预测集大小。

Details

Motivation: 共形分类器通常会产生信息量不足的大预测集，影响了其实际应用效果。 Method: 采用测试时增强（TTA）技术，结合任意共形评分方法，无需模型重新训练，灵活且计算高效。 Result: 在三个数据集、三种模型和多种分布偏移下验证，TTA平均减少预测集大小10%-14%。 Conclusion: 测试时增强是共形分类器流程中有用的补充，能有效减少预测集大小。 Abstract: A conformal classifier produces a set of predicted classes and provides a probabilistic guarantee that the set includes the true class. Unfortunately, it is often the case that conformal classifiers produce uninformatively large sets. In this work, we show that test-time augmentation (TTA)--a technique that introduces inductive biases during inference--reduces the size of the sets produced by conformal classifiers. Our approach is flexible, computationally efficient, and effective. It can be combined with any conformal score, requires no model retraining, and reduces prediction set sizes by 10%-14% on average. We conduct an evaluation of the approach spanning three datasets, three models, two established conformal scoring methods, different guarantee strengths, and several distribution shifts to show when and why test-time augmentation is a useful addition to the conformal pipeline.

[304] Number of Clusters in a Dataset: A Regularized K-means Approach

Behzad Kamgar-Parsi,Behrooz Kamgar-Parsi

Main category: cs.LG

TL;DR: 本文研究了正则化k-means算法中关键超参数λ的设定问题，提出了基于理想簇假设的λ严格界限，并分析了加性和乘性正则化对解的影响。

Details

Motivation: 在无标签数据集中确定有意义的簇数量是一个重要问题，但目前缺乏设定正则化超参数λ的原则性指导。 Method: 假设簇为理想簇（d维球体），推导λ的严格界限，并分析加性和乘性正则化k-means算法的解。 Result: 实验表明，加性正则化常产生多解，而乘性正则化在特定情况下能减少解的模糊性。 Conclusion: 本文为λ的设定提供了理论支持，并展示了正则化k-means算法在非理想簇情况下的表现。 Abstract: Finding the number of meaningful clusters in an unlabeled dataset is important in many applications. Regularized k-means algorithm is a possible approach frequently used to find the correct number of distinct clusters in datasets. The most common formulation of the regularization function is the additive linear term $\lambda k$, where $k$ is the number of clusters and $\lambda$ a positive coefficient. Currently, there are no principled guidelines for setting a value for the critical hyperparameter $\lambda$. In this paper, we derive rigorous bounds for $\lambda$ assuming clusters are {\em ideal}. Ideal clusters (defined as $d$-dimensional spheres with identical radii) are close proxies for k-means clusters ($d$-dimensional spherically symmetric distributions with identical standard deviations). Experiments show that the k-means algorithm with additive regularizer often yields multiple solutions. Thus, we also analyze k-means algorithm with multiplicative regularizer. The consensus among k-means solutions with additive and multiplicative regularizations reduces the ambiguity of multiple solutions in certain cases. We also present selected experiments that demonstrate performance of the regularized k-means algorithms as clusters deviate from the ideal assumption.

[305] Diverse Prototypical Ensembles Improve Robustness to Subpopulation Shift

Minh Nguyen Nhat To,Paul F RWilson,Viet Nguyen,Mohamed Harmanani,Michael Cooper,Fahimeh Fooladgar,Purang Abolmaesumi,Parvin Mousavi,Rahul G. Krishnan

Main category: cs.LG

TL;DR: 论文提出了一种名为Diverse Prototypical Ensembles（DPE）的方法，通过使用多样化的原型分类器集合来应对子群体分布偏移问题，显著提升了最差群体准确率。

Details

Motivation: 子群体分布偏移会显著降低机器学习模型的性能，而现有方法依赖于对子群体数量和性质的假设及标注信息，这在现实数据中往往不可得。 Method: 用多样化的原型分类器集合替代标准线性分类层，每个分类器专注于不同的特征和样本，以自适应地捕捉子群体风险。 Result: 在九个真实数据集上的实验表明，DPE方法在最差群体准确率上优于现有技术。 Conclusion: DPE方法无需依赖子群体标注信息，能有效应对子群体分布偏移问题。 Abstract: The subpopulationtion shift, characterized by a disparity in subpopulation distributibetween theween the training and target datasets, can significantly degrade the performance of machine learning models. Current solutions to subpopulation shift involve modifying empirical risk minimization with re-weighting strategies to improve generalization. This strategy relies on assumptions about the number and nature of subpopulations and annotations on group membership, which are unavailable for many real-world datasets. Instead, we propose using an ensemble of diverse classifiers to adaptively capture risk associated with subpopulations. Given a feature extractor network, we replace its standard linear classification layer with a mixture of prototypical classifiers, where each member is trained to classify the data while focusing on different features and samples from other members. In empirical evaluation on nine real-world datasets, covering diverse domains and kinds of subpopulation shift, our method of Diverse Prototypical Ensembles (DPEs) often outperforms the prior state-of-the-art in worst-group accuracy. The code is available at https://github.com/minhto2802/dpe4subpop

[306] Pseudo Multi-Source Domain Generalization: Bridging the Gap Between Single and Multi-Source Domain Generalization

Shohei Enomoto

Main category: cs.LG

TL;DR: 论文提出了一种名为PMDG的新框架，通过风格迁移和数据增强技术从单一源域生成多个伪域，解决了多源域泛化（MDG）在实际应用中的成本问题。

Details

Motivation: 深度学习模型在部署到与训练数据分布不同的环境时性能下降，而多源域泛化（MDG）虽有效但成本高，因此需要一种更实用的单源域泛化（SDG）方法。 Method: 提出PMDG框架，利用风格迁移和数据增强从单一源域生成伪多域数据集，适用于现有MDG算法。 Result: 实验表明PMDG性能与MDG正相关，且伪域在数据充足时可匹配或超越真实多域性能。 Conclusion: PMDG为域泛化研究提供了实用且高效的解决方案，未来可进一步探索其潜力。 Abstract: Deep learning models often struggle to maintain performance when deployed on data distributions different from their training data, particularly in real-world applications where environmental conditions frequently change. While Multi-source Domain Generalization (MDG) has shown promise in addressing this challenge by leveraging multiple source domains during training, its practical application is limited by the significant costs and difficulties associated with creating multi-domain datasets. To address this limitation, we propose Pseudo Multi-source Domain Generalization (PMDG), a novel framework that enables the application of sophisticated MDG algorithms in more practical Single-source Domain Generalization (SDG) settings. PMDG generates multiple pseudo-domains from a single source domain through style transfer and data augmentation techniques, creating a synthetic multi-domain dataset that can be used with existing MDG algorithms. Through extensive experiments with PseudoDomainBed, our modified version of the DomainBed benchmark, we analyze the effectiveness of PMDG across multiple datasets and architectures. Our analysis reveals several key findings, including a positive correlation between MDG and PMDG performance and the potential of pseudo-domains to match or exceed actual multi-domain performance with sufficient data. These comprehensive empirical results provide valuable insights for future research in domain generalization. Our code is available at https://github.com/s-enmt/PseudoDomainBed.

[307] Buffer-free Class-Incremental Learning with Out-of-Distribution Detection

Srishti Gupta,Daniele Angioni,Maura Pintor,Ambra Demontis,Lea Schönherr,Battista Biggio,Fabio Roli

Main category: cs.LG

TL;DR: 论文提出了一种无缓冲区的后验OOD检测方法，用于开放世界的类增量学习，性能与基于缓冲区的方法相当或更优。

Details

Motivation: 开放世界中的类增量学习需要处理未知类别的输入，现有方法依赖缓冲区，存在隐私、扩展性和训练时间问题。 Method: 分析并应用后验OOD检测方法，替代缓冲区，在推理时检测未知类别。 Result: 在CIFAR-10、CIFAR-100和Tiny ImageNet上，无缓冲区方法性能与缓冲区方法相当或更好。 Conclusion: 后验OOD检测方法为高效且保护隐私的开放世界类增量学习提供了新思路。 Abstract: Class-incremental learning (CIL) poses significant challenges in open-world scenarios, where models must not only learn new classes over time without forgetting previous ones but also handle inputs from unknown classes that a closed-set model would misclassify. Recent works address both issues by (i)~training multi-head models using the task-incremental learning framework, and (ii) predicting the task identity employing out-of-distribution (OOD) detectors. While effective, the latter mainly relies on joint training with a memory buffer of past data, raising concerns around privacy, scalability, and increased training time. In this paper, we present an in-depth analysis of post-hoc OOD detection methods and investigate their potential to eliminate the need for a memory buffer. We uncover that these methods, when applied appropriately at inference time, can serve as a strong substitute for buffer-based OOD detection. We show that this buffer-free approach achieves comparable or superior performance to buffer-based methods both in terms of class-incremental learning and the rejection of unknown samples. Experimental results on CIFAR-10, CIFAR-100 and Tiny ImageNet datasets support our findings, offering new insights into the design of efficient and privacy-preserving CIL systems for open-world settings.

[308] Network Inversion for Uncertainty-Aware Out-of-Distribution Detection

Pirzada Suhail,Rehna Afroz,Amit Sethi

Main category: cs.LG

TL;DR: 提出了一种结合网络反演和分类器训练的新框架，同时解决OOD检测和不确定性估计问题。

Details

Motivation: 构建安全的机器学习系统需要有效处理意外输入，OOD检测和不确定性估计是关键。 Method: 通过引入“垃圾”类并迭代训练、反演和排除，优化分类器决策边界。 Result: 模型能有效检测OOD样本并将其分类到垃圾类，同时提供不确定性估计。 Conclusion: 该方法无需外部OOD数据或后校准技术，提供了一种统一的解决方案。 Abstract: Out-of-distribution (OOD) detection and uncertainty estimation (UE) are critical components for building safe machine learning systems, especially in real-world scenarios where unexpected inputs are inevitable. In this work, we propose a novel framework that combines network inversion with classifier training to simultaneously address both OOD detection and uncertainty estimation. For a standard n-class classification task, we extend the classifier to an (n+1)-class model by introducing a "garbage" class, initially populated with random gaussian noise to represent outlier inputs. After each training epoch, we use network inversion to reconstruct input images corresponding to all output classes that initially appear as noisy and incoherent and are therefore excluded to the garbage class for retraining the classifier. This cycle of training, inversion, and exclusion continues iteratively till the inverted samples begin to resemble the in-distribution data more closely, suggesting that the classifier has learned to carve out meaningful decision boundaries while sanitising the class manifolds by pushing OOD content into the garbage class. During inference, this training scheme enables the model to effectively detect and reject OOD samples by classifying them into the garbage class. Furthermore, the confidence scores associated with each prediction can be used to estimate uncertainty for both in-distribution and OOD inputs. Our approach is scalable, interpretable, and does not require access to external OOD datasets or post-hoc calibration techniques while providing a unified solution to the dual challenges of OOD detection and uncertainty estimation.

[309] Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Qingyu Shi,Jinbin Bai,Zhuoran Zhao,Wenhao Chai,Kaidong Yu,Jianzong Wu,Shuangyong Song,Yunhai Tong,Xiangtai Li,Xuelong Li,Shuicheng Yan

Main category: cs.LG

TL;DR: Muddit是一种统一的离散扩散Transformer模型，支持快速并行生成文本和图像，结合预训练视觉先验和轻量级文本解码器，性能优于传统自回归模型。

Details

Motivation: 解决自回归统一模型推理慢和非自回归模型泛化能力弱的问题，探索离散扩散作为统一生成任务的高效骨干。 Method: 提出Muddit模型，整合预训练文本到图像骨干的视觉先验和轻量级文本解码器，实现多模态统一生成。 Result: 实验表明，Muddit在质量和效率上优于更大的自回归模型。 Conclusion: 离散扩散结合强视觉先验是统一生成任务的可扩展高效解决方案。 Abstract: Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.

[310] Merge-Friendly Post-Training Quantization for Multi-Target Domain Adaptation

Juncheol Shin,Minsang Seok,Seonggon Kim,Eunhyeok Park

Main category: cs.LG

TL;DR: 该研究提出了一种新颖的后训练量化方法HDRQ，用于解决量化对模型合并的挑战，确保量化过程对源预训练模型的偏差最小，并通过平滑损失表面促进模型合并。

Details

Motivation: 量化在目标特定数据上的应用限制了感兴趣领域并引入离散化效应，使模型合并变得复杂。研究旨在分析量化对模型合并的影响并提出解决方案。 Method: 通过误差屏障视角分析量化影响，提出HDRQ方法，结合Hessian和远距离正则化量化，优化量化过程以支持多目标域适应的模型合并。 Result: HDRQ方法在实验中表现出色，显著减少了量化对模型合并的负面影响。 Conclusion: HDRQ是首个针对量化模型合并挑战的研究，实验证明其有效性，为多目标域适应提供了实用解决方案。 Abstract: Model merging has emerged as a powerful technique for combining task-specific weights, achieving superior performance in multi-target domain adaptation. However, when applied to practical scenarios, such as quantized models, new challenges arise. In practical scenarios, quantization is often applied to target-specific data, but this process restricts the domain of interest and introduces discretization effects, making model merging highly non-trivial. In this study, we analyze the impact of quantization on model merging through the lens of error barriers. Leveraging these insights, we propose a novel post-training quantization, HDRQ - Hessian and distant regularizing quantization - that is designed to consider model merging for multi-target domain adaptation. Our approach ensures that the quantization process incurs minimal deviation from the source pre-trained model while flattening the loss surface to facilitate smooth model merging. To our knowledge, this is the first study on this challenge, and extensive experiments confirm its effectiveness.

[311] REOrdering Patches Improves Vision Models

Declan Kutscher,David M. Chan,Yutong Bai,Trevor Darrell,Ritwik Gupta

Main category: cs.LG

TL;DR: 论文提出REOrder框架，通过优化图像块的排列顺序提升序列模型的性能，实验显示在ImageNet-1K和Functional Map of the World数据集上显著提高了准确率。

Details

Motivation: 现代长序列变换器对图像块的排列顺序敏感，固定顺序（如行优先）可能影响性能，因此需要探索任务最优的排列顺序。 Method: 提出REOrder框架：1）通过信息论评估不同序列的可压缩性；2）使用REINFORCE优化Plackett-Luce策略学习排列策略。 Result: 在ImageNet-1K上准确率提升3.01%，在Functional Map of the World上提升13.35%。 Conclusion: REOrder框架能有效发现任务最优的排列顺序，显著提升模型性能。 Abstract: Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.

cs.DB [Back]

[312] TailorSQL: An NL2SQL System Tailored to Your Query Workload

Kapil Vaidya,Jialin Ding,Sebastian Kosak,David Kernert,Chuan Lei,Xiao Qin,Abhinav Tripathy,Ramesh Balan,Balakrishnan Narayanaswamy,Tim Kraska

Main category: cs.DB

TL;DR: TailorSQL利用历史查询负载中的信息改进NL2SQL的准确性和延迟，相比现有技术提升2倍执行准确率。

Details

Motivation: 现有NL2SQL技术未利用历史查询负载中的隐含信息（如常见连接路径和表/列语义），而这些信息对准确翻译至关重要。 Method: TailorSQL通过分析历史查询负载，提取有用信息（如常见连接路径和表/列语义），结合预训练大语言模型生成更准确的SQL查询。 Result: 在标准化基准测试中，TailorSQL的执行准确率提升高达2倍。 Conclusion: 利用历史查询负载信息可以显著提升NL2SQL的性能，TailorSQL为此提供了有效解决方案。 Abstract: NL2SQL (natural language to SQL) translates natural language questions into SQL queries, thereby making structured data accessible to non-technical users, serving as the foundation for intelligent data applications. State-of-the-art NL2SQL techniques typically perform translation by retrieving database-specific information, such as the database schema, and invoking a pre-trained large language model (LLM) using the question and retrieved information to generate the SQL query. However, existing NL2SQL techniques miss a key opportunity which is present in real-world settings: NL2SQL is typically applied on existing databases which have already served many SQL queries in the past. The past query workload implicitly contains information which is helpful for accurate NL2SQL translation and is not apparent from the database schema alone, such as common join paths and the semantics of obscurely-named tables and columns. We introduce TailorSQL, a NL2SQL system that takes advantage of information in the past query workload to improve both the accuracy and latency of translating natural language questions into SQL. By specializing to a given workload, TailorSQL achieves up to 2$\times$ improvement in execution accuracy on standardized benchmarks.

q-bio.NC [Back]

[313] ConnectomeDiffuser: Generative AI Enables Brain Network Construction from Diffusion Tensor Imaging

Xuhang Chen,Michael Kwok-Po Ng,Kim-Fung Tsang,Chi-Man Pun,Shuqiang Wang

Main category: q-bio.NC

TL;DR: ConnectomeDiffuser是一种基于扩散的自动化框架，用于从DTI构建脑网络，克服了现有方法的局限性，提高了诊断准确性。

Details

Motivation: 现有的脑网络构建方法存在主观性、工作流程繁琐以及无法捕捉复杂拓扑特征和疾病特异性生物标志物的问题。 Method: 结合模板网络、扩散模型和图卷积网络分类器，从DTI扫描中提取拓扑特征并生成全面的脑网络。 Result: 在两种神经退行性疾病数据集上验证，性能显著优于其他方法。 Conclusion: ConnectomeDiffuser为神经退行性疾病的诊断和治疗监测提供了更准确的工具。 Abstract: Brain network analysis plays a crucial role in diagnosing and monitoring neurodegenerative disorders such as Alzheimer's disease (AD). Existing approaches for constructing structural brain networks from diffusion tensor imaging (DTI) often rely on specialized toolkits that suffer from inherent limitations: operator subjectivity, labor-intensive workflows, and restricted capacity to capture complex topological features and disease-specific biomarkers. To overcome these challenges and advance computational neuroimaging instrumentation, ConnectomeDiffuser is proposed as a novel diffusion-based framework for automated end-to-end brain network construction from DTI. The proposed model combines three key components: (1) a Template Network that extracts topological features from 3D DTI scans using Riemannian geometric principles, (2) a diffusion model that generates comprehensive brain networks with enhanced topological fidelity, and (3) a Graph Convolutional Network classifier that incorporates disease-specific markers to improve diagnostic accuracy. ConnectomeDiffuser demonstrates superior performance by capturing a broader range of structural connectivity and pathology-related information, enabling more sensitive analysis of individual variations in brain networks. Experimental validation on datasets representing two distinct neurodegenerative conditions demonstrates significant performance improvements over other brain network methods. This work contributes to the advancement of instrumentation in the context of neurological disorders, providing clinicians and researchers with a robust, generalizable measurement framework that facilitates more accurate diagnosis, deeper mechanistic understanding, and improved therapeutic monitoring of neurodegenerative diseases such as AD.

cs.CR [Back]

[314] AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models

Jinchuan Zhang,Lu Yin,Yan Zhou,Songlin Hu

Main category: cs.CR

TL;DR: AgentAlign框架通过抽象行为链生成安全对齐数据，显著提升LLM代理的安全性，同时保持其有用性。

Details

Motivation: LLM代理能力的增强使其易受恶意利用，现有方法在安全对齐上存在不足。 Method: 利用抽象行为链在模拟环境中生成安全对齐数据，平衡安全性与实用性。 Result: 在AgentHarm测试中，安全性提升35.8%至79.5%，且不影响或提升有用性。 Conclusion: AgentAlign有效解决了LLM代理的安全对齐问题，优于现有提示方法。 Abstract: The acquisition of agentic capabilities has transformed LLMs from "knowledge providers" to "action executors", a trend that while expanding LLMs' capability boundaries, significantly increases their susceptibility to malicious use. Previous work has shown that current LLM-based agents execute numerous malicious tasks even without being attacked, indicating a deficiency in agentic use safety alignment during the post-training phase. To address this gap, we propose AgentAlign, a novel framework that leverages abstract behavior chains as a medium for safety alignment data synthesis. By instantiating these behavior chains in simulated environments with diverse tool instances, our framework enables the generation of highly authentic and executable instructions while capturing complex multi-step dynamics. The framework further ensures model utility by proportionally synthesizing benign instructions through non-malicious interpretations of behavior chains, precisely calibrating the boundary between helpfulness and harmlessness. Evaluation results on AgentHarm demonstrate that fine-tuning three families of open-source models using our method substantially improves their safety (35.8% to 79.5% improvement) while minimally impacting or even positively enhancing their helpfulness, outperforming various prompting methods. The dataset and code have both been open-sourced.

Chunlong Xie,Jialing He,Shangwei Guo,Jiacheng Wang,Shudong Zhang,Tianwei Zhang,Tao Xiang

Main category: cs.CR

TL;DR: AdvOF是一种针对视觉与语言导航（VLN）代理的新型攻击框架，通过生成对抗性3D对象来研究其对VLM感知模块的影响。

Details

Motivation: 现有对抗攻击未考虑服务计算环境中的可靠性和服务质量（QoS），AdvOF填补了这一空白。 Method: AdvOF通过精确对齐2D和3D空间中的目标对象位置，定义并渲染对抗性对象，并通过多视图优化和正则化协作优化。 Result: 实验表明，AdvOF能有效降低代理在对抗条件下的性能，同时最小化对正常导航任务的干扰。 Conclusion: AdvOF为VLM驱动的导航系统提供了服务安全性的新理解，并为物理世界部署中的鲁棒服务组合奠定了基础。 Abstract: We present Adversarial Object Fusion (AdvOF), a novel attack framework targeting vision-and-language navigation (VLN) agents in service-oriented environments by generating adversarial 3D objects. While foundational models like Large Language Models (LLMs) and Vision Language Models (VLMs) have enhanced service-oriented navigation systems through improved perception and decision-making, their integration introduces vulnerabilities in mission-critical service workflows. Existing adversarial attacks fail to address service computing contexts, where reliability and quality-of-service (QoS) are paramount. We utilize AdvOF to investigate and explore the impact of adversarial environments on the VLM-based perception module of VLN agents. In particular, AdvOF first precisely aggregates and aligns the victim object positions in both 2D and 3D space, defining and rendering adversarial objects. Then, we collaboratively optimize the adversarial object with regularization between the adversarial and victim object across physical properties and VLM perceptions. Through assigning importance weights to varying views, the optimization is processed stably and multi-viewedly by iterative fusions from local updates and justifications. Our extensive evaluations demonstrate AdvOF can effectively degrade agent performance under adversarial conditions while maintaining minimal interference with normal navigation tasks. This work advances the understanding of service security in VLM-powered navigation systems, providing computational foundations for robust service composition in physical-world deployments.

Table of Contents

cs.CV [Back]

[1] One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

[2] Using Cross-Domain Detection Loss to Infer Multi-Scale Information for Improved Tiny Head Tracking

[3] Frequency-Adaptive Discrete Cosine-ViT-ResNet Architecture for Sparse-Data Vision

[4] How Animals Dance (When You're Not Looking)

[5] HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer

[6] LayerPeeler: Autoregressive Peeling for Layer-wise Image Vectorization

[7] MIAS-SAM: Medical Image Anomaly Segmentation without thresholding

[8] Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization

[9] Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory

[10] Fast Trajectory-Independent Model-Based Reconstruction Algorithm for Multi-Dimensional Magnetic Particle Imaging

[11] VidText: Towards Comprehensive Evaluation for Video Text Understanding

[12] IMTS is Worth Time $\times$ Channel Patches: Visual Masked Autoencoders for Irregular Multivariate Time Series Prediction

[13] Improving Contrastive Learning for Referring Expression Counting

[14] CLIPGaussian: Universal and Multimodal Style Transfer Based on Gaussian Splatting

[15] A Probabilistic Jump-Diffusion Framework for Open-World Egocentric Activity Recognition

[16] 4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians

[17] CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models

[18] 3DGS Compression with Sparsity-guided Hierarchical Transform Coding

[19] Hierarchical Material Recognition from Local Appearance

[20] cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning

[21] Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

[22] Leveraging Diffusion Models for Synthetic Data Augmentation in Protein Subcellular Localization Classification

[23] Fast Isotropic Median Filtering

[24] ATI: Any Trajectory Instruction for Controllable Video Generation

[25] Toward Memory-Aided World Models: Benchmarking via Spatial Consistency

[26] HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions

[27] Pose-free 3D Gaussian splatting via shape-ray estimation

[28] MOVi: Training-free Text-conditioned Multi-Object Video Generation

[29] Synthetic Document Question Answering in Hungarian

[30] SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model

[31] Spatio-Temporal Joint Density Driven Learning for Skeleton-Based Action Recognition

[32] Towards Privacy-Preserving Fine-Grained Visual Classification via Hierarchical Learning from Label Proportions

[33] Deep Modeling and Optimization of Medical Image Classification

[34] Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation

[35] SpatialSplat: Efficient Semantic 3D from Sparse Unposed Images

[36] Multi-Sourced Compositional Generalization in Visual Question Answering

[37] Zero-P-to-3: Zero-Shot Partial-View Images to 3D Object

[38] URWKV: Unified RWKV Model with Multi-state Perspective for Low-light Image Restoration

[39] GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion

[40] LeMoRe: Learn More Details for Lightweight Semantic Segmentation

[41] CURVE: CLIP-Utilized Reinforcement Learning for Visual Image Enhancement via Simple Image Processing

[42] EAD: An EEG Adapter for Automated Classification

[43] Identification of Patterns of Cognitive Impairment for Early Detection of Dementia

[44] Diffusion-Based Generative Models for 3D Occupancy Prediction in Autonomous Driving

[45] TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance

[46] MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation

[47] HMAD: Advancing E2E Driving with Anchored Offset Proposals and Simulation-Supervised Multi-target Scoring

[48] PhotoArtAgent: Intelligent Photo Retouching with Language Model-Based Artist Agents

[49] Zero-to-Hero: Zero-Shot Initialization Empowering Reference-Based Video Appearance Editing

[50] Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning

[51] FlowAlign: Trajectory-Regularized, Inversion-Free Flow-based Image Editing

[52] PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling

[53] LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering

[54] Implicit Inversion turns CLIP into a Decoder

[55] RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer

[56] DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes

[57] HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image

[58] Fooling the Watchers: Breaking AIGC Detectors via Semantic Prompt Attacks

[59] Language-guided Learning for Object Detection Tackling Multiple Variations in Aerial Images

[60] WTEFNet: Real-Time Low-Light Object Detection for Advanced Driver-Assistance Systems

[61] HyperPointFormer: Multimodal Fusion in 3D Space with Dual-Branch Cross-Attention Transformers

[62] Navigating the Accuracy-Size Trade-Off with Flexible Model Merging

[63] SAMamba: Adaptive State Space Modeling with Hierarchical Vision for Infrared Small Target Detection

[64] UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes

[65] Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs

[66] Unsupervised Transcript-assisted Video Summarization and Highlight Detection

[67] LADA: Scalable Label-Specific CLIP Adapter for Continual Learning

[68] Are MLMs Trapped in the Visual Room?

[69] Holistic Large-Scale Scene Reconstruction via Mixed Gaussian Splatting

[70] RSFAKE-1M: A Large-Scale Dataset for Detecting Diffusion-Generated Remote Sensing Forgeries

[71] GenCAD-Self-Repairing: Feasibility Enhancement for 3D CAD Generation

[72] Federated Unsupervised Semantic Segmentation

[73] TRACE: Trajectory-Constrained Concept Erasure in Diffusion Models

[74] Adversarial Semantic and Label Perturbation Attack for Pedestrian Attribute Recognition

[75] Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis

[76] Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization

[77] DSAGL: Dual-Stream Attention-Guided Learning for Weakly Supervised Whole Slide Image Classification

[78] Diffusion Sampling Path Tells More: An Efficient Plug-and-Play Strategy for Sample Filtering