cs.CV [Back]

[1] Using Cross-Domain Detection Loss to Infer Multi-Scale Information for Improved Tiny Head Tracking

Jisu Kim,Alex Mattingly,Eung-Joo Lee,Benjamin S. Riggan

Main category: cs.CV

TL;DR: 提出了一种优化性能和效率平衡的框架，用于增强微小头部检测和跟踪，通过跨域检测损失、多尺度模块和小感受野检测机制实现。

Details

Motivation: 当前方法计算成本高，延迟大且占用资源多，需优化微小头部检测和跟踪的性能与效率平衡。 Method: 整合跨域检测损失、多尺度模块和小感受野检测机制，以提升检测效果。 Result: 在CroHD和CrowdHuman数据集上，多目标跟踪精度（MOTA）和平均精度（mAP）有所提升。 Conclusion: 该框架在拥挤场景中有效提升了微小头部检测和跟踪的性能。 Abstract: Head detection and tracking are essential for downstream tasks, but current methods often require large computational budgets, which increase latencies and ties up resources (e.g., processors, memory, and bandwidth). To address this, we propose a framework to enhance tiny head detection and tracking by optimizing the balance between performance and efficiency. Our framework integrates (1) a cross-domain detection loss, (2) a multi-scale module, and (3) a small receptive field detection mechanism. These innovations enhance detection by bridging the gap between large and small detectors, capturing high-frequency details at multiple scales during training, and using filters with small receptive fields to detect tiny heads. Evaluations on the CroHD and CrowdHuman datasets show improved Multiple Object Tracking Accuracy (MOTA) and mean Average Precision (mAP), demonstrating the effectiveness of our approach in crowded scenes.

[2] Frequency-Adaptive Discrete Cosine-ViT-ResNet Architecture for Sparse-Data Vision

Ziyue Kang,Weichuan Zhang

Main category: cs.CV

TL;DR: 提出了一种混合深度学习框架，结合自适应DCT预处理模块、ViT-B16和ResNet50骨干网络，以及贝叶斯线性分类头，用于解决稀有动物图像分类中数据稀缺的问题。

Details

Motivation: 稀有动物图像分类面临数据稀缺的挑战，许多物种仅有少量标记样本。 Method: 设计了一种自适应频率域选择机制，结合ViT-B16和ResNet50提取全局和局部特征，并通过交叉融合策略整合特征，最后使用贝叶斯线性分类器进行分类。 Result: 在自建的50类野生动物数据集上，该方法优于传统CNN和固定频带DCT方法，在样本稀缺情况下达到最优准确率。 Conclusion: 提出的混合框架通过自适应频率域处理和特征融合，有效提升了稀有动物图像分类的性能。 Abstract: A major challenge in rare animal image classification is the scarcity of data, as many species usually have only a small number of labeled samples. To address this challenge, we designed a hybrid deep-learning framework comprising a novel adaptive DCT preprocessing module, ViT-B16 and ResNet50 backbones, and a Bayesian linear classification head. To our knowledge, we are the first to introduce an adaptive frequency-domain selection mechanism that learns optimal low-, mid-, and high-frequency boundaries suited to the subsequent backbones. Our network first captures image frequency-domain cues via this adaptive DCT partitioning. The adaptively filtered frequency features are then fed into ViT-B16 to model global contextual relationships, while ResNet50 concurrently extracts local, multi-scale spatial representations from the original image. A cross-level fusion strategy seamlessly integrates these frequency- and spatial-domain embeddings, and the fused features are passed through a Bayesian linear classifier to output the final category predictions. On our self-built 50-class wildlife dataset, this approach outperforms conventional CNN and fixed-band DCT pipelines, achieving state-of-the-art accuracy under extreme sample scarcity.

[3] HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer

Qi Cai,Jingwen Chen,Yang Chen,Yehao Li,Fuchen Long,Yingwei Pan,Zhaofan Qiu,Yiheng Zhang,Fengbin Gao,Peihan Xu,Yimeng Wang,Kai Yu,Wenxuan Chen,Ziwei Feng,Zijian Gong,Jianzhuang Pan,Yi Peng,Rui Tian,Siyu Wang,Bo Zhao,Ting Yao,Tao Mei

Main category: cs.CV

TL;DR: HiDream-I1是一个17B参数的图像生成基础模型，通过稀疏扩散变换器（DiT）和动态MoE架构实现高质量快速图像生成，并提供三种变体。还扩展为指令编辑模型HiDream-E1和综合图像代理HiDream-A1。

Details

Motivation: 解决现有图像生成模型在质量和计算效率之间的权衡问题，提供高效且高质量的生成方案。 Method: 采用双流解耦设计和动态MoE架构的稀疏DiT结构，支持多模态交互和高效生成。 Result: 实现了秒级生成的高质量图像，并扩展为指令编辑和交互式图像代理。 Conclusion: HiDream-I1及其衍生模型为多模态AIGC研究提供了高效、灵活的开源工具。 Abstract: Recent advancements in image generative foundation models have prioritized quality improvements but often at the cost of increased computational complexity and inference latency. To address this critical trade-off, we introduce HiDream-I1, a new open-source image generative foundation model with 17B parameters that achieves state-of-the-art image generation quality within seconds. HiDream-I1 is constructed with a new sparse Diffusion Transformer (DiT) structure. Specifically, it starts with a dual-stream decoupled design of sparse DiT with dynamic Mixture-of-Experts (MoE) architecture, in which two separate encoders are first involved to independently process image and text tokens. Then, a single-stream sparse DiT structure with dynamic MoE architecture is adopted to trigger multi-model interaction for image generation in a cost-efficient manner. To support flexiable accessibility with varied model capabilities, we provide HiDream-I1 in three variants: HiDream-I1-Full, HiDream-I1-Dev, and HiDream-I1-Fast. Furthermore, we go beyond the typical text-to-image generation and remould HiDream-I1 with additional image conditions to perform precise, instruction-based editing on given images, yielding a new instruction-based image editing model namely HiDream-E1. Ultimately, by integrating text-to-image generation and instruction-based image editing, HiDream-I1 evolves to form a comprehensive image agent (HiDream-A1) capable of fully interactive image creation and refinement. To accelerate multi-modal AIGC research, we have open-sourced all the codes and model weights of HiDream-I1-Full, HiDream-I1-Dev, HiDream-I1-Fast, HiDream-E1 through our project websites: https://github.com/HiDream-ai/HiDream-I1 and https://github.com/HiDream-ai/HiDream-E1. All features can be directly experienced via https://vivago.ai/studio.

[4] MIAS-SAM: Medical Image Anomaly Segmentation without thresholding

Marco Colussi,Dragan Ahmetovic,Sergio Mascetti

Main category: cs.CV

TL;DR: MIAS-SAM是一种用于医学图像异常区域分割的新方法，利用基于补丁的记忆库和SAM编码器提取特征，无需阈值即可获得精确分割。

Details

Motivation: 现有方法需要手动设置阈值来分割异常区域，MIAS-SAM旨在消除这一需求，提高分割的准确性和自动化程度。 Method: 使用SAM编码器从正常数据中提取特征并存储到记忆库中，推理时通过比较特征生成异常图，并利用异常图的重心提示SAM解码器进行分割。 Result: 在三个公开数据集（脑MRI、肝脏CT和视网膜OCT）上实验，显示出较高的DICE分数，表明其异常分割能力优异。 Conclusion: MIAS-SAM无需阈值即可实现精确的异常区域分割，为医学图像分析提供了高效且自动化的解决方案。 Abstract: This paper presents MIAS-SAM, a novel approach for the segmentation of anomalous regions in medical images. MIAS-SAM uses a patch-based memory bank to store relevant image features, which are extracted from normal data using the SAM encoder. At inference time, the embedding patches extracted from the SAM encoder are compared with those in the memory bank to obtain the anomaly map. Finally, MIAS-SAM computes the center of gravity of the anomaly map to prompt the SAM decoder, obtaining an accurate segmentation from the previously extracted features. Differently from prior works, MIAS-SAM does not require to define a threshold value to obtain the segmentation from the anomaly map. Experimental results conducted on three publicly available datasets, each with a different imaging modality (Brain MRI, Liver CT, and Retina OCT) show accurate anomaly segmentation capabilities measured using DICE score. The code is available at: https://github.com/warpcut/MIAS-SAM

[5] Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization

Yuxi Zhang,Yueting Li,Xinyu Du,Sibo Wang

Main category: cs.CV

TL;DR: Rhet2Pix是一个解决修辞语言生成图像问题的框架，通过多步策略优化和双层MDP扩散模块，显著优于现有模型。

Details

Motivation: 现有文本到图像模型难以捕捉修辞语言的隐含意义，导致生成的图像偏向字面视觉而非语义意图。 Method: 提出Rhet2Pix框架，采用多步策略优化和双层MDP扩散模块，逐步细化子句并优化图像生成动作。 Result: 实验表明Rhet2Pix在修辞文本到图像生成中优于GPT-4o、Grok-3等SOTA模型。 Conclusion: Rhet2Pix有效解决了修辞语言生成图像的挑战，为多模态模型提供了新思路。 Abstract: Generating images from rhetorical languages remains a critical challenge for text-to-image models. Even state-of-the-art (SOTA) multimodal large language models (MLLM) fail to generate images based on the hidden meaning inherent in rhetorical language--despite such content being readily mappable to visual representations by humans. A key limitation is that current models emphasize object-level word embedding alignment, causing metaphorical expressions to steer image generation towards their literal visuals and overlook the intended semantic meaning. To address this, we propose Rhet2Pix, a framework that formulates rhetorical text-to-image generation as a multi-step policy optimization problem, incorporating a two-layer MDP diffusion module. In the outer layer, Rhet2Pix converts the input prompt into incrementally elaborated sub-sentences and executes corresponding image-generation actions, constructing semantically richer visuals. In the inner layer, Rhet2Pix mitigates reward sparsity during image generation by discounting the final reward and optimizing every adjacent action pair along the diffusion denoising trajectory. Extensive experiments demonstrate the effectiveness of Rhet2Pix in rhetorical text-to-image generation. Our model outperforms SOTA MLLMs such as GPT-4o, Grok-3 and leading academic baselines across both qualitative and quantitative evaluations. The code and dataset used in this work are publicly available.

[6] Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory

Srishti Yadav,Lauren Tilton,Maria Antoniak,Taylor Arnold,Jiaang Li,Siddhesh Milind Pawar,Antonia Karamolegkou,Stella Frank,Zhaochong An,Negar Rostamzadeh,Daniel Hershcovich,Serge Belongie,Ekaterina Shutova

Main category: cs.CV

TL;DR: 论文指出现代视觉语言模型（VLMs）在文化能力评估中表现不佳，提出结合视觉文化研究的方法，构建五个文化维度框架以系统性分析VLMs的文化能力。

Details

Motivation: VLMs在文化多样性应用中表现不足，缺乏系统性分析文化维度的框架。 Method: 基于视觉文化研究（文化研究、符号学、视觉研究）的方法，提出五个文化维度框架。 Result: 提出了一套系统性分析VLMs文化能力的框架。 Conclusion: 结合视觉文化研究的方法能更全面地评估和提升VLMs的文化能力。 Abstract: Modern vision-language models (VLMs) often fail at cultural competency evaluations and benchmarks. Given the diversity of applications built upon VLMs, there is renewed interest in understanding how they encode cultural nuances. While individual aspects of this problem have been studied, we still lack a comprehensive framework for systematically identifying and annotating the nuanced cultural dimensions present in images for VLMs. This position paper argues that foundational methodologies from visual culture studies (cultural studies, semiotics, and visual studies) are necessary for cultural analysis of images. Building upon this review, we propose a set of five frameworks, corresponding to cultural dimensions, that must be considered for a more complete analysis of the cultural competencies of VLMs.

[7] One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

Chenhao Zheng,Jieyu Zhang,Mohammadreza Salehi,Ziqi Gao,Vishnu Iyengar,Norimasa Kobori,Quan Kong,Ranjay Krishna

Main category: cs.CV

TL;DR: TrajViT通过基于物体轨迹的视频标记化方法，显著减少冗余标记并保持性能，优于传统空间-时间ViT。

Details

Motivation: 当前视频标记化方法因使用固定空间-时间块导致标记冗余和计算低效，且在相机移动时效果不佳。 Method: 提出基于全景子对象轨迹的标记化方法TrajViT，通过对比学习训练，提取语义有意义的标记。 Result: TrajViT在视频理解任务中显著优于ViT3D，如视频-文本检索任务中top-5召回率提升6%，且标记减少10倍。 Conclusion: TrajViT是首个在多样化视频分析任务中一致优于ViT3D的高效编码器，具有鲁棒性和可扩展性。 Abstract: Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT (ViT3D) across multiple video understanding benchmarks, e.g., TrajViT outperforms ViT3D by a large margin of 6% top-5 recall in average at video-text retrieval task with 10x token deduction. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 18x less inference FLOPs. TrajViT is the first efficient encoder to consistently outperform ViT3D across diverse video analysis tasks, making it a robust and scalable solution.

[8] Fast Trajectory-Independent Model-Based Reconstruction Algorithm for Multi-Dimensional Magnetic Particle Imaging

Vladyslav Gapyak,Thomas März,Andreas Weinmann

Main category: cs.CV

TL;DR: 本文提出了一种不依赖轨迹的模型重建算法，用于2D磁粒子成像（MPI），并结合零样本即插即用（PnP）方法处理反卷积问题，展示了在不同扫描场景下的强大重建能力。

Details

Motivation: 传统MPI重建方法依赖耗时校准或特定轨迹的模型仿真，限制了其灵活性和通用性。本文旨在开发一种不依赖特定轨迹的模型重建算法，提升MPI的适用性。 Method: 采用轨迹无关的模型重建算法，结合零样本PnP方法（含自动噪声估计），利用自然图像训练的先进去噪器处理MPI数据。 Result: 在公开的2D FFP MPI数据集和自定义数据上验证了算法的有效性，展示了在不同扫描场景下的强大重建能力。 Conclusion: 本文提出的方法为通用、灵活的模型MPI重建奠定了基础，具有广泛的应用潜力。 Abstract: Magnetic Particle Imaging (MPI) is a promising tomographic technique for visualizing the spatio-temporal distribution of superparamagnetic nanoparticles, with applications ranging from cancer detection to real-time cardiovascular monitoring. Traditional MPI reconstruction relies on either time-consuming calibration (measured system matrix) or model-based simulation of the forward operator. Recent developments have shown the applicability of Chebyshev polynomials to multi-dimensional Lissajous Field-Free Point (FFP) scans. This method is bound to the particular choice of sinusoidal scanning trajectories. In this paper, we present the first reconstruction on real 2D MPI data with a trajectory-independent model-based MPI reconstruction algorithm. We further develop the zero-shot Plug-and-Play (PnP) algorithm of the authors -- with automatic noise level estimation -- to address the present deconvolution problem, leveraging a state-of-the-art denoiser trained on natural images without retraining on MPI-specific data. We evaluate our method on the publicly available 2D FFP MPI dataset ``MPIdata: Equilibrium Model with Anisotropy", featuring scans of six phantoms acquired using a Bruker preclinical scanner. Moreover, we show reconstruction performed on custom data on a 2D scanner with additional high-frequency excitation field and partial data. Our results demonstrate strong reconstruction capabilities across different scanning scenarios -- setting a precedent for general-purpose, flexible model-based MPI reconstruction.

[9] VidText: Towards Comprehensive Evaluation for Video Text Understanding

Zhoufaran Yang,Yan Shu,Zhifei Yang,Yan Zhang,Yu Li,Keyang Lu,Gangyan Zeng,Shaohui Liu,Yu Zhou,Nicu Sebe

Main category: cs.CV

TL;DR: VidText是一个新的视频文本理解基准，填补了现有视频理解和OCR基准的不足，支持多语言和多层次任务评估。

Details

Motivation: 现有视频理解基准忽视文本信息，OCR基准局限于静态图像，无法捕捉动态视觉与文本的交互。 Method: 提出VidText基准，涵盖多样化场景和多语言内容，引入分层评估框架（视频级、片段级、实例级任务）和感知推理任务。 Result: 实验表明当前大型多模态模型在多数任务上表现不佳，存在改进空间。 Conclusion: VidText填补了视频理解基准的空白，为未来动态环境中多模态推理研究奠定基础。 Abstract: Visual texts embedded in videos carry rich semantic information, which is crucial for both holistic video understanding and fine-grained reasoning about local human actions. However, existing video understanding benchmarks largely overlook textual information, while OCR-specific benchmarks are constrained to static images, limiting their ability to capture the interaction between text and dynamic visual contexts. To address this gap, we propose VidText, a new benchmark designed for comprehensive and in-depth evaluation of video text understanding. VidText offers the following key features: 1) It covers a wide range of real-world scenarios and supports multilingual content, encompassing diverse settings where video text naturally appears. 2) It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks, enabling assessment of both global summarization and local retrieval capabilities. 3) The benchmark also introduces a set of paired perception reasoning tasks, ranging from visual text perception to cross-modal reasoning between textual and visual information. Extensive experiments on 18 state-of-the-art Large Multimodal Models (LMMs) reveal that current models struggle across most tasks, with significant room for improvement. Further analysis highlights the impact of both model-intrinsic factors, such as input resolution and OCR capability, and external factors, including the use of auxiliary information and Chain-of-Thought reasoning strategies. We hope VidText will fill the current gap in video understanding benchmarks and serve as a foundation for future research on multimodal reasoning with video text in dynamic environments.

[10] IMTS is Worth Time $\times$ Channel Patches: Visual Masked Autoencoders for Irregular Multivariate Time Series Prediction

Zhangyi Hu,Jiemin Wu,Hua Xu,Mingqian Liao,Ninghui Feng,Bo Gao,Songning Lai,Yutao Yue

Main category: cs.CV

TL;DR: VIMTS是一个基于视觉MAE的框架，用于处理不规则多变量时间序列（IMTS）预测问题，通过特征补全和自监督学习提升性能。

Details

Motivation: IMTS预测因多通道信号不对齐和大量缺失数据而具有挑战性，现有方法难以捕捉可靠的时间模式。视觉MAE在稀疏多通道数据处理上的潜力激发了将其应用于IMTS的动机。 Method: VIMTS将IMTS沿时间线分割为等间隔特征块，利用跨通道依赖补全缺失值，并通过视觉MAE进行块重建，采用粗到细技术生成预测。 Result: 实验表明VIMTS在性能和少样本能力上表现优异，推动了视觉基础模型在更广泛时间序列任务中的应用。 Conclusion: VIMTS成功将视觉MAE应用于IMTS预测，为处理不规则时间序列提供了新思路。 Abstract: Irregular Multivariate Time Series (IMTS) forecasting is challenging due to the unaligned nature of multi-channel signals and the prevalence of extensive missing data. Existing methods struggle to capture reliable temporal patterns from such data due to significant missing values. While pre-trained foundation models show potential for addressing these challenges, they are typically designed for Regularly Sampled Time Series (RTS). Motivated by the visual Mask AutoEncoder's (MAE) powerful capability for modeling sparse multi-channel information and its success in RTS forecasting, we propose VIMTS, a framework adapting Visual MAE for IMTS forecasting. To mitigate the effect of missing values, VIMTS first processes IMTS along the timeline into feature patches at equal intervals. These patches are then complemented using learned cross-channel dependencies. Then it leverages visual MAE's capability in handling sparse multichannel data for patch reconstruction, followed by a coarse-to-fine technique to generate precise predictions from focused contexts. In addition, we integrate self-supervised learning for improved IMTS modeling by adapting the visual MAE to IMTS data. Extensive experiments demonstrate VIMTS's superior performance and few-shot capability, advancing the application of visual foundation models in more general time series tasks. Our code is available at https://github.com/WHU-HZY/VIMTS.

[11] How Animals Dance (When You're Not Looking)

Xiaojuan Wang,Aleksander Holynski,Brian Curless,Ira Kemelmacher,Steve Seitz

Main category: cs.CV

TL;DR: 提出一种基于关键帧的框架，用于生成音乐同步、舞蹈感知的动物舞蹈视频，通过优化关键帧结构和视频扩散模型生成中间帧。

Details

Motivation: 解决从少量关键帧生成高质量、音乐同步的动物舞蹈视频的挑战，并捕捉舞蹈中的对称性。 Method: 将舞蹈合成建模为图优化问题，自动估计舞蹈节拍模式，并使用视频扩散模型生成中间帧。 Result: 仅需六个输入关键帧，即可生成长达30秒的动物舞蹈视频，适用于多种动物和音乐。 Conclusion: 该方法高效且灵活，能够生成多样化的音乐同步舞蹈视频。 Abstract: We present a keyframe-based framework for generating music-synchronized, choreography aware animal dance videos. Starting from a few keyframes representing distinct animal poses -- generated via text-to-image prompting or GPT-4o -- we formulate dance synthesis as a graph optimization problem: find the optimal keyframe structure that satisfies a specified choreography pattern of beats, which can be automatically estimated from a reference dance video. We also introduce an approach for mirrored pose image generation, essential for capturing symmetry in dance. In-between frames are synthesized using an video diffusion model. With as few as six input keyframes, our method can produce up to 30 second dance videos across a wide range of animals and music tracks.

[12] Improving Contrastive Learning for Referring Expression Counting

Kostas Triaridis,Panagiotis Kaliosis,E-Ro Nguyen,Jingyi Xu,Hieu Le,Dimitris Samaras

Main category: cs.CV

TL;DR: 论文提出C-REX，一种基于对比学习的框架，用于解决Referring Expression Counting（REC）任务，显著提升性能。

Details

Motivation: 现有方法难以区分视觉相似但属于不同指代表达的对象，需要一种更鲁棒的表示学习方法。 Method: 提出C-REX，基于监督对比学习，完全在图像空间操作，避免图像-文本对比学习的对齐问题，并提供更大的负样本池。 Result: C-REX在REC任务中MAE提升22%，RMSE提升10%，同时在类无关计数任务中表现优异。 Conclusion: C-REX是一种通用且高效的框架，适用于REC及其他类似任务，通过检测对象中心点而非边界框进一步提升性能。 Abstract: Object counting has progressed from class-specific models, which count only known categories, to class-agnostic models that generalize to unseen categories. The next challenge is Referring Expression Counting (REC), where the goal is to count objects based on fine-grained attributes and contextual differences. Existing methods struggle with distinguishing visually similar objects that belong to the same category but correspond to different referring expressions. To address this, we propose C-REX, a novel contrastive learning framework, based on supervised contrastive learning, designed to enhance discriminative representation learning. Unlike prior works, C-REX operates entirely within the image space, avoiding the misalignment issues of image-text contrastive learning, thus providing a more stable contrastive signal. It also guarantees a significantly larger pool of negative samples, leading to improved robustness in the learned representations. Moreover, we showcase that our framework is versatile and generic enough to be applied to other similar tasks like class-agnostic counting. To support our approach, we analyze the key components of sota detection-based models and identify that detecting object centroids instead of bounding boxes is the key common factor behind their success in counting tasks. We use this insight to design a simple yet effective detection-based baseline to build upon. Our experiments show that C-REX achieves state-of-the-art results in REC, outperforming previous methods by more than 22\% in MAE and more than 10\% in RMSE, while also demonstrating strong performance in class-agnostic counting. Code is available at https://github.com/cvlab-stonybrook/c-rex.

[13] LayerPeeler: Autoregressive Peeling for Layer-wise Image Vectorization

Ronghuan Wu,Wanchao Su,Jing Liao

Main category: cs.CV

TL;DR: LayerPeeler是一种新颖的分层图像矢量化方法，通过逐步简化策略解决遮挡区域的挑战，生成完整路径和连贯层结构的矢量图形。

Details

Motivation: 现有图像矢量化工具在遮挡区域处理上表现不佳，导致形状不完整或碎片化，影响可编辑性。 Method: 采用自回归剥离策略，结合视觉语言模型构建层图，利用微调图像扩散模型移除遮挡层，并通过局部注意力控制确保精确移除。 Result: 实验表明，LayerPeeler在路径语义、几何规则性和视觉保真度上显著优于现有技术。 Conclusion: LayerPeeler为图像矢量化提供了高质量和灵活的解决方案，尤其在处理遮挡区域方面表现突出。 Abstract: Image vectorization is a powerful technique that converts raster images into vector graphics, enabling enhanced flexibility and interactivity. However, popular image vectorization tools struggle with occluded regions, producing incomplete or fragmented shapes that hinder editability. While recent advancements have explored rule-based and data-driven layer-wise image vectorization, these methods face limitations in vectorization quality and flexibility. In this paper, we introduce LayerPeeler, a novel layer-wise image vectorization approach that addresses these challenges through a progressive simplification paradigm. The key to LayerPeeler's success lies in its autoregressive peeling strategy: by identifying and removing the topmost non-occluded layers while recovering underlying content, we generate vector graphics with complete paths and coherent layer structures. Our method leverages vision-language models to construct a layer graph that captures occlusion relationships among elements, enabling precise detection and description for non-occluded layers. These descriptive captions are used as editing instructions for a finetuned image diffusion model to remove the identified layers. To ensure accurate removal, we employ localized attention control that precisely guides the model to target regions while faithfully preserving the surrounding content. To support this, we contribute a large-scale dataset specifically designed for layer peeling tasks. Extensive quantitative and qualitative experiments demonstrate that LayerPeeler significantly outperforms existing techniques, producing vectorization results with superior path semantics, geometric regularity, and visual fidelity.

[14] CLIPGaussian: Universal and Multimodal Style Transfer Based on Gaussian Splatting

Kornel Howil,Joanna Waczyńska,Piotr Borycki,Tadeusz Dziarmaga,Marcin Mazur,Przemysław Spurek

Main category: cs.CV

TL;DR: CLIPGaussians 是一种支持多模态（2D图像、视频、3D对象、4D场景）风格迁移的统一框架，基于高斯泼溅表示，无需大型生成模型或从头训练。

Details

Motivation: 高斯泼溅（GS）在渲染3D场景方面表现高效，但风格迁移仍具挑战性，尤其是超越简单颜色变化的应用。 Method: CLIPGaussians 直接操作高斯基元，作为插件模块集成到现有GS流程中，联合优化3D和4D的颜色与几何，并保持视频时间一致性。 Result: 该方法在所有任务中展现出卓越的风格保真度和一致性，模型尺寸保持不变。 Conclusion: CLIPGaussians 是一种通用且高效的多模态风格迁移解决方案。 Abstract: Gaussian Splatting (GS) has recently emerged as an efficient representation for rendering 3D scenes from 2D images and has been extended to images, videos, and dynamic 4D content. However, applying style transfer to GS-based representations, especially beyond simple color changes, remains challenging. In this work, we introduce CLIPGaussians, the first unified style transfer framework that supports text- and image-guided stylization across multiple modalities: 2D images, videos, 3D objects, and 4D scenes. Our method operates directly on Gaussian primitives and integrates into existing GS pipelines as a plug-in module, without requiring large generative models or retraining from scratch. CLIPGaussians approach enables joint optimization of color and geometry in 3D and 4D settings, and achieves temporal coherence in videos, while preserving a model size. We demonstrate superior style fidelity and consistency across all tasks, validating CLIPGaussians as a universal and efficient solution for multimodal style transfer.

[15] A Probabilistic Jump-Diffusion Framework for Open-World Egocentric Activity Recognition

Sanjoy Kundu,Shanmukha Vellamcheti,Sathyanarayanan N. Aakur

Main category: cs.CV

TL;DR: ProbRes框架通过概率残差搜索和跳跃扩散方法，高效解决开放世界自我中心活动识别的挑战，结合常识先验和视觉语言模型，实现高性能。

Details

Motivation: 开放世界自我中心活动识别的无约束性导致模型需从部分观察的搜索空间中推断未见活动，亟需高效方法。 Method: 提出ProbRes框架，基于跳跃扩散的概率残差搜索，结合常识先验和视觉语言模型，自适应优化预测。 Result: 在多个基准数据集（GTEA Gaze等）上实现最优性能，并建立开放世界识别的分类体系。 Conclusion: ProbRes为开放世界自我中心活动理解提供了高效方法，并明确了相关挑战和方法进展。 Abstract: Open-world egocentric activity recognition poses a fundamental challenge due to its unconstrained nature, requiring models to infer unseen activities from an expansive, partially observed search space. We introduce ProbRes, a Probabilistic Residual search framework based on jump-diffusion that efficiently navigates this space by balancing prior-guided exploration with likelihood-driven exploitation. Our approach integrates structured commonsense priors to construct a semantically coherent search space, adaptively refines predictions using Vision-Language Models (VLMs) and employs a stochastic search mechanism to locate high-likelihood activity labels while minimizing exhaustive enumeration efficiently. We systematically evaluate ProbRes across multiple openness levels (L0--L3), demonstrating its adaptability to increasing search space complexity. In addition to achieving state-of-the-art performance on benchmark datasets (GTEA Gaze, GTEA Gaze+, EPIC-Kitchens, and Charades-Ego), we establish a clear taxonomy for open-world recognition, delineating the challenges and methodological advancements necessary for egocentric activity understanding.

[16] 4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians

Hidenobu Matsuki,Gwangbin Bae,Andrew J. Davison

Main category: cs.CV

TL;DR: 提出首个基于可微分渲染的4D跟踪与建图方法，联合优化相机定位与非刚性表面重建，解决4D-SLAM的高维优化问题。

Details

Motivation: 自然环境中的复杂非刚性运动使4D-SLAM研究不足，且缺乏可靠评估协议。 Method: 结合高斯表面基元和MLP变形场，提出新型相机位姿估计与表面正则化技术。 Result: 实现精确表面重建，并发布开源合成数据集以支持评估。 Conclusion: 为现代4D-SLAM研究奠定基础，提供新方法和评估标准。 Abstract: We propose the first 4D tracking and mapping method that jointly performs camera localization and non-rigid surface reconstruction via differentiable rendering. Our approach captures 4D scenes from an online stream of color images with depth measurements or predictions by jointly optimizing scene geometry, appearance, dynamics, and camera ego-motion. Although natural environments exhibit complex non-rigid motions, 4D-SLAM remains relatively underexplored due to its inherent challenges; even with 2.5D signals, the problem is ill-posed because of the high dimensionality of the optimization space. To overcome these challenges, we first introduce a SLAM method based on Gaussian surface primitives that leverages depth signals more effectively than 3D Gaussians, thereby achieving accurate surface reconstruction. To further model non-rigid deformations, we employ a warp-field represented by a multi-layer perceptron (MLP) and introduce a novel camera pose estimation technique along with surface regularization terms that facilitate spatio-temporal reconstruction. In addition to these algorithmic challenges, a significant hurdle in 4D SLAM research is the lack of reliable ground truth and evaluation protocols, primarily due to the difficulty of 4D capture using commodity sensors. To address this, we present a novel open synthetic dataset of everyday objects with diverse motions, leveraging large-scale object models and animation modeling. In summary, we open up the modern 4D-SLAM research by introducing a novel method and evaluation protocols grounded in modern vision and rendering techniques.

[17] CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models

Junbo Yin,Chao Zha,Wenjia He,Chencheng Xu,Xin Gao

Main category: cs.CV

TL;DR: CFP-Gen是一种新型扩散语言模型，用于组合功能蛋白质生成，通过整合多模态条件实现蛋白质设计。

Details

Motivation: 现有PLM仅基于单一模态条件生成蛋白质序列，难以同时满足多模态约束。 Method: 引入AGFM模块动态调整蛋白质特征分布，RCFE模块捕获残基交互，并集成3D结构编码器施加几何约束。 Result: CFP-Gen能高效生成功能与天然蛋白质相当的新型蛋白质，且多功能蛋白质设计成功率高。 Conclusion: CFP-Gen为多模态约束下的蛋白质设计提供了有效解决方案。 Abstract: Existing PLMs generate protein sequences based on a single-condition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP-Gen, a novel diffusion language model for Combinatorial Functional Protein GENeration. CFP-Gen facilitates the de novo protein design by integrating multimodal conditions with functional, sequence, and structural constraints. Specifically, an Annotation-Guided Feature Modulation (AGFM) module is introduced to dynamically adjust the protein feature distribution based on composable functional annotations, e.g., GO terms, IPR domains and EC numbers. Meanwhile, the Residue-Controlled Functional Encoding (RCFE) module captures residue-wise interaction to ensure more precise control. Additionally, off-the-shelf 3D structure encoders can be seamlessly integrated to impose geometric constraints. We demonstrate that CFP-Gen enables high-throughput generation of novel proteins with functionality comparable to natural proteins, while achieving a high success rate in designing multifunctional proteins. Code and data available at https://github.com/yinjunbo/cfpgen.

[18] 3DGS Compression with Sparsity-guided Hierarchical Transform Coding

Hao Xu,Xiaolin Wu,Xi Zhang

Main category: cs.CV

TL;DR: SHTC是一种端到端优化的变换编码框架，用于3D高斯泼溅（3DGS）压缩，通过联合优化3DGS、变换和轻量级上下文模型，显著提升了率失真性能。

Details

Motivation: 3DGS因其快速高质量的渲染而流行，但内存占用大，传输和存储开销高。现有神经压缩方法未采用端到端优化的分析-合成变换，导致性能不佳。 Method: SHTC框架包括使用KLT进行数据去相关的基础层和稀疏编码的增强层，增强层通过线性变换和ISTA算法重构残差。所有组件设计为可解释，减少参数。 Result: SHTC显著提升了率失真性能，同时参数和计算开销最小。 Conclusion: SHTC是首个端到端优化的3DGS压缩框架，通过可解释设计和联合优化，实现了高效的压缩性能。 Abstract: 3D Gaussian Splatting (3DGS) has gained popularity for its fast and high-quality rendering, but it has a very large memory footprint incurring high transmission and storage overhead. Recently, some neural compression methods, such as Scaffold-GS, were proposed for 3DGS but they did not adopt the approach of end-to-end optimized analysis-synthesis transforms which has been proven highly effective in neural signal compression. Without an appropriate analysis transform, signal correlations cannot be removed by sparse representation. Without such transforms the only way to remove signal redundancies is through entropy coding driven by a complex and expensive context modeling, which results in slower speed and suboptimal rate-distortion (R-D) performance. To overcome this weakness, we propose Sparsity-guided Hierarchical Transform Coding (SHTC), the first end-to-end optimized transform coding framework for 3DGS compression. SHTC jointly optimizes the 3DGS, transforms and a lightweight context model. This joint optimization enables the transform to produce representations that approach the best R-D performance possible. The SHTC framework consists of a base layer using KLT for data decorrelation, and a sparsity-coded enhancement layer that compresses the KLT residuals to refine the representation. The enhancement encoder learns a linear transform to project high-dimensional inputs into a low-dimensional space, while the decoder unfolds the Iterative Shrinkage-Thresholding Algorithm (ISTA) to reconstruct the residuals. All components are designed to be interpretable, allowing the incorporation of signal priors and fewer parameters than black-box transforms. This novel design significantly improves R-D performance with minimal additional parameters and computational overhead.

[19] Hierarchical Material Recognition from Local Appearance

Matthew Beveridge,Shree K. Nayar

Main category: cs.CV

TL;DR: 论文提出了一种基于物理特性的材料分类法，并构建了一个多样化的数据集。通过图注意力网络实现分层材料识别，性能优异，且在恶劣条件下和少样本学习中表现良好。

Details

Motivation: 为视觉应用提供一种基于物理特性的材料分类法，并解决真实场景中材料识别的挑战。 Method: 使用图注意力网络，结合分类法和数据集，利用类别间的分类关系进行分层材料识别。 Result: 模型在材料识别任务中达到最优性能，能适应恶劣条件，且通过深度图增强泛化能力。 Conclusion: 分类法和图注意力网络结合的方法在材料识别中表现优异，具有实际应用潜力。 Abstract: We introduce a taxonomy of materials for hierarchical recognition from local appearance. Our taxonomy is motivated by vision applications and is arranged according to the physical traits of materials. We contribute a diverse, in-the-wild dataset with images and depth maps of the taxonomy classes. Utilizing the taxonomy and dataset, we present a method for hierarchical material recognition based on graph attention networks. Our model leverages the taxonomic proximity between classes and achieves state-of-the-art performance. We demonstrate the model's potential to generalize to adverse, real-world imaging conditions, and that novel views rendered using the depth maps can enhance this capability. Finally, we show the model's capacity to rapidly learn new materials in a few-shot learning setting.

Maksim Kolodiazhnyi,Denis Tarasov,Dmitrii Zhemchuzhnikov,Alexander Nikulin,Ilya Zisman,Anna Vorontsova,Anton Konushin,Vladislav Kurenkov,Danila Rukhovich

Main category: cs.CV

TL;DR: 提出了一种多模态CAD重建模型，结合点云、图像和文本输入，通过两阶段训练（监督微调和强化学习微调），在DeepCAD基准测试中表现优于现有单模态方法。

Details

Motivation: 现有CAD重建方法通常仅支持单一输入模态，限制了通用性和鲁棒性。通过结合多模态输入，可以更广泛地应用于设计领域。 Method: 采用两阶段训练：1）监督微调（SFT）在大规模程序生成数据上；2）强化学习微调（RL）使用在线反馈（如GRPO算法）。首次探索了LLM在CAD任务中的RL微调。 Result: SFT模型在DeepCAD基准测试中优于所有单模态方法；RL微调后，模型在三个挑战性数据集（包括真实世界数据）上达到新SOTA。 Conclusion: 多模态输入结合两阶段训练显著提升了CAD重建的性能和通用性，为设计应用提供了更强大的工具。 Abstract: Computer-Aided Design (CAD) plays a central role in engineering and manufacturing, making it possible to create precise and editable 3D models. Using a variety of sensor or user-provided data as inputs for CAD reconstruction can democratize access to design applications. However, existing methods typically focus on a single input modality, such as point clouds, images, or text, which limits their generalizability and robustness. Leveraging recent advances in vision-language models (VLM), we propose a multi-modal CAD reconstruction model that simultaneously processes all three input modalities. Inspired by large language model (LLM) training paradigms, we adopt a two-stage pipeline: supervised fine-tuning (SFT) on large-scale procedurally generated data, followed by reinforcement learning (RL) fine-tuning using online feedback, obtained programatically. Furthermore, we are the first to explore RL fine-tuning of LLMs for CAD tasks demonstrating that online RL algorithms such as Group Relative Preference Optimization (GRPO) outperform offline alternatives. In the DeepCAD benchmark, our SFT model outperforms existing single-modal approaches in all three input modalities simultaneously. More importantly, after RL fine-tuning, cadrille sets new state-of-the-art on three challenging datasets, including a real-world one.

[21] Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

Ruichen Chen,Keith G. Mills,Liyao Jiang,Chao Gao,Di Niu

Main category: cs.CV

TL;DR: Re-ttention提出了一种高稀疏注意力机制，通过利用扩散模型的时间冗余性，在极低计算开销下保持视觉生成质量。

Details

Motivation: 解决扩散变换器中注意力机制因分辨率和视频长度导致的二次复杂度问题，同时避免现有稀疏注意力技术在极高稀疏度下视觉质量下降和计算开销增加的问题。 Method: 通过基于历史softmax分布重新调整注意力分数，实现极高稀疏度的注意力计算。 Result: 在CogVideoX和PixArt DiTs等模型上，仅需3.1%的token即可超越现有方法，并在H100 GPU上实现45%端到端延迟降低和92%自注意力延迟降低。 Conclusion: Re-ttention在极高稀疏度下显著提升计算效率，同时保持视觉生成质量，为高分辨率视频和图像生成提供了高效解决方案。 Abstract: Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. % To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. % Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1\% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference. Further, we measure latency to show that our method can attain over 45\% end-to-end % and over 92\% self-attention latency reduction on an H100 GPU at negligible overhead cost. Code available online here: \href{https://github.com/cccrrrccc/Re-ttention}{https://github.com/cccrrrccc/Re-ttention}

[22] Leveraging Diffusion Models for Synthetic Data Augmentation in Protein Subcellular Localization Classification

Sylvey Lin,Zhi-Yi Cao

Main category: cs.CV

TL;DR: 研究探讨扩散模型生成的合成图像是否能提升蛋白质亚细胞定位的多标签分类性能，发现混合训练策略在验证集表现良好，但测试集泛化能力不足。

Details

Motivation: 探索合成数据在生物医学图像分类中的潜力，尤其是扩散模型生成的数据是否能增强多标签分类任务。 Method: 采用简化的类条件去噪扩散概率模型（DDPM）生成标签一致的样本，并通过Mix Loss和Mix Representation两种混合训练策略整合合成与真实数据。 Result: 混合策略在验证集表现良好，但测试集泛化能力差；基于ResNet的基线分类器表现更稳定。 Conclusion: 合成数据在生物医学图像分类中的应用需注重数据真实性和监督机制的鲁棒性。 Abstract: We investigate whether synthetic images generated by diffusion models can enhance multi-label classification of protein subcellular localization. Specifically, we implement a simplified class-conditional denoising diffusion probabilistic model (DDPM) to produce label-consistent samples and explore their integration with real data via two hybrid training strategies: Mix Loss and Mix Representation. While these approaches yield promising validation performance, our proposed MixModel exhibits poor generalization to unseen test data, underscoring the challenges of leveraging synthetic data effectively. In contrast, baseline classifiers built on ResNet backbones with conventional loss functions demonstrate greater stability and test-time performance. Our findings highlight the importance of realistic data generation and robust supervision when incorporating generative augmentation into biomedical image classification.

[23] Fast Isotropic Median Filtering

Ben Weiss

Main category: cs.CV

TL;DR: 提出了一种高效的中值滤波方法，克服了传统算法在比特深度、核大小和形状上的限制。

Details

Motivation: 传统中值滤波算法存在比特深度、核大小和形状的局限性，导致实际应用中效果不佳。 Method: 开发了一种新方法，支持任意比特深度、核大小和凸核形状（包括圆形）。 Result: 新方法高效且无传统算法的限制，避免了条纹交叉影线伪影。 Conclusion: 该方法首次全面解决了中值滤波的实践限制，具有广泛适用性。 Abstract: Median filtering is a cornerstone of computational image processing. It provides an effective means of image smoothing, with minimal blurring or softening of edges, invariance to monotonic transformations such as gamma adjustment, and robustness to noise and outliers. However, known algorithms have all suffered from practical limitations: the bit depth of the image data, the size of the filter kernel, or the kernel shape itself. Square-kernel implementations tend to produce streaky cross-hatching artifacts, and nearly all known efficient algorithms are in practice limited to square kernels. We present for the first time a method that overcomes all of these limitations. Our method operates efficiently on arbitrary bit-depth data, arbitrary kernel sizes, and arbitrary convex kernel shapes, including circular shapes.

[24] ATI: Any Trajectory Instruction for Controllable Video Generation

Angtian Wang,Haibin Huang,Jacob Zhiyuan Fang,Yiding Yang,Chongyang Ma

Main category: cs.CV

TL;DR: 提出了一种统一的视频生成运动控制框架，通过轨迹输入整合相机运动、物体平移和局部运动。

Details

Motivation: 解决现有方法对不同类型的运动控制采用分离模块或任务特定设计的问题，提供一种统一的解决方案。 Method: 通过轻量级运动注入器将用户定义的轨迹投影到预训练图像到视频生成模型的潜在空间中。 Result: 在多种视频运动控制任务中表现出色，包括风格化运动效果、动态视角变化和精确局部运动操控。 Conclusion: 该方法在可控性和视觉质量上显著优于现有方法和商业解决方案，且兼容多种先进视频生成模型。 Abstract: We propose a unified framework for motion control in video generation that seamlessly integrates camera movement, object-level translation, and fine-grained local motion using trajectory-based inputs. In contrast to prior methods that address these motion types through separate modules or task-specific designs, our approach offers a cohesive solution by projecting user-defined trajectories into the latent space of pre-trained image-to-video generation models via a lightweight motion injector. Users can specify keypoints and their motion paths to control localized deformations, entire object motion, virtual camera dynamics, or combinations of these. The injected trajectory signals guide the generative process to produce temporally consistent and semantically aligned motion sequences. Our framework demonstrates superior performance across multiple video motion control tasks, including stylized motion effects (e.g., motion brushes), dynamic viewpoint changes, and precise local motion manipulation. Experiments show that our method provides significantly better controllability and visual quality compared to prior approaches and commercial solutions, while remaining broadly compatible with various state-of-the-art video generation backbones. Project page: https://anytraj.github.io/.

[25] Toward Memory-Aided World Models: Benchmarking via Spatial Consistency

Kewei Lian,Shaofei Cai,Yilun Du,Yitao Liang

Main category: cs.CV

TL;DR: 论文提出了一种用于空间一致性世界模型的数据集和基准测试，基于Minecraft环境收集了20百万帧导航视频，支持模型学习长距离空间一致性。

Details

Motivation: 现有数据集缺乏对空间一致性的明确约束，且多数基准测试仅关注视觉一致性或生成质量，忽略了长距离空间一致性的需求。 Method: 构建了包含150个Minecraft位置的导航视频数据集（20百万帧），采用课程设计逐步增加序列长度，评估了四种世界模型基线。 Result: 数据集和基准测试支持模型学习复杂导航轨迹的空间一致性，且数据收集流程易于扩展到新环境。 Conclusion: 开源的数据集和基准测试填补了空间一致性世界模型研究的空白，为未来研究提供了支持。 Abstract: The ability to simulate the world in a spatially consistent manner is a crucial requirements for effective world models. Such a model enables high-quality visual generation, and also ensures the reliability of world models for downstream tasks such as simulation and planning. Designing a memory module is a crucial component for addressing spatial consistency: such a model must not only retain long-horizon observational information, but also enables the construction of explicit or implicit internal spatial representations. However, there are no dataset designed to promote the development of memory modules by explicitly enforcing spatial consistency constraints. Furthermore, most existing benchmarks primarily emphasize visual coherence or generation quality, neglecting the requirement of long-range spatial consistency. To bridge this gap, we construct a dataset and corresponding benchmark by sampling 150 distinct locations within the open-world environment of Minecraft, collecting about 250 hours (20 million frames) of loop-based navigation videos with actions. Our dataset follows a curriculum design of sequence lengths, allowing models to learn spatial consistency on increasingly complex navigation trajectories. Furthermore, our data collection pipeline is easily extensible to new Minecraft environments and modules. Four representative world model baselines are evaluated on our benchmark. Dataset, benchmark, and code are open-sourced to support future research.

[26] HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions

Shuolin Xu,Siming Zheng,Ziyi Wang,HC Yu,Jinwei Chen,Huaqi Zhang,Bo Li,Peng-Tao Jiang

Main category: cs.CV

TL;DR: 论文提出了一种新数据集和基准测试（Open-HyperMotionX Dataset和HyperMotionX Bench），用于评估和改进复杂人体运动条件下的姿态引导动画生成模型，并提出了一种基于DiT的视频生成基线方法和空间低频增强RoPE模块。

Details

Motivation: 现有方法在复杂人体运动（Hypermotion）下表现不佳，且缺乏高质量评估基准。 Method: 提出Open-HyperMotionX数据集和HyperMotionX Bench，设计基于DiT的视频生成基线方法，并引入空间低频增强RoPE模块。 Result: 方法显著提高了高度动态人体运动序列的结构稳定性和外观一致性。 Conclusion: 提出的数据集和方法有效提升了复杂人体运动图像动画的生成质量。 Abstract: Recent advances in diffusion models have significantly improved conditional video generation, particularly in the pose-guided human image animation task. Although existing methods are capable of generating high-fidelity and time-consistent animation sequences in regular motions and static scenes, there are still obvious limitations when facing complex human body motions (Hypermotion) that contain highly dynamic, non-standard motions, and the lack of a high-quality benchmark for evaluation of complex human motion animations. To address this challenge, we introduce the \textbf{Open-HyperMotionX Dataset} and \textbf{HyperMotionX Bench}, which provide high-quality human pose annotations and curated video clips for evaluating and improving pose-guided human image animation models under complex human motion conditions. Furthermore, we propose a simple yet powerful DiT-based video generation baseline and design spatial low-frequency enhanced RoPE, a novel module that selectively enhances low-frequency spatial feature modeling by introducing learnable frequency scaling. Our method significantly improves structural stability and appearance consistency in highly dynamic human motion sequences. Extensive experiments demonstrate the effectiveness of our dataset and proposed approach in advancing the generation quality of complex human motion image animations. Code and dataset will be made publicly available.

[27] Pose-free 3D Gaussian splatting via shape-ray estimation

Youngju Na,Taeyeon Kim,Jumin Lee,Kyu Beom Han,Woo Jae Kim,Sung-eui Yoon

Main category: cs.CV

TL;DR: SHARE是一种无需相机姿态的3D高斯泼溅框架，通过联合形状和相机光线估计解决姿态不准确问题。

Details

Motivation: 在真实场景中，精确的相机姿态难以获取，导致几何对齐问题。 Method: SHARE通过构建姿态感知的规范体积表示和锚对齐高斯预测，减少姿态不准确的影响。 Result: 在多样化真实数据集上，SHARE表现出色，实现了无需姿态的泛化高斯泼溅。 Conclusion: SHARE为姿态不准确场景下的高效高质量渲染提供了有效解决方案。 Abstract: While generalizable 3D Gaussian splatting enables efficient, high-quality rendering of unseen scenes, it heavily depends on precise camera poses for accurate geometry. In real-world scenarios, obtaining accurate poses is challenging, leading to noisy pose estimates and geometric misalignments. To address this, we introduce SHARE, a pose-free, feed-forward Gaussian splatting framework that overcomes these ambiguities by joint shape and camera rays estimation. Instead of relying on explicit 3D transformations, SHARE builds a pose-aware canonical volume representation that seamlessly integrates multi-view information, reducing misalignment caused by inaccurate pose estimates. Additionally, anchor-aligned Gaussian prediction enhances scene reconstruction by refining local geometry around coarse anchors, allowing for more precise Gaussian placement. Extensive experiments on diverse real-world datasets show that our method achieves robust performance in pose-free generalizable Gaussian splatting.

[28] MOVi: Training-free Text-conditioned Multi-Object Video Generation

Aimon Rahman,Jiang Liu,Ze Wang,Ximeng Sun,Jialian Wu,Xiaodong Yu,Yusheng Su,Vishal M. Patel,Zicheng Liu,Emad Barsoum

Main category: cs.CV

TL;DR: 提出了一种无需训练的多对象视频生成方法，利用扩散模型和大型语言模型（LLM）的开放世界知识，通过噪声重新初始化和注意力机制优化，显著提升了多对象生成能力。

Details

Motivation: 现有扩散模型在多对象视频生成中难以准确捕捉复杂对象交互，常将对象视为静态背景或生成错误特征。 Method: 使用LLM作为对象轨迹的“导演”，通过噪声重新初始化和注意力机制优化，精确控制对象运动和特征。 Result: 实验表明，该方法在多对象生成能力上提升了42%，同时保持了高保真度和运动平滑性。 Conclusion: 该方法为多对象视频生成提供了一种高效且无需训练的解决方案。 Abstract: Recent advances in diffusion-based text-to-video (T2V) models have demonstrated remarkable progress, but these models still face challenges in generating videos with multiple objects. Most models struggle with accurately capturing complex object interactions, often treating some objects as static background elements and limiting their movement. In addition, they often fail to generate multiple distinct objects as specified in the prompt, resulting in incorrect generations or mixed features across objects. In this paper, we present a novel training-free approach for multi-object video generation that leverages the open world knowledge of diffusion models and large language models (LLMs). We use an LLM as the ``director'' of object trajectories, and apply the trajectories through noise re-initialization to achieve precise control of realistic movements. We further refine the generation process by manipulating the attention mechanism to better capture object-specific features and motion patterns, and prevent cross-object feature interference. Extensive experiments validate the effectiveness of our training free approach in significantly enhancing the multi-object generation capabilities of existing video diffusion models, resulting in 42% absolute improvement in motion dynamics and object generation accuracy, while also maintaining high fidelity and motion smoothness.

[29] Synthetic Document Question Answering in Hungarian

Jonathan Li,Zoltan Csaki,Nidhi Hiremath,Etash Guha,Fenglu Hong,Edward Ma,Urmish Thakker

Main category: cs.CV

TL;DR: 论文提出了一种为低资源语言（匈牙利语）构建文档视觉问答（VQA）数据集的方法，并发布了HuDocVQA和HuDocVQA-manual数据集，以及HuCCPDF数据集用于OCR训练。实验表明，微调模型在这些数据集上能显著提升性能。

Details

Motivation: 解决低资源语言（如匈牙利语）在文档VQA任务中缺乏训练和评估数据的问题。 Method: 通过从Common Crawl中提取匈牙利语文档，构建了手动和合成的VQA数据集（HuDocVQA和HuDocVQA-manual），并进行了多轮质量过滤和去重。同时发布了HuCCPDF数据集用于OCR训练。 Result: 微调Llama 3.2 11B Instruct模型在HuDocVQA上的准确率提升了7.2%。 Conclusion: 发布的数据集和代码将促进多语言文档VQA的研究。 Abstract: Modern VLMs have achieved near-saturation accuracy in English document visual question-answering (VQA). However, this task remains challenging in lower resource languages due to a dearth of suitable training and evaluation data. In this paper we present scalable methods for curating such datasets by focusing on Hungarian, approximately the 17th highest resource language on the internet. Specifically, we present HuDocVQA and HuDocVQA-manual, document VQA datasets that modern VLMs significantly underperform on compared to English DocVQA. HuDocVQA-manual is a small manually curated dataset based on Hungarian documents from Common Crawl, while HuDocVQA is a larger synthetically generated VQA data set from the same source. We apply multiple rounds of quality filtering and deduplication to HuDocVQA in order to match human-level quality in this dataset. We also present HuCCPDF, a dataset of 117k pages from Hungarian Common Crawl PDFs along with their transcriptions, which can be used for training a model for Hungarian OCR. To validate the quality of our datasets, we show how finetuning on a mixture of these datasets can improve accuracy on HuDocVQA for Llama 3.2 11B Instruct by +7.2%. Our datasets and code will be released to the public to foster further research in multilingual DocVQA.

[30] SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model

Bowen Chen,Keyan Chen,Mohan Yang,Zhengxia Zou,Zhenwei Shi

Main category: cs.CV

TL;DR: 提出了一种基于语义引导的超分辨率框架SeG-SR，利用视觉语言模型提取语义知识，显著提升了遥感图像超分辨率重建的性能。

Details

Motivation: 现有遥感图像超分辨率方法主要关注像素空间低层特征，忽视了高层语义理解，导致重建结果语义不一致。 Method: 设计了语义特征提取模块（SFEM）、语义定位模块（SLM）和可学习调制模块（LMM），结合视觉语言模型提取的语义知识指导超分辨率过程。 Result: SeG-SR在两个数据集上实现了最先进的性能，并在多种超分辨率架构中表现一致提升。 Conclusion: SeG-SR通过引入高层语义知识，有效提升了遥感图像超分辨率重建的质量和语义一致性。 Abstract: High-resolution (HR) remote sensing imagery plays a vital role in a wide range of applications, including urban planning and environmental monitoring. However, due to limitations in sensors and data transmission links, the images acquired in practice often suffer from resolution degradation. Remote Sensing Image Super-Resolution (RSISR) aims to reconstruct HR images from low-resolution (LR) inputs, providing a cost-effective and efficient alternative to direct HR image acquisition. Existing RSISR methods primarily focus on low-level characteristics in pixel space, while neglecting the high-level understanding of remote sensing scenes. This may lead to semantically inconsistent artifacts in the reconstructed results. Motivated by this observation, our work aims to explore the role of high-level semantic knowledge in improving RSISR performance. We propose a Semantic-Guided Super-Resolution framework, SeG-SR, which leverages Vision-Language Models (VLMs) to extract semantic knowledge from input images and uses it to guide the super resolution (SR) process. Specifically, we first design a Semantic Feature Extraction Module (SFEM) that utilizes a pretrained VLM to extract semantic knowledge from remote sensing images. Next, we propose a Semantic Localization Module (SLM), which derives a series of semantic guidance from the extracted semantic knowledge. Finally, we develop a Learnable Modulation Module (LMM) that uses semantic guidance to modulate the features extracted by the SR network, effectively incorporating high-level scene understanding into the SR pipeline. We validate the effectiveness and generalizability of SeG-SR through extensive experiments: SeG-SR achieves state-of-the-art performance on two datasets and consistently delivers performance improvements across various SR architectures. Codes can be found at https://github.com/Mr-Bamboo/SeG-SR.

[31] Spatio-Temporal Joint Density Driven Learning for Skeleton-Based Action Recognition

Shanaka Ramesh Gunasekara,Wanqing Li,Philip Ogunbona,Jack Yang

Main category: cs.CV

TL;DR: 论文提出了一种新的空间-时间关节密度（STJD）测量方法，用于量化骨架序列中动态与静态关节的交互作用，并基于此提出了STJD-CL对比学习策略和STJD-MP方法，显著提升了动作分类性能。

Details

Motivation: 传统方法主要关注骨架序列的动态特性，而忽略了动态与静态关节交互的判别潜力。本文旨在挖掘这种交互作用以改进动作分类。 Method: 提出了STJD测量方法，识别关键关节（prime joints），并设计了STJD-CL对比学习策略和STJD-MP方法，结合重建框架进行学习。 Result: 在NTU RGB+D 60、NTU RGB+D 120和PKUMMD数据集上，STJD-CL和STJD-MP显著优于现有方法，尤其在NTU RGB+D 120上分别提升3.5和3.6个百分点。 Conclusion: 通过量化动态与静态关节的交互作用，STJD方法为骨架动作分类提供了新的判别特征，显著提升了性能。 Abstract: Traditional approaches in unsupervised or self supervised learning for skeleton-based action classification have concentrated predominantly on the dynamic aspects of skeletal sequences. Yet, the intricate interaction between the moving and static elements of the skeleton presents a rarely tapped discriminative potential for action classification. This paper introduces a novel measurement, referred to as spatial-temporal joint density (STJD), to quantify such interaction. Tracking the evolution of this density throughout an action can effectively identify a subset of discriminative moving and/or static joints termed "prime joints" to steer self-supervised learning. A new contrastive learning strategy named STJD-CL is proposed to align the representation of a skeleton sequence with that of its prime joints while simultaneously contrasting the representations of prime and nonprime joints. In addition, a method called STJD-MP is developed by integrating it with a reconstruction-based framework for more effective learning. Experimental evaluations on the NTU RGB+D 60, NTU RGB+D 120, and PKUMMD datasets in various downstream tasks demonstrate that the proposed STJD-CL and STJD-MP improved performance, particularly by 3.5 and 3.6 percentage points over the state-of-the-art contrastive methods on the NTU RGB+D 120 dataset using X-sub and X-set evaluations, respectively.

[32] Towards Privacy-Preserving Fine-Grained Visual Classification via Hierarchical Learning from Label Proportions

Jinyi Chang,Dongliang Chang,Lei Chen,Bingyao Yu,Zhanyu Ma

Main category: cs.CV

TL;DR: 本文提出了一种无需实例级标签的细粒度视觉分类方法，利用标签比例学习（LLP）范式，结合层次化特征优化，显著提升了分类精度。

Details

Motivation: 现有细粒度分类方法依赖实例级标签，不适用于隐私敏感场景（如医学图像分析），本文旨在解决这一问题。 Method: 提出LHFGLP框架，结合层次化稀疏字典学习和层次化比例损失，实现渐进式特征优化。 Result: 在三个细粒度数据集上，LHFGLP框架表现优于现有LLP方法。 Conclusion: 该方法为隐私保护的细粒度分类提供了有效解决方案，代码和数据集将公开以促进研究。 Abstract: In recent years, Fine-Grained Visual Classification (FGVC) has achieved impressive recognition accuracy, despite minimal inter-class variations. However, existing methods heavily rely on instance-level labels, making them impractical in privacy-sensitive scenarios such as medical image analysis. This paper aims to enable accurate fine-grained recognition without direct access to instance labels. To achieve this, we leverage the Learning from Label Proportions (LLP) paradigm, which requires only bag-level labels for efficient training. Unlike existing LLP-based methods, our framework explicitly exploits the hierarchical nature of fine-grained datasets, enabling progressive feature granularity refinement and improving classification accuracy. We propose Learning from Hierarchical Fine-Grained Label Proportions (LHFGLP), a framework that incorporates Unrolled Hierarchical Fine-Grained Sparse Dictionary Learning, transforming handcrafted iterative approximation into learnable network optimization. Additionally, our proposed Hierarchical Proportion Loss provides hierarchical supervision, further enhancing classification performance. Experiments on three widely-used fine-grained datasets, structured in a bag-based manner, demonstrate that our framework consistently outperforms existing LLP-based methods. We will release our code and datasets to foster further research in privacy-preserving fine-grained classification.

[33] Deep Modeling and Optimization of Medical Image Classification

Yihang Wu,Muhammad Owais,Reem Kateb,Ahmad Chaddad

Main category: cs.CV

TL;DR: 论文提出了一种改进的CLIP变体，结合多种深度模型和联邦学习技术，用于医学图像分类，同时利用传统机器学习方法提升泛化能力。实验表明，该方法在多个数据集上表现优异。

Details

Motivation: 解决医学领域因数据隐私问题导致的大规模数据微调困难，并探索CLIP在医学领域的潜力。 Method: 1) 提出基于CNN和ViT的CLIP变体；2) 结合联邦学习保护数据隐私；3) 引入传统ML方法提升泛化能力。 Result: MaxViT在HAM10000数据集上表现最佳（AVG=87.03%），ConvNeXt_L在FL模型中F1-score达83.98%。SVM进一步提升Swin Transformer系列性能（约2%）。 Conclusion: 改进的CLIP变体结合联邦学习和传统ML方法，在医学图像分类中表现优异，同时解决了数据隐私问题。 Abstract: Deep models, such as convolutional neural networks (CNNs) and vision transformer (ViT), demonstrate remarkable performance in image classification. However, those deep models require large data to fine-tune, which is impractical in the medical domain due to the data privacy issue. Furthermore, despite the feasible performance of contrastive language image pre-training (CLIP) in the natural domain, the potential of CLIP has not been fully investigated in the medical field. To face these challenges, we considered three scenarios: 1) we introduce a novel CLIP variant using four CNNs and eight ViTs as image encoders for the classification of brain cancer and skin cancer, 2) we combine 12 deep models with two federated learning techniques to protect data privacy, and 3) we involve traditional machine learning (ML) methods to improve the generalization ability of those deep models in unseen domain data. The experimental results indicate that maxvit shows the highest averaged (AVG) test metrics (AVG = 87.03\%) in HAM10000 dataset with multimodal learning, while convnext\_l demonstrates remarkable test with an F1-score of 83.98\% compared to swin\_b with 81.33\% in FL model. Furthermore, the use of support vector machine (SVM) can improve the overall test metrics with AVG of $\sim 2\%$ for swin transformer series in ISIC2018. Our codes are available at https://github.com/AIPMLab/SkinCancerSimulation.

[34] Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation

Jihai Zhang,Tianle Li,Linjie Li,Zhengyuan Yang,Yu Cheng

Main category: cs.CV

TL;DR: 本文探讨了统一视觉语言模型（VLMs）中理解与生成任务的相互增强，发现混合训练能带来显著益处，并提出多模态对齐对泛化能力的重要性。

Details

Motivation: 研究统一视觉语言模型中理解与生成任务的相互增强假设，填补现有研究的空白。 Method: 设计贴近现实场景的数据集，评估多种统一VLM架构，进行定量分析。 Result: 混合训练带来理解与生成任务的相互提升，多模态对齐增强泛化能力，生成任务知识可迁移至理解任务。 Conclusion: 统一理解与生成对VLMs至关重要，为模型设计与优化提供了重要启示。 Abstract: Recent advancements in unified vision-language models (VLMs), which integrate both visual understanding and generation capabilities, have attracted significant attention. The underlying hypothesis is that a unified architecture with mixed training on both understanding and generation tasks can enable mutual enhancement between understanding and generation. However, this hypothesis remains underexplored in prior works on unified VLMs. To address this gap, this paper systematically investigates the generalization across understanding and generation tasks in unified VLMs. Specifically, we design a dataset closely aligned with real-world scenarios to facilitate extensive experiments and quantitative evaluations. We evaluate multiple unified VLM architectures to validate our findings. Our key findings are as follows. First, unified VLMs trained with mixed data exhibit mutual benefits in understanding and generation tasks across various architectures, and this mutual benefits can scale up with increased data. Second, better alignment between multimodal input and output spaces will lead to better generalization. Third, the knowledge acquired during generation tasks can transfer to understanding tasks, and this cross-task generalization occurs within the base language model, beyond modality adapters. Our findings underscore the critical necessity of unifying understanding and generation in VLMs, offering valuable insights for the design and optimization of unified VLMs.

[35] SpatialSplat: Efficient Semantic 3D from Sparse Unposed Images

Yu Sheng,Jiajun Deng,Xinran Zhang,Yu Zhang,Bei Hua,Yanyong Zhang,Jianmin Ji

Main category: cs.CV

TL;DR: SpatialSplat提出了一种新的3D重建框架，通过双场语义表示和选择性高斯机制，显著减少了冗余并提升了语义表达能力。

Details

Motivation: 现有方法在压缩语义特征时牺牲了表达能力，且像素级预测导致内存冗余。为了解决这些问题，SpatialSplat旨在实现更高效的语义3D重建。 Method: 采用双场语义表示（粗粒度场和细粒度场）和选择性高斯机制，减少冗余并保留关键高斯点。 Result: 实验表明，该方法减少了60%的场景表示参数，同时性能优于现有技术。 Conclusion: SpatialSplat通过紧凑的3D高斯表示和高效语义编码，为语义3D重建提供了更实用的解决方案。 Abstract: A major breakthrough in 3D reconstruction is the feedforward paradigm to generate pixel-wise 3D points or Gaussian primitives from sparse, unposed images. To further incorporate semantics while avoiding the significant memory and storage costs of high-dimensional semantic features, existing methods extend this paradigm by associating each primitive with a compressed semantic feature vector. However, these methods have two major limitations: (a) the naively compressed feature compromises expressiveness, affecting the model's ability to capture fine-grained semantics, and (b) the pixel-wise primitive prediction introduces redundancy in overlapping areas, causing unnecessary memory overhead. To this end, we introduce \textbf{SpatialSplat}, a feedforward framework that produces redundancy-aware Gaussians and capitalizes on a dual-field semantic representation. Particularly, with the insight that primitives within the same instance exhibit high semantic consistency, we decompose the semantic representation into a coarse feature field that encodes uncompressed semantics with minimal primitives, and a fine-grained yet low-dimensional feature field that captures detailed inter-instance relationships. Moreover, we propose a selective Gaussian mechanism, which retains only essential Gaussians in the scene, effectively eliminating redundant primitives. Our proposed Spatialsplat learns accurate semantic information and detailed instances prior with more compact 3D Gaussians, making semantic 3D reconstruction more applicable. We conduct extensive experiments to evaluate our method, demonstrating a remarkable 60\% reduction in scene representation parameters while achieving superior performance over state-of-the-art methods. The code will be made available for future investigation.

[36] Multi-Sourced Compositional Generalization in Visual Question Answering

Chuanhao Li,Wenbo Ye,Zhen Li,Yuwei Wu,Yunde Jia

Main category: cs.CV

TL;DR: 该论文研究了视觉与语言任务中的多源组合泛化（MSCG）问题，提出了一种检索增强的训练框架，以提升视觉问答（VQA）模型的MSCG能力。

Details

Motivation: 由于视觉与语言任务的多模态特性，组合的基元来自不同模态，导致多源新组合的泛化能力未被充分探索。 Method: 提出了一种检索增强的训练框架，通过检索语义等效的基元并聚合其特征，学习跨模态的统一表示。 Result: 实验结果表明，该框架有效提升了VQA模型的MSCG能力，并基于GQA数据集构建了新的GQA-MSCG评测数据集。 Conclusion: 该研究填补了多源组合泛化领域的空白，提出的框架和数据集为未来研究提供了基础。 Abstract: Compositional generalization is the ability of generalizing novel compositions from seen primitives, and has received much attention in vision-and-language (V\&L) recently. Due to the multi-modal nature of V\&L tasks, the primitives composing compositions source from different modalities, resulting in multi-sourced novel compositions. However, the generalization ability over multi-sourced novel compositions, \textit{i.e.}, multi-sourced compositional generalization (MSCG) remains unexplored. In this paper, we explore MSCG in the context of visual question answering (VQA), and propose a retrieval-augmented training framework to enhance the MSCG ability of VQA models by learning unified representations for primitives from different modalities. Specifically, semantically equivalent primitives are retrieved for each primitive in the training samples, and the retrieved features are aggregated with the original primitive to refine the model. This process helps the model learn consistent representations for the same semantic primitives across different modalities. To evaluate the MSCG ability of VQA models, we construct a new GQA-MSCG dataset based on the GQA dataset, in which samples include three types of novel compositions composed of primitives from different modalities. Experimental results demonstrate the effectiveness of the proposed framework. We release GQA-MSCG at https://github.com/NeverMoreLCH/MSCG.

[37] Zero-P-to-3: Zero-Shot Partial-View Images to 3D Object

Yuxuan Lin,Ruihang Chu,Zhenyu Chen,Xiao Tang,Lei Ke,Haoling Li,Yingji Zhong,Zhihao Li,Shiyong Liu,Xiaofei Wu,Jianzhuang Liu,Yujiu Yang

Main category: cs.CV

TL;DR: 提出了一种无需训练的方法，通过融合局部密集观测和多源先验来解决部分观测下的3D重建问题，生成多视角一致的图像。

Details

Motivation: 部分观测下的3D重建任务因视角范围有限和生成不一致而具有挑战性。 Method: 提出了一种基于融合的策略，结合DDIM采样和多源先验，并设计了迭代细化策略以提升重建质量。 Result: 在多个数据集上的实验表明，该方法在不可见区域的表现优于现有技术。 Conclusion: 该方法有效解决了部分观测下的3D重建问题，尤其在不可见区域表现优异。 Abstract: Generative 3D reconstruction shows strong potential in incomplete observations. While sparse-view and single-image reconstruction are well-researched, partial observation remains underexplored. In this context, dense views are accessible only from a specific angular range, with other perspectives remaining inaccessible. This task presents two main challenges: (i) limited View Range: observations confined to a narrow angular scope prevent effective traditional interpolation techniques that require evenly distributed perspectives. (ii) inconsistent Generation: views created for invisible regions often lack coherence with both visible regions and each other, compromising reconstruction consistency. To address these challenges, we propose \method, a novel training-free approach that integrates the local dense observations and multi-source priors for reconstruction. Our method introduces a fusion-based strategy to effectively align these priors in DDIM sampling, thereby generating multi-view consistent images to supervise invisible views. We further design an iterative refinement strategy, which uses the geometric structures of the object to enhance reconstruction quality. Extensive experiments on multiple datasets show the superiority of our method over SOTAs, especially in invisible regions.

[38] URWKV: Unified RWKV Model with Multi-state Perspective for Low-light Image Restoration

Rui Xu,Yuzhen Niu,Yuezhou Li,Huangbiao Xu,Wenxi Liu,Yuzhong Chen

Main category: cs.CV

TL;DR: 提出了一种名为URWKV的统一模型，用于低光图像增强和去模糊，通过多状态视角灵活处理动态耦合的退化问题。

Details

Motivation: 现有低光图像增强和联合去模糊模型在处理动态耦合退化时表现受限，需更灵活的解决方案。 Method: 设计了URWKV核心模块，结合亮度自适应归一化（LAN）和状态感知选择性融合（SSF）模块，利用多阶段状态分析退化。 Result: URWKV模型在多个基准测试中表现优于现有技术，且参数和计算资源需求更低。 Conclusion: URWKV模型通过多状态视角和动态融合机制，有效解决了低光图像增强和去模糊中的动态耦合退化问题。 Abstract: Existing low-light image enhancement (LLIE) and joint LLIE and deblurring (LLIE-deblur) models have made strides in addressing predefined degradations, yet they are often constrained by dynamically coupled degradations. To address these challenges, we introduce a Unified Receptance Weighted Key Value (URWKV) model with multi-state perspective, enabling flexible and effective degradation restoration for low-light images. Specifically, we customize the core URWKV block to perceive and analyze complex degradations by leveraging multiple intra- and inter-stage states. First, inspired by the pupil mechanism in the human visual system, we propose Luminance-adaptive Normalization (LAN) that adjusts normalization parameters based on rich inter-stage states, allowing for adaptive, scene-aware luminance modulation. Second, we aggregate multiple intra-stage states through exponential moving average approach, effectively capturing subtle variations while mitigating information loss inherent in the single-state mechanism. To reduce the degradation effects commonly associated with conventional skip connections, we propose the State-aware Selective Fusion (SSF) module, which dynamically aligns and integrates multi-state features across encoder stages, selectively fusing contextual information. In comparison to state-of-the-art models, our URWKV model achieves superior performance on various benchmarks, while requiring significantly fewer parameters and computational resources.

[39] GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion

Gwanghyun Kim,Xueting Li,Ye Yuan,Koki Nagano,Tianye Li,Jan Kautz,Se Young Chun,Umar Iqbal

Main category: cs.CV

TL;DR: GeoMan是一种新颖的架构，用于从单目视频中估计准确且时间一致的3D人体几何形状，解决了现有方法在时间一致性和细节捕捉上的不足。

Details

Motivation: 现有方法主要针对单图像优化，存在时间不一致性和无法捕捉动态细节的问题。GeoMan旨在解决这些问题，并应对高质量4D训练数据稀缺和度量深度估计的挑战。 Method: GeoMan结合图像模型和视频扩散模型，首帧通过图像模型估计深度和法线，视频模型则专注于细节生成。采用根相对深度表示以保留人体尺度细节。 Result: GeoMan在定性和定量评估中均达到最先进性能，显著提升了时间一致性和泛化能力。 Conclusion: GeoMan通过创新的架构和表示方法，有效解决了3D人体几何估计中的长期挑战。 Abstract: Estimating accurate and temporally consistent 3D human geometry from videos is a challenging problem in computer vision. Existing methods, primarily optimized for single images, often suffer from temporal inconsistencies and fail to capture fine-grained dynamic details. To address these limitations, we present GeoMan, a novel architecture designed to produce accurate and temporally consistent depth and normal estimations from monocular human videos. GeoMan addresses two key challenges: the scarcity of high-quality 4D training data and the need for metric depth estimation to accurately model human size. To overcome the first challenge, GeoMan employs an image-based model to estimate depth and normals for the first frame of a video, which then conditions a video diffusion model, reframing video geometry estimation task as an image-to-video generation problem. This design offloads the heavy lifting of geometric estimation to the image model and simplifies the video model's role to focus on intricate details while using priors learned from large-scale video datasets. Consequently, GeoMan improves temporal consistency and generalizability while requiring minimal 4D training data. To address the challenge of accurate human size estimation, we introduce a root-relative depth representation that retains critical human-scale details and is easier to be estimated from monocular inputs, overcoming the limitations of traditional affine-invariant and metric depth representations. GeoMan achieves state-of-the-art performance in both qualitative and quantitative evaluations, demonstrating its effectiveness in overcoming longstanding challenges in 3D human geometry estimation from videos.

[40] LeMoRe: Learn More Details for Lightweight Semantic Segmentation

Mian Muhammad Naeem Abid,Nancy Mehta,Zongwei Wu,Radu Timofte

Main category: cs.CV

TL;DR: 提出了一种结合显式和隐式建模的轻量级语义分割方法，平衡计算效率与表征能力。

Details

Motivation: 现有方法在特征建模复杂性上难以平衡效率与性能，且依赖参数繁重的设计或计算密集的Vision Transformer框架。 Method: 结合明确的笛卡尔方向与显式建模视图和隐式推断中间表示，通过嵌套注意力机制高效捕获全局依赖。 Result: 在ADE20K、CityScapes等数据集上验证了性能与效率的有效平衡。 Conclusion: LeMoRe方法在轻量级语义分割中实现了高效与高性能的平衡。 Abstract: Lightweight semantic segmentation is essential for many downstream vision tasks. Unfortunately, existing methods often struggle to balance efficiency and performance due to the complexity of feature modeling. Many of these existing approaches are constrained by rigid architectures and implicit representation learning, often characterized by parameter-heavy designs and a reliance on computationally intensive Vision Transformer-based frameworks. In this work, we introduce an efficient paradigm by synergizing explicit and implicit modeling to balance computational efficiency with representational fidelity. Our method combines well-defined Cartesian directions with explicitly modeled views and implicitly inferred intermediate representations, efficiently capturing global dependencies through a nested attention mechanism. Extensive experiments on challenging datasets, including ADE20K, CityScapes, Pascal Context, and COCO-Stuff, demonstrate that LeMoRe strikes an effective balance between performance and efficiency.

[41] CURVE: CLIP-Utilized Reinforcement Learning for Visual Image Enhancement via Simple Image Processing

Yuka Ogino,Takahiro Toizumi,Atsushi Ito

Main category: cs.CV

TL;DR: 本文提出了一种基于CLIP和强化学习的低光照图像增强方法CURVE，通过Bézier曲线调整全局色调并迭代优化参数，实验证明其在质量和速度上优于传统方法。

Details

Motivation: 解决零参考低光照图像增强中如何利用CLIP模型获得感知良好的图像，同时保持高分辨率图像的计算效率。 Method: 提出CURVE方法，使用Bézier曲线调整全局色调，并通过强化学习迭代优化参数，奖励函数基于CLIP文本嵌入设计。 Result: 在低光照和多曝光数据集上，CURVE在增强质量和处理速度上优于传统方法。 Conclusion: CURVE通过结合CLIP和强化学习，有效解决了低光照图像增强的挑战，具有实际应用潜力。 Abstract: Low-Light Image Enhancement (LLIE) is crucial for improving both human perception and computer vision tasks. This paper addresses two challenges in zero-reference LLIE: obtaining perceptually 'good' images using the Contrastive Language-Image Pre-Training (CLIP) model and maintaining computational efficiency for high-resolution images. We propose CLIP-Utilized Reinforcement learning-based Visual image Enhancement (CURVE). CURVE employs a simple image processing module which adjusts global image tone based on B\'ezier curve and estimates its processing parameters iteratively. The estimator is trained by reinforcement learning with rewards designed using CLIP text embeddings. Experiments on low-light and multi-exposure datasets demonstrate the performance of CURVE in terms of enhancement quality and processing speed compared to conventional methods.

[42] EAD: An EEG Adapter for Automated Classification

Pushapdeep Singh,Jyoti Nigam,Medicherla Vamsi Krishna,Arnav Bhavsar,Aditya Nigam

Main category: cs.CV

TL;DR: EEG Adapter (EAD) 是一个灵活的框架，用于学习适用于不同 EEG 设备的统一嵌入表示，在 EEG-ImageNet 和 BrainLat 数据集上实现了高准确率。

Details

Motivation: 传统 EEG 分类方法依赖于特定任务和设备，难以统一处理不同通道数的 EEG 数据，因此需要开发一个通用框架。 Method: 提出 EEG Adapter (EAD)，基于 EEG 基础模型进行适配，学习鲁棒的 EEG 表示，适用于不同设备。 Result: 在 EEG-ImageNet 和 BrainLat 数据集上分别达到 99.33% 和 92.31% 的准确率，并展示了零样本分类能力。 Conclusion: EAD 是一个高效且通用的 EEG 嵌入学习框架，适用于多种任务和设备。 Abstract: While electroencephalography (EEG) has been a popular modality for neural decoding, it often involves task specific acquisition of the EEG data. This poses challenges for the development of a unified pipeline to learn embeddings for various EEG signal classification, which is often involved in various decoding tasks. Traditionally, EEG classification involves the step of signal preprocessing and the use of deep learning techniques, which are highly dependent on the number of EEG channels in each sample. However, the same pipeline cannot be applied even if the EEG data is collected for the same experiment but with different acquisition devices. This necessitates the development of a framework for learning EEG embeddings, which could be highly beneficial for tasks involving multiple EEG samples for the same task but with varying numbers of EEG channels. In this work, we propose EEG Adapter (EAD), a flexible framework compatible with any signal acquisition device. More specifically, we leverage a recent EEG foundational model with significant adaptations to learn robust representations from the EEG data for the classification task. We evaluate EAD on two publicly available datasets achieving state-of-the-art accuracies 99.33% and 92.31% on EEG-ImageNet and BrainLat respectively. This illustrates the effectiveness of the proposed framework across diverse EEG datasets containing two different perception tasks: stimulus and resting-state EEG signals. We also perform zero-shot EEG classification on EEG-ImageNet task to demonstrate the generalization capability of the proposed approach.

[43] Identification of Patterns of Cognitive Impairment for Early Detection of Dementia

Anusha A. S.,Uma Ranjan,Medha Sharma,Siddharth Dutt

Main category: cs.CV

TL;DR: 提出了一种个性化认知测试方案，通过识别个体特定的认知障碍模式，为早期痴呆检测提供高效工具。

Details

Motivation: 传统认知测试耗时且难以大规模应用，且不同痴呆类型的认知障碍模式各异，需个性化解决方案。 Method: 采用两步法：先通过集成包装特征选择识别群体中的认知障碍模式，再聚类分析；基于NACC数据库中24,000名受试者的基线数据。 Result: 识别出的模式与临床认可的轻度认知障碍（MCI）亚型一致，可用于预测无症状或正常人群的认知障碍路径。 Conclusion: 个性化测试方案有望提高痴呆早期检测的效率和准确性，适用于大规模人群的定期评估。 Abstract: Early detection of dementia is crucial to devise effective interventions. Comprehensive cognitive tests, while being the most accurate means of diagnosis, are long and tedious, thus limiting their applicability to a large population, especially when periodic assessments are needed. The problem is compounded by the fact that people have differing patterns of cognitive impairment as they progress to different forms of dementia. This paper presents a novel scheme by which individual-specific patterns of impairment can be identified and used to devise personalized tests for periodic follow-up. Patterns of cognitive impairment are initially learned from a population cluster of combined normals and MCIs, using a set of standardized cognitive tests. Impairment patterns in the population are identified using a 2-step procedure involving an ensemble wrapper feature selection followed by cluster identification and analysis. These patterns have been shown to correspond to clinically accepted variants of MCI, a prodrome of dementia. The learned clusters of patterns can subsequently be used to identify the most likely route of cognitive impairment, even for pre-symptomatic and apparently normal people. Baseline data of 24,000 subjects from the NACC database was used for the study.

[44] Diffusion-Based Generative Models for 3D Occupancy Prediction in Autonomous Driving

Yunshen Wang,Yicheng Liu,Tianyuan Yuan,Yucheng Mao,Yingshi Liang,Xiuyu Yang,Honggang Zhang,Hang Zhao

Main category: cs.CV

TL;DR: 将3D占用预测重构为生成建模任务，利用扩散模型提升预测一致性、噪声鲁棒性及对复杂3D结构的处理能力。

Details

Motivation: 当前判别方法在噪声数据、不完整观测和复杂3D场景结构方面表现不佳，需要更优的解决方案。 Method: 采用扩散模型作为生成建模工具，学习底层数据分布并融入3D场景先验。 Result: 扩散模型在预测真实性和准确性上优于现有判别方法，尤其在遮挡或低可见度区域表现突出。 Conclusion: 该方法显著提升了自动驾驶下游规划任务的性能，具有实际应用价值。 Abstract: Accurately predicting 3D occupancy grids from visual inputs is critical for autonomous driving, but current discriminative methods struggle with noisy data, incomplete observations, and the complex structures inherent in 3D scenes. In this work, we reframe 3D occupancy prediction as a generative modeling task using diffusion models, which learn the underlying data distribution and incorporate 3D scene priors. This approach enhances prediction consistency, noise robustness, and better handles the intricacies of 3D spatial structures. Our extensive experiments show that diffusion-based generative models outperform state-of-the-art discriminative approaches, delivering more realistic and accurate occupancy predictions, especially in occluded or low-visibility regions. Moreover, the improved predictions significantly benefit downstream planning tasks, highlighting the practical advantages of our method for real-world autonomous driving applications.

[45] TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance

Keren Ye,Ignacio Garcia Dorado,Michalis Raptis,Mauricio Delbracio,Irene Zhu,Peyman Milanfar,Hossein Talebi

Main category: cs.CV

TL;DR: TextSR是一种针对多语言场景文本图像超分辨率的多模态扩散模型，通过结合文本检测和OCR技术，利用字符形状先验提升超分辨率效果。

Details

Motivation: 现有扩散模型在场景文本图像超分辨率中存在文本区域定位不准确和字符形状建模不足的问题，导致生成质量下降。 Method: TextSR结合文本检测器和OCR提取多语言文本，通过UTF-8编码和交叉注意力将字符转换为视觉形状，并引入两种创新方法增强模型鲁棒性。 Result: 在TextZoom和TextVQA数据集上表现优异，为STISR设定了新基准。 Conclusion: TextSR通过整合字符先验和低分辨率图像，显著提升了文本超分辨率的细节和可读性。 Abstract: While recent advancements in Image Super-Resolution (SR) using diffusion models have shown promise in improving overall image quality, their application to scene text images has revealed limitations. These models often struggle with accurate text region localization and fail to effectively model image and multilingual character-to-shape priors. This leads to inconsistencies, the generation of hallucinated textures, and a decrease in the perceived quality of the super-resolved text. To address these issues, we introduce TextSR, a multimodal diffusion model specifically designed for Multilingual Scene Text Image Super-Resolution. TextSR leverages a text detector to pinpoint text regions within an image and then employs Optical Character Recognition (OCR) to extract multilingual text from these areas. The extracted text characters are then transformed into visual shapes using a UTF-8 based text encoder and cross-attention. Recognizing that OCR may sometimes produce inaccurate results in real-world scenarios, we have developed two innovative methods to enhance the robustness of our model. By integrating text character priors with the low-resolution text images, our model effectively guides the super-resolution process, enhancing fine details within the text and improving overall legibility. The superior performance of our model on both the TextZoom and TextVQA datasets sets a new benchmark for STISR, underscoring the efficacy of our approach.

[46] MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation

Siyuan Wang,Jiawei Liu,Wei Wang,Yeying Jin,Jinsong Du,Zhi Han

Main category: cs.CV

TL;DR: 论文提出了一种基于运动掩码引导的两阶段网络（MMGT），通过结合音频、运动掩码和运动特征生成同步的语音手势视频，解决了传统方法中因仅依赖音频而导致的运动失真问题。

Details

Motivation: 由于仅依赖音频作为控制信号难以捕捉大幅手势动作，导致视频中出现明显失真，现有方法通常引入额外先验信息，但限制了实际应用。 Method: 提出MMGT网络，分为两阶段：1）SMGA网络从音频生成高质量姿态视频和运动掩码；2）MM-HAA模块结合稳定扩散视频生成模型，提升细粒度运动和区域细节控制。 Result: 实验表明，该方法在视频质量、唇同步和手势生成方面表现更优。 Conclusion: MMGT通过两阶段设计和运动掩码引导，显著提升了语音手势视频的生成质量，解决了传统方法的局限性。 Abstract: Co-Speech Gesture Video Generation aims to generate vivid speech videos from audio-driven still images, which is challenging due to the diversity of different parts of the body in terms of amplitude of motion, audio relevance, and detailed features. Relying solely on audio as the control signal often fails to capture large gesture movements in video, leading to more pronounced artifacts and distortions. Existing approaches typically address this issue by introducing additional a priori information, but this can limit the practical application of the task. Specifically, we propose a Motion Mask-Guided Two-Stage Network (MMGT) that uses audio, as well as motion masks and motion features generated from the audio signal to jointly drive the generation of synchronized speech gesture videos. In the first stage, the Spatial Mask-Guided Audio Pose Generation (SMGA) Network generates high-quality pose videos and motion masks from audio, effectively capturing large movements in key regions such as the face and gestures. In the second stage, we integrate the Motion Masked Hierarchical Audio Attention (MM-HAA) into the Stabilized Diffusion Video Generation model, overcoming limitations in fine-grained motion generation and region-specific detail control found in traditional methods. This guarantees high-quality, detailed upper-body video generation with accurate texture and motion details. Evaluations show improved video quality, lip-sync, and gesture. The model and code are available at https://github.com/SIA-IDE/MMGT.

[47] HMAD: Advancing E2E Driving with Anchored Offset Proposals and Simulation-Supervised Multi-target Scoring

Bin Wang,Pingjun Li,Jinkun Liu,Jun Cheng,Hailong Lei,Yinze Rong,Huan-ang Gao,Kangliang Chen,Xing Pan,Weihao Gu

Main category: cs.CV

TL;DR: HMAD框架通过结合BEV轨迹生成和多标准评分，解决了自动驾驶中轨迹多样性和路径选择的问题。

Details

Motivation: 自动驾驶在生成多样且合规的轨迹以及通过多维度评分选择最优路径方面存在挑战。 Method: HMAD利用BEVFormer和可学习锚点查询生成多样轨迹，并通过模拟监督评分模块评估轨迹。 Result: HMAD在CVPR 2025测试集上达到44.5%的驾驶评分。 Conclusion: HMAD展示了轨迹生成与安全评分分离对高级自动驾驶的优势。 Abstract: End-to-end autonomous driving faces persistent challenges in both generating diverse, rule-compliant trajectories and robustly selecting the optimal path from these options via learned, multi-faceted evaluation. To address these challenges, we introduce HMAD, a framework integrating a distinctive Bird's-Eye-View (BEV) based trajectory proposal mechanism with learned multi-criteria scoring. HMAD leverages BEVFormer and employs learnable anchored queries, initialized from a trajectory dictionary and refined via iterative offset decoding (inspired by DiffusionDrive), to produce numerous diverse and stable candidate trajectories. A key innovation, our simulation-supervised scorer module, then evaluates these proposals against critical metrics including no at-fault collisions, drivable area compliance, comfortableness, and overall driving quality (i.e., extended PDM score). Demonstrating its efficacy, HMAD achieves a 44.5% driving score on the CVPR 2025 private test set. This work highlights the benefits of effectively decoupling robust trajectory generation from comprehensive, safety-aware learned scoring for advanced autonomous driving.

[48] PhotoArtAgent: Intelligent Photo Retouching with Language Model-Based Artist Agents

Haoyu Chen,Keda Tao,Yizao Wang,Xinlei Wang,Lei Zhu,Jinjin Gu

Main category: cs.CV

TL;DR: PhotoArtAgent是一个结合视觉语言模型和自然语言推理的智能系统，模拟专业艺术家的修图过程，提供透明解释和迭代优化，效果优于自动化工具并接近专业艺术家水平。

Details

Motivation: 解决非专业用户依赖自动化工具时缺乏解释深度和交互透明性的问题，模拟专业艺术家的创意过程。 Method: 结合视觉语言模型和自然语言推理，进行艺术分析、制定修图策略，并通过API输出参数到Lightroom，迭代优化并提供解释。 Result: 在用户研究中优于现有自动化工具，效果接近专业艺术家水平。 Conclusion: PhotoArtAgent通过透明交互和迭代优化，为非专业用户提供了接近专业水平的修图体验。 Abstract: Photo retouching is integral to photographic art, extending far beyond simple technical fixes to heighten emotional expression and narrative depth. While artists leverage expertise to create unique visual effects through deliberate adjustments, non-professional users often rely on automated tools that produce visually pleasing results but lack interpretative depth and interactive transparency. In this paper, we introduce PhotoArtAgent, an intelligent system that combines Vision-Language Models (VLMs) with advanced natural language reasoning to emulate the creative process of a professional artist. The agent performs explicit artistic analysis, plans retouching strategies, and outputs precise parameters to Lightroom through an API. It then evaluates the resulting images and iteratively refines them until the desired artistic vision is achieved. Throughout this process, PhotoArtAgent provides transparent, text-based explanations of its creative rationale, fostering meaningful interaction and user control. Experimental results show that PhotoArtAgent not only surpasses existing automated tools in user studies but also achieves results comparable to those of professional human artists.

[49] Zero-to-Hero: Zero-Shot Initialization Empowering Reference-Based Video Appearance Editing

Tongtong Su,Chengyu Wang,Jun Huang,Dongming Lu

Main category: cs.CV

TL;DR: 论文提出了一种名为Zero-to-Hero的参考视频编辑方法，通过分解编辑过程为两个阶段，实现更精细的控制和一致性。

Details

Motivation: 现有文本引导方法存在用户意图模糊和细粒度控制不足的问题，需改进。 Method: 方法分为两阶段：Zero阶段编辑锚帧作为参考，Hero阶段通过条件生成模型恢复视频质量。 Result: PSNR提升2.6 dB，优于基线方法。 Conclusion: Zero-to-Hero方法在视频编辑中实现了更高的准确性和时间一致性。 Abstract: Appearance editing according to user needs is a pivotal task in video editing. Existing text-guided methods often lead to ambiguities regarding user intentions and restrict fine-grained control over editing specific aspects of objects. To overcome these limitations, this paper introduces a novel approach named {Zero-to-Hero}, which focuses on reference-based video editing that disentangles the editing process into two distinct problems. It achieves this by first editing an anchor frame to satisfy user requirements as a reference image and then consistently propagating its appearance across other frames. We leverage correspondence within the original frames to guide the attention mechanism, which is more robust than previously proposed optical flow or temporal modules in memory-friendly video generative models, especially when dealing with objects exhibiting large motions. It offers a solid ZERO-shot initialization that ensures both accuracy and temporal consistency. However, intervention in the attention mechanism results in compounded imaging degradation with over-saturated colors and unknown blurring issues. Starting from Zero-Stage, our Hero-Stage Holistically learns a conditional generative model for vidEo RestOration. To accurately evaluate the consistency of the appearance, we construct a set of videos with multiple appearances using Blender, enabling a fine-grained and deterministic evaluation. Our method outperforms the best-performing baseline with a PSNR improvement of 2.6 dB. The project page is at https://github.com/Tonniia/Zero2Hero.

[50] Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning

Jinquan Guan,Qi Chen,Lizhou Liang,Yuhang Liu,Vu Minh Hieu Phan,Minh-Son To,Jian Chen,Yutong Xie

Main category: cs.CV

TL;DR: 论文提出CXRTrek数据集和CXRTrekNet模型，模拟放射科医生的多阶段诊断推理过程，解决了现有医学AI模型的不足。

Details

Motivation: 现有医学AI模型忽视诊断推理过程，导致与临床场景不匹配、推理缺乏上下文和错误不可追溯。 Method: 构建CXRTrek数据集（8阶段诊断，42.9万样本，1100万Q&A对），并提出CXRTrekNet模型，整合临床推理流程。 Result: CXRTrekNet在CXRTrek基准测试中优于现有医学VLLM，并在多个外部数据集上表现优异。 Conclusion: CXRTrek数据集和模型有效模拟临床诊断推理，提升了医学AI的实用性和泛化能力。 Abstract: Artificial intelligence (AI)-based chest X-ray (CXR) interpretation assistants have demonstrated significant progress and are increasingly being applied in clinical settings. However, contemporary medical AI models often adhere to a simplistic input-to-output paradigm, directly processing an image and an instruction to generate a result, where the instructions may be integral to the model's architecture. This approach overlooks the modeling of the inherent diagnostic reasoning in chest X-ray interpretation. Such reasoning is typically sequential, where each interpretive stage considers the images, the current task, and the contextual information from previous stages. This oversight leads to several shortcomings, including misalignment with clinical scenarios, contextless reasoning, and untraceable errors. To fill this gap, we construct CXRTrek, a new multi-stage visual question answering (VQA) dataset for CXR interpretation. The dataset is designed to explicitly simulate the diagnostic reasoning process employed by radiologists in real-world clinical settings for the first time. CXRTrek covers 8 sequential diagnostic stages, comprising 428,966 samples and over 11 million question-answer (Q&A) pairs, with an average of 26.29 Q&A pairs per sample. Building on the CXRTrek dataset, we propose a new vision-language large model (VLLM), CXRTrekNet, specifically designed to incorporate the clinical reasoning flow into the VLLM framework. CXRTrekNet effectively models the dependencies between diagnostic stages and captures reasoning patterns within the radiological context. Trained on our dataset, the model consistently outperforms existing medical VLLMs on the CXRTrek benchmarks and demonstrates superior generalization across multiple tasks on five diverse external datasets. The dataset and model can be found in our repository (https://github.com/guanjinquan/CXRTrek).

[51] FlowAlign: Trajectory-Regularized, Inversion-Free Flow-based Image Editing

Jeongsol Kim,Yeobin Hong,Jong Chul Ye

Main category: cs.CV

TL;DR: FlowAlign提出了一种基于流的无反转图像编辑框架，通过流匹配损失实现更稳定和一致的编辑轨迹。

Details

Motivation: 现有基于流的图像编辑方法（如FlowEdit）因缺乏精确的潜在反转导致编辑轨迹不稳定和源一致性差。 Method: FlowAlign引入流匹配损失作为正则化机制，平衡编辑提示的语义对齐和源图像的结构一致性。 Result: 实验表明，FlowAlign在源保留和编辑可控性上优于现有方法。 Conclusion: FlowAlign通过流匹配损失实现了更稳定、一致且可逆的图像编辑。 Abstract: Recent inversion-free, flow-based image editing methods such as FlowEdit leverages a pre-trained noise-to-image flow model such as Stable Diffusion 3, enabling text-driven manipulation by solving an ordinary differential equation (ODE). While the lack of exact latent inversion is a core advantage of these methods, it often results in unstable editing trajectories and poor source consistency. To address this limitation, we propose FlowAlign, a novel inversion-free flow-based framework for consistent image editing with principled trajectory control. FlowAlign introduces a flow-matching loss as a regularization mechanism to promote smoother and more stable trajectories during the editing process. Notably, the flow-matching loss is shown to explicitly balance semantic alignment with the edit prompt and structural consistency with the source image along the trajectory. Furthermore, FlowAlign naturally supports reverse editing by simply reversing the ODE trajectory, highlighting the reversible and consistent nature of the transformation. Extensive experiments demonstrate that FlowAlign outperforms existing methods in both source preservation and editing controllability.

[52] PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling

Xiao Yu,Yan Fang,Xiaojie Jin,Yao Zhao,Yunchao Wei

Main category: cs.CV

TL;DR: 论文提出了在线音视频事件解析（On-AVEP）任务，并设计了预测未来建模（PreFM）框架，以实现在线实时处理，显著优于现有方法。

Details

Motivation: 现有方法依赖离线处理且模型庞大，限制了实时应用，因此需要一种在线解析音视频事件的新方法。 Method: 提出PreFM框架，包括预测多模态未来建模和模态无关的鲁棒表示，以增强上下文理解并提高精度。 Result: 在UnAV-100和LLP数据集上，PreFM显著优于现有方法，且参数更少。 Conclusion: PreFM为实时多模态视频理解提供了有效解决方案，具有实际应用潜力。 Abstract: Audio-visual event parsing plays a crucial role in understanding multimodal video content, but existing methods typically rely on offline processing of entire videos with huge model sizes, limiting their real-time applicability. We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams. The On-AVEP task necessitates models with two key capabilities: (1) Accurate online inference, to effectively distinguish events with unclear and limited context in online settings, and (2) Real-time efficiency, to balance high performance with computational constraints. To cultivate these, we propose the Predictive Future Modeling (PreFM) framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues, thereby enhancing contextual understanding and (b) modality-agnostic robust representation along with focal temporal prioritization to improve precision and generalization. Extensive experiments on the UnAV-100 and LLP datasets show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters, offering an insightful approach for real-time multimodal video understanding. Code is available at https://github.com/XiaoYu-1123/PreFM.

[53] LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering

Jonas Kulhanek,Marie-Julie Rakotosaona,Fabian Manhardt,Christina Tsalicoglou,Michael Niemeyer,Torsten Sattler,Songyou Peng,Federico Tombari

Main category: cs.CV

TL;DR: 提出了一种基于3D高斯泼溅的LOD方法，通过分层表示和动态加载优化实时渲染性能。

Details

Motivation: 解决内存受限设备上大规模场景实时渲染的挑战。 Method: 采用分层LOD表示、深度感知平滑滤波、重要性剪枝和动态加载技术。 Result: 在户外和室内数据集上实现高性能渲染，降低延迟和内存需求。 Conclusion: 该方法显著提升了渲染效率，适用于资源受限设备。 Abstract: In this work, we present a novel level-of-detail (LOD) method for 3D Gaussian Splatting that enables real-time rendering of large-scale scenes on memory-constrained devices. Our approach introduces a hierarchical LOD representation that iteratively selects optimal subsets of Gaussians based on camera distance, thus largely reducing both rendering time and GPU memory usage. We construct each LOD level by applying a depth-aware 3D smoothing filter, followed by importance-based pruning and fine-tuning to maintain visual fidelity. To further reduce memory overhead, we partition the scene into spatial chunks and dynamically load only relevant Gaussians during rendering, employing an opacity-blending mechanism to avoid visual artifacts at chunk boundaries. Our method achieves state-of-the-art performance on both outdoor (Hierarchical 3DGS) and indoor (Zip-NeRF) datasets, delivering high-quality renderings with reduced latency and memory requirements.

[54] Implicit Inversion turns CLIP into a Decoder

Antonio D'Orazio,Maria Rosaria Briglia,Donato Crisostomi,Dario Loi,Emanuele Rodolà,Iacopo Masi

Main category: cs.CV

TL;DR: CLIP模型无需解码器或训练即可实现图像合成，通过优化频率感知的隐式神经表示和引入稳定技术，解锁了文本到图像生成等能力。

Details

Motivation: 探索CLIP作为判别模型是否具有未开发的生成潜力，无需额外训练或解码器。 Method: 采用频率感知隐式神经表示、对抗鲁棒初始化、正交Procrustes投影和混合损失等技术。 Result: 实现了文本到图像生成、风格迁移和图像重建等功能，无需修改CLIP权重。 Conclusion: 判别模型可能隐藏着未被发现的生成潜力。 Abstract: CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. In this work, we show that image synthesis is nevertheless possible using CLIP alone -- without any decoder, training, or fine-tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. Without altering CLIP's weights, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. These findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight.

[55] RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer

Liu Liu,Xiaofeng Wang,Guosheng Zhao,Keyu Li,Wenkang Qin,Jiaxiong Qiu,Zheng Zhu,Guan Huang,Zhizhong Su

Main category: cs.CV

TL;DR: RoboTransfer是一个基于扩散的视频生成框架，用于机器人数据合成，解决了模仿学习中真实数据收集昂贵和模拟器与现实差距的问题。

Details

Motivation: 模仿学习中大规模真实机器人演示数据收集成本高昂，模拟器与现实差距难以克服。 Method: RoboTransfer通过多视角几何和场景组件控制，结合跨视角特征交互和全局深度/法线条件，确保几何一致性。 Result: 实验显示，RoboTransfer生成的多视角视频几何一致性和视觉保真度更高，训练的策略在DIFF-OBJ和DIFF-ALL场景中分别提升33.3%和251%的成功率。 Conclusion: RoboTransfer为机器人数据合成提供了高效且可控的解决方案，显著提升了模仿学习的性能。 Abstract: Imitation Learning has become a fundamental approach in robotic manipulation. However, collecting large-scale real-world robot demonstrations is prohibitively expensive. Simulators offer a cost-effective alternative, but the sim-to-real gap make it extremely challenging to scale. Therefore, we introduce RoboTransfer, a diffusion-based video generation framework for robotic data synthesis. Unlike previous methods, RoboTransfer integrates multi-view geometry with explicit control over scene components, such as background and object attributes. By incorporating cross-view feature interactions and global depth/normal conditions, RoboTransfer ensures geometry consistency across views. This framework allows fine-grained control, including background edits and object swaps. Experiments demonstrate that RoboTransfer is capable of generating multi-view videos with enhanced geometric consistency and visual fidelity. In addition, policies trained on the data generated by RoboTransfer achieve a 33.3% relative improvement in the success rate in the DIFF-OBJ setting and a substantial 251% relative improvement in the more challenging DIFF-ALL scenario. Explore more demos on our project page: https://horizonrobotics.github.io/robot_lab/robotransfer

[56] DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes

Sungjune Park,Hyunjun Kim,Junho Kim,Seongho Kim,Yong Man Ro

Main category: cs.CV

TL;DR: 论文提出了一种基于强化学习的框架DIP-R1，用于增强多模态大语言模型（MLLMs）在复杂场景中的细粒度视觉感知能力。

Details

Motivation: 尽管MLLMs在视觉理解方面表现出色，但在复杂现实场景（如密集人群）中的细粒度感知能力仍有限。受强化学习在LLMs和MLLMs中的成功启发，作者探索如何利用RL提升MLLMs的视觉感知能力。 Method: 提出DIP-R1框架，通过三种规则化奖励模型增强MLLMs的视觉感知：1）标准推理奖励，2）方差引导观察奖励，3）加权精确召回奖励。 Result: DIP-R1在多种细粒度目标检测数据上表现优异，显著优于现有基线模型和监督微调方法。 Conclusion: 研究表明，将RL集成到MLLMs中具有巨大潜力，可提升复杂现实场景中的感知能力。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant visual understanding capabilities, yet their fine-grained visual perception in complex real-world scenarios, such as densely crowded public areas, remains limited. Inspired by the recent success of reinforcement learning (RL) in both LLMs and MLLMs, in this paper, we explore how RL can enhance visual perception ability of MLLMs. Then we develop a novel RL-based framework, Deep Inspection and Perception with RL (DIP-R1) designed to enhance the visual perception capabilities of MLLMs, by comprehending complex scenes and looking through visual instances closely. DIP-R1 guides MLLMs through detailed inspection of visual scene via three simply designed rule-based reward modelings. First, we adopt a standard reasoning reward encouraging the model to include three step-by-step processes: 1) reasoning for understanding visual scenes, 2) observing for looking through interested but ambiguous regions, and 3) decision-making for predicting answer. Second, a variance-guided looking reward is designed to examine uncertain regions for the second observing process. It explicitly enables the model to inspect ambiguous areas, improving its ability to mitigate perceptual uncertainties. Third, we model a weighted precision-recall accuracy reward enhancing accurate decision-making. We explore its effectiveness across diverse fine-grained object detection data consisting of challenging real-world environments, such as densely crowded scenes. Built upon existing MLLMs, DIP-R1 achieves consistent and significant improvement across various in-domain and out-of-domain scenarios. It also outperforms various existing baseline models and supervised fine-tuning methods. Our findings highlight the substantial potential of integrating RL into MLLMs for enhancing capabilities in complex real-world perception tasks.

Junyi Guo,Jingxuan Zhang,Fangyu Wu,Huanda Lu,Qiufeng Wang,Wenmian Yang,Eng Gee Lim,Dongming Lu

Main category: cs.CV

TL;DR: 论文提出新任务FS2RG，结合平面草图和文本生成逼真服装图像，并解决多模态冲突问题。

Details

Motivation: 服装合成任务多关注设计阶段，生产过程研究不足，需填补这一空白。 Method: 提出HiGarment框架，包含多模态语义增强和协调交叉注意力机制。 Result: 实验和用户研究证明HiGarment有效性，并发布最大开源数据集。 Conclusion: HiGarment成功解决FS2RG任务，为服装生成提供新方法。 Abstract: Diffusion-based garment synthesis tasks primarily focus on the design phase in the fashion domain, while the garment production process remains largely underexplored. To bridge this gap, we introduce a new task: Flat Sketch to Realistic Garment Image (FS2RG), which generates realistic garment images by integrating flat sketches and textual guidance. FS2RG presents two key challenges: 1) fabric characteristics are solely guided by textual prompts, providing insufficient visual supervision for diffusion-based models, which limits their ability to capture fine-grained fabric details; 2) flat sketches and textual guidance may provide conflicting information, requiring the model to selectively preserve or modify garment attributes while maintaining structural coherence. To tackle this task, we propose HiGarment, a novel framework that comprises two core components: i) a multi-modal semantic enhancement mechanism that enhances fabric representation across textual and visual modalities, and ii) a harmonized cross-attention mechanism that dynamically balances information from flat sketches and text prompts, allowing controllable synthesis by generating either sketch-aligned (image-biased) or text-guided (text-biased) outputs. Furthermore, we collect Multi-modal Detailed Garment, the largest open-source dataset for garment generation. Experimental results and user studies demonstrate the effectiveness of HiGarment in garment synthesis. The code and dataset will be released.

[58] Fooling the Watchers: Breaking AIGC Detectors via Semantic Prompt Attacks

Run Hao,Peng Ying

Main category: cs.CV

TL;DR: 提出一种基于语法树和蒙特卡洛树搜索的对抗性提示生成框架，用于规避AIGC检测器，并在竞赛中表现优异。

Details

Motivation: 解决文本到图像模型生成逼真肖像引发的身份滥用问题，测试AIGC检测器的鲁棒性。 Method: 利用语法树结构和蒙特卡洛树搜索变体，系统探索语义提示空间，生成多样化、可控的对抗性提示。 Result: 在多个T2I模型上验证有效性，并在实际竞赛中排名第一，同时可用于构建高质量对抗数据集。 Conclusion: 该方法不仅攻击性强，还能为训练和评估更鲁棒的AIGC检测系统提供资源。 Abstract: The rise of text-to-image (T2I) models has enabled the synthesis of photorealistic human portraits, raising serious concerns about identity misuse and the robustness of AIGC detectors. In this work, we propose an automated adversarial prompt generation framework that leverages a grammar tree structure and a variant of the Monte Carlo tree search algorithm to systematically explore the semantic prompt space. Our method generates diverse, controllable prompts that consistently evade both open-source and commercial AIGC detectors. Extensive experiments across multiple T2I models validate its effectiveness, and the approach ranked first in a real-world adversarial AIGC detection competition. Beyond attack scenarios, our method can also be used to construct high-quality adversarial datasets, providing valuable resources for training and evaluating more robust AIGC detection and defense systems.

[59] Language-guided Learning for Object Detection Tackling Multiple Variations in Aerial Images

Sungjune Park,Hyunjun Kim,Beomchan Park,Yong Man Ro

Main category: cs.CV

TL;DR: 论文提出了一种名为LANGO的语言引导物体检测框架，旨在解决航拍图像中因光照和视角变化导致的物体检测挑战。

Details

Motivation: 航拍图像中存在多种变化（如光照和视角），导致物体定位和识别复杂化。受人类理解场景语义的启发，作者希望通过语言引导学习缓解这些影响。 Method: 设计了视觉语义推理器以理解图像场景的视觉语义，并提出关系学习损失来处理实例级变化（如视角和尺度）。 Result: 实验表明，该方法显著提升了检测性能。 Conclusion: LANGO框架通过语言引导学习有效缓解了航拍图像中的场景和实例级变化问题。 Abstract: Despite recent advancements in computer vision research, object detection in aerial images still suffers from several challenges. One primary challenge to be mitigated is the presence of multiple types of variation in aerial images, for example, illumination and viewpoint changes. These variations result in highly diverse image scenes and drastic alterations in object appearance, so that it becomes more complicated to localize objects from the whole image scene and recognize their categories. To address this problem, in this paper, we introduce a novel object detection framework in aerial images, named LANGuage-guided Object detection (LANGO). Upon the proposed language-guided learning, the proposed framework is designed to alleviate the impacts from both scene and instance-level variations. First, we are motivated by the way humans understand the semantics of scenes while perceiving environmental factors in the scenes (e.g., weather). Therefore, we design a visual semantic reasoner that comprehends visual semantics of image scenes by interpreting conditions where the given images were captured. Second, we devise a training objective, named relation learning loss, to deal with instance-level variations, such as viewpoint angle and scale changes. This training objective aims to learn relations in language representations of object categories, with the help of the robust characteristics against such variations. Through extensive experiments, we demonstrate the effectiveness of the proposed method, and our method obtains noticeable detection performance improvements.

[60] WTEFNet: Real-Time Low-Light Object Detection for Advanced Driver-Assistance Systems

Hao Wu,Junzhou Chen,Ronghui Zhang,Nengchao Lyu,Hongyu Hu,Yanyong Guo,Tony Z. Qiu

Main category: cs.CV

TL;DR: WTEFNet是一个专为低光场景设计的实时目标检测框架，结合低光增强、小波特征提取和自适应融合检测模块，在多个数据集上表现优异。

Details

Motivation: 解决现有RGB摄像头在低光条件下性能下降的问题，提升ADAS的环境感知能力。 Method: WTEFNet包含三个核心模块：低光增强（LLE）、小波特征提取（WFE）和自适应融合检测（AFFD），并引入GSN数据集支持训练与评估。 Result: 在BDD100K、SHIFT、nuScenes和GSN数据集上达到最先进的低光检测精度，且适用于嵌入式平台的实时应用。 Conclusion: WTEFNet在低光条件下表现出色，适合实时ADAS应用。 Abstract: Object detection is a cornerstone of environmental perception in advanced driver assistance systems(ADAS). However, most existing methods rely on RGB cameras, which suffer from significant performance degradation under low-light conditions due to poor image quality. To address this challenge, we proposes WTEFNet, a real-time object detection framework specifically designed for low-light scenarios, with strong adaptability to mainstream detectors. WTEFNet comprises three core modules: a Low-Light Enhancement (LLE) module, a Wavelet-based Feature Extraction (WFE) module, and an Adaptive Fusion Detection (AFFD) module. The LLE enhances dark regions while suppressing overexposed areas; the WFE applies multi-level discrete wavelet transforms to isolate high- and low-frequency components, enabling effective denoising and structural feature retention; the AFFD fuses semantic and illumination features for robust detection. To support training and evaluation, we introduce GSN, a manually annotated dataset covering both clear and rainy night-time scenes. Extensive experiments on BDD100K, SHIFT, nuScenes, and GSN demonstrate that WTEFNet achieves state-of-the-art accuracy under low-light conditions. Furthermore, deployment on a embedded platform (NVIDIA Jetson AGX Orin) confirms the framework's suitability for real-time ADAS applications.

[61] HyperPointFormer: Multimodal Fusion in 3D Space with Dual-Branch Cross-Attention Transformers

Aldino Rizaldy,Richard Gloaguen,Fabian Ewald Fassnacht,Pedram Ghamisi

Main category: cs.CV

TL;DR: 提出了一种基于3D点云的多模态融合方法，使用双分支Transformer模型直接学习几何和光谱特征，并通过跨注意力机制增强特征融合。

Details

Motivation: 现有方法将3D数据降维到2D处理，未能充分利用3D数据的潜力，限制了模型学习3D空间特征和生成3D预测的能力。 Method: 在3D点云中融合多模态数据，采用双分支Transformer模型和跨注意力机制，直接学习几何与光谱特征。 Result: 在多个数据集上验证，3D融合方法表现优于2D方法，并能生成3D预测。 Conclusion: 3D融合方法不仅性能优越，还提供了更高的灵活性，支持3D预测的生成。 Abstract: Multimodal remote sensing data, including spectral and lidar or photogrammetry, is crucial for achieving satisfactory land-use / land-cover classification results in urban scenes. So far, most studies have been conducted in a 2D context. When 3D information is available in the dataset, it is typically integrated with the 2D data by rasterizing the 3D data into 2D formats. Although this method yields satisfactory classification results, it falls short in fully exploiting the potential of 3D data by restricting the model's ability to learn 3D spatial features directly from raw point clouds. Additionally, it limits the generation of 3D predictions, as the dimensionality of the input data has been reduced. In this study, we propose a fully 3D-based method that fuses all modalities within the 3D point cloud and employs a dedicated dual-branch Transformer model to simultaneously learn geometric and spectral features. To enhance the fusion process, we introduce a cross-attention-based mechanism that fully operates on 3D points, effectively integrating features from various modalities across multiple scales. The purpose of cross-attention is to allow one modality to assess the importance of another by weighing the relevant features. We evaluated our method by comparing it against both 3D and 2D methods using the 2018 IEEE GRSS Data Fusion Contest (DFC2018) dataset. Our findings indicate that 3D fusion delivers competitive results compared to 2D methods and offers more flexibility by providing 3D predictions. These predictions can be projected onto 2D maps, a capability that is not feasible in reverse. Additionally, we evaluated our method on different datasets, specifically the ISPRS Vaihingen 3D and the IEEE 2019 Data Fusion Contest. Our code will be published here: https://github.com/aldinorizaldy/hyperpointformer.

[62] Navigating the Accuracy-Size Trade-Off with Flexible Model Merging

Akash Dhasade,Divyansh Jhunjhunwala,Milos Vujasinovic,Gauri Joshi,Anne-Marie Kermarrec

Main category: cs.CV

TL;DR: FlexMerge是一种新型的无数据模型合并框架，通过灵活生成不同大小的合并模型，平衡精度与部署成本。

Details

Motivation: 解决单一合并模型精度不足与部署多个独立模型成本高的问题。 Method: 将微调模型视为顺序块集合，逐步合并，支持多种无数据合并算法。 Result: 实验表明，适度增大的合并模型能显著提升精度，适用于多任务场景。 Conclusion: FlexMerge为多任务部署提供了灵活、高效的无数据解决方案。 Abstract: Model merging has emerged as an efficient method to combine multiple single-task fine-tuned models. The merged model can enjoy multi-task capabilities without expensive training. While promising, merging into a single model often suffers from an accuracy gap with respect to individual fine-tuned models. On the other hand, deploying all individual fine-tuned models incurs high costs. We propose FlexMerge, a novel data-free model merging framework to flexibly generate merged models of varying sizes, spanning the spectrum from a single merged model to retaining all individual fine-tuned models. FlexMerge treats fine-tuned models as collections of sequential blocks and progressively merges them using any existing data-free merging method, halting at a desired size. We systematically explore the accuracy-size trade-off exhibited by different merging algorithms in combination with FlexMerge. Extensive experiments on vision and NLP benchmarks, with up to 30 tasks, reveal that even modestly larger merged models can provide substantial accuracy improvements over a single model. By offering fine-grained control over fused model size, FlexMerge provides a flexible, data-free, and high-performance solution for diverse deployment scenarios.

[63] SAMamba: Adaptive State Space Modeling with Hierarchical Vision for Infrared Small Target Detection

Wenhao Xu,Shuchen Zheng,Changwei Wang,Zherui Zhang,Chuan Ren,Rongtao Xu,Shibiao Xu

Main category: cs.CV

TL;DR: SAMamba框架通过结合SAM2的分层特征学习和Mamba的选择性序列建模，解决了红外小目标检测中的信息丢失和全局上下文建模问题，显著提升了性能。

Details

Motivation: 红外小目标检测在军事和预警应用中至关重要，但现有方法存在信息丢失和全局上下文建模效率低的问题。 Method: 提出SAMamba框架，包括FS-Adapter、CSI模块和DPCF模块，分别用于域适应、全局上下文建模和多尺度特征融合。 Result: 在多个数据集上显著优于现有方法，尤其在复杂背景和多尺度目标场景中表现突出。 Conclusion: SAMamba通过创新模块设计有效解决了红外小目标检测的核心挑战，为实际应用提供了高效解决方案。 Abstract: Infrared small target detection (ISTD) is vital for long-range surveillance in military, maritime, and early warning applications. ISTD is challenged by targets occupying less than 0.15% of the image and low distinguishability from complex backgrounds. Existing deep learning methods often suffer from information loss during downsampling and inefficient global context modeling. This paper presents SAMamba, a novel framework integrating SAM2's hierarchical feature learning with Mamba's selective sequence modeling. Key innovations include: (1) A Feature Selection Adapter (FS-Adapter) for efficient natural-to-infrared domain adaptation via dual-stage selection (token-level with a learnable task embedding and channel-wise adaptive transformations); (2) A Cross-Channel State-Space Interaction (CSI) module for efficient global context modeling with linear complexity using selective state space modeling; and (3) A Detail-Preserving Contextual Fusion (DPCF) module that adaptively combines multi-scale features with a gating mechanism to balance high-resolution and low-resolution feature contributions. SAMamba addresses core ISTD challenges by bridging the domain gap, maintaining fine-grained details, and efficiently modeling long-range dependencies. Experiments on NUAA-SIRST, IRSTD-1k, and NUDT-SIRST datasets show SAMamba significantly outperforms state-of-the-art methods, especially in challenging scenarios with heterogeneous backgrounds and varying target scales. Code: https://github.com/zhengshuchen/SAMamba.

[64] UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes

Yixun Liang,Kunming Luo,Xiao Chen,Rui Chen,Hongyu Yan,Weiyu Li,Jiarui Liu,Ping Tan

Main category: cs.CV

TL;DR: UniTEX是一个两阶段的3D纹理生成框架，通过直接在3D功能空间中操作，避免了UV映射的限制，生成高质量且一致的3D纹理。

Details

Motivation: 现有方法依赖UV映射和图像重投影，导致拓扑模糊问题。UniTEX旨在绕过这些限制，直接在3D空间中生成纹理。 Method: 1. 使用Texture Functions（TFs）将纹理生成提升到3D空间；2. 基于Transformer的大规模纹理模型（LTM）从图像和几何输入预测TFs；3. 利用LoRA策略优化2D先验，实现高质量多视图纹理合成。 Result: 实验表明，UniTEX在视觉质量和纹理完整性上优于现有方法，提供了一种通用且可扩展的3D纹理生成解决方案。 Conclusion: UniTEX通过创新的两阶段框架，解决了3D纹理生成中的拓扑模糊问题，为自动化3D纹理生成提供了高效方案。 Abstract: We present UniTEX, a novel two-stage 3D texture generation framework to create high-quality, consistent textures for 3D assets. Existing approaches predominantly rely on UV-based inpainting to refine textures after reprojecting the generated multi-view images onto the 3D shapes, which introduces challenges related to topological ambiguity. To address this, we propose to bypass the limitations of UV mapping by operating directly in a unified 3D functional space. Specifically, we first propose that lifts texture generation into 3D space via Texture Functions (TFs)--a continuous, volumetric representation that maps any 3D point to a texture value based solely on surface proximity, independent of mesh topology. Then, we propose to predict these TFs directly from images and geometry inputs using a transformer-based Large Texturing Model (LTM). To further enhance texture quality and leverage powerful 2D priors, we develop an advanced LoRA-based strategy for efficiently adapting large-scale Diffusion Transformers (DiTs) for high-quality multi-view texture synthesis as our first stage. Extensive experiments demonstrate that UniTEX achieves superior visual quality and texture integrity compared to existing approaches, offering a generalizable and scalable solution for automated 3D texture generation. Code will available in: https://github.com/YixunLiang/UniTEX.

[65] Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs

Zheng Sun,Yi Wei,Long Yu

Main category: cs.CV

TL;DR: 该论文提出了一种针对医学图像筛选的完整解决方案，包括数据集和方法论，旨在提升多模态大语言模型（MLLMs）的图像美学推理能力。

Details

Motivation: 当前MLLMs在图像筛选任务中表现不佳，主要由于缺乏数据和模型的美学推理能力较弱。 Method: 收集了一个包含1500+样本的医学图像数据集，并提出了DPA-GRPO方法（结合长链思维和动态比例准确度奖励的强化学习）来增强MLLMs的美学推理能力。 Result: 实验表明，即使最先进的闭源MLLMs（如GPT-4o和Qwen-VL-Max）在图像美学推理中表现接近随机猜测，而通过强化学习方法，使用更小的模型超越了这些大模型的性能。 Conclusion: 该研究为图像美学推理提供了一种新的配置方法，并有望在未来成为医学图像筛选的常规方案。 Abstract: Multimodal Large Language Models (MLLMs) are of great application across many domains, such as multimodal understanding and generation. With the development of diffusion models (DM) and unified MLLMs, the performance of image generation has been significantly improved, however, the study of image screening is rare and its performance with MLLMs is unsatisfactory due to the lack of data and the week image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive medical image screening dataset with 1500+ samples, each sample consists of a medical image, four generated images, and a multiple-choice answer. The dataset evaluates the aesthetic reasoning ability under four aspects: \textit{(1) Appearance Deformation, (2) Principles of Physical Lighting and Shadow, (3) Placement Layout, (4) Extension Rationality}. For methodology, we utilize long chains of thought (CoT) and Group Relative Policy Optimization with Dynamic Proportional Accuracy reward, called DPA-GRPO, to enhance the image aesthetic reasoning ability of MLLMs. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT-4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the reinforcement learning approach, we are able to surpass the score of both large-scale models and leading closed-source models using a much smaller model. We hope our attempt on medical image screening will serve as a regular configuration in image aesthetic reasoning in the future.

[66] Unsupervised Transcript-assisted Video Summarization and Highlight Detection

Spyros Barbakos,Charalampos Antoniadis,Gerasimos Potamianos,Gianluca Setti

Main category: cs.CV

TL;DR: 本文提出了一种结合视频帧和文本转录的多模态方法，通过强化学习框架生成视频摘要和检测亮点，优于仅依赖视觉内容的方法。

Details

Motivation: 视频消费是日常生活的重要组成部分，但观看完整视频可能很乏味。现有方法未将视频帧和文本转录结合在强化学习框架中。 Method: 提出一种多模态管道，利用视频帧和文本转录，通过强化学习训练模型生成多样化和代表性的摘要，并确保包含有意义的转录内容。 Result: 实验表明，结合文本转录的视频摘要和亮点检测优于仅依赖视觉内容的方法。 Conclusion: 多模态方法在视频摘要和亮点检测中表现优越，尤其适用于大规模未标注数据集。 Abstract: Video consumption is a key part of daily life, but watching entire videos can be tedious. To address this, researchers have explored video summarization and highlight detection to identify key video segments. While some works combine video frames and transcripts, and others tackle video summarization and highlight detection using Reinforcement Learning (RL), no existing work, to the best of our knowledge, integrates both modalities within an RL framework. In this paper, we propose a multimodal pipeline that leverages video frames and their corresponding transcripts to generate a more condensed version of the video and detect highlights using a modality fusion mechanism. The pipeline is trained within an RL framework, which rewards the model for generating diverse and representative summaries while ensuring the inclusion of video segments with meaningful transcript content. The unsupervised nature of the training allows for learning from large-scale unannotated datasets, overcoming the challenge posed by the limited size of existing annotated datasets. Our experiments show that using the transcript in video summarization and highlight detection achieves superior results compared to relying solely on the visual content of the video.

[67] LADA: Scalable Label-Specific CLIP Adapter for Continual Learning

Mao-Lin Luo,Zi-Hao Zhou,Tong Wei,Min-Ling Zhang

Main category: cs.CV

TL;DR: LADA通过为冻结的CLIP图像编码器添加轻量级标签特定内存单元，解决了现有CLIP方法在持续学习中的参数选择问题，并通过特征蒸馏防止灾难性遗忘，实现了最先进的性能。

Details

Motivation: 现有CLIP方法在持续学习中需要为每个任务选择部分参数，容易导致性能下降。 Method: LADA在冻结的CLIP图像编码器后添加轻量级标签特定内存单元，通过特征蒸馏防止灾难性遗忘。 Result: LADA在持续学习任务中实现了最先进的性能。 Conclusion: LADA通过轻量级设计和特征蒸馏有效解决了持续学习中的问题。 Abstract: Continual learning with vision-language models like CLIP offers a pathway toward scalable machine learning systems by leveraging its transferable representations. Existing CLIP-based methods adapt the pre-trained image encoder by adding multiple sets of learnable parameters, with each task using a partial set of parameters. This requires selecting the expected parameters for input images during inference, which is prone to error that degrades performance. To address this problem, we introduce LADA (Label-specific ADApter). Instead of partitioning parameters across tasks, LADA appends lightweight, label-specific memory units to the frozen CLIP image encoder, enabling discriminative feature generation by aggregating task-agnostic knowledge. To prevent catastrophic forgetting, LADA employs feature distillation for seen classes, preventing their features from being interfered with by new classes. Positioned after the image encoder, LADA prevents gradient flow to the frozen CLIP parameters, ensuring efficient training. Extensive results show that LADA achieves state-of-the-art performance in continual learning settings. The implementation code is available at https://github.com/MaolinLuo/LADA.

[68] Are MLMs Trapped in the Visual Room?

Yazhou Zhang,Chunwang Zou,Qimeng Liu,Lu Rong,Ben Yao,Zheng Lian,Qiuchi Li,Peng Zhang,Jing Qin

Main category: cs.CV

TL;DR: 论文探讨多模态大模型（MLMs）是否能真正“理解”图像，提出“视觉房间”论点，并通过感知和认知双层次评估框架验证模型表现。

Details

Motivation: 挑战现有假设，即感知能力等同于真正理解，通过“视觉房间”论点质疑MLMs是否具备深层理解能力。 Method: 提出双层次评估框架（感知与认知），并构建高质量多模态讽刺数据集，评估8种SoTA MLMs。 Result: MLMs在感知任务表现良好，但讽刺理解错误率达16.1%，揭示感知与理解的显著差距。 Conclusion: 实证支持“视觉房间”论点，为MLMs评估提供新范式，强调情感推理和常识推断的不足。 Abstract: Can multi-modal large models (MLMs) that can ``see'' an image be said to ``understand'' it? Drawing inspiration from Searle's Chinese Room, we propose the \textbf{Visual Room} argument: a system may process and describe every detail of visual inputs by following algorithmic rules, without genuinely comprehending the underlying intention. This dilemma challenges the prevailing assumption that perceptual mastery implies genuine understanding. In implementation, we introduce a two-tier evaluation framework spanning perception and cognition. The perception component evaluates whether MLMs can accurately capture the surface-level details of visual contents, where the cognitive component examines their ability to infer sarcasm polarity. To support this framework, We further introduce a high-quality multi-modal sarcasm dataset comprising both 924 static images and 100 dynamic videos. All sarcasm labels are annotated by the original authors and verified by independent reviewers to ensure clarity and consistency. We evaluate eight state-of-the-art (SoTA) MLMs. Our results highlight three key findings: (1) MLMs perform well on perception tasks; (2) even with correct perception, models exhibit an average error rate of ~16.1\% in sarcasm understanding, revealing a significant gap between seeing and understanding; (3) error analysis attributes this gap to deficiencies in emotional reasoning, commonsense inference, and context alignment. This work provides empirical grounding for the proposed Visual Room argument and offers a new evaluation paradigm for MLMs.

[69] Holistic Large-Scale Scene Reconstruction via Mixed Gaussian Splatting

Chuandong Liu,Huijiao Wang,Lei Yu,Gui-Song Xia

Main category: cs.CV

TL;DR: MixGS提出了一种全局优化的3D高斯泼溅框架，解决了现有方法因分治策略导致的全局信息丢失和复杂参数调优问题，实现了高质量渲染和高效计算。

Details

Motivation: 现有大规模场景重建方法依赖分治策略，导致全局信息丢失和参数调优复杂，MixGS旨在解决这些问题。 Method: MixGS通过整合相机位姿和高斯属性为视图感知表示，并设计混合操作结合解码与原高斯，实现全局一致性与局部保真。 Result: 实验表明MixGS在大规模场景中实现最优渲染质量和高效计算，单卡24GB VRAM即可训练。 Conclusion: MixGS为大规模3D场景重建提供了高效且高质量的解决方案，代码将开源。 Abstract: Recent advances in 3D Gaussian Splatting have shown remarkable potential for novel view synthesis. However, most existing large-scale scene reconstruction methods rely on the divide-and-conquer paradigm, which often leads to the loss of global scene information and requires complex parameter tuning due to scene partitioning and local optimization. To address these limitations, we propose MixGS, a novel holistic optimization framework for large-scale 3D scene reconstruction. MixGS models the entire scene holistically by integrating camera pose and Gaussian attributes into a view-aware representation, which is decoded into fine-detailed Gaussians. Furthermore, a novel mixing operation combines decoded and original Gaussians to jointly preserve global coherence and local fidelity. Extensive experiments on large-scale scenes demonstrate that MixGS achieves state-of-the-art rendering quality and competitive speed, while significantly reducing computational requirements, enabling large-scale scene reconstruction training on a single 24GB VRAM GPU. The code will be released at https://github.com/azhuantou/MixGS.

[70] RSFAKE-1M: A Large-Scale Dataset for Detecting Diffusion-Generated Remote Sensing Forgeries

Zhihong Tan,Jiayi Wang,Huiying Shi,Binyuan Huang,Hongchen Wei,Zhenzhong Chen

Main category: cs.CV

TL;DR: 论文介绍了RSFAKE-1M数据集，用于检测基于扩散模型的伪造遥感图像，并展示了其在提升检测方法泛化性和鲁棒性方面的效果。

Details

Motivation: 遥感图像在环境监测等领域至关重要，但现有基准主要针对GAN生成的伪造图像或自然图像，缺乏对扩散模型伪造的遥感图像的研究。 Method: 构建了包含50万伪造和50万真实遥感图像的RSFAKE-1M数据集，使用10种扩散模型生成伪造图像，涵盖多种生成条件。 Result: 实验表明，当前方法对基于扩散模型的伪造遥感图像检测效果有限，但使用RSFAKE-1M训练的模型显著提升了性能。 Conclusion: RSFAKE-1M为开发下一代遥感图像伪造检测方法提供了重要基础。 Abstract: Detecting forged remote sensing images is becoming increasingly critical, as such imagery plays a vital role in environmental monitoring, urban planning, and national security. While diffusion models have emerged as the dominant paradigm for image generation, their impact on remote sensing forgery detection remains underexplored. Existing benchmarks primarily target GAN-based forgeries or focus on natural images, limiting progress in this critical domain. To address this gap, we introduce RSFAKE-1M, a large-scale dataset of 500K forged and 500K real remote sensing images. The fake images are generated by ten diffusion models fine-tuned on remote sensing data, covering six generation conditions such as text prompts, structural guidance, and inpainting. This paper presents the construction of RSFAKE-1M along with a comprehensive experimental evaluation using both existing detectors and unified baselines. The results reveal that diffusion-based remote sensing forgeries remain challenging for current methods, and that models trained on RSFAKE-1M exhibit notably improved generalization and robustness. Our findings underscore the importance of RSFAKE-1M as a foundation for developing and evaluating next-generation forgery detection approaches in the remote sensing domain. The dataset and other supplementary materials are available at https://huggingface.co/datasets/TZHSW/RSFAKE/.

[71] GenCAD-Self-Repairing: Feasibility Enhancement for 3D CAD Generation

Chikaha Tsuji,Enrique Flores Medina,Harshit Gupta,Md Ferdous Alam

Main category: cs.CV

TL;DR: GenCAD-Self-Repairing通过扩散引导和自我修复流程，提升了生成CAD模型的可行性，解决了GenCAD生成不可行B-rep的问题。

Details

Motivation: GenCAD生成的CAD文件中约10%不可行，限制了其实际应用。 Method: 结合扩散引导去噪过程和回归校正机制，优化不可行CAD命令序列。 Result: 成功将基线方法中三分之二的不可行设计转化为可行设计，显著提升可行性率。 Conclusion: 该方法提高了生成CAD模型的质量，扩展了AI驱动CAD生成的应用范围。 Abstract: With the advancement of generative AI, research on its application to 3D model generation has gained traction, particularly in automating the creation of Computer-Aided Design (CAD) files from images. GenCAD is a notable model in this domain, leveraging an autoregressive transformer-based architecture with a contrastive learning framework to generate CAD programs. However, a major limitation of GenCAD is its inability to consistently produce feasible boundary representations (B-reps), with approximately 10% of generated designs being infeasible. To address this, we propose GenCAD-Self-Repairing, a framework that enhances the feasibility of generative CAD models through diffusion guidance and a self-repairing pipeline. This framework integrates a guided diffusion denoising process in the latent space and a regression-based correction mechanism to refine infeasible CAD command sequences while preserving geometric accuracy. Our approach successfully converted two-thirds of infeasible designs in the baseline method into feasible ones, significantly improving the feasibility rate while simultaneously maintaining a reasonable level of geometric accuracy between the point clouds of ground truth models and generated models. By significantly improving the feasibility rate of generating CAD models, our approach helps expand the availability of high-quality training data and enhances the applicability of AI-driven CAD generation in manufacturing, architecture, and product design.

[72] Federated Unsupervised Semantic Segmentation

Evangelos Charalampakis,Vasileios Mygdalis,Ioannis Pitas

Main category: cs.CV

TL;DR: 本文提出FUSS框架，首次实现完全去中心化、无监督的联邦学习语义图像分割，通过特征和原型空间的一致性优化，显著优于传统方法。

Details

Motivation: 探索联邦学习在无监督语义图像分割中的应用，解决分布式客户端在异构数据分布下特征对齐的挑战。 Method: 提出FUSS框架，引入新的联邦策略优化局部分割头和共享语义中心，确保特征和原型空间的全局一致性。 Result: 在基准和真实数据集上，FUSS在二元和多类分割任务中均优于局部训练和传统联邦学习算法。 Conclusion: FUSS为无监督联邦语义分割提供了有效解决方案，代码将公开以支持可重复性。 Abstract: This work explores the application of Federated Learning (FL) in Unsupervised Semantic image Segmentation (USS). Recent USS methods extract pixel-level features using frozen visual foundation models and refine them through self-supervised objectives that encourage semantic grouping. These features are then grouped to semantic clusters to produce segmentation masks. Extending these ideas to federated settings requires feature representation and cluster centroid alignment across distributed clients -- an inherently difficult task under heterogeneous data distributions in the absence of supervision. To address this, we propose FUSS Federated Unsupervised image Semantic Segmentation) which is, to our knowledge, the first framework to enable fully decentralized, label-free semantic segmentation training. FUSS introduces novel federation strategies that promote global consistency in feature and prototype space, jointly optimizing local segmentation heads and shared semantic centroids. Experiments on both benchmark and real-world datasets, including binary and multi-class segmentation tasks, show that FUSS consistently outperforms local-only client trainings as well as extensions of classical FL algorithms under varying client data distributions. To support reproducibility, full code will be released upon manuscript acceptance.

[73] TRACE: Trajectory-Constrained Concept Erasure in Diffusion Models

Finn Carter

Main category: cs.CV

TL;DR: TRACE是一种新方法，用于从扩散模型中删除特定概念，同时保持生成质量。它结合了理论框架和微调过程，在多个基准测试中表现优异。

Details

Motivation: 扩散模型可能生成不良内容（如色情、敏感身份、版权风格），引发隐私、公平和安全问题。概念擦除旨在解决这一问题。 Method: TRACE通过理论框架和微调过程实现概念擦除，包括对交叉注意力层的闭式更新和轨迹感知微调目标。 Result: TRACE在多个基准测试中表现最优，优于ANT、EraseAnything和MACE等方法。 Conclusion: TRACE在概念擦除和生成质量方面取得了显著进展，为扩散模型的安全应用提供了有效工具。 Abstract: Text-to-image diffusion models have shown unprecedented generative capability, but their ability to produce undesirable concepts (e.g.~pornographic content, sensitive identities, copyrighted styles) poses serious concerns for privacy, fairness, and safety. {Concept erasure} aims to remove or suppress specific concept information in a generative model. In this paper, we introduce \textbf{TRACE (Trajectory-Constrained Attentional Concept Erasure)}, a novel method to erase targeted concepts from diffusion models while preserving overall generative quality. Our approach combines a rigorous theoretical framework, establishing formal conditions under which a concept can be provably suppressed in the diffusion process, with an effective fine-tuning procedure compatible with both conventional latent diffusion (Stable Diffusion) and emerging rectified flow models (e.g.~FLUX). We first derive a closed-form update to the model's cross-attention layers that removes hidden representations of the target concept. We then introduce a trajectory-aware finetuning objective that steers the denoising process away from the concept only in the late sampling stages, thus maintaining the model's fidelity on unrelated content. Empirically, we evaluate TRACE on multiple benchmarks used in prior concept erasure studies (object classes, celebrity faces, artistic styles, and explicit content from the I2P dataset). TRACE achieves state-of-the-art performance, outperforming recent methods such as ANT, EraseAnything, and MACE in terms of removal efficacy and output quality.

[74] Adversarial Semantic and Label Perturbation Attack for Pedestrian Attribute Recognition

Weizhe Kong,Xiao Wang,Ruichong Gao,Chenglong Li,Yu Zhang,Xing Yang,Yaowei Wang,Jin Tang

Main category: cs.CV

TL;DR: 本文提出了首个针对行人属性识别（PAR）的对抗攻击与防御框架ASL-PAR，通过全局和局部攻击生成对抗噪声，并设计语义偏移防御策略，验证了其有效性。

Details

Motivation: 尽管行人属性识别在深度学习推动下取得进展，但其潜在脆弱性和抗干扰能力尚未充分研究。本文旨在填补这一空白。 Method: 基于预训练的CLIP框架，将图像分块嵌入特征，扩展属性为句子并嵌入文本特征，通过多模态Transformer融合视觉与文本令牌，利用前馈网络进行属性识别，生成对抗噪声（ASL-PAR）并设计防御策略。 Result: 在数字和物理域（如PETA、PA100K等）的实验验证了攻击与防御策略的有效性。 Conclusion: 提出的ASL-PAR框架为行人属性识别提供了有效的对抗攻击与防御方法，填补了研究空白。 Abstract: Pedestrian Attribute Recognition (PAR) is an indispensable task in human-centered research and has made great progress in recent years with the development of deep neural networks. However, the potential vulnerability and anti-interference ability have still not been fully explored. To bridge this gap, this paper proposes the first adversarial attack and defense framework for pedestrian attribute recognition. Specifically, we exploit both global- and patch-level attacks on the pedestrian images, based on the pre-trained CLIP-based PAR framework. It first divides the input pedestrian image into non-overlapping patches and embeds them into feature embeddings using a projection layer. Meanwhile, the attribute set is expanded into sentences using prompts and embedded into attribute features using a pre-trained CLIP text encoder. A multi-modal Transformer is adopted to fuse the obtained vision and text tokens, and a feed-forward network is utilized for attribute recognition. Based on the aforementioned PAR framework, we adopt the adversarial semantic and label-perturbation to generate the adversarial noise, termed ASL-PAR. We also design a semantic offset defense strategy to suppress the influence of adversarial attacks. Extensive experiments conducted on both digital domains (i.e., PETA, PA100K, MSP60K, RAPv2) and physical domains fully validated the effectiveness of our proposed adversarial attack and defense strategies for the pedestrian attribute recognition. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR.

[75] Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis

Hengyuan Cao,Yutong Feng,Biao Gong,Yijing Tian,Yunhong Lu,Chuang Liu,Bin Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为DRA-Ctrl的视频到图像知识压缩与任务适应范式，利用视频模型的优势（如长程上下文建模和全注意力机制）支持可控图像生成任务。

Details

Motivation: 探索训练好的高维视频生成模型是否能有效支持低维任务（如可控图像生成），以挖掘视频模型的潜力。 Method: 提出DRA-Ctrl范式，包括基于mixup的过渡策略和重新设计的注意力结构，以解决视频帧与图像生成之间的差异。 Result: 实验表明，改造后的视频模型在多种图像生成任务中表现优于直接训练的模型。 Conclusion: DRA-Ctrl为资源密集型视频模型的重用提供了新思路，并为跨视觉模态的统一生成模型奠定了基础。 Abstract: Video generative models can be regarded as world simulators due to their ability to capture dynamic, continuous changes inherent in real-world environments. These models integrate high-dimensional information across visual, temporal, spatial, and causal dimensions, enabling predictions of subjects in various status. A natural and valuable research direction is to explore whether a fully trained video generative model in high-dimensional space can effectively support lower-dimensional tasks such as controllable image generation. In this work, we propose a paradigm for video-to-image knowledge compression and task adaptation, termed \textit{Dimension-Reduction Attack} (\texttt{DRA-Ctrl}), which utilizes the strengths of video models, including long-range context modeling and flatten full-attention, to perform various generation tasks. Specially, to address the challenging gap between continuous video frames and discrete image generation, we introduce a mixup-based transition strategy that ensures smooth adaptation. Moreover, we redesign the attention structure with a tailored masking mechanism to better align text prompts with image-level control. Experiments across diverse image generation tasks, such as subject-driven and spatially conditioned generation, show that repurposed video models outperform those trained directly on images. These results highlight the untapped potential of large-scale video generators for broader visual applications. \texttt{DRA-Ctrl} provides new insights into reusing resource-intensive video models and lays foundation for future unified generative models across visual modalities. The project page is https://dra-ctrl-2025.github.io/DRA-Ctrl/.

[76] Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization

Matteo Gallici,Haitz Sáez de Ocáriz Borde

Main category: cs.CV

TL;DR: 论文研究了使用强化学习（RL）微调预训练生成模型的方法，特别是应用Group Relative Policy Optimization（GRPO）优化视觉自回归（VAR）模型，显著提升了图像质量并实现了生成风格的控制。

Details

Motivation: 通过强化学习微调生成模型，以更精准地符合人类偏好，尤其是利用GRPO优化VAR模型，解决传统方法在复杂奖励信号下的不足。 Method: 采用GRPO方法微调VAR模型，结合美学预测器和CLIP嵌入的奖励信号，实现图像质量的提升和生成风格的控制。 Result: 实验表明，该方法能显著提升图像质量，并通过CLIP实现超出预训练数据分布的生成能力。 Conclusion: RL微调对VAR模型高效有效，尤其得益于其快速推理速度，优于扩散模型等替代方案。 Abstract: Fine-tuning pre-trained generative models with Reinforcement Learning (RL) has emerged as an effective approach for aligning outputs more closely with nuanced human preferences. In this paper, we investigate the application of Group Relative Policy Optimization (GRPO) to fine-tune next-scale visual autoregressive (VAR) models. Our empirical results demonstrate that this approach enables alignment to intricate reward signals derived from aesthetic predictors and CLIP embeddings, significantly enhancing image quality and enabling precise control over the generation style. Interestingly, by leveraging CLIP, our method can help VAR models generalize beyond their initial ImageNet distribution: through RL-driven exploration, these models can generate images aligned with prompts referencing image styles that were absent during pre-training. In summary, we show that RL-based fine-tuning is both efficient and effective for VAR models, benefiting particularly from their fast inference speeds, which are advantageous for online sampling, an aspect that poses significant challenges for diffusion-based alternatives.

[77] DSAGL: Dual-Stream Attention-Guided Learning for Weakly Supervised Whole Slide Image Classification

Daoxi Cao,Hangbei Cheng,Yijin Li,Ruolin Zhou,Xinyi Li,Xuehan Zhang,Binwei Li,Xuancheng Gu,Xueyu Liu,Yongfei Wu

Main category: cs.CV

TL;DR: DSAGL是一种新颖的弱监督分类框架，通过双流设计和注意力机制解决全切片图像分类中的实例级模糊性和语义一致性问题。

Details

Motivation: 全切片图像（WSIs）因其超高分辨率和丰富语义内容对癌症诊断至关重要，但其巨大尺寸和细粒度标注的稀缺性给传统监督学习带来挑战。 Method: 提出DSAGL框架，结合教师-学生架构和双流设计，生成多尺度注意力伪标签并指导实例级学习，使用轻量级编码器VSSMamba和融合注意力模块FASA。 Result: 在CIFAR-10、NCT-CRC和TCGA-Lung数据集上，DSAGL表现优于现有MIL基线，具有更高的判别性能和鲁棒性。 Conclusion: DSAGL通过弱监督学习有效解决了WSIs分类问题，为癌症诊断提供了高效工具。 Abstract: Whole-slide images (WSIs) are critical for cancer diagnosis due to their ultra-high resolution and rich semantic content. However, their massive size and the limited availability of fine-grained annotations pose substantial challenges for conventional supervised learning. We propose DSAGL (Dual-Stream Attention-Guided Learning), a novel weakly supervised classification framework that combines a teacher-student architecture with a dual-stream design. DSAGL explicitly addresses instance-level ambiguity and bag-level semantic consistency by generating multi-scale attention-based pseudo labels and guiding instance-level learning. A shared lightweight encoder (VSSMamba) enables efficient long-range dependency modeling, while a fusion-attentive module (FASA) enhances focus on sparse but diagnostically relevant regions. We further introduce a hybrid loss to enforce mutual consistency between the two streams. Experiments on CIFAR-10, NCT-CRC, and TCGA-Lung datasets demonstrate that DSAGL consistently outperforms state-of-the-art MIL baselines, achieving superior discriminative performance and robustness under weak supervision.

[78] Diffusion Sampling Path Tells More: An Efficient Plug-and-Play Strategy for Sample Filtering

Sixian Wang,Zhiwei Tang,Tsung-Hui Chang

Main category: cs.CV

TL;DR: 论文提出了一种名为CFG-Rejection的高效方法，通过分析去噪轨迹中的累积分数差异（ASD）来提前过滤低质量样本，无需外部奖励信号或模型重训练。

Details

Motivation: 扩散模型在采样过程中存在样本质量不一致的问题，现有方法（如DDPO和推理时对齐技术）计算成本高且依赖外部奖励信号，限制了广泛应用。 Method: 发现样本质量与去噪轨迹中条件和无条件分数的累积差异（ASD）强相关，提出CFG-Rejection方法，在去噪早期过滤低质量样本。 Result: 实验验证表明，CFG-Rejection在图像生成中显著提升了人类偏好分数（HPSv2, PickScore）和挑战性基准（GenEval, DPG-Bench）的表现。 Conclusion: CFG-Rejection为高质量样本生成提供了一种高效、兼容现有框架的解决方案，并有望扩展到其他生成模态。 Abstract: Diffusion models often exhibit inconsistent sample quality due to stochastic variations inherent in their sampling trajectories. Although training-based fine-tuning (e.g. DDPO [1]) and inference-time alignment techniques[2] aim to improve sample fidelity, they typically necessitate full denoising processes and external reward signals. This incurs substantial computational costs, hindering their broader applicability. In this work, we unveil an intriguing phenomenon: a previously unobserved yet exploitable link between sample quality and characteristics of the denoising trajectory during classifier-free guidance (CFG). Specifically, we identify a strong correlation between high-density regions of the sample distribution and the Accumulated Score Differences (ASD)--the cumulative divergence between conditional and unconditional scores. Leveraging this insight, we introduce CFG-Rejection, an efficient, plug-and-play strategy that filters low-quality samples at an early stage of the denoising process, crucially without requiring external reward signals or model retraining. Importantly, our approach necessitates no modifications to model architectures or sampling schedules and maintains full compatibility with existing diffusion frameworks. We validate the effectiveness of CFG-Rejection in image generation through extensive experiments, demonstrating marked improvements on human preference scores (HPSv2, PickScore) and challenging benchmarks (GenEval, DPG-Bench). We anticipate that CFG-Rejection will offer significant advantages for diverse generative modalities beyond images, paving the way for more efficient and reliable high-quality sample generation.

[79] Beyond Optimal Transport: Model-Aligned Coupling for Flow Matching

Yexiong Lin,Yu Yao,Tongliang Liu

Main category: cs.CV

TL;DR: Flow Matching (FM) 框架通过优化耦合路径，提出 Model-Aligned Coupling (MAC) 方法，显著提升生成质量和效率。

Details

Motivation: 传统 FM 方法使用随机耦合导致路径交叉，而基于几何距离的 OT 方法未能与模型偏好对齐，MAC 旨在解决这一问题。 Method: MAC 结合几何距离和模型预测误差，选择误差最小的耦合进行训练，避免耗时匹配过程。 Result: 实验表明，MAC 在少步生成中显著优于现有方法。 Conclusion: MAC 通过模型对齐的耦合优化，提升了 FM 框架的生成效率和质量。 Abstract: Flow Matching (FM) is an effective framework for training a model to learn a vector field that transports samples from a source distribution to a target distribution. To train the model, early FM methods use random couplings, which often result in crossing paths and lead the model to learn non-straight trajectories that require many integration steps to generate high-quality samples. To address this, recent methods adopt Optimal Transport (OT) to construct couplings by minimizing geometric distances, which helps reduce path crossings. However, we observe that such geometry-based couplings do not necessarily align with the model's preferred trajectories, making it difficult to learn the vector field induced by these couplings, which prevents the model from learning straight trajectories. Motivated by this, we propose Model-Aligned Coupling (MAC), an effective method that matches training couplings based not only on geometric distance but also on alignment with the model's preferred transport directions based on its prediction error. To avoid the time-costly match process, MAC proposes to select the top-$k$ fraction of couplings with the lowest error for training. Extensive experiments show that MAC significantly improves generation quality and efficiency in few-step settings compared to existing methods. Project page: https://yexionglin.github.io/mac

[80] Beam-Guided Knowledge Replay for Knowledge-Rich Image Captioning using Vision-Language Model

Reem AlJunaid,Muzammil Behzad

Main category: cs.CV

TL;DR: KRCapVLM是一种基于知识重放的图像描述生成框架，通过结合视觉语言模型、波束搜索解码和注意力模块，显著提升了描述的质量和知识识别能力。

Details

Motivation: 现有图像描述模型生成的描述通常缺乏具体性和上下文深度，KRCapVLM旨在解决这一问题。 Method: 结合视觉语言模型、波束搜索解码、注意力模块和训练调度器，提升特征表示和训练稳定性。 Result: 模型在知识识别准确性和描述质量上均有显著提升，能够生成更具信息性和上下文相关性的描述。 Conclusion: KRCapVLM有效增强了模型生成有意义、基于知识的图像描述的能力。 Abstract: Generating informative and knowledge-rich image captions remains a challenge for many existing captioning models, which often produce generic descriptions that lack specificity and contextual depth. To address this limitation, we propose KRCapVLM, a knowledge replay-based novel image captioning framework using vision-language model. We incorporate beam search decoding to generate more diverse and coherent captions. We also integrate attention-based modules into the image encoder to enhance feature representation. Finally, we employ training schedulers to improve stability and ensure smoother convergence during training. These proposals accelerate substantial gains in both caption quality and knowledge recognition. Our proposed model demonstrates clear improvements in both the accuracy of knowledge recognition and the overall quality of generated captions. It shows a stronger ability to generalize to previously unseen knowledge concepts, producing more informative and contextually relevant descriptions. These results indicate the effectiveness of our approach in enhancing the model's capacity to generate meaningful, knowledge-grounded captions across a range of scenarios.

[81] VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?

Yuanxin Liu,Kun Ouyang,Haoning Wu,Yi Liu,Lin Sui,Xinhao Li,Yan Zhong,Y. Charles,Xinyu Zhou,Xu Sun

Main category: cs.CV

TL;DR: 论文提出VideoReasonBench，一个专注于视觉中心复杂视频推理的基准测试，填补了现有视频推理任务缺乏视觉深度和推理复杂性的空白。

Details

Motivation: 现有视频理解领域的基准测试缺乏足够的推理深度，无法展示长链思维推理的优势，且任务多为知识驱动而非视觉内容驱动。 Method: 设计VideoReasonBench，包含视觉丰富且高复杂度的视频，评估三个递进层次的视频推理能力：视觉信息回忆、潜在状态推断和视频外信息预测。 Result: 评估18种多模态大语言模型（MLLMs），发现大多数在复杂视频推理上表现不佳（如GPT-4o准确率仅6.9%），而Gemini-2.5-Pro以56.0%准确率显著领先。 Conclusion: 扩展思维预算对VideoReasonBench性能提升至关重要，而在现有视频基准测试中效果有限。 Abstract: Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoning, e.g., GPT-4o achieves only 6.9% accuracy, while the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on "test-time scaling" further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.

[82] MCFNet: A Multimodal Collaborative Fusion Network for Fine-Grained Semantic Classification

Yang Qiao,Xiaoyu Zhong,Xiaofeng Gu,Zhiguo Yu

Main category: cs.CV

TL;DR: 提出了一种多模态协作融合网络（MCFNet），用于细粒度分类，通过模态特定正则化和混合注意力机制提升分类精度。

Details

Motivation: 多模态信息处理对图像分类性能提升至关重要，但传统方法难以捕捉细粒度语义交互。 Method: MCFNet包含正则化融合模块和混合注意力机制，以及多模态决策分类模块。 Result: 在基准数据集上，MCFNet显著提升了分类准确性。 Conclusion: MCFNet能有效建模跨模态的细微语义关系。 Abstract: Multimodal information processing has become increasingly important for enhancing image classification performance. However, the intricate and implicit dependencies across different modalities often hinder conventional methods from effectively capturing fine-grained semantic interactions, thereby limiting their applicability in high-precision classification tasks. To address this issue, we propose a novel Multimodal Collaborative Fusion Network (MCFNet) designed for fine-grained classification. The proposed MCFNet architecture incorporates a regularized integrated fusion module that improves intra-modal feature representation through modality-specific regularization strategies, while facilitating precise semantic alignment via a hybrid attention mechanism. Additionally, we introduce a multimodal decision classification module, which jointly exploits inter-modal correlations and unimodal discriminative features by integrating multiple loss functions within a weighted voting paradigm. Extensive experiments and ablation studies on benchmark datasets demonstrate that the proposed MCFNet framework achieves consistent improvements in classification accuracy, confirming its effectiveness in modeling subtle cross-modal semantics.

[83] PAN-Crafter: Learning Modality-Consistent Alignment for PAN-Sharpening

Jeonghyeok Do,Sungpyo Kim,Geunhyuk Youk,Jaehyup Lee,Munchurl Kim

Main category: cs.CV

TL;DR: PAN-Crafter提出了一种解决PAN和MS图像模态不对齐问题的新框架，通过模态自适应重建和跨模态对齐注意力机制，显著提升了图像融合质量。

Details

Motivation: 解决PAN和MS图像因传感器放置、采集时间和分辨率差异导致的模态不对齐问题，避免传统方法因假设完美对齐而导致的频谱失真和模糊。 Method: 提出模态自适应重建（MARs）和跨模态对齐注意力机制（CM3A），联合重建HRMS和PAN图像，并双向对齐纹理与结构。 Result: 在多个基准数据集上表现优于现有方法，推理时间快50.11倍，内存占用减少0.63倍，且在未见过的卫星数据集上表现出强泛化能力。 Conclusion: PAN-Crafter通过显式解决模态不对齐问题，显著提升了图像融合的精度和效率，具有广泛适用性。 Abstract: PAN-sharpening aims to fuse high-resolution panchromatic (PAN) images with low-resolution multi-spectral (MS) images to generate high-resolution multi-spectral (HRMS) outputs. However, cross-modality misalignment -- caused by sensor placement, acquisition timing, and resolution disparity -- induces a fundamental challenge. Conventional deep learning methods assume perfect pixel-wise alignment and rely on per-pixel reconstruction losses, leading to spectral distortion, double edges, and blurring when misalignment is present. To address this, we propose PAN-Crafter, a modality-consistent alignment framework that explicitly mitigates the misalignment gap between PAN and MS modalities. At its core, Modality-Adaptive Reconstruction (MARs) enables a single network to jointly reconstruct HRMS and PAN images, leveraging PAN's high-frequency details as auxiliary self-supervision. Additionally, we introduce Cross-Modality Alignment-Aware Attention (CM3A), a novel mechanism that bidirectionally aligns MS texture to PAN structure and vice versa, enabling adaptive feature refinement across modalities. Extensive experiments on multiple benchmark datasets demonstrate that our PAN-Crafter outperforms the most recent state-of-the-art method in all metrics, even with 50.11$\times$ faster inference time and 0.63$\times$ the memory size. Furthermore, it demonstrates strong generalization performance on unseen satellite datasets, showing its robustness across different conditions.

[84] UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning

Weijia Mao,Zhenheng Yang,Mike Zheng Shou

Main category: cs.CV

TL;DR: UniRL是一种自改进的后训练方法，无需外部图像数据，通过模型自身生成图像作为训练数据，并利用生成和理解任务的相互增强提升性能。

Details

Motivation: 现有的多模态大语言模型依赖大规模数据集和大量计算资源，且后训练方法通常需要外部数据或局限于特定任务。UniRL旨在解决这些问题。 Method: 采用自生成图像作为训练数据，结合监督微调（SFT）和Group Relative Policy Optimization（GRPO）优化模型，实现生成和理解任务的相互增强。 Result: 在Show-o和Janus上评估，UniRL分别获得0.77和0.65的GenEval分数。 Conclusion: UniRL无需外部数据，提升任务性能并减少任务间不平衡，且后训练阶段仅需少量额外步骤。 Abstract: Unified multimodal large language models such as Show-o and Janus have achieved strong performance across both generation and understanding tasks. However, these models typically rely on large-scale datasets and require substantial computation during the pretraining stage. In addition, several post-training methods have been proposed, but they often depend on external data or are limited to task-specific customization. In this work, we introduce UniRL, a self-improving post-training approach. Our approach enables the model to generate images from prompts and use them as training data in each iteration, without relying on any external image data. Moreover, it enables the two tasks to enhance each other: the generated images are used for understanding, and the understanding results are used to supervise generation. We explore supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to optimize the models. UniRL offers three key advantages: (1) it requires no external image data, as all training samples are generated by the model itself during training; (2) it not only improves individual task performance, but also reduces the imbalance between generation and understanding; and (3) it requires only several additional training steps during the post-training stage. We evaluate UniRL on top of Show-o and Janus, achieving a GenEval score of 0.77 for Show-o and 0.65 for Janus. Code and models will be released in https://github.com/showlab/UniRL.

[85] VModA: An Effective Framework for Adaptive NSFW Image Moderation

Han Bao,Qinying Wang,Zhi Chen,Qingming Li,Xuhong Zhang,Changjiang Li,Zonghui Wang,Shouling Ji,Wenzhi Chen

Main category: cs.CV

TL;DR: VModA是一个通用的NSFW内容检测框架，适应多样化规则并处理复杂语义内容，显著提升检测准确性。

Details

Motivation: NSFW内容在社交网络上泛滥，现有检测方法难以应对复杂语义和多样化规则，亟需改进。 Method: 提出VModA框架，适应不同平台和地区的规则，处理复杂语义的NSFW内容。 Result: 实验显示VModA在各类NSFW内容上准确性提升54.3%，并展示出强适应性。 Conclusion: VModA有效解决NSFW检测中的复杂语义和规则多样性问题，具有实际应用价值。 Abstract: Not Safe/Suitable for Work (NSFW) content is rampant on social networks and poses serious harm to citizens, especially minors. Current detection methods mainly rely on deep learning-based image recognition and classification. However, NSFW images are now presented in increasingly sophisticated ways, often using image details and complex semantics to obscure their true nature or attract more views. Although still understandable to humans, these images often evade existing detection methods, posing a significant threat. Further complicating the issue, varying regulations across platforms and regions create additional challenges for effective moderation, leading to detection bias and reduced accuracy. To address this, we propose VModA, a general and effective framework that adapts to diverse moderation rules and handles complex, semantically rich NSFW content across categories. Experimental results show that VModA significantly outperforms existing methods, achieving up to a 54.3% accuracy improvement across NSFW types, including those with complex semantics. Further experiments demonstrate that our method exhibits strong adaptability across categories, scenarios, and base VLMs. We also identified inconsistent and controversial label samples in public NSFW benchmark datasets, re-annotated them, and submitted corrections to the original maintainers. Two datasets have confirmed the updates so far. Additionally, we evaluate VModA in real-world scenarios to demonstrate its practical effectiveness.

[86] Robust and Annotation-Free Wound Segmentation on Noisy Real-World Pressure Ulcer Images: Towards Automated DESIGN-R\textsuperscript{\textregistered} Assessment

Yun-Cheng Tsai

Main category: cs.CV

TL;DR: 提出了一种基于YOLOv11n检测器和FUSegNet分割模型的轻量级管道，仅需500个标注框即可实现跨身体部位伤口分割，无需微调。

Details

Motivation: 现有模型（如FUSegNet）在非足部伤口上泛化能力差，需要高效且通用的解决方案。 Method: 结合YOLOv11n检测器和预训练FUSegNet，仅使用500个标注框，无需像素级标注或微调。 Result: 在三种伤口测试集上，平均IoU提升23个百分点，DESIGN-R尺寸估计准确率从71%提升至94%。 Conclusion: 该方法实现了跨身体部位的高效伤口分割，为临床自动化评分提供了可行方案，并公开模型权重以促进应用。 Abstract: Purpose: Accurate wound segmentation is essential for automated DESIGN-R scoring. However, existing models such as FUSegNet, which are trained primarily on foot ulcer datasets, often fail to generalize to wounds on other body sites. Methods: We propose an annotation-efficient pipeline that combines a lightweight YOLOv11n-based detector with the pre-trained FUSegNet segmentation model. Instead of relying on pixel-level annotations or retraining for new anatomical regions, our method achieves robust performance using only 500 manually labeled bounding boxes. This zero fine-tuning approach effectively bridges the domain gap and enables direct deployment across diverse wound types. This is an advance not previously demonstrated in the wound segmentation literature. Results: Evaluated on three real-world test sets spanning foot, sacral, and trochanter wounds, our YOLO plus FUSegNet pipeline improved mean IoU by 23 percentage points over vanilla FUSegNet and increased end-to-end DESIGN-R size estimation accuracy from 71 percent to 94 percent (see Table 3 for details). Conclusion: Our pipeline generalizes effectively across body sites without task-specific fine-tuning, demonstrating that minimal supervision, with 500 annotated ROIs, is sufficient for scalable, annotation-light wound segmentation. This capability paves the way for real-world DESIGN-R automation, reducing reliance on pixel-wise labeling, streamlining documentation workflows, and supporting objective and consistent wound scoring in clinical practice. We will publicly release the trained detector weights and configuration to promote reproducibility and facilitate downstream deployment.

[87] Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings

Xingguang Wei,Haomin Wang,Shenglong Ye,Ruifeng Luo,Yanting Zhang,Lixin Gu,Jifeng Dai,Yu Qiao,Wenhai Wang,Hongjie Zhang

Main category: cs.CV

TL;DR: VecFormer提出了一种基于线表示的CAD图纸全景符号识别方法，解决了现有方法的高计算成本、泛化性差和几何信息丢失问题。

Details

Motivation: 现有方法在CAD图纸的全景符号识别中存在高计算成本、泛化性差和几何信息丢失的问题。 Method: VecFormer通过线表示原始图元，保留几何连续性，并引入分支融合细化模块整合实例与语义预测。 Result: 实验表明，VecFormer在PQ指标上达到91.1，Stuff-PQ分别提升9.6和21.2分。 Conclusion: 线表示是矢量图形理解的有效基础，VecFormer为CAD图纸的全景符号识别提供了新思路。 Abstract: We study the task of panoptic symbol spotting, which involves identifying both individual instances of countable things and the semantic regions of uncountable stuff in computer-aided design (CAD) drawings composed of vector graphical primitives. Existing methods typically rely on image rasterization, graph construction, or point-based representation, but these approaches often suffer from high computational costs, limited generality, and loss of geometric structural information. In this paper, we propose VecFormer, a novel method that addresses these challenges through line-based representation of primitives. This design preserves the geometric continuity of the original primitive, enabling more accurate shape representation while maintaining a computation-friendly structure, making it well-suited for vector graphic understanding tasks. To further enhance prediction reliability, we introduce a Branch Fusion Refinement module that effectively integrates instance and semantic predictions, resolving their inconsistencies for more coherent panoptic outputs. Extensive experiments demonstrate that our method establishes a new state-of-the-art, achieving 91.1 PQ, with Stuff-PQ improved by 9.6 and 21.2 points over the second-best results under settings with and without prior information, respectively, highlighting the strong potential of line-based representation as a foundation for vector graphic understanding.

[88] Bridging Geometric and Semantic Foundation Models for Generalized Monocular Depth Estimation

Sanggyun Ma,Wonjoon Choi,Jihun Park,Jaeyeul Kim,Seunghun Lee,Jiwan Seo,Sunghoon Im

Main category: cs.CV

TL;DR: BriGeS是一种融合几何与语义信息的深度估计方法，通过Bridging Gate和Attention Temperature Scaling技术提升性能，减少资源需求。

Details

Motivation: 提升单目深度估计（MDE）在复杂场景中的表现，结合几何与语义信息的互补优势。 Method: 利用预训练基础模型，仅训练Bridging Gate，结合Attention Temperature Scaling平衡注意力机制。 Result: 在多个数据集上优于现有方法，尤其在复杂结构和重叠物体场景中表现突出。 Conclusion: BriGeS通过高效融合几何与语义信息，显著提升了MDE的泛化能力和性能。 Abstract: We present Bridging Geometric and Semantic (BriGeS), an effective method that fuses geometric and semantic information within foundation models to enhance Monocular Depth Estimation (MDE). Central to BriGeS is the Bridging Gate, which integrates the complementary strengths of depth and segmentation foundation models. This integration is further refined by our Attention Temperature Scaling technique. It finely adjusts the focus of the attention mechanisms to prevent over-concentration on specific features, thus ensuring balanced performance across diverse inputs. BriGeS capitalizes on pre-trained foundation models and adopts a strategy that focuses on training only the Bridging Gate. This method significantly reduces resource demands and training time while maintaining the model's ability to generalize effectively. Extensive experiments across multiple challenging datasets demonstrate that BriGeS outperforms state-of-the-art methods in MDE for complex scenes, effectively handling intricate structures and overlapping objects.

[89] Video Editing for Audio-Visual Dubbing

Binyamin Manela,Sharon Gannot,Ethan Fetyaya

Main category: cs.CV

TL;DR: EdiDub是一种新颖的视觉配音框架，通过内容感知编辑任务改进现有方法，显著提升了身份保留和唇同步效果。

Details

Motivation: 当前视觉配音方法在无缝集成到原始场景或保留复杂视觉元素（如部分遮挡和光照变化）方面存在局限性。 Method: EdiDub将视觉配音重新定义为内容感知编辑任务，采用专门的条件方案以保留原始视频上下文。 Result: 在多个基准测试中，EdiDub在身份保留和同步方面表现优异，人类评估也证实其同步性和视觉自然性优于现有方法。 Conclusion: 内容感知编辑方法在保留复杂视觉元素的同时确保唇同步准确性方面优于传统生成或修复方法。 Abstract: Visual dubbing, the synchronization of facial movements with new speech, is crucial for making content accessible across different languages, enabling broader global reach. However, current methods face significant limitations. Existing approaches often generate talking faces, hindering seamless integration into original scenes, or employ inpainting techniques that discard vital visual information like partial occlusions and lighting variations. This work introduces EdiDub, a novel framework that reformulates visual dubbing as a content-aware editing task. EdiDub preserves the original video context by utilizing a specialized conditioning scheme to ensure faithful and accurate modifications rather than mere copying. On multiple benchmarks, including a challenging occluded-lip dataset, EdiDub significantly improves identity preservation and synchronization. Human evaluations further confirm its superiority, achieving higher synchronization and visual naturalness scores compared to the leading methods. These results demonstrate that our content-aware editing approach outperforms traditional generation or inpainting, particularly in maintaining complex visual elements while ensuring accurate lip synchronization.

[90] UrbanCraft: Urban View Extrapolation via Hierarchical Sem-Geometric Priors

Tianhang Wang,Fan Lu,Sanqing Qu,Guo Yu,Shihang Du,Ya Wu,Yuan Huang,Guang Chen

Main category: cs.CV

TL;DR: UrbanCraft提出了一种解决外推视图合成（EVS）问题的方法，通过分层语义几何表示作为先验，结合HSG-VSD技术提升新视图生成的性能。

Details

Motivation: 现有神经渲染方法在训练相机分布外的视图合成（如左右或向下视角）表现不佳，限制了城市重建的泛化能力。 Method: 利用部分可观测场景重建粗粒度语义和几何基元，并通过占用网格和3D边界框增强细节。提出HSG-VSD技术，整合语义几何约束到分数蒸馏采样中。 Result: 定性和定量实验验证了UrbanCraft在EVS问题上的有效性。 Conclusion: UrbanCraft通过分层先验和HSG-VSD技术显著提升了外推视图合成的性能。 Abstract: Existing neural rendering-based urban scene reconstruction methods mainly focus on the Interpolated View Synthesis (IVS) setting that synthesizes from views close to training camera trajectory. However, IVS can not guarantee the on-par performance of the novel view outside the training camera distribution (\textit{e.g.}, looking left, right, or downwards), which limits the generalizability of the urban reconstruction application. Previous methods have optimized it via image diffusion, but they fail to handle text-ambiguous or large unseen view angles due to coarse-grained control of text-only diffusion. In this paper, we design UrbanCraft, which surmounts the Extrapolated View Synthesis (EVS) problem using hierarchical sem-geometric representations serving as additional priors. Specifically, we leverage the partially observable scene to reconstruct coarse semantic and geometric primitives, establishing a coarse scene-level prior through an occupancy grid as the base representation. Additionally, we incorporate fine instance-level priors from 3D bounding boxes to enhance object-level details and spatial relationships. Building on this, we propose the \textbf{H}ierarchical \textbf{S}emantic-Geometric-\textbf{G}uided Variational Score Distillation (HSG-VSD), which integrates semantic and geometric constraints from pretrained UrbanCraft2D into the score distillation sampling process, forcing the distribution to be consistent with the observable scene. Qualitative and quantitative comparisons demonstrate the effectiveness of our methods on EVS problem.

[91] Adaptive Spatial Augmentation for Semi-supervised Semantic Segmentation

Lingyan Ran,Yali Li,Tao Zhuo,Shizhou Zhang,Yanning Zhang

Main category: cs.CV

TL;DR: 论文提出了一种自适应空间增强方法（ASAug），用于半监督语义分割（SSSS），通过动态调整增强策略提升模型性能。

Details

Motivation: 现有强增强方法主要关注基于强度的扰动，对语义掩码影响较小，而空间增强在SSSS中被忽视。研究旨在验证空间增强的有效性并提出自适应策略。 Method: 提出自适应空间增强（ASAug），基于熵动态调整每张图像的增强方式，解决弱增强与强增强间掩码不一致问题。 Result: 实验表明ASAug可提升现有方法性能，在PASCAL VOC 2012、Cityscapes和COCO等数据集上达到SOTA。 Conclusion: 空间增强在SSSS中有效，自适应策略进一步提升了模型泛化能力。 Abstract: In semi-supervised semantic segmentation (SSSS), data augmentation plays a crucial role in the weak-to-strong consistency regularization framework, as it enhances diversity and improves model generalization. Recent strong augmentation methods have primarily focused on intensity-based perturbations, which have minimal impact on the semantic masks. In contrast, spatial augmentations like translation and rotation have long been acknowledged for their effectiveness in supervised semantic segmentation tasks, but they are often ignored in SSSS. In this work, we demonstrate that spatial augmentation can also contribute to model training in SSSS, despite generating inconsistent masks between the weak and strong augmentations. Furthermore, recognizing the variability among images, we propose an adaptive augmentation strategy that dynamically adjusts the augmentation for each instance based on entropy. Extensive experiments show that our proposed Adaptive Spatial Augmentation (\textbf{ASAug}) can be integrated as a pluggable module, consistently improving the performance of existing methods and achieving state-of-the-art results on benchmark datasets such as PASCAL VOC 2012, Cityscapes, and COCO.

[92] VITON-DRR: Details Retention Virtual Try-on via Non-rigid Registration

Ben Li,Minqi Li,Jie Ren,Kaibing Zhang

Main category: cs.CV

TL;DR: 提出了一种基于非刚性配准的虚拟试穿方法（VITON-DRR），通过双金字塔结构特征提取器和变形模块，实现了更准确的服装变形和细节保留。

Details

Motivation: 现有方法在服装变形时难以保留细节，导致试穿效果不真实，影响了虚拟试穿在电商和时尚行业的应用潜力。 Method: 使用双金字塔结构特征提取器重建人体语义分割，设计变形模块提取服装关键点并通过非刚性配准算法进行变形，最后通过图像合成模块生成试穿图像。 Result: 实验表明，VITON-DRR在服装变形和细节保留方面优于现有方法。 Conclusion: 该方法显著提升了虚拟试穿的准确性和细节保留能力，具有实际应用价值。 Abstract: Image-based virtual try-on aims to fit a target garment to a specific person image and has attracted extensive research attention because of its huge application potential in the e-commerce and fashion industries. To generate high-quality try-on results, accurately warping the clothing item to fit the human body plays a significant role, as slight misalignment may lead to unrealistic artifacts in the fitting image. Most existing methods warp the clothing by feature matching and thin-plate spline (TPS). However, it often fails to preserve clothing details due to self-occlusion, severe misalignment between poses, etc. To address these challenges, this paper proposes a detail retention virtual try-on method via accurate non-rigid registration (VITON-DRR) for diverse human poses. Specifically, we reconstruct a human semantic segmentation using a dual-pyramid-structured feature extractor. Then, a novel Deformation Module is designed for extracting the cloth key points and warping them through an accurate non-rigid registration algorithm. Finally, the Image Synthesis Module is designed to synthesize the deformed garment image and generate the human pose information adaptively. {Compared with} traditional methods, the proposed VITON-DRR can make the deformation of fitting images more accurate and retain more garment details. The experimental results demonstrate that the proposed method performs better than state-of-the-art methods.

[93] CryoCCD: Conditional Cycle-consistent Diffusion with Biophysical Modeling for Cryo-EM Synthesis

Runmin Jiang,Genpei Zhang,Yuntian Yang,Siqi Wu,Yuheng Zhang,Wanyue Feng,Yizhou Zhao,Xi Xiao,Xiao Wang,Tianyang Wang,Xingjian Li,Min Xu

Main category: cs.CV

TL;DR: CryoCCD是一种结合生物物理建模与生成技术的合成框架，用于生成多尺度冷冻电镜显微图像，解决了现有方法在结构多样性和噪声复杂性上的不足。

Details

Motivation: 冷冻电镜（cryo-EM）在近原子分辨率成像方面表现优异，但高质量标注数据的稀缺阻碍了稳健模型的开发。现有合成数据生成方法难以同时捕捉生物样本的结构多样性和复杂的空间变化噪声。 Method: CryoCCD通过整合生物物理建模与生成技术，生成反映真实生物物理变异的多尺度显微图像。采用条件扩散模型生成噪声，并通过循环一致性和掩码感知对比学习提升噪声的真实性。 Result: 实验表明，CryoCCD生成的显微图像结构准确，并在下游任务（如粒子挑选和重建）中优于现有基线方法。 Conclusion: CryoCCD为解决冷冻电镜数据稀缺和噪声复杂性提供了一种有效方案，提升了图像生成和下游任务的性能。 Abstract: Cryo-electron microscopy (cryo-EM) offers near-atomic resolution imaging of macromolecules, but developing robust models for downstream analysis is hindered by the scarcity of high-quality annotated data. While synthetic data generation has emerged as a potential solution, existing methods often fail to capture both the structural diversity of biological specimens and the complex, spatially varying noise inherent in cryo-EM imaging. To overcome these limitations, we propose CryoCCD, a synthesis framework that integrates biophysical modeling with generative techniques. Specifically, CryoCCD produces multi-scale cryo-EM micrographs that reflect realistic biophysical variability through compositional heterogeneity, cellular context, and physics-informed imaging. To generate realistic noise, we employ a conditional diffusion model, enhanced by cycle consistency to preserve structural fidelity and mask-aware contrastive learning to capture spatially adaptive noise patterns. Extensive experiments show that CryoCCD generates structurally accurate micrographs and enhances performance in downstream tasks, outperforming state-of-the-art baselines in both particle picking and reconstruction.

[94] A Reverse Causal Framework to Mitigate Spurious Correlations for Debiasing Scene Graph Generation

Shuzhou Sun,Li Liu,Tianpeng Liu,Shuaifeng Zhi,Ming-Ming Cheng,Janne Heikkilä,Yongxiang Liu

Main category: cs.CV

TL;DR: 论文提出了一种逆向因果框架（RcSGG），通过重构因果链结构，解决了现有两阶段场景图生成（SGG）框架中因虚假相关性导致的偏差问题。

Details

Motivation: 现有两阶段SGG框架的因果链结构可能导致虚假相关性，例如尾部关系被预测为头部关系或前景关系被预测为背景关系。论文旨在消除这些偏差。 Method: 提出RcSGG框架，通过Active Reverse Estimation（ARE）干预混杂变量估计逆向因果关系，并使用Maximum Information Sampling（MIS）增强估计效果。 Result: 在多个基准测试和不同SGG框架中，RcSGG取得了最先进的平均召回率。 Conclusion: RcSGG通过重构因果结构，有效减少了虚假相关性，显著提升了SGG的性能。 Abstract: Existing two-stage Scene Graph Generation (SGG) frameworks typically incorporate a detector to extract relationship features and a classifier to categorize these relationships; therefore, the training paradigm follows a causal chain structure, where the detector's inputs determine the classifier's inputs, which in turn influence the final predictions. However, such a causal chain structure can yield spurious correlations between the detector's inputs and the final predictions, i.e., the prediction of a certain relationship may be influenced by other relationships. This influence can induce at least two observable biases: tail relationships are predicted as head ones, and foreground relationships are predicted as background ones; notably, the latter bias is seldom discussed in the literature. To address this issue, we propose reconstructing the causal chain structure into a reverse causal structure, wherein the classifier's inputs are treated as the confounder, and both the detector's inputs and the final predictions are viewed as causal variables. Specifically, we term the reconstructed causal paradigm as the Reverse causal Framework for SGG (RcSGG). RcSGG initially employs the proposed Active Reverse Estimation (ARE) to intervene on the confounder to estimate the reverse causality, \ie the causality from final predictions to the classifier's inputs. Then, the Maximum Information Sampling (MIS) is suggested to enhance the reverse causality estimation further by considering the relationship information. Theoretically, RcSGG can mitigate the spurious correlations inherent in the SGG framework, subsequently eliminating the induced biases. Comprehensive experiments on popular benchmarks and diverse SGG frameworks show the state-of-the-art mean recall rate.

Runyi Li,Bin Chen,Jian Zhang,Radu Timofte

Main category: cs.CV

TL;DR: LAFR提出了一种基于代码本的潜在空间适配器，用于对齐低质量图像的潜在分布，实现高质量人脸恢复，同时减少计算成本。

Details

Motivation: 解决现有扩散模型在低质量图像编码时的语义不匹配问题，避免重新训练VAE的高计算成本。 Method: 使用代码本对齐潜在分布，结合多级恢复损失和轻量级扩散先验微调。 Result: 在合成和真实数据集上实现高质量、身份保持的人脸恢复，训练时间减少70%。 Conclusion: LAFR高效且有效，适用于严重退化输入的人脸恢复。 Abstract: Blind face restoration from low-quality (LQ) images is a challenging task that requires not only high-fidelity image reconstruction but also the preservation of facial identity. While diffusion models like Stable Diffusion have shown promise in generating high-quality (HQ) images, their VAE modules are typically trained only on HQ data, resulting in semantic misalignment when encoding LQ inputs. This mismatch significantly weakens the effectiveness of LQ conditions during the denoising process. Existing approaches often tackle this issue by retraining the VAE encoder, which is computationally expensive and memory-intensive. To address this limitation efficiently, we propose LAFR (Latent Alignment for Face Restoration), a novel codebook-based latent space adapter that aligns the latent distribution of LQ images with that of HQ counterparts, enabling semantically consistent diffusion sampling without altering the original VAE. To further enhance identity preservation, we introduce a multi-level restoration loss that combines constraints from identity embeddings and facial structural priors. Additionally, by leveraging the inherent structural regularity of facial images, we show that lightweight finetuning of diffusion prior on just 0.9% of FFHQ dataset is sufficient to achieve results comparable to state-of-the-art methods, reduce training time by 70%. Extensive experiments on both synthetic and real-world face restoration benchmarks demonstrate the effectiveness and efficiency of LAFR, achieving high-quality, identity-preserving face reconstruction from severely degraded inputs.

[96] Revisiting Reweighted Risk for Calibration: AURC, Focal Loss, and Inverse Focal Loss

Han Zhou,Sebastian G. Gruber,Teodora Popordanoska,Matthew B. Blaschko

Main category: cs.CV

TL;DR: 论文重新审视了深度学习中常用的加权风险函数，建立了重加权方案与校准误差之间的联系，提出了一种基于AURC的损失函数，通过SoftRank技术实现可微分优化，实验表明其在校准性能上表现优异。

Details

Motivation: 研究不同重加权风险函数（如焦点损失、逆焦点损失和AURC）的校准特性，探索其与校准误差的关系，以改进模型校准性能。 Method: 提出一种基于AURC的正则化损失函数，使用SoftRank技术实现可微分优化，并通过选择不同的置信度评分函数（CSFs）增强灵活性。 Result: 实验证明，基于AURC的损失函数在多种数据集和模型架构上实现了竞争性的类校准性能。 Conclusion: 优化正则化AURC可有效提升模型校准性能，且逆焦点损失的重加权策略更符合校准目标，而焦点损失在此方面表现较差。 Abstract: Several variants of reweighted risk functionals, such as focal losss, inverse focal loss, and the Area Under the Risk-Coverage Curve (AURC), have been proposed in the literature and claims have been made in relation to their calibration properties. However, focal loss and inverse focal loss propose vastly different weighting schemes. In this paper, we revisit a broad class of weighted risk functions commonly used in deep learning and establish a principled connection between these reweighting schemes and calibration errors. We show that minimizing calibration error is closely linked to the selective classification paradigm and demonstrate that optimizing a regularized variant of the AURC naturally leads to improved calibration. This regularized AURC shares a similar reweighting strategy with inverse focal loss, lending support to the idea that focal loss is less principled when calibration is a desired outcome. Direct AURC optimization offers greater flexibility through the choice of confidence score functions (CSFs). To enable gradient-based optimization, we introduce a differentiable formulation of the regularized AURC using the SoftRank technique. Empirical evaluations demonstrate that our AURC-based loss achieves competitive class-wise calibration performance across a range of datasets and model architectures.

[97] A Divide-and-Conquer Approach for Global Orientation of Non-Watertight Scene-Level Point Clouds Using 0-1 Integer Optimization

Zhuodong Li,Fei Hou,Wencheng Wang,Xuequan Lu,Ying He

Main category: cs.CV

TL;DR: DACPO提出了一种分而治之的方法，用于大规模非封闭点云的定向问题，通过分块处理与全局优化实现高效和鲁棒的定向。

Details

Motivation: 现有方法主要针对封闭的物体级3D模型，而大规模非封闭3D场景的定向问题尚未充分探索，DACPO旨在填补这一空白。 Method: DACPO将点云分割为小块，通过随机贪婪方法估计初始法向，并利用改进的泊松表面重建进行细化，再通过图模型和全局优化整合结果。 Result: 实验表明DACPO在大规模非封闭场景中表现优异，优于现有方法。 Conclusion: DACPO为大规模非封闭点云的定向问题提供了一种高效且鲁棒的解决方案。 Abstract: Orienting point clouds is a fundamental problem in computer graphics and 3D vision, with applications in reconstruction, segmentation, and analysis. While significant progress has been made, existing approaches mainly focus on watertight, object-level 3D models. The orientation of large-scale, non-watertight 3D scenes remains an underexplored challenge. To address this gap, we propose DACPO (Divide-And-Conquer Point Orientation), a novel framework that leverages a divide-and-conquer strategy for scalable and robust point cloud orientation. Rather than attempting to orient an unbounded scene at once, DACPO segments the input point cloud into smaller, manageable blocks, processes each block independently, and integrates the results through a global optimization stage. For each block, we introduce a two-step process: estimating initial normal orientations by a randomized greedy method and refining them by an adapted iterative Poisson surface reconstruction. To achieve consistency across blocks, we model inter-block relationships using an an undirected graph, where nodes represent blocks and edges connect spatially adjacent blocks. To reliably evaluate orientation consistency between adjacent blocks, we introduce the concept of the visible connected region, which defines the region over which visibility-based assessments are performed. The global integration is then formulated as a 0-1 integer-constrained optimization problem, with block flip states as binary variables. Despite the combinatorial nature of the problem, DACPO remains scalable by limiting the number of blocks (typically a few hundred for 3D scenes) involved in the optimization. Experiments on benchmark datasets demonstrate DACPO's strong performance, particularly in challenging large-scale, non-watertight scenarios where existing methods often fail. The source code is available at https://github.com/zd-lee/DACPO.

[98] TimePoint: Accelerated Time Series Alignment via Self-Supervised Keypoint and Descriptor Learning

Ron Shapira Weber,Shahar Ben Ishay,Andrey Lavrinenko,Shahaf E. Finder,Oren Freifeld

Main category: cs.CV

TL;DR: TimePoint是一种自监督方法，通过从合成数据中学习关键点和描述符，显著加速DTW对齐并提高准确性。

Details

Motivation: 动态时间规整（DTW）在时间序列对齐中存在可扩展性差和对噪声敏感的问题，需要一种更高效且准确的方法。 Method: TimePoint利用1D微分同胚生成合成数据，结合全卷积和小波卷积架构提取关键点和描述符，稀疏表示后应用DTW。 Result: TimePoint在速度和准确性上均优于标准DTW，且仅需合成数据即可泛化到真实时间序列。 Conclusion: TimePoint为时间序列分析提供了一种可扩展的高效解决方案，代码已开源。 Abstract: Fast and scalable alignment of time series is a fundamental challenge in many domains. The standard solution, Dynamic Time Warping (DTW), struggles with poor scalability and sensitivity to noise. We introduce TimePoint, a self-supervised method that dramatically accelerates DTW-based alignment while typically improving alignment accuracy by learning keypoints and descriptors from synthetic data. Inspired by 2D keypoint detection but carefully adapted to the unique challenges of 1D signals, TimePoint leverages efficient 1D diffeomorphisms, which effectively model nonlinear time warping, to generate realistic training data. This approach, along with fully convolutional and wavelet convolutional architectures, enables the extraction of informative keypoints and descriptors. Applying DTW to these sparse representations yield major speedups and typically higher alignment accuracy than standard DTW applied to the full signals. TimePoint demonstrates strong generalization to real-world time series when trained solely on synthetic data, and further improves with fine-tuning on real data. Extensive experiments demonstrate that TimePoint consistently achieves faster and more accurate alignments than standard DTW, making it a scalable solution for time-series analysis. Our code is available at https://github.com/BGU-CS-VIL/TimePoint

[99] PhysicsNeRF: Physics-Guided 3D Reconstruction from Sparse Views

Mohamed Rayan Barhdadi,Hasan Kurban,Hussein Alnuweiri

Main category: cs.CV

TL;DR: PhysicsNeRF通过引入四种物理约束改进NeRF，在稀疏视图下实现更优的3D重建效果。

Details

Motivation: 解决标准NeRF在稀疏视图下表现不佳的问题，提升3D重建的物理一致性和泛化能力。 Method: 结合深度排序、RegNeRF一致性、稀疏先验和跨视图对齐四种约束，采用0.67M参数的小型架构。 Result: 仅用8视图即达到21.4 dB平均PSNR，优于现有方法，并揭示了稀疏重建的5.7-6.2 dB泛化差距。 Conclusion: PhysicsNeRF为物理一致的3D表示提供了新思路，并阐明了约束NeRF模型的表达力-泛化权衡。 Abstract: PhysicsNeRF is a physically grounded framework for 3D reconstruction from sparse views, extending Neural Radiance Fields with four complementary constraints: depth ranking, RegNeRF-style consistency, sparsity priors, and cross-view alignment. While standard NeRFs fail under sparse supervision, PhysicsNeRF employs a compact 0.67M-parameter architecture and achieves 21.4 dB average PSNR using only 8 views, outperforming prior methods. A generalization gap of 5.7-6.2 dB is consistently observed and analyzed, revealing fundamental limitations of sparse-view reconstruction. PhysicsNeRF enables physically consistent, generalizable 3D representations for agent interaction and simulation, and clarifies the expressiveness-generalization trade-off in constrained NeRF models.

[100] VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation

Shi-Xue Zhang,Hongfa Wang,Duojun Huang,Xin Li,Xiaobin Zhu,Xu-Cheng Yin

Main category: cs.CV

TL;DR: 论文提出了VCapsBench，首个大规模细粒度视频字幕评估基准，包含5K+视频和100K+QA对，旨在提升文本到视频生成的质量。

Details

Motivation: 现有基准在细粒度评估（尤其是空间-时间细节）上不足，影响视频生成的语义连贯性和视觉保真度。 Method: 引入VCapsBench基准，包含5,677视频和109,796 QA对，标注21个细粒度维度，并提出AR、IR、CR三个指标及基于LLM的自动化评估流程。 Result: 通过对比QA对分析验证字幕质量，为字幕优化提供可操作建议。 Conclusion: VCapsBench有助于推动鲁棒文本到视频模型的发展，数据集和代码已开源。 Abstract: Video captions play a crucial role in text-to-video generation tasks, as their quality directly influences the semantic coherence and visual fidelity of the generated videos. Although large vision-language models (VLMs) have demonstrated significant potential in caption generation, existing benchmarks inadequately address fine-grained evaluation, particularly in capturing spatial-temporal details critical for video generation. To address this gap, we introduce the Fine-grained Video Caption Evaluation Benchmark (VCapsBench), the first large-scale fine-grained benchmark comprising 5,677 (5K+) videos and 109,796 (100K+) question-answer pairs. These QA-pairs are systematically annotated across 21 fine-grained dimensions (e.g., camera movement, and shot type) that are empirically proven critical for text-to-video generation. We further introduce three metrics (Accuracy (AR), Inconsistency Rate (IR), Coverage Rate (CR)), and an automated evaluation pipeline leveraging large language model (LLM) to verify caption quality via contrastive QA-pairs analysis. By providing actionable insights for caption optimization, our benchmark can advance the development of robust text-to-video models. The dataset and codes are available at website: https://github.com/GXYM/VCapsBench.

[101] R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation

Kaijie Chen,Zihao Lin,Zhiyang Xu,Ying Shen,Yuguang Yao,Joy Rimchala,Jiaxin Zhang,Lifu Huang

Main category: cs.CV

TL;DR: R2I-Bench是一个专门评估文本到图像生成中推理能力的基准测试，包含多类推理任务，并设计了细粒度评估指标R2IScore。实验表明当前模型的推理能力仍有不足。

Details

Motivation: 现有文本到图像生成模型在推理能力上表现不足，缺乏系统性评估，因此需要开发一个专门评估推理能力的基准测试。 Method: 设计了R2I-Bench基准测试，包含多种推理类别（如常识、数学、逻辑等），并开发了R2IScore评估指标，通过问答形式评估文本-图像对齐、推理准确性和图像质量。 Result: 实验评估了16种代表性模型，结果显示其推理能力普遍有限，表明需要更强大的推理感知架构。 Conclusion: R2I-Bench为评估和改进文本到图像生成模型的推理能力提供了重要工具，未来需开发更鲁棒的推理感知模型。 Abstract: Reasoning is a fundamental capability often required in real-world text-to-image (T2I) generation, e.g., generating ``a bitten apple that has been left in the air for more than a week`` necessitates understanding temporal decay and commonsense concepts. While recent T2I models have made impressive progress in producing photorealistic images, their reasoning capability remains underdeveloped and insufficiently evaluated. To bridge this gap, we introduce R2I-Bench, a comprehensive benchmark specifically designed to rigorously assess reasoning-driven T2I generation. R2I-Bench comprises meticulously curated data instances, spanning core reasoning categories, including commonsense, mathematical, logical, compositional, numerical, causal, and concept mixing. To facilitate fine-grained evaluation, we design R2IScore, a QA-style metric based on instance-specific, reasoning-oriented evaluation questions that assess three critical dimensions: text-image alignment, reasoning accuracy, and image quality. Extensive experiments with 16 representative T2I models, including a strong pipeline-based framework that decouples reasoning and generation using the state-of-the-art language and image generation models, demonstrate consistently limited reasoning performance, highlighting the need for more robust, reasoning-aware architectures in the next generation of T2I systems. Project Page: https://r2i-bench.github.io

[102] VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning

Liyun Zhu,Qixiang Chen,Xi Shen,Xiaodong Cun

Main category: cs.CV

TL;DR: VAU-R1是一个基于多模态大语言模型的数据高效框架，通过强化微调提升视频异常推理能力，并提出了首个链式思维基准VAU-Bench。

Details

Motivation: 视频异常理解（VAU）在智能城市、安防监控等应用中至关重要，但现有方法缺乏可解释性且难以捕捉异常事件的因果和上下文关系。 Method: 提出VAU-R1框架，利用多模态大语言模型和强化微调（RFT）增强异常推理；同时构建VAU-Bench基准，包含多选QA、详细解释、时间标注和描述性字幕。 Result: 实验表明，VAU-R1显著提高了问答准确性、时间定位和推理连贯性。 Conclusion: VAU-R1和VAU-Bench为可解释和推理感知的视频异常理解奠定了基础。 Abstract: Video Anomaly Understanding (VAU) is essential for applications such as smart cities, security surveillance, and disaster alert systems, yet remains challenging due to its demand for fine-grained spatio-temporal perception and robust reasoning under ambiguity. Despite advances in anomaly detection, existing methods often lack interpretability and struggle to capture the causal and contextual aspects of abnormal events. This limitation is further compounded by the absence of comprehensive benchmarks for evaluating reasoning ability in anomaly scenarios. To address both challenges, we introduce VAU-R1, a data-efficient framework built upon Multimodal Large Language Models (MLLMs), which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT). Besides, we propose VAU-Bench, the first Chain-of-Thought benchmark tailored for video anomaly reasoning, featuring multiple-choice QA, detailed rationales, temporal annotations, and descriptive captions. Empirical results show that VAU-R1 significantly improves question answering accuracy, temporal grounding, and reasoning coherence across diverse contexts. Together, our method and benchmark establish a strong foundation for interpretable and reasoning-aware video anomaly understanding. Our code is available at https://github.com/GVCLab/VAU-R1.

[103] OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data

Fengxiang Wang,Mingshuo Chen,Xuming He,YiFan Zhang,Feng Liu,Zijie Guo,Zhenghao Hu,Jiong Wang,Jingyi Xu,Zhangrui Li,Fenghua Ling,Ben Fei,Weijia Li,Long Lan,Wenjing Yang,Wenlong Zhang,Lei Bai

Main category: cs.CV

TL;DR: OmniEarth-Bench是一个全面的地球科学多模态基准测试，覆盖六个地球科学领域及其交叉领域，包含100个专家策划的评估维度，旨在解决现有基准测试的局限性。

Details

Motivation: 现有地球科学多模态学习基准测试在覆盖范围和评估维度上存在局限性，无法全面反映地球系统的复杂性。 Method: 利用卫星传感器和实地观测数据，整合29,779个标注，涵盖感知、一般推理、科学知识推理和链式推理四个层次，并通过专家和众包协作减少标签歧义。 Result: 实验显示，即使是先进的MLLM模型在基准测试中表现不佳，最高准确率不足35%，某些交叉领域任务中GPT-4o的准确率降至0%。 Conclusion: OmniEarth-Bench为地球科学AI设立了新标准，推动了科学发现和环境监测的实际应用，相关数据和模型已公开。 Abstract: Existing benchmarks for Earth science multimodal learning exhibit critical limitations in systematic coverage of geosystem components and cross-sphere interactions, often constrained to isolated subsystems (only in Human-activities sphere or atmosphere) with limited evaluation dimensions (less than 16 tasks). To address these gaps, we introduce OmniEarth-Bench, the first comprehensive multimodal benchmark spanning all six Earth science spheres (atmosphere, lithosphere, Oceansphere, cryosphere, biosphere and Human-activities sphere) and cross-spheres with one hundred expert-curated evaluation dimensions. Leveraging observational data from satellite sensors and in-situ measurements, OmniEarth-Bench integrates 29,779 annotations across four tiers: perception, general reasoning, scientific knowledge reasoning and chain-of-thought (CoT) reasoning. This involves the efforts of 2-5 experts per sphere to establish authoritative evaluation dimensions and curate relevant observational datasets, 40 crowd-sourcing annotators to assist experts for annotations, and finally, OmniEarth-Bench is validated via hybrid expert-crowd workflows to reduce label ambiguity. Experiments on 9 state-of-the-art MLLMs reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35\% accuracy. Especially, in some cross-spheres tasks, the performance of leading models like GPT-4o drops to 0.0\%. OmniEarth-Bench sets a new standard for geosystem-aware AI, advancing both scientific discovery and practical applications in environmental monitoring and disaster prediction. The dataset, source code, and trained models were released.

[104] CLIP-AE: CLIP-assisted Cross-view Audio-Visual Enhancement for Unsupervised Temporal Action Localization

Rui Xia,Dan Jiang,Quan Zhang,Ke Zhang,Chun Yuan

Main category: cs.CV

TL;DR: 提出了一种基于CLIP的跨模态无监督时序动作定位方法，通过视觉语言预训练和音频感知增强上下文边界信息，无需额外标注。

Details

Motivation: 现有方法依赖标注数据且视觉特征过于关注高区分性区域，缺乏多模态信息支持。 Method: 结合视觉语言预训练和分类预训练协作增强，引入音频感知，采用自监督跨视角学习范式。 Result: 在两个公开数据集上表现优于现有方法。 Conclusion: 该方法有效解决了无监督时序动作定位中的特征过聚焦和上下文边界问题。 Abstract: Temporal Action Localization (TAL) has garnered significant attention in information retrieval. Existing supervised or weakly supervised methods heavily rely on labeled temporal boundaries and action categories, which are labor-intensive and time-consuming. Consequently, unsupervised temporal action localization (UTAL) has gained popularity. However, current methods face two main challenges: 1) Classification pre-trained features overly focus on highly discriminative regions; 2) Solely relying on visual modality information makes it difficult to determine contextual boundaries. To address these issues, we propose a CLIP-assisted cross-view audiovisual enhanced UTAL method. Specifically, we introduce visual language pre-training (VLP) and classification pre-training-based collaborative enhancement to avoid excessive focus on highly discriminative regions; we also incorporate audio perception to provide richer contextual boundary information. Finally, we introduce a self-supervised cross-view learning paradigm to achieve multi-view perceptual enhancement without additional annotations. Extensive experiments on two public datasets demonstrate our model's superiority over several state-of-the-art competitors.

[105] Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization and Temporal Motion Modulation

Jiahao Cui,Yan Chen,Mingwang Xu,Hanlin Shang,Yuxuan Chen,Yun Zhan,Zilong Dong,Yao Yao,Jingdong Wang,Siyu Zhu

Main category: cs.CV

TL;DR: 提出了一种基于人类偏好对齐的扩散框架，用于生成高度动态和逼真的肖像动画，解决了唇同步、自然表情和高保真身体运动动力学的挑战。

Details

Motivation: 生成高度动态和逼真的肖像动画在唇同步、自然表情和身体运动动力学方面仍具挑战性。 Method: 通过人类偏好优化和时空运动调制，将运动条件转化为维度对齐的潜在特征，保留高频运动细节。 Result: 实验表明，在唇音频同步、表情生动性和身体运动连贯性方面优于基线方法，并显著提升了人类偏好指标。 Conclusion: 提出的框架在肖像动画生成中表现出色，代码和模型已开源。 Abstract: Generating highly dynamic and photorealistic portrait animations driven by audio and skeletal motion remains challenging due to the need for precise lip synchronization, natural facial expressions, and high-fidelity body motion dynamics. We propose a human-preference-aligned diffusion framework that addresses these challenges through two key innovations. First, we introduce direct preference optimization tailored for human-centric animation, leveraging a curated dataset of human preferences to align generated outputs with perceptual metrics for portrait motion-video alignment and naturalness of expression. Second, the proposed temporal motion modulation resolves spatiotemporal resolution mismatches by reshaping motion conditions into dimensionally aligned latent features through temporal channel redistribution and proportional feature expansion, preserving the fidelity of high-frequency motion details in diffusion-based synthesis. The proposed mechanism is complementary to existing UNet and DiT-based portrait diffusion approaches, and experiments demonstrate obvious improvements in lip-audio synchronization, expression vividness, body motion coherence over baseline methods, alongside notable gains in human preference metrics. Our model and source code can be found at: https://github.com/xyz123xyz456/hallo4.

[106] Position Paper: Metadata Enrichment Model: Integrating Neural Networks and Semantic Knowledge Graphs for Cultural Heritage Applications

Jan Ignatowicz,Krzysztof Kutt,Grzegorz J. Nalepa

Main category: cs.CV

TL;DR: 本文提出了一种结合神经网络与语义技术的元数据丰富模型（MEM），用于提升文化遗产数字化收藏的可访问性和互操作性。

Details

Motivation: 文化遗产数字化的元数据不足限制了其可访问性和跨机构协作，现有视觉分析模型对特定领域（如手稿）的适应性有限。 Method: 提出MEM框架，结合计算机视觉模型、大语言模型（LLMs）和知识图谱，通过多层视觉机制（MVM）动态检测嵌套特征。 Result: 在Jagiellonian数字图书馆的数字化手稿数据集上验证了MEM的潜力，并发布了105页手动标注数据集。 Conclusion: MEM为文化遗产研究提供了灵活可扩展的方法，展示了人工智能与语义技术在实践中的潜力。 Abstract: The digitization of cultural heritage collections has opened new directions for research, yet the lack of enriched metadata poses a substantial challenge to accessibility, interoperability, and cross-institutional collaboration. In several past years neural networks models such as YOLOv11 and Detectron2 have revolutionized visual data analysis, but their application to domain-specific cultural artifacts - such as manuscripts and incunabula - remains limited by the absence of methodologies that address structural feature extraction and semantic interoperability. In this position paper, we argue, that the integration of neural networks with semantic technologies represents a paradigm shift in cultural heritage digitization processes. We present the Metadata Enrichment Model (MEM), a conceptual framework designed to enrich metadata for digitized collections by combining fine-tuned computer vision models, large language models (LLMs) and structured knowledge graphs. The Multilayer Vision Mechanism (MVM) appears as the key innovation of MEM. This iterative process improves visual analysis by dynamically detecting nested features, such as text within seals or images within stamps. To expose MEM's potential, we apply it to a dataset of digitized incunabula from the Jagiellonian Digital Library and release a manually annotated dataset of 105 manuscript pages. We examine the practical challenges of MEM's usage in real-world GLAM institutions, including the need for domain-specific fine-tuning, the adjustment of enriched metadata with Linked Data standards and computational costs. We present MEM as a flexible and extensible methodology. This paper contributes to the discussion on how artificial intelligence and semantic web technologies can advance cultural heritage research, and also use these technologies in practice.

[107] Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information

Xu Chu,Xinrong Chen,Guanyu Wang,Zhijie Tan,Kui Huang,Wenyu Lv,Tong Mo,Weiping Li

Main category: cs.CV

TL;DR: Qwen-LA通过引入视觉-文本反射过程，减少视觉语言推理模型中的幻觉问题，提升性能。

Details

Motivation: 长推理过程导致视觉信息被忽视，引发幻觉，仅文本反射不足以解决此问题。 Method: 提出BRPO强化学习方法，结合视觉令牌COPY和ROUTE，强制模型重新关注视觉信息。 Result: 在多个视觉QA数据集上表现优异，减少幻觉。 Conclusion: Qwen-LA通过视觉-文本反射有效提升视觉注意力，减少幻觉。 Abstract: Inference time scaling drives extended reasoning to enhance the performance of Vision-Language Models (VLMs), thus forming powerful Vision-Language Reasoning Models (VLRMs). However, long reasoning dilutes visual tokens, causing visual information to receive less attention and may trigger hallucinations. Although introducing text-only reflection processes shows promise in language models, we demonstrate that it is insufficient to suppress hallucinations in VLMs. To address this issue, we introduce Qwen-LookAgain (Qwen-LA), a novel VLRM designed to mitigate hallucinations by incorporating a vision-text reflection process that guides the model to re-attention visual information during reasoning. We first propose a reinforcement learning method Balanced Reflective Policy Optimization (BRPO), which guides the model to decide when to generate vision-text reflection on its own and balance the number and length of reflections. Then, we formally prove that VLRMs lose attention to visual tokens as reasoning progresses, and demonstrate that supplementing visual information during reflection enhances visual attention. Therefore, during training and inference, Visual Token COPY and Visual Token ROUTE are introduced to force the model to re-attention visual information at the visual level, addressing the limitations of text-only reflection. Experiments on multiple visual QA datasets and hallucination metrics indicate that Qwen-LA achieves leading accuracy performance while reducing hallucinations. Our code is available at: https://github.com/Liar406/Look_Again.

[108] Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition

Yu Li,Jin Jiang,Jianhua Zhu,Shuai Peng,Baole Wei,Yuxuan Zhou,Liangcai Gao

Main category: cs.CV

TL;DR: Uni-MuMER利用预训练视觉语言模型（VLM）进行手写数学表达式识别（HMER），通过三种数据驱动任务提升性能，在CROHME和HME100K数据集上达到SOTA。

Details

Motivation: HMER因符号布局自由和手写风格多变而具有挑战性，现有方法难以整合为统一框架。预训练VLM的跨任务泛化能力为统一解决方案提供了可能。 Method: Uni-MuMER完全微调VLM而不修改架构，整合了Tree-CoT（结构化空间推理）、EDL（减少相似字符混淆）和SC（提升长表达式一致性）三种任务。 Result: 在CROHME和HME100K数据集上，Uni-MuMER超越SSAN（16.31%）和Gemini2.5-flash（24.42%），达到SOTA。 Conclusion: Uni-MuMER通过微调VLM和整合数据驱动任务，为HMER提供了高效统一的解决方案，性能显著提升。 Abstract: Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layout and variability in handwriting styles. Prior methods have faced performance bottlenecks, proposing isolated architectural modifications that are difficult to integrate coherently into a unified framework. Meanwhile, recent advances in pretrained vision-language models (VLMs) have demonstrated strong cross-task generalization, offering a promising foundation for developing unified solutions. In this paper, we introduce Uni-MuMER, which fully fine-tunes a VLM for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions. Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves new state-of-the-art performance, surpassing the best lightweight specialized model SSAN by 16.31% and the top-performing VLM Gemini2.5-flash by 24.42% in the zero-shot setting. Our datasets, models, and code are open-sourced at: https://github.com/BFlameSwift/Uni-MuMER

[109] Weakly-supervised Localization of Manipulated Image Regions Using Multi-resolution Learned Features

Ziyong Wang,Charith Abhayaratne

Main category: cs.CV

TL;DR: 提出了一种弱监督的图像篡改定位方法，结合图像级检测网络和预训练分割模型，无需像素级标注即可定位篡改区域。

Details

Motivation: 数字图像和编辑工具的普及使篡改检测成为关键挑战，现有方法在可解释性和定位能力上不足，且缺乏像素级标注。 Method: 基于WCBnet生成多视角特征图，与预训练分割模型（如DeepLab）结合，通过贝叶斯推理优化定位。 Result: 实验证明该方法能有效定位篡改区域，无需依赖像素级标签。 Conclusion: 弱监督方法在篡改定位中具有可行性，解决了标注不足的问题。 Abstract: The explosive growth of digital images and the widespread availability of image editing tools have made image manipulation detection an increasingly critical challenge. Current deep learning-based manipulation detection methods excel in achieving high image-level classification accuracy, they often fall short in terms of interpretability and localization of manipulated regions. Additionally, the absence of pixel-wise annotations in real-world scenarios limits the existing fully-supervised manipulation localization techniques. To address these challenges, we propose a novel weakly-supervised approach that integrates activation maps generated by image-level manipulation detection networks with segmentation maps from pre-trained models. Specifically, we build on our previous image-level work named WCBnet to produce multi-view feature maps which are subsequently fused for coarse localization. These coarse maps are then refined using detailed segmented regional information provided by pre-trained segmentation models (such as DeepLab, SegmentAnything and PSPnet), with Bayesian inference employed to enhance the manipulation localization. Experimental results demonstrate the effectiveness of our approach, highlighting the feasibility to localize image manipulations without relying on pixel-level labels.

[110] Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles

Zifu Wang,Junyi Zhu,Bo Tang,Zhiyu Li,Feiyu Xiong,Jiaqian Yu,Matthew B. Blaschko

Main category: cs.CV

TL;DR: 论文研究了基于规则的多模态大语言模型（MLLMs）在视觉强化学习（RL）中的应用，以拼图任务为实验框架，发现MLLMs通过微调能显著提升性能并泛化到复杂任务，且RL比监督微调（SFT）更有效。

Details

Motivation: 探索MLLMs在视觉任务中的表现，尤其是基于规则的RL方法在多模态学习中的潜力。 Method: 使用拼图任务作为实验框架，对比RL和SFT的性能，分析MLLMs的泛化能力和推理模式。 Result: MLLMs通过微调在拼图任务中达到接近完美的准确率，并能泛化到其他视觉任务；RL比SFT表现更好，且SFT初始阶段可能阻碍RL优化。 Conclusion: 研究为理解基于规则的视觉RL提供了重要见解，展示了其在多模态学习中的潜力，但结果可能因任务而异。 Abstract: The application of rule-based reinforcement learning (RL) to multimodal large language models (MLLMs) introduces unique challenges and potential deviations from findings in text-only domains, particularly for perception-heavy tasks. This paper provides a comprehensive study of rule-based visual RL using jigsaw puzzles as a structured experimental framework, revealing several key findings. \textit{Firstly,} we find that MLLMs, initially performing near to random guessing on simple puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning. \textit{Secondly,} training on jigsaw puzzles can induce generalization to other visual tasks, with effectiveness tied to specific task configurations. \textit{Thirdly,} MLLMs can learn and generalize with or without explicit reasoning, though open-source models often favor direct answering. Consequently, even when trained for step-by-step reasoning, they can ignore the thinking process in deriving the final answer. \textit{Fourthly,} we observe that complex reasoning patterns appear to be pre-existing rather than emergent, with their frequency increasing alongside training and task difficulty. \textit{Finally,} our results demonstrate that RL exhibits more effective generalization than Supervised Fine-Tuning (SFT), and an initial SFT cold start phase can hinder subsequent RL optimization. Although these observations are based on jigsaw puzzles and may vary across other visual tasks, this research contributes a valuable piece of jigsaw to the larger puzzle of collective understanding rule-based visual RL and its potential in multimodal learning. The code is available at: \href{https://github.com/zifuwanggg/Jigsaw-R1}{https://github.com/zifuwanggg/Jigsaw-R1}.

[111] DeepChest: Dynamic Gradient-Free Task Weighting for Effective Multi-Task Learning in Chest X-ray Classification

Youssef Mohamed,Noran Mohamed,Khaled Abouhashad,Feilong Tang,Sara Atito,Shoaib Jameel,Imran Razzak,Ahmed B. Zaky

Main category: cs.CV

TL;DR: DeepChest是一种动态任务加权框架，用于多标签胸部X光分类，通过性能驱动的权重机制提高效率和准确性。

Details

Motivation: 多任务学习（MTL）在医学影像等领域具有优势，但任务贡献平衡是一个挑战。 Method: DeepChest利用任务特定损失趋势分析动态调整权重，无需梯度访问，显著减少内存使用并提高训练速度。 Result: 在大型CXR数据集上，DeepChest比现有MTL方法准确率提高7%，并显著降低任务损失。 Conclusion: DeepChest为医学诊断中的深度学习提供了更高效和稳健的解决方案。 Abstract: While Multi-Task Learning (MTL) offers inherent advantages in complex domains such as medical imaging by enabling shared representation learning, effectively balancing task contributions remains a significant challenge. This paper addresses this critical issue by introducing DeepChest, a novel, computationally efficient and effective dynamic task-weighting framework specifically designed for multi-label chest X-ray (CXR) classification. Unlike existing heuristic or gradient-based methods that often incur substantial overhead, DeepChest leverages a performance-driven weighting mechanism based on effective analysis of task-specific loss trends. Given a network architecture (e.g., ResNet18), our model-agnostic approach adaptively adjusts task importance without requiring gradient access, thereby significantly reducing memory usage and achieving a threefold increase in training speed. It can be easily applied to improve various state-of-the-art methods. Extensive experiments on a large-scale CXR dataset demonstrate that DeepChest not only outperforms state-of-the-art MTL methods by 7% in overall accuracy but also yields substantial reductions in individual task losses, indicating improved generalization and effective mitigation of negative transfer. The efficiency and performance gains of DeepChest pave the way for more practical and robust deployment of deep learning in critical medical diagnostic applications. The code is publicly available at https://github.com/youssefkhalil320/DeepChest-MTL

[112] Bridging Classical and Modern Computer Vision: PerceptiveNet for Tree Crown Semantic Segmentation

Georgios Voulgaris

Main category: cs.CV

TL;DR: PerceptiveNet是一种新型深度学习模型，通过结合对数Gabor卷积层和宽感受野的骨干网络，提升了树冠语义分割的准确性，并在多个数据集上表现出色。

Details

Motivation: 精确的树冠语义分割对森林管理、生物多样性研究和碳封存量化的科学工作至关重要，但传统方法和现有深度学习模型难以应对复杂的森林冠层特征。 Method: 提出PerceptiveNet，结合对数Gabor参数化卷积层和宽感受野的骨干网络，提取显著特征并捕获上下文信息。通过实验比较不同卷积层的影响，并评估其在混合CNN-Transformer模型中的表现。 Result: PerceptiveNet在树冠数据集上表现优于现有先进模型，并在多个复杂度的基准数据集上展现出良好的泛化能力。 Conclusion: PerceptiveNet通过创新的卷积层设计和宽感受野的骨干网络，显著提升了树冠语义分割的准确性，具有广泛的应用潜力。 Abstract: The accurate semantic segmentation of tree crowns within remotely sensed data is crucial for scientific endeavours such as forest management, biodiversity studies, and carbon sequestration quantification. However, precise segmentation remains challenging due to complexities in the forest canopy, including shadows, intricate backgrounds, scale variations, and subtle spectral differences among tree species. Compared to the traditional methods, Deep Learning models improve accuracy by extracting informative and discriminative features, but often fall short in capturing the aforementioned complexities. To address these challenges, we propose PerceptiveNet, a novel model incorporating a Logarithmic Gabor-parameterised convolutional layer with trainable filter parameters, alongside a backbone that extracts salient features while capturing extensive context and spatial information through a wider receptive field. We investigate the impact of Log-Gabor, Gabor, and standard convolutional layers on semantic segmentation performance through extensive experimentation. Additionally, we conduct an ablation study to assess the contributions of individual layers and their combinations to overall model performance, and we evaluate PerceptiveNet as a backbone within a novel hybrid CNN-Transformer model. Our results outperform state-of-the-art models, demonstrating significant performance improvements on a tree crown dataset while generalising across domains, including two benchmark aerial scene semantic segmentation datasets with varying complexities.

Shengyuan Liu,Boyun Zheng,Wenting Chen,Zhihao Peng,Zhenfei Yin,Jing Shao,Jiancong Hu,Yixuan Yuan

Main category: cs.CV

TL;DR: EndoBench是一个全面的多模态大语言模型（MLLM）基准测试，用于评估内窥镜实践中的多维能力，涵盖多种场景和任务，揭示了当前模型与临床专家之间的差距。

Details

Motivation: 现有基准测试局限于特定内窥镜场景和少量临床任务，无法反映真实世界的多样性和临床工作流程的全面需求。 Method: EndoBench包含4种内窥镜场景、12项临床任务及12项子任务、5种视觉提示粒度，共6,832个验证过的VQA对，采用多维评估框架。 Result: 实验表明，专有MLLM优于开源和医学专用模型，但仍不及人类专家；医学领域监督微调显著提升任务准确性；模型性能受提示格式和任务复杂度影响。 Conclusion: EndoBench为内窥镜领域MLLM的评估和进步设定了新标准，并公开了基准和代码。 Abstract: Endoscopic procedures are essential for diagnosing and treating internal diseases, and multi-modal large language models (MLLMs) are increasingly applied to assist in endoscopy analysis. However, current benchmarks are limited, as they typically cover specific endoscopic scenarios and a small set of clinical tasks, failing to capture the real-world diversity of endoscopic scenarios and the full range of skills needed in clinical workflows. To address these issues, we introduce EndoBench, the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice with multi-dimensional capacities. EndoBench encompasses 4 distinct endoscopic scenarios, 12 specialized clinical tasks with 12 secondary subtasks, and 5 levels of visual prompting granularities, resulting in 6,832 rigorously validated VQA pairs from 21 diverse datasets. Our multi-dimensional evaluation framework mirrors the clinical workflow--spanning anatomical recognition, lesion analysis, spatial localization, and surgical operations--to holistically gauge the perceptual and diagnostic abilities of MLLMs in realistic scenarios. We benchmark 23 state-of-the-art models, including general-purpose, medical-specialized, and proprietary MLLMs, and establish human clinician performance as a reference standard. Our extensive experiments reveal: (1) proprietary MLLMs outperform open-source and medical-specialized models overall, but still trail human experts; (2) medical-domain supervised fine-tuning substantially boosts task-specific accuracy; and (3) model performance remains sensitive to prompt format and clinical task complexity. EndoBench establishes a new standard for evaluating and advancing MLLMs in endoscopy, highlighting both progress and persistent gaps between current models and expert clinical reasoning. We publicly release our benchmark and code.

[114] VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

Ce Zhang,Kaixin Ma,Tianqing Fang,Wenhao Yu,Hongming Zhang,Zhisong Zhang,Yaqi Xie,Katia Sycara,Haitao Mi,Dong Yu

Main category: cs.CV

TL;DR: VScan是一个两阶段视觉令牌减少框架，通过全局和局部扫描结合令牌合并以及语言模型中间层剪枝，显著加速推理并保持高性能。

Details

Motivation: 大型视觉语言模型（LVLMs）因视觉令牌序列较长导致计算成本高，难以实时部署，需优化令牌处理效率。 Method: VScan在视觉编码阶段结合全局和局部扫描与令牌合并，并在语言模型中间层引入剪枝。 Result: 在四个LVLMs上验证，VScan显著加速推理（如LLaVA-NeXT-7B提速2.91倍，FLOPs减少10倍），性能保留95.4%。 Conclusion: VScan通过优化令牌处理，在保持性能的同时显著提升效率，优于现有方法。 Abstract: Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91$\times$ speedup in prefilling and a 10$\times$ reduction in FLOPs, while retaining 95.4% of the original performance.

[115] Color Image Set Recognition Based on Quaternionic Grassmannians

Xiang Xiang Wang,Tin-Yau Tam

Main category: cs.CV

TL;DR: 提出了一种基于四元数Grassmannian的彩色图像集识别方法，通过四元数捕捉颜色信息，将图像集表示为Grassmannian上的点，并利用最短距离进行分类。

Details

Motivation: 利用四元数的优势更有效地捕捉彩色图像信息，提升识别效果。 Method: 将彩色图像集表示为四元数Grassmannian上的点，计算最短距离并构建分类框架。 Result: 在ETH-80数据集上取得了良好的识别效果。 Conclusion: 方法有效但稳定性有待改进，未来可进一步优化。 Abstract: We propose a new method for recognizing color image sets using quaternionic Grassmannians, which use the power of quaternions to capture color information and represent each color image set as a point on the quaternionic Grassmannian. We provide a direct formula to calculate the shortest distance between two points on the quaternionic Grassmannian, and use this distance to build a new classification framework. Experiments on the ETH-80 benchmark dataset show that our method achieves good recognition results. We also discuss some limitations in stability and suggest ways the method can be improved in the future.

[116] Comparing the Effects of Persistence Barcodes Aggregation and Feature Concatenation on Medical Imaging

Dashti A. Ali,Richard K. G. Do,William R. Jarnagin,Aras T. Asaad,Amber L. Simpson

Main category: cs.CV

TL;DR: 论文比较了医学图像分析中两种基于持久同调的特征向量构建方法，发现特征拼接优于聚合方法，能保留更多拓扑信息并提升分类性能。

Details

Motivation: 传统特征提取方法对输入的小变化敏感，而持久同调（PH）能提供稳定的拓扑和几何特征，但如何从多个持久条码构建特征向量尚需研究。 Method: 通过比较两种方法（聚合持久条码后特征化 vs. 拼接各条码的拓扑特征向量）在多种医学图像数据集上的分类性能。 Result: 特征拼接方法保留了更多细节拓扑信息，分类性能更优。 Conclusion: 在类似实验中，特征拼接是更优的选择。 Abstract: In medical image analysis, feature engineering plays an important role in the design and performance of machine learning models. Persistent homology (PH), from the field of topological data analysis (TDA), demonstrates robustness and stability to data perturbations and addresses the limitation from traditional feature extraction approaches where a small change in input results in a large change in feature representation. Using PH, we store persistent topological and geometrical features in the form of the persistence barcode whereby large bars represent global topological features and small bars encapsulate geometrical information of the data. When multiple barcodes are computed from 2D or 3D medical images, two approaches can be used to construct the final topological feature vector in each dimension: aggregating persistence barcodes followed by featurization or concatenating topological feature vectors derived from each barcode. In this study, we conduct a comprehensive analysis across diverse medical imaging datasets to compare the effects of the two aforementioned approaches on the performance of classification models. The results of this analysis indicate that feature concatenation preserves detailed topological information from individual barcodes, yields better classification performance and is therefore a preferred approach when conducting similar experiments.

[117] Radiant Triangle Soup with Soft Connectivity Forces for 3D Reconstruction and Novel View Synthesis

Nathaniel Burgdorfer,Philippos Mordohai

Main category: cs.CV

TL;DR: 提出了一种基于三角形的推理时优化框架，用于表示场景的几何和外观，优于当前广泛使用的高斯泼溅方法。

Details

Motivation: 三角形能提供更丰富的颜色插值，并受益于下游任务的大量算法基础设施，同时能自然形成表面。 Method: 开发了一种针对三角形汤（半透明三角形集合）的场景优化算法，并在优化过程中引入连接力以鼓励表面连续性。 Result: 在代表性3D重建数据集上展示了具有竞争力的光度学和几何学结果。 Conclusion: 三角形作为场景表示基元具有优势，尤其在颜色插值和表面形成方面。 Abstract: In this work, we introduce an inference-time optimization framework utilizing triangles to represent the geometry and appearance of the scene. More specifically, we develop a scene optimization algorithm for triangle soup, a collection of disconnected semi-transparent triangle primitives. Compared to the current most-widely used primitives for 3D scene representation, namely Gaussian splats, triangles allow for more expressive color interpolation, and benefit from a large algorithmic infrastructure for downstream tasks. Triangles, unlike full-rank Gaussian kernels, naturally combine to form surfaces. We formulate connectivity forces between triangles during optimization, encouraging explicit, but soft, surface continuity in 3D. We perform experiments on a representative 3D reconstruction dataset and show competitive photometric and geometric results.

[118] VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

Xiangdong Zhang,Jiaqi Liao,Shaofeng Zhang,Fanqing Meng,Xiangpeng Wan,Junchi Yan,Yu Cheng

Main category: cs.CV

TL;DR: VideoREPA框架通过Token Relation Distillation损失，将视频理解基础模型中的物理知识蒸馏到T2V模型中，显著提升了生成视频的物理合理性。

Details

Motivation: 当前T2V模型在生成物理合理内容方面表现不佳，其物理理解能力落后于视频自监督学习方法。 Method: 提出VideoREPA框架，利用Token Relation Distillation损失进行时空对齐，微调预训练T2V模型。 Result: VideoREPA显著提升了基线方法CogVideoX的物理常识，在相关基准测试中表现优异。 Conclusion: VideoREPA是首个专为T2V模型设计的REPA方法，成功提升了生成视频的物理一致性。 Abstract: Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models, a critical departure from prior representation alignment (REPA) methods. To our knowledge, VideoREPA is the first REPA method designed for finetuning T2V models and specifically for injecting physical knowledge. Empirical evaluations show that VideoREPA substantially enhances the physics commonsense of baseline method, CogVideoX, achieving significant improvement on relevant benchmarks and demonstrating a strong capacity for generating videos consistent with intuitive physics. More video results are available at https://videorepa.github.io/.

[119] D-AR: Diffusion via Autoregressive Models

Ziteng Gao,Mike Zheng Shou

Main category: cs.CV

TL;DR: D-AR将图像扩散过程重新定义为标准的自回归过程，通过离散化图像为序列并利用扩散特性实现粗到细的生成。

Details

Motivation: 探索一种统一的自回归架构，利用扩散特性简化视觉合成过程。 Method: 设计离散化tokenizer将图像转换为序列，利用自回归模型预测token，直接映射到扩散去噪步骤。 Result: 在ImageNet上达到2.09 FID，支持预览和零样本布局控制合成。 Conclusion: D-AR为视觉合成的统一自回归架构提供了新思路，尤其适用于大语言模型。 Abstract: This paper presents Diffusion via Autoregressive models (D-AR), a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure in the standard next-token-prediction fashion. We start by designing the tokenizer that converts images into sequences of discrete tokens, where tokens in different positions can be decoded into different diffusion denoising steps in the pixel space. Thanks to the diffusion properties, these tokens naturally follow a coarse-to-fine order, which directly lends itself to autoregressive modeling. Therefore, we apply standard next-token prediction on these tokens, without modifying any underlying designs (either causal masks or training/inference strategies), and such sequential autoregressive token generation directly mirrors the diffusion procedure in image space. That is, once the autoregressive model generates an increment of tokens, we can directly decode these tokens into the corresponding diffusion denoising step in the streaming manner. Our pipeline naturally reveals several intriguing properties, for example, it supports consistent previews when generating only a subset of tokens and enables zero-shot layout-controlled synthesis. On the standard ImageNet benchmark, our method achieves 2.09 FID using a 775M Llama backbone with 256 discrete tokens. We hope our work can inspire future research on unified autoregressive architectures of visual synthesis, especially with large language models. Code and models will be available at https://github.com/showlab/D-AR

[120] OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Size Wu,Zhonghua Wu,Zerui Gong,Qingyi Tao,Sheng Jin,Qinyue Li,Wei Li,Chen Change Loy

Main category: cs.CV

TL;DR: OpenUni是一个轻量级、开源的统一多模态理解和生成基线模型，通过高效训练策略和简单架构，实现了高质量的图像生成和卓越的基准性能。

Details

Motivation: 受统一模型学习实践的启发，旨在简化训练复杂性并降低开销，同时支持多模态任务。 Method: 采用可学习查询和轻量级Transformer连接器，结合现成的多模态大语言模型和扩散模型。 Result: 生成高质量且符合指令的图像，在GenEval、DPG-Bench和WISE等基准测试中表现优异，仅需1.1B和3.1B激活参数。 Conclusion: OpenUni展示了简单架构的高效性，并开源了模型权重、训练代码和数据集，以推动社区研究。 Abstract: In this report, we present OpenUni, a simple, lightweight, and fully open-source baseline for unifying multimodal understanding and generation. Inspired by prevailing practices in unified model learning, we adopt an efficient training strategy that minimizes the training complexity and overhead by bridging the off-the-shelf multimodal large language models (LLMs) and diffusion models through a set of learnable queries and a light-weight transformer-based connector. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG- Bench, and WISE, with only 1.1B and 3.1B activated parameters. To support open research and community advancement, we release all model weights, training code, and our curated training datasets (including 23M image-text pairs) at https://github.com/wusize/OpenUni.

[121] Grounded Reinforcement Learning for Visual Reasoning

Gabriel Sarch,Snigdha Saha,Naitik Khandelwal,Ayush Jain,Michael J. Tarr,Aviral Kumar,Katerina Fragkiadaki

Main category: cs.CV

TL;DR: ViGoRL是一种视觉语言模型，通过强化学习将推理步骤锚定到视觉坐标，显著提升了视觉推理任务的性能。

Details

Motivation: 视觉推理任务需要模型具备视觉注意力、感知输入解释和空间证据抽象推理能力，传统方法缺乏显式的地面机制。 Method: ViGoRL结合多轮强化学习框架，动态放大预测坐标，并生成空间锚定的推理轨迹。 Result: 在多个视觉推理基准测试中，ViGoRL表现优于监督微调和传统强化学习方法，尤其在定位小GUI元素和视觉搜索任务中达到86.4%的准确率。 Conclusion: 视觉锚定的强化学习是提升模型通用视觉推理能力的有效范式。 Abstract: While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, visual reasoning introduces added complexity by requiring models to direct visual attention, interpret perceptual inputs, and ground abstract reasoning in spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces, guiding visual attention to task-relevant regions at each step. When fine-grained exploration is required, our novel multi-turn RL framework enables the model to dynamically zoom into predicted coordinates as reasoning unfolds. Across a diverse set of visual reasoning benchmarks--including SAT-2 and BLINK for spatial reasoning, V*bench for visual search, and ScreenSpot and VisualWebArena for web-based grounding--ViGoRL consistently outperforms both supervised fine-tuning and conventional RL baselines that lack explicit grounding mechanisms. Incorporating multi-turn RL with zoomed-in visual feedback significantly improves ViGoRL's performance on localizing small GUI elements and visual search, achieving 86.4% on V*Bench. Additionally, we find that grounding amplifies other visual behaviors such as region exploration, grounded subgoal setting, and visual verification. Finally, human evaluations show that the model's visual references are not only spatially accurate but also helpful for understanding model reasoning steps. Our results show that visually grounded RL is a strong paradigm for imbuing models with general-purpose visual reasoning.

[122] VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

Tingyu Song,Tongyan Hu,Guo Gan,Yilun Zhao

Main category: cs.CV

TL;DR: 论文提出了一个名为VF-Eval的新基准，用于全面评估多模态大语言模型（MLLMs）在AI生成内容（AIGC）视频上的能力，发现现有模型表现不佳，并通过实验展示了其在视频生成中的应用潜力。

Details

Motivation: 现有研究主要关注自然视频，而忽略了合成视频（如AIGC）的评估，同时MLLMs在解释AIGC视频方面的能力尚未充分探索。 Method: 提出VF-Eval基准，包含四个任务（一致性验证、错误感知、错误类型检测和推理评估），并评估了13种前沿MLLMs的表现。 Result: 即使是表现最佳的GPT-4.1模型，在所有任务中也难以保持一致性，表明基准的挑战性。通过RePrompt实验，展示了MLLMs与人类反馈对齐对视频生成的改进作用。 Conclusion: VF-Eval揭示了MLLMs在AIGC视频任务中的局限性，并为其在视频生成中的应用提供了新思路。 Abstract: MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VF-Eval in improving video generation, we conduct an experiment, RePrompt, demonstrating that aligning MLLMs more closely with human feedback can benefit video generation.

[123] DA-VPT: Semantic-Guided Visual Prompt Tuning for Vision Transformers

Li Ren,Chen Chen,Liqiang Wang,Kien Hua

Main category: cs.CV

TL;DR: 本文提出了一种名为DA-VPT的新框架，通过度量学习技术研究提示分布对微调性能的影响，并利用语义信息优化提示学习，从而提升ViT模型在下游视觉任务中的性能。

Details

Motivation: 探索提示与图像标记之间的基本关联和分布，以优化视觉提示调谐（VPT）的效果。 Method: 提出DA-VPT框架，通过学习类别相关语义数据的距离度量来引导提示的分布。 Result: 在识别和分割任务中验证了方法的有效性，表明DA-VPT能更高效地微调ViT模型。 Conclusion: DA-VPT通过语义信息引导提示学习，显著提升了视觉任务的性能。 Abstract: Visual Prompt Tuning (VPT) has become a promising solution for Parameter-Efficient Fine-Tuning (PEFT) approach for Vision Transformer (ViT) models by partially fine-tuning learnable tokens while keeping most model parameters frozen. Recent research has explored modifying the connection structures of the prompts. However, the fundamental correlation and distribution between the prompts and image tokens remain unexplored. In this paper, we leverage metric learning techniques to investigate how the distribution of prompts affects fine-tuning performance. Specifically, we propose a novel framework, Distribution Aware Visual Prompt Tuning (DA-VPT), to guide the distributions of the prompts by learning the distance metric from their class-related semantic data. Our method demonstrates that the prompts can serve as an effective bridge to share semantic information between image patches and the class token. We extensively evaluated our approach on popular benchmarks in both recognition and segmentation tasks. The results demonstrate that our approach enables more effective and efficient fine-tuning of ViT models by leveraging semantic information to guide the learning of the prompts, leading to improved performance on various downstream vision tasks.

[124] CLDTracker: A Comprehensive Language Description for Visual Tracking

Mohamad Alansari,Sajid Javed,Iyyakutti Iyappan Ganapathi,Sara Alansari,Muzammal Naseer

Main category: cs.CV

TL;DR: 论文提出CLDTracker，一种基于全面语言描述的视觉跟踪框架，通过双分支架构结合视觉和语言信息，解决了传统跟踪器在复杂场景中的局限性。

Details

Motivation: 视觉目标跟踪（VOT）因动态外观变化、遮挡和背景干扰而具有挑战性，传统跟踪器依赖视觉线索效果有限。视觉语言模型（VLMs）在语义理解上的潜力为VOT提供了新思路，但直接应用存在文本表示不足、特征融合低效和缺乏时间建模等问题。 Method: 提出CLDTracker，采用双分支架构（文本分支和视觉分支），利用CLIP和GPT-4V等VLMs生成丰富的文本描述，增强语义和上下文信息。 Result: 在六个标准VOT基准测试中达到SOTA性能，验证了结合视觉和语言信息的有效性。 Conclusion: CLDTracker通过全面语言描述和时间自适应建模，显著提升了VOT性能，为视觉语言模型在跟踪任务中的应用提供了新方向。 Abstract: VOT remains a fundamental yet challenging task in computer vision due to dynamic appearance changes, occlusions, and background clutter. Traditional trackers, relying primarily on visual cues, often struggle in such complex scenarios. Recent advancements in VLMs have shown promise in semantic understanding for tasks like open-vocabulary detection and image captioning, suggesting their potential for VOT. However, the direct application of VLMs to VOT is hindered by critical limitations: the absence of a rich and comprehensive textual representation that semantically captures the target object's nuances, limiting the effective use of language information; inefficient fusion mechanisms that fail to optimally integrate visual and textual features, preventing a holistic understanding of the target; and a lack of temporal modeling of the target's evolving appearance in the language domain, leading to a disconnect between the initial description and the object's subsequent visual changes. To bridge these gaps and unlock the full potential of VLMs for VOT, we propose CLDTracker, a novel Comprehensive Language Description framework for robust visual Tracking. Our tracker introduces a dual-branch architecture consisting of a textual and a visual branch. In the textual branch, we construct a rich bag of textual descriptions derived by harnessing the powerful VLMs such as CLIP and GPT-4V, enriched with semantic and contextual cues to address the lack of rich textual representation. Experiments on six standard VOT benchmarks demonstrate that CLDTracker achieves SOTA performance, validating the effectiveness of leveraging robust and temporally-adaptive vision-language representations for tracking. Code and models are publicly available at: https://github.com/HamadYA/CLDTracker

Dionysis Christopoulos,Sotiris Spanos,Eirini Baltzi,Valsamis Ntouskos,Konstantinos Karantzalos

Main category: cs.CV

TL;DR: SLIMP通过结合皮肤病变图像和元数据，提出了一种新的嵌套对比学习方法，提升了皮肤病变分类任务的性能。

Details

Motivation: 解决仅依赖图像进行黑色素瘤检测和皮肤病变分类的挑战，如成像条件差异大和缺乏临床背景。 Method: 采用嵌套对比学习，结合病变图像、个体元数据和患者级元数据。 Result: 相比其他预训练策略，SLIMP在下游分类任务中表现更优。 Conclusion: SLIMP通过充分利用多模态数据，显著提升了皮肤病变分类的准确性。 Abstract: We introduce SLIMP (Skin Lesion Image-Metadata Pre-training) for learning rich representations of skin lesions through a novel nested contrastive learning approach that captures complex relationships between images and metadata. Melanoma detection and skin lesion classification based solely on images, pose significant challenges due to large variations in imaging conditions (lighting, color, resolution, distance, etc.) and lack of clinical and phenotypical context. Clinicians typically follow a holistic approach for assessing the risk level of the patient and for deciding which lesions may be malignant and need to be excised, by considering the patient's medical history as well as the appearance of other lesions of the patient. Inspired by this, SLIMP combines the appearance and the metadata of individual skin lesions with patient-level metadata relating to their medical record and other clinically relevant information. By fully exploiting all available data modalities throughout the learning process, the proposed pre-training strategy improves performance compared to other pre-training strategies on downstream skin lesions classification tasks highlighting the learned representations quality.

[126] AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views

Lihan Jiang,Yucheng Mao,Linning Xu,Tao Lu,Kerui Ren,Yichen Jin,Xudong Xu,Mulin Yu,Jiangmiao Pang,Feng Zhao,Dahua Lin,Bo Dai

Main category: cs.CV

TL;DR: AnySplat是一种前馈网络，用于从未校准的图像集合中进行新视角合成，无需已知相机姿态或逐场景优化，且计算效率高。

Details

Motivation: 传统神经渲染方法需要已知相机姿态和逐场景优化，而现有前馈方法在密集视角下计算负担重。AnySplat旨在解决这些问题。 Method: 通过单次前向传播预测3D高斯基元（编码场景几何和外观）及每张输入图像的相机内外参数。 Result: 在零样本评估中，AnySplat在稀疏和密集视角下均达到与姿态感知基线相当的质量，且超越现有无姿态方法，同时显著降低渲染延迟。 Conclusion: AnySplat为无约束拍摄场景下的实时新视角合成提供了高效解决方案。 Abstract: We introduce AnySplat, a feed forward network for novel view synthesis from uncalibrated image collections. In contrast to traditional neural rendering pipelines that demand known camera poses and per scene optimization, or recent feed forward methods that buckle under the computational weight of dense views, our model predicts everything in one shot. A single forward pass yields a set of 3D Gaussian primitives encoding both scene geometry and appearance, and the corresponding camera intrinsics and extrinsics for each input image. This unified design scales effortlessly to casually captured, multi view datasets without any pose annotations. In extensive zero shot evaluations, AnySplat matches the quality of pose aware baselines in both sparse and dense view scenarios while surpassing existing pose free approaches. Moreover, it greatly reduce rendering latency compared to optimization based neural fields, bringing real time novel view synthesis within reach for unconstrained capture settings.Project page: https://city-super.github.io/anysplat/

[127] FMG-Det: Foundation Model Guided Robust Object Detection

Darryl Hannan,Timothy Doster,Henry Kvinge,Adam Attarian,Yijing Watkins

Main category: cs.CV

TL;DR: FMG-Det提出了一种结合多实例学习（MIL）和预处理的简单高效方法，用于处理噪声标注数据，提升目标检测性能。

Details

Motivation: 目标检测任务中，标注边界的主观性导致数据质量不一致，噪声标注会显著降低模型性能，尤其是在少样本场景下。 Method: 结合多实例学习框架和预处理流程，利用基础模型校正标注，并对检测头进行轻微修改。 Result: 在多个数据集上实现了最先进的性能，适用于标准及少样本场景，且方法更简单高效。 Conclusion: FMG-Det通过校正噪声标注和优化模型结构，显著提升了目标检测的性能和鲁棒性。 Abstract: Collecting high quality data for object detection tasks is challenging due to the inherent subjectivity in labeling the boundaries of an object. This makes it difficult to not only collect consistent annotations across a dataset but also to validate them, as no two annotators are likely to label the same object using the exact same coordinates. These challenges are further compounded when object boundaries are partially visible or blurred, which can be the case in many domains. Training on noisy annotations significantly degrades detector performance, rendering them unusable, particularly in few-shot settings, where just a few corrupted annotations can impact model performance. In this work, we propose FMG-Det, a simple, efficient methodology for training models with noisy annotations. More specifically, we propose combining a multiple instance learning (MIL) framework with a pre-processing pipeline that leverages powerful foundation models to correct labels prior to training. This pre-processing pipeline, along with slight modifications to the detector head, results in state-of-the-art performance across a number of datasets, for both standard and few-shot scenarios, while being much simpler and more efficient than other approaches.

[128] PixelThink: Towards Efficient Chain-of-Pixel Reasoning

Song Wang,Gongfan Fang,Lingdong Kong,Xiangtai Li,Jianyun Xu,Sheng Yang,Qiang Li,Jianke Zhu,Xinchao Wang

Main category: cs.CV

TL;DR: PixelThink通过结合任务难度和模型不确定性调节推理生成，提升推理效率和分割性能。

Details

Motivation: 现有方法在泛化性和推理效率上表现不佳，缺乏对推理过程的显式控制。 Method: 提出PixelThink，利用外部任务难度和内部模型不确定性调节推理生成。 Result: 实验表明，该方法提高了推理效率和分割性能。 Conclusion: 为高效可解释的多模态理解提供了新视角。 Abstract: Existing reasoning segmentation approaches typically fine-tune multimodal large language models (MLLMs) using image-text pairs and corresponding mask labels. However, they exhibit limited generalization to out-of-distribution scenarios without an explicit reasoning process. Although recent efforts leverage reinforcement learning through group-relative policy optimization (GRPO) to enhance reasoning ability, they often suffer from overthinking - producing uniformly verbose reasoning chains irrespective of task complexity. This results in elevated computational costs and limited control over reasoning quality. To address this problem, we propose PixelThink, a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty to regulate reasoning generation within a reinforcement learning paradigm. The model learns to compress reasoning length in accordance with scene complexity and predictive confidence. To support comprehensive evaluation, we introduce ReasonSeg-Diff, an extended benchmark with annotated reasoning references and difficulty scores, along with a suite of metrics designed to assess segmentation accuracy, reasoning quality, and efficiency jointly. Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance. Our work contributes novel perspectives towards efficient and interpretable multimodal understanding. The code and model will be publicly available.

[129] ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS

Weijie Wang,Donny Y. Chen,Zeyu Zhang,Duochao Shi,Akide Liu,Bohan Zhuang

Main category: cs.CV

TL;DR: ZPressor是一种轻量级模块，通过信息瓶颈原则压缩多视图输入，提升3D高斯溅射模型的扩展性和性能。

Details

Motivation: 现有的前馈3D高斯溅射模型因编码器容量有限，难以处理多视图输入，导致性能下降或内存消耗过大。 Method: ZPressor将视图分为锚点和支持集，利用交叉注意力压缩信息，形成紧凑的潜在状态Z。 Result: ZPressor使模型能在80GB GPU上处理100+视图，性能提升且鲁棒性增强。 Conclusion: ZPressor为前馈3DGS模型提供了一种高效的多视图压缩方案，显著提升了扩展性和性能。 Abstract: Feed-forward 3D Gaussian Splatting (3DGS) models have recently emerged as a promising solution for novel view synthesis, enabling one-pass inference without the need for per-scene 3DGS optimization. However, their scalability is fundamentally constrained by the limited capacity of their encoders, leading to degraded performance or excessive memory consumption as the number of input views increases. In this work, we analyze feed-forward 3DGS frameworks through the lens of the Information Bottleneck principle and introduce ZPressor, a lightweight architecture-agnostic module that enables efficient compression of multi-view inputs into a compact latent state $Z$ that retains essential scene information while discarding redundancy. Concretely, ZPressor enables existing feed-forward 3DGS models to scale to over 100 input views at 480P resolution on an 80GB GPU, by partitioning the views into anchor and support sets and using cross attention to compress the information from the support views into anchor views, forming the compressed latent state $Z$. We show that integrating ZPressor into several state-of-the-art feed-forward 3DGS models consistently improves performance under moderate input views and enhances robustness under dense view settings on two large-scale benchmarks DL3DV-10K and RealEstate10K. The video results, code and trained models are available on our project page: https://lhmd.top/zpressor.

[130] MAGREF: Masked Guidance for Any-Reference Video Generation

Yufan Deng,Xun Guo,Yuanyang Yin,Jacob Zhiyuan Fang,Yiding Yang,Yizhi Wang,Shenghai Yuan,Angtian Wang,Bo Liu,Haibin Huang,Chongyang Ma

Main category: cs.CV

TL;DR: MAGREF提出了一种基于多参考主题的视频生成统一框架，通过掩码引导实现高质量、一致的多主题视频合成。

Details

Motivation: 多参考主题视频生成在保持多主题一致性和生成质量方面仍面临挑战。 Method: 1. 区域感知动态掩码机制，灵活处理不同主题；2. 像素级通道拼接机制，保留外观特征。 Result: 模型在多主题场景中表现出色，生成质量优于现有开源和商业基线。 Conclusion: MAGREF为可扩展、可控、高保真的多主题视频合成提供了有效解决方案。 Abstract: Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF

[131] DarkDiff: Advancing Low-Light Raw Enhancement by Retasking Diffusion Models for Camera ISP

Amber Yijia Zheng,Yu Zhang,Jun Hu,Raymond A. Yeh,Chen Chen

Main category: cs.CV

TL;DR: 论文提出了一种新框架，通过重新利用预训练的生成扩散模型结合相机ISP，显著提升了极端低光条件下原始图像的质量，优于现有技术。

Details

Motivation: 极端低光条件下的高质量摄影对数码相机至关重要，但现有方法存在图像模糊和颜色失真的问题。 Method: 采用预训练的生成扩散模型，结合相机ISP，对低光原始图像进行增强。 Result: 在三个低光原始图像基准测试中，该方法在感知质量上优于现有技术。 Conclusion: 该方法有效解决了低光图像增强中的模糊和颜色失真问题，具有显著的实际应用价值。 Abstract: High-quality photography in extreme low-light conditions is challenging but impactful for digital cameras. With advanced computing hardware, traditional camera image signal processor (ISP) algorithms are gradually being replaced by efficient deep networks that enhance noisy raw images more intelligently. However, existing regression-based models often minimize pixel errors and result in oversmoothing of low-light photos or deep shadows. Recent work has attempted to address this limitation by training a diffusion model from scratch, yet those models still struggle to recover sharp image details and accurate colors. We introduce a novel framework to enhance low-light raw images by retasking pre-trained generative diffusion models with the camera ISP. Extensive experiments demonstrate that our method outperforms the state-of-the-art in perceptual quality across three challenging low-light raw image benchmarks.

[132] Boosting Domain Incremental Learning: Selecting the Optimal Parameters is All You Need

Qiang Wang,Xiang Song,Yuhang He,Jizhou Han,Chenhao Ding,Xinyuan Gao,Yihong Gong

Main category: cs.CV

TL;DR: SOYO是一个轻量级框架，通过改进PIDIL中的域选择，解决了DNN在动态环境中性能下降的问题。

Details

Motivation: DNN在数据分布随时间变化的动态环境中表现不佳，PIDIL虽能持续适应模型，但现有方法在参数选择准确性上存在问题。 Method: SOYO引入GMC和DFR高效存储和平衡先验域数据，MDFN增强域特征提取，支持多种PEFT方法。 Result: 在六个基准测试中，SOYO表现优于现有基线，展示了其在复杂环境中的鲁棒性和适应性。 Conclusion: SOYO通过改进域选择和特征提取，为动态环境中的DNN提供了有效的解决方案。 Abstract: Deep neural networks (DNNs) often underperform in real-world, dynamic settings where data distributions change over time. Domain Incremental Learning (DIL) offers a solution by enabling continual model adaptation, with Parameter-Isolation DIL (PIDIL) emerging as a promising paradigm to reduce knowledge conflicts. However, existing PIDIL methods struggle with parameter selection accuracy, especially as the number of domains and corresponding classes grows. To address this, we propose SOYO, a lightweight framework that improves domain selection in PIDIL. SOYO introduces a Gaussian Mixture Compressor (GMC) and Domain Feature Resampler (DFR) to store and balance prior domain data efficiently, while a Multi-level Domain Feature Fusion Network (MDFN) enhances domain feature extraction. Our framework supports multiple Parameter-Efficient Fine-Tuning (PEFT) methods and is validated across tasks such as image classification, object detection, and speech enhancement. Experimental results on six benchmarks demonstrate SOYO's consistent superiority over existing baselines, showcasing its robustness and adaptability in complex, evolving environments. The codes will be released in https://github.com/qwangcv/SOYO.

[133] To Trust Or Not To Trust Your Vision-Language Model's Prediction

Hao Dong,Moru Liu,Jian Liang,Eleni Chatzi,Olga Fink

Main category: cs.CV

TL;DR: TrustVLM是一个无需训练的框架，旨在解决VLM预测可信度估计的关键挑战，通过利用图像嵌入空间改进误分类检测，显著提升模型可靠性。

Details

Motivation: 尽管VLM在多模态任务中表现优异，但其在安全关键领域中的误分类问题可能导致严重后果，因此需要一种方法来评估预测的可信度。 Method: 提出了一种基于图像嵌入空间的置信度评分函数，利用模态间隙和概念在嵌入空间中的区分性来检测误分类。 Result: 在17个数据集、4种架构和2种VLM上验证，TrustVLM在AURC、AUROC和FPR95指标上分别提升51.87%、9.14%和32.42%，达到SOTA性能。 Conclusion: TrustVLM无需重新训练即可提升VLM的可靠性，为其在现实应用中的安全部署铺平了道路。 Abstract: Vision-Language Models (VLMs) have demonstrated strong capabilities in aligning visual and textual modalities, enabling a wide range of applications in multimodal understanding and generation. While they excel in zero-shot and transfer learning scenarios, VLMs remain susceptible to misclassification, often yielding confident yet incorrect predictions. This limitation poses a significant risk in safety-critical domains, where erroneous predictions can lead to severe consequences. In this work, we introduce TrustVLM, a training-free framework designed to address the critical challenge of estimating when VLM's predictions can be trusted. Motivated by the observed modality gap in VLMs and the insight that certain concepts are more distinctly represented in the image embedding space, we propose a novel confidence-scoring function that leverages this space to improve misclassification detection. We rigorously evaluate our approach across 17 diverse datasets, employing 4 architectures and 2 VLMs, and demonstrate state-of-the-art performance, with improvements of up to 51.87% in AURC, 9.14% in AUROC, and 32.42% in FPR95 compared to existing baselines. By improving the reliability of the model without requiring retraining, TrustVLM paves the way for safer deployment of VLMs in real-world applications. The code will be available at https://github.com/EPFL-IMOS/TrustVLM.

[134] Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu,Fangfu Liu,Yi-Hsin Hung,Yueqi Duan

Main category: cs.CV

TL;DR: Spatial-MLLM是一种新型框架，通过纯2D观测实现视觉空间推理，利用双编码器架构和空间感知帧采样策略，显著提升了空间理解能力。

Details

Motivation: 现有3D多模态大语言模型依赖额外3D或2.5D数据，限制了在仅有2D输入（如图像或视频）场景中的应用。本文旨在解决这一问题。 Method: 提出双编码器架构：语义编码器提取语义特征，空间编码器从视觉几何模型中提取3D结构特征，并通过连接器整合。同时提出空间感知帧采样策略。 Result: 在多种真实数据集上，Spatial-MLLM在视觉空间理解和推理任务中达到最先进性能。 Conclusion: Spatial-MLLM通过纯2D输入实现了高效的空间推理，为相关领域提供了新思路。 Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k dataset and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.

[135] ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks

Akashah Shabbir,Muhammad Akhtar Munir,Akshay Dudhane,Muhammad Umer Sheikh,Muhammad Haris Khan,Paolo Fraccaro,Juan Bernabe Moreno,Fahad Shahbaz Khan,Salman Khan

Main category: cs.CV

TL;DR: ThinkGeo是一个专为评估LLM驱动的智能体在遥感任务中工具使用能力的基准测试，填补了领域特定评估的空白。

Details

Motivation: 现有评估多关注通用或多模态场景，缺乏针对复杂遥感用例的领域特定基准。 Method: ThinkGeo包含人工策划的查询，覆盖多种实际应用，采用ReAct式交互循环评估开源和闭源LLM。 Result: 分析显示不同模型在工具准确性和规划一致性上存在显著差异。 Conclusion: ThinkGeo为评估工具增强LLM在遥感中的空间推理能力提供了首个广泛测试平台。 Abstract: Recent progress in large language models (LLMs) has enabled tool-augmented agents capable of solving complex real-world tasks through step-by-step reasoning. However, existing evaluations often focus on general-purpose or multimodal scenarios, leaving a gap in domain-specific benchmarks that assess tool-use capabilities in complex remote sensing use cases. We present ThinkGeo, an agentic benchmark designed to evaluate LLM-driven agents on remote sensing tasks via structured tool use and multi-step planning. Inspired by tool-interaction paradigms, ThinkGeo includes human-curated queries spanning a wide range of real-world applications such as urban planning, disaster assessment and change analysis, environmental monitoring, transportation analysis, aviation monitoring, recreational infrastructure, and industrial site analysis. Each query is grounded in satellite or aerial imagery and requires agents to reason through a diverse toolset. We implement a ReAct-style interaction loop and evaluate both open and closed-source LLMs (e.g., GPT-4o, Qwen2.5) on 436 structured agentic tasks. The benchmark reports both step-wise execution metrics and final answer correctness. Our analysis reveals notable disparities in tool accuracy and planning consistency across models. ThinkGeo provides the first extensive testbed for evaluating how tool-enabled LLMs handle spatial reasoning in remote sensing. Our code and dataset are publicly available

[136] Rooms from Motion: Un-posed Indoor 3D Object Detection as Localization and Mapping

Justin Lazarow,Kai Kang,Afshin Dehghan

Main category: cs.CV

TL;DR: Rooms from Motion (RfM) 是一种基于物体中心的框架，用于场景级3D物体检测，能够通过未定位的图像估计相机姿态和物体轨迹，生成全局语义3D物体地图。

Details

Motivation: 现有3D物体检测方法依赖全局信息和已知相机姿态，而RfM旨在通过物体中心匹配器处理未定位图像，实现更灵活的定位和映射。 Method: RfM用基于3D盒子的物体中心匹配器替代传统的2D关键点匹配器，估计相机姿态和物体轨迹，并优化全局3D盒子以提高地图质量。 Result: RfM在CA-1M和ScanNet++上表现出优于基于点和多视图的3D物体检测方法的性能，生成更高质量的地图。 Conclusion: RfM提供了一种通用的物体中心表示方法，扩展了Cubify Anything的能力，并实现了稀疏定位和参数化映射。 Abstract: We revisit scene-level 3D object detection as the output of an object-centric framework capable of both localization and mapping using 3D oriented boxes as the underlying geometric primitive. While existing 3D object detection approaches operate globally and implicitly rely on the a priori existence of metric camera poses, our method, Rooms from Motion (RfM) operates on a collection of un-posed images. By replacing the standard 2D keypoint-based matcher of structure-from-motion with an object-centric matcher based on image-derived 3D boxes, we estimate metric camera poses, object tracks, and finally produce a global, semantic 3D object map. When a priori pose is available, we can significantly improve map quality through optimization of global 3D boxes against individual observations. RfM shows strong localization performance and subsequently produces maps of higher quality than leading point-based and multi-view 3D object detection methods on CA-1M and ScanNet++, despite these global methods relying on overparameterization through point clouds or dense volumes. Rooms from Motion achieves a general, object-centric representation which not only extends the work of Cubify Anything to full scenes but also allows for inherently sparse localization and parametric mapping proportional to the number of objects in a scene.

[137] Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models

Haohan Chi,Huan-ang Gao,Ziming Liu,Jianing Liu,Chenyu Liu,Jinwei Li,Kaisen Yang,Yangcheng Yu,Zeda Wang,Wenyi Li,Leichen Wang,Xingtao Hu,Hao Sun,Hang Zhao,Hao Zhao

Main category: cs.CV

TL;DR: Impromptu VLA提出了一种新的数据集，用于提升Vision-Language-Action模型在自动驾驶中的性能，特别是在非结构化场景下。

Details

Motivation: 现有VLA模型在非结构化极端场景中表现不佳，缺乏针对性基准数据集。 Method: 构建了包含8万多个视频片段的Impromptu VLA数据集，基于四类非结构化场景分类，并包含规划导向的问答注释和动作轨迹。 Result: 实验表明，使用该数据集训练的VLA模型在多个基准测试中表现显著提升，包括闭环NeuroNCAP分数、碰撞率以及开环nuScenes轨迹预测的L2精度。 Conclusion: Impromptu VLA数据集有效提升了VLA模型的性能，并为感知、预测和规划提供了诊断工具。 Abstract: Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios, largely due to a scarcity of targeted benchmarks. To address this, we introduce Impromptu VLA. Our core contribution is the Impromptu VLA Dataset: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets. This dataset is built upon our novel taxonomy of four challenging unstructured categories and features rich, planning-oriented question-answering annotations and action trajectories. Crucially, experiments demonstrate that VLAs trained with our dataset achieve substantial performance gains on established benchmarks--improving closed-loop NeuroNCAP scores and collision rates, and reaching near state-of-the-art L2 accuracy in open-loop nuScenes trajectory prediction. Furthermore, our Q&A suite serves as an effective diagnostic, revealing clear VLM improvements in perception, prediction, and planning. Our code, data and models are available at https://github.com/ahydchh/Impromptu-VLA.

[138] LoRAShop: Training-Free Multi-Concept Image Generation and Editing with Rectified Flow Transformers

Yusuf Dalva,Hidir Yesiltepe,Pinar Yanardag

Main category: cs.CV

TL;DR: LoRAShop是一个多概念图像编辑框架，利用LoRA模型实现个性化编辑，无需重新训练。

Details

Motivation: 研究多概念图像编辑的实用工具，解决现有方法在身份保留和全局一致性上的不足。 Method: 通过分析扩散变换器中的特征交互模式，生成解耦的潜在掩码，并在特定区域混合LoRA权重。 Result: 实验表明，LoRAShop在身份保留和编辑质量上优于基线方法。 Conclusion: LoRAShop为视觉创作和快速迭代提供了实用工具，扩展了LoRA模型的应用场景。 Abstract: We introduce LoRAShop, the first framework for multi-concept image editing with LoRA models. LoRAShop builds on a key observation about the feature interaction patterns inside Flux-style diffusion transformers: concept-specific transformer features activate spatially coherent regions early in the denoising process. We harness this observation to derive a disentangled latent mask for each concept in a prior forward pass and blend the corresponding LoRA weights only within regions bounding the concepts to be personalized. The resulting edits seamlessly integrate multiple subjects or styles into the original scene while preserving global context, lighting, and fine details. Our experiments demonstrate that LoRAShop delivers better identity preservation compared to baselines. By eliminating retraining and external constraints, LoRAShop turns personalized diffusion models into a practical `photoshop-with-LoRAs' tool and opens new avenues for compositional visual storytelling and rapid creative iteration.

[139] Sketch Down the FLOPs: Towards Efficient Networks for Human Sketch

Aneeshan Sain,Subhajit Maity,Pinaki Nath Chowdhury,Subhadeep Koley,Ayan Kumar Bhunia,Yi-Zhe Song

Main category: cs.CV

TL;DR: 论文提出两种针对草图数据的高效推理组件，通过跨模态知识蒸馏和基于强化学习的画布选择器，显著减少计算量（FLOPs减少99.37%），同时保持精度。

Details

Motivation: 现有高效轻量模型适用于照片但不适用于草图，缺乏针对草图数据的高效推理研究。 Method: 提出跨模态知识蒸馏网络和基于强化学习的画布选择器，动态适应草图抽象特性。 Result: FLOPs减少99.37%（从40.18G降至0.254G），精度几乎不变（33.03% vs 32.77%）。 Conclusion: 成功开发了适用于稀疏草图数据的高效网络，其计算量甚至低于最佳照片模型。 Abstract: As sketch research has collectively matured over time, its adaptation for at-mass commercialisation emerges on the immediate horizon. Despite an already mature research endeavour for photos, there is no research on the efficient inference specifically designed for sketch data. In this paper, we first demonstrate existing state-of-the-art efficient light-weight models designed for photos do not work on sketches. We then propose two sketch-specific components which work in a plug-n-play manner on any photo efficient network to adapt them to work on sketch data. We specifically chose fine-grained sketch-based image retrieval (FG-SBIR) as a demonstrator as the most recognised sketch problem with immediate commercial value. Technically speaking, we first propose a cross-modal knowledge distillation network to transfer existing photo efficient networks to be compatible with sketch, which brings down number of FLOPs and model parameters by 97.96% percent and 84.89% respectively. We then exploit the abstract trait of sketch to introduce a RL-based canvas selector that dynamically adjusts to the abstraction level which further cuts down number of FLOPs by two thirds. The end result is an overall reduction of 99.37% of FLOPs (from 40.18G to 0.254G) when compared with a full network, while retaining the accuracy (33.03% vs 32.77%) -- finally making an efficient network for the sparse sketch data that exhibit even fewer FLOPs than the best photo counterpart.

[140] MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Sihan Yang,Runsen Xu,Yiman Xie,Sizhe Yang,Mo Li,Jingli Lin,Chenming Zhu,Xiaochen Chen,Haodong Duan,Xiangyu Yue,Dahua Lin,Tai Wang,Jiangmiao Pang

Main category: cs.CV

TL;DR: MMSI-Bench是一个专注于多图像空间智能的VQA基准测试，旨在评估MLLMs在复杂物理世界中的表现。通过实验发现，现有模型与人类表现存在显著差距。

Details

Motivation: 现有基准测试仅关注单图像关系，无法满足实际部署中对多图像空间推理的需求。 Method: 六位3D视觉专家耗时300多小时，从12万张图像中精心设计1000道具有挑战性的选择题，并评估34个开源和专有MLLMs。 Result: 最强开源模型准确率约30%，OpenAI的o3推理模型达40%，而人类得分97%。 Conclusion: MMSI-Bench揭示了多图像空间推理的挑战性，并为未来研究提供了自动化错误分析工具。 Abstract: Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a step-by-step reasoning process. We conduct extensive experiments and thoroughly evaluate 34 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI's o3 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering valuable insights for advancing multi-image spatial intelligence. Project page: https://runsenxu.com/projects/MMSI_Bench .

[141] Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Yunze Man,De-An Huang,Guilin Liu,Shiwei Sheng,Shilong Liu,Liang-Yan Gui,Jan Kautz,Yu-Xiong Wang,Zhiding Yu

Main category: cs.CV

TL;DR: Argus通过视觉注意力机制提升多模态大语言模型在视觉中心任务中的表现。

Details

Motivation: 现有MLLMs在需要精确视觉聚焦的任务中表现不佳，需改进视觉注意力机制。 Method: 采用对象中心接地作为视觉链式思考信号，增强目标条件视觉注意力。 Result: Argus在多模态推理和对象接地任务中表现优异。 Conclusion: 显式语言引导的视觉兴趣区域参与对MLLMs至关重要，需从视觉中心视角推进多模态智能。 Abstract: Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, yet they often struggle with vision-centric scenarios where precise visual focus is needed for accurate reasoning. In this paper, we introduce Argus to address these limitations with a new visual attention grounding mechanism. Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention during multimodal reasoning tasks. Evaluations on diverse benchmarks demonstrate that Argus excels in both multimodal reasoning tasks and referring object grounding tasks. Extensive analysis further validates various design choices of Argus, and reveals the effectiveness of explicit language-guided visual region-of-interest engagement in MLLMs, highlighting the importance of advancing multimodal intelligence from a visual-centric perspective. Project page: https://yunzeman.github.io/argus/

[142] TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

Yao Xiao,Qiqian Fu,Heyi Tao,Yuqun Wu,Zhen Zhu,Derek Hoiem

Main category: cs.CV

TL;DR: TextRegion结合图像-文本模型和SAM2的优势，生成文本对齐的区域标记，支持详细视觉理解和开放词汇能力。

Details

Motivation: 解决图像-文本模型在详细视觉理解上的不足，同时保留开放词汇能力。 Method: 提出TextRegion框架，结合图像-文本模型和SAM2，生成文本对齐的区域标记。 Result: 在开放世界语义分割等任务中表现优异，优于或媲美现有无训练方法。 Conclusion: TextRegion实用性强，易于扩展，适用于多种下游任务。 Abstract: Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: https://github.com/avaxiao/TextRegion.

cs.GR [Back]

[143] Quality assessment of 3D human animation: Subjective and objective evaluation

Rim Rekik,Stefanie Wuhrer,Ludovic Hoyet,Katja Zibrek,Anne-Hélène Olivier

Main category: cs.GR

TL;DR: 提出了一种基于数据驱动的虚拟人动画质量评估方法，通过用户研究生成数据集并训练线性回归模型，性能优于现有深度学习基线。

Details

Motivation: 虚拟人动画质量评估缺乏非参数化生成模型的评价标准，需要开发新的评估方法。 Method: 1. 生成虚拟人动画数据集并收集主观真实感评分；2. 使用数据集训练线性回归模型预测评分。 Result: 线性回归模型在数据集上的预测评分与实际评分的相关性达到90%，优于现有深度学习基线。 Conclusion: 提出的数据驱动框架为虚拟人动画质量评估提供了有效工具，性能显著优于现有方法。 Abstract: Virtual human animations have a wide range of applications in virtual and augmented reality. While automatic generation methods of animated virtual humans have been developed, assessing their quality remains challenging. Recently, approaches introducing task-oriented evaluation metrics have been proposed, leveraging neural network training. However, quality assessment measures for animated virtual humans that are not generated with parametric body models have yet to be developed. In this context, we introduce a first such quality assessment measure leveraging a novel data-driven framework. First, we generate a dataset of virtual human animations together with their corresponding subjective realism evaluation scores collected with a user study. Second, we use the resulting dataset to learn predicting perceptual evaluation scores. Results indicate that training a linear regressor on our dataset results in a correlation of 90%, which outperforms a state of the art deep learning baseline.

[144] To Measure What Isn't There -- Visual Exploration of Missingness Structures Using Quality Metrics

Sara Johansson Fernstad,Sarah Alsufyani,Silvia Del Din,Alison Yarnall,Lynn Rochester

Main category: cs.GR

TL;DR: 本文提出了一套用于识别和可视化分析高维数据中结构化缺失的质量指标，填补了缺失数据可视化研究的空白。

Details

Motivation: 高维数据中的缺失值是常见问题，可能导致分析问题。结构化缺失可能反映数据收集或预处理问题，也可能揭示重要数据特征。可视化有助于深入理解缺失结构，但相关研究较少且缺乏可扩展性。 Method: 提出了一套质量指标，用于识别和理解数据中的结构化缺失模式，并通过实际步行监测研究案例验证其有效性。 Result: 质量指标可用于指导可视化分析，帮助探索高维数据中的缺失结构，支持数据质量问题的决策。 Conclusion: 本文的质量指标填补了缺失数据可视化研究的空白，为大规模高维数据中的缺失结构分析提供了有效工具。 Abstract: This paper contributes a set of quality metrics for identification and visual analysis of structured missingness in high-dimensional data. Missing values in data are a frequent challenge in most data generating domains and may cause a range of analysis issues. Structural missingness in data may indicate issues in data collection and pre-processing, but may also highlight important data characteristics. While research into statistical methods for dealing with missing data are mainly focusing on replacing missing values with plausible estimated values, visualization has great potential to support a more in-depth understanding of missingness structures in data. Nonetheless, while the interest in missing data visualization has increased in the last decade, it is still a relatively overlooked research topic with a comparably small number of publications, few of which address scalability issues. Efficient visual analysis approaches are needed to enable exploration of missingness structures in large and high-dimensional data, and to support informed decision-making in context of potential data quality issues. This paper suggests a set of quality metrics for identification of patterns of interest for understanding of structural missingness in data. These quality metrics can be used as guidance in visual analysis, as demonstrated through a use case exploring structural missingness in data from a real-life walking monitoring study. All supplemental materials for this paper are available at https://doi.org/10.25405/data.ncl.c.7741829.

cs.CL [Back]

[145] Training Language Models to Generate Quality Code with Program Analysis Feedback

Feng Yao,Zilong Wang,Liyuan Liu,Junxia Cui,Li Zhong,Xiaohan Fu,Haohui Mai,Vish Krishnan,Jianfeng Gao,Jingbo Shang

Main category: cs.CL

TL;DR: REAL是一个基于强化学习的框架，通过程序分析和单元测试反馈，激励大型语言模型生成高质量的代码，解决了现有方法在可扩展性和有效性上的限制。

Details

Motivation: 现有方法（如监督微调和基于规则的后处理）依赖人工标注或脆弱启发式，难以确保代码质量和安全性。 Method: REAL结合程序分析（检测安全或可维护性缺陷）和单元测试（确保功能正确性），无需人工干预。 Result: 实验表明，REAL在功能和代码质量评估上优于现有方法。 Conclusion: REAL填补了快速原型和生产级代码之间的差距，实现了速度和质量的平衡。 Abstract: Code generation with large language models (LLMs), often termed vibe coding, is increasingly adopted in production but fails to ensure code quality, particularly in security (e.g., SQL injection vulnerabilities) and maintainability (e.g., missing type annotations). Existing methods, such as supervised fine-tuning and rule-based post-processing, rely on labor-intensive annotations or brittle heuristics, limiting their scalability and effectiveness. We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code using program analysis-guided feedback. Specifically, REAL integrates two automated signals: (1) program analysis detecting security or maintainability defects and (2) unit tests ensuring functional correctness. Unlike prior work, our framework is prompt-agnostic and reference-free, enabling scalable supervision without manual intervention. Experiments across multiple datasets and model scales demonstrate that REAL outperforms state-of-the-art methods in simultaneous assessments of functionality and code quality. Our work bridges the gap between rapid prototyping and production-ready code, enabling LLMs to deliver both speed and quality.

[146] Climate Finance Bench

Rafik Mankour,Yassine Chafai,Hamada Saleh,Ghassen Ben Hassine,Thibaud Barreau,Peter Tankov

Main category: cs.CL

TL;DR: Climate Finance Bench提出一个开放基准，用于评估大型语言模型在企业气候披露中的问答能力，并比较了RAG方法的性能。

Details

Motivation: 解决企业气候披露信息问答的标准化评估问题，并推动AI在气候应用中的透明碳报告。 Method: 收集33份英文可持续发展报告，标注330个专家验证的问答对，比较RAG方法，分析检索器的性能瓶颈。 Result: 检索器定位答案段落的能力是主要性能瓶颈，并提倡采用权重量化等技术以减少碳足迹。 Conclusion: 该基准为气候信息披露问答提供了评估工具，同时强调了AI应用中碳透明度的重要性。 Abstract: Climate Finance Bench introduces an open benchmark that targets question-answering over corporate climate disclosures using Large Language Models. We curate 33 recent sustainability reports in English drawn from companies across all 11 GICS sectors and annotate 330 expert-validated question-answer pairs that span pure extraction, numerical reasoning, and logical reasoning. Building on this dataset, we propose a comparison of RAG (retrieval-augmented generation) approaches. We show that the retriever's ability to locate passages that actually contain the answer is the chief performance bottleneck. We further argue for transparent carbon reporting in AI-for-climate applications, highlighting advantages of techniques such as Weight Quantization.

[147] Pre-Training Curriculum for Multi-Token Prediction in Language Models

Ansar Aynetdinov,Alan Akbik

Main category: cs.CL

TL;DR: 多令牌预测（MTP）是一种新的语言模型预训练目标，通过逐步学习策略（正向和反向课程）帮助小模型适应MTP，提升性能。

Details

Motivation: 解决小语言模型（SLMs）在多令牌预测（MTP）目标上的表现不佳问题。 Method: 提出两种课程学习策略：正向课程（从NTP逐步过渡到MTP）和反向课程（从MTP逐步过渡到NTP）。 Result: 正向课程提升下游NTP性能和生成质量，保留自推测解码优势；反向课程虽提升性能但无自推测解码优势。 Conclusion: 正向课程更适合SLMs，平衡性能与效率；反向课程仅适用于特定需求。 Abstract: Multi-token prediction (MTP) is a recently proposed pre-training objective for language models. Rather than predicting only the next token (NTP), MTP predicts the next $k$ tokens at each prediction step, using multiple prediction heads. MTP has shown promise in improving downstream performance, inference speed, and training efficiency, particularly for large models. However, prior work has shown that smaller language models (SLMs) struggle with the MTP objective. To address this, we propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum, which gradually increases the complexity of the pre-training objective from NTP to MTP, and a reverse curriculum, which does the opposite. Our experiments show that the forward curriculum enables SLMs to better leverage the MTP objective during pre-training, improving downstream NTP performance and generative output quality, while retaining the benefits of self-speculative decoding. The reverse curriculum achieves stronger NTP performance and output quality, but fails to provide any self-speculative decoding benefits.

[148] FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian

Sara Papi,Marco Gaido,Luisa Bentivogli,Alessio Brutti,Mauro Cettolo,Roberto Gretter,Marco Matassoni,Mohamed Nabih,Matteo Negri

Main category: cs.CL

TL;DR: FAMA是首个基于开源数据的语音基础模型家族，填补了语音领域开放科学的空白，性能接近现有模型且速度更快。

Details

Motivation: 现有语音基础模型（如Whisper和SeamlessM4T）的封闭性导致可复现性和公平评估困难，语音领域缺乏开放科学努力。 Method: 开发FAMA模型家族，使用超过150k小时的开源语音数据训练，并引入16k小时的清洗和伪标注数据集。 Result: FAMA性能与现有模型相当，速度提升高达8倍。 Conclusion: FAMA及其开源资源推动了语音技术研究的开放性和透明度。 Abstract: The development of speech foundation models (SFMs) like Whisper and SeamlessM4T has significantly advanced the field of speech processing. However, their closed nature--with inaccessible training data and code--poses major reproducibility and fair evaluation challenges. While other domains have made substantial progress toward open science by developing fully transparent models trained on open-source (OS) code and data, similar efforts in speech remain limited. To fill this gap, we introduce FAMA, the first family of open science SFMs for English and Italian, trained on 150k+ hours of OS speech data. Moreover, we present a new dataset containing 16k hours of cleaned and pseudo-labeled speech for both languages. Results show that FAMA achieves competitive performance compared to existing SFMs while being up to 8 times faster. All artifacts, including code, datasets, and models, are released under OS-compliant licenses, promoting openness in speech technology research.

[149] StressTest: Can YOUR Speech LM Handle the Stress?

Iddo Yosha,Gallil Maimon,Yossi Adi

Main category: cs.CL

TL;DR: 论文提出了StressTest基准，用于评估语音感知语言模型（SLM）在句子重音区分任务中的表现，并开发了合成数据集Stress17k以改进模型性能。

Details

Motivation: 句子重音在表达意图和含义中起关键作用，但在SLM的评估和开发中常被忽视。 Method: 引入StressTest基准，评估SLM表现；提出合成数据生成方法创建Stress17k数据集；优化模型StresSLM。 Result: 现有SLM在重音任务中表现不佳；StresSLM显著优于其他模型。 Conclusion: 通过合成数据优化SLM能有效提升其在句子重音任务中的表现。 Abstract: Sentence stress refers to emphasis, placed on specific words within a spoken utterance to highlight or contrast an idea, or to introduce new information. It is often used to imply an underlying intention that is not explicitly stated. Recent advances in speech-aware language models (SLMs) have enabled direct processing of audio, allowing models to bypass transcription and access the full richness of the speech signal and perform audio reasoning tasks such as spoken question answering. Despite the crucial role of sentence stress in shaping meaning and speaker intent, it remains largely overlooked in evaluation and development of such models. In this work, we address this gap by introducing StressTest, a benchmark specifically designed to evaluate a model's ability to distinguish between interpretations of spoken sentences based on the stress pattern. We assess the performance of several leading SLMs and find that, despite their overall capabilities, they perform poorly on such tasks. To overcome this limitation, we propose a novel synthetic data generation pipeline, and create Stress17k, a training set that simulates change of meaning implied by stress variation. Then, we empirically show that optimizing models with this synthetic dataset aligns well with real-world recordings and enables effective finetuning of SLMs. Results suggest, that our finetuned model, StresSLM, significantly outperforms existing models on both sentence stress reasoning and detection tasks. Code, models, data, and audio samples - pages.cs.huji.ac.il/adiyoss-lab/stresstest.

[150] Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems

Christopher Ormerod

Main category: cs.CL

TL;DR: 通过将反馈导向的注释（如拼写、语法错误和论证成分标记）整合到自动作文评分（AES）中，可以提高评分的准确性。

Details

Motivation: 提升自动作文评分的准确性，通过引入反馈驱动的注释来优化评分流程。 Method: 使用PERSUADE语料库，整合拼写、语法错误和论证成分的注释，并利用两种LLM（生成式语言模型和编码器基础的标记分类器）生成注释。 Result: 通过将注释整合到评分过程中，基于编码器的大型语言模型在分类任务中表现提升。 Conclusion: 反馈驱动的注释能有效提升自动作文评分的性能，展示了在实际应用中的潜力。 Abstract: This study illustrates how incorporating feedback-oriented annotations into the scoring pipeline can enhance the accuracy of automated essay scoring (AES). This approach is demonstrated with the Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements (PERSUADE) corpus. We integrate two types of feedback-driven annotations: those that identify spelling and grammatical errors, and those that highlight argumentative components. To illustrate how this method could be applied in real-world scenarios, we employ two LLMs to generate annotations -- a generative language model used for spell-correction and an encoder-based token classifier trained to identify and mark argumentative elements. By incorporating annotations into the scoring process, we demonstrate improvements in performance using encoder-based large language models fine-tuned as classifiers.

[151] Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages

Kaja Dobrovoljc

Main category: cs.CL

TL;DR: 论文提出了一种基于树库的方法，通过依赖解析语料库比较口语和书面语的句法结构，发现两者在句法多样性、分布和模态特异性上存在显著差异。

Details

Motivation: 研究动机在于探索口语和书面语在句法结构上的差异，以理解不同模态对句法组织的影响。 Method: 采用自下而上的归纳方法，从英语和斯洛文尼亚语的通用依赖树库中提取去词汇化的依赖子树，分析其大小、多样性和分布。 Result: 结果显示，口语语料库的句法结构更少且多样性更低，且口语和书面语的句法结构重叠有限，表明模态特异性偏好。 Conclusion: 结论认为，这种可扩展的、语言无关的框架为系统研究语料库间的句法变异提供了通用方法，为基于数据的语法理论奠定了基础。 Abstract: This paper presents a novel treebank-driven approach to comparing syntactic structures in speech and writing using dependency-parsed corpora. Adopting a fully inductive, bottom-up method, we define syntactic structures as delexicalized dependency (sub)trees and extract them from spoken and written Universal Dependencies (UD) treebanks in two syntactically distinct languages, English and Slovenian. For each corpus, we analyze the size, diversity, and distribution of syntactic inventories, their overlap across modalities, and the structures most characteristic of speech. Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities. Strikingly, the overlap between spoken and written syntactic inventories is very limited: most structures attested in speech do not occur in writing, pointing to modality-specific preferences in syntactic organization that reflect the distinct demands of real-time interaction and elaborated writing. This contrast is further supported by a keyness analysis of the most frequent speech-specific structures, which highlights patterns associated with interactivity, context-grounding, and economy of expression. We argue that this scalable, language-independent framework offers a useful general method for systematically studying syntactic variation across corpora, laying the groundwork for more comprehensive data-driven theories of grammar in use.

[152] MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators

John Mendonça,Alon Lavie,Isabel Trancoso

Main category: cs.CL

TL;DR: MEDAL是一个自动化多智能体框架，用于生成、评估和优化更具代表性和多样性的开放域对话评估基准，解决了现有基准数据集静态、过时和缺乏多语言覆盖的问题。

Details

Motivation: 现有聊天机器人和LLM的评估基准数据集多为静态、过时且缺乏多语言覆盖，无法捕捉细微的语言和文化差异，阻碍了进一步的发展。 Method: 利用多个先进LLM生成多语言用户-聊天机器人对话，基于多样化的种子上下文，并通过GPT-4.1进行多维性能分析，最终构建一个新的多语言元评估基准。 Result: 发现当前LLM在检测细微问题（如同理心和推理）方面表现不佳，并揭示了显著的跨语言性能差异。 Conclusion: MEDAL框架能够生成更全面的评估基准，但当前LLM在评估开放域对话时仍存在局限性。 Abstract: As the capabilities of chatbots and their underlying LLMs continue to dramatically improve, evaluating their performance has increasingly become a major blocker to their further development. A major challenge is the available benchmarking datasets, which are largely static, outdated, and lacking in multilingual coverage, limiting their ability to capture subtle linguistic and cultural variations. This paper introduces MEDAL, an automated multi-agent framework for generating, evaluating, and curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. A strong LLM (GPT-4.1) is then used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. We find that current LLMs struggle to detect nuanced issues, particularly those involving empathy and reasoning.

[153] Can Large Language Models Match the Conclusions of Systematic Reviews?

Christopher Polzak,Alejandro Lozano,Min Woo Sun,James Burgess,Yuhui Zhang,Kevin Wu,Serena Yeung-Levy

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型（LLMs）在自动生成系统综述（SR）方面的能力，发现当前LLMs在证据评估和多文档推理方面仍无法匹敌临床专家。

Details

Motivation: 随着科学文献的爆炸式增长，利用LLMs自动化生成SR的需求增加，但其能力尚未充分验证。 Method: 研究提出了MedEvidence基准，对比了24种LLMs在100个SR上的表现，评估了推理能力、模型大小和微调的影响。 Result: 发现推理能力未必提升表现，模型大小不总是带来增益，知识微调反而降低准确性，且LLMs普遍缺乏科学怀疑态度。 Conclusion: 当前LLMs尚无法可靠匹配专家生成的SR结论，需进一步研究改进。 Abstract: Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly characterized. We therefore ask: Can LLMs match the conclusions of systematic reviews written by clinical experts when given access to the same studies? To explore this question, we present MedEvidence, a benchmark pairing findings from 100 SRs with the studies they are based on. We benchmark 24 LLMs on MedEvidence, including reasoning, non-reasoning, medical specialist, and models across varying sizes (from 7B-700B). Through our systematic evaluation, we find that reasoning does not necessarily improve performance, larger models do not consistently yield greater gains, and knowledge-based fine-tuning degrades accuracy on MedEvidence. Instead, most models exhibit similar behavior: performance tends to degrade as token length increases, their responses show overconfidence, and, contrary to human experts, all models show a lack of scientific skepticism toward low-quality findings. These results suggest that more work is still required before LLMs can reliably match the observations from expert-conducted SRs, even though these systems are already deployed and being used by clinicians. We release our codebase and benchmark to the broader research community to further investigate LLM-based SR systems.

[154] Towards a More Generalized Approach in Open Relation Extraction

Qing Wang,Yuepei Li,Qiao Qiao,Kang Zhou,Qi Li

Main category: cs.CL

TL;DR: MixORE是一个两阶段框架，用于在已知和未知关系混合的无标签数据中联合学习关系分类和聚类，显著优于现有基线。

Details

Motivation: 现实场景中，未知关系是随机分布的，而传统OpenRE方法假设数据仅包含未知关系或已预先划分，无法直接应用。 Method: 提出MixORE框架，结合关系分类和聚类两阶段方法，处理已知和未知关系的混合数据。 Result: 在三个基准数据集上，MixORE在已知关系分类和未知关系聚类任务中均优于基线方法。 Conclusion: MixORE为广义OpenRE研究和实际应用提供了有效解决方案。 Abstract: Open Relation Extraction (OpenRE) seeks to identify and extract novel relational facts between named entities from unlabeled data without pre-defined relation schemas. Traditional OpenRE methods typically assume that the unlabeled data consists solely of novel relations or is pre-divided into known and novel instances. However, in real-world scenarios, novel relations are arbitrarily distributed. In this paper, we propose a generalized OpenRE setting that considers unlabeled data as a mixture of both known and novel instances. To address this, we propose MixORE, a two-phase framework that integrates relation classification and clustering to jointly learn known and novel relations. Experiments on three benchmark datasets demonstrate that MixORE consistently outperforms competitive baselines in known relation classification and novel relation clustering. Our findings contribute to the advancement of generalized OpenRE research and real-world applications.

[155] First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay

Andrew Zhu,Evan Osgood,Chris Callison-Burch

Main category: cs.CL

TL;DR: 论文提出了一种称为“旁听代理”的新型LLM交互范式，通过监听人类对话提供背景任务支持，并以《龙与地下城》游戏为例进行了实证研究。

Details

Motivation: 探索LLM代理在非直接对话场景中的应用潜力，尤其是通过监听人类对话提供辅助功能。 Method: 使用大型多模态音频-语言模型作为旁听代理，辅助游戏主持人，并通过人类评估检验其有效性。 Result: 研究发现某些大型音频-语言模型能够利用隐式音频线索完成旁听代理任务。 Conclusion: 旁听代理范式具有潜力，研究提供了相关工具和代码以支持进一步探索。 Abstract: Much work has been done on conversational LLM agents which directly assist human users with tasks. We present an alternative paradigm for interacting with LLM agents, which we call "overhearing agents". These overhearing agents do not actively participate in conversation -- instead, they "listen in" on human-to-human conversations and perform background tasks or provide suggestions to assist the user. In this work, we explore the overhearing agents paradigm through the lens of Dungeons & Dragons gameplay. We present an in-depth study using large multimodal audio-language models as overhearing agents to assist a Dungeon Master. We perform a human evaluation to examine the helpfulness of such agents and find that some large audio-language models have the emergent ability to perform overhearing agent tasks using implicit audio cues. Finally, we release Python libraries and our project code to support further research into the overhearing agents paradigm at https://github.com/zhudotexe/overhearing_agents.

Yingming Wang,Pepa Atanasova

Main category: cs.CL

TL;DR: SR-NLE框架通过自我批判和迭代优化提升语言模型的解释忠实度，无需外部监督。

Details

Motivation: 现有大语言模型（LLMs）的自然语言解释（NLEs）常无法忠实反映模型推理过程，需改进其解释忠实度。 Method: 提出SR-NLE框架，利用自然语言自我反馈和基于特征归因的新反馈机制，迭代优化解释。 Result: 实验表明SR-NLE显著降低不忠实率，最佳方法平均不忠实率降至36.02%（基线为54.81%）。 Conclusion: LLMs可通过适当反馈优化解释忠实度，无需额外训练或微调。 Abstract: With the rapid development of large language models (LLMs), natural language explanations (NLEs) have become increasingly important for understanding model predictions. However, these explanations often fail to faithfully represent the model's actual reasoning process. While existing work has demonstrated that LLMs can self-critique and refine their initial outputs for various tasks, this capability remains unexplored for improving explanation faithfulness. To address this gap, we introduce Self-critique and Refinement for Natural Language Explanations (SR-NLE), a framework that enables models to improve the faithfulness of their own explanations -- specifically, post-hoc NLEs -- through an iterative critique and refinement process without external supervision. Our framework leverages different feedback mechanisms to guide the refinement process, including natural language self-feedback and, notably, a novel feedback approach based on feature attribution that highlights important input words. Our experiments across three datasets and four state-of-the-art LLMs demonstrate that SR-NLE significantly reduces unfaithfulness rates, with our best method achieving an average unfaithfulness rate of 36.02%, compared to 54.81% for baseline -- an absolute reduction of 18.79%. These findings reveal that the investigated LLMs can indeed refine their explanations to better reflect their actual reasoning process, requiring only appropriate guidance through feedback without additional training or fine-tuning.

[157] What Has Been Lost with Synthetic Evaluation?

Alexander Gill,Abhilasha Ravichander,Ana Marasović

Main category: cs.CL

TL;DR: 论文探讨了使用大语言模型（LLMs）生成评估基准的可行性，发现虽然成本低且有效，但生成的基准对LLMs的挑战性不如人工创建的基准。

Details

Motivation: 研究动机是评估LLMs生成的数据是否能满足高质量评估基准的需求，包括针对性、挑战性和避免利用捷径。 Method: 通过两个案例研究（CondaQA和DROP数据集），比较LLMs生成的基准与人工创建的基准在有效性和难度上的差异。 Result: LLMs生成的基准在有效性上接近人工基准，但挑战性较低，容易被LLMs自身解决。 Conclusion: 结论指出需要重新评估LLMs生成评估基准的普遍做法，因其可能牺牲挑战性。 Abstract: Large language models (LLMs) are increasingly used for data generation. However, creating evaluation benchmarks raises the bar for this emerging paradigm. Benchmarks must target specific phenomena, penalize exploiting shortcuts, and be challenging. Through two case studies, we investigate whether LLMs can meet these demands by generating reasoning over-text benchmarks and comparing them to those created through careful crowdsourcing. Specifically, we evaluate both the validity and difficulty of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA, which evaluates reasoning about negation, and DROP, which targets reasoning about quantities. We find that prompting LLMs can produce variants of these datasets that are often valid according to the annotation guidelines, at a fraction of the cost of the original crowdsourcing effort. However, we show that they are less challenging for LLMs than their human-authored counterparts. This finding sheds light on what may have been lost by generating evaluation data with LLMs, and calls for critically reassessing the immediate use of this increasingly prevalent approach to benchmark creation.

[158] Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

Arthur S. Bianchessi,Rodrigo C. Barros,Lucas S. Kupssinskü

Main category: cs.CL

TL;DR: 提出了一种名为BAM的理论框架，将位置编码建模为概率模型中的先验，统一了现有方法并显著提升了长上下文泛化能力。

Details

Motivation: 现有位置编码方法缺乏理论清晰性，且评估指标有限，难以支持其外推能力。 Method: 提出BAM框架，将位置编码作为概率模型的先验，并引入广义高斯位置先验。 Result: BAM在500倍训练上下文长度下仍能准确检索信息，优于现有方法，同时保持较低的困惑度和额外参数。 Conclusion: BAM为位置编码提供了理论支持，显著提升了长上下文泛化能力。 Abstract: Transformer-based language models rely on positional encoding (PE) to handle token order and support context length extrapolation. However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims. We propose the Bayesian Attention Mechanism (BAM), a theoretical framework that formulates positional encoding as a prior within a probabilistic model. BAM unifies existing methods (e.g., NoPE and ALiBi) and motivates a new Generalized Gaussian positional prior that substantially improves long-context generalization. Empirically, BAM enables accurate information retrieval at $500\times$ the training context length, outperforming previous state-of-the-art context length generalization in long context retrieval accuracy while maintaining comparable perplexity and introducing minimal additional parameters.

[159] LiTEx: A Linguistic Taxonomy of Explanations for Understanding Within-Label Variation in Natural Language Inference

Pingjun Hong,Beiduo Chen,Siyao Peng,Marie-Catherine de Marneffe,Barbara Plank

Main category: cs.CL

TL;DR: 论文研究了自然语言推理（NLI）中人类标注者同意同一标签但提供不同推理的“标签内变异”问题，提出了LITEX分类法，并通过实验验证其在解释生成中的有效性。

Details

Motivation: 解决NLI中标注者同意标签但推理不一致的问题，揭示标签背后的真实推理。 Method: 引入LITEX分类法，对e-SNLI数据集子集进行标注，验证分类法可靠性，并分析其与标签、高亮和解释的关联。 Result: LITEX生成的解释在语言学上更接近人类解释，优于仅基于标签或高亮的生成方法。 Conclusion: LITEX不仅能捕捉标签内变异，还能通过分类法指导的生成方法缩小人类与模型解释的差距。 Abstract: There is increasing evidence of Human Label Variation (HLV) in Natural Language Inference (NLI), where annotators assign different labels to the same premise-hypothesis pair. However, within-label variation--cases where annotators agree on the same label but provide divergent reasoning--poses an additional and mostly overlooked challenge. Several NLI datasets contain highlighted words in the NLI item as explanations, but the same spans on the NLI item can be highlighted for different reasons, as evidenced by free-text explanations, which offer a window into annotators' reasoning. To systematically understand this problem and gain insight into the rationales behind NLI labels, we introduce LITEX, a linguistically-informed taxonomy for categorizing free-text explanations. Using this taxonomy, we annotate a subset of the e-SNLI dataset, validate the taxonomy's reliability, and analyze how it aligns with NLI labels, highlights, and explanations. We further assess the taxonomy's usefulness in explanation generation, demonstrating that conditioning generation on LITEX yields explanations that are linguistically closer to human explanations than those generated using only labels or highlights. Our approach thus not only captures within-label variation but also shows how taxonomy-guided generation for reasoning can bridge the gap between human and model explanations more effectively than existing strategies.

[160] GateNLP at SemEval-2025 Task 10: Hierarchical Three-Step Prompting for Multilingual Narrative Classification

Iknoor Singh,Carolina Scarton,Kalina Bontcheva

Main category: cs.CL

TL;DR: 论文提出了一种名为H3Prompt的分层三步提示方法，用于多语言新闻叙事分类，并在SemEval 2025任务10子任务2中取得英语测试集第一名。

Details

Motivation: 在线新闻的激增和虚假信息的传播需要自动数据分析方法，叙事分类成为关键任务。 Method: 采用分层三步提示策略，利用大型语言模型（LLM）逐步分类新闻文章的主叙事和子叙事。 Result: 在28个全球竞争团队中，该方法在英语测试集上排名第一。 Conclusion: H3Prompt方法在多语言叙事分类任务中表现出色，代码已开源。 Abstract: The proliferation of online news and the increasing spread of misinformation necessitate robust methods for automatic data analysis. Narrative classification is emerging as a important task, since identifying what is being said online is critical for fact-checkers, policy markers and other professionals working on information studies. This paper presents our approach to SemEval 2025 Task 10 Subtask 2, which aims to classify news articles into a pre-defined two-level taxonomy of main narratives and sub-narratives across multiple languages. We propose Hierarchical Three-Step Prompting (H3Prompt) for multilingual narrative classification. Our methodology follows a three-step Large Language Model (LLM) prompting strategy, where the model first categorises an article into one of two domains (Ukraine-Russia War or Climate Change), then identifies the most relevant main narratives, and finally assigns sub-narratives. Our approach secured the top position on the English test set among 28 competing teams worldwide. The code is available at https://github.com/GateNLP/H3Prompt.

[161] When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy

Jirui Qi,Shan Chen,Zidi Xiong,Raquel Fernández,Danielle S. Bitterman,Arianna Bisazza

Main category: cs.CL

TL;DR: 研究发现当前大型推理模型（LRMs）在多语言推理能力上存在显著不足，尤其是在非英语语言中表现不佳。通过干预措施和少量针对性训练可以改善，但仍存在准确性与可读性的权衡。

Details

Motivation: 评估LRMs在多语言推理任务中的表现，因为用户需要以母语理解推理过程以实现有效监督。 Method: 在XReasoning基准上全面评估两种主流LRMs，采用提示干预和针对性训练（100个示例）以改善多语言推理能力。 Result: 即使最先进的模型也常回归英语或产生碎片化推理，提示干预提高可读性但降低准确性，针对性训练部分缓解问题。 Conclusion: 当前LRMs的多语言推理能力有限，未来需进一步研究以平衡准确性与可读性。 Abstract: Recent Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks. However, their ability to think in other languages is less studied. This capability is as important as answer accuracy for real world applications because users may find the reasoning trace useful for oversight only when it is expressed in their own language. We comprehensively evaluate two leading families of LRMs on our XReasoning benchmark and find that even the most advanced models often revert to English or produce fragmented reasoning in other languages, revealing a substantial gap in multilingual reasoning. Prompt based interventions that force models to reason in the users language improve readability and oversight but reduce answer accuracy, exposing an important trade off. We further show that targeted post training on just 100 examples mitigates this mismatch, though some accuracy loss remains. Our results highlight the limited multilingual reasoning capabilities of current LRMs and outline directions for future work. Code and data are available at https://github.com/Betswish/mCoT-XReasoning.

[162] VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models

Chahat Raj,Bowen Wei,Aylin Caliskan,Antonios Anastasopoulos,Ziwei Zhu

Main category: cs.CL

TL;DR: VIGNETTE是一个大规模VQA基准，用于评估视觉语言模型（VLMs）中的偏见，涵盖事实性、感知、刻板印象和决策四个方向，揭示模型如何通过视觉身份线索构建社会意义。

Details

Motivation: 现有VLM偏见研究多集中于肖像图像和性别-职业关联，忽略了更广泛复杂的社会刻板印象及其潜在危害。 Method: 通过30M+图像的VQA框架评估VLM偏见，结合社会心理学分析模型如何从视觉线索推断特质和能力。 Result: 研究发现VLMs存在微妙、多面且令人惊讶的刻板模式，揭示了模型如何编码社会等级和歧视性选择。 Conclusion: VIGNETTE为理解VLMs如何从输入中构建社会意义提供了新视角，强调了更全面的偏见评估的重要性。 Abstract: While bias in large language models (LLMs) is well-studied, similar concerns in vision-language models (VLMs) have received comparatively less attention. Existing VLM bias studies often focus on portrait-style images and gender-occupation associations, overlooking broader and more complex social stereotypes and their implied harm. This work introduces VIGNETTE, a large-scale VQA benchmark with 30M+ images for evaluating bias in VLMs through a question-answering framework spanning four directions: factuality, perception, stereotyping, and decision making. Beyond narrowly-centered studies, we assess how VLMs interpret identities in contextualized settings, revealing how models make trait and capability assumptions and exhibit patterns of discrimination. Drawing from social psychology, we examine how VLMs connect visual identity cues to trait and role-based inferences, encoding social hierarchies, through biased selections. Our findings uncover subtle, multifaceted, and surprising stereotypical patterns, offering insights into how VLMs construct social meaning from inputs.

[163] Talent or Luck? Evaluating Attribution Bias in Large Language Models

Chahat Raj,Mahika Banerjee,Aylin Caliskan,Antonios Anastasopoulos,Ziwei Zhu

Main category: cs.CL

TL;DR: 论文探讨了人类和LLMs如何归因事件结果，提出了一个基于认知的偏见评估框架。

Details

Motivation: 研究归因理论在LLMs中的应用，揭示模型如何基于人口统计学归因事件结果，及其公平性影响。 Method: 提出一个基于认知的偏见评估框架，分析模型推理差异如何导致对特定人口群体的偏见。 Result: 发现LLMs在归因事件结果时存在基于人口统计学的偏见。 Conclusion: 研究强调了评估和解决LLMs中归因偏见的重要性，以确保公平性。 Abstract: When a student fails an exam, do we tend to blame their effort or the test's difficulty? Attribution, defined as how reasons are assigned to event outcomes, shapes perceptions, reinforces stereotypes, and influences decisions. Attribution Theory in social psychology explains how humans assign responsibility for events using implicit cognition, attributing causes to internal (e.g., effort, ability) or external (e.g., task difficulty, luck) factors. LLMs' attribution of event outcomes based on demographics carries important fairness implications. Most works exploring social biases in LLMs focus on surface-level associations or isolated stereotypes. This work proposes a cognitively grounded bias evaluation framework to identify how models' reasoning disparities channelize biases toward demographic groups.

[164] ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room

Nikita Mehandru,Niloufar Golchini,David Bamman,Travis Zack,Melanie F. Molina,Ahmed Alaa

Main category: cs.CL

TL;DR: ER-Reason是一个用于评估大型语言模型在急诊室临床推理和决策能力的基准，包含3984名患者的数据和25174条临床记录，揭示了模型与医生推理之间的差距。

Details

Motivation: 现有评估多依赖昂贵的人工标注且集中于孤立任务，未能全面反映临床推理或医疗决策流程，尤其是在急诊室这一高风险环境中。 Method: 引入ER-Reason基准，涵盖急诊工作流的关键阶段（如分诊、治疗选择等），并收集了72份医生撰写的推理过程以模拟教学。 Result: 评估显示，当前最先进的大型语言模型在急诊决策中的临床推理与医生存在显著差距。 Conclusion: 未来研究需弥合模型与医生临床推理之间的差距，提升模型在急诊等高风险场景中的应用能力。 Abstract: Large language models (LLMs) have been extensively evaluated on medical question answering tasks based on licensing exams. However, real-world evaluations often depend on costly human annotators, and existing benchmarks tend to focus on isolated tasks that rarely capture the clinical reasoning or full workflow underlying medical decisions. In this paper, we introduce ER-Reason, a benchmark designed to evaluate LLM-based clinical reasoning and decision-making in the emergency room (ER)--a high-stakes setting where clinicians make rapid, consequential decisions across diverse patient presentations and medical specialties under time pressure. ER-Reason includes data from 3,984 patients, encompassing 25,174 de-identified longitudinal clinical notes spanning discharge summaries, progress notes, history and physical exams, consults, echocardiography reports, imaging notes, and ER provider documentation. The benchmark includes evaluation tasks that span key stages of the ER workflow: triage intake, initial assessment, treatment selection, disposition planning, and final diagnosis--each structured to reflect core clinical reasoning processes such as differential diagnosis via rule-out reasoning. We also collected 72 full physician-authored rationales explaining reasoning processes that mimic the teaching process used in residency training, and are typically absent from ER documentation. Evaluations of state-of-the-art LLMs on ER-Reason reveal a gap between LLM-generated and clinician-authored clinical reasoning for ER decisions, highlighting the need for future research to bridge this divide.

[165] Structured Memory Mechanisms for Stable Context Representation in Large Language Models

Yue Xing,Tao Yang,Yijiashun Qi,Minggu Wei,Yu Cheng,Honghui Xin

Main category: cs.CL

TL;DR: 论文提出了一种具有长期记忆机制的模型架构，以解决大语言模型在理解长期上下文时的局限性，并通过实验验证了其有效性。

Details

Motivation: 解决传统语言模型在处理长期依赖时常见的上下文丢失和语义漂移问题。 Method: 模型集成了显式记忆单元、门控写入机制和基于注意力的读取模块，并设计了联合训练目标以优化记忆策略。 Result: 模型在文本生成一致性、多轮问答稳定性和跨上下文推理准确性方面表现优越，尤其在长文本任务和复杂问答场景中表现出色。 Conclusion: 提出的记忆机制在语言理解中起关键作用，其架构设计和性能表现均证实了方法的可行性和有效性。 Abstract: This paper addresses the limitations of large language models in understanding long-term context. It proposes a model architecture equipped with a long-term memory mechanism to improve the retention and retrieval of semantic information across paragraphs and dialogue turns. The model integrates explicit memory units, gated writing mechanisms, and attention-based reading modules. A forgetting function is introduced to enable dynamic updates of memory content, enhancing the model's ability to manage historical information. To further improve the effectiveness of memory operations, the study designs a joint training objective. This combines the main task loss with constraints on memory writing and forgetting. It guides the model to learn better memory strategies during task execution. Systematic evaluation across multiple subtasks shows that the model achieves clear advantages in text generation consistency, stability in multi-turn question answering, and accuracy in cross-context reasoning. In particular, the model demonstrates strong semantic retention and contextual coherence in long-text tasks and complex question answering scenarios. It effectively mitigates the context loss and semantic drift problems commonly faced by traditional language models when handling long-term dependencies. The experiments also include analysis of different memory structures, capacity sizes, and control strategies. These results further confirm the critical role of memory mechanisms in language understanding. They demonstrate the feasibility and effectiveness of the proposed approach in both architectural design and performance outcomes.

[166] Unraveling LoRA Interference: Orthogonal Subspaces for Robust Model Merging

Haobo Zhang,Jiayu Zhou

Main category: cs.CL

TL;DR: 论文提出OSRM方法，通过约束LoRA子空间提升模型合并性能，减少任务间干扰，保持单任务准确性。

Details

Motivation: 微调大型语言模型（LMs）成本高，现有合并方法对LoRA微调模型效果不佳，需解决参数与数据分布的交互问题。 Method: 提出OSRM方法，在微调前约束LoRA子空间，避免任务间干扰，兼容现有合并算法。 Result: 在八个数据集上测试，OSRM显著提升合并性能，保持单任务准确性，对超参数更鲁棒。 Conclusion: OSRM为LoRA模型合并提供了即插即用解决方案，强调了数据-参数交互的重要性。 Abstract: Fine-tuning large language models (LMs) for individual tasks yields strong performance but is expensive for deployment and storage. Recent works explore model merging to combine multiple task-specific models into a single multi-task model without additional training. However, existing merging methods often fail for models fine-tuned with low-rank adaptation (LoRA), due to significant performance degradation. In this paper, we show that this issue arises from a previously overlooked interplay between model parameters and data distributions. We propose Orthogonal Subspaces for Robust model Merging (OSRM) to constrain the LoRA subspace *prior* to fine-tuning, ensuring that updates relevant to one task do not adversely shift outputs for others. Our approach can seamlessly integrate with most existing merging algorithms, reducing the unintended interference among tasks. Extensive experiments on eight datasets, tested with three widely used LMs and two large LMs, demonstrate that our method not only boosts merging performance but also preserves single-task accuracy. Furthermore, our approach exhibits greater robustness to the hyperparameters of merging. These results highlight the importance of data-parameter interaction in model merging and offer a plug-and-play solution for merging LoRA models.

[167] Improving QA Efficiency with DistilBERT: Fine-Tuning and Inference on mobile Intel CPUs

Ngeyen Yinkfu

Main category: cs.CL

TL;DR: 该研究提出了一种基于Transformer的高效问答模型，针对13代Intel i7-1355U CPU优化，在SQuAD v1.1数据集上表现优异。

Details

Motivation: 旨在开发一种在资源受限系统上实时运行的高效问答模型，平衡准确性和计算效率。 Method: 采用数据增强、探索性数据分析和DistilBERT架构微调，系统评估数据增强策略和超参数配置。 Result: 验证F1得分为0.6536，平均推理时间为0.1208秒/问题，优于规则基线和完整BERT模型。 Conclusion: 该模型在CPU推理中实现了准确性与效率的良好平衡，适合实时应用。 Abstract: This study presents an efficient transformer-based question-answering (QA) model optimized for deployment on a 13th Gen Intel i7-1355U CPU, using the Stanford Question Answering Dataset (SQuAD) v1.1. Leveraging exploratory data analysis, data augmentation, and fine-tuning of a DistilBERT architecture, the model achieves a validation F1 score of 0.6536 with an average inference time of 0.1208 seconds per question. Compared to a rule-based baseline (F1: 0.3124) and full BERT-based models, our approach offers a favorable trade-off between accuracy and computational efficiency. This makes it well-suited for real-time applications on resource-constrained systems. The study includes systematic evaluation of data augmentation strategies and hyperparameter configurations, providing practical insights into optimizing transformer models for CPU-based inference.

[168] WorkForceAgent-R1: Incentivizing Reasoning Capability in LLM-based Web Agents via Reinforcement Learning

Yuchen Zhuang,Di Jin,Jiaao Chen,Wenqi Shi,Hanrui Wang,Chao Zhang

Main category: cs.CL

TL;DR: WorkForceAgent-R1是一种基于LLM的网页代理，通过R1风格的强化学习框架提升单步推理能力，显著优于SFT基线。

Details

Motivation: 现有基于SFT的网页代理在动态网页交互中泛化性和鲁棒性不足，需要更强的推理能力。 Method: 采用规则化的R1强化学习框架，结合结构化奖励函数，无需显式标注或专家演示。 Result: 在WorkArena基准测试中，性能比SFT基线提升10.26-16.59%，接近GPT-4o水平。 Conclusion: WorkForceAgent-R1在商业导向的网页导航任务中表现出色，验证了强化学习框架的有效性。 Abstract: Large language models (LLMs)-empowered web agents enables automating complex, real-time web navigation tasks in enterprise environments. However, existing web agents relying on supervised fine-tuning (SFT) often struggle with generalization and robustness due to insufficient reasoning capabilities when handling the inherently dynamic nature of web interactions. In this study, we introduce WorkForceAgent-R1, an LLM-based web agent trained using a rule-based R1-style reinforcement learning framework designed explicitly to enhance single-step reasoning and planning for business-oriented web navigation tasks. We employ a structured reward function that evaluates both adherence to output formats and correctness of actions, enabling WorkForceAgent-R1 to implicitly learn robust intermediate reasoning without explicit annotations or extensive expert demonstrations. Extensive experiments on the WorkArena benchmark demonstrate that WorkForceAgent-R1 substantially outperforms SFT baselines by 10.26-16.59%, achieving competitive performance relative to proprietary LLM-based agents (gpt-4o) in workplace-oriented web navigation tasks.

[169] Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

Jaewoo Ahn,Heeseung Yun,Dayoon Ko,Gunhee Kim

Main category: cs.CL

TL;DR: 论文提出了一种多模态对抗组合性（MAC）基准，利用大语言模型生成欺骗性文本来测试多模态表示的组合性漏洞，并通过自训练方法提升零样本性能。

Details

Motivation: 预训练多模态表示（如CLIP）虽表现出强大能力，但存在组合性漏洞，导致反直觉判断。研究旨在揭示并改进这些漏洞。 Method: 提出MAC基准，利用LLMs生成欺骗性文本，并通过自训练方法（拒绝采样微调和多样性过滤）提升攻击成功率和样本多样性。 Result: 使用较小语言模型（如Llama-3.1-8B）的方法在多模态表示（图像、视频、音频）中表现出色，揭示了组合性漏洞。 Conclusion: MAC基准和自训练方法有效揭示了多模态表示的组合性漏洞，并提升了零样本方法的性能。 Abstract: While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.

[170] OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature

Alisha Srivastava,Emir Korukluoglu,Minh Nhat Le,Duyen Tran,Chau Minh Pham,Marzena Karpinska,Mohit Iyyer

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLMs）在多语言和跨语言记忆方面的能力，发现模型能跨语言回忆内容，即使文本未直接出现在预训练数据中。

Details

Motivation: LLMs在英语文本记忆方面表现良好，但其在非英语语言或跨语言记忆的能力尚不明确。 Method: 使用OWL数据集（包含10种语言的3.15万条对齐文本），通过三种任务（直接探测、名称填空、前缀探测）评估模型记忆能力。 Result: LLMs能跨语言回忆内容，例如GPT-4o在新翻译文本中识别作者和标题的准确率为69%。扰动（如打乱单词）会略微降低准确性。 Conclusion: 研究揭示了LLMs的跨语言记忆能力，并提供了模型间差异的见解。 Abstract: Large language models (LLMs) are known to memorize and recall English text from their pretraining data. However, the extent to which this ability generalizes to non-English languages or transfers across languages remains unclear. This paper investigates multilingual and cross-lingual memorization in LLMs, probing if memorized content in one language (e.g., English) can be recalled when presented in translation. To do so, we introduce OWL, a dataset of 31.5K aligned excerpts from 20 books in ten languages, including English originals, official translations (Vietnamese, Spanish, Turkish), and new translations in six low-resource languages (Sesotho, Yoruba, Maithili, Malagasy, Setswana, Tahitian). We evaluate memorization across model families and sizes through three tasks: (1) direct probing, which asks the model to identify a book's title and author; (2) name cloze, which requires predicting masked character names; and (3) prefix probing, which involves generating continuations. We find that LLMs consistently recall content across languages, even for texts without direct translation in pretraining data. GPT-4o, for example, identifies authors and titles 69% of the time and masked entities 6% of the time in newly translated excerpts. Perturbations (e.g., masking characters, shuffling words) modestly reduce direct probing accuracy (7% drop for shuffled official translations). Our results highlight the extent of cross-lingual memorization and provide insights on the differences between the models.

[171] NegVQA: Can Vision Language Models Understand Negation?

Yuhui Zhang,Yuchang Su,Yiming Liu,Serena Yeung-Levy

Main category: cs.CL

TL;DR: NegVQA是一个新的视觉问答基准测试，用于评估视觉语言模型对否定的理解能力，发现现有模型在否定问题上表现显著下降。

Details

Motivation: 评估视觉语言模型在否定理解上的能力，因为否定是语言中的基本现象，可能完全改变句子含义。 Method: 利用大型语言模型从现有VQA数据集中生成否定问题，构建包含7,379个问题的NegVQA基准测试。 Result: 评估20个先进视觉语言模型，发现它们在否定问题上表现显著下降，并呈现U型扩展趋势。 Conclusion: NegVQA揭示了视觉语言模型在否定理解上的关键缺陷，为未来模型开发提供了方向。 Abstract: Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs' negation understanding and offers insights into future VLM development. Project page available at https://yuhui-zh15.github.io/NegVQA/.

[172] StrucSum: Graph-Structured Reasoning for Long Document Extractive Summarization with LLMs

Haohan Yuan,Sukhwa Hong,Haopeng Zhang

Main category: cs.CL

TL;DR: StrucSum是一种无需训练的提示框架，通过句子级图结构增强LLM在零样本摘要中的表现，显著提升摘要质量和事实一致性。

Details

Motivation: 大型语言模型（LLM）在零样本摘要中表现优异，但在长文本中建模文档结构和识别关键信息方面存在困难。 Method: StrucSum通过三种策略注入结构信号：邻居感知提示（NAP）用于局部上下文，中心性感知提示（CAP）用于重要性估计，以及中心性引导掩码（CGM）用于高效输入缩减。 Result: 在ArXiv、PubMed和Multi-News上的实验表明，StrucSum在摘要质量和事实一致性上均优于无监督基线和普通提示方法，尤其在ArXiv上FactCC和SummaC分别提升19.2和9.7分。 Conclusion: 结构感知提示是一种简单有效的零样本抽取式摘要方法，无需训练或任务特定调整。 Abstract: Large language models (LLMs) have shown strong performance in zero-shot summarization, but often struggle to model document structure and identify salient information in long texts. In this work, we introduce StrucSum, a training-free prompting framework that enhances LLM reasoning through sentence-level graph structures. StrucSum injects structural signals into prompts via three targeted strategies: Neighbor-Aware Prompting (NAP) for local context, Centrality-Aware Prompting (CAP) for importance estimation, and Centrality-Guided Masking (CGM) for efficient input reduction. Experiments on ArXiv, PubMed, and Multi-News demonstrate that StrucSum consistently improves both summary quality and factual consistency over unsupervised baselines and vanilla prompting. Notably, on ArXiv, it boosts FactCC and SummaC by 19.2 and 9.7 points, indicating stronger alignment between summaries and source content. These findings suggest that structure-aware prompting is a simple yet effective approach for zero-shot extractive summarization with LLMs, without any training or task-specific tuning.

[173] LLMs for Argument Mining: Detection, Extraction, and Relationship Classification of pre-defined Arguments in Online Comments

Matteo Guida,Yulia Otmakhova,Eduard Hovy,Lea Frermann

Main category: cs.CL

TL;DR: 论文评估了四种先进大语言模型（LLM）在三个论点挖掘任务中的表现，发现其在处理大规模在线评论时表现良好，但对长文本和情感化语言存在系统性不足。

Details

Motivation: 研究动机是探索LLM在检测和分析争议性话题（如堕胎）中预定义论点的能力，填补其在在线评论中应用的空白。 Method: 方法包括对四种LLM在三个论点挖掘任务上的定量评估，使用超过2000条评论的数据集，涵盖六个争议性话题。 Result: 结果显示，大型和微调后的LLM表现优异，但存在处理长文本和情感化语言的不足，且环境成本较高。 Conclusion: 结论指出LLM在自动化论点分析中具有潜力，但仍需改进以应对复杂语言和情感化内容。 Abstract: Automated large-scale analysis of public discussions around contested issues like abortion requires detecting and understanding the use of arguments. While Large Language Models (LLMs) have shown promise in language processing tasks, their performance in mining topic-specific, pre-defined arguments in online comments remains underexplored. We evaluate four state-of-the-art LLMs on three argument mining tasks using datasets comprising over 2,000 opinion comments across six polarizing topics. Quantitative evaluation suggests an overall strong performance across the three tasks, especially for large and fine-tuned LLMs, albeit at a significant environmental cost. However, a detailed error analysis revealed systematic shortcomings on long and nuanced comments and emotionally charged language, raising concerns for downstream applications like content moderation or opinion analysis. Our results highlight both the promise and current limitations of LLMs for automated argument analysis in online comments.

[174] LLM-based HSE Compliance Assessment: Benchmark, Performance, and Advancements

Jianwei Wang,Mengqi Wang,Yinsi Zhou,Zhenchang Xing,Qing Liu,Xiwei Xu,Wenjie Zhang,Liming Zhu

Main category: cs.CL

TL;DR: HSE-Bench是一个评估大型语言模型（LLM）在HSE合规性评估中能力的首个基准数据集，包含1000多个问题，采用IRAC推理流程。研究发现当前LLM依赖语义匹配而非原则性推理，并提出RoE提示技术以改进。

Details

Motivation: 探索LLM在HSE合规性评估中的潜力，填补其在领域知识和结构化法律推理方面的研究空白。 Method: 构建HSE-Bench数据集，采用IRAC推理流程评估LLM，并提出RoE提示技术模拟专家推理。 Result: 当前LLM表现良好但依赖语义匹配，缺乏系统性法律推理。RoE技术显著提升推理准确性。 Conclusion: 研究揭示了LLM在HSE合规性评估中的推理缺陷，RoE技术为改进提供了方向。 Abstract: Health, Safety, and Environment (HSE) compliance assessment demands dynamic real-time decision-making under complicated regulations and complex human-machine-environment interactions. While large language models (LLMs) hold significant potential for decision intelligence and contextual dialogue, their capacity for domain-specific knowledge in HSE and structured legal reasoning remains underexplored. We introduce HSE-Bench, the first benchmark dataset designed to evaluate the HSE compliance assessment capabilities of LLM. HSE-Bench comprises over 1,000 manually curated questions drawn from regulations, court cases, safety exams, and fieldwork videos, and integrates a reasoning flow based on Issue spotting, rule Recall, rule Application, and rule Conclusion (IRAC) to assess the holistic reasoning pipeline. We conduct extensive evaluations on different prompting strategies and more than 10 LLMs, including foundation models, reasoning models and multimodal vision models. The results show that, although current LLMs achieve good performance, their capabilities largely rely on semantic matching rather than principled reasoning grounded in the underlying HSE compliance context. Moreover, their native reasoning trace lacks the systematic legal reasoning required for rigorous HSE compliance assessment. To alleviate these, we propose a new prompting technique, Reasoning of Expert (RoE), which guides LLMs to simulate the reasoning process of different experts for compliance assessment and reach a more accurate unified decision. We hope our study highlights reasoning gaps in LLMs for HSE compliance and inspires further research on related tasks.

[175] ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind

Peixuan Han,Zijia Liu,Jiaxuan You

Main category: cs.CL

TL;DR: 论文提出了一种名为ToMAP的新方法，通过整合两个心智理论模块，增强LLM在说服任务中对对手心理状态的感知和分析能力，显著提升了说服效果。

Details

Motivation: 现有LLM在说服任务中缺乏对对手心理状态的动态建模能力，导致说服多样性和对手意识不足。 Method: ToMAP结合了心智理论模块，通过预测对手立场和强化学习生成更有效的论点。 Result: ToMAP在3B参数规模下，性能超过GPT-4o等更大模型，相对增益达39.4%，并能生成更多样化和逻辑性强的论点。 Conclusion: ToMAP展示了在说服任务中的高效性和潜力，适合长对话和逻辑性强的策略。 Abstract: Large language models (LLMs) have shown promising potential in persuasion, but existing works on training LLM persuaders are still preliminary. Notably, while humans are skilled in modeling their opponent's thoughts and opinions proactively and dynamically, current LLMs struggle with such Theory of Mind (ToM) reasoning, resulting in limited diversity and opponent awareness. To address this limitation, we introduce Theory of Mind Augmented Persuader (ToMAP), a novel approach for building more flexible persuader agents by incorporating two theory of mind modules that enhance the persuader's awareness and analysis of the opponent's mental state. Specifically, we begin by prompting the persuader to consider possible objections to the target central claim, and then use a text encoder paired with a trained MLP classifier to predict the opponent's current stance on these counterclaims. Our carefully designed reinforcement learning schema enables the persuader learns how to analyze opponent-related information and utilize it to generate more effective arguments. Experiments show that the ToMAP persuader, while containing only 3B parameters, outperforms much larger baselines, like GPT-4o, with a relative gain of 39.4% across multiple persuadee models and diverse corpora. Notably, ToMAP exhibits complex reasoning chains and reduced repetition during training, which leads to more diverse and effective arguments. The opponent-aware feature of ToMAP also makes it suitable for long conversations and enables it to employ more logical and opponent-aware strategies. These results underscore our method's effectiveness and highlight its potential for developing more persuasive language agents. Code is available at: https://github.com/ulab-uiuc/ToMAP.

[176] Exploring Scaling Laws for EHR Foundation Models

Sheng Zhang,Qin Liu,Naoto Usuyama,Cliff Wong,Tristan Naumann,Hoifung Poon

Main category: cs.CL

TL;DR: 论文首次探索了电子健康记录（EHR）中的扩展规律，发现其与大型语言模型（LLMs）类似，为资源高效训练提供了预测性见解。

Details

Motivation: 电子健康记录（EHRs）是一种丰富且结构独特的数据源，但其扩展规律尚未被研究，而这对开发高效的EHR基础模型至关重要。 Method: 通过在不同模型大小和计算预算下训练基于MIMIC-IV数据库的Transformer架构，分析了EHR模型的扩展行为。 Result: 发现了与LLMs类似的扩展规律，包括抛物线IsoFLOPs曲线和计算、模型参数、数据量与临床效用之间的幂律关系。 Conclusion: 研究结果为开发高效的EHR基础模型奠定了基础，有望推动临床预测任务和个性化医疗的发展。 Abstract: The emergence of scaling laws has profoundly shaped the development of large language models (LLMs), enabling predictable performance gains through systematic increases in model size, dataset volume, and compute. Yet, these principles remain largely unexplored in the context of electronic health records (EHRs) -- a rich, sequential, and globally abundant data source that differs structurally from natural language. In this work, we present the first empirical investigation of scaling laws for EHR foundation models. By training transformer architectures on patient timeline data from the MIMIC-IV database across varying model sizes and compute budgets, we identify consistent scaling patterns, including parabolic IsoFLOPs curves and power-law relationships between compute, model parameters, data size, and clinical utility. These findings demonstrate that EHR models exhibit scaling behavior analogous to LLMs, offering predictive insights into resource-efficient training strategies. Our results lay the groundwork for developing powerful EHR foundation models capable of transforming clinical prediction tasks and advancing personalized healthcare.

[177] Verify-in-the-Graph: Entity Disambiguation Enhancement for Complex Claim Verification with Interactive Graph Representation

Hoang Pham,Thanh-Do Nguyen,Khac-Hoai Nam Bui

Main category: cs.CL

TL;DR: VeGraph是一个基于LLM的新框架，通过图表示、实体消歧和验证三阶段解决复杂声明的验证问题，在HoVer和FEVEROUS基准上表现优异。

Details

Motivation: 传统方法在复杂声明验证中因实体消歧不足而受限，需结合LLM的推理能力提升准确性和可解释性。 Method: VeGraph分三阶段：图表示（将声明分解为三元组）、实体消歧（与知识库交互消歧）和验证（完成事实核查）。 Result: 使用Meta-Llama-3-70B的实验显示，VeGraph在HoVer和FEVEROUS基准上优于基线方法。 Conclusion: VeGraph通过LLM和图结构有效解决了复杂声明验证问题，代码和数据已开源。 Abstract: Claim verification is a long-standing and challenging task that demands not only high accuracy but also explainability of the verification process. This task becomes an emerging research issue in the era of large language models (LLMs) since real-world claims are often complex, featuring intricate semantic structures or obfuscated entities. Traditional approaches typically address this by decomposing claims into sub-claims and querying a knowledge base to resolve hidden or ambiguous entities. However, the absence of effective disambiguation strategies for these entities can compromise the entire verification process. To address these challenges, we propose Verify-in-the-Graph (VeGraph), a novel framework leveraging the reasoning and comprehension abilities of LLM agents. VeGraph operates in three phases: (1) Graph Representation - an input claim is decomposed into structured triplets, forming a graph-based representation that integrates both structured and unstructured information; (2) Entity Disambiguation -VeGraph iteratively interacts with the knowledge base to resolve ambiguous entities within the graph for deeper sub-claim verification; and (3) Verification - remaining triplets are verified to complete the fact-checking process. Experiments using Meta-Llama-3-70B (instruct version) show that VeGraph achieves competitive performance compared to baselines on two benchmarks HoVer and FEVEROUS, effectively addressing claim verification challenges. Our source code and data are available for further exploitation.

[178] DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors

Yize Cheng,Wenxiao Wang,Mazda Moayeri,Soheil Feizi

Main category: cs.CL

TL;DR: DyePack是一个通过后门攻击检测模型是否在训练中使用了基准测试集的框架，无需访问模型内部细节，能有效防止误报。

Details

Motivation: 开放基准测试容易被污染，需要一种方法在不依赖模型内部信息的情况下检测污染。 Method: DyePack通过混合后门样本到测试数据中，利用随机目标的多后门设计，计算精确的误报率。 Result: 在多个数据集和任务中，DyePack成功检测所有污染模型，误报率极低。 Conclusion: DyePack提供了一种高效且可靠的方法来检测基准测试集的污染，适用于多种任务。 Abstract: Open benchmarks are essential for evaluating and advancing large language models, offering reproducibility and transparency. However, their accessibility makes them likely targets of test set contamination. In this work, we introduce DyePack, a framework that leverages backdoor attacks to identify models that used benchmark test sets during training, without requiring access to the loss, logits, or any internal details of the model. Like how banks mix dye packs with their money to mark robbers, DyePack mixes backdoor samples with the test data to flag models that trained on it. We propose a principled design incorporating multiple backdoors with stochastic targets, enabling exact false positive rate (FPR) computation when flagging every model. This provably prevents false accusations while providing strong evidence for every detected case of contamination. We evaluate DyePack on five models across three datasets, covering both multiple-choice and open-ended generation tasks. For multiple-choice questions, it successfully detects all contaminated models with guaranteed FPRs as low as 0.000073% on MMLU-Pro and 0.000017% on Big-Bench-Hard using eight backdoors. For open-ended generation tasks, it generalizes well and identifies all contaminated models on Alpaca with a guaranteed false positive rate of just 0.127% using six backdoors.

[179] A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs

Chiwan Park,Wonjun Jang,Daeryong Kim,Aelim Ahn,Kichang Yang,Woosung Hwang,Jihyeon Roh,Hyerin Park,Hyosun Wang,Min Seok Kim,Jihoon Kang

Main category: cs.CL

TL;DR: 论文探讨了如何将先进的大型语言模型（LLMs）应用于工业场景，解决灵活对话能力与服务约束之间的冲突，并通过电商对话机器人的案例研究提出解决方案。

Details

Motivation: 工业应用中，LLMs需要在保持灵活对话能力的同时严格遵守服务约束，这两者之间存在冲突，亟需解决方案。 Method: 提出了一种方法，结合策略和优化技术，解决LLMs在工业应用中的局限性，并通过电商对话机器人的案例研究验证。 Result: 研究提供了一个框架，用于开发可扩展、可控且可靠的AI驱动代理，弥合学术研究与实际应用的差距。 Conclusion: 论文提出的方法成功解决了LLMs在工业应用中的挑战，为实际场景中的AI代理开发提供了实用框架。 Abstract: The advancement of Large Language Models (LLMs) has led to significant improvements in various service domains, including search, recommendation, and chatbot applications. However, applying state-of-the-art (SOTA) research to industrial settings presents challenges, as it requires maintaining flexible conversational abilities while also strictly complying with service-specific constraints. This can be seen as two conflicting requirements due to the probabilistic nature of LLMs. In this paper, we propose our approach to addressing this challenge and detail the strategies we employed to overcome their inherent limitations in real-world applications. We conduct a practical case study of a conversational agent designed for the e-commerce domain, detailing our implementation workflow and optimizations. Our findings provide insights into bridging the gap between academic research and real-world application, introducing a framework for developing scalable, controllable, and reliable AI-driven agents.

[180] Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models

Jinwen Chen,Hainan Zhang,Fei Sun,Qinnan Zhang,Sijia Wen,Ziwei Wang,Zhiming Zheng

Main category: cs.CL

TL;DR: 论文提出了一种基于参考过滤和TF-IDF聚类的隐蔽后门样本检测方法（RFTC），用于高效识别LLM中的中毒样本，并在实验中验证了其优越性。

Details

Motivation: 现有检测方法无法适用于生成任务或可能降低生成性能，因此需要一种高效消除隐蔽中毒样本的方法。 Method: 通过参考模型输出比较和TF-IDF聚类，识别中毒样本。 Result: 在机器翻译和QA数据集上，RFTC在检测后门和模型性能上优于基线方法。 Conclusion: RFTC方法有效解决了隐蔽后门样本检测问题，且参考过滤机制被证实有效。 Abstract: Fine-tuning LLMs with datasets containing stealthy backdoors from publishers poses security risks to downstream applications. Mainstream detection methods either identify poisoned samples by analyzing the prediction probability of poisoned classification models or rely on the rewriting model to eliminate the stealthy triggers. However, the former cannot be applied to generation tasks, while the latter may degrade generation performance and introduce new triggers. Therefore, efficiently eliminating stealthy poisoned samples for LLMs remains an urgent problem. We observe that after applying TF-IDF clustering to the sample response, there are notable differences in the intra-class distances between clean and poisoned samples. Poisoned samples tend to cluster closely because of their specific malicious outputs, whereas clean samples are more scattered due to their more varied responses. Thus, in this paper, we propose a stealthy backdoor sample detection method based on Reference-Filtration and Tfidf-Clustering mechanisms (RFTC). Specifically, we first compare the sample response with the reference model's outputs and consider the sample suspicious if there's a significant discrepancy. And then we perform TF-IDF clustering on these suspicious samples to identify the true poisoned samples based on the intra-class distance. Experiments on two machine translation datasets and one QA dataset demonstrate that RFTC outperforms baselines in backdoor detection and model performance. Further analysis of different reference models also confirms the effectiveness of our Reference-Filtration.

[181] Context Robust Knowledge Editing for Language Models

Haewon Park,Gyubin Choi,Minjun Kim,Yohan Jo

Main category: cs.CL

TL;DR: CHED是一个评估知识编辑（KE）方法上下文鲁棒性的基准，发现现有KE方法在前置上下文存在时表现不佳。CoRE方法通过减少隐藏状态的上下文敏感方差，提升了编辑成功率并保持模型能力。

Details

Motivation: 现有KE评估通常忽略前置上下文对知识检索的影响，导致编辑效果在实际应用中受限。 Method: 开发CHED基准评估KE方法的上下文鲁棒性，并提出CoRE方法通过优化隐藏状态减少上下文敏感方差。 Result: CHED显示现有KE方法在前置上下文存在时失败率高，CoRE显著提升了编辑成功率。 Conclusion: CoRE方法有效解决了KE方法在上下文存在时的鲁棒性问题，同时保持模型性能。 Abstract: Knowledge editing (KE) methods offer an efficient way to modify knowledge in large language models. Current KE evaluations typically assess editing success by considering only the edited knowledge without any preceding contexts. In real-world applications, however, preceding contexts often trigger the retrieval of the original knowledge and undermine the intended edit. To address this issue, we develop CHED -- a benchmark designed to evaluate the context robustness of KE methods. Evaluations on CHED show that they often fail when preceding contexts are present. To mitigate this shortcoming, we introduce CoRE, a KE method designed to strengthen context robustness by minimizing context-sensitive variance in hidden states of the model for edited knowledge. This method not only improves the editing success rate in situations where a preceding context is present but also preserves the overall capabilities of the model. We provide an in-depth analysis of the differing impacts of preceding contexts when introduced as user utterances versus assistant responses, and we dissect attention-score patterns to assess how specific tokens influence editing success.

[182] Uncovering Visual-Semantic Psycholinguistic Properties from the Distributional Structure of Text Embedding Spac

Si Wu,Sebastian Bruch

Main category: cs.CL

TL;DR: 本文提出了一种无监督方法NSM，通过语义嵌入空间中单词邻域的峰值度来估计文本的可想象性和具体性，实验表明其优于现有方法。

Details

Motivation: 探索文本本身在图像-标题数据集中是否足以准确估计可想象性和具体性，避免依赖多模态数据。 Method: 提出NSM（邻域稳定性度量），量化语义嵌入空间中单词邻域的峰值度，作为无监督、分布无关的度量。 Result: NSM与真实评分的相关性优于现有无监督方法，且在分类任务中表现良好。 Conclusion: NSM是一种有效的无监督方法，可用于估计文本的可想象性和具体性。 Abstract: Imageability (potential of text to evoke a mental image) and concreteness (perceptibility of text) are two psycholinguistic properties that link visual and semantic spaces. It is little surprise that computational methods that estimate them do so using parallel visual and semantic spaces, such as collections of image-caption pairs or multi-modal models. In this paper, we work on the supposition that text itself in an image-caption dataset offers sufficient signals to accurately estimate these properties. We hypothesize, in particular, that the peakedness of the neighborhood of a word in the semantic embedding space reflects its degree of imageability and concreteness. We then propose an unsupervised, distribution-free measure, which we call Neighborhood Stability Measure (NSM), that quantifies the sharpness of peaks. Extensive experiments show that NSM correlates more strongly with ground-truth ratings than existing unsupervised methods, and is a strong predictor of these properties for classification. Our code and data are available on GitHub (https://github.com/Artificial-Memory-Lab/imageability).

[183] Can Modern NLP Systems Reliably Annotate Chest Radiography Exams? A Pre-Purchase Evaluation and Comparative Study of Solutions from AWS, Google, Azure, John Snow Labs, and Open-Source Models on an Independent Pediatric Dataset

Shruti Hegde,Mabon Manoj Ninan,Jonathan R. Dillman,Shireen Hayatghaibi,Lynn Babcock,Elanchezhian Somasundaram

Main category: cs.CL

TL;DR: 本研究比较了四种商业临床NLP系统（AWS、GC、AZ、SP）和两种专用胸部X光报告标注工具（CheXpert、CheXbert）在儿科胸部X光报告中的实体提取和断言检测性能，发现性能差异显著。

Details

Motivation: 尽管通用临床NLP工具广泛用于临床报告自动标注，但针对特定任务（如儿科胸部X光报告标注）的独立评估有限。 Method: 研究分析了95,008份儿科胸部X光报告，比较了四种NLP系统和两种专用工具在实体提取和断言检测上的表现，使用Fleiss Kappa和准确性进行评估。 Result: 不同NLP系统在实体提取数量和断言准确性上差异显著，SP表现最佳（76%），AWS最低（50%）。CheXpert和CheXbert的准确性为56%。 Conclusion: 临床NLP工具性能差异显著，部署前需仔细验证和审查。 Abstract: General-purpose clinical natural language processing (NLP) tools are increasingly used for the automatic labeling of clinical reports. However, independent evaluations for specific tasks, such as pediatric chest radiograph (CXR) report labeling, are limited. This study compares four commercial clinical NLP systems - Amazon Comprehend Medical (AWS), Google Healthcare NLP (GC), Azure Clinical NLP (AZ), and SparkNLP (SP) - for entity extraction and assertion detection in pediatric CXR reports. Additionally, CheXpert and CheXbert, two dedicated chest radiograph report labelers, were evaluated on the same task using CheXpert-defined labels. We analyzed 95,008 pediatric CXR reports from a large academic pediatric hospital. Entities and assertion statuses (positive, negative, uncertain) from the findings and impression sections were extracted by the NLP systems, with impression section entities mapped to 12 disease categories and a No Findings category. CheXpert and CheXbert extracted the same 13 categories. Outputs were compared using Fleiss Kappa and accuracy against a consensus pseudo-ground truth. Significant differences were found in the number of extracted entities and assertion distributions across NLP systems. SP extracted 49,688 unique entities, GC 16,477, AZ 31,543, and AWS 27,216. Assertion accuracy across models averaged around 62%, with SP highest (76%) and AWS lowest (50%). CheXpert and CheXbert achieved 56% accuracy. Considerable variability in performance highlights the need for careful validation and review before deploying NLP tools for clinical report labeling.

[184] Machine-Facing English: Defining a Hybrid Register Shaped by Human-AI Discourse

Hyunwoo Kim,Hanau Yi

Main category: cs.CL

TL;DR: 论文研究了机器导向英语（MFE）这一新兴语言现象，探讨了人类与AI互动中语言简化和明确化的特征及其对交流效率与语言丰富性的影响。

Details

Motivation: 研究动机在于理解人类如何在与AI的持续互动中调整语言，以适应机器的解析需求，同时牺牲了自然语言的流畅性。 Method: 方法包括基于双语（韩语/英语）语音和文本产品测试的定性观察，以及使用自然语言声明提示（NLD-P）进行反思性起草。 Result: 研究发现MFE具有五种特征：冗余清晰性、指令性语法、受控词汇、平坦韵律和单一意图结构，这些特征提高了执行准确性但压缩了表达范围。 Conclusion: 结论指出MFE的发展凸显了交流效率与语言丰富性之间的张力，并提出了对话界面设计和多语言用户教学方面的挑战，同时呼吁未来进行更全面的方法论阐述和实证验证。 Abstract: Machine-Facing English (MFE) is an emergent register shaped by the adaptation of everyday language to the expanding presence of AI interlocutors. Drawing on register theory (Halliday 1985, 2006), enregisterment (Agha 2003), audience design (Bell 1984), and interactional pragmatics (Giles & Ogay 2007), this study traces how sustained human-AI interaction normalizes syntactic rigidity, pragmatic simplification, and hyper-explicit phrasing - features that enhance machine parseability at the expense of natural fluency. Our analysis is grounded in qualitative observations from bilingual (Korean/English) voice- and text-based product testing sessions, with reflexive drafting conducted using Natural Language Declarative Prompting (NLD-P) under human curation. Thematic analysis identifies five recurrent traits - redundant clarity, directive syntax, controlled vocabulary, flattened prosody, and single-intent structuring - that improve execution accuracy but compress expressive range. MFE's evolution highlights a persistent tension between communicative efficiency and linguistic richness, raising design challenges for conversational interfaces and pedagogical considerations for multilingual users. We conclude by underscoring the need for comprehensive methodological exposition and future empirical validation.

Longyin Zhang,Bowei Zou,Ai Ti Aw

Main category: cs.CL

TL;DR: 提出了一种基于多语言大语言模型的细粒度方法（CAT-G），用于从社交媒体评论中生成方面术语，并通过DPO对齐模型预测与人类期望，提升了社交媒体的理解能力。

Details

Motivation: 社交媒体评论语言自由且主题分散，给NLP任务（如评论聚类、总结和意见分析）带来挑战。 Method: 利用多语言大语言模型进行监督微调，生成评论方面术语（CAT-G），并通过DPO对齐模型预测。 Result: 方法在两项NLP任务中提升了社交媒体评论的理解能力，并贡献了首个多语言CAT-G测试集（英语、中文、马来语、印尼语）。 Conclusion: CAT-G方法有效解决了社交媒体评论的多样性问题，测试集为跨语言性能比较提供了基础。 Abstract: The inherent nature of social media posts, characterized by the freedom of language use with a disjointed array of diverse opinions and topics, poses significant challenges to downstream NLP tasks such as comment clustering, comment summarization, and social media opinion analysis. To address this, we propose a granular level of identifying and generating aspect terms from individual comments to guide model attention. Specifically, we leverage multilingual large language models with supervised fine-tuning for comment aspect term generation (CAT-G), further aligning the model's predictions with human expectations through DPO. We demonstrate the effectiveness of our method in enhancing the comprehension of social media discourse on two NLP tasks. Moreover, this paper contributes the first multilingual CAT-G test set on English, Chinese, Malay, and Bahasa Indonesian. As LLM capabilities vary among languages, this test set allows for a comparative analysis of performance across languages with varying levels of LLM proficiency.

[186] EL4NER: Ensemble Learning for Named Entity Recognition via Multiple Small-Parameter Large Language Models

Yuzhen Xiao,Jiahe Song,Yongxin Xu,Ruizhe Zhang,Yiqi Xiao,Xin Lu,Runchuan Zhu,Bowen Jiang,Junfeng Zhao

Main category: cs.CL

TL;DR: 论文提出了一种名为EL4NER的集成学习方法，通过结合多个开源小参数LLM的ICL输出来提升NER任务性能，同时降低部署和推理成本。

Details

Motivation: 解决现有ICL-based NER方法依赖大参数LLM的问题，包括高计算资源需求、API成本、数据隐私和协作障碍。 Method: 1. 设计基于任务分解的管道；2. 引入span级句子相似度算法优化ICL演示检索；3. 加入自验证机制减少集成噪声。 Result: EL4NER在多个NER数据集上表现优于大参数LLM方法，部分数据集达到SOTA性能。 Conclusion: EL4NER展示了小参数LLM在ICL范式中的高效性和可行性。 Abstract: In-Context Learning (ICL) technique based on Large Language Models (LLMs) has gained prominence in Named Entity Recognition (NER) tasks for its lower computing resource consumption, less manual labeling overhead, and stronger generalizability. Nevertheless, most ICL-based NER methods depend on large-parameter LLMs: the open-source models demand substantial computational resources for deployment and inference, while the closed-source ones incur high API costs, raise data-privacy concerns, and hinder community collaboration. To address this question, we propose an Ensemble Learning Method for Named Entity Recognition (EL4NER), which aims at aggregating the ICL outputs of multiple open-source, small-parameter LLMs to enhance overall performance in NER tasks at less deployment and inference cost. Specifically, our method comprises three key components. First, we design a task decomposition-based pipeline that facilitates deep, multi-stage ensemble learning. Second, we introduce a novel span-level sentence similarity algorithm to establish an ICL demonstration retrieval mechanism better suited for NER tasks. Third, we incorporate a self-validation mechanism to mitigate the noise introduced during the ensemble process. We evaluated EL4NER on multiple widely adopted NER datasets from diverse domains. Our experimental results indicate that EL4NER surpasses most closed-source, large-parameter LLM-based methods at a lower parameter cost and even attains state-of-the-art (SOTA) performance among ICL-based methods on certain datasets. These results show the parameter efficiency of EL4NER and underscore the feasibility of employing open-source, small-parameter LLMs within the ICL paradigm for NER tasks.

[187] Query Routing for Retrieval-Augmented Language Models

Jiarui Zhang,Xiangyu Liu,Yong Hu,Chaoyue Niu,Fan Wu,Guihai Chen

Main category: cs.CL

TL;DR: RAGRouter是一种新型的路由机制，通过动态结合检索文档的影响，优化了多LLM在RAG场景下的选择，显著提升了任务性能。

Details

Motivation: 现有路由方法在RAG场景下表现不佳，因其依赖静态知识表示，无法动态适应检索文档对LLM能力的影响。 Method: 提出RAGRouter，利用文档嵌入和RAG能力嵌入，通过对比学习捕捉知识表示变化，实现智能路由。 Result: RAGRouter平均优于最佳单LLM 3.61%，优于现有路由方法3.29%-9.33%，并在低延迟下实现性能与效率的平衡。 Conclusion: RAGRouter通过动态路由设计有效解决了RAG场景下的LLM选择问题，显著提升任务性能。 Abstract: Retrieval-Augmented Generation (RAG) significantly improves the performance of Large Language Models (LLMs) on knowledge-intensive tasks. However, varying response quality across LLMs under RAG necessitates intelligent routing mechanisms, which select the most suitable model for each query from multiple retrieval-augmented LLMs via a dedicated router model. We observe that external documents dynamically affect LLMs' ability to answer queries, while existing routing methods, which rely on static parametric knowledge representations, exhibit suboptimal performance in RAG scenarios. To address this, we formally define the new retrieval-augmented LLM routing problem, incorporating the influence of retrieved documents into the routing framework. We propose RAGRouter, a RAG-aware routing design, which leverages document embeddings and RAG capability embeddings with contrastive learning to capture knowledge representation shifts and enable informed routing decisions. Extensive experiments on diverse knowledge-intensive tasks and retrieval settings show that RAGRouter outperforms the best individual LLM by 3.61% on average and existing routing methods by 3.29%-9.33%. With an extended score-threshold-based mechanism, it also achieves strong performance-efficiency trade-offs under low-latency constraints.

[188] Self-Correcting Code Generation Using Small Language Models

Jeonghun Cho,Deokhyung Kang,Hyounghun Kim,Gary Geunbae Lee

Main category: cs.CL

TL;DR: 论文探讨了小模型在代码生成中通过自我修正提升输出的能力，发现其表现不佳，并提出了一种名为CoCoS的方法，通过强化学习提升小模型的多轮修正能力，取得了显著效果。

Details

Motivation: 研究小模型是否具备通过自我反思修正代码输出的能力，填补现有研究的空白。 Method: 提出CoCoS方法，采用在线强化学习目标，设计累积奖励函数和细粒度奖励机制，优化多轮修正效果。 Result: 在1B规模模型上，CoCoS在MBPP和HumanEval数据集上分别提升了35.8%和27.7%。 Conclusion: CoCoS有效提升了小模型在代码生成中的自我修正能力，为小模型的应用提供了新思路。 Abstract: Self-correction has demonstrated potential in code generation by allowing language models to revise and improve their outputs through successive refinement. Recent studies have explored prompting-based strategies that incorporate verification or feedback loops using proprietary models, as well as training-based methods that leverage their strong reasoning capabilities. However, whether smaller models possess the capacity to effectively guide their outputs through self-reflection remains unexplored. Our findings reveal that smaller models struggle to exhibit reflective revision behavior across both self-correction paradigms. In response, we introduce CoCoS, an approach designed to enhance the ability of small language models for multi-turn code correction. Specifically, we propose an online reinforcement learning objective that trains the model to confidently maintain correct outputs while progressively correcting incorrect outputs as turns proceed. Our approach features an accumulated reward function that aggregates rewards across the entire trajectory and a fine-grained reward better suited to multi-turn correction scenarios. This facilitates the model in enhancing initial response quality while achieving substantial improvements through self-correction. With 1B-scale models, CoCoS achieves improvements of 35.8% on the MBPP and 27.7% on HumanEval compared to the baselines.

Hongcheng Guo,Zheyong Xie,Shaosheng Cao,Boyang Wang,Weiting Liu,Anjie Le,Lei Li,Zhoujun Li

Main category: cs.CL

TL;DR: SNS-Bench-VL是一个多模态基准测试，用于评估视觉-语言大模型在社交媒体场景中的表现，涵盖8种任务和4001个问题-答案对。

Details

Motivation: 现有基准测试主要关注文本任务，缺乏对现代社交媒体中多模态内容的覆盖，因此需要新的评估工具。 Method: 设计了SNS-Bench-VL，包含8种多模态任务和4001个问题-答案对，评估了25种先进的多模态大模型。 Result: 研究发现多模态社交语境理解仍存在挑战。 Conclusion: SNS-Bench-VL有望推动未来研究，开发更鲁棒、上下文感知且符合人类需求的多模态智能。 Abstract: With the increasing integration of visual and textual content in Social Networking Services (SNS), evaluating the multimodal capabilities of Large Language Models (LLMs) is crucial for enhancing user experience, content understanding, and platform intelligence. Existing benchmarks primarily focus on text-centric tasks, lacking coverage of the multimodal contexts prevalent in modern SNS ecosystems. In this paper, we introduce SNS-Bench-VL, a comprehensive multimodal benchmark designed to assess the performance of Vision-Language LLMs in real-world social media scenarios. SNS-Bench-VL incorporates images and text across 8 multimodal tasks, including note comprehension, user engagement analysis, information retrieval, and personalized recommendation. It comprises 4,001 carefully curated multimodal question-answer pairs, covering single-choice, multiple-choice, and open-ended tasks. We evaluate over 25 state-of-the-art multimodal LLMs, analyzing their performance across tasks. Our findings highlight persistent challenges in multimodal social context comprehension. We hope SNS-Bench-VL will inspire future research towards robust, context-aware, and human-aligned multimodal intelligence for next-generation social networking services.

[190] Document-Level Text Generation with Minimum Bayes Risk Decoding using Optimal Transport

Yuu Jinnai

Main category: cs.CL

TL;DR: 本文研究了如何将最小贝叶斯风险（MBR）解码应用于文档级文本生成任务，并提出了一种基于Wasserstein距离的改进方法MBR-OT。

Details

Motivation: 文档级文本生成任务比句子级任务更复杂，需要理解更长的上下文。现有的MBR解码方法在文档级任务中表现有限，因其效用函数多针对句子设计。 Method: 提出MBR-OT，利用Wasserstein距离结合句子级效用函数计算文档级效用。 Result: 实验表明，MBR-OT在文档级机器翻译、文本简化和密集图像描述任务中优于标准MBR。 Conclusion: MBR-OT通过改进效用计算方式，有效提升了文档级文本生成任务的性能。 Abstract: Document-level text generation tasks are known to be more difficult than sentence-level text generation tasks as they require the understanding of longer context to generate high-quality texts. In this paper, we investigate the adaption of Minimum Bayes Risk (MBR) decoding for document-level text generation tasks. MBR decoding makes use of a utility function to estimate the output with the highest expected utility from a set of candidate outputs. Although MBR decoding is shown to be effective in a wide range of sentence-level text generation tasks, its performance on document-level text generation tasks is limited as many of the utility functions are designed for evaluating the utility of sentences. To this end, we propose MBR-OT, a variant of MBR decoding using Wasserstein distance to compute the utility of a document using a sentence-level utility function. The experimental result shows that the performance of MBR-OT outperforms that of the standard MBR in document-level machine translation, text simplification, and dense image captioning tasks. Our code is available at https://github.com/jinnaiyuu/mbr-optimal-transport

[191] Generating Diverse Training Samples for Relation Extraction with Large Language Models

Zexuan Li,Hongliang Dai,Piji Li

Main category: cs.CL

TL;DR: 研究探讨如何利用大语言模型（LLM）生成多样且正确的关系抽取（RE）训练数据，通过指令提示和直接偏好优化（DPO）微调LLM，实验表明这两种方法均能提升生成数据的质量。

Details

Motivation: 直接使用LLM生成的关系抽取训练样本结构相似度高，表达方式单一，需提升多样性和正确性。 Method: 1. 通过上下文学习（ICL）提示直接指导LLM生成多样化样本；2. 使用DPO微调LLM以生成多样性样本。 Result: 实验证明两种方法均能提升生成数据的质量，且用生成数据训练的非LLM模型性能优于直接使用LLM。 Conclusion: 通过指令提示和DPO微调LLM可有效提升关系抽取训练数据的多样性和质量，间接训练非LLM模型效果更佳。 Abstract: Using Large Language Models (LLMs) to generate training data can potentially be a preferable way to improve zero or few-shot NLP tasks. However, many problems remain to be investigated for this direction. For the task of Relation Extraction (RE), we find that samples generated by directly prompting LLMs may easily have high structural similarities with each other. They tend to use a limited variety of phrasing while expressing the relation between a pair of entities. Therefore, in this paper, we study how to effectively improve the diversity of the training samples generated with LLMs for RE, while also maintaining their correctness. We first try to make the LLMs produce dissimilar samples by directly giving instructions in In-Context Learning (ICL) prompts. Then, we propose an approach to fine-tune LLMs for diversity training sample generation through Direct Preference Optimization (DPO). Our experiments on commonly used RE datasets show that both attempts can improve the quality of the generated training data. We also find that comparing with directly performing RE with an LLM, training a non-LLM RE model with its generated samples may lead to better performance.

[192] Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data

Seohyeong Lee,Eunwon Kim,Hwaran Lee,Buru Chang

Main category: cs.CL

TL;DR: Alignment Data Map利用GPT-4o分析偏好数据，通过计算对齐分数并构建数据地图，仅需33%的高质量数据即可达到或超越全数据集性能。

Details

Motivation: 收集人类偏好数据成本高且效率低，限制了LLM与人类价值观对齐的可扩展性。 Method: 使用GPT-4o作为LLM对齐代理，计算对齐分数并构建基于均值和方差的Alignment Data Map。 Result: 实验表明，仅使用33%的高质量数据（高均值、低方差区域）即可达到或超越全数据集性能。 Conclusion: Alignment Data Map显著提升数据收集效率，并能诊断现有偏好数据集中的低效或错误标注样本。 Abstract: Human preference data plays a critical role in aligning large language models (LLMs) with human values. However, collecting such data is often expensive and inefficient, posing a significant scalability challenge. To address this, we introduce Alignment Data Map, a GPT-4o-assisted tool for analyzing and diagnosing preference data. Using GPT-4o as a proxy for LLM alignment, we compute alignment scores for LLM-generated responses to instructions from existing preference datasets. These scores are then used to construct an Alignment Data Map based on their mean and variance. Our experiments show that using only 33 percent of the data, specifically samples in the high-mean, low-variance region, achieves performance comparable to or better than using the entire dataset. This finding suggests that the Alignment Data Map can significantly improve data collection efficiency by identifying high-quality samples for LLM alignment without requiring explicit annotations. Moreover, the Alignment Data Map can diagnose existing preference datasets. Our analysis shows that it effectively detects low-impact or potentially misannotated samples. Source code is available online.

[193] Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios

Linjie Mu,Zhongzhen Huang,Yakun Zhu,Xiangyu Zhao,Shaoting Zhang,Xiaofan Zhang

Main category: cs.CL

TL;DR: 论文提出了一种名为MedE²的两阶段后训练方法，旨在提升医学领域的多模态推理能力，通过文本和多模态数据的训练显著提高了模型性能。

Details

Motivation: 临床决策依赖于多源证据的多模态推理，而现有模型在医学领域的应用尚未充分探索，因此需要一种专门的方法来提升医学多模态推理能力。 Method: MedE²分为两阶段：第一阶段使用2000个文本数据样本微调模型以激发推理行为；第二阶段使用1500个多模态医学案例进一步优化推理能力。 Result: 实验表明，MedE²显著提升了医学多模态模型的推理性能，在多个基准测试中优于基线模型，且在大模型和推理扩展下表现稳健。 Conclusion: MedE²为医学多模态推理提供了一种有效且可靠的方法，具有实际应用潜力。 Abstract: Effective clinical decision-making depends on iterative, multimodal reasoning across diverse sources of evidence. The recent emergence of multimodal reasoning models has significantly transformed the landscape of solving complex tasks. Although such models have achieved notable success in mathematics and science, their application to medical domains remains underexplored. In this work, we propose \textit{MedE$^2$}, a two-stage post-training pipeline that elicits and then enhances multimodal reasoning for medical domains. In Stage-I, we fine-tune models using 2,000 text-only data samples containing precisely orchestrated reasoning demonstrations to elicit reasoning behaviors. In Stage-II, we further enhance the model's reasoning capabilities using 1,500 rigorously curated multimodal medical cases, aligning model reasoning outputs with our proposed multimodal medical reasoning preference. Extensive experiments demonstrate the efficacy and reliability of \textit{MedE$^2$} in improving the reasoning performance of medical multimodal models. Notably, models trained with \textit{MedE$^2$} consistently outperform baselines across multiple medical multimodal benchmarks. Additional validation on larger models and under inference-time scaling further confirms the robustness and practical utility of our approach.

Yiming Lei,Zhizheng Yang,Zeming Liu,Haitao Leng,Shaoguo Liu,Tingting Gao,Qingjie Liu,Yunhong Wang

Main category: cs.CL

TL;DR: 论文提出了一种名为ContextQFormer的上下文建模模块，用于提升多模态大语言模型在多轮交互中的表现，并构建了一个新的多轮多模态对话数据集TMDialog。实验表明，ContextQFormer在可用率上比基线模型提高了2%-4%。

Details

Motivation: 现有开源多模态模型在多轮交互（尤其是长上下文）方面表现较弱，因此需要改进。 Method: 引入ContextQFormer模块，利用记忆块增强上下文信息表示，并构建TMDialog数据集用于训练和评估。 Result: ContextQFormer在TMDialog数据集上的表现优于基线模型，可用率提升2%-4%。 Conclusion: ContextQFormer和TMDialog为多轮多模态对话研究提供了有效工具，未来将进一步开源和优化。 Abstract: Multi-modal large language models have demonstrated remarkable zero-shot abilities and powerful image-understanding capabilities. However, the existing open-source multi-modal models suffer from the weak capability of multi-turn interaction, especially for long contexts. To address the issue, we first introduce a context modeling module, termed ContextQFormer, which utilizes a memory block to enhance the presentation of contextual information. Furthermore, to facilitate further research, we carefully build a new multi-turn multi-modal dialogue dataset (TMDialog) for pre-training, instruction-tuning, and evaluation, which will be open-sourced lately. Compared with other multi-modal dialogue datasets, TMDialog contains longer conversations, which supports the research of multi-turn multi-modal dialogue. In addition, ContextQFormer is compared with three baselines on TMDialog and experimental results illustrate that ContextQFormer achieves an improvement of 2%-4% in available rate over baselines.

[195] PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics

Atharva Naik,Darsh Agrawal,Manav Kapadnis,Yuwei An,Yash Mathur,Carolyn Rose,David Mortensen

Main category: cs.CL

TL;DR: 论文探讨了长链思维（LCoT）大型语言模型（LLMs）在历史语言学启发的归纳推理问题中的表现，发现其能力有限。

Details

Motivation: 研究LCoT LLMs在实用问题中的抽象推理能力是否足够通用，特别是历史语言学启发的编程示例问题。 Method: 开发了一个全自动流水线，动态生成可控难度的基准测试，以解决可扩展性和污染问题。 Result: 生成的测试集对当前最优推理LLMs具有挑战性，最佳模型（Claude-3.7-Sonnet）仅达到54%通过率。 Conclusion: LCoT LLMs在历史语言学等领域中常见的推理任务上仍存在困难。 Abstract: Recently, long chain of thought (LCoT), Large Language Models (LLMs), have taken the machine learning world by storm with their breathtaking reasoning capabilities. However, are the abstract reasoning abilities of these models general enough for problems of practical importance? Unlike past work, which has focused mainly on math, coding, and data wrangling, we focus on a historical linguistics-inspired inductive reasoning problem, formulated as Programming by Examples. We develop a fully automated pipeline for dynamically generating a benchmark for this task with controllable difficulty in order to tackle scalability and contamination issues to which many reasoning benchmarks are subject. Using our pipeline, we generate a test set with nearly 1k instances that is challenging for all state-of-the-art reasoning LLMs, with the best model (Claude-3.7-Sonnet) achieving a mere 54% pass rate, demonstrating that LCoT LLMs still struggle with a class or reasoning that is ubiquitous in historical linguistics as well as many other domains.

[196] Enhancing Large Language Models'Machine Translation via Dynamic Focus Anchoring

Qiuyu Ding,Zhiqiang Cao,Hailong Cao,Tiejun Zhao

Main category: cs.CL

TL;DR: 提出了一种简单有效的方法，通过动态识别上下文敏感单元（CSUs）并应用语义聚焦，提升大语言模型（LLMs）在机器翻译中的表现，无需额外训练。

Details

Motivation: 现有大语言模型在多语言任务中表现优异，但在处理上下文敏感单元（如多义词）时仍存在挑战，影响翻译准确性和模型理解能力。 Method: 动态分析并识别翻译难点，以结构化方式将其融入LLMs，避免信息扁平化导致的误译或误解，激活模型相关知识库。 Result: 在机器翻译基准数据集上，该方法表现优异，优于多个开源基线模型，适用于相似和远距离语言对。 Conclusion: 该方法无需额外训练，资源消耗低，能有效提升LLMs在多任务中的性能，具有鲁棒性和广泛适用性。 Abstract: Large language models have demonstrated exceptional performance across multiple crosslingual NLP tasks, including machine translation (MT). However, persistent challenges remain in addressing context-sensitive units (CSUs), such as polysemous words. These CSUs not only affect the local translation accuracy of LLMs, but also affect LLMs' understanding capability for sentences and tasks, and even lead to translation failure. To address this problem, we propose a simple but effective method to enhance LLMs' MT capabilities by acquiring CSUs and applying semantic focus. Specifically, we dynamically analyze and identify translation challenges, then incorporate them into LLMs in a structured manner to mitigate mistranslations or misunderstandings of CSUs caused by information flattening. Efficiently activate LLMs to identify and apply relevant knowledge from its vast data pool in this way, ensuring more accurate translations for translating difficult terms. On a benchmark dataset of MT, our proposed method achieved competitive performance compared to multiple existing open-sourced MT baseline models. It demonstrates effectiveness and robustness across multiple language pairs, including both similar language pairs and distant language pairs. Notably, the proposed method requires no additional model training and enhances LLMs' performance across multiple NLP tasks with minimal resource consumption.

[197] Cross-Domain Bilingual Lexicon Induction via Pretrained Language Models

Qiuyu Ding,Zhiqiang Cao,Hailong Cao,Tiejun Zhao

Main category: cs.CL

TL;DR: 本文提出了一种基于通用领域和目标领域单语语料库的跨领域双语词典提取任务，利用预训练模型改进词嵌入，并通过实验验证了其有效性。

Details

Motivation: 传统双语词典归纳（BLI）方法在专业领域表现不佳，主要由于领域数据规模小、词频低以及静态词嵌入的局限性。 Method: 结合预训练模型改进词嵌入，并首次在跨领域BLI任务中引入Code Switch策略，以适配不同上下文。 Result: 实验表明，该方法在三个特定领域上平均提升0.78分，优于传统BLI基线。 Conclusion: 该方法为专业领域的双语词典提取提供了更优解决方案，但仍需进一步验证其普适性。 Abstract: Bilingual Lexicon Induction (BLI) is generally based on common domain data to obtain monolingual word embedding, and by aligning the monolingual word embeddings to obtain the cross-lingual embeddings which are used to get the word translation pairs. In this paper, we propose a new task of BLI, which is to use the monolingual corpus of the general domain and target domain to extract domain-specific bilingual dictionaries. Motivated by the ability of Pre-trained models, we propose a method to get better word embeddings that build on the recent work on BLI. This way, we introduce the Code Switch(Qin et al., 2020) firstly in the cross-domain BLI task, which can match differit is yet to be seen whether these methods are suitable for bilingual lexicon extraction in professional fields. As we can see in table 1, the classic and efficient BLI approach, Muse and Vecmap, perform much worse on the Medical dataset than on the Wiki dataset. On one hand, the specialized domain data set is relatively smaller compared to the generic domain data set generally, and specialized words have a lower frequency, which will directly affect the translation quality of bilingual dictionaries. On the other hand, static word embeddings are widely used for BLI, however, in some specific fields, the meaning of words is greatly influenced by context, in this case, using only static word embeddings may lead to greater bias. ent strategies in different contexts, making the model more suitable for this task. Experimental results show that our method can improve performances over robust BLI baselines on three specific domains by averagely improving 0.78 points.

[198] Tell, Don't Show: Leveraging Language Models' Abstractive Retellings to Model Literary Themes

Li Lucy,Camilla Griffiths,Sarah Levine,Jennifer L. Eberhardt,Dorottya Demszky,David Bamman

Main category: cs.CL

TL;DR: Retell是一种针对文学文本的主题建模方法，通过生成语言模型将叙事内容转化为高级概念，再结合LDA提升主题建模效果。

Details

Motivation: 传统词袋方法（如LDA）难以处理文学文本，因其注重感官细节而非抽象描述。Retell旨在解决这一问题。 Method: 利用生成语言模型将叙事内容转化为高级概念，再对其输出运行LDA，结合两者优势。 Result: Retell生成的主题比单独使用LDA或直接让语言模型列出主题更精确且信息丰富。 Conclusion: Retell在文化分析中具有潜力，能更有效地提取文学文本的主题。 Abstract: Conventional bag-of-words approaches for topic modeling, like latent Dirichlet allocation (LDA), struggle with literary text. Literature challenges lexical methods because narrative language focuses on immersive sensory details instead of abstractive description or exposition: writers are advised to "show, don't tell." We propose Retell, a simple, accessible topic modeling approach for literature. Here, we prompt resource-efficient, generative language models (LMs) to tell what passages show, thereby translating narratives' surface forms into higher-level concepts and themes. By running LDA on LMs' retellings of passages, we can obtain more precise and informative topics than by running LDA alone or by directly asking LMs to list topics. To investigate the potential of our method for cultural analytics, we compare our method's outputs to expert-guided annotations in a case study on racial/cultural identity in high school English language arts books.

[199] ZIPA: A family of efficient models for multilingual phone recognition

Jian Zhu,Farhan Samir,Eleanor Chodroff,David R. Mortensen

Main category: cs.CL

TL;DR: ZIPA是一系列高效语音模型，通过大规模多语言数据和高效Zipformer架构，显著提升了跨语言音素识别性能，但仍存在社会语音多样性建模的挑战。

Details

Motivation: 提升跨语言音素识别的性能，并解决现有系统在参数效率和数据规模上的不足。 Method: 使用IPAPack++数据集（17,132小时标准化音素转录），结合Zipformer架构（ZIPA-T和ZIPA-CR变体），并通过噪声学生训练进一步扩展数据规模（11,000小时伪标签数据）。 Result: ZIPA在音素识别任务中优于现有系统，且参数更少；噪声学生训练进一步提升了性能。 Conclusion: ZIPA在跨语言音素识别上取得了显著进展，但社会语音多样性建模仍是未来研究的挑战。 Abstract: We present ZIPA, a family of efficient speech models that advances the state-of-the-art performance of crosslinguistic phone recognition. We first curated IPAPack++, a large-scale multilingual speech corpus with 17,132 hours of normalized phone transcriptions and a novel evaluation set capturing unseen languages and sociophonetic variation. With the large-scale training data, ZIPA, including transducer (ZIPA-T) and CTC-based (ZIPA-CR) variants, leverage the efficient Zipformer backbones and outperform existing phone recognition systems with much fewer parameters. Further scaling via noisy student training on 11,000 hours of pseudo-labeled multilingual data yields further improvement. While ZIPA achieves strong performance on benchmarks, error analysis reveals persistent limitations in modeling sociophonetic diversity, underscoring challenges for future research.

[200] Map&Make: Schema Guided Text to Table Generation

Naman Ahuja,Fenil Bardoliya,Chitta Baral,Vivek Gupta

Main category: cs.CL

TL;DR: 本文提出了一种名为Map&Make的新方法，用于将复杂文本分解为原子命题并生成表格，显著提升了Text-to-Table任务的性能。

Details

Motivation: 当前方法在提取复杂信息和推断数据方面存在不足，需要一种更有效的方式来实现文本到表格的转换。 Method: Map&Make方法将文本分解为原子命题，提取潜在模式并填充表格，同时纠正幻觉错误。 Result: 在Rotowire和Livesum数据集上，该方法显著提升了性能，并提供了更好的可解释性。 Conclusion: Map&Make框架在结构化摘要任务中表现出色，并通过实验验证了其优越性和实用性。 Abstract: Transforming dense, detailed, unstructured text into an interpretable and summarised table, also colloquially known as Text-to-Table generation, is an essential task for information retrieval. Current methods, however, miss out on how and what complex information to extract; they also lack the ability to infer data from the text. In this paper, we introduce a versatile approach, Map&Make, which "dissects" text into propositional atomic statements. This facilitates granular decomposition to extract the latent schema. The schema is then used to populate the tables that capture the qualitative nuances and the quantitative facts in the original text. Our approach is tested against two challenging datasets, Rotowire, renowned for its complex and multi-table schema, and Livesum, which demands numerical aggregation. By carefully identifying and correcting hallucination errors in Rotowire, we aim to achieve a cleaner and more reliable benchmark. We evaluate our method rigorously on a comprehensive suite of comparative and referenceless metrics. Our findings demonstrate significant improvement results across both datasets with better interpretability in Text-to-Table generation. Moreover, through detailed ablation studies and analyses, we investigate the factors contributing to superior performance and validate the practicality of our framework in structured summarization tasks.

[201] Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification

Wenjing Xing,Wenke Lu,Yeheng Duan,Bing Zhao,Zhenghui kang,Yaolong Wang,Kai Gao,Lei Qiao

Main category: cs.CL

TL;DR: Infinite-Instruct 是一个自动化框架，用于合成高质量的问答对，以提升大语言模型（LLMs）的代码生成能力。通过反向构造和知识图谱重构，增强问题逻辑和代码质量，实验显示性能显著提升。

Details

Motivation: 传统代码指令数据合成方法存在多样性和逻辑性不足的问题，需要一种更高效的方法来提升LLMs的代码生成能力。 Method: 框架采用“反向构造”将代码片段转化为编程问题，通过“反馈构造”利用知识图谱重构问题逻辑，并结合跨语言静态代码分析过滤无效样本。 Result: 实验表明，在主流代码生成基准测试中，7B和32B参数模型的性能分别提升21.70%和36.95%，且使用更少数据达到可比性能。 Conclusion: Infinite-Instruct 为编程领域的LLM训练提供了可扩展的解决方案，并开源了实验数据集。 Abstract: Traditional code instruction data synthesis methods suffer from limited diversity and poor logic. We introduce Infinite-Instruct, an automated framework for synthesizing high-quality question-answer pairs, designed to enhance the code generation capabilities of large language models (LLMs). The framework focuses on improving the internal logic of synthesized problems and the quality of synthesized code. First, "Reverse Construction" transforms code snippets into diverse programming problems. Then, through "Backfeeding Construction," keywords in programming problems are structured into a knowledge graph to reconstruct them into programming problems with stronger internal logic. Finally, a cross-lingual static code analysis pipeline filters invalid samples to ensure data quality. Experiments show that on mainstream code generation benchmarks, our fine-tuned models achieve an average performance improvement of 21.70% on 7B-parameter models and 36.95% on 32B-parameter models. Using less than one-tenth of the instruction fine-tuning data, we achieved performance comparable to the Qwen-2.5-Coder-Instruct. Infinite-Instruct provides a scalable solution for LLM training in programming. We open-source the datasets used in the experiments, including both unfiltered versions and filtered versions via static analysis. The data are available at https://github.com/xingwenjing417/Infinite-Instruct-dataset

[202] Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement

Gabriele Sarti,Vilém Zouhar,Malvina Nissim,Arianna Bisazza

Main category: cs.CL

TL;DR: 本文探讨了利用语言模型可解释性和不确定性量化来高效识别翻译错误的无监督方法，并评估了其在多语言翻译任务中的表现。

Details

Motivation: 现代词级质量评估（WQE）技术成本高昂，通常需要大量人工标注数据或调用大型语言模型。本文旨在探索更高效的替代方法。 Method: 利用语言模型的可解释性和不确定性量化技术，从翻译模型的内部机制中识别错误。 Result: 在12种翻译方向的14个指标评估中，无监督指标表现出潜力，而监督方法在标签不确定性下表现不佳。 Conclusion: 无监督方法具有潜力，但当前基于单标注者的评估方法存在脆弱性，需改进。 Abstract: Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.

[203] Cross-Task Experiential Learning on LLM-based Multi-Agent Collaboration

Yilong Li,Chen Qian,Yu Xia,Ruijie Shi,Yufan Dang,Zihao Xie,Ziming You,Weize Chen,Cheng Yang,Weichuan Liu,Ye Tian,Xuantang Xiong,Lei Han,Zhiyuan Liu,Maosong Sun

Main category: cs.CL

TL;DR: MAEL框架通过跨任务经验学习提升多智能体系统的协作效率，减少冗余计算并增强泛化能力。

Details

Motivation: 现有方法将任务孤立处理，导致计算冗余和泛化能力受限。 Method: 基于图结构的多智能体协作网络，量化任务解决流程中的步骤质量，存储高奖励经验以供后续任务参考。 Result: 实验表明MAEL能更快收敛并生成更高质量的解决方案。 Conclusion: MAEL通过经验积累显著提升了多智能体系统的协作效率和任务解决质量。 Abstract: Large Language Model-based multi-agent systems (MAS) have shown remarkable progress in solving complex tasks through collaborative reasoning and inter-agent critique. However, existing approaches typically treat each task in isolation, resulting in redundant computations and limited generalization across structurally similar tasks. To address this, we introduce multi-agent cross-task experiential learning (MAEL), a novel framework that endows LLM-driven agents with explicit cross-task learning and experience accumulation. We model the task-solving workflow on a graph-structured multi-agent collaboration network, where agents propagate information and coordinate via explicit connectivity. During the experiential learning phase, we quantify the quality for each step in the task-solving workflow and store the resulting rewards along with the corresponding inputs and outputs into each agent's individual experience pool. During inference, agents retrieve high-reward, task-relevant experiences as few-shot examples to enhance the effectiveness of each reasoning step, thereby enabling more accurate and efficient multi-agent collaboration. Experimental results on diverse datasets demonstrate that MAEL empowers agents to learn from prior task experiences effectively-achieving faster convergence and producing higher-quality solutions on current tasks.

[204] ExpeTrans: LLMs Are Experiential Transfer Learners

Jinglong Gao,Xiao Ding,Lingxiao Zou,Bibo Cai,Bing Qin,Ting Liu

Main category: cs.CL

TL;DR: 论文提出了一种自主经验转移框架，使大语言模型（LLMs）能够从源任务中自主转移经验到目标任务，减少人工和时间成本，并提升模型性能。

Details

Motivation: 现有方法依赖大量人工或时间收集任务解决经验，难以应对LLMs任务类型的多样性。 Method: 设计了一个自主经验转移框架，模拟人类认知智能，实现经验的自主转移。 Result: 在13个数据集上的实验表明，该框架有效提升了LLMs的性能。 Conclusion: 该框架为LLMs的泛化提供了新路径，并通过详细分析验证了其有效性。 Abstract: Recent studies provide large language models (LLMs) with textual task-solving experiences via prompts to improve their performance. However, previous methods rely on substantial human labor or time to gather such experiences for each task, which is impractical given the growing variety of task types in user queries to LLMs. To address this issue, we design an autonomous experience transfer framework to explore whether LLMs can mimic human cognitive intelligence to autonomously transfer experience from existing source tasks to newly encountered target tasks. This not only allows the acquisition of experience without extensive costs of previous methods, but also offers a novel path for the generalization of LLMs. Experimental results on 13 datasets demonstrate that our framework effectively improves the performance of LLMs. Furthermore, we provide a detailed analysis of each module in the framework.

[205] MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration

Zhitao He,Sandeep Polisetty,Zhiyuan Fan,Yuchen Huang,Shujin Wu,Yi R.,Fung

Main category: cs.CL

TL;DR: MMBoundary框架通过校准多模态大语言模型（MLLMs）推理步骤的置信度，提升其知识边界意识，减少幻觉现象。

Details

Motivation: 现有方法在评估模型置信度时，仅关注整体响应，而忽略了推理步骤的置信度评估，导致幻觉问题累积。 Method: 提出MMBoundary框架，结合文本和跨模态自奖励信号校准推理步骤置信度，并通过监督微调和强化学习进一步优化。 Result: 实验表明，MMBoundary在多领域数据集和指标上显著优于现有方法，平均减少7.5%的置信度校准误差，任务性能提升达8.3%。 Conclusion: MMBoundary通过精细化的推理步骤置信度校准，有效提升了MLLMs的推理能力和自我纠正能力。 Abstract: In recent years, multimodal large language models (MLLMs) have made significant progress but continue to face inherent challenges in multimodal reasoning, which requires multi-level (e.g., perception, reasoning) and multi-granular (e.g., multi-step reasoning chain) advanced inferencing. Prior work on estimating model confidence tends to focus on the overall response for training and calibration, but fails to assess confidence in each reasoning step, leading to undesirable hallucination snowballing. In this work, we present MMBoundary, a novel framework that advances the knowledge boundary awareness of MLLMs through reasoning step confidence calibration. To achieve this, we propose to incorporate complementary textual and cross-modal self-rewarding signals to estimate confidence at each step of the MLLM reasoning process. In addition to supervised fine-tuning MLLM on this set of self-rewarded confidence estimation signal for initial confidence expression warm-up, we introduce a reinforcement learning stage with multiple reward functions for further aligning model knowledge and calibrating confidence at each reasoning step, enhancing reasoning chain self-correction. Empirical results show that MMBoundary significantly outperforms existing methods across diverse domain datasets and metrics, achieving an average of 7.5% reduction in multimodal confidence calibration errors and up to 8.3% improvement in task performance.

[206] MCTSr-Zero: Self-Reflective Psychological Counseling Dialogues Generation via Principles and Adaptive Exploration

Hao Lu,Yanchi Gu,Haoyuan Huang,Yulin Zhou,Ningxin Zhu,Chen Li

Main category: cs.CL

TL;DR: MCTSr-Zero框架将MCTS与LLMs结合，针对开放对话任务（如心理咨询）设计，通过领域对齐和探索机制提升对话质量。

Details

Motivation: 传统MCTS方法在开放对话任务中可能产生不匹配的回应，因其依赖客观正确性，而心理咨询等任务需主观因素如共情和伦理。 Method: 提出MCTSr-Zero框架，引入领域对齐、再生和元提示适应机制，优化对话轨迹生成。 Result: 实验表明，基于MCTSr-Zero生成的对话数据训练的PsyLLM在PsyEval基准上表现优异。 Conclusion: MCTSr-Zero有效解决了开放对话任务中LLMs的挑战，生成高质量、符合心理学标准的对话数据。 Abstract: The integration of Monte Carlo Tree Search (MCTS) with Large Language Models (LLMs) has demonstrated significant success in structured, problem-oriented tasks. However, applying these methods to open-ended dialogues, such as those in psychological counseling, presents unique challenges. Unlike tasks with objective correctness, success in therapeutic conversations depends on subjective factors like empathetic engagement, ethical adherence, and alignment with human preferences, for which strict "correctness" criteria are ill-defined. Existing result-oriented MCTS approaches can therefore produce misaligned responses. To address this, we introduce MCTSr-Zero, an MCTS framework designed for open-ended, human-centric dialogues. Its core innovation is "domain alignment", which shifts the MCTS search objective from predefined end-states towards conversational trajectories that conform to target domain principles (e.g., empathy in counseling). Furthermore, MCTSr-Zero incorporates "Regeneration" and "Meta-Prompt Adaptation" mechanisms to substantially broaden exploration by allowing the MCTS to consider fundamentally different initial dialogue strategies. We evaluate MCTSr-Zero in psychological counseling by generating multi-turn dialogue data, which is used to fine-tune an LLM, PsyLLM. We also introduce PsyEval, a benchmark for assessing multi-turn psychological counseling dialogues. Experiments demonstrate that PsyLLM achieves state-of-the-art performance on PsyEval and other relevant metrics, validating MCTSr-Zero's effectiveness in generating high-quality, principle-aligned conversational data for human-centric domains and addressing the LLM challenge of consistently adhering to complex psychological standards.

[207] ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering

Jingxuan Wei,Nan Xu,Junnan Zhu,Yanni Hao,Gaowei Wu,Bihui Yu,Lei Wang

Main category: cs.CL

TL;DR: ChartMind是一个新的复杂图表问答（CQA）基准，支持多语言和开放域输出，填补了实际应用与学术基准之间的差距。提出的ChartLLM框架显著优于现有方法。

Details

Motivation: 现有CQA评估过于依赖固定输出格式和客观指标，忽视了实际图表分析的复杂需求。 Method: 提出ChartMind基准和ChartLLM框架，专注于提取关键上下文元素、降噪和增强多模态大语言模型的推理能力。 Result: 在ChartMind和三个公共基准上的评估显示，ChartLLM显著优于现有三种常见CQA范式。 Conclusion: 研究强调了灵活图表理解的重要性，为未来开发更稳健的图表推理提供了新方向。 Abstract: Chart question answering (CQA) has become a critical multimodal task for evaluating the reasoning capabilities of vision-language models. While early approaches have shown promising performance by focusing on visual features or leveraging large-scale pre-training, most existing evaluations rely on rigid output formats and objective metrics, thus ignoring the complex, real-world demands of practical chart analysis. In this paper, we introduce ChartMind, a new benchmark designed for complex CQA tasks in real-world settings. ChartMind covers seven task categories, incorporates multilingual contexts, supports open-domain textual outputs, and accommodates diverse chart formats, bridging the gap between real-world applications and traditional academic benchmarks. Furthermore, we propose a context-aware yet model-agnostic framework, ChartLLM, that focuses on extracting key contextual elements, reducing noise, and enhancing the reasoning accuracy of multimodal large language models. Extensive evaluations on ChartMind and three representative public benchmarks with 14 mainstream multimodal models show our framework significantly outperforms the previous three common CQA paradigms: instruction-following, OCR-enhanced, and chain-of-thought, highlighting the importance of flexible chart understanding for real-world CQA. These findings suggest new directions for developing more robust chart reasoning in future research.

[208] Automatic Construction of Multiple Classification Dimensions for Managing Approaches in Scientific Papers

Bing Ma,Hai Zhuge

Main category: cs.CL

TL;DR: 本文提出了一种基于多维度的科学方法管理框架，通过语言模式识别和树结构相似性度量，实现高效的方法查询和分类。

Details

Motivation: 科学论文中方法的查询和管理缺乏高效框架，导致研究者在查找和利用相关方法时面临挑战。 Method: 通过语义、语篇、句法和词汇四个语言层次识别方法模式，提取方法并分类为五个维度；提出树结构表示步骤，基于句法相似性度量方法相似性；采用自底向上聚类算法构建类树。 Result: 构建的多维方法空间显著提升了查询的相关性，并通过类机制快速缩小搜索范围。 Conclusion: 多维方法空间框架为科学方法的高效查询和管理提供了可行解决方案。 Abstract: Approaches form the foundation for conducting scientific research. Querying approaches from a vast body of scientific papers is extremely time-consuming, and without a well-organized management framework, researchers may face significant challenges in querying and utilizing relevant approaches. Constructing multiple dimensions on approaches and managing them from these dimensions can provide an efficient solution. Firstly, this paper identifies approach patterns using a top-down way, refining the patterns through four distinct linguistic levels: semantic level, discourse level, syntactic level, and lexical level. Approaches in scientific papers are extracted based on approach patterns. Additionally, five dimensions for categorizing approaches are identified using these patterns. This paper proposes using tree structure to represent step and measuring the similarity between different steps with a tree-structure-based similarity measure that focuses on syntactic-level similarities. A collection similarity measure is proposed to compute the similarity between approaches. A bottom-up clustering algorithm is proposed to construct class trees for approach components within each dimension by merging each approach component or class with its most similar approach component or class in each iteration. The class labels generated during the clustering process indicate the common semantics of the step components within the approach components in each class and are used to manage the approaches within the class. The class trees of the five dimensions collectively form a multi-dimensional approach space. The application of approach queries on the multi-dimensional approach space demonstrates that querying within this space ensures strong relevance between user queries and results and rapidly reduces search space through a class-based query mechanism.

[209] The Arabic AI Fingerprint: Stylometric Analysis and Detection of Large Language Models Text

Maged S. Al-Shaibani,Moataz Ahmed

Main category: cs.CL

TL;DR: 该论文全面研究了阿拉伯语机器生成文本，通过多种生成策略和模型架构，揭示了机器生成文本的独特语言模式，并开发了高效的BERT检测模型。

Details

Motivation: 大型语言模型（LLM）在生成类人文本方面表现出色，但对信息完整性构成挑战，尤其是在阿拉伯语等低资源语言中。 Method: 研究采用多种生成策略（标题生成、内容感知生成和文本优化）和模型架构（ALLaM、Jais、Llama、GPT-4），结合风格计量分析。 Result: 研究发现机器生成的阿拉伯语文本具有可检测的特征，BERT检测模型在正式语境中表现优异（F1分数高达99.9%）。 Conclusion: 该研究为开发针对阿拉伯语的鲁棒检测系统奠定了基础，对维护信息完整性具有重要意义。 Abstract: Large Language Models (LLMs) have achieved unprecedented capabilities in generating human-like text, posing subtle yet significant challenges for information integrity across critical domains, including education, social media, and academia, enabling sophisticated misinformation campaigns, compromising healthcare guidance, and facilitating targeted propaganda. This challenge becomes severe, particularly in under-explored and low-resource languages like Arabic. This paper presents a comprehensive investigation of Arabic machine-generated text, examining multiple generation strategies (generation from the title only, content-aware generation, and text refinement) across diverse model architectures (ALLaM, Jais, Llama, and GPT-4) in academic, and social media domains. Our stylometric analysis reveals distinctive linguistic patterns differentiating human-written from machine-generated Arabic text across these varied contexts. Despite their human-like qualities, we demonstrate that LLMs produce detectable signatures in their Arabic outputs, with domain-specific characteristics that vary significantly between different contexts. Based on these insights, we developed BERT-based detection models that achieved exceptional performance in formal contexts (up to 99.9\% F1-score) with strong precision across model architectures. Our cross-domain analysis confirms generalization challenges previously reported in the literature. To the best of our knowledge, this work represents the most comprehensive investigation of Arabic machine-generated text to date, uniquely combining multiple prompt generation methods, diverse model architectures, and in-depth stylometric analysis across varied textual domains, establishing a foundation for developing robust, linguistically-informed detection systems essential for preserving information integrity in Arabic-language contexts.

[210] Sentinel: Attention Probing of Proxy Models for LLM Context Compression with an Understanding Perspective

Yong Zhang,Yanwen Huang,Ning Cheng,Yang Guo,Yun Zhu,Yanmeng Wang,Shaojun Wang,Jing Xiao

Main category: cs.CL

TL;DR: Sentinel提出了一种轻量级的句子级压缩框架，通过利用现成的小型LLM的解码器注意力信号，实现高效且问题感知的上下文压缩，无需训练专用压缩模型。

Details

Motivation: 现有的检索增强生成（RAG）方法中，检索到的上下文通常冗长、嘈杂或超出输入限制，而传统压缩方法需要训练专用模型，成本高且可移植性差。 Method: Sentinel通过轻量级分类器探测0.5B代理LLM的解码器注意力信号，识别句子相关性，实现上下文过滤。 Result: 在LongBench基准测试中，Sentinel实现了5倍的压缩，同时匹配7B规模压缩系统的问答性能。 Conclusion: 研究表明，利用原生注意力信号可以实现快速、有效且问题感知的上下文压缩。 Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external context, but retrieved passages are often lengthy, noisy, or exceed input limits. Existing compression methods typically require supervised training of dedicated compression models, increasing cost and reducing portability. We propose Sentinel, a lightweight sentence-level compression framework that reframes context filtering as an attention-based understanding task. Rather than training a compression model, Sentinel probes decoder attention from an off-the-shelf 0.5B proxy LLM using a lightweight classifier to identify sentence relevance. Empirically, we find that query-context relevance estimation is consistent across model scales, with 0.5B proxies closely matching the behaviors of larger models. On the LongBench benchmark, Sentinel achieves up to 5$\times$ compression while matching the QA performance of 7B-scale compression systems. Our results suggest that probing native attention signals enables fast, effective, and question-aware context compression. Code available at: https://github.com/yzhangchuck/Sentinel.

[211] ScEdit: Script-based Assessment of Knowledge Editing

Xinye Li,Zunwen Zheng,Qian Zhang,Dekai Zhuang,Jiabao Kang,Liyan Xu,Qingbin Liu,Xi Chen,Zhiying Tu,Dianhui Chu,Dianbo Sui

Main category: cs.CL

TL;DR: 论文提出了一个新的知识编辑基准ScEdit，结合了反事实和时间编辑，发现现有方法在文本级指标上表现不佳。

Details

Motivation: 当前知识编辑任务过于简单，缺乏实际应用场景的整合，需更全面的评估框架。 Method: 引入ScEdit基准，结合反事实和时间编辑，采用词级和文本级评估方法。 Result: 所有知识编辑方法在现有指标上表现下降，文本级指标面临挑战。 Conclusion: ScEdit为知识编辑提供了更全面的评估，揭示了现有方法的局限性。 Abstract: Knowledge Editing (KE) has gained increasing attention, yet current KE tasks remain relatively simple. Under current evaluation frameworks, many editing methods achieve exceptionally high scores, sometimes nearing perfection. However, few studies integrate KE into real-world application scenarios (e.g., recent interest in LLM-as-agent). To support our analysis, we introduce a novel script-based benchmark -- ScEdit (Script-based Knowledge Editing Benchmark) -- which encompasses both counterfactual and temporal edits. We integrate token-level and text-level evaluation methods, comprehensively analyzing existing KE techniques. The benchmark extends traditional fact-based ("What"-type question) evaluation to action-based ("How"-type question) evaluation. We observe that all KE methods exhibit a drop in performance on established metrics and face challenges on text-level metrics, indicating a challenging task. Our benchmark is available at https://github.com/asdfo123/ScEdit.

[212] How Does Response Length Affect Long-Form Factuality

James Xu Zhao,Jimmy Z. J. Liu,Bryan Hooi,See-Kiong Ng

Main category: cs.CL

TL;DR: 研究发现，大语言模型（LLMs）生成长文本时，长度增加会导致事实准确性下降，主要原因是知识耗尽。

Details

Motivation: 探讨长文本生成中响应长度对事实准确性的影响，填补现有研究的空白。 Method: 提出自动双层长文本事实性评估框架，并进行控制实验，验证三种假设（错误传播、长上下文、知识耗尽）。 Result: 实验表明，长文本事实精度更低，知识耗尽是主要原因。 Conclusion: 知识耗尽是导致长文本事实性下降的主要因素，而非错误传播或长上下文。 Abstract: Large language models (LLMs) are widely used for long-form text generation. However, factual errors in the responses would undermine their reliability. Despite growing attention to LLM factuality, the effect of response length on factuality remains underexplored. In this work, we systematically investigate this relationship by first introducing an automatic and bi-level long-form factuality evaluation framework, which achieves high agreement with human annotations while being cost-effective. Using this framework, we conduct controlled experiments and find that longer responses exhibit lower factual precision, confirming the presence of length bias. To explain this phenomenon, we empirically examine three hypotheses: error propagation, long context, and facts exhaustion. Our results reveal that facts exhaustion, where the model gradually exhausts more reliable knowledge, is the primary cause of factual degradation, rather than the other two hypotheses.

[213] EmoBench-UA: A Benchmark Dataset for Emotion Detection in Ukrainian

Daryna Dementieva,Nikolay Babakov,Alexander Fraser

Main category: cs.CL

TL;DR: EmoBench-UA是首个乌克兰语情感分类数据集，填补了该领域的空白，并通过多种方法评估了其性能。

Details

Motivation: 乌克兰语的情感分类研究缺乏公开基准数据集，阻碍了相关研究的发展。 Method: 采用众包方式标注数据，并评估了基于语言学、合成数据和大型语言模型的方法。 Result: 研究揭示了乌克兰语等非主流语言在情感分类上的挑战，需进一步开发专用模型和资源。 Conclusion: EmoBench-UA为乌克兰语情感分类提供了重要资源，未来需更多本土化研究。 Abstract: While Ukrainian NLP has seen progress in many texts processing tasks, emotion classification remains an underexplored area with no publicly available benchmark to date. In this work, we introduce EmoBench-UA, the first annotated dataset for emotion detection in Ukrainian texts. Our annotation schema is adapted from the previous English-centric works on emotion detection (Mohammad et al., 2018; Mohammad, 2022) guidelines. The dataset was created through crowdsourcing using the Toloka.ai platform ensuring high-quality of the annotation process. Then, we evaluate a range of approaches on the collected dataset, starting from linguistic-based baselines, synthetic data translated from English, to large language models (LLMs). Our findings highlight the challenges of emotion classification in non-mainstream languages like Ukrainian and emphasize the need for further development of Ukrainian-specific models and training resources.

[214] Data-efficient Meta-models for Evaluation of Context-based Questions and Answers in LLMs

Julia Belikova,Konstantin Polev,Rauf Parchiev,Dmitry Simakov

Main category: cs.CL

TL;DR: 本文研究了减少幻觉检测训练数据需求的方法，结合高效分类算法和降维技术，仅需250个样本即可达到与现有方法相当的性能。

Details

Motivation: 大型语言模型（LLMs）和检索增强生成（RAG）系统在工业应用中可靠性受限于幻觉检测的挑战，现有方法依赖大量标注数据，难以扩展。 Method: 提出一种结合高效分类算法和降维技术的方法，减少对训练数据的需求，并评估其在Lookback Lens和探测框架中的效果。 Result: 在标准问答RAG基准测试中，仅用250个训练样本即可达到与强基线相当的性能。 Conclusion: 轻量级、数据高效的方法在工业部署中具有潜力，尤其适用于标注受限的场景。 Abstract: Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly deployed in industry applications, yet their reliability remains hampered by challenges in detecting hallucinations. While supervised state-of-the-art (SOTA) methods that leverage LLM hidden states -- such as activation tracing and representation analysis -- show promise, their dependence on extensively annotated datasets limits scalability in real-world applications. This paper addresses the critical bottleneck of data annotation by investigating the feasibility of reducing training data requirements for two SOTA hallucination detection frameworks: Lookback Lens, which analyzes attention head dynamics, and probing-based approaches, which decode internal model representations. We propose a methodology combining efficient classification algorithms with dimensionality reduction techniques to minimize sample size demands while maintaining competitive performance. Evaluations on standardized question-answering RAG benchmarks show that our approach achieves performance comparable to strong proprietary LLM-based baselines with only 250 training samples. These results highlight the potential of lightweight, data-efficient paradigms for industrial deployment, particularly in annotation-constrained scenarios.

[215] Generalized Category Discovery in Event-Centric Contexts: Latent Pattern Mining with LLMs

Yi Luo,Qiwen Wang,Junqi Yang,Luyao Tang,Zhenghao Lin,Zhenzhe Ying,Weiqiang Wang,Chen Lin

Main category: cs.CL

TL;DR: 论文提出EC-GCD任务，解决长文本和类别不平衡问题，并提出PaMA框架，利用LLMs优化聚类与分类对齐，显著提升性能。

Details

Motivation: 现有文本GCD方法在现实场景中验证不足，尤其是面对长文本和类别不平衡时表现不佳。 Method: 提出PaMA框架，利用LLMs提取事件模式并优化聚类与分类对齐，同时采用排名-过滤-挖掘流程平衡类别原型。 Result: 在EC-GCD基准测试中，PaMA性能提升高达12.58%，且在新构建的Scam Report数据集上表现优异。 Conclusion: PaMA有效解决了EC-GCD的挑战，并在性能和泛化能力上优于现有方法。 Abstract: Generalized Category Discovery (GCD) aims to classify both known and novel categories using partially labeled data that contains only known classes. Despite achieving strong performance on existing benchmarks, current textual GCD methods lack sufficient validation in realistic settings. We introduce Event-Centric GCD (EC-GCD), characterized by long, complex narratives and highly imbalanced class distributions, posing two main challenges: (1) divergent clustering versus classification groupings caused by subjective criteria, and (2) Unfair alignment for minority classes. To tackle these, we propose PaMA, a framework leveraging LLMs to extract and refine event patterns for improved cluster-class alignment. Additionally, a ranking-filtering-mining pipeline ensures balanced representation of prototypes across imbalanced categories. Evaluations on two EC-GCD benchmarks, including a newly constructed Scam Report dataset, demonstrate that PaMA outperforms prior methods with up to 12.58% H-score gains, while maintaining strong generalization on base GCD datasets.

[216] Enhancing Marker Scoring Accuracy through Ordinal Confidence Modelling in Educational Assessments

Abhirup Chakravarty,Mark Brenchley,Trevor Breakspear,Ian Lewin,Yan Huang

Main category: cs.CL

TL;DR: 论文提出了一种通过置信度建模提高自动作文评分（AES）可靠性的方法，通过分类任务预测分数是否准确，并引入新的损失函数KWOCCE，显著提升了评分可靠性。

Details

Motivation: 解决自动作文评分中分数可靠性不足的问题，确保仅在高可靠性标准下发布分数。 Method: 将置信度估计作为分类任务，利用分数分箱和新的KWOCCE损失函数，结合CEFR标签的序数结构。 Result: 最佳模型F1分数达0.97，47%的分数达到100% CEFR一致性，99%达到至少95%一致性，显著优于原始AES模型的92%。 Conclusion: 置信度建模和KWOCCE损失函数有效提升了AES的可靠性，为实际应用提供了更高保障。 Abstract: A key ethical challenge in Automated Essay Scoring (AES) is ensuring that scores are only released when they meet high reliability standards. Confidence modelling addresses this by assigning a reliability estimate measure, in the form of a confidence score, to each automated score. In this study, we frame confidence estimation as a classification task: predicting whether an AES-generated score correctly places a candidate in the appropriate CEFR level. While this is a binary decision, we leverage the inherent granularity of the scoring domain in two ways. First, we reformulate the task as an n-ary classification problem using score binning. Second, we introduce a set of novel Kernel Weighted Ordinal Categorical Cross Entropy (KWOCCE) loss functions that incorporate the ordinal structure of CEFR labels. Our best-performing model achieves an F1 score of 0.97, and enables the system to release 47% of scores with 100% CEFR agreement and 99% with at least 95% CEFR agreement -compared to approximately 92% (approx.) CEFR agreement from the standalone AES model where we release all AM predicted scores.

[217] Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO

Kaiyang Guo,Yinchuan Li,Zhitang Chen

Main category: cs.CL

TL;DR: 论文分析了直接对齐方法（如DPO）在优化大语言模型时的局限性，提出了一种新方法PRO，解决了似然不确定性问题，并在多种反馈类型中表现优异。

Details

Motivation: 直接对齐方法（如DPO）在优化大语言模型时存在似然不确定性问题，导致模型输出偏离预期模式。 Method: 通过重新分解DPO的损失函数，提出PRO方法，引入完整正则化项以解决似然不确定性问题。 Result: PRO在成对、二元和标量反馈场景中优于现有方法。 Conclusion: PRO是一种统一且高效的对齐方法，解决了直接对齐中的核心问题。 Abstract: Direct alignment methods typically optimize large language models (LLMs) by contrasting the likelihoods of preferred versus dispreferred responses. While effective in steering LLMs to match relative preference, these methods are frequently noted for decreasing the absolute likelihoods of example responses. As a result, aligned models tend to generate outputs that deviate from the expected patterns, exhibiting reward-hacking effect even without a reward model. This undesired consequence exposes a fundamental limitation in contrastive alignment, which we characterize as likelihood underdetermination. In this work, we revisit direct preference optimization (DPO) -- the seminal direct alignment method -- and demonstrate that its loss theoretically admits a decomposed reformulation. The reformulated loss not only broadens applicability to a wider range of feedback types, but also provides novel insights into the underlying cause of likelihood underdetermination. Specifically, the standard DPO implementation implicitly oversimplifies a regularizer in the reformulated loss, and reinstating its complete version effectively resolves the underdetermination issue. Leveraging these findings, we introduce PRoximalized PReference Optimization (PRO), a unified method to align with diverse feeback types, eliminating likelihood underdetermination through an efficient approximation of the complete regularizer. Comprehensive experiments show the superiority of PRO over existing methods in scenarios involving pairwise, binary and scalar feedback.

[218] Neither Stochastic Parroting nor AGI: LLMs Solve Tasks through Context-Directed Extrapolation from Training Data Priors

Harish Tayyar Madabushi,Melissa Torgbi,Claire Bonial

Main category: cs.CL

TL;DR: 论文提出了一种关于LLM能力的中间立场，认为LLM通过上下文指导的外推（context-directed extrapolation）从训练数据中提取信息，而非极端观点中的“随机鹦鹉”或“涌现”高级推理能力。

Details

Motivation: 批判性地探讨LLM能力的现实视角，避免极端观点（如LLM是无意识的“随机鹦鹉”或具有不可预测的“涌现”高级推理能力）。 Method: 提出“上下文指导的外推”机制，结合现有文献支持，分析LLM如何从训练数据中提取和扩展信息。 Result: LLM的推理能力超出随机模仿，但可预测、可控，且不具人类高级认知能力或无限扩展性。 Conclusion: 研究应聚焦上下文指导的外推机制及其与训练数据的互动，未来可探索不依赖LLM固有高级推理的增强技术。 Abstract: In this position paper we raise critical awareness of a realistic view of LLM capabilities that eschews extreme alternative views that LLMs are either "stochastic parrots" or in possession of "emergent" advanced reasoning capabilities, which, due to their unpredictable emergence, constitute an existential threat. Our middle-ground view is that LLMs extrapolate from priors from their training data, and that a mechanism akin to in-context learning enables the targeting of the appropriate information from which to extrapolate. We call this "context-directed extrapolation." Under this view, substantiated though existing literature, while reasoning capabilities go well beyond stochastic parroting, such capabilities are predictable, controllable, not indicative of advanced reasoning akin to high-level cognitive capabilities in humans, and not infinitely scalable with additional training. As a result, fears of uncontrollable emergence of agency are allayed, while research advances are appropriately refocused on the processes of context-directed extrapolation and how this interacts with training data to produce valuable capabilities in LLMs. Future work can therefore explore alternative augmenting techniques that do not rely on inherent advanced reasoning in LLMs.

[219] Discriminative Policy Optimization for Token-Level Reward Models

Hongzhan Chen,Tao Yang,Shiping Gao,Ruijun Chen,Xiaojun Quan,Hongtao Tian,Ting Yao

Main category: cs.CL

TL;DR: Q-RM是一种通过解耦奖励建模与语言生成来优化策略模型的令牌级奖励模型，显著提升了复杂推理任务的性能与训练效率。

Details

Motivation: 解决生成语言建模与奖励建模冲突导致的信用分配不准确问题，提升令牌级奖励模型的稳定性与性能。 Method: 提出Q-RM，通过优化判别策略（Q函数）从偏好数据中学习令牌级奖励，无需细粒度标注。 Result: Q-RM在数学推理任务中显著优于基线方法，Pass@1分数提升明显，且训练效率大幅提高（收敛速度快12倍）。 Conclusion: Q-RM为复杂推理任务提供了一种高效且稳定的令牌级奖励建模方法，具有广泛应用潜力。 Abstract: Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs) for optimizing policy models, positioning them as a promising approach to enhancing the capabilities of LLMs in complex reasoning tasks. Recent efforts have advanced PRMs from step-level to token-level granularity by integrating reward modeling into the training of generative models, with reward scores derived from token generation probabilities. However, the conflict between generative language modeling and reward modeling may introduce instability and lead to inaccurate credit assignments. To address this challenge, we revisit token-level reward assignment by decoupling reward modeling from language generation and derive a token-level reward model through the optimization of a discriminative policy, termed the Q-function Reward Model (Q-RM). We theoretically demonstrate that Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations. In our experiments, Q-RM consistently outperforms all baseline methods across various benchmarks. For example, when integrated into PPO/REINFORCE algorithms, Q-RM enhances the average Pass@1 score by 5.85/4.70 points on mathematical reasoning tasks compared to the ORM baseline, and by 4.56/5.73 points compared to the token-level PRM counterpart. Moreover, reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH. Code and data are available at https://github.com/homzer/Q-RM.

[220] Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation

Beiduo Chen,Yang Janet Liu,Anna Korhonen,Barbara Plank

Main category: cs.CL

TL;DR: 本文提出了一种基于LLM的管道方法，利用语言基础的分段器从思维链中提取支持或反对每个答案选项的陈述，并设计了一种基于排名的评估框架，优于直接生成方法和基线。

Details

Motivation: 研究如何利用LLM生成的思维链（CoTs）更好地理解人类标签变异现象，并改进模型预测与人类标签分布的匹配。 Method: 提出一种LLM管道方法，结合语言分段器从CoTs中提取支持或反对答案的陈述，并设计基于排名的评估框架。 Result: 方法在三个数据集上优于直接生成方法和基线，排名方法更符合人类标注。 Conclusion: 该方法有效提升了模型预测与人类标签分布的对齐，展示了思维链在理解标签变异中的潜力。 Abstract: The recent rise of reasoning-tuned Large Language Models (LLMs)--which generate chains of thought (CoTs) before giving the final answer--has attracted significant attention and offers new opportunities for gaining insights into human label variation, which refers to plausible differences in how multiple annotators label the same data instance. Prior work has shown that LLM-generated explanations can help align model predictions with human label distributions, but typically adopt a reverse paradigm: producing explanations based on given answers. In contrast, CoTs provide a forward reasoning path that may implicitly embed rationales for each answer option, before generating the answers. We thus propose a novel LLM-based pipeline enriched with linguistically-grounded discourse segmenters to extract supporting and opposing statements for each answer option from CoTs with improved accuracy. We also propose a rank-based HLV evaluation framework that prioritizes the ranking of answers over exact scores, which instead favor direct comparison of label distributions. Our method outperforms a direct generation method as well as baselines on three datasets, and shows better alignment of ranking methods with humans, highlighting the effectiveness of our approach.

[221] Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models

Mingyu Yu,Wei Wang,Yanjie Wei,Sujuan Qin

Main category: cs.CL

TL;DR: 论文研究了针对大型语言模型（LLMs）的越狱攻击，提出了一种基于语义理解能力的自适应越狱策略框架，显著提高了攻击成功率。

Details

Motivation: 越狱攻击通过绕过LLMs的安全和伦理约束，成为AI安全的关键挑战。论文旨在探索针对不同LLMs理解能力的自适应攻击策略。 Method: 提出了一种分类框架，将LLMs分为Type I和Type II，并根据其语义理解能力设计定制化的越狱策略。 Result: 实验表明，自适应策略显著提高了越狱成功率，特别是在GPT-4o上达到了98.9%的成功率。 Conclusion: 该研究为LLMs的安全性提供了新的视角，并展示了自适应策略在攻击中的高效性。 Abstract: Adversarial attacks on Large Language Models (LLMs) via jailbreaking techniques-methods that circumvent their built-in safety and ethical constraints-have emerged as a critical challenge in AI security. These attacks compromise the reliability of LLMs by exploiting inherent weaknesses in their comprehension capabilities. This paper investigates the efficacy of jailbreaking strategies that are specifically adapted to the diverse levels of understanding exhibited by different LLMs. We propose the Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models, a novel framework that classifies LLMs into Type I and Type II categories according to their semantic comprehension abilities. For each category, we design tailored jailbreaking strategies aimed at leveraging their vulnerabilities to facilitate successful attacks. Extensive experiments conducted on multiple LLMs demonstrate that our adaptive strategy markedly improves the success rate of jailbreaking. Notably, our approach achieves an exceptional 98.9% success rate in jailbreaking GPT-4o(29 May 2025 release)

[222] From Parameters to Prompts: Understanding and Mitigating the Factuality Gap between Fine-Tuned LLMs

Xuan Gong,Hanbo Huang,Shiyu Liang

Main category: cs.CL

TL;DR: 本文研究了监督微调数据对大型语言模型（LLMs）事实性的影响，发现微调数据与测试时提示的交互作用，证明上下文学习（ICL）可以弥补微调数据的不足。

Details

Motivation: 探索监督微调数据对LLMs事实性影响的机制，尤其是已知与未知知识间的差距。 Method: 通过系统实验，分析微调数据与测试时提示（如少样本学习和思维链）的交互作用，并从知识图谱角度进行理论证明。 Result: 发现测试时提示可以减轻微调数据的不足，甚至主导知识提取过程。 Conclusion: 上下文学习（ICL）能有效弥补微调数据的缺陷，需重新评估其在微调数据选择方法中的作用。 Abstract: Factual knowledge extraction aims to explicitly extract knowledge parameterized in pre-trained language models for application in downstream tasks. While prior work has been investigating the impact of supervised fine-tuning data on the factuality of large language models (LLMs), its mechanism remains poorly understood. We revisit this impact through systematic experiments, with a particular focus on the factuality gap that arises when fine-tuning on known versus unknown knowledge. Our findings show that this gap can be mitigated at the inference stage, either under out-of-distribution (OOD) settings or by using appropriate in-context learning (ICL) prompts (i.e., few-shot learning and Chain of Thought (CoT)). We prove this phenomenon theoretically from the perspective of knowledge graphs, showing that the test-time prompt may diminish or even overshadow the impact of fine-tuning data and play a dominant role in knowledge extraction. Ultimately, our results shed light on the interaction between finetuning data and test-time prompt, demonstrating that ICL can effectively compensate for shortcomings in fine-tuning data, and highlighting the need to reconsider the use of ICL prompting as a means to evaluate the effectiveness of fine-tuning data selection methods.

[223] The Warmup Dilemma: How Learning Rate Strategies Impact Speech-to-Text Model Convergence

Marco Gaido,Sara Papi,Luisa Bentivogli,Alessio Brutti,Mauro Cettolo,Roberto Gretter,Marco Matassoni,Mohamed Nabih,Matteo Negri

Main category: cs.CL

TL;DR: 本文填补了大规模语音到文本（S2T）训练中学习率（LR）预热调优的研究空白，发现子指数预热和初始高学习率对最终性能的影响。

Details

Motivation: 大规模S2T训练中，传统学习率调整方法不足，需研究更优的预热策略。 Method: 提出并比较不同学习率预热调度，包括双线性预热和子指数预热。 Result: 子指数预热更适合大规模S2T训练；初始高学习率加速收敛但不提升最终性能。 Conclusion: 大规模S2T训练需采用子指数学习率预热策略，初始高学习率仅对收敛速度有益。 Abstract: Training large-scale models presents challenges not only in terms of resource requirements but also in terms of their convergence. For this reason, the learning rate (LR) is often decreased when the size of a model is increased. Such a simple solution is not enough in the case of speech-to-text (S2T) trainings, where evolved and more complex variants of the Transformer architecture -- e.g., Conformer or Branchformer -- are used in light of their better performance. As a workaround, OWSM designed a double linear warmup of the LR, increasing it to a very small value in the first phase before updating it to a higher value in the second phase. While this solution worked well in practice, it was not compared with alternative solutions, nor was the impact on the final performance of different LR warmup schedules studied. This paper fills this gap, revealing that i) large-scale S2T trainings demand a sub-exponential LR warmup, and ii) a higher LR in the warmup phase accelerates initial convergence, but it does not boost final performance.

[224] UAQFact: Evaluating Factual Knowledge Utilization of LLMs on Unanswerable Questions

Chuanyuan Tan,Wenbiao Shao,Hao Xiong,Tong Zhu,Zhenhua Liu,Kai Shi,Wenliang Chen

Main category: cs.CL

TL;DR: 论文提出了一种新的不可回答问题数据集UAQFact，用于评估LLMs在利用事实知识处理不可回答问题时的能力，并发现LLMs在此任务中表现不佳。

Details

Motivation: 现有数据集缺乏事实知识支持，限制了评估LLMs利用事实知识处理不可回答问题的能力。 Method: 引入基于知识图谱的双语数据集UAQFact，并定义两个新任务分别评估LLMs利用内部和外部事实知识的能力。 Result: 实验表明，LLMs在UAQFact上表现不佳，即使存储了事实知识，也无法充分利用。 Conclusion: 外部知识可能提升性能，但LLMs仍难以充分利用，导致错误回答。 Abstract: Handling unanswerable questions (UAQ) is crucial for LLMs, as it helps prevent misleading responses in complex situations. While previous studies have built several datasets to assess LLMs' performance on UAQ, these datasets lack factual knowledge support, which limits the evaluation of LLMs' ability to utilize their factual knowledge when handling UAQ. To address the limitation, we introduce a new unanswerable question dataset UAQFact, a bilingual dataset with auxiliary factual knowledge created from a Knowledge Graph. Based on UAQFact, we further define two new tasks to measure LLMs' ability to utilize internal and external factual knowledge, respectively. Our experimental results across multiple LLM series show that UAQFact presents significant challenges, as LLMs do not consistently perform well even when they have factual knowledge stored. Additionally, we find that incorporating external knowledge may enhance performance, but LLMs still cannot make full use of the knowledge which may result in incorrect responses.

[225] Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons

Krithik Vishwanath,Anton Alyakin,Mrigayu Ghosh,Jin Vivian Lee,Daniel Alexander Alber,Karl L. Sangwon,Douglas Kondziolka,Eric Karl Oermann

Main category: cs.CL

TL;DR: 研究评估了28个大语言模型在神经外科考试题上的表现，发现部分模型能通过考试，但性能易受干扰信息影响。

Details

Motivation: 评估大语言模型在神经外科知识上的表现及其对干扰信息的鲁棒性。 Method: 使用2,904道神经外科考试题测试28个模型，并引入干扰框架评估模型脆弱性。 Result: 6个模型通过考试，但干扰信息导致性能显著下降，降幅达20.4%。 Conclusion: 当前大语言模型在神经外科考试中表现优异，但对干扰信息敏感，需开发新策略提升鲁棒性。 Abstract: The Congress of Neurological Surgeons Self-Assessment for Neurological Surgeons (CNS-SANS) questions are widely used by neurosurgical residents to prepare for written board examinations. Recently, these questions have also served as benchmarks for evaluating large language models' (LLMs) neurosurgical knowledge. This study aims to assess the performance of state-of-the-art LLMs on neurosurgery board-like questions and to evaluate their robustness to the inclusion of distractor statements. A comprehensive evaluation was conducted using 28 large language models. These models were tested on 2,904 neurosurgery board examination questions derived from the CNS-SANS. Additionally, the study introduced a distraction framework to assess the fragility of these models. The framework incorporated simple, irrelevant distractor statements containing polysemous words with clinical meanings used in non-clinical contexts to determine the extent to which such distractions degrade model performance on standard medical benchmarks. 6 of the 28 tested LLMs achieved board-passing outcomes, with the top-performing models scoring over 15.7% above the passing threshold. When exposed to distractions, accuracy across various model architectures was significantly reduced-by as much as 20.4%-with one model failing that had previously passed. Both general-purpose and medical open-source models experienced greater performance declines compared to proprietary variants when subjected to the added distractors. While current LLMs demonstrate an impressive ability to answer neurosurgery board-like exam questions, their performance is markedly vulnerable to extraneous, distracting information. These findings underscore the critical need for developing novel mitigation strategies aimed at bolstering LLM resilience against in-text distractions, particularly for safe and effective clinical deployment.

[226] Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt

Keqin Peng,Liang Ding,Yuanxin Ouyang,Meng Fang,Dacheng Tao

Main category: cs.CL

TL;DR: 论文提出了一种定量分析大型语言模型（RLLMs）在长链思维推理中过度思考（overthinking）的方法，发现自我怀疑（self-doubt）是主要原因，并提出了一种简单有效的提示方法以减少过度依赖输入问题。

Details

Motivation: 现有研究主要基于定性分析长链思维推理中的过度思考现象，缺乏定量研究。本文从自我怀疑的角度定量分析过度思考，并探索解决方法。 Method: 提出一种提示方法，先让模型质疑输入问题的有效性，再根据评估结果简洁回答。实验在三个数学推理任务和四个缺失前提的数据集上进行。 Result: 该方法显著减少了答案长度和推理步骤，并在四个广泛使用的RLLMs上取得了显著改进。进一步分析表明，该方法有效减少了自我怀疑。 Conclusion: 通过定量分析自我怀疑对过度思考的影响，并提出了一种简单有效的提示方法，显著提升了模型表现。 Abstract: Reasoning Large Language Models (RLLMs) have demonstrated impressive performance on complex tasks, largely due to the adoption of Long Chain-of-Thought (Long CoT) reasoning. However, they often exhibit overthinking -- performing unnecessary reasoning steps even after arriving at the correct answer. Prior work has largely focused on qualitative analyses of overthinking through sample-based observations of long CoTs. In contrast, we present a quantitative analysis of overthinking from the perspective of self-doubt, characterized by excessive token usage devoted to re-verifying already-correct answer. We find that self-doubt significantly contributes to overthinking. In response, we introduce a simple and effective prompting method to reduce the model's over-reliance on input questions, thereby avoiding self-doubt. Specifically, we first prompt the model to question the validity of the input question, and then respond concisely based on the outcome of that evaluation. Experiments on three mathematical reasoning tasks and four datasets with missing premises demonstrate that our method substantially reduces answer length and yields significant improvements across nearly all datasets upon 4 widely-used RLLMs. Further analysis demonstrates that our method effectively minimizes the number of reasoning steps and reduces self-doubt.

[227] Spoken Language Modeling with Duration-Penalized Self-Supervised Units

Nicol Visser,Herman Kamper

Main category: cs.CL

TL;DR: 研究了语音语言模型（SLM）中码本大小和单元粗糙度（持续时间）对性能的影响，发现粗糙度在句子重合成任务中表现更好，但在音素和单词级别影响较小。

Details

Motivation: 探索码本大小和单元粗糙度对SLM性能的交互作用，填补这一未研究领域的空白。 Method: 使用简单的持续时间惩罚动态规划（DPDP）方法，调整码本大小和单元粗糙度，并在不同语言级别进行分析。 Result: 在音素和单词级别，粗糙度影响不大；但在句子重合成任务中，粗糙单元表现更好。词汇和句法任务中，粗糙单元在低比特率下准确率更高。 Conclusion: 粗糙单元并非总是更好，但DPDP是一种简单高效的方法，适用于需要粗糙单元的任务。 Abstract: Spoken language models (SLMs) operate on acoustic units obtained by discretizing self-supervised speech representations. Although the characteristics of these units directly affect performance, the interaction between codebook size and unit coarseness (i.e., duration) remains unexplored. We investigate SLM performance as we vary codebook size and unit coarseness using the simple duration-penalized dynamic programming (DPDP) method. New analyses are performed across different linguistic levels. At the phone and word levels, coarseness provides little benefit, as long as the codebook size is chosen appropriately. However, when producing whole sentences in a resynthesis task, SLMs perform better with coarser units. In lexical and syntactic language modeling tasks, coarser units also give higher accuracies at lower bitrates. We therefore show that coarser units aren't always better, but that DPDP is a simple and efficient way to obtain coarser units for the tasks where they are beneficial.

[228] Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

Liangliang Zhang,Zhuorui Jiang,Hongliang Chi,Haoyang Chen,Mohammed Elkoumy,Fali Wang,Qiong Wu,Zhengyi Zhou,Shirui Pan,Suhang Wang,Yao Ma

Main category: cs.CL

TL;DR: KGQAGen是一个基于LLM的框架，用于生成高质量的多跳推理问答数据集，解决了现有基准数据集的质量问题。

Details

Motivation: 现有KGQA基准数据集（如WebQSP和CWQ）存在标注错误、问题模糊或不可回答、知识过时等问题，平均事实正确率仅为57%。 Method: KGQAGen结合结构化知识基础、LLM引导生成和符号验证，生成具有挑战性且可验证的QA实例。 Result: 构建了KGQAGen-10k基准，实验表明即使是SOTA模型在该基准上也表现不佳。 Conclusion: KGQAGen为KGQA评估提供了可扩展的框架，并呼吁更严格的基准构建。 Abstract: Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets, including WebQSP and CWQ, we find that the average factual correctness rate is only 57 %. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a ten-thousand scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.

[229] CLaC at SemEval-2025 Task 6: A Multi-Architecture Approach for Corporate Environmental Promise Verification

Nawar Turk,Eeham Khan,Leila Kosseim

Main category: cs.CL

TL;DR: 本文介绍了针对SemEval-2025 Task 6（PromiseEval）的方法，通过三种模型架构解决承诺验证的四个子任务，最终组合模型表现最佳。

Details

Motivation: 验证企业ESG报告中的承诺，解决承诺识别、支持证据评估、清晰度评价和验证时机四个子任务。 Method: 1. ESG-BERT模型；2. 增强版ESG-BERT结合语言特征；3. 组合子任务模型，采用注意力池化、文档元数据增强和多目标学习。 Result: 在ML-Promise数据集上，组合模型得分0.5268，优于基线0.5227。 Conclusion: 语言特征提取、注意力池化和多目标学习对承诺验证任务有效，但面临类别不平衡和数据不足的挑战。 Abstract: This paper presents our approach to the SemEval-2025 Task~6 (PromiseEval), which focuses on verifying promises in corporate ESG (Environmental, Social, and Governance) reports. We explore three model architectures to address the four subtasks of promise identification, supporting evidence assessment, clarity evaluation, and verification timing. Our first model utilizes ESG-BERT with task-specific classifier heads, while our second model enhances this architecture with linguistic features tailored for each subtask. Our third approach implements a combined subtask model with attention-based sequence pooling, transformer representations augmented with document metadata, and multi-objective learning. Experiments on the English portion of the ML-Promise dataset demonstrate progressive improvement across our models, with our combined subtask approach achieving a leaderboard score of 0.5268, outperforming the provided baseline of 0.5227. Our work highlights the effectiveness of linguistic feature extraction, attention pooling, and multi-objective learning in promise verification tasks, despite challenges posed by class imbalance and limited training data.

[230] Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

Yunqiao Yang,Houxing Ren,Zimu Lu,Ke Wang,Weikang Shi,Aojun Zhou,Junting Pan,Mingjie Zhan,Hongsheng Li

Main category: cs.CL

TL;DR: 论文提出了一种名为PCPO的新框架，通过结合答案正确性和内部概率一致性优化语言模型的数学推理能力。

Details

Motivation: 现有方法仅依赖答案正确性或一致性，忽略了内部逻辑一致性，PCPO旨在解决这一问题。 Method: 提出PCPO框架，使用双重定量指标（答案正确性和词级概率一致性）进行偏好优化。 Result: 实验表明PCPO在多种LLM和基准测试中优于现有方法。 Conclusion: PCPO通过结合表面和内部一致性指标，显著提升了语言模型的数学推理能力。 Abstract: Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at https://github.com/YunqiaoYang/PCPO.

[231] Translation in the Wild

Yuri Balashov

Main category: cs.CL

TL;DR: LLMs表现出强大的翻译能力，但其训练目标并非翻译相关。本文探讨其翻译能力的来源，提出“双重性”假设，并讨论其对翻译概念的影响。

Details

Motivation: 研究LLMs在未专门训练翻译任务的情况下，为何仍能表现出卓越的翻译能力。 Method: 通过分析预训练数据和指令调优的作用，提出“双重性”假设。 Result: LLMs的翻译能力可能源于两种不同类型的预训练数据。 Conclusion: “双重性”假设为重新定义深度学习时代的翻译提供了新视角。 Abstract: Large Language Models (LLMs) excel in translation among other things, demonstrating competitive performance for many language pairs in zero- and few-shot settings. But unlike dedicated neural machine translation models, LLMs are not trained on any translation-related objective. What explains their remarkable translation abilities? Are these abilities grounded in "incidental bilingualism" (Briakou et al. 2023) in training data? Does instruction tuning contribute to it? Are LLMs capable of aligning and leveraging semantically identical or similar monolingual contents from different corners of the internet that are unlikely to fit in a single context window? I offer some reflections on this topic, informed by recent studies and growing user experience. My working hypothesis is that LLMs' translation abilities originate in two different types of pre-training data that may be internalized by the models in different ways. I discuss the prospects for testing the "duality" hypothesis empirically and its implications for reconceptualizing translation, human and machine, in the age of deep learning.

[232] Understanding Refusal in Language Models with Sparse Autoencoders

Wei Jie Yeo,Nirmalendu Prakash,Clement Neo,Roy Ka-Wei Lee,Erik Cambria,Ranjan Satapathy

Main category: cs.CL

TL;DR: 论文通过稀疏自编码器研究指令调优LLM中的拒绝行为，揭示了其内部机制，并验证了拒绝相关特征对生成的影响。

Details

Motivation: 研究语言模型中拒绝行为的内部机制，以提升模型安全性和对抗攻击的理解。 Method: 使用稀疏自编码器识别拒绝行为的潜在特征，并在两个开源聊天模型上进行干预实验。 Result: 验证了拒绝特征对生成行为的影响，并展示了其在对抗样本分类任务中的泛化能力。 Conclusion: 研究为拒绝行为的机制提供了细粒度分析，并支持了对抗攻击防御的改进。 Abstract: Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks. We open source our code in https://github.com/wj210/refusal_sae.

[233] Evaluating AI capabilities in detecting conspiracy theories on YouTube

Leonardo La Rocca,Francesco Corso,Francesco Pierri

Main category: cs.CL

TL;DR: 该研究评估了基于文本和多模态的大型语言模型（LLMs）在YouTube上识别阴谋论视频的效果，发现文本模型召回率高但精度低，多模态模型表现较差。RoBERTa在未标注数据上表现接近参数更多的LLMs。

Details

Motivation: YouTube作为全球领先平台，易传播有害内容，如阴谋论。研究旨在探索LLMs在此类内容检测中的潜力。 Method: 使用标注数据集评估多种LLMs的零样本性能，并与微调RoBERTa基线对比。测试了文本和多模态模型。 Result: 文本LLMs召回率高但精度低；多模态模型表现不如文本模型。RoBERTa在未标注数据上接近LLMs性能。 Conclusion: 当前LLM方法在有害内容检测中表现有限，需更精确和鲁棒的系统。 Abstract: As a leading online platform with a vast global audience, YouTube's extensive reach also makes it susceptible to hosting harmful content, including disinformation and conspiracy theories. This study explores the use of open-weight Large Language Models (LLMs), both text-only and multimodal, for identifying conspiracy theory videos shared on YouTube. Leveraging a labeled dataset of thousands of videos, we evaluate a variety of LLMs in a zero-shot setting and compare their performance to a fine-tuned RoBERTa baseline. Results show that text-based LLMs achieve high recall but lower precision, leading to increased false positives. Multimodal models lag behind their text-only counterparts, indicating limited benefits from visual data integration. To assess real-world applicability, we evaluate the most accurate models on an unlabeled dataset, finding that RoBERTa achieves performance close to LLMs with a larger number of parameters. Our work highlights the strengths and limitations of current LLM-based approaches for online harmful content detection, emphasizing the need for more precise and robust systems.

[234] Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

Guangtao Zeng,Maohao Shen,Delin Chen,Zhenting Qi,Subhro Das,Dan Gutfreund,David Cox,Gregory Wornell,Wei Lu,Zhang-Wei Hong,Chuang Gan

Main category: cs.CL

TL;DR: EvoScale是一种高效的方法，通过进化过程优化语言模型的输出，减少样本需求，使小模型（如32B）性能媲美或超越大模型（>100B）。

Details

Motivation: 小规模语言模型在现实软件工程任务中表现不佳，但因其计算成本低更实用。现有方法（如监督微调）成本高，测试时扩展策略效率低。 Method: 提出EvoScale，将生成视为进化过程，通过选择和突变迭代优化输出，并结合强化学习训练模型自我进化。 Result: 在SWE-Bench-Verified上，32B模型Satori-SWE-32B性能媲美或超越100B以上模型，且样本需求少。 Conclusion: EvoScale为小模型提供高效优化路径，代码、数据和模型将开源。 Abstract: Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller models are preferable in practice due to their lower computational cost, improving their performance remains challenging. Existing approaches primarily rely on supervised fine-tuning (SFT) with high-quality data, which is expensive to curate at scale. An alternative is test-time scaling: generating multiple outputs, scoring them using a verifier, and selecting the best one. Although effective, this strategy often requires excessive sampling and costly scoring, limiting its practical application. We propose Evolutionary Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process. By iteratively refining outputs via selection and mutation, EvoScale shifts the output distribution toward higher-scoring regions, reducing the number of samples needed to find correct solutions. To reduce the overhead from repeatedly sampling and selection, we train the model to self-evolve using reinforcement learning (RL). Rather than relying on external verifiers at inference time, the model learns to self-improve the scores of its own generations across iterations. Evaluated on SWE-Bench-Verified, EvoScale enables our 32B model, Satori-SWE-32B, to match or exceed the performance of models with over 100B parameters while using a few samples. Code, data, and models will be fully open-sourced.

[235] Table-R1: Inference-Time Scaling for Table Reasoning

Zheyuan Yang,Lyuhao Chen,Arman Cohan,Yilun Zhao

Main category: cs.CL

TL;DR: 本文首次研究了表格推理任务中的推理时间扩展，提出了两种后训练策略：基于前沿模型推理轨迹的蒸馏和基于可验证奖励的强化学习（RLVR）。实验表明，Table-R1-Zero模型性能媲美GPT-4.1和DeepSeek-R1，且仅需7B参数。

Details

Motivation: 探索表格推理任务中推理时间扩展的可行性，提升模型性能。 Method: 1. 蒸馏：利用DeepSeek-R1生成的大规模推理轨迹数据集微调LLMs，得到Table-R1-SFT模型。2. RLVR：设计任务特定的可验证奖励函数，应用GRPO算法训练Table-R1-Zero模型。 Result: Table-R1-Zero模型在多种表格推理任务中表现优异，性能匹配或超越GPT-4.1和DeepSeek-R1，且泛化能力强。 Conclusion: 指令微调、模型架构选择和跨任务泛化对提升表格推理能力至关重要，RL训练中涌现出关键推理技能。 Abstract: In this work, we present the first study to explore inference-time scaling on table reasoning tasks. We develop and evaluate two post-training strategies to enable inference-time scaling: distillation from frontier model reasoning traces and reinforcement learning with verifiable rewards (RLVR). For distillation, we introduce a large-scale dataset of reasoning traces generated by DeepSeek-R1, which we use to fine-tune LLMs into the Table-R1-SFT model. For RLVR, we propose task-specific verifiable reward functions and apply the GRPO algorithm to obtain the Table-R1-Zero model. We evaluate our Table-R1-series models across diverse table reasoning tasks, including short-form QA, fact verification, and free-form QA. Notably, the Table-R1-Zero model matches or exceeds the performance of GPT-4.1 and DeepSeek-R1, while using only a 7B-parameter LLM. It also demonstrates strong generalization to out-of-domain datasets. Extensive ablation and qualitative analyses reveal the benefits of instruction tuning, model architecture choices, and cross-task generalization, as well as emergence of essential table reasoning skills during RL training.

[236] Characterizing the Expressivity of Transformer Language Models

Jiaoda Li,Ryan Cotterell

Main category: cs.CL

TL;DR: 论文研究了固定精度Transformer的理论表达能力，发现其表达能力等同于线性时序逻辑的特定片段，并验证了理论与实验的一致性。

Details

Motivation: 尽管Transformer模型在实证中表现优异，但其理论表达能力尚不明确，尤其是实际实现中的固定精度和软注意力机制。 Method: 通过理想化的固定精度Transformer模型，结合严格的未来掩码和软注意力，分析其表达能力。 Result: 证明这些模型与仅包含过去操作符的线性时序逻辑片段具有相同的表达能力，并与形式语言理论、自动机理论和代数建立了联系。 Conclusion: 理论框架与实验结果一致，Transformer在其理论能力范围内的语言上表现完美，超出范围则无法泛化。 Abstract: Transformer-based language models (LMs) have achieved widespread empirical success, but their theoretical expressive power remains only partially understood. Prior work often relies on idealized models with assumptions -- such as arbitrary numerical precision and hard attention -- that diverge from real-world transformers. In this work, we provide an exact characterization of fixed-precision transformers with strict future masking and soft attention, an idealization that more closely mirrors practical implementations. We show that these models are precisely as expressive as a specific fragment of linear temporal logic that includes only a single temporal operator: the past operator. We further relate this logic to established classes in formal language theory, automata theory, and algebra, yielding a rich and unified theoretical framework for understanding transformer expressivity. Finally, we present empirical results that align closely with our theory: transformers trained on languages within their theoretical capacity generalize perfectly over lengths, while they consistently fail to generalize on languages beyond it.

[237] AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora

Jiaxin Bai,Wei Fan,Qi Hu,Qing Zong,Chunyang Li,Hong Ting Tsang,Hongyu Luo,Yauwai Yim,Haoyu Huang,Xiao Zhou,Feng Qin,Tianshi Zheng,Xi Peng,Xin Yao,Huiwen Yang,Leijie Wu,Yi Ji,Gong Zhang,Renhai Chen,Yangqiu Song

Main category: cs.CL

TL;DR: AutoSchemaKG是一个无需预定义模式的全自主知识图谱构建框架，利用大语言模型从文本中提取知识三元组并生成模式，构建了大规模知识图谱ATLAS。

Details

Motivation: 传统知识图谱构建依赖预定义模式，限制了灵活性和扩展性。AutoSchemaKG旨在消除这一限制，实现完全自主的构建。 Method: 结合大语言模型，同时提取知识三元组和生成模式，通过概念化将实例组织到语义类别中。处理了5000多万份文档。 Result: 构建了ATLAS知识图谱，包含9亿多节点和59亿边，在多跳QA任务中优于基线，提升了LLM的事实性。模式生成与人工模式语义对齐达95%。 Conclusion: 动态生成模式的大规模知识图谱能有效补充大语言模型的参数知识，展示了完全自主知识图谱构建的可行性。 Abstract: We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 95\% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.

[238] GeNRe: A French Gender-Neutral Rewriting System Using Collective Nouns

Enzo Doyen,Amalia Todirascu

Main category: cs.CL

TL;DR: 论文提出GeNRe，首个法语性别中性改写系统，使用集体名词解决性别偏见问题，结合规则系统和微调语言模型。

Details

Motivation: NLP中文本数据存在性别偏见，尤其是法语中男性通用词的使用，需通过性别中性改写技术缓解。 Method: 开发基于规则的系统（RBS）和两种微调语言模型，探索指令模型提升性能。 Result: Claude 3 Opus结合词典效果接近RBS。 Conclusion: GeNRe推动了法语NLP中性别偏见缓解技术的发展。 Abstract: A significant portion of the textual data used in the field of Natural Language Processing (NLP) exhibits gender biases, particularly due to the use of masculine generics (masculine words that are supposed to refer to mixed groups of men and women), which can perpetuate and amplify stereotypes. Gender rewriting, an NLP task that involves automatically detecting and replacing gendered forms with neutral or opposite forms (e.g., from masculine to feminine), can be employed to mitigate these biases. While such systems have been developed in a number of languages (English, Arabic, Portuguese, German, French), automatic use of gender neutralization techniques (as opposed to inclusive or gender-switching techniques) has only been studied for English. This paper presents GeNRe, the very first French gender-neutral rewriting system using collective nouns, which are gender-fixed in French. We introduce a rule-based system (RBS) tailored for the French language alongside two fine-tuned language models trained on data generated by our RBS. We also explore the use of instruct-based models to enhance the performance of our other systems and find that Claude 3 Opus combined with our dictionary achieves results close to our RBS. Through this contribution, we hope to promote the advancement of gender bias mitigation techniques in NLP for French.

[239] Are Reasoning Models More Prone to Hallucination?

Zijun Yao,Yantao Liu,Yanxu Chen,Jianhui Chen,Junfeng Fang,Lei Hou,Juanzi Li,Tat-Seng Chua

Main category: cs.CL

TL;DR: 大型推理模型（LRMs）在解决复杂任务时表现出强大的性能，但其在事实寻求任务中是否减少幻觉仍存在争议。本文从三个方面探讨了推理模型是否更容易产生幻觉。

Details

Motivation: 研究大型推理模型（LRMs）在事实寻求任务中的幻觉问题，以解决现有研究中关于其是否减少或加剧幻觉的争议。 Method: 1. 对LRMs的幻觉进行全面评估；2. 分析不同后训练流程对幻觉的影响；3. 从模型不确定性的角度探究幻觉机制。 Result: 1. 完整的后训练流程（冷启动SFT和可验证奖励RL）减轻幻觉；2. 蒸馏和未冷启动的RL训练加剧幻觉；3. 模型不确定性与事实准确性之间的不匹配导致幻觉增加。 Conclusion: 本文初步揭示了LRMs的幻觉问题，为未来研究提供了方向。 Abstract: Recently evolved large reasoning models (LRMs) show powerful performance in solving complex tasks with long chain-of-thought (CoT) reasoning capability. As these LRMs are mostly developed by post-training on formal reasoning tasks, whether they generalize the reasoning capability to help reduce hallucination in fact-seeking tasks remains unclear and debated. For instance, DeepSeek-R1 reports increased performance on SimpleQA, a fact-seeking benchmark, while OpenAI-o3 observes even severer hallucination. This discrepancy naturally raises the following research question: Are reasoning models more prone to hallucination? This paper addresses the question from three perspectives. (1) We first conduct a holistic evaluation for the hallucination in LRMs. Our analysis reveals that LRMs undergo a full post-training pipeline with cold start supervised fine-tuning (SFT) and verifiable reward RL generally alleviate their hallucination. In contrast, both distillation alone and RL training without cold start fine-tuning introduce more nuanced hallucinations. (2) To explore why different post-training pipelines alters the impact on hallucination in LRMs, we conduct behavior analysis. We characterize two critical cognitive behaviors that directly affect the factuality of a LRM: Flaw Repetition, where the surface-level reasoning attempts repeatedly follow the same underlying flawed logic, and Think-Answer Mismatch, where the final answer fails to faithfully match the previous CoT process. (3) Further, we investigate the mechanism behind the hallucination of LRMs from the perspective of model uncertainty. We find that increased hallucination of LRMs is usually associated with the misalignment between model uncertainty and factual accuracy. Our work provides an initial understanding of the hallucination in LRMs.

[240] ARC: Argument Representation and Coverage Analysis for Zero-Shot Long Document Summarization with Instruction Following LLMs

Mohamed Elaraby,Diane Litman

Main category: cs.CL

TL;DR: 论文研究了指令调优的大型语言模型（LLMs）在摘要生成中是否充分保留关键论证角色信息，并提出了Argument Representation Coverage（ARC）框架来评估LLM生成摘要的质量。

Details

Motivation: 在法律等高风险领域，论证角色对文档摘要至关重要，但LLMs是否能够有效保留这些信息尚不明确。 Method: 引入ARC框架，分析三个开源LLMs在长法律意见和科学文章中的摘要表现。 Result: LLMs能部分覆盖关键论证角色，但关键信息常被遗漏，尤其是在论证分散时。LLMs的位置偏差和角色偏好影响了摘要质量。 Conclusion: 需要开发更具论证意识的摘要策略，以提升LLMs在高风险领域中的表现。 Abstract: Integrating structured information has long improved the quality of abstractive summarization, particularly in retaining salient content. In this work, we focus on a specific form of structure: argument roles, which are crucial for summarizing documents in high-stakes domains such as law. We investigate whether instruction-tuned large language models (LLMs) adequately preserve this information. To this end, we introduce Argument Representation Coverage (ARC), a framework for measuring how well LLM-generated summaries capture salient arguments. Using ARC, we analyze summaries produced by three open-weight LLMs in two domains where argument roles are central: long legal opinions and scientific articles. Our results show that while LLMs cover salient argument roles to some extent, critical information is often omitted in generated summaries, particularly when arguments are sparsely distributed throughout the input. Further, we use ARC to uncover behavioral patterns -- specifically, how the positional bias of LLM context windows and role-specific preferences impact the coverage of key arguments in generated summaries, emphasizing the need for more argument-aware summarization strategies.

[241] Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation

Hongxiang Zhang,Hao Chen,Tianyi Zhang,Muhao Chen

Main category: cs.CL

TL;DR: ActLCD是一种新的解码策略，通过强化学习优化生成内容的真实性，减少幻觉。

Details

Motivation: 现有解码方法在长上下文中仍易产生幻觉，需改进。 Method: 提出ActLCD，利用强化学习策略和奖励感知分类器，动态选择对比层。 Result: 在五个基准测试中超越现有方法，有效减少幻觉。 Conclusion: ActLCD在多样化生成场景中显著提升事实性。 Abstract: Recent decoding methods improve the factuality of large language models~(LLMs) by refining how the next token is selected during generation. These methods typically operate at the token level, leveraging internal representations to suppress superficial patterns. Nevertheless, LLMs remain prone to hallucinations, especially over longer contexts. In this paper, we propose Active Layer-Contrastive Decoding (ActLCD), a novel decoding strategy that actively decides when to apply contrasting layers during generation. By casting decoding as a sequential decision-making problem, ActLCD employs a reinforcement learning policy guided by a reward-aware classifier to optimize factuality beyond the token level. Our experiments demonstrate that ActLCD surpasses state-of-the-art methods across five benchmarks, showcasing its effectiveness in mitigating hallucinations in diverse generation scenarios.

[242] ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

Beong-woo Kwak,Minju Kim,Dongha Lim,Hyungjoo Chae,Dongjin Kang,Sunghwan Kim,Dongil Yang,Jinyoung Yeo

Main category: cs.CL

TL;DR: ToolHaystack是一个用于测试长期交互中工具使用能力的基准，揭示了现有大语言模型在长期鲁棒性上的不足。

Details

Motivation: 现有评估大多假设工具使用在短上下文中，缺乏对长期交互中模型行为的深入理解。 Method: 引入ToolHaystack基准，包含多任务执行上下文和连续对话中的噪声，评估模型在长期交互中的表现。 Result: 测试14个先进大语言模型发现，尽管在标准多轮设置中表现良好，但在ToolHaystack中表现显著下降。 Conclusion: ToolHaystack揭示了现有模型在长期鲁棒性上的关键缺陷，弥补了以往工具基准的不足。 Abstract: Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.

[243] LoLA: Low-Rank Linear Attention With Sparse Caching

Luke McDermott,Robert W. Heath Jr.,Rahul Parhi

Main category: cs.CL

TL;DR: LoLA是一种低秩线性注意力方法，通过稀疏缓存和三种记忆形式解决了线性注意力在长上下文中的记忆碰撞问题，显著提升了性能。

Details

Motivation: Transformer的长序列推理存在二次复杂度问题，线性注意力虽高效但近似不准确。LoLA旨在通过改进线性注意力，弥补其与Transformer的性能差距。 Method: LoLA结合滑动窗口注意力、稀疏全局缓存和循环隐藏状态，将键值对分为三种记忆形式，避免记忆碰撞。 Result: LoLA在8K上下文长度任务中表现优异，准确率从0.6%提升至97.4%，缓存大小仅为Llama-3.1 8B的1/4.6，且在零样本常识推理任务中表现突出。 Conclusion: LoLA是一种轻量级高效方法，显著提升了线性注意力模型的性能，适用于长上下文任务。 Abstract: Transformer-based large language models suffer from quadratic complexity at inference on long sequences. Linear attention methods are efficient alternatives, however, they fail to provide an accurate approximation of softmax attention. By additionally incorporating sliding window attention into each linear attention head, this gap can be closed for short context-length tasks. Unfortunately, these approaches cannot recall important information from long contexts due to "memory collisions". In this paper , we propose LoLA: Low-rank Linear Attention with sparse caching. LoLA separately stores additional key-value pairs that would otherwise interfere with past associative memories. Moreover, LoLA further closes the gap between linear attention models and transformers by distributing past key-value pairs into three forms of memory: (i) recent pairs in a local sliding window; (ii) difficult-to-memorize pairs in a sparse, global cache; and (iii) generic pairs in the recurrent hidden state of linear attention. As an inference-only strategy, LoLA enables pass-key retrieval on up to 8K context lengths on needle-in-a-haystack tasks from RULER. It boosts the accuracy of the base subquadratic model from 0.6% to 97.4% at 4K context lengths, with a 4.6x smaller cache than that of Llama-3.1 8B. LoLA demonstrates strong performance on zero-shot commonsense reasoning tasks among 1B and 8B parameter subquadratic models. Finally, LoLA is an extremely lightweight approach: Nearly all of our results can be reproduced on a single consumer GPU.

[244] Automatic classification of stop realisation with wav2vec2.0

James Tanner,Morgan Sonderegger,Jane Stuart-Smith,Jeff Mielke,Tyler Kendall

Main category: cs.CL

TL;DR: 利用预训练的wav2vec2.0模型，自动分类语音数据中的爆破音存在，展示了其在英语和日语中的高准确性和鲁棒性。

Details

Motivation: 现代语音研究缺乏针对多种可变语音现象的自动标注工具，而预训练的自我监督模型（如wav2vec2.0）在语音分类任务中表现优异。 Method: 训练wav2vec2.0模型，自动分类英语和日语中的爆破音存在，测试其在精心整理和未准备语音语料库中的表现。 Result: 自动标注的爆破音存在分类准确率高，且与手动标注的结果高度一致，能够复现爆破音实现的变异性模式。 Conclusion: 预训练语音模型具有作为自动标注工具的潜力，可扩展语音研究的范围。 Abstract: Modern phonetic research regularly makes use of automatic tools for the annotation of speech data, however few tools exist for the annotation of many variable phonetic phenomena. At the same time, pre-trained self-supervised models, such as wav2vec2.0, have been shown to perform well at speech classification tasks and latently encode fine-grained phonetic information. We demonstrate that wav2vec2.0 models can be trained to automatically classify stop burst presence with high accuracy in both English and Japanese, robust across both finely-curated and unprepared speech corpora. Patterns of variability in stop realisation are replicated with the automatic annotations, and closely follow those of manual annotations. These results demonstrate the potential of pre-trained speech models as tools for the automatic annotation and processing of speech corpus data, enabling researchers to `scale-up' the scope of phonetic research with relative ease.

[245] Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models

Francesca Padovani,Jaap Jumelet,Yevgen Matusevych,Arianna Bisazza

Main category: cs.CL

TL;DR: 研究发现，儿童导向语言（CDL）训练的模型在多数情况下表现不如维基百科训练的模型，且需控制频率效应以准确评估句法能力。

Details

Motivation: 验证CDL在不同语言、模型类型和评估设置中的通用性，并改进现有基准测试的不足。 Method: 比较CDL和维基百科训练的模型，采用两种目标（掩码和因果）、三种语言（英语、法语、德语）和三个句法最小对基准。 Result: CDL在多数情况下表现不如维基百科模型，且需频率控制设计（FIT-CLAMS）以平衡比较。 Conclusion: CDL训练并未带来更强的句法泛化能力，频率效应控制对评估句法能力至关重要。 Abstract: Seminal work by Huebner et al. (2021) showed that language models (LMs) trained on English Child-Directed Language (CDL) can reach similar syntactic abilities as LMs trained on much larger amounts of adult-directed written text, suggesting that CDL could provide more effective LM training material than the commonly used internet-crawled data. However, the generalizability of these results across languages, model types, and evaluation settings remains unclear. We test this by comparing models trained on CDL vs. Wikipedia across two LM objectives (masked and causal), three languages (English, French, German), and three syntactic minimal-pair benchmarks. Our results on these benchmarks show inconsistent benefits of CDL, which in most cases is outperformed by Wikipedia models. We then identify various shortcomings in previous benchmarks, and introduce a novel testing methodology, FIT-CLAMS, which uses a frequency-controlled design to enable balanced comparisons across training corpora. Through minimal pair evaluations and regression analysis we show that training on CDL does not yield stronger generalizations for acquiring syntax and highlight the importance of controlling for frequency effects when evaluating syntactic ability.

[246] Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation

Ziling Cheng,Meng Cao,Leila Pishdad,Yanshuai Cao,Jackie Chi Kit Cheung

Main category: cs.CL

TL;DR: 论文指出，基于最终答案的评估指标在数学应用题中主要受算术计算而非抽象公式化的限制，CoT（思维链）主要帮助计算而非抽象思维。

Details

Motivation: 探讨LLMs在数学应用题中的表现是否真正反映其推理能力，揭示最终答案指标可能掩盖的两个子技能（抽象公式化和算术计算）的差异。 Method: 通过GSM8K和SVAMP数据集对Llama-3和Qwen2.5进行解耦评估，分析CoT的作用，并通过因果修补验证抽象化机制。 Result: 发现算术计算是瓶颈，CoT对计算帮助显著但对抽象公式化影响有限，模型通过抽象化后计算的机制运作。 Conclusion: 需解耦评估以准确衡量LLM推理能力，并指导未来改进。 Abstract: Final-answer-based metrics are commonly used for evaluating large language models (LLMs) on math word problems, often taken as proxies for reasoning ability. However, such metrics conflate two distinct sub-skills: abstract formulation (capturing mathematical relationships using expressions) and arithmetic computation (executing the calculations). Through a disentangled evaluation on GSM8K and SVAMP, we find that the final-answer accuracy of Llama-3 and Qwen2.5 (1B-32B) without CoT is overwhelmingly bottlenecked by the arithmetic computation step and not by the abstract formulation step. Contrary to the common belief, we show that CoT primarily aids in computation, with limited impact on abstract formulation. Mechanistically, we show that these two skills are composed conjunctively even in a single forward pass without any reasoning steps via an abstract-then-compute mechanism: models first capture problem abstractions, then handle computation. Causal patching confirms these abstractions are present, transferable, composable, and precede computation. These behavioural and mechanistic findings highlight the need for disentangled evaluation to accurately assess LLM reasoning and to guide future improvements.

Zixiang Xu,Yanbo Wang,Yue Huang,Jiayi Ye,Haomin Zhuang,Zirui Song,Lang Gao,Chenxi Wang,Zhaorun Chen,Yujun Zhou,Sixian Li,Wang Pan,Yue Zhao,Jieyu Zhao,Xiangliang Zhang,Xiuying Chen

Main category: cs.CL

TL;DR: 论文提出了SocialMaze，一个评估大语言模型（LLMs）社会推理能力的新基准，填补了现有评估框架的不足。

Details

Motivation: 现有评估框架过于简化现实场景，无法全面评估LLMs的社会推理能力。 Method: SocialMaze包含三个核心挑战（深度推理、动态交互和信息不确定性）和六项任务，覆盖社交推理游戏、日常生活互动和数字社区平台。 Result: 评估发现模型在动态交互和信息整合能力上差异显著，强链式推理模型在深度推理任务中表现更好，不确定性下模型推理能力显著下降。 Conclusion: 通过针对性微调可显著提升模型在复杂社会场景中的表现，SocialMaze数据集已公开。 Abstract: Large language models (LLMs) are increasingly applied to socially grounded tasks, such as online community moderation, media content analysis, and social reasoning games. Success in these contexts depends on a model's social reasoning ability - the capacity to interpret social contexts, infer others' mental states, and assess the truthfulness of presented information. However, there is currently no systematic evaluation framework that comprehensively assesses the social reasoning capabilities of LLMs. Existing efforts often oversimplify real-world scenarios and consist of tasks that are too basic to challenge advanced models. To address this gap, we introduce SocialMaze, a new benchmark specifically designed to evaluate social reasoning. SocialMaze systematically incorporates three core challenges: deep reasoning, dynamic interaction, and information uncertainty. It provides six diverse tasks across three key settings: social reasoning games, daily-life interactions, and digital community platforms. Both automated and human validation are used to ensure data quality. Our evaluation reveals several key insights: models vary substantially in their ability to handle dynamic interactions and integrate temporally evolving information; models with strong chain-of-thought reasoning perform better on tasks requiring deeper inference beyond surface-level cues; and model reasoning degrades significantly under uncertainty. Furthermore, we show that targeted fine-tuning on curated reasoning examples can greatly improve model performance in complex social scenarios. The dataset is publicly available at: https://huggingface.co/datasets/MBZUAI/SocialMaze

[248] SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods

Roksana Goworek,Harpal Karlcut,Muhammad Shezad,Nijaguna Darshana,Abhishek Mane,Syam Bondada,Raghav Sikka,Ulvi Mammadov,Rauf Allahverdiyev,Sriram Purighella,Paridhi Gupta,Muhinyia Ndegwa,Haim Dubossarsky

Main category: cs.CL

TL;DR: 论文提出了一种半自动标注方法，创建了涵盖九种低资源语言的多义词标注数据集，用于跨语言迁移研究。

Details

Motivation: 解决低资源语言中高质量评估数据集的缺乏问题，以推动跨语言迁移技术的发展。 Method: 采用半自动标注方法创建多义词标注数据集，并进行WiC格式的实验评估。 Result: 结果表明，针对性的数据集创建和评估对低资源语言中的多义消歧和迁移研究至关重要。 Conclusion: 发布的数据集和代码旨在支持更公平、稳健和真正多语言的NLP研究。 Abstract: This paper addresses the critical need for high-quality evaluation datasets in low-resource languages to advance cross-lingual transfer. While cross-lingual transfer offers a key strategy for leveraging multilingual pretraining to expand language technologies to understudied and typologically diverse languages, its effectiveness is dependent on quality and suitable benchmarks. We release new sense-annotated datasets of sentences containing polysemous words, spanning nine low-resource languages across diverse language families and scripts. To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method. The utility of the datasets is demonstrated through Word-in-Context (WiC) formatted experiments that evaluate transfer on these low-resource languages. Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation in low-resource settings and transfer studies. The released datasets and code aim to support further research into fair, robust, and truly multilingual NLP.

[249] Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models

Jinzhe Li,Gengxu Li,Yi Chang,Yuan Wu

Main category: cs.CL

TL;DR: 论文提出PCBench评估框架，揭示大语言模型在前提批判能力上的不足，指出其依赖显式提示且能力与推理能力不相关。

Details

Motivation: 大语言模型在输入前提错误时表现脆弱，缺乏自主批判能力，需提升其前提批判能力以增强可靠性。 Method: 设计PCBench，包含四种错误类型和三个难度级别，评估15种代表性大语言模型。 Result: 发现模型依赖显式提示，批判能力与难度和错误类型相关，推理能力与批判能力不相关，错误前提导致过度思考。 Conclusion: 强调提升模型自主前提批判能力的必要性，为开发可靠系统奠定基础。 Abstract: Large language models (LLMs) have witnessed rapid advancements, demonstrating remarkable capabilities. However, a notable vulnerability persists: LLMs often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs. This emphasizes the significance of possessing the \textbf{Premise Critique Ability} for LLMs, defined as the capacity to proactively identify and articulate errors in input premises. Most existing studies assess LLMs' reasoning ability in ideal settings, largely ignoring their vulnerabilities when faced with flawed premises. Thus, we introduce the \textbf{Premise Critique Bench (PCBench)}, designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics. We conducted systematic evaluations of 15 representative LLMs. Our findings reveal: (1) Most models rely heavily on explicit prompts to detect errors, with limited autonomous critique; (2) Premise critique ability depends on question difficulty and error type, with direct contradictions being easier to detect than complex or procedural errors; (3) Reasoning ability does not consistently correlate with the premise critique ability; (4) Flawed premises trigger overthinking in reasoning models, markedly lengthening responses due to repeated attempts at resolving conflicts. These insights underscore the urgent need to enhance LLMs' proactive evaluation of input validity, positioning premise critique as a foundational capability for developing reliable, human-centric systems. The code is available at https://github.com/MLGroupJLU/Premise_Critique.

[250] Label-Guided In-Context Learning for Named Entity Recognition

Fan Bai,Hamid Hassanzadeh,Ardavan Saeedi,Mark Dredze

Main category: cs.CL

TL;DR: DEER方法通过利用训练标签的token级统计信息，改进了上下文学习（ICL）在命名实体识别（NER）中的性能，显著优于现有方法。

Details

Motivation: 现有ICL方法在NER中仅基于语义相似性选择示例，忽略了训练标签，导致性能不佳。 Method: DEER结合标签引导的token检索器和错误修正机制，优化示例选择并针对性修正错误。 Result: 在五个NER数据集和四种LLM上，DEER表现优于现有ICL方法，接近监督微调水平。 Conclusion: DEER在已知和未知实体上均有效，且在低资源环境下表现稳健。 Abstract: In-context learning (ICL) enables large language models (LLMs) to perform new tasks using only a few demonstrations. In Named Entity Recognition (NER), demonstrations are typically selected based on semantic similarity to the test instance, ignoring training labels and resulting in suboptimal performance. We introduce DEER, a new method that leverages training labels through token-level statistics to improve ICL performance. DEER first enhances example selection with a label-guided, token-based retriever that prioritizes tokens most informative for entity recognition. It then prompts the LLM to revisit error-prone tokens, which are also identified using label statistics, and make targeted corrections. Evaluated on five NER datasets using four different LLMs, DEER consistently outperforms existing ICL methods and approaches the performance of supervised fine-tuning. Further analysis shows its effectiveness on both seen and unseen entities and its robustness in low-resource settings.

[251] ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering

Zexi Liu,Jingyi Chai,Xinyu Zhu,Shuo Tang,Rui Ye,Bo Zhang,Lei Bai,Siheng Chen

Main category: cs.CL

TL;DR: 论文提出了一种基于学习的代理式机器学习框架，通过在线强化学习优化LLM代理的性能，显著提升了效率和跨任务泛化能力。

Details

Motivation: 现有方法依赖手动提示工程，无法根据实验经验自适应优化，因此探索基于学习的代理式机器学习范式。 Method: 提出包含探索增强微调、逐步强化学习和统一奖励模块的框架，训练7B规模的ML-Agent。 Result: ML-Agent在仅9个任务上训练后，性能超越671B规模的DeepSeek-R1，并展现出持续改进和跨任务泛化能力。 Conclusion: 该框架为代理式机器学习提供了高效、自适应的解决方案，展示了小规模模型的潜力。 Abstract: The emergence of large language model (LLM)-based agents has significantly advanced the development of autonomous machine learning (ML) engineering. However, most existing approaches rely heavily on manual prompt engineering, failing to adapt and optimize based on diverse experimental experiences. Focusing on this, for the first time, we explore the paradigm of learning-based agentic ML, where an LLM agent learns through interactive experimentation on ML tasks using online reinforcement learning (RL). To realize this, we propose a novel agentic ML training framework with three key components: (1) exploration-enriched fine-tuning, which enables LLM agents to generate diverse actions for enhanced RL exploration; (2) step-wise RL, which enables training on a single action step, accelerating experience collection and improving training efficiency; (3) an agentic ML-specific reward module, which unifies varied ML feedback signals into consistent rewards for RL optimization. Leveraging this framework, we train ML-Agent, driven by a 7B-sized Qwen-2.5 LLM for autonomous ML. Remarkably, despite being trained on merely 9 ML tasks, our 7B-sized ML-Agent outperforms the 671B-sized DeepSeek-R1 agent. Furthermore, it achieves continuous performance improvements and demonstrates exceptional cross-task generalization capabilities.

[252] Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time

Mohamad Chehade,Soumya Suvra Ghosal,Souradip Chakraborty,Avinash Reddy,Dinesh Manocha,Hao Zhu,Amrit Singh Bedi

Main category: cs.CL

TL;DR: SITAlign是一个基于推理时间的框架，通过最大化主要目标并满足次要标准的阈值约束，解决多目标对齐问题，优于现有方法。

Details

Motivation: 现有方法通常将人类偏好反馈视为多目标优化问题，但忽略了人类决策的实际方式（如满意策略），因此需要更贴近人类决策的对齐方法。 Method: 提出SITAlign框架，在推理时最大化主要目标，同时满足次要标准的阈值约束，并推导了理论上的次优性界限。 Result: 在PKU-SafeRLHF数据集上，SITAlign在保持无害性阈值的同时，将GPT-4的胜率提高了22.3%。 Conclusion: SITAlign通过满意策略有效解决了多目标对齐问题，优于现有方法，具有理论和实证支持。 Abstract: Aligning large language models with humans is challenging due to the inherently multifaceted nature of preference feedback. While existing approaches typically frame this as a multi-objective optimization problem, they often overlook how humans actually make decisions. Research on bounded rationality suggests that human decision making follows satisficing strategies-optimizing primary objectives while ensuring others meet acceptable thresholds. To bridge this gap and operationalize the notion of satisficing alignment, we propose SITAlign: an inference time framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria. We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach. We empirically validate SITAlign's performance through extensive experimentation on multiple benchmarks. For instance, on the PKU-SafeRLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, SITAlign outperforms the state-of-the-art multi objective decoding strategy by a margin of 22.3% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the threshold on harmlessness.

[253] ATLAS: Learning to Optimally Memorize the Context at Test Time

Ali Behrouz,Zeman Li,Praneeth Kacham,Majid Daliri,Yuan Deng,Peilin Zhong,Meisam Razaviyayn,Vahab Mirrokni

Main category: cs.CL

TL;DR: 论文提出ATLAS，一种高容量的长期记忆模块，通过优化当前和过去令牌的记忆，解决了现代循环神经网络在长上下文理解和序列外推任务中的不足。

Details

Motivation: Transformer在长序列中因二次复杂度受限，而现代循环神经网络在长上下文任务中表现不佳，主要受限于内存容量、在线更新和固定内存管理。 Method: 提出ATLAS模块，优化记忆管理，并基于此设计DeepTransformers，扩展了原始Transformer架构。 Result: ATLAS在语言建模、常识推理和长上下文任务中超越Transformer和线性循环模型，在BABILong基准上实现+80%准确率。 Conclusion: ATLAS通过改进记忆管理，显著提升了长上下文任务的性能，为序列建模提供了新方向。 Abstract: Transformers have been established as the most popular backbones in sequence modeling, mainly due to their effectiveness in in-context retrieval tasks and the ability to learn at scale. Their quadratic memory and time complexity, however, bound their applicability in longer sequences and so has motivated researchers to explore effective alternative architectures such as modern recurrent neural networks (a.k.a long-term recurrent memory module). Despite their recent success in diverse downstream tasks, they struggle in tasks that requires long context understanding and extrapolation to longer sequences. We observe that these shortcomings come from three disjoint aspects in their design: (1) limited memory capacity that is bounded by the architecture of memory and feature mapping of the input; (2) online nature of update, i.e., optimizing the memory only with respect to the last input; and (3) less expressive management of their fixed-size memory. To enhance all these three aspects, we present ATLAS, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture. Our experimental results on language modeling, common-sense reasoning, recall-intensive, and long-context understanding tasks show that ATLAS surpasses the performance of Transformers and recent linear recurrent models. ATLAS further improves the long context performance of Titans, achieving +80\% accuracy in 10M context length of BABILong benchmark.

[254] DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning

Ziyin Zhang,Jiahao Xu,Zhiwei He,Tian Liang,Qiuzhi Liu,Yansi Li,Linfeng Song,Zhengwen Liang,Zhuosheng Zhang,Rui Wang,Zhaopeng Tu,Haitao Mi,Dong Yu

Main category: cs.CL

TL;DR: DeepTheorem是一个利用自然语言增强大型语言模型数学推理的非正式定理证明框架，包括大规模数据集和强化学习策略，显著提升了定理证明性能。

Details

Motivation: 传统自动定理证明方法依赖形式化证明系统，与大型语言模型的自然语言知识不匹配，因此需要一种更适应其优势的非正式定理证明框架。 Method: 提出DeepTheorem框架，包含121K高质量非正式定理和证明的数据集，以及专为非正式定理证明设计的强化学习策略（RL-Zero）。 Result: 实验表明，DeepTheorem显著提升了大型语言模型的定理证明性能，达到了最先进的准确性和推理质量。 Conclusion: DeepTheorem有潜力从根本上推动非正式定理证明和数学探索的发展。 Abstract: Theorem proving serves as a major testbed for evaluating complex reasoning abilities in large language models (LLMs). However, traditional automated theorem proving (ATP) approaches rely heavily on formal proof systems that poorly align with LLMs' strength derived from informal, natural language knowledge acquired during pre-training. In this work, we propose DeepTheorem, a comprehensive informal theorem-proving framework exploiting natural language to enhance LLM mathematical reasoning. DeepTheorem includes a large-scale benchmark dataset consisting of 121K high-quality IMO-level informal theorems and proofs spanning diverse mathematical domains, rigorously annotated for correctness, difficulty, and topic categories, accompanied by systematically constructed verifiable theorem variants. We devise a novel reinforcement learning strategy (RL-Zero) explicitly tailored to informal theorem proving, leveraging the verified theorem variants to incentivize robust mathematical inference. Additionally, we propose comprehensive outcome and process evaluation metrics examining proof correctness and the quality of reasoning steps. Extensive experimental analyses demonstrate DeepTheorem significantly improves LLM theorem-proving performance compared to existing datasets and supervised fine-tuning protocols, achieving state-of-the-art accuracy and reasoning quality. Our findings highlight DeepTheorem's potential to fundamentally advance automated informal theorem proving and mathematical exploration.

[255] Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

Heekyung Lee,Jiaxin Ge,Tsung-Han Wu,Minwoo Kang,Trevor Darrell,David M. Chan

Main category: cs.CL

TL;DR: 本文研究了视觉语言模型（VLMs）解决视觉谜题（rebus puzzles）的能力，发现其在抽象推理和视觉隐喻理解方面存在显著不足。

Details

Motivation: 探索VLMs在多模态抽象、符号推理和文化语言双关方面的表现，填补其在复杂视觉语言任务中的研究空白。 Method: 构建了一个手工生成和标注的英语视觉谜题基准，涵盖从简单图像替换到空间依赖提示的多种类型。 Result: VLMs在解码简单视觉线索时表现出色，但在需要抽象推理、横向思维和视觉隐喻理解的任务中表现不佳。 Conclusion: 当前VLMs在解决复杂视觉谜题时仍存在局限性，需进一步提升抽象推理和多模态理解能力。 Abstract: Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head" over "heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.

[256] From Chat Logs to Collective Insights: Aggregative Question Answering

Wentao Zhang,Woojeong Kim,Yuntian Deng

Main category: cs.CL

TL;DR: 论文提出了一种新任务——聚合问答（Aggregative Question Answering），旨在通过分析大规模用户与聊天机器人的对话数据，回答聚合性问题（如识别特定人群的关注点）。作者构建了基准数据集WildChat-AQA，并发现现有方法在有效推理或计算成本方面存在不足。

Details

Motivation: 现有方法通常将用户与聊天机器人的交互视为独立事件，无法从大规模对话数据中提取集体洞察。 Method: 提出了聚合问答任务，并构建了包含6,027个聚合性问题的WildChat-AQA基准数据集。 Result: 实验表明，现有方法在有效推理或计算成本方面表现不佳。 Conclusion: 需要开发新方法，以从大规模对话数据中提取集体洞察。 Abstract: Conversational agents powered by large language models (LLMs) are rapidly becoming integral to our daily interactions, generating unprecedented amounts of conversational data. Such datasets offer a powerful lens into societal interests, trending topics, and collective concerns. Yet, existing approaches typically treat these interactions as independent and miss critical insights that could emerge from aggregating and reasoning across large-scale conversation logs. In this paper, we introduce Aggregative Question Answering, a novel task requiring models to reason explicitly over thousands of user-chatbot interactions to answer aggregative queries, such as identifying emerging concerns among specific demographics. To enable research in this direction, we construct a benchmark, WildChat-AQA, comprising 6,027 aggregative questions derived from 182,330 real-world chatbot conversations. Experiments show that existing methods either struggle to reason effectively or incur prohibitive computational costs, underscoring the need for new approaches capable of extracting collective insights from large-scale conversational data.

cs.AI [Back]

[257] Decomposing Elements of Problem Solving: What "Math" Does RL Teach?

Tian Qin,Core Francisco Park,Mujin Kwun,Aaron Walsman,Eran Malach,Nikhil Anand,Hidenori Tanaka,David Alvarez-Melis

Main category: cs.AI

TL;DR: 论文提出将数学推理任务分解为计划、执行和验证三个基本能力，发现GRPO主要通过温度蒸馏增强执行能力，但RL训练模型在解决新问题时因计划能力不足遇到‘覆盖墙’。通过合成任务验证RL主要提升执行鲁棒性，并探索了克服覆盖墙的条件。

Details

Motivation: 现有基于RL的方法（如GRPO）在数学推理任务中表现优异，但仅依赖准确性指标无法细粒度评估模型能力，尤其是问题解决技能的掌握情况。 Method: 将问题解决分解为计划、执行和验证三个能力，并通过实验和合成任务分析RL对这些能力的影响。 Result: GRPO通过温度蒸馏增强执行能力，但RL模型在新问题上因计划能力不足遇到‘覆盖墙’；合成任务验证RL主要提升执行鲁棒性，并发现可能克服覆盖墙的条件。 Conclusion: 研究揭示了RL在提升LLM推理能力中的作用和局限性，为克服这些限制提供了方向。 Abstract: Mathematical reasoning tasks have become prominent benchmarks for assessing the reasoning capabilities of LLMs, especially with reinforcement learning (RL) methods such as GRPO showing significant performance gains. However, accuracy metrics alone do not support fine-grained assessment of capabilities and fail to reveal which problem-solving skills have been internalized. To better understand these capabilities, we propose to decompose problem solving into fundamental capabilities: Plan (mapping questions to sequences of steps), Execute (correctly performing solution steps), and Verify (identifying the correctness of a solution). Empirically, we find that GRPO mainly enhances the execution skill-improving execution robustness on problems the model already knows how to solve-a phenomenon we call temperature distillation. More importantly, we show that RL-trained models struggle with fundamentally new problems, hitting a 'coverage wall' due to insufficient planning skills. To explore RL's impact more deeply, we construct a minimal, synthetic solution-tree navigation task as an analogy for mathematical problem-solving. This controlled setup replicates our empirical findings, confirming RL primarily boosts execution robustness. Importantly, in this setting, we identify conditions under which RL can potentially overcome the coverage wall through improved exploration and generalization to new solution paths. Our findings provide insights into the role of RL in enhancing LLM reasoning, expose key limitations, and suggest a path toward overcoming these barriers. Code is available at https://github.com/cfpark00/RL-Wall.

[258] Enhancing Study-Level Inference from Clinical Trial Papers via RL-based Numeric Reasoning

Massimiliano Pronesti,Michela Lorandi,Paul Flanagan,Oisin Redmon,Anya Belz,Yufang Hou

Main category: cs.AI

TL;DR: 论文提出了一种基于定量推理的方法，通过提取结构化数值证据和应用领域知识逻辑，改进了医学系统综述中的结论推断准确性。

Details

Motivation: 自动化医学系统综述中的数值证据提取和结论推断存在瓶颈，现有方法依赖浅层文本线索，无法捕捉专家评估的数值推理。 Method: 开发了一个数值推理系统，包括数值数据提取模型和效应估计组件，采用监督微调和强化学习进行训练。 Result: 在CochraneForest基准测试中，最佳方法（强化学习训练的小规模数值提取模型）比检索系统F1分数提升21%，优于400B参数的大模型9%。 Conclusion: 研究表明，基于推理的方法在自动化系统证据合成中具有潜力。 Abstract: Systematic reviews in medicine play a critical role in evidence-based decision-making by aggregating findings from multiple studies. A central bottleneck in automating this process is extracting numeric evidence and determining study-level conclusions for specific outcomes and comparisons. Prior work has framed this problem as a textual inference task by retrieving relevant content fragments and inferring conclusions from them. However, such approaches often rely on shallow textual cues and fail to capture the underlying numeric reasoning behind expert assessments. In this work, we conceptualise the problem as one of quantitative reasoning. Rather than inferring conclusions from surface text, we extract structured numerical evidence (e.g., event counts or standard deviations) and apply domain knowledge informed logic to derive outcome-specific conclusions. We develop a numeric reasoning system composed of a numeric data extraction model and an effect estimate component, enabling more accurate and interpretable inference aligned with the domain expert principles. We train the numeric data extraction model using different strategies, including supervised fine-tuning (SFT) and reinforcement learning (RL) with a new value reward model. When evaluated on the CochraneForest benchmark, our best-performing approach -- using RL to train a small-scale number extraction model -- yields up to a 21% absolute improvement in F1 score over retrieval-based systems and outperforms general-purpose LLMs of over 400B parameters by up to 9%. Our results demonstrate the promise of reasoning-driven approaches for automating systematic evidence synthesis.

[259] Be.FM: Open Foundation Models for Human Behavior

Yutong Xie,Zhuoheng Li,Xiyuan Wang,Yijun Pan,Qijia Liu,Xingzhi Cui,Kuang-Yu Lo,Ruoyi Gao,Xingjian Zhang,Jin Huang,Walter Yuan,Matthew O. Jackson,Qiaozhu Mei

Main category: cs.AI

TL;DR: Be.FM是一种基于开源大语言模型的行为基础模型，用于理解和预测人类行为，表现优异。

Details

Motivation: 探索基础模型在人类行为建模中的潜力。 Method: 基于开源大语言模型，通过多样化的行为数据微调构建Be.FM。 Result: Be.FM能预测行为、推断个体和群体特征、生成情境洞察并应用行为科学知识。 Conclusion: Be.FM展示了基础模型在人类行为建模中的巨大潜力。 Abstract: Despite their success in numerous fields, the potential of foundation models for modeling and understanding human behavior remains largely unexplored. We introduce Be.FM, one of the first open foundation models designed for human behavior modeling. Built upon open-source large language models and fine-tuned on a diverse range of behavioral data, Be.FM can be used to understand and predict human decision-making. We construct a comprehensive set of benchmark tasks for testing the capabilities of behavioral foundation models. Our results demonstrate that Be.FM can predict behaviors, infer characteristics of individuals and populations, generate insights about contexts, and apply behavioral science knowledge.

[260] Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models

Zeyu Liu,Yuhang Liu,Guanghao Zhu,Congkai Xie,Zhen Li,Jianbo Yuan,Xinyao Wang,Qing Li,Shing-Chi Cheung,Shengyu Zhang,Fei Wu,Hongxia Yang

Main category: cs.AI

TL;DR: 论文提出Infi-MMR框架，通过三个阶段提升多模态小语言模型（MSLMs）的推理能力，并在多个测试中取得最佳成绩。

Details

Motivation: 尽管大语言模型（LLMs）在推理能力上取得进展，但多模态小语言模型（MSLMs）面临数据集稀缺、视觉处理导致推理能力下降以及强化学习可能产生错误推理等挑战。 Method: 设计Infi-MMR框架，分三个阶段：基础推理激活、跨模态推理适应和多模态推理增强，最终提出模型Infi-MMR-3B。 Result: Infi-MMR-3B在多模态数学推理（如MathVerse、MathVision和OlympiadBench）和通用推理（MathVista）测试中表现优异。 Conclusion: Infi-MMR框架有效解决了MSLMs的推理挑战，显著提升了其多模态推理能力。 Abstract: Recent advancements in large language models (LLMs) have demonstrated substantial progress in reasoning capabilities, such as DeepSeek-R1, which leverages rule-based reinforcement learning to enhance logical reasoning significantly. However, extending these achievements to multimodal large language models (MLLMs) presents critical challenges, which are frequently more pronounced for Multimodal Small Language Models (MSLMs) given their typically weaker foundational reasoning abilities: (1) the scarcity of high-quality multimodal reasoning datasets, (2) the degradation of reasoning capabilities due to the integration of visual processing, and (3) the risk that direct application of reinforcement learning may produce complex yet incorrect reasoning processes. To address these challenges, we design a novel framework Infi-MMR to systematically unlock the reasoning potential of MSLMs through a curriculum of three carefully structured phases and propose our multimodal reasoning model Infi-MMR-3B. The first phase, Foundational Reasoning Activation, leverages high-quality textual reasoning datasets to activate and strengthen the model's logical reasoning capabilities. The second phase, Cross-Modal Reasoning Adaptation, utilizes caption-augmented multimodal data to facilitate the progressive transfer of reasoning skills to multimodal contexts. The third phase, Multimodal Reasoning Enhancement, employs curated, caption-free multimodal data to mitigate linguistic biases and promote robust cross-modal reasoning. Infi-MMR-3B achieves both state-of-the-art multimodal math reasoning ability (43.68% on MathVerse testmini, 27.04% on MathVision test, and 21.33% on OlympiadBench) and general reasoning ability (67.2% on MathVista testmini).

[261] Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns

Xiang Li,Haiyang Yu,Xinghua Zhang,Ziyang Huang,Shizhu He,Kang Liu,Jun Zhao,Fei Huang,Yongbin Li

Main category: cs.AI

TL;DR: Socratic-PRMBench是一个新基准，用于系统评估PRMs在六种推理模式下的表现，填补了现有基准的不足。

Details

Motivation: 现有基准主要关注逐步正确性，缺乏对PRMs在多种推理模式下错误识别的系统评估。 Method: 引入Socratic-PRMBench，包含2995条有缺陷的推理路径，覆盖六种推理模式。 Result: 实验发现现有PRMs在多种推理模式下存在显著缺陷。 Conclusion: Socratic-PRMBench为PRMs的全面评估提供了测试平台，并推动其未来发展。 Abstract: Process Reward Models (PRMs) are crucial in complex reasoning and problem-solving tasks (e.g., LLM agents with long-horizon decision-making) by verifying the correctness of each intermediate reasoning step. In real-world scenarios, LLMs may apply various reasoning patterns (e.g., decomposition) to solve a problem, potentially suffering from errors under various reasoning patterns. Therefore, PRMs are required to identify errors under various reasoning patterns during the reasoning process. However, existing benchmarks mainly focus on evaluating PRMs with stepwise correctness, ignoring a systematic evaluation of PRMs under various reasoning patterns. To mitigate this gap, we introduce Socratic-PRMBench, a new benchmark to evaluate PRMs systematically under six reasoning patterns, including Transformation, Decomposition, Regather, Deduction, Verification, and Integration. Socratic-PRMBench}comprises 2995 reasoning paths with flaws within the aforementioned six reasoning patterns. Through our experiments on both PRMs and LLMs prompted as critic models, we identify notable deficiencies in existing PRMs. These observations underscore the significant weakness of current PRMs in conducting evaluations on reasoning steps under various reasoning patterns. We hope Socratic-PRMBench can serve as a comprehensive testbed for systematic evaluation of PRMs under diverse reasoning patterns and pave the way for future development of PRMs.

[262] ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Chenyu Yang,Shiqian Su,Shi Liu,Xuan Dong,Yue Yu,Weijie Su,Xuehui Wang,Zhaoyang Liu,Jinguo Zhu,Hao Li,Wenhai Wang,Yu Qiao,Xizhou Zhu,Jifeng Dai

Main category: cs.AI

TL;DR: ZeroGUI提出了一种无需人工标注的在线学习框架，用于训练GUI代理，通过自动任务生成和奖励估计提升性能。

Details

Motivation: 现有GUI代理方法依赖人工标注且难以适应动态环境，ZeroGUI旨在解决这些问题。 Method: 结合VLM自动生成任务和奖励估计，采用两阶段在线强化学习。 Result: 在UI-TARS和Aguvis上显著提升性能。 Conclusion: ZeroGUI为GUI代理训练提供了一种高效、可扩展的解决方案。 Abstract: The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.

cs.HC [Back]

[263] Errors in Stereo Geometry Induce Distance Misperception

Raffles Xingqi Zhu,Charlie S. Burlingham,Olivier Mercier,Phillip Guan

Main category: cs.HC

TL;DR: 论文提出了一种几何框架，用于预测由HMD透视几何不准确引起的距离感知误差，并通过实验验证了其有效性。

Details

Motivation: 研究HMD渲染和显示中的相机及视角位置误差如何影响用户对深度和距离的感知。 Method: 构建了一个几何框架预测误差，并在Quest 3 HMD平台上进行实验验证。 Result: 透视几何误差会导致距离感知的过高或过低估计，实时视觉反馈可动态校准视觉运动映射。 Conclusion: 几何框架能有效预测误差，动态校准可改善HMD中的距离感知问题。 Abstract: Stereoscopic head-mounted displays (HMDs) render and present binocular images to create an egocentric, 3D percept to the HMD user. Within this render and presentation pipeline there are potential rendering camera and viewing position errors that can induce deviations in the depth and distance that a user perceives compared to the underlying intended geometry. For example, rendering errors can arise when HMD render cameras are incorrectly positioned relative to the assumed centers of projections of the HMD displays and viewing errors can arise when users view stereo geometry from the incorrect location in the HMD eyebox. In this work we present a geometric framework that predicts errors in distance perception arising from inaccurate HMD perspective geometry and build an HMD platform to reliably simulate render and viewing error in a Quest 3 HMD with eye tracking to experimentally test these predictions. We present a series of five experiments to explore the efficacy of this geometric framework and show that errors in perspective geometry can induce both under- and over-estimations in perceived distance. We further demonstrate how real-time visual feedback can be used to dynamically recalibrate visuomotor mapping so that an accurate reach distance is achieved even if the perceived visual distance is negatively impacted by geometric error.

[264] Large Language Models for Depression Recognition in Spoken Language Integrating Psychological Knowledge

Yupei Li,Shuaijie Shao,Manuel Milling,Björn W. Schuller

Main category: cs.HC

TL;DR: 论文提出了一种结合音频特征和心理学知识的LLM多模态抑郁症检测方法，显著提升了诊断准确性。

Details

Motivation: 现有DNN和LLM在抑郁症检测中效果有限，尤其是缺乏对非文本线索和心理知识的整合。 Method: 使用Wav2Vec提取音频特征，结合文本LLM处理，并引入心理学问答知识增强模型。 Result: 在DAIC-WOZ数据集上，MAE和RMSE指标显著优于基线方法。 Conclusion: 多模态结合心理学知识的LLM方法能有效提升抑郁症检测性能。 Abstract: Depression is a growing concern gaining attention in both public discourse and AI research. While deep neural networks (DNNs) have been used for recognition, they still lack real-world effectiveness. Large language models (LLMs) show strong potential but require domain-specific fine-tuning and struggle with non-textual cues. Since depression is often expressed through vocal tone and behaviour rather than explicit text, relying on language alone is insufficient. Diagnostic accuracy also suffers without incorporating psychological expertise. To address these limitations, we present, to the best of our knowledge, the first application of LLMs to multimodal depression detection using the DAIC-WOZ dataset. We extract the audio features using the pre-trained model Wav2Vec, and mapped it to text-based LLMs for further processing. We also propose a novel strategy for incorporating psychological knowledge into LLMs to enhance diagnostic performance, specifically using a question and answer set to grant authorised knowledge to LLMs. Our approach yields a notable improvement in both Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) compared to a base score proposed by the related original paper. The codes are available at https://github.com/myxp-lyp/Depression-detection.git

[265] Human Empathy as Encoder: AI-Assisted Depression Assessment in Special Education

Boning Zhao

Main category: cs.HC

TL;DR: 论文提出了一种名为HEAE的人本AI框架，通过结合学生叙述文本和教师生成的“共情向量”，提升抑郁严重程度评估的透明度和社会责任性。

Details

Motivation: 在特殊教育等敏感环境中，标准化问卷和自动化方法难以准确评估学生抑郁情况，且缺乏教师共情带来的个性化洞察。 Method: HEAE框架整合学生叙述文本与基于PHQ-9框架的9维“共情向量”，通过多模态融合、文本表示和分类架构优化实现抑郁严重程度分类。 Result: 实验显示，HEAE在7级抑郁严重程度分类中达到82.74%的准确率。 Conclusion: HEAE为情感计算提供了一种更负责任和伦理的路径，通过结构化嵌入人类共情，增强而非替代人类判断。 Abstract: Assessing student depression in sensitive environments like special education is challenging. Standardized questionnaires may not fully reflect students' true situations. Furthermore, automated methods often falter with rich student narratives, lacking the crucial, individualized insights stemming from teachers' empathetic connections with students. Existing methods often fail to address this ambiguity or effectively integrate educator understanding. To address these limitations by fostering a synergistic human-AI collaboration, this paper introduces Human Empathy as Encoder (HEAE), a novel, human-centered AI framework for transparent and socially responsible depression severity assessment. Our approach uniquely integrates student narrative text with a teacher-derived, 9-dimensional "Empathy Vector" (EV), its dimensions guided by the PHQ-9 framework,to explicitly translate tacit empathetic insight into a structured AI input enhancing rather than replacing human judgment. Rigorous experiments optimized the multimodal fusion, text representation, and classification architecture, achieving 82.74% accuracy for 7-level severity classification. This work demonstrates a path toward more responsible and ethical affective computing by structurally embedding human empathy

[266] MAC-Gaze: Motion-Aware Continual Calibration for Mobile Gaze Tracking

Yaxiong Lei,Mingyue Zhao,Yuheng Wang,Shijing He,Yusuke Sugano,Yafei Wang,Kaixing Zhao,Mohamed Khamis,Juan Ye

Main category: cs.HC

TL;DR: MAC-Gaze是一种基于运动感知的持续校准方法，利用智能手机IMU传感器和持续学习技术，动态调整视线跟踪模型以适应用户姿势变化，显著提升准确性。

Details

Motivation: 传统一次性校准方法无法适应动态姿势变化，导致性能下降，需要一种能够持续适应变化的解决方案。 Method: 结合预训练视觉视线估计器和IMU活动识别模型，采用聚类混合决策机制触发重新校准，并利用回放式持续学习避免灾难性遗忘。 Result: 在RGBDGaze和MotionGaze数据集上，视线估计误差分别降低19.9%和31.7%。 Conclusion: MAC-Gaze为移动场景下的视线跟踪提供了一种鲁棒的持续校准解决方案。 Abstract: Mobile gaze tracking faces a fundamental challenge: maintaining accuracy as users naturally change their postures and device orientations. Traditional calibration approaches, like one-off, fail to adapt to these dynamic conditions, leading to degraded performance over time. We present MAC-Gaze, a Motion-Aware continual Calibration approach that leverages smartphone Inertial measurement unit (IMU) sensors and continual learning techniques to automatically detect changes in user motion states and update the gaze tracking model accordingly. Our system integrates a pre-trained visual gaze estimator and an IMU-based activity recognition model with a clustering-based hybrid decision-making mechanism that triggers recalibration when motion patterns deviate significantly from previously encountered states. To enable accumulative learning of new motion conditions while mitigating catastrophic forgetting, we employ replay-based continual learning, allowing the model to maintain performance across previously encountered motion conditions. We evaluate our system through extensive experiments on the publicly available RGBDGaze dataset and our own 10-hour multimodal MotionGaze dataset (481K+ images, 800K+ IMU readings), encompassing a wide range of postures under various motion conditions including sitting, standing, lying, and walking. Results demonstrate that our method reduces gaze estimation error by 19.9% on RGBDGaze (from 1.73 cm to 1.41 cm) and by 31.7% on MotionGaze (from 2.81 cm to 1.92 cm) compared to traditional calibration approaches. Our framework provides a robust solution for maintaining gaze estimation accuracy in mobile scenarios.

cs.CY [Back]

[267] Conversational Alignment with Artificial Intelligence in Context

Rachel Katharine Sterken,James Ravi Kirkpatrick

Main category: cs.CY

TL;DR: 论文探讨了AI对话代理如何与人类沟通规范对齐，提出了CONTEXT-ALIGN框架，并指出当前LLM架构可能限制完全对齐。

Details

Motivation: 研究AI对话代理与人类沟通规范的关系，确保AI设计符合人类价值观和实践。 Method: 结合哲学和语言学文献，提出CONTEXT-ALIGN框架，评估LLM的设计选择。 Result: 当前LLM架构可能无法完全实现与人类沟通规范的对齐。 Conclusion: 需要进一步研究以克服LLM在对话对齐上的局限性。 Abstract: The development of sophisticated artificial intelligence (AI) conversational agents based on large language models raises important questions about the relationship between human norms, values, and practices and AI design and performance. This article explores what it means for AI agents to be conversationally aligned to human communicative norms and practices for handling context and common ground and proposes a new framework for evaluating developers' design choices. We begin by drawing on the philosophical and linguistic literature on conversational pragmatics to motivate a set of desiderata, which we call the CONTEXT-ALIGN framework, for conversational alignment with human communicative practices. We then suggest that current large language model (LLM) architectures, constraints, and affordances may impose fundamental limitations on achieving full conversational alignment.

cs.SD [Back]

[268] Nosey: Open-source hardware for acoustic nasalance

Maya Dewhurst,Jack Collins,Justin J. H. Lo,Roy Alderton,Sam Kirkham

Main category: cs.SD

TL;DR: Nosey是一个低成本、可定制的开源鼻音数据记录系统，与商业设备相比表现良好。

Details

Motivation: 开发低成本、开源硬件替代商业鼻音测量设备。 Method: 设计3D打印硬件系统，并与商业设备进行对比测试。 Result: Nosey鼻音评分更高，但对比效果与商业设备相当。 Conclusion: Nosey是商业设备的灵活、经济替代方案，适用于数据收集。 Abstract: We introduce Nosey (Nasalance Open Source Estimation sYstem), a low-cost, customizable, 3D-printed system for recording acoustic nasalance data that we have made available as open-source hardware (http://github.com/phoneticslab/nosey). We first outline the motivations and design principles behind our hardware nasalance system, and then present a comparison between Nosey and a commercial nasalance device. Nosey shows consistently higher nasalance scores than the commercial device, but the magnitude of contrast between phonological environments is comparable between systems. We also review ways of customizing the hardware to facilitate testing, such as comparison of microphones and different construction materials. We conclude that Nosey is a flexible and cost-effective alternative to commercial nasometry devices and propose some methodological considerations for its use in data collection.

[269] Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation

Hao Li,Ju Dai,Xin Zhao,Feng Zhou,Junjun Pan,Lei Li

Main category: cs.SD

TL;DR: 论文提出Wav2Sem模块，通过语义解耦解决3D语音驱动面部动画中音素相似音节导致的耦合问题，提升动画精度和自然度。

Details

Motivation: 现有方法使用自监督音频模型编码器，但音素相似音节在特征空间中耦合严重，导致唇形生成的平均效应。 Method: 提出Wav2Sem模块，提取音频序列的语义特征，解耦特征空间中的音频编码。 Result: 实验表明，Wav2Sem有效解耦音频特征，显著减轻唇形生成的平均效应。 Conclusion: Wav2Sem模块提升了面部动画的精确性和自然度。 Abstract: In 3D speech-driven facial animation generation, existing methods commonly employ pre-trained self-supervised audio models as encoders. However, due to the prevalence of phonetically similar syllables with distinct lip shapes in language, these near-homophone syllables tend to exhibit significant coupling in self-supervised audio feature spaces, leading to the averaging effect in subsequent lip motion generation. To address this issue, this paper proposes a plug-and-play semantic decorrelation module-Wav2Sem. This module extracts semantic features corresponding to the entire audio sequence, leveraging the added semantic information to decorrelate audio encodings within the feature space, thereby achieving more expressive audio features. Extensive experiments across multiple Speech-driven models indicate that the Wav2Sem module effectively decouples audio features, significantly alleviating the averaging effect of phonetically similar syllables in lip shape generation, thereby enhancing the precision and naturalness of facial animations. Our source code is available at https://github.com/wslh852/Wav2Sem.git.

[270] Semantics-Aware Human Motion Generation from Audio Instructions

Zi-An Wang,Shihao Zou,Shiyao Yu,Mingyuan Zhang,Chao Dong

Main category: cs.SD

TL;DR: 本文提出了一种利用音频信号作为条件输入生成与音频语义对齐的运动的新任务，通过掩码生成变压器和记忆检索注意力模块提升性能。

Details

Motivation: 音频信号比文本更自然直观，但现有方法通常仅匹配音乐或语音节奏，导致音频语义与生成运动关联较弱。 Method: 采用端到端框架，结合掩码生成变压器和记忆检索注意力模块处理稀疏长音频输入，并通过丰富数据集增强效果。 Result: 实验证明该框架高效有效，音频指令能传达类似文本的语义，同时提供更实用和用户友好的交互。 Conclusion: 音频作为条件输入能有效生成语义对齐的运动，为交互技术提供了新方向。 Abstract: Recent advances in interactive technologies have highlighted the prominence of audio signals for semantic encoding. This paper explores a new task, where audio signals are used as conditioning inputs to generate motions that align with the semantics of the audio. Unlike text-based interactions, audio provides a more natural and intuitive communication method. However, existing methods typically focus on matching motions with music or speech rhythms, which often results in a weak connection between the semantics of the audio and generated motions. We propose an end-to-end framework using a masked generative transformer, enhanced by a memory-retrieval attention module to handle sparse and lengthy audio inputs. Additionally, we enrich existing datasets by converting descriptions into conversational style and generating corresponding audio with varied speaker identities. Experiments demonstrate the effectiveness and efficiency of the proposed framework, demonstrating that audio instructions can convey semantics similar to text while providing more practical and user-friendly interactions.

[271] ZeroSep: Separate Anything in Audio with Zero Training

Chao Huang,Yuesheng Ma,Junxuan Huang,Susan Liang,Yunlong Tang,Jing Bi,Wenqiang Liu,Nima Mesgarani,Chenliang Xu

Main category: cs.SD

TL;DR: 论文提出ZeroSep方法，利用预训练的文本引导音频扩散模型实现零样本音频源分离，无需任务特定训练。

Details

Motivation: 当前监督深度学习方法需要大量任务特定标注数据且难以泛化到现实复杂声学场景，受生成基础模型启发，探索预训练扩散模型是否可解决这些问题。 Method: 通过将混合音频反转到扩散模型的潜在空间，利用文本条件引导去噪过程恢复单个源信号，无需微调。 Result: ZeroSep在多个分离基准上表现优异，甚至超越监督方法。 Conclusion: 预训练文本引导扩散模型可成功用于零样本音频源分离，支持开放集场景。 Abstract: Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of generative foundation models, we investigate whether pre-trained text-guided audio diffusion models can overcome these limitations. We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model under the right configuration. Our method, named ZeroSep, works by inverting the mixed audio into the diffusion model's latent space and then using text conditioning to guide the denoising process to recover individual sources. Without any task-specific training or fine-tuning, ZeroSep repurposes the generative diffusion model for a discriminative separation task and inherently supports open-set scenarios through its rich textual priors. ZeroSep is compatible with a variety of pre-trained text-guided audio diffusion backbones and delivers strong separation performance on multiple separation benchmarks, surpassing even supervised methods.

cs.DB [Back]

[272] TailorSQL: An NL2SQL System Tailored to Your Query Workload

Kapil Vaidya,Jialin Ding,Sebastian Kosak,David Kernert,Chuan Lei,Xiao Qin,Abhinav Tripathy,Ramesh Balan,Balakrishnan Narayanaswamy,Tim Kraska

Main category: cs.DB

TL;DR: TailorSQL利用历史查询工作负载中的信息改进NL2SQL的准确性和延迟，相比现有方法提升2倍执行准确率。

Details

Motivation: 现有NL2SQL技术未充分利用数据库中已有的历史查询工作负载信息，而这些信息（如常见连接路径和表/列语义）对准确翻译至关重要。 Method: TailorSQL通过分析历史查询工作负载，提取有用信息（如常见连接路径和表/列语义），结合预训练大语言模型生成更准确的SQL查询。 Result: TailorSQL在标准化基准测试中实现了高达2倍的执行准确率提升。 Conclusion: 利用历史查询工作负载信息可显著提升NL2SQL的性能，TailorSQL为此提供了有效解决方案。 Abstract: NL2SQL (natural language to SQL) translates natural language questions into SQL queries, thereby making structured data accessible to non-technical users, serving as the foundation for intelligent data applications. State-of-the-art NL2SQL techniques typically perform translation by retrieving database-specific information, such as the database schema, and invoking a pre-trained large language model (LLM) using the question and retrieved information to generate the SQL query. However, existing NL2SQL techniques miss a key opportunity which is present in real-world settings: NL2SQL is typically applied on existing databases which have already served many SQL queries in the past. The past query workload implicitly contains information which is helpful for accurate NL2SQL translation and is not apparent from the database schema alone, such as common join paths and the semantics of obscurely-named tables and columns. We introduce TailorSQL, a NL2SQL system that takes advantage of information in the past query workload to improve both the accuracy and latency of translating natural language questions into SQL. By specializing to a given workload, TailorSQL achieves up to 2$\times$ improvement in execution accuracy on standardized benchmarks.

eess.SY [Back]

[273] CF-DETR: Coarse-to-Fine Transformer for Real-Time Object Detection

Woojin Shin,Donghwa Kang,Byeongyun Park,Brent Byunghoon Kang,Jinkyu Lee,Hyeongboo Baek

Main category: eess.SY

TL;DR: CF-DETR通过粗到细的Transformer架构和实时调度框架NPFP**，解决了多DETR任务在自动驾驶感知系统中的实时性和准确性挑战。

Details

Motivation: 现有实时DNN调度方法未充分利用Transformer特性，难以满足自动驾驶感知系统对实时性和高精度的双重需求。 Method: 提出CF-DETR系统，结合粗到细推理、选择性细推理和多级批量推理策略，以及NPFP**调度框架，动态调整资源分配。 Result: 在多种平台上验证，CF-DETR能保证关键操作的实时性，并显著提升整体和关键物体检测精度。 Conclusion: CF-DETR为自动驾驶感知系统提供了一种高效、可靠的解决方案，平衡了实时性和准确性。 Abstract: Detection Transformers (DETR) are increasingly adopted in autonomous vehicle (AV) perception systems due to their superior accuracy over convolutional networks. However, concurrently executing multiple DETR tasks presents significant challenges in meeting firm real-time deadlines (R1) and high accuracy requirements (R2), particularly for safety-critical objects, while navigating the inherent latency-accuracy trade-off under resource constraints. Existing real-time DNN scheduling approaches often treat models generically, failing to leverage Transformer-specific properties for efficient resource allocation. To address these challenges, we propose CF-DETR, an integrated system featuring a novel coarse-to-fine Transformer architecture and a dedicated real-time scheduling framework NPFP**. CF-DETR employs three key strategies (A1: coarse-to-fine inference, A2: selective fine inference, A3: multi-level batch inference) that exploit Transformer properties to dynamically adjust patch granularity and attention scope based on object criticality, aiming to satisfy R2. The NPFP** scheduling framework (A4) orchestrates these adaptive mechanisms A1-A3. It partitions each DETR task into a safety-critical coarse subtask for guaranteed critical object detection within its deadline (ensuring R1), and an optional fine subtask for enhanced overall accuracy (R2), while managing individual and batched execution. Our extensive evaluations on server, GPU-enabled embedded platforms, and actual AV platforms demonstrate that CF-DETR, under an NPFP** policy, successfully meets strict timing guarantees for critical operations and achieves significantly higher overall and critical object detection accuracy compared to existing baselines across diverse AV workloads.

eess.AS [Back]

[274] NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding

Vladimir Bataev,Andrei Andrusenko,Lilit Grigoryan,Aleksandr Laptev,Vitaly Lavrukhin,Boris Ginsburg

Main category: eess.AS

TL;DR: NGPU-LM提出了一种高效并行的统计n-gram语言模型，用于优化ASR中的上下文偏置任务，显著提升了计算效率。

Details

Motivation: 现有统计n-gram语言模型在ASR中因并行化不足导致计算效率低，限制了工业应用。 Method: 重新设计数据结构和引入可定制贪婪解码，支持GPU优化推理，适用于多种ASR模型。 Result: 计算开销低于7%，在域外场景中缩小了贪婪搜索与束搜索50%以上的准确率差距。 Conclusion: NGPU-LM通过高效并行化和低开销，显著提升了ASR上下文偏置的性能，并开源实现。 Abstract: Statistical n-gram language models are widely used for context-biasing tasks in Automatic Speech Recognition (ASR). However, existing implementations lack computational efficiency due to poor parallelization, making context-biasing less appealing for industrial use. This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference. Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types - including transducers, attention encoder-decoder models, and CTC - with less than 7% computational overhead. The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search. The implementation of the proposed NGPU-LM is open-sourced.

cs.RO [Back]

[275] AMOR: Adaptive Character Control through Multi-Objective Reinforcement Learning

Lucas N. Alegre,Agon Serifi,Ruben Grandia,David Müller,Espen Knoop,Moritz Bächer

Main category: cs.RO

TL;DR: 提出一种多目标强化学习框架，通过训练一个权重条件化的策略，解决传统RL中奖励函数权重调优耗时的问题，并展示其在机器人动态运动中的应用。

Details

Motivation: 传统强化学习依赖加权奖励函数，需大量调优且难以适应现实世界中的sim-to-real差距。 Method: 提出多目标强化学习框架，训练一个权重条件化的策略，覆盖奖励权衡的Pareto前沿。 Result: 该框架显著缩短迭代时间，支持动态权重选择，并能高效适应新任务。 Conclusion: 多目标策略编码了多样行为，为机器人控制提供了灵活高效的解决方案。 Abstract: Reinforcement learning (RL) has significantly advanced the control of physics-based and robotic characters that track kinematic reference motion. However, methods typically rely on a weighted sum of conflicting reward functions, requiring extensive tuning to achieve a desired behavior. Due to the computational cost of RL, this iterative process is a tedious, time-intensive task. Furthermore, for robotics applications, the weights need to be chosen such that the policy performs well in the real world, despite inevitable sim-to-real gaps. To address these challenges, we propose a multi-objective reinforcement learning framework that trains a single policy conditioned on a set of weights, spanning the Pareto front of reward trade-offs. Within this framework, weights can be selected and tuned after training, significantly speeding up iteration time. We demonstrate how this improved workflow can be used to perform highly dynamic motions with a robot character. Moreover, we explore how weight-conditioned policies can be leveraged in hierarchical settings, using a high-level policy to dynamically select weights according to the current task. We show that the multi-objective policy encodes a diverse spectrum of behaviors, facilitating efficient adaptation to novel tasks.

Siddharth Ancha,Sunshine Jiang,Travis Manderson,Laura Brandt,Yilun Du,Philip R. Osteen,Nicholas Roy

Main category: cs.RO

TL;DR: 提出了一种基于生成扩散模型的像素级异常检测方法，无需对异常数据做假设，通过编辑图像并检测修改区域实现异常检测。

Details

Motivation: 在非结构化环境中，机器人需要检测与训练数据分布不同的异常，以确保安全导航。 Method: 使用生成扩散模型编辑输入图像以移除异常，通过分析修改区域检测异常；提出了一种新的引导扩散推理方法。 Result: 方法无需重新训练或微调，可直接集成到现有工作流中，结合视觉语言基础模型实现准确的异常检测。 Conclusion: 该方法为机器人导航提供了一种有效的异常检测解决方案，适用于非结构化环境。 Abstract: In order to navigate safely and reliably in off-road and unstructured environments, robots must detect anomalies that are out-of-distribution (OOD) with respect to the training data. We present an analysis-by-synthesis approach for pixel-wise anomaly detection without making any assumptions about the nature of OOD data. Given an input image, we use a generative diffusion model to synthesize an edited image that removes anomalies while keeping the remaining image unchanged. Then, we formulate anomaly detection as analyzing which image segments were modified by the diffusion model. We propose a novel inference approach for guided diffusion by analyzing the ideal guidance gradient and deriving a principled approximation that bootstraps the diffusion model to predict guidance gradients. Our editing technique is purely test-time that can be integrated into existing workflows without the need for retraining or fine-tuning. Finally, we use a combination of vision-language foundation models to compare pixels in a learned feature space and detect semantically meaningful edits, enabling accurate anomaly detection for off-road navigation. Project website: https://siddancha.github.io/anomalies-by-diffusion-synthesis/

[277] TrackVLA: Embodied Visual Tracking in the Wild

Shaoan Wang,Jiazhao Zhang,Minghan Li,Jiahang Liu,Anqi Li,Kui Wu,Fangwei Zhong,Junzhi Yu,Zhizheng Zhang,He Wang

Main category: cs.RO

TL;DR: TrackVLA是一种视觉-语言-动作（VLA）模型，通过结合目标识别和轨迹规划，在动态环境中实现高效的视觉跟踪任务。

Details

Motivation: 解决现有方法在目标识别和轨迹规划分离时面临的遮挡和动态场景挑战。 Method: 利用共享的LLM骨干网络，结合语言建模头和基于锚点的扩散模型进行识别与规划。 Result: 在合成和真实环境中表现出SOTA性能，零样本下优于现有方法，10 FPS推理速度。 Conclusion: TrackVLA展示了强大的泛化能力和鲁棒性，适用于高动态和遮挡场景。 Abstract: Embodied visual tracking is a fundamental skill in Embodied AI, enabling an agent to follow a specific target in dynamic environments using only egocentric vision. This task is inherently challenging as it requires both accurate target recognition and effective trajectory planning under conditions of severe occlusion and high scene dynamics. Existing approaches typically address this challenge through a modular separation of recognition and planning. In this work, we propose TrackVLA, a Vision-Language-Action (VLA) model that learns the synergy between object recognition and trajectory planning. Leveraging a shared LLM backbone, we employ a language modeling head for recognition and an anchor-based diffusion model for trajectory planning. To train TrackVLA, we construct an Embodied Visual Tracking Benchmark (EVT-Bench) and collect diverse difficulty levels of recognition samples, resulting in a dataset of 1.7 million samples. Through extensive experiments in both synthetic and real-world environments, TrackVLA demonstrates SOTA performance and strong generalizability. It significantly outperforms existing methods on public benchmarks in a zero-shot manner while remaining robust to high dynamics and occlusion in real-world scenarios at 10 FPS inference speed. Our project page is: https://pku-epic.github.io/TrackVLA-web.

[278] Autoregressive Meta-Actions for Unified Controllable Trajectory Generation

Jianbo Zhao,Taiyu Ban,Xiyang Wang,Qibin Zhou,Hangning Zhou,Zhihao Liu,Mu Yang,Lei Liu,Bin Li

Main category: cs.RO

TL;DR: 论文提出了一种自回归元动作方法，用于解决自动驾驶系统中元动作与轨迹时间不对齐的问题，通过分解长间隔元动作为帧级元动作，实现了轨迹生成与决策的严格对齐。

Details

Motivation: 现有自动驾驶框架依赖固定时间间隔的元动作，导致元动作与实际轨迹时间不对齐，影响任务连贯性和模型性能。 Method: 提出自回归元动作方法，将长间隔元动分解为帧级元动作，结合自回归轨迹生成框架，并采用分阶段预训练分离基础运动动力学与高层决策控制的学习。 Result: 实验证明该方法提高了轨迹的自适应性和对动态决策场景的响应能力。 Conclusion: 该方法通过严格对齐元动作与轨迹，显著降低了复杂性，提升了自动驾驶系统的性能。 Abstract: Controllable trajectory generation guided by high-level semantic decisions, termed meta-actions, is crucial for autonomous driving systems. A significant limitation of existing frameworks is their reliance on invariant meta-actions assigned over fixed future time intervals, causing temporal misalignment with the actual behavior trajectories. This misalignment leads to irrelevant associations between the prescribed meta-actions and the resulting trajectories, disrupting task coherence and limiting model performance. To address this challenge, we introduce Autoregressive Meta-Actions, an approach integrated into autoregressive trajectory generation frameworks that provides a unified and precise definition for meta-action-conditioned trajectory prediction. Specifically, We decompose traditional long-interval meta-actions into frame-level meta-actions, enabling a sequential interplay between autoregressive meta-action prediction and meta-action-conditioned trajectory generation. This decomposition ensures strict alignment between each trajectory segment and its corresponding meta-action, achieving a consistent and unified task formulation across the entire trajectory span and significantly reducing complexity. Moreover, we propose a staged pre-training process to decouple the learning of basic motion dynamics from the integration of high-level decision control, which offers flexibility, stability, and modularity. Experimental results validate our framework's effectiveness, demonstrating improved trajectory adaptivity and responsiveness to dynamic decision-making scenarios. We provide the video document and dataset, which are available at https://arma-traj.github.io/.

[279] Mobi-$π$: Mobilizing Your Robot Learning Policy

Jingyun Yang,Isabella Huang,Brandon Vu,Max Bajracharya,Rika Antonova,Jeannette Bohg

Main category: cs.RO

TL;DR: 论文提出了一种解决视觉运动策略在新环境中泛化能力不足的方法，通过优化机器人基座姿态使其符合策略训练时的分布，从而无需重新训练策略。

Details

Motivation: 现有视觉运动策略在训练时受限于有限的机器人位置和摄像头视角，导致在新环境中泛化能力差，尤其是对精确任务。 Method: 提出了Mobi-π框架，包括量化策略泛化难度的指标、模拟任务、可视化工具和基线方法，并提出了一种基于3D高斯散射和采样优化的基座姿态优化方法。 Result: 提出的方法在仿真和真实环境中均优于基线方法，验证了其有效性。 Conclusion: 政策动员方法通过优化基座姿态，显著提升了策略在新环境中的泛化能力，且与现有提升策略鲁棒性的方法兼容。 Abstract: Learned visuomotor policies are capable of performing increasingly complex manipulation tasks. However, most of these policies are trained on data collected from limited robot positions and camera viewpoints. This leads to poor generalization to novel robot positions, which limits the use of these policies on mobile platforms, especially for precise tasks like pressing buttons or turning faucets. In this work, we formulate the policy mobilization problem: find a mobile robot base pose in a novel environment that is in distribution with respect to a manipulation policy trained on a limited set of camera viewpoints. Compared to retraining the policy itself to be more robust to unseen robot base pose initializations, policy mobilization decouples navigation from manipulation and thus does not require additional demonstrations. Crucially, this problem formulation complements existing efforts to improve manipulation policy robustness to novel viewpoints and remains compatible with them. To study policy mobilization, we introduce the Mobi-$\pi$ framework, which includes: (1) metrics that quantify the difficulty of mobilizing a given policy, (2) a suite of simulated mobile manipulation tasks based on RoboCasa to evaluate policy mobilization, (3) visualization tools for analysis, and (4) several baseline methods. We also propose a novel approach that bridges navigation and manipulation by optimizing the robot's base pose to align with an in-distribution base pose for a learned policy. Our approach utilizes 3D Gaussian Splatting for novel view synthesis, a score function to evaluate pose suitability, and sampling-based optimization to identify optimal robot poses. We show that our approach outperforms baselines in both simulation and real-world environments, demonstrating its effectiveness for policy mobilization.

cs.SE [Back]

[280] SWE-bench Goes Live!

Linghao Zhang,Shilin He,Chaoyun Zhang,Yu Kang,Bowen Li,Chengxing Xie,Junhao Wang,Maoquan Wang,Yufan Huang,Shengyu Fu,Elsie Nallipogu,Qingwei Lin,Yingnong Dang,Saravan Rajmohan,Dongmei Zhang

Main category: cs.SE

TL;DR: SWE-bench-Live是一个动态更新的基准测试，旨在解决现有静态基准测试的局限性，如数据过时、覆盖范围窄和依赖人工。它通过自动化流程和实时更新的GitHub问题，为LLM和代理框架提供了更严格的评估环境。

Details

Motivation: 现有基准测试（如SWE-bench）存在数据过时、覆盖范围有限和依赖人工的问题，限制了其扩展性和评估效果。SWE-bench-Live旨在通过动态更新和自动化解决这些问题。 Method: SWE-bench-Live基于实时GitHub问题构建，包含1,319个任务，覆盖93个仓库。每个任务配有Docker镜像以确保可重复性。通过自动化流程（\method）实现实例创建和环境设置的自动化。 Result: 在SWE-bench-Live上评估的LLM和代理框架表现显著低于静态基准测试，揭示了性能差距。进一步分析表明，差异与仓库来源、问题时效性和任务难度相关。 Conclusion: SWE-bench-Live提供了一个动态、多样且可执行的基准测试，支持对LLM和代理在真实软件开发环境中的严格评估，避免了数据污染问题。 Abstract: The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily on manual effort for instance construction and environment setup. These factors hinder scalability and introduce risks of overfitting and data contamination. In this work, we present \textbf{SWE-bench-Live}, a \textit{live-updatable} benchmark designed to overcome these challenges. Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Central to our benchmark is \method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art agent frameworks and LLMs on SWE-bench-Live, revealing a substantial performance gap compared to static benchmarks like SWE-bench, even under controlled evaluation conditions. To better understand this discrepancy, we perform detailed analyses across repository origin, issue recency, and task difficulty. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live facilitates rigorous, contamination-resistant evaluation of LLMs and agents in dynamic, real-world software development settings.

[281] Identity resolution of software metadata using Large Language Models

Eva Martín del Pico,Josep Lluís Gelpí,Salvador Capella-Gutiérrez

Main category: cs.SE

TL;DR: 本文探讨了研究软件的重要性及其元数据的整合问题，评估了指令调优的大型语言模型在软件元数据身份解析任务中的表现。

Details

Motivation: 研究软件在科研中的重要性日益凸显，但其元数据的质量和完整性参差不齐，需要整合以支持大规模分析。 Method: 通过评估指令调优的大型语言模型，将其与人工标注的金标准进行对比，并引入基于一致性的高置信度自动化决策代理。 Result: 代理方法在精确度和统计稳健性上表现优异，但也揭示了当前模型的局限性和自动化语义判断的挑战。 Conclusion: 研究强调了整合软件元数据的必要性，并展示了自动化方法的潜力与局限性，为未来改进提供了方向。 Abstract: Software is an essential component of research. However, little attention has been paid to it compared with that paid to research data. Recently, there has been an increase in efforts to acknowledge and highlight the importance of software in research activities. Structured metadata from platforms like bio.tools, Bioconductor, and Galaxy ToolShed offers valuable insights into research software in the Life Sciences. Although originally intended to support discovery and integration, this metadata can be repurposed for large-scale analysis of software practices. However, its quality and completeness vary across platforms, reflecting diverse documentation practices. To gain a comprehensive view of software development and sustainability, consolidating this metadata is necessary, but requires robust mechanisms to address its heterogeneity and scale. This article presents an evaluation of instruction-tuned large language models for the task of software metadata identity resolution, a critical step in assembling a cohesive collection of research software. Such a collection is the reference component for the Software Observatory at OpenEBench, a platform that aggregates metadata to monitor the FAIRness of research software in the Life Sciences. We benchmarked multiple models against a human-annotated gold standard, examined their behavior on ambiguous cases, and introduced an agreement-based proxy for high-confidence automated decisions. The proxy achieved high precision and statistical robustness, while also highlighting the limitations of current models and the broader challenges of automating semantic judgment in FAIR-aligned software metadata across registries and repositories.

[282] GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

Manish Shetty,Naman Jain,Jinjian Liu,Vijay Kethanaboyina,Koushik Sen,Ion Stoica

Main category: cs.SE

TL;DR: GSO是一个用于评估语言模型开发高性能软件能力的基准测试，通过自动化流程生成性能测试并分析代码库历史，发现现有SWE-Agents成功率低于5%。

Details

Motivation: 开发高性能软件需要专业知识，目前语言模型在此领域的能力尚不明确，因此需要建立基准测试以评估其表现。 Method: 开发自动化管道生成性能测试，分析代码库历史以识别102个优化任务，提供代码库和性能测试作为规范，要求代理提升运行时效率。 Result: 领先的SWE-Agents成功率低于5%，推理时间扩展效果有限，定性分析揭示了低层语言、懒惰优化策略和瓶颈定位等关键失败模式。 Conclusion: GSO基准测试揭示了语言模型在高性能软件开发中的局限性，为未来研究提供了代码和轨迹数据。 Abstract: Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models' capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties with low-level languages, practicing lazy optimization strategies, and challenges in accurately localizing bottlenecks. We release the code and artifacts of our benchmark along with agent trajectories to enable future research.

q-bio.NC [Back]

[283] ConnectomeDiffuser: Generative AI Enables Brain Network Construction from Diffusion Tensor Imaging

Xuhang Chen,Michael Kwok-Po Ng,Kim-Fung Tsang,Chi-Man Pun,Shuqiang Wang

Main category: q-bio.NC

TL;DR: ConnectomeDiffuser是一种基于扩散的自动化框架，用于从DTI构建脑网络，克服了现有方法的局限性，提高了诊断准确性。

Details

Motivation: 现有DTI脑网络构建方法存在主观性、工作流程繁琐及无法捕捉复杂拓扑特征等问题，需要一种更高效、自动化的解决方案。 Method: 结合模板网络（提取拓扑特征）、扩散模型（生成高保真脑网络）和图卷积网络分类器（整合疾病标志物），实现端到端自动化构建。 Result: 在两种神经退行性疾病数据集上验证，性能显著优于其他方法，能更敏感地分析脑网络个体差异。 Conclusion: ConnectomeDiffuser为神经退行性疾病提供了更准确的诊断工具，推动了神经影像学仪器的发展。 Abstract: Brain network analysis plays a crucial role in diagnosing and monitoring neurodegenerative disorders such as Alzheimer's disease (AD). Existing approaches for constructing structural brain networks from diffusion tensor imaging (DTI) often rely on specialized toolkits that suffer from inherent limitations: operator subjectivity, labor-intensive workflows, and restricted capacity to capture complex topological features and disease-specific biomarkers. To overcome these challenges and advance computational neuroimaging instrumentation, ConnectomeDiffuser is proposed as a novel diffusion-based framework for automated end-to-end brain network construction from DTI. The proposed model combines three key components: (1) a Template Network that extracts topological features from 3D DTI scans using Riemannian geometric principles, (2) a diffusion model that generates comprehensive brain networks with enhanced topological fidelity, and (3) a Graph Convolutional Network classifier that incorporates disease-specific markers to improve diagnostic accuracy. ConnectomeDiffuser demonstrates superior performance by capturing a broader range of structural connectivity and pathology-related information, enabling more sensitive analysis of individual variations in brain networks. Experimental validation on datasets representing two distinct neurodegenerative conditions demonstrates significant performance improvements over other brain network methods. This work contributes to the advancement of instrumentation in the context of neurological disorders, providing clinicians and researchers with a robust, generalizable measurement framework that facilitates more accurate diagnosis, deeper mechanistic understanding, and improved therapeutic monitoring of neurodegenerative diseases such as AD.

cs.CR [Back]

[284] AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models

Jinchuan Zhang,Lu Yin,Yan Zhou,Songlin Hu

Main category: cs.CR

TL;DR: AgentAlign框架通过抽象行为链提升LLM代理的安全性，显著减少恶意任务执行，同时保持实用性。

Details

Motivation: LLM代理能力的提升增加了恶意使用的风险，现有方法在安全性对齐方面存在不足。 Method: 利用抽象行为链合成安全对齐数据，通过模拟环境生成真实可执行指令，并平衡安全性与实用性。 Result: 在AgentHarm评估中，安全性提升35.8%至79.5%，实用性影响极小或有所增强。 Conclusion: AgentAlign有效解决了LLM代理的安全对齐问题，优于现有提示方法。 Abstract: The acquisition of agentic capabilities has transformed LLMs from "knowledge providers" to "action executors", a trend that while expanding LLMs' capability boundaries, significantly increases their susceptibility to malicious use. Previous work has shown that current LLM-based agents execute numerous malicious tasks even without being attacked, indicating a deficiency in agentic use safety alignment during the post-training phase. To address this gap, we propose AgentAlign, a novel framework that leverages abstract behavior chains as a medium for safety alignment data synthesis. By instantiating these behavior chains in simulated environments with diverse tool instances, our framework enables the generation of highly authentic and executable instructions while capturing complex multi-step dynamics. The framework further ensures model utility by proportionally synthesizing benign instructions through non-malicious interpretations of behavior chains, precisely calibrating the boundary between helpfulness and harmlessness. Evaluation results on AgentHarm demonstrate that fine-tuning three families of open-source models using our method substantially improves their safety (35.8% to 79.5% improvement) while minimally impacting or even positively enhancing their helpfulness, outperforming various prompting methods. The dataset and code have both been open-sourced.

Chunlong Xie,Jialing He,Shangwei Guo,Jiacheng Wang,Shudong Zhang,Tianwei Zhang,Tao Xiang

Main category: cs.CR

TL;DR: AdvOF是一种针对视觉与语言导航（VLN）代理的攻击框架，通过生成对抗性3D对象来研究其对VLM感知模块的影响。

Details

Motivation: 研究服务导向环境中VLM导航系统的安全性漏洞，现有攻击方法未考虑服务计算的可靠性需求。 Method: AdvOF通过精确聚合和对齐2D/3D空间中的目标对象，定义并渲染对抗性对象，并通过多视角优化和正则化协同优化。 Result: AdvOF能有效降低代理性能，同时最小化对正常导航任务的干扰。 Conclusion: 该研究提升了VLM导航系统服务安全性的理解，为物理世界部署中的鲁棒服务组合提供了计算基础。 Abstract: We present Adversarial Object Fusion (AdvOF), a novel attack framework targeting vision-and-language navigation (VLN) agents in service-oriented environments by generating adversarial 3D objects. While foundational models like Large Language Models (LLMs) and Vision Language Models (VLMs) have enhanced service-oriented navigation systems through improved perception and decision-making, their integration introduces vulnerabilities in mission-critical service workflows. Existing adversarial attacks fail to address service computing contexts, where reliability and quality-of-service (QoS) are paramount. We utilize AdvOF to investigate and explore the impact of adversarial environments on the VLM-based perception module of VLN agents. In particular, AdvOF first precisely aggregates and aligns the victim object positions in both 2D and 3D space, defining and rendering adversarial objects. Then, we collaboratively optimize the adversarial object with regularization between the adversarial and victim object across physical properties and VLM perceptions. Through assigning importance weights to varying views, the optimization is processed stably and multi-viewedly by iterative fusions from local updates and justifications. Our extensive evaluations demonstrate AdvOF can effectively degrade agent performance under adversarial conditions while maintaining minimal interference with normal navigation tasks. This work advances the understanding of service security in VLM-powered navigation systems, providing computational foundations for robust service composition in physical-world deployments.

q-bio.TO [Back]

[286] Physiology-Informed Generative Multi-Task Network for Contrast-Free CT Perfusion

Wasif Khan,Kyle B. See,Simon Kato,Ziqian Huang,Amy Lazarte,Kyle Douglas,Xiangyang Lou,Teng J. Peng,Dhanashree Rajderkar,John Rees,Pina Sanelli,Amita Singh,Ibrahim Tuna,Christina A. Wilson,Ruogu Fang

Main category: q-bio.TO

TL;DR: 提出了一种名为MAGIC的深度学习框架，通过生成式AI和生理信息将非对比CT图像映射为多模态无对比CTP图像，解决了传统CTP使用对比剂的问题。

Details

Motivation: 传统CTP成像依赖对比剂，可能导致过敏反应和高成本，需要一种无对比剂的替代方案。 Method: 结合生成式AI和生理信息，设计MAGIC框架，通过损失函数优化图像保真度，并在卒中患者数据上训练验证。 Result: MAGIC在视觉质量和诊断准确性上表现优异，优于传统对比剂CTP成像。 Conclusion: MAGIC有望成为无对比剂、经济高效的灌注成像新方法，具有临床应用潜力。 Abstract: Perfusion imaging is extensively utilized to assess hemodynamic status and tissue perfusion in various organs. Computed tomography perfusion (CTP) imaging plays a key role in the early assessment and planning of stroke treatment. While CTP provides essential perfusion parameters to identify abnormal blood flow in the brain, the use of contrast agents in CTP can lead to allergic reactions and adverse side effects, along with costing USD 4.9 billion worldwide in 2022. To address these challenges, we propose a novel deep learning framework called Multitask Automated Generation of Intermodal CT perfusion maps (MAGIC). This framework combines generative artificial intelligence and physiological information to map non-contrast computed tomography (CT) imaging to multiple contrast-free CTP imaging maps. We demonstrate enhanced image fidelity by incorporating physiological characteristics into the loss terms. Our network was trained and validated using CT image data from patients referred for stroke at UF Health and demonstrated robustness to abnormalities in brain perfusion activity. A double-blinded study was conducted involving seven experienced neuroradiologists and vascular neurologists. This study validated MAGIC's visual quality and diagnostic accuracy showing favorable performance compared to clinical perfusion imaging with intravenous contrast injection. Overall, MAGIC holds great promise in revolutionizing healthcare by offering contrast-free, cost-effective, and rapid perfusion imaging.

cs.LG [Back]

[287] FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference

Aniruddha Nrusimha,William Brandon,Mayank Mishra,Yikang Shen,Rameswar Panda,Jonathan Ragan-Kelley,Yoon Kim

Main category: cs.LG

TL;DR: FlashFormer是一种专为单批次推理优化的内核，针对Transformer大语言模型，在边缘部署和延迟敏感应用中表现优异。

Details

Motivation: 现有内核主要针对大批次训练和推理优化，而低批次推理在内存带宽和内核启动开销方面仍有挑战，尤其是在边缘部署和延迟敏感应用中。 Method: 开发了FlashFormer，一种专为单批次推理优化的内核，适用于Transformer大语言模型。 Result: 在不同模型规模和量化设置下，FlashFormer相比现有最优推理内核实现了显著加速。 Conclusion: FlashFormer为低批次推理提供了高效解决方案，适用于边缘和延迟敏感场景。 Abstract: The size and compute characteristics of modern large language models have led to an increased interest in developing specialized kernels tailored for training and inference. Existing kernels primarily optimize for compute utilization, targeting the large-batch training and inference settings. However, low-batch inference, where memory bandwidth and kernel launch overheads contribute are significant factors, remains important for many applications of interest such as in edge deployment and latency-sensitive applications. This paper describes FlashFormer, a proof-of-concept kernel for accelerating single-batch inference for transformer-based large language models. Across various model sizes and quantizations settings, we observe nontrivial speedups compared to existing state-of-the-art inference kernels.

[288] DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration

Tianteng Gu,Bei Liu,Bo Xiao,Ke Zeng,Jiacheng Liu,Yanmin Qian

Main category: cs.LG

TL;DR: 论文提出了一种名为DenoiseRotator的新方法，通过重新分配参数重要性来增强模型对剪枝的鲁棒性，显著减少了性能下降。

Details

Motivation: 现有剪枝方法主要关注单个权重的重要性估计，限制了保留模型关键能力的能力，导致性能显著下降。 Method: 提出通过最小化归一化重要性分数的信息熵，将重要性集中在更小的权重子集上，并利用DenoiseRotator对权重矩阵应用可学习的正交变换。 Result: 在LLaMA3、Qwen2.5和Mistral模型上，DenoiseRotator显著降低了困惑度差距，例如在LLaMA3-70B上困惑度差距减少了58%。 Conclusion: DenoiseRotator是一种模型无关的方法，可与现有剪枝技术无缝集成，显著提升了剪枝后的模型性能。 Abstract: Pruning is a widely used technique to compress large language models (LLMs) by removing unimportant weights, but it often suffers from significant performance degradation - especially under semi-structured sparsity constraints. Existing pruning methods primarily focus on estimating the importance of individual weights, which limits their ability to preserve critical capabilities of the model. In this work, we propose a new perspective: rather than merely selecting which weights to prune, we first redistribute parameter importance to make the model inherently more amenable to pruning. By minimizing the information entropy of normalized importance scores, our approach concentrates importance onto a smaller subset of weights, thereby enhancing pruning robustness. We instantiate this idea through DenoiseRotator, which applies learnable orthogonal transformations to the model's weight matrices. Our method is model-agnostic and can be seamlessly integrated with existing pruning techniques such as Magnitude, SparseGPT, and Wanda. Evaluated on LLaMA3, Qwen2.5, and Mistral models under 50% unstructured and 2:4 semi-structured sparsity, DenoiseRotator consistently improves perplexity and zero-shot accuracy. For instance, on LLaMA3-70B pruned with SparseGPT at 2:4 semi-structured sparsity, DenoiseRotator reduces the perplexity gap to the dense model by 58%, narrowing the degradation from 8.1 to 3.4 points. Codes are available at https://github.com/Axel-gu/DenoiseRotator.

[289] MAP: Revisiting Weight Decomposition for Low-Rank Adaptation

Chongjie Si,Zhiyi Shi,Yadao Wang,Xiaokang Yang,Susanto Rahardja,Wei Shen

Main category: cs.LG

TL;DR: MAP是一种新的参数高效微调框架，通过将权重矩阵分解为方向和幅度，提供了一种更严谨的微调方法。

Details

Motivation: 现有参数高效微调方法（如LoRA）在方向定义上缺乏几何基础，限制了其灵活性和可解释性。 Method: MAP将权重矩阵视为高维向量，通过归一化预训练权重、学习方向更新，并引入两个标量系数独立调整基向量和更新向量的幅度。 Result: 实验表明，MAP与现有方法结合时显著提升了性能。 Conclusion: MAP因其通用性和简单性，有望成为未来参数高效微调方法设计的默认设置。 Abstract: The rapid development of large language models has revolutionized natural language processing, but their fine-tuning remains computationally expensive, hindering broad deployment. Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, have emerged as solutions. Recent work like DoRA attempts to further decompose weight adaptation into direction and magnitude components. However, existing formulations often define direction heuristically at the column level, lacking a principled geometric foundation. In this paper, we propose MAP, a novel framework that reformulates weight matrices as high-dimensional vectors and decouples their adaptation into direction and magnitude in a rigorous manner. MAP normalizes the pre-trained weights, learns a directional update, and introduces two scalar coefficients to independently scale the magnitude of the base and update vectors. This design enables more interpretable and flexible adaptation, and can be seamlessly integrated into existing PEFT methods. Extensive experiments show that MAP significantly improves performance when coupling with existing methods, offering a simple yet powerful enhancement to existing PEFT methods. Given the universality and simplicity of MAP, we hope it can serve as a default setting for designing future PEFT methods.

[290] Does Machine Unlearning Truly Remove Model Knowledge? A Framework for Auditing Unlearning in LLMs

Haokun Chen,Yueqi Zhang,Yuan Bi,Yao Zhang,Tong Liu,Jinhe Bi,Jian Lan,Jindong Gu,Claudia Grosser,Denis Krompass,Nassir Navab,Volker Tresp

Main category: cs.LG

TL;DR: 该论文提出了一种全面的审计框架，用于评估大语言模型（LLMs）中的遗忘算法效果，包括基准数据集、算法和审计方法，并引入了一种基于中间激活扰动的新技术。

Details

Motivation: 大语言模型训练数据可能包含敏感或受版权保护的内容，而现有遗忘算法效果难以评估，因此需要开发更全面的审计方法。 Method: 提出了一个包含三个基准数据集、六种遗忘算法和五种基于提示的审计方法的框架，并引入了一种基于中间激活扰动的新技术。 Result: 通过多种审计算法评估了不同遗忘策略的有效性和鲁棒性，新提出的技术弥补了仅依赖输入输出审计方法的局限性。 Conclusion: 该框架为评估遗忘算法提供了更全面的工具，新技术为未来研究提供了替代方案。 Abstract: In recent years, Large Language Models (LLMs) have achieved remarkable advancements, drawing significant attention from the research community. Their capabilities are largely attributed to large-scale architectures, which require extensive training on massive datasets. However, such datasets often contain sensitive or copyrighted content sourced from the public internet, raising concerns about data privacy and ownership. Regulatory frameworks, such as the General Data Protection Regulation (GDPR), grant individuals the right to request the removal of such sensitive information. This has motivated the development of machine unlearning algorithms that aim to remove specific knowledge from models without the need for costly retraining. Despite these advancements, evaluating the efficacy of unlearning algorithms remains a challenge due to the inherent complexity and generative nature of LLMs. In this work, we introduce a comprehensive auditing framework for unlearning evaluation, comprising three benchmark datasets, six unlearning algorithms, and five prompt-based auditing methods. By using various auditing algorithms, we evaluate the effectiveness and robustness of different unlearning strategies. To explore alternatives beyond prompt-based auditing, we propose a novel technique that leverages intermediate activation perturbations, addressing the limitations of auditing methods that rely solely on model inputs and outputs.

[291] Rethinking Regularization Methods for Knowledge Graph Completion

Linyu Li,Zhi Jin,Yuanpeng He,Dongming Jin,Haoran Duan,Zhengwei Tao,Xuan Zhang,Jiandong Li

Main category: cs.LG

TL;DR: 本文重新思考了正则化方法在知识图谱补全（KGC）中的应用，提出了一种新颖的稀疏正则化方法（SPR），通过选择性惩罚显著特征组件来提升模型性能。

Details

Motivation: 现有KGC模型未能充分利用正则化的潜力，本文旨在探索正则化对模型性能的深层影响。 Method: 提出SPR稀疏正则化方法，选择性惩罚嵌入向量中的显著特征组件，忽略噪声。 Result: 实验表明SPR优于其他正则化方法，帮助KGC模型突破性能上限。 Conclusion: 精心设计的正则化不仅能缓解过拟合，还能显著提升KGC模型性能。 Abstract: Knowledge graph completion (KGC) has attracted considerable attention in recent years because it is critical to improving the quality of knowledge graphs. Researchers have continuously explored various models. However, most previous efforts have neglected to take advantage of regularization from a deeper perspective and therefore have not been used to their full potential. This paper rethinks the application of regularization methods in KGC. Through extensive empirical studies on various KGC models, we find that carefully designed regularization not only alleviates overfitting and reduces variance but also enables these models to break through the upper bounds of their original performance. Furthermore, we introduce a novel sparse-regularization method that embeds the concept of rank-based selective sparsity into the KGC regularizer. The core idea is to selectively penalize those components with significant features in the embedding vector, thus effectively ignoring many components that contribute little and may only represent noise. Various comparative experiments on multiple datasets and multiple models show that the SPR regularization method is better than other regularization methods and can enable the KGC model to further break through the performance margin.

[292] Domain-Aware Tensor Network Structure Search

Giorgos Iacovides,Wuyang Zhou,Chao Li,Qibin Zhao,Danilo Mandic

Main category: cs.LG

TL;DR: 提出了一种名为tnLLM的新框架，利用大型语言模型（LLMs）和领域信息直接预测最优张量网络结构，显著减少了计算成本。

Details

Motivation: 当前张量网络结构搜索（TN-SS）方法计算成本高且缺乏透明度和领域信息利用。 Method: 结合领域信息的提示管道，指导LLM根据张量模式间的关系推断结构，并生成解释。 Result: 实验显示tnLLM在较少函数评估下达到与SOTA算法相当的性能，并能加速其他方法的收敛。 Conclusion: tnLLM通过LLM和领域信息高效解决了TN-SS问题，兼具性能和可解释性。 Abstract: Tensor networks (TNs) provide efficient representations of high-dimensional data, yet identification of the optimal TN structures, the so called tensor network structure search (TN-SS) problem, remains a challenge. Current state-of-the-art (SOTA) algorithms are computationally expensive as they require extensive function evaluations, which is prohibitive for real-world applications. In addition, existing methods ignore valuable domain information inherent in real-world tensor data and lack transparency in their identified TN structures. To this end, we propose a novel TN-SS framework, termed the tnLLM, which incorporates domain information about the data and harnesses the reasoning capabilities of large language models (LLMs) to directly predict suitable TN structures. The proposed framework involves a domain-aware prompting pipeline which instructs the LLM to infer suitable TN structures based on the real-world relationships between tensor modes. In this way, our approach is capable of not only iteratively optimizing the objective function, but also generating domain-aware explanations for the identified structures. Experimental results demonstrate that tnLLM achieves comparable TN-SS objective function values with much fewer function evaluations compared to SOTA algorithms. Furthermore, we demonstrate that the LLM-enabled domain information can be used to find good initializations in the search space for sampling-based SOTA methods to accelerate their convergence while preserving theoretical performance guarantees.

[293] Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models

Yiran Guo,Lijie Xu,Jie Liu,Dan Ye,Shuang Qiu

Main category: cs.LG

TL;DR: 论文提出了一种名为SPO的新RL框架，通过中粒度段级优势估计，平衡了细粒度和粗粒度方法的优缺点，显著提升了语言模型的推理能力。

Details

Motivation: 现有方法在优势估计粒度上存在两极分化，细粒度方法（如PPO）因难以训练准确的评判模型而估计不准确，粗粒度方法（如GRPO）仅依赖最终奖励导致信用分配不精确。SPO旨在解决这些问题。 Method: SPO采用段级优势估计，包含三个创新策略：灵活的段划分、准确的段优势估计、基于段优势的策略优化（包括概率掩码策略）。并针对短链和长链推理场景分别提出了SPO-chain和SPO-tree。 Result: SPO在GSM8K上比PPO和GRPO提升了6-12个百分点，在MATH500上比GRPO提升了7-11个百分点，且显著降低了MC估计成本。 Conclusion: SPO通过中粒度优势估计，有效提升了语言模型的推理能力，并在不同场景下表现出优越性能。 Abstract: Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: Token-level methods (e.g., PPO) aim to provide the fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose Segment Policy Optimization (SPO), a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than token-level methods, enabling accurate advantage estimation based on Monte Carlo (MC) without a critic model. SPO features three components with novel strategies: (1) flexible segment partition; (2) accurate segment advantage estimation; and (3) policy optimization using segment advantages, including a novel probability-mask strategy. We further instantiate SPO for two specific scenarios: (1) SPO-chain for short chain-of-thought (CoT), featuring novel cutpoint-based partition and chain-based advantage estimation, achieving $6$-$12$ percentage point improvements in accuracy over PPO and GRPO on GSM8K. (2) SPO-tree for long CoT, featuring novel tree-based advantage estimation, which significantly reduces the cost of MC estimation, achieving $7$-$11$ percentage point improvements over GRPO on MATH500 under 2K and 4K context evaluation. We make our code publicly available at https://github.com/AIFrameResearch/SPO.

[294] On-Policy RL with Optimal Reward Baseline

Yaru Hao,Li Dong,Xun Wu,Shaohan Huang,Zewen Chi,Furu Wei

Main category: cs.LG

TL;DR: 本文提出了一种名为OPO的新型强化学习算法，通过精确的on-policy训练和最优奖励基线，解决了现有算法训练不稳定和计算效率低的问题。

Details

Motivation: 当前强化学习算法在大型语言模型对齐和推理任务中存在训练不稳定和计算效率低的问题。 Method: 提出OPO算法，强调精确的on-policy训练和引入最优奖励基线以减少梯度方差。 Result: 在数学推理基准测试中，OPO表现出更高的性能和训练稳定性，同时实现了更低的策略偏移和更高的输出熵。 Conclusion: OPO是一种稳定且高效的强化学习算法，适用于大型语言模型的对齐和推理任务。 Abstract: Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration. Moreover, OPO introduces the optimal reward baseline that theoretically minimizes gradient variance. We evaluate OPO on mathematical reasoning benchmarks. The results demonstrate its superior performance and training stability without additional models or regularization terms. Furthermore, OPO achieves lower policy shifts and higher output entropy, encouraging more diverse and less repetitive responses. These results highlight OPO as a promising direction for stable and effective reinforcement learning in large language model alignment and reasoning tasks. The implementation is provided at https://github.com/microsoft/LMOps/tree/main/opo.

[295] Differential Information: An Information-Theoretic Perspective on Preference Optimization

Yunjae Won,Hyunji Lee,Hyeonbin Hwang,Minjoon Seo

Main category: cs.LG

TL;DR: 本文通过引入差分信息分布（DID）理论，填补了直接偏好优化（DPO）中log-ratio奖励参数化的理论空白，揭示了其唯一最优性，并分析了偏好数据与策略行为的关系。

Details

Motivation: 尽管DPO在经验上取得了成功，但其log-ratio奖励参数化的理论依据尚不完整。本文旨在填补这一理论空白，并通过DID理论提供更深入的理解。 Method: 利用差分信息分布（DID）分析偏好标签如何编码从参考策略到目标策略的差分信息，推导出log-ratio奖励的唯一最优性，并研究其对策略行为的影响。 Result: 发现偏好数据编码差分信息的条件与对数边际有序策略的隐含假设相关，并通过DID熵分析揭示了低熵与高熵差分信息对策略的不同影响。 Conclusion: 研究为DPO目标、偏好数据结构和策略行为提供了统一的理论视角，验证了高熵差分信息对通用指令跟随的重要性，而低熵差分信息对知识密集型问答的益处。 Abstract: Direct Preference Optimization (DPO) has become a standard technique for aligning language models with human preferences in a supervised manner. Despite its empirical success, the theoretical justification behind its log-ratio reward parameterization remains incomplete. In this work, we address this gap by utilizing the Differential Information Distribution (DID): a distribution over token sequences that captures the information gained during policy updates. First, we show that when preference labels encode the differential information required to transform a reference policy into a target policy, the log-ratio reward in DPO emerges as the uniquely optimal form for learning the target policy via preference optimization. This result naturally yields a closed-form expression for the optimal sampling distribution over rejected responses. Second, we find that the condition for preferences to encode differential information is fundamentally linked to an implicit assumption regarding log-margin ordered policies-an inductive bias widely used in preference optimization yet previously unrecognized. Finally, by analyzing the entropy of the DID, we characterize how learning low-entropy differential information reinforces the policy distribution, while high-entropy differential information induces a smoothing effect, which explains the log-likelihood displacement phenomenon. We validate our theoretical findings in synthetic experiments and extend them to real-world instruction-following datasets. Our results suggest that learning high-entropy differential information is crucial for general instruction-following, while learning low-entropy differential information benefits knowledge-intensive question answering. Overall, our work presents a unifying perspective on the DPO objective, the structure of preference data, and resulting policy behaviors through the lens of differential information.

[296] Test-time augmentation improves efficiency in conformal prediction

Divya Shanmugam,Helen Lu,Swami Sankaranarayanan,John Guttag

Main category: cs.LG

TL;DR: 本文提出了一种通过测试时增强（TTA）减少保形分类器预测集大小的方法，无需重新训练模型，平均减少10%-14%的预测集大小。

Details

Motivation: 保形分类器虽然能提供概率保证，但通常会产生过大的预测集，缺乏信息量。 Method: 采用测试时增强（TTA）技术，结合任意保形评分方法，无需模型重新训练。 Result: 在三个数据集、三种模型和多种分布偏移下，TTA平均减少预测集大小10%-14%。 Conclusion: 测试时增强是保形分类器流程中的有效补充，尤其在减少预测集大小方面表现显著。 Abstract: A conformal classifier produces a set of predicted classes and provides a probabilistic guarantee that the set includes the true class. Unfortunately, it is often the case that conformal classifiers produce uninformatively large sets. In this work, we show that test-time augmentation (TTA)--a technique that introduces inductive biases during inference--reduces the size of the sets produced by conformal classifiers. Our approach is flexible, computationally efficient, and effective. It can be combined with any conformal score, requires no model retraining, and reduces prediction set sizes by 10%-14% on average. We conduct an evaluation of the approach spanning three datasets, three models, two established conformal scoring methods, different guarantee strengths, and several distribution shifts to show when and why test-time augmentation is a useful addition to the conformal pipeline.

[297] Number of Clusters in a Dataset: A Regularized K-means Approach

Behzad Kamgar-Parsi,Behrooz Kamgar-Parsi

Main category: cs.LG

TL;DR: 本文研究了正则化k-means算法中关键超参数λ的设定问题，推导了理想聚类假设下的λ边界，并分析了加性和乘性正则化对解的影响。

Details

Motivation: 在无标签数据集中确定有意义的聚类数量是许多应用中的重要问题，但目前缺乏设定正则化超参数λ的原则性指导。 Method: 假设聚类为理想球形，推导λ的严格边界；分析加性和乘性正则化k-means算法的解。 Result: 实验表明加性正则化常产生多解，而乘性正则化在某些情况下能减少解的模糊性。 Conclusion: 本文为λ的设定提供了理论依据，并展示了正则化k-means算法在非理想聚类下的性能。 Abstract: Finding the number of meaningful clusters in an unlabeled dataset is important in many applications. Regularized k-means algorithm is a possible approach frequently used to find the correct number of distinct clusters in datasets. The most common formulation of the regularization function is the additive linear term $\lambda k$, where $k$ is the number of clusters and $\lambda$ a positive coefficient. Currently, there are no principled guidelines for setting a value for the critical hyperparameter $\lambda$. In this paper, we derive rigorous bounds for $\lambda$ assuming clusters are {\em ideal}. Ideal clusters (defined as $d$-dimensional spheres with identical radii) are close proxies for k-means clusters ($d$-dimensional spherically symmetric distributions with identical standard deviations). Experiments show that the k-means algorithm with additive regularizer often yields multiple solutions. Thus, we also analyze k-means algorithm with multiplicative regularizer. The consensus among k-means solutions with additive and multiplicative regularizations reduces the ambiguity of multiple solutions in certain cases. We also present selected experiments that demonstrate performance of the regularized k-means algorithms as clusters deviate from the ideal assumption.

[298] Diverse Prototypical Ensembles Improve Robustness to Subpopulation Shift

Minh Nguyen Nhat To,Paul F RWilson,Viet Nguyen,Mohamed Harmanani,Michael Cooper,Fahimeh Fooladgar,Purang Abolmaesumi,Parvin Mousavi,Rahul G. Krishnan

Main category: cs.LG

TL;DR: 论文提出了一种名为Diverse Prototypical Ensembles（DPEs）的方法，通过使用多样化的原型分类器集合来应对子群体分布偏移问题，显著提升了最差群体准确率。

Details

Motivation: 子群体分布偏移会显著降低机器学习模型的性能，而现有方法依赖于对子群体数量和性质的假设及标注，这在现实数据中往往不可用。 Method: 用多样化的原型分类器集合替代标准线性分类层，每个分类器专注于不同的特征和样本，从而自适应地捕捉子群体风险。 Result: 在九个真实数据集上的实验表明，DPE方法在最差群体准确率上优于现有方法。 Conclusion: DPE方法无需依赖子群体标注，能够有效应对子群体分布偏移问题。 Abstract: The subpopulationtion shift, characterized by a disparity in subpopulation distributibetween theween the training and target datasets, can significantly degrade the performance of machine learning models. Current solutions to subpopulation shift involve modifying empirical risk minimization with re-weighting strategies to improve generalization. This strategy relies on assumptions about the number and nature of subpopulations and annotations on group membership, which are unavailable for many real-world datasets. Instead, we propose using an ensemble of diverse classifiers to adaptively capture risk associated with subpopulations. Given a feature extractor network, we replace its standard linear classification layer with a mixture of prototypical classifiers, where each member is trained to classify the data while focusing on different features and samples from other members. In empirical evaluation on nine real-world datasets, covering diverse domains and kinds of subpopulation shift, our method of Diverse Prototypical Ensembles (DPEs) often outperforms the prior state-of-the-art in worst-group accuracy. The code is available at https://github.com/minhto2802/dpe4subpop

[299] Pseudo Multi-Source Domain Generalization: Bridging the Gap Between Single and Multi-Source Domain Generalization

Shohei Enomoto

Main category: cs.LG

TL;DR: 论文提出了一种名为PMDG的新框架，通过风格迁移和数据增强技术从单一源域生成多个伪域，解决了多源域泛化（MDG）在实际应用中数据集构建成本高的问题。实验表明，PMDG性能与MDG正相关，且伪域在数据充足时可媲美真实多域性能。

Details

Motivation: 解决深度学习模型在分布变化数据上性能下降的问题，同时克服多源域泛化（MDG）因数据集构建成本高而难以实际应用的局限性。 Method: 提出PMDG框架，通过风格迁移和数据增强从单一源域生成多个伪域，构建合成多域数据集，并利用现有MDG算法进行训练。 Result: 实验证明PMDG性能与MDG正相关，伪域在数据充足时可达到或超过真实多域性能。 Conclusion: PMDG为单源域泛化（SDG）提供了一种实用解决方案，为未来域泛化研究提供了有价值的见解。 Abstract: Deep learning models often struggle to maintain performance when deployed on data distributions different from their training data, particularly in real-world applications where environmental conditions frequently change. While Multi-source Domain Generalization (MDG) has shown promise in addressing this challenge by leveraging multiple source domains during training, its practical application is limited by the significant costs and difficulties associated with creating multi-domain datasets. To address this limitation, we propose Pseudo Multi-source Domain Generalization (PMDG), a novel framework that enables the application of sophisticated MDG algorithms in more practical Single-source Domain Generalization (SDG) settings. PMDG generates multiple pseudo-domains from a single source domain through style transfer and data augmentation techniques, creating a synthetic multi-domain dataset that can be used with existing MDG algorithms. Through extensive experiments with PseudoDomainBed, our modified version of the DomainBed benchmark, we analyze the effectiveness of PMDG across multiple datasets and architectures. Our analysis reveals several key findings, including a positive correlation between MDG and PMDG performance and the potential of pseudo-domains to match or exceed actual multi-domain performance with sufficient data. These comprehensive empirical results provide valuable insights for future research in domain generalization. Our code is available at https://github.com/s-enmt/PseudoDomainBed.

[300] Buffer-free Class-Incremental Learning with Out-of-Distribution Detection

Srishti Gupta,Daniele Angioni,Maura Pintor,Ambra Demontis,Lea Schönherr,Battista Biggio,Fabio Roli

Main category: cs.LG

TL;DR: 论文提出了一种无需内存缓冲区的后验OOD检测方法，用于开放世界中的类增量学习，性能与基于缓冲区的方法相当或更优。

Details

Motivation: 解决开放世界类增量学习中隐私、可扩展性和训练时间增加的问题，避免依赖历史数据缓冲区。 Method: 分析并应用后验OOD检测方法，替代基于缓冲区的OOD检测，在推理时动态处理未知类。 Result: 在CIFAR-10、CIFAR-100和Tiny ImageNet数据集上，性能与缓冲区方法相当或更优。 Conclusion: 后验OOD检测方法为开放世界类增量学习提供了高效且隐私保护的解决方案。 Abstract: Class-incremental learning (CIL) poses significant challenges in open-world scenarios, where models must not only learn new classes over time without forgetting previous ones but also handle inputs from unknown classes that a closed-set model would misclassify. Recent works address both issues by (i)~training multi-head models using the task-incremental learning framework, and (ii) predicting the task identity employing out-of-distribution (OOD) detectors. While effective, the latter mainly relies on joint training with a memory buffer of past data, raising concerns around privacy, scalability, and increased training time. In this paper, we present an in-depth analysis of post-hoc OOD detection methods and investigate their potential to eliminate the need for a memory buffer. We uncover that these methods, when applied appropriately at inference time, can serve as a strong substitute for buffer-based OOD detection. We show that this buffer-free approach achieves comparable or superior performance to buffer-based methods both in terms of class-incremental learning and the rejection of unknown samples. Experimental results on CIFAR-10, CIFAR-100 and Tiny ImageNet datasets support our findings, offering new insights into the design of efficient and privacy-preserving CIL systems for open-world settings.

[301] Network Inversion for Uncertainty-Aware Out-of-Distribution Detection

Pirzada Suhail,Rehna Afroz,Amit Sethi

Main category: cs.LG

TL;DR: 提出了一种结合网络反转和分类器训练的新框架，用于同时解决OOD检测和不确定性估计问题。

Details

Motivation: 构建安全的机器学习系统需要有效处理意外输入，OOD检测和不确定性估计是关键。 Method: 通过引入“垃圾”类并迭代训练、反转和排除，优化分类器决策边界。 Result: 模型能有效检测OOD样本并将其分类到垃圾类，同时提供不确定性估计。 Conclusion: 该方法无需外部OOD数据或后校准，为OOD检测和不确定性估计提供了统一解决方案。 Abstract: Out-of-distribution (OOD) detection and uncertainty estimation (UE) are critical components for building safe machine learning systems, especially in real-world scenarios where unexpected inputs are inevitable. In this work, we propose a novel framework that combines network inversion with classifier training to simultaneously address both OOD detection and uncertainty estimation. For a standard n-class classification task, we extend the classifier to an (n+1)-class model by introducing a "garbage" class, initially populated with random gaussian noise to represent outlier inputs. After each training epoch, we use network inversion to reconstruct input images corresponding to all output classes that initially appear as noisy and incoherent and are therefore excluded to the garbage class for retraining the classifier. This cycle of training, inversion, and exclusion continues iteratively till the inverted samples begin to resemble the in-distribution data more closely, suggesting that the classifier has learned to carve out meaningful decision boundaries while sanitising the class manifolds by pushing OOD content into the garbage class. During inference, this training scheme enables the model to effectively detect and reject OOD samples by classifying them into the garbage class. Furthermore, the confidence scores associated with each prediction can be used to estimate uncertainty for both in-distribution and OOD inputs. Our approach is scalable, interpretable, and does not require access to external OOD datasets or post-hoc calibration techniques while providing a unified solution to the dual challenges of OOD detection and uncertainty estimation.

[302] Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Qingyu Shi,Jinbin Bai,Zhuoran Zhao,Wenhao Chai,Kaidong Yu,Jianzong Wu,Shuangyong Song,Yunhai Tong,Xiangtai Li,Xuelong Li,Shuicheng Yan

Main category: cs.LG

TL;DR: Muddit是一种统一的离散扩散Transformer，通过结合预训练文本到图像骨干和轻量级文本解码器，实现快速并行生成文本和图像，性能优于传统自回归模型。

Details

Motivation: 解决自回归统一模型推理慢和非自回归统一模型泛化能力弱的问题，探索离散扩散作为统一生成任务的高效骨干。 Method: 提出Muddit，结合预训练文本到图像骨干和轻量级文本解码器，实现多模态并行生成。 Result: Muddit在质量和效率上优于更大的自回归模型，展示了离散扩散的潜力。 Conclusion: 离散扩散结合强视觉先验，可作为统一生成任务的可扩展高效骨干。 Abstract: Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.

[303] Merge-Friendly Post-Training Quantization for Multi-Target Domain Adaptation

Juncheol Shin,Minsang Seok,Seonggon Kim,Eunhyeok Park

Main category: cs.LG

TL;DR: 研究分析了量化对模型合并的影响，并提出了一种新的后训练量化方法HDRQ，以支持多目标域适应的模型合并。

Details

Motivation: 量化在目标特定数据上的应用限制了兴趣域并引入离散化效应，使模型合并变得复杂。 Method: 通过误差屏障分析量化影响，提出HDRQ方法，结合Hessian和远距离正则化量化，确保量化过程最小化偏离源预训练模型并平滑损失表面。 Result: 实验证实HDRQ在多目标域适应中的模型合并效果显著。 Conclusion: 这是首次针对量化模型合并的研究，HDRQ方法有效解决了相关挑战。 Abstract: Model merging has emerged as a powerful technique for combining task-specific weights, achieving superior performance in multi-target domain adaptation. However, when applied to practical scenarios, such as quantized models, new challenges arise. In practical scenarios, quantization is often applied to target-specific data, but this process restricts the domain of interest and introduces discretization effects, making model merging highly non-trivial. In this study, we analyze the impact of quantization on model merging through the lens of error barriers. Leveraging these insights, we propose a novel post-training quantization, HDRQ - Hessian and distant regularizing quantization - that is designed to consider model merging for multi-target domain adaptation. Our approach ensures that the quantization process incurs minimal deviation from the source pre-trained model while flattening the loss surface to facilitate smooth model merging. To our knowledge, this is the first study on this challenge, and extensive experiments confirm its effectiveness.

[304] REOrdering Patches Improves Vision Models

Declan Kutscher,David M. Chan,Yutong Bai,Trevor Darrell,Ritwik Gupta

Main category: cs.LG

TL;DR: 论文提出REOrder框架，通过优化图像块顺序提升Transformer模型性能，实验显示显著准确率提升。

Details

Motivation: 现有Transformer模型对图像块顺序敏感，固定顺序可能影响性能，需探索任务最优顺序。 Method: 提出两阶段框架：1）基于信息论评估块序列压缩性；2）用REINFORCE优化Plackett-Luce策略学习最优顺序。 Result: 在ImageNet-1K和Functional Map of the World上，REOrder分别提升准确率3.01%和13.35%。 Conclusion: REOrder通过动态优化图像块顺序，显著提升模型性能，为序列模型设计提供新思路。 Abstract: Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.

eess.IV [Back]

[305] IRS: Incremental Relationship-guided Segmentation for Digital Pathology

Ruining Deng,Junchao Zhu,Juming Xiong,Can Cui,Tianyuan Yao,Junlin Guo,Siqi Lu,Marilyn Lionts,Mengmeng Yin,Yu Wang,Shilin Zhao,Yucheng Tang,Yihe Yang,Paul Dennis Simonson,Mert R. Sabuncu,Haichun Yang,Yuankai Huo

Main category: eess.IV

TL;DR: 本文提出了一种新颖的增量关系引导分割（IRS）学习方案，用于处理数字病理学中时间获取、部分标注的数据，并保持分布外（OOD）持续学习能力。

Details

Motivation: 数字病理学中的全景分割面临标注不完整和持续学习新类别的挑战，IRS旨在解决这些问题。 Method: IRS通过数学建模解剖关系，利用增量通用命题矩阵实现空间-时间OOD持续学习。 Result: 实验表明IRS能有效处理多尺度病理分割，实现精确的肾脏分割和OOD病变识别。 Conclusion: IRS显著增强了领域泛化能力，适用于实际数字病理学应用。 Abstract: Continual learning is rapidly emerging as a key focus in computer vision, aiming to develop AI systems capable of continuous improvement, thereby enhancing their value and practicality in diverse real-world applications. In healthcare, continual learning holds great promise for continuously acquired digital pathology data, which is collected in hospitals on a daily basis. However, panoramic segmentation on digital whole slide images (WSIs) presents significant challenges, as it is often infeasible to obtain comprehensive annotations for all potential objects, spanning from coarse structures (e.g., regions and unit objects) to fine structures (e.g., cells). This results in temporally and partially annotated data, posing a major challenge in developing a holistic segmentation framework. Moreover, an ideal segmentation model should incorporate new phenotypes, unseen diseases, and diverse populations, making this task even more complex. In this paper, we introduce a novel and unified Incremental Relationship-guided Segmentation (IRS) learning scheme to address temporally acquired, partially annotated data while maintaining out-of-distribution (OOD) continual learning capacity in digital pathology. The key innovation of IRS lies in its ability to realize a new spatial-temporal OOD continual learning paradigm by mathematically modeling anatomical relationships between existing and newly introduced classes through a simple incremental universal proposition matrix. Experimental results demonstrate that the IRS method effectively handles the multi-scale nature of pathological segmentation, enabling precise kidney segmentation across various structures (regions, units, and cells) as well as OOD disease lesions at multiple magnifications. This capability significantly enhances domain generalization, making IRS a robust approach for real-world digital pathology applications.

[306] iHDR: Iterative HDR Imaging with Arbitrary Number of Exposures

Yu Yuan,Yiheng Chi,Xingguang Zhang,Stanley Chan

Main category: eess.IV

TL;DR: 提出了一种名为iHDR的新型框架，通过迭代融合多张低动态范围（LDR）图像生成高质量HDR图像，解决了现有方法输入数量固定的限制。

Details

Motivation: 现有HDR成像方法通常仅适用于固定数量的输入，无法灵活处理不同数量的输入帧。 Method: iHDR框架包含一个无重影的双输入HDR融合网络（DiHDR）和一个基于物理的域映射网络（ToneNet），通过迭代融合逐步生成HDR图像。 Result: 实验表明，iHDR在输入帧数量灵活的情况下，性能优于现有HDR去重影方法。 Conclusion: iHDR框架为动态场景下的HDR成像提供了一种灵活且高效的解决方案。 Abstract: High dynamic range (HDR) imaging aims to obtain a high-quality HDR image by fusing information from multiple low dynamic range (LDR) images. Numerous learning-based HDR imaging methods have been proposed to achieve this for static and dynamic scenes. However, their architectures are mostly tailored for a fixed number (e.g., three) of inputs and, therefore, cannot apply directly to situations beyond the pre-defined limited scope. To address this issue, we propose a novel framework, iHDR, for iterative fusion, which comprises a ghost-free Dual-input HDR fusion network (DiHDR) and a physics-based domain mapping network (ToneNet). DiHDR leverages a pair of inputs to estimate an intermediate HDR image, while ToneNet maps it back to the nonlinear domain and serves as the reference input for the next pairwise fusion. This process is iteratively executed until all input frames are utilized. Qualitative and quantitative experiments demonstrate the effectiveness of the proposed method as compared to existing state-of-the-art HDR deghosting approaches given flexible numbers of input frames.

[307] Proximal Algorithm Unrolling: Flexible and Efficient Reconstruction Networks for Single-Pixel Imaging

Ping Wang,Lishun Wang,Gang Qu,Xiaodong Wang,Yulun Zhang,Xin Yuan

Main category: eess.IV

TL;DR: 论文提出了一种结合深度展开（unrolling）和即插即用（PnP）方法的单像素成像（SPI）逆问题求解器，通过设计高效的深度图像恢复器（DIR）和提出通用的近端轨迹（PT）损失函数，实现了在可变压缩比（CR）下的高精度和快速重建。

Details

Motivation: PnP方法虽然灵活但精度和速度有限，而展开方法虽精度高但需针对不同CR调整。论文旨在结合两者的优势。 Method: 设计了高效的DIR用于展开HQS和ADMM，并提出PT损失函数训练网络，使DIR逼近理想显式恢复正则化的近端算子。 Result: 实验表明，该方法不仅能灵活处理不同CR，还在重建精度和速度上优于以往CR特定的展开网络。 Conclusion: 论文成功整合了PnP和展开方法的优势，为SPI逆问题提供了高效且灵活的解决方案。 Abstract: Deep-unrolling and plug-and-play (PnP) approaches have become the de-facto standard solvers for single-pixel imaging (SPI) inverse problem. PnP approaches, a class of iterative algorithms where regularization is implicitly performed by an off-the-shelf deep denoiser, are flexible for varying compression ratios (CRs) but are limited in reconstruction accuracy and speed. Conversely, unrolling approaches, a class of multi-stage neural networks where a truncated iterative optimization process is transformed into an end-to-end trainable network, typically achieve better accuracy with faster inference but require fine-tuning or even retraining when CR changes. In this paper, we address the challenge of integrating the strengths of both classes of solvers. To this end, we design an efficient deep image restorer (DIR) for the unrolling of HQS (half quadratic splitting) and ADMM (alternating direction method of multipliers). More importantly, a general proximal trajectory (PT) loss function is proposed to train HQS/ADMM-unrolling networks such that learned DIR approximates the proximal operator of an ideal explicit restoration regularizer. Extensive experiments demonstrate that, the resulting proximal unrolling networks can not only flexibly handle varying CRs with a single model like PnP algorithms, but also outperform previous CR-specific unrolling networks in both reconstruction accuracy and speed. Source codes and models are available at https://github.com/pwangcs/ProxUnroll.

[308] Advancing Image Super-resolution Techniques in Remote Sensing: A Comprehensive Survey

Yunliang Qi,Meng Lou,Yimin Liu,Lu Li,Zhen Yang,Wen Nie

Main category: eess.IV

TL;DR: 本文系统综述了遥感图像超分辨率（RSISR）的方法、数据集和评估指标，分析了现有技术的优缺点，并指出了未来研究方向。

Details

Motivation: 尽管近年来RSISR方法不断增加，但缺乏系统性的综述，本文旨在填补这一空白。 Method: 将RSISR方法分为监督、无监督和质量评估三类，并对其进行了详细分析。 Result: 发现现有方法在大尺度退化下保留细节纹理和几何结构方面存在显著不足。 Conclusion: 未来需开发领域专用架构和更鲁棒的评估协议，以缩小合成与真实场景的差距。 Abstract: Remote sensing image super-resolution (RSISR) is a crucial task in remote sensing image processing, aiming to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts. Despite the growing number of RSISR methods proposed in recent years, a systematic and comprehensive review of these methods is still lacking. This paper presents a thorough review of RSISR algorithms, covering methodologies, datasets, and evaluation metrics. We provide an in-depth analysis of RSISR methods, categorizing them into supervised, unsupervised, and quality evaluation approaches, to help researchers understand current trends and challenges. Our review also discusses the strengths, limitations, and inherent challenges of these techniques. Notably, our analysis reveals significant limitations in existing methods, particularly in preserving fine-grained textures and geometric structures under large-scale degradation. Based on these findings, we outline future research directions, highlighting the need for domain-specific architectures and robust evaluation protocols to bridge the gap between synthetic and real-world RSISR scenarios.

[309] Can Large Language Models Challenge CNNS in Medical Image Analysis?

Shibbir Ahmed,Shahnewaz Karim Sakib,Anindya Bijoy Das

Main category: eess.IV

TL;DR: 多模态AI框架用于医学影像分类，比较CNN和LLM的性能、效率和环境影响，发现结合LLM的过滤技术可显著提升性能。

Details

Motivation: 提升医学影像诊断的可靠性、效率和可扩展性。 Method: 使用公开数据集，比较CNN和LLM在准确性、F1分数、执行时间、能耗和CO2排放上的表现。 Result: CNN在某些方面优于多模态技术，但结合LLM的过滤技术可显著提升性能。 Conclusion: 多模态AI系统在医学诊断中具有变革潜力。 Abstract: This study presents a multimodal AI framework designed for precisely classifying medical diagnostic images. Utilizing publicly available datasets, the proposed system compares the strengths of convolutional neural networks (CNNs) and different large language models (LLMs). This in-depth comparative analysis highlights key differences in diagnostic performance, execution efficiency, and environmental impacts. Model evaluation was based on accuracy, F1-score, average execution time, average energy consumption, and estimated $CO_2$ emission. The findings indicate that although CNN-based models can outperform various multimodal techniques that incorporate both images and contextual information, applying additional filtering on top of LLMs can lead to substantial performance gains. These findings highlight the transformative potential of multimodal AI systems to enhance the reliability, efficiency, and scalability of medical diagnostics in clinical settings.

[310] PCA for Enhanced Cross-Dataset Generalizability in Breast Ultrasound Tumor Segmentation

Christian Schmidt,Heinrich Martin Overhoff

Main category: eess.IV

TL;DR: 论文提出了一种基于主成分分析（PCA）的新方法，用于提升医学超声图像分割模型在未见数据集上的外部有效性。实验表明，PCA预处理显著提高了召回率和Dice分数。

Details

Motivation: 医学图像分割模型在跨数据集部署时外部有效性不足，尤其是在超声图像领域。现有方法（如域适应和GAN）在小规模多样化数据集上效果有限。 Method: 通过PCA预处理降噪并保留90%的方差，生成PCA重建数据集。在六个乳腺肿瘤超声数据集上训练U-Net模型，并比较原始数据集和PCA数据集的表现。 Result: PCA方法显著提升了召回率（0.57→0.70）和Dice分数（0.50→0.58），并将外部验证导致的召回率下降减少了33%。 Conclusion: PCA重建是一种有效的方法，可提升医学图像分割模型的外部有效性，尤其在挑战性案例中表现突出。 Abstract: In medical image segmentation, limited external validity remains a critical obstacle when models are deployed across unseen datasets, an issue particularly pronounced in the ultrasound image domain. Existing solutions-such as domain adaptation and GAN-based style transfer-while promising, often fall short in the medical domain where datasets are typically small and diverse. This paper presents a novel application of principal component analysis (PCA) to address this limitation. PCA preprocessing reduces noise and emphasizes essential features by retaining approximately 90\% of the dataset variance. We evaluate our approach across six diverse breast tumor ultrasound datasets comprising 3,983 B-mode images and corresponding expert tumor segmentation masks. For each dataset, a corresponding dimensionality reduced PCA-dataset is created and U-Net-based segmentation models are trained on each of the twelve datasets. Each model trained on an original dataset was inferenced on the remaining five out-of-domain original datasets (baseline results), while each model trained on a PCA dataset was inferenced on five out-of-domain PCA datasets. Our experimental results indicate that using PCA reconstructed datasets, instead of original images, improves the model's recall and Dice scores, particularly for model-dataset pairs where baseline performance was lowest, achieving statistically significant gains in recall (0.57 $\pm$ 0.07 vs. 0.70 $\pm$ 0.05, $p = 0.0004$) and Dice scores (0.50 $\pm$ 0.06 vs. 0.58 $\pm$ 0.06, $p = 0.03$). Our method reduced the decline in recall values due to external validation by $33\%$. These findings underscore the potential of PCA reconstruction as a safeguard to mitigate declines in segmentation performance, especially in challenging cases, with implications for enhancing external validity in real-world medical applications.

[311] ImmunoDiff: A Diffusion Model for Immunotherapy Response Prediction in Lung Cancer

Moinak Bhattacharya,Judy Huang,Amna F. Sher,Gagandeep Singh,Chao Chen,Prateek Prasanna

Main category: eess.IV

TL;DR: ImmunoDiff是一种基于扩散模型的框架，通过合成治疗后CT扫描并结合临床数据，显著提高了NSCLC免疫治疗反应的预测准确性。

Details

Motivation: 准确预测NSCLC免疫治疗反应是未满足的临床需求，现有模型依赖治疗前影像且无法捕捉治疗后复杂变化。 Method: 提出ImmunoDiff，结合解剖学先验和临床数据嵌入，通过cbi-Adapter模块和多模态集成优化生成过程。 Result: 在NSCLC队列中，反应预测的平衡准确率提高21.24%，生存预测的c-index增加0.03。 Conclusion: ImmunoDiff通过多模态数据整合和生成模型，显著提升了免疫治疗预测性能。 Abstract: Accurately predicting immunotherapy response in Non-Small Cell Lung Cancer (NSCLC) remains a critical unmet need. Existing radiomics and deep learning-based predictive models rely primarily on pre-treatment imaging to predict categorical response outcomes, limiting their ability to capture the complex morphological and textural transformations induced by immunotherapy. This study introduces ImmunoDiff, an anatomy-aware diffusion model designed to synthesize post-treatment CT scans from baseline imaging while incorporating clinically relevant constraints. The proposed framework integrates anatomical priors, specifically lobar and vascular structures, to enhance fidelity in CT synthesis. Additionally, we introduce a novel cbi-Adapter, a conditioning module that ensures pairwise-consistent multimodal integration of imaging and clinical data embeddings, to refine the generative process. Additionally, a clinical variable conditioning mechanism is introduced, leveraging demographic data, blood-based biomarkers, and PD-L1 expression to refine the generative process. Evaluations on an in-house NSCLC cohort treated with immune checkpoint inhibitors demonstrate a 21.24% improvement in balanced accuracy for response prediction and a 0.03 increase in c-index for survival prediction. Code will be released soon.

[312] MRI Image Generation Based on Text Prompts

Xinxian Fan,Mengye Lyu

Main category: eess.IV

TL;DR: 研究利用Stable Diffusion模型生成文本提示的MRI图像，解决真实MRI数据集获取的高成本、稀有样本少和隐私问题，并通过分类任务验证其有效性。

Details

Motivation: 解决真实MRI数据集获取的高成本、稀有样本不足和隐私问题，探索生成模型在医学AI中的应用。 Method: 使用3T fastMRI和0.3T M4Raw数据集微调Stable Diffusion模型，生成不同磁场强度的脑部T1、T2和FLAIR图像，并通过FID和MS-SSIM评估性能。 Result: 微调后的模型在图像质量和语义一致性上表现更好，生成的合成图像能有效增强训练数据集并提升MRI对比分类任务性能。 Conclusion: 文本提示的MRI图像生成可行，可作为医学AI的有用工具。 Abstract: This study explores the use of text-prompted MRI image generation with the Stable Diffusion (SD) model to address challenges in acquiring real MRI datasets, such as high costs, limited rare case samples, and privacy concerns. The SD model, pre-trained on natural images, was fine-tuned using the 3T fastMRI dataset and the 0.3T M4Raw dataset, with the goal of generating brain T1, T2, and FLAIR images across different magnetic field strengths. The performance of the fine-tuned model was evaluated using quantitative metrics,including Fr\'echet Inception Distance (FID) and Multi-Scale Structural Similarity (MS-SSIM), showing improvements in image quality and semantic consistency with the text prompts. To further evaluate the model's potential, a simple classification task was carried out using a small 0.35T MRI dataset, demonstrating that the synthetic images generated by the fine-tuned SD model can effectively augment training datasets and improve the performance of MRI constrast classification tasks. Overall, our findings suggest that text-prompted MRI image generation is feasible and can serve as a useful tool for medical AI applications.

[313] DeepMultiConnectome: Deep Multi-Task Prediction of Structural Connectomes Directly from Diffusion MRI Tractography

Marcus J. Vroemen,Yuqian Chen,Yui Lo,Tengfei Xu,Weidong Cai,Fan Zhang,Josien P. W. Pluim,Lauren J. O'Donnell

Main category: eess.IV

TL;DR: DeepMultiConnectome是一种深度学习模型，直接从纤维追踪预测结构连接组，无需灰质分区，支持多种分区方案，速度快且结果与传统方法高度相关。

Details

Motivation: 传统连接组生成方法耗时且依赖灰质分区，难以适用于大规模研究。 Method: 使用基于点云的神经网络和多任务学习，模型根据两种分区方案对纤维束进行分类，共享学习表示。 Result: 预测的连接组与传统方法生成的连接组高度相关（r=0.992和r=0.986），且保留了网络特性。 Conclusion: DeepMultiConnectome提供了一种快速、可扩展的方法，支持多种分区方案，适用于大规模研究。 Abstract: Diffusion MRI (dMRI) tractography enables in vivo mapping of brain structural connections, but traditional connectome generation is time-consuming and requires gray matter parcellation, posing challenges for large-scale studies. We introduce DeepMultiConnectome, a deep-learning model that predicts structural connectomes directly from tractography, bypassing the need for gray matter parcellation while supporting multiple parcellation schemes. Using a point-cloud-based neural network with multi-task learning, the model classifies streamlines according to their connected regions across two parcellation schemes, sharing a learned representation. We train and validate DeepMultiConnectome on tractography from the Human Connectome Project Young Adult dataset ($n = 1000$), labeled with an 84 and 164 region gray matter parcellation scheme. DeepMultiConnectome predicts multiple structural connectomes from a whole-brain tractogram containing 3 million streamlines in approximately 40 seconds. DeepMultiConnectome is evaluated by comparing predicted connectomes with traditional connectomes generated using the conventional method of labeling streamlines using a gray matter parcellation. The predicted connectomes are highly correlated with traditionally generated connectomes ($r = 0.992$ for an 84-region scheme; $r = 0.986$ for a 164-region scheme) and largely preserve network properties. A test-retest analysis of DeepMultiConnectome demonstrates reproducibility comparable to traditionally generated connectomes. The predicted connectomes perform similarly to traditionally generated connectomes in predicting age and cognitive function. Overall, DeepMultiConnectome provides a scalable, fast model for generating subject-specific connectomes across multiple parcellation schemes.

[314] Plug-and-Play Posterior Sampling for Blind Inverse Problems

Anqi Li,Weijie Gan,Ulugbek S. Kamilov

Main category: eess.IV

TL;DR: Blind-PnPDM是一种解决盲逆问题的新框架，无需显式先验或单独参数估计，而是通过交替高斯去噪方案进行后验采样。

Details

Motivation: 传统方法依赖显式先验或单独参数估计，而Blind-PnPDM旨在通过扩散模型作为学习先验，灵活解决盲逆问题。 Method: 使用两个扩散模型分别捕捉目标图像分布和测量算子参数，通过交替高斯去噪方案实现后验采样。 Result: 在盲图像去模糊实验中，Blind-PnPDM在定量指标和视觉保真度上优于现有方法。 Conclusion: Blind-PnPDM通过将盲逆问题转化为去噪子问题序列，并利用扩散模型的表达能力，取得了显著效果。 Abstract: We introduce Blind Plug-and-Play Diffusion Models (Blind-PnPDM) as a novel framework for solving blind inverse problems where both the target image and the measurement operator are unknown. Unlike conventional methods that rely on explicit priors or separate parameter estimation, our approach performs posterior sampling by recasting the problem into an alternating Gaussian denoising scheme. We leverage two diffusion models as learned priors: one to capture the distribution of the target image and another to characterize the parameters of the measurement operator. This PnP integration of diffusion models ensures flexibility and ease of adaptation. Our experiments on blind image deblurring show that Blind-PnPDM outperforms state-of-the-art methods in terms of both quantitative metrics and visual fidelity. Our results highlight the effectiveness of treating blind inverse problems as a sequence of denoising subproblems while harnessing the expressive power of diffusion-based priors.

[315] Synthetic Generation and Latent Projection Denoising of Rim Lesions in Multiple Sclerosis

Alexandra G. Roberts,Ha M. Luu,Mert Şişman,Alexey V. Dimov,Ceren Tozlu,Ilhami Kovanlikaya,Susan A. Gauthier,Thanh D. Nguyen,Yi Wang

Main category: eess.IV

TL;DR: 该论文提出了一种合成定量磁化率图的方法，用于改善多发性硬化症中边缘病变的分类性能，并通过生成对抗网络（GAN）扩展了多通道对比和概率分割图。

Details

Motivation: 多发性硬化症中的边缘病变（PRLs）是一种新兴的生物标志物，但由于其罕见性，分类器面临类别不平衡问题。 Method: 通过生成合成定量磁化率图，并利用生成对抗网络（GAN）进行多通道扩展和概率分割图生成，同时提出了一种新的去噪方法以处理模糊的边缘病变。 Result: 合成数据和去噪方法显著改善了边缘病变的检测性能，并更好地逼近了真实分布。 Conclusion: 该方法在多发性硬化症边缘病变的检测中具有临床可解释性，并公开了代码和生成数据。 Abstract: Quantitative susceptibility maps from magnetic resonance images can provide both prognostic and diagnostic information in multiple sclerosis, a neurodegenerative disease characterized by the formation of lesions in white matter brain tissue. In particular, susceptibility maps provide adequate contrast to distinguish between "rim" lesions, surrounded by deposited paramagnetic iron, and "non-rim" lesion types. These paramagnetic rim lesions (PRLs) are an emerging biomarker in multiple sclerosis. Much effort has been devoted to both detection and segmentation of such lesions to monitor longitudinal change. As paramagnetic rim lesions are rare, addressing this problem requires confronting the class imbalance between rim and non-rim lesions. We produce synthetic quantitative susceptibility maps of paramagnetic rim lesions and show that inclusion of such synthetic data improves classifier performance and provide a multi-channel extension to generate accompanying contrasts and probabilistic segmentation maps. We exploit the projection capability of our trained generative network to demonstrate a novel denoising approach that allows us to train on ambiguous rim cases and substantially increase the minority class. We show that both synthetic lesion synthesis and our proposed rim lesion label denoising method best approximate the unseen rim lesion distribution and improve detection in a clinically interpretable manner. We release our code and generated data at https://github.com/agr78/PRLx-GAN upon publication.

Table of Contents

cs.CV [Back]

[1] Using Cross-Domain Detection Loss to Infer Multi-Scale Information for Improved Tiny Head Tracking

[2] Frequency-Adaptive Discrete Cosine-ViT-ResNet Architecture for Sparse-Data Vision

[3] HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer

[4] MIAS-SAM: Medical Image Anomaly Segmentation without thresholding

[5] Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization

[6] Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory

[7] One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

[8] Fast Trajectory-Independent Model-Based Reconstruction Algorithm for Multi-Dimensional Magnetic Particle Imaging

[9] VidText: Towards Comprehensive Evaluation for Video Text Understanding

[10] IMTS is Worth Time $\times$ Channel Patches: Visual Masked Autoencoders for Irregular Multivariate Time Series Prediction

[11] How Animals Dance (When You're Not Looking)

[12] Improving Contrastive Learning for Referring Expression Counting

[13] LayerPeeler: Autoregressive Peeling for Layer-wise Image Vectorization

[14] CLIPGaussian: Universal and Multimodal Style Transfer Based on Gaussian Splatting

[15] A Probabilistic Jump-Diffusion Framework for Open-World Egocentric Activity Recognition

[16] 4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians

[17] CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models

[18] 3DGS Compression with Sparsity-guided Hierarchical Transform Coding

[19] Hierarchical Material Recognition from Local Appearance

[20] cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning

[21] Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

[22] Leveraging Diffusion Models for Synthetic Data Augmentation in Protein Subcellular Localization Classification

[23] Fast Isotropic Median Filtering

[24] ATI: Any Trajectory Instruction for Controllable Video Generation

[25] Toward Memory-Aided World Models: Benchmarking via Spatial Consistency

[26] HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions

[27] Pose-free 3D Gaussian splatting via shape-ray estimation

[28] MOVi: Training-free Text-conditioned Multi-Object Video Generation

[29] Synthetic Document Question Answering in Hungarian

[30] SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model

[31] Spatio-Temporal Joint Density Driven Learning for Skeleton-Based Action Recognition

[32] Towards Privacy-Preserving Fine-Grained Visual Classification via Hierarchical Learning from Label Proportions

[33] Deep Modeling and Optimization of Medical Image Classification

[34] Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation

[35] SpatialSplat: Efficient Semantic 3D from Sparse Unposed Images

[36] Multi-Sourced Compositional Generalization in Visual Question Answering

[37] Zero-P-to-3: Zero-Shot Partial-View Images to 3D Object

[38] URWKV: Unified RWKV Model with Multi-state Perspective for Low-light Image Restoration

[39] GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion

[40] LeMoRe: Learn More Details for Lightweight Semantic Segmentation

[41] CURVE: CLIP-Utilized Reinforcement Learning for Visual Image Enhancement via Simple Image Processing

[42] EAD: An EEG Adapter for Automated Classification

[43] Identification of Patterns of Cognitive Impairment for Early Detection of Dementia

[44] Diffusion-Based Generative Models for 3D Occupancy Prediction in Autonomous Driving

[45] TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance

[46] MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation

[47] HMAD: Advancing E2E Driving with Anchored Offset Proposals and Simulation-Supervised Multi-target Scoring

[48] PhotoArtAgent: Intelligent Photo Retouching with Language Model-Based Artist Agents

[49] Zero-to-Hero: Zero-Shot Initialization Empowering Reference-Based Video Appearance Editing

[50] Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning

[51] FlowAlign: Trajectory-Regularized, Inversion-Free Flow-based Image Editing

[52] PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling

[53] LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering

[54] Implicit Inversion turns CLIP into a Decoder

[55] RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer

[56] DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes

[57] HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image

[58] Fooling the Watchers: Breaking AIGC Detectors via Semantic Prompt Attacks

[59] Language-guided Learning for Object Detection Tackling Multiple Variations in Aerial Images

[60] WTEFNet: Real-Time Low-Light Object Detection for Advanced Driver-Assistance Systems

[61] HyperPointFormer: Multimodal Fusion in 3D Space with Dual-Branch Cross-Attention Transformers

[62] Navigating the Accuracy-Size Trade-Off with Flexible Model Merging

[63] SAMamba: Adaptive State Space Modeling with Hierarchical Vision for Infrared Small Target Detection

[64] UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes

[65] Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs

[66] Unsupervised Transcript-assisted Video Summarization and Highlight Detection

[67] LADA: Scalable Label-Specific CLIP Adapter for Continual Learning

[68] Are MLMs Trapped in the Visual Room?

[69] Holistic Large-Scale Scene Reconstruction via Mixed Gaussian Splatting

[70] RSFAKE-1M: A Large-Scale Dataset for Detecting Diffusion-Generated Remote Sensing Forgeries

[71] GenCAD-Self-Repairing: Feasibility Enhancement for 3D CAD Generation

[72] Federated Unsupervised Semantic Segmentation

[73] TRACE: Trajectory-Constrained Concept Erasure in Diffusion Models

[74] Adversarial Semantic and Label Perturbation Attack for Pedestrian Attribute Recognition

[75] Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis

[76] Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization

[77] DSAGL: Dual-Stream Attention-Guided Learning for Weakly Supervised Whole Slide Image Classification

[78] Diffusion Sampling Path Tells More: An Efficient Plug-and-Play Strategy for Sample Filtering