Skip to content

Table of Contents

cs.CV [Back]

[1] One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

Chenhao Zheng,Jieyu Zhang,Mohammadreza Salehi,Ziqi Gao,Vishnu Iyengar,Norimasa Kobori,Quan Kong,Ranjay Krishna

Main category: cs.CV

TL;DR: 论文提出了一种基于对象轨迹的视频标记化方法(TrajViT),显著减少了冗余标记并提升了性能,优于传统的时空ViT(ViT3D)。

Details Motivation: 当前视频标记化方法使用固定时空块,导致标记过多且计算效率低下,而现有标记减少策略会降低性能或在相机移动时效果不佳。 Method: 提出基于全景子对象轨迹的标记化方法(TrajViT),通过对比学习训练,提取对象轨迹并转换为语义标记。 Result: TrajViT在视频文本检索任务中比ViT3D高出6%的top-5召回率,标记减少10倍;在VideoQA任务中平均提升5.2%,训练时间快4倍,推理FLOPs减少18倍。 Conclusion: TrajViT是首个在多样化视频分析任务中一致优于ViT3D的高效编码器,具有鲁棒性和可扩展性。 Abstract: Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT (ViT3D) across multiple video understanding benchmarks, e.g., TrajViT outperforms ViT3D by a large margin of 6% top-5 recall in average at video-text retrieval task with 10x token deduction. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 18x less inference FLOPs. TrajViT is the first efficient encoder to consistently outperform ViT3D across diverse video analysis tasks, making it a robust and scalable solution.

[2] Using Cross-Domain Detection Loss to Infer Multi-Scale Information for Improved Tiny Head Tracking

Jisu Kim,Alex Mattingly,Eung-Joo Lee,Benjamin S. Riggan

Main category: cs.CV

TL;DR: 提出了一种优化性能和效率平衡的框架,用于增强微小头部检测与跟踪,通过跨域检测损失、多尺度模块和小感受野检测机制实现。

Details Motivation: 当前方法计算成本高,导致延迟和资源占用问题,需优化性能与效率的平衡。 Method: 整合跨域检测损失、多尺度模块和小感受野检测机制,提升检测效果。 Result: 在CroHD和CrowdHuman数据集上,MOTA和mAP指标均有提升。 Conclusion: 该框架在拥挤场景中有效提升了微小头部检测与跟踪的性能。 Abstract: Head detection and tracking are essential for downstream tasks, but current methods often require large computational budgets, which increase latencies and ties up resources (e.g., processors, memory, and bandwidth). To address this, we propose a framework to enhance tiny head detection and tracking by optimizing the balance between performance and efficiency. Our framework integrates (1) a cross-domain detection loss, (2) a multi-scale module, and (3) a small receptive field detection mechanism. These innovations enhance detection by bridging the gap between large and small detectors, capturing high-frequency details at multiple scales during training, and using filters with small receptive fields to detect tiny heads. Evaluations on the CroHD and CrowdHuman datasets show improved Multiple Object Tracking Accuracy (MOTA) and mean Average Precision (mAP), demonstrating the effectiveness of our approach in crowded scenes.

[3] Frequency-Adaptive Discrete Cosine-ViT-ResNet Architecture for Sparse-Data Vision

Ziyue Kang,Weichuan Zhang

Main category: cs.CV

TL;DR: 提出了一种混合深度学习框架,结合自适应DCT预处理模块、ViT-B16和ResNet50主干网络,以及贝叶斯线性分类头,用于解决稀有动物图像分类中的数据稀缺问题。

Details Motivation: 稀有动物图像分类面临数据稀缺的挑战,许多物种仅有少量标记样本。 Method: 设计了一种自适应频率选择机制,结合ViT-B16和ResNet50提取全局和局部特征,并通过交叉融合策略整合特征,最后使用贝叶斯线性分类器进行分类。 Result: 在自建的50类野生动物数据集上,该方法优于传统CNN和固定频带DCT方法,在样本稀缺情况下达到最优准确率。 Conclusion: 提出的自适应频率选择机制和混合框架有效提升了稀有动物图像分类的性能。 Abstract: A major challenge in rare animal image classification is the scarcity of data, as many species usually have only a small number of labeled samples. To address this challenge, we designed a hybrid deep-learning framework comprising a novel adaptive DCT preprocessing module, ViT-B16 and ResNet50 backbones, and a Bayesian linear classification head. To our knowledge, we are the first to introduce an adaptive frequency-domain selection mechanism that learns optimal low-, mid-, and high-frequency boundaries suited to the subsequent backbones. Our network first captures image frequency-domain cues via this adaptive DCT partitioning. The adaptively filtered frequency features are then fed into ViT-B16 to model global contextual relationships, while ResNet50 concurrently extracts local, multi-scale spatial representations from the original image. A cross-level fusion strategy seamlessly integrates these frequency- and spatial-domain embeddings, and the fused features are passed through a Bayesian linear classifier to output the final category predictions. On our self-built 50-class wildlife dataset, this approach outperforms conventional CNN and fixed-band DCT pipelines, achieving state-of-the-art accuracy under extreme sample scarcity.

[4] How Animals Dance (When You're Not Looking)

Xiaojuan Wang,Aleksander Holynski,Brian Curless,Ira Kemelmacher,Steve Seitz

Main category: cs.CV

TL;DR: 提出了一种基于关键帧的框架,用于生成音乐同步、舞蹈感知的动物舞蹈视频。通过文本到图像提示或GPT-4o生成关键帧,将舞蹈合成建模为图优化问题,并通过视频扩散模型生成中间帧。

Details Motivation: 解决从少量关键帧生成高质量、音乐同步的动物舞蹈视频的挑战,同时捕捉舞蹈中的对称性。 Method: 使用文本到图像提示或GPT-4o生成关键帧,将舞蹈合成建模为图优化问题,自动估计舞蹈节拍模式,并通过视频扩散模型生成中间帧。 Result: 仅需六个输入关键帧,即可生成长达30秒的动物舞蹈视频,适用于多种动物和音乐。 Conclusion: 该方法能够高效生成音乐同步的动物舞蹈视频,展示了关键帧和优化技术在舞蹈合成中的潜力。 Abstract: We present a keyframe-based framework for generating music-synchronized, choreography aware animal dance videos. Starting from a few keyframes representing distinct animal poses -- generated via text-to-image prompting or GPT-4o -- we formulate dance synthesis as a graph optimization problem: find the optimal keyframe structure that satisfies a specified choreography pattern of beats, which can be automatically estimated from a reference dance video. We also introduce an approach for mirrored pose image generation, essential for capturing symmetry in dance. In-between frames are synthesized using an video diffusion model. With as few as six input keyframes, our method can produce up to 30 second dance videos across a wide range of animals and music tracks.

[5] HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer

Qi Cai,Jingwen Chen,Yang Chen,Yehao Li,Fuchen Long,Yingwei Pan,Zhaofan Qiu,Yiheng Zhang,Fengbin Gao,Peihan Xu,Yimeng Wang,Kai Yu,Wenxuan Chen,Ziwei Feng,Zijian Gong,Jianzhuang Pan,Yi Peng,Rui Tian,Siyu Wang,Bo Zhao,Ting Yao,Tao Mei

Main category: cs.CV

TL;DR: HiDream-I1是一个17B参数的开源图像生成基础模型,通过稀疏扩散变换器(DiT)和动态MoE架构,在几秒内实现高质量图像生成,并提供三种变体。此外,HiDream-E1支持基于指令的图像编辑,最终形成交互式图像代理HiDream-A1。

Details Motivation: 解决图像生成基础模型在提升质量时伴随的计算复杂性和延迟问题。 Method: 采用双流解耦设计和动态MoE架构的稀疏DiT结构,支持多模态交互和高效图像生成。 Result: 实现了高质量的图像生成和基于指令的图像编辑,形成全面的图像代理。 Conclusion: HiDream系列模型为多模态AIGC研究提供了高效、灵活的工具,并开源了代码和模型权重。 Abstract: Recent advancements in image generative foundation models have prioritized quality improvements but often at the cost of increased computational complexity and inference latency. To address this critical trade-off, we introduce HiDream-I1, a new open-source image generative foundation model with 17B parameters that achieves state-of-the-art image generation quality within seconds. HiDream-I1 is constructed with a new sparse Diffusion Transformer (DiT) structure. Specifically, it starts with a dual-stream decoupled design of sparse DiT with dynamic Mixture-of-Experts (MoE) architecture, in which two separate encoders are first involved to independently process image and text tokens. Then, a single-stream sparse DiT structure with dynamic MoE architecture is adopted to trigger multi-model interaction for image generation in a cost-efficient manner. To support flexiable accessibility with varied model capabilities, we provide HiDream-I1 in three variants: HiDream-I1-Full, HiDream-I1-Dev, and HiDream-I1-Fast. Furthermore, we go beyond the typical text-to-image generation and remould HiDream-I1 with additional image conditions to perform precise, instruction-based editing on given images, yielding a new instruction-based image editing model namely HiDream-E1. Ultimately, by integrating text-to-image generation and instruction-based image editing, HiDream-I1 evolves to form a comprehensive image agent (HiDream-A1) capable of fully interactive image creation and refinement. To accelerate multi-modal AIGC research, we have open-sourced all the codes and model weights of HiDream-I1-Full, HiDream-I1-Dev, HiDream-I1-Fast, HiDream-E1 through our project websites: https://github.com/HiDream-ai/HiDream-I1 and https://github.com/HiDream-ai/HiDream-E1. All features can be directly experienced via https://vivago.ai/studio.

[6] LayerPeeler: Autoregressive Peeling for Layer-wise Image Vectorization

Ronghuan Wu,Wanchao Su,Jing Liao

Main category: cs.CV

TL;DR: LayerPeeler是一种新颖的逐层图像矢量化方法,通过渐进式简化策略解决现有工具在遮挡区域处理上的不足,生成完整路径和连贯层结构的矢量图形。

Details Motivation: 现有图像矢量化工具在处理遮挡区域时效果不佳,导致不完整或碎片化的形状,限制了编辑灵活性。 Method: 采用自回归剥离策略,结合视觉语言模型构建层图捕捉遮挡关系,利用微调图像扩散模型移除识别层,并通过局部注意力控制确保精确移除。 Result: LayerPeeler在路径语义、几何规则性和视觉保真度方面显著优于现有技术。 Conclusion: LayerPeeler通过创新方法解决了遮挡区域矢量化问题,为高质量矢量图形生成提供了有效解决方案。 Abstract: Image vectorization is a powerful technique that converts raster images into vector graphics, enabling enhanced flexibility and interactivity. However, popular image vectorization tools struggle with occluded regions, producing incomplete or fragmented shapes that hinder editability. While recent advancements have explored rule-based and data-driven layer-wise image vectorization, these methods face limitations in vectorization quality and flexibility. In this paper, we introduce LayerPeeler, a novel layer-wise image vectorization approach that addresses these challenges through a progressive simplification paradigm. The key to LayerPeeler's success lies in its autoregressive peeling strategy: by identifying and removing the topmost non-occluded layers while recovering underlying content, we generate vector graphics with complete paths and coherent layer structures. Our method leverages vision-language models to construct a layer graph that captures occlusion relationships among elements, enabling precise detection and description for non-occluded layers. These descriptive captions are used as editing instructions for a finetuned image diffusion model to remove the identified layers. To ensure accurate removal, we employ localized attention control that precisely guides the model to target regions while faithfully preserving the surrounding content. To support this, we contribute a large-scale dataset specifically designed for layer peeling tasks. Extensive quantitative and qualitative experiments demonstrate that LayerPeeler significantly outperforms existing techniques, producing vectorization results with superior path semantics, geometric regularity, and visual fidelity.

[7] MIAS-SAM: Medical Image Anomaly Segmentation without thresholding

Marco Colussi,Dragan Ahmetovic,Sergio Mascetti

Main category: cs.CV

TL;DR: MIAS-SAM是一种用于医学图像异常区域分割的新方法,利用基于块的记忆库和SAM编码器提取特征,无需手动设定阈值即可实现精确分割。

Details Motivation: 医学图像中的异常区域分割通常需要手动设定阈值,MIAS-SAM旨在消除这一需求,提高分割的准确性和自动化程度。 Method: 使用SAM编码器从正常数据中提取特征并存储于记忆库,推理时通过比较特征生成异常图,最终通过计算异常图的重心提示SAM解码器完成分割。 Result: 在三种不同模态的公开数据集(脑MRI、肝脏CT和视网膜OCT)上实验,DICE评分显示其具有高精度的异常分割能力。 Conclusion: MIAS-SAM无需手动设定阈值即可实现精确的异常分割,为医学图像分析提供了高效且自动化的解决方案。 Abstract: This paper presents MIAS-SAM, a novel approach for the segmentation of anomalous regions in medical images. MIAS-SAM uses a patch-based memory bank to store relevant image features, which are extracted from normal data using the SAM encoder. At inference time, the embedding patches extracted from the SAM encoder are compared with those in the memory bank to obtain the anomaly map. Finally, MIAS-SAM computes the center of gravity of the anomaly map to prompt the SAM decoder, obtaining an accurate segmentation from the previously extracted features. Differently from prior works, MIAS-SAM does not require to define a threshold value to obtain the segmentation from the anomaly map. Experimental results conducted on three publicly available datasets, each with a different imaging modality (Brain MRI, Liver CT, and Retina OCT) show accurate anomaly segmentation capabilities measured using DICE score. The code is available at: https://github.com/warpcut/MIAS-SAM

[8] Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization

Yuxi Zhang,Yueting Li,Xinyu Du,Sibo Wang

Main category: cs.CV

TL;DR: Rhet2Pix是一个解决修辞语言生成图像问题的框架,通过多步策略优化和双层MDP扩散模块,显著优于现有模型。

Details Motivation: 当前文本到图像模型难以处理修辞语言的隐含意义,导致生成的图像偏向字面表达而非语义意图。 Method: 提出Rhet2Pix框架,采用多步策略优化和双层MDP扩散模块,逐步细化子句并优化图像生成动作。 Result: 在定性和定量评估中,Rhet2Pix优于GPT-4o、Grok-3等SOTA模型。 Conclusion: Rhet2Pix有效解决了修辞语言生成图像的挑战,为相关领域提供了新思路。 Abstract: Generating images from rhetorical languages remains a critical challenge for text-to-image models. Even state-of-the-art (SOTA) multimodal large language models (MLLM) fail to generate images based on the hidden meaning inherent in rhetorical language--despite such content being readily mappable to visual representations by humans. A key limitation is that current models emphasize object-level word embedding alignment, causing metaphorical expressions to steer image generation towards their literal visuals and overlook the intended semantic meaning. To address this, we propose Rhet2Pix, a framework that formulates rhetorical text-to-image generation as a multi-step policy optimization problem, incorporating a two-layer MDP diffusion module. In the outer layer, Rhet2Pix converts the input prompt into incrementally elaborated sub-sentences and executes corresponding image-generation actions, constructing semantically richer visuals. In the inner layer, Rhet2Pix mitigates reward sparsity during image generation by discounting the final reward and optimizing every adjacent action pair along the diffusion denoising trajectory. Extensive experiments demonstrate the effectiveness of Rhet2Pix in rhetorical text-to-image generation. Our model outperforms SOTA MLLMs such as GPT-4o, Grok-3 and leading academic baselines across both qualitative and quantitative evaluations. The code and dataset used in this work are publicly available.

[9] Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory

Srishti Yadav,Lauren Tilton,Maria Antoniak,Taylor Arnold,Jiaang Li,Siddhesh Milind Pawar,Antonia Karamolegkou,Stella Frank,Zhaochong An,Negar Rostamzadeh,Daniel Hershcovich,Serge Belongie,Ekaterina Shutova

Main category: cs.CV

TL;DR: 现代视觉语言模型(VLM)在文化能力评估和基准测试中表现不佳,需要系统性框架分析图像中的文化维度。

Details Motivation: 由于VLM应用的多样性,理解其如何编码文化细微差别变得重要,但目前缺乏系统性分析框架。 Method: 借鉴视觉文化研究(文化研究、符号学、视觉研究)的基础方法,提出五个文化维度框架。 Result: 提出了一套系统性框架,用于更全面地分析VLM的文化能力。 Conclusion: 视觉文化研究的方法论对VLM文化分析至关重要,提出的框架为未来研究提供了方向。 Abstract: Modern vision-language models (VLMs) often fail at cultural competency evaluations and benchmarks. Given the diversity of applications built upon VLMs, there is renewed interest in understanding how they encode cultural nuances. While individual aspects of this problem have been studied, we still lack a comprehensive framework for systematically identifying and annotating the nuanced cultural dimensions present in images for VLMs. This position paper argues that foundational methodologies from visual culture studies (cultural studies, semiotics, and visual studies) are necessary for cultural analysis of images. Building upon this review, we propose a set of five frameworks, corresponding to cultural dimensions, that must be considered for a more complete analysis of the cultural competencies of VLMs.

[10] Fast Trajectory-Independent Model-Based Reconstruction Algorithm for Multi-Dimensional Magnetic Particle Imaging

Vladyslav Gapyak,Thomas März,Andreas Weinmann

Main category: cs.CV

TL;DR: 提出了一种独立于轨迹的模型重建算法,结合零样本PnP方法,实现了2D MPI数据的灵活重建。

Details Motivation: 传统MPI重建依赖耗时校准或特定轨迹模拟,限制了灵活性。 Method: 使用轨迹无关的模型重建算法,结合零样本PnP方法,利用自然图像训练的降噪器。 Result: 在公开数据集和自定义数据上均表现出强大的重建能力。 Conclusion: 为通用、灵活的模型MPI重建奠定了基础。 Abstract: Magnetic Particle Imaging (MPI) is a promising tomographic technique for visualizing the spatio-temporal distribution of superparamagnetic nanoparticles, with applications ranging from cancer detection to real-time cardiovascular monitoring. Traditional MPI reconstruction relies on either time-consuming calibration (measured system matrix) or model-based simulation of the forward operator. Recent developments have shown the applicability of Chebyshev polynomials to multi-dimensional Lissajous Field-Free Point (FFP) scans. This method is bound to the particular choice of sinusoidal scanning trajectories. In this paper, we present the first reconstruction on real 2D MPI data with a trajectory-independent model-based MPI reconstruction algorithm. We further develop the zero-shot Plug-and-Play (PnP) algorithm of the authors -- with automatic noise level estimation -- to address the present deconvolution problem, leveraging a state-of-the-art denoiser trained on natural images without retraining on MPI-specific data. We evaluate our method on the publicly available 2D FFP MPI dataset ``MPIdata: Equilibrium Model with Anisotropy", featuring scans of six phantoms acquired using a Bruker preclinical scanner. Moreover, we show reconstruction performed on custom data on a 2D scanner with additional high-frequency excitation field and partial data. Our results demonstrate strong reconstruction capabilities across different scanning scenarios -- setting a precedent for general-purpose, flexible model-based MPI reconstruction.

[11] VidText: Towards Comprehensive Evaluation for Video Text Understanding

Zhoufaran Yang,Yan Shu,Zhifei Yang,Yan Zhang,Yu Li,Keyang Lu,Gangyan Zeng,Shaohui Liu,Yu Zhou,Nicu Sebe

Main category: cs.CV

TL;DR: VidText是一个新的视频文本理解基准,填补了现有视频理解和OCR基准的不足,支持多语言和多样化场景,并提出了分层评估框架和感知推理任务。实验表明当前模型仍有较大改进空间。

Details Motivation: 现有视频理解基准忽视文本信息,OCR基准局限于静态图像,无法捕捉文本与动态视觉的交互。 Method: 提出VidText基准,涵盖多语言和多样化场景,引入分层评估框架(视频级、片段级、实例级任务)和感知推理任务。 Result: 实验显示当前模型在多数任务上表现不佳,改进空间大。模型内在因素(如输入分辨率、OCR能力)和外部因素(辅助信息、推理策略)影响显著。 Conclusion: VidText填补了视频文本理解基准的空白,为未来动态环境中的多模态推理研究奠定基础。 Abstract: Visual texts embedded in videos carry rich semantic information, which is crucial for both holistic video understanding and fine-grained reasoning about local human actions. However, existing video understanding benchmarks largely overlook textual information, while OCR-specific benchmarks are constrained to static images, limiting their ability to capture the interaction between text and dynamic visual contexts. To address this gap, we propose VidText, a new benchmark designed for comprehensive and in-depth evaluation of video text understanding. VidText offers the following key features: 1) It covers a wide range of real-world scenarios and supports multilingual content, encompassing diverse settings where video text naturally appears. 2) It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks, enabling assessment of both global summarization and local retrieval capabilities. 3) The benchmark also introduces a set of paired perception reasoning tasks, ranging from visual text perception to cross-modal reasoning between textual and visual information. Extensive experiments on 18 state-of-the-art Large Multimodal Models (LMMs) reveal that current models struggle across most tasks, with significant room for improvement. Further analysis highlights the impact of both model-intrinsic factors, such as input resolution and OCR capability, and external factors, including the use of auxiliary information and Chain-of-Thought reasoning strategies. We hope VidText will fill the current gap in video understanding benchmarks and serve as a foundation for future research on multimodal reasoning with video text in dynamic environments.

[12] IMTS is Worth Time $\times$ Channel Patches: Visual Masked Autoencoders for Irregular Multivariate Time Series Prediction

Zhangyi Hu,Jiemin Wu,Hua Xu,Mingqian Liao,Ninghui Feng,Bo Gao,Songning Lai,Yutao Yue

Main category: cs.CV

TL;DR: VIMTS框架通过视觉MAE处理不规则多变量时间序列(IMTS),利用跨通道依赖性和自监督学习提升预测性能。

Details Motivation: IMTS预测因多通道信号未对齐和大量缺失数据而困难,现有方法难以捕捉可靠时间模式。视觉MAE在稀疏多通道数据处理中表现优异,但需适配IMTS。 Method: VIMTS将IMTS分段为等间隔特征块,利用跨通道依赖补全缺失值,结合视觉MAE进行块重建,并通过粗到细技术生成预测。 Result: 实验表明VIMTS性能优越且具备少样本能力,推动了视觉基础模型在更广泛时间序列任务中的应用。 Conclusion: VIMTS成功将视觉MAE适配于IMTS预测,为处理不规则时间序列提供了新思路。 Abstract: Irregular Multivariate Time Series (IMTS) forecasting is challenging due to the unaligned nature of multi-channel signals and the prevalence of extensive missing data. Existing methods struggle to capture reliable temporal patterns from such data due to significant missing values. While pre-trained foundation models show potential for addressing these challenges, they are typically designed for Regularly Sampled Time Series (RTS). Motivated by the visual Mask AutoEncoder's (MAE) powerful capability for modeling sparse multi-channel information and its success in RTS forecasting, we propose VIMTS, a framework adapting Visual MAE for IMTS forecasting. To mitigate the effect of missing values, VIMTS first processes IMTS along the timeline into feature patches at equal intervals. These patches are then complemented using learned cross-channel dependencies. Then it leverages visual MAE's capability in handling sparse multichannel data for patch reconstruction, followed by a coarse-to-fine technique to generate precise predictions from focused contexts. In addition, we integrate self-supervised learning for improved IMTS modeling by adapting the visual MAE to IMTS data. Extensive experiments demonstrate VIMTS's superior performance and few-shot capability, advancing the application of visual foundation models in more general time series tasks. Our code is available at https://github.com/WHU-HZY/VIMTS.

[13] Improving Contrastive Learning for Referring Expression Counting

Kostas Triaridis,Panagiotis Kaliosis,E-Ro Nguyen,Jingyi Xu,Hieu Le,Dimitris Samaras

Main category: cs.CV

TL;DR: 论文提出了一种名为C-REX的对比学习框架,用于解决Referring Expression Counting (REC)任务,通过增强判别性表示学习,显著提升了性能。

Details Motivation: 现有的方法在区分视觉相似但属于不同指代表达式的对象时表现不佳,因此需要一种更有效的解决方案。 Method: C-REX基于监督对比学习,完全在图像空间内操作,避免了图像-文本对比学习的对齐问题,并提供更大的负样本池。 Result: C-REX在REC任务中取得了最先进的结果,MAE和RMSE分别提升了22%和10%,同时在类无关计数任务中也表现优异。 Conclusion: C-REX不仅解决了REC任务中的关键问题,还展示了其通用性,适用于其他类似任务。 Abstract: Object counting has progressed from class-specific models, which count only known categories, to class-agnostic models that generalize to unseen categories. The next challenge is Referring Expression Counting (REC), where the goal is to count objects based on fine-grained attributes and contextual differences. Existing methods struggle with distinguishing visually similar objects that belong to the same category but correspond to different referring expressions. To address this, we propose C-REX, a novel contrastive learning framework, based on supervised contrastive learning, designed to enhance discriminative representation learning. Unlike prior works, C-REX operates entirely within the image space, avoiding the misalignment issues of image-text contrastive learning, thus providing a more stable contrastive signal. It also guarantees a significantly larger pool of negative samples, leading to improved robustness in the learned representations. Moreover, we showcase that our framework is versatile and generic enough to be applied to other similar tasks like class-agnostic counting. To support our approach, we analyze the key components of sota detection-based models and identify that detecting object centroids instead of bounding boxes is the key common factor behind their success in counting tasks. We use this insight to design a simple yet effective detection-based baseline to build upon. Our experiments show that C-REX achieves state-of-the-art results in REC, outperforming previous methods by more than 22\% in MAE and more than 10\% in RMSE, while also demonstrating strong performance in class-agnostic counting. Code is available at https://github.com/cvlab-stonybrook/c-rex.

[14] CLIPGaussian: Universal and Multimodal Style Transfer Based on Gaussian Splatting

Kornel Howil,Joanna Waczyńska,Piotr Borycki,Tadeusz Dziarmaga,Marcin Mazur,Przemysław Spurek

Main category: cs.CV

TL;DR: CLIPGaussians是一种支持多模态(2D图像、视频、3D对象和4D场景)风格迁移的统一框架,直接操作高斯基元,无需大型生成模型或从头训练。

Details Motivation: 高斯泼溅(GS)在3D场景渲染中表现高效,但风格迁移尤其是复杂风格迁移仍具挑战性。 Method: CLIPGaussians直接在高斯基元上操作,作为插件模块集成到现有GS流程中,联合优化颜色和几何。 Result: 在3D和4D场景中实现风格保真和一致性,视频中保持时间连贯性,模型尺寸不变。 Conclusion: CLIPGaussians是一种通用且高效的多模态风格迁移解决方案。 Abstract: Gaussian Splatting (GS) has recently emerged as an efficient representation for rendering 3D scenes from 2D images and has been extended to images, videos, and dynamic 4D content. However, applying style transfer to GS-based representations, especially beyond simple color changes, remains challenging. In this work, we introduce CLIPGaussians, the first unified style transfer framework that supports text- and image-guided stylization across multiple modalities: 2D images, videos, 3D objects, and 4D scenes. Our method operates directly on Gaussian primitives and integrates into existing GS pipelines as a plug-in module, without requiring large generative models or retraining from scratch. CLIPGaussians approach enables joint optimization of color and geometry in 3D and 4D settings, and achieves temporal coherence in videos, while preserving a model size. We demonstrate superior style fidelity and consistency across all tasks, validating CLIPGaussians as a universal and efficient solution for multimodal style transfer.

[15] A Probabilistic Jump-Diffusion Framework for Open-World Egocentric Activity Recognition

Sanjoy Kundu,Shanmukha Vellamcheti,Sathyanarayanan N. Aakur

Main category: cs.CV

TL;DR: ProbRes框架通过概率残差搜索和跳扩散方法,高效解决开放世界自我中心活动识别的挑战,结合常识先验和视觉语言模型,实现高性能。

Details Motivation: 开放世界自我中心活动识别的无约束性导致模型需从部分观察的广阔空间中推断未见活动,亟需高效方法。 Method: ProbRes框架整合常识先验构建语义一致的搜索空间,利用视觉语言模型自适应优化预测,并通过随机搜索机制高效定位高可能性活动标签。 Result: 在多个基准数据集(GTEA Gaze等)上达到最先进性能,并建立了开放世界识别的清晰分类法。 Conclusion: ProbRes为自我中心活动理解提供了方法论进步,明确了开放世界识别的挑战与解决方案。 Abstract: Open-world egocentric activity recognition poses a fundamental challenge due to its unconstrained nature, requiring models to infer unseen activities from an expansive, partially observed search space. We introduce ProbRes, a Probabilistic Residual search framework based on jump-diffusion that efficiently navigates this space by balancing prior-guided exploration with likelihood-driven exploitation. Our approach integrates structured commonsense priors to construct a semantically coherent search space, adaptively refines predictions using Vision-Language Models (VLMs) and employs a stochastic search mechanism to locate high-likelihood activity labels while minimizing exhaustive enumeration efficiently. We systematically evaluate ProbRes across multiple openness levels (L0--L3), demonstrating its adaptability to increasing search space complexity. In addition to achieving state-of-the-art performance on benchmark datasets (GTEA Gaze, GTEA Gaze+, EPIC-Kitchens, and Charades-Ego), we establish a clear taxonomy for open-world recognition, delineating the challenges and methodological advancements necessary for egocentric activity understanding.

[16] 4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians

Hidenobu Matsuki,Gwangbin Bae,Andrew J. Davison

Main category: cs.CV

TL;DR: 提出首个4D跟踪与建图方法,通过可微分渲染联合优化相机定位与非刚性表面重建,解决4D-SLAM的高维优化难题。

Details Motivation: 自然环境中复杂的非刚性运动使4D-SLAM研究不足,且缺乏可靠评估标准。 Method: 基于高斯表面基元的SLAM方法结合MLP变形场,优化相机位姿与表面正则化项。 Result: 实现精确表面重建,并发布首个开放合成数据集以支持评估。 Conclusion: 为现代4D-SLAM研究提供新方法与评估标准。 Abstract: We propose the first 4D tracking and mapping method that jointly performs camera localization and non-rigid surface reconstruction via differentiable rendering. Our approach captures 4D scenes from an online stream of color images with depth measurements or predictions by jointly optimizing scene geometry, appearance, dynamics, and camera ego-motion. Although natural environments exhibit complex non-rigid motions, 4D-SLAM remains relatively underexplored due to its inherent challenges; even with 2.5D signals, the problem is ill-posed because of the high dimensionality of the optimization space. To overcome these challenges, we first introduce a SLAM method based on Gaussian surface primitives that leverages depth signals more effectively than 3D Gaussians, thereby achieving accurate surface reconstruction. To further model non-rigid deformations, we employ a warp-field represented by a multi-layer perceptron (MLP) and introduce a novel camera pose estimation technique along with surface regularization terms that facilitate spatio-temporal reconstruction. In addition to these algorithmic challenges, a significant hurdle in 4D SLAM research is the lack of reliable ground truth and evaluation protocols, primarily due to the difficulty of 4D capture using commodity sensors. To address this, we present a novel open synthetic dataset of everyday objects with diverse motions, leveraging large-scale object models and animation modeling. In summary, we open up the modern 4D-SLAM research by introducing a novel method and evaluation protocols grounded in modern vision and rendering techniques.

[17] CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models

Junbo Yin,Chao Zha,Wenjia He,Chencheng Xu,Xin Gao

Main category: cs.CV

TL;DR: CFP-Gen是一种新型扩散语言模型,用于组合功能蛋白质生成,通过整合多模态条件(功能、序列和结构约束)实现蛋白质设计。

Details Motivation: 现有PLM仅基于单一模态条件生成蛋白质序列,难以同时满足多模态的多种约束。 Method: 引入AGFM模块动态调整蛋白质特征分布,RCFE模块捕获残基间相互作用,并整合3D结构编码器施加几何约束。 Result: CFP-Gen能够高通量生成功能与天然蛋白质相当的新蛋白质,并在设计多功能蛋白质时具有高成功率。 Conclusion: CFP-Gen为蛋白质设计提供了一种高效的多模态约束整合方法。 Abstract: Existing PLMs generate protein sequences based on a single-condition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP-Gen, a novel diffusion language model for Combinatorial Functional Protein GENeration. CFP-Gen facilitates the de novo protein design by integrating multimodal conditions with functional, sequence, and structural constraints. Specifically, an Annotation-Guided Feature Modulation (AGFM) module is introduced to dynamically adjust the protein feature distribution based on composable functional annotations, e.g., GO terms, IPR domains and EC numbers. Meanwhile, the Residue-Controlled Functional Encoding (RCFE) module captures residue-wise interaction to ensure more precise control. Additionally, off-the-shelf 3D structure encoders can be seamlessly integrated to impose geometric constraints. We demonstrate that CFP-Gen enables high-throughput generation of novel proteins with functionality comparable to natural proteins, while achieving a high success rate in designing multifunctional proteins. Code and data available at https://github.com/yinjunbo/cfpgen.

[18] 3DGS Compression with Sparsity-guided Hierarchical Transform Coding

Hao Xu,Xiaolin Wu,Xi Zhang

Main category: cs.CV

TL;DR: SHTC是一种端到端优化的变换编码框架,用于3DGS压缩,通过联合优化3DGS、变换和轻量级上下文模型,显著提升了率失真性能。

Details Motivation: 3DGS因其快速高质量渲染而流行,但内存占用大,传输和存储开销高。现有神经压缩方法未采用端到端优化的分析-合成变换,导致性能不足。 Method: SHTC框架包括使用KLT进行数据去相关的基础层和稀疏编码的增强层,增强层通过线性变换和ISTA重建残差。 Result: SHTC显著提升了率失真性能,同时参数和计算开销最小。 Conclusion: SHTC是首个端到端优化的3DGS压缩框架,通过可解释的设计和联合优化,实现了高性能和低开销。 Abstract: 3D Gaussian Splatting (3DGS) has gained popularity for its fast and high-quality rendering, but it has a very large memory footprint incurring high transmission and storage overhead. Recently, some neural compression methods, such as Scaffold-GS, were proposed for 3DGS but they did not adopt the approach of end-to-end optimized analysis-synthesis transforms which has been proven highly effective in neural signal compression. Without an appropriate analysis transform, signal correlations cannot be removed by sparse representation. Without such transforms the only way to remove signal redundancies is through entropy coding driven by a complex and expensive context modeling, which results in slower speed and suboptimal rate-distortion (R-D) performance. To overcome this weakness, we propose Sparsity-guided Hierarchical Transform Coding (SHTC), the first end-to-end optimized transform coding framework for 3DGS compression. SHTC jointly optimizes the 3DGS, transforms and a lightweight context model. This joint optimization enables the transform to produce representations that approach the best R-D performance possible. The SHTC framework consists of a base layer using KLT for data decorrelation, and a sparsity-coded enhancement layer that compresses the KLT residuals to refine the representation. The enhancement encoder learns a linear transform to project high-dimensional inputs into a low-dimensional space, while the decoder unfolds the Iterative Shrinkage-Thresholding Algorithm (ISTA) to reconstruct the residuals. All components are designed to be interpretable, allowing the incorporation of signal priors and fewer parameters than black-box transforms. This novel design significantly improves R-D performance with minimal additional parameters and computational overhead.

[19] Hierarchical Material Recognition from Local Appearance

Matthew Beveridge,Shree K. Nayar

Main category: cs.CV

TL;DR: 提出了一种基于物理特性的材料分类方法,并构建了一个多样化数据集,利用图注意力网络实现了层次化材料识别,性能优异且具有泛化能力。

Details Motivation: 为视觉应用提供一种基于物理特性的材料分类方法,并解决真实场景中材料识别的挑战。 Method: 使用图注意力网络,结合分类学关系和深度图数据,进行层次化材料识别。 Result: 模型在性能上达到最优,并能泛化到复杂成像条件,同时支持小样本学习。 Conclusion: 该方法在材料识别中表现出色,具有实际应用潜力。 Abstract: We introduce a taxonomy of materials for hierarchical recognition from local appearance. Our taxonomy is motivated by vision applications and is arranged according to the physical traits of materials. We contribute a diverse, in-the-wild dataset with images and depth maps of the taxonomy classes. Utilizing the taxonomy and dataset, we present a method for hierarchical material recognition based on graph attention networks. Our model leverages the taxonomic proximity between classes and achieves state-of-the-art performance. We demonstrate the model's potential to generalize to adverse, real-world imaging conditions, and that novel views rendered using the depth maps can enhance this capability. Finally, we show the model's capacity to rapidly learn new materials in a few-shot learning setting.

[20] cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning

Maksim Kolodiazhnyi,Denis Tarasov,Dmitrii Zhemchuzhnikov,Alexander Nikulin,Ilya Zisman,Anna Vorontsova,Anton Konushin,Vladislav Kurenkov,Danila Rukhovich

Main category: cs.CV

TL;DR: 提出了一种多模态CAD重建模型,结合点云、图像和文本输入,通过监督微调和强化学习优化,在DeepCAD基准测试中表现优异。

Details Motivation: 现有CAD重建方法通常仅支持单一输入模态,限制了通用性和鲁棒性,希望通过多模态输入和强化学习提升性能。 Method: 采用两阶段训练:先在大规模生成数据上进行监督微调(SFT),再通过在线反馈进行强化学习(RL)微调,使用GRPO算法。 Result: SFT模型在DeepCAD基准测试中优于单模态方法,RL微调后进一步在多个数据集上达到新SOTA。 Conclusion: 多模态输入结合强化学习显著提升了CAD重建的性能和通用性。 Abstract: Computer-Aided Design (CAD) plays a central role in engineering and manufacturing, making it possible to create precise and editable 3D models. Using a variety of sensor or user-provided data as inputs for CAD reconstruction can democratize access to design applications. However, existing methods typically focus on a single input modality, such as point clouds, images, or text, which limits their generalizability and robustness. Leveraging recent advances in vision-language models (VLM), we propose a multi-modal CAD reconstruction model that simultaneously processes all three input modalities. Inspired by large language model (LLM) training paradigms, we adopt a two-stage pipeline: supervised fine-tuning (SFT) on large-scale procedurally generated data, followed by reinforcement learning (RL) fine-tuning using online feedback, obtained programatically. Furthermore, we are the first to explore RL fine-tuning of LLMs for CAD tasks demonstrating that online RL algorithms such as Group Relative Preference Optimization (GRPO) outperform offline alternatives. In the DeepCAD benchmark, our SFT model outperforms existing single-modal approaches in all three input modalities simultaneously. More importantly, after RL fine-tuning, cadrille sets new state-of-the-art on three challenging datasets, including a real-world one.

[21] Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

Ruichen Chen,Keith G. Mills,Liyao Jiang,Chao Gao,Di Niu

Main category: cs.CV

TL;DR: 论文提出Re-ttention方法,通过利用扩散模型的时间冗余性,实现极高稀疏度的注意力机制,显著降低计算开销,同时保持视觉生成质量。

Details Motivation: 现有稀疏注意力技术在极高稀疏度下无法保持视觉质量,且可能引入额外计算开销,因此需要一种新方法来解决这一问题。 Method: Re-ttention通过基于历史softmax分布重塑注意力分数,克服注意力机制中的概率归一化偏移,实现高效稀疏注意力。 Result: 实验表明,Re-ttention仅需3.1%的token即可在推理中优于现有方法,并在H100 GPU上实现45%端到端延迟降低和92%自注意力延迟降低。 Conclusion: Re-ttention是一种高效且高质量的稀疏注意力方法,适用于视觉生成模型,显著提升了计算效率。 Abstract: Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. % To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. % Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1\% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference. Further, we measure latency to show that our method can attain over 45\% end-to-end % and over 92\% self-attention latency reduction on an H100 GPU at negligible overhead cost. Code available online here: \href{https://github.com/cccrrrccc/Re-ttention}{https://github.com/cccrrrccc/Re-ttention}

[22] Leveraging Diffusion Models for Synthetic Data Augmentation in Protein Subcellular Localization Classification

Sylvey Lin,Zhi-Yi Cao

Main category: cs.CV

TL;DR: 研究探讨扩散模型生成的合成图像是否能提升蛋白质亚细胞定位的多标签分类性能,但发现合成数据在测试集上泛化能力不足。

Details Motivation: 探索合成数据在生物医学图像分类中的潜在价值,尤其是扩散模型生成的数据是否能增强多标签分类任务。 Method: 采用简化的类条件去噪扩散概率模型(DDPM)生成标签一致的样本,并通过两种混合训练策略(Mix Loss和Mix Representation)整合真实数据。 Result: 混合方法在验证集上表现良好,但在测试集上泛化能力差;基于ResNet的基线分类器表现更稳定。 Conclusion: 合成数据在生物医学图像分类中的应用需更真实的数据生成和鲁棒的监督机制。 Abstract: We investigate whether synthetic images generated by diffusion models can enhance multi-label classification of protein subcellular localization. Specifically, we implement a simplified class-conditional denoising diffusion probabilistic model (DDPM) to produce label-consistent samples and explore their integration with real data via two hybrid training strategies: Mix Loss and Mix Representation. While these approaches yield promising validation performance, our proposed MixModel exhibits poor generalization to unseen test data, underscoring the challenges of leveraging synthetic data effectively. In contrast, baseline classifiers built on ResNet backbones with conventional loss functions demonstrate greater stability and test-time performance. Our findings highlight the importance of realistic data generation and robust supervision when incorporating generative augmentation into biomedical image classification.

[23] Fast Isotropic Median Filtering

Ben Weiss

Main category: cs.CV

TL;DR: 本文提出了一种新的中值滤波方法,克服了传统算法在图像位深、滤波器核大小和形状上的限制,支持任意凸核形状(如圆形)的高效处理。

Details Motivation: 传统中值滤波算法存在对图像位深、滤波器核大小和形状的限制,尤其是方形核易产生条纹状伪影,亟需一种更灵活高效的解决方案。 Method: 提出了一种新方法,能够高效处理任意位深数据、任意核大小和任意凸核形状(如圆形)。 Result: 该方法成功克服了传统算法的限制,实现了对任意凸核形状的高效处理。 Conclusion: 新方法为中值滤波提供了更灵活高效的解决方案,适用于更广泛的图像处理场景。 Abstract: Median filtering is a cornerstone of computational image processing. It provides an effective means of image smoothing, with minimal blurring or softening of edges, invariance to monotonic transformations such as gamma adjustment, and robustness to noise and outliers. However, known algorithms have all suffered from practical limitations: the bit depth of the image data, the size of the filter kernel, or the kernel shape itself. Square-kernel implementations tend to produce streaky cross-hatching artifacts, and nearly all known efficient algorithms are in practice limited to square kernels. We present for the first time a method that overcomes all of these limitations. Our method operates efficiently on arbitrary bit-depth data, arbitrary kernel sizes, and arbitrary convex kernel shapes, including circular shapes.

[24] ATI: Any Trajectory Instruction for Controllable Video Generation

Angtian Wang,Haibin Huang,Jacob Zhiyuan Fang,Yiding Yang,Chongyang Ma

Main category: cs.CV

TL;DR: 提出了一种统一的视频生成运动控制框架,通过轨迹输入整合相机运动、物体平移和局部精细运动。

Details Motivation: 解决现有方法需通过独立模块或任务特定设计处理不同运动类型的问题,提供一种更统一的解决方案。 Method: 通过轻量级运动注入器将用户定义的轨迹投影到预训练图像到视频生成模型的潜在空间中。 Result: 在多种视频运动控制任务中表现优异,包括风格化运动效果、动态视角变化和精确局部运动操控。 Conclusion: 该方法在可控性和视觉质量上显著优于现有方法,且兼容多种先进视频生成模型。 Abstract: We propose a unified framework for motion control in video generation that seamlessly integrates camera movement, object-level translation, and fine-grained local motion using trajectory-based inputs. In contrast to prior methods that address these motion types through separate modules or task-specific designs, our approach offers a cohesive solution by projecting user-defined trajectories into the latent space of pre-trained image-to-video generation models via a lightweight motion injector. Users can specify keypoints and their motion paths to control localized deformations, entire object motion, virtual camera dynamics, or combinations of these. The injected trajectory signals guide the generative process to produce temporally consistent and semantically aligned motion sequences. Our framework demonstrates superior performance across multiple video motion control tasks, including stylized motion effects (e.g., motion brushes), dynamic viewpoint changes, and precise local motion manipulation. Experiments show that our method provides significantly better controllability and visual quality compared to prior approaches and commercial solutions, while remaining broadly compatible with various state-of-the-art video generation backbones. Project page: https://anytraj.github.io/.

[25] Toward Memory-Aided World Models: Benchmarking via Spatial Consistency

Kewei Lian,Shaofei Cai,Yilun Du,Yitao Liang

Main category: cs.CV

TL;DR: 论文提出了一种用于增强世界模型中空间一致性的数据集和基准测试,基于Minecraft环境收集了20百万帧导航视频,并评估了四种世界模型基线。

Details Motivation: 现有数据集和基准测试主要关注视觉一致性或生成质量,忽视了长距离空间一致性的需求,因此需要一种新的数据集来促进记忆模块的发展。 Method: 通过在Minecraft环境中采样150个不同位置,收集250小时(20百万帧)的循环导航视频,并采用课程设计逐步增加序列长度。 Result: 构建了一个可扩展的数据集和基准测试,并评估了四种世界模型基线,支持未来研究。 Conclusion: 该数据集填补了现有研究的空白,为开发具有空间一致性的世界模型提供了重要资源。 Abstract: The ability to simulate the world in a spatially consistent manner is a crucial requirements for effective world models. Such a model enables high-quality visual generation, and also ensures the reliability of world models for downstream tasks such as simulation and planning. Designing a memory module is a crucial component for addressing spatial consistency: such a model must not only retain long-horizon observational information, but also enables the construction of explicit or implicit internal spatial representations. However, there are no dataset designed to promote the development of memory modules by explicitly enforcing spatial consistency constraints. Furthermore, most existing benchmarks primarily emphasize visual coherence or generation quality, neglecting the requirement of long-range spatial consistency. To bridge this gap, we construct a dataset and corresponding benchmark by sampling 150 distinct locations within the open-world environment of Minecraft, collecting about 250 hours (20 million frames) of loop-based navigation videos with actions. Our dataset follows a curriculum design of sequence lengths, allowing models to learn spatial consistency on increasingly complex navigation trajectories. Furthermore, our data collection pipeline is easily extensible to new Minecraft environments and modules. Four representative world model baselines are evaluated on our benchmark. Dataset, benchmark, and code are open-sourced to support future research.

[26] HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions

Shuolin Xu,Siming Zheng,Ziyi Wang,HC Yu,Jinwei Chen,Huaqi Zhang,Bo Li,Peng-Tao Jiang

Main category: cs.CV

TL;DR: 论文提出了Open-HyperMotionX数据集和HyperMotionX基准,用于评估和改进复杂人体运动下的姿态引导动画生成,并提出了一种基于DiT的视频生成基线方法,通过空间低频增强RoPE模块提升动态序列的生成质量。

Details Motivation: 现有方法在复杂人体运动(如Hypermotion)下表现不佳,且缺乏高质量评估基准。 Method: 提出Open-HyperMotionX数据集和HyperMotionX基准,设计基于DiT的视频生成基线方法,并引入空间低频增强RoPE模块。 Result: 方法显著提升了高动态人体运动序列的结构稳定性和外观一致性。 Conclusion: 提出的数据集和方法有效提升了复杂人体运动动画的生成质量,代码和数据集将公开。 Abstract: Recent advances in diffusion models have significantly improved conditional video generation, particularly in the pose-guided human image animation task. Although existing methods are capable of generating high-fidelity and time-consistent animation sequences in regular motions and static scenes, there are still obvious limitations when facing complex human body motions (Hypermotion) that contain highly dynamic, non-standard motions, and the lack of a high-quality benchmark for evaluation of complex human motion animations. To address this challenge, we introduce the \textbf{Open-HyperMotionX Dataset} and \textbf{HyperMotionX Bench}, which provide high-quality human pose annotations and curated video clips for evaluating and improving pose-guided human image animation models under complex human motion conditions. Furthermore, we propose a simple yet powerful DiT-based video generation baseline and design spatial low-frequency enhanced RoPE, a novel module that selectively enhances low-frequency spatial feature modeling by introducing learnable frequency scaling. Our method significantly improves structural stability and appearance consistency in highly dynamic human motion sequences. Extensive experiments demonstrate the effectiveness of our dataset and proposed approach in advancing the generation quality of complex human motion image animations. Code and dataset will be made publicly available.

[27] Pose-free 3D Gaussian splatting via shape-ray estimation

Youngju Na,Taeyeon Kim,Jumin Lee,Kyu Beom Han,Woo Jae Kim,Sung-eui Yoon

Main category: cs.CV

TL;DR: SHARE是一种无需相机姿态的3D高斯溅射框架,通过联合形状和相机光线估计解决姿态不准确问题。

Details Motivation: 在真实场景中,精确的相机姿态难以获取,导致几何对齐问题。SHARE旨在解决这一问题。 Method: SHARE构建姿态感知的规范体积表示,并通过锚对齐高斯预测优化局部几何。 Result: 实验表明,SHARE在姿态无关的通用高斯溅射中表现稳健。 Conclusion: SHARE通过联合估计和规范表示,有效解决了姿态不准确带来的几何对齐问题。 Abstract: While generalizable 3D Gaussian splatting enables efficient, high-quality rendering of unseen scenes, it heavily depends on precise camera poses for accurate geometry. In real-world scenarios, obtaining accurate poses is challenging, leading to noisy pose estimates and geometric misalignments. To address this, we introduce SHARE, a pose-free, feed-forward Gaussian splatting framework that overcomes these ambiguities by joint shape and camera rays estimation. Instead of relying on explicit 3D transformations, SHARE builds a pose-aware canonical volume representation that seamlessly integrates multi-view information, reducing misalignment caused by inaccurate pose estimates. Additionally, anchor-aligned Gaussian prediction enhances scene reconstruction by refining local geometry around coarse anchors, allowing for more precise Gaussian placement. Extensive experiments on diverse real-world datasets show that our method achieves robust performance in pose-free generalizable Gaussian splatting.

[28] MOVi: Training-free Text-conditioned Multi-Object Video Generation

Aimon Rahman,Jiang Liu,Ze Wang,Ximeng Sun,Jialian Wu,Xiaodong Yu,Yusheng Su,Vishal M. Patel,Zicheng Liu,Emad Barsoum

Main category: cs.CV

TL;DR: 提出了一种无需训练的多对象视频生成方法,通过结合扩散模型和大型语言模型(LLM)的开放世界知识,显著提升了多对象生成能力。

Details Motivation: 现有扩散模型在多对象视频生成中难以准确捕捉复杂对象交互,常将对象视为静态背景或混合其特征。 Method: 利用LLM作为对象轨迹的“导演”,通过噪声重新初始化和注意力机制优化,实现精确控制和特征分离。 Result: 实验表明,该方法在多对象生成能力上提升了42%,同时保持了高保真度和运动平滑性。 Conclusion: 该方法为多对象视频生成提供了一种高效且无需训练的解决方案。 Abstract: Recent advances in diffusion-based text-to-video (T2V) models have demonstrated remarkable progress, but these models still face challenges in generating videos with multiple objects. Most models struggle with accurately capturing complex object interactions, often treating some objects as static background elements and limiting their movement. In addition, they often fail to generate multiple distinct objects as specified in the prompt, resulting in incorrect generations or mixed features across objects. In this paper, we present a novel training-free approach for multi-object video generation that leverages the open world knowledge of diffusion models and large language models (LLMs). We use an LLM as the ``director'' of object trajectories, and apply the trajectories through noise re-initialization to achieve precise control of realistic movements. We further refine the generation process by manipulating the attention mechanism to better capture object-specific features and motion patterns, and prevent cross-object feature interference. Extensive experiments validate the effectiveness of our training free approach in significantly enhancing the multi-object generation capabilities of existing video diffusion models, resulting in 42% absolute improvement in motion dynamics and object generation accuracy, while also maintaining high fidelity and motion smoothness.

[29] Synthetic Document Question Answering in Hungarian

Jonathan Li,Zoltan Csaki,Nidhi Hiremath,Etash Guha,Fenglu Hong,Edward Ma,Urmish Thakker

Main category: cs.CV

TL;DR: 论文提出了一种为低资源语言(匈牙利语)构建文档视觉问答(VQA)数据集的方法,包括手动和自动生成的数据集,并通过质量过滤和去重提升数据质量。

Details Motivation: 解决低资源语言(如匈牙利语)在文档VQA任务中缺乏训练和评估数据的问题。 Method: 构建了HuDocVQA(自动生成)和HuDocVQA-manual(手动标注)两个数据集,并应用质量过滤和去重;同时发布了HuCCPDF数据集用于OCR训练。 Result: 微调模型在HuDocVQA上的准确率提升了7.2%。 Conclusion: 提出的数据集和方法为多语言文档VQA研究提供了支持,未来将公开数据集和代码。 Abstract: Modern VLMs have achieved near-saturation accuracy in English document visual question-answering (VQA). However, this task remains challenging in lower resource languages due to a dearth of suitable training and evaluation data. In this paper we present scalable methods for curating such datasets by focusing on Hungarian, approximately the 17th highest resource language on the internet. Specifically, we present HuDocVQA and HuDocVQA-manual, document VQA datasets that modern VLMs significantly underperform on compared to English DocVQA. HuDocVQA-manual is a small manually curated dataset based on Hungarian documents from Common Crawl, while HuDocVQA is a larger synthetically generated VQA data set from the same source. We apply multiple rounds of quality filtering and deduplication to HuDocVQA in order to match human-level quality in this dataset. We also present HuCCPDF, a dataset of 117k pages from Hungarian Common Crawl PDFs along with their transcriptions, which can be used for training a model for Hungarian OCR. To validate the quality of our datasets, we show how finetuning on a mixture of these datasets can improve accuracy on HuDocVQA for Llama 3.2 11B Instruct by +7.2%. Our datasets and code will be released to the public to foster further research in multilingual DocVQA.

[30] SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model

Bowen Chen,Keyan Chen,Mohan Yang,Zhengxia Zou,Zhenwei Shi

Main category: cs.CV

TL;DR: 论文提出了一种语义引导的超分辨率框架(SeG-SR),利用视觉语言模型提取语义知识,指导遥感图像超分辨率重建,显著提升性能。

Details Motivation: 现有遥感图像超分辨率方法忽视高层语义理解,导致重建结果语义不一致。本文旨在探索高层语义知识对提升超分辨率性能的作用。 Method: 提出SeG-SR框架,包括语义特征提取模块(SFEM)、语义定位模块(SLM)和可学习调制模块(LMM),利用视觉语言模型提取语义知识并指导超分辨率过程。 Result: SeG-SR在两个数据集上达到最优性能,并在多种超分辨率架构中表现一致提升。 Conclusion: SeG-SR通过引入高层语义知识,显著提升了遥感图像超分辨率的性能,为相关应用提供了高效解决方案。 Abstract: High-resolution (HR) remote sensing imagery plays a vital role in a wide range of applications, including urban planning and environmental monitoring. However, due to limitations in sensors and data transmission links, the images acquired in practice often suffer from resolution degradation. Remote Sensing Image Super-Resolution (RSISR) aims to reconstruct HR images from low-resolution (LR) inputs, providing a cost-effective and efficient alternative to direct HR image acquisition. Existing RSISR methods primarily focus on low-level characteristics in pixel space, while neglecting the high-level understanding of remote sensing scenes. This may lead to semantically inconsistent artifacts in the reconstructed results. Motivated by this observation, our work aims to explore the role of high-level semantic knowledge in improving RSISR performance. We propose a Semantic-Guided Super-Resolution framework, SeG-SR, which leverages Vision-Language Models (VLMs) to extract semantic knowledge from input images and uses it to guide the super resolution (SR) process. Specifically, we first design a Semantic Feature Extraction Module (SFEM) that utilizes a pretrained VLM to extract semantic knowledge from remote sensing images. Next, we propose a Semantic Localization Module (SLM), which derives a series of semantic guidance from the extracted semantic knowledge. Finally, we develop a Learnable Modulation Module (LMM) that uses semantic guidance to modulate the features extracted by the SR network, effectively incorporating high-level scene understanding into the SR pipeline. We validate the effectiveness and generalizability of SeG-SR through extensive experiments: SeG-SR achieves state-of-the-art performance on two datasets and consistently delivers performance improvements across various SR architectures. Codes can be found at https://github.com/Mr-Bamboo/SeG-SR.

[31] Spatio-Temporal Joint Density Driven Learning for Skeleton-Based Action Recognition

Shanaka Ramesh Gunasekara,Wanqing Li,Philip Ogunbona,Jack Yang

Main category: cs.CV

TL;DR: 论文提出了一种新的空间-时间关节密度(STJD)测量方法,用于量化骨骼序列中动态与静态关节的交互作用,并提出了STJD-CL对比学习策略和STJD-MP方法,显著提升了动作分类性能。

Details Motivation: 传统方法主要关注骨骼序列的动态特征,而忽略了动态与静态关节的交互作用,论文旨在挖掘这种交互作用的判别潜力。 Method: 提出了STJD测量方法,识别关键关节(prime joints),并设计了STJD-CL对比学习策略和STJD-MP方法。 Result: 在NTU RGB+D 60、NTU RGB+D 120和PKUMMD数据集上,STJD-CL和STJD-MP性能显著提升,尤其在NTU RGB+D 120数据集上分别比现有对比方法提高了3.5和3.6个百分点。 Conclusion: STJD方法有效利用了动态与静态关节的交互作用,显著提升了动作分类性能,为自监督学习提供了新思路。 Abstract: Traditional approaches in unsupervised or self supervised learning for skeleton-based action classification have concentrated predominantly on the dynamic aspects of skeletal sequences. Yet, the intricate interaction between the moving and static elements of the skeleton presents a rarely tapped discriminative potential for action classification. This paper introduces a novel measurement, referred to as spatial-temporal joint density (STJD), to quantify such interaction. Tracking the evolution of this density throughout an action can effectively identify a subset of discriminative moving and/or static joints termed "prime joints" to steer self-supervised learning. A new contrastive learning strategy named STJD-CL is proposed to align the representation of a skeleton sequence with that of its prime joints while simultaneously contrasting the representations of prime and nonprime joints. In addition, a method called STJD-MP is developed by integrating it with a reconstruction-based framework for more effective learning. Experimental evaluations on the NTU RGB+D 60, NTU RGB+D 120, and PKUMMD datasets in various downstream tasks demonstrate that the proposed STJD-CL and STJD-MP improved performance, particularly by 3.5 and 3.6 percentage points over the state-of-the-art contrastive methods on the NTU RGB+D 120 dataset using X-sub and X-set evaluations, respectively.

[32] Towards Privacy-Preserving Fine-Grained Visual Classification via Hierarchical Learning from Label Proportions

Jinyi Chang,Dongliang Chang,Lei Chen,Bingyao Yu,Zhanyu Ma

Main category: cs.CV

TL;DR: 提出了一种无需实例级标签的细粒度视觉分类方法LHFGLP,利用分层标签比例学习(LLP)和分层监督,显著提升了分类精度。

Details Motivation: 现有细粒度分类方法依赖实例级标签,不适用于隐私敏感场景(如医学图像分析),需开发无需直接标签的方法。 Method: 提出LHFGLP框架,结合分层稀疏字典学习和分层比例损失,通过袋级标签实现高效训练。 Result: 在三个细粒度数据集上实验表明,LHFGLP优于现有LLP方法。 Conclusion: LHFGLP为隐私保护下的细粒度分类提供了有效解决方案,代码和数据集将开源。 Abstract: In recent years, Fine-Grained Visual Classification (FGVC) has achieved impressive recognition accuracy, despite minimal inter-class variations. However, existing methods heavily rely on instance-level labels, making them impractical in privacy-sensitive scenarios such as medical image analysis. This paper aims to enable accurate fine-grained recognition without direct access to instance labels. To achieve this, we leverage the Learning from Label Proportions (LLP) paradigm, which requires only bag-level labels for efficient training. Unlike existing LLP-based methods, our framework explicitly exploits the hierarchical nature of fine-grained datasets, enabling progressive feature granularity refinement and improving classification accuracy. We propose Learning from Hierarchical Fine-Grained Label Proportions (LHFGLP), a framework that incorporates Unrolled Hierarchical Fine-Grained Sparse Dictionary Learning, transforming handcrafted iterative approximation into learnable network optimization. Additionally, our proposed Hierarchical Proportion Loss provides hierarchical supervision, further enhancing classification performance. Experiments on three widely-used fine-grained datasets, structured in a bag-based manner, demonstrate that our framework consistently outperforms existing LLP-based methods. We will release our code and datasets to foster further research in privacy-preserving fine-grained classification.

[33] Deep Modeling and Optimization of Medical Image Classification

Yihang Wu,Muhammad Owais,Reem Kateb,Ahmad Chaddad

Main category: cs.CV

TL;DR: 论文提出了一种结合CLIP变体、联邦学习和传统机器学习的方法,用于解决医学图像分类中的数据隐私和泛化能力问题。

Details Motivation: 解决医学领域因数据隐私问题导致的大数据需求不足,以及CLIP在医学领域潜力未充分挖掘的问题。 Method: 1) 提出CLIP变体,使用CNN和ViT作为图像编码器;2) 结合联邦学习保护数据隐私;3) 引入传统ML方法提升泛化能力。 Result: MaxViT在HAM10000数据集中表现最佳(AVG=87.03%),ConvNeXt_L在FL模型中F1-score达83.98%,SVM提升Swin Transformer系列性能约2%。 Conclusion: 该方法在医学图像分类中有效平衡了性能与隐私保护,同时提升了模型在未见数据上的泛化能力。 Abstract: Deep models, such as convolutional neural networks (CNNs) and vision transformer (ViT), demonstrate remarkable performance in image classification. However, those deep models require large data to fine-tune, which is impractical in the medical domain due to the data privacy issue. Furthermore, despite the feasible performance of contrastive language image pre-training (CLIP) in the natural domain, the potential of CLIP has not been fully investigated in the medical field. To face these challenges, we considered three scenarios: 1) we introduce a novel CLIP variant using four CNNs and eight ViTs as image encoders for the classification of brain cancer and skin cancer, 2) we combine 12 deep models with two federated learning techniques to protect data privacy, and 3) we involve traditional machine learning (ML) methods to improve the generalization ability of those deep models in unseen domain data. The experimental results indicate that maxvit shows the highest averaged (AVG) test metrics (AVG = 87.03\%) in HAM10000 dataset with multimodal learning, while convnext\_l demonstrates remarkable test with an F1-score of 83.98\% compared to swin\_b with 81.33\% in FL model. Furthermore, the use of support vector machine (SVM) can improve the overall test metrics with AVG of $\sim 2\%$ for swin transformer series in ISIC2018. Our codes are available at https://github.com/AIPMLab/SkinCancerSimulation.

[34] Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation

Jihai Zhang,Tianle Li,Linjie Li,Zhengyuan Yang,Yu Cheng

Main category: cs.CV

TL;DR: 本文探讨了统一视觉语言模型(VLMs)中理解与生成任务的相互增强效果,发现混合训练能带来显著优势,并揭示了跨任务泛化的机制。

Details Motivation: 研究统一VLMs中理解与生成任务是否能够相互增强,填补现有研究的空白。 Method: 设计了贴近现实场景的数据集,评估多种统一VLM架构,进行定量分析。 Result: 混合训练能提升任务表现,跨模态对齐和知识迁移是关键因素。 Conclusion: 统一理解与生成对VLMs至关重要,为模型设计与优化提供了新见解。 Abstract: Recent advancements in unified vision-language models (VLMs), which integrate both visual understanding and generation capabilities, have attracted significant attention. The underlying hypothesis is that a unified architecture with mixed training on both understanding and generation tasks can enable mutual enhancement between understanding and generation. However, this hypothesis remains underexplored in prior works on unified VLMs. To address this gap, this paper systematically investigates the generalization across understanding and generation tasks in unified VLMs. Specifically, we design a dataset closely aligned with real-world scenarios to facilitate extensive experiments and quantitative evaluations. We evaluate multiple unified VLM architectures to validate our findings. Our key findings are as follows. First, unified VLMs trained with mixed data exhibit mutual benefits in understanding and generation tasks across various architectures, and this mutual benefits can scale up with increased data. Second, better alignment between multimodal input and output spaces will lead to better generalization. Third, the knowledge acquired during generation tasks can transfer to understanding tasks, and this cross-task generalization occurs within the base language model, beyond modality adapters. Our findings underscore the critical necessity of unifying understanding and generation in VLMs, offering valuable insights for the design and optimization of unified VLMs.

[35] SpatialSplat: Efficient Semantic 3D from Sparse Unposed Images

Yu Sheng,Jiajun Deng,Xinran Zhang,Yu Zhang,Bei Hua,Yanyong Zhang,Jianmin Ji

Main category: cs.CV

TL;DR: SpatialSplat提出了一种新的前馈框架,通过双场语义表示和选择性高斯机制,减少了冗余并提升了语义3D重建的性能。

Details Motivation: 现有方法在压缩语义特征时牺牲了表达能力,且像素级预测引入了冗余,导致内存开销大。 Method: 采用双场语义表示(粗粒度特征场和细粒度特征场)和选择性高斯机制,消除冗余高斯。 Result: 实验表明,参数减少了60%,性能优于现有方法。 Conclusion: SpatialSplat在语义3D重建中实现了更紧凑和高效的表示。 Abstract: A major breakthrough in 3D reconstruction is the feedforward paradigm to generate pixel-wise 3D points or Gaussian primitives from sparse, unposed images. To further incorporate semantics while avoiding the significant memory and storage costs of high-dimensional semantic features, existing methods extend this paradigm by associating each primitive with a compressed semantic feature vector. However, these methods have two major limitations: (a) the naively compressed feature compromises expressiveness, affecting the model's ability to capture fine-grained semantics, and (b) the pixel-wise primitive prediction introduces redundancy in overlapping areas, causing unnecessary memory overhead. To this end, we introduce \textbf{SpatialSplat}, a feedforward framework that produces redundancy-aware Gaussians and capitalizes on a dual-field semantic representation. Particularly, with the insight that primitives within the same instance exhibit high semantic consistency, we decompose the semantic representation into a coarse feature field that encodes uncompressed semantics with minimal primitives, and a fine-grained yet low-dimensional feature field that captures detailed inter-instance relationships. Moreover, we propose a selective Gaussian mechanism, which retains only essential Gaussians in the scene, effectively eliminating redundant primitives. Our proposed Spatialsplat learns accurate semantic information and detailed instances prior with more compact 3D Gaussians, making semantic 3D reconstruction more applicable. We conduct extensive experiments to evaluate our method, demonstrating a remarkable 60\% reduction in scene representation parameters while achieving superior performance over state-of-the-art methods. The code will be made available for future investigation.

[36] Multi-Sourced Compositional Generalization in Visual Question Answering

Chuanhao Li,Wenbo Ye,Zhen Li,Yuwei Wu,Yunde Jia

Main category: cs.CV

TL;DR: 该论文研究了多源组合泛化(MSCG)在视觉问答(VQA)中的表现,提出了一种检索增强的训练框架,通过统一不同模态的基元表示来提升模型的MSCG能力。

Details Motivation: 由于视觉与语言(V&L)任务的多模态特性,组合的基元来自不同模态,导致多源新组合的泛化能力(MSCG)尚未被探索。 Method: 提出了一种检索增强的训练框架,通过检索语义等效的基元并聚合其特征,学习跨模态的统一基元表示。 Result: 实验结果表明该框架有效提升了VQA模型的MSCG能力,并基于GQA数据集构建了新的GQA-MSCG数据集进行评估。 Conclusion: 该研究填补了MSCG领域的空白,提出的框架和数据集为未来研究提供了基础。 Abstract: Compositional generalization is the ability of generalizing novel compositions from seen primitives, and has received much attention in vision-and-language (V\&L) recently. Due to the multi-modal nature of V\&L tasks, the primitives composing compositions source from different modalities, resulting in multi-sourced novel compositions. However, the generalization ability over multi-sourced novel compositions, \textit{i.e.}, multi-sourced compositional generalization (MSCG) remains unexplored. In this paper, we explore MSCG in the context of visual question answering (VQA), and propose a retrieval-augmented training framework to enhance the MSCG ability of VQA models by learning unified representations for primitives from different modalities. Specifically, semantically equivalent primitives are retrieved for each primitive in the training samples, and the retrieved features are aggregated with the original primitive to refine the model. This process helps the model learn consistent representations for the same semantic primitives across different modalities. To evaluate the MSCG ability of VQA models, we construct a new GQA-MSCG dataset based on the GQA dataset, in which samples include three types of novel compositions composed of primitives from different modalities. Experimental results demonstrate the effectiveness of the proposed framework. We release GQA-MSCG at https://github.com/NeverMoreLCH/MSCG.

[37] Zero-P-to-3: Zero-Shot Partial-View Images to 3D Object

Yuxuan Lin,Ruihang Chu,Zhenyu Chen,Xiao Tang,Lei Ke,Haoling Li,Yingji Zhong,Zhihao Li,Shiyong Liu,Xiaofei Wu,Jianzhuang Liu,Yujiu Yang

Main category: cs.CV

TL;DR: 论文提出了一种无需训练的方法\method,通过融合局部密集观测和多源先验来解决部分视角下的3D重建问题,生成多视角一致的图像以提高重建质量。

Details Motivation: 部分视角下的3D重建任务因视角范围有限和生成内容不一致而面临挑战,传统插值方法效果不佳。 Method: 提出了一种融合局部密集观测和多源先验的方法,通过DDIM采样对齐先验,并设计迭代细化策略利用几何结构提升重建质量。 Result: 在多个数据集上的实验表明,该方法在不可见区域的性能优于现有技术。 Conclusion: \method通过融合多源先验和迭代细化,显著提升了部分视角下的3D重建效果。 Abstract: Generative 3D reconstruction shows strong potential in incomplete observations. While sparse-view and single-image reconstruction are well-researched, partial observation remains underexplored. In this context, dense views are accessible only from a specific angular range, with other perspectives remaining inaccessible. This task presents two main challenges: (i) limited View Range: observations confined to a narrow angular scope prevent effective traditional interpolation techniques that require evenly distributed perspectives. (ii) inconsistent Generation: views created for invisible regions often lack coherence with both visible regions and each other, compromising reconstruction consistency. To address these challenges, we propose \method, a novel training-free approach that integrates the local dense observations and multi-source priors for reconstruction. Our method introduces a fusion-based strategy to effectively align these priors in DDIM sampling, thereby generating multi-view consistent images to supervise invisible views. We further design an iterative refinement strategy, which uses the geometric structures of the object to enhance reconstruction quality. Extensive experiments on multiple datasets show the superiority of our method over SOTAs, especially in invisible regions.

[38] URWKV: Unified RWKV Model with Multi-state Perspective for Low-light Image Restoration

Rui Xu,Yuzhen Niu,Yuezhou Li,Huangbiao Xu,Wenxi Liu,Yuzhong Chen

Main category: cs.CV

TL;DR: 提出了一种统一的多状态视角模型URWKV,用于灵活有效地恢复低光图像的动态耦合退化问题。

Details Motivation: 现有低光图像增强和去模糊模型在处理动态耦合退化时受限,需要更灵活的解决方案。 Method: 设计了URWKV块,结合多阶段状态感知,提出亮度自适应归一化(LAN)和状态感知选择性融合(SSF)模块。 Result: URWKV模型在多个基准测试中表现优异,且参数和计算资源需求更低。 Conclusion: URWKV模型通过多状态视角和动态融合机制,显著提升了低光图像增强和去模糊的性能。 Abstract: Existing low-light image enhancement (LLIE) and joint LLIE and deblurring (LLIE-deblur) models have made strides in addressing predefined degradations, yet they are often constrained by dynamically coupled degradations. To address these challenges, we introduce a Unified Receptance Weighted Key Value (URWKV) model with multi-state perspective, enabling flexible and effective degradation restoration for low-light images. Specifically, we customize the core URWKV block to perceive and analyze complex degradations by leveraging multiple intra- and inter-stage states. First, inspired by the pupil mechanism in the human visual system, we propose Luminance-adaptive Normalization (LAN) that adjusts normalization parameters based on rich inter-stage states, allowing for adaptive, scene-aware luminance modulation. Second, we aggregate multiple intra-stage states through exponential moving average approach, effectively capturing subtle variations while mitigating information loss inherent in the single-state mechanism. To reduce the degradation effects commonly associated with conventional skip connections, we propose the State-aware Selective Fusion (SSF) module, which dynamically aligns and integrates multi-state features across encoder stages, selectively fusing contextual information. In comparison to state-of-the-art models, our URWKV model achieves superior performance on various benchmarks, while requiring significantly fewer parameters and computational resources.

[39] GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion

Gwanghyun Kim,Xueting Li,Ye Yuan,Koki Nagano,Tianye Li,Jan Kautz,Se Young Chun,Umar Iqbal

Main category: cs.CV

TL;DR: GeoMan是一种新架构,用于从单目视频中生成准确且时间一致的3D人体几何估计,解决了现有方法的时间不一致性和细节捕捉不足问题。

Details Motivation: 现有方法主要针对单张图像优化,导致视频中时间不一致和动态细节捕捉不足。GeoMan旨在解决这些问题。 Method: GeoMan结合图像模型和视频扩散模型,首帧由图像模型估计深度和法线,后续帧由视频模型生成,并引入根相对深度表示以保留人体尺度细节。 Result: GeoMan在定性和定量评估中均达到最先进性能,显著提升了时间一致性和泛化能力。 Conclusion: GeoMan通过创新设计有效解决了3D人体几何估计中的关键挑战,展示了其在视频中的优越性能。 Abstract: Estimating accurate and temporally consistent 3D human geometry from videos is a challenging problem in computer vision. Existing methods, primarily optimized for single images, often suffer from temporal inconsistencies and fail to capture fine-grained dynamic details. To address these limitations, we present GeoMan, a novel architecture designed to produce accurate and temporally consistent depth and normal estimations from monocular human videos. GeoMan addresses two key challenges: the scarcity of high-quality 4D training data and the need for metric depth estimation to accurately model human size. To overcome the first challenge, GeoMan employs an image-based model to estimate depth and normals for the first frame of a video, which then conditions a video diffusion model, reframing video geometry estimation task as an image-to-video generation problem. This design offloads the heavy lifting of geometric estimation to the image model and simplifies the video model's role to focus on intricate details while using priors learned from large-scale video datasets. Consequently, GeoMan improves temporal consistency and generalizability while requiring minimal 4D training data. To address the challenge of accurate human size estimation, we introduce a root-relative depth representation that retains critical human-scale details and is easier to be estimated from monocular inputs, overcoming the limitations of traditional affine-invariant and metric depth representations. GeoMan achieves state-of-the-art performance in both qualitative and quantitative evaluations, demonstrating its effectiveness in overcoming longstanding challenges in 3D human geometry estimation from videos.

[40] LeMoRe: Learn More Details for Lightweight Semantic Segmentation

Mian Muhammad Naeem Abid,Nancy Mehta,Zongwei Wu,Radu Timofte

Main category: cs.CV

TL;DR: 论文提出了一种轻量级语义分割方法LeMoRe,通过结合显式和隐式建模,平衡计算效率与表征能力。

Details Motivation: 现有方法在特征建模复杂度上难以平衡效率与性能,且依赖参数密集的设计和计算密集型框架。 Method: 结合明确的笛卡尔方向与显式建模视图及隐式推断中间表示,通过嵌套注意力机制高效捕捉全局依赖。 Result: 在ADE20K、CityScapes等数据集上验证了LeMoRe在性能与效率间的有效平衡。 Conclusion: LeMoRe为轻量级语义分割提供了一种高效且性能优越的解决方案。 Abstract: Lightweight semantic segmentation is essential for many downstream vision tasks. Unfortunately, existing methods often struggle to balance efficiency and performance due to the complexity of feature modeling. Many of these existing approaches are constrained by rigid architectures and implicit representation learning, often characterized by parameter-heavy designs and a reliance on computationally intensive Vision Transformer-based frameworks. In this work, we introduce an efficient paradigm by synergizing explicit and implicit modeling to balance computational efficiency with representational fidelity. Our method combines well-defined Cartesian directions with explicitly modeled views and implicitly inferred intermediate representations, efficiently capturing global dependencies through a nested attention mechanism. Extensive experiments on challenging datasets, including ADE20K, CityScapes, Pascal Context, and COCO-Stuff, demonstrate that LeMoRe strikes an effective balance between performance and efficiency.

[41] CURVE: CLIP-Utilized Reinforcement Learning for Visual Image Enhancement via Simple Image Processing

Yuka Ogino,Takahiro Toizumi,Atsushi Ito

Main category: cs.CV

TL;DR: 本文提出了一种基于CLIP和强化学习的低光图像增强方法CURVE,通过Bézier曲线调整全局色调,并在实验中验证了其性能优于传统方法。

Details Motivation: 解决零参考低光图像增强中如何利用CLIP模型生成感知良好的图像,同时保持高分辨率图像的计算效率。 Method: 使用Bézier曲线调整全局图像色调,并通过强化学习迭代估计处理参数,奖励设计基于CLIP文本嵌入。 Result: 在低光和多曝光数据集上,CURVE在增强质量和处理速度上优于传统方法。 Conclusion: CURVE是一种高效且性能优越的低光图像增强方法。 Abstract: Low-Light Image Enhancement (LLIE) is crucial for improving both human perception and computer vision tasks. This paper addresses two challenges in zero-reference LLIE: obtaining perceptually 'good' images using the Contrastive Language-Image Pre-Training (CLIP) model and maintaining computational efficiency for high-resolution images. We propose CLIP-Utilized Reinforcement learning-based Visual image Enhancement (CURVE). CURVE employs a simple image processing module which adjusts global image tone based on B\'ezier curve and estimates its processing parameters iteratively. The estimator is trained by reinforcement learning with rewards designed using CLIP text embeddings. Experiments on low-light and multi-exposure datasets demonstrate the performance of CURVE in terms of enhancement quality and processing speed compared to conventional methods.

[42] EAD: An EEG Adapter for Automated Classification

Pushapdeep Singh,Jyoti Nigam,Medicherla Vamsi Krishna,Arnav Bhavsar,Aditya Nigam

Main category: cs.CV

TL;DR: 论文提出EEG Adapter(EAD),一种灵活的框架,适用于任何EEG信号采集设备,通过学习鲁棒的表示实现分类任务,并在两个公开数据集上达到SOTA准确率。

Details Motivation: 传统EEG分类方法依赖于特定任务的数据采集和通道数量,难以统一处理不同设备采集的数据,因此需要开发一种通用框架。 Method: 提出EAD框架,基于EEG基础模型进行适应性调整,学习鲁棒表示,支持不同通道数量的EEG数据分类。 Result: 在EEG-ImageNet和BrainLat数据集上分别达到99.33%和92.31%的准确率,并在零样本分类中展示泛化能力。 Conclusion: EAD框架有效解决了EEG数据设备依赖性问题和分类任务多样性,具有广泛适用性和泛化能力。 Abstract: While electroencephalography (EEG) has been a popular modality for neural decoding, it often involves task specific acquisition of the EEG data. This poses challenges for the development of a unified pipeline to learn embeddings for various EEG signal classification, which is often involved in various decoding tasks. Traditionally, EEG classification involves the step of signal preprocessing and the use of deep learning techniques, which are highly dependent on the number of EEG channels in each sample. However, the same pipeline cannot be applied even if the EEG data is collected for the same experiment but with different acquisition devices. This necessitates the development of a framework for learning EEG embeddings, which could be highly beneficial for tasks involving multiple EEG samples for the same task but with varying numbers of EEG channels. In this work, we propose EEG Adapter (EAD), a flexible framework compatible with any signal acquisition device. More specifically, we leverage a recent EEG foundational model with significant adaptations to learn robust representations from the EEG data for the classification task. We evaluate EAD on two publicly available datasets achieving state-of-the-art accuracies 99.33% and 92.31% on EEG-ImageNet and BrainLat respectively. This illustrates the effectiveness of the proposed framework across diverse EEG datasets containing two different perception tasks: stimulus and resting-state EEG signals. We also perform zero-shot EEG classification on EEG-ImageNet task to demonstrate the generalization capability of the proposed approach.

[43] Identification of Patterns of Cognitive Impairment for Early Detection of Dementia

Anusha A. S.,Uma Ranjan,Medha Sharma,Siddharth Dutt

Main category: cs.CV

TL;DR: 提出了一种个性化认知测试方案,通过识别个体特定的认知障碍模式,为大规模人群提供高效的早期痴呆检测。

Details Motivation: 早期痴呆检测对干预至关重要,但传统测试耗时且难以适用于大规模人群,尤其是需要定期评估时。 Method: 从正常人和轻度认知障碍(MCI)人群中学习认知障碍模式,采用两步法(集成特征选择和聚类分析)识别模式,并用于设计个性化测试。 Result: 识别出的模式与临床认可的MCI变体一致,可用于预测无症状或正常人群的认知障碍路径。 Conclusion: 该方法为大规模痴呆早期检测提供了高效且个性化的解决方案。 Abstract: Early detection of dementia is crucial to devise effective interventions. Comprehensive cognitive tests, while being the most accurate means of diagnosis, are long and tedious, thus limiting their applicability to a large population, especially when periodic assessments are needed. The problem is compounded by the fact that people have differing patterns of cognitive impairment as they progress to different forms of dementia. This paper presents a novel scheme by which individual-specific patterns of impairment can be identified and used to devise personalized tests for periodic follow-up. Patterns of cognitive impairment are initially learned from a population cluster of combined normals and MCIs, using a set of standardized cognitive tests. Impairment patterns in the population are identified using a 2-step procedure involving an ensemble wrapper feature selection followed by cluster identification and analysis. These patterns have been shown to correspond to clinically accepted variants of MCI, a prodrome of dementia. The learned clusters of patterns can subsequently be used to identify the most likely route of cognitive impairment, even for pre-symptomatic and apparently normal people. Baseline data of 24,000 subjects from the NACC database was used for the study.

[44] Diffusion-Based Generative Models for 3D Occupancy Prediction in Autonomous Driving

Yunshen Wang,Yicheng Liu,Tianyuan Yuan,Yucheng Mao,Yingshi Liang,Xiuyu Yang,Honggang Zhang,Hang Zhao

Main category: cs.CV

TL;DR: 论文提出了一种基于扩散模型的生成方法,用于预测3D占用网格,解决了传统判别方法在噪声数据、不完整观测和复杂3D结构中的局限性。

Details Motivation: 当前判别方法在3D占用网格预测中面临噪声数据、不完整观测和复杂结构的挑战,需要一种更鲁棒的方法。 Method: 采用扩散模型作为生成建模任务,学习数据分布并融入3D场景先验,提升预测一致性和噪声鲁棒性。 Result: 实验表明,扩散模型在遮挡或低可见度区域表现优于现有判别方法,预测更真实准确。 Conclusion: 扩散模型显著提升了3D占用预测的实用性,对自动驾驶的下游规划任务具有实际优势。 Abstract: Accurately predicting 3D occupancy grids from visual inputs is critical for autonomous driving, but current discriminative methods struggle with noisy data, incomplete observations, and the complex structures inherent in 3D scenes. In this work, we reframe 3D occupancy prediction as a generative modeling task using diffusion models, which learn the underlying data distribution and incorporate 3D scene priors. This approach enhances prediction consistency, noise robustness, and better handles the intricacies of 3D spatial structures. Our extensive experiments show that diffusion-based generative models outperform state-of-the-art discriminative approaches, delivering more realistic and accurate occupancy predictions, especially in occluded or low-visibility regions. Moreover, the improved predictions significantly benefit downstream planning tasks, highlighting the practical advantages of our method for real-world autonomous driving applications.

[45] TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance

Keren Ye,Ignacio Garcia Dorado,Michalis Raptis,Mauricio Delbracio,Irene Zhu,Peyman Milanfar,Hossein Talebi

Main category: cs.CV

TL;DR: TextSR是一种针对多语言场景文本图像超分辨率的多模态扩散模型,通过结合文本检测、OCR和字符形状先验,显著提升了文本图像的清晰度和可读性。

Details Motivation: 现有基于扩散模型的图像超分辨率方法在场景文本图像上表现不佳,存在文本定位不准和字符形状建模不足的问题,导致超分辨率效果不理想。 Method: TextSR利用文本检测器和OCR定位并提取文本区域的多语言字符,通过UTF-8编码器和交叉注意力将字符转换为视觉形状,并结合两种创新方法提升模型鲁棒性。 Result: 模型在TextZoom和TextVQA数据集上表现优异,为场景文本图像超分辨率设立了新基准。 Conclusion: TextSR通过整合文本字符先验和低分辨率图像,有效提升了文本超分辨率的细节和可读性,验证了其方法的有效性。 Abstract: While recent advancements in Image Super-Resolution (SR) using diffusion models have shown promise in improving overall image quality, their application to scene text images has revealed limitations. These models often struggle with accurate text region localization and fail to effectively model image and multilingual character-to-shape priors. This leads to inconsistencies, the generation of hallucinated textures, and a decrease in the perceived quality of the super-resolved text. To address these issues, we introduce TextSR, a multimodal diffusion model specifically designed for Multilingual Scene Text Image Super-Resolution. TextSR leverages a text detector to pinpoint text regions within an image and then employs Optical Character Recognition (OCR) to extract multilingual text from these areas. The extracted text characters are then transformed into visual shapes using a UTF-8 based text encoder and cross-attention. Recognizing that OCR may sometimes produce inaccurate results in real-world scenarios, we have developed two innovative methods to enhance the robustness of our model. By integrating text character priors with the low-resolution text images, our model effectively guides the super-resolution process, enhancing fine details within the text and improving overall legibility. The superior performance of our model on both the TextZoom and TextVQA datasets sets a new benchmark for STISR, underscoring the efficacy of our approach.

[46] MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation

Siyuan Wang,Jiawei Liu,Wei Wang,Yeying Jin,Jinsong Du,Zhi Han

Main category: cs.CV

TL;DR: 论文提出了一种基于运动掩码的两阶段网络(MMGT),通过音频和运动掩码联合驱动生成同步的语音手势视频,解决了传统方法在捕捉大动作和细节控制上的不足。

Details Motivation: 传统方法仅依赖音频信号难以捕捉大动作,且容易产生失真,引入额外先验信息又限制了实际应用。 Method: 提出MMGT网络,分为两阶段:SMGA网络生成高质量姿态视频和运动掩码;MM-HAA结合稳定扩散视频生成模型,优化细节控制。 Result: 实验表明,该方法在视频质量、唇同步和手势生成上表现更优。 Conclusion: MMGT通过两阶段设计和运动掩码的引入,显著提升了语音手势视频的生成质量。 Abstract: Co-Speech Gesture Video Generation aims to generate vivid speech videos from audio-driven still images, which is challenging due to the diversity of different parts of the body in terms of amplitude of motion, audio relevance, and detailed features. Relying solely on audio as the control signal often fails to capture large gesture movements in video, leading to more pronounced artifacts and distortions. Existing approaches typically address this issue by introducing additional a priori information, but this can limit the practical application of the task. Specifically, we propose a Motion Mask-Guided Two-Stage Network (MMGT) that uses audio, as well as motion masks and motion features generated from the audio signal to jointly drive the generation of synchronized speech gesture videos. In the first stage, the Spatial Mask-Guided Audio Pose Generation (SMGA) Network generates high-quality pose videos and motion masks from audio, effectively capturing large movements in key regions such as the face and gestures. In the second stage, we integrate the Motion Masked Hierarchical Audio Attention (MM-HAA) into the Stabilized Diffusion Video Generation model, overcoming limitations in fine-grained motion generation and region-specific detail control found in traditional methods. This guarantees high-quality, detailed upper-body video generation with accurate texture and motion details. Evaluations show improved video quality, lip-sync, and gesture. The model and code are available at https://github.com/SIA-IDE/MMGT.

[47] HMAD: Advancing E2E Driving with Anchored Offset Proposals and Simulation-Supervised Multi-target Scoring

Bin Wang,Pingjun Li,Jinkun Liu,Jun Cheng,Hailong Lei,Yinze Rong,Huan-ang Gao,Kangliang Chen,Xing Pan,Weihao Gu

Main category: cs.CV

TL;DR: HMAD框架通过BEV轨迹生成和多元评分机制解决了自动驾驶中的轨迹多样性和最优路径选择问题。

Details Motivation: 自动驾驶在生成多样且合规的轨迹及通过多维度评分选择最优路径方面存在挑战。 Method: HMAD结合BEV轨迹生成和模拟监督评分模块,利用BEVFormer和可学习锚点查询生成轨迹,并通过评分模块评估安全性、舒适性等指标。 Result: HMAD在CVPR 2025测试集上取得了44.5%的驾驶评分。 Conclusion: HMAD展示了轨迹生成与安全评分解耦对高级自动驾驶的有效性。 Abstract: End-to-end autonomous driving faces persistent challenges in both generating diverse, rule-compliant trajectories and robustly selecting the optimal path from these options via learned, multi-faceted evaluation. To address these challenges, we introduce HMAD, a framework integrating a distinctive Bird's-Eye-View (BEV) based trajectory proposal mechanism with learned multi-criteria scoring. HMAD leverages BEVFormer and employs learnable anchored queries, initialized from a trajectory dictionary and refined via iterative offset decoding (inspired by DiffusionDrive), to produce numerous diverse and stable candidate trajectories. A key innovation, our simulation-supervised scorer module, then evaluates these proposals against critical metrics including no at-fault collisions, drivable area compliance, comfortableness, and overall driving quality (i.e., extended PDM score). Demonstrating its efficacy, HMAD achieves a 44.5% driving score on the CVPR 2025 private test set. This work highlights the benefits of effectively decoupling robust trajectory generation from comprehensive, safety-aware learned scoring for advanced autonomous driving.

[48] PhotoArtAgent: Intelligent Photo Retouching with Language Model-Based Artist Agents

Haoyu Chen,Keda Tao,Yizao Wang,Xinlei Wang,Lei Zhu,Jinjin Gu

Main category: cs.CV

TL;DR: PhotoArtAgent是一个结合视觉语言模型和自然语言推理的智能系统,模拟专业艺术家的创作过程,提供透明且交互性强的照片修饰服务。

Details Motivation: 解决非专业用户依赖自动化工具时缺乏解释深度和交互透明度的问题,同时模拟专业艺术家的创作过程。 Method: 结合视觉语言模型(VLMs)和自然语言推理,进行艺术分析、制定修饰策略,并通过API输出精确参数至Lightroom,迭代优化结果。 Result: 在用户研究中优于现有自动化工具,结果接近专业艺术家水平。 Conclusion: PhotoArtAgent通过透明解释和迭代优化,实现了高质量且用户可控的照片修饰。 Abstract: Photo retouching is integral to photographic art, extending far beyond simple technical fixes to heighten emotional expression and narrative depth. While artists leverage expertise to create unique visual effects through deliberate adjustments, non-professional users often rely on automated tools that produce visually pleasing results but lack interpretative depth and interactive transparency. In this paper, we introduce PhotoArtAgent, an intelligent system that combines Vision-Language Models (VLMs) with advanced natural language reasoning to emulate the creative process of a professional artist. The agent performs explicit artistic analysis, plans retouching strategies, and outputs precise parameters to Lightroom through an API. It then evaluates the resulting images and iteratively refines them until the desired artistic vision is achieved. Throughout this process, PhotoArtAgent provides transparent, text-based explanations of its creative rationale, fostering meaningful interaction and user control. Experimental results show that PhotoArtAgent not only surpasses existing automated tools in user studies but also achieves results comparable to those of professional human artists.

[49] Zero-to-Hero: Zero-Shot Initialization Empowering Reference-Based Video Appearance Editing

Tongtong Su,Chengyu Wang,Jun Huang,Dongming Lu

Main category: cs.CV

TL;DR: 论文提出了一种名为Zero-to-Hero的参考视频编辑方法,通过分解编辑过程为两个阶段,解决了现有文本引导方法的模糊性和细粒度控制不足问题。

Details Motivation: 现有文本引导的视频编辑方法存在用户意图模糊和细粒度控制不足的问题,需要一种更精确和一致的方法。 Method: 方法分为两个阶段:Zero阶段通过编辑锚帧作为参考图像,并传播其外观;Hero阶段通过条件生成模型修复视频。利用原始帧的对应关系引导注意力机制。 Result: 在PSNR指标上比最佳基线方法提高了2.6 dB,并通过Blender构建的视频集验证了外观一致性。 Conclusion: Zero-to-Hero方法在视频编辑中实现了更高的准确性和时间一致性,解决了现有方法的局限性。 Abstract: Appearance editing according to user needs is a pivotal task in video editing. Existing text-guided methods often lead to ambiguities regarding user intentions and restrict fine-grained control over editing specific aspects of objects. To overcome these limitations, this paper introduces a novel approach named {Zero-to-Hero}, which focuses on reference-based video editing that disentangles the editing process into two distinct problems. It achieves this by first editing an anchor frame to satisfy user requirements as a reference image and then consistently propagating its appearance across other frames. We leverage correspondence within the original frames to guide the attention mechanism, which is more robust than previously proposed optical flow or temporal modules in memory-friendly video generative models, especially when dealing with objects exhibiting large motions. It offers a solid ZERO-shot initialization that ensures both accuracy and temporal consistency. However, intervention in the attention mechanism results in compounded imaging degradation with over-saturated colors and unknown blurring issues. Starting from Zero-Stage, our Hero-Stage Holistically learns a conditional generative model for vidEo RestOration. To accurately evaluate the consistency of the appearance, we construct a set of videos with multiple appearances using Blender, enabling a fine-grained and deterministic evaluation. Our method outperforms the best-performing baseline with a PSNR improvement of 2.6 dB. The project page is at https://github.com/Tonniia/Zero2Hero.

[50] Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning

Jinquan Guan,Qi Chen,Lizhou Liang,Yuhang Liu,Vu Minh Hieu Phan,Minh-Son To,Jian Chen,Yutong Xie

Main category: cs.CV

TL;DR: 论文提出CXRTrek数据集和CXRTrekNet模型,模拟放射科医生的多阶段诊断推理过程,解决现有医学AI模型的局限性。

Details Motivation: 现有医学AI模型采用简单的输入-输出模式,忽略了诊断推理的序列性和上下文依赖性,导致与临床场景不匹配、推理无上下文和错误不可追溯。 Method: 构建CXRTrek数据集,包含8个诊断阶段的42.8万样本和1100万问答对;提出CXRTrekNet模型,将临床推理流程融入视觉语言大模型框架。 Result: CXRTrekNet在CXRTrek基准测试中优于现有医学VLLM,并在五个外部数据集上表现出更强的泛化能力。 Conclusion: CXRTrek数据集和模型填补了医学AI在诊断推理建模上的空白,为临床场景提供了更贴合的工具。 Abstract: Artificial intelligence (AI)-based chest X-ray (CXR) interpretation assistants have demonstrated significant progress and are increasingly being applied in clinical settings. However, contemporary medical AI models often adhere to a simplistic input-to-output paradigm, directly processing an image and an instruction to generate a result, where the instructions may be integral to the model's architecture. This approach overlooks the modeling of the inherent diagnostic reasoning in chest X-ray interpretation. Such reasoning is typically sequential, where each interpretive stage considers the images, the current task, and the contextual information from previous stages. This oversight leads to several shortcomings, including misalignment with clinical scenarios, contextless reasoning, and untraceable errors. To fill this gap, we construct CXRTrek, a new multi-stage visual question answering (VQA) dataset for CXR interpretation. The dataset is designed to explicitly simulate the diagnostic reasoning process employed by radiologists in real-world clinical settings for the first time. CXRTrek covers 8 sequential diagnostic stages, comprising 428,966 samples and over 11 million question-answer (Q&A) pairs, with an average of 26.29 Q&A pairs per sample. Building on the CXRTrek dataset, we propose a new vision-language large model (VLLM), CXRTrekNet, specifically designed to incorporate the clinical reasoning flow into the VLLM framework. CXRTrekNet effectively models the dependencies between diagnostic stages and captures reasoning patterns within the radiological context. Trained on our dataset, the model consistently outperforms existing medical VLLMs on the CXRTrek benchmarks and demonstrates superior generalization across multiple tasks on five diverse external datasets. The dataset and model can be found in our repository (https://github.com/guanjinquan/CXRTrek).

[51] FlowAlign: Trajectory-Regularized, Inversion-Free Flow-based Image Editing

Jeongsol Kim,Yeobin Hong,Jong Chul Ye

Main category: cs.CV

TL;DR: FlowAlign是一种无需反演的基于流的图像编辑框架,通过引入流匹配损失提升编辑轨迹的稳定性和源一致性。

Details Motivation: 现有基于流的方法(如FlowEdit)因缺乏精确的潜在反演导致编辑轨迹不稳定和源一致性差。 Method: 提出FlowAlign框架,利用流匹配损失作为正则化机制,平衡编辑提示的语义对齐和源图像的结构一致性。 Result: 实验表明,FlowAlign在源保留和编辑可控性上优于现有方法。 Conclusion: FlowAlign通过流匹配损失实现了更稳定和一致的图像编辑,支持反向编辑。 Abstract: Recent inversion-free, flow-based image editing methods such as FlowEdit leverages a pre-trained noise-to-image flow model such as Stable Diffusion 3, enabling text-driven manipulation by solving an ordinary differential equation (ODE). While the lack of exact latent inversion is a core advantage of these methods, it often results in unstable editing trajectories and poor source consistency. To address this limitation, we propose FlowAlign, a novel inversion-free flow-based framework for consistent image editing with principled trajectory control. FlowAlign introduces a flow-matching loss as a regularization mechanism to promote smoother and more stable trajectories during the editing process. Notably, the flow-matching loss is shown to explicitly balance semantic alignment with the edit prompt and structural consistency with the source image along the trajectory. Furthermore, FlowAlign naturally supports reverse editing by simply reversing the ODE trajectory, highlighting the reversible and consistent nature of the transformation. Extensive experiments demonstrate that FlowAlign outperforms existing methods in both source preservation and editing controllability.

[52] PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling

Xiao Yu,Yan Fang,Xiaojie Jin,Yao Zhao,Yunchao Wei

Main category: cs.CV

TL;DR: 论文提出了一种在线音频-视觉事件解析(On-AVEP)的新范式,通过预测未来建模(PreFM)框架实现实时高效的多模态视频理解。

Details Motivation: 现有方法依赖离线处理和大模型,限制了实时应用。 Method: 提出PreFM框架,包括预测未来多模态建模和模态无关的鲁棒表示,以提升上下文理解和实时效率。 Result: 在UnAV-100和LLP数据集上显著优于现有方法,且参数更少。 Conclusion: PreFM为实时多模态视频理解提供了有效解决方案。 Abstract: Audio-visual event parsing plays a crucial role in understanding multimodal video content, but existing methods typically rely on offline processing of entire videos with huge model sizes, limiting their real-time applicability. We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams. The On-AVEP task necessitates models with two key capabilities: (1) Accurate online inference, to effectively distinguish events with unclear and limited context in online settings, and (2) Real-time efficiency, to balance high performance with computational constraints. To cultivate these, we propose the Predictive Future Modeling (PreFM) framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues, thereby enhancing contextual understanding and (b) modality-agnostic robust representation along with focal temporal prioritization to improve precision and generalization. Extensive experiments on the UnAV-100 and LLP datasets show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters, offering an insightful approach for real-time multimodal video understanding. Code is available at https://github.com/XiaoYu-1123/PreFM.

[53] LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering

Jonas Kulhanek,Marie-Julie Rakotosaona,Fabian Manhardt,Christina Tsalicoglou,Michael Niemeyer,Torsten Sattler,Songyou Peng,Federico Tombari

Main category: cs.CV

TL;DR: 提出了一种新颖的3D高斯泼溅LOD方法,用于在内存受限设备上实时渲染大规模场景。

Details Motivation: 解决大规模场景在内存受限设备上实时渲染的挑战。 Method: 采用分层LOD表示,基于相机距离选择高斯子集,结合深度感知3D平滑滤波、重要性修剪和微调,同时动态加载空间分块。 Result: 在户外和室内数据集上实现最先进性能,降低延迟和内存需求。 Conclusion: 该方法高效平衡了渲染质量和资源消耗。 Abstract: In this work, we present a novel level-of-detail (LOD) method for 3D Gaussian Splatting that enables real-time rendering of large-scale scenes on memory-constrained devices. Our approach introduces a hierarchical LOD representation that iteratively selects optimal subsets of Gaussians based on camera distance, thus largely reducing both rendering time and GPU memory usage. We construct each LOD level by applying a depth-aware 3D smoothing filter, followed by importance-based pruning and fine-tuning to maintain visual fidelity. To further reduce memory overhead, we partition the scene into spatial chunks and dynamically load only relevant Gaussians during rendering, employing an opacity-blending mechanism to avoid visual artifacts at chunk boundaries. Our method achieves state-of-the-art performance on both outdoor (Hierarchical 3DGS) and indoor (Zip-NeRF) datasets, delivering high-quality renderings with reduced latency and memory requirements.

[54] Implicit Inversion turns CLIP into a Decoder

Antonio D'Orazio,Maria Rosaria Briglia,Donato Crisostomi,Dario Loi,Emanuele Rodolà,Iacopo Masi

Main category: cs.CV

TL;DR: CLIP模型无需解码器或训练,通过优化频率感知的隐式神经表示实现图像合成,展示了判别模型的生成潜力。

Details Motivation: 探索CLIP模型在无需额外训练或解码器的情况下,是否能够直接用于图像合成任务。 Method: 采用频率感知的隐式神经表示,结合对抗鲁棒初始化、正交投影和混合损失,优化图像生成。 Result: 成功实现了文本到图像生成、风格迁移和图像重建等功能,无需修改CLIP权重。 Conclusion: 判别模型可能隐藏着未被开发的生成潜力。 Abstract: CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. In this work, we show that image synthesis is nevertheless possible using CLIP alone -- without any decoder, training, or fine-tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. Without altering CLIP's weights, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. These findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight.

[55] RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer

Liu Liu,Xiaofeng Wang,Guosheng Zhao,Keyu Li,Wenkang Qin,Jiaxiong Qiu,Zheng Zhu,Guan Huang,Zhizhong Su

Main category: cs.CV

TL;DR: RoboTransfer是一种基于扩散的视频生成框架,用于机器人数据合成,解决了模拟到现实的差距问题,并通过多视角几何和场景组件控制提升了生成数据的几何一致性和视觉保真度。

Details Motivation: 大规模真实机器人演示数据收集成本高昂,而模拟器存在模拟到现实的差距问题,因此需要一种高效的数据合成方法。 Method: RoboTransfer结合多视角几何和场景组件控制,利用跨视角特征交互和全局深度/法线条件确保几何一致性,支持背景编辑和对象交换。 Result: 实验表明,RoboTransfer生成的多视角视频具有更高的几何一致性和视觉保真度,且基于其生成数据训练的策略在DIFF-OBJ和DIFF-ALL场景中分别实现了33.3%和251%的相对成功率提升。 Conclusion: RoboTransfer为机器人数据合成提供了一种高效且可控的解决方案,显著提升了模拟数据的实用性和策略训练效果。 Abstract: Imitation Learning has become a fundamental approach in robotic manipulation. However, collecting large-scale real-world robot demonstrations is prohibitively expensive. Simulators offer a cost-effective alternative, but the sim-to-real gap make it extremely challenging to scale. Therefore, we introduce RoboTransfer, a diffusion-based video generation framework for robotic data synthesis. Unlike previous methods, RoboTransfer integrates multi-view geometry with explicit control over scene components, such as background and object attributes. By incorporating cross-view feature interactions and global depth/normal conditions, RoboTransfer ensures geometry consistency across views. This framework allows fine-grained control, including background edits and object swaps. Experiments demonstrate that RoboTransfer is capable of generating multi-view videos with enhanced geometric consistency and visual fidelity. In addition, policies trained on the data generated by RoboTransfer achieve a 33.3% relative improvement in the success rate in the DIFF-OBJ setting and a substantial 251% relative improvement in the more challenging DIFF-ALL scenario. Explore more demos on our project page: https://horizonrobotics.github.io/robot_lab/robotransfer

[56] DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes

Sungjune Park,Hyunjun Kim,Junho Kim,Seongho Kim,Yong Man Ro

Main category: cs.CV

TL;DR: 论文提出了一种基于强化学习(RL)的框架DIP-R1,旨在提升多模态大语言模型(MLLMs)在复杂场景中的细粒度视觉感知能力。

Details Motivation: 尽管MLLMs在视觉理解方面表现优异,但在复杂现实场景(如密集人群)中的细粒度感知能力仍有限。 Method: 通过设计三种基于规则的奖励模型(标准推理奖励、方差引导观察奖励和加权精确召回奖励),DIP-R1引导MLLMs逐步理解复杂场景。 Result: DIP-R1在多种细粒度目标检测数据上表现优异,显著优于现有基线模型和监督微调方法。 Conclusion: 研究表明,将RL整合到MLLMs中,可显著提升其在复杂现实感知任务中的能力。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant visual understanding capabilities, yet their fine-grained visual perception in complex real-world scenarios, such as densely crowded public areas, remains limited. Inspired by the recent success of reinforcement learning (RL) in both LLMs and MLLMs, in this paper, we explore how RL can enhance visual perception ability of MLLMs. Then we develop a novel RL-based framework, Deep Inspection and Perception with RL (DIP-R1) designed to enhance the visual perception capabilities of MLLMs, by comprehending complex scenes and looking through visual instances closely. DIP-R1 guides MLLMs through detailed inspection of visual scene via three simply designed rule-based reward modelings. First, we adopt a standard reasoning reward encouraging the model to include three step-by-step processes: 1) reasoning for understanding visual scenes, 2) observing for looking through interested but ambiguous regions, and 3) decision-making for predicting answer. Second, a variance-guided looking reward is designed to examine uncertain regions for the second observing process. It explicitly enables the model to inspect ambiguous areas, improving its ability to mitigate perceptual uncertainties. Third, we model a weighted precision-recall accuracy reward enhancing accurate decision-making. We explore its effectiveness across diverse fine-grained object detection data consisting of challenging real-world environments, such as densely crowded scenes. Built upon existing MLLMs, DIP-R1 achieves consistent and significant improvement across various in-domain and out-of-domain scenarios. It also outperforms various existing baseline models and supervised fine-tuning methods. Our findings highlight the substantial potential of integrating RL into MLLMs for enhancing capabilities in complex real-world perception tasks.

[57] HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image

Junyi Guo,Jingxuan Zhang,Fangyu Wu,Huanda Lu,Qiufeng Wang,Wenmian Yang,Eng Gee Lim,Dongming Lu

Main category: cs.CV

TL;DR: 论文提出了一种新任务FS2RG,通过结合平面草图和文本指导生成逼真服装图像,并提出了HiGarment框架解决其挑战。

Details Motivation: 填补服装生产过程中基于扩散模型的合成任务的研究空白,解决文本指导对布料特性描述不足及草图与文本信息冲突的问题。 Method: HiGarment框架包含多模态语义增强机制和协调交叉注意力机制,以增强布料表示并动态平衡草图与文本信息。 Result: 实验和用户研究证明了HiGarment在服装合成中的有效性,并发布了开源数据集。 Conclusion: HiGarment成功解决了FS2RG任务中的挑战,为服装生成提供了新工具和数据集。 Abstract: Diffusion-based garment synthesis tasks primarily focus on the design phase in the fashion domain, while the garment production process remains largely underexplored. To bridge this gap, we introduce a new task: Flat Sketch to Realistic Garment Image (FS2RG), which generates realistic garment images by integrating flat sketches and textual guidance. FS2RG presents two key challenges: 1) fabric characteristics are solely guided by textual prompts, providing insufficient visual supervision for diffusion-based models, which limits their ability to capture fine-grained fabric details; 2) flat sketches and textual guidance may provide conflicting information, requiring the model to selectively preserve or modify garment attributes while maintaining structural coherence. To tackle this task, we propose HiGarment, a novel framework that comprises two core components: i) a multi-modal semantic enhancement mechanism that enhances fabric representation across textual and visual modalities, and ii) a harmonized cross-attention mechanism that dynamically balances information from flat sketches and text prompts, allowing controllable synthesis by generating either sketch-aligned (image-biased) or text-guided (text-biased) outputs. Furthermore, we collect Multi-modal Detailed Garment, the largest open-source dataset for garment generation. Experimental results and user studies demonstrate the effectiveness of HiGarment in garment synthesis. The code and dataset will be released.

[58] Fooling the Watchers: Breaking AIGC Detectors via Semantic Prompt Attacks

Run Hao,Peng Ying

Main category: cs.CV

TL;DR: 提出了一种基于语法树和蒙特卡洛树搜索的对抗性提示生成框架,用于规避AIGC检测器,并在实验中验证了其有效性。

Details Motivation: 随着文本到图像(T2I)模型的发展,身份滥用和AIGC检测器的鲁棒性问题日益突出,需要一种系统化的方法来测试和改进检测器。 Method: 采用语法树结构和蒙特卡洛树搜索变体,系统探索语义提示空间,生成多样且可控的对抗性提示。 Result: 方法在多个T2I模型上验证有效,并在真实对抗性AIGC检测竞赛中排名第一。 Conclusion: 该方法不仅能用于攻击场景,还能构建高质量对抗数据集,为训练和评估更鲁棒的AIGC检测系统提供资源。 Abstract: The rise of text-to-image (T2I) models has enabled the synthesis of photorealistic human portraits, raising serious concerns about identity misuse and the robustness of AIGC detectors. In this work, we propose an automated adversarial prompt generation framework that leverages a grammar tree structure and a variant of the Monte Carlo tree search algorithm to systematically explore the semantic prompt space. Our method generates diverse, controllable prompts that consistently evade both open-source and commercial AIGC detectors. Extensive experiments across multiple T2I models validate its effectiveness, and the approach ranked first in a real-world adversarial AIGC detection competition. Beyond attack scenarios, our method can also be used to construct high-quality adversarial datasets, providing valuable resources for training and evaluating more robust AIGC detection and defense systems.

[59] Language-guided Learning for Object Detection Tackling Multiple Variations in Aerial Images

Sungjune Park,Hyunjun Kim,Beomchan Park,Yong Man Ro

Main category: cs.CV

TL;DR: 论文提出了一种名为LANGO的语言引导目标检测框架,用于解决航空图像中因光照和视角变化导致的目标检测难题。

Details Motivation: 航空图像中存在多种变化(如光照和视角),导致目标外观差异大,检测和识别困难。受人类理解场景语义的启发,作者希望通过语言引导学习缓解这些问题。 Method: 设计了视觉语义推理器以理解图像场景的视觉语义,并提出了关系学习损失来处理实例级变化(如视角和尺度)。 Result: 实验表明,该方法显著提升了检测性能。 Conclusion: LANGO框架通过语言引导学习有效缓解了航空图像中的场景和实例级变化问题,提升了目标检测效果。 Abstract: Despite recent advancements in computer vision research, object detection in aerial images still suffers from several challenges. One primary challenge to be mitigated is the presence of multiple types of variation in aerial images, for example, illumination and viewpoint changes. These variations result in highly diverse image scenes and drastic alterations in object appearance, so that it becomes more complicated to localize objects from the whole image scene and recognize their categories. To address this problem, in this paper, we introduce a novel object detection framework in aerial images, named LANGuage-guided Object detection (LANGO). Upon the proposed language-guided learning, the proposed framework is designed to alleviate the impacts from both scene and instance-level variations. First, we are motivated by the way humans understand the semantics of scenes while perceiving environmental factors in the scenes (e.g., weather). Therefore, we design a visual semantic reasoner that comprehends visual semantics of image scenes by interpreting conditions where the given images were captured. Second, we devise a training objective, named relation learning loss, to deal with instance-level variations, such as viewpoint angle and scale changes. This training objective aims to learn relations in language representations of object categories, with the help of the robust characteristics against such variations. Through extensive experiments, we demonstrate the effectiveness of the proposed method, and our method obtains noticeable detection performance improvements.

[60] WTEFNet: Real-Time Low-Light Object Detection for Advanced Driver-Assistance Systems

Hao Wu,Junzhou Chen,Ronghui Zhang,Nengchao Lyu,Hongyu Hu,Yanyong Guo,Tony Z. Qiu

Main category: cs.CV

TL;DR: WTEFNet是一个专为低光场景设计的实时目标检测框架,通过低光增强、小波特征提取和自适应融合检测模块,显著提升了低光条件下的检测性能。

Details Motivation: 现有基于RGB摄像头的目标检测方法在低光条件下性能下降严重,需要一种适应性强且高效的解决方案。 Method: WTEFNet包含三个核心模块:低光增强(LLE)、小波特征提取(WFE)和自适应融合检测(AFFD),并引入GSN数据集支持训练与评估。 Result: 在BDD100K、SHIFT、nuScenes和GSN数据集上,WTEFNet实现了低光条件下的最先进检测精度,并在嵌入式平台上验证了实时性。 Conclusion: WTEFNet为低光环境下的ADAS应用提供了一种高效、适应性强的目标检测解决方案。 Abstract: Object detection is a cornerstone of environmental perception in advanced driver assistance systems(ADAS). However, most existing methods rely on RGB cameras, which suffer from significant performance degradation under low-light conditions due to poor image quality. To address this challenge, we proposes WTEFNet, a real-time object detection framework specifically designed for low-light scenarios, with strong adaptability to mainstream detectors. WTEFNet comprises three core modules: a Low-Light Enhancement (LLE) module, a Wavelet-based Feature Extraction (WFE) module, and an Adaptive Fusion Detection (AFFD) module. The LLE enhances dark regions while suppressing overexposed areas; the WFE applies multi-level discrete wavelet transforms to isolate high- and low-frequency components, enabling effective denoising and structural feature retention; the AFFD fuses semantic and illumination features for robust detection. To support training and evaluation, we introduce GSN, a manually annotated dataset covering both clear and rainy night-time scenes. Extensive experiments on BDD100K, SHIFT, nuScenes, and GSN demonstrate that WTEFNet achieves state-of-the-art accuracy under low-light conditions. Furthermore, deployment on a embedded platform (NVIDIA Jetson AGX Orin) confirms the framework's suitability for real-time ADAS applications.

[61] HyperPointFormer: Multimodal Fusion in 3D Space with Dual-Branch Cross-Attention Transformers

Aldino Rizaldy,Richard Gloaguen,Fabian Ewald Fassnacht,Pedram Ghamisi

Main category: cs.CV

TL;DR: 该论文提出了一种完全基于3D点云的多模态融合方法,采用双分支Transformer模型学习几何和光谱特征,并通过跨注意力机制实现多尺度特征融合。

Details Motivation: 现有方法通常将3D数据栅格化为2D格式,未能充分利用3D数据的潜力,限制了模型直接从原始点云学习3D空间特征的能力。 Method: 提出了一种3D点云多模态融合方法,使用双分支Transformer模型和跨注意力机制,实现几何与光谱特征的多尺度融合。 Result: 在多个数据集上的实验表明,3D融合方法性能与2D方法相当,且能提供更灵活的3D预测。 Conclusion: 3D融合方法不仅性能优越,还能生成2D方法无法实现的3D预测,具有更高的灵活性。 Abstract: Multimodal remote sensing data, including spectral and lidar or photogrammetry, is crucial for achieving satisfactory land-use / land-cover classification results in urban scenes. So far, most studies have been conducted in a 2D context. When 3D information is available in the dataset, it is typically integrated with the 2D data by rasterizing the 3D data into 2D formats. Although this method yields satisfactory classification results, it falls short in fully exploiting the potential of 3D data by restricting the model's ability to learn 3D spatial features directly from raw point clouds. Additionally, it limits the generation of 3D predictions, as the dimensionality of the input data has been reduced. In this study, we propose a fully 3D-based method that fuses all modalities within the 3D point cloud and employs a dedicated dual-branch Transformer model to simultaneously learn geometric and spectral features. To enhance the fusion process, we introduce a cross-attention-based mechanism that fully operates on 3D points, effectively integrating features from various modalities across multiple scales. The purpose of cross-attention is to allow one modality to assess the importance of another by weighing the relevant features. We evaluated our method by comparing it against both 3D and 2D methods using the 2018 IEEE GRSS Data Fusion Contest (DFC2018) dataset. Our findings indicate that 3D fusion delivers competitive results compared to 2D methods and offers more flexibility by providing 3D predictions. These predictions can be projected onto 2D maps, a capability that is not feasible in reverse. Additionally, we evaluated our method on different datasets, specifically the ISPRS Vaihingen 3D and the IEEE 2019 Data Fusion Contest. Our code will be published here: https://github.com/aldinorizaldy/hyperpointformer.

[62] Navigating the Accuracy-Size Trade-Off with Flexible Model Merging

Akash Dhasade,Divyansh Jhunjhunwala,Milos Vujasinovic,Gauri Joshi,Anne-Marie Kermarrec

Main category: cs.CV

TL;DR: FlexMerge是一种无需数据的模型合并框架,通过灵活生成不同大小的合并模型,平衡精度与部署成本。

Details Motivation: 解决单模型合并精度不足与多模型部署成本高的问题。 Method: 将微调模型视为序列块,逐步合并,支持多种合并算法,灵活控制合并规模。 Result: 实验表明,适度增大的合并模型能显著提升精度,适用于多种任务。 Conclusion: FlexMerge提供了一种灵活、高效且无需数据的解决方案,适用于多样化部署场景。 Abstract: Model merging has emerged as an efficient method to combine multiple single-task fine-tuned models. The merged model can enjoy multi-task capabilities without expensive training. While promising, merging into a single model often suffers from an accuracy gap with respect to individual fine-tuned models. On the other hand, deploying all individual fine-tuned models incurs high costs. We propose FlexMerge, a novel data-free model merging framework to flexibly generate merged models of varying sizes, spanning the spectrum from a single merged model to retaining all individual fine-tuned models. FlexMerge treats fine-tuned models as collections of sequential blocks and progressively merges them using any existing data-free merging method, halting at a desired size. We systematically explore the accuracy-size trade-off exhibited by different merging algorithms in combination with FlexMerge. Extensive experiments on vision and NLP benchmarks, with up to 30 tasks, reveal that even modestly larger merged models can provide substantial accuracy improvements over a single model. By offering fine-grained control over fused model size, FlexMerge provides a flexible, data-free, and high-performance solution for diverse deployment scenarios.

[63] SAMamba: Adaptive State Space Modeling with Hierarchical Vision for Infrared Small Target Detection

Wenhao Xu,Shuchen Zheng,Changwei Wang,Zherui Zhang,Chuan Ren,Rongtao Xu,Shibiao Xu

Main category: cs.CV

TL;DR: SAMamba是一种结合SAM2层次特征学习和Mamba选择性序列建模的新框架,通过FS-Adapter、CSI模块和DPCF模块解决红外小目标检测中的信息丢失和全局上下文建模问题,显著优于现有方法。

Details Motivation: 红外小目标检测在军事、海事和预警应用中至关重要,但现有深度学习方法存在信息丢失和全局上下文建模效率低的问题。 Method: 提出SAMamba框架,包括FS-Adapter实现自然到红外域适应,CSI模块进行高效全局上下文建模,DPCF模块平衡多尺度特征。 Result: 在NUAA-SIRST等数据集上,SAMamba显著优于现有方法,尤其在复杂背景和多尺度目标场景中。 Conclusion: SAMamba通过域适应、细节保留和长距离依赖建模,有效解决了红外小目标检测的核心挑战。 Abstract: Infrared small target detection (ISTD) is vital for long-range surveillance in military, maritime, and early warning applications. ISTD is challenged by targets occupying less than 0.15% of the image and low distinguishability from complex backgrounds. Existing deep learning methods often suffer from information loss during downsampling and inefficient global context modeling. This paper presents SAMamba, a novel framework integrating SAM2's hierarchical feature learning with Mamba's selective sequence modeling. Key innovations include: (1) A Feature Selection Adapter (FS-Adapter) for efficient natural-to-infrared domain adaptation via dual-stage selection (token-level with a learnable task embedding and channel-wise adaptive transformations); (2) A Cross-Channel State-Space Interaction (CSI) module for efficient global context modeling with linear complexity using selective state space modeling; and (3) A Detail-Preserving Contextual Fusion (DPCF) module that adaptively combines multi-scale features with a gating mechanism to balance high-resolution and low-resolution feature contributions. SAMamba addresses core ISTD challenges by bridging the domain gap, maintaining fine-grained details, and efficiently modeling long-range dependencies. Experiments on NUAA-SIRST, IRSTD-1k, and NUDT-SIRST datasets show SAMamba significantly outperforms state-of-the-art methods, especially in challenging scenarios with heterogeneous backgrounds and varying target scales. Code: https://github.com/zhengshuchen/SAMamba.

[64] UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes

Yixun Liang,Kunming Luo,Xiao Chen,Rui Chen,Hongyu Yan,Weiyu Li,Jiarui Liu,Ping Tan

Main category: cs.CV

TL;DR: UniTEX是一个新颖的两阶段3D纹理生成框架,通过直接在统一的3D功能空间中操作,避免了UV映射的限制,生成高质量且一致的3D纹理。

Details Motivation: 现有方法依赖UV映射修复纹理,存在拓扑模糊性问题,UniTEX旨在绕过这些限制,直接在3D空间中生成纹理。 Method: 1. 提出Texture Functions(TFs)将纹理生成提升到3D空间;2. 使用基于Transformer的Large Texturing Model(LTM)直接从图像和几何输入预测TFs;3. 采用LoRA策略高效适配大规模Diffusion Transformers(DiTs)进行多视角纹理合成。 Result: 实验表明,UniTEX在视觉质量和纹理完整性上优于现有方法,提供了可扩展的自动化3D纹理生成方案。 Conclusion: UniTEX通过创新的3D功能空间和高效模型适配策略,实现了高质量的3D纹理生成,代码已开源。 Abstract: We present UniTEX, a novel two-stage 3D texture generation framework to create high-quality, consistent textures for 3D assets. Existing approaches predominantly rely on UV-based inpainting to refine textures after reprojecting the generated multi-view images onto the 3D shapes, which introduces challenges related to topological ambiguity. To address this, we propose to bypass the limitations of UV mapping by operating directly in a unified 3D functional space. Specifically, we first propose that lifts texture generation into 3D space via Texture Functions (TFs)--a continuous, volumetric representation that maps any 3D point to a texture value based solely on surface proximity, independent of mesh topology. Then, we propose to predict these TFs directly from images and geometry inputs using a transformer-based Large Texturing Model (LTM). To further enhance texture quality and leverage powerful 2D priors, we develop an advanced LoRA-based strategy for efficiently adapting large-scale Diffusion Transformers (DiTs) for high-quality multi-view texture synthesis as our first stage. Extensive experiments demonstrate that UniTEX achieves superior visual quality and texture integrity compared to existing approaches, offering a generalizable and scalable solution for automated 3D texture generation. Code will available in: https://github.com/YixunLiang/UniTEX.

[65] Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs

Zheng Sun,Yi Wei,Long Yu

Main category: cs.CV

TL;DR: 本文提出了一种结合数据和方法的完整解决方案,以解决多模态大语言模型(MLLMs)在医学图像筛选中的性能不足问题。通过收集数据集和引入强化学习方法(DPA-GRPO),显著提升了模型的图像美学推理能力。

Details Motivation: 当前MLLMs在图像筛选任务中表现不佳,主要由于缺乏数据和美学推理能力不足。本文旨在通过数据和方法的改进解决这一问题。 Method: 收集了包含1500+样本的医学图像筛选数据集,并提出了DPA-GRPO强化学习方法,结合长链思维(CoT)提升美学推理能力。 Result: 实验表明,即使是GPT-4o等先进闭源MLLMs,在美学推理任务中表现接近随机猜测。而本文方法通过小规模模型超越了这些大模型的性能。 Conclusion: 本文提出的解决方案为图像美学推理提供了一种可行的配置,未来有望成为标准方法。 Abstract: Multimodal Large Language Models (MLLMs) are of great application across many domains, such as multimodal understanding and generation. With the development of diffusion models (DM) and unified MLLMs, the performance of image generation has been significantly improved, however, the study of image screening is rare and its performance with MLLMs is unsatisfactory due to the lack of data and the week image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive medical image screening dataset with 1500+ samples, each sample consists of a medical image, four generated images, and a multiple-choice answer. The dataset evaluates the aesthetic reasoning ability under four aspects: \textit{(1) Appearance Deformation, (2) Principles of Physical Lighting and Shadow, (3) Placement Layout, (4) Extension Rationality}. For methodology, we utilize long chains of thought (CoT) and Group Relative Policy Optimization with Dynamic Proportional Accuracy reward, called DPA-GRPO, to enhance the image aesthetic reasoning ability of MLLMs. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT-4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the reinforcement learning approach, we are able to surpass the score of both large-scale models and leading closed-source models using a much smaller model. We hope our attempt on medical image screening will serve as a regular configuration in image aesthetic reasoning in the future.

[66] Unsupervised Transcript-assisted Video Summarization and Highlight Detection

Spyros Barbakos,Charalampos Antoniadis,Gerasimos Potamianos,Gianluca Setti

Main category: cs.CV

TL;DR: 本文提出了一种基于强化学习的多模态视频摘要和亮点检测方法,结合视频帧和文本转录,生成更紧凑的视频版本并检测亮点。

Details Motivation: 视频观看是日常生活的重要组成部分,但观看完整视频可能很繁琐。现有方法未能在强化学习框架中整合视频帧和文本转录两种模态。 Method: 提出了一种多模态管道,利用视频帧和转录文本,通过模态融合机制生成视频摘要和检测亮点,并在强化学习框架中训练模型。 Result: 实验表明,结合转录文本的视频摘要和亮点检测方法优于仅依赖视觉内容的方法。 Conclusion: 该方法通过无监督学习解决了标注数据有限的问题,并在多模态融合中取得了更好的效果。 Abstract: Video consumption is a key part of daily life, but watching entire videos can be tedious. To address this, researchers have explored video summarization and highlight detection to identify key video segments. While some works combine video frames and transcripts, and others tackle video summarization and highlight detection using Reinforcement Learning (RL), no existing work, to the best of our knowledge, integrates both modalities within an RL framework. In this paper, we propose a multimodal pipeline that leverages video frames and their corresponding transcripts to generate a more condensed version of the video and detect highlights using a modality fusion mechanism. The pipeline is trained within an RL framework, which rewards the model for generating diverse and representative summaries while ensuring the inclusion of video segments with meaningful transcript content. The unsupervised nature of the training allows for learning from large-scale unannotated datasets, overcoming the challenge posed by the limited size of existing annotated datasets. Our experiments show that using the transcript in video summarization and highlight detection achieves superior results compared to relying solely on the visual content of the video.

[67] LADA: Scalable Label-Specific CLIP Adapter for Continual Learning

Mao-Lin Luo,Zi-Hao Zhou,Tong Wei,Min-Ling Zhang

Main category: cs.CV

TL;DR: LADA(Label-specific ADApter)是一种针对CLIP模型的持续学习方法,通过添加轻量级标签特定记忆单元,避免参数分割问题,并利用特征蒸馏防止灾难性遗忘,实现了高效训练和最佳性能。

Details Motivation: 现有基于CLIP的方法在持续学习中需要分割参数,导致推理时选择错误,性能下降。LADA旨在解决这一问题。 Method: LADA在冻结的CLIP图像编码器后添加标签特定记忆单元,通过特征蒸馏防止灾难性遗忘,并阻止梯度流向CLIP参数。 Result: LADA在持续学习任务中实现了最先进的性能。 Conclusion: LADA通过轻量级设计和特征蒸馏,有效解决了持续学习中的参数分割和遗忘问题。 Abstract: Continual learning with vision-language models like CLIP offers a pathway toward scalable machine learning systems by leveraging its transferable representations. Existing CLIP-based methods adapt the pre-trained image encoder by adding multiple sets of learnable parameters, with each task using a partial set of parameters. This requires selecting the expected parameters for input images during inference, which is prone to error that degrades performance. To address this problem, we introduce LADA (Label-specific ADApter). Instead of partitioning parameters across tasks, LADA appends lightweight, label-specific memory units to the frozen CLIP image encoder, enabling discriminative feature generation by aggregating task-agnostic knowledge. To prevent catastrophic forgetting, LADA employs feature distillation for seen classes, preventing their features from being interfered with by new classes. Positioned after the image encoder, LADA prevents gradient flow to the frozen CLIP parameters, ensuring efficient training. Extensive results show that LADA achieves state-of-the-art performance in continual learning settings. The implementation code is available at https://github.com/MaolinLuo/LADA.

[68] Are MLMs Trapped in the Visual Room?

Yazhou Zhang,Chunwang Zou,Qimeng Liu,Lu Rong,Ben Yao,Zheng Lian,Qiuchi Li,Peng Zhang,Jing Qin

Main category: cs.CV

TL;DR: 论文探讨多模态大模型(MLMs)是否能真正“理解”图像,提出“视觉房间”论点,并通过感知与认知两层次评估框架揭示其局限性。

Details Motivation: 挑战当前假设,即感知能力等同于真正理解,通过“视觉房间”论点质疑MLMs是否具备真实理解能力。 Method: 提出两层次评估框架(感知与认知),并构建高质量多模态讽刺数据集,评估8种SoTA MLMs。 Result: MLMs在感知任务表现良好,但讽刺理解错误率约16.1%,显示感知与理解的显著差距。 Conclusion: 实证支持“视觉房间”论点,为MLMs评估提供新范式,强调情感推理与常识推断的不足。 Abstract: Can multi-modal large models (MLMs) that can ``see'' an image be said to ``understand'' it? Drawing inspiration from Searle's Chinese Room, we propose the \textbf{Visual Room} argument: a system may process and describe every detail of visual inputs by following algorithmic rules, without genuinely comprehending the underlying intention. This dilemma challenges the prevailing assumption that perceptual mastery implies genuine understanding. In implementation, we introduce a two-tier evaluation framework spanning perception and cognition. The perception component evaluates whether MLMs can accurately capture the surface-level details of visual contents, where the cognitive component examines their ability to infer sarcasm polarity. To support this framework, We further introduce a high-quality multi-modal sarcasm dataset comprising both 924 static images and 100 dynamic videos. All sarcasm labels are annotated by the original authors and verified by independent reviewers to ensure clarity and consistency. We evaluate eight state-of-the-art (SoTA) MLMs. Our results highlight three key findings: (1) MLMs perform well on perception tasks; (2) even with correct perception, models exhibit an average error rate of ~16.1\% in sarcasm understanding, revealing a significant gap between seeing and understanding; (3) error analysis attributes this gap to deficiencies in emotional reasoning, commonsense inference, and context alignment. This work provides empirical grounding for the proposed Visual Room argument and offers a new evaluation paradigm for MLMs.

[69] Holistic Large-Scale Scene Reconstruction via Mixed Gaussian Splatting

Chuandong Liu,Huijiao Wang,Lei Yu,Gui-Song Xia

Main category: cs.CV

TL;DR: MixGS提出了一种全局优化的3D高斯泼溅框架,解决了现有方法因分治策略导致的全局信息丢失和参数复杂性问题,实现了高质量渲染和高效计算。

Details Motivation: 现有大规模场景重建方法依赖分治策略,导致全局信息丢失且参数调整复杂,MixGS旨在解决这些问题。 Method: MixGS通过视图感知表示整合相机位姿和高斯属性,解码为精细高斯,并通过混合操作保留全局一致性和局部保真度。 Result: 实验表明MixGS在大规模场景中实现了最先进的渲染质量和竞争性速度,显著降低了计算需求。 Conclusion: MixGS是一种高效的大规模3D场景重建方法,支持单GPU训练,代码将开源。 Abstract: Recent advances in 3D Gaussian Splatting have shown remarkable potential for novel view synthesis. However, most existing large-scale scene reconstruction methods rely on the divide-and-conquer paradigm, which often leads to the loss of global scene information and requires complex parameter tuning due to scene partitioning and local optimization. To address these limitations, we propose MixGS, a novel holistic optimization framework for large-scale 3D scene reconstruction. MixGS models the entire scene holistically by integrating camera pose and Gaussian attributes into a view-aware representation, which is decoded into fine-detailed Gaussians. Furthermore, a novel mixing operation combines decoded and original Gaussians to jointly preserve global coherence and local fidelity. Extensive experiments on large-scale scenes demonstrate that MixGS achieves state-of-the-art rendering quality and competitive speed, while significantly reducing computational requirements, enabling large-scale scene reconstruction training on a single 24GB VRAM GPU. The code will be released at https://github.com/azhuantou/MixGS.

[70] RSFAKE-1M: A Large-Scale Dataset for Detecting Diffusion-Generated Remote Sensing Forgeries

Zhihong Tan,Jiayi Wang,Huiying Shi,Binyuan Huang,Hongchen Wei,Zhenzhong Chen

Main category: cs.CV

TL;DR: 论文介绍了RSFAKE-1M数据集,用于检测基于扩散模型的伪造遥感图像,并展示了其在提升检测方法泛化性和鲁棒性方面的效果。

Details Motivation: 遥感图像在环境监测、城市规划等领域至关重要,但现有基准主要针对GAN伪造或自然图像,缺乏对扩散模型伪造的研究。 Method: 构建了包含50万伪造和50万真实遥感图像的RSFAKE-1M数据集,生成条件涵盖文本提示、结构引导等六种方式,并进行了实验评估。 Result: 实验表明,当前方法对扩散模型伪造的遥感图像检测效果不佳,但基于RSFAKE-1M训练的模型表现显著提升。 Conclusion: RSFAKE-1M为遥感图像伪造检测领域的发展提供了重要基础,数据集已公开。 Abstract: Detecting forged remote sensing images is becoming increasingly critical, as such imagery plays a vital role in environmental monitoring, urban planning, and national security. While diffusion models have emerged as the dominant paradigm for image generation, their impact on remote sensing forgery detection remains underexplored. Existing benchmarks primarily target GAN-based forgeries or focus on natural images, limiting progress in this critical domain. To address this gap, we introduce RSFAKE-1M, a large-scale dataset of 500K forged and 500K real remote sensing images. The fake images are generated by ten diffusion models fine-tuned on remote sensing data, covering six generation conditions such as text prompts, structural guidance, and inpainting. This paper presents the construction of RSFAKE-1M along with a comprehensive experimental evaluation using both existing detectors and unified baselines. The results reveal that diffusion-based remote sensing forgeries remain challenging for current methods, and that models trained on RSFAKE-1M exhibit notably improved generalization and robustness. Our findings underscore the importance of RSFAKE-1M as a foundation for developing and evaluating next-generation forgery detection approaches in the remote sensing domain. The dataset and other supplementary materials are available at https://huggingface.co/datasets/TZHSW/RSFAKE/.

[71] GenCAD-Self-Repairing: Feasibility Enhancement for 3D CAD Generation

Chikaha Tsuji,Enrique Flores Medina,Harshit Gupta,Md Ferdous Alam

Main category: cs.CV

TL;DR: GenCAD-Self-Repairing通过扩散引导和自我修复流程,显著提高了生成CAD模型的可行性,解决了GenCAD模型生成不可行边界表示的问题。

Details Motivation: GenCAD模型在生成CAD文件时,约10%的设计不可行,限制了其实际应用。研究旨在通过改进框架提高可行性。 Method: 采用扩散引导的潜在空间去噪过程和基于回归的修正机制,优化不可行的CAD命令序列,同时保持几何精度。 Result: 成功将基线方法中三分之二的不可行设计转化为可行设计,显著提高了可行性率,同时保持了合理的几何精度。 Conclusion: 该方法提升了AI驱动CAD生成的适用性,为制造、建筑和产品设计提供了更高质量的训练数据。 Abstract: With the advancement of generative AI, research on its application to 3D model generation has gained traction, particularly in automating the creation of Computer-Aided Design (CAD) files from images. GenCAD is a notable model in this domain, leveraging an autoregressive transformer-based architecture with a contrastive learning framework to generate CAD programs. However, a major limitation of GenCAD is its inability to consistently produce feasible boundary representations (B-reps), with approximately 10% of generated designs being infeasible. To address this, we propose GenCAD-Self-Repairing, a framework that enhances the feasibility of generative CAD models through diffusion guidance and a self-repairing pipeline. This framework integrates a guided diffusion denoising process in the latent space and a regression-based correction mechanism to refine infeasible CAD command sequences while preserving geometric accuracy. Our approach successfully converted two-thirds of infeasible designs in the baseline method into feasible ones, significantly improving the feasibility rate while simultaneously maintaining a reasonable level of geometric accuracy between the point clouds of ground truth models and generated models. By significantly improving the feasibility rate of generating CAD models, our approach helps expand the availability of high-quality training data and enhances the applicability of AI-driven CAD generation in manufacturing, architecture, and product design.

[72] Federated Unsupervised Semantic Segmentation

Evangelos Charalampakis,Vasileios Mygdalis,Ioannis Pitas

Main category: cs.CV

TL;DR: 本文提出FUSS框架,首次实现完全去中心化、无标签的联邦学习语义图像分割,通过特征和原型空间全局一致性策略,显著优于本地训练和传统联邦学习方法。

Details Motivation: 探索联邦学习在无监督语义图像分割中的应用,解决分布式客户端特征表示和聚类中心对齐的挑战。 Method: 提出FUSS框架,结合局部分割头和共享语义中心,优化特征和原型空间的全局一致性。 Result: 在基准和真实数据集上,FUSS在二分类和多分类任务中均优于本地训练和传统联邦学习方法。 Conclusion: FUSS为无监督联邦语义分割提供了有效解决方案,代码将开源以支持复现。 Abstract: This work explores the application of Federated Learning (FL) in Unsupervised Semantic image Segmentation (USS). Recent USS methods extract pixel-level features using frozen visual foundation models and refine them through self-supervised objectives that encourage semantic grouping. These features are then grouped to semantic clusters to produce segmentation masks. Extending these ideas to federated settings requires feature representation and cluster centroid alignment across distributed clients -- an inherently difficult task under heterogeneous data distributions in the absence of supervision. To address this, we propose FUSS Federated Unsupervised image Semantic Segmentation) which is, to our knowledge, the first framework to enable fully decentralized, label-free semantic segmentation training. FUSS introduces novel federation strategies that promote global consistency in feature and prototype space, jointly optimizing local segmentation heads and shared semantic centroids. Experiments on both benchmark and real-world datasets, including binary and multi-class segmentation tasks, show that FUSS consistently outperforms local-only client trainings as well as extensions of classical FL algorithms under varying client data distributions. To support reproducibility, full code will be released upon manuscript acceptance.

[73] TRACE: Trajectory-Constrained Concept Erasure in Diffusion Models

Finn Carter

Main category: cs.CV

TL;DR: TRACE是一种新方法,用于从扩散模型中擦除特定概念,同时保持生成质量。它结合理论框架和微调程序,在多个基准测试中表现优异。

Details Motivation: 扩散模型可能生成不良内容(如色情、敏感身份、版权风格),引发隐私、公平和安全问题。概念擦除旨在移除或抑制这些内容。 Method: TRACE通过理论框架确定概念抑制的正式条件,并结合微调程序,修改交叉注意力层以移除目标概念的隐藏表示。 Result: TRACE在多个基准测试中表现优异,优于ANT、EraseAnything和MACE等方法。 Conclusion: TRACE是一种高效的概念擦除方法,能显著提升扩散模型的安全性和实用性。 Abstract: Text-to-image diffusion models have shown unprecedented generative capability, but their ability to produce undesirable concepts (e.g.~pornographic content, sensitive identities, copyrighted styles) poses serious concerns for privacy, fairness, and safety. {Concept erasure} aims to remove or suppress specific concept information in a generative model. In this paper, we introduce \textbf{TRACE (Trajectory-Constrained Attentional Concept Erasure)}, a novel method to erase targeted concepts from diffusion models while preserving overall generative quality. Our approach combines a rigorous theoretical framework, establishing formal conditions under which a concept can be provably suppressed in the diffusion process, with an effective fine-tuning procedure compatible with both conventional latent diffusion (Stable Diffusion) and emerging rectified flow models (e.g.~FLUX). We first derive a closed-form update to the model's cross-attention layers that removes hidden representations of the target concept. We then introduce a trajectory-aware finetuning objective that steers the denoising process away from the concept only in the late sampling stages, thus maintaining the model's fidelity on unrelated content. Empirically, we evaluate TRACE on multiple benchmarks used in prior concept erasure studies (object classes, celebrity faces, artistic styles, and explicit content from the I2P dataset). TRACE achieves state-of-the-art performance, outperforming recent methods such as ANT, EraseAnything, and MACE in terms of removal efficacy and output quality.

[74] Adversarial Semantic and Label Perturbation Attack for Pedestrian Attribute Recognition

Weizhe Kong,Xiao Wang,Ruichong Gao,Chenglong Li,Yu Zhang,Xing Yang,Yaowei Wang,Jin Tang

Main category: cs.CV

TL;DR: 本文提出了首个针对行人属性识别(PAR)的对抗攻击与防御框架,结合全局和局部攻击,并设计防御策略验证其有效性。

Details Motivation: 尽管PAR在深度神经网络推动下取得进展,但其抗干扰能力和潜在脆弱性尚未充分研究。 Method: 基于CLIP的PAR框架,采用多模态Transformer融合视觉和文本特征,提出对抗语义和标签扰动攻击(ASL-PAR)及语义偏移防御策略。 Result: 在数字和物理域的多数据集上验证了攻击与防御策略的有效性。 Conclusion: 提出的框架显著提升了PAR的对抗鲁棒性,代码将开源。 Abstract: Pedestrian Attribute Recognition (PAR) is an indispensable task in human-centered research and has made great progress in recent years with the development of deep neural networks. However, the potential vulnerability and anti-interference ability have still not been fully explored. To bridge this gap, this paper proposes the first adversarial attack and defense framework for pedestrian attribute recognition. Specifically, we exploit both global- and patch-level attacks on the pedestrian images, based on the pre-trained CLIP-based PAR framework. It first divides the input pedestrian image into non-overlapping patches and embeds them into feature embeddings using a projection layer. Meanwhile, the attribute set is expanded into sentences using prompts and embedded into attribute features using a pre-trained CLIP text encoder. A multi-modal Transformer is adopted to fuse the obtained vision and text tokens, and a feed-forward network is utilized for attribute recognition. Based on the aforementioned PAR framework, we adopt the adversarial semantic and label-perturbation to generate the adversarial noise, termed ASL-PAR. We also design a semantic offset defense strategy to suppress the influence of adversarial attacks. Extensive experiments conducted on both digital domains (i.e., PETA, PA100K, MSP60K, RAPv2) and physical domains fully validated the effectiveness of our proposed adversarial attack and defense strategies for the pedestrian attribute recognition. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR.

[75] Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis

Hengyuan Cao,Yutong Feng,Biao Gong,Yijing Tian,Yunhong Lu,Chuang Liu,Bin Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为DRA-Ctrl的视频到图像知识压缩与任务适应范式,利用视频模型的优势支持可控图像生成任务。

Details Motivation: 探索训练好的高维视频生成模型是否能有效支持低维任务(如可控图像生成),以挖掘视频模型的潜力。 Method: 提出DRA-Ctrl范式,包括基于mixup的过渡策略和重新设计的注意力结构,以解决视频帧与图像生成之间的差异。 Result: 实验表明,改造后的视频模型在多种图像生成任务中优于直接训练的图像模型。 Conclusion: DRA-Ctrl展示了大规模视频生成器在更广泛视觉应用中的潜力,为跨模态统一生成模型奠定了基础。 Abstract: Video generative models can be regarded as world simulators due to their ability to capture dynamic, continuous changes inherent in real-world environments. These models integrate high-dimensional information across visual, temporal, spatial, and causal dimensions, enabling predictions of subjects in various status. A natural and valuable research direction is to explore whether a fully trained video generative model in high-dimensional space can effectively support lower-dimensional tasks such as controllable image generation. In this work, we propose a paradigm for video-to-image knowledge compression and task adaptation, termed \textit{Dimension-Reduction Attack} (\texttt{DRA-Ctrl}), which utilizes the strengths of video models, including long-range context modeling and flatten full-attention, to perform various generation tasks. Specially, to address the challenging gap between continuous video frames and discrete image generation, we introduce a mixup-based transition strategy that ensures smooth adaptation. Moreover, we redesign the attention structure with a tailored masking mechanism to better align text prompts with image-level control. Experiments across diverse image generation tasks, such as subject-driven and spatially conditioned generation, show that repurposed video models outperform those trained directly on images. These results highlight the untapped potential of large-scale video generators for broader visual applications. \texttt{DRA-Ctrl} provides new insights into reusing resource-intensive video models and lays foundation for future unified generative models across visual modalities. The project page is https://dra-ctrl-2025.github.io/DRA-Ctrl/.

[76] Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization

Matteo Gallici,Haitz Sáez de Ocáriz Borde

Main category: cs.CV

TL;DR: GRPO用于微调视觉自回归模型,通过RL提升图像质量并控制生成风格,还能泛化到预训练数据外的风格。

Details Motivation: 研究如何通过RL微调预训练生成模型,以更好地符合人类偏好,尤其是视觉自回归模型的应用。 Method: 采用Group Relative Policy Optimization (GRPO)微调视觉自回归模型,结合CLIP嵌入和美学预测器生成奖励信号。 Result: 显著提升图像质量,实现对生成风格的精确控制,并泛化到预训练数据外的风格。 Conclusion: RL微调对视觉自回归模型高效有效,尤其适合在线采样,优于扩散模型。 Abstract: Fine-tuning pre-trained generative models with Reinforcement Learning (RL) has emerged as an effective approach for aligning outputs more closely with nuanced human preferences. In this paper, we investigate the application of Group Relative Policy Optimization (GRPO) to fine-tune next-scale visual autoregressive (VAR) models. Our empirical results demonstrate that this approach enables alignment to intricate reward signals derived from aesthetic predictors and CLIP embeddings, significantly enhancing image quality and enabling precise control over the generation style. Interestingly, by leveraging CLIP, our method can help VAR models generalize beyond their initial ImageNet distribution: through RL-driven exploration, these models can generate images aligned with prompts referencing image styles that were absent during pre-training. In summary, we show that RL-based fine-tuning is both efficient and effective for VAR models, benefiting particularly from their fast inference speeds, which are advantageous for online sampling, an aspect that poses significant challenges for diffusion-based alternatives.

[77] DSAGL: Dual-Stream Attention-Guided Learning for Weakly Supervised Whole Slide Image Classification

Daoxi Cao,Hangbei Cheng,Yijin Li,Ruolin Zhou,Xinyi Li,Xuehan Zhang,Binwei Li,Xuancheng Gu,Xueyu Liu,Yongfei Wu

Main category: cs.CV

TL;DR: DSAGL是一种新颖的弱监督分类框架,结合教师-学生架构和双流设计,通过多尺度注意力伪标签解决实例级模糊性和包级语义一致性,在多个数据集上表现优于现有方法。

Details Motivation: 全切片图像(WSIs)因其超高分辨率和丰富语义内容对癌症诊断至关重要,但其巨大尺寸和细粒度标注稀缺性为传统监督学习带来挑战。 Method: 提出DSAGL框架,采用教师-学生架构和双流设计,生成多尺度注意力伪标签,结合轻量级编码器VSSMamba和融合注意力模块FASA,并引入混合损失确保双流一致性。 Result: 在CIFAR-10、NCT-CRC和TCGA-Lung数据集上,DSAGL显著优于现有弱监督学习方法,展现出更强的判别性能和鲁棒性。 Conclusion: DSAGL通过双流注意力引导学习有效解决了WSIs分类中的弱监督挑战,为医学图像分析提供了新思路。 Abstract: Whole-slide images (WSIs) are critical for cancer diagnosis due to their ultra-high resolution and rich semantic content. However, their massive size and the limited availability of fine-grained annotations pose substantial challenges for conventional supervised learning. We propose DSAGL (Dual-Stream Attention-Guided Learning), a novel weakly supervised classification framework that combines a teacher-student architecture with a dual-stream design. DSAGL explicitly addresses instance-level ambiguity and bag-level semantic consistency by generating multi-scale attention-based pseudo labels and guiding instance-level learning. A shared lightweight encoder (VSSMamba) enables efficient long-range dependency modeling, while a fusion-attentive module (FASA) enhances focus on sparse but diagnostically relevant regions. We further introduce a hybrid loss to enforce mutual consistency between the two streams. Experiments on CIFAR-10, NCT-CRC, and TCGA-Lung datasets demonstrate that DSAGL consistently outperforms state-of-the-art MIL baselines, achieving superior discriminative performance and robustness under weak supervision.

[78] Diffusion Sampling Path Tells More: An Efficient Plug-and-Play Strategy for Sample Filtering

Sixian Wang,Zhiwei Tang,Tsung-Hui Chang

Main category: cs.CV

TL;DR: 论文提出CFG-Rejection方法,通过分析去噪轨迹中的累积分数差异(ASD)来早期过滤低质量样本,无需外部奖励信号或模型重训练。

Details Motivation: 扩散模型采样过程中存在随机性导致样本质量不一致,现有方法(如DDPO和推理时对齐)计算成本高且依赖外部奖励信号。 Method: 发现样本质量与去噪轨迹中条件与无条件分数的累积差异(ASD)相关,提出CFG-Rejection方法,早期过滤低质量样本。 Result: 实验验证CFG-Rejection在图像生成中显著提升人类偏好分数(HPSv2, PickScore)和基准测试(GenEval, DPG-Bench)表现。 Conclusion: CFG-Rejection为高效高质量样本生成提供了新思路,适用于多种生成任务。 Abstract: Diffusion models often exhibit inconsistent sample quality due to stochastic variations inherent in their sampling trajectories. Although training-based fine-tuning (e.g. DDPO [1]) and inference-time alignment techniques[2] aim to improve sample fidelity, they typically necessitate full denoising processes and external reward signals. This incurs substantial computational costs, hindering their broader applicability. In this work, we unveil an intriguing phenomenon: a previously unobserved yet exploitable link between sample quality and characteristics of the denoising trajectory during classifier-free guidance (CFG). Specifically, we identify a strong correlation between high-density regions of the sample distribution and the Accumulated Score Differences (ASD)--the cumulative divergence between conditional and unconditional scores. Leveraging this insight, we introduce CFG-Rejection, an efficient, plug-and-play strategy that filters low-quality samples at an early stage of the denoising process, crucially without requiring external reward signals or model retraining. Importantly, our approach necessitates no modifications to model architectures or sampling schedules and maintains full compatibility with existing diffusion frameworks. We validate the effectiveness of CFG-Rejection in image generation through extensive experiments, demonstrating marked improvements on human preference scores (HPSv2, PickScore) and challenging benchmarks (GenEval, DPG-Bench). We anticipate that CFG-Rejection will offer significant advantages for diverse generative modalities beyond images, paving the way for more efficient and reliable high-quality sample generation.

[79] Beyond Optimal Transport: Model-Aligned Coupling for Flow Matching

Yexiong Lin,Yu Yao,Tongliang Liu

Main category: cs.CV

TL;DR: 论文提出了一种名为Model-Aligned Coupling (MAC)的方法,通过结合几何距离和模型预测误差来优化训练耦合,显著提高了生成质量和效率。

Details Motivation: 现有基于几何距离的耦合方法(如Optimal Transport)可能与模型的偏好轨迹不一致,导致难以学习直线轨迹。 Method: MAC通过选择预测误差最低的耦合进行训练,避免耗时匹配过程。 Result: 实验表明,MAC在少步生成场景中显著优于现有方法。 Conclusion: MAC通过模型对齐的耦合策略,有效提升了生成任务的性能。 Abstract: Flow Matching (FM) is an effective framework for training a model to learn a vector field that transports samples from a source distribution to a target distribution. To train the model, early FM methods use random couplings, which often result in crossing paths and lead the model to learn non-straight trajectories that require many integration steps to generate high-quality samples. To address this, recent methods adopt Optimal Transport (OT) to construct couplings by minimizing geometric distances, which helps reduce path crossings. However, we observe that such geometry-based couplings do not necessarily align with the model's preferred trajectories, making it difficult to learn the vector field induced by these couplings, which prevents the model from learning straight trajectories. Motivated by this, we propose Model-Aligned Coupling (MAC), an effective method that matches training couplings based not only on geometric distance but also on alignment with the model's preferred transport directions based on its prediction error. To avoid the time-costly match process, MAC proposes to select the top-$k$ fraction of couplings with the lowest error for training. Extensive experiments show that MAC significantly improves generation quality and efficiency in few-step settings compared to existing methods. Project page: https://yexionglin.github.io/mac

[80] Beam-Guided Knowledge Replay for Knowledge-Rich Image Captioning using Vision-Language Model

Reem AlJunaid,Muzammil Behzad

Main category: cs.CV

TL;DR: KRCapVLM是一个基于知识重放的图像字幕生成框架,通过结合视觉语言模型、波束搜索解码和注意力模块,显著提升了字幕的知识性和质量。

Details Motivation: 现有字幕生成模型常产生缺乏深度和特异性的通用描述,KRCapVLM旨在解决这一问题。 Method: 结合波束搜索解码生成多样化字幕,集成注意力模块增强图像特征表示,并使用训练调度器提升稳定性。 Result: 模型在知识识别准确性和字幕质量上均有显著提升,能更好地泛化到新知识概念。 Conclusion: KRCapVLM有效增强了模型生成知识丰富且上下文相关字幕的能力。 Abstract: Generating informative and knowledge-rich image captions remains a challenge for many existing captioning models, which often produce generic descriptions that lack specificity and contextual depth. To address this limitation, we propose KRCapVLM, a knowledge replay-based novel image captioning framework using vision-language model. We incorporate beam search decoding to generate more diverse and coherent captions. We also integrate attention-based modules into the image encoder to enhance feature representation. Finally, we employ training schedulers to improve stability and ensure smoother convergence during training. These proposals accelerate substantial gains in both caption quality and knowledge recognition. Our proposed model demonstrates clear improvements in both the accuracy of knowledge recognition and the overall quality of generated captions. It shows a stronger ability to generalize to previously unseen knowledge concepts, producing more informative and contextually relevant descriptions. These results indicate the effectiveness of our approach in enhancing the model's capacity to generate meaningful, knowledge-grounded captions across a range of scenarios.

[81] VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?

Yuanxin Liu,Kun Ouyang,Haoning Wu,Yi Liu,Lin Sui,Xinhao Li,Yan Zhong,Y. Charles,Xinyu Zhou,Xu Sun

Main category: cs.CV

TL;DR: VideoReasonBench是一个新的视频理解基准,专注于视觉中心复杂推理任务,填补了现有基准在深度推理上的不足。

Details Motivation: 现有视频理解基准缺乏深度推理需求,无法展示长链思维推理的优势,因此需要一个新的基准来评估视觉中心的复杂视频推理能力。 Method: 提出VideoReasonBench基准,包含视觉丰富且高复杂度的视频任务,要求模型分步推理。评估了18种多模态大语言模型的表现。 Result: 大多数模型在复杂视频推理上表现不佳,GPT-4o准确率仅6.9%,而Gemini-2.5-Pro以56.0%显著领先。扩展思维预算对提升性能至关重要。 Conclusion: VideoReasonBench展示了复杂视频推理的挑战性,扩展思维预算对性能提升至关重要,为未来研究提供了新方向。 Abstract: Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoning, e.g., GPT-4o achieves only 6.9% accuracy, while the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on "test-time scaling" further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.

[82] MCFNet: A Multimodal Collaborative Fusion Network for Fine-Grained Semantic Classification

Yang Qiao,Xiaoyu Zhong,Xiaofeng Gu,Zhiguo Yu

Main category: cs.CV

TL;DR: 提出了一种新型多模态协作融合网络(MCFNet),通过模态特定正则化和混合注意力机制提升细粒度分类性能。

Details Motivation: 多模态信息处理对图像分类性能提升至关重要,但传统方法难以捕捉模态间复杂依赖关系,限制了高精度分类任务的应用。 Method: MCFNet包含正则化集成融合模块(提升模态内特征表示)和多模态决策分类模块(结合多损失函数和加权投票机制)。 Result: 在基准数据集上的实验表明,MCFNet在分类准确率上持续提升,验证了其对跨模态语义建模的有效性。 Conclusion: MCFNet通过精细的模态间语义对齐和特征融合,显著提升了细粒度分类任务的性能。 Abstract: Multimodal information processing has become increasingly important for enhancing image classification performance. However, the intricate and implicit dependencies across different modalities often hinder conventional methods from effectively capturing fine-grained semantic interactions, thereby limiting their applicability in high-precision classification tasks. To address this issue, we propose a novel Multimodal Collaborative Fusion Network (MCFNet) designed for fine-grained classification. The proposed MCFNet architecture incorporates a regularized integrated fusion module that improves intra-modal feature representation through modality-specific regularization strategies, while facilitating precise semantic alignment via a hybrid attention mechanism. Additionally, we introduce a multimodal decision classification module, which jointly exploits inter-modal correlations and unimodal discriminative features by integrating multiple loss functions within a weighted voting paradigm. Extensive experiments and ablation studies on benchmark datasets demonstrate that the proposed MCFNet framework achieves consistent improvements in classification accuracy, confirming its effectiveness in modeling subtle cross-modal semantics.

[83] PAN-Crafter: Learning Modality-Consistent Alignment for PAN-Sharpening

Jeonghyeok Do,Sungpyo Kim,Geunhyuk Youk,Jaehyup Lee,Munchurl Kim

Main category: cs.CV

TL;DR: PAN-Crafter提出了一种模态一致性对齐框架,通过模态自适应重建和跨模态对齐感知注意力机制,解决了PAN和MS图像融合中的跨模态不对齐问题,显著提升了性能。

Details Motivation: PAN和MS图像融合中存在的跨模态不对齐问题(如传感器位置、采集时间和分辨率差异)导致传统深度学习方法在像素级对齐假设下出现光谱失真、双边缘和模糊。 Method: PAN-Crafter采用模态自适应重建(MARs)联合重建HRMS和PAN图像,并引入跨模态对齐感知注意力(CM3A)机制双向对齐MS纹理和PAN结构。 Result: 在多个基准数据集上,PAN-Crafter在所有指标上均优于最新方法,推理时间快50.11倍,内存占用减少0.63倍,且在未见卫星数据集上表现出强泛化能力。 Conclusion: PAN-Crafter通过模态一致性对齐框架有效解决了跨模态不对齐问题,显著提升了图像融合的性能和效率。 Abstract: PAN-sharpening aims to fuse high-resolution panchromatic (PAN) images with low-resolution multi-spectral (MS) images to generate high-resolution multi-spectral (HRMS) outputs. However, cross-modality misalignment -- caused by sensor placement, acquisition timing, and resolution disparity -- induces a fundamental challenge. Conventional deep learning methods assume perfect pixel-wise alignment and rely on per-pixel reconstruction losses, leading to spectral distortion, double edges, and blurring when misalignment is present. To address this, we propose PAN-Crafter, a modality-consistent alignment framework that explicitly mitigates the misalignment gap between PAN and MS modalities. At its core, Modality-Adaptive Reconstruction (MARs) enables a single network to jointly reconstruct HRMS and PAN images, leveraging PAN's high-frequency details as auxiliary self-supervision. Additionally, we introduce Cross-Modality Alignment-Aware Attention (CM3A), a novel mechanism that bidirectionally aligns MS texture to PAN structure and vice versa, enabling adaptive feature refinement across modalities. Extensive experiments on multiple benchmark datasets demonstrate that our PAN-Crafter outperforms the most recent state-of-the-art method in all metrics, even with 50.11$\times$ faster inference time and 0.63$\times$ the memory size. Furthermore, it demonstrates strong generalization performance on unseen satellite datasets, showing its robustness across different conditions.

[84] UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning

Weijia Mao,Zhenheng Yang,Mike Zheng Shou

Main category: cs.CV

TL;DR: UniRL是一种自改进的后训练方法,通过模型生成的图像作为训练数据,无需外部图像数据,同时优化生成和理解任务。

Details Motivation: 现有统一多模态大语言模型依赖大规模数据和计算资源,且后训练方法常需外部数据或局限于特定任务。UniRL旨在解决这些问题。 Method: 采用监督微调(SFT)和组相对策略优化(GRPO),模型生成图像并用于训练,生成和理解任务相互增强。 Result: 在Show-o和Janus上评估,GenEval得分分别为0.77和0.65。 Conclusion: UniRL无需外部数据,提升任务性能并减少任务间不平衡,仅需少量额外训练步骤。 Abstract: Unified multimodal large language models such as Show-o and Janus have achieved strong performance across both generation and understanding tasks. However, these models typically rely on large-scale datasets and require substantial computation during the pretraining stage. In addition, several post-training methods have been proposed, but they often depend on external data or are limited to task-specific customization. In this work, we introduce UniRL, a self-improving post-training approach. Our approach enables the model to generate images from prompts and use them as training data in each iteration, without relying on any external image data. Moreover, it enables the two tasks to enhance each other: the generated images are used for understanding, and the understanding results are used to supervise generation. We explore supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to optimize the models. UniRL offers three key advantages: (1) it requires no external image data, as all training samples are generated by the model itself during training; (2) it not only improves individual task performance, but also reduces the imbalance between generation and understanding; and (3) it requires only several additional training steps during the post-training stage. We evaluate UniRL on top of Show-o and Janus, achieving a GenEval score of 0.77 for Show-o and 0.65 for Janus. Code and models will be released in https://github.com/showlab/UniRL.

[85] VModA: An Effective Framework for Adaptive NSFW Image Moderation

Han Bao,Qinying Wang,Zhi Chen,Qingming Li,Xuhong Zhang,Changjiang Li,Zonghui Wang,Shouling Ji,Wenzhi Chen

Main category: cs.CV

TL;DR: VModA是一个通用且有效的框架,用于检测复杂语义的NSFW内容,适应不同平台和地区的审核规则,显著提升检测准确性。

Details Motivation: NSFW内容在社交网络上泛滥,现有检测方法难以应对复杂语义和多样化规则,亟需更有效的解决方案。 Method: 提出VModA框架,结合多样化审核规则和复杂语义处理能力,通过实验验证其性能。 Result: VModA在多种NSFW类型中准确率提升54.3%,并展示出跨类别和场景的强适应性。 Conclusion: VModA为NSFW内容检测提供了高效且适应性强的解决方案,并在实际场景中验证了其有效性。 Abstract: Not Safe/Suitable for Work (NSFW) content is rampant on social networks and poses serious harm to citizens, especially minors. Current detection methods mainly rely on deep learning-based image recognition and classification. However, NSFW images are now presented in increasingly sophisticated ways, often using image details and complex semantics to obscure their true nature or attract more views. Although still understandable to humans, these images often evade existing detection methods, posing a significant threat. Further complicating the issue, varying regulations across platforms and regions create additional challenges for effective moderation, leading to detection bias and reduced accuracy. To address this, we propose VModA, a general and effective framework that adapts to diverse moderation rules and handles complex, semantically rich NSFW content across categories. Experimental results show that VModA significantly outperforms existing methods, achieving up to a 54.3% accuracy improvement across NSFW types, including those with complex semantics. Further experiments demonstrate that our method exhibits strong adaptability across categories, scenarios, and base VLMs. We also identified inconsistent and controversial label samples in public NSFW benchmark datasets, re-annotated them, and submitted corrections to the original maintainers. Two datasets have confirmed the updates so far. Additionally, we evaluate VModA in real-world scenarios to demonstrate its practical effectiveness.

[86] Robust and Annotation-Free Wound Segmentation on Noisy Real-World Pressure Ulcer Images: Towards Automated DESIGN-R\textsuperscript{\textregistered} Assessment

Yun-Cheng Tsai

Main category: cs.CV

TL;DR: 提出了一种结合YOLOv11n检测器和FUSegNet分割模型的轻量级方法,仅需500个标注框即可实现跨身体部位的高效伤口分割。

Details Motivation: 现有模型(如FUSegNet)主要针对足部溃疡训练,难以泛化到其他身体部位,需解决这一领域差距。 Method: 使用YOLOv11n检测器与预训练FUSegNet结合,无需像素级标注或重新训练,仅需500个标注框即可实现跨领域泛化。 Result: 在足部、骶部和转子伤口测试集上,平均IoU提升23个百分点,DESIGN-R尺寸估计准确率从71%提升至94%。 Conclusion: 该方法无需任务特定微调即可泛化到不同身体部位,为临床伤口评分自动化提供了高效、可扩展的解决方案。 Abstract: Purpose: Accurate wound segmentation is essential for automated DESIGN-R scoring. However, existing models such as FUSegNet, which are trained primarily on foot ulcer datasets, often fail to generalize to wounds on other body sites. Methods: We propose an annotation-efficient pipeline that combines a lightweight YOLOv11n-based detector with the pre-trained FUSegNet segmentation model. Instead of relying on pixel-level annotations or retraining for new anatomical regions, our method achieves robust performance using only 500 manually labeled bounding boxes. This zero fine-tuning approach effectively bridges the domain gap and enables direct deployment across diverse wound types. This is an advance not previously demonstrated in the wound segmentation literature. Results: Evaluated on three real-world test sets spanning foot, sacral, and trochanter wounds, our YOLO plus FUSegNet pipeline improved mean IoU by 23 percentage points over vanilla FUSegNet and increased end-to-end DESIGN-R size estimation accuracy from 71 percent to 94 percent (see Table 3 for details). Conclusion: Our pipeline generalizes effectively across body sites without task-specific fine-tuning, demonstrating that minimal supervision, with 500 annotated ROIs, is sufficient for scalable, annotation-light wound segmentation. This capability paves the way for real-world DESIGN-R automation, reducing reliance on pixel-wise labeling, streamlining documentation workflows, and supporting objective and consistent wound scoring in clinical practice. We will publicly release the trained detector weights and configuration to promote reproducibility and facilitate downstream deployment.

[87] Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings

Xingguang Wei,Haomin Wang,Shenglong Ye,Ruifeng Luo,Yanting Zhang,Lixin Gu,Jifeng Dai,Yu Qiao,Wenhai Wang,Hongjie Zhang

Main category: cs.CV

TL;DR: VecFormer提出了一种基于线表示的CAD图纸全景符号识别方法,解决了现有方法的高计算成本、泛化性差和几何信息丢失问题,并通过分支融合细化模块提升了预测一致性。

Details Motivation: 现有方法在CAD图纸的全景符号识别中存在高计算成本、泛化性差和几何信息丢失的问题,需要一种更高效且保留几何结构的方法。 Method: VecFormer采用线表示法保留原始图元的几何连续性,结合分支融合细化模块整合实例与语义预测。 Result: 实验表明VecFormer在PQ指标上达到91.1,Stuff-PQ分别提升9.6和21.2分,优于现有方法。 Conclusion: 线表示法在矢量图形理解任务中具有潜力,VecFormer为相关领域提供了新的解决方案。 Abstract: We study the task of panoptic symbol spotting, which involves identifying both individual instances of countable things and the semantic regions of uncountable stuff in computer-aided design (CAD) drawings composed of vector graphical primitives. Existing methods typically rely on image rasterization, graph construction, or point-based representation, but these approaches often suffer from high computational costs, limited generality, and loss of geometric structural information. In this paper, we propose VecFormer, a novel method that addresses these challenges through line-based representation of primitives. This design preserves the geometric continuity of the original primitive, enabling more accurate shape representation while maintaining a computation-friendly structure, making it well-suited for vector graphic understanding tasks. To further enhance prediction reliability, we introduce a Branch Fusion Refinement module that effectively integrates instance and semantic predictions, resolving their inconsistencies for more coherent panoptic outputs. Extensive experiments demonstrate that our method establishes a new state-of-the-art, achieving 91.1 PQ, with Stuff-PQ improved by 9.6 and 21.2 points over the second-best results under settings with and without prior information, respectively, highlighting the strong potential of line-based representation as a foundation for vector graphic understanding.

[88] Bridging Geometric and Semantic Foundation Models for Generalized Monocular Depth Estimation

Sanggyun Ma,Wonjoon Choi,Jihun Park,Jaeyeul Kim,Seunghun Lee,Jiwan Seo,Sunghoon Im

Main category: cs.CV

TL;DR: BriGeS是一种融合几何和语义信息的深度估计方法,通过Bridging Gate和Attention Temperature Scaling技术提升性能,同时减少资源消耗。

Details Motivation: 结合几何和语义信息以提升单目深度估计(MDE)的准确性,特别是在复杂场景中。 Method: 利用预训练的基础模型,仅训练Bridging Gate,结合Attention Temperature Scaling技术平衡注意力机制。 Result: 在多个数据集上表现优于现有方法,能有效处理复杂结构和重叠物体。 Conclusion: BriGeS通过高效融合几何和语义信息,显著提升了MDE的性能和泛化能力。 Abstract: We present Bridging Geometric and Semantic (BriGeS), an effective method that fuses geometric and semantic information within foundation models to enhance Monocular Depth Estimation (MDE). Central to BriGeS is the Bridging Gate, which integrates the complementary strengths of depth and segmentation foundation models. This integration is further refined by our Attention Temperature Scaling technique. It finely adjusts the focus of the attention mechanisms to prevent over-concentration on specific features, thus ensuring balanced performance across diverse inputs. BriGeS capitalizes on pre-trained foundation models and adopts a strategy that focuses on training only the Bridging Gate. This method significantly reduces resource demands and training time while maintaining the model's ability to generalize effectively. Extensive experiments across multiple challenging datasets demonstrate that BriGeS outperforms state-of-the-art methods in MDE for complex scenes, effectively handling intricate structures and overlapping objects.

[89] Video Editing for Audio-Visual Dubbing

Binyamin Manela,Sharon Gannot,Ethan Fetyaya

Main category: cs.CV

TL;DR: EdiDub是一种新颖的视觉配音框架,通过内容感知编辑任务改进现有方法,显著提升了身份保留和唇同步效果。

Details Motivation: 当前视觉配音方法在保留原始视频上下文和复杂视觉元素方面存在局限,EdiDub旨在解决这一问题。 Method: EdiDub将视觉配音重新定义为内容感知编辑任务,采用专用条件方案确保修改的准确性和忠实性。 Result: 在多个基准测试中,EdiDub在身份保留和同步方面表现优异,人类评估也确认其同步性和视觉自然性优于现有方法。 Conclusion: EdiDub的内容感知编辑方法在保留复杂视觉元素的同时实现准确唇同步,优于传统生成或修复技术。 Abstract: Visual dubbing, the synchronization of facial movements with new speech, is crucial for making content accessible across different languages, enabling broader global reach. However, current methods face significant limitations. Existing approaches often generate talking faces, hindering seamless integration into original scenes, or employ inpainting techniques that discard vital visual information like partial occlusions and lighting variations. This work introduces EdiDub, a novel framework that reformulates visual dubbing as a content-aware editing task. EdiDub preserves the original video context by utilizing a specialized conditioning scheme to ensure faithful and accurate modifications rather than mere copying. On multiple benchmarks, including a challenging occluded-lip dataset, EdiDub significantly improves identity preservation and synchronization. Human evaluations further confirm its superiority, achieving higher synchronization and visual naturalness scores compared to the leading methods. These results demonstrate that our content-aware editing approach outperforms traditional generation or inpainting, particularly in maintaining complex visual elements while ensuring accurate lip synchronization.

[90] UrbanCraft: Urban View Extrapolation via Hierarchical Sem-Geometric Priors

Tianhang Wang,Fan Lu,Sanqing Qu,Guo Yu,Shihang Du,Ya Wu,Yuan Huang,Guang Chen

Main category: cs.CV

TL;DR: UrbanCraft提出了一种解决外推视图合成(EVS)问题的方法,通过分层语义几何表示作为额外先验,结合HSG-VSD技术提升性能。

Details Motivation: 现有神经渲染方法在训练相机分布外的视图合成(如左、右或向下视角)表现不佳,限制了城市重建的泛化能力。 Method: 利用部分可观测场景重建粗略语义和几何基元,建立场景级先验(占用网格),并引入实例级先验(3D边界框)。提出HSG-VSD技术,将语义和几何约束融入分数蒸馏采样。 Result: 定性和定量实验表明,UrbanCraft在EVS问题上表现优异。 Conclusion: UrbanCraft通过分层先验和HSG-VSD技术,显著提升了外推视图合成的性能,解决了现有方法的局限性。 Abstract: Existing neural rendering-based urban scene reconstruction methods mainly focus on the Interpolated View Synthesis (IVS) setting that synthesizes from views close to training camera trajectory. However, IVS can not guarantee the on-par performance of the novel view outside the training camera distribution (\textit{e.g.}, looking left, right, or downwards), which limits the generalizability of the urban reconstruction application. Previous methods have optimized it via image diffusion, but they fail to handle text-ambiguous or large unseen view angles due to coarse-grained control of text-only diffusion. In this paper, we design UrbanCraft, which surmounts the Extrapolated View Synthesis (EVS) problem using hierarchical sem-geometric representations serving as additional priors. Specifically, we leverage the partially observable scene to reconstruct coarse semantic and geometric primitives, establishing a coarse scene-level prior through an occupancy grid as the base representation. Additionally, we incorporate fine instance-level priors from 3D bounding boxes to enhance object-level details and spatial relationships. Building on this, we propose the \textbf{H}ierarchical \textbf{S}emantic-Geometric-\textbf{G}uided Variational Score Distillation (HSG-VSD), which integrates semantic and geometric constraints from pretrained UrbanCraft2D into the score distillation sampling process, forcing the distribution to be consistent with the observable scene. Qualitative and quantitative comparisons demonstrate the effectiveness of our methods on EVS problem.

[91] Adaptive Spatial Augmentation for Semi-supervised Semantic Segmentation

Lingyan Ran,Yali Li,Tao Zhuo,Shizhou Zhang,Yanning Zhang

Main category: cs.CV

TL;DR: 论文提出了一种自适应空间增强方法(ASAug),用于半监督语义分割(SSSS),通过动态调整增强策略提升模型性能。

Details Motivation: 现有强增强方法主要基于强度扰动,对语义掩码影响有限;而空间增强在监督任务中有效,但在SSSS中被忽视。 Method: 提出自适应空间增强(ASAug),基于熵动态调整每张图像的空间增强策略。 Result: ASAug作为可插拔模块,显著提升现有方法性能,在PASCAL VOC 2012、Cityscapes和COCO数据集上达到SOTA。 Conclusion: 空间增强在半监督语义分割中有效,自适应策略进一步提升了模型泛化能力。 Abstract: In semi-supervised semantic segmentation (SSSS), data augmentation plays a crucial role in the weak-to-strong consistency regularization framework, as it enhances diversity and improves model generalization. Recent strong augmentation methods have primarily focused on intensity-based perturbations, which have minimal impact on the semantic masks. In contrast, spatial augmentations like translation and rotation have long been acknowledged for their effectiveness in supervised semantic segmentation tasks, but they are often ignored in SSSS. In this work, we demonstrate that spatial augmentation can also contribute to model training in SSSS, despite generating inconsistent masks between the weak and strong augmentations. Furthermore, recognizing the variability among images, we propose an adaptive augmentation strategy that dynamically adjusts the augmentation for each instance based on entropy. Extensive experiments show that our proposed Adaptive Spatial Augmentation (\textbf{ASAug}) can be integrated as a pluggable module, consistently improving the performance of existing methods and achieving state-of-the-art results on benchmark datasets such as PASCAL VOC 2012, Cityscapes, and COCO.

[92] VITON-DRR: Details Retention Virtual Try-on via Non-rigid Registration

Ben Li,Minqi Li,Jie Ren,Kaibing Zhang

Main category: cs.CV

TL;DR: 提出了一种基于非刚性配准的虚拟试穿方法VITON-DRR,通过双金字塔结构特征提取器和变形模块,提高了服装细节保留和变形准确性。

Details Motivation: 解决现有虚拟试穿方法因自遮挡或姿态差异导致的服装细节丢失和变形不准确问题。 Method: 使用双金字塔结构特征提取器重建人体语义分割,设计变形模块提取服装关键点并通过非刚性配准算法变形,最后通过图像合成模块生成试穿图像。 Result: 实验表明,VITON-DRR在变形准确性和服装细节保留上优于现有方法。 Conclusion: VITON-DRR通过非刚性配准和双金字塔结构,显著提升了虚拟试穿的质量和细节保留能力。 Abstract: Image-based virtual try-on aims to fit a target garment to a specific person image and has attracted extensive research attention because of its huge application potential in the e-commerce and fashion industries. To generate high-quality try-on results, accurately warping the clothing item to fit the human body plays a significant role, as slight misalignment may lead to unrealistic artifacts in the fitting image. Most existing methods warp the clothing by feature matching and thin-plate spline (TPS). However, it often fails to preserve clothing details due to self-occlusion, severe misalignment between poses, etc. To address these challenges, this paper proposes a detail retention virtual try-on method via accurate non-rigid registration (VITON-DRR) for diverse human poses. Specifically, we reconstruct a human semantic segmentation using a dual-pyramid-structured feature extractor. Then, a novel Deformation Module is designed for extracting the cloth key points and warping them through an accurate non-rigid registration algorithm. Finally, the Image Synthesis Module is designed to synthesize the deformed garment image and generate the human pose information adaptively. {Compared with} traditional methods, the proposed VITON-DRR can make the deformation of fitting images more accurate and retain more garment details. The experimental results demonstrate that the proposed method performs better than state-of-the-art methods.

[93] CryoCCD: Conditional Cycle-consistent Diffusion with Biophysical Modeling for Cryo-EM Synthesis

Runmin Jiang,Genpei Zhang,Yuntian Yang,Siqi Wu,Yuheng Zhang,Wanyue Feng,Yizhou Zhao,Xi Xiao,Xiao Wang,Tianyang Wang,Xingjian Li,Min Xu

Main category: cs.CV

TL;DR: CryoCCD是一个结合生物物理建模与生成技术的合成框架,用于生成多尺度冷冻电镜显微图像,解决了现有方法在结构多样性和噪声模拟上的不足。

Details Motivation: 冷冻电镜(cryo-EM)的高质量标注数据稀缺,现有合成方法难以捕捉生物样本的结构多样性和复杂噪声,阻碍了模型开发。 Method: CryoCCD通过生物物理变异性(如组成异质性、细胞环境)和物理成像模拟生成图像,并利用条件扩散模型和对比学习生成真实噪声。 Result: 实验表明,CryoCCD生成的图像结构准确,在下游任务(如粒子挑选和重建)中优于现有方法。 Conclusion: CryoCCD为冷冻电镜数据合成提供了更真实的解决方案,提升了分析性能。 Abstract: Cryo-electron microscopy (cryo-EM) offers near-atomic resolution imaging of macromolecules, but developing robust models for downstream analysis is hindered by the scarcity of high-quality annotated data. While synthetic data generation has emerged as a potential solution, existing methods often fail to capture both the structural diversity of biological specimens and the complex, spatially varying noise inherent in cryo-EM imaging. To overcome these limitations, we propose CryoCCD, a synthesis framework that integrates biophysical modeling with generative techniques. Specifically, CryoCCD produces multi-scale cryo-EM micrographs that reflect realistic biophysical variability through compositional heterogeneity, cellular context, and physics-informed imaging. To generate realistic noise, we employ a conditional diffusion model, enhanced by cycle consistency to preserve structural fidelity and mask-aware contrastive learning to capture spatially adaptive noise patterns. Extensive experiments show that CryoCCD generates structurally accurate micrographs and enhances performance in downstream tasks, outperforming state-of-the-art baselines in both particle picking and reconstruction.

[94] A Reverse Causal Framework to Mitigate Spurious Correlations for Debiasing Scene Graph Generation

Shuzhou Sun,Li Liu,Tianpeng Liu,Shuaifeng Zhi,Ming-Ming Cheng,Janne Heikkilä,Yongxiang Liu

Main category: cs.CV

TL;DR: 论文提出了一种反向因果框架(RcSGG),通过重构因果链结构,解决了现有场景图生成(SGG)框架中的虚假相关性问题,并显著提高了平均召回率。

Details Motivation: 现有两阶段SGG框架因因果链结构导致虚假相关性,表现为尾部关系被预测为头部关系、前景关系被预测为背景关系。 Method: 提出RcSGG框架,采用主动反向估计(ARE)干预混杂因子,并通过最大信息采样(MIS)增强反向因果估计。 Result: 在多个基准测试和不同SGG框架中实现了最先进的平均召回率。 Conclusion: RcSGG有效消除了SGG框架中的虚假相关性,解决了相关偏差问题。 Abstract: Existing two-stage Scene Graph Generation (SGG) frameworks typically incorporate a detector to extract relationship features and a classifier to categorize these relationships; therefore, the training paradigm follows a causal chain structure, where the detector's inputs determine the classifier's inputs, which in turn influence the final predictions. However, such a causal chain structure can yield spurious correlations between the detector's inputs and the final predictions, i.e., the prediction of a certain relationship may be influenced by other relationships. This influence can induce at least two observable biases: tail relationships are predicted as head ones, and foreground relationships are predicted as background ones; notably, the latter bias is seldom discussed in the literature. To address this issue, we propose reconstructing the causal chain structure into a reverse causal structure, wherein the classifier's inputs are treated as the confounder, and both the detector's inputs and the final predictions are viewed as causal variables. Specifically, we term the reconstructed causal paradigm as the Reverse causal Framework for SGG (RcSGG). RcSGG initially employs the proposed Active Reverse Estimation (ARE) to intervene on the confounder to estimate the reverse causality, \ie the causality from final predictions to the classifier's inputs. Then, the Maximum Information Sampling (MIS) is suggested to enhance the reverse causality estimation further by considering the relationship information. Theoretically, RcSGG can mitigate the spurious correlations inherent in the SGG framework, subsequently eliminating the induced biases. Comprehensive experiments on popular benchmarks and diverse SGG frameworks show the state-of-the-art mean recall rate.

[95] LAFR: Efficient Diffusion-based Blind Face Restoration via Latent Codebook Alignment Adapter

Runyi Li,Bin Chen,Jian Zhang,Radu Timofte

Main category: cs.CV

TL;DR: LAFR提出了一种基于码本的潜在空间适配器,用于对齐低质量图像的潜在分布,实现高质量人脸恢复,同时减少计算成本。

Details Motivation: 现有方法在低质量图像恢复时存在语义不对齐问题,且重新训练VAE编码器计算成本高。 Method: 提出LAFR适配器对齐潜在分布,结合多级恢复损失和轻量级扩散先验微调。 Result: 在合成和真实数据集上实现高质量、身份保留的恢复,训练时间减少70%。 Conclusion: LAFR高效解决了低质量图像恢复中的语义对齐问题,性能优于现有方法。 Abstract: Blind face restoration from low-quality (LQ) images is a challenging task that requires not only high-fidelity image reconstruction but also the preservation of facial identity. While diffusion models like Stable Diffusion have shown promise in generating high-quality (HQ) images, their VAE modules are typically trained only on HQ data, resulting in semantic misalignment when encoding LQ inputs. This mismatch significantly weakens the effectiveness of LQ conditions during the denoising process. Existing approaches often tackle this issue by retraining the VAE encoder, which is computationally expensive and memory-intensive. To address this limitation efficiently, we propose LAFR (Latent Alignment for Face Restoration), a novel codebook-based latent space adapter that aligns the latent distribution of LQ images with that of HQ counterparts, enabling semantically consistent diffusion sampling without altering the original VAE. To further enhance identity preservation, we introduce a multi-level restoration loss that combines constraints from identity embeddings and facial structural priors. Additionally, by leveraging the inherent structural regularity of facial images, we show that lightweight finetuning of diffusion prior on just 0.9% of FFHQ dataset is sufficient to achieve results comparable to state-of-the-art methods, reduce training time by 70%. Extensive experiments on both synthetic and real-world face restoration benchmarks demonstrate the effectiveness and efficiency of LAFR, achieving high-quality, identity-preserving face reconstruction from severely degraded inputs.

[96] Revisiting Reweighted Risk for Calibration: AURC, Focal Loss, and Inverse Focal Loss

Han Zhou,Sebastian G. Gruber,Teodora Popordanoska,Matthew B. Blaschko

Main category: cs.CV

TL;DR: 本文重新审视了深度学习中常用的加权风险函数,建立了这些重加权方案与校准误差之间的理论联系,并证明优化正则化AURC可改善校准性能。

Details Motivation: 研究不同加权风险函数(如焦点损失和逆焦点损失)的校准特性,并探索其与校准误差的关系。 Method: 提出一种基于SoftRank技术的可微分正则化AURC优化方法,并分析其与选择性分类范式的联系。 Result: 实证表明,基于AURC的损失函数在多种数据集和模型架构中实现了竞争性的类校准性能。 Conclusion: 优化正则化AURC是一种更灵活且有效的校准方法,而焦点损失在校准目标下可能缺乏理论基础。 Abstract: Several variants of reweighted risk functionals, such as focal losss, inverse focal loss, and the Area Under the Risk-Coverage Curve (AURC), have been proposed in the literature and claims have been made in relation to their calibration properties. However, focal loss and inverse focal loss propose vastly different weighting schemes. In this paper, we revisit a broad class of weighted risk functions commonly used in deep learning and establish a principled connection between these reweighting schemes and calibration errors. We show that minimizing calibration error is closely linked to the selective classification paradigm and demonstrate that optimizing a regularized variant of the AURC naturally leads to improved calibration. This regularized AURC shares a similar reweighting strategy with inverse focal loss, lending support to the idea that focal loss is less principled when calibration is a desired outcome. Direct AURC optimization offers greater flexibility through the choice of confidence score functions (CSFs). To enable gradient-based optimization, we introduce a differentiable formulation of the regularized AURC using the SoftRank technique. Empirical evaluations demonstrate that our AURC-based loss achieves competitive class-wise calibration performance across a range of datasets and model architectures.

[97] A Divide-and-Conquer Approach for Global Orientation of Non-Watertight Scene-Level Point Clouds Using 0-1 Integer Optimization

Zhuodong Li,Fei Hou,Wencheng Wang,Xuequan Lu,Ying He

Main category: cs.CV

TL;DR: DACPO提出了一种分而治之的点云定向框架,通过分割、独立处理和全局优化解决大规模非封闭3D场景的定向问题。

Details Motivation: 现有方法主要针对封闭的物体级3D模型,而大规模非封闭3D场景的定向问题尚未充分探索。 Method: DACPO将点云分割为小块,通过随机贪婪法和迭代泊松重建估计初始法向,再通过图模型和全局优化整合结果。 Result: 实验表明DACPO在大规模非封闭场景中表现优异,优于现有方法。 Conclusion: DACPO为大规模非封闭点云定向提供了高效且鲁棒的解决方案。 Abstract: Orienting point clouds is a fundamental problem in computer graphics and 3D vision, with applications in reconstruction, segmentation, and analysis. While significant progress has been made, existing approaches mainly focus on watertight, object-level 3D models. The orientation of large-scale, non-watertight 3D scenes remains an underexplored challenge. To address this gap, we propose DACPO (Divide-And-Conquer Point Orientation), a novel framework that leverages a divide-and-conquer strategy for scalable and robust point cloud orientation. Rather than attempting to orient an unbounded scene at once, DACPO segments the input point cloud into smaller, manageable blocks, processes each block independently, and integrates the results through a global optimization stage. For each block, we introduce a two-step process: estimating initial normal orientations by a randomized greedy method and refining them by an adapted iterative Poisson surface reconstruction. To achieve consistency across blocks, we model inter-block relationships using an an undirected graph, where nodes represent blocks and edges connect spatially adjacent blocks. To reliably evaluate orientation consistency between adjacent blocks, we introduce the concept of the visible connected region, which defines the region over which visibility-based assessments are performed. The global integration is then formulated as a 0-1 integer-constrained optimization problem, with block flip states as binary variables. Despite the combinatorial nature of the problem, DACPO remains scalable by limiting the number of blocks (typically a few hundred for 3D scenes) involved in the optimization. Experiments on benchmark datasets demonstrate DACPO's strong performance, particularly in challenging large-scale, non-watertight scenarios where existing methods often fail. The source code is available at https://github.com/zd-lee/DACPO.

[98] TimePoint: Accelerated Time Series Alignment via Self-Supervised Keypoint and Descriptor Learning

Ron Shapira Weber,Shahar Ben Ishay,Andrey Lavrinenko,Shahaf E. Finder,Oren Freifeld

Main category: cs.CV

TL;DR: TimePoint是一种自监督方法,通过从合成数据中学习关键点和描述符,显著加速DTW对齐并提高准确性。

Details Motivation: 动态时间规整(DTW)在时间序列对齐中存在可扩展性差和对噪声敏感的问题,需要一种更高效的解决方案。 Method: TimePoint利用1D微分同胚生成合成训练数据,结合全卷积和小波卷积架构提取关键点和描述符,再应用DTW进行稀疏表示对齐。 Result: TimePoint比标准DTW更快且更准确,且在合成数据训练下对真实数据表现出强泛化能力。 Conclusion: TimePoint为时间序列分析提供了一种可扩展的解决方案,显著优于传统DTW方法。 Abstract: Fast and scalable alignment of time series is a fundamental challenge in many domains. The standard solution, Dynamic Time Warping (DTW), struggles with poor scalability and sensitivity to noise. We introduce TimePoint, a self-supervised method that dramatically accelerates DTW-based alignment while typically improving alignment accuracy by learning keypoints and descriptors from synthetic data. Inspired by 2D keypoint detection but carefully adapted to the unique challenges of 1D signals, TimePoint leverages efficient 1D diffeomorphisms, which effectively model nonlinear time warping, to generate realistic training data. This approach, along with fully convolutional and wavelet convolutional architectures, enables the extraction of informative keypoints and descriptors. Applying DTW to these sparse representations yield major speedups and typically higher alignment accuracy than standard DTW applied to the full signals. TimePoint demonstrates strong generalization to real-world time series when trained solely on synthetic data, and further improves with fine-tuning on real data. Extensive experiments demonstrate that TimePoint consistently achieves faster and more accurate alignments than standard DTW, making it a scalable solution for time-series analysis. Our code is available at https://github.com/BGU-CS-VIL/TimePoint

[99] PhysicsNeRF: Physics-Guided 3D Reconstruction from Sparse Views

Mohamed Rayan Barhdadi,Hasan Kurban,Hussein Alnuweiri

Main category: cs.CV

TL;DR: PhysicsNeRF通过引入物理约束改进NeRF,在稀疏视角下实现更优的3D重建,性能显著优于现有方法。

Details Motivation: 标准NeRF在稀疏视角下表现不佳,PhysicsNeRF旨在通过物理约束提升稀疏视角重建的准确性和泛化能力。 Method: 结合深度排序、RegNeRF一致性、稀疏先验和跨视角对齐四种约束,采用紧凑的0.67M参数架构。 Result: 仅用8个视角即达到21.4 dB平均PSNR,泛化差距为5.7-6.2 dB,揭示了稀疏重建的局限性。 Conclusion: PhysicsNeRF为物理一致的3D表示提供了新思路,并阐明了约束NeRF模型的表达力与泛化之间的权衡。 Abstract: PhysicsNeRF is a physically grounded framework for 3D reconstruction from sparse views, extending Neural Radiance Fields with four complementary constraints: depth ranking, RegNeRF-style consistency, sparsity priors, and cross-view alignment. While standard NeRFs fail under sparse supervision, PhysicsNeRF employs a compact 0.67M-parameter architecture and achieves 21.4 dB average PSNR using only 8 views, outperforming prior methods. A generalization gap of 5.7-6.2 dB is consistently observed and analyzed, revealing fundamental limitations of sparse-view reconstruction. PhysicsNeRF enables physically consistent, generalizable 3D representations for agent interaction and simulation, and clarifies the expressiveness-generalization trade-off in constrained NeRF models.

[100] VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation

Shi-Xue Zhang,Hongfa Wang,Duojun Huang,Xin Li,Xiaobin Zhu,Xu-Cheng Yin

Main category: cs.CV

TL;DR: 论文提出了VCapsBench,首个大规模细粒度视频字幕评估基准,包含5,677个视频和109,796个问答对,用于提升文本到视频生成任务的字幕质量。

Details Motivation: 现有基准在细粒度评估(尤其是空间-时间细节)方面不足,影响视频生成质量。 Method: 构建VCapsBench基准,包含21个细粒度维度的QA对,并引入三个指标(AR、IR、CR)及基于LLM的自动化评估流程。 Result: VCapsBench提供了可操作的优化建议,有助于提升文本到视频模型的鲁棒性。 Conclusion: VCapsBench填补了细粒度视频字幕评估的空白,推动了文本到视频生成技术的发展。 Abstract: Video captions play a crucial role in text-to-video generation tasks, as their quality directly influences the semantic coherence and visual fidelity of the generated videos. Although large vision-language models (VLMs) have demonstrated significant potential in caption generation, existing benchmarks inadequately address fine-grained evaluation, particularly in capturing spatial-temporal details critical for video generation. To address this gap, we introduce the Fine-grained Video Caption Evaluation Benchmark (VCapsBench), the first large-scale fine-grained benchmark comprising 5,677 (5K+) videos and 109,796 (100K+) question-answer pairs. These QA-pairs are systematically annotated across 21 fine-grained dimensions (e.g., camera movement, and shot type) that are empirically proven critical for text-to-video generation. We further introduce three metrics (Accuracy (AR), Inconsistency Rate (IR), Coverage Rate (CR)), and an automated evaluation pipeline leveraging large language model (LLM) to verify caption quality via contrastive QA-pairs analysis. By providing actionable insights for caption optimization, our benchmark can advance the development of robust text-to-video models. The dataset and codes are available at website: https://github.com/GXYM/VCapsBench.

[101] R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation

Kaijie Chen,Zihao Lin,Zhiyang Xu,Ying Shen,Yuguang Yao,Joy Rimchala,Jiaxin Zhang,Lifu Huang

Main category: cs.CV

TL;DR: 论文提出了R2I-Bench,一个专门评估文本到图像生成中推理能力的基准,并设计了R2IScore作为细粒度评估指标。实验表明当前模型的推理能力有限。

Details Motivation: 现有文本到图像生成模型在推理能力上表现不足且缺乏系统评估,因此需要开发专门的评测工具。 Method: 设计了R2I-Bench基准,涵盖多种推理类别,并开发了R2IScore评估指标,通过实验验证了16种代表性模型的性能。 Result: 实验结果显示当前模型的推理能力普遍不足,尤其是复杂推理任务。 Conclusion: 未来需要开发更具推理能力的文本到图像生成架构。 Abstract: Reasoning is a fundamental capability often required in real-world text-to-image (T2I) generation, e.g., generating ``a bitten apple that has been left in the air for more than a week`` necessitates understanding temporal decay and commonsense concepts. While recent T2I models have made impressive progress in producing photorealistic images, their reasoning capability remains underdeveloped and insufficiently evaluated. To bridge this gap, we introduce R2I-Bench, a comprehensive benchmark specifically designed to rigorously assess reasoning-driven T2I generation. R2I-Bench comprises meticulously curated data instances, spanning core reasoning categories, including commonsense, mathematical, logical, compositional, numerical, causal, and concept mixing. To facilitate fine-grained evaluation, we design R2IScore, a QA-style metric based on instance-specific, reasoning-oriented evaluation questions that assess three critical dimensions: text-image alignment, reasoning accuracy, and image quality. Extensive experiments with 16 representative T2I models, including a strong pipeline-based framework that decouples reasoning and generation using the state-of-the-art language and image generation models, demonstrate consistently limited reasoning performance, highlighting the need for more robust, reasoning-aware architectures in the next generation of T2I systems. Project Page: https://r2i-bench.github.io

[102] VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning

Liyun Zhu,Qixiang Chen,Xi Shen,Xiaodong Cun

Main category: cs.CV

TL;DR: VAU-R1是一个基于多模态大语言模型的数据高效框架,通过强化微调提升视频异常理解能力,并提出了首个视频异常推理基准VAU-Bench。

Details Motivation: 视频异常理解在智能城市和安全监控中至关重要,但现有方法缺乏可解释性且难以捕捉异常事件的因果关系和上下文。 Method: 提出VAU-R1框架,结合多模态大语言模型和强化微调,并设计了VAU-Bench基准,包含多选QA、详细推理、时间标注和描述性标题。 Result: VAU-R1显著提高了问答准确性、时间定位和推理连贯性。 Conclusion: VAU-R1和VAU-Bench为可解释和推理感知的视频异常理解奠定了基础。 Abstract: Video Anomaly Understanding (VAU) is essential for applications such as smart cities, security surveillance, and disaster alert systems, yet remains challenging due to its demand for fine-grained spatio-temporal perception and robust reasoning under ambiguity. Despite advances in anomaly detection, existing methods often lack interpretability and struggle to capture the causal and contextual aspects of abnormal events. This limitation is further compounded by the absence of comprehensive benchmarks for evaluating reasoning ability in anomaly scenarios. To address both challenges, we introduce VAU-R1, a data-efficient framework built upon Multimodal Large Language Models (MLLMs), which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT). Besides, we propose VAU-Bench, the first Chain-of-Thought benchmark tailored for video anomaly reasoning, featuring multiple-choice QA, detailed rationales, temporal annotations, and descriptive captions. Empirical results show that VAU-R1 significantly improves question answering accuracy, temporal grounding, and reasoning coherence across diverse contexts. Together, our method and benchmark establish a strong foundation for interpretable and reasoning-aware video anomaly understanding. Our code is available at https://github.com/GVCLab/VAU-R1.

[103] OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data

Fengxiang Wang,Mingshuo Chen,Xuming He,YiFan Zhang,Feng Liu,Zijie Guo,Zhenghao Hu,Jiong Wang,Jingyi Xu,Zhangrui Li,Fenghua Ling,Ben Fei,Weijia Li,Long Lan,Wenjing Yang,Wenlong Zhang,Lei Bai

Main category: cs.CV

TL;DR: OmniEarth-Bench是一个全面的多模态基准测试,涵盖地球科学的六个领域及其交互,包含100个专家评估维度,现有先进模型表现不佳。

Details Motivation: 现有基准测试在地球科学多模态学习中覆盖范围有限,无法全面评估跨领域交互,需要更系统的评估工具。 Method: 利用卫星和实地观测数据,整合29,779个标注,涵盖感知、推理、科学知识和链式推理四个层次,通过专家和众包协作减少标注模糊性。 Result: 测试9种先进多模态模型,表现均不理想,最高准确率低于35%,某些跨领域任务中GPT-4o准确率降至0%。 Conclusion: OmniEarth-Bench为地球科学AI设定了新标准,推动了科学发现和实际应用,数据集和模型已公开。 Abstract: Existing benchmarks for Earth science multimodal learning exhibit critical limitations in systematic coverage of geosystem components and cross-sphere interactions, often constrained to isolated subsystems (only in Human-activities sphere or atmosphere) with limited evaluation dimensions (less than 16 tasks). To address these gaps, we introduce OmniEarth-Bench, the first comprehensive multimodal benchmark spanning all six Earth science spheres (atmosphere, lithosphere, Oceansphere, cryosphere, biosphere and Human-activities sphere) and cross-spheres with one hundred expert-curated evaluation dimensions. Leveraging observational data from satellite sensors and in-situ measurements, OmniEarth-Bench integrates 29,779 annotations across four tiers: perception, general reasoning, scientific knowledge reasoning and chain-of-thought (CoT) reasoning. This involves the efforts of 2-5 experts per sphere to establish authoritative evaluation dimensions and curate relevant observational datasets, 40 crowd-sourcing annotators to assist experts for annotations, and finally, OmniEarth-Bench is validated via hybrid expert-crowd workflows to reduce label ambiguity. Experiments on 9 state-of-the-art MLLMs reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35\% accuracy. Especially, in some cross-spheres tasks, the performance of leading models like GPT-4o drops to 0.0\%. OmniEarth-Bench sets a new standard for geosystem-aware AI, advancing both scientific discovery and practical applications in environmental monitoring and disaster prediction. The dataset, source code, and trained models were released.

[104] CLIP-AE: CLIP-assisted Cross-view Audio-Visual Enhancement for Unsupervised Temporal Action Localization

Rui Xia,Dan Jiang,Quan Zhang,Ke Zhang,Chun Yuan

Main category: cs.CV

TL;DR: 提出了一种基于CLIP的跨视角视听增强无监督时序动作定位方法,解决了现有方法过度依赖高区分性区域和单一视觉模态的问题。

Details Motivation: 现有无监督时序动作定位方法依赖高区分性区域且缺乏多模态信息,导致边界定位困难。 Method: 结合视觉语言预训练和分类预训练,引入音频感知,并通过自监督跨视角学习实现多视角增强。 Result: 在两个公开数据集上表现优于现有方法。 Conclusion: 该方法通过多模态和跨视角学习提升了无监督时序动作定位的性能。 Abstract: Temporal Action Localization (TAL) has garnered significant attention in information retrieval. Existing supervised or weakly supervised methods heavily rely on labeled temporal boundaries and action categories, which are labor-intensive and time-consuming. Consequently, unsupervised temporal action localization (UTAL) has gained popularity. However, current methods face two main challenges: 1) Classification pre-trained features overly focus on highly discriminative regions; 2) Solely relying on visual modality information makes it difficult to determine contextual boundaries. To address these issues, we propose a CLIP-assisted cross-view audiovisual enhanced UTAL method. Specifically, we introduce visual language pre-training (VLP) and classification pre-training-based collaborative enhancement to avoid excessive focus on highly discriminative regions; we also incorporate audio perception to provide richer contextual boundary information. Finally, we introduce a self-supervised cross-view learning paradigm to achieve multi-view perceptual enhancement without additional annotations. Extensive experiments on two public datasets demonstrate our model's superiority over several state-of-the-art competitors.

[105] Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization and Temporal Motion Modulation

Jiahao Cui,Yan Chen,Mingwang Xu,Hanlin Shang,Yuxuan Chen,Yun Zhan,Zilong Dong,Yao Yao,Jingdong Wang,Siyu Zhu

Main category: cs.CV

TL;DR: 提出了一种基于人类偏好对齐的扩散框架,通过直接偏好优化和时间运动调制,显著提升了肖像动画的唇音同步、表情生动性和身体运动连贯性。

Details Motivation: 生成高度动态和逼真的肖像动画面临唇音同步、自然表情和高保真身体运动的挑战。 Method: 采用人类偏好对齐的扩散框架,包括直接偏好优化和时间运动调制。 Result: 实验显示在唇音同步、表情生动性和身体运动连贯性上优于基线方法,人类偏好指标也有显著提升。 Conclusion: 提出的方法有效解决了肖像动画生成中的关键问题,并在多个指标上取得了明显改进。 Abstract: Generating highly dynamic and photorealistic portrait animations driven by audio and skeletal motion remains challenging due to the need for precise lip synchronization, natural facial expressions, and high-fidelity body motion dynamics. We propose a human-preference-aligned diffusion framework that addresses these challenges through two key innovations. First, we introduce direct preference optimization tailored for human-centric animation, leveraging a curated dataset of human preferences to align generated outputs with perceptual metrics for portrait motion-video alignment and naturalness of expression. Second, the proposed temporal motion modulation resolves spatiotemporal resolution mismatches by reshaping motion conditions into dimensionally aligned latent features through temporal channel redistribution and proportional feature expansion, preserving the fidelity of high-frequency motion details in diffusion-based synthesis. The proposed mechanism is complementary to existing UNet and DiT-based portrait diffusion approaches, and experiments demonstrate obvious improvements in lip-audio synchronization, expression vividness, body motion coherence over baseline methods, alongside notable gains in human preference metrics. Our model and source code can be found at: https://github.com/xyz123xyz456/hallo4.

[106] Position Paper: Metadata Enrichment Model: Integrating Neural Networks and Semantic Knowledge Graphs for Cultural Heritage Applications

Jan Ignatowicz,Krzysztof Kutt,Grzegorz J. Nalepa

Main category: cs.CV

TL;DR: 本文提出了一种结合神经网络与语义技术的元数据丰富模型(MEM),用于提升文化遗产数字化收藏的可访问性和互操作性。

Details Motivation: 文化遗产数字化的元数据不足限制了其可访问性和跨机构协作,现有视觉分析模型对特定领域文化文物的应用有限。 Method: 提出MEM框架,结合计算机视觉模型、大语言模型和知识图谱,通过多层视觉机制(MVM)动态检测嵌套特征。 Result: 在Jagiellonian数字图书馆的数字化古版书数据集上验证MEM,并发布105页手稿标注数据集。 Conclusion: MEM为文化遗产研究提供了一种灵活可扩展的方法,展示了人工智能与语义技术在实践中的潜力。 Abstract: The digitization of cultural heritage collections has opened new directions for research, yet the lack of enriched metadata poses a substantial challenge to accessibility, interoperability, and cross-institutional collaboration. In several past years neural networks models such as YOLOv11 and Detectron2 have revolutionized visual data analysis, but their application to domain-specific cultural artifacts - such as manuscripts and incunabula - remains limited by the absence of methodologies that address structural feature extraction and semantic interoperability. In this position paper, we argue, that the integration of neural networks with semantic technologies represents a paradigm shift in cultural heritage digitization processes. We present the Metadata Enrichment Model (MEM), a conceptual framework designed to enrich metadata for digitized collections by combining fine-tuned computer vision models, large language models (LLMs) and structured knowledge graphs. The Multilayer Vision Mechanism (MVM) appears as the key innovation of MEM. This iterative process improves visual analysis by dynamically detecting nested features, such as text within seals or images within stamps. To expose MEM's potential, we apply it to a dataset of digitized incunabula from the Jagiellonian Digital Library and release a manually annotated dataset of 105 manuscript pages. We examine the practical challenges of MEM's usage in real-world GLAM institutions, including the need for domain-specific fine-tuning, the adjustment of enriched metadata with Linked Data standards and computational costs. We present MEM as a flexible and extensible methodology. This paper contributes to the discussion on how artificial intelligence and semantic web technologies can advance cultural heritage research, and also use these technologies in practice.

[107] Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information

Xu Chu,Xinrong Chen,Guanyu Wang,Zhijie Tan,Kui Huang,Wenyu Lv,Tong Mo,Weiping Li

Main category: cs.CV

TL;DR: Qwen-LA是一种新型视觉语言推理模型,通过引入视觉-文本反思过程减少幻觉,提升视觉注意力。

Details Motivation: 长推理过程会稀释视觉信息,导致幻觉问题,仅靠文本反思不足以解决。 Method: 提出BRPO强化学习方法,结合视觉标记COPY和ROUTE技术,强制模型重新关注视觉信息。 Result: 在多个视觉问答数据集上表现优异,减少幻觉。 Conclusion: Qwen-LA通过视觉-文本反思有效提升模型性能,减少幻觉。 Abstract: Inference time scaling drives extended reasoning to enhance the performance of Vision-Language Models (VLMs), thus forming powerful Vision-Language Reasoning Models (VLRMs). However, long reasoning dilutes visual tokens, causing visual information to receive less attention and may trigger hallucinations. Although introducing text-only reflection processes shows promise in language models, we demonstrate that it is insufficient to suppress hallucinations in VLMs. To address this issue, we introduce Qwen-LookAgain (Qwen-LA), a novel VLRM designed to mitigate hallucinations by incorporating a vision-text reflection process that guides the model to re-attention visual information during reasoning. We first propose a reinforcement learning method Balanced Reflective Policy Optimization (BRPO), which guides the model to decide when to generate vision-text reflection on its own and balance the number and length of reflections. Then, we formally prove that VLRMs lose attention to visual tokens as reasoning progresses, and demonstrate that supplementing visual information during reflection enhances visual attention. Therefore, during training and inference, Visual Token COPY and Visual Token ROUTE are introduced to force the model to re-attention visual information at the visual level, addressing the limitations of text-only reflection. Experiments on multiple visual QA datasets and hallucination metrics indicate that Qwen-LA achieves leading accuracy performance while reducing hallucinations. Our code is available at: https://github.com/Liar406/Look_Again.

[108] Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition

Yu Li,Jin Jiang,Jianhua Zhu,Shuai Peng,Baole Wei,Yuxuan Zhou,Liangcai Gao

Main category: cs.CV

TL;DR: Uni-MuMER通过微调预训练视觉语言模型(VLM)解决手写数学表达式识别(HMER)问题,无需修改模型架构,结合Tree-CoT、EDL和SC任务,在CROHME和HME100K数据集上达到SOTA性能。

Details Motivation: HMER因符号布局自由和手写风格多变而具有挑战性,现有方法难以整合成统一框架,而VLM的跨任务泛化能力为此提供了可能。 Method: Uni-MuMER通过微调VLM,结合Tree-CoT(结构化空间推理)、EDL(减少相似字符混淆)和SC(长表达式一致性)任务。 Result: 在CROHME和HME100K数据集上,Uni-MuMER超越SSAN(16.31%)和Gemini2.5-flash(24.42%),达到SOTA。 Conclusion: Uni-MuMER展示了VLM在HMER任务中的潜力,通过数据驱动任务整合实现了高性能,代码和模型已开源。 Abstract: Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layout and variability in handwriting styles. Prior methods have faced performance bottlenecks, proposing isolated architectural modifications that are difficult to integrate coherently into a unified framework. Meanwhile, recent advances in pretrained vision-language models (VLMs) have demonstrated strong cross-task generalization, offering a promising foundation for developing unified solutions. In this paper, we introduce Uni-MuMER, which fully fine-tunes a VLM for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions. Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves new state-of-the-art performance, surpassing the best lightweight specialized model SSAN by 16.31% and the top-performing VLM Gemini2.5-flash by 24.42% in the zero-shot setting. Our datasets, models, and code are open-sourced at: https://github.com/BFlameSwift/Uni-MuMER

[109] Weakly-supervised Localization of Manipulated Image Regions Using Multi-resolution Learned Features

Ziyong Wang,Charith Abhayaratne

Main category: cs.CV

TL;DR: 提出了一种弱监督方法,结合图像级检测网络和预训练分割模型,实现无需像素级标注的图像篡改定位。

Details Motivation: 解决现有深度学习方法在图像篡改检测中解释性和定位能力不足,以及缺乏像素级标注的问题。 Method: 基于WCBnet生成多视角特征图,结合预训练分割模型(如DeepLab、SegmentAnything等)和贝叶斯推断,精确定位篡改区域。 Result: 实验证明该方法有效,无需像素级标注即可定位图像篡改。 Conclusion: 弱监督方法在图像篡改定位中具有潜力,解决了标注资源不足的问题。 Abstract: The explosive growth of digital images and the widespread availability of image editing tools have made image manipulation detection an increasingly critical challenge. Current deep learning-based manipulation detection methods excel in achieving high image-level classification accuracy, they often fall short in terms of interpretability and localization of manipulated regions. Additionally, the absence of pixel-wise annotations in real-world scenarios limits the existing fully-supervised manipulation localization techniques. To address these challenges, we propose a novel weakly-supervised approach that integrates activation maps generated by image-level manipulation detection networks with segmentation maps from pre-trained models. Specifically, we build on our previous image-level work named WCBnet to produce multi-view feature maps which are subsequently fused for coarse localization. These coarse maps are then refined using detailed segmented regional information provided by pre-trained segmentation models (such as DeepLab, SegmentAnything and PSPnet), with Bayesian inference employed to enhance the manipulation localization. Experimental results demonstrate the effectiveness of our approach, highlighting the feasibility to localize image manipulations without relying on pixel-level labels.

[110] Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles

Zifu Wang,Junyi Zhu,Bo Tang,Zhiyu Li,Feiyu Xiong,Jiaqian Yu,Matthew B. Blaschko

Main category: cs.CV

TL;DR: 该论文研究了基于规则的视觉强化学习(RL)在拼图任务中的应用,发现多模态大语言模型(MLLMs)通过微调能显著提升性能,并观察到RL比监督微调(SFT)更具泛化能力。

Details Motivation: 探索多模态大语言模型在视觉任务中的表现,特别是在基于规则的强化学习框架下,以拼图任务为实验平台。 Method: 使用拼图任务作为结构化实验框架,对比RL和SFT的性能,分析模型的泛化能力和推理模式。 Result: MLLMs通过微调在拼图任务中达到近乎完美的准确率,并能泛化到复杂配置;RL比SFT更具泛化优势;推理模式多为预存而非涌现。 Conclusion: 研究为基于规则的视觉RL提供了重要见解,表明其在多模态学习中的潜力,但结果可能因任务而异。 Abstract: The application of rule-based reinforcement learning (RL) to multimodal large language models (MLLMs) introduces unique challenges and potential deviations from findings in text-only domains, particularly for perception-heavy tasks. This paper provides a comprehensive study of rule-based visual RL using jigsaw puzzles as a structured experimental framework, revealing several key findings. \textit{Firstly,} we find that MLLMs, initially performing near to random guessing on simple puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning. \textit{Secondly,} training on jigsaw puzzles can induce generalization to other visual tasks, with effectiveness tied to specific task configurations. \textit{Thirdly,} MLLMs can learn and generalize with or without explicit reasoning, though open-source models often favor direct answering. Consequently, even when trained for step-by-step reasoning, they can ignore the thinking process in deriving the final answer. \textit{Fourthly,} we observe that complex reasoning patterns appear to be pre-existing rather than emergent, with their frequency increasing alongside training and task difficulty. \textit{Finally,} our results demonstrate that RL exhibits more effective generalization than Supervised Fine-Tuning (SFT), and an initial SFT cold start phase can hinder subsequent RL optimization. Although these observations are based on jigsaw puzzles and may vary across other visual tasks, this research contributes a valuable piece of jigsaw to the larger puzzle of collective understanding rule-based visual RL and its potential in multimodal learning. The code is available at: \href{https://github.com/zifuwanggg/Jigsaw-R1}{https://github.com/zifuwanggg/Jigsaw-R1}.

[111] VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

Ce Zhang,Kaixin Ma,Tianqing Fang,Wenhao Yu,Hongming Zhang,Zhisong Zhang,Yaqi Xie,Katia Sycara,Haitao Mi,Dong Yu

Main category: cs.CV

TL;DR: VScan是一个两阶段视觉标记减少框架,通过全局和局部扫描以及标记合并优化视觉编码,并在语言模型中间层引入剪枝,显著加速推理并保持高性能。

Details Motivation: 现有大型视觉语言模型(LVLMs)因视觉标记序列较长导致计算成本高,难以实时部署,需要优化视觉标记处理。 Method: 提出VScan框架,结合全局和局部扫描的标记合并优化视觉编码,并在语言模型中间层引入剪枝。 Result: 在四个LVLMs上验证,VScan显著加速推理(LLaVA-NeXT-7B速度提升2.91倍,FLOPs减少10倍),并保持95.4%的原始性能。 Conclusion: VScan通过优化视觉标记处理,有效平衡了计算效率和模型性能,优于现有方法。 Abstract: Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91$\times$ speedup in prefilling and a 10$\times$ reduction in FLOPs, while retaining 95.4% of the original performance.

[112] DeepChest: Dynamic Gradient-Free Task Weighting for Effective Multi-Task Learning in Chest X-ray Classification

Youssef Mohamed,Noran Mohamed,Khaled Abouhashad,Feilong Tang,Sara Atito,Shoaib Jameel,Imran Razzak,Ahmed B. Zaky

Main category: cs.CV

TL;DR: DeepChest是一种动态任务加权框架,用于多标签胸部X光分类,通过性能驱动的权重机制显著提升训练速度和准确性。

Details Motivation: 多任务学习(MTL)在医学影像等领域具有优势,但任务贡献平衡是一个关键挑战。 Method: DeepChest采用模型无关的动态任务加权方法,基于任务特定损失趋势分析,无需梯度访问,减少内存使用并加速训练。 Result: 在大规模CXR数据集上,DeepChest比现有MTL方法整体准确率提高7%,并显著降低任务损失。 Conclusion: DeepChest的高效性和性能提升为医学诊断中深度学习的实际应用提供了更实用的解决方案。 Abstract: While Multi-Task Learning (MTL) offers inherent advantages in complex domains such as medical imaging by enabling shared representation learning, effectively balancing task contributions remains a significant challenge. This paper addresses this critical issue by introducing DeepChest, a novel, computationally efficient and effective dynamic task-weighting framework specifically designed for multi-label chest X-ray (CXR) classification. Unlike existing heuristic or gradient-based methods that often incur substantial overhead, DeepChest leverages a performance-driven weighting mechanism based on effective analysis of task-specific loss trends. Given a network architecture (e.g., ResNet18), our model-agnostic approach adaptively adjusts task importance without requiring gradient access, thereby significantly reducing memory usage and achieving a threefold increase in training speed. It can be easily applied to improve various state-of-the-art methods. Extensive experiments on a large-scale CXR dataset demonstrate that DeepChest not only outperforms state-of-the-art MTL methods by 7% in overall accuracy but also yields substantial reductions in individual task losses, indicating improved generalization and effective mitigation of negative transfer. The efficiency and performance gains of DeepChest pave the way for more practical and robust deployment of deep learning in critical medical diagnostic applications. The code is publicly available at https://github.com/youssefkhalil320/DeepChest-MTL

[113] Bridging Classical and Modern Computer Vision: PerceptiveNet for Tree Crown Semantic Segmentation

Georgios Voulgaris

Main category: cs.CV

TL;DR: 提出PerceptiveNet模型,结合对数Gabor卷积层和宽感受野主干网络,显著提升树冠语义分割精度,并在多领域数据集上验证其泛化能力。

Details Motivation: 树冠的精确语义分割对森林管理、生物多样性研究和碳封存量化至关重要,但传统方法和深度学习模型难以处理复杂的树冠结构。 Method: 提出PerceptiveNet模型,包含可训练的对数Gabor卷积层和宽感受野主干网络,并通过实验比较不同卷积层的效果,同时评估其在混合CNN-Transformer模型中的表现。 Result: PerceptiveNet在树冠数据集上表现优于现有模型,并在多个复杂度的航空场景数据集上展示了泛化能力。 Conclusion: PerceptiveNet通过结合对数Gabor卷积和宽感受野主干网络,有效提升了树冠语义分割的精度和泛化能力。 Abstract: The accurate semantic segmentation of tree crowns within remotely sensed data is crucial for scientific endeavours such as forest management, biodiversity studies, and carbon sequestration quantification. However, precise segmentation remains challenging due to complexities in the forest canopy, including shadows, intricate backgrounds, scale variations, and subtle spectral differences among tree species. Compared to the traditional methods, Deep Learning models improve accuracy by extracting informative and discriminative features, but often fall short in capturing the aforementioned complexities. To address these challenges, we propose PerceptiveNet, a novel model incorporating a Logarithmic Gabor-parameterised convolutional layer with trainable filter parameters, alongside a backbone that extracts salient features while capturing extensive context and spatial information through a wider receptive field. We investigate the impact of Log-Gabor, Gabor, and standard convolutional layers on semantic segmentation performance through extensive experimentation. Additionally, we conduct an ablation study to assess the contributions of individual layers and their combinations to overall model performance, and we evaluate PerceptiveNet as a backbone within a novel hybrid CNN-Transformer model. Our results outperform state-of-the-art models, demonstrating significant performance improvements on a tree crown dataset while generalising across domains, including two benchmark aerial scene semantic segmentation datasets with varying complexities.

[114] A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

Shengyuan Liu,Boyun Zheng,Wenting Chen,Zhihao Peng,Zhenfei Yin,Jing Shao,Jiancong Hu,Yixuan Yuan

Main category: cs.CV

TL;DR: EndoBench是一个全面的多模态大语言模型(MLLM)基准测试,旨在评估内窥镜实践中的多维能力,涵盖多种场景和任务,揭示了当前模型与专家临床推理之间的差距。

Details Motivation: 现有基准测试局限于特定内窥镜场景和少量临床任务,无法反映真实世界的多样性和临床工作流程的全面需求。 Method: EndoBench包含4种内窥镜场景、12项临床任务及其子任务,以及5种视觉提示粒度,共6,832个验证过的VQA对,评估23种先进模型。 Result: 专有MLLM表现优于开源和医学专用模型,但仍不及人类专家;医学领域监督微调显著提升任务准确性;模型性能受提示格式和任务复杂度影响。 Conclusion: EndoBench为内窥镜领域的MLLM评估和进步设定了新标准,揭示了当前模型的局限性,并公开了基准和代码。 Abstract: Endoscopic procedures are essential for diagnosing and treating internal diseases, and multi-modal large language models (MLLMs) are increasingly applied to assist in endoscopy analysis. However, current benchmarks are limited, as they typically cover specific endoscopic scenarios and a small set of clinical tasks, failing to capture the real-world diversity of endoscopic scenarios and the full range of skills needed in clinical workflows. To address these issues, we introduce EndoBench, the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice with multi-dimensional capacities. EndoBench encompasses 4 distinct endoscopic scenarios, 12 specialized clinical tasks with 12 secondary subtasks, and 5 levels of visual prompting granularities, resulting in 6,832 rigorously validated VQA pairs from 21 diverse datasets. Our multi-dimensional evaluation framework mirrors the clinical workflow--spanning anatomical recognition, lesion analysis, spatial localization, and surgical operations--to holistically gauge the perceptual and diagnostic abilities of MLLMs in realistic scenarios. We benchmark 23 state-of-the-art models, including general-purpose, medical-specialized, and proprietary MLLMs, and establish human clinician performance as a reference standard. Our extensive experiments reveal: (1) proprietary MLLMs outperform open-source and medical-specialized models overall, but still trail human experts; (2) medical-domain supervised fine-tuning substantially boosts task-specific accuracy; and (3) model performance remains sensitive to prompt format and clinical task complexity. EndoBench establishes a new standard for evaluating and advancing MLLMs in endoscopy, highlighting both progress and persistent gaps between current models and expert clinical reasoning. We publicly release our benchmark and code.

[115] Color Image Set Recognition Based on Quaternionic Grassmannians

Xiang Xiang Wang,Tin-Yau Tam

Main category: cs.CV

TL;DR: 提出了一种基于四元数Grassmannians的颜色图像集识别方法,通过四元数捕捉颜色信息,并将图像集表示为Grassmannian上的点。

Details Motivation: 利用四元数的优势更有效地表示和处理颜色信息,提升图像集识别的性能。 Method: 将颜色图像集映射为四元数Grassmannian上的点,并计算最短距离用于分类。 Result: 在ETH-80数据集上取得了良好的识别效果。 Conclusion: 方法有效但稳定性有待改进,未来可进一步优化。 Abstract: We propose a new method for recognizing color image sets using quaternionic Grassmannians, which use the power of quaternions to capture color information and represent each color image set as a point on the quaternionic Grassmannian. We provide a direct formula to calculate the shortest distance between two points on the quaternionic Grassmannian, and use this distance to build a new classification framework. Experiments on the ETH-80 benchmark dataset show that our method achieves good recognition results. We also discuss some limitations in stability and suggest ways the method can be improved in the future.

[116] Comparing the Effects of Persistence Barcodes Aggregation and Feature Concatenation on Medical Imaging

Dashti A. Ali,Richard K. G. Do,William R. Jarnagin,Aras T. Asaad,Amber L. Simpson

Main category: cs.CV

TL;DR: 比较了医学图像分析中两种基于持久同调的特征向量构建方法,发现特征拼接方法优于聚合方法。

Details Motivation: 传统特征提取方法对输入变化敏感,持久同调(PH)能提供稳定的拓扑特征,但如何构建最终特征向量尚需研究。 Method: 通过聚合持久条形码或拼接拓扑特征向量构建特征,并在多种医学图像数据集上比较两种方法。 Result: 特征拼接方法保留了更多细节拓扑信息,分类性能更好。 Conclusion: 在类似实验中,特征拼接是更优的选择。 Abstract: In medical image analysis, feature engineering plays an important role in the design and performance of machine learning models. Persistent homology (PH), from the field of topological data analysis (TDA), demonstrates robustness and stability to data perturbations and addresses the limitation from traditional feature extraction approaches where a small change in input results in a large change in feature representation. Using PH, we store persistent topological and geometrical features in the form of the persistence barcode whereby large bars represent global topological features and small bars encapsulate geometrical information of the data. When multiple barcodes are computed from 2D or 3D medical images, two approaches can be used to construct the final topological feature vector in each dimension: aggregating persistence barcodes followed by featurization or concatenating topological feature vectors derived from each barcode. In this study, we conduct a comprehensive analysis across diverse medical imaging datasets to compare the effects of the two aforementioned approaches on the performance of classification models. The results of this analysis indicate that feature concatenation preserves detailed topological information from individual barcodes, yields better classification performance and is therefore a preferred approach when conducting similar experiments.

[117] Radiant Triangle Soup with Soft Connectivity Forces for 3D Reconstruction and Novel View Synthesis

Nathaniel Burgdorfer,Philippos Mordohai

Main category: cs.CV

TL;DR: 提出了一种基于三角形表示场景几何和外观的推理时优化框架,优于高斯样条。

Details Motivation: 三角形比高斯样条更具表现力,且能更好地支持下游任务。 Method: 开发了针对三角形汤(不连接的半透明三角形)的优化算法,并引入连接力以促进表面连续性。 Result: 在3D重建数据集上展示了具有竞争力的光度和几何结果。 Conclusion: 三角形表示在场景优化中具有优势,尤其在表达性和下游任务支持方面。 Abstract: In this work, we introduce an inference-time optimization framework utilizing triangles to represent the geometry and appearance of the scene. More specifically, we develop a scene optimization algorithm for triangle soup, a collection of disconnected semi-transparent triangle primitives. Compared to the current most-widely used primitives for 3D scene representation, namely Gaussian splats, triangles allow for more expressive color interpolation, and benefit from a large algorithmic infrastructure for downstream tasks. Triangles, unlike full-rank Gaussian kernels, naturally combine to form surfaces. We formulate connectivity forces between triangles during optimization, encouraging explicit, but soft, surface continuity in 3D. We perform experiments on a representative 3D reconstruction dataset and show competitive photometric and geometric results.

[118] VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

Xiangdong Zhang,Jiaqi Liao,Shaofeng Zhang,Fanqing Meng,Xiangpeng Wan,Junchi Yan,Yu Cheng

Main category: cs.CV

TL;DR: VideoREPA框架通过令牌关系蒸馏(TRD)损失,将视频理解基础模型中的物理知识注入文本到视频(T2V)模型,显著提升了生成视频的物理合理性。

Details Motivation: 当前T2V模型在生成物理合理内容方面表现不佳,其物理理解能力落后于视频自监督学习方法。 Method: 提出VideoREPA框架,通过令牌关系蒸馏(TRD)损失对齐视频理解基础模型与T2V模型的令牌级关系。 Result: VideoREPA显著提升了基线模型CogVideoX的物理常识,在相关基准测试中表现优异。 Conclusion: VideoREPA是首个针对T2V模型微调并注入物理知识的REPA方法,有效提升了视频生成的物理合理性。 Abstract: Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models, a critical departure from prior representation alignment (REPA) methods. To our knowledge, VideoREPA is the first REPA method designed for finetuning T2V models and specifically for injecting physical knowledge. Empirical evaluations show that VideoREPA substantially enhances the physics commonsense of baseline method, CogVideoX, achieving significant improvement on relevant benchmarks and demonstrating a strong capacity for generating videos consistent with intuitive physics. More video results are available at https://videorepa.github.io/.

[119] D-AR: Diffusion via Autoregressive Models

Ziteng Gao,Mike Zheng Shou

Main category: cs.CV

TL;DR: D-AR将图像扩散过程重新定义为标准的自回归过程,通过离散令牌序列实现图像生成,支持预览和零样本布局控制合成。

Details Motivation: 探索一种统一的自回归架构,利用扩散模型特性实现图像生成,同时兼容大型语言模型。 Method: 设计令牌化器将图像转换为离散令牌序列,利用自回归模型预测令牌,直接对应扩散去噪步骤。 Result: 在ImageNet基准测试中,使用775M Llama主干和256令牌,达到2.09 FID。 Conclusion: D-AR为视觉合成的统一自回归架构提供了新思路,未来可与大型语言模型结合。 Abstract: This paper presents Diffusion via Autoregressive models (D-AR), a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure in the standard next-token-prediction fashion. We start by designing the tokenizer that converts images into sequences of discrete tokens, where tokens in different positions can be decoded into different diffusion denoising steps in the pixel space. Thanks to the diffusion properties, these tokens naturally follow a coarse-to-fine order, which directly lends itself to autoregressive modeling. Therefore, we apply standard next-token prediction on these tokens, without modifying any underlying designs (either causal masks or training/inference strategies), and such sequential autoregressive token generation directly mirrors the diffusion procedure in image space. That is, once the autoregressive model generates an increment of tokens, we can directly decode these tokens into the corresponding diffusion denoising step in the streaming manner. Our pipeline naturally reveals several intriguing properties, for example, it supports consistent previews when generating only a subset of tokens and enables zero-shot layout-controlled synthesis. On the standard ImageNet benchmark, our method achieves 2.09 FID using a 775M Llama backbone with 256 discrete tokens. We hope our work can inspire future research on unified autoregressive architectures of visual synthesis, especially with large language models. Code and models will be available at https://github.com/showlab/D-AR

[120] OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Size Wu,Zhonghua Wu,Zerui Gong,Qingyi Tao,Sheng Jin,Qinyue Li,Wei Li,Chen Change Loy

Main category: cs.CV

TL;DR: OpenUni是一个轻量级、开源的基线模型,用于统一多模态理解和生成,通过高效训练策略和简单架构实现高质量图像生成和卓越性能。

Details Motivation: 受统一模型学习实践的启发,旨在简化训练复杂性并提升多模态任务的表现。 Method: 采用可学习查询和轻量级Transformer连接器,结合现成的多模态大语言模型和扩散模型。 Result: 生成高质量图像并在标准基准测试中表现优异,仅需1.1B和3.1B激活参数。 Conclusion: OpenUni为开源研究和社区发展提供了有力支持,释放了模型权重、代码和数据集。 Abstract: In this report, we present OpenUni, a simple, lightweight, and fully open-source baseline for unifying multimodal understanding and generation. Inspired by prevailing practices in unified model learning, we adopt an efficient training strategy that minimizes the training complexity and overhead by bridging the off-the-shelf multimodal large language models (LLMs) and diffusion models through a set of learnable queries and a light-weight transformer-based connector. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG- Bench, and WISE, with only 1.1B and 3.1B activated parameters. To support open research and community advancement, we release all model weights, training code, and our curated training datasets (including 23M image-text pairs) at https://github.com/wusize/OpenUni.

[121] Grounded Reinforcement Learning for Visual Reasoning

Gabriel Sarch,Snigdha Saha,Naitik Khandelwal,Ayush Jain,Michael J. Tarr,Aviral Kumar,Katerina Fragkiadaki

Main category: cs.CV

TL;DR: ViGoRL是一种通过强化学习在视觉推理任务中显式锚定推理步骤到视觉坐标的模型,显著提升了性能。

Details Motivation: 视觉推理任务需要模型具备视觉注意力、感知输入解释和空间证据支持的能力,传统方法缺乏显式锚定机制。 Method: ViGoRL通过多轮强化学习框架动态聚焦视觉坐标,并生成空间锚定的推理轨迹。 Result: 在多个视觉推理基准测试中,ViGoRL表现优于监督微调和传统强化学习方法,尤其在视觉搜索和GUI元素定位任务中达到86.4%的准确率。 Conclusion: 视觉锚定的强化学习是提升模型通用视觉推理能力的有效范式。 Abstract: While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, visual reasoning introduces added complexity by requiring models to direct visual attention, interpret perceptual inputs, and ground abstract reasoning in spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces, guiding visual attention to task-relevant regions at each step. When fine-grained exploration is required, our novel multi-turn RL framework enables the model to dynamically zoom into predicted coordinates as reasoning unfolds. Across a diverse set of visual reasoning benchmarks--including SAT-2 and BLINK for spatial reasoning, V*bench for visual search, and ScreenSpot and VisualWebArena for web-based grounding--ViGoRL consistently outperforms both supervised fine-tuning and conventional RL baselines that lack explicit grounding mechanisms. Incorporating multi-turn RL with zoomed-in visual feedback significantly improves ViGoRL's performance on localizing small GUI elements and visual search, achieving 86.4% on V*Bench. Additionally, we find that grounding amplifies other visual behaviors such as region exploration, grounded subgoal setting, and visual verification. Finally, human evaluations show that the model's visual references are not only spatially accurate but also helpful for understanding model reasoning steps. Our results show that visually grounded RL is a strong paradigm for imbuing models with general-purpose visual reasoning.

[122] VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

Tingyu Song,Tongyan Hu,Guo Gan,Yilun Zhao

Main category: cs.CV

TL;DR: 论文提出了一个名为VF-Eval的新基准,用于评估多模态大语言模型(MLLMs)在AI生成内容(AIGC)视频上的能力,发现现有模型表现不佳,并展示了如何通过人类反馈改进视频生成。

Details Motivation: 现有研究主要关注自然视频,而忽略了AI生成视频(AIGC)的评估,同时MLLMs在AIGC视频上的能力尚未充分探索。 Method: 提出VF-Eval基准,包含四个任务:一致性验证、错误感知、错误类型检测和推理评估,评估了13个前沿MLLMs。 Result: 即使是表现最佳的GPT-4.1模型,在所有任务中也难以保持一致性,表明基准的挑战性。 Conclusion: VF-Eval揭示了MLLMs在AIGC视频上的局限性,并通过实验RePrompt展示了人类反馈对改进视频生成的价值。 Abstract: MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VF-Eval in improving video generation, we conduct an experiment, RePrompt, demonstrating that aligning MLLMs more closely with human feedback can benefit video generation.

[123] DA-VPT: Semantic-Guided Visual Prompt Tuning for Vision Transformers

Li Ren,Chen Chen,Liqiang Wang,Kien Hua

Main category: cs.CV

TL;DR: DA-VPT利用度量学习技术研究提示分布对微调性能的影响,提出一种新框架,通过语义数据引导提示分布,提升ViT模型的微调效果。

Details Motivation: 研究提示与图像标记之间的基本关联和分布,以改进视觉提示调优的性能。 Method: 提出Distribution Aware Visual Prompt Tuning (DA-VPT),通过学习类相关语义数据的距离度量来引导提示分布。 Result: 在识别和分割任务中验证了方法的有效性,提升了ViT模型在下游视觉任务中的性能。 Conclusion: DA-VPT通过语义信息引导提示学习,实现了更高效和有效的ViT模型微调。 Abstract: Visual Prompt Tuning (VPT) has become a promising solution for Parameter-Efficient Fine-Tuning (PEFT) approach for Vision Transformer (ViT) models by partially fine-tuning learnable tokens while keeping most model parameters frozen. Recent research has explored modifying the connection structures of the prompts. However, the fundamental correlation and distribution between the prompts and image tokens remain unexplored. In this paper, we leverage metric learning techniques to investigate how the distribution of prompts affects fine-tuning performance. Specifically, we propose a novel framework, Distribution Aware Visual Prompt Tuning (DA-VPT), to guide the distributions of the prompts by learning the distance metric from their class-related semantic data. Our method demonstrates that the prompts can serve as an effective bridge to share semantic information between image patches and the class token. We extensively evaluated our approach on popular benchmarks in both recognition and segmentation tasks. The results demonstrate that our approach enables more effective and efficient fine-tuning of ViT models by leveraging semantic information to guide the learning of the prompts, leading to improved performance on various downstream vision tasks.

[124] CLDTracker: A Comprehensive Language Description for Visual Tracking

Mohamad Alansari,Sajid Javed,Iyyakutti Iyappan Ganapathi,Sara Alansari,Muzammal Naseer

Main category: cs.CV

TL;DR: CLDTracker提出了一种结合视觉和语言的双分支架构,通过丰富的文本描述和高效的视觉-语言特征融合,解决了VOT任务中的动态变化和语义理解问题,并在多个基准测试中达到SOTA性能。

Details Motivation: 传统跟踪器在复杂场景中表现不佳,而VLMs在语义理解方面有潜力但存在文本表示不足、特征融合效率低和缺乏时间建模等问题。 Method: 采用双分支架构(文本分支和视觉分支),利用CLIP和GPT-4V生成丰富的文本描述,并结合语义和上下文信息。 Result: 在六个标准VOT基准测试中达到SOTA性能。 Conclusion: CLDTracker通过高效的视觉-语言表示和时间适应性,显著提升了VOT任务的鲁棒性和准确性。 Abstract: VOT remains a fundamental yet challenging task in computer vision due to dynamic appearance changes, occlusions, and background clutter. Traditional trackers, relying primarily on visual cues, often struggle in such complex scenarios. Recent advancements in VLMs have shown promise in semantic understanding for tasks like open-vocabulary detection and image captioning, suggesting their potential for VOT. However, the direct application of VLMs to VOT is hindered by critical limitations: the absence of a rich and comprehensive textual representation that semantically captures the target object's nuances, limiting the effective use of language information; inefficient fusion mechanisms that fail to optimally integrate visual and textual features, preventing a holistic understanding of the target; and a lack of temporal modeling of the target's evolving appearance in the language domain, leading to a disconnect between the initial description and the object's subsequent visual changes. To bridge these gaps and unlock the full potential of VLMs for VOT, we propose CLDTracker, a novel Comprehensive Language Description framework for robust visual Tracking. Our tracker introduces a dual-branch architecture consisting of a textual and a visual branch. In the textual branch, we construct a rich bag of textual descriptions derived by harnessing the powerful VLMs such as CLIP and GPT-4V, enriched with semantic and contextual cues to address the lack of rich textual representation. Experiments on six standard VOT benchmarks demonstrate that CLDTracker achieves SOTA performance, validating the effectiveness of leveraging robust and temporally-adaptive vision-language representations for tracking. Code and models are publicly available at: https://github.com/HamadYA/CLDTracker

[125] Skin Lesion Phenotyping via Nested Multi-modal Contrastive Learning

Dionysis Christopoulos,Sotiris Spanos,Eirini Baltzi,Valsamis Ntouskos,Konstantinos Karantzalos

Main category: cs.CV

TL;DR: SLIMP通过结合皮肤病变图像和元数据,提出了一种新颖的嵌套对比学习方法,提升了皮肤病变分类任务的性能。

Details Motivation: 现有方法仅依赖图像数据,忽略了临床和表型背景信息,而医生通常会综合考虑患者病史和其他病变信息进行诊断。SLIMP旨在通过整合多模态数据改进分类效果。 Method: 采用嵌套对比学习方法,结合皮肤病变图像、个体元数据及患者级元数据(如医疗记录),充分利用所有可用数据模态。 Result: 相比其他预训练策略,SLIMP在下游皮肤病变分类任务中表现更优,验证了其学习表示的质量。 Conclusion: SLIMP通过多模态数据整合和对比学习,显著提升了皮肤病变分类的准确性,为临床诊断提供了更全面的支持。 Abstract: We introduce SLIMP (Skin Lesion Image-Metadata Pre-training) for learning rich representations of skin lesions through a novel nested contrastive learning approach that captures complex relationships between images and metadata. Melanoma detection and skin lesion classification based solely on images, pose significant challenges due to large variations in imaging conditions (lighting, color, resolution, distance, etc.) and lack of clinical and phenotypical context. Clinicians typically follow a holistic approach for assessing the risk level of the patient and for deciding which lesions may be malignant and need to be excised, by considering the patient's medical history as well as the appearance of other lesions of the patient. Inspired by this, SLIMP combines the appearance and the metadata of individual skin lesions with patient-level metadata relating to their medical record and other clinically relevant information. By fully exploiting all available data modalities throughout the learning process, the proposed pre-training strategy improves performance compared to other pre-training strategies on downstream skin lesions classification tasks highlighting the learned representations quality.

[126] AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views

Lihan Jiang,Yucheng Mao,Linning Xu,Tao Lu,Kerui Ren,Yichen Jin,Xudong Xu,Mulin Yu,Jiangmiao Pang,Feng Zhao,Dahua Lin,Bo Dai

Main category: cs.CV

TL;DR: AnySplat是一种前馈网络,用于从未校准的图像集合中进行新视角合成,无需已知相机姿态或逐场景优化,且计算效率高。

Details Motivation: 传统神经渲染方法需要已知相机姿态和逐场景优化,而现有前馈方法在密集视图下计算负担重。AnySplat旨在解决这些问题,实现高效且无需姿态标注的新视角合成。 Method: 通过单次前向传播预测3D高斯基元(编码场景几何和外观)及每张输入图像的相机内外参数,适用于未校准的多视图数据集。 Result: 在零样本评估中,AnySplat在稀疏和密集视图下均达到与姿态感知基线相当的质量,并超越现有无姿态方法,同时显著降低渲染延迟。 Conclusion: AnySplat为无约束捕获环境下的实时新视角合成提供了高效解决方案。 Abstract: We introduce AnySplat, a feed forward network for novel view synthesis from uncalibrated image collections. In contrast to traditional neural rendering pipelines that demand known camera poses and per scene optimization, or recent feed forward methods that buckle under the computational weight of dense views, our model predicts everything in one shot. A single forward pass yields a set of 3D Gaussian primitives encoding both scene geometry and appearance, and the corresponding camera intrinsics and extrinsics for each input image. This unified design scales effortlessly to casually captured, multi view datasets without any pose annotations. In extensive zero shot evaluations, AnySplat matches the quality of pose aware baselines in both sparse and dense view scenarios while surpassing existing pose free approaches. Moreover, it greatly reduce rendering latency compared to optimization based neural fields, bringing real time novel view synthesis within reach for unconstrained capture settings.Project page: https://city-super.github.io/anysplat/

[127] FMG-Det: Foundation Model Guided Robust Object Detection

Darryl Hannan,Timothy Doster,Henry Kvinge,Adam Attarian,Yijing Watkins

Main category: cs.CV

TL;DR: FMG-Det提出了一种高效的方法,通过结合多实例学习(MIL)和预处理的标签校正流程,解决目标检测任务中噪声标注的问题,显著提升了模型性能。

Details Motivation: 目标检测任务中标注边界的主观性导致数据质量不一致,噪声标注严重影响模型性能,尤其是在少样本场景下。 Method: 结合多实例学习(MIL)框架和基于基础模型的预处理流程,校正标签并对检测头进行轻微修改。 Result: 在多个数据集上实现了最先进的性能,适用于标准和少样本场景,且方法更简单高效。 Conclusion: FMG-Det为噪声标注问题提供了一种简单高效的解决方案,显著提升了目标检测模型的鲁棒性。 Abstract: Collecting high quality data for object detection tasks is challenging due to the inherent subjectivity in labeling the boundaries of an object. This makes it difficult to not only collect consistent annotations across a dataset but also to validate them, as no two annotators are likely to label the same object using the exact same coordinates. These challenges are further compounded when object boundaries are partially visible or blurred, which can be the case in many domains. Training on noisy annotations significantly degrades detector performance, rendering them unusable, particularly in few-shot settings, where just a few corrupted annotations can impact model performance. In this work, we propose FMG-Det, a simple, efficient methodology for training models with noisy annotations. More specifically, we propose combining a multiple instance learning (MIL) framework with a pre-processing pipeline that leverages powerful foundation models to correct labels prior to training. This pre-processing pipeline, along with slight modifications to the detector head, results in state-of-the-art performance across a number of datasets, for both standard and few-shot scenarios, while being much simpler and more efficient than other approaches.

[128] PixelThink: Towards Efficient Chain-of-Pixel Reasoning

Song Wang,Gongfan Fang,Lingdong Kong,Xiangtai Li,Jianyun Xu,Sheng Yang,Qiang Li,Jianke Zhu,Xinchao Wang

Main category: cs.CV

TL;DR: PixelThink通过结合任务难度和模型不确定性来优化推理生成,提升推理效率和分割性能。

Details Motivation: 现有方法在分布外场景泛化能力有限,且推理过程冗长,计算成本高。 Method: 提出PixelThink,结合外部任务难度和内部模型不确定性,动态调整推理长度。 Result: 实验表明,该方法提高了推理效率和分割性能。 Conclusion: 为高效可解释的多模态理解提供了新视角。 Abstract: Existing reasoning segmentation approaches typically fine-tune multimodal large language models (MLLMs) using image-text pairs and corresponding mask labels. However, they exhibit limited generalization to out-of-distribution scenarios without an explicit reasoning process. Although recent efforts leverage reinforcement learning through group-relative policy optimization (GRPO) to enhance reasoning ability, they often suffer from overthinking - producing uniformly verbose reasoning chains irrespective of task complexity. This results in elevated computational costs and limited control over reasoning quality. To address this problem, we propose PixelThink, a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty to regulate reasoning generation within a reinforcement learning paradigm. The model learns to compress reasoning length in accordance with scene complexity and predictive confidence. To support comprehensive evaluation, we introduce ReasonSeg-Diff, an extended benchmark with annotated reasoning references and difficulty scores, along with a suite of metrics designed to assess segmentation accuracy, reasoning quality, and efficiency jointly. Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance. Our work contributes novel perspectives towards efficient and interpretable multimodal understanding. The code and model will be publicly available.

[129] ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS

Weijie Wang,Donny Y. Chen,Zeyu Zhang,Duochao Shi,Akide Liu,Bohan Zhuang

Main category: cs.CV

TL;DR: ZPressor是一个轻量级模块,通过压缩多视角输入为紧凑潜在状态Z,提升3D高斯溅射模型的扩展性和性能。

Details Motivation: 现有前馈3D高斯溅射模型因编码器容量有限,难以处理多视角输入,导致性能下降或内存消耗过高。 Method: 提出ZPressor模块,将多视角分为锚点和支持集,利用交叉注意力压缩信息至潜在状态Z。 Result: 在DL3DV-10K和RealEstate10K基准测试中,ZPressor显著提升模型性能和鲁棒性,支持超100视角输入。 Conclusion: ZPressor为前馈3D高斯溅射模型提供高效压缩方案,扩展其应用范围。 Abstract: Feed-forward 3D Gaussian Splatting (3DGS) models have recently emerged as a promising solution for novel view synthesis, enabling one-pass inference without the need for per-scene 3DGS optimization. However, their scalability is fundamentally constrained by the limited capacity of their encoders, leading to degraded performance or excessive memory consumption as the number of input views increases. In this work, we analyze feed-forward 3DGS frameworks through the lens of the Information Bottleneck principle and introduce ZPressor, a lightweight architecture-agnostic module that enables efficient compression of multi-view inputs into a compact latent state $Z$ that retains essential scene information while discarding redundancy. Concretely, ZPressor enables existing feed-forward 3DGS models to scale to over 100 input views at 480P resolution on an 80GB GPU, by partitioning the views into anchor and support sets and using cross attention to compress the information from the support views into anchor views, forming the compressed latent state $Z$. We show that integrating ZPressor into several state-of-the-art feed-forward 3DGS models consistently improves performance under moderate input views and enhances robustness under dense view settings on two large-scale benchmarks DL3DV-10K and RealEstate10K. The video results, code and trained models are available on our project page: https://lhmd.top/zpressor.

[130] MAGREF: Masked Guidance for Any-Reference Video Generation

Yufan Deng,Xun Guo,Yuanyang Yin,Jacob Zhiyuan Fang,Yiding Yang,Yizhi Wang,Shenghai Yuan,Angtian Wang,Bo Liu,Haibin Huang,Chongyang Ma

Main category: cs.CV

TL;DR: MAGREF是一个基于掩码引导的统一框架,用于多参考视频生成,通过动态掩码和像素级通道拼接机制实现高质量、多主体一致性视频合成。

Details Motivation: 多主体视频生成在保持主体一致性和生成质量方面面临挑战,现有方法难以灵活处理多样参考图像和文本提示。 Method: 提出区域感知动态掩码机制和像素级通道拼接机制,支持单模型处理多主体(如人、物体、背景)且无需架构调整。 Result: 模型在视频生成质量上达到SOTA,支持从单主体训练扩展到复杂多主体场景,优于开源和商业基线。 Conclusion: MAGREF为可扩展、可控、高保真的多主体视频合成提供了有效解决方案,并引入了新的多主体视频基准。 Abstract: Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF

[131] DarkDiff: Advancing Low-Light Raw Enhancement by Retasking Diffusion Models for Camera ISP

Amber Yijia Zheng,Yu Zhang,Jun Hu,Raymond A. Yeh,Chen Chen

Main category: cs.CV

TL;DR: 提出一种基于预训练生成扩散模型的新框架,用于增强低光原始图像,在感知质量上优于现有方法。

Details Motivation: 极端低光条件下的高质量摄影具有挑战性但意义重大,现有回归模型易导致图像过度平滑或阴影过深,而从头训练的扩散模型难以恢复清晰细节和准确颜色。 Method: 通过重新利用预训练的生成扩散模型,结合相机ISP,提出新框架增强低光原始图像。 Result: 在三个低光原始图像基准测试中,该方法在感知质量上优于现有技术。 Conclusion: 该框架为低光图像增强提供了更优的解决方案,尤其在恢复细节和颜色准确性方面表现突出。 Abstract: High-quality photography in extreme low-light conditions is challenging but impactful for digital cameras. With advanced computing hardware, traditional camera image signal processor (ISP) algorithms are gradually being replaced by efficient deep networks that enhance noisy raw images more intelligently. However, existing regression-based models often minimize pixel errors and result in oversmoothing of low-light photos or deep shadows. Recent work has attempted to address this limitation by training a diffusion model from scratch, yet those models still struggle to recover sharp image details and accurate colors. We introduce a novel framework to enhance low-light raw images by retasking pre-trained generative diffusion models with the camera ISP. Extensive experiments demonstrate that our method outperforms the state-of-the-art in perceptual quality across three challenging low-light raw image benchmarks.

[132] Boosting Domain Incremental Learning: Selecting the Optimal Parameters is All You Need

Qiang Wang,Xiang Song,Yuhang He,Jizhou Han,Chenhao Ding,Xinyuan Gao,Yihong Gong

Main category: cs.CV

TL;DR: SOYO是一个轻量级框架,通过高斯混合压缩器和域特征重采样器改进参数隔离域增量学习中的域选择问题,并在多个任务中表现优于现有基线。

Details Motivation: 深度神经网络在动态环境中表现不佳,参数隔离域增量学习(PIDIL)虽能减少知识冲突,但现有方法在参数选择准确性上存在问题。 Method: SOYO引入高斯混合压缩器(GMC)和域特征重采样器(DFR)高效存储和平衡先验域数据,同时使用多级域特征融合网络(MDFN)增强特征提取。 Result: 在六个基准测试中,SOYO表现优于现有基线,展示了其在复杂动态环境中的鲁棒性和适应性。 Conclusion: SOYO为参数隔离域增量学习提供了一种高效解决方案,代码将开源。 Abstract: Deep neural networks (DNNs) often underperform in real-world, dynamic settings where data distributions change over time. Domain Incremental Learning (DIL) offers a solution by enabling continual model adaptation, with Parameter-Isolation DIL (PIDIL) emerging as a promising paradigm to reduce knowledge conflicts. However, existing PIDIL methods struggle with parameter selection accuracy, especially as the number of domains and corresponding classes grows. To address this, we propose SOYO, a lightweight framework that improves domain selection in PIDIL. SOYO introduces a Gaussian Mixture Compressor (GMC) and Domain Feature Resampler (DFR) to store and balance prior domain data efficiently, while a Multi-level Domain Feature Fusion Network (MDFN) enhances domain feature extraction. Our framework supports multiple Parameter-Efficient Fine-Tuning (PEFT) methods and is validated across tasks such as image classification, object detection, and speech enhancement. Experimental results on six benchmarks demonstrate SOYO's consistent superiority over existing baselines, showcasing its robustness and adaptability in complex, evolving environments. The codes will be released in https://github.com/qwangcv/SOYO.

[133] To Trust Or Not To Trust Your Vision-Language Model's Prediction

Hao Dong,Moru Liu,Jian Liang,Eleni Chatzi,Olga Fink

Main category: cs.CV

TL;DR: TrustVLM是一个无需训练的框架,旨在解决VLM预测可信度评估问题,通过利用图像嵌入空间改进误分类检测,显著提升了性能。

Details Motivation: VLMs在多模态任务中表现优异,但在安全关键领域,其自信但错误的预测可能导致严重后果,因此需要提升其预测的可信度。 Method: 提出了一种基于图像嵌入空间的置信度评分函数,利用模态间隙和概念在嵌入空间中的区分性来改进误分类检测。 Result: 在17个数据集、4种架构和2种VLM上验证,TrustVLM在AURC、AUROC和FPR95等指标上显著优于现有基线。 Conclusion: TrustVLM无需重新训练即可提升VLM的可靠性,为其在现实应用中的安全部署铺平了道路。 Abstract: Vision-Language Models (VLMs) have demonstrated strong capabilities in aligning visual and textual modalities, enabling a wide range of applications in multimodal understanding and generation. While they excel in zero-shot and transfer learning scenarios, VLMs remain susceptible to misclassification, often yielding confident yet incorrect predictions. This limitation poses a significant risk in safety-critical domains, where erroneous predictions can lead to severe consequences. In this work, we introduce TrustVLM, a training-free framework designed to address the critical challenge of estimating when VLM's predictions can be trusted. Motivated by the observed modality gap in VLMs and the insight that certain concepts are more distinctly represented in the image embedding space, we propose a novel confidence-scoring function that leverages this space to improve misclassification detection. We rigorously evaluate our approach across 17 diverse datasets, employing 4 architectures and 2 VLMs, and demonstrate state-of-the-art performance, with improvements of up to 51.87% in AURC, 9.14% in AUROC, and 32.42% in FPR95 compared to existing baselines. By improving the reliability of the model without requiring retraining, TrustVLM paves the way for safer deployment of VLMs in real-world applications. The code will be available at https://github.com/EPFL-IMOS/TrustVLM.

[134] Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu,Fangfu Liu,Yi-Hsin Hung,Yueqi Duan

Main category: cs.CV

TL;DR: Spatial-MLLM是一种新型框架,通过纯2D输入实现视觉空间推理,无需依赖3D或2.5D数据。

Details Motivation: 现有3D多模态大语言模型依赖额外3D或2.5D数据,限制了其在仅2D输入场景中的应用。 Method: 提出双编码器架构:语义编码器和空间编码器,结合空间感知帧采样策略。 Result: 在多种真实数据集上表现优异,达到视觉空间理解和推理任务的最先进水平。 Conclusion: Spatial-MLLM为纯2D输入的空间推理提供了有效解决方案。 Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k dataset and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.

[135] ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks

Akashah Shabbir,Muhammad Akhtar Munir,Akshay Dudhane,Muhammad Umer Sheikh,Muhammad Haris Khan,Paolo Fraccaro,Juan Bernabe Moreno,Fahad Shahbaz Khan,Salman Khan

Main category: cs.CV

TL;DR: ThinkGeo是一个专为评估LLM驱动的智能体在遥感任务中工具使用能力的基准测试,涵盖多种实际应用场景。

Details Motivation: 现有评估多关注通用或多模态场景,缺乏针对复杂遥感用例的领域特定基准。 Method: 采用ReAct式交互循环,评估开源和闭源LLM在436个结构化任务中的表现。 Result: 分析显示不同模型在工具准确性和规划一致性上存在显著差异。 Conclusion: ThinkGeo为评估工具增强的LLM在空间推理中的表现提供了首个广泛测试平台。 Abstract: Recent progress in large language models (LLMs) has enabled tool-augmented agents capable of solving complex real-world tasks through step-by-step reasoning. However, existing evaluations often focus on general-purpose or multimodal scenarios, leaving a gap in domain-specific benchmarks that assess tool-use capabilities in complex remote sensing use cases. We present ThinkGeo, an agentic benchmark designed to evaluate LLM-driven agents on remote sensing tasks via structured tool use and multi-step planning. Inspired by tool-interaction paradigms, ThinkGeo includes human-curated queries spanning a wide range of real-world applications such as urban planning, disaster assessment and change analysis, environmental monitoring, transportation analysis, aviation monitoring, recreational infrastructure, and industrial site analysis. Each query is grounded in satellite or aerial imagery and requires agents to reason through a diverse toolset. We implement a ReAct-style interaction loop and evaluate both open and closed-source LLMs (e.g., GPT-4o, Qwen2.5) on 436 structured agentic tasks. The benchmark reports both step-wise execution metrics and final answer correctness. Our analysis reveals notable disparities in tool accuracy and planning consistency across models. ThinkGeo provides the first extensive testbed for evaluating how tool-enabled LLMs handle spatial reasoning in remote sensing. Our code and dataset are publicly available

[136] Rooms from Motion: Un-posed Indoor 3D Object Detection as Localization and Mapping

Justin Lazarow,Kai Kang,Afshin Dehghan

Main category: cs.CV

TL;DR: 论文提出了一种基于物体中心的3D目标检测方法Rooms from Motion (RfM),能够在无相机姿态的情况下进行定位和建图,优于现有方法。

Details Motivation: 现有3D目标检测方法依赖全局信息和已知相机姿态,而RfM旨在通过物体中心的方法在无姿态图像集合中实现定位和建图。 Method: RfM使用图像衍生的3D盒子替换传统的2D关键点匹配器,估计相机姿态和物体轨迹,最终生成全局语义3D物体地图。 Result: RfM在CA-1M和ScanNet++数据集上表现出优于基于点和多视图的3D目标检测方法的性能。 Conclusion: RfM提供了一种通用的物体中心表示方法,扩展了场景理解能力,并实现了稀疏定位和参数化建图。 Abstract: We revisit scene-level 3D object detection as the output of an object-centric framework capable of both localization and mapping using 3D oriented boxes as the underlying geometric primitive. While existing 3D object detection approaches operate globally and implicitly rely on the a priori existence of metric camera poses, our method, Rooms from Motion (RfM) operates on a collection of un-posed images. By replacing the standard 2D keypoint-based matcher of structure-from-motion with an object-centric matcher based on image-derived 3D boxes, we estimate metric camera poses, object tracks, and finally produce a global, semantic 3D object map. When a priori pose is available, we can significantly improve map quality through optimization of global 3D boxes against individual observations. RfM shows strong localization performance and subsequently produces maps of higher quality than leading point-based and multi-view 3D object detection methods on CA-1M and ScanNet++, despite these global methods relying on overparameterization through point clouds or dense volumes. Rooms from Motion achieves a general, object-centric representation which not only extends the work of Cubify Anything to full scenes but also allows for inherently sparse localization and parametric mapping proportional to the number of objects in a scene.

[137] Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models

Haohan Chi,Huan-ang Gao,Ziming Liu,Jianing Liu,Chenyu Liu,Jinwei Li,Kaisen Yang,Yangcheng Yu,Zeda Wang,Wenyi Li,Leichen Wang,Xingtao Hu,Hao Sun,Hang Zhao,Hao Zhao

Main category: cs.CV

TL;DR: Impromptu VLA 提出了一种新的数据集,用于提升 Vision-Language-Action 模型在自动驾驶中的表现,特别是在非结构化场景下。

Details Motivation: 现有 VLA 模型在非结构化极端场景中表现不佳,缺乏针对性基准测试。 Method: 构建了包含 80,000 个视频片段的 Impromptu VLA 数据集,基于四类非结构化场景,并包含规划导向的问答注释和动作轨迹。 Result: 实验显示,使用该数据集训练的 VLA 模型在多个基准测试中表现显著提升,包括闭环 NeuroNCAP 分数、碰撞率和开环轨迹预测。 Conclusion: Impromptu VLA 数据集有效提升了 VLA 模型的性能,并为诊断模型在感知、预测和规划方面的改进提供了工具。 Abstract: Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios, largely due to a scarcity of targeted benchmarks. To address this, we introduce Impromptu VLA. Our core contribution is the Impromptu VLA Dataset: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets. This dataset is built upon our novel taxonomy of four challenging unstructured categories and features rich, planning-oriented question-answering annotations and action trajectories. Crucially, experiments demonstrate that VLAs trained with our dataset achieve substantial performance gains on established benchmarks--improving closed-loop NeuroNCAP scores and collision rates, and reaching near state-of-the-art L2 accuracy in open-loop nuScenes trajectory prediction. Furthermore, our Q&A suite serves as an effective diagnostic, revealing clear VLM improvements in perception, prediction, and planning. Our code, data and models are available at https://github.com/ahydchh/Impromptu-VLA.

[138] LoRAShop: Training-Free Multi-Concept Image Generation and Editing with Rectified Flow Transformers

Yusuf Dalva,Hidir Yesiltepe,Pinar Yanardag

Main category: cs.CV

TL;DR: LoRAShop是一个基于LoRA模型的多概念图像编辑框架,通过分析扩散变换器中的特征交互模式,实现多概念的无缝融合。

Details Motivation: 现有方法在图像编辑中难以同时保留多个概念的细节和全局上下文,LoRAShop旨在解决这一问题。 Method: 利用扩散变换器中概念特定特征的早期激活特性,生成解耦的潜在掩码,并在特定区域混合LoRA权重。 Result: 实验表明,LoRAShop在身份保留方面优于基线方法,且无需重新训练。 Conclusion: LoRAShop为多概念图像编辑提供了实用工具,推动了视觉创作的发展。 Abstract: We introduce LoRAShop, the first framework for multi-concept image editing with LoRA models. LoRAShop builds on a key observation about the feature interaction patterns inside Flux-style diffusion transformers: concept-specific transformer features activate spatially coherent regions early in the denoising process. We harness this observation to derive a disentangled latent mask for each concept in a prior forward pass and blend the corresponding LoRA weights only within regions bounding the concepts to be personalized. The resulting edits seamlessly integrate multiple subjects or styles into the original scene while preserving global context, lighting, and fine details. Our experiments demonstrate that LoRAShop delivers better identity preservation compared to baselines. By eliminating retraining and external constraints, LoRAShop turns personalized diffusion models into a practical `photoshop-with-LoRAs' tool and opens new avenues for compositional visual storytelling and rapid creative iteration.

[139] MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Sihan Yang,Runsen Xu,Yiman Xie,Sizhe Yang,Mo Li,Jingli Lin,Chenming Zhu,Xiaochen Chen,Haodong Duan,Xiangyu Yue,Dahua Lin,Tai Wang,Jiangmiao Pang

Main category: cs.CV

TL;DR: MMSI-Bench是一个专注于多图像空间智能的VQA基准测试,包含1000个挑战性问题,评估34个MLLM模型,发现与人类表现存在显著差距。

Details Motivation: 现有基准测试仅关注单图像关系,无法满足现实世界对多图像空间推理的需求。 Method: 六名3D视觉研究人员耗时300多小时,从12万张图像中精心设计1000个多选问题,并评估34个开源和专有MLLM模型。 Result: 开源模型准确率约30%,OpenAI的o3模型达40%,而人类为97%。 Conclusion: MMSI-Bench揭示了多图像空间智能的挑战性,并为未来研究提供了自动化错误分析工具。 Abstract: Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a step-by-step reasoning process. We conduct extensive experiments and thoroughly evaluate 34 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI's o3 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering valuable insights for advancing multi-image spatial intelligence. Project page: https://runsenxu.com/projects/MMSI_Bench .

[140] Sketch Down the FLOPs: Towards Efficient Networks for Human Sketch

Aneeshan Sain,Subhajit Maity,Pinaki Nath Chowdhury,Subhadeep Koley,Ayan Kumar Bhunia,Yi-Zhe Song

Main category: cs.CV

TL;DR: 论文提出两种针对草图数据的组件,通过跨模态知识蒸馏和基于强化学习的画布选择器,显著降低了计算量(FLOPs减少99.37%),同时保持了准确性。

Details Motivation: 现有针对照片的轻量级模型无法直接应用于草图数据,因此需要专门设计高效的草图推理方法。 Method: 1. 跨模态知识蒸馏网络,将照片高效网络适配到草图数据;2. 基于强化学习的画布选择器,动态调整抽象级别。 Result: FLOPs从40.18G降至0.254G(减少99.37%),参数减少84.89%,同时准确率几乎不变(33.03% vs 32.77%)。 Conclusion: 提出的方法成功实现了针对草图数据的高效推理,计算量甚至低于最佳照片模型。 Abstract: As sketch research has collectively matured over time, its adaptation for at-mass commercialisation emerges on the immediate horizon. Despite an already mature research endeavour for photos, there is no research on the efficient inference specifically designed for sketch data. In this paper, we first demonstrate existing state-of-the-art efficient light-weight models designed for photos do not work on sketches. We then propose two sketch-specific components which work in a plug-n-play manner on any photo efficient network to adapt them to work on sketch data. We specifically chose fine-grained sketch-based image retrieval (FG-SBIR) as a demonstrator as the most recognised sketch problem with immediate commercial value. Technically speaking, we first propose a cross-modal knowledge distillation network to transfer existing photo efficient networks to be compatible with sketch, which brings down number of FLOPs and model parameters by 97.96% percent and 84.89% respectively. We then exploit the abstract trait of sketch to introduce a RL-based canvas selector that dynamically adjusts to the abstraction level which further cuts down number of FLOPs by two thirds. The end result is an overall reduction of 99.37% of FLOPs (from 40.18G to 0.254G) when compared with a full network, while retaining the accuracy (33.03% vs 32.77%) -- finally making an efficient network for the sparse sketch data that exhibit even fewer FLOPs than the best photo counterpart.

[141] Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Yunze Man,De-An Huang,Guilin Liu,Shiwei Sheng,Shilong Liu,Liang-Yan Gui,Jan Kautz,Yu-Xiong Wang,Zhiding Yu

Main category: cs.CV

TL;DR: Argus通过视觉注意力机制改进多模态大语言模型在视觉中心任务中的表现。

Details Motivation: 多模态大语言模型在需要精确视觉聚焦的任务中表现不佳,Argus旨在解决这一问题。 Method: 采用对象为中心的视觉链式思维信号,实现目标导向的视觉注意力。 Result: 在多种基准测试中表现优异,验证了设计的有效性。 Conclusion: 显式语言引导的视觉关注区域对多模态智能的发展至关重要。 Abstract: Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, yet they often struggle with vision-centric scenarios where precise visual focus is needed for accurate reasoning. In this paper, we introduce Argus to address these limitations with a new visual attention grounding mechanism. Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention during multimodal reasoning tasks. Evaluations on diverse benchmarks demonstrate that Argus excels in both multimodal reasoning tasks and referring object grounding tasks. Extensive analysis further validates various design choices of Argus, and reveals the effectiveness of explicit language-guided visual region-of-interest engagement in MLLMs, highlighting the importance of advancing multimodal intelligence from a visual-centric perspective. Project page: https://yunzeman.github.io/argus/

[142] TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

Yao Xiao,Qiqian Fu,Heyi Tao,Yuqun Wu,Zhen Zhu,Derek Hoiem

Main category: cs.CV

TL;DR: TextRegion结合图像-文本模型和SAM2的优势,生成文本对齐的区域标记,支持详细视觉理解,并在多项任务中表现优异。

Details Motivation: 解决图像-文本模型在详细视觉理解上的不足,结合SAM2的精确空间边界能力。 Method: 提出TextRegion框架,无需训练,结合图像-文本模型和SAM2生成文本对齐区域标记。 Result: 在开放词汇语义分割等任务中表现优异,兼容多种图像-文本模型。 Conclusion: TextRegion是一种简单有效的框架,具有高度实用性和扩展性。 Abstract: Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: https://github.com/avaxiao/TextRegion.

cs.GR [Back]

[143] Quality assessment of 3D human animation: Subjective and objective evaluation

Rim Rekik,Stefanie Wuhrer,Ludovic Hoyet,Katja Zibrek,Anne-Hélène Olivier

Main category: cs.GR

TL;DR: 提出了一种基于数据驱动的虚拟人动画质量评估方法,通过用户研究生成数据集并训练线性回归模型,性能优于现有深度学习基线。

Details Motivation: 虚拟人动画质量评估缺乏非参数化模型的评估方法,现有任务导向指标不适用。 Method: 生成虚拟人动画数据集并收集主观评分,训练线性回归模型预测感知评分。 Result: 线性回归模型与用户评分的相关性达90%,优于深度学习基线。 Conclusion: 数据驱动方法在虚拟人动画质量评估中具有潜力,线性回归模型表现优异。 Abstract: Virtual human animations have a wide range of applications in virtual and augmented reality. While automatic generation methods of animated virtual humans have been developed, assessing their quality remains challenging. Recently, approaches introducing task-oriented evaluation metrics have been proposed, leveraging neural network training. However, quality assessment measures for animated virtual humans that are not generated with parametric body models have yet to be developed. In this context, we introduce a first such quality assessment measure leveraging a novel data-driven framework. First, we generate a dataset of virtual human animations together with their corresponding subjective realism evaluation scores collected with a user study. Second, we use the resulting dataset to learn predicting perceptual evaluation scores. Results indicate that training a linear regressor on our dataset results in a correlation of 90%, which outperforms a state of the art deep learning baseline.

[144] To Measure What Isn't There -- Visual Exploration of Missingness Structures Using Quality Metrics

Sara Johansson Fernstad,Sarah Alsufyani,Silvia Del Din,Alison Yarnall,Lynn Rochester

Main category: cs.GR

TL;DR: 本文提出了一套用于识别和可视化分析高维数据中结构化缺失的质量指标,填补了现有研究的空白。

Details Motivation: 数据中的缺失值是常见问题,可能引发分析问题或揭示重要特征。现有研究多关注统计填补方法,而可视化分析潜力未被充分挖掘。 Method: 提出了一套质量指标,用于识别和理解结构化缺失模式,并通过实际行走监测研究案例验证。 Result: 质量指标能有效指导可视化分析,帮助探索高维数据中的缺失结构。 Conclusion: 本文填补了缺失数据可视化研究的空白,为大规模高维数据的缺失结构分析提供了实用工具。 Abstract: This paper contributes a set of quality metrics for identification and visual analysis of structured missingness in high-dimensional data. Missing values in data are a frequent challenge in most data generating domains and may cause a range of analysis issues. Structural missingness in data may indicate issues in data collection and pre-processing, but may also highlight important data characteristics. While research into statistical methods for dealing with missing data are mainly focusing on replacing missing values with plausible estimated values, visualization has great potential to support a more in-depth understanding of missingness structures in data. Nonetheless, while the interest in missing data visualization has increased in the last decade, it is still a relatively overlooked research topic with a comparably small number of publications, few of which address scalability issues. Efficient visual analysis approaches are needed to enable exploration of missingness structures in large and high-dimensional data, and to support informed decision-making in context of potential data quality issues. This paper suggests a set of quality metrics for identification of patterns of interest for understanding of structural missingness in data. These quality metrics can be used as guidance in visual analysis, as demonstrated through a use case exploring structural missingness in data from a real-life walking monitoring study. All supplemental materials for this paper are available at https://doi.org/10.25405/data.ncl.c.7741829.

cs.CL [Back]

[145] Training Language Models to Generate Quality Code with Program Analysis Feedback

Feng Yao,Zilong Wang,Liyuan Liu,Junxia Cui,Li Zhong,Xiaohan Fu,Haohui Mai,Vish Krishnan,Jianfeng Gao,Jingbo Shang

Main category: cs.CL

TL;DR: 论文提出REAL框架,通过强化学习结合程序分析和单元测试反馈,提升大语言模型生成代码的质量,解决现有方法依赖人工标注或启发式规则的局限性。

Details Motivation: 现有基于大语言模型的代码生成方法(如vibe coding)难以保证代码质量(如安全性和可维护性),且现有方法依赖人工标注或启发式规则,扩展性和效果有限。 Method: 提出REAL框架,利用程序分析检测缺陷和单元测试验证功能正确性,通过强化学习激励模型生成高质量代码,无需人工干预。 Result: 实验表明,REAL在功能和代码质量评估上优于现有方法,适用于不同数据集和模型规模。 Conclusion: REAL填补了快速原型与生产级代码之间的鸿沟,使大语言模型兼具速度和质量的生成能力。 Abstract: Code generation with large language models (LLMs), often termed vibe coding, is increasingly adopted in production but fails to ensure code quality, particularly in security (e.g., SQL injection vulnerabilities) and maintainability (e.g., missing type annotations). Existing methods, such as supervised fine-tuning and rule-based post-processing, rely on labor-intensive annotations or brittle heuristics, limiting their scalability and effectiveness. We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code using program analysis-guided feedback. Specifically, REAL integrates two automated signals: (1) program analysis detecting security or maintainability defects and (2) unit tests ensuring functional correctness. Unlike prior work, our framework is prompt-agnostic and reference-free, enabling scalable supervision without manual intervention. Experiments across multiple datasets and model scales demonstrate that REAL outperforms state-of-the-art methods in simultaneous assessments of functionality and code quality. Our work bridges the gap between rapid prototyping and production-ready code, enabling LLMs to deliver both speed and quality.

[146] Climate Finance Bench

Rafik Mankour,Yassine Chafai,Hamada Saleh,Ghassen Ben Hassine,Thibaud Barreau,Peter Tankov

Main category: cs.CL

TL;DR: Climate Finance Bench提出一个开放基准,用于评估大型语言模型在企业气候披露问答任务中的表现。基于33份英文可持续发展报告和330个专家验证的问答对,比较了RAG方法,发现检索器的性能是关键瓶颈,并提倡在AI气候应用中透明报告碳排放。

Details Motivation: 解决企业气候披露问答任务的标准化评估问题,推动AI在气候领域的透明应用。 Method: 收集33份跨行业可持续发展报告,标注330个问答对,比较RAG方法,分析性能瓶颈。 Result: 检索器的答案定位能力是主要性能瓶颈,透明碳排放报告在AI气候应用中具有优势。 Conclusion: 通过开放基准和透明报告,提升AI在气候披露问答中的性能和可持续性。 Abstract: Climate Finance Bench introduces an open benchmark that targets question-answering over corporate climate disclosures using Large Language Models. We curate 33 recent sustainability reports in English drawn from companies across all 11 GICS sectors and annotate 330 expert-validated question-answer pairs that span pure extraction, numerical reasoning, and logical reasoning. Building on this dataset, we propose a comparison of RAG (retrieval-augmented generation) approaches. We show that the retriever's ability to locate passages that actually contain the answer is the chief performance bottleneck. We further argue for transparent carbon reporting in AI-for-climate applications, highlighting advantages of techniques such as Weight Quantization.

[147] Pre-Training Curriculum for Multi-Token Prediction in Language Models

Ansar Aynetdinov,Alan Akbik

Main category: cs.CL

TL;DR: 论文提出了一种针对小语言模型(SLMs)的多令牌预测(MTP)训练策略,通过正向和反向课程学习,提升模型性能。

Details Motivation: 小语言模型在多令牌预测任务中表现不佳,需要一种有效的训练策略来提升其性能。 Method: 采用正向和反向课程学习策略,逐步调整训练目标从单令牌预测到多令牌预测(正向)或反之(反向)。 Result: 正向课程学习提升了生成质量和下游任务性能,同时保留了自推测解码优势;反向课程学习虽提升了性能,但失去了自推测解码优势。 Conclusion: 正向课程学习是小语言模型在多令牌预测任务中的有效策略,平衡了性能提升和功能保留。 Abstract: Multi-token prediction (MTP) is a recently proposed pre-training objective for language models. Rather than predicting only the next token (NTP), MTP predicts the next $k$ tokens at each prediction step, using multiple prediction heads. MTP has shown promise in improving downstream performance, inference speed, and training efficiency, particularly for large models. However, prior work has shown that smaller language models (SLMs) struggle with the MTP objective. To address this, we propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum, which gradually increases the complexity of the pre-training objective from NTP to MTP, and a reverse curriculum, which does the opposite. Our experiments show that the forward curriculum enables SLMs to better leverage the MTP objective during pre-training, improving downstream NTP performance and generative output quality, while retaining the benefits of self-speculative decoding. The reverse curriculum achieves stronger NTP performance and output quality, but fails to provide any self-speculative decoding benefits.

[148] FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian

Sara Papi,Marco Gaido,Luisa Bentivogli,Alessio Brutti,Mauro Cettolo,Roberto Gretter,Marco Matassoni,Mohamed Nabih,Matteo Negri

Main category: cs.CL

TL;DR: FAMA是首个基于开源数据和代码的语音基础模型家族,填补了语音领域开放科学的空白,性能接近现有模型且速度更快。

Details Motivation: 现有语音基础模型(如Whisper和SeamlessM4T)的封闭性导致可复现性和公平评估困难,而其他领域已通过开源模型取得进展。 Method: FAMA基于15万+小时的开源语音数据训练,并提供了16k小时的清洗和伪标注语音数据集。 Result: FAMA性能与现有模型相当,且速度提升高达8倍。 Conclusion: FAMA及其开源工具推动了语音技术研究的开放性和透明度。 Abstract: The development of speech foundation models (SFMs) like Whisper and SeamlessM4T has significantly advanced the field of speech processing. However, their closed nature--with inaccessible training data and code--poses major reproducibility and fair evaluation challenges. While other domains have made substantial progress toward open science by developing fully transparent models trained on open-source (OS) code and data, similar efforts in speech remain limited. To fill this gap, we introduce FAMA, the first family of open science SFMs for English and Italian, trained on 150k+ hours of OS speech data. Moreover, we present a new dataset containing 16k hours of cleaned and pseudo-labeled speech for both languages. Results show that FAMA achieves competitive performance compared to existing SFMs while being up to 8 times faster. All artifacts, including code, datasets, and models, are released under OS-compliant licenses, promoting openness in speech technology research.

[149] StressTest: Can YOUR Speech LM Handle the Stress?

Iddo Yosha,Gallil Maimon,Yossi Adi

Main category: cs.CL

TL;DR: 论文提出了StressTest基准和Stress17k数据集,用于评估和提升语音感知语言模型(SLMs)在句子重音推理任务中的表现。

Details Motivation: 句子重音在语音中传递重要信息,但现有SLMs在重音推理任务中表现不佳,缺乏相关评估和训练数据。 Method: 引入StressTest基准和Stress17k合成数据集,通过优化模型提升重音推理能力。 Result: 优化后的模型StresSLM在重音推理和检测任务中显著优于现有模型。 Conclusion: 合成数据生成和优化方法有效提升了SLMs在重音推理任务中的性能。 Abstract: Sentence stress refers to emphasis, placed on specific words within a spoken utterance to highlight or contrast an idea, or to introduce new information. It is often used to imply an underlying intention that is not explicitly stated. Recent advances in speech-aware language models (SLMs) have enabled direct processing of audio, allowing models to bypass transcription and access the full richness of the speech signal and perform audio reasoning tasks such as spoken question answering. Despite the crucial role of sentence stress in shaping meaning and speaker intent, it remains largely overlooked in evaluation and development of such models. In this work, we address this gap by introducing StressTest, a benchmark specifically designed to evaluate a model's ability to distinguish between interpretations of spoken sentences based on the stress pattern. We assess the performance of several leading SLMs and find that, despite their overall capabilities, they perform poorly on such tasks. To overcome this limitation, we propose a novel synthetic data generation pipeline, and create Stress17k, a training set that simulates change of meaning implied by stress variation. Then, we empirically show that optimizing models with this synthetic dataset aligns well with real-world recordings and enables effective finetuning of SLMs. Results suggest, that our finetuned model, StresSLM, significantly outperforms existing models on both sentence stress reasoning and detection tasks. Code, models, data, and audio samples - pages.cs.huji.ac.il/adiyoss-lab/stresstest.

[150] Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems

Christopher Ormerod

Main category: cs.CL

TL;DR: 通过将反馈导向的注释整合到自动作文评分(AES)流程中,可以提高评分准确性。

Details Motivation: 探索如何利用反馈驱动的注释(如拼写、语法错误和论证成分标注)来增强自动作文评分的性能。 Method: 使用PERSUADE语料库,结合两种反馈注释:拼写和语法错误标注,以及论证成分标注。采用生成式语言模型进行拼写纠正,基于编码器的标记分类器识别论证元素。 Result: 通过将注释整合到评分流程中,基于编码器的大型语言模型在分类任务中表现提升。 Conclusion: 反馈驱动的注释能有效提升自动作文评分的准确性,尤其在拼写、语法和论证结构方面。 Abstract: This study illustrates how incorporating feedback-oriented annotations into the scoring pipeline can enhance the accuracy of automated essay scoring (AES). This approach is demonstrated with the Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements (PERSUADE) corpus. We integrate two types of feedback-driven annotations: those that identify spelling and grammatical errors, and those that highlight argumentative components. To illustrate how this method could be applied in real-world scenarios, we employ two LLMs to generate annotations -- a generative language model used for spell-correction and an encoder-based token classifier trained to identify and mark argumentative elements. By incorporating annotations into the scoring process, we demonstrate improvements in performance using encoder-based large language models fine-tuned as classifiers.

[151] Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages

Kaja Dobrovoljc

Main category: cs.CL

TL;DR: 论文提出了一种基于树库的方法,通过依赖解析语料库比较口语和书面语的句法结构,发现两者在句法结构上存在显著差异。

Details Motivation: 研究动机在于探索口语和书面语在句法结构上的差异,以理解实时互动和书面表达的独特需求如何影响句法组织。 Method: 采用自下而上的归纳方法,从英语和斯洛文尼亚语的通用依赖树库中提取去词汇化的依赖子树,分析其大小、多样性和分布。 Result: 结果显示,口语语料库的句法结构更少且多样性更低,口语和书面语的句法结构重叠有限,表明存在模态特定的句法偏好。 Conclusion: 该框架为跨语料库的句法变异研究提供了可扩展的语言独立方法,为基于数据的语法理论奠定了基础。 Abstract: This paper presents a novel treebank-driven approach to comparing syntactic structures in speech and writing using dependency-parsed corpora. Adopting a fully inductive, bottom-up method, we define syntactic structures as delexicalized dependency (sub)trees and extract them from spoken and written Universal Dependencies (UD) treebanks in two syntactically distinct languages, English and Slovenian. For each corpus, we analyze the size, diversity, and distribution of syntactic inventories, their overlap across modalities, and the structures most characteristic of speech. Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities. Strikingly, the overlap between spoken and written syntactic inventories is very limited: most structures attested in speech do not occur in writing, pointing to modality-specific preferences in syntactic organization that reflect the distinct demands of real-time interaction and elaborated writing. This contrast is further supported by a keyness analysis of the most frequent speech-specific structures, which highlights patterns associated with interactivity, context-grounding, and economy of expression. We argue that this scalable, language-independent framework offers a useful general method for systematically studying syntactic variation across corpora, laying the groundwork for more comprehensive data-driven theories of grammar in use.

[152] MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators

John Mendonça,Alon Lavie,Isabel Trancoso

Main category: cs.CL

TL;DR: 论文提出MEDAL框架,用于生成、评估和优化多语言开放域对话评测基准,发现当前LLMs在检测细微问题(如共情和推理)上存在不足。

Details Motivation: 现有评测数据集静态、过时且缺乏多语言覆盖,限制了捕捉语言和文化细微差异的能力。 Method: 利用多代理框架和先进LLMs生成多语言对话,通过GPT-4.1多维分析性能,并构建新的多语言评测基准。 Result: 发现LLMs在跨语言性能上存在差异,且在检测共情和推理等细微问题上表现不佳。 Conclusion: 当前LLMs作为开放域对话评测工具仍有局限,需进一步优化以提升评测能力。 Abstract: As the capabilities of chatbots and their underlying LLMs continue to dramatically improve, evaluating their performance has increasingly become a major blocker to their further development. A major challenge is the available benchmarking datasets, which are largely static, outdated, and lacking in multilingual coverage, limiting their ability to capture subtle linguistic and cultural variations. This paper introduces MEDAL, an automated multi-agent framework for generating, evaluating, and curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. A strong LLM (GPT-4.1) is then used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. We find that current LLMs struggle to detect nuanced issues, particularly those involving empathy and reasoning.

[153] Can Large Language Models Match the Conclusions of Systematic Reviews?

Christopher Polzak,Alejandro Lozano,Min Woo Sun,James Burgess,Yuhui Zhang,Kevin Wu,Serena Yeung-Levy

Main category: cs.CL

TL;DR: 该研究探讨了大型语言模型(LLMs)能否在系统综述(SR)生成中达到临床专家的水平,发现LLMs在批判性评估和多文档推理方面存在不足。

Details Motivation: 随着科学文献的爆炸式增长,利用LLMs自动化生成SR的需求增加,但其能力尚未明确。 Method: 研究使用MedEvidence基准,评估了24种LLMs在100个SR及其基础研究上的表现。 Result: 发现推理能力不必然提升性能,模型大小与表现无直接关联,且知识微调会降低准确性。模型普遍存在过度自信和缺乏科学怀疑的问题。 Conclusion: LLMs目前尚无法可靠匹配专家生成的SR,需进一步研究。研究代码和基准已公开。 Abstract: Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly characterized. We therefore ask: Can LLMs match the conclusions of systematic reviews written by clinical experts when given access to the same studies? To explore this question, we present MedEvidence, a benchmark pairing findings from 100 SRs with the studies they are based on. We benchmark 24 LLMs on MedEvidence, including reasoning, non-reasoning, medical specialist, and models across varying sizes (from 7B-700B). Through our systematic evaluation, we find that reasoning does not necessarily improve performance, larger models do not consistently yield greater gains, and knowledge-based fine-tuning degrades accuracy on MedEvidence. Instead, most models exhibit similar behavior: performance tends to degrade as token length increases, their responses show overconfidence, and, contrary to human experts, all models show a lack of scientific skepticism toward low-quality findings. These results suggest that more work is still required before LLMs can reliably match the observations from expert-conducted SRs, even though these systems are already deployed and being used by clinicians. We release our codebase and benchmark to the broader research community to further investigate LLM-based SR systems.

[154] Towards a More Generalized Approach in Open Relation Extraction

Qing Wang,Yuepei Li,Qiao Qiao,Kang Zhou,Qi Li

Main category: cs.CL

TL;DR: MixORE是一个两阶段框架,用于处理已知和未知关系的混合数据,在OpenRE任务中表现优于基线方法。

Details Motivation: 现实场景中,未知关系是随机分布的,而传统OpenRE方法假设数据仅包含未知关系或已明确分为已知和未知实例。 Method: 提出MixORE框架,结合关系分类和聚类,共同学习已知和未知关系。 Result: 在三个基准数据集上,MixORE在已知关系分类和未知关系聚类中均优于基线方法。 Conclusion: MixORE推动了广义OpenRE研究和实际应用的进展。 Abstract: Open Relation Extraction (OpenRE) seeks to identify and extract novel relational facts between named entities from unlabeled data without pre-defined relation schemas. Traditional OpenRE methods typically assume that the unlabeled data consists solely of novel relations or is pre-divided into known and novel instances. However, in real-world scenarios, novel relations are arbitrarily distributed. In this paper, we propose a generalized OpenRE setting that considers unlabeled data as a mixture of both known and novel instances. To address this, we propose MixORE, a two-phase framework that integrates relation classification and clustering to jointly learn known and novel relations. Experiments on three benchmark datasets demonstrate that MixORE consistently outperforms competitive baselines in known relation classification and novel relation clustering. Our findings contribute to the advancement of generalized OpenRE research and real-world applications.

[155] First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay

Andrew Zhu,Evan Osgood,Chris Callison-Burch

Main category: cs.CL

TL;DR: 论文提出了一种名为“旁听代理”的新型LLM交互范式,通过监听人类对话提供背景支持,并以《龙与地下城》游戏为例进行了研究。

Details Motivation: 探索LLM代理在非直接对话场景中的应用潜力,尤其是通过监听人类对话提供辅助。 Method: 使用大型多模态音频-语言模型作为旁听代理,帮助游戏主持人,并通过人类评估验证其有效性。 Result: 研究发现某些大型音频-语言模型能够通过隐式音频线索完成旁听代理任务。 Conclusion: 旁听代理范式具有潜力,并开源了相关代码以支持进一步研究。 Abstract: Much work has been done on conversational LLM agents which directly assist human users with tasks. We present an alternative paradigm for interacting with LLM agents, which we call "overhearing agents". These overhearing agents do not actively participate in conversation -- instead, they "listen in" on human-to-human conversations and perform background tasks or provide suggestions to assist the user. In this work, we explore the overhearing agents paradigm through the lens of Dungeons & Dragons gameplay. We present an in-depth study using large multimodal audio-language models as overhearing agents to assist a Dungeon Master. We perform a human evaluation to examine the helpfulness of such agents and find that some large audio-language models have the emergent ability to perform overhearing agent tasks using implicit audio cues. Finally, we release Python libraries and our project code to support further research into the overhearing agents paradigm at https://github.com/zhudotexe/overhearing_agents.

[156] Self-Critique and Refinement for Faithful Natural Language Explanations

Yingming Wang,Pepa Atanasova

Main category: cs.CL

TL;DR: 论文提出了一种名为SR-NLE的框架,通过自我批判和迭代优化,提升大语言模型生成的自然语言解释的忠实度,无需外部监督。

Details Motivation: 现有的大语言模型生成的自然语言解释往往不能忠实反映模型的真实推理过程,而自我批判和优化的能力尚未被用于提升解释的忠实度。 Method: SR-NLE框架通过迭代的自我批判和优化过程,利用自然语言反馈和基于特征归因的新反馈机制,改进解释的忠实度。 Result: 实验表明,SR-NLE显著降低了不忠实率,最佳方法平均不忠实率为36.02%,比基线的54.81%降低了18.79%。 Conclusion: 研究表明,大语言模型可以通过适当的反馈指导,优化其解释以更好地反映真实推理过程,无需额外训练或微调。 Abstract: With the rapid development of large language models (LLMs), natural language explanations (NLEs) have become increasingly important for understanding model predictions. However, these explanations often fail to faithfully represent the model's actual reasoning process. While existing work has demonstrated that LLMs can self-critique and refine their initial outputs for various tasks, this capability remains unexplored for improving explanation faithfulness. To address this gap, we introduce Self-critique and Refinement for Natural Language Explanations (SR-NLE), a framework that enables models to improve the faithfulness of their own explanations -- specifically, post-hoc NLEs -- through an iterative critique and refinement process without external supervision. Our framework leverages different feedback mechanisms to guide the refinement process, including natural language self-feedback and, notably, a novel feedback approach based on feature attribution that highlights important input words. Our experiments across three datasets and four state-of-the-art LLMs demonstrate that SR-NLE significantly reduces unfaithfulness rates, with our best method achieving an average unfaithfulness rate of 36.02%, compared to 54.81% for baseline -- an absolute reduction of 18.79%. These findings reveal that the investigated LLMs can indeed refine their explanations to better reflect their actual reasoning process, requiring only appropriate guidance through feedback without additional training or fine-tuning.

[157] What Has Been Lost with Synthetic Evaluation?

Alexander Gill,Abhilasha Ravichander,Ana Marasović

Main category: cs.CL

TL;DR: 论文探讨了使用大语言模型(LLM)生成评估基准的可行性,发现虽然成本低且有效,但生成的基准对LLM的挑战性不如人工创建的基准。

Details Motivation: 研究LLM生成评估基准的能力,以解决传统众包方法成本高的问题。 Method: 通过两个案例研究(CondaQA和DROP),比较LLM生成的基准与人工众包基准的有效性和难度。 Result: LLM生成的基准在成本上更低且有效,但对LLM的挑战性不足。 Conclusion: 需重新评估LLM生成基准的广泛应用,因其可能降低评估的挑战性。 Abstract: Large language models (LLMs) are increasingly used for data generation. However, creating evaluation benchmarks raises the bar for this emerging paradigm. Benchmarks must target specific phenomena, penalize exploiting shortcuts, and be challenging. Through two case studies, we investigate whether LLMs can meet these demands by generating reasoning over-text benchmarks and comparing them to those created through careful crowdsourcing. Specifically, we evaluate both the validity and difficulty of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA, which evaluates reasoning about negation, and DROP, which targets reasoning about quantities. We find that prompting LLMs can produce variants of these datasets that are often valid according to the annotation guidelines, at a fraction of the cost of the original crowdsourcing effort. However, we show that they are less challenging for LLMs than their human-authored counterparts. This finding sheds light on what may have been lost by generating evaluation data with LLMs, and calls for critically reassessing the immediate use of this increasingly prevalent approach to benchmark creation.

[158] Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

Arthur S. Bianchessi,Rodrigo C. Barros,Lucas S. Kupssinskü

Main category: cs.CL

TL;DR: 论文提出了一种名为贝叶斯注意力机制(BAM)的理论框架,将位置编码建模为概率模型中的先验,统一了现有方法并提出了新的广义高斯位置先验,显著提升了长上下文泛化能力。

Details Motivation: 现有位置编码方法缺乏理论清晰性,且依赖有限的评估指标来支持其外推能力。 Method: 提出贝叶斯注意力机制(BAM),将位置编码作为概率模型中的先验,并引入广义高斯位置先验。 Result: BAM在500倍训练上下文长度下实现了准确的信息检索,优于现有方法,同时保持较低的困惑度和额外参数。 Conclusion: BAM为位置编码提供了理论框架,显著提升了长上下文泛化能力。 Abstract: Transformer-based language models rely on positional encoding (PE) to handle token order and support context length extrapolation. However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims. We propose the Bayesian Attention Mechanism (BAM), a theoretical framework that formulates positional encoding as a prior within a probabilistic model. BAM unifies existing methods (e.g., NoPE and ALiBi) and motivates a new Generalized Gaussian positional prior that substantially improves long-context generalization. Empirically, BAM enables accurate information retrieval at $500\times$ the training context length, outperforming previous state-of-the-art context length generalization in long context retrieval accuracy while maintaining comparable perplexity and introducing minimal additional parameters.

[159] LiTEx: A Linguistic Taxonomy of Explanations for Understanding Within-Label Variation in Natural Language Inference

Pingjun Hong,Beiduo Chen,Siyao Peng,Marie-Catherine de Marneffe,Barbara Plank

Main category: cs.CL

TL;DR: 论文探讨了自然语言推理(NLI)中人类标注差异(HLV)的问题,特别是标注者同意相同标签但提供不同推理的“标签内差异”。作者提出LITEX分类法,用于分析自由文本解释,并在e-SNLI数据集上验证其可靠性,证明其在解释生成中的有效性。

Details Motivation: 研究NLI中标注者推理的差异问题,尤其是标签内差异,以填补现有研究的空白。 Method: 提出LITEX分类法,标注e-SNLI数据集,验证分类法可靠性,并分析其与NLI标签、高亮和解释的关系。 Result: LITEX分类法能有效捕捉标签内差异,且基于LITEX生成的解释更接近人类解释。 Conclusion: LITEX分类法不仅解决了标签内差异问题,还为模型生成更接近人类推理的解释提供了有效方法。 Abstract: There is increasing evidence of Human Label Variation (HLV) in Natural Language Inference (NLI), where annotators assign different labels to the same premise-hypothesis pair. However, within-label variation--cases where annotators agree on the same label but provide divergent reasoning--poses an additional and mostly overlooked challenge. Several NLI datasets contain highlighted words in the NLI item as explanations, but the same spans on the NLI item can be highlighted for different reasons, as evidenced by free-text explanations, which offer a window into annotators' reasoning. To systematically understand this problem and gain insight into the rationales behind NLI labels, we introduce LITEX, a linguistically-informed taxonomy for categorizing free-text explanations. Using this taxonomy, we annotate a subset of the e-SNLI dataset, validate the taxonomy's reliability, and analyze how it aligns with NLI labels, highlights, and explanations. We further assess the taxonomy's usefulness in explanation generation, demonstrating that conditioning generation on LITEX yields explanations that are linguistically closer to human explanations than those generated using only labels or highlights. Our approach thus not only captures within-label variation but also shows how taxonomy-guided generation for reasoning can bridge the gap between human and model explanations more effectively than existing strategies.

[160] GateNLP at SemEval-2025 Task 10: Hierarchical Three-Step Prompting for Multilingual Narrative Classification

Iknoor Singh,Carolina Scarton,Kalina Bontcheva

Main category: cs.CL

TL;DR: 论文提出了一种名为H3Prompt的分层三步提示方法,用于多语言新闻叙事分类,并在SemEval 2025任务中取得最佳成绩。

Details Motivation: 在线新闻和错误信息的泛滥需要自动数据分析方法,叙事分类对事实核查和政策制定者至关重要。 Method: 采用三步大型语言模型提示策略,先分类文章领域,再识别主叙事,最后分配子叙事。 Result: 在28个团队中,该方法在英语测试集上排名第一。 Conclusion: H3Prompt是一种有效的多语言叙事分类方法,代码已开源。 Abstract: The proliferation of online news and the increasing spread of misinformation necessitate robust methods for automatic data analysis. Narrative classification is emerging as a important task, since identifying what is being said online is critical for fact-checkers, policy markers and other professionals working on information studies. This paper presents our approach to SemEval 2025 Task 10 Subtask 2, which aims to classify news articles into a pre-defined two-level taxonomy of main narratives and sub-narratives across multiple languages. We propose Hierarchical Three-Step Prompting (H3Prompt) for multilingual narrative classification. Our methodology follows a three-step Large Language Model (LLM) prompting strategy, where the model first categorises an article into one of two domains (Ukraine-Russia War or Climate Change), then identifies the most relevant main narratives, and finally assigns sub-narratives. Our approach secured the top position on the English test set among 28 competing teams worldwide. The code is available at https://github.com/GateNLP/H3Prompt.

[161] When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy

Jirui Qi,Shan Chen,Zidi Xiong,Raquel Fernández,Danielle S. Bitterman,Arianna Bisazza

Main category: cs.CL

TL;DR: 当前大型推理模型(LRMs)在英语推理任务中表现优异,但在其他语言中的推理能力研究较少。研究发现,即使是最先进的模型也常回归英语或产生碎片化推理,揭示了多语言推理的显著差距。提示干预可改善可读性但降低准确性,而少量针对性训练可部分缓解问题。

Details Motivation: 研究多语言推理能力的重要性,因为用户需要以母语理解推理过程以实现有效监督。 Method: 通过XReasoning基准全面评估两种主流LRMs,采用提示干预和针对性训练。 Result: 模型在多语言推理中存在显著差距,提示干预改善可读性但降低准确性,针对性训练部分缓解问题。 Conclusion: 当前LRMs的多语言推理能力有限,需进一步研究以改进。代码和数据已开源。 Abstract: Recent Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks. However, their ability to think in other languages is less studied. This capability is as important as answer accuracy for real world applications because users may find the reasoning trace useful for oversight only when it is expressed in their own language. We comprehensively evaluate two leading families of LRMs on our XReasoning benchmark and find that even the most advanced models often revert to English or produce fragmented reasoning in other languages, revealing a substantial gap in multilingual reasoning. Prompt based interventions that force models to reason in the users language improve readability and oversight but reduce answer accuracy, exposing an important trade off. We further show that targeted post training on just 100 examples mitigates this mismatch, though some accuracy loss remains. Our results highlight the limited multilingual reasoning capabilities of current LRMs and outline directions for future work. Code and data are available at https://github.com/Betswish/mCoT-XReasoning.

[162] VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models

Chahat Raj,Bowen Wei,Aylin Caliskan,Antonios Anastasopoulos,Ziwei Zhu

Main category: cs.CL

TL;DR: VIGNETTE是一个大规模视觉问答基准,用于评估视觉语言模型(VLMs)在事实性、感知、刻板印象和决策四个方面的偏见,揭示其隐含的社会刻板印象和歧视模式。

Details Motivation: 现有对视觉语言模型偏见的研究多集中于肖像类图像和性别-职业关联,忽视了更广泛复杂的社会刻板印象及其潜在危害。 Method: 通过30M+图像的VQA基准(VIGNETTE),结合社会心理学,分析VLMs如何从视觉身份线索推断特质和角色,并编码社会等级。 Result: 研究发现VLMs存在微妙、多面且令人惊讶的刻板模式,揭示了模型如何从输入中构建社会意义。 Conclusion: VIGNETTE为理解VLMs的偏见提供了新视角,强调了评估和解决其社会影响的必要性。 Abstract: While bias in large language models (LLMs) is well-studied, similar concerns in vision-language models (VLMs) have received comparatively less attention. Existing VLM bias studies often focus on portrait-style images and gender-occupation associations, overlooking broader and more complex social stereotypes and their implied harm. This work introduces VIGNETTE, a large-scale VQA benchmark with 30M+ images for evaluating bias in VLMs through a question-answering framework spanning four directions: factuality, perception, stereotyping, and decision making. Beyond narrowly-centered studies, we assess how VLMs interpret identities in contextualized settings, revealing how models make trait and capability assumptions and exhibit patterns of discrimination. Drawing from social psychology, we examine how VLMs connect visual identity cues to trait and role-based inferences, encoding social hierarchies, through biased selections. Our findings uncover subtle, multifaceted, and surprising stereotypical patterns, offering insights into how VLMs construct social meaning from inputs.

[163] Talent or Luck? Evaluating Attribution Bias in Large Language Models

Chahat Raj,Mahika Banerjee,Aylin Caliskan,Antonios Anastasopoulos,Ziwei Zhu

Main category: cs.CL

TL;DR: 论文探讨了学生对考试失败的归因(努力或考试难度),并提出了一个基于认知的偏见评估框架,以分析LLM在归因中的偏见。

Details Motivation: 研究LLM在归因事件结果时如何基于人口统计学因素分配责任,及其对公平性的影响。 Method: 提出一个基于认知的偏见评估框架,分析模型推理中的差异如何导致对特定人口群体的偏见。 Result: 揭示了LLM在归因推理中存在的偏见,尤其是对人口统计学因素的依赖。 Conclusion: 该框架为评估和改进LLM的公平性提供了认知基础,有助于减少模型中的偏见。 Abstract: When a student fails an exam, do we tend to blame their effort or the test's difficulty? Attribution, defined as how reasons are assigned to event outcomes, shapes perceptions, reinforces stereotypes, and influences decisions. Attribution Theory in social psychology explains how humans assign responsibility for events using implicit cognition, attributing causes to internal (e.g., effort, ability) or external (e.g., task difficulty, luck) factors. LLMs' attribution of event outcomes based on demographics carries important fairness implications. Most works exploring social biases in LLMs focus on surface-level associations or isolated stereotypes. This work proposes a cognitively grounded bias evaluation framework to identify how models' reasoning disparities channelize biases toward demographic groups.

[164] ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room

Nikita Mehandru,Niloufar Golchini,David Bamman,Travis Zack,Melanie F. Molina,Ahmed Alaa

Main category: cs.CL

TL;DR: ER-Reason是一个用于评估大型语言模型(LLMs)在急诊室(ER)临床推理和决策能力的基准,包含3,984名患者的数据和25,174份临床记录,揭示了LLMs与临床医生推理之间的差距。

Details Motivation: 现有医学问答任务评估依赖昂贵的人工标注且未能全面捕捉临床推理流程,尤其是在急诊室这种高风险、快速决策的环境中。 Method: 通过构建ER-Reason基准,包含多阶段ER工作流任务(如分诊、诊断、治疗选择等)和72份临床医生撰写的推理过程,评估LLMs的表现。 Result: 评估显示,当前LLMs在急诊室临床推理方面与临床医生的决策存在显著差距。 Conclusion: 未来研究需进一步缩小LLMs与临床医生在急诊决策推理上的差距。 Abstract: Large language models (LLMs) have been extensively evaluated on medical question answering tasks based on licensing exams. However, real-world evaluations often depend on costly human annotators, and existing benchmarks tend to focus on isolated tasks that rarely capture the clinical reasoning or full workflow underlying medical decisions. In this paper, we introduce ER-Reason, a benchmark designed to evaluate LLM-based clinical reasoning and decision-making in the emergency room (ER)--a high-stakes setting where clinicians make rapid, consequential decisions across diverse patient presentations and medical specialties under time pressure. ER-Reason includes data from 3,984 patients, encompassing 25,174 de-identified longitudinal clinical notes spanning discharge summaries, progress notes, history and physical exams, consults, echocardiography reports, imaging notes, and ER provider documentation. The benchmark includes evaluation tasks that span key stages of the ER workflow: triage intake, initial assessment, treatment selection, disposition planning, and final diagnosis--each structured to reflect core clinical reasoning processes such as differential diagnosis via rule-out reasoning. We also collected 72 full physician-authored rationales explaining reasoning processes that mimic the teaching process used in residency training, and are typically absent from ER documentation. Evaluations of state-of-the-art LLMs on ER-Reason reveal a gap between LLM-generated and clinician-authored clinical reasoning for ER decisions, highlighting the need for future research to bridge this divide.

[165] Structured Memory Mechanisms for Stable Context Representation in Large Language Models

Yue Xing,Tao Yang,Yijiashun Qi,Minggu Wei,Yu Cheng,Honghui Xin

Main category: cs.CL

TL;DR: 提出一种带长期记忆机制的模型架构,解决大语言模型在长期上下文理解中的局限性,通过显式记忆单元和动态更新机制提升语义信息保留与检索能力。

Details Motivation: 解决传统语言模型在长期依赖任务中常见的上下文丢失和语义漂移问题。 Method: 集成显式记忆单元、门控写入机制和基于注意力的读取模块,引入遗忘函数动态更新记忆内容,设计联合训练目标优化记忆策略。 Result: 模型在文本生成一致性、多轮问答稳定性和跨上下文推理准确性方面表现优异,尤其在长文本任务和复杂问答场景中展现出强语义保留和上下文连贯性。 Conclusion: 实验验证了记忆机制在语言理解中的关键作用,证明了所提方法在架构设计和性能表现上的可行性与有效性。 Abstract: This paper addresses the limitations of large language models in understanding long-term context. It proposes a model architecture equipped with a long-term memory mechanism to improve the retention and retrieval of semantic information across paragraphs and dialogue turns. The model integrates explicit memory units, gated writing mechanisms, and attention-based reading modules. A forgetting function is introduced to enable dynamic updates of memory content, enhancing the model's ability to manage historical information. To further improve the effectiveness of memory operations, the study designs a joint training objective. This combines the main task loss with constraints on memory writing and forgetting. It guides the model to learn better memory strategies during task execution. Systematic evaluation across multiple subtasks shows that the model achieves clear advantages in text generation consistency, stability in multi-turn question answering, and accuracy in cross-context reasoning. In particular, the model demonstrates strong semantic retention and contextual coherence in long-text tasks and complex question answering scenarios. It effectively mitigates the context loss and semantic drift problems commonly faced by traditional language models when handling long-term dependencies. The experiments also include analysis of different memory structures, capacity sizes, and control strategies. These results further confirm the critical role of memory mechanisms in language understanding. They demonstrate the feasibility and effectiveness of the proposed approach in both architectural design and performance outcomes.

[166] Unraveling LoRA Interference: Orthogonal Subspaces for Robust Model Merging

Haobo Zhang,Jiayu Zhou

Main category: cs.CL

TL;DR: 论文提出了一种名为OSRM的方法,用于解决低秩适应(LoRA)微调模型合并时的性能下降问题,通过正交子空间约束确保任务间无干扰。

Details Motivation: 现有模型合并方法在LoRA微调模型上表现不佳,主要原因是参数与数据分布的交互未被充分考虑。 Method: 提出OSRM方法,在微调前约束LoRA子空间的正交性,减少任务间的干扰,并与现有合并算法兼容。 Result: 在八个数据集和五种语言模型上的实验表明,OSRM显著提升了合并性能并保持了单任务准确性。 Conclusion: OSRM为LoRA模型合并提供了即插即用的解决方案,强调了数据与参数交互的重要性。 Abstract: Fine-tuning large language models (LMs) for individual tasks yields strong performance but is expensive for deployment and storage. Recent works explore model merging to combine multiple task-specific models into a single multi-task model without additional training. However, existing merging methods often fail for models fine-tuned with low-rank adaptation (LoRA), due to significant performance degradation. In this paper, we show that this issue arises from a previously overlooked interplay between model parameters and data distributions. We propose Orthogonal Subspaces for Robust model Merging (OSRM) to constrain the LoRA subspace *prior* to fine-tuning, ensuring that updates relevant to one task do not adversely shift outputs for others. Our approach can seamlessly integrate with most existing merging algorithms, reducing the unintended interference among tasks. Extensive experiments on eight datasets, tested with three widely used LMs and two large LMs, demonstrate that our method not only boosts merging performance but also preserves single-task accuracy. Furthermore, our approach exhibits greater robustness to the hyperparameters of merging. These results highlight the importance of data-parameter interaction in model merging and offer a plug-and-play solution for merging LoRA models.

[167] Improving QA Efficiency with DistilBERT: Fine-Tuning and Inference on mobile Intel CPUs

Ngeyen Yinkfu

Main category: cs.CL

TL;DR: 该研究提出了一种基于Transformer的高效问答模型,优化后可在13代Intel i7-1355U CPU上部署,使用SQuAD v1.1数据集,通过数据增强和DistilBERT微调,实现了0.6536的F1分数和0.1208秒的平均推理时间。

Details Motivation: 研究旨在开发一种在资源受限系统上实时运行的高效问答模型,平衡准确性和计算效率。 Method: 采用探索性数据分析、数据增强和DistilBERT架构微调,系统评估数据增强策略和超参数配置。 Result: 模型验证F1分数为0.6536,推理时间为0.1208秒/问题,优于规则基线(F1:0.3124)和完整BERT模型。 Conclusion: 该模型在准确性和效率间取得了良好平衡,适合资源受限系统的实时应用,并提供了优化Transformer模型的实际见解。 Abstract: This study presents an efficient transformer-based question-answering (QA) model optimized for deployment on a 13th Gen Intel i7-1355U CPU, using the Stanford Question Answering Dataset (SQuAD) v1.1. Leveraging exploratory data analysis, data augmentation, and fine-tuning of a DistilBERT architecture, the model achieves a validation F1 score of 0.6536 with an average inference time of 0.1208 seconds per question. Compared to a rule-based baseline (F1: 0.3124) and full BERT-based models, our approach offers a favorable trade-off between accuracy and computational efficiency. This makes it well-suited for real-time applications on resource-constrained systems. The study includes systematic evaluation of data augmentation strategies and hyperparameter configurations, providing practical insights into optimizing transformer models for CPU-based inference.

[168] WorkForceAgent-R1: Incentivizing Reasoning Capability in LLM-based Web Agents via Reinforcement Learning

Yuchen Zhuang,Di Jin,Jiaao Chen,Wenqi Shi,Hanrui Wang,Chao Zhang

Main category: cs.CL

TL;DR: WorkForceAgent-R1是一种基于LLM的网页代理,通过规则化R1强化学习框架提升单步推理和规划能力,显著优于SFT基线,接近GPT-4o性能。

Details Motivation: 现有基于监督微调的网页代理在动态网页交互中泛化性和鲁棒性不足,需增强推理能力。 Method: 采用规则化R1强化学习框架,结合结构化奖励函数评估输出格式和动作正确性,无需显式标注或专家示范。 Result: 在WorkArena基准测试中,WorkForceAgent-R1比SFT基线提升10.26-16.59%,接近GPT-4o性能。 Conclusion: WorkForceAgent-R1通过强化学习有效提升网页导航任务的推理能力和性能。 Abstract: Large language models (LLMs)-empowered web agents enables automating complex, real-time web navigation tasks in enterprise environments. However, existing web agents relying on supervised fine-tuning (SFT) often struggle with generalization and robustness due to insufficient reasoning capabilities when handling the inherently dynamic nature of web interactions. In this study, we introduce WorkForceAgent-R1, an LLM-based web agent trained using a rule-based R1-style reinforcement learning framework designed explicitly to enhance single-step reasoning and planning for business-oriented web navigation tasks. We employ a structured reward function that evaluates both adherence to output formats and correctness of actions, enabling WorkForceAgent-R1 to implicitly learn robust intermediate reasoning without explicit annotations or extensive expert demonstrations. Extensive experiments on the WorkArena benchmark demonstrate that WorkForceAgent-R1 substantially outperforms SFT baselines by 10.26-16.59%, achieving competitive performance relative to proprietary LLM-based agents (gpt-4o) in workplace-oriented web navigation tasks.

[169] Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

Jaewoo Ahn,Heeseung Yun,Dayoon Ko,Gunhee Kim

Main category: cs.CL

TL;DR: 论文提出MAC基准,利用LLMs生成欺骗性文本样本,评估多模态表示的组合漏洞,并提出自训练方法提升零样本性能。

Details Motivation: 预训练多模态表示(如CLIP)存在组合漏洞,导致反直觉判断,需系统性评估和改进。 Method: 提出MAC基准,利用LLMs生成欺骗性样本;采用自训练方法,结合拒绝采样微调和多样性过滤。 Result: 方法在Llama-3.1-8B等小模型上表现优异,成功揭示图像、视频和音频的多模态组合漏洞。 Conclusion: MAC基准和自训练方法有效评估并提升多模态表示的组合鲁棒性。 Abstract: While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.

[170] OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature

Alisha Srivastava,Emir Korukluoglu,Minh Nhat Le,Duyen Tran,Chau Minh Pham,Marzena Karpinska,Mohit Iyyer

Main category: cs.CL

TL;DR: 论文研究了多语言和跨语言大语言模型(LLMs)的记忆能力,发现模型能跨语言回忆内容,即使训练数据中无直接翻译。

Details Motivation: 探究LLMs在非英语语言中的记忆能力及其跨语言迁移性。 Method: 使用OWL数据集(31.5K对齐文本,10种语言),通过直接探测、名称填空和前缀生成任务评估模型。 Result: LLMs能跨语言回忆内容,GPT-4o在新翻译文本中识别作者和标题的准确率为69%,扰动会略微降低表现。 Conclusion: LLMs具有显著的跨语言记忆能力,模型间存在差异。 Abstract: Large language models (LLMs) are known to memorize and recall English text from their pretraining data. However, the extent to which this ability generalizes to non-English languages or transfers across languages remains unclear. This paper investigates multilingual and cross-lingual memorization in LLMs, probing if memorized content in one language (e.g., English) can be recalled when presented in translation. To do so, we introduce OWL, a dataset of 31.5K aligned excerpts from 20 books in ten languages, including English originals, official translations (Vietnamese, Spanish, Turkish), and new translations in six low-resource languages (Sesotho, Yoruba, Maithili, Malagasy, Setswana, Tahitian). We evaluate memorization across model families and sizes through three tasks: (1) direct probing, which asks the model to identify a book's title and author; (2) name cloze, which requires predicting masked character names; and (3) prefix probing, which involves generating continuations. We find that LLMs consistently recall content across languages, even for texts without direct translation in pretraining data. GPT-4o, for example, identifies authors and titles 69% of the time and masked entities 6% of the time in newly translated excerpts. Perturbations (e.g., masking characters, shuffling words) modestly reduce direct probing accuracy (7% drop for shuffled official translations). Our results highlight the extent of cross-lingual memorization and provide insights on the differences between the models.

[171] NegVQA: Can Vision Language Models Understand Negation?

Yuhui Zhang,Yuchang Su,Yiming Liu,Serena Yeung-Levy

Main category: cs.CL

TL;DR: NegVQA是一个新的视觉问答基准测试,用于评估视觉语言模型对否定句的理解能力,发现现有模型在否定问题上表现显著下降。

Details Motivation: 否定是语言中的基本现象,可能完全改变句子含义。随着视觉语言模型在高风险应用中的部署,评估其理解否定的能力变得至关重要。 Method: 利用大型语言模型从现有VQA数据集中生成否定问题,构建包含7,379个二选一问题的NegVQA基准测试,并评估20个先进视觉语言模型的表现。 Result: 模型在否定问题上表现显著下降,且发现模型规模与性能呈U型关系。 Conclusion: NegVQA揭示了视觉语言模型在否定理解上的关键不足,为未来模型开发提供了方向。 Abstract: Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs' negation understanding and offers insights into future VLM development. Project page available at https://yuhui-zh15.github.io/NegVQA/.

[172] StrucSum: Graph-Structured Reasoning for Long Document Extractive Summarization with LLMs

Haohan Yuan,Sukhwa Hong,Haopeng Zhang

Main category: cs.CL

TL;DR: StrucSum是一种无需训练的结构感知提示框架,通过句子级图结构增强LLM在零样本摘要中的表现,显著提升摘要质量和事实一致性。

Details Motivation: 大型语言模型(LLM)在零样本摘要中表现优异,但在长文本中建模文档结构和识别关键信息时存在困难。 Method: StrucSum通过三种策略注入结构信号:邻域感知提示(NAP)关注局部上下文,中心性感知提示(CAP)估计重要性,中心性引导掩码(CGM)实现高效输入缩减。 Result: 在ArXiv、PubMed和Multi-News数据集上,StrucSum显著优于无监督基线和普通提示方法,例如在ArXiv上FactCC和SummaC分别提升19.2和9.7分。 Conclusion: 结构感知提示是一种简单有效的零样本抽取式摘要方法,无需额外训练或任务特定调整。 Abstract: Large language models (LLMs) have shown strong performance in zero-shot summarization, but often struggle to model document structure and identify salient information in long texts. In this work, we introduce StrucSum, a training-free prompting framework that enhances LLM reasoning through sentence-level graph structures. StrucSum injects structural signals into prompts via three targeted strategies: Neighbor-Aware Prompting (NAP) for local context, Centrality-Aware Prompting (CAP) for importance estimation, and Centrality-Guided Masking (CGM) for efficient input reduction. Experiments on ArXiv, PubMed, and Multi-News demonstrate that StrucSum consistently improves both summary quality and factual consistency over unsupervised baselines and vanilla prompting. Notably, on ArXiv, it boosts FactCC and SummaC by 19.2 and 9.7 points, indicating stronger alignment between summaries and source content. These findings suggest that structure-aware prompting is a simple yet effective approach for zero-shot extractive summarization with LLMs, without any training or task-specific tuning.

[173] LLMs for Argument Mining: Detection, Extraction, and Relationship Classification of pre-defined Arguments in Online Comments

Matteo Guida,Yulia Otmakhova,Eduard Hovy,Lea Frermann

Main category: cs.CL

TL;DR: 论文评估了四种大型语言模型(LLM)在三个论点挖掘任务中的表现,发现其在处理大规模、经过微调的模型时表现良好,但在处理长文本和情感化语言时存在系统性不足。

Details Motivation: 研究动机是探索LLM在检测和理解争议性话题(如堕胎)中的论点时的性能,填补其在特定主题论点挖掘中的研究空白。 Method: 方法包括使用六种极化话题的2000多条评论数据集,对四种先进LLM进行定量评估和详细错误分析。 Result: 结果显示LLM在三个任务中总体表现良好,尤其是大型和微调模型,但在处理长文本和情感化语言时表现不佳。 Conclusion: 结论指出LLM在自动论点分析中具有潜力,但仍存在局限性,尤其是在复杂和情感化内容上。 Abstract: Automated large-scale analysis of public discussions around contested issues like abortion requires detecting and understanding the use of arguments. While Large Language Models (LLMs) have shown promise in language processing tasks, their performance in mining topic-specific, pre-defined arguments in online comments remains underexplored. We evaluate four state-of-the-art LLMs on three argument mining tasks using datasets comprising over 2,000 opinion comments across six polarizing topics. Quantitative evaluation suggests an overall strong performance across the three tasks, especially for large and fine-tuned LLMs, albeit at a significant environmental cost. However, a detailed error analysis revealed systematic shortcomings on long and nuanced comments and emotionally charged language, raising concerns for downstream applications like content moderation or opinion analysis. Our results highlight both the promise and current limitations of LLMs for automated argument analysis in online comments.

[174] LLM-based HSE Compliance Assessment: Benchmark, Performance, and Advancements

Jianwei Wang,Mengqi Wang,Yinsi Zhou,Zhenchang Xing,Qing Liu,Xiwei Xu,Wenjie Zhang,Liming Zhu

Main category: cs.CL

TL;DR: HSE-Bench是一个评估大型语言模型(LLM)在健康、安全和环境(HSE)合规性评估中能力的首个基准数据集,包含1,000多个问题,并揭示了当前LLM依赖语义匹配而非系统性法律推理的局限性。

Details Motivation: HSE合规性评估需要动态实时决策,但LLM在领域知识和结构化法律推理方面的能力尚未充分探索。 Method: 提出HSE-Bench数据集,基于IRAC推理流程评估LLM,并提出新的提示技术RoE以模拟专家推理。 Result: 当前LLM表现良好但依赖语义匹配,缺乏系统性法律推理能力。RoE技术显著提升了决策准确性。 Conclusion: 研究揭示了LLM在HSE合规性评估中的推理缺陷,并提出了改进方向。 Abstract: Health, Safety, and Environment (HSE) compliance assessment demands dynamic real-time decision-making under complicated regulations and complex human-machine-environment interactions. While large language models (LLMs) hold significant potential for decision intelligence and contextual dialogue, their capacity for domain-specific knowledge in HSE and structured legal reasoning remains underexplored. We introduce HSE-Bench, the first benchmark dataset designed to evaluate the HSE compliance assessment capabilities of LLM. HSE-Bench comprises over 1,000 manually curated questions drawn from regulations, court cases, safety exams, and fieldwork videos, and integrates a reasoning flow based on Issue spotting, rule Recall, rule Application, and rule Conclusion (IRAC) to assess the holistic reasoning pipeline. We conduct extensive evaluations on different prompting strategies and more than 10 LLMs, including foundation models, reasoning models and multimodal vision models. The results show that, although current LLMs achieve good performance, their capabilities largely rely on semantic matching rather than principled reasoning grounded in the underlying HSE compliance context. Moreover, their native reasoning trace lacks the systematic legal reasoning required for rigorous HSE compliance assessment. To alleviate these, we propose a new prompting technique, Reasoning of Expert (RoE), which guides LLMs to simulate the reasoning process of different experts for compliance assessment and reach a more accurate unified decision. We hope our study highlights reasoning gaps in LLMs for HSE compliance and inspires further research on related tasks.

[175] ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind

Peixuan Han,Zijia Liu,Jiaxuan You

Main category: cs.CL

TL;DR: ToMAP是一种通过整合两个心理理论模块来增强说服者代理的方法,显著提升了说服效果和多样性。

Details Motivation: 现有的大型语言模型在心理理论推理上表现不足,导致说服多样性和对手意识有限。 Method: ToMAP通过提示说服者考虑反对意见,并使用文本编码器和MLP分类器预测对手立场,结合强化学习生成更有效的论点。 Result: ToMAP在3B参数下表现优于更大的基线模型(如GPT-4o),相对增益达39.4%,且生成更多样和有效的论点。 Conclusion: ToMAP展示了在开发更具说服力的语言代理方面的潜力,特别适合长对话和逻辑性强的策略。 Abstract: Large language models (LLMs) have shown promising potential in persuasion, but existing works on training LLM persuaders are still preliminary. Notably, while humans are skilled in modeling their opponent's thoughts and opinions proactively and dynamically, current LLMs struggle with such Theory of Mind (ToM) reasoning, resulting in limited diversity and opponent awareness. To address this limitation, we introduce Theory of Mind Augmented Persuader (ToMAP), a novel approach for building more flexible persuader agents by incorporating two theory of mind modules that enhance the persuader's awareness and analysis of the opponent's mental state. Specifically, we begin by prompting the persuader to consider possible objections to the target central claim, and then use a text encoder paired with a trained MLP classifier to predict the opponent's current stance on these counterclaims. Our carefully designed reinforcement learning schema enables the persuader learns how to analyze opponent-related information and utilize it to generate more effective arguments. Experiments show that the ToMAP persuader, while containing only 3B parameters, outperforms much larger baselines, like GPT-4o, with a relative gain of 39.4% across multiple persuadee models and diverse corpora. Notably, ToMAP exhibits complex reasoning chains and reduced repetition during training, which leads to more diverse and effective arguments. The opponent-aware feature of ToMAP also makes it suitable for long conversations and enables it to employ more logical and opponent-aware strategies. These results underscore our method's effectiveness and highlight its potential for developing more persuasive language agents. Code is available at: https://github.com/ulab-uiuc/ToMAP.

[176] Exploring Scaling Laws for EHR Foundation Models

Sheng Zhang,Qin Liu,Naoto Usuyama,Cliff Wong,Tristan Naumann,Hoifung Poon

Main category: cs.CL

TL;DR: 本文首次研究了电子健康记录(EHR)基础模型的扩展规律,发现其与大型语言模型(LLMs)类似,具有可预测的性能提升规律。

Details Motivation: 探索EHR数据中的扩展规律,以填补其在模型开发中的空白,并为临床预测任务提供资源高效的训练策略。 Method: 使用MIMIC-IV数据库中的患者时间线数据,训练不同规模和计算预算的Transformer架构,分析扩展模式。 Result: 发现EHR模型具有与LLMs类似的扩展行为,包括抛物线IsoFLOPs曲线和计算、模型参数、数据量与临床效用之间的幂律关系。 Conclusion: 研究结果为开发强大的EHR基础模型奠定了基础,有望推动个性化医疗的发展。 Abstract: The emergence of scaling laws has profoundly shaped the development of large language models (LLMs), enabling predictable performance gains through systematic increases in model size, dataset volume, and compute. Yet, these principles remain largely unexplored in the context of electronic health records (EHRs) -- a rich, sequential, and globally abundant data source that differs structurally from natural language. In this work, we present the first empirical investigation of scaling laws for EHR foundation models. By training transformer architectures on patient timeline data from the MIMIC-IV database across varying model sizes and compute budgets, we identify consistent scaling patterns, including parabolic IsoFLOPs curves and power-law relationships between compute, model parameters, data size, and clinical utility. These findings demonstrate that EHR models exhibit scaling behavior analogous to LLMs, offering predictive insights into resource-efficient training strategies. Our results lay the groundwork for developing powerful EHR foundation models capable of transforming clinical prediction tasks and advancing personalized healthcare.

[177] Verify-in-the-Graph: Entity Disambiguation Enhancement for Complex Claim Verification with Interactive Graph Representation

Hoang Pham,Thanh-Do Nguyen,Khac-Hoai Nam Bui

Main category: cs.CL

TL;DR: VeGraph是一个基于LLM的框架,通过图表示、实体消歧和验证三阶段解决复杂声明的验证问题,并在实验中表现出色。

Details Motivation: 传统方法在复杂声明验证中因缺乏有效的实体消歧策略而受限,VeGraph旨在利用LLM的能力解决这一问题。 Method: VeGraph分三阶段:图表示(将声明分解为三元组)、实体消歧(与知识库交互解决歧义)、验证(完成事实核查)。 Result: 实验表明,VeGraph在HoVer和FEVEROUS基准测试中表现优于基线方法。 Conclusion: VeGraph通过结合LLM和图表示,有效提升了复杂声明验证的准确性和可解释性。 Abstract: Claim verification is a long-standing and challenging task that demands not only high accuracy but also explainability of the verification process. This task becomes an emerging research issue in the era of large language models (LLMs) since real-world claims are often complex, featuring intricate semantic structures or obfuscated entities. Traditional approaches typically address this by decomposing claims into sub-claims and querying a knowledge base to resolve hidden or ambiguous entities. However, the absence of effective disambiguation strategies for these entities can compromise the entire verification process. To address these challenges, we propose Verify-in-the-Graph (VeGraph), a novel framework leveraging the reasoning and comprehension abilities of LLM agents. VeGraph operates in three phases: (1) Graph Representation - an input claim is decomposed into structured triplets, forming a graph-based representation that integrates both structured and unstructured information; (2) Entity Disambiguation -VeGraph iteratively interacts with the knowledge base to resolve ambiguous entities within the graph for deeper sub-claim verification; and (3) Verification - remaining triplets are verified to complete the fact-checking process. Experiments using Meta-Llama-3-70B (instruct version) show that VeGraph achieves competitive performance compared to baselines on two benchmarks HoVer and FEVEROUS, effectively addressing claim verification challenges. Our source code and data are available for further exploitation.

[178] DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors

Yize Cheng,Wenxiao Wang,Mazda Moayeri,Soheil Feizi

Main category: cs.CL

TL;DR: DyePack是一个通过后门攻击检测模型是否在训练中使用基准测试集的框架,无需访问模型内部细节。

Details Motivation: 开放基准测试容易被污染,需要一种透明且可复现的方法来检测模型是否在训练中使用了测试数据。 Method: DyePack通过在测试数据中混入后门样本,利用随机目标的多后门设计,精确计算假阳性率(FPR)。 Result: 在多项选择题任务中,DyePack成功检测所有污染模型,FPR极低;在开放式生成任务中,表现同样出色。 Conclusion: DyePack提供了一种高效且可靠的方法来检测测试集污染,具有广泛的应用潜力。 Abstract: Open benchmarks are essential for evaluating and advancing large language models, offering reproducibility and transparency. However, their accessibility makes them likely targets of test set contamination. In this work, we introduce DyePack, a framework that leverages backdoor attacks to identify models that used benchmark test sets during training, without requiring access to the loss, logits, or any internal details of the model. Like how banks mix dye packs with their money to mark robbers, DyePack mixes backdoor samples with the test data to flag models that trained on it. We propose a principled design incorporating multiple backdoors with stochastic targets, enabling exact false positive rate (FPR) computation when flagging every model. This provably prevents false accusations while providing strong evidence for every detected case of contamination. We evaluate DyePack on five models across three datasets, covering both multiple-choice and open-ended generation tasks. For multiple-choice questions, it successfully detects all contaminated models with guaranteed FPRs as low as 0.000073% on MMLU-Pro and 0.000017% on Big-Bench-Hard using eight backdoors. For open-ended generation tasks, it generalizes well and identifies all contaminated models on Alpaca with a guaranteed false positive rate of just 0.127% using six backdoors.

[179] A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs

Chiwan Park,Wonjun Jang,Daeryong Kim,Aelim Ahn,Kichang Yang,Woosung Hwang,Jihyeon Roh,Hyerin Park,Hyosun Wang,Min Seok Kim,Jihoon Kang

Main category: cs.CL

TL;DR: 论文探讨了如何将先进的大型语言模型(LLMs)应用于工业场景,解决灵活对话能力与服务约束之间的矛盾,并通过电商对话机器人的案例展示了实现方法。

Details Motivation: 工业应用中需要平衡LLMs的灵活性与服务特定约束,这对实际应用提出了挑战。 Method: 提出了一种方法,通过优化策略解决LLMs的局限性,并以电商对话机器人为案例进行实践。 Result: 研究提供了一个框架,用于开发可扩展、可控且可靠的AI驱动代理,弥合学术研究与实际应用的差距。 Conclusion: 论文为LLMs在工业场景中的应用提供了实用解决方案,展示了如何实现灵活性与约束的平衡。 Abstract: The advancement of Large Language Models (LLMs) has led to significant improvements in various service domains, including search, recommendation, and chatbot applications. However, applying state-of-the-art (SOTA) research to industrial settings presents challenges, as it requires maintaining flexible conversational abilities while also strictly complying with service-specific constraints. This can be seen as two conflicting requirements due to the probabilistic nature of LLMs. In this paper, we propose our approach to addressing this challenge and detail the strategies we employed to overcome their inherent limitations in real-world applications. We conduct a practical case study of a conversational agent designed for the e-commerce domain, detailing our implementation workflow and optimizations. Our findings provide insights into bridging the gap between academic research and real-world application, introducing a framework for developing scalable, controllable, and reliable AI-driven agents.

[180] Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models

Jinwen Chen,Hainan Zhang,Fei Sun,Qinnan Zhang,Sijia Wen,Ziwei Wang,Zhiming Zheng

Main category: cs.CL

TL;DR: 论文提出了一种基于参考过滤和TF-IDF聚类(RFTC)的方法,用于检测LLMs中的隐蔽后门样本,解决了现有方法在生成任务中的局限性。

Details Motivation: 现有检测方法无法有效适用于生成任务,且可能影响生成性能或引入新触发器,因此需要一种高效消除隐蔽后门样本的方法。 Method: 通过参考模型输出与样本响应的差异筛选可疑样本,再对可疑样本进行TF-IDF聚类,根据类内距离识别真实后门样本。 Result: 在两个机器翻译数据集和一个QA数据集上的实验表明,RFTC在检测后门和模型性能上优于基线方法。 Conclusion: RFTC方法有效解决了LLMs中隐蔽后门样本的检测问题,参考过滤机制也被证明是有效的。 Abstract: Fine-tuning LLMs with datasets containing stealthy backdoors from publishers poses security risks to downstream applications. Mainstream detection methods either identify poisoned samples by analyzing the prediction probability of poisoned classification models or rely on the rewriting model to eliminate the stealthy triggers. However, the former cannot be applied to generation tasks, while the latter may degrade generation performance and introduce new triggers. Therefore, efficiently eliminating stealthy poisoned samples for LLMs remains an urgent problem. We observe that after applying TF-IDF clustering to the sample response, there are notable differences in the intra-class distances between clean and poisoned samples. Poisoned samples tend to cluster closely because of their specific malicious outputs, whereas clean samples are more scattered due to their more varied responses. Thus, in this paper, we propose a stealthy backdoor sample detection method based on Reference-Filtration and Tfidf-Clustering mechanisms (RFTC). Specifically, we first compare the sample response with the reference model's outputs and consider the sample suspicious if there's a significant discrepancy. And then we perform TF-IDF clustering on these suspicious samples to identify the true poisoned samples based on the intra-class distance. Experiments on two machine translation datasets and one QA dataset demonstrate that RFTC outperforms baselines in backdoor detection and model performance. Further analysis of different reference models also confirms the effectiveness of our Reference-Filtration.

[181] Context Robust Knowledge Editing for Language Models

Haewon Park,Gyubin Choi,Minjun Kim,Yohan Jo

Main category: cs.CL

TL;DR: 论文提出CHED基准和CoRE方法,用于评估和改进知识编辑(KE)方法在上下文环境中的鲁棒性,解决了现有方法在上下文触发原始知识时失效的问题。

Details Motivation: 现有知识编辑评估通常忽略上下文对编辑效果的影响,导致实际应用中编辑失效。 Method: 开发CHED基准评估上下文鲁棒性,并提出CoRE方法,通过最小化隐藏状态中的上下文敏感方差来增强鲁棒性。 Result: 实验表明现有KE方法在上下文存在时易失效,而CoRE方法显著提升了编辑成功率并保持模型能力。 Conclusion: 上下文对知识编辑效果有显著影响,CoRE方法有效提升了编辑的鲁棒性,同时分析了上下文类型和注意力模式的影响。 Abstract: Knowledge editing (KE) methods offer an efficient way to modify knowledge in large language models. Current KE evaluations typically assess editing success by considering only the edited knowledge without any preceding contexts. In real-world applications, however, preceding contexts often trigger the retrieval of the original knowledge and undermine the intended edit. To address this issue, we develop CHED -- a benchmark designed to evaluate the context robustness of KE methods. Evaluations on CHED show that they often fail when preceding contexts are present. To mitigate this shortcoming, we introduce CoRE, a KE method designed to strengthen context robustness by minimizing context-sensitive variance in hidden states of the model for edited knowledge. This method not only improves the editing success rate in situations where a preceding context is present but also preserves the overall capabilities of the model. We provide an in-depth analysis of the differing impacts of preceding contexts when introduced as user utterances versus assistant responses, and we dissect attention-score patterns to assess how specific tokens influence editing success.

[182] Uncovering Visual-Semantic Psycholinguistic Properties from the Distributional Structure of Text Embedding Spac

Si Wu,Sebastian Bruch

Main category: cs.CL

TL;DR: 本文提出了一种无监督的Neighborhood Stability Measure (NSM)方法,通过分析语义嵌入空间中单词邻域的峰值程度,来估计文本的可想象性和具体性。实验表明,NSM比现有无监督方法更接近人工评分。

Details Motivation: 研究假设文本本身在图像-标题数据集中已包含足够信号来估计可想象性和具体性,而无需依赖多模态数据。 Method: 提出NSM方法,量化语义嵌入空间中单词邻域的峰值程度,作为可想象性和具体性的指标。 Result: NSM与人工评分的相关性优于现有无监督方法,且能有效分类这些属性。 Conclusion: NSM是一种高效的无监督方法,可用于估计文本的可想象性和具体性。 Abstract: Imageability (potential of text to evoke a mental image) and concreteness (perceptibility of text) are two psycholinguistic properties that link visual and semantic spaces. It is little surprise that computational methods that estimate them do so using parallel visual and semantic spaces, such as collections of image-caption pairs or multi-modal models. In this paper, we work on the supposition that text itself in an image-caption dataset offers sufficient signals to accurately estimate these properties. We hypothesize, in particular, that the peakedness of the neighborhood of a word in the semantic embedding space reflects its degree of imageability and concreteness. We then propose an unsupervised, distribution-free measure, which we call Neighborhood Stability Measure (NSM), that quantifies the sharpness of peaks. Extensive experiments show that NSM correlates more strongly with ground-truth ratings than existing unsupervised methods, and is a strong predictor of these properties for classification. Our code and data are available on GitHub (https://github.com/Artificial-Memory-Lab/imageability).

[183] Can Modern NLP Systems Reliably Annotate Chest Radiography Exams? A Pre-Purchase Evaluation and Comparative Study of Solutions from AWS, Google, Azure, John Snow Labs, and Open-Source Models on an Independent Pediatric Dataset

Shruti Hegde,Mabon Manoj Ninan,Jonathan R. Dillman,Shireen Hayatghaibi,Lynn Babcock,Elanchezhian Somasundaram

Main category: cs.CL

TL;DR: 比较四种商业临床NLP系统和两种专用胸片报告标记工具在儿科胸片报告中的实体提取和断言检测性能,发现性能差异显著,需谨慎验证。

Details Motivation: 评估通用临床NLP工具在儿科胸片报告标记任务中的性能,填补独立评估的空白。 Method: 使用四种商业NLP系统(AWS、GC、AZ、SP)和两种专用工具(CheXpert、CheXbert)分析95,008份儿科胸片报告,比较实体提取和断言检测的准确性。 Result: 各系统提取的实体数量和断言准确性差异显著,SP表现最佳(76%),AWS最低(50%),专用工具准确性为56%。 Conclusion: 临床NLP工具性能差异大,需严格验证后才能用于临床报告标记。 Abstract: General-purpose clinical natural language processing (NLP) tools are increasingly used for the automatic labeling of clinical reports. However, independent evaluations for specific tasks, such as pediatric chest radiograph (CXR) report labeling, are limited. This study compares four commercial clinical NLP systems - Amazon Comprehend Medical (AWS), Google Healthcare NLP (GC), Azure Clinical NLP (AZ), and SparkNLP (SP) - for entity extraction and assertion detection in pediatric CXR reports. Additionally, CheXpert and CheXbert, two dedicated chest radiograph report labelers, were evaluated on the same task using CheXpert-defined labels. We analyzed 95,008 pediatric CXR reports from a large academic pediatric hospital. Entities and assertion statuses (positive, negative, uncertain) from the findings and impression sections were extracted by the NLP systems, with impression section entities mapped to 12 disease categories and a No Findings category. CheXpert and CheXbert extracted the same 13 categories. Outputs were compared using Fleiss Kappa and accuracy against a consensus pseudo-ground truth. Significant differences were found in the number of extracted entities and assertion distributions across NLP systems. SP extracted 49,688 unique entities, GC 16,477, AZ 31,543, and AWS 27,216. Assertion accuracy across models averaged around 62%, with SP highest (76%) and AWS lowest (50%). CheXpert and CheXbert achieved 56% accuracy. Considerable variability in performance highlights the need for careful validation and review before deploying NLP tools for clinical report labeling.

[184] Machine-Facing English: Defining a Hybrid Register Shaped by Human-AI Discourse

Hyunwoo Kim,Hanau Yi

Main category: cs.CL

TL;DR: 论文研究了机器面向英语(MFE)这一新兴语言现象,分析了其语法僵化、语用简化和超明确表达等特征,及其对机器解析和自然流畅性的影响。

Details Motivation: 探讨人类与AI持续互动如何塑造MFE,并揭示其在提高机器解析能力的同时对语言丰富性的压缩。 Method: 基于双语(韩语/英语)语音和文本产品测试的定性观察,采用自然语言声明提示(NLD-P)进行反思性起草,并进行主题分析。 Result: 识别出五种常见特征(冗余清晰性、指令性语法、受控词汇、扁平化韵律和单一意图结构),这些特征提高了执行准确性但压缩了表达范围。 Conclusion: MFE的发展凸显了沟通效率与语言丰富性之间的张力,提出了对话界面设计和多语言用户教学的新挑战,并呼吁未来实证验证。 Abstract: Machine-Facing English (MFE) is an emergent register shaped by the adaptation of everyday language to the expanding presence of AI interlocutors. Drawing on register theory (Halliday 1985, 2006), enregisterment (Agha 2003), audience design (Bell 1984), and interactional pragmatics (Giles & Ogay 2007), this study traces how sustained human-AI interaction normalizes syntactic rigidity, pragmatic simplification, and hyper-explicit phrasing - features that enhance machine parseability at the expense of natural fluency. Our analysis is grounded in qualitative observations from bilingual (Korean/English) voice- and text-based product testing sessions, with reflexive drafting conducted using Natural Language Declarative Prompting (NLD-P) under human curation. Thematic analysis identifies five recurrent traits - redundant clarity, directive syntax, controlled vocabulary, flattened prosody, and single-intent structuring - that improve execution accuracy but compress expressive range. MFE's evolution highlights a persistent tension between communicative efficiency and linguistic richness, raising design challenges for conversational interfaces and pedagogical considerations for multilingual users. We conclude by underscoring the need for comprehensive methodological exposition and future empirical validation.

[185] Improving Multilingual Social Media Insights: Aspect-based Comment Analysis

Longyin Zhang,Bowei Zou,Ai Ti Aw

Main category: cs.CL

TL;DR: 提出了一种基于多语言大语言模型的方法(CAT-G),用于从社交媒体评论中生成方面术语,以提升下游NLP任务的效果。

Details Motivation: 社交媒体评论的语言自由性和多样性给NLP任务(如评论聚类、摘要和意见分析)带来挑战,需要更细粒度的处理方法。 Method: 利用多语言大语言模型进行监督微调,生成评论方面术语(CAT-G),并通过DPO对齐模型预测与人类期望。 Result: 方法在两项NLP任务中提升了社交媒体话语理解效果,并贡献了首个多语言CAT-G测试集(英语、中文、马来语、印尼语)。 Conclusion: CAT-G方法有效解决了社交媒体评论的多样性问题,测试集为多语言性能比较提供了基础。 Abstract: The inherent nature of social media posts, characterized by the freedom of language use with a disjointed array of diverse opinions and topics, poses significant challenges to downstream NLP tasks such as comment clustering, comment summarization, and social media opinion analysis. To address this, we propose a granular level of identifying and generating aspect terms from individual comments to guide model attention. Specifically, we leverage multilingual large language models with supervised fine-tuning for comment aspect term generation (CAT-G), further aligning the model's predictions with human expectations through DPO. We demonstrate the effectiveness of our method in enhancing the comprehension of social media discourse on two NLP tasks. Moreover, this paper contributes the first multilingual CAT-G test set on English, Chinese, Malay, and Bahasa Indonesian. As LLM capabilities vary among languages, this test set allows for a comparative analysis of performance across languages with varying levels of LLM proficiency.

[186] EL4NER: Ensemble Learning for Named Entity Recognition via Multiple Small-Parameter Large Language Models

Yuzhen Xiao,Jiahe Song,Yongxin Xu,Ruizhe Zhang,Yiqi Xiao,Xin Lu,Runchuan Zhu,Bowen Jiang,Junfeng Zhao

Main category: cs.CL

TL;DR: EL4NER是一种基于集成学习的方法,通过聚合多个开源小参数LLM的ICL输出来提升NER任务性能,同时降低部署和推理成本。

Details Motivation: 解决ICL-based NER方法依赖大参数LLM的问题,包括高计算资源、API成本、数据隐私和社区协作障碍。 Method: 1. 任务分解管道;2. 基于跨度级句子相似性的演示检索机制;3. 自验证机制减少噪声。 Result: EL4NER在多个NER数据集上表现优于大参数LLM方法,部分数据集达到SOTA性能。 Conclusion: EL4NER展示了在小参数LLM上高效实现ICL的可行性,为NER任务提供了更经济的解决方案。 Abstract: In-Context Learning (ICL) technique based on Large Language Models (LLMs) has gained prominence in Named Entity Recognition (NER) tasks for its lower computing resource consumption, less manual labeling overhead, and stronger generalizability. Nevertheless, most ICL-based NER methods depend on large-parameter LLMs: the open-source models demand substantial computational resources for deployment and inference, while the closed-source ones incur high API costs, raise data-privacy concerns, and hinder community collaboration. To address this question, we propose an Ensemble Learning Method for Named Entity Recognition (EL4NER), which aims at aggregating the ICL outputs of multiple open-source, small-parameter LLMs to enhance overall performance in NER tasks at less deployment and inference cost. Specifically, our method comprises three key components. First, we design a task decomposition-based pipeline that facilitates deep, multi-stage ensemble learning. Second, we introduce a novel span-level sentence similarity algorithm to establish an ICL demonstration retrieval mechanism better suited for NER tasks. Third, we incorporate a self-validation mechanism to mitigate the noise introduced during the ensemble process. We evaluated EL4NER on multiple widely adopted NER datasets from diverse domains. Our experimental results indicate that EL4NER surpasses most closed-source, large-parameter LLM-based methods at a lower parameter cost and even attains state-of-the-art (SOTA) performance among ICL-based methods on certain datasets. These results show the parameter efficiency of EL4NER and underscore the feasibility of employing open-source, small-parameter LLMs within the ICL paradigm for NER tasks.

[187] Query Routing for Retrieval-Augmented Language Models

Jiarui Zhang,Xiangyu Liu,Yong Hu,Chaoyue Niu,Fan Wu,Guihai Chen

Main category: cs.CL

TL;DR: RAGRouter是一种新型的路由机制,通过结合检索文档的动态影响,显著提升了检索增强生成(RAG)场景下多模型路由的性能。

Details Motivation: 现有路由方法在RAG场景中表现不佳,因为它们依赖静态知识表示,而检索文档对LLM的回答能力有动态影响。 Method: 提出RAGRouter,利用文档嵌入和RAG能力嵌入,通过对比学习捕捉知识表示变化,实现智能路由。 Result: 实验表明,RAGRouter平均比最佳单一LLM提升3.61%,比现有路由方法提升3.29%-9.33%。 Conclusion: RAGRouter在性能与效率之间取得了良好平衡,适用于低延迟场景。 Abstract: Retrieval-Augmented Generation (RAG) significantly improves the performance of Large Language Models (LLMs) on knowledge-intensive tasks. However, varying response quality across LLMs under RAG necessitates intelligent routing mechanisms, which select the most suitable model for each query from multiple retrieval-augmented LLMs via a dedicated router model. We observe that external documents dynamically affect LLMs' ability to answer queries, while existing routing methods, which rely on static parametric knowledge representations, exhibit suboptimal performance in RAG scenarios. To address this, we formally define the new retrieval-augmented LLM routing problem, incorporating the influence of retrieved documents into the routing framework. We propose RAGRouter, a RAG-aware routing design, which leverages document embeddings and RAG capability embeddings with contrastive learning to capture knowledge representation shifts and enable informed routing decisions. Extensive experiments on diverse knowledge-intensive tasks and retrieval settings show that RAGRouter outperforms the best individual LLM by 3.61% on average and existing routing methods by 3.29%-9.33%. With an extended score-threshold-based mechanism, it also achieves strong performance-efficiency trade-offs under low-latency constraints.

[188] Self-Correcting Code Generation Using Small Language Models

Jeonghun Cho,Deokhyung Kang,Hyounghun Kim,Gary Geunbae Lee

Main category: cs.CL

TL;DR: 小模型在自我修正代码生成中表现不佳,CoCoS方法通过强化学习提升其能力,显著改进性能。

Details Motivation: 探索小模型是否具备通过自我反思有效修正代码的能力。 Method: 提出CoCoS方法,采用在线强化学习目标,设计累积奖励函数和细粒度奖励机制。 Result: 在1B规模模型上,CoCoS在MBPP和HumanEval上分别提升35.8%和27.7%。 Conclusion: CoCoS有效提升小模型的多轮代码修正能力。 Abstract: Self-correction has demonstrated potential in code generation by allowing language models to revise and improve their outputs through successive refinement. Recent studies have explored prompting-based strategies that incorporate verification or feedback loops using proprietary models, as well as training-based methods that leverage their strong reasoning capabilities. However, whether smaller models possess the capacity to effectively guide their outputs through self-reflection remains unexplored. Our findings reveal that smaller models struggle to exhibit reflective revision behavior across both self-correction paradigms. In response, we introduce CoCoS, an approach designed to enhance the ability of small language models for multi-turn code correction. Specifically, we propose an online reinforcement learning objective that trains the model to confidently maintain correct outputs while progressively correcting incorrect outputs as turns proceed. Our approach features an accumulated reward function that aggregates rewards across the entire trajectory and a fine-grained reward better suited to multi-turn correction scenarios. This facilitates the model in enhancing initial response quality while achieving substantial improvements through self-correction. With 1B-scale models, CoCoS achieves improvements of 35.8% on the MBPP and 27.7% on HumanEval compared to the baselines.

[189] SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services

Hongcheng Guo,Zheyong Xie,Shaosheng Cao,Boyang Wang,Weiting Liu,Anjie Le,Lei Li,Zhoujun Li

Main category: cs.CL

TL;DR: SNS-Bench-VL是一个多模态基准测试,用于评估视觉-语言大模型在社交媒体场景中的表现,涵盖8种任务和4001个问答对。

Details Motivation: 随着社交媒体中视觉与文本内容的融合,评估大模型的多模态能力对提升用户体验和平台智能至关重要。现有基准测试主要关注文本任务,缺乏对多模态场景的覆盖。 Method: 提出SNS-Bench-VL基准,包含8种多模态任务和4001个问答对,评估了25种先进的多模态大模型。 Result: 研究发现多模态社交语境理解仍存在挑战。 Conclusion: SNS-Bench-VL旨在推动未来研究,开发更鲁棒、情境感知且符合人类需求的多模态智能。 Abstract: With the increasing integration of visual and textual content in Social Networking Services (SNS), evaluating the multimodal capabilities of Large Language Models (LLMs) is crucial for enhancing user experience, content understanding, and platform intelligence. Existing benchmarks primarily focus on text-centric tasks, lacking coverage of the multimodal contexts prevalent in modern SNS ecosystems. In this paper, we introduce SNS-Bench-VL, a comprehensive multimodal benchmark designed to assess the performance of Vision-Language LLMs in real-world social media scenarios. SNS-Bench-VL incorporates images and text across 8 multimodal tasks, including note comprehension, user engagement analysis, information retrieval, and personalized recommendation. It comprises 4,001 carefully curated multimodal question-answer pairs, covering single-choice, multiple-choice, and open-ended tasks. We evaluate over 25 state-of-the-art multimodal LLMs, analyzing their performance across tasks. Our findings highlight persistent challenges in multimodal social context comprehension. We hope SNS-Bench-VL will inspire future research towards robust, context-aware, and human-aligned multimodal intelligence for next-generation social networking services.

[190] Document-Level Text Generation with Minimum Bayes Risk Decoding using Optimal Transport

Yuu Jinnai

Main category: cs.CL

TL;DR: 本文探讨了将最小贝叶斯风险(MBR)解码应用于文档级文本生成任务的方法,提出了一种基于Wasserstein距离的变体MBR-OT,以解决传统MBR在文档级任务中的局限性。

Details Motivation: 文档级文本生成任务比句子级任务更具挑战性,因为需要理解更长的上下文。传统MBR解码在文档级任务中表现有限,因为其效用函数多针对句子设计。 Method: 提出MBR-OT,利用Wasserstein距离结合句子级效用函数计算文档的效用。 Result: 实验表明,MBR-OT在文档级机器翻译、文本简化和密集图像描述任务中优于标准MBR。 Conclusion: MBR-OT通过改进效用计算方式,显著提升了文档级文本生成任务的性能。 Abstract: Document-level text generation tasks are known to be more difficult than sentence-level text generation tasks as they require the understanding of longer context to generate high-quality texts. In this paper, we investigate the adaption of Minimum Bayes Risk (MBR) decoding for document-level text generation tasks. MBR decoding makes use of a utility function to estimate the output with the highest expected utility from a set of candidate outputs. Although MBR decoding is shown to be effective in a wide range of sentence-level text generation tasks, its performance on document-level text generation tasks is limited as many of the utility functions are designed for evaluating the utility of sentences. To this end, we propose MBR-OT, a variant of MBR decoding using Wasserstein distance to compute the utility of a document using a sentence-level utility function. The experimental result shows that the performance of MBR-OT outperforms that of the standard MBR in document-level machine translation, text simplification, and dense image captioning tasks. Our code is available at https://github.com/jinnaiyuu/mbr-optimal-transport

[191] Generating Diverse Training Samples for Relation Extraction with Large Language Models

Zexuan Li,Hongliang Dai,Piji Li

Main category: cs.CL

TL;DR: 本文探讨如何利用大语言模型(LLM)生成多样且正确的关系抽取(RE)训练数据,通过指令提示和直接偏好优化(DPO)提升数据质量。

Details Motivation: 直接使用LLM生成的关系抽取样本结构相似度高,表达方式单一,需提升多样性和正确性。 Method: 采用指令提示和直接偏好优化(DPO)微调LLM,生成多样化训练样本。 Result: 实验表明,两种方法均能提升生成数据质量,且用生成数据训练的非LLM模型性能优于直接使用LLM。 Conclusion: 通过优化LLM生成样本的多样性,可有效提升关系抽取任务的性能。 Abstract: Using Large Language Models (LLMs) to generate training data can potentially be a preferable way to improve zero or few-shot NLP tasks. However, many problems remain to be investigated for this direction. For the task of Relation Extraction (RE), we find that samples generated by directly prompting LLMs may easily have high structural similarities with each other. They tend to use a limited variety of phrasing while expressing the relation between a pair of entities. Therefore, in this paper, we study how to effectively improve the diversity of the training samples generated with LLMs for RE, while also maintaining their correctness. We first try to make the LLMs produce dissimilar samples by directly giving instructions in In-Context Learning (ICL) prompts. Then, we propose an approach to fine-tune LLMs for diversity training sample generation through Direct Preference Optimization (DPO). Our experiments on commonly used RE datasets show that both attempts can improve the quality of the generated training data. We also find that comparing with directly performing RE with an LLM, training a non-LLM RE model with its generated samples may lead to better performance.

[192] Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data

Seohyeong Lee,Eunwon Kim,Hwaran Lee,Buru Chang

Main category: cs.CL

TL;DR: Alignment Data Map利用GPT-4o分析偏好数据,通过计算对齐分数并构建数据地图,显著提升数据收集效率,仅需33%的高质量数据即可达到或超越全数据集性能。

Details Motivation: 收集人类偏好数据成本高且效率低,限制了大规模语言模型(LLM)与人类价值观对齐的可扩展性。 Method: 使用GPT-4o作为LLM对齐的代理,计算LLM生成响应的对齐分数,基于均值和方差构建Alignment Data Map。 Result: 实验表明,仅使用33%的高均值、低方差区域数据,性能与全数据集相当或更好。 Conclusion: Alignment Data Map能高效识别高质量样本并诊断数据集,无需显式标注,显著提升数据收集效率。 Abstract: Human preference data plays a critical role in aligning large language models (LLMs) with human values. However, collecting such data is often expensive and inefficient, posing a significant scalability challenge. To address this, we introduce Alignment Data Map, a GPT-4o-assisted tool for analyzing and diagnosing preference data. Using GPT-4o as a proxy for LLM alignment, we compute alignment scores for LLM-generated responses to instructions from existing preference datasets. These scores are then used to construct an Alignment Data Map based on their mean and variance. Our experiments show that using only 33 percent of the data, specifically samples in the high-mean, low-variance region, achieves performance comparable to or better than using the entire dataset. This finding suggests that the Alignment Data Map can significantly improve data collection efficiency by identifying high-quality samples for LLM alignment without requiring explicit annotations. Moreover, the Alignment Data Map can diagnose existing preference datasets. Our analysis shows that it effectively detects low-impact or potentially misannotated samples. Source code is available online.

[193] Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios

Linjie Mu,Zhongzhen Huang,Yakun Zhu,Xiangyu Zhao,Shaoting Zhang,Xiaofan Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为MedE²的两阶段后训练方法,用于增强医学领域的多模态推理能力,显著提升了模型在医学多模态任务中的表现。

Details Motivation: 临床决策依赖于多源证据的多模态推理,而现有模型在医学领域的应用尚未充分探索。 Method: MedE²分为两阶段:第一阶段用2000个文本样本微调模型以激发推理行为;第二阶段用1500个多模态医学案例进一步强化推理能力。 Result: 实验表明,MedE²显著提升了模型在医学多模态任务中的性能,优于基线方法,并在更大模型和推理扩展中验证了其稳健性。 Conclusion: MedE²是一种有效的医学多模态推理增强方法,具有实际应用价值。 Abstract: Effective clinical decision-making depends on iterative, multimodal reasoning across diverse sources of evidence. The recent emergence of multimodal reasoning models has significantly transformed the landscape of solving complex tasks. Although such models have achieved notable success in mathematics and science, their application to medical domains remains underexplored. In this work, we propose \textit{MedE$^2$}, a two-stage post-training pipeline that elicits and then enhances multimodal reasoning for medical domains. In Stage-I, we fine-tune models using 2,000 text-only data samples containing precisely orchestrated reasoning demonstrations to elicit reasoning behaviors. In Stage-II, we further enhance the model's reasoning capabilities using 1,500 rigorously curated multimodal medical cases, aligning model reasoning outputs with our proposed multimodal medical reasoning preference. Extensive experiments demonstrate the efficacy and reliability of \textit{MedE$^2$} in improving the reasoning performance of medical multimodal models. Notably, models trained with \textit{MedE$^2$} consistently outperform baselines across multiple medical multimodal benchmarks. Additional validation on larger models and under inference-time scaling further confirms the robustness and practical utility of our approach.

[194] ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations

Yiming Lei,Zhizheng Yang,Zeming Liu,Haitao Leng,Shaoguo Liu,Tingting Gao,Qingjie Liu,Yunhong Wang

Main category: cs.CL

TL;DR: 论文提出了一种名为ContextQFormer的上下文建模模块,用于增强多模态大语言模型在多轮交互中的能力,并构建了一个新的多轮多模态对话数据集TMDialog。

Details Motivation: 现有开源多模态模型在多轮交互(尤其是长上下文)中表现较弱,需要改进。 Method: 引入ContextQFormer模块,利用内存块增强上下文信息表示;构建TMDialog数据集用于预训练、指令调优和评估。 Result: ContextQFormer在TMDialog数据集上比基线模型提升了2%-4%的可用率。 Conclusion: ContextQFormer和TMDialog数据集有效提升了多轮多模态对话的能力,为未来研究提供了支持。 Abstract: Multi-modal large language models have demonstrated remarkable zero-shot abilities and powerful image-understanding capabilities. However, the existing open-source multi-modal models suffer from the weak capability of multi-turn interaction, especially for long contexts. To address the issue, we first introduce a context modeling module, termed ContextQFormer, which utilizes a memory block to enhance the presentation of contextual information. Furthermore, to facilitate further research, we carefully build a new multi-turn multi-modal dialogue dataset (TMDialog) for pre-training, instruction-tuning, and evaluation, which will be open-sourced lately. Compared with other multi-modal dialogue datasets, TMDialog contains longer conversations, which supports the research of multi-turn multi-modal dialogue. In addition, ContextQFormer is compared with three baselines on TMDialog and experimental results illustrate that ContextQFormer achieves an improvement of 2%-4% in available rate over baselines.

[195] PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics

Atharva Naik,Darsh Agrawal,Manav Kapadnis,Yuwei An,Yash Mathur,Carolyn Rose,David Mortensen

Main category: cs.CL

TL;DR: 论文探讨了长链思维(LCoT)大语言模型(LLMs)在历史语言学启发的归纳推理任务中的表现,发现其能力有限。

Details Motivation: 验证LLMs在实用问题中的抽象推理能力是否足够通用,尤其是历史语言学启发的归纳推理任务。 Method: 开发了一个自动化流水线,动态生成可控难度的基准测试集,以解决现有推理基准的可扩展性和污染问题。 Result: 生成的测试集包含近1k个实例,对现有最佳推理LLMs(如Claude-3.7-Sonnet)仍具挑战性,通过率仅为54%。 Conclusion: LCoT LLMs在历史语言学等领域的推理任务中仍存在明显不足。 Abstract: Recently, long chain of thought (LCoT), Large Language Models (LLMs), have taken the machine learning world by storm with their breathtaking reasoning capabilities. However, are the abstract reasoning abilities of these models general enough for problems of practical importance? Unlike past work, which has focused mainly on math, coding, and data wrangling, we focus on a historical linguistics-inspired inductive reasoning problem, formulated as Programming by Examples. We develop a fully automated pipeline for dynamically generating a benchmark for this task with controllable difficulty in order to tackle scalability and contamination issues to which many reasoning benchmarks are subject. Using our pipeline, we generate a test set with nearly 1k instances that is challenging for all state-of-the-art reasoning LLMs, with the best model (Claude-3.7-Sonnet) achieving a mere 54% pass rate, demonstrating that LCoT LLMs still struggle with a class or reasoning that is ubiquitous in historical linguistics as well as many other domains.

[196] Enhancing Large Language Models'Machine Translation via Dynamic Focus Anchoring

Qiuyu Ding,Zhiqiang Cao,Hailong Cao,Tiejun Zhao

Main category: cs.CL

TL;DR: 提出了一种简单有效的方法,通过获取上下文敏感单元(CSUs)和应用语义焦点,增强大语言模型(LLMs)的机器翻译能力,无需额外训练。

Details Motivation: 解决LLMs在处理上下文敏感单元(如多义词)时的翻译失败和理解能力问题。 Method: 动态分析并识别翻译挑战,以结构化方式将其纳入LLMs,避免信息扁平化导致的误译。 Result: 在基准数据集上表现优异,支持多种语言对,且资源消耗低。 Conclusion: 该方法有效提升了LLMs的翻译准确性和跨任务性能,具有鲁棒性和实用性。 Abstract: Large language models have demonstrated exceptional performance across multiple crosslingual NLP tasks, including machine translation (MT). However, persistent challenges remain in addressing context-sensitive units (CSUs), such as polysemous words. These CSUs not only affect the local translation accuracy of LLMs, but also affect LLMs' understanding capability for sentences and tasks, and even lead to translation failure. To address this problem, we propose a simple but effective method to enhance LLMs' MT capabilities by acquiring CSUs and applying semantic focus. Specifically, we dynamically analyze and identify translation challenges, then incorporate them into LLMs in a structured manner to mitigate mistranslations or misunderstandings of CSUs caused by information flattening. Efficiently activate LLMs to identify and apply relevant knowledge from its vast data pool in this way, ensuring more accurate translations for translating difficult terms. On a benchmark dataset of MT, our proposed method achieved competitive performance compared to multiple existing open-sourced MT baseline models. It demonstrates effectiveness and robustness across multiple language pairs, including both similar language pairs and distant language pairs. Notably, the proposed method requires no additional model training and enhances LLMs' performance across multiple NLP tasks with minimal resource consumption.

[197] Cross-Domain Bilingual Lexicon Induction via Pretrained Language Models

Qiuyu Ding,Zhiqiang Cao,Hailong Cao,Tiejun Zhao

Main category: cs.CL

TL;DR: 本文提出了一种新的双语词典归纳任务,利用通用领域和目标领域的单语语料库提取领域特定的双语词典,并通过预训练模型改进词嵌入方法。

Details Motivation: 传统双语词典归纳方法在专业领域表现较差,因为专业领域数据规模小且词频低,静态词嵌入难以捕捉上下文影响。 Method: 引入Code Switch方法,结合预训练模型改进词嵌入,以匹配不同上下文中的策略。 Result: 实验表明,该方法在三个特定领域上平均提升0.78分,优于传统方法。 Conclusion: 该方法在专业领域双语词典归纳中表现更优,但仍需进一步验证其普适性。 Abstract: Bilingual Lexicon Induction (BLI) is generally based on common domain data to obtain monolingual word embedding, and by aligning the monolingual word embeddings to obtain the cross-lingual embeddings which are used to get the word translation pairs. In this paper, we propose a new task of BLI, which is to use the monolingual corpus of the general domain and target domain to extract domain-specific bilingual dictionaries. Motivated by the ability of Pre-trained models, we propose a method to get better word embeddings that build on the recent work on BLI. This way, we introduce the Code Switch(Qin et al., 2020) firstly in the cross-domain BLI task, which can match differit is yet to be seen whether these methods are suitable for bilingual lexicon extraction in professional fields. As we can see in table 1, the classic and efficient BLI approach, Muse and Vecmap, perform much worse on the Medical dataset than on the Wiki dataset. On one hand, the specialized domain data set is relatively smaller compared to the generic domain data set generally, and specialized words have a lower frequency, which will directly affect the translation quality of bilingual dictionaries. On the other hand, static word embeddings are widely used for BLI, however, in some specific fields, the meaning of words is greatly influenced by context, in this case, using only static word embeddings may lead to greater bias. ent strategies in different contexts, making the model more suitable for this task. Experimental results show that our method can improve performances over robust BLI baselines on three specific domains by averagely improving 0.78 points.

[198] Tell, Don't Show: Leveraging Language Models' Abstractive Retellings to Model Literary Themes

Li Lucy,Camilla Griffiths,Sarah Levine,Jennifer L. Eberhardt,Dorottya Demszky,David Bamman

Main category: cs.CL

TL;DR: Retell是一种针对文学文本的主题建模方法,通过生成语言模型将叙事内容转化为高层次概念,再结合LDA提升主题建模效果。

Details Motivation: 传统方法(如LDA)难以处理文学文本,因其注重感官细节而非抽象描述。 Method: 利用生成语言模型将段落内容转化为概念,再对转化结果运行LDA。 Result: 与单独使用LDA或直接让语言模型生成主题相比,Retell生成的主题更精确且信息丰富。 Conclusion: Retell在文化分析中具有潜力,案例研究表明其能有效捕捉专家标注的主题。 Abstract: Conventional bag-of-words approaches for topic modeling, like latent Dirichlet allocation (LDA), struggle with literary text. Literature challenges lexical methods because narrative language focuses on immersive sensory details instead of abstractive description or exposition: writers are advised to "show, don't tell." We propose Retell, a simple, accessible topic modeling approach for literature. Here, we prompt resource-efficient, generative language models (LMs) to tell what passages show, thereby translating narratives' surface forms into higher-level concepts and themes. By running LDA on LMs' retellings of passages, we can obtain more precise and informative topics than by running LDA alone or by directly asking LMs to list topics. To investigate the potential of our method for cultural analytics, we compare our method's outputs to expert-guided annotations in a case study on racial/cultural identity in high school English language arts books.

[199] ZIPA: A family of efficient models for multilingual phone recognition

Jian Zhu,Farhan Samir,Eleanor Chodroff,David R. Mortensen

Main category: cs.CL

TL;DR: ZIPA是一系列高效的语音模型,通过大规模多语言数据和Zipformer架构,提升了跨语言音素识别的性能,但仍存在对社会语音多样性的建模挑战。

Details Motivation: 提升跨语言音素识别的性能,并解决现有系统在参数效率和性能上的不足。 Method: 使用IPAPack++大规模多语言语音数据集,结合Zipformer架构(ZIPA-T和ZIPA-CR变体),并通过噪声学生训练进一步扩展。 Result: ZIPA在音素识别任务上优于现有系统,且参数更少;噪声学生训练进一步提升了性能。 Conclusion: ZIPA在性能上取得突破,但对社会语音多样性的建模仍需改进,为未来研究指明了方向。 Abstract: We present ZIPA, a family of efficient speech models that advances the state-of-the-art performance of crosslinguistic phone recognition. We first curated IPAPack++, a large-scale multilingual speech corpus with 17,132 hours of normalized phone transcriptions and a novel evaluation set capturing unseen languages and sociophonetic variation. With the large-scale training data, ZIPA, including transducer (ZIPA-T) and CTC-based (ZIPA-CR) variants, leverage the efficient Zipformer backbones and outperform existing phone recognition systems with much fewer parameters. Further scaling via noisy student training on 11,000 hours of pseudo-labeled multilingual data yields further improvement. While ZIPA achieves strong performance on benchmarks, error analysis reveals persistent limitations in modeling sociophonetic diversity, underscoring challenges for future research.

[200] Map&Make: Schema Guided Text to Table Generation

Naman Ahuja,Fenil Bardoliya,Chitta Baral,Vivek Gupta

Main category: cs.CL

TL;DR: 本文提出了一种名为Map&Make的新方法,用于将复杂文本分解为命题原子语句,从而生成可解释的表格,显著提升了Text-to-Table任务的性能。

Details Motivation: 当前方法在提取复杂信息和推断数据方面存在不足,无法有效生成高质量的表格。 Method: Map&Make方法将文本分解为命题原子语句,提取潜在模式,并填充表格以捕获原文的定性和定量信息。 Result: 在Rotowire和Livesum数据集上测试,Map&Make显著提升了性能,并减少了幻觉错误。 Conclusion: 该方法在结构化摘要任务中表现出色,具有更好的可解释性和实用性。 Abstract: Transforming dense, detailed, unstructured text into an interpretable and summarised table, also colloquially known as Text-to-Table generation, is an essential task for information retrieval. Current methods, however, miss out on how and what complex information to extract; they also lack the ability to infer data from the text. In this paper, we introduce a versatile approach, Map&Make, which "dissects" text into propositional atomic statements. This facilitates granular decomposition to extract the latent schema. The schema is then used to populate the tables that capture the qualitative nuances and the quantitative facts in the original text. Our approach is tested against two challenging datasets, Rotowire, renowned for its complex and multi-table schema, and Livesum, which demands numerical aggregation. By carefully identifying and correcting hallucination errors in Rotowire, we aim to achieve a cleaner and more reliable benchmark. We evaluate our method rigorously on a comprehensive suite of comparative and referenceless metrics. Our findings demonstrate significant improvement results across both datasets with better interpretability in Text-to-Table generation. Moreover, through detailed ablation studies and analyses, we investigate the factors contributing to superior performance and validate the practicality of our framework in structured summarization tasks.

[201] Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification

Wenjing Xing,Wenke Lu,Yeheng Duan,Bing Zhao,Zhenghui kang,Yaolong Wang,Kai Gao,Lei Qiao

Main category: cs.CL

TL;DR: Infinite-Instruct是一种自动化框架,用于合成高质量的问答对,旨在提升大语言模型(LLM)的代码生成能力。通过反向构建和反馈构建方法增强问题逻辑和代码质量,实验显示性能显著提升。

Details Motivation: 传统代码指令数据合成方法存在多样性和逻辑性不足的问题,需要一种更高效的解决方案。 Method: 采用反向构建将代码片段转化为编程问题,通过反馈构建构建知识图谱增强逻辑,最后通过静态代码分析过滤无效样本。 Result: 实验结果表明,在主流代码生成基准测试中,7B和32B参数模型的性能分别提升了21.70%和36.95%。 Conclusion: Infinite-Instruct为LLM编程训练提供了可扩展的解决方案,并开源了实验数据集。 Abstract: Traditional code instruction data synthesis methods suffer from limited diversity and poor logic. We introduce Infinite-Instruct, an automated framework for synthesizing high-quality question-answer pairs, designed to enhance the code generation capabilities of large language models (LLMs). The framework focuses on improving the internal logic of synthesized problems and the quality of synthesized code. First, "Reverse Construction" transforms code snippets into diverse programming problems. Then, through "Backfeeding Construction," keywords in programming problems are structured into a knowledge graph to reconstruct them into programming problems with stronger internal logic. Finally, a cross-lingual static code analysis pipeline filters invalid samples to ensure data quality. Experiments show that on mainstream code generation benchmarks, our fine-tuned models achieve an average performance improvement of 21.70% on 7B-parameter models and 36.95% on 32B-parameter models. Using less than one-tenth of the instruction fine-tuning data, we achieved performance comparable to the Qwen-2.5-Coder-Instruct. Infinite-Instruct provides a scalable solution for LLM training in programming. We open-source the datasets used in the experiments, including both unfiltered versions and filtered versions via static analysis. The data are available at https://github.com/xingwenjing417/Infinite-Instruct-dataset

[202] Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement

Gabriele Sarti,Vilém Zouhar,Malvina Nissim,Arianna Bisazza

Main category: cs.CL

TL;DR: 本文研究了利用语言模型可解释性和不确定性量化来高效识别翻译错误的方法,替代了传统昂贵的大模型或人工标注方法。

Details Motivation: 现代词级质量评估(WQE)方法成本高昂,需要大量人工标注或大模型支持,因此探索更高效的替代方案。 Method: 利用语言模型的可解释性和不确定性量化技术,从翻译模型的内部机制中识别错误。 Result: 在12种翻译方向的14个指标评估中,发现无监督指标的潜力,监督方法在标签不确定性下的不足,以及单标注评估的脆弱性。 Conclusion: 无监督指标具有潜力,监督方法需改进以应对标签不确定性,单标注评估需谨慎。 Abstract: Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.

[203] Cross-Task Experiential Learning on LLM-based Multi-Agent Collaboration

Yilong Li,Chen Qian,Yu Xia,Ruijie Shi,Yufan Dang,Zihao Xie,Ziming You,Weize Chen,Cheng Yang,Weichuan Liu,Ye Tian,Xuantang Xiong,Lei Han,Zhiyuan Liu,Maosong Sun

Main category: cs.CL

TL;DR: 论文提出了一种多智能体跨任务经验学习框架(MAEL),通过显式经验积累提升LLM驱动的多智能体系统在相似任务中的表现。

Details Motivation: 现有方法通常孤立处理任务,导致计算冗余和泛化能力受限,MAEL旨在解决这一问题。 Method: 基于图结构的多智能体协作网络建模任务流程,量化任务步骤质量并存储经验,推理时检索高奖励经验作为示例。 Result: 实验表明MAEL能有效利用先验任务经验,实现更快收敛和更高质量的解。 Conclusion: MAEL通过跨任务经验学习显著提升了多智能体系统的协作效率和准确性。 Abstract: Large Language Model-based multi-agent systems (MAS) have shown remarkable progress in solving complex tasks through collaborative reasoning and inter-agent critique. However, existing approaches typically treat each task in isolation, resulting in redundant computations and limited generalization across structurally similar tasks. To address this, we introduce multi-agent cross-task experiential learning (MAEL), a novel framework that endows LLM-driven agents with explicit cross-task learning and experience accumulation. We model the task-solving workflow on a graph-structured multi-agent collaboration network, where agents propagate information and coordinate via explicit connectivity. During the experiential learning phase, we quantify the quality for each step in the task-solving workflow and store the resulting rewards along with the corresponding inputs and outputs into each agent's individual experience pool. During inference, agents retrieve high-reward, task-relevant experiences as few-shot examples to enhance the effectiveness of each reasoning step, thereby enabling more accurate and efficient multi-agent collaboration. Experimental results on diverse datasets demonstrate that MAEL empowers agents to learn from prior task experiences effectively-achieving faster convergence and producing higher-quality solutions on current tasks.

[204] ExpeTrans: LLMs Are Experiential Transfer Learners

Jinglong Gao,Xiao Ding,Lingxiao Zou,Bibo Cai,Bing Qin,Ting Liu

Main category: cs.CL

TL;DR: 论文提出了一种自主经验转移框架,通过让大语言模型(LLMs)模仿人类认知智能,将现有任务的经验转移到新任务中,从而减少人工和时间成本,并提升模型性能。

Details Motivation: 现有方法依赖大量人工或时间成本来收集任务解决经验,面对LLMs用户查询任务类型的多样性时显得不切实际。 Method: 设计了自主经验转移框架,探索LLMs是否能模仿人类认知智能,自主将现有任务经验迁移到新任务中。 Result: 在13个数据集上的实验表明,该框架有效提升了LLMs的性能。 Conclusion: 该框架不仅降低了经验获取成本,还为LLMs的泛化提供了新路径,同时详细分析了框架中各模块的作用。 Abstract: Recent studies provide large language models (LLMs) with textual task-solving experiences via prompts to improve their performance. However, previous methods rely on substantial human labor or time to gather such experiences for each task, which is impractical given the growing variety of task types in user queries to LLMs. To address this issue, we design an autonomous experience transfer framework to explore whether LLMs can mimic human cognitive intelligence to autonomously transfer experience from existing source tasks to newly encountered target tasks. This not only allows the acquisition of experience without extensive costs of previous methods, but also offers a novel path for the generalization of LLMs. Experimental results on 13 datasets demonstrate that our framework effectively improves the performance of LLMs. Furthermore, we provide a detailed analysis of each module in the framework.

[205] MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration

Zhitao He,Sandeep Polisetty,Zhiyuan Fan,Yuchen Huang,Shujin Wu,Yi R.,Fung

Main category: cs.CL

TL;DR: MMBoundary通过多模态推理步骤的置信度校准,提升多模态大语言模型的知识边界意识,减少幻觉问题。

Details Motivation: 多模态大语言模型在多级、多粒度推理中存在置信度评估不足的问题,导致幻觉累积。 Method: 提出MMBoundary框架,结合文本和跨模态自奖励信号校准推理步骤置信度,并通过监督微调和强化学习优化。 Result: 实验显示,MMBoundary在多个数据集和指标上显著优于现有方法,平均减少7.5%的校准误差,任务性能提升8.3%。 Conclusion: MMBoundary通过细粒度置信度校准,有效提升多模态推理的准确性和鲁棒性。 Abstract: In recent years, multimodal large language models (MLLMs) have made significant progress but continue to face inherent challenges in multimodal reasoning, which requires multi-level (e.g., perception, reasoning) and multi-granular (e.g., multi-step reasoning chain) advanced inferencing. Prior work on estimating model confidence tends to focus on the overall response for training and calibration, but fails to assess confidence in each reasoning step, leading to undesirable hallucination snowballing. In this work, we present MMBoundary, a novel framework that advances the knowledge boundary awareness of MLLMs through reasoning step confidence calibration. To achieve this, we propose to incorporate complementary textual and cross-modal self-rewarding signals to estimate confidence at each step of the MLLM reasoning process. In addition to supervised fine-tuning MLLM on this set of self-rewarded confidence estimation signal for initial confidence expression warm-up, we introduce a reinforcement learning stage with multiple reward functions for further aligning model knowledge and calibrating confidence at each reasoning step, enhancing reasoning chain self-correction. Empirical results show that MMBoundary significantly outperforms existing methods across diverse domain datasets and metrics, achieving an average of 7.5% reduction in multimodal confidence calibration errors and up to 8.3% improvement in task performance.

[206] MCTSr-Zero: Self-Reflective Psychological Counseling Dialogues Generation via Principles and Adaptive Exploration

Hao Lu,Yanchi Gu,Haoyuan Huang,Yulin Zhou,Ningxin Zhu,Chen Li

Main category: cs.CL

TL;DR: 论文提出MCTSr-Zero框架,将MCTS与LLMs结合,用于开放对话任务(如心理咨询),通过“领域对齐”和探索机制提升对话质量。

Details Motivation: 现有MCTS方法在开放对话中因缺乏明确正确性标准而表现不佳,需适应主观因素(如共情、伦理)。 Method: MCTSr-Zero引入“领域对齐”调整搜索目标,并采用“再生”和“元提示适应”机制扩展探索范围。 Result: 实验表明,基于MCTSr-Zero生成的对话数据训练的PsyLLM在PsyEval基准上达到最优性能。 Conclusion: MCTSr-Zero有效解决了LLMs在复杂心理标准下的对话生成问题,为开放对话任务提供了高质量数据。 Abstract: The integration of Monte Carlo Tree Search (MCTS) with Large Language Models (LLMs) has demonstrated significant success in structured, problem-oriented tasks. However, applying these methods to open-ended dialogues, such as those in psychological counseling, presents unique challenges. Unlike tasks with objective correctness, success in therapeutic conversations depends on subjective factors like empathetic engagement, ethical adherence, and alignment with human preferences, for which strict "correctness" criteria are ill-defined. Existing result-oriented MCTS approaches can therefore produce misaligned responses. To address this, we introduce MCTSr-Zero, an MCTS framework designed for open-ended, human-centric dialogues. Its core innovation is "domain alignment", which shifts the MCTS search objective from predefined end-states towards conversational trajectories that conform to target domain principles (e.g., empathy in counseling). Furthermore, MCTSr-Zero incorporates "Regeneration" and "Meta-Prompt Adaptation" mechanisms to substantially broaden exploration by allowing the MCTS to consider fundamentally different initial dialogue strategies. We evaluate MCTSr-Zero in psychological counseling by generating multi-turn dialogue data, which is used to fine-tune an LLM, PsyLLM. We also introduce PsyEval, a benchmark for assessing multi-turn psychological counseling dialogues. Experiments demonstrate that PsyLLM achieves state-of-the-art performance on PsyEval and other relevant metrics, validating MCTSr-Zero's effectiveness in generating high-quality, principle-aligned conversational data for human-centric domains and addressing the LLM challenge of consistently adhering to complex psychological standards.

[207] ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering

Jingxuan Wei,Nan Xu,Junnan Zhu,Yanni Hao,Gaowei Wu,Bihui Yu,Lei Wang

Main category: cs.CL

TL;DR: ChartMind是一个新的图表问答(CQA)基准,专注于复杂任务和多语言环境,提出了模型无关的框架ChartLLM,显著优于现有方法。

Details Motivation: 现有CQA评估过于依赖固定输出格式和客观指标,忽视了实际图表分析的复杂需求。 Method: 提出ChartMind基准和ChartLLM框架,专注于上下文提取和降噪,提升多模态模型的推理能力。 Result: 在ChartMind和三个公共基准上,ChartLLM显著优于指令跟随、OCR增强和思维链三种常见CQA范式。 Conclusion: ChartMind和ChartLLM为未来开发更鲁棒的图表推理提供了新方向。 Abstract: Chart question answering (CQA) has become a critical multimodal task for evaluating the reasoning capabilities of vision-language models. While early approaches have shown promising performance by focusing on visual features or leveraging large-scale pre-training, most existing evaluations rely on rigid output formats and objective metrics, thus ignoring the complex, real-world demands of practical chart analysis. In this paper, we introduce ChartMind, a new benchmark designed for complex CQA tasks in real-world settings. ChartMind covers seven task categories, incorporates multilingual contexts, supports open-domain textual outputs, and accommodates diverse chart formats, bridging the gap between real-world applications and traditional academic benchmarks. Furthermore, we propose a context-aware yet model-agnostic framework, ChartLLM, that focuses on extracting key contextual elements, reducing noise, and enhancing the reasoning accuracy of multimodal large language models. Extensive evaluations on ChartMind and three representative public benchmarks with 14 mainstream multimodal models show our framework significantly outperforms the previous three common CQA paradigms: instruction-following, OCR-enhanced, and chain-of-thought, highlighting the importance of flexible chart understanding for real-world CQA. These findings suggest new directions for developing more robust chart reasoning in future research.

[208] Automatic Construction of Multiple Classification Dimensions for Managing Approaches in Scientific Papers

Bing Ma,Hai Zhuge

Main category: cs.CL

TL;DR: 本文提出了一种基于多维度的方法管理框架,通过语言模式识别和树结构相似性度量,实现科学论文中方法的快速查询和分类。

Details Motivation: 科学论文中方法查询耗时且缺乏组织管理,亟需高效的多维度管理框架。 Method: 通过语义、话语、句法和词汇四个语言层次识别方法模式,提取方法并分类为五个维度;提出树结构表示步骤,基于句法相似性度量步骤和方法的相似性;采用自底向上聚类算法构建类树。 Result: 构建了多维度方法空间,查询结果表明该方法能确保查询结果的高相关性和快速缩小搜索空间。 Conclusion: 多维度方法空间框架显著提升了方法查询的效率和准确性。 Abstract: Approaches form the foundation for conducting scientific research. Querying approaches from a vast body of scientific papers is extremely time-consuming, and without a well-organized management framework, researchers may face significant challenges in querying and utilizing relevant approaches. Constructing multiple dimensions on approaches and managing them from these dimensions can provide an efficient solution. Firstly, this paper identifies approach patterns using a top-down way, refining the patterns through four distinct linguistic levels: semantic level, discourse level, syntactic level, and lexical level. Approaches in scientific papers are extracted based on approach patterns. Additionally, five dimensions for categorizing approaches are identified using these patterns. This paper proposes using tree structure to represent step and measuring the similarity between different steps with a tree-structure-based similarity measure that focuses on syntactic-level similarities. A collection similarity measure is proposed to compute the similarity between approaches. A bottom-up clustering algorithm is proposed to construct class trees for approach components within each dimension by merging each approach component or class with its most similar approach component or class in each iteration. The class labels generated during the clustering process indicate the common semantics of the step components within the approach components in each class and are used to manage the approaches within the class. The class trees of the five dimensions collectively form a multi-dimensional approach space. The application of approach queries on the multi-dimensional approach space demonstrates that querying within this space ensures strong relevance between user queries and results and rapidly reduces search space through a class-based query mechanism.

[209] The Arabic AI Fingerprint: Stylometric Analysis and Detection of Large Language Models Text

Maged S. Al-Shaibani,Moataz Ahmed

Main category: cs.CL

TL;DR: 该论文全面研究了阿拉伯语机器生成文本,通过多种生成策略和模型架构,揭示了机器生成文本的可检测特征,并开发了高效的BERT检测模型。

Details Motivation: 大型语言模型(LLM)在生成类人文本方面表现出色,但也对信息完整性构成威胁,尤其是在阿拉伯语等低资源语言中。本文旨在填补这一研究空白。 Method: 研究采用多种生成策略(标题生成、内容感知生成和文本优化)和模型架构(ALLaM、Jais、Llama、GPT-4),结合风格计量学分析,开发BERT检测模型。 Result: 研究发现机器生成的阿拉伯语文本具有可检测特征,BERT模型在正式语境中表现优异(F1分数高达99.9%),但跨领域泛化能力有限。 Conclusion: 该研究为开发针对阿拉伯语的鲁棒检测系统奠定了基础,强调了语言特征在信息完整性保护中的重要性。 Abstract: Large Language Models (LLMs) have achieved unprecedented capabilities in generating human-like text, posing subtle yet significant challenges for information integrity across critical domains, including education, social media, and academia, enabling sophisticated misinformation campaigns, compromising healthcare guidance, and facilitating targeted propaganda. This challenge becomes severe, particularly in under-explored and low-resource languages like Arabic. This paper presents a comprehensive investigation of Arabic machine-generated text, examining multiple generation strategies (generation from the title only, content-aware generation, and text refinement) across diverse model architectures (ALLaM, Jais, Llama, and GPT-4) in academic, and social media domains. Our stylometric analysis reveals distinctive linguistic patterns differentiating human-written from machine-generated Arabic text across these varied contexts. Despite their human-like qualities, we demonstrate that LLMs produce detectable signatures in their Arabic outputs, with domain-specific characteristics that vary significantly between different contexts. Based on these insights, we developed BERT-based detection models that achieved exceptional performance in formal contexts (up to 99.9\% F1-score) with strong precision across model architectures. Our cross-domain analysis confirms generalization challenges previously reported in the literature. To the best of our knowledge, this work represents the most comprehensive investigation of Arabic machine-generated text to date, uniquely combining multiple prompt generation methods, diverse model architectures, and in-depth stylometric analysis across varied textual domains, establishing a foundation for developing robust, linguistically-informed detection systems essential for preserving information integrity in Arabic-language contexts.

[210] Sentinel: Attention Probing of Proxy Models for LLM Context Compression with an Understanding Perspective

Yong Zhang,Yanwen Huang,Ning Cheng,Yang Guo,Yun Zhu,Yanmeng Wang,Shaojun Wang,Jing Xiao

Main category: cs.CL

TL;DR: Sentinel提出了一种轻量级的句子级压缩框架,通过利用现成的小型LLM的解码器注意力信号,实现高效、低成本且问题感知的上下文压缩。

Details Motivation: 现有检索增强生成(RAG)方法中,检索到的段落通常冗长、噪声多或超出输入限制,而传统压缩方法需要训练专用模型,成本高且可移植性差。 Method: Sentinel通过轻量级分类器探测小型代理LLM的解码器注意力信号,识别句子相关性,无需训练专用压缩模型。 Result: 在LongBench基准测试中,Sentinel实现了5倍压缩,同时性能与7B规模的压缩系统相当。 Conclusion: 研究表明,利用原生注意力信号可以实现快速、有效且问题感知的上下文压缩。 Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external context, but retrieved passages are often lengthy, noisy, or exceed input limits. Existing compression methods typically require supervised training of dedicated compression models, increasing cost and reducing portability. We propose Sentinel, a lightweight sentence-level compression framework that reframes context filtering as an attention-based understanding task. Rather than training a compression model, Sentinel probes decoder attention from an off-the-shelf 0.5B proxy LLM using a lightweight classifier to identify sentence relevance. Empirically, we find that query-context relevance estimation is consistent across model scales, with 0.5B proxies closely matching the behaviors of larger models. On the LongBench benchmark, Sentinel achieves up to 5$\times$ compression while matching the QA performance of 7B-scale compression systems. Our results suggest that probing native attention signals enables fast, effective, and question-aware context compression. Code available at: https://github.com/yzhangchuck/Sentinel.

[211] ScEdit: Script-based Assessment of Knowledge Editing

Xinye Li,Zunwen Zheng,Qian Zhang,Dekai Zhuang,Jiabao Kang,Liyan Xu,Qingbin Liu,Xi Chen,Zhiying Tu,Dianhui Chu,Dianbo Sui

Main category: cs.CL

TL;DR: 论文提出了一个基于脚本的知识编辑基准ScEdit,用于更全面地评估知识编辑方法在现实场景中的表现。

Details Motivation: 当前知识编辑任务过于简单,缺乏对现实应用场景的适应性,需要更全面的评估框架。 Method: 引入ScEdit基准,结合反事实和时间编辑,整合标记级和文本级评估方法。 Result: 所有知识编辑方法在传统指标上表现下降,文本级指标更具挑战性。 Conclusion: ScEdit为知识编辑提供了更全面的评估工具,揭示了现有方法的局限性。 Abstract: Knowledge Editing (KE) has gained increasing attention, yet current KE tasks remain relatively simple. Under current evaluation frameworks, many editing methods achieve exceptionally high scores, sometimes nearing perfection. However, few studies integrate KE into real-world application scenarios (e.g., recent interest in LLM-as-agent). To support our analysis, we introduce a novel script-based benchmark -- ScEdit (Script-based Knowledge Editing Benchmark) -- which encompasses both counterfactual and temporal edits. We integrate token-level and text-level evaluation methods, comprehensively analyzing existing KE techniques. The benchmark extends traditional fact-based ("What"-type question) evaluation to action-based ("How"-type question) evaluation. We observe that all KE methods exhibit a drop in performance on established metrics and face challenges on text-level metrics, indicating a challenging task. Our benchmark is available at https://github.com/asdfo123/ScEdit.

[212] How Does Response Length Affect Long-Form Factuality

James Xu Zhao,Jimmy Z. J. Liu,Bryan Hooi,See-Kiong Ng

Main category: cs.CL

TL;DR: 本文研究了大型语言模型(LLM)生成长文本时的事实准确性,发现响应长度越长,事实准确性越低,主要原因是知识耗尽。

Details Motivation: 尽管LLM在长文本生成中广泛应用,但事实错误会降低其可靠性,而响应长度对事实准确性的影响尚未充分研究。 Method: 引入了一个自动化的双层长文本事实性评估框架,并通过控制实验验证了长度偏差的存在。 Result: 实验表明,长响应的事实准确性更低,主要原因是知识耗尽而非错误传播或长上下文。 Conclusion: 研究揭示了知识耗尽是导致事实性下降的主要原因,为改进LLM的事实准确性提供了方向。 Abstract: Large language models (LLMs) are widely used for long-form text generation. However, factual errors in the responses would undermine their reliability. Despite growing attention to LLM factuality, the effect of response length on factuality remains underexplored. In this work, we systematically investigate this relationship by first introducing an automatic and bi-level long-form factuality evaluation framework, which achieves high agreement with human annotations while being cost-effective. Using this framework, we conduct controlled experiments and find that longer responses exhibit lower factual precision, confirming the presence of length bias. To explain this phenomenon, we empirically examine three hypotheses: error propagation, long context, and facts exhaustion. Our results reveal that facts exhaustion, where the model gradually exhausts more reliable knowledge, is the primary cause of factual degradation, rather than the other two hypotheses.

[213] EmoBench-UA: A Benchmark Dataset for Emotion Detection in Ukrainian

Daryna Dementieva,Nikolay Babakov,Alexander Fraser

Main category: cs.CL

TL;DR: 本文介绍了EmoBench-UA,首个乌克兰语情感分类数据集,并评估了多种方法,凸显了乌克兰语情感分类的挑战。

Details Motivation: 乌克兰语的情感分类研究较少,缺乏公开基准数据集。 Method: 通过众包平台Toloka.ai创建高质量标注数据集,并评估了语言学基线、合成数据和大型语言模型。 Result: 研究发现乌克兰语等非主流语言的情感分类面临挑战,需开发更多乌克兰语专用模型和资源。 Conclusion: EmoBench-UA填补了乌克兰语情感分类的空白,并呼吁进一步开发相关资源。 Abstract: While Ukrainian NLP has seen progress in many texts processing tasks, emotion classification remains an underexplored area with no publicly available benchmark to date. In this work, we introduce EmoBench-UA, the first annotated dataset for emotion detection in Ukrainian texts. Our annotation schema is adapted from the previous English-centric works on emotion detection (Mohammad et al., 2018; Mohammad, 2022) guidelines. The dataset was created through crowdsourcing using the Toloka.ai platform ensuring high-quality of the annotation process. Then, we evaluate a range of approaches on the collected dataset, starting from linguistic-based baselines, synthetic data translated from English, to large language models (LLMs). Our findings highlight the challenges of emotion classification in non-mainstream languages like Ukrainian and emphasize the need for further development of Ukrainian-specific models and training resources.

[214] Data-efficient Meta-models for Evaluation of Context-based Questions and Answers in LLMs

Julia Belikova,Konstantin Polev,Rauf Parchiev,Dmitry Simakov

Main category: cs.CL

TL;DR: 论文探讨了如何通过结合高效分类算法和降维技术,减少大语言模型(LLMs)和检索增强生成(RAG)系统中幻觉检测的训练数据需求,同时保持性能。

Details Motivation: LLMs和RAG系统在工业应用中可靠性受限于幻觉检测的挑战,现有方法依赖大量标注数据,限制了可扩展性。 Method: 提出一种结合高效分类算法和降维技术的方法,减少训练数据需求,应用于Lookback Lens和基于探针的两种SOTA框架。 Result: 在标准化问答RAG基准测试中,仅需250个训练样本即可达到与强基线相当的性能。 Conclusion: 轻量级、数据高效的方法在工业部署中具有潜力,尤其适用于标注受限的场景。 Abstract: Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly deployed in industry applications, yet their reliability remains hampered by challenges in detecting hallucinations. While supervised state-of-the-art (SOTA) methods that leverage LLM hidden states -- such as activation tracing and representation analysis -- show promise, their dependence on extensively annotated datasets limits scalability in real-world applications. This paper addresses the critical bottleneck of data annotation by investigating the feasibility of reducing training data requirements for two SOTA hallucination detection frameworks: Lookback Lens, which analyzes attention head dynamics, and probing-based approaches, which decode internal model representations. We propose a methodology combining efficient classification algorithms with dimensionality reduction techniques to minimize sample size demands while maintaining competitive performance. Evaluations on standardized question-answering RAG benchmarks show that our approach achieves performance comparable to strong proprietary LLM-based baselines with only 250 training samples. These results highlight the potential of lightweight, data-efficient paradigms for industrial deployment, particularly in annotation-constrained scenarios.

[215] Generalized Category Discovery in Event-Centric Contexts: Latent Pattern Mining with LLMs

Yi Luo,Qiwen Wang,Junqi Yang,Luyao Tang,Zhenghao Lin,Zhenzhe Ying,Weiqiang Wang,Chen Lin

Main category: cs.CL

TL;DR: 论文提出EC-GCD任务,针对长文本和类别不平衡问题,提出PaMA框架,利用LLM优化聚类与分类对齐,并在新数据集上表现优异。

Details Motivation: 现有文本GCD方法在现实场景中验证不足,尤其是面对长文本和类别不平衡时效果不佳。 Method: 提出PaMA框架,利用LLM提取事件模式并优化聚类与分类对齐,同时通过排名-过滤-挖掘流程平衡原型表示。 Result: 在EC-GCD基准测试中,PaMA性能优于现有方法,H-score提升高达12.58%,且在基础GCD数据集上泛化能力强。 Conclusion: PaMA有效解决了EC-GCD中的聚类与分类对齐及类别不平衡问题,具有实际应用潜力。 Abstract: Generalized Category Discovery (GCD) aims to classify both known and novel categories using partially labeled data that contains only known classes. Despite achieving strong performance on existing benchmarks, current textual GCD methods lack sufficient validation in realistic settings. We introduce Event-Centric GCD (EC-GCD), characterized by long, complex narratives and highly imbalanced class distributions, posing two main challenges: (1) divergent clustering versus classification groupings caused by subjective criteria, and (2) Unfair alignment for minority classes. To tackle these, we propose PaMA, a framework leveraging LLMs to extract and refine event patterns for improved cluster-class alignment. Additionally, a ranking-filtering-mining pipeline ensures balanced representation of prototypes across imbalanced categories. Evaluations on two EC-GCD benchmarks, including a newly constructed Scam Report dataset, demonstrate that PaMA outperforms prior methods with up to 12.58% H-score gains, while maintaining strong generalization on base GCD datasets.

[216] Enhancing Marker Scoring Accuracy through Ordinal Confidence Modelling in Educational Assessments

Abhirup Chakravarty,Mark Brenchley,Trevor Breakspear,Ian Lewin,Yan Huang

Main category: cs.CL

TL;DR: 该研究通过将信心估计建模为分类任务,并引入新的损失函数KWOCCE,显著提高了自动作文评分(AES)的可靠性,使47%的分数达到100% CEFR一致性。

Details Motivation: 解决自动作文评分(AES)中分数可靠性不足的问题,确保仅在高可靠性标准下发布分数。 Method: 将信心估计作为分类任务,利用分数分箱将其转化为n元分类问题,并引入KWOCCE损失函数以利用CEFR标签的序数结构。 Result: 最佳模型F1得分为0.97,47%的分数达到100% CEFR一致性,99%的分数至少达到95%一致性,显著优于独立AES模型的92%。 Conclusion: 通过信心建模和KWOCCE损失函数,显著提升了AES的可靠性,为高置信度分数发布提供了有效方法。 Abstract: A key ethical challenge in Automated Essay Scoring (AES) is ensuring that scores are only released when they meet high reliability standards. Confidence modelling addresses this by assigning a reliability estimate measure, in the form of a confidence score, to each automated score. In this study, we frame confidence estimation as a classification task: predicting whether an AES-generated score correctly places a candidate in the appropriate CEFR level. While this is a binary decision, we leverage the inherent granularity of the scoring domain in two ways. First, we reformulate the task as an n-ary classification problem using score binning. Second, we introduce a set of novel Kernel Weighted Ordinal Categorical Cross Entropy (KWOCCE) loss functions that incorporate the ordinal structure of CEFR labels. Our best-performing model achieves an F1 score of 0.97, and enables the system to release 47% of scores with 100% CEFR agreement and 99% with at least 95% CEFR agreement -compared to approximately 92% (approx.) CEFR agreement from the standalone AES model where we release all AM predicted scores.

[217] Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO

Kaiyang Guo,Yinchuan Li,Zhitang Chen

Main category: cs.CL

TL;DR: 论文探讨了直接对齐方法(如DPO)在优化大语言模型时的局限性,提出了PRO方法以解决似然不确定性问题,并在多种反馈类型中表现优异。

Details Motivation: 直接对齐方法(如DPO)虽能优化大语言模型,但会导致生成结果偏离预期模式,存在似然不确定性问题。 Method: 通过重新分解DPO损失函数,提出PRO方法,利用完整正则化器解决似然不确定性问题。 Result: PRO在成对、二元和标量反馈场景中优于现有方法。 Conclusion: PRO方法有效解决了直接对齐中的似然不确定性问题,适用于多种反馈类型。 Abstract: Direct alignment methods typically optimize large language models (LLMs) by contrasting the likelihoods of preferred versus dispreferred responses. While effective in steering LLMs to match relative preference, these methods are frequently noted for decreasing the absolute likelihoods of example responses. As a result, aligned models tend to generate outputs that deviate from the expected patterns, exhibiting reward-hacking effect even without a reward model. This undesired consequence exposes a fundamental limitation in contrastive alignment, which we characterize as likelihood underdetermination. In this work, we revisit direct preference optimization (DPO) -- the seminal direct alignment method -- and demonstrate that its loss theoretically admits a decomposed reformulation. The reformulated loss not only broadens applicability to a wider range of feedback types, but also provides novel insights into the underlying cause of likelihood underdetermination. Specifically, the standard DPO implementation implicitly oversimplifies a regularizer in the reformulated loss, and reinstating its complete version effectively resolves the underdetermination issue. Leveraging these findings, we introduce PRoximalized PReference Optimization (PRO), a unified method to align with diverse feeback types, eliminating likelihood underdetermination through an efficient approximation of the complete regularizer. Comprehensive experiments show the superiority of PRO over existing methods in scenarios involving pairwise, binary and scalar feedback.

[218] Neither Stochastic Parroting nor AGI: LLMs Solve Tasks through Context-Directed Extrapolation from Training Data Priors

Harish Tayyar Madabushi,Melissa Torgbi,Claire Bonial

Main category: cs.CL

TL;DR: 论文提出了一种中间立场,认为LLM通过上下文引导的外推(context-directed extrapolation)从训练数据中提取信息,既非完全随机模仿,也不具备不可预测的“涌现”高级推理能力。

Details Motivation: 旨在消除对LLM能力的极端看法,避免将其视为完全随机模仿或具有不可控高级推理能力的威胁。 Method: 提出“上下文引导的外推”机制,结合现有文献支持,认为LLM的能力是可预测且可控的。 Result: LLM的能力超出随机模仿,但并非人类高级认知能力,且无法通过无限训练无限扩展。 Conclusion: 未来研究应聚焦于上下文引导的外推机制及其与训练数据的交互,探索不依赖LLM固有高级推理的增强技术。 Abstract: In this position paper we raise critical awareness of a realistic view of LLM capabilities that eschews extreme alternative views that LLMs are either "stochastic parrots" or in possession of "emergent" advanced reasoning capabilities, which, due to their unpredictable emergence, constitute an existential threat. Our middle-ground view is that LLMs extrapolate from priors from their training data, and that a mechanism akin to in-context learning enables the targeting of the appropriate information from which to extrapolate. We call this "context-directed extrapolation." Under this view, substantiated though existing literature, while reasoning capabilities go well beyond stochastic parroting, such capabilities are predictable, controllable, not indicative of advanced reasoning akin to high-level cognitive capabilities in humans, and not infinitely scalable with additional training. As a result, fears of uncontrollable emergence of agency are allayed, while research advances are appropriately refocused on the processes of context-directed extrapolation and how this interacts with training data to produce valuable capabilities in LLMs. Future work can therefore explore alternative augmenting techniques that do not rely on inherent advanced reasoning in LLMs.

[219] Discriminative Policy Optimization for Token-Level Reward Models

Hongzhan Chen,Tao Yang,Shiping Gao,Ruijun Chen,Xiaojun Quan,Hongtao Tian,Ting Yao

Main category: cs.CL

TL;DR: Q-RM通过解耦奖励建模与语言生成,优化判别策略,显著提升复杂推理任务中LLM的性能和训练效率。

Details Motivation: 传统过程奖励模型(PRMs)在细粒度奖励分配中存在不稳定性,Q-RM旨在解决这一问题。 Method: 提出Q-RM模型,通过优化判别策略(Q函数)从偏好数据中学习细粒度奖励,避免依赖生成模型。 Result: Q-RM在数学推理任务中显著优于基线方法,训练效率提升显著(收敛速度加快12倍)。 Conclusion: Q-RM为复杂推理任务提供了一种高效稳定的细粒度奖励建模方法。 Abstract: Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs) for optimizing policy models, positioning them as a promising approach to enhancing the capabilities of LLMs in complex reasoning tasks. Recent efforts have advanced PRMs from step-level to token-level granularity by integrating reward modeling into the training of generative models, with reward scores derived from token generation probabilities. However, the conflict between generative language modeling and reward modeling may introduce instability and lead to inaccurate credit assignments. To address this challenge, we revisit token-level reward assignment by decoupling reward modeling from language generation and derive a token-level reward model through the optimization of a discriminative policy, termed the Q-function Reward Model (Q-RM). We theoretically demonstrate that Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations. In our experiments, Q-RM consistently outperforms all baseline methods across various benchmarks. For example, when integrated into PPO/REINFORCE algorithms, Q-RM enhances the average Pass@1 score by 5.85/4.70 points on mathematical reasoning tasks compared to the ORM baseline, and by 4.56/5.73 points compared to the token-level PRM counterpart. Moreover, reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH. Code and data are available at https://github.com/homzer/Q-RM.

[220] Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation

Beiduo Chen,Yang Janet Liu,Anna Korhonen,Barbara Plank

Main category: cs.CL

TL;DR: 本文提出了一种基于LLM的管道方法,利用语言基础的分段器从CoTs中提取支持与反对陈述,并设计了基于排名的HLV评估框架,优于直接生成方法。

Details Motivation: 研究如何利用LLM生成的CoTs更好地理解人类标签变异,改进现有方法中基于给定答案生成解释的反向范式。 Method: 提出LLM管道方法,结合语言分段器提取CoTs中的支持与反对陈述;设计基于排名的HLV评估框架。 Result: 方法在三个数据集上优于直接生成方法和基线,排名方法更符合人类标注。 Conclusion: 该方法有效提升了LLM在人类标签变异研究中的表现,展示了CoTs的潜力。 Abstract: The recent rise of reasoning-tuned Large Language Models (LLMs)--which generate chains of thought (CoTs) before giving the final answer--has attracted significant attention and offers new opportunities for gaining insights into human label variation, which refers to plausible differences in how multiple annotators label the same data instance. Prior work has shown that LLM-generated explanations can help align model predictions with human label distributions, but typically adopt a reverse paradigm: producing explanations based on given answers. In contrast, CoTs provide a forward reasoning path that may implicitly embed rationales for each answer option, before generating the answers. We thus propose a novel LLM-based pipeline enriched with linguistically-grounded discourse segmenters to extract supporting and opposing statements for each answer option from CoTs with improved accuracy. We also propose a rank-based HLV evaluation framework that prioritizes the ranking of answers over exact scores, which instead favor direct comparison of label distributions. Our method outperforms a direct generation method as well as baselines on three datasets, and shows better alignment of ranking methods with humans, highlighting the effectiveness of our approach.

[221] Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models

Mingyu Yu,Wei Wang,Yanjie Wei,Sujuan Qin

Main category: cs.CL

TL;DR: 本文研究了针对大型语言模型(LLMs)的越狱攻击,提出了一种基于语义理解能力的自适应越狱策略框架,显著提高了攻击成功率。

Details Motivation: 越狱攻击通过绕过LLMs的安全和伦理约束,成为AI安全的关键挑战。本文旨在探索针对不同LLMs理解能力的自适应攻击策略。 Method: 提出了一种分类框架,将LLMs分为Type I和Type II两类,并根据其语义理解能力设计定制化的越狱策略。 Result: 实验表明,自适应策略显著提高了越狱成功率,对GPT-4o(2025年5月29日发布)的成功率高达98.9%。 Conclusion: 该研究为LLMs的安全漏洞提供了新视角,自适应策略在攻击效果上表现出色。 Abstract: Adversarial attacks on Large Language Models (LLMs) via jailbreaking techniques-methods that circumvent their built-in safety and ethical constraints-have emerged as a critical challenge in AI security. These attacks compromise the reliability of LLMs by exploiting inherent weaknesses in their comprehension capabilities. This paper investigates the efficacy of jailbreaking strategies that are specifically adapted to the diverse levels of understanding exhibited by different LLMs. We propose the Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models, a novel framework that classifies LLMs into Type I and Type II categories according to their semantic comprehension abilities. For each category, we design tailored jailbreaking strategies aimed at leveraging their vulnerabilities to facilitate successful attacks. Extensive experiments conducted on multiple LLMs demonstrate that our adaptive strategy markedly improves the success rate of jailbreaking. Notably, our approach achieves an exceptional 98.9% success rate in jailbreaking GPT-4o(29 May 2025 release)

[222] From Parameters to Prompts: Understanding and Mitigating the Factuality Gap between Fine-Tuned LLMs

Xuan Gong,Hanbo Huang,Shiyu Liang

Main category: cs.CL

TL;DR: 本文研究了监督微调数据对大型语言模型事实性的影响,发现推理阶段的提示(如少样本学习和思维链)可以弥补微调数据的不足。

Details Motivation: 探讨微调数据对模型事实性影响的机制,尤其是已知与未知知识间的差距。 Method: 通过系统实验和知识图谱理论分析,研究微调数据与推理提示的交互作用。 Result: 发现推理提示(如少样本学习和思维链)可以显著减少事实性差距,甚至主导知识提取。 Conclusion: 推理提示能有效补偿微调数据的不足,需重新评估其作为微调数据选择方法的效果衡量手段。 Abstract: Factual knowledge extraction aims to explicitly extract knowledge parameterized in pre-trained language models for application in downstream tasks. While prior work has been investigating the impact of supervised fine-tuning data on the factuality of large language models (LLMs), its mechanism remains poorly understood. We revisit this impact through systematic experiments, with a particular focus on the factuality gap that arises when fine-tuning on known versus unknown knowledge. Our findings show that this gap can be mitigated at the inference stage, either under out-of-distribution (OOD) settings or by using appropriate in-context learning (ICL) prompts (i.e., few-shot learning and Chain of Thought (CoT)). We prove this phenomenon theoretically from the perspective of knowledge graphs, showing that the test-time prompt may diminish or even overshadow the impact of fine-tuning data and play a dominant role in knowledge extraction. Ultimately, our results shed light on the interaction between finetuning data and test-time prompt, demonstrating that ICL can effectively compensate for shortcomings in fine-tuning data, and highlighting the need to reconsider the use of ICL prompting as a means to evaluate the effectiveness of fine-tuning data selection methods.

[223] The Warmup Dilemma: How Learning Rate Strategies Impact Speech-to-Text Model Convergence

Marco Gaido,Sara Papi,Luisa Bentivogli,Alessio Brutti,Mauro Cettolo,Roberto Gretter,Marco Matassoni,Mohamed Nabih,Matteo Negri

Main category: cs.CL

TL;DR: 论文探讨了大规模语音到文本(S2T)训练中学习率(LR)预热策略的影响,发现子指数预热优于双线性预热,且高初始LR加速收敛但不提升最终性能。

Details Motivation: 大规模S2T训练中,传统学习率调整方法(如双线性预热)缺乏与其他策略的比较,且对最终性能的影响未深入研究。 Method: 研究比较了不同学习率预热策略(包括双线性预热和子指数预热)在大规模S2T训练中的效果。 Result: 发现子指数预热更适合大规模S2T训练,高初始学习率虽加速初始收敛,但对最终性能无显著提升。 Conclusion: 大规模S2T训练应采用子指数学习率预热策略,而非传统双线性预热。 Abstract: Training large-scale models presents challenges not only in terms of resource requirements but also in terms of their convergence. For this reason, the learning rate (LR) is often decreased when the size of a model is increased. Such a simple solution is not enough in the case of speech-to-text (S2T) trainings, where evolved and more complex variants of the Transformer architecture -- e.g., Conformer or Branchformer -- are used in light of their better performance. As a workaround, OWSM designed a double linear warmup of the LR, increasing it to a very small value in the first phase before updating it to a higher value in the second phase. While this solution worked well in practice, it was not compared with alternative solutions, nor was the impact on the final performance of different LR warmup schedules studied. This paper fills this gap, revealing that i) large-scale S2T trainings demand a sub-exponential LR warmup, and ii) a higher LR in the warmup phase accelerates initial convergence, but it does not boost final performance.

[224] UAQFact: Evaluating Factual Knowledge Utilization of LLMs on Unanswerable Questions

Chuanyuan Tan,Wenbiao Shao,Hao Xiong,Tong Zhu,Zhenhua Liu,Kai Shi,Wenliang Chen

Main category: cs.CL

TL;DR: 论文提出新数据集UAQFact,用于评估大语言模型(LLM)处理不可回答问题(UAQ)时利用事实知识的能力,实验表明LLM在此任务上表现不佳。

Details Motivation: 现有数据集缺乏事实知识支持,无法评估LLM在处理UAQ时利用事实知识的能力。 Method: 构建双语数据集UAQFact,基于知识图谱提供辅助事实知识,并定义两个新任务分别评估LLM利用内部和外部知识的能力。 Result: 实验显示LLM在UAQFact上表现不佳,即使具备事实知识也未能充分利用,外部知识虽能提升性能但仍不理想。 Conclusion: UAQFact为评估LLM利用事实知识处理UAQ提供了新基准,揭示了LLM在此领域的局限性。 Abstract: Handling unanswerable questions (UAQ) is crucial for LLMs, as it helps prevent misleading responses in complex situations. While previous studies have built several datasets to assess LLMs' performance on UAQ, these datasets lack factual knowledge support, which limits the evaluation of LLMs' ability to utilize their factual knowledge when handling UAQ. To address the limitation, we introduce a new unanswerable question dataset UAQFact, a bilingual dataset with auxiliary factual knowledge created from a Knowledge Graph. Based on UAQFact, we further define two new tasks to measure LLMs' ability to utilize internal and external factual knowledge, respectively. Our experimental results across multiple LLM series show that UAQFact presents significant challenges, as LLMs do not consistently perform well even when they have factual knowledge stored. Additionally, we find that incorporating external knowledge may enhance performance, but LLMs still cannot make full use of the knowledge which may result in incorrect responses.

[225] Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons

Krithik Vishwanath,Anton Alyakin,Mrigayu Ghosh,Jin Vivian Lee,Daniel Alexander Alber,Karl L. Sangwon,Douglas Kondziolka,Eric Karl Oermann

Main category: cs.CL

TL;DR: 研究评估了28个大语言模型在神经外科考试问题上的表现及其对干扰信息的脆弱性,发现部分模型能通过考试但性能易受干扰影响。

Details Motivation: 评估大语言模型在神经外科知识测试中的表现,并测试其对干扰信息的鲁棒性。 Method: 使用2,904个神经外科考试问题测试28个模型,并引入干扰框架评估模型性能。 Result: 6个模型通过考试,但干扰显著降低性能,降幅达20.4%。开源模型比专有模型更易受影响。 Conclusion: 大语言模型在神经外科考试中表现良好,但对干扰信息敏感,需开发增强鲁棒性的策略。 Abstract: The Congress of Neurological Surgeons Self-Assessment for Neurological Surgeons (CNS-SANS) questions are widely used by neurosurgical residents to prepare for written board examinations. Recently, these questions have also served as benchmarks for evaluating large language models' (LLMs) neurosurgical knowledge. This study aims to assess the performance of state-of-the-art LLMs on neurosurgery board-like questions and to evaluate their robustness to the inclusion of distractor statements. A comprehensive evaluation was conducted using 28 large language models. These models were tested on 2,904 neurosurgery board examination questions derived from the CNS-SANS. Additionally, the study introduced a distraction framework to assess the fragility of these models. The framework incorporated simple, irrelevant distractor statements containing polysemous words with clinical meanings used in non-clinical contexts to determine the extent to which such distractions degrade model performance on standard medical benchmarks. 6 of the 28 tested LLMs achieved board-passing outcomes, with the top-performing models scoring over 15.7% above the passing threshold. When exposed to distractions, accuracy across various model architectures was significantly reduced-by as much as 20.4%-with one model failing that had previously passed. Both general-purpose and medical open-source models experienced greater performance declines compared to proprietary variants when subjected to the added distractors. While current LLMs demonstrate an impressive ability to answer neurosurgery board-like exam questions, their performance is markedly vulnerable to extraneous, distracting information. These findings underscore the critical need for developing novel mitigation strategies aimed at bolstering LLM resilience against in-text distractions, particularly for safe and effective clinical deployment.

[226] Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt

Keqin Peng,Liang Ding,Yuanxin Ouyang,Meng Fang,Dacheng Tao

Main category: cs.CL

TL;DR: 论文提出了一种定量分析大型语言模型(RLLMs)在长链思维推理中的“过度思考”现象的方法,并通过减少自我怀疑的提示方法显著提升了模型性能。

Details Motivation: 现有研究主要基于定性分析长链思维推理中的过度思考现象,缺乏定量研究。本文从自我怀疑的角度出发,定量分析了过度思考,并提出解决方案。 Method: 提出了一种简单的提示方法:首先让模型质疑输入问题的有效性,然后根据评估结果简洁回答。实验在三个数学推理任务和四个数据集上进行。 Result: 该方法显著减少了答案长度,并在几乎所有数据集上提升了四种广泛使用的RLLMs的性能,同时减少了推理步骤和自我怀疑。 Conclusion: 通过减少自我怀疑,可以有效减少过度思考,提升模型效率和性能。 Abstract: Reasoning Large Language Models (RLLMs) have demonstrated impressive performance on complex tasks, largely due to the adoption of Long Chain-of-Thought (Long CoT) reasoning. However, they often exhibit overthinking -- performing unnecessary reasoning steps even after arriving at the correct answer. Prior work has largely focused on qualitative analyses of overthinking through sample-based observations of long CoTs. In contrast, we present a quantitative analysis of overthinking from the perspective of self-doubt, characterized by excessive token usage devoted to re-verifying already-correct answer. We find that self-doubt significantly contributes to overthinking. In response, we introduce a simple and effective prompting method to reduce the model's over-reliance on input questions, thereby avoiding self-doubt. Specifically, we first prompt the model to question the validity of the input question, and then respond concisely based on the outcome of that evaluation. Experiments on three mathematical reasoning tasks and four datasets with missing premises demonstrate that our method substantially reduces answer length and yields significant improvements across nearly all datasets upon 4 widely-used RLLMs. Further analysis demonstrates that our method effectively minimizes the number of reasoning steps and reduces self-doubt.

[227] Spoken Language Modeling with Duration-Penalized Self-Supervised Units

Nicol Visser,Herman Kamper

Main category: cs.CL

TL;DR: 研究了语音语言模型(SLM)中码本大小和单元粗糙度(持续时间)的交互作用,发现粗糙度在不同任务中的影响不同,并提出了一种简单有效的方法(DPDP)来优化单元粗糙度。

Details Motivation: 探索码本大小和单元粗糙度对SLM性能的影响,填补了相关研究的空白。 Method: 使用动态规划(DPDP)方法调整码本大小和单元粗糙度,并在不同语言学层面进行分析。 Result: 在音素和单词层面,粗糙度影响不大;但在句子重合成任务中,粗糙单元表现更好;在词汇和句法任务中,粗糙单元在低比特率下更准确。 Conclusion: 粗糙单元并非总是更好,但DPDP是一种简单有效的方法,可在需要时优化单元粗糙度。 Abstract: Spoken language models (SLMs) operate on acoustic units obtained by discretizing self-supervised speech representations. Although the characteristics of these units directly affect performance, the interaction between codebook size and unit coarseness (i.e., duration) remains unexplored. We investigate SLM performance as we vary codebook size and unit coarseness using the simple duration-penalized dynamic programming (DPDP) method. New analyses are performed across different linguistic levels. At the phone and word levels, coarseness provides little benefit, as long as the codebook size is chosen appropriately. However, when producing whole sentences in a resynthesis task, SLMs perform better with coarser units. In lexical and syntactic language modeling tasks, coarser units also give higher accuracies at lower bitrates. We therefore show that coarser units aren't always better, but that DPDP is a simple and efficient way to obtain coarser units for the tasks where they are beneficial.

[228] Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

Liangliang Zhang,Zhuorui Jiang,Hongliang Chi,Haoyang Chen,Mohammed Elkoumy,Fali Wang,Qiong Wu,Zhengyi Zhou,Shirui Pan,Suhang Wang,Yao Ma

Main category: cs.CL

TL;DR: KGQAGen是一个基于LLM的框架,用于解决KGQA数据集中的质量问题,并生成了高质量的KGQAGen-10k基准测试。

Details Motivation: 现有KGQA数据集(如WebQSP和CWQ)存在标注不准确、问题模糊或无法回答等问题,平均事实正确率仅为57%。 Method: KGQAGen结合结构化知识基础、LLM引导生成和符号验证,生成可验证的QA实例。 Result: 实验表明,即使是SOTA模型在KGQAGen-10k基准上也表现不佳,凸显了其挑战性。 Conclusion: KGQAGen为KGQA评估提供了可扩展的框架,并呼吁更严格的基准构建。 Abstract: Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets, including WebQSP and CWQ, we find that the average factual correctness rate is only 57 %. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a ten-thousand scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.

[229] CLaC at SemEval-2025 Task 6: A Multi-Architecture Approach for Corporate Environmental Promise Verification

Nawar Turk,Eeham Khan,Leila Kosseim

Main category: cs.CL

TL;DR: 本文介绍了针对SemEval-2025 Task 6(PromiseEval)的三种模型架构,用于验证企业ESG报告中的承诺,最终结合多任务的模型表现最佳。

Details Motivation: 验证企业ESG报告中的承诺,解决承诺识别、证据评估、清晰度评价和时间验证四个子任务。 Method: 1. ESG-BERT模型;2. 结合语言特征的改进模型;3. 多任务结合的注意力池化模型。 Result: 结合多任务的模型在ML-Promise数据集上得分0.5268,优于基线0.5227。 Conclusion: 语言特征提取、注意力池化和多任务学习在承诺验证任务中有效,但数据不平衡和训练数据有限仍是挑战。 Abstract: This paper presents our approach to the SemEval-2025 Task~6 (PromiseEval), which focuses on verifying promises in corporate ESG (Environmental, Social, and Governance) reports. We explore three model architectures to address the four subtasks of promise identification, supporting evidence assessment, clarity evaluation, and verification timing. Our first model utilizes ESG-BERT with task-specific classifier heads, while our second model enhances this architecture with linguistic features tailored for each subtask. Our third approach implements a combined subtask model with attention-based sequence pooling, transformer representations augmented with document metadata, and multi-objective learning. Experiments on the English portion of the ML-Promise dataset demonstrate progressive improvement across our models, with our combined subtask approach achieving a leaderboard score of 0.5268, outperforming the provided baseline of 0.5227. Our work highlights the effectiveness of linguistic feature extraction, attention pooling, and multi-objective learning in promise verification tasks, despite challenges posed by class imbalance and limited training data.

[230] Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

Yunqiao Yang,Houxing Ren,Zimu Lu,Ke Wang,Weikang Shi,Aojun Zhou,Junting Pan,Mingjie Zhan,Hongsheng Li

Main category: cs.CL

TL;DR: 论文提出了一种名为PCPO的新框架,通过双重定量指标(答案正确性和内在概率一致性)优化LLM的数学推理能力,优于现有方法。

Details Motivation: 现有偏好优化方法仅关注答案正确性或一致性,而忽略了内部逻辑一致性。 Method: 提出PCPO框架,结合表面答案正确性和内在概率一致性双重指标。 Result: PCPO在多种LLM和基准测试中表现优于现有方法。 Conclusion: PCPO通过双重指标显著提升了LLM的数学推理能力。 Abstract: Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at https://github.com/YunqiaoYang/PCPO.

[231] Translation in the Wild

Yuri Balashov

Main category: cs.CL

TL;DR: LLMs表现出强大的翻译能力,但其能力来源尚不明确,可能与训练数据中的“偶然双语”和指令调整有关。

Details Motivation: 探讨LLMs在未经专门翻译训练的情况下,为何能表现出卓越的翻译能力,并分析其潜在机制。 Method: 通过分析现有研究和用户经验,提出“双重性”假设,认为LLMs的翻译能力源于两种不同的预训练数据。 Result: LLMs的翻译能力可能与训练数据中的多语言内容和指令调整相关,具体机制需进一步验证。 Conclusion: LLMs的翻译能力可能源于预训练数据的多样性,这对重新定义翻译(人类与机器)具有重要意义。 Abstract: Large Language Models (LLMs) excel in translation among other things, demonstrating competitive performance for many language pairs in zero- and few-shot settings. But unlike dedicated neural machine translation models, LLMs are not trained on any translation-related objective. What explains their remarkable translation abilities? Are these abilities grounded in "incidental bilingualism" (Briakou et al. 2023) in training data? Does instruction tuning contribute to it? Are LLMs capable of aligning and leveraging semantically identical or similar monolingual contents from different corners of the internet that are unlikely to fit in a single context window? I offer some reflections on this topic, informed by recent studies and growing user experience. My working hypothesis is that LLMs' translation abilities originate in two different types of pre-training data that may be internalized by the models in different ways. I discuss the prospects for testing the "duality" hypothesis empirically and its implications for reconceptualizing translation, human and machine, in the age of deep learning.

[232] Understanding Refusal in Language Models with Sparse Autoencoders

Wei Jie Yeo,Nirmalendu Prakash,Clement Neo,Roy Ka-Wei Lee,Erik Cambria,Ranjan Satapathy

Main category: cs.CL

TL;DR: 本文通过稀疏自编码器研究了指令调优LLM中的拒绝行为机制,识别了因果中介拒绝行为的潜在特征,并验证了其对生成的影响。

Details Motivation: 拒绝行为是对齐语言模型的关键安全行为,但其内部机制尚不明确。 Method: 使用稀疏自编码器识别拒绝行为的潜在特征,并在两个开源聊天模型上进行干预实验。 Result: 验证了拒绝相关特征对生成行为的影响,并展示了其在分类任务中对对抗样本的泛化能力。 Conclusion: 研究揭示了拒绝行为在激活层面的表现,并探讨了对抗性越狱技术的机制,同时开源了代码。 Abstract: Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks. We open source our code in https://github.com/wj210/refusal_sae.

[233] Evaluating AI capabilities in detecting conspiracy theories on YouTube

Leonardo La Rocca,Francesco Corso,Francesco Pierri

Main category: cs.CL

TL;DR: 研究探讨了使用开源大语言模型(LLMs)识别YouTube上的阴谋论视频,发现文本模型召回率高但精度低,多模态模型表现不如纯文本模型,RoBERTa在未标记数据上表现接近更大参数的LLMs。

Details Motivation: YouTube作为全球领先平台,容易传播有害内容(如阴谋论),需要有效检测方法。 Method: 利用标记数据集评估多种LLMs(纯文本和多模态)的零样本性能,并与微调RoBERTa基线对比。 Result: 文本LLMs召回率高但精度低;多模态模型表现较差;RoBERTa在未标记数据上接近LLMs性能。 Conclusion: 当前LLM方法在有害内容检测中存在局限性,需更精确和鲁棒的系统。 Abstract: As a leading online platform with a vast global audience, YouTube's extensive reach also makes it susceptible to hosting harmful content, including disinformation and conspiracy theories. This study explores the use of open-weight Large Language Models (LLMs), both text-only and multimodal, for identifying conspiracy theory videos shared on YouTube. Leveraging a labeled dataset of thousands of videos, we evaluate a variety of LLMs in a zero-shot setting and compare their performance to a fine-tuned RoBERTa baseline. Results show that text-based LLMs achieve high recall but lower precision, leading to increased false positives. Multimodal models lag behind their text-only counterparts, indicating limited benefits from visual data integration. To assess real-world applicability, we evaluate the most accurate models on an unlabeled dataset, finding that RoBERTa achieves performance close to LLMs with a larger number of parameters. Our work highlights the strengths and limitations of current LLM-based approaches for online harmful content detection, emphasizing the need for more precise and robust systems.

[234] Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

Guangtao Zeng,Maohao Shen,Delin Chen,Zhenting Qi,Subhro Das,Dan Gutfreund,David Cox,Gregory Wornell,Wei Lu,Zhang-Wei Hong,Chuang Gan

Main category: cs.CL

TL;DR: EvoScale是一种高效的测试时扩展方法,通过进化过程优化语言模型输出,减少样本需求,提升小模型性能。

Details Motivation: 解决小规模语言模型在真实软件工程任务中表现不佳的问题,同时避免高成本的数据标注和验证。 Method: 提出EvoScale方法,结合进化算法和强化学习,迭代优化输出并减少样本需求。 Result: 32B模型Satori-SWE-32B在SWE-Bench-Verified上表现优于100B以上模型。 Conclusion: EvoScale为小模型提供了一种高效且低成本的性能提升方案。 Abstract: Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller models are preferable in practice due to their lower computational cost, improving their performance remains challenging. Existing approaches primarily rely on supervised fine-tuning (SFT) with high-quality data, which is expensive to curate at scale. An alternative is test-time scaling: generating multiple outputs, scoring them using a verifier, and selecting the best one. Although effective, this strategy often requires excessive sampling and costly scoring, limiting its practical application. We propose Evolutionary Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process. By iteratively refining outputs via selection and mutation, EvoScale shifts the output distribution toward higher-scoring regions, reducing the number of samples needed to find correct solutions. To reduce the overhead from repeatedly sampling and selection, we train the model to self-evolve using reinforcement learning (RL). Rather than relying on external verifiers at inference time, the model learns to self-improve the scores of its own generations across iterations. Evaluated on SWE-Bench-Verified, EvoScale enables our 32B model, Satori-SWE-32B, to match or exceed the performance of models with over 100B parameters while using a few samples. Code, data, and models will be fully open-sourced.

[235] Table-R1: Inference-Time Scaling for Table Reasoning

Zheyuan Yang,Lyuhao Chen,Arman Cohan,Yilun Zhao

Main category: cs.CL

TL;DR: 本文首次研究了表格推理任务中的推理时扩展,提出了两种后训练策略:蒸馏和强化学习(RLVR)。Table-R1-Zero模型性能媲美GPT-4.1,且仅需7B参数。

Details Motivation: 探索表格推理任务中的推理时扩展,提升模型性能。 Method: 1. 蒸馏策略:利用DeepSeek-R1生成的大规模推理轨迹微调模型;2. RLVR策略:设计可验证奖励函数并使用GRPO算法训练模型。 Result: Table-R1-Zero模型在多项任务中表现优异,性能媲美GPT-4.1,且泛化能力强。 Conclusion: 指令调优、模型架构选择和跨任务泛化对表格推理至关重要,RL训练能提升推理能力。 Abstract: In this work, we present the first study to explore inference-time scaling on table reasoning tasks. We develop and evaluate two post-training strategies to enable inference-time scaling: distillation from frontier model reasoning traces and reinforcement learning with verifiable rewards (RLVR). For distillation, we introduce a large-scale dataset of reasoning traces generated by DeepSeek-R1, which we use to fine-tune LLMs into the Table-R1-SFT model. For RLVR, we propose task-specific verifiable reward functions and apply the GRPO algorithm to obtain the Table-R1-Zero model. We evaluate our Table-R1-series models across diverse table reasoning tasks, including short-form QA, fact verification, and free-form QA. Notably, the Table-R1-Zero model matches or exceeds the performance of GPT-4.1 and DeepSeek-R1, while using only a 7B-parameter LLM. It also demonstrates strong generalization to out-of-domain datasets. Extensive ablation and qualitative analyses reveal the benefits of instruction tuning, model architecture choices, and cross-task generalization, as well as emergence of essential table reasoning skills during RL training.

[236] Characterizing the Expressivity of Transformer Language Models

Jiaoda Li,Ryan Cotterell

Main category: cs.CL

TL;DR: 本文研究了固定精度Transformer的理论表达能力,发现其表达能力等同于线性时序逻辑中的过去操作符片段,并通过实验验证了理论结果。

Details Motivation: 尽管Transformer模型在实践中表现优异,但其理论表达能力尚未完全理解,尤其是实际实现中的固定精度和软注意力机制。 Method: 通过理想化的固定精度Transformer模型,结合严格的未来掩码和软注意力,分析其表达能力。 Result: 证明这些模型与线性时序逻辑中的过去操作符片段具有相同的表达能力,并与形式语言理论、自动机理论和代数建立了联系。 Conclusion: 理论框架与实验结果一致,表明Transformer在其理论能力范围内的语言上能完美泛化,而在超出范围的语言上则无法泛化。 Abstract: Transformer-based language models (LMs) have achieved widespread empirical success, but their theoretical expressive power remains only partially understood. Prior work often relies on idealized models with assumptions -- such as arbitrary numerical precision and hard attention -- that diverge from real-world transformers. In this work, we provide an exact characterization of fixed-precision transformers with strict future masking and soft attention, an idealization that more closely mirrors practical implementations. We show that these models are precisely as expressive as a specific fragment of linear temporal logic that includes only a single temporal operator: the past operator. We further relate this logic to established classes in formal language theory, automata theory, and algebra, yielding a rich and unified theoretical framework for understanding transformer expressivity. Finally, we present empirical results that align closely with our theory: transformers trained on languages within their theoretical capacity generalize perfectly over lengths, while they consistently fail to generalize on languages beyond it.

[237] AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora

Jiaxin Bai,Wei Fan,Qi Hu,Qing Zong,Chunyang Li,Hong Ting Tsang,Hongyu Luo,Yauwai Yim,Haoyu Huang,Xiao Zhou,Feng Qin,Tianshi Zheng,Xi Peng,Xin Yao,Huiwen Yang,Leijie Wu,Yi Ji,Gong Zhang,Renhai Chen,Yangqiu Song

Main category: cs.CL

TL;DR: AutoSchemaKG是一个无需预定义模式的自主知识图谱构建框架,通过大语言模型直接从文本中提取知识三元组并生成模式,构建了大规模知识图谱ATLAS,性能优于现有方法。

Details Motivation: 传统知识图谱构建依赖预定义模式,限制了灵活性和可扩展性。AutoSchemaKG旨在消除这一限制,实现完全自主的模式归纳和知识提取。 Method: 利用大语言模型从文本中同时提取知识三元组和归纳模式,通过概念化将实例组织到语义类别中,处理超过5000万文档。 Result: 构建了ATLAS知识图谱家族,包含9亿+节点和59亿边,在多跳QA任务中优于基线方法,模式归纳与人工模式的语义对齐率达95%。 Conclusion: 动态归纳模式的大规模知识图谱能有效补充大语言模型的参数知识,提升事实性和性能。 Abstract: We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 95\% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.

[238] GeNRe: A French Gender-Neutral Rewriting System Using Collective Nouns

Enzo Doyen,Amalia Todirascu

Main category: cs.CL

TL;DR: GeNRe是首个法语性别中性改写系统,使用集体名词解决法语中的性别偏见问题,结合规则系统和微调语言模型。

Details Motivation: NLP中大量文本数据存在性别偏见,尤其是法语中的男性通用词,可能加剧刻板印象。 Method: 开发基于规则的系统(RBS)和两种微调语言模型,探索基于指令的模型提升性能。 Result: Claude 3 Opus结合字典效果接近RBS,系统有效改写法语中的性别表达。 Conclusion: GeNRe推动了法语NLP中性别偏见缓解技术的发展。 Abstract: A significant portion of the textual data used in the field of Natural Language Processing (NLP) exhibits gender biases, particularly due to the use of masculine generics (masculine words that are supposed to refer to mixed groups of men and women), which can perpetuate and amplify stereotypes. Gender rewriting, an NLP task that involves automatically detecting and replacing gendered forms with neutral or opposite forms (e.g., from masculine to feminine), can be employed to mitigate these biases. While such systems have been developed in a number of languages (English, Arabic, Portuguese, German, French), automatic use of gender neutralization techniques (as opposed to inclusive or gender-switching techniques) has only been studied for English. This paper presents GeNRe, the very first French gender-neutral rewriting system using collective nouns, which are gender-fixed in French. We introduce a rule-based system (RBS) tailored for the French language alongside two fine-tuned language models trained on data generated by our RBS. We also explore the use of instruct-based models to enhance the performance of our other systems and find that Claude 3 Opus combined with our dictionary achieves results close to our RBS. Through this contribution, we hope to promote the advancement of gender bias mitigation techniques in NLP for French.

[239] Are Reasoning Models More Prone to Hallucination?

Zijun Yao,Yantao Liu,Yanxu Chen,Jianhui Chen,Junfeng Fang,Lei Hou,Juanzi Li,Tat-Seng Chua

Main category: cs.CL

TL;DR: 论文探讨大型推理模型(LRMs)在事实寻求任务中的幻觉问题,通过全面评估、行为分析和模型不确定性研究,发现不同后训练流程对幻觉的影响。

Details Motivation: 研究LRMs在事实寻求任务中是否因推理能力而更容易产生幻觉,解决现有研究中的不一致结论。 Method: 从三个角度分析:全面评估LRMs的幻觉现象,行为分析揭示认知行为对事实性的影响,以及从模型不确定性角度探讨机制。 Result: 发现完整后训练流程(冷启动SFT和可验证奖励RL)减轻幻觉,而蒸馏和未冷启动的RL训练增加幻觉;行为分析揭示Flaw Repetition和Think-Answer Mismatch是关键问题;模型不确定性与事实准确性不匹配导致幻觉增加。 Conclusion: 研究初步揭示了LRMs中幻觉的成因,为后续改进提供了方向。 Abstract: Recently evolved large reasoning models (LRMs) show powerful performance in solving complex tasks with long chain-of-thought (CoT) reasoning capability. As these LRMs are mostly developed by post-training on formal reasoning tasks, whether they generalize the reasoning capability to help reduce hallucination in fact-seeking tasks remains unclear and debated. For instance, DeepSeek-R1 reports increased performance on SimpleQA, a fact-seeking benchmark, while OpenAI-o3 observes even severer hallucination. This discrepancy naturally raises the following research question: Are reasoning models more prone to hallucination? This paper addresses the question from three perspectives. (1) We first conduct a holistic evaluation for the hallucination in LRMs. Our analysis reveals that LRMs undergo a full post-training pipeline with cold start supervised fine-tuning (SFT) and verifiable reward RL generally alleviate their hallucination. In contrast, both distillation alone and RL training without cold start fine-tuning introduce more nuanced hallucinations. (2) To explore why different post-training pipelines alters the impact on hallucination in LRMs, we conduct behavior analysis. We characterize two critical cognitive behaviors that directly affect the factuality of a LRM: Flaw Repetition, where the surface-level reasoning attempts repeatedly follow the same underlying flawed logic, and Think-Answer Mismatch, where the final answer fails to faithfully match the previous CoT process. (3) Further, we investigate the mechanism behind the hallucination of LRMs from the perspective of model uncertainty. We find that increased hallucination of LRMs is usually associated with the misalignment between model uncertainty and factual accuracy. Our work provides an initial understanding of the hallucination in LRMs.

[240] ARC: Argument Representation and Coverage Analysis for Zero-Shot Long Document Summarization with Instruction Following LLMs

Mohamed Elaraby,Diane Litman

Main category: cs.CL

TL;DR: 论文研究了指令调优的大型语言模型(LLMs)在摘要生成中是否充分保留关键论据角色,提出了一种衡量框架ARC,发现LLMs在生成摘要时存在关键信息遗漏的问题。

Details Motivation: 探讨LLMs在高风险领域(如法律)的摘要生成中是否能够有效保留论据角色,以提升摘要质量。 Method: 引入Argument Representation Coverage (ARC)框架,分析三种开源LLMs在法律和科学领域生成的摘要。 Result: LLMs在一定程度上覆盖了关键论据角色,但在输入中论据分布稀疏时容易遗漏信息,且存在位置偏见和角色偏好。 Conclusion: 需要开发更具论据意识的摘要生成策略,以改进LLMs在高风险领域的表现。 Abstract: Integrating structured information has long improved the quality of abstractive summarization, particularly in retaining salient content. In this work, we focus on a specific form of structure: argument roles, which are crucial for summarizing documents in high-stakes domains such as law. We investigate whether instruction-tuned large language models (LLMs) adequately preserve this information. To this end, we introduce Argument Representation Coverage (ARC), a framework for measuring how well LLM-generated summaries capture salient arguments. Using ARC, we analyze summaries produced by three open-weight LLMs in two domains where argument roles are central: long legal opinions and scientific articles. Our results show that while LLMs cover salient argument roles to some extent, critical information is often omitted in generated summaries, particularly when arguments are sparsely distributed throughout the input. Further, we use ARC to uncover behavioral patterns -- specifically, how the positional bias of LLM context windows and role-specific preferences impact the coverage of key arguments in generated summaries, emphasizing the need for more argument-aware summarization strategies.

[241] Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation

Hongxiang Zhang,Hao Chen,Tianyi Zhang,Muhao Chen

Main category: cs.CL

TL;DR: ActLCD是一种新的解码策略,通过强化学习优化生成的事实性,减少幻觉。

Details Motivation: 现有方法在长上下文中仍易产生幻觉,需改进解码策略。 Method: 提出ActLCD,利用强化学习策略和奖励感知分类器动态选择对比层。 Result: 在五个基准测试中超越现有方法,有效减少幻觉。 Conclusion: ActLCD在多样化生成场景中显著提升事实性。 Abstract: Recent decoding methods improve the factuality of large language models~(LLMs) by refining how the next token is selected during generation. These methods typically operate at the token level, leveraging internal representations to suppress superficial patterns. Nevertheless, LLMs remain prone to hallucinations, especially over longer contexts. In this paper, we propose Active Layer-Contrastive Decoding (ActLCD), a novel decoding strategy that actively decides when to apply contrasting layers during generation. By casting decoding as a sequential decision-making problem, ActLCD employs a reinforcement learning policy guided by a reward-aware classifier to optimize factuality beyond the token level. Our experiments demonstrate that ActLCD surpasses state-of-the-art methods across five benchmarks, showcasing its effectiveness in mitigating hallucinations in diverse generation scenarios.

[242] ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

Beong-woo Kwak,Minju Kim,Dongha Lim,Hyungjoo Chae,Dongjin Kang,Sunghwan Kim,Dongil Yang,Jinyoung Yeo

Main category: cs.CL

TL;DR: ToolHaystack是一个用于评估大型语言模型在长期交互中使用工具能力的基准测试,揭示了现有模型在长期鲁棒性上的不足。

Details Motivation: 现有评估多关注短上下文中的工具使用,缺乏对长期交互中模型行为的深入理解。 Method: 引入ToolHaystack基准,包含多任务执行上下文和真实噪声的连续对话,测试模型的上下文保持和抗干扰能力。 Result: 测试14个先进模型发现,它们在标准多轮设置中表现良好,但在ToolHaystack中表现显著下降。 Conclusion: ToolHaystack揭示了现有模型在长期工具使用中的鲁棒性不足,为未来研究提供了重要方向。 Abstract: Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.

[243] LoLA: Low-Rank Linear Attention With Sparse Caching

Luke McDermott,Robert W. Heath Jr.,Rahul Parhi

Main category: cs.CL

TL;DR: LoLA是一种低秩线性注意力方法,通过稀疏缓存和分层存储机制,显著提升了长序列任务中的性能,同时保持高效计算。

Details Motivation: 解决Transformer模型在长序列推理中的二次复杂度问题,以及现有线性注意力方法在短上下文任务中的近似不足和长上下文中的信息遗忘问题。 Method: 结合滑动窗口注意力、稀疏全局缓存和递归隐藏状态,分层存储关键值对,避免内存冲突。 Result: 在8K上下文长度任务中,准确率从0.6%提升至97.4%,缓存大小仅为Llama-3.1的1/4.6,且在单消费级GPU上可复现。 Conclusion: LoLA是一种轻量高效的解决方案,显著提升了线性注意力模型在长序列任务中的表现。 Abstract: Transformer-based large language models suffer from quadratic complexity at inference on long sequences. Linear attention methods are efficient alternatives, however, they fail to provide an accurate approximation of softmax attention. By additionally incorporating sliding window attention into each linear attention head, this gap can be closed for short context-length tasks. Unfortunately, these approaches cannot recall important information from long contexts due to "memory collisions". In this paper , we propose LoLA: Low-rank Linear Attention with sparse caching. LoLA separately stores additional key-value pairs that would otherwise interfere with past associative memories. Moreover, LoLA further closes the gap between linear attention models and transformers by distributing past key-value pairs into three forms of memory: (i) recent pairs in a local sliding window; (ii) difficult-to-memorize pairs in a sparse, global cache; and (iii) generic pairs in the recurrent hidden state of linear attention. As an inference-only strategy, LoLA enables pass-key retrieval on up to 8K context lengths on needle-in-a-haystack tasks from RULER. It boosts the accuracy of the base subquadratic model from 0.6% to 97.4% at 4K context lengths, with a 4.6x smaller cache than that of Llama-3.1 8B. LoLA demonstrates strong performance on zero-shot commonsense reasoning tasks among 1B and 8B parameter subquadratic models. Finally, LoLA is an extremely lightweight approach: Nearly all of our results can be reproduced on a single consumer GPU.

[244] Automatic classification of stop realisation with wav2vec2.0

James Tanner,Morgan Sonderegger,Jane Stuart-Smith,Jeff Mielke,Tyler Kendall

Main category: cs.CL

TL;DR: 利用预训练的wav2vec2.0模型自动分类语音数据中的爆破音存在,展示了其在英语和日语中的高准确性和鲁棒性。

Details Motivation: 现代语音研究缺乏针对多变语音现象的自动标注工具,而预训练的自监督模型(如wav2vec2.0)在语音分类任务中表现优异。 Method: 训练wav2vec2.0模型自动分类爆破音存在,并在英语和日语的精细整理及未准备语音语料库中测试其鲁棒性。 Result: 自动标注结果与手动标注高度一致,且能复现爆破音实现的变异性模式。 Conclusion: 预训练语音模型为语音数据的自动标注和处理提供了潜力,可轻松扩展语音研究的范围。 Abstract: Modern phonetic research regularly makes use of automatic tools for the annotation of speech data, however few tools exist for the annotation of many variable phonetic phenomena. At the same time, pre-trained self-supervised models, such as wav2vec2.0, have been shown to perform well at speech classification tasks and latently encode fine-grained phonetic information. We demonstrate that wav2vec2.0 models can be trained to automatically classify stop burst presence with high accuracy in both English and Japanese, robust across both finely-curated and unprepared speech corpora. Patterns of variability in stop realisation are replicated with the automatic annotations, and closely follow those of manual annotations. These results demonstrate the potential of pre-trained speech models as tools for the automatic annotation and processing of speech corpus data, enabling researchers to `scale-up' the scope of phonetic research with relative ease.

[245] Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models

Francesca Padovani,Jaap Jumelet,Yevgen Matusevych,Arianna Bisazza

Main category: cs.CL

TL;DR: 研究发现,尽管儿童导向语言(CDL)在某些情况下可能优于成人导向文本,但在多数情况下,维基百科训练的模型表现更好。研究还提出了新的测试方法FIT-CLAMS,强调频率控制的重要性。

Details Motivation: 验证Huebner等人(2021)关于CDL训练语言模型(LMs)效果的结论是否适用于不同语言、模型类型和评估设置。 Method: 比较CDL和维基百科训练的模型,涵盖两种LM目标(掩码和因果)、三种语言(英语、法语、德语)和三个句法最小对基准。 Result: CDL在多数情况下表现不如维基百科模型,且研究揭示了先前基准的不足。新方法FIT-CLAMS显示CDL并未带来更强的句法泛化能力。 Conclusion: CDL训练对句法学习无显著优势,频率控制在评估句法能力时至关重要。 Abstract: Seminal work by Huebner et al. (2021) showed that language models (LMs) trained on English Child-Directed Language (CDL) can reach similar syntactic abilities as LMs trained on much larger amounts of adult-directed written text, suggesting that CDL could provide more effective LM training material than the commonly used internet-crawled data. However, the generalizability of these results across languages, model types, and evaluation settings remains unclear. We test this by comparing models trained on CDL vs. Wikipedia across two LM objectives (masked and causal), three languages (English, French, German), and three syntactic minimal-pair benchmarks. Our results on these benchmarks show inconsistent benefits of CDL, which in most cases is outperformed by Wikipedia models. We then identify various shortcomings in previous benchmarks, and introduce a novel testing methodology, FIT-CLAMS, which uses a frequency-controlled design to enable balanced comparisons across training corpora. Through minimal pair evaluations and regression analysis we show that training on CDL does not yield stronger generalizations for acquiring syntax and highlight the importance of controlling for frequency effects when evaluating syntactic ability.

[246] Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation

Ziling Cheng,Meng Cao,Leila Pishdad,Yanshuai Cao,Jackie Chi Kit Cheung

Main category: cs.CL

TL;DR: 论文指出,基于最终答案的评估指标在衡量大语言模型(LLM)数学解题能力时存在局限性,因为它混淆了抽象表达和算术计算两个子技能。研究发现,算术计算是主要瓶颈,而思维链(CoT)主要帮助计算而非抽象表达。

Details Motivation: 探讨现有评估指标是否准确反映LLM的推理能力,尤其是区分抽象表达和算术计算的影响。 Method: 在GSM8K和SVAMP数据集上对Llama-3和Qwen2.5(1B-32B)进行解耦评估,分析CoT的作用,并通过因果修补验证机制。 Result: 发现算术计算是主要瓶颈,CoT对计算帮助显著但对抽象表达影响有限;模型通过“先抽象后计算”机制运作。 Conclusion: 需要解耦评估以更准确衡量LLM推理能力,并指导未来改进。 Abstract: Final-answer-based metrics are commonly used for evaluating large language models (LLMs) on math word problems, often taken as proxies for reasoning ability. However, such metrics conflate two distinct sub-skills: abstract formulation (capturing mathematical relationships using expressions) and arithmetic computation (executing the calculations). Through a disentangled evaluation on GSM8K and SVAMP, we find that the final-answer accuracy of Llama-3 and Qwen2.5 (1B-32B) without CoT is overwhelmingly bottlenecked by the arithmetic computation step and not by the abstract formulation step. Contrary to the common belief, we show that CoT primarily aids in computation, with limited impact on abstract formulation. Mechanistically, we show that these two skills are composed conjunctively even in a single forward pass without any reasoning steps via an abstract-then-compute mechanism: models first capture problem abstractions, then handle computation. Causal patching confirms these abstractions are present, transferable, composable, and precede computation. These behavioural and mechanistic findings highlight the need for disentangled evaluation to accurately assess LLM reasoning and to guide future improvements.

[247] SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models

Zixiang Xu,Yanbo Wang,Yue Huang,Jiayi Ye,Haomin Zhuang,Zirui Song,Lang Gao,Chenxi Wang,Zhaorun Chen,Yujun Zhou,Sixian Li,Wang Pan,Yue Zhao,Jieyu Zhao,Xiangliang Zhang,Xiuying Chen

Main category: cs.CL

TL;DR: 论文介绍了SocialMaze,一个评估大语言模型社交推理能力的新基准,填补了现有评估框架的不足。

Details Motivation: 现有评估框架过于简化现实场景,无法全面评估大语言模型的社交推理能力。 Method: SocialMaze通过六项任务系统评估社交推理的三大核心挑战:深度推理、动态互动和信息不确定性。 Result: 评估显示模型在动态互动和信息整合能力上差异显著,不确定性会显著降低推理能力,针对性微调可提升表现。 Conclusion: SocialMaze为评估和改进大语言模型的社交推理能力提供了有效工具。 Abstract: Large language models (LLMs) are increasingly applied to socially grounded tasks, such as online community moderation, media content analysis, and social reasoning games. Success in these contexts depends on a model's social reasoning ability - the capacity to interpret social contexts, infer others' mental states, and assess the truthfulness of presented information. However, there is currently no systematic evaluation framework that comprehensively assesses the social reasoning capabilities of LLMs. Existing efforts often oversimplify real-world scenarios and consist of tasks that are too basic to challenge advanced models. To address this gap, we introduce SocialMaze, a new benchmark specifically designed to evaluate social reasoning. SocialMaze systematically incorporates three core challenges: deep reasoning, dynamic interaction, and information uncertainty. It provides six diverse tasks across three key settings: social reasoning games, daily-life interactions, and digital community platforms. Both automated and human validation are used to ensure data quality. Our evaluation reveals several key insights: models vary substantially in their ability to handle dynamic interactions and integrate temporally evolving information; models with strong chain-of-thought reasoning perform better on tasks requiring deeper inference beyond surface-level cues; and model reasoning degrades significantly under uncertainty. Furthermore, we show that targeted fine-tuning on curated reasoning examples can greatly improve model performance in complex social scenarios. The dataset is publicly available at: https://huggingface.co/datasets/MBZUAI/SocialMaze

[248] SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods

Roksana Goworek,Harpal Karlcut,Muhammad Shezad,Nijaguna Darshana,Abhishek Mane,Syam Bondada,Raghav Sikka,Ulvi Mammadov,Rauf Allahverdiyev,Sriram Purighella,Paridhi Gupta,Muhinyia Ndegwa,Haim Dubossarsky

Main category: cs.CL

TL;DR: 论文针对低资源语言的高质量评估数据集需求,提出了一种半自动标注方法,并发布了涵盖九种低资源语言的语义标注数据集,以支持跨语言迁移研究。

Details Motivation: 解决低资源语言中高质量评估数据集的缺乏问题,推动跨语言迁移技术的发展。 Method: 提出了一种半自动标注方法,创建了包含多义词句子的语义标注数据集。 Result: 通过WiC格式实验验证了数据集的有效性,强调了针对性数据集创建对多义词消歧的重要性。 Conclusion: 发布的数据集和代码旨在支持更公平、稳健且真正多语言的NLP研究。 Abstract: This paper addresses the critical need for high-quality evaluation datasets in low-resource languages to advance cross-lingual transfer. While cross-lingual transfer offers a key strategy for leveraging multilingual pretraining to expand language technologies to understudied and typologically diverse languages, its effectiveness is dependent on quality and suitable benchmarks. We release new sense-annotated datasets of sentences containing polysemous words, spanning nine low-resource languages across diverse language families and scripts. To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method. The utility of the datasets is demonstrated through Word-in-Context (WiC) formatted experiments that evaluate transfer on these low-resource languages. Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation in low-resource settings and transfer studies. The released datasets and code aim to support further research into fair, robust, and truly multilingual NLP.

[249] Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models

Jinzhe Li,Gengxu Li,Yi Chang,Yuan Wu

Main category: cs.CL

TL;DR: 论文提出PCBench评估框架,测试大语言模型(LLMs)对输入前提错误的识别能力,发现现有模型依赖显式提示,自主批判能力有限,且推理能力与前提批判能力不一致。

Details Motivation: LLMs在输入前提存在错误时表现脆弱,缺乏主动识别和纠正的能力,影响其可靠性和推理效率。 Method: 设计PCBench,包含四种错误类型和三个难度级别,评估15种代表性LLMs。 Result: 模型依赖显式提示,自主批判能力弱;前提批判能力与推理能力不相关;错误前提导致模型过度思考。 Conclusion: 需提升LLMs对输入有效性的主动评估能力,前提批判能力是构建可靠系统的关键。 Abstract: Large language models (LLMs) have witnessed rapid advancements, demonstrating remarkable capabilities. However, a notable vulnerability persists: LLMs often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs. This emphasizes the significance of possessing the \textbf{Premise Critique Ability} for LLMs, defined as the capacity to proactively identify and articulate errors in input premises. Most existing studies assess LLMs' reasoning ability in ideal settings, largely ignoring their vulnerabilities when faced with flawed premises. Thus, we introduce the \textbf{Premise Critique Bench (PCBench)}, designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics. We conducted systematic evaluations of 15 representative LLMs. Our findings reveal: (1) Most models rely heavily on explicit prompts to detect errors, with limited autonomous critique; (2) Premise critique ability depends on question difficulty and error type, with direct contradictions being easier to detect than complex or procedural errors; (3) Reasoning ability does not consistently correlate with the premise critique ability; (4) Flawed premises trigger overthinking in reasoning models, markedly lengthening responses due to repeated attempts at resolving conflicts. These insights underscore the urgent need to enhance LLMs' proactive evaluation of input validity, positioning premise critique as a foundational capability for developing reliable, human-centric systems. The code is available at https://github.com/MLGroupJLU/Premise_Critique.

[250] Label-Guided In-Context Learning for Named Entity Recognition

Fan Bai,Hamid Hassanzadeh,Ardavan Saeedi,Mark Dredze

Main category: cs.CL

TL;DR: DEER方法通过利用训练标签的token级统计信息改进ICL在NER任务中的表现,显著优于现有方法。

Details Motivation: 现有ICL方法在NER任务中仅基于语义相似性选择示例,忽略训练标签,导致性能不佳。 Method: DEER通过标签引导的token检索器优化示例选择,并提示LLM修正易错token。 Result: 在五个NER数据集上,DEER表现优于现有ICL方法,接近监督微调性能。 Conclusion: DEER在可见和不可见实体上均有效,且在低资源环境下表现稳健。 Abstract: In-context learning (ICL) enables large language models (LLMs) to perform new tasks using only a few demonstrations. In Named Entity Recognition (NER), demonstrations are typically selected based on semantic similarity to the test instance, ignoring training labels and resulting in suboptimal performance. We introduce DEER, a new method that leverages training labels through token-level statistics to improve ICL performance. DEER first enhances example selection with a label-guided, token-based retriever that prioritizes tokens most informative for entity recognition. It then prompts the LLM to revisit error-prone tokens, which are also identified using label statistics, and make targeted corrections. Evaluated on five NER datasets using four different LLMs, DEER consistently outperforms existing ICL methods and approaches the performance of supervised fine-tuning. Further analysis shows its effectiveness on both seen and unseen entities and its robustness in low-resource settings.

[251] ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering

Zexi Liu,Jingyi Chai,Xinyu Zhu,Shuo Tang,Rui Ye,Bo Zhang,Lei Bai,Siheng Chen

Main category: cs.CL

TL;DR: 论文提出了一种基于学习的大语言模型(LLM)代理框架,通过在线强化学习(RL)优化ML任务,显著提升了自主机器学习(ML)的性能和效率。

Details Motivation: 现有方法依赖手动提示工程,无法根据多样化实验经验自适应优化,因此探索基于学习的代理ML范式。 Method: 提出包含三个关键组件的框架:探索增强微调、逐步RL和代理ML特定奖励模块。 Result: 训练出的7B规模ML-Agent在9个ML任务上表现优于671B规模的DeepSeek-R1,并展示了持续改进和跨任务泛化能力。 Conclusion: 该框架为自主ML提供了高效、自适应且可扩展的解决方案。 Abstract: The emergence of large language model (LLM)-based agents has significantly advanced the development of autonomous machine learning (ML) engineering. However, most existing approaches rely heavily on manual prompt engineering, failing to adapt and optimize based on diverse experimental experiences. Focusing on this, for the first time, we explore the paradigm of learning-based agentic ML, where an LLM agent learns through interactive experimentation on ML tasks using online reinforcement learning (RL). To realize this, we propose a novel agentic ML training framework with three key components: (1) exploration-enriched fine-tuning, which enables LLM agents to generate diverse actions for enhanced RL exploration; (2) step-wise RL, which enables training on a single action step, accelerating experience collection and improving training efficiency; (3) an agentic ML-specific reward module, which unifies varied ML feedback signals into consistent rewards for RL optimization. Leveraging this framework, we train ML-Agent, driven by a 7B-sized Qwen-2.5 LLM for autonomous ML. Remarkably, despite being trained on merely 9 ML tasks, our 7B-sized ML-Agent outperforms the 671B-sized DeepSeek-R1 agent. Furthermore, it achieves continuous performance improvements and demonstrates exceptional cross-task generalization capabilities.

[252] Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time

Mohamad Chehade,Soumya Suvra Ghosal,Souradip Chakraborty,Avinash Reddy,Dinesh Manocha,Hao Zhu,Amrit Singh Bedi

Main category: cs.CL

TL;DR: SITAlign是一个推理时框架,通过最大化主要目标并满足次要标准的阈值约束,解决多目标对齐问题,优于现有方法。

Details Motivation: 现有方法忽视了人类决策的实际方式,而人类决策通常采用满意策略(优化主要目标并确保次要目标达到阈值)。 Method: 提出SITAlign框架,结合满意策略,最大化主要目标并满足次要标准的阈值约束。 Result: 在PKU-SafeRLHF数据集上,SITAlign在保持无害性阈值的同时,在有用性奖励上比现有方法高出22.3%。 Conclusion: SITAlign通过满意策略有效解决了多目标对齐问题,优于传统多目标优化方法。 Abstract: Aligning large language models with humans is challenging due to the inherently multifaceted nature of preference feedback. While existing approaches typically frame this as a multi-objective optimization problem, they often overlook how humans actually make decisions. Research on bounded rationality suggests that human decision making follows satisficing strategies-optimizing primary objectives while ensuring others meet acceptable thresholds. To bridge this gap and operationalize the notion of satisficing alignment, we propose SITAlign: an inference time framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria. We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach. We empirically validate SITAlign's performance through extensive experimentation on multiple benchmarks. For instance, on the PKU-SafeRLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, SITAlign outperforms the state-of-the-art multi objective decoding strategy by a margin of 22.3% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the threshold on harmlessness.

[253] ATLAS: Learning to Optimally Memorize the Context at Test Time

Ali Behrouz,Zeman Li,Praneeth Kacham,Majid Daliri,Yuan Deng,Peilin Zhong,Meisam Razaviyayn,Vahab Mirrokni

Main category: cs.CL

TL;DR: 论文提出ATLAS,一种高容量的长期记忆模块,解决了现代循环神经网络在长上下文理解和序列外推任务中的不足,并通过实验验证其优于Transformer和线性循环模型。

Details Motivation: Transformer在长序列任务中因二次复杂度受限,而现代循环神经网络在长上下文理解和外推任务中表现不佳,论文旨在解决这些问题。 Method: 提出ATLAS模块,优化记忆容量和更新机制,并基于此设计DeepTransformers架构。 Result: ATLAS在语言建模、常识推理等任务中超越Transformer和线性循环模型,并在长上下文任务中显著提升性能。 Conclusion: ATLAS通过改进记忆模块设计,有效提升了长序列任务的性能,为序列建模提供了新思路。 Abstract: Transformers have been established as the most popular backbones in sequence modeling, mainly due to their effectiveness in in-context retrieval tasks and the ability to learn at scale. Their quadratic memory and time complexity, however, bound their applicability in longer sequences and so has motivated researchers to explore effective alternative architectures such as modern recurrent neural networks (a.k.a long-term recurrent memory module). Despite their recent success in diverse downstream tasks, they struggle in tasks that requires long context understanding and extrapolation to longer sequences. We observe that these shortcomings come from three disjoint aspects in their design: (1) limited memory capacity that is bounded by the architecture of memory and feature mapping of the input; (2) online nature of update, i.e., optimizing the memory only with respect to the last input; and (3) less expressive management of their fixed-size memory. To enhance all these three aspects, we present ATLAS, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture. Our experimental results on language modeling, common-sense reasoning, recall-intensive, and long-context understanding tasks show that ATLAS surpasses the performance of Transformers and recent linear recurrent models. ATLAS further improves the long context performance of Titans, achieving +80\% accuracy in 10M context length of BABILong benchmark.

[254] DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning

Ziyin Zhang,Jiahao Xu,Zhiwei He,Tian Liang,Qiuzhi Liu,Yansi Li,Linfeng Song,Zhengwen Liang,Zhuosheng Zhang,Rui Wang,Zhaopeng Tu,Haitao Mi,Dong Yu

Main category: cs.CL

TL;DR: DeepTheorem是一个利用自然语言增强大型语言模型数学推理的非正式定理证明框架,包含大规模数据集和新颖的强化学习策略,显著提升了定理证明性能。

Details Motivation: 传统自动定理证明方法依赖形式化系统,与大型语言模型的自然语言知识不匹配,限制了其推理能力。 Method: 提出DeepTheorem框架,包括121K高质量非正式定理数据集和RL-Zero强化学习策略,利用验证定理变体激励推理。 Result: 实验表明DeepTheorem显著提升了定理证明性能,达到最先进的准确性和推理质量。 Conclusion: DeepTheorem有潜力推动非正式定理证明和数学探索的进步。 Abstract: Theorem proving serves as a major testbed for evaluating complex reasoning abilities in large language models (LLMs). However, traditional automated theorem proving (ATP) approaches rely heavily on formal proof systems that poorly align with LLMs' strength derived from informal, natural language knowledge acquired during pre-training. In this work, we propose DeepTheorem, a comprehensive informal theorem-proving framework exploiting natural language to enhance LLM mathematical reasoning. DeepTheorem includes a large-scale benchmark dataset consisting of 121K high-quality IMO-level informal theorems and proofs spanning diverse mathematical domains, rigorously annotated for correctness, difficulty, and topic categories, accompanied by systematically constructed verifiable theorem variants. We devise a novel reinforcement learning strategy (RL-Zero) explicitly tailored to informal theorem proving, leveraging the verified theorem variants to incentivize robust mathematical inference. Additionally, we propose comprehensive outcome and process evaluation metrics examining proof correctness and the quality of reasoning steps. Extensive experimental analyses demonstrate DeepTheorem significantly improves LLM theorem-proving performance compared to existing datasets and supervised fine-tuning protocols, achieving state-of-the-art accuracy and reasoning quality. Our findings highlight DeepTheorem's potential to fundamentally advance automated informal theorem proving and mathematical exploration.

[255] Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

Heekyung Lee,Jiaxin Ge,Tsung-Han Wu,Minwoo Kang,Trevor Darrell,David M. Chan

Main category: cs.CL

TL;DR: 本文研究了当前视觉语言模型(VLMs)解决Rebus谜题的能力,发现其在抽象推理和视觉隐喻理解方面表现不佳。

Details Motivation: Rebus谜题通过图像、空间排列和符号替代编码语言,对VLMs提出了独特挑战。本文旨在评估VLMs在此类任务中的表现。 Method: 构建了一个手工生成和标注的多样化Rebus谜题基准,涵盖从简单图像替换到空间依赖提示的谜题。 Result: VLMs在解码简单视觉线索时表现尚可,但在需要抽象推理、横向思维和视觉隐喻理解的任务中表现显著不足。 Conclusion: 当前VLMs在解决Rebus谜题时存在局限性,尤其是在复杂推理和隐喻理解方面,需进一步改进。 Abstract: Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head" over "heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.

[256] From Chat Logs to Collective Insights: Aggregative Question Answering

Wentao Zhang,Woojeong Kim,Yuntian Deng

Main category: cs.CL

TL;DR: 论文提出了一种新任务——聚合问答(Aggregative Question Answering),旨在通过分析大规模用户与聊天机器人的对话数据来回答聚合性问题,并构建了WildChat-AQA基准数据集。

Details Motivation: 现有方法通常将对话视为独立事件,忽略了从大规模对话日志中聚合和推理的潜在价值。 Method: 提出了聚合问答任务,并构建了包含6,027个聚合性问题的WildChat-AQA数据集。 Result: 实验表明,现有方法在有效推理或计算成本方面表现不佳。 Conclusion: 需要开发新方法以从大规模对话数据中提取集体洞察。 Abstract: Conversational agents powered by large language models (LLMs) are rapidly becoming integral to our daily interactions, generating unprecedented amounts of conversational data. Such datasets offer a powerful lens into societal interests, trending topics, and collective concerns. Yet, existing approaches typically treat these interactions as independent and miss critical insights that could emerge from aggregating and reasoning across large-scale conversation logs. In this paper, we introduce Aggregative Question Answering, a novel task requiring models to reason explicitly over thousands of user-chatbot interactions to answer aggregative queries, such as identifying emerging concerns among specific demographics. To enable research in this direction, we construct a benchmark, WildChat-AQA, comprising 6,027 aggregative questions derived from 182,330 real-world chatbot conversations. Experiments show that existing methods either struggle to reason effectively or incur prohibitive computational costs, underscoring the need for new approaches capable of extracting collective insights from large-scale conversational data.

cs.CR [Back]

[257] AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models

Jinchuan Zhang,Lu Yin,Yan Zhou,Songlin Hu

Main category: cs.CR

TL;DR: AgentAlign框架通过抽象行为链合成安全对齐数据,显著提升LLM代理的安全性,同时保持其有用性。

Details Motivation: LLM代理能力的扩展增加了恶意使用风险,现有方法在安全对齐上存在不足。 Method: 利用抽象行为链在模拟环境中生成安全对齐数据,平衡有用性与无害性。 Result: 在AgentHarm评估中,安全性提升35.8%至79.5%,有用性几乎不受影响或有所提升。 Conclusion: AgentAlign有效解决了LLM代理的安全对齐问题,优于现有提示方法。 Abstract: The acquisition of agentic capabilities has transformed LLMs from "knowledge providers" to "action executors", a trend that while expanding LLMs' capability boundaries, significantly increases their susceptibility to malicious use. Previous work has shown that current LLM-based agents execute numerous malicious tasks even without being attacked, indicating a deficiency in agentic use safety alignment during the post-training phase. To address this gap, we propose AgentAlign, a novel framework that leverages abstract behavior chains as a medium for safety alignment data synthesis. By instantiating these behavior chains in simulated environments with diverse tool instances, our framework enables the generation of highly authentic and executable instructions while capturing complex multi-step dynamics. The framework further ensures model utility by proportionally synthesizing benign instructions through non-malicious interpretations of behavior chains, precisely calibrating the boundary between helpfulness and harmlessness. Evaluation results on AgentHarm demonstrate that fine-tuning three families of open-source models using our method substantially improves their safety (35.8% to 79.5% improvement) while minimally impacting or even positively enhancing their helpfulness, outperforming various prompting methods. The dataset and code have both been open-sourced.

[258] Disrupting Vision-Language Model-Driven Navigation Services via Adversarial Object Fusion

Chunlong Xie,Jialing He,Shangwei Guo,Jiacheng Wang,Shudong Zhang,Tianwei Zhang,Tao Xiang

Main category: cs.CR

TL;DR: AdvOF是一种针对视觉与语言导航(VLN)代理的攻击框架,通过生成对抗性3D对象来研究其在服务导向环境中的影响。

Details Motivation: 现有对抗攻击未考虑服务计算环境中的可靠性和服务质量(QoS),AdvOF填补了这一空白。 Method: AdvOF通过精确聚合和对齐2D/3D空间中的目标对象位置,定义并渲染对抗对象,并通过多视角优化和迭代融合实现稳定攻击。 Result: 实验表明,AdvOF能有效降低代理性能,同时对正常导航任务干扰最小。 Conclusion: 该研究深化了对VLM导航系统服务安全的理解,为物理世界部署中的鲁棒服务组合提供了计算基础。 Abstract: We present Adversarial Object Fusion (AdvOF), a novel attack framework targeting vision-and-language navigation (VLN) agents in service-oriented environments by generating adversarial 3D objects. While foundational models like Large Language Models (LLMs) and Vision Language Models (VLMs) have enhanced service-oriented navigation systems through improved perception and decision-making, their integration introduces vulnerabilities in mission-critical service workflows. Existing adversarial attacks fail to address service computing contexts, where reliability and quality-of-service (QoS) are paramount. We utilize AdvOF to investigate and explore the impact of adversarial environments on the VLM-based perception module of VLN agents. In particular, AdvOF first precisely aggregates and aligns the victim object positions in both 2D and 3D space, defining and rendering adversarial objects. Then, we collaboratively optimize the adversarial object with regularization between the adversarial and victim object across physical properties and VLM perceptions. Through assigning importance weights to varying views, the optimization is processed stably and multi-viewedly by iterative fusions from local updates and justifications. Our extensive evaluations demonstrate AdvOF can effectively degrade agent performance under adversarial conditions while maintaining minimal interference with normal navigation tasks. This work advances the understanding of service security in VLM-powered navigation systems, providing computational foundations for robust service composition in physical-world deployments.

eess.IV [Back]

[259] IRS: Incremental Relationship-guided Segmentation for Digital Pathology

Ruining Deng,Junchao Zhu,Juming Xiong,Can Cui,Tianyuan Yao,Junlin Guo,Siqi Lu,Marilyn Lionts,Mengmeng Yin,Yu Wang,Shilin Zhao,Yucheng Tang,Yihe Yang,Paul Dennis Simonson,Mert R. Sabuncu,Haichun Yang,Yuankai Huo

Main category: eess.IV

TL;DR: 论文提出了一种名为IRS的增量关系引导分割学习方案,用于处理数字病理学中时间获取的部分标注数据,同时保持分布外持续学习能力。

Details Motivation: 数字病理学中的全景分割面临标注不完整和持续学习新类别的挑战,需要一种能够处理多尺度结构和分布外数据的统一框架。 Method: IRS通过数学建模解剖关系,利用增量通用命题矩阵实现空间-时间分布外持续学习。 Result: 实验表明IRS能有效处理多尺度病理分割,实现精确的肾脏分割和分布外病变识别。 Conclusion: IRS显著提升了领域泛化能力,适用于实际数字病理学应用。 Abstract: Continual learning is rapidly emerging as a key focus in computer vision, aiming to develop AI systems capable of continuous improvement, thereby enhancing their value and practicality in diverse real-world applications. In healthcare, continual learning holds great promise for continuously acquired digital pathology data, which is collected in hospitals on a daily basis. However, panoramic segmentation on digital whole slide images (WSIs) presents significant challenges, as it is often infeasible to obtain comprehensive annotations for all potential objects, spanning from coarse structures (e.g., regions and unit objects) to fine structures (e.g., cells). This results in temporally and partially annotated data, posing a major challenge in developing a holistic segmentation framework. Moreover, an ideal segmentation model should incorporate new phenotypes, unseen diseases, and diverse populations, making this task even more complex. In this paper, we introduce a novel and unified Incremental Relationship-guided Segmentation (IRS) learning scheme to address temporally acquired, partially annotated data while maintaining out-of-distribution (OOD) continual learning capacity in digital pathology. The key innovation of IRS lies in its ability to realize a new spatial-temporal OOD continual learning paradigm by mathematically modeling anatomical relationships between existing and newly introduced classes through a simple incremental universal proposition matrix. Experimental results demonstrate that the IRS method effectively handles the multi-scale nature of pathological segmentation, enabling precise kidney segmentation across various structures (regions, units, and cells) as well as OOD disease lesions at multiple magnifications. This capability significantly enhances domain generalization, making IRS a robust approach for real-world digital pathology applications.

[260] iHDR: Iterative HDR Imaging with Arbitrary Number of Exposures

Yu Yuan,Yiheng Chi,Xingguang Zhang,Stanley Chan

Main category: eess.IV

TL;DR: 提出了一种名为iHDR的新框架,通过迭代融合多张低动态范围(LDR)图像生成高质量HDR图像,解决了现有方法对固定输入数量的限制。

Details Motivation: 现有HDR成像方法通常仅适用于固定数量的输入(如三张),无法灵活处理不同数量的输入帧。 Method: iHDR框架包括一个无重影的双输入HDR融合网络(DiHDR)和一个基于物理的域映射网络(ToneNet),通过迭代融合输入帧生成HDR图像。 Result: 实验表明,该方法在输入帧数量灵活的情况下,优于现有HDR去重影方法。 Conclusion: iHDR框架为HDR成像提供了一种灵活且高效的解决方案。 Abstract: High dynamic range (HDR) imaging aims to obtain a high-quality HDR image by fusing information from multiple low dynamic range (LDR) images. Numerous learning-based HDR imaging methods have been proposed to achieve this for static and dynamic scenes. However, their architectures are mostly tailored for a fixed number (e.g., three) of inputs and, therefore, cannot apply directly to situations beyond the pre-defined limited scope. To address this issue, we propose a novel framework, iHDR, for iterative fusion, which comprises a ghost-free Dual-input HDR fusion network (DiHDR) and a physics-based domain mapping network (ToneNet). DiHDR leverages a pair of inputs to estimate an intermediate HDR image, while ToneNet maps it back to the nonlinear domain and serves as the reference input for the next pairwise fusion. This process is iteratively executed until all input frames are utilized. Qualitative and quantitative experiments demonstrate the effectiveness of the proposed method as compared to existing state-of-the-art HDR deghosting approaches given flexible numbers of input frames.

[261] Proximal Algorithm Unrolling: Flexible and Efficient Reconstruction Networks for Single-Pixel Imaging

Ping Wang,Lishun Wang,Gang Qu,Xiaodong Wang,Yulun Zhang,Xin Yuan

Main category: eess.IV

TL;DR: 论文提出了一种结合深度展开(unrolling)和即插即用(PnP)方法优势的解决方案,通过设计高效的深度图像恢复器(DIR)和提出通用的近端轨迹(PT)损失函数,实现了在单像素成像(SPI)逆问题中灵活处理不同压缩比(CR)的同时,提升重建精度和速度。

Details Motivation: 解决PnP方法在重建精度和速度上的不足,以及深度展开方法在压缩比变化时需要微调或重新训练的问题。 Method: 设计高效的深度图像恢复器(DIR)用于展开HQS和ADMM算法,并提出通用的近端轨迹(PT)损失函数来训练网络。 Result: 实验表明,提出的近端展开网络不仅能灵活处理不同压缩比,还在重建精度和速度上优于之前的压缩比专用展开网络。 Conclusion: 该方法成功整合了PnP和深度展开方法的优势,为单像素成像逆问题提供了更高效的解决方案。 Abstract: Deep-unrolling and plug-and-play (PnP) approaches have become the de-facto standard solvers for single-pixel imaging (SPI) inverse problem. PnP approaches, a class of iterative algorithms where regularization is implicitly performed by an off-the-shelf deep denoiser, are flexible for varying compression ratios (CRs) but are limited in reconstruction accuracy and speed. Conversely, unrolling approaches, a class of multi-stage neural networks where a truncated iterative optimization process is transformed into an end-to-end trainable network, typically achieve better accuracy with faster inference but require fine-tuning or even retraining when CR changes. In this paper, we address the challenge of integrating the strengths of both classes of solvers. To this end, we design an efficient deep image restorer (DIR) for the unrolling of HQS (half quadratic splitting) and ADMM (alternating direction method of multipliers). More importantly, a general proximal trajectory (PT) loss function is proposed to train HQS/ADMM-unrolling networks such that learned DIR approximates the proximal operator of an ideal explicit restoration regularizer. Extensive experiments demonstrate that, the resulting proximal unrolling networks can not only flexibly handle varying CRs with a single model like PnP algorithms, but also outperform previous CR-specific unrolling networks in both reconstruction accuracy and speed. Source codes and models are available at https://github.com/pwangcs/ProxUnroll.

[262] Advancing Image Super-resolution Techniques in Remote Sensing: A Comprehensive Survey

Yunliang Qi,Meng Lou,Yimin Liu,Lu Li,Zhen Yang,Wen Nie

Main category: eess.IV

TL;DR: 本文对遥感图像超分辨率(RSISR)方法进行了系统性综述,涵盖方法、数据集和评估指标,分析了现有方法的局限性,并提出了未来研究方向。

Details Motivation: 尽管近年来RSISR方法数量增加,但缺乏系统性综述,本文旨在填补这一空白,帮助研究者了解当前趋势与挑战。 Method: 将RSISR方法分为监督、无监督和质量评估三类,进行深入分析。 Result: 发现现有方法在大尺度退化下保留细粒度纹理和几何结构方面存在显著局限性。 Conclusion: 未来研究需关注领域特定架构和鲁棒评估协议,以缩小合成与真实场景的差距。 Abstract: Remote sensing image super-resolution (RSISR) is a crucial task in remote sensing image processing, aiming to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts. Despite the growing number of RSISR methods proposed in recent years, a systematic and comprehensive review of these methods is still lacking. This paper presents a thorough review of RSISR algorithms, covering methodologies, datasets, and evaluation metrics. We provide an in-depth analysis of RSISR methods, categorizing them into supervised, unsupervised, and quality evaluation approaches, to help researchers understand current trends and challenges. Our review also discusses the strengths, limitations, and inherent challenges of these techniques. Notably, our analysis reveals significant limitations in existing methods, particularly in preserving fine-grained textures and geometric structures under large-scale degradation. Based on these findings, we outline future research directions, highlighting the need for domain-specific architectures and robust evaluation protocols to bridge the gap between synthetic and real-world RSISR scenarios.

[263] Can Large Language Models Challenge CNNS in Medical Image Analysis?

Shibbir Ahmed,Shahnewaz Karim Sakib,Anindya Bijoy Das

Main category: eess.IV

TL;DR: 该研究提出了一种多模态AI框架,用于精确分类医学诊断图像,比较了CNN和LLM的性能,发现结合LLM的过滤技术可显著提升性能。

Details Motivation: 提升医学诊断图像的分类精度和效率,探索多模态AI在临床诊断中的潜力。 Method: 使用公开数据集,比较CNN和LLM在准确性、执行效率及环境影响上的表现。 Result: CNN在某些方面优于多模态技术,但结合LLM的过滤技术可显著提升性能。 Conclusion: 多模态AI系统有望提高医学诊断的可靠性、效率和可扩展性。 Abstract: This study presents a multimodal AI framework designed for precisely classifying medical diagnostic images. Utilizing publicly available datasets, the proposed system compares the strengths of convolutional neural networks (CNNs) and different large language models (LLMs). This in-depth comparative analysis highlights key differences in diagnostic performance, execution efficiency, and environmental impacts. Model evaluation was based on accuracy, F1-score, average execution time, average energy consumption, and estimated $CO_2$ emission. The findings indicate that although CNN-based models can outperform various multimodal techniques that incorporate both images and contextual information, applying additional filtering on top of LLMs can lead to substantial performance gains. These findings highlight the transformative potential of multimodal AI systems to enhance the reliability, efficiency, and scalability of medical diagnostics in clinical settings.

[264] PCA for Enhanced Cross-Dataset Generalizability in Breast Ultrasound Tumor Segmentation

Christian Schmidt,Heinrich Martin Overhoff

Main category: eess.IV

TL;DR: 该论文提出了一种基于主成分分析(PCA)的新方法,用于提高医学超声图像分割模型在未见数据集上的外部有效性。通过PCA预处理减少噪声并保留关键特征,实验表明该方法显著提升了召回率和Dice分数。

Details Motivation: 医学图像分割模型在跨数据集部署时外部有效性不足,尤其是在超声图像领域。现有方法(如域适应和GAN)在数据量小且多样化的医学领域效果有限。 Method: 使用PCA对六种乳腺肿瘤超声数据集进行预处理,生成降维后的PCA数据集,并训练U-Net分割模型。比较原始数据集和PCA数据集在跨域测试中的表现。 Result: PCA重建数据集显著提升了召回率(0.57→0.70)和Dice分数(0.50→0.58),并减少了外部验证中召回率下降的33%。 Conclusion: PCA重建是一种有效的方法,可提高医学图像分割模型的外部有效性,尤其在挑战性案例中表现突出。 Abstract: In medical image segmentation, limited external validity remains a critical obstacle when models are deployed across unseen datasets, an issue particularly pronounced in the ultrasound image domain. Existing solutions-such as domain adaptation and GAN-based style transfer-while promising, often fall short in the medical domain where datasets are typically small and diverse. This paper presents a novel application of principal component analysis (PCA) to address this limitation. PCA preprocessing reduces noise and emphasizes essential features by retaining approximately 90\% of the dataset variance. We evaluate our approach across six diverse breast tumor ultrasound datasets comprising 3,983 B-mode images and corresponding expert tumor segmentation masks. For each dataset, a corresponding dimensionality reduced PCA-dataset is created and U-Net-based segmentation models are trained on each of the twelve datasets. Each model trained on an original dataset was inferenced on the remaining five out-of-domain original datasets (baseline results), while each model trained on a PCA dataset was inferenced on five out-of-domain PCA datasets. Our experimental results indicate that using PCA reconstructed datasets, instead of original images, improves the model's recall and Dice scores, particularly for model-dataset pairs where baseline performance was lowest, achieving statistically significant gains in recall (0.57 $\pm$ 0.07 vs. 0.70 $\pm$ 0.05, $p = 0.0004$) and Dice scores (0.50 $\pm$ 0.06 vs. 0.58 $\pm$ 0.06, $p = 0.03$). Our method reduced the decline in recall values due to external validation by $33\%$. These findings underscore the potential of PCA reconstruction as a safeguard to mitigate declines in segmentation performance, especially in challenging cases, with implications for enhancing external validity in real-world medical applications.

[265] ImmunoDiff: A Diffusion Model for Immunotherapy Response Prediction in Lung Cancer

Moinak Bhattacharya,Judy Huang,Amna F. Sher,Gagandeep Singh,Chao Chen,Prateek Prasanna

Main category: eess.IV

TL;DR: ImmunoDiff是一种基于扩散模型的框架,用于从基线CT合成治疗后CT,结合解剖学先验和临床数据,显著提高了NSCLC免疫治疗响应预测的准确性。

Details Motivation: 准确预测非小细胞肺癌(NSCLC)免疫治疗响应是未满足的临床需求,现有模型难以捕捉治疗引起的复杂形态和纹理变化。 Method: 提出ImmunoDiff,结合解剖学先验(如肺叶和血管结构)和临床数据(如PD-L1表达),通过cbi-Adapter模块实现多模态数据集成。 Result: 在NSCLC队列中,响应预测的平衡准确率提高了21.24%,生存预测的c指数增加了0.03。 Conclusion: ImmunoDiff通过整合解剖学和临床数据,显著提升了免疫治疗响应预测的性能。 Abstract: Accurately predicting immunotherapy response in Non-Small Cell Lung Cancer (NSCLC) remains a critical unmet need. Existing radiomics and deep learning-based predictive models rely primarily on pre-treatment imaging to predict categorical response outcomes, limiting their ability to capture the complex morphological and textural transformations induced by immunotherapy. This study introduces ImmunoDiff, an anatomy-aware diffusion model designed to synthesize post-treatment CT scans from baseline imaging while incorporating clinically relevant constraints. The proposed framework integrates anatomical priors, specifically lobar and vascular structures, to enhance fidelity in CT synthesis. Additionally, we introduce a novel cbi-Adapter, a conditioning module that ensures pairwise-consistent multimodal integration of imaging and clinical data embeddings, to refine the generative process. Additionally, a clinical variable conditioning mechanism is introduced, leveraging demographic data, blood-based biomarkers, and PD-L1 expression to refine the generative process. Evaluations on an in-house NSCLC cohort treated with immune checkpoint inhibitors demonstrate a 21.24% improvement in balanced accuracy for response prediction and a 0.03 increase in c-index for survival prediction. Code will be released soon.

[266] MRI Image Generation Based on Text Prompts

Xinxian Fan,Mengye Lyu

Main category: eess.IV

TL;DR: 研究探索了基于文本提示的Stable Diffusion模型生成MRI图像的可行性,以解决真实MRI数据集获取的高成本、稀有病例样本不足和隐私问题。

Details Motivation: 解决真实MRI数据集获取的挑战,如高成本、稀有病例样本不足和隐私问题。 Method: 使用3T fastMRI和0.3T M4Raw数据集对预训练的SD模型进行微调,生成不同磁场强度的脑部T1、T2和FLAIR图像,并通过FID和MS-SSIM评估性能。 Result: 微调后的模型在图像质量和语义一致性上有所提升,合成图像能有效增强训练数据集并改善MRI对比分类任务性能。 Conclusion: 文本提示的MRI图像生成可行,可作为医学AI应用的有用工具。 Abstract: This study explores the use of text-prompted MRI image generation with the Stable Diffusion (SD) model to address challenges in acquiring real MRI datasets, such as high costs, limited rare case samples, and privacy concerns. The SD model, pre-trained on natural images, was fine-tuned using the 3T fastMRI dataset and the 0.3T M4Raw dataset, with the goal of generating brain T1, T2, and FLAIR images across different magnetic field strengths. The performance of the fine-tuned model was evaluated using quantitative metrics,including Fr\'echet Inception Distance (FID) and Multi-Scale Structural Similarity (MS-SSIM), showing improvements in image quality and semantic consistency with the text prompts. To further evaluate the model's potential, a simple classification task was carried out using a small 0.35T MRI dataset, demonstrating that the synthetic images generated by the fine-tuned SD model can effectively augment training datasets and improve the performance of MRI constrast classification tasks. Overall, our findings suggest that text-prompted MRI image generation is feasible and can serve as a useful tool for medical AI applications.

[267] DeepMultiConnectome: Deep Multi-Task Prediction of Structural Connectomes Directly from Diffusion MRI Tractography

Marcus J. Vroemen,Yuqian Chen,Yui Lo,Tengfei Xu,Weidong Cai,Fan Zhang,Josien P. W. Pluim,Lauren J. O'Donnell

Main category: eess.IV

TL;DR: DeepMultiConnectome是一种深度学习模型,直接从纤维追踪预测结构连接组,无需灰质分区,支持多种分区方案,速度快且结果与传统方法高度相关。

Details Motivation: 传统连接组生成方法耗时且需要灰质分区,限制了大规模研究。 Method: 使用基于点云的神经网络和多任务学习,模型根据两种分区方案对纤维束进行分类,共享学习表示。 Result: 预测的连接组与传统方法生成的结果高度相关(r=0.992和0.986),并保留了网络特性,测试-重测分析显示可重复性良好。 Conclusion: DeepMultiConnectome为跨多种分区方案生成个体化连接组提供了可扩展且快速的解决方案。 Abstract: Diffusion MRI (dMRI) tractography enables in vivo mapping of brain structural connections, but traditional connectome generation is time-consuming and requires gray matter parcellation, posing challenges for large-scale studies. We introduce DeepMultiConnectome, a deep-learning model that predicts structural connectomes directly from tractography, bypassing the need for gray matter parcellation while supporting multiple parcellation schemes. Using a point-cloud-based neural network with multi-task learning, the model classifies streamlines according to their connected regions across two parcellation schemes, sharing a learned representation. We train and validate DeepMultiConnectome on tractography from the Human Connectome Project Young Adult dataset ($n = 1000$), labeled with an 84 and 164 region gray matter parcellation scheme. DeepMultiConnectome predicts multiple structural connectomes from a whole-brain tractogram containing 3 million streamlines in approximately 40 seconds. DeepMultiConnectome is evaluated by comparing predicted connectomes with traditional connectomes generated using the conventional method of labeling streamlines using a gray matter parcellation. The predicted connectomes are highly correlated with traditionally generated connectomes ($r = 0.992$ for an 84-region scheme; $r = 0.986$ for a 164-region scheme) and largely preserve network properties. A test-retest analysis of DeepMultiConnectome demonstrates reproducibility comparable to traditionally generated connectomes. The predicted connectomes perform similarly to traditionally generated connectomes in predicting age and cognitive function. Overall, DeepMultiConnectome provides a scalable, fast model for generating subject-specific connectomes across multiple parcellation schemes.

[268] Plug-and-Play Posterior Sampling for Blind Inverse Problems

Anqi Li,Weijie Gan,Ulugbek S. Kamilov

Main category: eess.IV

TL;DR: Blind-PnPDM是一种新框架,用于解决目标和测量算子均未知的盲逆问题,通过交替高斯去噪方案实现后验采样,优于现有方法。

Details Motivation: 传统方法依赖显式先验或单独参数估计,无法灵活适应盲逆问题。 Method: 利用两个扩散模型作为学习先验:一个捕捉目标图像分布,另一个表征测量算子参数,通过交替高斯去噪实现后验采样。 Result: 在盲图像去模糊实验中,Blind-PnPDM在定量指标和视觉保真度上优于现有方法。 Conclusion: 将盲逆问题视为一系列去噪子问题,并结合扩散先验的表达能力,是一种有效的方法。 Abstract: We introduce Blind Plug-and-Play Diffusion Models (Blind-PnPDM) as a novel framework for solving blind inverse problems where both the target image and the measurement operator are unknown. Unlike conventional methods that rely on explicit priors or separate parameter estimation, our approach performs posterior sampling by recasting the problem into an alternating Gaussian denoising scheme. We leverage two diffusion models as learned priors: one to capture the distribution of the target image and another to characterize the parameters of the measurement operator. This PnP integration of diffusion models ensures flexibility and ease of adaptation. Our experiments on blind image deblurring show that Blind-PnPDM outperforms state-of-the-art methods in terms of both quantitative metrics and visual fidelity. Our results highlight the effectiveness of treating blind inverse problems as a sequence of denoising subproblems while harnessing the expressive power of diffusion-based priors.

[269] Synthetic Generation and Latent Projection Denoising of Rim Lesions in Multiple Sclerosis

Alexandra G. Roberts,Ha M. Luu,Mert Şişman,Alexey V. Dimov,Ceren Tozlu,Ilhami Kovanlikaya,Susan A. Gauthier,Thanh D. Nguyen,Yi Wang

Main category: eess.IV

TL;DR: 该论文提出了一种生成合成定量磁化率图的方法,用于改善多发性硬化症中边缘病变(PRLs)的分类性能,并通过去噪方法增加少数类样本。

Details Motivation: 多发性硬化症中的边缘病变(PRLs)是一种新兴的生物标志物,但由于其稀有性,分类问题面临类别不平衡的挑战。 Method: 生成合成定量磁化率图,扩展多通道对比度,并提出一种去噪方法以增加少数类样本。 Result: 合成数据和去噪方法显著改善了边缘病变的分类性能,并提供了临床可解释的检测结果。 Conclusion: 该方法为多发性硬化症中边缘病变的检测提供了有效工具,并公开了代码和生成数据。 Abstract: Quantitative susceptibility maps from magnetic resonance images can provide both prognostic and diagnostic information in multiple sclerosis, a neurodegenerative disease characterized by the formation of lesions in white matter brain tissue. In particular, susceptibility maps provide adequate contrast to distinguish between "rim" lesions, surrounded by deposited paramagnetic iron, and "non-rim" lesion types. These paramagnetic rim lesions (PRLs) are an emerging biomarker in multiple sclerosis. Much effort has been devoted to both detection and segmentation of such lesions to monitor longitudinal change. As paramagnetic rim lesions are rare, addressing this problem requires confronting the class imbalance between rim and non-rim lesions. We produce synthetic quantitative susceptibility maps of paramagnetic rim lesions and show that inclusion of such synthetic data improves classifier performance and provide a multi-channel extension to generate accompanying contrasts and probabilistic segmentation maps. We exploit the projection capability of our trained generative network to demonstrate a novel denoising approach that allows us to train on ambiguous rim cases and substantially increase the minority class. We show that both synthetic lesion synthesis and our proposed rim lesion label denoising method best approximate the unseen rim lesion distribution and improve detection in a clinically interpretable manner. We release our code and generated data at https://github.com/agr78/PRLx-GAN upon publication.

cs.AI [Back]

[270] Decomposing Elements of Problem Solving: What "Math" Does RL Teach?

Tian Qin,Core Francisco Park,Mujin Kwun,Aaron Walsman,Eran Malach,Nikhil Anand,Hidenori Tanaka,David Alvarez-Melis

Main category: cs.AI

TL;DR: 论文研究了强化学习(RL)在提升LLM数学推理能力中的作用,发现RL主要增强了执行能力,但在解决新问题时因规划能力不足而遇到‘覆盖墙’。通过合成任务验证,提出了克服这一限制的可能路径。

Details Motivation: 现有评估方法仅依赖准确性指标,无法细粒度分析LLM的问题解决能力。研究旨在分解问题解决过程,揭示RL对LLM推理能力的实际影响。 Method: 将问题解决分解为计划、执行和验证三个基本能力,并通过实证研究和合成任务分析RL的作用。 Result: 发现RL主要提升执行能力(温度蒸馏现象),但在新问题上因规划能力不足受限。合成任务验证了RL的局限性,并提出了可能的改进方向。 Conclusion: RL对LLM推理能力的提升集中在执行层面,规划能力是瓶颈。未来研究可通过改进探索和泛化能力克服这一限制。 Abstract: Mathematical reasoning tasks have become prominent benchmarks for assessing the reasoning capabilities of LLMs, especially with reinforcement learning (RL) methods such as GRPO showing significant performance gains. However, accuracy metrics alone do not support fine-grained assessment of capabilities and fail to reveal which problem-solving skills have been internalized. To better understand these capabilities, we propose to decompose problem solving into fundamental capabilities: Plan (mapping questions to sequences of steps), Execute (correctly performing solution steps), and Verify (identifying the correctness of a solution). Empirically, we find that GRPO mainly enhances the execution skill-improving execution robustness on problems the model already knows how to solve-a phenomenon we call temperature distillation. More importantly, we show that RL-trained models struggle with fundamentally new problems, hitting a 'coverage wall' due to insufficient planning skills. To explore RL's impact more deeply, we construct a minimal, synthetic solution-tree navigation task as an analogy for mathematical problem-solving. This controlled setup replicates our empirical findings, confirming RL primarily boosts execution robustness. Importantly, in this setting, we identify conditions under which RL can potentially overcome the coverage wall through improved exploration and generalization to new solution paths. Our findings provide insights into the role of RL in enhancing LLM reasoning, expose key limitations, and suggest a path toward overcoming these barriers. Code is available at https://github.com/cfpark00/RL-Wall.

[271] Enhancing Study-Level Inference from Clinical Trial Papers via RL-based Numeric Reasoning

Massimiliano Pronesti,Michela Lorandi,Paul Flanagan,Oisin Redmon,Anya Belz,Yufang Hou

Main category: cs.AI

TL;DR: 论文提出了一种基于定量推理的方法,通过提取结构化数值证据并应用领域知识逻辑,改进了医学系统综述中的自动化证据合成。

Details Motivation: 医学系统综述中自动化提取数值证据和确定研究结论的瓶颈在于现有方法依赖浅层文本线索,未能捕捉专家评估背后的数值推理。 Method: 开发了一个数值推理系统,包括数值数据提取模型和效应估计组件,采用监督微调和强化学习训练模型。 Result: 在CochraneForest基准测试中,最佳方法(强化学习训练的小规模数值提取模型)比检索系统F1分数提高21%,优于400B参数的大型语言模型9%。 Conclusion: 研究表明,基于推理的方法在自动化系统证据合成中具有潜力。 Abstract: Systematic reviews in medicine play a critical role in evidence-based decision-making by aggregating findings from multiple studies. A central bottleneck in automating this process is extracting numeric evidence and determining study-level conclusions for specific outcomes and comparisons. Prior work has framed this problem as a textual inference task by retrieving relevant content fragments and inferring conclusions from them. However, such approaches often rely on shallow textual cues and fail to capture the underlying numeric reasoning behind expert assessments. In this work, we conceptualise the problem as one of quantitative reasoning. Rather than inferring conclusions from surface text, we extract structured numerical evidence (e.g., event counts or standard deviations) and apply domain knowledge informed logic to derive outcome-specific conclusions. We develop a numeric reasoning system composed of a numeric data extraction model and an effect estimate component, enabling more accurate and interpretable inference aligned with the domain expert principles. We train the numeric data extraction model using different strategies, including supervised fine-tuning (SFT) and reinforcement learning (RL) with a new value reward model. When evaluated on the CochraneForest benchmark, our best-performing approach -- using RL to train a small-scale number extraction model -- yields up to a 21% absolute improvement in F1 score over retrieval-based systems and outperforms general-purpose LLMs of over 400B parameters by up to 9%. Our results demonstrate the promise of reasoning-driven approaches for automating systematic evidence synthesis.

[272] Be.FM: Open Foundation Models for Human Behavior

Yutong Xie,Zhuoheng Li,Xiyuan Wang,Yijun Pan,Qijia Liu,Xingzhi Cui,Kuang-Yu Lo,Ruoyi Gao,Xingjian Zhang,Jin Huang,Walter Yuan,Matthew O. Jackson,Qiaozhu Mei

Main category: cs.AI

TL;DR: Be.FM是一种基于开源大语言模型的行为基础模型,用于理解和预测人类决策,并在多个任务中表现出色。

Details Motivation: 探索基础模型在人类行为建模中的潜力,填补该领域的空白。 Method: 基于开源大语言模型,通过多样化的行为数据进行微调,构建Be.FM模型。 Result: Be.FM能预测行为、推断个体和群体特征、生成情境洞察,并应用行为科学知识。 Conclusion: Be.FM为人类行为建模提供了有效的工具,展示了基础模型在该领域的潜力。 Abstract: Despite their success in numerous fields, the potential of foundation models for modeling and understanding human behavior remains largely unexplored. We introduce Be.FM, one of the first open foundation models designed for human behavior modeling. Built upon open-source large language models and fine-tuned on a diverse range of behavioral data, Be.FM can be used to understand and predict human decision-making. We construct a comprehensive set of benchmark tasks for testing the capabilities of behavioral foundation models. Our results demonstrate that Be.FM can predict behaviors, infer characteristics of individuals and populations, generate insights about contexts, and apply behavioral science knowledge.

[273] Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models

Zeyu Liu,Yuhang Liu,Guanghao Zhu,Congkai Xie,Zhen Li,Jianbo Yuan,Xinyao Wang,Qing Li,Shing-Chi Cheung,Shengyu Zhang,Fei Wu,Hongxia Yang

Main category: cs.AI

TL;DR: 论文提出Infi-MMR框架,通过三阶段课程提升MSLMs的多模态推理能力,并在多个测试中取得SOTA结果。

Details Motivation: 尽管LLMs在推理能力上取得进展,但扩展到MLLMs和MSLMs面临数据集稀缺、视觉处理导致推理能力下降等问题。 Method: 设计三阶段框架:基础推理激活、跨模态推理适应和多模态推理增强,逐步提升模型能力。 Result: Infi-MMR-3B在多模态数学推理(如MathVerse)和通用推理(如MathVista)测试中表现优异。 Conclusion: Infi-MMR框架有效解决了MSLMs在多模态推理中的挑战,显著提升了性能。 Abstract: Recent advancements in large language models (LLMs) have demonstrated substantial progress in reasoning capabilities, such as DeepSeek-R1, which leverages rule-based reinforcement learning to enhance logical reasoning significantly. However, extending these achievements to multimodal large language models (MLLMs) presents critical challenges, which are frequently more pronounced for Multimodal Small Language Models (MSLMs) given their typically weaker foundational reasoning abilities: (1) the scarcity of high-quality multimodal reasoning datasets, (2) the degradation of reasoning capabilities due to the integration of visual processing, and (3) the risk that direct application of reinforcement learning may produce complex yet incorrect reasoning processes. To address these challenges, we design a novel framework Infi-MMR to systematically unlock the reasoning potential of MSLMs through a curriculum of three carefully structured phases and propose our multimodal reasoning model Infi-MMR-3B. The first phase, Foundational Reasoning Activation, leverages high-quality textual reasoning datasets to activate and strengthen the model's logical reasoning capabilities. The second phase, Cross-Modal Reasoning Adaptation, utilizes caption-augmented multimodal data to facilitate the progressive transfer of reasoning skills to multimodal contexts. The third phase, Multimodal Reasoning Enhancement, employs curated, caption-free multimodal data to mitigate linguistic biases and promote robust cross-modal reasoning. Infi-MMR-3B achieves both state-of-the-art multimodal math reasoning ability (43.68% on MathVerse testmini, 27.04% on MathVision test, and 21.33% on OlympiadBench) and general reasoning ability (67.2% on MathVista testmini).

[274] Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns

Xiang Li,Haiyang Yu,Xinghua Zhang,Ziyang Huang,Shizhu He,Kang Liu,Jun Zhao,Fei Huang,Yongbin Li

Main category: cs.AI

TL;DR: Socratic-PRMBench是一个新基准,用于系统评估过程奖励模型(PRMs)在六种推理模式下的表现,填补现有基准的不足。

Details Motivation: 现有基准主要关注逐步正确性,缺乏对PRMs在不同推理模式下错误的系统评估。 Method: 引入Socratic-PRMBench,包含2995条有缺陷的推理路径,覆盖六种推理模式。 Result: 实验发现现有PRMs在多种推理模式下存在显著缺陷。 Conclusion: Socratic-PRMBench为PRMs的系统评估提供了全面测试平台,推动未来PRMs的发展。 Abstract: Process Reward Models (PRMs) are crucial in complex reasoning and problem-solving tasks (e.g., LLM agents with long-horizon decision-making) by verifying the correctness of each intermediate reasoning step. In real-world scenarios, LLMs may apply various reasoning patterns (e.g., decomposition) to solve a problem, potentially suffering from errors under various reasoning patterns. Therefore, PRMs are required to identify errors under various reasoning patterns during the reasoning process. However, existing benchmarks mainly focus on evaluating PRMs with stepwise correctness, ignoring a systematic evaluation of PRMs under various reasoning patterns. To mitigate this gap, we introduce Socratic-PRMBench, a new benchmark to evaluate PRMs systematically under six reasoning patterns, including Transformation, Decomposition, Regather, Deduction, Verification, and Integration. Socratic-PRMBench}comprises 2995 reasoning paths with flaws within the aforementioned six reasoning patterns. Through our experiments on both PRMs and LLMs prompted as critic models, we identify notable deficiencies in existing PRMs. These observations underscore the significant weakness of current PRMs in conducting evaluations on reasoning steps under various reasoning patterns. We hope Socratic-PRMBench can serve as a comprehensive testbed for systematic evaluation of PRMs under diverse reasoning patterns and pave the way for future development of PRMs.

[275] ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Chenyu Yang,Shiqian Su,Shi Liu,Xuan Dong,Yue Yu,Weijie Su,Xuehui Wang,Zhaoyang Liu,Jinguo Zhu,Hao Li,Wenhai Wang,Yu Qiao,Xizhou Zhu,Jifeng Dai

Main category: cs.AI

TL;DR: ZeroGUI提出了一种无需人工标注的在线学习框架,通过VLM自动生成任务和奖励,显著提升了GUI代理的性能。

Details Motivation: 现有GUI代理方法依赖高质量人工标注且难以适应动态环境,ZeroGUI旨在解决这些问题。 Method: 结合VLM自动任务生成、奖励估计及两阶段在线强化学习,实现零人工成本的训练。 Result: 在UI-TARS和Aguvis上验证,ZeroGUI在OSWorld和AndroidLab环境中表现优异。 Conclusion: ZeroGUI为GUI代理训练提供了一种高效、可扩展的解决方案。 Abstract: The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.

cs.SE [Back]

[276] SWE-bench Goes Live!

Linghao Zhang,Shilin He,Chaoyun Zhang,Yu Kang,Bowen Li,Chengxing Xie,Junhao Wang,Maoquan Wang,Yufan Huang,Shengyu Fu,Elsie Nallipogu,Qingwei Lin,Yingnong Dang,Saravan Rajmohan,Dongmei Zhang

Main category: cs.SE

TL;DR: SWE-bench-Live是一个动态更新的基准测试,旨在解决现有静态基准测试(如SWE-bench)的局限性,包括数据过时、覆盖范围窄和依赖人工的问题。

Details Motivation: 现有基准测试(如SWE-bench)存在数据过时、覆盖范围有限和依赖人工的问题,阻碍了可扩展性并可能导致过拟合和数据污染。 Method: 提出SWE-bench-Live,通过自动化流程从GitHub实时问题中生成任务,并为每个任务提供Docker镜像以确保可重复性。 Result: 评估显示,即使在受控条件下,现有LLM和代理框架在SWE-bench-Live上的表现与静态基准测试存在显著差距。 Conclusion: SWE-bench-Live为动态、真实的软件开发环境提供了新鲜、多样且可执行的基准测试,支持对LLM和代理的严格评估。 Abstract: The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily on manual effort for instance construction and environment setup. These factors hinder scalability and introduce risks of overfitting and data contamination. In this work, we present \textbf{SWE-bench-Live}, a \textit{live-updatable} benchmark designed to overcome these challenges. Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Central to our benchmark is \method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art agent frameworks and LLMs on SWE-bench-Live, revealing a substantial performance gap compared to static benchmarks like SWE-bench, even under controlled evaluation conditions. To better understand this discrepancy, we perform detailed analyses across repository origin, issue recency, and task difficulty. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live facilitates rigorous, contamination-resistant evaluation of LLMs and agents in dynamic, real-world software development settings.

[277] Identity resolution of software metadata using Large Language Models

Eva Martín del Pico,Josep Lluís Gelpí,Salvador Capella-Gutiérrez

Main category: cs.SE

TL;DR: 本文探讨了研究软件的重要性,并评估了基于指令调优的大型语言模型在软件元数据身份解析任务中的表现,旨在提升生命科学研究软件的FAIR性。

Details Motivation: 研究软件在科研中至关重要,但其重要性常被忽视。本文旨在通过整合和分析软件元数据,推动研究软件的可持续发展。 Method: 利用生物信息学平台(如bio.tools、Bioconductor和Galaxy ToolShed)的结构化元数据,评估指令调优的大型语言模型在软件元数据身份解析中的表现。 Result: 模型在模糊案例中表现良好,并通过基于一致性的代理方法实现了高精度和统计稳健性,但也揭示了当前模型的局限性。 Conclusion: 自动化语义判断在跨注册表和存储库的FAIR对齐软件元数据中仍具挑战性,需进一步改进模型和方法。 Abstract: Software is an essential component of research. However, little attention has been paid to it compared with that paid to research data. Recently, there has been an increase in efforts to acknowledge and highlight the importance of software in research activities. Structured metadata from platforms like bio.tools, Bioconductor, and Galaxy ToolShed offers valuable insights into research software in the Life Sciences. Although originally intended to support discovery and integration, this metadata can be repurposed for large-scale analysis of software practices. However, its quality and completeness vary across platforms, reflecting diverse documentation practices. To gain a comprehensive view of software development and sustainability, consolidating this metadata is necessary, but requires robust mechanisms to address its heterogeneity and scale. This article presents an evaluation of instruction-tuned large language models for the task of software metadata identity resolution, a critical step in assembling a cohesive collection of research software. Such a collection is the reference component for the Software Observatory at OpenEBench, a platform that aggregates metadata to monitor the FAIRness of research software in the Life Sciences. We benchmarked multiple models against a human-annotated gold standard, examined their behavior on ambiguous cases, and introduced an agreement-based proxy for high-confidence automated decisions. The proxy achieved high precision and statistical robustness, while also highlighting the limitations of current models and the broader challenges of automating semantic judgment in FAIR-aligned software metadata across registries and repositories.

[278] GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

Manish Shetty,Naman Jain,Jinjian Liu,Vijay Kethanaboyina,Koushik Sen,Ion Stoica

Main category: cs.SE

TL;DR: GSO是一个用于评估语言模型在高性能软件开发中能力的基准测试,通过自动化流程生成并执行性能测试,识别102个优化任务。领先的SWE-Agents表现不佳,成功率低于5%。

Details Motivation: 开发高性能软件需要专业知识,现有语言模型在此领域的能力尚未充分评估。 Method: 开发自动化管道生成性能测试,分析代码库提交历史,识别优化任务,并提供代码库和性能测试作为规范。 Result: 领先的SWE-Agents成功率低于5%,改进有限。定性分析揭示了低效优化策略和瓶颈定位困难等失败模式。 Conclusion: GSO基准测试揭示了语言模型在高性能软件开发中的局限性,为未来研究提供了数据和工具。 Abstract: Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models' capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties with low-level languages, practicing lazy optimization strategies, and challenges in accurately localizing bottlenecks. We release the code and artifacts of our benchmark along with agent trajectories to enable future research.

cs.LG [Back]

[279] FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference

Aniruddha Nrusimha,William Brandon,Mayank Mishra,Yikang Shen,Rameswar Panda,Jonathan Ragan-Kelley,Yoon Kim

Main category: cs.LG

TL;DR: FlashFormer是一种针对单批次推理优化的内核,用于加速基于Transformer的大语言模型,在低批次推理场景中表现优于现有技术。

Details Motivation: 现有内核主要针对大批次训练和推理优化,而低批次推理(如边缘部署和延迟敏感应用)的需求未得到充分满足。 Method: 开发了FlashFormer,一种专为单批次推理设计的优化内核。 Result: 在不同模型大小和量化设置下,FlashFormer相比现有最先进推理内核实现了显著加速。 Conclusion: FlashFormer证明了在低批次推理场景中优化内核的潜力,为边缘和延迟敏感应用提供了有效解决方案。 Abstract: The size and compute characteristics of modern large language models have led to an increased interest in developing specialized kernels tailored for training and inference. Existing kernels primarily optimize for compute utilization, targeting the large-batch training and inference settings. However, low-batch inference, where memory bandwidth and kernel launch overheads contribute are significant factors, remains important for many applications of interest such as in edge deployment and latency-sensitive applications. This paper describes FlashFormer, a proof-of-concept kernel for accelerating single-batch inference for transformer-based large language models. Across various model sizes and quantizations settings, we observe nontrivial speedups compared to existing state-of-the-art inference kernels.

[280] DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration

Tianteng Gu,Bei Liu,Bo Xiao,Ke Zeng,Jiacheng Liu,Yanmin Qian

Main category: cs.LG

TL;DR: 论文提出了一种新的剪枝方法DenoiseRotator,通过重新分配参数重要性来增强剪枝鲁棒性,显著减少了性能下降。

Details Motivation: 现有剪枝方法主要关注单个权重的重要性估计,导致性能下降明显,尤其是在半结构化稀疏约束下。 Method: 提出最小化归一化重要性分数的信息熵,通过可学习的正交变换(DenoiseRotator)集中重要性到更小的权重子集。 Result: 在LLaMA3、Qwen2.5和Mistral模型上,DenoiseRotator显著降低了困惑度差距(如LLaMA3-70B的困惑度差距减少58%)。 Conclusion: DenoiseRotator是一种模型无关的方法,可与现有剪枝技术无缝集成,显著提升剪枝效果。 Abstract: Pruning is a widely used technique to compress large language models (LLMs) by removing unimportant weights, but it often suffers from significant performance degradation - especially under semi-structured sparsity constraints. Existing pruning methods primarily focus on estimating the importance of individual weights, which limits their ability to preserve critical capabilities of the model. In this work, we propose a new perspective: rather than merely selecting which weights to prune, we first redistribute parameter importance to make the model inherently more amenable to pruning. By minimizing the information entropy of normalized importance scores, our approach concentrates importance onto a smaller subset of weights, thereby enhancing pruning robustness. We instantiate this idea through DenoiseRotator, which applies learnable orthogonal transformations to the model's weight matrices. Our method is model-agnostic and can be seamlessly integrated with existing pruning techniques such as Magnitude, SparseGPT, and Wanda. Evaluated on LLaMA3, Qwen2.5, and Mistral models under 50% unstructured and 2:4 semi-structured sparsity, DenoiseRotator consistently improves perplexity and zero-shot accuracy. For instance, on LLaMA3-70B pruned with SparseGPT at 2:4 semi-structured sparsity, DenoiseRotator reduces the perplexity gap to the dense model by 58%, narrowing the degradation from 8.1 to 3.4 points. Codes are available at https://github.com/Axel-gu/DenoiseRotator.

[281] MAP: Revisiting Weight Decomposition for Low-Rank Adaptation

Chongjie Si,Zhiyi Shi,Yadao Wang,Xiaokang Yang,Susanto Rahardja,Wei Shen

Main category: cs.LG

TL;DR: 论文提出了一种名为MAP的新框架,通过将权重矩阵重新定义为高维向量,并严格解耦方向和大小的调整,改进了现有的参数高效微调(PEFT)方法。

Details Motivation: 现有PEFT方法(如LoRA和DoRA)在方向调整上缺乏几何基础,限制了其灵活性和可解释性。 Method: MAP将预训练权重归一化,学习方向更新,并引入两个标量系数独立调整基础和更新向量的大小。 Result: 实验表明,MAP显著提升了现有PEFT方法的性能,且易于集成。 Conclusion: MAP因其普适性和简洁性,有望成为未来PEFT方法设计的默认设置。 Abstract: The rapid development of large language models has revolutionized natural language processing, but their fine-tuning remains computationally expensive, hindering broad deployment. Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, have emerged as solutions. Recent work like DoRA attempts to further decompose weight adaptation into direction and magnitude components. However, existing formulations often define direction heuristically at the column level, lacking a principled geometric foundation. In this paper, we propose MAP, a novel framework that reformulates weight matrices as high-dimensional vectors and decouples their adaptation into direction and magnitude in a rigorous manner. MAP normalizes the pre-trained weights, learns a directional update, and introduces two scalar coefficients to independently scale the magnitude of the base and update vectors. This design enables more interpretable and flexible adaptation, and can be seamlessly integrated into existing PEFT methods. Extensive experiments show that MAP significantly improves performance when coupling with existing methods, offering a simple yet powerful enhancement to existing PEFT methods. Given the universality and simplicity of MAP, we hope it can serve as a default setting for designing future PEFT methods.

[282] Does Machine Unlearning Truly Remove Model Knowledge? A Framework for Auditing Unlearning in LLMs

Haokun Chen,Yueqi Zhang,Yuan Bi,Yao Zhang,Tong Liu,Jinhe Bi,Jian Lan,Jindong Gu,Claudia Grosser,Denis Krompass,Nassir Navab,Volker Tresp

Main category: cs.LG

TL;DR: 论文提出了一种评估大语言模型(LLM)遗忘算法效果的审计框架,包括数据集、算法和审计方法,并引入了一种基于中间激活扰动的新技术。

Details Motivation: 由于LLM训练数据可能包含敏感或受版权保护的内容,且现有遗忘算法评估方法存在局限性,需要更全面的审计框架。 Method: 提出包含三个基准数据集、六种遗忘算法和五种基于提示的审计方法的框架,并引入基于中间激活扰动的新技术。 Result: 通过多种审计算法评估了不同遗忘策略的有效性和鲁棒性。 Conclusion: 新提出的审计框架和技术为LLM遗忘算法的评估提供了更全面的解决方案。 Abstract: In recent years, Large Language Models (LLMs) have achieved remarkable advancements, drawing significant attention from the research community. Their capabilities are largely attributed to large-scale architectures, which require extensive training on massive datasets. However, such datasets often contain sensitive or copyrighted content sourced from the public internet, raising concerns about data privacy and ownership. Regulatory frameworks, such as the General Data Protection Regulation (GDPR), grant individuals the right to request the removal of such sensitive information. This has motivated the development of machine unlearning algorithms that aim to remove specific knowledge from models without the need for costly retraining. Despite these advancements, evaluating the efficacy of unlearning algorithms remains a challenge due to the inherent complexity and generative nature of LLMs. In this work, we introduce a comprehensive auditing framework for unlearning evaluation, comprising three benchmark datasets, six unlearning algorithms, and five prompt-based auditing methods. By using various auditing algorithms, we evaluate the effectiveness and robustness of different unlearning strategies. To explore alternatives beyond prompt-based auditing, we propose a novel technique that leverages intermediate activation perturbations, addressing the limitations of auditing methods that rely solely on model inputs and outputs.

[283] Rethinking Regularization Methods for Knowledge Graph Completion

Linyu Li,Zhi Jin,Yuanpeng He,Dongming Jin,Haoran Duan,Zhengwei Tao,Xuan Zhang,Jiandong Li

Main category: cs.LG

TL;DR: 本文重新思考了知识图谱补全(KGC)中正则化方法的应用,提出了一种新颖的稀疏正则化方法(SPR),通过选择性惩罚嵌入向量中的重要特征,显著提升了模型性能。

Details Motivation: 现有KGC模型未能充分利用正则化的潜力,本文旨在探索正则化在KGC中的深层作用,并提出改进方法。 Method: 通过实证研究分析正则化对KGC模型的影响,并提出SPR方法,选择性惩罚嵌入向量中的重要特征。 Result: SPR方法优于其他正则化方法,显著提升了KGC模型的性能上限。 Conclusion: 精心设计的正则化方法(如SPR)不仅能缓解过拟合,还能突破KGC模型的性能瓶颈。 Abstract: Knowledge graph completion (KGC) has attracted considerable attention in recent years because it is critical to improving the quality of knowledge graphs. Researchers have continuously explored various models. However, most previous efforts have neglected to take advantage of regularization from a deeper perspective and therefore have not been used to their full potential. This paper rethinks the application of regularization methods in KGC. Through extensive empirical studies on various KGC models, we find that carefully designed regularization not only alleviates overfitting and reduces variance but also enables these models to break through the upper bounds of their original performance. Furthermore, we introduce a novel sparse-regularization method that embeds the concept of rank-based selective sparsity into the KGC regularizer. The core idea is to selectively penalize those components with significant features in the embedding vector, thus effectively ignoring many components that contribute little and may only represent noise. Various comparative experiments on multiple datasets and multiple models show that the SPR regularization method is better than other regularization methods and can enable the KGC model to further break through the performance margin.

Giorgos Iacovides,Wuyang Zhou,Chao Li,Qibin Zhao,Danilo Mandic

Main category: cs.LG

TL;DR: 提出了一种名为tnLLM的新框架,利用大型语言模型(LLM)和领域信息直接预测最优张量网络结构,减少计算成本并提高透明度。

Details Motivation: 当前张量网络结构搜索(TN-SS)方法计算成本高且缺乏领域信息和结构透明度,需要更高效的解决方案。 Method: 结合领域信息的提示管道指导LLM推断合适的张量网络结构,并生成领域感知的解释。 Result: tnLLM在较少函数评估下达到与SOTA算法相当的性能,并能加速其他方法的收敛。 Conclusion: tnLLM通过LLM和领域信息的结合,显著提升了TN-SS的效率和可解释性。 Abstract: Tensor networks (TNs) provide efficient representations of high-dimensional data, yet identification of the optimal TN structures, the so called tensor network structure search (TN-SS) problem, remains a challenge. Current state-of-the-art (SOTA) algorithms are computationally expensive as they require extensive function evaluations, which is prohibitive for real-world applications. In addition, existing methods ignore valuable domain information inherent in real-world tensor data and lack transparency in their identified TN structures. To this end, we propose a novel TN-SS framework, termed the tnLLM, which incorporates domain information about the data and harnesses the reasoning capabilities of large language models (LLMs) to directly predict suitable TN structures. The proposed framework involves a domain-aware prompting pipeline which instructs the LLM to infer suitable TN structures based on the real-world relationships between tensor modes. In this way, our approach is capable of not only iteratively optimizing the objective function, but also generating domain-aware explanations for the identified structures. Experimental results demonstrate that tnLLM achieves comparable TN-SS objective function values with much fewer function evaluations compared to SOTA algorithms. Furthermore, we demonstrate that the LLM-enabled domain information can be used to find good initializations in the search space for sampling-based SOTA methods to accelerate their convergence while preserving theoretical performance guarantees.

[285] Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models

Yiran Guo,Lijie Xu,Jie Liu,Dan Ye,Shuang Qiu

Main category: cs.LG

TL;DR: 论文提出了一种名为SPO的强化学习框架,通过中粒度分段优势估计,平衡了细粒度和粗粒度方法的不足,提升了语言模型的推理能力。

Details Motivation: 现有方法在优势估计粒度上存在两极分化,细粒度方法(如PPO)因难以训练准确的评论家模型而估计不准确,粗粒度方法(如GRPO)仅依赖最终奖励导致信用分配不精确。 Method: SPO框架包含三个创新策略:灵活分段划分、准确分段优势估计以及基于分段优势的策略优化(包括概率掩码策略)。具体实例化为SPO-chain(短链式推理)和SPO-tree(长链式推理)。 Result: 在GSM8K上,SPO-chain比PPO和GRPO提升了6-12个百分点;在MATH500的2K和4K上下文评估中,SPO-tree比GRPO提升了7-11个百分点。 Conclusion: SPO通过中粒度优势估计,显著提升了语言模型的推理能力,同时避免了细粒度和粗粒度方法的缺陷。 Abstract: Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: Token-level methods (e.g., PPO) aim to provide the fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose Segment Policy Optimization (SPO), a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than token-level methods, enabling accurate advantage estimation based on Monte Carlo (MC) without a critic model. SPO features three components with novel strategies: (1) flexible segment partition; (2) accurate segment advantage estimation; and (3) policy optimization using segment advantages, including a novel probability-mask strategy. We further instantiate SPO for two specific scenarios: (1) SPO-chain for short chain-of-thought (CoT), featuring novel cutpoint-based partition and chain-based advantage estimation, achieving $6$-$12$ percentage point improvements in accuracy over PPO and GRPO on GSM8K. (2) SPO-tree for long CoT, featuring novel tree-based advantage estimation, which significantly reduces the cost of MC estimation, achieving $7$-$11$ percentage point improvements over GRPO on MATH500 under 2K and 4K context evaluation. We make our code publicly available at https://github.com/AIFrameResearch/SPO.

[286] On-Policy RL with Optimal Reward Baseline

Yaru Hao,Li Dong,Xun Wu,Shaohan Huang,Zewen Chi,Furu Wei

Main category: cs.LG

TL;DR: 提出了OPO算法,通过精确的on-policy训练和最优奖励基线,解决了强化学习中的训练不稳定和计算效率问题。

Details Motivation: 当前强化学习算法在训练大语言模型时存在不稳定性和计算效率低的问题。 Method: 提出OPO算法,强调精确的on-policy训练并引入最优奖励基线以减少梯度方差。 Result: 在数学推理基准测试中表现优异,训练稳定且无需额外模型或正则项。 Conclusion: OPO为大规模语言模型对齐和推理任务提供了稳定高效的强化学习方向。 Abstract: Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration. Moreover, OPO introduces the optimal reward baseline that theoretically minimizes gradient variance. We evaluate OPO on mathematical reasoning benchmarks. The results demonstrate its superior performance and training stability without additional models or regularization terms. Furthermore, OPO achieves lower policy shifts and higher output entropy, encouraging more diverse and less repetitive responses. These results highlight OPO as a promising direction for stable and effective reinforcement learning in large language model alignment and reasoning tasks. The implementation is provided at https://github.com/microsoft/LMOps/tree/main/opo.

[287] Differential Information: An Information-Theoretic Perspective on Preference Optimization

Yunjae Won,Hyunji Lee,Hyeonbin Hwang,Minjoon Seo

Main category: cs.LG

TL;DR: 本文通过差异信息分布(DID)理论,填补了直接偏好优化(DPO)中奖励参数化的理论空白,揭示了其最优性及对策略行为的影响。

Details Motivation: 尽管DPO在实证中表现成功,但其奖励参数化的理论依据尚不完整。本文旨在填补这一空白,并探讨偏好数据与策略行为的关系。 Method: 利用差异信息分布(DID)分析偏好标签如何编码信息,推导出DPO奖励的最优形式,并通过熵分析策略行为。 Result: 研究发现,低熵差异信息强化策略分布,高熵差异信息产生平滑效果,且偏好数据需满足对数边际有序策略的隐含假设。 Conclusion: 本文为DPO目标、偏好数据结构和策略行为提供了统一的理论视角,并验证了其在真实数据集中的适用性。 Abstract: Direct Preference Optimization (DPO) has become a standard technique for aligning language models with human preferences in a supervised manner. Despite its empirical success, the theoretical justification behind its log-ratio reward parameterization remains incomplete. In this work, we address this gap by utilizing the Differential Information Distribution (DID): a distribution over token sequences that captures the information gained during policy updates. First, we show that when preference labels encode the differential information required to transform a reference policy into a target policy, the log-ratio reward in DPO emerges as the uniquely optimal form for learning the target policy via preference optimization. This result naturally yields a closed-form expression for the optimal sampling distribution over rejected responses. Second, we find that the condition for preferences to encode differential information is fundamentally linked to an implicit assumption regarding log-margin ordered policies-an inductive bias widely used in preference optimization yet previously unrecognized. Finally, by analyzing the entropy of the DID, we characterize how learning low-entropy differential information reinforces the policy distribution, while high-entropy differential information induces a smoothing effect, which explains the log-likelihood displacement phenomenon. We validate our theoretical findings in synthetic experiments and extend them to real-world instruction-following datasets. Our results suggest that learning high-entropy differential information is crucial for general instruction-following, while learning low-entropy differential information benefits knowledge-intensive question answering. Overall, our work presents a unifying perspective on the DPO objective, the structure of preference data, and resulting policy behaviors through the lens of differential information.

[288] Test-time augmentation improves efficiency in conformal prediction

Divya Shanmugam,Helen Lu,Swami Sankaranarayanan,John Guttag

Main category: cs.LG

TL;DR: 测试时增强(TTA)可减少共形分类器生成的预测集大小,提高效率且无需重新训练模型。

Details Motivation: 共形分类器生成的预测集通常过大且缺乏信息性,需要一种方法来优化其大小。 Method: 利用测试时增强(TTA)技术,结合不同的共形评分方法,评估其在多种数据集和模型上的效果。 Result: TTA平均减少预测集大小10%-14%,且适用于不同分布偏移和保证强度。 Conclusion: TTA是共形分类流程中的有效补充,能显著提升预测集的紧凑性。 Abstract: A conformal classifier produces a set of predicted classes and provides a probabilistic guarantee that the set includes the true class. Unfortunately, it is often the case that conformal classifiers produce uninformatively large sets. In this work, we show that test-time augmentation (TTA)--a technique that introduces inductive biases during inference--reduces the size of the sets produced by conformal classifiers. Our approach is flexible, computationally efficient, and effective. It can be combined with any conformal score, requires no model retraining, and reduces prediction set sizes by 10%-14% on average. We conduct an evaluation of the approach spanning three datasets, three models, two established conformal scoring methods, different guarantee strengths, and several distribution shifts to show when and why test-time augmentation is a useful addition to the conformal pipeline.

[289] Number of Clusters in a Dataset: A Regularized K-means Approach

Behzad Kamgar-Parsi,Behrooz Kamgar-Parsi

Main category: cs.LG

TL;DR: 本文研究了正则化k-means算法中关键超参数λ的设定问题,推导了理想聚类假设下的λ界限,并分析了加性和乘性正则化对解的影响。

Details Motivation: 在无标签数据集中找到有意义的聚类数量是许多应用中的关键问题,但目前缺乏设定正则化超参数λ的原则性指导。 Method: 假设聚类为理想球形,推导了λ的严格界限;分析了加性和乘性正则化k-means算法的解的多重性。 Result: 实验表明加性正则化常产生多解,而乘性正则化与加性正则化的共识可减少解的模糊性。 Conclusion: 在理想假设下,本文提供了λ的界限,并展示了正则化k-means算法在非理想情况下的性能。 Abstract: Finding the number of meaningful clusters in an unlabeled dataset is important in many applications. Regularized k-means algorithm is a possible approach frequently used to find the correct number of distinct clusters in datasets. The most common formulation of the regularization function is the additive linear term $\lambda k$, where $k$ is the number of clusters and $\lambda$ a positive coefficient. Currently, there are no principled guidelines for setting a value for the critical hyperparameter $\lambda$. In this paper, we derive rigorous bounds for $\lambda$ assuming clusters are {\em ideal}. Ideal clusters (defined as $d$-dimensional spheres with identical radii) are close proxies for k-means clusters ($d$-dimensional spherically symmetric distributions with identical standard deviations). Experiments show that the k-means algorithm with additive regularizer often yields multiple solutions. Thus, we also analyze k-means algorithm with multiplicative regularizer. The consensus among k-means solutions with additive and multiplicative regularizations reduces the ambiguity of multiple solutions in certain cases. We also present selected experiments that demonstrate performance of the regularized k-means algorithms as clusters deviate from the ideal assumption.

[290] Diverse Prototypical Ensembles Improve Robustness to Subpopulation Shift

Minh Nguyen Nhat To,Paul F RWilson,Viet Nguyen,Mohamed Harmanani,Michael Cooper,Fahimeh Fooladgar,Purang Abolmaesumi,Parvin Mousavi,Rahul G. Krishnan

Main category: cs.LG

TL;DR: 论文提出了一种名为Diverse Prototypical Ensembles (DPEs)的方法,通过使用多样化的原型分类器集合来适应子群体分布变化,提升了模型在最差群体上的准确性。

Details Motivation: 子群体分布在训练和目标数据集之间的差异会显著降低机器学习模型的性能。现有方法依赖于对子群体数量和性质的假设以及群体成员注释,但这些信息在许多现实数据集中不可用。 Method: 提出使用多样化的分类器集合,将标准线性分类层替换为原型分类器的混合体,每个成员专注于不同的特征和样本。 Result: 在九个现实数据集上的实验表明,DPE方法在最差群体准确性上优于现有技术。 Conclusion: DPE方法无需依赖子群体假设或注释,通过多样化分类器集合有效应对子群体分布变化。 Abstract: The subpopulationtion shift, characterized by a disparity in subpopulation distributibetween theween the training and target datasets, can significantly degrade the performance of machine learning models. Current solutions to subpopulation shift involve modifying empirical risk minimization with re-weighting strategies to improve generalization. This strategy relies on assumptions about the number and nature of subpopulations and annotations on group membership, which are unavailable for many real-world datasets. Instead, we propose using an ensemble of diverse classifiers to adaptively capture risk associated with subpopulations. Given a feature extractor network, we replace its standard linear classification layer with a mixture of prototypical classifiers, where each member is trained to classify the data while focusing on different features and samples from other members. In empirical evaluation on nine real-world datasets, covering diverse domains and kinds of subpopulation shift, our method of Diverse Prototypical Ensembles (DPEs) often outperforms the prior state-of-the-art in worst-group accuracy. The code is available at https://github.com/minhto2802/dpe4subpop

[291] Pseudo Multi-Source Domain Generalization: Bridging the Gap Between Single and Multi-Source Domain Generalization

Shohei Enomoto

Main category: cs.LG

TL;DR: 论文提出了一种名为PMDG的新框架,通过风格迁移和数据增强技术从单一源域生成多个伪域,解决了多源域泛化(MDG)在实际应用中数据集构建成本高的问题。实验表明PMDG性能与MDG正相关,且伪域性能可媲美真实多域。

Details Motivation: 解决深度学习模型在数据分布变化时性能下降的问题,同时克服多源域泛化(MDG)因数据集构建成本高而难以实际应用的局限性。 Method: 提出PMDG框架,利用风格迁移和数据增强从单一源域生成多个伪域,构建合成多域数据集,适用于现有MDG算法。 Result: 实验表明PMDG性能与MDG正相关,伪域性能在数据充足时可匹配或超越真实多域。 Conclusion: PMDG为单源域泛化(SDG)提供了一种实用解决方案,为未来域泛化研究提供了重要参考。 Abstract: Deep learning models often struggle to maintain performance when deployed on data distributions different from their training data, particularly in real-world applications where environmental conditions frequently change. While Multi-source Domain Generalization (MDG) has shown promise in addressing this challenge by leveraging multiple source domains during training, its practical application is limited by the significant costs and difficulties associated with creating multi-domain datasets. To address this limitation, we propose Pseudo Multi-source Domain Generalization (PMDG), a novel framework that enables the application of sophisticated MDG algorithms in more practical Single-source Domain Generalization (SDG) settings. PMDG generates multiple pseudo-domains from a single source domain through style transfer and data augmentation techniques, creating a synthetic multi-domain dataset that can be used with existing MDG algorithms. Through extensive experiments with PseudoDomainBed, our modified version of the DomainBed benchmark, we analyze the effectiveness of PMDG across multiple datasets and architectures. Our analysis reveals several key findings, including a positive correlation between MDG and PMDG performance and the potential of pseudo-domains to match or exceed actual multi-domain performance with sufficient data. These comprehensive empirical results provide valuable insights for future research in domain generalization. Our code is available at https://github.com/s-enmt/PseudoDomainBed.

[292] Buffer-free Class-Incremental Learning with Out-of-Distribution Detection

Srishti Gupta,Daniele Angioni,Maura Pintor,Ambra Demontis,Lea Schönherr,Battista Biggio,Fabio Roli

Main category: cs.LG

TL;DR: 论文提出了一种无需内存缓冲区的后验OOD检测方法,用于开放世界中的类增量学习,性能与基于缓冲区的方法相当或更优。

Details Motivation: 解决开放世界场景中类增量学习的挑战,避免使用内存缓冲区带来的隐私、可扩展性和训练时间问题。 Method: 分析后验OOD检测方法,并在推理时应用以替代基于缓冲区的OOD检测。 Result: 在CIFAR-10、CIFAR-100和Tiny ImageNet数据集上,该方法性能与基于缓冲区的方法相当或更优。 Conclusion: 后验OOD检测方法为高效且保护隐私的开放世界类增量学习系统提供了新思路。 Abstract: Class-incremental learning (CIL) poses significant challenges in open-world scenarios, where models must not only learn new classes over time without forgetting previous ones but also handle inputs from unknown classes that a closed-set model would misclassify. Recent works address both issues by (i)~training multi-head models using the task-incremental learning framework, and (ii) predicting the task identity employing out-of-distribution (OOD) detectors. While effective, the latter mainly relies on joint training with a memory buffer of past data, raising concerns around privacy, scalability, and increased training time. In this paper, we present an in-depth analysis of post-hoc OOD detection methods and investigate their potential to eliminate the need for a memory buffer. We uncover that these methods, when applied appropriately at inference time, can serve as a strong substitute for buffer-based OOD detection. We show that this buffer-free approach achieves comparable or superior performance to buffer-based methods both in terms of class-incremental learning and the rejection of unknown samples. Experimental results on CIFAR-10, CIFAR-100 and Tiny ImageNet datasets support our findings, offering new insights into the design of efficient and privacy-preserving CIL systems for open-world settings.

[293] Network Inversion for Uncertainty-Aware Out-of-Distribution Detection

Pirzada Suhail,Rehna Afroz,Amit Sethi

Main category: cs.LG

TL;DR: 提出了一种结合网络反演和分类器训练的新框架,用于同时解决OOD检测和不确定性估计问题。

Details Motivation: 在现实场景中,意外输入不可避免,OOD检测和不确定性估计对构建安全的机器学习系统至关重要。 Method: 通过引入一个“垃圾”类,初始填充高斯噪声表示异常输入,结合网络反演和迭代训练优化分类器。 Result: 模型能有效检测OOD样本并将其分类到垃圾类,同时利用置信度分数估计不确定性。 Conclusion: 该方法无需外部OOD数据或后校准技术,为OOD检测和不确定性估计提供了统一解决方案。 Abstract: Out-of-distribution (OOD) detection and uncertainty estimation (UE) are critical components for building safe machine learning systems, especially in real-world scenarios where unexpected inputs are inevitable. In this work, we propose a novel framework that combines network inversion with classifier training to simultaneously address both OOD detection and uncertainty estimation. For a standard n-class classification task, we extend the classifier to an (n+1)-class model by introducing a "garbage" class, initially populated with random gaussian noise to represent outlier inputs. After each training epoch, we use network inversion to reconstruct input images corresponding to all output classes that initially appear as noisy and incoherent and are therefore excluded to the garbage class for retraining the classifier. This cycle of training, inversion, and exclusion continues iteratively till the inverted samples begin to resemble the in-distribution data more closely, suggesting that the classifier has learned to carve out meaningful decision boundaries while sanitising the class manifolds by pushing OOD content into the garbage class. During inference, this training scheme enables the model to effectively detect and reject OOD samples by classifying them into the garbage class. Furthermore, the confidence scores associated with each prediction can be used to estimate uncertainty for both in-distribution and OOD inputs. Our approach is scalable, interpretable, and does not require access to external OOD datasets or post-hoc calibration techniques while providing a unified solution to the dual challenges of OOD detection and uncertainty estimation.

[294] Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Qingyu Shi,Jinbin Bai,Zhuoran Zhao,Wenhao Chai,Kaidong Yu,Jianzong Wu,Shuangyong Song,Yunhai Tong,Xiangtai Li,Xuelong Li,Shuicheng Yan

Main category: cs.LG

TL;DR: Muddit是一种统一的离散扩散Transformer,能够在文本和图像模态上实现快速并行生成,结合预训练文本到图像骨干和轻量级文本解码器,性能优于大型自回归模型。

Details Motivation: 解决自回归统一模型推理速度慢和非自回归统一模型泛化能力弱的问题。 Method: 引入Muddit,结合预训练文本到图像骨干和轻量级文本解码器,实现跨模态统一生成。 Result: Muddit在质量和效率上优于大型自回归模型,展示了离散扩散作为统一生成骨干的潜力。 Conclusion: 离散扩散结合强视觉先验是统一生成任务的可扩展且有效的解决方案。 Abstract: Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.

[295] Merge-Friendly Post-Training Quantization for Multi-Target Domain Adaptation

Juncheol Shin,Minsang Seok,Seonggon Kim,Eunhyeok Park

Main category: cs.LG

TL;DR: 论文提出了一种名为HDRQ的后训练量化方法,旨在解决量化对模型合并的影响,确保量化过程对源预训练模型的偏离最小化,并通过平滑损失表面促进模型合并。

Details Motivation: 量化在目标特定数据上的应用限制了感兴趣领域并引入离散化效应,使模型合并变得复杂。研究通过误差屏障分析量化对模型合并的影响。 Method: 提出HDRQ(Hessian和距离正则化量化),一种后训练量化方法,考虑多目标领域适应的模型合并需求。 Result: HDRQ方法在实验中表现出色,有效减少了量化对源模型的偏离并平滑了损失表面。 Conclusion: HDRQ是首个针对量化模型合并挑战的研究,实验证实其有效性。 Abstract: Model merging has emerged as a powerful technique for combining task-specific weights, achieving superior performance in multi-target domain adaptation. However, when applied to practical scenarios, such as quantized models, new challenges arise. In practical scenarios, quantization is often applied to target-specific data, but this process restricts the domain of interest and introduces discretization effects, making model merging highly non-trivial. In this study, we analyze the impact of quantization on model merging through the lens of error barriers. Leveraging these insights, we propose a novel post-training quantization, HDRQ - Hessian and distant regularizing quantization - that is designed to consider model merging for multi-target domain adaptation. Our approach ensures that the quantization process incurs minimal deviation from the source pre-trained model while flattening the loss surface to facilitate smooth model merging. To our knowledge, this is the first study on this challenge, and extensive experiments confirm its effectiveness.

[296] REOrdering Patches Improves Vision Models

Declan Kutscher,David M. Chan,Yutong Bai,Trevor Darrell,Ritwik Gupta

Main category: cs.LG

TL;DR: 论文提出REOrder框架,通过优化图像块排列顺序提升序列模型性能,在ImageNet-1K和Functional Map of the World数据集上分别提升3.01%和13.35%的准确率。

Details Motivation: 现有序列模型(如Transformer)对图像块排列顺序敏感,固定顺序(如行优先)可能影响性能。 Method: 提出两阶段框架REOrder:1)基于信息论评估块序列压缩性;2)使用REINFORCE优化Plackett-Luce策略学习排列顺序。 Result: 在ImageNet-1K和Functional Map of the World上分别提升3.01%和13.35%的准确率。 Conclusion: REOrder通过优化排列顺序显著提升模型性能,为序列模型设计提供新思路。 Abstract: Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.

cs.RO [Back]

[297] AMOR: Adaptive Character Control through Multi-Objective Reinforcement Learning

Lucas N. Alegre,Agon Serifi,Ruben Grandia,David Müller,Espen Knoop,Moritz Bächer

Main category: cs.RO

TL;DR: 论文提出了一种多目标强化学习框架,通过训练一个单一策略来适应不同的奖励权重,从而减少调参时间并提高适应性。

Details Motivation: 传统强化学习方法依赖加权奖励函数,调参耗时且难以适应现实世界中的不确定性。 Method: 提出多目标强化学习框架,训练一个权重条件化的策略,覆盖奖励权衡的帕累托前沿。 Result: 该方法显著加快了迭代速度,并能在分层设置中动态调整权重,适应新任务。 Conclusion: 多目标策略编码了多样行为,提高了任务适应性和效率。 Abstract: Reinforcement learning (RL) has significantly advanced the control of physics-based and robotic characters that track kinematic reference motion. However, methods typically rely on a weighted sum of conflicting reward functions, requiring extensive tuning to achieve a desired behavior. Due to the computational cost of RL, this iterative process is a tedious, time-intensive task. Furthermore, for robotics applications, the weights need to be chosen such that the policy performs well in the real world, despite inevitable sim-to-real gaps. To address these challenges, we propose a multi-objective reinforcement learning framework that trains a single policy conditioned on a set of weights, spanning the Pareto front of reward trade-offs. Within this framework, weights can be selected and tuned after training, significantly speeding up iteration time. We demonstrate how this improved workflow can be used to perform highly dynamic motions with a robot character. Moreover, we explore how weight-conditioned policies can be leveraged in hierarchical settings, using a high-level policy to dynamically select weights according to the current task. We show that the multi-objective policy encodes a diverse spectrum of behaviors, facilitating efficient adaptation to novel tasks.

[298] Anomalies by Synthesis: Anomaly Detection using Generative Diffusion Models for Off-Road Navigation

Siddharth Ancha,Sunshine Jiang,Travis Manderson,Laura Brandt,Yilun Du,Philip R. Osteen,Nicholas Roy

Main category: cs.RO

TL;DR: 论文提出了一种基于生成扩散模型的像素级异常检测方法,无需对异常数据做假设,通过分析图像编辑后的变化实现检测。

Details Motivation: 为了在非结构化环境中实现安全可靠的导航,机器人需要检测与训练数据分布不一致的异常。 Method: 使用生成扩散模型编辑输入图像以去除异常,分析编辑后的图像变化;提出新的推理方法,通过引导梯度分析实现异常检测。 Result: 方法无需重新训练或微调,可直接集成到现有工作流程中,结合视觉-语言基础模型实现高精度异常检测。 Conclusion: 该方法为机器人导航提供了一种高效且无需假设的异常检测解决方案。 Abstract: In order to navigate safely and reliably in off-road and unstructured environments, robots must detect anomalies that are out-of-distribution (OOD) with respect to the training data. We present an analysis-by-synthesis approach for pixel-wise anomaly detection without making any assumptions about the nature of OOD data. Given an input image, we use a generative diffusion model to synthesize an edited image that removes anomalies while keeping the remaining image unchanged. Then, we formulate anomaly detection as analyzing which image segments were modified by the diffusion model. We propose a novel inference approach for guided diffusion by analyzing the ideal guidance gradient and deriving a principled approximation that bootstraps the diffusion model to predict guidance gradients. Our editing technique is purely test-time that can be integrated into existing workflows without the need for retraining or fine-tuning. Finally, we use a combination of vision-language foundation models to compare pixels in a learned feature space and detect semantically meaningful edits, enabling accurate anomaly detection for off-road navigation. Project website: https://siddancha.github.io/anomalies-by-diffusion-synthesis/

[299] TrackVLA: Embodied Visual Tracking in the Wild

Shaoan Wang,Jiazhao Zhang,Minghan Li,Jiahang Liu,Anqi Li,Kui Wu,Fangwei Zhong,Junzhi Yu,Zhizheng Zhang,He Wang

Main category: cs.RO

TL;DR: TrackVLA是一种视觉-语言-动作(VLA)模型,通过结合目标识别和轨迹规划,在动态环境中实现高效的视觉跟踪。

Details Motivation: 解决现有方法在目标识别和轨迹规划分离时的性能瓶颈,提升在遮挡和高动态场景中的跟踪能力。 Method: 利用共享的LLM骨干网络,结合语言建模头进行识别和基于锚点的扩散模型进行轨迹规划,并在EVT-Bench数据集上训练。 Result: 在合成和真实环境中均表现出SOTA性能,零样本下优于现有方法,且在高动态和遮挡场景中保持鲁棒性(10 FPS)。 Conclusion: TrackVLA展示了目标识别与轨迹规划的协同优势,为Embodied AI中的视觉跟踪任务提供了高效解决方案。 Abstract: Embodied visual tracking is a fundamental skill in Embodied AI, enabling an agent to follow a specific target in dynamic environments using only egocentric vision. This task is inherently challenging as it requires both accurate target recognition and effective trajectory planning under conditions of severe occlusion and high scene dynamics. Existing approaches typically address this challenge through a modular separation of recognition and planning. In this work, we propose TrackVLA, a Vision-Language-Action (VLA) model that learns the synergy between object recognition and trajectory planning. Leveraging a shared LLM backbone, we employ a language modeling head for recognition and an anchor-based diffusion model for trajectory planning. To train TrackVLA, we construct an Embodied Visual Tracking Benchmark (EVT-Bench) and collect diverse difficulty levels of recognition samples, resulting in a dataset of 1.7 million samples. Through extensive experiments in both synthetic and real-world environments, TrackVLA demonstrates SOTA performance and strong generalizability. It significantly outperforms existing methods on public benchmarks in a zero-shot manner while remaining robust to high dynamics and occlusion in real-world scenarios at 10 FPS inference speed. Our project page is: https://pku-epic.github.io/TrackVLA-web.

[300] Autoregressive Meta-Actions for Unified Controllable Trajectory Generation

Jianbo Zhao,Taiyu Ban,Xiyang Wang,Qibin Zhou,Hangning Zhou,Zhihao Liu,Mu Yang,Lei Liu,Bin Li

Main category: cs.RO

TL;DR: 论文提出了一种自回归元动作方法,用于解决自动驾驶系统中元动作与轨迹时间不对齐的问题,通过分解长间隔元动作为帧级元动作,实现了轨迹生成与决策的严格对齐。

Details Motivation: 现有自动驾驶系统依赖固定时间间隔的元动作,导致元动作与实际轨迹时间不对齐,影响任务连贯性和模型性能。 Method: 提出自回归元动作方法,将长间隔元动作为帧级元动作,结合自回归轨迹生成框架,实现严格对齐;并采用分阶段预训练分离运动动力学与高层决策控制。 Result: 实验证明该方法提高了轨迹的自适应性和动态决策响应能力。 Conclusion: 该方法有效解决了元动作与轨迹时间不对齐问题,提升了自动驾驶系统的性能。 Abstract: Controllable trajectory generation guided by high-level semantic decisions, termed meta-actions, is crucial for autonomous driving systems. A significant limitation of existing frameworks is their reliance on invariant meta-actions assigned over fixed future time intervals, causing temporal misalignment with the actual behavior trajectories. This misalignment leads to irrelevant associations between the prescribed meta-actions and the resulting trajectories, disrupting task coherence and limiting model performance. To address this challenge, we introduce Autoregressive Meta-Actions, an approach integrated into autoregressive trajectory generation frameworks that provides a unified and precise definition for meta-action-conditioned trajectory prediction. Specifically, We decompose traditional long-interval meta-actions into frame-level meta-actions, enabling a sequential interplay between autoregressive meta-action prediction and meta-action-conditioned trajectory generation. This decomposition ensures strict alignment between each trajectory segment and its corresponding meta-action, achieving a consistent and unified task formulation across the entire trajectory span and significantly reducing complexity. Moreover, we propose a staged pre-training process to decouple the learning of basic motion dynamics from the integration of high-level decision control, which offers flexibility, stability, and modularity. Experimental results validate our framework's effectiveness, demonstrating improved trajectory adaptivity and responsiveness to dynamic decision-making scenarios. We provide the video document and dataset, which are available at https://arma-traj.github.io/.

[301] Mobi-$π$: Mobilizing Your Robot Learning Policy

Jingyun Yang,Isabella Huang,Brandon Vu,Max Bajracharya,Rika Antonova,Jeannette Bohg

Main category: cs.RO

TL;DR: 论文提出了一种解决机器人视觉运动策略在新环境中泛化能力不足的方法,通过优化机器人基座姿态以适配训练数据分布,无需重新训练策略。

Details Motivation: 现有视觉运动策略在训练数据有限的机器人位置和视角下表现良好,但在新环境中泛化能力差,限制了其在移动平台上的应用。 Method: 提出Mobi-π框架,包括评估指标、模拟任务、可视化工具和基线方法,并开发了一种基于3D高斯散射和采样优化的基座姿态优化方法。 Result: 提出的方法在仿真和真实环境中均优于基线,验证了其在策略迁移中的有效性。 Conclusion: 通过优化机器人基座姿态,实现了视觉运动策略在新环境中的高效迁移,为移动机器人操作任务提供了新思路。 Abstract: Learned visuomotor policies are capable of performing increasingly complex manipulation tasks. However, most of these policies are trained on data collected from limited robot positions and camera viewpoints. This leads to poor generalization to novel robot positions, which limits the use of these policies on mobile platforms, especially for precise tasks like pressing buttons or turning faucets. In this work, we formulate the policy mobilization problem: find a mobile robot base pose in a novel environment that is in distribution with respect to a manipulation policy trained on a limited set of camera viewpoints. Compared to retraining the policy itself to be more robust to unseen robot base pose initializations, policy mobilization decouples navigation from manipulation and thus does not require additional demonstrations. Crucially, this problem formulation complements existing efforts to improve manipulation policy robustness to novel viewpoints and remains compatible with them. To study policy mobilization, we introduce the Mobi-$\pi$ framework, which includes: (1) metrics that quantify the difficulty of mobilizing a given policy, (2) a suite of simulated mobile manipulation tasks based on RoboCasa to evaluate policy mobilization, (3) visualization tools for analysis, and (4) several baseline methods. We also propose a novel approach that bridges navigation and manipulation by optimizing the robot's base pose to align with an in-distribution base pose for a learned policy. Our approach utilizes 3D Gaussian Splatting for novel view synthesis, a score function to evaluate pose suitability, and sampling-based optimization to identify optimal robot poses. We show that our approach outperforms baselines in both simulation and real-world environments, demonstrating its effectiveness for policy mobilization.

q-bio.NC [Back]

[302] ConnectomeDiffuser: Generative AI Enables Brain Network Construction from Diffusion Tensor Imaging

Xuhang Chen,Michael Kwok-Po Ng,Kim-Fung Tsang,Chi-Man Pun,Shuqiang Wang

Main category: q-bio.NC

TL;DR: ConnectomeDiffuser是一种基于扩散的自动化端到端脑网络构建框架,克服了现有方法的局限性,提高了诊断准确性。

Details Motivation: 现有DTI脑网络构建方法存在主观性、工作流程繁琐及拓扑特征捕捉能力不足的问题,需要更高效、准确的工具。 Method: 结合模板网络、扩散模型和图卷积网络分类器,从DTI扫描中提取拓扑特征并生成脑网络。 Result: 在两种神经退行性疾病数据集上表现优于其他方法,能更敏感地分析个体差异。 Conclusion: ConnectomeDiffuser为神经退行性疾病提供了更准确的诊断和监测工具。 Abstract: Brain network analysis plays a crucial role in diagnosing and monitoring neurodegenerative disorders such as Alzheimer's disease (AD). Existing approaches for constructing structural brain networks from diffusion tensor imaging (DTI) often rely on specialized toolkits that suffer from inherent limitations: operator subjectivity, labor-intensive workflows, and restricted capacity to capture complex topological features and disease-specific biomarkers. To overcome these challenges and advance computational neuroimaging instrumentation, ConnectomeDiffuser is proposed as a novel diffusion-based framework for automated end-to-end brain network construction from DTI. The proposed model combines three key components: (1) a Template Network that extracts topological features from 3D DTI scans using Riemannian geometric principles, (2) a diffusion model that generates comprehensive brain networks with enhanced topological fidelity, and (3) a Graph Convolutional Network classifier that incorporates disease-specific markers to improve diagnostic accuracy. ConnectomeDiffuser demonstrates superior performance by capturing a broader range of structural connectivity and pathology-related information, enabling more sensitive analysis of individual variations in brain networks. Experimental validation on datasets representing two distinct neurodegenerative conditions demonstrates significant performance improvements over other brain network methods. This work contributes to the advancement of instrumentation in the context of neurological disorders, providing clinicians and researchers with a robust, generalizable measurement framework that facilitates more accurate diagnosis, deeper mechanistic understanding, and improved therapeutic monitoring of neurodegenerative diseases such as AD.

eess.SY [Back]

[303] CF-DETR: Coarse-to-Fine Transformer for Real-Time Object Detection

Woojin Shin,Donghwa Kang,Byeongyun Park,Brent Byunghoon Kang,Jinkyu Lee,Hyeongboo Baek

Main category: eess.SY

TL;DR: CF-DETR通过粗到细的Transformer架构和实时调度框架NPFP**,解决了自动驾驶系统中多DETR任务实时性和高精度的挑战。

Details Motivation: 自动驾驶感知系统中,DETR的高精度需求与实时性要求之间存在矛盾,现有调度方法未能充分利用Transformer特性。 Method: 提出CF-DETR系统,包含粗到细推理、选择性细推理和多级批量推理策略,结合NPFP**调度框架动态调整资源分配。 Result: 在多种平台上验证,CF-DETR在满足实时性要求的同时,显著提升了整体和关键目标的检测精度。 Conclusion: CF-DETR通过动态调整和专用调度,有效平衡了自动驾驶感知系统的实时性和精度需求。 Abstract: Detection Transformers (DETR) are increasingly adopted in autonomous vehicle (AV) perception systems due to their superior accuracy over convolutional networks. However, concurrently executing multiple DETR tasks presents significant challenges in meeting firm real-time deadlines (R1) and high accuracy requirements (R2), particularly for safety-critical objects, while navigating the inherent latency-accuracy trade-off under resource constraints. Existing real-time DNN scheduling approaches often treat models generically, failing to leverage Transformer-specific properties for efficient resource allocation. To address these challenges, we propose CF-DETR, an integrated system featuring a novel coarse-to-fine Transformer architecture and a dedicated real-time scheduling framework NPFP**. CF-DETR employs three key strategies (A1: coarse-to-fine inference, A2: selective fine inference, A3: multi-level batch inference) that exploit Transformer properties to dynamically adjust patch granularity and attention scope based on object criticality, aiming to satisfy R2. The NPFP** scheduling framework (A4) orchestrates these adaptive mechanisms A1-A3. It partitions each DETR task into a safety-critical coarse subtask for guaranteed critical object detection within its deadline (ensuring R1), and an optional fine subtask for enhanced overall accuracy (R2), while managing individual and batched execution. Our extensive evaluations on server, GPU-enabled embedded platforms, and actual AV platforms demonstrate that CF-DETR, under an NPFP** policy, successfully meets strict timing guarantees for critical operations and achieves significantly higher overall and critical object detection accuracy compared to existing baselines across diverse AV workloads.

cs.HC [Back]

[304] Errors in Stereo Geometry Induce Distance Misperception

Raffles Xingqi Zhu,Charlie S. Burlingham,Olivier Mercier,Phillip Guan

Main category: cs.HC

TL;DR: 论文提出了一种几何框架,用于预测由于HMD透视几何不准确导致的距离感知误差,并通过实验验证了其有效性。

Details Motivation: 研究HMD渲染和视角误差如何影响用户对深度和距离的感知,并提出解决方案。 Method: 构建了一个几何框架和HMD平台,模拟渲染和视角误差,并通过五个实验验证框架的有效性。 Result: 透视几何误差会导致距离感知的过高或过低估计,但实时视觉反馈可以动态校准视动映射。 Conclusion: 几何框架能有效预测HMD误差对感知的影响,实时反馈可改善准确性。 Abstract: Stereoscopic head-mounted displays (HMDs) render and present binocular images to create an egocentric, 3D percept to the HMD user. Within this render and presentation pipeline there are potential rendering camera and viewing position errors that can induce deviations in the depth and distance that a user perceives compared to the underlying intended geometry. For example, rendering errors can arise when HMD render cameras are incorrectly positioned relative to the assumed centers of projections of the HMD displays and viewing errors can arise when users view stereo geometry from the incorrect location in the HMD eyebox. In this work we present a geometric framework that predicts errors in distance perception arising from inaccurate HMD perspective geometry and build an HMD platform to reliably simulate render and viewing error in a Quest 3 HMD with eye tracking to experimentally test these predictions. We present a series of five experiments to explore the efficacy of this geometric framework and show that errors in perspective geometry can induce both under- and over-estimations in perceived distance. We further demonstrate how real-time visual feedback can be used to dynamically recalibrate visuomotor mapping so that an accurate reach distance is achieved even if the perceived visual distance is negatively impacted by geometric error.

[305] Large Language Models for Depression Recognition in Spoken Language Integrating Psychological Knowledge

Yupei Li,Shuaijie Shao,Manuel Milling,Björn W. Schuller

Main category: cs.HC

TL;DR: 该论文提出了一种结合音频特征和心理学知识的LLMs多模态抑郁症检测方法,显著提升了诊断准确性。

Details Motivation: 现有DNNs和LLMs在抑郁症检测中效果有限,尤其是缺乏对非文本线索(如语音和行为)的处理和心理专业知识。 Method: 使用Wav2Vec提取音频特征,结合文本LLMs处理,并通过问答形式引入心理学知识。 Result: 在DAIC-WOZ数据集上,MAE和RMSE显著优于基线。 Conclusion: 多模态结合心理学知识的LLMs方法在抑郁症检测中具有潜力。 Abstract: Depression is a growing concern gaining attention in both public discourse and AI research. While deep neural networks (DNNs) have been used for recognition, they still lack real-world effectiveness. Large language models (LLMs) show strong potential but require domain-specific fine-tuning and struggle with non-textual cues. Since depression is often expressed through vocal tone and behaviour rather than explicit text, relying on language alone is insufficient. Diagnostic accuracy also suffers without incorporating psychological expertise. To address these limitations, we present, to the best of our knowledge, the first application of LLMs to multimodal depression detection using the DAIC-WOZ dataset. We extract the audio features using the pre-trained model Wav2Vec, and mapped it to text-based LLMs for further processing. We also propose a novel strategy for incorporating psychological knowledge into LLMs to enhance diagnostic performance, specifically using a question and answer set to grant authorised knowledge to LLMs. Our approach yields a notable improvement in both Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) compared to a base score proposed by the related original paper. The codes are available at https://github.com/myxp-lyp/Depression-detection.git

[306] Human Empathy as Encoder: AI-Assisted Depression Assessment in Special Education

Boning Zhao

Main category: cs.HC

TL;DR: 论文提出了一种名为HEAE的人本AI框架,通过结合学生叙述和教师共情向量,提升抑郁严重程度评估的透明度和社会责任。

Details Motivation: 在特殊教育等敏感环境中,标准化问卷和自动化方法难以准确评估学生抑郁情况,且缺乏教师共情带来的个性化洞察。 Method: HEAE框架整合学生叙述文本与教师衍生的9维共情向量(EV),通过多模态融合和分类架构优化,实现抑郁严重程度的7级分类。 Result: 实验结果显示,该方法在7级严重程度分类中达到82.74%的准确率。 Conclusion: HEAE为情感计算提供了一种更负责任和伦理的路径,通过结构化嵌入人类共情,增强而非替代人类判断。 Abstract: Assessing student depression in sensitive environments like special education is challenging. Standardized questionnaires may not fully reflect students' true situations. Furthermore, automated methods often falter with rich student narratives, lacking the crucial, individualized insights stemming from teachers' empathetic connections with students. Existing methods often fail to address this ambiguity or effectively integrate educator understanding. To address these limitations by fostering a synergistic human-AI collaboration, this paper introduces Human Empathy as Encoder (HEAE), a novel, human-centered AI framework for transparent and socially responsible depression severity assessment. Our approach uniquely integrates student narrative text with a teacher-derived, 9-dimensional "Empathy Vector" (EV), its dimensions guided by the PHQ-9 framework,to explicitly translate tacit empathetic insight into a structured AI input enhancing rather than replacing human judgment. Rigorous experiments optimized the multimodal fusion, text representation, and classification architecture, achieving 82.74% accuracy for 7-level severity classification. This work demonstrates a path toward more responsible and ethical affective computing by structurally embedding human empathy

[307] MAC-Gaze: Motion-Aware Continual Calibration for Mobile Gaze Tracking

Yaxiong Lei,Mingyue Zhao,Yuheng Wang,Shijing He,Yusuke Sugano,Yafei Wang,Kaixing Zhao,Mohamed Khamis,Juan Ye

Main category: cs.HC

TL;DR: MAC-Gaze是一种基于运动感知的持续校准方法,利用智能手机IMU传感器和持续学习技术,动态调整视线跟踪模型以适应用户姿势和设备方向的变化。

Details Motivation: 传统一次性校准方法无法适应动态条件下的用户姿势和设备方向变化,导致性能下降。 Method: 结合预训练的视觉视线估计器和基于IMU的活动识别模型,采用聚类混合决策机制触发重新校准,并使用基于回放的持续学习技术避免灾难性遗忘。 Result: 在RGBDGaze和MotionGaze数据集上,视线估计误差分别降低19.9%和31.7%。 Conclusion: MAC-Gaze为移动场景下的视线跟踪提供了鲁棒的解决方案。 Abstract: Mobile gaze tracking faces a fundamental challenge: maintaining accuracy as users naturally change their postures and device orientations. Traditional calibration approaches, like one-off, fail to adapt to these dynamic conditions, leading to degraded performance over time. We present MAC-Gaze, a Motion-Aware continual Calibration approach that leverages smartphone Inertial measurement unit (IMU) sensors and continual learning techniques to automatically detect changes in user motion states and update the gaze tracking model accordingly. Our system integrates a pre-trained visual gaze estimator and an IMU-based activity recognition model with a clustering-based hybrid decision-making mechanism that triggers recalibration when motion patterns deviate significantly from previously encountered states. To enable accumulative learning of new motion conditions while mitigating catastrophic forgetting, we employ replay-based continual learning, allowing the model to maintain performance across previously encountered motion conditions. We evaluate our system through extensive experiments on the publicly available RGBDGaze dataset and our own 10-hour multimodal MotionGaze dataset (481K+ images, 800K+ IMU readings), encompassing a wide range of postures under various motion conditions including sitting, standing, lying, and walking. Results demonstrate that our method reduces gaze estimation error by 19.9% on RGBDGaze (from 1.73 cm to 1.41 cm) and by 31.7% on MotionGaze (from 2.81 cm to 1.92 cm) compared to traditional calibration approaches. Our framework provides a robust solution for maintaining gaze estimation accuracy in mobile scenarios.

cs.SD [Back]

[308] Nosey: Open-source hardware for acoustic nasalance

Maya Dewhurst,Jack Collins,Justin J. H. Lo,Roy Alderton,Sam Kirkham

Main category: cs.SD

TL;DR: Nosey是一个低成本、可定制的开源硬件系统,用于记录声学鼻音数据,与商业设备相比表现一致但分数更高。

Details Motivation: 开发低成本、开源且可定制的鼻音测量系统,以替代昂贵的商业设备。 Method: 设计并3D打印硬件系统,与商业设备进行对比测试,并探讨定制化选项。 Result: Nosey的鼻音分数高于商业设备,但两者在音系环境对比上表现一致。 Conclusion: Nosey是商业鼻音测量设备的灵活且经济高效的替代品,适合数据收集。 Abstract: We introduce Nosey (Nasalance Open Source Estimation sYstem), a low-cost, customizable, 3D-printed system for recording acoustic nasalance data that we have made available as open-source hardware (http://github.com/phoneticslab/nosey). We first outline the motivations and design principles behind our hardware nasalance system, and then present a comparison between Nosey and a commercial nasalance device. Nosey shows consistently higher nasalance scores than the commercial device, but the magnitude of contrast between phonological environments is comparable between systems. We also review ways of customizing the hardware to facilitate testing, such as comparison of microphones and different construction materials. We conclude that Nosey is a flexible and cost-effective alternative to commercial nasometry devices and propose some methodological considerations for its use in data collection.

[309] Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation

Hao Li,Ju Dai,Xin Zhao,Feng Zhou,Junjun Pan,Lei Li

Main category: cs.SD

TL;DR: 论文提出Wav2Sem模块,通过语义解耦解决语音驱动面部动画中音近字导致的唇形平均化问题。

Details Motivation: 现有方法使用自监督音频模型编码器,但音近字在特征空间中耦合严重,导致唇形生成的平均效应。 Method: 提出Wav2Sem模块,提取音频序列的语义特征,解耦特征空间中的音频编码。 Result: 实验表明Wav2Sem有效解耦音频特征,显著改善唇形生成的精确性和自然度。 Conclusion: Wav2Sem模块能提升语音驱动面部动画的表现力。 Abstract: In 3D speech-driven facial animation generation, existing methods commonly employ pre-trained self-supervised audio models as encoders. However, due to the prevalence of phonetically similar syllables with distinct lip shapes in language, these near-homophone syllables tend to exhibit significant coupling in self-supervised audio feature spaces, leading to the averaging effect in subsequent lip motion generation. To address this issue, this paper proposes a plug-and-play semantic decorrelation module-Wav2Sem. This module extracts semantic features corresponding to the entire audio sequence, leveraging the added semantic information to decorrelate audio encodings within the feature space, thereby achieving more expressive audio features. Extensive experiments across multiple Speech-driven models indicate that the Wav2Sem module effectively decouples audio features, significantly alleviating the averaging effect of phonetically similar syllables in lip shape generation, thereby enhancing the precision and naturalness of facial animations. Our source code is available at https://github.com/wslh852/Wav2Sem.git.

[310] Semantics-Aware Human Motion Generation from Audio Instructions

Zi-An Wang,Shihao Zou,Shiyao Yu,Mingyuan Zhang,Chao Dong

Main category: cs.SD

TL;DR: 论文提出了一种利用音频信号作为条件输入生成语义对齐运动的新任务,通过掩码生成变压器和记忆检索注意力模块提升性能。

Details Motivation: 音频信号比文本更自然直观,但现有方法多关注音乐或语音节奏,导致音频语义与生成运动关联较弱。 Method: 采用端到端框架,结合掩码生成变压器和记忆检索注意力模块处理稀疏长音频输入,并丰富数据集。 Result: 实验证明框架有效且高效,音频指令能传达类似文本的语义,同时提供更实用和用户友好的交互。 Conclusion: 音频信号可作为语义编码的有效媒介,为交互技术提供新方向。 Abstract: Recent advances in interactive technologies have highlighted the prominence of audio signals for semantic encoding. This paper explores a new task, where audio signals are used as conditioning inputs to generate motions that align with the semantics of the audio. Unlike text-based interactions, audio provides a more natural and intuitive communication method. However, existing methods typically focus on matching motions with music or speech rhythms, which often results in a weak connection between the semantics of the audio and generated motions. We propose an end-to-end framework using a masked generative transformer, enhanced by a memory-retrieval attention module to handle sparse and lengthy audio inputs. Additionally, we enrich existing datasets by converting descriptions into conversational style and generating corresponding audio with varied speaker identities. Experiments demonstrate the effectiveness and efficiency of the proposed framework, demonstrating that audio instructions can convey semantics similar to text while providing more practical and user-friendly interactions.

[311] ZeroSep: Separate Anything in Audio with Zero Training

Chao Huang,Yuesheng Ma,Junxuan Huang,Susan Liang,Yunlong Tang,Jing Bi,Wenqiang Liu,Nima Mesgarani,Chenliang Xu

Main category: cs.SD

TL;DR: ZeroSep利用预训练的文本引导音频扩散模型实现零样本音频源分离,无需任务特定训练,支持开放场景。

Details Motivation: 当前监督深度学习方法需要大量标记数据且难以泛化到真实世界的开放场景,因此探索预训练生成模型是否能解决这些限制。 Method: 通过将混合音频反转到扩散模型的潜在空间,利用文本条件引导去噪过程恢复单个源。 Result: ZeroSep在多个分离基准测试中表现优异,甚至超过监督方法。 Conclusion: 预训练的文本引导扩散模型可成功用于音频源分离任务,支持开放场景且性能优越。 Abstract: Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of generative foundation models, we investigate whether pre-trained text-guided audio diffusion models can overcome these limitations. We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model under the right configuration. Our method, named ZeroSep, works by inverting the mixed audio into the diffusion model's latent space and then using text conditioning to guide the denoising process to recover individual sources. Without any task-specific training or fine-tuning, ZeroSep repurposes the generative diffusion model for a discriminative separation task and inherently supports open-set scenarios through its rich textual priors. ZeroSep is compatible with a variety of pre-trained text-guided audio diffusion backbones and delivers strong separation performance on multiple separation benchmarks, surpassing even supervised methods.

q-bio.TO [Back]

[312] Physiology-Informed Generative Multi-Task Network for Contrast-Free CT Perfusion

Wasif Khan,Kyle B. See,Simon Kato,Ziqian Huang,Amy Lazarte,Kyle Douglas,Xiangyang Lou,Teng J. Peng,Dhanashree Rajderkar,John Rees,Pina Sanelli,Amita Singh,Ibrahim Tuna,Christina A. Wilson,Ruogu Fang

Main category: q-bio.TO

TL;DR: 提出了一种名为MAGIC的深度学习框架,通过生成式AI和生理信息将非对比CT图像映射为无对比剂的CTP图像,解决了传统CTP使用对比剂的问题。

Details Motivation: 传统CTP成像使用对比剂可能导致过敏反应和高成本,因此需要一种无对比剂的替代方案。 Method: 结合生成式AI和生理信息,开发了MAGIC框架,通过非对比CT图像生成无对比剂的CTP图像。 Result: MAGIC在图像保真度和诊断准确性上表现优异,双盲研究验证了其临床价值。 Conclusion: MAGIC有望成为无对比剂、低成本且快速的灌注成像解决方案,具有革命性潜力。 Abstract: Perfusion imaging is extensively utilized to assess hemodynamic status and tissue perfusion in various organs. Computed tomography perfusion (CTP) imaging plays a key role in the early assessment and planning of stroke treatment. While CTP provides essential perfusion parameters to identify abnormal blood flow in the brain, the use of contrast agents in CTP can lead to allergic reactions and adverse side effects, along with costing USD 4.9 billion worldwide in 2022. To address these challenges, we propose a novel deep learning framework called Multitask Automated Generation of Intermodal CT perfusion maps (MAGIC). This framework combines generative artificial intelligence and physiological information to map non-contrast computed tomography (CT) imaging to multiple contrast-free CTP imaging maps. We demonstrate enhanced image fidelity by incorporating physiological characteristics into the loss terms. Our network was trained and validated using CT image data from patients referred for stroke at UF Health and demonstrated robustness to abnormalities in brain perfusion activity. A double-blinded study was conducted involving seven experienced neuroradiologists and vascular neurologists. This study validated MAGIC's visual quality and diagnostic accuracy showing favorable performance compared to clinical perfusion imaging with intravenous contrast injection. Overall, MAGIC holds great promise in revolutionizing healthcare by offering contrast-free, cost-effective, and rapid perfusion imaging.

cs.DB [Back]

[313] TailorSQL: An NL2SQL System Tailored to Your Query Workload

Kapil Vaidya,Jialin Ding,Sebastian Kosak,David Kernert,Chuan Lei,Xiao Qin,Abhinav Tripathy,Ramesh Balan,Balakrishnan Narayanaswamy,Tim Kraska

Main category: cs.DB

TL;DR: TailorSQL利用历史查询负载中的信息改进NL2SQL翻译,显著提升准确性和延迟。

Details Motivation: 现有NL2SQL技术未充分利用历史查询负载中的隐含信息(如常见连接路径和表列语义),而这些信息对准确翻译至关重要。 Method: TailorSQL通过分析历史查询负载,提取有用信息(如常见连接路径和表列语义),结合预训练大语言模型生成更准确的SQL查询。 Result: TailorSQL在标准化基准测试中实现了高达2倍的执行准确性提升。 Conclusion: TailorSQL通过利用历史查询负载信息,显著提升了NL2SQL的翻译性能,为智能数据应用提供了更高效的工具。 Abstract: NL2SQL (natural language to SQL) translates natural language questions into SQL queries, thereby making structured data accessible to non-technical users, serving as the foundation for intelligent data applications. State-of-the-art NL2SQL techniques typically perform translation by retrieving database-specific information, such as the database schema, and invoking a pre-trained large language model (LLM) using the question and retrieved information to generate the SQL query. However, existing NL2SQL techniques miss a key opportunity which is present in real-world settings: NL2SQL is typically applied on existing databases which have already served many SQL queries in the past. The past query workload implicitly contains information which is helpful for accurate NL2SQL translation and is not apparent from the database schema alone, such as common join paths and the semantics of obscurely-named tables and columns. We introduce TailorSQL, a NL2SQL system that takes advantage of information in the past query workload to improve both the accuracy and latency of translating natural language questions into SQL. By specializing to a given workload, TailorSQL achieves up to 2$\times$ improvement in execution accuracy on standardized benchmarks.

eess.AS [Back]

[314] NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding

Vladimir Bataev,Andrei Andrusenko,Lilit Grigoryan,Aleksandr Laptev,Vitaly Lavrukhin,Boris Ginsburg

Main category: eess.AS

TL;DR: NGPU-LM提出了一种针对统计n-gram语言模型的高效并行化方法,显著提升了GPU推理速度,适用于多种ASR模型,并开源实现。

Details Motivation: 现有统计n-gram语言模型在并行化方面效率低下,限制了其在工业场景中的应用。 Method: 重新设计数据结构和引入可定制贪婪解码,支持GPU优化推理,适用于多种ASR模型。 Result: 计算开销低于7%,在域外场景中减少50%以上贪婪解码与束搜索的准确率差距,同时避免束搜索的显著减速。 Conclusion: NGPU-LM为ASR任务提供了一种高效且通用的解决方案,并开源以促进进一步研究。 Abstract: Statistical n-gram language models are widely used for context-biasing tasks in Automatic Speech Recognition (ASR). However, existing implementations lack computational efficiency due to poor parallelization, making context-biasing less appealing for industrial use. This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference. Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types - including transducers, attention encoder-decoder models, and CTC - with less than 7% computational overhead. The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search. The implementation of the proposed NGPU-LM is open-sourced.

cs.CY [Back]

[315] Conversational Alignment with Artificial Intelligence in Context

Rachel Katharine Sterken,James Ravi Kirkpatrick

Main category: cs.CY

TL;DR: 本文探讨了AI对话代理如何与人类沟通规范对齐,提出了CONTEXT-ALIGN框架,并指出当前大型语言模型在实现完全对齐上的局限性。

Details Motivation: 研究AI对话代理与人类沟通规范的关系,以提升其设计性能。 Method: 基于哲学和语言学文献提出CONTEXT-ALIGN框架,评估开发者的设计选择。 Result: 当前大型语言模型可能在实现完全对话对齐上存在根本性限制。 Conclusion: 需要进一步研究以克服大型语言模型在对话对齐上的局限性。 Abstract: The development of sophisticated artificial intelligence (AI) conversational agents based on large language models raises important questions about the relationship between human norms, values, and practices and AI design and performance. This article explores what it means for AI agents to be conversationally aligned to human communicative norms and practices for handling context and common ground and proposes a new framework for evaluating developers' design choices. We begin by drawing on the philosophical and linguistic literature on conversational pragmatics to motivate a set of desiderata, which we call the CONTEXT-ALIGN framework, for conversational alignment with human communicative practices. We then suggest that current large language model (LLM) architectures, constraints, and affordances may impose fundamental limitations on achieving full conversational alignment.