Skip to content

Table of Contents

cs.CV [Back]

[1] Dynamic Epsilon Scheduling: A Multi-Factor Adaptive Perturbation Budget for Adversarial Training

Alan Mitkiy,James Smith,Hana Satou,Hiroshi Tanaka,Emily Johnson,F Monkey

Main category: cs.CV

TL;DR: DES是一种动态调整对抗训练中扰动预算的新框架,通过结合梯度代理、预测置信度和模型不确定性,实现更有效的对抗学习。

Details Motivation: 现有对抗训练方法依赖固定扰动预算,无法适应实例特定的鲁棒性特征。 Method: DES结合梯度代理、预测置信度和模型不确定性,动态调整扰动预算。 Result: 在CIFAR-10和CIFAR-100上,DES显著提升了对抗鲁棒性和标准准确性。 Conclusion: DES为实例感知和数据驱动的对抗训练方法开辟了新途径。 Abstract: Adversarial training is among the most effective strategies for defending deep neural networks against adversarial examples. A key limitation of existing adversarial training approaches lies in their reliance on a fixed perturbation budget, which fails to account for instance-specific robustness characteristics. While prior works such as IAAT and MMA introduce instance-level adaptations, they often rely on heuristic or static approximations of data robustness. In this paper, we propose Dynamic Epsilon Scheduling (DES), a novel framework that adaptively adjusts the adversarial perturbation budget per instance and per training iteration. DES integrates three key factors: (1) the distance to the decision boundary approximated via gradient-based proxies, (2) prediction confidence derived from softmax entropy, and (3) model uncertainty estimated via Monte Carlo dropout. By combining these cues into a unified scheduling strategy, DES tailors the perturbation budget dynamically to guide more effective adversarial learning. Experimental results on CIFAR-10 and CIFAR-100 show that our method consistently improves both adversarial robustness and standard accuracy compared to fixed-epsilon baselines and prior adaptive methods. Moreover, we provide theoretical insights into the stability and convergence of our scheduling policy. This work opens a new avenue for instance-aware, data-driven adversarial training methods.

[2] RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought

Yi Lu,Jiawang Cao,Yongliang Wu,Bozheng Li,Licheng Tang,Yangguang Ji,Chong Wu,Jay Wu,Wenbo Zhu

Main category: cs.CV

TL;DR: RSVP是一个结合多模态推理与视觉分割的新框架,通过两阶段结构实现视觉定位与分割优化,显著提升了性能。

Details Motivation: 多模态大语言模型(MLLMs)在推理能力上表现优异,但缺乏明确的视觉定位与分割机制,导致认知推理与视觉感知之间存在差距。 Method: RSVP采用两阶段框架:推理阶段通过多模态思维链视觉提示生成区域提案;分割阶段通过视觉语言分割模块(VLSM)优化提案,生成精确分割掩码。 Result: RSVP在ReasonSeg上超越现有方法(+6.5 gIoU和+9.2 cIoU),在SegInW上零样本设置下达到49.7 mAP。 Conclusion: RSVP为结合认知推理与结构化视觉理解提供了有效且可扩展的框架。 Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated remarkable reasoning capability while lack explicit mechanisms for visual grounding and segmentation, creating a gap between cognitive reasoning and visual perception. To bridge this gap, we introduce Reasoning Segmentation via Visual Prompting (RSVP), a novel framework that unifies multi-step multimodal reasoning with grounded visual understanding. RSVP is a two-stage structuralized framework that integrates reasoning-driven localization with segmentation refinement. In the reasoning stage, RSVP employs multimodal chain-of-thought visual prompts to help MLLMs understand queries and infer targets, generating interpretable region proposals that enhance visual grounding. In segmentation stage, RSVP refines these proposals with a Vision-Language Segmentation Module (VLSM), seamlessly integrates textual and visual cues to produce precise segmentation masks. By explicitly modelling the interaction between multimodal reasoning and segmentation, RSVP introduces a new paradigm for interpretable reasoning segmentation. It exploits MLLMs' inherent localization capabilities, enabling the models to not only reason about objects but also generate structured visual representations. Our extensive experiments demonstrate that RSVP achieves state-of-the-art performance, surpasses state-of-the-art methods by up to +6.5 gIoU and +9.2 cIoU on ReasonSeg, and achieves 49.7 mAP on SegInW under zero-shot settings. These results validate RSVP as an effective and scalable framework for integrating cognitive reasoning with structured visual understanding.

[3] Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark

Ziming Cheng,Binrui Xu,Lisheng Gong,Zuhe Song,Tianshuo Zhou,Shiqi Zhong,Siyu Ren,Mingxiang Chen,Xiangchao Meng,Yuxin Zhang,Yanlin Li,Lei Ren,Wei Chen,Zhiyuan Huang,Mingjie Zhan,Xiaojie Wang,Fangxiang Feng

Main category: cs.CV

TL;DR: 该论文提出了首个多图像推理基准MMRB,用于评估多模态大语言模型在多图像输入下的结构化视觉推理能力,并揭示了开源模型与商业模型之间的显著差距。

Details Motivation: 现有MLLM基准主要关注单图像推理或多图像任务的最终答案评估,缺乏对多图像输入下推理能力的深入探索。 Method: 设计了包含92个子任务的MMRB基准,涵盖空间、时间和语义推理,并采用GPT-4o生成多解法和CoT风格标注。同时提出基于开源LLM的句子级匹配框架进行快速评估。 Result: 实验表明,开源MLLM在多图像推理任务中显著落后于商业模型,且当前多模态奖励模型几乎无法处理多图像奖励排名任务。 Conclusion: MMRB填补了多图像推理评估的空白,为未来研究提供了重要工具,并揭示了开源模型在多图像任务中的不足。 Abstract: With enhanced capabilities and widespread applications, Multimodal Large Language Models (MLLMs) are increasingly required to process and reason over multiple images simultaneously. However, existing MLLM benchmarks focus either on single-image visual reasoning or on multi-image understanding tasks with only final-answer evaluation, leaving the reasoning capabilities of MLLMs over multi-image inputs largely underexplored. To address this gap, we introduce the $\textbf{Multimodal Multi-image Reasoning Benchmark (MMRB)}$, the first benchmark designed to evaluate structured visual reasoning across multiple images. MMRB comprises $\textbf{92 sub-tasks}$ covering spatial, temporal, and semantic reasoning, with multi-solution, CoT-style annotations generated by GPT-4o and refined by human experts. A derivative subset is designed to evaluate multimodal reward models in multi-image scenarios. To support fast and scalable evaluation, we propose a sentence-level matching framework using open-source LLMs. Extensive baseline experiments on $\textbf{40 MLLMs}$, including 9 reasoning-specific models and 8 reward models, demonstrate that open-source MLLMs still lag significantly behind commercial MLLMs in multi-image reasoning tasks. Furthermore, current multimodal reward models are nearly incapable of handling multi-image reward ranking tasks.

[4] HuGeDiff: 3D Human Generation via Diffusion with Gaussian Splatting

Maksym Ivashechkin,Oscar Mendez,Richard Bowden

Main category: cs.CV

TL;DR: 提出了一种弱监督流程,通过图像扩散模型生成可控属性的人体图像数据集,并利用基于Transformer的架构将其映射为3D点云,最终训练点云扩散模型,显著提升了3D人体生成的速度、真实感和文本对齐性。

Details Motivation: 当前3D人体生成方法在细节、真实感和可控性方面存在不足,且缺乏多样化和标注数据,阻碍了基础模型的发展。 Method: 1. 使用图像扩散模型生成可控属性的真实人体图像数据集;2. 提出基于Transformer的架构将图像特征映射为3D点云;3. 训练点云扩散模型,实现闭环生成。 Result: 相比现有方法,实现了数量级的速度提升,显著改善了文本对齐性、真实感和渲染质量。 Conclusion: 提出的流程解决了3D人体生成的关键挑战,为未来研究提供了高效、可控的解决方案。 Abstract: 3D human generation is an important problem with a wide range of applications in computer vision and graphics. Despite recent progress in generative AI such as diffusion models or rendering methods like Neural Radiance Fields or Gaussian Splatting, controlling the generation of accurate 3D humans from text prompts remains an open challenge. Current methods struggle with fine detail, accurate rendering of hands and faces, human realism, and controlability over appearance. The lack of diversity, realism, and annotation in human image data also remains a challenge, hindering the development of a foundational 3D human model. We present a weakly supervised pipeline that tries to address these challenges. In the first step, we generate a photorealistic human image dataset with controllable attributes such as appearance, race, gender, etc using a state-of-the-art image diffusion model. Next, we propose an efficient mapping approach from image features to 3D point clouds using a transformer-based architecture. Finally, we close the loop by training a point-cloud diffusion model that is conditioned on the same text prompts used to generate the original samples. We demonstrate orders-of-magnitude speed-ups in 3D human generation compared to the state-of-the-art approaches, along with significantly improved text-prompt alignment, realism, and rendering quality. We will make the code and dataset available.

[5] Photoreal Scene Reconstruction from an Egocentric Device

Zhaoyang Lv,Maurizio Monge,Ka Chen,Yufeng Zhu,Michael Goesele,Jakob Engel,Zhao Dong,Richard Newcombe

Main category: cs.CV

TL;DR: 本文研究了使用自我中心设备进行高动态范围场景的真实感重建的挑战,提出了基于视觉惯性束调整(VIBA)的高频轨迹校准方法和基于高斯泼溅的物理图像形成模型,显著提升了重建质量。

Details Motivation: 现有方法通常假设使用设备视觉惯性里程计系统的帧率6DoF姿态估计,可能忽略像素级重建所需的关键细节。 Method: 1. 使用VIBA校准滚动快门RGB相机的时间戳和运动;2. 在高斯泼溅中引入物理图像形成模型,处理传感器特性。 Result: 实验表明,VIBA和图像形成模型分别带来+1 dB的PSNR提升,重建质量显著改善。 Conclusion: 提出的方法在多种光照条件下均表现优异,适用于广泛的高斯泼溅表示变体。 Abstract: In this paper, we investigate the challenges associated with using egocentric devices to photorealistic reconstruct the scene in high dynamic range. Existing methodologies typically assume using frame-rate 6DoF pose estimated from the device's visual-inertial odometry system, which may neglect crucial details necessary for pixel-accurate reconstruction. This study presents two significant findings. Firstly, in contrast to mainstream work treating RGB camera as global shutter frame-rate camera, we emphasize the importance of employing visual-inertial bundle adjustment (VIBA) to calibrate the precise timestamps and movement of the rolling shutter RGB sensing camera in a high frequency trajectory format, which ensures an accurate calibration of the physical properties of the rolling-shutter camera. Secondly, we incorporate a physical image formation model based into Gaussian Splatting, which effectively addresses the sensor characteristics, including the rolling-shutter effect of RGB cameras and the dynamic ranges measured by sensors. Our proposed formulation is applicable to the widely-used variants of Gaussian Splats representation. We conduct a comprehensive evaluation of our pipeline using the open-source Project Aria device under diverse indoor and outdoor lighting conditions, and further validate it on a Meta Quest3 device. Across all experiments, we observe a consistent visual enhancement of +1 dB in PSNR by incorporating VIBA, with an additional +1 dB achieved through our proposed image formation model. Our complete implementation, evaluation datasets, and recording profile are available at http://www.projectaria.com/photoreal-reconstruction/

[6] ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding

Ankit Pal,Jung-Oh Lee,Xiaoman Zhang,Malaikannan Sankarasubbu,Seunghyeon Roh,Won Jung Kim,Meesun Lee,Pranav Rajpurkar

Main category: cs.CV

TL;DR: ReXVQA是胸片视觉问答(VQA)的最大综合基准,包含69.6万问题和16万胸片研究,评估了8种多模态大模型,其中MedGemma表现最佳(83.24%准确率),并首次超越人类专家(77.27%)。

Details Motivation: 为胸片VQA提供多样且临床真实的评估任务,推动AI模拟放射学专家级推理能力。 Method: 引入5种核心放射学推理任务,评估8种多模态大模型,并与3名放射科住院医师进行对比研究。 Result: MedGemma表现最优(83.84%准确率),首次超越人类专家(77.27%),并揭示了AI与人类表现模式的差异。 Conclusion: ReXVQA为评估通用放射学AI系统设定了新标准,为下一代AI系统奠定基础,数据集将开源。 Abstract: We present ReXVQA, the largest and most comprehensive benchmark for visual question answering (VQA) in chest radiology, comprising approximately 696,000 questions paired with 160,000 chest X-rays studies across training, validation, and test sets. Unlike prior efforts that rely heavily on template based queries, ReXVQA introduces a diverse and clinically authentic task suite reflecting five core radiological reasoning skills: presence assessment, location analysis, negation detection, differential diagnosis, and geometric reasoning. We evaluate eight state-of-the-art multimodal large language models, including MedGemma-4B-it, Qwen2.5-VL, Janus-Pro-7B, and Eagle2-9B. The best-performing model (MedGemma) achieves 83.24% overall accuracy. To bridge the gap between AI performance and clinical expertise, we conducted a comprehensive human reader study involving 3 radiology residents on 200 randomly sampled cases. Our evaluation demonstrates that MedGemma achieved superior performance (83.84% accuracy) compared to human readers (best radiology resident: 77.27%), representing a significant milestone where AI performance exceeds expert human evaluation on chest X-ray interpretation. The reader study reveals distinct performance patterns between AI models and human experts, with strong inter-reader agreement among radiologists while showing more variable agreement patterns between human readers and AI models. ReXVQA establishes a new standard for evaluating generalist radiological AI systems, offering public leaderboards, fine-grained evaluation splits, structured explanations, and category-level breakdowns. This benchmark lays the foundation for next-generation AI systems capable of mimicking expert-level clinical reasoning beyond narrow pathology classification. Our dataset will be open-sourced at https://huggingface.co/datasets/rajpurkarlab/ReXVQA

[7] WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning

Delong Chen,Willy Chung,Yejin Bang,Ziwei Ji,Pascale Fung

Main category: cs.CV

TL;DR: 论文提出了WorldPrediction,一个基于视频的基准测试,用于评估AI模型的世界建模和程序规划能力,强调时间与语义抽象的动作。

Details Motivation: 人类具有内部“世界模型”以支持行动规划,但当前AI模型(尤其是生成模型)如何学习这种能力尚不明确。 Method: 通过WorldPrediction-WM和WorldPrediction-PP任务,区分正确动作或动作序列,使用视觉观测表示状态和动作,并引入“动作等效”避免低层线索干扰。 Result: 前沿模型在WorldPrediction-WM和WorldPrediction-PP上的准确率分别为57%和38%,远低于人类的完美表现。 Conclusion: WorldPrediction为评估AI世界建模和规划能力提供了可靠基准,揭示了当前模型的不足。 Abstract: Humans are known to have an internal "world model" that enables us to carry out action planning based on world states. AI agents need to have such a world model for action planning as well. It is not clear how current AI models, especially generative models, are able to learn such world models and carry out procedural planning in diverse environments. We introduce WorldPrediction, a video-based benchmark for evaluating world modeling and procedural planning capabilities of different AI models. In contrast to prior benchmarks that focus primarily on low-level world modeling and robotic motion planning, WorldPrediction is the first benchmark that emphasizes actions with temporal and semantic abstraction. Given initial and final world states, the task is to distinguish the proper action (WorldPrediction-WM) or the properly ordered sequence of actions (WorldPrediction-PP) from a set of counterfactual distractors. This discriminative task setup enable us to evaluate different types of world models and planners and realize a thorough comparison across different hypothesis. The benchmark represents states and actions using visual observations. In order to prevent models from exploiting low-level continuity cues in background scenes, we provide "action equivalents" - identical actions observed in different contexts - as candidates for selection. This benchmark is grounded in a formal framework of partially observable semi-MDP, ensuring better reliability and robustness of the evaluation. We conduct extensive human filtering and validation on our benchmark and show that current frontier models barely achieve 57% accuracy on WorldPrediction-WM and 38% on WorldPrediction-PP whereas humans are able to solve both tasks perfectly.

[8] Ice Hockey Puck Localization Using Contextual Cues

Liam Salass,Jerrin Bright,Amir Nazemi,Yuhao Chen,John Zelek,David Clausi

Main category: cs.CV

TL;DR: PLUCC利用球员行为作为上下文线索,提出了一种新颖的冰球检测方法,显著提升了检测性能。

Details Motivation: 冰球在视频中检测困难,现有方法未充分利用球员行为的上下文线索。 Method: PLUCC包含上下文编码器、特征金字塔编码器和门控解码器,结合球员姿态和多尺度特征。 Result: 在PuckDataset上,PLUCC的平均精度提升12.2%,RSLE精度提升25%。 Conclusion: 上下文理解对冰球检测至关重要,对自动化体育分析有广泛意义。 Abstract: Puck detection in ice hockey broadcast videos poses significant challenges due to the puck's small size, frequent occlusions, motion blur, broadcast artifacts, and scale inconsistencies due to varying camera zoom and broadcast camera viewpoints. Prior works focus on appearance-based or motion-based cues of the puck without explicitly modelling the cues derived from player behaviour. Players consistently turn their bodies and direct their gaze toward the puck. Motivated by this strong contextual cue, we propose Puck Localization Using Contextual Cues (PLUCC), a novel approach for scale-aware and context-driven single-frame puck detections. PLUCC consists of three components: (a) a contextual encoder, which utilizes player orientations and positioning as helpful priors; (b) a feature pyramid encoder, which extracts multiscale features from the dual encoders; and (c) a gating decoder that combines latent features with a channel gating mechanism. For evaluation, in addition to standard average precision, we propose Rink Space Localization Error (RSLE), a scale-invariant homography-based metric for removing perspective bias from rink space evaluation. The experimental results of PLUCC on the PuckDataset dataset demonstrated state-of-the-art detection performance, surpassing previous baseline methods by an average precision improvement of 12.2% and RSLE average precision of 25%. Our research demonstrates the critical role of contextual understanding in improving puck detection performance, with broad implications for automated sports analysis.

[9] Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks

Jubayer Ahmed Bhuiyan Shawon,Hasan Mahmud,Kamrul Hasan

Main category: cs.CV

TL;DR: 该研究通过微调VideoMAE、ViViT和TimeSformer等视频Transformer架构,在BdSLW60和BdSLW401数据集上实现了高精度的孟加拉手语识别,VideoMAE表现最佳。

Details Motivation: 提高孟加拉手语(BdSL)识别的准确性和可扩展性,以改善听力障碍群体的交流无障碍性。 Method: 使用视频Transformer架构(VideoMAE、ViViT、TimeSformer),结合数据增强和10折分层交叉验证,在BdSLW60和BdSLW401数据集上进行微调和评估。 Result: VideoMAE在BdSLW60和BdSLW401上分别达到95.5%和81.04%的准确率,显著优于传统方法。 Conclusion: 视频Transformer模型在孟加拉手语识别中表现出色,数据集规模、视频质量和模型架构是影响性能的关键因素。 Abstract: Sign Language Recognition (SLR) involves the automatic identification and classification of sign gestures from images or video, converting them into text or speech to improve accessibility for the hearing-impaired community. In Bangladesh, Bangla Sign Language (BdSL) serves as the primary mode of communication for many individuals with hearing impairments. This study fine-tunes state-of-the-art video transformer architectures -- VideoMAE, ViViT, and TimeSformer -- on BdSLW60 (arXiv:2402.08635), a small-scale BdSL dataset with 60 frequent signs. We standardized the videos to 30 FPS, resulting in 9,307 user trial clips. To evaluate scalability and robustness, the models were also fine-tuned on BdSLW401 (arXiv:2503.02360), a large-scale dataset with 401 sign classes. Additionally, we benchmark performance against public datasets, including LSA64 and WLASL. Data augmentation techniques such as random cropping, horizontal flipping, and short-side scaling were applied to improve model robustness. To ensure balanced evaluation across folds during model selection, we employed 10-fold stratified cross-validation on the training set, while signer-independent evaluation was carried out using held-out test data from unseen users U4 and U8. Results show that video transformer models significantly outperform traditional machine learning and deep learning approaches. Performance is influenced by factors such as dataset size, video quality, frame distribution, frame rate, and model architecture. Among the models, the VideoMAE variant (MCG-NJU/videomae-base-finetuned-kinetics) achieved the highest accuracies of 95.5% on the frame rate corrected BdSLW60 dataset and 81.04% on the front-facing signs of BdSLW401 -- demonstrating strong potential for scalable and accurate BdSL recognition.

[10] Visualizing and Controlling Cortical Responses Using Voxel-Weighted Activation Maximization

Matthew W. Shinkle,Mark D. Lescroart

Main category: cs.CV

TL;DR: 论文提出了一种基于激活最大化技术的方法,用于解释DNN编码模型对人类大脑视觉系统的响应,并通过实验验证了其有效性。

Details Motivation: 尽管DNN编码模型能预测大脑对视觉刺激的响应,但缺乏对驱动这些响应的具体特征的理解。 Method: 使用预训练的Inception V3网络提取并下采样激活,通过线性回归预测fMRI响应,并应用激活最大化生成优化图像。 Result: 生成的图像与已知选择性特征对应,并能驱动目标脑区活动。 Conclusion: 激活最大化技术可成功应用于DNN编码模型,为人类视觉系统的响应提供了灵活的表征和调制方法。 Abstract: Deep neural networks (DNNs) trained on visual tasks develop feature representations that resemble those in the human visual system. Although DNN-based encoding models can accurately predict brain responses to visual stimuli, they offer limited insight into the specific features driving these responses. Here, we demonstrate that activation maximization -- a technique designed to interpret vision DNNs -- can be applied to DNN-based encoding models of the human brain. We extract and adaptively downsample activations from multiple layers of a pretrained Inception V3 network, then use linear regression to predict fMRI responses. This yields a full image-computable model of brain responses. Next, we apply activation maximization to generate images optimized for predicted responses in individual cortical voxels. We find that these images contain visual characteristics that qualitatively correspond with known selectivity and enable exploration of selectivity across the visual cortex. We further extend our method to whole regions of interest (ROIs) of the brain and validate its efficacy by presenting these images to human participants in an fMRI study. We find that the generated images reliably drive activity in targeted regions across both low- and high-level visual areas and across subjects. These results demonstrate that activation maximization can be successfully applied to DNN-based encoding models. By addressing key limitations of alternative approaches that require natively generative models, our approach enables flexible characterization and modulation of responses across the human visual system.

[11] Is Perturbation-Based Image Protection Disruptive to Image Editing?

Qiuyu Tang,Bonor Ayambem,Mooi Choo Chuah,Aparna Bharati

Main category: cs.CV

TL;DR: 研究发现,现有的基于扰动的图像保护方法无法完全阻止扩散模型对图像的编辑,反而可能增强编辑效果。

Details Motivation: 探讨扩散模型(如Stable Diffusion)在图像生成中的滥用风险,以及现有图像保护方法的不足。 Method: 通过实验评估多种基于扰动的图像保护方法在不同领域(自然场景图像和艺术作品)和编辑任务(图像到图像生成和风格编辑)中的效果。 Result: 发现扰动保护方法未能完全阻止编辑,反而可能使编辑结果更符合提示文本。 Conclusion: 基于扰动的方法不足以提供针对扩散模型编辑的鲁棒图像保护。 Abstract: The remarkable image generation capabilities of state-of-the-art diffusion models, such as Stable Diffusion, can also be misused to spread misinformation and plagiarize copyrighted materials. To mitigate the potential risks associated with image editing, current image protection methods rely on adding imperceptible perturbations to images to obstruct diffusion-based editing. A fully successful protection for an image implies that the output of editing attempts is an undesirable, noisy image which is completely unrelated to the reference image. In our experiments with various perturbation-based image protection methods across multiple domains (natural scene images and artworks) and editing tasks (image-to-image generation and style editing), we discover that such protection does not achieve this goal completely. In most scenarios, diffusion-based editing of protected images generates a desirable output image which adheres precisely to the guidance prompt. Our findings suggest that adding noise to images may paradoxically increase their association with given text prompts during the generation process, leading to unintended consequences such as better resultant edits. Hence, we argue that perturbation-based methods may not provide a sufficient solution for robust image protection against diffusion-based editing.

[12] Normalize Filters! Classical Wisdom for Deep Vision

Gustavo Perez,Stella X. Yu

Main category: cs.CV

TL;DR: 论文提出了一种滤波器归一化方法,通过可学习的缩放和平移解决深度学习卷积滤波器在图像大气传输中的失真问题,显著提升了性能。

Details Motivation: 传统图像滤波器经过精心归一化以确保一致性和可解释性,而深度学习中的卷积滤波器缺乏此类约束,导致在大气传输中响应失真。 Method: 提出滤波器归一化方法,结合可学习的缩放和平移,类似批归一化,确保滤波器具有大气等变性。 Result: 该方法在人工和自然强度变化基准测试中表现显著提升,ResNet34甚至大幅超越CLIP。 Conclusion: 滤波器归一化不仅解决了失真问题,还正则化学习、促进多样性,提高了鲁棒性和泛化能力。 Abstract: Classical image filters, such as those for averaging or differencing, are carefully normalized to ensure consistency, interpretability, and to avoid artifacts like intensity shifts, halos, or ringing. In contrast, convolutional filters learned end-to-end in deep networks lack such constraints. Although they may resemble wavelets and blob/edge detectors, they are not normalized in the same or any way. Consequently, when images undergo atmospheric transfer, their responses become distorted, leading to incorrect outcomes. We address this limitation by proposing filter normalization, followed by learnable scaling and shifting, akin to batch normalization. This simple yet effective modification ensures that the filters are atmosphere-equivariant, enabling co-domain symmetry. By integrating classical filtering principles into deep learning (applicable to both convolutional neural networks and convolution-dependent vision transformers), our method achieves significant improvements on artificial and natural intensity variation benchmarks. Our ResNet34 could even outperform CLIP by a large margin. Our analysis reveals that unnormalized filters degrade performance, whereas filter normalization regularizes learning, promotes diversity, and improves robustness and generalization.

[13] HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation

Hermann Kumbong,Xian Liu,Tsung-Yi Lin,Ming-Yu Liu,Xihui Liu,Ziwei Liu,Daniel Y. Fu,Christopher Ré,David W. Romero

Main category: cs.CV

TL;DR: HMAR是一种新的图像生成算法,通过改进VAR的并行生成问题,实现了更高质量的图像生成和更快的采样速度。

Details Motivation: VAR在并行生成图像时存在质量下降、序列长度超线性增长以及采样计划固定的问题,HMAR旨在解决这些问题。 Method: HMAR采用马尔可夫过程和多步掩码生成技术,逐分辨率生成图像,并优化了注意力机制。 Result: HMAR在ImageNet基准测试中表现优于VAR、扩散模型和自回归基线,同时训练和推理速度更快,内存占用更低。 Conclusion: HMAR不仅提高了图像生成质量,还提供了更大的灵活性,适用于零样本图像编辑任务。 Abstract: Visual Auto-Regressive modeling (VAR) has shown promise in bridging the speed and quality gap between autoregressive image models and diffusion models. VAR reformulates autoregressive modeling by decomposing an image into successive resolution scales. During inference, an image is generated by predicting all the tokens in the next (higher-resolution) scale, conditioned on all tokens in all previous (lower-resolution) scales. However, this formulation suffers from reduced image quality due to the parallel generation of all tokens in a resolution scale; has sequence lengths scaling superlinearly in image resolution; and requires retraining to change the sampling schedule. We introduce Hierarchical Masked Auto-Regressive modeling (HMAR), a new image generation algorithm that alleviates these issues using next-scale prediction and masked prediction to generate high-quality images with fast sampling. HMAR reformulates next-scale prediction as a Markovian process, wherein the prediction of each resolution scale is conditioned only on tokens in its immediate predecessor instead of the tokens in all predecessor resolutions. When predicting a resolution scale, HMAR uses a controllable multi-step masked generation procedure to generate a subset of the tokens in each step. On ImageNet 256x256 and 512x512 benchmarks, HMAR models match or outperform parameter-matched VAR, diffusion, and autoregressive baselines. We develop efficient IO-aware block-sparse attention kernels that allow HMAR to achieve faster training and inference times over VAR by over 2.5x and 1.75x respectively, as well as over 3x lower inference memory footprint. Finally, HMAR yields additional flexibility over VAR; its sampling schedule can be changed without further training, and it can be applied to image editing tasks in a zero-shot manner.

[14] Towards Large-Scale Pose-Invariant Face Recognition Using Face Defrontalization

Patrik Mesec,Alan Jović

Main category: cs.CV

TL;DR: 提出了一种称为“面部去正面化”的方法,通过增强训练数据集来改进姿态不变的人脸识别,优于现有方法。

Details Motivation: 解决极端头部姿态下人脸识别的挑战,避免现有方法在小数据集上的过拟合问题。 Method: 1) 训练基于FFWM模型的面部去正面化方法;2) 使用ArcFace损失训练ResNet-50模型,结合原始和去正面化数据。 Result: 在LFW、AgeDB和CFP数据集上优于现有方法,但在Multi-PIE极端姿态下表现不佳。 Conclusion: 面部去正面化有效,但现有方法可能在小数据集上过拟合。 Abstract: Face recognition under extreme head poses is a challenging task. Ideally, a face recognition system should perform well across different head poses, which is known as pose-invariant face recognition. To achieve pose invariance, current approaches rely on sophisticated methods, such as face frontalization and various facial feature extraction model architectures. However, these methods are somewhat impractical in real-life settings and are typically evaluated on small scientific datasets, such as Multi-PIE. In this work, we propose the inverse method of face frontalization, called face defrontalization, to augment the training dataset of facial feature extraction model. The method does not introduce any time overhead during the inference step. The method is composed of: 1) training an adapted face defrontalization FFWM model on a frontal-profile pairs dataset, which has been preprocessed using our proposed face alignment method; 2) training a ResNet-50 facial feature extraction model based on ArcFace loss on a raw and randomly defrontalized large-scale dataset, where defrontalization was performed with our previously trained face defrontalization model. Our method was compared with the existing approaches on four open-access datasets: LFW, AgeDB, CFP, and Multi-PIE. Defrontalization shows improved results compared to models without defrontalization, while the proposed adjustments show clear superiority over the state-of-the-art face frontalization FFWM method on three larger open-access datasets, but not on the small Multi-PIE dataset for extreme poses (75 and 90 degrees). The results suggest that at least some of the current methods may be overfitted to small datasets.

[15] FALO: Fast and Accurate LiDAR 3D Object Detection on Resource-Constrained Devices

Shizhong Han,Hsin-Pai Cheng,Hong Cai,Jihad Masri,Soyeb Nagori,Fatih Porikli

Main category: cs.CV

TL;DR: FALO是一种硬件友好的LiDAR 3D检测方法,结合了SOTA检测精度和快速推理速度,适用于资源受限的边缘设备。

Details Motivation: 现有LiDAR 3D检测方法依赖稀疏卷积或Transformer,在资源受限设备上运行困难。FALO旨在解决这一问题。 Method: FALO将稀疏3D体素排列为1D序列,通过ConvDotMix块(大核卷积、Hadamard积和线性层)处理,引入隐式分组以提高效率。 Result: 在nuScenes和Waymo基准测试中表现优异,推理速度比最新SOTA快1.6~9.8倍。 Conclusion: FALO是一种高效且适用于边缘设备的LiDAR 3D检测方法。 Abstract: Existing LiDAR 3D object detection methods predominantely rely on sparse convolutions and/or transformers, which can be challenging to run on resource-constrained edge devices, due to irregular memory access patterns and high computational costs. In this paper, we propose FALO, a hardware-friendly approach to LiDAR 3D detection, which offers both state-of-the-art (SOTA) detection accuracy and fast inference speed. More specifically, given the 3D point cloud and after voxelization, FALO first arranges sparse 3D voxels into a 1D sequence based on their coordinates and proximity. The sequence is then processed by our proposed ConvDotMix blocks, consisting of large-kernel convolutions, Hadamard products, and linear layers. ConvDotMix provides sufficient mixing capability in both spatial and embedding dimensions, and introduces higher-order nonlinear interaction among spatial features. Furthermore, when going through the ConvDotMix layers, we introduce implicit grouping, which balances the tensor dimensions for more efficient inference and takes into account the growing receptive field. All these operations are friendly to run on resource-constrained platforms and proposed FALO can readily deploy on compact, embedded devices. Our extensive evaluation on LiDAR 3D detection benchmarks such as nuScenes and Waymo shows that FALO achieves competitive performance. Meanwhile, FALO is 1.6~9.8x faster than the latest SOTA on mobile Graphics Processing Unit (GPU) and mobile Neural Processing Unit (NPU).

[16] AuthGuard: Generalizable Deepfake Detection via Language Guidance

Guangyu Shen,Zhihua Li,Xiang Xu,Tianchen Zhao,Zheng Zhang,Dongsheng An,Zhuowen Tu,Yifan Xing,Qin Zhang

Main category: cs.CV

TL;DR: AuthGuard通过结合语言指导和视觉编码器,提升了深度伪造检测的泛化能力和准确性。

Details Motivation: 现有深度伪造检测技术依赖训练时学到的统计特征,难以应对新的伪造方法。 Method: 结合分类与图像-文本对比学习,利用MLLM生成文本指导,并集成数据不确定性学习。 Result: 在DFDC和DF40数据集上AUC分别提升6.15%和16.68%,DDVQA数据集性能提升24.69%。 Conclusion: AuthGuard通过语言指导和不确定性学习,实现了更泛化和可解释的深度伪造检测。 Abstract: Existing deepfake detection techniques struggle to keep-up with the ever-evolving novel, unseen forgeries methods. This limitation stems from their reliance on statistical artifacts learned during training, which are often tied to specific generation processes that may not be representative of samples from new, unseen deepfake generation methods encountered at test time. We propose that incorporating language guidance can improve deepfake detection generalization by integrating human-like commonsense reasoning -- such as recognizing logical inconsistencies and perceptual anomalies -- alongside statistical cues. To achieve this, we train an expert deepfake vision encoder by combining discriminative classification with image-text contrastive learning, where the text is generated by generalist MLLMs using few-shot prompting. This allows the encoder to extract both language-describable, commonsense deepfake artifacts and statistical forgery artifacts from pixel-level distributions. To further enhance robustness, we integrate data uncertainty learning into vision-language contrastive learning, mitigating noise in image-text supervision. Our expert vision encoder seamlessly interfaces with an LLM, further enabling more generalized and interpretable deepfake detection while also boosting accuracy. The resulting framework, AuthGuard, achieves state-of-the-art deepfake detection accuracy in both in-distribution and out-of-distribution settings, achieving AUC gains of 6.15% on the DFDC dataset and 16.68% on the DF40 dataset. Additionally, AuthGuard significantly enhances deepfake reasoning, improving performance by 24.69% on the DDVQA dataset.

[17] Pruning Everything, Everywhere, All at Once

Gustavo Henrique do Nascimento,Ian Pons,Anna Helena Reali Costa,Artur Jordao

Main category: cs.CV

TL;DR: 提出了一种同时剪枝神经元和层的新方法,通过表示相似性选择最优子网络,显著降低计算复杂度并保持模型性能。

Details Motivation: 深度学习模型复杂度高,计算成本大,现有剪枝方法仅针对神经元或层,无法同时剪枝。 Method: 通过表示相似性(Centered Kernel Alignment)选择最优子网络,迭代剪枝神经元和层。 Result: 在高FLOPs减少下保持或提升模型性能,ResNet56和ResNet110分别实现86.37%和95.82%的FLOPs减少,碳减排达83.31%。 Conclusion: 该方法为剪枝领域开辟新方向,显著提升计算效率和模型鲁棒性。 Abstract: Deep learning stands as the modern paradigm for solving cognitive tasks. However, as the problem complexity increases, models grow deeper and computationally prohibitive, hindering advancements in real-world and resource-constrained applications. Extensive studies reveal that pruning structures in these models efficiently reduces model complexity and improves computational efficiency. Successful strategies in this sphere include removing neurons (i.e., filters, heads) or layers, but not both together. Therefore, simultaneously pruning different structures remains an open problem. To fill this gap and leverage the benefits of eliminating neurons and layers at once, we propose a new method capable of pruning different structures within a model as follows. Given two candidate subnetworks (pruned models), one from layer pruning and the other from neuron pruning, our method decides which to choose by selecting the one with the highest representation similarity to its parent (the network that generates the subnetworks) using the Centered Kernel Alignment metric. Iteratively repeating this process provides highly sparse models that preserve the original predictive ability. Throughout extensive experiments on standard architectures and benchmarks, we confirm the effectiveness of our approach and show that it outperforms state-of-the-art layer and filter pruning techniques. At high levels of Floating Point Operations reduction, most state-of-the-art methods degrade accuracy, whereas our approach either improves it or experiences only a minimal drop. Notably, on the popular ResNet56 and ResNet110, we achieve a milestone of 86.37% and 95.82% FLOPs reduction. Besides, our pruned models obtain robustness to adversarial and out-of-distribution samples and take an important step towards GreenAI, reducing carbon emissions by up to 83.31%. Overall, we believe our work opens a new chapter in pruning.

[18] EECD-Net: Energy-Efficient Crack Detection with Spiking Neural Networks and Gated Attention

Shuo Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为EECD-Net的多阶段道路裂缝检测方法,通过SRCNN提升图像分辨率,SCU降低能耗,GAT模块增强检测鲁棒性,实现了98.6%的检测精度和5.6 mJ的低能耗。

Details Motivation: 智能终端设备因能量有限和低分辨率成像难以实现实时监测,需提升检测精度和能效。 Method: 采用SRCNN提升图像分辨率,SCU转换图像为稀疏脉冲序列降低能耗,GAT模块融合多尺度特征增强检测鲁棒性。 Result: 在CrackVision12K基准测试中,EECD-Net达到98.6%的检测精度,能耗仅为5.6 mJ,优于现有方法。 Conclusion: EECD-Net为资源受限环境下的实时大规模基础设施监测提供了高效、低功耗的解决方案。 Abstract: Crack detection on road surfaces is a critical measurement technology in the instrumentation domain, essential for ensuring infrastructure safety and transportation reliability. However, due to limited energy and low-resolution imaging, smart terminal devices struggle to maintain real-time monitoring performance. To overcome these challenges, this paper proposes a multi-stage detection approach for road crack detection, EECD-Net, to enhance accuracy and energy efficiency of instrumentation. Specifically, the sophisticated Super-Resolution Convolutional Neural Network (SRCNN) is employed to address the inherent challenges of low-quality images, which effectively enhance image resolution while preserving critical structural details. Meanwhile, a Spike Convolution Unit (SCU) with Continuous Integrate-and-Fire (CIF) neurons is proposed to convert these images into sparse pulse sequences, significantly reducing power consumption. Additionally, a Gated Attention Transformer (GAT) module is designed to strategically fuse multi-scale feature representations through adaptive attention mechanisms, effectively capturing both long-range dependencies and intricate local crack patterns, and significantly enhancing detection robustness across varying crack morphologies. The experiments on the CrackVision12K benchmark demonstrate that EECD-Net achieves a remarkable 98.6\% detection accuracy, surpassing state-of-the-art counterparts such as Hybrid-Segmentor by a significant 1.5\%. Notably, the EECD-Net maintains exceptional energy efficiency, consuming merely 5.6 mJ, which is a substantial 33\% reduction compared to baseline implementations. This work pioneers a transformative approach in instrumentation-based crack detection, offering a scalable, low-power solution for real-time, large-scale infrastructure monitoring in resource-constrained environments.

[19] Enhancing Frequency for Single Image Super-Resolution with Learnable Separable Kernels

Heng Tian

Main category: cs.CV

TL;DR: 提出了一种名为可学习可分离核(LSKs)的即插即用模块,通过直接增强图像频率分量提升单图像超分辨率(SISR)性能,显著减少参数和计算量。

Details Motivation: 现有方法通常通过间接方式(如特殊损失函数)提升SISR性能,而LSKs直接从频率角度优化图像质量。 Method: LSKs设计为秩一矩阵,可分解为正交且可合并的一维核,从而减少参数和计算需求。 Result: 实验表明,LSKs减少60%以上参数和计算量,同时提升模型性能,尤其在高放大因子下表现更优。 Conclusion: LSKs是一种高效且可解释的SISR增强模块,适用于实际应用。 Abstract: Existing approaches often enhance the performance of single-image super-resolution (SISR) methods by incorporating auxiliary structures, such as specialized loss functions, to indirectly boost the quality of low-resolution images. In this paper, we propose a plug-and-play module called Learnable Separable Kernels (LSKs), which are formally rank-one matrices designed to directly enhance image frequency components. We begin by explaining why LSKs are particularly suitable for SISR tasks from a frequency perspective. Baseline methods incorporating LSKs demonstrate a significant reduction of over 60\% in both the number of parameters and computational requirements. This reduction is achieved through the decomposition of LSKs into orthogonal and mergeable one-dimensional kernels. Additionally, we perform an interpretable analysis of the feature maps generated by LSKs. Visualization results reveal the capability of LSKs to enhance image frequency components effectively. Extensive experiments show that incorporating LSKs not only reduces the number of parameters and computational load but also improves overall model performance. Moreover, these experiments demonstrate that models utilizing LSKs exhibit superior performance, particularly as the upscaling factor increases.

[20] Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning

Yunhao Gou,Kai Chen,Zhili Liu,Lanqing Hong,Xin Jin,Zhenguo Li,James T. Kwok,Yu Zhang

Main category: cs.CV

TL;DR: 论文提出RACRO方法,通过强化学习优化视觉提取器的描述生成,以支持多模态大语言模型的复杂推理任务。

Details Motivation: 当前多模态大语言模型在升级推理能力时面临视觉-语言对齐的高成本问题,而简单的视觉-语言解耦方法可能导致描述不准确或不充分。 Method: 提出RACRO方法,通过推理目标引导的强化学习优化视觉提取器的描述生成,实现感知与推理的对齐。 Result: 实验表明,RACRO在多模态数学和科学基准测试中达到最优性能,并支持低成本适配更先进的推理模型。 Conclusion: RACRO通过优化视觉描述生成,显著提升了多模态推理的性能和可扩展性。 Abstract: Recent advances in slow-thinking language models (e.g., OpenAI-o1 and DeepSeek-R1) have demonstrated remarkable abilities in complex reasoning tasks by emulating human-like reflective cognition. However, extending such capabilities to multi-modal large language models (MLLMs) remains challenging due to the high cost of retraining vision-language alignments when upgrading the underlying reasoner LLMs. A straightforward solution is to decouple perception from reasoning, i.e., converting visual inputs into language representations (e.g., captions) that are then passed to a powerful text-only reasoner. However, this decoupling introduces a critical challenge: the visual extractor must generate descriptions that are both faithful to the image and informative enough to support accurate downstream reasoning. To address this, we propose Reasoning-Aligned Perceptual Decoupling via Caption Reward Optimization (RACRO) - a reasoning-guided reinforcement learning strategy that aligns the extractor's captioning behavior with the reasoning objective. By closing the perception-reasoning loop via reward-based optimization, RACRO significantly enhances visual grounding and extracts reasoning-optimized representations. Experiments on multi-modal math and science benchmarks show that the proposed RACRO method achieves state-of-the-art average performance while enabling superior scalability and plug-and-play adaptation to more advanced reasoning LLMs without the necessity for costly multi-modal re-alignment.

[21] LGM-Pose: A Lightweight Global Modeling Network for Real-time Human Pose Estimation

Biao Guo,Fangmin Guo,Guibo Luo,Xiaonan Luo,Feng Zhang

Main category: cs.CV

TL;DR: 提出了一种单分支轻量级全局建模网络(LGM-Pose),通过轻量级注意力表示模块(LARM)和非参数变换操作(NPT-Op)提取全局信息,并结合新颖的Shuffle-Integrated Fusion Module(SFusion)整合多尺度信息,显著提升了性能和处理速度。

Details Motivation: 当前基于多分支并行CNN的轻量级多人姿态估计方法难以捕捉全局上下文且延迟高,因此需要一种更高效的单分支网络结构来解决这些问题。 Method: 设计了轻量级MobileViM Block和LARM模块,利用NPT-Op提取全局信息,并引入SFusion模块整合多尺度信息。 Result: 在COCO和MPII数据集上,该方法减少了参数量,同时实现了更优的性能和更快的处理速度。 Conclusion: LGM-Pose通过单分支结构和创新模块设计,有效解决了轻量级姿态估计中的全局建模和性能退化问题。 Abstract: Most of the current top-down multi-person pose estimation lightweight methods are based on multi-branch parallel pure CNN network architecture, which often struggle to capture the global context required for detecting semantically complex keypoints and are hindered by high latency due to their intricate and redundant structures. In this article, an approximate single-branch lightweight global modeling network (LGM-Pose) is proposed to address these challenges. In the network, a lightweight MobileViM Block is designed with a proposed Lightweight Attentional Representation Module (LARM), which integrates information within and between patches using the Non-Parametric Transformation Operation(NPT-Op) to extract global information. Additionally, a novel Shuffle-Integrated Fusion Module (SFusion) is introduced to effectively integrate multi-scale information, mitigating performance degradation often observed in single-branch structures. Experimental evaluations on the COCO and MPII datasets demonstrate that our approach not only reduces the number of parameters compared to existing mainstream lightweight methods but also achieves superior performance and faster processing speeds.

[22] Follow-Your-Creation: Empowering 4D Creation through Video Inpainting

Yue Ma,Kunyu Feng,Xinhua Zhang,Hongyu Liu,David Junhao Zhang,Jinbo Xing,Yinhan Zhang,Ayden Yang,Zeyu Wang,Qifeng Chen

Main category: cs.CV

TL;DR: Follow-Your-Creation是一个新颖的4D视频生成与编辑框架,通过视频修复模型生成和编辑4D内容,支持基于提示的编辑,并在质量和多功能性上优于现有方法。

Details Motivation: 解决从单目视频生成和编辑4D内容的挑战,利用视频修复模型的先验知识,实现多视角一致性和用户自定义编辑。 Method: 将4D视频生成任务转化为视频修复问题,通过深度点云渲染生成不可见区域掩码,结合用户编辑掩码构建复合掩码数据集,并采用自迭代调优策略增强时间一致性。 Result: 生成的4D视频具有多视角一致性,支持灵活的内容编辑,质量和多功能性优于现有方法。 Conclusion: 该框架有效结合了视频修复模型的先验知识,实现了高质量的4D视频生成与编辑,具有广泛的应用潜力。 Abstract: We introduce Follow-Your-Creation, a novel 4D video creation framework capable of both generating and editing 4D content from a single monocular video input. By leveraging a powerful video inpainting foundation model as a generative prior, we reformulate 4D video creation as a video inpainting task, enabling the model to fill in missing content caused by camera trajectory changes or user edits. To facilitate this, we generate composite masked inpainting video data to effectively fine-tune the model for 4D video generation. Given an input video and its associated camera trajectory, we first perform depth-based point cloud rendering to obtain invisibility masks that indicate the regions that should be completed. Simultaneously, editing masks are introduced to specify user-defined modifications, and these are combined with the invisibility masks to create a composite masks dataset. During training, we randomly sample different types of masks to construct diverse and challenging inpainting scenarios, enhancing the model's generalization and robustness in various 4D editing and generation tasks. To handle temporal consistency under large camera motion, we design a self-iterative tuning strategy that gradually increases the viewing angles during training, where the model is used to generate the next-stage training data after each fine-tuning iteration. Moreover, we introduce a temporal packaging module during inference to enhance generation quality. Our method effectively leverages the prior knowledge of the base model without degrading its original performance, enabling the generation of 4D videos with consistent multi-view coherence. In addition, our approach supports prompt-based content editing, demonstrating strong flexibility and significantly outperforming state-of-the-art methods in both quality and versatility.

[23] Hierarchical-Task-Aware Multi-modal Mixture of Incremental LoRA Experts for Embodied Continual Learning

Ziqi Jia,Anmin Wang,Xiaoyang Qu,Xiaowen Yang,Jianzong Wang

Main category: cs.CV

TL;DR: 提出了一种分层持续学习框架(HEC)和任务感知的增量LoRA专家混合方法(Task-aware MoILE),通过分层学习和双路由机制减少灾难性遗忘,实验证明其有效性。

Details Motivation: 现有持续学习方法忽视高级规划和多级知识学习,需解决这一问题。 Method: 分层学习框架(HEC)和任务感知的增量LoRA专家混合方法(Task-aware MoILE),结合视觉-文本嵌入聚类和双路由机制,利用SVD保留关键参数。 Result: 实验表明,该方法在减少旧任务遗忘方面优于其他方法,有效支持持续学习。 Conclusion: HEC和Task-aware MoILE方法在分层学习和减少遗忘方面表现优异,为持续学习提供了新思路。 Abstract: Previous continual learning setups for embodied intelligence focused on executing low-level actions based on human commands, neglecting the ability to learn high-level planning and multi-level knowledge. To address these issues, we propose the Hierarchical Embodied Continual Learning Setups (HEC) that divide the agent's continual learning process into two layers: high-level instructions and low-level actions, and define five embodied continual learning sub-setups. Building on these setups, we introduce the Task-aware Mixture of Incremental LoRA Experts (Task-aware MoILE) method. This approach achieves task recognition by clustering visual-text embeddings and uses both a task-level router and a token-level router to select the appropriate LoRA experts. To effectively address the issue of catastrophic forgetting, we apply Singular Value Decomposition (SVD) to the LoRA parameters obtained from prior tasks, preserving key components while orthogonally training the remaining parts. The experimental results show that our method stands out in reducing the forgetting of old tasks compared to other methods, effectively supporting agents in retaining prior knowledge while continuously learning new tasks.

[24] SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents

Alexander Huang-Menders,Xinhang Liu,Andy Xu,Yuyao Zhang,Chi-Keung Tang,Yu-Wing Tai

Main category: cs.CV

TL;DR: SmartAvatar是一个基于视觉-语言-代理的框架,通过单张照片或文本提示生成可动画的3D人体化身,利用大模型和参数化生成器实现高质量定制。

Details Motivation: 现有扩散模型在3D人体化身生成中难以精确控制身份、体型和动画适应性,SmartAvatar旨在解决这一问题。 Method: 结合视觉语言模型和参数化生成器,通过自主验证循环(渲染、评估、调整)迭代优化生成参数,支持自然语言交互细化。 Result: 生成的化身质量高、可动画,在网格质量、身份保真度和动画适应性上优于现有方法。 Conclusion: SmartAvatar为消费者硬件提供了高质量、可定制的化身生成工具。 Abstract: SmartAvatar is a vision-language-agent-driven framework for generating fully rigged, animation-ready 3D human avatars from a single photo or textual prompt. While diffusion-based methods have made progress in general 3D object generation, they continue to struggle with precise control over human identity, body shape, and animation readiness. In contrast, SmartAvatar leverages the commonsense reasoning capabilities of large vision-language models (VLMs) in combination with off-the-shelf parametric human generators to deliver high-quality, customizable avatars. A key innovation is an autonomous verification loop, where the agent renders draft avatars, evaluates facial similarity, anatomical plausibility, and prompt alignment, and iteratively adjusts generation parameters for convergence. This interactive, AI-guided refinement process promotes fine-grained control over both facial and body features, enabling users to iteratively refine their avatars via natural-language conversations. Unlike diffusion models that rely on static pre-trained datasets and offer limited flexibility, SmartAvatar brings users into the modeling loop and ensures continuous improvement through an LLM-driven procedural generation and verification system. The generated avatars are fully rigged and support pose manipulation with consistent identity and appearance, making them suitable for downstream animation and interactive applications. Quantitative benchmarks and user studies demonstrate that SmartAvatar outperforms recent text- and image-driven avatar generation systems in terms of reconstructed mesh quality, identity fidelity, attribute accuracy, and animation readiness, making it a versatile tool for realistic, customizable avatar creation on consumer-grade hardware.

[25] Perfecting Depth: Uncertainty-Aware Enhancement of Metric Depth

Jinyoung Jun,Lei Chu,Jiahao Li,Yan Lu,Chang-Su Kim

Main category: cs.CV

TL;DR: 提出了一种名为Perfecting Depth的两阶段框架,通过结合随机扩散模型和确定性优化,提升传感器深度图的可靠性和准确性。

Details Motivation: 解决传感器深度图中不可靠区域的检测问题,同时保留几何线索,为自动驾驶、机器人和沉浸式技术提供更可靠的深度数据。 Method: 分为两个阶段:1)随机估计阶段,利用训练-推理域差距识别不可靠区域并推断几何结构;2)确定性优化阶段,利用不确定性图强制结构一致性和像素级精度。 Result: 实验证明该方法能生成密集、无伪影的深度图,并在多样化的真实场景中表现出色。 Conclusion: 该框架为传感器深度增强设定了新基准,具有广泛的应用潜力。 Abstract: We propose a novel two-stage framework for sensor depth enhancement, called Perfecting Depth. This framework leverages the stochastic nature of diffusion models to automatically detect unreliable depth regions while preserving geometric cues. In the first stage (stochastic estimation), the method identifies unreliable measurements and infers geometric structure by leveraging a training-inference domain gap. In the second stage (deterministic refinement), it enforces structural consistency and pixel-level accuracy using the uncertainty map derived from the first stage. By combining stochastic uncertainty modeling with deterministic refinement, our method yields dense, artifact-free depth maps with improved reliability. Experimental results demonstrate its effectiveness across diverse real-world scenarios. Furthermore, theoretical analysis, various experiments, and qualitative visualizations validate its robustness and scalability. Our framework sets a new baseline for sensor depth enhancement, with potential applications in autonomous driving, robotics, and immersive technologies.

[26] Deep Learning Reforms Image Matching: A Survey and Outlook

Shihua Zhang,Zizhuo Li,Kaining Zhang,Yifan Lu,Yuxin Deng,Linfeng Tang,Xingyu Jiang,Jiayi Ma

Main category: cs.CV

TL;DR: 该论文综述了深度学习如何逐步改变传统的图像匹配流程,包括替换传统步骤为可学习模块以及合并多步骤为端到端学习模块,并评估了代表性方法在多个任务中的表现。

Details Motivation: 传统图像匹配流程在复杂场景中表现不佳,而深度学习的进展显著提升了其鲁棒性和准确性。本文旨在全面回顾深度学习对图像匹配的变革。 Method: 通过分类和评估深度学习驱动的策略,包括可学习的检测器-描述符、异常值过滤器和几何估计器,以及端到端学习模块。 Result: 论文在相对位姿恢复、单应性估计和视觉定位任务上对代表性方法进行了基准测试。 Conclusion: 论文总结了当前挑战并提出了未来研究方向,为图像匹配领域提供了清晰的概述和创新方向。 Abstract: Image matching, which establishes correspondences between two-view images to recover 3D structure and camera geometry, serves as a cornerstone in computer vision and underpins a wide range of applications, including visual localization, 3D reconstruction, and simultaneous localization and mapping (SLAM). Traditional pipelines composed of ``detector-descriptor, feature matcher, outlier filter, and geometric estimator'' falter in challenging scenarios. Recent deep-learning advances have significantly boosted both robustness and accuracy. This survey adopts a unique perspective by comprehensively reviewing how deep learning has incrementally transformed the classical image matching pipeline. Our taxonomy highly aligns with the traditional pipeline in two key aspects: i) the replacement of individual steps in the traditional pipeline with learnable alternatives, including learnable detector-descriptor, outlier filter, and geometric estimator; and ii) the merging of multiple steps into end-to-end learnable modules, encompassing middle-end sparse matcher, end-to-end semi-dense/dense matcher, and pose regressor. We first examine the design principles, advantages, and limitations of both aspects, and then benchmark representative methods on relative pose recovery, homography estimation, and visual localization tasks. Finally, we discuss open challenges and outline promising directions for future research. By systematically categorizing and evaluating deep learning-driven strategies, this survey offers a clear overview of the evolving image matching landscape and highlights key avenues for further innovation.

[27] Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations

Linjie Li,Mahtab Bigverdi,Jiawei Gu,Zixian Ma,Yinuo Yang,Ziang Li,Yejin Choi,Ranjay Krishna

Main category: cs.CV

TL;DR: STARE是一个评估多模态大语言模型在空间认知任务中表现的基准,涵盖几何变换、空间推理和现实世界问题。模型在简单任务表现良好,但在复杂任务中表现接近随机,且无法有效利用视觉模拟。

Details Motivation: 现有AI基准主要关注语言推理,忽视了非语言、多步骤视觉模拟的复杂性。STARE旨在填补这一空白,评估模型在空间认知任务中的能力。 Method: STARE包含4K任务,涵盖2D/3D几何变换、立方体展开、七巧板拼图及现实世界空间推理。通过对比人类和模型表现,分析视觉模拟的作用。 Result: 模型在简单2D任务表现优异,但在复杂3D任务(如立方体展开、七巧板)中表现接近随机。人类通过视觉模拟显著提升速度,模型则表现不一致。 Conclusion: 模型在复杂空间认知任务中仍有局限,未能有效利用视觉模拟。STARE为未来研究提供了重要基准。 Abstract: Spatial cognition is essential for human intelligence, enabling problem-solving through visual simulations rather than solely relying on verbal reasoning. However, existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation. We introduce STARE(Spatial Transformations and Reasoning Evaluation), a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through multi-step visual simulation. STARE features 4K tasks spanning foundational geometric transformations (2D and 3D), integrated spatial reasoning (cube net folding and tangram puzzles), and real-world spatial reasoning (perspective and temporal reasoning), reflecting practical cognitive challenges like object assembly, mechanical diagram interpretation, and everyday spatial navigation. Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks like 3D cube net folding and tangram puzzles that require multi-step visual simulations. Humans achieve near-perfect accuracy but take considerable time (up to 28.9s) on complex tasks, significantly speeding up (down by 7.5 seconds on average) with intermediate visual simulations. In contrast, models exhibit inconsistent performance gains from visual simulations, improving on most tasks but declining in specific cases like tangram puzzles (GPT-4o, o1) and cube net folding (Claude-3.5, Gemini-2.0 Flash), indicating that models may not know how to effectively leverage intermediate visual information.

[28] Text-Aware Real-World Image Super-Resolution via Diffusion Model with Joint Segmentation Decoders

Qiming Hu,Linlong Fan,Yiyan Luo,Yuhang Yu,Xiaojie Guo,Qingnan Fan

Main category: cs.CV

TL;DR: TADiSR是一种基于扩散模型的超分辨率框架,通过文本感知注意力和联合分割解码器,提升真实世界图像中文本区域的结构保真度。

Details Motivation: 生成模型在图像超分辨率中表现优异,但常导致文本结构失真。本文旨在解决这一问题。 Method: 提出TADiSR框架,结合文本感知注意力和联合分割解码器,并开发了合成高质量图像的全流程。 Result: 实验表明,TADiSR显著提升了超分辨率图像中文本的可读性,并在多项评估指标中达到最优。 Conclusion: TADiSR在真实场景中表现出强大的泛化能力,代码已开源。 Abstract: The introduction of generative models has significantly advanced image super-resolution (SR) in handling real-world degradations. However, they often incur fidelity-related issues, particularly distorting textual structures. In this paper, we introduce a novel diffusion-based SR framework, namely TADiSR, which integrates text-aware attention and joint segmentation decoders to recover not only natural details but also the structural fidelity of text regions in degraded real-world images. Moreover, we propose a complete pipeline for synthesizing high-quality images with fine-grained full-image text masks, combining realistic foreground text regions with detailed background content. Extensive experiments demonstrate that our approach substantially enhances text legibility in super-resolved images, achieving state-of-the-art performance across multiple evaluation metrics and exhibiting strong generalization to real-world scenarios. Our code is available at \href{https://github.com/mingcv/TADiSR}{here}.

[29] FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion

Akide Liu,Zeyu Zhang,Zhexin Li,Xuehai Bai,Yizeng Han,Jiasheng Tang,Yuanjie Xing,Jichao Wu,Mingyang Yang,Weihua Chen,Jiahao He,Yuanyu He,Fan Wang,Gholamreza Haffari,Bohan Zhuang

Main category: cs.CV

TL;DR: FPSAttention是一种结合FP8量化和稀疏性的训练感知协同设计方法,用于视频生成,显著提升推理速度且不牺牲生成质量。

Details Motivation: 扩散生成模型在高质量视频生成中表现优异,但推理速度慢且计算需求高,限制了实际应用。 Method: 提出FPSAttention,通过统一3D分块粒度、去噪步感知策略和硬件友好内核,实现量化与稀疏性的联合优化。 Result: 在VBench基准测试中,FPSAttention实现了7.09倍的注意力操作加速和4.96倍的端到端视频生成加速。 Conclusion: FPSAttention在保持生成质量的同时,显著提升了视频生成的效率,为实际部署提供了可行方案。 Abstract: Diffusion generative models have become the standard for producing high-quality, coherent video content, yet their slow inference speeds and high computational demands hinder practical deployment. Although both quantization and sparsity can independently accelerate inference while maintaining generation quality, naively combining these techniques in existing training-free approaches leads to significant performance degradation due to the lack of joint optimization.We introduce FPSAttention, a novel training-aware co-design of FP8 quantization and sparsity for video generation, with a focus on the 3D bi-directional attention mechanism. Our approach features three key innovations: 1) A unified 3D tile-wise granularity that simultaneously supports both quantization and sparsity; 2) A denoising step-aware strategy that adapts to the noise schedule, addressing the strong correlation between quantization/sparsity errors and denoising steps; 3) A native, hardware-friendly kernel that leverages FlashAttention and is implemented with optimized Hopper architecture features for highly efficient execution. Trained on Wan2.1's 1.3B and 14B models and evaluated on the VBench benchmark, FPSAttention achieves a 7.09x kernel speedup for attention operations and a 4.96x end-to-end speedup for video generation compared to the BF16 baseline at 720p resolution-without sacrificing generation quality.

[30] Feature-Based Lie Group Transformer for Real-World Applications

Takayuki Komatsu,Yoshiyuki Ohmura,Kayato Nishitsunoi,Yasuo Kuniyoshi

Main category: cs.CV

TL;DR: 该论文提出了一种新的表示学习方法,通过结合特征提取和对象分割,将群分解理论应用于更现实的场景,解决了传统方法无法处理条件独立性的问题。

Details Motivation: 传统表示学习假设解耦的独立特征轴是好的表示,但无法解释条件独立性。作者希望提出一种更通用的表示方法,能够处理真实世界的复杂数据。 Method: 结合特征提取和对象分割,将像素平移替换为特征平移,并将对象分割视为同一变换下的特征分组。基于Galois代数理论中的群分解方法。 Result: 在包含真实世界对象和背景的数据集上验证了方法的有效性,表明其适用于更现实的场景。 Conclusion: 该方法为理解人类在真实世界中的物体识别发展提供了新视角,并有望推动表示学习的进一步发展。 Abstract: The main goal of representation learning is to acquire meaningful representations from real-world sensory inputs without supervision. Representation learning explains some aspects of human development. Various neural network (NN) models have been proposed that acquire empirically good representations. However, the formulation of a good representation has not been established. We recently proposed a method for categorizing changes between a pair of sensory inputs. A unique feature of this approach is that transformations between two sensory inputs are learned to satisfy algebraic structural constraints. Conventional representation learning often assumes that disentangled independent feature axes is a good representation; however, we found that such a representation cannot account for conditional independence. To overcome this problem, we proposed a new method using group decomposition in Galois algebra theory. Although this method is promising for defining a more general representation, it assumes pixel-to-pixel translation without feature extraction, and can only process low-resolution images with no background, which prevents real-world application. In this study, we provide a simple method to apply our group decomposition theory to a more realistic scenario by combining feature extraction and object segmentation. We replace pixel translation with feature translation and formulate object segmentation as grouping features under the same transformation. We validated the proposed method on a practical dataset containing both real-world object and background. We believe that our model will lead to a better understanding of human development of object recognition in the real world.

[31] Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts

Zhong Ji,Rongshuai Wei,Jingren Liu,Yanwei Pang,Jungong Han

Main category: cs.CV

TL;DR: 论文提出了一种Few-Shot Prototypical Concept Classification (FSPCC)框架,通过参数高效适应和多层次特征融合,解决了数据稀缺场景下自解释模型(SEMs)的性能问题。

Details Motivation: 自解释模型(SEMs)在数据稀缺场景下表现不佳,主要由于参数不平衡和表示不对齐问题。 Method: 采用Mixture of LoRA Experts (MoLE)实现参数高效适应,结合跨模块概念指导和多层次特征融合策略,并引入几何感知概念判别损失。 Result: 在六个基准测试中,FSPCC显著优于现有SEMs,5-way 5-shot分类任务中相对提升4.2%-8.7%。 Conclusion: FSPCC框架通过结合概念学习和少样本适应,实现了更高准确性和模型可解释性,为透明视觉识别系统提供了新思路。 Abstract: Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable, but they often struggle in data-scarce settings where insufficient training samples lead to suboptimal performance.To address this limitation, we propose a Few-Shot Prototypical Concept Classification (FSPCC) framework that systematically mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment. Specifically, our approach leverages a Mixture of LoRA Experts (MoLE) for parameter-efficient adaptation, ensuring a balanced allocation of trainable parameters between the backbone and the PCL module.Meanwhile, cross-module concept guidance enforces tight alignment between the backbone's feature representations and the prototypical concept activation patterns.In addition, we incorporate a multi-level feature preservation strategy that fuses spatial and semantic cues across various layers, thereby enriching the learned representations and mitigating the challenges posed by limited data availability.Finally, to enhance interpretability and minimize concept overlap, we introduce a geometry-aware concept discrimination loss that enforces orthogonality among concepts, encouraging more disentangled and transparent decision boundaries.Experimental results on six popular benchmarks (CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft, and DTD) demonstrate that our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot classification.These findings highlight the efficacy of coupling concept learning with few-shot adaptation to achieve both higher accuracy and clearer model interpretability, paving the way for more transparent visual recognition systems.

[32] Gen-n-Val: Agentic Image Data Generation and Validation

Jing-En Huang,I-Sheng Fang,Tzuhsuan Huang,Chih-Yu Wang,Jun-Cheng Chen

Main category: cs.CV

TL;DR: Gen-n-Val 是一个新型数据生成框架,利用 Layer Diffusion、LLMs 和 VLLMs 生成高质量的单物体掩码和多样化背景,显著减少无效数据并提升模型性能。

Details Motivation: 解决计算机视觉任务中数据稀缺和标签噪声问题,当前合成数据生成方法存在多物体掩码、分割不准确和标签错误等缺陷。 Method: Gen-n-Val 包含两个代理:LD 提示代理优化提示生成高质量前景图像和掩码;数据验证代理过滤低质量数据。系统提示通过 TextGrad 优化,并使用图像协调技术。 Result: 将无效合成数据从 50% 降至 7%,在 COCO 实例分割中提升 1% mAP,在开放词汇目标检测中提升 7.1% mAP。 Conclusion: Gen-n-Val 显著提升了合成数据的质量和模型性能,适用于实例分割和目标检测任务。 Abstract: Recently, Large Language Models (LLMs) and Vision Large Language Models (VLLMs) have demonstrated impressive performance as agents across various tasks while data scarcity and label noise remain significant challenges in computer vision tasks, such as object detection and instance segmentation. A common solution for resolving these issues is to generate synthetic data. However, current synthetic data generation methods struggle with issues, such as multiple objects per mask, inaccurate segmentation, and incorrect category labels, limiting their effectiveness. To address these issues, we introduce Gen-n-Val, a novel agentic data generation framework that leverages Layer Diffusion (LD), LLMs, and VLLMs to produce high-quality, single-object masks and diverse backgrounds. Gen-n-Val consists of two agents: (1) The LD prompt agent, an LLM, optimizes prompts for LD to generate high-quality foreground instance images and segmentation masks. These optimized prompts ensure the generation of single-object synthetic data with precise instance masks and clean backgrounds. (2) The data validation agent, a VLLM, which filters out low-quality synthetic instance images. The system prompts for both agents are refined through TextGrad. Additionally, we use image harmonization to combine multiple instances within scenes. Compared to state-of-the-art synthetic data approaches like MosaicFusion, our approach reduces invalid synthetic data from 50% to 7% and improves performance by 1% mAP on rare classes in COCO instance segmentation with YOLOv9c and YOLO11m. Furthermore, Gen-n-Val shows significant improvements (7. 1% mAP) over YOLO-Worldv2-M in open-vocabulary object detection benchmarks with YOLO11m. Moreover, Gen-n-Val improves the performance of YOLOv9 and YOLO11 families in instance segmentation and object detection.

[33] MARS: Radio Map Super-resolution and Reconstruction Method under Sparse Channel Measurements

Chuyun Deng,Na Liu,Wei Xie,Lianming Xu,Li Wang

Main category: cs.CV

TL;DR: MARS是一种结合CNN和Transformer的多尺度感知无线电地图超分辨率方法,通过多尺度特征融合和残差连接提升重建精度,优于基线模型。

Details Motivation: 无线电地图在智能城市和物联网等应用中至关重要,但稀疏测量下的精确重建仍具挑战性。传统方法缺乏环境感知,深度学习依赖详细场景数据,限制了泛化能力。 Method: 提出MARS方法,结合CNN和Transformer,利用多尺度特征融合和残差连接,同时关注全局和局部特征提取。 Result: 在不同场景和天线位置的实验中,MARS在MSE和SSIM指标上优于基线模型,且计算成本低。 Conclusion: MARS展示了强大的实用潜力,能够高效且准确地重建无线电地图。 Abstract: Radio maps reflect the spatial distribution of signal strength and are essential for applications like smart cities, IoT, and wireless network planning. However, reconstructing accurate radio maps from sparse measurements remains challenging. Traditional interpolation and inpainting methods lack environmental awareness, while many deep learning approaches depend on detailed scene data, limiting generalization. To address this, we propose MARS, a Multi-scale Aware Radiomap Super-resolution method that combines CNNs and Transformers with multi-scale feature fusion and residual connections. MARS focuses on both global and local feature extraction, enhancing feature representation across different receptive fields and improving reconstruction accuracy. Experiments across different scenes and antenna locations show that MARS outperforms baseline models in both MSE and SSIM, while maintaining low computational cost, demonstrating strong practical potential.

[34] HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model

Youngwan Lee,Kangsan Kim,Kwanyong Park,Ilcahe Jung,Soojin Jang,Seanie Lee,Yong-Ju Lee,Sung Ju Hwang

Main category: cs.CV

TL;DR: 论文提出HoliSafe数据集和SafeLLaVA模型,解决现有视觉语言模型(VLM)安全性的不足,通过全面覆盖安全/不安全图像-文本组合和引入可学习安全元标记,显著提升模型安全性。

Details Motivation: 现有VLM安全性方法存在数据集覆盖不全和架构创新不足的问题,导致模型易受攻击。 Method: 提出HoliSafe数据集和SafeLLaVA模型,后者包含可学习安全元标记和专用安全头。 Result: SafeLLaVA在多个VLM基准测试中达到最先进的安全性能,HoliSafe数据集揭示了现有模型的漏洞。 Conclusion: HoliSafe和SafeLLaVA为VLM安全性研究提供了新方向,推动了多模态对齐的进一步发展。 Abstract: Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, HoliSafe, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation. We further propose SafeLLaVA, a novel VLM augmented with a learnable safety meta token and a dedicated safety head. The meta token encodes harmful visual cues during training, intrinsically guiding the language model toward safer responses, while the safety head offers interpretable harmfulness classification aligned with refusal rationales. Experiments show that SafeLLaVA, trained on HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe benchmark itself reveals critical vulnerabilities in existing models. We hope that HoliSafe and SafeLLaVA will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.

[35] Line of Sight: On Linear Representations in VLLMs

Achyuta Rajaram,Sarah Schwettmann,Jacob Andreas,Arthur Conmy

Main category: cs.CV

TL;DR: 论文探讨了多模态语言模型LlaVA-Next中图像概念的表示方式,发现线性可解码特征在残差流中表示ImageNet类别,并通过编辑模型输出验证其因果性。通过训练稀疏自编码器(SAEs),增加了特征的多样性,发现不同模态的表征在深层逐渐共享。

Details Motivation: 研究多模态语言模型如何在其隐藏激活中表示图像概念,以增强对模型内部机制的理解。 Method: 使用LlaVA-Next模型,分析其残差流中的线性可解码特征,并通过编辑模型输出验证特征的因果性。训练多模态稀疏自编码器(SAEs)以增加特征多样性。 Result: 发现ImageNet类别通过线性可解码特征表示,且这些特征是因果性的。不同模态的表征在深层逐渐共享。 Conclusion: 多模态语言模型的表征在不同模态间存在差异,但在深层逐渐融合,稀疏自编码器有助于提高特征的多样性和可解释性。 Abstract: Language models can be equipped with multimodal capabilities by fine-tuning on embeddings of visual inputs. But how do such multimodal models represent images in their hidden activations? We explore representations of image concepts within LlaVA-Next, a popular open-source VLLM. We find a diverse set of ImageNet classes represented via linearly decodable features in the residual stream. We show that the features are causal by performing targeted edits on the model output. In order to increase the diversity of the studied linear features, we train multimodal Sparse Autoencoders (SAEs), creating a highly interpretable dictionary of text and image features. We find that although model representations across modalities are quite disjoint, they become increasingly shared in deeper layers.

[36] Robust Few-Shot Vision-Language Model Adaptation

Hanxin Wang,Tian Liu,Shu Kong

Main category: cs.CV

TL;DR: 研究提出了一种部分微调视觉编码器的方法(SRAPF),结合检索增强和对抗扰动,显著提升了少样本适应任务中的ID和OOD准确率。

Details Motivation: 预训练视觉语言模型(VLM)在少样本适应任务中表现优异,但面对分布外(OOD)数据时性能下降,因此需要提升其OOD泛化能力。 Method: 通过比较不同适应方法,发现部分微调视觉编码器效果最佳,并提出SRAPF方法,分阶段结合检索增强和对抗扰动。 Result: SRAPF在ImageNet OOD基准测试中实现了最先进的ID和OOD准确率。 Conclusion: 部分微调视觉编码器结合检索增强和对抗扰动是提升少样本适应任务性能的有效方法。 Abstract: Pretrained VLMs achieve strong performance on downstream tasks when adapted with just a few labeled examples. As the adapted models inevitably encounter out-of-distribution (OOD) test data that deviates from the in-distribution (ID) task-specific training data, enhancing OOD generalization in few-shot adaptation is critically important. We study robust few-shot VLM adaptation, aiming to increase both ID and OOD accuracy. By comparing different adaptation methods (e.g., prompt tuning, linear probing, contrastive finetuning, and full finetuning), we uncover three key findings: (1) finetuning with proper hyperparameters significantly outperforms the popular VLM adaptation methods prompt tuning and linear probing; (2) visual encoder-only finetuning achieves better efficiency and accuracy than contrastively finetuning both visual and textual encoders; (3) finetuning the top layers of the visual encoder provides the best balance between ID and OOD accuracy. Building on these findings, we propose partial finetuning of the visual encoder empowered with two simple augmentation techniques: (1) retrieval augmentation which retrieves task-relevant data from the VLM's pretraining dataset to enhance adaptation, and (2) adversarial perturbation which promotes robustness during finetuning. Results show that the former/latter boosts OOD/ID accuracy while slightly sacrificing the ID/OOD accuracy. Yet, perhaps understandably, naively combining the two does not maintain their best OOD/ID accuracy. We address this dilemma with the developed SRAPF, Stage-wise Retrieval Augmentation-based Adversarial Partial Finetuning. SRAPF consists of two stages: (1) partial finetuning the visual encoder using both ID and retrieved data, and (2) adversarial partial finetuning with few-shot ID data. Extensive experiments demonstrate that SRAPF achieves the state-of-the-art ID and OOD accuracy on the ImageNet OOD benchmarks.

[37] Towards Holistic Visual Quality Assessment of AI-Generated Videos: A LLM-Based Multi-Dimensional Evaluation Model

Zelu Qi,Ping Shi,Chaoyang Zhang,Shuqi Wang,Fei Zhao,Da Pan,Zefeng Ying

Main category: cs.CV

TL;DR: 该论文提出了一种基于多维度特征和大型语言模型(LLM)的AI生成视频(AIGV)自动视觉质量评估方法,并在NTIRE 2025挑战赛中取得第二名。

Details Motivation: AIGV技术存在视觉质量缺陷(如噪声、模糊、帧抖动等),影响用户体验,因此需要有效的自动质量评估方法。 Method: 将AIGV视觉质量分解为技术质量、运动质量和视频语义三个维度,设计对应编码器,并引入LLM作为质量回归模块,结合多模态提示工程框架和LoRA微调技术。 Result: 在NTIRE 2025挑战赛中取得第二名,验证了方法的有效性。 Conclusion: 该方法通过多维度特征和LLM的结合,显著提升了AIGV视觉质量评估的准确性。 Abstract: The development of AI-Generated Video (AIGV) technology has been remarkable in recent years, significantly transforming the paradigm of video content production. However, AIGVs still suffer from noticeable visual quality defects, such as noise, blurriness, frame jitter and low dynamic degree, which severely impact the user's viewing experience. Therefore, an effective automatic visual quality assessment is of great importance for AIGV content regulation and generative model improvement. In this work, we decompose the visual quality of AIGVs into three dimensions: technical quality, motion quality, and video semantics. For each dimension, we design corresponding encoder to achieve effective feature representation. Moreover, considering the outstanding performance of large language models (LLMs) in various vision and language tasks, we introduce a LLM as the quality regression module. To better enable the LLM to establish reasoning associations between multi-dimensional features and visual quality, we propose a specially designed multi-modal prompt engineering framework. Additionally, we incorporate LoRA fine-tuning technology during the training phase, allowing the LLM to better adapt to specific tasks. Our proposed method achieved \textbf{second place} in the NTIRE 2025 Quality Assessment of AI-Generated Content Challenge: Track 2 AI Generated video, demonstrating its effectiveness. Codes can be obtained at https://github.com/QiZelu/AIGVEval.

[38] Learning dissection trajectories from expert surgical videos via imitation learning with equivariant diffusion

Hongyu Wang,Yonghao Long,Yueyao Chen,Hon-Chi Yip,Markus Scheppach,Philip Wai-Yan Chiu,Yeung Yam,Helen Mei-Ling Meng,Qi Dou

Main category: cs.CV

TL;DR: 提出了一种名为iDPOE的新方法,通过隐式扩散策略和等变表示改进内窥镜黏膜下剥离术(ESD)中的轨迹预测,提升了手术技能训练的效果。

Details Motivation: 预测ESD视频中的剥离轨迹对提升手术技能训练和简化学习过程有重要意义,但现有方法在不确定未来动作、几何对称性和泛化性方面存在不足。 Method: 结合隐式扩散策略和等变表示,通过联合状态动作分布建模专家行为,并利用扩散模型优化策略学习和采样。 Result: 在近2000个ESD视频片段的数据集上,iDPOE在轨迹预测上超越了现有方法。 Conclusion: iDPOE是首个将模仿学习应用于手术技能开发中轨迹预测的方法,具有显著的实际应用潜力。 Abstract: Endoscopic Submucosal Dissection (ESD) is a well-established technique for removing epithelial lesions. Predicting dissection trajectories in ESD videos offers significant potential for enhancing surgical skill training and simplifying the learning process, yet this area remains underexplored. While imitation learning has shown promise in acquiring skills from expert demonstrations, challenges persist in handling uncertain future movements, learning geometric symmetries, and generalizing to diverse surgical scenarios. To address these, we introduce a novel approach: Implicit Diffusion Policy with Equivariant Representations for Imitation Learning (iDPOE). Our method models expert behavior through a joint state action distribution, capturing the stochastic nature of dissection trajectories and enabling robust visual representation learning across various endoscopic views. By incorporating a diffusion model into policy learning, iDPOE ensures efficient training and sampling, leading to more accurate predictions and better generalization. Additionally, we enhance the model's ability to generalize to geometric symmetries by embedding equivariance into the learning process. To address state mismatches, we develop a forward-process guided action inference strategy for conditional sampling. Using an ESD video dataset of nearly 2000 clips, experimental results show that our approach surpasses state-of-the-art methods, both explicit and implicit, in trajectory prediction. To the best of our knowledge, this is the first application of imitation learning to surgical skill development for dissection trajectory prediction.

[39] Using In-Context Learning for Automatic Defect Labelling of Display Manufacturing Data

Babar Hussain,Qiang Liu,Gang Chen,Bihai She,Dahai Yu

Main category: cs.CV

TL;DR: 本文提出了一种AI辅助的自动标注系统,用于显示面板缺陷检测,利用上下文学习能力。通过改进SegGPT架构和引入基于涂鸦的标注机制,显著提升了标注效率。实验结果显示,该系统在工业数据集上表现优异,自动标注数据训练的模型性能接近人工标注数据。

Details Motivation: 减少工业检测系统中的人工标注工作量,提高缺陷检测的效率和准确性。 Method: 采用改进的SegGPT架构,结合领域特定的训练技术和涂鸦标注机制,采用两阶段训练方法。 Result: 在工业数据集上,平均IoU提升0.22,召回率提高14%,自动标注覆盖率达60%。 Conclusion: 该系统为工业检测提供了一种高效的自动标注解决方案,显著减少了人工标注需求。 Abstract: This paper presents an AI-assisted auto-labeling system for display panel defect detection that leverages in-context learning capabilities. We adopt and enhance the SegGPT architecture with several domain-specific training techniques and introduce a scribble-based annotation mechanism to streamline the labeling process. Our two-stage training approach, validated on industrial display panel datasets, demonstrates significant improvements over the baseline model, achieving an average IoU increase of 0.22 and a 14% improvement in recall across multiple product types, while maintaining approximately 60% auto-labeling coverage. Experimental results show that models trained on our auto-labeled data match the performance of those trained on human-labeled data, offering a practical solution for reducing manual annotation efforts in industrial inspection systems.

[40] Bridging Annotation Gaps: Transferring Labels to Align Object Detection Datasets

Mikhail Kennerley,Angelica Alives-Reviro,Carola-Bibiane Schönlieb,Robby T. Tan

Main category: cs.CV

TL;DR: LAT提出了一种标签对齐转移框架,通过伪标签生成和语义特征融合,解决多数据集标注不一致问题,提升目标域检测性能。

Details Motivation: 多数据集联合训练可提升泛化性,但标注语义和边界框不一致阻碍了其应用。现有方法需共享标签空间或手动重标注,限制了灵活性。 Method: LAT通过训练数据集特定检测器生成伪标签,结合特权提案生成器(PPG)和语义特征融合(SFF)模块,实现标签空间对齐。 Result: 在多个基准测试中,LAT显著提升目标域检测性能,最高比半监督基线提升+4.8AP。 Conclusion: LAT无需共享标签空间或手动标注,有效解决了类别和空间标注不一致问题,适用于异构数据集联合训练。 Abstract: Combining multiple object detection datasets offers a path to improved generalisation but is hindered by inconsistencies in class semantics and bounding box annotations. Some methods to address this assume shared label taxonomies and address only spatial inconsistencies; others require manual relabelling, or produce a unified label space, which may be unsuitable when a fixed target label space is required. We propose Label-Aligned Transfer (LAT), a label transfer framework that systematically projects annotations from diverse source datasets into the label space of a target dataset. LAT begins by training dataset-specific detectors to generate pseudo-labels, which are then combined with ground-truth annotations via a Privileged Proposal Generator (PPG) that replaces the region proposal network in two-stage detectors. To further refine region features, a Semantic Feature Fusion (SFF) module injects class-aware context and features from overlapping proposals using a confidence-weighted attention mechanism. This pipeline preserves dataset-specific annotation granularity while enabling many-to-one label space transfer across heterogeneous datasets, resulting in a semantically and spatially aligned representation suitable for training a downstream detector. LAT thus jointly addresses both class-level misalignments and bounding box inconsistencies without relying on shared label spaces or manual annotations. Across multiple benchmarks, LAT demonstrates consistent improvements in target-domain detection performance, achieving gains of up to +4.8AP over semi-supervised baselines.

[41] SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs

Shuhan Xu,Siyuan Liang,Hongling Zheng,Yong Luo,Aishan Liu,Dacheng Tao

Main category: cs.CV

TL;DR: 论文提出了一种名为Semantic Reward Defense(SRD)的强化学习框架,用于防御视觉语言模型(VLMs)中的后门攻击,无需事先知道触发器的信息。

Details Motivation: 近期研究表明,VLMs在图像描述任务中容易受到后门攻击,攻击者通过注入不易察觉的扰动(如局部像素触发器或全局语义短语)控制模型生成恶意描述。这些攻击难以检测和防御。 Method: 提出SRD框架,利用深度Q网络学习对敏感图像区域施加离散扰动(如遮挡、颜色掩码)的策略,以破坏恶意路径的激活。通过设计语义保真度评分作为奖励信号,评估输出的语义一致性和语言流畅性。 Result: 实验表明,SRD将攻击成功率降至5.6%,同时在干净输入上保持描述质量,性能下降不到10%。 Conclusion: SRD提供了一种无需触发器先验知识、可解释的防御范式,有效应对多模态生成模型中的隐蔽后门威胁。 Abstract: Vision-Language Models (VLMs) have achieved remarkable performance in image captioning, but recent studies show they are vulnerable to backdoor attacks. Attackers can inject imperceptible perturbations-such as local pixel triggers or global semantic phrases-into the training data, causing the model to generate malicious, attacker-controlled captions for specific inputs. These attacks are hard to detect and defend due to their stealthiness and cross-modal nature. By analyzing attack samples, we identify two key vulnerabilities: (1) abnormal attention concentration on specific image regions, and (2) semantic drift and incoherence in generated captions. To counter this, we propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without prior knowledge of triggers. SRD uses a Deep Q-Network to learn policies for applying discrete perturbations (e.g., occlusion, color masking) to sensitive image regions, aiming to disrupt the activation of malicious pathways. We design a semantic fidelity score as the reward signal, which jointly evaluates semantic consistency and linguistic fluency of the output, guiding the agent toward generating robust yet faithful captions. Experiments across mainstream VLMs and datasets show SRD reduces attack success rates to 5.6%, while preserving caption quality on clean inputs with less than 10% performance drop. SRD offers a trigger-agnostic, interpretable defense paradigm against stealthy backdoor threats in multimodal generative models.

[42] Physics Informed Capsule Enhanced Variational AutoEncoder for Underwater Image Enhancement

Niki Martinel,Rita Pucci

Main category: cs.CV

TL;DR: 提出了一种新颖的双流架构,通过结合物理模型与胶囊聚类特征学习,实现了水下图像增强的最优性能。

Details Motivation: 解决水下图像增强中物理模型与语义特征难以兼顾的问题。 Method: 采用双流架构,分别通过物理估计器和胶囊聚类学习传输图和背景光,同时优化物理约束与感知质量。 Result: 在六个基准测试中,PSNR提升0.5dB,计算复杂度降低三分之二。 Conclusion: 该方法在性能和效率上均优于现有技术,且无需额外参数。 Abstract: We present a novel dual-stream architecture that achieves state-of-the-art underwater image enhancement by explicitly integrating the Jaffe-McGlamery physical model with capsule clustering-based feature representation learning. Our method simultaneously estimates transmission maps and spatially-varying background light through a dedicated physics estimator while extracting entity-level features via capsule clustering in a parallel stream. This physics-guided approach enables parameter-free enhancement that respects underwater formation constraints while preserving semantic structures and fine-grained details. Our approach also features a novel optimization objective ensuring both physical adherence and perceptual quality across multiple spatial frequencies. To validate our approach, we conducted extensive experiments across six challenging benchmarks. Results demonstrate consistent improvements of $+0.5$dB PSNR over the best existing methods while requiring only one-third of their computational complexity (FLOPs), or alternatively, more than $+1$dB PSNR improvement when compared to methods with similar computational budgets. Code and data \textit{will} be available at https://github.com/iN1k1/.

[43] Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning

Shenshen Li,Kaiyuan Deng,Lei Wang,Hao Yang,Chong Peng,Peng Yan,Fumin Shen,Heng Tao Shen,Xing Xu

Main category: cs.CV

TL;DR: 论文提出了一种名为RAP的数据选择方法,通过识别高价值的认知样本(cognitive samples),仅使用9.3%的训练数据就能在多模态推理任务中取得优于全量数据的效果,同时降低43%的计算成本。

Details Motivation: 传统观点认为多模态大语言模型(MLLMs)需要大量训练数据才能提升推理能力,但作者发现真正触发多模态推理的仅是稀疏的认知样本,其余数据贡献有限。 Method: RAP方法通过两种互补的估计器(CDE和ACE)识别认知样本,并引入难度感知替换模块(DRM)提升数据复杂性。 Result: 在六个数据集上的实验表明,RAP仅需9.3%的训练数据即可实现更优性能,同时减少43%以上的计算成本。 Conclusion: 研究表明,高质量的小数据集可以替代全量数据,为多模态推理任务提供高效且经济的解决方案。 Abstract: While multi-modal large language models (MLLMs) have made significant progress in complex reasoning tasks via reinforcement learning, it is commonly believed that extensive training data is necessary for improving multi-modal reasoning ability, inevitably leading to data redundancy and substantial computational costs. However, can smaller high-value datasets match or outperform full corpora for multi-modal reasoning in MLLMs? In this work, we challenge this assumption through a key observation: meaningful multi-modal reasoning is triggered by only a sparse subset of training samples, termed cognitive samples, whereas the majority contribute marginally. Building on this insight, we propose a novel data selection paradigm termed Reasoning Activation Potential (RAP), which identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning by two complementary estimators: 1) Causal Discrepancy Estimator (CDE) based on the potential outcome model principle, eliminates samples that overly rely on language priors by comparing outputs between multi-modal and text-only inputs; 2) Attention Confidence Estimator (ACE), which exploits token-level self-attention to discard samples dominated by irrelevant but over-emphasized tokens in intermediate reasoning stages. Moreover, we introduce a Difficulty-aware Replacement Module (DRM) to substitute trivial instances with cognitively challenging ones, thereby ensuring complexity for robust multi-modal reasoning. Experiments on six datasets show that our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%. Our code is available at https://github.com/Leo-ssl/RAP.

[44] Toward Better SSIM Loss for Unsupervised Monocular Depth Estimation

Yijun Cao,Fuya Luo,Yongjie Li

Main category: cs.CV

TL;DR: 本文提出了一种新的SSIM形式,通过加法而非乘法组合SSIM中的亮度、对比度和结构相似性组件,以优化无监督单目深度学习的训练效果。

Details Motivation: 传统方法在无监督单目深度学习中使用的SSIM函数忽略了不同组件及其超参数对训练的影响,导致梯度不平滑和性能受限。 Method: 提出了一种新的SSIM形式,用加法替代乘法组合其组件,并进行了大量实验以确定最优参数组合。 Result: 基于MonoDepth方法,优化的SSIM损失函数在KITTI-2015数据集上显著优于基线。 Conclusion: 新SSIM形式能生成更平滑的梯度,提升无监督深度估计的性能。 Abstract: Unsupervised monocular depth learning generally relies on the photometric relation among temporally adjacent images. Most of previous works use both mean absolute error (MAE) and structure similarity index measure (SSIM) with conventional form as training loss. However, they ignore the effect of different components in the SSIM function and the corresponding hyperparameters on the training. To address these issues, this work proposes a new form of SSIM. Compared with original SSIM function, the proposed new form uses addition rather than multiplication to combine the luminance, contrast, and structural similarity related components in SSIM. The loss function constructed with this scheme helps result in smoother gradients and achieve higher performance on unsupervised depth estimation. We conduct extensive experiments to determine the relatively optimal combination of parameters for our new SSIM. Based on the popular MonoDepth approach, the optimized SSIM loss function can remarkably outperform the baseline on the KITTI-2015 outdoor dataset.

[45] HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition

Suhan Woo,Seongwon Lee,Jinwoo Jang,Euntai Kim

Main category: cs.CV

TL;DR: HypeVPR是一种新颖的双曲空间分层嵌入框架,用于解决P2E VPR的挑战,通过分层特征聚合和高效搜索策略,显著提升了检索速度和准确性。

Details Motivation: 现实世界中的视觉地点识别(VPR)需要处理多视角查询图像,P2E方法成为自然选择,但现有方法未能充分利用全景图像的层次结构。 Method: 提出HypeVPR框架,利用双曲空间表示层次特征关系,采用分层特征聚合机制和粗到细搜索策略。 Result: HypeVPR在多个基准数据集上表现优于现有方法,检索速度提升高达5倍。 Conclusion: HypeVPR通过双曲空间和分层设计,为P2E VPR提供了高效且准确的解决方案。 Abstract: When applying Visual Place Recognition (VPR) to real-world mobile robots and similar applications, perspective-to-equirectangular (P2E) formulation naturally emerges as a suitable approach to accommodate diverse query images captured from various viewpoints. In this paper, we introduce HypeVPR, a novel hierarchical embedding framework in hyperbolic space, designed to address the unique challenges of P2E VPR. The key idea behind HypeVPR is that visual environments captured by panoramic views exhibit inherent hierarchical structures. To leverage this property, we employ hyperbolic space to represent hierarchical feature relationships and preserve distance properties within the feature space. To achieve this, we propose a hierarchical feature aggregation mechanism that organizes local-to-global feature representations within hyperbolic space. Additionally, HypeVPR adopts an efficient coarse-to-fine search strategy, optimally balancing speed and accuracy to ensure robust matching, even between descriptors from different image types. This approach enables HypeVPR to outperform state-of-the-art methods while significantly reducing retrieval time, achieving up to 5x faster retrieval across diverse benchmark datasets. The code and models will be released at https://github.com/suhan-woo/HypeVPR.git.

[46] Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations

Gaia Di Lorenzo,Federico Tombari,Marc Pollefeys,Daniel Barath

Main category: cs.CV

TL;DR: Object-X是一个多模态3D对象表示框架,能够编码丰富的信息并解码为几何和视觉重建,支持多种下游任务,且存储需求极低。

Details Motivation: 现有方法通常针对特定任务设计,无法同时支持几何重建和跨任务复用,因此需要一种更通用的多模态3D表示方法。 Method: Object-X通过将多模态信息几何地嵌入3D体素网格,并学习融合体素与对象属性的非结构化嵌入,支持基于3D高斯泼溅的重建。 Result: 在真实数据集上,Object-X实现了高保真度的新视角合成和几何精度提升,同时在场景对齐和定位任务中表现优异,存储需求显著降低。 Conclusion: Object-X是一种高效、可扩展的多模态3D场景表示解决方案,适用于多种应用场景。 Abstract: Learning effective multi-modal 3D representations of objects is essential for numerous applications, such as augmented reality and robotics. Existing methods often rely on task-specific embeddings that are tailored either for semantic understanding or geometric reconstruction. As a result, these embeddings typically cannot be decoded into explicit geometry and simultaneously reused across tasks. In this paper, we propose Object-X, a versatile multi-modal object representation framework capable of encoding rich object embeddings (e.g. images, point cloud, text) and decoding them back into detailed geometric and visual reconstructions. Object-X operates by geometrically grounding the captured modalities in a 3D voxel grid and learning an unstructured embedding fusing the information from the voxels with the object attributes. The learned embedding enables 3D Gaussian Splatting-based object reconstruction, while also supporting a range of downstream tasks, including scene alignment, single-image 3D object reconstruction, and localization. Evaluations on two challenging real-world datasets demonstrate that Object-X produces high-fidelity novel-view synthesis comparable to standard 3D Gaussian Splatting, while significantly improving geometric accuracy. Moreover, Object-X achieves competitive performance with specialized methods in scene alignment and localization. Critically, our object-centric descriptors require 3-4 orders of magnitude less storage compared to traditional image- or point cloud-based approaches, establishing Object-X as a scalable and highly practical solution for multi-modal 3D scene representation.

[47] LotusFilter: Fast Diverse Nearest Neighbor Search via a Learned Cutoff Table

Yusuke Matsui

Main category: cs.CV

TL;DR: LotusFilter是一种后处理模块,用于多样化近似最近邻搜索(ANNS)结果,通过预计算邻近向量表并删除冗余向量,速度快且适用于实际应用。

Details Motivation: ANNS在应用中可能返回过于相似的结果,而某些场景需要结果既相似又多样。 Method: 预计算邻近向量表,并在过滤阶段贪婪查找以删除冗余向量。 Result: LotusFilter在类似实际RAG应用的环境中运行速度快(0.02 ms/查询)。 Conclusion: LotusFilter是一种高效且实用的ANNS结果多样化方法,代码已开源。 Abstract: Approximate nearest neighbor search (ANNS) is an essential building block for applications like RAG but can sometimes yield results that are overly similar to each other. In certain scenarios, search results should be similar to the query and yet diverse. We propose LotusFilter, a post-processing module to diversify ANNS results. We precompute a cutoff table summarizing vectors that are close to each other. During the filtering, LotusFilter greedily looks up the table to delete redundant vectors from the candidates. We demonstrated that the LotusFilter operates fast (0.02 [ms/query]) in settings resembling real-world RAG applications, utilizing features such as OpenAI embeddings. Our code is publicly available at https://github.com/matsui528/lotf.

[48] SupeRANSAC: One RANSAC to Rule Them All

Daniel Barath

Main category: cs.CV

TL;DR: SupeRANSAC是一个统一的RANSAC管道,旨在提高计算机视觉中几何模型估计的鲁棒性和一致性,优于现有方法。

Details Motivation: RANSAC及其变体在计算机视觉中广泛使用,但性能在不同任务中表现不一致,受实现细节和优化影响较大。 Method: 提出SupeRANSAC,分析并整合了使RANSAC在特定视觉任务中有效的技术,如单应性、基础/本质矩阵和绝对/刚性姿态估计。 Result: SupeRANSAC在多个任务和数据集上显著优于现有方法,例如在基础矩阵估计上平均提高6 AUC点。 Conclusion: SupeRANSAC提供了一种一致且高效的解决方案,适用于多种几何模型估计任务。 Abstract: Robust estimation is a cornerstone in computer vision, particularly for tasks like Structure-from-Motion and Simultaneous Localization and Mapping. RANSAC and its variants are the gold standard for estimating geometric models (e.g., homographies, relative/absolute poses) from outlier-contaminated data. Despite RANSAC's apparent simplicity, achieving consistently high performance across different problems is challenging. While recent research often focuses on improving specific RANSAC components (e.g., sampling, scoring), overall performance is frequently more influenced by the "bells and whistles" (i.e., the implementation details and problem-specific optimizations) within a given library. Popular frameworks like OpenCV and PoseLib demonstrate varying performance, excelling in some tasks but lagging in others. We introduce SupeRANSAC, a novel unified RANSAC pipeline, and provide a detailed analysis of the techniques that make RANSAC effective for specific vision tasks, including homography, fundamental/essential matrix, and absolute/rigid pose estimation. SupeRANSAC is designed for consistent accuracy across these tasks, improving upon the best existing methods by, for example, 6 AUC points on average for fundamental matrix estimation. We demonstrate significant performance improvements over the state-of-the-art on multiple problems and datasets. Code: https://github.com/danini/superansac

[49] MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories

Yuyi Zhang,Yongxin Shi,Peirong Zhang,Yixin Zhao,Zhenhua Yang,Lianwen Jin

Main category: cs.CV

TL;DR: 论文介绍了MegaHan97K数据集,支持GB18030-2022标准,覆盖97,455个汉字类别,解决了现有数据集规模不足和长尾分布问题,并揭示了新挑战。

Details Motivation: 中文汉字识别对文化遗产保护和数字应用至关重要,但现有数据集规模有限,无法满足需求。 Method: 提出MegaHan97K数据集,包含手写、历史和合成子集,平衡样本分布。 Result: 数据集规模远超现有资源,揭示了存储需求、形态相似字符识别和零样本学习等新挑战。 Conclusion: MegaHan97K为OCR和模式识别领域提供了重要资源,推动了未来研究。 Abstract: Foundational to the Chinese language and culture, Chinese characters encompass extraordinarily extensive and ever-expanding categories, with the latest Chinese GB18030-2022 standard containing 87,887 categories. The accurate recognition of this vast number of characters, termed mega-category recognition, presents a formidable yet crucial challenge for cultural heritage preservation and digital applications. Despite significant advances in Optical Character Recognition (OCR), mega-category recognition remains unexplored due to the absence of comprehensive datasets, with the largest existing dataset containing merely 16,151 categories. To bridge this critical gap, we introduce MegaHan97K, a mega-category, large-scale dataset covering an unprecedented 97,455 categories of Chinese characters. Our work offers three major contributions: (1) MegaHan97K is the first dataset to fully support the latest GB18030-2022 standard, providing at least six times more categories than existing datasets; (2) It effectively addresses the long-tail distribution problem by providing balanced samples across all categories through its three distinct subsets: handwritten, historical and synthetic subsets; (3) Comprehensive benchmarking experiments reveal new challenges in mega-category scenarios, including increased storage demands, morphologically similar character recognition, and zero-shot learning difficulties, while also unlocking substantial opportunities for future research. To the best of our knowledge, the MetaHan97K is likely the dataset with the largest classes not only in the field of OCR but may also in the broader domain of pattern recognition. The dataset is available at https://github.com/SCUT-DLVCLab/MegaHan97K.

[50] Spike-TBR: a Noise Resilient Neuromorphic Event Representation

Gabriele Magrini. Federico Becattini,Luca Cultrera,Lorenzo Berlincioni,Pietro Pala,Alberto Del Bimbo

Main category: cs.CV

TL;DR: 提出了一种基于时间二进制表示(TBR)的事件编码策略Spike-TBR,结合脉冲神经元增强噪声鲁棒性,并在多数据集上验证了其优越性能。

Details Motivation: 事件相机具有高时间分辨率和低延迟等优势,但如何高效且抗噪地转换事件流以适应标准计算机视觉流程仍具挑战性。 Method: 提出Spike-TBR,结合TBR的帧基优势与脉冲神经网络的噪声过滤能力,设计了四种不同脉冲神经元的变体。 Result: 在噪声场景下表现优异,同时在干净数据上也有提升,验证了方法的鲁棒性。 Conclusion: Spike-TBR填补了脉冲基与帧基处理的鸿沟,为事件驱动视觉应用提供了简单且抗噪的解决方案。 Abstract: Event cameras offer significant advantages over traditional frame-based sensors, including higher temporal resolution, lower latency and dynamic range. However, efficiently converting event streams into formats compatible with standard computer vision pipelines remains a challenging problem, particularly in the presence of noise. In this paper, we propose Spike-TBR, a novel event-based encoding strategy based on Temporal Binary Representation (TBR), addressing its vulnerability to noise by integrating spiking neurons. Spike-TBR combines the frame-based advantages of TBR with the noise-filtering capabilities of spiking neural networks, creating a more robust representation of event streams. We evaluate four variants of Spike-TBR, each using different spiking neurons, across multiple datasets, demonstrating superior performance in noise-affected scenarios while improving the results on clean data. Our method bridges the gap between spike-based and frame-based processing, offering a simple noise-resilient solution for event-driven vision applications.

[51] Fool the Stoplight: Realistic Adversarial Patch Attacks on Traffic Light Detectors

Svetlana Pavlitska,Jamie Robb,Nikolai Polley,Melih Yazgan,J. Marius Zöllner

Main category: cs.CV

TL;DR: 该论文提出了一种针对交通灯检测CNN的对抗性补丁攻击方法,展示了在现实场景中通过打印补丁实现标签翻转和分类攻击的可行性。

Details Motivation: 现有研究对自动驾驶车辆摄像头感知任务的对抗性攻击较多,但对交通灯检测的攻击研究较少,因此填补这一空白。 Method: 提出一种威胁模型,通过在交通灯下方放置打印补丁攻击CNN检测器,并设计训练策略。 Result: 实验证明该方法能成功实现标签翻转(红变绿)和分类攻击,并在实验室和真实场景中验证了有效性。 Conclusion: 该研究展示了对抗性补丁攻击在交通灯检测中的实际威胁,为防御提供了参考。 Abstract: Realistic adversarial attacks on various camera-based perception tasks of autonomous vehicles have been successfully demonstrated so far. However, only a few works considered attacks on traffic light detectors. This work shows how CNNs for traffic light detection can be attacked with printed patches. We propose a threat model, where each instance of a traffic light is attacked with a patch placed under it, and describe a training strategy. We demonstrate successful adversarial patch attacks in universal settings. Our experiments show realistic targeted red-to-green label-flipping attacks and attacks on pictogram classification. Finally, we perform a real-world evaluation with printed patches and demonstrate attacks in the lab settings with a mobile traffic light for construction sites and in a test area with stationary traffic lights. Our code is available at https://github.com/KASTEL-MobilityLab/attacks-on-traffic-light-detection.

[52] DualX-VSR: Dual Axial Spatial$\times$Temporal Transformer for Real-World Video Super-Resolution without Motion Compensation

Shuo Cao,Yihao Liu,Xiaohui Li. Yuanting Gao. Yu Zhou,Chao Dong

Main category: cs.CV

TL;DR: 论文提出DualX-VSR模型,通过双轴时空注意力机制解决视频超分辨率任务中的像素级精度问题,无需运动补偿,性能优越。

Details Motivation: 现有基于Transformer的视频超分辨率模型因token化和顺序注意力机制导致像素级精度不足,且依赖光流对齐,在真实场景中表现受限。 Method: 提出DualX-VSR模型,采用双轴时空注意力机制,沿正交方向整合时空信息,无需运动补偿。 Result: DualX-VSR在真实场景视频超分辨率任务中实现高保真度和优越性能。 Conclusion: DualX-VSR通过简化结构和创新注意力机制,有效解决了现有模型的局限性,提升了视频超分辨率的质量。 Abstract: Transformer-based models like ViViT and TimeSformer have advanced video understanding by effectively modeling spatiotemporal dependencies. Recent video generation models, such as Sora and Vidu, further highlight the power of transformers in long-range feature extraction and holistic spatiotemporal modeling. However, directly applying these models to real-world video super-resolution (VSR) is challenging, as VSR demands pixel-level precision, which can be compromised by tokenization and sequential attention mechanisms. While recent transformer-based VSR models attempt to address these issues using smaller patches and local attention, they still face limitations such as restricted receptive fields and dependence on optical flow-based alignment, which can introduce inaccuracies in real-world settings. To overcome these issues, we propose Dual Axial Spatial$\times$Temporal Transformer for Real-World Video Super-Resolution (DualX-VSR), which introduces a novel dual axial spatial$\times$temporal attention mechanism that integrates spatial and temporal information along orthogonal directions. DualX-VSR eliminates the need for motion compensation, offering a simplified structure that provides a cohesive representation of spatiotemporal information. As a result, DualX-VSR achieves high fidelity and superior performance in real-world VSR task.

[53] OpenMaskDINO3D : Reasoning 3D Segmentation via Large Language Model

Kunshen Zhang

Main category: cs.CV

TL;DR: OpenMaskDINO3D是一个基于LLM的3D理解和分割模型,通过处理点云数据和文本提示生成实例分割掩码,填补了3D推理分割领域的空白。

Details Motivation: 现有感知系统在2D推理分割方面已成熟,但缺乏类似框架处理3D任务,OpenMaskDINO3D旨在解决这一问题。 Method: 模型引入SEG令牌和对象标识符,通过点云数据和文本提示生成高精度3D分割掩码。 Result: 在ScanNet数据集上的实验验证了模型在多种3D任务中的有效性。 Conclusion: OpenMaskDINO3D为3D推理分割提供了高效解决方案,支持自然语言指令直接生成分割结果。 Abstract: Although perception systems have made remarkable advancements in recent years, particularly in 2D reasoning segmentation, these systems still rely on explicit human instruction or pre-defined categories to identify target objects before executing visual recognition tasks. Such systems have matured significantly, demonstrating the ability to reason and comprehend implicit user intentions in two-dimensional contexts, producing accurate segmentation masks based on complex and implicit query text. However, a comparable framework and structure for 3D reasoning segmentation remain absent. This paper introduces OpenMaskDINO3D, a LLM designed for comprehensive 3D understanding and segmentation. OpenMaskDINO3D processes point cloud data and text prompts to produce instance segmentation masks, excelling in many 3D tasks. By introducing a SEG token and object identifier, we achieve high-precision 3D segmentation mask generation, enabling the model to directly produce accurate point cloud segmentation results from natural language instructions. Experimental results on large-scale ScanNet datasets validate the effectiveness of our OpenMaskDINO3D across various tasks.

[54] Geological Field Restoration through the Lens of Image Inpainting

Vladislav Trifonov,Ivan Oseledets,Ekaterina Muravleva

Main category: cs.CV

TL;DR: 论文提出了一种基于多维张量低秩结构的稀疏观测地质场重建方法,优于传统克里金法。

Details Motivation: 受确定性图像修复技术启发,解决稀疏观测下多维地质场重建问题。 Method: 结合张量补全和地质统计学,构建全局低秩结构的优化框架。 Result: 在合成地质场实验中,张量补全方法的重建精度显著优于普通克里金法。 Conclusion: 该方法为地质场重建提供了更精确的解决方案。 Abstract: We present a new viewpoint on a reconstructing multidimensional geological fields from sparse observations. Drawing inspiration from deterministic image inpainting techniques, we model a partially observed spatial field as a multidimensional tensor and recover missing values by enforcing a global low-rank structure. Our approach combines ideas from tensor completion and geostatistics, providing a robust optimization framework. Experiments on synthetic geological fields demonstrate that used tensor completion method significant improvements in reconstruction accuracy over ordinary kriging for various percent of observed data.

[55] Invisible Backdoor Triggers in Image Editing Model via Deep Watermarking

Yu-Feng Chen,Tzuhsuan Huang,Pin-Yen Chiu,Jun-Cheng Chen

Main category: cs.CV

TL;DR: 该论文提出了一种新型的后门攻击框架,通过中毒训练数据在图像编辑过程中嵌入不可见触发器,利用深度水印模型实现攻击。

Details Motivation: 现有研究多关注图像生成的后门攻击,而图像编辑领域的后门攻击研究较少,且现有方法多使用可见触发器,不实用。 Method: 利用现成的深度水印模型将不可察觉的水印编码为后门触发器,通过中毒训练数据实现攻击。 Result: 在不同水印模型上进行了广泛实验,攻击成功率显著。水印特性分析进一步验证了方法的有效性。 Conclusion: 该方法成功实现了在图像编辑中的不可见后门攻击,为相关领域提供了新的研究方向。 Abstract: Diffusion models have achieved remarkable progress in both image generation and editing. However, recent studies have revealed their vulnerability to backdoor attacks, in which specific patterns embedded in the input can manipulate the model's behavior. Most existing research in this area has proposed attack frameworks focused on the image generation pipeline, leaving backdoor attacks in image editing relatively unexplored. Among the few studies targeting image editing, most utilize visible triggers, which are impractical because they introduce noticeable alterations to the input image before editing. In this paper, we propose a novel attack framework that embeds invisible triggers into the image editing process via poisoned training data. We leverage off-the-shelf deep watermarking models to encode imperceptible watermarks as backdoor triggers. Our goal is to make the model produce the predefined backdoor target when it receives watermarked inputs, while editing clean images normally according to the given prompt. With extensive experiments across different watermarking models, the proposed method achieves promising attack success rates. In addition, the analysis results of the watermark characteristics in term of backdoor attack further support the effectiveness of our approach. The code is available at:https://github.com/aiiu-lab/BackdoorImageEditing

[56] Learning to Plan via Supervised Contrastive Learning and Strategic Interpolation: A Chess Case Study

Andrew Hamara,Greg Hamerly,Pablo Rivas,Andrew C. Freeman

Main category: cs.CV

TL;DR: 论文提出了一种基于直觉驱动的规划方法,通过对比学习训练Transformer编码器,将棋盘状态嵌入到潜在空间中,实现无需深度搜索的走子选择。

Details Motivation: 现代国际象棋引擎依赖深度树搜索和回归评估,而人类玩家则依赖直觉选择候选走法并进行浅层验证。论文旨在模拟这一直觉驱动的规划过程。 Method: 使用监督对比学习训练Transformer编码器,将棋盘状态嵌入到按位置评估结构化的潜在空间中。在该空间中,距离反映评估相似性,走子选择通过向有利区域移动实现。 Result: 模型仅使用6层束搜索,估计Elo评分为2593。性能随模型规模和嵌入维度提升而提高,表明潜在规划可作为传统搜索的替代方案。 Conclusion: 该方法不仅适用于国际象棋,还可推广到其他完美信息游戏。所有源代码已开源。 Abstract: Modern chess engines achieve superhuman performance through deep tree search and regressive evaluation, while human players rely on intuition to select candidate moves followed by a shallow search to validate them. To model this intuition-driven planning process, we train a transformer encoder using supervised contrastive learning to embed board states into a latent space structured by positional evaluation. In this space, distance reflects evaluative similarity, and visualized trajectories display interpretable transitions between game states. We demonstrate that move selection can occur entirely within this embedding space by advancing toward favorable regions, without relying on deep search. Despite using only a 6-ply beam search, our model achieves an estimated Elo rating of 2593. Performance improves with both model size and embedding dimensionality, suggesting that latent planning may offer a viable alternative to traditional search. Although we focus on chess, the proposed embedding-based planning method can be generalized to other perfect-information games where state evaluations are learnable. All source code is available at https://github.com/andrewhamara/SOLIS.

[57] From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

Tianxu Wang,Zhuofan Zhang,Ziyu Zhu,Yue Fan,Jing Xiong,Pengxiang Li,Xiaojian Ma,Qing Li

Main category: cs.CV

TL;DR: Anywhere3D-Bench是一个全面的3D视觉定位基准,涵盖四个层次的任务,揭示了当前模型在空间和部分级别任务上的显著不足。

Details Motivation: 探索3D场景中超越对象级别的视觉定位能力,填补现有研究的空白。 Method: 构建Anywhere3D-Bench基准,评估多种先进3D视觉定位方法及LLMs/MLLMs的性能。 Result: 空间和部分级别任务表现最差,最佳模型OpenAI o4-mini在空间级别任务上仅23.57%准确率。 Conclusion: 当前模型在3D场景的空间和部分级别理解与推理能力存在明显不足。 Abstract: 3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,632 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space, while part-level tasks demand fine-grained perception of object composition. Even the best performance model, OpenAI o4-mini, achieves only 23.57% accuracy on space-level tasks and 33.94% on part-level tasks, significantly lower than its performance on area-level and object-level tasks. These findings underscore a critical gap in current models' capacity to understand and reason about 3D scene beyond object-level semantics.

[58] Generating Synthetic Stereo Datasets using 3D Gaussian Splatting and Expert Knowledge Transfer

Filip Slezak,Magnus K. Gjerde,Joakim B. Haurum,Ivan Nikolov,Morten S. Laursen,Thomas B. Moeslund

Main category: cs.CV

TL;DR: 本文提出了一种基于3D高斯泼溅(3DGS)的立体数据集生成方法,替代了基于神经辐射场(NeRF)的方法,并通过专家知识转移利用几何重建和深度估计优化立体模型。

Details Motivation: 探索一种高效且低成本的立体数据集生成方法,以替代传统的NeRF方法,并提升立体模型在零样本泛化任务中的性能。 Method: 结合3DGS生成的几何重建和FoundationStereo模型的深度估计,通过专家知识转移优化立体模型的微调。 Result: 3DGS生成的几何数据存在噪声,而FoundationStereo的视差估计更干净,显著提升了零样本泛化性能。 Conclusion: 3DGS方法在低成本高保真数据集生成和快速微调方面具有潜力,但其在复杂场景中的鲁棒性仍需进一步研究。 Abstract: In this paper, we introduce a 3D Gaussian Splatting (3DGS)-based pipeline for stereo dataset generation, offering an efficient alternative to Neural Radiance Fields (NeRF)-based methods. To obtain useful geometry estimates, we explore utilizing the reconstructed geometry from the explicit 3D representations as well as depth estimates from the FoundationStereo model in an expert knowledge transfer setup. We find that when fine-tuning stereo models on 3DGS-generated datasets, we demonstrate competitive performance in zero-shot generalization benchmarks. When using the reconstructed geometry directly, we observe that it is often noisy and contains artifacts, which propagate noise to the trained model. In contrast, we find that the disparity estimates from FoundationStereo are cleaner and consequently result in a better performance on the zero-shot generalization benchmarks. Our method highlights the potential for low-cost, high-fidelity dataset creation and fast fine-tuning for deep stereo models. Moreover, we also reveal that while the latest Gaussian Splatting based methods have achieved superior performance on established benchmarks, their robustness falls short in challenging in-the-wild settings warranting further exploration.

[59] Light and 3D: a methodological exploration of digitisation techniques adapted to a selection of objects from the Mus{é}e d'Arch{é}ologie Nationale

Antoine Laurent,Jean Mélou,Catherine Schwab,Rolande Simon-Millot,Sophie Féret,Thomas Sagory,Carole Fritz,Jean-Denis Durou

Main category: cs.CV

TL;DR: 本文探讨了文化遗产数字化中3D摄影方法的多样性,强调没有单一方法适用于所有对象,需根据对象特性和数字孪生的用途选择合适工具。

Details Motivation: 文化遗产数字化的重要性已被广泛认可,但现有3D数字化方法多样,需针对不同对象选择最优方法。 Method: 通过法国国家考古博物馆的藏品案例,分析不同3D摄影数字化方法的适用性。 Result: 研究表明,每种对象可能需要调整现有工具,无法对3D数字化方法进行绝对分类。 Conclusion: 应根据对象特性和数字孪生的用途,选择最适合的数字化工具。 Abstract: The need to digitize heritage objects is now widely accepted. This article presents the very fashionable context of the creation of ''digital twins''. It illustrates the diversity of photographic 3D digitization methods, but this is not its only objective. Using a selection of objects from the collections of the mus{\'e}e d'Arch{\'e}ologie nationale, it shows that no single method is suitable for all cases. Rather, the method to be recommended for a given object should be the result of a concerted choice between those involved in heritage and those involved in the digital domain, as each new object may require the adaptation of existing tools. It would therefore be pointless to attempt an absolute classification of 3D digitization methods. On the contrary, we need to find the digital tool best suited to each object, taking into account not only its characteristics, but also the future use of its digital twin.

[60] CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx

Lukas Picek,Elisa Belotti,Michal Bojda,Ludek Bufka,Vojtech Cermak,Martin Dula,Rostislav Dvorak,Luboslav Hrdy,Miroslav Jirik,Vaclav Kocourek,Josefa Krausova,Jirı Labuda,Jakub Straka,Ludek Toman,Vlado Trulık,Martin Vana,Miroslav Kutal

Main category: cs.CV

TL;DR: CzechLynx是一个大规模、开放访问的数据集,用于欧亚猞猁的个体识别、2D姿态估计和实例分割,包含3万多张相机陷阱图像和10万张合成图像。

Details Motivation: 为猞猁的个体识别和姿态估计提供高质量、多样化的数据集,支持跨时空领域的泛化测试。 Method: 数据集包含真实图像和合成图像,后者通过Unity和扩散驱动技术生成,定义了三种评估协议以测试泛化能力。 Result: 数据集覆盖219个独特个体,15年监测数据,并提供了三种评估协议。 Conclusion: CzechLynx将有助于基准测试和新方法的开发,不仅限于动物个体重识别。 Abstract: We introduce CzechLynx, the first large-scale, open-access dataset for individual identification, 2D pose estimation, and instance segmentation of the Eurasian lynx (Lynx lynx). CzechLynx includes more than 30k camera trap images annotated with segmentation masks, identity labels, and 20-point skeletons and covers 219 unique individuals across 15 years of systematic monitoring in two geographically distinct regions: Southwest Bohemia and the Western Carpathians. To increase the data variability, we create a complementary synthetic set with more than 100k photorealistic images generated via a Unity-based pipeline and diffusion-driven text-to-texture modeling, covering diverse environments, poses, and coat-pattern variations. To allow testing generalization across spatial and temporal domains, we define three tailored evaluation protocols/splits: (i) geo-aware, (ii) time-aware open-set, and (iii) time-aware closed-set. This dataset is targeted to be instrumental in benchmarking state-of-the-art models and the development of novel methods for not just individual animal re-identification.

[61] Time-Lapse Video-Based Embryo Grading via Complementary Spatial-Temporal Pattern Mining

Yong Sun,Yipeng Wang,Junyu Shi,Zhiyuan Zhang,Yanmei Xiao,Lei Zhu,Manxi Jiang,Qiang Nie

Main category: cs.CV

TL;DR: 提出了一种基于视频的胚胎分级任务,利用全长延时监测视频预测胚胎质量,并设计了互补时空模式挖掘框架(CoSTeM)。

Details Motivation: 现有方法在胚胎评估中缺乏全面性,或受胚胎外因素干扰,限制了临床应用。 Method: 提出CoSTeM框架,结合形态学和形态动力学分支,分别提取局部结构特征和全局发育轨迹。 Result: 实验结果表明该设计优越,为AI辅助胚胎选择提供了方法论框架。 Conclusion: 该工作填补了胚胎评估的空白,数据集和代码将公开。 Abstract: Artificial intelligence has recently shown promise in automated embryo selection for In-Vitro Fertilization (IVF). However, current approaches either address partial embryo evaluation lacking holistic quality assessment or target clinical outcomes inevitably confounded by extra-embryonic factors, both limiting clinical utility. To bridge this gap, we propose a new task called Video-Based Embryo Grading - the first paradigm that directly utilizes full-length time-lapse monitoring (TLM) videos to predict embryologists' overall quality assessments. To support this task, we curate a real-world clinical dataset comprising over 2,500 TLM videos, each annotated with a grading label indicating the overall quality of embryos. Grounded in clinical decision-making principles, we propose a Complementary Spatial-Temporal Pattern Mining (CoSTeM) framework that conceptually replicates embryologists' evaluation process. The CoSTeM comprises two branches: (1) a morphological branch using a Mixture of Cross-Attentive Experts layer and a Temporal Selection Block to select discriminative local structural features, and (2) a morphokinetic branch employing a Temporal Transformer to model global developmental trajectories, synergistically integrating static and dynamic determinants for grading embryos. Extensive experimental results demonstrate the superiority of our design. This work provides a valuable methodological framework for AI-assisted embryo selection. The dataset and source code will be publicly available upon acceptance.

[62] Robustness as Architecture: Designing IQA Models to Withstand Adversarial Perturbations

Igor Meleshin,Anna Chistyakova,Anastasia Antsiferova,Dmitriy Vatolin

Main category: cs.CV

TL;DR: 提出一种通过架构设计而非数据驱动的方法提升图像质量评估模型的鲁棒性。

Details Motivation: 传统IQA模型易受对抗攻击,数据驱动的防御方法效果有限,因此探索通过设计架构提升鲁棒性。 Method: 通过正交信息流和规范保持操作重塑模型内部结构,结合剪枝和微调进一步稳定系统。 Result: 设计出一种无需对抗训练的鲁棒IQA架构,能有效抵御对抗攻击。 Conclusion: 建议从数据优化转向设计优化,以提升模型的鲁棒性。 Abstract: Image Quality Assessment (IQA) models are increasingly relied upon to evaluate image quality in real-world systems -- from compression and enhancement to generation and streaming. Yet their adoption brings a fundamental risk: these models are inherently unstable. Adversarial manipulations can easily fool them, inflating scores and undermining trust. Traditionally, such vulnerabilities are addressed through data-driven defenses -- adversarial retraining, regularization, or input purification. But what if this is the wrong lens? What if robustness in perceptual models is not something to learn but something to design? In this work, we propose a provocative idea: robustness as an architectural prior. Rather than training models to resist perturbations, we reshape their internal structure to suppress sensitivity from the ground up. We achieve this by enforcing orthogonal information flow, constraining the network to norm-preserving operations -- and further stabilizing the system through pruning and fine-tuning. The result is a robust IQA architecture that withstands adversarial attacks without requiring adversarial training or significant changes to the original model. This approach suggests a shift in perspective: from optimizing robustness through data to engineering it through design.

[63] APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval

Hong Gao,Yiming Bao,Xuezhan Tu,Bin Zhong,Minling Zhang

Main category: cs.CV

TL;DR: APVR框架通过分层视觉信息检索解决视频理解中的计算限制,无需训练即可处理小时级视频。

Details Motivation: 当前视频多模态大语言模型因计算限制和低效信息提取难以处理小时级视频。 Method: APVR采用双组件方法:Pivot Frame Retrieval(语义扩展和多模态置信度评分)和Pivot Token Retrieval(查询感知的注意力驱动令牌选择)。 Result: 在LongVideoBench和VideoMME上验证,性能显著提升,达到SOTA结果。 Conclusion: APVR为现有MLLM架构提供即插即用能力,解决了视频理解中的关键挑战。 Abstract: Current video-based multimodal large language models struggle with hour-level video understanding due to computational constraints and inefficient information extraction from extensive temporal sequences. We propose APVR (Adaptive Pivot Visual information Retrieval), a training-free framework that addresses the memory wall limitation through hierarchical visual information retrieval. APVR operates via two complementary components: Pivot Frame Retrieval employs semantic expansion and multi-modal confidence scoring to identify semantically relevant video frames, while Pivot Token Retrieval performs query-aware attention-driven token selection within the pivot frames. This dual granularity approach enables processing of hour-long videos while maintaining semantic fidelity. Experimental validation on LongVideoBench and VideoMME demonstrates significant performance improvements, establishing state-of-the-art results for not only training-free but also training-based approaches while providing plug-and-play integration capability with existing MLLM architectures.

[64] FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation

Huihan Wang,Zhiwen Yang,Hui Zhang,Dan Zhao,Bingzheng Wei,Yan Xu

Main category: cs.CV

TL;DR: FEAT是一种高效的全维度注意力Transformer,通过空间-时间-通道注意力机制、线性复杂度设计和残差值引导模块,解决了动态医学视频合成的挑战,性能优于现有方法。

Details Motivation: 动态医学视频合成需要同时建模空间一致性和时间动态性,现有Transformer方法在通道交互、计算复杂度和噪声处理上存在不足。 Method: 提出FEAT,包含三个创新点:1)空间-时间-通道注意力机制;2)线性复杂度设计;3)残差值引导模块。 Result: FEAT-S参数仅为Endora的23%,性能相当或更优;FEAT-L在多个数据集上超越所有对比方法。 Conclusion: FEAT在动态医学视频合成中表现出高效性和可扩展性,代码已开源。 Abstract: Synthesizing high-quality dynamic medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23\% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at https://github.com/Yaziwel/FEAT.

[65] Bringing SAM to new heights: Leveraging elevation data for tree crown segmentation from drone imagery

Mélisande Teng,Arthur Ouaknine,Etienne Laliberté,Yoshua Bengio,David Rolnick,Hugo Larochelle

Main category: cs.CV

TL;DR: 论文比较了利用Segment Anything Model (SAM)进行无人机图像中树冠实例分割的方法,并提出了结合高程数据的BalSAM模型,在特定场景下表现优于其他方法。

Details Motivation: 传统森林监测方法成本高且耗时,无人机遥感和计算机视觉技术为大规模个体树木测绘提供了潜力。 Method: 比较了SAM在三种森林类型中的应用,并研究了结合数字表面模型(DSM)数据的方法,提出了BalSAM模型。 Result: SAM直接使用效果不如定制Mask R-CNN,但结合DSM数据和端到端调优的SAM(BalSAM)在特定场景下表现更优。 Conclusion: 结合DSM数据和端到端调优的SAM是树冠实例分割模型的潜在方向。 Abstract: Information on trees at the individual level is crucial for monitoring forest ecosystems and planning forest management. Current monitoring methods involve ground measurements, requiring extensive cost, time and labor. Advances in drone remote sensing and computer vision offer great potential for mapping individual trees from aerial imagery at broad-scale. Large pre-trained vision models, such as the Segment Anything Model (SAM), represent a particularly compelling choice given limited labeled data. In this work, we compare methods leveraging SAM for the task of automatic tree crown instance segmentation in high resolution drone imagery in three use cases: 1) boreal plantations, 2) temperate forests and 3) tropical forests. We also study the integration of elevation data into models, in the form of Digital Surface Model (DSM) information, which can readily be obtained at no additional cost from RGB drone imagery. We present BalSAM, a model leveraging SAM and DSM information, which shows potential over other methods, particularly in the context of plantations. We find that methods using SAM out-of-the-box do not outperform a custom Mask R-CNN, even with well-designed prompts. However, efficiently tuning SAM end-to-end and integrating DSM information are both promising avenues for tree crown instance segmentation models.

[66] TextVidBench: A Benchmark for Long Video Scene Text Understanding

Yangyang Zhong,Ji Qi,Yuan Yao,Pengxin Luo,Yunfeng Yan,Donglian Qi,Zhiyuan Liu,Tat-Seng Chua

Main category: cs.CV

TL;DR: 论文介绍了TextVidBench,首个专注于长视频文本问答的基准测试,解决了现有数据集的视频时长短和评估范围窄的问题。

Details Motivation: 现有数据集难以评估多模态大语言模型(MLLMs)在长视频理解上的能力,因此需要更全面的评估工具。 Method: 1) 提出TextVidBench,覆盖9个领域的长视频;2) 设计三阶段评估框架;3) 提供高质量标注。同时提出IT-Rope机制、非均匀位置编码和轻量微调方法。 Result: 实验表明TextVidBench对现有模型具有挑战性,提出的方法有效提升了长视频文本理解能力。 Conclusion: TextVidBench为长视频文本问答提供了更全面的评估标准,提出的方法为提升模型能力提供了新思路。 Abstract: Despite recent progress on the short-video Text-Visual Question Answering (ViteVQA) task - largely driven by benchmarks such as M4-ViteVQA - existing datasets still suffer from limited video duration and narrow evaluation scopes, making it difficult to adequately assess the growing capabilities of powerful multimodal large language models (MLLMs). To address these limitations, we introduce TextVidBench, the first benchmark specifically designed for long-video text question answering (>3 minutes). TextVidBench makes three key contributions: 1) Cross-domain long-video coverage: Spanning 9 categories (e.g., news, sports, gaming), with an average video length of 2306 seconds, enabling more realistic evaluation of long-video understanding. 2) A three-stage evaluation framework: "Text Needle-in-Haystack -> Temporal Grounding -> Text Dynamics Captioning". 3) High-quality fine-grained annotations: Containing over 5,000 question-answer pairs with detailed semantic labeling. Furthermore, we propose an efficient paradigm for improving large models through: (i) introducing the IT-Rope mechanism and temporal prompt engineering to enhance temporal perception, (ii) adopting non-uniform positional encoding to better handle long video sequences, and (iii) applying lightweight fine-tuning on video-text data. Extensive experiments on multiple public datasets as well as TextVidBench demonstrate that our new benchmark presents significant challenges to existing models, while our proposed method offers valuable insights into improving long-video scene text understanding capabilities.

[67] Multi-scale Image Super Resolution with a Single Auto-Regressive Model

Enrique Sanchez,Isma Hadji,Adrian Bulat,Christos Tzelepis,Brais Martinez,Georgios Tzimiropoulos

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉自回归(VAR)模型的图像超分辨率(ISR)方法,通过多尺度图像标记化和直接偏好优化(DPO)解决了现有方法的局限性,实现了单次前向传递的超分辨率,并在小模型和无外部数据的情况下取得了最优结果。

Details Motivation: 现有VARSR方法在固定分辨率下工作且依赖大规模模型和数据集,本文旨在解决这些限制,提出更高效且无需外部数据的解决方案。 Method: 采用多尺度图像标记化方法(Hierarchical Image Tokenization)和DPO正则化项,训练量化器以在不同尺度下生成语义一致的残差,并通过偏好优化训练VAR模型。 Result: 模型能够在单次前向传递中完成去噪和超分辨率任务,使用小模型(300M参数)且无需外部数据,取得了最优性能。 Conclusion: 本文提出的方法在效率和性能上均优于现有技术,为ISR任务提供了一种更实用的解决方案。 Abstract: In this paper we tackle Image Super Resolution (ISR), using recent advances in Visual Auto-Regressive (VAR) modeling. VAR iteratively estimates the residual in latent space between gradually increasing image scales, a process referred to as next-scale prediction. Thus, the strong priors learned during pre-training align well with the downstream task (ISR). To our knowledge, only VARSR has exploited this synergy so far, showing promising results. However, due to the limitations of existing residual quantizers, VARSR works only at a fixed resolution, i.e. it fails to map intermediate outputs to the corresponding image scales. Additionally, it relies on a 1B transformer architecture (VAR-d24), and leverages a large-scale private dataset to achieve state-of-the-art results. We address these limitations through two novel components: a) a Hierarchical Image Tokenization approach with a multi-scale image tokenizer that progressively represents images at different scales while simultaneously enforcing token overlap across scales, and b) a Direct Preference Optimization (DPO) regularization term that, relying solely on the LR and HR tokenizations, encourages the transformer to produce the latter over the former. To the best of our knowledge, this is the first time a quantizer is trained to force semantically consistent residuals at different scales, and the first time that preference-based optimization is used to train a VAR. Using these two components, our model can denoise the LR image and super-resolve at half and full target upscale factors in a single forward pass. Additionally, we achieve \textit{state-of-the-art results on ISR}, while using a small model (300M params vs ~1B params of VARSR), and without using external training data.

[68] PATS: Proficiency-Aware Temporal Sampling for Multi-View Sports Skill Assessment

Edoardo Bianchi,Antonio Liotta

Main category: cs.CV

TL;DR: PATS是一种新颖的时间采样策略,用于自动化运动技能评估,通过保留完整的基础动作片段,显著提升了评估准确性。

Details Motivation: 当前视频采样方法破坏了运动技能评估所需的时间连续性,影响了专家与新手的区分。 Method: 提出PATS策略,自适应分割视频以确保每个分析部分包含完整的关键动作执行,并在多视点配置下重复采样以最大化信息覆盖。 Result: 在EgoExo4D基准测试中,PATS在所有视点配置下均优于现有方法(+0.65%至+3.05%),在挑战性领域(如攀岩、音乐、篮球)表现尤为突出。 Conclusion: PATS作为一种自适应时间采样方法,能够适应多样化活动特性,显著推动了自动化技能评估的实际应用。 Abstract: Automated sports skill assessment requires capturing fundamental movement patterns that distinguish expert from novice performance, yet current video sampling methods disrupt the temporal continuity essential for proficiency evaluation. To this end, we introduce Proficiency-Aware Temporal Sampling (PATS), a novel sampling strategy that preserves complete fundamental movements within continuous temporal segments for multi-view skill assessment. PATS adaptively segments videos to ensure each analyzed portion contains full execution of critical performance components, repeating this process across multiple segments to maximize information coverage while maintaining temporal coherence. Evaluated on the EgoExo4D benchmark with SkillFormer, PATS surpasses the state-of-the-art accuracy across all viewing configurations (+0.65% to +3.05%) and delivers substantial gains in challenging domains (+26.22% bouldering, +2.39% music, +1.13% basketball). Systematic analysis reveals that PATS successfully adapts to diverse activity characteristics-from high-frequency sampling for dynamic sports to fine-grained segmentation for sequential skills-demonstrating its effectiveness as an adaptive approach to temporal sampling that advances automated skill assessment for real-world applications.

[69] Beyond Cropped Regions: New Benchmark and Corresponding Baseline for Chinese Scene Text Retrieval in Diverse Layouts

Gengluo Li,Huawen Shen,Yu Zhou

Main category: cs.CV

TL;DR: 本文提出了一种针对中文场景文本检索的新模型CSTR-CLIP,通过结合全局视觉信息和多粒度对齐训练,显著提升了性能。

Details Motivation: 中文场景文本检索因复杂多样的布局而极具挑战性,现有方法多沿袭英文解决方案,效果不佳。 Method: 提出CSTR-CLIP模型,采用两阶段训练,结合全局视觉信息和多粒度对齐。 Result: 在现有基准测试中,CSTR-CLIP性能提升18.82%,推理速度更快。 Conclusion: CSTR-CLIP能有效处理多样文本布局,数据集和代码将公开以促进研究。 Abstract: Chinese scene text retrieval is a practical task that aims to search for images containing visual instances of a Chinese query text. This task is extremely challenging because Chinese text often features complex and diverse layouts in real-world scenes. Current efforts tend to inherit the solution for English scene text retrieval, failing to achieve satisfactory performance. In this paper, we establish a Diversified Layout benchmark for Chinese Street View Text Retrieval (DL-CSVTR), which is specifically designed to evaluate retrieval performance across various text layouts, including vertical, cross-line, and partial alignments. To address the limitations in existing methods, we propose Chinese Scene Text Retrieval CLIP (CSTR-CLIP), a novel model that integrates global visual information with multi-granularity alignment training. CSTR-CLIP applies a two-stage training process to overcome previous limitations, such as the exclusion of visual features outside the text region and reliance on single-granularity alignment, thereby enabling the model to effectively handle diverse text layouts. Experiments on existing benchmark show that CSTR-CLIP outperforms the previous state-of-the-art model by 18.82% accuracy and also provides faster inference speed. Further analysis on DL-CSVTR confirms the superior performance of CSTR-CLIP in handling various text layouts. The dataset and code will be publicly available to facilitate research in Chinese scene text retrieval.

[70] Structure-Aware Radar-Camera Depth Estimation

Fuyi Zhang,Zhu Yu,Chunhao Li,Runmin Zhang,Xiaokai Bai,Zili Zhou,Si-Yuan Cao,Wang Wang,Hui-Liang Shen

Main category: cs.CV

TL;DR: 论文探讨了单目深度估计的进展,重点介绍了深度学习在该领域的应用及挑战,尤其是泛化到未见域的问题。

Details Motivation: 单目深度估计旨在从单目相机捕获的RGB图像中确定每个像素的深度。尽管深度学习推动了该领域的发展,但泛化到未见域仍是一个挑战。 Method: 论文回顾了多种方法,包括多尺度融合网络、将回归任务重新定义为分类问题、引入额外先验和优化目标函数。近期方法采用仿射不变损失实现多数据集联合训练。 Result: Depth Anything在零样本单目深度估计中表现领先,擅长从未见图像中提取结构信息,但在度量深度估计上仍有不足。 Conclusion: 尽管单目深度估计取得了进展,但泛化能力和度量深度准确性仍需进一步研究。 Abstract: Monocular depth estimation aims to determine the depth of each pixel from an RGB image captured by a monocular camera. The development of deep learning has significantly advanced this field by facilitating the learning of depth features from some well-annotated datasets \cite{Geiger_Lenz_Stiller_Urtasun_2013,silberman2012indoor}. Eigen \textit{et al.} \cite{eigen2014depth} first introduce a multi-scale fusion network for depth regression. Following this, subsequent improvements have come from reinterpreting the regression task as a classification problem \cite{bhat2021adabins,Li_Wang_Liu_Jiang_2022}, incorporating additional priors \cite{shao2023nddepth,yang2023gedepth}, and developing more effective objective function \cite{xian2020structure,Yin_Liu_Shen_Yan_2019}. Despite these advances, generalizing to unseen domains remains a challenge. Recently, several methods have employed affine-invariant loss to enable multi-dataset joint training \cite{MiDaS,ZeroDepth,guizilini2023towards,Dany}. Among them, Depth Anything \cite{Dany} has shown leading performance in zero-shot monocular depth estimation. While it struggles to estimate accurate metric depth due to the lack of explicit depth cues, it excels at extracting structural information from unseen images, producing structure-detailed monocular depth.

[71] Point Cloud Segmentation of Agricultural Vehicles using 3D Gaussian Splatting

Alfred T. Christiansen,Andreas H. Højrup,Morten K. Stephansen,Md Ibtihaj A. Sakib,Taman S. Poojary,Filip Slezak,Morten S. Laursen,Thomas B. Moeslund,Joakim B. Haurum

Main category: cs.CV

TL;DR: 提出了一种利用3D高斯泼溅和高斯不透明度场生成合成数据的管道,用于3D点云语义分割任务,并验证了合成数据对模型的积极影响。

Details Motivation: 获取和标注真实点云数据成本高且耗时,因此需要一种生成逼真合成数据的方法。 Method: 使用3D高斯泼溅和高斯不透明度场生成农业车辆的3D资产,并通过模拟LiDAR生成点云数据。 Result: PTv3模型在仅使用合成数据训练时,mIoU达到91.35%,某些情况下甚至优于真实数据训练的模型。 Conclusion: 合成数据可以替代真实数据,且模型能够泛化到未训练过的语义类别。 Abstract: Training neural networks for tasks such as 3D point cloud semantic segmentation demands extensive datasets, yet obtaining and annotating real-world point clouds is costly and labor-intensive. This work aims to introduce a novel pipeline for generating realistic synthetic data, by leveraging 3D Gaussian Splatting (3DGS) and Gaussian Opacity Fields (GOF) to generate 3D assets of multiple different agricultural vehicles instead of using generic models. These assets are placed in a simulated environment, where the point clouds are generated using a simulated LiDAR. This is a flexible approach that allows changing the LiDAR specifications without incurring additional costs. We evaluated the impact of synthetic data on segmentation models such as PointNet++, Point Transformer V3, and OACNN, by training and validating the models only on synthetic data. Remarkably, the PTv3 model had an mIoU of 91.35\%, a noteworthy result given that the model had neither been trained nor validated on any real data. Further studies even suggested that in certain scenarios the models trained only on synthetically generated data performed better than models trained on real-world data. Finally, experiments demonstrated that the models can generalize across semantic classes, enabling accurate predictions on mesh models they were never trained on.

[72] UAV4D: Dynamic Neural Rendering of Human-Centric UAV Imagery using Gaussian Splatting

Jaehoon Choi,Dongki Jung,Christopher Maxey,Yonghan Lee,Sungmin Eum,Dinesh Manocha,Heesung Kwon

Main category: cs.CV

TL;DR: UAV4D框架解决了无人机拍摄动态场景的渲染问题,通过结合3D基础模型和人体网格重建模型,实现了单目视频数据下的多行人动态场景重建。

Details Motivation: 现有动态神经渲染方法未能解决无人机拍摄场景的独特挑战,如单目相机、俯视角和多小移动行人。 Method: 结合3D基础模型和人体网格重建模型,通过人体-场景接触点解决尺度模糊问题,并利用SMPL模型和背景网格初始化高斯泼溅。 Result: 在三个复杂无人机数据集上测试,PSNR提升1.5 dB,视觉锐度优于现有方法。 Conclusion: UAV4D框架在无人机动态场景渲染中表现优异,解决了现有方法的局限性。 Abstract: Despite significant advancements in dynamic neural rendering, existing methods fail to address the unique challenges posed by UAV-captured scenarios, particularly those involving monocular camera setups, top-down perspective, and multiple small, moving humans, which are not adequately represented in existing datasets. In this work, we introduce UAV4D, a framework for enabling photorealistic rendering for dynamic real-world scenes captured by UAVs. Specifically, we address the challenge of reconstructing dynamic scenes with multiple moving pedestrians from monocular video data without the need for additional sensors. We use a combination of a 3D foundation model and a human mesh reconstruction model to reconstruct both the scene background and humans. We propose a novel approach to resolve the scene scale ambiguity and place both humans and the scene in world coordinates by identifying human-scene contact points. Additionally, we exploit the SMPL model and background mesh to initialize Gaussian splats, enabling holistic scene rendering. We evaluated our method on three complex UAV-captured datasets: VisDrone, Manipal-UAV, and Okutama-Action, each with distinct characteristics and 10~50 humans. Our results demonstrate the benefits of our approach over existing methods in novel view synthesis, achieving a 1.5 dB PSNR improvement and superior visual sharpness.

[73] Physical Annotation for Automated Optical Inspection: A Concept for In-Situ, Pointer-Based Trainingdata Generation

Oliver Krumpek,Oliver Heimann,Jörg Krüger

Main category: cs.CV

TL;DR: 提出了一种新型物理标注系统,用于为自动光学检测生成训练数据,通过指针交互和投影界面提高标注效率和准确性。

Details Motivation: 传统屏幕标注方法效率低且不直观,无法充分利用人工检测人员的专业知识。 Method: 使用校准的追踪指针和投影界面,直接在物体上捕捉轨迹和轮廓,转化为标准化标注格式。 Result: 初步评估证实系统可行,能捕获详细标注轨迹,并与CVAT集成优化ML工作流。 Conclusion: 该系统填补了人工与自动化数据生成间的空白,为非IT专家参与ML训练提供了可能。 Abstract: This paper introduces a novel physical annotation system designed to generate training data for automated optical inspection. The system uses pointer-based in-situ interaction to transfer the valuable expertise of trained inspection personnel directly into a machine learning (ML) training pipeline. Unlike conventional screen-based annotation methods, our system captures physical trajectories and contours directly on the object, providing a more intuitive and efficient way to label data. The core technology uses calibrated, tracked pointers to accurately record user input and transform these spatial interactions into standardised annotation formats that are compatible with open-source annotation software. Additionally, a simple projector-based interface projects visual guidance onto the object to assist users during the annotation process, ensuring greater accuracy and consistency. The proposed concept bridges the gap between human expertise and automated data generation, enabling non-IT experts to contribute to the ML training pipeline and preventing the loss of valuable training samples. Preliminary evaluation results confirm the feasibility of capturing detailed annotation trajectories and demonstrate that integration with CVAT streamlines the workflow for subsequent ML tasks. This paper details the system architecture, calibration procedures and interface design, and discusses its potential contribution to future ML data generation for automated optical inspection.

[74] FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing

Guangzhao Li,Yanming Yang,Chenxi Song,Chi Zhang

Main category: cs.CV

TL;DR: FlowDirector是一种无需反转的视频编辑框架,通过ODE直接演化数据空间,保持时间一致性和结构细节,结合注意力掩码和增强引导策略,实现高效、一致的视频编辑。

Details Motivation: 解决现有基于反转的视频编辑方法导致的时间不一致和结构退化问题。 Method: 提出FlowDirector框架,利用ODE在数据空间中直接演化视频,结合注意力掩码和增强引导策略。 Result: 在指令遵循、时间一致性和背景保留方面达到最优性能。 Conclusion: FlowDirector为无需反转的高效、一致视频编辑提供了新范式。 Abstract: Text-driven video editing aims to modify video content according to natural language instructions. While recent training-free approaches have made progress by leveraging pre-trained diffusion models, they typically rely on inversion-based techniques that map input videos into the latent space, which often leads to temporal inconsistencies and degraded structural fidelity. To address this, we propose FlowDirector, a novel inversion-free video editing framework. Our framework models the editing process as a direct evolution in data space, guiding the video via an Ordinary Differential Equation (ODE) to smoothly transition along its inherent spatiotemporal manifold, thereby preserving temporal coherence and structural details. To achieve localized and controllable edits, we introduce an attention-guided masking mechanism that modulates the ODE velocity field, preserving non-target regions both spatially and temporally. Furthermore, to address incomplete edits and enhance semantic alignment with editing instructions, we present a guidance-enhanced editing strategy inspired by Classifier-Free Guidance, which leverages differential signals between multiple candidate flows to steer the editing trajectory toward stronger semantic alignment without compromising structural consistency. Extensive experiments across benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction adherence, temporal consistency, and background preservation, establishing a new paradigm for efficient and coherent video editing without inversion.

[75] A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions

Anh Le,Thanh Lam,Dung Nguyen

Main category: cs.CV

TL;DR: 本文综述了越南文档分析与识别(DAR)的现状,探讨了复杂变音符号、数据稀缺等挑战,并展望了大型语言模型(LLMs)的应用潜力。

Details Motivation: 越南DAR在数字化、信息检索和自动化中至关重要,但面临复杂变音符号、数据稀缺等独特挑战。 Method: 综述现有技术,分析传统OCR和深度学习的局限性,探讨LLMs和视觉语言模型的应用。 Result: LLMs在文本识别和文档理解方面表现出潜力,但仍需解决领域适应、多模态学习和计算效率问题。 Conclusion: 未来研究方向包括数据集开发、模型优化和多模态方法整合,以推动越南DAR的进步。 Abstract: Vietnamese document analysis and recognition (DAR) is a crucial field with applications in digitization, information retrieval, and automation. Despite advancements in OCR and NLP, Vietnamese text recognition faces unique challenges due to its complex diacritics, tonal variations, and lack of large-scale annotated datasets. Traditional OCR methods often struggle with real-world document variations, while deep learning approaches have shown promise but remain limited by data scarcity and generalization issues. Recently, large language models (LLMs) and vision-language models have demonstrated remarkable improvements in text recognition and document understanding, offering a new direction for Vietnamese DAR. However, challenges such as domain adaptation, multimodal learning, and computational efficiency persist. This survey provide a comprehensive review of existing techniques in Vietnamese document recognition, highlights key limitations, and explores how LLMs can revolutionize the field. We discuss future research directions, including dataset development, model optimization, and the integration of multimodal approaches for improved document intelligence. By addressing these gaps, we aim to foster advancements in Vietnamese DAR and encourage community-driven solutions.

[76] SeedEdit 3.0: Fast and High-Quality Generative Image Editing

Peng Wang,Yichun Shi,Xiaochen Lian,Zhonghua Zhai,Xin Xia,Xuefeng Xiao,Weilin Huang,Jianchao Yang

Main category: cs.CV

TL;DR: SeedEdit 3.0 是 Seedream 3.0 的配套工具,显著提升了编辑指令跟随和图像内容保留能力。通过改进数据整理流程和引入联合学习管道,实现了更高的可用性。

Details Motivation: 提升图像编辑工具的指令跟随能力和内容保留效果,特别是在真实图像输入上。 Method: 1. 改进数据整理流程,引入元信息嵌入策略;2. 设计联合学习管道,结合扩散损失和奖励损失。 Result: 在真实图像编辑测试中,SeedEdit 3.0 的可用性达到56.1%,优于之前版本和其他工具。 Conclusion: SeedEdit 3.0 在多个方面实现了最佳平衡,显著提升了图像编辑的实用性和效果。 Abstract: We introduce SeedEdit 3.0, in companion with our T2I model Seedream 3.0 [22], which significantly improves over our previous version [27] in both aspects of edit instruction following and image content (e.g., ID/IP) preservation on real image inputs. Additional to model upgrading with T2I, in this report, we present several key improvements. First, we develop an enhanced data curation pipeline with a meta-info paradigm and meta-info embedding strategy that help mix images from multiple data sources. This allows us to scale editing data effectively, and meta information is helpfult to connect VLM with diffusion model more closely. Second, we introduce a joint learning pipeline for computing a diffusion loss and a reward loss. Finally, we evaluate SeedEdit 3.0 on our testing benchmarks, for real image editing, where it achieves a best trade-off between multiple aspects, yielding a high usability rate of 56.1%, compared to SeedEdit 1.6 (38.4%), GPT4o (37.1%) and Gemini 2.0 (30.3%).

[77] Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics

HaoTian Lan

Main category: cs.CV

TL;DR: 该研究提出了一种多模态街道评估框架(MSEF),结合视觉和语言模型,以可解释的方式评估街道景观,同时捕捉主观感知与客观特征的矛盾。

Details Motivation: 传统街景指标无法充分捕捉主观感知,而主观感知对包容性城市设计至关重要。 Method: 使用视觉变换器(VisualGLM-6B)和大型语言模型(GPT-4)构建MSEF,通过LoRA和P-Tuning v2进行参数高效微调,基于15,000张哈尔滨街景图像。 Result: 模型在客观特征上F1得分为0.84,与居民感知一致性达89.3%,并揭示了非线性感知模式(如商业活动对活力和舒适度的矛盾影响)。 Conclusion: MSEF为城市感知建模提供了方法创新,并有助于规划系统平衡基础设施精确性与居民体验。 Abstract: While objective street metrics derived from imagery or GIS have become standard in urban analytics, they remain insufficient to capture subjective perceptions essential to inclusive urban design. This study introduces a novel Multimodal Street Evaluation Framework (MSEF) that fuses a vision transformer (VisualGLM-6B) with a large language model (GPT-4), enabling interpretable dual-output assessment of streetscapes. Leveraging over 15,000 annotated street-view images from Harbin, China, we fine-tune the framework using LoRA and P-Tuning v2 for parameter-efficient adaptation. The model achieves an F1 score of 0.84 on objective features and 89.3 percent agreement with aggregated resident perceptions, validated across stratified socioeconomic geographies. Beyond classification accuracy, MSEF captures context-dependent contradictions: for instance, informal commerce boosts perceived vibrancy while simultaneously reducing pedestrian comfort. It also identifies nonlinear and semantically contingent patterns -- such as the divergent perceptual effects of architectural transparency across residential and commercial zones -- revealing the limits of universal spatial heuristics. By generating natural-language rationales grounded in attention mechanisms, the framework bridges sensory data with socio-affective inference, enabling transparent diagnostics aligned with SDG 11. This work offers both methodological innovation in urban perception modeling and practical utility for planning systems seeking to reconcile infrastructural precision with lived experience.

[78] FG 2025 TrustFAA: the First Workshop on Towards Trustworthy Facial Affect Analysis: Advancing Insights of Fairness, Explainability, and Safety (TrustFAA)

Jiaee Cheong,Yang Liu,Harold Soh,Hatice Gunes

Main category: cs.CV

TL;DR: 第一届TrustFAA研讨会聚焦于提升面部情感分析(FAA)的可信度,涵盖公平性、可解释性和安全性等议题,以应对当前技术的挑战。

Details Motivation: 随着情感AI驱动的FAA工具广泛应用,其可信度问题日益突出,需要多角度研究以解决偏见、隐私等问题。 Method: 通过研讨会形式,汇集研究者探讨FAA中的可解释性、不确定性、偏见和隐私等挑战。 Result: 研讨会支持FG2025的伦理目标,推动可信FAA的研究与讨论。 Conclusion: TrustFAA旨在促进FAA领域的伦理和技术进步,提升系统可信度。 Abstract: With the increasing prevalence and deployment of Emotion AI-powered facial affect analysis (FAA) tools, concerns about the trustworthiness of these systems have become more prominent. This first workshop on "Towards Trustworthy Facial Affect Analysis: Advancing Insights of Fairness, Explainability, and Safety (TrustFAA)" aims to bring together researchers who are investigating different challenges in relation to trustworthiness-such as interpretability, uncertainty, biases, and privacy-across various facial affect analysis tasks, including macro/ micro-expression recognition, facial action unit detection, other corresponding applications such as pain and depression detection, as well as human-robot interaction and collaboration. In alignment with FG2025's emphasis on ethics, as demonstrated by the inclusion of an Ethical Impact Statement requirement for this year's submissions, this workshop supports FG2025's efforts by encouraging research, discussion and dialogue on trustworthy FAA.

[79] Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers

Haosong Liu,Yuge Cheng,Zihan Liu,Aiyue Chen,Yiwu Yao,Chen Chen,Jingwen Leng,Yu Feng,Minyi Guo

Main category: cs.CV

TL;DR: ASTRAEA是一个自动框架,用于优化视频扩散变压器(vDiT)的配置,通过轻量级令牌选择和高效稀疏注意力策略,显著提升推理速度,同时保持生成质量。

Details Motivation: 现有的视频扩散变压器计算需求高,现有加速方法依赖启发式方法,适用性有限。 Method: 提出轻量级令牌选择机制和GPU并行稀疏注意力策略,并设计基于进化算法的搜索框架以优化令牌预算分配。 Result: 在单GPU上实现2.4倍推理加速,8 GPU上达13.2倍,视频质量损失极小(VBench得分损失<0.5%)。 Conclusion: ASTRAEA在显著提升速度的同时,保持了视频生成的高质量,具有实际部署潜力。 Abstract: Video diffusion transformers (vDiTs) have made impressive progress in text-to-video generation, but their high computational demands present major challenges for practical deployment. While existing acceleration methods reduce workload at various granularities, they often rely on heuristics, limiting their applicability. We introduce ASTRAEA, an automatic framework that searches for near-optimal configurations for vDiT-based video generation. At its core, ASTRAEA proposes a lightweight token selection mechanism and a memory-efficient, GPU-parallel sparse attention strategy, enabling linear reductions in execution time with minimal impact on generation quality. To determine optimal token reduction for different timesteps, we further design a search framework that leverages a classic evolutionary algorithm to automatically determine the distribution of the token budget effectively. Together, ASTRAEA achieves up to 2.4x inference speedup on a single GPU with great scalability (up to 13.2x speedup on 8 GPUs) while retaining better video quality compared to the state-of-the-art methods (<0.5% loss on the VBench score compared to the baseline vDiT models).

[80] DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models

Revant Teotia,Candace Ross,Karen Ullrich,Sumit Chopra,Adriana Romero-Soriano,Melissa Hall,Matthew J. Muckley

Main category: cs.CV

TL;DR: 论文提出了DIM-CIM框架,用于无参考评估文本到图像模型的多样性和泛化能力,发现模型规模扩大可能牺牲默认模式多样性。

Details Motivation: 现有评估方法依赖参考图像或缺乏多样性类型的具体定义,限制了适应性和可解释性。 Method: 引入DIM-CIM框架,通过COCO-DIMCIM基准测试模型默认模式多样性和泛化能力。 Result: 模型参数从1.5B增至8.1B时,泛化能力提升但默认模式多样性下降;训练数据多样性与默认模式多样性相关性为0.85。 Conclusion: DIM-CIM为评估T2I模型提供了灵活且可解释的工具,有助于全面理解模型性能。 Abstract: Recent advances in text-to-image (T2I) models have achieved impressive quality and consistency. However, this has come at the cost of representation diversity. While automatic evaluation methods exist for benchmarking model diversity, they either require reference image datasets or lack specificity about the kind of diversity measured, limiting their adaptability and interpretability. To address this gap, we introduce the Does-it/Can-it framework, DIM-CIM, a reference-free measurement of default-mode diversity ("Does" the model generate images with expected attributes?) and generalization capacity ("Can" the model generate diverse attributes for a particular concept?). We construct the COCO-DIMCIM benchmark, which is seeded with COCO concepts and captions and augmented by a large language model. With COCO-DIMCIM, we find that widely-used models improve in generalization at the cost of default-mode diversity when scaling from 1.5B to 8.1B parameters. DIMCIM also identifies fine-grained failure cases, such as attributes that are generated with generic prompts but are rarely generated when explicitly requested. Finally, we use DIMCIM to evaluate the training data of a T2I model and observe a correlation of 0.85 between diversity in training images and default-mode diversity. Our work provides a flexible and interpretable framework for assessing T2I model diversity and generalization, enabling a more comprehensive understanding of model performance.

[81] Practical Manipulation Model for Robust Deepfake Detection

Benedikt Hopf,Radu Timofte

Main category: cs.CV

TL;DR: 论文提出了一种实用的伪造模型(PMM),通过扩展伪造空间和增强训练图像的退化,显著提升了深度伪造检测模型的鲁棒性和性能。

Details Motivation: 现有深度伪造检测模型在非理想条件下性能不稳定,容易被规避,因此需要更鲁棒的检测方法。 Method: 开发了PMM模型,使用泊松混合、多样化掩码、生成器伪影和干扰物扩展伪造空间,并在训练图像中添加强退化以提升检测器的泛化能力。 Result: 在DFDC和DFDCP数据集上,AUC分别提高了3.51%和6.21%,显著提升了模型的鲁棒性和性能。 Conclusion: PMM模型有效解决了现有检测器鲁棒性不足的问题,为深度伪造检测提供了更实用的解决方案。 Abstract: Modern deepfake detection models have achieved strong performance even on the challenging cross-dataset task. However, detection performance under non-ideal conditions remains very unstable, limiting success on some benchmark datasets and making it easy to circumvent detection. Inspired by the move to a more real-world degradation model in the area of image super-resolution, we have developed a Practical Manipulation Model (PMM) that covers a larger set of possible forgeries. We extend the space of pseudo-fakes by using Poisson blending, more diverse masks, generator artifacts, and distractors. Additionally, we improve the detectors' generality and robustness by adding strong degradations to the training images. We demonstrate that these changes not only significantly enhance the model's robustness to common image degradations but also improve performance on standard benchmark datasets. Specifically, we show clear increases of $3.51\%$ and $6.21\%$ AUC on the DFDC and DFDCP datasets, respectively, over the s-o-t-a LAA backbone. Furthermore, we highlight the lack of robustness in previous detectors and our improvements in this regard. Code can be found at https://github.com/BenediktHopf/PMM

[82] CIVET: Systematic Evaluation of Understanding in VLMs

Massimo Rizzoli,Simone Alghisi,Olha Khomyn,Gabriel Roccabruna,Seyed Mahed Mousavi,Giuseppe Riccardi

Main category: cs.CV

TL;DR: CIVET框架用于系统评估视觉语言模型(VLMs)对场景结构和语义的理解能力,发现当前VLMs在对象属性和关系理解上存在局限,且性能受对象位置影响,未达到人类水平。

Details Motivation: 研究VLMs对场景结构和语义的理解能力,填补现有评估方法的不足。 Method: 提出CIVET框架,通过可控刺激系统评估VLMs,避免噪声和偏差。 Result: 当前VLMs仅能识别有限基本属性,性能受对象位置影响,关系理解能力不足,未达人类水平。 Conclusion: VLMs在场景理解上仍有显著提升空间,需进一步研究改进。 Abstract: While Vision-Language Models (VLMs) have achieved competitive performance in various tasks, their comprehension of the underlying structure and semantics of a scene remains understudied. To investigate the understanding of VLMs, we study their capability regarding object properties and relations in a controlled and interpretable manner. To this scope, we introduce CIVET, a novel and extensible framework for systematiC evaluatIon Via controllEd sTimuli. CIVET addresses the lack of standardized systematic evaluation for assessing VLMs' understanding, enabling researchers to test hypotheses with statistical rigor. With CIVET, we evaluate five state-of-the-art VLMs on exhaustive sets of stimuli, free from annotation noise, dataset-specific biases, and uncontrolled scene complexity. Our findings reveal that 1) current VLMs can accurately recognize only a limited set of basic object properties; 2) their performance heavily depends on the position of the object in the scene; 3) they struggle to understand basic relations among objects. Furthermore, a comparative evaluation with human annotators reveals that VLMs still fall short of achieving human-level accuracy.

[83] FRED: The Florence RGB-Event Drone Dataset

Gabriele Magrini,Niccolò Marini,Federico Becattini,Lorenzo Berlincioni,Niccolò Biondi,Pietro Pala,Alberto Del Bimbo

Main category: cs.CV

TL;DR: 论文介绍了Florence RGB-Event Drone数据集(FRED),专为无人机检测、跟踪和轨迹预测设计,结合RGB视频和事件流,以解决传统RGB相机在高速无人机感知中的局限性。

Details Motivation: 传统RGB相机在捕捉高速移动的无人机时存在局限性,尤其是在复杂光照条件下。事件相机虽具高时间分辨率和动态范围,但现有数据集缺乏精细时间分辨率或无人机特有运动模式。 Method: 作者提出了FRED数据集,包含7小时密集标注的无人机轨迹,涵盖5种无人机模型及雨和恶劣光照等挑战性场景,并提供评估协议和标准指标。 Result: FRED数据集为无人机检测、跟踪和轨迹预测提供了多模态数据支持,填补了现有数据集的不足。 Conclusion: FRED有望推动高速无人机感知和多模态时空理解的研究进展。 Abstract: Small, fast, and lightweight drones present significant challenges for traditional RGB cameras due to their limitations in capturing fast-moving objects, especially under challenging lighting conditions. Event cameras offer an ideal solution, providing high temporal definition and dynamic range, yet existing benchmarks often lack fine temporal resolution or drone-specific motion patterns, hindering progress in these areas. This paper introduces the Florence RGB-Event Drone dataset (FRED), a novel multimodal dataset specifically designed for drone detection, tracking, and trajectory forecasting, combining RGB video and event streams. FRED features more than 7 hours of densely annotated drone trajectories, using 5 different drone models and including challenging scenarios such as rain and adverse lighting conditions. We provide detailed evaluation protocols and standard metrics for each task, facilitating reproducible benchmarking. The authors hope FRED will advance research in high-speed drone perception and multimodal spatiotemporal understanding.

[84] Through-the-Wall Radar Human Activity Recognition WITHOUT Using Neural Networks

Weicheng Gao

Main category: cs.CV

TL;DR: 论文提出了一种不使用神经网络的穿墙雷达人类活动识别方法,基于模板匹配和拓扑相似性计算。

Details Motivation: 作者认为当前领域过度依赖神经网络训练,忽视了早期基于模板匹配的方法的物理可解释性和理论信号处理基础,希望回归原始路径。 Method: 通过生成距离-时间图和多普勒-时间图,使用角点检测和多相主动轮廓模型分割微多普勒特征,将其离散化为点云,并通过Mapper算法计算拓扑相似性。 Result: 数值模拟和实测实验验证了方法的有效性。 Conclusion: 该方法展示了不依赖神经网络也能实现智能识别的潜力,代码已开源。 Abstract: After a few years of research in the field of through-the-wall radar (TWR) human activity recognition (HAR), I found that we seem to be stuck in the mindset of training on radar image data through neural network models. The earliest related works in this field based on template matching did not require a training process, and I believe they have never died. Because these methods possess a strong physical interpretability and are closer to the basis of theoretical signal processing research. In this paper, I would like to try to return to the original path by attempting to eschew neural networks to achieve the TWR HAR task and challenge to achieve intelligent recognition as neural network models. In detail, the range-time map and Doppler-time map of TWR are first generated. Then, the initial regions of the human target foreground and noise background on the maps are determined using corner detection method, and the micro-Doppler signature is segmented using the multiphase active contour model. The micro-Doppler segmentation feature is discretized into a two-dimensional point cloud. Finally, the topological similarity between the resulting point cloud and the point clouds of the template data is calculated using Mapper algorithm to obtain the recognition results. The effectiveness of the proposed method is demonstrated by numerical simulated and measured experiments. The open-source code of this work is released at: https://github.com/JoeyBGOfficial/Through-the-Wall-Radar-Human-Activity-Recognition-Without-Using-Neural-Networks.

[85] Track Any Anomalous Object: A Granular Video Anomaly Detection Pipeline

Yuzhi Huang,Chenxin Li,Haitao Zhang,Zixu Lin,Yunlong Lin,Hengyu Liu,Wuyang Li,Xinyu Liu,Jiechao Gao,Yue Huang,Xinghao Ding,Yixuan Yuan

Main category: cs.CV

TL;DR: 提出了一种名为TAO的新框架,用于细粒度视频异常检测,首次将多粒度异常对象检测统一到一个框架中,通过像素级跟踪实现更精确的异常定位。

Details Motivation: 现有方法主要关注异常帧或对象,忽略了对像素级异常的分析,限制了检测范围。 Method: 将异常检测问题转化为像素级跟踪问题,结合分割和跟踪任务,避免阈值调整。 Result: 实验表明TAO在准确性和鲁棒性上达到了新基准。 Conclusion: TAO框架为视频异常检测提供了更精确和统一的解决方案。 Abstract: Video anomaly detection (VAD) is crucial in scenarios such as surveillance and autonomous driving, where timely detection of unexpected activities is essential. Although existing methods have primarily focused on detecting anomalous objects in videos -- either by identifying anomalous frames or objects -- they often neglect finer-grained analysis, such as anomalous pixels, which limits their ability to capture a broader range of anomalies. To address this challenge, we propose a new framework called Track Any Anomalous Object (TAO), which introduces a granular video anomaly detection pipeline that, for the first time, integrates the detection of multiple fine-grained anomalous objects into a unified framework. Unlike methods that assign anomaly scores to every pixel, our approach transforms the problem into pixel-level tracking of anomalous objects. By linking anomaly scores to downstream tasks such as segmentation and tracking, our method removes the need for threshold tuning and achieves more precise anomaly localization in long and complex video sequences. Experiments demonstrate that TAO sets new benchmarks in accuracy and robustness. Project page available online.

[86] Single GPU Task Adaptation of Pathology Foundation Models for Whole Slide Image Analysis

Neeraj Kumar,Swaraj Nanda,Siddharth Singi,Jamal Benhamida,David Kim,Jie-Fu Chen,Amir Momeni-Boroujeni,Gregory M. Goldgof,Gabriele Campanella,Chad Vanderbilt

Main category: cs.CV

TL;DR: 本文提出了一种名为TAPFM的新方法,用于在单GPU上对病理基础模型(PFMs)进行任务适应,通过视觉变换器注意力优化特征表示和注意力权重,显著提升了全幻灯片图像(WSI)分析的效果。

Details Motivation: 尽管PFMs在分析WSI方面表现出色,但由于仅能获得弱标签(WSI级别)且需要多实例学习(MIL),其适应特定临床任务仍具挑战性。 Method: TAPFM利用视觉变换器注意力进行MIL聚合,同时优化特征表示和注意力权重,并通过分离MIL聚合器和PFM的计算图实现稳定的端到端训练。 Result: 在膀胱癌和肺腺癌的突变预测任务中,TAPFM表现优于传统方法,并能有效处理多标签分类任务。 Conclusion: TAPFM使得在标准硬件上高效适应预训练PFMs成为可能,适用于多种临床应用。 Abstract: Pathology foundation models (PFMs) have emerged as powerful tools for analyzing whole slide images (WSIs). However, adapting these pretrained PFMs for specific clinical tasks presents considerable challenges, primarily due to the availability of only weak (WSI-level) labels for gigapixel images, necessitating multiple instance learning (MIL) paradigm for effective WSI analysis. This paper proposes a novel approach for single-GPU \textbf{T}ask \textbf{A}daptation of \textbf{PFM}s (TAPFM) that uses vision transformer (\vit) attention for MIL aggregation while optimizing both for feature representations and attention weights. The proposed approach maintains separate computational graphs for MIL aggregator and the PFM to create stable training dynamics that align with downstream task objectives during end-to-end adaptation. Evaluated on mutation prediction tasks for bladder cancer and lung adenocarcinoma across institutional and TCGA cohorts, TAPFM consistently outperforms conventional approaches, with H-Optimus-0 (TAPFM) outperforming the benchmarks. TAPFM effectively handles multi-label classification of actionable mutations as well. Thus, TAPFM makes adaptation of powerful pre-trained PFMs practical on standard hardware for various clinical applications.

[87] MokA: Multimodal Low-Rank Adaptation for MLLMs

Yake Wei,Yu Miao,Dongzhan Zhou,Di Hu

Main category: cs.CV

TL;DR: 本文提出了一种多模态低秩适应方法(MokA),针对当前多模态微调方法的局限性,通过模态特定参数压缩单模态信息并增强跨模态交互,显著提升了多模态大语言模型的微调效果。

Details Motivation: 当前高效多模态微调方法多直接借鉴自大语言模型(LLMs),忽视了多模态场景的固有差异,导致无法充分利用所有模态。本文旨在解决这一问题。 Method: 提出了MokA方法,通过模态特定参数压缩单模态信息,并显式增强跨模态交互,实现单模态和跨模态的联合适应。 Result: 在多种多模态场景(如视听文本、视觉文本、语音文本)和多种LLM骨干网络(如LLaMA2/3、Qwen2等)上进行了实验,均取得了显著改进。 Conclusion: MokA为多模态大语言模型的高效适应提供了更具针对性的解决方案,为未来研究奠定了基础。 Abstract: In this paper, we reveal that most current efficient multimodal fine-tuning methods are hindered by a key limitation: they are directly borrowed from LLMs, often neglecting the intrinsic differences of multimodal scenarios and even affecting the full utilization of all modalities. Inspired by our empirical observation, we argue that unimodal adaptation and cross-modal adaptation are two essential parts for the effective fine-tuning of MLLMs. From this perspective, we propose Multimodal low-rank Adaptation (MokA), a multimodal-aware efficient fine-tuning strategy that takes multimodal characteristics into consideration. It compresses unimodal information by modality-specific parameters while explicitly enhancing cross-modal interaction, ensuring both unimodal and cross-modal adaptation. Extensive experiments cover three representative multimodal scenarios (audio-visual-text, visual-text, and speech-text), and multiple LLM backbones (LLaMA2/3, Qwen2, Qwen2.5-VL, etc). Consistent improvements indicate the efficacy and versatility of the proposed method. Ablation studies and efficiency evaluation are also conducted to fully asses our method. Overall, we think MokA provides a more targeted solution for efficient adaptation of MLLMs, paving the way for further exploration. The project page is at https://gewu-lab.github.io/MokA.

[88] Vision-Based Autonomous MM-Wave Reflector Using ArUco-Driven Angle-of-Arrival Estimation

Josue Marroquin,Nan Inzali,Miles Dillon Lantz,Campbell Freeman,Amod Ashtekar,\Ajinkya Umesh Mulik,Mohammed E Eltayeb

Main category: cs.CV

TL;DR: 论文提出了一种基于视觉辅助的自主反射器系统,用于在非视距条件下增强毫米波通信性能。

Details Motivation: 在非视距条件下实现可靠的毫米波通信是军事和民用领域的主要挑战,尤其是在城市或基础设施有限的环境中。 Method: 系统利用单目摄像头检测ArUco标记,估计到达角,并通过电机驱动的金属板动态调整反射器方向,实现实时信号重定向。 Result: 实验结果表明,在60GHz频段下,系统平均接收信号强度增益为23dB,且在室内环境中保持信号接收高于-65dB的概率为0.89。 Conclusion: 该系统在复杂动态环境中展现出强大的毫米波通信适应性和可靠性。 Abstract: Reliable millimeter-wave (mmWave) communication in non-line-of-sight (NLoS) conditions remains a major challenge for both military and civilian operations, especially in urban or infrastructure-limited environments. This paper presents a vision-aided autonomous reflector system designed to enhance mmWave link performance by dynamically steering signal reflections using a motorized metallic plate. The proposed system leverages a monocular camera to detect ArUco markers on allied transmitter and receiver nodes, estimate their angles of arrival, and align the reflector in real time for optimal signal redirection. This approach enables selective beam coverage by serving only authenticated targets with visible markers and reduces the risk of unintended signal exposure. The designed prototype, built on a Raspberry Pi 4 and low-power hardware, operates autonomously without reliance on external infrastructure or GPS. Experimental results at 60\,GHz demonstrate a 23\,dB average gain in received signal strength and an 0.89 probability of maintaining signal reception above a target threshold of -65 dB in an indoor environment, far exceeding the static and no-reflector baselines. These results demonstrate the system's potential for resilient and adaptive mmWave connectivity in complex and dynamic environments.

[89] Quantifying Cross-Modality Memorization in Vision-Language Models

Yuxin Wen,Yangsibo Huang,Tom Goldstein,Ravi Kumar,Badih Ghazi,Chiyuan Zhang

Main category: cs.CV

TL;DR: 本文研究了跨模态记忆的特性,通过合成数据集量化了视觉语言模型中的知识记忆与跨模态迁移能力,发现模态间存在显著差距,并提出了一种缓解方法。

Details Motivation: 研究神经网络在多模态环境中的记忆行为,以解决敏感信息泄露和知识获取效率问题。 Method: 引入合成数据集,训练单模态模型并评估其跨模态表现,分析记忆差距。 Result: 发现模态间知识可迁移但存在显著差距,且该差距在多种场景下普遍存在。 Conclusion: 提出基线方法缓解差距,呼吁未来研究提升跨模态迁移能力的鲁棒性。 Abstract: Understanding what and how neural networks memorize during training is crucial, both from the perspective of unintentional memorization of potentially sensitive information and from the standpoint of effective knowledge acquisition for real-world, knowledge-intensive tasks. While previous studies primarily investigate memorization within a single modality, such as text memorization in large language models or image memorization in diffusion models, unified multimodal models are becoming increasingly prevalent in practical applications. In this work, we focus on the unique characteristics of cross-modality memorization and conduct a systematic study centered on vision-language models. To facilitate controlled experiments, we first introduce a synthetic persona dataset comprising diverse synthetic person images and textual descriptions. We quantify factual knowledge memorization and cross-modal transferability by training models on a single modality and evaluating their performance in the other. Our results reveal that facts learned in one modality transfer to the other, but a significant gap exists between recalling information in the source and target modalities. Furthermore, we observe that this gap exists across various scenarios, including more capable models, machine unlearning, and the multi-hop case. At the end, we propose a baseline method to mitigate this challenge. We hope our study can inspire future research on developing more robust multimodal learning techniques to enhance cross-modal transferability.

[90] Grounding Beyond Detection: Enhancing Contextual Understanding in Embodied 3D Grounding

Yani Zhang,Dongming Wu,Hao Shi,Yingfei Liu,Tiancai Wang,Haoqiang Fan,Xingping Dong

Main category: cs.CV

TL;DR: 研究发现,现有的3D检测模型在无需语言指令训练的情况下,其定位性能优于专门训练的3D定位模型,表明当前3D定位任务仍有不足。作者提出DEGround框架,通过共享DETR查询和引入区域激活与查询调制模块,显著提升了性能。

Details Motivation: 探究3D定位任务是否真正受益于检测模型,并发现现有方法在类别级定位上表现不佳,需要改进。 Method: 提出DEGround框架,共享DETR查询用于检测和定位,并引入区域激活模块和查询调制模块以增强语言上下文理解。 Result: DEGround在EmbodiedScan验证集上的整体准确率比现有最佳模型BIP3D高出7.52%。 Conclusion: DEGround通过结合检测和定位任务,显著提升了3D定位性能,证明了其有效性。 Abstract: Embodied 3D grounding aims to localize target objects described in human instructions from ego-centric viewpoint. Most methods typically follow a two-stage paradigm where a trained 3D detector's optimized backbone parameters are used to initialize a grounding model. In this study, we explore a fundamental question: Does embodied 3D grounding benefit enough from detection? To answer this question, we assess the grounding performance of detection models using predicted boxes filtered by the target category. Surprisingly, these detection models without any instruction-specific training outperform the grounding models explicitly trained with language instructions. This indicates that even category-level embodied 3D grounding may not be well resolved, let alone more fine-grained context-aware grounding. Motivated by this finding, we propose DEGround, which shares DETR queries as object representation for both DEtection and Grounding and enables the grounding to benefit from basic category classification and box detection. Based on this framework, we further introduce a regional activation grounding module that highlights instruction-related regions and a query-wise modulation module that incorporates sentence-level semantic into the query representation, strengthening the context-aware understanding of language instructions. Remarkably, DEGround outperforms state-of-the-art model BIP3D by 7.52\% at overall accuracy on the EmbodiedScan validation set. The source code will be publicly available at https://github.com/zyn213/DEGround.

[91] OGGSplat: Open Gaussian Growing for Generalizable Reconstruction with Expanded Field-of-View

Yanbo Wang,Ziyi Wang,Wenzhao Zheng,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: OGGSplat是一种基于开放高斯生长的方法,通过语义一致性修复模块扩展视野,实现稀疏视图下的3D场景重建。

Details Motivation: 现有方法需要密集视图且计算成本高,而通用方法难以重建视野外的区域。 Method: 利用开放高斯的语义属性进行图像外推,结合RGB-语义一致性修复模块和双向控制扩散模型。 Result: 在Gaussian Outpainting基准测试中表现优异,支持智能手机拍摄的两视图重建。 Conclusion: OGGSplat在稀疏视图下实现了语义一致且视觉合理的3D场景重建。 Abstract: Reconstructing semantic-aware 3D scenes from sparse views is a challenging yet essential research direction, driven by the demands of emerging applications such as virtual reality and embodied AI. Existing per-scene optimization methods require dense input views and incur high computational costs, while generalizable approaches often struggle to reconstruct regions outside the input view cone. In this paper, we propose OGGSplat, an open Gaussian growing method that expands the field-of-view in generalizable 3D reconstruction. Our key insight is that the semantic attributes of open Gaussians provide strong priors for image extrapolation, enabling both semantic consistency and visual plausibility. Specifically, once open Gaussians are initialized from sparse views, we introduce an RGB-semantic consistent inpainting module applied to selected rendered views. This module enforces bidirectional control between an image diffusion model and a semantic diffusion model. The inpainted regions are then lifted back into 3D space for efficient and progressive Gaussian parameter optimization. To evaluate our method, we establish a Gaussian Outpainting (GO) benchmark that assesses both semantic and generative quality of reconstructed open-vocabulary scenes. OGGSplat also demonstrates promising semantic-aware scene reconstruction capabilities when provided with two view images captured directly from a smartphone camera.

[92] Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning

Yue Ma,Yulong Liu,Qiyuan Zhu,Ayden Yang,Kunyu Feng,Xinhua Zhang,Zhifeng Li,Sirui Han,Chenyang Qi,Qifeng Chen

Main category: cs.CV

TL;DR: 论文提出Follow-Your-Motion框架,通过空间-时间解耦的LoRA和稀疏运动采样优化视频运动迁移任务,解决现有方法在运动一致性和调优效率上的不足。

Details Motivation: 现有基于LoRA的运动迁移方法在大型视频扩散变换器中存在运动不一致和调优效率低的问题。 Method: 提出空间-时间解耦的LoRA,稀疏运动采样和自适应RoPE,分两阶段优化视频扩散变换器。 Result: 在MotionBench上验证了Follow-Your-Motion的优越性。 Conclusion: Follow-Your-Motion显著提升了运动迁移的一致性和效率。 Abstract: Recently, breakthroughs in the video diffusion transformer have shown remarkable capabilities in diverse motion generations. As for the motion-transfer task, current methods mainly use two-stage Low-Rank Adaptations (LoRAs) finetuning to obtain better performance. However, existing adaptation-based motion transfer still suffers from motion inconsistency and tuning inefficiency when applied to large video diffusion transformers. Naive two-stage LoRA tuning struggles to maintain motion consistency between generated and input videos due to the inherent spatial-temporal coupling in the 3D attention operator. Additionally, they require time-consuming fine-tuning processes in both stages. To tackle these issues, we propose Follow-Your-Motion, an efficient two-stage video motion transfer framework that finetunes a powerful video diffusion transformer to synthesize complex motion.Specifically, we propose a spatial-temporal decoupled LoRA to decouple the attention architecture for spatial appearance and temporal motion processing. During the second training stage, we design the sparse motion sampling and adaptive RoPE to accelerate the tuning speed. To address the lack of a benchmark for this field, we introduce MotionBench, a comprehensive benchmark comprising diverse motion, including creative camera motion, single object motion, multiple object motion, and complex human motion. We show extensive evaluations on MotionBench to verify the superiority of Follow-Your-Motion.

[93] Towards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation

Jan Ackermann,Kiyohiro Nakayama,Guandao Yang,Tong Wu,Gordon Wetzstein

Main category: cs.CV

TL;DR: VLG模型通过结合视觉和语言信息生成服装,展示了多模态基础模型在专业领域(如时尚设计)的潜力。

Details Motivation: 探索多模态基础模型在专业领域(如服装生成)的知识迁移能力。 Method: 提出VLG模型,利用文本描述和视觉图像合成服装,并评估其零样本泛化能力。 Result: 初步结果显示VLG在未见过的服装风格和提示上具有较好的知识迁移能力。 Conclusion: 多模态基础模型在专业领域(如时尚设计)中具有适应性潜力。 Abstract: Multimodal foundation models have demonstrated strong generalization, yet their ability to transfer knowledge to specialized domains such as garment generation remains underexplored. We introduce VLG, a vision-language-garment model that synthesizes garments from textual descriptions and visual imagery. Our experiments assess VLG's zero-shot generalization, investigating its ability to transfer web-scale reasoning to unseen garment styles and prompts. Preliminary results indicate promising transfer capabilities, highlighting the potential for multimodal foundation models to adapt effectively to specialized domains like fashion design.

[94] DSG-World: Learning a 3D Gaussian World Model from Dual State Videos

Wenhao Hu,Xuexiang Wen,Xi Li,Gaoang Wang

Main category: cs.CV

TL;DR: DSG-World提出了一种基于双状态观测的端到端框架,通过显式构建3D高斯世界模型解决遮挡问题,实现高效的真实到仿真转换。

Details Motivation: 现有世界建模方法存在训练困难、缺乏3D或物理一致性,或需要多阶段处理的问题。DSG-World旨在通过双状态观测解决这些问题。 Method: 利用双状态观测构建双分割感知高斯场,强制双向光度和语义一致性,并引入伪中间状态进行对称对齐和几何完整性优化。 Result: 实验表明,DSG-World在新视角和场景状态下具有强泛化能力,支持高保真渲染和对象级场景操作。 Conclusion: DSG-World是一种高效且物理一致的世界建模方法,适用于真实世界的3D重建和仿真。 Abstract: Building an efficient and physically consistent world model from limited observations is a long standing challenge in vision and robotics. Many existing world modeling pipelines are based on implicit generative models, which are hard to train and often lack 3D or physical consistency. On the other hand, explicit 3D methods built from a single state often require multi-stage processing-such as segmentation, background completion, and inpainting-due to occlusions. To address this, we leverage two perturbed observations of the same scene under different object configurations. These dual states offer complementary visibility, alleviating occlusion issues during state transitions and enabling more stable and complete reconstruction. In this paper, we present DSG-World, a novel end-to-end framework that explicitly constructs a 3D Gaussian World model from Dual State observations. Our approach builds dual segmentation-aware Gaussian fields and enforces bidirectional photometric and semantic consistency. We further introduce a pseudo intermediate state for symmetric alignment and design collaborative co-pruning trategies to refine geometric completeness. DSG-World enables efficient real-to-simulation transfer purely in the explicit Gaussian representation space, supporting high-fidelity rendering and object-level scene manipulation without relying on dense observations or multi-stage pipelines. Extensive experiments demonstrate strong generalization to novel views and scene states, highlighting the effectiveness of our approach for real-world 3D reconstruction and simulation.

[95] MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm

Zhang Li,Yuliang Liu,Qiang Liu,Zhiyin Ma,Ziyang Zhang,Shuo Zhang,Zidun Guo,Jiarui Zhang,Xinyu Wang,Xiang Bai

Main category: cs.CV

TL;DR: MonkeyOCR是一种基于SRR三元范式的视觉语言模型,用于文档解析,通过简化复杂流程并提升效率,显著优于现有方法。

Details Motivation: 解决传统文档解析方法(如模块化或多模态大模型)的复杂性和低效问题。 Method: 采用SRR(结构-识别-关系)三元范式,将文档解析分解为三个核心问题,分别处理布局、内容和逻辑关系。 Result: 在MonkeyDoc数据集上表现优异,平均性能提升5.1%,尤其在公式和表格处理上显著改进,且速度更快。 Conclusion: MonkeyOCR在精度和效率上均达到先进水平,适合实际部署。 Abstract: We introduce MonkeyOCR, a vision-language model for document parsing that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline (as in MinerU's modular approach) and avoids the inefficiencies of processing full pages with giant end-to-end models (e.g., large multimodal LLMs like Qwen-VL). In SRR, document parsing is abstracted into three fundamental questions - "Where is it?" (structure), "What is it?" (recognition), and "How is it organized?" (relation) - corresponding to layout analysis, content identification, and logical ordering. This focused decomposition balances accuracy and speed: it enables efficient, scalable processing without sacrificing precision. To train and evaluate this approach, we introduce the MonkeyDoc (the most comprehensive document parsing dataset to date), with 3.9 million instances spanning over ten document types in both Chinese and English. Experiments show that MonkeyOCR outperforms MinerU by an average of 5.1%, with particularly notable improvements on challenging content such as formulas (+15.0%) and tables (+8.6%). Remarkably, our 3B-parameter model surpasses much larger and top-performing models, including Qwen2.5-VL (72B) and Gemini 2.5 Pro, achieving state-of-the-art average performance on English document parsing tasks. In addition, MonkeyOCR processes multi-page documents significantly faster (0.84 pages per second compared to 0.65 for MinerU and 0.12 for Qwen2.5-VL-7B). The 3B model can be efficiently deployed for inference on a single NVIDIA 3090 GPU. Code and models will be released at https://github.com/Yuliang-Liu/MonkeyOCR.

[96] SAM-aware Test-time Adaptation for Universal Medical Image Segmentation

Jianghao Wu,Yicheng Wu,Yutong Xie,Wenjia Bai,You Zhang,Feilong Tang,Yulong Li,Yasmeen George,Imran Razzak

Main category: cs.CV

TL;DR: SAM-TTA是一种新的测试时适应框架,旨在提升SAM在医学图像分割中的性能,同时保持其泛化能力。

Details Motivation: 解决SAM在医学图像分割中适应性不足的问题,同时避免现有方法(如MedSAM)泛化能力下降的缺点。 Method: 提出SAM-TTA框架,包括SBCT(自适应贝塞尔曲线转换)和DUMT(双尺度不确定性驱动的均值教师适应),分别解决输入级和语义级差异。 Result: 在五个公开数据集上的实验表明,SAM-TTA优于现有TTA方法,甚至在某些场景下超过完全微调的MedSAM。 Conclusion: SAM-TTA为通用医学图像分割提供了新范式,兼具高性能和泛化能力。 Abstract: Universal medical image segmentation using the Segment Anything Model (SAM) remains challenging due to its limited adaptability to medical domains. Existing adaptations, such as MedSAM, enhance SAM's performance in medical imaging but at the cost of reduced generalization to unseen data. Therefore, in this paper, we propose SAM-aware Test-Time Adaptation (SAM-TTA), a fundamentally different pipeline that preserves the generalization of SAM while improving its segmentation performance in medical imaging via a test-time framework. SAM-TTA tackles two key challenges: (1) input-level discrepancies caused by differences in image acquisition between natural and medical images and (2) semantic-level discrepancies due to fundamental differences in object definition between natural and medical domains (e.g., clear boundaries vs. ambiguous structures). Specifically, our SAM-TTA framework comprises (1) Self-adaptive Bezier Curve-based Transformation (SBCT), which adaptively converts single-channel medical images into three-channel SAM-compatible inputs while maintaining structural integrity, to mitigate the input gap between medical and natural images, and (2) Dual-scale Uncertainty-driven Mean Teacher adaptation (DUMT), which employs consistency learning to align SAM's internal representations to medical semantics, enabling efficient adaptation without auxiliary supervision or expensive retraining. Extensive experiments on five public datasets demonstrate that our SAM-TTA outperforms existing TTA approaches and even surpasses fully fine-tuned models such as MedSAM in certain scenarios, establishing a new paradigm for universal medical image segmentation. Code can be found at https://github.com/JianghaoWu/SAM-TTA.

[97] Spatiotemporal Contrastive Learning for Cross-View Video Localization in Unstructured Off-road Terrains

Zhiyun Deng,Dongmyeong Lee,Amanda Adkins,Jesse Quattrociocchi,Christian Ellis,Joydeep Biswas

Main category: cs.CV

TL;DR: MoViX是一个自监督的跨视角视频定位框架,用于解决GPS缺失、越野环境中的3-DoF定位问题,通过学习和季节不变的表示提升定位准确性。

Details Motivation: 在GPS缺失的越野环境中,重复植被、无结构地形和季节性变化导致视觉定位困难,传统方法难以对齐过时的卫星图像。 Method: MoViX采用姿态依赖的正样本采样策略、时间对齐的硬负样本挖掘、运动信息帧采样器和轻量级时间聚合器,结合蒙特卡洛定位框架进行推理。 Result: 在TartanDrive 2.0数据集上,MoViX仅用30分钟训练数据,测试12.29公里,93%情况下定位误差在25米内,100%在50米内,优于现有方法。 Conclusion: MoViX在视觉模糊环境下表现出色,且能泛化到地理和机器人平台不同的真实越野数据集。 Abstract: Robust cross-view 3-DoF localization in GPS-denied, off-road environments remains challenging due to (1) perceptual ambiguities from repetitive vegetation and unstructured terrain, and (2) seasonal shifts that significantly alter scene appearance, hindering alignment with outdated satellite imagery. To address this, we introduce MoViX, a self-supervised cross-view video localization framework that learns viewpoint- and season-invariant representations while preserving directional awareness essential for accurate localization. MoViX employs a pose-dependent positive sampling strategy to enhance directional discrimination and temporally aligned hard negative mining to discourage shortcut learning from seasonal cues. A motion-informed frame sampler selects spatially diverse frames, and a lightweight temporal aggregator emphasizes geometrically aligned observations while downweighting ambiguous ones. At inference, MoViX runs within a Monte Carlo Localization framework, using a learned cross-view matching module in place of handcrafted models. Entropy-guided temperature scaling enables robust multi-hypothesis tracking and confident convergence under visual ambiguity. We evaluate MoViX on the TartanDrive 2.0 dataset, training on under 30 minutes of data and testing over 12.29 km. Despite outdated satellite imagery, MoViX localizes within 25 meters of ground truth 93% of the time, and within 50 meters 100% of the time in unseen regions, outperforming state-of-the-art baselines without environment-specific tuning. We further demonstrate generalization on a real-world off-road dataset from a geographically distinct site with a different robot platform.

[98] LeanPO: Lean Preference Optimization for Likelihood Alignment in Video-LLMs

Xiaodong Wang,Jinfa Huang,Li Yuan,Peixi Peng

Main category: cs.CV

TL;DR: 论文提出LeanPO方法,通过重新定义奖励和动态标签平滑策略,解决Video-LLMs中偏好对齐技术导致的非目标响应概率意外增加问题。

Details Motivation: 现有Video-LLMs使用偏好对齐技术(如DPO)时,常导致目标和非目标响应的对数概率同时下降,从而非目标响应概率增加。 Method: 提出LeanPO方法,包括基于策略模型的平均似然重新定义奖励,以及动态标签平滑策略减少噪声影响。 Result: 实验表明LeanPO显著提升Video-LLMs性能,且额外训练开销小。 Conclusion: LeanPO为Video-LLMs提供了一种简单有效的偏好对齐方案,提升模型可靠性和效率。 Abstract: Most Video Large Language Models (Video-LLMs) adopt preference alignment techniques, e.g., DPO~\citep{rafailov2024dpo}, to optimize the reward margin between a winning response ($y_w$) and a losing response ($y_l$). However, the likelihood displacement observed in DPO indicates that both $\log \pi_\theta (y_w\mid x)$ and $\log \pi_\theta (y_l\mid x) $ often decrease during training, inadvertently boosting the probabilities of non-target responses. In this paper, we systematically revisit this phenomenon from LLMs to Video-LLMs, showing that it intensifies when dealing with the redundant complexity of video content. To alleviate the impact of this phenomenon, we propose \emph{Lean Preference Optimization} (LeanPO), a reference-free approach that reformulates the implicit reward as the average likelihood of the response with respect to the policy model. A key component of LeanPO is the reward-trustworthiness correlated self-generated preference data pipeline, which carefully infuses relevant prior knowledge into the model while continuously refining the preference data via self-reflection. This allows the policy model to obtain high-quality paired data and accurately estimate the newly defined reward, thus mitigating the unintended drop. In addition, we introduce a dynamic label smoothing strategy that mitigates the impact of noise in responses from diverse video content, preventing the model from overfitting to spurious details. Extensive experiments demonstrate that LeanPO significantly enhances the performance of state-of-the-art Video-LLMs, consistently boosting baselines of varying capacities with minimal additional training overhead. Moreover, LeanPO offers a simple yet effective solution for aligning Video-LLM preferences with human trustworthiness, paving the way toward the reliable and efficient Video-LLMs.

[99] Can Foundation Models Generalise the Presentation Attack Detection Capabilities on ID Cards?

Juan E. Tapia,Christoph Busch

Main category: cs.CV

TL;DR: 本文探讨了利用基础模型(FM)提升ID卡防伪检测(PAD)的泛化能力,并通过零样本和微调方法在多个ID卡数据集上进行了测试。

Details Motivation: 当前PAD系统因隐私保护限制,仅能针对少数ID卡进行训练,导致在新国家ID卡上表现不佳。FM因其大数据集训练能力,有望提升泛化性能。 Method: 采用零样本和微调方法,测试了基于智利ID的私有数据集和芬兰、西班牙、斯洛伐克ID的开放数据集。 Result: 研究发现,真实图像(bona fide)是提升泛化能力的关键。 Conclusion: FM在ID卡PAD任务中具有潜力,尤其是通过真实图像优化泛化能力。 Abstract: Nowadays, one of the main challenges in presentation attack detection (PAD) on ID cards is obtaining generalisation capabilities for a diversity of countries that are issuing ID cards. Most PAD systems are trained on one, two, or three ID documents because of privacy protection concerns. As a result, they do not obtain competitive results for commercial purposes when tested in an unknown new ID card country. In this scenario, Foundation Models (FM) trained on huge datasets can help to improve generalisation capabilities. This work intends to improve and benchmark the capabilities of FM and how to use them to adapt the generalisation on PAD of ID Documents. Different test protocols were used, considering zero-shot and fine-tuning and two different ID card datasets. One private dataset based on Chilean IDs and one open-set based on three ID countries: Finland, Spain, and Slovakia. Our findings indicate that bona fide images are the key to generalisation.

[100] From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

Animesh Gupta,Jay Parmar,Ishan Rajendrakumar Dave,Mubarak Shah

Main category: cs.CV

TL;DR: TF-CoVR是一个专注于时间细粒度视频检索的新基准,通过LLM生成查询-修改对,并提出了TF-CoVR-Base框架,显著提升了检索性能。

Details Motivation: 现有CoVR基准未能测试对快速、细微时间差异的捕捉能力,因此需要专门针对时间细粒度的新基准。 Method: 提出TF-CoVR基准,基于FineGym和FineDiving构建180K三元组;设计TF-CoVR-Base框架,包括预训练视频编码器和对比学习对齐。 Result: TF-CoVR-Base在零样本和微调后分别将mAP@50从5.92提升至7.51和从19.83提升至25.82。 Conclusion: TF-CoVR填补了时间细粒度视频检索的空白,TF-CoVR-Base框架显著提升了性能。 Abstract: Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving. Previous CoVR benchmarks focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 25.82.

[101] Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting

Nan Wang,Yuantao Chen,Lixing Xiao,Weiqing Xiao,Bohan Li,Zhaoxi Chen,Chongjie Ye,Shaocong Xu,Saining Zhang,Ziyang Yan,Pierre Merriaux,Lei Lei,Tianfan Xue,Hao Zhao

Main category: cs.CV

TL;DR: 提出了一种多尺度双边网格方法,结合外观编码和双边网格,显著提高了动态自动驾驶场景重建的几何精度。

Details Motivation: 现实场景中难以保证完美的光度一致性,现有方法(外观编码和双边网格)存在局限性,需要更有效的解决方案。 Method: 提出多尺度双边网格,统一外观编码和双边网格,优化像素级颜色映射。 Result: 在Waymo等四个数据集上表现优异,几何精度显著提升,减少了光度不一致导致的伪影。 Conclusion: 多尺度双边网格方法在自动驾驶场景重建中表现优越,几何精度提升对障碍物避障和控制至关重要。 Abstract: Neural rendering techniques, including NeRF and Gaussian Splatting (GS), rely on photometric consistency to produce high-quality reconstructions. However, in real-world scenarios, it is challenging to guarantee perfect photometric consistency in acquired images. Appearance codes have been widely used to address this issue, but their modeling capability is limited, as a single code is applied to the entire image. Recently, the bilateral grid was introduced to perform pixel-wise color mapping, but it is difficult to optimize and constrain effectively. In this paper, we propose a novel multi-scale bilateral grid that unifies appearance codes and bilateral grids. We demonstrate that this approach significantly improves geometric accuracy in dynamic, decoupled autonomous driving scene reconstruction, outperforming both appearance codes and bilateral grids. This is crucial for autonomous driving, where accurate geometry is important for obstacle avoidance and control. Our method shows strong results across four datasets: Waymo, NuScenes, Argoverse, and PandaSet. We further demonstrate that the improvement in geometry is driven by the multi-scale bilateral grid, which effectively reduces floaters caused by photometric inconsistency.

[102] Rectified Point Flow: Generic Point Cloud Pose Estimation

Tao Sun,Liyuan Zhu,Shengyu Huang,Shuran Song,Iro Armeni

Main category: cs.CV

TL;DR: Rectified Point Flow 是一种统一的参数化方法,将点云配准和多部件形状组装视为条件生成问题,通过学习连续点速度场实现目标定位,无需对称标签即可学习对称性,并在多个基准测试中达到最优性能。

Details Motivation: 解决传统方法在点云配准和形状组装中依赖对称标签和独立训练的问题,提出一种统一的框架以学习共享几何先验。 Method: 通过学习点速度场将噪声点传输到目标位置,并结合自监督编码器专注于重叠点,实现对称性的自动学习。 Result: 在六个基准测试中达到最优性能,统一的框架支持跨数据集联合训练,提升准确性。 Conclusion: Rectified Point Flow 提供了一种高效且统一的解决方案,能够自动学习对称性并提升点云配准和形状组装的性能。 Abstract: We introduce Rectified Point Flow, a unified parameterization that formulates pairwise point cloud registration and multi-part shape assembly as a single conditional generative problem. Given unposed point clouds, our method learns a continuous point-wise velocity field that transports noisy points toward their target positions, from which part poses are recovered. In contrast to prior work that regresses part-wise poses with ad-hoc symmetry handling, our method intrinsically learns assembly symmetries without symmetry labels. Together with a self-supervised encoder focused on overlapping points, our method achieves a new state-of-the-art performance on six benchmarks spanning pairwise registration and shape assembly. Notably, our unified formulation enables effective joint training on diverse datasets, facilitating the learning of shared geometric priors and consequently boosting accuracy. Project page: https://rectified-pointflow.github.io/.

[103] Video World Models with Long-term Spatial Memory

Tong Wu,Shuai Yang,Ryan Po,Yinghao Xu,Ziwei Liu,Dahua Lin,Gordon Wetzstein

Main category: cs.CV

TL;DR: 提出了一种基于几何空间记忆的框架,用于增强视频世界模型的长期一致性,解决了现有模型因时间窗口限制导致的场景遗忘问题。

Details Motivation: 现有视频世界模型因时间上下文窗口有限,难以维持场景一致性,尤其是重新访问时容易遗忘之前生成的环境。 Method: 引入了一种几何基础的长时空间记忆框架,包括存储和检索信息的机制,并通过定制数据集训练和评估带有3D记忆机制的模型。 Result: 实验表明,该方法在质量、一致性和上下文长度上优于基线模型。 Conclusion: 该框架为长期一致的视频世界生成提供了有效途径。 Abstract: Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long-term consistency of video world models through a geometry-grounded long-term spatial memory. Our framework includes mechanisms to store and retrieve information from the long-term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation.

[104] RaySt3R: Predicting Novel Depth Maps for Zero-Shot Object Completion

Bardienus P. Duisterhof,Jan Oberst,Bowen Wen,Stan Birchfield,Deva Ramanan,Jeffrey Ichnowski

Main category: cs.CV

TL;DR: RaySt3R将3D形状补全问题重新定义为新视角合成问题,通过单张RGB-D图像和查询射线预测深度图、物体掩码和置信度,实现高效且一致的3D重建。

Details Motivation: 现有3D形状补全方法缺乏一致性、计算成本高且难以捕捉锐利边界,RaySt3R旨在解决这些问题。 Method: 利用单张RGB-D图像和查询射线,训练前馈Transformer预测深度图、物体掩码和置信度,并通过多视角融合重建完整3D形状。 Result: 在合成和真实数据集上表现优异,3D Chamfer距离比基线方法提升高达44%。 Conclusion: RaySt3R通过新视角合成方法显著提升了3D形状补全的性能和效率。 Abstract: 3D shape completion has broad applications in robotics, digital twin reconstruction, and extended reality (XR). Although recent advances in 3D object and scene completion have achieved impressive results, existing methods lack 3D consistency, are computationally expensive, and struggle to capture sharp object boundaries. Our work (RaySt3R) addresses these limitations by recasting 3D shape completion as a novel view synthesis problem. Specifically, given a single RGB-D image and a novel viewpoint (encoded as a collection of query rays), we train a feedforward transformer to predict depth maps, object masks, and per-pixel confidence scores for those query rays. RaySt3R fuses these predictions across multiple query views to reconstruct complete 3D shapes. We evaluate RaySt3R on synthetic and real-world datasets, and observe it achieves state-of-the-art performance, outperforming the baselines on all datasets by up to 44% in 3D chamfer distance. Project page: https://rayst3r.github.io

[105] Stable Vision Concept Transformers for Medical Diagnosis

Lijie Hu,Songning Lai,Yuan Hua,Shu Yang,Jingfeng Zhang,Di Wang

Main category: cs.CV

TL;DR: 论文提出VCT和SVCT模型,解决医学领域中XAI的透明性和稳定性问题,结合概念特征与图像特征提升性能,并通过实验验证其效果。

Details Motivation: 医学领域对透明性的需求促使研究可解释AI(XAI),但现有概念瓶颈模型(CBMs)忽视图像特征且解释不稳定,限制了应用。 Method: 提出Vision Concept Transformer(VCT)和Stable Vision Concept Transformer(SVCT),结合ViT和概念层,融合概念与图像特征,并通过Denoised Diffusion Smoothing提升稳定性。 Result: 在四个医学数据集上的实验表明,VCT和SVCT在保持准确性的同时具有可解释性,SVCT在扰动下仍能提供稳定解释。 Conclusion: VCT和SVCT解决了医学XAI的透明性与稳定性问题,为实际应用提供了可行方案。 Abstract: Transparency is a paramount concern in the medical field, prompting researchers to delve into the realm of explainable AI (XAI). Among these XAI methods, Concept Bottleneck Models (CBMs) aim to restrict the model's latent space to human-understandable high-level concepts by generating a conceptual layer for extracting conceptual features, which has drawn much attention recently. However, existing methods rely solely on concept features to determine the model's predictions, which overlook the intrinsic feature embeddings within medical images. To address this utility gap between the original models and concept-based models, we propose Vision Concept Transformer (VCT). Furthermore, despite their benefits, CBMs have been found to negatively impact model performance and fail to provide stable explanations when faced with input perturbations, which limits their application in the medical field. To address this faithfulness issue, this paper further proposes the Stable Vision Concept Transformer (SVCT) based on VCT, which leverages the vision transformer (ViT) as its backbone and incorporates a conceptual layer. SVCT employs conceptual features to enhance decision-making capabilities by fusing them with image features and ensures model faithfulness through the integration of Denoised Diffusion Smoothing. Comprehensive experiments on four medical datasets demonstrate that our VCT and SVCT maintain accuracy while remaining interpretable compared to baselines. Furthermore, even when subjected to perturbations, our SVCT model consistently provides faithful explanations, thus meeting the needs of the medical field.

[106] EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?

Yuqian Yuan,Ronghao Dang,Long Li,Wentong Li,Dian Jiao,Xin Li,Deli Zhao,Fan Wang,Wenqiao Zhang,Jun Xiao,Yueting Zhuang

Main category: cs.CV

TL;DR: EOC-Bench是一个创新的基准测试,用于评估动态自我中心场景中的物体中心化认知能力,填补了现有基准测试在动态交互评估上的空白。

Details Motivation: 现有的基准测试主要关注静态场景,忽视了用户交互引起的动态变化,因此需要一种新的评估工具。 Method: EOC-Bench包含3,277个标注QA对,分为过去、现在和未来三类,涵盖11个细粒度评估维度和3种视觉对象引用类型。采用混合格式的人类参与标注框架和多尺度时间准确性指标。 Result: 通过对多种MLLMs的综合评估,EOC-Bench展示了其在动态场景中的有效性。 Conclusion: EOC-Bench为提升MLLMs的物体认知能力提供了重要工具,并为开发可靠的嵌入式系统核心模型奠定了基础。 Abstract: The emergence of multimodal large language models (MLLMs) has driven breakthroughs in egocentric vision applications. These applications necessitate persistent, context-aware understanding of objects, as users interact with tools in dynamic and cluttered environments. However, existing embodied benchmarks primarily focus on static scene exploration, emphasizing object's appearance and spatial attributes while neglecting the assessment of dynamic changes arising from users' interactions. To address this gap, we introduce EOC-Bench, an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocentric scenarios. Specially, EOC-Bench features 3,277 meticulously annotated QA pairs categorized into three temporal categories: Past, Present, and Future, covering 11 fine-grained evaluation dimensions and 3 visual object referencing types. To ensure thorough assessment, we develop a mixed-format human-in-the-loop annotation framework with four types of questions and design a novel multi-scale temporal accuracy metric for open-ended temporal evaluation. Based on EOC-Bench, we conduct comprehensive evaluations of various proprietary, open-source, and object-level MLLMs. EOC-Bench serves as a crucial tool for advancing the embodied object cognitive capabilities of MLLMs, establishing a robust foundation for developing reliable core models for embodied systems.

[107] AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model

Pingyu Wu,Kai Zhu,Yu Liu,Longxiang Tang,Jian Yang,Yansong Peng,Wei Zhai,Yang Cao,Zheng-Jun Zha

Main category: cs.CV

TL;DR: 提出了一种名为AliTok的新型图像标记器,通过因果解码器建立单向依赖关系,改善了自回归模型的建模效果,并在ImageNet-256基准测试中取得了优异的生成性能。

Details Motivation: 现有图像标记器在压缩过程中编码的双向依赖关系阻碍了自回归模型的有效建模。 Method: 使用因果解码器建立单向依赖关系,结合前缀标记和两阶段训练提升重建一致性。 Result: 在ImageNet-256上,177M参数的AliTok取得gFID 1.50和IS 305.9;662M参数时gFID 1.35,超越当前最佳扩散方法且采样速度快10倍。 Conclusion: AliTok在生成友好性和重建性能上表现优异,为自回归图像生成提供了高效解决方案。 Abstract: Autoregressive image generation aims to predict the next token based on previous ones. However, existing image tokenizers encode tokens with bidirectional dependencies during the compression process, which hinders the effective modeling by autoregressive models. In this paper, we propose a novel Aligned Tokenizer (AliTok), which utilizes a causal decoder to establish unidirectional dependencies among encoded tokens, thereby aligning the token modeling approach between the tokenizer and autoregressive model. Furthermore, by incorporating prefix tokens and employing two-stage tokenizer training to enhance reconstruction consistency, AliTok achieves great reconstruction performance while being generation-friendly. On ImageNet-256 benchmark, using a standard decoder-only autoregressive model as the generator with only 177M parameters, AliTok achieves a gFID score of 1.50 and an IS of 305.9. When the parameter count is increased to 662M, AliTok achieves a gFID score of 1.35, surpassing the state-of-the-art diffusion method with 10x faster sampling speed. The code and weights are available at https://github.com/ali-vilab/alitok.

[108] SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

Jianyi Wang,Shanchuan Lin,Zhijie Lin,Yuxi Ren,Meng Wei,Zongsheng Yue,Shangchen Zhou,Hao Chen,Yang Zhao,Ceyuan Yang,Xuefeng Xiao,Chen Change Loy,Lu Jiang

Main category: cs.CV

TL;DR: SeedVR2是一种基于扩散的单步视频修复模型,通过对抗训练和动态窗口注意力机制,显著降低了计算成本,同时保持了高质量输出。

Details Motivation: 现有扩散视频修复方法计算成本高,且单步修复方法在高分辨率视频中表现不佳。 Method: 提出SeedVR2模型,采用动态窗口注意力机制和对抗训练,优化损失函数以提高训练稳定性。 Result: 实验表明,SeedVR2在单步修复中性能优于现有方法。 Conclusion: SeedVR2为高分辨率视频修复提供了一种高效且高质量的解决方案。 Abstract: Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size. To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency. Extensive experiments show that SeedVR2 can achieve comparable or even better performance compared with existing VR approaches in a single step.

[109] Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos

Weifeng Lin,Xinyu Wei,Ruichuan An,Tianhe Ren,Tingwei Chen,Renrui Zhang,Ziyu Guo,Wentao Zhang,Lei Zhang,Hongsheng Li

Main category: cs.CV

TL;DR: PAM是一个高效的区域级视觉理解框架,结合了SAM 2和LLMs,支持对象分割与多样化语义输出,速度快且内存占用低。

Details Motivation: 提升区域级视觉理解的全面性和效率,结合分割模型与语言模型,实现多模态语义输出。 Method: 通过Semantic Perceiver将SAM 2的视觉特征转化为多模态令牌,结合数据增强生成高质量数据集。 Result: PAM在多项任务中表现优异,运行速度快1.2-2.4倍,GPU内存占用更低。 Conclusion: PAM为区域级视觉理解提供了高效且实用的解决方案,可作为未来研究的基准。 Abstract: We present Perceive Anything Model (PAM), a conceptually straightforward and efficient framework for comprehensive region-level visual understanding in images and videos. Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation with the generation of diverse, region-specific semantic outputs, including categories, label definition, functional explanations, and detailed captions. A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features, which inherently carry general vision, localization, and semantic priors into multi-modal tokens for LLM comprehension. To support robust multi-granularity understanding, we also develop a dedicated data refinement and augmentation pipeline, yielding a high-quality dataset of 1.5M image and 0.6M video region-semantic annotations, including novel region-level streaming video caption data. PAM is designed for lightweightness and efficiency, while also demonstrates strong performance across a diverse range of region understanding tasks. It runs 1.2-2.4x faster and consumes less GPU memory than prior approaches, offering a practical solution for real-world applications. We believe that our effective approach will serve as a strong baseline for future research in region-level visual understanding.

[110] Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels

Olaf Dünkel,Thomas Wimmer,Christian Theobalt,Christian Rupprecht,Adam Kortylewski

Main category: cs.CV

TL;DR: 论文提出了一种通过3D感知伪标签改进语义对应估计的方法,显著提升了性能并减少了标注需求。

Details Motivation: 解决语义匹配中对称物体或重复部分导致的模糊性问题。 Method: 训练适配器以优化现成特征,利用3D感知链式伪标签、松弛循环一致性和3D球形原型映射约束。 Result: 在SPair-71k上实现了超过4%的绝对性能提升,且比类似监督需求的方法高出7%。 Conclusion: 该方法通用性强,易于扩展到其他数据源,且显著减少了特定数据集的标注需求。 Abstract: Finding correspondences between semantically similar points across images and object instances is one of the everlasting challenges in computer vision. While large pre-trained vision models have recently been demonstrated as effective priors for semantic matching, they still suffer from ambiguities for symmetric objects or repeated object parts. We propose to improve semantic correspondence estimation via 3D-aware pseudo-labeling. Specifically, we train an adapter to refine off-the-shelf features using pseudo-labels obtained via 3D-aware chaining, filtering wrong labels through relaxed cyclic consistency, and 3D spherical prototype mapping constraints. While reducing the need for dataset specific annotations compared to prior work, we set a new state-of-the-art on SPair-71k by over 4% absolute gain and by over 7% against methods with similar supervision requirements. The generality of our proposed approach simplifies extension of training to other data sources, which we demonstrate in our experiments.

[111] MARBLE: Material Recomposition and Blending in CLIP-Space

Ta-Ying Cheng,Prafull Sharma,Mark Boss,Varun Jampani

Main category: cs.CV

TL;DR: MARBLE是一种基于CLIP空间材料嵌入的方法,通过控制预训练的文本到图像模型,实现材料混合和细粒度材料属性的重组。

Details Motivation: 研究旨在改进基于示例图像的材料编辑方法,通过更精确的材料属性控制和混合,提升图像编辑的灵活性和质量。 Method: 利用CLIP空间中的材料嵌入,找到去噪UNet中负责材料属性的模块,并通过浅层网络预测材料属性变化方向。 Result: 定性定量分析表明,MARBLE能有效实现材料混合和细粒度属性控制,支持单次前向传递的多重编辑和绘画应用。 Conclusion: MARBLE为材料编辑提供了高效且灵活的方法,扩展了图像编辑的应用范围。 Abstract: Editing materials of objects in images based on exemplar images is an active area of research in computer vision and graphics. We propose MARBLE, a method for performing material blending and recomposing fine-grained material properties by finding material embeddings in CLIP-space and using that to control pre-trained text-to-image models. We improve exemplar-based material editing by finding a block in the denoising UNet responsible for material attribution. Given two material exemplar-images, we find directions in the CLIP-space for blending the materials. Further, we can achieve parametric control over fine-grained material attributes such as roughness, metallic, transparency, and glow using a shallow network to predict the direction for the desired material attribute change. We perform qualitative and quantitative analysis to demonstrate the efficacy of our proposed method. We also present the ability of our method to perform multiple edits in a single forward pass and applicability to painting. Project Page: https://marblecontrol.github.io/

[112] ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation

Daniel Rho,Jun Myeong Choi,Biswadip Dey,Roni Sengupta

Main category: cs.CV

TL;DR: ProJo4D提出了一种渐进式联合优化框架,用于从稀疏多视角视频中估计物理参数,解决了现有方法在稀疏输入下的误差累积问题。

Details Motivation: 现有方法在稀疏多视角视频输入下表现不佳,导致误差累积,限制了物理准确数字孪生等应用的实际效果。 Method: ProJo4D采用渐进式联合优化策略,逐步增加联合优化的参数集,最终实现几何、外观、物理状态和材料属性的全面联合优化。 Result: 在PAC-NeRF和Spring-Gaus数据集上,ProJo4D在4D未来状态预测、未来状态的新视角渲染和材料参数估计方面优于现有方法。 Conclusion: ProJo4D为物理基础的4D场景理解提供了一种有效的解决方案,适用于稀疏输入场景。 Abstract: Neural rendering has made significant strides in 3D reconstruction and novel view synthesis. With the integration with physics, it opens up new applications. The inverse problem of estimating physics from visual data, however, still remains challenging, limiting its effectiveness for applications like physically accurate digital twin creation in robotics and XR. Existing methods that incorporate physics into neural rendering frameworks typically require dense multi-view videos as input, making them impractical for scalable, real-world use. When presented with sparse multi-view videos, the sequential optimization strategy used by existing approaches introduces significant error accumulation, e.g., poor initial 3D reconstruction leads to bad material parameter estimation in subsequent stages. Instead of sequential optimization, directly optimizing all parameters at the same time also fails due to the highly non-convex and often non-differentiable nature of the problem. We propose ProJo4D, a progressive joint optimization framework that gradually increases the set of jointly optimized parameters guided by their sensitivity, leading to fully joint optimization over geometry, appearance, physical state, and material property. Evaluations on PAC-NeRF and Spring-Gaus datasets show that ProJo4D outperforms prior work in 4D future state prediction, novel view rendering of future state, and material parameter estimation, demonstrating its effectiveness in physically grounded 4D scene understanding. For demos, please visit the project webpage: https://daniel03c1.github.io/ProJo4D/

[113] Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs

Haoyuan Li,Yanpeng Zhou,Yufei Gao,Tao Tang,Jianhua Han,Yujie Yuan,Dave Zhenyu Chen,Jiawang Bian,Hang Xu,Xiaodan Liang

Main category: cs.CV

TL;DR: 本文分析了3D视觉语言模型(VLMs)的性能差距,发现3D场景中心模型对3D编码器的依赖有限,并提出新数据集以促进真正的3D场景理解。

Details Motivation: 2D视觉语言模型的成功激发了将其扩展到3D场景的兴趣,但3D场景中心模型的性能较低,需要深入分析原因并提出改进方案。 Method: 通过分类和对比3D VLMs的编码器设计(3D对象中心、2D图像基础和3D场景中心),分析性能差距的原因,并提出新数据集。 Result: 发现3D场景中心模型对3D编码器依赖不足,预训练效果较差,且数据扩展收益不明显。 Conclusion: 需要改进评估策略和模型设计,以提升3D VLMs的3D场景理解能力。 Abstract: Remarkable progress in 2D Vision-Language Models (VLMs) has spurred interest in extending them to 3D settings for tasks like 3D Question Answering, Dense Captioning, and Visual Grounding. Unlike 2D VLMs that typically process images through an image encoder, 3D scenes, with their intricate spatial structures, allow for diverse model architectures. Based on their encoder design, this paper categorizes recent 3D VLMs into 3D object-centric, 2D image-based, and 3D scene-centric approaches. Despite the architectural similarity of 3D scene-centric VLMs to their 2D counterparts, they have exhibited comparatively lower performance compared with the latest 3D object-centric and 2D image-based approaches. To understand this gap, we conduct an in-depth analysis, revealing that 3D scene-centric VLMs show limited reliance on the 3D scene encoder, and the pre-train stage appears less effective than in 2D VLMs. Furthermore, we observe that data scaling benefits are less pronounced on larger datasets. Our investigation suggests that while these models possess cross-modal alignment capabilities, they tend to over-rely on linguistic cues and overfit to frequent answer distributions, thereby diminishing the effective utilization of the 3D encoder. To address these limitations and encourage genuine 3D scene understanding, we introduce a novel 3D Relevance Discrimination QA dataset designed to disrupt shortcut learning and improve 3D understanding. Our findings highlight the need for advanced evaluation and improved strategies for better 3D understanding in 3D VLMs.

[114] Revisiting Depth Representations for Feed-Forward 3D Gaussian Splatting

Duochao Shi,Weijie Wang,Donny Y. Chen,Zeyu Zhang,Jia-Wang Bian,Bohan Zhuang,Chunhua Shen

Main category: cs.CV

TL;DR: 论文提出了一种基于点图的PM-Loss正则化损失,用于改善深度图中物体边界处的几何平滑性,从而提升3D高斯泼溅的渲染质量。

Details Motivation: 深度图在3D高斯泼溅中常用于生成3D点云,但物体边界处的深度不连续性会导致点云稀疏或碎片化,影响渲染质量。 Method: 引入PM-Loss,利用预训练变换器预测的点图作为正则化损失,增强几何平滑性。 Result: 改进后的深度图显著提升了3D高斯泼溅的渲染效果,适用于多种架构和场景。 Conclusion: PM-Loss有效解决了深度图在物体边界处的局限性,提升了3D渲染的稳定性和质量。 Abstract: Depth maps are widely used in feed-forward 3D Gaussian Splatting (3DGS) pipelines by unprojecting them into 3D point clouds for novel view synthesis. This approach offers advantages such as efficient training, the use of known camera poses, and accurate geometry estimation. However, depth discontinuities at object boundaries often lead to fragmented or sparse point clouds, degrading rendering quality -- a well-known limitation of depth-based representations. To tackle this issue, we introduce PM-Loss, a novel regularization loss based on a pointmap predicted by a pre-trained transformer. Although the pointmap itself may be less accurate than the depth map, it effectively enforces geometric smoothness, especially around object boundaries. With the improved depth map, our method significantly improves the feed-forward 3DGS across various architectures and scenes, delivering consistently better rendering results. Our project page: https://aim-uofa.github.io/PMLoss

[115] AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

Lidong Lu,Guo Chen,Zhiqi Li,Yicheng Liu,Tong Lu

Main category: cs.CV

TL;DR: 论文提出了CG-AV-Counting基准和AV-Reasoner模型,用于解决视频理解中的计数问题,并在多个基准测试中取得最优结果。

Details Motivation: 现有视频理解模型在计数任务上表现不佳,且现有基准存在视频短、查询封闭、缺乏线索标注和多模态覆盖不足等问题。 Method: 提出CG-AV-Counting基准,包含1,027个多模态问题和5,845个标注线索;设计AV-Reasoner模型,采用GRPO和课程学习提升计数能力。 Result: AV-Reasoner在多个基准测试中达到最优性能,但语言空间推理在域外基准上未能带来性能提升。 Conclusion: CG-AV-Counting为计数任务提供了全面测试平台,AV-Reasoner展示了强化学习的有效性,但需进一步改进域外性能。 Abstract: Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model's counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks, reasoning in the language space fails to bring performance gains. The code and benchmark have been realeased on https://av-reasoner.github.io.

[116] MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Xinyan Chen,Renrui Zhang,Dongzhi Jiang,Aojun Zhou,Shilin Yan,Weifeng Lin,Hongsheng Li

Main category: cs.CV

TL;DR: MINT-CoT提出了一种通过动态插入视觉令牌到文本推理步骤中的方法,解决了多模态数学推理中的视觉感知和区域选择问题。

Details Motivation: 现有方法在多模态数学推理中存在视觉区域选择粗糙、视觉编码器对数学内容感知有限以及依赖外部视觉修改能力的限制。 Method: MINT-CoT通过Interleave Token动态选择数学图形中的任意形状视觉区域,并构建了包含54K数学问题的数据集,采用三阶段训练策略(文本CoT SFT、交错CoT SFT、交错CoT RL)。 Result: MINT-CoT-7B在MathVista、GeoQA和MMStar上分别比基线模型提升了34.08%、28.78%和23.2%。 Conclusion: MINT-CoT在多模态数学推理中表现出色,显著提升了视觉交错推理的效果。 Abstract: Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our code and data are available at https://github.com/xinyan-cxy/MINT-CoT

[117] Unleashing Hour-Scale Video Training for Long Video-Language Understanding

Jingyang Lin,Jialian Wu,Ximeng Sun,Ze Wang,Jiang Liu,Yusheng Su,Xiaodong Yu,Hao Chen,Jiebo Luo,Zicheng Liu,Emad Barsoum

Main category: cs.CV

TL;DR: 论文介绍了VideoMarathon数据集和Hour-LLaVA模型,填补了长视频训练数据的空白,并在多个长视频语言基准测试中表现优异。

Details Motivation: 现有长视频标注数据稀缺,限制了长视频多模态模型(Video-LMMs)的发展。 Method: 提出VideoMarathon数据集(9,700小时长视频,3.3M QA对)和Hour-LLaVA模型(支持1-FPS采样,结合记忆增强模块)。 Result: Hour-LLaVA在多个长视频语言基准测试中表现最佳。 Conclusion: VideoMarathon数据集和Hour-LLaVA模型为长视频语言理解提供了高质量资源和有效方法。 Abstract: Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LLMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates user question-relevant and spatiotemporal-informative semantics from a cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.

[118] VideoMolmo: Spatio-Temporal Grounding Meets Pointing

Ghazi Shazan Ahmad,Ahmed Heakl,Hanan Gani,Abdelrahman Shaker,Zhiqiang Shen,Ranjay Krishna,Fahad Shahbaz Khan,Salman Khan

Main category: cs.CV

TL;DR: VideoMolmo是一个多模态大模型,用于基于文本描述的细粒度时空定位,结合了时间模块和掩码融合技术,显著提升了时空一致性和推理能力。

Details Motivation: 当前基于视频的时空定位方法缺乏大型语言模型的推理能力,限制了上下文理解和泛化能力。 Method: VideoMolmo基于Molmo架构,引入时间模块和双向点传播的掩码融合技术,分两步生成精确的坐标和连贯的分割。 Result: 在多个真实场景和任务中,VideoMolmo显著提升了时空定位的准确性和推理能力。 Conclusion: VideoMolmo通过结合语言模型和时间模块,为时空定位任务提供了一种高效且可解释的解决方案。 Abstract: Spatio-temporal localization is vital for precise interactions across diverse domains, from biological research to autonomous navigation and interactive interfaces. Current video-based approaches, while proficient in tracking, lack the sophisticated reasoning capabilities of large language models, limiting their contextual understanding and generalization. We introduce VideoMolmo, a large multimodal model tailored for fine-grained spatio-temporal pointing conditioned on textual descriptions. Building upon the Molmo architecture, VideoMolmo incorporates a temporal module utilizing an attention mechanism to condition each frame on preceding frames, ensuring temporal consistency. Additionally, our novel temporal mask fusion pipeline employs SAM2 for bidirectional point propagation, significantly enhancing coherence across video sequences. This two-step decomposition, i.e., first using the LLM to generate precise pointing coordinates, then relying on a sequential mask-fusion module to produce coherent segmentation, not only simplifies the task for the language model but also enhances interpretability. Due to the lack of suitable datasets, we curate a comprehensive dataset comprising 72k video-caption pairs annotated with 100k object points. To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics. We also evaluate our model on Referring Video Object Segmentation (Refer-VOS) and Reasoning VOS tasks. In comparison to existing models, VideoMolmo substantially improves spatio-temporal pointing accuracy and reasoning capability. Our code and models are publicly available at https://github.com/mbzuai-oryx/VideoMolmo.

[119] Defurnishing with X-Ray Vision: Joint Removal of Furniture from Panoramas and Mesh

Alan Dolhasz,Chen Ma,Dave Gausebeck,Kevin Chen,Gregor Miller,Lucas Hayne,Gunnar Hovden,Azwad Sabik,Olaf Brandt,Mira Slavcheva

Main category: cs.CV

TL;DR: 提出了一种从纹理网格和多视角全景图像生成去家具室内场景的流程,通过简化去家具网格(SDM)和ControlNet修复技术,生成高质量的去家具场景资产。

Details Motivation: 现有方法(如神经辐射场或RGB-D修复)在生成去家具场景时存在模糊、低分辨率或幻觉问题,需要一种更高质量的方法。 Method: 首先从网格中分割并移除家具,生成SDM作为场景结构的“X光”指导;然后通过SDM提取Canny边缘,利用ControlNet修复全景图像;最后用修复图像重新纹理网格。 Result: 该方法生成的去家具场景资产质量高于依赖神经辐射场或RGB-D修复的方法,避免了模糊和幻觉问题。 Conclusion: 提出的流程能够高效生成高质量的去家具场景资产,优于现有技术。 Abstract: We present a pipeline for generating defurnished replicas of indoor spaces represented as textured meshes and corresponding multi-view panoramic images. To achieve this, we first segment and remove furniture from the mesh representation, extend planes, and fill holes, obtaining a simplified defurnished mesh (SDM). This SDM acts as an ``X-ray'' of the scene's underlying structure, guiding the defurnishing process. We extract Canny edges from depth and normal images rendered from the SDM. We then use these as a guide to remove the furniture from panorama images via ControlNet inpainting. This control signal ensures the availability of global geometric information that may be hidden from a particular panoramic view by the furniture being removed. The inpainted panoramas are used to texture the mesh. We show that our approach produces higher quality assets than methods that rely on neural radiance fields, which tend to produce blurry low-resolution images, or RGB-D inpainting, which is highly susceptible to hallucinations.

[120] Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning

Xingjian Ran,Yixuan Li,Linning Xu,Mulin Yu,Bo Dai

Main category: cs.CV

TL;DR: DirectLayout是一个基于大语言模型的框架,直接从文本描述生成3D室内场景布局,解决了现有方法在开放词汇和细粒度用户指令对齐上的不足。

Details Motivation: 3D室内场景合成对AI和数字内容创作至关重要,但现有方法在布局生成上存在数据集有限、灵活性不足的问题。 Method: DirectLayout通过三阶段生成(BEV布局、3D提升、对象细化),结合Chain-of-Thought激活和奖励机制,利用LLM进行空间推理。 Result: 实验表明,DirectLayout在语义一致性、泛化性和物理合理性上表现优异。 Conclusion: DirectLayout为开放词汇和细粒度控制的3D场景合成提供了有效解决方案。 Abstract: Realistic 3D indoor scene synthesis is vital for embodied AI and digital content creation. It can be naturally divided into two subtasks: object generation and layout generation. While recent generative models have significantly advanced object-level quality and controllability, layout generation remains challenging due to limited datasets. Existing methods either overfit to these datasets or rely on predefined constraints to optimize numerical layout that sacrifice flexibility. As a result, they fail to generate scenes that are both open-vocabulary and aligned with fine-grained user instructions. We introduce DirectLayout, a framework that directly generates numerical 3D layouts from text descriptions using generalizable spatial reasoning of large language models (LLMs). DirectLayout decomposes the generation into three stages: producing a Bird's-Eye View (BEV) layout, lifting it into 3D space, and refining object placements. To enable explicit spatial reasoning and help the model grasp basic principles of object placement, we employ Chain-of-Thought (CoT) Activation based on the 3D-Front dataset. Additionally, we design CoT-Grounded Generative Layout Reward to enhance generalization and spatial planning. During inference, DirectLayout addresses asset-layout mismatches via Iterative Asset-Layout Alignment through in-context learning. Extensive experiments demonstrate that DirectLayout achieves impressive semantic consistency, generalization and physical plausibility.

[121] Refer to Anything with Vision-Language Prompts

Shengcao Cao,Zijun Wei,Jason Kuen,Kangning Liu,Lingzhi Zhang,Jiuxiang Gu,HyunJoon Jung,Liang-Yan Gui,Yu-Xiong Wang

Main category: cs.CV

TL;DR: 论文提出了一种新型任务Omnimodal Referring Expression Segmentation (ORES),通过文本或文本加视觉实体的任意提示生成一组掩码,并提出了RAS框架以增强分割模型的多模态交互能力。

Details Motivation: 现有图像分割模型无法基于语言和视觉的复杂查询提供全面的语义理解,限制了其在用户友好交互应用中的效果。 Method: 提出RAS框架,通过掩码中心的大型多模态模型增强分割模型的多模态交互和理解能力,并创建了MaskGroups-2M和MaskGroups-HQ数据集用于训练和评估。 Result: RAS在ORES任务以及经典RES和GRES任务上表现出优越性能。 Conclusion: RAS框架成功解决了多模态提示下的分割任务,为复杂交互应用提供了有效解决方案。 Abstract: Recent image segmentation models have advanced to segment images into high-quality masks for visual entities, and yet they cannot provide comprehensive semantic understanding for complex queries based on both language and vision. This limitation reduces their effectiveness in applications that require user-friendly interactions driven by vision-language prompts. To bridge this gap, we introduce a novel task of omnimodal referring expression segmentation (ORES). In this task, a model produces a group of masks based on arbitrary prompts specified by text only or text plus reference visual entities. To address this new challenge, we propose a novel framework to "Refer to Any Segmentation Mask Group" (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric large multimodal model. For training and benchmarking ORES models, we create datasets MaskGroups-2M and MaskGroups-HQ to include diverse mask groups specified by text and reference entities. Through extensive evaluation, we demonstrate superior performance of RAS on our new ORES task, as well as classic referring expression segmentation (RES) and generalized referring expression segmentation (GRES) tasks. Project page: https://Ref2Any.github.io.

[122] ContentV: Efficient Training of Video Generation Models with Limited Compute

Wenfeng Lin,Renjie Chen,Boyuan Liu,Shiyue Yan,Ruoyu Feng,Jiangchuan Wei,Yichen Zhang,Yimeng Zhou,Chao Feng,Jiao Ran,Qi Wu,Zuotao Liu,Mingyu Guo

Main category: cs.CV

TL;DR: ContentV是一个8B参数的文本到视频模型,通过三项创新技术实现了高效训练和高质量视频生成。

Details Motivation: 随着视频生成技术的进步,计算成本急剧增加,需要更高效的训练方法。 Method: 1. 最小化架构设计,重用预训练图像生成模型;2. 多阶段训练策略,利用流匹配提高效率;3. 低成本强化学习框架,无需额外人工标注。 Result: 在VBench上达到85.14的SOTA性能,支持多分辨率和时长视频生成。 Conclusion: ContentV展示了高效训练和高性能视频生成的潜力,代码和模型已开源。 Abstract: Recent advances in video generation demand increasingly efficient training recipes to mitigate escalating computational costs. In this report, we present ContentV, an 8B-parameter text-to-video model that achieves state-of-the-art performance (85.14 on VBench) after training on 256 x 64GB Neural Processing Units (NPUs) for merely four weeks. ContentV generates diverse, high-quality videos across multiple resolutions and durations from text prompts, enabled by three key innovations: (1) A minimalist architecture that maximizes reuse of pre-trained image generation models for video generation; (2) A systematic multi-stage training strategy leveraging flow matching for enhanced efficiency; and (3) A cost-effective reinforcement learning with human feedback framework that improves generation quality without requiring additional human annotations. All the code and models are available at: https://contentv.github.io.

[123] SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

Jiahui Wang,Zuyan Liu,Yongming Rao,Jiwen Lu

Main category: cs.CV

TL;DR: 研究发现MLLMs中仅有少量注意力头(约5%)对视觉理解有贡献,称为视觉头。通过无训练框架识别这些头,并提出了SparseMM优化策略,显著提升推理效率。

Details Motivation: 探索MLLMs如何处理视觉输入,揭示注意力机制的稀疏性,以优化计算资源分配。 Method: 设计无训练框架量化视觉相关性,提出SparseMM策略,基于视觉分数分配计算资源。 Result: SparseMM在主流多模态基准测试中实现1.38倍实时加速和52%内存减少,同时保持性能。 Conclusion: 视觉头的稀疏性为MLLMs优化提供了新思路,SparseMM在效率和准确性上表现优异。 Abstract: Multimodal Large Language Models (MLLMs) are commonly derived by extending pre-trained Large Language Models (LLMs) with visual capabilities. In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. We reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed visual heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis. Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding. Extensive evaluations across mainstream multimodal benchmarks demonstrate that SparseMM achieves superior accuracy-efficiency trade-offs. Notably, SparseMM delivers 1.38x real-time acceleration and 52% memory reduction during generation while maintaining performance parity on efficiency test. Our project is open sourced at https://github.com/CR400AF-A/SparseMM.

[124] Neural Inverse Rendering from Propagating Light

Anagh Malik,Benjamin Attal,Andrew Xie,Matthew O'Toole,David B. Lindell

Main category: cs.CV

TL;DR: 提出首个基于物理的神经逆向渲染系统,用于多视角视频中的光传播分析。

Details Motivation: 解决多视角视频中光传播的逆向渲染问题,尤其是间接光的处理。 Method: 采用时间分辨的神经辐射缓存技术,存储无限次反射的辐射信息。 Result: 实现了高精度的3D重建,支持光的传播合成、直接与间接光分解及多视角时间分辨重光照。 Conclusion: 该方法在复杂光传输场景中表现出色,为光传播分析提供了新工具。 Abstract: We present the first system for physically based, neural inverse rendering from multi-viewpoint videos of propagating light. Our approach relies on a time-resolved extension of neural radiance caching -- a technique that accelerates inverse rendering by storing infinite-bounce radiance arriving at any point from any direction. The resulting model accurately accounts for direct and indirect light transport effects and, when applied to captured measurements from a flash lidar system, enables state-of-the-art 3D reconstruction in the presence of strong indirect light. Further, we demonstrate view synthesis of propagating light, automatic decomposition of captured measurements into direct and indirect components, as well as novel capabilities such as multi-view time-resolved relighting of captured scenes.

[125] FreeTimeGS: Free Gaussians at Anytime and Anywhere for Dynamic Scene Reconstruction

Yifan Wang,Peishan Yang,Zhen Xu,Jiaming Sun,Zhanhua Zhang,Yong Chen,Hujun Bao,Sida Peng,Xiaowei Zhou

Main category: cs.CV

TL;DR: 提出了一种名为FreeTimeGS的4D表示方法,用于动态3D场景重建,解决了复杂运动场景中变形场优化困难的问题。

Details Motivation: 现有方法在处理复杂运动场景时因变形场优化困难而表现不佳。 Method: 采用4D表示,允许高斯基元在任意时间和位置出现,并为每个基元赋予运动函数以减少时间冗余。 Result: 实验结果表明,该方法在渲染质量上显著优于现有方法。 Conclusion: FreeTimeGS通过灵活的4D表示和运动函数,显著提升了动态3D场景建模能力。 Abstract: This paper addresses the challenge of reconstructing dynamic 3D scenes with complex motions. Some recent works define 3D Gaussian primitives in the canonical space and use deformation fields to map canonical primitives to observation spaces, achieving real-time dynamic view synthesis. However, these methods often struggle to handle scenes with complex motions due to the difficulty of optimizing deformation fields. To overcome this problem, we propose FreeTimeGS, a novel 4D representation that allows Gaussian primitives to appear at arbitrary time and locations. In contrast to canonical Gaussian primitives, our representation possesses the strong flexibility, thus improving the ability to model dynamic 3D scenes. In addition, we endow each Gaussian primitive with an motion function, allowing it to move to neighboring regions over time, which reduces the temporal redundancy. Experiments results on several datasets show that the rendering quality of our method outperforms recent methods by a large margin.

[126] VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

Hanoona Rasheed,Abdelrahman Shaker,Anqi Tang,Muhammad Maaz,Ming-Hsuan Yang,Salman Khan,Fahad Khan

Main category: cs.CV

TL;DR: VideoMathQA是一个用于评估模型在视频中进行跨模态数学推理能力的基准测试,涵盖10个数学领域,包含多模态内容和高难度推理挑战。

Details Motivation: 现实世界中的视频数学推理与静态图像或文本不同,需要整合视觉、音频和文本信息,现有方法在此类任务上表现有限。 Method: 引入VideoMathQA基准,包含多样化的视频和问题类型,由专家标注,设计三种核心推理挑战:直接问题解决、概念迁移和深度教学理解。 Result: 通过基准测试揭示了现有方法的局限性,并提供了一个系统化的评估框架。 Conclusion: VideoMathQA为跨模态数学推理任务提供了高质量的评估工具,推动了模型在复杂场景下的推理能力研究。 Abstract: Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate-level experts to ensure high quality, totaling over $920$ man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality-rich mathematical problem settings. Our benchmark and evaluation code are available at: https://mbzuai-oryx.github.io/VideoMathQA

[127] Contrastive Flow Matching

George Stoica,Vivek Ramanujan,Xiang Fan,Ali Farhadi,Ranjay Krishna,Judy Hoffman

Main category: cs.CV

TL;DR: 论文提出了一种对比流匹配方法,解决了条件设置下流匹配的唯一性问题,显著提升了生成模型的训练速度和性能。

Details Motivation: 在条件设置(如类别条件模型)中,流匹配的唯一性无法保证,导致生成结果模糊。 Method: 引入对比流匹配目标,通过对比目标强制不同条件流之间的唯一性。 Result: 实验表明,对比流匹配使训练速度提升9倍,减少5倍去噪步骤,FID降低8.9。 Conclusion: 对比流匹配显著提升了条件生成模型的效率和性能。 Abstract: Unconditional flow-matching trains diffusion models to transport samples from a source distribution to a target distribution by enforcing that the flows between sample pairs are unique. However, in conditional settings (e.g., class-conditioned models), this uniqueness is no longer guaranteed--flows from different conditions may overlap, leading to more ambiguous generations. We introduce Contrastive Flow Matching, an extension to the flow matching objective that explicitly enforces uniqueness across all conditional flows, enhancing condition separation. Our approach adds a contrastive objective that maximizes dissimilarities between predicted flows from arbitrary sample pairs. We validate Contrastive Flow Matching by conducting extensive experiments across varying model architectures on both class-conditioned (ImageNet-1k) and text-to-image (CC3M) benchmarks. Notably, we find that training models with Contrastive Flow Matching (1) improves training speed by a factor of up to 9x, (2) requires up to 5x fewer de-noising steps and (3) lowers FID by up to 8.9 compared to training the same models with flow matching. We release our code at: https://github.com/gstoica27/DeltaFM.git.

cs.GR [Back]

[128] SSIMBaD: Sigma Scaling with SSIM-Guided Balanced Diffusion for AnimeFace Colorization

Junpyo Seo,Hanbin Koo,Jieun Yook,Byung-Ro Moon

Main category: cs.GR

TL;DR: 提出了一种基于扩散模型的动漫风格面部草图自动上色框架,通过SSIMBaD技术实现结构保真和风格迁移。

Details Motivation: 传统方法依赖预定义的噪声调度,可能损害感知一致性,需要更平衡的扩散模型。 Method: 采用连续时间扩散模型,引入SSIMBaD技术,通过sigma空间变换对齐感知退化。 Result: 在大规模动漫面部数据集上表现优于现有方法,像素精度和感知质量均提升。 Conclusion: SSIMBaD框架在动漫面部上色任务中具有优越性和泛化能力。 Abstract: We propose a novel diffusion-based framework for automatic colorization of Anime-style facial sketches. Our method preserves the structural fidelity of the input sketch while effectively transferring stylistic attributes from a reference image. Unlike traditional approaches that rely on predefined noise schedules - which often compromise perceptual consistency -- our framework builds on continuous-time diffusion models and introduces SSIMBaD (Sigma Scaling with SSIM-Guided Balanced Diffusion). SSIMBaD applies a sigma-space transformation that aligns perceptual degradation, as measured by structural similarity (SSIM), in a linear manner. This scaling ensures uniform visual difficulty across timesteps, enabling more balanced and faithful reconstructions. Experiments on a large-scale Anime face dataset demonstrate that our method outperforms state-of-the-art models in both pixel accuracy and perceptual quality, while generalizing to diverse styles. Code is available at github.com/Giventicket/SSIMBaD-Sigma-Scaling-with-SSIM-Guided-Balanced-Diffusion-for-AnimeFace-Colorization

[129] Handle-based Mesh Deformation Guided By Vision Language Model

Xingpeng Sun,Shiyang Jia,Zherong Pan,Kui Wu,Aniket Bera

Main category: cs.GR

TL;DR: 提出一种无需训练的基于手柄的网格变形方法,利用视觉语言模型(VLM)通过提示工程选择变形子部分和手柄,并通过多视角投票减少不确定性。

Details Motivation: 现有网格变形方法存在输出质量低、需手动调整或依赖数据训练的问题。 Method: 结合锥形奇点检测识别手柄,利用VLM选择变形部分和手柄,并通过多视角投票优化结果。 Result: 在多个基准测试中,该方法更符合用户意图,且变形失真低。 Conclusion: 该方法无需训练、高度自动化,能持续生成高质量的网格变形。 Abstract: Mesh deformation is a fundamental tool in 3D content manipulation. Despite extensive prior research, existing approaches often suffer from low output quality, require significant manual tuning, or depend on data-intensive training. To address these limitations, we introduce a training-free, handle-based mesh deformation method. % Our core idea is to leverage a Vision-Language Model (VLM) to interpret and manipulate a handle-based interface through prompt engineering. We begin by applying cone singularity detection to identify a sparse set of potential handles. The VLM is then prompted to select both the deformable sub-parts of the mesh and the handles that best align with user instructions. Subsequently, we query the desired deformed positions of the selected handles in screen space. To reduce uncertainty inherent in VLM predictions, we aggregate the results from multiple camera views using a novel multi-view voting scheme. % Across a suite of benchmarks, our method produces deformations that align more closely with user intent, as measured by CLIP and GPTEval3D scores, while introducing low distortion -- quantified via membrane energy. In summary, our approach is training-free, highly automated, and consistently delivers high-quality mesh deformations.

[130] VoxDet: Rethinking 3D Semantic Occupancy Prediction as Dense Object Detection

Wuyang Li,Zhu Yu,Alexandre Alahi

Main category: cs.GR

TL;DR: VoxDet通过将体素级语义占用预测转化为实例级检测任务,解决了现有方法忽略实例区分性的问题,并在性能和效率上达到最优。

Details Motivation: 现有方法将3D语义占用预测视为密集分割任务,忽略了实例级区分性,导致实例不完整和相邻模糊问题。 Method: 提出VoxDet框架,包括Voxel-to-Instance(VoxNT)技巧和任务解耦的密集预测器,通过偏移回归和语义预测实现实例感知。 Result: VoxDet在相机和LiDAR输入下均表现优异,SemanticKITTI测试集上达到63.0 IoU,排名第一。 Conclusion: VoxDet通过实例级优化显著提升了3D语义占用预测的性能,为未来研究提供了新思路。 Abstract: 3D semantic occupancy prediction aims to reconstruct the 3D geometry and semantics of the surrounding environment. With dense voxel labels, prior works typically formulate it as a dense segmentation task, independently classifying each voxel. However, this paradigm neglects critical instance-centric discriminability, leading to instance-level incompleteness and adjacent ambiguities. To address this, we highlight a free lunch of occupancy labels: the voxel-level class label implicitly provides insight at the instance level, which is overlooked by the community. Motivated by this observation, we first introduce a training-free Voxel-to-Instance (VoxNT) trick: a simple yet effective method that freely converts voxel-level class labels into instance-level offset labels. Building on this, we further propose VoxDet, an instance-centric framework that reformulates the voxel-level occupancy prediction as dense object detection by decoupling it into two sub-tasks: offset regression and semantic prediction. Specifically, based on the lifted 3D volume, VoxDet first uses (a) Spatially-decoupled Voxel Encoder to generate disentangled feature volumes for the two sub-tasks, which learn task-specific spatial deformation in the densely projected tri-perceptive space. Then, we deploy (b) Task-decoupled Dense Predictor to address this task via dense detection. Here, we first regress a 4D offset field to estimate distances (6 directions) between voxels and object borders in the voxel space. The regressed offsets are then used to guide the instance-level aggregation in the classification branch, achieving instance-aware prediction. Experiments show that VoxDet can be deployed on both camera and LiDAR input, jointly achieving state-of-the-art results on both benchmarks. VoxDet is not only highly efficient, but also achieves 63.0 IoU on the SemanticKITTI test set, ranking 1st on the online leaderboard.

[131] A Fast Unsupervised Scheme for Polygonal Approximation

Bimal Kumar Ray

Main category: cs.GR

TL;DR: 提出了一种快速且无监督的多边形近似闭合数字曲线的方法,速度快于现有技术,且在Rosin度量与美学方面表现优异。

Details Motivation: 现有方法在速度和美学表现上存在不足,需一种更高效且美观的近似方案。 Method: 分为三个阶段:初始分割、迭代顶点插入与合并,最后顶点调整。初始分割检测高曲率顶点,迭代插入补充低曲率顶点,合并去除冗余顶点,调整优化美观性。 Result: 方法速度快,Rosin度量表现优异,且对几何变换具有鲁棒性。 Conclusion: 该方法在速度和美学上优于现有技术,适用于闭合数字曲线的多边形近似。 Abstract: This paper proposes a fast and unsupervised scheme for a polygonal approximation of a closed digital curve. It is demonstrated that the approximation scheme is faster than state-of-the-art approximation and is competitive with the same in Rosin's measure and in its aesthetic aspect. The scheme comprises of three phases: initial segmentation, iterative vertex insertion, and iterative merging, followed by vertex adjustment. The initial segmentation is used to detect sharp turnings - the vertices that seemingly have high curvature. It is likely that some of important vertices with low curvature might have been missed out at the first phase and so iterative vertex insertion is used to add vertices in a region where the curvature changes slowly but steadily. The initial phase may pick up some undesirable vertices and so merging is used to eliminate the redundant vertices. Finally, vertex adjustment is used to facilitate enhancement in the aesthetic look of the approximation. The quality of the approximations is measured using Rosin's measure. The robustness of the proposed scheme with respect to geometric transformation is observed.

[132] Midplane based 3D single pass unbiased segment-to-segment contact interaction using penalty method

Indrajeet Sahu,Nik Petrinic

Main category: cs.GR

TL;DR: 提出了一种无偏接触交互方法,避免主从面划分,通过中平面单次评估接触力,确保力平衡,适用于多种几何配置和动态问题。

Details Motivation: 传统接触方法需指定主从面,可能导致偏差。本文旨在开发一种无偏方法,适用于复杂几何和动态接触问题。 Method: 基于中平面单次评估接触力,惩罚真实穿透,详细分析3D几何配置,验证了多种接触场景的准确性。 Result: 方法在接触补丁测试、两梁弯曲、Hertz接触和平冲头测试中验证了准确性,适用于自接触和动态碰撞问题。 Conclusion: 该方法无需主从面划分,具有高精度和鲁棒性,适用于复杂接触问题,包括自接触和动态碰撞。 Abstract: This work introduces a contact interaction methodology for an unbiased treatment of contacting surfaces without assigning surfaces as master and slave. The contact tractions between interacting discrete segments are evaluated with respect to a midplane in a single pass, inherently maintaining the equilibrium of tractions. These tractions are based on the penalisation of true interpenetration between opposite surfaces, and the procedure of their integral for discrete contacting segments is described in this paper. A meticulous examination of the different possible geometric configurations of interacting 3D segments is presented to develop visual understanding and better traction evaluation accuracy. The accuracy and robustness of the proposed method are validated against the analytical solutions of the contact patch test, two-beam bending, Hertzian contact, and flat punch test, thus proving the capability to reproduce contact between flat surfaces, curved surfaces, and sharp corners in contact, respectively. The method passes the contact patch test with the uniform transmission of contact pressure matching the accuracy levels of finite elements. It converges towards the analytical solution with mesh refinement and a suitably high penalty factor. The effectiveness of the proposed algorithm also extends to self-contact problems and has been tested for self-contact between flat and curved surfaces with inelastic material. Dynamic problems of elastic and inelastic collisions between bars, as well as oblique collisions of cylinders, are also presented. The ability of the algorithm to resolve contacts between flat and curved surfaces for nonconformal meshes with high accuracy demonstrates its versatility in general contact problems.

[133] Towards the target and not beyond: 2d vs 3d visual aids in mr-based neurosurgical simulation

Pasquale Cascarano,Andrea Loretti,Matteo Martinoni,Luca Zanuttini,Alessio Di Pasquale,Gustavo Marfia

Main category: cs.GR

TL;DR: NeuroMix是一种基于混合现实(MR)的模拟器,用于脑室外引流(EVD)放置训练。研究发现,结合2D和3D视觉辅助的训练方式显著提高了未辅助条件下的手术精度,但操作时间较长。

Details Motivation: 神经外科手术中,从2D切片重建复杂3D解剖结构具有挑战性。MR技术虽能提升手术表现,但临床可用性有限,因此需要开发训练系统以提升未辅助条件下的技能保留。 Method: 研究比较了三种训练模式:无视觉辅助、仅2D辅助、以及2D和3D辅助结合。48名参与者通过数字对象训练后,在无MR辅助条件下进行自由手EVD放置测试。 Result: 结合2D和3D辅助的训练组在未辅助测试中精度提升44%,显著高于其他组。所有训练模式均获得高可用性和技术接受度评分,但结合辅助组操作时间较长。 Conclusion: 2D和3D视觉辅助结合的训练方式能显著提升手术精度,且不影响认知负荷,但需权衡操作时间的增加。 Abstract: Neurosurgery increasingly uses Mixed Reality (MR) technologies for intraoperative assistance. The greatest challenge in this area is mentally reconstructing complex 3D anatomical structures from 2D slices with millimetric precision, which is required in procedures like External Ventricular Drain (EVD) placement. MR technologies have shown great potential in improving surgical performance, however, their limited availability in clinical settings underscores the need for training systems that foster skill retention in unaided conditions. In this paper, we introduce NeuroMix, an MR-based simulator for EVD placement. We conduct a study with 48 participants to assess the impact of 2D and 3D visual aids on usability, cognitive load, technology acceptance, and procedure precision and execution time. Three training modalities are compared: one without visual aids, one with 2D aids only, and one combining both 2D and 3D aids. The training phase takes place entirely on digital objects, followed by a freehand EVD placement testing phase performed with a physical catherer and a physical phantom without MR aids. We then compare the participants performance with that of a control group that does not undergo training. Our findings show that participants trained with both 2D and 3D aids achieve a 44\% improvement in precision during unaided testing compared to the control group, substantially higher than the improvement observed in the other groups. All three training modalities receive high usability and technology acceptance ratings, with significant equivalence across groups. The combination of 2D and 3D visual aids does not significantly increase cognitive workload, though it leads to longer operation times during freehand testing compared to the control group.

[134] Uniform Sampling of Surfaces by Casting Rays

Selena Ling,Abhishek Madan,Nicholas Sharp,Alec Jacobson

Main category: cs.GR

TL;DR: 提出了一种基于随机射线与表面交点的简单通用表面点采样方法,适用于隐式曲面,无需提取中间网格,高效且保证均匀性。

Details Motivation: 在几何处理中,表面点采样是基本操作,但对隐式曲面等非显式网格表示采样困难,现有方法效率低。 Method: 通过随机射线与表面的交点采样点,利用球面行进法高效实现隐式符号距离函数的射线投射和点采样。 Result: 实验证明该方法在多种表示上比替代策略更高效,且支持蓝噪声和分层采样扩展。 Conclusion: 该方法为隐式曲面提供了一种高效、通用的采样方案,适用于变形神经隐式曲面和矩估计等应用。 Abstract: Randomly sampling points on surfaces is an essential operation in geometry processing. This sampling is computationally straightforward on explicit meshes, but it is much more difficult on other shape representations, such as widely-used implicit surfaces. This work studies a simple and general scheme for sampling points on a surface, which is derived from a connection to the intersections of random rays with the surface. Concretely, given a subroutine to cast a ray against a surface and find all intersections, we can use that subroutine to uniformly sample white noise points on the surface. This approach is particularly effective in the context of implicit signed distance functions, where sphere marching allows us to efficiently cast rays and sample points, without needing to extract an intermediate mesh. We analyze the basic method to show that it guarantees uniformity, and find experimentally that it is significantly more efficient than alternative strategies on a variety of representations. Furthermore, we show extensions to blue noise sampling and stratified sampling, and applications to deform neural implicit surfaces as well as moment estimation.

cs.CL [Back]

[135] GEM: Empowering LLM for both Embedding Generation and Language Understanding

Caojin Zhang,Qiang Zhang,Ke Li,Sai Vidyaranya Nuthalapati,Benyu Zhang,Jason Liu,Serena Li,Lizhu Zhang,Xiangjun Fan

Main category: cs.CL

TL;DR: 提出了一种自监督方法GEM,使大型解码器语言模型(LLM)能生成高质量文本嵌入,同时保留原始文本生成和推理能力。

Details Motivation: 现有应用中,检索增强生成(RAG)依赖单独的嵌入模型,导致系统复杂且理解不一致。 Method: 通过插入特殊标记和操纵注意力掩码生成文本摘要嵌入,可轻松集成到现有LLM的后训练或微调阶段。 Result: 在1B到8B参数的LLM上验证,显著提升MTEB性能,对MMLU影响极小。 Conclusion: GEM方法为LLM提供了先进的文本嵌入能力,同时保持其原始NLP性能。 Abstract: Large decoder-only language models (LLMs) have achieved remarkable success in generation and reasoning tasks, where they generate text responses given instructions. However, many applications, e.g., retrieval augmented generation (RAG), still rely on separate embedding models to generate text embeddings, which can complicate the system and introduce discrepancies in understanding of the query between the embedding model and LLMs. To address this limitation, we propose a simple self-supervised approach, Generative Embedding large language Model (GEM), that enables any large decoder-only LLM to generate high-quality text embeddings while maintaining its original text generation and reasoning capabilities. Our method inserts new special token(s) into a text body, and generates summarization embedding of the text by manipulating the attention mask. This method could be easily integrated into post-training or fine tuning stages of any existing LLMs. We demonstrate the effectiveness of our approach by applying it to two popular LLM families, ranging from 1B to 8B parameters, and evaluating the transformed models on both text embedding benchmarks (MTEB) and NLP benchmarks (MMLU). The results show that our proposed method significantly improves the original LLMs on MTEB while having a minimal impact on MMLU. Our strong results indicate that our approach can empower LLMs with state-of-the-art text embedding capabilities while maintaining their original NLP performance

[136] Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR

Zheng-Xin Yong,Vineel Pratap,Michael Auli,Jean Maillard

Main category: cs.CL

TL;DR: 研究训练数据中说话者数量、音频时长和口音多样性对ASR系统在未见口音上的鲁棒性影响,发现增加说话者数量比增加单个说话者的音频时长更有效。

Details Motivation: 构建一个适用于全球各种口音的自动语音识别(ASR)系统,需研究训练数据变量对未见口音鲁棒性的影响。 Method: 系统研究训练数据中说话者数量、音频时长和口音多样性三个变量对ASR鲁棒性的影响。 Result: 在固定训练时长下,增加说话者数量比增加单个说话者的音频时长更有效;说话者数量多时,增加训练时长能提升性能;口音多样性对性能提升有限。 Conclusion: 建议在ASR训练数据中优先增加说话者数量而非单个说话者的音频时长或口音多样性。 Abstract: To build an automatic speech recognition (ASR) system that can serve everyone in the world, the ASR needs to be robust to a wide range of accents including unseen accents. We systematically study how three different variables in training data -- the number of speakers, the audio duration per each individual speaker, and the diversity of accents -- affect ASR robustness towards unseen accents in a low-resource training regime. We observe that for a fixed number of ASR training hours, it is more beneficial to increase the number of speakers (which means each speaker contributes less) than the number of hours contributed per speaker. We also observe that more speakers enables ASR performance gains from scaling number of hours. Surprisingly, we observe minimal benefits to prioritizing speakers with different accents when the number of speakers is controlled. Our work suggests that practitioners should prioritize increasing the speaker count in ASR training data composition for new languages.

[137] Mechanistic Decomposition of Sentence Representations

Matthieu Tehenan,Vikram Natarajan,Jonathan Michala,Milton Lin,Juri Opitz

Main category: cs.CL

TL;DR: 提出了一种新方法,通过字典学习分解句子嵌入为可解释组件,揭示其内部结构。

Details Motivation: 句子嵌入在现代NLP和AI系统中至关重要,但其内部结构不透明,难以解释。 Method: 使用字典学习对词级表示进行分解,分析池化操作如何压缩特征为句子表示。 Result: 发现许多语义和句法特征在嵌入中是线性编码的。 Conclusion: 该方法提升了句子嵌入的透明性和可控性。 Abstract: Sentence embeddings are central to modern NLP and AI systems, yet little is known about their internal structure. While we can compare these embeddings using measures such as cosine similarity, the contributing features are not human-interpretable, and the content of an embedding seems untraceable, as it is masked by complex neural transformations and a final pooling operation that combines individual token embeddings. To alleviate this issue, we propose a new method to mechanistically decompose sentence embeddings into interpretable components, by using dictionary learning on token-level representations. We analyze how pooling compresses these features into sentence representations, and assess the latent features that reside in a sentence embedding. This bridges token-level mechanistic interpretability with sentence-level analysis, making for more transparent and controllable representations. In our studies, we obtain several interesting insights into the inner workings of sentence embedding spaces, for instance, that many semantic and syntactic aspects are linearly encoded in the embeddings.

[138] Hierarchical Text Classification Using Contrastive Learning Informed Path Guided Hierarchy

Neeraj Agrawal,Saurabh Kumar,Priyanka Bhatt,Tanishka Agarwal

Main category: cs.CL

TL;DR: 提出了一种结合对比学习和路径引导层次结构的HTC-CLIP模型,通过联合学习文本和层次表示,显著提升了分类性能。

Details Motivation: 现有HTC模型分别处理文本和标签层次结构,未能充分利用两者的互补性。 Method: HTC-CLIP通过对比学习联合学习文本表示和路径引导的层次表示,并在推理时结合两者的概率分布。 Result: 在两个公开数据集上,HTC-CLIP的Macro F1分数比现有最优模型提高了0.99-2.37%。 Conclusion: HTC-CLIP有效结合了两种现有方法的优势,显著提升了分类性能。 Abstract: Hierarchical Text Classification (HTC) has recently gained traction given the ability to handle complex label hierarchy. This has found applications in domains like E- commerce, customer care and medicine industry among other real-world applications. Existing HTC models either encode label hierarchy separately and mix it with text encoding or guide the label hierarchy structure in the text encoder. Both approaches capture different characteristics of label hierarchy and are complementary to each other. In this paper, we propose a Hierarchical Text Classification using Contrastive Learning Informed Path guided hierarchy (HTC-CLIP), which learns hierarchy-aware text representation and text informed path guided hierarchy representation using contrastive learning. During the training of HTC-CLIP, we learn two different sets of class probabilities distributions and during inference, we use the pooled output of both probabilities for each class to get the best of both representations. Our results show that the two previous approaches can be effectively combined into one architecture to achieve improved performance. Tests on two public benchmark datasets showed an improvement of 0.99 - 2.37% in Macro F1 score using HTC-CLIP over the existing state-of-the-art models.

[139] MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP

Kurt Micallef,Claudia Borg

Main category: cs.CL

TL;DR: 评估55个公开大语言模型在低资源语言马耳他语上的表现,发现多数模型表现不佳,尤其是生成任务,而小型微调模型表现更好。预训练和指令调优中对马耳他语的接触是关键因素。

Details Motivation: 研究大语言模型在低资源语言(如马耳他语)上的表现,揭示其局限性并探索改进方法。 Method: 使用包含11个判别性和生成性任务的新基准测试55个公开大语言模型,进行多维分析。 Result: 多数模型表现不佳,小型微调模型表现更好;预训练和指令调优中对马耳他语的接触是关键因素。 Conclusion: 建议低资源语言研究采用更传统的语言建模方法,并强调微调虽成本高但效果更优。 Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various Natural Language Processing (NLP) tasks, largely due to their generalisability and ability to perform tasks without additional training. However, their effectiveness for low-resource languages remains limited. In this study, we evaluate the performance of 55 publicly available LLMs on Maltese, a low-resource language, using a newly introduced benchmark covering 11 discriminative and generative tasks. Our experiments highlight that many models perform poorly, particularly on generative tasks, and that smaller fine-tuned models often perform better across all tasks. From our multidimensional analysis, we investigate various factors impacting performance. We conclude that prior exposure to Maltese during pre-training and instruction-tuning emerges as the most important factor. We also examine the trade-offs between fine-tuning and prompting, highlighting that while fine-tuning requires a higher initial cost, it yields better performance and lower inference costs. Through this work, we aim to highlight the need for more inclusive language technologies and recommend that researchers working with low-resource languages consider more "traditional" language modelling approaches.

[140] Building a Few-Shot Cross-Domain Multilingual NLU Model for Customer Care

Saurabh Kumar,Sourav Bansal,Neeraj Agrawal,Priyanka Bhatt

Main category: cs.CL

TL;DR: 提出一种结合嵌入器和分类器的模型架构,通过少量标注样本扩展领域特定模型到其他领域,提升跨领域意图分类性能。

Details Motivation: 跨领域数据稀缺限制了意图分类器的泛化能力,需解决这一问题以提升客户服务的效率。 Method: 采用监督微调结合各向同性正则化训练领域特定嵌入器,并通过多语言知识蒸馏策略泛化到多领域。 Result: 在加拿大和墨西哥电商客户服务数据集上,少样本意图检测准确率比现有SOTA模型提高20-23%。 Conclusion: 提出的模型架构能有效利用少量标注数据实现跨领域意图分类,具有实际应用价值。 Abstract: Customer care is an essential pillar of the e-commerce shopping experience with companies spending millions of dollars each year, employing automation and human agents, across geographies (like US, Canada, Mexico, Chile), channels (like Chat, Interactive Voice Response (IVR)), and languages (like English, Spanish). SOTA pre-trained models like multilingual-BERT, fine-tuned on annotated data have shown good performance in downstream tasks relevant to Customer Care. However, model performance is largely subject to the availability of sufficient annotated domain-specific data. Cross-domain availability of data remains a bottleneck, thus building an intent classifier that generalizes across domains (defined by channel, geography, and language) with only a few annotations, is of great practical value. In this paper, we propose an embedder-cum-classifier model architecture which extends state-of-the-art domain-specific models to other domains with only a few labeled samples. We adopt a supervised fine-tuning approach with isotropic regularizers to train a domain-specific sentence embedder and a multilingual knowledge distillation strategy to generalize this embedder across multiple domains. The trained embedder, further augmented with a simple linear classifier can be deployed for new domains. Experiments on Canada and Mexico e-commerce Customer Care dataset with few-shot intent detection show an increase in accuracy by 20-23% against the existing state-of-the-art pre-trained models.

[141] MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale

Ran Xu,Yuchen Zhuang,Yishan Zhong,Yue Yu,Xiangru Tang,Hang Wu,May D. Wang,Peifeng Ruan,Donghan Yang,Tao Wang,Guanghua Xiao,Carl Yang,Yang Xie,Wenqi Shi

Main category: cs.CL

TL;DR: MedAgentGYM是一个公开的训练环境,旨在提升大型语言模型(LLM)在医学编码推理方面的能力。它包含72,413个任务实例,覆盖129个真实生物医学场景类别,并提供可执行环境、反馈机制和可扩展训练轨迹。实验显示商业API模型与开源模型存在性能差距,而Med-Copilot-7B通过微调和强化学习显著提升性能。

Details Motivation: 开发一个公开可用的训练环境,以增强LLM在医学编码推理中的能力,填补现有资源的空白。 Method: 构建包含72,413个任务实例的MedAgentGYM环境,提供详细任务描述、交互反馈和可扩展训练轨迹。对30多个LLM进行基准测试,并通过监督微调和强化学习优化Med-Copilot-7B。 Result: 商业API模型与开源模型性能差距显著。Med-Copilot-7B通过微调提升36.44%,强化学习提升42.47%,性能接近gpt-4o。 Conclusion: MedAgentGYM为开发生物医学研究和实践中的LLM编码助手提供了综合平台,兼具可访问性和可扩展性。 Abstract: We introduce MedAgentGYM, the first publicly available training environment designed to enhance coding-based medical reasoning capabilities in large language model (LLM) agents. MedAgentGYM comprises 72,413 task instances across 129 categories derived from authentic real-world biomedical scenarios. Tasks are encapsulated within executable coding environments, each featuring detailed task descriptions, interactive feedback mechanisms, verifiable ground-truth annotations, and scalable training trajectory generation. Extensive benchmarking of over 30 LLMs reveals a notable performance disparity between commercial API-based models and open-source counterparts. Leveraging MedAgentGYM, Med-Copilot-7B achieves substantial performance gains through supervised fine-tuning (+36.44%) and continued reinforcement learning (+42.47%), emerging as an affordable and privacy-preserving alternative competitive with gpt-4o. By offering both a comprehensive benchmark and accessible, expandable training resources within unified execution environments, MedAgentGYM delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical research and practice.

[142] Unpacking Let Alone: Human-Scale Models Generalize to a Rare Construction in Form but not Meaning

Wesley Scivetti,Tatsuya Aoyama,Ethan Wilcox,Nathan Schneider

Main category: cs.CL

TL;DR: 人类规模的语言模型在罕见语法形式上表现良好,但在理解其含义上存在不足,显示出与人类学习者的不对称性。

Details Motivation: 探讨语言模型是否能像人类一样从常见语法推广到罕见语法,并理解其含义。 Method: 通过测试模型对罕见英语LET-ALONE结构的语法形式和语义知识的掌握,使用定制合成基准进行评估。 Result: 模型对形式敏感,但对含义的推广能力不足。 Conclusion: 当前架构在语言形式和含义的样本效率上存在不对称性,与人类学习者不同。 Abstract: Humans have a remarkable ability to acquire and understand grammatical phenomena that are seen rarely, if ever, during childhood. Recent evidence suggests that language models with human-scale pretraining data may possess a similar ability by generalizing from frequent to rare constructions. However, it remains an open question how widespread this generalization ability is, and to what extent this knowledge extends to meanings of rare constructions, as opposed to just their forms. We fill this gap by testing human-scale transformer language models on their knowledge of both the form and meaning of the (rare and quirky) English LET-ALONE construction. To evaluate our LMs we construct a bespoke synthetic benchmark that targets syntactic and semantic properties of the construction. We find that human-scale LMs are sensitive to form, even when related constructions are filtered from the dataset. However, human-scale LMs do not make correct generalizations about LET-ALONE's meaning. These results point to an asymmetry in the current architectures' sample efficiency between language form and meaning, something which is not present in human language learners.

[143] Empaths at SemEval-2025 Task 11: Retrieval-Augmented Approach to Perceived Emotions Prediction

Lev Morozov,Aleksandr Mogilevskii,Alexander Shirnin

Main category: cs.CL

TL;DR: EmoRAG是一个用于多标签情感检测的系统,无需额外训练模型,仅通过模型集成实现高效、可扩展的情感预测。

Details Motivation: 解决SemEval-2025任务11的子任务A,即从文本中检测说话者的感知情感。 Method: 使用模型集成方法,无需额外训练,直接预测情感标签(如喜悦、悲伤等)。 Result: 性能与最佳系统相当,但更高效、可扩展且易于实现。 Conclusion: EmoRAG为情感检测提供了一种高效且实用的解决方案。 Abstract: This paper describes EmoRAG, a system designed to detect perceived emotions in text for SemEval-2025 Task 11, Subtask A: Multi-label Emotion Detection. We focus on predicting the perceived emotions of the speaker from a given text snippet, labeling it with emotions such as joy, sadness, fear, anger, surprise, and disgust. Our approach does not require additional model training and only uses an ensemble of models to predict emotions. EmoRAG achieves results comparable to the best performing systems, while being more efficient, scalable, and easier to implement.

[144] Zero-Shot Open-Schema Entity Structure Discovery

Xueqiang Xu,Jinfeng Xiao,James Barry,Mohab Elkaref,Jiaru Zou,Pengcheng Jiang,Yunyi Zhang,Max Giammona,Geeth de Mel,Jiawei Han

Main category: cs.CL

TL;DR: 论文提出了一种无需预定义模式或标注数据的零样本开放模式实体结构发现方法(ZOES),通过丰富、细化和统一的机制提升实体结构提取的完整性和质量。

Details Motivation: 现有基于大语言模型(LLM)的方法依赖预定义实体属性模式或标注数据,导致提取结果不完整。 Method: ZOES采用丰富、细化和统一的机制,利用实体与其结构的相互强化关系进行提取。 Result: 实验表明,ZOES在三个不同领域中显著提升了LLM提取实体结构的完整性和泛化能力。 Conclusion: ZOES的机制为提升LLM在多种场景下实体结构发现的质量提供了有效方法。 Abstract: Entity structure extraction, which aims to extract entities and their associated attribute-value structures from text, is an essential task for text understanding and knowledge graph construction. Existing methods based on large language models (LLMs) typically rely heavily on predefined entity attribute schemas or annotated datasets, often leading to incomplete extraction results. To address these challenges, we introduce Zero-Shot Open-schema Entity Structure Discovery (ZOES), a novel approach to entity structure extraction that does not require any schema or annotated samples. ZOES operates via a principled mechanism of enrichment, refinement, and unification, based on the insight that an entity and its associated structure are mutually reinforcing. Experiments demonstrate that ZOES consistently enhances LLMs' ability to extract more complete entity structures across three different domains, showcasing both the effectiveness and generalizability of the method. These findings suggest that such an enrichment, refinement, and unification mechanism may serve as a principled approach to improving the quality of LLM-based entity structure discovery in various scenarios.

[145] Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

Apurv Verma,NhatHai Phan,Shubhendu Trivedi

Main category: cs.CL

TL;DR: 本文系统分析了两种流行的水印方法(Gumbel和KGW)对大型语言模型(LLM)在真实性、安全性和帮助性上的影响,并提出了一种名为Alignment Resampling(AR)的推理时采样方法以恢复模型的对齐性。

Details Motivation: 水印技术对LLM输出质量的影响尚未充分研究,尤其是对其真实性、安全性和帮助性的影响。本文旨在填补这一空白。 Method: 通过实验分析两种水印方法(Gumbel和KGW)对四种对齐LLM的影响,并提出AR方法,利用外部奖励模型在推理时恢复对齐性。 Result: 实验表明,水印会导致两种退化模式(防护衰减和防护放大),而AR方法仅需2-4个采样即可恢复或超越基线对齐分数。 Conclusion: AR方法有效平衡了水印强度与模型对齐性,为水印LLM的负责任部署提供了简单解决方案。 Abstract: Watermarking techniques for large language models (LLMs) can significantly impact output quality, yet their effects on truthfulness, safety, and helpfulness remain critically underexamined. This paper presents a systematic analysis of how two popular watermarking approaches-Gumbel and KGW-affect these core alignment properties across four aligned LLMs. Our experiments reveal two distinct degradation patterns: guard attenuation, where enhanced helpfulness undermines model safety, and guard amplification, where excessive caution reduces model helpfulness. These patterns emerge from watermark-induced shifts in token distribution, surfacing the fundamental tension that exists between alignment objectives. To mitigate these degradations, we propose Alignment Resampling (AR), an inference-time sampling method that uses an external reward model to restore alignment. We establish a theoretical lower bound on the improvement in expected reward score as the sample size is increased and empirically demonstrate that sampling just 2-4 watermarked generations effectively recovers or surpasses baseline (unwatermarked) alignment scores. To overcome the limited response diversity of standard Gumbel watermarking, our modified implementation sacrifices strict distortion-freeness while maintaining robust detectability, ensuring compatibility with AR. Experimental results confirm that AR successfully recovers baseline alignment in both watermarking approaches, while maintaining strong watermark detectability. This work reveals the critical balance between watermark strength and model alignment, providing a simple inference-time solution to responsibly deploy watermarked LLMs in practice.

[146] Aligning Large Language Models with Implicit Preferences from User-Generated Content

Zhaoxuan Tan,Zheng Li,Tianyi Liu,Haodong Wang,Hyokun Yun,Ming Zeng,Pei Chen,Zhihan Zhang,Yifan Gao,Ruijie Wang,Priyanka Nigam,Bing Yin,Meng Jiang

Main category: cs.CL

TL;DR: PUGC框架利用未标记的用户生成内容(UGC)中的隐式偏好生成偏好数据,显著提升了模型性能。

Details Motivation: 现有偏好学习方法依赖昂贵且难以扩展的人工或高级LLM标注数据,PUGC旨在通过UGC解决这一问题。 Method: PUGC将UGC转化为用户查询并生成响应,利用UGC作为参考文本进行评分,从而对齐隐式偏好。 Result: 实验显示,DPO和PUGC训练的模型性能提升9.37%,在Mistral-7B-Instruct上达到35.93%的领先胜率。 Conclusion: PUGC提供了一种高效、可扩展的偏好学习方法,显著提升了模型对齐和响应质量。 Abstract: Learning from preference feedback is essential for aligning large language models (LLMs) with human values and improving the quality of generated responses. However, existing preference learning methods rely heavily on curated data from humans or advanced LLMs, which is costly and difficult to scale. In this work, we present PUGC, a novel framework that leverages implicit human Preferences in unlabeled User-Generated Content (UGC) to generate preference data. Although UGC is not explicitly created to guide LLMs in generating human-preferred responses, it often reflects valuable insights and implicit preferences from its creators that has the potential to address readers' questions. PUGC transforms UGC into user queries and generates responses from the policy model. The UGC is then leveraged as a reference text for response scoring, aligning the model with these implicit preferences. This approach improves the quality of preference data while enabling scalable, domain-specific alignment. Experimental results on Alpaca Eval 2 show that models trained with DPO and PUGC achieve a 9.37% performance improvement over traditional methods, setting a 35.93% state-of-the-art length-controlled win rate using Mistral-7B-Instruct. Further studies highlight gains in reward quality, domain-specific alignment effectiveness, robustness against UGC quality, and theory of mind capabilities. Our code and dataset are available at https://zhaoxuan.info/PUGC.github.io/

[147] SQLens: An End-to-End Framework for Error Detection and Correction in Text-to-SQL

Yue Gong,Chuan Lei,Xiao Qin,Kapil Vaidya,Balakrishnan Narayanaswamy,Tim Kraska

Main category: cs.CL

TL;DR: SQLens是一个端到端框架,用于细粒度检测和纠正LLM生成的SQL中的语义错误,显著提升了错误检测和查询执行准确性。

Details Motivation: LLM生成的SQL查询可能存在语义错误但语法正确,缺乏可靠性评估。 Method: SQLens结合数据库和LLM的错误信号,检测SQL子句中的语义错误并指导修正。 Result: 在两个公开基准上,SQLens在错误检测F1上优于最佳LLM自评估方法25.78%,执行准确性提升高达20%。 Conclusion: SQLens有效提升了LLM生成SQL的语义正确性和可靠性。 Abstract: Text-to-SQL systems translate natural language (NL) questions into SQL queries, enabling non-technical users to interact with structured data. While large language models (LLMs) have shown promising results on the text-to-SQL task, they often produce semantically incorrect yet syntactically valid queries, with limited insight into their reliability. We propose SQLens, an end-to-end framework for fine-grained detection and correction of semantic errors in LLM-generated SQL. SQLens integrates error signals from both the underlying database and the LLM to identify potential semantic errors within SQL clauses. It further leverages these signals to guide query correction. Empirical results on two public benchmarks show that SQLens outperforms the best LLM-based self-evaluation method by 25.78% in F1 for error detection, and improves execution accuracy of out-of-the-box text-to-SQL systems by up to 20%.

[148] DRE: An Effective Dual-Refined Method for Integrating Small and Large Language Models in Open-Domain Dialogue Evaluation

Kun Zhao,Bohao Yang,Chen Tang,Siyuan Dai,Haoteng Tang,Chenghua Lin,Liang Zhan

Main category: cs.CL

TL;DR: SLIDE方法结合小型和大型语言模型的优势,通过自适应权重提升对话评估的可靠性。进一步提出的DRE方法通过双阶段优化,显著提高了评估准确性。

Details Motivation: 大型语言模型(LLMs)在模糊场景中表现不稳定,而小型语言模型(SLMs)在正例中表现优异但易受误导。结合两者的互补优势,可以提升评估工具的可靠性。 Method: 提出SLIDE方法,通过自适应权重整合SLMs和LLMs。进一步提出DRE方法,分两阶段优化:SLM生成的洞察指导LLM初步评估,SLM衍生的调整优化LLM评分。 Result: 实验表明,DRE方法在多样基准测试中优于现有方法,更符合人类判断。 Conclusion: 结合小型和大型模型可以显著提升开放式任务(如对话评估)的可靠性。 Abstract: Large Language Models (LLMs) excel at many tasks but struggle with ambiguous scenarios where multiple valid responses exist, often yielding unreliable results. Conversely, Small Language Models (SLMs) demonstrate robustness in such scenarios but are susceptible to misleading or adversarial inputs. We observed that LLMs handle negative examples effectively, while SLMs excel with positive examples. To leverage their complementary strengths, we introduce SLIDE (Small and Large Integrated for Dialogue Evaluation), a method integrating SLMs and LLMs via adaptive weighting. Building on SLIDE, we further propose a Dual-Refinement Evaluation (DRE) method to enhance SLM-LLM integration: (1) SLM-generated insights guide the LLM to produce initial evaluations; (2) SLM-derived adjustments refine the LLM's scores for improved accuracy. Experiments demonstrate that DRE outperforms existing methods, showing stronger alignment with human judgment across diverse benchmarks. This work illustrates how combining small and large models can yield more reliable evaluation tools, particularly for open-ended tasks such as dialogue evaluation.

[149] Please Translate Again: Two Simple Experiments on Whether Human-Like Reasoning Helps Translation

Di Wu,Seth Aycock,Christof Monz

Main category: cs.CL

TL;DR: 研究发现,在翻译任务中,LLMs的分步推理(CoT)并未明确提升性能,简单提示“再翻译一次”反而效果更好。

Details Motivation: 探讨分步推理(CoT)在LLM翻译任务中的实际有效性。 Method: 通过实验比较分步提示与简单提示(如“再翻译一次”)的效果。 Result: 分步推理未显著提升翻译性能,简单提示效果更优。 Conclusion: 分步推理在翻译中的作用尚不明确,需进一步研究CoT的有效性因素。 Abstract: Large Language Models (LLMs) demonstrate strong reasoning capabilities for many tasks, often by explicitly decomposing the task via Chain-of-Thought (CoT) reasoning. Recent work on LLM-based translation designs hand-crafted prompts to decompose translation, or trains models to incorporate intermediate steps.~\textit{Translating Step-by-step}~\citep{briakou2024translating}, for instance, introduces a multi-step prompt with decomposition and refinement of translation with LLMs, which achieved state-of-the-art results on WMT24. In this work, we scrutinise this strategy's effectiveness. Empirically, we find no clear evidence that performance gains stem from explicitly decomposing the translation process, at least for the models on test; and we show that simply prompting LLMs to ``translate again'' yields even better results than human-like step-by-step prompting. Our analysis does not rule out the role of reasoning, but instead invites future work exploring the factors for CoT's effectiveness in the context of translation.

[150] Is It JUST Semantics? A Case Study of Discourse Particle Understanding in LLMs

William Sheffield,Kanishka Misra,Valentina Pyatkin,Ashwini Deo,Kyle Mahowald,Junyi Jessy Li

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型(LLMs)区分英语粒子“just”细微语义的能力,发现其能区分大类但难以捕捉更微妙的差异。

Details Motivation: 探讨LLMs对多义性话语粒子的理解能力,以“just”为例,因其语义多样性而具有挑战性。 Method: 使用专家标注的数据集,评估LLMs对“just”不同语义类别的区分能力。 Result: LLMs能区分“just”的广义类别,但对更细微的语义差异理解不足。 Conclusion: LLMs在理解话语粒子的细微语义上存在局限,需进一步改进。 Abstract: Discourse particles are crucial elements that subtly shape the meaning of text. These words, often polyfunctional, give rise to nuanced and often quite disparate semantic/discourse effects, as exemplified by the diverse uses of the particle "just" (e.g., exclusive, temporal, emphatic). This work investigates the capacity of LLMs to distinguish the fine-grained senses of English "just", a well-studied example in formal semantics, using data meticulously created and labeled by expert linguists. Our findings reveal that while LLMs exhibit some ability to differentiate between broader categories, they struggle to fully capture more subtle nuances, highlighting a gap in their understanding of discourse particles.

[151] BSBench: will your LLM find the largest prime number?

K. O. T. Erziev

Main category: cs.CL

TL;DR: 论文提出了一种测试LLMs在无法回答的问题上的表现的方法,并发现现有模型的表现远不完美。

Details Motivation: 探讨LLMs在面对无法回答的问题时的表现,揭示其局限性。 Method: 提出了一种基准测试方法,并修改现有数据集以进行测试。 Result: 现有模型在无法回答的问题上表现不佳。 Conclusion: 研究强调了LLMs在处理无解问题时的不足,并提供了相关代码和数据。 Abstract: We propose that benchmarking LLMs on questions which have no reasonable answer actually isn't as silly as it sounds. We also present a benchmark that allows such testing and a method to modify the existing datasets, and discover that existing models demonstrate a performance far from the perfect on such questions. Our code and data artifacts are available at https://github.com/L3G5/impossible-bench

[152] SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?

Senyu Li,Jiayi Wang,Felermino D. M. A. Ali,Colin Cherry,Daniel Deutsch,Eleftheria Briakou,Rui Sousa-Silva,Henrique Lopes Cardoso,Pontus Stenetorp,David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: 论文介绍了SSA-MTE数据集和SSA-COMET评估指标,显著提升了非洲低资源语言机器翻译质量评估的性能。

Details Motivation: 现有机器翻译评估指标在非洲低资源语言上表现不佳,缺乏公开数据集和评估工具。 Method: 构建了包含13种非洲语言对的大规模人工标注数据集SSA-MTE,并基于此开发了SSA-COMET和SSA-COMET-QE评估指标。 Result: SSA-COMET显著优于AfriCOMET,并在低资源语言(如Twi、Luo、Yoruba)上与最强LLM(Gemini 2.5 Pro)竞争。 Conclusion: SSA-MTE和SSA-COMET为非洲低资源语言机器翻译评估提供了有效工具,所有资源已开源。 Abstract: Evaluating machine translation (MT) quality for under-resourced African languages remains a significant challenge, as existing metrics often suffer from limited language coverage and poor performance in low-resource settings. While recent efforts, such as AfriCOMET, have addressed some of the issues, they are still constrained by small evaluation sets, a lack of publicly available training data tailored to African languages, and inconsistent performance in extremely low-resource scenarios. In this work, we introduce SSA-MTE, a large-scale human-annotated MT evaluation (MTE) dataset covering 13 African language pairs from the News domain, with over 63,000 sentence-level annotations from a diverse set of MT systems. Based on this data, we develop SSA-COMET and SSA-COMET-QE, improved reference-based and reference-free evaluation metrics. We also benchmark prompting-based approaches using state-of-the-art LLMs like GPT-4o and Claude. Our experimental results show that SSA-COMET models significantly outperform AfriCOMET and are competitive with the strongest LLM (Gemini 2.5 Pro) evaluated in our study, particularly on low-resource languages such as Twi, Luo, and Yoruba. All resources are released under open licenses to support future research.

[153] Demonstrations of Integrity Attacks in Multi-Agent Systems

Can Zheng,Yuhan Cao,Xiaoning Dong,Tianxing He

Main category: cs.CL

TL;DR: 论文探讨了恶意代理在多智能体系统中通过提示操纵实施的四种攻击类型,揭示了当前检测机制的局限性。

Details Motivation: 多智能体系统(MAS)易受恶意代理的操纵,这些代理通过微妙提示操纵系统行为以谋取私利。 Method: 研究了四种攻击类型:Scapegoater、Boaster、Self-Dealer和Free-Rider,并通过实验验证其有效性。 Result: 恶意代理能成功误导评估系统,甚至绕过基于GPT-4o-mini等先进LLM的监控。 Conclusion: 强调了MAS需加强安全协议和内容验证机制,以应对潜在风险。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, code generation, and complex planning. Simultaneously, Multi-Agent Systems (MAS) have garnered attention for their potential to enable cooperation among distributed agents. However, from a multi-party perspective, MAS could be vulnerable to malicious agents that exploit the system to serve self-interests without disrupting its core functionality. This work explores integrity attacks where malicious agents employ subtle prompt manipulation to bias MAS operations and gain various benefits. Four types of attacks are examined: \textit{Scapegoater}, who misleads the system monitor to underestimate other agents' contributions; \textit{Boaster}, who misleads the system monitor to overestimate their own performance; \textit{Self-Dealer}, who manipulates other agents to adopt certain tools; and \textit{Free-Rider}, who hands off its own task to others. We demonstrate that strategically crafted prompts can introduce systematic biases in MAS behavior and executable instructions, enabling malicious agents to effectively mislead evaluation systems and manipulate collaborative agents. Furthermore, our attacks can bypass advanced LLM-based monitors, such as GPT-4o-mini and o3-mini, highlighting the limitations of current detection mechanisms. Our findings underscore the critical need for MAS architectures with robust security protocols and content validation mechanisms, alongside monitoring systems capable of comprehensive risk scenario assessment.

[154] Reasoning or Overthinking: Evaluating Large Language Models on Financial Sentiment Analysis

Dimitris Vamvourellis,Dhagash Mehta

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型(LLMs)在零样本金融情感分析中的表现,发现推理能力并未提升任务表现,GPT-4o无推理提示表现最佳。

Details Motivation: 评估LLMs在金融情感分析中的表现,探索推理能力是否有助于提升任务效果。 Method: 使用Financial PhraseBank数据集,比较不同LLMs(如GPT-4o、FinBERT)及提示策略(模拟快速直觉或慢速推理)的表现。 Result: 推理能力未提升任务表现,GPT-4o无推理提示表现最优,且快速直觉思维更接近人类判断。 Conclusion: 金融情感分析中,快速直觉思维优于慢速推理,挑战了推理能力必然提升LLM决策的假设。 Abstract: We investigate the effectiveness of large language models (LLMs), including reasoning-based and non-reasoning models, in performing zero-shot financial sentiment analysis. Using the Financial PhraseBank dataset annotated by domain experts, we evaluate how various LLMs and prompting strategies align with human-labeled sentiment in a financial context. We compare three proprietary LLMs (GPT-4o, GPT-4.1, o3-mini) under different prompting paradigms that simulate System 1 (fast and intuitive) or System 2 (slow and deliberate) thinking and benchmark them against two smaller models (FinBERT-Prosus, FinBERT-Tone) fine-tuned on financial sentiment analysis. Our findings suggest that reasoning, either through prompting or inherent model design, does not improve performance on this task. Surprisingly, the most accurate and human-aligned combination of model and method was GPT-4o without any Chain-of-Thought (CoT) prompting. We further explore how performance is impacted by linguistic complexity and annotation agreement levels, uncovering that reasoning may introduce overthinking, leading to suboptimal predictions. This suggests that for financial sentiment classification, fast, intuitive "System 1"-like thinking aligns more closely with human judgment compared to "System 2"-style slower, deliberative reasoning simulated by reasoning models or CoT prompting. Our results challenge the default assumption that more reasoning always leads to better LLM decisions, particularly in high-stakes financial applications.

[155] Are LLMs Reliable Translators of Logical Reasoning Across Lexically Diversified Contexts?

Qingchuan Li,Jiatong Li,Zirui Liu,Mingyue Cheng,Yuting Zeng,Qi Liu,Tongxuan Liu

Main category: cs.CL

TL;DR: 论文提出SCALe基准和MenTaL方法,解决LLMs在逻辑翻译中处理词汇多样性不足的问题。

Details Motivation: 发现LLMs在逻辑翻译中难以应对词汇多样性,且现有基准缺乏词汇多样性,掩盖了这一问题。 Method: 提出SCALe基准,通过逻辑不变的词汇多样化评估LLMs;设计MenTaL方法,通过统一表达表格提升翻译性能。 Result: 实验证实LLMs在词汇多样化翻译中存在不足,MenTaL方法显著提升了性能。 Conclusion: SCALe和MenTaL有效解决了LLMs在逻辑翻译中的词汇多样性问题,为实际应用提供了改进方向。 Abstract: Neuro-symbolic approaches combining large language models (LLMs) with solvers excels in logical reasoning problems need long reasoning chains. In this paradigm, LLMs serve as translators, converting natural language reasoning problems into formal logic formulas. Then reliable symbolic solvers return correct solutions. Despite their success, we find that LLMs, as translators, struggle to handle lexical diversification, a common linguistic phenomenon, indicating that LLMs as logic translators are unreliable in real-world scenarios. Moreover, existing logical reasoning benchmarks lack lexical diversity, failing to challenge LLMs' ability to translate such text and thus obscuring this issue. In this work, we propose SCALe, a benchmark designed to address this significant gap through **logic-invariant lexical diversification**. By using LLMs to transform original benchmark datasets into lexically diversified but logically equivalent versions, we evaluate LLMs' ability to consistently map diverse expressions to uniform logical symbols on these new datasets. Experiments using SCALe further confirm that current LLMs exhibit deficiencies in this capability. Building directly on the deficiencies identified through our benchmark, we propose a new method, MenTaL, to address this limitation. This method guides LLMs to first construct a table unifying diverse expressions before performing translation. Applying MenTaL through in-context learning and supervised fine-tuning (SFT) significantly improves the performance of LLM translators on lexically diversified text. Our code is now available at https://github.com/wufeiwuwoshihua/LexicalDiver.

[156] Selecting Demonstrations for Many-Shot In-Context Learning via Gradient Matching

Jianfei Zhang,Bei Li,Jun Bai,Rumei Li,Yanmeng Wang,Chenghua Lin,Wenge Rong

Main category: cs.CL

TL;DR: 论文提出了一种梯度匹配方法,用于改进大语言模型(LLMs)中的多示例上下文学习(ICL)的演示选择问题,显著优于随机选择方法。

Details Motivation: 现有ICL方法在多示例场景下依赖随机选择演示,效果受限,作者假设ICL与微调的数据需求类似,试图通过梯度匹配优化演示选择。 Method: 提出梯度匹配方法,通过对齐目标任务训练集与所选示例的微调梯度,模拟完整训练集的学习效果。 Result: 在4到128示例场景下,该方法在9个数据集上显著优于随机选择,例如在Qwen2.5-72B和Llama3-70B上提升4%,在5个闭源LLMs上提升约2%。 Conclusion: 该方法为多示例ICL提供了更可靠和高效的解决方案,推动了其更广泛应用。 Abstract: In-Context Learning (ICL) empowers Large Language Models (LLMs) for rapid task adaptation without Fine-Tuning (FT), but its reliance on demonstration selection remains a critical challenge. While many-shot ICL shows promising performance through scaled demonstrations, the selection method for many-shot demonstrations remains limited to random selection in existing work. Since the conventional instance-level retrieval is not suitable for many-shot scenarios, we hypothesize that the data requirements for in-context learning and fine-tuning are analogous. To this end, we introduce a novel gradient matching approach that selects demonstrations by aligning fine-tuning gradients between the entire training set of the target task and the selected examples, so as to approach the learning effect on the entire training set within the selected examples. Through gradient matching on relatively small models, e.g., Qwen2.5-3B or Llama3-8B, our method consistently outperforms random selection on larger LLMs from 4-shot to 128-shot scenarios across 9 diverse datasets. For instance, it surpasses random selection by 4% on Qwen2.5-72B and Llama3-70B, and by around 2% on 5 closed-source LLMs. This work unlocks more reliable and effective many-shot ICL, paving the way for its broader application.

[157] SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing

Hongjun Liu,Yilun Zhao,Arman Cohan,Chen Zhao

Main category: cs.CL

TL;DR: 提出了一种无需训练的方法SUCEA,通过分解对抗性声明并重新表述,提高事实核查系统的检索和标签预测准确性。

Details Motivation: 解决基于检索增强语言模型的事实核查系统在处理对抗性声明时的困难。 Method: SUCEA框架分三步:声明分割与去上下文化、迭代证据检索与声明编辑、证据聚合与标签预测。 Result: 在两个数据集上显著提升了检索和标签预测准确性,优于四种基线方法。 Conclusion: SUCEA框架有效提升了对抗性声明的事实核查能力。 Abstract: Automatic fact-checking has recently received more attention as a means of combating misinformation. Despite significant advancements, fact-checking systems based on retrieval-augmented language models still struggle to tackle adversarial claims, which are intentionally designed by humans to challenge fact-checking systems. To address these challenges, we propose a training-free method designed to rephrase the original claim, making it easier to locate supporting evidence. Our modular framework, SUCEA, decomposes the task into three steps: 1) Claim Segmentation and Decontextualization that segments adversarial claims into independent sub-claims; 2) Iterative Evidence Retrieval and Claim Editing that iteratively retrieves evidence and edits the subclaim based on the retrieved evidence; 3) Evidence Aggregation and Label Prediction that aggregates all retrieved evidence and predicts the entailment label. Experiments on two challenging fact-checking datasets demonstrate that our framework significantly improves on both retrieval and entailment label accuracy, outperforming four strong claim-decomposition-based baselines.

[158] MuSciClaims: Multimodal Scientific Claim Verification

Yash Kumar Lal,Manikanta Bandham,Mohammad Saqib Hasan,Apoorva Kashi,Mahnaz Koupaee,Niranjan Balasubramanian

Main category: cs.CL

TL;DR: 论文提出了一个多模态基准MuSciClaims,用于测试科学文献中基于图表的声明验证能力,并发现现有视觉-语言模型在此任务上表现不佳。

Details Motivation: 科学文献中的多模态数据(如图表)对声明验证至关重要,但缺乏直接测试此类能力的基准。 Method: 通过自动提取科学文章中的支持声明并手动扰动生成矛盾声明,构建MuSciClaims基准,并设计诊断任务分析模型失败原因。 Result: 现有视觉-语言模型表现较差(F1分数0.3-0.5),最佳模型仅达0.77,且倾向于判断声明为支持。 Conclusion: 模型在定位证据、跨模态信息聚合和图表基本理解方面存在显著不足,需进一步改进。 Abstract: Assessing scientific claims requires identifying, extracting, and reasoning with multimodal data expressed in information-rich figures in scientific literature. Despite the large body of work in scientific QA, figure captioning, and other multimodal reasoning tasks over chart-based data, there are no readily usable multimodal benchmarks that directly test claim verification abilities. To remedy this gap, we introduce a new benchmark MuSciClaims accompanied by diagnostics tasks. We automatically extract supported claims from scientific articles, which we manually perturb to produce contradicted claims. The perturbations are designed to test for a specific set of claim verification capabilities. We also introduce a suite of diagnostic tasks that help understand model failures. Our results show most vision-language models are poor (~0.3-0.5 F1), with even the best model only achieving 0.77 F1. They are also biased towards judging claims as supported, likely misunderstanding nuanced perturbations within the claims. Our diagnostics show models are bad at localizing correct evidence within figures, struggle with aggregating information across modalities, and often fail to understand basic components of the figure.

[159] LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models

Wen Ding,Fan Qian

Main category: cs.CL

TL;DR: LESS框架利用大语言模型(LLM)优化半监督学习中的伪标签,显著提升语音识别和翻译任务的性能。

Details Motivation: 解决半监督学习中伪标签质量不高的问题,通过LLM优化伪标签并提升知识转移效率。 Method: 结合LLM修正伪标签,并采用数据过滤策略优化LLM知识转移。 Result: 在普通话ASR和西班牙语-英语AST任务中,WER显著降低3.77%,BLEU得分达34.0和64.7。 Conclusion: LESS框架在多语言、多任务和多领域中表现出强大的适应性和有效性。 Abstract: We introduce LESS (Large Language Model Enhanced Semi-supervised Learning), a versatile framework that leverages Large Language Models (LLMs) to correct pseudo labels generated from in-the-wild data. Within the LESS framework, pseudo-labeled text from Automatic Speech Recognition (ASR) or Automatic Speech Translation (AST) of the unsupervised data is refined by an LLM, and augmented by a data filtering strategy to optimize LLM knowledge transfer efficiency. Experiments on both Mandarin ASR and Spanish-to-English AST tasks show that LESS achieves a notable absolute WER reduction of 3.77% on the Wenet Speech test set, as well as BLEU scores of 34.0 and 64.7 on Callhome and Fisher test sets respectively. These results validate the adaptability of LESS across different languages, tasks, and domains. Ablation studies conducted with various LLMs and prompt configurations provide novel insights into leveraging LLM-derived knowledge for speech processing applications.

[160] Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification

Chengwu Liu,Ye Yuan,Yichun Yin,Yan Xu,Xin Xu,Zaoyu Chen,Yasheng Wang,Lifeng Shang,Qun Liu,Ming Zhang

Main category: cs.CL

TL;DR: 论文提出了一种名为$Safe$的回顾性、步骤感知的形式验证框架,利用Lean 4形式化数学语言验证LLM生成的推理步骤,以减少CoT中的幻觉问题。

Details Motivation: 现有方法(如PRMs或自一致性)对幻觉问题的检测缺乏可验证证据,限制了其有效性。受数学证明的启发,作者希望通过形式化语言提供可验证的证据。 Method: 使用Lean 4形式化数学语言,在每个推理步骤中明确数学声明并提供形式化证明,以识别幻觉。 Result: 在多个LLM和数学数据集上验证了$Safe$框架,性能显著提升,并提供可解释和可验证的证据。同时提出了包含30,809个形式化声明的$FormalStep$基准。 Conclusion: 这是首次利用Lean 4形式化数学语言验证LLM生成的自然语言内容,为幻觉问题提供了可靠解决方案。 Abstract: Chain-of-Thought (CoT) prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs). However, to mitigate hallucinations in CoT that are notoriously difficult to detect, current methods such as process reward models (PRMs) or self-consistency operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness. To address this issue, we draw inspiration from the idea that "the gold standard for supporting a mathematical claim is to provide a proof". We propose a retrospective, step-aware formal verification framework $Safe$. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations. We evaluate our framework $Safe$ across multiple language models and various mathematical datasets, demonstrating a significant performance improvement while offering interpretable and verifiable evidence. We also propose $FormalStep$ as a benchmark for step correctness theorem proving with $30,809$ formal statements. To the best of our knowledge, our work represents the first endeavor to utilize formal mathematical language Lean 4 for verifying natural language content generated by LLMs, aligning with the reason why formal mathematical languages were created in the first place: to provide a robust foundation for hallucination-prone human-written proofs.

[161] A MISMATCHED Benchmark for Scientific Natural Language Inference

Firoz Shaik,Mobashir Sadat,Nikita Gautam,Doina Caragea,Cornelia Caragea

Main category: cs.CL

TL;DR: 论文介绍了MISMATCHED基准,用于科学自然语言推理(NLI),覆盖非计算机科学领域,并展示了预训练语言模型的基线性能。

Details Motivation: 现有科学NLI数据集仅关注计算机科学领域,忽略了其他学科,因此需要扩展数据集以支持更广泛的研究。 Method: 构建了包含2,700对标注句子的MISMATCHED基准,覆盖心理学、工程学和公共卫生领域,并使用预训练小模型和大模型建立基线。 Result: 最佳基线模型的Macro F1为78.17%,表明未来改进空间大。同时,训练中加入隐含科学NLI关系的句子对能提升模型性能。 Conclusion: MISMATCHED填补了非计算机科学领域科学NLI的空白,为未来研究提供了新方向和数据支持。 Abstract: Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from research articles. Existing datasets for this task are derived from various computer science (CS) domains, whereas non-CS domains are completely ignored. In this paper, we introduce a novel evaluation benchmark for scientific NLI, called MISMATCHED. The new MISMATCHED benchmark covers three non-CS domains-PSYCHOLOGY, ENGINEERING, and PUBLIC HEALTH, and contains 2,700 human annotated sentence pairs. We establish strong baselines on MISMATCHED using both Pre-trained Small Language Models (SLMs) and Large Language Models (LLMs). Our best performing baseline shows a Macro F1 of only 78.17% illustrating the substantial headroom for future improvements. In addition to introducing the MISMATCHED benchmark, we show that incorporating sentence pairs having an implicit scientific NLI relation between them in model training improves their performance on scientific NLI. We make our dataset and code publicly available on GitHub.

[162] Revisiting Test-Time Scaling: A Survey and a Diversity-Aware Method for Efficient Reasoning

Ho-Lam Chung,Teng-Yun Hsiao,Hsiao-Ying Huang,Chunerh Cho,Jian-Ren Lin,Zhang Ziwei,Yun-Nung Chen

Main category: cs.CL

TL;DR: TTS通过推理时分配额外计算提升LLMs性能,ADAPT方法通过多样性优化前缀微调显著减少计算需求。

Details Motivation: 推理优化模型输出多样性不足,限制了TTS效果。 Method: 提出ADAPT,结合多样性数据策略的前缀微调方法。 Result: 数学推理任务中,ADAPT以8倍少计算达到80%准确率。 Conclusion: 生成多样性对最大化TTS效果至关重要。 Abstract: Test-Time Scaling (TTS) improves the reasoning performance of Large Language Models (LLMs) by allocating additional compute during inference. We conduct a structured survey of TTS methods and categorize them into sampling-based, search-based, and trajectory optimization strategies. We observe that reasoning-optimized models often produce less diverse outputs, which limits TTS effectiveness. To address this, we propose ADAPT (A Diversity Aware Prefix fine-Tuning), a lightweight method that applies prefix tuning with a diversity-focused data strategy. Experiments on mathematical reasoning tasks show that ADAPT reaches 80% accuracy using eight times less compute than strong baselines. Our findings highlight the essential role of generative diversity in maximizing TTS effectiveness.

[163] Subjective Perspectives within Learned Representations Predict High-Impact Innovation

Likun Cao,Rui Pan,James Evans

Main category: cs.CL

TL;DR: 论文通过机器学习方法量化创新者的主观视角和背景多样性,发现视角多样性促进创新,而背景多样性则相反。

Details Motivation: 研究创新者的主观视角和背景多样性如何影响创新成果。 Method: 利用动态语言表示建模创新者的视角和背景多样性,分析科学家、发明家等的数据,并进行自然实验和AI模拟。 Result: 视角多样性预测创新成功,背景多样性则相反;共同语言是成功协作的关键。 Conclusion: 视角多样性对创新至关重要,团队组建和研究政策应重视视角而非背景多样性。 Abstract: Existing studies of innovation emphasize the power of social structures to shape innovation capacity. Emerging machine learning approaches, however, enable us to model innovators' personal perspectives and interpersonal innovation opportunities as a function of their prior trajectories of experience. We theorize then quantify subjective perspectives and innovation opportunities based on innovator positions within the geometric space of concepts inscribed by dynamic language representations. Using data on millions of scientists, inventors, writers, entrepreneurs, and Wikipedia contributors across the creative domains of science, technology, film, entrepreneurship, and Wikipedia, here we show that measured subjective perspectives anticipate what ideas individuals and groups creatively attend to and successfully combine in future. When perspective and background diversity are decomposed as the angular difference between collaborators' perspectives on their creation and between their experiences, the former consistently anticipates creative achievement while the latter portends its opposite, across all cases and time periods examined. We analyze a natural experiment and simulate creative collaborations between AI (large language model) agents designed with various perspective and background diversity, which are consistent with our observational findings. We explore mechanisms underlying these findings and identify how successful collaborators leverage common language to weave together diverse experience obtained through trajectories of prior work that converge to provoke one another and innovate. We explore the importance of these findings for team assembly and research policy.

[164] Static Word Embeddings for Sentence Semantic Representation

Takashi Wada,Yuki Hirakawa,Ryotaro Shimizu,Takahiro Kawashima,Yuki Saito

Main category: cs.CL

TL;DR: 提出一种优化句子语义表示的静态词嵌入方法,通过主成分分析和知识蒸馏或对比学习改进预训练词嵌入,简单平均词嵌入表示句子,计算成本低,性能优于现有静态模型,部分数据集媲美SimCSE。

Details Motivation: 改进静态词嵌入以更好地表示句子语义,同时保持计算效率。 Method: 从预训练Sentence Transformer提取词嵌入,用句子级主成分分析优化,结合知识蒸馏或对比学习。推理时通过平均词嵌入表示句子。 Result: 在单语和跨语言任务中显著优于现有静态模型,部分数据集性能接近SimCSE。分析显示成功去除无关语义成分并调整词向量范数。 Conclusion: 该方法有效提升静态词嵌入的句子语义表示能力,计算高效,性能接近动态模型。 Abstract: We propose new static word embeddings optimised for sentence semantic representation. We first extract word embeddings from a pre-trained Sentence Transformer, and improve them with sentence-level principal component analysis, followed by either knowledge distillation or contrastive learning. During inference, we represent sentences by simply averaging word embeddings, which requires little computational cost. We evaluate models on both monolingual and cross-lingual tasks and show that our model substantially outperforms existing static models on sentence semantic tasks, and even rivals a basic Sentence Transformer model (SimCSE) on some data sets. Lastly, we perform a variety of analyses and show that our method successfully removes word embedding components that are irrelevant to sentence semantics, and adjusts the vector norms based on the influence of words on sentence semantics.

[165] Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning

Zhiyuan Ma,Jiayu Liu,Xianzhen Luo,Zhenya Huang,Qingfu Zhu,Wanxiang Che

Main category: cs.CL

TL;DR: Tool-MVR通过多智能体验证和探索式反思学习,解决了LLM在工具规划和反思能力上的不足,显著提升了性能。

Details Motivation: 当前LLM在工具规划和调用上存在不可靠性,且反思能力弱,无法纠正多数错误。 Method: 提出Tool-MVR,结合多智能体验证(MAMV)构建高质量数据集ToolBench-V,并通过探索式反思学习(EXPLORE)生成ToolBench-R。 Result: Tool-MVR在StableToolBench上超越ToolLLM和GPT-4,API调用减少31.4%,在RefineToolBench上错误纠正率达58.9%。 Conclusion: Tool-MVR通过系统验证和动态学习,显著提升了LLM的工具利用能力。 Abstract: Empowering large language models (LLMs) with effective tool utilization capabilities is crucial for enabling AI agents to solve complex problems. However, current models face two major limitations: (1) unreliable tool planning and invocation due to low-quality instruction datasets (e.g., widespread hallucinated API calls), and (2) weak tool reflection abilities (over 90% of errors cannot be corrected) resulting from static imitation learning. To address these critical limitations, we propose Tool-MVR, a novel Tool-Augmented LLM that achieves comprehensive System 2 reasoning through two key innovations. Specifically, we first introduce Multi-Agent Meta-Verification (MAMV), a systematic pipeline that rigorously validates APIs, queries, and reasoning trajectories to construct ToolBench-V, a new high-quality instruction dataset that addresses the limitation of unreliable tool planning and invocation. Second, we propose Exploration-based Reflection Learning (EXPLORE), which enhances tool reflection capabilities by leveraging tool feedback through a dynamic "Error -> Reflection -> Correction" learning paradigm, resulting in our reflection dataset ToolBench-R and addressing the critical weakness in tool reflection. Finally, we obtain Tool-MVR by finetuning open-source LLMs (e.g., Qwen-7B) on both ToolBench-V and ToolBench-R. Our experiments demonstrate that Tool-MVR achieves state-of-the-art performance on StableToolBench, surpassing both ToolLLM (by 23.9%) and GPT-4 (by 15.3%) while reducing API calls by 31.4%, with strong generalization capabilities across unseen tools and scenarios. Additionally, on our proposed RefineToolBench, the first benchmark specifically designed to evaluate tool reflection capabilities, Tool-MVR achieves a 58.9% error correction rate, significantly outperforming ToolLLM's 9.1%.

[166] ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition

Thai-Binh Nguyen,Thi Van Nguyen,Quoc Truong Do,Chi Mai Luong

Main category: cs.CL

TL;DR: 本文提出了一种从原始视频生成音频-视觉语音识别(AVSR)数据集的实用方法,并通过越南语基线模型验证了其有效性。

Details Motivation: 解决AVSR模型因数据集稀缺(尤其是非英语语言)而受限的问题,提出自动化数据收集方案。 Method: 改进现有技术,从原始视频高效生成AVSR数据集,并开发越南语基线模型。 Result: 自动收集的数据集支持高性能基线模型,在干净环境中表现与ASR相当,在嘈杂环境中显著优于ASR。 Conclusion: 该方法为扩展AVSR至更多语言(尤其是资源匮乏语言)提供了可行路径。 Abstract: Audio-Visual Speech Recognition (AVSR) has gained significant attention recently due to its robustness against noise, which often challenges conventional speech recognition systems that rely solely on audio features. Despite this advantage, AVSR models remain limited by the scarcity of extensive datasets, especially for most languages beyond English. Automated data collection offers a promising solution. This work presents a practical approach to generate AVSR datasets from raw video, refining existing techniques for improved efficiency and accessibility. We demonstrate its broad applicability by developing a baseline AVSR model for Vietnamese. Experiments show the automatically collected dataset enables a strong baseline, achieving competitive performance with robust ASR in clean conditions and significantly outperforming them in noisy environments like cocktail parties. This efficient method provides a pathway to expand AVSR to more languages, particularly under-resourced ones.

[167] TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering

Vinay Joshi,Pratik Prabhanjan Brahma,Zicheng Liu,Emad Barsoum

Main category: cs.CL

TL;DR: TaDA是一种无需训练的KV缓存压缩方法,通过自适应量化精度和均值中心化消除异常值处理,显著减少内存占用并保持准确性。

Details Motivation: KV缓存在Transformer模型中内存需求随序列长度增长而急剧增加,限制了大型语言模型的可扩展部署。 Method: 提出TaDA方法,通过自适应量化精度和均值中心化,无需单独处理异常值。 Result: 实验显示,TaDA将KV缓存内存占用降至原始16位基线的27%,同时保持准确性。 Conclusion: TaDA为语言模型的可扩展高性能推理提供了新途径,支持更长上下文和推理链。 Abstract: The key-value (KV) cache in transformer models is a critical component for efficient decoding or inference, yet its memory demands scale poorly with sequence length, posing a major challenge for scalable deployment of large language models. Among several approaches to KV cache compression, quantization of key and value activations has been widely explored. Most KV cache quantization methods still need to manage sparse and noncontiguous outliers separately. To address this, we introduce TaDA, a training-free recipe for KV cache compression with quantization precision that adapts to error sensitivity across layers and a mean centering to eliminate separate outlier handling. Our approach yields substantial accuracy improvements for multiple models supporting various context lengths. Moreover, our approach does not need to separately manage outlier elements -- a persistent hurdle in most traditional quantization methods. Experiments on standard benchmarks demonstrate that our technique reduces KV cache memory footprint to 27% of the original 16-bit baseline while achieving comparable accuracy. Our method paves the way for scalable and high-performance reasoning in language models by potentially enabling inference for longer context length models, reasoning models, and longer chain of thoughts.

[168] Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents

Juhyun Oh,Eunsu Kim,Alice Oh

Main category: cs.CL

TL;DR: 论文介绍了Flex-TravelPlanner基准测试,用于评估语言模型在动态规划场景中的灵活推理能力,发现现有模型在多轮任务和约束优先级处理上表现不佳。

Details Motivation: 当前评估语言模型规划能力的基准主要关注静态、单轮场景,而现实规划问题需要适应动态变化和平衡竞争约束。 Method: 基于TravelPlanner数据集,引入两种新评估设置:多轮顺序约束引入和显式优先级竞争约束场景。 Result: 分析发现,模型在单轮任务的表现无法预测其多轮适应能力,约束引入顺序和优先级处理显著影响性能。 Conclusion: 强调了在更真实的动态场景中评估语言模型的重要性,并提出了改进复杂规划任务性能的具体方向。 Abstract: Real-world planning problems require constant adaptation to changing requirements and balancing of competing constraints. However, current benchmarks for evaluating LLMs' planning capabilities primarily focus on static, single-turn scenarios. We introduce Flex-TravelPlanner, a benchmark that evaluates language models' ability to reason flexibly in dynamic planning scenarios. Building on the TravelPlanner dataset~\citep{xie2024travelplanner}, we introduce two novel evaluation settings: (1) sequential constraint introduction across multiple turns, and (2) scenarios with explicitly prioritized competing constraints. Our analysis of GPT-4o and Llama 3.1 70B reveals several key findings: models' performance on single-turn tasks poorly predicts their ability to adapt plans across multiple turns; constraint introduction order significantly affects performance; and models struggle with constraint prioritization, often incorrectly favoring newly introduced lower priority preferences over existing higher-priority constraints. These findings highlight the importance of evaluating LLMs in more realistic, dynamic planning scenarios and suggest specific directions for improving model performance on complex planning tasks. The code and dataset for our framework are publicly available at https://github.com/juhyunohh/FlexTravelBench.

[169] Normative Conflicts and Shallow AI Alignment

Raphaël Millière

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型(LLMs)的价值对齐问题,指出当前的对齐策略无法有效防止滥用,并揭示了其脆弱性源于缺乏深度的规范推理能力。

Details Motivation: 随着AI系统(如LLMs)的进步,其安全部署问题日益突出。本文旨在揭示现有对齐方法的不足,尤其是其对对抗性攻击的脆弱性。 Method: 通过分析人类道德心理学的研究,对比LLMs与人类在规范推理能力上的差异,指出LLMs缺乏深度对齐能力。 Result: 研究发现,现有对齐方法仅强化了浅层行为倾向,未能赋予LLMs真正的规范推理能力,使其易受操纵。 Conclusion: 论文认为当前的对齐方法不足以应对日益强大的AI系统带来的潜在风险,呼吁开发更深入的对齐策略。 Abstract: The progress of AI systems such as large language models (LLMs) raises increasingly pressing concerns about their safe deployment. This paper examines the value alignment problem for LLMs, arguing that current alignment strategies are fundamentally inadequate to prevent misuse. Despite ongoing efforts to instill norms such as helpfulness, honesty, and harmlessness in LLMs through fine-tuning based on human preferences, they remain vulnerable to adversarial attacks that exploit conflicts between these norms. I argue that this vulnerability reflects a fundamental limitation of existing alignment methods: they reinforce shallow behavioral dispositions rather than endowing LLMs with a genuine capacity for normative deliberation. Drawing from on research in moral psychology, I show how humans' ability to engage in deliberative reasoning enhances their resilience against similar adversarial tactics. LLMs, by contrast, lack a robust capacity to detect and rationally resolve normative conflicts, leaving them susceptible to manipulation; even recent advances in reasoning-focused LLMs have not addressed this vulnerability. This ``shallow alignment'' problem carries significant implications for AI safety and regulation, suggesting that current approaches are insufficient for mitigating potential harms posed by increasingly capable AI systems.

[170] MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models

Gio Paik,Geewook Kim,Jinbae Im

Main category: cs.CL

TL;DR: MMRefine是一个多模态细化基准,用于评估多模态大语言模型(MLLMs)的错误细化能力,通过六种场景和错误类型分析性能瓶颈。

Details Motivation: 随着推理过程中对增强推理能力的重视,需要一种框架来评估MLLMs在检测和纠正错误方面的能力,而不仅仅是比较细化前后的最终准确性。 Method: MMRefine通过六种不同场景和六种错误类型评估MLLMs的细化能力,并对开放和封闭的MLLMs进行实验。 Result: 实验揭示了细化性能的瓶颈和阻碍因素,指出了有效推理增强的改进方向。 Conclusion: MMRefine为MLLMs的错误细化能力提供了评估框架,并公开了代码和数据集。 Abstract: This paper introduces MMRefine, a MultiModal Refinement benchmark designed to evaluate the error refinement capabilities of Multimodal Large Language Models (MLLMs). As the emphasis shifts toward enhancing reasoning during inference, MMRefine provides a framework that evaluates MLLMs' abilities to detect and correct errors across six distinct scenarios beyond just comparing final accuracy before and after refinement. Furthermore, the benchmark analyzes the refinement performance by categorizing errors into six error types. Experiments with various open and closed MLLMs reveal bottlenecks and factors impeding refinement performance, highlighting areas for improvement in effective reasoning enhancement. Our code and dataset are publicly available at https://github.com/naver-ai/MMRefine.

[171] Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models

Thao Nguyen,Yang Li,Olga Golovneva,Luke Zettlemoyer,Sewoong Oh,Ludwig Schmidt,Xian Li

Main category: cs.CL

TL;DR: 论文提出REWIRE方法,通过改写低质量文本来丰富预训练数据,显著提升模型性能。

Details Motivation: 解决预训练数据不足的问题,尤其是高质量文本稀缺,探索如何利用被过滤的低质量数据。 Method: 提出REWIRE方法,通过改写低质量文档使其可用于训练,增加合成数据在预训练集中的比例。 Result: 在1B、3B和7B规模的实验中,混合高质量原始文本和改写文本分别提升1.0、1.3和2.5个百分点。 Conclusion: 改写低质量文本是一种简单有效的扩展预训练数据的方法,优于其他合成数据生成方法。 Abstract: Scaling laws predict that the performance of large language models improves with increasing model size and data size. In practice, pre-training has been relying on massive web crawls, using almost all data sources publicly available on the internet so far. However, this pool of natural data does not grow at the same rate as the compute supply. Furthermore, the availability of high-quality texts is even more limited: data filtering pipelines often remove up to 99% of the initial web scrapes to achieve state-of-the-art. To address the "data wall" of pre-training scaling, our work explores ways to transform and recycle data discarded in existing filtering processes. We propose REWIRE, REcycling the Web with guIded REwrite, a method to enrich low-quality documents so that they could become useful for training. This in turn allows us to increase the representation of synthetic data in the final pre-training set. Experiments at 1B, 3B and 7B scales of the DCLM benchmark show that mixing high-quality raw texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points improvement respectively across 22 diverse tasks, compared to training on only filtered web data. Training on the raw-synthetic data mix is also more effective than having access to 2x web data. Through further analysis, we demonstrate that about 82% of the mixed in texts come from transforming lower-quality documents that would otherwise be discarded. REWIRE also outperforms related approaches of generating synthetic data, including Wikipedia-style paraphrasing, question-answer synthesizing and knowledge extraction. These results suggest that recycling web texts holds the potential for being a simple and effective approach for scaling pre-training data.

[172] Cracking the Code: Enhancing Implicit Hate Speech Detection through Coding Classification

Lu Wei,Liangzhi Li,Tong Xiang,Xiao Liu,Noa Garcia

Main category: cs.CL

TL;DR: 论文提出了一种新的隐式仇恨言论(im-HS)检测分类法,定义了六种编码策略(codetypes),并通过两种方法(直接提示LLM分类和嵌入编码过程)提升检测效果。实验验证了该方法在中英文数据集上的有效性。

Details Motivation: 互联网上的隐式仇恨言论(im-HS)对自动检测方法构成挑战,现有方法难以识别其微妙形式,威胁社会和谐与个人福祉。 Method: 提出六种编码策略(codetypes),并采用两种方法:1)直接提示大语言模型(LLM)分类;2)将codetypes嵌入LLM编码过程。 Result: 实验表明,codetypes的使用显著提升了中英文数据集上的隐式仇恨言论检测效果。 Conclusion: 该方法为隐式仇恨言论检测提供了有效解决方案,并展示了跨语言的适用性。 Abstract: The internet has become a hotspot for hate speech (HS), threatening societal harmony and individual well-being. While automatic detection methods perform well in identifying explicit hate speech (ex-HS), they struggle with more subtle forms, such as implicit hate speech (im-HS). We tackle this problem by introducing a new taxonomy for im-HS detection, defining six encoding strategies named codetypes. We present two methods for integrating codetypes into im-HS detection: 1) prompting large language models (LLMs) directly to classify sentences based on generated responses, and 2) using LLMs as encoders with codetypes embedded during the encoding process. Experiments show that the use of codetypes improves im-HS detection in both Chinese and English datasets, validating the effectiveness of our approach across different languages.

[173] Accelerated Test-Time Scaling with Model-Free Speculative Sampling

Woomin Song,Saket Dingliwal,Sai Muralidhar Jayanthi,Bhavana Ganesh,Jinwoo Shin,Aram Galstyan,Sravan Babu Bodapati

Main category: cs.CL

TL;DR: STAND是一种无需模型的推测解码方法,通过利用推理轨迹中的冗余性显著加速推理,同时保持准确性。

Details Motivation: 现有推理方法(如best-of-N采样和树搜索)需要大量计算资源,STAND旨在解决性能与效率之间的权衡问题。 Method: STAND采用随机自适应N-gram草拟,结合高效的Gumbel-Top-K采样和数据驱动的树构建,提高令牌接受率。 Result: STAND在多个推理任务中减少推理延迟60-65%,吞吐量优于现有方法14-28%,单轨迹场景下延迟减少48-58%。 Conclusion: STAND是一种无需额外训练的即插即用解决方案,适用于任何现有语言模型,显著加速推理。 Abstract: Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that leverages the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis reveals that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND outperforms state-of-the-art speculative decoding methods by 14-28% in throughput and shows strong performance even in single-trajectory scenarios, reducing inference latency by 48-58%. As a model-free approach, STAND can be applied to any existing language model without additional training, being a powerful plug-and-play solution for accelerating language model reasoning.

[174] IIITH-BUT system for IWSLT 2025 low-resource Bhojpuri to Hindi speech translation

Bhavana Akkiraju,Aishwarya Pothula,Santosh Kesiraju,Anil Kumar Vuppala

Main category: cs.CL

TL;DR: IIITH-BUT团队在IWSLT 2025低资源Bhojpuri-Hindi语音翻译任务中,通过超参数优化和数据增强技术显著提升了SeamlessM4T模型的性能。

Details Motivation: 研究在低资源语言对(Bhojpuri-Hindi)中,如何通过超参数优化和数据增强技术提升语音翻译模型的性能。 Method: 系统研究了学习率调度、更新步数、预热步数、标签平滑和批量大小等超参数,并应用了速度扰动和SpecAugment数据增强技术。同时探索了跨语言信号(Marathi和Bhojpuri联合训练)的效果。 Result: 实验表明,超参数选择和简单有效的数据增强技术显著提升了低资源环境下的翻译质量。 Conclusion: 在低资源语音翻译任务中,超参数优化和数据增强是提升性能的关键。同时,错误分析有助于理解翻译质量的影响因素。 Abstract: This paper presents the submission of IIITH-BUT to the IWSLT 2025 shared task on speech translation for the low-resource Bhojpuri-Hindi language pair. We explored the impact of hyperparameter optimisation and data augmentation techniques on the performance of the SeamlessM4T model fine-tuned for this specific task. We systematically investigated a range of hyperparameters including learning rate schedules, number of update steps, warm-up steps, label smoothing, and batch sizes; and report their effect on translation quality. To address data scarcity, we applied speed perturbation and SpecAugment and studied their effect on translation quality. We also examined the use of cross-lingual signal through joint training with Marathi and Bhojpuri speech data. Our experiments reveal that careful selection of hyperparameters and the application of simple yet effective augmentation techniques significantly improve performance in low-resource settings. We also analysed the translation hypotheses to understand various kinds of errors that impacted the translation quality in terms of BLEU.

[175] SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat

Yuru Jiang,Wenxuan Ding,Shangbin Feng,Greg Durrett,Yulia Tsvetkov

Main category: cs.CL

TL;DR: SPARTA ALIGNMENT是一种通过竞争和对抗集体对齐多个LLM的算法,利用模型间的竞争和相互评估提升多样性和减少偏见。

Details Motivation: 解决单一模型在生成多样性和评估偏见上的不足。 Method: 多个LLM组成“斯巴达部落”,通过竞争和相互评估生成偏好对,并基于Elo排名系统调整模型权重。 Result: 在12个任务和数据集中的10个上优于初始模型和4个自对齐基线,平均提升7.0%。 Conclusion: SPARTA ALIGNMENT通过集体竞争实现模型自我进化,提升泛化能力和输出质量。 Abstract: We propose SPARTA ALIGNMENT, an algorithm to collectively align multiple LLMs through competition and combat. To complement a single model's lack of diversity in generation and biases in evaluation, multiple LLMs form a "sparta tribe" to compete against each other in fulfilling instructions while serving as judges for the competition of others. For each iteration, one instruction and two models are selected for a duel, the other models evaluate the two responses, and their evaluation scores are aggregated through a adapted elo-ranking based reputation system, where winners/losers of combat gain/lose weight in evaluating others. The peer-evaluated combat results then become preference pairs where the winning response is preferred over the losing one, and all models learn from these preferences at the end of each iteration. SPARTA ALIGNMENT enables the self-evolution of multiple LLMs in an iterative and collective competition process. Extensive experiments demonstrate that SPARTA ALIGNMENT outperforms initial models and 4 self-alignment baselines across 10 out of 12 tasks and datasets with 7.0% average improvement. Further analysis reveals that SPARTA ALIGNMENT generalizes more effectively to unseen tasks and leverages the expertise diversity of participating models to produce more logical, direct and informative outputs.

[176] Lifelong Evolution: Collaborative Learning between Large and Small Language Models for Continuous Emergent Fake News Detection

Ziyi Zhou,Xiaoming Zhang,Litian Zhang,Yibo Zhang,Zhenyu Guan,Chaozhuo Li,Philip S. Yu

Main category: cs.CL

TL;DR: 提出了一种名为C²EFND的新框架,结合大型语言模型(LLMs)和小型语言模型(SLMs)的优势,通过多轮协作学习和知识更新模块,显著提升了虚假新闻检测的准确性和适应性。

Details Motivation: 虚假新闻在社交媒体上的广泛传播对社会造成严重影响,现有方法(如SLMs和LLMs)因数据稀缺、知识过时等问题难以有效应对。 Method: 采用多轮协作学习框架,结合LLMs的泛化能力和SLMs的分类专长,并引入终身知识编辑模块和基于回放的持续学习方法。 Result: 在Pheme和Twitter16数据集上的实验表明,C²EFND显著优于现有方法。 Conclusion: C²EFND框架有效解决了虚假新闻检测中的动态适应问题,提升了检测性能。 Abstract: The widespread dissemination of fake news on social media has significantly impacted society, resulting in serious consequences. Conventional deep learning methodologies employing small language models (SLMs) suffer from extensive supervised training requirements and difficulties adapting to evolving news environments due to data scarcity and distribution shifts. Large language models (LLMs), despite robust zero-shot capabilities, fall short in accurately detecting fake news owing to outdated knowledge and the absence of suitable demonstrations. In this paper, we propose a novel Continuous Collaborative Emergent Fake News Detection (C$^2$EFND) framework to address these challenges. The C$^2$EFND framework strategically leverages both LLMs' generalization power and SLMs' classification expertise via a multi-round collaborative learning framework. We further introduce a lifelong knowledge editing module based on a Mixture-of-Experts architecture to incrementally update LLMs and a replay-based continue learning method to ensure SLMs retain prior knowledge without retraining entirely. Extensive experiments on Pheme and Twitter16 datasets demonstrate that C$^2$EFND significantly outperforms existed methods, effectively improving detection accuracy and adaptability in continuous emergent fake news scenarios.

[177] Identifying Reliable Evaluation Metrics for Scientific Text Revision

Léane Jourdan,Florian Boudin,Richard Dufour,Nicolas Hernandez

Main category: cs.CL

TL;DR: 本文探讨了科学写作中文本修订的评估问题,分析了传统指标的局限性,并探索了更符合人类判断的替代方法。

Details Motivation: 传统指标(如ROUGE和BERTScore)主要关注相似性而非改进质量,难以有效评估修订效果。 Method: 通过人工标注研究评估修订质量,探索无参考评估指标,并分析LLM作为评判者的能力。 Result: LLM能有效评估指令遵循性,但在正确性上表现不佳;领域特定指标提供补充信息。 Conclusion: 结合LLM评判和任务特定指标的混合方法能最可靠地评估修订质量。 Abstract: Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision quality.

[178] Fine-Grained Interpretation of Political Opinions in Large Language Models

Jingyu Hu,Mengyue Yang,Mengnan Du,Weiru Liu

Main category: cs.CL

TL;DR: 论文研究了LLMs的政治观点,发现其开放回答与内部意图不一致,提出多维度政治学习框架和可解释表示工程技术,以解构政治概念混淆。

Details Motivation: LLMs的开放回答与内部意图不一致,且现有分析多依赖单轴概念,易导致混淆。 Method: 设计四维政治学习框架,构建数据集,应用三种表示工程技术进行实验。 Result: 向量能解构政治概念混淆,检测任务验证其语义,干预实验可调整LLMs的政治倾向。 Conclusion: 多维度框架和表示工程技术能更透明地学习LLMs的政治概念,并实现干预。 Abstract: Studies of LLMs' political opinions mainly rely on evaluations of their open-ended responses. Recent work indicates that there is a misalignment between LLMs' responses and their internal intentions. This motivates us to probe LLMs' internal mechanisms and help uncover their internal political states. Additionally, we found that the analysis of LLMs' political opinions often relies on single-axis concepts, which can lead to concept confounds. In this work, we extend the single-axis to multi-dimensions and apply interpretable representation engineering techniques for more transparent LLM political concept learning. Specifically, we designed a four-dimensional political learning framework and constructed a corresponding dataset for fine-grained political concept vector learning. These vectors can be used to detect and intervene in LLM internals. Experiments are conducted on eight open-source LLMs with three representation engineering techniques. Results show these vectors can disentangle political concept confounds. Detection tasks validate the semantic meaning of the vectors and show good generalization and robustness in OOD settings. Intervention Experiments show these vectors can intervene in LLMs to generate responses with different political leanings.

[179] MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Dingdong Wang,Jincenzi Wu,Junan Li,Dongchao Yang,Xueyuan Chen,Tianhua Zhang,Helen Meng

Main category: cs.CL

TL;DR: MMSU是一个针对口语理解的综合基准测试,包含5000个音频-问题-答案三元组,覆盖47种任务,旨在评估和改进语音大语言模型的细粒度感知和复杂推理能力。

Details Motivation: 语音包含丰富的声学信息,现有语音大语言模型在自然语音中的细粒度感知和复杂推理能力尚未充分探索。 Method: 引入MMSU基准,系统整合语音学、韵律、修辞、句法、语义和副语言学等多种语言现象,评估14种先进语音大语言模型。 Result: 发现现有模型有显著改进空间,为未来优化提供了方向。 Conclusion: MMSU为口语理解的全面评估设立了新标准,为开发更复杂的人机语音交互系统提供了宝贵见解。 Abstract: Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in spoken language. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. To ground our benchmark in linguistic theory, we systematically incorporate a wide range of linguistic phenomena, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics. Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization. MMSU establishes a new standard for comprehensive assessment of spoken language understanding, providing valuable insights for developing more sophisticated human-AI speech interaction systems. MMSU benchmark is available at https://huggingface.co/datasets/ddwang2000/MMSU. Evaluation Code is available at https://github.com/dingdongwang/MMSU_Bench.

[180] Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques

Jisu An,Junseok Lee,Jeoungeun Lee,Yongseok Son

Main category: cs.CL

TL;DR: 本文对多模态大语言模型(MLLMs)进行了系统分析,提出了一种基于三个维度的分类框架,并总结了125个模型的趋势。

Details Motivation: 现有文献缺乏对不同模态如何与语言主干连接的全面理解,本文旨在填补这一空白。 Method: 通过分析125个MLLMs,提出分类框架,包括架构策略、表示学习技术和训练范式。 Result: 总结了当前模态集成技术的趋势,为未来模型开发提供指导。 Conclusion: 本文的分类框架为研究者提供了结构化视角,有助于未来多模态模型的稳健集成。 Abstract: The rapid progress of Multimodal Large Language Models(MLLMs) has transformed the AI landscape. These models combine pre-trained LLMs with various modality encoders. This integration requires a systematic understanding of how different modalities connect to the language backbone. Our survey presents an LLM-centric analysis of current approaches. We examine methods for transforming and aligning diverse modal inputs into the language embedding space. This addresses a significant gap in existing literature. We propose a classification framework for MLLMs based on three key dimensions. First, we examine architectural strategies for modality integration. This includes both the specific integration mechanisms and the fusion level. Second, we categorize representation learning techniques as either joint or coordinate representations. Third, we analyze training paradigms, including training strategies and objective functions. By examining 125 MLLMs developed between 2021 and 2025, we identify emerging patterns in the field. Our taxonomy provides researchers with a structured overview of current integration techniques. These insights aim to guide the development of more robust multimodal integration strategies for future models built on pre-trained foundations.

[181] Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study

Yujun Zhou,Jiayi Ye,Zipeng Ling,Yufei Han,Yue Huang,Haomin Zhuang,Zhenwen Liang,Kehan Guo,Taicheng Guo,Xiangqi Wang,Xiangliang Zhang

Main category: cs.CL

TL;DR: FineLogic是一个细粒度评估框架,从准确性、步骤合理性和表示对齐三个维度评估LLM的逻辑推理能力,并研究了不同监督格式对推理能力的影响。

Details Motivation: 现有基准仅依赖最终答案准确性,无法捕捉推理过程的质量和结构,因此需要更全面的评估方法。 Method: 提出FineLogic框架,构建四种监督风格(自然语言和符号变体),训练LLM并分析其推理行为。 Result: 自然语言监督具有强泛化能力,符号风格促进结构化和原子化推理链;微调主要通过逐步生成改进推理行为。 Conclusion: FineLogic为LLM逻辑推理的评估和改进提供了更严谨和可解释的视角。 Abstract: Logical reasoning is a core capability for many applications of large language models (LLMs), yet existing benchmarks often rely solely on final-answer accuracy, failing to capture the quality and structure of the reasoning process. We propose FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions: overall benchmark accuracy, stepwise soundness, and representation-level alignment. In addition, to better understand how reasoning capabilities emerge, we conduct a comprehensive study on the effects of supervision format during fine-tuning. We construct four supervision styles (one natural language and three symbolic variants) and train LLMs under each. Our findings reveal that natural language supervision yields strong generalization even on out-of-distribution and long-context tasks, while symbolic reasoning styles promote more structurally sound and atomic inference chains. Further, our representation-level probing shows that fine-tuning primarily improves reasoning behaviors through step-by-step generation, rather than enhancing shortcut prediction or internalized correctness. Together, our framework and analysis provide a more rigorous and interpretable lens for evaluating and improving logical reasoning in LLMs.

[182] Design of intelligent proofreading system for English translation based on CNN and BERT

Feijun Liu,Huifeng Wang,Kun Wang,Yizhen Wang

Main category: cs.CL

TL;DR: 本文提出了一种结合CNN和BERT的混合方法,用于机器翻译校对,通过语义提取和上下文建模显著提升了校对质量。

Details Motivation: 自动翻译常含错误需人工校对,现有方法效果有限,需更高效的校对技术。 Method: 采用CNN提取局部n-gram模式,BERT建模上下文,结合错误检测与修正模块,通过端到端训练优化。 Result: 实验显示90%准确率、89.37% F1和16.24% MSE,性能优于现有技术10%以上。 Conclusion: 该方法在识别和修正翻译错误方面达到先进水平,显著提升校对质量。 Abstract: Since automatic translations can contain errors that require substantial human post-editing, machine translation proofreading is essential for improving quality. This paper proposes a novel hybrid approach for robust proofreading that combines convolutional neural networks (CNN) with Bidirectional Encoder Representations from Transformers (BERT). In order to extract semantic information from phrases and expressions, CNN uses a variety of convolution kernel filters to capture local n-gram patterns. In the meanwhile, BERT creates context-rich representations of whole sequences by utilizing stacked bidirectional transformer encoders. Using BERT's attention processes, the integrated error detection component relates tokens to spot translation irregularities including word order problems and omissions. The correction module then uses parallel English-German alignment and GRU decoder models in conjunction with translation memory to propose logical modifications that maintain original meaning. A unified end-to-end training process optimized for post-editing performance is applied to the whole pipeline. The multi-domain collection of WMT and the conversational dialogues of Open-Subtitles are two of the English-German parallel corpora used to train the model. Multiple loss functions supervise detection and correction capabilities. Experiments attain a 90% accuracy, 89.37% F1, and 16.24% MSE, exceeding recent proofreading techniques by over 10% overall. Comparative benchmarking demonstrates state-of-the-art performance in identifying and coherently rectifying mistranslations and omissions.

[183] Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms

Nurul Aisyah,Muhammad Dehan Al Kautsar,Arif Hidayat,Raqib Chowdhury,Fajri Koto

Main category: cs.CL

TL;DR: 研究评估了视觉语言模型(VLM)和大型语言模型(LLM)在印尼四年级学生手写考试中的表现,发现VLM在手写识别上存在困难,影响了LLM的评分准确性,但LLM生成的反馈仍有一定实用性。

Details Motivation: 探索VLM和LLM在真实课堂环境(尤其是教育资源不足地区)中的有效性,填补相关研究的空白。 Method: 使用VLM和多个LLM对646份印尼四年级学生的手写考试答卷(含14K+答案)进行评分和生成个性化反馈。 Result: VLM在手写识别上表现不佳,导致LLM评分错误;但LLM生成的反馈仍有一定价值,尽管个性化和上下文相关性有限。 Conclusion: VLM和LLM在教育评估中具有潜力,但需改进手写识别和反馈质量以提升实用性。 Abstract: Although vision-language and large language models (VLM and LLM) offer promising opportunities for AI-driven educational assessment, their effectiveness in real-world classroom settings, particularly in underrepresented educational contexts, remains underexplored. In this study, we evaluated the performance of a state-of-the-art VLM and several LLMs on 646 handwritten exam responses from grade 4 students in six Indonesian schools, covering two subjects: Mathematics and English. These sheets contain more than 14K student answers that span multiple choice, short answer, and essay questions. Assessment tasks include grading these responses and generating personalized feedback. Our findings show that the VLM often struggles to accurately recognize student handwriting, leading to error propagation in downstream LLM grading. Nevertheless, LLM-generated feedback retains some utility, even when derived from imperfect input, although limitations in personalization and contextual relevance persist.

[184] A Reasoning-Based Approach to Cryptic Crossword Clue Solving

Martin Andrews,Sam Witteveen

Main category: cs.CL

TL;DR: 论文提出了一种基于LLM的系统,用于解决密码填字游戏线索,通过假设答案、提出文字游戏解释和验证步骤,实现了在Cryptonite数据集上的最新性能。

Details Motivation: 密码填字游戏线索是复杂的语言任务,现有方法难以高效解决,需要一种可解释且高性能的系统。 Method: 系统分为三步:假设答案、提出文字游戏解释、使用验证系统检查推理步骤。 Result: 在Cryptonite数据集上实现了最新性能,且解决方案以Python代码形式提供,便于检查。 Conclusion: 该系统为密码填字游戏提供了一种高效且可解释的解决方案,展示了LLM在复杂语言任务中的潜力。 Abstract: Cryptic crossword clues are challenging language tasks for which new test sets are released daily by major newspapers on a global basis. Each cryptic clue contains both the definition of the answer to be placed in the crossword grid (in common with regular crosswords), and 'wordplay' that proves that the answer is correct (i.e. a human solver can be confident that an answer is correct without needing crossing words as confirmation). This work describes an LLM-based reasoning system built from open-licensed components that solves cryptic clues by (i) hypothesising answers; (ii) proposing wordplay explanations; and (iii) using a verifier system that operates on codified reasoning steps. Overall, this system establishes a new state-of-the-art performance on the challenging Cryptonite dataset of clues from The Times and The Telegraph newspapers in the UK. Because each proved solution is expressed in Python, interpretable wordplay reasoning for proven answers is available for inspection.

[185] Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning Models

Changyue Wang,Weihang Su,Qingyao Ai,Yiqun Liu

Main category: cs.CL

TL;DR: 论文提出RACE框架,用于检测大型推理模型(LRMs)中的幻觉问题,通过分析推理步骤和答案一致性,优于现有方法。

Details Motivation: 现有幻觉检测方法主要关注答案层面的不确定性,难以检测推理过程中的冗余或不一致,而LRMs的推理痕迹是潜在幻觉的重要来源。 Method: 提出RACE框架,通过提取关键推理步骤并计算四种诊断信号(推理痕迹一致性、答案不确定性、语义对齐和推理内部一致性)进行细粒度检测。 Result: 实验表明RACE在多种数据集和LLMs上优于现有基线方法。 Conclusion: RACE为评估LRMs提供了鲁棒且通用的解决方案,代码已开源。 Abstract: Large Reasoning Models (LRMs) extend large language models with explicit, multi-step reasoning traces to enhance transparency and performance on complex tasks. However, these reasoning traces can be redundant or logically inconsistent, making them a new source of hallucination that is difficult to detect. Existing hallucination detection methods focus primarily on answer-level uncertainty and often fail to detect hallucinations or logical inconsistencies arising from the model's reasoning trace. This oversight is particularly problematic for LRMs, where the explicit thinking trace is not only an important support to the model's decision-making process but also a key source of potential hallucination. To this end, we propose RACE (Reasoning and Answer Consistency Evaluation), a novel framework specifically tailored for hallucination detection in LRMs. RACE operates by extracting essential reasoning steps and computing four diagnostic signals: inter-sample consistency of reasoning traces, entropy-based answer uncertainty, semantic alignment between reasoning and answers, and internal coherence of reasoning. This joint analysis enables fine-grained hallucination detection even when the final answer appears correct. Experiments across datasets and different LLMs demonstrate that RACE outperforms existing hallucination detection baselines, offering a robust and generalizable solution for evaluating LRMs. Our code is available at: https://github.com/bebr2/RACE.

[186] MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines

Dávid Javorský,Ondřej Bojar,François Yvon

Main category: cs.CL

TL;DR: 论文介绍了MockConf数据集和InterAlign工具,用于自动对齐和分析同声传译任务,填补了现有平行语料库的不足。

Details Motivation: 现有平行语料库和对齐算法无法有效建模同声传译中的长距离交互和特定类型的分歧(如简化、功能泛化)。 Method: 收集了学生模拟会议的同传数据集MockConf(7小时录音,5种语言),并开发了基于网页的标注工具InterAlign。 Result: 发布了数据集和工具,并提出了自动对齐的基线评估指标。 Conclusion: MockConf和InterAlign为同声传译的研究提供了新的资源和工具。 Abstract: In simultaneous interpreting, an interpreter renders a source speech into another language with a very short lag, much sooner than sentences are finished. In order to understand and later reproduce this dynamic and complex task automatically, we need dedicated datasets and tools for analysis, monitoring, and evaluation, such as parallel speech corpora, and tools for their automatic annotation. Existing parallel corpora of translated texts and associated alignment algorithms hardly fill this gap, as they fail to model long-range interactions between speech segments or specific types of divergences (e.g., shortening, simplification, functional generalization) between the original and interpreted speeches. In this work, we introduce MockConf, a student interpreting dataset that was collected from Mock Conferences run as part of the students' curriculum. This dataset contains 7 hours of recordings in 5 European languages, transcribed and aligned at the level of spans and words. We further implement and release InterAlign, a modern web-based annotation tool for parallel word and span annotations on long inputs, suitable for aligning simultaneous interpreting. We propose metrics for the evaluation and a baseline for automatic alignment. Dataset and tools are released to the community.

[187] Multiple-Choice Question Generation Using Large Language Models: Methodology and Educator Insights

Giorgio Biancini,Alessio Ferrato,Carla Limongelli

Main category: cs.CL

TL;DR: 本文探讨了利用大型语言模型(LLMs)自动生成多选题(MCQs)的潜力,比较了Llama 2、Mistral和GPT-3.5的性能,发现GPT-3.5表现最佳,但教育领域对AI的接受度仍有待提高。

Details Motivation: 手动生成MCQs耗时耗力,而LLMs可能提供高效解决方案。 Method: 通过向LLMs注入知识而非依赖其固有知识,对比三种模型生成MCQs的效果,并由21名教育工作者评估。 Result: GPT-3.5生成的MCQs在多项指标上表现最优,但教育领域对AI的接受度仍有障碍。 Conclusion: LLMs在生成MCQs方面具有潜力,但需进一步推动教育领域对AI的接受。 Abstract: Integrating Artificial Intelligence (AI) in educational settings has brought new learning approaches, transforming the practices of both students and educators. Among the various technologies driving this transformation, Large Language Models (LLMs) have emerged as powerful tools for creating educational materials and question answering, but there are still space for new applications. Educators commonly use Multiple-Choice Questions (MCQs) to assess student knowledge, but manually generating these questions is resource-intensive and requires significant time and cognitive effort. In our opinion, LLMs offer a promising solution to these challenges. This paper presents a novel comparative analysis of three widely known LLMs - Llama 2, Mistral, and GPT-3.5 - to explore their potential for creating informative and challenging MCQs. In our approach, we do not rely on the knowledge of the LLM, but we inject the knowledge into the prompt to contrast the hallucinations, giving the educators control over the test's source text, too. Our experiment involving 21 educators shows that GPT-3.5 generates the most effective MCQs across several known metrics. Additionally, it shows that there is still some reluctance to adopt AI in the educational field. This study sheds light on the potential of LLMs to generate MCQs and improve the educational experience, providing valuable insights for the future.

[188] Prompting LLMs: Length Control for Isometric Machine Translation

Dávid Javorský,Ondřej Bojar,François Yvon

Main category: cs.CL

TL;DR: 研究探讨了等长机器翻译在多种语言对中的效果,发现指令和示例对齐对输出长度控制至关重要,少量示例提示提升翻译质量但边际效益递减,多输出选择可优化长度与质量的平衡。

Details Motivation: 探索等长机器翻译在不同语言对中的有效性,特别是在IWSLT等长共享任务2022条件下,研究提示策略、示例数量及选择对翻译质量和长度控制的影响。 Method: 使用八种不同规模的开源大语言模型(LLMs),分析不同提示策略、少量示例数量及示例选择对翻译结果的影响。 Result: 指令与示例对齐对长度控制至关重要;极端示例能缩短翻译,但等长示例易使模型忽略长度限制;少量示例提升质量但边际效益递减;多输出选择优化长度与质量平衡。 Conclusion: 等长机器翻译中,指令和示例设计对长度控制至关重要,少量示例提示提升质量但效益有限,多输出选择可显著优化性能。 Abstract: In this study, we explore the effectiveness of isometric machine translation across multiple language pairs (En$\to$De, En$\to$Fr, and En$\to$Es) under the conditions of the IWSLT Isometric Shared Task 2022. Using eight open-source large language models (LLMs) of varying sizes, we investigate how different prompting strategies, varying numbers of few-shot examples, and demonstration selection influence translation quality and length control. We discover that the phrasing of instructions, when aligned with the properties of the provided demonstrations, plays a crucial role in controlling the output length. Our experiments show that LLMs tend to produce shorter translations only when presented with extreme examples, while isometric demonstrations often lead to the models disregarding length constraints. While few-shot prompting generally enhances translation quality, further improvements are marginal across 5, 10, and 20-shot settings. Finally, considering multiple outputs allows to notably improve overall tradeoff between the length and quality, yielding state-of-the-art performance for some language pairs.

[189] Evaluating the Effectiveness of Linguistic Knowledge in Pretrained Language Models: A Case Study of Universal Dependencies

Wenxi Li

Main category: cs.CL

TL;DR: 论文探讨了将Universal Dependencies (UD) 融入预训练语言模型的效果,发现UD显著提升了跨语言对抗性释义识别任务的性能。

Details Motivation: UD作为跨语言句法表示框架,其有效性尚未充分探索。本文旨在填补这一空白。 Method: 将UD整合到预训练语言模型中,评估其在跨语言对抗性释义识别任务中的表现。 Result: UD的引入显著提升了准确率和F1分数,平均增益分别为3.85%和6.08%,并在某些语言对上超越大型语言模型。 Conclusion: UD在跨领域任务中具有潜力和有效性,其与英语的相似性分数与模型性能正相关。 Abstract: Universal Dependencies (UD), while widely regarded as the most successful linguistic framework for cross-lingual syntactic representation, remains underexplored in terms of its effectiveness. This paper addresses this gap by integrating UD into pretrained language models and assesses if UD can improve their performance on a cross-lingual adversarial paraphrase identification task. Experimental results show that incorporation of UD yields significant improvements in accuracy and $F_1$ scores, with average gains of 3.85\% and 6.08\% respectively. These enhancements reduce the performance gap between pretrained models and large language models in some language pairs, and even outperform the latter in some others. Furthermore, the UD-based similarity score between a given language and English is positively correlated to the performance of models in that language. Both findings highlight the validity and potential of UD in out-of-domain tasks.

[190] ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests

Shiyi Xu,Yiwen Hu,Yingqian Min,Zhipeng Chen,Wayne Xin Zhao,Ji-Rong Wen

Main category: cs.CL

TL;DR: 提出了ICPC-Eval,一个用于评估大型语言模型在竞赛环境中编码能力的新基准,解决了现有基准和指标的不足。

Details Motivation: 现有基准(如LiveCodeBench和CodeElo)无法充分评估大型语言模型在真实竞赛环境中的编码能力,且当前指标(如Pass@K)未能捕捉模型的反思能力。 Method: ICPC-Eval包含来自11个ICPC竞赛的118个问题,提供真实的竞赛场景、高效的本地评估工具和新的评估指标Refine@K。 Result: 结果表明,顶级推理模型(如DeepSeek-R1)需要多轮代码反馈才能发挥潜力,且仍落后于顶尖人类团队。 Conclusion: ICPC-Eval为评估复杂推理能力提供了有效工具,揭示了模型在竞赛环境中的局限性。 Abstract: With the significant progress of large reasoning models in complex coding and reasoning tasks, existing benchmarks, like LiveCodeBench and CodeElo, are insufficient to evaluate the coding capabilities of large language models (LLMs) in real competition environments. Moreover, current evaluation metrics such as Pass@K fail to capture the reflective abilities of reasoning models. To address these challenges, we propose \textbf{ICPC-Eval}, a top-level competitive coding benchmark designed to probing the frontiers of LLM reasoning. ICPC-Eval includes 118 carefully curated problems from 11 recent ICPC contests held in various regions of the world, offering three key contributions: 1) A challenging realistic ICPC competition scenario, featuring a problem type and difficulty distribution consistent with actual contests. 2) A robust test case generation method and a corresponding local evaluation toolkit, enabling efficient and accurate local evaluation. 3) An effective test-time scaling evaluation metric, Refine@K, which allows iterative repair of solutions based on execution feedback. The results underscore the significant challenge in evaluating complex reasoning abilities: top-tier reasoning models like DeepSeek-R1 often rely on multi-turn code feedback to fully unlock their in-context reasoning potential when compared to non-reasoning counterparts. Furthermore, despite recent advancements in code generation, these models still lag behind top-performing human teams. We release the benchmark at: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs

[191] Verbose ListOps (VLO): Beyond Long Context -- Unmasking LLM's Reasoning Blind Spots

Alex Pan,Mary-Anne Williams

Main category: cs.CL

TL;DR: Verbose ListOps 是一个新基准,通过将 ListOps 计算转化为长故事,测试大语言模型(LLMs)在嵌套叙事推理中的表现,揭示了其在状态管理上的局限性。

Details Motivation: 现有基准未能充分测试 LLMs 在嵌套推理中的能力,无法区分上下文长度与推理复杂度,掩盖了 LLMs 的根本缺陷。 Method: 通过编程将 ListOps 计算转化为长故事,迫使模型进行内部计算和状态管理,同时控制叙事长度和推理难度。 Result: 实验显示,主流 LLMs 在 Verbose ListOps 上表现不佳,尽管能轻松解决原始 ListOps 问题。 Conclusion: Verbose ListOps 揭示了 LLMs 在嵌套推理中的弱点,为改进推理能力提供了方向,是自动化知识工作的关键一步。 Abstract: Large Language Models (LLMs), whilst great at extracting facts from text, struggle with nested narrative reasoning. Existing long context and multi-hop QA benchmarks inadequately test this, lacking realistic distractors or failing to decouple context length from reasoning complexity, masking a fundamental LLM limitation. We introduce Verbose ListOps, a novel benchmark that programmatically transposes ListOps computations into lengthy, coherent stories. This uniquely forces internal computation and state management of nested reasoning problems by withholding intermediate results, and offers fine-grained controls for both narrative size \emph{and} reasoning difficulty. Whilst benchmarks like LongReason (2025) advance approaches for synthetically expanding the context size of multi-hop QA problems, Verbose ListOps pinpoints a specific LLM vulnerability: difficulty in state management for nested sub-reasoning amongst semantically-relevant, distracting narrative. Our experiments show that leading LLMs (e.g., OpenAI o4, Gemini 2.5 Pro) collapse in performance on Verbose ListOps at modest (~10k token) narrative lengths, despite effortlessly solving raw ListOps equations. Addressing this failure is paramount for real-world text interpretation which requires identifying key reasoning points, tracking conceptual intermediate results, and filtering irrelevant information. Verbose ListOps, and its extensible generation framework thus enables targeted reasoning enhancements beyond mere context-window expansion; a critical step to automating the world's knowledge work.

[192] A Practitioner's Guide to Building ASR Models for Low-Resource Languages: A Case Study on Scottish Gaelic

Ondřej Klejch,William Lamb,Peter Bell

Main category: cs.CL

TL;DR: 本文挑战了通过微调多语言端到端模型开发低资源语言ASR系统的普遍观点,提出了一种结合混合HMM与自监督模型的方法,在有限训练数据下表现更优。

Details Motivation: 探讨在低资源语言ASR系统中,微调多语言端到端模型的局限性,并提出更高效的替代方案。 Method: 结合混合HMM与自监督模型,通过持续自监督预训练和半监督训练充分利用可用数据。 Result: 在苏格兰盖尔语上,相比最佳微调Whisper模型,WER相对降低了32%。 Conclusion: 混合HMM与自监督模型的组合在低资源语言ASR任务中优于传统微调方法。 Abstract: An effective approach to the development of ASR systems for low-resource languages is to fine-tune an existing multilingual end-to-end model. When the original model has been trained on large quantities of data from many languages, fine-tuning can be effective with limited training data, even when the language in question was not present in the original training data. The fine-tuning approach has been encouraged by the availability of public-domain E2E models and is widely believed to lead to state-of-the-art results. This paper, however, challenges that belief. We show that an approach combining hybrid HMMs with self-supervised models can yield substantially better performance with limited training data. This combination allows better utilisation of all available speech and text data through continued self-supervised pre-training and semi-supervised training. We benchmark our approach on Scottish Gaelic, achieving WER reductions of 32% relative over our best fine-tuned Whisper model.

[193] Simulating LLM-to-LLM Tutoring for Multilingual Math Feedback

Junior Cedric Tonga,KV Aditya Srivatsa,Kaushal Kumar Maurya,Fajri Koto,Ekaterina Kochmar

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型(LLMs)在多语言教育中的有效性,通过模拟师生互动评估多语言提示对学习效果的影响。

Details Motivation: 评估LLMs在不同语言(尤其是数学推理任务)中提供教学支持的能力,填补多语言教育工具开发的空白。 Method: 通过模拟师生互动,使用强模型作为教师生成提示,弱模型模拟学生,覆盖11种语言和多种提示策略。 Result: 多语言提示显著提升学习效果,尤其在低资源语言中,反馈与学生母语一致时效果更佳。 Conclusion: 研究为开发有效且包容的多语言LLM教育工具提供了实用见解。 Abstract: Large language models (LLMs) have demonstrated the ability to generate formative feedback and instructional hints in English, making them increasingly relevant for AI-assisted education. However, their ability to provide effective instructional support across different languages, especially for mathematically grounded reasoning tasks, remains largely unexamined. In this work, we present the first large-scale simulation of multilingual tutor-student interactions using LLMs. A stronger model plays the role of the tutor, generating feedback in the form of hints, while a weaker model simulates the student. We explore 352 experimental settings across 11 typologically diverse languages, four state-of-the-art LLMs, and multiple prompting strategies to assess whether language-specific feedback leads to measurable learning gains. Our study examines how student input language, teacher feedback language, model choice, and language resource level jointly influence performance. Results show that multilingual hints can significantly improve learning outcomes, particularly in low-resource languages when feedback is aligned with the student's native language. These findings offer practical insights for developing multilingual, LLM-based educational tools that are both effective and inclusive.

[194] ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT

Mikołaj Pokrywka,Wojciech Kusa,Mieszko Rutkowski,Mikołaj Koszowski

Main category: cs.CL

TL;DR: 研究探索了如何通过添加上下文信息(如图像和产品元数据)提升电商领域神经机器翻译的质量,并发布了新的捷克语-波兰语数据集ConECT。

Details Motivation: 神经机器翻译在电商领域面临词义模糊和上下文不足的问题,尤其是在数据质量较差的情况下。 Method: 创建了包含11,400句对的ConECT数据集,结合图像和产品元数据,测试了视觉语言模型和文本到文本模型的效果。 Result: 视觉上下文和附加信息(如产品类别路径或图像描述)显著提升了翻译质量。 Conclusion: 上下文信息的整合能有效改善机器翻译质量,并公开了新数据集。 Abstract: Neural Machine Translation (NMT) has improved translation by using Transformer-based models, but it still struggles with word ambiguity and context. This problem is especially important in domain-specific applications, which often have problems with unclear sentences or poor data quality. Our research explores how adding information to models can improve translations in the context of e-commerce data. To this end we create ConECT -- a new Czech-to-Polish e-commerce product translation dataset coupled with images and product metadata consisting of 11,400 sentence pairs. We then investigate and compare different methods that are applicable to context-aware translation. We test a vision-language model (VLM), finding that visual context aids translation quality. Additionally, we explore the incorporation of contextual information into text-to-text models, such as the product's category path or image descriptions. The results of our study demonstrate that the incorporation of contextual information leads to an improvement in the quality of machine translation. We make the new dataset publicly available.

[195] From Struggle (06-2024) to Mastery (02-2025) LLMs Conquer Advanced Algorithm Exams and Pave the Way for Editorial Generation

Adrian Marius Dumitran,Theodor-Pierre Moroianu,Vasile Paul Alexe

Main category: cs.CL

TL;DR: 本文评估了大型语言模型(LLMs)在大学级算法考试中的表现,发现最新模型表现优异,但仍存在图相关任务的困难。

Details Motivation: 研究LLMs在复杂算法问题上的解决能力及其在教育中的应用潜力。 Method: 通过测试多个模型在罗马尼亚语考试及其英语翻译版本上的表现,分析其问题解决能力、一致性和多语言性能。 Result: 最新模型表现接近顶尖学生,且在复杂多步算法任务中展现出强推理能力,但图任务仍有挑战。 Conclusion: LLMs在教育中具有潜力,可用于生成高质量内容以支持教学反馈,为生成式AI在算法教育中的进一步整合铺路。 Abstract: This paper presents a comprehensive evaluation of the performance of state-of-the-art Large Language Models (LLMs) on challenging university-level algorithms exams. By testing multiple models on both a Romanian exam and its high-quality English translation, we analyze LLMs' problem-solving capabilities, consistency, and multilingual performance. Our empirical study reveals that the most recent models not only achieve scores comparable to top-performing students but also demonstrate robust reasoning skills on complex, multi-step algorithmic challenges, even though difficulties remain with graph-based tasks. Building on these findings, we explore the potential of LLMs to support educational environments through the generation of high-quality editorial content, offering instructors a powerful tool to enhance student feedback. The insights and best practices discussed herein pave the way for further integration of generative AI in advanced algorithm education.

[196] Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering

Andres Carofilis,Pradeep Rangappa,Srikanth Madikeri,Shashi Kumar,Sergio Burdisso,Jeena Prakash,Esau Villatoro-Tello,Petr Motlicek,Bidisha Sharma,Kadri Hacioglu,Shankar Venkatesan,Saurabh Vyas,Andreas Stolcke

Main category: cs.CL

TL;DR: 论文提出了一种增量半监督学习流程,利用少量领域内标注数据和相关领域辅助数据,通过多模型共识或命名实体识别筛选伪标签,显著提升了ASR模型的性能。

Details Motivation: 在领域内标注数据稀缺的情况下,利用未标注音频和相关领域标注数据提升ASR模型的性能。 Method: 提出增量半监督学习流程,结合少量领域内标注数据和辅助数据集,通过多模型共识或NER筛选伪标签并迭代优化。 Result: 在Wow呼叫中心和Fisher英语语料库上,共识筛选方法相对随机选择提升了22.3%和24.8%,NER方法性能次优但计算成本更低。 Conclusion: 共识筛选方法性能最佳,NER方法在计算成本上更具优势,增量半监督学习流程显著优于单步微调。 Abstract: Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxiliary data. Filtering based on multi-model consensus or named entity recognition (NER) is then applied to select and iteratively refine pseudo-labels, showing slower performance saturation compared to random selection. Evaluated on the multi-domain Wow call center and Fisher English corpora, it outperforms single-step fine-tuning. Consensus-based filtering outperforms other methods, providing up to 22.3% relative improvement on Wow and 24.8% on Fisher over single-step fine-tuning with random selection. NER is the second-best filter, providing competitive performance at a lower computational cost.

[197] SCOP: Evaluating the Comprehension Process of Large Language Models from a Cognitive View

Yongjie Xiao,Hongru Liang,Peixin Qin,Yao Zhang,Wenqiang Lei

Main category: cs.CL

TL;DR: 论文提出SCOP框架,从认知视角评估大语言模型(LLMs)的理解能力,发现其与专家水平仍有差距,且存在不可靠性,建议改进方向。

Details Motivation: 尽管LLMs在机器理解方面潜力巨大,但其理解过程是否与专家一致缺乏合理解释,需系统评估。 Method: 提出SCOP框架,定义五项理解技能,构建测试数据,并对开源和闭源LLMs进行详细分析。 Result: LLMs难以达到专家级理解水平,且存在通过错误理解过程得出正确答案的不可靠性。 Conclusion: 建议改进LLMs时更关注理解过程,确保训练中全面培养所有理解技能。 Abstract: Despite the great potential of large language models(LLMs) in machine comprehension, it is still disturbing to fully count on them in real-world scenarios. This is probably because there is no rational explanation for whether the comprehension process of LLMs is aligned with that of experts. In this paper, we propose SCOP to carefully examine how LLMs perform during the comprehension process from a cognitive view. Specifically, it is equipped with a systematical definition of five requisite skills during the comprehension process, a strict framework to construct testing data for these skills, and a detailed analysis of advanced open-sourced and closed-sourced LLMs using the testing data. With SCOP, we find that it is still challenging for LLMs to perform an expert-level comprehension process. Even so, we notice that LLMs share some similarities with experts, e.g., performing better at comprehending local information than global information. Further analysis reveals that LLMs can be somewhat unreliable -- they might reach correct answers through flawed comprehension processes. Based on SCOP, we suggest that one direction for improving LLMs is to focus more on the comprehension process, ensuring all comprehension skills are thoroughly developed during training.

[198] ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development

Zhenran Xu,Xue Yang,Yiyu Wang,Qingli Hu,Zijiao Wu,Longyue Wang,Weihua Luo,Kaifu Zhang,Baotian Hu,Min Zhang

Main category: cs.CL

TL;DR: ComfyUI-Copilot是一个基于大语言模型的插件,旨在提升ComfyUI平台的易用性和效率,通过智能节点推荐和一键工作流构建解决新用户面临的挑战。

Details Motivation: ComfyUI虽然灵活且用户友好,但对新手存在文档不足、模型配置复杂等问题,ComfyUI-Copilot旨在解决这些问题。 Method: 采用分层多代理框架,包括中央助手代理和专用工作代理,结合知识库支持调试和部署。 Result: 离线评估和用户反馈显示,插件能准确推荐节点并加速工作流开发,降低新手门槛并提升老手效率。 Conclusion: ComfyUI-Copilot有效解决了ComfyUI的使用难题,为不同水平的用户提供了显著帮助。 Abstract: We introduce ComfyUI-Copilot, a large language model-powered plugin designed to enhance the usability and efficiency of ComfyUI, an open-source platform for AI-driven art creation. Despite its flexibility and user-friendly interface, ComfyUI can present challenges to newcomers, including limited documentation, model misconfigurations, and the complexity of workflow design. ComfyUI-Copilot addresses these challenges by offering intelligent node and model recommendations, along with automated one-click workflow construction. At its core, the system employs a hierarchical multi-agent framework comprising a central assistant agent for task delegation and specialized worker agents for different usages, supported by our curated ComfyUI knowledge bases to streamline debugging and deployment. We validate the effectiveness of ComfyUI-Copilot through both offline quantitative evaluations and online user feedback, showing that it accurately recommends nodes and accelerates workflow development. Additionally, use cases illustrate that ComfyUI-Copilot lowers entry barriers for beginners and enhances workflow efficiency for experienced users. The ComfyUI-Copilot installation package and a demo video are available at https://github.com/AIDC-AI/ComfyUI-Copilot.

[199] Controlling Summarization Length Through EOS Token Weighting

Zeno Belligoli,Emmanouil Stergiadis,Eran Fainman,Ilya Gusev

Main category: cs.CL

TL;DR: 提出了一种简单的方法,通过调整EOS令牌在交叉熵损失中的重要性来控制生成文本的长度,适用于多种模型和解码算法。

Details Motivation: 现有方法通常需要复杂的模型修改,限制了与预训练模型的兼容性,因此需要一种更通用的解决方案。 Method: 通过增加EOS令牌在交叉熵损失中的预测重要性来控制生成文本的长度,无需修改模型架构。 Result: 该方法在多种模型(如编码器-解码器和GPT风格LLM)中有效控制生成长度,且通常不影响摘要质量。 Conclusion: 该方法简单、通用,适用于多种文本生成任务,且与现有技术正交。 Abstract: Controlling the length of generated text can be crucial in various text-generation tasks, including summarization. Existing methods often require complex model alterations, limiting compatibility with pre-trained models. We address these limitations by developing a simple approach for controlling the length of automatic text summaries by increasing the importance of correctly predicting the EOS token in the cross-entropy loss computation. The proposed methodology is agnostic to architecture and decoding algorithms and orthogonal to other inference-time techniques to control the generation length. We tested it with encoder-decoder and modern GPT-style LLMs, and show that this method can control generation length, often without affecting the quality of the summary.

[200] Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers

Yutao Hou,Zeguan Xiao,Fei Yu,Yihan Jiang,Xuetao Wei,Hailiang Huang,Yun Chen,Guanhua Chen

Main category: cs.CL

TL;DR: AR-Checker是一个自动生成数学问题变体的框架,用于测试大语言模型的鲁棒性,避免数据污染问题。

Details Motivation: 大语言模型在复杂推理任务中表现优异,但在简单任务中可能意外失败,现有评估方法存在数据污染风险。 Method: 通过多轮并行LLM重写和验证生成语义不变但可能使LLM失败的数学问题变体。 Result: 在GSM8K和MATH-500等数学任务上表现优异,并在MMLU等非数学基准测试中验证了有效性。 Conclusion: AR-Checker能动态生成测试变体,显著提升模型鲁棒性评估的可靠性。 Abstract: Large language models (LLMs) have achieved distinguished performance on various reasoning-intensive tasks. However, LLMs might still face the challenges of robustness issues and fail unexpectedly in some simple reasoning tasks. Previous works evaluate the LLM robustness with hand-crafted templates or a limited set of perturbation rules, indicating potential data contamination in pre-training or fine-tuning datasets. In this work, inspired by stress testing in software engineering, we propose a novel framework, Automatic Robustness Checker (AR-Checker), to generate mathematical problem variants that maintain the semantic meanings of the original one but might fail the LLMs. The AR-Checker framework generates mathematical problem variants through multi-round parallel streams of LLM-based rewriting and verification. Our framework can generate benchmark variants dynamically for each LLM, thus minimizing the risk of data contamination. Experiments on GSM8K and MATH-500 demonstrate the strong performance of AR-Checker on mathematical tasks. We also evaluate AR-Checker on benchmarks beyond mathematics, including MMLU, MMLU-Pro, and CommonsenseQA, where it also achieves strong performance, further proving the effectiveness of AR-Checker.

[201] TALL -- A Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages

Moshe Ofer,Orel Zamler,Amos Azaria

Main category: cs.CL

TL;DR: TALL架构通过结合LLM和双语翻译模型,显著提升了低资源语言的处理性能,同时保持计算效率。

Details Motivation: 解决LLM在低资源语言上因训练数据不足而表现不佳的问题。 Method: 集成LLM与双语翻译模型,通过维度对齐层和定制转换器将低资源输入转换为高资源表示,并采用参数高效策略。 Result: 在希伯来语实验中,TALL显著优于直接使用、简单翻译和微调等基线方法。 Conclusion: TALL为低资源语言提供了一种高效且性能优越的解决方案。 Abstract: Large Language Models (LLMs) excel in high-resource languages but struggle with low-resource languages due to limited training data. This paper presents TALL (Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages), which integrates an LLM with two bilingual translation models. TALL transforms low-resource inputs into high-resource representations, leveraging the LLM's capabilities while preserving linguistic features through dimension alignment layers and custom transformers. Our experiments on Hebrew demonstrate significant improvements over several baselines, including direct use, naive translation, and fine-tuning approaches. The architecture employs a parameter-efficient strategy, freezing pre-trained components while training only lightweight adapter modules, balancing computational efficiency with performance gains.

[202] Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation

Noy Sternlicht,Ariel Gera,Roy Bar-Haim,Tom Hope,Noam Slonim

Main category: cs.CL

TL;DR: 论文提出了辩论演讲评估作为评估LLM法官的新基准,分析了LLM在评估辩论演讲时的表现,并与人类法官进行了对比。

Details Motivation: 辩论演讲评估需要多层次的深度理解,但目前对LLM在这方面的系统性评估有限。 Method: 利用600多篇标注的辩论演讲数据集,分析前沿LLM与人类法官的表现差异。 Result: 大模型在某些方面接近人类判断,但整体行为差异显著;前沿LLM生成说服性演讲的能力可达人类水平。 Conclusion: 辩论演讲评估为LLM能力提供了新视角,揭示了模型与人类判断的差异及潜力。 Abstract: We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that have previously received limited attention in systematic LLM benchmarking. To explore such skills, we leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task. Our findings reveal a nuanced picture: while larger models can approximate individual human judgments in some respects, they differ substantially in their overall judgment behavior. We also investigate the ability of frontier LLMs to generate persuasive, opinionated speeches, showing that models may perform at a human level on this task.

[203] Does It Make Sense to Speak of Introspection in Large Language Models?

Iulia Comşa,Murray Shanahan

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型(LLMs)的自我报告行为是否可被视为内省,并通过两个例子分析其合理性。

Details Motivation: 随着LLMs语言能力和认知能力的提升,其自我报告行为引发了对其是否具有内省能力的讨论,类似于人类的内省与意识关联。 Method: 通过分析LLMs的两个自我报告例子:一是描述其“创造性”写作过程,二是推断其自身温度参数值。 Result: 第一个例子不被视为有效内省,而第二个例子被认为是内省的最小示例,但缺乏意识体验。 Conclusion: LLMs的某些行为可被视为内省,但其与人类内省的本质差异仍需进一步研究。 Abstract: Large language models (LLMs) exhibit compelling linguistic behaviour, and sometimes offer self-reports, that is to say statements about their own nature, inner workings, or behaviour. In humans, such reports are often attributed to a faculty of introspection and are typically linked to consciousness. This raises the question of how to interpret self-reports produced by LLMs, given their increasing linguistic fluency and cognitive capabilities. To what extent (if any) can the concept of introspection be meaningfully applied to LLMs? Here, we present and critique two examples of apparent introspective self-report from LLMs. In the first example, an LLM attempts to describe the process behind its own ``creative'' writing, and we argue this is not a valid example of introspection. In the second example, an LLM correctly infers the value of its own temperature parameter, and we argue that this can be legitimately considered a minimal example of introspection, albeit one that is (presumably) not accompanied by conscious experience.

[204] RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation

Tianjiao Li,Mengran Yu,Chenyu Shi,Yanjun Zhao,Xiaojing Liu,Qiang Zhang,Qi Zhang,Xuanjing Huang,Jiayin Wang

Main category: cs.CL

TL;DR: 论文提出RIVAL框架,通过对抗训练解决LLM在口语字幕翻译任务中因分布偏移导致的性能下降问题。

Details Motivation: 观察到结合RLHF的LLM在口语字幕翻译任务中表现不佳,原因是离线奖励模型与在线LLM因分布偏移逐渐偏离。 Method: 提出RIVAL对抗训练框架,将奖励模型与LLM的优化过程建模为极小极大博弈,并结合定量偏好奖励(如BLEU)提升稳定性。 Result: 实验表明,RIVAL框架显著提升了翻译任务的性能。 Conclusion: RIVAL通过对抗训练有效解决了分布偏移问题,提升了LLM在口语字幕翻译中的表现。 Abstract: Large language models (LLMs) possess strong multilingual capabilities, and combining Reinforcement Learning from Human Feedback (RLHF) with translation tasks has shown great potential. However, we observe that this paradigm performs unexpectedly poorly when applied to colloquial subtitle translation tasks. In this work, we investigate this issue and find that the offline reward model (RM) gradually diverges from the online LLM due to distributional shift, ultimately leading to undesirable training outcomes. To address this, we propose RIVAL, an adversarial training framework that formulates the process as a min-max game between the RM and the LLM. RIVAL iteratively updates the both models, with the RM trained to distinguish strong from weak translations (qualitative preference reward), and the LLM trained to enhance its translation for closing this gap. To stabilize training and improve generalizability, we also incorporate quantitative preference reward (e.g., BLEU) into the RM, enabling reference-free quality modeling aligned with human evaluation. Through extensive experiments, we demonstrate that the proposed adversarial training framework significantly improves upon translation baselines.

[205] Just a Scratch: Enhancing LLM Capabilities for Self-harm Detection through Intent Differentiation and Emoji Interpretation

Soumitra Ghosh,Gopendra Vikram Singh,Shambhavi,Sabarna Choudhury,Asif Ekbal

Main category: cs.CL

TL;DR: 论文提出了一种通过语言与表情符号的微妙交互来增强大语言模型(LLM)对自残意图的理解的方法,并发布了CESM-100和SHINES数据集。

Details Motivation: 社交媒体上的自残检测对早期干预和心理健康支持至关重要,但当前LLM难以理解隐晦的表达。 Method: 通过CESM-100丰富输入,多任务学习微调LLM,并生成可解释的自残预测依据。 Result: 在三种LLM上验证,结合意图区分和上下文线索显著提升了检测和解释任务的表现。 Conclusion: 该方法有效解决了自残信号的模糊性问题,相关资源已公开。 Abstract: Self-harm detection on social media is critical for early intervention and mental health support, yet remains challenging due to the subtle, context-dependent nature of such expressions. Identifying self-harm intent aids suicide prevention by enabling timely responses, but current large language models (LLMs) struggle to interpret implicit cues in casual language and emojis. This work enhances LLMs' comprehension of self-harm by distinguishing intent through nuanced language-emoji interplay. We present the Centennial Emoji Sensitivity Matrix (CESM-100), a curated set of 100 emojis with contextual self-harm interpretations and the Self-Harm Identification aNd intent Extraction with Supportive emoji sensitivity (SHINES) dataset, offering detailed annotations for self-harm labels, casual mentions (CMs), and serious intents (SIs). Our unified framework: a) enriches inputs using CESM-100; b) fine-tunes LLMs for multi-task learning: self-harm detection (primary) and CM/SI span detection (auxiliary); c) generates explainable rationales for self-harm predictions. We evaluate the framework on three state-of-the-art LLMs-Llama 3, Mental-Alpaca, and MentalLlama, across zero-shot, few-shot, and fine-tuned scenarios. By coupling intent differentiation with contextual cues, our approach commendably enhances LLM performance in both detection and explanation tasks, effectively addressing the inherent ambiguity in self-harm signals. The SHINES dataset, CESM-100 and codebase are publicly available at: https://www.iitp.ac.in/~ai-nlp-ml/resources.html#SHINES .

[206] Parking, Perception, and Retail: Street-Level Determinants of Community Vitality in Harbin

HaoTian Lan

Main category: cs.CL

TL;DR: 研究提出了一种基于图像的可解释框架,分析中国城市社区街道的商业活力与车辆可达性、环境质量和行人感知的关系,发现车辆密度、绿化和街道宽度等因素对零售表现和用户满意度有显著影响。

Details Motivation: 探讨社区街道的商业活力如何受车辆可达性、环境质量和行人感知的复杂交互影响,为城市设计和规划提供科学依据。 Method: 利用街景图像和多模态大语言模型(VisualGLM-6B),结合美团和大众点评数据构建社区商业活力指数(CCVI),并通过GPT-4感知模型分析空间属性。 Result: 研究发现适度车辆存在可提升商业可达性,但过度停车会降低步行性和满意度;绿化和清洁度显著提高满意度,但对定价影响较弱;街道宽度调节车辆影响。 Conclusion: 研究展示了AI辅助感知与城市形态分析结合的价值,为社区商业活力的非线性驱动因素提供了理论和实践启示。 Abstract: The commercial vitality of community-scale streets in Chinese cities is shaped by complex interactions between vehicular accessibility, environmental quality, and pedestrian perception. This study proposes an interpretable, image-based framework to examine how street-level features -- including parked vehicle density, greenery, cleanliness, and street width -- impact retail performance and user satisfaction in Harbin, China. Leveraging street view imagery and a multimodal large language model (VisualGLM-6B), we construct a Community Commercial Vitality Index (CCVI) from Meituan and Dianping data and analyze its relationship with spatial attributes extracted via GPT-4-based perception modeling. Our findings reveal that while moderate vehicle presence may enhance commercial access, excessive on-street parking -- especially in narrow streets -- erodes walkability and reduces both satisfaction and shop-level pricing. In contrast, streets with higher perceived greenery and cleanliness show significantly greater satisfaction scores but only weak associations with pricing. Street width moderates the effects of vehicle presence, underscoring the importance of spatial configuration. These results demonstrate the value of integrating AI-assisted perception with urban morphological analysis to capture non-linear and context-sensitive drivers of commercial success. This study advances both theoretical and methodological frontiers by highlighting the conditional role of vehicle activity in neighborhood commerce and demonstrating the feasibility of multimodal AI for perceptual urban diagnostics. The implications extend to urban design, parking management, and scalable planning tools for community revitalization.

[207] CL-ISR: A Contrastive Learning and Implicit Stance Reasoning Framework for Misleading Text Detection on Social Media

Tianyi Huang,Zikun Cui,Cuiqianhe Du,Chia-En Chiang

Main category: cs.CL

TL;DR: 论文提出了一种结合对比学习和隐式立场推理的新框架CL-ISR,用于提高社交媒体误导文本的检测准确性。

Details Motivation: 社交媒体上的误导文本可能导致公众误解、社会恐慌和经济损失,因此检测这些文本是一个重要的研究方向。 Method: 使用对比学习算法增强模型对真实与误导文本语义差异的学习能力,并引入隐式立场推理模块分析文本中的潜在立场倾向及其与相关主题的关系。 Result: CL-ISR框架显著提高了误导文本的检测效果,尤其在语言复杂情况下表现优异。 Conclusion: CL-ISR框架通过结合对比学习和隐式立场推理,有效提升了社交媒体误导文本的检测能力。 Abstract: Misleading text detection on social media platforms is a critical research area, as these texts can lead to public misunderstanding, social panic and even economic losses. This paper proposes a novel framework - CL-ISR (Contrastive Learning and Implicit Stance Reasoning), which combines contrastive learning and implicit stance reasoning, to improve the detection accuracy of misleading texts on social media. First, we use the contrastive learning algorithm to improve the model's learning ability of semantic differences between truthful and misleading texts. Contrastive learning could help the model to better capture the distinguishing features between different categories by constructing positive and negative sample pairs. This approach enables the model to capture distinguishing features more effectively, particularly in linguistically complicated situations. Second, we introduce the implicit stance reasoning module, to explore the potential stance tendencies in the text and their relationships with related topics. This method is effective for identifying content that misleads through stance shifting or emotional manipulation, because it can capture the implicit information behind the text. Finally, we integrate these two algorithms together to form a new framework, CL-ISR, which leverages the discriminative power of contrastive learning and the interpretive depth of stance reasoning to significantly improve detection effect.

[208] The NTNU System at the S&I Challenge 2025 SLA Open Track

Hong-Yun Lin,Tien-Hong Lo,Yu-Hsuan Fang,Jhen-Ke Lin,Chung-Chun Wang,Hao-Chien Lu,Berlin Chen

Main category: cs.CL

TL;DR: 该研究提出了一种结合wav2vec 2.0和Phi-4多模态大语言模型的系统,用于口语能力评估,解决了BERT和W2V各自的局限性,并在比赛中取得了第二名的成绩。

Details Motivation: BERT和W2V在口语评估中各有局限性:BERT依赖ASR转录,无法捕捉语音特征;W2V擅长建模声学特征但缺乏语义解释性。因此,研究旨在结合两者的优势。 Method: 通过分数融合策略,将W2V与Phi-4多模态大语言模型集成。 Result: 系统在Speak & Improve Challenge 2025的测试集上RMSE为0.375,排名第二。 Conclusion: 该方法有效结合了不同模态的优势,提升了口语评估的准确性。 Abstract: A recent line of research on spoken language assessment (SLA) employs neural models such as BERT and wav2vec 2.0 (W2V) to evaluate speaking proficiency across linguistic and acoustic modalities. Although both models effectively capture features relevant to oral competence, each exhibits modality-specific limitations. BERT-based methods rely on ASR transcripts, which often fail to capture prosodic and phonetic cues for SLA. In contrast, W2V-based methods excel at modeling acoustic features but lack semantic interpretability. To overcome these limitations, we propose a system that integrates W2V with Phi-4 multimodal large language model (MLLM) through a score fusion strategy. The proposed system achieves a root mean square error (RMSE) of 0.375 on the official test set of the Speak & Improve Challenge 2025, securing second place in the competition. For comparison, the RMSEs of the top-ranked, third-ranked, and official baseline systems are 0.364, 0.384, and 0.444, respectively.

[209] DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning

Tanmay Parekh,Kartik Mehta,Ninareh Mehrabi,Kai-Wei Chang,Nanyun Peng

Main category: cs.CL

TL;DR: DiCoRe框架通过发散-收敛推理(Dreamer和Grounder)提升零样本事件检测性能,结合LLM-Judge验证,平均F1分数提升4-7%。

Details Motivation: 解决零样本事件检测中复杂事件本体理解和领域特定触发词提取的挑战,提升大型语言模型(LLMs)的实用性。 Method: 提出DiCoRe框架,包含发散推理的Dreamer(开放事件发现)和收敛推理的Grounder(有限状态机约束解码),并由LLM-Judge验证结果。 Result: 在六个数据集和五个领域的实验中,DiCoRe显著优于现有零样本、迁移学习和推理基线,平均F1分数提升4-7%。 Conclusion: DiCoRe是一种强大的零样本事件检测框架,通过发散-收敛推理和验证机制显著提升性能。 Abstract: Zero-shot Event Detection (ED), the task of identifying event mentions in natural language text without any training data, is critical for document understanding in specialized domains. Understanding the complex event ontology, extracting domain-specific triggers from the passage, and structuring them appropriately overloads and limits the utility of Large Language Models (LLMs) for zero-shot ED. To this end, we propose DiCoRe, a divergent-convergent reasoning framework that decouples the task of ED using Dreamer and Grounder. Dreamer encourages divergent reasoning through open-ended event discovery, which helps to boost event coverage. Conversely, Grounder introduces convergent reasoning to align the free-form predictions with the task-specific instructions using finite-state machine guided constrained decoding. Additionally, an LLM-Judge verifies the final outputs to ensure high precision. Through extensive experiments on six datasets across five domains and nine LLMs, we demonstrate how DiCoRe consistently outperforms prior zero-shot, transfer-learning, and reasoning baselines, achieving 4-7% average F1 gains over the best baseline -- establishing DiCoRe as a strong zero-shot ED framework.

[210] Information Locality as an Inductive Bias for Neural Language Models

Taiga Someya,Anej Svete,Brian DuSell,Timothy J. O'Donnell,Mario Giulianelli,Ryan Cotterell

Main category: cs.CL

TL;DR: 论文提出了一种定量框架,通过$m$-局部熵衡量语言模型的归纳偏差,发现高$m$-局部熵的语言更难被Transformer和LSTM学习。

Details Motivation: 探讨神经语言模型的归纳偏差是否与人类处理约束一致。 Method: 引入$m$-局部熵作为信息论度量,分析语言模型的局部不确定性。 Result: 实验表明,高$m$-局部熵的语言对Transformer和LSTM更具挑战性。 Conclusion: 神经语言模型与人类类似,对语言的局部统计结构高度敏感。 Abstract: Inductive biases are inherent in every machine learning system, shaping how models generalize from finite data. In the case of neural language models (LMs), debates persist as to whether these biases align with or diverge from human processing constraints. To address this issue, we propose a quantitative framework that allows for controlled investigations into the nature of these biases. Within our framework, we introduce $m$-local entropy$\unicode{x2013}$an information-theoretic measure derived from average lossy-context surprisal$\unicode{x2013}$that captures the local uncertainty of a language by quantifying how effectively the $m-1$ preceding symbols disambiguate the next symbol. In experiments on both perturbed natural language corpora and languages defined by probabilistic finite-state automata (PFSAs), we show that languages with higher $m$-local entropy are more difficult for Transformer and LSTM LMs to learn. These results suggest that neural LMs, much like humans, are highly sensitive to the local statistical structure of a language.

[211] AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models

Chih-Kai Yang,Neo Ho,Yi-Jyun Lee,Hung-yi Lee

Main category: cs.CL

TL;DR: 本文首次深入分析了大型音频语言模型(LALMs)内部如何感知和识别听觉属性,发现属性信息随层深减少,早期层解析属性与更高准确性相关,并提出了一种增强LALMs的方法。

Details Motivation: 理解LALMs的内部机制对解释其行为和提升性能至关重要。 Method: 通过词汇投影技术分析三种先进LALMs,追踪属性信息在层和标记位置的演变。 Result: 发现属性信息在识别失败时随层深减少,早期层解析属性与准确性正相关,且LALMs依赖查询输入而非隐藏状态聚合信息。 Conclusion: 研究为听觉属性处理提供了新见解,并为未来改进LALMs奠定了基础。 Abstract: Understanding the internal mechanisms of large audio-language models (LALMs) is crucial for interpreting their behavior and improving performance. This work presents the first in-depth analysis of how LALMs internally perceive and recognize auditory attributes. By applying vocabulary projection on three state-of-the-art LALMs, we track how attribute information evolves across layers and token positions. We find that attribute information generally decreases with layer depth when recognition fails, and that resolving attributes at earlier layers correlates with better accuracy. Moreover, LALMs heavily rely on querying auditory inputs for predicting attributes instead of aggregating necessary information in hidden states at attribute-mentioning positions. Based on our findings, we demonstrate a method to enhance LALMs. Our results offer insights into auditory attribute processing, paving the way for future improvements.

[212] Do Large Language Models Judge Error Severity Like Humans?

Diege Sun,Guanyi Chen,Fan Zhao,Xiaorong Cheng,Tingting He

Main category: cs.CL

TL;DR: 研究比较了人类与大型语言模型(LLMs)在评估图像描述错误严重性时的差异,发现LLMs在性别和颜色错误上的评分与人类不一致,只有少数模型能部分模拟人类判断。

Details Motivation: 探讨LLMs是否能准确模拟人类对自然语言生成中错误严重性的判断。 Method: 扩展van Miltenburg等人的实验框架,在单模态和多模态设置下评估四种错误类型(年龄、性别、服装类型、颜色)。 Result: 人类对不同错误类型的严重性有不同判断,视觉背景显著影响颜色和类型错误的感知;大多数LLMs在性别和颜色错误上的评分与人类不一致,仅Doubao和DeepSeek-V3表现较好。 Conclusion: LLMs在模拟人类错误严重性判断时存在局限性,部分模型表现较好,但仍有改进空间。 Abstract: Large Language Models (LLMs) are increasingly used as automated evaluators in natural language generation, yet it remains unclear whether they can accurately replicate human judgments of error severity. In this study, we systematically compare human and LLM assessments of image descriptions containing controlled semantic errors. We extend the experimental framework of van Miltenburg et al. (2020) to both unimodal (text-only) and multimodal (text + image) settings, evaluating four error types: age, gender, clothing type, and clothing colour. Our findings reveal that humans assign varying levels of severity to different error types, with visual context significantly amplifying perceived severity for colour and type errors. Notably, most LLMs assign low scores to gender errors but disproportionately high scores to colour errors, unlike humans, who judge both as highly severe but for different reasons. This suggests that these models may have internalised social norms influencing gender judgments but lack the perceptual grounding to emulate human sensitivity to colour, which is shaped by distinct neural mechanisms. Only one of the evaluated LLMs, Doubao, replicates the human-like ranking of error severity, but it fails to distinguish between error types as clearly as humans. Surprisingly, DeepSeek-V3, a unimodal LLM, achieves the highest alignment with human judgments across both unimodal and multimodal conditions, outperforming even state-of-the-art multimodal models.

[213] Knowledgeable-r1: Policy Optimization for Knowledge Exploration in Retrieval-Augmented Generation

Chenyu Lin,Yilin Wen,Du Su,Fei Sun,Muhan Chen,Chenfu Bao,Zhonghou Lv

Main category: cs.CL

TL;DR: 论文提出Knowledgeable-r1方法,通过联合采样和多策略分布探索知识能力,平衡检索增强生成(RAG)中对检索上下文的过度依赖,提升模型的鲁棒性和推理准确性。

Details Motivation: 当前RAG系统过于依赖检索上下文,可能忽视模型固有知识,尤其在处理误导或冗余信息时表现不佳。 Method: 提出Knowledgeable-r1,采用联合采样和多策略分布探索知识能力,促进模型对参数化和上下文知识的自整合利用。 Result: 实验表明,Knowledgeable-r1显著提升了鲁棒性和推理准确性,尤其在反事实场景中优于基线17.07%,并在RAG任务中表现一致。 Conclusion: Knowledgeable-r1有效解决了RAG系统对检索上下文的过度依赖问题,提升了模型性能。 Abstract: Retrieval-augmented generation (RAG) is a mainstream method for improving performance on knowledge-intensive tasks. However,current RAG systems often place too much emphasis on retrieved contexts. This can lead to reliance on inaccurate sources and overlook the model's inherent knowledge, especially when dealing with misleading or excessive information. To resolve this imbalance, we propose Knowledgeable-r1 that using joint sampling and define multi policy distributions in knowledge capability exploration to stimulate large language models'self-integrated utilization of parametric and contextual knowledge. Experiments show that Knowledgeable-r1 significantly enhances robustness and reasoning accuracy in both parameters and contextual conflict tasks and general RAG tasks, especially outperforming baselines by 17.07% in counterfactual scenarios and demonstrating consistent gains across RAG tasks. Our code are available at https://github.com/lcy80366872/ knowledgeable-r1.

[214] Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective

Bhavik Chandna,Zubair Bashir,Procheta Sen

Main category: cs.CL

TL;DR: 论文采用机制可解释性方法分析GPT-2和Llama2模型中的社会、人口和性别偏见,发现偏见计算高度集中在少数层,移除这些组件会减少偏见输出但影响其他NLP任务。

Details Motivation: 大型语言模型(LLMs)因训练数据常表现出社会、人口和性别偏见,研究旨在通过机制可解释性方法揭示这些偏见的结构表征。 Method: 聚焦人口和性别偏见,采用不同指标识别导致偏见行为的内部结构,评估其稳定性、定位性和泛化性,并通过系统消融实验验证。 Result: 偏见相关计算高度集中在少数层,且随微调设置变化;移除这些组件减少偏见输出但影响其他NLP任务。 Conclusion: 偏见在模型中具有局部性,移除偏见组件可能对其他任务产生负面影响,需谨慎权衡。 Abstract: Large Language Models (LLMs) are known to exhibit social, demographic, and gender biases, often as a consequence of the data on which they are trained. In this work, we adopt a mechanistic interpretability approach to analyze how such biases are structurally represented within models such as GPT-2 and Llama2. Focusing on demographic and gender biases, we explore different metrics to identify the internal edges responsible for biased behavior. We then assess the stability, localization, and generalizability of these components across dataset and linguistic variations. Through systematic ablations, we demonstrate that bias-related computations are highly localized, often concentrated in a small subset of layers. Moreover, the identified components change across fine-tuning settings, including those unrelated to bias. Finally, we show that removing these components not only reduces biased outputs but also affects other NLP tasks, such as named entity recognition and linguistic acceptability judgment because of the sharing of important components with these tasks.

[215] ECoRAG: Evidentiality-guided Compression for Long Context RAG

Yeonseok Jeong,Jinsu Kim,Dohyeon Lee,Seung-won Hwang

Main category: cs.CL

TL;DR: ECoRAG框架通过基于证据性压缩检索文档,提升LLM在开放域问答任务中的性能,同时降低成本。

Details Motivation: 现有压缩方法未过滤非证据性信息,限制了LLM在RAG中的表现。 Method: 提出ECoRAG框架,基于证据性压缩文档,并在证据不足时继续检索。 Result: 实验表明ECoRAG在ODQA任务中优于现有压缩方法,且成本更低。 Conclusion: ECoRAG通过优化证据性压缩,显著提升了LLM的性能和效率。 Abstract: Large Language Models (LLMs) have shown remarkable performance in Open-Domain Question Answering (ODQA) by leveraging external documents through Retrieval-Augmented Generation (RAG). To reduce RAG overhead, from longer context, context compression is necessary. However, prior compression methods do not focus on filtering out non-evidential information, which limit the performance in LLM-based RAG. We thus propose Evidentiality-guided RAG, or \textbf{ECoRAG} framework. ECoRAG improves LLM performance by compressing retrieved documents based on evidentiality, ensuring whether answer generation is supported by the correct evidence. As an additional step, ECoRAG reflects whether the compressed content provides sufficient evidence, and if not, retrieves more until sufficient. Experiments show that ECoRAG improves LLM performance on ODQA tasks, outperforming existing compression methods. Furthermore, ECoRAG is highly cost-efficient, as it not only reduces latency but also minimizes token usage by retaining only the necessary information to generate the correct answer. Code is available at https://github.com/ldilab/ECoRAG.

[216] Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang,Mingxin Li,Dingkun Long,Xin Zhang,Huan Lin,Baosong Yang,Pengjun Xie,An Yang,Dayiheng Liu,Junyang Lin,Fei Huang,Jingren Zhou

Main category: cs.CL

TL;DR: Qwen3 Embedding系列是基于Qwen3基础模型的文本嵌入和重排序技术,通过多阶段训练流程和模型合并策略,提供多种规模的模型选择,并在多语言和多样化任务中表现优异。

Details Motivation: 提升文本嵌入和重排序能力,解决多语言和多领域任务的需求,同时提供高效的模型部署选项。 Method: 结合大规模无监督预训练和高质量数据集的有监督微调,利用Qwen3 LLMs生成多样化训练数据,并通过模型合并增强鲁棒性。 Result: 在MTEB等多语言基准测试和检索任务中达到最先进水平,支持代码检索、跨语言检索等多样化应用。 Conclusion: Qwen3 Embedding系列通过创新的训练流程和模型设计,在多语言和多样化任务中表现卓越,并开源以促进社区研究。 Abstract: In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs' robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data across multiple domains and languages, thus enhancing the training pipeline. The Qwen3 Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both embedding and reranking tasks, addressing diverse deployment scenarios where users can optimize for either efficiency or effectiveness. Empirical evaluations demonstrate that the Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks. Notably, it excels on the multilingual evaluation benchmark MTEB for text embedding, as well as in various retrieval tasks, including code retrieval, cross-lingual retrieval and multilingual retrieval. To facilitate reproducibility and promote community-driven research and development, the Qwen3 Embedding models are publicly available under the Apache 2.0 license.

[217] Counterfactual reasoning: an analysis of in-context emergence

Moritz Miller,Bernhard Schölkopf,Siyuan Guo

Main category: cs.CL

TL;DR: 论文研究了大规模神经语言模型(LMs)在上下文学习中的反事实推理能力,即在假设场景下预测变化后果的能力。通过线性回归任务验证了模型在此类任务中的表现,并探讨了自注意力、模型深度和数据多样性对性能的影响。

Details Motivation: 探索语言模型在反事实推理任务中的能力,尤其是在无需参数更新的上下文学习场景下。 Method: 采用线性回归任务作为合成实验环境,要求模型通过噪声推断和复制进行准确预测,并分析了自注意力、模型深度和数据多样性的影响。 Result: 语言模型在受控环境中能够进行反事实推理,且这种能力可推广到更广泛的函数类别。此外,Transformer模型还能在序列数据上完成噪声推断。 Conclusion: 研究表明语言模型具备反事实推理潜力,为反事实故事生成等任务提供了初步证据。 Abstract: Large-scale neural language models (LMs) exhibit remarkable performance in in-context learning: the ability to learn and reason the input context on the fly without parameter update. This work studies in-context counterfactual reasoning in language models, that is, to predict the consequences of changes under hypothetical scenarios. We focus on studying a well-defined synthetic setup: a linear regression task that requires noise abduction, where accurate prediction is based on inferring and copying the contextual noise from factual observations. We show that language models are capable of counterfactual reasoning in this controlled setup and provide insights that counterfactual reasoning for a broad class of functions can be reduced to a transformation on in-context observations; we find self-attention, model depth, and data diversity in pre-training drive performance in Transformers. More interestingly, our findings extend beyond regression tasks and show that Transformers can perform noise abduction on sequential data, providing preliminary evidence on the potential for counterfactual story generation. Our code is available under https://github.com/moXmiller/counterfactual-reasoning.git .

[218] RELIC: Evaluating Compositional Instruction Following via Language Recognition

Jackson Petty,Michael Y. Hu,Wentao Wang,Shauli Ravfogel,William Merrill,Tal Linzen

Main category: cs.CL

TL;DR: RELIC框架通过语言识别任务评估LLMs的指令跟随能力,发现当前先进LLMs在复杂语法和样本上表现接近随机,且倾向于依赖浅层启发式方法。

Details Motivation: 评估LLMs仅基于上下文任务说明执行任务的能力(指令跟随),避免数据污染。 Method: 引入RELIC框架,利用形式语法生成的语言识别任务,动态调整复杂度。 Result: LLMs的准确性可由语法和样本复杂度预测,复杂任务中表现接近随机。 Conclusion: 当前LLMs在复杂指令跟随任务中表现有限,倾向于使用启发式方法而非深度推理。 Abstract: Large language models (LLMs) are increasingly expected to perform tasks based only on a specification of the task provided in context, without examples of inputs and outputs; this ability is referred to as instruction following. We introduce the Recognition of Languages In-Context (RELIC) framework to evaluate instruction following using language recognition: the task of determining if a string is generated by formal grammar. Unlike many standard evaluations of LLMs' ability to use their context, this task requires composing together a large number of instructions (grammar productions) retrieved from the context. Because the languages are synthetic, the task can be increased in complexity as LLMs' skills improve, and new instances can be automatically generated, mitigating data contamination. We evaluate state-of-the-art LLMs on RELIC and find that their accuracy can be reliably predicted from the complexity of the grammar and the individual example strings, and that even the most advanced LLMs currently available show near-chance performance on more complex grammars and samples, in line with theoretical expectations. We also use RELIC to diagnose how LLMs attempt to solve increasingly difficult reasoning tasks, finding that as the complexity of the language recognition task increases, models switch to relying on shallow heuristics instead of following complex instructions.

[219] The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Nikhil Kandpal,Brian Lester,Colin Raffel,Sebastian Majstorovic,Stella Biderman,Baber Abbasi,Luca Soldaini,Enrico Shippole,A. Feder Cooper,Aviya Skowron,John Kirchenbauer,Shayne Longpre,Lintang Sutawika,Alon Albalak,Zhenlin Xu,Guilherme Penedo,Loubna Ben Allal,Elie Bakouch,John David Pressman,Honglu Fan,Dashiell Stander,Guangyu Song,Aaron Gokaslan,Tom Goldstein,Brian R. Bartoldson,Bhavya Kailkhura,Tyler Murray

Main category: cs.CL

TL;DR: 论文提出了Common Pile v0.1,一个8TB的开源文本数据集,用于训练大语言模型(LLMs),解决了现有开源数据集规模小或质量低的问题。通过训练两个7B参数的LLMs(Comma v0.1-1T和Comma v0.1-2T),验证了其性能与使用未授权文本训练的模型相当。

Details Motivation: 解决LLMs训练中因使用未授权文本引发的知识产权和伦理问题,同时填补开源数据集规模不足的空白。 Method: 收集、整理并发布Common Pile v0.1数据集,包含30个来源的多样化内容。训练两个7B参数的LLMs(Comma v0.1-1T和Comma v0.1-2T)验证数据集效果。 Result: Comma v0.1模型在性能上与使用未授权文本训练的类似规模模型(如Llama 1和2 7B)相当。 Conclusion: Common Pile v0.1为开源LLMs训练提供了可行的高质量数据集,同时发布了数据集、代码和模型检查点,推动开源LLMs的发展。 Abstract: Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.

[220] Improving Low-Resource Morphological Inflection via Self-Supervised Objectives

Adam Wiemerslage,Katharina von der Wense

Main category: cs.CL

TL;DR: 论文研究了自监督辅助任务在极低资源环境下对形态屈折(字符级任务)的有效性,发现自动编码在数据极少时表现最佳,而字符掩码语言模型(CMLM)在数据增加时更有效。基于已知语素边界的掩码采样能持续提升性能。

Details Motivation: 探索自监督目标在字符级任务(如形态屈折)中的潜力,尤其是在资源稀缺的语言中,以推动语言文档化相关任务的发展。 Method: 使用编码器-解码器变换器模型,在19种语言和13种辅助目标上进行实验,比较不同自监督任务(如自动编码和CMLM)的效果。 Result: 自动编码在数据极少时表现最佳,CMLM在数据增加时更有效;基于语素边界的掩码采样能显著提升性能。 Conclusion: 自监督任务在低资源形态建模中具有潜力,尤其是结合语素边界信息的掩码策略,为未来研究提供了方向。 Abstract: Self-supervised objectives have driven major advances in NLP by leveraging large-scale unlabeled data, but such resources are scarce for many of the world's languages. Surprisingly, they have not been explored much for character-level tasks, where smaller amounts of data have the potential to be beneficial. We investigate the effectiveness of self-supervised auxiliary tasks for morphological inflection -- a character-level task highly relevant for language documentation -- in extremely low-resource settings, training encoder-decoder transformers for 19 languages and 13 auxiliary objectives. Autoencoding yields the best performance when unlabeled data is very limited, while character masked language modeling (CMLM) becomes more effective as data availability increases. Though objectives with stronger inductive biases influence model predictions intuitively, they rarely outperform standard CMLM. However, sampling masks based on known morpheme boundaries consistently improves performance, highlighting a promising direction for low-resource morphological modeling.

[221] Towards a Unified System of Representation for Continuity and Discontinuity in Natural Language

Ratna Kandala,Prakash Mondal

Main category: cs.CL

TL;DR: 本文提出了一种统一表示自然语言连续和非连续结构的系统,结合了短语结构语法、依存语法和范畴语法的特点。

Details Motivation: 解决不同语法框架对非连续结构分析独立且不收敛的问题。 Method: 结合短语结构语法的成分性、依存语法的头依赖关系和范畴语法的函子-论元关系,提出统一的数学推导方法。 Result: 展示了连续和非连续结构可以通过统一系统分析。 Conclusion: 证明了三种语法框架可以协同用于语言结构分析。 Abstract: Syntactic discontinuity is a grammatical phenomenon in which a constituent is split into more than one part because of the insertion of an element which is not part of the constituent. This is observed in many languages across the world such as Turkish, Russian, Japanese, Warlpiri, Navajo, Hopi, Dyirbal, Yidiny etc. Different formalisms/frameworks in current linguistic theory approach the problem of discontinuous structures in different ways. Each framework/formalism has widely been viewed as an independent and non-converging system of analysis. In this paper, we propose a unified system of representation for both continuity and discontinuity in structures of natural languages by taking into account three formalisms, in particular, Phrase Structure Grammar (PSG) for its widely used notion of constituency, Dependency Grammar (DG) for its head-dependent relations, and Categorial Grammar (CG) for its focus on functor-argument relations. We attempt to show that discontinuous expressions as well as continuous structures can be analysed through a unified mathematical derivation incorporating the representations of linguistic structure in these three grammar formalisms.

[222] CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection

Ron Eliav,Arie Cattan,Eran Hirsch,Shahaf Bassan,Elias Stengel-Eskin,Mohit Bansal,Ido Dagan

Main category: cs.CL

TL;DR: 论文提出了一种系统化的推理方法,通过分解文本并验证每个子事实,提升LLM在幻觉检测中的性能。

Details Motivation: 现有方法将幻觉检测视为自然语言推理任务,但复杂的推理任务需要更明确的推理过程。 Method: 定义了三步推理过程:声明分解、子声明归因与蕴含分类、聚合分类,并引入评估中间步骤质量的指标。 Result: 实验表明,系统化推理显著提升了幻觉检测的准确性。 Conclusion: 通过引导模型进行更细粒度的推理,可以有效改善幻觉检测的性能。 Abstract: A common approach to hallucination detection casts it as a natural language inference (NLI) task, often using LLMs to classify whether the generated text is entailed by corresponding reference texts. Since entailment classification is a complex reasoning task, one would expect that LLMs could benefit from generating an explicit reasoning process, as in CoT reasoning or the explicit ``thinking'' of recent reasoning models. In this work, we propose that guiding such models to perform a systematic and comprehensive reasoning process -- one that both decomposes the text into smaller facts and also finds evidence in the source for each fact -- allows models to execute much finer-grained and accurate entailment decisions, leading to increased performance. To that end, we define a 3-step reasoning process, consisting of (i) claim decomposition, (ii) sub-claim attribution and entailment classification, and (iii) aggregated classification, showing that such guided reasoning indeed yields improved hallucination detection. Following this reasoning framework, we introduce an analysis scheme, consisting of several metrics that measure the quality of the intermediate reasoning steps, which provided additional empirical evidence for the improved quality of our guided reasoning scheme.

[223] Micro-Act: Mitigate Knowledge Conflict in Question Answering via Actionable Self-Reasoning

Nan Huo,Jinyang Li,Bowen Qin,Ge Qu,Xiaolong Li,Xiaodong Li,Chenhao Ma,Reynold Cheng

Main category: cs.CL

TL;DR: Micro-Act框架通过分层动作空间解决RAG系统中的知识冲突问题,显著提升QA任务准确性。

Details Motivation: RAG系统中外部知识与LLM内部知识冲突影响下游任务性能,现有方法因上下文冗长而效果有限。 Method: 提出Micro-Act框架,通过感知上下文复杂度并分解知识源为细粒度比较步骤。 Result: 在5个基准数据集上显著超越现有方法,尤其在时间和语义冲突类型中表现突出。 Conclusion: Micro-Act在冲突和非冲突问题上均表现稳健,具有实际应用价值。 Abstract: Retrieval-Augmented Generation (RAG) systems commonly suffer from Knowledge Conflicts, where retrieved external knowledge contradicts the inherent, parametric knowledge of large language models (LLMs). It adversely affects performance on downstream tasks such as question answering (QA). Existing approaches often attempt to mitigate conflicts by directly comparing two knowledge sources in a side-by-side manner, but this can overwhelm LLMs with extraneous or lengthy contexts, ultimately hindering their ability to identify and mitigate inconsistencies. To address this issue, we propose Micro-Act a framework with a hierarchical action space that automatically perceives context complexity and adaptively decomposes each knowledge source into a sequence of fine-grained comparisons. These comparisons are represented as actionable steps, enabling reasoning beyond the superficial context. Through extensive experiments on five benchmark datasets, Micro-Act consistently achieves significant increase in QA accuracy over state-of-the-art baselines across all 5 datasets and 3 conflict types, especially in temporal and semantic types where all baselines fail significantly. More importantly, Micro-Act exhibits robust performance on non-conflict questions simultaneously, highlighting its practical value in real-world RAG applications.

[224] ProRefine: Inference-time Prompt Refinement with Textual Feedback

Deepak Pandita,Tharindu Cyril Weerasooriya,Ankit Parag Shah,Christopher M. Homan,Wei Wei

Main category: cs.CL

TL;DR: ProRefine是一种创新的提示优化方法,通过动态调整多步推理任务的提示,显著提升了AI代理协作的准确性和性能。

Details Motivation: 多AI代理协作的工作流常因提示设计不佳导致错误传播和性能不佳,限制了系统的可靠性和可扩展性。 Method: ProRefine利用大型语言模型的文本反馈,动态优化多步推理任务的提示,无需额外训练或真实标签。 Result: 在五个数学推理基准数据集上,ProRefine比零样本思维链基线提升了3至37个百分点。 Conclusion: ProRefine不仅提高了准确性,还能让小模型达到大模型的性能,具有高效、可扩展和普及高性能AI的潜力。 Abstract: Agentic workflows, where multiple AI agents collaborate to accomplish complex tasks like reasoning or planning, are becoming increasingly prevalent. However, these workflows often suffer from error propagation and sub-optimal performance, largely due to poorly designed prompts that fail to effectively guide individual agents. This is a critical problem because it limits the reliability and scalability of these powerful systems. We introduce ProRefine, an innovative inference-time prompt optimization method that leverages textual feedback from large language models (LLMs) to address this challenge. ProRefine dynamically refines prompts for multi-step reasoning tasks without additional training or ground truth labels. Evaluated on five benchmark mathematical reasoning datasets, ProRefine significantly surpasses zero-shot Chain-of-Thought baselines by 3 to 37 percentage points. This approach not only boosts accuracy but also allows smaller models to match the performance of larger ones, highlighting its potential for efficient and scalable AI deployment, and democratizing access to high-performing AI.

[225] Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models

Taha Entesari,Arman Hatami,Rinat Khaziev,Anil Ramakrishna,Mahyar Fazlyab

Main category: cs.CL

TL;DR: 本文提出了一种新的LLM遗忘方法,通过约束优化和新型损失函数,有效解决了现有方法在遗忘敏感信息时的不稳定性和性能下降问题。

Details Motivation: 现实中的LLM需要遗忘敏感、过时或专有信息,但现有方法通常通过正则化权衡遗忘和保留,导致优化不稳定和性能下降。 Method: 将LLM遗忘问题重新定义为约束优化问题,使用新型logit-margin flattening损失函数强制遗忘,并通过硬约束保留关键数据。采用可扩展的原始对偶算法解决优化问题。 Result: 在TOFU和MUSE基准测试中,该方法在多种LLM架构上表现优于现有基线,能有效移除目标信息并保留下游实用性。 Conclusion: 该方法提供了一种高效、稳定的LLM遗忘解决方案,优于现有技术。 Abstract: Large Language Models (LLMs) deployed in real-world settings increasingly face the need to unlearn sensitive, outdated, or proprietary information. Existing unlearning methods typically formulate forgetting and retention as a regularized trade-off, combining both objectives into a single scalarized loss. This often leads to unstable optimization and degraded performance on retained data, especially under aggressive forgetting. We propose a new formulation of LLM unlearning as a constrained optimization problem: forgetting is enforced via a novel logit-margin flattening loss that explicitly drives the output distribution toward uniformity on a designated forget set, while retention is preserved through a hard constraint on a separate retain set. Compared to entropy-based objectives, our loss is softmax-free, numerically stable, and maintains non-vanishing gradients, enabling more efficient and robust optimization. We solve the constrained problem using a scalable primal-dual algorithm that exposes the trade-off between forgetting and retention through the dynamics of the dual variable. Evaluations on the TOFU and MUSE benchmarks across diverse LLM architectures demonstrate that our approach consistently matches or exceeds state-of-the-art baselines, effectively removing targeted information while preserving downstream utility.

[226] Search Arena: Analyzing Search-Augmented LLMs

Mihran Miroyan,Tsung-Han Wu,Logan King,Tianle Li,Jiayi Pan,Xinyan Hu,Wei-Lin Chiang,Anastasios N. Angelopoulos,Trevor Darrell,Narges Norouzi,Joseph E. Gonzalez

Main category: cs.CL

TL;DR: 论文介绍了Search Arena数据集,用于评估搜索增强语言模型的用户偏好,发现用户偏好受引用数量和来源影响,并分析了不同设置下的模型表现。

Details Motivation: 现有数据集规模小且范围窄,难以全面分析搜索增强语言模型的性能,因此需要更大规模、多样化的数据集。 Method: 通过众包构建包含24,000对多轮用户交互的Search Arena数据集,涵盖多种意图和语言,并包含12,000个人类偏好投票。 Result: 用户偏好受引用数量和来源影响;搜索增强模型在非搜索环境中表现良好,但在搜索环境中仅依赖参数知识会显著影响质量。 Conclusion: Search Arena数据集支持未来研究,揭示了用户偏好与模型表现的新发现,并开源了数据和代码。 Abstract: Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations, even when the cited content does not directly support the attributed claims, uncovering a gap between perceived and actual credibility. Furthermore, user preferences vary across cited sources, revealing that community-driven platforms are generally preferred and static encyclopedic sources are not always appropriate and reliable. To assess performance across different settings, we conduct cross-arena analyses by testing search-augmented LLMs in a general-purpose chat environment and conventional LLMs in search-intensive settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model's parametric knowledge. We open-sourced the dataset to support future research in this direction. Our dataset and code are available at: https://github.com/lmarena/search-arena.

[227] Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models

Anirudh Bharadwaj,Chaitanya Malaviya,Nitish Joshi,Mark Yatskar

Main category: cs.CL

TL;DR: 语言模型在人类偏好判断中存在系统性偏差,过度依赖表面特征(如长度、结构等),导致评估不可靠。研究发现这些偏差源于训练数据,并提出一种基于反事实数据增强的后训练方法,有效减少偏差。

Details Motivation: 语言模型在人类偏好判断中表现出系统性偏差,如过度依赖表面特征,导致评估不可靠。研究旨在量化这些偏差并提出解决方案。 Method: 通过控制反事实对,量化偏好模型对偏差特征的依赖程度,并提出基于反事实数据增强(CDA)的后训练方法。 Result: 偏好模型在60%以上的情况下偏向偏差特征,与人类偏好相比存在40%的校准误差。CDA方法将校准误差从39.4%降至32.5%,偏差差异从20.5%降至10.0%。 Conclusion: 反事实数据增强方法能有效减少语言模型的偏差,提升偏好模型的可靠性。 Abstract: Language models serve as proxies for human preference judgements in alignment and evaluation, yet they exhibit systematic miscalibration, prioritizing superficial patterns over substantive qualities. This bias manifests as overreliance on features like length, structure, and style, leading to issues like reward hacking and unreliable evaluations. Evidence suggests these biases originate in artifacts in human training data. In this work, we systematically investigate the relationship between training data biases and preference model miscalibration across five idiosyncratic features of language model generations: length, structure, jargon, sycophancy and vagueness. Using controlled counterfactual pairs, we first quantify the extent to which preference models favor responses with magnified biases (skew), finding this preference occurs in >60% of instances, and model preferences show high miscalibration (~40%) compared to human preferences. Notably, bias features only show mild negative correlations to human preference labels (mean r_human = -0.12) but show moderately strong positive correlations with labels from a strong reward model (mean r_model = +0.36), suggesting that models may overrely on spurious cues. To mitigate these issues, we propose a simple post-training method based on counterfactual data augmentation (CDA) using synthesized contrastive examples. Finetuning models with CDA reduces average miscalibration from 39.4% to 32.5% and average absolute skew difference from 20.5% to 10.0%, while maintaining overall RewardBench performance, showing that targeted debiasing is effective for building reliable preference models.

[228] HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

Zhaolu Kang,Junhao Gong,Jiaxu Yan,Wanke Xia,Yian Wang,Ziwen Wang,Huaxuan Ding,Zhuo Cheng,Wenhao Cao,Zhiyuan Feng,Siqi He,Shannan Yan,Junzhe Chen,Xiaomin He,Chaoya Jiang,Wei Ye,Kaidong Yu,Xuelong Li

Main category: cs.CL

TL;DR: HSSBench是一个专门用于评估多模态大语言模型(MLLMs)在人文学科和社会科学(HSS)任务中表现的新基准,填补了现有评测的不足。

Details Motivation: 当前MLLMs评测主要关注STEM领域,忽视了HSS领域的独特需求,如跨学科思维和抽象概念与视觉表征的结合。 Method: 提出HSSBench,包含13,000多个样本,覆盖六类任务,并通过专家与自动化代理协作生成数据。 Result: 评测20多个主流MLLMs,发现即使是先进模型也面临显著挑战。 Conclusion: HSSBench有望推动MLLMs跨学科推理能力的提升,促进知识整合研究。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for HSS scenarios, in which multiple domain experts and automated agents collaborate to generate and iteratively refine each sample. HSSBench contains over 13,000 meticulously designed samples, covering six key categories. We benchmark more than 20 mainstream MLLMs on HSSBench and demonstrate that it poses significant challenges even for state-of-the-art models. We hope that this benchmark will inspire further research into enhancing the cross-disciplinary reasoning abilities of MLLMs, especially their capacity to internalize and connect knowledge across fields.

cs.RO [Back]

[229] RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Enshen Zhou,Jingkun An,Cheng Chi,Yi Han,Shanyu Rong,Chi Zhang,Pengwei Wang,Zhongyuan Wang,Tiejun Huang,Lu Sheng,Shanghang Zhang

Main category: cs.RO

TL;DR: RoboRefer是一个3D感知的视觉语言模型,通过监督微调(SFT)和强化微调(RFT)提升空间理解和推理能力,并在RefSpatial数据集上取得显著性能提升。

Details Motivation: 现有视觉语言模型在复杂3D场景中难以准确理解和推理空间指令,RoboRefer旨在解决这一问题。 Method: 结合SFT和RFT,使用专用深度编码器和度量敏感奖励函数,构建大规模数据集RefSpatial。 Result: SFT训练的RoboRefer在空间理解上达到89.6%成功率,RFT训练的模型在RefSpatial-Bench上超越Gemini-2.5-Pro 17.4%。 Conclusion: RoboRefer显著提升了空间推理能力,适用于多样机器人在复杂场景中的动态任务。 Abstract: Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes.

[230] Learning Smooth State-Dependent Traversability from Dense Point Clouds

Zihao Dong,Alan Papalia,Leonard Jung,Alenna Spiro,Philip R. Osteen,Christa S. Robison,Michael Everett

Main category: cs.RO

TL;DR: SPARTA是一种通过点云估计基于接近角度的地形可穿越性的方法,通过输出平滑的解析函数来预测风险分布,显著提高了计算效率和泛化能力。

Details Motivation: 解决越野自主性中地形可穿越性依赖于车辆状态的问题,特别是某些障碍物仅在某些方向可穿越,而传统方法需要大量数据和计算资源。 Method: 利用傅里叶基函数构建平滑解析函数,预测任意接近角度的风险分布,减少重复模型推断的计算开销。 Result: 在高保真仿真中,SPARTA在穿越40米巨石场时成功率达到91%(基线为73%),并在硬件实验中展示了泛化能力。 Conclusion: SPARTA通过几何结构化和傅里叶基函数,显著提升了地形可穿越性预测的效率和泛化能力。 Abstract: A key open challenge in off-road autonomy is that the traversability of terrain often depends on the vehicle's state. In particular, some obstacles are only traversable from some orientations. However, learning this interaction by encoding the angle of approach as a model input demands a large and diverse training dataset and is computationally inefficient during planning due to repeated model inference. To address these challenges, we present SPARTA, a method for estimating approach angle conditioned traversability from point clouds. Specifically, we impose geometric structure into our network by outputting a smooth analytical function over the 1-Sphere that predicts risk distribution for any angle of approach with minimal overhead and can be reused for subsequent queries. The function is composed of Fourier basis functions, which has important advantages for generalization due to their periodic nature and smoothness. We demonstrate SPARTA both in a high-fidelity simulation platform, where our model achieves a 91\% success rate crossing a 40m boulder field (compared to 73\% for the baseline), and on hardware, illustrating the generalization ability of the model to real-world settings.

[231] MineInsight: A Multi-sensor Dataset for Humanitarian Demining Robotics in Off-Road Environments

Mario Malizia,Charles Hamesse,Ken Hasselmann,Geert De Cubber,Nikolaos Tsiogkas,Eric Demeester,Rob Haelterman

Main category: cs.RO

TL;DR: MineInsight是一个公开的多传感器、多光谱数据集,专为越野地雷检测设计,包含多种目标和环境条件,为算法验证提供多样化测试环境。

Details Motivation: 由于缺乏多样化和真实的数据集,地雷检测算法的可靠验证成为研究社区的挑战。 Method: 数据集整合了无人地面车辆及其机械臂的双视角传感器扫描,包括LiDAR和多光谱图像(RGB、VIS-SWIR、LWIR),并记录了白天和夜间条件下的数据。 Result: 数据集包含35个目标(15个地雷和20个常见物体),约38,000帧RGB图像、53,000帧VIS-SWIR图像和108,000帧LWIR图像。 Conclusion: MineInsight为地雷检测算法的开发和评估提供了基准,填补了现有数据集的空白。 Abstract: The use of robotics in humanitarian demining increasingly involves computer vision techniques to improve landmine detection capabilities. However, in the absence of diverse and realistic datasets, the reliable validation of algorithms remains a challenge for the research community. In this paper, we introduce MineInsight, a publicly available multi-sensor, multi-spectral dataset designed for off-road landmine detection. The dataset features 35 different targets (15 landmines and 20 commonly found objects) distributed along three distinct tracks, providing a diverse and realistic testing environment. MineInsight is, to the best of our knowledge, the first dataset to integrate dual-view sensor scans from both an Unmanned Ground Vehicle and its robotic arm, offering multiple viewpoints to mitigate occlusions and improve spatial awareness. It features two LiDARs, as well as images captured at diverse spectral ranges, including visible (RGB, monochrome), visible short-wave infrared (VIS-SWIR), and long-wave infrared (LWIR). Additionally, the dataset comes with an estimation of the location of the targets, offering a benchmark for evaluating detection algorithms. We recorded approximately one hour of data in both daylight and nighttime conditions, resulting in around 38,000 RGB frames, 53,000 VIS-SWIR frames, and 108,000 LWIR frames. MineInsight serves as a benchmark for developing and evaluating landmine detection algorithms. Our dataset is available at https://github.com/mariomlz99/MineInsight.

[232] Synthetic Dataset Generation for Autonomous Mobile Robots Using 3D Gaussian Splatting for Vision Training

Aneesh Deogan,Wout Beks,Peter Teurlings,Koen de Vos,Mark van den Brand,Rene van de Molengraft

Main category: cs.RO

TL;DR: 提出了一种基于Unreal Engine的自动生成合成数据方法,用于解决手动标注数据集的耗时和多样性不足问题,尤其在机器人足球领域表现优异。

Details Motivation: 手动标注数据集耗时、易错且多样性有限,尤其在机器人领域动态场景下更为突出。 Method: 利用Unreal Engine和3D高斯样条快速生成逼真合成数据,结合真实数据提升检测性能。 Result: 合成数据训练的检测器在机器人足球场景中表现与真实数据相当,且合成与真实数据结合效果更佳。 Conclusion: 该方法为机器人领域提供了一种高效、可扩展的数据集生成方案,显著减少人工标注需求。 Abstract: Annotated datasets are critical for training neural networks for object detection, yet their manual creation is time- and labour-intensive, subjective to human error, and often limited in diversity. This challenge is particularly pronounced in the domain of robotics, where diverse and dynamic scenarios further complicate the creation of representative datasets. To address this, we propose a novel method for automatically generating annotated synthetic data in Unreal Engine. Our approach leverages photorealistic 3D Gaussian splats for rapid synthetic data generation. We demonstrate that synthetic datasets can achieve performance comparable to that of real-world datasets while significantly reducing the time required to generate and annotate data. Additionally, combining real-world and synthetic data significantly increases object detection performance by leveraging the quality of real-world images with the easier scalability of synthetic data. To our knowledge, this is the first application of synthetic data for training object detection algorithms in the highly dynamic and varied environment of robot soccer. Validation experiments reveal that a detector trained on synthetic images performs on par with one trained on manually annotated real-world images when tested on robot soccer match scenarios. Our method offers a scalable and comprehensive alternative to traditional dataset creation, eliminating the labour-intensive error-prone manual annotation process. By generating datasets in a simulator where all elements are intrinsically known, we ensure accurate annotations while significantly reducing manual effort, which makes it particularly valuable for robotics applications requiring diverse and scalable training data.

cs.CY [Back]

[233] Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems

Emma Harvey,Emily Sheng,Su Lin Blodgett,Alexandra Chouldechova,Jean Garcia-Gathright,Alexandra Olteanu,Hanna Wallach

Main category: cs.CY

TL;DR: 研究发现,现有的NLP工具在测量大型语言模型(LLM)的表征危害时,未能充分满足实践者的需求,主要因工具不实用或存在使用障碍。

Details Motivation: 探讨公开可用的NLP工具是否满足实践者评估LLM表征危害的需求。 Method: 通过半结构化访谈12名实践者,分析工具的实用性及使用障碍。 Result: 实践者常无法使用现有工具,原因包括工具不实用或存在实际和制度性障碍。 Conclusion: 建议基于测量理论和实用测量改进工具,以更好地满足实践者需求。 Abstract: The NLP research community has made publicly available numerous instruments for measuring representational harms caused by large language model (LLM)-based systems. These instruments have taken the form of datasets, metrics, tools, and more. In this paper, we examine the extent to which such instruments meet the needs of practitioners tasked with evaluating LLM-based systems. Via semi-structured interviews with 12 such practitioners, we find that practitioners are often unable to use publicly available instruments for measuring representational harms. We identify two types of challenges. In some cases, instruments are not useful because they do not meaningfully measure what practitioners seek to measure or are otherwise misaligned with practitioner needs. In other cases, instruments - even useful instruments - are not used by practitioners due to practical and institutional barriers impeding their uptake. Drawing on measurement theory and pragmatic measurement, we provide recommendations for addressing these challenges to better meet practitioner needs.

cs.LG [Back]

[234] Behavioural vs. Representational Systematicity in End-to-End Models: An Opinionated Survey

Ivan Vegner,Sydelle de Souza,Valentin Forch,Martha Lewis,Leonidas A. A. Doumas

Main category: cs.LG

TL;DR: 论文探讨了系统性在ML模型中的重要性,区分了行为系统性和表征系统性,并分析了现有基准测试的局限性。

Details Motivation: 系统性是组合性的核心,对模型在新情境下的泛化能力至关重要。现有研究多关注行为系统性,而忽略了表征系统性。 Method: 基于Hadley(1994)的分类法,分析了语言和视觉领域中关键基准测试对行为系统性的评估程度。 Result: 现有基准测试主要关注行为系统性,未能充分评估表征系统性。 Conclusion: 强调了评估表征系统性的重要性,并提出了从机制可解释性角度评估的方法。 Abstract: A core aspect of compositionality, systematicity is a desirable property in ML models as it enables strong generalization to novel contexts. This has led to numerous studies proposing benchmarks to assess systematic generalization, as well as models and training regimes designed to enhance it. Many of these efforts are framed as addressing the challenge posed by Fodor and Pylyshyn. However, while they argue for systematicity of representations, existing benchmarks and models primarily focus on the systematicity of behaviour. We emphasize the crucial nature of this distinction. Furthermore, building on Hadley's (1994) taxonomy of systematic generalization, we analyze the extent to which behavioural systematicity is tested by key benchmarks in the literature across language and vision. Finally, we highlight ways of assessing systematicity of representations in ML models as practiced in the field of mechanistic interpretability.

[235] Clustering and Median Aggregation Improve Differentially Private Inference

Kareem Amin,Salman Avestimehr,Sara Babakniya,Alex Bie,Weiwei Kong,Natalia Ponomareva,Umar Syed

Main category: cs.LG

TL;DR: 本文提出了一种改进的差分隐私语言模型推断方法,通过聚类输入数据并利用中位数聚合令牌预测,显著提高了生成文本的质量和隐私保护效果。

Details Motivation: 现有方法通过均匀随机采样敏感输入生成差分隐私文本,但这种方法在输入数据主题异构时效果不佳,导致生成文本质量下降。 Method: 通过聚类输入数据改进批次选择,并利用中位数(而非平均值)聚合令牌预测,从而降低局部敏感性并实现数据依赖的差分隐私保证。 Result: 实验表明,该方法在代表性指标(如MAUVE)和下游任务性能上均有提升,能以更低的隐私成本生成高质量合成数据。 Conclusion: 聚类和中位数聚合的结合显著提升了差分隐私语言模型推断的效果,为生成高质量隐私保护文本提供了新思路。 Abstract: Differentially private (DP) language model inference is an approach for generating private synthetic text. A sensitive input example is used to prompt an off-the-shelf large language model (LLM) to produce a similar example. Multiple examples can be aggregated together to formally satisfy the DP guarantee. Prior work creates inference batches by sampling sensitive inputs uniformly at random. We show that uniform sampling degrades the quality of privately generated text, especially when the sensitive examples concern heterogeneous topics. We remedy this problem by clustering the input data before selecting inference batches. Next, we observe that clustering also leads to more similar next-token predictions across inferences. We use this insight to introduce a new algorithm that aggregates next token statistics by privately computing medians instead of averages. This approach leverages the fact that the median has decreased local sensitivity when next token predictions are similar, allowing us to state a data-dependent and ex-post DP guarantee about the privacy properties of this algorithm. Finally, we demonstrate improvements in terms of representativeness metrics (e.g., MAUVE) as well as downstream task performance. We show that our method produces high-quality synthetic data at significantly lower privacy cost than a previous state-of-the-art method.

[236] Urania: Differentially Private Insights into AI Use

Daogao Liu,Edith Cohen,Badih Ghazi,Peter Kairouz,Pritish Kamath,Alexander Knop,Ravi Kumar,Pasin Manurangsi,Adam Sealfon,Da Yu,Chiyuan Zhang

Main category: cs.LG

TL;DR: Urania是一个新颖的框架,用于在严格差分隐私(DP)保证下生成LLM聊天机器人交互的洞察。它结合了私有聚类机制和创新的关键词提取方法,并通过DP工具实现端到端隐私保护。

Details Motivation: 研究旨在在保护用户隐私的同时,从LLM聊天机器人交互中提取有意义的洞察。 Method: 采用私有聚类机制和多种关键词提取方法(频率、TF-IDF、LLM引导),结合DP工具(聚类、分区选择、直方图摘要)实现隐私保护。 Result: 评估显示框架在保留语义内容和相似性的同时,提供了严格的隐私保护,且优于非私有方法。 Conclusion: Urania成功平衡了数据效用和隐私保护,能够有效提取对话洞察。 Abstract: We introduce $Urania$, a novel framework for generating insights about LLM chatbot interactions with rigorous differential privacy (DP) guarantees. The framework employs a private clustering mechanism and innovative keyword extraction methods, including frequency-based, TF-IDF-based, and LLM-guided approaches. By leveraging DP tools such as clustering, partition selection, and histogram-based summarization, $Urania$ provides end-to-end privacy protection. Our evaluation assesses lexical and semantic content preservation, pair similarity, and LLM-based metrics, benchmarking against a non-private Clio-inspired pipeline (Tamkin et al., 2024). Moreover, we develop a simple empirical privacy evaluation that demonstrates the enhanced robustness of our DP pipeline. The results show the framework's ability to extract meaningful conversational insights while maintaining stringent user privacy, effectively balancing data utility with privacy preservation.

[237] From EHRs to Patient Pathways: Scalable Modeling of Longitudinal Health Trajectories with LLMs

Chantal Pellegrini,Ege Özsoy,David Bani-Harouni,Matthias Keicher,Nassir Navab

Main category: cs.LG

TL;DR: 提出了一种名为EHR2Path的新方法,通过将多样化的电子健康记录数据转化为结构化表示,并设计了一个全面的路径预测模型,以预测患者未来的健康轨迹。该方法在预测性能和效率上优于现有模型。

Details Motivation: 医疗系统在管理和解释大量异构患者数据以实现个性化护理方面面临挑战。现有方法通常局限于狭窄的用例和有限的特征空间,忽视了患者健康的复杂性和长期交互需求。 Method: 通过将电子健康记录数据转化为结构化表示,设计了一个名为EHR2Path的路径预测模型,并引入了一种新颖的摘要机制,将长期时间上下文嵌入到特定主题的摘要标记中。 Result: EHR2Path在下一步预测和长期模拟中表现出色,优于现有基线模型,能够详细模拟患者轨迹,支持多种评估任务。 Conclusion: EHR2Path为预测性和个性化医疗开辟了新路径,能够高效预测患者的健康轨迹。 Abstract: Healthcare systems face significant challenges in managing and interpreting vast, heterogeneous patient data for personalized care. Existing approaches often focus on narrow use cases with a limited feature space, overlooking the complex, longitudinal interactions needed for a holistic understanding of patient health. In this work, we propose a novel approach to patient pathway modeling by transforming diverse electronic health record (EHR) data into a structured representation and designing a holistic pathway prediction model, EHR2Path, optimized to predict future health trajectories. Further, we introduce a novel summary mechanism that embeds long-term temporal context into topic-specific summary tokens, improving performance over text-only models, while being much more token-efficient. EHR2Path demonstrates strong performance in both next time-step prediction and longitudinal simulation, outperforming competitive baselines. It enables detailed simulations of patient trajectories, inherently targeting diverse evaluation tasks, such as forecasting vital signs, lab test results, or length-of-stay, opening a path towards predictive and personalized healthcare.

[238] Dissecting Long Reasoning Models: An Empirical Study

Yongyu Mu,Jiali Zeng,Bei Li,Xinyan Guan,Fandong Meng,Jie Zhou,Tong Xiao,Jingbo Zhu

Main category: cs.LG

TL;DR: 论文研究了长上下文推理模型在强化学习中的三个关键问题:正负样本的作用、数据效率及性能稳定性。

Details Motivation: 尽管强化学习在长上下文推理模型训练中取得进展,但仍存在反直觉行为和未解问题,需系统分析。 Method: 分析了正负样本的作用,提出改进数据效率的策略,并探讨性能不稳定的原因及解决方案。 Result: 负样本显著提升泛化能力;改进策略提高了数据效率;多次评估缓解性能不稳定问题。 Conclusion: 研究揭示了强化学习中的关键问题,并提出了实用解决方案,为未来研究提供了方向。 Abstract: Despite recent progress in training long-context reasoning models via reinforcement learning (RL), several open questions and counterintuitive behaviors remain. This work focuses on three key aspects: (1) We systematically analyze the roles of positive and negative samples in RL, revealing that positive samples mainly facilitate data fitting, whereas negative samples significantly enhance generalization and robustness. Interestingly, training solely on negative samples can rival standard RL training performance. (2) We identify substantial data inefficiency in group relative policy optimization, where over half of the samples yield zero advantage. To address this, we explore two straightforward strategies, including relative length rewards and offline sample injection, to better leverage these data and enhance reasoning efficiency and capability. (3) We investigate unstable performance across various reasoning models and benchmarks, attributing instability to uncertain problems with ambiguous outcomes, and demonstrate that multiple evaluation runs mitigate this issue.

[239] Mitigating Degree Bias Adaptively with Hard-to-Learn Nodes in Graph Contrastive Learning

Jingyu Hu,Hongbo Bo,Jun Hong,Xiaowei Liu,Weiru Liu

Main category: cs.LG

TL;DR: 论文提出了一种名为HAR的对比损失方法,通过增加正样本对和自适应加权来缓解GNN中的度偏差问题,并开发了SHARP实验框架验证其有效性。

Details Motivation: GNN在节点分类任务中存在度偏差问题,现有GCL方法因正样本对不足和权重分配不均导致低度节点信息不足。 Method: 提出HAR对比损失,利用节点标签增加正样本对,并根据学习难度自适应加权正负样本对。开发SHARP框架扩展HAR应用场景。 Result: 在四个数据集上的实验表明,SHARP在全局和度级别上均优于基线方法。 Conclusion: HAR和SHARP能有效缓解GNN中的度偏差问题,提升模型性能。 Abstract: Graph Neural Networks (GNNs) often suffer from degree bias in node classification tasks, where prediction performance varies across nodes with different degrees. Several approaches, which adopt Graph Contrastive Learning (GCL), have been proposed to mitigate this bias. However, the limited number of positive pairs and the equal weighting of all positives and negatives in GCL still lead to low-degree nodes acquiring insufficient and noisy information. This paper proposes the Hardness Adaptive Reweighted (HAR) contrastive loss to mitigate degree bias. It adds more positive pairs by leveraging node labels and adaptively weights positive and negative pairs based on their learning hardness. In addition, we develop an experimental framework named SHARP to extend HAR to a broader range of scenarios. Both our theoretical analysis and experiments validate the effectiveness of SHARP. The experimental results across four datasets show that SHARP achieves better performance against baselines at both global and degree levels.

[240] Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

Danil Sivtsov,Ivan Rodkin,Gleb Kuzmin,Yuri Kuratov,Ivan Oseledets

Main category: cs.LG

TL;DR: Diagonal Batching 是一种调度方案,解决了 RMTs 的并行性问题,显著提升了长上下文推理效率。

Details Motivation: Transformer 模型在长上下文推理中因时间和内存复杂度高而表现不佳,RMTs 虽然降低了复杂度,但其顺序执行机制成为性能瓶颈。 Method: 通过 Diagonal Batching 调度方案,实现 RMTs 的并行化,同时保持精确的递归性。 Result: 在 LLaMA-1B ARMT 模型上,Diagonal Batching 实现了 3.3 倍的速度提升,并显著降低了推理成本。 Conclusion: Diagonal Batching 消除了顺序瓶颈,使 RMTs 成为实际长上下文应用的可行解决方案。 Abstract: Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory usage. However, their memory update mechanism leads to sequential execution, causing a performance bottleneck. We introduce Diagonal Batching, a scheduling scheme that unlocks parallelism across segments in RMTs while preserving exact recurrence. This approach eliminates the sequential constraint, enabling efficient GPU inference even for single long-context inputs without complex batching and pipelining techniques. Because the technique is purely a run-time computation reordering, existing RMT models adopt it with no retraining. Applied to a LLaMA-1B ARMT model, Diagonal Batching yields a 3.3x speedup over standard full-attention LLaMA-1B and a 1.8x speedup over the sequential RMT implementation on 131,072-token sequences. By removing sequential bottleneck, Diagonal Batching reduces inference cost and latency, thereby strengthening RMTs as a practical solution for real-world, long-context applications.

[241] MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

Johannes von Oswald,Nino Scherrer,Seijin Kobayashi,Luca Versari,Songlin Yang,Maximilian Schlegel,Kaitlin Maile,Yanick Schimpf,Oliver Sieberling,Alexander Meulemans,Rif A. Saurous,Guillaume Lajoie,Charlotte Frenkel,Razvan Pascanu,Blaise Agüera y Arcas,João Sacramento

Main category: cs.LG

TL;DR: 论文提出了一种基于在线学习规则的稳定、可并行化的Mesa层,用于语言建模,通过优化上下文损失提升性能。

Details Motivation: 解决传统Transformer模型推理时内存和计算资源线性增长的问题,同时提升RNN模型在长上下文任务中的表现。 Method: 引入数值稳定的、可并行化的Mesa层,通过快速共轭梯度求解器优化上下文损失。 Result: 在十亿参数规模的语言建模中,新方法在困惑度和下游任务性能上优于现有RNN模型。 Conclusion: 通过增加推理时的计算资源优化序列问题,可显著提升模型性能,尤其是在长上下文任务中。 Abstract: Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), and study it in language modeling at the billion-parameter scale. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.

[242] Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay

Yifan Sun,Jingyan Shen,Yibin Wang,Tianyu Chen,Zhendong Wang,Mingyuan Zhou,Huan Zhang

Main category: cs.LG

TL;DR: 本文提出两种技术(难度导向的在线数据选择和回放机制)以提高大语言模型(LLM)强化学习(RL)微调的数据效率,显著减少训练时间。

Details Motivation: RL微调LLM资源消耗大且数据效率问题被忽视,亟需优化方法。 Method: 1. 基于自适应难度的在线数据选择,优先中等难度问题;2. 注意力框架高效估计难度;3. 回放机制重用近期数据。 Result: 实验表明,方法在6种LLM-数据集组合中减少25%-65%训练时间,性能与GRPO相当。 Conclusion: 所提技术显著提升RL微调效率,为资源密集型任务提供实用解决方案。 Abstract: Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities. However, RL fine-tuning remains highly resource-intensive, and existing work has largely overlooked the problem of data efficiency. In this paper, we propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay. We introduce the notion of adaptive difficulty to guide online data selection, prioritizing questions of moderate difficulty that are more likely to yield informative learning signals. To estimate adaptive difficulty efficiently, we develop an attention-based framework that requires rollouts for only a small reference set of questions. The adaptive difficulty of the remaining questions is then estimated based on their similarity to this set. To further reduce rollout cost, we introduce a rollout replay mechanism that reuses recent rollouts, lowering per-step computation while maintaining stable updates. Extensive experiments across 6 LLM-dataset combinations show that our method reduces RL fine-tuning time by 25% to 65% to reach the same level of performance as the original GRPO algorithm.

[243] Kinetics: Rethinking Test-Time Scaling Laws

Ranajoy Sadhukhan,Zhuoming Chen,Haizhong Zheng,Yang Zhou,Emma Strubell,Beidi Chen

Main category: cs.LG

TL;DR: 论文重新审视了测试时扩展规律,提出了一种更高效的资源分配方法,发现稀疏注意力模型在性能上优于密集模型。

Details Motivation: 现有基于计算最优性的研究忽视了推理时策略(如Best-of-N、长CoTs)引入的内存访问瓶颈,导致对小模型效果的过高估计。 Method: 通过分析0.6B到32B参数的模型,提出了一种新的Kinetics Scaling Law,结合计算和内存访问成本,并提出了基于稀疏注意力的扩展范式。 Result: 稀疏注意力模型在低资源和高资源场景下均表现优异,在AIME问题解决准确率上分别提升60分和5分。 Conclusion: 稀疏注意力是实现测试时扩展潜力的关键,因为测试时准确率通过增加生成而持续提升。 Abstract: We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of-$N$, long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones. A key reason is that in TTS, attention, rather than parameter count, emerges as the dominant cost factor. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. These results suggest that sparse attention is essential for realizing the full potential of test-time scaling because, unlike training, where parameter scaling saturates, test-time accuracy continues to improve through increased generation. The code is available at https://github.com/Infini-AI-Lab/Kinetics.

[244] Inference-Time Hyper-Scaling with KV Cache Compression

Adrian Łańcucki,Konrad Staniszewski,Piotr Nawrot,Edoardo M. Ponti

Main category: cs.LG

TL;DR: 通过压缩KV缓存实现推理时超扩展,提升推理准确性,同时保持计算预算不变。

Details Motivation: Transformer LLMs的生成成本受KV缓存大小限制,而非生成的token数量。通过压缩KV缓存,可以在相同计算预算下生成更多token,从而提升推理准确性。 Method: 提出动态内存稀疏化(DMS),一种稀疏化KV缓存的新方法,仅需1K训练步骤即可实现8倍压缩,同时保持比无训练稀疏注意力更高的准确性。 Result: 在多个LLM家族中验证了DMS的有效性,例如在Qwen-R1 32B上,AIME 24平均提升9.1分,GPQA提升7.6分,LiveCodeBench提升9.6分。 Conclusion: DMS通过延迟token淘汰和隐式合并表示,实现了高效的KV缓存压缩,为推理时超扩展提供了实用解决方案。 Abstract: Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8$\times$ compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding cached tokens, DMS delays token eviction, implicitly merging representations and preserving critical information. We demonstrate the effectiveness of inference-time hyper-scaling with DMS on multiple families of LLMs, showing that it boosts accuracy for comparable inference runtime and memory load. For instance, we enhance Qwen-R1 32B by an average of 9.1 points on AIME 24, 7.6 on GPQA, and 9.6 on LiveCodeBench across compute budgets.

[245] You Only Train Once

Christos Sakaridis

Main category: cs.LG

TL;DR: 论文提出了一种名为YOTO的方法,通过单次训练自动优化损失权重超参数,避免了传统网格搜索的繁琐过程。

Details Motivation: 传统方法需要多次训练以优化损失权重超参数,效率低下。YOTO旨在通过单次训练完成这一任务,提高效率。 Method: YOTO将损失权重超参数视为网络参数,通过梯度优化自动学习,并引入软最大操作和正则化损失确保优化效果。 Result: 在3D估计和语义分割任务中,YOTO表现优于传统网格搜索方法。 Conclusion: YOTO通过单次训练实现了损失权重的自动优化,显著提升了模型性能。 Abstract: The title of this paper is perhaps an overclaim. Of course, the process of creating and optimizing a learned model inevitably involves multiple training runs which potentially feature different architectural designs, input and output encodings, and losses. However, our method, You Only Train Once (YOTO), indeed contributes to limiting training to one shot for the latter aspect of losses selection and weighting. We achieve this by automatically optimizing loss weight hyperparameters of learned models in one shot via standard gradient-based optimization, treating these hyperparameters as regular parameters of the networks and learning them. To this end, we leverage the differentiability of the composite loss formulation which is widely used for optimizing multiple empirical losses simultaneously and model it as a novel layer which is parameterized with a softmax operation that satisfies the inherent positivity constraints on loss hyperparameters while avoiding degenerate empirical gradients. We complete our joint end-to-end optimization scheme by defining a novel regularization loss on the learned hyperparameters, which models a uniformity prior among the employed losses while ensuring boundedness of the identified optima. We evidence the efficacy of YOTO in jointly optimizing loss hyperparameters and regular model parameters in one shot by comparing it to the commonly used brute-force grid search across state-of-the-art networks solving two key problems in computer vision, i.e. 3D estimation and semantic segmentation, and showing that it consistently outperforms the best grid-search model on unseen test data. Code will be made publicly available.

[246] StatsMerging: Statistics-Guided Model Merging via Task-Specific Teacher Distillation

Ranjith Merugu,Bryan Bo Cao,Shubham Jain

Main category: cs.LG

TL;DR: StatsMerging是一种基于统计的轻量级模型合并方法,利用SVD和任务特定权重分布指导合并,无需真实标签或测试样本。

Details Motivation: 解决在有限内存预算下合并多个大型模型的问题,同时提升泛化能力和适应性。 Method: 利用SVD捕获任务特定权重分布,通过轻量级学习器StatsMergeLearner建模权重分布,并引入任务特定教师蒸馏技术。 Result: 在八项任务中表现优异,优于现有技术,尤其在泛化性和鲁棒性方面。 Conclusion: StatsMerging是一种高效且通用的模型合并方法,适用于异构架构模型。 Abstract: Model merging has emerged as a promising solution to accommodate multiple large models within constrained memory budgets. We present StatsMerging, a novel lightweight learning-based model merging method guided by weight distribution statistics without requiring ground truth labels or test samples. StatsMerging offers three key advantages: (1) It uniquely leverages singular values from singular value decomposition (SVD) to capture task-specific weight distributions, serving as a proxy for task importance to guide task coefficient prediction; (2) It employs a lightweight learner StatsMergeLearner to model the weight distributions of task-specific pre-trained models, improving generalization and enhancing adaptation to unseen samples; (3) It introduces Task-Specific Teacher Distillation for merging vision models with heterogeneous architectures, a merging learning paradigm that avoids costly ground-truth labels by task-specific teacher distillation. Notably, we present two types of knowledge distillation, (a) distilling knowledge from task-specific models to StatsMergeLearner; and (b) distilling knowledge from models with heterogeneous architectures prior to merging. Extensive experiments across eight tasks demonstrate the effectiveness of StatsMerging. Our results show that StatsMerging outperforms state-of-the-art techniques in terms of overall accuracy, generalization to unseen tasks, and robustness to image quality variations.

[247] Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets

Marianna Nezhurina,Tomer Porian,Giovanni Pucceti,Tommie Kerssies,Romain Beaumont,Mehdi Cherti,Jenia Jitsev

Main category: cs.LG

TL;DR: 论文探讨了如何通过扩展定律(scaling laws)比较模型和数据集,以优化预训练过程,并首次为CLIP和MaMMUT两种语言-视觉学习方法推导了完整的扩展定律。

Details Motivation: 研究扩展定律的推导不仅用于预测模型性能,还可用于模型和数据集比较,以优化预训练选择。 Method: 通过密集测量不同模型和样本规模的扩展定律,比较CLIP和MaMMUT两种方法,并验证其在分类、检索和分割等下游任务中的表现。 Result: MaMMUT在扩展性和样本效率上优于CLIP,且在不同数据集和任务中表现一致。 Conclusion: 扩展定律的准确推导为跨规模比较提供了有效工具,避免了单一参考规模的误导结论,推动了开放基础模型和数据集的系统性改进。 Abstract: In studies of transferable learning, scaling laws are obtained for various important foundation models to predict their properties and performance at larger scales. We show here how scaling law derivation can also be used for model and dataset comparison, allowing to decide which procedure is to be preferred for pre-training. For the first time, full scaling laws based on dense measurements across a wide span of model and samples seen scales are derived for two important language-vision learning procedures, CLIP and MaMMUT, that use either contrastive only or contrastive and captioning text generative loss. Ensuring sufficient prediction accuracy for held out points, we use derived scaling laws to compare both models, obtaining evidence for MaMMUT's stronger improvement with scale and better sample efficiency than standard CLIP. To strengthen validity of the comparison, we show scaling laws for various downstream tasks, classification, retrieval, and segmentation, and for different open datasets, DataComp, DFN and Re-LAION, observing consistently the same trends. We show that comparison can also be performed when deriving scaling laws with a constant learning rate schedule, reducing compute cost. Accurate derivation of scaling laws provides thus means to perform model and dataset comparison across scale spans, avoiding misleading conclusions based on measurements from single reference scales only, paving the road for systematic comparison and improvement of open foundation models and datasets for their creation. We release all the pre-trained models with their intermediate checkpoints, including openMaMMUT-L/14, which achieves $80.3\%$ zero-shot ImageNet-1k accuracy, trained on 12.8B samples from DataComp-1.4B. Code for reproducing experiments in the paper and raw experiments data can be found at https://github.com/LAION-AI/scaling-laws-for-comparison.

[248] Exploring bidirectional bounds for minimax-training of Energy-based models

Cong Geng,Jia Wang,Li Chen,Zhiyong Gao,Jes Frellsen,Søren Hauberg

Main category: cs.LG

TL;DR: 论文提出了一种通过双向边界训练能量模型(EBMs)的方法,以解决传统训练中的不稳定性问题,并展示了其在密度估计和样本生成中的有效性。

Details Motivation: 能量模型(EBMs)在估计非归一化密度方面具有优势,但训练过程通常不稳定。本文旨在通过双向边界方法改进训练稳定性。 Method: 提出使用双向边界(最大化下界和最小化上界)训练EBMs,并研究了四种不同的对数似然边界,包括基于生成器雅可比矩阵奇异值和互信息的下界,以及梯度惩罚和扩散过程的上界。 Result: 实验表明,双向边界方法显著提高了EBMs的训练稳定性,并实现了高质量的密度估计和样本生成。 Conclusion: 双向边界方法为EBMs的训练提供了一种稳定且高效的解决方案,具有广泛的应用潜力。 Abstract: Energy-based models (EBMs) estimate unnormalized densities in an elegant framework, but they are generally difficult to train. Recent work has linked EBMs to generative adversarial networks, by noting that they can be trained through a minimax game using a variational lower bound. To avoid the instabilities caused by minimizing a lower bound, we propose to instead work with bidirectional bounds, meaning that we maximize a lower bound and minimize an upper bound when training the EBM. We investigate four different bounds on the log-likelihood derived from different perspectives. We derive lower bounds based on the singular values of the generator Jacobian and on mutual information. To upper bound the negative log-likelihood, we consider a gradient penalty-like bound, as well as one based on diffusion processes. In all cases, we provide algorithms for evaluating the bounds. We compare the different bounds to investigate, the pros and cons of the different approaches. Finally, we demonstrate that the use of bidirectional bounds stabilizes EBM training and yields high-quality density estimation and sample generation.

[249] Identifying and Understanding Cross-Class Features in Adversarial Training

Zeming Wei,Yiwen Guo,Yisen Wang

Main category: cs.LG

TL;DR: 论文通过类间特征视角研究对抗训练(AT),发现跨类特征在AT初期对鲁棒性起关键作用,而后期则转向类特定特征,揭示了AT机制的新视角。

Details Motivation: 对抗训练(AT)是提升深度神经网络对抗攻击鲁棒性的有效方法,但其训练机制和动态仍不明确。本文旨在通过类间特征分析AT的机制。 Method: 提出通过类间特征(跨类特征)研究AT,利用合成数据模型提供理论支持,并在多种模型架构和设置下进行系统性研究。 Result: 研究发现,AT初期模型倾向于学习跨类特征以提升鲁棒性,而后期则转向类特定特征,导致鲁棒过拟合。 Conclusion: 研究为AT机制提供了新视角,统一解释了软标签训练的优势和鲁棒过拟合现象,深化了对AT的理解。 Abstract: Adversarial training (AT) has been considered one of the most effective methods for making deep neural networks robust against adversarial attacks, while the training mechanisms and dynamics of AT remain open research problems. In this paper, we present a novel perspective on studying AT through the lens of class-wise feature attribution. Specifically, we identify the impact of a key family of features on AT that are shared by multiple classes, which we call cross-class features. These features are typically useful for robust classification, which we offer theoretical evidence to illustrate through a synthetic data model. Through systematic studies across multiple model architectures and settings, we find that during the initial stage of AT, the model tends to learn more cross-class features until the best robustness checkpoint. As AT further squeezes the training robust loss and causes robust overfitting, the model tends to make decisions based on more class-specific features. Based on these discoveries, we further provide a unified view of two existing properties of AT, including the advantage of soft-label training and robust overfitting. Overall, these insights refine the current understanding of AT mechanisms and provide new perspectives on studying them. Our code is available at https://github.com/PKU-ML/Cross-Class-Features-AT.

[250] Aligning Latent Spaces with Flow Priors

Yizhuo Li,Yuying Ge,Yixiao Ge,Ying Shan,Ping Luo

Main category: cs.LG

TL;DR: 提出了一种利用基于流的生成模型作为先验,对齐可学习潜在空间与目标分布的新框架。方法避免了昂贵的似然计算和ODE求解,并通过理论和实验验证了其有效性。

Details Motivation: 潜在空间对齐是生成模型中的重要问题,但现有方法计算成本高或效率低。本文旨在提出一种更高效且理论可靠的对齐方法。 Method: 预训练基于流的模型捕捉目标分布,然后通过对齐损失正则化潜在空间,优化潜在变量而非直接匹配流。 Result: 理论证明对齐损失是目标分布对数似然变分下界的替代目标。实验验证了方法在ImageNet上的有效性。 Conclusion: 该框架为潜在空间对齐提供了新的高效途径,兼具理论和实践优势。 Abstract: This paper presents a novel framework for aligning learnable latent spaces to arbitrary target distributions by leveraging flow-based generative models as priors. Our method first pretrains a flow model on the target features to capture the underlying distribution. This fixed flow model subsequently regularizes the latent space via an alignment loss, which reformulates the flow matching objective to treat the latents as optimization targets. We formally prove that minimizing this alignment loss establishes a computationally tractable surrogate objective for maximizing a variational lower bound on the log-likelihood of latents under the target distribution. Notably, the proposed method eliminates computationally expensive likelihood evaluations and avoids ODE solving during optimization. As a proof of concept, we demonstrate in a controlled setting that the alignment loss landscape closely approximates the negative log-likelihood of the target distribution. We further validate the effectiveness of our approach through large-scale image generation experiments on ImageNet with diverse target distributions, accompanied by detailed discussions and ablation studies. With both theoretical and empirical validation, our framework paves a new way for latent space alignment.

cs.IR [Back]

[251] Exp4Fuse: A Rank Fusion Framework for Enhanced Sparse Retrieval using Large Language Model-based Query Expansion

Lingyuan Liu,Mengxiang Zhang

Main category: cs.IR

TL;DR: 论文提出Exp4Fuse框架,通过零样本LLM查询扩展提升稀疏检索性能,结合双检索路径和融合排名方法,在多个数据集上表现优于现有方法。

Details Motivation: 现有LLM生成文档的查询扩展方法成本高且计算复杂,需改进稀疏检索性能。 Method: 提出Exp4Fuse框架,结合原始查询和LLM扩展查询的双检索路径,使用改进的互逆排名融合方法。 Result: 在多个数据集上超越现有LLM查询扩展方法,结合高级稀疏检索器达到SOTA性能。 Conclusion: Exp4Fuse显著提升稀疏检索性能,具有高效性和优越性。 Abstract: Large Language Models (LLMs) have shown potential in generating hypothetical documents for query expansion, thereby enhancing information retrieval performance. However, the efficacy of this method is highly dependent on the quality of the generated documents, which often requires complex prompt strategies and the integration of advanced dense retrieval techniques. This can be both costly and computationally intensive. To mitigate these limitations, we explore the use of zero-shot LLM-based query expansion to improve sparse retrieval, particularly for learned sparse retrievers. We introduce a novel fusion ranking framework, Exp4Fuse, which enhances the performance of sparse retrievers through an indirect application of zero-shot LLM-based query expansion. Exp4Fuse operates by simultaneously considering two retrieval routes-one based on the original query and the other on the LLM-augmented query. It then generates two ranked lists using a sparse retriever and fuses them using a modified reciprocal rank fusion method. We conduct extensive evaluations of Exp4Fuse against leading LLM-based query expansion methods and advanced retrieval techniques on three MS MARCO-related datasets and seven low-resource datasets. Experimental results reveal that Exp4Fuse not only surpasses existing LLM-based query expansion methods in enhancing sparse retrievers but also, when combined with advanced sparse retrievers, achieves SOTA results on several benchmarks. This highlights the superior performance and effectiveness of Exp4Fuse in improving query expansion for sparse retrieval.

[252] GOLFer: Smaller LM-Generated Documents Hallucination Filter & Combiner for Query Expansion in Information Retrieval

Lingyuan Liu,Mengxiang Zhang

Main category: cs.IR

TL;DR: GOLFer是一种利用小型开源语言模型进行查询扩展的新方法,通过过滤幻觉内容和组合文档,解决了大型语言模型的高成本和计算强度问题。

Details Motivation: 大型语言模型(LLMs)在查询扩展中表现优异,但其依赖大规模模型导致成本高、计算强度大且可访问性有限。 Method: GOLFer包含两个模块:幻觉过滤器(检测并移除生成文档中的非事实和不一致内容)和文档组合器(通过权重向量平衡查询与过滤内容)。 Result: 在三个网页搜索和十个低资源数据集上的实验表明,GOLFer在小型语言模型上表现优于其他方法,并与大型LLMs方法竞争。 Conclusion: GOLFer为小型语言模型提供了一种高效且低成本的查询扩展解决方案,性能接近大型模型。 Abstract: Large language models (LLMs)-based query expansion for information retrieval augments queries with generated hypothetical documents with LLMs. However, its performance relies heavily on the scale of the language models (LMs), necessitating larger, more advanced LLMs. This approach is costly, computationally intensive, and often has limited accessibility. To address these limitations, we introduce GOLFer - Smaller LMs-Generated Documents Hallucination Filter & Combiner - a novel method leveraging smaller open-source LMs for query expansion. GOLFer comprises two modules: a hallucination filter and a documents combiner. The former detects and removes non-factual and inconsistent sentences in generated documents, a common issue with smaller LMs, while the latter combines the filtered content with the query using a weight vector to balance their influence. We evaluate GOLFer alongside dominant LLM-based query expansion methods on three web search and ten low-resource datasets. Experimental results demonstrate that GOLFer consistently outperforms other methods using smaller LMs, and maintains competitive performance against methods using large-size LLMs, demonstrating its effectiveness.

[253] Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings

Yubo Ma,Jinsong Li,Yuhang Zang,Xiaobao Wu,Xiaoyi Dong,Pan Zhang,Yuhang Cao,Haodong Duan,Jiaqi Wang,Yixin Cao,Aixin Sun

Main category: cs.IR

TL;DR: 研究探讨了在视觉文档检索(VDR)中减少每页的补丁嵌入数量的方法,提出了两种策略:令牌修剪和令牌合并。随机修剪表现意外优于复杂方法,但合并策略更有效,最终开发的Light-ColPali/ColQwen2在保持高性能的同时大幅降低了内存使用。

Details Motivation: ColPali/ColQwen2在VDR中性能强大,但每页编码为多个补丁级嵌入导致内存占用过高,研究旨在减少嵌入数量并最小化性能损失。 Method: 评估了令牌修剪和令牌合并两种策略,发现修剪不适合VDR,而合并策略通过优化三个维度的组合实现了高效压缩。 Result: Light-ColPali/ColQwen2保持了98.2%的检索性能,内存使用仅为原来的11.8%;在2.8%内存占用下仍保留94.6%的有效性。 Conclusion: 研究为高效VDR提供了有价值的见解和竞争性基线,Light-ColPali/ColQwen2展示了合并策略的潜力。 Abstract: Despite the strong performance of ColPali/ColQwen2 in Visualized Document Retrieval (VDR), it encodes each page into multiple patch-level embeddings and leads to excessive memory usage. This empirical study investigates methods to reduce patch embeddings per page at minimum performance degradation. We evaluate two token-reduction strategies: token pruning and token merging. Regarding token pruning, we surprisingly observe that a simple random strategy outperforms other sophisticated pruning methods, though still far from satisfactory. Further analysis reveals that pruning is inherently unsuitable for VDR as it requires removing certain page embeddings without query-specific information. Turning to token merging (more suitable for VDR), we search for the optimal combinations of merging strategy across three dimensions and develop Light-ColPali/ColQwen2. It maintains 98.2% of retrieval performance with only 11.8% of original memory usage, and preserves 94.6% effectiveness at 2.8% memory footprint. We expect our empirical findings and resulting Light-ColPali/ColQwen2 offer valuable insights and establish a competitive baseline for future research towards efficient VDR.

eess.IV [Back]

[254] Gradient Inversion Attacks on Parameter-Efficient Fine-Tuning

Hasin Us Sami,Swapneel Sen,Amit K. Roy-Chowdhury,Srikanth V. Krishnamurthy,Basak Guler

Main category: eess.IV

TL;DR: 论文研究了联邦学习中参数高效微调(PEFT)的隐私风险,展示了通过恶意设计的预训练模型和适配器模块,攻击者可以利用梯度反演攻击重构用户本地数据。

Details Motivation: 探讨PEFT在联邦学习中的隐私问题,揭示攻击者如何通过梯度反演攻击窃取用户数据。 Method: 通过设计恶意预训练模型和适配器模块,利用梯度反演攻击重构用户本地数据。 Result: 实验表明,攻击者可以高保真地重构大批量微调图像。 Conclusion: 研究强调了PEFT隐私保护机制的必要性,并提出了未来研究方向。 Abstract: Federated learning (FL) allows multiple data-owners to collaboratively train machine learning models by exchanging local gradients, while keeping their private data on-device. To simultaneously enhance privacy and training efficiency, recently parameter-efficient fine-tuning (PEFT) of large-scale pretrained models has gained substantial attention in FL. While keeping a pretrained (backbone) model frozen, each user fine-tunes only a few lightweight modules to be used in conjunction, to fit specific downstream applications. Accordingly, only the gradients with respect to these lightweight modules are shared with the server. In this work, we investigate how the privacy of the fine-tuning data of the users can be compromised via a malicious design of the pretrained model and trainable adapter modules. We demonstrate gradient inversion attacks on a popular PEFT mechanism, the adapter, which allow an attacker to reconstruct local data samples of a target user, using only the accessible adapter gradients. Via extensive experiments, we demonstrate that a large batch of fine-tuning images can be retrieved with high fidelity. Our attack highlights the need for privacy-preserving mechanisms for PEFT, while opening up several future directions. Our code is available at https://github.com/info-ucr/PEFTLeak.

[255] A Poisson-Guided Decomposition Network for Extreme Low-Light Image Enhancement

Isha Rao,Sanjay Ghosh

Main category: eess.IV

TL;DR: 提出了一种轻量级深度学习方法,结合Retinex分解与泊松去噪,用于极端低光条件下的图像去噪与增强。

Details Motivation: 传统高斯噪声假设在低光成像中不适用,实际噪声更符合泊松分布,需解决信号依赖性噪声问题。 Method: 设计统一的编码器-解码器网络,集成Retinex分解与泊松去噪,无需先验反射和光照信息。 Result: 显著提升低光条件下的可见度和亮度,同时保持图像结构和颜色一致性。 Conclusion: 该方法在极端低光条件下有效且实用,解决了信号依赖性噪声问题。 Abstract: Low-light image denoising and enhancement are challenging, especially when traditional noise assumptions, such as Gaussian noise, do not hold in majority. In many real-world scenarios, such as low-light imaging, noise is signal-dependent and is better represented as Poisson noise. In this work, we address the problem of denoising images degraded by Poisson noise under extreme low-light conditions. We introduce a light-weight deep learning-based method that integrates Retinex based decomposition with Poisson denoising into a unified encoder-decoder network. The model simultaneously enhances illumination and suppresses noise by incorporating a Poisson denoising loss to address signal-dependent noise. Without prior requirement for reflectance and illumination, the network learns an effective decomposition process while ensuring consistent reflectance and smooth illumination without causing any form of color distortion. The experimental results demonstrate the effectiveness and practicality of the proposed low-light illumination enhancement method. Our method significantly improves visibility and brightness in low-light conditions, while preserving image structure and color constancy under ambient illumination.

[256] DACN: Dual-Attention Convolutional Network for Hyperspectral Image Super-Resolution

Usman Muhammad,Jorma Laaksonen

Main category: eess.IV

TL;DR: DACN是一种用于高光谱图像超分辨率的双注意力卷积网络,通过多头注意力和通道-空间注意力机制解决局部依赖和数据稀缺问题,结合优化的损失函数提升性能。

Details Motivation: 现有2D CNN在高光谱图像超分辨率任务中依赖局部邻域,缺乏全局上下文理解,且受限于波段相关性和数据稀缺。 Method: DACN采用增强卷积和多头注意力捕获局部与全局特征依赖,通过通道和空间注意力机制分配注意力,并设计结合L2正则化和空间-光谱梯度损失的优化损失函数。 Result: 在两个高光谱数据集上的实验表明,多头注意力与通道注意力的结合优于单独使用任一机制。 Conclusion: DACN通过双注意力机制和优化损失函数有效提升了高光谱图像超分辨率的性能。 Abstract: 2D convolutional neural networks (CNNs) have attracted significant attention for hyperspectral image super-resolution tasks. However, a key limitation is their reliance on local neighborhoods, which leads to a lack of global contextual understanding. Moreover, band correlation and data scarcity continue to limit their performance. To mitigate these issues, we introduce DACN, a dual-attention convolutional network for hyperspectral image super-resolution. Specifically, the model first employs augmented convolutions, integrating multi-head attention to effectively capture both local and global feature dependencies. Next, we infer separate attention maps for the channel and spatial dimensions to determine where to focus across different channels and spatial positions. Furthermore, a custom optimized loss function is proposed that combines L2 regularization with spatial-spectral gradient loss to ensure accurate spectral fidelity. Experimental results on two hyperspectral datasets demonstrate that the combination of multi-head attention and channel attention outperforms either attention mechanism used individually.

[257] PixCell: A generative foundation model for digital histopathology images

Srikar Yellapragada,Alexandros Graikos,Zilinghan Li,Kostas Triaridis,Varun Belagali,Saarthak Kapse,Tarak Nath Nandi,Ravi K Madduri,Prateek Prasanna,Tahsin Kurc,Rajarsi R. Gupta,Joel Saltz,Dimitris Samaras

Main category: eess.IV

TL;DR: PixCell是一种基于扩散的生成基础模型,用于组织病理学,能够生成多样且高质量的图像,解决数据稀缺、隐私保护和虚拟染色等问题。

Details Motivation: 解决病理学中数据稀缺、隐私保护和生成任务(如虚拟染色)的挑战。 Method: 使用扩散模型PixCell,在PanCan-30M数据集上进行无监督训练,采用渐进式训练策略和自监督条件。 Result: PixCell生成高质量图像,可用于训练判别模型、数据共享、数据增强和教育用途,并支持分子标记推断。 Conclusion: PixCell为计算病理学提供了强大的生成工具,推动了研究和应用的发展。 Abstract: The digitization of histology slides has revolutionized pathology, providing massive datasets for cancer diagnosis and research. Contrastive self-supervised and vision-language models have been shown to effectively mine large pathology datasets to learn discriminative representations. On the other hand, generative models, capable of synthesizing realistic and diverse images, present a compelling solution to address unique problems in pathology that involve synthesizing images; overcoming annotated data scarcity, enabling privacy-preserving data sharing, and performing inherently generative tasks, such as virtual staining. We introduce PixCell, the first diffusion-based generative foundation model for histopathology. We train PixCell on PanCan-30M, a vast, diverse dataset derived from 69,184 H\&E-stained whole slide images covering various cancer types. We employ a progressive training strategy and a self-supervision-based conditioning that allows us to scale up training without any annotated data. PixCell generates diverse and high-quality images across multiple cancer types, which we find can be used in place of real data to train a self-supervised discriminative model. Synthetic images shared between institutions are subject to fewer regulatory barriers than would be the case with real clinical images. Furthermore, we showcase the ability to precisely control image generation using a small set of annotated images, which can be used for both data augmentation and educational purposes. Testing on a cell segmentation task, a mask-guided PixCell enables targeted data augmentation, improving downstream performance. Finally, we demonstrate PixCell's ability to use H\&E structural staining to infer results from molecular marker studies; we use this capability to infer IHC staining from H\&E images. Our trained models are publicly released to accelerate research in computational pathology.

[258] DM-SegNet: Dual-Mamba Architecture for 3D Medical Image Segmentation with Global Context Modeling

Hangyu Ji

Main category: eess.IV

TL;DR: DM-SegNet提出了一种双Mamba架构,通过方向性状态转换和解剖感知分层解码,解决了医学图像分割中全局上下文建模与空间拓扑保持的兼容性问题。

Details Motivation: 现有医学状态空间模型(SSMs)存在编码器-解码器不兼容问题,1D序列展平破坏空间结构,而传统解码器无法利用Mamba的状态传播。 Method: DM-SegNet采用四方向3D扫描的Mamba模块保持空间一致性,门控空间卷积层增强特征表示,以及Mamba驱动的解码框架实现跨尺度状态同步。 Result: 在Synapse数据集上达到85.44%的DSC(腹部器官分割),在BraTS2023数据集上达到90.22%的DSC(脑肿瘤分割)。 Conclusion: DM-SegNet通过创新设计在医学图像分割任务中实现了最先进的性能。 Abstract: Accurate 3D medical image segmentation demands architectures capable of reconciling global context modeling with spatial topology preservation. While State Space Models (SSMs) like Mamba show potential for sequence modeling, existing medical SSMs suffer from encoder-decoder incompatibility: the encoder's 1D sequence flattening compromises spatial structures, while conventional decoders fail to leverage Mamba's state propagation. We present DM-SegNet, a Dual-Mamba architecture integrating directional state transitions with anatomy-aware hierarchical decoding. The core innovations include a quadri-directional spatial Mamba module employing four-directional 3D scanning to maintain anatomical spatial coherence, a gated spatial convolution layer that enhances spatially sensitive feature representation prior to state modeling, and a Mamba-driven decoding framework enabling bidirectional state synchronization across scales. Extensive evaluation on two clinically significant benchmarks demonstrates the efficacy of DM-SegNet: achieving state-of-the-art Dice Similarity Coefficient (DSC) of 85.44% on the Synapse dataset for abdominal organ segmentation and 90.22% on the BraTS2023 dataset for brain tumor segmentation.

eess.AS [Back]

[259] Can we reconstruct a dysarthric voice with the large speech model Parler TTS?

Ariadna Sanchez,Simon King

Main category: eess.AS

TL;DR: 论文探讨了使用大型语音模型Parler TTS重建发音障碍者患病前的声音,以作为沟通辅助工具。研究通过数据集微调模型,发现模型能生成接近原始声音的语音,但在控制清晰度和保持说话者身份一致性方面存在困难。

Details Motivation: 发音障碍者沟通困难,个性化语音合成是一种潜在解决方案。研究旨在通过语音模型重建患病前的声音。 Method: 使用Parler TTS模型,通过标注的数据集进行微调,评估其生成语音的清晰度和说话者身份一致性。 Result: 模型能生成接近原始声音的语音,但在清晰度和身份一致性控制上表现不佳。 Conclusion: 未来需改进模型的可控性,以更好地完成声音重建任务。 Abstract: Speech disorders can make communication hard or even impossible for those who develop them. Personalised Text-to-Speech is an attractive option as a communication aid. We attempt voice reconstruction using a large speech model, with which we generate an approximation of a dysarthric speaker's voice prior to the onset of their condition. In particular, we investigate whether a state-of-the-art large speech model, Parler TTS, can generate intelligible speech while maintaining speaker identity. We curate a dataset and annotate it with relevant speaker and intelligibility information, and use this to fine-tune the model. Our results show that the model can indeed learn to generate from the distribution of this challenging data, but struggles to control intelligibility and to maintain consistent speaker identity. We propose future directions to improve controllability of this class of model, for the voice reconstruction task.

[260] Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model

Haibin Wu,Yuxuan Hu,Ruchao Fan,Xiaofei Wang,Kenichi Kumatani,Bo Ren,Jianwei Yu,Heng Lu,Lijuan Wang,Yao Qian,Jinyu Li

Main category: eess.AS

TL;DR: 论文比较了联合语音-文本解码策略(交错与并行生成),发现交错方法对齐效果最佳但推理慢,提出了一种新的早期停止交错(ESI)模式以加速解码并提升性能,同时通过高质量QA数据集优化语音QA表现。

Details Motivation: 研究联合语音-文本解码策略对性能、效率和对齐质量的影响,旨在优化语音语言模型的解码方法。 Method: 在相同基础语言模型、语音分词器和训练数据下,系统比较交错与并行生成解码策略,并提出早期停止交错(ESI)模式。 Result: 交错方法对齐效果最佳但推理慢;ESI模式显著加速解码且性能略优;高质量QA数据集进一步提升了语音QA表现。 Conclusion: 交错解码策略在语音-文本对齐中表现最佳,ESI模式解决了其推理速度问题,为语音对话系统提供了高效解决方案。 Abstract: Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality. In this work, we systematically compare representative joint speech-text decoding strategies-including the interleaved, and parallel generation paradigms-under a controlled experimental setup using the same base language model, speech tokenizer and training data. Our results show that the interleaved approach achieves the best alignment. However it suffers from slow inference due to long token sequence length. To address this, we propose a novel early-stop interleaved (ESI) pattern that not only significantly accelerates decoding but also yields slightly better performance. Additionally, we curate high-quality question answering (QA) datasets to further improve speech QA performance.

[261] EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition

Yi-Cheng Lin,Huang-Cheng Chou,Yu-Hsuan Li Liang,Hung-yi Lee

Main category: eess.AS

TL;DR: 论文提出了EMO-Debias方法,比较了13种去偏方法在多标签语音情感识别中的效果,分析了公平性与准确性的权衡。

Details Motivation: 现有去偏方法在多标签语音情感识别中的效果和鲁棒性尚未充分研究,存在性别偏见问题。 Method: 研究比较了13种去偏方法,包括预处理、正则化、对抗学习、偏置学习器和分布鲁棒优化,使用WavLM和XLSR表示,并在性别不平衡条件下评估。 Result: 实验量化了公平性与准确性的权衡,识别出能减少性别性能差距且不影响整体性能的方法。 Conclusion: 研究为选择有效去偏策略提供了实用建议,并强调了数据集分布的影响。 Abstract: Speech emotion recognition (SER) systems often exhibit gender bias. However, the effectiveness and robustness of existing debiasing methods in such multi-label scenarios remain underexplored. To address this gap, we present EMO-Debias, a large-scale comparison of 13 debiasing methods applied to multi-label SER. Our study encompasses techniques from pre-processing, regularization, adversarial learning, biased learners, and distributionally robust optimization. Experiments conducted on acted and naturalistic emotion datasets, using WavLM and XLSR representations, evaluate each method under conditions of gender imbalance. Our analysis quantifies the trade-offs between fairness and accuracy, identifying which approaches consistently reduce gender performance gaps without compromising overall model performance. The findings provide actionable insights for selecting effective debiasing strategies and highlight the impact of dataset distributions.

astro-ph.SR [Back]

[262] Deep learning image burst stacking to reconstruct high-resolution ground-based solar observations

Christoph Schirninger,Robert Jarolim,Astrid M. Veronig,Christoph Kuckein

Main category: astro-ph.SR

TL;DR: 论文提出了一种基于深度学习的实时图像重建方法,用于解决地面太阳望远镜观测中因大气湍流导致的图像质量问题。

Details Motivation: 地面太阳望远镜的高分辨率观测受大气湍流限制,现有重建方法在强湍流和高计算成本下表现不佳。 Method: 采用无配对的图像到图像转换技术,通过深度学习模型将100张短曝光图像实时重建为一张高质量图像。 Result: 该方法在感知质量上表现出更强的鲁棒性,尤其在参考图像存在伪影时效果显著,且能充分利用图像序列信息。 Conclusion: 深度学习模型在提供完整图像序列时能实现最佳重建效果,为太阳观测提供了高效解决方案。 Abstract: Large aperture ground based solar telescopes allow the solar atmosphere to be resolved in unprecedented detail. However, observations are limited by Earths turbulent atmosphere, requiring post image corrections. Current reconstruction methods using short exposure bursts face challenges with strong turbulence and high computational costs. We introduce a deep learning approach that reconstructs 100 short exposure images into one high quality image in real time. Using unpaired image to image translation, our model is trained on degraded bursts with speckle reconstructions as references, improving robustness and generalization. Our method shows an improved robustness in terms of perceptual quality, especially when speckle reconstructions show artifacts. An evaluation with a varying number of images per burst demonstrates that our method makes efficient use of the combined image information and achieves the best reconstructions when provided with the full image burst.

cs.SD [Back]

[263] Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning

Hien Ohnaka,Yuma Shirahata,Byeongseon Park,Ryuichi Yamamoto

Main category: cs.SD

TL;DR: 提出一种模型,通过隐式和显式方法使语音的音素和韵律标签与字素一致,显著提升一致性,并在口音估计任务中验证了有效性。

Details Motivation: 现有方法仅通过微调预训练ASR模型生成标签,缺乏与字素的直接关联,限制了标签的准确性和下游任务的应用。 Method: 1)通过预训练BERT特征的提示编码器隐式条件化字素;2)在推理时显式剪枝与字素不一致的标签假设。 Result: 显著提升了字素与预测标签的一致性,并在口音估计任务中提高了准确性。 Conclusion: 该模型生成的并行数据有效支持多种下游任务,如语音合成和口音估计。 Abstract: We propose a model to obtain phonemic and prosodic labels of speech that are coherent with graphemes. Unlike previous methods that simply fine-tune a pre-trained ASR model with the labels, the proposed model conditions the label generation on corresponding graphemes by two methods: 1) Add implicit grapheme conditioning through prompt encoder using pre-trained BERT features. 2) Explicitly prune the label hypotheses inconsistent with the grapheme during inference. These methods enable obtaining parallel data of speech, the labels, and graphemes, which is applicable to various downstream tasks such as text-to-speech and accent estimation from text. Experiments showed that the proposed method significantly improved the consistency between graphemes and the predicted labels. Further, experiments on accent estimation task confirmed that the created parallel data by the proposed method effectively improve the estimation accuracy.

[264] LLM-based phoneme-to-grapheme for phoneme-based speech recognition

Te Ma,Min Bi,Saierdaer Yusuyin,Hao Huang,Zhijian Ou

Main category: cs.SD

TL;DR: 论文提出了一种基于大型语言模型(LLM)的音素到字素(P2G)解码方法,用于音素基础的自动语音识别(ASR),通过数据增强和随机化训练策略解决了信息丢失问题,并在波兰语和德语的跨语言ASR中表现优于传统WFST方法。

Details Motivation: 传统的基于WFST的解码方法存在流程复杂且无法利用大型语言模型的局限性,因此需要一种更高效且性能优越的解码方法。 Method: 提出了LLM-P2G解码方法,包括语音到音素(S2P)和音素到字素(P2G)两个阶段,并通过数据增强(DANP)和随机化训练(TKM)策略优化模型。 Result: 实验结果显示,LLM-P2G在波兰语和德语的跨语言ASR中相对WER分别降低了3.6%和6.9%。 Conclusion: LLM-P2G方法在数据效率和性能上优于传统WFST方法,为音素基础的ASR提供了一种有效的解决方案。 Abstract: In automatic speech recognition (ASR), phoneme-based multilingual pre-training and crosslingual fine-tuning is attractive for its high data efficiency and competitive results compared to subword-based models. However, Weighted Finite State Transducer (WFST) based decoding is limited by its complex pipeline and inability to leverage large language models (LLMs). Therefore, we propose LLM-based phoneme-to-grapheme (LLM-P2G) decoding for phoneme-based ASR, consisting of speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G). A challenge is that there seems to have information loss in cascading S2P and P2G. To address this challenge, we propose two training strategies: data augmentation with noisy phonemes (DANP), and randomized top-$K$ marginalized (TKM) training and decoding. Our experimental results show that LLM-P2G outperforms WFST-based systems in crosslingual ASR for Polish and German, by relative WER reductions of 3.6% and 6.9% respectively.

cs.MA [Back]

[265] Time to Talk: LLM Agents for Asynchronous Group Communication in Mafia Games

Niv Eckhaus,Uri Berger,Gabriel Stanovsky

Main category: cs.MA

TL;DR: 该论文提出了一种自适应异步LLM代理,能够决定何时发言以模拟真实异步场景,并在在线Mafia游戏中表现与人类玩家相当。

Details Motivation: 现有LLM主要用于同步通信,而许多现实场景是异步的,如群聊或团队会议,因此需要开发能决定发言时机的代理。 Method: 开发了一种自适应异步LLM代理,并收集了在线Mafia游戏数据集进行评估。 Result: 代理在游戏表现和融入人类玩家方面与人类相当,发言时机行为接近人类,但消息内容存在差异。 Conclusion: 该研究为LLM在真实异步场景中的应用铺平了道路,并公开了数据和代码以促进进一步研究。 Abstract: LLMs are used predominantly in synchronous communication, where a human user and a model communicate in alternating turns. In contrast, many real-world settings are inherently asynchronous. For example, in group chats, online team meetings, or social games, there is no inherent notion of turns; therefore, the decision of when to speak forms a crucial part of the participant's decision making. In this work, we develop an adaptive asynchronous LLM-agent which, in addition to determining what to say, also decides when to say it. To evaluate our agent, we collect a unique dataset of online Mafia games, including both human participants, as well as our asynchronous agent. Overall, our agent performs on par with human players, both in game performance, as well as in its ability to blend in with the other human players. Our analysis shows that the agent's behavior in deciding when to speak closely mirrors human patterns, although differences emerge in message content. We release all our data and code to support and encourage further research for more realistic asynchronous communication between LLM agents. This work paves the way for integration of LLMs into realistic human group settings, from assistance in team discussions to educational and professional environments where complex social dynamics must be navigated.

cs.AI [Back]

[266] Contextual Integrity in LLMs via Reasoning and Reinforcement Learning

Guangchen Lan,Huseyin A. Inan,Sahar Abdelnabi,Janardhan Kulkarni,Lukas Wutschitz,Reza Shokri,Christopher G. Brinton,Robert Sim

Main category: cs.AI

TL;DR: 论文提出了一种通过强化学习框架和显式推理来确保自主代理在决策时保持上下文完整性的方法,显著减少了不适当的信息披露。

Details Motivation: 随着自主代理为用户决策的普及,确保上下文完整性(CI)成为核心问题,即代理需要在特定任务中判断哪些信息适合共享。 Method: 首先通过提示LLMs显式推理CI,然后开发强化学习框架进一步训练模型以实现CI。使用合成数据集(约700个例子)验证方法。 Result: 方法显著减少了不适当的信息披露,同时保持了任务性能,且改进效果可迁移到人类标注的CI基准测试(如PrivacyLens)。 Conclusion: 通过显式推理和强化学习,可以有效提升自主代理在决策时的上下文完整性,减少隐私泄露风险。 Abstract: As the era of autonomous agents making decisions on behalf of users unfolds, ensuring contextual integrity (CI) -- what is the appropriate information to share while carrying out a certain task -- becomes a central question to the field. We posit that CI demands a form of reasoning where the agent needs to reason about the context in which it is operating. To test this, we first prompt LLMs to reason explicitly about CI when deciding what information to disclose. We then extend this approach by developing a reinforcement learning (RL) framework that further instills in models the reasoning necessary to achieve CI. Using a synthetic, automatically created, dataset of only $\sim700$ examples but with diverse contexts and information disclosure norms, we show that our method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Importantly, improvements transfer from this synthetic dataset to established CI benchmarks such as PrivacyLens that has human annotations and evaluates privacy leakage of AI assistants in actions and tool calls.

[267] A Graph-Retrieval-Augmented Generation Framework Enhances Decision-Making in the Circular Economy

Yang Zhao,Chengxiao Dai,Dusit Niyato,Chuan Fu Tan,Keyi Xiang,Yueyang Wang,Zhiquan Yeo,Daren Tan Zong Loong,Jonathan Low Zhaozhi,Eugene H. Z. HO

Main category: cs.AI

TL;DR: CircuGraphRAG是一个基于知识图谱的检索增强生成框架,用于提高大型语言模型在循环经济领域的准确性和可靠性。

Details Motivation: 大型语言模型在可持续制造中常产生错误的工业代码和排放因子,影响决策。CircuGraphRAG旨在解决这一问题。 Method: 通过连接11.7万工业与废物实体的知识图谱,将自然语言查询转化为SPARQL,并检索验证子图以确保准确性。 Result: 在单跳和多跳问答中表现优异,ROUGE-L F1得分达1.0,响应时间减半,令牌使用减少16%。 Conclusion: CircuGraphRAG为循环经济规划提供了可靠支持,推动了低碳资源决策。 Abstract: Large language models (LLMs) hold promise for sustainable manufacturing, but often hallucinate industrial codes and emission factors, undermining regulatory and investment decisions. We introduce CircuGraphRAG, a retrieval-augmented generation (RAG) framework that grounds LLMs outputs in a domain-specific knowledge graph for the circular economy. This graph connects 117,380 industrial and waste entities with classification codes and GWP100 emission data, enabling structured multi-hop reasoning. Natural language queries are translated into SPARQL and verified subgraphs are retrieved to ensure accuracy and traceability. Compared with Standalone LLMs and Naive RAG, CircuGraphRAG achieves superior performance in single-hop and multi-hop question answering, with ROUGE-L F1 scores up to 1.0, while baseline scores below 0.08. It also improves efficiency, halving the response time and reducing token usage by 16% in representative tasks. CircuGraphRAG provides fact-checked, regulatory-ready support for circular economy planning, advancing reliable, low-carbon resource decision making.

[268] Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science

Peter Jansen,Samiah Hassan,Ruoyao Wang

Main category: cs.AI

TL;DR: 论文提出Matter-of-Fact数据集,用于评估假设的可行性,以加速科学发现。

Details Motivation: 自动化实验成本高,需筛选可行性假设以提高发现效率。 Method: 构建包含8.4k科学声明的数据集,涵盖材料科学多个领域,测试检索增强生成和代码生成模型的性能。 Result: 当前模型性能不超过72%(随机为50%),但专家认为问题可解。 Conclusion: 任务对现有模型具挑战性,但解决后可显著推动科学发现。 Abstract: Contemporary approaches to assisted scientific discovery use language models to automatically generate large numbers of potential hypothesis to test, while also automatically generating code-based experiments to test those hypotheses. While hypotheses can be comparatively inexpensive to generate, automated experiments can be costly, particularly when run at scale (i.e. thousands of experiments). Developing the capacity to filter hypotheses based on their feasibility would allow discovery systems to run at scale, while increasing their likelihood of making significant discoveries. In this work we introduce Matter-of-Fact, a challenge dataset for determining the feasibility of hypotheses framed as claims. Matter-of-Fact includes 8.4k claims extracted from scientific articles spanning four high-impact contemporary materials science topics, including superconductors, semiconductors, batteries, and aerospace materials, while including qualitative and quantitative claims from theoretical, experimental, and code/simulation results. We show that strong baselines that include retrieval augmented generation over scientific literature and code generation fail to exceed 72% performance on this task (chance performance is 50%), while domain-expert verification suggests nearly all are solvable -- highlighting both the difficulty of this task for current models, and the potential to accelerate scientific discovery by making near-term progress.

[269] Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design

Lin Sun,Weihong Lin,Jinzhu Wu,Yongfu Zhu,Xiaoqi Jian,Guangxiang Zhao,Change Jia,Linglin Zhang,Sai-er Hu,Yuhan Wu,Xiangzheng Zhang

Main category: cs.AI

TL;DR: 研究发现Deepseek-R1-Distill系列模型的基准评估结果受多种因素影响,波动显著,呼吁建立更严格的评估范式。

Details Motivation: 评估开源推理模型(如Deepseek-R1-Distill系列和QwQ-32B)性能时发现结果波动大,难以复现其声称的性能提升。 Method: 通过实证评估分析Deepseek-R1-Distill系列模型的性能波动。 Result: 评估条件的微小差异会导致结果显著变化,性能提升难以可靠复现。 Conclusion: 需要建立更严格的模型性能评估范式以确保结果的可信度。 Abstract: Reasoning models represented by the Deepseek-R1-Distill series have been widely adopted by the open-source community due to their strong performance in mathematics, science, programming, and other domains. However, our study reveals that their benchmark evaluation results are subject to significant fluctuations caused by various factors. Subtle differences in evaluation conditions can lead to substantial variations in results. Similar phenomena are observed in other open-source inference models fine-tuned based on the Deepseek-R1-Distill series, as well as in the QwQ-32B model, making their claimed performance improvements difficult to reproduce reliably. Therefore, we advocate for the establishment of a more rigorous paradigm for model performance evaluation and present our empirical assessments of the Deepseek-R1-Distill series models.

[270] When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

Kai Wang,Yihao Zhang,Meng Sun

Main category: cs.AI

TL;DR: 论文研究了大型语言模型(LLMs)的战略性欺骗问题,通过表征工程和线性人工断层扫描(LAT)技术,实现了89%的欺骗检测准确率,并通过激活引导在无明确提示下实现了40%的欺骗成功率。

Details Motivation: 随着具备链式思维(CoT)推理能力的LLMs发展,其可能战略性欺骗人类,这一问题比传统的幻觉问题更具挑战性,亟需研究。 Method: 使用表征工程技术,通过线性人工断层扫描(LAT)提取“欺骗向量”,并利用激活引导技术控制欺骗行为。 Result: 实现了89%的欺骗检测准确率和40%的无提示欺骗成功率。 Conclusion: 研究揭示了推理模型在诚实性方面的特定问题,并提供了可信AI对齐的工具。 Abstract: The honesty of large language models (LLMs) is a critical alignment challenge, especially as advanced systems with chain-of-thought (CoT) reasoning may strategically deceive humans. Unlike traditional honesty issues on LLMs, which could be possibly explained as some kind of hallucination, those models' explicit thought paths enable us to study strategic deception--goal-driven, intentional misinformation where reasoning contradicts outputs. Using representation engineering, we systematically induce, detect, and control such deception in CoT-enabled LLMs, extracting "deception vectors" via Linear Artificial Tomography (LAT) for 89% detection accuracy. Through activation steering, we achieve a 40% success rate in eliciting context-appropriate deception without explicit prompts, unveiling the specific honesty-related issue of reasoning models and providing tools for trustworthy AI alignment.

[271] LLM-First Search: Self-Guided Exploration of the Solution Space

Nathan Herr,Tim Rocktäschel,Roberta Raileanu

Main category: cs.AI

TL;DR: 论文提出了一种名为LLM-First Search (LFS)的新方法,通过让大型语言模型(LLM)自主控制搜索过程,无需预定义的搜索策略,从而在推理和规划任务中实现更灵活和高效的性能。

Details Motivation: 现有的搜索方法(如MCTS)依赖固定的探索超参数,难以适应不同难度的任务,限制了其实际应用。LFS旨在通过LLM的自主探索能力解决这一问题。 Method: LFS是一种LLM自引导搜索方法,LLM通过内部评分机制自主决定是否继续当前搜索路径或探索其他分支,无需外部启发式或硬编码策略。 Result: 在Countdown和Sudoku任务中,LFS表现优于ToT-BFS、BestFS和MCTS,尤其在更具挑战性的任务中表现更优,计算效率更高,且能更好地利用更强的模型和更大的计算预算。 Conclusion: LFS通过LLM自引导搜索实现了更灵活、高效的推理能力,无需手动调整,适用于不同难度的任务,并展示了更强的模型和计算预算的扩展性。 Abstract: Large Language Models (LLMs) have demonstrated remarkable improvements in reasoning and planning through increased test-time compute, often by framing problem-solving as a search process. While methods like Monte Carlo Tree Search (MCTS) have proven effective in some domains, their reliance on fixed exploration hyperparameters limits their adaptability across tasks of varying difficulty, rendering them impractical or expensive in certain settings. In this paper, we propose \textbf{LLM-First Search (LFS)}, a novel \textit{LLM Self-Guided Search} method that removes the need for pre-defined search strategies by empowering the LLM to autonomously control the search process via self-guided exploration. Rather than relying on external heuristics or hardcoded policies, the LLM evaluates whether to pursue the current search path or explore alternative branches based on its internal scoring mechanisms. This enables more flexible and context-sensitive reasoning without requiring manual tuning or task-specific adaptation. We evaluate LFS on Countdown and Sudoku against three classic widely-used search algorithms, Tree-of-Thoughts' Breadth First Search (ToT-BFS), Best First Search (BestFS), and MCTS, each of which have been used to achieve SotA results on a range of challenging reasoning tasks. We found that LFS (1) performs better on more challenging tasks without additional tuning, (2) is more computationally efficient compared to the other methods, especially when powered by a stronger model, (3) scales better with stronger models, due to its LLM-First design, and (4) scales better with increased compute budget. Our code is publicly available at \href{https://github.com/NathanHerr/LLM-First-Search}{LLM-First-Search}.

[272] Ontology-based knowledge representation for bone disease diagnosis: a foundation for safe and sustainable medical artificial intelligence systems

Loan Dao,Ngoc Quoc Ly

Main category: cs.AI

TL;DR: 该研究提出了一种基于本体的骨疾病诊断框架,结合了层次神经网络、视觉语言模型和多模态深度学习,旨在提升AI系统的诊断可靠性。

Details Motivation: 医疗AI系统常缺乏系统的领域专业知识整合,可能影响诊断的可靠性,因此需要一种更结构化的方法。 Method: 开发了一个基于本体的框架,包括层次神经网络、视觉问答系统和多模态深度学习模型,结合了影像、临床和实验室数据。 Result: 框架展示了扩展潜力,但目前因数据和计算资源限制,实验验证尚未完成。 Conclusion: 未来工作将集中于扩展临床数据集和进行系统验证,以进一步验证框架的实用性。 Abstract: Medical artificial intelligence (AI) systems frequently lack systematic domain expertise integration, potentially compromising diagnostic reliability. This study presents an ontology-based framework for bone disease diagnosis, developed in collaboration with Ho Chi Minh City Hospital for Traumatology and Orthopedics. The framework introduces three theoretical contributions: (1) a hierarchical neural network architecture guided by bone disease ontology for segmentation-classification tasks, incorporating Visual Language Models (VLMs) through prompts, (2) an ontology-enhanced Visual Question Answering (VQA) system for clinical reasoning, and (3) a multimodal deep learning model that integrates imaging, clinical, and laboratory data through ontological relationships. The methodology maintains clinical interpretability through systematic knowledge digitization, standardized medical terminology mapping, and modular architecture design. The framework demonstrates potential for extension beyond bone diseases through its standardized structure and reusable components. While theoretical foundations are established, experimental validation remains pending due to current dataset and computational resource limitations. Future work will focus on expanding the clinical dataset and conducting comprehensive system validation.

cs.CR [Back]

[273] Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets

Lei Hsiung,Tianyu Pang,Yung-Chen Tang,Linyue Song,Tsung-Yi Ho,Pin-Yu Chen,Yaoqing Yang

Main category: cs.CR

TL;DR: 论文探讨了上游安全对齐数据与下游微调任务之间的表示相似性对安全护栏退化的影响,发现高相似性会削弱安全护栏,而低相似性能显著增强模型鲁棒性。

Details Motivation: 现有缓解策略多关注事后处理或微调过程中的梯度移除,忽视了上游安全对齐数据的关键作用。 Method: 通过实验分析上游对齐数据集与下游微调任务的表示相似性对安全护栏的影响。 Result: 高相似性显著削弱安全护栏,低相似性则使模型更鲁棒,有害性评分降低达10.33%。 Conclusion: 上游数据集设计对构建持久安全护栏至关重要,为微调服务提供商提供了实用建议。 Abstract: Recent advancements in large language models (LLMs) have underscored their vulnerability to safety alignment jailbreaks, particularly when subjected to downstream fine-tuning. However, existing mitigation strategies primarily focus on reactively addressing jailbreak incidents after safety guardrails have been compromised, removing harmful gradients during fine-tuning, or continuously reinforcing safety alignment throughout fine-tuning. As such, they tend to overlook a critical upstream factor: the role of the original safety-alignment data. This paper therefore investigates the degradation of safety guardrails through the lens of representation similarity between upstream alignment datasets and downstream fine-tuning tasks. Our experiments demonstrate that high similarity between these datasets significantly weakens safety guardrails, making models more susceptible to jailbreaks. Conversely, low similarity between these two types of datasets yields substantially more robust models and thus reduces harmfulness score by up to 10.33%. By highlighting the importance of upstream dataset design in the building of durable safety guardrails and reducing real-world vulnerability to jailbreak attacks, these findings offer actionable insights for fine-tuning service providers.

cs.HC [Back]

[274] Beyond the Desktop: XR-Driven Segmentation with Meta Quest 3 and MX Ink

Lisle Faray de Paiva,Gijs Luijten,Ana Sofia Ferreira Santos,Moon Kim,Behrus Puladi,Jens Kleesiek,Jan Egger

Main category: cs.HC

TL;DR: 该研究开发了一种基于扩展现实(XR)的医学图像分割工具,结合Meta Quest 3头显和Logitech MX Ink触控笔,旨在简化临床中的手动标注任务。

Details Motivation: 医学图像分割在临床中至关重要,但手动标注耗时费力,因此需要更高效的工具。 Method: 研究开发了一个沉浸式界面,支持实时交互2D和3D医学图像数据,结合触控笔标注和即时3D体积渲染。 Result: 用户研究表明工具具有基础可行性(SUS得分66),参与者认为其控制直观(ISONORM评分4.1/5),但需改进任务精度和错误管理。 Conclusion: XR-触控笔范式为沉浸式分割工具提供了有前景的基础,未来需优化触觉反馈和工作流个性化。 Abstract: Medical imaging segmentation is essential in clinical settings for diagnosing diseases, planning surgeries, and other procedures. However, manual annotation is a cumbersome and effortful task. To mitigate these aspects, this study implements and evaluates the usability and clinical applicability of an extended reality (XR)-based segmentation tool for anatomical CT scans, using the Meta Quest 3 headset and Logitech MX Ink stylus. We develop an immersive interface enabling real-time interaction with 2D and 3D medical imaging data in a customizable workspace designed to mitigate workflow fragmentation and cognitive demands inherent to conventional manual segmentation tools. The platform combines stylus-driven annotation, mirroring traditional pen-on-paper workflows, with instant 3D volumetric rendering. A user study with a public craniofacial CT dataset demonstrated the tool's foundational viability, achieving a System Usability Scale (SUS) score of 66, within the expected range for medical applications. Participants highlighted the system's intuitive controls (scoring 4.1/5 for self-descriptiveness on ISONORM metrics) and spatial interaction design, with qualitative feedback highlighting strengths in hybrid 2D/3D navigation and realistic stylus ergonomics. While users identified opportunities to enhance task-specific precision and error management, the platform's core workflow enabled dynamic slice adjustment, reducing cognitive load compared to desktop tools. Results position the XR-stylus paradigm as a promising foundation for immersive segmentation tools, with iterative refinements targeting haptic feedback calibration and workflow personalization to advance adoption in preoperative planning.

[275] From Screen to Space: Evaluating Siemens' Cinematic Reality

Gijs Luijten,Lisle Faray de Paiva,Sebastian Krueger,Alexander Brost,Laura Mazilescu,Ana Sofia Ferreira Santos,Peter Hoyer,Jens Kleesiek,Sophia Marie-Therese Schmitz,Ulf Peter Neumann,Jan Egger

Main category: cs.HC

TL;DR: 研究团队评估了Siemens的Cinematic Reality在Apple Vision Pro上的可用性和临床潜力,通过医学专家的反馈确定了其可行性和改进方向。

Details Motivation: 探索Cinematic Reality在医学影像中的沉浸式渲染潜力,以促进其在实际临床工作中的应用。 Method: 使用CHAOS和MRCP_DLRecon数据集的影像,14位医学专家通过问卷调查和开放反馈评估系统可用性和临床潜力。 Result: 反馈揭示了系统的可行性、优势及需改进的功能,为临床工作流提供了参考。 Conclusion: 研究表明Cinematic Reality在医学影像中具有潜力,但仍需进一步优化以适应临床需求。 Abstract: As one of the first research teams with full access to Siemens' Cinematic Reality, we evaluate its usability and clinical potential for cinematic volume rendering on the Apple Vision Pro. We visualized venous-phase liver computed tomography and magnetic resonance cholangiopancreatography scans from the CHAOS and MRCP\_DLRecon datasets. Fourteen medical experts assessed usability and anticipated clinical integration potential using the System Usability Scale, ISONORM 9242-110-S questionnaire, and an open-ended survey. Their feedback identified feasibility, key usability strengths, and required features to catalyze the adaptation in real-world clinical workflows. The findings provide insights into the potential of immersive cinematic rendering in medical imaging.

cs.MM [Back]

[276] CMIE: Combining MLLM Insights with External Evidence for Explainable Out-of-Context Misinformation Detection

Fanxiao Li,Jiaying Wu,Canyuan He,Wei Zhou

Main category: cs.MM

TL;DR: 论文提出了一种名为CMIE的新框架,用于检测图文不符的虚假信息,通过生成共存关系和关联评分机制提升检测效果。

Details Motivation: 现有多模态大语言模型(MLLM)在检测图文不符虚假信息时存在难以捕捉深层语义关联和证据噪声影响准确性的问题。 Method: 提出了CMIE框架,包含共存关系生成(CRG)策略和关联评分(AS)机制,以识别图文间的深层关系并选择性利用证据。 Result: 实验表明CMIE优于现有方法。 Conclusion: CMIE通过改进图文关系建模和证据利用,有效提升了虚假信息检测的准确性。 Abstract: Multimodal large language models (MLLMs) have demonstrated impressive capabilities in visual reasoning and text generation. While previous studies have explored the application of MLLM for detecting out-of-context (OOC) misinformation, our empirical analysis reveals two persisting challenges of this paradigm. Evaluating the representative GPT-4o model on direct reasoning and evidence augmented reasoning, results indicate that MLLM struggle to capture the deeper relationships-specifically, cases in which the image and text are not directly connected but are associated through underlying semantic links. Moreover, noise in the evidence further impairs detection accuracy. To address these challenges, we propose CMIE, a novel OOC misinformation detection framework that incorporates a Coexistence Relationship Generation (CRG) strategy and an Association Scoring (AS) mechanism. CMIE identifies the underlying coexistence relationships between images and text, and selectively utilizes relevant evidence to enhance misinformation detection. Experimental results demonstrate that our approach outperforms existing methods.