Skip to content

Table of Contents

cs.CV [Back]

[1] Dynamic Epsilon Scheduling: A Multi-Factor Adaptive Perturbation Budget for Adversarial Training

Alan Mitkiy,James Smith,Hana Satou,Hiroshi Tanaka,Emily Johnson,F Monkey

Main category: cs.CV

TL;DR: 论文提出了一种动态调整对抗训练扰动预算的方法(DES),通过结合决策边界距离、预测置信度和模型不确定性,显著提升了对抗鲁棒性和标准准确性。

Details Motivation: 现有对抗训练方法依赖固定扰动预算,无法适应实例特定的鲁棒性需求,限制了其效果。 Method: 提出DES框架,动态调整每个实例和训练迭代的扰动预算,结合梯度代理、软最大熵和蒙特卡洛dropout。 Result: 在CIFAR-10和CIFAR-100上,DES在对抗鲁棒性和标准准确性上均优于固定预算和现有自适应方法。 Conclusion: DES为实例感知和数据驱动的对抗训练提供了新思路,并展示了理论和实验上的优越性。 Abstract: Adversarial training is among the most effective strategies for defending deep neural networks against adversarial examples. A key limitation of existing adversarial training approaches lies in their reliance on a fixed perturbation budget, which fails to account for instance-specific robustness characteristics. While prior works such as IAAT and MMA introduce instance-level adaptations, they often rely on heuristic or static approximations of data robustness. In this paper, we propose Dynamic Epsilon Scheduling (DES), a novel framework that adaptively adjusts the adversarial perturbation budget per instance and per training iteration. DES integrates three key factors: (1) the distance to the decision boundary approximated via gradient-based proxies, (2) prediction confidence derived from softmax entropy, and (3) model uncertainty estimated via Monte Carlo dropout. By combining these cues into a unified scheduling strategy, DES tailors the perturbation budget dynamically to guide more effective adversarial learning. Experimental results on CIFAR-10 and CIFAR-100 show that our method consistently improves both adversarial robustness and standard accuracy compared to fixed-epsilon baselines and prior adaptive methods. Moreover, we provide theoretical insights into the stability and convergence of our scheduling policy. This work opens a new avenue for instance-aware, data-driven adversarial training methods.

[2] RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought

Yi Lu,Jiawang Cao,Yongliang Wu,Bozheng Li,Licheng Tang,Yangguang Ji,Chong Wu,Jay Wu,Wenbo Zhu

Main category: cs.CV

TL;DR: RSVP是一个结合多模态推理与视觉分割的新框架,通过两阶段结构提升视觉定位与分割精度。

Details Motivation: 多模态大语言模型(MLLMs)缺乏显式的视觉定位与分割机制,导致认知推理与视觉感知之间存在差距。 Method: RSVP采用两阶段框架:推理阶段通过多模态链式视觉提示生成区域建议;分割阶段通过视觉语言分割模块(VLSM)精炼分割掩码。 Result: RSVP在ReasonSeg上比现有方法提升6.5 gIoU和9.2 cIoU,在SegInW零样本设置下达到49.7 mAP。 Conclusion: RSVP为结合认知推理与结构化视觉理解提供了有效且可扩展的框架。 Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated remarkable reasoning capability while lack explicit mechanisms for visual grounding and segmentation, creating a gap between cognitive reasoning and visual perception. To bridge this gap, we introduce Reasoning Segmentation via Visual Prompting (RSVP), a novel framework that unifies multi-step multimodal reasoning with grounded visual understanding. RSVP is a two-stage structuralized framework that integrates reasoning-driven localization with segmentation refinement. In the reasoning stage, RSVP employs multimodal chain-of-thought visual prompts to help MLLMs understand queries and infer targets, generating interpretable region proposals that enhance visual grounding. In segmentation stage, RSVP refines these proposals with a Vision-Language Segmentation Module (VLSM), seamlessly integrates textual and visual cues to produce precise segmentation masks. By explicitly modelling the interaction between multimodal reasoning and segmentation, RSVP introduces a new paradigm for interpretable reasoning segmentation. It exploits MLLMs' inherent localization capabilities, enabling the models to not only reason about objects but also generate structured visual representations. Our extensive experiments demonstrate that RSVP achieves state-of-the-art performance, surpasses state-of-the-art methods by up to +6.5 gIoU and +9.2 cIoU on ReasonSeg, and achieves 49.7 mAP on SegInW under zero-shot settings. These results validate RSVP as an effective and scalable framework for integrating cognitive reasoning with structured visual understanding.

[3] Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark

Ziming Cheng,Binrui Xu,Lisheng Gong,Zuhe Song,Tianshuo Zhou,Shiqi Zhong,Siyu Ren,Mingxiang Chen,Xiangchao Meng,Yuxin Zhang,Yanlin Li,Lei Ren,Wei Chen,Zhiyuan Huang,Mingjie Zhan,Xiaojie Wang,Fangxiang Feng

Main category: cs.CV

TL;DR: MMRB是首个评估多图像结构化视觉推理的基准,包含92个子任务,覆盖空间、时间和语义推理,并采用GPT-4o生成的多解决方案和CoT风格注释。实验表明开源MLLM在多图像推理任务中显著落后于商业MLLM。

Details Motivation: 现有MLLM基准主要关注单图像推理或多图像理解任务的最终答案评估,缺乏对多图像输入推理能力的深入探索。 Method: 提出MMRB基准,包含92个子任务,采用GPT-4o生成注释并人工优化,设计子集评估多模态奖励模型,并提出基于开源LLM的句子级匹配框架。 Result: 实验显示开源MLLM在多图像推理任务中表现显著落后于商业MLLM,且当前多模态奖励模型几乎无法处理多图像奖励排名任务。 Conclusion: MMRB填补了多图像推理评估的空白,揭示了开源MLLM和多模态奖励模型在多图像任务中的不足。 Abstract: With enhanced capabilities and widespread applications, Multimodal Large Language Models (MLLMs) are increasingly required to process and reason over multiple images simultaneously. However, existing MLLM benchmarks focus either on single-image visual reasoning or on multi-image understanding tasks with only final-answer evaluation, leaving the reasoning capabilities of MLLMs over multi-image inputs largely underexplored. To address this gap, we introduce the $\textbf{Multimodal Multi-image Reasoning Benchmark (MMRB)}$, the first benchmark designed to evaluate structured visual reasoning across multiple images. MMRB comprises $\textbf{92 sub-tasks}$ covering spatial, temporal, and semantic reasoning, with multi-solution, CoT-style annotations generated by GPT-4o and refined by human experts. A derivative subset is designed to evaluate multimodal reward models in multi-image scenarios. To support fast and scalable evaluation, we propose a sentence-level matching framework using open-source LLMs. Extensive baseline experiments on $\textbf{40 MLLMs}$, including 9 reasoning-specific models and 8 reward models, demonstrate that open-source MLLMs still lag significantly behind commercial MLLMs in multi-image reasoning tasks. Furthermore, current multimodal reward models are nearly incapable of handling multi-image reward ranking tasks.

[4] HuGeDiff: 3D Human Generation via Diffusion with Gaussian Splatting

Maksym Ivashechkin,Oscar Mendez,Richard Bowden

Main category: cs.CV

TL;DR: 论文提出了一种弱监督流程,通过图像扩散模型生成可控属性的真实人类图像数据集,并利用基于Transformer的架构将图像特征映射到3D点云,最后训练点云扩散模型,显著提升了3D人类生成的速度、文本对齐性、真实感和渲染质量。

Details Motivation: 当前3D人类生成方法在细节、手部和面部渲染、真实感及可控性方面存在不足,且缺乏多样性和标注数据。本文旨在解决这些问题。 Method: 1. 使用图像扩散模型生成可控属性的真实人类图像数据集;2. 提出基于Transformer的架构将图像特征映射到3D点云;3. 训练点云扩散模型,实现闭环生成。 Result: 相比现有方法,实现了数量级的速度提升,并显著改善了文本对齐性、真实感和渲染质量。 Conclusion: 提出的弱监督流程有效解决了3D人类生成的挑战,为未来研究提供了数据集和代码支持。 Abstract: 3D human generation is an important problem with a wide range of applications in computer vision and graphics. Despite recent progress in generative AI such as diffusion models or rendering methods like Neural Radiance Fields or Gaussian Splatting, controlling the generation of accurate 3D humans from text prompts remains an open challenge. Current methods struggle with fine detail, accurate rendering of hands and faces, human realism, and controlability over appearance. The lack of diversity, realism, and annotation in human image data also remains a challenge, hindering the development of a foundational 3D human model. We present a weakly supervised pipeline that tries to address these challenges. In the first step, we generate a photorealistic human image dataset with controllable attributes such as appearance, race, gender, etc using a state-of-the-art image diffusion model. Next, we propose an efficient mapping approach from image features to 3D point clouds using a transformer-based architecture. Finally, we close the loop by training a point-cloud diffusion model that is conditioned on the same text prompts used to generate the original samples. We demonstrate orders-of-magnitude speed-ups in 3D human generation compared to the state-of-the-art approaches, along with significantly improved text-prompt alignment, realism, and rendering quality. We will make the code and dataset available.

[5] Photoreal Scene Reconstruction from an Egocentric Device

Zhaoyang Lv,Maurizio Monge,Ka Chen,Yufeng Zhu,Michael Goesele,Jakob Engel,Zhao Dong,Richard Newcombe

Main category: cs.CV

TL;DR: 本文研究了使用自我中心设备进行高动态范围场景的光真实重建的挑战,提出了两种改进方法:视觉惯性束调整(VIBA)和高斯泼溅的物理图像形成模型,实验显示PSNR提升显著。

Details Motivation: 现有方法通常假设使用设备的视觉惯性里程计系统估计的帧率6DoF姿态,可能忽略像素级重建所需的细节。 Method: 提出使用VIBA校准滚动快门RGB相机的时间戳和运动,并基于高斯泼溅的物理图像形成模型解决传感器特性问题。 Result: 在多种光照条件下,VIBA和图像形成模型分别带来PSNR +1 dB的提升。 Conclusion: 提出的方法显著提升了光真实重建的准确性,适用于多种高斯泼溅表示变体。 Abstract: In this paper, we investigate the challenges associated with using egocentric devices to photorealistic reconstruct the scene in high dynamic range. Existing methodologies typically assume using frame-rate 6DoF pose estimated from the device's visual-inertial odometry system, which may neglect crucial details necessary for pixel-accurate reconstruction. This study presents two significant findings. Firstly, in contrast to mainstream work treating RGB camera as global shutter frame-rate camera, we emphasize the importance of employing visual-inertial bundle adjustment (VIBA) to calibrate the precise timestamps and movement of the rolling shutter RGB sensing camera in a high frequency trajectory format, which ensures an accurate calibration of the physical properties of the rolling-shutter camera. Secondly, we incorporate a physical image formation model based into Gaussian Splatting, which effectively addresses the sensor characteristics, including the rolling-shutter effect of RGB cameras and the dynamic ranges measured by sensors. Our proposed formulation is applicable to the widely-used variants of Gaussian Splats representation. We conduct a comprehensive evaluation of our pipeline using the open-source Project Aria device under diverse indoor and outdoor lighting conditions, and further validate it on a Meta Quest3 device. Across all experiments, we observe a consistent visual enhancement of +1 dB in PSNR by incorporating VIBA, with an additional +1 dB achieved through our proposed image formation model. Our complete implementation, evaluation datasets, and recording profile are available at http://www.projectaria.com/photoreal-reconstruction/

[6] ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding

Ankit Pal,Jung-Oh Lee,Xiaoman Zhang,Malaikannan Sankarasubbu,Seunghyeon Roh,Won Jung Kim,Meesun Lee,Pranav Rajpurkar

Main category: cs.CV

TL;DR: ReXVQA是胸部放射学视觉问答(VQA)的最大综合基准,包含69.6万问题和16万胸部X光研究。评估了8种多模态大语言模型,其中MedGemma表现最佳(83.24%准确率),并首次超越人类专家(77.27%)。

Details Motivation: 填补胸部X光VQA领域的空白,提供多样化和临床真实的任务,评估AI在放射学推理中的表现。 Method: 构建大规模数据集ReXVQA,涵盖5种放射学推理技能,评估8种先进模型,并与3名放射科住院医师进行对比研究。 Result: MedGemma模型表现最佳(83.24%),首次超越人类专家(77.27%),揭示了AI与人类在放射学推理上的差异。 Conclusion: ReXVQA为评估放射学AI系统设定了新标准,支持下一代AI模拟专家级临床推理,数据集将开源。 Abstract: We present ReXVQA, the largest and most comprehensive benchmark for visual question answering (VQA) in chest radiology, comprising approximately 696,000 questions paired with 160,000 chest X-rays studies across training, validation, and test sets. Unlike prior efforts that rely heavily on template based queries, ReXVQA introduces a diverse and clinically authentic task suite reflecting five core radiological reasoning skills: presence assessment, location analysis, negation detection, differential diagnosis, and geometric reasoning. We evaluate eight state-of-the-art multimodal large language models, including MedGemma-4B-it, Qwen2.5-VL, Janus-Pro-7B, and Eagle2-9B. The best-performing model (MedGemma) achieves 83.24% overall accuracy. To bridge the gap between AI performance and clinical expertise, we conducted a comprehensive human reader study involving 3 radiology residents on 200 randomly sampled cases. Our evaluation demonstrates that MedGemma achieved superior performance (83.84% accuracy) compared to human readers (best radiology resident: 77.27%), representing a significant milestone where AI performance exceeds expert human evaluation on chest X-ray interpretation. The reader study reveals distinct performance patterns between AI models and human experts, with strong inter-reader agreement among radiologists while showing more variable agreement patterns between human readers and AI models. ReXVQA establishes a new standard for evaluating generalist radiological AI systems, offering public leaderboards, fine-grained evaluation splits, structured explanations, and category-level breakdowns. This benchmark lays the foundation for next-generation AI systems capable of mimicking expert-level clinical reasoning beyond narrow pathology classification. Our dataset will be open-sourced at https://huggingface.co/datasets/rajpurkarlab/ReXVQA

[7] WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning

Delong Chen,Willy Chung,Yejin Bang,Ziwei Ji,Pascale Fung

Main category: cs.CV

TL;DR: WorldPrediction是一个基于视频的基准测试,用于评估AI模型的世界建模和程序规划能力,强调具有时间和语义抽象的动作。当前前沿模型表现远低于人类水平。

Details Motivation: 研究AI模型如何学习和利用世界模型进行动作规划,填补现有基准测试在高级世界建模和规划能力评估上的空白。 Method: 提出WorldPrediction基准测试,包括WorldPrediction-WM(动作区分)和WorldPrediction-PP(动作序列排序),使用视觉观察表示状态和动作,并通过“动作等效”避免低层连续性线索的干扰。 Result: 当前前沿模型在WorldPrediction-WM和WorldPrediction-PP上的准确率分别为57%和38%,远低于人类的完美表现。 Conclusion: WorldPrediction为评估AI模型的世界建模和规划能力提供了可靠基准,揭示了当前模型与人类水平的显著差距。 Abstract: Humans are known to have an internal "world model" that enables us to carry out action planning based on world states. AI agents need to have such a world model for action planning as well. It is not clear how current AI models, especially generative models, are able to learn such world models and carry out procedural planning in diverse environments. We introduce WorldPrediction, a video-based benchmark for evaluating world modeling and procedural planning capabilities of different AI models. In contrast to prior benchmarks that focus primarily on low-level world modeling and robotic motion planning, WorldPrediction is the first benchmark that emphasizes actions with temporal and semantic abstraction. Given initial and final world states, the task is to distinguish the proper action (WorldPrediction-WM) or the properly ordered sequence of actions (WorldPrediction-PP) from a set of counterfactual distractors. This discriminative task setup enable us to evaluate different types of world models and planners and realize a thorough comparison across different hypothesis. The benchmark represents states and actions using visual observations. In order to prevent models from exploiting low-level continuity cues in background scenes, we provide "action equivalents" - identical actions observed in different contexts - as candidates for selection. This benchmark is grounded in a formal framework of partially observable semi-MDP, ensuring better reliability and robustness of the evaluation. We conduct extensive human filtering and validation on our benchmark and show that current frontier models barely achieve 57% accuracy on WorldPrediction-WM and 38% on WorldPrediction-PP whereas humans are able to solve both tasks perfectly.

[8] Ice Hockey Puck Localization Using Contextual Cues

Liam Salass,Jerrin Bright,Amir Nazemi,Yuhao Chen,John Zelek,David Clausi

Main category: cs.CV

TL;DR: PLUCC提出了一种基于上下文线索的冰球检测方法,利用球员行为和姿态作为先验信息,显著提升了检测性能。

Details Motivation: 冰球在视频中检测困难,传统方法未充分利用球员行为的上下文线索。 Method: PLUCC包含上下文编码器、特征金字塔编码器和门控解码器,结合多尺度特征和通道门控机制。 Result: 在PuckDataset上,PLUCC的平均精度提升12.2%,RSLE精度提升25%。 Conclusion: 上下文理解对冰球检测至关重要,为自动化体育分析提供了新思路。 Abstract: Puck detection in ice hockey broadcast videos poses significant challenges due to the puck's small size, frequent occlusions, motion blur, broadcast artifacts, and scale inconsistencies due to varying camera zoom and broadcast camera viewpoints. Prior works focus on appearance-based or motion-based cues of the puck without explicitly modelling the cues derived from player behaviour. Players consistently turn their bodies and direct their gaze toward the puck. Motivated by this strong contextual cue, we propose Puck Localization Using Contextual Cues (PLUCC), a novel approach for scale-aware and context-driven single-frame puck detections. PLUCC consists of three components: (a) a contextual encoder, which utilizes player orientations and positioning as helpful priors; (b) a feature pyramid encoder, which extracts multiscale features from the dual encoders; and (c) a gating decoder that combines latent features with a channel gating mechanism. For evaluation, in addition to standard average precision, we propose Rink Space Localization Error (RSLE), a scale-invariant homography-based metric for removing perspective bias from rink space evaluation. The experimental results of PLUCC on the PuckDataset dataset demonstrated state-of-the-art detection performance, surpassing previous baseline methods by an average precision improvement of 12.2% and RSLE average precision of 25%. Our research demonstrates the critical role of contextual understanding in improving puck detection performance, with broad implications for automated sports analysis.

[9] Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks

Jubayer Ahmed Bhuiyan Shawon,Hasan Mahmud,Kamrul Hasan

Main category: cs.CV

TL;DR: 该研究通过微调视频Transformer架构(VideoMAE、ViViT和TimeSformer)在孟加拉手语(BdSL)数据集上,显著提升了手语识别的准确性和可扩展性。

Details Motivation: 提高听力障碍社区的可访问性,通过自动识别和分类手语手势,将其转换为文本或语音。 Method: 使用视频Transformer架构(VideoMAE、ViViT和TimeSformer)在BdSLW60和BdSLW401数据集上进行微调,并应用数据增强技术和10折分层交叉验证。 Result: VideoMAE模型在BdSLW60和BdSLW401数据集上分别达到95.5%和81.04%的准确率,显著优于传统方法。 Conclusion: 视频Transformer模型在手语识别中表现出色,尤其在数据集规模、视频质量和模型架构方面具有潜力。 Abstract: Sign Language Recognition (SLR) involves the automatic identification and classification of sign gestures from images or video, converting them into text or speech to improve accessibility for the hearing-impaired community. In Bangladesh, Bangla Sign Language (BdSL) serves as the primary mode of communication for many individuals with hearing impairments. This study fine-tunes state-of-the-art video transformer architectures -- VideoMAE, ViViT, and TimeSformer -- on BdSLW60 (arXiv:2402.08635), a small-scale BdSL dataset with 60 frequent signs. We standardized the videos to 30 FPS, resulting in 9,307 user trial clips. To evaluate scalability and robustness, the models were also fine-tuned on BdSLW401 (arXiv:2503.02360), a large-scale dataset with 401 sign classes. Additionally, we benchmark performance against public datasets, including LSA64 and WLASL. Data augmentation techniques such as random cropping, horizontal flipping, and short-side scaling were applied to improve model robustness. To ensure balanced evaluation across folds during model selection, we employed 10-fold stratified cross-validation on the training set, while signer-independent evaluation was carried out using held-out test data from unseen users U4 and U8. Results show that video transformer models significantly outperform traditional machine learning and deep learning approaches. Performance is influenced by factors such as dataset size, video quality, frame distribution, frame rate, and model architecture. Among the models, the VideoMAE variant (MCG-NJU/videomae-base-finetuned-kinetics) achieved the highest accuracies of 95.5% on the frame rate corrected BdSLW60 dataset and 81.04% on the front-facing signs of BdSLW401 -- demonstrating strong potential for scalable and accurate BdSL recognition.

[10] Visualizing and Controlling Cortical Responses Using Voxel-Weighted Activation Maximization

Matthew W. Shinkle,Mark D. Lescroart

Main category: cs.CV

TL;DR: 论文提出了一种方法,通过激活最大化技术解释DNN编码模型,生成能预测大脑反应的图像,并验证其有效性。

Details Motivation: 现有DNN编码模型虽能预测大脑反应,但无法揭示具体驱动特征。 Method: 从预训练Inception V3网络中提取并降采样激活,用线性回归预测fMRI反应,再通过激活最大化生成图像。 Result: 生成的图像能定性对应大脑区域的选择性,并在fMRI实验中验证了其有效性。 Conclusion: 激活最大化可成功应用于DNN编码模型,为研究人类视觉系统提供了灵活工具。 Abstract: Deep neural networks (DNNs) trained on visual tasks develop feature representations that resemble those in the human visual system. Although DNN-based encoding models can accurately predict brain responses to visual stimuli, they offer limited insight into the specific features driving these responses. Here, we demonstrate that activation maximization -- a technique designed to interpret vision DNNs -- can be applied to DNN-based encoding models of the human brain. We extract and adaptively downsample activations from multiple layers of a pretrained Inception V3 network, then use linear regression to predict fMRI responses. This yields a full image-computable model of brain responses. Next, we apply activation maximization to generate images optimized for predicted responses in individual cortical voxels. We find that these images contain visual characteristics that qualitatively correspond with known selectivity and enable exploration of selectivity across the visual cortex. We further extend our method to whole regions of interest (ROIs) of the brain and validate its efficacy by presenting these images to human participants in an fMRI study. We find that the generated images reliably drive activity in targeted regions across both low- and high-level visual areas and across subjects. These results demonstrate that activation maximization can be successfully applied to DNN-based encoding models. By addressing key limitations of alternative approaches that require natively generative models, our approach enables flexible characterization and modulation of responses across the human visual system.

[11] Is Perturbation-Based Image Protection Disruptive to Image Editing?

Qiuyu Tang,Bonor Ayambem,Mooi Choo Chuah,Aparna Bharati

Main category: cs.CV

TL;DR: 研究发现,现有的基于扰动的图像保护方法无法完全防止扩散模型对图像的编辑,反而可能增强编辑效果。

Details Motivation: 探讨扩散模型(如Stable Diffusion)在图像生成中的滥用风险,以及现有图像保护方法的局限性。 Method: 通过实验评估多种基于扰动的图像保护方法在不同领域(自然场景图像和艺术作品)和编辑任务(图像到图像生成和风格编辑)中的效果。 Result: 发现保护方法未能完全阻止编辑,反而可能使编辑结果更符合提示文本,导致意外效果。 Conclusion: 基于扰动的方法不足以提供对扩散模型编辑的鲁棒保护,需探索其他解决方案。 Abstract: The remarkable image generation capabilities of state-of-the-art diffusion models, such as Stable Diffusion, can also be misused to spread misinformation and plagiarize copyrighted materials. To mitigate the potential risks associated with image editing, current image protection methods rely on adding imperceptible perturbations to images to obstruct diffusion-based editing. A fully successful protection for an image implies that the output of editing attempts is an undesirable, noisy image which is completely unrelated to the reference image. In our experiments with various perturbation-based image protection methods across multiple domains (natural scene images and artworks) and editing tasks (image-to-image generation and style editing), we discover that such protection does not achieve this goal completely. In most scenarios, diffusion-based editing of protected images generates a desirable output image which adheres precisely to the guidance prompt. Our findings suggest that adding noise to images may paradoxically increase their association with given text prompts during the generation process, leading to unintended consequences such as better resultant edits. Hence, we argue that perturbation-based methods may not provide a sufficient solution for robust image protection against diffusion-based editing.

[12] Normalize Filters! Classical Wisdom for Deep Vision

Gustavo Perez,Stella X. Yu

Main category: cs.CV

TL;DR: 论文提出了一种滤波器归一化方法,结合可学习的缩放和位移,解决了深度学习卷积滤波器在图像大气传输中的失真问题,显著提升了性能。

Details Motivation: 传统图像滤波器经过精心归一化以确保一致性和可解释性,而深度学习中的卷积滤波器缺乏此类约束,导致在大气传输中响应失真。 Method: 提出滤波器归一化,随后进行可学习的缩放和位移,类似于批归一化,确保滤波器具有大气等变性。 Result: 方法在人工和自然强度变化基准测试中表现显著提升,ResNet34甚至大幅超越CLIP。 Conclusion: 滤波器归一化不仅规范了学习过程,还提高了多样性、鲁棒性和泛化能力。 Abstract: Classical image filters, such as those for averaging or differencing, are carefully normalized to ensure consistency, interpretability, and to avoid artifacts like intensity shifts, halos, or ringing. In contrast, convolutional filters learned end-to-end in deep networks lack such constraints. Although they may resemble wavelets and blob/edge detectors, they are not normalized in the same or any way. Consequently, when images undergo atmospheric transfer, their responses become distorted, leading to incorrect outcomes. We address this limitation by proposing filter normalization, followed by learnable scaling and shifting, akin to batch normalization. This simple yet effective modification ensures that the filters are atmosphere-equivariant, enabling co-domain symmetry. By integrating classical filtering principles into deep learning (applicable to both convolutional neural networks and convolution-dependent vision transformers), our method achieves significant improvements on artificial and natural intensity variation benchmarks. Our ResNet34 could even outperform CLIP by a large margin. Our analysis reveals that unnormalized filters degrade performance, whereas filter normalization regularizes learning, promotes diversity, and improves robustness and generalization.

[13] HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation

Hermann Kumbong,Xian Liu,Tsung-Yi Lin,Ming-Yu Liu,Xihui Liu,Ziwei Liu,Daniel Y. Fu,Christopher Ré,David W. Romero

Main category: cs.CV

TL;DR: HMAR是一种新的图像生成算法,通过改进VAR的并行生成问题,实现了更高质量的图像生成和更快的采样速度。

Details Motivation: VAR在并行生成时存在图像质量下降、序列长度超线性增长以及采样计划固定等问题,HMAR旨在解决这些问题。 Method: HMAR采用马尔可夫过程和多步掩码生成,仅依赖前一分辨率的标记进行预测,并通过可控的掩码生成步骤提高效率。 Result: 在ImageNet 256x256和512x512基准测试中,HMAR表现优于VAR、扩散模型和自回归基线,且训练和推理速度更快。 Conclusion: HMAR不仅提升了图像生成质量,还提供了更高的灵活性和效率,适用于零样本图像编辑任务。 Abstract: Visual Auto-Regressive modeling (VAR) has shown promise in bridging the speed and quality gap between autoregressive image models and diffusion models. VAR reformulates autoregressive modeling by decomposing an image into successive resolution scales. During inference, an image is generated by predicting all the tokens in the next (higher-resolution) scale, conditioned on all tokens in all previous (lower-resolution) scales. However, this formulation suffers from reduced image quality due to the parallel generation of all tokens in a resolution scale; has sequence lengths scaling superlinearly in image resolution; and requires retraining to change the sampling schedule. We introduce Hierarchical Masked Auto-Regressive modeling (HMAR), a new image generation algorithm that alleviates these issues using next-scale prediction and masked prediction to generate high-quality images with fast sampling. HMAR reformulates next-scale prediction as a Markovian process, wherein the prediction of each resolution scale is conditioned only on tokens in its immediate predecessor instead of the tokens in all predecessor resolutions. When predicting a resolution scale, HMAR uses a controllable multi-step masked generation procedure to generate a subset of the tokens in each step. On ImageNet 256x256 and 512x512 benchmarks, HMAR models match or outperform parameter-matched VAR, diffusion, and autoregressive baselines. We develop efficient IO-aware block-sparse attention kernels that allow HMAR to achieve faster training and inference times over VAR by over 2.5x and 1.75x respectively, as well as over 3x lower inference memory footprint. Finally, HMAR yields additional flexibility over VAR; its sampling schedule can be changed without further training, and it can be applied to image editing tasks in a zero-shot manner.

[14] Towards Large-Scale Pose-Invariant Face Recognition Using Face Defrontalization

Patrik Mesec,Alan Jović

Main category: cs.CV

TL;DR: 论文提出了一种称为“面部去正面化”的方法,通过增强训练数据集来提高面部特征提取模型的性能,并在多个公开数据集上验证了其有效性。

Details Motivation: 解决极端头部姿态下的人脸识别问题,避免现有方法在小数据集上的过拟合问题。 Method: 1) 训练一个改进的面部去正面化模型(FFWM);2) 基于ArcFace损失训练ResNet-50模型,使用原始和随机去正面化的数据集。 Result: 在LFW、AgeDB和CFP数据集上优于现有方法,但在Multi-PIE极端姿态下表现不佳。 Conclusion: 面部去正面化方法在大规模数据集上表现优越,但需注意小数据集上的过拟合问题。 Abstract: Face recognition under extreme head poses is a challenging task. Ideally, a face recognition system should perform well across different head poses, which is known as pose-invariant face recognition. To achieve pose invariance, current approaches rely on sophisticated methods, such as face frontalization and various facial feature extraction model architectures. However, these methods are somewhat impractical in real-life settings and are typically evaluated on small scientific datasets, such as Multi-PIE. In this work, we propose the inverse method of face frontalization, called face defrontalization, to augment the training dataset of facial feature extraction model. The method does not introduce any time overhead during the inference step. The method is composed of: 1) training an adapted face defrontalization FFWM model on a frontal-profile pairs dataset, which has been preprocessed using our proposed face alignment method; 2) training a ResNet-50 facial feature extraction model based on ArcFace loss on a raw and randomly defrontalized large-scale dataset, where defrontalization was performed with our previously trained face defrontalization model. Our method was compared with the existing approaches on four open-access datasets: LFW, AgeDB, CFP, and Multi-PIE. Defrontalization shows improved results compared to models without defrontalization, while the proposed adjustments show clear superiority over the state-of-the-art face frontalization FFWM method on three larger open-access datasets, but not on the small Multi-PIE dataset for extreme poses (75 and 90 degrees). The results suggest that at least some of the current methods may be overfitted to small datasets.

[15] FALO: Fast and Accurate LiDAR 3D Object Detection on Resource-Constrained Devices

Shizhong Han,Hsin-Pai Cheng,Hong Cai,Jihad Masri,Soyeb Nagori,Fatih Porikli

Main category: cs.CV

TL;DR: FALO是一种硬件友好的LiDAR 3D检测方法,结合了高精度和快速推理速度,适用于资源受限的边缘设备。

Details Motivation: 现有LiDAR 3D检测方法依赖稀疏卷积或Transformer,计算成本高且内存访问不规则,难以在边缘设备上运行。 Method: FALO将稀疏3D体素排列为1D序列,通过ConvDotMix块(大核卷积、Hadamard乘积和线性层)处理,引入隐式分组以优化推理效率。 Result: 在nuScenes和Waymo基准测试中表现优异,比最新SOTA方法快1.6~9.8倍。 Conclusion: FALO在保持高精度的同时显著提升了推理速度,适合部署在资源受限设备上。 Abstract: Existing LiDAR 3D object detection methods predominantely rely on sparse convolutions and/or transformers, which can be challenging to run on resource-constrained edge devices, due to irregular memory access patterns and high computational costs. In this paper, we propose FALO, a hardware-friendly approach to LiDAR 3D detection, which offers both state-of-the-art (SOTA) detection accuracy and fast inference speed. More specifically, given the 3D point cloud and after voxelization, FALO first arranges sparse 3D voxels into a 1D sequence based on their coordinates and proximity. The sequence is then processed by our proposed ConvDotMix blocks, consisting of large-kernel convolutions, Hadamard products, and linear layers. ConvDotMix provides sufficient mixing capability in both spatial and embedding dimensions, and introduces higher-order nonlinear interaction among spatial features. Furthermore, when going through the ConvDotMix layers, we introduce implicit grouping, which balances the tensor dimensions for more efficient inference and takes into account the growing receptive field. All these operations are friendly to run on resource-constrained platforms and proposed FALO can readily deploy on compact, embedded devices. Our extensive evaluation on LiDAR 3D detection benchmarks such as nuScenes and Waymo shows that FALO achieves competitive performance. Meanwhile, FALO is 1.6~9.8x faster than the latest SOTA on mobile Graphics Processing Unit (GPU) and mobile Neural Processing Unit (NPU).

[16] AuthGuard: Generalizable Deepfake Detection via Language Guidance

Guangyu Shen,Zhihua Li,Xiang Xu,Tianchen Zhao,Zheng Zhang,Dongsheng An,Zhuowen Tu,Yifan Xing,Qin Zhang

Main category: cs.CV

TL;DR: AuthGuard通过结合语言指导和视觉编码器,提升了深度伪造检测的泛化能力和准确性。

Details Motivation: 现有深度伪造检测技术难以应对不断更新的伪造方法,因其依赖特定生成过程的统计特征。 Method: 结合判别分类与图像-文本对比学习,利用MLLM生成文本指导,并集成数据不确定性学习。 Result: 在DFDC和DF40数据集上分别提升AUC 6.15%和16.68%,在DDVQA数据集上提升性能24.69%。 Conclusion: AuthGuard通过语言引导和视觉-语言对比学习,实现了更泛化和可解释的深度伪造检测。 Abstract: Existing deepfake detection techniques struggle to keep-up with the ever-evolving novel, unseen forgeries methods. This limitation stems from their reliance on statistical artifacts learned during training, which are often tied to specific generation processes that may not be representative of samples from new, unseen deepfake generation methods encountered at test time. We propose that incorporating language guidance can improve deepfake detection generalization by integrating human-like commonsense reasoning -- such as recognizing logical inconsistencies and perceptual anomalies -- alongside statistical cues. To achieve this, we train an expert deepfake vision encoder by combining discriminative classification with image-text contrastive learning, where the text is generated by generalist MLLMs using few-shot prompting. This allows the encoder to extract both language-describable, commonsense deepfake artifacts and statistical forgery artifacts from pixel-level distributions. To further enhance robustness, we integrate data uncertainty learning into vision-language contrastive learning, mitigating noise in image-text supervision. Our expert vision encoder seamlessly interfaces with an LLM, further enabling more generalized and interpretable deepfake detection while also boosting accuracy. The resulting framework, AuthGuard, achieves state-of-the-art deepfake detection accuracy in both in-distribution and out-of-distribution settings, achieving AUC gains of 6.15% on the DFDC dataset and 16.68% on the DF40 dataset. Additionally, AuthGuard significantly enhances deepfake reasoning, improving performance by 24.69% on the DDVQA dataset.

[17] Pruning Everything, Everywhere, All at Once

Gustavo Henrique do Nascimento,Ian Pons,Anna Helena Reali Costa,Artur Jordao

Main category: cs.CV

TL;DR: 提出了一种同时剪枝神经元和层的新方法,通过表示相似性选择最优子网络,显著减少计算量并保持模型性能。

Details Motivation: 深度学习模型复杂度高且计算成本大,现有剪枝方法仅针对神经元或层,无法同时剪枝。 Method: 通过Centered Kernel Alignment度量选择与父网络表示相似性最高的子网络,迭代剪枝神经元和层。 Result: 在ResNet56和ResNet110上实现86.37%和95.82%的FLOPs减少,同时保持或提升准确性,并增强对抗性和分布外样本的鲁棒性。 Conclusion: 该方法为剪枝领域开辟新方向,显著减少计算资源和碳排放,推动GreenAI发展。 Abstract: Deep learning stands as the modern paradigm for solving cognitive tasks. However, as the problem complexity increases, models grow deeper and computationally prohibitive, hindering advancements in real-world and resource-constrained applications. Extensive studies reveal that pruning structures in these models efficiently reduces model complexity and improves computational efficiency. Successful strategies in this sphere include removing neurons (i.e., filters, heads) or layers, but not both together. Therefore, simultaneously pruning different structures remains an open problem. To fill this gap and leverage the benefits of eliminating neurons and layers at once, we propose a new method capable of pruning different structures within a model as follows. Given two candidate subnetworks (pruned models), one from layer pruning and the other from neuron pruning, our method decides which to choose by selecting the one with the highest representation similarity to its parent (the network that generates the subnetworks) using the Centered Kernel Alignment metric. Iteratively repeating this process provides highly sparse models that preserve the original predictive ability. Throughout extensive experiments on standard architectures and benchmarks, we confirm the effectiveness of our approach and show that it outperforms state-of-the-art layer and filter pruning techniques. At high levels of Floating Point Operations reduction, most state-of-the-art methods degrade accuracy, whereas our approach either improves it or experiences only a minimal drop. Notably, on the popular ResNet56 and ResNet110, we achieve a milestone of 86.37% and 95.82% FLOPs reduction. Besides, our pruned models obtain robustness to adversarial and out-of-distribution samples and take an important step towards GreenAI, reducing carbon emissions by up to 83.31%. Overall, we believe our work opens a new chapter in pruning.

[18] EECD-Net: Energy-Efficient Crack Detection with Spiking Neural Networks and Gated Attention

Shuo Zhang

Main category: cs.CV

TL;DR: 提出了一种名为EECD-Net的多阶段道路裂缝检测方法,结合SRCNN、SCU和GAT模块,显著提升检测精度和能效。

Details Motivation: 智能终端设备因能量有限和低分辨率成像难以实现实时监测,需提升道路裂缝检测的准确性和能效。 Method: 采用SRCNN增强图像分辨率,SCU降低功耗,GAT模块融合多尺度特征以提升检测鲁棒性。 Result: 在CrackVision12K基准测试中达到98.6%的准确率,功耗仅5.6 mJ,比基线降低33%。 Conclusion: EECD-Net为资源受限环境提供了一种可扩展、低功耗的实时裂缝检测解决方案。 Abstract: Crack detection on road surfaces is a critical measurement technology in the instrumentation domain, essential for ensuring infrastructure safety and transportation reliability. However, due to limited energy and low-resolution imaging, smart terminal devices struggle to maintain real-time monitoring performance. To overcome these challenges, this paper proposes a multi-stage detection approach for road crack detection, EECD-Net, to enhance accuracy and energy efficiency of instrumentation. Specifically, the sophisticated Super-Resolution Convolutional Neural Network (SRCNN) is employed to address the inherent challenges of low-quality images, which effectively enhance image resolution while preserving critical structural details. Meanwhile, a Spike Convolution Unit (SCU) with Continuous Integrate-and-Fire (CIF) neurons is proposed to convert these images into sparse pulse sequences, significantly reducing power consumption. Additionally, a Gated Attention Transformer (GAT) module is designed to strategically fuse multi-scale feature representations through adaptive attention mechanisms, effectively capturing both long-range dependencies and intricate local crack patterns, and significantly enhancing detection robustness across varying crack morphologies. The experiments on the CrackVision12K benchmark demonstrate that EECD-Net achieves a remarkable 98.6\% detection accuracy, surpassing state-of-the-art counterparts such as Hybrid-Segmentor by a significant 1.5\%. Notably, the EECD-Net maintains exceptional energy efficiency, consuming merely 5.6 mJ, which is a substantial 33\% reduction compared to baseline implementations. This work pioneers a transformative approach in instrumentation-based crack detection, offering a scalable, low-power solution for real-time, large-scale infrastructure monitoring in resource-constrained environments.

[19] Enhancing Frequency for Single Image Super-Resolution with Learnable Separable Kernels

Heng Tian

Main category: cs.CV

TL;DR: 提出了一种名为可学习可分离核(LSKs)的即插即用模块,通过直接增强图像频率分量来提升单图像超分辨率(SISR)性能。

Details Motivation: 现有方法通常通过间接方式(如特殊损失函数)提升SISR性能,而LSKs旨在直接优化频率分量。 Method: LSKs被设计为秩一矩阵,可分解为正交且可合并的一维核,从而显著减少参数和计算量。 Result: 实验表明,LSKs能减少60%以上的参数和计算量,同时提升模型性能,尤其在高放大因子下表现更优。 Conclusion: LSKs是一种高效且有效的SISR增强模块,兼具参数和计算效率的优势。 Abstract: Existing approaches often enhance the performance of single-image super-resolution (SISR) methods by incorporating auxiliary structures, such as specialized loss functions, to indirectly boost the quality of low-resolution images. In this paper, we propose a plug-and-play module called Learnable Separable Kernels (LSKs), which are formally rank-one matrices designed to directly enhance image frequency components. We begin by explaining why LSKs are particularly suitable for SISR tasks from a frequency perspective. Baseline methods incorporating LSKs demonstrate a significant reduction of over 60\% in both the number of parameters and computational requirements. This reduction is achieved through the decomposition of LSKs into orthogonal and mergeable one-dimensional kernels. Additionally, we perform an interpretable analysis of the feature maps generated by LSKs. Visualization results reveal the capability of LSKs to enhance image frequency components effectively. Extensive experiments show that incorporating LSKs not only reduces the number of parameters and computational load but also improves overall model performance. Moreover, these experiments demonstrate that models utilizing LSKs exhibit superior performance, particularly as the upscaling factor increases.

[20] Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning

Yunhao Gou,Kai Chen,Zhili Liu,Lanqing Hong,Xin Jin,Zhenguo Li,James T. Kwok,Yu Zhang

Main category: cs.CV

TL;DR: RACRO方法通过强化学习优化视觉提取器的描述生成,以支持多模态大语言模型的高效推理,无需昂贵的多模态重新对齐。

Details Motivation: 解决多模态大语言模型中视觉与语言对齐的高成本问题,同时确保视觉提取生成的描述既忠实于图像又能支持准确推理。 Method: 提出RACRO方法,通过推理引导的强化学习策略,优化视觉提取器的描述生成行为,使其与推理目标对齐。 Result: 在多模态数学和科学基准测试中,RACRO实现了最先进的平均性能,并支持更高级推理模型的即插即用适配。 Conclusion: RACRO通过感知与推理的解耦优化,显著提升了视觉基础能力,同时降低了多模态重新对齐的成本。 Abstract: Recent advances in slow-thinking language models (e.g., OpenAI-o1 and DeepSeek-R1) have demonstrated remarkable abilities in complex reasoning tasks by emulating human-like reflective cognition. However, extending such capabilities to multi-modal large language models (MLLMs) remains challenging due to the high cost of retraining vision-language alignments when upgrading the underlying reasoner LLMs. A straightforward solution is to decouple perception from reasoning, i.e., converting visual inputs into language representations (e.g., captions) that are then passed to a powerful text-only reasoner. However, this decoupling introduces a critical challenge: the visual extractor must generate descriptions that are both faithful to the image and informative enough to support accurate downstream reasoning. To address this, we propose Reasoning-Aligned Perceptual Decoupling via Caption Reward Optimization (RACRO) - a reasoning-guided reinforcement learning strategy that aligns the extractor's captioning behavior with the reasoning objective. By closing the perception-reasoning loop via reward-based optimization, RACRO significantly enhances visual grounding and extracts reasoning-optimized representations. Experiments on multi-modal math and science benchmarks show that the proposed RACRO method achieves state-of-the-art average performance while enabling superior scalability and plug-and-play adaptation to more advanced reasoning LLMs without the necessity for costly multi-modal re-alignment.

[21] LGM-Pose: A Lightweight Global Modeling Network for Real-time Human Pose Estimation

Biao Guo,Fangmin Guo,Guibo Luo,Xiaonan Luo,Feng Zhang

Main category: cs.CV

TL;DR: 提出了一种轻量级全局建模网络(LGM-Pose),通过单分支结构和创新模块(LARM和SFusion)解决多分支CNN的全局上下文捕捉和高延迟问题,在COCO和MPII数据集上表现优异。

Details Motivation: 当前多分支CNN架构在多人体姿态估计中难以捕捉全局上下文且延迟高,需改进。 Method: 设计轻量级MobileViM Block和LARM模块,结合NPT-Op提取全局信息;引入SFusion模块整合多尺度信息。 Result: 在COCO和MPII数据集上,参数更少、性能更优、处理速度更快。 Conclusion: LGM-Pose通过单分支结构和创新模块,有效解决了现有方法的局限性,实现了高效轻量化的姿态估计。 Abstract: Most of the current top-down multi-person pose estimation lightweight methods are based on multi-branch parallel pure CNN network architecture, which often struggle to capture the global context required for detecting semantically complex keypoints and are hindered by high latency due to their intricate and redundant structures. In this article, an approximate single-branch lightweight global modeling network (LGM-Pose) is proposed to address these challenges. In the network, a lightweight MobileViM Block is designed with a proposed Lightweight Attentional Representation Module (LARM), which integrates information within and between patches using the Non-Parametric Transformation Operation(NPT-Op) to extract global information. Additionally, a novel Shuffle-Integrated Fusion Module (SFusion) is introduced to effectively integrate multi-scale information, mitigating performance degradation often observed in single-branch structures. Experimental evaluations on the COCO and MPII datasets demonstrate that our approach not only reduces the number of parameters compared to existing mainstream lightweight methods but also achieves superior performance and faster processing speeds.

[22] Follow-Your-Creation: Empowering 4D Creation through Video Inpainting

Yue Ma,Kunyu Feng,Xinhua Zhang,Hongyu Liu,David Junhao Zhang,Jinbo Xing,Yinhan Zhang,Ayden Yang,Zeyu Wang,Qifeng Chen

Main category: cs.CV

TL;DR: Follow-Your-Creation是一个新颖的4D视频生成与编辑框架,通过视频修复任务实现单目视频输入下的4D内容生成与编辑。

Details Motivation: 解决从单目视频生成和编辑4D内容的需求,利用视频修复基础模型作为生成先验,填补因相机轨迹变化或用户编辑导致的缺失内容。 Method: 结合深度点云渲染生成不可见区域掩码,与用户编辑掩码组合成复合掩码数据集,通过随机采样和自迭代调优策略增强模型泛化能力。 Result: 生成具有多视角一致性的4D视频,支持基于提示的内容编辑,在质量和灵活性上显著优于现有方法。 Conclusion: 该框架有效利用基础模型的先验知识,实现了高质量且灵活的4D视频生成与编辑。 Abstract: We introduce Follow-Your-Creation, a novel 4D video creation framework capable of both generating and editing 4D content from a single monocular video input. By leveraging a powerful video inpainting foundation model as a generative prior, we reformulate 4D video creation as a video inpainting task, enabling the model to fill in missing content caused by camera trajectory changes or user edits. To facilitate this, we generate composite masked inpainting video data to effectively fine-tune the model for 4D video generation. Given an input video and its associated camera trajectory, we first perform depth-based point cloud rendering to obtain invisibility masks that indicate the regions that should be completed. Simultaneously, editing masks are introduced to specify user-defined modifications, and these are combined with the invisibility masks to create a composite masks dataset. During training, we randomly sample different types of masks to construct diverse and challenging inpainting scenarios, enhancing the model's generalization and robustness in various 4D editing and generation tasks. To handle temporal consistency under large camera motion, we design a self-iterative tuning strategy that gradually increases the viewing angles during training, where the model is used to generate the next-stage training data after each fine-tuning iteration. Moreover, we introduce a temporal packaging module during inference to enhance generation quality. Our method effectively leverages the prior knowledge of the base model without degrading its original performance, enabling the generation of 4D videos with consistent multi-view coherence. In addition, our approach supports prompt-based content editing, demonstrating strong flexibility and significantly outperforming state-of-the-art methods in both quality and versatility.

[23] Hierarchical-Task-Aware Multi-modal Mixture of Incremental LoRA Experts for Embodied Continual Learning

Ziqi Jia,Anmin Wang,Xiaoyang Qu,Xiaowen Yang,Jianzong Wang

Main category: cs.CV

TL;DR: 论文提出了一种分层持续学习框架(HEC)和任务感知的增量LoRA专家混合方法(Task-aware MoILE),以解决现有方法忽视高层规划和多级知识学习的问题。

Details Motivation: 现有持续学习方法主要关注基于人类指令执行低层动作,缺乏对高层规划和多级知识学习的能力。 Method: 提出HEC框架,将学习分为高层指令和低层动作两层,并定义五个子任务。Task-aware MoILE方法通过聚类视觉-文本嵌入实现任务识别,并使用任务级和令牌级路由器选择LoRA专家。通过SVD处理LoRA参数以减少灾难性遗忘。 Result: 实验表明,该方法在减少旧任务遗忘方面优于其他方法,有效支持代理在持续学习新任务时保留先验知识。 Conclusion: HEC框架和Task-aware MoILE方法为持续学习提供了新思路,显著提升了代理的学习能力和知识保留效果。 Abstract: Previous continual learning setups for embodied intelligence focused on executing low-level actions based on human commands, neglecting the ability to learn high-level planning and multi-level knowledge. To address these issues, we propose the Hierarchical Embodied Continual Learning Setups (HEC) that divide the agent's continual learning process into two layers: high-level instructions and low-level actions, and define five embodied continual learning sub-setups. Building on these setups, we introduce the Task-aware Mixture of Incremental LoRA Experts (Task-aware MoILE) method. This approach achieves task recognition by clustering visual-text embeddings and uses both a task-level router and a token-level router to select the appropriate LoRA experts. To effectively address the issue of catastrophic forgetting, we apply Singular Value Decomposition (SVD) to the LoRA parameters obtained from prior tasks, preserving key components while orthogonally training the remaining parts. The experimental results show that our method stands out in reducing the forgetting of old tasks compared to other methods, effectively supporting agents in retaining prior knowledge while continuously learning new tasks.

[24] SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents

Alexander Huang-Menders,Xinhang Liu,Andy Xu,Yuyao Zhang,Chi-Keung Tang,Yu-Wing Tai

Main category: cs.CV

TL;DR: SmartAvatar是一个基于视觉-语言-代理的框架,通过单张照片或文本提示生成完全绑定、可动画的3D人体化身。它利用大型视觉语言模型(VLMs)和现成的参数化人体生成器,结合自主验证循环,实现高质量、可定制的化身生成。

Details Motivation: 现有基于扩散的方法在3D人体化身生成中难以精确控制身份、体型和动画准备度,因此需要一种更灵活、可控的解决方案。 Method: SmartAvatar结合视觉语言模型的常识推理能力和参数化人体生成器,通过自主验证循环(渲染、评估、调整参数)实现迭代优化,支持自然语言对话的细粒度控制。 Result: 生成的化身具有高质量、身份一致性和动画准备度,在网格质量、身份保真度、属性准确性和动画准备度上优于现有方法。 Conclusion: SmartAvatar是一种适用于消费级硬件的多功能工具,能够生成逼真、可定制的3D人体化身。 Abstract: SmartAvatar is a vision-language-agent-driven framework for generating fully rigged, animation-ready 3D human avatars from a single photo or textual prompt. While diffusion-based methods have made progress in general 3D object generation, they continue to struggle with precise control over human identity, body shape, and animation readiness. In contrast, SmartAvatar leverages the commonsense reasoning capabilities of large vision-language models (VLMs) in combination with off-the-shelf parametric human generators to deliver high-quality, customizable avatars. A key innovation is an autonomous verification loop, where the agent renders draft avatars, evaluates facial similarity, anatomical plausibility, and prompt alignment, and iteratively adjusts generation parameters for convergence. This interactive, AI-guided refinement process promotes fine-grained control over both facial and body features, enabling users to iteratively refine their avatars via natural-language conversations. Unlike diffusion models that rely on static pre-trained datasets and offer limited flexibility, SmartAvatar brings users into the modeling loop and ensures continuous improvement through an LLM-driven procedural generation and verification system. The generated avatars are fully rigged and support pose manipulation with consistent identity and appearance, making them suitable for downstream animation and interactive applications. Quantitative benchmarks and user studies demonstrate that SmartAvatar outperforms recent text- and image-driven avatar generation systems in terms of reconstructed mesh quality, identity fidelity, attribute accuracy, and animation readiness, making it a versatile tool for realistic, customizable avatar creation on consumer-grade hardware.

[25] Perfecting Depth: Uncertainty-Aware Enhancement of Metric Depth

Jinyoung Jun,Lei Chu,Jiahao Li,Yan Lu,Chang-Su Kim

Main category: cs.CV

TL;DR: 提出了一种名为Perfecting Depth的两阶段框架,用于传感器深度增强,结合随机扩散模型和确定性细化,生成高可靠性的深度图。

Details Motivation: 解决传感器深度数据中不可靠区域的自动检测问题,同时保留几何线索,以提高深度图的可靠性和实用性。 Method: 第一阶段(随机估计)利用训练-推理域差距识别不可靠区域并推断几何结构;第二阶段(确定性细化)基于不确定性图强制结构一致性和像素级精度。 Result: 实验表明,该方法能生成密集、无伪影的深度图,并在多种实际场景中表现优异。 Conclusion: 该框架为传感器深度增强设定了新基准,适用于自动驾驶、机器人和沉浸式技术等领域。 Abstract: We propose a novel two-stage framework for sensor depth enhancement, called Perfecting Depth. This framework leverages the stochastic nature of diffusion models to automatically detect unreliable depth regions while preserving geometric cues. In the first stage (stochastic estimation), the method identifies unreliable measurements and infers geometric structure by leveraging a training-inference domain gap. In the second stage (deterministic refinement), it enforces structural consistency and pixel-level accuracy using the uncertainty map derived from the first stage. By combining stochastic uncertainty modeling with deterministic refinement, our method yields dense, artifact-free depth maps with improved reliability. Experimental results demonstrate its effectiveness across diverse real-world scenarios. Furthermore, theoretical analysis, various experiments, and qualitative visualizations validate its robustness and scalability. Our framework sets a new baseline for sensor depth enhancement, with potential applications in autonomous driving, robotics, and immersive technologies.

[26] Deep Learning Reforms Image Matching: A Survey and Outlook

Shihua Zhang,Zizhuo Li,Kaining Zhang,Yifan Lu,Yuxin Deng,Linfeng Tang,Xingyu Jiang,Jiayi Ma

Main category: cs.CV

TL;DR: 这篇论文综述了深度学习如何逐步改变传统的图像匹配流程,包括替换传统步骤为可学习模块以及合并多步骤为端到端模块,并评估了代表性方法。

Details Motivation: 传统图像匹配流程在复杂场景中表现不佳,而深度学习的进步显著提升了其鲁棒性和准确性。本文旨在全面回顾深度学习对图像匹配的变革。 Method: 通过分类和评估深度学习驱动的策略,包括可学习的检测器-描述符、异常值过滤器和几何估计器,以及端到端模块。 Result: 论文评估了代表性方法在相对位姿恢复、单应性估计和视觉定位任务中的表现。 Conclusion: 论文总结了当前挑战,并指出了未来研究的潜在方向,为图像匹配领域提供了清晰的概述。 Abstract: Image matching, which establishes correspondences between two-view images to recover 3D structure and camera geometry, serves as a cornerstone in computer vision and underpins a wide range of applications, including visual localization, 3D reconstruction, and simultaneous localization and mapping (SLAM). Traditional pipelines composed of ``detector-descriptor, feature matcher, outlier filter, and geometric estimator'' falter in challenging scenarios. Recent deep-learning advances have significantly boosted both robustness and accuracy. This survey adopts a unique perspective by comprehensively reviewing how deep learning has incrementally transformed the classical image matching pipeline. Our taxonomy highly aligns with the traditional pipeline in two key aspects: i) the replacement of individual steps in the traditional pipeline with learnable alternatives, including learnable detector-descriptor, outlier filter, and geometric estimator; and ii) the merging of multiple steps into end-to-end learnable modules, encompassing middle-end sparse matcher, end-to-end semi-dense/dense matcher, and pose regressor. We first examine the design principles, advantages, and limitations of both aspects, and then benchmark representative methods on relative pose recovery, homography estimation, and visual localization tasks. Finally, we discuss open challenges and outline promising directions for future research. By systematically categorizing and evaluating deep learning-driven strategies, this survey offers a clear overview of the evolving image matching landscape and highlights key avenues for further innovation.

[27] Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations

Linjie Li,Mahtab Bigverdi,Jiawei Gu,Zixian Ma,Yinuo Yang,Ziang Li,Yejin Choi,Ranjay Krishna

Main category: cs.CV

TL;DR: STARE是一个评估多模态大语言模型在空间认知任务中表现的基准,涵盖几何变换、空间推理和现实场景任务。模型在简单2D任务中表现良好,但在复杂3D任务中接近随机水平,且无法有效利用视觉模拟。

Details Motivation: 现有AI基准主要关注语言推理,忽略了非语言、多步视觉模拟的复杂性,因此需要STARE来填补这一空白。 Method: STARE包含4K任务,涵盖几何变换、集成空间推理和现实世界空间推理,通过多步视觉模拟评估模型表现。 Result: 模型在简单2D任务中表现优异,但在复杂任务(如3D折叠和拼图)中表现接近随机,且视觉模拟对模型帮助有限。 Conclusion: 模型在复杂空间认知任务中表现不足,需改进对视觉模拟的利用能力。 Abstract: Spatial cognition is essential for human intelligence, enabling problem-solving through visual simulations rather than solely relying on verbal reasoning. However, existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation. We introduce STARE(Spatial Transformations and Reasoning Evaluation), a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through multi-step visual simulation. STARE features 4K tasks spanning foundational geometric transformations (2D and 3D), integrated spatial reasoning (cube net folding and tangram puzzles), and real-world spatial reasoning (perspective and temporal reasoning), reflecting practical cognitive challenges like object assembly, mechanical diagram interpretation, and everyday spatial navigation. Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks like 3D cube net folding and tangram puzzles that require multi-step visual simulations. Humans achieve near-perfect accuracy but take considerable time (up to 28.9s) on complex tasks, significantly speeding up (down by 7.5 seconds on average) with intermediate visual simulations. In contrast, models exhibit inconsistent performance gains from visual simulations, improving on most tasks but declining in specific cases like tangram puzzles (GPT-4o, o1) and cube net folding (Claude-3.5, Gemini-2.0 Flash), indicating that models may not know how to effectively leverage intermediate visual information.

[28] Text-Aware Real-World Image Super-Resolution via Diffusion Model with Joint Segmentation Decoders

Qiming Hu,Linlong Fan,Yiyan Luo,Yuhang Yu,Xiaojie Guo,Qingnan Fan

Main category: cs.CV

TL;DR: TADiSR是一种基于扩散模型的超分辨率框架,通过文本感知注意力和联合分割解码器提升图像中文本区域的结构保真度。

Details Motivation: 生成模型在图像超分辨率中处理真实世界退化时,常导致文本结构失真,TADiSR旨在解决这一问题。 Method: 提出TADiSR框架,结合文本感知注意力和联合分割解码器,并设计合成高质量图像的全流程。 Result: 实验表明,TADiSR显著提升超分辨率图像的文本可读性,并在多项指标上达到最优性能。 Conclusion: TADiSR在真实场景中表现出强泛化能力,为图像超分辨率提供了有效解决方案。 Abstract: The introduction of generative models has significantly advanced image super-resolution (SR) in handling real-world degradations. However, they often incur fidelity-related issues, particularly distorting textual structures. In this paper, we introduce a novel diffusion-based SR framework, namely TADiSR, which integrates text-aware attention and joint segmentation decoders to recover not only natural details but also the structural fidelity of text regions in degraded real-world images. Moreover, we propose a complete pipeline for synthesizing high-quality images with fine-grained full-image text masks, combining realistic foreground text regions with detailed background content. Extensive experiments demonstrate that our approach substantially enhances text legibility in super-resolved images, achieving state-of-the-art performance across multiple evaluation metrics and exhibiting strong generalization to real-world scenarios. Our code is available at \href{https://github.com/mingcv/TADiSR}{here}.

[29] FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion

Akide Liu,Zeyu Zhang,Zhexin Li,Xuehai Bai,Yizeng Han,Jiasheng Tang,Yuanjie Xing,Jichao Wu,Mingyang Yang,Weihua Chen,Jiahao He,Yuanyu He,Fan Wang,Gholamreza Haffari,Bohan Zhuang

Main category: cs.CV

TL;DR: FPSAttention是一种结合FP8量化和稀疏化的训练感知协同设计方法,用于视频生成,显著提升了推理速度,同时保持生成质量。

Details Motivation: 扩散生成模型在高质量视频生成中表现优异,但推理速度慢且计算需求高,限制了实际应用。 Method: 提出FPSAttention方法,包括统一的3D分块粒度、去噪步骤感知策略和硬件友好的内核实现。 Result: 在1.3B和14B模型上测试,FPSAttention实现了7.09倍注意力操作加速和4.96倍端到端视频生成加速,且不损失质量。 Conclusion: FPSAttention通过联合优化量化和稀疏化,显著提升了视频生成的效率,为实际部署提供了可行方案。 Abstract: Diffusion generative models have become the standard for producing high-quality, coherent video content, yet their slow inference speeds and high computational demands hinder practical deployment. Although both quantization and sparsity can independently accelerate inference while maintaining generation quality, naively combining these techniques in existing training-free approaches leads to significant performance degradation due to the lack of joint optimization.We introduce FPSAttention, a novel training-aware co-design of FP8 quantization and sparsity for video generation, with a focus on the 3D bi-directional attention mechanism. Our approach features three key innovations: 1) A unified 3D tile-wise granularity that simultaneously supports both quantization and sparsity; 2) A denoising step-aware strategy that adapts to the noise schedule, addressing the strong correlation between quantization/sparsity errors and denoising steps; 3) A native, hardware-friendly kernel that leverages FlashAttention and is implemented with optimized Hopper architecture features for highly efficient execution. Trained on Wan2.1's 1.3B and 14B models and evaluated on the VBench benchmark, FPSAttention achieves a 7.09x kernel speedup for attention operations and a 4.96x end-to-end speedup for video generation compared to the BF16 baseline at 720p resolution-without sacrificing generation quality.

[30] Feature-Based Lie Group Transformer for Real-World Applications

Takayuki Komatsu,Yoshiyuki Ohmura,Kayato Nishitsunoi,Yasuo Kuniyoshi

Main category: cs.CV

TL;DR: 论文提出了一种结合特征提取和对象分割的方法,将群分解理论应用于更现实的场景,以改进表示学习。

Details Motivation: 传统表示学习假设解耦的独立特征轴是好的表示,但无法解释条件独立性。本文旨在通过群分解理论解决这一问题,并应用于现实世界场景。 Method: 结合特征提取和对象分割,将像素翻译替换为特征翻译,并将对象分割视为相同变换下的特征分组。 Result: 在包含真实世界对象和背景的数据集上验证了方法的有效性。 Conclusion: 该方法有望更好地理解人类在现实世界中的物体识别发展。 Abstract: The main goal of representation learning is to acquire meaningful representations from real-world sensory inputs without supervision. Representation learning explains some aspects of human development. Various neural network (NN) models have been proposed that acquire empirically good representations. However, the formulation of a good representation has not been established. We recently proposed a method for categorizing changes between a pair of sensory inputs. A unique feature of this approach is that transformations between two sensory inputs are learned to satisfy algebraic structural constraints. Conventional representation learning often assumes that disentangled independent feature axes is a good representation; however, we found that such a representation cannot account for conditional independence. To overcome this problem, we proposed a new method using group decomposition in Galois algebra theory. Although this method is promising for defining a more general representation, it assumes pixel-to-pixel translation without feature extraction, and can only process low-resolution images with no background, which prevents real-world application. In this study, we provide a simple method to apply our group decomposition theory to a more realistic scenario by combining feature extraction and object segmentation. We replace pixel translation with feature translation and formulate object segmentation as grouping features under the same transformation. We validated the proposed method on a practical dataset containing both real-world object and background. We believe that our model will lead to a better understanding of human development of object recognition in the real world.

[31] Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts

Zhong Ji,Rongshuai Wei,Jingren Liu,Yanwei Pang,Jungong Han

Main category: cs.CV

TL;DR: 论文提出了一种Few-Shot Prototypical Concept Classification (FSPCC)框架,通过参数高效适应和多层次特征融合,解决了数据稀缺场景下自解释模型(SEMs)的性能问题,显著提升了分类准确率和模型可解释性。

Details Motivation: 自解释模型(SEMs)在数据稀缺场景下表现不佳,主要由于参数不平衡和表征错位问题。本文旨在通过Few-Shot Prototypical Concept Classification (FSPCC)框架解决这些问题。 Method: 采用Mixture of LoRA Experts (MoLE)实现参数高效适应,结合跨模块概念指导和多层次特征融合策略,并通过几何感知概念判别损失增强概念区分度。 Result: 在六个基准数据集上,FSPCC比现有SEMs性能提升4.2%-8.7%,尤其在5-way 5-shot分类任务中表现突出。 Conclusion: FSPCC框架通过结合概念学习和少样本适应,显著提升了模型的准确性和可解释性,为透明视觉识别系统提供了新思路。 Abstract: Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable, but they often struggle in data-scarce settings where insufficient training samples lead to suboptimal performance.To address this limitation, we propose a Few-Shot Prototypical Concept Classification (FSPCC) framework that systematically mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment. Specifically, our approach leverages a Mixture of LoRA Experts (MoLE) for parameter-efficient adaptation, ensuring a balanced allocation of trainable parameters between the backbone and the PCL module.Meanwhile, cross-module concept guidance enforces tight alignment between the backbone's feature representations and the prototypical concept activation patterns.In addition, we incorporate a multi-level feature preservation strategy that fuses spatial and semantic cues across various layers, thereby enriching the learned representations and mitigating the challenges posed by limited data availability.Finally, to enhance interpretability and minimize concept overlap, we introduce a geometry-aware concept discrimination loss that enforces orthogonality among concepts, encouraging more disentangled and transparent decision boundaries.Experimental results on six popular benchmarks (CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft, and DTD) demonstrate that our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot classification.These findings highlight the efficacy of coupling concept learning with few-shot adaptation to achieve both higher accuracy and clearer model interpretability, paving the way for more transparent visual recognition systems.

[32] Gen-n-Val: Agentic Image Data Generation and Validation

Jing-En Huang,I-Sheng Fang,Tzuhsuan Huang,Chih-Yu Wang,Jun-Cheng Chen

Main category: cs.CV

TL;DR: Gen-n-Val是一个新型数据生成框架,利用Layer Diffusion、LLMs和VLLMs生成高质量的单对象掩码和多样化背景,显著减少无效数据并提升性能。

Details Motivation: 解决计算机视觉任务中数据稀缺和标签噪声问题,当前合成数据生成方法存在多对象掩码、分割不准确和类别标签错误等限制。 Method: Gen-n-Val包含两个代理:LD提示代理(LLM)优化提示生成高质量前景实例和掩码;数据验证代理(VLLM)过滤低质量数据。系统提示通过TextGrad优化,并使用图像协调技术组合多个实例。 Result: 相比MosaicFusion,Gen-n-Val将无效数据从50%降至7%,在COCO实例分割中提升1% mAP,在开放词汇目标检测中提升7.1% mAP。 Conclusion: Gen-n-Val显著提升了合成数据的质量和任务性能,为计算机视觉任务提供了更有效的解决方案。 Abstract: Recently, Large Language Models (LLMs) and Vision Large Language Models (VLLMs) have demonstrated impressive performance as agents across various tasks while data scarcity and label noise remain significant challenges in computer vision tasks, such as object detection and instance segmentation. A common solution for resolving these issues is to generate synthetic data. However, current synthetic data generation methods struggle with issues, such as multiple objects per mask, inaccurate segmentation, and incorrect category labels, limiting their effectiveness. To address these issues, we introduce Gen-n-Val, a novel agentic data generation framework that leverages Layer Diffusion (LD), LLMs, and VLLMs to produce high-quality, single-object masks and diverse backgrounds. Gen-n-Val consists of two agents: (1) The LD prompt agent, an LLM, optimizes prompts for LD to generate high-quality foreground instance images and segmentation masks. These optimized prompts ensure the generation of single-object synthetic data with precise instance masks and clean backgrounds. (2) The data validation agent, a VLLM, which filters out low-quality synthetic instance images. The system prompts for both agents are refined through TextGrad. Additionally, we use image harmonization to combine multiple instances within scenes. Compared to state-of-the-art synthetic data approaches like MosaicFusion, our approach reduces invalid synthetic data from 50% to 7% and improves performance by 1% mAP on rare classes in COCO instance segmentation with YOLOv9c and YOLO11m. Furthermore, Gen-n-Val shows significant improvements (7. 1% mAP) over YOLO-Worldv2-M in open-vocabulary object detection benchmarks with YOLO11m. Moreover, Gen-n-Val improves the performance of YOLOv9 and YOLO11 families in instance segmentation and object detection.

[33] MARS: Radio Map Super-resolution and Reconstruction Method under Sparse Channel Measurements

Chuyun Deng,Na Liu,Wei Xie,Lianming Xu,Li Wang

Main category: cs.CV

TL;DR: MARS是一种结合CNN和Transformer的多尺度感知无线电地图超分辨率方法,通过多尺度特征融合和残差连接提升重建精度。

Details Motivation: 无线电地图在智能城市和物联网中至关重要,但稀疏测量下的准确重建仍具挑战性。传统方法缺乏环境感知,而深度学习依赖详细场景数据。 Method: 提出MARS方法,结合CNN和Transformer,利用多尺度特征融合和残差连接,同时关注全局和局部特征提取。 Result: 实验表明,MARS在MSE和SSIM指标上优于基线模型,且计算成本低。 Conclusion: MARS具有强实用性,适用于不同场景和天线位置。 Abstract: Radio maps reflect the spatial distribution of signal strength and are essential for applications like smart cities, IoT, and wireless network planning. However, reconstructing accurate radio maps from sparse measurements remains challenging. Traditional interpolation and inpainting methods lack environmental awareness, while many deep learning approaches depend on detailed scene data, limiting generalization. To address this, we propose MARS, a Multi-scale Aware Radiomap Super-resolution method that combines CNNs and Transformers with multi-scale feature fusion and residual connections. MARS focuses on both global and local feature extraction, enhancing feature representation across different receptive fields and improving reconstruction accuracy. Experiments across different scenes and antenna locations show that MARS outperforms baseline models in both MSE and SSIM, while maintaining low computational cost, demonstrating strong practical potential.

[34] HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model

Youngwan Lee,Kangsan Kim,Kwanyong Park,Ilcahe Jung,Soojin Jang,Seanie Lee,Yong-Ju Lee,Sung Ju Hwang

Main category: cs.CV

TL;DR: 论文提出HoliSafe数据集和SafeLLaVA模型,解决现有视觉语言模型(VLM)安全性的不足,通过全面覆盖安全/不安全图像-文本组合和引入可学习的安全元标记及专用安全头,显著提升模型安全性。

Details Motivation: 现有VLM安全方法仅部分考虑图像-文本交互的有害内容,且依赖数据调优,缺乏架构创新,导致模型在未见配置下易受攻击。 Method: 提出HoliSafe数据集,涵盖五种安全/不安全图像-文本组合;设计SafeLLaVA模型,包含可学习安全元标记和专用安全头,以编码有害视觉线索并分类危害性。 Result: SafeLLaVA在多个VLM基准测试中达到最先进的安全性能;HoliSafe揭示了现有模型的关键漏洞。 Conclusion: HoliSafe和SafeLLaVA为VLM安全性研究提供了更全面和可解释的解决方案,推动多模态对齐的未来发展。 Abstract: Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, HoliSafe, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation. We further propose SafeLLaVA, a novel VLM augmented with a learnable safety meta token and a dedicated safety head. The meta token encodes harmful visual cues during training, intrinsically guiding the language model toward safer responses, while the safety head offers interpretable harmfulness classification aligned with refusal rationales. Experiments show that SafeLLaVA, trained on HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe benchmark itself reveals critical vulnerabilities in existing models. We hope that HoliSafe and SafeLLaVA will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.

[35] Line of Sight: On Linear Representations in VLLMs

Achyuta Rajaram,Sarah Schwettmann,Jacob Andreas,Arthur Conmy

Main category: cs.CV

TL;DR: 该论文研究了多模态语言模型中图像概念的表示方式,发现线性可解码特征存在于残差流中,并通过训练稀疏自编码器提高特征多样性。

Details Motivation: 探索多模态语言模型(如LlaVA-Next)如何在其隐藏激活中表示图像概念。 Method: 分析残差流中的线性可解码特征,并通过目标编辑验证其因果性;训练多模态稀疏自编码器(SAEs)以增加特征多样性。 Result: 发现ImageNet类别的线性可解码特征,且模型表示在不同模态间逐渐共享。 Conclusion: 多模态模型的表示在不同模态间存在分离,但在深层逐渐共享,稀疏自编码器可提高特征解释性。 Abstract: Language models can be equipped with multimodal capabilities by fine-tuning on embeddings of visual inputs. But how do such multimodal models represent images in their hidden activations? We explore representations of image concepts within LlaVA-Next, a popular open-source VLLM. We find a diverse set of ImageNet classes represented via linearly decodable features in the residual stream. We show that the features are causal by performing targeted edits on the model output. In order to increase the diversity of the studied linear features, we train multimodal Sparse Autoencoders (SAEs), creating a highly interpretable dictionary of text and image features. We find that although model representations across modalities are quite disjoint, they become increasingly shared in deeper layers.

[36] Robust Few-Shot Vision-Language Model Adaptation

Hanxin Wang,Tian Liu,Shu Kong

Main category: cs.CV

TL;DR: 该论文研究了如何通过部分微调视觉编码器和两种增强技术(检索增强和对抗扰动)提升预训练视觉语言模型(VLM)在少样本适应中的分布内(ID)和分布外(OOD)准确性。

Details Motivation: 预训练的VLM在少样本适应中表现良好,但面对OOD测试数据时性能下降,因此需要提升其OOD泛化能力。 Method: 通过比较不同适应方法(如提示调优、线性探测、对比微调等),发现部分微调视觉编码器效果最佳,并提出结合检索增强和对抗扰动的两阶段方法SRAPF。 Result: SRAPF在ImageNet OOD基准测试中实现了最先进的ID和OOD准确性。 Conclusion: 部分微调视觉编码器结合检索增强和对抗扰动是提升VLM少样本适应性能的有效方法。 Abstract: Pretrained VLMs achieve strong performance on downstream tasks when adapted with just a few labeled examples. As the adapted models inevitably encounter out-of-distribution (OOD) test data that deviates from the in-distribution (ID) task-specific training data, enhancing OOD generalization in few-shot adaptation is critically important. We study robust few-shot VLM adaptation, aiming to increase both ID and OOD accuracy. By comparing different adaptation methods (e.g., prompt tuning, linear probing, contrastive finetuning, and full finetuning), we uncover three key findings: (1) finetuning with proper hyperparameters significantly outperforms the popular VLM adaptation methods prompt tuning and linear probing; (2) visual encoder-only finetuning achieves better efficiency and accuracy than contrastively finetuning both visual and textual encoders; (3) finetuning the top layers of the visual encoder provides the best balance between ID and OOD accuracy. Building on these findings, we propose partial finetuning of the visual encoder empowered with two simple augmentation techniques: (1) retrieval augmentation which retrieves task-relevant data from the VLM's pretraining dataset to enhance adaptation, and (2) adversarial perturbation which promotes robustness during finetuning. Results show that the former/latter boosts OOD/ID accuracy while slightly sacrificing the ID/OOD accuracy. Yet, perhaps understandably, naively combining the two does not maintain their best OOD/ID accuracy. We address this dilemma with the developed SRAPF, Stage-wise Retrieval Augmentation-based Adversarial Partial Finetuning. SRAPF consists of two stages: (1) partial finetuning the visual encoder using both ID and retrieved data, and (2) adversarial partial finetuning with few-shot ID data. Extensive experiments demonstrate that SRAPF achieves the state-of-the-art ID and OOD accuracy on the ImageNet OOD benchmarks.

[37] Towards Holistic Visual Quality Assessment of AI-Generated Videos: A LLM-Based Multi-Dimensional Evaluation Model

Zelu Qi,Ping Shi,Chaoyang Zhang,Shuqi Wang,Fei Zhao,Da Pan,Zefeng Ying

Main category: cs.CV

TL;DR: 论文提出了一种基于多维度特征和大型语言模型(LLM)的AI生成视频(AIGV)自动视觉质量评估方法,并在NTIRE 2025竞赛中取得第二名。

Details Motivation: AIGV技术快速发展,但仍存在视觉质量缺陷(如噪声、模糊、帧抖动等),影响用户体验,亟需有效的自动质量评估方法。 Method: 将AIGV视觉质量分解为技术质量、运动质量和视频语义三个维度,设计对应编码器提取特征,并引入LLM作为质量回归模块,结合多模态提示工程和LoRA微调技术。 Result: 在NTIRE 2025竞赛的AIGV质量评估赛道中取得第二名,验证了方法的有效性。 Conclusion: 提出的多维度特征和LLM结合的方法为AIGV质量评估提供了有效解决方案,未来可进一步优化和扩展。 Abstract: The development of AI-Generated Video (AIGV) technology has been remarkable in recent years, significantly transforming the paradigm of video content production. However, AIGVs still suffer from noticeable visual quality defects, such as noise, blurriness, frame jitter and low dynamic degree, which severely impact the user's viewing experience. Therefore, an effective automatic visual quality assessment is of great importance for AIGV content regulation and generative model improvement. In this work, we decompose the visual quality of AIGVs into three dimensions: technical quality, motion quality, and video semantics. For each dimension, we design corresponding encoder to achieve effective feature representation. Moreover, considering the outstanding performance of large language models (LLMs) in various vision and language tasks, we introduce a LLM as the quality regression module. To better enable the LLM to establish reasoning associations between multi-dimensional features and visual quality, we propose a specially designed multi-modal prompt engineering framework. Additionally, we incorporate LoRA fine-tuning technology during the training phase, allowing the LLM to better adapt to specific tasks. Our proposed method achieved \textbf{second place} in the NTIRE 2025 Quality Assessment of AI-Generated Content Challenge: Track 2 AI Generated video, demonstrating its effectiveness. Codes can be obtained at https://github.com/QiZelu/AIGVEval.

[38] Learning dissection trajectories from expert surgical videos via imitation learning with equivariant diffusion

Hongyu Wang,Yonghao Long,Yueyao Chen,Hon-Chi Yip,Markus Scheppach,Philip Wai-Yan Chiu,Yeung Yam,Helen Mei-Ling Meng,Qi Dou

Main category: cs.CV

TL;DR: 本文提出了一种名为iDPOE的新方法,通过隐式扩散策略和等变表示来改进内窥镜黏膜下剥离术(ESD)视频中的轨迹预测,提升了手术技能训练的效果。

Details Motivation: 预测ESD视频中的剥离轨迹对提升手术技能训练和简化学习过程有重要意义,但目前研究不足。现有模仿学习方法在处理不确定的未来动作、几何对称性和多样化手术场景时存在挑战。 Method: 提出iDPOE方法,通过联合状态动作分布建模专家行为,结合扩散模型和等变表示,提升视觉表示学习和泛化能力。 Result: 在近2000个ESD视频片段的数据集上,iDPOE在轨迹预测上超越了现有方法。 Conclusion: iDPOE是首个将模仿学习应用于手术技能开发的轨迹预测研究,展示了其在提升预测准确性和泛化能力方面的潜力。 Abstract: Endoscopic Submucosal Dissection (ESD) is a well-established technique for removing epithelial lesions. Predicting dissection trajectories in ESD videos offers significant potential for enhancing surgical skill training and simplifying the learning process, yet this area remains underexplored. While imitation learning has shown promise in acquiring skills from expert demonstrations, challenges persist in handling uncertain future movements, learning geometric symmetries, and generalizing to diverse surgical scenarios. To address these, we introduce a novel approach: Implicit Diffusion Policy with Equivariant Representations for Imitation Learning (iDPOE). Our method models expert behavior through a joint state action distribution, capturing the stochastic nature of dissection trajectories and enabling robust visual representation learning across various endoscopic views. By incorporating a diffusion model into policy learning, iDPOE ensures efficient training and sampling, leading to more accurate predictions and better generalization. Additionally, we enhance the model's ability to generalize to geometric symmetries by embedding equivariance into the learning process. To address state mismatches, we develop a forward-process guided action inference strategy for conditional sampling. Using an ESD video dataset of nearly 2000 clips, experimental results show that our approach surpasses state-of-the-art methods, both explicit and implicit, in trajectory prediction. To the best of our knowledge, this is the first application of imitation learning to surgical skill development for dissection trajectory prediction.

[39] Using In-Context Learning for Automatic Defect Labelling of Display Manufacturing Data

Babar Hussain,Qiang Liu,Gang Chen,Bihai She,Dahai Yu

Main category: cs.CV

TL;DR: 提出了一种基于AI的自动标注系统,用于显示面板缺陷检测,结合上下文学习能力,显著提升了标注效率和模型性能。

Details Motivation: 减少工业检测系统中的人工标注工作量,提高缺陷检测的效率和准确性。 Method: 采用并改进了SegGPT架构,引入基于涂鸦的标注机制,采用两阶段训练方法。 Result: 在工业数据集上验证,平均IoU提升0.22,召回率提高14%,自动标注覆盖率达60%。 Conclusion: 自动标注数据训练的模型性能与人工标注数据相当,为工业检测提供了实用解决方案。 Abstract: This paper presents an AI-assisted auto-labeling system for display panel defect detection that leverages in-context learning capabilities. We adopt and enhance the SegGPT architecture with several domain-specific training techniques and introduce a scribble-based annotation mechanism to streamline the labeling process. Our two-stage training approach, validated on industrial display panel datasets, demonstrates significant improvements over the baseline model, achieving an average IoU increase of 0.22 and a 14% improvement in recall across multiple product types, while maintaining approximately 60% auto-labeling coverage. Experimental results show that models trained on our auto-labeled data match the performance of those trained on human-labeled data, offering a practical solution for reducing manual annotation efforts in industrial inspection systems.

[40] Bridging Annotation Gaps: Transferring Labels to Align Object Detection Datasets

Mikhail Kennerley,Angelica Alives-Reviro,Carola-Bibiane Schönlieb,Robby T. Tan

Main category: cs.CV

TL;DR: LAT框架通过标签对齐和特征融合,解决了多数据集目标检测中的语义和空间不一致问题,显著提升了目标域检测性能。

Details Motivation: 多数据集目标检测因类别语义和标注框不一致而受限,现有方法无法同时解决语义和空间对齐问题。 Method: 提出LAT框架,包括伪标签生成、特权提案生成器(PPG)和语义特征融合(SFF)模块,实现标签空间对齐和特征优化。 Result: 在多个基准测试中,LAT比半监督基线方法提升了高达4.8AP。 Conclusion: LAT无需共享标签空间或手动标注,即可有效解决类别和空间对齐问题,提升检测性能。 Abstract: Combining multiple object detection datasets offers a path to improved generalisation but is hindered by inconsistencies in class semantics and bounding box annotations. Some methods to address this assume shared label taxonomies and address only spatial inconsistencies; others require manual relabelling, or produce a unified label space, which may be unsuitable when a fixed target label space is required. We propose Label-Aligned Transfer (LAT), a label transfer framework that systematically projects annotations from diverse source datasets into the label space of a target dataset. LAT begins by training dataset-specific detectors to generate pseudo-labels, which are then combined with ground-truth annotations via a Privileged Proposal Generator (PPG) that replaces the region proposal network in two-stage detectors. To further refine region features, a Semantic Feature Fusion (SFF) module injects class-aware context and features from overlapping proposals using a confidence-weighted attention mechanism. This pipeline preserves dataset-specific annotation granularity while enabling many-to-one label space transfer across heterogeneous datasets, resulting in a semantically and spatially aligned representation suitable for training a downstream detector. LAT thus jointly addresses both class-level misalignments and bounding box inconsistencies without relying on shared label spaces or manual annotations. Across multiple benchmarks, LAT demonstrates consistent improvements in target-domain detection performance, achieving gains of up to +4.8AP over semi-supervised baselines.

[41] SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs

Shuhan Xu,Siyuan Liang,Hongling Zheng,Yong Luo,Aishan Liu,Dacheng Tao

Main category: cs.CV

TL;DR: 论文提出了一种针对视觉语言模型(VLMs)后门攻击的防御方法Semantic Reward Defense(SRD),通过强化学习框架减少恶意行为,无需预先知道触发机制。

Details Motivation: 研究发现VLMs在图像描述任务中容易受到后门攻击,攻击者通过注入微小扰动控制模型输出恶意描述,且难以检测和防御。 Method: 提出SRD框架,利用深度Q网络学习对敏感图像区域施加离散扰动(如遮挡、颜色掩码),以破坏恶意路径的激活,并通过语义保真度评分作为奖励信号。 Result: 实验表明SRD将攻击成功率降至5.6%,同时在干净输入上保持描述质量,性能下降小于10%。 Conclusion: SRD提供了一种无需触发机制、可解释的防御范式,有效应对多模态生成模型中的隐蔽后门威胁。 Abstract: Vision-Language Models (VLMs) have achieved remarkable performance in image captioning, but recent studies show they are vulnerable to backdoor attacks. Attackers can inject imperceptible perturbations-such as local pixel triggers or global semantic phrases-into the training data, causing the model to generate malicious, attacker-controlled captions for specific inputs. These attacks are hard to detect and defend due to their stealthiness and cross-modal nature. By analyzing attack samples, we identify two key vulnerabilities: (1) abnormal attention concentration on specific image regions, and (2) semantic drift and incoherence in generated captions. To counter this, we propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without prior knowledge of triggers. SRD uses a Deep Q-Network to learn policies for applying discrete perturbations (e.g., occlusion, color masking) to sensitive image regions, aiming to disrupt the activation of malicious pathways. We design a semantic fidelity score as the reward signal, which jointly evaluates semantic consistency and linguistic fluency of the output, guiding the agent toward generating robust yet faithful captions. Experiments across mainstream VLMs and datasets show SRD reduces attack success rates to 5.6%, while preserving caption quality on clean inputs with less than 10% performance drop. SRD offers a trigger-agnostic, interpretable defense paradigm against stealthy backdoor threats in multimodal generative models.

[42] Physics Informed Capsule Enhanced Variational AutoEncoder for Underwater Image Enhancement

Niki Martinel,Rita Pucci

Main category: cs.CV

TL;DR: 提出一种新颖的双流架构,通过结合物理模型与胶囊聚类特征学习,实现水下图像增强,性能优于现有方法。

Details Motivation: 解决水下图像增强中物理约束与语义结构保留的挑战。 Method: 采用双流架构,分别估计传输图和背景光,并通过胶囊聚类提取特征,结合物理模型优化目标。 Result: 在六个基准测试中表现优异,PSNR提升0.5dB,计算复杂度降低三分之二。 Conclusion: 该方法在性能和效率上均显著优于现有技术,适用于水下图像增强。 Abstract: We present a novel dual-stream architecture that achieves state-of-the-art underwater image enhancement by explicitly integrating the Jaffe-McGlamery physical model with capsule clustering-based feature representation learning. Our method simultaneously estimates transmission maps and spatially-varying background light through a dedicated physics estimator while extracting entity-level features via capsule clustering in a parallel stream. This physics-guided approach enables parameter-free enhancement that respects underwater formation constraints while preserving semantic structures and fine-grained details. Our approach also features a novel optimization objective ensuring both physical adherence and perceptual quality across multiple spatial frequencies. To validate our approach, we conducted extensive experiments across six challenging benchmarks. Results demonstrate consistent improvements of $+0.5$dB PSNR over the best existing methods while requiring only one-third of their computational complexity (FLOPs), or alternatively, more than $+1$dB PSNR improvement when compared to methods with similar computational budgets. Code and data \textit{will} be available at https://github.com/iN1k1/.

[43] Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning

Shenshen Li,Kaiyuan Deng,Lei Wang,Hao Yang,Chong Peng,Peng Yan,Fumin Shen,Heng Tao Shen,Xing Xu

Main category: cs.CV

TL;DR: RAP方法通过识别高价值的认知样本,仅用9.3%的训练数据即可超越完整数据集性能,并减少43%的计算成本。

Details Motivation: 传统多模态大语言模型需要大量训练数据,导致数据冗余和计算成本高。本文挑战这一假设,认为仅需稀疏的高价值样本即可触发有效的多模态推理。 Method: 提出RAP范式,包括Causal Discrepancy Estimator(CDE)和Attention Confidence Estimator(ACE)识别认知样本,并通过Difficulty-aware Replacement Module(DRM)替换简单样本。 Result: 在六个数据集上,RAP仅用9.3%的数据即实现更优性能,计算成本降低43%。 Conclusion: RAP证明了小规模高价值数据集在多模态推理中的潜力,为高效训练提供了新思路。 Abstract: While multi-modal large language models (MLLMs) have made significant progress in complex reasoning tasks via reinforcement learning, it is commonly believed that extensive training data is necessary for improving multi-modal reasoning ability, inevitably leading to data redundancy and substantial computational costs. However, can smaller high-value datasets match or outperform full corpora for multi-modal reasoning in MLLMs? In this work, we challenge this assumption through a key observation: meaningful multi-modal reasoning is triggered by only a sparse subset of training samples, termed cognitive samples, whereas the majority contribute marginally. Building on this insight, we propose a novel data selection paradigm termed Reasoning Activation Potential (RAP), which identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning by two complementary estimators: 1) Causal Discrepancy Estimator (CDE) based on the potential outcome model principle, eliminates samples that overly rely on language priors by comparing outputs between multi-modal and text-only inputs; 2) Attention Confidence Estimator (ACE), which exploits token-level self-attention to discard samples dominated by irrelevant but over-emphasized tokens in intermediate reasoning stages. Moreover, we introduce a Difficulty-aware Replacement Module (DRM) to substitute trivial instances with cognitively challenging ones, thereby ensuring complexity for robust multi-modal reasoning. Experiments on six datasets show that our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%. Our code is available at https://github.com/Leo-ssl/RAP.

[44] Toward Better SSIM Loss for Unsupervised Monocular Depth Estimation

Yijun Cao,Fuya Luo,Yongjie Li

Main category: cs.CV

TL;DR: 本文提出了一种新的SSIM形式,通过加法而非乘法组合其组件,优化了无监督单目深度学习的训练效果。

Details Motivation: 传统方法忽略了SSIM函数中不同组件及其超参数对训练的影响,导致性能受限。 Method: 提出了一种新的SSIM形式,用加法替代乘法组合亮度、对比度和结构相似性组件,并优化参数组合。 Result: 在KITTI-2015数据集上,优化后的SSIM损失函数显著优于基线方法。 Conclusion: 新SSIM形式能生成更平滑的梯度,提升无监督深度估计性能。 Abstract: Unsupervised monocular depth learning generally relies on the photometric relation among temporally adjacent images. Most of previous works use both mean absolute error (MAE) and structure similarity index measure (SSIM) with conventional form as training loss. However, they ignore the effect of different components in the SSIM function and the corresponding hyperparameters on the training. To address these issues, this work proposes a new form of SSIM. Compared with original SSIM function, the proposed new form uses addition rather than multiplication to combine the luminance, contrast, and structural similarity related components in SSIM. The loss function constructed with this scheme helps result in smoother gradients and achieve higher performance on unsupervised depth estimation. We conduct extensive experiments to determine the relatively optimal combination of parameters for our new SSIM. Based on the popular MonoDepth approach, the optimized SSIM loss function can remarkably outperform the baseline on the KITTI-2015 outdoor dataset.

[45] HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition

Suhan Woo,Seongwon Lee,Jinwoo Jang,Euntai Kim

Main category: cs.CV

TL;DR: HypeVPR是一种基于双曲空间的分层嵌入框架,用于解决视觉地点识别(VPR)中的视角到等距矩形(P2E)问题,通过分层特征聚合和粗到细搜索策略,显著提升了检索速度和准确性。

Details Motivation: 现实世界中移动机器人等应用需要处理多视角查询图像,P2E方法成为自然选择,但现有方法未能充分利用全景图像的分层结构特性。 Method: 提出HypeVPR框架,利用双曲空间表示分层特征关系,采用分层特征聚合机制和粗到细搜索策略。 Result: HypeVPR在多个基准数据集上优于现有方法,检索速度提升高达5倍。 Conclusion: HypeVPR通过双曲空间的分层表示和高效搜索策略,显著提升了P2E VPR的性能和效率。 Abstract: When applying Visual Place Recognition (VPR) to real-world mobile robots and similar applications, perspective-to-equirectangular (P2E) formulation naturally emerges as a suitable approach to accommodate diverse query images captured from various viewpoints. In this paper, we introduce HypeVPR, a novel hierarchical embedding framework in hyperbolic space, designed to address the unique challenges of P2E VPR. The key idea behind HypeVPR is that visual environments captured by panoramic views exhibit inherent hierarchical structures. To leverage this property, we employ hyperbolic space to represent hierarchical feature relationships and preserve distance properties within the feature space. To achieve this, we propose a hierarchical feature aggregation mechanism that organizes local-to-global feature representations within hyperbolic space. Additionally, HypeVPR adopts an efficient coarse-to-fine search strategy, optimally balancing speed and accuracy to ensure robust matching, even between descriptors from different image types. This approach enables HypeVPR to outperform state-of-the-art methods while significantly reducing retrieval time, achieving up to 5x faster retrieval across diverse benchmark datasets. The code and models will be released at https://github.com/suhan-woo/HypeVPR.git.

[46] Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations

Gaia Di Lorenzo,Federico Tombari,Marc Pollefeys,Daniel Barath

Main category: cs.CV

TL;DR: Object-X提出了一种多模态3D对象表示框架,能够编码丰富对象嵌入并解码为几何和视觉重建,支持多种下游任务,且存储需求极低。

Details Motivation: 现有方法通常依赖任务特定的嵌入,无法同时解码为显式几何并跨任务重用,因此需要一种更通用的多模态对象表示框架。 Method: Object-X通过几何基础模态到3D体素网格,并学习融合体素与对象属性的非结构化嵌入,支持3D高斯泼溅重建和下游任务。 Result: 在真实数据集上,Object-X实现了高保真新视角合成和几何精度提升,存储需求比传统方法低3-4个数量级。 Conclusion: Object-X是一种可扩展且实用的多模态3D场景表示解决方案,性能与专用方法相当。 Abstract: Learning effective multi-modal 3D representations of objects is essential for numerous applications, such as augmented reality and robotics. Existing methods often rely on task-specific embeddings that are tailored either for semantic understanding or geometric reconstruction. As a result, these embeddings typically cannot be decoded into explicit geometry and simultaneously reused across tasks. In this paper, we propose Object-X, a versatile multi-modal object representation framework capable of encoding rich object embeddings (e.g. images, point cloud, text) and decoding them back into detailed geometric and visual reconstructions. Object-X operates by geometrically grounding the captured modalities in a 3D voxel grid and learning an unstructured embedding fusing the information from the voxels with the object attributes. The learned embedding enables 3D Gaussian Splatting-based object reconstruction, while also supporting a range of downstream tasks, including scene alignment, single-image 3D object reconstruction, and localization. Evaluations on two challenging real-world datasets demonstrate that Object-X produces high-fidelity novel-view synthesis comparable to standard 3D Gaussian Splatting, while significantly improving geometric accuracy. Moreover, Object-X achieves competitive performance with specialized methods in scene alignment and localization. Critically, our object-centric descriptors require 3-4 orders of magnitude less storage compared to traditional image- or point cloud-based approaches, establishing Object-X as a scalable and highly practical solution for multi-modal 3D scene representation.

[47] LotusFilter: Fast Diverse Nearest Neighbor Search via a Learned Cutoff Table

Yusuke Matsui

Main category: cs.CV

TL;DR: LotusFilter是一种后处理模块,用于多样化近似最近邻搜索(ANNS)结果,通过预计算邻近向量表快速删除冗余向量。

Details Motivation: 在某些场景中,ANNS结果需要既与查询相似又多样化,但现有方法可能导致结果过于相似。 Method: 预计算邻近向量表,并在过滤阶段通过贪心查找删除冗余向量。 Result: 在类似真实RAG应用的设置中,LotusFilter运行速度快(0.02 ms/查询)。 Conclusion: LotusFilter是一种高效的多样化ANNS结果的解决方案,适用于实际应用。 Abstract: Approximate nearest neighbor search (ANNS) is an essential building block for applications like RAG but can sometimes yield results that are overly similar to each other. In certain scenarios, search results should be similar to the query and yet diverse. We propose LotusFilter, a post-processing module to diversify ANNS results. We precompute a cutoff table summarizing vectors that are close to each other. During the filtering, LotusFilter greedily looks up the table to delete redundant vectors from the candidates. We demonstrated that the LotusFilter operates fast (0.02 [ms/query]) in settings resembling real-world RAG applications, utilizing features such as OpenAI embeddings. Our code is publicly available at https://github.com/matsui528/lotf.

[48] SupeRANSAC: One RANSAC to Rule Them All

Daniel Barath

Main category: cs.CV

TL;DR: SupeRANSAC是一种新颖的统一RANSAC流程,旨在提高几何模型估计的鲁棒性和准确性,优于现有方法。

Details Motivation: 尽管RANSAC及其变体在计算机视觉中是几何模型估计的金标准,但其性能在不同任务中表现不一致,受实现细节和问题特定优化的影响较大。 Method: 作者提出SupeRANSAC,一个统一的RANSAC流程,详细分析了使其在特定视觉任务中有效的技术,包括单应性、基本/本质矩阵和绝对/刚性姿态估计。 Result: SupeRANSAC在多个任务和数据集上显著优于现有方法,例如在基本矩阵估计上平均提高6 AUC点。 Conclusion: SupeRANSAC通过统一流程和优化技术,实现了跨任务的一致高精度,为计算机视觉中的鲁棒估计提供了新工具。 Abstract: Robust estimation is a cornerstone in computer vision, particularly for tasks like Structure-from-Motion and Simultaneous Localization and Mapping. RANSAC and its variants are the gold standard for estimating geometric models (e.g., homographies, relative/absolute poses) from outlier-contaminated data. Despite RANSAC's apparent simplicity, achieving consistently high performance across different problems is challenging. While recent research often focuses on improving specific RANSAC components (e.g., sampling, scoring), overall performance is frequently more influenced by the "bells and whistles" (i.e., the implementation details and problem-specific optimizations) within a given library. Popular frameworks like OpenCV and PoseLib demonstrate varying performance, excelling in some tasks but lagging in others. We introduce SupeRANSAC, a novel unified RANSAC pipeline, and provide a detailed analysis of the techniques that make RANSAC effective for specific vision tasks, including homography, fundamental/essential matrix, and absolute/rigid pose estimation. SupeRANSAC is designed for consistent accuracy across these tasks, improving upon the best existing methods by, for example, 6 AUC points on average for fundamental matrix estimation. We demonstrate significant performance improvements over the state-of-the-art on multiple problems and datasets. Code: https://github.com/danini/superansac

[49] MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories

Yuyi Zhang,Yongxin Shi,Peirong Zhang,Yixin Zhao,Zhenhua Yang,Lianwen Jin

Main category: cs.CV

TL;DR: 论文介绍了MegaHan97K数据集,支持GB18030-2022标准的87,887个汉字类别,是目前最大的OCR数据集,解决了长尾分布问题,并揭示了新挑战。

Details Motivation: 中文汉字类别庞大且不断扩展,现有OCR数据集无法满足需求,阻碍了文化遗产保护和数字应用的发展。 Method: 构建了MegaHan97K数据集,包含97,455个汉字类别,分为手写、历史和合成三个子集,以平衡样本分布。 Result: 数据集支持最新标准,类别数量是现有数据集的六倍以上,解决了长尾问题,并揭示了存储需求、形态相似字符识别和零样本学习等新挑战。 Conclusion: MegaHan97K为OCR和模式识别领域提供了重要资源,推动了未来研究的发展。 Abstract: Foundational to the Chinese language and culture, Chinese characters encompass extraordinarily extensive and ever-expanding categories, with the latest Chinese GB18030-2022 standard containing 87,887 categories. The accurate recognition of this vast number of characters, termed mega-category recognition, presents a formidable yet crucial challenge for cultural heritage preservation and digital applications. Despite significant advances in Optical Character Recognition (OCR), mega-category recognition remains unexplored due to the absence of comprehensive datasets, with the largest existing dataset containing merely 16,151 categories. To bridge this critical gap, we introduce MegaHan97K, a mega-category, large-scale dataset covering an unprecedented 97,455 categories of Chinese characters. Our work offers three major contributions: (1) MegaHan97K is the first dataset to fully support the latest GB18030-2022 standard, providing at least six times more categories than existing datasets; (2) It effectively addresses the long-tail distribution problem by providing balanced samples across all categories through its three distinct subsets: handwritten, historical and synthetic subsets; (3) Comprehensive benchmarking experiments reveal new challenges in mega-category scenarios, including increased storage demands, morphologically similar character recognition, and zero-shot learning difficulties, while also unlocking substantial opportunities for future research. To the best of our knowledge, the MetaHan97K is likely the dataset with the largest classes not only in the field of OCR but may also in the broader domain of pattern recognition. The dataset is available at https://github.com/SCUT-DLVCLab/MegaHan97K.

[50] Spike-TBR: a Noise Resilient Neuromorphic Event Representation

Gabriele Magrini. Federico Becattini,Luca Cultrera,Lorenzo Berlincioni,Pietro Pala,Alberto Del Bimbo

Main category: cs.CV

TL;DR: Spike-TBR是一种基于事件的新型编码策略,结合了TBR和脉冲神经网络的噪声过滤能力,提高了事件流表示的鲁棒性。

Details Motivation: 传统的事件流转换方法在噪声环境下表现不佳,需要一种更鲁棒的编码策略。 Method: 提出Spike-TBR,结合TBR的帧式优势和脉冲神经网络的噪声过滤能力,评估了四种不同脉冲神经元的变体。 Result: 在噪声和干净数据上均表现出优越性能,为事件驱动视觉应用提供了简单且抗噪声的解决方案。 Conclusion: Spike-TBR填补了基于脉冲和基于帧的处理之间的空白,为事件驱动的视觉应用提供了有效的解决方案。 Abstract: Event cameras offer significant advantages over traditional frame-based sensors, including higher temporal resolution, lower latency and dynamic range. However, efficiently converting event streams into formats compatible with standard computer vision pipelines remains a challenging problem, particularly in the presence of noise. In this paper, we propose Spike-TBR, a novel event-based encoding strategy based on Temporal Binary Representation (TBR), addressing its vulnerability to noise by integrating spiking neurons. Spike-TBR combines the frame-based advantages of TBR with the noise-filtering capabilities of spiking neural networks, creating a more robust representation of event streams. We evaluate four variants of Spike-TBR, each using different spiking neurons, across multiple datasets, demonstrating superior performance in noise-affected scenarios while improving the results on clean data. Our method bridges the gap between spike-based and frame-based processing, offering a simple noise-resilient solution for event-driven vision applications.

[51] Fool the Stoplight: Realistic Adversarial Patch Attacks on Traffic Light Detectors

Svetlana Pavlitska,Jamie Robb,Nikolai Polley,Melih Yazgan,J. Marius Zöllner

Main category: cs.CV

TL;DR: 该论文展示了如何通过打印的对抗性补丁攻击交通灯检测的CNN模型,提出了威胁模型和训练策略,并在实际场景中验证了攻击效果。

Details Motivation: 现有研究多关注自动驾驶车辆的其他感知任务,而针对交通灯检测的攻击研究较少,因此填补了这一空白。 Method: 提出了一种威胁模型,通过在交通灯下方放置对抗性补丁,并设计了一种训练策略,实现了通用设置下的攻击。 Result: 实验成功实现了目标标签翻转攻击(如红变绿)和象形图分类攻击,并在实际场景中验证了攻击的有效性。 Conclusion: 该研究证明了对抗性补丁对交通灯检测的威胁,为相关防御研究提供了参考。 Abstract: Realistic adversarial attacks on various camera-based perception tasks of autonomous vehicles have been successfully demonstrated so far. However, only a few works considered attacks on traffic light detectors. This work shows how CNNs for traffic light detection can be attacked with printed patches. We propose a threat model, where each instance of a traffic light is attacked with a patch placed under it, and describe a training strategy. We demonstrate successful adversarial patch attacks in universal settings. Our experiments show realistic targeted red-to-green label-flipping attacks and attacks on pictogram classification. Finally, we perform a real-world evaluation with printed patches and demonstrate attacks in the lab settings with a mobile traffic light for construction sites and in a test area with stationary traffic lights. Our code is available at https://github.com/KASTEL-MobilityLab/attacks-on-traffic-light-detection.

[52] DualX-VSR: Dual Axial Spatial$\times$Temporal Transformer for Real-World Video Super-Resolution without Motion Compensation

Shuo Cao,Yihao Liu,Xiaohui Li. Yuanting Gao. Yu Zhou,Chao Dong

Main category: cs.CV

TL;DR: DualX-VSR提出了一种新型的双轴时空注意力机制,用于解决视频超分辨率任务中传统Transformer模型的局限性,无需运动补偿即可实现高性能。

Details Motivation: 现有的Transformer模型在视频超分辨率任务中因像素级精度需求和运动补偿依赖等问题表现不佳,需要一种更高效的时空建模方法。 Method: DualX-VSR采用双轴时空注意力机制,沿正交方向整合时空信息,避免了传统方法中的运动补偿需求。 Result: DualX-VSR在真实世界视频超分辨率任务中实现了高保真度和卓越性能。 Conclusion: DualX-VSR通过创新的双轴注意力机制,简化了结构并提升了性能,为视频超分辨率任务提供了有效解决方案。 Abstract: Transformer-based models like ViViT and TimeSformer have advanced video understanding by effectively modeling spatiotemporal dependencies. Recent video generation models, such as Sora and Vidu, further highlight the power of transformers in long-range feature extraction and holistic spatiotemporal modeling. However, directly applying these models to real-world video super-resolution (VSR) is challenging, as VSR demands pixel-level precision, which can be compromised by tokenization and sequential attention mechanisms. While recent transformer-based VSR models attempt to address these issues using smaller patches and local attention, they still face limitations such as restricted receptive fields and dependence on optical flow-based alignment, which can introduce inaccuracies in real-world settings. To overcome these issues, we propose Dual Axial Spatial$\times$Temporal Transformer for Real-World Video Super-Resolution (DualX-VSR), which introduces a novel dual axial spatial$\times$temporal attention mechanism that integrates spatial and temporal information along orthogonal directions. DualX-VSR eliminates the need for motion compensation, offering a simplified structure that provides a cohesive representation of spatiotemporal information. As a result, DualX-VSR achieves high fidelity and superior performance in real-world VSR task.

[53] OpenMaskDINO3D : Reasoning 3D Segmentation via Large Language Model

Kunshen Zhang

Main category: cs.CV

TL;DR: OpenMaskDINO3D 是一种基于大语言模型(LLM)的3D理解和分割框架,通过处理点云数据和文本提示生成实例分割掩码,填补了3D推理分割的空白。

Details Motivation: 当前2D感知系统已能通过隐式用户意图推理完成分割任务,但3D领域缺乏类似框架。本文旨在填补这一空白,实现基于自然语言指令的3D分割。 Method: 引入SEG token和对象标识符,结合点云数据和文本提示,直接生成高精度3D分割掩码。 Result: 在ScanNet数据集上的实验验证了OpenMaskDINO3D在多种任务中的有效性。 Conclusion: OpenMaskDINO3D为3D推理分割提供了高效解决方案,支持自然语言驱动的3D分割任务。 Abstract: Although perception systems have made remarkable advancements in recent years, particularly in 2D reasoning segmentation, these systems still rely on explicit human instruction or pre-defined categories to identify target objects before executing visual recognition tasks. Such systems have matured significantly, demonstrating the ability to reason and comprehend implicit user intentions in two-dimensional contexts, producing accurate segmentation masks based on complex and implicit query text. However, a comparable framework and structure for 3D reasoning segmentation remain absent. This paper introduces OpenMaskDINO3D, a LLM designed for comprehensive 3D understanding and segmentation. OpenMaskDINO3D processes point cloud data and text prompts to produce instance segmentation masks, excelling in many 3D tasks. By introducing a SEG token and object identifier, we achieve high-precision 3D segmentation mask generation, enabling the model to directly produce accurate point cloud segmentation results from natural language instructions. Experimental results on large-scale ScanNet datasets validate the effectiveness of our OpenMaskDINO3D across various tasks.

[54] Geological Field Restoration through the Lens of Image Inpainting

Vladislav Trifonov,Ivan Oseledets,Ekaterina Muravleva

Main category: cs.CV

TL;DR: 提出了一种基于多维张量低秩结构的稀疏观测地质场重建方法,优于普通克里金法。

Details Motivation: 从确定性图像修复技术中汲取灵感,解决稀疏观测下多维地质场重建问题。 Method: 结合张量补全和地统计学,建立全局低秩结构的优化框架。 Result: 在合成地质场实验中,张量补全方法在不同观测比例下均显著优于普通克里金法。 Conclusion: 该方法为地质场重建提供了更准确和稳健的解决方案。 Abstract: We present a new viewpoint on a reconstructing multidimensional geological fields from sparse observations. Drawing inspiration from deterministic image inpainting techniques, we model a partially observed spatial field as a multidimensional tensor and recover missing values by enforcing a global low-rank structure. Our approach combines ideas from tensor completion and geostatistics, providing a robust optimization framework. Experiments on synthetic geological fields demonstrate that used tensor completion method significant improvements in reconstruction accuracy over ordinary kriging for various percent of observed data.

[55] Invisible Backdoor Triggers in Image Editing Model via Deep Watermarking

Yu-Feng Chen,Tzuhsuan Huang,Pin-Yen Chiu,Jun-Cheng Chen

Main category: cs.CV

TL;DR: 该论文提出了一种新型的后门攻击框架,通过中毒训练数据在图像编辑过程中嵌入不可见触发器,利用深度水印模型实现隐蔽攻击。

Details Motivation: 现有研究多关注图像生成的后门攻击,而图像编辑领域的后门攻击研究较少且多使用可见触发器,实用性不足。 Method: 利用现成深度水印模型将不可察觉的水印编码为后门触发器,通过中毒训练数据嵌入图像编辑过程。 Result: 在不同水印模型上的实验表明,该方法攻击成功率较高,且水印特性分析进一步验证了其有效性。 Conclusion: 提出的方法在图像编辑后门攻击中表现出色,为隐蔽攻击提供了新思路。 Abstract: Diffusion models have achieved remarkable progress in both image generation and editing. However, recent studies have revealed their vulnerability to backdoor attacks, in which specific patterns embedded in the input can manipulate the model's behavior. Most existing research in this area has proposed attack frameworks focused on the image generation pipeline, leaving backdoor attacks in image editing relatively unexplored. Among the few studies targeting image editing, most utilize visible triggers, which are impractical because they introduce noticeable alterations to the input image before editing. In this paper, we propose a novel attack framework that embeds invisible triggers into the image editing process via poisoned training data. We leverage off-the-shelf deep watermarking models to encode imperceptible watermarks as backdoor triggers. Our goal is to make the model produce the predefined backdoor target when it receives watermarked inputs, while editing clean images normally according to the given prompt. With extensive experiments across different watermarking models, the proposed method achieves promising attack success rates. In addition, the analysis results of the watermark characteristics in term of backdoor attack further support the effectiveness of our approach. The code is available at:https://github.com/aiiu-lab/BackdoorImageEditing

[56] Learning to Plan via Supervised Contrastive Learning and Strategic Interpolation: A Chess Case Study

Andrew Hamara,Greg Hamerly,Pablo Rivas,Andrew C. Freeman

Main category: cs.CV

TL;DR: 论文提出了一种基于直觉驱动的规划方法,通过对比学习训练Transformer编码器,将棋盘状态嵌入到潜在空间中,实现无需深度搜索的走子选择。

Details Motivation: 现代国际象棋引擎依赖深度树搜索和回归评估,而人类玩家则依靠直觉选择候选走法后进行浅层搜索验证。论文旨在模拟这种直觉驱动的规划过程。 Method: 使用监督对比学习训练Transformer编码器,将棋盘状态嵌入到由位置评估结构化的潜在空间中,距离反映评估相似性。 Result: 模型仅使用6-ply束搜索,估计Elo评分为2593,性能随模型规模和嵌入维度提升。 Conclusion: 潜在空间规划可能成为传统搜索的替代方案,方法可推广至其他完美信息游戏。 Abstract: Modern chess engines achieve superhuman performance through deep tree search and regressive evaluation, while human players rely on intuition to select candidate moves followed by a shallow search to validate them. To model this intuition-driven planning process, we train a transformer encoder using supervised contrastive learning to embed board states into a latent space structured by positional evaluation. In this space, distance reflects evaluative similarity, and visualized trajectories display interpretable transitions between game states. We demonstrate that move selection can occur entirely within this embedding space by advancing toward favorable regions, without relying on deep search. Despite using only a 6-ply beam search, our model achieves an estimated Elo rating of 2593. Performance improves with both model size and embedding dimensionality, suggesting that latent planning may offer a viable alternative to traditional search. Although we focus on chess, the proposed embedding-based planning method can be generalized to other perfect-information games where state evaluations are learnable. All source code is available at https://github.com/andrewhamara/SOLIS.

[57] From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

Tianxu Wang,Zhuofan Zhang,Ziyu Zhu,Yue Fan,Jing Xiong,Pengxiang Li,Xiaojian Ma,Qing Li

Main category: cs.CV

TL;DR: Anywhere3D-Bench是一个全面的3D视觉基准测试,涵盖四个不同层次的定位任务,揭示了当前模型在空间和部分级别任务上的显著不足。

Details Motivation: 探索3D场景中超越对象级别的视觉定位能力,填补现有研究的空白。 Method: 引入Anywhere3D-Bench基准测试,评估多种先进3D视觉定位方法及大语言模型的表现。 Result: 空间和部分级别任务表现最差,最佳模型OpenAI o4-mini在空间任务上仅23.57%准确率,部分任务上33.94%。 Conclusion: 当前模型在3D场景的全面理解和推理能力上存在显著不足,需进一步改进。 Abstract: 3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,632 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space, while part-level tasks demand fine-grained perception of object composition. Even the best performance model, OpenAI o4-mini, achieves only 23.57% accuracy on space-level tasks and 33.94% on part-level tasks, significantly lower than its performance on area-level and object-level tasks. These findings underscore a critical gap in current models' capacity to understand and reason about 3D scene beyond object-level semantics.

[58] Generating Synthetic Stereo Datasets using 3D Gaussian Splatting and Expert Knowledge Transfer

Filip Slezak,Magnus K. Gjerde,Joakim B. Haurum,Ivan Nikolov,Morten S. Laursen,Thomas B. Moeslund

Main category: cs.CV

TL;DR: 提出基于3D高斯泼溅(3DGS)的立体数据集生成方法,替代NeRF方法,结合几何重建与FoundationStereo深度估计,提升零样本泛化性能。

Details Motivation: 探索低成本、高保真数据集生成方法,优化立体模型的快速微调性能。 Method: 利用3DGS生成数据集,结合几何重建和FoundationStereo深度估计,进行专家知识迁移。 Result: 3DGS生成的数据集在零样本泛化中表现优异,但几何重建噪声较多,而FoundationStereo深度估计更干净且性能更好。 Conclusion: 3DGS方法在低成本数据集生成和快速微调中潜力显著,但在复杂场景中的鲁棒性仍需改进。 Abstract: In this paper, we introduce a 3D Gaussian Splatting (3DGS)-based pipeline for stereo dataset generation, offering an efficient alternative to Neural Radiance Fields (NeRF)-based methods. To obtain useful geometry estimates, we explore utilizing the reconstructed geometry from the explicit 3D representations as well as depth estimates from the FoundationStereo model in an expert knowledge transfer setup. We find that when fine-tuning stereo models on 3DGS-generated datasets, we demonstrate competitive performance in zero-shot generalization benchmarks. When using the reconstructed geometry directly, we observe that it is often noisy and contains artifacts, which propagate noise to the trained model. In contrast, we find that the disparity estimates from FoundationStereo are cleaner and consequently result in a better performance on the zero-shot generalization benchmarks. Our method highlights the potential for low-cost, high-fidelity dataset creation and fast fine-tuning for deep stereo models. Moreover, we also reveal that while the latest Gaussian Splatting based methods have achieved superior performance on established benchmarks, their robustness falls short in challenging in-the-wild settings warranting further exploration.

[59] Light and 3D: a methodological exploration of digitisation techniques adapted to a selection of objects from the Mus{é}e d'Arch{é}ologie Nationale

Antoine Laurent,Jean Mélou,Catherine Schwab,Rolande Simon-Millot,Sophie Féret,Thomas Sagory,Carole Fritz,Jean-Denis Durou

Main category: cs.CV

TL;DR: 本文探讨了文化遗产数字化中3D摄影方法的多样性,强调没有单一方法适用于所有对象,需根据对象特性和未来用途选择合适工具。

Details Motivation: 文化遗产数字化需求广泛,但现有方法多样且无统一标准,需针对不同对象灵活选择。 Method: 通过法国国家考古博物馆的藏品案例,分析多种3D摄影数字化方法,强调需结合遗产和数字领域专家的意见。 Result: 研究表明,每种方法需根据对象特性和用途定制,无法绝对分类。 Conclusion: 文化遗产数字化需个性化工具选择,避免一刀切,注重实际需求。 Abstract: The need to digitize heritage objects is now widely accepted. This article presents the very fashionable context of the creation of ''digital twins''. It illustrates the diversity of photographic 3D digitization methods, but this is not its only objective. Using a selection of objects from the collections of the mus{\'e}e d'Arch{\'e}ologie nationale, it shows that no single method is suitable for all cases. Rather, the method to be recommended for a given object should be the result of a concerted choice between those involved in heritage and those involved in the digital domain, as each new object may require the adaptation of existing tools. It would therefore be pointless to attempt an absolute classification of 3D digitization methods. On the contrary, we need to find the digital tool best suited to each object, taking into account not only its characteristics, but also the future use of its digital twin.

[60] CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx

Lukas Picek,Elisa Belotti,Michal Bojda,Ludek Bufka,Vojtech Cermak,Martin Dula,Rostislav Dvorak,Luboslav Hrdy,Miroslav Jirik,Vaclav Kocourek,Josefa Krausova,Jirı Labuda,Jakub Straka,Ludek Toman,Vlado Trulık,Martin Vana,Miroslav Kutal

Main category: cs.CV

TL;DR: CzechLynx是一个大规模、开放访问的数据集,用于欧亚猞猁的个体识别、2D姿态估计和实例分割,包含真实和合成图像,并提供三种评估协议。

Details Motivation: 为欧亚猞猁的研究提供首个大规模数据集,支持个体识别、姿态估计和分割任务,并解决数据稀缺和多样性问题。 Method: 数据集包含30k真实相机陷阱图像和100k合成图像,通过Unity和扩散模型生成,覆盖219个个体和15年监测数据。定义了三种评估协议:地理感知、时间感知开放集和封闭集。 Result: 数据集提供了丰富的标注(分割掩码、身份标签、20点骨骼),并支持跨时空域泛化测试。 Conclusion: CzechLynx将成为基准测试和新方法开发的工具,不仅限于动物个体重识别。 Abstract: We introduce CzechLynx, the first large-scale, open-access dataset for individual identification, 2D pose estimation, and instance segmentation of the Eurasian lynx (Lynx lynx). CzechLynx includes more than 30k camera trap images annotated with segmentation masks, identity labels, and 20-point skeletons and covers 219 unique individuals across 15 years of systematic monitoring in two geographically distinct regions: Southwest Bohemia and the Western Carpathians. To increase the data variability, we create a complementary synthetic set with more than 100k photorealistic images generated via a Unity-based pipeline and diffusion-driven text-to-texture modeling, covering diverse environments, poses, and coat-pattern variations. To allow testing generalization across spatial and temporal domains, we define three tailored evaluation protocols/splits: (i) geo-aware, (ii) time-aware open-set, and (iii) time-aware closed-set. This dataset is targeted to be instrumental in benchmarking state-of-the-art models and the development of novel methods for not just individual animal re-identification.

[61] Time-Lapse Video-Based Embryo Grading via Complementary Spatial-Temporal Pattern Mining

Yong Sun,Yipeng Wang,Junyu Shi,Zhiyuan Zhang,Yanmei Xiao,Lei Zhu,Manxi Jiang,Qiang Nie

Main category: cs.CV

TL;DR: 提出了一种基于视频的胚胎分级任务,利用全长延时监测视频预测胚胎质量,并设计了互补时空模式挖掘框架(CoSTeM)来模拟胚胎学家的评估过程。

Details Motivation: 当前胚胎选择方法存在局限性,要么缺乏整体质量评估,要么受胚胎外因素干扰,因此需要一种直接利用视频数据预测胚胎质量的新方法。 Method: 提出CoSTeM框架,包含形态分支(局部结构特征选择)和形态动力学分支(全局发育轨迹建模),结合静态和动态特征进行分级。 Result: 实验结果表明CoSTeM框架优于现有方法,为AI辅助胚胎选择提供了有效工具。 Conclusion: 该研究为胚胎选择提供了新的方法论框架,数据集和源代码将公开。 Abstract: Artificial intelligence has recently shown promise in automated embryo selection for In-Vitro Fertilization (IVF). However, current approaches either address partial embryo evaluation lacking holistic quality assessment or target clinical outcomes inevitably confounded by extra-embryonic factors, both limiting clinical utility. To bridge this gap, we propose a new task called Video-Based Embryo Grading - the first paradigm that directly utilizes full-length time-lapse monitoring (TLM) videos to predict embryologists' overall quality assessments. To support this task, we curate a real-world clinical dataset comprising over 2,500 TLM videos, each annotated with a grading label indicating the overall quality of embryos. Grounded in clinical decision-making principles, we propose a Complementary Spatial-Temporal Pattern Mining (CoSTeM) framework that conceptually replicates embryologists' evaluation process. The CoSTeM comprises two branches: (1) a morphological branch using a Mixture of Cross-Attentive Experts layer and a Temporal Selection Block to select discriminative local structural features, and (2) a morphokinetic branch employing a Temporal Transformer to model global developmental trajectories, synergistically integrating static and dynamic determinants for grading embryos. Extensive experimental results demonstrate the superiority of our design. This work provides a valuable methodological framework for AI-assisted embryo selection. The dataset and source code will be publicly available upon acceptance.

[62] Robustness as Architecture: Designing IQA Models to Withstand Adversarial Perturbations

Igor Meleshin,Anna Chistyakova,Anastasia Antsiferova,Dmitriy Vatolin

Main category: cs.CV

TL;DR: 论文提出了一种通过设计而非训练实现图像质量评估模型鲁棒性的方法,通过正交信息流和网络结构调整来抵御对抗攻击。

Details Motivation: 传统的数据驱动防御方法(如对抗训练)可能不是解决IQA模型鲁棒性的最佳途径,作者认为鲁棒性应通过设计实现。 Method: 通过强制正交信息流、约束网络为保范操作,并结合剪枝和微调,设计了一种鲁棒的IQA架构。 Result: 提出的方法在不依赖对抗训练或大幅修改原模型的情况下,显著提升了模型对对抗攻击的抵抗力。 Conclusion: 论文倡导从数据优化转向设计优化,为IQA模型的鲁棒性提供了新的视角。 Abstract: Image Quality Assessment (IQA) models are increasingly relied upon to evaluate image quality in real-world systems -- from compression and enhancement to generation and streaming. Yet their adoption brings a fundamental risk: these models are inherently unstable. Adversarial manipulations can easily fool them, inflating scores and undermining trust. Traditionally, such vulnerabilities are addressed through data-driven defenses -- adversarial retraining, regularization, or input purification. But what if this is the wrong lens? What if robustness in perceptual models is not something to learn but something to design? In this work, we propose a provocative idea: robustness as an architectural prior. Rather than training models to resist perturbations, we reshape their internal structure to suppress sensitivity from the ground up. We achieve this by enforcing orthogonal information flow, constraining the network to norm-preserving operations -- and further stabilizing the system through pruning and fine-tuning. The result is a robust IQA architecture that withstands adversarial attacks without requiring adversarial training or significant changes to the original model. This approach suggests a shift in perspective: from optimizing robustness through data to engineering it through design.

[63] APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval

Hong Gao,Yiming Bao,Xuezhan Tu,Bin Zhong,Minling Zhang

Main category: cs.CV

TL;DR: APVR框架通过分层视觉信息检索解决视频理解中的计算限制,无需训练即可处理小时级视频。

Details Motivation: 现有视频多模态大语言模型因计算限制和长时序列信息提取效率低,难以处理小时级视频理解。 Method: APVR采用双组件:Pivot Frame Retrieval通过语义扩展和多模态置信度评分识别关键帧,Pivot Token Retrieval在关键帧内进行查询感知的注意力驱动令牌选择。 Result: 在LongVideoBench和VideoMME上的实验验证了其显著性能提升,达到训练无关和训练相关方法的SOTA。 Conclusion: APVR为视频理解提供了一种高效、无需训练且兼容现有MLLM架构的解决方案。 Abstract: Current video-based multimodal large language models struggle with hour-level video understanding due to computational constraints and inefficient information extraction from extensive temporal sequences. We propose APVR (Adaptive Pivot Visual information Retrieval), a training-free framework that addresses the memory wall limitation through hierarchical visual information retrieval. APVR operates via two complementary components: Pivot Frame Retrieval employs semantic expansion and multi-modal confidence scoring to identify semantically relevant video frames, while Pivot Token Retrieval performs query-aware attention-driven token selection within the pivot frames. This dual granularity approach enables processing of hour-long videos while maintaining semantic fidelity. Experimental validation on LongVideoBench and VideoMME demonstrates significant performance improvements, establishing state-of-the-art results for not only training-free but also training-based approaches while providing plug-and-play integration capability with existing MLLM architectures.

[64] FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation

Huihan Wang,Zhiwen Yang,Hui Zhang,Dan Zhao,Bingzheng Wei,Yan Xu

Main category: cs.CV

TL;DR: FEAT是一种高效的全维度注意力Transformer,通过空间-时间-通道注意力机制、线性复杂度设计和残差值引导模块,解决了动态医学视频合成的挑战,性能优于现有方法。

Details Motivation: 动态医学视频合成需要同时建模空间一致性和时间动态性,现有Transformer方法在通道交互、计算复杂度和噪声处理方面存在不足。 Method: FEAT采用空间-时间-通道注意力机制、线性复杂度设计和残差值引导模块,实现高效全局依赖建模和精细噪声适应。 Result: FEAT-S参数仅为Endora的23%,性能相当或更优;FEAT-L在多个数据集上超越所有对比方法。 Conclusion: FEAT在动态医学视频合成中表现出高效性和可扩展性,为相关领域提供了新解决方案。 Abstract: Synthesizing high-quality dynamic medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23\% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at https://github.com/Yaziwel/FEAT.

[65] Bringing SAM to new heights: Leveraging elevation data for tree crown segmentation from drone imagery

Mélisande Teng,Arthur Ouaknine,Etienne Laliberté,Yoshua Bengio,David Rolnick,Hugo Larochelle

Main category: cs.CV

TL;DR: 比较了利用Segment Anything Model (SAM)进行无人机图像中树冠实例分割的方法,并提出了结合数字表面模型(DSM)的BalSAM模型,在特定场景下表现优于其他方法。

Details Motivation: 传统森林监测方法成本高、耗时长,无人机遥感与计算机视觉技术为大规模个体树木测绘提供了潜力。 Method: 比较了SAM在三种森林类型中的表现,并研究了DSM数据的集成效果,提出了BalSAM模型。 Result: BalSAM在特定场景(如种植园)表现优异,但未经调整的SAM表现不如定制Mask R-CNN。 Conclusion: 端到端调整SAM及集成DSM数据是提升树冠实例分割模型的有效途径。 Abstract: Information on trees at the individual level is crucial for monitoring forest ecosystems and planning forest management. Current monitoring methods involve ground measurements, requiring extensive cost, time and labor. Advances in drone remote sensing and computer vision offer great potential for mapping individual trees from aerial imagery at broad-scale. Large pre-trained vision models, such as the Segment Anything Model (SAM), represent a particularly compelling choice given limited labeled data. In this work, we compare methods leveraging SAM for the task of automatic tree crown instance segmentation in high resolution drone imagery in three use cases: 1) boreal plantations, 2) temperate forests and 3) tropical forests. We also study the integration of elevation data into models, in the form of Digital Surface Model (DSM) information, which can readily be obtained at no additional cost from RGB drone imagery. We present BalSAM, a model leveraging SAM and DSM information, which shows potential over other methods, particularly in the context of plantations. We find that methods using SAM out-of-the-box do not outperform a custom Mask R-CNN, even with well-designed prompts. However, efficiently tuning SAM end-to-end and integrating DSM information are both promising avenues for tree crown instance segmentation models.

[66] TextVidBench: A Benchmark for Long Video Scene Text Understanding

Yangyang Zhong,Ji Qi,Yuan Yao,Pengxin Luo,Yunfeng Yan,Donglian Qi,Zhiyuan Liu,Tat-Seng Chua

Main category: cs.CV

TL;DR: 论文提出了TextVidBench,首个专为长视频文本问答设计的基准,解决了现有数据集的视频时长和评估范围限制问题。

Details Motivation: 现有短视频文本视觉问答(ViteVQA)数据集在视频时长和评估范围上存在局限,无法充分评估多模态大语言模型(MLLMs)的能力。 Method: 1) 引入跨领域长视频覆盖;2) 提出三阶段评估框架;3) 提供高质量细粒度标注。此外,提出了改进大模型的方法,包括IT-Rope机制、非均匀位置编码和轻量级微调。 Result: TextVidBench对现有模型提出了显著挑战,提出的方法在提升长视频场景文本理解能力方面提供了有价值的见解。 Conclusion: TextVidBench填补了长视频文本问答基准的空白,提出的方法为改进长视频理解提供了有效途径。 Abstract: Despite recent progress on the short-video Text-Visual Question Answering (ViteVQA) task - largely driven by benchmarks such as M4-ViteVQA - existing datasets still suffer from limited video duration and narrow evaluation scopes, making it difficult to adequately assess the growing capabilities of powerful multimodal large language models (MLLMs). To address these limitations, we introduce TextVidBench, the first benchmark specifically designed for long-video text question answering (>3 minutes). TextVidBench makes three key contributions: 1) Cross-domain long-video coverage: Spanning 9 categories (e.g., news, sports, gaming), with an average video length of 2306 seconds, enabling more realistic evaluation of long-video understanding. 2) A three-stage evaluation framework: "Text Needle-in-Haystack -> Temporal Grounding -> Text Dynamics Captioning". 3) High-quality fine-grained annotations: Containing over 5,000 question-answer pairs with detailed semantic labeling. Furthermore, we propose an efficient paradigm for improving large models through: (i) introducing the IT-Rope mechanism and temporal prompt engineering to enhance temporal perception, (ii) adopting non-uniform positional encoding to better handle long video sequences, and (iii) applying lightweight fine-tuning on video-text data. Extensive experiments on multiple public datasets as well as TextVidBench demonstrate that our new benchmark presents significant challenges to existing models, while our proposed method offers valuable insights into improving long-video scene text understanding capabilities.

[67] Multi-scale Image Super Resolution with a Single Auto-Regressive Model

Enrique Sanchez,Isma Hadji,Adrian Bulat,Christos Tzelepis,Brais Martinez,Georgios Tzimiropoulos

Main category: cs.CV

TL;DR: 论文提出了一种基于视觉自回归(VAR)模型的图像超分辨率(ISR)方法,通过多尺度图像标记化和偏好优化,解决了现有方法的固定分辨率限制和大模型依赖问题。

Details Motivation: 现有VARSR方法仅支持固定分辨率且依赖大规模模型和数据集,限制了其应用。本文旨在通过改进标记化和优化方法,提升模型的灵活性和效率。 Method: 提出两种新组件:1) 多尺度图像标记化方法,强制不同尺度的标记重叠;2) 基于直接偏好优化(DPO)的损失项,鼓励模型生成高分辨率标记。 Result: 模型在单次前向传播中实现去噪和超分辨率,且在小模型(300M参数)下达到SOTA效果,无需外部训练数据。 Conclusion: 通过多尺度标记化和偏好优化,本文方法显著提升了VAR在ISR任务中的性能和灵活性。 Abstract: In this paper we tackle Image Super Resolution (ISR), using recent advances in Visual Auto-Regressive (VAR) modeling. VAR iteratively estimates the residual in latent space between gradually increasing image scales, a process referred to as next-scale prediction. Thus, the strong priors learned during pre-training align well with the downstream task (ISR). To our knowledge, only VARSR has exploited this synergy so far, showing promising results. However, due to the limitations of existing residual quantizers, VARSR works only at a fixed resolution, i.e. it fails to map intermediate outputs to the corresponding image scales. Additionally, it relies on a 1B transformer architecture (VAR-d24), and leverages a large-scale private dataset to achieve state-of-the-art results. We address these limitations through two novel components: a) a Hierarchical Image Tokenization approach with a multi-scale image tokenizer that progressively represents images at different scales while simultaneously enforcing token overlap across scales, and b) a Direct Preference Optimization (DPO) regularization term that, relying solely on the LR and HR tokenizations, encourages the transformer to produce the latter over the former. To the best of our knowledge, this is the first time a quantizer is trained to force semantically consistent residuals at different scales, and the first time that preference-based optimization is used to train a VAR. Using these two components, our model can denoise the LR image and super-resolve at half and full target upscale factors in a single forward pass. Additionally, we achieve \textit{state-of-the-art results on ISR}, while using a small model (300M params vs ~1B params of VARSR), and without using external training data.

[68] PATS: Proficiency-Aware Temporal Sampling for Multi-View Sports Skill Assessment

Edoardo Bianchi,Antonio Liotta

Main category: cs.CV

TL;DR: PATS是一种新型视频采样策略,专注于保留完整的基础动作,以提升多视角技能评估的准确性。

Details Motivation: 当前视频采样方法破坏了技能评估所需的时间连续性,影响了专家与新手表现的区分。 Method: PATS通过自适应分段视频,确保每个分析部分包含关键动作的完整执行,并在多段中重复此过程以最大化信息覆盖。 Result: 在EgoExo4D基准测试中,PATS在所有视角配置下均优于现有方法(+0.65%至+3.05%),在挑战性领域表现尤为突出(如攀岩+26.22%)。 Conclusion: PATS能适应不同活动特性,是一种有效的自适应时间采样方法,推动了自动化技能评估在实际应用中的发展。 Abstract: Automated sports skill assessment requires capturing fundamental movement patterns that distinguish expert from novice performance, yet current video sampling methods disrupt the temporal continuity essential for proficiency evaluation. To this end, we introduce Proficiency-Aware Temporal Sampling (PATS), a novel sampling strategy that preserves complete fundamental movements within continuous temporal segments for multi-view skill assessment. PATS adaptively segments videos to ensure each analyzed portion contains full execution of critical performance components, repeating this process across multiple segments to maximize information coverage while maintaining temporal coherence. Evaluated on the EgoExo4D benchmark with SkillFormer, PATS surpasses the state-of-the-art accuracy across all viewing configurations (+0.65% to +3.05%) and delivers substantial gains in challenging domains (+26.22% bouldering, +2.39% music, +1.13% basketball). Systematic analysis reveals that PATS successfully adapts to diverse activity characteristics-from high-frequency sampling for dynamic sports to fine-grained segmentation for sequential skills-demonstrating its effectiveness as an adaptive approach to temporal sampling that advances automated skill assessment for real-world applications.

[69] Beyond Cropped Regions: New Benchmark and Corresponding Baseline for Chinese Scene Text Retrieval in Diverse Layouts

Gengluo Li,Huawen Shen,Yu Zhou

Main category: cs.CV

TL;DR: 论文提出了一种针对中文场景文本检索的新模型CSTR-CLIP,通过全局视觉信息和多粒度对齐训练,显著提升了性能。

Details Motivation: 现有方法主要基于英文场景文本检索,难以处理中文文本的复杂布局,因此需要专门的中文解决方案。 Method: 提出CSTR-CLIP模型,采用两阶段训练,结合全局视觉信息和多粒度对齐,以处理多样化的文本布局。 Result: 在现有基准测试中,CSTR-CLIP比之前的最佳模型准确率提高了18.82%,且推理速度更快。 Conclusion: CSTR-CLIP在多样化文本布局中表现优异,数据集和代码将公开以促进研究。 Abstract: Chinese scene text retrieval is a practical task that aims to search for images containing visual instances of a Chinese query text. This task is extremely challenging because Chinese text often features complex and diverse layouts in real-world scenes. Current efforts tend to inherit the solution for English scene text retrieval, failing to achieve satisfactory performance. In this paper, we establish a Diversified Layout benchmark for Chinese Street View Text Retrieval (DL-CSVTR), which is specifically designed to evaluate retrieval performance across various text layouts, including vertical, cross-line, and partial alignments. To address the limitations in existing methods, we propose Chinese Scene Text Retrieval CLIP (CSTR-CLIP), a novel model that integrates global visual information with multi-granularity alignment training. CSTR-CLIP applies a two-stage training process to overcome previous limitations, such as the exclusion of visual features outside the text region and reliance on single-granularity alignment, thereby enabling the model to effectively handle diverse text layouts. Experiments on existing benchmark show that CSTR-CLIP outperforms the previous state-of-the-art model by 18.82% accuracy and also provides faster inference speed. Further analysis on DL-CSVTR confirms the superior performance of CSTR-CLIP in handling various text layouts. The dataset and code will be publicly available to facilitate research in Chinese scene text retrieval.

[70] Structure-Aware Radar-Camera Depth Estimation

Fuyi Zhang,Zhu Yu,Chunhao Li,Runmin Zhang,Xiaokai Bai,Zili Zhou,Si-Yuan Cao,Wang Wang,Hui-Liang Shen

Main category: cs.CV

TL;DR: 该论文探讨了单目深度估计的进展,重点介绍了基于深度学习的方法及其在未见领域泛化中的挑战。

Details Motivation: 单目深度估计旨在从单目相机捕获的RGB图像中确定每个像素的深度,深度学习的发展推动了这一领域的进步。 Method: 论文回顾了多种方法,包括多尺度融合网络、将回归任务重新定义为分类问题、引入额外先验和更有效的目标函数。 Result: 尽管有进展,未见领域的泛化仍具挑战性。Depth Anything在零样本单目深度估计中表现领先,但缺乏显式深度线索限制了其准确度量深度估计。 Conclusion: 单目深度估计在结构信息提取方面表现优异,但在泛化和准确度量深度估计方面仍需改进。 Abstract: Monocular depth estimation aims to determine the depth of each pixel from an RGB image captured by a monocular camera. The development of deep learning has significantly advanced this field by facilitating the learning of depth features from some well-annotated datasets \cite{Geiger_Lenz_Stiller_Urtasun_2013,silberman2012indoor}. Eigen \textit{et al.} \cite{eigen2014depth} first introduce a multi-scale fusion network for depth regression. Following this, subsequent improvements have come from reinterpreting the regression task as a classification problem \cite{bhat2021adabins,Li_Wang_Liu_Jiang_2022}, incorporating additional priors \cite{shao2023nddepth,yang2023gedepth}, and developing more effective objective function \cite{xian2020structure,Yin_Liu_Shen_Yan_2019}. Despite these advances, generalizing to unseen domains remains a challenge. Recently, several methods have employed affine-invariant loss to enable multi-dataset joint training \cite{MiDaS,ZeroDepth,guizilini2023towards,Dany}. Among them, Depth Anything \cite{Dany} has shown leading performance in zero-shot monocular depth estimation. While it struggles to estimate accurate metric depth due to the lack of explicit depth cues, it excels at extracting structural information from unseen images, producing structure-detailed monocular depth.

[71] Point Cloud Segmentation of Agricultural Vehicles using 3D Gaussian Splatting

Alfred T. Christiansen,Andreas H. Højrup,Morten K. Stephansen,Md Ibtihaj A. Sakib,Taman S. Poojary,Filip Slezak,Morten S. Laursen,Thomas B. Moeslund,Joakim B. Haurum

Main category: cs.CV

TL;DR: 提出了一种利用3D高斯泼溅和高斯不透明度场生成合成数据的管道,用于3D点云语义分割任务,避免了真实数据的高成本标注。

Details Motivation: 真实点云数据的获取和标注成本高且耗时,因此需要一种低成本、高效的合成数据生成方法。 Method: 结合3D高斯泼溅(3DGS)和高斯不透明度场(GOF)生成农业车辆的3D资产,并在模拟环境中使用模拟LiDAR生成点云。 Result: 仅使用合成数据训练的模型(如PTv3)在mIoU上达到91.35%,某些情况下甚至优于真实数据训练的模型。 Conclusion: 合成数据可以替代真实数据用于训练点云分割模型,且在某些情况下表现更优,同时具备跨语义类别的泛化能力。 Abstract: Training neural networks for tasks such as 3D point cloud semantic segmentation demands extensive datasets, yet obtaining and annotating real-world point clouds is costly and labor-intensive. This work aims to introduce a novel pipeline for generating realistic synthetic data, by leveraging 3D Gaussian Splatting (3DGS) and Gaussian Opacity Fields (GOF) to generate 3D assets of multiple different agricultural vehicles instead of using generic models. These assets are placed in a simulated environment, where the point clouds are generated using a simulated LiDAR. This is a flexible approach that allows changing the LiDAR specifications without incurring additional costs. We evaluated the impact of synthetic data on segmentation models such as PointNet++, Point Transformer V3, and OACNN, by training and validating the models only on synthetic data. Remarkably, the PTv3 model had an mIoU of 91.35\%, a noteworthy result given that the model had neither been trained nor validated on any real data. Further studies even suggested that in certain scenarios the models trained only on synthetically generated data performed better than models trained on real-world data. Finally, experiments demonstrated that the models can generalize across semantic classes, enabling accurate predictions on mesh models they were never trained on.

[72] UAV4D: Dynamic Neural Rendering of Human-Centric UAV Imagery using Gaussian Splatting

Jaehoon Choi,Dongki Jung,Christopher Maxey,Yonghan Lee,Sungmin Eum,Dinesh Manocha,Heesung Kwon

Main category: cs.CV

TL;DR: UAV4D框架解决了无人机单目视频动态场景重建的挑战,通过3D基础模型和人体网格重建模型,结合场景尺度模糊解决方法,实现了高质量渲染。

Details Motivation: 现有动态神经渲染方法未针对无人机单目视频场景(如俯视角、多移动行人)优化,且缺乏相关数据集。 Method: 结合3D基础模型和人体网格重建模型,利用SMPL模型和背景网格初始化高斯样条,解决场景尺度模糊问题。 Result: 在三个无人机数据集上测试,PSNR提升1.5 dB,视觉清晰度优于现有方法。 Conclusion: UAV4D框架在无人机动态场景渲染中表现优异,为单目视频重建提供了新思路。 Abstract: Despite significant advancements in dynamic neural rendering, existing methods fail to address the unique challenges posed by UAV-captured scenarios, particularly those involving monocular camera setups, top-down perspective, and multiple small, moving humans, which are not adequately represented in existing datasets. In this work, we introduce UAV4D, a framework for enabling photorealistic rendering for dynamic real-world scenes captured by UAVs. Specifically, we address the challenge of reconstructing dynamic scenes with multiple moving pedestrians from monocular video data without the need for additional sensors. We use a combination of a 3D foundation model and a human mesh reconstruction model to reconstruct both the scene background and humans. We propose a novel approach to resolve the scene scale ambiguity and place both humans and the scene in world coordinates by identifying human-scene contact points. Additionally, we exploit the SMPL model and background mesh to initialize Gaussian splats, enabling holistic scene rendering. We evaluated our method on three complex UAV-captured datasets: VisDrone, Manipal-UAV, and Okutama-Action, each with distinct characteristics and 10~50 humans. Our results demonstrate the benefits of our approach over existing methods in novel view synthesis, achieving a 1.5 dB PSNR improvement and superior visual sharpness.

[73] Physical Annotation for Automated Optical Inspection: A Concept for In-Situ, Pointer-Based Trainingdata Generation

Oliver Krumpek,Oliver Heimann,Jörg Krüger

Main category: cs.CV

TL;DR: 提出了一种新型物理标注系统,用于为自动光学检测生成训练数据,通过指针交互直接捕获物理轨迹,提升标注效率和准确性。

Details Motivation: 传统屏幕标注方法效率低且不够直观,无法充分利用检测人员的专业知识。该系统旨在通过物理交互方式,将专家经验直接转化为机器学习训练数据。 Method: 使用校准的追踪指针记录用户输入,并通过投影界面提供视觉引导,将空间交互转化为标准化标注格式。 Result: 初步评估证实了系统可行性,能够捕获详细标注轨迹,并与CVAT集成优化后续机器学习任务。 Conclusion: 该系统填补了人类专家与自动化数据生成之间的鸿沟,为非IT专家参与机器学习训练提供了可能,并提升了数据标注的准确性和一致性。 Abstract: This paper introduces a novel physical annotation system designed to generate training data for automated optical inspection. The system uses pointer-based in-situ interaction to transfer the valuable expertise of trained inspection personnel directly into a machine learning (ML) training pipeline. Unlike conventional screen-based annotation methods, our system captures physical trajectories and contours directly on the object, providing a more intuitive and efficient way to label data. The core technology uses calibrated, tracked pointers to accurately record user input and transform these spatial interactions into standardised annotation formats that are compatible with open-source annotation software. Additionally, a simple projector-based interface projects visual guidance onto the object to assist users during the annotation process, ensuring greater accuracy and consistency. The proposed concept bridges the gap between human expertise and automated data generation, enabling non-IT experts to contribute to the ML training pipeline and preventing the loss of valuable training samples. Preliminary evaluation results confirm the feasibility of capturing detailed annotation trajectories and demonstrate that integration with CVAT streamlines the workflow for subsequent ML tasks. This paper details the system architecture, calibration procedures and interface design, and discusses its potential contribution to future ML data generation for automated optical inspection.

[74] FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing

Guangzhao Li,Yanming Yang,Chenxi Song,Chi Zhang

Main category: cs.CV

TL;DR: FlowDirector提出了一种无需反转的视频编辑框架,通过ODE引导视频在数据空间中直接演化,保持时间一致性和结构细节。

Details Motivation: 解决现有基于反转的视频编辑方法导致的时间不一致和结构退化问题。 Method: 利用ODE在数据空间中直接演化视频,结合注意力引导掩码机制和增强的编辑策略。 Result: 在指令遵循、时间一致性和背景保留方面达到最先进性能。 Conclusion: FlowDirector为高效且连贯的视频编辑提供了新范式。 Abstract: Text-driven video editing aims to modify video content according to natural language instructions. While recent training-free approaches have made progress by leveraging pre-trained diffusion models, they typically rely on inversion-based techniques that map input videos into the latent space, which often leads to temporal inconsistencies and degraded structural fidelity. To address this, we propose FlowDirector, a novel inversion-free video editing framework. Our framework models the editing process as a direct evolution in data space, guiding the video via an Ordinary Differential Equation (ODE) to smoothly transition along its inherent spatiotemporal manifold, thereby preserving temporal coherence and structural details. To achieve localized and controllable edits, we introduce an attention-guided masking mechanism that modulates the ODE velocity field, preserving non-target regions both spatially and temporally. Furthermore, to address incomplete edits and enhance semantic alignment with editing instructions, we present a guidance-enhanced editing strategy inspired by Classifier-Free Guidance, which leverages differential signals between multiple candidate flows to steer the editing trajectory toward stronger semantic alignment without compromising structural consistency. Extensive experiments across benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction adherence, temporal consistency, and background preservation, establishing a new paradigm for efficient and coherent video editing without inversion.

[75] A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions

Anh Le,Thanh Lam,Dung Nguyen

Main category: cs.CV

TL;DR: 本文综述了越南文档分析与识别(DAR)的现状,探讨了传统OCR和深度学习方法面临的挑战,并展望了大型语言模型(LLMs)和视觉语言模型的应用潜力。

Details Motivation: 越南文本识别因复杂的变音符号、声调变化和缺乏大规模标注数据集而面临独特挑战,亟需新技术解决。 Method: 综述现有越南文档识别技术,分析传统OCR和深度学习的局限性,并探讨LLMs和视觉语言模型的应用。 Result: LLMs和视觉语言模型在文本识别和文档理解方面表现出潜力,但仍需解决领域适应、多模态学习和计算效率问题。 Conclusion: 未来研究方向包括数据集开发、模型优化和多模态方法整合,以推动越南DAR领域的进步。 Abstract: Vietnamese document analysis and recognition (DAR) is a crucial field with applications in digitization, information retrieval, and automation. Despite advancements in OCR and NLP, Vietnamese text recognition faces unique challenges due to its complex diacritics, tonal variations, and lack of large-scale annotated datasets. Traditional OCR methods often struggle with real-world document variations, while deep learning approaches have shown promise but remain limited by data scarcity and generalization issues. Recently, large language models (LLMs) and vision-language models have demonstrated remarkable improvements in text recognition and document understanding, offering a new direction for Vietnamese DAR. However, challenges such as domain adaptation, multimodal learning, and computational efficiency persist. This survey provide a comprehensive review of existing techniques in Vietnamese document recognition, highlights key limitations, and explores how LLMs can revolutionize the field. We discuss future research directions, including dataset development, model optimization, and the integration of multimodal approaches for improved document intelligence. By addressing these gaps, we aim to foster advancements in Vietnamese DAR and encourage community-driven solutions.

[76] SeedEdit 3.0: Fast and High-Quality Generative Image Editing

Peng Wang,Yichun Shi,Xiaochen Lian,Zhonghua Zhai,Xin Xia,Xuefeng Xiao,Weilin Huang,Jianchao Yang

Main category: cs.CV

TL;DR: SeedEdit 3.0结合T2I模型Seedream 3.0,显著提升了编辑指令跟随和图像内容保留能力,并通过数据增强和联合学习管道进一步优化。

Details Motivation: 改进现有图像编辑工具在指令跟随和内容保留方面的表现,并提升数据扩展和模型协作效率。 Method: 1. 开发增强的数据整理管道,采用元信息范式嵌入策略;2. 引入联合学习管道,结合扩散损失和奖励损失。 Result: 在真实图像编辑测试中,SeedEdit 3.0实现了56.1%的高可用率,优于其他版本和竞品。 Conclusion: SeedEdit 3.0通过数据增强和联合学习,显著提升了图像编辑的实用性和效果。 Abstract: We introduce SeedEdit 3.0, in companion with our T2I model Seedream 3.0 [22], which significantly improves over our previous version [27] in both aspects of edit instruction following and image content (e.g., ID/IP) preservation on real image inputs. Additional to model upgrading with T2I, in this report, we present several key improvements. First, we develop an enhanced data curation pipeline with a meta-info paradigm and meta-info embedding strategy that help mix images from multiple data sources. This allows us to scale editing data effectively, and meta information is helpfult to connect VLM with diffusion model more closely. Second, we introduce a joint learning pipeline for computing a diffusion loss and a reward loss. Finally, we evaluate SeedEdit 3.0 on our testing benchmarks, for real image editing, where it achieves a best trade-off between multiple aspects, yielding a high usability rate of 56.1%, compared to SeedEdit 1.6 (38.4%), GPT4o (37.1%) and Gemini 2.0 (30.3%).

[77] Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics

HaoTian Lan

Main category: cs.CV

TL;DR: 该研究提出了一种多模态街道评估框架(MSEF),结合视觉和语言模型,用于评估街道景观的客观特征和主观感知,并在中国哈尔滨的数据集上验证了其有效性。

Details Motivation: 传统的街道指标无法充分捕捉主观感知,而主观感知对包容性城市设计至关重要。因此,需要一种能够融合客观数据和主观评价的方法。 Method: 研究使用视觉变换器(VisualGLM-6B)和大型语言模型(GPT-4)构建MSEF,并通过LoRA和P-Tuning v2进行参数高效微调。数据集包含15,000张哈尔滨的街景图像。 Result: 模型在客观特征上F1得分为0.84,与居民感知的一致性达89.3%。同时,框架揭示了上下文依赖的矛盾和非线性模式。 Conclusion: MSEF为城市感知建模提供了方法创新,并为城市规划系统提供了实用工具,能够兼顾基础设施精确性和居民体验。 Abstract: While objective street metrics derived from imagery or GIS have become standard in urban analytics, they remain insufficient to capture subjective perceptions essential to inclusive urban design. This study introduces a novel Multimodal Street Evaluation Framework (MSEF) that fuses a vision transformer (VisualGLM-6B) with a large language model (GPT-4), enabling interpretable dual-output assessment of streetscapes. Leveraging over 15,000 annotated street-view images from Harbin, China, we fine-tune the framework using LoRA and P-Tuning v2 for parameter-efficient adaptation. The model achieves an F1 score of 0.84 on objective features and 89.3 percent agreement with aggregated resident perceptions, validated across stratified socioeconomic geographies. Beyond classification accuracy, MSEF captures context-dependent contradictions: for instance, informal commerce boosts perceived vibrancy while simultaneously reducing pedestrian comfort. It also identifies nonlinear and semantically contingent patterns -- such as the divergent perceptual effects of architectural transparency across residential and commercial zones -- revealing the limits of universal spatial heuristics. By generating natural-language rationales grounded in attention mechanisms, the framework bridges sensory data with socio-affective inference, enabling transparent diagnostics aligned with SDG 11. This work offers both methodological innovation in urban perception modeling and practical utility for planning systems seeking to reconcile infrastructural precision with lived experience.

[78] FG 2025 TrustFAA: the First Workshop on Towards Trustworthy Facial Affect Analysis: Advancing Insights of Fairness, Explainability, and Safety (TrustFAA)

Jiaee Cheong,Yang Liu,Harold Soh,Hatice Gunes

Main category: cs.CV

TL;DR: 本次研讨会旨在探讨情感AI面部情感分析(FAA)工具的可信度问题,聚焦公平性、可解释性和安全性。

Details Motivation: 随着情感AI面部情感分析工具的广泛应用,其可信度问题日益突出,需解决如解释性、不确定性、偏见和隐私等挑战。 Method: 通过研讨会形式,汇集研究者讨论FAA任务中的可信度问题,包括宏/微表情识别、面部动作单元检测等应用。 Result: 研讨会支持FG2025的伦理目标,推动可信FAA的研究与讨论。 Conclusion: 会议旨在促进FAA领域的可信度研究,为未来技术发展提供伦理和实用指导。 Abstract: With the increasing prevalence and deployment of Emotion AI-powered facial affect analysis (FAA) tools, concerns about the trustworthiness of these systems have become more prominent. This first workshop on "Towards Trustworthy Facial Affect Analysis: Advancing Insights of Fairness, Explainability, and Safety (TrustFAA)" aims to bring together researchers who are investigating different challenges in relation to trustworthiness-such as interpretability, uncertainty, biases, and privacy-across various facial affect analysis tasks, including macro/ micro-expression recognition, facial action unit detection, other corresponding applications such as pain and depression detection, as well as human-robot interaction and collaboration. In alignment with FG2025's emphasis on ethics, as demonstrated by the inclusion of an Ethical Impact Statement requirement for this year's submissions, this workshop supports FG2025's efforts by encouraging research, discussion and dialogue on trustworthy FAA.

[79] Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers

Haosong Liu,Yuge Cheng,Zihan Liu,Aiyue Chen,Yiwu Yao,Chen Chen,Jingwen Leng,Yu Feng,Minyi Guo

Main category: cs.CV

TL;DR: ASTRAEA是一个自动框架,通过轻量级令牌选择和高效稀疏注意力策略,显著提升视频扩散变压器的推理速度,同时保持生成质量。

Details Motivation: 视频扩散变压器(vDiT)在文本到视频生成中表现优异,但计算需求高,现有加速方法依赖启发式方法,适用性受限。 Method: 提出轻量级令牌选择机制和内存高效的GPU并行稀疏注意力策略,结合进化算法自动优化令牌预算分配。 Result: 单GPU上推理速度提升2.4倍,8 GPU上提升13.2倍,视频质量损失极小(VBench得分损失<0.5%)。 Conclusion: ASTRAEA在显著加速vDiT推理的同时,保持了高质量的生成效果,具有实际部署潜力。 Abstract: Video diffusion transformers (vDiTs) have made impressive progress in text-to-video generation, but their high computational demands present major challenges for practical deployment. While existing acceleration methods reduce workload at various granularities, they often rely on heuristics, limiting their applicability. We introduce ASTRAEA, an automatic framework that searches for near-optimal configurations for vDiT-based video generation. At its core, ASTRAEA proposes a lightweight token selection mechanism and a memory-efficient, GPU-parallel sparse attention strategy, enabling linear reductions in execution time with minimal impact on generation quality. To determine optimal token reduction for different timesteps, we further design a search framework that leverages a classic evolutionary algorithm to automatically determine the distribution of the token budget effectively. Together, ASTRAEA achieves up to 2.4x inference speedup on a single GPU with great scalability (up to 13.2x speedup on 8 GPUs) while retaining better video quality compared to the state-of-the-art methods (<0.5% loss on the VBench score compared to the baseline vDiT models).

[80] DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models

Revant Teotia,Candace Ross,Karen Ullrich,Sumit Chopra,Adriana Romero-Soriano,Melissa Hall,Matthew J. Muckley

Main category: cs.CV

TL;DR: 论文提出了DIM-CIM框架,用于无参考地测量文本到图像模型的多样性和泛化能力,并通过COCO-DIMCIM基准测试发现模型规模扩大可能牺牲默认模式多样性。

Details Motivation: 现有评估方法依赖参考图像数据集或缺乏多样性测量的特异性,限制了其适应性和可解释性。 Method: 引入DIM-CIM框架,通过COCO-DIMCIM基准测试评估模型的默认模式多样性和泛化能力。 Result: 模型规模扩大(1.5B到8.1B参数)可能提高泛化能力但牺牲默认模式多样性;DIM-CIM还能识别细粒度失败案例。 Conclusion: DIM-CIM为评估文本到图像模型提供了灵活且可解释的框架,有助于更全面理解模型性能。 Abstract: Recent advances in text-to-image (T2I) models have achieved impressive quality and consistency. However, this has come at the cost of representation diversity. While automatic evaluation methods exist for benchmarking model diversity, they either require reference image datasets or lack specificity about the kind of diversity measured, limiting their adaptability and interpretability. To address this gap, we introduce the Does-it/Can-it framework, DIM-CIM, a reference-free measurement of default-mode diversity ("Does" the model generate images with expected attributes?) and generalization capacity ("Can" the model generate diverse attributes for a particular concept?). We construct the COCO-DIMCIM benchmark, which is seeded with COCO concepts and captions and augmented by a large language model. With COCO-DIMCIM, we find that widely-used models improve in generalization at the cost of default-mode diversity when scaling from 1.5B to 8.1B parameters. DIMCIM also identifies fine-grained failure cases, such as attributes that are generated with generic prompts but are rarely generated when explicitly requested. Finally, we use DIMCIM to evaluate the training data of a T2I model and observe a correlation of 0.85 between diversity in training images and default-mode diversity. Our work provides a flexible and interpretable framework for assessing T2I model diversity and generalization, enabling a more comprehensive understanding of model performance.

[81] Practical Manipulation Model for Robust Deepfake Detection

Benedikt Hopf,Radu Timofte

Main category: cs.CV

TL;DR: 论文提出了一种实用的伪造模型(PMM),通过扩展伪造空间和增强训练图像的退化,显著提升了深度伪造检测的鲁棒性和性能。

Details Motivation: 现有深度伪造检测模型在非理想条件下性能不稳定,容易被规避,需要更接近真实场景的退化模型。 Method: 开发了PMM,利用泊松混合、多样化掩码、生成器伪影和干扰物扩展伪造空间,并在训练图像中添加强退化。 Result: 在DFDC和DFDCP数据集上分别提高了3.51%和6.21%的AUC,显著提升了模型的鲁棒性。 Conclusion: PMM有效解决了现有检测器的鲁棒性问题,并在标准数据集上取得了显著性能提升。 Abstract: Modern deepfake detection models have achieved strong performance even on the challenging cross-dataset task. However, detection performance under non-ideal conditions remains very unstable, limiting success on some benchmark datasets and making it easy to circumvent detection. Inspired by the move to a more real-world degradation model in the area of image super-resolution, we have developed a Practical Manipulation Model (PMM) that covers a larger set of possible forgeries. We extend the space of pseudo-fakes by using Poisson blending, more diverse masks, generator artifacts, and distractors. Additionally, we improve the detectors' generality and robustness by adding strong degradations to the training images. We demonstrate that these changes not only significantly enhance the model's robustness to common image degradations but also improve performance on standard benchmark datasets. Specifically, we show clear increases of $3.51\%$ and $6.21\%$ AUC on the DFDC and DFDCP datasets, respectively, over the s-o-t-a LAA backbone. Furthermore, we highlight the lack of robustness in previous detectors and our improvements in this regard. Code can be found at https://github.com/BenediktHopf/PMM

[82] CIVET: Systematic Evaluation of Understanding in VLMs

Massimo Rizzoli,Simone Alghisi,Olha Khomyn,Gabriel Roccabruna,Seyed Mahed Mousavi,Giuseppe Riccardi

Main category: cs.CV

TL;DR: CIVET框架用于系统评估视觉语言模型(VLMs)对场景结构和语义的理解能力,发现现有模型在对象属性、位置依赖和关系理解上表现有限,且远未达到人类水平。

Details Motivation: 研究VLMs对场景结构和语义的理解能力,填补现有评估方法的不足。 Method: 提出CIVET框架,通过可控刺激系统评估五种前沿VLMs,排除噪声和偏差。 Result: VLMs仅能识别有限基本属性,性能受对象位置影响,且难以理解对象间关系,表现不及人类。 Conclusion: VLMs在场景理解上仍有显著局限,需进一步改进以接近人类水平。 Abstract: While Vision-Language Models (VLMs) have achieved competitive performance in various tasks, their comprehension of the underlying structure and semantics of a scene remains understudied. To investigate the understanding of VLMs, we study their capability regarding object properties and relations in a controlled and interpretable manner. To this scope, we introduce CIVET, a novel and extensible framework for systematiC evaluatIon Via controllEd sTimuli. CIVET addresses the lack of standardized systematic evaluation for assessing VLMs' understanding, enabling researchers to test hypotheses with statistical rigor. With CIVET, we evaluate five state-of-the-art VLMs on exhaustive sets of stimuli, free from annotation noise, dataset-specific biases, and uncontrolled scene complexity. Our findings reveal that 1) current VLMs can accurately recognize only a limited set of basic object properties; 2) their performance heavily depends on the position of the object in the scene; 3) they struggle to understand basic relations among objects. Furthermore, a comparative evaluation with human annotators reveals that VLMs still fall short of achieving human-level accuracy.

[83] FRED: The Florence RGB-Event Drone Dataset

Gabriele Magrini,Niccolò Marini,Federico Becattini,Lorenzo Berlincioni,Niccolò Biondi,Pietro Pala,Alberto Del Bimbo

Main category: cs.CV

TL;DR: 论文介绍了FRED数据集,专为无人机检测、跟踪和轨迹预测设计,结合RGB视频和事件流,包含7小时密集标注数据。

Details Motivation: 传统RGB相机在快速移动物体和复杂光照条件下表现不佳,事件相机虽具优势,但现有基准缺乏精细时间分辨率或无人机特定运动模式。 Method: 提出Florence RGB-Event Drone数据集(FRED),包含多模态数据(RGB和事件流),覆盖多种无人机模型和挑战性场景(如雨天和不良光照)。 Result: FRED提供7小时密集标注数据、详细评估协议和标准指标,支持可复现的基准测试。 Conclusion: FRED有望推动高速无人机感知和多模态时空理解的研究。 Abstract: Small, fast, and lightweight drones present significant challenges for traditional RGB cameras due to their limitations in capturing fast-moving objects, especially under challenging lighting conditions. Event cameras offer an ideal solution, providing high temporal definition and dynamic range, yet existing benchmarks often lack fine temporal resolution or drone-specific motion patterns, hindering progress in these areas. This paper introduces the Florence RGB-Event Drone dataset (FRED), a novel multimodal dataset specifically designed for drone detection, tracking, and trajectory forecasting, combining RGB video and event streams. FRED features more than 7 hours of densely annotated drone trajectories, using 5 different drone models and including challenging scenarios such as rain and adverse lighting conditions. We provide detailed evaluation protocols and standard metrics for each task, facilitating reproducible benchmarking. The authors hope FRED will advance research in high-speed drone perception and multimodal spatiotemporal understanding.

[84] Through-the-Wall Radar Human Activity Recognition WITHOUT Using Neural Networks

Weicheng Gao

Main category: cs.CV

TL;DR: 论文提出了一种不使用神经网络的穿墙雷达(TWR)人体活动识别方法,基于物理可解释的信号处理技术。

Details Motivation: 当前TWR人体活动识别领域过度依赖神经网络训练,而早期基于模板匹配的方法具有更强的物理可解释性。作者希望回归原始路径,挑战神经网络模型的性能。 Method: 首先生成TWR的距离-时间图和多普勒-时间图,通过角点检测确定目标前景和噪声背景,使用多相主动轮廓模型分割微多普勒特征,将其离散化为二维点云,最后通过Mapper算法计算点云与模板数据的拓扑相似性。 Result: 通过数值模拟和实测实验验证了方法的有效性。 Conclusion: 该方法展示了在不使用神经网络的情况下实现TWR人体活动识别的可行性,并提供了开源代码。 Abstract: After a few years of research in the field of through-the-wall radar (TWR) human activity recognition (HAR), I found that we seem to be stuck in the mindset of training on radar image data through neural network models. The earliest related works in this field based on template matching did not require a training process, and I believe they have never died. Because these methods possess a strong physical interpretability and are closer to the basis of theoretical signal processing research. In this paper, I would like to try to return to the original path by attempting to eschew neural networks to achieve the TWR HAR task and challenge to achieve intelligent recognition as neural network models. In detail, the range-time map and Doppler-time map of TWR are first generated. Then, the initial regions of the human target foreground and noise background on the maps are determined using corner detection method, and the micro-Doppler signature is segmented using the multiphase active contour model. The micro-Doppler segmentation feature is discretized into a two-dimensional point cloud. Finally, the topological similarity between the resulting point cloud and the point clouds of the template data is calculated using Mapper algorithm to obtain the recognition results. The effectiveness of the proposed method is demonstrated by numerical simulated and measured experiments. The open-source code of this work is released at: https://github.com/JoeyBGOfficial/Through-the-Wall-Radar-Human-Activity-Recognition-Without-Using-Neural-Networks.

[85] Track Any Anomalous Object: A Granular Video Anomaly Detection Pipeline

Yuzhi Huang,Chenxin Li,Haitao Zhang,Zixu Lin,Yunlong Lin,Hengyu Liu,Wuyang Li,Xinyu Liu,Jiechao Gao,Yue Huang,Xinghao Ding,Yixuan Yuan

Main category: cs.CV

TL;DR: TAO是一个新的视频异常检测框架,通过像素级跟踪异常对象,实现了更精细的异常检测,无需阈值调整,并在实验中表现出更高的准确性和鲁棒性。

Details Motivation: 现有方法主要关注异常帧或对象,忽略了更细粒度的分析(如异常像素),限制了检测范围。TAO旨在解决这一问题。 Method: TAO提出了一种细粒度的视频异常检测流程,首次将多粒度异常对象检测集成到统一框架中,通过像素级跟踪异常对象,并与分割和跟踪任务结合。 Result: 实验表明,TAO在准确性和鲁棒性上设定了新的基准。 Conclusion: TAO通过像素级跟踪和任务集成,实现了更精确的异常定位,适用于复杂视频序列。 Abstract: Video anomaly detection (VAD) is crucial in scenarios such as surveillance and autonomous driving, where timely detection of unexpected activities is essential. Although existing methods have primarily focused on detecting anomalous objects in videos -- either by identifying anomalous frames or objects -- they often neglect finer-grained analysis, such as anomalous pixels, which limits their ability to capture a broader range of anomalies. To address this challenge, we propose a new framework called Track Any Anomalous Object (TAO), which introduces a granular video anomaly detection pipeline that, for the first time, integrates the detection of multiple fine-grained anomalous objects into a unified framework. Unlike methods that assign anomaly scores to every pixel, our approach transforms the problem into pixel-level tracking of anomalous objects. By linking anomaly scores to downstream tasks such as segmentation and tracking, our method removes the need for threshold tuning and achieves more precise anomaly localization in long and complex video sequences. Experiments demonstrate that TAO sets new benchmarks in accuracy and robustness. Project page available online.

[86] Single GPU Task Adaptation of Pathology Foundation Models for Whole Slide Image Analysis

Neeraj Kumar,Swaraj Nanda,Siddharth Singi,Jamal Benhamida,David Kim,Jie-Fu Chen,Amir Momeni-Boroujeni,Gregory M. Goldgof,Gabriele Campanella,Chad Vanderbilt

Main category: cs.CV

TL;DR: TAPFM提出了一种单GPU任务适应病理基础模型(PFM)的新方法,利用ViT注意力进行MIL聚合,优化特征表示和注意力权重,显著提升了临床任务性能。

Details Motivation: 适应预训练的PFM用于特定临床任务面临挑战,主要由于WSI级弱标签和MIL需求。 Method: 提出TAPFM方法,通过ViT注意力进行MIL聚合,优化特征和注意力权重,保持MIL聚合器和PFM的独立计算图以实现稳定训练。 Result: 在膀胱癌和肺腺癌的突变预测任务中,TAPFM表现优于传统方法,并有效处理多标签分类。 Conclusion: TAPFM使预训练PFM在标准硬件上的适应变得实用,适用于多种临床应用。 Abstract: Pathology foundation models (PFMs) have emerged as powerful tools for analyzing whole slide images (WSIs). However, adapting these pretrained PFMs for specific clinical tasks presents considerable challenges, primarily due to the availability of only weak (WSI-level) labels for gigapixel images, necessitating multiple instance learning (MIL) paradigm for effective WSI analysis. This paper proposes a novel approach for single-GPU \textbf{T}ask \textbf{A}daptation of \textbf{PFM}s (TAPFM) that uses vision transformer (\vit) attention for MIL aggregation while optimizing both for feature representations and attention weights. The proposed approach maintains separate computational graphs for MIL aggregator and the PFM to create stable training dynamics that align with downstream task objectives during end-to-end adaptation. Evaluated on mutation prediction tasks for bladder cancer and lung adenocarcinoma across institutional and TCGA cohorts, TAPFM consistently outperforms conventional approaches, with H-Optimus-0 (TAPFM) outperforming the benchmarks. TAPFM effectively handles multi-label classification of actionable mutations as well. Thus, TAPFM makes adaptation of powerful pre-trained PFMs practical on standard hardware for various clinical applications.

[87] MokA: Multimodal Low-Rank Adaptation for MLLMs

Yake Wei,Yu Miao,Dongzhan Zhou,Di Hu

Main category: cs.CV

TL;DR: 当前高效多模态微调方法因直接借用LLMs而受限,忽略了多模态特性。本文提出MokA,一种多模态感知的微调策略,通过模态特定参数压缩单模态信息并增强跨模态交互。

Details Motivation: 现有方法直接从LLMs借用,未充分考虑多模态场景的差异,影响模态的充分利用。 Method: 提出MokA策略,结合单模态适应和跨模态适应,通过模态特定参数压缩单模态信息并增强跨模态交互。 Result: 在多种多模态场景和LLM骨干上的实验表明MokA的一致改进效果。消融研究和效率评估验证了方法的有效性。 Conclusion: MokA为多模态大模型的高效适应提供了针对性解决方案,为未来探索铺平了道路。 Abstract: In this paper, we reveal that most current efficient multimodal fine-tuning methods are hindered by a key limitation: they are directly borrowed from LLMs, often neglecting the intrinsic differences of multimodal scenarios and even affecting the full utilization of all modalities. Inspired by our empirical observation, we argue that unimodal adaptation and cross-modal adaptation are two essential parts for the effective fine-tuning of MLLMs. From this perspective, we propose Multimodal low-rank Adaptation (MokA), a multimodal-aware efficient fine-tuning strategy that takes multimodal characteristics into consideration. It compresses unimodal information by modality-specific parameters while explicitly enhancing cross-modal interaction, ensuring both unimodal and cross-modal adaptation. Extensive experiments cover three representative multimodal scenarios (audio-visual-text, visual-text, and speech-text), and multiple LLM backbones (LLaMA2/3, Qwen2, Qwen2.5-VL, etc). Consistent improvements indicate the efficacy and versatility of the proposed method. Ablation studies and efficiency evaluation are also conducted to fully asses our method. Overall, we think MokA provides a more targeted solution for efficient adaptation of MLLMs, paving the way for further exploration. The project page is at https://gewu-lab.github.io/MokA.

[88] Vision-Based Autonomous MM-Wave Reflector Using ArUco-Driven Angle-of-Arrival Estimation

Josue Marroquin,Nan Inzali,Miles Dillon Lantz,Campbell Freeman,Amod Ashtekar,\Ajinkya Umesh Mulik,Mohammed E Eltayeb

Main category: cs.CV

TL;DR: 论文提出了一种基于视觉辅助的自主反射器系统,用于在非视距条件下增强毫米波通信性能,通过动态调整反射板实现信号优化。

Details Motivation: 在非视距条件下实现可靠的毫米波通信是军事和民用领域的重大挑战,尤其是在城市或基础设施受限的环境中。 Method: 系统利用单目摄像头检测ArUco标记,估计到达角,并实时调整反射板以优化信号反射。 Result: 实验结果显示,在60GHz频段下,接收信号强度平均增益达23dB,信号接收概率为0.89(阈值-65dB)。 Conclusion: 该系统在复杂动态环境中展现了毫米波通信的韧性和适应性。 Abstract: Reliable millimeter-wave (mmWave) communication in non-line-of-sight (NLoS) conditions remains a major challenge for both military and civilian operations, especially in urban or infrastructure-limited environments. This paper presents a vision-aided autonomous reflector system designed to enhance mmWave link performance by dynamically steering signal reflections using a motorized metallic plate. The proposed system leverages a monocular camera to detect ArUco markers on allied transmitter and receiver nodes, estimate their angles of arrival, and align the reflector in real time for optimal signal redirection. This approach enables selective beam coverage by serving only authenticated targets with visible markers and reduces the risk of unintended signal exposure. The designed prototype, built on a Raspberry Pi 4 and low-power hardware, operates autonomously without reliance on external infrastructure or GPS. Experimental results at 60\,GHz demonstrate a 23\,dB average gain in received signal strength and an 0.89 probability of maintaining signal reception above a target threshold of -65 dB in an indoor environment, far exceeding the static and no-reflector baselines. These results demonstrate the system's potential for resilient and adaptive mmWave connectivity in complex and dynamic environments.

[89] Quantifying Cross-Modality Memorization in Vision-Language Models

Yuxin Wen,Yangsibo Huang,Tom Goldstein,Ravi Kumar,Badih Ghazi,Chiyuan Zhang

Main category: cs.CV

TL;DR: 研究探讨了多模态模型中跨模态记忆的特性,发现单模态学习的信息可以迁移到另一模态,但存在显著差距,并提出了一种缓解方法。

Details Motivation: 理解神经网络在训练中的记忆机制,特别是在多模态模型中跨模态记忆的特性,对隐私保护和知识获取至关重要。 Method: 引入合成人物数据集,量化单模态训练后模型在另一模态的表现,分析跨模态迁移能力。 Result: 发现跨模态信息迁移存在显著差距,且在不同场景下普遍存在。 Conclusion: 提出基线方法以缓解跨模态迁移差距,希望推动更鲁棒的多模态学习技术发展。 Abstract: Understanding what and how neural networks memorize during training is crucial, both from the perspective of unintentional memorization of potentially sensitive information and from the standpoint of effective knowledge acquisition for real-world, knowledge-intensive tasks. While previous studies primarily investigate memorization within a single modality, such as text memorization in large language models or image memorization in diffusion models, unified multimodal models are becoming increasingly prevalent in practical applications. In this work, we focus on the unique characteristics of cross-modality memorization and conduct a systematic study centered on vision-language models. To facilitate controlled experiments, we first introduce a synthetic persona dataset comprising diverse synthetic person images and textual descriptions. We quantify factual knowledge memorization and cross-modal transferability by training models on a single modality and evaluating their performance in the other. Our results reveal that facts learned in one modality transfer to the other, but a significant gap exists between recalling information in the source and target modalities. Furthermore, we observe that this gap exists across various scenarios, including more capable models, machine unlearning, and the multi-hop case. At the end, we propose a baseline method to mitigate this challenge. We hope our study can inspire future research on developing more robust multimodal learning techniques to enhance cross-modal transferability.

[90] Grounding Beyond Detection: Enhancing Contextual Understanding in Embodied 3D Grounding

Yani Zhang,Dongming Wu,Hao Shi,Yingfei Liu,Tiancai Wang,Haoqiang Fan,Xingping Dong

Main category: cs.CV

TL;DR: DEGround提出了一种新方法,通过共享DETR查询实现检测与定位的联合优化,显著提升了3D定位性能。

Details Motivation: 研究发现现有检测模型在未经过指令训练的情况下,性能优于专门训练的定位模型,表明当前3D定位方法仍有改进空间。 Method: 提出DEGround框架,共享DETR查询,并引入区域激活模块和查询调制模块以增强语言理解。 Result: DEGround在EmbodiedScan验证集上比BIP3D模型整体准确率提升7.52%。 Conclusion: DEGround通过联合优化检测与定位任务,显著提升了3D定位性能,为未来研究提供了新思路。 Abstract: Embodied 3D grounding aims to localize target objects described in human instructions from ego-centric viewpoint. Most methods typically follow a two-stage paradigm where a trained 3D detector's optimized backbone parameters are used to initialize a grounding model. In this study, we explore a fundamental question: Does embodied 3D grounding benefit enough from detection? To answer this question, we assess the grounding performance of detection models using predicted boxes filtered by the target category. Surprisingly, these detection models without any instruction-specific training outperform the grounding models explicitly trained with language instructions. This indicates that even category-level embodied 3D grounding may not be well resolved, let alone more fine-grained context-aware grounding. Motivated by this finding, we propose DEGround, which shares DETR queries as object representation for both DEtection and Grounding and enables the grounding to benefit from basic category classification and box detection. Based on this framework, we further introduce a regional activation grounding module that highlights instruction-related regions and a query-wise modulation module that incorporates sentence-level semantic into the query representation, strengthening the context-aware understanding of language instructions. Remarkably, DEGround outperforms state-of-the-art model BIP3D by 7.52\% at overall accuracy on the EmbodiedScan validation set. The source code will be publicly available at https://github.com/zyn213/DEGround.

[91] OGGSplat: Open Gaussian Growing for Generalizable Reconstruction with Expanded Field-of-View

Yanbo Wang,Ziyi Wang,Wenzhao Zheng,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: OGGSplat是一种基于开放高斯增长的方法,用于从稀疏视图中重建语义感知的3D场景,通过RGB-语义一致性修复模块和双向控制扩散模型实现高效优化。

Details Motivation: 现有方法需要密集输入视图且计算成本高,而通用方法难以重建输入视锥外的区域,因此需要一种高效且语义一致的重建方法。 Method: 提出OGGSplat,利用开放高斯的语义属性进行图像外推,结合RGB-语义一致性修复模块和双向控制扩散模型,逐步优化高斯参数。 Result: 建立了高斯外推(GO)基准测试,验证了OGGSplat在语义和生成质量上的表现,并在智能手机拍摄的两视图场景中展示了语义感知重建能力。 Conclusion: OGGSplat在稀疏视图下实现了语义一致且视觉合理的3D场景重建,为虚拟现实和具身AI等应用提供了高效解决方案。 Abstract: Reconstructing semantic-aware 3D scenes from sparse views is a challenging yet essential research direction, driven by the demands of emerging applications such as virtual reality and embodied AI. Existing per-scene optimization methods require dense input views and incur high computational costs, while generalizable approaches often struggle to reconstruct regions outside the input view cone. In this paper, we propose OGGSplat, an open Gaussian growing method that expands the field-of-view in generalizable 3D reconstruction. Our key insight is that the semantic attributes of open Gaussians provide strong priors for image extrapolation, enabling both semantic consistency and visual plausibility. Specifically, once open Gaussians are initialized from sparse views, we introduce an RGB-semantic consistent inpainting module applied to selected rendered views. This module enforces bidirectional control between an image diffusion model and a semantic diffusion model. The inpainted regions are then lifted back into 3D space for efficient and progressive Gaussian parameter optimization. To evaluate our method, we establish a Gaussian Outpainting (GO) benchmark that assesses both semantic and generative quality of reconstructed open-vocabulary scenes. OGGSplat also demonstrates promising semantic-aware scene reconstruction capabilities when provided with two view images captured directly from a smartphone camera.

[92] Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning

Yue Ma,Yulong Liu,Qiyuan Zhu,Ayden Yang,Kunyu Feng,Xinhua Zhang,Zhifeng Li,Sirui Han,Chenyang Qi,Qifeng Chen

Main category: cs.CV

TL;DR: 提出了一种高效的两阶段视频运动迁移框架Follow-Your-Motion,通过时空解耦的LoRA和稀疏运动采样技术,解决了现有方法在运动一致性和调优效率上的问题。

Details Motivation: 现有基于LoRA的运动迁移方法在大型视频扩散变换器中存在运动不一致和调优效率低的问题。 Method: 提出时空解耦的LoRA,分离空间外观和时间运动处理;在第二阶段采用稀疏运动采样和自适应RoPE加速调优。 Result: 在提出的MotionBench基准测试中验证了方法的优越性。 Conclusion: Follow-Your-Motion框架显著提升了运动迁移的效率和一致性。 Abstract: Recently, breakthroughs in the video diffusion transformer have shown remarkable capabilities in diverse motion generations. As for the motion-transfer task, current methods mainly use two-stage Low-Rank Adaptations (LoRAs) finetuning to obtain better performance. However, existing adaptation-based motion transfer still suffers from motion inconsistency and tuning inefficiency when applied to large video diffusion transformers. Naive two-stage LoRA tuning struggles to maintain motion consistency between generated and input videos due to the inherent spatial-temporal coupling in the 3D attention operator. Additionally, they require time-consuming fine-tuning processes in both stages. To tackle these issues, we propose Follow-Your-Motion, an efficient two-stage video motion transfer framework that finetunes a powerful video diffusion transformer to synthesize complex motion.Specifically, we propose a spatial-temporal decoupled LoRA to decouple the attention architecture for spatial appearance and temporal motion processing. During the second training stage, we design the sparse motion sampling and adaptive RoPE to accelerate the tuning speed. To address the lack of a benchmark for this field, we introduce MotionBench, a comprehensive benchmark comprising diverse motion, including creative camera motion, single object motion, multiple object motion, and complex human motion. We show extensive evaluations on MotionBench to verify the superiority of Follow-Your-Motion.

[93] Towards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation

Jan Ackermann,Kiyohiro Nakayama,Guandao Yang,Tong Wu,Gordon Wetzstein

Main category: cs.CV

TL;DR: VLG模型通过文本和视觉输入生成服装,展示了多模态基础模型在专业领域(如时尚设计)的潜在适应能力。

Details Motivation: 探索多模态基础模型在专业领域(如服装生成)的知识迁移能力。 Method: 提出VLG模型,结合文本描述和视觉图像生成服装,并评估其零样本泛化能力。 Result: 初步结果显示VLG在未见过的服装风格和提示上具有良好的迁移能力。 Conclusion: 多模态基础模型有望在时尚设计等专业领域有效适应。 Abstract: Multimodal foundation models have demonstrated strong generalization, yet their ability to transfer knowledge to specialized domains such as garment generation remains underexplored. We introduce VLG, a vision-language-garment model that synthesizes garments from textual descriptions and visual imagery. Our experiments assess VLG's zero-shot generalization, investigating its ability to transfer web-scale reasoning to unseen garment styles and prompts. Preliminary results indicate promising transfer capabilities, highlighting the potential for multimodal foundation models to adapt effectively to specialized domains like fashion design.

[94] DSG-World: Learning a 3D Gaussian World Model from Dual State Videos

Wenhao Hu,Xuexiang Wen,Xi Li,Gaoang Wang

Main category: cs.CV

TL;DR: DSG-World提出了一种基于双态观测的3D高斯世界建模框架,通过双向光度和语义一致性实现高效重建。

Details Motivation: 解决现有隐式生成模型训练困难、缺乏3D一致性,以及显式3D方法因遮挡需多阶段处理的问题。 Method: 利用双态观测构建双分割感知高斯场,引入伪中间态对称对齐,并设计协同修剪策略。 Result: 实验表明DSG-World能高效支持新视角和场景状态的高保真渲染与对象级操作。 Conclusion: DSG-World为真实世界3D重建与仿真提供了一种高效且一致的方法。 Abstract: Building an efficient and physically consistent world model from limited observations is a long standing challenge in vision and robotics. Many existing world modeling pipelines are based on implicit generative models, which are hard to train and often lack 3D or physical consistency. On the other hand, explicit 3D methods built from a single state often require multi-stage processing-such as segmentation, background completion, and inpainting-due to occlusions. To address this, we leverage two perturbed observations of the same scene under different object configurations. These dual states offer complementary visibility, alleviating occlusion issues during state transitions and enabling more stable and complete reconstruction. In this paper, we present DSG-World, a novel end-to-end framework that explicitly constructs a 3D Gaussian World model from Dual State observations. Our approach builds dual segmentation-aware Gaussian fields and enforces bidirectional photometric and semantic consistency. We further introduce a pseudo intermediate state for symmetric alignment and design collaborative co-pruning trategies to refine geometric completeness. DSG-World enables efficient real-to-simulation transfer purely in the explicit Gaussian representation space, supporting high-fidelity rendering and object-level scene manipulation without relying on dense observations or multi-stage pipelines. Extensive experiments demonstrate strong generalization to novel views and scene states, highlighting the effectiveness of our approach for real-world 3D reconstruction and simulation.

[95] MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm

Zhang Li,Yuliang Liu,Qiang Liu,Zhiyin Ma,Ziyang Zhang,Shuo Zhang,Zidun Guo,Jiarui Zhang,Xinyu Wang,Xiang Bai

Main category: cs.CV

TL;DR: MonkeyOCR是一种基于SRR三元范式的视觉语言模型,用于文档解析,通过简化流程和高效处理,在准确性和速度上优于现有方法。

Details Motivation: 解决现有文档解析方法中复杂多工具流程和低效全页处理的不足。 Method: 采用SRR(结构-识别-关系)三元范式,将文档解析分解为布局分析、内容识别和逻辑排序三个问题。 Result: 在MonkeyDoc数据集上表现优异,平均性能提升5.1%,处理速度更快(0.84页/秒),且3B参数模型超越更大模型。 Conclusion: MonkeyOCR在文档解析任务中实现了高效、准确和可扩展的解决方案。 Abstract: We introduce MonkeyOCR, a vision-language model for document parsing that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline (as in MinerU's modular approach) and avoids the inefficiencies of processing full pages with giant end-to-end models (e.g., large multimodal LLMs like Qwen-VL). In SRR, document parsing is abstracted into three fundamental questions - "Where is it?" (structure), "What is it?" (recognition), and "How is it organized?" (relation) - corresponding to layout analysis, content identification, and logical ordering. This focused decomposition balances accuracy and speed: it enables efficient, scalable processing without sacrificing precision. To train and evaluate this approach, we introduce the MonkeyDoc (the most comprehensive document parsing dataset to date), with 3.9 million instances spanning over ten document types in both Chinese and English. Experiments show that MonkeyOCR outperforms MinerU by an average of 5.1%, with particularly notable improvements on challenging content such as formulas (+15.0%) and tables (+8.6%). Remarkably, our 3B-parameter model surpasses much larger and top-performing models, including Qwen2.5-VL (72B) and Gemini 2.5 Pro, achieving state-of-the-art average performance on English document parsing tasks. In addition, MonkeyOCR processes multi-page documents significantly faster (0.84 pages per second compared to 0.65 for MinerU and 0.12 for Qwen2.5-VL-7B). The 3B model can be efficiently deployed for inference on a single NVIDIA 3090 GPU. Code and models will be released at https://github.com/Yuliang-Liu/MonkeyOCR.

[96] SAM-aware Test-time Adaptation for Universal Medical Image Segmentation

Jianghao Wu,Yicheng Wu,Yutong Xie,Wenjia Bai,You Zhang,Feilong Tang,Yulong Li,Yasmeen George,Imran Razzak

Main category: cs.CV

TL;DR: SAM-TTA提出了一种测试时适应框架,解决SAM在医学图像分割中的输入和语义差异问题,通过SBCT和DUMT方法提升性能,并在实验中优于现有方法。

Details Motivation: 解决SAM在医学图像分割中因输入和语义差异导致的性能限制,同时保持其泛化能力。 Method: 提出SAM-TTA框架,包括SBCT(自适应转换医学图像为SAM兼容输入)和DUMT(双尺度不确定性驱动的均值教师适应)。 Result: 在五个公开数据集上,SAM-TTA优于现有TTA方法,甚至在某些场景下超越完全微调的模型如MedSAM。 Conclusion: SAM-TTA为通用医学图像分割提供了新范式,兼具性能提升和泛化能力。 Abstract: Universal medical image segmentation using the Segment Anything Model (SAM) remains challenging due to its limited adaptability to medical domains. Existing adaptations, such as MedSAM, enhance SAM's performance in medical imaging but at the cost of reduced generalization to unseen data. Therefore, in this paper, we propose SAM-aware Test-Time Adaptation (SAM-TTA), a fundamentally different pipeline that preserves the generalization of SAM while improving its segmentation performance in medical imaging via a test-time framework. SAM-TTA tackles two key challenges: (1) input-level discrepancies caused by differences in image acquisition between natural and medical images and (2) semantic-level discrepancies due to fundamental differences in object definition between natural and medical domains (e.g., clear boundaries vs. ambiguous structures). Specifically, our SAM-TTA framework comprises (1) Self-adaptive Bezier Curve-based Transformation (SBCT), which adaptively converts single-channel medical images into three-channel SAM-compatible inputs while maintaining structural integrity, to mitigate the input gap between medical and natural images, and (2) Dual-scale Uncertainty-driven Mean Teacher adaptation (DUMT), which employs consistency learning to align SAM's internal representations to medical semantics, enabling efficient adaptation without auxiliary supervision or expensive retraining. Extensive experiments on five public datasets demonstrate that our SAM-TTA outperforms existing TTA approaches and even surpasses fully fine-tuned models such as MedSAM in certain scenarios, establishing a new paradigm for universal medical image segmentation. Code can be found at https://github.com/JianghaoWu/SAM-TTA.

[97] Spatiotemporal Contrastive Learning for Cross-View Video Localization in Unstructured Off-road Terrains

Zhiyun Deng,Dongmyeong Lee,Amanda Adkins,Jesse Quattrociocchi,Christian Ellis,Joydeep Biswas

Main category: cs.CV

TL;DR: MoViX是一种自监督的跨视角视频定位框架,用于GPS缺失的越野环境,通过学习视角和季节不变的表示,实现高精度定位。

Details Motivation: 解决GPS缺失环境中因重复植被、无结构地形和季节变化导致的定位挑战。 Method: 采用姿态依赖的正采样、时间对齐的负采样、运动信息帧采样器和轻量级时间聚合器,结合蒙特卡洛定位框架。 Result: 在TartanDrive 2.0数据集上,93%的时间内定位误差小于25米,100%小于50米,优于现有方法。 Conclusion: MoViX在复杂环境中表现出色,无需环境特定调优,且能泛化到不同地理位置的现实场景。 Abstract: Robust cross-view 3-DoF localization in GPS-denied, off-road environments remains challenging due to (1) perceptual ambiguities from repetitive vegetation and unstructured terrain, and (2) seasonal shifts that significantly alter scene appearance, hindering alignment with outdated satellite imagery. To address this, we introduce MoViX, a self-supervised cross-view video localization framework that learns viewpoint- and season-invariant representations while preserving directional awareness essential for accurate localization. MoViX employs a pose-dependent positive sampling strategy to enhance directional discrimination and temporally aligned hard negative mining to discourage shortcut learning from seasonal cues. A motion-informed frame sampler selects spatially diverse frames, and a lightweight temporal aggregator emphasizes geometrically aligned observations while downweighting ambiguous ones. At inference, MoViX runs within a Monte Carlo Localization framework, using a learned cross-view matching module in place of handcrafted models. Entropy-guided temperature scaling enables robust multi-hypothesis tracking and confident convergence under visual ambiguity. We evaluate MoViX on the TartanDrive 2.0 dataset, training on under 30 minutes of data and testing over 12.29 km. Despite outdated satellite imagery, MoViX localizes within 25 meters of ground truth 93% of the time, and within 50 meters 100% of the time in unseen regions, outperforming state-of-the-art baselines without environment-specific tuning. We further demonstrate generalization on a real-world off-road dataset from a geographically distinct site with a different robot platform.

[98] LeanPO: Lean Preference Optimization for Likelihood Alignment in Video-LLMs

Xiaodong Wang,Jinfa Huang,Li Yuan,Peixi Peng

Main category: cs.CV

TL;DR: 论文提出LeanPO方法,通过重新定义奖励和动态标签平滑策略,解决Video-LLMs中偏好对齐技术导致的非目标响应概率提升问题。

Details Motivation: 现有Video-LLMs使用偏好对齐技术(如DPO)时,常导致目标和非目标响应的对数概率同时下降,从而无意中提升非目标响应的概率。 Method: 提出LeanPO方法,通过重新定义奖励为策略模型响应的平均似然,并结合自生成偏好数据管道和动态标签平滑策略。 Result: 实验表明,LeanPO显著提升Video-LLMs性能,且额外训练开销小。 Conclusion: LeanPO为Video-LLMs提供了一种简单有效的偏好对齐解决方案,提升模型可靠性和效率。 Abstract: Most Video Large Language Models (Video-LLMs) adopt preference alignment techniques, e.g., DPO~\citep{rafailov2024dpo}, to optimize the reward margin between a winning response ($y_w$) and a losing response ($y_l$). However, the likelihood displacement observed in DPO indicates that both $\log \pi_\theta (y_w\mid x)$ and $\log \pi_\theta (y_l\mid x) $ often decrease during training, inadvertently boosting the probabilities of non-target responses. In this paper, we systematically revisit this phenomenon from LLMs to Video-LLMs, showing that it intensifies when dealing with the redundant complexity of video content. To alleviate the impact of this phenomenon, we propose \emph{Lean Preference Optimization} (LeanPO), a reference-free approach that reformulates the implicit reward as the average likelihood of the response with respect to the policy model. A key component of LeanPO is the reward-trustworthiness correlated self-generated preference data pipeline, which carefully infuses relevant prior knowledge into the model while continuously refining the preference data via self-reflection. This allows the policy model to obtain high-quality paired data and accurately estimate the newly defined reward, thus mitigating the unintended drop. In addition, we introduce a dynamic label smoothing strategy that mitigates the impact of noise in responses from diverse video content, preventing the model from overfitting to spurious details. Extensive experiments demonstrate that LeanPO significantly enhances the performance of state-of-the-art Video-LLMs, consistently boosting baselines of varying capacities with minimal additional training overhead. Moreover, LeanPO offers a simple yet effective solution for aligning Video-LLM preferences with human trustworthiness, paving the way toward the reliable and efficient Video-LLMs.

[99] Can Foundation Models Generalise the Presentation Attack Detection Capabilities on ID Cards?

Juan E. Tapia,Christoph Busch

Main category: cs.CV

TL;DR: 研究探讨了如何利用基础模型(FM)提升ID卡展示攻击检测(PAD)的泛化能力,特别是在面对不同国家ID卡时的表现。

Details Motivation: 当前PAD系统因隐私保护限制,通常仅针对少数ID卡训练,导致在新国家ID卡上表现不佳。基础模型因其大数据训练特性,有望提升泛化能力。 Method: 采用零样本学习和微调两种方法,基于智利ID的私有数据集和芬兰、西班牙、斯洛伐克ID的开放数据集进行测试。 Result: 研究发现,真实图像(bona fide)是提升泛化能力的关键。 Conclusion: 基础模型在ID卡PAD任务中具有潜力,真实图像数据对泛化至关重要。 Abstract: Nowadays, one of the main challenges in presentation attack detection (PAD) on ID cards is obtaining generalisation capabilities for a diversity of countries that are issuing ID cards. Most PAD systems are trained on one, two, or three ID documents because of privacy protection concerns. As a result, they do not obtain competitive results for commercial purposes when tested in an unknown new ID card country. In this scenario, Foundation Models (FM) trained on huge datasets can help to improve generalisation capabilities. This work intends to improve and benchmark the capabilities of FM and how to use them to adapt the generalisation on PAD of ID Documents. Different test protocols were used, considering zero-shot and fine-tuning and two different ID card datasets. One private dataset based on Chilean IDs and one open-set based on three ID countries: Finland, Spain, and Slovakia. Our findings indicate that bona fide images are the key to generalisation.

[100] From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

Animesh Gupta,Jay Parmar,Ishan Rajendrakumar Dave,Mubarak Shah

Main category: cs.CV

TL;DR: TF-CoVR是一个专注于时间细粒度视频检索的新基准,提出了一种两阶段训练框架TF-CoVR-Base,显著提升了检索性能。

Details Motivation: 现有CoVR基准未能测试对快速、细微时间差异的捕捉能力,因此需要新的基准和方法。 Method: 采用两阶段框架:预训练视频编码器以获取时间区分性嵌入,然后通过对比学习对齐查询与候选视频。 Result: TF-CoVR-Base在零样本和微调后均显著提升性能,mAP@50分别达到7.51和25.82。 Conclusion: TF-CoVR填补了时间细粒度视频检索的空白,TF-CoVR-Base为相关任务提供了高效解决方案。 Abstract: Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving. Previous CoVR benchmarks focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 25.82.

[101] Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting

Nan Wang,Yuantao Chen,Lixing Xiao,Weiqing Xiao,Bohan Li,Zhaoxi Chen,Chongjie Ye,Shaocong Xu,Saining Zhang,Ziyang Yan,Pierre Merriaux,Lei Lei,Tianfan Xue,Hao Zhao

Main category: cs.CV

TL;DR: 提出了一种多尺度双边网格方法,结合外观编码和双边网格,显著提升了动态自动驾驶场景重建的几何精度。

Details Motivation: 现实场景中难以保证完美的光度一致性,现有方法(外观编码和双边网格)在建模能力和优化上存在局限。 Method: 提出多尺度双边网格,统一外观编码和双边网格,优化几何重建。 Result: 在Waymo等四个数据集上表现优异,几何精度显著提升,减少了光度不一致导致的伪影。 Conclusion: 多尺度双边网格有效解决了光度不一致问题,对自动驾驶场景重建至关重要。 Abstract: Neural rendering techniques, including NeRF and Gaussian Splatting (GS), rely on photometric consistency to produce high-quality reconstructions. However, in real-world scenarios, it is challenging to guarantee perfect photometric consistency in acquired images. Appearance codes have been widely used to address this issue, but their modeling capability is limited, as a single code is applied to the entire image. Recently, the bilateral grid was introduced to perform pixel-wise color mapping, but it is difficult to optimize and constrain effectively. In this paper, we propose a novel multi-scale bilateral grid that unifies appearance codes and bilateral grids. We demonstrate that this approach significantly improves geometric accuracy in dynamic, decoupled autonomous driving scene reconstruction, outperforming both appearance codes and bilateral grids. This is crucial for autonomous driving, where accurate geometry is important for obstacle avoidance and control. Our method shows strong results across four datasets: Waymo, NuScenes, Argoverse, and PandaSet. We further demonstrate that the improvement in geometry is driven by the multi-scale bilateral grid, which effectively reduces floaters caused by photometric inconsistency.

[102] Rectified Point Flow: Generic Point Cloud Pose Estimation

Tao Sun,Liyuan Zhu,Shengyu Huang,Shuran Song,Iro Armeni

Main category: cs.CV

TL;DR: Rectified Point Flow 是一种统一参数化方法,将点云配准和多部件形状组装视为条件生成问题,通过学习点速度场实现目标位置匹配,无需对称标签即可学习对称性,性能领先。

Details Motivation: 解决传统方法在点云配准和形状组装中依赖对称标签和部分姿态回归的问题,提出一种更通用的生成式方法。 Method: 通过连续点速度场将噪声点传输到目标位置,结合自监督编码器处理重叠点,实现对称性学习和姿态恢复。 Result: 在六个基准测试中达到最新性能,统一框架支持跨数据集联合训练,提升几何先验学习。 Conclusion: Rectified Point Flow 提供了一种高效、统一的解决方案,显著提升了点云配准和形状组装的性能。 Abstract: We introduce Rectified Point Flow, a unified parameterization that formulates pairwise point cloud registration and multi-part shape assembly as a single conditional generative problem. Given unposed point clouds, our method learns a continuous point-wise velocity field that transports noisy points toward their target positions, from which part poses are recovered. In contrast to prior work that regresses part-wise poses with ad-hoc symmetry handling, our method intrinsically learns assembly symmetries without symmetry labels. Together with a self-supervised encoder focused on overlapping points, our method achieves a new state-of-the-art performance on six benchmarks spanning pairwise registration and shape assembly. Notably, our unified formulation enables effective joint training on diverse datasets, facilitating the learning of shared geometric priors and consequently boosting accuracy. Project page: https://rectified-pointflow.github.io/.

[103] Video World Models with Long-term Spatial Memory

Tong Wu,Shuai Yang,Ryan Po,Yinghao Xu,Ziwei Liu,Dahua Lin,Gordon Wetzstein

Main category: cs.CV

TL;DR: 提出了一种基于几何空间记忆的新框架,用于提升视频世界模型的长期一致性,解决了场景重复访问时的遗忘问题。

Details Motivation: 现有模型因时间上下文窗口有限,在场景重复访问时难以保持一致性,导致遗忘问题。受人类记忆机制启发,提出新方法。 Method: 引入几何基础的长时空间记忆机制,包括信息存储与检索,并使用定制数据集训练和评估模型。 Result: 评估显示,相比基线,模型在质量、一致性和上下文长度上均有提升。 Conclusion: 该框架为长期一致的世界生成提供了新方向。 Abstract: Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long-term consistency of video world models through a geometry-grounded long-term spatial memory. Our framework includes mechanisms to store and retrieve information from the long-term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation.

[104] RaySt3R: Predicting Novel Depth Maps for Zero-Shot Object Completion

Bardienus P. Duisterhof,Jan Oberst,Bowen Wen,Stan Birchfield,Deva Ramanan,Jeffrey Ichnowski

Main category: cs.CV

TL;DR: RaySt3R将3D形状补全问题重新定义为新视角合成问题,通过单张RGB-D图像和多视角查询射线,预测深度图、物体掩码和置信度分数,实现高效且一致的3D重建。

Details Motivation: 现有3D形状补全方法缺乏3D一致性、计算成本高且难以捕捉锐利边界,RaySt3R旨在解决这些问题。 Method: 利用单张RGB-D图像和多视角查询射线,训练前馈Transformer预测深度图、物体掩码和置信度分数,并通过多视角融合完成3D形状重建。 Result: 在合成和真实数据集上,RaySt3R在3D chamfer距离上比基线方法提升高达44%,达到最优性能。 Conclusion: RaySt3R通过新视角合成方法显著提升了3D形状补全的效果和效率,具有广泛应用潜力。 Abstract: 3D shape completion has broad applications in robotics, digital twin reconstruction, and extended reality (XR). Although recent advances in 3D object and scene completion have achieved impressive results, existing methods lack 3D consistency, are computationally expensive, and struggle to capture sharp object boundaries. Our work (RaySt3R) addresses these limitations by recasting 3D shape completion as a novel view synthesis problem. Specifically, given a single RGB-D image and a novel viewpoint (encoded as a collection of query rays), we train a feedforward transformer to predict depth maps, object masks, and per-pixel confidence scores for those query rays. RaySt3R fuses these predictions across multiple query views to reconstruct complete 3D shapes. We evaluate RaySt3R on synthetic and real-world datasets, and observe it achieves state-of-the-art performance, outperforming the baselines on all datasets by up to 44% in 3D chamfer distance. Project page: https://rayst3r.github.io

[105] Stable Vision Concept Transformers for Medical Diagnosis

Lijie Hu,Songning Lai,Yuan Hua,Shu Yang,Jingfeng Zhang,Di Wang

Main category: cs.CV

TL;DR: 论文提出VCT和SVCT模型,解决现有概念瓶颈模型(CBMs)在医学领域中的效用和稳定性问题。

Details Motivation: 医学领域对透明性的需求促使研究可解释AI(XAI),但现有CBMs仅依赖概念特征,忽略了医学图像的内在特征,且对输入扰动缺乏稳定性。 Method: 提出Vision Concept Transformer(VCT)和Stable Vision Concept Transformer(SVCT),结合概念特征与图像特征,并利用Denoised Diffusion Smoothing提升稳定性。 Result: 在四个医学数据集上的实验表明,VCT和SVCT在保持准确性的同时具备可解释性,且SVCT在扰动下仍能提供稳定解释。 Conclusion: VCT和SVCT满足了医学领域对模型透明性和稳定性的需求,具有实际应用潜力。 Abstract: Transparency is a paramount concern in the medical field, prompting researchers to delve into the realm of explainable AI (XAI). Among these XAI methods, Concept Bottleneck Models (CBMs) aim to restrict the model's latent space to human-understandable high-level concepts by generating a conceptual layer for extracting conceptual features, which has drawn much attention recently. However, existing methods rely solely on concept features to determine the model's predictions, which overlook the intrinsic feature embeddings within medical images. To address this utility gap between the original models and concept-based models, we propose Vision Concept Transformer (VCT). Furthermore, despite their benefits, CBMs have been found to negatively impact model performance and fail to provide stable explanations when faced with input perturbations, which limits their application in the medical field. To address this faithfulness issue, this paper further proposes the Stable Vision Concept Transformer (SVCT) based on VCT, which leverages the vision transformer (ViT) as its backbone and incorporates a conceptual layer. SVCT employs conceptual features to enhance decision-making capabilities by fusing them with image features and ensures model faithfulness through the integration of Denoised Diffusion Smoothing. Comprehensive experiments on four medical datasets demonstrate that our VCT and SVCT maintain accuracy while remaining interpretable compared to baselines. Furthermore, even when subjected to perturbations, our SVCT model consistently provides faithful explanations, thus meeting the needs of the medical field.

[106] EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?

Yuqian Yuan,Ronghao Dang,Long Li,Wentong Li,Dian Jiao,Xin Li,Deli Zhao,Fan Wang,Wenqiao Zhang,Jun Xiao,Yueting Zhuang

Main category: cs.CV

TL;DR: EOC-Bench是一个创新的基准测试,用于评估动态自我中心场景中的物体中心化认知能力,填补了现有基准测试的不足。

Details Motivation: 现有基准测试主要关注静态场景,忽视了用户交互导致的动态变化,需要新的评估工具。 Method: 开发了EOC-Bench,包含3,277个标注QA对,涵盖11个评估维度和3种视觉对象引用类型,采用混合格式标注框架和多尺度时间准确性指标。 Result: 对多种MLLM进行了全面评估,EOC-Bench成为提升MLLM物体认知能力的关键工具。 Conclusion: EOC-Bench为开发可靠的嵌入式系统核心模型奠定了坚实基础。 Abstract: The emergence of multimodal large language models (MLLMs) has driven breakthroughs in egocentric vision applications. These applications necessitate persistent, context-aware understanding of objects, as users interact with tools in dynamic and cluttered environments. However, existing embodied benchmarks primarily focus on static scene exploration, emphasizing object's appearance and spatial attributes while neglecting the assessment of dynamic changes arising from users' interactions. To address this gap, we introduce EOC-Bench, an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocentric scenarios. Specially, EOC-Bench features 3,277 meticulously annotated QA pairs categorized into three temporal categories: Past, Present, and Future, covering 11 fine-grained evaluation dimensions and 3 visual object referencing types. To ensure thorough assessment, we develop a mixed-format human-in-the-loop annotation framework with four types of questions and design a novel multi-scale temporal accuracy metric for open-ended temporal evaluation. Based on EOC-Bench, we conduct comprehensive evaluations of various proprietary, open-source, and object-level MLLMs. EOC-Bench serves as a crucial tool for advancing the embodied object cognitive capabilities of MLLMs, establishing a robust foundation for developing reliable core models for embodied systems.

[107] AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model

Pingyu Wu,Kai Zhu,Yu Liu,Longxiang Tang,Jian Yang,Yansong Peng,Wei Zhai,Yang Cao,Zheng-Jun Zha

Main category: cs.CV

TL;DR: 提出了一种新的对齐分词器(AliTok),通过单向依赖关系提升自回归模型的图像生成效果,并在ImageNet-256上取得了优异的性能。

Details Motivation: 现有图像分词器在压缩过程中存在双向依赖关系,阻碍了自回归模型的有效建模。 Method: 使用因果解码器建立单向依赖关系,结合前缀标记和两阶段分词器训练以增强重建一致性。 Result: 在ImageNet-256上,AliTok的gFID为1.50(177M参数)和1.35(662M参数),采样速度比扩散方法快10倍。 Conclusion: AliTok在图像生成任务中表现出色,兼具高效性和性能优势。 Abstract: Autoregressive image generation aims to predict the next token based on previous ones. However, existing image tokenizers encode tokens with bidirectional dependencies during the compression process, which hinders the effective modeling by autoregressive models. In this paper, we propose a novel Aligned Tokenizer (AliTok), which utilizes a causal decoder to establish unidirectional dependencies among encoded tokens, thereby aligning the token modeling approach between the tokenizer and autoregressive model. Furthermore, by incorporating prefix tokens and employing two-stage tokenizer training to enhance reconstruction consistency, AliTok achieves great reconstruction performance while being generation-friendly. On ImageNet-256 benchmark, using a standard decoder-only autoregressive model as the generator with only 177M parameters, AliTok achieves a gFID score of 1.50 and an IS of 305.9. When the parameter count is increased to 662M, AliTok achieves a gFID score of 1.35, surpassing the state-of-the-art diffusion method with 10x faster sampling speed. The code and weights are available at https://github.com/ali-vilab/alitok.

[108] SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

Jianyi Wang,Shanchuan Lin,Zhijie Lin,Yuxi Ren,Meng Wei,Zongsheng Yue,Shangchen Zhou,Hao Chen,Yang Zhao,Ceyuan Yang,Xuefeng Xiao,Chen Change Loy,Lu Jiang

Main category: cs.CV

TL;DR: SeedVR2是一种基于扩散的单步视频修复模型,通过对抗训练和动态窗口注意力机制,显著降低了计算成本,同时在高分辨率视频修复中表现优异。

Details Motivation: 现有扩散视频修复方法计算成本高,且单步修复方法在高分辨率视频中表现不足,SeedVR2旨在解决这些问题。 Method: 采用对抗训练和动态调整窗口大小的注意力机制,结合特征匹配损失,优化模型架构和训练流程。 Result: SeedVR2在单步修复中性能优于或接近现有方法,尤其在高分辨率视频中表现突出。 Conclusion: SeedVR2为高分辨率视频修复提供了一种高效且性能优越的单步解决方案。 Abstract: Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size. To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency. Extensive experiments show that SeedVR2 can achieve comparable or even better performance compared with existing VR approaches in a single step.

[109] Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos

Weifeng Lin,Xinyu Wei,Ruichuan An,Tianhe Ren,Tingwei Chen,Renrui Zhang,Ziyu Guo,Wentao Zhang,Lei Zhang,Hongsheng Li

Main category: cs.CV

TL;DR: PAM是一个高效的区域级视觉理解框架,结合SAM 2和LLMs,实现对象分割与多样化语义输出。

Details Motivation: 解决现有方法在区域级视觉理解中效率和多模态输出不足的问题。 Method: 整合SAM 2和LLMs,引入Semantic Perceiver转换视觉特征,并开发数据增强流程。 Result: PAM在速度和内存消耗上优于现有方法,支持多粒度理解任务。 Conclusion: PAM为区域级视觉理解提供了高效且实用的解决方案,可作为未来研究的基准。 Abstract: We present Perceive Anything Model (PAM), a conceptually straightforward and efficient framework for comprehensive region-level visual understanding in images and videos. Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation with the generation of diverse, region-specific semantic outputs, including categories, label definition, functional explanations, and detailed captions. A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features, which inherently carry general vision, localization, and semantic priors into multi-modal tokens for LLM comprehension. To support robust multi-granularity understanding, we also develop a dedicated data refinement and augmentation pipeline, yielding a high-quality dataset of 1.5M image and 0.6M video region-semantic annotations, including novel region-level streaming video caption data. PAM is designed for lightweightness and efficiency, while also demonstrates strong performance across a diverse range of region understanding tasks. It runs 1.2-2.4x faster and consumes less GPU memory than prior approaches, offering a practical solution for real-world applications. We believe that our effective approach will serve as a strong baseline for future research in region-level visual understanding.

[110] Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels

Olaf Dünkel,Thomas Wimmer,Christian Theobalt,Christian Rupprecht,Adam Kortylewski

Main category: cs.CV

TL;DR: 提出了一种基于3D感知伪标签的方法,改进语义匹配,减少对特定数据集标注的依赖,并在SPair-71k上取得显著性能提升。

Details Motivation: 解决语义匹配中对称物体或重复部分导致的模糊性问题,利用3D信息提升匹配精度。 Method: 训练适配器优化现成特征,通过3D感知链式伪标签、松弛循环一致性和3D球形原型映射约束过滤错误标签。 Result: 在SPair-71k上性能提升超过4%,且相比类似监督需求的方法提升7%。 Conclusion: 该方法通用性强,易于扩展到其他数据源,显著提升了语义匹配的准确性。 Abstract: Finding correspondences between semantically similar points across images and object instances is one of the everlasting challenges in computer vision. While large pre-trained vision models have recently been demonstrated as effective priors for semantic matching, they still suffer from ambiguities for symmetric objects or repeated object parts. We propose to improve semantic correspondence estimation via 3D-aware pseudo-labeling. Specifically, we train an adapter to refine off-the-shelf features using pseudo-labels obtained via 3D-aware chaining, filtering wrong labels through relaxed cyclic consistency, and 3D spherical prototype mapping constraints. While reducing the need for dataset specific annotations compared to prior work, we set a new state-of-the-art on SPair-71k by over 4% absolute gain and by over 7% against methods with similar supervision requirements. The generality of our proposed approach simplifies extension of training to other data sources, which we demonstrate in our experiments.

[111] MARBLE: Material Recomposition and Blending in CLIP-Space

Ta-Ying Cheng,Prafull Sharma,Mark Boss,Varun Jampani

Main category: cs.CV

TL;DR: MARBLE是一种基于CLIP空间材料嵌入和预训练文本到图像模型的方法,用于图像中对象的材料编辑,支持材料混合和细粒度属性控制。

Details Motivation: 研究目的是改进基于示例图像的材料编辑方法,通过CLIP空间中的材料嵌入和预训练模型实现更精细的材料属性控制。 Method: 方法包括在去噪UNet中定位材料属性块,利用CLIP空间中的方向混合材料,并通过浅层网络预测细粒度材料属性变化方向。 Result: 定性和定量分析证明了方法的有效性,支持单次前向传递中的多次编辑和绘画应用。 Conclusion: MARBLE在材料编辑和细粒度属性控制方面表现出色,具有广泛的应用潜力。 Abstract: Editing materials of objects in images based on exemplar images is an active area of research in computer vision and graphics. We propose MARBLE, a method for performing material blending and recomposing fine-grained material properties by finding material embeddings in CLIP-space and using that to control pre-trained text-to-image models. We improve exemplar-based material editing by finding a block in the denoising UNet responsible for material attribution. Given two material exemplar-images, we find directions in the CLIP-space for blending the materials. Further, we can achieve parametric control over fine-grained material attributes such as roughness, metallic, transparency, and glow using a shallow network to predict the direction for the desired material attribute change. We perform qualitative and quantitative analysis to demonstrate the efficacy of our proposed method. We also present the ability of our method to perform multiple edits in a single forward pass and applicability to painting. Project Page: https://marblecontrol.github.io/

[112] ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation

Daniel Rho,Jun Myeong Choi,Biswadip Dey,Roni Sengupta

Main category: cs.CV

TL;DR: ProJo4D提出了一种渐进式联合优化框架,用于从稀疏多视角视频中估计物理参数,解决了现有方法因误差累积和高度非凸问题导致的局限性。

Details Motivation: 现有方法在稀疏多视角视频下因误差累积和高度非凸问题表现不佳,限制了物理准确数字孪生等应用的效果。 Method: ProJo4D采用渐进式联合优化策略,逐步增加联合优化的参数集,最终实现几何、外观、物理状态和材料属性的完全联合优化。 Result: 在PAC-NeRF和Spring-Gaus数据集上,ProJo4D在4D未来状态预测、未来状态的新视角渲染和材料参数估计方面优于现有方法。 Conclusion: ProJo4D展示了在物理基础的4D场景理解中的有效性,为实际应用提供了更实用的解决方案。 Abstract: Neural rendering has made significant strides in 3D reconstruction and novel view synthesis. With the integration with physics, it opens up new applications. The inverse problem of estimating physics from visual data, however, still remains challenging, limiting its effectiveness for applications like physically accurate digital twin creation in robotics and XR. Existing methods that incorporate physics into neural rendering frameworks typically require dense multi-view videos as input, making them impractical for scalable, real-world use. When presented with sparse multi-view videos, the sequential optimization strategy used by existing approaches introduces significant error accumulation, e.g., poor initial 3D reconstruction leads to bad material parameter estimation in subsequent stages. Instead of sequential optimization, directly optimizing all parameters at the same time also fails due to the highly non-convex and often non-differentiable nature of the problem. We propose ProJo4D, a progressive joint optimization framework that gradually increases the set of jointly optimized parameters guided by their sensitivity, leading to fully joint optimization over geometry, appearance, physical state, and material property. Evaluations on PAC-NeRF and Spring-Gaus datasets show that ProJo4D outperforms prior work in 4D future state prediction, novel view rendering of future state, and material parameter estimation, demonstrating its effectiveness in physically grounded 4D scene understanding. For demos, please visit the project webpage: https://daniel03c1.github.io/ProJo4D/

[113] Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs

Haoyuan Li,Yanpeng Zhou,Yufei Gao,Tao Tang,Jianhua Han,Yujie Yuan,Dave Zhenyu Chen,Jiawang Bian,Hang Xu,Xiaodan Liang

Main category: cs.CV

TL;DR: 论文探讨了3D视觉语言模型(VLMs)的性能问题,发现3D场景中心模型对3D编码器的依赖有限,并提出新数据集以改进3D理解。

Details Motivation: 研究3D VLMs性能不如2D VLMs的原因,并提出改进方法。 Method: 分类3D VLMs为3D对象中心、2D图像基础和3D场景中心方法,分析其性能差异。 Result: 发现3D场景中心模型对3D编码器依赖不足,预训练效果较差,且易过拟合语言线索。 Conclusion: 提出新数据集以促进真实3D场景理解,强调需改进评估和策略。 Abstract: Remarkable progress in 2D Vision-Language Models (VLMs) has spurred interest in extending them to 3D settings for tasks like 3D Question Answering, Dense Captioning, and Visual Grounding. Unlike 2D VLMs that typically process images through an image encoder, 3D scenes, with their intricate spatial structures, allow for diverse model architectures. Based on their encoder design, this paper categorizes recent 3D VLMs into 3D object-centric, 2D image-based, and 3D scene-centric approaches. Despite the architectural similarity of 3D scene-centric VLMs to their 2D counterparts, they have exhibited comparatively lower performance compared with the latest 3D object-centric and 2D image-based approaches. To understand this gap, we conduct an in-depth analysis, revealing that 3D scene-centric VLMs show limited reliance on the 3D scene encoder, and the pre-train stage appears less effective than in 2D VLMs. Furthermore, we observe that data scaling benefits are less pronounced on larger datasets. Our investigation suggests that while these models possess cross-modal alignment capabilities, they tend to over-rely on linguistic cues and overfit to frequent answer distributions, thereby diminishing the effective utilization of the 3D encoder. To address these limitations and encourage genuine 3D scene understanding, we introduce a novel 3D Relevance Discrimination QA dataset designed to disrupt shortcut learning and improve 3D understanding. Our findings highlight the need for advanced evaluation and improved strategies for better 3D understanding in 3D VLMs.

[114] Revisiting Depth Representations for Feed-Forward 3D Gaussian Splatting

Duochao Shi,Weijie Wang,Donny Y. Chen,Zeyu Zhang,Jia-Wang Bian,Bohan Zhuang,Chunhua Shen

Main category: cs.CV

TL;DR: 论文提出了一种基于点图的PM-Loss正则化损失,用于改进深度图中物体边界处的几何平滑性,从而提升3D高斯泼溅的渲染质量。

Details Motivation: 深度图在3D高斯泼溅中常用于生成3D点云,但物体边界处的深度不连续性会导致点云稀疏或碎片化,影响渲染质量。 Method: 引入PM-Loss,利用预训练变换器预测的点图对深度图进行正则化,增强几何平滑性。 Result: 改进后的深度图显著提升了3D高斯泼溅的渲染效果,适用于多种架构和场景。 Conclusion: PM-Loss通过几何平滑性优化,有效解决了深度图在物体边界处的局限性,提升了渲染质量。 Abstract: Depth maps are widely used in feed-forward 3D Gaussian Splatting (3DGS) pipelines by unprojecting them into 3D point clouds for novel view synthesis. This approach offers advantages such as efficient training, the use of known camera poses, and accurate geometry estimation. However, depth discontinuities at object boundaries often lead to fragmented or sparse point clouds, degrading rendering quality -- a well-known limitation of depth-based representations. To tackle this issue, we introduce PM-Loss, a novel regularization loss based on a pointmap predicted by a pre-trained transformer. Although the pointmap itself may be less accurate than the depth map, it effectively enforces geometric smoothness, especially around object boundaries. With the improved depth map, our method significantly improves the feed-forward 3DGS across various architectures and scenes, delivering consistently better rendering results. Our project page: https://aim-uofa.github.io/PMLoss

[115] AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

Lidong Lu,Guo Chen,Zhiqi Li,Yicheng Liu,Tong Lu

Main category: cs.CV

TL;DR: 论文提出了CG-AV-Counting基准和AV-Reasoner模型,用于提升视频计数任务的能力,并在多个基准上取得最优结果。

Details Motivation: 当前多模态大语言模型在计数任务上表现不佳,现有基准存在视频短、查询封闭、缺乏线索标注等问题。 Method: 提出CG-AV-Counting基准,包含1,027个多模态问题和5,845个标注线索;提出AV-Reasoner模型,采用GRPO和课程学习。 Result: AV-Reasoner在多个基准上达到最优,但语言空间推理在域外基准上无性能提升。 Conclusion: CG-AV-Counting和AV-Reasoner为计数任务提供了有效工具,但需进一步研究域外性能。 Abstract: Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model's counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks, reasoning in the language space fails to bring performance gains. The code and benchmark have been realeased on https://av-reasoner.github.io.

[116] MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Xinyan Chen,Renrui Zhang,Dongzhi Jiang,Aojun Zhou,Shilin Yan,Weifeng Lin,Hongsheng Li

Main category: cs.CV

TL;DR: MINT-CoT提出了一种新的视觉推理方法,通过动态插入视觉标记到文本推理步骤中,解决了多模态数学推理中的关键挑战。

Details Motivation: 现有方法在将Chain-of-Thought(CoT)扩展到多模态领域时面临三个主要限制:粗粒度图像区域依赖、视觉编码器对数学内容感知有限,以及对外部视觉修改能力的依赖。 Method: MINT-CoT通过Interleave Token动态选择数学图形中的任意形状视觉区域,并构建了一个包含54K数学问题的数据集。训练策略分为三个阶段:纯文本CoT微调、插入式CoT微调和插入式CoT强化学习。 Result: MINT-CoT-7B在MathVista、GeoQA和MMStar上的表现分别比基线模型提高了34.08%、28.78%和23.2%。 Conclusion: MINT-CoT在多模态数学推理中表现出色,为视觉与文本结合的推理提供了有效解决方案。 Abstract: Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our code and data are available at https://github.com/xinyan-cxy/MINT-CoT

[117] Unleashing Hour-Scale Video Training for Long Video-Language Understanding

Jingyang Lin,Jialian Wu,Ximeng Sun,Ze Wang,Jiang Liu,Yusheng Su,Xiaodong Yu,Hao Chen,Jiebo Luo,Zicheng Liu,Emad Barsoum

Main category: cs.CV

TL;DR: VideoMarathon是一个大规模的长视频指令跟随数据集,包含9,700小时的长视频和3.3M高质量问答对,支持22种任务。基于此,Hour-LLaVA模型在长视频语言建模中表现优异。

Details Motivation: 现有长视频语言理解数据集的稀缺性限制了视频大模型的发展,因此需要填补这一空白。 Method: 提出VideoMarathon数据集,并基于此开发Hour-LLaVA模型,利用内存增强模块实现长视频训练和推理。 Result: Hour-LLaVA在多个长视频语言基准测试中表现最佳,验证了数据集和模型的有效性。 Conclusion: VideoMarathon和Hour-LLaVA为长视频语言建模提供了高质量的数据和高效的解决方案。 Abstract: Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LLMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates user question-relevant and spatiotemporal-informative semantics from a cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.

[118] VideoMolmo: Spatio-Temporal Grounding Meets Pointing

Ghazi Shazan Ahmad,Ahmed Heakl,Hanan Gani,Abdelrahman Shaker,Zhiqiang Shen,Ranjay Krishna,Fahad Shahbaz Khan,Salman Khan

Main category: cs.CV

TL;DR: VideoMolmo是一种多模态大模型,用于基于文本描述的细粒度时空定位,通过结合LLM和时序模块提升准确性和一致性。

Details Motivation: 现有视频定位方法缺乏语言模型的推理能力,限制了上下文理解和泛化能力。 Method: 结合Molmo架构和时序模块,使用注意力机制和双向点传播技术(SAM2)确保时序一致性。 Result: 在多个真实场景和基准测试中显著提升时空定位准确性和推理能力。 Conclusion: VideoMolmo通过简化任务和增强可解释性,为时空定位提供了高效解决方案。 Abstract: Spatio-temporal localization is vital for precise interactions across diverse domains, from biological research to autonomous navigation and interactive interfaces. Current video-based approaches, while proficient in tracking, lack the sophisticated reasoning capabilities of large language models, limiting their contextual understanding and generalization. We introduce VideoMolmo, a large multimodal model tailored for fine-grained spatio-temporal pointing conditioned on textual descriptions. Building upon the Molmo architecture, VideoMolmo incorporates a temporal module utilizing an attention mechanism to condition each frame on preceding frames, ensuring temporal consistency. Additionally, our novel temporal mask fusion pipeline employs SAM2 for bidirectional point propagation, significantly enhancing coherence across video sequences. This two-step decomposition, i.e., first using the LLM to generate precise pointing coordinates, then relying on a sequential mask-fusion module to produce coherent segmentation, not only simplifies the task for the language model but also enhances interpretability. Due to the lack of suitable datasets, we curate a comprehensive dataset comprising 72k video-caption pairs annotated with 100k object points. To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics. We also evaluate our model on Referring Video Object Segmentation (Refer-VOS) and Reasoning VOS tasks. In comparison to existing models, VideoMolmo substantially improves spatio-temporal pointing accuracy and reasoning capability. Our code and models are publicly available at https://github.com/mbzuai-oryx/VideoMolmo.

[119] Defurnishing with X-Ray Vision: Joint Removal of Furniture from Panoramas and Mesh

Alan Dolhasz,Chen Ma,Dave Gausebeck,Kevin Chen,Gregor Miller,Lucas Hayne,Gunnar Hovden,Azwad Sabik,Olaf Brandt,Mira Slavcheva

Main category: cs.CV

TL;DR: 提出了一种从纹理网格和多视角全景图像生成无家具室内空间的流程,通过简化网格、边缘提取和ControlNet修复实现高质量结果。

Details Motivation: 现有方法(如神经辐射场或RGB-D修复)在生成无家具场景时存在模糊或幻觉问题,需要更高质量的解决方案。 Method: 1. 从网格中分割并移除家具,生成简化网格(SDM);2. 从SDM提取Canny边缘;3. 使用ControlNet修复全景图像;4. 用修复图像重新纹理网格。 Result: 相比神经辐射场和RGB-D修复,该方法生成的无家具场景质量更高,避免了模糊和幻觉问题。 Conclusion: 该方法通过结合几何引导和图像修复,显著提升了无家具场景的生成质量。 Abstract: We present a pipeline for generating defurnished replicas of indoor spaces represented as textured meshes and corresponding multi-view panoramic images. To achieve this, we first segment and remove furniture from the mesh representation, extend planes, and fill holes, obtaining a simplified defurnished mesh (SDM). This SDM acts as an ``X-ray'' of the scene's underlying structure, guiding the defurnishing process. We extract Canny edges from depth and normal images rendered from the SDM. We then use these as a guide to remove the furniture from panorama images via ControlNet inpainting. This control signal ensures the availability of global geometric information that may be hidden from a particular panoramic view by the furniture being removed. The inpainted panoramas are used to texture the mesh. We show that our approach produces higher quality assets than methods that rely on neural radiance fields, which tend to produce blurry low-resolution images, or RGB-D inpainting, which is highly susceptible to hallucinations.

[120] Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning

Xingjian Ran,Yixuan Li,Linning Xu,Mulin Yu,Bo Dai

Main category: cs.CV

TL;DR: DirectLayout是一个基于大语言模型的框架,直接从文本描述生成3D室内场景布局,解决了现有方法在开放词汇和细粒度用户指令对齐上的不足。

Details Motivation: 3D室内场景合成的布局生成因数据集有限而具有挑战性,现有方法要么过拟合,要么依赖预定义约束,牺牲灵活性。 Method: DirectLayout分三阶段生成布局:鸟瞰图生成、3D空间提升和对象放置优化,结合Chain-of-Thought激活和奖励机制增强空间推理。 Result: 实验表明DirectLayout在语义一致性、泛化性和物理合理性上表现优异。 Conclusion: DirectLayout通过直接生成和优化布局,显著提升了3D场景合成的灵活性和质量。 Abstract: Realistic 3D indoor scene synthesis is vital for embodied AI and digital content creation. It can be naturally divided into two subtasks: object generation and layout generation. While recent generative models have significantly advanced object-level quality and controllability, layout generation remains challenging due to limited datasets. Existing methods either overfit to these datasets or rely on predefined constraints to optimize numerical layout that sacrifice flexibility. As a result, they fail to generate scenes that are both open-vocabulary and aligned with fine-grained user instructions. We introduce DirectLayout, a framework that directly generates numerical 3D layouts from text descriptions using generalizable spatial reasoning of large language models (LLMs). DirectLayout decomposes the generation into three stages: producing a Bird's-Eye View (BEV) layout, lifting it into 3D space, and refining object placements. To enable explicit spatial reasoning and help the model grasp basic principles of object placement, we employ Chain-of-Thought (CoT) Activation based on the 3D-Front dataset. Additionally, we design CoT-Grounded Generative Layout Reward to enhance generalization and spatial planning. During inference, DirectLayout addresses asset-layout mismatches via Iterative Asset-Layout Alignment through in-context learning. Extensive experiments demonstrate that DirectLayout achieves impressive semantic consistency, generalization and physical plausibility.

[121] Refer to Anything with Vision-Language Prompts

Shengcao Cao,Zijun Wei,Jason Kuen,Kangning Liu,Lingzhi Zhang,Jiuxiang Gu,HyunJoon Jung,Liang-Yan Gui,Yu-Xiong Wang

Main category: cs.CV

TL;DR: 论文提出了一种新的任务——全模态参考表达分割(ORES),并提出了一个名为RAS的框架,通过多模态交互增强分割模型的能力。

Details Motivation: 现有图像分割模型无法满足基于语言和视觉的复杂查询需求,限制了其在用户友好交互中的应用。 Method: 提出了RAS框架,通过掩码中心的大型多模态模型增强分割模型的多模态交互和理解能力。 Result: RAS在ORES任务及经典RES和GRES任务上表现出优越性能。 Conclusion: RAS框架为多模态交互驱动的图像分割任务提供了有效解决方案。 Abstract: Recent image segmentation models have advanced to segment images into high-quality masks for visual entities, and yet they cannot provide comprehensive semantic understanding for complex queries based on both language and vision. This limitation reduces their effectiveness in applications that require user-friendly interactions driven by vision-language prompts. To bridge this gap, we introduce a novel task of omnimodal referring expression segmentation (ORES). In this task, a model produces a group of masks based on arbitrary prompts specified by text only or text plus reference visual entities. To address this new challenge, we propose a novel framework to "Refer to Any Segmentation Mask Group" (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric large multimodal model. For training and benchmarking ORES models, we create datasets MaskGroups-2M and MaskGroups-HQ to include diverse mask groups specified by text and reference entities. Through extensive evaluation, we demonstrate superior performance of RAS on our new ORES task, as well as classic referring expression segmentation (RES) and generalized referring expression segmentation (GRES) tasks. Project page: https://Ref2Any.github.io.

[122] ContentV: Efficient Training of Video Generation Models with Limited Compute

Wenfeng Lin,Renjie Chen,Boyuan Liu,Shiyue Yan,Ruoyu Feng,Jiangchuan Wei,Yichen Zhang,Yimeng Zhou,Chao Feng,Jiao Ran,Qi Wu,Zuotao Liu,Mingyu Guo

Main category: cs.CV

TL;DR: ContentV是一个8B参数的文本到视频模型,通过三项创新技术实现了高效训练和高质量视频生成,仅用4周时间在256个NPU上训练完成,性能达到SOTA。

Details Motivation: 视频生成的训练成本不断上升,需要更高效的训练方法以降低成本。 Method: 1. 采用极简架构,重用预训练图像生成模型;2. 多阶段训练策略,利用流匹配提高效率;3. 低成本强化学习框架,无需额外人工标注即可提升生成质量。 Result: 在VBench上达到85.14分,支持多分辨率和时长的视频生成。 Conclusion: ContentV通过创新技术实现了高效、高质量的视频生成,代码和模型已开源。 Abstract: Recent advances in video generation demand increasingly efficient training recipes to mitigate escalating computational costs. In this report, we present ContentV, an 8B-parameter text-to-video model that achieves state-of-the-art performance (85.14 on VBench) after training on 256 x 64GB Neural Processing Units (NPUs) for merely four weeks. ContentV generates diverse, high-quality videos across multiple resolutions and durations from text prompts, enabled by three key innovations: (1) A minimalist architecture that maximizes reuse of pre-trained image generation models for video generation; (2) A systematic multi-stage training strategy leveraging flow matching for enhanced efficiency; and (3) A cost-effective reinforcement learning with human feedback framework that improves generation quality without requiring additional human annotations. All the code and models are available at: https://contentv.github.io.

[123] SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

Jiahui Wang,Zuyan Liu,Yongming Rao,Jiwen Lu

Main category: cs.CV

TL;DR: 研究发现多模态大语言模型(MLLMs)中仅有少量注意力头(约5%)对视觉理解有贡献,提出了一种无训练框架识别这些头,并基于此设计了KV-Cache优化策略SparseMM,显著提升了推理效率和内存使用。

Details Motivation: 探索MLLMs如何处理视觉输入,揭示注意力机制中的稀疏性现象,以优化模型效率。 Method: 通过分析注意力机制识别视觉头,设计无训练框架量化其视觉相关性,并开发KV-Cache优化策略SparseMM。 Result: SparseMM在主流多模态基准测试中实现了1.38倍实时加速和52%内存减少,同时保持性能。 Conclusion: 视觉头的稀疏性为MLLMs效率优化提供了新方向,SparseMM展示了显著的性能提升。 Abstract: Multimodal Large Language Models (MLLMs) are commonly derived by extending pre-trained Large Language Models (LLMs) with visual capabilities. In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. We reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed visual heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis. Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding. Extensive evaluations across mainstream multimodal benchmarks demonstrate that SparseMM achieves superior accuracy-efficiency trade-offs. Notably, SparseMM delivers 1.38x real-time acceleration and 52% memory reduction during generation while maintaining performance parity on efficiency test. Our project is open sourced at https://github.com/CR400AF-A/SparseMM.

[124] Neural Inverse Rendering from Propagating Light

Anagh Malik,Benjamin Attal,Andrew Xie,Matthew O'Toole,David B. Lindell

Main category: cs.CV

TL;DR: 首个基于物理的神经逆向渲染系统,通过多视角视频捕捉光的传播,结合时间分辨神经辐射缓存技术,实现高精度3D重建和光传播效果分解。

Details Motivation: 解决在强间接光条件下3D重建的挑战,并实现光传播的动态捕捉与分解。 Method: 扩展神经辐射缓存技术至时间分辨领域,结合多视角视频数据,捕捉光的传播过程。 Result: 实现高精度3D重建,支持光传播的视图合成、直接与间接光分解,以及多视角时间分辨重光照。 Conclusion: 该系统为动态光传播的捕捉与分析提供了新工具,尤其在复杂光照条件下表现优异。 Abstract: We present the first system for physically based, neural inverse rendering from multi-viewpoint videos of propagating light. Our approach relies on a time-resolved extension of neural radiance caching -- a technique that accelerates inverse rendering by storing infinite-bounce radiance arriving at any point from any direction. The resulting model accurately accounts for direct and indirect light transport effects and, when applied to captured measurements from a flash lidar system, enables state-of-the-art 3D reconstruction in the presence of strong indirect light. Further, we demonstrate view synthesis of propagating light, automatic decomposition of captured measurements into direct and indirect components, as well as novel capabilities such as multi-view time-resolved relighting of captured scenes.

[125] FreeTimeGS: Free Gaussians at Anytime and Anywhere for Dynamic Scene Reconstruction

Yifan Wang,Peishan Yang,Zhen Xu,Jiaming Sun,Zhanhua Zhang,Yong Chen,Hujun Bao,Sida Peng,Xiaowei Zhou

Main category: cs.CV

TL;DR: 提出了一种新的4D表示方法FreeTimeGS,用于处理复杂运动的动态3D场景重建,通过允许高斯基元在任意时间和位置出现,并赋予其运动函数,显著提升了渲染质量。

Details Motivation: 现有方法在处理复杂运动的动态3D场景时,由于变形场优化困难,效果不佳。 Method: 提出FreeTimeGS,一种4D表示方法,允许高斯基元在任意时间和位置出现,并赋予其运动函数以减少时间冗余。 Result: 实验结果表明,该方法在多个数据集上的渲染质量显著优于现有方法。 Conclusion: FreeTimeGS通过灵活的4D表示和运动函数,有效提升了动态3D场景重建的能力。 Abstract: This paper addresses the challenge of reconstructing dynamic 3D scenes with complex motions. Some recent works define 3D Gaussian primitives in the canonical space and use deformation fields to map canonical primitives to observation spaces, achieving real-time dynamic view synthesis. However, these methods often struggle to handle scenes with complex motions due to the difficulty of optimizing deformation fields. To overcome this problem, we propose FreeTimeGS, a novel 4D representation that allows Gaussian primitives to appear at arbitrary time and locations. In contrast to canonical Gaussian primitives, our representation possesses the strong flexibility, thus improving the ability to model dynamic 3D scenes. In addition, we endow each Gaussian primitive with an motion function, allowing it to move to neighboring regions over time, which reduces the temporal redundancy. Experiments results on several datasets show that the rendering quality of our method outperforms recent methods by a large margin.

[126] VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

Hanoona Rasheed,Abdelrahman Shaker,Anqi Tang,Muhammad Maaz,Ming-Hsuan Yang,Salman Khan,Fahad Khan

Main category: cs.CV

TL;DR: VideoMathQA是一个新基准,用于评估模型在视频中跨模态数学推理的能力,覆盖10个数学领域,强调时间扩展和多模态整合。

Details Motivation: 现实世界中的视频数学推理与静态图像或文本不同,需要整合视觉、音频和文本信息,现有方法在此类任务上表现不足。 Method: 设计了VideoMathQA基准,包含多样化的视频和问题类型,由专家标注,涵盖直接问题解决、概念迁移和深度教学理解。 Result: 基准包含高质量标注,揭示了现有方法的局限性,并提供了系统评估框架。 Conclusion: VideoMathQA为跨模态数学推理提供了重要工具,推动了模型在复杂视频环境中的发展。 Abstract: Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate-level experts to ensure high quality, totaling over $920$ man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality-rich mathematical problem settings. Our benchmark and evaluation code are available at: https://mbzuai-oryx.github.io/VideoMathQA

[127] Contrastive Flow Matching

George Stoica,Vivek Ramanujan,Xiang Fan,Ali Farhadi,Ranjay Krishna,Judy Hoffman

Main category: cs.CV

TL;DR: 论文提出了一种名为对比流匹配(Contrastive Flow Matching)的方法,用于解决条件设置下流匹配的唯一性问题,提高了生成质量和训练效率。

Details Motivation: 在条件设置(如类别条件模型)中,流匹配的唯一性无法保证,导致生成结果模糊。本文旨在通过对比流匹配增强条件分离。 Method: 通过添加对比目标函数,最大化不同条件样本对预测流之间的差异性,从而明确区分条件流。 Result: 实验表明,对比流匹配在ImageNet-1k和CC3M基准上显著提升了训练速度(最高9倍)、减少了去噪步骤(最高5倍)并降低了FID(最高8.9)。 Conclusion: 对比流匹配是一种有效的扩展方法,显著提升了条件生成模型的性能和效率。 Abstract: Unconditional flow-matching trains diffusion models to transport samples from a source distribution to a target distribution by enforcing that the flows between sample pairs are unique. However, in conditional settings (e.g., class-conditioned models), this uniqueness is no longer guaranteed--flows from different conditions may overlap, leading to more ambiguous generations. We introduce Contrastive Flow Matching, an extension to the flow matching objective that explicitly enforces uniqueness across all conditional flows, enhancing condition separation. Our approach adds a contrastive objective that maximizes dissimilarities between predicted flows from arbitrary sample pairs. We validate Contrastive Flow Matching by conducting extensive experiments across varying model architectures on both class-conditioned (ImageNet-1k) and text-to-image (CC3M) benchmarks. Notably, we find that training models with Contrastive Flow Matching (1) improves training speed by a factor of up to 9x, (2) requires up to 5x fewer de-noising steps and (3) lowers FID by up to 8.9 compared to training the same models with flow matching. We release our code at: https://github.com/gstoica27/DeltaFM.git.

cs.GR [Back]

[128] SSIMBaD: Sigma Scaling with SSIM-Guided Balanced Diffusion for AnimeFace Colorization

Junpyo Seo,Hanbin Koo,Jieun Yook,Byung-Ro Moon

Main category: cs.GR

TL;DR: 提出了一种基于扩散模型的动漫风格面部草图自动上色框架,通过SSIMBaD技术实现结构保真和风格迁移。

Details Motivation: 传统方法依赖预定义的噪声调度,可能损害感知一致性,因此需要一种更平衡且忠实的方法。 Method: 采用连续时间扩散模型,引入SSIMBaD技术,通过sigma空间变换实现线性感知退化对齐。 Result: 在大规模动漫面部数据集上,方法在像素精度和感知质量上均优于现有技术。 Conclusion: SSIMBaD框架在动漫面部上色任务中表现出色,具有广泛风格适应性。 Abstract: We propose a novel diffusion-based framework for automatic colorization of Anime-style facial sketches. Our method preserves the structural fidelity of the input sketch while effectively transferring stylistic attributes from a reference image. Unlike traditional approaches that rely on predefined noise schedules - which often compromise perceptual consistency -- our framework builds on continuous-time diffusion models and introduces SSIMBaD (Sigma Scaling with SSIM-Guided Balanced Diffusion). SSIMBaD applies a sigma-space transformation that aligns perceptual degradation, as measured by structural similarity (SSIM), in a linear manner. This scaling ensures uniform visual difficulty across timesteps, enabling more balanced and faithful reconstructions. Experiments on a large-scale Anime face dataset demonstrate that our method outperforms state-of-the-art models in both pixel accuracy and perceptual quality, while generalizing to diverse styles. Code is available at github.com/Giventicket/SSIMBaD-Sigma-Scaling-with-SSIM-Guided-Balanced-Diffusion-for-AnimeFace-Colorization

[129] Handle-based Mesh Deformation Guided By Vision Language Model

Xingpeng Sun,Shiyang Jia,Zherong Pan,Kui Wu,Aniket Bera

Main category: cs.GR

TL;DR: 提出了一种无需训练、基于手柄的网格变形方法,利用视觉语言模型(VLM)通过提示工程实现高质量变形。

Details Motivation: 现有网格变形方法存在输出质量低、需手动调整或依赖数据训练的问题。 Method: 通过锥形奇点检测识别手柄,利用VLM选择可变形子部分和手柄,并通过多视角投票减少不确定性。 Result: 在多个基准测试中,该方法生成更符合用户意图的变形,且失真低。 Conclusion: 该方法无需训练、高度自动化,能持续提供高质量的网格变形。 Abstract: Mesh deformation is a fundamental tool in 3D content manipulation. Despite extensive prior research, existing approaches often suffer from low output quality, require significant manual tuning, or depend on data-intensive training. To address these limitations, we introduce a training-free, handle-based mesh deformation method. % Our core idea is to leverage a Vision-Language Model (VLM) to interpret and manipulate a handle-based interface through prompt engineering. We begin by applying cone singularity detection to identify a sparse set of potential handles. The VLM is then prompted to select both the deformable sub-parts of the mesh and the handles that best align with user instructions. Subsequently, we query the desired deformed positions of the selected handles in screen space. To reduce uncertainty inherent in VLM predictions, we aggregate the results from multiple camera views using a novel multi-view voting scheme. % Across a suite of benchmarks, our method produces deformations that align more closely with user intent, as measured by CLIP and GPTEval3D scores, while introducing low distortion -- quantified via membrane energy. In summary, our approach is training-free, highly automated, and consistently delivers high-quality mesh deformations.

[130] VoxDet: Rethinking 3D Semantic Occupancy Prediction as Dense Object Detection

Wuyang Li,Zhu Yu,Alexandre Alahi

Main category: cs.GR

TL;DR: 论文提出了一种实例感知的3D语义占用预测方法VoxDet,通过将体素级分类任务解耦为偏移回归和语义预测两个子任务,显著提升了实例级完整性和区分性。

Details Motivation: 现有方法将3D语义占用预测视为密集分割任务,忽略了实例级区分性,导致实例不完整和相邻模糊问题。论文发现体素级标签隐含实例信息,提出利用这一免费信息改进预测。 Method: 提出VoxDet框架,包含空间解耦体素编码器和任务解耦密集预测器,通过偏移回归和语义预测实现实例感知的3D占用预测。 Result: VoxDet在相机和LiDAR输入下均取得SOTA结果,SemanticKITTI测试集上达到63.0 IoU,排名第一。 Conclusion: VoxDet通过实例感知设计显著提升了3D语义占用预测性能,证明了体素级标签隐含实例信息的有效性。 Abstract: 3D semantic occupancy prediction aims to reconstruct the 3D geometry and semantics of the surrounding environment. With dense voxel labels, prior works typically formulate it as a dense segmentation task, independently classifying each voxel. However, this paradigm neglects critical instance-centric discriminability, leading to instance-level incompleteness and adjacent ambiguities. To address this, we highlight a free lunch of occupancy labels: the voxel-level class label implicitly provides insight at the instance level, which is overlooked by the community. Motivated by this observation, we first introduce a training-free Voxel-to-Instance (VoxNT) trick: a simple yet effective method that freely converts voxel-level class labels into instance-level offset labels. Building on this, we further propose VoxDet, an instance-centric framework that reformulates the voxel-level occupancy prediction as dense object detection by decoupling it into two sub-tasks: offset regression and semantic prediction. Specifically, based on the lifted 3D volume, VoxDet first uses (a) Spatially-decoupled Voxel Encoder to generate disentangled feature volumes for the two sub-tasks, which learn task-specific spatial deformation in the densely projected tri-perceptive space. Then, we deploy (b) Task-decoupled Dense Predictor to address this task via dense detection. Here, we first regress a 4D offset field to estimate distances (6 directions) between voxels and object borders in the voxel space. The regressed offsets are then used to guide the instance-level aggregation in the classification branch, achieving instance-aware prediction. Experiments show that VoxDet can be deployed on both camera and LiDAR input, jointly achieving state-of-the-art results on both benchmarks. VoxDet is not only highly efficient, but also achieves 63.0 IoU on the SemanticKITTI test set, ranking 1st on the online leaderboard.

[131] A Fast Unsupervised Scheme for Polygonal Approximation

Bimal Kumar Ray

Main category: cs.GR

TL;DR: 本文提出了一种快速且无监督的多边形近似闭合数字曲线的方案,其速度优于现有技术,并在Rosin度量及美学方面具有竞争力。

Details Motivation: 现有多边形近似方法在速度和美学效果上存在不足,需要一种更高效且美观的解决方案。 Method: 方案分为三个阶段:初始分割、迭代顶点插入和迭代合并,最后进行顶点调整。初始分割检测高曲率顶点,迭代插入补充低曲率顶点,合并去除冗余顶点,调整优化美学效果。 Result: 该方案在Rosin度量下表现优异,且对几何变换具有鲁棒性。 Conclusion: 提出的方案在速度和美学上均优于现有方法,适用于闭合数字曲线的多边形近似。 Abstract: This paper proposes a fast and unsupervised scheme for a polygonal approximation of a closed digital curve. It is demonstrated that the approximation scheme is faster than state-of-the-art approximation and is competitive with the same in Rosin's measure and in its aesthetic aspect. The scheme comprises of three phases: initial segmentation, iterative vertex insertion, and iterative merging, followed by vertex adjustment. The initial segmentation is used to detect sharp turnings - the vertices that seemingly have high curvature. It is likely that some of important vertices with low curvature might have been missed out at the first phase and so iterative vertex insertion is used to add vertices in a region where the curvature changes slowly but steadily. The initial phase may pick up some undesirable vertices and so merging is used to eliminate the redundant vertices. Finally, vertex adjustment is used to facilitate enhancement in the aesthetic look of the approximation. The quality of the approximations is measured using Rosin's measure. The robustness of the proposed scheme with respect to geometric transformation is observed.

[132] Midplane based 3D single pass unbiased segment-to-segment contact interaction using penalty method

Indrajeet Sahu,Nik Petrinic

Main category: cs.GR

TL;DR: 提出了一种无偏接触交互方法,避免主从面划分,通过中平面单次计算接触力,确保力平衡,适用于多种几何形状和动态问题。

Details Motivation: 传统接触方法需划分主从面,可能导致偏差。本文旨在提供一种更公平、高效且准确的接触力计算方法。 Method: 基于中平面的单次计算,通过惩罚真实穿透评估接触力,详细分析3D几何形状以提高精度。 Result: 验证了方法在多种接触问题(如平面、曲面、尖角接触)中的准确性和鲁棒性,支持非共形网格和高精度动态问题。 Conclusion: 该方法在多种接触场景中表现优异,具有广泛适用性和高精度,适用于静态和动态问题。 Abstract: This work introduces a contact interaction methodology for an unbiased treatment of contacting surfaces without assigning surfaces as master and slave. The contact tractions between interacting discrete segments are evaluated with respect to a midplane in a single pass, inherently maintaining the equilibrium of tractions. These tractions are based on the penalisation of true interpenetration between opposite surfaces, and the procedure of their integral for discrete contacting segments is described in this paper. A meticulous examination of the different possible geometric configurations of interacting 3D segments is presented to develop visual understanding and better traction evaluation accuracy. The accuracy and robustness of the proposed method are validated against the analytical solutions of the contact patch test, two-beam bending, Hertzian contact, and flat punch test, thus proving the capability to reproduce contact between flat surfaces, curved surfaces, and sharp corners in contact, respectively. The method passes the contact patch test with the uniform transmission of contact pressure matching the accuracy levels of finite elements. It converges towards the analytical solution with mesh refinement and a suitably high penalty factor. The effectiveness of the proposed algorithm also extends to self-contact problems and has been tested for self-contact between flat and curved surfaces with inelastic material. Dynamic problems of elastic and inelastic collisions between bars, as well as oblique collisions of cylinders, are also presented. The ability of the algorithm to resolve contacts between flat and curved surfaces for nonconformal meshes with high accuracy demonstrates its versatility in general contact problems.

[133] Towards the target and not beyond: 2d vs 3d visual aids in mr-based neurosurgical simulation

Pasquale Cascarano,Andrea Loretti,Matteo Martinoni,Luca Zanuttini,Alessio Di Pasquale,Gustavo Marfia

Main category: cs.GR

TL;DR: NeuroMix是一种基于混合现实(MR)的模拟器,用于脑室外引流(EVD)放置训练。研究表明,结合2D和3D视觉辅助的训练显著提高了未辅助条件下的手术精度,且不影响认知负荷或技术接受度。

Details Motivation: 神经外科手术中,从2D切片重建复杂3D解剖结构具有挑战性,而MR技术在临床中的有限可用性凸显了对未辅助条件下技能保留训练系统的需求。 Method: 研究比较了三种训练模式:无视觉辅助、仅2D辅助、2D和3D辅助结合。48名参与者通过数字对象训练后,在未辅助条件下进行自由手EVD放置测试。 Result: 结合2D和3D辅助的训练组在未辅助测试中的精度比对照组提高了44%,且所有训练模式均获得高可用性和技术接受度评分。 Conclusion: 2D和3D视觉辅助结合的训练显著提升了手术精度,且不影响认知负荷或技术接受度,但操作时间较长。 Abstract: Neurosurgery increasingly uses Mixed Reality (MR) technologies for intraoperative assistance. The greatest challenge in this area is mentally reconstructing complex 3D anatomical structures from 2D slices with millimetric precision, which is required in procedures like External Ventricular Drain (EVD) placement. MR technologies have shown great potential in improving surgical performance, however, their limited availability in clinical settings underscores the need for training systems that foster skill retention in unaided conditions. In this paper, we introduce NeuroMix, an MR-based simulator for EVD placement. We conduct a study with 48 participants to assess the impact of 2D and 3D visual aids on usability, cognitive load, technology acceptance, and procedure precision and execution time. Three training modalities are compared: one without visual aids, one with 2D aids only, and one combining both 2D and 3D aids. The training phase takes place entirely on digital objects, followed by a freehand EVD placement testing phase performed with a physical catherer and a physical phantom without MR aids. We then compare the participants performance with that of a control group that does not undergo training. Our findings show that participants trained with both 2D and 3D aids achieve a 44\% improvement in precision during unaided testing compared to the control group, substantially higher than the improvement observed in the other groups. All three training modalities receive high usability and technology acceptance ratings, with significant equivalence across groups. The combination of 2D and 3D visual aids does not significantly increase cognitive workload, though it leads to longer operation times during freehand testing compared to the control group.

[134] Uniform Sampling of Surfaces by Casting Rays

Selena Ling,Abhishek Madan,Nicholas Sharp,Alec Jacobson

Main category: cs.GR

TL;DR: 提出了一种基于随机射线与表面交点的简单通用方法,用于在隐式表面上均匀采样点,无需提取中间网格。

Details Motivation: 在几何处理中,显式网格上的点采样计算简单,但隐式表面等其他表示形式采样困难。 Method: 通过随机射线与表面的交点采样点,适用于隐式有符号距离函数,利用球体行进高效实现。 Result: 实验证明该方法均匀且高效,支持扩展到蓝噪声和分层采样,应用于变形神经隐式表面和矩估计。 Conclusion: 该方法为隐式表面采样提供了高效且通用的解决方案。 Abstract: Randomly sampling points on surfaces is an essential operation in geometry processing. This sampling is computationally straightforward on explicit meshes, but it is much more difficult on other shape representations, such as widely-used implicit surfaces. This work studies a simple and general scheme for sampling points on a surface, which is derived from a connection to the intersections of random rays with the surface. Concretely, given a subroutine to cast a ray against a surface and find all intersections, we can use that subroutine to uniformly sample white noise points on the surface. This approach is particularly effective in the context of implicit signed distance functions, where sphere marching allows us to efficiently cast rays and sample points, without needing to extract an intermediate mesh. We analyze the basic method to show that it guarantees uniformity, and find experimentally that it is significantly more efficient than alternative strategies on a variety of representations. Furthermore, we show extensions to blue noise sampling and stratified sampling, and applications to deform neural implicit surfaces as well as moment estimation.

cs.CL [Back]

[135] GEM: Empowering LLM for both Embedding Generation and Language Understanding

Caojin Zhang,Qiang Zhang,Ke Li,Sai Vidyaranya Nuthalapati,Benyu Zhang,Jason Liu,Serena Li,Lizhu Zhang,Xiangjun Fan

Main category: cs.CL

TL;DR: 论文提出了一种自监督方法GEM,使仅解码器的大语言模型(LLM)能生成高质量文本嵌入,同时保留其原始文本生成和推理能力。

Details Motivation: 解决现有应用中依赖独立嵌入模型导致的系统复杂性和理解不一致问题。 Method: 通过插入特殊标记和操纵注意力掩码生成文本摘要嵌入,适用于任何现有LLM的后训练或微调阶段。 Result: 在MTEB文本嵌入基准上显著提升性能,同时对MMLU NLP基准影响极小。 Conclusion: GEM方法为LLM提供了先进的文本嵌入能力,同时不影响其原始NLP性能。 Abstract: Large decoder-only language models (LLMs) have achieved remarkable success in generation and reasoning tasks, where they generate text responses given instructions. However, many applications, e.g., retrieval augmented generation (RAG), still rely on separate embedding models to generate text embeddings, which can complicate the system and introduce discrepancies in understanding of the query between the embedding model and LLMs. To address this limitation, we propose a simple self-supervised approach, Generative Embedding large language Model (GEM), that enables any large decoder-only LLM to generate high-quality text embeddings while maintaining its original text generation and reasoning capabilities. Our method inserts new special token(s) into a text body, and generates summarization embedding of the text by manipulating the attention mask. This method could be easily integrated into post-training or fine tuning stages of any existing LLMs. We demonstrate the effectiveness of our approach by applying it to two popular LLM families, ranging from 1B to 8B parameters, and evaluating the transformed models on both text embedding benchmarks (MTEB) and NLP benchmarks (MMLU). The results show that our proposed method significantly improves the original LLMs on MTEB while having a minimal impact on MMLU. Our strong results indicate that our approach can empower LLMs with state-of-the-art text embedding capabilities while maintaining their original NLP performance

[136] Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR

Zheng-Xin Yong,Vineel Pratap,Michael Auli,Jean Maillard

Main category: cs.CL

TL;DR: 研究训练数据中说话者数量、音频时长和口音多样性对ASR系统对未见口音鲁棒性的影响,发现增加说话者数量比增加单个说话者的音频时长更有效。

Details Motivation: 构建适用于全球用户的ASR系统,需提升对未见口音的鲁棒性。 Method: 系统研究训练数据中说话者数量、音频时长和口音多样性对ASR性能的影响。 Result: 增加说话者数量比增加单个说话者的音频时长更有效,且口音多样性对性能提升有限。 Conclusion: 建议在ASR训练数据中优先增加说话者数量。 Abstract: To build an automatic speech recognition (ASR) system that can serve everyone in the world, the ASR needs to be robust to a wide range of accents including unseen accents. We systematically study how three different variables in training data -- the number of speakers, the audio duration per each individual speaker, and the diversity of accents -- affect ASR robustness towards unseen accents in a low-resource training regime. We observe that for a fixed number of ASR training hours, it is more beneficial to increase the number of speakers (which means each speaker contributes less) than the number of hours contributed per speaker. We also observe that more speakers enables ASR performance gains from scaling number of hours. Surprisingly, we observe minimal benefits to prioritizing speakers with different accents when the number of speakers is controlled. Our work suggests that practitioners should prioritize increasing the speaker count in ASR training data composition for new languages.

[137] Mechanistic Decomposition of Sentence Representations

Matthieu Tehenan,Vikram Natarajan,Jonathan Michala,Milton Lin,Juri Opitz

Main category: cs.CL

TL;DR: 提出了一种新方法,通过字典学习分解句子嵌入为可解释组件,揭示其内部结构和特征。

Details Motivation: 句子嵌入是现代NLP和AI系统的核心,但其内部结构不透明,缺乏可解释性。 Method: 使用字典学习对词级表示进行分解,分析池化操作如何压缩特征为句子表示。 Result: 发现许多语义和句法特征在线性编码于嵌入中,揭示了句子嵌入空间的内部机制。 Conclusion: 该方法提升了句子嵌入的透明性和可控性,为理解其内部结构提供了新视角。 Abstract: Sentence embeddings are central to modern NLP and AI systems, yet little is known about their internal structure. While we can compare these embeddings using measures such as cosine similarity, the contributing features are not human-interpretable, and the content of an embedding seems untraceable, as it is masked by complex neural transformations and a final pooling operation that combines individual token embeddings. To alleviate this issue, we propose a new method to mechanistically decompose sentence embeddings into interpretable components, by using dictionary learning on token-level representations. We analyze how pooling compresses these features into sentence representations, and assess the latent features that reside in a sentence embedding. This bridges token-level mechanistic interpretability with sentence-level analysis, making for more transparent and controllable representations. In our studies, we obtain several interesting insights into the inner workings of sentence embedding spaces, for instance, that many semantic and syntactic aspects are linearly encoded in the embeddings.

[138] Hierarchical Text Classification Using Contrastive Learning Informed Path Guided Hierarchy

Neeraj Agrawal,Saurabh Kumar,Priyanka Bhatt,Tanishka Agarwal

Main category: cs.CL

TL;DR: 提出了一种结合对比学习和路径引导层次结构的HTC-CLIP模型,通过联合学习文本和标签层次表示,显著提升了分类性能。

Details Motivation: 现有HTC模型分别处理标签层次和文本编码,未能充分利用两者的互补性。 Method: HTC-CLIP通过对比学习联合优化文本表示和路径引导的层次表示,并在推理时结合两者的概率分布。 Result: 在两个公开数据集上,Macro F1分数比现有最优模型提升了0.99-2.37%。 Conclusion: HTC-CLIP有效结合了两种现有方法的优势,显著提升了分类性能。 Abstract: Hierarchical Text Classification (HTC) has recently gained traction given the ability to handle complex label hierarchy. This has found applications in domains like E- commerce, customer care and medicine industry among other real-world applications. Existing HTC models either encode label hierarchy separately and mix it with text encoding or guide the label hierarchy structure in the text encoder. Both approaches capture different characteristics of label hierarchy and are complementary to each other. In this paper, we propose a Hierarchical Text Classification using Contrastive Learning Informed Path guided hierarchy (HTC-CLIP), which learns hierarchy-aware text representation and text informed path guided hierarchy representation using contrastive learning. During the training of HTC-CLIP, we learn two different sets of class probabilities distributions and during inference, we use the pooled output of both probabilities for each class to get the best of both representations. Our results show that the two previous approaches can be effectively combined into one architecture to achieve improved performance. Tests on two public benchmark datasets showed an improvement of 0.99 - 2.37% in Macro F1 score using HTC-CLIP over the existing state-of-the-art models.

[139] MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP

Kurt Micallef,Claudia Borg

Main category: cs.CL

TL;DR: 评估55个公开大语言模型在低资源语言马耳他语上的表现,发现小规模微调模型表现更优,预训练和指令调优对性能影响最大。

Details Motivation: 大语言模型在低资源语言上表现有限,需探索其适用性和改进方法。 Method: 使用包含11项任务的基准测试55个模型,分析预训练、指令调优和微调的影响。 Result: 小规模微调模型表现更好,预训练和指令调优是关键因素。 Conclusion: 建议低资源语言研究采用传统语言建模方法,并关注预训练和指令调优。 Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various Natural Language Processing (NLP) tasks, largely due to their generalisability and ability to perform tasks without additional training. However, their effectiveness for low-resource languages remains limited. In this study, we evaluate the performance of 55 publicly available LLMs on Maltese, a low-resource language, using a newly introduced benchmark covering 11 discriminative and generative tasks. Our experiments highlight that many models perform poorly, particularly on generative tasks, and that smaller fine-tuned models often perform better across all tasks. From our multidimensional analysis, we investigate various factors impacting performance. We conclude that prior exposure to Maltese during pre-training and instruction-tuning emerges as the most important factor. We also examine the trade-offs between fine-tuning and prompting, highlighting that while fine-tuning requires a higher initial cost, it yields better performance and lower inference costs. Through this work, we aim to highlight the need for more inclusive language technologies and recommend that researchers working with low-resource languages consider more "traditional" language modelling approaches.

[140] Building a Few-Shot Cross-Domain Multilingual NLU Model for Customer Care

Saurabh Kumar,Sourav Bansal,Neeraj Agrawal,Priyanka Bhatt

Main category: cs.CL

TL;DR: 本文提出了一种结合嵌入器和分类器的模型架构,通过少量标注样本扩展领域特定模型到其他领域,提高了跨领域意图分类的准确性。

Details Motivation: 客户服务是电子商务体验的关键,但跨领域数据稀缺限制了模型性能。本文旨在解决这一问题,提出一种能在少量标注下泛化到新领域的方法。 Method: 采用监督微调方法结合各向同性正则化器训练领域特定句子嵌入器,并通过多语言知识蒸馏策略实现跨领域泛化。最终嵌入器与线性分类器结合部署。 Result: 在加拿大和墨西哥电子商务客户服务数据集上,少样本意图检测的准确率比现有SOTA预训练模型提高了20-23%。 Conclusion: 提出的模型架构在跨领域意图分类任务中表现优异,为实际应用提供了高效解决方案。 Abstract: Customer care is an essential pillar of the e-commerce shopping experience with companies spending millions of dollars each year, employing automation and human agents, across geographies (like US, Canada, Mexico, Chile), channels (like Chat, Interactive Voice Response (IVR)), and languages (like English, Spanish). SOTA pre-trained models like multilingual-BERT, fine-tuned on annotated data have shown good performance in downstream tasks relevant to Customer Care. However, model performance is largely subject to the availability of sufficient annotated domain-specific data. Cross-domain availability of data remains a bottleneck, thus building an intent classifier that generalizes across domains (defined by channel, geography, and language) with only a few annotations, is of great practical value. In this paper, we propose an embedder-cum-classifier model architecture which extends state-of-the-art domain-specific models to other domains with only a few labeled samples. We adopt a supervised fine-tuning approach with isotropic regularizers to train a domain-specific sentence embedder and a multilingual knowledge distillation strategy to generalize this embedder across multiple domains. The trained embedder, further augmented with a simple linear classifier can be deployed for new domains. Experiments on Canada and Mexico e-commerce Customer Care dataset with few-shot intent detection show an increase in accuracy by 20-23% against the existing state-of-the-art pre-trained models.

[141] MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale

Ran Xu,Yuchen Zhuang,Yishan Zhong,Yue Yu,Xiangru Tang,Hang Wu,May D. Wang,Peifeng Ruan,Donghan Yang,Tao Wang,Guanghua Xiao,Carl Yang,Yang Xie,Wenqi Shi

Main category: cs.CL

TL;DR: MedAgentGYM是一个公开的训练环境,旨在提升大型语言模型在医学推理中的编码能力,包含72,413个任务实例,覆盖129个类别。通过实验,Med-Copilot-7B模型在性能上显著提升。

Details Motivation: 为医学领域开发一个可扩展且隐私保护的编码助手,填补现有公开训练环境的空白。 Method: 设计了包含任务描述、反馈机制和真实标注的编码环境,并对30多个大型语言模型进行基准测试。 Result: Med-Copilot-7B通过监督微调和强化学习分别提升了36.44%和42.47%的性能,表现接近GPT-4o。 Conclusion: MedAgentGYM为生物医学研究和实践提供了一个集成的平台,支持开发高效且隐私保护的编码助手。 Abstract: We introduce MedAgentGYM, the first publicly available training environment designed to enhance coding-based medical reasoning capabilities in large language model (LLM) agents. MedAgentGYM comprises 72,413 task instances across 129 categories derived from authentic real-world biomedical scenarios. Tasks are encapsulated within executable coding environments, each featuring detailed task descriptions, interactive feedback mechanisms, verifiable ground-truth annotations, and scalable training trajectory generation. Extensive benchmarking of over 30 LLMs reveals a notable performance disparity between commercial API-based models and open-source counterparts. Leveraging MedAgentGYM, Med-Copilot-7B achieves substantial performance gains through supervised fine-tuning (+36.44%) and continued reinforcement learning (+42.47%), emerging as an affordable and privacy-preserving alternative competitive with gpt-4o. By offering both a comprehensive benchmark and accessible, expandable training resources within unified execution environments, MedAgentGYM delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical research and practice.

[142] Unpacking Let Alone: Human-Scale Models Generalize to a Rare Construction in Form but not Meaning

Wesley Scivetti,Tatsuya Aoyama,Ethan Wilcox,Nathan Schneider

Main category: cs.CL

TL;DR: 人类和语言模型在罕见语法现象上的表现对比,发现语言模型在形式上能泛化,但在意义上表现不足。

Details Motivation: 探讨人类规模的语言模型是否能像人类一样理解和泛化罕见语法现象的形式和意义。 Method: 通过测试语言模型对英语罕见LET-ALONE结构的语法和语义知识的掌握,并使用合成基准进行评估。 Result: 语言模型能感知形式,但在意义的泛化上表现不佳,显示出与人类学习者的不对称性。 Conclusion: 当前语言模型架构在语言形式和意义的样本效率上存在不对称性,与人类学习能力不同。 Abstract: Humans have a remarkable ability to acquire and understand grammatical phenomena that are seen rarely, if ever, during childhood. Recent evidence suggests that language models with human-scale pretraining data may possess a similar ability by generalizing from frequent to rare constructions. However, it remains an open question how widespread this generalization ability is, and to what extent this knowledge extends to meanings of rare constructions, as opposed to just their forms. We fill this gap by testing human-scale transformer language models on their knowledge of both the form and meaning of the (rare and quirky) English LET-ALONE construction. To evaluate our LMs we construct a bespoke synthetic benchmark that targets syntactic and semantic properties of the construction. We find that human-scale LMs are sensitive to form, even when related constructions are filtered from the dataset. However, human-scale LMs do not make correct generalizations about LET-ALONE's meaning. These results point to an asymmetry in the current architectures' sample efficiency between language form and meaning, something which is not present in human language learners.

[143] Empaths at SemEval-2025 Task 11: Retrieval-Augmented Approach to Perceived Emotions Prediction

Lev Morozov,Aleksandr Mogilevskii,Alexander Shirnin

Main category: cs.CL

TL;DR: EmoRAG是一个用于多标签情感检测的系统,无需额外训练模型,仅通过模型集成预测情感,效果高效且易于实现。

Details Motivation: 研究目标是检测文本中说话者的感知情感,如喜悦、悲伤、恐惧等,为SemEval-2025任务11的子任务A提供解决方案。 Method: 采用模型集成方法,无需额外训练,直接预测文本中的多标签情感。 Result: EmoRAG的性能与最佳系统相当,同时更高效、可扩展且易于实现。 Conclusion: EmoRAG为情感检测提供了一种高效且实用的解决方案。 Abstract: This paper describes EmoRAG, a system designed to detect perceived emotions in text for SemEval-2025 Task 11, Subtask A: Multi-label Emotion Detection. We focus on predicting the perceived emotions of the speaker from a given text snippet, labeling it with emotions such as joy, sadness, fear, anger, surprise, and disgust. Our approach does not require additional model training and only uses an ensemble of models to predict emotions. EmoRAG achieves results comparable to the best performing systems, while being more efficient, scalable, and easier to implement.

[144] Zero-Shot Open-Schema Entity Structure Discovery

Xueqiang Xu,Jinfeng Xiao,James Barry,Mohab Elkaref,Jiaru Zou,Pengcheng Jiang,Yunyi Zhang,Max Giammona,Geeth de Mel,Jiawei Han

Main category: cs.CL

TL;DR: 论文提出了一种无需预定义模式或标注数据的零样本开放模式实体结构发现方法(ZOES),通过丰富、细化和统一的机制提升LLMs在实体结构提取中的表现。

Details Motivation: 现有基于LLMs的实体结构提取方法依赖预定义模式或标注数据,导致提取结果不完整。 Method: ZOES采用丰富、细化和统一的机制,利用实体与其结构的相互强化关系进行提取。 Result: 实验表明,ZOES在三个不同领域显著提升了LLMs提取实体结构的完整性和泛化能力。 Conclusion: ZOES的机制为提升LLM在实体结构发现中的质量提供了原则性方法。 Abstract: Entity structure extraction, which aims to extract entities and their associated attribute-value structures from text, is an essential task for text understanding and knowledge graph construction. Existing methods based on large language models (LLMs) typically rely heavily on predefined entity attribute schemas or annotated datasets, often leading to incomplete extraction results. To address these challenges, we introduce Zero-Shot Open-schema Entity Structure Discovery (ZOES), a novel approach to entity structure extraction that does not require any schema or annotated samples. ZOES operates via a principled mechanism of enrichment, refinement, and unification, based on the insight that an entity and its associated structure are mutually reinforcing. Experiments demonstrate that ZOES consistently enhances LLMs' ability to extract more complete entity structures across three different domains, showcasing both the effectiveness and generalizability of the method. These findings suggest that such an enrichment, refinement, and unification mechanism may serve as a principled approach to improving the quality of LLM-based entity structure discovery in various scenarios.

[145] Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

Apurv Verma,NhatHai Phan,Shubhendu Trivedi

Main category: cs.CL

TL;DR: 本文系统分析了两种水印方法(Gumbel和KGW)对大型语言模型(LLM)在真实性、安全性和帮助性上的影响,并提出了一种名为Alignment Resampling(AR)的采样方法以恢复模型的对齐性能。

Details Motivation: 研究水印技术对LLM核心对齐属性的影响,填补现有研究的空白。 Method: 通过实验分析两种水印方法的效果,并提出AR方法,利用外部奖励模型在推理时恢复对齐性能。 Result: 实验表明,AR方法仅需2-4次采样即可恢复或超越未加水印的基线对齐分数,同时保持水印的可检测性。 Conclusion: 揭示了水印强度与模型对齐之间的平衡,为实际部署水印LLM提供了简单有效的解决方案。 Abstract: Watermarking techniques for large language models (LLMs) can significantly impact output quality, yet their effects on truthfulness, safety, and helpfulness remain critically underexamined. This paper presents a systematic analysis of how two popular watermarking approaches-Gumbel and KGW-affect these core alignment properties across four aligned LLMs. Our experiments reveal two distinct degradation patterns: guard attenuation, where enhanced helpfulness undermines model safety, and guard amplification, where excessive caution reduces model helpfulness. These patterns emerge from watermark-induced shifts in token distribution, surfacing the fundamental tension that exists between alignment objectives. To mitigate these degradations, we propose Alignment Resampling (AR), an inference-time sampling method that uses an external reward model to restore alignment. We establish a theoretical lower bound on the improvement in expected reward score as the sample size is increased and empirically demonstrate that sampling just 2-4 watermarked generations effectively recovers or surpasses baseline (unwatermarked) alignment scores. To overcome the limited response diversity of standard Gumbel watermarking, our modified implementation sacrifices strict distortion-freeness while maintaining robust detectability, ensuring compatibility with AR. Experimental results confirm that AR successfully recovers baseline alignment in both watermarking approaches, while maintaining strong watermark detectability. This work reveals the critical balance between watermark strength and model alignment, providing a simple inference-time solution to responsibly deploy watermarked LLMs in practice.

[146] Aligning Large Language Models with Implicit Preferences from User-Generated Content

Zhaoxuan Tan,Zheng Li,Tianyi Liu,Haodong Wang,Hyokun Yun,Ming Zeng,Pei Chen,Zhihan Zhang,Yifan Gao,Ruijie Wang,Priyanka Nigam,Bing Yin,Meng Jiang

Main category: cs.CL

TL;DR: PUGC框架利用未标记的用户生成内容(UGC)中的隐式偏好生成偏好数据,显著降低了成本并提升了模型性能。

Details Motivation: 现有偏好学习方法依赖昂贵的人工或高级LLM标注数据,难以扩展。PUGC旨在通过UGC中的隐式偏好解决这一问题。 Method: PUGC将UGC转化为用户查询并生成响应,利用UGC作为参考文本进行评分,从而对齐隐式偏好。 Result: 实验显示,PUGC结合DPO在Alpaca Eval 2上性能提升9.37%,并达到35.93%的SOTA胜率。 Conclusion: PUGC为偏好学习提供了高效、可扩展的解决方案,尤其在领域对齐和鲁棒性方面表现突出。 Abstract: Learning from preference feedback is essential for aligning large language models (LLMs) with human values and improving the quality of generated responses. However, existing preference learning methods rely heavily on curated data from humans or advanced LLMs, which is costly and difficult to scale. In this work, we present PUGC, a novel framework that leverages implicit human Preferences in unlabeled User-Generated Content (UGC) to generate preference data. Although UGC is not explicitly created to guide LLMs in generating human-preferred responses, it often reflects valuable insights and implicit preferences from its creators that has the potential to address readers' questions. PUGC transforms UGC into user queries and generates responses from the policy model. The UGC is then leveraged as a reference text for response scoring, aligning the model with these implicit preferences. This approach improves the quality of preference data while enabling scalable, domain-specific alignment. Experimental results on Alpaca Eval 2 show that models trained with DPO and PUGC achieve a 9.37% performance improvement over traditional methods, setting a 35.93% state-of-the-art length-controlled win rate using Mistral-7B-Instruct. Further studies highlight gains in reward quality, domain-specific alignment effectiveness, robustness against UGC quality, and theory of mind capabilities. Our code and dataset are available at https://zhaoxuan.info/PUGC.github.io/

[147] SQLens: An End-to-End Framework for Error Detection and Correction in Text-to-SQL

Yue Gong,Chuan Lei,Xiao Qin,Kapil Vaidya,Balakrishnan Narayanaswamy,Tim Kraska

Main category: cs.CL

TL;DR: SQLens是一个端到端框架,用于检测和纠正LLM生成的SQL中的语义错误,显著提升了错误检测和查询执行准确性。

Details Motivation: LLM生成的SQL查询可能存在语义错误但语法正确,缺乏可靠性评估。 Method: SQLens结合数据库和LLM的错误信号,识别SQL子句中的语义错误并指导修正。 Result: 在两个公共基准测试中,SQLens在错误检测F1上优于最佳LLM自评估方法25.78%,查询执行准确率提升高达20%。 Conclusion: SQLens能有效提升LLM生成SQL的可靠性和准确性。 Abstract: Text-to-SQL systems translate natural language (NL) questions into SQL queries, enabling non-technical users to interact with structured data. While large language models (LLMs) have shown promising results on the text-to-SQL task, they often produce semantically incorrect yet syntactically valid queries, with limited insight into their reliability. We propose SQLens, an end-to-end framework for fine-grained detection and correction of semantic errors in LLM-generated SQL. SQLens integrates error signals from both the underlying database and the LLM to identify potential semantic errors within SQL clauses. It further leverages these signals to guide query correction. Empirical results on two public benchmarks show that SQLens outperforms the best LLM-based self-evaluation method by 25.78% in F1 for error detection, and improves execution accuracy of out-of-the-box text-to-SQL systems by up to 20%.

[148] DRE: An Effective Dual-Refined Method for Integrating Small and Large Language Models in Open-Domain Dialogue Evaluation

Kun Zhao,Bohao Yang,Chen Tang,Siyuan Dai,Haoteng Tang,Chenghua Lin,Liang Zhan

Main category: cs.CL

TL;DR: SLIDE和DRE方法结合小型和大型语言模型的优势,通过自适应加权和双重精炼提升对话评估的可靠性。

Details Motivation: 大型语言模型(LLM)在模糊场景中表现不稳定,而小型语言模型(SLM)对误导性输入敏感,但两者在正负例处理上各有优势。 Method: 提出SLIDE方法,通过自适应加权整合SLM和LLM;进一步提出DRE方法,利用SLM生成的洞察指导LLM评估,并通过SLM调整优化LLM评分。 Result: 实验表明DRE优于现有方法,在多基准测试中更符合人类判断。 Conclusion: 结合小型和大型模型可显著提升开放式任务(如对话评估)的可靠性。 Abstract: Large Language Models (LLMs) excel at many tasks but struggle with ambiguous scenarios where multiple valid responses exist, often yielding unreliable results. Conversely, Small Language Models (SLMs) demonstrate robustness in such scenarios but are susceptible to misleading or adversarial inputs. We observed that LLMs handle negative examples effectively, while SLMs excel with positive examples. To leverage their complementary strengths, we introduce SLIDE (Small and Large Integrated for Dialogue Evaluation), a method integrating SLMs and LLMs via adaptive weighting. Building on SLIDE, we further propose a Dual-Refinement Evaluation (DRE) method to enhance SLM-LLM integration: (1) SLM-generated insights guide the LLM to produce initial evaluations; (2) SLM-derived adjustments refine the LLM's scores for improved accuracy. Experiments demonstrate that DRE outperforms existing methods, showing stronger alignment with human judgment across diverse benchmarks. This work illustrates how combining small and large models can yield more reliable evaluation tools, particularly for open-ended tasks such as dialogue evaluation.

[149] Please Translate Again: Two Simple Experiments on Whether Human-Like Reasoning Helps Translation

Di Wu,Seth Aycock,Christof Monz

Main category: cs.CL

TL;DR: 研究发现,在翻译任务中,显式分解翻译过程(如CoT)并未带来明显性能提升,简单提示LLM“重新翻译”反而效果更好。

Details Motivation: 探讨Chain-of-Thought(CoT)在翻译任务中的有效性,验证显式分解是否真正提升翻译性能。 Method: 通过实验比较显式分解翻译(如多步提示)与简单提示(如“重新翻译”)的效果。 Result: 显式分解翻译未显著提升性能,简单提示“重新翻译”表现更优。 Conclusion: CoT在翻译中的有效性需进一步研究,显式分解可能非关键因素。 Abstract: Large Language Models (LLMs) demonstrate strong reasoning capabilities for many tasks, often by explicitly decomposing the task via Chain-of-Thought (CoT) reasoning. Recent work on LLM-based translation designs hand-crafted prompts to decompose translation, or trains models to incorporate intermediate steps.~\textit{Translating Step-by-step}~\citep{briakou2024translating}, for instance, introduces a multi-step prompt with decomposition and refinement of translation with LLMs, which achieved state-of-the-art results on WMT24. In this work, we scrutinise this strategy's effectiveness. Empirically, we find no clear evidence that performance gains stem from explicitly decomposing the translation process, at least for the models on test; and we show that simply prompting LLMs to ``translate again'' yields even better results than human-like step-by-step prompting. Our analysis does not rule out the role of reasoning, but instead invites future work exploring the factors for CoT's effectiveness in the context of translation.

[150] Is It JUST Semantics? A Case Study of Discourse Particle Understanding in LLMs

William Sheffield,Kanishka Misra,Valentina Pyatkin,Ashwini Deo,Kyle Mahowald,Junyi Jessy Li

Main category: cs.CL

TL;DR: LLMs can broadly categorize English discourse particle 'just' but struggle with subtle nuances, showing limitations in understanding discourse particles.

Details Motivation: To assess LLMs' ability to distinguish fine-grained senses of the polyfunctional discourse particle 'just'. Method: Using expert-labeled data to evaluate LLMs' performance in differentiating semantic/discourse effects of 'just'. Result: LLMs show partial success in broader categorization but fail to capture subtle nuances. Conclusion: LLMs have a gap in understanding nuanced discourse particles like 'just'. Abstract: Discourse particles are crucial elements that subtly shape the meaning of text. These words, often polyfunctional, give rise to nuanced and often quite disparate semantic/discourse effects, as exemplified by the diverse uses of the particle "just" (e.g., exclusive, temporal, emphatic). This work investigates the capacity of LLMs to distinguish the fine-grained senses of English "just", a well-studied example in formal semantics, using data meticulously created and labeled by expert linguists. Our findings reveal that while LLMs exhibit some ability to differentiate between broader categories, they struggle to fully capture more subtle nuances, highlighting a gap in their understanding of discourse particles.

[151] BSBench: will your LLM find the largest prime number?

K. O. T. Erziev

Main category: cs.CL

TL;DR: 论文提出了一种针对无法合理回答问题的LLM基准测试方法,并发现现有模型在此类问题上的表现远非完美。

Details Motivation: 研究动机是探讨LLM在面对无法合理回答的问题时的表现,以评估其真实能力。 Method: 提出了一种基准测试方法,并开发了修改现有数据集的技术。 Result: 发现现有模型在无法回答的问题上表现不佳。 Conclusion: 论文强调了测试LLM在无解问题上的重要性,并提供了相关工具和数据。 Abstract: We propose that benchmarking LLMs on questions which have no reasonable answer actually isn't as silly as it sounds. We also present a benchmark that allows such testing and a method to modify the existing datasets, and discover that existing models demonstrate a performance far from the perfect on such questions. Our code and data artifacts are available at https://github.com/L3G5/impossible-bench

[152] SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?

Senyu Li,Jiayi Wang,Felermino D. M. A. Ali,Colin Cherry,Daniel Deutsch,Eleftheria Briakou,Rui Sousa-Silva,Henrique Lopes Cardoso,Pontus Stenetorp,David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: 论文提出了SSA-MTE数据集和SSA-COMET评估指标,显著提升了非洲低资源语言的机器翻译质量评估。

Details Motivation: 现有评估指标在非洲低资源语言中表现不佳,缺乏公开数据集和覆盖范围。 Method: 构建了包含13种非洲语言的SSA-MTE数据集,并开发了SSA-COMET和SSA-COMET-QE评估指标,同时测试了LLM的提示方法。 Result: SSA-COMET显著优于AfriCOMET,并在低资源语言(如Twi、Luo、Yoruba)中表现优异。 Conclusion: SSA-MTE和SSA-COMET为非洲低资源语言的机器翻译评估提供了有效工具,所有资源已开源。 Abstract: Evaluating machine translation (MT) quality for under-resourced African languages remains a significant challenge, as existing metrics often suffer from limited language coverage and poor performance in low-resource settings. While recent efforts, such as AfriCOMET, have addressed some of the issues, they are still constrained by small evaluation sets, a lack of publicly available training data tailored to African languages, and inconsistent performance in extremely low-resource scenarios. In this work, we introduce SSA-MTE, a large-scale human-annotated MT evaluation (MTE) dataset covering 13 African language pairs from the News domain, with over 63,000 sentence-level annotations from a diverse set of MT systems. Based on this data, we develop SSA-COMET and SSA-COMET-QE, improved reference-based and reference-free evaluation metrics. We also benchmark prompting-based approaches using state-of-the-art LLMs like GPT-4o and Claude. Our experimental results show that SSA-COMET models significantly outperform AfriCOMET and are competitive with the strongest LLM (Gemini 2.5 Pro) evaluated in our study, particularly on low-resource languages such as Twi, Luo, and Yoruba. All resources are released under open licenses to support future research.

[153] Demonstrations of Integrity Attacks in Multi-Agent Systems

Can Zheng,Yuhan Cao,Xiaoning Dong,Tianxing He

Main category: cs.CL

TL;DR: 本文探讨了恶意代理在多代理系统(MAS)中通过提示操纵发起的四种攻击类型,揭示了当前检测机制的局限性,并呼吁加强MAS的安全协议和监控系统。

Details Motivation: 多代理系统(MAS)在分布式协作中具有潜力,但易受恶意代理的操纵攻击,这些攻击可能在不破坏核心功能的情况下为恶意代理谋取私利。 Method: 研究了四种攻击类型(Scapegoater、Boaster、Self-Dealer、Free-Rider),并通过精心设计的提示操纵MAS行为,测试其对高级LLM监控系统的规避能力。 Result: 实验表明,恶意代理能成功误导评估系统并操纵协作代理,且能绕过GPT-4o-mini等高级监控系统。 Conclusion: 当前MAS架构需加强安全协议和内容验证机制,并开发更全面的风险评估监控系统。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, code generation, and complex planning. Simultaneously, Multi-Agent Systems (MAS) have garnered attention for their potential to enable cooperation among distributed agents. However, from a multi-party perspective, MAS could be vulnerable to malicious agents that exploit the system to serve self-interests without disrupting its core functionality. This work explores integrity attacks where malicious agents employ subtle prompt manipulation to bias MAS operations and gain various benefits. Four types of attacks are examined: \textit{Scapegoater}, who misleads the system monitor to underestimate other agents' contributions; \textit{Boaster}, who misleads the system monitor to overestimate their own performance; \textit{Self-Dealer}, who manipulates other agents to adopt certain tools; and \textit{Free-Rider}, who hands off its own task to others. We demonstrate that strategically crafted prompts can introduce systematic biases in MAS behavior and executable instructions, enabling malicious agents to effectively mislead evaluation systems and manipulate collaborative agents. Furthermore, our attacks can bypass advanced LLM-based monitors, such as GPT-4o-mini and o3-mini, highlighting the limitations of current detection mechanisms. Our findings underscore the critical need for MAS architectures with robust security protocols and content validation mechanisms, alongside monitoring systems capable of comprehensive risk scenario assessment.

[154] Reasoning or Overthinking: Evaluating Large Language Models on Financial Sentiment Analysis

Dimitris Vamvourellis,Dhagash Mehta

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型(LLMs)在零样本金融情感分析中的表现,发现推理能力并未提升任务表现,GPT-4o在无推理提示时表现最佳。

Details Motivation: 验证推理型与非推理型LLMs在金融情感分析中的有效性,挑战推理能力提升模型表现的默认假设。 Method: 使用Financial PhraseBank数据集,比较不同LLMs及提示策略(模拟系统1或系统2思维),并与微调模型FinBERT对比。 Result: 推理能力未提升表现,GPT-4o无推理提示时最准确且与人类判断最接近。 Conclusion: 金融情感分类中,快速直觉思维优于慢速推理,挑战推理总是提升决策的假设。 Abstract: We investigate the effectiveness of large language models (LLMs), including reasoning-based and non-reasoning models, in performing zero-shot financial sentiment analysis. Using the Financial PhraseBank dataset annotated by domain experts, we evaluate how various LLMs and prompting strategies align with human-labeled sentiment in a financial context. We compare three proprietary LLMs (GPT-4o, GPT-4.1, o3-mini) under different prompting paradigms that simulate System 1 (fast and intuitive) or System 2 (slow and deliberate) thinking and benchmark them against two smaller models (FinBERT-Prosus, FinBERT-Tone) fine-tuned on financial sentiment analysis. Our findings suggest that reasoning, either through prompting or inherent model design, does not improve performance on this task. Surprisingly, the most accurate and human-aligned combination of model and method was GPT-4o without any Chain-of-Thought (CoT) prompting. We further explore how performance is impacted by linguistic complexity and annotation agreement levels, uncovering that reasoning may introduce overthinking, leading to suboptimal predictions. This suggests that for financial sentiment classification, fast, intuitive "System 1"-like thinking aligns more closely with human judgment compared to "System 2"-style slower, deliberative reasoning simulated by reasoning models or CoT prompting. Our results challenge the default assumption that more reasoning always leads to better LLM decisions, particularly in high-stakes financial applications.

[155] Are LLMs Reliable Translators of Logical Reasoning Across Lexically Diversified Contexts?

Qingchuan Li,Jiatong Li,Zirui Liu,Mingyue Cheng,Yuting Zeng,Qi Liu,Tongxuan Liu

Main category: cs.CL

TL;DR: 论文提出SCALe基准和MenTaL方法,解决LLMs在逻辑翻译中因词汇多样性导致的不可靠问题。

Details Motivation: 现有LLMs在逻辑翻译中难以处理词汇多样性,且现有基准缺乏词汇多样性,掩盖了这一问题。 Method: 提出SCALe基准,通过逻辑不变的词汇多样化评估LLMs翻译能力;提出MenTaL方法,通过构建统一表达表改进翻译。 Result: 实验证实LLMs在词汇多样化翻译中存在不足,MenTaL方法显著提升了翻译性能。 Conclusion: SCALe和MenTaL为解决LLMs在逻辑翻译中的词汇多样性问题提供了有效工具。 Abstract: Neuro-symbolic approaches combining large language models (LLMs) with solvers excels in logical reasoning problems need long reasoning chains. In this paradigm, LLMs serve as translators, converting natural language reasoning problems into formal logic formulas. Then reliable symbolic solvers return correct solutions. Despite their success, we find that LLMs, as translators, struggle to handle lexical diversification, a common linguistic phenomenon, indicating that LLMs as logic translators are unreliable in real-world scenarios. Moreover, existing logical reasoning benchmarks lack lexical diversity, failing to challenge LLMs' ability to translate such text and thus obscuring this issue. In this work, we propose SCALe, a benchmark designed to address this significant gap through **logic-invariant lexical diversification**. By using LLMs to transform original benchmark datasets into lexically diversified but logically equivalent versions, we evaluate LLMs' ability to consistently map diverse expressions to uniform logical symbols on these new datasets. Experiments using SCALe further confirm that current LLMs exhibit deficiencies in this capability. Building directly on the deficiencies identified through our benchmark, we propose a new method, MenTaL, to address this limitation. This method guides LLMs to first construct a table unifying diverse expressions before performing translation. Applying MenTaL through in-context learning and supervised fine-tuning (SFT) significantly improves the performance of LLM translators on lexically diversified text. Our code is now available at https://github.com/wufeiwuwoshihua/LexicalDiver.

[156] Selecting Demonstrations for Many-Shot In-Context Learning via Gradient Matching

Jianfei Zhang,Bei Li,Jun Bai,Rumei Li,Yanmeng Wang,Chenghua Lin,Wenge Rong

Main category: cs.CL

TL;DR: 本文提出了一种基于梯度匹配的方法,用于改进大语言模型(LLMs)在上下文学习(ICL)中的演示选择问题,显著优于随机选择方法。

Details Motivation: 现有许多上下文学习方法依赖随机选择演示,而实例级检索不适用于多示例场景,作者假设ICL和微调的数据需求类似,因此提出梯度匹配方法。 Method: 通过梯度匹配方法,将目标任务的整个训练集的微调梯度与所选示例对齐,以接近完整训练集的学习效果。 Result: 在4到128示例场景中,该方法在9个数据集上表现优于随机选择,例如在Qwen2.5-72B和Llama3-70B上提升4%,在5个闭源LLMs上提升约2%。 Conclusion: 该方法为多示例ICL提供了更可靠和有效的解决方案,拓宽了其应用范围。 Abstract: In-Context Learning (ICL) empowers Large Language Models (LLMs) for rapid task adaptation without Fine-Tuning (FT), but its reliance on demonstration selection remains a critical challenge. While many-shot ICL shows promising performance through scaled demonstrations, the selection method for many-shot demonstrations remains limited to random selection in existing work. Since the conventional instance-level retrieval is not suitable for many-shot scenarios, we hypothesize that the data requirements for in-context learning and fine-tuning are analogous. To this end, we introduce a novel gradient matching approach that selects demonstrations by aligning fine-tuning gradients between the entire training set of the target task and the selected examples, so as to approach the learning effect on the entire training set within the selected examples. Through gradient matching on relatively small models, e.g., Qwen2.5-3B or Llama3-8B, our method consistently outperforms random selection on larger LLMs from 4-shot to 128-shot scenarios across 9 diverse datasets. For instance, it surpasses random selection by 4% on Qwen2.5-72B and Llama3-70B, and by around 2% on 5 closed-source LLMs. This work unlocks more reliable and effective many-shot ICL, paving the way for its broader application.

[157] SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing

Hongjun Liu,Yilun Zhao,Arman Cohan,Chen Zhao

Main category: cs.CL

TL;DR: 提出了一种无需训练的方法SUCEA,通过分解对抗性声明为子声明、迭代检索证据并编辑声明,显著提高了事实核查的检索和标签准确性。

Details Motivation: 对抗性声明故意设计以挑战事实核查系统,现有基于检索增强语言模型的系统难以应对。 Method: SUCEA框架分三步:声明分割与去上下文化、迭代证据检索与声明编辑、证据聚合与标签预测。 Result: 在两个数据集上显著优于四种基线方法,提高了检索和标签准确性。 Conclusion: SUCEA框架有效应对对抗性声明,提升了事实核查系统的性能。 Abstract: Automatic fact-checking has recently received more attention as a means of combating misinformation. Despite significant advancements, fact-checking systems based on retrieval-augmented language models still struggle to tackle adversarial claims, which are intentionally designed by humans to challenge fact-checking systems. To address these challenges, we propose a training-free method designed to rephrase the original claim, making it easier to locate supporting evidence. Our modular framework, SUCEA, decomposes the task into three steps: 1) Claim Segmentation and Decontextualization that segments adversarial claims into independent sub-claims; 2) Iterative Evidence Retrieval and Claim Editing that iteratively retrieves evidence and edits the subclaim based on the retrieved evidence; 3) Evidence Aggregation and Label Prediction that aggregates all retrieved evidence and predicts the entailment label. Experiments on two challenging fact-checking datasets demonstrate that our framework significantly improves on both retrieval and entailment label accuracy, outperforming four strong claim-decomposition-based baselines.

[158] MuSciClaims: Multimodal Scientific Claim Verification

Yash Kumar Lal,Manikanta Bandham,Mohammad Saqib Hasan,Apoorva Kashi,Mahnaz Koupaee,Niranjan Balasubramanian

Main category: cs.CL

TL;DR: 该论文提出了一个名为MuSciClaims的新基准,用于测试科学文献中多模态数据的声明验证能力,并发现现有视觉语言模型在此任务上表现不佳。

Details Motivation: 现有科学QA和图表数据多模态推理任务缺乏直接测试声明验证能力的基准,因此需要填补这一空白。 Method: 通过自动提取科学文章中的支持声明,并手动扰动生成矛盾声明,构建MuSciClaims基准,并设计诊断任务分析模型失败原因。 Result: 大多数视觉语言模型表现较差(F1分数0.3-0.5),最佳模型仅达0.77。模型倾向于判断声明为支持,且难以理解细微扰动。 Conclusion: 模型在多模态证据定位、跨模态信息聚合和图表基础组件理解方面存在显著不足,需进一步改进。 Abstract: Assessing scientific claims requires identifying, extracting, and reasoning with multimodal data expressed in information-rich figures in scientific literature. Despite the large body of work in scientific QA, figure captioning, and other multimodal reasoning tasks over chart-based data, there are no readily usable multimodal benchmarks that directly test claim verification abilities. To remedy this gap, we introduce a new benchmark MuSciClaims accompanied by diagnostics tasks. We automatically extract supported claims from scientific articles, which we manually perturb to produce contradicted claims. The perturbations are designed to test for a specific set of claim verification capabilities. We also introduce a suite of diagnostic tasks that help understand model failures. Our results show most vision-language models are poor (~0.3-0.5 F1), with even the best model only achieving 0.77 F1. They are also biased towards judging claims as supported, likely misunderstanding nuanced perturbations within the claims. Our diagnostics show models are bad at localizing correct evidence within figures, struggle with aggregating information across modalities, and often fail to understand basic components of the figure.

[159] LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models

Wen Ding,Fan Qian

Main category: cs.CL

TL;DR: LESS框架利用大语言模型(LLM)修正伪标签,结合数据过滤策略,显著提升语音识别和翻译任务的性能。

Details Motivation: 解决伪标签在无监督数据中的噪声问题,提升语音识别和翻译任务的准确性。 Method: 通过LLM修正ASR/AST生成的伪标签,并结合数据过滤策略优化知识转移效率。 Result: 在普通话ASR和西班牙语-英语AST任务中,WER显著降低3.77%,BLEU得分分别达到34.0和64.7。 Conclusion: LESS框架在多语言、多任务和多领域中表现出强大的适应性和有效性。 Abstract: We introduce LESS (Large Language Model Enhanced Semi-supervised Learning), a versatile framework that leverages Large Language Models (LLMs) to correct pseudo labels generated from in-the-wild data. Within the LESS framework, pseudo-labeled text from Automatic Speech Recognition (ASR) or Automatic Speech Translation (AST) of the unsupervised data is refined by an LLM, and augmented by a data filtering strategy to optimize LLM knowledge transfer efficiency. Experiments on both Mandarin ASR and Spanish-to-English AST tasks show that LESS achieves a notable absolute WER reduction of 3.77% on the Wenet Speech test set, as well as BLEU scores of 34.0 and 64.7 on Callhome and Fisher test sets respectively. These results validate the adaptability of LESS across different languages, tasks, and domains. Ablation studies conducted with various LLMs and prompt configurations provide novel insights into leveraging LLM-derived knowledge for speech processing applications.

[160] Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification

Chengwu Liu,Ye Yuan,Yichun Yin,Yan Xu,Xin Xu,Zaoyu Chen,Yasheng Wang,Lifeng Shang,Qun Liu,Ming Zhang

Main category: cs.CL

TL;DR: 论文提出了一种名为$Safe$的框架,通过形式化数学语言Lean 4验证LLM推理步骤,以减少幻觉并提供可检查的证据。

Details Motivation: 当前CoT方法(如PRMs或自一致性)缺乏可验证性,难以检测幻觉。受数学证明的启发,作者希望通过形式化验证解决这一问题。 Method: 采用回顾性、步骤感知的形式化验证框架$Safe$,将推理步骤转化为Lean 4形式化语言并提供证明。 Result: 在多个LLM和数学数据集上验证,性能显著提升,并提供了可解释的证据。还提出了包含30,809个形式化语句的$FormalStep$基准。 Conclusion: $Safe$是首个利用Lean 4验证LLM生成内容的框架,为幻觉问题提供了可验证的解决方案。 Abstract: Chain-of-Thought (CoT) prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs). However, to mitigate hallucinations in CoT that are notoriously difficult to detect, current methods such as process reward models (PRMs) or self-consistency operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness. To address this issue, we draw inspiration from the idea that "the gold standard for supporting a mathematical claim is to provide a proof". We propose a retrospective, step-aware formal verification framework $Safe$. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations. We evaluate our framework $Safe$ across multiple language models and various mathematical datasets, demonstrating a significant performance improvement while offering interpretable and verifiable evidence. We also propose $FormalStep$ as a benchmark for step correctness theorem proving with $30,809$ formal statements. To the best of our knowledge, our work represents the first endeavor to utilize formal mathematical language Lean 4 for verifying natural language content generated by LLMs, aligning with the reason why formal mathematical languages were created in the first place: to provide a robust foundation for hallucination-prone human-written proofs.

[161] A MISMATCHED Benchmark for Scientific Natural Language Inference

Firoz Shaik,Mobashir Sadat,Nikita Gautam,Doina Caragea,Cornelia Caragea

Main category: cs.CL

TL;DR: 论文提出了一个新的科学自然语言推理(NLI)评估基准MISMATCHED,覆盖心理学、工程学和公共卫生三个非计算机科学领域,包含2700个人工标注的句子对。通过预训练的小型和大型语言模型建立基线,最佳基线宏F1为78.17%,显示未来改进空间大。此外,训练中加入隐含科学NLI关系的句子对可提升模型性能。

Details Motivation: 现有科学NLI数据集主要来自计算机科学领域,非计算机科学领域完全被忽视,因此需要一个新的评估基准来填补这一空白。 Method: 提出MISMATCHED基准,覆盖三个非计算机科学领域,包含2700个人工标注句子对。使用预训练的小型和大型语言模型建立基线。 Result: 最佳基线模型的宏F1为78.17%,表明未来改进空间大。加入隐含科学NLI关系的句子对可提升模型性能。 Conclusion: MISMATCHED填补了非计算机科学领域科学NLI数据集的空白,并展示了未来改进的潜力。 Abstract: Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from research articles. Existing datasets for this task are derived from various computer science (CS) domains, whereas non-CS domains are completely ignored. In this paper, we introduce a novel evaluation benchmark for scientific NLI, called MISMATCHED. The new MISMATCHED benchmark covers three non-CS domains-PSYCHOLOGY, ENGINEERING, and PUBLIC HEALTH, and contains 2,700 human annotated sentence pairs. We establish strong baselines on MISMATCHED using both Pre-trained Small Language Models (SLMs) and Large Language Models (LLMs). Our best performing baseline shows a Macro F1 of only 78.17% illustrating the substantial headroom for future improvements. In addition to introducing the MISMATCHED benchmark, we show that incorporating sentence pairs having an implicit scientific NLI relation between them in model training improves their performance on scientific NLI. We make our dataset and code publicly available on GitHub.

[162] Revisiting Test-Time Scaling: A Survey and a Diversity-Aware Method for Efficient Reasoning

Ho-Lam Chung,Teng-Yun Hsiao,Hsiao-Ying Huang,Chunerh Cho,Jian-Ren Lin,Zhang Ziwei,Yun-Nung Chen

Main category: cs.CL

TL;DR: TTS通过推理时分配额外计算资源提升LLMs的推理性能。ADAPT方法通过多样性优化的前缀微调显著减少计算需求。

Details Motivation: 推理优化模型输出多样性不足,限制了TTS的效果。 Method: 提出ADAPT方法,结合多样性优化的前缀微调策略。 Result: 在数学推理任务中,ADAPT以8倍计算效率达到80%准确率。 Conclusion: 生成多样性对最大化TTS效果至关重要。 Abstract: Test-Time Scaling (TTS) improves the reasoning performance of Large Language Models (LLMs) by allocating additional compute during inference. We conduct a structured survey of TTS methods and categorize them into sampling-based, search-based, and trajectory optimization strategies. We observe that reasoning-optimized models often produce less diverse outputs, which limits TTS effectiveness. To address this, we propose ADAPT (A Diversity Aware Prefix fine-Tuning), a lightweight method that applies prefix tuning with a diversity-focused data strategy. Experiments on mathematical reasoning tasks show that ADAPT reaches 80% accuracy using eight times less compute than strong baselines. Our findings highlight the essential role of generative diversity in maximizing TTS effectiveness.

[163] Subjective Perspectives within Learned Representations Predict High-Impact Innovation

Likun Cao,Rui Pan,James Evans

Main category: cs.CL

TL;DR: 该论文通过机器学习方法量化创新者的主观视角和背景多样性,发现视角多样性促进创新,而背景多样性可能阻碍创新。

Details Motivation: 研究旨在探索创新者的主观视角和背景多样性如何影响其创新能力,填补了现有研究中忽视个人视角的空白。 Method: 利用动态语言表示建模创新者的主观视角和背景多样性,分析数百万科学家、发明家等的数据,并进行自然实验和AI模拟。 Result: 视角多样性显著预测创新成果,而背景多样性则相反;成功合作者通过共同语言整合多样经验。 Conclusion: 研究为团队组建和研究政策提供了新见解,强调视角多样性的重要性。 Abstract: Existing studies of innovation emphasize the power of social structures to shape innovation capacity. Emerging machine learning approaches, however, enable us to model innovators' personal perspectives and interpersonal innovation opportunities as a function of their prior trajectories of experience. We theorize then quantify subjective perspectives and innovation opportunities based on innovator positions within the geometric space of concepts inscribed by dynamic language representations. Using data on millions of scientists, inventors, writers, entrepreneurs, and Wikipedia contributors across the creative domains of science, technology, film, entrepreneurship, and Wikipedia, here we show that measured subjective perspectives anticipate what ideas individuals and groups creatively attend to and successfully combine in future. When perspective and background diversity are decomposed as the angular difference between collaborators' perspectives on their creation and between their experiences, the former consistently anticipates creative achievement while the latter portends its opposite, across all cases and time periods examined. We analyze a natural experiment and simulate creative collaborations between AI (large language model) agents designed with various perspective and background diversity, which are consistent with our observational findings. We explore mechanisms underlying these findings and identify how successful collaborators leverage common language to weave together diverse experience obtained through trajectories of prior work that converge to provoke one another and innovate. We explore the importance of these findings for team assembly and research policy.

[164] Static Word Embeddings for Sentence Semantic Representation

Takashi Wada,Yuki Hirakawa,Ryotaro Shimizu,Takahiro Kawashima,Yuki Saito

Main category: cs.CL

TL;DR: 提出一种优化的静态词嵌入方法,通过句子级主成分分析和知识蒸馏或对比学习改进预训练的词嵌入,以低成本实现句子语义表示。

Details Motivation: 改进现有静态词嵌入在句子语义表示上的性能,使其接近动态模型的效果。 Method: 从预训练的Sentence Transformer提取词嵌入,通过句子级主成分分析和知识蒸馏或对比学习优化词嵌入,句子表示为词嵌入的平均。 Result: 在单语和跨语言任务中显著优于现有静态模型,部分数据集上媲美SimCSE。 Conclusion: 方法成功去除了与句子语义无关的词嵌入成分,并根据词对句子语义的影响调整向量范数。 Abstract: We propose new static word embeddings optimised for sentence semantic representation. We first extract word embeddings from a pre-trained Sentence Transformer, and improve them with sentence-level principal component analysis, followed by either knowledge distillation or contrastive learning. During inference, we represent sentences by simply averaging word embeddings, which requires little computational cost. We evaluate models on both monolingual and cross-lingual tasks and show that our model substantially outperforms existing static models on sentence semantic tasks, and even rivals a basic Sentence Transformer model (SimCSE) on some data sets. Lastly, we perform a variety of analyses and show that our method successfully removes word embedding components that are irrelevant to sentence semantics, and adjusts the vector norms based on the influence of words on sentence semantics.

[165] Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning

Zhiyuan Ma,Jiayu Liu,Xianzhen Luo,Zhenya Huang,Qingfu Zhu,Wanxiang Che

Main category: cs.CL

TL;DR: Tool-MVR通过多代理元验证(MAMV)和探索性反思学习(EXPLORE)提升大语言模型的工具使用能力,显著优于现有方法。

Details Motivation: 当前大语言模型在工具规划和反思能力上存在不足,Tool-MVR旨在解决这些问题。 Method: 提出MAMV构建高质量指令数据集ToolBench-V,以及EXPLORE动态学习范式生成ToolBench-R,并微调开源模型。 Result: Tool-MVR在StableToolBench上超越ToolLLM和GPT-4,API调用减少31.4%,在RefineToolBench上错误修正率达58.9%。 Conclusion: Tool-MVR通过系统化验证和动态反思学习,显著提升工具使用能力,具有广泛适用性。 Abstract: Empowering large language models (LLMs) with effective tool utilization capabilities is crucial for enabling AI agents to solve complex problems. However, current models face two major limitations: (1) unreliable tool planning and invocation due to low-quality instruction datasets (e.g., widespread hallucinated API calls), and (2) weak tool reflection abilities (over 90% of errors cannot be corrected) resulting from static imitation learning. To address these critical limitations, we propose Tool-MVR, a novel Tool-Augmented LLM that achieves comprehensive System 2 reasoning through two key innovations. Specifically, we first introduce Multi-Agent Meta-Verification (MAMV), a systematic pipeline that rigorously validates APIs, queries, and reasoning trajectories to construct ToolBench-V, a new high-quality instruction dataset that addresses the limitation of unreliable tool planning and invocation. Second, we propose Exploration-based Reflection Learning (EXPLORE), which enhances tool reflection capabilities by leveraging tool feedback through a dynamic "Error -> Reflection -> Correction" learning paradigm, resulting in our reflection dataset ToolBench-R and addressing the critical weakness in tool reflection. Finally, we obtain Tool-MVR by finetuning open-source LLMs (e.g., Qwen-7B) on both ToolBench-V and ToolBench-R. Our experiments demonstrate that Tool-MVR achieves state-of-the-art performance on StableToolBench, surpassing both ToolLLM (by 23.9%) and GPT-4 (by 15.3%) while reducing API calls by 31.4%, with strong generalization capabilities across unseen tools and scenarios. Additionally, on our proposed RefineToolBench, the first benchmark specifically designed to evaluate tool reflection capabilities, Tool-MVR achieves a 58.9% error correction rate, significantly outperforming ToolLLM's 9.1%.

[166] ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition

Thai-Binh Nguyen,Thi Van Nguyen,Quoc Truong Do,Chi Mai Luong

Main category: cs.CL

TL;DR: 本文提出了一种从原始视频生成AVSR数据集的实用方法,并开发了越南语的基线模型,在噪声环境中表现优异。

Details Motivation: 解决AVSR模型因数据集稀缺(尤其是非英语语言)而受限的问题。 Method: 通过自动化数据收集从原始视频生成AVSR数据集,并优化现有技术以提高效率和可访问性。 Result: 自动收集的数据集支持了强大的基线模型,在干净条件下表现与ASR相当,在噪声环境中显著优于ASR。 Conclusion: 该方法为扩展AVSR到更多语言(尤其是资源匮乏的语言)提供了有效途径。 Abstract: Audio-Visual Speech Recognition (AVSR) has gained significant attention recently due to its robustness against noise, which often challenges conventional speech recognition systems that rely solely on audio features. Despite this advantage, AVSR models remain limited by the scarcity of extensive datasets, especially for most languages beyond English. Automated data collection offers a promising solution. This work presents a practical approach to generate AVSR datasets from raw video, refining existing techniques for improved efficiency and accessibility. We demonstrate its broad applicability by developing a baseline AVSR model for Vietnamese. Experiments show the automatically collected dataset enables a strong baseline, achieving competitive performance with robust ASR in clean conditions and significantly outperforming them in noisy environments like cocktail parties. This efficient method provides a pathway to expand AVSR to more languages, particularly under-resourced ones.

[167] TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering

Vinay Joshi,Pratik Prabhanjan Brahma,Zicheng Liu,Emad Barsoum

Main category: cs.CL

TL;DR: TaDA是一种无需训练的KV缓存压缩方法,通过自适应量化精度和均值中心化消除异常值处理,显著减少内存占用并保持模型精度。

Details Motivation: KV缓存在Transformer模型中内存需求随序列长度急剧增加,限制了大型语言模型的可扩展部署。现有量化方法仍需单独处理稀疏和非连续异常值。 Method: 提出TaDA方法,通过自适应量化精度和均值中心化,无需单独管理异常值,实现KV缓存压缩。 Result: 实验显示,TaDA将KV缓存内存占用降至原始16位基准的27%,同时保持可比精度。 Conclusion: TaDA为语言模型提供了可扩展且高性能的推理方案,支持更长上下文和推理链。 Abstract: The key-value (KV) cache in transformer models is a critical component for efficient decoding or inference, yet its memory demands scale poorly with sequence length, posing a major challenge for scalable deployment of large language models. Among several approaches to KV cache compression, quantization of key and value activations has been widely explored. Most KV cache quantization methods still need to manage sparse and noncontiguous outliers separately. To address this, we introduce TaDA, a training-free recipe for KV cache compression with quantization precision that adapts to error sensitivity across layers and a mean centering to eliminate separate outlier handling. Our approach yields substantial accuracy improvements for multiple models supporting various context lengths. Moreover, our approach does not need to separately manage outlier elements -- a persistent hurdle in most traditional quantization methods. Experiments on standard benchmarks demonstrate that our technique reduces KV cache memory footprint to 27% of the original 16-bit baseline while achieving comparable accuracy. Our method paves the way for scalable and high-performance reasoning in language models by potentially enabling inference for longer context length models, reasoning models, and longer chain of thoughts.

[168] Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents

Juhyun Oh,Eunsu Kim,Alice Oh

Main category: cs.CL

TL;DR: Flex-TravelPlanner是一个评估语言模型在动态规划场景中灵活推理能力的基准,揭示了模型在多轮任务和约束优先级处理上的不足。

Details Motivation: 现有基准主要关注静态单轮场景,无法满足现实规划问题的动态需求。 Method: 基于TravelPlanner数据集,引入多轮顺序约束引入和显式优先级竞争约束两种新评估设置。 Result: 模型在多轮任务中的表现与单轮任务相关性低,约束引入顺序和优先级处理显著影响性能。 Conclusion: 强调了在动态场景中评估LLMs的重要性,并提出了改进复杂规划任务性能的具体方向。 Abstract: Real-world planning problems require constant adaptation to changing requirements and balancing of competing constraints. However, current benchmarks for evaluating LLMs' planning capabilities primarily focus on static, single-turn scenarios. We introduce Flex-TravelPlanner, a benchmark that evaluates language models' ability to reason flexibly in dynamic planning scenarios. Building on the TravelPlanner dataset~\citep{xie2024travelplanner}, we introduce two novel evaluation settings: (1) sequential constraint introduction across multiple turns, and (2) scenarios with explicitly prioritized competing constraints. Our analysis of GPT-4o and Llama 3.1 70B reveals several key findings: models' performance on single-turn tasks poorly predicts their ability to adapt plans across multiple turns; constraint introduction order significantly affects performance; and models struggle with constraint prioritization, often incorrectly favoring newly introduced lower priority preferences over existing higher-priority constraints. These findings highlight the importance of evaluating LLMs in more realistic, dynamic planning scenarios and suggest specific directions for improving model performance on complex planning tasks. The code and dataset for our framework are publicly available at https://github.com/juhyunohh/FlexTravelBench.

[169] Normative Conflicts and Shallow AI Alignment

Raphaël Millière

Main category: cs.CL

TL;DR: 论文探讨了大语言模型(LLMs)的价值对齐问题,指出当前的对齐策略无法有效防止滥用,并揭示了其根本局限性。

Details Motivation: 随着AI系统(如LLMs)的进步,其安全部署问题日益突出。作者旨在揭示当前对齐方法的不足,并提出更深层次的解决方案。 Method: 通过分析人类道德心理学研究,对比LLMs与人类在规范性推理能力上的差异,指出LLMs缺乏深度的规范性思考能力。 Result: 研究发现,LLMs的对齐策略仅强化了浅层行为倾向,无法抵御对抗性攻击,且近期专注于推理的LLMs仍未解决此问题。 Conclusion: 当前的“浅层对齐”方法不足以应对日益强大的AI系统可能带来的危害,需探索更深层次的对齐机制。 Abstract: The progress of AI systems such as large language models (LLMs) raises increasingly pressing concerns about their safe deployment. This paper examines the value alignment problem for LLMs, arguing that current alignment strategies are fundamentally inadequate to prevent misuse. Despite ongoing efforts to instill norms such as helpfulness, honesty, and harmlessness in LLMs through fine-tuning based on human preferences, they remain vulnerable to adversarial attacks that exploit conflicts between these norms. I argue that this vulnerability reflects a fundamental limitation of existing alignment methods: they reinforce shallow behavioral dispositions rather than endowing LLMs with a genuine capacity for normative deliberation. Drawing from on research in moral psychology, I show how humans' ability to engage in deliberative reasoning enhances their resilience against similar adversarial tactics. LLMs, by contrast, lack a robust capacity to detect and rationally resolve normative conflicts, leaving them susceptible to manipulation; even recent advances in reasoning-focused LLMs have not addressed this vulnerability. This ``shallow alignment'' problem carries significant implications for AI safety and regulation, suggesting that current approaches are insufficient for mitigating potential harms posed by increasingly capable AI systems.

[170] MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models

Gio Paik,Geewook Kim,Jinbae Im

Main category: cs.CL

TL;DR: MMRefine是一个多模态细化基准,用于评估多模态大语言模型(MLLMs)的错误细化能力,涵盖六种场景和错误类型。

Details Motivation: 随着推理过程中对增强推理能力的重视,需要评估MLLMs在检测和纠正错误方面的能力,而不仅仅是比较细化前后的最终准确性。 Method: MMRefine通过六种不同场景和六种错误类型评估MLLMs的细化能力,并对开放和封闭MLLMs进行实验。 Result: 实验揭示了细化性能的瓶颈和阻碍因素,指出了有效推理增强的改进方向。 Conclusion: MMRefine为MLLMs的错误细化能力提供了评估框架,并公开了代码和数据集。 Abstract: This paper introduces MMRefine, a MultiModal Refinement benchmark designed to evaluate the error refinement capabilities of Multimodal Large Language Models (MLLMs). As the emphasis shifts toward enhancing reasoning during inference, MMRefine provides a framework that evaluates MLLMs' abilities to detect and correct errors across six distinct scenarios beyond just comparing final accuracy before and after refinement. Furthermore, the benchmark analyzes the refinement performance by categorizing errors into six error types. Experiments with various open and closed MLLMs reveal bottlenecks and factors impeding refinement performance, highlighting areas for improvement in effective reasoning enhancement. Our code and dataset are publicly available at https://github.com/naver-ai/MMRefine.

[171] Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models

Thao Nguyen,Yang Li,Olga Golovneva,Luke Zettlemoyer,Sewoong Oh,Ludwig Schmidt,Xian Li

Main category: cs.CL

TL;DR: 论文提出REWIRE方法,通过改写低质量网络文本以扩充预训练数据,实验表明混合高质量原始文本与改写文本能显著提升模型性能。

Details Motivation: 解决预训练数据不足的问题,尤其是高质量文本稀缺,探索如何利用被过滤的低质量数据。 Method: 提出REWIRE方法,通过指导性改写(guided rewrite)将低质量文档转化为可用训练数据。 Result: 在1B、3B和7B规模的实验中,混合改写文本使22项任务性能提升1.0、1.3和2.5个百分点,优于单纯使用过滤数据或2倍原始数据。 Conclusion: 改写低质量网络文本是一种简单有效的预训练数据扩展方法,REWIRE方法优于其他合成数据生成技术。 Abstract: Scaling laws predict that the performance of large language models improves with increasing model size and data size. In practice, pre-training has been relying on massive web crawls, using almost all data sources publicly available on the internet so far. However, this pool of natural data does not grow at the same rate as the compute supply. Furthermore, the availability of high-quality texts is even more limited: data filtering pipelines often remove up to 99% of the initial web scrapes to achieve state-of-the-art. To address the "data wall" of pre-training scaling, our work explores ways to transform and recycle data discarded in existing filtering processes. We propose REWIRE, REcycling the Web with guIded REwrite, a method to enrich low-quality documents so that they could become useful for training. This in turn allows us to increase the representation of synthetic data in the final pre-training set. Experiments at 1B, 3B and 7B scales of the DCLM benchmark show that mixing high-quality raw texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points improvement respectively across 22 diverse tasks, compared to training on only filtered web data. Training on the raw-synthetic data mix is also more effective than having access to 2x web data. Through further analysis, we demonstrate that about 82% of the mixed in texts come from transforming lower-quality documents that would otherwise be discarded. REWIRE also outperforms related approaches of generating synthetic data, including Wikipedia-style paraphrasing, question-answer synthesizing and knowledge extraction. These results suggest that recycling web texts holds the potential for being a simple and effective approach for scaling pre-training data.

[172] Cracking the Code: Enhancing Implicit Hate Speech Detection through Coding Classification

Lu Wei,Liangzhi Li,Tong Xiang,Xiao Liu,Noa Garcia

Main category: cs.CL

TL;DR: 论文提出了一种新的隐式仇恨言论(im-HS)检测分类法,通过六种编码策略(codetypes)提升检测效果,并在中英文数据集中验证了其有效性。

Details Motivation: 互联网上的仇恨言论(HS)对社会和谐和个人福祉构成威胁,现有方法对隐式仇恨言论(im-HS)检测效果不佳。 Method: 提出六种编码策略(codetypes),并采用两种方法:1)直接提示大语言模型(LLMs)分类;2)将codetypes嵌入LLMs编码过程。 Result: 实验表明,codetypes显著提升了中英文数据集中的im-HS检测效果。 Conclusion: 该方法为跨语言的隐式仇恨言论检测提供了有效解决方案。 Abstract: The internet has become a hotspot for hate speech (HS), threatening societal harmony and individual well-being. While automatic detection methods perform well in identifying explicit hate speech (ex-HS), they struggle with more subtle forms, such as implicit hate speech (im-HS). We tackle this problem by introducing a new taxonomy for im-HS detection, defining six encoding strategies named codetypes. We present two methods for integrating codetypes into im-HS detection: 1) prompting large language models (LLMs) directly to classify sentences based on generated responses, and 2) using LLMs as encoders with codetypes embedded during the encoding process. Experiments show that the use of codetypes improves im-HS detection in both Chinese and English datasets, validating the effectiveness of our approach across different languages.

[173] Accelerated Test-Time Scaling with Model-Free Speculative Sampling

Woomin Song,Saket Dingliwal,Sai Muralidhar Jayanthi,Bhavana Ganesh,Jinwoo Shin,Aram Galstyan,Sravan Babu Bodapati

Main category: cs.CL

TL;DR: STAND是一种无需模型的推测解码方法,通过利用推理轨迹的冗余性显著加速推理,同时保持准确性。

Details Motivation: 现有推理方法(如best-of-N采样和树搜索)需要大量计算资源,性能与效率之间存在矛盾。 Method: STAND采用随机自适应N-gram草拟,结合高效的Gumbel-Top-K采样和数据驱动的树构建,提高令牌接受率。 Result: 在多个模型和任务中,STAND将推理延迟降低60-65%,吞吐量优于现有方法14-28%。 Conclusion: STAND是一种即插即用的高效解决方案,无需额外训练即可应用于任何语言模型。 Abstract: Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that leverages the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis reveals that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND outperforms state-of-the-art speculative decoding methods by 14-28% in throughput and shows strong performance even in single-trajectory scenarios, reducing inference latency by 48-58%. As a model-free approach, STAND can be applied to any existing language model without additional training, being a powerful plug-and-play solution for accelerating language model reasoning.

[174] IIITH-BUT system for IWSLT 2025 low-resource Bhojpuri to Hindi speech translation

Bhavana Akkiraju,Aishwarya Pothula,Santosh Kesiraju,Anil Kumar Vuppala

Main category: cs.CL

TL;DR: 本文介绍了IIITH-BUT团队在IWSLT 2025低资源Bhojpuri-Hindi语音翻译任务中的提交,研究了超参数优化和数据增强对SeamlessM4T模型性能的影响。

Details Motivation: 探索在低资源语言对(Bhojpuri-Hindi)中,如何通过超参数优化和数据增强技术提升语音翻译模型的性能。 Method: 系统研究超参数(如学习率、更新步数、预热步数等)和数据增强技术(如速度扰动和SpecAugment),并分析跨语言信号(联合训练Marathi和Bhojpuri数据)的效果。 Result: 实验表明,超参数选择和简单有效的数据增强技术显著提升了低资源环境下的翻译性能。 Conclusion: 在低资源场景中,超参数优化和数据增强是提升翻译质量的关键因素。 Abstract: This paper presents the submission of IIITH-BUT to the IWSLT 2025 shared task on speech translation for the low-resource Bhojpuri-Hindi language pair. We explored the impact of hyperparameter optimisation and data augmentation techniques on the performance of the SeamlessM4T model fine-tuned for this specific task. We systematically investigated a range of hyperparameters including learning rate schedules, number of update steps, warm-up steps, label smoothing, and batch sizes; and report their effect on translation quality. To address data scarcity, we applied speed perturbation and SpecAugment and studied their effect on translation quality. We also examined the use of cross-lingual signal through joint training with Marathi and Bhojpuri speech data. Our experiments reveal that careful selection of hyperparameters and the application of simple yet effective augmentation techniques significantly improve performance in low-resource settings. We also analysed the translation hypotheses to understand various kinds of errors that impacted the translation quality in terms of BLEU.

[175] SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat

Yuru Jiang,Wenxuan Ding,Shangbin Feng,Greg Durrett,Yulia Tsvetkov

Main category: cs.CL

TL;DR: SPARTA ALIGNMENT是一种通过竞争和对抗集体对齐多个LLM的算法,利用模型间的竞争和相互评估提升生成多样性和减少偏见。

Details Motivation: 解决单一模型在生成多样性和评估偏见上的不足,通过多模型竞争和相互评估实现自我进化。 Method: 多个LLM组成“斯巴达部落”,通过竞争和相互评估生成偏好对,利用改进的elo排名系统聚合评分,模型从偏好对中学习。 Result: 在12个任务和数据集中的10个上优于初始模型和4个基线方法,平均提升7.0%,且能更好地泛化到未见任务。 Conclusion: SPARTA ALIGNMENT通过集体竞争和相互评估有效提升了LLM的性能和泛化能力。 Abstract: We propose SPARTA ALIGNMENT, an algorithm to collectively align multiple LLMs through competition and combat. To complement a single model's lack of diversity in generation and biases in evaluation, multiple LLMs form a "sparta tribe" to compete against each other in fulfilling instructions while serving as judges for the competition of others. For each iteration, one instruction and two models are selected for a duel, the other models evaluate the two responses, and their evaluation scores are aggregated through a adapted elo-ranking based reputation system, where winners/losers of combat gain/lose weight in evaluating others. The peer-evaluated combat results then become preference pairs where the winning response is preferred over the losing one, and all models learn from these preferences at the end of each iteration. SPARTA ALIGNMENT enables the self-evolution of multiple LLMs in an iterative and collective competition process. Extensive experiments demonstrate that SPARTA ALIGNMENT outperforms initial models and 4 self-alignment baselines across 10 out of 12 tasks and datasets with 7.0% average improvement. Further analysis reveals that SPARTA ALIGNMENT generalizes more effectively to unseen tasks and leverages the expertise diversity of participating models to produce more logical, direct and informative outputs.

[176] Lifelong Evolution: Collaborative Learning between Large and Small Language Models for Continuous Emergent Fake News Detection

Ziyi Zhou,Xiaoming Zhang,Litian Zhang,Yibo Zhang,Zhenyu Guan,Chaozhuo Li,Philip S. Yu

Main category: cs.CL

TL;DR: 提出了一种名为C²EFND的新框架,结合大语言模型(LLM)和小语言模型(SLM)的优势,通过多轮协作学习和知识更新模块,显著提升了假新闻检测的准确性和适应性。

Details Motivation: 假新闻在社交媒体上的传播对社会造成严重影响,现有方法(如SLM和LLM)因数据稀缺、知识过时等问题难以有效应对。 Method: C²EFND框架结合LLM的泛化能力和SLM的分类专长,采用多轮协作学习,并引入终身知识编辑模块和基于回放的持续学习方法。 Result: 在Pheme和Twitter16数据集上的实验表明,C²EFND显著优于现有方法,提高了检测准确性和适应性。 Conclusion: C²EFND为解决假新闻检测中的动态性和知识更新问题提供了有效方案。 Abstract: The widespread dissemination of fake news on social media has significantly impacted society, resulting in serious consequences. Conventional deep learning methodologies employing small language models (SLMs) suffer from extensive supervised training requirements and difficulties adapting to evolving news environments due to data scarcity and distribution shifts. Large language models (LLMs), despite robust zero-shot capabilities, fall short in accurately detecting fake news owing to outdated knowledge and the absence of suitable demonstrations. In this paper, we propose a novel Continuous Collaborative Emergent Fake News Detection (C$^2$EFND) framework to address these challenges. The C$^2$EFND framework strategically leverages both LLMs' generalization power and SLMs' classification expertise via a multi-round collaborative learning framework. We further introduce a lifelong knowledge editing module based on a Mixture-of-Experts architecture to incrementally update LLMs and a replay-based continue learning method to ensure SLMs retain prior knowledge without retraining entirely. Extensive experiments on Pheme and Twitter16 datasets demonstrate that C$^2$EFND significantly outperforms existed methods, effectively improving detection accuracy and adaptability in continuous emergent fake news scenarios.

[177] Identifying Reliable Evaluation Metrics for Scientific Text Revision

Léane Jourdan,Florian Boudin,Richard Dufour,Nicolas Hernandez

Main category: cs.CL

TL;DR: 论文分析了传统文本修订评估指标的局限性,探索了更符合人类判断的替代方法,包括人工标注、无参考指标和LLM评估,最终提出结合LLM和任务特定指标的混合方法。

Details Motivation: 传统评估指标(如ROUGE和BERTScore)主要关注相似性,无法捕捉文本修订的实际改进,因此需要更符合人类判断的评估方法。 Method: 通过人工标注研究评估修订质量,探索无参考指标和LLM评估方法,分析其在有无参考文本下的表现。 Result: LLM能有效评估指令遵循性,但在正确性上表现不佳;领域特定指标提供补充信息。混合方法评估效果最佳。 Conclusion: 结合LLM评估和任务特定指标的混合方法能更可靠地评估文本修订质量。 Abstract: Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision quality.

[178] Fine-Grained Interpretation of Political Opinions in Large Language Models

Jingyu Hu,Mengyue Yang,Mengnan Du,Weiru Liu

Main category: cs.CL

TL;DR: 论文研究了LLMs的政治观点,发现其开放回答与内部意图不一致,提出多维度政治学习框架和可解释表示工程技术,以更透明地学习LLM的政治概念。

Details Motivation: 现有研究依赖单轴概念分析LLMs政治观点,易导致概念混淆,且LLMs的回答与内部意图不一致,需探索其内部机制。 Method: 设计四维政治学习框架,构建数据集学习细粒度政治概念向量,应用三种表示工程技术在八个开源LLMs上进行实验。 Result: 向量能解构政治概念混淆,检测任务验证其语义意义,干预实验显示可改变LLMs生成的政治倾向回答。 Conclusion: 多维度框架和表示工程技术能有效揭示和干预LLMs内部政治状态,提升透明度和可控性。 Abstract: Studies of LLMs' political opinions mainly rely on evaluations of their open-ended responses. Recent work indicates that there is a misalignment between LLMs' responses and their internal intentions. This motivates us to probe LLMs' internal mechanisms and help uncover their internal political states. Additionally, we found that the analysis of LLMs' political opinions often relies on single-axis concepts, which can lead to concept confounds. In this work, we extend the single-axis to multi-dimensions and apply interpretable representation engineering techniques for more transparent LLM political concept learning. Specifically, we designed a four-dimensional political learning framework and constructed a corresponding dataset for fine-grained political concept vector learning. These vectors can be used to detect and intervene in LLM internals. Experiments are conducted on eight open-source LLMs with three representation engineering techniques. Results show these vectors can disentangle political concept confounds. Detection tasks validate the semantic meaning of the vectors and show good generalization and robustness in OOD settings. Intervention Experiments show these vectors can intervene in LLMs to generate responses with different political leanings.

[179] MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Dingdong Wang,Jincenzi Wu,Junan Li,Dongchao Yang,Xueyuan Chen,Tianhua Zhang,Helen Meng

Main category: cs.CL

TL;DR: MMSU是一个专注于口语理解和推理的综合基准,包含5000个音频-问题-答案三元组,覆盖47种任务,评估了14种先进语音大语言模型,发现现有模型仍有改进空间。

Details Motivation: 口语理解需要整合语义、副语言特征和语音学特征,而现有语音大语言模型在细粒度感知和复杂推理方面的能力尚未充分探索。 Method: 通过系统整合多种语言现象(如语音学、韵律、修辞、句法、语义和副语言学),构建MMSU基准,并评估14种先进模型。 Result: 评估显示现有模型在口语理解和推理方面仍有显著改进空间。 Conclusion: MMSU为口语理解提供了新的评估标准,为开发更复杂的人机语音交互系统提供了方向。 Abstract: Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in spoken language. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. To ground our benchmark in linguistic theory, we systematically incorporate a wide range of linguistic phenomena, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics. Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization. MMSU establishes a new standard for comprehensive assessment of spoken language understanding, providing valuable insights for developing more sophisticated human-AI speech interaction systems. MMSU benchmark is available at https://huggingface.co/datasets/ddwang2000/MMSU. Evaluation Code is available at https://github.com/dingdongwang/MMSU_Bench.

[180] Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques

Jisu An,Junseok Lee,Jeoungeun Lee,Yongseok Son

Main category: cs.CL

TL;DR: 本文对多模态大语言模型(MLLMs)进行了系统分析,提出了基于架构策略、表示学习技术和训练范式的分类框架,并总结了125个模型的发展趋势。

Details Motivation: 填补现有文献中对多模态输入如何与语言主干连接的认知空白,为研究者提供结构化视角。 Method: 通过分析125个MLLMs,提出基于三个维度的分类框架:模态集成架构、表示学习技术和训练范式。 Result: 总结了当前MLLMs的发展模式,为未来模型的多模态集成策略提供了指导。 Conclusion: 本文的分类框架为研究者提供了系统化的工具,有助于开发更强大的多模态集成模型。 Abstract: The rapid progress of Multimodal Large Language Models(MLLMs) has transformed the AI landscape. These models combine pre-trained LLMs with various modality encoders. This integration requires a systematic understanding of how different modalities connect to the language backbone. Our survey presents an LLM-centric analysis of current approaches. We examine methods for transforming and aligning diverse modal inputs into the language embedding space. This addresses a significant gap in existing literature. We propose a classification framework for MLLMs based on three key dimensions. First, we examine architectural strategies for modality integration. This includes both the specific integration mechanisms and the fusion level. Second, we categorize representation learning techniques as either joint or coordinate representations. Third, we analyze training paradigms, including training strategies and objective functions. By examining 125 MLLMs developed between 2021 and 2025, we identify emerging patterns in the field. Our taxonomy provides researchers with a structured overview of current integration techniques. These insights aim to guide the development of more robust multimodal integration strategies for future models built on pre-trained foundations.

[181] Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study

Yujun Zhou,Jiayi Ye,Zipeng Ling,Yufei Han,Yue Huang,Haomin Zhuang,Zhenwen Liang,Kehan Guo,Taicheng Guo,Xiangqi Wang,Xiangliang Zhang

Main category: cs.CL

TL;DR: FineLogic是一个细粒度的评估框架,用于评估大语言模型在逻辑推理中的表现,涵盖准确性、步骤合理性和表示对齐三个维度。研究发现自然语言监督具有强泛化能力,而符号监督则促进结构化的推理链。

Details Motivation: 现有基准仅依赖最终答案准确性,无法全面评估推理过程的质量和结构,因此需要更精细的评估方法。 Method: 提出FineLogic框架,通过三种维度评估逻辑推理,并研究四种监督格式(一种自然语言和三种符号变体)对推理能力的影响。 Result: 自然语言监督在泛化任务中表现优异,符号监督则更擅长结构化推理。微调主要通过逐步生成改进推理行为。 Conclusion: FineLogic为评估和改进大语言模型的逻辑推理提供了更严谨和可解释的方法。 Abstract: Logical reasoning is a core capability for many applications of large language models (LLMs), yet existing benchmarks often rely solely on final-answer accuracy, failing to capture the quality and structure of the reasoning process. We propose FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions: overall benchmark accuracy, stepwise soundness, and representation-level alignment. In addition, to better understand how reasoning capabilities emerge, we conduct a comprehensive study on the effects of supervision format during fine-tuning. We construct four supervision styles (one natural language and three symbolic variants) and train LLMs under each. Our findings reveal that natural language supervision yields strong generalization even on out-of-distribution and long-context tasks, while symbolic reasoning styles promote more structurally sound and atomic inference chains. Further, our representation-level probing shows that fine-tuning primarily improves reasoning behaviors through step-by-step generation, rather than enhancing shortcut prediction or internalized correctness. Together, our framework and analysis provide a more rigorous and interpretable lens for evaluating and improving logical reasoning in LLMs.

[182] Design of intelligent proofreading system for English translation based on CNN and BERT

Feijun Liu,Huifeng Wang,Kun Wang,Yizhen Wang

Main category: cs.CL

TL;DR: 提出了一种结合CNN和BERT的混合方法,用于机器翻译校对,通过端到端训练实现高效错误检测与修正,性能优于现有技术。

Details Motivation: 自动翻译常含错误需人工校对,现有校对技术效率不足,需更鲁棒的方法提升质量。 Method: 结合CNN提取局部n-gram特征与BERT生成上下文表示,集成错误检测与修正模块,利用并行语料和GRU解码器优化训练。 Result: 实验显示90%准确率、89.37% F1和16.24% MSE,性能超越现有技术10%以上。 Conclusion: 该方法在错误检测与修正上表现优异,为机器翻译校对提供了高效解决方案。 Abstract: Since automatic translations can contain errors that require substantial human post-editing, machine translation proofreading is essential for improving quality. This paper proposes a novel hybrid approach for robust proofreading that combines convolutional neural networks (CNN) with Bidirectional Encoder Representations from Transformers (BERT). In order to extract semantic information from phrases and expressions, CNN uses a variety of convolution kernel filters to capture local n-gram patterns. In the meanwhile, BERT creates context-rich representations of whole sequences by utilizing stacked bidirectional transformer encoders. Using BERT's attention processes, the integrated error detection component relates tokens to spot translation irregularities including word order problems and omissions. The correction module then uses parallel English-German alignment and GRU decoder models in conjunction with translation memory to propose logical modifications that maintain original meaning. A unified end-to-end training process optimized for post-editing performance is applied to the whole pipeline. The multi-domain collection of WMT and the conversational dialogues of Open-Subtitles are two of the English-German parallel corpora used to train the model. Multiple loss functions supervise detection and correction capabilities. Experiments attain a 90% accuracy, 89.37% F1, and 16.24% MSE, exceeding recent proofreading techniques by over 10% overall. Comparative benchmarking demonstrates state-of-the-art performance in identifying and coherently rectifying mistranslations and omissions.

[183] Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms

Nurul Aisyah,Muhammad Dehan Al Kautsar,Arif Hidayat,Raqib Chowdhury,Fajri Koto

Main category: cs.CL

TL;DR: 研究了视觉语言模型(VLM)和大型语言模型(LLM)在印尼四年级学生手写考试中的表现,发现VLM识别手写能力有限,影响LLM评分,但LLM生成的反馈仍有一定实用性。

Details Motivation: 探索VLM和LLM在真实课堂环境中的有效性,特别是在教育资源不足的地区。 Method: 评估了VLM和多个LLM对646份印尼四年级学生手写考试答案的表现,涵盖数学和英语两科,包括选择题、简答题和论述题。 Result: VLM在手写识别上表现不佳,导致LLM评分错误;但LLM生成的反馈仍有一定价值,尽管个性化和上下文相关性有限。 Conclusion: VLM和LLM在教育评估中有潜力,但需改进手写识别和反馈质量。 Abstract: Although vision-language and large language models (VLM and LLM) offer promising opportunities for AI-driven educational assessment, their effectiveness in real-world classroom settings, particularly in underrepresented educational contexts, remains underexplored. In this study, we evaluated the performance of a state-of-the-art VLM and several LLMs on 646 handwritten exam responses from grade 4 students in six Indonesian schools, covering two subjects: Mathematics and English. These sheets contain more than 14K student answers that span multiple choice, short answer, and essay questions. Assessment tasks include grading these responses and generating personalized feedback. Our findings show that the VLM often struggles to accurately recognize student handwriting, leading to error propagation in downstream LLM grading. Nevertheless, LLM-generated feedback retains some utility, even when derived from imperfect input, although limitations in personalization and contextual relevance persist.

[184] A Reasoning-Based Approach to Cryptic Crossword Clue Solving

Martin Andrews,Sam Witteveen

Main category: cs.CL

TL;DR: 论文提出了一种基于LLM的系统,用于解决加密填字游戏线索,通过假设答案、提出字谜解释和验证步骤,实现了在Cryptonite数据集上的最新性能。

Details Motivation: 加密填字游戏线索是一种复杂的语言任务,现有测试集每日更新,需要一种高效且可解释的解决方案。 Method: 系统结合LLM和开源组件,通过假设答案、提出字谜解释和验证步骤来解决问题。 Result: 系统在Cryptonite数据集上达到了最新性能,且解决方案以Python代码形式提供,便于检查。 Conclusion: 该系统为加密填字游戏提供了一种高效且可解释的解决方案,展示了LLM在复杂语言任务中的潜力。 Abstract: Cryptic crossword clues are challenging language tasks for which new test sets are released daily by major newspapers on a global basis. Each cryptic clue contains both the definition of the answer to be placed in the crossword grid (in common with regular crosswords), and 'wordplay' that proves that the answer is correct (i.e. a human solver can be confident that an answer is correct without needing crossing words as confirmation). This work describes an LLM-based reasoning system built from open-licensed components that solves cryptic clues by (i) hypothesising answers; (ii) proposing wordplay explanations; and (iii) using a verifier system that operates on codified reasoning steps. Overall, this system establishes a new state-of-the-art performance on the challenging Cryptonite dataset of clues from The Times and The Telegraph newspapers in the UK. Because each proved solution is expressed in Python, interpretable wordplay reasoning for proven answers is available for inspection.

[185] Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning Models

Changyue Wang,Weihang Su,Qingyao Ai,Yiqun Liu

Main category: cs.CL

TL;DR: 论文提出RACE框架,用于检测大型推理模型(LRMs)中的幻觉问题,通过分析推理步骤的一致性、答案不确定性、语义对齐和内部连贯性,优于现有方法。

Details Motivation: 大型推理模型的推理痕迹可能冗余或不一致,成为新的幻觉来源,现有方法难以检测。 Method: 提出RACE框架,提取关键推理步骤并计算四种诊断信号:推理痕迹的样本间一致性、基于熵的答案不确定性、推理与答案的语义对齐及推理的内部连贯性。 Result: 实验表明RACE在多个数据集和不同LLMs上优于现有幻觉检测基线。 Conclusion: RACE为评估LRMs提供了鲁棒且通用的解决方案。 Abstract: Large Reasoning Models (LRMs) extend large language models with explicit, multi-step reasoning traces to enhance transparency and performance on complex tasks. However, these reasoning traces can be redundant or logically inconsistent, making them a new source of hallucination that is difficult to detect. Existing hallucination detection methods focus primarily on answer-level uncertainty and often fail to detect hallucinations or logical inconsistencies arising from the model's reasoning trace. This oversight is particularly problematic for LRMs, where the explicit thinking trace is not only an important support to the model's decision-making process but also a key source of potential hallucination. To this end, we propose RACE (Reasoning and Answer Consistency Evaluation), a novel framework specifically tailored for hallucination detection in LRMs. RACE operates by extracting essential reasoning steps and computing four diagnostic signals: inter-sample consistency of reasoning traces, entropy-based answer uncertainty, semantic alignment between reasoning and answers, and internal coherence of reasoning. This joint analysis enables fine-grained hallucination detection even when the final answer appears correct. Experiments across datasets and different LLMs demonstrate that RACE outperforms existing hallucination detection baselines, offering a robust and generalizable solution for evaluating LRMs. Our code is available at: https://github.com/bebr2/RACE.

[186] MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines

Dávid Javorský,Ondřej Bojar,François Yvon

Main category: cs.CL

TL;DR: 论文介绍了MockConf数据集和InterAlign工具,用于支持同声传译的研究,填补了现有平行语料库和工具的不足。

Details Motivation: 现有平行语料库和算法无法有效建模同声传译中的长距离交互和特定类型的差异(如简化、功能泛化),因此需要专用数据集和工具。 Method: 收集了MockConf学生同声传译数据集(7小时录音,5种欧洲语言),并开发了InterAlign工具用于标注和对齐。 Result: 发布了数据集和工具,并提出了评估指标和自动对齐基线。 Conclusion: MockConf和InterAlign为同声传译研究提供了重要资源,支持自动分析和评估。 Abstract: In simultaneous interpreting, an interpreter renders a source speech into another language with a very short lag, much sooner than sentences are finished. In order to understand and later reproduce this dynamic and complex task automatically, we need dedicated datasets and tools for analysis, monitoring, and evaluation, such as parallel speech corpora, and tools for their automatic annotation. Existing parallel corpora of translated texts and associated alignment algorithms hardly fill this gap, as they fail to model long-range interactions between speech segments or specific types of divergences (e.g., shortening, simplification, functional generalization) between the original and interpreted speeches. In this work, we introduce MockConf, a student interpreting dataset that was collected from Mock Conferences run as part of the students' curriculum. This dataset contains 7 hours of recordings in 5 European languages, transcribed and aligned at the level of spans and words. We further implement and release InterAlign, a modern web-based annotation tool for parallel word and span annotations on long inputs, suitable for aligning simultaneous interpreting. We propose metrics for the evaluation and a baseline for automatic alignment. Dataset and tools are released to the community.

[187] Multiple-Choice Question Generation Using Large Language Models: Methodology and Educator Insights

Giorgio Biancini,Alessio Ferrato,Carla Limongelli

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型(LLMs)在生成多选题(MCQs)中的应用,比较了Llama 2、Mistral和GPT-3.5的表现,发现GPT-3.5效果最佳,同时揭示了教育领域对AI的接受度仍有待提高。

Details Motivation: 手动生成MCQs耗时耗力,LLMs有望解决这一问题,提升教育效率。 Method: 通过向LLMs注入知识而非依赖其固有知识,对比三种模型生成MCQs的效果,并邀请21名教育工作者评估。 Result: GPT-3.5在多项指标中表现最佳,但教育领域对AI的接受度仍有限。 Conclusion: LLMs在生成MCQs方面潜力巨大,但需进一步推动AI在教育中的应用。 Abstract: Integrating Artificial Intelligence (AI) in educational settings has brought new learning approaches, transforming the practices of both students and educators. Among the various technologies driving this transformation, Large Language Models (LLMs) have emerged as powerful tools for creating educational materials and question answering, but there are still space for new applications. Educators commonly use Multiple-Choice Questions (MCQs) to assess student knowledge, but manually generating these questions is resource-intensive and requires significant time and cognitive effort. In our opinion, LLMs offer a promising solution to these challenges. This paper presents a novel comparative analysis of three widely known LLMs - Llama 2, Mistral, and GPT-3.5 - to explore their potential for creating informative and challenging MCQs. In our approach, we do not rely on the knowledge of the LLM, but we inject the knowledge into the prompt to contrast the hallucinations, giving the educators control over the test's source text, too. Our experiment involving 21 educators shows that GPT-3.5 generates the most effective MCQs across several known metrics. Additionally, it shows that there is still some reluctance to adopt AI in the educational field. This study sheds light on the potential of LLMs to generate MCQs and improve the educational experience, providing valuable insights for the future.

[188] Prompting LLMs: Length Control for Isometric Machine Translation

Dávid Javorský,Ondřej Bojar,François Yvon

Main category: cs.CL

TL;DR: 研究探讨了多语言对(英→德、英→法、英→西)下等长机器翻译的有效性,分析了不同提示策略、少样本示例数量及演示选择对翻译质量和长度控制的影响。

Details Motivation: 探索在IWSLT 2022等长共享任务条件下,如何通过提示策略和演示选择优化翻译质量和长度控制。 Method: 使用8种不同规模的开源大语言模型(LLMs),测试不同提示策略、少样本示例数量及演示选择的效果。 Result: 指令措辞与演示属性对齐对长度控制至关重要;极端示例可缩短翻译,但等长演示易忽视长度约束;少样本提示提升翻译质量,但5、10、20样本间改进有限;多输出可优化长度与质量的权衡。 Conclusion: 提示策略和演示选择显著影响翻译质量和长度控制,多输出方法在某些语言对中达到最优性能。 Abstract: In this study, we explore the effectiveness of isometric machine translation across multiple language pairs (En$\to$De, En$\to$Fr, and En$\to$Es) under the conditions of the IWSLT Isometric Shared Task 2022. Using eight open-source large language models (LLMs) of varying sizes, we investigate how different prompting strategies, varying numbers of few-shot examples, and demonstration selection influence translation quality and length control. We discover that the phrasing of instructions, when aligned with the properties of the provided demonstrations, plays a crucial role in controlling the output length. Our experiments show that LLMs tend to produce shorter translations only when presented with extreme examples, while isometric demonstrations often lead to the models disregarding length constraints. While few-shot prompting generally enhances translation quality, further improvements are marginal across 5, 10, and 20-shot settings. Finally, considering multiple outputs allows to notably improve overall tradeoff between the length and quality, yielding state-of-the-art performance for some language pairs.

[189] Evaluating the Effectiveness of Linguistic Knowledge in Pretrained Language Models: A Case Study of Universal Dependencies

Wenxi Li

Main category: cs.CL

TL;DR: 将通用依存关系(UD)集成到预训练语言模型中,显著提升了跨语言对抗性释义识别任务的性能。

Details Motivation: 尽管UD是跨语言句法表示的成功框架,但其有效性尚未充分探索。本文旨在填补这一空白。 Method: 将UD集成到预训练语言模型中,并评估其在跨语言对抗性释义识别任务中的表现。 Result: UD的引入显著提升了准确率和F1分数(平均增益分别为3.85%和6.08%),缩小了预训练模型与大型语言模型的性能差距,甚至在某些语言对中表现更优。 Conclusion: UD在跨领域任务中具有有效性和潜力,其与英语的相似度得分与模型性能呈正相关。 Abstract: Universal Dependencies (UD), while widely regarded as the most successful linguistic framework for cross-lingual syntactic representation, remains underexplored in terms of its effectiveness. This paper addresses this gap by integrating UD into pretrained language models and assesses if UD can improve their performance on a cross-lingual adversarial paraphrase identification task. Experimental results show that incorporation of UD yields significant improvements in accuracy and $F_1$ scores, with average gains of 3.85\% and 6.08\% respectively. These enhancements reduce the performance gap between pretrained models and large language models in some language pairs, and even outperform the latter in some others. Furthermore, the UD-based similarity score between a given language and English is positively correlated to the performance of models in that language. Both findings highlight the validity and potential of UD in out-of-domain tasks.

[190] ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests

Shiyi Xu,Yiwen Hu,Yingqian Min,Zhipeng Chen,Wayne Xin Zhao,Ji-Rong Wen

Main category: cs.CL

TL;DR: 提出了ICPC-Eval,一个用于评估大型语言模型在竞争性编程环境中推理能力的新基准。

Details Motivation: 现有基准和评估指标无法充分评估大型语言模型在真实竞赛环境中的编码和反思能力。 Method: ICPC-Eval包含118个精选问题,提供真实竞赛场景、本地评估工具和新的评估指标Refine@K。 Result: 结果表明,顶级推理模型仍需多轮反馈才能发挥潜力,且仍落后于人类团队。 Conclusion: ICPC-Eval为评估复杂推理能力提供了有效工具,揭示了模型的局限性。 Abstract: With the significant progress of large reasoning models in complex coding and reasoning tasks, existing benchmarks, like LiveCodeBench and CodeElo, are insufficient to evaluate the coding capabilities of large language models (LLMs) in real competition environments. Moreover, current evaluation metrics such as Pass@K fail to capture the reflective abilities of reasoning models. To address these challenges, we propose \textbf{ICPC-Eval}, a top-level competitive coding benchmark designed to probing the frontiers of LLM reasoning. ICPC-Eval includes 118 carefully curated problems from 11 recent ICPC contests held in various regions of the world, offering three key contributions: 1) A challenging realistic ICPC competition scenario, featuring a problem type and difficulty distribution consistent with actual contests. 2) A robust test case generation method and a corresponding local evaluation toolkit, enabling efficient and accurate local evaluation. 3) An effective test-time scaling evaluation metric, Refine@K, which allows iterative repair of solutions based on execution feedback. The results underscore the significant challenge in evaluating complex reasoning abilities: top-tier reasoning models like DeepSeek-R1 often rely on multi-turn code feedback to fully unlock their in-context reasoning potential when compared to non-reasoning counterparts. Furthermore, despite recent advancements in code generation, these models still lag behind top-performing human teams. We release the benchmark at: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs

[191] Verbose ListOps (VLO): Beyond Long Context -- Unmasking LLM's Reasoning Blind Spots

Alex Pan,Mary-Anne Williams

Main category: cs.CL

TL;DR: Verbose ListOps是一个新基准,通过将ListOps计算转化为长故事,测试LLMs在嵌套叙事推理中的状态管理能力,揭示了其局限性。

Details Motivation: 现有基准未能有效测试LLMs在嵌套叙事推理中的表现,掩盖了其根本限制。Verbose ListOps旨在填补这一空白。 Method: 通过编程将ListOps计算转化为长故事,强制LLMs进行内部计算和状态管理,同时控制叙事长度和推理难度。 Result: 领先的LLMs(如OpenAI o4、Gemini 2.5 Pro)在Verbose ListOps上表现不佳,尤其在中等长度叙事(约10k token)时崩溃。 Conclusion: Verbose ListOps揭示了LLMs在状态管理上的弱点,为改进推理能力提供了方向,是实现知识工作自动化的关键一步。 Abstract: Large Language Models (LLMs), whilst great at extracting facts from text, struggle with nested narrative reasoning. Existing long context and multi-hop QA benchmarks inadequately test this, lacking realistic distractors or failing to decouple context length from reasoning complexity, masking a fundamental LLM limitation. We introduce Verbose ListOps, a novel benchmark that programmatically transposes ListOps computations into lengthy, coherent stories. This uniquely forces internal computation and state management of nested reasoning problems by withholding intermediate results, and offers fine-grained controls for both narrative size \emph{and} reasoning difficulty. Whilst benchmarks like LongReason (2025) advance approaches for synthetically expanding the context size of multi-hop QA problems, Verbose ListOps pinpoints a specific LLM vulnerability: difficulty in state management for nested sub-reasoning amongst semantically-relevant, distracting narrative. Our experiments show that leading LLMs (e.g., OpenAI o4, Gemini 2.5 Pro) collapse in performance on Verbose ListOps at modest (~10k token) narrative lengths, despite effortlessly solving raw ListOps equations. Addressing this failure is paramount for real-world text interpretation which requires identifying key reasoning points, tracking conceptual intermediate results, and filtering irrelevant information. Verbose ListOps, and its extensible generation framework thus enables targeted reasoning enhancements beyond mere context-window expansion; a critical step to automating the world's knowledge work.

[192] A Practitioner's Guide to Building ASR Models for Low-Resource Languages: A Case Study on Scottish Gaelic

Ondřej Klejch,William Lamb,Peter Bell

Main category: cs.CL

TL;DR: 本文挑战了现有观点,提出了一种结合混合HMM与自监督模型的方法,在低资源语言ASR系统中表现优于微调多语言端到端模型。

Details Motivation: 现有方法认为微调多语言端到端模型在低资源语言中表现最佳,但本文发现其效果有限,尤其是在语言未包含在原始训练数据中时。 Method: 提出了一种结合混合HMM与自监督模型的方法,通过持续的自监督预训练和半监督训练充分利用可用数据。 Result: 在苏格兰盖尔语上测试,相对最佳微调Whisper模型,WER降低了32%。 Conclusion: 混合HMM与自监督模型的方法在低资源语言ASR系统中更具优势,性能显著提升。 Abstract: An effective approach to the development of ASR systems for low-resource languages is to fine-tune an existing multilingual end-to-end model. When the original model has been trained on large quantities of data from many languages, fine-tuning can be effective with limited training data, even when the language in question was not present in the original training data. The fine-tuning approach has been encouraged by the availability of public-domain E2E models and is widely believed to lead to state-of-the-art results. This paper, however, challenges that belief. We show that an approach combining hybrid HMMs with self-supervised models can yield substantially better performance with limited training data. This combination allows better utilisation of all available speech and text data through continued self-supervised pre-training and semi-supervised training. We benchmark our approach on Scottish Gaelic, achieving WER reductions of 32% relative over our best fine-tuned Whisper model.

[193] Simulating LLM-to-LLM Tutoring for Multilingual Math Feedback

Junior Cedric Tonga,KV Aditya Srivatsa,Kaushal Kumar Maurya,Fajri Koto,Ekaterina Kochmar

Main category: cs.CL

TL;DR: 论文研究了多语言大语言模型(LLMs)在数学推理任务中提供教学反馈的效果,发现多语言提示能显著提升学习效果,尤其是在低资源语言中。

Details Motivation: 探讨LLMs在不同语言中提供有效教学反馈的能力,填补多语言教育支持的空白。 Method: 通过模拟多语言师生互动,使用强模型生成提示反馈,弱模型模拟学生,覆盖11种语言和多种提示策略。 Result: 多语言提示显著提升学习效果,尤其是在学生母语与反馈语言一致的低资源语言中。 Conclusion: 研究为开发多语言LLM教育工具提供了实用见解,强调反馈语言与学生母语一致的重要性。 Abstract: Large language models (LLMs) have demonstrated the ability to generate formative feedback and instructional hints in English, making them increasingly relevant for AI-assisted education. However, their ability to provide effective instructional support across different languages, especially for mathematically grounded reasoning tasks, remains largely unexamined. In this work, we present the first large-scale simulation of multilingual tutor-student interactions using LLMs. A stronger model plays the role of the tutor, generating feedback in the form of hints, while a weaker model simulates the student. We explore 352 experimental settings across 11 typologically diverse languages, four state-of-the-art LLMs, and multiple prompting strategies to assess whether language-specific feedback leads to measurable learning gains. Our study examines how student input language, teacher feedback language, model choice, and language resource level jointly influence performance. Results show that multilingual hints can significantly improve learning outcomes, particularly in low-resource languages when feedback is aligned with the student's native language. These findings offer practical insights for developing multilingual, LLM-based educational tools that are both effective and inclusive.

[194] ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT

Mikołaj Pokrywka,Wojciech Kusa,Mieszko Rutkowski,Mikołaj Koszowski

Main category: cs.CL

TL;DR: 研究探讨了在电子商务数据中,通过添加上下文信息(如图像和产品元数据)提升神经机器翻译质量的方法,并发布了新的捷克语-波兰语数据集。

Details Motivation: 神经机器翻译(NMT)在特定领域应用中仍面临词义模糊和上下文不足的问题,尤其是在电子商务数据中。 Method: 创建了包含11,400句对的捷克语-波兰语数据集ConECT,结合图像和产品元数据,测试了视觉语言模型(VLM)和文本到文本模型的不同方法。 Result: 视觉上下文和产品类别路径等信息的加入显著提升了翻译质量。 Conclusion: 上下文信息的整合能有效改善机器翻译质量,并公开了新数据集。 Abstract: Neural Machine Translation (NMT) has improved translation by using Transformer-based models, but it still struggles with word ambiguity and context. This problem is especially important in domain-specific applications, which often have problems with unclear sentences or poor data quality. Our research explores how adding information to models can improve translations in the context of e-commerce data. To this end we create ConECT -- a new Czech-to-Polish e-commerce product translation dataset coupled with images and product metadata consisting of 11,400 sentence pairs. We then investigate and compare different methods that are applicable to context-aware translation. We test a vision-language model (VLM), finding that visual context aids translation quality. Additionally, we explore the incorporation of contextual information into text-to-text models, such as the product's category path or image descriptions. The results of our study demonstrate that the incorporation of contextual information leads to an improvement in the quality of machine translation. We make the new dataset publicly available.

[195] From Struggle (06-2024) to Mastery (02-2025) LLMs Conquer Advanced Algorithm Exams and Pave the Way for Editorial Generation

Adrian Marius Dumitran,Theodor-Pierre Moroianu,Vasile Paul Alexe

Main category: cs.CL

TL;DR: 本文评估了大型语言模型(LLMs)在大学级算法考试中的表现,发现最新模型表现优异,但仍存在图相关任务的困难。

Details Motivation: 研究LLMs在复杂算法问题上的解决能力及其在教育中的应用潜力。 Method: 通过测试多个模型在罗马尼亚语考试及其高质量英语翻译上的表现,分析其问题解决能力、一致性和多语言性能。 Result: 最新模型表现接近优秀学生,具备复杂多步推理能力,但在图任务上仍有困难。 Conclusion: LLMs在教育中具有潜力,可用于生成高质量反馈内容,推动生成式AI在算法教育中的进一步应用。 Abstract: This paper presents a comprehensive evaluation of the performance of state-of-the-art Large Language Models (LLMs) on challenging university-level algorithms exams. By testing multiple models on both a Romanian exam and its high-quality English translation, we analyze LLMs' problem-solving capabilities, consistency, and multilingual performance. Our empirical study reveals that the most recent models not only achieve scores comparable to top-performing students but also demonstrate robust reasoning skills on complex, multi-step algorithmic challenges, even though difficulties remain with graph-based tasks. Building on these findings, we explore the potential of LLMs to support educational environments through the generation of high-quality editorial content, offering instructors a powerful tool to enhance student feedback. The insights and best practices discussed herein pave the way for further integration of generative AI in advanced algorithm education.

[196] Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering

Andres Carofilis,Pradeep Rangappa,Srikanth Madikeri,Shashi Kumar,Sergio Burdisso,Jeena Prakash,Esau Villatoro-Tello,Petr Motlicek,Bidisha Sharma,Kadri Hacioglu,Shankar Venkatesan,Saurabh Vyas,Andreas Stolcke

Main category: cs.CL

TL;DR: 论文提出了一种增量半监督学习流程,通过整合少量领域内标注数据和相关领域辅助数据,结合多模型共识或命名实体识别筛选伪标签,显著提升了ASR模型在稀缺标注数据场景下的性能。

Details Motivation: 在特定领域微调预训练ASR模型时,标注数据稀缺是一个挑战,但未标注音频和相关领域标注数据通常可用。 Method: 提出增量半监督学习流程,先整合少量领域内标注数据和相关领域辅助数据,再通过多模型共识或命名实体识别筛选伪标签并迭代优化。 Result: 在Wow呼叫中心和Fisher英语语料库上,该方法优于单步微调,共识筛选相对提升22.3%(Wow)和24.8%(Fisher)。NER筛选性能次优但计算成本更低。 Conclusion: 共识筛选是最佳方法,显著提升性能;NER筛选在性能和成本间取得平衡,适用于计算资源有限场景。 Abstract: Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxiliary data. Filtering based on multi-model consensus or named entity recognition (NER) is then applied to select and iteratively refine pseudo-labels, showing slower performance saturation compared to random selection. Evaluated on the multi-domain Wow call center and Fisher English corpora, it outperforms single-step fine-tuning. Consensus-based filtering outperforms other methods, providing up to 22.3% relative improvement on Wow and 24.8% on Fisher over single-step fine-tuning with random selection. NER is the second-best filter, providing competitive performance at a lower computational cost.

[197] SCOP: Evaluating the Comprehension Process of Large Language Models from a Cognitive View

Yongjie Xiao,Hongru Liang,Peixin Qin,Yao Zhang,Wenqiang Lei

Main category: cs.CL

TL;DR: 论文提出SCOP框架,从认知视角评估大语言模型(LLMs)的理解能力,发现其与专家理解过程存在差距,并建议改进方向。

Details Motivation: 尽管LLMs在机器理解方面潜力巨大,但其理解过程缺乏与专家对齐的合理解释,因此需要系统评估。 Method: 提出SCOP框架,定义五项理解必备技能,构建测试数据,并对开源和闭源LLMs进行详细分析。 Result: LLMs难以达到专家级理解水平,且在局部信息理解上表现较好,但存在通过错误理解过程得出正确答案的问题。 Conclusion: 建议改进LLMs时更关注理解过程,确保所有理解技能在训练中得到充分发展。 Abstract: Despite the great potential of large language models(LLMs) in machine comprehension, it is still disturbing to fully count on them in real-world scenarios. This is probably because there is no rational explanation for whether the comprehension process of LLMs is aligned with that of experts. In this paper, we propose SCOP to carefully examine how LLMs perform during the comprehension process from a cognitive view. Specifically, it is equipped with a systematical definition of five requisite skills during the comprehension process, a strict framework to construct testing data for these skills, and a detailed analysis of advanced open-sourced and closed-sourced LLMs using the testing data. With SCOP, we find that it is still challenging for LLMs to perform an expert-level comprehension process. Even so, we notice that LLMs share some similarities with experts, e.g., performing better at comprehending local information than global information. Further analysis reveals that LLMs can be somewhat unreliable -- they might reach correct answers through flawed comprehension processes. Based on SCOP, we suggest that one direction for improving LLMs is to focus more on the comprehension process, ensuring all comprehension skills are thoroughly developed during training.

[198] ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development

Zhenran Xu,Xue Yang,Yiyu Wang,Qingli Hu,Zijiao Wu,Longyue Wang,Weihua Luo,Kaifu Zhang,Baotian Hu,Min Zhang

Main category: cs.CL

TL;DR: ComfyUI-Copilot是一个基于大语言模型的插件,旨在提升ComfyUI平台的易用性和效率,通过智能节点推荐和一键工作流构建解决新手面临的挑战。

Details Motivation: ComfyUI虽然灵活且用户友好,但新手可能面临文档不足、模型配置错误和工作流设计复杂等问题,ComfyUI-Copilot旨在解决这些问题。 Method: 采用分层多代理框架,包括中央助理代理和专用工作代理,结合知识库支持,提供智能节点推荐和一键工作流构建。 Result: 离线评估和用户反馈表明,插件能准确推荐节点并加速工作流开发,降低新手门槛并提升有经验用户的工作效率。 Conclusion: ComfyUI-Copilot有效解决了ComfyUI的易用性问题,适用于不同水平的用户,其安装包和演示视频已公开。 Abstract: We introduce ComfyUI-Copilot, a large language model-powered plugin designed to enhance the usability and efficiency of ComfyUI, an open-source platform for AI-driven art creation. Despite its flexibility and user-friendly interface, ComfyUI can present challenges to newcomers, including limited documentation, model misconfigurations, and the complexity of workflow design. ComfyUI-Copilot addresses these challenges by offering intelligent node and model recommendations, along with automated one-click workflow construction. At its core, the system employs a hierarchical multi-agent framework comprising a central assistant agent for task delegation and specialized worker agents for different usages, supported by our curated ComfyUI knowledge bases to streamline debugging and deployment. We validate the effectiveness of ComfyUI-Copilot through both offline quantitative evaluations and online user feedback, showing that it accurately recommends nodes and accelerates workflow development. Additionally, use cases illustrate that ComfyUI-Copilot lowers entry barriers for beginners and enhances workflow efficiency for experienced users. The ComfyUI-Copilot installation package and a demo video are available at https://github.com/AIDC-AI/ComfyUI-Copilot.

[199] Controlling Summarization Length Through EOS Token Weighting

Zeno Belligoli,Emmanouil Stergiadis,Eran Fainman,Ilya Gusev

Main category: cs.CL

TL;DR: 提出了一种简单的方法,通过调整交叉熵损失中EOS令牌的权重来控制生成文本的长度,适用于多种模型和解码算法。

Details Motivation: 现有方法通常需要复杂的模型修改,限制了与预训练模型的兼容性。 Method: 通过增加交叉熵损失中EOS令牌预测的重要性来控制生成文本长度。 Result: 该方法能有效控制生成文本长度,且通常不影响摘要质量。 Conclusion: 该方法简单、通用,适用于多种模型和解码算法。 Abstract: Controlling the length of generated text can be crucial in various text-generation tasks, including summarization. Existing methods often require complex model alterations, limiting compatibility with pre-trained models. We address these limitations by developing a simple approach for controlling the length of automatic text summaries by increasing the importance of correctly predicting the EOS token in the cross-entropy loss computation. The proposed methodology is agnostic to architecture and decoding algorithms and orthogonal to other inference-time techniques to control the generation length. We tested it with encoder-decoder and modern GPT-style LLMs, and show that this method can control generation length, often without affecting the quality of the summary.

[200] Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers

Yutao Hou,Zeguan Xiao,Fei Yu,Yihan Jiang,Xuetao Wei,Hailiang Huang,Yun Chen,Guanhua Chen

Main category: cs.CL

TL;DR: AR-Checker是一个自动生成数学问题变体的框架,用于测试大语言模型(LLMs)的鲁棒性,避免数据污染问题。

Details Motivation: LLMs在简单推理任务中可能意外失败,现有评估方法存在数据污染风险,需要动态生成测试用例。 Method: 通过多轮并行LLM重写和验证生成语义相同但可能使LLMs失败的数学问题变体。 Result: 在GSM8K、MATH-500等数学任务及MMLU、CommonsenseQA等非数学任务上表现优异。 Conclusion: AR-Checker能有效评估LLMs的鲁棒性,避免数据污染,适用于多种任务。 Abstract: Large language models (LLMs) have achieved distinguished performance on various reasoning-intensive tasks. However, LLMs might still face the challenges of robustness issues and fail unexpectedly in some simple reasoning tasks. Previous works evaluate the LLM robustness with hand-crafted templates or a limited set of perturbation rules, indicating potential data contamination in pre-training or fine-tuning datasets. In this work, inspired by stress testing in software engineering, we propose a novel framework, Automatic Robustness Checker (AR-Checker), to generate mathematical problem variants that maintain the semantic meanings of the original one but might fail the LLMs. The AR-Checker framework generates mathematical problem variants through multi-round parallel streams of LLM-based rewriting and verification. Our framework can generate benchmark variants dynamically for each LLM, thus minimizing the risk of data contamination. Experiments on GSM8K and MATH-500 demonstrate the strong performance of AR-Checker on mathematical tasks. We also evaluate AR-Checker on benchmarks beyond mathematics, including MMLU, MMLU-Pro, and CommonsenseQA, where it also achieves strong performance, further proving the effectiveness of AR-Checker.

[201] TALL -- A Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages

Moshe Ofer,Orel Zamler,Amos Azaria

Main category: cs.CL

TL;DR: TALL架构通过双语翻译模型和维度对齐层提升LLM在低资源语言中的表现,实验显示显著优于基线方法。

Details Motivation: 解决LLM在低资源语言中因训练数据不足而表现不佳的问题。 Method: 结合LLM与双语翻译模型,通过维度对齐和轻量级适配模块提升性能。 Result: 在希伯来语实验中,TALL显著优于直接使用、简单翻译和微调等方法。 Conclusion: TALL以参数高效的方式平衡计算效率与性能提升,适用于低资源语言场景。 Abstract: Large Language Models (LLMs) excel in high-resource languages but struggle with low-resource languages due to limited training data. This paper presents TALL (Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages), which integrates an LLM with two bilingual translation models. TALL transforms low-resource inputs into high-resource representations, leveraging the LLM's capabilities while preserving linguistic features through dimension alignment layers and custom transformers. Our experiments on Hebrew demonstrate significant improvements over several baselines, including direct use, naive translation, and fine-tuning approaches. The architecture employs a parameter-efficient strategy, freezing pre-trained components while training only lightweight adapter modules, balancing computational efficiency with performance gains.

[202] Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation

Noy Sternlicht,Ariel Gera,Roy Bar-Haim,Tom Hope,Noam Slonim

Main category: cs.CL

TL;DR: 论文提出辩论演讲评估作为评估LLM裁判的新基准,分析了LLM在理解辩论演讲多层面能力上的表现,并与人类裁判对比。

Details Motivation: 辩论演讲评估需要多层次的深度理解,但目前LLM在这方面的能力尚未得到系统评估。 Method: 利用600多篇标注辩论演讲数据集,分析前沿LLM与人类裁判的表现差异。 Result: 大模型在某些方面接近人类裁判,但整体判断行为差异显著;前沿LLM生成说服性演讲的能力可达人类水平。 Conclusion: 辩论演讲评估为LLM能力提供了新视角,揭示了模型与人类裁判的差异及潜力。 Abstract: We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that have previously received limited attention in systematic LLM benchmarking. To explore such skills, we leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task. Our findings reveal a nuanced picture: while larger models can approximate individual human judgments in some respects, they differ substantially in their overall judgment behavior. We also investigate the ability of frontier LLMs to generate persuasive, opinionated speeches, showing that models may perform at a human level on this task.

[203] Does It Make Sense to Speak of Introspection in Large Language Models?

Iulia Comşa,Murray Shanahan

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型(LLMs)的自我报告是否可被视为内省,并通过两个例子分析其合理性。

Details Motivation: 随着LLMs语言能力和认知能力的提升,其自我报告是否具有内省意义成为一个值得探讨的问题。 Method: 通过分析两个LLMs的自我报告例子,评估其是否具备内省能力。 Result: 第一个例子(描述创作过程)不构成内省;第二个例子(推断自身温度参数)可视为最小程度的内省,但无意识体验。 Conclusion: LLMs的某些自我报告可被看作内省,但缺乏意识支持,需谨慎解读。 Abstract: Large language models (LLMs) exhibit compelling linguistic behaviour, and sometimes offer self-reports, that is to say statements about their own nature, inner workings, or behaviour. In humans, such reports are often attributed to a faculty of introspection and are typically linked to consciousness. This raises the question of how to interpret self-reports produced by LLMs, given their increasing linguistic fluency and cognitive capabilities. To what extent (if any) can the concept of introspection be meaningfully applied to LLMs? Here, we present and critique two examples of apparent introspective self-report from LLMs. In the first example, an LLM attempts to describe the process behind its own ``creative'' writing, and we argue this is not a valid example of introspection. In the second example, an LLM correctly infers the value of its own temperature parameter, and we argue that this can be legitimately considered a minimal example of introspection, albeit one that is (presumably) not accompanied by conscious experience.

[204] RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation

Tianjiao Li,Mengran Yu,Chenyu Shi,Yanjun Zhao,Xiaojing Liu,Qiang Zhang,Qi Zhang,Xuanjing Huang,Jiayin Wang

Main category: cs.CL

TL;DR: 论文提出RIVAL框架,通过对抗训练解决LLM在俚语字幕翻译中因分布偏移导致的性能下降问题。

Details Motivation: 观察到结合RLHF的LLM在俚语字幕翻译中表现不佳,原因是离线奖励模型与在线LLM因分布偏移而逐渐偏离。 Method: 提出RIVAL对抗训练框架,将RM与LLM建模为min-max博弈,结合定性和定量奖励优化模型。 Result: 实验表明RIVAL显著提升了翻译性能。 Conclusion: RIVAL框架有效解决了分布偏移问题,提升了翻译质量。 Abstract: Large language models (LLMs) possess strong multilingual capabilities, and combining Reinforcement Learning from Human Feedback (RLHF) with translation tasks has shown great potential. However, we observe that this paradigm performs unexpectedly poorly when applied to colloquial subtitle translation tasks. In this work, we investigate this issue and find that the offline reward model (RM) gradually diverges from the online LLM due to distributional shift, ultimately leading to undesirable training outcomes. To address this, we propose RIVAL, an adversarial training framework that formulates the process as a min-max game between the RM and the LLM. RIVAL iteratively updates the both models, with the RM trained to distinguish strong from weak translations (qualitative preference reward), and the LLM trained to enhance its translation for closing this gap. To stabilize training and improve generalizability, we also incorporate quantitative preference reward (e.g., BLEU) into the RM, enabling reference-free quality modeling aligned with human evaluation. Through extensive experiments, we demonstrate that the proposed adversarial training framework significantly improves upon translation baselines.

[205] Just a Scratch: Enhancing LLM Capabilities for Self-harm Detection through Intent Differentiation and Emoji Interpretation

Soumitra Ghosh,Gopendra Vikram Singh,Shambhavi,Sabarna Choudhury,Asif Ekbal

Main category: cs.CL

TL;DR: 该论文提出了一种通过语言和表情符号的微妙交互来增强大型语言模型(LLMs)对自残意图理解的方法,并发布了CESM-100和SHINES数据集。

Details Motivation: 自残检测在社交媒体上对早期干预和心理健康支持至关重要,但现有LLMs难以理解隐晦的表达。 Method: 提出统一框架:1)使用CESM-100丰富输入;2)微调LLMs进行多任务学习;3)生成可解释的自残预测依据。 Result: 在三种LLMs上评估,该方法显著提升了检测和解释任务的性能。 Conclusion: 通过结合意图区分和上下文线索,有效解决了自残信号的模糊性问题,并公开了数据集和代码。 Abstract: Self-harm detection on social media is critical for early intervention and mental health support, yet remains challenging due to the subtle, context-dependent nature of such expressions. Identifying self-harm intent aids suicide prevention by enabling timely responses, but current large language models (LLMs) struggle to interpret implicit cues in casual language and emojis. This work enhances LLMs' comprehension of self-harm by distinguishing intent through nuanced language-emoji interplay. We present the Centennial Emoji Sensitivity Matrix (CESM-100), a curated set of 100 emojis with contextual self-harm interpretations and the Self-Harm Identification aNd intent Extraction with Supportive emoji sensitivity (SHINES) dataset, offering detailed annotations for self-harm labels, casual mentions (CMs), and serious intents (SIs). Our unified framework: a) enriches inputs using CESM-100; b) fine-tunes LLMs for multi-task learning: self-harm detection (primary) and CM/SI span detection (auxiliary); c) generates explainable rationales for self-harm predictions. We evaluate the framework on three state-of-the-art LLMs-Llama 3, Mental-Alpaca, and MentalLlama, across zero-shot, few-shot, and fine-tuned scenarios. By coupling intent differentiation with contextual cues, our approach commendably enhances LLM performance in both detection and explanation tasks, effectively addressing the inherent ambiguity in self-harm signals. The SHINES dataset, CESM-100 and codebase are publicly available at: https://www.iitp.ac.in/~ai-nlp-ml/resources.html#SHINES .

[206] Parking, Perception, and Retail: Street-Level Determinants of Community Vitality in Harbin

HaoTian Lan

Main category: cs.CL

TL;DR: 研究提出了一种基于图像的框架,分析中国哈尔滨社区街道的商业活力与车辆可达性、环境质量和行人感知的关系,发现适度车辆存在有助于商业,但过度停车会降低满意度。

Details Motivation: 探讨社区街道商业活力的驱动因素,尤其是车辆、环境和行人感知的复杂交互作用。 Method: 利用街景图像和多模态大语言模型(VisualGLM-6B)构建商业活力指数(CCVI),并通过GPT-4提取空间属性进行分析。 Result: 适度车辆存在提升商业可达性,但过度停车降低满意度;绿化和清洁显著提升满意度,但对定价影响较弱;街道宽度调节车辆影响。 Conclusion: 研究展示了AI辅助感知与城市形态分析的结合价值,为社区振兴提供了理论和工具支持。 Abstract: The commercial vitality of community-scale streets in Chinese cities is shaped by complex interactions between vehicular accessibility, environmental quality, and pedestrian perception. This study proposes an interpretable, image-based framework to examine how street-level features -- including parked vehicle density, greenery, cleanliness, and street width -- impact retail performance and user satisfaction in Harbin, China. Leveraging street view imagery and a multimodal large language model (VisualGLM-6B), we construct a Community Commercial Vitality Index (CCVI) from Meituan and Dianping data and analyze its relationship with spatial attributes extracted via GPT-4-based perception modeling. Our findings reveal that while moderate vehicle presence may enhance commercial access, excessive on-street parking -- especially in narrow streets -- erodes walkability and reduces both satisfaction and shop-level pricing. In contrast, streets with higher perceived greenery and cleanliness show significantly greater satisfaction scores but only weak associations with pricing. Street width moderates the effects of vehicle presence, underscoring the importance of spatial configuration. These results demonstrate the value of integrating AI-assisted perception with urban morphological analysis to capture non-linear and context-sensitive drivers of commercial success. This study advances both theoretical and methodological frontiers by highlighting the conditional role of vehicle activity in neighborhood commerce and demonstrating the feasibility of multimodal AI for perceptual urban diagnostics. The implications extend to urban design, parking management, and scalable planning tools for community revitalization.

[207] CL-ISR: A Contrastive Learning and Implicit Stance Reasoning Framework for Misleading Text Detection on Social Media

Tianyi Huang,Zikun Cui,Cuiqianhe Du,Chia-En Chiang

Main category: cs.CL

TL;DR: 论文提出了一种结合对比学习和隐式立场推理的新框架CL-ISR,用于提高社交媒体误导文本的检测准确率。

Details Motivation: 社交媒体上的误导文本可能导致公众误解、社会恐慌和经济损失,因此需要更有效的检测方法。 Method: 使用对比学习算法增强模型对语义差异的学习能力,并引入隐式立场推理模块分析文本中的潜在立场倾向。 Result: CL-ISR框架显著提高了误导文本的检测效果,尤其在复杂语言情境下表现优异。 Conclusion: CL-ISR框架通过结合对比学习和隐式立场推理,为社交媒体误导文本检测提供了更高效的解决方案。 Abstract: Misleading text detection on social media platforms is a critical research area, as these texts can lead to public misunderstanding, social panic and even economic losses. This paper proposes a novel framework - CL-ISR (Contrastive Learning and Implicit Stance Reasoning), which combines contrastive learning and implicit stance reasoning, to improve the detection accuracy of misleading texts on social media. First, we use the contrastive learning algorithm to improve the model's learning ability of semantic differences between truthful and misleading texts. Contrastive learning could help the model to better capture the distinguishing features between different categories by constructing positive and negative sample pairs. This approach enables the model to capture distinguishing features more effectively, particularly in linguistically complicated situations. Second, we introduce the implicit stance reasoning module, to explore the potential stance tendencies in the text and their relationships with related topics. This method is effective for identifying content that misleads through stance shifting or emotional manipulation, because it can capture the implicit information behind the text. Finally, we integrate these two algorithms together to form a new framework, CL-ISR, which leverages the discriminative power of contrastive learning and the interpretive depth of stance reasoning to significantly improve detection effect.

[208] The NTNU System at the S&I Challenge 2025 SLA Open Track

Hong-Yun Lin,Tien-Hong Lo,Yu-Hsuan Fang,Jhen-Ke Lin,Chung-Chun Wang,Hao-Chien Lu,Berlin Chen

Main category: cs.CL

TL;DR: 该研究提出了一种结合wav2vec 2.0和Phi-4多模态大语言模型的系统,用于口语能力评估,解决了BERT和W2V各自的局限性,并在比赛中取得了第二名。

Details Motivation: BERT和Wav2vec 2.0在口语能力评估中各有限制,BERT依赖ASR转录,无法捕捉语音特征,而W2V缺乏语义解释性。 Method: 通过分数融合策略整合W2V和Phi-4多模态大语言模型。 Result: 在Speak & Improve Challenge 2025的官方测试集上,RMSE为0.375,排名第二。 Conclusion: 提出的系统有效结合了两种模型的优势,提升了口语评估的准确性。 Abstract: A recent line of research on spoken language assessment (SLA) employs neural models such as BERT and wav2vec 2.0 (W2V) to evaluate speaking proficiency across linguistic and acoustic modalities. Although both models effectively capture features relevant to oral competence, each exhibits modality-specific limitations. BERT-based methods rely on ASR transcripts, which often fail to capture prosodic and phonetic cues for SLA. In contrast, W2V-based methods excel at modeling acoustic features but lack semantic interpretability. To overcome these limitations, we propose a system that integrates W2V with Phi-4 multimodal large language model (MLLM) through a score fusion strategy. The proposed system achieves a root mean square error (RMSE) of 0.375 on the official test set of the Speak & Improve Challenge 2025, securing second place in the competition. For comparison, the RMSEs of the top-ranked, third-ranked, and official baseline systems are 0.364, 0.384, and 0.444, respectively.

[209] DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning

Tanmay Parekh,Kartik Mehta,Ninareh Mehrabi,Kai-Wei Chang,Nanyun Peng

Main category: cs.CL

TL;DR: DiCoRe框架通过发散-收敛推理(Dreamer和Grounder)提升零样本事件检测性能,结合LLM-Judge验证,在多个数据集上表现优于基线方法。

Details Motivation: 零样本事件检测在无训练数据时识别事件,但复杂事件本体和领域特定触发词限制了大型语言模型(LLMs)的效用。 Method: 提出DiCoRe框架,包含Dreamer(发散推理)和Grounder(收敛推理),并通过LLM-Judge验证输出。 Result: 在六个数据集上,DiCoRe平均F1分数比最佳基线高4-7%。 Conclusion: DiCoRe是一种强大的零样本事件检测框架。 Abstract: Zero-shot Event Detection (ED), the task of identifying event mentions in natural language text without any training data, is critical for document understanding in specialized domains. Understanding the complex event ontology, extracting domain-specific triggers from the passage, and structuring them appropriately overloads and limits the utility of Large Language Models (LLMs) for zero-shot ED. To this end, we propose DiCoRe, a divergent-convergent reasoning framework that decouples the task of ED using Dreamer and Grounder. Dreamer encourages divergent reasoning through open-ended event discovery, which helps to boost event coverage. Conversely, Grounder introduces convergent reasoning to align the free-form predictions with the task-specific instructions using finite-state machine guided constrained decoding. Additionally, an LLM-Judge verifies the final outputs to ensure high precision. Through extensive experiments on six datasets across five domains and nine LLMs, we demonstrate how DiCoRe consistently outperforms prior zero-shot, transfer-learning, and reasoning baselines, achieving 4-7% average F1 gains over the best baseline -- establishing DiCoRe as a strong zero-shot ED framework.

[210] Information Locality as an Inductive Bias for Neural Language Models

Taiga Someya,Anej Svete,Brian DuSell,Timothy J. O'Donnell,Mario Giulianelli,Ryan Cotterell

Main category: cs.CL

TL;DR: 论文提出了一种量化框架,通过$m$-局部熵衡量语言的局部不确定性,发现神经语言模型与人类类似,对语言的局部统计结构高度敏感。

Details Motivation: 探讨神经语言模型的归纳偏置是否与人类处理约束一致,需定量研究其学习机制。 Method: 引入$m$-局部熵作为信息论度量,基于损失上下文意外性,分析语言模型的局部不确定性。 Result: 实验表明,$m$-局部熵较高的语言对Transformer和LSTM模型更难学习。 Conclusion: 神经语言模型与人类类似,对语言的局部统计结构高度敏感。 Abstract: Inductive biases are inherent in every machine learning system, shaping how models generalize from finite data. In the case of neural language models (LMs), debates persist as to whether these biases align with or diverge from human processing constraints. To address this issue, we propose a quantitative framework that allows for controlled investigations into the nature of these biases. Within our framework, we introduce $m$-local entropy$\unicode{x2013}$an information-theoretic measure derived from average lossy-context surprisal$\unicode{x2013}$that captures the local uncertainty of a language by quantifying how effectively the $m-1$ preceding symbols disambiguate the next symbol. In experiments on both perturbed natural language corpora and languages defined by probabilistic finite-state automata (PFSAs), we show that languages with higher $m$-local entropy are more difficult for Transformer and LSTM LMs to learn. These results suggest that neural LMs, much like humans, are highly sensitive to the local statistical structure of a language.

[211] AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models

Chih-Kai Yang,Neo Ho,Yi-Jyun Lee,Hung-yi Lee

Main category: cs.CL

TL;DR: 该论文首次深入分析了大型音频-语言模型(LALMs)内部如何感知和识别听觉属性,发现属性信息在识别失败时随层深减少,早期层解析属性与更高准确性相关,并提出了一种增强LALMs的方法。

Details Motivation: 理解LALMs的内部机制对于解释其行为和提升性能至关重要。 Method: 通过词汇投影技术分析三种先进LALMs,追踪属性信息在层和标记位置的演变。 Result: 发现属性信息在识别失败时随层深减少,早期层解析属性与更高准确性相关,且LALMs依赖查询听觉输入而非隐藏状态聚合信息。 Conclusion: 研究结果为听觉属性处理提供了新见解,并为未来改进LALMs奠定了基础。 Abstract: Understanding the internal mechanisms of large audio-language models (LALMs) is crucial for interpreting their behavior and improving performance. This work presents the first in-depth analysis of how LALMs internally perceive and recognize auditory attributes. By applying vocabulary projection on three state-of-the-art LALMs, we track how attribute information evolves across layers and token positions. We find that attribute information generally decreases with layer depth when recognition fails, and that resolving attributes at earlier layers correlates with better accuracy. Moreover, LALMs heavily rely on querying auditory inputs for predicting attributes instead of aggregating necessary information in hidden states at attribute-mentioning positions. Based on our findings, we demonstrate a method to enhance LALMs. Our results offer insights into auditory attribute processing, paving the way for future improvements.

[212] Do Large Language Models Judge Error Severity Like Humans?

Diege Sun,Guanyi Chen,Fan Zhao,Xiaorong Cheng,Tingting He

Main category: cs.CL

TL;DR: 研究比较了人类与LLMs对图像描述中语义错误的严重性评估,发现人类与LLMs在错误严重性判断上存在显著差异,尤其是对颜色和性别错误的感知。

Details Motivation: 探讨LLMs是否能准确复制人类对错误严重性的判断,特别是在多模态环境下。 Method: 扩展van Miltenburg等人的实验框架,比较人类和LLMs在单模态和多模态设置下对四种错误类型(年龄、性别、服装类型、颜色)的评估。 Result: 人类对不同错误类型的严重性有不同感知,而LLMs在性别和颜色错误上的判断与人类不一致。DeepSeek-V3在单模态和多模态条件下与人类判断最接近。 Conclusion: LLMs在错误严重性判断上存在局限性,部分模型可能内化了社会规范但缺乏感知基础。DeepSeek-V3表现最佳,但仍未完全达到人类水平。 Abstract: Large Language Models (LLMs) are increasingly used as automated evaluators in natural language generation, yet it remains unclear whether they can accurately replicate human judgments of error severity. In this study, we systematically compare human and LLM assessments of image descriptions containing controlled semantic errors. We extend the experimental framework of van Miltenburg et al. (2020) to both unimodal (text-only) and multimodal (text + image) settings, evaluating four error types: age, gender, clothing type, and clothing colour. Our findings reveal that humans assign varying levels of severity to different error types, with visual context significantly amplifying perceived severity for colour and type errors. Notably, most LLMs assign low scores to gender errors but disproportionately high scores to colour errors, unlike humans, who judge both as highly severe but for different reasons. This suggests that these models may have internalised social norms influencing gender judgments but lack the perceptual grounding to emulate human sensitivity to colour, which is shaped by distinct neural mechanisms. Only one of the evaluated LLMs, Doubao, replicates the human-like ranking of error severity, but it fails to distinguish between error types as clearly as humans. Surprisingly, DeepSeek-V3, a unimodal LLM, achieves the highest alignment with human judgments across both unimodal and multimodal conditions, outperforming even state-of-the-art multimodal models.

[213] Knowledgeable-r1: Policy Optimization for Knowledge Exploration in Retrieval-Augmented Generation

Chenyu Lin,Yilin Wen,Du Su,Fei Sun,Muhan Chen,Chenfu Bao,Zhonghou Lv

Main category: cs.CL

TL;DR: 论文提出Knowledgeable-r1方法,通过联合采样和多策略分布解决RAG系统过度依赖检索上下文的问题,显著提升模型在冲突任务和一般RAG任务中的性能。

Details Motivation: 当前RAG系统过于依赖检索上下文,可能导致依赖不准确信息或忽视模型固有知识,尤其在处理误导性或冗余信息时。 Method: 提出Knowledgeable-r1,采用联合采样和多策略分布,激发模型对参数化知识和上下文知识的自我整合利用。 Result: 实验表明,Knowledgeable-r1在冲突任务和一般RAG任务中显著提升鲁棒性和推理准确性,尤其在反事实场景中优于基线17.07%。 Conclusion: Knowledgeable-r1有效平衡了模型对参数化知识和上下文知识的利用,提升了RAG系统的性能。 Abstract: Retrieval-augmented generation (RAG) is a mainstream method for improving performance on knowledge-intensive tasks. However,current RAG systems often place too much emphasis on retrieved contexts. This can lead to reliance on inaccurate sources and overlook the model's inherent knowledge, especially when dealing with misleading or excessive information. To resolve this imbalance, we propose Knowledgeable-r1 that using joint sampling and define multi policy distributions in knowledge capability exploration to stimulate large language models'self-integrated utilization of parametric and contextual knowledge. Experiments show that Knowledgeable-r1 significantly enhances robustness and reasoning accuracy in both parameters and contextual conflict tasks and general RAG tasks, especially outperforming baselines by 17.07% in counterfactual scenarios and demonstrating consistent gains across RAG tasks. Our code are available at https://github.com/lcy80366872/ knowledgeable-r1.

[214] Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective

Bhavik Chandna,Zubair Bashir,Procheta Sen

Main category: cs.CL

TL;DR: 该论文通过机制解释性方法分析GPT-2和Llama2模型中社会、人口和性别偏见的内部结构,发现偏见计算高度集中在少数层,且去除这些组件会影响其他NLP任务。

Details Motivation: 大型语言模型(LLMs)常因训练数据而表现出社会、人口和性别偏见,研究旨在揭示这些偏见在模型中的结构表征。 Method: 采用机制解释性方法,分析模型内部边缘,评估偏见的稳定性、定位和泛化性,并通过系统性消融实验验证。 Result: 偏见计算集中在少数层,且随微调设置变化;去除偏见组件会减少偏见输出,但影响其他NLP任务。 Conclusion: 偏见在模型中高度局部化,去除偏见组件需谨慎,以避免对其他任务产生负面影响。 Abstract: Large Language Models (LLMs) are known to exhibit social, demographic, and gender biases, often as a consequence of the data on which they are trained. In this work, we adopt a mechanistic interpretability approach to analyze how such biases are structurally represented within models such as GPT-2 and Llama2. Focusing on demographic and gender biases, we explore different metrics to identify the internal edges responsible for biased behavior. We then assess the stability, localization, and generalizability of these components across dataset and linguistic variations. Through systematic ablations, we demonstrate that bias-related computations are highly localized, often concentrated in a small subset of layers. Moreover, the identified components change across fine-tuning settings, including those unrelated to bias. Finally, we show that removing these components not only reduces biased outputs but also affects other NLP tasks, such as named entity recognition and linguistic acceptability judgment because of the sharing of important components with these tasks.

[215] ECoRAG: Evidentiality-guided Compression for Long Context RAG

Yeonseok Jeong,Jinsu Kim,Dohyeon Lee,Seung-won Hwang

Main category: cs.CL

TL;DR: ECoRAG框架通过基于证据性压缩检索文档,提升LLM在开放域问答中的性能,同时降低成本。

Details Motivation: 现有压缩方法未过滤非证据信息,限制了LLM在RAG中的表现。 Method: 提出ECoRAG框架,基于证据性压缩文档,并在证据不足时继续检索。 Result: 实验表明ECoRAG在ODQA任务中优于现有方法,且成本更低。 Conclusion: ECoRAG通过保留必要信息提升性能并降低成本,具有高效性。 Abstract: Large Language Models (LLMs) have shown remarkable performance in Open-Domain Question Answering (ODQA) by leveraging external documents through Retrieval-Augmented Generation (RAG). To reduce RAG overhead, from longer context, context compression is necessary. However, prior compression methods do not focus on filtering out non-evidential information, which limit the performance in LLM-based RAG. We thus propose Evidentiality-guided RAG, or \textbf{ECoRAG} framework. ECoRAG improves LLM performance by compressing retrieved documents based on evidentiality, ensuring whether answer generation is supported by the correct evidence. As an additional step, ECoRAG reflects whether the compressed content provides sufficient evidence, and if not, retrieves more until sufficient. Experiments show that ECoRAG improves LLM performance on ODQA tasks, outperforming existing compression methods. Furthermore, ECoRAG is highly cost-efficient, as it not only reduces latency but also minimizes token usage by retaining only the necessary information to generate the correct answer. Code is available at https://github.com/ldilab/ECoRAG.

[216] Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang,Mingxin Li,Dingkun Long,Xin Zhang,Huan Lin,Baosong Yang,Pengjun Xie,An Yang,Dayiheng Liu,Junyang Lin,Fei Huang,Jingren Zhou

Main category: cs.CL

TL;DR: Qwen3 Embedding系列是基于Qwen3基础模型的文本嵌入和重排序技术,通过多阶段训练和模型合并策略,提供多种尺寸模型,在多语言和检索任务中表现优异。

Details Motivation: 提升文本嵌入和重排序能力,满足多语言和多领域需求。 Method: 结合大规模无监督预训练和有监督微调,利用Qwen3 LLM生成高质量训练数据,采用模型合并策略。 Result: 在MTEB等多语言评测和检索任务中达到最优性能。 Conclusion: Qwen3 Embedding系列高效且灵活,适用于多种部署场景,并开源以促进社区研究。 Abstract: In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs' robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data across multiple domains and languages, thus enhancing the training pipeline. The Qwen3 Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both embedding and reranking tasks, addressing diverse deployment scenarios where users can optimize for either efficiency or effectiveness. Empirical evaluations demonstrate that the Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks. Notably, it excels on the multilingual evaluation benchmark MTEB for text embedding, as well as in various retrieval tasks, including code retrieval, cross-lingual retrieval and multilingual retrieval. To facilitate reproducibility and promote community-driven research and development, the Qwen3 Embedding models are publicly available under the Apache 2.0 license.

[217] Counterfactual reasoning: an analysis of in-context emergence

Moritz Miller,Bernhard Schölkopf,Siyuan Guo

Main category: cs.CL

TL;DR: 该论文研究了大规模神经语言模型在上下文学习中的反事实推理能力,通过线性回归任务验证了模型在假设场景下预测后果的能力。

Details Motivation: 探索语言模型在反事实推理中的表现,特别是在噪声推断和复制方面的能力。 Method: 使用线性回归任务作为合成实验,研究模型在噪声推断和反事实推理中的表现。 Result: 语言模型在受控环境中能够进行反事实推理,且自注意力、模型深度和预训练数据多样性是关键因素。 Conclusion: 研究结果表明,Transformer模型在反事实推理中具有潜力,尤其在序列数据上的噪声推断能力为反事实故事生成提供了初步证据。 Abstract: Large-scale neural language models (LMs) exhibit remarkable performance in in-context learning: the ability to learn and reason the input context on the fly without parameter update. This work studies in-context counterfactual reasoning in language models, that is, to predict the consequences of changes under hypothetical scenarios. We focus on studying a well-defined synthetic setup: a linear regression task that requires noise abduction, where accurate prediction is based on inferring and copying the contextual noise from factual observations. We show that language models are capable of counterfactual reasoning in this controlled setup and provide insights that counterfactual reasoning for a broad class of functions can be reduced to a transformation on in-context observations; we find self-attention, model depth, and data diversity in pre-training drive performance in Transformers. More interestingly, our findings extend beyond regression tasks and show that Transformers can perform noise abduction on sequential data, providing preliminary evidence on the potential for counterfactual story generation. Our code is available under https://github.com/moXmiller/counterfactual-reasoning.git .

[218] RELIC: Evaluating Compositional Instruction Following via Language Recognition

Jackson Petty,Michael Y. Hu,Wentao Wang,Shauli Ravfogel,William Merrill,Tal Linzen

Main category: cs.CL

TL;DR: RELIC框架通过语言识别任务评估大语言模型(LLMs)的指令跟随能力,发现当前最先进的LLMs在复杂语法和样本上表现接近随机水平。

Details Motivation: 评估LLMs仅基于上下文任务说明执行任务的能力(指令跟随),并提供一个可扩展的评估框架。 Method: 引入RELIC框架,利用形式语法的语言识别任务,自动生成复杂任务实例以避免数据污染。 Result: LLMs的准确性可从语法和样本复杂度预测,复杂任务中模型依赖浅层启发式而非复杂指令。 Conclusion: RELIC揭示了LLMs在复杂指令跟随任务中的局限性,为未来改进提供了诊断工具。 Abstract: Large language models (LLMs) are increasingly expected to perform tasks based only on a specification of the task provided in context, without examples of inputs and outputs; this ability is referred to as instruction following. We introduce the Recognition of Languages In-Context (RELIC) framework to evaluate instruction following using language recognition: the task of determining if a string is generated by formal grammar. Unlike many standard evaluations of LLMs' ability to use their context, this task requires composing together a large number of instructions (grammar productions) retrieved from the context. Because the languages are synthetic, the task can be increased in complexity as LLMs' skills improve, and new instances can be automatically generated, mitigating data contamination. We evaluate state-of-the-art LLMs on RELIC and find that their accuracy can be reliably predicted from the complexity of the grammar and the individual example strings, and that even the most advanced LLMs currently available show near-chance performance on more complex grammars and samples, in line with theoretical expectations. We also use RELIC to diagnose how LLMs attempt to solve increasingly difficult reasoning tasks, finding that as the complexity of the language recognition task increases, models switch to relying on shallow heuristics instead of following complex instructions.

[219] The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Nikhil Kandpal,Brian Lester,Colin Raffel,Sebastian Majstorovic,Stella Biderman,Baber Abbasi,Luca Soldaini,Enrico Shippole,A. Feder Cooper,Aviya Skowron,John Kirchenbauer,Shayne Longpre,Lintang Sutawika,Alon Albalak,Zhenlin Xu,Guilherme Penedo,Loubna Ben Allal,Elie Bakouch,John David Pressman,Honglu Fan,Dashiell Stander,Guangyu Song,Aaron Gokaslan,Tom Goldstein,Brian R. Bartoldson,Bhavya Kailkhura,Tyler Murray

Main category: cs.CL

TL;DR: 论文介绍了Common Pile v0.1,一个8TB的开放许可文本数据集,用于训练大型语言模型(LLMs),并通过训练两个7B参数的模型验证其性能。

Details Motivation: 解决LLMs训练中因使用未经许可文本引发的知识产权和伦理问题,同时填补开放许可数据集规模小或质量低的空白。 Method: 收集、整理并发布Common Pile v0.1数据集,包含30个来源的多样化内容,并训练两个7B参数的LLMs(Comma v0.1-1T和Comma v0.1-2T)。 Result: 训练出的模型性能与使用未经许可文本训练的类似规模LLMs(如Llama 1和2 7B)相当。 Conclusion: Common Pile v0.1为开放许可文本训练LLMs提供了可行方案,并公开了数据集、代码和模型检查点。 Abstract: Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.

[220] Improving Low-Resource Morphological Inflection via Self-Supervised Objectives

Adam Wiemerslage,Katharina von der Wense

Main category: cs.CL

TL;DR: 研究了自监督辅助任务在极低资源环境下对形态屈折任务的有效性,发现自动编码在数据极少时表现最佳,而字符掩码语言模型在数据增加时更有效。

Details Motivation: 探索自监督目标在字符级任务(如形态屈折)中的潜力,尤其是在资源稀缺的语言中。 Method: 使用编码器-解码器变换器,在19种语言和13种辅助目标上进行实验。 Result: 自动编码在数据极少时表现最佳,字符掩码语言模型在数据增加时更有效;基于已知语素边界的掩码采样能持续提升性能。 Conclusion: 自监督辅助任务在低资源形态建模中具有潜力,尤其是结合语素边界信息的掩码采样。 Abstract: Self-supervised objectives have driven major advances in NLP by leveraging large-scale unlabeled data, but such resources are scarce for many of the world's languages. Surprisingly, they have not been explored much for character-level tasks, where smaller amounts of data have the potential to be beneficial. We investigate the effectiveness of self-supervised auxiliary tasks for morphological inflection -- a character-level task highly relevant for language documentation -- in extremely low-resource settings, training encoder-decoder transformers for 19 languages and 13 auxiliary objectives. Autoencoding yields the best performance when unlabeled data is very limited, while character masked language modeling (CMLM) becomes more effective as data availability increases. Though objectives with stronger inductive biases influence model predictions intuitively, they rarely outperform standard CMLM. However, sampling masks based on known morpheme boundaries consistently improves performance, highlighting a promising direction for low-resource morphological modeling.

[221] Towards a Unified System of Representation for Continuity and Discontinuity in Natural Language

Ratna Kandala,Prakash Mondal

Main category: cs.CL

TL;DR: 提出一种统一表示自然语言连续与不连续结构的系统,结合短语结构语法、依存语法和范畴语法的特点。

Details Motivation: 解决不同语言学理论对不连续结构分析的非收敛性问题,提供统一框架。 Method: 结合短语结构语法(PSG)的构成性、依存语法(DG)的头依赖关系及范畴语法(CG)的函子-论元关系,提出统一数学推导。 Result: 展示连续与不连续结构可通过统一数学推导分析。 Conclusion: 三种语法形式可统一表示自然语言结构,为不连续现象提供新视角。 Abstract: Syntactic discontinuity is a grammatical phenomenon in which a constituent is split into more than one part because of the insertion of an element which is not part of the constituent. This is observed in many languages across the world such as Turkish, Russian, Japanese, Warlpiri, Navajo, Hopi, Dyirbal, Yidiny etc. Different formalisms/frameworks in current linguistic theory approach the problem of discontinuous structures in different ways. Each framework/formalism has widely been viewed as an independent and non-converging system of analysis. In this paper, we propose a unified system of representation for both continuity and discontinuity in structures of natural languages by taking into account three formalisms, in particular, Phrase Structure Grammar (PSG) for its widely used notion of constituency, Dependency Grammar (DG) for its head-dependent relations, and Categorial Grammar (CG) for its focus on functor-argument relations. We attempt to show that discontinuous expressions as well as continuous structures can be analysed through a unified mathematical derivation incorporating the representations of linguistic structure in these three grammar formalisms.

[222] CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection

Ron Eliav,Arie Cattan,Eran Hirsch,Shahaf Bassan,Elias Stengel-Eskin,Mohit Bansal,Ido Dagan

Main category: cs.CL

TL;DR: 论文提出了一种基于系统性推理的幻觉检测方法,通过分解文本并验证每个子主张的证据,提升了自然语言推理任务的性能。

Details Motivation: 现有的幻觉检测方法通常将任务视为自然语言推理(NLI),但复杂的推理任务可能受益于更明确的推理过程。 Method: 定义了一个三步推理框架:主张分解、子主张归因与蕴含分类、聚合分类,并通过中间步骤的质量指标验证其有效性。 Result: 实验表明,这种系统性推理框架显著提升了幻觉检测的准确性和细粒度性能。 Conclusion: 系统性推理框架能够有效提升幻觉检测任务的性能,并通过中间步骤的质量验证了其优越性。 Abstract: A common approach to hallucination detection casts it as a natural language inference (NLI) task, often using LLMs to classify whether the generated text is entailed by corresponding reference texts. Since entailment classification is a complex reasoning task, one would expect that LLMs could benefit from generating an explicit reasoning process, as in CoT reasoning or the explicit ``thinking'' of recent reasoning models. In this work, we propose that guiding such models to perform a systematic and comprehensive reasoning process -- one that both decomposes the text into smaller facts and also finds evidence in the source for each fact -- allows models to execute much finer-grained and accurate entailment decisions, leading to increased performance. To that end, we define a 3-step reasoning process, consisting of (i) claim decomposition, (ii) sub-claim attribution and entailment classification, and (iii) aggregated classification, showing that such guided reasoning indeed yields improved hallucination detection. Following this reasoning framework, we introduce an analysis scheme, consisting of several metrics that measure the quality of the intermediate reasoning steps, which provided additional empirical evidence for the improved quality of our guided reasoning scheme.

[223] Micro-Act: Mitigate Knowledge Conflict in Question Answering via Actionable Self-Reasoning

Nan Huo,Jinyang Li,Bowen Qin,Ge Qu,Xiaolong Li,Xiaodong Li,Chenhao Ma,Reynold Cheng

Main category: cs.CL

TL;DR: 论文提出Micro-Act框架,通过分层动作空间解决RAG系统中的知识冲突问题,显著提升QA任务准确性。

Details Motivation: RAG系统中外部检索知识与LLM内部知识冲突影响下游任务性能,现有方法因上下文冗长而效果有限。 Method: 提出Micro-Act框架,通过感知上下文复杂度并分解知识源为细粒度比较步骤,实现深层推理。 Result: 在5个基准数据集上,Micro-Act显著优于现有方法,尤其在时间和语义冲突类型上表现突出。 Conclusion: Micro-Act不仅解决知识冲突,还能同时处理非冲突问题,具有实际应用价值。 Abstract: Retrieval-Augmented Generation (RAG) systems commonly suffer from Knowledge Conflicts, where retrieved external knowledge contradicts the inherent, parametric knowledge of large language models (LLMs). It adversely affects performance on downstream tasks such as question answering (QA). Existing approaches often attempt to mitigate conflicts by directly comparing two knowledge sources in a side-by-side manner, but this can overwhelm LLMs with extraneous or lengthy contexts, ultimately hindering their ability to identify and mitigate inconsistencies. To address this issue, we propose Micro-Act a framework with a hierarchical action space that automatically perceives context complexity and adaptively decomposes each knowledge source into a sequence of fine-grained comparisons. These comparisons are represented as actionable steps, enabling reasoning beyond the superficial context. Through extensive experiments on five benchmark datasets, Micro-Act consistently achieves significant increase in QA accuracy over state-of-the-art baselines across all 5 datasets and 3 conflict types, especially in temporal and semantic types where all baselines fail significantly. More importantly, Micro-Act exhibits robust performance on non-conflict questions simultaneously, highlighting its practical value in real-world RAG applications.

[224] ProRefine: Inference-time Prompt Refinement with Textual Feedback

Deepak Pandita,Tharindu Cyril Weerasooriya,Ankit Parag Shah,Christopher M. Homan,Wei Wei

Main category: cs.CL

TL;DR: ProRefine是一种创新的推理时提示优化方法,通过LLM的文本反馈动态优化多步推理任务的提示,显著提升性能。

Details Motivation: 多AI代理协作工作流中,提示设计不佳导致错误传播和性能下降,限制了系统的可靠性和可扩展性。 Method: ProRefine利用LLM的文本反馈动态优化提示,无需额外训练或真实标签。 Result: 在五个数学推理基准数据集上,ProRefine比零样本思维链基线提升3至37个百分点。 Conclusion: ProRefine不仅提高准确性,还能让小模型匹配大模型性能,推动高效、可扩展的AI部署。 Abstract: Agentic workflows, where multiple AI agents collaborate to accomplish complex tasks like reasoning or planning, are becoming increasingly prevalent. However, these workflows often suffer from error propagation and sub-optimal performance, largely due to poorly designed prompts that fail to effectively guide individual agents. This is a critical problem because it limits the reliability and scalability of these powerful systems. We introduce ProRefine, an innovative inference-time prompt optimization method that leverages textual feedback from large language models (LLMs) to address this challenge. ProRefine dynamically refines prompts for multi-step reasoning tasks without additional training or ground truth labels. Evaluated on five benchmark mathematical reasoning datasets, ProRefine significantly surpasses zero-shot Chain-of-Thought baselines by 3 to 37 percentage points. This approach not only boosts accuracy but also allows smaller models to match the performance of larger ones, highlighting its potential for efficient and scalable AI deployment, and democratizing access to high-performing AI.

[225] Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models

Taha Entesari,Arman Hatami,Rinat Khaziev,Anil Ramakrishna,Mahyar Fazlyab

Main category: cs.CL

TL;DR: 本文提出了一种新的LLM遗忘方法,通过约束优化问题实现敏感信息的遗忘,同时保留有用数据。

Details Motivation: 现实中的LLM需要遗忘敏感、过时或专有信息,但现有方法在遗忘和保留之间难以平衡,导致性能下降。 Method: 采用约束优化问题,通过logit-margin flattening loss强制遗忘,并通过硬约束保留数据,使用原始对偶算法求解。 Result: 在TOFU和MUSE基准测试中,该方法优于现有基线,有效移除目标信息并保留下游实用性。 Conclusion: 该方法在LLM遗忘任务中表现优异,提供了一种稳定且高效的解决方案。 Abstract: Large Language Models (LLMs) deployed in real-world settings increasingly face the need to unlearn sensitive, outdated, or proprietary information. Existing unlearning methods typically formulate forgetting and retention as a regularized trade-off, combining both objectives into a single scalarized loss. This often leads to unstable optimization and degraded performance on retained data, especially under aggressive forgetting. We propose a new formulation of LLM unlearning as a constrained optimization problem: forgetting is enforced via a novel logit-margin flattening loss that explicitly drives the output distribution toward uniformity on a designated forget set, while retention is preserved through a hard constraint on a separate retain set. Compared to entropy-based objectives, our loss is softmax-free, numerically stable, and maintains non-vanishing gradients, enabling more efficient and robust optimization. We solve the constrained problem using a scalable primal-dual algorithm that exposes the trade-off between forgetting and retention through the dynamics of the dual variable. Evaluations on the TOFU and MUSE benchmarks across diverse LLM architectures demonstrate that our approach consistently matches or exceeds state-of-the-art baselines, effectively removing targeted information while preserving downstream utility.

[226] Search Arena: Analyzing Search-Augmented LLMs

Mihran Miroyan,Tsung-Han Wu,Logan King,Tianle Li,Jiayi Pan,Xinyan Hu,Wei-Lin Chiang,Anastasios N. Angelopoulos,Trevor Darrell,Narges Norouzi,Joseph E. Gonzalez

Main category: cs.CL

TL;DR: Search Arena是一个大规模、多轮对话数据集,用于评估搜索增强语言模型的用户偏好和性能,揭示了引用数量和来源对用户偏好的影响。

Details Motivation: 现有数据集规模小、范围窄,难以全面分析搜索增强语言模型的性能,因此需要更全面的数据集。 Method: 通过众包收集了24,000对多轮用户交互数据,包含12,000个人类偏好投票,并进行了跨场景分析。 Result: 用户偏好受引用数量和来源影响;搜索增强模型在非搜索场景中表现良好,但在搜索场景中仅依赖参数知识会显著影响质量。 Conclusion: Search Arena为未来研究提供了支持,揭示了用户偏好与模型性能之间的关系。 Abstract: Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations, even when the cited content does not directly support the attributed claims, uncovering a gap between perceived and actual credibility. Furthermore, user preferences vary across cited sources, revealing that community-driven platforms are generally preferred and static encyclopedic sources are not always appropriate and reliable. To assess performance across different settings, we conduct cross-arena analyses by testing search-augmented LLMs in a general-purpose chat environment and conventional LLMs in search-intensive settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model's parametric knowledge. We open-sourced the dataset to support future research in this direction. Our dataset and code are available at: https://github.com/lmarena/search-arena.

[227] Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models

Anirudh Bharadwaj,Chaitanya Malaviya,Nitish Joshi,Mark Yatskar

Main category: cs.CL

TL;DR: 语言模型在偏好评估中存在系统性偏差,过度依赖表面特征而非实质内容。研究发现训练数据中的偏见导致模型偏好与人类偏好不一致,提出了一种基于反事实数据增强的后训练方法,有效减少了偏差。

Details Motivation: 语言模型在评估人类偏好时表现出系统性偏差,如过度依赖长度、结构等表面特征,导致奖励黑客攻击和不可靠评估。研究旨在探究训练数据偏见与模型偏好偏差之间的关系。 Method: 通过控制反事实对,量化偏好模型对偏见放大响应的偏好程度,并提出基于反事实数据增强(CDA)的后训练方法进行去偏。 Result: 研究发现模型偏好与人类偏好存在显著偏差(40%),且模型对虚假线索的依赖较强。CDA方法将平均偏差从39.4%降至32.5%,绝对偏差差从20.5%降至10.0%。 Conclusion: 反事实数据增强方法能有效减少语言模型的偏好偏差,提升评估可靠性,同时保持整体性能。 Abstract: Language models serve as proxies for human preference judgements in alignment and evaluation, yet they exhibit systematic miscalibration, prioritizing superficial patterns over substantive qualities. This bias manifests as overreliance on features like length, structure, and style, leading to issues like reward hacking and unreliable evaluations. Evidence suggests these biases originate in artifacts in human training data. In this work, we systematically investigate the relationship between training data biases and preference model miscalibration across five idiosyncratic features of language model generations: length, structure, jargon, sycophancy and vagueness. Using controlled counterfactual pairs, we first quantify the extent to which preference models favor responses with magnified biases (skew), finding this preference occurs in >60% of instances, and model preferences show high miscalibration (~40%) compared to human preferences. Notably, bias features only show mild negative correlations to human preference labels (mean r_human = -0.12) but show moderately strong positive correlations with labels from a strong reward model (mean r_model = +0.36), suggesting that models may overrely on spurious cues. To mitigate these issues, we propose a simple post-training method based on counterfactual data augmentation (CDA) using synthesized contrastive examples. Finetuning models with CDA reduces average miscalibration from 39.4% to 32.5% and average absolute skew difference from 20.5% to 10.0%, while maintaining overall RewardBench performance, showing that targeted debiasing is effective for building reliable preference models.

[228] HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

Zhaolu Kang,Junhao Gong,Jiaxu Yan,Wanke Xia,Yian Wang,Ziwen Wang,Huaxuan Ding,Zhuo Cheng,Wenhao Cao,Zhiyuan Feng,Siqi He,Shannan Yan,Junzhe Chen,Xiaomin He,Chaoya Jiang,Wei Ye,Kaidong Yu,Xuelong Li

Main category: cs.CL

TL;DR: HSSBench是一个专为评估多模态大语言模型(MLLMs)在人文学科和社会科学(HSS)任务中的表现而设计的基准测试,填补了现有评测的空白。

Details Motivation: 当前MLLMs的评测基准主要关注STEM领域的知识和推理能力,忽视了HSS领域对跨学科思维和知识整合的需求。 Method: 通过多领域专家与自动化代理协作的数据生成管道,构建了包含13,000多个样本的HSSBench,涵盖六种联合国官方语言和六个关键类别。 Result: 对20多个主流MLLMs的评测显示,HSSBench对现有模型提出了显著挑战。 Conclusion: HSSBench有望推动MLLMs在跨学科推理能力方面的研究,尤其是知识整合与连接能力的提升。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for HSS scenarios, in which multiple domain experts and automated agents collaborate to generate and iteratively refine each sample. HSSBench contains over 13,000 meticulously designed samples, covering six key categories. We benchmark more than 20 mainstream MLLMs on HSSBench and demonstrate that it poses significant challenges even for state-of-the-art models. We hope that this benchmark will inspire further research into enhancing the cross-disciplinary reasoning abilities of MLLMs, especially their capacity to internalize and connect knowledge across fields.

cs.HC [Back]

[229] Beyond the Desktop: XR-Driven Segmentation with Meta Quest 3 and MX Ink

Lisle Faray de Paiva,Gijs Luijten,Ana Sofia Ferreira Santos,Moon Kim,Behrus Puladi,Jens Kleesiek,Jan Egger

Main category: cs.HC

TL;DR: 该研究开发了一种基于扩展现实(XR)的医学影像分割工具,结合Meta Quest 3头显和Logitech MX Ink手写笔,旨在简化手动标注流程。用户研究表明其具有临床潜力,但仍需改进。

Details Motivation: 医学影像分割在临床中至关重要,但手动标注繁琐且耗时。本研究旨在通过XR技术优化这一流程。 Method: 开发了一个沉浸式界面,支持实时交互2D/3D医学影像数据,结合手写笔标注和即时3D渲染。使用公共颅面CT数据集进行用户测试。 Result: 工具在系统可用性量表(SUS)中得分为66,用户反馈其控制直观(ISONORM评分4.1/5),但需提升任务精度和错误管理。 Conclusion: XR-手写笔范式为沉浸式分割工具提供了基础,未来需优化触觉反馈和工作流个性化以推动临床应用。 Abstract: Medical imaging segmentation is essential in clinical settings for diagnosing diseases, planning surgeries, and other procedures. However, manual annotation is a cumbersome and effortful task. To mitigate these aspects, this study implements and evaluates the usability and clinical applicability of an extended reality (XR)-based segmentation tool for anatomical CT scans, using the Meta Quest 3 headset and Logitech MX Ink stylus. We develop an immersive interface enabling real-time interaction with 2D and 3D medical imaging data in a customizable workspace designed to mitigate workflow fragmentation and cognitive demands inherent to conventional manual segmentation tools. The platform combines stylus-driven annotation, mirroring traditional pen-on-paper workflows, with instant 3D volumetric rendering. A user study with a public craniofacial CT dataset demonstrated the tool's foundational viability, achieving a System Usability Scale (SUS) score of 66, within the expected range for medical applications. Participants highlighted the system's intuitive controls (scoring 4.1/5 for self-descriptiveness on ISONORM metrics) and spatial interaction design, with qualitative feedback highlighting strengths in hybrid 2D/3D navigation and realistic stylus ergonomics. While users identified opportunities to enhance task-specific precision and error management, the platform's core workflow enabled dynamic slice adjustment, reducing cognitive load compared to desktop tools. Results position the XR-stylus paradigm as a promising foundation for immersive segmentation tools, with iterative refinements targeting haptic feedback calibration and workflow personalization to advance adoption in preoperative planning.

[230] From Screen to Space: Evaluating Siemens' Cinematic Reality

Gijs Luijten,Lisle Faray de Paiva,Sebastian Krueger,Alexander Brost,Laura Mazilescu,Ana Sofia Ferreira Santos,Peter Hoyer,Jens Kleesiek,Sophia Marie-Therese Schmitz,Ulf Peter Neumann,Jan Egger

Main category: cs.HC

TL;DR: 研究团队评估了Siemens的Cinematic Reality在Apple Vision Pro上的可用性和临床潜力,通过医学专家反馈分析了其可行性和改进方向。

Details Motivation: 探索Cinematic Reality在医学影像中的沉浸式渲染潜力,以促进其在实际临床工作中的应用。 Method: 使用CHAOS和MRCP_DLRecon数据集的影像,通过专家评估(System Usability Scale、ISONORM 9242-110-S问卷和开放式调查)分析可用性和临床潜力。 Result: 专家反馈确认了可行性,并指出了关键优势及需改进的功能,以推动临床工作流中的应用。 Conclusion: 沉浸式电影渲染在医学影像中具有潜力,但仍需进一步优化以支持临床整合。 Abstract: As one of the first research teams with full access to Siemens' Cinematic Reality, we evaluate its usability and clinical potential for cinematic volume rendering on the Apple Vision Pro. We visualized venous-phase liver computed tomography and magnetic resonance cholangiopancreatography scans from the CHAOS and MRCP\_DLRecon datasets. Fourteen medical experts assessed usability and anticipated clinical integration potential using the System Usability Scale, ISONORM 9242-110-S questionnaire, and an open-ended survey. Their feedback identified feasibility, key usability strengths, and required features to catalyze the adaptation in real-world clinical workflows. The findings provide insights into the potential of immersive cinematic rendering in medical imaging.

cs.SD [Back]

[231] Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning

Hien Ohnaka,Yuma Shirahata,Byeongseon Park,Ryuichi Yamamoto

Main category: cs.SD

TL;DR: 提出了一种基于音素和韵律标签的语音模型,通过隐式和显式方法确保标签与字素一致,显著提升了标签预测的准确性,并在口音估计任务中验证了其有效性。

Details Motivation: 解决现有方法在语音标签与字素一致性上的不足,为下游任务(如文本转语音、口音估计)提供更可靠的数据支持。 Method: 1)通过预训练BERT特征的提示编码器实现隐式字素条件;2)在推理阶段显式剪枝与字素不一致的标签假设。 Result: 显著提升了标签与字素的一致性,并在口音估计任务中提高了准确性。 Conclusion: 该方法有效生成语音、标签和字素的并行数据,适用于多种下游任务,具有实际应用价值。 Abstract: We propose a model to obtain phonemic and prosodic labels of speech that are coherent with graphemes. Unlike previous methods that simply fine-tune a pre-trained ASR model with the labels, the proposed model conditions the label generation on corresponding graphemes by two methods: 1) Add implicit grapheme conditioning through prompt encoder using pre-trained BERT features. 2) Explicitly prune the label hypotheses inconsistent with the grapheme during inference. These methods enable obtaining parallel data of speech, the labels, and graphemes, which is applicable to various downstream tasks such as text-to-speech and accent estimation from text. Experiments showed that the proposed method significantly improved the consistency between graphemes and the predicted labels. Further, experiments on accent estimation task confirmed that the created parallel data by the proposed method effectively improve the estimation accuracy.

[232] LLM-based phoneme-to-grapheme for phoneme-based speech recognition

Te Ma,Min Bi,Saierdaer Yusuyin,Hao Huang,Zhijian Ou

Main category: cs.SD

TL;DR: 论文提出了一种基于大型语言模型(LLM)的音素到字素解码方法(LLM-P2G),用于音素基础的自动语音识别(ASR),通过数据增强和随机化训练策略解决了信息丢失问题,显著优于传统WFST方法。

Details Motivation: 传统的基于WFST的解码方法在音素基础ASR中存在流程复杂且无法利用大型语言模型的限制,因此需要一种更高效的方法。 Method: 提出LLM-P2G解码方法,包括语音到音素(S2P)和音素到字素(P2G)两部分,并通过数据增强(DANP)和随机化训练(TKM)解决信息丢失问题。 Result: 实验表明,LLM-P2G在波兰语和德语的跨语言ASR中相对WER分别降低了3.6%和6.9%。 Conclusion: LLM-P2G方法在音素基础ASR中表现优异,为跨语言语音识别提供了高效解决方案。 Abstract: In automatic speech recognition (ASR), phoneme-based multilingual pre-training and crosslingual fine-tuning is attractive for its high data efficiency and competitive results compared to subword-based models. However, Weighted Finite State Transducer (WFST) based decoding is limited by its complex pipeline and inability to leverage large language models (LLMs). Therefore, we propose LLM-based phoneme-to-grapheme (LLM-P2G) decoding for phoneme-based ASR, consisting of speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G). A challenge is that there seems to have information loss in cascading S2P and P2G. To address this challenge, we propose two training strategies: data augmentation with noisy phonemes (DANP), and randomized top-$K$ marginalized (TKM) training and decoding. Our experimental results show that LLM-P2G outperforms WFST-based systems in crosslingual ASR for Polish and German, by relative WER reductions of 3.6% and 6.9% respectively.

astro-ph.SR [Back]

[233] Deep learning image burst stacking to reconstruct high-resolution ground-based solar observations

Christoph Schirninger,Robert Jarolim,Astrid M. Veronig,Christoph Kuckein

Main category: astro-ph.SR

TL;DR: 提出一种基于深度学习的实时图像重建方法,用于解决地面太阳望远镜观测中大气湍流导致的图像退化问题。

Details Motivation: 地面太阳望远镜观测受大气湍流影响,现有重建方法在强湍流和高计算成本下表现不佳。 Method: 采用无配对图像到图像转换的深度学习模型,将100张短曝光图像实时重建为高质量图像。 Result: 模型在感知质量上表现更优,尤其在参考图像存在伪影时更具鲁棒性。 Conclusion: 该方法能高效利用图像信息,在完整图像序列下实现最佳重建效果。 Abstract: Large aperture ground based solar telescopes allow the solar atmosphere to be resolved in unprecedented detail. However, observations are limited by Earths turbulent atmosphere, requiring post image corrections. Current reconstruction methods using short exposure bursts face challenges with strong turbulence and high computational costs. We introduce a deep learning approach that reconstructs 100 short exposure images into one high quality image in real time. Using unpaired image to image translation, our model is trained on degraded bursts with speckle reconstructions as references, improving robustness and generalization. Our method shows an improved robustness in terms of perceptual quality, especially when speckle reconstructions show artifacts. An evaluation with a varying number of images per burst demonstrates that our method makes efficient use of the combined image information and achieves the best reconstructions when provided with the full image burst.

cs.MA [Back]

[234] Time to Talk: LLM Agents for Asynchronous Group Communication in Mafia Games

Niv Eckhaus,Uri Berger,Gabriel Stanovsky

Main category: cs.MA

TL;DR: 论文提出了一种异步LLM智能体,能够决定何时发言,而不仅仅是内容。通过在在线Mafia游戏中测试,其表现与人类玩家相当。

Details Motivation: 现有LLM主要用于同步通信,而现实场景多为异步(如群聊、会议)。研究旨在开发能适应异步环境的智能体。 Method: 开发自适应异步LLM智能体,结合何时发言的决策能力,并在在线Mafia游戏中与人类玩家对比测试。 Result: 智能体在游戏表现和融入人类玩家方面与人类相当,发言时机接近人类,但内容存在差异。 Conclusion: 该研究为LLM融入现实异步场景(如团队讨论、教育)奠定了基础,并开源数据与代码以促进进一步研究。 Abstract: LLMs are used predominantly in synchronous communication, where a human user and a model communicate in alternating turns. In contrast, many real-world settings are inherently asynchronous. For example, in group chats, online team meetings, or social games, there is no inherent notion of turns; therefore, the decision of when to speak forms a crucial part of the participant's decision making. In this work, we develop an adaptive asynchronous LLM-agent which, in addition to determining what to say, also decides when to say it. To evaluate our agent, we collect a unique dataset of online Mafia games, including both human participants, as well as our asynchronous agent. Overall, our agent performs on par with human players, both in game performance, as well as in its ability to blend in with the other human players. Our analysis shows that the agent's behavior in deciding when to speak closely mirrors human patterns, although differences emerge in message content. We release all our data and code to support and encourage further research for more realistic asynchronous communication between LLM agents. This work paves the way for integration of LLMs into realistic human group settings, from assistance in team discussions to educational and professional environments where complex social dynamics must be navigated.

eess.AS [Back]

[235] Can we reconstruct a dysarthric voice with the large speech model Parler TTS?

Ariadna Sanchez,Simon King

Main category: eess.AS

TL;DR: 利用Parler TTS模型重建语音,尝试为构音障碍患者生成患病前的语音,但模型在控制清晰度和保持说话者身份一致性方面存在困难。

Details Motivation: 构音障碍患者沟通困难,个性化语音合成技术可作为辅助工具,重建其患病前的语音。 Method: 使用Parler TTS模型,通过标注的数据集进行微调,生成语音并评估其清晰度和说话者身份一致性。 Result: 模型能学习生成语音,但在清晰度和说话者身份一致性方面表现不佳。 Conclusion: 未来需改进模型的可控性,以更好地完成语音重建任务。 Abstract: Speech disorders can make communication hard or even impossible for those who develop them. Personalised Text-to-Speech is an attractive option as a communication aid. We attempt voice reconstruction using a large speech model, with which we generate an approximation of a dysarthric speaker's voice prior to the onset of their condition. In particular, we investigate whether a state-of-the-art large speech model, Parler TTS, can generate intelligible speech while maintaining speaker identity. We curate a dataset and annotate it with relevant speaker and intelligibility information, and use this to fine-tune the model. Our results show that the model can indeed learn to generate from the distribution of this challenging data, but struggles to control intelligibility and to maintain consistent speaker identity. We propose future directions to improve controllability of this class of model, for the voice reconstruction task.

[236] Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model

Haibin Wu,Yuxuan Hu,Ruchao Fan,Xiaofei Wang,Kenichi Kumatani,Bo Ren,Jianwei Yu,Heng Lu,Lijuan Wang,Yao Qian,Jinyu Li

Main category: eess.AS

TL;DR: 论文比较了联合语音文本解码策略,提出了一种新的早期停止交错(ESI)模式,显著加速解码并提升性能。

Details Motivation: 研究联合语音文本解码策略对性能、效率和对齐质量的影响,以优化语音语言模型。 Method: 在相同基础模型、语音分词器和训练数据下,系统比较了交错和平行生成范式,并提出ESI模式。 Result: 交错方法对齐效果最佳但推理慢,ESI模式显著加速解码且性能略优。 Conclusion: ESI模式是一种高效的联合解码策略,同时提升了语音问答性能。 Abstract: Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality. In this work, we systematically compare representative joint speech-text decoding strategies-including the interleaved, and parallel generation paradigms-under a controlled experimental setup using the same base language model, speech tokenizer and training data. Our results show that the interleaved approach achieves the best alignment. However it suffers from slow inference due to long token sequence length. To address this, we propose a novel early-stop interleaved (ESI) pattern that not only significantly accelerates decoding but also yields slightly better performance. Additionally, we curate high-quality question answering (QA) datasets to further improve speech QA performance.

[237] EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition

Yi-Cheng Lin,Huang-Cheng Chou,Yu-Hsuan Li Liang,Hung-yi Lee

Main category: eess.AS

TL;DR: 本文研究了语音情感识别(SER)中的性别偏见问题,比较了13种去偏方法在多标签SER中的效果,并分析了公平性与准确性之间的权衡。

Details Motivation: 语音情感识别系统常存在性别偏见,而现有去偏方法在多标签场景下的有效性和鲁棒性尚未充分研究。 Method: 提出了EMO-Debias,对13种去偏方法进行了大规模比较,涵盖预处理、正则化、对抗学习、偏见学习器和分布鲁棒优化等技术。 Result: 实验表明,某些方法能在不降低整体性能的情况下减少性别性能差距,且数据集分布对结果有显著影响。 Conclusion: 研究为选择有效去偏策略提供了实用建议,并强调了数据集分布的重要性。 Abstract: Speech emotion recognition (SER) systems often exhibit gender bias. However, the effectiveness and robustness of existing debiasing methods in such multi-label scenarios remain underexplored. To address this gap, we present EMO-Debias, a large-scale comparison of 13 debiasing methods applied to multi-label SER. Our study encompasses techniques from pre-processing, regularization, adversarial learning, biased learners, and distributionally robust optimization. Experiments conducted on acted and naturalistic emotion datasets, using WavLM and XLSR representations, evaluate each method under conditions of gender imbalance. Our analysis quantifies the trade-offs between fairness and accuracy, identifying which approaches consistently reduce gender performance gaps without compromising overall model performance. The findings provide actionable insights for selecting effective debiasing strategies and highlight the impact of dataset distributions.

eess.IV [Back]

[238] Gradient Inversion Attacks on Parameter-Efficient Fine-Tuning

Hasin Us Sami,Swapneel Sen,Amit K. Roy-Chowdhury,Srikanth V. Krishnamurthy,Basak Guler

Main category: eess.IV

TL;DR: 论文研究了联邦学习中参数高效微调(PEFT)的隐私风险,通过恶意设计的预训练模型和适配器模块,攻击者可以利用梯度反演攻击重构用户的本地数据。

Details Motivation: 探讨PEFT在联邦学习中的隐私问题,揭示攻击者如何通过梯度反演攻击获取用户数据。 Method: 通过恶意设计的预训练模型和适配器模块,利用梯度反演攻击重构用户数据。 Result: 实验表明,攻击者可以高保真地重构大量微调图像数据。 Conclusion: 研究强调了PEFT隐私保护机制的必要性,并提出了未来研究方向。 Abstract: Federated learning (FL) allows multiple data-owners to collaboratively train machine learning models by exchanging local gradients, while keeping their private data on-device. To simultaneously enhance privacy and training efficiency, recently parameter-efficient fine-tuning (PEFT) of large-scale pretrained models has gained substantial attention in FL. While keeping a pretrained (backbone) model frozen, each user fine-tunes only a few lightweight modules to be used in conjunction, to fit specific downstream applications. Accordingly, only the gradients with respect to these lightweight modules are shared with the server. In this work, we investigate how the privacy of the fine-tuning data of the users can be compromised via a malicious design of the pretrained model and trainable adapter modules. We demonstrate gradient inversion attacks on a popular PEFT mechanism, the adapter, which allow an attacker to reconstruct local data samples of a target user, using only the accessible adapter gradients. Via extensive experiments, we demonstrate that a large batch of fine-tuning images can be retrieved with high fidelity. Our attack highlights the need for privacy-preserving mechanisms for PEFT, while opening up several future directions. Our code is available at https://github.com/info-ucr/PEFTLeak.

[239] A Poisson-Guided Decomposition Network for Extreme Low-Light Image Enhancement

Isha Rao,Sanjay Ghosh

Main category: eess.IV

TL;DR: 提出了一种轻量级深度学习方法,结合Retinex分解与泊松去噪,用于极端低光条件下的图像去噪和增强。

Details Motivation: 解决低光条件下信号依赖的泊松噪声问题,传统高斯噪声假设不适用。 Method: 集成Retinex分解与泊松去噪的编码器-解码器网络,引入泊松去噪损失函数。 Result: 显著提升低光条件下的可见度和亮度,保持图像结构和颜色一致性。 Conclusion: 该方法有效且实用,适用于低光图像增强。 Abstract: Low-light image denoising and enhancement are challenging, especially when traditional noise assumptions, such as Gaussian noise, do not hold in majority. In many real-world scenarios, such as low-light imaging, noise is signal-dependent and is better represented as Poisson noise. In this work, we address the problem of denoising images degraded by Poisson noise under extreme low-light conditions. We introduce a light-weight deep learning-based method that integrates Retinex based decomposition with Poisson denoising into a unified encoder-decoder network. The model simultaneously enhances illumination and suppresses noise by incorporating a Poisson denoising loss to address signal-dependent noise. Without prior requirement for reflectance and illumination, the network learns an effective decomposition process while ensuring consistent reflectance and smooth illumination without causing any form of color distortion. The experimental results demonstrate the effectiveness and practicality of the proposed low-light illumination enhancement method. Our method significantly improves visibility and brightness in low-light conditions, while preserving image structure and color constancy under ambient illumination.

[240] DACN: Dual-Attention Convolutional Network for Hyperspectral Image Super-Resolution

Usman Muhammad,Jorma Laaksonen

Main category: eess.IV

TL;DR: DACN是一种双注意力卷积网络,用于解决2D CNN在超分辨率任务中缺乏全局上下文理解的问题,通过多注意力机制和优化的损失函数提升性能。

Details Motivation: 2D CNN在超光谱图像超分辨率任务中依赖局部邻域,缺乏全局上下文理解,且受限于波段相关性和数据稀缺性。 Method: DACN结合增强卷积和多头注意力,捕捉局部和全局特征依赖;通过通道和空间注意力图优化关注点;提出结合L2正则化和空间-光谱梯度损失的损失函数。 Result: 实验表明,多头注意力和通道注意力的结合优于单独使用任一机制。 Conclusion: DACN通过双注意力机制和优化损失函数,显著提升了超光谱图像超分辨率的性能。 Abstract: 2D convolutional neural networks (CNNs) have attracted significant attention for hyperspectral image super-resolution tasks. However, a key limitation is their reliance on local neighborhoods, which leads to a lack of global contextual understanding. Moreover, band correlation and data scarcity continue to limit their performance. To mitigate these issues, we introduce DACN, a dual-attention convolutional network for hyperspectral image super-resolution. Specifically, the model first employs augmented convolutions, integrating multi-head attention to effectively capture both local and global feature dependencies. Next, we infer separate attention maps for the channel and spatial dimensions to determine where to focus across different channels and spatial positions. Furthermore, a custom optimized loss function is proposed that combines L2 regularization with spatial-spectral gradient loss to ensure accurate spectral fidelity. Experimental results on two hyperspectral datasets demonstrate that the combination of multi-head attention and channel attention outperforms either attention mechanism used individually.

[241] PixCell: A generative foundation model for digital histopathology images

Srikar Yellapragada,Alexandros Graikos,Zilinghan Li,Kostas Triaridis,Varun Belagali,Saarthak Kapse,Tarak Nath Nandi,Ravi K Madduri,Prateek Prasanna,Tahsin Kurc,Rajarsi R. Gupta,Joel Saltz,Dimitris Samaras

Main category: eess.IV

TL;DR: PixCell是一种基于扩散模型的生成基础模型,用于组织病理学,能够生成高质量、多样化的图像,解决数据稀缺、隐私保护和虚拟染色等问题。

Details Motivation: 解决病理学中数据稀缺、隐私保护和生成任务(如虚拟染色)的挑战。 Method: 使用扩散模型PixCell,在PanCan-30M数据集上进行渐进式训练,结合自监督条件化。 Result: PixCell生成高质量图像,可用于训练自监督模型、数据共享、数据增强和教育用途,并支持分子标记推断。 Conclusion: PixCell为计算病理学提供了强大的生成工具,加速研究并解决实际问题。 Abstract: The digitization of histology slides has revolutionized pathology, providing massive datasets for cancer diagnosis and research. Contrastive self-supervised and vision-language models have been shown to effectively mine large pathology datasets to learn discriminative representations. On the other hand, generative models, capable of synthesizing realistic and diverse images, present a compelling solution to address unique problems in pathology that involve synthesizing images; overcoming annotated data scarcity, enabling privacy-preserving data sharing, and performing inherently generative tasks, such as virtual staining. We introduce PixCell, the first diffusion-based generative foundation model for histopathology. We train PixCell on PanCan-30M, a vast, diverse dataset derived from 69,184 H\&E-stained whole slide images covering various cancer types. We employ a progressive training strategy and a self-supervision-based conditioning that allows us to scale up training without any annotated data. PixCell generates diverse and high-quality images across multiple cancer types, which we find can be used in place of real data to train a self-supervised discriminative model. Synthetic images shared between institutions are subject to fewer regulatory barriers than would be the case with real clinical images. Furthermore, we showcase the ability to precisely control image generation using a small set of annotated images, which can be used for both data augmentation and educational purposes. Testing on a cell segmentation task, a mask-guided PixCell enables targeted data augmentation, improving downstream performance. Finally, we demonstrate PixCell's ability to use H\&E structural staining to infer results from molecular marker studies; we use this capability to infer IHC staining from H\&E images. Our trained models are publicly released to accelerate research in computational pathology.

[242] DM-SegNet: Dual-Mamba Architecture for 3D Medical Image Segmentation with Global Context Modeling

Hangyu Ji

Main category: eess.IV

TL;DR: DM-SegNet提出了一种双Mamba架构,结合方向性状态转换和解剖感知分层解码,解决了现有医学SSM在空间结构保持和解码器兼容性上的问题。

Details Motivation: 现有医学SSM在1D序列扁平化时会破坏空间结构,而传统解码器无法有效利用Mamba的状态传播能力。 Method: 采用四方向3D扫描的Mamba模块、门控空间卷积层和Mamba驱动的解码框架,实现双向状态同步。 Result: 在Synapse和BraTS2023数据集上分别达到85.44%和90.22%的DSC,表现优异。 Conclusion: DM-SegNet通过创新的架构设计,在医学图像分割任务中实现了全局上下文建模与空间拓扑保持的平衡。 Abstract: Accurate 3D medical image segmentation demands architectures capable of reconciling global context modeling with spatial topology preservation. While State Space Models (SSMs) like Mamba show potential for sequence modeling, existing medical SSMs suffer from encoder-decoder incompatibility: the encoder's 1D sequence flattening compromises spatial structures, while conventional decoders fail to leverage Mamba's state propagation. We present DM-SegNet, a Dual-Mamba architecture integrating directional state transitions with anatomy-aware hierarchical decoding. The core innovations include a quadri-directional spatial Mamba module employing four-directional 3D scanning to maintain anatomical spatial coherence, a gated spatial convolution layer that enhances spatially sensitive feature representation prior to state modeling, and a Mamba-driven decoding framework enabling bidirectional state synchronization across scales. Extensive evaluation on two clinically significant benchmarks demonstrates the efficacy of DM-SegNet: achieving state-of-the-art Dice Similarity Coefficient (DSC) of 85.44% on the Synapse dataset for abdominal organ segmentation and 90.22% on the BraTS2023 dataset for brain tumor segmentation.

cs.IR [Back]

[243] Exp4Fuse: A Rank Fusion Framework for Enhanced Sparse Retrieval using Large Language Model-based Query Expansion

Lingyuan Liu,Mengxiang Zhang

Main category: cs.IR

TL;DR: Exp4Fuse是一种新型融合排序框架,通过零样本LLM查询扩展提升稀疏检索性能,优于现有方法。

Details Motivation: 现有LLM查询扩展方法依赖高质量生成文档,成本高且计算量大,需改进稀疏检索效果。 Method: 提出Exp4Fuse框架,结合原始查询和LLM增强查询的检索路径,通过改进的互逆排序融合方法合并结果。 Result: 在多个数据集上,Exp4Fuse超越现有LLM查询扩展方法,结合高级稀疏检索器达到SOTA性能。 Conclusion: Exp4Fuse显著提升了稀疏检索的查询扩展效果,具有高效性和优越性能。 Abstract: Large Language Models (LLMs) have shown potential in generating hypothetical documents for query expansion, thereby enhancing information retrieval performance. However, the efficacy of this method is highly dependent on the quality of the generated documents, which often requires complex prompt strategies and the integration of advanced dense retrieval techniques. This can be both costly and computationally intensive. To mitigate these limitations, we explore the use of zero-shot LLM-based query expansion to improve sparse retrieval, particularly for learned sparse retrievers. We introduce a novel fusion ranking framework, Exp4Fuse, which enhances the performance of sparse retrievers through an indirect application of zero-shot LLM-based query expansion. Exp4Fuse operates by simultaneously considering two retrieval routes-one based on the original query and the other on the LLM-augmented query. It then generates two ranked lists using a sparse retriever and fuses them using a modified reciprocal rank fusion method. We conduct extensive evaluations of Exp4Fuse against leading LLM-based query expansion methods and advanced retrieval techniques on three MS MARCO-related datasets and seven low-resource datasets. Experimental results reveal that Exp4Fuse not only surpasses existing LLM-based query expansion methods in enhancing sparse retrievers but also, when combined with advanced sparse retrievers, achieves SOTA results on several benchmarks. This highlights the superior performance and effectiveness of Exp4Fuse in improving query expansion for sparse retrieval.

[244] GOLFer: Smaller LM-Generated Documents Hallucination Filter & Combiner for Query Expansion in Information Retrieval

Lingyuan Liu,Mengxiang Zhang

Main category: cs.IR

TL;DR: GOLFer是一种利用小型开源语言模型进行查询扩展的新方法,通过过滤幻觉内容和组合文档,解决了大型语言模型的高成本和计算强度问题。

Details Motivation: 大型语言模型(LLMs)在查询扩展中依赖模型规模,成本高且计算强度大。GOLFer旨在通过小型开源模型解决这些问题。 Method: GOLFer包含两个模块:幻觉过滤器(检测并移除非事实和不一致的句子)和文档组合器(通过权重向量平衡查询与过滤内容)。 Result: 实验表明,GOLFer在小型语言模型上表现优于其他方法,并与大型LLMs方法竞争。 Conclusion: GOLFer证明了小型语言模型在查询扩展中的有效性,提供了一种高效且经济的替代方案。 Abstract: Large language models (LLMs)-based query expansion for information retrieval augments queries with generated hypothetical documents with LLMs. However, its performance relies heavily on the scale of the language models (LMs), necessitating larger, more advanced LLMs. This approach is costly, computationally intensive, and often has limited accessibility. To address these limitations, we introduce GOLFer - Smaller LMs-Generated Documents Hallucination Filter & Combiner - a novel method leveraging smaller open-source LMs for query expansion. GOLFer comprises two modules: a hallucination filter and a documents combiner. The former detects and removes non-factual and inconsistent sentences in generated documents, a common issue with smaller LMs, while the latter combines the filtered content with the query using a weight vector to balance their influence. We evaluate GOLFer alongside dominant LLM-based query expansion methods on three web search and ten low-resource datasets. Experimental results demonstrate that GOLFer consistently outperforms other methods using smaller LMs, and maintains competitive performance against methods using large-size LLMs, demonstrating its effectiveness.

[245] Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings

Yubo Ma,Jinsong Li,Yuhang Zang,Xiaobao Wu,Xiaoyi Dong,Pan Zhang,Yuhang Cao,Haodong Duan,Jiaqi Wang,Yixin Cao,Aixin Sun

Main category: cs.IR

TL;DR: 论文研究了在视觉文档检索(VDR)中减少每页的补丁嵌入数量的方法,提出了两种策略:令牌修剪和令牌合并。研究发现随机修剪策略优于其他复杂方法,但修剪本身不适合VDR。令牌合并更有效,最终开发的Light-ColPali/ColQwen2在保持98.2%检索性能的同时,仅使用11.8%的内存。

Details Motivation: ColPali/ColQwen2在VDR中性能强大,但每页编码为多个补丁级嵌入导致内存占用过高,研究旨在减少嵌入数量并最小化性能损失。 Method: 评估了令牌修剪和令牌合并两种策略,发现修剪不适合VDR,转而优化合并策略,开发了Light-ColPali/ColQwen2。 Result: Light-ColPali/ColQwen2在保持98.2%检索性能的同时,仅使用11.8%的内存,最低配置下保留94.6%性能且内存占用降至2.8%。 Conclusion: 研究为高效VDR提供了有价值的见解和竞争性基线,Light-ColPali/ColQwen2展示了显著的内存优化效果。 Abstract: Despite the strong performance of ColPali/ColQwen2 in Visualized Document Retrieval (VDR), it encodes each page into multiple patch-level embeddings and leads to excessive memory usage. This empirical study investigates methods to reduce patch embeddings per page at minimum performance degradation. We evaluate two token-reduction strategies: token pruning and token merging. Regarding token pruning, we surprisingly observe that a simple random strategy outperforms other sophisticated pruning methods, though still far from satisfactory. Further analysis reveals that pruning is inherently unsuitable for VDR as it requires removing certain page embeddings without query-specific information. Turning to token merging (more suitable for VDR), we search for the optimal combinations of merging strategy across three dimensions and develop Light-ColPali/ColQwen2. It maintains 98.2% of retrieval performance with only 11.8% of original memory usage, and preserves 94.6% effectiveness at 2.8% memory footprint. We expect our empirical findings and resulting Light-ColPali/ColQwen2 offer valuable insights and establish a competitive baseline for future research towards efficient VDR.

cs.CR [Back]

[246] Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets

Lei Hsiung,Tianyu Pang,Yung-Chen Tang,Linyue Song,Tsung-Yi Ho,Pin-Yu Chen,Yaoqing Yang

Main category: cs.CR

TL;DR: 论文探讨了上游安全对齐数据与下游微调任务数据集相似性对LLM安全护栏的影响,发现高相似性会削弱安全性,而低相似性可显著提升模型鲁棒性。

Details Motivation: 现有缓解策略多关注事后处理或微调过程中的安全强化,忽视了上游安全对齐数据的作用。 Method: 通过分析上游对齐数据集与下游微调任务的表示相似性,研究其对安全护栏的影响。 Result: 高相似性会显著削弱安全护栏,而低相似性可将有害性评分降低高达10.33%。 Conclusion: 上游数据集设计对构建持久安全护栏和减少实际漏洞至关重要,为微调服务提供商提供了实用建议。 Abstract: Recent advancements in large language models (LLMs) have underscored their vulnerability to safety alignment jailbreaks, particularly when subjected to downstream fine-tuning. However, existing mitigation strategies primarily focus on reactively addressing jailbreak incidents after safety guardrails have been compromised, removing harmful gradients during fine-tuning, or continuously reinforcing safety alignment throughout fine-tuning. As such, they tend to overlook a critical upstream factor: the role of the original safety-alignment data. This paper therefore investigates the degradation of safety guardrails through the lens of representation similarity between upstream alignment datasets and downstream fine-tuning tasks. Our experiments demonstrate that high similarity between these datasets significantly weakens safety guardrails, making models more susceptible to jailbreaks. Conversely, low similarity between these two types of datasets yields substantially more robust models and thus reduces harmfulness score by up to 10.33%. By highlighting the importance of upstream dataset design in the building of durable safety guardrails and reducing real-world vulnerability to jailbreak attacks, these findings offer actionable insights for fine-tuning service providers.

cs.AI [Back]

[247] Contextual Integrity in LLMs via Reasoning and Reinforcement Learning

Guangchen Lan,Huseyin A. Inan,Sahar Abdelnabi,Janardhan Kulkarni,Lukas Wutschitz,Reza Shokri,Christopher G. Brinton,Robert Sim

Main category: cs.AI

TL;DR: 论文提出了一种通过强化学习框架结合上下文推理的方法,以减少自主代理在任务中不适当的信息披露,同时保持任务性能。

Details Motivation: 随着自主代理为用户做决策的时代到来,确保上下文完整性(CI)成为核心问题,即代理需要推理其操作上下文以决定适当的信息共享。 Method: 首先提示大型语言模型(LLM)显式推理CI,然后开发强化学习框架进一步训练模型以实现CI。使用仅约700个示例的合成数据集验证方法。 Result: 方法显著减少了不适当的信息披露,同时保持了任务性能,且改进效果可迁移到人类标注的CI基准测试(如PrivacyLens)。 Conclusion: 通过上下文推理和强化学习,可以有效提升自主代理在信息共享中的上下文完整性。 Abstract: As the era of autonomous agents making decisions on behalf of users unfolds, ensuring contextual integrity (CI) -- what is the appropriate information to share while carrying out a certain task -- becomes a central question to the field. We posit that CI demands a form of reasoning where the agent needs to reason about the context in which it is operating. To test this, we first prompt LLMs to reason explicitly about CI when deciding what information to disclose. We then extend this approach by developing a reinforcement learning (RL) framework that further instills in models the reasoning necessary to achieve CI. Using a synthetic, automatically created, dataset of only $\sim700$ examples but with diverse contexts and information disclosure norms, we show that our method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Importantly, improvements transfer from this synthetic dataset to established CI benchmarks such as PrivacyLens that has human annotations and evaluates privacy leakage of AI assistants in actions and tool calls.

[248] A Graph-Retrieval-Augmented Generation Framework Enhances Decision-Making in the Circular Economy

Yang Zhao,Chengxiao Dai,Dusit Niyato,Chuan Fu Tan,Keyi Xiang,Yueyang Wang,Zhiquan Yeo,Daren Tan Zong Loong,Jonathan Low Zhaozhi,Eugene H. Z. HO

Main category: cs.AI

TL;DR: CircuGraphRAG是一个基于知识图谱的检索增强生成框架,用于提高大语言模型在可持续制造中的准确性,减少幻觉问题。

Details Motivation: 解决大语言模型在工业代码和排放因子上的幻觉问题,以支持可持续制造和低碳决策。 Method: 结合领域知识图谱,通过SPARQL查询和多跳推理,确保输出的准确性和可追溯性。 Result: 在单跳和多跳问答任务中表现优异,ROUGE-L F1得分高达1.0,同时提升效率,减少响应时间和令牌使用。 Conclusion: CircuGraphRAG为循环经济提供了可靠的事实核查支持,推动了低碳资源决策的进步。 Abstract: Large language models (LLMs) hold promise for sustainable manufacturing, but often hallucinate industrial codes and emission factors, undermining regulatory and investment decisions. We introduce CircuGraphRAG, a retrieval-augmented generation (RAG) framework that grounds LLMs outputs in a domain-specific knowledge graph for the circular economy. This graph connects 117,380 industrial and waste entities with classification codes and GWP100 emission data, enabling structured multi-hop reasoning. Natural language queries are translated into SPARQL and verified subgraphs are retrieved to ensure accuracy and traceability. Compared with Standalone LLMs and Naive RAG, CircuGraphRAG achieves superior performance in single-hop and multi-hop question answering, with ROUGE-L F1 scores up to 1.0, while baseline scores below 0.08. It also improves efficiency, halving the response time and reducing token usage by 16% in representative tasks. CircuGraphRAG provides fact-checked, regulatory-ready support for circular economy planning, advancing reliable, low-carbon resource decision making.

[249] Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science

Peter Jansen,Samiah Hassan,Ruoyao Wang

Main category: cs.AI

TL;DR: 论文提出了一个名为Matter-of-Fact的数据集,用于评估假设(以科学声明形式呈现)的可行性,旨在优化科学发现系统的效率。

Details Motivation: 当前科学发现系统生成大量假设和实验的成本较高,尤其是大规模实验。通过筛选可行性假设,可以提高发现效率。 Method: 构建了一个包含8.4k科学声明的数据集,涵盖材料科学四个领域,并测试了检索增强生成和代码生成等基线方法。 Result: 基线方法表现不佳(最高72%准确率,随机为50%),但专家认为问题可解,显示当前模型的挑战和改进潜力。 Conclusion: 该任务对现有模型具有挑战性,但通过改进可以显著加速科学发现。 Abstract: Contemporary approaches to assisted scientific discovery use language models to automatically generate large numbers of potential hypothesis to test, while also automatically generating code-based experiments to test those hypotheses. While hypotheses can be comparatively inexpensive to generate, automated experiments can be costly, particularly when run at scale (i.e. thousands of experiments). Developing the capacity to filter hypotheses based on their feasibility would allow discovery systems to run at scale, while increasing their likelihood of making significant discoveries. In this work we introduce Matter-of-Fact, a challenge dataset for determining the feasibility of hypotheses framed as claims. Matter-of-Fact includes 8.4k claims extracted from scientific articles spanning four high-impact contemporary materials science topics, including superconductors, semiconductors, batteries, and aerospace materials, while including qualitative and quantitative claims from theoretical, experimental, and code/simulation results. We show that strong baselines that include retrieval augmented generation over scientific literature and code generation fail to exceed 72% performance on this task (chance performance is 50%), while domain-expert verification suggests nearly all are solvable -- highlighting both the difficulty of this task for current models, and the potential to accelerate scientific discovery by making near-term progress.

[250] Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design

Lin Sun,Weihong Lin,Jinzhu Wu,Yongfu Zhu,Xiaoqi Jian,Guangxiang Zhao,Change Jia,Linglin Zhang,Sai-er Hu,Yuhan Wu,Xiangzheng Zhang

Main category: cs.AI

TL;DR: 研究发现Deepseek-R1-Distill系列及其衍生模型在评估结果上存在显著波动,呼吁建立更严格的评估范式。

Details Motivation: 由于现有推理模型在评估结果上波动大,难以可靠复现性能提升,需改进评估方法。 Method: 通过实证评估Deepseek-R1-Distill系列模型,分析评估条件的细微差异对结果的影响。 Result: 评估结果受条件影响显著,性能提升难以稳定复现。 Conclusion: 需建立更严格的模型性能评估范式以确保结果可靠性。 Abstract: Reasoning models represented by the Deepseek-R1-Distill series have been widely adopted by the open-source community due to their strong performance in mathematics, science, programming, and other domains. However, our study reveals that their benchmark evaluation results are subject to significant fluctuations caused by various factors. Subtle differences in evaluation conditions can lead to substantial variations in results. Similar phenomena are observed in other open-source inference models fine-tuned based on the Deepseek-R1-Distill series, as well as in the QwQ-32B model, making their claimed performance improvements difficult to reproduce reliably. Therefore, we advocate for the establishment of a more rigorous paradigm for model performance evaluation and present our empirical assessments of the Deepseek-R1-Distill series models.

[251] When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

Kai Wang,Yihao Zhang,Meng Sun

Main category: cs.AI

TL;DR: 论文研究了大型语言模型(LLMs)的战略欺骗行为,提出了一种通过表示工程检测和控制欺骗的方法,并展示了高检测准确率和诱导欺骗的能力。

Details Motivation: 随着链式思维(CoT)推理的LLMs发展,其可能故意欺骗人类,这种战略欺骗与传统幻觉问题不同,需要系统性研究。 Method: 使用表示工程和线性人工断层扫描(LAT)提取“欺骗向量”,并通过激活引导诱导欺骗行为。 Result: 实现了89%的欺骗检测准确率,并在无明确提示下成功诱导40%的欺骗行为。 Conclusion: 研究揭示了推理模型的诚实性问题,并提供了可信AI对齐的工具。 Abstract: The honesty of large language models (LLMs) is a critical alignment challenge, especially as advanced systems with chain-of-thought (CoT) reasoning may strategically deceive humans. Unlike traditional honesty issues on LLMs, which could be possibly explained as some kind of hallucination, those models' explicit thought paths enable us to study strategic deception--goal-driven, intentional misinformation where reasoning contradicts outputs. Using representation engineering, we systematically induce, detect, and control such deception in CoT-enabled LLMs, extracting "deception vectors" via Linear Artificial Tomography (LAT) for 89% detection accuracy. Through activation steering, we achieve a 40% success rate in eliciting context-appropriate deception without explicit prompts, unveiling the specific honesty-related issue of reasoning models and providing tools for trustworthy AI alignment.

[252] LLM-First Search: Self-Guided Exploration of the Solution Space

Nathan Herr,Tim Rocktäschel,Roberta Raileanu

Main category: cs.AI

TL;DR: LLM-First Search (LFS) 是一种新型的LLM自引导搜索方法,通过让LLM自主控制搜索过程,无需预定义策略,提高了灵活性和计算效率。

Details Motivation: 传统搜索方法(如MCTS)依赖固定超参数,难以适应不同任务难度,限制了其实际应用。 Method: LFS利用LLM的内部评分机制,自主决定搜索路径或探索分支,无需外部启发式或硬编码策略。 Result: 在Countdown和Sudoku任务中,LFS表现优于ToT-BFS、BestFS和MCTS,计算效率更高,且随模型增强和计算预算增加表现更优。 Conclusion: LFS展示了LLM自引导搜索的潜力,为复杂推理任务提供了更灵活高效的解决方案。 Abstract: Large Language Models (LLMs) have demonstrated remarkable improvements in reasoning and planning through increased test-time compute, often by framing problem-solving as a search process. While methods like Monte Carlo Tree Search (MCTS) have proven effective in some domains, their reliance on fixed exploration hyperparameters limits their adaptability across tasks of varying difficulty, rendering them impractical or expensive in certain settings. In this paper, we propose \textbf{LLM-First Search (LFS)}, a novel \textit{LLM Self-Guided Search} method that removes the need for pre-defined search strategies by empowering the LLM to autonomously control the search process via self-guided exploration. Rather than relying on external heuristics or hardcoded policies, the LLM evaluates whether to pursue the current search path or explore alternative branches based on its internal scoring mechanisms. This enables more flexible and context-sensitive reasoning without requiring manual tuning or task-specific adaptation. We evaluate LFS on Countdown and Sudoku against three classic widely-used search algorithms, Tree-of-Thoughts' Breadth First Search (ToT-BFS), Best First Search (BestFS), and MCTS, each of which have been used to achieve SotA results on a range of challenging reasoning tasks. We found that LFS (1) performs better on more challenging tasks without additional tuning, (2) is more computationally efficient compared to the other methods, especially when powered by a stronger model, (3) scales better with stronger models, due to its LLM-First design, and (4) scales better with increased compute budget. Our code is publicly available at \href{https://github.com/NathanHerr/LLM-First-Search}{LLM-First-Search}.

[253] Ontology-based knowledge representation for bone disease diagnosis: a foundation for safe and sustainable medical artificial intelligence systems

Loan Dao,Ngoc Quoc Ly

Main category: cs.AI

TL;DR: 该研究提出了一种基于本体的骨病诊断框架,结合了分层神经网络、视觉问答系统和多模态深度学习模型,旨在提升医学AI系统的诊断可靠性。

Details Motivation: 医学AI系统常缺乏系统性领域知识整合,可能影响诊断可靠性。本研究旨在通过本体框架解决这一问题。 Method: 开发了分层神经网络架构、本体增强的视觉问答系统和多模态深度学习模型,整合影像、临床和实验室数据。 Result: 框架展示了标准化结构和可重用组件的潜力,但实验验证因数据和计算资源限制尚未完成。 Conclusion: 未来工作将扩展临床数据集并进行全面系统验证,以进一步验证框架的实用性。 Abstract: Medical artificial intelligence (AI) systems frequently lack systematic domain expertise integration, potentially compromising diagnostic reliability. This study presents an ontology-based framework for bone disease diagnosis, developed in collaboration with Ho Chi Minh City Hospital for Traumatology and Orthopedics. The framework introduces three theoretical contributions: (1) a hierarchical neural network architecture guided by bone disease ontology for segmentation-classification tasks, incorporating Visual Language Models (VLMs) through prompts, (2) an ontology-enhanced Visual Question Answering (VQA) system for clinical reasoning, and (3) a multimodal deep learning model that integrates imaging, clinical, and laboratory data through ontological relationships. The methodology maintains clinical interpretability through systematic knowledge digitization, standardized medical terminology mapping, and modular architecture design. The framework demonstrates potential for extension beyond bone diseases through its standardized structure and reusable components. While theoretical foundations are established, experimental validation remains pending due to current dataset and computational resource limitations. Future work will focus on expanding the clinical dataset and conducting comprehensive system validation.

cs.RO [Back]

[254] RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Enshen Zhou,Jingkun An,Cheng Chi,Yi Han,Shanyu Rong,Chi Zhang,Pengwei Wang,Zhongyuan Wang,Tiejun Huang,Lu Sheng,Shanghang Zhang

Main category: cs.RO

TL;DR: RoboRefer是一种3D感知的视觉语言模型,通过监督微调(SFT)和强化微调(RFT)实现精确的空间理解和多步推理,在RefSpatial-Bench上表现优异。

Details Motivation: 现有方法在复杂3D场景理解和动态空间推理方面表现不足,需要更强大的模型来支持机器人交互。 Method: 提出RoboRefer,结合SFT和RFT,使用RefSpatial数据集(20M QA对)进行训练,并引入RefSpatial-Bench评估。 Result: SFT训练的RoboRefer空间理解成功率89.6%;RFT训练的模型在RefSpatial-Bench上超越基线17.4%。 Conclusion: RoboRefer在空间推理任务中表现卓越,并能与多种机器人控制策略集成,适用于复杂场景。 Abstract: Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes.

[255] Learning Smooth State-Dependent Traversability from Dense Point Clouds

Zihao Dong,Alan Papalia,Leonard Jung,Alenna Spiro,Philip R. Osteen,Christa S. Robison,Michael Everett

Main category: cs.RO

TL;DR: SPARTA方法通过点云估计基于接近角度的可通行性,利用傅里叶基函数输出平滑风险分布,显著提高了越野自主性。

Details Motivation: 越野自主性中,地形的可通行性常取决于车辆状态(如接近角度),传统方法需要大量数据和重复计算。 Method: SPARTA通过点云输入,输出基于傅里叶基函数的平滑风险分布函数,高效预测任意接近角度的风险。 Result: 在高保真仿真中,SPARTA在40米巨石场中的通过成功率达91%,优于基线的73%,并在硬件测试中验证了泛化能力。 Conclusion: SPARTA通过几何结构和傅里叶基函数,高效解决了越野自主性中基于角度的可通行性问题,具有实际应用潜力。 Abstract: A key open challenge in off-road autonomy is that the traversability of terrain often depends on the vehicle's state. In particular, some obstacles are only traversable from some orientations. However, learning this interaction by encoding the angle of approach as a model input demands a large and diverse training dataset and is computationally inefficient during planning due to repeated model inference. To address these challenges, we present SPARTA, a method for estimating approach angle conditioned traversability from point clouds. Specifically, we impose geometric structure into our network by outputting a smooth analytical function over the 1-Sphere that predicts risk distribution for any angle of approach with minimal overhead and can be reused for subsequent queries. The function is composed of Fourier basis functions, which has important advantages for generalization due to their periodic nature and smoothness. We demonstrate SPARTA both in a high-fidelity simulation platform, where our model achieves a 91\% success rate crossing a 40m boulder field (compared to 73\% for the baseline), and on hardware, illustrating the generalization ability of the model to real-world settings.

[256] MineInsight: A Multi-sensor Dataset for Humanitarian Demining Robotics in Off-Road Environments

Mario Malizia,Charles Hamesse,Ken Hasselmann,Geert De Cubber,Nikolaos Tsiogkas,Eric Demeester,Rob Haelterman

Main category: cs.RO

TL;DR: MineInsight是一个公开的多传感器、多光谱数据集,专为越野地雷检测设计,集成了双视角传感器扫描和多种光谱范围数据。

Details Motivation: 由于缺乏多样化和真实的数据集,地雷检测算法的可靠验证成为研究社区的挑战。 Method: 数据集包含35个不同目标(15个地雷和20个常见物体),分布在三条不同路径上,集成了双视角传感器扫描(地面无人车和机械臂)、两种LiDAR以及多种光谱范围图像(RGB、VIS-SWIR、LWIR)。 Result: 数据集包含约38,000帧RGB图像、53,000帧VIS-SWIR图像和108,000帧LWIR图像,提供了地雷检测算法的基准。 Conclusion: MineInsight为开发和评估地雷检测算法提供了多样化和真实的数据支持,填补了现有数据集的空白。 Abstract: The use of robotics in humanitarian demining increasingly involves computer vision techniques to improve landmine detection capabilities. However, in the absence of diverse and realistic datasets, the reliable validation of algorithms remains a challenge for the research community. In this paper, we introduce MineInsight, a publicly available multi-sensor, multi-spectral dataset designed for off-road landmine detection. The dataset features 35 different targets (15 landmines and 20 commonly found objects) distributed along three distinct tracks, providing a diverse and realistic testing environment. MineInsight is, to the best of our knowledge, the first dataset to integrate dual-view sensor scans from both an Unmanned Ground Vehicle and its robotic arm, offering multiple viewpoints to mitigate occlusions and improve spatial awareness. It features two LiDARs, as well as images captured at diverse spectral ranges, including visible (RGB, monochrome), visible short-wave infrared (VIS-SWIR), and long-wave infrared (LWIR). Additionally, the dataset comes with an estimation of the location of the targets, offering a benchmark for evaluating detection algorithms. We recorded approximately one hour of data in both daylight and nighttime conditions, resulting in around 38,000 RGB frames, 53,000 VIS-SWIR frames, and 108,000 LWIR frames. MineInsight serves as a benchmark for developing and evaluating landmine detection algorithms. Our dataset is available at https://github.com/mariomlz99/MineInsight.

[257] Synthetic Dataset Generation for Autonomous Mobile Robots Using 3D Gaussian Splatting for Vision Training

Aneesh Deogan,Wout Beks,Peter Teurlings,Koen de Vos,Mark van den Brand,Rene van de Molengraft

Main category: cs.RO

TL;DR: 提出了一种在Unreal Engine中自动生成标注合成数据的新方法,用于解决手动创建数据集的高成本和低效率问题,并在机器人足球场景中验证了其有效性。

Details Motivation: 手动创建标注数据集耗时耗力且容易出错,尤其在机器人领域,多样性和动态性使得数据集的代表性更难保证。 Method: 利用3D高斯泼溅技术快速生成逼真的合成数据,并结合真实数据提升目标检测性能。 Result: 合成数据集的表现与真实数据集相当,且结合两者能显著提升检测性能。 Conclusion: 该方法为机器人领域提供了一种高效、可扩展的数据集生成方案,减少了人工标注的需求。 Abstract: Annotated datasets are critical for training neural networks for object detection, yet their manual creation is time- and labour-intensive, subjective to human error, and often limited in diversity. This challenge is particularly pronounced in the domain of robotics, where diverse and dynamic scenarios further complicate the creation of representative datasets. To address this, we propose a novel method for automatically generating annotated synthetic data in Unreal Engine. Our approach leverages photorealistic 3D Gaussian splats for rapid synthetic data generation. We demonstrate that synthetic datasets can achieve performance comparable to that of real-world datasets while significantly reducing the time required to generate and annotate data. Additionally, combining real-world and synthetic data significantly increases object detection performance by leveraging the quality of real-world images with the easier scalability of synthetic data. To our knowledge, this is the first application of synthetic data for training object detection algorithms in the highly dynamic and varied environment of robot soccer. Validation experiments reveal that a detector trained on synthetic images performs on par with one trained on manually annotated real-world images when tested on robot soccer match scenarios. Our method offers a scalable and comprehensive alternative to traditional dataset creation, eliminating the labour-intensive error-prone manual annotation process. By generating datasets in a simulator where all elements are intrinsically known, we ensure accurate annotations while significantly reducing manual effort, which makes it particularly valuable for robotics applications requiring diverse and scalable training data.

cs.CY [Back]

[258] Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems

Emma Harvey,Emily Sheng,Su Lin Blodgett,Alexandra Chouldechova,Jean Garcia-Gathright,Alexandra Olteanu,Hanna Wallach

Main category: cs.CY

TL;DR: 研究发现,现有公开工具难以满足实践者评估大型语言模型(LLM)系统表征性危害的需求,主要因工具不实用或存在使用障碍。

Details Motivation: 探讨公开工具是否能满足实践者在评估LLM系统表征性危害时的需求。 Method: 通过半结构化访谈12位实践者,分析工具的使用情况。 Result: 实践者常无法使用公开工具,原因包括工具不实用或存在使用障碍。 Conclusion: 建议基于测量理论和实用测量改进工具,以更好地满足实践者需求。 Abstract: The NLP research community has made publicly available numerous instruments for measuring representational harms caused by large language model (LLM)-based systems. These instruments have taken the form of datasets, metrics, tools, and more. In this paper, we examine the extent to which such instruments meet the needs of practitioners tasked with evaluating LLM-based systems. Via semi-structured interviews with 12 such practitioners, we find that practitioners are often unable to use publicly available instruments for measuring representational harms. We identify two types of challenges. In some cases, instruments are not useful because they do not meaningfully measure what practitioners seek to measure or are otherwise misaligned with practitioner needs. In other cases, instruments - even useful instruments - are not used by practitioners due to practical and institutional barriers impeding their uptake. Drawing on measurement theory and pragmatic measurement, we provide recommendations for addressing these challenges to better meet practitioner needs.

cs.MM [Back]

[259] CMIE: Combining MLLM Insights with External Evidence for Explainable Out-of-Context Misinformation Detection

Fanxiao Li,Jiaying Wu,Canyuan He,Wei Zhou

Main category: cs.MM

TL;DR: 论文提出CMIE框架,通过生成共存关系和关联评分机制改进多模态大语言模型在检测上下文无关假信息中的性能。

Details Motivation: 现有MLLM在检测上下文无关假信息时难以捕捉深层语义关联,且证据噪声影响准确性。 Method: 提出CMIE框架,包含共存关系生成策略和关联评分机制,选择性利用证据增强检测。 Result: 实验表明CMIE优于现有方法。 Conclusion: CMIE有效解决了MLLM在假信息检测中的挑战,提升了性能。 Abstract: Multimodal large language models (MLLMs) have demonstrated impressive capabilities in visual reasoning and text generation. While previous studies have explored the application of MLLM for detecting out-of-context (OOC) misinformation, our empirical analysis reveals two persisting challenges of this paradigm. Evaluating the representative GPT-4o model on direct reasoning and evidence augmented reasoning, results indicate that MLLM struggle to capture the deeper relationships-specifically, cases in which the image and text are not directly connected but are associated through underlying semantic links. Moreover, noise in the evidence further impairs detection accuracy. To address these challenges, we propose CMIE, a novel OOC misinformation detection framework that incorporates a Coexistence Relationship Generation (CRG) strategy and an Association Scoring (AS) mechanism. CMIE identifies the underlying coexistence relationships between images and text, and selectively utilizes relevant evidence to enhance misinformation detection. Experimental results demonstrate that our approach outperforms existing methods.

cs.LG [Back]

[260] Behavioural vs. Representational Systematicity in End-to-End Models: An Opinionated Survey

Ivan Vegner,Sydelle de Souza,Valentin Forch,Martha Lewis,Leonidas A. A. Doumas

Main category: cs.LG

TL;DR: 论文探讨了系统性在ML模型中的重要性,区分了行为系统性和表征系统性,并分析了现有基准测试的局限性。

Details Motivation: 系统性是ML模型中理想的性质,但现有研究多关注行为系统性,忽视了表征系统性,论文旨在强调这一区别。 Method: 基于Hadley(1994)的分类法,分析了语言和视觉领域的关键基准测试对行为系统性的评估程度。 Result: 现有基准测试主要测试行为系统性,未充分评估表征系统性。 Conclusion: 论文呼吁关注表征系统性的评估,并借鉴机制可解释性领域的方法。 Abstract: A core aspect of compositionality, systematicity is a desirable property in ML models as it enables strong generalization to novel contexts. This has led to numerous studies proposing benchmarks to assess systematic generalization, as well as models and training regimes designed to enhance it. Many of these efforts are framed as addressing the challenge posed by Fodor and Pylyshyn. However, while they argue for systematicity of representations, existing benchmarks and models primarily focus on the systematicity of behaviour. We emphasize the crucial nature of this distinction. Furthermore, building on Hadley's (1994) taxonomy of systematic generalization, we analyze the extent to which behavioural systematicity is tested by key benchmarks in the literature across language and vision. Finally, we highlight ways of assessing systematicity of representations in ML models as practiced in the field of mechanistic interpretability.

[261] Clustering and Median Aggregation Improve Differentially Private Inference

Kareem Amin,Salman Avestimehr,Sara Babakniya,Alex Bie,Weiwei Kong,Natalia Ponomareva,Umar Syed

Main category: cs.LG

TL;DR: 本文提出了一种改进的差分隐私语言模型推理方法,通过聚类输入数据和私有计算中位数来提升生成文本的质量和隐私保护效果。

Details Motivation: 现有方法通过均匀采样敏感输入生成私有文本,但在处理异构主题时效果不佳。本文旨在解决这一问题。 Method: 通过聚类输入数据选择推理批次,并引入基于中位数聚合的新算法,利用预测相似性降低局部敏感性。 Result: 实验表明,该方法在代表性和下游任务性能上优于现有方法,且隐私成本更低。 Conclusion: 本文方法显著提升了私有合成数据的质量,同时降低了隐私成本。 Abstract: Differentially private (DP) language model inference is an approach for generating private synthetic text. A sensitive input example is used to prompt an off-the-shelf large language model (LLM) to produce a similar example. Multiple examples can be aggregated together to formally satisfy the DP guarantee. Prior work creates inference batches by sampling sensitive inputs uniformly at random. We show that uniform sampling degrades the quality of privately generated text, especially when the sensitive examples concern heterogeneous topics. We remedy this problem by clustering the input data before selecting inference batches. Next, we observe that clustering also leads to more similar next-token predictions across inferences. We use this insight to introduce a new algorithm that aggregates next token statistics by privately computing medians instead of averages. This approach leverages the fact that the median has decreased local sensitivity when next token predictions are similar, allowing us to state a data-dependent and ex-post DP guarantee about the privacy properties of this algorithm. Finally, we demonstrate improvements in terms of representativeness metrics (e.g., MAUVE) as well as downstream task performance. We show that our method produces high-quality synthetic data at significantly lower privacy cost than a previous state-of-the-art method.

[262] Urania: Differentially Private Insights into AI Use

Daogao Liu,Edith Cohen,Badih Ghazi,Peter Kairouz,Pritish Kamath,Alexander Knop,Ravi Kumar,Pasin Manurangsi,Adam Sealfon,Da Yu,Chiyuan Zhang

Main category: cs.LG

TL;DR: Urania是一个新颖的框架,用于生成具有严格差分隐私(DP)保证的LLM聊天机器人交互见解。通过私有聚类和创新的关键词提取方法,结合DP工具,实现了端到端的隐私保护。

Details Motivation: 研究动机是为LLM聊天机器人交互提供隐私保护的见解生成方法,确保用户数据隐私不被泄露。 Method: 方法包括私有聚类机制、多种关键词提取方法(频率、TF-IDF、LLM引导),以及DP工具(聚类、分区选择、直方图摘要)。 Result: 评估显示框架在保留语义内容和相似性的同时,提供了严格的隐私保护,且通过实证隐私评估验证了其鲁棒性。 Conclusion: Urania框架成功平衡了数据实用性和隐私保护,能够有效提取有意义的对话见解。 Abstract: We introduce $Urania$, a novel framework for generating insights about LLM chatbot interactions with rigorous differential privacy (DP) guarantees. The framework employs a private clustering mechanism and innovative keyword extraction methods, including frequency-based, TF-IDF-based, and LLM-guided approaches. By leveraging DP tools such as clustering, partition selection, and histogram-based summarization, $Urania$ provides end-to-end privacy protection. Our evaluation assesses lexical and semantic content preservation, pair similarity, and LLM-based metrics, benchmarking against a non-private Clio-inspired pipeline (Tamkin et al., 2024). Moreover, we develop a simple empirical privacy evaluation that demonstrates the enhanced robustness of our DP pipeline. The results show the framework's ability to extract meaningful conversational insights while maintaining stringent user privacy, effectively balancing data utility with privacy preservation.

[263] From EHRs to Patient Pathways: Scalable Modeling of Longitudinal Health Trajectories with LLMs

Chantal Pellegrini,Ege Özsoy,David Bani-Harouni,Matthias Keicher,Nassir Navab

Main category: cs.LG

TL;DR: 论文提出了一种名为EHR2Path的新方法,通过将多样化的电子健康记录数据转化为结构化表示,并设计了一个全面的路径预测模型,以预测患者的未来健康轨迹。

Details Motivation: 医疗系统在管理和解释大量异构患者数据以提供个性化护理方面面临挑战,现有方法往往局限于狭窄的用例和有限的特征空间。 Method: 将电子健康记录数据转化为结构化表示,并设计了一个名为EHR2Path的路径预测模型,引入了一种新颖的摘要机制以嵌入长期时间上下文。 Result: EHR2Path在下一步预测和长期模拟中表现优异,优于基线模型,并能详细模拟患者轨迹。 Conclusion: EHR2Path为预测性和个性化医疗开辟了新途径,能够针对多种评估任务进行预测。 Abstract: Healthcare systems face significant challenges in managing and interpreting vast, heterogeneous patient data for personalized care. Existing approaches often focus on narrow use cases with a limited feature space, overlooking the complex, longitudinal interactions needed for a holistic understanding of patient health. In this work, we propose a novel approach to patient pathway modeling by transforming diverse electronic health record (EHR) data into a structured representation and designing a holistic pathway prediction model, EHR2Path, optimized to predict future health trajectories. Further, we introduce a novel summary mechanism that embeds long-term temporal context into topic-specific summary tokens, improving performance over text-only models, while being much more token-efficient. EHR2Path demonstrates strong performance in both next time-step prediction and longitudinal simulation, outperforming competitive baselines. It enables detailed simulations of patient trajectories, inherently targeting diverse evaluation tasks, such as forecasting vital signs, lab test results, or length-of-stay, opening a path towards predictive and personalized healthcare.

[264] Dissecting Long Reasoning Models: An Empirical Study

Yongyu Mu,Jiali Zeng,Bei Li,Xinyan Guan,Fandong Meng,Jie Zhou,Tong Xiao,Jingbo Zhu

Main category: cs.LG

TL;DR: 论文分析了强化学习中正负样本的作用,发现负样本显著提升泛化能力;解决了数据效率低的问题;探讨了模型性能不稳定的原因及解决方法。

Details Motivation: 研究强化学习中正负样本的作用、数据效率低及模型性能不稳定的问题。 Method: 系统分析正负样本作用;提出相对长度奖励和离线样本注入策略;通过多次评估缓解性能不稳定。 Result: 负样本训练效果媲美标准RL;数据效率提升;多次评估缓解性能波动。 Conclusion: 负样本对泛化至关重要;优化数据利用和多次评估可提升模型稳定性和性能。 Abstract: Despite recent progress in training long-context reasoning models via reinforcement learning (RL), several open questions and counterintuitive behaviors remain. This work focuses on three key aspects: (1) We systematically analyze the roles of positive and negative samples in RL, revealing that positive samples mainly facilitate data fitting, whereas negative samples significantly enhance generalization and robustness. Interestingly, training solely on negative samples can rival standard RL training performance. (2) We identify substantial data inefficiency in group relative policy optimization, where over half of the samples yield zero advantage. To address this, we explore two straightforward strategies, including relative length rewards and offline sample injection, to better leverage these data and enhance reasoning efficiency and capability. (3) We investigate unstable performance across various reasoning models and benchmarks, attributing instability to uncertain problems with ambiguous outcomes, and demonstrate that multiple evaluation runs mitigate this issue.

[265] Mitigating Degree Bias Adaptively with Hard-to-Learn Nodes in Graph Contrastive Learning

Jingyu Hu,Hongbo Bo,Jun Hong,Xiaowei Liu,Weiru Liu

Main category: cs.LG

TL;DR: 论文提出了一种名为HAR的对比损失方法,通过增加正样本对并自适应加权,解决了GNN中节点分类任务的度偏差问题,并通过SHARP框架验证了其有效性。

Details Motivation: GNN在节点分类任务中存在度偏差问题,现有基于GCL的方法因正样本对不足和噪声信息导致低度节点表现不佳。 Method: 提出HAR对比损失,利用节点标签增加正样本对,并根据学习难度自适应加权正负样本对;开发SHARP实验框架扩展应用场景。 Result: 在四个数据集上的实验表明,SHARP在全局和度级别上均优于基线方法。 Conclusion: HAR和SHARP有效缓解了GNN中的度偏差问题,提升了节点分类性能。 Abstract: Graph Neural Networks (GNNs) often suffer from degree bias in node classification tasks, where prediction performance varies across nodes with different degrees. Several approaches, which adopt Graph Contrastive Learning (GCL), have been proposed to mitigate this bias. However, the limited number of positive pairs and the equal weighting of all positives and negatives in GCL still lead to low-degree nodes acquiring insufficient and noisy information. This paper proposes the Hardness Adaptive Reweighted (HAR) contrastive loss to mitigate degree bias. It adds more positive pairs by leveraging node labels and adaptively weights positive and negative pairs based on their learning hardness. In addition, we develop an experimental framework named SHARP to extend HAR to a broader range of scenarios. Both our theoretical analysis and experiments validate the effectiveness of SHARP. The experimental results across four datasets show that SHARP achieves better performance against baselines at both global and degree levels.

[266] Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

Danil Sivtsov,Ivan Rodkin,Gleb Kuzmin,Yuri Kuratov,Ivan Oseledets

Main category: cs.LG

TL;DR: Diagonal Batching 是一种调度方案,通过并行化 RMT 中的分段处理,解决了其顺序执行的性能瓶颈,显著提升了长上下文推理的效率。

Details Motivation: Transformer 模型在长上下文推理中因时间和内存复杂度高而表现不佳,RMT 虽然降低了复杂度,但顺序执行导致性能瓶颈。 Method: 提出 Diagonal Batching,通过运行时计算重排序实现分段并行化,无需重新训练现有 RMT 模型。 Result: 在 LLaMA-1B ARMT 模型上,Diagonal Batching 实现了 3.3 倍的速度提升,并显著降低了推理成本和延迟。 Conclusion: Diagonal Batching 解决了 RMT 的顺序瓶颈,使其成为实际长上下文应用的可行解决方案。 Abstract: Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory usage. However, their memory update mechanism leads to sequential execution, causing a performance bottleneck. We introduce Diagonal Batching, a scheduling scheme that unlocks parallelism across segments in RMTs while preserving exact recurrence. This approach eliminates the sequential constraint, enabling efficient GPU inference even for single long-context inputs without complex batching and pipelining techniques. Because the technique is purely a run-time computation reordering, existing RMT models adopt it with no retraining. Applied to a LLaMA-1B ARMT model, Diagonal Batching yields a 3.3x speedup over standard full-attention LLaMA-1B and a 1.8x speedup over the sequential RMT implementation on 131,072-token sequences. By removing sequential bottleneck, Diagonal Batching reduces inference cost and latency, thereby strengthening RMTs as a practical solution for real-world, long-context applications.

[267] MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

Johannes von Oswald,Nino Scherrer,Seijin Kobayashi,Luca Versari,Songlin Yang,Maximilian Schlegel,Kaitlin Maile,Yanick Schimpf,Oliver Sieberling,Alexander Meulemans,Rif A. Saurous,Guillaume Lajoie,Charlotte Frenkel,Razvan Pascanu,Blaise Agüera y Arcas,João Sacramento

Main category: cs.LG

TL;DR: 论文提出了一种基于在线学习规则的稳定并行化Mesa层,通过优化上下文损失提升语言模型性能,尤其在长上下文任务中表现突出,但增加了推理时的计算成本。

Details Motivation: 当前因果Transformer架构在序列建模中占主导,但其线性扩展的内存和计算需求限制了效率。近期研究通过线性化softmax操作提出了高效RNN模型,本文旨在进一步优化这类模型的性能。 Method: 引入了一种数值稳定、可并行化的Mesa层,通过快速共轭梯度求解器在每个时间点优化上下文损失,实现更优的语言建模性能。 Result: 实验表明,该方法在语言建模困惑度和下游任务性能上优于之前的RNN模型,尤其在长上下文理解任务中表现显著,但增加了推理时的计算开销。 Conclusion: 研究展示了通过增加推理时计算来提升性能的潜力,即通过神经网络内部解决序列优化问题,为高效序列建模提供了新思路。 Abstract: Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), and study it in language modeling at the billion-parameter scale. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.

[268] Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay

Yifan Sun,Jingyan Shen,Yibin Wang,Tianyu Chen,Zhendong Wang,Mingyuan Zhou,Huan Zhang

Main category: cs.LG

TL;DR: 本文提出两种技术(难度目标在线数据选择和回放机制)以提高大型语言模型(LLM)强化学习微调的数据效率,显著减少训练时间。

Details Motivation: 现有强化学习微调方法资源消耗高且忽视数据效率问题,亟需改进。 Method: 引入自适应难度概念指导在线数据选择,开发基于注意力的框架估计难度,并提出回放机制重用近期数据。 Result: 实验表明,该方法在6种LLM-数据集组合中将微调时间减少25%至65%,性能与原始GRPO算法相当。 Conclusion: 所提方法显著提升数据效率,为LLM强化学习微调提供实用解决方案。 Abstract: Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities. However, RL fine-tuning remains highly resource-intensive, and existing work has largely overlooked the problem of data efficiency. In this paper, we propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay. We introduce the notion of adaptive difficulty to guide online data selection, prioritizing questions of moderate difficulty that are more likely to yield informative learning signals. To estimate adaptive difficulty efficiently, we develop an attention-based framework that requires rollouts for only a small reference set of questions. The adaptive difficulty of the remaining questions is then estimated based on their similarity to this set. To further reduce rollout cost, we introduce a rollout replay mechanism that reuses recent rollouts, lowering per-step computation while maintaining stable updates. Extensive experiments across 6 LLM-dataset combinations show that our method reduces RL fine-tuning time by 25% to 65% to reach the same level of performance as the original GRPO algorithm.

[269] Kinetics: Rethinking Test-Time Scaling Laws

Ranajoy Sadhukhan,Zhuoming Chen,Haizhong Zheng,Yang Zhou,Emma Strubell,Beidi Chen

Main category: cs.LG

TL;DR: 研究发现小模型在测试时的效率被高估,提出新的Kinetics Scaling Law,强调稀疏注意力的重要性。

Details Motivation: 现有研究基于计算最优性,忽略了测试时内存访问瓶颈,需更全面的资源分配指导。 Method: 通过分析0.6B到32B参数的模型,提出Kinetics Scaling Law,并设计稀疏注意力范式。 Result: 稀疏注意力模型在低/高成本场景下均优于密集模型,AIME问题解决准确率提升显著。 Conclusion: 稀疏注意力是实现测试时扩展潜力的关键,代码已开源。 Abstract: We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of-$N$, long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones. A key reason is that in TTS, attention, rather than parameter count, emerges as the dominant cost factor. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. These results suggest that sparse attention is essential for realizing the full potential of test-time scaling because, unlike training, where parameter scaling saturates, test-time accuracy continues to improve through increased generation. The code is available at https://github.com/Infini-AI-Lab/Kinetics.

[270] Inference-Time Hyper-Scaling with KV Cache Compression

Adrian Łańcucki,Konrad Staniszewski,Piotr Nawrot,Edoardo M. Ponti

Main category: cs.LG

TL;DR: 通过压缩KV缓存实现推理时超缩放,提升推理准确率而不增加计算成本。

Details Motivation: Transformer LLMs的推理成本受限于KV缓存大小,而非生成token数量,因此探索通过压缩KV缓存提升效率。 Method: 提出动态内存稀疏化(DMS),延迟token驱逐并合并表示,仅需1K训练步骤即可实现8倍压缩。 Result: 在多个LLM家族中验证了DMS的有效性,例如Qwen-R1 32B在多个基准测试中准确率显著提升。 Conclusion: DMS为推理时超缩放提供了一种实用方法,显著提升准确率且不增加计算负担。 Abstract: Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8$\times$ compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding cached tokens, DMS delays token eviction, implicitly merging representations and preserving critical information. We demonstrate the effectiveness of inference-time hyper-scaling with DMS on multiple families of LLMs, showing that it boosts accuracy for comparable inference runtime and memory load. For instance, we enhance Qwen-R1 32B by an average of 9.1 points on AIME 24, 7.6 on GPQA, and 9.6 on LiveCodeBench across compute budgets.

[271] You Only Train Once

Christos Sakaridis

Main category: cs.LG

TL;DR: 论文提出了一种名为YOTO的方法,通过一次性训练自动优化损失权重超参数,避免了传统网格搜索的繁琐过程。

Details Motivation: 传统方法在模型训练中需要多次运行以调整损失权重超参数,效率低下且耗时。YOTO旨在通过一次性训练解决这一问题。 Method: YOTO将损失权重超参数视为网络参数,通过梯度优化自动学习。采用可微的复合损失函数和软最大操作层,确保超参数的正性,并引入正则化损失以避免退化。 Result: YOTO在3D估计和语义分割任务中表现优于传统网格搜索方法,测试数据上性能更优。 Conclusion: YOTO通过一次性训练高效优化损失权重,显著提升了模型性能,为超参数优化提供了新思路。 Abstract: The title of this paper is perhaps an overclaim. Of course, the process of creating and optimizing a learned model inevitably involves multiple training runs which potentially feature different architectural designs, input and output encodings, and losses. However, our method, You Only Train Once (YOTO), indeed contributes to limiting training to one shot for the latter aspect of losses selection and weighting. We achieve this by automatically optimizing loss weight hyperparameters of learned models in one shot via standard gradient-based optimization, treating these hyperparameters as regular parameters of the networks and learning them. To this end, we leverage the differentiability of the composite loss formulation which is widely used for optimizing multiple empirical losses simultaneously and model it as a novel layer which is parameterized with a softmax operation that satisfies the inherent positivity constraints on loss hyperparameters while avoiding degenerate empirical gradients. We complete our joint end-to-end optimization scheme by defining a novel regularization loss on the learned hyperparameters, which models a uniformity prior among the employed losses while ensuring boundedness of the identified optima. We evidence the efficacy of YOTO in jointly optimizing loss hyperparameters and regular model parameters in one shot by comparing it to the commonly used brute-force grid search across state-of-the-art networks solving two key problems in computer vision, i.e. 3D estimation and semantic segmentation, and showing that it consistently outperforms the best grid-search model on unseen test data. Code will be made publicly available.

[272] StatsMerging: Statistics-Guided Model Merging via Task-Specific Teacher Distillation

Ranjith Merugu,Bryan Bo Cao,Shubham Jain

Main category: cs.LG

TL;DR: StatsMerging是一种基于统计学习的轻量级模型合并方法,利用权重分布统计和SVD奇异值指导任务系数预测,无需真实标签或测试样本。

Details Motivation: 解决在有限内存预算下合并多个大型模型的问题,同时避免依赖真实标签或测试样本。 Method: 利用SVD奇异值捕捉任务特定权重分布,通过轻量级学习器StatsMergeLearner建模权重分布,并引入任务特定教师蒸馏技术处理异构架构模型。 Result: 在八个任务上的实验表明,StatsMerging在整体准确性、泛化能力和鲁棒性上优于现有技术。 Conclusion: StatsMerging为模型合并提供了一种高效、轻量且无需真实标签的解决方案,适用于异构架构和多任务场景。 Abstract: Model merging has emerged as a promising solution to accommodate multiple large models within constrained memory budgets. We present StatsMerging, a novel lightweight learning-based model merging method guided by weight distribution statistics without requiring ground truth labels or test samples. StatsMerging offers three key advantages: (1) It uniquely leverages singular values from singular value decomposition (SVD) to capture task-specific weight distributions, serving as a proxy for task importance to guide task coefficient prediction; (2) It employs a lightweight learner StatsMergeLearner to model the weight distributions of task-specific pre-trained models, improving generalization and enhancing adaptation to unseen samples; (3) It introduces Task-Specific Teacher Distillation for merging vision models with heterogeneous architectures, a merging learning paradigm that avoids costly ground-truth labels by task-specific teacher distillation. Notably, we present two types of knowledge distillation, (a) distilling knowledge from task-specific models to StatsMergeLearner; and (b) distilling knowledge from models with heterogeneous architectures prior to merging. Extensive experiments across eight tasks demonstrate the effectiveness of StatsMerging. Our results show that StatsMerging outperforms state-of-the-art techniques in terms of overall accuracy, generalization to unseen tasks, and robustness to image quality variations.

[273] Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets

Marianna Nezhurina,Tomer Porian,Giovanni Pucceti,Tommie Kerssies,Romain Beaumont,Mehdi Cherti,Jenia Jitsev

Main category: cs.LG

TL;DR: 论文探讨了如何通过缩放定律(scaling laws)比较模型和数据集,以优化预训练过程。通过CLIP和MaMMUT两种语言-视觉学习方法的对比,发现MaMMUT在规模扩展和样本效率上优于CLIP。

Details Motivation: 研究动机在于利用缩放定律进行模型和数据集比较,避免仅基于单一参考尺度的误导性结论,从而系统性地改进开放基础模型和数据集。 Method: 方法包括密集测量不同模型和样本规模的缩放定律,比较CLIP和MaMMUT的性能,并在分类、检索和分割等下游任务中验证结果。 Result: 结果显示MaMMUT在规模扩展和样本效率上优于CLIP,并在多个数据集和任务中表现一致。 Conclusion: 结论表明,准确的缩放定律推导为跨尺度模型和数据集比较提供了有效手段,推动了开放基础模型的系统改进。 Abstract: In studies of transferable learning, scaling laws are obtained for various important foundation models to predict their properties and performance at larger scales. We show here how scaling law derivation can also be used for model and dataset comparison, allowing to decide which procedure is to be preferred for pre-training. For the first time, full scaling laws based on dense measurements across a wide span of model and samples seen scales are derived for two important language-vision learning procedures, CLIP and MaMMUT, that use either contrastive only or contrastive and captioning text generative loss. Ensuring sufficient prediction accuracy for held out points, we use derived scaling laws to compare both models, obtaining evidence for MaMMUT's stronger improvement with scale and better sample efficiency than standard CLIP. To strengthen validity of the comparison, we show scaling laws for various downstream tasks, classification, retrieval, and segmentation, and for different open datasets, DataComp, DFN and Re-LAION, observing consistently the same trends. We show that comparison can also be performed when deriving scaling laws with a constant learning rate schedule, reducing compute cost. Accurate derivation of scaling laws provides thus means to perform model and dataset comparison across scale spans, avoiding misleading conclusions based on measurements from single reference scales only, paving the road for systematic comparison and improvement of open foundation models and datasets for their creation. We release all the pre-trained models with their intermediate checkpoints, including openMaMMUT-L/14, which achieves $80.3\%$ zero-shot ImageNet-1k accuracy, trained on 12.8B samples from DataComp-1.4B. Code for reproducing experiments in the paper and raw experiments data can be found at https://github.com/LAION-AI/scaling-laws-for-comparison.

[274] Exploring bidirectional bounds for minimax-training of Energy-based models

Cong Geng,Jia Wang,Li Chen,Zhiyong Gao,Jes Frellsen,Søren Hauberg

Main category: cs.LG

TL;DR: 论文提出了一种通过双向边界训练能量模型(EBM)的方法,以解决传统训练中的不稳定性问题,并比较了四种不同的边界方法。

Details Motivation: 能量模型(EBM)在密度估计中具有优雅的框架,但训练困难。传统方法通过变分下界训练,但会导致不稳定性。 Method: 提出双向边界训练方法:最大化下界并最小化上界。研究了四种基于不同视角的边界,包括生成器雅可比矩阵的奇异值和互信息的下界,以及梯度惩罚和扩散过程的上界。 Result: 双向边界方法稳定了EBM训练,实现了高质量的密度估计和样本生成。 Conclusion: 双向边界训练方法有效解决了EBM训练的不稳定性问题,为密度估计和生成任务提供了新思路。 Abstract: Energy-based models (EBMs) estimate unnormalized densities in an elegant framework, but they are generally difficult to train. Recent work has linked EBMs to generative adversarial networks, by noting that they can be trained through a minimax game using a variational lower bound. To avoid the instabilities caused by minimizing a lower bound, we propose to instead work with bidirectional bounds, meaning that we maximize a lower bound and minimize an upper bound when training the EBM. We investigate four different bounds on the log-likelihood derived from different perspectives. We derive lower bounds based on the singular values of the generator Jacobian and on mutual information. To upper bound the negative log-likelihood, we consider a gradient penalty-like bound, as well as one based on diffusion processes. In all cases, we provide algorithms for evaluating the bounds. We compare the different bounds to investigate, the pros and cons of the different approaches. Finally, we demonstrate that the use of bidirectional bounds stabilizes EBM training and yields high-quality density estimation and sample generation.

[275] Identifying and Understanding Cross-Class Features in Adversarial Training

Zeming Wei,Yiwen Guo,Yisen Wang

Main category: cs.LG

TL;DR: 论文通过类间特征视角研究对抗训练(AT),发现交叉类特征对鲁棒性至关重要,并揭示了AT过程中特征学习的动态变化。

Details Motivation: 对抗训练(AT)是提升深度神经网络对抗攻击鲁棒性的有效方法,但其训练机制和动态仍不明确。本文旨在通过类间特征分析AT的机制。 Method: 提出通过类间特征(交叉类特征)研究AT,并通过合成数据模型提供理论支持。在不同模型架构和设置下进行系统研究。 Result: 发现AT初期模型倾向于学习更多交叉类特征直至最佳鲁棒性检查点;随后鲁棒过拟合时,模型更依赖类特定特征。 Conclusion: 研究为AT机制提供了新视角,统一了软标签训练和鲁棒过拟合的现有特性,深化了对AT的理解。 Abstract: Adversarial training (AT) has been considered one of the most effective methods for making deep neural networks robust against adversarial attacks, while the training mechanisms and dynamics of AT remain open research problems. In this paper, we present a novel perspective on studying AT through the lens of class-wise feature attribution. Specifically, we identify the impact of a key family of features on AT that are shared by multiple classes, which we call cross-class features. These features are typically useful for robust classification, which we offer theoretical evidence to illustrate through a synthetic data model. Through systematic studies across multiple model architectures and settings, we find that during the initial stage of AT, the model tends to learn more cross-class features until the best robustness checkpoint. As AT further squeezes the training robust loss and causes robust overfitting, the model tends to make decisions based on more class-specific features. Based on these discoveries, we further provide a unified view of two existing properties of AT, including the advantage of soft-label training and robust overfitting. Overall, these insights refine the current understanding of AT mechanisms and provide new perspectives on studying them. Our code is available at https://github.com/PKU-ML/Cross-Class-Features-AT.

[276] Aligning Latent Spaces with Flow Priors

Yizhuo Li,Yuying Ge,Yixiao Ge,Ying Shan,Ping Luo

Main category: cs.LG

TL;DR: 提出了一种利用基于流的生成模型作为先验,将可学习潜在空间与任意目标分布对齐的新框架。该方法避免了昂贵的似然评估和ODE求解。

Details Motivation: 现有方法在潜在空间对齐中计算成本高,且难以处理复杂目标分布。 Method: 预训练流模型捕获目标分布,通过对齐损失正则化潜在空间,优化目标为变分下界。 Result: 理论证明对齐损失是计算可行的替代目标;实验验证了在ImageNet上的有效性。 Conclusion: 该框架为潜在空间对齐提供了新思路,兼具理论和实证支持。 Abstract: This paper presents a novel framework for aligning learnable latent spaces to arbitrary target distributions by leveraging flow-based generative models as priors. Our method first pretrains a flow model on the target features to capture the underlying distribution. This fixed flow model subsequently regularizes the latent space via an alignment loss, which reformulates the flow matching objective to treat the latents as optimization targets. We formally prove that minimizing this alignment loss establishes a computationally tractable surrogate objective for maximizing a variational lower bound on the log-likelihood of latents under the target distribution. Notably, the proposed method eliminates computationally expensive likelihood evaluations and avoids ODE solving during optimization. As a proof of concept, we demonstrate in a controlled setting that the alignment loss landscape closely approximates the negative log-likelihood of the target distribution. We further validate the effectiveness of our approach through large-scale image generation experiments on ImageNet with diverse target distributions, accompanied by detailed discussions and ablation studies. With both theoretical and empirical validation, our framework paves a new way for latent space alignment.